CN116204694A

CN116204694A - Multi-mode retrieval method based on deep learning and hash algorithm

Info

Publication number: CN116204694A
Application number: CN202310126081.9A
Authority: CN
Inventors: 欧中洪; 罗中李; 宋美娜; 尧思远
Original assignee: Beijing University of Posts and Telecommunications
Current assignee: Beijing University of Posts and Telecommunications
Priority date: 2023-02-15
Filing date: 2023-02-15
Publication date: 2023-06-02

Abstract

The invention provides a multi-modal retrieval method based on deep learning and hash algorithm, which comprises the steps of obtaining multi-modal data to be retrieved; text semantic retrieval is carried out on the multi-modal data based on a method of deep learning and hash coding, and text data are obtained; carrying out picture retrieval on the multi-mode data based on a gray level comparison method and an elastic search technology to obtain picture data; mapping the text data and the picture data into a unified vector subspace using a dual encoder-based model; obtaining a similarity distance between text data and the picture data in a vector subspace by calculating cosine similarity, and obtaining a candidate set; and calculating the similarity on the candidate set by using a model based on the fusion encoder, and sequencing to obtain a retrieval result of the multi-mode data. By the method provided by the invention, the high-efficiency high-precision multi-mode retrieval task on a large-scale data set is realized.

Description

Multi-mode retrieval method based on deep learning and hash algorithm

Technical Field

The present invention belongs to the field of data processing technology.

Background

With the continuous development of the internet age, the data scale on the network is larger and larger, and the rapid popularization of intelligent terminal equipment enables multi-mode data on the internet to be increased in an explosive manner, so that required information cannot be rapidly found through manual means. Therefore, how to quickly, efficiently and accurately retrieve information needed by a user from massive multi-mode data information is a problem to be solved.

The current mainstream multi-mode retrieval model architecture is based on an encoder and mainly comprises a model architecture based on a fusion encoder and a model architecture based on a double encoder.

The main idea of the model architecture based on the fusion encoder is that firstly picture data and text data are converted into characteristics, then the characteristics of the picture and text characteristics are fused, and then the characteristics of the picture and text are input into the fusion encoder, so that the model learns a function capable of measuring the cross-modal similarity, and finally the cross-modal retrieval is realized by using the picture and text similarity obtained by the function. The retrieval scheme needs to calculate the similarity between the user input data and all the image-text data in the database, and then the retrieval results are obtained by sorting according to the similarity. Therefore, when the retrieval task is implemented on a large-scale data set, the scheme has the problems of low retrieval efficiency, high hardware resource overhead and the like.

The main idea of the model architecture based on the double encoders is to encode the image-text data by using the image encoder and the text encoder respectively, map the image-text data into a unified vector subspace, and finally realize cross-modal retrieval by a method of calculating cosine similarity. The searching scheme can pre-encode the image-text data and construct a vector database, so that a large-scale image-text searching task can be realized. However, when the model is trained, the scheme cannot ensure that the full information interaction and the information sharing can be realized among the data in different modes, and the image-text data cannot realize high-quality semantic alignment, so that the problems of low retrieval precision and the like exist.

Aiming at the characteristics of data dynamics, multisource and multi-mode in the internet at present, the invention provides a multi-mode retrieval system based on a deep learning and hash algorithm, ensures the accuracy and speed of cross-mode retrieval, organically fuses the image-text single-mode retrieval and the image-text cross-mode retrieval scheme, further improves the performance of the retrieval system, and realizes the high-efficiency high-accuracy multi-mode retrieval task on a large-scale data set.

Disclosure of Invention

The present invention aims to solve at least one of the technical problems in the related art to some extent.

Therefore, the invention aims to provide a multi-modal searching method based on a deep learning and hash algorithm, which is used for realizing high-efficiency high-precision multi-modal searching tasks on a large-scale data set.

To achieve the above objective, an embodiment of a first aspect of the present invention provides a multi-modal retrieval method based on a deep learning and hash algorithm, including:

acquiring multi-modal data to be retrieved; text semantic retrieval is carried out on the multi-mode data based on a method of deep learning and hash coding, and text data are obtained; performing picture retrieval on the multi-mode data based on a gray level comparison method and an elastic search technology to obtain picture data;

mapping the text data and the picture data into a unified vector subspace using a dual encoder-based model;

obtaining a similarity distance between the text data and the picture data in the vector subspace by calculating cosine similarity, and obtaining a candidate set;

and calculating the similarity on the candidate set by using a model based on a fusion encoder, and sequencing to obtain a retrieval result of the multi-mode data.

In addition, the multi-mode searching method based on the deep learning and hash algorithm according to the embodiment of the invention may further have the following additional technical features:

further, in one embodiment of the present invention, the text semantic retrieval of the multimodal data based on the method of deep learning and hash coding includes:

performing audio retrieval on the multi-modal data through a voice recognition technology to obtain text data;

calculating and retrieving the distance similarity between the text data and all texts in the database in the hamming space, and forming a recall set by the L text data with the nearest recall similarity;

and calculating the distance similarity between the continuous vector obtained by the BERT pre-training model and the hash codes of the recall set, and obtaining the result with minimum distance similarity as output.

Further, in one embodiment of the present invention, the method for performing picture retrieval on the multi-modal data based on gray level comparison and the elastic search technology includes:

extracting each frame of video data in the multi-mode data, mapping the frames into a unified LUV color space, and calculating the absolute distance between each frame and the previous frame;

sequencing all the extracted frames according to the absolute distance, wherein a plurality of frames which are sequenced to the front are a plurality of pictures which can represent video content most;

and converting the picture data in the multi-mode data into a matrix, and realizing picture retrieval through an elastic search.

Further, in an embodiment of the present invention, the mapping the text data and the picture data into a unified vector subspace using a dual encoder-based model includes:

training a picture encoder and a text encoder by using a contrast learning method on the basis of a multipath transducer pre-training model; the training of the picture encoder and the text encoder using the method of contrast learning includes:

implementing linear transformation and regularization operations in an encoder comprising a multi-headed self-attention module, a visual feed-forward network, and a text feed-forward network, resulting in picture vectors

Text and method for making sameVector->

For calculating graph-to-text and graph-to-graph similarity:

wherein the method comprises the steps of

Representing the similarity between the picture in the ith image-text matching pair and the text in the jth image-text matching pair from picture to text,/for>

Representing the similarity between the picture in the ith image-text matching pair and the text in the jth image-text matching pair from text to picture,/for>

And->

Vector representations of the text in the ith matching pair and the regularized picture in the jth matching pair, sigma being a temperature parameter,>

and->

Is the similarity after regularization and softmax operation;

and (3) utilizing the similarity from graph to text and from text to graph to realize the contrast learning training of the model by using the cross entropy loss function, and obtaining the final picture encoder and the text encoder.

Further, in one embodiment of the present invention, the fine tuning based on the multi-channel transducer pre-training model comprises:

inputting the picture characterization data into a pre-training model, and training a visual feedforward network and a multi-head self-attention module;

freezing parameters in the obtained visual feedforward network and the multi-head self-attention mechanism module, inputting text characterization data into the pre-training model, and training the text feedforward network by using a mask language model self-supervision learning method;

and training the whole pre-training model by using the graph-text matching pair to obtain a final pre-training model for subsequent construction of the double encoder and the fusion encoder.

Further, in one embodiment of the present invention, the calculating similarity and ordering on the candidate set using a fusion encoder-based model includes:

on the basis of a multi-channel transducer pre-training model, a mask mechanism, a contrast learning method and a cross entropy loss function are used for optimizing the model to obtain a final fusion encoder which is used for sequencing tasks in image-text cross-mode retrieval.

To achieve the above objective, an embodiment of a second aspect of the present invention provides a multi-modal retrieval device based on a deep learning and hashing algorithm, including the following modules:

the acquisition module is used for acquiring multi-mode data to be retrieved; text semantic retrieval is carried out on the multi-mode data based on a method of deep learning and hash coding, and text data are obtained; performing picture retrieval on the multi-mode data based on a gray level comparison method and an elastic search technology to obtain picture data;

a mapping module for mapping the text data and the picture data into a unified vector subspace using a dual encoder-based model;

the calculating module is used for obtaining the similarity distance between the text data and the picture data in the vector subspace through calculating cosine similarity, so as to obtain a candidate set;

and the sorting module is used for calculating the similarity on the candidate set by using the model based on the fusion encoder and sorting the similarity to obtain the retrieval result of the multi-mode data.

To achieve the above object, an embodiment of the present invention provides a computer device, which is characterized by comprising a memory, a processor, and a computer program stored in the memory and capable of running on the processor, wherein the processor implements a multi-modal searching method based on deep learning and hash algorithm as described above when executing the computer program.

To achieve the above object, a fourth aspect of the present invention provides a computer readable storage medium having a computer program stored thereon, wherein the computer program, when executed by a processor, implements a multi-modal search method based on deep learning and hash algorithm as described above.

The multi-mode retrieval method based on the deep learning and hash algorithm provided by the embodiment of the invention uses a multi-channel transducer-based pre-training model, so that the data in different modes can realize sufficient information interaction and information sharing during encoding. On the basis of the pre-training model, the invention constructs a model based on double encoders to map image-text data into a unified vector subspace, realizes rough recall by calculating cosine similarity, and ensures the retrieval efficiency on a large-scale data set; meanwhile, the invention constructs a model based on a fusion encoder to realize the accurate sequencing of k-candidate sets, thereby ensuring the retrieval precision; in addition, the invention provides a large-scale text single-mode semantic retrieval scheme based on a deep learning and hash algorithm and a large-scale picture single-mode accurate retrieval scheme based on a gray level comparison characterization technology, and the large-scale text single-mode semantic retrieval scheme is organically combined with a picture-text cross-mode technology, so that large-scale high-efficiency high-performance multi-mode retrieval is finally realized.

Drawings

The foregoing and/or additional aspects and advantages of the invention will become apparent and readily appreciated from the following description of the embodiments, taken in conjunction with the accompanying drawings, in which:

fig. 1 is a schematic flow chart of a multi-modal searching method based on a deep learning and hash algorithm according to an embodiment of the present invention.

Fig. 2 is a text semantic retrieval model architecture diagram based on a deep learning and hash algorithm according to an embodiment of the present invention.

Fig. 3 is a schematic diagram of a cross-modal retrieval system according to an embodiment of the present invention.

Fig. 4 is a diagram of a model architecture based on a dual encoder according to an embodiment of the present invention.

Fig. 5 is a schematic diagram of a fusion encoder according to an embodiment of the present invention.

Fig. 6 is a diagram of a architecture of a multi-modal retrieval system based on a deep learning and hashing algorithm according to an embodiment of the present invention.

Fig. 7 is a schematic flow chart of a multi-modal searching device based on a deep learning and hash algorithm according to an embodiment of the present invention.

Detailed Description

Embodiments of the present invention are described in detail below, examples of which are illustrated in the accompanying drawings, wherein like or similar reference numerals refer to like or similar elements or elements having like or similar functions throughout. The embodiments described below by referring to the drawings are illustrative and intended to explain the present invention and should not be construed as limiting the invention.

The following describes a multi-modal retrieval method based on a deep learning and hash algorithm according to an embodiment of the present invention with reference to the accompanying drawings.

As shown in fig. 1, the multi-modal searching method based on the deep learning and hash algorithm comprises the following steps:

s101: acquiring multi-modal data to be retrieved; text semantic retrieval is carried out on the multi-modal data based on a method of deep learning and hash coding, and text data are obtained; carrying out picture retrieval on the multi-mode data based on a gray level comparison method and an elastic search technology to obtain picture data;

s102: mapping the text data and the picture data into a unified vector subspace using a dual encoder-based model;

s103: obtaining a similarity distance between text data and picture data in a vector subspace by calculating cosine similarity, and obtaining a candidate set;

s104: and calculating the similarity on the candidate set by using a model based on the fusion encoder, and sequencing to obtain a retrieval result of the multi-mode data.

Further, in one embodiment of the present invention, text semantic retrieval of multimodal data based on a method of deep learning and hash coding includes:

calculating the distance similarity between the retrieved text data and all texts in the database in the Hamming space, and forming a recall set by the L text data with the nearest recall similarity;

and calculating the distance similarity of the continuous vector obtained by the BERT pre-training model and the hash codes of the recall set, and obtaining the result with minimum distance similarity as output.

Specifically, the invention provides text semantic retrieval based on deep learning and hash algorithm. The text content is encoded into continuous vectors by using the BERT pre-training model so as to extract semantic information in the text, but the text consumes a great deal of hardware resources and affects the retrieval efficiency, so that the invention constructs a hash coding layer based on a scalable tanh function, integrates the hash coding layer into the BERT pre-training model, and uses dense binary codes to replace the continuous vectors, thereby greatly improving the retrieval efficiency and reducing the required hardware resources under the condition of almost not losing the retrieval precision. The text semantic retrieval model architecture is shown in fig. 2.

In order to improve the efficiency and reduce the consumption of hardware resources as much as possible without losing the precision during the retrieval, the invention divides the retrieval process into two stages of candidate set generation and re-ordering. In the "candidate set generation" stage, the invention recalls L text data with the closest similarity by calculating the similarity of the distances between the input text and all the texts in the database in the hamming space. In the're-ordering' stage, the invention calculates the distance similarity between the continuous vector obtained by BERT and the hash code of the recall, and obtains the K result with the minimum distance similarity as output.

The invention uses the hash coding-based method to recall the candidate set to ensure the retrieval speed and reduce the consumption of hardware resources, and then uses the deep learning-based method to accurately sort the recalled candidate set to ensure the retrieval precision.

sequencing all the extracted frames according to the absolute distance, wherein a plurality of frames which are sequenced to the front are a plurality of pictures which can represent the video content most;

and converting the picture data in the multi-mode data into a matrix, and then realizing picture retrieval through an elastic search.

The invention adopts a gray level comparison-based method and an elastic search technology to realize a large-scale and high-efficiency picture accurate search task, and details of the module are described in detail below.

Firstly, dividing a picture into 10×10 grid blocks, wherein the picture has 9×9 grid points; then, a rectangular area of 5×5 pixels is fixed around each grid point and the average gray level thereof is calculated; for each rectangular region, an array of 8 elements is calculated, representing the comparison gap of the average gray level of the rectangular region and the surrounding 8 rectangular regions, and the gap level of the average gray level is represented by-2, -1,0,1, 2. To this end, a picture can be characterized as an 81 x 8 matrix.

The invention uses the method for extracting the picture characteristics to convert each picture into an 81 multiplied by 8 matrix, and stores the matrix into an elastic search so as to construct a database for picture retrieval. When a user inputs a picture, the picture is firstly converted into a matrix of 81 multiplied by 8, and then the picture is accurately searched by utilizing the powerful searching function of the elastic search.

The invention adopts the intelligent video frame extraction technology to realize the video retrieval task. Firstly, each frame of a video is extracted and mapped into a unified LUV color space, the absolute distance between each frame and the previous frame is calculated, and the larger the absolute distance is, the more severe the change of the frame compared with the previous frame is; and finally, sequencing all the extracted frames according to the calculated absolute distance, wherein a plurality of frames which are sequenced to the front are regarded as a plurality of pictures which can represent the video content most, so as to realize video retrieval.

According to the invention, the double encoders and the fusion encoder are respectively constructed on the basis of the multi-channel transducer pre-training model, so that the data in different modes can be fully interacted and shared. Mapping the picture and text data into a unified vector subspace by using a model based on a double encoder, realizing rough recall by calculating cosine similarity, realizing accurate sorting by using a model based on a fusion encoder, and finally returning a search result to a user for display. The architecture of the graph-text cross-modal retrieval system is shown in figure 3.

The invention uses a multi-channel transducer-based pre-training model, so that the data in different modes can be fully interacted and shared in the process of encoding. The multipath transducer model consists of a shared multi-headed self-attention module and a plurality of feed forward networks. The visual feedforward network and the text feedforward network respectively process the picture data and the text data and are used for realizing a double encoder, and the visual-text feedforward network is used for processing the picture-text matching pair and realizing a fusion encoder. The architecture learns the characteristics of data of different modes by sharing the multi-head self-attention module, and enables the characteristics of different modes to be aligned, so that the multi-mode information fusion is tighter.

Model training is as follows: 1) Training a visual feedforward network and a multi-head self-attention module by using a picture representation input model; 2) Freezing parameters in the visual feedforward network and the multi-head self-attention mechanism module, inputting text characterization data into a model, and training the text feedforward network by using a mask language model (Masked language modeling) self-supervision learning method; 3) And training the whole model by using the graph-text matching pair to obtain a final pre-training model for subsequent construction of the double encoder and the fusion encoder.

The invention constructs a model based on double encoders to encode the image-text data respectively, thus realizing the rough recall task of image-text cross-mode retrieval. The dual encoder based model architecture is shown in fig. 4. In order to realize the encoder, the invention performs fine adjustment on the basis of a multipath converter pre-training model, and trains a picture encoder and a text encoder by using a contrast learning method.

The training batch is provided with N image-text matching pairs, the target of comparison learning predicts correct image-text matching pairs from N x N possible matching pairs, wherein only N pairs of graphs Wen Pi are positive matching pairs, N ² -the N matched pairs of graphics are negative matched pairs. The invention realizes operations such as linear transformation and regularization in an encoder comprising a multi-head self-attention module, a visual feedforward network and a text feedforward network to obtain picture vectors

Text vector +.>

For calculating the similarity from graph to text and from text to graph.

Further, in one embodiment of the present invention, mapping text data and picture data into a unified vector subspace using a dual encoder-based model comprises:

training a picture encoder and a text encoder by using a contrast learning method on the basis of a multipath transducer pre-training model; training a picture encoder and a text encoder using a contrast learning method includes:

Text vector +.>

For calculating graph-to-text and graph-to-graph similarity:

wherein,,

And->

and->

Is the similarity after regularization and softmax operation;

The picture encoder and the text encoder encode the picture and the text data respectively and map the picture and the text data into a public subspace, and then the similarity distance between the picture and the text in the subspace is obtained by calculating cosine similarity, so that cross-mode retrieval between pictures and texts is realized. The scheme has the characteristic of high efficiency, so the scheme is used for a rough recall task in a large-scale image-text retrieval system.

Further, in one embodiment of the present invention, fine tuning is performed on the basis of a multi-channel transducer-based pre-training model, comprising:

Further, in one embodiment of the invention, calculating similarity and ordering over a candidate set using a fusion encoder-based model includes:

When the model framework based on the fusion encoder realizes cross-mode retrieval among pictures and texts, all possible picture and text pairing combinations need to be jointly encoded to obtain similarity scores and reorder, and finally a retrieval result is obtained. When the data volume is large, the scheme is too low to realize, but the scheme precision is higher than that of a model architecture based on double encoders. Therefore, on the basis of a multi-path transducer pre-training model, the invention uses a mask mechanism, a contrast learning method and a cross entropy loss function to finely tune the model, so as to obtain a final fusion encoder which is used for the task of 'precise sequencing' in the graph-text cross-mode retrieval. The model architecture based on a fusion encoder is shown in fig. 5.

The above is a complete multi-mode searching method flow based on deep learning and hash algorithm, and fig. 6 is a diagram of the whole architecture of the present invention.

The multi-mode retrieval method based on the deep learning and hash algorithm provided by the embodiment of the invention realizes a large-scale text semantic retrieval task by using a method based on the deep learning and hash coding, and realizes an audio retrieval task by using a voice recognition technology; the method based on gray level comparison and the elastic search technology are used for realizing a large-scale picture accurate retrieval task, and the video intelligent frame extraction technology is used for realizing a video retrieval task; the method based on the multipath transducer pre-training model and the encoder is used for realizing large-scale image-text cross-mode retrieval, and simultaneously ensuring the retrieval precision and speed. Compared with the current mainstream retrieval technology, the invention has the advantages that:

1) The existing retrieval method based on deep learning is high in precision but low in efficiency, and the scheme of combining the BERT pre-training model with a hash encoder based on a scalable tanh function is provided, so that text semantic retrieval with unchanged precision and higher efficiency is realized. The image data has the characteristic of high dimensionality, so the problem of low efficiency exists when the image retrieval is realized, and the proposal provides a scheme combining a gray level comparison method with an elastic search technology to realize the accurate retrieval of the image with higher precision, higher speed and higher speed.

2) The image-text cross-mode retrieval scheme based on the multipath transducer pre-training model and the encoder solves the problems of insufficient precision, low efficiency and the like of the conventional image-text cross-mode retrieval system. The multi-channel transducer pre-training model can enable data of different modes to achieve sufficient information interaction and sharing, the pre-training model is used for constructing a double encoder and a fusion encoder, then the double encoder is used for achieving a rough recall task to obtain a k-candidate set, retrieval efficiency is greatly improved, finally the fusion encoder is used for calculating similarity on the k-candidate set, a precise sorting task is completed, and retrieval accuracy is guaranteed.

In order to realize the embodiment, the invention also provides a multi-mode retrieval device based on the deep learning and hash algorithm.

Fig. 7 is a schematic structural diagram of a multi-modal retrieval device based on a deep learning and hash algorithm according to an embodiment of the present invention.

As shown in fig. 7, the multi-modal searching apparatus based on the deep learning and hash algorithm includes: an acquisition module 100, a mapping module 200, a calculation module 300, a ranking module 400, wherein,

a mapping module for mapping text data and the picture data into a unified vector subspace using a dual encoder-based model;

the computing module is used for obtaining the similarity distance between the text data and the picture data in the vector subspace by computing the cosine similarity to obtain a candidate set;

To achieve the above object, an embodiment of the third aspect of the present invention provides a computer device, which is characterized by comprising a memory, a processor, and a computer program stored in the memory and capable of running on the processor, wherein the processor implements the multi-modal searching method based on deep learning and hash algorithm as described above when executing the computer program.

To achieve the above object, a fourth aspect of the present invention provides a computer-readable storage medium having a computer program stored thereon, wherein the computer program, when executed by a processor, implements the multi-modal search method based on deep learning and hash algorithm as described above.

In the description of the present specification, a description referring to terms "one embodiment," "some embodiments," "examples," "specific examples," or "some examples," etc., means that a particular feature, structure, material, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the present invention. In this specification, schematic representations of the above terms are not necessarily directed to the same embodiment or example. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or more embodiments or examples. Furthermore, the different embodiments or examples described in this specification and the features of the different embodiments or examples may be combined and combined by those skilled in the art without contradiction.

Furthermore, the terms "first," "second," and the like, are used for descriptive purposes only and are not to be construed as indicating or implying a relative importance or implicitly indicating the number of technical features indicated. Thus, a feature defining "a first" or "a second" may explicitly or implicitly include at least one such feature. In the description of the present invention, the meaning of "plurality" means at least two, for example, two, three, etc., unless specifically defined otherwise.

While embodiments of the present invention have been shown and described above, it will be understood that the above embodiments are illustrative and not to be construed as limiting the invention, and that variations, modifications, alternatives and variations may be made to the above embodiments by one of ordinary skill in the art within the scope of the invention.

Claims

1. A multi-mode retrieval method based on deep learning and hash algorithm is characterized by comprising the following steps:

2. The method of claim 1, wherein the deep learning and hash coding based method performs text semantic retrieval on the multimodal data, comprising:

3. The method of claim 1, wherein the multi-modal data is retrieved based on gray level comparison and an elastosearch technique, comprising:

4. The method of claim 1, wherein the mapping the text data and the picture data into a unified vector subspace using a dual encoder-based model comprises:

Text vector +.>

For calculating graph-to-text and graph-to-graph similarity:

wherein the method comprises the steps of

And->

and->

Is the similarity after regularization and softmax operation;

5. The method of claim 4, wherein the fine tuning based on a multi-channel transducer pre-training model comprises:

6. The method of claim 1, wherein the calculating and ordering similarities over the candidate set using a fusion encoder-based model comprises:

7. The multi-mode retrieval device based on the deep learning and hash algorithm is characterized by comprising the following modules:

8. A computer device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, the processor implementing the multi-modal retrieval method based on deep learning and hashing algorithm as in any one of claims 1-7 when executing the computer program.

9. A computer readable storage medium, on which a computer program is stored, characterized in that the computer program, when being executed by a processor, implements a multi-modal retrieval method based on a deep learning and hashing algorithm as in any one of claims 1-7.