CN116204694A - Multi-mode retrieval method based on deep learning and hash algorithm - Google Patents

Multi-mode retrieval method based on deep learning and hash algorithm Download PDF

Info

Publication number
CN116204694A
CN116204694A CN202310126081.9A CN202310126081A CN116204694A CN 116204694 A CN116204694 A CN 116204694A CN 202310126081 A CN202310126081 A CN 202310126081A CN 116204694 A CN116204694 A CN 116204694A
Authority
CN
China
Prior art keywords
text
data
picture
retrieval
encoder
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202310126081.9A
Other languages
Chinese (zh)
Inventor
欧中洪
罗中李
宋美娜
尧思远
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing University of Posts and Telecommunications
Original Assignee
Beijing University of Posts and Telecommunications
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing University of Posts and Telecommunications filed Critical Beijing University of Posts and Telecommunications
Priority to CN202310126081.9A priority Critical patent/CN116204694A/en
Publication of CN116204694A publication Critical patent/CN116204694A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/953Querying, e.g. by the use of web search engines
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/334Query execution
    • G06F16/3344Query execution using natural language analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/50Information retrieval; Database structures therefor; File system structures therefor of still image data
    • G06F16/58Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
    • G06F16/583Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/953Querying, e.g. by the use of web search engines
    • G06F16/9538Presentation of query results
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Library & Information Science (AREA)
  • Artificial Intelligence (AREA)
  • Computational Linguistics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention provides a multi-modal retrieval method based on deep learning and hash algorithm, which comprises the steps of obtaining multi-modal data to be retrieved; text semantic retrieval is carried out on the multi-modal data based on a method of deep learning and hash coding, and text data are obtained; carrying out picture retrieval on the multi-mode data based on a gray level comparison method and an elastic search technology to obtain picture data; mapping the text data and the picture data into a unified vector subspace using a dual encoder-based model; obtaining a similarity distance between text data and the picture data in a vector subspace by calculating cosine similarity, and obtaining a candidate set; and calculating the similarity on the candidate set by using a model based on the fusion encoder, and sequencing to obtain a retrieval result of the multi-mode data. By the method provided by the invention, the high-efficiency high-precision multi-mode retrieval task on a large-scale data set is realized.

Description

Multi-mode retrieval method based on deep learning and hash algorithm
Technical Field
The present invention belongs to the field of data processing technology.
Background
With the continuous development of the internet age, the data scale on the network is larger and larger, and the rapid popularization of intelligent terminal equipment enables multi-mode data on the internet to be increased in an explosive manner, so that required information cannot be rapidly found through manual means. Therefore, how to quickly, efficiently and accurately retrieve information needed by a user from massive multi-mode data information is a problem to be solved.
The current mainstream multi-mode retrieval model architecture is based on an encoder and mainly comprises a model architecture based on a fusion encoder and a model architecture based on a double encoder.
The main idea of the model architecture based on the fusion encoder is that firstly picture data and text data are converted into characteristics, then the characteristics of the picture and text characteristics are fused, and then the characteristics of the picture and text are input into the fusion encoder, so that the model learns a function capable of measuring the cross-modal similarity, and finally the cross-modal retrieval is realized by using the picture and text similarity obtained by the function. The retrieval scheme needs to calculate the similarity between the user input data and all the image-text data in the database, and then the retrieval results are obtained by sorting according to the similarity. Therefore, when the retrieval task is implemented on a large-scale data set, the scheme has the problems of low retrieval efficiency, high hardware resource overhead and the like.
The main idea of the model architecture based on the double encoders is to encode the image-text data by using the image encoder and the text encoder respectively, map the image-text data into a unified vector subspace, and finally realize cross-modal retrieval by a method of calculating cosine similarity. The searching scheme can pre-encode the image-text data and construct a vector database, so that a large-scale image-text searching task can be realized. However, when the model is trained, the scheme cannot ensure that the full information interaction and the information sharing can be realized among the data in different modes, and the image-text data cannot realize high-quality semantic alignment, so that the problems of low retrieval precision and the like exist.
Aiming at the characteristics of data dynamics, multisource and multi-mode in the internet at present, the invention provides a multi-mode retrieval system based on a deep learning and hash algorithm, ensures the accuracy and speed of cross-mode retrieval, organically fuses the image-text single-mode retrieval and the image-text cross-mode retrieval scheme, further improves the performance of the retrieval system, and realizes the high-efficiency high-accuracy multi-mode retrieval task on a large-scale data set.
Disclosure of Invention
The present invention aims to solve at least one of the technical problems in the related art to some extent.
Therefore, the invention aims to provide a multi-modal searching method based on a deep learning and hash algorithm, which is used for realizing high-efficiency high-precision multi-modal searching tasks on a large-scale data set.
To achieve the above objective, an embodiment of a first aspect of the present invention provides a multi-modal retrieval method based on a deep learning and hash algorithm, including:
acquiring multi-modal data to be retrieved; text semantic retrieval is carried out on the multi-mode data based on a method of deep learning and hash coding, and text data are obtained; performing picture retrieval on the multi-mode data based on a gray level comparison method and an elastic search technology to obtain picture data;
mapping the text data and the picture data into a unified vector subspace using a dual encoder-based model;
obtaining a similarity distance between the text data and the picture data in the vector subspace by calculating cosine similarity, and obtaining a candidate set;
and calculating the similarity on the candidate set by using a model based on a fusion encoder, and sequencing to obtain a retrieval result of the multi-mode data.
In addition, the multi-mode searching method based on the deep learning and hash algorithm according to the embodiment of the invention may further have the following additional technical features:
further, in one embodiment of the present invention, the text semantic retrieval of the multimodal data based on the method of deep learning and hash coding includes:
performing audio retrieval on the multi-modal data through a voice recognition technology to obtain text data;
calculating and retrieving the distance similarity between the text data and all texts in the database in the hamming space, and forming a recall set by the L text data with the nearest recall similarity;
and calculating the distance similarity between the continuous vector obtained by the BERT pre-training model and the hash codes of the recall set, and obtaining the result with minimum distance similarity as output.
Further, in one embodiment of the present invention, the method for performing picture retrieval on the multi-modal data based on gray level comparison and the elastic search technology includes:
extracting each frame of video data in the multi-mode data, mapping the frames into a unified LUV color space, and calculating the absolute distance between each frame and the previous frame;
sequencing all the extracted frames according to the absolute distance, wherein a plurality of frames which are sequenced to the front are a plurality of pictures which can represent video content most;
and converting the picture data in the multi-mode data into a matrix, and realizing picture retrieval through an elastic search.
Further, in an embodiment of the present invention, the mapping the text data and the picture data into a unified vector subspace using a dual encoder-based model includes:
training a picture encoder and a text encoder by using a contrast learning method on the basis of a multipath transducer pre-training model; the training of the picture encoder and the text encoder using the method of contrast learning includes:
implementing linear transformation and regularization operations in an encoder comprising a multi-headed self-attention module, a visual feed-forward network, and a text feed-forward network, resulting in picture vectors
Figure BDA0004082202130000031
Text and method for making sameVector->
Figure BDA0004082202130000032
For calculating graph-to-text and graph-to-graph similarity:
Figure BDA0004082202130000033
Figure BDA0004082202130000034
Figure BDA0004082202130000035
wherein the method comprises the steps of
Figure BDA0004082202130000036
Representing the similarity between the picture in the ith image-text matching pair and the text in the jth image-text matching pair from picture to text,/for>
Figure BDA0004082202130000037
Representing the similarity between the picture in the ith image-text matching pair and the text in the jth image-text matching pair from text to picture,/for>
Figure BDA0004082202130000038
And->
Figure BDA0004082202130000039
Vector representations of the text in the ith matching pair and the regularized picture in the jth matching pair, sigma being a temperature parameter,>
Figure BDA00040822021300000310
and->
Figure BDA00040822021300000311
Is the similarity after regularization and softmax operation;
and (3) utilizing the similarity from graph to text and from text to graph to realize the contrast learning training of the model by using the cross entropy loss function, and obtaining the final picture encoder and the text encoder.
Further, in one embodiment of the present invention, the fine tuning based on the multi-channel transducer pre-training model comprises:
inputting the picture characterization data into a pre-training model, and training a visual feedforward network and a multi-head self-attention module;
freezing parameters in the obtained visual feedforward network and the multi-head self-attention mechanism module, inputting text characterization data into the pre-training model, and training the text feedforward network by using a mask language model self-supervision learning method;
and training the whole pre-training model by using the graph-text matching pair to obtain a final pre-training model for subsequent construction of the double encoder and the fusion encoder.
Further, in one embodiment of the present invention, the calculating similarity and ordering on the candidate set using a fusion encoder-based model includes:
on the basis of a multi-channel transducer pre-training model, a mask mechanism, a contrast learning method and a cross entropy loss function are used for optimizing the model to obtain a final fusion encoder which is used for sequencing tasks in image-text cross-mode retrieval.
To achieve the above objective, an embodiment of a second aspect of the present invention provides a multi-modal retrieval device based on a deep learning and hashing algorithm, including the following modules:
the acquisition module is used for acquiring multi-mode data to be retrieved; text semantic retrieval is carried out on the multi-mode data based on a method of deep learning and hash coding, and text data are obtained; performing picture retrieval on the multi-mode data based on a gray level comparison method and an elastic search technology to obtain picture data;
a mapping module for mapping the text data and the picture data into a unified vector subspace using a dual encoder-based model;
the calculating module is used for obtaining the similarity distance between the text data and the picture data in the vector subspace through calculating cosine similarity, so as to obtain a candidate set;
and the sorting module is used for calculating the similarity on the candidate set by using the model based on the fusion encoder and sorting the similarity to obtain the retrieval result of the multi-mode data.
To achieve the above object, an embodiment of the present invention provides a computer device, which is characterized by comprising a memory, a processor, and a computer program stored in the memory and capable of running on the processor, wherein the processor implements a multi-modal searching method based on deep learning and hash algorithm as described above when executing the computer program.
To achieve the above object, a fourth aspect of the present invention provides a computer readable storage medium having a computer program stored thereon, wherein the computer program, when executed by a processor, implements a multi-modal search method based on deep learning and hash algorithm as described above.
The multi-mode retrieval method based on the deep learning and hash algorithm provided by the embodiment of the invention uses a multi-channel transducer-based pre-training model, so that the data in different modes can realize sufficient information interaction and information sharing during encoding. On the basis of the pre-training model, the invention constructs a model based on double encoders to map image-text data into a unified vector subspace, realizes rough recall by calculating cosine similarity, and ensures the retrieval efficiency on a large-scale data set; meanwhile, the invention constructs a model based on a fusion encoder to realize the accurate sequencing of k-candidate sets, thereby ensuring the retrieval precision; in addition, the invention provides a large-scale text single-mode semantic retrieval scheme based on a deep learning and hash algorithm and a large-scale picture single-mode accurate retrieval scheme based on a gray level comparison characterization technology, and the large-scale text single-mode semantic retrieval scheme is organically combined with a picture-text cross-mode technology, so that large-scale high-efficiency high-performance multi-mode retrieval is finally realized.
Drawings
The foregoing and/or additional aspects and advantages of the invention will become apparent and readily appreciated from the following description of the embodiments, taken in conjunction with the accompanying drawings, in which:
fig. 1 is a schematic flow chart of a multi-modal searching method based on a deep learning and hash algorithm according to an embodiment of the present invention.
Fig. 2 is a text semantic retrieval model architecture diagram based on a deep learning and hash algorithm according to an embodiment of the present invention.
Fig. 3 is a schematic diagram of a cross-modal retrieval system according to an embodiment of the present invention.
Fig. 4 is a diagram of a model architecture based on a dual encoder according to an embodiment of the present invention.
Fig. 5 is a schematic diagram of a fusion encoder according to an embodiment of the present invention.
Fig. 6 is a diagram of a architecture of a multi-modal retrieval system based on a deep learning and hashing algorithm according to an embodiment of the present invention.
Fig. 7 is a schematic flow chart of a multi-modal searching device based on a deep learning and hash algorithm according to an embodiment of the present invention.
Detailed Description
Embodiments of the present invention are described in detail below, examples of which are illustrated in the accompanying drawings, wherein like or similar reference numerals refer to like or similar elements or elements having like or similar functions throughout. The embodiments described below by referring to the drawings are illustrative and intended to explain the present invention and should not be construed as limiting the invention.
The following describes a multi-modal retrieval method based on a deep learning and hash algorithm according to an embodiment of the present invention with reference to the accompanying drawings.
Fig. 1 is a schematic flow chart of a multi-modal searching method based on a deep learning and hash algorithm according to an embodiment of the present invention.
As shown in fig. 1, the multi-modal searching method based on the deep learning and hash algorithm comprises the following steps:
s101: acquiring multi-modal data to be retrieved; text semantic retrieval is carried out on the multi-modal data based on a method of deep learning and hash coding, and text data are obtained; carrying out picture retrieval on the multi-mode data based on a gray level comparison method and an elastic search technology to obtain picture data;
s102: mapping the text data and the picture data into a unified vector subspace using a dual encoder-based model;
s103: obtaining a similarity distance between text data and picture data in a vector subspace by calculating cosine similarity, and obtaining a candidate set;
s104: and calculating the similarity on the candidate set by using a model based on the fusion encoder, and sequencing to obtain a retrieval result of the multi-mode data.
Further, in one embodiment of the present invention, text semantic retrieval of multimodal data based on a method of deep learning and hash coding includes:
performing audio retrieval on the multi-modal data through a voice recognition technology to obtain text data;
calculating the distance similarity between the retrieved text data and all texts in the database in the Hamming space, and forming a recall set by the L text data with the nearest recall similarity;
and calculating the distance similarity of the continuous vector obtained by the BERT pre-training model and the hash codes of the recall set, and obtaining the result with minimum distance similarity as output.
Specifically, the invention provides text semantic retrieval based on deep learning and hash algorithm. The text content is encoded into continuous vectors by using the BERT pre-training model so as to extract semantic information in the text, but the text consumes a great deal of hardware resources and affects the retrieval efficiency, so that the invention constructs a hash coding layer based on a scalable tanh function, integrates the hash coding layer into the BERT pre-training model, and uses dense binary codes to replace the continuous vectors, thereby greatly improving the retrieval efficiency and reducing the required hardware resources under the condition of almost not losing the retrieval precision. The text semantic retrieval model architecture is shown in fig. 2.
In order to improve the efficiency and reduce the consumption of hardware resources as much as possible without losing the precision during the retrieval, the invention divides the retrieval process into two stages of candidate set generation and re-ordering. In the "candidate set generation" stage, the invention recalls L text data with the closest similarity by calculating the similarity of the distances between the input text and all the texts in the database in the hamming space. In the're-ordering' stage, the invention calculates the distance similarity between the continuous vector obtained by BERT and the hash code of the recall, and obtains the K result with the minimum distance similarity as output.
The invention uses the hash coding-based method to recall the candidate set to ensure the retrieval speed and reduce the consumption of hardware resources, and then uses the deep learning-based method to accurately sort the recalled candidate set to ensure the retrieval precision.
Further, in one embodiment of the present invention, the method for performing picture retrieval on the multi-modal data based on gray level comparison and the elastic search technology includes:
extracting each frame of video data in the multi-mode data, mapping the frames into a unified LUV color space, and calculating the absolute distance between each frame and the previous frame;
sequencing all the extracted frames according to the absolute distance, wherein a plurality of frames which are sequenced to the front are a plurality of pictures which can represent the video content most;
and converting the picture data in the multi-mode data into a matrix, and then realizing picture retrieval through an elastic search.
The invention adopts a gray level comparison-based method and an elastic search technology to realize a large-scale and high-efficiency picture accurate search task, and details of the module are described in detail below.
Firstly, dividing a picture into 10×10 grid blocks, wherein the picture has 9×9 grid points; then, a rectangular area of 5×5 pixels is fixed around each grid point and the average gray level thereof is calculated; for each rectangular region, an array of 8 elements is calculated, representing the comparison gap of the average gray level of the rectangular region and the surrounding 8 rectangular regions, and the gap level of the average gray level is represented by-2, -1,0,1, 2. To this end, a picture can be characterized as an 81 x 8 matrix.
The invention uses the method for extracting the picture characteristics to convert each picture into an 81 multiplied by 8 matrix, and stores the matrix into an elastic search so as to construct a database for picture retrieval. When a user inputs a picture, the picture is firstly converted into a matrix of 81 multiplied by 8, and then the picture is accurately searched by utilizing the powerful searching function of the elastic search.
The invention adopts the intelligent video frame extraction technology to realize the video retrieval task. Firstly, each frame of a video is extracted and mapped into a unified LUV color space, the absolute distance between each frame and the previous frame is calculated, and the larger the absolute distance is, the more severe the change of the frame compared with the previous frame is; and finally, sequencing all the extracted frames according to the calculated absolute distance, wherein a plurality of frames which are sequenced to the front are regarded as a plurality of pictures which can represent the video content most, so as to realize video retrieval.
According to the invention, the double encoders and the fusion encoder are respectively constructed on the basis of the multi-channel transducer pre-training model, so that the data in different modes can be fully interacted and shared. Mapping the picture and text data into a unified vector subspace by using a model based on a double encoder, realizing rough recall by calculating cosine similarity, realizing accurate sorting by using a model based on a fusion encoder, and finally returning a search result to a user for display. The architecture of the graph-text cross-modal retrieval system is shown in figure 3.
The invention uses a multi-channel transducer-based pre-training model, so that the data in different modes can be fully interacted and shared in the process of encoding. The multipath transducer model consists of a shared multi-headed self-attention module and a plurality of feed forward networks. The visual feedforward network and the text feedforward network respectively process the picture data and the text data and are used for realizing a double encoder, and the visual-text feedforward network is used for processing the picture-text matching pair and realizing a fusion encoder. The architecture learns the characteristics of data of different modes by sharing the multi-head self-attention module, and enables the characteristics of different modes to be aligned, so that the multi-mode information fusion is tighter.
Model training is as follows: 1) Training a visual feedforward network and a multi-head self-attention module by using a picture representation input model; 2) Freezing parameters in the visual feedforward network and the multi-head self-attention mechanism module, inputting text characterization data into a model, and training the text feedforward network by using a mask language model (Masked language modeling) self-supervision learning method; 3) And training the whole model by using the graph-text matching pair to obtain a final pre-training model for subsequent construction of the double encoder and the fusion encoder.
The invention constructs a model based on double encoders to encode the image-text data respectively, thus realizing the rough recall task of image-text cross-mode retrieval. The dual encoder based model architecture is shown in fig. 4. In order to realize the encoder, the invention performs fine adjustment on the basis of a multipath converter pre-training model, and trains a picture encoder and a text encoder by using a contrast learning method.
The training batch is provided with N image-text matching pairs, the target of comparison learning predicts correct image-text matching pairs from N x N possible matching pairs, wherein only N pairs of graphs Wen Pi are positive matching pairs, N 2 -the N matched pairs of graphics are negative matched pairs. The invention realizes operations such as linear transformation and regularization in an encoder comprising a multi-head self-attention module, a visual feedforward network and a text feedforward network to obtain picture vectors
Figure BDA0004082202130000071
Text vector +.>
Figure BDA0004082202130000072
For calculating the similarity from graph to text and from text to graph.
Further, in one embodiment of the present invention, mapping text data and picture data into a unified vector subspace using a dual encoder-based model comprises:
training a picture encoder and a text encoder by using a contrast learning method on the basis of a multipath transducer pre-training model; training a picture encoder and a text encoder using a contrast learning method includes:
implementing linear transformation and regularization operations in an encoder comprising a multi-headed self-attention module, a visual feed-forward network, and a text feed-forward network, resulting in picture vectors
Figure BDA0004082202130000081
Text vector +.>
Figure BDA0004082202130000082
For calculating graph-to-text and graph-to-graph similarity:
Figure BDA0004082202130000083
Figure BDA0004082202130000084
Figure BDA0004082202130000085
wherein,,
Figure BDA0004082202130000086
representing the similarity between the picture in the ith image-text matching pair and the text in the jth image-text matching pair from picture to text,/for>
Figure BDA0004082202130000087
Representing the similarity between the picture in the ith image-text matching pair and the text in the jth image-text matching pair from text to picture,/for>
Figure BDA0004082202130000088
And->
Figure BDA0004082202130000089
Vector representations of the text in the ith matching pair and the regularized picture in the jth matching pair, sigma being a temperature parameter,>
Figure BDA00040822021300000810
and->
Figure BDA00040822021300000811
Is the similarity after regularization and softmax operation;
and (3) utilizing the similarity from graph to text and from text to graph to realize the contrast learning training of the model by using the cross entropy loss function, and obtaining the final picture encoder and the text encoder.
The picture encoder and the text encoder encode the picture and the text data respectively and map the picture and the text data into a public subspace, and then the similarity distance between the picture and the text in the subspace is obtained by calculating cosine similarity, so that cross-mode retrieval between pictures and texts is realized. The scheme has the characteristic of high efficiency, so the scheme is used for a rough recall task in a large-scale image-text retrieval system.
Further, in one embodiment of the present invention, fine tuning is performed on the basis of a multi-channel transducer-based pre-training model, comprising:
inputting the picture characterization data into a pre-training model, and training a visual feedforward network and a multi-head self-attention module;
freezing parameters in the obtained visual feedforward network and the multi-head self-attention mechanism module, inputting text characterization data into the pre-training model, and training the text feedforward network by using a mask language model self-supervision learning method;
and training the whole pre-training model by using the graph-text matching pair to obtain a final pre-training model for subsequent construction of the double encoder and the fusion encoder.
Further, in one embodiment of the invention, calculating similarity and ordering over a candidate set using a fusion encoder-based model includes:
on the basis of a multi-channel transducer pre-training model, a mask mechanism, a contrast learning method and a cross entropy loss function are used for optimizing the model to obtain a final fusion encoder which is used for sequencing tasks in image-text cross-mode retrieval.
When the model framework based on the fusion encoder realizes cross-mode retrieval among pictures and texts, all possible picture and text pairing combinations need to be jointly encoded to obtain similarity scores and reorder, and finally a retrieval result is obtained. When the data volume is large, the scheme is too low to realize, but the scheme precision is higher than that of a model architecture based on double encoders. Therefore, on the basis of a multi-path transducer pre-training model, the invention uses a mask mechanism, a contrast learning method and a cross entropy loss function to finely tune the model, so as to obtain a final fusion encoder which is used for the task of 'precise sequencing' in the graph-text cross-mode retrieval. The model architecture based on a fusion encoder is shown in fig. 5.
The above is a complete multi-mode searching method flow based on deep learning and hash algorithm, and fig. 6 is a diagram of the whole architecture of the present invention.
The multi-mode retrieval method based on the deep learning and hash algorithm provided by the embodiment of the invention realizes a large-scale text semantic retrieval task by using a method based on the deep learning and hash coding, and realizes an audio retrieval task by using a voice recognition technology; the method based on gray level comparison and the elastic search technology are used for realizing a large-scale picture accurate retrieval task, and the video intelligent frame extraction technology is used for realizing a video retrieval task; the method based on the multipath transducer pre-training model and the encoder is used for realizing large-scale image-text cross-mode retrieval, and simultaneously ensuring the retrieval precision and speed. Compared with the current mainstream retrieval technology, the invention has the advantages that:
1) The existing retrieval method based on deep learning is high in precision but low in efficiency, and the scheme of combining the BERT pre-training model with a hash encoder based on a scalable tanh function is provided, so that text semantic retrieval with unchanged precision and higher efficiency is realized. The image data has the characteristic of high dimensionality, so the problem of low efficiency exists when the image retrieval is realized, and the proposal provides a scheme combining a gray level comparison method with an elastic search technology to realize the accurate retrieval of the image with higher precision, higher speed and higher speed.
2) The image-text cross-mode retrieval scheme based on the multipath transducer pre-training model and the encoder solves the problems of insufficient precision, low efficiency and the like of the conventional image-text cross-mode retrieval system. The multi-channel transducer pre-training model can enable data of different modes to achieve sufficient information interaction and sharing, the pre-training model is used for constructing a double encoder and a fusion encoder, then the double encoder is used for achieving a rough recall task to obtain a k-candidate set, retrieval efficiency is greatly improved, finally the fusion encoder is used for calculating similarity on the k-candidate set, a precise sorting task is completed, and retrieval accuracy is guaranteed.
In order to realize the embodiment, the invention also provides a multi-mode retrieval device based on the deep learning and hash algorithm.
Fig. 7 is a schematic structural diagram of a multi-modal retrieval device based on a deep learning and hash algorithm according to an embodiment of the present invention.
As shown in fig. 7, the multi-modal searching apparatus based on the deep learning and hash algorithm includes: an acquisition module 100, a mapping module 200, a calculation module 300, a ranking module 400, wherein,
the acquisition module is used for acquiring multi-mode data to be retrieved; text semantic retrieval is carried out on the multi-mode data based on a method of deep learning and hash coding, and text data are obtained; performing picture retrieval on the multi-mode data based on a gray level comparison method and an elastic search technology to obtain picture data;
a mapping module for mapping text data and the picture data into a unified vector subspace using a dual encoder-based model;
the computing module is used for obtaining the similarity distance between the text data and the picture data in the vector subspace by computing the cosine similarity to obtain a candidate set;
and the sorting module is used for calculating the similarity on the candidate set by using the model based on the fusion encoder and sorting the similarity to obtain the retrieval result of the multi-mode data.
To achieve the above object, an embodiment of the third aspect of the present invention provides a computer device, which is characterized by comprising a memory, a processor, and a computer program stored in the memory and capable of running on the processor, wherein the processor implements the multi-modal searching method based on deep learning and hash algorithm as described above when executing the computer program.
To achieve the above object, a fourth aspect of the present invention provides a computer-readable storage medium having a computer program stored thereon, wherein the computer program, when executed by a processor, implements the multi-modal search method based on deep learning and hash algorithm as described above.
In the description of the present specification, a description referring to terms "one embodiment," "some embodiments," "examples," "specific examples," or "some examples," etc., means that a particular feature, structure, material, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the present invention. In this specification, schematic representations of the above terms are not necessarily directed to the same embodiment or example. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or more embodiments or examples. Furthermore, the different embodiments or examples described in this specification and the features of the different embodiments or examples may be combined and combined by those skilled in the art without contradiction.
Furthermore, the terms "first," "second," and the like, are used for descriptive purposes only and are not to be construed as indicating or implying a relative importance or implicitly indicating the number of technical features indicated. Thus, a feature defining "a first" or "a second" may explicitly or implicitly include at least one such feature. In the description of the present invention, the meaning of "plurality" means at least two, for example, two, three, etc., unless specifically defined otherwise.
While embodiments of the present invention have been shown and described above, it will be understood that the above embodiments are illustrative and not to be construed as limiting the invention, and that variations, modifications, alternatives and variations may be made to the above embodiments by one of ordinary skill in the art within the scope of the invention.

Claims (9)

1. A multi-mode retrieval method based on deep learning and hash algorithm is characterized by comprising the following steps:
acquiring multi-modal data to be retrieved; text semantic retrieval is carried out on the multi-mode data based on a method of deep learning and hash coding, and text data are obtained; performing picture retrieval on the multi-mode data based on a gray level comparison method and an elastic search technology to obtain picture data;
mapping the text data and the picture data into a unified vector subspace using a dual encoder-based model;
obtaining a similarity distance between the text data and the picture data in the vector subspace by calculating cosine similarity, and obtaining a candidate set;
and calculating the similarity on the candidate set by using a model based on a fusion encoder, and sequencing to obtain a retrieval result of the multi-mode data.
2. The method of claim 1, wherein the deep learning and hash coding based method performs text semantic retrieval on the multimodal data, comprising:
performing audio retrieval on the multi-modal data through a voice recognition technology to obtain text data;
calculating and retrieving the distance similarity between the text data and all texts in the database in the hamming space, and forming a recall set by the L text data with the nearest recall similarity;
and calculating the distance similarity between the continuous vector obtained by the BERT pre-training model and the hash codes of the recall set, and obtaining the result with minimum distance similarity as output.
3. The method of claim 1, wherein the multi-modal data is retrieved based on gray level comparison and an elastosearch technique, comprising:
extracting each frame of video data in the multi-mode data, mapping the frames into a unified LUV color space, and calculating the absolute distance between each frame and the previous frame;
sequencing all the extracted frames according to the absolute distance, wherein a plurality of frames which are sequenced to the front are a plurality of pictures which can represent video content most;
and converting the picture data in the multi-mode data into a matrix, and realizing picture retrieval through an elastic search.
4. The method of claim 1, wherein the mapping the text data and the picture data into a unified vector subspace using a dual encoder-based model comprises:
training a picture encoder and a text encoder by using a contrast learning method on the basis of a multipath transducer pre-training model; the training of the picture encoder and the text encoder using the method of contrast learning includes:
implementing linear transformation and regularization operations in an encoder comprising a multi-headed self-attention module, a visual feed-forward network, and a text feed-forward network, resulting in picture vectors
Figure FDA0004082202120000021
Text vector +.>
Figure FDA0004082202120000022
For calculating graph-to-text and graph-to-graph similarity:
Figure FDA0004082202120000023
Figure FDA0004082202120000024
wherein the method comprises the steps of
Figure FDA0004082202120000025
Representing the similarity between the picture in the ith image-text matching pair and the text in the jth image-text matching pair from picture to text,/for>
Figure FDA0004082202120000026
Representing the similarity between the picture in the ith image-text matching pair and the text in the jth image-text matching pair from text to picture,/for>
Figure FDA0004082202120000027
And->
Figure FDA0004082202120000028
Vector representations of the text in the ith matching pair and the regularized picture in the jth matching pair, sigma being a temperature parameter,>
Figure FDA0004082202120000029
and->
Figure FDA00040822021200000210
Is the similarity after regularization and softmax operation;
and (3) utilizing the similarity from graph to text and from text to graph to realize the contrast learning training of the model by using the cross entropy loss function, and obtaining the final picture encoder and the text encoder.
5. The method of claim 4, wherein the fine tuning based on a multi-channel transducer pre-training model comprises:
inputting the picture characterization data into a pre-training model, and training a visual feedforward network and a multi-head self-attention module;
freezing parameters in the obtained visual feedforward network and the multi-head self-attention mechanism module, inputting text characterization data into the pre-training model, and training the text feedforward network by using a mask language model self-supervision learning method;
and training the whole pre-training model by using the graph-text matching pair to obtain a final pre-training model for subsequent construction of the double encoder and the fusion encoder.
6. The method of claim 1, wherein the calculating and ordering similarities over the candidate set using a fusion encoder-based model comprises:
on the basis of a multi-channel transducer pre-training model, a mask mechanism, a contrast learning method and a cross entropy loss function are used for optimizing the model to obtain a final fusion encoder which is used for sequencing tasks in image-text cross-mode retrieval.
7. The multi-mode retrieval device based on the deep learning and hash algorithm is characterized by comprising the following modules:
the acquisition module is used for acquiring multi-mode data to be retrieved; text semantic retrieval is carried out on the multi-mode data based on a method of deep learning and hash coding, and text data are obtained; performing picture retrieval on the multi-mode data based on a gray level comparison method and an elastic search technology to obtain picture data;
a mapping module for mapping the text data and the picture data into a unified vector subspace using a dual encoder-based model;
the calculating module is used for obtaining the similarity distance between the text data and the picture data in the vector subspace through calculating cosine similarity, so as to obtain a candidate set;
and the sorting module is used for calculating the similarity on the candidate set by using the model based on the fusion encoder and sorting the similarity to obtain the retrieval result of the multi-mode data.
8. A computer device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, the processor implementing the multi-modal retrieval method based on deep learning and hashing algorithm as in any one of claims 1-7 when executing the computer program.
9. A computer readable storage medium, on which a computer program is stored, characterized in that the computer program, when being executed by a processor, implements a multi-modal retrieval method based on a deep learning and hashing algorithm as in any one of claims 1-7.
CN202310126081.9A 2023-02-15 2023-02-15 Multi-mode retrieval method based on deep learning and hash algorithm Pending CN116204694A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310126081.9A CN116204694A (en) 2023-02-15 2023-02-15 Multi-mode retrieval method based on deep learning and hash algorithm

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310126081.9A CN116204694A (en) 2023-02-15 2023-02-15 Multi-mode retrieval method based on deep learning and hash algorithm

Publications (1)

Publication Number Publication Date
CN116204694A true CN116204694A (en) 2023-06-02

Family

ID=86516835

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310126081.9A Pending CN116204694A (en) 2023-02-15 2023-02-15 Multi-mode retrieval method based on deep learning and hash algorithm

Country Status (1)

Country Link
CN (1) CN116204694A (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116932731A (en) * 2023-09-18 2023-10-24 上海帜讯信息技术股份有限公司 Multi-mode knowledge question-answering method and system for 5G message
CN117056543A (en) * 2023-08-21 2023-11-14 数据空间研究院 Multi-mode patent retrieval method based on images
CN117173517A (en) * 2023-11-03 2023-12-05 中国科学院空天信息创新研究院 Time sequence data processing method, device, equipment and medium oriented to space-sky heterogeneous

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117056543A (en) * 2023-08-21 2023-11-14 数据空间研究院 Multi-mode patent retrieval method based on images
CN116932731A (en) * 2023-09-18 2023-10-24 上海帜讯信息技术股份有限公司 Multi-mode knowledge question-answering method and system for 5G message
CN116932731B (en) * 2023-09-18 2024-01-30 上海帜讯信息技术股份有限公司 Multi-mode knowledge question-answering method and system for 5G message
CN117173517A (en) * 2023-11-03 2023-12-05 中国科学院空天信息创新研究院 Time sequence data processing method, device, equipment and medium oriented to space-sky heterogeneous

Similar Documents

Publication Publication Date Title
CN116204694A (en) Multi-mode retrieval method based on deep learning and hash algorithm
CN110046656B (en) Multi-mode scene recognition method based on deep learning
CN114461839B (en) Multi-mode pre-training-based similar picture retrieval method and device and electronic equipment
Zhou et al. Exploiting operation importance for differentiable neural architecture search
CN114419387A (en) Cross-modal retrieval system and method based on pre-training model and recall ranking
CN113076465A (en) Universal cross-modal retrieval model based on deep hash
CN112948601B (en) Cross-modal hash retrieval method based on controlled semantic embedding
CN117688132A (en) Intelligent retrieval method and system based on big data
CN110990596A (en) Multi-mode hash retrieval method and system based on self-adaptive quantization
CN109783691A (en) A kind of video retrieval method of deep learning and Hash coding
CN115375877A (en) Three-dimensional point cloud classification method and device based on channel attention mechanism
CN112989120A (en) Video clip query system and video clip query method
CN118113815B (en) Content searching method, related device and medium
CN111090765B (en) Social image retrieval method and system based on missing multi-modal hash
CN117669693A (en) Knowledge distillation method and system based on multi-teacher multi-mode model
CN118069877A (en) Lightweight multi-mode image description generation method based on CLIP encoder
CN117610658A (en) Knowledge graph data dynamic updating method and system based on artificial intelligence
CN116595343B (en) Manifold ordering learning-based online unsupervised cross-modal retrieval method and system
CN117634459A (en) Target content generation and model training method, device, system, equipment and medium
CN111259176B (en) Cross-modal Hash retrieval method based on matrix decomposition and integrated with supervision information
CN112199531A (en) Cross-modal retrieval method and device based on Hash algorithm and neighborhood map
CN115840798A (en) Cross-modal retrieval method based on topic model
Peng et al. Temporal pyramid transformer with multimodal interaction for video question answering
CN114330239A (en) Text processing method and device, storage medium and electronic equipment
CN114548293A (en) Video-text cross-modal retrieval method based on cross-granularity self-distillation

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination