CN112015923A - Multi-mode data retrieval method, system, terminal and storage medium - Google Patents

Multi-mode data retrieval method, system, terminal and storage medium Download PDF

Info

Publication number
CN112015923A
CN112015923A CN202010922939.9A CN202010922939A CN112015923A CN 112015923 A CN112015923 A CN 112015923A CN 202010922939 A CN202010922939 A CN 202010922939A CN 112015923 A CN112015923 A CN 112015923A
Authority
CN
China
Prior art keywords
data
modal
retrieval
model
retrieved
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202010922939.9A
Other languages
Chinese (zh)
Inventor
王硕
吴振宇
王建明
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Ping An Technology Shenzhen Co Ltd
Original Assignee
Ping An Technology Shenzhen Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Ping An Technology Shenzhen Co Ltd filed Critical Ping An Technology Shenzhen Co Ltd
Priority to CN202010922939.9A priority Critical patent/CN112015923A/en
Priority to PCT/CN2020/124812 priority patent/WO2021155682A1/en
Publication of CN112015923A publication Critical patent/CN112015923A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/40Information retrieval; Database structures therefor; File system structures therefor of multimedia data, e.g. slideshows comprising image and additional audio data
    • G06F16/43Querying
    • G06F16/432Query formulation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/40Information retrieval; Database structures therefor; File system structures therefor of multimedia data, e.g. slideshows comprising image and additional audio data
    • G06F16/43Querying
    • G06F16/438Presentation of query results
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/40Information retrieval; Database structures therefor; File system structures therefor of multimedia data, e.g. slideshows comprising image and additional audio data
    • G06F16/48Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
    • G06F16/483Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content

Abstract

The invention discloses a multi-mode data retrieval method, a system, a terminal and a storage medium. The method comprises the following steps: acquiring historical multi-modal data, wherein the historical multi-modal data at least comprises picture data and text data; training a cross-modal retrieval model according to the historical multi-modal data; the cross-modal retrieval model at least comprises a picture modal retrieval model and a text modal retrieval model; inputting the data to be retrieved into the cross-modal retrieval model, wherein the cross-modal retrieval model retrieves the data to be retrieved through the picture modal retrieval model and the text modal retrieval model respectively to obtain a candidate set of similar data files of the data to be retrieved, and performs similarity sorting on the candidate set of similar data files to obtain a data file with the highest similarity with the data to be retrieved. The invention can well search multi-mode data such as pictures, texts and the like, and improves the search accuracy and the search efficiency.

Description

Multi-mode data retrieval method, system, terminal and storage medium
Technical Field
The present invention relates to the field of data retrieval technologies, and in particular, to a multimodal data retrieval method, system, terminal, and storage medium.
Background
With the rapid development of network technologies, multimodal documents containing data such as text and images are appearing in large scale in people's daily life. These different modalities of data resources can invisibly enhance the ability of the sense organs to accept knowledge in the world of information.
Because of the diversity, complexity, and randomness that multimodal data presents, it is important to quickly and accurately retrieve information useful to a user from a large number of multimodal documents. The traditional data retrieval method generally performs retrieval through keywords, corresponding keywords need to be manually extracted in advance, and because the keywords are coarse-grained, the retrieval accuracy and efficiency are relatively poor.
Disclosure of Invention
The invention provides a multi-mode data retrieval method, a multi-mode data retrieval system, a multi-mode data retrieval terminal and a multi-mode data retrieval storage medium, which can solve the defects in the prior art to a certain extent.
In order to solve the technical problems, the technical scheme adopted by the invention is as follows:
a method of multimodal data retrieval, comprising:
acquiring historical multi-modal data, wherein the historical multi-modal data at least comprises picture data and text data;
training a cross-modal retrieval model according to the historical multi-modal data; the cross-modal retrieval model at least comprises a picture modal retrieval model and a text modal retrieval model;
inputting the data to be retrieved into the cross-modal retrieval model, wherein the cross-modal retrieval model retrieves the data to be retrieved through the picture modal retrieval model and the text modal retrieval model respectively to obtain a candidate set of similar data files of the data to be retrieved, and performs similarity sorting on the candidate set of similar data files to obtain a data file with the highest similarity with the data to be retrieved.
The technical scheme adopted by the embodiment of the invention also comprises the following steps: the obtaining historical multimodal data further comprises:
and constructing a multi-mode file database, wherein the multi-mode file database comprises the picture data and the text data of each data file.
The technical scheme adopted by the embodiment of the invention also comprises the following steps: before training the cross-modal retrieval model according to the historical multi-modal data, the method further comprises:
and marking the category of the data file in the multi-modal file database to generate a data sample for training a model.
The technical scheme adopted by the embodiment of the invention also comprises the following steps: the training of the cross-modal retrieval model from the historical multimodal data comprises:
the model training comprises a retrieval recall phase and a precise sequencing phase, wherein:
in the retrieval recall stage, a matching algorithm is used for roughly screening all data samples to respectively obtain at least two similar data file sets of the file to be retrieved in different modes, and then a union of the at least two similar data file sets is taken as a similar data file candidate set of the data to be retrieved;
and in the accurate sorting stage, the candidate sets of the similar data files are subjected to similarity sorting to obtain the data files with the highest similarity with the data to be retrieved.
The technical scheme adopted by the embodiment of the invention also comprises the following steps: the cross-modal retrieval model respectively retrieves the data to be retrieved through the picture modal retrieval model and the text modal retrieval model, and comprises the following steps:
judging whether the picture data of the data to be retrieved is empty, if not, inputting the picture data into the picture mode retrieval model to obtain similar data file retrieval results in the picture mode, sorting the retrieval results, and taking the first M retrieval results as a similar data file set S in the picture modeI
Judging whether the text data of the data to be retrieved is empty, if not, inputting the text data into the text modal retrieval model to obtain similar data file retrieval in a text modeAnd after the retrieval results are ranked, taking the first M retrieval results as a similar data file set S in a text modeT
Get set SIAnd STThe union set is used as a candidate set of similar data files of the data to be retrieved;
and sequencing the similarity of the candidate set of similar data files to obtain the data file with the highest similarity with the data to be retrieved.
The technical scheme adopted by the embodiment of the invention also comprises the following steps: the picture modal retrieval model is coded by ResNet, and the text modal retrieval model is coded by BERT.
The technical scheme adopted by the embodiment of the invention also comprises the following steps: the retrieval algorithm of the text modal retrieval model comprises BM25 or TFIDF algorithm, and the retrieval algorithm of the picture modal retrieval model comprises similarity matching by using visual features of pictures, wherein the visual features comprise color distribution, geometric shape or texture.
The embodiment of the invention adopts another technical scheme that: a multimodal data retrieval system comprising:
a data collection module: the system comprises a database, a display and a display, wherein the database is used for storing historical multi-modal data, and the historical multi-modal data at least comprises picture data and text data;
a model construction module: the cross-modal retrieval model is trained according to the historical multi-modal data; the cross-modal retrieval model at least comprises a picture modal retrieval model and a text modal retrieval model;
a data retrieval module: the cross-modal retrieval model is used for inputting the data to be retrieved into the cross-modal retrieval model, retrieving the data to be retrieved respectively through the picture modal retrieval model and the text modal retrieval model to obtain a candidate set of similar data files of the data to be retrieved, and sorting the candidate set of similar data files according to the similarity to obtain a data file with the highest similarity to the data to be retrieved.
The embodiment of the invention adopts another technical scheme that: a terminal comprising a processor, a memory coupled to the processor, wherein,
the memory stores program instructions for implementing the above-described multimodal data retrieval method;
the processor is to execute the program instructions stored by the memory to perform the multimodal data retrieval operation.
The embodiment of the invention adopts another technical scheme that: a storage medium stores program instructions executable by a processor to perform the above-described multimodal data retrieval method.
The invention has the beneficial effects that: according to the multi-modal data retrieval method, the multi-modal data retrieval system, the multi-modal data retrieval terminal and the multi-modal data retrieval storage medium, a cross-modal retrieval model is built based on data files of different modes, the data files of different modes are directly input, and then retrieval results of corresponding modes are output, so that an end-to-end retrieval scheme is realized, multi-modal data such as pictures and texts can be well retrieved, and the retrieval accuracy and the retrieval efficiency are improved.
Drawings
FIG. 1 is a schematic flow chart diagram of a multimodal data retrieval method according to a first embodiment of the invention;
FIG. 2 is a flow chart diagram of a multimodal data retrieval method according to a second embodiment of the present invention;
FIG. 3 is a schematic diagram of the structure of a multimodal data retrieval system in accordance with an embodiment of the present invention;
fig. 4 is a schematic structural diagram of a terminal according to an embodiment of the present invention;
FIG. 5 is a schematic diagram of a storage medium structure according to an embodiment of the present invention.
Detailed Description
The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
The terms "first", "second" and "third" in the present invention are used for descriptive purposes only and are not to be construed as indicating or implying relative importance or implicitly indicating the number of technical features indicated. Thus, a feature defined as "first," "second," or "third" may explicitly or implicitly include at least one of the feature. In the description of the present invention, "a plurality" means at least two, e.g., two, three, etc., unless specifically limited otherwise. All directional indicators (such as up, down, left, right, front, and rear … …) in the embodiments of the present invention are only used to explain the relative positional relationship between the components, the movement, and the like in a specific posture (as shown in the drawings), and if the specific posture is changed, the directional indicator is changed accordingly. Furthermore, the terms "include" and "have," as well as any variations thereof, are intended to cover non-exclusive inclusions. For example, a process, method, system, article, or apparatus that comprises a list of steps or elements is not limited to only those steps or elements listed, but may alternatively include other steps or elements not listed, or inherent to such process, method, article, or apparatus.
Reference herein to "an embodiment" means that a particular feature, structure, or characteristic described in connection with the embodiment can be included in at least one embodiment of the invention. The appearances of the phrase in various places in the specification are not necessarily all referring to the same embodiment, nor are separate or alternative embodiments mutually exclusive of other embodiments. It is explicitly and implicitly understood by one skilled in the art that the embodiments described herein can be combined with other embodiments.
Multimodal data typically includes data in different forms, such as text, images, voice, video, and the like. For the same type of data, although the data of different modalities can present the bottom-layer characteristic heterogeneity, the high-layer semantics thereof have certain relevance. For example, for a particular disease, the medication conditions and the medical image examinations performed are largely similar, i.e., the text data of the medication conditions and the medical image data of the medical image examinations are semantically related. Based on the characteristics, the embodiment of the invention trains a cross-modal retrieval model by using historical multi-modal data, obtains data sets with different modalities similar to the data to be retrieved through the cross-modal retrieval model, and then takes the intersection of the data sets with different modalities as a final retrieval result.
For convenience of description, the following embodiments of the present invention are specifically described by taking two most commonly used modality data, namely, a picture and a text, as an example, and it is understood that the present invention is also applicable to retrieval of other modality data, such as voice, video, and the like.
Specifically, please refer to fig. 1, which is a flowchart illustrating a multimodal data retrieval method according to a first embodiment of the present invention. The multimodal data retrieval method of the first embodiment of the present invention includes the steps of:
s10: acquiring historical multi-modal data, and constructing a multi-modal file database based on the historical multi-modal data;
in this step, the multimodal file database includes multimodal data such as a picture and a text of each data file. Assuming that the number of data files collected in the multi-modal file database is N, the data set contained in the database is { (I)1,T1),(I2,T2),(I3,T3),…,(IN,TN) In which (I)i,Ti) A picture-text pair representing the ith data file.
S11: labeling categories of a certain number of data files in the multi-modal file database to generate data samples for training the model;
in this step, taking the data files of the medical data type as an example, the labels of the categories of the data files include disease names, medication types, image examination types, and the like, and the categories of each data file are manually labeled, so that the data files belonging to the same category have similarity.
S12: training a cross-modal retrieval model according to the data sample, and respectively retrieving the picture data and the text data of the file to be retrieved through the cross-modal retrieval model to obtain a data file with the highest similarity to the file to be retrieved;
in the step, the cross-modal retrieval model at least comprises a text modal retrieval model and a picture modal retrieval model, the embodiment of the invention adopts a Pairwise mode to train the model, and the training process comprises two stages of retrieval recall and accurate sequencing:
in the retrieval recall stage, a matching algorithm is used for roughly screening all data samples to obtain a relatively small candidate set of similar data files; and the retrieval recall stage is retrieval under a single mode, namely retrieving similar picture data files from a picture mode retrieval model by using the picture data of the files to be retrieved, retrieving similar text data files from a text mode retrieval model by using the text data of the files to be retrieved, respectively obtaining similar data file sets of the files to be retrieved under the picture mode and the text mode, and then taking a union of the two similar data file sets as a similar data file candidate set of the files to be retrieved. Assuming that the candidate set size of the similar data files obtained through the screening in the retrieval recall stage is K, the data file set corresponding to the candidate set is { (I)1,T1),(I2,T2),…,(IK,TK)}。
In the accurate sorting stage, carrying out similarity sorting on the similar data file candidate set obtained in the retrieval recall stage to obtain a data file with the highest similarity with the file to be retrieved; the accurate sorting stage is designed based on the idea of Learning to Rank (sorting Learning), the optimization target of the accurate sorting stage is the matching degree between text data and picture data, and the Hinge Loss is used as a Loss function in the Pairwise mode:
Figure BDA0002667348280000091
respectively obtaining a picture modal retrieval model and a text modal retrieval model based on the training modes, wherein the picture modal retrieval model adopts a deep learning pre-training image model ResNet for coding, and the text modal retrieval model adopts BERT (Bidirectional Encoder Rep)relationships from transformations, deep learning pre-trained language model) as shown below, picture IiAnd text TiThe coded embedded vectors are respectively IEiAnd TEi
TEi=BERT(Ti)
IEi=ResNet(Ii)
In the embodiment of the present invention, the search algorithm of the text modal search model includes, but is not limited to, BM25 or TFIDF algorithm, and the search algorithm of the picture modal search model includes similarity matching using simple visual features such as color distribution, geometric shape, texture, etc. of the picture.
Please refer to fig. 2, which is a flowchart illustrating a multimodal data retrieval method according to a second embodiment of the present invention. The multimodal data retrieval method of the second embodiment of the present invention includes the steps of:
s20: selecting a file to be retrieved, and acquiring pictures and text data of the file to be retrieved;
s21: inputting the pictures and the text data into a trained cross-modal retrieval model;
s22: respectively retrieving the picture data and the text data through a cross-modal retrieval model to obtain a candidate set of similar data files of the file to be retrieved, and sequencing the similarity of the candidate set of similar data files to obtain a data file with the highest similarity to the file to be retrieved;
in this step, the search mode of the cross-modal search model specifically includes:
1. judging whether the picture data of the file to be retrieved is empty, if not, inputting the picture data into a picture modal retrieval model to obtain similar data file retrieval results of the picture modalities, sorting the similar data file retrieval results of the picture modalities, and taking the first M retrieval results as a similar data file set S in the picture modalitiesI
2. Judging whether the text data of the file to be searched is empty, if not, inputting the text data into a text modal search model to obtain a text modal similar data file search result, and searching the similar data fileAfter the results are sorted, taking the first M retrieval results as a similar data file set S in a text modeT
3. Get set SIAnd STThe union set is used as a candidate set of similar data files of the file to be retrieved;
4. and carrying out similarity sorting on the candidate set of similar data files to obtain a data file retrieval result with the highest similarity with the file to be retrieved.
In the above, the value of M may be set according to actual operation.
In summary, the multi-modal data retrieval method of the embodiment of the invention constructs a cross-modal retrieval model based on the data files of different modalities, and directly inputs the data files of different modalities and then outputs the retrieval result of the corresponding modality, so that an end-to-end retrieval scheme is realized, multi-modal data such as pictures and texts can be well processed, and the retrieval accuracy and the retrieval efficiency are improved.
In an alternative embodiment, it is also possible to: and uploading the result of the multi-modal data retrieval method to a block chain.
Specifically, the corresponding summary information is obtained based on the result of the multimodal data retrieval method, and specifically, the summary information is obtained by hashing the result of the multimodal data retrieval method, for example, using the sha256s algorithm. Uploading summary information to the blockchain can ensure the safety and the fair transparency of the user. The user can download the summary information from the blockchain to verify whether the result of the multimodal data retrieval method is tampered. The blockchain referred to in this example is a novel application mode of computer technologies such as distributed data storage, point-to-point transmission, consensus mechanism, encryption algorithm, and the like. A block chain (Blockchain), which is essentially a decentralized database, is a series of data blocks associated by using a cryptographic method, and each data block contains information of a batch of network transactions, so as to verify the validity (anti-counterfeiting) of the information and generate a next block. The blockchain may include a blockchain underlying platform, a platform product service layer, an application service layer, and the like.
Please refer to fig. 3, which is a schematic structural diagram of a multimodal data retrieval system according to an embodiment of the present invention. The multimodal data retrieval system 40 according to the embodiment of the present invention includes:
the data acquisition module 41: the system comprises a database, a multi-modal file database and a database server, wherein the database is used for acquiring historical multi-modal data and constructing the multi-modal file database based on the historical multi-modal data; the multi-mode file database comprises multi-mode data such as pictures, texts and the like of each data file. Assuming that the number of data files collected in the multi-modal file database is N, the data set contained in the database is { (I)1,T1),(I2,T2),(I3,T3),…,(IN,TN) In which (I)i,Ti) A picture-text pair representing the ith data file.
Model building module 42: the system is used for training a cross-modal retrieval model according to data samples in the multi-modal file database; the model training method specifically comprises the following steps: firstly, labeling the categories of a certain number of data files in a multi-modal file database to generate data samples for training a model; then, training a cross-modal retrieval model according to the data sample;
in the embodiment of the invention, a cross-modal retrieval model comprises a text modal retrieval model and a picture modal retrieval model, the embodiment of the invention adopts a Pairwise mode to train the models, and the training process comprises two stages of retrieval recall and accurate sequencing:
in the retrieval recall stage, a matching algorithm is used for roughly screening all data samples to obtain a relatively small candidate set of similar data files; and the retrieval recall stage is retrieval under a single mode, namely retrieving similar picture data files from a picture mode retrieval model by using the picture data of the files to be retrieved, retrieving similar text data files from a text mode retrieval model by using the text data of the files to be retrieved, respectively obtaining similar data file sets of the files to be retrieved under the picture mode and the text mode, and then taking a union of the two similar data file sets as a similar data file candidate set of the files to be retrieved. Hypothesis obtained through screening in retrieval recall stageIf the candidate set size of the similar data files is K, the data file set corresponding to the candidate set is { (I)1,T1),(I2,T2),…,(IK,TK)}。
In the accurate sorting stage, carrying out similarity sorting on the similar data file candidate set obtained in the retrieval recall stage to obtain a data file with the highest similarity with the file to be retrieved; the accurate sorting stage is designed based on the idea of Learning to Rank (sorting Learning), the optimization target of the accurate sorting stage is the matching degree between text data and picture data, and the Hinge Loss is used as a Loss function in the Pairwise mode:
Figure BDA0002667348280000131
based on the training mode, a picture modal retrieval model and a text modal retrieval model are respectively obtained, the picture modal retrieval model is coded by a deep learning pre-training image model ResNet, the text modal retrieval model is coded by a BERT (Bidirectional Encoder retrieval from transformations), and as shown in the following, a picture IiAnd text TiThe coded embedded vectors are respectively IEiAnd TEi
TEi=BERT(Ti)
IEi=ResNet(Ii)
The data retrieval module 43: the system comprises a cross-modal retrieval model, a similarity data file candidate set, a similarity sorting module and a searching module, wherein the cross-modal retrieval model is used for respectively retrieving the picture data and the text data to obtain the similar data file candidate set of the file to be retrieved, and the similarity sorting module is used for sorting the similar data file candidate set to obtain the data file with the highest similarity with the file to be retrieved;
the searching mode of the cross-modal searching model specifically comprises the following steps:
1. judging whether the picture data of the file to be retrieved is empty, if not, inputting the picture data into a picture modal retrieval model to obtain a retrieval result of the similar data file of the picture modal, and comparing the retrieval result with the similar data file of the picture modalAfter the retrieval results of the similar data files in the picture mode are sequenced, taking the first M retrieval results as a similar data file set S in the picture modeI
2. Judging whether the text data of the file to be retrieved is empty, if not, inputting the text data into a text modal retrieval model to obtain similar data file retrieval results of the text modal, sorting the retrieval results, and taking the first M retrieval results as a similar data file set S in the text modalT
3. Get set SIAnd STThe union set is used as a candidate set of similar data files of the file to be retrieved;
4. and carrying out similarity sorting on the candidate set of similar data files to obtain a data file retrieval result with the highest similarity with the file to be retrieved.
Fig. 4 is a schematic diagram of a terminal structure according to an embodiment of the present invention. The terminal 50 comprises a processor 51, a memory 52 coupled to the processor 51.
The memory 52 stores program instructions for implementing the multimodal data retrieval method described above.
The processor 51 is operative to execute program instructions stored in the memory 52 to perform multimodal data retrieval operations.
The processor 51 may also be referred to as a CPU (Central Processing Unit). The processor 51 may be an integrated circuit chip having signal processing capabilities. The processor 51 may also be a general purpose processor, a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), an off-the-shelf programmable gate array (FPGA) or other programmable logic device, discrete gate or transistor logic, discrete hardware components. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like.
Referring to fig. 5, fig. 5 is a schematic structural diagram of a storage medium according to an embodiment of the invention. The storage medium of the embodiment of the present invention stores a program file 61 capable of implementing all the methods described above, wherein the program file 61 may be stored in the storage medium in the form of a software product, and includes several instructions to enable a computer device (which may be a personal computer, a server, or a network device) or a processor (processor) to execute all or part of the steps of the methods of the embodiments of the present invention. And the aforementioned storage medium includes: various media capable of storing program codes, such as a usb disk, a mobile hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, or terminal devices, such as a computer, a server, a mobile phone, and a tablet.
In the embodiments provided in the present invention, it should be understood that the disclosed system, apparatus and method may be implemented in other ways. For example, the above-described system embodiments are merely illustrative, and for example, a division of a unit is merely a logical division, and an actual implementation may have another division, for example, a plurality of units or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, devices or units, and may be in an electrical, mechanical or other form.
In addition, functional units in the embodiments of the present invention may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit can be realized in a form of hardware, and can also be realized in a form of a software functional unit. The above description is only an embodiment of the present invention, and not intended to limit the scope of the present invention, and all modifications of equivalent structures and equivalent processes, which are made by using the contents of the present specification and the accompanying drawings, or directly or indirectly applied to other related technical fields, are included in the scope of the present invention.

Claims (10)

1. A method for multimodal data retrieval, comprising:
acquiring historical multi-modal data, wherein the historical multi-modal data at least comprises picture data and text data;
training a cross-modal retrieval model according to the historical multi-modal data; the cross-modal retrieval model at least comprises a picture modal retrieval model and a text modal retrieval model;
inputting the data to be retrieved into the cross-modal retrieval model, wherein the cross-modal retrieval model retrieves the data to be retrieved through the picture modal retrieval model and the text modal retrieval model respectively to obtain a candidate set of similar data files of the data to be retrieved, and performs similarity sorting on the candidate set of similar data files to obtain a data file with the highest similarity with the data to be retrieved.
2. The method of claim 1, wherein the retrieving historical multimodal data further comprises:
and constructing a multi-mode file database, wherein the multi-mode file database comprises the picture data and the text data of each data file.
3. The method of claim 2, wherein training a cross-modal search model based on the historical multi-modal data further comprises:
and marking the category of the data file in the multi-modal file database to generate a data sample for training a model.
4. The method of claim 3, wherein training a cross-modal search model based on the historical multimodal data comprises:
the model training comprises a retrieval recall phase and a precise sequencing phase, wherein:
in the retrieval recall stage, a matching algorithm is used for roughly screening all data samples to respectively obtain at least two similar data file sets of the file to be retrieved in different modes, and then a union of the at least two similar data file sets is taken as a similar data file candidate set of the data to be retrieved;
and in the accurate sorting stage, the candidate sets of the similar data files are subjected to similarity sorting to obtain the data files with the highest similarity with the data to be retrieved.
5. The multi-modal data retrieval method of claim 4, wherein the cross-modal retrieval model respectively retrieves the data to be retrieved through the picture modal retrieval model and the text modal retrieval model, comprising:
judging whether the picture data of the data to be retrieved is empty, if not, inputting the picture data into the picture mode retrieval model to obtain similar data file retrieval results in the picture mode, sorting the retrieval results, and taking the first M retrieval results as a similar data file set S in the picture modeI
Judging whether the text data of the data to be retrieved is empty, if not, inputting the text data into the text modal retrieval model to obtain similar data file retrieval results in a text mode, sorting the retrieval results, and taking the first M retrieval results as a similar data file set S in the text modeT
Get set SIAnd STThe union set is used as a candidate set of similar data files of the data to be retrieved;
and sequencing the similarity of the candidate set of similar data files to obtain the data file with the highest similarity with the data to be retrieved.
6. The multi-modal data retrieval method of claim 1 wherein the picture modal retrieval model is encoded using ResNet and the text modal retrieval model is encoded using BERT.
7. The multi-modal data retrieval method of claim 1 wherein the retrieval algorithm of the text modal retrieval model comprises BM25 or TFIDF algorithm and the retrieval algorithm of the picture modal retrieval model comprises similarity matching using visual features of the picture, the visual features comprising color distribution, geometry or texture.
8. A multimodal data retrieval system, comprising:
a data collection module: the system comprises a database, a display and a display, wherein the database is used for storing historical multi-modal data, and the historical multi-modal data at least comprises picture data and text data;
a model construction module: the cross-modal retrieval model is trained according to the historical multi-modal data; the cross-modal retrieval model at least comprises a picture modal retrieval model and a text modal retrieval model;
a data retrieval module: the cross-modal retrieval model is used for inputting the data to be retrieved into the cross-modal retrieval model, retrieving the data to be retrieved respectively through the picture modal retrieval model and the text modal retrieval model to obtain a candidate set of similar data files of the data to be retrieved, and sorting the candidate set of similar data files according to the similarity to obtain a data file with the highest similarity to the data to be retrieved.
9. A terminal, comprising a processor, a memory coupled to the processor, wherein,
the memory stores program instructions for implementing the multimodal data retrieval method of any of claims 1-7;
the processor is configured to execute the program instructions stored by the memory to perform the multimodal data retrieval method.
10. A storage medium having stored thereon program instructions executable by a processor to perform the multimodal data retrieval method of any one of claims 1 to 7.
CN202010922939.9A 2020-09-04 2020-09-04 Multi-mode data retrieval method, system, terminal and storage medium Pending CN112015923A (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
CN202010922939.9A CN112015923A (en) 2020-09-04 2020-09-04 Multi-mode data retrieval method, system, terminal and storage medium
PCT/CN2020/124812 WO2021155682A1 (en) 2020-09-04 2020-10-29 Multi-modal data retrieval method and system, terminal, and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010922939.9A CN112015923A (en) 2020-09-04 2020-09-04 Multi-mode data retrieval method, system, terminal and storage medium

Publications (1)

Publication Number Publication Date
CN112015923A true CN112015923A (en) 2020-12-01

Family

ID=73516848

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010922939.9A Pending CN112015923A (en) 2020-09-04 2020-09-04 Multi-mode data retrieval method, system, terminal and storage medium

Country Status (2)

Country Link
CN (1) CN112015923A (en)
WO (1) WO2021155682A1 (en)

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112579841A (en) * 2020-12-23 2021-03-30 深圳大学 Multi-mode database establishing method, multi-mode database retrieving method and multi-mode database retrieving system
CN113590852A (en) * 2021-06-30 2021-11-02 北京百度网讯科技有限公司 Training method of multi-modal recognition model, multi-modal recognition method and device
CN113656668A (en) * 2021-08-19 2021-11-16 北京百度网讯科技有限公司 Retrieval method, management method, device, equipment and medium of multi-modal information base
CN114461839A (en) * 2022-04-12 2022-05-10 智者四海(北京)技术有限公司 Multi-mode pre-training-based similar picture retrieval method and device and electronic equipment
CN114861016A (en) * 2022-07-05 2022-08-05 人民中科(北京)智能技术有限公司 Cross-modal retrieval method and device and storage medium
WO2023168997A1 (en) * 2022-03-07 2023-09-14 腾讯科技(深圳)有限公司 Cross-modal retrieval method and related device

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117648459B (en) * 2024-01-29 2024-04-26 中国海洋大学 Image-text cross-modal retrieval method and system for high-similarity marine remote sensing data

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103955543A (en) * 2014-05-20 2014-07-30 电子科技大学 Multimode-based clothing image retrieval method
CN110188210A (en) * 2019-05-10 2019-08-30 山东师范大学 One kind is based on figure regularization and the independent cross-module state data retrieval method of mode and system
CN110597878A (en) * 2019-09-16 2019-12-20 广东工业大学 Cross-modal retrieval method, device, equipment and medium for multi-modal data
CN111599438A (en) * 2020-04-02 2020-08-28 浙江工业大学 Real-time diet health monitoring method for diabetic patient based on multi-modal data

Family Cites Families (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110069650B (en) * 2017-10-10 2024-02-09 阿里巴巴集团控股有限公司 Searching method and processing equipment
CN109783655B (en) * 2018-12-07 2022-12-30 西安电子科技大学 Cross-modal retrieval method and device, computer equipment and storage medium
CN110163220A (en) * 2019-04-26 2019-08-23 腾讯科技(深圳)有限公司 Picture feature extracts model training method, device and computer equipment
CN111008278B (en) * 2019-11-22 2022-06-21 厦门美柚股份有限公司 Content recommendation method and device
CN111598214B (en) * 2020-04-02 2023-04-18 浙江工业大学 Cross-modal retrieval method based on graph convolution neural network

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103955543A (en) * 2014-05-20 2014-07-30 电子科技大学 Multimode-based clothing image retrieval method
CN110188210A (en) * 2019-05-10 2019-08-30 山东师范大学 One kind is based on figure regularization and the independent cross-module state data retrieval method of mode and system
CN110597878A (en) * 2019-09-16 2019-12-20 广东工业大学 Cross-modal retrieval method, device, equipment and medium for multi-modal data
CN111599438A (en) * 2020-04-02 2020-08-28 浙江工业大学 Real-time diet health monitoring method for diabetic patient based on multi-modal data

Cited By (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112579841A (en) * 2020-12-23 2021-03-30 深圳大学 Multi-mode database establishing method, multi-mode database retrieving method and multi-mode database retrieving system
CN112579841B (en) * 2020-12-23 2024-01-05 深圳大学 Multi-mode database establishment method, retrieval method and system
CN113590852A (en) * 2021-06-30 2021-11-02 北京百度网讯科技有限公司 Training method of multi-modal recognition model, multi-modal recognition method and device
CN113590852B (en) * 2021-06-30 2022-07-08 北京百度网讯科技有限公司 Training method of multi-modal recognition model, multi-modal recognition method and device
CN113656668A (en) * 2021-08-19 2021-11-16 北京百度网讯科技有限公司 Retrieval method, management method, device, equipment and medium of multi-modal information base
WO2023019948A1 (en) * 2021-08-19 2023-02-23 北京百度网讯科技有限公司 Retrieval method, management method, and apparatuses for multimodal information base, device, and medium
WO2023168997A1 (en) * 2022-03-07 2023-09-14 腾讯科技(深圳)有限公司 Cross-modal retrieval method and related device
CN114461839A (en) * 2022-04-12 2022-05-10 智者四海(北京)技术有限公司 Multi-mode pre-training-based similar picture retrieval method and device and electronic equipment
CN114861016A (en) * 2022-07-05 2022-08-05 人民中科(北京)智能技术有限公司 Cross-modal retrieval method and device and storage medium

Also Published As

Publication number Publication date
WO2021155682A1 (en) 2021-08-12

Similar Documents

Publication Publication Date Title
CN112015923A (en) Multi-mode data retrieval method, system, terminal and storage medium
CN109388807B (en) Method, device and storage medium for identifying named entities of electronic medical records
CN110364234B (en) Intelligent storage, analysis and retrieval system and method for electronic medical records
CN112131393A (en) Construction method of medical knowledge map question-answering system based on BERT and similarity algorithm
US20110106805A1 (en) Method and system for searching multilingual documents
CN113869044A (en) Keyword automatic extraction method, device, equipment and storage medium
CN112287069B (en) Information retrieval method and device based on voice semantics and computer equipment
CN112597300A (en) Text clustering method and device, terminal equipment and storage medium
CN113707299A (en) Auxiliary diagnosis method and device based on inquiry session and computer equipment
CN111105013A (en) Optimization method of countermeasure network architecture, image description generation method and system
CN116304307A (en) Graph-text cross-modal retrieval network training method, application method and electronic equipment
CN117112829B (en) Medical data cross-modal retrieval method and device and related equipment
US20220121824A1 (en) Method for determining text similarity, method for obtaining semantic answer text, and question answering method
CN114003758A (en) Training method and device of image retrieval model and retrieval method and device
CN113254814A (en) Network course video labeling method and device, electronic equipment and medium
CN114491265B (en) Construction method of operation service system of business space platform
CN113704623B (en) Data recommendation method, device, equipment and storage medium
WO2021179688A1 (en) Medical literature retrieval method and apparatus, electronic device, and storage medium
CN115408599A (en) Information recommendation method and device, electronic equipment and computer-readable storage medium
CN107463590A (en) Automatic talking phase is found
CN115658935B (en) Personalized comment generation method and device
CN113656539B (en) Cross-modal retrieval method based on feature separation and reconstruction
WO2022165858A1 (en) Multi-clue reasoning with memory augmentation for knowledge-based visual question answering
CN114238663B (en) Knowledge graph analysis method and system for material data, electronic device and medium
CN117056575B (en) Method for data acquisition based on intelligent book recommendation system

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination