CN112015923A

CN112015923A - Multi-mode data retrieval method, system, terminal and storage medium

Info

Publication number: CN112015923A
Application number: CN202010922939.9A
Authority: CN
Inventors: 王硕; 吴振宇; 王建明
Original assignee: Ping An Technology Shenzhen Co Ltd
Current assignee: Ping An Technology Shenzhen Co Ltd
Priority date: 2020-09-04
Filing date: 2020-09-04
Publication date: 2020-12-01
Also published as: WO2021155682A1

Abstract

The invention discloses a multi-mode data retrieval method, a system, a terminal and a storage medium. The method comprises the following steps: acquiring historical multi-modal data, wherein the historical multi-modal data at least comprises picture data and text data; training a cross-modal retrieval model according to the historical multi-modal data; the cross-modal retrieval model at least comprises a picture modal retrieval model and a text modal retrieval model; inputting the data to be retrieved into the cross-modal retrieval model, wherein the cross-modal retrieval model retrieves the data to be retrieved through the picture modal retrieval model and the text modal retrieval model respectively to obtain a candidate set of similar data files of the data to be retrieved, and performs similarity sorting on the candidate set of similar data files to obtain a data file with the highest similarity with the data to be retrieved. The invention can well search multi-mode data such as pictures, texts and the like, and improves the search accuracy and the search efficiency.

Description

Multi-mode data retrieval method, system, terminal and storage medium

Technical Field

The present invention relates to the field of data retrieval technologies, and in particular, to a multimodal data retrieval method, system, terminal, and storage medium.

Background

With the rapid development of network technologies, multimodal documents containing data such as text and images are appearing in large scale in people's daily life. These different modalities of data resources can invisibly enhance the ability of the sense organs to accept knowledge in the world of information.

Because of the diversity, complexity, and randomness that multimodal data presents, it is important to quickly and accurately retrieve information useful to a user from a large number of multimodal documents. The traditional data retrieval method generally performs retrieval through keywords, corresponding keywords need to be manually extracted in advance, and because the keywords are coarse-grained, the retrieval accuracy and efficiency are relatively poor.

Disclosure of Invention

The invention provides a multi-mode data retrieval method, a multi-mode data retrieval system, a multi-mode data retrieval terminal and a multi-mode data retrieval storage medium, which can solve the defects in the prior art to a certain extent.

In order to solve the technical problems, the technical scheme adopted by the invention is as follows:

a method of multimodal data retrieval, comprising:

acquiring historical multi-modal data, wherein the historical multi-modal data at least comprises picture data and text data;

training a cross-modal retrieval model according to the historical multi-modal data; the cross-modal retrieval model at least comprises a picture modal retrieval model and a text modal retrieval model;

inputting the data to be retrieved into the cross-modal retrieval model, wherein the cross-modal retrieval model retrieves the data to be retrieved through the picture modal retrieval model and the text modal retrieval model respectively to obtain a candidate set of similar data files of the data to be retrieved, and performs similarity sorting on the candidate set of similar data files to obtain a data file with the highest similarity with the data to be retrieved.

The technical scheme adopted by the embodiment of the invention also comprises the following steps: the obtaining historical multimodal data further comprises:

and constructing a multi-mode file database, wherein the multi-mode file database comprises the picture data and the text data of each data file.

The technical scheme adopted by the embodiment of the invention also comprises the following steps: before training the cross-modal retrieval model according to the historical multi-modal data, the method further comprises:

and marking the category of the data file in the multi-modal file database to generate a data sample for training a model.

The technical scheme adopted by the embodiment of the invention also comprises the following steps: the training of the cross-modal retrieval model from the historical multimodal data comprises:

the model training comprises a retrieval recall phase and a precise sequencing phase, wherein:

in the retrieval recall stage, a matching algorithm is used for roughly screening all data samples to respectively obtain at least two similar data file sets of the file to be retrieved in different modes, and then a union of the at least two similar data file sets is taken as a similar data file candidate set of the data to be retrieved;

and in the accurate sorting stage, the candidate sets of the similar data files are subjected to similarity sorting to obtain the data files with the highest similarity with the data to be retrieved.

The technical scheme adopted by the embodiment of the invention also comprises the following steps: the cross-modal retrieval model respectively retrieves the data to be retrieved through the picture modal retrieval model and the text modal retrieval model, and comprises the following steps:

judging whether the picture data of the data to be retrieved is empty, if not, inputting the picture data into the picture mode retrieval model to obtain similar data file retrieval results in the picture mode, sorting the retrieval results, and taking the first M retrieval results as a similar data file set S in the picture mode_I；

Judging whether the text data of the data to be retrieved is empty, if not, inputting the text data into the text modal retrieval model to obtain similar data file retrieval in a text modeAnd after the retrieval results are ranked, taking the first M retrieval results as a similar data file set S in a text mode_T；

Get set S_IAnd S_TThe union set is used as a candidate set of similar data files of the data to be retrieved;

and sequencing the similarity of the candidate set of similar data files to obtain the data file with the highest similarity with the data to be retrieved.

The technical scheme adopted by the embodiment of the invention also comprises the following steps: the picture modal retrieval model is coded by ResNet, and the text modal retrieval model is coded by BERT.

The technical scheme adopted by the embodiment of the invention also comprises the following steps: the retrieval algorithm of the text modal retrieval model comprises BM25 or TFIDF algorithm, and the retrieval algorithm of the picture modal retrieval model comprises similarity matching by using visual features of pictures, wherein the visual features comprise color distribution, geometric shape or texture.

The embodiment of the invention adopts another technical scheme that: a multimodal data retrieval system comprising:

a data collection module: the system comprises a database, a display and a display, wherein the database is used for storing historical multi-modal data, and the historical multi-modal data at least comprises picture data and text data;

a model construction module: the cross-modal retrieval model is trained according to the historical multi-modal data; the cross-modal retrieval model at least comprises a picture modal retrieval model and a text modal retrieval model;

a data retrieval module: the cross-modal retrieval model is used for inputting the data to be retrieved into the cross-modal retrieval model, retrieving the data to be retrieved respectively through the picture modal retrieval model and the text modal retrieval model to obtain a candidate set of similar data files of the data to be retrieved, and sorting the candidate set of similar data files according to the similarity to obtain a data file with the highest similarity to the data to be retrieved.

The embodiment of the invention adopts another technical scheme that: a terminal comprising a processor, a memory coupled to the processor, wherein,

the memory stores program instructions for implementing the above-described multimodal data retrieval method;

the processor is to execute the program instructions stored by the memory to perform the multimodal data retrieval operation.

The embodiment of the invention adopts another technical scheme that: a storage medium stores program instructions executable by a processor to perform the above-described multimodal data retrieval method.

The invention has the beneficial effects that: according to the multi-modal data retrieval method, the multi-modal data retrieval system, the multi-modal data retrieval terminal and the multi-modal data retrieval storage medium, a cross-modal retrieval model is built based on data files of different modes, the data files of different modes are directly input, and then retrieval results of corresponding modes are output, so that an end-to-end retrieval scheme is realized, multi-modal data such as pictures and texts can be well retrieved, and the retrieval accuracy and the retrieval efficiency are improved.

Drawings

FIG. 1 is a schematic flow chart diagram of a multimodal data retrieval method according to a first embodiment of the invention;

FIG. 2 is a flow chart diagram of a multimodal data retrieval method according to a second embodiment of the present invention;

FIG. 3 is a schematic diagram of the structure of a multimodal data retrieval system in accordance with an embodiment of the present invention;

fig. 4 is a schematic structural diagram of a terminal according to an embodiment of the present invention;

FIG. 5 is a schematic diagram of a storage medium structure according to an embodiment of the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

The terms "first", "second" and "third" in the present invention are used for descriptive purposes only and are not to be construed as indicating or implying relative importance or implicitly indicating the number of technical features indicated. Thus, a feature defined as "first," "second," or "third" may explicitly or implicitly include at least one of the feature. In the description of the present invention, "a plurality" means at least two, e.g., two, three, etc., unless specifically limited otherwise. All directional indicators (such as up, down, left, right, front, and rear … …) in the embodiments of the present invention are only used to explain the relative positional relationship between the components, the movement, and the like in a specific posture (as shown in the drawings), and if the specific posture is changed, the directional indicator is changed accordingly. Furthermore, the terms "include" and "have," as well as any variations thereof, are intended to cover non-exclusive inclusions. For example, a process, method, system, article, or apparatus that comprises a list of steps or elements is not limited to only those steps or elements listed, but may alternatively include other steps or elements not listed, or inherent to such process, method, article, or apparatus.

Reference herein to "an embodiment" means that a particular feature, structure, or characteristic described in connection with the embodiment can be included in at least one embodiment of the invention. The appearances of the phrase in various places in the specification are not necessarily all referring to the same embodiment, nor are separate or alternative embodiments mutually exclusive of other embodiments. It is explicitly and implicitly understood by one skilled in the art that the embodiments described herein can be combined with other embodiments.

Multimodal data typically includes data in different forms, such as text, images, voice, video, and the like. For the same type of data, although the data of different modalities can present the bottom-layer characteristic heterogeneity, the high-layer semantics thereof have certain relevance. For example, for a particular disease, the medication conditions and the medical image examinations performed are largely similar, i.e., the text data of the medication conditions and the medical image data of the medical image examinations are semantically related. Based on the characteristics, the embodiment of the invention trains a cross-modal retrieval model by using historical multi-modal data, obtains data sets with different modalities similar to the data to be retrieved through the cross-modal retrieval model, and then takes the intersection of the data sets with different modalities as a final retrieval result.

For convenience of description, the following embodiments of the present invention are specifically described by taking two most commonly used modality data, namely, a picture and a text, as an example, and it is understood that the present invention is also applicable to retrieval of other modality data, such as voice, video, and the like.

Specifically, please refer to fig. 1, which is a flowchart illustrating a multimodal data retrieval method according to a first embodiment of the present invention. The multimodal data retrieval method of the first embodiment of the present invention includes the steps of:

s10: acquiring historical multi-modal data, and constructing a multi-modal file database based on the historical multi-modal data;

in this step, the multimodal file database includes multimodal data such as a picture and a text of each data file. Assuming that the number of data files collected in the multi-modal file database is N, the data set contained in the database is { (I)₁,T₁),(I₂,T₂)，(I₃,T₃),…,(I_N,T_N) In which (I)_i,T_i) A picture-text pair representing the ith data file.

S11: labeling categories of a certain number of data files in the multi-modal file database to generate data samples for training the model;

in this step, taking the data files of the medical data type as an example, the labels of the categories of the data files include disease names, medication types, image examination types, and the like, and the categories of each data file are manually labeled, so that the data files belonging to the same category have similarity.

S12: training a cross-modal retrieval model according to the data sample, and respectively retrieving the picture data and the text data of the file to be retrieved through the cross-modal retrieval model to obtain a data file with the highest similarity to the file to be retrieved;

in the step, the cross-modal retrieval model at least comprises a text modal retrieval model and a picture modal retrieval model, the embodiment of the invention adopts a Pairwise mode to train the model, and the training process comprises two stages of retrieval recall and accurate sequencing:

in the retrieval recall stage, a matching algorithm is used for roughly screening all data samples to obtain a relatively small candidate set of similar data files; and the retrieval recall stage is retrieval under a single mode, namely retrieving similar picture data files from a picture mode retrieval model by using the picture data of the files to be retrieved, retrieving similar text data files from a text mode retrieval model by using the text data of the files to be retrieved, respectively obtaining similar data file sets of the files to be retrieved under the picture mode and the text mode, and then taking a union of the two similar data file sets as a similar data file candidate set of the files to be retrieved. Assuming that the candidate set size of the similar data files obtained through the screening in the retrieval recall stage is K, the data file set corresponding to the candidate set is { (I)₁,T₁),(I₂,T₂),…,(I_K,T_K)}。

In the accurate sorting stage, carrying out similarity sorting on the similar data file candidate set obtained in the retrieval recall stage to obtain a data file with the highest similarity with the file to be retrieved; the accurate sorting stage is designed based on the idea of Learning to Rank (sorting Learning), the optimization target of the accurate sorting stage is the matching degree between text data and picture data, and the Hinge Loss is used as a Loss function in the Pairwise mode:

respectively obtaining a picture modal retrieval model and a text modal retrieval model based on the training modes, wherein the picture modal retrieval model adopts a deep learning pre-training image model ResNet for coding, and the text modal retrieval model adopts BERT (Bidirectional Encoder Rep)relationships from transformations, deep learning pre-trained language model) as shown below, picture I_iAnd text T_iThe coded embedded vectors are respectively I_EiAnd T_Ei：

T_Ei＝BERT(T_i)

I_Ei＝ResNet(I_i)

In the embodiment of the present invention, the search algorithm of the text modal search model includes, but is not limited to, BM25 or TFIDF algorithm, and the search algorithm of the picture modal search model includes similarity matching using simple visual features such as color distribution, geometric shape, texture, etc. of the picture.

Please refer to fig. 2, which is a flowchart illustrating a multimodal data retrieval method according to a second embodiment of the present invention. The multimodal data retrieval method of the second embodiment of the present invention includes the steps of:

s20: selecting a file to be retrieved, and acquiring pictures and text data of the file to be retrieved;

s21: inputting the pictures and the text data into a trained cross-modal retrieval model;

s22: respectively retrieving the picture data and the text data through a cross-modal retrieval model to obtain a candidate set of similar data files of the file to be retrieved, and sequencing the similarity of the candidate set of similar data files to obtain a data file with the highest similarity to the file to be retrieved;

in this step, the search mode of the cross-modal search model specifically includes:

1. judging whether the picture data of the file to be retrieved is empty, if not, inputting the picture data into a picture modal retrieval model to obtain similar data file retrieval results of the picture modalities, sorting the similar data file retrieval results of the picture modalities, and taking the first M retrieval results as a similar data file set S in the picture modalities_I；

2. Judging whether the text data of the file to be searched is empty, if not, inputting the text data into a text modal search model to obtain a text modal similar data file search result, and searching the similar data fileAfter the results are sorted, taking the first M retrieval results as a similar data file set S in a text mode_T；

3. Get set S_IAnd S_TThe union set is used as a candidate set of similar data files of the file to be retrieved;

4. and carrying out similarity sorting on the candidate set of similar data files to obtain a data file retrieval result with the highest similarity with the file to be retrieved.

In the above, the value of M may be set according to actual operation.

In summary, the multi-modal data retrieval method of the embodiment of the invention constructs a cross-modal retrieval model based on the data files of different modalities, and directly inputs the data files of different modalities and then outputs the retrieval result of the corresponding modality, so that an end-to-end retrieval scheme is realized, multi-modal data such as pictures and texts can be well processed, and the retrieval accuracy and the retrieval efficiency are improved.

In an alternative embodiment, it is also possible to: and uploading the result of the multi-modal data retrieval method to a block chain.

Specifically, the corresponding summary information is obtained based on the result of the multimodal data retrieval method, and specifically, the summary information is obtained by hashing the result of the multimodal data retrieval method, for example, using the sha256s algorithm. Uploading summary information to the blockchain can ensure the safety and the fair transparency of the user. The user can download the summary information from the blockchain to verify whether the result of the multimodal data retrieval method is tampered. The blockchain referred to in this example is a novel application mode of computer technologies such as distributed data storage, point-to-point transmission, consensus mechanism, encryption algorithm, and the like. A block chain (Blockchain), which is essentially a decentralized database, is a series of data blocks associated by using a cryptographic method, and each data block contains information of a batch of network transactions, so as to verify the validity (anti-counterfeiting) of the information and generate a next block. The blockchain may include a blockchain underlying platform, a platform product service layer, an application service layer, and the like.

Please refer to fig. 3, which is a schematic structural diagram of a multimodal data retrieval system according to an embodiment of the present invention. The multimodal data retrieval system 40 according to the embodiment of the present invention includes:

the data acquisition module 41: the system comprises a database, a multi-modal file database and a database server, wherein the database is used for acquiring historical multi-modal data and constructing the multi-modal file database based on the historical multi-modal data; the multi-mode file database comprises multi-mode data such as pictures, texts and the like of each data file. Assuming that the number of data files collected in the multi-modal file database is N, the data set contained in the database is { (I)₁,T₁),(I₂,T₂)，(I₃,T₃),…,(I_N,T_N) In which (I)_i,T_i) A picture-text pair representing the ith data file.

Model building module 42: the system is used for training a cross-modal retrieval model according to data samples in the multi-modal file database; the model training method specifically comprises the following steps: firstly, labeling the categories of a certain number of data files in a multi-modal file database to generate data samples for training a model; then, training a cross-modal retrieval model according to the data sample;

in the embodiment of the invention, a cross-modal retrieval model comprises a text modal retrieval model and a picture modal retrieval model, the embodiment of the invention adopts a Pairwise mode to train the models, and the training process comprises two stages of retrieval recall and accurate sequencing:

in the retrieval recall stage, a matching algorithm is used for roughly screening all data samples to obtain a relatively small candidate set of similar data files; and the retrieval recall stage is retrieval under a single mode, namely retrieving similar picture data files from a picture mode retrieval model by using the picture data of the files to be retrieved, retrieving similar text data files from a text mode retrieval model by using the text data of the files to be retrieved, respectively obtaining similar data file sets of the files to be retrieved under the picture mode and the text mode, and then taking a union of the two similar data file sets as a similar data file candidate set of the files to be retrieved. Hypothesis obtained through screening in retrieval recall stageIf the candidate set size of the similar data files is K, the data file set corresponding to the candidate set is { (I)₁,T₁),(I₂,T₂),…,(I_K,T_K)}。

based on the training mode, a picture modal retrieval model and a text modal retrieval model are respectively obtained, the picture modal retrieval model is coded by a deep learning pre-training image model ResNet, the text modal retrieval model is coded by a BERT (Bidirectional Encoder retrieval from transformations), and as shown in the following, a picture I_iAnd text T_iThe coded embedded vectors are respectively I_EiAnd T_Ei：

T_Ei＝BERT(T_i)

I_Ei＝ResNet(I_i)

The data retrieval module 43: the system comprises a cross-modal retrieval model, a similarity data file candidate set, a similarity sorting module and a searching module, wherein the cross-modal retrieval model is used for respectively retrieving the picture data and the text data to obtain the similar data file candidate set of the file to be retrieved, and the similarity sorting module is used for sorting the similar data file candidate set to obtain the data file with the highest similarity with the file to be retrieved;

the searching mode of the cross-modal searching model specifically comprises the following steps:

1. judging whether the picture data of the file to be retrieved is empty, if not, inputting the picture data into a picture modal retrieval model to obtain a retrieval result of the similar data file of the picture modal, and comparing the retrieval result with the similar data file of the picture modalAfter the retrieval results of the similar data files in the picture mode are sequenced, taking the first M retrieval results as a similar data file set S in the picture mode_I；

2. Judging whether the text data of the file to be retrieved is empty, if not, inputting the text data into a text modal retrieval model to obtain similar data file retrieval results of the text modal, sorting the retrieval results, and taking the first M retrieval results as a similar data file set S in the text modal_T；

Fig. 4 is a schematic diagram of a terminal structure according to an embodiment of the present invention. The terminal 50 comprises a processor 51, a memory 52 coupled to the processor 51.

The memory 52 stores program instructions for implementing the multimodal data retrieval method described above.

The processor 51 is operative to execute program instructions stored in the memory 52 to perform multimodal data retrieval operations.

The processor 51 may also be referred to as a CPU (Central Processing Unit). The processor 51 may be an integrated circuit chip having signal processing capabilities. The processor 51 may also be a general purpose processor, a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), an off-the-shelf programmable gate array (FPGA) or other programmable logic device, discrete gate or transistor logic, discrete hardware components. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like.

Referring to fig. 5, fig. 5 is a schematic structural diagram of a storage medium according to an embodiment of the invention. The storage medium of the embodiment of the present invention stores a program file 61 capable of implementing all the methods described above, wherein the program file 61 may be stored in the storage medium in the form of a software product, and includes several instructions to enable a computer device (which may be a personal computer, a server, or a network device) or a processor (processor) to execute all or part of the steps of the methods of the embodiments of the present invention. And the aforementioned storage medium includes: various media capable of storing program codes, such as a usb disk, a mobile hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, or terminal devices, such as a computer, a server, a mobile phone, and a tablet.

In the embodiments provided in the present invention, it should be understood that the disclosed system, apparatus and method may be implemented in other ways. For example, the above-described system embodiments are merely illustrative, and for example, a division of a unit is merely a logical division, and an actual implementation may have another division, for example, a plurality of units or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, devices or units, and may be in an electrical, mechanical or other form.

In addition, functional units in the embodiments of the present invention may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit can be realized in a form of hardware, and can also be realized in a form of a software functional unit. The above description is only an embodiment of the present invention, and not intended to limit the scope of the present invention, and all modifications of equivalent structures and equivalent processes, which are made by using the contents of the present specification and the accompanying drawings, or directly or indirectly applied to other related technical fields, are included in the scope of the present invention.

Claims

1. A method for multimodal data retrieval, comprising:

2. The method of claim 1, wherein the retrieving historical multimodal data further comprises:

3. The method of claim 2, wherein training a cross-modal search model based on the historical multi-modal data further comprises:

4. The method of claim 3, wherein training a cross-modal search model based on the historical multimodal data comprises:

5. The multi-modal data retrieval method of claim 4, wherein the cross-modal retrieval model respectively retrieves the data to be retrieved through the picture modal retrieval model and the text modal retrieval model, comprising:

Judging whether the text data of the data to be retrieved is empty, if not, inputting the text data into the text modal retrieval model to obtain similar data file retrieval results in a text mode, sorting the retrieval results, and taking the first M retrieval results as a similar data file set S in the text mode_T；

6. The multi-modal data retrieval method of claim 1 wherein the picture modal retrieval model is encoded using ResNet and the text modal retrieval model is encoded using BERT.

7. The multi-modal data retrieval method of claim 1 wherein the retrieval algorithm of the text modal retrieval model comprises BM25 or TFIDF algorithm and the retrieval algorithm of the picture modal retrieval model comprises similarity matching using visual features of the picture, the visual features comprising color distribution, geometry or texture.

8. A multimodal data retrieval system, comprising:

9. A terminal, comprising a processor, a memory coupled to the processor, wherein,

the memory stores program instructions for implementing the multimodal data retrieval method of any of claims 1-7;

the processor is configured to execute the program instructions stored by the memory to perform the multimodal data retrieval method.

10. A storage medium having stored thereon program instructions executable by a processor to perform the multimodal data retrieval method of any one of claims 1 to 7.