CN114611618A

CN114611618A - Cross-modal retrieval-oriented data acquisition processing method and system

Info

Publication number: CN114611618A
Application number: CN202210260897.6A
Authority: CN
Inventors: 纪守领; 何平; 白熠阳; 张旭鸿; 杜天宇; 蒲誉文
Original assignee: Zhejiang University ZJU
Current assignee: Zhejiang University ZJU
Priority date: 2022-03-16
Filing date: 2022-03-16
Publication date: 2022-06-10

Abstract

The invention discloses a data acquisition processing method and a data acquisition processing system for cross-modal retrieval, which comprise the following steps: carrying out distributed parallel acquisition on multi-mode data on a target open source data network; after the text modal data is cleaned with special characters and invisible characters, different message queues are stored for the cleaned text modal data and the cleaned image modal data; respectively extracting the features of each text and each image in the message queue by using a feature extraction model to obtain text features and image features, matching and screening according to the similarity of the text features and the image features to obtain a graph-text combination, and storing a database by taking the image features and the text features of the graph-text combination as indexes; and when in retrieval, the matched graph-text combination is screened as a retrieval result of the uploaded data according to the similarity of the uploaded data and the graph-text combination in the database and returned.

Description

Cross-modal retrieval-oriented data acquisition processing method and system

Technical Field

The invention belongs to the technical field of cross-modal information retrieval and particularly relates to a data acquisition processing method and system for cross-modal retrieval.

Background

With the wide progress of artificial intelligence technologies represented by deep learning, the demand of the artificial intelligence industry for various modal data such as pictures and texts has increased greatly. There is a great deal of valuable multi-modal open source data in the internet, and the multi-modal open source data can be used for deep learning model training, such as training of a multi-modal deep learning model by using relevant pictures and text data existing in a social network platform. In the cross-modal retrieval scenario, the requirements for pictures and text data are particularly huge.

To achieve the goal, how to construct data acquisition, processing channels for both modalities (pictures and text) is crucial. The current data processing related technology aiming at pictures and texts mainly considers the problem of how to perform effective storage. However, in practice, some invisible characters and display control characters may exist in the originally acquired text data, and the existence of these characters may cause inaccuracy in the subsequent indexing of the text data in the cross-modal retrieval. Meanwhile, the originally obtained picture and text pairs may have semantic inconsistency and also need to be screened in the data processing process.

Patent document CN108877948A discloses a multimodal data processing method and system, including: the data acquisition network acquires multi-mode data corresponding to the coronary heart disease case; the data acquisition network processes the multi-modal data by combining a data cleaning model corresponding to a preset type according to the preset type of the multi-modal data; the data acquisition network sends the processed multi-mode data to a data server; the data server performs data preprocessing according to the received multi-modal data to obtain fusion data; and the data server performs association rule mining on the fusion data. This process may result in inaccurate processing of the multimodal data for non-visible characters and display control characters in the non-processed captured data.

Patent document CN 112256786 discloses a multimodal data processing method and apparatus, including: the terminal acquires multi-mode data; the terminal extracts the characteristics of the multi-modal data through a characteristic extraction algorithm to obtain the data characteristics of the multi-modal data; the terminal converts the data characteristics through a first conversion algorithm to obtain first data characteristics, wherein the first conversion algorithm is used for mapping the multi-modal data to a specific space; the terminal transmits the data characteristics, the data tags and the terminal id of the multi-mode data to the server; the server converts the first data characteristics through a second conversion algorithm corresponding to the terminal id to obtain second data characteristics, wherein the second conversion algorithm is used for mapping data in different specific spaces to the same space; and the server side takes the second data characteristic as input and takes the data label as output to carry out multi-modal representation learning so as to train a multi-modal representation learning algorithm. The process may cause the second data features obtained in the conversion process of the multi-modal data to affect the modal learning result for the invisible characters and the display control characters in the non-processed collected data.

Disclosure of Invention

In view of the foregoing, an object of the present invention is to provide a data acquisition processing method and system for cross-modal search, which can acquire and perform depth processing on open-source picture and text data, thereby finally realizing high-quality cross-modal search on picture-text data.

To achieve the above object, an embodiment provides a data acquisition processing method for cross-modality search, including:

carrying out distributed parallel acquisition on multi-modal data on a target open source data network, wherein the multi-modal data comprises text modal data and image modal data;

after the text modal data is cleaned with special characters and invisible characters, different message queues are stored for the cleaned text modal data and the cleaned image modal data;

respectively extracting the features of each text and each image in the message queue by using a feature extraction model to obtain text features and image features, matching and screening according to the similarity of the text features and the image features to obtain a graph-text combination, and storing a database by taking the image features and the text features of the graph-text combination as indexes;

and during retrieval, screening matched graph-text combinations as retrieval results of the uploaded data according to the similarity of the uploaded data and the graph-text combinations in the database, and returning the retrieval results, wherein the uploaded data comprises text data and image data.

In one embodiment, the cleansing of the text modality data for invisible characters comprises:

and filtering out invisible characters in the text modal data, wherein the invisible characters comprise zero-length spaces, zero-length connectors and zero-length non-connectors.

In one embodiment, the washing of the special characters for the text mode data comprises:

when the special character is a pictograph, the pictograph is replaced by the original character according to the mapping relation between the pictograph and the original character presented by the pictograph table;

when the special character is a deletable character, adopting different cleaning modes according to different deletable characters, including: when the deletable character is a backspace character, deleting the backspace character and a character before the backspace character at the same time; when the deletable character is a deleted character, deleting the deleted character and a character behind the deleted character simultaneously; and when the deletable character is a carriage return character, covering all characters behind the carriage return character from the beginning of the paragraph.

when the special characters are display sequence characters, the sequence of the display sequence characters is changed according to control, bottom-up recursive recovery is carried out, and the original text is finally recovered, wherein the display sequence characters comprise: PDF character, LRE character, RLE character, LRO character, RLO character, PDI character, LRI character, RLI character, the sequence of the control change display sequence character is [ LRO, LRI, RLO, LRI, character string 1, PDI, LRI, character string 2, PDI, PDF, PDI, PDF ];

restoring original text by bottom-up recursive restoration, comprising:

(a) adopting a non-greedy matching algorithm to match a character string sequence with the shape of [ LRO, LRI, RLO, LRI, character string 1, PDI, LRI, character string 2, PDI, PDF, PDI, PDF ], and replacing the character string sequence with a sequence with the shape of [ character string 2, character string 1 ];

(b) repeating step (a) until the sequence of strings is absent in the text modality data;

(c) and if the display sequence characters still exist in the residual text mode data, deleting all the display sequence characters.

In one embodiment, the obtaining of the image-text combination according to the similarity matching screening of the text feature and the image feature includes:

and calculating the similarity of the text features and all the image features, and screening the image features with the maximum similarity to match with the text features to form a graph-text combination.

In one embodiment, the screening matching graph-text combinations according to similarity of the uploaded data and the graph-text combinations in the database as retrieval results of the uploaded data comprises:

when the uploaded data is text data, extracting text features of the text data by using a feature extraction model, carrying out similarity calculation on the text features and the text features in the database, taking the text features with the maximum similarity in the database as matched text features, and taking the image-text combination to which the matched text features belong as a retrieval result;

when the uploaded data is image data, extracting image features of the image data by using a feature extraction model, carrying out similarity calculation on the image features and the image features in the database, taking the image features with the maximum similarity in the database as matched image features, and taking the image-text combination to which the matched image features belong as a retrieval result.

In one embodiment, the feature extraction module employs a contrast text-to-picture pre-training model.

In one embodiment, the similarity of the text feature and the image feature is a cosine similarity or L2 distance; the similarity of the text features and the text features is cosine similarity or L2 distance; the similarity between the image features and the image features is a cosine similarity or L2 distance.

In order to achieve the above object, an embodiment of the present invention further provides a data acquisition and processing system for cross-modality search, including:

the acquisition module is used for carrying out distributed parallel acquisition on multi-mode data on the target open source data network, wherein the multi-mode data comprises text mode data and image mode data;

the cleaning module is used for cleaning special characters and invisible characters of the text modal data and then storing different message queues of the cleaned text modal data and the cleaned image modal data;

the feature extraction module is used for respectively extracting features of each text and each image in the message queue by using the feature extraction model to obtain text features and image features;

the matching module is used for matching and screening to obtain a graph-text combination according to the similarity of the text characteristic and the image characteristic, and storing a database by taking the image characteristic and the text characteristic of the graph-text combination as indexes;

and the retrieval module is used for screening the matched graph-text combination as a retrieval result of the uploaded data according to the similarity of the uploaded data and the graph-text combination in the database and returning the retrieval result, wherein the uploaded data comprises text data and image data.

Compared with the prior art, the invention has the beneficial effects that at least:

after multi-modal data are acquired in a distributed and parallel mode, the deep cleaning of the data is realized by processing special characters and invisible characters on text modal data, the influence of the special characters and the invisible characters on semantics is avoided, the inaccurate indexing is caused, the similarity matching association is performed on the text characteristics corresponding to the text modal data and the image characteristics of image model data, the semantic relevance of texts and images is skillfully introduced into the formed image-text combination, the image characteristics and the text characteristics of the image-text combination are used as a database for indexing to realize storage, additional indexing is omitted, the data storage is simpler and more direct, the retrieval application is convenient, in short, the method can quickly acquire the multi-modal data, effectively construct a cross-modal retrieval data warehouse, can realize the automatic mining of large-scale multi-modal data, and save the manual analysis cost, is beneficial to large-scale deployment and implementation.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to these drawings without creative efforts.

Fig. 1 is a flowchart of a data acquisition processing method for cross-modality search according to an embodiment;

fig. 2 is another flowchart of a data acquisition processing method for cross-modality search according to an embodiment;

fig. 3 is a schematic structural diagram of a data acquisition processing system for cross-modality search according to an embodiment.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention will be further described in detail with reference to the accompanying drawings and examples. It should be understood that the detailed description and specific examples, while indicating the scope of the invention, are intended for purposes of illustration only and are not intended to limit the scope of the invention.

Fig. 1 is a flowchart of a data acquisition processing method for cross-modality search according to an embodiment. Fig. 2 is another flowchart of a data acquisition processing method for cross-modality search according to an embodiment. As shown in fig. 1 and fig. 2, the data acquisition processing method for cross-modality search according to the embodiment includes the following steps:

step 1, carrying out distributed parallel acquisition on multi-mode data on a target open source data network.

In an embodiment, the multimodal data includes text modality data and image modality data. The multi-mode data are collected through the distributed system, the distributed system can realize the multi-process parallel collection of the multi-mode data in multiple nodes, and meanwhile, the attack can be defended against the interference increased on the text content level. Each process may acquire either the text modality data or the image modality data separately, or may acquire both the text modality data and the image modality data simultaneously.

In application, the distributed system provides different acquisition options according to different modal data to be acquired, for example, if the data to be acquired is text modal data, a data acquisition function of the text modal is selected, and then corresponding acquisition tasks can be allocated in the distributed system according to corresponding nodes owned by the distributed system, and the corresponding acquisition requirements are submitted to specific nodes for execution.

And 2, preprocessing the multi-modal data and storing the multi-modal data into a message queue.

In the embodiment, a data processing engine is adopted to preprocess multi-modal data, and when the multi-modal data is judged to comprise text modal data, the text modal data is deeply cleaned. The deep cleaning mainly comprises the step of cleaning the text modal data by special characters and invisible characters.

The specific cleaning process comprises the following steps:

(2-1) checking the text modal data, traversing each text character, and observing whether the text characters possibly belong to special characters or invisible characters;

(2-2) if the special character or the invisible character exists, judging the character type according to the character type;

and (2-3) when the character type is judged to be the invisible character, cleaning by adopting a method of filtering the invisible character.

The invisible characters comprise the following characters: a zero-length space with Unicode code of U + 200B; a zero-length connector with Unicode code of U + 200D; zero length non-connector with Unicode code of U + 200C. And directly deleting the invisible characters from the text modal data aiming at the invisible characters so as to clean the invisible characters of the text modal data.

And (2-4) when the character type is judged to be the special character, adopting different cleaning modes according to different special character types.

In the embodiment, when the special character is the pictographic character, the pictographic character is replaced by the original character according to the mapping relation between the pictographic character and the original character presented by the pictographic character table so as to realize the cleaning of the pictographic character. It should be noted that the pictograph table is a predefined dictionary with similarity in vision, in which mapping relationships between original characters and visually similar characters (pictographs) are stored.

And when the special character is a deletable character, adopting different cleaning modes according to the meanings of different deletable characters. The deletable characters include the following: the backspace character has a Unicode code of U + 8; deleting characters, wherein the Unicode code of the characters is U + 7F; and the carriage return character has a Unicode code of U + D.

When the deletable character is a backspace character, deleting the backspace character and a character before the backspace character at the same time; when the deletable character is a deleted character, deleting the deleted character and a character behind the deleted character at the same time; and when the deletable character is a carriage return character, covering all characters after the carriage return character from the beginning of the paragraph.

And when the special characters are display sequence characters, changing the sequence of the display sequence characters according to control, performing bottom-up recursive recovery, and finally recovering the original text.

Wherein, the display sequence characters comprise the following characters: PDF character with Unicode code of U + 202C; LRE character with Unicode code of U + 202A; RLE characters with Unicode code U + 202B; LRO character with Unicode code of U + 202D; RLO characters with Unicode code U + 202E; PDI character with Unicode code of U + 2069; LRI character with Unicode code U + 2066; RLI characters with Unicode code U + 2067.

The sequence that controls the change of display order characters is [ LRO, LRI, RLO, LRI, string 1, PDI, LRI, string 2, PDI, PDF, PDI, PDF ].

Restoring original text using bottom-up recursive recovery, comprising:

And respectively storing the cleaned text modal data and image modal data in two message queues so as to be read and applied by subsequent feature extraction. It should be noted that, in order to reduce the amount of data storage, only its links in the network are stored for the image modality data.

And 3, extracting the features of the texts and the images in the message queue by using the feature extraction model.

In the embodiment, the feature extraction model is constructed based on a deep learning model, and specifically, a CLIP (text-picture Pre-training) model in which a text and a picture are simultaneously coded is used as a contrast text-picture Pre-training model. The CLIP model consists of a text encoder and an image encoder, and the training method adopts a large number of mutually-correlated pictures and texts existing in the Internet and performs mutual comparison learning on the mutually-correlated pictures and texts, so that the CLIP model can well mine the relevant information between the pictures and the texts and can provide more accurate cross-modal index. Thus, embodiments only need feature encoding by an encoder in the CLIP model.

In the embodiment, the data processing engine is adopted to realize the feature coding through calling the CLIP model in a multithreading way. The method specifically comprises the following steps: inputting each text read from the message queue into a text encoder of the CLIP model, obtaining text characteristics through calculation, inputting each picture read from the message queue into an image encoder of the CLIP model, and obtaining image characteristics through calculation.

And 4, matching the text features and the image feature similarity to construct a graph-text combination and store the graph-text combination.

In the embodiment, the similarity matching is also realized through a data processing engine, and specifically includes obtaining a graph-text combination according to similarity matching screening of text features and image features, and storing a database by using the image features and the text features of the graph-text combination as indexes. Specifically, during similarity matching screening, the similarity between the text features and all the image features is calculated, and the image features with the maximum similarity are screened to be matched with the text features, so that a graph-text combination is formed. It should be noted that the similarity may be a cosine similarity or an L2 distance. Wherein the database may be an elastic search database.

And 5, uploading the retrieval application of the data.

And during retrieval application, receiving uploaded data, wherein the uploaded data comprises text data and image data, and screening matched graph-text combinations according to the similarity of the uploaded data and the graph-text combinations in the database to serve as retrieval results of the uploaded data and returning the retrieval results.

The retrieval application is realized through a client side with a data retrieval interface, for the query of a user, the client side firstly judges the requested query data mode, and then different coding parts of a CLIP model are respectively called to carry out index coding according to the different query data modes.

The data retrieval interface is divided into two parts of picture retrieval and text retrieval, in the text retrieval, namely when the uploaded data is text data, a text encoder of a feature extraction model is used for extracting text features of the text data, similarity calculation is carried out on the text features and the text features in the database, the text features with the maximum similarity in the database are used as matched text features, graph-text combinations to which the matched text features belong are used as retrieval results, and it needs to be noted that the similarity between the text features and the text features is cosine similarity or L2 distance.

In picture retrieval, that is, when the uploaded data is image data, an image encoder of a feature extraction model is used to extract image features of the image data, similarity calculation is performed between the image features and image features in a database, the image features with the maximum similarity in the database are used as matched image features, and a graph-text combination to which the matched image features belong is used as a retrieval result, it should be noted that the similarity between the image features and the image features is cosine similarity or L2 distance.

The data acquisition processing method for cross-modal retrieval provided by the embodiment can acquire and deeply process open-source pictures and text data, thereby finally realizing high-quality cross-modal retrieval of picture-text data.

Fig. 3 is a schematic structural diagram of a data acquisition processing system for cross-modality search according to an embodiment. As shown in fig. 3, an embodiment provides a data acquisition processing system, including:

the cleaning module is used for cleaning special characters and invisible characters of the text modal data and then storing the cleaned text modal data and the cleaned image modal data in a message queue;

The acquisition module acquires data through a distributed system, and acquires multi-mode data in a distributed and parallel manner on the basis of an acquisition control command on a target open source data network, and the multi-mode data are transmitted back to the distributed system in a data flow manner. The cleaning module, the feature extraction module and the matching module are realized through a data processing engine, the data processing engine adopts a streaming processing means to perform text modal data cleaning, feature extraction model calling and feature matching processing on data of different channels in a message queue, and a graph-text combination formed by matching is stored in a database.

In the data processing engine, the user can select the data modality (e.g. picture, text) and the processing mode (single-modality processing mode, multi-modality association processing) of the processing in the engine, so as to select three functions of feature extraction for picture, feature extraction for text and association analysis between picture and text multi-modality features according to actual requirements. It should be noted that, after extracting the multi-modal features such as the text features and the image features, the data processing engine further performs positioning of the multi-modal features, that is, positions of the text data and the image data corresponding to the text features and the image features in the web page are positioned.

The retrieval module is realized through an interactive data retrieval interface, a saying request is sent through the data retrieval interface, corresponding text features and image features are called from a database according to data indexes based on the request to perform similarity matching calculation, and the graph-text combination with the maximum similarity is screened as a search response and returned.

The system can quickly acquire multi-modal data by data acquisition and processing oriented to cross-modal retrieval, effectively constructs a cross-modal retrieval data warehouse, can automatically mine large-scale multi-modal data, saves manual analysis cost, and is beneficial to large-scale deployment and implementation.

The above-mentioned embodiments are intended to illustrate the technical solutions and advantages of the present invention, and it should be understood that the above-mentioned embodiments are only the most preferred embodiments of the present invention, and are not intended to limit the present invention, and any modifications, additions, equivalents, etc. made within the scope of the principles of the present invention should be included in the scope of the present invention.

Claims

1. A data acquisition processing method for cross-modal retrieval is characterized by comprising the following steps:

the method comprises the steps that distributed parallel collection of multi-mode data is conducted on a target open source data network, wherein the multi-mode data comprise text mode data and image mode data;

2. The data acquisition processing method for cross-modal search according to claim 1, wherein the cleaning of invisible characters for text modal data comprises:

3. The data acquisition processing method for cross-modal search according to claim 1, wherein the washing of the special characters of the text modal data comprises:

when the special character is a pictographic character, replacing the pictographic character with the original character according to the mapping relation between the pictographic character and the original character presented by the pictographic character table;

4. The data acquisition processing method for cross-modal search according to claim 1, wherein the washing of the special characters of the text modal data comprises:

restoring original text by bottom-up recursive restoration, comprising:

5. The data acquisition processing method for cross-modality search according to claim 1, wherein the obtaining of the image-text combination according to similarity matching screening of text features and image features comprises:

6. The data acquisition processing method for cross-modal search according to claim 1, wherein the screening matching graph-text combinations as search results of the uploaded data according to the similarity between the uploaded data and the graph-text combinations in the database comprises:

7. The data acquisition and processing method for cross-modal search according to claim 1 or 6, wherein the feature extraction module employs a contrast text-picture pre-training model.

8. The data acquisition processing method for cross-modal search according to claim 1, 5 or 6, wherein the similarity between the text feature and the image feature is cosine similarity or L2 distance; the similarity of the text features and the text features is cosine similarity or L2 distance; the similarity between the image features and the image features is a cosine similarity or L2 distance.

9. A data acquisition and processing system oriented to cross-modal retrieval, comprising: