CN116563426A

CN116563426A - Method, apparatus, electronic device and medium for processing multi-modal data

Info

Publication number: CN116563426A
Application number: CN202310511915.8A
Authority: CN
Inventors: 郭龙腾; 陈浚毅; 孙佳; 袁泽寰
Original assignee: Beijing Youzhuju Network Technology Co Ltd
Current assignee: Beijing Youzhuju Network Technology Co Ltd
Priority date: 2023-05-08
Filing date: 2023-05-08
Publication date: 2023-08-08

Abstract

Embodiments of the present disclosure relate to methods, apparatuses, electronic devices, and media for processing multi-modal data. The method includes segmenting a source image into a set of source pixel blocks and generating a sequence of masked source pixel blocks by masking one or more source pixel blocks in the set of source pixel blocks. The method also includes generating one or more target pixel blocks corresponding to the one or more source pixel blocks based on the masked sequence of source pixel blocks and the source text corresponding to the source image. The method further includes generating a multimodal model based on the one or more source pixel blocks and the one or more target pixel blocks. The method of the embodiment of the disclosure can train the multi-modal model by using less training data, effectively reduce the training cost, reduce the complexity of training the multi-modal model, and improve the stability and accuracy of the model.

Description

Method, apparatus, electronic device and medium for processing multi-modal data

Technical Field

The present disclosure relates generally to the field of machine learning, and more particularly, to a method, apparatus, electronic device, and medium for processing multi-modal data.

Background

Multimodal data refers to a dataset made up of data of multiple modalities, typically including different types of information, such as text, images, audio, video, etc. Multimodal data can facilitate a more comprehensive understanding and analysis of various things in the real world. In the field of machine learning, researchers process these data by building multimodal models to achieve more accurate predictive, classification, generation, etc. tasks.

For example, in Natural Language Processing (NLP) and Computer Vision (CV) tasks, multimodal models can process text and image information simultaneously, thereby achieving better performance in tasks such as scene understanding, automatic image annotation, image description generation, and the like. The multi-mode model obtained by training on large-scale unlabeled data can be quickly adapted and applied to various downstream tasks, and has a large amount of service requirements.

Disclosure of Invention

Embodiments of the present disclosure provide a method, apparatus, electronic device, and medium for processing multi-modal data, capable of training a multi-modal model with less training data, reducing the complexity of training the multi-modal model, and capable of improving the stability and accuracy of the model.

In a first aspect of the present disclosure, a method for processing multi-modal data is provided. The method includes partitioning a source image into a set of source pixel blocks, each pixel block in the set of source pixel blocks including a plurality of pixels. The method further includes generating a sequence of masked source pixel blocks by masking one or more source pixel blocks of the set of source pixel blocks. The method also includes generating one or more target pixel blocks corresponding to the one or more source pixel blocks based on the masked sequence of source pixel blocks and the source text corresponding to the source image. The method further includes generating a multimodal model based on the one or more source pixel blocks and the one or more target pixel blocks.

In a second aspect of the present disclosure, an apparatus for processing multimodal data is provided. The apparatus includes a pixel block segmentation module configured to segment a source image into a set of source pixel blocks, each pixel block in the set of source pixel blocks including a plurality of pixels. The apparatus also includes a pixel block masking module configured to generate a sequence of masked source pixel blocks by masking one or more source pixel blocks in a set of source pixel blocks. The apparatus also includes a pixel block generation module configured to generate one or more target pixel blocks corresponding to the one or more source pixel blocks based on the masked sequence of source pixel blocks and the source text corresponding to the source image. The apparatus further includes a multi-modal model generation module configured to generate a multi-modal model based on the one or more source pixel blocks and the one or more target pixel blocks.

In a third aspect of the present disclosure, an electronic device is provided. The electronic device includes one or more processors; and a storage means for storing one or more programs which, when executed by the one or more processors, cause the one or more processors to implement a method for processing multi-modal data. The method includes partitioning a source image into a set of source pixel blocks, each pixel block in the set of source pixel blocks including a plurality of pixels. The method further includes generating a sequence of masked source pixel blocks by masking one or more source pixel blocks of the set of source pixel blocks. The method also includes generating one or more target pixel blocks corresponding to the one or more source pixel blocks based on the masked sequence of source pixel blocks and the source text corresponding to the source image. The method further includes generating a multimodal model based on the one or more source pixel blocks and the one or more target pixel blocks.

In a fourth aspect of the present disclosure, a computer-readable storage medium is provided, having stored thereon a computer program which, when executed by a processor, implements a method for processing multi-modal data. The method includes partitioning a source image into a set of source pixel blocks, each pixel block in the set of source pixel blocks including a plurality of pixels. The method further includes generating a sequence of masked source pixel blocks by masking one or more source pixel blocks of the set of source pixel blocks. The method also includes generating one or more target pixel blocks corresponding to the one or more source pixel blocks based on the masked sequence of source pixel blocks and the source text corresponding to the source image. The method further includes generating a multimodal model based on the one or more source pixel blocks and the one or more target pixel blocks.

The summary is provided to introduce a selection of concepts in a simplified form that are further described below in the detailed description. This summary is not intended to identify key features or essential features of the claimed subject matter, nor is it intended to be used to limit the scope of the claimed subject matter.

Drawings

The above and other features, advantages and aspects of embodiments of the present disclosure will become more apparent by reference to the following detailed description when taken in conjunction with the accompanying drawings. In the drawings, wherein like or similar reference numerals denote like or similar elements, in which:

FIG. 1 illustrates a schematic diagram of an example environment in which various embodiments of the present disclosure may be implemented;

FIG. 2 illustrates a flow chart of a method for processing multi-modal data in accordance with some embodiments of the present disclosure;

FIG. 3 illustrates a schematic diagram of a process of training a multimodal model by masking images, according to some embodiments of the disclosure;

FIG. 4 illustrates a schematic diagram of a process of training a multimodal model by masking text, according to some embodiments of the disclosure;

FIG. 5A illustrates a schematic diagram of a structure of a multimodal model, according to some embodiments of the present disclosure;

FIG. 5B illustrates a schematic diagram of a process of generating pixel block sequence embedding and word sequence embedding, according to some embodiments of the present disclosure;

fig. 6 illustrates a schematic diagram of a structure of a routing network used in the structure of a multimodal model, according to some embodiments of the present disclosure;

FIG. 7 illustrates a schematic diagram of a process of using a multimodal model in an inference phase in accordance with some embodiments of the disclosure;

FIG. 8 illustrates a block diagram of an apparatus for processing multimodal data in accordance with some embodiments of the present disclosure; and

fig. 9 illustrates a block diagram of a device capable of implementing various embodiments of the present disclosure.

Detailed Description

It will be appreciated that all user related data related to the present solution should be acquired and used after user authorization. This means that in the present solution, if personal information of the user is required to be used, before these data are acquired, explicit consent and authorization of the user are required, otherwise no relevant data collection and use will be performed. It should also be understood that when implementing the technical scheme, related laws and regulations should be strictly complied with in the process of collecting, using and storing data, and necessary technologies and measures should be taken to ensure the data security of the user and ensure the safe use of the data.

Embodiments of the present disclosure will be described in more detail below with reference to the accompanying drawings. While certain embodiments of the present disclosure have been shown in the accompanying drawings, it is to be understood that the present disclosure may be embodied in various forms and should not be construed as limited to the embodiments set forth herein, but are provided to provide a more thorough and complete understanding of the present disclosure. It should be understood that the drawings and embodiments of the present disclosure are for illustration purposes only and are not intended to limit the scope of the present disclosure.

In describing embodiments of the present disclosure, the term "comprising" and its like should be understood to be open-ended, i.e., including, but not limited to. The term "based on" should be understood as "based at least in part on". The term "one embodiment" or "the embodiment" should be understood as "at least one embodiment". The terms "first," "second," and the like, may refer to different or the same object unless explicitly stated otherwise. Other explicit and implicit definitions are also possible below.

As described above, the multimodal model is capable of processing and understanding data of multiple modalities, such as text, images, audio, and video, among others. These models are trained during a pre-training phase using large amounts of multi-modal data to capture shared information and inherent correlations between different modalities. Through training, the multi-modal model can learn richer and deeper knowledge, thereby improving the performance of the model on various tasks. These models can also migrate knowledge learned on one modality to another modality, thereby achieving better generalization and reasoning capabilities. In addition, the multimodal model can be applied to various cross-modal tasks, such as text-to-image generation, image-to-text generation, video understanding, multimodal questioning and answering, and the like.

Pretraining refers to training a generic model using a large amount of unlabeled data to learn some generic features or knowledge over a certain task. Then, the universal model is used as an initialization model of the downstream task, and fine tuning or migration learning is carried out on the model by using the marked data, so that the model is suitable for the use situation of the downstream task. The main advantage of using a pre-training model is that it can significantly improve the performance of downstream tasks while also saving significant training time and computational resources. The pre-trained model generally has better initial performance than the random initialization model and has learned some general features or knowledge so that the training process converges faster and gives better results when fine-tuning is performed in the downstream task.

During the pre-training of multimodal models, some conventional schemes use Image Text Contrast (ITC) and Image Text Matching (ITM) techniques to learn the relationship between visual content and language content. However, these schemes consume significant computational resources and rely on significant amounts of training data during one iteration of the training. In other conventional approaches, to model image data, a visual marker analyzer (token) needs to be trained to convert pixels in the image into visual markers (token) and then the multi-modal model is trained using the visual markers. However, these schemes are very accurate for the generated visual markers, in other words, if the accuracy of the pre-trained visual marker analyzer cannot be ensured, the accuracy of the trained multimodal model cannot be ensured as well.

To this end, embodiments of the present disclosure propose a scheme for processing multi-modal data. The scheme trains a multimodal model using an image and text corresponding to the image (e.g., descriptive text about the image) as training data. The scheme divides an image into several pixel blocks, each pixel block containing a plurality of pixels. A portion of these pixel blocks are then masked to generate a sequence of masked pixel blocks. The scheme may utilize a multimodal model to generate a target pixel block corresponding to the masked pixel block based on the sequence of masked pixel blocks and descriptive text of the image. The task of the multimodal model is to bring the generated target pixel block close to the corresponding original pixel block before being masked. Thus, the scheme trains a multimodal model based on differences between the generated target pixel blocks and the corresponding original pixel blocks. In this way, the scheme provided by the embodiment of the disclosure can efficiently utilize the existing training data by masking different pixel blocks in the same image, so that the multimodal model can be trained by using less training data, and the training cost is reduced. In addition, the scheme avoids training an additional visual mark analyzer by comparing differences among pixel blocks, thereby reducing the complexity of training the multi-mode model and improving the stability and accuracy of the model.

FIG. 1 illustrates a schematic diagram of an example environment 100 in which various embodiments of the present disclosure may be implemented. In the example shown in fig. 1, environment 100 includes an image 102 and text 104 describing the contents of image 102. A dim sky, a street, stop signs on both sides of the street, and a lighted street light are shown in image 102. Text 104 is an english sentence, "Astop sign and lighted lamp post along a street at dust" ("a stop sign and lighted lights on the street in the evening"). In the example shown in fig. 1, the image 102 is partitioned into 4 rows and 4 columns of 16 pixel blocks, e.g., pixel blocks 106, 108, 110, 112, 114, 116, etc. These pixel blocks are acquired from the original image and are therefore referred to as source pixel blocks, and each pixel block includes a plurality of pixels, for example 120×200 pixels or the like. In environment 100, a portion of these pixel blocks may be masked with any predetermined masking image (represented in FIG. 1 by pixel blocks labeled with the letter "M"), e.g., pixel blocks 108, 112, 114, and 116 are masked in the example of FIG. 1, thereby generating a masked sequence of pixel blocks 118. To more clearly show which pixel blocks in the original image are masked, the masked sequence of pixel blocks 118 is still displayed in a two-dimensional form of 4 rows and 4 columns, but it may also be represented as a one-dimensional ordered sequence of pixel blocks.

As shown in FIG. 1, a multimodal model 124 is included in the environment 100. Through preprocessing 120, the masked sequence of pixel blocks 118 is processed into an embedded form input to a multimodal model 124. At the same time, text 104 is also processed as an embedded form input multimodal model 124 by preprocessing 122. Embedding (also referred to as embedding vectors) is a continuous vector representation that results from the conversion of discrete features that can be converted into an embedded form for use in subsequent algorithms when processing discrete features (e.g., colors, words, etc.) in the image, text, etc. Multimodal model 124 then outputs multimodal features, including image features and text features, based on the embedded version of the masked sequence of pixel blocks 118 and text 104. Then, in the example of fig. 1, a decoder 126 may be utilized to generate a target pixel block for each masked pixel block based on the output multi-modal characteristics. In other words, based on the output multi-modal characteristics, decoder 126 may reconstruct the contents of the masked pixel blocks. For example, decoder 126 may output target pixel blocks 108', 112', 114', and 116', where target pixel block 108 'is a reconstruction of masked source pixel block 108 based on the image feature portion of the multi-modal feature, target pixel block 112' is a reconstruction of masked source pixel block 112, target pixel block 114 'is a reconstruction of masked source pixel block 114, and target pixel block 116' is a reconstruction of masked source pixel block 116. The differences between the target pixel block and the source pixel block may then be compared and back-propagated to the multimodal model 124 to optimize various parameters in the multimodal model 124, through a number of back-propagation and optimization, to complete training of the multimodal model 124.

It should be understood that the images, text, segmentation methods, and masking methods shown in fig. 1 are given by way of example only, and embodiments of the present disclosure should not be limited to any particular images, text, segmentation methods, and masking methods. For example, image 108 may be any size, any number of images, text 104 may be text in any language, any length, any content, image 102 may be partitioned into any number of pixel blocks, any number of pixels may be included in a pixel block, and any number of any pixel blocks in a pixel block may be replaced with any predetermined mask image.

Fig. 2 illustrates a flow chart of a method 200 for processing multi-modal data in accordance with some embodiments of the present disclosure. As shown in fig. 2, at block 202, method 200 partitions a source image into a set of source pixel blocks. For example, in the example of fig. 1, the image 102 may be partitioned into 16 pixel blocks (e.g., pixel blocks 106, 108, 110, 112, 114, 116, etc.) of 4 rows and 4 columns, where each pixel block includes a plurality of pixels.

At block 204, method 200 generates a sequence of masked source pixel blocks by masking one or more source pixel blocks in the set of source pixel blocks. For example, in the example of FIG. 1, any predetermined pixel blocks may be used as masking pixel blocks (represented in FIG. 1 by pixel blocks labeled with the letter "M") to mask a portion of these pixel blocks, e.g., pixel blocks 108, 112, 114, and 116 are masked in the example of FIG. 1, to generate a masked sequence of pixel blocks 118.

At block 206, method 200 generates one or more target pixel blocks corresponding to the one or more source pixel blocks based on the masked sequence of source pixel blocks and the source text corresponding to the source image. For example, in the example of fig. 1, the masked sequence of source pixel blocks 118 and text 104 may be pre-processed into an embedded form acceptable to the multimodal model 124, and then input to the multimodal model 124. Multimodal model 124 may output multimodal features based on the masked sequence of source pixel blocks 118 and text 104 in embedded form. The image feature portion of the output multi-modal feature may then be decoded using decoder 126 to generate target pixel blocks 108', 112', 114', and 116'. Target pixel block 108 'is a reconstruction of masked source pixel block 108 based on the image feature portion of the output multi-modal feature, target pixel block 112' is a reconstruction of masked source pixel block 112, target pixel block 114 'is a reconstruction of masked source pixel block 114, and target pixel block 116' is a reconstruction of masked source pixel block 116.

At block 208, the method 200 generates a multimodal model based on the one or more source pixel blocks and the one or more target pixel blocks. For example, in the example of fig. 1, a loss function (also referred to as a first loss function) may be constructed based on the reconstructed target pixel block and the original source pixel block to calculate the current loss. This loss may then be back-propagated to the multimodal model 124 for optimization of various parameters in the multimodal model 124. When the loss is less than a certain threshold, training of the multimodal model 124 may be completed. The trained multimodal model 124 can be used for downstream tasks, in particular, the downstream tasks utilize the multimodal model 124 to process multimodal data including image-text pairs and generate multimodal features that fuse knowledge and features of the same thing in the content of different modalities (images and text).

In this manner, method 200 may efficiently utilize existing training data by masking different blocks of pixels in the same image, thereby enabling training of a multimodal model with less training data, reducing training costs. In addition, the method 200 avoids training additional visual marker analyzers by comparing differences between pixel blocks, thereby reducing the complexity of training a multimodal model and improving the stability and accuracy of the model.

When the images in the training data have a large size, the multimodal model requires significant computational resources to be consumed. In some embodiments, to reduce the computational power requirements and reduce the consumption of computational resources, the masked sequence of source pixel blocks may be deleted from the masked sequence of source pixel blocks while the masked sequence of source pixel blocks is being pre-processed, thereby generating a reserved pixel block sequence. A sequence of retained pixel blocks may then be generated based on the sequence of retained pixel blocks, and the multimodal model may generate a multimodal feature based on the sequence of retained pixel blocks. After the multi-modal feature is generated, the features of the mask pixel block may be inserted back into the corresponding locations in the multi-modal feature, and then the target pixel block may be generated based on the inserted multi-modal feature. The schemes provided by the embodiments can reduce the effective area in the image which needs to be processed by the multi-mode model, thereby reducing the calculation amount and the consumption of calculation resources and reducing the requirement on the calculation capability.

Fig. 3 illustrates a schematic diagram of a process 300 for training a multimodal model by masking images, according to some embodiments of the disclosure. As shown in fig. 3, in process 300, image 102 is partitioned into a set of pixel blocks (pixel blocks 106, 108, 110, 112, 114, 116, etc.). Process 300 randomly masks a number of pixel blocks (e.g., pixel blocks 108, 112, 114, and 116) of the plurality of pixel blocks using a predetermined mask pixel block (represented in fig. 3 by a pixel block labeled with the letter "M") to generate a masked sequence of pixel blocks 118. In the embodiment shown in fig. 3, process 300 deletes the four masked pixel blocks from the masked pixel block sequence 118, leaving the remaining pixel blocks in the sequence, resulting in a remaining pixel block sequence 302 that includes the remaining 12 pixel blocks. For example, since masked pixel block 108 is deleted, pixel block 110 becomes located after pixel block 106 in preserved pixel block sequence 302.

In the example shown in fig. 3, process 300 generates a reserved pixel block sequence embedding 304 based on reserved pixel block sequence 302. For example, process 300 may utilize techniques such as self-encoders, convolutional neural networks, etc., to generate a corresponding pixel block embedding for each pixel block in the reserved pixel block sequence embeddings 304, respectively. For example, pixel block embedding 306 is an embedded representation of pixel block 106 and pixel block embedding 310 is an embedded representation of pixel block 110. The reserved pixel block sequence embedding 304 is an ordered embedded sequence consisting of all such pixel block embeddings. The reserved pixel block sequence embedding 304 includes 12 pixel block embeddings corresponding to 12 reserved pixel blocks, but only 4 of them are shown in fig. 3 for simplicity.

On the other hand, the process 300 generates a word sequence embedding 308 based on the text 104. For example, the process 300 may utilize techniques such as transformers (transformers) to generate a corresponding word insert for each word in the text 104, respectively. For example, word insert 322 is an embedded representation of the word "A" and word insert 324 is an embedded representation of the word "stop". Word sequence embeddings 308 include 12 word embeddings corresponding to 12 words (also referred to as source words) in text 104, but only 4 of which are shown in fig. 3 for simplicity.

As shown in fig. 3, process 300 inputs preserved pixel block sequence embedding 304 and word sequence embedding 308 into multimodal model 124, enabling the multimodal model to learn knowledge and features shared between pixel block embedding and word embedding, and then outputs multimodal features 312 (also referred to as first multimodal features). In the example shown in fig. 3, the multimodal feature 312 includes an image feature portion and a text feature portion. The image feature portion includes 12 pixel block features corresponding to 12 pixel block embeddings in the reserved pixel block sequence embeddings 304. For example, pixel block feature 316 corresponds to pixel block embedding 306 of pixel block 106, and image feature 320 corresponds to pixel block embedding 310 of pixel block 110. The text feature portion includes 12 word features, which correspond to 12 word embeddings in word sequence embeddings 308. For example, word feature 332 corresponds to word insert 322 for word "A" and word feature 334 corresponds to word insert 324 for word "stop".

To reconstruct the masked source pixel block, process 300 may generate mask image features for the mask pixel block using techniques such as a transformer, and then insert the mask image features back into corresponding locations in multi-modal feature 312, thereby generating updated multi-modal feature 314 (also referred to as a second multi-modal feature). For example, process 300 may generate mask image feature 318 based on the mask pixel blocks. In some embodiments, since the same mask pixel block is used to mask several source pixel blocks, the mask image feature may be generated and inserted only once at each respective location in the multi-modal feature 312. For example, process 300 inserts mask image feature 318 between image feature 316 and image feature 320 to represent the mask image features of pixel block 108.

After generating updated multi-modal feature 314, process 300 may utilize decoder 126 to decode the image feature portions of updated multi-modal feature 314 to generate target pixel blocks 108', 112', 114', and 116', where target pixel block 108 'is a reconstruction of masked source pixel block 108 based on the image feature portions of multi-modal feature 314, target pixel block 112' is a reconstruction of masked source pixel block 112, target pixel block 114 'is a reconstruction of masked source pixel block 114, and target pixel block 116' is a reconstruction of masked source pixel block 116. In process 300, it is desirable to approximate the content in the generated target pixel block to the content in the original source pixel block by optimizing various parameters of the multimodal model 124. For example, the stop flag is not left in all pixel blocks in the pixel block sequence 302, however, based on the knowledge provided by the word sequence embedding 308, the stop flag should appear in the original image, so it is desirable that the stop flag appear in the generated target pixel block. Thus, the process 300 may calculate the current loss by constructing a loss function by determining the mean square error between the reconstructed target pixel block and the original source pixel block. The process 300 may then back-propagate the loss to the multimodal model 124 for optimization of various parameters in the multimodal model 124. When the loss is less than a certain threshold, training of the multimodal model 124 may be completed.

In the embodiment shown in fig. 3, by deleting the masked source pixel blocks from the sequence of source pixel blocks, the effective area in the image that the multimodal model needs to process can be reduced, thereby reducing the amount of computation and the consumption of computing resources, and reducing the requirement for computing power. In addition, by randomly masking different pixel blocks in the same image, existing training data can be efficiently utilized, so that a multimodal model can be trained with less training data, and training cost is reduced.

In some embodiments, to further improve the training efficiency of the multimodal model, and to improve the accuracy of the multimodal model, words in the text (also referred to as source words) may be masked, and then target words corresponding to the masked words are generated based on the source image and the masked word sequence. In these embodiments, the penalty function may be constructed based on the differences between the target word and the source word, and the overall penalty function of the multimodal model may be constructed based on both the penalty function constructed during training to mask the image and the penalty function constructed during training to mask the text, and then the various parameters of the multimodal model may be optimized such that the value of the overall penalty function is minimized.

Fig. 4 illustrates a schematic diagram of a process 400 for training a multimodal model by masking text, according to some embodiments of the disclosure. As shown in fig. 4, process 400 partitions image 102 into a set of pixel blocks and then generates a sequence of pixel block embeddings 401 based on the sequence of pixel blocks. On the other hand, process 400 randomly MASKs words (e.g., "stop" and "lamp") in text 104 using predetermined masking words (e.g., represented by [ MASK ] in fig. 4) to generate masked text 402. In process 400, the masked text 402 may be converted to a form of a word sequence for subsequent operations. In the example shown in fig. 4, process 400 may utilize techniques such as transformers to generate a corresponding word insert for each word in the converted word sequence, respectively. For example, word insert 406 is an insert representation of word "A" and word insert 408 is an insert representation of a mask word immediately following word "A". Word sequence embedding 404 is an ordered embedded sequence consisting of all such word embeddings. Word sequence embeddings 404 include 12 word embeddings corresponding to 12 words (including several mask words) in the masked text 402, but only 6 of them are shown in fig. 4 for simplicity.

As shown in fig. 4, process 400 inputs pixel block sequence embedding 401 and word embedding sequence 404 into multimodal model 124, enabling the multimodal model to learn knowledge and features shared between pixel block embedding and word embedding, and then outputs multimodal features 412 (also referred to as third multimodal features). In the example shown in fig. 4, the multimodal feature 412 includes an image feature portion and a text feature portion. The image feature portion includes 16 pixel block features corresponding to 16 pixel block embeddings in the pixel block sequence embeddings 401. The text feature portion includes 12 word features, which correspond to 12 word embeddings in word sequence embeddings 404. For example, word feature 416 corresponds to word insert 406 for word "A" and word feature 408 corresponds to the mask word for word "stop".

After generating the multimodal feature 412, the process 400 may utilize the decoder 426 to decode the masked portions of the text feature in the multimodal feature 412 (i.e., the portions of the multimodal feature 412 other than the image feature portion and the text feature portion) to generate target words 428 'and 430', wherein the target word 428 'is a reconstruction of the masked source word 428 based on the masked portions of the text feature of the multimodal feature 412 and the target word 430' is a reconstruction of the masked source word 430. The process 400 may utilize a cross entropy function to construct a loss function (also referred to as a second loss function) based on the reconstructed target word and the original source word, thereby calculating the current loss. In some embodiments, during the training phase of multimodal model 124, process 300 of masking images, as shown in FIG. 3, may be performed in parallel in process 400, thereby constructing a penalty function of masking images and a penalty function of masking text. The process 400 may construct an overall loss function (also referred to as a third loss function) by summing the two loss functions, and the goal of the process 400 is to minimize the value of the overall loss function by optimizing various parameters of the multimodal model 124.

By performing the training process of masking the image and the training process of masking the text simultaneously, the multimodal model can learn richer, deeper knowledge and features, which helps the model better understand and interpret complex real world things. In addition, this training approach enables the multi-modality model to better learn how to align and correlate the image and text information with each other, thereby improving the alignment capability of the model and exhibiting greater robustness in processing noise data in the various modalities. In addition, the training mode is helpful for accelerating the convergence speed of the model and improving the training efficiency. Compared with some traditional schemes, the training mode only needs to train the multi-modal model without depending on other models, so that the stability of the training process and the accuracy of the multi-modal model are improved.

In order to further improve the training effect of the multimodal model, the structure of the multimodal model may be optimized. The multimodal model may receive pixel block sequence embedding and word sequence embedding as inputs and then output multimodal features. In some embodiments, in the structure of the multimodal model, image features and text features may be generated using a shared self-attention network and based on pixel block sequence embedding and word sequence embedding input into the model. The image features and text features are then further optimized using the image-specific network and the text-specific network, respectively. The image features and text features are then further optimized using another shared self-attention network based on the image features and text features. The multi-modal feature is then output using a multi-layer routing network comprised of a plurality of networks.

Fig. 5A illustrates a schematic diagram of the structure of a multimodal model 500, according to some embodiments of the disclosure. As shown in FIG. 5A, multimodal model 500 receives as input pixel block sequence embedding 502 and word sequence embedding 504 and outputs multimodal features 530. Fig. 5B illustrates a schematic diagram of a process 530 of generating a pixel block sequence embedding 502 and a word sequence embedding 504, according to some embodiments of the present disclosure. In process 530, image 532 is partitioned into a set of pixel blocks, and process 530 generates a sequence of pixel block embeddings 536 based on the partitioned image 532. Process 530 may then add the embedding corresponding to each pixel block in pixel block sequence embedding 536 to the position embedding of that pixel block and the modality embedding representing the image modality, such that multi-modality model 500 may capture the position and modality information of each pixel block embedding. For example, process 530 may add pixel block embedding 538 to a position embedding 540 that represents its position in the sequence of pixel blocks (e.g., the position of the first pixel block in the sequence of pixel blocks is represented by "0", the position of the second pixel block is represented by "1", and so on) and a modality embedding 542 that represents its modality type (e.g., the image modality is represented by "0", the text modality is represented by "1"), thereby generating pixel block embedding 544. The pixel block sequence embedding 502 may be generated by performing the above-described operations for all of the pixel block embeddings in the pixel block sequence embedding 536. Similarly, process 530 generates word sequence embed 546 based on text 534. Process 530 may then add the embedding corresponding to each word in word sequence embedment 546 to the position embedment of that word and the modality embedment representing the text modality, such that multi-modality model 500 may capture the position and modality information of each word embedment. For example, process 530 may add word insert 548 to location insert 550, which represents its location in the text, and modality insert 552, which represents its modality type, thereby generating word insert 554. The word sequence embedding 504 may be generated by performing the above operations on all word embeddings in the word sequence embeddings 546.

In the example shown in fig. 5A, the multimodal model 500 includes a self-attention network 506 (also referred to as a first self-attention network), an image-specific network 512, a text-specific network 514, a self-attention network 520 (also referred to as a second self-attention network), and a routing network 526. The self-attention network 520 receives as input the pixel block sequence embedding 502 and the word sequence embedding 504 and outputs an image feature 508 (also referred to as a first image intermediate feature) and a text feature 510 (also referred to as a first text intermediate feature). Here, the pixel block embedding and the text embedding are handled by the same self-attention network 506, so that parameters of the self-attention network 506 can be shared when handling the pixel block embedding 502 and the text embedding 504, which helps to better capture and fuse shared features and associated information between the two modalities of image and text. Additionally, the use of a self-attention network may help to improve the effectiveness and performance of the multimodal model 500 in extracting features from pixel block embedding and text embedding.

As shown in FIG. 5A, multimodal model 500 employs image-specific network 512 to process image features 508 to generate image features 516 (also referred to as second image intermediate features) and text-specific network 514 to process text features 510 to generate text features 518 (also referred to as second text intermediate features). Here, the image feature 508 and the text feature 510 are processed separately by a private network dedicated to each modality according to different modalities. Therefore, special network structures can be designed according to the characteristics of the images and the texts, so that the characteristics of the images and the texts can be captured better, and the accuracy and the effect of the characteristic extraction are improved. In addition, this approach also allows the respective private network to be optimized separately for both modalities without fear of having a significant impact on the other modality, making the respective optimization and improvement more flexible.

As shown in FIG. 5A, the multimodal model 500 employs another shared self-attention network 520 to process image features 516 and text features 518 to generate image features 522 (also referred to as third image intermediate features) and text features 524 (also referred to as third text intermediate features). Here again the shared self-attention network is used to process features of different modalities so that the parameters of the self-attention network 520 can be shared when processing the image features 516 and the text features 518. By the method, the image features and the text features can be further fused, so that the connection between the image features and the text features is enhanced, and the cross-modal learning is promoted. As shown in fig. 5A, the multimodal model 500 employs a routing network 526 to process the image features 522 and the text features 524 to generate multimodal features 528.

Fig. 6 illustrates a schematic diagram of the structure of a routing network 526, according to some embodiments of the present disclosure. As shown in fig. 6, the routing network 526 receives the image feature 522 and the text feature 524 as inputs and outputs a multi-modal feature 528, the multi-modal feature 528 including an image feature portion and a text feature portion. The routing network 526 includes a modal router 602 and a plurality of feed-forward neural networks 612-1, 612-2, …, 612-N (collectively referred to herein as 612). At modality router 602, routing network 526 adds each of the image features 522 corresponding to a pixel block to a modality embedding (e.g., modality embedding 604 represented by 0 in fig. 6) representing an image modality, generating adjusted image features 522. Similarly, at modality router 602, routing network 526 adds each word feature in text feature 524 corresponding to a word to a modality embedding (e.g., modality embedding 606 represented by 1 in fig. 6) representing a text modality, generating adjusted text feature 524. The modality router 602 then generates a respective weight vector, e.g., w1 through w8 in fig. 6, for each pixel block feature in the image feature 522 and each word feature in the text feature 524 based on the adjusted image feature 522 and the adjusted text feature 524. After generating the weight vectors w 1-w 8, the routing network 526 may assign a portion of the feed-forward neural network 612 to each pixel block feature in the image feature 522 and each word feature in the text feature 524 to process the features, and the weight vectors include the weight of each feed-forward neural network 12 for a given pixel block feature or word feature.

In some embodiments, a given feature may be input to the feedforward neural network 612 that has all weights in the corresponding weight vector that are non-zero values. In other embodiments, a given feature may be input to the predetermined number of feedforward neural networks 612 with the highest weights in the corresponding weight vectors. For example, in the example shown in FIG. 6, where the pixel block feature 608 corresponds to the weight vector w1, the routing network may input the pixel block feature 608 to the 2 highest weighted feed-forward neural networks 612 (i.e., feed-forward neural networks 612-1 and 612-N) in the weight vector w 1. The routing network 526 then generates pixel block features 614 in the multi-modal feature 528 based on the outputs of the feedforward neural networks 612-1 and 612-N and the weight values of the two feedforward neural networks in the weight vector w 1. In some embodiments, the weights of the feedforward neural networks 612-1 and 612-N may be recalculated based on their proportions in the weight vector w1 such that the sum of the recalculated weights is equal to 1. The outputs of the feedforward neural networks 612-1 and 612-N are then multiplied by the corresponding recalculated weights, respectively, and the results of the calculations are summed to generate the pixel block feature 614.

In this way, the feed forward neural network 612 can be used for both single-modality private networks and multi-modality fusion networks, and the routing network 526 can automatically route pixel block features or word features to the appropriate network, thereby improving the accuracy of the multi-modality model. In addition, because different features of the same modality and different features of different modalities may both be input into the same feedforward neural network 612, parameters of the network may be shared between the same modalities or between different modalities, thereby facilitating enhanced knowledge sharing between inputs of the same modalities (e.g., between different pixel block features in image feature 522 or between different word features in text feature 524) and between inputs of different modalities (e.g., between pixel block features in image feature 522 and word features in text feature 524), enabling improved accuracy of multimodal model 500.

In some embodiments, several sets of structures consisting of shared self-attention network, image private network, and text private network may be added again over image private network 512 and text private network 514 on the basis of model 500 shown in fig. 5. For example, where the total number of layers of the multimodal model 500 is N layers, the bottom N-to-L layers may include multiple sets of shared self-attention networks, image-specific networks, and text-specific networks. In some embodiments, several sets of structures consisting of shared self-attention network and routing network may be added again over routing network 526. For example, where the total number of layers of the multimodal model 500 is N layers, the top L layers may include multiple sets of shared self-attention networks and routing networks. By stacking two network structures multiple times, the multimodal model 500 can learn more advanced features, thereby improving the accuracy and generalization ability of the model.

In some embodiments, multiple sets of structures consisting of shared self-attention network, image private network, and text private network may be added again over image private network 512 and text private network 514 based on model 500 shown in fig. 5. For example, where the total number of layers of the multimodal model 500 is N layers, the bottom N-to-L layers may include multiple sets of shared self-attention networks, image-specific networks, and text-specific networks. By stacking such a structure multiple times, the multimodal model 500 can learn more advanced features, thereby improving the accuracy and generalization ability of the model.

In the inference phase, the multimodal model generates multimodal features for use by downstream tasks based on the input images and text. Fig. 7 illustrates a schematic diagram of a process 700 for using a multimodal model in an inference phase in accordance with some embodiments of the disclosure. As shown in FIG. 7, a multimodal model 724 (which may be, for example, multimodal model 124 of FIG. 1) may generate a multimodal feature 730 based on image 702 and text 704, the multimodal feature 730 including an image feature portion and a text feature portion. The process 700 segments the image 702 into a set of pixel blocks and generates a pixel block sequence embedding 706 based on the pixel blocks, the pixel block sequence embedding 706 including a pixel block embedding corresponding to each pixel block in the image 702. In addition, process 700 generates word sequence embeddings 708 based on text 704, word sequence embeddings 708 including word embeddings corresponding to each word in text 704. Process 700 inputs pixel block sequence embedding 706 and word sequence embedding 708 into multimodal model 724, and multimodal model 724 then outputs multimodal features 730 that are available to downstream tasks.

The downstream task may add a custom task header based on the multimodal model 724, which may be a classification layer, regression layer, or other output layer appropriate for the particular task, and which will be responsible for mapping the multimodal features 730 to outputs associated with the particular task. The downstream tasks may choose appropriate learning rates, optimizers, and loss functions to fine tune the model to better adapt the model to new tasks while retaining knowledge that the multimodal model 724 has learned.

Fig. 8 illustrates a block diagram of an apparatus 800 for processing multi-modal data in accordance with some embodiments of the present disclosure. As shown in fig. 8, the apparatus 800 includes a pixel block segmentation module 802 configured to segment a source image into a set of source pixel blocks, each pixel block of the set of source pixel blocks including a plurality of pixels. Apparatus 800 further includes a pixel block masking module 804 configured to generate a sequence of masked source pixel blocks by masking one or more source pixel blocks of a set of source pixel blocks. In addition, apparatus 800 includes a pixel block generation module 806 configured to generate one or more target pixel blocks corresponding to the one or more source pixel blocks based on the masked sequence of source pixel blocks and the source text corresponding to the source image. In addition, the apparatus 800 further comprises a multimodal model generation module 808 configured to generate a multimodal model based on the one or more source pixel blocks and the one or more target pixel blocks.

It will be appreciated that at least one of the many advantages that can be achieved by the methods or processes described above can be achieved by the apparatus 800 of the present disclosure. For example, by masking different blocks of pixels in the same image, existing training data can be efficiently utilized, enabling a multimodal model to be trained with less training data, reducing training costs. In addition, the scheme avoids training an additional visual mark analyzer by comparing differences among pixel blocks, thereby reducing the complexity of training the multi-mode model and improving the stability and accuracy of the model.

Fig. 9 shows a block diagram of an electronic device 900, which device 900 may be a device or apparatus described in embodiments of the disclosure, in accordance with certain embodiments of the disclosure. As shown in fig. 9, the device 900 includes a Central Processing Unit (CPU) and/or a Graphics Processing Unit (GPU) 901, which may perform various suitable actions and processes in accordance with computer program instructions stored in a read-only memory (ROM) 902 or loaded from a storage unit 908 into a Random Access Memory (RAM) 903. In the RAM 903, various programs and data required for the operation of the device 900 can also be stored. The CPU/GPU 901, ROM 902 and RAM 903 are connected to each other via a bus 904. An input/output (I/O) interface 905 is also connected to the bus 904. Although not shown in fig. 9, device 900 may also include a coprocessor.

Various components in device 900 are connected to I/O interface 905, including: an input unit 906 such as a keyboard, a mouse, or the like; an output unit 907 such as various types of displays, speakers, and the like; a storage unit 908 such as a magnetic disk, an optical disk, or the like; and a communication unit 909 such as a network card, modem, wireless communication transceiver, or the like. The communication unit 909 allows the device 900 to exchange information/data with other devices through a computer network such as the internet and/or various telecommunications networks.

The various methods or processes described above may be performed by the CPU/GPU 901. For example, in some embodiments, the method may be implemented as a computer software program tangibly embodied on a machine-readable medium, such as the storage unit 908. In some embodiments, part or all of the computer program may be loaded and/or installed onto the device 900 via the ROM 902 and/or the communication unit 909. When the computer program is loaded into RAM 903 and executed by CPU/GPU 901, one or more steps or actions in the above-described methods or processes may be performed.

In some embodiments, the methods and processes described above may be implemented as a computer program product. The computer program product may include a computer readable storage medium having computer readable program instructions embodied thereon for performing aspects of the present disclosure.

The computer readable storage medium may be a tangible device that can hold and store instructions for use by an instruction execution device. The computer readable storage medium may be, for example, but not limited to, an electronic storage device, a magnetic storage device, an optical storage device, an electromagnetic storage device, a semiconductor storage device, or any suitable combination of the foregoing. More specific examples (a non-exhaustive list) of the computer-readable storage medium would include the following: portable computer disks, hard disks, random Access Memory (RAM), read-only memory (ROM), erasable programmable read-only memory (EPROM or flash memory), static Random Access Memory (SRAM), portable compact disk read-only memory (CD-ROM), digital Versatile Disks (DVD), memory sticks, floppy disks, mechanical coding devices, punch cards or in-groove structures such as punch cards or grooves having instructions stored thereon, and any suitable combination of the foregoing. Computer-readable storage media, as used herein, are not to be construed as transitory signals per se, such as radio waves or other freely propagating electromagnetic waves, electromagnetic waves propagating through waveguides or other transmission media (e.g., optical pulses through fiber optic cables), or electrical signals transmitted through wires.

The computer readable program instructions described herein may be downloaded from a computer readable storage medium to a respective computing/processing device or to an external computer or external storage device over a network, such as the internet, a local area network, a wide area network, and/or a wireless network. The network may include copper transmission cables, fiber optic transmissions, wireless transmissions, routers, firewalls, switches, gateway computers and/or edge servers. The network interface card or network interface in each computing/processing device receives computer readable program instructions from the network and forwards the computer readable program instructions for storage in a computer readable storage medium in the respective computing/processing device.

The computer program instructions for performing the operations of the present disclosure can be assembly instructions, instruction Set Architecture (ISA) instructions, machine-related instructions, microcode, firmware instructions, state setting data, or source or object code written in any combination of one or more programming languages, including an object-oriented programming language and conventional procedural programming languages. The computer readable program instructions may be executed entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the case of a remote computer, the remote computer may be connected to the user's computer through any kind of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or may be connected to an external computer (for example, through the Internet using an Internet service provider). In some embodiments, aspects of the present disclosure are implemented by personalizing electronic circuitry, such as programmable logic circuitry, field Programmable Gate Arrays (FPGAs), or Programmable Logic Arrays (PLAs), with state information of computer readable program instructions, which can execute the computer readable program instructions.

These computer readable program instructions may be provided to a processing unit of a general purpose computer, special purpose computer, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processing unit of the computer or other programmable data processing apparatus, create means for implementing the functions/acts specified in the flowchart and/or block diagram block or blocks. These computer readable program instructions may also be stored in a computer readable storage medium that can direct a computer, programmable data processing apparatus, and/or other devices to function in a particular manner, such that the computer readable medium having the instructions stored therein includes an article of manufacture including instructions which implement the function/act specified in the flowchart and/or block diagram block or blocks.

The computer readable program instructions may also be loaded onto a computer, other programmable data processing apparatus, or other devices to cause a series of operational steps to be performed on the computer, other programmable apparatus or other devices to produce a computer implemented process such that the instructions which execute on the computer, other programmable apparatus or other devices implement the functions/acts specified in the flowchart and/or block diagram block or blocks.

The flowcharts and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of devices, methods and computer program products according to various embodiments of the present disclosure. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of instructions, which comprises one or more executable instructions for implementing the specified logical function(s). In some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two consecutive blocks may in fact be performed substantially in parallel, and they may sometimes be performed in the reverse order, depending on the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

The foregoing description of the embodiments of the present disclosure has been presented for purposes of illustration and description, and is not intended to be exhaustive or limited to the embodiments disclosed. Many modifications and variations will be apparent to those of ordinary skill in the art without departing from the scope and spirit of the various embodiments described. The terminology used herein was chosen in order to best explain the principles of the embodiments, the practical application, or the technical improvement of the technology in the marketplace, or to enable others of ordinary skill in the art to understand the embodiments disclosed herein.

Some example implementations of the present disclosure are listed below.

Example 1. A method for processing multi-modal data, comprising:

dividing a source image into a set of source pixel blocks, each pixel block in the set of source pixel blocks comprising a plurality of pixels;

generating a sequence of masked source pixel blocks by masking one or more source pixel blocks of the set of source pixel blocks;

generating one or more target pixel blocks corresponding to the one or more source pixel blocks based on the masked sequence of source pixel blocks and source text corresponding to the source image; and

a multimodal model is generated based on the one or more source pixel blocks and the one or more target pixel blocks.

Example 2. The method of example 1, wherein generating the multimodal model comprises:

generating a masked sequence of source words by masking one or more source words in the source text;

generating one or more target words corresponding to the one or more source words based on the masked sequence of source words and the source image; and

the multimodal model is generated based on the one or more source pixel blocks, the one or more target pixel blocks, the one or more source words, and the one or more target words.

Example 3. The method of example 1, wherein generating the one or more target blocks of pixels corresponding to the one or more source blocks of pixels comprises:

generating a sequence of retained pixel blocks by deleting the one or more masked source pixel blocks from the sequence of masked source pixel blocks;

generating a reserved pixel block sequence embedding based on the reserved pixel block sequence;

generating a source word sequence embedding based on the source text; and

and generating a first multi-modal feature based on the reserved pixel block sequence embedding and the source word sequence embedding.

Example 4. The method of example 3, wherein generating the one or more target blocks of pixels corresponding to the one or more source blocks of pixels further comprises:

generating one or more mask pixel block features based on the one or more masked source pixel blocks;

generating a second multi-modal feature by inserting the one or more mask pixel block features at respective locations in the first multi-modal feature;

the one or more target pixel blocks are generated with an image decoder based on the second multi-modal feature.

Example 5. The method of example 2, wherein generating the one or more target words corresponding to the one or more source words comprises:

Generating a source pixel block embedding based on the source image;

generating a mask word sequence embedding based on the masked source word sequence;

generating a third multi-modal feature based on the source pixel block embedding and the masking word sequence embedding; and

the one or more target words are generated based on the third multimodal feature.

Example 6. The method of example 2, wherein generating the multimodal model further comprises:

generating a first loss function by determining a mean square error between the one or more source pixel blocks and the one or more target pixel blocks;

generating a second loss function using a cross entropy function based on the one or more source words and the one or more target words;

generating a third loss function based on the first loss function and the second loss function; and

the multi-modal model is generated using the third loss function.

Example 7. The method of example 3, wherein generating the first multi-modal feature comprises:

a first image intermediate feature and a first text intermediate feature are generated using a first self-attention network based on the reserved pixel block sequence embedding and the source word sequence embedding.

Example 8 the method of example 7, wherein generating the first multi-modal feature further comprises:

generating a second image intermediate feature using an image-specific network based on the first image intermediate feature; and

a second text intermediation feature is generated using a text-specific network based on the first text intermediation feature.

Example 9. The method of example 8, wherein generating the first multi-modal feature further comprises:

a third image intermediate feature and a third text intermediate feature are generated using a second self-attention network based on the second image intermediate feature and the second text intermediate feature.

Example 10. The method of example 9, wherein generating the first multi-modal feature further comprises:

the first multi-modal feature is generated using a routing network based on the third image intermediate feature and the third text intermediate feature, wherein the routing network includes a plurality of feed-forward neural networks.

Example 11. The method of example 10, wherein generating the first multi-modal feature further comprises:

generating a plurality of feed-forward neural network weight vectors using the routing network based on the third image intermediate feature and the third text intermediate feature; and

The first multi-modal feature is generated based on the third image intermediate feature, the third text intermediate feature, and the plurality of feedforward neural network weight vectors.

Example 12. The method of example 1, further comprising:

acquiring an application image and an application text;

dividing the application image into a set of pixel blocks;

generating a pixel block sequence feature and a word sequence feature based on the set of pixel blocks and the application text; and

a multi-modal feature is generated using the multi-modal model based on the pixel block sequence feature and the word sequence feature.

Example 13 an apparatus for processing multimodal data, comprising:

a pixel block segmentation module configured to segment a source image into a set of source pixel blocks, each pixel block of the set of source pixel blocks comprising a plurality of pixels;

a pixel block masking module configured to generate a sequence of masked source pixel blocks by masking one or more source pixel blocks of the set of source pixel blocks;

a pixel block generation module configured to generate one or more target pixel blocks corresponding to the one or more source pixel blocks based on the masked sequence of source pixel blocks and source text corresponding to the source image; and

A multimodal model generation module configured to generate a multimodal model based on the one or more source pixel blocks and the one or more target pixel blocks.

Example 14 the apparatus of example 13, wherein generating the multimodal model comprises:

a word sequence masking module configured to generate a masked sequence of source words by masking one or more source words in the source text;

a target word generation module configured to generate one or more target words corresponding to the one or more source words based on the masked sequence of source words and the source image; and

a second model generation module configured to generate the multimodal model based on the one or more source pixel blocks, the one or more target pixel blocks, the one or more source words, and the one or more target words.

Example 15 the apparatus of example 13, wherein generating the one or more target blocks of pixels corresponding to the one or more source blocks of pixels comprises:

a masking pixel block deletion module configured to generate a sequence of retained pixel blocks by deleting the masked one or more source pixel blocks from the sequence of masked source pixel blocks;

A pixel block embedding generation module configured to generate a reserved pixel block sequence embedding based on the reserved pixel block sequence;

the word sequence embedding generation module is configured to generate source word sequence embedding based on the source text; and

the first multi-modal feature generation module is configured to generate a first multi-modal feature based on the reserved pixel block sequence embedding and the source word sequence embedding.

Example 16 the apparatus of example 15, wherein generating the one or more target blocks of pixels corresponding to the one or more source blocks of pixels further comprises:

a mask pixel block feature generation module configured to generate one or more mask pixel block features based on the one or more source pixel blocks that are masked;

a second multi-modal feature generation module configured to generate a second multi-modal feature by inserting the one or more mask pixel block features at respective locations in the first multi-modal feature;

a target pixel block generation module configured to generate the one or more target pixel blocks with an image decoder based on the second multi-modal feature.

Example 17 the apparatus of example 14, wherein generating the one or more target words corresponding to the one or more source words comprises:

A pixel block embedding generation module configured to generate a source pixel block embedding based on the source image;

a mask word sequence generation module configured to generate a mask word sequence embedding based on the masked source word sequence;

a third multi-modal feature generation module configured to generate a third multi-modal feature based on the source pixel block embedding and the mask word sequence embedding; and

and a target word generation module configured to generate the one or more target words based on the third multimodal feature.

Example 18 the apparatus of example 14, wherein generating the multimodal model further comprises:

a first loss function generation module configured to generate a first loss function by determining a mean square error between the one or more source pixel blocks and the one or more target pixel blocks;

a second loss function generation module configured to generate a second loss function using a cross entropy function based on the one or more source words and the one or more target words;

a third loss function generation module configured to generate a third loss function based on the first loss function and the second loss function; and

A third loss function usage module configured to generate the multimodal model using the third loss function.

Example 19 the apparatus of example 15, wherein generating the first multi-modal feature comprises:

a first intermediate feature generation module configured to generate a first image intermediate feature and a first text intermediate feature using a first self-attention network based on the reserved pixel block sequence embedding and the source word sequence embedding.

Example 20 the apparatus of example 19, wherein generating the first multi-modal feature further comprises:

a second image feature generation module configured to generate a second image intermediate feature using an image-specific network based on the first image intermediate feature; and

a second text feature generation module configured to generate a second text intermediate feature using a text-specific network based on the first text intermediate feature.

Example 21 the apparatus of example 20, wherein generating the first multi-modal feature further comprises:

a third intermediate feature generation module configured to generate a third image intermediate feature and a third text intermediate feature using a second self-attention network based on the second image intermediate feature and the second text intermediate feature.

Example 22 the apparatus of example 21, wherein generating the first multi-modal feature further comprises:

a first multi-modal feature generation module configured to generate the first multi-modal feature using a routing network based on the third image intermediate feature and the third text intermediate feature, wherein the routing network includes a plurality of feed-forward neural networks.

Example 23 the apparatus of example 22, wherein generating the first multi-modal feature further comprises:

a weight vector generation module configured to generate a plurality of feed-forward neural network weight vectors using the routing network based on the third image intermediate feature and the third text intermediate feature; and

a weight vector usage module configured to generate the first multi-modal feature based on the third image intermediate feature, the third text intermediate feature, and the plurality of feedforward neural network weight vectors

Example 24 the apparatus of example 13, further comprising:

the data acquisition module is configured to acquire an application image and an application text;

an image segmentation module configured to segment the application image into a set of pixel blocks;

an embedding generation module configured to generate a pixel block sequence embedding and a word sequence embedding based on the set of pixel blocks and the application text; and

A feature generation module configured to generate a multi-modal feature using the multi-modal model based on the pixel block sequence feature and the word sequence feature.

Example 25. An electronic device, comprising:

a processor; and

a memory coupled with the processor, the memory having instructions stored therein, which when executed by the processor, cause the electronic device to perform actions comprising:

Example 26 the electronic device of example 25, wherein generating the multimodal model comprises:

Example 27 the electronic device of example 25, wherein generating the one or more target blocks of pixels corresponding to the one or more source blocks of pixels comprises:

generating a source word sequence embedding based on the source text; and

Example 28 the electronic device of example 27, wherein generating the one or more target blocks of pixels corresponding to the one or more source blocks of pixels further comprises:

Example 29 the electronic device of example 26, wherein generating the one or more target words corresponding to the one or more source words comprises:

generating a source pixel block embedding based on the source image;

Example 30 the electronic device of example 26, wherein generating the multimodal model further comprises:

the multi-modal model is generated using the third loss function.

Example 31 the electronic device of example 27, wherein generating the first multi-modal feature comprises:

Example 32 the electronic device of example 31, wherein generating the first multi-modal feature further comprises:

Example 33 the electronic device of example 32, wherein generating the first multi-modal feature further comprises:

Example 34 the electronic device of example 33, wherein generating the first multi-modal feature further comprises:

Example 35 the electronic device of example 34, wherein generating the first multi-modal feature further comprises:

Example 36 the electronic device of example 25, further comprising:

acquiring an application image and an application text;

dividing the application image into a set of pixel blocks;

Although the disclosure has been described in language specific to structural features and/or methodological acts, it is to be understood that the subject matter defined in the appended claims is not necessarily limited to the specific features or acts described above. Rather, the specific features and acts described above are example forms of implementing the claims.

Claims

1. A method for processing multi-modal data, comprising:

2. The method of claim 1, wherein generating the multimodal model comprises:

3. The method of claim 1, wherein generating the one or more target blocks of pixels corresponding to the one or more source blocks of pixels comprises:

generating a source word sequence embedding based on the source text; and

4. The method of claim 3, wherein generating the one or more target blocks of pixels corresponding to the one or more source blocks of pixels further comprises:

5. The method of claim 2, wherein generating the one or more target words corresponding to the one or more source words comprises:

Generating a source pixel block embedding based on the source image;

6. The method of claim 2, wherein generating the multimodal model further comprises:

the multi-modal model is generated using the third loss function.

7. The method of claim 3, wherein generating the first multi-modal feature comprises:

8. The method of claim 7, wherein generating the first multi-modal feature further comprises:

9. The method of claim 8, wherein generating the first multi-modal feature further comprises:

10. The method of claim 9, wherein generating the first multi-modal feature further comprises:

11. The method of claim 10, wherein generating the first multi-modal feature further comprises:

12. The method of claim 1, further comprising:

acquiring an application image and an application text;

dividing the application image into a set of pixel blocks;

generating a pixel block sequence embedding and a word sequence embedding based on the set of pixel blocks and the application text; and

13. An apparatus for processing multi-modal data, comprising:

14. An electronic device, comprising:

a processor; and

a memory coupled with the processor, the memory having instructions stored therein, which when executed by the processor, cause the electronic device to perform the method of any of claims 1-12.

15. A computer readable storage medium having stored thereon computer executable instructions, wherein the computer executable instructions are executed by a processor to implement the method of any of claims 1 to 12.