CN116663609A

CN116663609A - Model training method, device, equipment and storage medium

Info

Publication number: CN116663609A
Application number: CN202310673422.4A
Authority: CN
Inventors: 杨志雄; 杨延展
Original assignee: Douyin Vision Co Ltd
Current assignee: Douyin Vision Co Ltd
Priority date: 2023-06-07
Filing date: 2023-06-07
Publication date: 2023-08-29

Abstract

The embodiment of the disclosure provides a model training method, device, equipment and storage medium, which are used for training a multi-mode fusion network. Acquiring multi-mode data; the multi-modal data comprises at least two modes of image data, text data and audio data; sequentially inputting the multi-mode data into the multi-mode fusion network, and outputting a multi-mode data processing result; and training at least one of the multi-modal adapter sub-network, the modal fusion sub-network and the target task sub-network based on the multi-modal data processing result to obtain a trained multi-modal fusion network. According to the model training method provided by the embodiment of the disclosure, other sub-networks except the pre-training multi-mode sub-network in the multi-mode fusion network are trained, so that resources such as memory and video memory required by training can be effectively reduced, meanwhile, a pre-trained large model can be utilized, computing resources and time can be greatly saved, and therefore training and deployment efficiency of the multi-mode fusion network is improved.

Description

Model training method, device, equipment and storage medium

Technical Field

The embodiment of the disclosure relates to the technical field of neural networks, in particular to a model training method, device, equipment and storage medium.

Background

At present, the neural network model is bigger and bigger, the parameter quantity contained in the neural network model is very big, and especially the multi-mode neural network model is bigger and bigger. If the neural network model is to be migrated to the downstream application field, a large amount of computing resources are required to train the large model, and loading the model into the video memory is very difficult due to the large scale of the model. Thus, training a large scale multimodal model typically requires a significant amount of computing resources and time, affecting model training and deployment efficiency.

Disclosure of Invention

The embodiment of the disclosure provides a model training method, device, equipment and storage medium, which can greatly save computing resources and time by training other sub-networks except a pre-training multi-mode sub-network in a multi-mode fusion network, thereby improving the training and deployment efficiency of the multi-mode fusion network.

In a first aspect, an embodiment of the present disclosure provides a model training method for training a multi-modal fusion network, where the multi-modal fusion network includes a pre-training multi-modal sub-network, a multi-modal adapter sub-network, a modal fusion sub-network, and a target task sub-network that are sequentially connected, the method includes:

Acquiring multi-mode data; wherein the multi-modal data includes data of at least two modalities of image data, text data, and audio data;

inputting the multi-mode data into the multi-mode fusion network and outputting a multi-mode data processing result;

and training at least one of the multi-modal adapter sub-network, the modal fusion sub-network and the target task sub-network based on the multi-modal data processing result to obtain a trained multi-modal fusion network.

In a second aspect, an embodiment of the present disclosure further provides a model training apparatus, configured to train a multi-modal fusion network, where the multi-modal fusion network includes a pre-trained multi-modal sub-network, a multi-modal adapter sub-network, a modal fusion sub-network, and a target task sub-network that are sequentially connected, where the apparatus includes:

the multi-mode data acquisition module is used for acquiring multi-mode data; wherein the multi-modal data includes data of at least two modalities of image data, text data, and audio data;

the multi-mode data processing result acquisition module is used for inputting the multi-mode data into the multi-mode fusion network and outputting a multi-mode data processing result;

The multi-mode fusion network training module is used for training at least one of the multi-mode adaptation sub-network, the mode fusion sub-network and the target task sub-network based on the multi-mode data processing result to obtain a trained multi-mode fusion network.

In a third aspect, embodiments of the present disclosure further provide an electronic device, including:

one or more processors;

storage means for storing one or more programs,

the one or more programs, when executed by the one or more processors, cause the one or more processors to implement the model training method as described in embodiments of the present disclosure.

In a fourth aspect, the disclosed embodiments also provide a storage medium containing computer-executable instructions, which when executed by a computer processor, are for performing a model training method as described in the disclosed embodiments.

The embodiment of the disclosure discloses a model training method, device, equipment and storage medium, which are used for training a multi-mode fusion network, wherein the multi-mode fusion network comprises a pre-training multi-mode sub-network, a multi-mode adaptation sub-network, a mode fusion sub-network and a target task sub-network which are sequentially connected, and the method comprises the following steps: acquiring multi-mode data; the multi-mode data comprises at least two modes of image data, text data and audio data; inputting the multi-mode data into a multi-mode fusion network and outputting a multi-mode data processing result; and training at least one of the multi-mode adaptation sub-network, the mode fusion sub-network and the target task sub-network based on the multi-mode data processing result to obtain a trained multi-mode fusion network. According to the model training method provided by the embodiment of the disclosure, other sub-networks except the pre-training multi-mode sub-network in the multi-mode fusion network are trained, so that resources such as memory and video memory required by training can be effectively reduced, meanwhile, a pre-trained large model can be utilized, computing resources and time can be greatly saved, and therefore training and deployment efficiency of the multi-mode fusion network is improved.

Drawings

The above and other features, advantages, and aspects of embodiments of the present disclosure will become more apparent by reference to the following detailed description when taken in conjunction with the accompanying drawings. The same or similar reference numbers will be used throughout the drawings to refer to the same or like elements. It should be understood that the figures are schematic and that elements and components are not necessarily drawn to scale.

FIG. 1 is a schematic flow chart of a model training method according to an embodiment of the disclosure;

FIG. 2 is a schematic diagram of a multi-modal converged network in accordance with an embodiment of the present disclosure;

FIG. 3 is a schematic diagram of a multi-modal aptamer network according to an embodiment of the disclosure;

FIG. 4 is a schematic diagram of a modality adapter according to an embodiment of the present disclosure;

FIG. 5 is a schematic diagram of a modal attention fusion module provided by an embodiment of the present disclosure;

FIG. 6 is a schematic diagram of a context fusion module provided by an embodiment of the present disclosure;

fig. 7 is a schematic structural diagram of a modal fusion subnetwork according to an embodiment of the present disclosure;

FIG. 8 is a schematic diagram of a feed-forward layer structure provided by an embodiment of the present disclosure;

FIG. 9 is a schematic diagram of a multi-modal converged network in accordance with an embodiment of the present disclosure;

FIG. 10 is a schematic diagram of a model training apparatus according to an embodiment of the present disclosure;

fig. 11 is a schematic structural diagram of an electronic device according to an embodiment of the present disclosure.

Detailed Description

Embodiments of the present disclosure will be described in more detail below with reference to the accompanying drawings. While certain embodiments of the present disclosure have been shown in the accompanying drawings, it is to be understood that the present disclosure may be embodied in various forms and should not be construed as limited to the embodiments set forth herein, but are provided to provide a more thorough and complete understanding of the present disclosure. It should be understood that the drawings and embodiments of the present disclosure are for illustration purposes only and are not intended to limit the scope of the present disclosure.

It should be understood that the various steps recited in the method embodiments of the present disclosure may be performed in a different order and/or performed in parallel. Furthermore, method embodiments may include additional steps and/or omit performing the illustrated steps. The scope of the present disclosure is not limited in this respect.

The term "including" and variations thereof as used herein are intended to be open-ended, i.e., including, but not limited to. The term "based on" is based at least in part on. The term "one embodiment" means "at least one embodiment"; the term "another embodiment" means "at least one additional embodiment"; the term "some embodiments" means "at least some embodiments. Related definitions of other terms will be given in the description below.

It should be noted that the terms "first," "second," and the like in this disclosure are merely used to distinguish between different devices, modules, or units and are not used to define an order or interdependence of functions performed by the devices, modules, or units.

It should be noted that references to "one", "a plurality" and "a plurality" in this disclosure are intended to be illustrative rather than limiting, and those of ordinary skill in the art will appreciate that "one or more" is intended to be understood as "one or more" unless the context clearly indicates otherwise.

The names of messages or information interacted between the various devices in the embodiments of the present disclosure are for illustrative purposes only and are not intended to limit the scope of such messages or information.

It will be appreciated that prior to using the technical solutions disclosed in the embodiments of the present disclosure, the user should be informed and authorized of the type, usage range, usage scenario, etc. of the personal information related to the present disclosure in an appropriate manner according to the relevant legal regulations.

For example, in response to receiving an active request from a user, a prompt is sent to the user to explicitly prompt the user that the operation it is requesting to perform will require personal information to be obtained and used with the user. Thus, the user can autonomously select whether to provide personal information to software or hardware such as an electronic device, an application program, a server or a storage medium for executing the operation of the technical scheme of the present disclosure according to the prompt information.

As an alternative but non-limiting implementation, in response to receiving an active request from a user, the manner in which the prompt information is sent to the user may be, for example, a popup, in which the prompt information may be presented in a text manner. In addition, a selection control for the user to select to provide personal information to the electronic device in a 'consent' or 'disagreement' manner can be carried in the popup window.

It will be appreciated that the above-described notification and user authorization process is merely illustrative and not limiting of the implementations of the present disclosure, and that other ways of satisfying relevant legal regulations may be applied to the implementations of the present disclosure.

It will be appreciated that the data (including but not limited to the data itself, the acquisition or use of the data) involved in the present technical solution should comply with the corresponding legal regulations and the requirements of the relevant regulations.

In order to migrate an upstream model to a downstream task, the conventional solution is as follows:

1. trim (full model finetune): trimming the entire pre-trained multimodal model onto a new task specific data set is the most common method in transfer learning. While effective, this approach has several limitations such as high computational cost, risk of catastrophic forgetfulness, and the need for large amounts of task specific data.

2. Feature stitching (feature concatenation): for multi-modal fusion, one simple strategy is to splice features extracted from different modalities together and then input into a classifier. However, this approach does not capture complex interactions between modalities, typically resulting in high-dimensional input, increasing the complexity of the model and computational requirements.

3. Multimodal attention mechanism (Multimodal attention mechanism): this is a traditional and common approach, which is also known as a cross-modal attention mechanism, assigns weights to different modalities or their features based on the relevance of the task. While this approach may effectively mimic the importance of each modality, there may be limitations in capturing higher-order interactions or adapting to new tasks with small amounts of training data.

Fig. 1 is a schematic flow chart of a model training method provided by an embodiment of the present disclosure, where the embodiment of the present disclosure is suitable for a situation of training a multi-mode converged network, the method may be performed by a model training apparatus, and the apparatus may be implemented in a form of software and/or hardware, and optionally, may be implemented by an electronic device, where the electronic device may be a mobile terminal, a PC side, a server, or the like.

The multimode fusion network comprises a Pre-trained multimode sub-network (Pre-trained Multi-mode Model), a multimode adaptation sub-network (Multi-mode adaptation, MMA), a mode fusion sub-network (Dynamic Fusion Mechanism, DFM) and a target Task sub-network (TSFT) which are connected in sequence.

As shown in fig. 1, the method includes:

s110, multi-mode data is acquired.

The multi-modal data comprises at least two modes of image data, text data and audio data. In this embodiment, the multimodal data may be paired or grouped image-text-audio data. Illustratively, the process of acquiring the multi-modal data may be: frame extraction is performed from an open source (or authorized) movie, television show, short video, etc., images, text, and audio are extracted from the frames, and aligned while duplicate multimodal data is removed. Multimodal data can be divided into training sets, validation sets, and test sets.

S120, inputting the multi-mode data into a multi-mode fusion network, and outputting a multi-mode data processing result.

Wherein, inputting the multi-modal data into the multi-modal fusion network can be understood as: firstly, inputting multi-mode data into a pre-training multi-mode sub-network, then inputting data output by the pre-training multi-mode sub-network into a multi-mode adaptation sub-network, then inputting data output by the multi-mode adaptation sub-network into a mode fusion sub-network, and finally inputting the mode fusion sub-network into a target task sub-network, and outputting a multi-mode data processing result.

Fig. 2 is a schematic structural diagram of a multi-modal fusion network according to an embodiment of the disclosure, where the multi-modal fusion network includes a pre-training multi-modal sub-network, a multi-modal adaptation sub-network, a modal fusion sub-network, and a target task sub-network. As shown in fig. 2, the input Multi-Modal data sequentially passes through a Pre-trained Multi-Modal Model (Pre-trained Multi-Modal adaptation sub-network (MMA), a Modal fusion sub-network (Dynamic Fusion Mechanism, DFM) and a target Task sub-network (TSFT), and finally outputs a Multi-Modal data processing result.

Optionally, the process of inputting the multi-mode data into the multi-mode fusion network and outputting the multi-mode data processing result may be: performing feature extraction on the multi-mode data based on the pre-training multi-mode sub-network to obtain multi-mode feature data; the multi-mode characteristic data is adjusted based on the multi-mode adaptation sub-network, and adjusted multi-mode characteristic data is obtained; fusing the adjusted multi-mode characteristic data based on the mode fusion sub-network to obtain fused characteristic data; and carrying out data processing of the target task on the fusion characteristic data based on the target task sub-network to obtain a multi-mode data processing result.

The multi-mode feature data comprises feature data corresponding to each mode, namely image feature data, text feature data and audio feature data. The multi-mode adaptation sub-network is used for respectively adjusting the characteristic data of each mode, the mode fusion sub-network is used for fusing the characteristic data of each mode after adjustment, and the target task sub-network is used for carrying out data processing of target tasks on the fused characteristic data.

Specifically, the multimode adaptor network includes a multimode adaptor (Cross-Modal adaptor); the multi-mode adapter comprises at least two mode adapters of an image adapter, a text adapter and an audio adapter. The process of adjusting the multi-mode characteristic data based on the multi-mode adaptation sub-network to obtain the adjusted multi-mode characteristic data can be as follows: based on the multi-mode adapter, respectively adjusting the characteristic data of the corresponding modes to obtain the characteristic data of each mode after adjustment; and performing cross-modal adjustment on the adjusted characteristic data of each mode based on the cross-modal adapter to obtain the characteristic data of each mode after re-adjustment.

Wherein, the adjustment of the characteristic data of the corresponding modes based on the multi-mode adapter can be understood as follows: the image adapter adjusts the image characteristic data, the text adapter adjusts the text characteristic data, and the audio adapter adjusts the audio characteristic data.

The cross-modal adapter comprises a Multi-head attention layer (Multi-head attention layer) and a Feed-forward layer (Feed-forward layer). The cross-modal adjustment of the characteristic data of each mode after adjustment based on the cross-modal adapter can be understood as that the multi-head attention layer further interacts and supplements the characteristic data of different modes after adjustment, and then inputs the characteristic data into the feedforward layer.

Fig. 3 is a schematic structural diagram of a multimodal adaptive sub-network in this embodiment, and as shown in fig. 3, the multimodal adaptive sub-network includes an image adapter, a text adapter, an audio adapter, a multi-head attention layer, and a feedforward layer in parallel. Inputting the image characteristic data into an image adapter, and outputting the adjusted image characteristic data; inputting the text feature data into a text adapter, and outputting the adjusted text feature data; the audio feature data is input to the audio adapter and the adjusted audio feature data is output. The adjusted image feature data, text feature data and audio feature data are input into the multi-head attention layer for interaction and mutual supplementation, the image feature data, the text feature data and the audio feature data after interaction and mutual supplementation are output, and then the image feature data, the text feature data and the audio feature data after interaction and mutual supplementation are input into the feedforward layer, and the adjusted multi-mode feature data is output.

The image adapter, the text adapter and the audio adapter comprise two full-connection layers, and the input of the first full-connection layer is connected with the output residual error of the second full-connection layer. The dimension of the first fully connected layer is N x d, the dimension of the second fully connected layer may be d x N, and d < < N. The image adapter, the text adapter and the audio adapter have the same structure, but the parameters are not shared, so that the adapters of different modes can have better suitability for data of different modes. The input of the first full-connection layer is connected with the output residual error of the second full-connection layer, so that the smooth flow of the gradient can be ensured. Fig. 4 is a schematic structural diagram of a mode adapter in this embodiment, and as shown in fig. 4, the mode adapter includes a full connection layer with a dimension of n×d and a full connection layer with a dimension of d×n, where an input of the first full connection layer is connected with an output residual of the second full connection layer.

Wherein the feed-forward layer comprises two fully-connected layers and a nonlinear active layer (ReLU) intermediate the two fully-connected layers.

The multi-mode adapter sub-network enables the model to adjust the characteristic data of each mode according to the target task by applying a lightweight, flexible and adaptable adapter of each mode without training the whole model, and adapts to a wide range of tasks and fields with minimal additional training.

Among other things, the modality fusion subnetwork may also be referred to as a dynamic fusion subnetwork (Dynamic Fusion Mechanism, DFM). The method is used for fusing the characteristic data of the multiple modes to acquire important characteristics and relations among different mode data in the target task.

In this embodiment, the modality fusion sub-network includes a modality attention fusion module (Modality Attention Mechanism, MAM), a context fusion module (Contextual Fusion Layer, CFL) and a feature fusion module; the process of fusing the adjusted multi-mode characteristic data based on the mode fusion sub-network to obtain the fused characteristic data can be as follows: the adjusted multi-mode feature data are fused according to the attention score based on the mode attention fusion module, and first intermediate fusion feature data are obtained; performing context fusion on the first intermediate fusion characteristic data based on the context fusion module to obtain second intermediate fusion characteristic data; and superposing the first intermediate fusion characteristic data and the second intermediate fusion characteristic data based on the characteristic fusion module to obtain fusion characteristic data.

Wherein, the mode attention fusion module includes: a feed-forward layer, an activation layer and a fusion layer. The process of fusing the adjusted multi-mode feature data according to the attention score based on the mode attention fusion module to obtain the intermediate fusion feature data can be as follows: determining initial attention scores corresponding to the adjusted multi-mode characteristic data respectively based on the feedforward layer; transforming the initial attention score based on the activation layer to obtain a target attention score; and weighting and summing the adjusted multi-mode characteristic data according to the target attention score based on the fusion layer to obtain intermediate fusion characteristic data.

Fig. 5 is a schematic structural diagram of a module for fusing attention in the embodiment, as shown in fig. 5, an initial attention score of each mode is calculated by a feedforward layer, then the initial attention score is transformed by an activation layer (softmax) to obtain a target attention score, and finally the adjusted multi-mode feature data is weighted and summed according to the target attention score by the fusion layer. The weighted summation of the adjusted multi-modal feature data according to the target attention score based on the fusion layer can be understood as: the target attention score is weighted and summed as the weight of the corresponding modality feature data.

The context fusion module is used for extracting high-order interaction information among all modes. The context fusion module comprises a Self-Attention layer (Self-Attention), a first superimposed layer (Add), a first normalization layer (LayerNorm), a feedforward layer, a second superimposed layer (Add) and a second normalization layer (LayerNorm); the first superposition layer is used for superposing the input and the output of the self-attention layer; the second superimposed layer is used for superimposing the input and output of the feedforward layer. Fig. 6 is a schematic structural diagram of a context fusion module in the disclosed embodiment, and as shown in fig. 6, the context fusion module includes a self-attention layer, a first superimposed layer, a first normalization layer, a feedforward layer, a second superimposed layer, and a second normalization layer, so as to perform context fusion on first intermediate fusion feature data output by the mode attention fusion, and obtain second intermediate fusion feature data.

Fig. 7 is a schematic structural diagram of a modal fusion sub-network in an embodiment of the disclosure, where, as shown in fig. 7, the modal fusion sub-network includes a modal attention fusion module, a context fusion module, and a feature fusion module, where the feature fusion module superimposes an output of the modal attention fusion module and an output of the context fusion module. The modality fusion sub-network can dynamically pay attention to different modalities and extract the relation among the modalities according to the correlation between the modalities and the target task, so that multi-modality learning is more effective and has stronger adaptability.

In this embodiment, the target task sub-network includes a target task adapter and a prediction layer; the method for processing the data of the target task based on the target task sub-network and obtaining the multi-mode data processing result can be as follows: adjusting the fusion characteristic data based on the target task adapter to obtain adjusted fusion characteristic data; and carrying out linear processing on the adjusted fusion characteristic data based on the prediction layer to obtain a multi-mode data processing result.

Wherein the target task adapter comprises at least one feed-forward layer; the Prediction layer (Prediction layer) is a linear layer. The prediction layer is based on target task settings, such as: the target task may be a sort task, a match task, a search task, and the like. The feedforward layer comprises two full-connection layers and a nonlinear activation layer, wherein the nonlinear activation layer is positioned in the middle of the two full-connection layers. Exemplary, fig. 8 is a schematic structural diagram of a feedforward layer in an embodiment of the present disclosure, where, as shown in fig. 8, the feedforward layer includes two fully-connected layers and one nonlinear activation layer, and the nonlinear activation layer is located in the middle of the two fully-connected layers.

On the basis of the above embodiment, fig. 9 is a schematic structural diagram of a multi-modal fusion network in the embodiment of the disclosure, and as shown in fig. 9, the multi-modal fusion network includes a pre-training multi-modal sub-network, a multi-modal adapter sub-network, a modal fusion sub-network and a target task sub-network. The multi-mode adapter sub-network comprises a multi-mode adapter and a cross-mode adapter; the modal fusion sub-network comprises a modal attention fusion module, a context fusion module and a feature fusion module; the target task subnetwork includes a target task adapter and a prediction layer.

S130, training at least one of a multi-mode adaptation sub-network, a mode fusion sub-network and a target task sub-network based on the multi-mode data processing result to obtain a trained multi-mode fusion network.

The trained multi-modal fusion network can be understood as a multi-modal fusion model. Training at least one of the multimodal adaptation sub-network, the modality fusion sub-network, and the target task sub-network based on the multimodal data processing result may be understood as: and determining a loss function according to the multi-mode data processing result, freezing the pre-training multi-mode sub-network, and adjusting parameters in the multi-mode adaptation sub-network, the mode fusion sub-network and the target task sub-network based on the loss function.

According to the technical scheme, multi-mode data are acquired; the multi-mode data comprises at least two modes of image data, text data and audio data; inputting the multi-mode data into a multi-mode fusion network and outputting a multi-mode data processing result; and training at least one of the multi-mode adaptation sub-network, the mode fusion sub-network and the target task sub-network based on the multi-mode data processing result to obtain a trained multi-mode fusion network. According to the model training method provided by the embodiment of the disclosure, other sub-networks except the pre-training multi-mode sub-network in the multi-mode fusion network are trained, so that resources such as memory and video memory required by training can be effectively reduced, meanwhile, a pre-trained large model can be utilized, computing resources and time can be greatly saved, and therefore training and deployment efficiency of the multi-mode fusion network is improved.

Fig. 10 is a schematic structural diagram of a model training device provided by an embodiment of the present disclosure, where the model training device is configured to train a multi-modal fusion network, and the multi-modal fusion network includes a pre-training multi-modal sub-network, a multi-modal adapter sub-network, a modal fusion sub-network, and a target task sub-network that are sequentially connected, as shown in fig. 10, where the device includes:

A multi-modal data acquisition module 210, configured to acquire multi-modal data; the multi-mode data comprises at least two modes of image data, text data and audio data;

the multi-mode data processing result obtaining module 220 is configured to sequentially input multi-mode data into the multi-mode fusion network, and output a multi-mode data processing result;

the multimodal fusion network training module 230 is configured to train at least one of a multimodal adaptation sub-network, a modality fusion sub-network, and a target task sub-network based on the multimodal data processing result, and obtain a trained multimodal fusion network.

Optionally, the multi-mode data processing result obtaining module 220 is further configured to:

performing feature extraction on the multi-mode data based on the pre-training multi-mode sub-network to obtain multi-mode feature data;

the multi-mode characteristic data is adjusted based on the multi-mode adaptation sub-network, and adjusted multi-mode characteristic data is obtained;

fusing the adjusted multi-mode characteristic data based on the mode fusion sub-network to obtain fused characteristic data;

and carrying out data processing of the target task on the fusion characteristic data based on the target task sub-network to obtain a multi-mode data processing result.

Optionally, the multimodal adapter sub-network includes a multimodal adapter and a cross-modality adapter; the multi-mode adapter comprises at least two mode adapters of an image adapter, a text adapter and an audio adapter; the multi-mode data processing result obtaining module 220 is further configured to:

based on the multi-mode adapter, respectively adjusting the characteristic data of the corresponding modes to obtain the characteristic data of each mode after adjustment;

and performing cross-modal adjustment on the adjusted characteristic data of each mode based on the cross-modal adapter to obtain the characteristic data of each mode after re-adjustment.

Optionally, the image adapter, the text adapter and the audio adapter all comprise two full-connection layers, and the input of the first full-connection layer is connected with the output residual error of the second full-connection layer; the cross-modality adapter includes a multi-head attention layer and a feed-forward layer.

Optionally, the modal fusion sub-network includes a modal attention fusion module, a context fusion module and a feature fusion module; the multi-mode data processing result obtaining module 220 is further configured to:

the adjusted multi-mode feature data are fused according to the attention score based on the mode attention fusion module, and first intermediate fusion feature data are obtained;

Performing context fusion on the first intermediate fusion characteristic data based on the context fusion module to obtain second intermediate fusion characteristic data;

and superposing the first intermediate fusion characteristic data and the second intermediate fusion characteristic data based on the characteristic fusion module to obtain fusion characteristic data.

Optionally, the modal attention fusion module includes: a feed-forward layer, an activation layer and a fusion layer; the multi-mode data processing result obtaining module 220 is further configured to:

determining initial attention scores corresponding to the adjusted multi-mode characteristic data respectively based on the feedforward layer;

transforming the initial attention score based on the activation layer to obtain a target attention score;

and weighting and summing the adjusted multi-mode characteristic data according to the target attention score based on the fusion layer to obtain intermediate fusion characteristic data.

Optionally, the context fusion module includes a self-attention layer, a first superposition layer, a first normalization layer, a feedforward layer, a second superposition layer, and a second normalization layer; the first superposition layer is used for superposing the input and the output of the self-attention layer; the second superimposed layer is used for superimposing the input and output of the feedforward layer.

Optionally, the target task sub-network includes a target task adapter and a prediction layer; the multi-mode data processing result obtaining module 220 is further configured to:

Adjusting the fusion characteristic data based on the target task adapter to obtain adjusted fusion characteristic data;

and carrying out linear processing on the adjusted fusion characteristic data based on the prediction layer to obtain a multi-mode data processing result.

Optionally, the target task adapter includes at least one feed-forward layer; the prediction layer is a linear layer.

Optionally, the feedforward layer includes two fully-connected layers and a nonlinear activation layer, and the nonlinear activation layer is located in the middle of the two fully-connected layers.

The model training device provided by the embodiment of the disclosure can execute the model training method provided by any embodiment of the disclosure, and has the corresponding functional modules and beneficial effects of the execution method.

It should be noted that each unit and module included in the above apparatus are only divided according to the functional logic, but not limited to the above division, so long as the corresponding functions can be implemented; in addition, the specific names of the functional units are also only for convenience of distinguishing from each other, and are not used to limit the protection scope of the embodiments of the present disclosure.

Fig. 11 is a schematic structural diagram of an electronic device according to an embodiment of the disclosure. Referring now to fig. 11, a schematic diagram of an electronic device (e.g., a terminal device or server in fig. 11) 500 suitable for use in implementing embodiments of the present disclosure is shown. The terminal devices in the embodiments of the present disclosure may include, but are not limited to, mobile terminals such as mobile phones, notebook computers, digital broadcast receivers, PDAs (personal digital assistants), PADs (tablet computers), PMPs (portable multimedia players), in-vehicle terminals (e.g., in-vehicle navigation terminals), and the like, and stationary terminals such as digital TVs, desktop computers, and the like. The electronic device shown in fig. 11 is merely an example, and should not impose any limitations on the functionality and scope of use of embodiments of the present disclosure.

As shown in fig. 11, the electronic device 500 may include a processing means (e.g., a central processing unit, a graphics processor, etc.) 501, which may perform various appropriate actions and processes according to a program stored in a Read Only Memory (ROM) 502 or a program loaded from a storage means 508 into a Random Access Memory (RAM) 503. In the RAM 503, various programs and data required for the operation of the electronic apparatus 500 are also stored. The processing device 501, the ROM 502, and the RAM 503 are connected to each other via a bus 504. An edit/output (I/O) interface 505 is also connected to bus 504.

In general, the following devices may be connected to the I/O interface 505: input devices 506 including, for example, a touch screen, touchpad, keyboard, mouse, camera, microphone, accelerometer, gyroscope, etc.; an output device 507 including, for example, a Liquid Crystal Display (LCD), a speaker, a vibrator, and the like; storage 508 including, for example, magnetic tape, hard disk, etc.; and communication means 509. The communication means 509 may allow the electronic device 500 to communicate with other devices wirelessly or by wire to exchange data. While fig. 11 shows an electronic device 500 having various means, it is to be understood that not all of the illustrated means are required to be implemented or provided. More or fewer devices may be implemented or provided instead.

In particular, according to embodiments of the present disclosure, the processes described above with reference to flowcharts may be implemented as computer software programs. For example, embodiments of the present disclosure include a computer program product comprising a computer program embodied on a non-transitory computer readable medium, the computer program comprising program code for performing the method shown in the flow chart. In such an embodiment, the computer program may be downloaded and installed from a network via the communication means 509, or from the storage means 508, or from the ROM 502. The above-described functions defined in the methods of the embodiments of the present disclosure are performed when the computer program is executed by the processing device 501.

The electronic device provided by the embodiment of the present disclosure and the model training method provided by the foregoing embodiment belong to the same inventive concept, and technical details not described in detail in the present embodiment may be referred to the foregoing embodiment, and the present embodiment has the same beneficial effects as the foregoing embodiment.

The embodiment of the present disclosure provides a computer storage medium having stored thereon a computer program which, when executed by a processor, implements the model training method provided by the above embodiment.

It should be noted that the computer readable medium described in the present disclosure may be a computer readable signal medium or a computer readable storage medium, or any combination of the two. The computer readable storage medium can be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or a combination of any of the foregoing. More specific examples of the computer-readable storage medium may include, but are not limited to: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the context of this disclosure, a computer-readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. In the present disclosure, however, the computer-readable signal medium may include a data signal propagated in baseband or as part of a carrier wave, with the computer-readable program code embodied therein. Such a propagated data signal may take any of a variety of forms, including, but not limited to, electro-magnetic, optical, or any suitable combination of the foregoing. A computer readable signal medium may also be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to: electrical wires, fiber optic cables, RF (radio frequency), and the like, or any suitable combination of the foregoing.

In some implementations, the clients, servers may communicate using any currently known or future developed network protocol, such as HTTP (HyperText Transfer Protocol ), and may be interconnected with any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include a local area network ("LAN"), a wide area network ("WAN"), the internet (e.g., the internet), and peer-to-peer networks (e.g., ad hoc peer-to-peer networks), as well as any currently known or future developed networks.

The computer readable medium may be contained in the electronic device; or may exist alone without being incorporated into the electronic device.

The computer readable medium carries one or more programs which, when executed by the electronic device, cause the electronic device to:

the computer readable medium carries one or more programs which, when executed by the electronic device, cause the electronic device to: acquiring multi-mode data; the multi-modal data comprises at least two modes of image data, text data and audio data; sequentially inputting the multi-mode data into the multi-mode fusion network, and outputting a multi-mode data processing result; and training at least one of the multi-modal adapter sub-network, the modal fusion sub-network and the target task sub-network based on the multi-modal data processing result to obtain a trained multi-modal fusion network.

Computer program code for carrying out operations of the present disclosure may be written in one or more programming languages, including, but not limited to, an object oriented programming language such as Java, smalltalk, C ++ and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the case of a remote computer, the remote computer may be connected to the user's computer through any kind of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or may be connected to an external computer (for example, through the Internet using an Internet service provider).

The flowcharts and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present disclosure. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

The units involved in the embodiments of the present disclosure may be implemented by means of software, or may be implemented by means of hardware. The name of the unit does not in any way constitute a limitation of the unit itself, for example the first acquisition unit may also be described as "unit acquiring at least two internet protocol addresses".

The functions described above herein may be performed, at least in part, by one or more hardware logic components. For example, without limitation, exemplary types of hardware logic components that may be used include: a Field Programmable Gate Array (FPGA), an Application Specific Integrated Circuit (ASIC), an Application Specific Standard Product (ASSP), a system on a chip (SOC), a Complex Programmable Logic Device (CPLD), and the like.

In the context of this disclosure, a machine-readable medium may be a tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. The machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. The machine-readable medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

The foregoing description is only of the preferred embodiments of the present disclosure and description of the principles of the technology being employed. It will be appreciated by persons skilled in the art that the scope of the disclosure referred to in this disclosure is not limited to the specific combinations of features described above, but also covers other embodiments which may be formed by any combination of features described above or equivalents thereof without departing from the spirit of the disclosure. Such as those described above, are mutually substituted with the technical features having similar functions disclosed in the present disclosure (but not limited thereto).

Moreover, although operations are depicted in a particular order, this should not be understood as requiring that such operations be performed in the particular order shown or in sequential order. In certain circumstances, multitasking and parallel processing may be advantageous. Likewise, while several specific implementation details are included in the above discussion, these should not be construed as limiting the scope of the present disclosure. Certain features that are described in the context of separate embodiments can also be implemented in combination in a single embodiment. Conversely, various features that are described in the context of a single embodiment can also be implemented in multiple embodiments separately or in any suitable subcombination.

Although the subject matter has been described in language specific to structural features and/or methodological acts, it is to be understood that the subject matter defined in the appended claims is not necessarily limited to the specific features or acts described above. Rather, the specific features and acts described above are example forms of implementing the claims.

Claims

1. The model training method is used for training a multi-mode fusion network and is characterized in that the multi-mode fusion network comprises a pre-training multi-mode sub-network, a multi-mode adaptation sub-network, a mode fusion sub-network and a target task sub-network which are connected in sequence, and the method comprises the following steps:

acquiring multi-mode data; the multi-modal data comprises at least two modes of image data, text data and audio data;

sequentially inputting the multi-mode data into the multi-mode fusion network, and outputting a multi-mode data processing result;

2. The method of claim 1, wherein inputting the multimodal data into the multimodal fusion network and outputting a multimodal data processing result comprises:

Performing feature extraction on the multi-modal data based on the pre-training multi-modal sub-network to obtain multi-modal feature data;

adjusting the multi-mode characteristic data based on the multi-mode aptamer network to obtain adjusted multi-mode characteristic data;

and performing data processing of the target task on the fusion characteristic data based on the target task sub-network to obtain a multi-mode data processing result.

3. The method of claim 2, wherein the multimodal adapter sub-network comprises a multimodal adapter and a cross-modality adapter; the multi-mode adapter comprises at least two mode adapters of an image adapter, a text adapter and an audio adapter; adjusting the multi-modal feature data based on the multi-modal aptamer network to obtain adjusted multi-modal feature data, including:

4. The method of claim 3, wherein the image adapter, text adapter, and audio adapter each comprise two fully connected layers, and wherein an input of a first fully connected layer is connected with an output residual of a second fully connected layer; the cross-modality adapter includes a multi-head attention layer and a feed-forward layer.

5. The method of claim 2, wherein the modality fusion subnetwork comprises a modality attention fusion module, a context fusion module, and a feature fusion module; fusing the adjusted multi-mode feature data based on the mode fusion sub-network to obtain fused feature data, wherein the method comprises the following steps:

based on the modal attention fusion module, fusing the adjusted multi-modal feature data according to attention scores to obtain first intermediate fusion feature data;

and superposing the first intermediate fusion feature data and the second intermediate fusion feature data based on the feature fusion module to obtain fusion feature data.

6. The method of claim 5, wherein the modal attention fusion module comprises: a feed-forward layer, an activation layer and a fusion layer; based on the mode attention fusion module, fusing the adjusted multi-mode feature data according to attention scores to obtain intermediate fusion feature data, wherein the method comprises the following steps:

determining initial attention scores corresponding to the adjusted multi-mode feature data respectively based on the feedforward layer;

and carrying out weighted summation on the adjusted multi-mode characteristic data according to the target attention score based on the fusion layer to obtain intermediate fusion characteristic data.

7. The method of claim 5, wherein the context fusion module comprises a self-attention layer, a first overlay layer, a first normalization layer, a feed forward layer, a second overlay layer, and a second normalization layer; the first superposition layer is used for superposing the input and the output of the self-attention layer; the second superposition layer is used for superposing the input and the output of the feedforward layer.

8. The method of claim 2, wherein the target task subnetwork comprises a target task adapter and a prediction layer; performing data processing of the target task on the fusion characteristic data based on the target task sub-network to obtain a multi-mode data processing result, wherein the method comprises the following steps:

9. The method of claim 8, wherein the target task adapter comprises at least one feed-forward layer; the prediction layer is a linear layer.

10. The method of any one of claims 4, 6, 7 or 9, wherein the feed forward layer comprises two fully connected layers and one nonlinear activation layer, the nonlinear activation layer being intermediate the two fully connected layers.

11. A model training device for training a multi-modal fusion network, wherein the multi-modal fusion network comprises a pre-training multi-modal sub-network, a multi-modal adaptation sub-network, a modal fusion sub-network and a target task sub-network which are connected in sequence, the device comprising:

The multi-mode data processing result acquisition module is used for sequentially inputting the multi-mode data into the multi-mode fusion network and outputting a multi-mode data processing result;

12. An electronic device, the electronic device comprising:

one or more processors;

storage means for storing one or more programs,

the one or more programs, when executed by the one or more processors, cause the one or more processors to implement the model training method of any of claims 1-10.

13. A storage medium containing computer executable instructions for performing the model training method of any of claims 1-10 when executed by a computer processor.