CN117421577A

CN117421577A - Multi-mode model determining method and device, electronic equipment and storage medium

Info

Publication number: CN117421577A
Application number: CN202210806358.8A
Authority: CN
Inventors: 陈泽晗; 祝航程; 马国俊
Original assignee: Beijing Zitiao Network Technology Co Ltd
Current assignee: Beijing Zitiao Network Technology Co Ltd
Priority date: 2022-07-08
Filing date: 2022-07-08
Publication date: 2024-01-19

Abstract

The embodiment of the disclosure discloses a method, a device, electronic equipment and a storage medium for determining a multi-mode model, comprising the following steps: extracting the characteristics of the data of each mode in the first sample set through each characteristic extraction branch of the initial model; fusing related features including at least one feature of data of each mode according to a corresponding mode combination mode through each mode combination branch of the initial model, and determining a prediction result according to each fused feature; training an initial model according to each prediction result, and determining a target combination mode from all mode combination modes according to each prediction result when a preset training condition is met; and reserving a feature extraction branch and a modal combination branch which are related to the target combination mode in the initial model to obtain a target model. Multi-version model training can be avoided, and time and computing resources can be saved.

Description

Multi-mode model determining method and device, electronic equipment and storage medium

Technical Field

The embodiment of the disclosure relates to the technical field of computers, in particular to a method and a device for determining a multi-mode model, electronic equipment and a storage medium.

Background

Multimodal model training can be understood as the process of modeling a model based on data of different modalities. Wherein each source or form of information may be referred to as a modality. The quality of the model performance does not depend strictly on the type number of the modes, and if wrong mode data is introduced, the model performance may be degraded due to the addition of interference.

In the prior art, models of corresponding versions need to be trained respectively according to data of different mode combinations so as to obtain models of optimal performance versions. The disadvantages of the prior art include at least: the separate training of multiple versions of the model can consume significant amounts of time and computing resources.

Disclosure of Invention

The embodiment of the disclosure provides a method, a device, electronic equipment and a storage medium for determining a multi-mode model, which can avoid multi-version model training and save time and computing resources.

In a first aspect, an embodiment of the present disclosure provides a method for determining a multimodal model, including:

extracting the characteristics of the data of each mode in the first sample set through each characteristic extraction branch of the initial model;

fusing related features through each mode combination branch of the initial model according to a corresponding mode combination mode, wherein the related features comprise at least one feature of data of each mode, and determining a prediction result according to each fused feature;

Training the initial model according to each prediction result, and determining a target combination mode from all mode combination modes according to each prediction result when a preset training condition is met;

the feature extraction branches and the modal combination branches related to the target combination mode in the initial model are reserved, and a target model is obtained; the feature extraction branch related to the target combination mode is used for extracting related features of the target combination mode; and the mode combination branches related to the target combination mode are used for fusing related features according to the target combination mode.

In a second aspect, an embodiment of the present disclosure further provides a device for determining a multimodal model, including:

the feature extraction module is used for extracting the features of the data of each mode in the first sample set through each feature extraction branch of the initial model;

the mode combination module is used for fusing relevant characteristics in a corresponding mode combination mode through each mode combination branch of the initial model, wherein the relevant characteristics comprise at least one characteristic of data of each mode, and a prediction result is determined according to each fused characteristic;

the modal self-adaptive selection module is used for training the initial model according to each prediction result and determining a target combination mode from all modal combination modes according to each prediction result when a preset training condition is met;

The model determining module is used for reserving a characteristic extraction branch and a modal combination branch which are related to the target combination mode in the initial model to obtain a target model; the feature extraction branch related to the target combination mode is used for extracting related features of the target combination mode; and the mode combination branches related to the target combination mode are used for fusing related features according to the target combination mode.

In a third aspect, embodiments of the present disclosure further provide an electronic device, including:

one or more processors;

storage means for storing one or more programs,

the one or more programs, when executed by the one or more processors, cause the one or more processors to implement a method of determining a multimodal model as described in any of the embodiments of the disclosure.

In a fourth aspect, the presently disclosed embodiments also provide a storage medium containing computer-executable instructions, which when executed by a computer processor, are for performing a method of determining a multimodal model as described in any of the presently disclosed embodiments.

According to the technical scheme, the features of the data of each mode in the first sample set are extracted through each feature extraction branch of the initial model; fusing related features including at least one feature of data of each mode according to a corresponding mode combination mode through each mode combination branch of the initial model, and determining a prediction result according to each fused feature; training an initial model according to each prediction result, and determining a target combination mode from all mode combination modes according to each prediction result when a preset training condition is met; the feature extraction branches and the modal combination branches related to the target combination mode in the initial model are reserved, and a target model is obtained; the feature extraction branch related to the target combination mode is used for extracting related features of the target combination mode; the mode combination branches related to the target combination mode are used for fusing related features according to the target combination mode.

By enabling each mode combination branch to share the extracted characteristics, fusing each characteristic according to different mode combination modes and predicting according to the fused characteristics, the performance comparison of the mode combination branches under different mode combination modes can be obtained by training only one version of initial model, and therefore the optimal mode combination mode is selected in a self-adaptive mode, and the multi-mode model with optimal model performance is obtained. Compared with the traditional multi-version model separate training, the method can reduce redundant calculation of repeatedly extracted features, and save training time and calculation resources.

Drawings

The above and other features, advantages, and aspects of embodiments of the present disclosure will become more apparent by reference to the following detailed description when taken in conjunction with the accompanying drawings. The same or similar reference numbers will be used throughout the drawings to refer to the same or like elements. It should be understood that the figures are schematic and that elements and components are not necessarily drawn to scale.

FIG. 1 is a flowchart of a method for determining a multimodal model according to an embodiment of the disclosure;

FIG. 2 is a schematic structural diagram of an initial model in a method for determining a multi-modal model according to an embodiment of the present disclosure;

FIG. 3 is a schematic structural diagram of a multi-modal model determining apparatus according to an embodiment of the disclosure;

Fig. 4 is a schematic structural diagram of an electronic device according to an embodiment of the disclosure.

Detailed Description

Embodiments of the present disclosure will be described in more detail below with reference to the accompanying drawings. While certain embodiments of the present disclosure have been shown in the accompanying drawings, it is to be understood that the present disclosure may be embodied in various forms and should not be construed as limited to the embodiments set forth herein, but are provided to provide a more thorough and complete understanding of the present disclosure. It should be understood that the drawings and embodiments of the present disclosure are for illustration purposes only and are not intended to limit the scope of the present disclosure.

It should be understood that the various steps recited in the method embodiments of the present disclosure may be performed in a different order and/or performed in parallel. Furthermore, method embodiments may include additional steps and/or omit performing the illustrated steps. The scope of the present disclosure is not limited in this respect.

The term "including" and variations thereof as used herein are intended to be open-ended, i.e., including, but not limited to. The term "based on" is based at least in part on. The term "one embodiment" means "at least one embodiment"; the term "another embodiment" means "at least one additional embodiment"; the term "some embodiments" means "at least some embodiments. Related definitions of other terms will be given in the description below.

It should be noted that the terms "first," "second," and the like in this disclosure are merely used to distinguish between different devices, modules, or units and are not used to define an order or interdependence of functions performed by the devices, modules, or units.

It should be noted that references to "one", "a plurality" and "a plurality" in this disclosure are intended to be illustrative rather than limiting, and those of ordinary skill in the art will appreciate that "one or more" is intended to be understood as "one or more" unless the context clearly indicates otherwise.

The names of messages or information interacted between the various devices in the embodiments of the present disclosure are for illustrative purposes only and are not intended to limit the scope of such messages or information.

It will be appreciated that the data (including but not limited to the data itself, the acquisition or use of the data) involved in the present technical solution should comply with the corresponding legal regulations and the requirements of the relevant regulations.

Fig. 1 is a flowchart of a method for determining a multimodal model according to an embodiment of the disclosure. The embodiment of the disclosure is suitable for the situation of multi-modal model training of multi-modal adaptive selection. The method may be performed by a multi-modal model determination means, which may be implemented in software and/or hardware, which may be configured in an electronic device, such as a computer.

As shown in fig. 1, the method for determining a multimodal model provided in this embodiment may include:

s110, extracting the characteristics of the data of each mode in the first sample set through each characteristic extraction branch of the initial model.

In an embodiment of the present disclosure, the initial model may be a pre-built multi-modal neural network model. The first set of samples may include a plurality of samples for the model task, and truth labels for each sample. Wherein each sample may contain data of multiple modalities. In training the initial model, features of each sample may first be extracted based on the initial model. Specifically, feature extraction may be performed on data of a corresponding modality in each sample through each feature extraction branch of the initial model.

Fig. 2 is a schematic structural diagram of an initial model in a method for determining a multi-modal model according to an embodiment of the disclosure. Referring to fig. 2, each sample in the first set of samples may include data for three modalities, for example, modality 1, modality 2, and modality 3; correspondingly, the initial model can comprise three feature extraction branches, namely a feature extraction branch 1, a feature extraction branch 2 and a feature extraction branch 3, and the three branches can extract the features of the data of the mode 1, the mode 2 and the mode 3 respectively to obtain the feature 1, the feature 2 and the feature 3.

S120, fusing related features through each mode combination branch of the initial model according to a corresponding mode combination mode, wherein the related features comprise at least one feature of data of each mode, and determining a prediction result according to each fused feature.

In the embodiment of the present disclosure, the mode combining mode may refer to a mode of combining features of data of at least one mode. The feature of the data of at least one modality combined in the modality combination way may be called the relevant feature of the modality combination way. The mode combination mode can be determined in advance according to different model tasks, and each mode combination branch in the initial model can be created according to the mode combination mode which is determined in advance; wherein, the mode combination modes corresponding to the mode combination branches are different. By combining branches of all modes in the initial model, the characteristics of the data of at least one mode can be subjected to diversified fusion, and a prediction result can be determined according to the fused characteristics.

For example, referring again to fig. 2, features of data of at least one modality may be randomly selected from the modalities 1, 2 and 3 to be combined to determine N modality combination modes (nmax desirable 7), and each modality combination mode may correspond to one modality combination branch. Each modality combining branch may be provided with a fusion network and a classifier. The fusion network can fuse related features according to a mode combination mode corresponding to the branches, and the fusion network can select a full connection layer (Fully Connected layer, FC) compression expansion layer (SE), for example. The classifier can output a prediction result according to each fused characteristic, and the classifier can be a soft-max classifier, for example. N prediction results can be output through N mode combined branches.

Although the initial model contains a plurality of mode combination branches, compared with the feature extraction branches, the mode combination branches are very lightweight network modules, and the calculation resources are very little. Experiments prove that under the condition that the feature extraction branches use common multiple networks, the calculated amount of the multiple mode combination branches accounts for less than 2.5% of the total calculated amount of the model. It is considered that the addition of multiple modal combination branches adds very little computation to the initial model.

S130, training an initial model according to each prediction result, and determining a target combination mode from all mode combination modes according to each prediction result when a preset training condition is met.

In the embodiment of the disclosure, the deviation between each prediction result and the truth label of the corresponding sample can be fed back forward to adjust the network parameters of the corresponding mode combination branch and the feature extraction network, so that the training of the initial model is realized. Wherein, the preset training condition is satisfied, for example, the number of trained samples reaches s% of the total number of samples (s can be a value of 30); for example, all samples may be trained; it is also possible that the deviation of the combined branches of each mode is smaller than a preset deviation, etc., which is not exhaustive here.

If the current condition meets the preset training condition, the performance of each mode combining branch can be compared. For example, the prediction accuracy of each mode combined branch is evaluated by using a preset test sample. Then, the mode combination mode corresponding to the mode combination branch with the best performance can be used as the target combination mode.

And S140, reserving a feature extraction branch and a modal combination branch which are related to the target combination mode in the initial model to obtain a target model.

The feature extraction branch related to the target combination mode is used for extracting related features of the target combination mode; the mode combination branches related to the target combination mode are used for fusing related features according to the target combination mode. The mode combining branch related to the target combining mode, namely the mode combining branch with the best performance. For example, referring to fig. 2, if the target combination is to combine feature 1 and feature 2, feature extraction branch 1 for extracting feature 1 and feature extraction branch 2 for extracting feature 2 are feature extraction branches related to the target combination; the mode combining branch 2 where the feature 1 and the feature 2 are fused is a mode combining branch related to the target combining mode.

The feature extraction branches and the mode combination branches related to the target combination mode in the initial model can be reserved, and other branches can be automatically apoptotic and are not used, so that a target model is obtained from the initial model. When the target model executes the task, unnecessary modal data can be removed, only valuable modal data is used, and the result is predicted according to the optimal modal combination mode, so that the optimal model performance can be obtained.

In the traditional multi-mode model training mode, models of corresponding versions are required to be trained respectively according to data of different mode combinations. In the training process of different versions of models, the feature extraction network of each mode is retrained, and the calculated amount waste caused by repeated training exists.

In the embodiment of the disclosure, the features extracted by each feature extraction branch can be shared among the mode combination branches of the initial model, so that the performance comparison under different mode combination modes can be obtained by training one version of the initial model, and the time and the computing resources consumed by multi-version model training are saved.

Taking the mode combining mode shown in fig. 2 as an example, in the case of N possible mode combining modes, the conventional method needs to train N versions of models, but the embodiment of the disclosure only needs to train 1 version of initial model, so that the computing resource of (N-1)/N can be saved.

According to the technical scheme, the extracted features are shared by the mode combination branches, the features are fused according to different mode combination modes, and according to the fused feature prediction, only one version of initial model can be trained, and the performance comparison of the mode combination branches in different mode combination modes can be obtained, so that the optimal mode combination mode can be selected in a self-adaptive mode, and the multi-mode model with optimal model performance can be obtained. Compared with the traditional multi-version model separate training, the method can reduce redundant calculation of repeatedly extracted features, and save training time and calculation resources.

In some alternative implementations, after obtaining the target model, it may further include: training the target model by using the second sample set; the second sample set contains data of a mode related to the target combination mode, and the sample size of the second sample set is larger than that of the first sample set.

Wherein the second sample set has a larger sample size than the first sample set can be considered as the first sample set being a small sample data set and the second sample set being a large sample data set. In some implementations, the first sample set may be a sample set obtained by sampling the second sample set. For example, the second sample set may be randomly sampled at a sampling rate (e.g., 20%) to obtain a subset of its data that may be used as the first sample set. Each mode data in the first sample set is derived from the second sample set, so that the second sample set can be ensured to contain the mode data related to the target combination mode.

In these alternative implementations, by training the initial model based on a small sample dataset, evaluating performance in each mode combination, the speed of determining the optimal mode combination can be increased, thereby enabling further savings in computing resources. By training the target model based on a large sample dataset, the performance of the target model can be further refined.

In some alternative implementations, the modality combination branches may be created based on the following steps: determining a mode combination mode according to the selection marks of each mode, wherein the selection marks comprise a necessary mark and a non-necessary mark; and creating each mode combination branch according to each mode combination mode.

When the selection mark of the mode is the necessary mark, the characteristics of the data of the mode are necessarily contained in the mode combination mode; when the selection flag of the modality is a non-necessary flag, the feature selectivity of the data of the modality is included in the modality combination mode. The selection marks of the modes can be preset according to specific service conditions. Further, the mode combination may be determined according to the selection flag of each mode, and includes, for example:

case one: optional flags for each modality

When the modes of the necessary mark do not exist, if M modes can be selected, M (M is more than or equal to 1 and less than or equal to M) modes can be randomly selected from the M modes as possible mode combination modes. At this time, the possible mode combination N may be calculated based on the following formula:

and a second case: modality with optional flag

If some modalities are directly judged as valuable modalities, they can be designated as mandatory modalities (i.e., a mandatory flag is set for them) to reduce the amount of computation in determining the best modality combination. If there are P (P.gtoreq.1) mandatory modes, then there are M-P optional modes. On the premise that each mode combination mode must contain P necessary modes, M (M is more than or equal to 0 and less than or equal to M-P) modes can be randomly selected from M-P modes in each mode combination mode to be used as possible mode combination modes. At this time, the possible mode combination N may be calculated based on the following formula:

in these alternative implementations, after determining the possible mode combination modes N according to the selection marks of the modes, each corresponding mode combination branch may be created and added to the initial model structure, so as to implement training of only one version of initial model, and obtain performance comparison of the mode combination branches in different mode combination modes, thereby adaptively selecting the optimal mode combination mode.

In some optional implementations, each feature extraction branch and each modality combination branch is configured with an activation status flag; preserving the modal combination branches and the feature extraction branches related to the target combination mode in the initial model may include: determining each feature extraction branch and an activation state mark of each feature extraction branch according to the target combination mode; and (3) retaining a mode combination branch and a feature extraction branch related to the target combination mode in the initial model through activating the state mark.

Wherein each feature extraction branch and each mode combination branch in the initial model can be configured with an activation state flag, and other branches of the initial model (such as input branches of data of each mode and the like) can also be provided with activation state flags. The active state flags may include, but are not limited to, an active state and an inactive state, and when the active state flag is active, the corresponding branch may function normally; when the active state flag is inactive, the corresponding branch stops running, i.e., data will not be able to pass through the branch processing.

In these alternative implementations, according to the target combination mode, the mode combination branches and the feature extraction branches related to the target combination mode may be determined, and then the activation status flags of the related branches may be set to be in an activation status. By setting the related branches to be in an activated state, the mode combination branches and the feature extraction branches related to the target combination mode in the initial model can be kept in a normal operation state, and other irrelevant branches are stopped, namely, automatic apoptosis is not used any more, so that the initial model can be changed into the target model.

In some alternative implementations, before extracting the features of the data of each modality in the first sample set through each feature extraction branch of the initial model, the method may further include: setting the activation state marks of the feature extraction branches and the mode combination branches as activation states;

accordingly, determining each feature extraction branch and an activation state flag of each feature extraction branch according to the target combination manner may include: maintaining the activation states of a modal combination branch and a feature extraction branch which are related to a target combination mode in an initial model; and updating the activation state marks of all the feature extraction branches and all the feature extraction branches except the related mode combination branches and other branches except the feature extraction branches to be in an inactive state.

For example, an active state may be characterized by a True identifier and an inactive state may be characterized by a False identifier. The activation status flags of the feature extraction branches and the modality combination branches may be initially set to True during the creation of the initial model. After determining the target combination mode, the activation state flags of the mode combination branch and the feature extraction branch related to the target combination mode in the initial model can be maintained, and only the activation state flags of the irrelevant branches are updated from True to False.

In these alternative implementations, the activation status flag of each branch may be checked before processing the data via that branch; if True, the data may be processed via the branch; if False, the data is not processed by the branch, so that the retention of the mode combination branch and the feature extraction branch related to the target combination mode is realized.

In some alternative implementations, the target model is a video processing model; each modality includes at least one of: audio modality, image modality, and text modality.

The video processing model may include, but is not limited to, a video classification model, a video understanding model, and the like. The multi-mode video processing model can utilize the characteristics of data of multiple modes such as an audio mode, an image mode, a text mode and the like to perform video processing, and can better complete video processing tasks compared with a single mode.

In these alternative implementations, by training a version of the initial model, the best mode combination mode can be adaptively selected, so as to solve the mode selection problem of the multi-mode video processing model, and save the calculation amount and the model training time.

Fig. 3 is a schematic structural diagram of a multi-mode model determining device according to an embodiment of the disclosure. The multi-modal model determining device provided by the embodiment is suitable for the situation of multi-modal model training of multi-modal adaptive selection.

As shown in fig. 3, the apparatus for determining a multimodal model according to an embodiment of the disclosure may include:

a feature extraction module 310, configured to extract features of data of each modality in the first sample set through each feature extraction branch of the initial model;

the mode combination module 320 is configured to combine the relevant features through each mode combination branch of the initial model in a corresponding mode combination manner, where the relevant features include at least one feature of data of each mode, and determine a prediction result according to each combined feature;

the mode adaptive selection module 330 is configured to train the initial model according to each prediction result, and determine a target combination mode from the mode combination modes according to each prediction result when a preset training condition is satisfied;

the model determining module 340 is configured to retain a feature extraction branch and a modal combination branch related to the target combination manner in the initial model, so as to obtain a target model; the feature extraction branch related to the target combination mode is used for extracting related features of the target combination mode; the mode combination branches related to the target combination mode are used for fusing related features according to the target combination mode.

In some alternative implementations, the determining device of the multimodal model further includes:

The target model training module is used for training the target model by using the second sample set after the target model is obtained;

the second sample set contains data of a mode related to the target combination mode, and the sample size of the second sample set is larger than that of the first sample set.

In some alternative implementations, the first sample set is a sample set obtained by sampling the second sample set.

a model creation model for creating modal combination branches based on the steps of:

determining a mode combination mode according to the selection marks of each mode, wherein the selection marks comprise a necessary mark and a non-necessary mark;

and creating each mode combination branch according to each mode combination mode.

In some optional implementations, each feature extraction branch and each modality combination branch is configured with an activation status flag;

the model determination module may be used to:

determining each feature extraction branch and an activation state mark of each feature extraction branch according to the target combination mode;

and (3) retaining a mode combination branch and a feature extraction branch related to the target combination mode in the initial model through activating the state mark.

In some alternative implementations, the model determination module may be further configured to, prior to the feature extraction module extracting features of the data of each modality in the first sample set through each feature extraction branch of the initial model:

setting the activation state marks of the feature extraction branches and the mode combination branches as activation states;

accordingly, the model determination module may be configured to:

and maintaining the activation states of the mode combination branches and the feature extraction branches related to the target combination mode in the initial model, and updating the activation state marks of other branches except the related mode combination branches and feature extraction branches in each feature extraction branch and each feature extraction branch to be in an unactivated state.

The multi-mode model determining device provided by the embodiment of the disclosure can execute the multi-mode model determining method provided by any embodiment of the disclosure, and has the corresponding functional modules and beneficial effects of the executing method.

It should be noted that each unit and module included in the above apparatus are only divided according to the functional logic, but not limited to the above division, so long as the corresponding functions can be implemented; in addition, the specific names of the functional units are also only for convenience of distinguishing from each other, and are not used to limit the protection scope of the embodiments of the present disclosure.

Referring now to fig. 4, a schematic diagram of an electronic device (e.g., a terminal device or server in fig. 4) 400 suitable for use in implementing embodiments of the present disclosure is shown. The terminal devices in the embodiments of the present disclosure may include, but are not limited to, mobile terminals such as mobile phones, notebook computers, digital broadcast receivers, PDAs (personal digital assistants), PADs (tablet computers), PMPs (portable multimedia players), in-vehicle terminals (e.g., in-vehicle navigation terminals), and the like, and stationary terminals such as digital TVs, desktop computers, and the like. The electronic device shown in fig. 4 is merely an example and should not be construed to limit the functionality and scope of use of the disclosed embodiments.

As shown in fig. 4, the electronic apparatus 400 may include a processing device (e.g., a central processing unit, a graphics processor, etc.) 401 that may perform various appropriate actions and processes according to a program stored in a Read-Only Memory (ROM) 402 or a program loaded from a storage device 408 into a random access Memory (Random Access Memory, RAM) 403. In the RAM 403, various programs and data necessary for the operation of the electronic device 400 are also stored. The processing device 401, the ROM 402, and the RAM 403 are connected to each other by a bus 404. An input/output (I/O) interface 405 is also connected to bus 404.

In general, the following devices may be connected to the I/O interface 405: input devices 406 including, for example, a touch screen, touchpad, keyboard, mouse, camera, microphone, accelerometer, gyroscope, etc.; an output device 407 including, for example, a Liquid Crystal Display (LCD), a speaker, a vibrator, and the like; storage 408 including, for example, magnetic tape, hard disk, etc.; and a communication device 409. The communication means 409 may allow the electronic device 400 to communicate with other devices wirelessly or by wire to exchange data. While fig. 4 shows an electronic device 400 having various means, it is to be understood that not all of the illustrated means are required to be implemented or provided. More or fewer devices may be implemented or provided instead.

In particular, according to embodiments of the present disclosure, the processes described above with reference to flowcharts may be implemented as computer software programs. For example, embodiments of the present disclosure include a computer program product comprising a computer program embodied on a non-transitory computer readable medium, the computer program comprising program code for performing the method shown in the flow chart. In such an embodiment, the computer program may be downloaded and installed from a network via communications device 409, or from storage 408, or from ROM 402. The above-described functions defined in the determination method of the multimodal model of the embodiment of the disclosure are performed when the computer program is executed by the processing device 401.

The electronic device provided by the embodiment of the present disclosure and the method for determining a multimodal model provided by the foregoing embodiment belong to the same disclosure concept, and technical details not described in detail in the present embodiment may be referred to the foregoing embodiment, and the present embodiment has the same beneficial effects as the foregoing embodiment.

The embodiment of the present disclosure provides a computer storage medium having stored thereon a computer program which, when executed by a processor, implements the method for determining a multimodal model provided by the above embodiment.

It should be noted that the computer readable medium described in the present disclosure may be a computer readable signal medium or a computer readable storage medium, or any combination of the two. The computer readable storage medium can be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or a combination of any of the foregoing. More specific examples of the computer-readable storage medium may include, but are not limited to: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-Only Memory (ROM), an erasable programmable read-Only Memory (EPROM) or FLASH Memory (FLASH), an optical fiber, a portable compact disc read-Only Memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the context of this disclosure, a computer-readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. In the present disclosure, however, the computer-readable signal medium may include a data signal propagated in baseband or as part of a carrier wave, with the computer-readable program code embodied therein. Such a propagated data signal may take any of a variety of forms, including, but not limited to, electro-magnetic, optical, or any suitable combination of the foregoing. A computer readable signal medium may also be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to: electrical wires, fiber optic cables, RF (radio frequency), and the like, or any suitable combination of the foregoing.

In some implementations, the clients, servers may communicate using any currently known or future developed network protocol, such as HTTP (Hyper Text Transfer Protocol ), and may be interconnected with any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include a local area network ("LAN"), a wide area network ("WAN"), the internet (e.g., the internet), and peer-to-peer networks (e.g., ad hoc peer-to-peer networks), as well as any currently known or future developed networks.

The computer readable medium may be contained in the electronic device; or may exist alone without being incorporated into the electronic device.

The computer readable medium carries one or more programs which, when executed by the electronic device, cause the electronic device to:

extracting the characteristics of the data of each mode in the first sample set through each characteristic extraction branch of the initial model; fusing related features including at least one feature of data of each mode according to a corresponding mode combination mode through each mode combination branch of the initial model, and determining a prediction result according to each fused feature; training an initial model according to each prediction result, and determining a target combination mode from all mode combination modes according to each prediction result when a preset training condition is met; the feature extraction branches and the modal combination branches related to the target combination mode in the initial model are reserved, and a target model is obtained; the feature extraction branch related to the target combination mode is used for extracting related features of the target combination mode; the mode combination branches related to the target combination mode are used for fusing related features according to the target combination mode.

Computer program code for carrying out operations of the present disclosure may be written in one or more programming languages, including, but not limited to, an object oriented programming language such as Java, smalltalk, C ++ and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the case of a remote computer, the remote computer may be connected to the user's computer through any kind of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or may be connected to an external computer (for example, through the Internet using an Internet service provider).

The flowcharts and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present disclosure. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

The units involved in the embodiments of the present disclosure may be implemented by means of software, or may be implemented by means of hardware. The names of the units and modules do not limit the units and modules themselves in some cases.

The functions described above herein may be performed, at least in part, by one or more hardware logic components. For example, without limitation, exemplary types of hardware logic components that may be used include: a field programmable gate array (Field Programmable Gate Array, FPGA), an application specific integrated circuit (Application Specific Integrated Circuit, ASIC), a special standard product (Application Specific Standard Parts, ASSP), a System On Chip (SOC), a Complex Programmable Logic Device (CPLD), and the like.

In the context of this disclosure, a machine-readable medium may be a tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. The machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. The machine-readable medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

According to one or more embodiments of the present disclosure, there is provided a method of determining a multimodal model, the method comprising:

According to one or more embodiments of the present disclosure, there is provided a method for determining a multimodal model, further comprising:

In some optional implementations, after the obtaining the target model, further includes:

training the target model by using a second sample set;

wherein the second sample set contains data of a modality related to the target combination mode, and a sample size of the second sample set is larger than a sample size of the first sample set.

in some optional implementations, the first sample set is a sample set obtained by sampling the second sample set.

in some alternative implementations, the modal combination branches are created based on the steps of:

determining a mode combination mode of each mode according to the selection marks of each mode, wherein the selection marks comprise a necessary mark and a non-necessary mark;

and creating the mode combination branches according to the mode combination modes.

In some optional implementations, the feature extraction branches and the mode combination branches are configured with an activation status flag;

the retaining the mode combination branch and the feature extraction branch related to the target combination mode in the initial model comprises the following steps:

and reserving a mode combination branch and a feature extraction branch which are related to the target combination mode in the initial model through the activation state mark.

in some optional implementations, before extracting the features of the data of each modality in the first sample set through each feature extraction branch of the initial model, the method further includes:

setting the activation state marks of the feature extraction branches and the mode combination branches to be in an activation state;

correspondingly, the determining the feature extraction branches and the activation state marks of the feature extraction branches according to the target combination mode comprises the following steps:

Maintaining the activation states of a modal combination branch and a feature extraction branch related to the target combination mode in the initial model;

and updating the activation state marks of the feature extraction branches and other branches except the related mode combination branch and the feature extraction branch into an inactivated state.

in some alternative implementations, the target model is a video processing model; the modalities include at least one of the following: audio modality, image modality, and text modality.

According to one or more embodiments of the present disclosure, there is provided an apparatus for determining a multimodal model, the apparatus comprising:

The foregoing description is only of the preferred embodiments of the present disclosure and description of the principles of the technology being employed. It will be appreciated by persons skilled in the art that the scope of the disclosure referred to in this disclosure is not limited to the specific combinations of features described above, but also covers other embodiments which may be formed by any combination of features described above or equivalents thereof without departing from the spirit of the disclosure. Such as those described above, are mutually substituted with the technical features having similar functions disclosed in the present disclosure (but not limited thereto).

Moreover, although operations are depicted in a particular order, this should not be understood as requiring that such operations be performed in the particular order shown or in sequential order. In certain circumstances, multitasking and parallel processing may be advantageous. Likewise, while several specific implementation details are included in the above discussion, these should not be construed as limiting the scope of the present disclosure. Certain features that are described in the context of separate embodiments can also be implemented in combination in a single embodiment. Conversely, various features that are described in the context of a single embodiment can also be implemented in multiple embodiments separately or in any suitable subcombination.

Although the subject matter has been described in language specific to structural features and/or methodological acts, it is to be understood that the subject matter defined in the appended claims is not necessarily limited to the specific features or acts described above. Rather, the specific features and acts described above are example forms of implementing the claims.

Claims

1. A method for determining a multimodal model, comprising:

2. The method of claim 1, further comprising, after the obtaining the target model:

training the target model by using a second sample set;

3. The method according to claim 2, wherein the first sample set is a sample set obtained by sampling the second sample set.

4. The method of claim 1, wherein the modal combination branches are created based on the steps of:

5. The method according to claim 1, characterized in that the feature extraction branches and the modality combination branches are configured with an activation status flag;

6. The method of claim 5, further comprising, prior to extracting features of data of each modality in the first sample set by each feature extraction branch of the initial model:

7. The method of any one of claims 1-6, wherein the object model is a video processing model; the modalities include at least one of the following: audio modality, image modality, and text modality.

8. A multi-modal model determination apparatus, comprising:

9. An electronic device, the electronic device comprising:

one or more processors;

storage means for storing one or more programs,

the one or more programs, when executed by the one or more processors, cause the one or more processors to implement the method of determining a multimodal model as claimed in any of claims 1-7.

10. A storage medium containing computer executable instructions for performing the method of determining a multimodal model as claimed in any of claims 1 to 7 when executed by a computer processor.