CN112183547A

CN112183547A - Multi-mode data-based multi-task learning method and system

Info

Publication number: CN112183547A
Application number: CN202011116788.4A
Authority: CN
Inventors: 李洁; 胡瑜
Original assignee: Institute of Computing Technology of CAS
Current assignee: Institute of Computing Technology of CAS
Priority date: 2020-10-19
Filing date: 2020-10-19
Publication date: 2021-01-05

Abstract

The invention provides a multi-mode data-based multi-task learning method and a multi-mode data-based multi-task learning system, which comprise the steps of obtaining multi-mode data for completing a plurality of tasks, grouping the multi-mode data according to modal types, and performing space-time alignment; extracting hard sharing features of data in each mode from the aligned multi-mode data, obtaining soft sharing features related to different tasks in each mode based on the hard sharing features, obtaining non-sharing features related to different tasks in each mode based on the soft sharing features, and obtaining data features corresponding to each task in the plurality of tasks based on the non-sharing features; executing the plurality of tasks based on the corresponding data features and calculating a loss function for each of the plurality of tasks; and calculating an overall loss function of the multi-task learning method by using uncertainty loss weighting so as to train the multi-task learning method. The technical scheme of the embodiment of the invention can make full use of the complementarity among multi-mode data, quickly and accurately complete multi-task learning and realize light weight.

Description

Multi-mode data-based multi-task learning method and system

Technical Field

The invention relates to the field of machine learning, in particular to a multi-task learning method and system based on multi-modal data.

Background

Multi-task learning (MTL) is a branch of machine learning that takes advantage of the commonality and differences between multiple tasks while solving multiple learning tasks. The multi-task learning can improve the learning efficiency and prediction accuracy of the task specific model compared to learning each task separately. In multi-task learning, there is a correlation between a given number of tasks (or at least some of them), which includes two aspects: one is modal correlation, that is, multiple tasks all adopt data of the same modality; the second is the relevance of feature representation, that is, features required by a plurality of tasks have certain relevance. Therefore, multi-task learning is mainly based on the correlation between tasks to improve the performance of each task.

However, the conventional multi-task learning method generally employs a single multi-task learning technique, such as only one of a hard parameter sharing technique, a soft parameter sharing technique, a multi-task loss weighting technique, and the like, and thus cannot fully utilize the correlation between different tasks. In addition, a plurality of tasks often adopt input data of the same modality, and complementarity of data of different modalities cannot be effectively utilized.

Therefore, a multi-task learning method based on multi-modal data is needed.

Disclosure of Invention

Therefore, an object of the embodiments of the present invention is to overcome the above-mentioned drawbacks of the prior art, and provide a method and a system for multi-task learning based on multi-modal data, which can complete multi-task learning quickly and accurately by fully utilizing complementarity between multi-modal data, thereby achieving light weight.

The above purpose is realized by the following technical scheme:

according to a first aspect of the present invention, a multi-modal data-based multi-modal learning method is provided, which includes obtaining multi-modal data for completing a plurality of tasks, grouping the multi-modal data according to modal categories, and performing spatiotemporal alignment; extracting hard sharing features of data in each mode from the aligned multi-mode data, obtaining soft sharing features related to different tasks in each mode based on the hard sharing features, obtaining non-sharing features related to different tasks in each mode based on the soft sharing features, and obtaining data features corresponding to each task in the plurality of tasks based on the non-sharing features; executing the plurality of tasks based on the corresponding data features and calculating a loss function for each of the plurality of tasks; and calculating an overall loss function of the multi-task learning method by using uncertainty loss weighting so as to train the multi-task learning method.

In some embodiments of the present invention, further comprising, when a task depends on a single-modality data feature, extracting the unshared feature from the single-modality data as a data feature corresponding to the task; and when the task depends on multi-modal data, performing feature fusion on the unshared features extracted from each modal data of the dependent multi-modal data, wherein the fused unshared features are data features corresponding to the task.

In some embodiments of the invention, when the task depends on multi-modal data, feature fusing the unshared features extracted from each modal data of the multi-modal data by using an attention focusing mechanism, comprises: splicing the unshared features extracted from each modal data of the dependent multi-modal data, and transforming to generate a feature mask; fusing the unshared features based on the feature mask features.

In some embodiments of the invention, wherein the multi-task learning method is trained by minimizing losses.

In some embodiments of the invention, wherein the multi-task learning method is trained by cross entropy and/or regression loss.

In some embodiments of the present invention, the soft sharing feature is extracted using an L2 paradigm or a cross-stitch network.

In some embodiments of the invention, wherein said calculating an overall loss function of the multi-tasking learning method using uncertainty loss weighting comprises: adjusting a loss function for each of the plurality of tasks based on an output standard deviation for the each task; and summing the adjusted loss functions of each task to obtain the overall loss function of the multi-task learning method.

According to a second aspect of the present invention, there is also provided a multi-modal data-based multi-task learning system, comprising: the interface module is used for acquiring multi-mode data for completing a plurality of tasks, grouping the multi-mode data according to modal types and performing space-time alignment; a feature extraction module, configured to extract a hard shared feature of data in each modality from the aligned multi-modality data, obtain a soft shared feature related to different tasks in each modality based on the hard shared feature, obtain an unshared feature related to different tasks in each modality based on the soft shared feature, and obtain a data feature corresponding to each task of the multiple tasks based on the unshared feature; the loss calculation module is used for completing the tasks based on the corresponding data characteristics, calculating a loss function of each task in the tasks, and calculating an overall loss function of the multi-task learning method by using uncertainty loss weighting; and a training module for training the multi-task learning method based on the overall loss function.

According to a third aspect of the present invention, there is provided a computer readable storage medium having stored thereon a computer program which, when executed, performs the method of the first aspect of the embodiments described above.

According to a fourth aspect of the present invention, there is provided an electronic device comprising a processor and a memory, the memory having stored therein a computer program which, when executed by the processor, implements the method as described in the first aspect of the embodiments above.

According to the technical scheme, complementarity among multi-mode data is fully utilized, redundancy of data among modes is eliminated, multi-task learning can be completed quickly and accurately, and light weight is achieved.

It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the invention, as claimed.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the invention and together with the description, serve to explain the principles of the invention. It is obvious that the drawings in the following description are only some embodiments of the invention, and that for a person skilled in the art, other drawings can be derived from them without inventive effort. In the drawings:

FIG. 1 illustrates a flow diagram of a method of multi-modal data-based multi-task learning, according to one embodiment of the invention;

FIG. 2 illustrates a flow diagram for extracting data features corresponding to each task from multi-modal data, according to one embodiment of the invention;

FIG. 3 illustrates a schematic diagram of extracting data features corresponding to each task from multi-modal data, according to one embodiment of the invention;

FIG. 4 shows a schematic diagram of extracting hard shared features according to one embodiment of the invention;

FIG. 5 shows a schematic diagram of extracting soft-shared features according to one embodiment of the invention;

FIG. 6 illustrates a schematic diagram of feature fusion of unshared features of a dependent multi-modal data task, according to one embodiment of the invention;

FIG. 7 illustrates a schematic diagram of an uncertainty loss weighting mechanism in accordance with one embodiment of the present invention;

FIG. 8 illustrates a structural schematic of a multi-modal data based multi-task learning system, according to one embodiment of the invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is further described in detail by embodiments with reference to the accompanying drawings. It is to be understood that the embodiments described are only a few embodiments of the present invention, and not all embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

Furthermore, the described features, structures, or characteristics may be combined in any suitable manner in one or more embodiments. In the following description, numerous specific details are provided to provide a thorough understanding of embodiments of the invention. One skilled in the relevant art will recognize, however, that the invention may be practiced without one or more of the specific details, or with other methods, components, devices, steps, and so forth. In other instances, well-known methods, devices, implementations or operations have not been shown or described in detail to avoid obscuring aspects of the invention.

The block diagrams shown in the figures are functional entities only and do not necessarily correspond to physically separate entities. I.e. these functional entities may be implemented in the form of software, or in one or more hardware modules or integrated circuits, or in different networks and/or processor means and/or microcontroller means.

The flow charts shown in the drawings are merely illustrative and do not necessarily include all of the contents and operations/steps, nor do they necessarily have to be performed in the order described. For example, some operations/steps may be decomposed, and some operations/steps may be combined or partially combined, so that the actual execution sequence may be changed according to the actual situation.

In the existing multi-task learning, because a single learning technology and single-mode input data are generally adopted, the correlation among tasks and the complementarity of different-mode data cannot be fully utilized, and multi-mode data fusion does not exist. The invention utilizes the characteristics of multi-task learning, combines methods of hardware parameter sharing, software parameter sharing, multi-task loss weighting and the like, extracts data characteristics corresponding to each task in the multi-task from multi-modal input data, executes the task, calculates the loss function of each task, and trains the multi-task learning method through the overall loss function of the multi-task learning by using an uncertainty loss weighting technology.

FIG. 1 shows a flow diagram of a method for multi-modal data-based multi-task learning, in accordance with an embodiment of the invention. As shown in fig. 1, the method includes:

s110, multi-mode data used for completing a plurality of tasks are obtained, the multi-mode data are grouped according to mode types, and space-time alignment is carried out.

The data from various sources are organized into different input data according to the types of the modes, and the input data of different modes meet certain space-time alignment requirements. For example, three tasks, namely task one image semantic segmentation, task two 3D target detection and task three-point cloud semantic segmentation, need to be completed in one multi-task learning, and input data includes data of two modalities, namely image data and point cloud data, wherein the task one image semantic segmentation depends on the image data, the task two 3D target detection depends on the image data and the point cloud data, and the task three-point cloud semantic segmentation depends on the point cloud data. First time-align the image data and the point cloud data, e.g., time-align of the two modality data may be guaranteed with a timestamp; the spatial coordinates of the image data and the point cloud data are then aligned. In one embodiment, a frustum corresponding to the point cloud data may be intercepted based on the image data to ensure spatial alignment, for example, an internal reference matrix and an external reference matrix of a camera, a coordinate transformation matrix of the camera and a laser radar are acquired, the frustum is determined by the camera, and then the frustum is projected into the laser radar, and only the laser point cloud data inside the frustum is extracted, thereby achieving spatial alignment of the image data and the point cloud data.

S120, extracting hard sharing features of data in each mode from the aligned multi-mode data, obtaining soft sharing features related to different tasks in each mode based on the hard sharing features, obtaining non-sharing features related to different tasks in each mode based on the soft sharing features, and obtaining data features corresponding to each task in the multiple tasks based on the non-sharing features.

For clarity, two parameter sharing mechanisms are first described in a unified manner: the hard parameter sharing mechanism refers to that a plurality of tasks commonly use the same feature extraction network, and the network extracts the same feature for different tasks, namely the hard shared feature; the soft parameter sharing mechanism refers to that each task shares a respective feature extraction sub-network, but the sub-networks of different tasks are mutually constrained, and features extracted by different sub-networks have correlation, namely, soft sharing features. In addition, if the extracted features do not adopt any sharing mechanism, the extracted features are 'non-shared features'.

In the invention, based on the data which is grouped and aligned, the hard sharing mechanism and the soft sharing mechanism are respectively adopted in each mode, the hard sharing characteristic, the soft sharing characteristic and the non-sharing characteristic of each mode data are sequentially extracted from the aligned multi-mode data, and the data characteristic corresponding to each task is obtained based on the non-sharing characteristic. In this process, the degree of sharing of features between different tasks is also gradually reduced.

FIG. 2 illustrates a flow diagram for extracting data features corresponding to each task from multi-modal data, in accordance with an embodiment of the present invention. Fig. 3 shows a schematic diagram of extracting data features corresponding to each task from multi-modal data according to an embodiment of the present invention. As shown in fig. 2, step S120 includes the steps of:

and S210, extracting hard sharing characteristics aiming at the input data of the same modality.

FIG. 4 shows a schematic diagram of extracting hard shared features according to one embodiment of the invention. As shown in fig. 4, the input data for each modality employs a corresponding neural network to extract its hard shared features, so that the hard shared features can be shared among different tasks within the modality. For example, for image data, a 2D convolutional network may be employed to extract hard shared features for an image semantic segmentation task (i.e., task one) and a 3D object detection task (i.e., task two); for point cloud data, a Pointernet network or a voxel Feature encoding layer VFE (voxel Feature encoding) and a 3D convolution network can be adopted to be combined so as to extract hard shared features for a 3D target detection task and a point cloud semantic segmentation task.

And S220, extracting soft sharing characteristics related to different tasks in the same modality based on the hard sharing characteristics.

And respectively extracting soft sharing characteristics aiming at different tasks in each mode based on the obtained hard sharing characteristics. The soft sharing characteristics related to different tasks in the same modality are not independent and unrelated with each other, but are restricted according to the relevance among the tasks.

FIG. 5 shows a schematic diagram of extracting soft-shared features according to one embodiment of the invention. As shown in fig. 5, different tasks in the same modality respectively extract their soft sharing features by using their corresponding neural networks. For example, for image data, different convolution sub-networks can be respectively adopted to respectively extract soft shared features of an image semantic segmentation task (i.e., task one) and a 3D object detection task (i.e., task two), wherein the input of the two sub-networks is the hard shared feature of the image data, but the output of the two sub-networks is the soft shared feature of the image semantic segmentation task (i.e., task one) and the 3D object detection task (i.e., task two); similarly, for the point cloud data, different pointenet sub-networks can be respectively adopted to extract soft sharing features of the 3D target detection network (i.e., task two) and the point cloud semantic segmentation task (i.e., task three), the input of the two sub-networks is the hard sharing feature of the point cloud data, and the output of the two sub-networks is the respective soft sharing feature of the 3D target detection network (i.e., task two) and the point cloud semantic segmentation task (i.e., task three).

In order to make soft sharing features within the same modality mutually constrained based on correlations between tasks, the neural networks employed by different tasks within a modality may be constrained. In one embodiment, the L2 norm may be used to achieve similarity of parameters in feature extraction networks for different tasks, thereby constraining the extracted soft shared features to each other. For example, the L2 distance of the parameter vectors of the feature extraction networks corresponding to different tasks within the same modality can be used as an optimization goal in the training model stage, and the goal is minimized so that the features extracted by different feature extraction networks have similarity to obtain soft shared features of different tasks. In another embodiment, a cross-stitch network may also be used to constrain network similarity, for example, a "cross-stitch" unit may be added between feature layers of two networks to enable the network to automatically learn features that need to be shared.

And S230, extracting unshared features related to different tasks in the same modality based on the soft shared features.

In each mode, non-shared features related to different tasks are extracted based on soft shared features, namely matched features are extracted for each task, so that task relevance of the features is guaranteed. At the moment, feature sharing is not performed among different tasks, the input of the feature extraction network of each task is the respective soft shared feature, and the output is the task-related unshared feature. For example, in the image data, different convolution sub-networks are adopted to respectively extract the unshared features of the image semantic segmentation task (task one) and the 3D object detection task (task two), at the moment, the input of the two sub-networks is respectively the soft shared feature of the two sub-networks, and the unshared features related to the image semantic segmentation task (task one) and the image unshared features related to the 3D object detection task (task two) are respectively output; in point cloud data, different Pointernet sub-networks are adopted to respectively extract non-shared characteristics of a 3D target detection network (task two) and a point cloud semantic segmentation task (task three), at the moment, the input of the two sub-networks is respectively soft shared characteristics, and the output of the two sub-networks is respectively non-shared characteristics related to the 3D target detection network (task two) and point cloud non-shared characteristics related to the point cloud semantic segmentation task (task three).

S240, obtaining data characteristics corresponding to each task in the plurality of tasks based on the unshared characteristics.

When the task depends on the single-mode data characteristics, extracting the unshared characteristics from the single-mode data as the data characteristics corresponding to the task; when the task depends on the multi-modal data, the non-shared features extracted from each modal data of the dependent multi-modal data are subjected to feature fusion, and the fused non-shared features are the data features corresponding to the task.

FIG. 6 illustrates a schematic diagram of feature fusion of unshared features of a dependent multi-modal data task according to an embodiment of the invention. As shown in fig. 6, an attention focusing mechanism may be adopted to balance the effects of the unshared features of different modalities, the unshared features of different modalities are spliced and transformed to generate a feature mask, and then the fused features are adjusted by using the mask. For example, of modality oneThe feature vector of the unshared feature is x1, the feature vector of the unshared feature of the modality two is x2, the feature vector x1 of the unshared feature of the modality one and the feature vector x2 of the unshared feature of the modality two are vector-spliced to obtain a spliced feature vector x3(x3 ═ concat (x1, x2), where concat represents the splicing operation), then the spliced feature vector x3 is transformed to obtain a transformed feature vector x4(x4 ═ g (x3), where g represents the feature transformation network), a feature mask m (m ═ entry (x3), where entry represents the network for generating the mask) of the spliced feature vector x3 is generated, and the feature mask is used to adjust the transformed feature vector x4 to obtain a fused feature vector x5 (m ═ 5 ═ by using the feature mask

Wherein

Indicating dot product).

S130, executing a plurality of tasks based on the corresponding data characteristics, and calculating a loss function of each task in the plurality of tasks.

The difference between the predicted value and the true value of a sample is called a loss, and the function used to calculate the loss is called a loss function. In the training phase, the parameters of the model can be trained by minimizing the losses.

In one embodiment, for a task that relies on only single modality data, cross entropy may be employed to compute a loss function for the task. For example, the image semantic segmentation task (task one) relies only on image data, and cross entropy loss can be used to measure task performance; the point cloud semantic segmentation task (task three) only depends on point cloud data, and the task performance can also be measured by using cross entropy loss. The cross entropy loss is used to measure the difference between probability distributions, here the difference between the predicted distribution and the true distribution for the class to which each pixel in the image (or each point in the point cloud) belongs.

In one embodiment, for a task depending on multi-modal data, the non-shared features obtained in different modalities by the task depending on multi-modal data can be subjected to feature fusion to obtain the corresponding data features, and the loss is calculated after the task is executed based on the corresponding data features. For example, a 3D object detection task (task two) relies on data of two modalities, namely, an image modality and a point cloud, non-shared features of the task are extracted in both modalities, and after the two non-shared features are fused, prediction output is performed, and then loss is calculated. At this time, the loss includes two parts, namely, a cross entropy loss (object type corresponding to the 3D detection frame) for target classification, and a regression loss (position and size of the 3D detection frame) for positioning, where the regression loss is used to measure a difference between a predicted value and a true value of the continuous variable, and specifically, a square loss, an absolute value loss, and the like may be used.

S140, calculating an overall loss function of the multi-task learning method by using uncertainty loss weighting so as to train the multi-task learning method.

For the loss obtained by each task, the total loss function of the multi-task learning method is calculated by adopting an uncertainty loss weighting mechanism, so that the change of the relative performance among the tasks in the model training process can be dynamically adapted. For example, task one may be far better than task two at some stage, and the adjusted weighting penalty may make task two more important, thereby emphasizing optimization of task two and reducing the focus on task one, thereby achieving better overall performance. Loss function L for task i_iLet us assume that task i has an uncertainty (i.e., output standard deviation) of σ_iAnd σ_iSatisfies the formula:

wherein, w_iFor learnable parameters, exp is an exponential function, sqrt is a square root function,

the adjusted loss function is then:

finally, summing all adjusted loss functions to obtain an overall loss function as:

L＝ΣL_i’ (3)

by introducing the uncertainty of the task (i.e., the output standard deviation) into the loss function, the overall likelihood function for multiple tasks can be minimized.

FIG. 7 illustrates a schematic diagram of an uncertainty loss weighting mechanism, according to an embodiment of the present invention. As shown in FIG. 7, the loss function of task one is passed through its uncertainty σ₁Adjusting the loss function of task two through the uncertainty sigma₂Adjusting the loss function of task three through the uncertainty sigma₃And adjusting, and finally summing all the adjusted loss functions to obtain an overall loss function L.

Different tasks in multi-task learning can be balanced through the loss weighting mechanism of uncertainty.

In another aspect, the present invention further provides a multi-modal data-based multi-task learning system, which is used for executing the multi-modal data-based multi-task learning method.

Fig. 8 is a schematic structural diagram of a multi-modal data-based multi-task learning system according to an embodiment of the present invention. As shown in fig. 8, the system 800 includes an interface module 801, a feature extraction module 802, a loss calculation module 803, and a training module 804. Although the block diagrams depict components in a functionally separate manner, such depiction is for illustrative purposes only. The components shown in the figures may be arbitrarily combined or separated into separate software, firmware, and/or hardware components. Moreover, regardless of how such components are combined or divided, they may execute on the same computing device or multiple computing devices, which may be connected by one or more networks.

The interface module 801 is configured to obtain multi-modal data that completes a plurality of tasks, group the multi-modal data according to modal types, and perform spatio-temporal alignment. The feature extraction module 802 extracts hard shared features of data in each modality from the aligned multi-modality data as described above, obtains soft shared features related to different tasks in each modality based on the hard shared features, obtains unshared features related to different tasks in each modality based on the soft shared features, and obtains data features corresponding to each of the plurality of tasks based on the unshared features. And the loss calculation module 803 is configured to complete a plurality of tasks based on the corresponding data features, calculate a loss function of each task in the plurality of tasks, and calculate an overall loss function of the multi-task learning method by using uncertainty loss weighting. The training module 804 is used to train the multi-task learning method based on the overall loss function as described above.

In another embodiment of the present invention, a computer-readable storage medium is further provided, on which a computer program or executable instructions are stored, and when the computer program or the executable instructions are executed, the technical solution as described in the foregoing embodiments is implemented, and the implementation principle thereof is similar, and is not described herein again. In embodiments of the present invention, the computer readable storage medium may be any tangible medium that can store data and that can be read by a computing device. Examples of computer readable storage media include hard disk drives, Network Attached Storage (NAS), read-only memory, random-access memory, CD-ROMs, CD-R, CD-RWs, magnetic tapes, and other optical or non-optical data storage devices. The computer readable storage medium may also include computer readable media distributed over a network coupled computer system so that computer programs or instructions may be stored and executed in a distributed fashion.

Reference in the specification to "various embodiments," "some embodiments," "one embodiment," or "an embodiment," etc., means that a particular feature, structure, or characteristic described in connection with the embodiment is included in at least one embodiment. Thus, appearances of the phrases "in various embodiments," "in some embodiments," "in one embodiment," or "in an embodiment," or the like, in various places throughout this specification are not necessarily referring to the same embodiment. Furthermore, the particular features, structures, or characteristics may be combined in any suitable manner in one or more embodiments. Thus, a particular feature, structure, or characteristic illustrated or described in connection with one embodiment may be combined, in whole or in part, with a feature, structure, or characteristic of one or more other embodiments without limitation, as long as the combination is not logical or operational.

The terms "comprises," "comprising," and "having," and similar referents in this specification, are intended to cover non-exclusive inclusions, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not limited to only those steps or elements but may alternatively include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus. The word "a" or "an" does not exclude a plurality. Additionally, the various elements of the drawings of the present application are merely schematic illustrations and are not drawn to scale.

Although the present invention has been described by the above embodiments, the present invention is not limited to the embodiments described herein, and various changes and modifications may be made without departing from the scope of the present invention.

Claims

1. A multi-modal data-based multi-task learning method, comprising:

obtaining multi-modal data for completing a plurality of tasks, grouping the multi-modal data according to modal types, and performing space-time alignment;

extracting hard sharing features of data in each mode from the aligned multi-mode data, obtaining soft sharing features related to different tasks in each mode based on the hard sharing features, obtaining non-sharing features related to different tasks in each mode based on the soft sharing features, and obtaining data features corresponding to each task in the plurality of tasks based on the non-sharing features;

executing the plurality of tasks based on the corresponding data features and calculating a loss function for each of the plurality of tasks;

and calculating an overall loss function of the multi-task learning method by using uncertainty loss weighting so as to train the multi-task learning method.

2. The method of claim 1, further comprising,

when a task depends on single-modal data features, the unshared features extracted from the single-modal data are the data features corresponding to the task;

and when the task depends on multi-modal data, performing feature fusion on the unshared features extracted from each modal data of the dependent multi-modal data, wherein the fused unshared features are data features corresponding to the task.

3. The method of claim 2, wherein when the task relies on multi-modal data, feature fusing the unshared features extracted from each modality data of the multi-modal-dependent data using an attention-focusing mechanism comprises:

splicing the unshared features extracted from each modal data of the dependent multi-modal data, and transforming to generate a feature mask;

fusing the unshared features based on the feature mask features.

4. The method of claim 3, wherein the multitask learning method is trained by minimizing losses.

5. The method of claim 4, wherein the multi-task learning method is trained by cross entropy and/or regression loss.

6. The method of claim 1, wherein the soft sharing feature is extracted using an L2 paradigm or a cross-stitch network.

7. The method of claim 1, wherein said calculating an overall loss function for the multi-tasking learning method using uncertainty loss weighting comprises:

adjusting a loss function for each of the plurality of tasks based on an output standard deviation for the each task;

and summing the adjusted loss functions of each task to obtain the overall loss function of the multi-task learning method.

8. A multi-modal data-based multi-task learning system, comprising:

the interface module is used for acquiring multi-mode data for completing a plurality of tasks, grouping the multi-mode data according to modal types and performing space-time alignment;

a feature extraction module, configured to extract a hard shared feature of data in each modality from the aligned multi-modality data, obtain a soft shared feature related to different tasks in each modality based on the hard shared feature, obtain an unshared feature related to different tasks in each modality based on the soft shared feature, and obtain a data feature corresponding to each task of the multiple tasks based on the unshared feature;

the loss calculation module is used for completing the tasks based on the corresponding data characteristics, calculating a loss function of each task in the tasks, and calculating an overall loss function of the multi-task learning method by using uncertainty loss weighting; and

a training module for training the multi-task learning method based on the global loss function.

9. A storage medium in which a computer program is stored which, when being executed by a processor, is operative to carry out the method of any one of claims 1-8.

10. An electronic device comprising a processor and a memory, the memory having stored therein a computer program which, when executed by the processor, is operable to carry out the method of any one of claims 1-8.