WO2024007105A1

WO2024007105A1 - Method and apparatus for continual learning of tasks

Info

Publication number: WO2024007105A1
Application number: PCT/CN2022/103595
Authority: WO
Inventors: Jun Zhu; Qian Li; Liyuan Wang; Xingxing Zhang; Yi ZHONG; Jianing Huang
Original assignee: Robert Bosch Gmbh; Tsinghua University
Priority date: 2022-07-04
Filing date: 2022-07-04
Publication date: 2024-01-11

Abstract

A computer implemented method for continual learning of a plurality of tasks, comprising: receiving input data by a plurality of continual learning sub-networks in parallel, the input data being related to one task of the plurality of tasks, the number of the plurality of continual learning sub-networks being fixed and irrespective to the number of the plurality of tasks; generating a plurality of feature representations respectively by the plurality of continual learning sub-networks based on the input data; generating a prediction related to the one task by a feature ensemble sub-network based on the plurality of feature representations; generating a continual learning loss value based on the prediction related to the one task and information related to tasks already learned until the one task; and updating learnable parameters of the plurality of continual learning sub-networks and the feature ensemble sub-network based on the continual learning loss value.

Description

METHOD AND APPARATUS FOR CONTINUAL LEARNING OF TASKS

FIELD

Aspects of the present disclosure relate generally to artificial intelligence (AI) , and more particularly, to continual learning of a plurality of tasks.

BACKGROUND

Continual learning is a branch of machine learning that targes the sequential learning of tasks with the objective of learning new problems while not forgetting previously learned tasks. The ability to incrementally learn a sequence of tasks is critical for artificial neural networks. Since the training data distribution is highly dynamic, the network needs to carefully trade-off the learning plasticity and memory stability. In general, excessive plasticity in learning new tasks leads to the catastrophic forgetting of old tasks, while excessive stability in remembering old tasks limits the learning of new tasks.

Efforts in continual learning either learn all tasks with a single model, which has to sacrifice the performance of each task to find a shared solution, or allocate a dedicated parameter subspace for each task to overcome their mutual interference, which usually lacks scalability. Recent work observed that a wider network usually suffers from less catastrophic forgetting, while different components such as batch normalization, skip connections and pooling layers play various roles. Thus, the choice of architecture for effective continual learning remains an open question. It would be desirable to improve the performance of continual learning with improved architecture.

SUMMARY

In order to improve the performance of continual learning, the disclosure proposes a novel framework for continual learning of a sequence of tasks.

According to an embodiment, there provides a computer implemented method for continual learning of a plurality of tasks. The method comprises: receiving input data by a plurality of continual learning sub-networks in parallel, the input data being related to one task of the plurality of tasks, the number of the plurality of continual learning sub-networks being fixed and irrespective to the number of the plurality of tasks; generating a plurality of feature representations respectively by the plurality of continual learning sub-networks based on the input data; generating a prediction related to the one task by a feature ensemble sub-network based on the plurality of feature representations; generating a continual learning loss value based on the prediction related to the one task and information related to tasks already learned until the one task; and updating learnable parameters of the plurality of continual learning sub-networks and the feature ensemble sub-network based on the continual learning loss value.

According to an embodiment, there provides a computer implemented method for performing a task using a trained model. The method comprises: receiving input data by a plurality of continual learning sub-networks of the model in parallel, the input data being related to the task of a plurality of tasks, the number of the plurality of continual learning sub-networks being fixed and irrespective to the number of the plurality of tasks; generating a plurality of feature representations respectively by the plurality of continual learning sub-networks based on the input data; generating a prediction related to the task by a feature ensemble sub-network of the model based on the plurality of feature representations.

According to an embodiment, there provides a computer system, which comprises one or more processors and one or more storage devices storing computer-executable instructions that, when executed, cause the one or more processors to perform the operations of the method as mentioned above as well as to perform the operations of the method according to aspects of the disclosure.

According to an embodiment, there provides one or more computer readable storage media storing computer-executable instructions that, when executed, cause one or more processors to perform the operations of the method as mentioned above as well as to perform the operations of the method according to aspects of the disclosure.

According to an embodiment, there provides a computer program product comprising computer-executable instructions that, when executed, cause one or more processors to perform the operations of the method as mentioned above as well as to perform the operations of the method according to aspects of the disclosure.

The proposed architecture with a fixed number of narrower sub-networks to learn incremental tasks in parallel can naturally reduce the generalization errors of learning plasticity and memory stability in continual learning through improving the discrepancy between tasks, the flatness of a solution and the covers of parameter spaces, which are related to the upper bound of the generalization errors. Other advantages and enhancements are explained in the description hereafter.

BRIEF DESCRIPTION OF THE DRAWINGS

The disclosed aspects will hereinafter be described in connection with the appended drawings that are provided to illustrate and not to limit the disclosed aspects.

Fig. 1 illustrates an exemplary framework for continual learning of a sequence of tasks according to aspects of the disclosure.

Fig. 2A to 2C illustrates a conceptual model of influential factors according to aspects of the disclosure.

Fig. 3A to 3E each illustrates an exemplary framework for continual learning of a sequence of tasks according to aspects of the disclosure.

Fig. 4 illustrates exemplary performances of SCL as well as CoSCLs with different number of continual learners according to aspects of the disclosure.

Fig. 5 illustrates exemplary performances of SCL as well as CoSCLs with same number of continual learners according to aspects of the disclosure.

Fig. 6 illustrates performances of respective continual learners in CoSCL according to aspects of the disclosure.

Fig. 7 illustrates an estimation of H-divergence of respective features from small learners across tasks according to aspects of the disclosure.

Fig. 8 illustrates curvature of loss landscape for learned solution according to aspects of the disclosure.

Fig. 9 illustrates an exemplary process for continual learning of a plurality of tasks according to aspects of the disclosure.

Fig. 10 illustrates an exemplary process for performing a task using a trained model according to aspects of the disclosure.

Fig. 11 illustrates an exemplary computing system according to aspects of the disclosure.

DETAILED DESCRIPTION

The present disclosure will now be discussed with reference to several example implementations. It is to be understood that these implementations are discussed only for enabling those skilled in the art to better understand and thus implement the embodiments of the present disclosure, rather than suggesting any limitations on the scope of the present disclosure.

Various embodiments will be described in detail with reference to the accompanying drawings. Wherever possible, the same reference numbers will be used throughout the drawings to refer to the same or like parts. References made to particular examples and embodiments are for illustrative purposes, and are not intended to limit the scope of the disclosure.

For sake of illustration, the input data 110 may be exampled as image data and the tasks may be exampled as image classification tasks. It is appreciated that aspects of the disclosure can be applied in various applications and scenarios in which various tasks are performed. For example, the input data may be image data, video data, graph data, gaming data, or text data. The tasks may be classification tasks such as classification based on images, videos, graphs and so on, may be image segmentation tasks, may be content generation tasks such as content generation based on graph data, text data, may be action generation tasks such as action generation based on video data, graph data, gaming data, or the like. And the prediction of the tasks may be a classification of the image data, the video data or the graph data, may be an image segmentation of the image data or the video data, may be a content or action generated based on the text data, gaming data, graph data or video data, or the like.

For example, aspects of the disclosure can be applied in an automatic driving system, an intelligent transportation system, an intelligent manufacturing system, industrial equipment system, an intelligent maintenance equipment system, a medical equipment system, a social network system, a financial network system, a website platform, or the like.

For example, the input data may be image data or video data obtained in one of an automatic driving system, an intelligent transportation system, an intelligent manufacturing system, industrial equipment system, an intelligent maintenance equipment system, a medical equipment system, or the like. It is appreciated that the image data or video data may be any form of images, such as luminance images, radar images, LiDAR images, ultrasonic images, motion images, thermal images, or the like. The plurality of tasks may be image classification tasks or image segmentation tasks. And the prediction of the tasks may be a classification or a segmented image portion related to the image data or video data. The image segmented image portion of the image segmentation tasks may be image portion related to the interested area of the image or video, such as the image portion corresponding to the pedestrian in the automatic driving system, the image portion related to the traffic condition in the intelligent transportation system, the image portion related to parameters such as an unqualified part of an product in the intelligent manufacturing system, industrial equipment system, an intelligent maintenance equipment system or the like, the image portion related to body tissue in the medical equipment system, or the like. The classification of the image classification tasks may be one of a plurality of classifications which are parameters for the operation of the systems. For example, the classifications for the automatic driving system may be classifications of two or more scenarios such as parking lot, urban road, expressway and so on. The classifications for intelligent transportation system may be classifications of light traffic, medium traffic, heavy traffic and so on. The classifications for an intelligent manufacturing system, industrial equipment system, or an intelligent maintenance equipment system may be classifications of unqualified, qualified and so on. The classifications for a medical equipment system may be classifications of normal condition, cautionary condition, bad condition, and so on.

For example, the input data may be video data obtained in one of an automatic driving system, an intelligent transportation system, an intelligent manufacturing system, industrial equipment system, an intelligent maintenance equipment system, and a medical equipment system, or the like. It is appreciated that the video data may be any form of images, such as luminance images, radar images, LiDAR images, ultrasonic images, motion images, thermal images, or the like. The plurality of tasks may be action generation tasks. The prediction of the tasks may be an action based on the video data. The action generated by the action generation tasks controls the operation of the systems. For example, the action for the automatic driving system may be an action for automatically controlling the vehicle such as braking, accelerating, parking and so on. The action for intelligent transportation system may be automatically controlling the transportation such as adjusting time length of traffic lights and so on. The action for an intelligent manufacturing system, industrial equipment system, or an intelligent maintenance equipment system may be picking out unqualified products or broken components, and so on. The action for a medical equipment system may be the guiding of physical examination process, and so on.

For example, the input data may be graph data obtained in one of a social network system, a financial network system, and a website platform, or the like. The plurality of tasks may be identification of abnormal account included in the graph data. The prediction of the tasks may be classification of the accounts. For example, the classification of the accounts may be classification of malicious account, normal account and so on.

The input data 110 may include training datasets for each of tasks 1 to T. For example, the training dataset for task 1 may include dog images and classification labels, the training dataset for task 2 may include cat images and classification labels, the training dataset for task 3 may include bird images and classification labels, and so on.

The continual learning neural network 120, which is also referred to as continual learning model 120, may be trained to performing the tasks 1 to T based on the respective training datasets for the tasks 1 to T. During the training or learning of the model 120, predictions 130 may be generated by the model 120 based on the input data 110, and an optimization objective value 140 may be generated based on the predictions 130, and the learnable parameters of the model 120 may be updated based on the objective value 140.

As illustrated in Fig. 1, a general setting of continual learning may be: a neural network 120 with parameter θ incrementally learns T tasks, and thus is referred to as a “continual learner” . The training set and test set of each task follow the same distribution

where the training set

includes N _t data-label pairs. For classification task, it may include one or more classes. After learning each task, the performance of all the tasks ever seen is evaluated on their test sets. Although training data set D _t is only available when learning task t, an ideal continual learner should behave as if training them jointly. To achieve this goal, it is critical to balance the learning plasticity of new tasks and memory stability of old tasks. Accordingly, the loss function for continual learning can be generally defined as

where L _t (·) is the task-specific loss for current learning task t, for example, it may be cross-entropy for supervised classification, and

is the loss related to the already learned tasks 1 to t-1, which is used to achieve an effective trade-off so as to allow the continual learner to learn incremental tasks without severe catastrophic forgetting.

Representative strategies for continual learning include weight regularization (e.g., Cha, S., Hsu, H., Hwang, T., Calmon, F.P., Moon, T.: Cpr: Classifier-projection regularization for continual learning. arXiv preprint arXiv: 2006.07326) , parameter isolation (e.g., Jung, S., Ahn, H., Cha, S., Moon, T.: Continual learning with node-importance based adaptive group sparse regularization. arXiv e-prints pp. arXiv–2003) and memory replay (e.g., Wang, L., Zhang, X., Yang, K., Yu, L., Li, C., Hong, L., Zhang, S., Li, Z., Zhong, Y., Zhu, J.: Memory replay with data compression for continual learning. arXiv preprint arXiv: 2202.06592) .

For example,

for weight regularization, where

denotes the learned parameters for the old tasks and I _1: t-1 indicates the “importance” of these parameters. For example, I _1: t-1 may be implemented based on Fisher information matrices F _1: t-1 of the respective old tasks, as detailed in e.g. Cha. For example,

for memory replay, where

is an approximation of D _k through storing a few old training samples or learning a generative model. For example, for parameter isolation,

are dynamically isolated as multiple task-specific subspaces

while

denotes the “free” parameters for current and future tasks. So

usually serves as a sparsity regularizer to save

It is appreciated that any continual learning strategies such as the existing continual learning strategies as exampled above and the upcoming continual learning strategies in the future may be applied in aspects of the disclosure.

The goal of continual learning is to find a solution θ of the learner 120 in a parameter space Θ that can generalize well over a set of distribution

and

Considering a bounded loss function l: y×y→ [0, c] , where y denotes a label space and c is the upper bound, such that l (y ₁, y ₂) =0 holds if and only if y ₁ = y ₂. Then, a population loss over the distribution

can be defined by

where f _θ (·) is the prediction of an input parameterized by θ. Likewise, the population loss over the distribution of old tasks is defined by

To minimize both

and

a continual learning model needs to minimize an empirical risk over the current training set D _t in a constrained parameter space, i.e.,

where

and the parameter space Θ depends on the previous experience carried by parameters or data, so as to prevent catastrophic forgetting over old tasks. Likewise,

denotes an empirical risk over the old tasks.

In practice, sequential learning of each task by minimizing the empirical loss can find multiple solutions, but provides significantly different generalizability on

and

i.e., different learning plasticity and memory stability. Several recent studies suggested that a flatter solution is more robust to catastrophic forgetting. To find such a flat solution, in an example, a robust empirical risk is defined by the worst case of the neighborhood in parameter space as

where b is the radius around θ and ∥·∥ denotes the L2 norm, likewise for the old tasks it is defined as

Then, solving the constrained robust empirical risk minimization, i.e.,

will find a near solution of a flat optimum showing better generalizability. In particular, the minimum found by the empirical loss

may also be the minimum of

only if the “radius” of its loss landscape is sufficiently wider than b. Intuitively, such a flat solution might mitigate catastrophic forgetting since it is more robust to parameter changes.

However, this approach is not sufficient. If a new task is too different from the old tasks, the parameter changes to learn it well might be much larger than the “radius” of the old minimum, resulting in catastrophic forgetting. On the other hand, staying around the old minimum is not a good solution for the new task, limiting the learning plasticity. In an aspect of the disclosure, let

and

denote the generalization errors of learning plasticity for a new task and memory stability for the old tasks respectively, the upper bounds of these two errors may be presented as follows.

Proposition 1. Let Θ be a cover of a parameter space with VC dimension d. If

are the distribution of continually learned tasks 1 to t, then for any δ∈(0, 1) with probability at least 1-δ, for every solution θ _1: t of the continually learned tasks 1 to t in parameter space Θ, i.e., θ _1: t∈ Θ:

where

is the

divergence for the distribution

and

where I (h) is the characteristic function.

is the total number of training samples over all old tasks.

It can be conjured from Proposition 1 that, the generalization errors of both learning plasticity and memory stability are directly bounded by the robust empirical risk and the distribution discrepancy between the new and old tasks, while minimizing the robust empirical risk (i.e., finding a flat solution) is directly related to mitigating the two errors. Further, by the optimal solution of robust empirical risk minimization, the generalization gaps over the new and old tasks are showed as follows.

Proposition 2. Let

denotes the optimal solution of the continually learned tasks 1 to t by robust empirical risk minimization over the new task, i.e.,

where Θ denotes a cover of a parameter space with VC dimension d. Then for any δ∈ (0, 1) with probability at least 1-δ:

where

and

is the

divergence for the distribution

and

where I (h) is the characteristic function.

It can be conjured from Proposition 1 and Proposition 1 that, the errors of learning plasticity and memory stability in continual learning both depend on three factors: (1) the discrepancy between tasks; (2) the flatness of a solution; and (3) the covers of parameter spaces.

Fig. 2A to 2C illustrates a conceptual model of the influence of the first two factors based on the above theoretical analysis. The labels “A” and “B” stands for task A and task B in the figures, it is appreciated that the learning order of task A and task B does not matter.

Fig. 2A illustrate an original solution for task A and task B conceptually, where

denotes the solution for task A having the minimum loss

denotes the solution for task B having the minimum loss

θ _A, B denotes the shared solution for task A and task B. The dashed line in Fig. 2B and Fig. 2C is the original solution in Fig. 2A. As illustrated in Fig. 2B, finding a flatter solution for tasks can decrease the generalization errors and generalization gaps of a shared solution θ _A, B. As illustrated in Fig. 2C, reducing the divergence between tasks can decrease the generalization errors and generalization gaps of a shared solution θ _A, B. It can be clearly seen that the solution for a task that is flatter or closer to the solution for other tasks can reduce the generalization errors and generalization gaps of a shared solution.

Although finding a flat minimum for continual learning draws increasing attentions in recent work, it is difficult to alleviate the other two factors by either using a shared set of parameters to learn all tasks or using isolated parameter subspace to learn each task. It would be desirable to improve the continual learning from an architecture perspective in consideration of the above mentioned three factors.

Fig. 3A illustrates an exemplary framework for continual learning of a sequence of tasks according to aspects of the disclosure.

For sake of illustration, the input data 310 may be exampled as image data and the tasks may be exampled image classification tasks. The input data 210 may include training datasets for each of tasks 1 to T. It is appreciated that aspects of the disclosure can be applied in various applications and scenarios in which various tasks are performed.

Different from learning all tasks with a single neural network, a plurality of sub-networks are employed in the architecture of Fig. 3A. For example, K sub-networks 320-1 to 320-K are employed as illustrated in Fig. 3A. the K sub-networks 320-1 to 320-K may be presented as

i = 1, ..., K. The K sub-networks 320-1 to 320-K receive the input data such as images in parallel and generate respective feature representations for the input images. In practice, the sub-networks 320-1 to 320-K are much narrower than that of a single continual learning model such as the model 120 shown in Fig. 1 in order to save parameters, so each sub-network 320 may be referred to as a small continual learner and the framework may be referred to as cooperative small continual learners (CoSCL) .

A feature ensemble sub-network 330 receives the respective feature representations generated by the sub-networks 320-1 to 320-K and obtains a prediction based on the feature representations. In an implementation, feature ensemble sub-network 330 may be implemented by a fully connected layer or fully connected network, which predicts the average of the respective feature representations. The feature ensemble sub-network 330 may be presented as

then the final prediction p is

Then the optimizable parameters are

The optimization objective or loss L _CoSCL may be determined through above equation (1) , that is,

Accordingly, the optimizable parameters

may be updated or optimized based on the loss

Fig. 3B illustrates an exemplary framework for continual learning of a sequence of tasks according to aspects of the disclosure.

Same labels are used for same or corresponding units in Fig. 3A and Fig. 3B, and the same or corresponding units in Fig. 3A and Fig. 3B are not described in detail.

In the embodiment of Fig. 3B, the predictions p ₁ …p _K made by the respective feature representations from the continual learners 320-1 to 320-K are leveraged to further shorten the divergence among the continual learners 320-1 to 320-K, so as to strengthen the advantage of feature ensemble. In an implementation, the respective prediction p _i corresponding to continual learners 320-i is

The additional cooperation of the continual learners 320-1 to 320-K is performed by penalizing the differences in predictions p ₁ …p _K related to the current task made by the respective feature representations from the continual learners 320-1 to 320-K. In an implementation, the widely-used Kullback Leibler (KL) divergence may be employed and an ensemble cooperation (EC) loss is defined as follows:

The optimization objective or loss L _CoSCL may be determined through above equation (1) and equation (9) , that is,

where γis a hyperparameter. Accordingly, the optimizable parameters

may be updated or optimized based on the loss

In an implementation, well known Adam method may be employed to update the parameters

Fig. 3C illustrates an exemplary framework for continual learning of a sequence of tasks according to aspects of the disclosure.

Same labels are used for same or corresponding units in Figs. 3A to 3C, and the same or corresponding units in Figs. 3A to 3C are not described in detail.

In the embodiment of Fig. 3C, to more effectively modulate the task relevance, a set of task-adaptive gates (TG) {g _t, i} are employed to weight the outputs of the small continual learners 320-1 to 320-K. Then the final prediction p is

Specifically, the gate output by the gate subnetwork 340 is defined as g _t, i=σ (s·α _t, i) for continual learner i to perform task t, where α _t, i is a learnable weight, s is a scale factor and σ (·) denotes the sigmoid function.

Then all the optimizable parameters are

Accordingly, the optimizable parameters

may be updated or optimized based on the loss

Fig. 3D illustrates an exemplary framework for continual learning of a sequence of tasks according to aspects of the disclosure.

Same labels are used for same or corresponding units in Figs. 3A to 3D, and the same or corresponding units in Figs. 3A to 3D are not described in detail.

In the embodiment of Fig. 3D, to more effectively modulate the task relevance, a set of task-adaptive gates (TG) {g _t, i} are employed to weight the outputs of the small continual learners 320-1 to 320-K. Then the final prediction p is

Specifically, the gate output by the gate subnetwork 340 is defined as g _t, i=σ (s·α _t, i) for continual learner i to perform task t, where α _t, i is a learnable weight, s is a scale factor and σ (·) denotes the sigmoid function. Then all the optimizable parameters are

And the predictions p ₁ …p _K made by the respective feature representations from the continual learners 320-1 to 320-K are leveraged to further shorten the divergence among the continual learners 320-1 to 320-K, so as to strengthen the advantage of feature ensemble. In an implementation, the respective prediction p _i corresponding to continual learners 320-i is

The additional cooperation of the continual learners 320-1 to 320-K is performed by penalizing the differences in predictions p ₁ …p _K related to the current task made by the respective feature representations from the continual learners 320-1 to 320-K. In an implementation, the KL divergence may be employed and an ensemble cooperation (EC) loss is defined as follows:

The optimization objective or loss L _CoSCL may be determined through above equation (1) and equation (10) , that is,

where γ is a hyperparameter. Accordingly, the optimizable parameters

may be updated or optimized based on the loss

Fig. 3E illustrates an exemplary framework for continual learning of a sequence of tasks according to aspects of the disclosure.

Same labels are used for same or corresponding units in Figs. 3A to 3E, and the same or corresponding units in Figs. 3A to 3E are not described in detail.

In the embodiment of Fig. 3E, the gate subnetwork 340 is implemented as a continual learning neural network. As illustrated, the gate sub-network receives the input date 310 and generates the plurality of gates g _t, i based on the input data 310. If the task labels are not provided in the training data set, the gate g _t, i or the above mentioned weights α _t, i of the gate can be inferred by the continual learner 340. In an implementation, the continual learner 340 includes a feature extractor and a fully-connection layer, and may be represented as

Then all the optimizable parameters are

where γ is a hyperparameter. Accordingly, the optimizable parameters

may be updated or optimized based on the loss

It is appreciated that, in another implementation, the parameters φ _g of the gate sub-network 340, along with the parameters of the continual learners 320-1 to 320-K, are included in the calculation of the continual learning loss L _CL in equation (1) .

It is appreciated that the gate subnetwork 340 implemented as a continual learner may also employed in the embodiments of Figs. 3A to 3D.

As illustrated in Figs. 3A to 3E, by designing an architecture with a fixed number of narrower sub-networks to learn incremental tasks in parallel, the two errors related to learning plasticity and memory stability can be naturally reduced through improving the discrepancy between tasks, the flatness of a solution, and the covers of parameter spaces. The discrepancy between tasks, the flatness of a solution, and the covers of parameter spaces are effectively improved through the feature ensemble, the ensemble cooperation and the task adaptive gating.

In an embodiment, the process of the structural optimization is shown in the following table 1, where EWC stands for elastic weight consolidation, which implements weight regularization stragety.

table 1

In exemplary experiments, visual classification tasks are performed based on CoSCL as illustrated in Figs. 3A to 3E. Four representative continual learning benchmarks are considered. The first two benchmarks are with CIFAR-100 dataset, which contains 100-class colored images of 500 training samples and 100 testing samples per class of the size 32 × 32. The 100 classes are randomly split (RS) into 20 incremental tasks or according to the super classes (SC) . The other two benchmarks are with larger-scale datasets and randomly split into 10 incremental tasks: CUB-200-2011 includes 200 classes and 11, 788 bird images of the size 224 × 224, and is split as 30 images per class for training while the rest for testing. Tiny-ImageNet is derived from iILSVRC-2012, consisting of 200-class natural images of the size 64 × 64.

As an implementation, a CNN architecture with 6 convolution layers and 2 fully connected layers may be for employed for the single continual learner (SCL) 120, which is taken as the baseline, and for the multiple small continual learners 320. Since CoSCL consists of multiple continual learners, a similar architecture and accordingly reduced network width (i.e., using fewer channels) are used to keep the total number of parameters comparable to the baseline, so as to make the comparison as fair as possible. Then, there is an intuitive trade-off between the number of learners and the width of sub-networks.

Fig. 4 illustrates exemplary performances of SCL as well as CoSCLs with different number of continual learners, where the total number of parameters are comparable for SCL as well as CoSCLs with different number of continual learners. The horizontal axis denotes the number of small continual learners of CoSCL, the vertical axis denotes the average accuracy of 20 tasks.

Fig. 5 illustrates the exemplary performances of SCL as well as CoSCLs with same number of continual learners. The “#” stands for the number of parameters, SCL stands for single continual learner as illustrated in Fig. 1, CE stands for classifier ensemble, FE stands for feature ensemble, TG stands for task-adaptive gates, and EC stands for ensemble cooperation. FE, FE+EC, FE+TG and FE+TG+EC are respectively illustrated in Figs. 3A to 3D. As illustrated in Fig. 4 and 5, the CoSCLs can improve the performance over the SCL with comparable number of parameters due to the fact that the CoSCLs can effectively improve the discrepancy between tasks, the flatness of a solution, and the covers of parameter spaces.

Fig. 6 illustrates the performances of respective continual learners in CoSCL. CL ID stands for the identification of continual learner 320, T ID stands for the identification of tasks, the gray level stands for the relative accuracy, the upper half and the bottom half in Fig. 6 are obtained based respectively on the training dataset CIFAR-100-SC and CIFAR-100-RS. In the illustrated example, five continual learners 320 are employ in the CoSCL to incrementally learn twenty tasks. As illustrated, the performance of respective continual learners in CoSCL varies significantly across tasks. The relative accuracy of each task is calculated by the performance of each continual learner minus their average, where an accuracy gap of about 10%to 20%exists between the best and the worst. The predictions made by the feature representations of each continual learner differ significantly across tasks and complement with each other, indicating that their solutions are highly diverse.

Fig. 7 illustrates an estimation of

divergence of the features from the small learners 320 across tasks. Specifically, a discriminator is trained to distinguish if the image features from the small learners 320 belong to a task or not, where a larger discrimination loss shown in the vertical axis of Fig. 7 indicates a smaller

divergence. As shown in Fig. 7, FE and EC can largely decrease the

-divergence while TG has a moderate benefit.

Fig. 8 illustrates curvature of loss landscape for the learned solution. T0 to T4 stands for task 0 to task 4, L stands for loss. As illustrated, FE and EC help the network converge to a flatter minimum. As illustrated in Figs. 4 to 8, cooperating multiple small continual learners with CoSCL helps to decrease the

-divergence between tasks in feature space and find a flat minimum, which can benefit learning plasticity and memory stability simultaneously.

Fig. 9 illustrates an exemplary process for continual learning of a plurality of tasks according to aspects of the disclosure. It is appreciated that the process may be implemented with computers or processors.

At block 910, input data are received by a plurality of continual learning sub-networks in parallel, the input data being related to one task of the plurality of tasks, the number of the plurality of continual learning sub-networks being fixed and irrespective to the number of the plurality of tasks.

At block 920, a plurality of feature representations are generated respectively by the plurality of continual learning sub-networks based on the input data.

At block 930, a prediction related to the one task is generated by a feature ensemble sub-network based on the plurality of feature representations.

At block 940, a continual learning loss value is generated based on the prediction related to the one task and information related to tasks already learned until the one task.

At block 950, learnable parameters of the plurality of continual learning sub-networks and the feature ensemble sub-network are updated based on the continual learning loss value.

In an embodiment, a plurality of predictions related to the one task are generated by the feature ensemble sub-network based respectively on the plurality of feature representations. An ensemble cooperation loss value is generated based on the plurality of predictions related to the one task. At block 950, the learnable parameters of the plurality of continual learning sub-networks and the feature ensemble sub-network are updated based on the continual learning loss value and the ensemble cooperation loss value.

In an embodiment, the plurality of feature representations are weighted with a plurality of gates output by a gate sub-network. At block 930, the prediction related to the one task is generated by the feature ensemble sub-network based on the plurality of weighted feature representations. At block 950, the learnable parameters of the plurality of continual learning sub-networks, the feature ensemble sub-network and the gate sub-network are updated based on the continual learning loss value.

In an embodiment, the plurality of feature representations are weighted with a plurality of gates generated by a gate sub-network. A plurality of predictions related to the one task are generated by the feature ensemble sub-network based respectively on the plurality of weighted feature representations, and an ensemble cooperation loss value is generated based on the plurality of predictions related to the one task. At block 930, the prediction related to the one task is generated by the feature ensemble sub-network based on the plurality of weighted feature representations. At block 950, the learnable parameters of the plurality of continual learning sub-networks, the feature ensemble sub-network and the gate sub-network are updated based on the continual learning loss value and the ensemble cooperation loss value.

In an embodiment, the gate sub-network comprises a set of vectors corresponding to the plurality of tasks and the plurality of continual learning sub-networks.

In an embodiment, the gate sub-network comprises a continual learning gate sub-network, wherein the continual learning gate sub-network receives the input date and generating the plurality of gates based on the input data.

In an embodiment, at block 940, the continual learning loss value is generated based on one of a weight regularization method, a memory replay method, and a parameter isolation method.

In an embodiment, the ensemble cooperation loss value represents divergence among the plurality of predictions related to the one task and corresponding respectively to plurality of continual learning sub-networks. The ensemble cooperation loss value is generated using Kullback Leibler (KL) divergence value based on the plurality of predictions related to the one task.

In an embodiment, the plurality of continual learning sub-networks have same or similar structures.

In an embodiment, the input data comprises one or more of image data, video data, graph data, gaming data, text data, the plurality of tasks comprises one or more of classification tasks, image segmentation tasks, content or action generation tasks, and the prediction comprises one or more of a classification of the image data, the video data or the graph data, an image segmentation of the image data or the video data, and content or action generated based on the text data, gaming data, graph data or video data.

In an embodiment, the input data comprises image data or video data obtained in one of an automatic driving system, an intelligent transportation system, an intelligent manufacturing system, industrial equipment system, an intelligent maintenance equipment system, and a medical equipment system, the plurality of tasks comprises image classification tasks or image segmentation tasks, the prediction comprises a classification or a segmented image portion related to the image data or video data. In an embodiment, the input data comprises video data obtained in one of an automatic driving system, an intelligent transportation system, an intelligent manufacturing system, industrial equipment system, an intelligent maintenance equipment system, and a medical equipment system, the plurality of tasks comprises action generation tasks, the prediction comprises an action based on the video data. In an embodiment, the input data comprises graph data obtained in one of a social network system, a financial network system, and a website platform, the plurality of tasks comprises identification of abnormal account included in the graph data, the prediction comprises classification of the accounts.

Fig. 10 illustrates an exemplary process for performing a task using a trained model according to aspects of the disclosure. It is appreciated that the process may be implemented with computers or processors.

At block 1010, input data are received by a plurality of continual learning sub-networks of the model in parallel, the input data being related to the task of a plurality of tasks, the number of the plurality of continual learning sub-networks being fixed and irrespective to the number of the plurality of tasks.

At block 1020, a plurality of feature representations are generated respectively by the plurality of continual learning sub-networks based on the input data.

At block 1030, a prediction related to the task is generated by a feature ensemble sub-network of the model based on the plurality of feature representations.

In an embodiment, the plurality of feature representations are weighted with a plurality of gates output by a gate sub-network. At block 1030, the prediction related to the task is generated by the feature ensemble sub-network based on the plurality of weighted feature representations.

In an embodiment, the gate sub-network comprises a continual learning gate sub-network, wherein the continual learning gate sub-network receives the input date and generates the plurality of gates based on the input data.

In an embodiment, the plurality of continual learning sub-networks having same or similar structures.

In an embodiment, the input data comprises one or more of image data, video data, graph data, gaming data, text data and the plurality of tasks comprising one or more of classification tasks, image segmentation tasks, content or action generation tasks.

Fig. 11 illustrates an exemplary computing system according to aspects of the disclosure. The computing system 1100 may comprise at least one processor 1110. The computing system 1100 may further comprise at least one storage device 1120. The storage device 1120 may store computer-executable instructions that, when executed, cause the processor 1110 to perform any operations according to the embodiments of the present disclosure as described in connection with Figs. 1-10.

The embodiments of the present disclosure may be embodied in a computer-readable medium such as non-transitory computer-readable medium. The non-transitory computer-readable medium may comprise instructions that, when executed, cause one or more processors to perform any operations according to the embodiments of the present disclosure as described in connection with Figs. 1-10.

The embodiments of the present disclosure may be embodied in a computer program product comprising computer-executable instructions that, when executed, cause one or more processors to perform any operations according to the embodiments of the present disclosure as described in connection with Figs. 1-10.

It should be appreciated that all the operations in the methods described above are merely exemplary, and the present disclosure is not limited to any operations in the methods or sequence orders of these operations, and should cover all other equivalents under the same or similar concepts.

It should also be appreciated that all the modules in the apparatuses described above may be implemented in various approaches. These modules may be implemented as hardware, software, or a combination thereof. Moreover, any of these modules may be further functionally divided into sub-modules or combined together.

The previous description is provided to enable any person skilled in the art to practice the various aspects described herein. Various modifications to these aspects will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other aspects. Thus, the claims are not intended to be limited to the aspects shown herein. All structural and functional equivalents to the elements of the various aspects described throughout the present disclosure that are known or later come to be known to those of ordinary skill in the art are expressly incorporated herein by reference and are intended to be encompassed by the claims.

Claims

A computer implemented method for continual learning of a plurality of tasks, comprising:

receiving input data by a plurality of continual learning sub-networks in parallel, the input data being related to one task of the plurality of tasks, the number of the plurality of continual learning sub-networks being fixed and irrespective to the number of the plurality of tasks;

generating a plurality of feature representations respectively by the plurality of continual learning sub-networks based on the input data;

generating a prediction related to the one task by a feature ensemble sub-network based on the plurality of feature representations;

generating a continual learning loss value based on the prediction related to the one task and information related to tasks already learned until the one task; and

updating learnable parameters of the plurality of continual learning sub-networks and the feature ensemble sub-network based on the continual learning loss value.
The method of claim 1, further comprising:

generating a plurality of predictions related to the one task by the feature ensemble sub-network based respectively on the plurality of feature representations;

generating an ensemble cooperation loss value based on the plurality of predictions related to the one task;

wherein the updating learnable parameters further comprising updating the learnable parameters of the plurality of continual learning sub-networks and the feature ensemble sub-network based on the continual learning loss value and the ensemble cooperation loss value.
The method of claim 1, further comprising weighting the plurality of feature representations with a plurality of gates output by a gate sub-network;

wherein the generating a prediction related to the one task further comprising generating the prediction related to the one task by the feature ensemble sub-network based on the plurality of weighted feature representations;

wherein the updating learnable parameters further comprising updating the learnable parameters of the plurality of continual learning sub-networks, the feature ensemble sub-network and the gate sub-network based on the continual learning loss value.
The method of claim 2, further comprising weighting the plurality of feature representations with a plurality of gates generated by a gate sub-network;

wherein the generating a prediction related to the one task further comprising generating the prediction related to the one task by the feature ensemble sub-network based on the plurality of weighted feature representations;

wherein the generating a plurality of predictions related to the one task further comprising generating the plurality of predictions related to the one task by the feature ensemble sub-network based respectively on the plurality of weighted feature representations;

wherein the updating learnable parameters further comprising updating the learnable parameters of the plurality of continual learning sub-networks, the feature ensemble sub-network and the gate sub-network based on the continual learning loss value and the ensemble cooperation loss value.
The method of claim 4, wherein the gate sub-network comprising a set of vectors corresponding to the plurality of tasks and the plurality of continual learning sub-networks.
The method of claims 4, wherein the gate sub-network comprising a continual learning gate sub-network, wherein the continual learning gate sub-network receiving the input date and generating the plurality of gates based on the input data.
The method of claim 4, wherein the generating a continual learning loss value comprising generating the continual learning loss value based on one of a weight regularization method, a memory replay method, and a parameter isolation method.
The method of claim 4, wherein the ensemble cooperation loss value representing divergence among the plurality of predictions related to the one task and corresponding respectively to plurality of continual learning sub-networks.
The method of claim 8, wherein the generating an ensemble cooperation loss value comprising generating Kullback Leibler (KL) divergence value based on the plurality of predictions related to the one task.
The method of claim 4, wherein the input data comprising one or more of image data, video data, graph data, gaming data, text data, the plurality of tasks comprising one or more of classification tasks, image segmentation tasks, content or action generation tasks, and the prediction comprising one or more of a classification of the image data, the video data or the graph data, an image segmentation of the image data or the video data, and content or action generated based on the text data, gaming data, graph data or video data.
The method of claim 4, wherein

the input data comprising image data or video data obtained in one of an automatic driving system, an intelligent transportation system, an intelligent manufacturing system, industrial equipment system, an intelligent maintenance equipment system, and a medical equipment system, the plurality of tasks comprising image classification tasks or image segmentation tasks, the prediction comprising a classification or a segmented image portion related to the image data or video data; or

the input data comprising video data obtained in one of an automatic driving system, an intelligent transportation system, an intelligent manufacturing system, industrial equipment system, an intelligent maintenance equipment system, and a medical equipment system, the plurality of tasks comprising action generation tasks, the prediction comprising an action based on the video data; or

the input data comprising graph data obtained in one of a social network system, a financial network system, and a website platform, the plurality of tasks comprising identification of abnormal account included in the graph data, the prediction comprising classification of the accounts.
A computer implemented method for performing a task using a trained model, comprising:

receiving input data by a plurality of continual learning sub-networks of the model in parallel, the input data being related to the task of a plurality of tasks, the number of the plurality of continual learning sub-networks being fixed and irrespective to the number of the plurality of tasks;

generating a plurality of feature representations respectively by the plurality of continual learning sub-networks based on the input data;

generating a prediction related to the task by a feature ensemble sub-network of the model based on the plurality of feature representations.
The method of claim 12, further comprising weighting the plurality of feature representations with a plurality of gates output by a gate sub-network;

wherein the generating a prediction related to the task further comprising generating the prediction related to the task by the feature ensemble sub-network based on the plurality of weighted feature representations.
The method of one of claims 13, wherein the gate sub-network comprising a set of vectors corresponding to the plurality of tasks and the plurality of continual learning sub-networks.
The method of one of claims 13, wherein the gate sub-network comprising a continual learning gate sub-network, wherein the continual learning gate sub-network receiving the input date and generating the plurality of gates based on the input data.
The method of claim 12, wherein the plurality of continual learning sub-networks having same or similar structures.
The method of claim 12, wherein the input data comprising one or more of image data, video data, graph data, gaming data, text data and the plurality of tasks comprising one or more of classification tasks, image segmentation tasks, content or action generation tasks, and the prediction comprising one or more of a classification of the image data, the video data or the graph data, an image segmentation of the image data or the video data, content or action generated based on the text data, gaming data, graph data or video data.
A computer system, comprising:

one or more processors; and

one or more storage devices storing computer-executable instructions that, when executed, cause the one or more processors to perform the operations of the method of one of claims 1-17.
One or more computer readable storage media storing computer-executable instructions that, when executed, cause one or more processors to perform the operations of the method of one of claims 1-17.
A computer program product comprising computer-executable instructions that, when executed, cause one or more processors to perform the operations of the method of one of claims 1-17.