WO2021068180A1

WO2021068180A1 - Method and system for continual meta-learning

Info

Publication number: WO2021068180A1
Application number: PCT/CN2019/110530
Authority: WO
Inventors: Jian Tang; Kun Wu; Chengxiang YIN; Zhengping Che
Original assignee: Beijing Didi Infinity Technology And Development Co., Ltd.
Priority date: 2019-10-11
Filing date: 2019-10-11
Publication date: 2021-04-15

Abstract

Embodiments of the disclosure provide artificial intelligence systems and methods for training a continual meta-learner framework (CML) model. An exemplary artificial intelligence system includes a storage device and a processor. A plurality of tasks are received at the CML framework in a sequence, where each task comprises a training dataset and a test dataset. For each task, a fast-learning is performed by training the student network and the discriminator associated with the CML framework based on the training dataset associated with the task. One or more initial parameters associated with the student network and the discriminator are updated to generate updated initial parameters corresponding to the one or more initial parameters. A meta-update is performed to optimize the one or more initial parameters associated with the student network and the discriminator using the updated initial parameters.

Description

METHOD AND SYSTEM FOR CONTINUAL META-LEARNING

TECHNICAL FIELD

This present disclosure generally relates to systems and methods for analyzing subject behavior, and in particular, to systems and methods for analyzing human driving behavior by recognizing basic driving actions and identifying intentions and attentions of the driver.

The present disclosure relates generally to systems and methods for machine learning, and more particularly to, machine learning models using continual meta-learning techniques.

BACKGROUND

With the rapid development in the field of Artificial Intelligence (AI) , deep learning techniques have made tremendous successes on various computer vision tasks. To train a deep-learning model, a great amount of labeled data is needed. The trained deep-learning model then may be used only for a specific task (e.g., classifying different types of animals) . Moreover, deep models may suffer from the problem of “forgetting. ” That is, when a deep-learning model is first trained on one task, then trained on a second task, it may forget how to perform the first task. Hence, it is desired to improve deep learning such that a deep model may learn to handle a new task from limited training data without forgetting old knowledge.

Embodiments of the disclosure address the above problems by providing a continual meta-learner (CML) framework, which can keep learning new concepts effectively and quickly from limited labeled data without forgetting old knowledge.

SUMMARY

Embodiments of the disclosure provide artificial intelligence systems and methods for training a continual meta-learner framework (CML) framework. An exemplary artificial intelligence system includes a storage device and a processor. The storage device is configured to store training datasets and test datasets associated with a plurality of tasks. The processor is configured to train the CML framework, including a teacher network, a student network, a classifier and a discriminator. To train the CML framework, the processor is configured to receive a plurality of tasks in a sequence, where each task comprises a training dataset and a test dataset. For each task, the processor is configured to perform a fast-learning by training the student network and the discriminator associated with the CML framework based on the training dataset associated with the task. The processor is also configured to update one or more initial parameters associated with the student network and the discriminator to generate updated initial parameters corresponding to the one or more initial parameters. The processor is further configured perform a meta-update to optimize the one or more initial parameters associated with the student network and the discriminator using the updated initial parameters.

In some embodiments, the processor is configured to receive the plurality of tasks in a sequence, where each task comprises a training dataset and a test dataset. The processor is configured to perform, for each task, a fast-learning by training the student network and the discriminator associated with the CML framework based on the training dataset associated with the task. The processor is configured to update one or more initial parameters associated with the student network and the discriminator to generate updated initial parameters corresponding to the one or more initial parameters. The processor is configured to perform a meta-update to optimize the one or more initial parameters associated with the student network and the discriminator using the updated initial parameters.

In some embodiments, the processor is configured to prior to performing the fast-learning, pre-train the teacher network associated with the CML framework to generate a feature map.

In some embodiments, the processor is configured to train the student network by minimizing a sum of a first loss on a classifier and a second loss on the discriminator. The processor is configured to train the discriminator using a third loss on the discriminator.

In some embodiments, the processor is configured to calculate a first cross-entropy loss corresponding to the classifier using the training dataset. The processor is configured to calculate a first binary-entropy loss corresponding to the discriminator using the training dataset. The processor is configured to train the student network by minimizing the sum of the first cross-entropy loss corresponding to the classifier and the first binary-entropy loss corresponding to the discriminator.

In some embodiments, the processor is configured to generate an all-class matrix by the student network. The processor is configured to store the generated all-class matrix into a memory.

In some embodiments, the processor is configured to input the feature map as a real input of the discriminator. The processor is configured to retrieve the all-class matrix from the memory. The processor is configured to input the all-class matrix as a fake input of the discriminator. The processor is configured to calculate a second binary-entropy loss corresponding to the discriminator based on the real input and the fake input. The processor is configured to train the discriminator using the second binary-entropy loss.

In some embodiments, the processor is configured to calculate a cosine similarity between the feature map and each class vector associated with the all-class matrix by the classifier. The processor is configured to generate a prediction score for each class. The processor is configured to normalize a plurality of prediction scores by using a softmax function.

In some embodiments, the processor is configured to calculate a second cross-entropy loss corresponding to the classifier using the updated one or more initial parameters associated with the student network and the test dataset. The processor is configured to optimiz the one or more initial parameters associated with the student network using a one-step gradient descent based on the second cross-entropy loss. The processor is configured to calculate a third binary-entropy loss corresponding to the discriminator using the test dataset and an updated model of the student network, wherein the updated model of the student network corresponding to the updated one or more initial parameters associated with the discriminator. The processor is configured to optimize the one or more initial parameters associated with the discriminator using the one-step gradient descent based on the third binary-entropy loss.

In some embodiments, a model associated with the student network is a convolutional neural network.

In some embodiments, the discriminator is implemented by a multilayer perceptron.

In some embodiments, the one or more initial parameters associated with the student network and the discriminator are obtained by a random initialization.

Embodiments of the disclosure further provide a computer-implemented method for training a continual meta-learner framework (CML) framework. The CML framework includes a teacher network, a student network, a classifier and a discriminator. An exemplary computer-implemented method includes receiving, at a continual meta-learner (CML) framework, a plurality of tasks in a sequence, wherein each task comprises a training dataset and a test dataset. The method further includes performing, by a processor, for each task, a fast-learning by training a student network and a discriminator associated with the CML framework based on the training dataset associated with the task. The method also includes updating, by a processor, one or more initial parameters associated with the student network and the discriminator to generate updated initial parameters corresponding to the one or more initial parameters. The method yet further includes performing, by a processor, a meta-update to optimize the one or more initial parameters associated with the student network and the discriminator using the updated initial parameters.

Embodiments of the disclosure further provide a non-transitory computer-readable medium having instructions stored thereon that, when executed by a processor, causes the processor to perform a computer-implemented method for training a continual meta-learner framework (CML) framework. The CML framework includes a teacher network, a student network, a classifier and a discriminator. An exemplary computer-implemented method includes receiving, at a continual meta-learner (CML) framework, a plurality of tasks in a sequence, wherein each task comprises a training dataset and a test dataset. The method further includes performing, by a processor, for each task, a fast-learning by training a student network and a discriminator associated with the CML framework based on the training dataset associated with the task. The method also includes updating, by a processor, one or more initial parameters associated with the student network and the discriminator to generate updated initial parameters corresponding to the one or more initial parameters. The method yet further includes performing, by a processor, a meta-update to optimize the one or more initial parameters associated with the student network and the discriminator using the updated initial parameters.

Additional features will be set forth in part in the description which follows, and in part will become apparent to those skilled in the art upon examination of the following and the accompanying drawings or may be learned by production or operation of the examples. The features of the present disclosure may be realized and attained by practice or use of various aspects of the methodologies, instrumentalities, and combinations set forth in the detailed examples discussed below.

BRIEF DESCRIPTION OF THE DRAWINGS

The present disclosure is further described in terms of exemplary embodiments. These exemplary embodiments are described in detail with reference to the drawings. These embodiments are non-limiting exemplary embodiments, in which like reference numerals represent similar structures throughout the several views of the drawings, and wherein:

FIG. 1 illustrates a schematic diagram of an exemplary continual meta-leaner (CML) system, according to embodiments of the disclosure.

FIG. 2 illustrates a block diagram of an exemplary AI system for training a meta-learning model using CML framework, according to embodiments of the disclosure.

FIG. 3 illustrates a schematic diagram of an exemplary CML framework, according to embodiments of the disclosure.

FIG. 4 illustrates a flowchart of an exemplary method for training the continual meta-learning model, according to embodiments of the disclosure.

FIG. 5 illustrates a flowchart of an exemplary method for training the CML framework during the fast-learning phase, according to embodiments of the disclosure.

FIG. 6 illustrates a flowchart of an exemplary method for training the CML framework during the meta-update phase, according to embodiments of the disclosure

DETAILED DESCRIPTION

The following description is presented to enable any person skilled in the art to make and use the present disclosure, and is provided in the context of a particular application and its requirements. Various modifications to the disclosed embodiments will be readily apparent to those skilled in the art, and the general principles defined herein may be applied to other embodiments and applications without departing from the spirit and scope of the present disclosure. Thus, the present disclosure is not limited to the embodiments shown, but is to be accorded the widest scope consistent with the claims.

A meta-learner may rapidly learn new concepts from a small dataset with only a few samples (e.g., 5 samples) for each class. Existing approaches of meta-learning include metric-based methods (i.e., exploiting the similarity between samples of different classes for meta-learning) , optimization-based approach (i.e., optimizing model parameters) , etc. However, even though these deep learning models can significantly reduce the amount of labeled training data, none of these existing solutions address continual learning or the forgetting issues., i.e., partially forgetting or even completely forgetting what they have already learned.

Continual learning focuses on how to achieve a good trade-off between learning new concepts and retaining old knowledge over a long time, which is known as the stability plasticity dilemma. Several regularization-based methods have been proposed for continual learning, imposing regularization terms to restrain the update of the model parameter. However, none of the existing continual learning approaches address the issue of learning new concepts with limited label data, which is a crucial challenge for achieving human-level Artificial Intelligence (AI) . Moreover, because the setting of the continual learning problem is quite different from that of the meta-learning problem, none of the existing solutions to the continual learning problems can be directly applied to meta-learning tasks. As such, the prior solutions cannot provide a deep learning model that handles a new task from very limited training data without forgetting old knowledge. Aspects of the present disclosure solve the above-mentioned deficiencies by providing mechanisms (e.g., methods, systems, media, etc. ) for a novel model-agnostic meta-leaner, CML, which integrates metric-based classification and a memory-based mechanism along with adversarial learning into an optimization-based meta-learning framework.

The terminology used herein is for the purpose of describing particular example embodiments only and is not intended to be limiting. As used herein, the singular forms “a, ” “an, ” and “the” may be intended to include the plural forms as well, unless the context clearly indicates otherwise. It will be further understood that the terms “comprises, ” “comprising, ” “includes, ” and/or “including” when used in this specification, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof.

These and other features, and characteristics of the present disclosure, as well as the methods of operation and functions of the related elements of structure and the combination of parts and economies of manufacture, may become more apparent upon consideration of the following description with reference to the accompanying drawing (s) , all of which form a part of this specification. It is to be expressly understood, however, that the drawing (s) are for the purpose of illustration and description only and are not intended to limit the scope of the present disclosure. It is understood that the drawings are not to scale.

The flowcharts used in the present disclosure illustrate operations that systems implement according to some embodiments in the present disclosure. It is to be expressly understood, the operations of the flowchart may or may not be implemented in order. Conversely, the operations may be implemented in inverted order, or simultaneously. Moreover, one or more other operations may be added to the flowcharts. One or more operations may be removed from the flowcharts.

Moreover, while the system and method in the present disclosure is described primarily in regard to image classification, it should also be understood that this is only one exemplary embodiment. The system or method of the present disclosure may be applied to any other kind of deep learning tasks.

Reference will now be made in detail to the exemplary embodiments, examples of which are illustrated in the accompanying drawings. Wherever possible, the same reference numbers will be used throughout the drawings to refer to the same or like parts.

FIG. 1 illustrates a schematic diagram of an exemplary continual meta-leaner (CML) system 100, according to embodiments of the disclosure. Consistent with the present disclosure, CML system 100 is configured to perform continuous meta training for one or more networks using datasets from database (e.g., the training database 130) . In some embodiments, CML system 100 may include components shown in FIG. 1, including a server 110, a network 120, a training database 130, and one or more user devices 140. It is contemplated that CML system 100 may include more or less components compared to those shown in FIG. 1.

The server 110 may be configured to process information and/or data relating to meta-learning tasks. For example, the server 110 may train neural networks for visual object classification, speech recognition, text processing, and other tasks. In some embodiments, the server 110 may be a single server, or a server group. The server group may be centralized, or distributed (e.g., the server 110 may be a distributed system) . In some embodiments, the server 110 may be local or remote. For example, the server 110 may access information and/or data stored in training database 130 and/or user device (s) 140 via network 120. As another example, the server 110 may be directly connected to the training database 130, and/or the user device (s) 140 access stored information and/or data. In some embodiments, the server 110 may be implemented on a cloud platform. Merely by way of example, the cloud platform may include a private cloud, a public cloud, a hybrid cloud, a community cloud, a distributed cloud, an inter-cloud, a multi-cloud, or the like, or any combination thereof. In some embodiments, the server 110 may be implemented on a computing device having one or more components illustrated in FIG. 2 in the present disclosure.

The network 120 may facilitate exchange of information and/or data. In some embodiments, one or more components in the system 100 (e.g., the server 110, the training database 130, and the user device (s) 140) may send and/or receive information and/or data to/from other component (s) in the system 100 via the network 120. For example, the server 110 may obtain/acquire a request for training a continual meta-learning model from the user device (s) 140 via the network 120. In some embodiments, the network 120 may be any type of wired or wireless network, or combination thereof. Merely by way of example, the network 120 may include a cable network, a wireline network, an optical fiber network, a tele communications network, an intranet, an Internet, a local area network (LAN) , a wide area network (WAN) , a wireless local area network (WLAN) , a metropolitan area network (MAN) , a wide area network (WAN) , a public telephone switched network (PSTN) , a BluetoothTM network, a ZigBeeTM network, a near field communication (NFC) network, a global system for mobile communications (GSM) network, a code-division multiple access (CDMA) network, a time-division multiple access (TDMA) network, a general packet radio service (GPRS) network, an enhanced data rate for GSM evolution (EDGE) network, a wideband code division multiple access (WCDMA) network, a high speed downlink packet access (HSDPA) network, a long term evolution (LTE) network, a user datagram protocol (UDP) network, a transmission control protocol/Internet protocol (TCP/IP) network, a short message service (SMS) network, a wireless application protocol (WAP) network, a ultra wide band (UWB) network, an infrared ray, or the like, or any combination thereof.

The user device (s) 140 may be operated by one or more users to perform various functions associated with the user device (s) 140. For example, a user of the user device (s) 140 may use the user device (s) 140 to send a request for himself/herself or another user, or receive information or instructions from the server 110. In some embodiments, the term “user” and “user device” may be used interchangeably.

In some embodiments, the user device (s) 140 may include a diverse variety of device types and are not limited to any particular type of device. Examples of user device (s) 140 can include but are not limited to a laptop 140-1, a stationary computer 140-2, a tablet computer 140-3, a mobile device 140-4, or the like, or any combination thereof. In some embodiments, stationary computer 140-2 can include desktop computers, work stations, personal computers, thin clients, terminals, game consoles, personal video recorders (PVRs) , set-top boxes, or the like. In some embodiments, the mobile device 140-4 may include a smart home device, a wearable device, a smart mobile device, a virtual reality device, an augmented reality device, or the like, or any combination thereof. In some embodiments, the smart home device may include a smart lighting device, a control device of an intelligent electrical apparatus, a smart monitoring device, a smart television, a smart video camera, an interphone, or the like, or any combination thereof. In some embodiments, the wearable device may include a smart bracelet, a smart footgear, a smart glass, a smart helmet, a smart watch, a smart clothing, a smart backpack, a smart accessory, or the like, or any combination thereof. In some embodiments, the smart mobile device may include a smartphone, a personal digital assistance (PDA) , a gaming device, a navigation device, a point of sale (POS) device, or the like, or any combination thereof. In some embodiments, the virtual reality device and/or the augmented reality device may include a virtual reality helmet, a virtual reality glass, a virtual reality patch, an augmented reality helmet, an augmented reality glass, an augmented reality patch, or the like, or any combination thereof. For example, the virtual reality device and/or the augmented reality device may include a Google Glass, an Oculus Rift, a Hololens, a Gear VR, etc.

In some embodiments, the server 110 may include a processing engine 112. The processing engine 112 may process information and/or data relating to the meta-learning tasks to perform one or more functions described in the present disclosure. For example, the processing engine 112 may receive a request from the user device (s) 140 to generate a trained meta-learning model 105 based on the request. In some embodiments, the processing engine 112 may include one or more processing engines (e.g., single-core processing engine (s) or multi-core processor (s) ) . Merely by way of example, the processing engine 112 may include a central processing unit (CPU) , an application-specific integrated circuit (ASIC) , an application-specific instruction-set processor (ASIP) , a graphics processing unit (GPU) , a physics processing unit (PPU) , a digital signal processor (DSP) , a field programmable gate array (FPGA) , a programmable logic device (PLD) , a controller, a microcontroller unit, a reduced instruction-set computer (RISC) , a microprocessor, or the like, or any combination thereof.

As shown in FIG. 1, server 110 may further include a deep learning training device 114, which may communicate with training database 130 to receive one or more sets of tasks 101. Each task may be different, for example, task 101-1 may be a task of classifying images of animals; task 101-2 may be a task of classifying images of fruits. Deep learning training device 114 may use training data corresponding to each task 101 that is received from training database 130 to train a model based on the CML framework(discussed in detail in connection with FIG. 3) , so that the trained meta-learning model 105 may be able to adapt to a large or infinite number of tasks. Deep learning training device 114 may be implemented with hardware specially programmed by software that performs the training process. For example, deep learning training device 114 may include a processor and a non-transitory computer-readable medium (discussed in detail in connection with FIG. 2) . The processor may conduct the training by performing instructions of a training process stored in the computer-readable medium. Deep learning training device 114 may additionally include input and output interfaces to communicate with training database 130, network 130, and/or a user interface (not shown) . The user interface may be used for selecting sets of training data, adjusting one or more parameters of the training process, selecting or modifying a framework of the learning model, and/or manually or semi-automatically providing diagnosis results associated with a sample patient description for training.

Consistent with some embodiments, deep learning training device 114 may generate the trained meta-learning model 105 through a CML framework (discussed in detail in connection with FIGs. 3-6) , which may include more than one convolutional neural network (CNN) models. Trained meta-learning model 105 may be trained using supervised and/or reinforcement learning. The architecture of a trained meta- learning model 105 includes a stack of distinct layers that transform the input into the output. As used herein, “training” a learning model refers to determining one or more parameters of at least one layer in the learning model. For example, a convolutional layer of a CNN model may include at least one filter or kernel. One or more parameters, such as kernel weights, size, shape, and structure, of the at least one filter may be determined by e.g., a backpropagation-based training process.

FIG. 2 illustrates a block diagram of an exemplary AI system 200 for training a meta-learning model using a CML framework, according to embodiments of the disclosure. Consistent with the present disclosure, AI system 200 may be an embodiment of deep learning training device 114. In some embodiments, as shown in FIG. 2, AI system 200 may include a communication interface 202, a processor 204, a memory 206, and a storage 208. In some embodiments, AI system 200 may have different modules in a single device, such as an integrated circuit (IC) chip (e.g., implemented as an application-specific integrated circuit (ASIC) or a field-programmable gate array (FPGA) ) , or separate devices with dedicated functions. In some embodiments, one or more components of AI system 200 may be located in a cloud, or may be alternatively in a single location (such as inside a mobile device) or distributed locations. Components of Ai system 200 may be in an integrated device, or distributed at different locations but communicate with each other through a network (not shown) . Consistent with the present disclosure, AI system 200 may be configured to train meta-learning model105 based on data received from the training database 130.

Communication interface 202 may send data to and receive data from components such as training database 130 via communication cables, a Wireless Local Area Network (WLAN) , a Wide Area Network (WAN) , wireless networks such as radio waves, a cellular network, and/or a local or short-range wireless network (e.g., Bluetooth ^TM) , or other communication methods.In some embodiments, communication interface 202 may include an integrated service digital network (ISDN) card, cable modem, satellite modem, or a modem to provide a data communication connection. As another example, communication interface 202 may include a local area network (LAN) card to provide a data communication connection to a compatible LAN. Wireless links can also be implemented by communication interface 202. In such an implementation, communication interface 202 can send and receive electrical, electromagnetic or optical signals that carry digital data streams representing various types of information.

Consistent with some embodiments, communication interface 202 may receive meta-training set (s) 101, where each meta-training set 101 corresponding to a different task, which arrives in sequence. In a regular learning setting where a model (e.g., a function F (. ) ) is trained to map the given samples x to the output y, the parameters θ of the model are trained on a training dataset D _train and a testing dataset D _test. Different from the regular learning setting, in the meta-learning setting described in the present disclosure, the database 110 stores a number of meta-training sets

which contains multiple regular datasets, and each dataset

is split into D _train and D _test as those in regular machine learning. Communication interface 202 may further provide the received data to memory 206 and/or storage 208 for storage or to processor 204 for processing.

Processor 204 may include any appropriate type of general-purpose or special-purpose microprocessor, digital signal processor, or microcontroller. Processor 204 may be configured as a separate processor module dedicated to training a learning model. Alternatively, processor 204 may be configured as a shared processor module for performing other functions in addition to model training.

Memory 206 and storage 208 may include any appropriate type of mass storage provided to store any type of information that processor 204 may need to operate. Memory 206 and storage 208 may be a volatile or non-volatile, magnetic, semiconductor, tape, optical, removable, non-removable, or other type of storage device or tangible (i.e., non-transitory) computer-readable medium including, but not limited to, a ROM, a flash memory, a dynamic RAM, and a static RAM. Memory 206 and/or storage 208 may be configured to store one or more computer programs that may be executed by processor 204 to perform functions disclosed herein. For example, memory 206 and/or storage 208 may be configured to store program (s) that may be executed by processor 204 to train and generate the trained meta-learning model 105.

Memory 206 and/or storage 208 may be further configured to store information and data used by processor 204. In some embodiments, memory 206 and/or storage 208 may also store intermediate data such as feature maps output by layers of the learning model, and optimization loss functions, etc. Memory 206 and/or storage 208 may additionally store various learning models including their model parameters, such as a CNN model and other types of neutral network models. The various types of data may be stored permanently, removed periodically, or disregarded immediately after the data is processed.

As shown in FIG. 2, processor 204 may include multiple modules, such as a Neural Networks (NNs) processing unit 242, an updating unit 244, an optimization unit 246, and the like. These modules (and any corresponding sub-modules or sub-units) can be hardware units (e.g., portions of an integrated circuit) of processor 204 designed for use with other components or software units implemented by processor 204 through executing at least part of a program. The program may be stored on a computer-readable medium, and when executed by processor 204, it may perform one or more functions. Although FIG. 2 shows units 242-246 all within one processor 204, it is contemplated that these units may be distributed among different processors located closely or remotely with each other.

Units 242-246 are configured to train a meta-learning model using meta-training set (s) 101. FIG. 3 illustrates a schematic diagram of an exemplary CML framework 300, according to embodiments of the disclosure. Consistent with the present disclosure, CML framework 300 may include a plurality of components, such as a teacher network 310, a student network 320, a classifier 330, and a discriminator 340.

In some embodiments, for a meat-learning setting, training dataset 302

and test dataset 304

are used during the meta-training and meta-testing phases respectively, and the class labels in

are not overlapping with those of

In the present disclosure, a number of N-way, K-shot tasks are used for illustration. N-way, K-shot is a typical setting for few-shot learning, which refers to the practice of feeding a learning model of small amount of training data, contrary to the normal practice using a large amount of data. The problem of N-way classification is set up as follow: select N unseen classes, provide the model with K different instances of each of the N classes, and evaluate the model’s ability to classify new instances within the N classes. For example, in each

D _train contains K samples for each of N classes, and D _test contains samples for evaluation. In a meta-learning scenario, the goal is to train the model to be able to adapt to a distribution over tasks

During meta-training and in the K-shot learning setting, a task

is sampled from

the model is trained with K samples and feedback from the corresponding loss

from

and then tested on new samples from

At the end of meta-training, new tasks are sampled from

and meta-performance is measured by the model’s performance after learning from K samples.

For example, in a meta-learning scenario which has a 5-way, 5-shot setting, the goal is to train a model M _fine-tune to classify images with unknown labels. These images belong to classes P ₁～P ₅; each class contains 5 labeled sample images for training the model M _fine-tune and 15 labeled samples to test the trained model M _fine-tune. In addition to the labeled samples of classes P ₁～P ₅, the dataset further includes sample images belong to another 10 classes C ₁～C ₁₀, each of the class contains 30 labeled samples to assist training the meta-learning model M _meta. During the meta-training process, sample images included in classes C ₁～C ₁₀ are first used to train the meta-learning model M _meta, then sample images included in classes P ₁～P ₅ are used to fine-tune M _meta to generate the final model M _fine-tune. In this example, C ₁～C ₁₀ are the meta-training classes, and the 300 samples included in classes C ₁～C ₁₀ are

which are used to train M _meta. Similarly, classes P ₁～P ₅ are the meta-test classes, and the 100 samples included in classes P ₁～P ₅ are

which are used to train and test the M _fine-tune. Based on the 5-way, 5-shot setting, during the process of training M _meta, 5 classes are randomly selected from classes C ₁～C ₁₀, and from each randomly selected class, 20 labeled samples are selected to form a task

This task

is equivalent to a piece of training data in the regular deep learning model.

In the continual learning setting of the present disclosure, different tasks arrive in a sequence

So that when training a task

only

of

is accessible, while

of any previous task

is not available. In this way, the present disclosure forms a new continual meta-learning problem. In this problem,

and the meta-training phase are the same as those described in the aforementioned meta-learning setting. However, during the meta-testing phase of this problem, tasks arrive one by one in a sequence rather than in a batch, to ensure that the learner can quickly, effectively and continuously learn new concepts without forgetting what it has already learned.

Back to the illustration of FIG. 3. As illustrated in FIG. 3, during the meta-testing phase, tasks arrive one by one in a sequence. Every time when a new task

arrives, the CML framework 300 uses images from

quickly learn to handle the new task and then takes images from

as input and outputs the corresponding class labels. Each component of the CML will be discussed in more details as below.

The teacher network 310

may take an image x as input and extracts its features to form a feature map M 306 with a dimension of z, which are then pushed to classifier 330 (P (. ) ) and discriminator 340

In some embodiments, the teacher network 310 may be a CNN. In some embodiments, the teacher network 310 may be a Residual Network (ResNet) . ResNet inserts shortcut connections to the plain network and turns the network into its counterpart residual version. ResNets may have variable sizes, depending on the size of each layer, and the number of layers it has. Each of the layers follow the same pattern, and they perform 3 × 3 convolution with a fixed feature map dimension. For example, the ResNet used as the teacher network 310 in the present disclosure may be a ResNet18 (that is, the residual network is 18 layers deep) . Given an input image x with a size of 84 × 84, the ResNet18 yields a feature map M with a dimension of z = 512 × 1 × 1 = 512.

The student network 320

is the core component of the CML framework 300. The student network 320 may take all training images x and generates an all-class matrix V 308. Each row of V corresponds to a class vector V _l with a dimension of z, which can be considered as a representation of images of the l ^th class. In some embodiments, the student network 320 may be a CNN. For example, the CNN may include four convolutional modules, each of which contains a 3 × 3 convolutional layer followed by batch normalization, a ReLU nonlinearity and a 2 × 2 max pooling, including 64 filters in the first two convolutional layers and 128 filters in the last two convolutional layers. Given an input image x with a size of 84 × 84, this exemplary student network may yield a feature map M with a dimension of z = 512 × 1 × 1 = 512, which is the same as ResNet18.

The classifier 330 may take the feature map M 306 of an image and the all-class matrix V 308 as input and predict which class this image belongs to (the prediction 312) . Classification is a supervised learning approach in which the model is learned from the data input and uses this learning to classify new observations. In some embodiments, the classification algorithm may be a linear classifier, a nearest neighbor, a support vector machine, a decision tree, a boosted tree, neural networks, or the like.

The discriminator 340

may distinguish between the feature map of an image (belong to class l) and the corresponding class vector V _l. The discriminator 340 and student network 320 function similar to the generative network (generator) and the discriminative network (discriminator) of a generative adversarial network (GAN) .

GANs are deep neural architecture. The generator of a GAN learns to generate plausible data, while the discriminator of a GAN learns to distinguish the generator’s fake data from real data. The generator part of a GAN learns to create fake data by incorporating feedback from the discriminator. It learns to make the discriminator classify its output as real. The discriminator in a GAN is simply a classifier. It tries to distinguish real data from the data created by the generator. It could use any network architecture appropriate to the type of data it′sclassifying. In the present disclosure as described in FIG. 3, the student network 320 may be considered as a generator of GAN that is used to generate the all-class matrix V 308, while the discriminator 340 may be considered as the discriminator of a GAN and help student network 320 to generate a more representative all-class matrix via adversarial learning.

In some embodiments, units 242-246 of FIG. 2 may execute computer instructions to perform the training. For example, FIG. 4 illustrates a flowchart of an exemplary method 400 for training the continual meta-learning model, according to embodiments of the disclosure. Method 400 may include steps S402-S418 as described below. It is to be appreciated that some of the steps may be optional to perform the disclosure provided herein. Further, some of the steps may be performed simultaneously, or in a different order than shown in FIG. 2.

In step 402, communication interface 202 may receive a plurality of tasks in a sequence. In some embodiment, each task

may include a training dataset 302 (D _train) and a test dataset 304 (D _test) . In some embodiment, when a new

arrives during the meta-testing phase, the CML framework 300 may take D _train as input, and performs fast-learning. In some embodiment, during the meta-training phase, the CML framework may take a batch of tasks

as input and learns good initializations of student network 320 and discriminator 340. The entire training process consists of two phases: fast-learning and meta-update. During fast-learning, the CML framework 300 may learn from D _train of each individual task of the batch; while during meta-update, the CML framework 300 may learn from D _test across all tasks of the batch. Both fast-learning and meta-update will be described in more details in connection with FIGs. 5 and 6.

In step S404, NNs processing unit 242 may input the image x from D _train and determine a first loss function on the classifier.

In step S406, NNs processing unit 242 may input the image x from D _train and determine a second loss function on the discriminator.

In step S408, optimization unit 246 may train the student network by minimizing both the first loss function obtained in step S404 and the second loss function obtained in step S406. In some embodiment, after the student network is trained, one or more parameters corresponding to the student network model are updated by the updating unit 244.

In step S410, NNs processing unit 242 may generate the all-class matrix V for current task

In some embodiment, under a 1-shot setting, if the input image belongs to the l ^th class, the class vector V _l may be directly obtained from the input image x. In some embodiment, under a K-shot setting, the class vector V _l may be obtained by taking the mean values. As such, the feature map M and each class vector V _l can have the same dimension. In some embodiment, the all-class matrix V is then constructed by stacking the class vectors together, where V has a dimension of N × Z, where N is the number of classes and z is the dimension of each class vector V _l.

In step S412, NNs processing unit 242 may output the all-class matrix V for current task

and store them into memory 206. Each class vector only needs a small space. For example, when the dimension of the class vector V _l is 512, each class vector only needs a space of 4KB since they are 512 64-bits numbers. As such, the proposed CML framework 300 has a Iow memory footprint and can improve the efficiency of computer memory storage.

In step S414, NNs processing unit 242 may input the image x from D _test and generate the feature map M. In some embodiment, before meta-training, the teacher network 310 may be pre-trained on the D _meta-train so that the teacher network 310 may gain enough knowledge to server as a “teacher. ” In some embodiment, for example, where the teacher network is a ResNet18, after the teacher network 310 is pre-trained, the last fully-connected layer is discarded and the feature extractor is kept as the teacher network 310.

In step S416, NNs processing unit 242 may retrieve the all-class matrix V obtained in step S412 from the memory 206.

In step S418, NNs processing unit 242 may predict the class for the input image x by calculating the similarity between its feature map M and each class vector V _l.

In some embodiment, the classifier 330 may calculate the cosine similarity between M of image x and each class vector V _l as the prediction score for each class, according to equation (1) :

P (M, V) = softmax (Cos (M, v ^T) ) eq. (1)

Where Cos (. ) represents a calculation of the cosine similarity. In the present disclosure, the cosine similarity is used because it eliminates the interference resulting from different orders of magnitude corresponding to different classes. Once the cosine similarity is obtained, a softmax function may be used to normalize prediction scores.

FIG. 5 illustrates a flowchart of an exemplary method 500 for training the CML framework 300 during the fast-learning phase, according to embodiments of the disclosure. Method 500 may be implemented by CML framework 300 and particularly processor 204 or a separate processor not shown in FIG. 5. Method 500 may include steps S502-S516 as described below. It is to be appreciated that some of the steps may be optional to perform the disclosure provided herein. Further, some of the steps may be performed simultaneously, or in a different order than shown in FIG. 5. A N-way, K-shot setting is used for the training process described in FIGs. 5 and 6. That is, K-shot classification tasks use K input/output pairs from each class, for a total of N × K datapoints for N-way classification.

In step S502, communication interface 202 may receive a plurality of tasks

from a distribution over tasks p

in a sequence.

In step S504, NNs processing unit 242 may calculate the first cross-entropy loss on the classifier 330, according to equation (2) :

Where (x, y) represents an image/label pair from training dataset 302 D _train; X represents the corresponding set of images; θ _s represents randomly selected initial parameters for the student network 320;

represents the corresponding function of the student network 320 which generates the all-class matrix V 308;

represents the corresponding function of the pre-trained teacher network 310 which generates the feature map M 306.

In step S506, NNs processing unit 242 may calculate the first binary-entropy loss corresponding to the discriminator 340, according to equation (3) :

Where

generates the all-class matrix V ; R (V, l) represents a function that returns the l ^th row of V, (i.e., V _l) ;

represents the function corresponding to the discriminator 340.

The discriminator 340 described in FIG. 5 is used to distinguish each class vector V _l generated by the student network 320 from true samples generated by the teacher network 310, during the training of the student network 320. Specifically, the discriminator340 takes each class vector V _l as input, calculates the probability of the input being a true sample. The student network may be considered as a generator used to generate the all-class matrix V; while the discriminator may help the student network 320 generate a more representative all-class matrix via adversarial learning. Furthermore, the loss l _s, p may make training the student network 320 and the discriminator 340 more stable, and prevent model collapses. This is because the loss l _s, p may still improve the student network 320 even when the discriminator 340 makes a mistake; and then the improvement on the student network 320 may help train the discriminator 340. Hence, the losses l _s, p and l _s, d benefit from each other. In some embodiment, a multilayer perceptron (MLP) may be used to implement the discriminator, which contains two fully-connected layers. The first fully connected layer is followed by batch normalization and a ReLU nonlinearity; and the second fully-connected layer is followed by the sigmoid function that normalizes output.

In step S508, the optimization unit 246 may train the student network 320 model by minimizing the sum of losses l _s, p and l _s, d by gradient descent, according to equation (4) :

In some embodiment, the student network 320 is trained for each task

independently but each time starts from the same parameters θ _s, where θ _s is a randomly initialized parameter.

In step S510, the updating unit 244 may update the parameters θ _s to θ′ _i, s with gradient descent for each task

according to equation (5) :

Where α _s represents a predetermined step size hyperparameter for fast-learning;

represents the loss from task

As such, the model of the student network 320 is trained with K samples and feedback from the corresponding loss

from task

In step S512, the NNs processing unit 242 may calculate the second binary- entropy loss corresponding to the discriminator 340, according to equation (6) :

Where x represents an image from test dataset 304 (D _train) . In some embodiments, the discriminator 340 may take the feature map M 306

from the teacher network 310 as the real (i.e., true) input and the all-class matrix 308

from the student network 320 as the fake (i.e., false) input.

In step S514, the NNs processing unit 242 may train the discriminator 340 in an adversarial manner, according to equation (7) :

In step S516, the updating unit 244 may update the parameters θ _d to θ′ _i, d with gradient decent for each task

according to equation (8) :

Where α _d represents a predetermined step size hyperparameter for fast-learning;

represents the loss from task

As such, the discriminator model is trained with K samples and feedback from corresponding loss

from task

FIG. 6 illustrates a flowchart of an exemplary method 600 for training the CML framework 300 during the meta-update phase, according to embodiments of the disclosure. Method 600 may be implemented by AI system 200 and particularly processor 204 or a separate processor not shown in FIG. 6. Method 600 may include steps S602-S608 as described below. It is to be appreciated that some of the steps may be optional to perform the disclosure provided herein. Further, some of the steps may be performed simultaneously, or in a different order than shown in FIG. 6.

In step S602, NNs processing unit 242 may calculate the second cross-entropy loss on the classifier 330, according to equation (9) :

Where (x, y) represents an image/label pair from test dataset 304 (D _test) ; X represents the corresponding set of images;

represents the corresponding to the function of the teacher network 310 which generates the feature map M 306.

In step S604, the updating unit 244 may optimize the parameters θ _s of the student network 320 using a one-step gradient descent, according to equation (10) :

Where β _s represents a predetermined step size hyperparameter for meta-update;

represents the loss from task

As such, the model of the student network is tested on new samples from

so that the model is improved by considering how the test error on new data changes with respect to the parameters.

In step S606, the NNs processing unit 242 may calculate the third binary-entropy loss corresponding to the discriminator, according to equation (11 ) :

Where

represents the updated model of the student network 320.

In step S608, the updating unit 244 may optimize the parameters θ _d using a one-step gradient descent, according to equation (12) :

Where β _d represents a predetermined step size hyperparameter for meta-update;

represents the loss from task

As such, the model of the discriminator 340 is tested on new samples from

As described in FIGs. 5 and 6, the meta-update is performed on parameters θ _s and θ _d, rather than θ′ _i, s and θ′ _i, d while the losses l _s, p and l _d are computed by the updated parameters θ′ _i, s and θ′ _i, d after fast-learning. In this way, the CML framework 300 can learn good initialization for both the student network 320 and the discriminator 340 such that it can quickly learn to deal with a new task during the meta-testing phase.

Another aspect of the disclosure is directed to a non-transitory computer-readable medium storing instructions which, when executed, cause one or more processors to perform the methods, as discussed above. The computer-readable medium may include volatile or non-volatile, magnetic, semiconductor, tape, optical, removable, non-removable, or other types of computer-readable medium or computer-readable storage devices. For example, the computer-readable medium may be the storage device or the memory module having the computer instructions stored thereon, as disclosed. In some embodiments, the computer-readable medium may be a disc or a flash drive having the computer instructions stored thereon.

It will be apparent to those skilled in the art that various modifications and variations can be made to the disclosed system and related methods. Other embodiments will be apparent to those skilled in the art from consideration of the specification and practice of the disclosed system and related methods.

It is intended that the specification and examples be considered as exemplary only, with a true scope being indicated by the following claims and their equivalents.

Claims

An artificial intelligence system for training a continual meta-learner framework (CML) framework, comprising:

a storage device configured to store training datasets and test datasets associated with a plurality of tasks;

a processor configured to train the CML framework, wherein the CML framework includes a teacher network, a student network, a classifier and a discriminator, wherein the processor is configured to:

receiving, at the CML framework, the plurality of tasks in a sequence, wherein each task comprises a training dataset and a test dataset;

performing, for each task, a fast-learning by training the student network and the discriminator associated with the CML framework based on the training dataset associated with the task;

updating one or more initial parameters associated with the student network and the discriminator to generate updated initial parameters corresponding to the one or more initial parameters; and

performing a meta-update to optimize the one or more initial parameters associated with the student network and the discriminator using the updated initial parameters.
The artificial intelligence system of claim 1, wherein the processor is further configured to:

prior to performing the fast-learning:

pre-training the teacher network associated with the CML framework to generate a feature map.
The artificial intelligence system of claim 2, wherein to perform the fast-learning by training the student network and the discriminator, the processor is further configured to:

training the student network by minimizing a sum of a first loss on a classifier and a second loss on the discriminator; and

training the discriminator using a third loss on the discriminator.
The artificial intelligence system of claim 3, wherein to minimize the sum of the first loss on the classifier and the second loss on the discriminator, the processor is further configured to:

calculating a first cross-entropy loss corresponding to the classifier using the training dataset;

calculating a first binary-entropy loss corresponding to the discriminator using the training dataset; and

training the student network by minimizing the sum of the first cross-entropy loss corresponding to the classifier and the first binary-entropy loss corresponding to the discriminator.
The artificial intelligence system of claim 4, wherein the processor is further configured to:

generating an all-class matrix by the student network; and

storing the generated all-class matrix into a memory.
The artificial intelligence system of claim 5, wherein to train the discriminator using a third loss on the discriminator, the processor is further configured to:

inputting the feature map as a real input of the discriminator;

retrieving the all-class matrix from the memory;

inputting the all-class matrix as a fake input of the discriminator;

calculating a second binary-entropy loss corresponding to the discriminator based on the real input and the fake input; and

training the discriminator using the second binary-entropy loss.
The artificial intelligence system of claim 6, wherein the processor is further configured to:

calculating a cosine similarity between the feature map and each class vector associated with the all-class matrix by the classifier;

generating a prediction score for each class; and

normalizing a plurality of prediction scores by using a softmax function.
The artificial intelligence system of claim 1, wherein to perform the meta-update to optimize the one or more initial parameters associated with the student network and the discriminator, the processor is further configured to:

calculating a second cross-entropy loss corresponding to the classifier using the updated one or more initial parameters associated with the student network and the test dataset;

optimizing the one or more initial parameters associated with the student network using a one-step gradient descent based on the second cross-entropy loss;

calculating a third binary-entropy loss corresponding to the discriminator using the test dataset and an updated model of the student network, wherein the updated model of the student network corresponding to the updated one or more initial parameters associated with the discriminator; and

optimizing the one or more initial parameters associated with the discriminator using the one-step gradient descent based on the third binary-entropy loss.
The artificial intelligence system of claim 1, wherein a model associated with the student network is a convolutional neural network.
The artificial intelligence system of claim 1, wherein the discriminator is implemented by a multilayer perceptron.
The artificial intelligence system of claim 1, wherein the one or more initial parameters associated with the student network and the discriminator are obtained by a random initialization.
A computer-implemented method, comprising:

receiving, at a continual meta-learner (CML) framework, a plurality of tasks in a sequence, wherein each task comprises a training dataset and a test dataset;

performing, for each task, a fast-learning by training a student network and a discriminator associated with the CML framework based on the training dataset associated with the task;

updating one or more initial parameters associated with the student network and the discriminator to generate updated initial parameters corresponding to the one or more initial parameters; and

performing a meta-update to optimize the one or more initial parameters associated with the student network and the discriminator using the updated initial parameters.
The computer-implemented method of claim 12, further comprising:

prior to performing the fast-learning:

pre-training a teacher network associated with the CML framework to generate a feature map.
The computer-implemented method of claim 13, wherein performing the fast-learning by training the student network and the discriminator comprises:

training the student network by minimizing a sum of a first loss on a classifier and a second loss on the discriminator; and

training the discriminator using a third loss on the discriminator.
The computer-implemented method of claim 14, wherein minimizing the sum of the first loss on the classifier and the second loss on the discriminator comprises:

calculating a first cross-entropy loss corresponding to the classifier using the training dataset;

calculating a first binary-entropy loss corresponding to the discriminator using the training dataset; and

training the student network by minimizing the sum of the first cross-entropy loss corresponding to the classifier and the first binary-entropy loss corresponding to the discriminator.
The computer-implemented method of claim 15, further comprising:

generating an all-class matrix by the student network; and

storing the generated all-class matrix into a memory.
The computer-implemented method of claim 16, wherein training the discriminator using a third loss on the discriminator comprises:

inputting the feature map as a real input of the discriminator;

retrieving the all-class matrix from the memory;

inputting the all-class matrix as a fake input of the discriminator;

calculating a second binary-entropy loss corresponding to the discriminator based on the real input and the fake input; and

training the discriminator using the second binary-entropy loss.
The computer-implemented method of claim 17, further comprising:

calculating a cosine similarity between the feature map and each class vector associated with the all-class matrix by the classifier;

generating a prediction score for each class; and

normalizing a plurality of prediction scores by using a softmax function.
The computer-implemented method of claim 12, wherein performing the meta-update to optimize the one or more initial parameters associated with the student network and the discriminator comprises:

calculating a second cross-entropy loss corresponding to the classifier using the updated one or more initial parameters associated with the student network and the test dataset;

optimizing the one or more initial parameters associated with the student network using a one-step gradient descent based on the second cross-entropy loss;

calculating a third binary-entropy loss corresponding to the discriminator using the test dataset and an updated model of the student network, wherein the updated model of the student network corresponding to the updated one or more initial parameters associated with the discriminator; and

optimizing the one or more initial parameters associated with the discriminator using the one-step gradient descent based on the third binary-entropy loss.
The computer-implemented method of claim 12, wherein a model associated with the student network is a convolutional neural network.
The computer-implemented method of claim 12, wherein the discriminator is implemented by a multilayer perceptron.
The computer-implemented method of claim 12, wherein the one or more initial parameters associated with the student network and the discriminator are obtained by a random initialization.
A non-transitory computer-readable medium having instructions stored thereon that, when executed by a processor, causes the processor to perform an artificial intelligence method for training a continual meta-learner (CML) framework, the CML framework includes a teacher network, a student network, a classifier and a discriminator, the artificial intelligence method comprising:

receiving, at the CML framework, a plurality of tasks in a sequence, wherein each task comprises a training dataset and a test dataset;

performing, for each task, a fast-learning by training the student network and the discriminator associated with the CML framework based on the training dataset associated with the task;

updating one or more initial parameters associated with the student network and the discriminator to generate updated initial parameters corresponding to the one or more initial parameters; and

performing a meta-update to optimize the one or more initial parameters associated with the student network and the discriminator using the updated initial parameters.