WO2021068180A1 - Method and system for continual meta-learning - Google Patents

Method and system for continual meta-learning Download PDF

Info

Publication number
WO2021068180A1
WO2021068180A1 PCT/CN2019/110530 CN2019110530W WO2021068180A1 WO 2021068180 A1 WO2021068180 A1 WO 2021068180A1 CN 2019110530 W CN2019110530 W CN 2019110530W WO 2021068180 A1 WO2021068180 A1 WO 2021068180A1
Authority
WO
WIPO (PCT)
Prior art keywords
discriminator
training
student network
initial parameters
network
Prior art date
Application number
PCT/CN2019/110530
Other languages
French (fr)
Inventor
Jian Tang
Kun Wu
Chengxiang YIN
Zhengping Che
Original Assignee
Beijing Didi Infinity Technology And Development Co., Ltd.
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Didi Infinity Technology And Development Co., Ltd. filed Critical Beijing Didi Infinity Technology And Development Co., Ltd.
Priority to PCT/CN2019/110530 priority Critical patent/WO2021068180A1/en
Publication of WO2021068180A1 publication Critical patent/WO2021068180A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks

Definitions

  • This present disclosure generally relates to systems and methods for analyzing subject behavior, and in particular, to systems and methods for analyzing human driving behavior by recognizing basic driving actions and identifying intentions and attentions of the driver.
  • the present disclosure relates generally to systems and methods for machine learning, and more particularly to, machine learning models using continual meta-learning techniques.
  • Deep learning techniques have made tremendous successes on various computer vision tasks.
  • To train a deep-learning model a great amount of labeled data is needed.
  • the trained deep-learning model then may be used only for a specific task (e.g., classifying different types of animals) .
  • deep models may suffer from the problem of “forgetting. ” That is, when a deep-learning model is first trained on one task, then trained on a second task, it may forget how to perform the first task.
  • it is desired to improve deep learning such that a deep model may learn to handle a new task from limited training data without forgetting old knowledge.
  • Embodiments of the disclosure address the above problems by providing a continual meta-learner (CML) framework, which can keep learning new concepts effectively and quickly from limited labeled data without forgetting old knowledge.
  • CML continual meta-learner
  • Embodiments of the disclosure provide artificial intelligence systems and methods for training a continual meta-learner framework (CML) framework.
  • An exemplary artificial intelligence system includes a storage device and a processor.
  • the storage device is configured to store training datasets and test datasets associated with a plurality of tasks.
  • the processor is configured to train the CML framework, including a teacher network, a student network, a classifier and a discriminator.
  • the processor is configured to receive a plurality of tasks in a sequence, where each task comprises a training dataset and a test dataset.
  • the processor is configured to perform a fast-learning by training the student network and the discriminator associated with the CML framework based on the training dataset associated with the task.
  • the processor is also configured to update one or more initial parameters associated with the student network and the discriminator to generate updated initial parameters corresponding to the one or more initial parameters.
  • the processor is further configured perform a meta-update to optimize the one or more initial parameters associated with the student network and the discriminator using the updated initial parameters.
  • the processor is configured to receive the plurality of tasks in a sequence, where each task comprises a training dataset and a test dataset.
  • the processor is configured to perform, for each task, a fast-learning by training the student network and the discriminator associated with the CML framework based on the training dataset associated with the task.
  • the processor is configured to update one or more initial parameters associated with the student network and the discriminator to generate updated initial parameters corresponding to the one or more initial parameters.
  • the processor is configured to perform a meta-update to optimize the one or more initial parameters associated with the student network and the discriminator using the updated initial parameters.
  • the processor is configured to prior to performing the fast-learning, pre-train the teacher network associated with the CML framework to generate a feature map.
  • the processor is configured to train the student network by minimizing a sum of a first loss on a classifier and a second loss on the discriminator.
  • the processor is configured to train the discriminator using a third loss on the discriminator.
  • the processor is configured to calculate a first cross-entropy loss corresponding to the classifier using the training dataset.
  • the processor is configured to calculate a first binary-entropy loss corresponding to the discriminator using the training dataset.
  • the processor is configured to train the student network by minimizing the sum of the first cross-entropy loss corresponding to the classifier and the first binary-entropy loss corresponding to the discriminator.
  • the processor is configured to generate an all-class matrix by the student network.
  • the processor is configured to store the generated all-class matrix into a memory.
  • the processor is configured to input the feature map as a real input of the discriminator.
  • the processor is configured to retrieve the all-class matrix from the memory.
  • the processor is configured to input the all-class matrix as a fake input of the discriminator.
  • the processor is configured to calculate a second binary-entropy loss corresponding to the discriminator based on the real input and the fake input.
  • the processor is configured to train the discriminator using the second binary-entropy loss.
  • the processor is configured to calculate a cosine similarity between the feature map and each class vector associated with the all-class matrix by the classifier.
  • the processor is configured to generate a prediction score for each class.
  • the processor is configured to normalize a plurality of prediction scores by using a softmax function.
  • the processor is configured to calculate a second cross-entropy loss corresponding to the classifier using the updated one or more initial parameters associated with the student network and the test dataset.
  • the processor is configured to optimiz the one or more initial parameters associated with the student network using a one-step gradient descent based on the second cross-entropy loss.
  • the processor is configured to calculate a third binary-entropy loss corresponding to the discriminator using the test dataset and an updated model of the student network, wherein the updated model of the student network corresponding to the updated one or more initial parameters associated with the discriminator.
  • the processor is configured to optimize the one or more initial parameters associated with the discriminator using the one-step gradient descent based on the third binary-entropy loss.
  • a model associated with the student network is a convolutional neural network.
  • the discriminator is implemented by a multilayer perceptron.
  • the one or more initial parameters associated with the student network and the discriminator are obtained by a random initialization.
  • Embodiments of the disclosure further provide a computer-implemented method for training a continual meta-learner framework (CML) framework.
  • the CML framework includes a teacher network, a student network, a classifier and a discriminator.
  • An exemplary computer-implemented method includes receiving, at a continual meta-learner (CML) framework, a plurality of tasks in a sequence, wherein each task comprises a training dataset and a test dataset.
  • the method further includes performing, by a processor, for each task, a fast-learning by training a student network and a discriminator associated with the CML framework based on the training dataset associated with the task.
  • the method also includes updating, by a processor, one or more initial parameters associated with the student network and the discriminator to generate updated initial parameters corresponding to the one or more initial parameters.
  • the method yet further includes performing, by a processor, a meta-update to optimize the one or more initial parameters associated with the student network and the discriminator using the updated initial parameters.
  • Embodiments of the disclosure further provide a non-transitory computer-readable medium having instructions stored thereon that, when executed by a processor, causes the processor to perform a computer-implemented method for training a continual meta-learner framework (CML) framework.
  • the CML framework includes a teacher network, a student network, a classifier and a discriminator.
  • An exemplary computer-implemented method includes receiving, at a continual meta-learner (CML) framework, a plurality of tasks in a sequence, wherein each task comprises a training dataset and a test dataset.
  • the method further includes performing, by a processor, for each task, a fast-learning by training a student network and a discriminator associated with the CML framework based on the training dataset associated with the task.
  • the method also includes updating, by a processor, one or more initial parameters associated with the student network and the discriminator to generate updated initial parameters corresponding to the one or more initial parameters.
  • the method yet further includes performing, by a processor, a meta-update to optimize the one or more initial parameters associated with the student network and the discriminator using the updated initial parameters.
  • FIG. 1 illustrates a schematic diagram of an exemplary continual meta-leaner (CML) system, according to embodiments of the disclosure.
  • FIG. 2 illustrates a block diagram of an exemplary AI system for training a meta-learning model using CML framework, according to embodiments of the disclosure.
  • FIG. 3 illustrates a schematic diagram of an exemplary CML framework, according to embodiments of the disclosure.
  • FIG. 4 illustrates a flowchart of an exemplary method for training the continual meta-learning model, according to embodiments of the disclosure.
  • FIG. 5 illustrates a flowchart of an exemplary method for training the CML framework during the fast-learning phase, according to embodiments of the disclosure.
  • FIG. 6 illustrates a flowchart of an exemplary method for training the CML framework during the meta-update phase, according to embodiments of the disclosure
  • a meta-learner may rapidly learn new concepts from a small dataset with only a few samples (e.g., 5 samples) for each class.
  • Existing approaches of meta-learning include metric-based methods (i.e., exploiting the similarity between samples of different classes for meta-learning) , optimization-based approach (i.e., optimizing model parameters) , etc.
  • metric-based methods i.e., exploiting the similarity between samples of different classes for meta-learning
  • optimization-based approach i.e., optimizing model parameters
  • Continual learning focuses on how to achieve a good trade-off between learning new concepts and retaining old knowledge over a long time, which is known as the stability plasticity dilemma.
  • Several regularization-based methods have been proposed for continual learning, imposing regularization terms to restrain the update of the model parameter.
  • none of the existing continual learning approaches address the issue of learning new concepts with limited label data, which is a crucial challenge for achieving human-level Artificial Intelligence (AI) .
  • AI Artificial Intelligence
  • none of the existing solutions to the continual learning problems can be directly applied to meta-learning tasks. As such, the prior solutions cannot provide a deep learning model that handles a new task from very limited training data without forgetting old knowledge.
  • aspects of the present disclosure solve the above-mentioned deficiencies by providing mechanisms (e.g., methods, systems, media, etc. ) for a novel model-agnostic meta-leaner, CML, which integrates metric-based classification and a memory-based mechanism along with adversarial learning into an optimization-based meta-learning framework.
  • CML model-agnostic meta-leaner
  • the flowcharts used in the present disclosure illustrate operations that systems implement according to some embodiments in the present disclosure. It is to be expressly understood, the operations of the flowchart may or may not be implemented in order. Conversely, the operations may be implemented in inverted order, or simultaneously. Moreover, one or more other operations may be added to the flowcharts. One or more operations may be removed from the flowcharts.
  • system and method in the present disclosure is described primarily in regard to image classification, it should also be understood that this is only one exemplary embodiment.
  • the system or method of the present disclosure may be applied to any other kind of deep learning tasks.
  • FIG. 1 illustrates a schematic diagram of an exemplary continual meta-leaner (CML) system 100, according to embodiments of the disclosure.
  • CML system 100 is configured to perform continuous meta training for one or more networks using datasets from database (e.g., the training database 130) .
  • database e.g., the training database 130
  • CML system 100 may include components shown in FIG. 1, including a server 110, a network 120, a training database 130, and one or more user devices 140. It is contemplated that CML system 100 may include more or less components compared to those shown in FIG. 1.
  • the server 110 may be configured to process information and/or data relating to meta-learning tasks.
  • the server 110 may train neural networks for visual object classification, speech recognition, text processing, and other tasks.
  • the server 110 may be a single server, or a server group.
  • the server group may be centralized, or distributed (e.g., the server 110 may be a distributed system) .
  • the server 110 may be local or remote.
  • the server 110 may access information and/or data stored in training database 130 and/or user device (s) 140 via network 120.
  • the server 110 may be directly connected to the training database 130, and/or the user device (s) 140 access stored information and/or data.
  • the server 110 may be implemented on a cloud platform.
  • the cloud platform may include a private cloud, a public cloud, a hybrid cloud, a community cloud, a distributed cloud, an inter-cloud, a multi-cloud, or the like, or any combination thereof.
  • the server 110 may be implemented on a computing device having one or more components illustrated in FIG. 2 in the present disclosure.
  • the network 120 may facilitate exchange of information and/or data.
  • one or more components in the system 100 e.g., the server 110, the training database 130, and the user device (s) 140
  • the server 110 may obtain/acquire a request for training a continual meta-learning model from the user device (s) 140 via the network 120.
  • the network 120 may be any type of wired or wireless network, or combination thereof.
  • the network 120 may include a cable network, a wireline network, an optical fiber network, a tele communications network, an intranet, an Internet, a local area network (LAN) , a wide area network (WAN) , a wireless local area network (WLAN) , a metropolitan area network (MAN) , a wide area network (WAN) , a public telephone switched network (PSTN) , a BluetoothTM network, a ZigBeeTM network, a near field communication (NFC) network, a global system for mobile communications (GSM) network, a code-division multiple access (CDMA) network, a time-division multiple access (TDMA) network, a general packet radio service (GPRS) network, an enhanced data rate for GSM evolution (EDGE) network, a wideband code division multiple access (WCDMA) network, a high speed downlink packet access (HSDPA) network, a long term evolution (LTE) network, a user datagram protocol (UDP) network,
  • LAN local area network
  • the user device (s) 140 may be operated by one or more users to perform various functions associated with the user device (s) 140. For example, a user of the user device (s) 140 may use the user device (s) 140 to send a request for himself/herself or another user, or receive information or instructions from the server 110. In some embodiments, the term “user” and “user device” may be used interchangeably.
  • the user device (s) 140 may include a diverse variety of device types and are not limited to any particular type of device. Examples of user device (s) 140 can include but are not limited to a laptop 140-1, a stationary computer 140-2, a tablet computer 140-3, a mobile device 140-4, or the like, or any combination thereof.
  • stationary computer 140-2 can include desktop computers, work stations, personal computers, thin clients, terminals, game consoles, personal video recorders (PVRs) , set-top boxes, or the like.
  • the mobile device 140-4 may include a smart home device, a wearable device, a smart mobile device, a virtual reality device, an augmented reality device, or the like, or any combination thereof.
  • the smart home device may include a smart lighting device, a control device of an intelligent electrical apparatus, a smart monitoring device, a smart television, a smart video camera, an interphone, or the like, or any combination thereof.
  • the wearable device may include a smart bracelet, a smart footgear, a smart glass, a smart helmet, a smart watch, a smart clothing, a smart backpack, a smart accessory, or the like, or any combination thereof.
  • the smart mobile device may include a smartphone, a personal digital assistance (PDA) , a gaming device, a navigation device, a point of sale (POS) device, or the like, or any combination thereof.
  • PDA personal digital assistance
  • the virtual reality device and/or the augmented reality device may include a virtual reality helmet, a virtual reality glass, a virtual reality patch, an augmented reality helmet, an augmented reality glass, an augmented reality patch, or the like, or any combination thereof.
  • the virtual reality device and/or the augmented reality device may include a Google Glass, an Oculus Rift, a Hololens, a Gear VR, etc.
  • the server 110 may include a processing engine 112.
  • the processing engine 112 may process information and/or data relating to the meta-learning tasks to perform one or more functions described in the present disclosure. For example, the processing engine 112 may receive a request from the user device (s) 140 to generate a trained meta-learning model 105 based on the request.
  • the processing engine 112 may include one or more processing engines (e.g., single-core processing engine (s) or multi-core processor (s) ) .
  • the processing engine 112 may include a central processing unit (CPU) , an application-specific integrated circuit (ASIC) , an application-specific instruction-set processor (ASIP) , a graphics processing unit (GPU) , a physics processing unit (PPU) , a digital signal processor (DSP) , a field programmable gate array (FPGA) , a programmable logic device (PLD) , a controller, a microcontroller unit, a reduced instruction-set computer (RISC) , a microprocessor, or the like, or any combination thereof.
  • CPU central processing unit
  • ASIC application-specific integrated circuit
  • ASIP application-specific instruction-set processor
  • GPU graphics processing unit
  • PPU physics processing unit
  • DSP digital signal processor
  • FPGA field programmable gate array
  • PLD programmable logic device
  • controller a microcontroller unit, a reduced instruction-set computer (RISC) , a microprocessor, or the like, or any combination thereof.
  • RISC reduced
  • server 110 may further include a deep learning training device 114, which may communicate with training database 130 to receive one or more sets of tasks 101.
  • Each task may be different, for example, task 101-1 may be a task of classifying images of animals; task 101-2 may be a task of classifying images of fruits.
  • Deep learning training device 114 may use training data corresponding to each task 101 that is received from training database 130 to train a model based on the CML framework(discussed in detail in connection with FIG. 3) , so that the trained meta-learning model 105 may be able to adapt to a large or infinite number of tasks.
  • Deep learning training device 114 may be implemented with hardware specially programmed by software that performs the training process.
  • deep learning training device 114 may include a processor and a non-transitory computer-readable medium (discussed in detail in connection with FIG. 2) .
  • the processor may conduct the training by performing instructions of a training process stored in the computer-readable medium.
  • Deep learning training device 114 may additionally include input and output interfaces to communicate with training database 130, network 130, and/or a user interface (not shown) .
  • the user interface may be used for selecting sets of training data, adjusting one or more parameters of the training process, selecting or modifying a framework of the learning model, and/or manually or semi-automatically providing diagnosis results associated with a sample patient description for training.
  • deep learning training device 114 may generate the trained meta-learning model 105 through a CML framework (discussed in detail in connection with FIGs. 3-6) , which may include more than one convolutional neural network (CNN) models.
  • Trained meta-learning model 105 may be trained using supervised and/or reinforcement learning.
  • the architecture of a trained meta- learning model 105 includes a stack of distinct layers that transform the input into the output.
  • “training” a learning model refers to determining one or more parameters of at least one layer in the learning model.
  • a convolutional layer of a CNN model may include at least one filter or kernel.
  • One or more parameters, such as kernel weights, size, shape, and structure, of the at least one filter may be determined by e.g., a backpropagation-based training process.
  • FIG. 2 illustrates a block diagram of an exemplary AI system 200 for training a meta-learning model using a CML framework, according to embodiments of the disclosure.
  • AI system 200 may be an embodiment of deep learning training device 114.
  • AI system 200 may include a communication interface 202, a processor 204, a memory 206, and a storage 208.
  • AI system 200 may have different modules in a single device, such as an integrated circuit (IC) chip (e.g., implemented as an application-specific integrated circuit (ASIC) or a field-programmable gate array (FPGA) ) , or separate devices with dedicated functions.
  • IC integrated circuit
  • ASIC application-specific integrated circuit
  • FPGA field-programmable gate array
  • one or more components of AI system 200 may be located in a cloud, or may be alternatively in a single location (such as inside a mobile device) or distributed locations. Components of Ai system 200 may be in an integrated device, or distributed at different locations but communicate with each other through a network (not shown) . Consistent with the present disclosure, AI system 200 may be configured to train meta-learning model105 based on data received from the training database 130.
  • Communication interface 202 may send data to and receive data from components such as training database 130 via communication cables, a Wireless Local Area Network (WLAN) , a Wide Area Network (WAN) , wireless networks such as radio waves, a cellular network, and/or a local or short-range wireless network (e.g., Bluetooth TM ) , or other communication methods.
  • communication interface 202 may include an integrated service digital network (ISDN) card, cable modem, satellite modem, or a modem to provide a data communication connection.
  • ISDN integrated service digital network
  • communication interface 202 may include a local area network (LAN) card to provide a data communication connection to a compatible LAN.
  • LAN local area network
  • Wireless links can also be implemented by communication interface 202.
  • communication interface 202 can send and receive electrical, electromagnetic or optical signals that carry digital data streams representing various types of information.
  • communication interface 202 may receive meta-training set (s) 101, where each meta-training set 101 corresponding to a different task, which arrives in sequence.
  • a model e.g., a function F (. )
  • the parameters ⁇ of the model are trained on a training dataset D train and a testing dataset D test .
  • the database 110 stores a number of meta-training sets which contains multiple regular datasets, and each dataset is split into D train and D test as those in regular machine learning.
  • Communication interface 202 may further provide the received data to memory 206 and/or storage 208 for storage or to processor 204 for processing.
  • Processor 204 may include any appropriate type of general-purpose or special-purpose microprocessor, digital signal processor, or microcontroller. Processor 204 may be configured as a separate processor module dedicated to training a learning model. Alternatively, processor 204 may be configured as a shared processor module for performing other functions in addition to model training.
  • Memory 206 and storage 208 may include any appropriate type of mass storage provided to store any type of information that processor 204 may need to operate.
  • Memory 206 and storage 208 may be a volatile or non-volatile, magnetic, semiconductor, tape, optical, removable, non-removable, or other type of storage device or tangible (i.e., non-transitory) computer-readable medium including, but not limited to, a ROM, a flash memory, a dynamic RAM, and a static RAM.
  • Memory 206 and/or storage 208 may be configured to store one or more computer programs that may be executed by processor 204 to perform functions disclosed herein.
  • memory 206 and/or storage 208 may be configured to store program (s) that may be executed by processor 204 to train and generate the trained meta-learning model 105.
  • Memory 206 and/or storage 208 may be further configured to store information and data used by processor 204.
  • memory 206 and/or storage 208 may also store intermediate data such as feature maps output by layers of the learning model, and optimization loss functions, etc.
  • Memory 206 and/or storage 208 may additionally store various learning models including their model parameters, such as a CNN model and other types of neutral network models. The various types of data may be stored permanently, removed periodically, or disregarded immediately after the data is processed.
  • processor 204 may include multiple modules, such as a Neural Networks (NNs) processing unit 242, an updating unit 244, an optimization unit 246, and the like.
  • These modules can be hardware units (e.g., portions of an integrated circuit) of processor 204 designed for use with other components or software units implemented by processor 204 through executing at least part of a program.
  • the program may be stored on a computer-readable medium, and when executed by processor 204, it may perform one or more functions.
  • FIG. 2 shows units 242-246 all within one processor 204, it is contemplated that these units may be distributed among different processors located closely or remotely with each other.
  • FIG. 3 illustrates a schematic diagram of an exemplary CML framework 300, according to embodiments of the disclosure.
  • CML framework 300 may include a plurality of components, such as a teacher network 310, a student network 320, a classifier 330, and a discriminator 340.
  • training dataset 302 and test dataset 304 are used during the meta-training and meta-testing phases respectively, and the class labels in are not overlapping with those of
  • a number of N-way, K-shot tasks are used for illustration.
  • N-way, K-shot is a typical setting for few-shot learning, which refers to the practice of feeding a learning model of small amount of training data, contrary to the normal practice using a large amount of data.
  • the problem of N-way classification is set up as follow: select N unseen classes, provide the model with K different instances of each of the N classes, and evaluate the model’s ability to classify new instances within the N classes.
  • each D train contains K samples for each of N classes, and D test contains samples for evaluation.
  • the goal is to train the model to be able to adapt to a distribution over tasks
  • a task is sampled from the model is trained with K samples and feedback from the corresponding loss from and then tested on new samples from
  • new tasks are sampled from and meta-performance is measured by the model’s performance after learning from K samples.
  • the goal is to train a model M fine-tune to classify images with unknown labels.
  • These images belong to classes P 1 ⁇ P 5 ; each class contains 5 labeled sample images for training the model M fine-tune and 15 labeled samples to test the trained model M fine-tune .
  • the dataset further includes sample images belong to another 10 classes C 1 ⁇ C 10 , each of the class contains 30 labeled samples to assist training the meta-learning model M meta .
  • sample images included in classes C 1 ⁇ C 10 are first used to train the meta-learning model M meta , then sample images included in classes P 1 ⁇ P 5 are used to fine-tune M meta to generate the final model M fine-tune .
  • C 1 ⁇ C 10 are the meta-training classes, and the 300 samples included in classes C 1 ⁇ C 10 are which are used to train M meta .
  • classes P 1 ⁇ P 5 are the meta-test classes, and the 100 samples included in classes P 1 ⁇ P 5 are which are used to train and test the M fine-tune .
  • the teacher network 310 may take an image x as input and extracts its features to form a feature map M 306 with a dimension of z, which are then pushed to classifier 330 (P (. ) ) and discriminator 340
  • the teacher network 310 may be a CNN.
  • the teacher network 310 may be a Residual Network (ResNet) . ResNet inserts shortcut connections to the plain network and turns the network into its counterpart residual version. ResNets may have variable sizes, depending on the size of each layer, and the number of layers it has. Each of the layers follow the same pattern, and they perform 3 ⁇ 3 convolution with a fixed feature map dimension.
  • ResNet ResNet
  • the ResNet used as the teacher network 310 in the present disclosure may be a ResNet18 (that is, the residual network is 18 layers deep) .
  • the student network 320 is the core component of the CML framework 300.
  • the student network 320 may take all training images x and generates an all-class matrix V 308.
  • Each row of V corresponds to a class vector V l with a dimension of z, which can be considered as a representation of images of the l th class.
  • the student network 320 may be a CNN.
  • the CNN may include four convolutional modules, each of which contains a 3 ⁇ 3 convolutional layer followed by batch normalization, a ReLU nonlinearity and a 2 ⁇ 2 max pooling, including 64 filters in the first two convolutional layers and 128 filters in the last two convolutional layers.
  • the classifier 330 may take the feature map M 306 of an image and the all-class matrix V 308 as input and predict which class this image belongs to (the prediction 312) .
  • Classification is a supervised learning approach in which the model is learned from the data input and uses this learning to classify new observations.
  • the classification algorithm may be a linear classifier, a nearest neighbor, a support vector machine, a decision tree, a boosted tree, neural networks, or the like.
  • the discriminator 340 may distinguish between the feature map of an image (belong to class l) and the corresponding class vector V l .
  • the discriminator 340 and student network 320 function similar to the generative network (generator) and the discriminative network (discriminator) of a generative adversarial network (GAN) .
  • GANs are deep neural architecture.
  • the generator of a GAN learns to generate plausible data, while the discriminator of a GAN learns to distinguish the generator’s fake data from real data.
  • the generator part of a GAN learns to create fake data by incorporating feedback from the discriminator. It learns to make the discriminator classify its output as real.
  • the discriminator in a GAN is simply a classifier. It tries to distinguish real data from the data created by the generator. It could use any network architecture appropriate to the type of data it′sclassifying. In the present disclosure as described in FIG.
  • the student network 320 may be considered as a generator of GAN that is used to generate the all-class matrix V 308, while the discriminator 340 may be considered as the discriminator of a GAN and help student network 320 to generate a more representative all-class matrix via adversarial learning.
  • FIG. 4 illustrates a flowchart of an exemplary method 400 for training the continual meta-learning model, according to embodiments of the disclosure.
  • Method 400 may include steps S402-S418 as described below. It is to be appreciated that some of the steps may be optional to perform the disclosure provided herein. Further, some of the steps may be performed simultaneously, or in a different order than shown in FIG. 2.
  • communication interface 202 may receive a plurality of tasks in a sequence.
  • each task may include a training dataset 302 (D train ) and a test dataset 304 (D test ) .
  • the CML framework 300 may take D train as input, and performs fast-learning.
  • the CML framework may take a batch of tasks as input and learns good initializations of student network 320 and discriminator 340. The entire training process consists of two phases: fast-learning and meta-update.
  • the CML framework 300 may learn from D train of each individual task of the batch; while during meta-update, the CML framework 300 may learn from D test across all tasks of the batch. Both fast-learning and meta-update will be described in more details in connection with FIGs. 5 and 6.
  • NNs processing unit 242 may input the image x from D train and determine a first loss function on the classifier.
  • NNs processing unit 242 may input the image x from D train and determine a second loss function on the discriminator.
  • optimization unit 246 may train the student network by minimizing both the first loss function obtained in step S404 and the second loss function obtained in step S406.
  • one or more parameters corresponding to the student network model are updated by the updating unit 244.
  • step S410 NNs processing unit 242 may generate the all-class matrix V for current task
  • the class vector V l may be directly obtained from the input image x. In some embodiment, under a K-shot setting, the class vector V l may be obtained by taking the mean values. As such, the feature map M and each class vector V l can have the same dimension. In some embodiment, the all-class matrix V is then constructed by stacking the class vectors together, where V has a dimension of N ⁇ Z, where N is the number of classes and z is the dimension of each class vector V l .
  • NNs processing unit 242 may output the all-class matrix V for current task and store them into memory 206.
  • Each class vector only needs a small space. For example, when the dimension of the class vector V l is 512, each class vector only needs a space of 4KB since they are 512 64-bits numbers. As such, the proposed CML framework 300 has a Iow memory footprint and can improve the efficiency of computer memory storage.
  • NNs processing unit 242 may input the image x from D test and generate the feature map M.
  • the teacher network 310 may be pre-trained on the D meta-train so that the teacher network 310 may gain enough knowledge to server as a “teacher. ”
  • the teacher network is a ResNet18
  • the last fully-connected layer is discarded and the feature extractor is kept as the teacher network 310.
  • NNs processing unit 242 may retrieve the all-class matrix V obtained in step S412 from the memory 206.
  • NNs processing unit 242 may predict the class for the input image x by calculating the similarity between its feature map M and each class vector V l .
  • the classifier 330 may calculate the cosine similarity between M of image x and each class vector V l as the prediction score for each class, according to equation (1) :
  • Cos (. ) represents a calculation of the cosine similarity.
  • the cosine similarity is used because it eliminates the interference resulting from different orders of magnitude corresponding to different classes.
  • a softmax function may be used to normalize prediction scores.
  • FIG. 5 illustrates a flowchart of an exemplary method 500 for training the CML framework 300 during the fast-learning phase, according to embodiments of the disclosure.
  • Method 500 may be implemented by CML framework 300 and particularly processor 204 or a separate processor not shown in FIG. 5.
  • Method 500 may include steps S502-S516 as described below. It is to be appreciated that some of the steps may be optional to perform the disclosure provided herein. Further, some of the steps may be performed simultaneously, or in a different order than shown in FIG. 5.
  • a N-way, K-shot setting is used for the training process described in FIGs. 5 and 6. That is, K-shot classification tasks use K input/output pairs from each class, for a total of N ⁇ K datapoints for N-way classification.
  • communication interface 202 may receive a plurality of tasks from a distribution over tasks p in a sequence.
  • NNs processing unit 242 may calculate the first cross-entropy loss on the classifier 330, according to equation (2) :
  • (x, y) represents an image/label pair from training dataset 302 D train ;
  • X represents the corresponding set of images;
  • ⁇ s represents randomly selected initial parameters for the student network 320;
  • NNs processing unit 242 may calculate the first binary-entropy loss corresponding to the discriminator 340, according to equation (3) :
  • R (V, l) represents a function that returns the l th row of V, (i.e., V l ) ; represents the function corresponding to the discriminator 340.
  • the discriminator 340 described in FIG. 5 is used to distinguish each class vector V l generated by the student network 320 from true samples generated by the teacher network 310, during the training of the student network 320. Specifically, the discriminator340 takes each class vector V l as input, calculates the probability of the input being a true sample.
  • the student network may be considered as a generator used to generate the all-class matrix V; while the discriminator may help the student network 320 generate a more representative all-class matrix via adversarial learning. Furthermore, the loss l s, p may make training the student network 320 and the discriminator 340 more stable, and prevent model collapses.
  • a multilayer perceptron may be used to implement the discriminator, which contains two fully-connected layers. The first fully connected layer is followed by batch normalization and a ReLU nonlinearity; and the second fully-connected layer is followed by the sigmoid function that normalizes output.
  • the optimization unit 246 may train the student network 320 model by minimizing the sum of losses l s, p and l s, d by gradient descent, according to equation (4) :
  • the student network 320 is trained for each task independently but each time starts from the same parameters ⁇ s , where ⁇ s is a randomly initialized parameter.
  • step S510 the updating unit 244 may update the parameters ⁇ s to ⁇ ′ i, s with gradient descent for each task according to equation (5) :
  • ⁇ s represents a predetermined step size hyperparameter for fast-learning; represents the loss from task
  • the model of the student network 320 is trained with K samples and feedback from the corresponding loss from task
  • step S512 the NNs processing unit 242 may calculate the second binary- entropy loss corresponding to the discriminator 340, according to equation (6) :
  • the discriminator 340 may take the feature map M 306 from the teacher network 310 as the real (i.e., true) input and the all-class matrix 308 from the student network 320 as the fake (i.e., false) input.
  • step S514 the NNs processing unit 242 may train the discriminator 340 in an adversarial manner, according to equation (7) :
  • step S5166 the updating unit 244 may update the parameters ⁇ d to ⁇ ′ i, d with gradient decent for each task according to equation (8) :
  • ⁇ d represents a predetermined step size hyperparameter for fast-learning; represents the loss from task
  • the discriminator model is trained with K samples and feedback from corresponding loss from task
  • FIG. 6 illustrates a flowchart of an exemplary method 600 for training the CML framework 300 during the meta-update phase, according to embodiments of the disclosure.
  • Method 600 may be implemented by AI system 200 and particularly processor 204 or a separate processor not shown in FIG. 6.
  • Method 600 may include steps S602-S608 as described below. It is to be appreciated that some of the steps may be optional to perform the disclosure provided herein. Further, some of the steps may be performed simultaneously, or in a different order than shown in FIG. 6.
  • NNs processing unit 242 may calculate the second cross-entropy loss on the classifier 330, according to equation (9) :
  • (x, y) represents an image/label pair from test dataset 304 (D test ) ;
  • X represents the corresponding set of images; represents the corresponding function of the student network 320 which generates the all-class matrix V 308; represents the corresponding to the function of the teacher network 310 which generates the feature map M 306.
  • step S604 the updating unit 244 may optimize the parameters ⁇ s of the student network 320 using a one-step gradient descent, according to equation (10) :
  • ⁇ s represents a predetermined step size hyperparameter for meta-update; represents the loss from task
  • step S606 the NNs processing unit 242 may calculate the third binary-entropy loss corresponding to the discriminator, according to equation (11 ) :
  • step S608 the updating unit 244 may optimize the parameters ⁇ d using a one-step gradient descent, according to equation (12) :
  • ⁇ d represents a predetermined step size hyperparameter for meta-update; represents the loss from task
  • the model of the discriminator 340 is tested on new samples from so that the model is improved by considering how the test error on new data changes with respect to the parameters.
  • the meta-update is performed on parameters ⁇ s and ⁇ d , rather than ⁇ ′ i, s and ⁇ ′ i, d while the losses l s, p and l d are computed by the updated parameters ⁇ ′ i, s and ⁇ ′ i, d after fast-learning.
  • the CML framework 300 can learn good initialization for both the student network 320 and the discriminator 340 such that it can quickly learn to deal with a new task during the meta-testing phase.
  • the computer-readable medium may include volatile or non-volatile, magnetic, semiconductor, tape, optical, removable, non-removable, or other types of computer-readable medium or computer-readable storage devices.
  • the computer-readable medium may be the storage device or the memory module having the computer instructions stored thereon, as disclosed.
  • the computer-readable medium may be a disc or a flash drive having the computer instructions stored thereon.

Abstract

Embodiments of the disclosure provide artificial intelligence systems and methods for training a continual meta-learner framework (CML) model. An exemplary artificial intelligence system includes a storage device and a processor. A plurality of tasks are received at the CML framework in a sequence, where each task comprises a training dataset and a test dataset. For each task, a fast-learning is performed by training the student network and the discriminator associated with the CML framework based on the training dataset associated with the task. One or more initial parameters associated with the student network and the discriminator are updated to generate updated initial parameters corresponding to the one or more initial parameters. A meta-update is performed to optimize the one or more initial parameters associated with the student network and the discriminator using the updated initial parameters.

Description

METHOD AND SYSTEM FOR CONTINUAL META-LEARNING TECHNICAL FIELD
This present disclosure generally relates to systems and methods for analyzing subject behavior, and in particular, to systems and methods for analyzing human driving behavior by recognizing basic driving actions and identifying intentions and attentions of the driver.
The present disclosure relates generally to systems and methods for machine learning, and more particularly to, machine learning models using continual meta-learning techniques.
BACKGROUND
With the rapid development in the field of Artificial Intelligence (AI) , deep learning techniques have made tremendous successes on various computer vision tasks. To train a deep-learning model, a great amount of labeled data is needed. The trained deep-learning model then may be used only for a specific task (e.g., classifying different types of animals) . Moreover, deep models may suffer from the problem of “forgetting. ” That is, when a deep-learning model is first trained on one task, then trained on a second task, it may forget how to perform the first task. Hence, it is desired to improve deep learning such that a deep model may learn to handle a new task from limited training data without forgetting old knowledge.
Embodiments of the disclosure address the above problems by providing a continual meta-learner (CML) framework, which can keep learning new concepts effectively and quickly from limited labeled data without forgetting old knowledge.
SUMMARY
Embodiments of the disclosure provide artificial intelligence systems and methods for training a continual meta-learner framework (CML) framework. An exemplary artificial intelligence system includes a storage device and a processor. The storage device is configured to store training datasets and test datasets associated with a plurality of tasks. The processor is configured to train the CML framework, including a teacher network, a student network, a classifier and a discriminator. To train the CML framework, the processor is configured to receive a plurality of tasks in a sequence, where each task comprises a training dataset and a test dataset. For each task, the processor is configured to perform a fast-learning by training the student network and the discriminator associated with the CML framework based on the training dataset associated with the task. The processor is also configured to update one or more initial parameters associated with the student network and the discriminator to generate updated initial parameters corresponding to the one or more initial parameters. The processor is further configured perform a meta-update to optimize the one or more initial parameters associated with the student network and the discriminator using the updated initial parameters.
In some embodiments, the processor is configured to receive the plurality of tasks in a sequence, where each task comprises a training dataset and a test dataset. The processor is configured to perform, for each task, a fast-learning by training the student network and the discriminator associated with the CML framework based on the training dataset associated with the task. The processor is configured to update one or more initial parameters associated with the student network and the discriminator to generate updated initial parameters corresponding to the one or more initial parameters. The processor is configured to perform a meta-update to optimize the one or more initial parameters associated with the student network and the discriminator using the updated initial parameters.
In some embodiments, the processor is configured to prior to performing the  fast-learning, pre-train the teacher network associated with the CML framework to generate a feature map.
In some embodiments, the processor is configured to train the student network by minimizing a sum of a first loss on a classifier and a second loss on the discriminator. The processor is configured to train the discriminator using a third loss on the discriminator.
In some embodiments, the processor is configured to calculate a first cross-entropy loss corresponding to the classifier using the training dataset. The processor is configured to calculate a first binary-entropy loss corresponding to the discriminator using the training dataset. The processor is configured to train the student network by minimizing the sum of the first cross-entropy loss corresponding to the classifier and the first binary-entropy loss corresponding to the discriminator.
In some embodiments, the processor is configured to generate an all-class matrix by the student network. The processor is configured to store the generated all-class matrix into a memory.
In some embodiments, the processor is configured to input the feature map as a real input of the discriminator. The processor is configured to retrieve the all-class matrix from the memory. The processor is configured to input the all-class matrix as a fake input of the discriminator. The processor is configured to calculate a second binary-entropy loss corresponding to the discriminator based on the real input and the fake input. The processor is configured to train the discriminator using the second binary-entropy loss.
In some embodiments, the processor is configured to calculate a cosine similarity between the feature map and each class vector associated with the all-class matrix by the classifier. The processor is configured to generate a prediction score for each class. The processor is configured to normalize a plurality of prediction scores by using a softmax function.
In some embodiments, the processor is configured to calculate a second  cross-entropy loss corresponding to the classifier using the updated one or more initial parameters associated with the student network and the test dataset. The processor is configured to optimiz the one or more initial parameters associated with the student network using a one-step gradient descent based on the second cross-entropy loss. The processor is configured to calculate a third binary-entropy loss corresponding to the discriminator using the test dataset and an updated model of the student network, wherein the updated model of the student network corresponding to the updated one or more initial parameters associated with the discriminator. The processor is configured to optimize the one or more initial parameters associated with the discriminator using the one-step gradient descent based on the third binary-entropy loss.
In some embodiments, a model associated with the student network is a convolutional neural network.
In some embodiments, the discriminator is implemented by a multilayer perceptron.
In some embodiments, the one or more initial parameters associated with the student network and the discriminator are obtained by a random initialization.
Embodiments of the disclosure further provide a computer-implemented method for training a continual meta-learner framework (CML) framework. The CML framework includes a teacher network, a student network, a classifier and a discriminator. An exemplary computer-implemented method includes receiving, at a continual meta-learner (CML) framework, a plurality of tasks in a sequence, wherein each task comprises a training dataset and a test dataset. The method further includes performing, by a processor, for each task, a fast-learning by training a student network and a discriminator associated with the CML framework based on the training dataset associated with the task. The method also includes updating, by a processor, one or more initial parameters associated with the student network and the discriminator to generate updated initial parameters corresponding to the one or more  initial parameters. The method yet further includes performing, by a processor, a meta-update to optimize the one or more initial parameters associated with the student network and the discriminator using the updated initial parameters.
Embodiments of the disclosure further provide a non-transitory computer-readable medium having instructions stored thereon that, when executed by a processor, causes the processor to perform a computer-implemented method for training a continual meta-learner framework (CML) framework. The CML framework includes a teacher network, a student network, a classifier and a discriminator. An exemplary computer-implemented method includes receiving, at a continual meta-learner (CML) framework, a plurality of tasks in a sequence, wherein each task comprises a training dataset and a test dataset. The method further includes performing, by a processor, for each task, a fast-learning by training a student network and a discriminator associated with the CML framework based on the training dataset associated with the task. The method also includes updating, by a processor, one or more initial parameters associated with the student network and the discriminator to generate updated initial parameters corresponding to the one or more initial parameters. The method yet further includes performing, by a processor, a meta-update to optimize the one or more initial parameters associated with the student network and the discriminator using the updated initial parameters.
Additional features will be set forth in part in the description which follows, and in part will become apparent to those skilled in the art upon examination of the following and the accompanying drawings or may be learned by production or operation of the examples. The features of the present disclosure may be realized and attained by practice or use of various aspects of the methodologies, instrumentalities, and combinations set forth in the detailed examples discussed below.
BRIEF DESCRIPTION OF THE DRAWINGS
The present disclosure is further described in terms of exemplary embodiments. These exemplary embodiments are described in detail with reference to the drawings. These embodiments are non-limiting exemplary embodiments, in which like reference numerals represent similar structures throughout the several views of the drawings, and wherein:
FIG. 1 illustrates a schematic diagram of an exemplary continual meta-leaner (CML) system, according to embodiments of the disclosure.
FIG. 2 illustrates a block diagram of an exemplary AI system for training a meta-learning model using CML framework, according to embodiments of the disclosure.
FIG. 3 illustrates a schematic diagram of an exemplary CML framework, according to embodiments of the disclosure.
FIG. 4 illustrates a flowchart of an exemplary method for training the continual meta-learning model, according to embodiments of the disclosure.
FIG. 5 illustrates a flowchart of an exemplary method for training the CML framework during the fast-learning phase, according to embodiments of the disclosure.
FIG. 6 illustrates a flowchart of an exemplary method for training the CML framework during the meta-update phase, according to embodiments of the disclosure
DETAILED DESCRIPTION
The following description is presented to enable any person skilled in the art to make and use the present disclosure, and is provided in the context of a particular application and its requirements. Various modifications to the disclosed embodiments will be readily apparent to those skilled in the art, and the general principles defined herein may be applied to other embodiments and applications without departing from the spirit and scope of the present disclosure. Thus, the present disclosure is not limited to the embodiments shown, but is to be accorded the widest scope consistent with the claims.
A meta-learner may rapidly learn new concepts from a small dataset with only a few samples (e.g., 5 samples) for each class. Existing approaches of meta-learning include metric-based methods (i.e., exploiting the similarity between samples of different classes for meta-learning) , optimization-based approach (i.e., optimizing model parameters) , etc. However, even though these deep learning models can significantly reduce the amount of labeled training data, none of these existing solutions address continual learning or the forgetting issues., i.e., partially forgetting or even completely forgetting what they have already learned.
Continual learning focuses on how to achieve a good trade-off between learning new concepts and retaining old knowledge over a long time, which is known as the stability plasticity dilemma. Several regularization-based methods have been proposed for continual learning, imposing regularization terms to restrain the update of the model parameter. However, none of the existing continual learning approaches address the issue of learning new concepts with limited label data, which is a crucial challenge for achieving human-level Artificial Intelligence (AI) . Moreover, because the setting of the continual learning problem is quite different from that of the meta-learning problem, none of the existing solutions to the continual learning problems can be directly applied to meta-learning tasks. As such, the prior solutions cannot provide a deep learning model that handles a new task from very limited training data without forgetting old knowledge. Aspects of the present disclosure solve the above-mentioned deficiencies by providing mechanisms (e.g., methods, systems, media, etc. ) for a novel model-agnostic meta-leaner, CML, which integrates metric-based classification and a memory-based mechanism along with adversarial learning into an optimization-based meta-learning framework.
The terminology used herein is for the purpose of describing particular example embodiments only and is not intended to be limiting. As used herein, the singular forms “a, ” “an, ” and “the” may be intended to include the plural forms as well, unless the context clearly indicates otherwise. It will be further understood that the  terms “comprises, ” “comprising, ” “includes, ” and/or “including” when used in this specification, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof.
These and other features, and characteristics of the present disclosure, as well as the methods of operation and functions of the related elements of structure and the combination of parts and economies of manufacture, may become more apparent upon consideration of the following description with reference to the accompanying drawing (s) , all of which form a part of this specification. It is to be expressly understood, however, that the drawing (s) are for the purpose of illustration and description only and are not intended to limit the scope of the present disclosure. It is understood that the drawings are not to scale.
The flowcharts used in the present disclosure illustrate operations that systems implement according to some embodiments in the present disclosure. It is to be expressly understood, the operations of the flowchart may or may not be implemented in order. Conversely, the operations may be implemented in inverted order, or simultaneously. Moreover, one or more other operations may be added to the flowcharts. One or more operations may be removed from the flowcharts.
Moreover, while the system and method in the present disclosure is described primarily in regard to image classification, it should also be understood that this is only one exemplary embodiment. The system or method of the present disclosure may be applied to any other kind of deep learning tasks.
Reference will now be made in detail to the exemplary embodiments, examples of which are illustrated in the accompanying drawings. Wherever possible, the same reference numbers will be used throughout the drawings to refer to the same or like parts.
FIG. 1 illustrates a schematic diagram of an exemplary continual meta-leaner  (CML) system 100, according to embodiments of the disclosure. Consistent with the present disclosure, CML system 100 is configured to perform continuous meta training for one or more networks using datasets from database (e.g., the training database 130) . In some embodiments, CML system 100 may include components shown in FIG. 1, including a server 110, a network 120, a training database 130, and one or more user devices 140. It is contemplated that CML system 100 may include more or less components compared to those shown in FIG. 1.
The server 110 may be configured to process information and/or data relating to meta-learning tasks. For example, the server 110 may train neural networks for visual object classification, speech recognition, text processing, and other tasks. In some embodiments, the server 110 may be a single server, or a server group. The server group may be centralized, or distributed (e.g., the server 110 may be a distributed system) . In some embodiments, the server 110 may be local or remote. For example, the server 110 may access information and/or data stored in training database 130 and/or user device (s) 140 via network 120. As another example, the server 110 may be directly connected to the training database 130, and/or the user device (s) 140 access stored information and/or data. In some embodiments, the server 110 may be implemented on a cloud platform. Merely by way of example, the cloud platform may include a private cloud, a public cloud, a hybrid cloud, a community cloud, a distributed cloud, an inter-cloud, a multi-cloud, or the like, or any combination thereof. In some embodiments, the server 110 may be implemented on a computing device having one or more components illustrated in FIG. 2 in the present disclosure.
The network 120 may facilitate exchange of information and/or data. In some embodiments, one or more components in the system 100 (e.g., the server 110, the training database 130, and the user device (s) 140) may send and/or receive information and/or data to/from other component (s) in the system 100 via the network 120. For example, the server 110 may obtain/acquire a request for training a continual meta-learning model from the user device (s) 140 via the network 120. In  some embodiments, the network 120 may be any type of wired or wireless network, or combination thereof. Merely by way of example, the network 120 may include a cable network, a wireline network, an optical fiber network, a tele communications network, an intranet, an Internet, a local area network (LAN) , a wide area network (WAN) , a wireless local area network (WLAN) , a metropolitan area network (MAN) , a wide area network (WAN) , a public telephone switched network (PSTN) , a BluetoothTM network, a ZigBeeTM network, a near field communication (NFC) network, a global system for mobile communications (GSM) network, a code-division multiple access (CDMA) network, a time-division multiple access (TDMA) network, a general packet radio service (GPRS) network, an enhanced data rate for GSM evolution (EDGE) network, a wideband code division multiple access (WCDMA) network, a high speed downlink packet access (HSDPA) network, a long term evolution (LTE) network, a user datagram protocol (UDP) network, a transmission control protocol/Internet protocol (TCP/IP) network, a short message service (SMS) network, a wireless application protocol (WAP) network, a ultra wide band (UWB) network, an infrared ray, or the like, or any combination thereof.
The user device (s) 140 may be operated by one or more users to perform various functions associated with the user device (s) 140. For example, a user of the user device (s) 140 may use the user device (s) 140 to send a request for himself/herself or another user, or receive information or instructions from the server 110. In some embodiments, the term “user” and “user device” may be used interchangeably.
In some embodiments, the user device (s) 140 may include a diverse variety of device types and are not limited to any particular type of device. Examples of user device (s) 140 can include but are not limited to a laptop 140-1, a stationary computer 140-2, a tablet computer 140-3, a mobile device 140-4, or the like, or any combination thereof. In some embodiments, stationary computer 140-2 can include desktop computers, work stations, personal computers, thin clients, terminals, game consoles,  personal video recorders (PVRs) , set-top boxes, or the like. In some embodiments, the mobile device 140-4 may include a smart home device, a wearable device, a smart mobile device, a virtual reality device, an augmented reality device, or the like, or any combination thereof. In some embodiments, the smart home device may include a smart lighting device, a control device of an intelligent electrical apparatus, a smart monitoring device, a smart television, a smart video camera, an interphone, or the like, or any combination thereof. In some embodiments, the wearable device may include a smart bracelet, a smart footgear, a smart glass, a smart helmet, a smart watch, a smart clothing, a smart backpack, a smart accessory, or the like, or any combination thereof. In some embodiments, the smart mobile device may include a smartphone, a personal digital assistance (PDA) , a gaming device, a navigation device, a point of sale (POS) device, or the like, or any combination thereof. In some embodiments, the virtual reality device and/or the augmented reality device may include a virtual reality helmet, a virtual reality glass, a virtual reality patch, an augmented reality helmet, an augmented reality glass, an augmented reality patch, or the like, or any combination thereof. For example, the virtual reality device and/or the augmented reality device may include a Google Glass, an Oculus Rift, a Hololens, a Gear VR, etc.
In some embodiments, the server 110 may include a processing engine 112. The processing engine 112 may process information and/or data relating to the meta-learning tasks to perform one or more functions described in the present disclosure. For example, the processing engine 112 may receive a request from the user device (s) 140 to generate a trained meta-learning model 105 based on the request. In some embodiments, the processing engine 112 may include one or more processing engines (e.g., single-core processing engine (s) or multi-core processor (s) ) . Merely by way of example, the processing engine 112 may include a central processing unit (CPU) , an application-specific integrated circuit (ASIC) , an application-specific instruction-set processor (ASIP) , a graphics processing unit (GPU) , a physics processing unit (PPU) , a digital signal processor (DSP) , a field programmable gate  array (FPGA) , a programmable logic device (PLD) , a controller, a microcontroller unit, a reduced instruction-set computer (RISC) , a microprocessor, or the like, or any combination thereof.
As shown in FIG. 1, server 110 may further include a deep learning training device 114, which may communicate with training database 130 to receive one or more sets of tasks 101. Each task may be different, for example, task 101-1 may be a task of classifying images of animals; task 101-2 may be a task of classifying images of fruits. Deep learning training device 114 may use training data corresponding to each task 101 that is received from training database 130 to train a model based on the CML framework(discussed in detail in connection with FIG. 3) , so that the trained meta-learning model 105 may be able to adapt to a large or infinite number of tasks. Deep learning training device 114 may be implemented with hardware specially programmed by software that performs the training process. For example, deep learning training device 114 may include a processor and a non-transitory computer-readable medium (discussed in detail in connection with FIG. 2) . The processor may conduct the training by performing instructions of a training process stored in the computer-readable medium. Deep learning training device 114 may additionally include input and output interfaces to communicate with training database 130, network 130, and/or a user interface (not shown) . The user interface may be used for selecting sets of training data, adjusting one or more parameters of the training process, selecting or modifying a framework of the learning model, and/or manually or semi-automatically providing diagnosis results associated with a sample patient description for training.
Consistent with some embodiments, deep learning training device 114 may generate the trained meta-learning model 105 through a CML framework (discussed in detail in connection with FIGs. 3-6) , which may include more than one convolutional neural network (CNN) models. Trained meta-learning model 105 may be trained using supervised and/or reinforcement learning. The architecture of a trained meta- learning model 105 includes a stack of distinct layers that transform the input into the output. As used herein, “training” a learning model refers to determining one or more parameters of at least one layer in the learning model. For example, a convolutional layer of a CNN model may include at least one filter or kernel. One or more parameters, such as kernel weights, size, shape, and structure, of the at least one filter may be determined by e.g., a backpropagation-based training process.
FIG. 2 illustrates a block diagram of an exemplary AI system 200 for training a meta-learning model using a CML framework, according to embodiments of the disclosure. Consistent with the present disclosure, AI system 200 may be an embodiment of deep learning training device 114. In some embodiments, as shown in FIG. 2, AI system 200 may include a communication interface 202, a processor 204, a memory 206, and a storage 208. In some embodiments, AI system 200 may have different modules in a single device, such as an integrated circuit (IC) chip (e.g., implemented as an application-specific integrated circuit (ASIC) or a field-programmable gate array (FPGA) ) , or separate devices with dedicated functions. In some embodiments, one or more components of AI system 200 may be located in a cloud, or may be alternatively in a single location (such as inside a mobile device) or distributed locations. Components of Ai system 200 may be in an integrated device, or distributed at different locations but communicate with each other through a network (not shown) . Consistent with the present disclosure, AI system 200 may be configured to train meta-learning model105 based on data received from the training database 130.
Communication interface 202 may send data to and receive data from components such as training database 130 via communication cables, a Wireless Local Area Network (WLAN) , a Wide Area Network (WAN) , wireless networks such as radio waves, a cellular network, and/or a local or short-range wireless network (e.g., Bluetooth TM) , or other communication methods.In some embodiments, communication interface 202 may include an integrated service digital network (ISDN)  card, cable modem, satellite modem, or a modem to provide a data communication connection. As another example, communication interface 202 may include a local area network (LAN) card to provide a data communication connection to a compatible LAN. Wireless links can also be implemented by communication interface 202. In such an implementation, communication interface 202 can send and receive electrical, electromagnetic or optical signals that carry digital data streams representing various types of information.
Consistent with some embodiments, communication interface 202 may receive meta-training set (s) 101, where each meta-training set 101 corresponding to a different task, which arrives in sequence. In a regular learning setting where a model (e.g., a function F (. ) ) is trained to map the given samples x to the output y, the parameters θ of the model are trained on a training dataset D train and a testing dataset D test. Different from the regular learning setting, in the meta-learning setting described in the present disclosure, the database 110 stores a number of meta-training sets
Figure PCTCN2019110530-appb-000001
which contains multiple regular datasets, and each dataset
Figure PCTCN2019110530-appb-000002
is split into D train and D test as those in regular machine learning. Communication interface 202 may further provide the received data to memory 206 and/or storage 208 for storage or to processor 204 for processing.
Processor 204 may include any appropriate type of general-purpose or special-purpose microprocessor, digital signal processor, or microcontroller. Processor 204 may be configured as a separate processor module dedicated to training a learning model. Alternatively, processor 204 may be configured as a shared processor module for performing other functions in addition to model training.
Memory 206 and storage 208 may include any appropriate type of mass storage provided to store any type of information that processor 204 may need to operate. Memory 206 and storage 208 may be a volatile or non-volatile, magnetic, semiconductor, tape, optical, removable, non-removable, or other type of storage device or tangible (i.e., non-transitory) computer-readable medium including, but not  limited to, a ROM, a flash memory, a dynamic RAM, and a static RAM. Memory 206 and/or storage 208 may be configured to store one or more computer programs that may be executed by processor 204 to perform functions disclosed herein. For example, memory 206 and/or storage 208 may be configured to store program (s) that may be executed by processor 204 to train and generate the trained meta-learning model 105.
Memory 206 and/or storage 208 may be further configured to store information and data used by processor 204. In some embodiments, memory 206 and/or storage 208 may also store intermediate data such as feature maps output by layers of the learning model, and optimization loss functions, etc. Memory 206 and/or storage 208 may additionally store various learning models including their model parameters, such as a CNN model and other types of neutral network models. The various types of data may be stored permanently, removed periodically, or disregarded immediately after the data is processed.
As shown in FIG. 2, processor 204 may include multiple modules, such as a Neural Networks (NNs) processing unit 242, an updating unit 244, an optimization unit 246, and the like. These modules (and any corresponding sub-modules or sub-units) can be hardware units (e.g., portions of an integrated circuit) of processor 204 designed for use with other components or software units implemented by processor 204 through executing at least part of a program. The program may be stored on a computer-readable medium, and when executed by processor 204, it may perform one or more functions. Although FIG. 2 shows units 242-246 all within one processor 204, it is contemplated that these units may be distributed among different processors located closely or remotely with each other.
Units 242-246 are configured to train a meta-learning model using meta-training set (s) 101. FIG. 3 illustrates a schematic diagram of an exemplary CML framework 300, according to embodiments of the disclosure. Consistent with the present disclosure, CML framework 300 may include a plurality of components, such  as a teacher network 310, a student network 320, a classifier 330, and a discriminator 340.
In some embodiments, for a meat-learning setting, training dataset 302 
Figure PCTCN2019110530-appb-000003
and test dataset 304
Figure PCTCN2019110530-appb-000004
are used during the meta-training and meta-testing phases respectively, and the class labels in
Figure PCTCN2019110530-appb-000005
are not overlapping with those of
Figure PCTCN2019110530-appb-000006
In the present disclosure, a number of N-way, K-shot tasks are used for illustration. N-way, K-shot is a typical setting for few-shot learning, which refers to the practice of feeding a learning model of small amount of training data, contrary to the normal practice using a large amount of data. The problem of N-way classification is set up as follow: select N unseen classes, provide the model with K different instances of each of the N classes, and evaluate the model’s ability to classify new instances within the N classes. For example, in each
Figure PCTCN2019110530-appb-000007
D train contains K samples for each of N classes, and D test contains samples for evaluation. In a meta-learning scenario, the goal is to train the model to be able to adapt to a distribution over tasks
Figure PCTCN2019110530-appb-000008
During meta-training and in the K-shot learning setting, a task
Figure PCTCN2019110530-appb-000009
is sampled from
Figure PCTCN2019110530-appb-000010
the model is trained with K samples and feedback from the corresponding loss
Figure PCTCN2019110530-appb-000011
from
Figure PCTCN2019110530-appb-000012
and then tested on new samples from
Figure PCTCN2019110530-appb-000013
At the end of meta-training, new tasks are sampled from
Figure PCTCN2019110530-appb-000014
and meta-performance is measured by the model’s performance after learning from K samples.
For example, in a meta-learning scenario which has a 5-way, 5-shot setting, the goal is to train a model M fine-tune to classify images with unknown labels. These images belong to classes P 1~P 5; each class contains 5 labeled sample images for training the model M fine-tune and 15 labeled samples to test the trained model M fine-tune. In addition to the labeled samples of classes P 1~P 5, the dataset further includes sample images belong to another 10 classes C 1~C 10, each of the class contains 30 labeled samples to assist training the meta-learning model M meta. During the meta-training process, sample images included in classes C 1~C 10 are  first used to train the meta-learning model M meta, then sample images included in classes P 1~P 5 are used to fine-tune M meta to generate the final model M fine-tune. In this example, C 1~C 10 are the meta-training classes, and the 300 samples included in classes C 1~C 10 are
Figure PCTCN2019110530-appb-000015
which are used to train M meta. Similarly, classes P 1~P 5 are the meta-test classes, and the 100 samples included in classes P 1~P 5 are
Figure PCTCN2019110530-appb-000016
which are used to train and test the M fine-tune. Based on the 5-way, 5-shot setting, during the process of training M meta, 5 classes are randomly selected from classes C 1~C 10, and from each randomly selected class, 20 labeled samples are selected to form a task
Figure PCTCN2019110530-appb-000017
This task
Figure PCTCN2019110530-appb-000018
is equivalent to a piece of training data in the regular deep learning model.
In the continual learning setting of the present disclosure, different tasks arrive in a sequence
Figure PCTCN2019110530-appb-000019
So that when training a task
Figure PCTCN2019110530-appb-000020
only
Figure PCTCN2019110530-appb-000021
of
Figure PCTCN2019110530-appb-000022
is accessible, while
Figure PCTCN2019110530-appb-000023
of any previous task
Figure PCTCN2019110530-appb-000024
is not available. In this way, the present disclosure forms a new continual meta-learning problem. In this problem, 
Figure PCTCN2019110530-appb-000025
and the meta-training phase are the same as those described in the aforementioned meta-learning setting. However, during the meta-testing phase of this problem, tasks arrive one by one in a sequence rather than in a batch, to ensure that the learner can quickly, effectively and continuously learn new concepts without forgetting what it has already learned.
Back to the illustration of FIG. 3. As illustrated in FIG. 3, during the meta-testing phase, tasks arrive one by one in a sequence. Every time when a new task 
Figure PCTCN2019110530-appb-000026
arrives, the CML framework 300 uses images from
Figure PCTCN2019110530-appb-000027
quickly learn to handle the new task and then takes images from
Figure PCTCN2019110530-appb-000028
as input and outputs the corresponding class labels. Each component of the CML will be discussed in more details as below.
The teacher network 310
Figure PCTCN2019110530-appb-000029
may take an image x as input and extracts its features to form a feature map M 306 with a dimension of z, which are then pushed to classifier 330 (P (. ) ) and discriminator 340
Figure PCTCN2019110530-appb-000030
In some embodiments, the  teacher network 310 may be a CNN. In some embodiments, the teacher network 310 may be a Residual Network (ResNet) . ResNet inserts shortcut connections to the plain network and turns the network into its counterpart residual version. ResNets may have variable sizes, depending on the size of each layer, and the number of layers it has. Each of the layers follow the same pattern, and they perform 3 × 3 convolution with a fixed feature map dimension. For example, the ResNet used as the teacher network 310 in the present disclosure may be a ResNet18 (that is, the residual network is 18 layers deep) . Given an input image x with a size of 84 × 84, the ResNet18 yields a feature map M with a dimension of z = 512 × 1 × 1 = 512.
The student network 320
Figure PCTCN2019110530-appb-000031
 is the core component of the CML framework 300. The student network 320 may take all training images x and generates an all-class matrix V 308. Each row of V corresponds to a class vector V l with a dimension of z, which can be considered as a representation of images of the l th class. In some embodiments, the student network 320 may be a CNN. For example, the CNN may include four convolutional modules, each of which contains a 3 × 3 convolutional layer followed by batch normalization, a ReLU nonlinearity and a 2 × 2 max pooling, including 64 filters in the first two convolutional layers and 128 filters in the last two convolutional layers. Given an input image x with a size of 84 × 84, this exemplary student network may yield a feature map M with a dimension of z = 512 × 1 × 1 = 512, which is the same as ResNet18.
The classifier 330 may take the feature map M 306 of an image and the all-class matrix V 308 as input and predict which class this image belongs to (the prediction 312) . Classification is a supervised learning approach in which the model is learned from the data input and uses this learning to classify new observations. In some embodiments, the classification algorithm may be a linear classifier, a nearest neighbor, a support vector machine, a decision tree, a boosted tree, neural networks, or the like.
The discriminator 340
Figure PCTCN2019110530-appb-000032
may distinguish between the feature map of an image (belong to class l) and the corresponding class vector V l. The discriminator 340 and student network 320 function similar to the generative network (generator) and the discriminative network (discriminator) of a generative adversarial network (GAN) .
GANs are deep neural architecture. The generator of a GAN learns to generate plausible data, while the discriminator of a GAN learns to distinguish the generator’s fake data from real data. The generator part of a GAN learns to create fake data by incorporating feedback from the discriminator. It learns to make the discriminator classify its output as real. The discriminator in a GAN is simply a classifier. It tries to distinguish real data from the data created by the generator. It could use any network architecture appropriate to the type of data it′sclassifying. In the present disclosure as described in FIG. 3, the student network 320 may be considered as a generator of GAN that is used to generate the all-class matrix V 308, while the discriminator 340 may be considered as the discriminator of a GAN and help student network 320 to generate a more representative all-class matrix via adversarial learning.
In some embodiments, units 242-246 of FIG. 2 may execute computer instructions to perform the training. For example, FIG. 4 illustrates a flowchart of an exemplary method 400 for training the continual meta-learning model, according to embodiments of the disclosure. Method 400 may include steps S402-S418 as described below. It is to be appreciated that some of the steps may be optional to perform the disclosure provided herein. Further, some of the steps may be performed simultaneously, or in a different order than shown in FIG. 2.
In step 402, communication interface 202 may receive a plurality of tasks in a sequence. In some embodiment, each task
Figure PCTCN2019110530-appb-000033
may include a training dataset 302 (D train) and a test dataset 304 (D test) . In some embodiment, when a new
Figure PCTCN2019110530-appb-000034
arrives during the meta-testing phase, the CML framework 300 may take D train as input, and  performs fast-learning. In some embodiment, during the meta-training phase, the CML framework may take a batch of tasks
Figure PCTCN2019110530-appb-000035
as input and learns good initializations of student network 320 and discriminator 340. The entire training process consists of two phases: fast-learning and meta-update. During fast-learning, the CML framework 300 may learn from D train of each individual task of the batch; while during meta-update, the CML framework 300 may learn from D test across all tasks of the batch. Both fast-learning and meta-update will be described in more details in connection with FIGs. 5 and 6.
In step S404, NNs processing unit 242 may input the image x from D train and determine a first loss function on the classifier.
In step S406, NNs processing unit 242 may input the image x from D train and determine a second loss function on the discriminator.
In step S408, optimization unit 246 may train the student network by minimizing both the first loss function obtained in step S404 and the second loss function obtained in step S406. In some embodiment, after the student network is trained, one or more parameters corresponding to the student network model are updated by the updating unit 244.
In step S410, NNs processing unit 242 may generate the all-class matrix V for current task
Figure PCTCN2019110530-appb-000036
In some embodiment, under a 1-shot setting, if the input image belongs to the l th class, the class vector V l may be directly obtained from the input image x. In some embodiment, under a K-shot setting, the class vector V l may be obtained by taking the mean values. As such, the feature map M and each class vector V l can have the same dimension. In some embodiment, the all-class matrix V is then constructed by stacking the class vectors together, where V has a dimension of N × Z, where N is the number of classes and z is the dimension of each class vector V l.
In step S412, NNs processing unit 242 may output the all-class matrix V for current task
Figure PCTCN2019110530-appb-000037
and store them into memory 206. Each class vector only needs a  small space. For example, when the dimension of the class vector V l is 512, each class vector only needs a space of 4KB since they are 512 64-bits numbers. As such, the proposed CML framework 300 has a Iow memory footprint and can improve the efficiency of computer memory storage.
In step S414, NNs processing unit 242 may input the image x from D test and generate the feature map M. In some embodiment, before meta-training, the teacher network 310 may be pre-trained on the D meta-train so that the teacher network 310 may gain enough knowledge to server as a “teacher. ” In some embodiment, for example, where the teacher network is a ResNet18, after the teacher network 310 is pre-trained, the last fully-connected layer is discarded and the feature extractor is kept as the teacher network 310.
In step S416, NNs processing unit 242 may retrieve the all-class matrix V obtained in step S412 from the memory 206.
In step S418, NNs processing unit 242 may predict the class for the input image x by calculating the similarity between its feature map M and each class vector V l.
In some embodiment, the classifier 330 may calculate the cosine similarity between M of image x and each class vector V l as the prediction score for each class, according to equation (1) :
P (M, V) = softmax (Cos (M, v T) )           eq. (1)
Where Cos (. ) represents a calculation of the cosine similarity. In the present disclosure, the cosine similarity is used because it eliminates the interference resulting from different orders of magnitude corresponding to different classes. Once the cosine similarity is obtained, a softmax function may be used to normalize prediction scores.
FIG. 5 illustrates a flowchart of an exemplary method 500 for training the CML framework 300 during the fast-learning phase, according to embodiments of the disclosure. Method 500 may be implemented by CML framework 300 and  particularly processor 204 or a separate processor not shown in FIG. 5. Method 500 may include steps S502-S516 as described below. It is to be appreciated that some of the steps may be optional to perform the disclosure provided herein. Further, some of the steps may be performed simultaneously, or in a different order than shown in FIG. 5. A N-way, K-shot setting is used for the training process described in FIGs. 5 and 6. That is, K-shot classification tasks use K input/output pairs from each class, for a total of N × K datapoints for N-way classification.
In step S502, communication interface 202 may receive a plurality of tasks
Figure PCTCN2019110530-appb-000038
from a distribution over tasks p
Figure PCTCN2019110530-appb-000039
in a sequence.
In step S504, NNs processing unit 242 may calculate the first cross-entropy loss on the classifier 330, according to equation (2) :
Figure PCTCN2019110530-appb-000040
Where (x, y) represents an image/label pair from training dataset 302 D train; X represents the corresponding set of images; θ s represents randomly selected initial parameters for the student network 320; 
Figure PCTCN2019110530-appb-000041
represents the corresponding function of the student network 320 which generates the all-class matrix V 308; 
Figure PCTCN2019110530-appb-000042
represents the corresponding function of the pre-trained teacher network 310 which generates the feature map M 306.
In step S506, NNs processing unit 242 may calculate the first binary-entropy loss corresponding to the discriminator 340, according to equation (3) :
Figure PCTCN2019110530-appb-000043
Where
Figure PCTCN2019110530-appb-000044
generates the all-class matrix V ; R (V, l) represents a function that returns the l th row of V, (i.e., V l) ; 
Figure PCTCN2019110530-appb-000045
represents the function corresponding to the discriminator 340.
The discriminator 340 described in FIG. 5 is used to distinguish each class vector V l generated by the student network 320 from true samples generated by the teacher network 310, during the training of the student network 320. Specifically, the  discriminator340 takes each class vector V l as input, calculates the probability of the input being a true sample. The student network may be considered as a generator used to generate the all-class matrix V; while the discriminator may help the student network 320 generate a more representative all-class matrix via adversarial learning. Furthermore, the loss l s, p may make training the student network 320 and the discriminator 340 more stable, and prevent model collapses. This is because the loss l s, p may still improve the student network 320 even when the discriminator 340 makes a mistake; and then the improvement on the student network 320 may help train the discriminator 340. Hence, the losses l s, p and l s, d benefit from each other. In some embodiment, a multilayer perceptron (MLP) may be used to implement the discriminator, which contains two fully-connected layers. The first fully connected layer is followed by batch normalization and a ReLU nonlinearity; and the second fully-connected layer is followed by the sigmoid function that normalizes output.
In step S508, the optimization unit 246 may train the student network 320 model by minimizing the sum of losses l s, p and l s, d by gradient descent, according to equation (4) :
Figure PCTCN2019110530-appb-000046
In some embodiment, the student network 320 is trained for each task
Figure PCTCN2019110530-appb-000047
independently but each time starts from the same parameters θ s, where θ s is a randomly initialized parameter.
In step S510, the updating unit 244 may update the parameters θ s to θ′ i, s with gradient descent for each task
Figure PCTCN2019110530-appb-000048
according to equation (5) :
Figure PCTCN2019110530-appb-000049
Where α s represents a predetermined step size hyperparameter for fast-learning; 
Figure PCTCN2019110530-appb-000050
represents the loss from task
Figure PCTCN2019110530-appb-000051
As such, the model of the student network 320 is trained with K samples and feedback from the corresponding loss
Figure PCTCN2019110530-appb-000052
from task
Figure PCTCN2019110530-appb-000053
In step S512, the NNs processing unit 242 may calculate the second binary- entropy loss corresponding to the discriminator 340, according to equation (6) :
Figure PCTCN2019110530-appb-000054
Where x represents an image from test dataset 304 (D train) . In some embodiments, the discriminator 340 may take the feature map M 306
Figure PCTCN2019110530-appb-000055
from the teacher network 310 as the real (i.e., true) input and the all-class matrix 308 
Figure PCTCN2019110530-appb-000056
from the student network 320 as the fake (i.e., false) input.
In step S514, the NNs processing unit 242 may train the discriminator 340 in an adversarial manner, according to equation (7) :
Figure PCTCN2019110530-appb-000057
In step S516, the updating unit 244 may update the parameters θ d to θ′ i, d with gradient decent for each task
Figure PCTCN2019110530-appb-000058
according to equation (8) :
Figure PCTCN2019110530-appb-000059
Where α d represents a predetermined step size hyperparameter for fast-learning; 
Figure PCTCN2019110530-appb-000060
represents the loss from task
Figure PCTCN2019110530-appb-000061
As such, the discriminator model is trained with K samples and feedback from corresponding loss 
Figure PCTCN2019110530-appb-000062
from task
Figure PCTCN2019110530-appb-000063
FIG. 6 illustrates a flowchart of an exemplary method 600 for training the CML framework 300 during the meta-update phase, according to embodiments of the disclosure. Method 600 may be implemented by AI system 200 and particularly processor 204 or a separate processor not shown in FIG. 6. Method 600 may include steps S602-S608 as described below. It is to be appreciated that some of the steps may be optional to perform the disclosure provided herein. Further, some of the steps may be performed simultaneously, or in a different order than shown in FIG. 6.
In step S602, NNs processing unit 242 may calculate the second cross-entropy loss on the classifier 330, according to equation (9) :
Figure PCTCN2019110530-appb-000064
Where (x, y) represents an image/label pair from test dataset 304 (D test) ; X represents the corresponding set of images; 
Figure PCTCN2019110530-appb-000065
represents the corresponding function of the student network 320 which generates the all-class matrix V 308; 
Figure PCTCN2019110530-appb-000066
represents the corresponding to the function of the teacher network 310 which generates the feature map M 306.
In step S604, the updating unit 244 may optimize the parameters θ s of the student network 320 using a one-step gradient descent, according to equation (10) :
Figure PCTCN2019110530-appb-000067
Where β s represents a predetermined step size hyperparameter for meta-update; 
Figure PCTCN2019110530-appb-000068
represents the loss from task
Figure PCTCN2019110530-appb-000069
As such, the model of the student network is tested on new samples from
Figure PCTCN2019110530-appb-000070
so that the model is improved by considering how the test error on new data changes with respect to the parameters.
In step S606, the NNs processing unit 242 may calculate the third binary-entropy loss corresponding to the discriminator, according to equation (11 ) :
Figure PCTCN2019110530-appb-000071
Where
Figure PCTCN2019110530-appb-000072
represents the updated model of the student network 320.
In step S608, the updating unit 244 may optimize the parameters θ d using a one-step gradient descent, according to equation (12) :
Figure PCTCN2019110530-appb-000073
Where β d represents a predetermined step size hyperparameter for meta-update; 
Figure PCTCN2019110530-appb-000074
represents the loss from task
Figure PCTCN2019110530-appb-000075
As such, the model of the discriminator 340 is tested on new samples from
Figure PCTCN2019110530-appb-000076
so that the model is improved by considering how the test error on new data changes with respect to the parameters.
As described in FIGs. 5 and 6, the meta-update is performed on parameters θ s and θ d, rather than θ′ i, s and θ′ i, d while the losses l s, p and l d are computed by the updated parameters θ′ i, s and θ′ i, d after fast-learning. In this way, the CML  framework 300 can learn good initialization for both the student network 320 and the discriminator 340 such that it can quickly learn to deal with a new task during the meta-testing phase.
Another aspect of the disclosure is directed to a non-transitory computer-readable medium storing instructions which, when executed, cause one or more processors to perform the methods, as discussed above. The computer-readable medium may include volatile or non-volatile, magnetic, semiconductor, tape, optical, removable, non-removable, or other types of computer-readable medium or computer-readable storage devices. For example, the computer-readable medium may be the storage device or the memory module having the computer instructions stored thereon, as disclosed. In some embodiments, the computer-readable medium may be a disc or a flash drive having the computer instructions stored thereon.
It will be apparent to those skilled in the art that various modifications and variations can be made to the disclosed system and related methods. Other embodiments will be apparent to those skilled in the art from consideration of the specification and practice of the disclosed system and related methods.
It is intended that the specification and examples be considered as exemplary only, with a true scope being indicated by the following claims and their equivalents.

Claims (23)

  1. An artificial intelligence system for training a continual meta-learner framework (CML) framework, comprising:
    a storage device configured to store training datasets and test datasets associated with a plurality of tasks;
    a processor configured to train the CML framework, wherein the CML framework includes a teacher network, a student network, a classifier and a discriminator, wherein the processor is configured to:
    receiving, at the CML framework, the plurality of tasks in a sequence, wherein each task comprises a training dataset and a test dataset;
    performing, for each task, a fast-learning by training the student network and the discriminator associated with the CML framework based on the training dataset associated with the task;
    updating one or more initial parameters associated with the student network and the discriminator to generate updated initial parameters corresponding to the one or more initial parameters; and
    performing a meta-update to optimize the one or more initial parameters associated with the student network and the discriminator using the updated initial parameters.
  2. The artificial intelligence system of claim 1, wherein the processor is further configured to:
    prior to performing the fast-learning:
    pre-training the teacher network associated with the CML framework to generate a feature map.
  3. The artificial intelligence system of claim 2, wherein to perform the fast-learning by training the student network and the discriminator, the processor is further configured to:
    training the student network by minimizing a sum of a first loss on a classifier and a second loss on the discriminator; and
    training the discriminator using a third loss on the discriminator.
  4. The artificial intelligence system of claim 3, wherein to minimize the sum of the first loss on the classifier and the second loss on the discriminator, the processor is further configured to:
    calculating a first cross-entropy loss corresponding to the classifier using the training dataset;
    calculating a first binary-entropy loss corresponding to the discriminator using the training dataset; and
    training the student network by minimizing the sum of the first cross-entropy loss corresponding to the classifier and the first binary-entropy loss corresponding to the discriminator.
  5. The artificial intelligence system of claim 4, wherein the processor is further configured to:
    generating an all-class matrix by the student network; and
    storing the generated all-class matrix into a memory.
  6. The artificial intelligence system of claim 5, wherein to train the discriminator using a third loss on the discriminator, the processor is further configured to:
    inputting the feature map as a real input of the discriminator;
    retrieving the all-class matrix from the memory;
    inputting the all-class matrix as a fake input of the discriminator;
    calculating a second binary-entropy loss corresponding to the discriminator based on the real input and the fake input; and
    training the discriminator using the second binary-entropy loss.
  7. The artificial intelligence system of claim 6, wherein the processor is further configured to:
    calculating a cosine similarity between the feature map and each class vector associated with the all-class matrix by the classifier;
    generating a prediction score for each class; and
    normalizing a plurality of prediction scores by using a softmax function.
  8. The artificial intelligence system of claim 1, wherein to perform the meta-update to optimize the one or more initial parameters associated with the student network and the discriminator, the processor is further configured to:
    calculating a second cross-entropy loss corresponding to the classifier using the updated one or more initial parameters associated with the student network and the test dataset;
    optimizing the one or more initial parameters associated with the student network using a one-step gradient descent based on the second cross-entropy loss;
    calculating a third binary-entropy loss corresponding to the discriminator using the test dataset and an updated model of the student network, wherein the updated model of the student network corresponding to the updated one or more initial parameters associated with the discriminator; and
    optimizing the one or more initial parameters associated with the discriminator using the one-step gradient descent based on the third binary-entropy loss.
  9. The artificial intelligence system of claim 1, wherein a model associated with the student network is a convolutional neural network.
  10. The artificial intelligence system of claim 1, wherein the discriminator is implemented by a multilayer perceptron.
  11. The artificial intelligence system of claim 1, wherein the one or more initial parameters associated with the student network and the discriminator are obtained by a random initialization.
  12. A computer-implemented method, comprising:
    receiving, at a continual meta-learner (CML) framework, a plurality of tasks in a sequence, wherein each task comprises a training dataset and a test dataset;
    performing, for each task, a fast-learning by training a student network and a discriminator associated with the CML framework based on the training dataset associated with the task;
    updating one or more initial parameters associated with the student network and the discriminator to generate updated initial parameters corresponding to the one or more initial parameters; and
    performing a meta-update to optimize the one or more initial parameters associated with the student network and the discriminator using the updated initial parameters.
  13. The computer-implemented method of claim 12, further comprising:
    prior to performing the fast-learning:
    pre-training a teacher network associated with the CML framework to generate a feature map.
  14. The computer-implemented method of claim 13, wherein performing the fast-learning by training the student network and the discriminator comprises:
    training the student network by minimizing a sum of a first loss on a classifier and a second loss on the discriminator; and
    training the discriminator using a third loss on the discriminator.
  15. The computer-implemented method of claim 14, wherein minimizing the sum of the first loss on the classifier and the second loss on the discriminator comprises:
    calculating a first cross-entropy loss corresponding to the classifier using the training dataset;
    calculating a first binary-entropy loss corresponding to the discriminator using the training dataset; and
    training the student network by minimizing the sum of the first cross-entropy loss corresponding to the classifier and the first binary-entropy loss corresponding to the discriminator.
  16. The computer-implemented method of claim 15, further comprising:
    generating an all-class matrix by the student network; and
    storing the generated all-class matrix into a memory.
  17. The computer-implemented method of claim 16, wherein training the discriminator using a third loss on the discriminator comprises:
    inputting the feature map as a real input of the discriminator;
    retrieving the all-class matrix from the memory;
    inputting the all-class matrix as a fake input of the discriminator;
    calculating a second binary-entropy loss corresponding to the discriminator based on the real input and the fake input; and
    training the discriminator using the second binary-entropy loss.
  18. The computer-implemented method of claim 17, further comprising:
    calculating a cosine similarity between the feature map and each class vector associated with the all-class matrix by the classifier;
    generating a prediction score for each class; and
    normalizing a plurality of prediction scores by using a softmax function.
  19. The computer-implemented method of claim 12, wherein performing the meta-update to optimize the one or more initial parameters associated with the student network and the discriminator comprises:
    calculating a second cross-entropy loss corresponding to the classifier using the updated one or more initial parameters associated with the student network and the test dataset;
    optimizing the one or more initial parameters associated with the student network using a one-step gradient descent based on the second cross-entropy loss;
    calculating a third binary-entropy loss corresponding to the discriminator using the test dataset and an updated model of the student network, wherein the updated model of the student network corresponding to the updated one or more initial parameters associated with the discriminator; and
    optimizing the one or more initial parameters associated with the discriminator using the one-step gradient descent based on the third binary-entropy loss.
  20. The computer-implemented method of claim 12, wherein a model associated with the student network is a convolutional neural network.
  21. The computer-implemented method of claim 12, wherein the discriminator is implemented by a multilayer perceptron.
  22. The computer-implemented method of claim 12, wherein the one or more initial parameters associated with the student network and the discriminator are obtained by a random initialization.
  23. A non-transitory computer-readable medium having instructions stored thereon that, when executed by a processor, causes the processor to perform an artificial intelligence method for training a continual meta-learner (CML) framework, the CML framework includes a teacher network, a student network, a classifier and a discriminator, the artificial intelligence method comprising:
    receiving, at the CML framework, a plurality of tasks in a sequence, wherein each task comprises a training dataset and a test dataset;
    performing, for each task, a fast-learning by training the student network and the discriminator associated with the CML framework based on the training dataset associated with the task;
    updating one or more initial parameters associated with the student network and the discriminator to generate updated initial parameters corresponding to the one or more initial parameters; and
    performing a meta-update to optimize the one or more initial parameters associated with the student network and the discriminator using the updated initial parameters.
PCT/CN2019/110530 2019-10-11 2019-10-11 Method and system for continual meta-learning WO2021068180A1 (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
PCT/CN2019/110530 WO2021068180A1 (en) 2019-10-11 2019-10-11 Method and system for continual meta-learning

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/CN2019/110530 WO2021068180A1 (en) 2019-10-11 2019-10-11 Method and system for continual meta-learning

Publications (1)

Publication Number Publication Date
WO2021068180A1 true WO2021068180A1 (en) 2021-04-15

Family

ID=75436931

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2019/110530 WO2021068180A1 (en) 2019-10-11 2019-10-11 Method and system for continual meta-learning

Country Status (1)

Country Link
WO (1) WO2021068180A1 (en)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113780473A (en) * 2021-09-30 2021-12-10 平安科技(深圳)有限公司 Data processing method and device based on depth model, electronic equipment and storage medium
CN114491039A (en) * 2022-01-27 2022-05-13 四川大学 Meta-learning few-sample text classification method based on gradient improvement
CN114563130A (en) * 2022-02-28 2022-05-31 中云开源数据技术(上海)有限公司 Class imbalance fault diagnosis method for rotary machine
CN114563130B (en) * 2022-02-28 2024-04-30 中云开源数据技术(上海)有限公司 Class unbalance fault diagnosis method for rotary machinery

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108898168A (en) * 2018-06-19 2018-11-27 清华大学 The compression method and system of convolutional neural networks model for target detection
CN108960419A (en) * 2017-05-18 2018-12-07 三星电子株式会社 For using student-teacher's transfer learning network device and method of knowledge bridge
WO2018223822A1 (en) * 2017-06-07 2018-12-13 北京深鉴智能科技有限公司 Pruning- and distillation-based convolutional neural network compression method
US20180365564A1 (en) * 2017-06-15 2018-12-20 TuSimple Method and device for training neural network
US20190101927A1 (en) * 2017-09-30 2019-04-04 TuSimple System and method for multitask processing for autonomous vehicle computation and control

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108960419A (en) * 2017-05-18 2018-12-07 三星电子株式会社 For using student-teacher's transfer learning network device and method of knowledge bridge
WO2018223822A1 (en) * 2017-06-07 2018-12-13 北京深鉴智能科技有限公司 Pruning- and distillation-based convolutional neural network compression method
US20180365564A1 (en) * 2017-06-15 2018-12-20 TuSimple Method and device for training neural network
US20190101927A1 (en) * 2017-09-30 2019-04-04 TuSimple System and method for multitask processing for autonomous vehicle computation and control
CN108898168A (en) * 2018-06-19 2018-11-27 清华大学 The compression method and system of convolutional neural networks model for target detection

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113780473A (en) * 2021-09-30 2021-12-10 平安科技(深圳)有限公司 Data processing method and device based on depth model, electronic equipment and storage medium
CN113780473B (en) * 2021-09-30 2023-07-14 平安科技(深圳)有限公司 Depth model-based data processing method and device, electronic equipment and storage medium
CN114491039A (en) * 2022-01-27 2022-05-13 四川大学 Meta-learning few-sample text classification method based on gradient improvement
CN114491039B (en) * 2022-01-27 2023-10-03 四川大学 Primitive learning few-sample text classification method based on gradient improvement
CN114563130A (en) * 2022-02-28 2022-05-31 中云开源数据技术(上海)有限公司 Class imbalance fault diagnosis method for rotary machine
CN114563130B (en) * 2022-02-28 2024-04-30 中云开源数据技术(上海)有限公司 Class unbalance fault diagnosis method for rotary machinery

Similar Documents

Publication Publication Date Title
US11361225B2 (en) Neural network architecture for attention based efficient model adaptation
CN107909101B (en) Semi-supervised transfer learning character identifying method and system based on convolutional neural networks
WO2021238366A1 (en) Neural network construction method and apparatus
WO2022083536A1 (en) Neural network construction method and apparatus
US20220237944A1 (en) Methods and systems for face alignment
CN108345875B (en) Driving region detection model training method, detection method and device
WO2022042713A1 (en) Deep learning training method and apparatus for use in computing device
CN111507378A (en) Method and apparatus for training image processing model
WO2022068623A1 (en) Model training method and related device
CN113807399B (en) Neural network training method, neural network detection method and neural network training device
US11468266B2 (en) Target identification in large image data
CN113065635A (en) Model training method, image enhancement method and device
WO2021129668A1 (en) Neural network training method and device
WO2022012668A1 (en) Training set processing method and apparatus
CN113570029A (en) Method for obtaining neural network model, image processing method and device
WO2021068180A1 (en) Method and system for continual meta-learning
CN113408570A (en) Image category identification method and device based on model distillation, storage medium and terminal
Adedoja et al. Intelligent mobile plant disease diagnostic system using NASNet-mobile deep learning
CN115018039A (en) Neural network distillation method, target detection method and device
CN111738403A (en) Neural network optimization method and related equipment
US20230004816A1 (en) Method of optimizing neural network model and neural network model processing system performing the same
EP4227858A1 (en) Method for determining neural network structure and apparatus thereof
WO2022125181A1 (en) Recurrent neural network architectures based on synaptic connectivity graphs
CN113627421A (en) Image processing method, model training method and related equipment
CN111860601B (en) Method and device for predicting type of large fungi

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 19948448

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 19948448

Country of ref document: EP

Kind code of ref document: A1