WO2020202594A1 - Learning system, method and program - Google Patents

Learning system, method and program Download PDF

Info

Publication number
WO2020202594A1
WO2020202594A1 PCT/JP2019/029456 JP2019029456W WO2020202594A1 WO 2020202594 A1 WO2020202594 A1 WO 2020202594A1 JP 2019029456 W JP2019029456 W JP 2019029456W WO 2020202594 A1 WO2020202594 A1 WO 2020202594A1
Authority
WO
WIPO (PCT)
Prior art keywords
learning
model
data
expert
classifier
Prior art date
Application number
PCT/JP2019/029456
Other languages
French (fr)
Inventor
Devendra Dhaka
Kanishka KHANDELWAL
Riki Eto
Original Assignee
Nec Corporation
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Nec Corporation filed Critical Nec Corporation
Publication of WO2020202594A1 publication Critical patent/WO2020202594A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning
    • G06N20/20Ensemble learning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N7/00Computing arrangements based on specific mathematical models
    • G06N7/01Probabilistic graphical models, e.g. probabilistic networks

Definitions

  • the present invention relates to a learning system, learning method and learning program that learns a model for classification of input data samples with time (or likewise sequential order) annotations into one of many class labels wherein the distribution of class labels of input data is changing with time.
  • a classifier that outputs a label representing the attribute of certain data in a case where the data is input in machine learning. It is also known that a classification criterion of the classifier may change over time. In order to prevent temporal deterioration of the classification accuracy of such a classifier, it is necessary to create a classifier whose classification criterion is updated.
  • Patent Literature 1 discloses a creating apparatus for creating a classifier.
  • the creating apparatus disclosed in Patent Literatures 1 creates a classifier whose classification accuracy is maintained without frequently collecting labeled training data.
  • Non Patent Literature 1 discloses a method of performing non-linear classification.
  • the Dirichlet Process mixtures of Generalized Linear Models (DP-GLM) produces a global model of the joint distribution through a mixture of local generalized linear models.
  • Non Patent Literature 2 discloses that an accurate variational transformation can be used to obtain a closed form approximation to the posterior distribution of the parameters thereby yielding an approximate posterior predictive model.
  • the creating apparatus disclosed in Patent Literatures 1 learns the classification criterion from input data with temporal attributes and class labels at each past time instances, and learns a time series change model over these classification criterions and using which it performs ahead prediction of classification criterion of a classifier.
  • the creating apparatus disclosed in Patent Literatures 1 is limited to a single classifier.
  • the explanation therein is performed using a simple logisitic regression classifier which gives a linear classification criterion.
  • a non-linear classifier such as SVM, boosting, neural networks, etc. can be used in place of logistic regression, when the classification criterion in the input data is non-linear at some or all time instances.
  • implementing a non-linear classfication model in their fashion is not obvious. Therefore, there is a problem that the data cannot be classified properly if a classifier model based on logistic regression is used and the boundary for classifying the data is non-linear.
  • Non Patent Literatures 1 classification for non-linearity is considered.
  • the method disclosed in Non Patent Literature 1 assumes that it is developed for non-changing distribution of data.
  • the method disclosed in Non Patent Literatures 1 has a problem that the accuracy of the classifier is deteriorated over time.
  • a learning system for learning a model for estimating a label indicating classification of data includes a classifier model learning unit which learns, using an input data, a classifier model for a mixture of classifiers referred as experts that are collectively assigned to a task of classification of the input data, at each time instance; a classifier time series model learning unit which learns, for each expert, a time series model indicating time series change of the classifier model of the expert; a data model parameter learning unit which learns a data model parameter for a data model indicating the distribution of data features for each expert; and an assignment parameter learning unit which learns an assignment parameter indicating the probability of assigning experts to individual samples in the input data.
  • a learning method for learning a model for estimating a label indicating classification of data includes: learning, using an input data, a classifier model for a mixture of classifiers referred as experts that are collectively assigned to a task of classification of the input data, at each time instance; learning, for each expert, a time series model indicating time series change of the classifier model of the expert; learning a data model parameter for a data model indicating the distribution of data features for each expert; and learning an assignment parameter indicating the probability of assigning experts to individual samples in the input data.
  • a learning program for learning a model for estimating a label indicating classification of data causes a computer to perform: a classifier model learning process of learning, using an input data, a classifier model for a mixture of classifiers referred as experts that are collectively assigned to a task of classification of the input data, at each time instance; a classifier time series model learning process of learning, for each expert, a time series model indicating time series change of the classifier model of the expert; a data model parameter learning process of learning a data model parameter for a data model indicating the distribution of data features for each expert; and an assignment parameter learning process of learning an assignment parameter indicating the probability of assigning experts to individual samples in the input data.
  • Fig. 1 depicts an exemplary block diagram illustrating the structure of an exemplary embodiment of a learning system according to the present invention.
  • the learning system 100 according to the present exemplary embodiment includes a learning unit 10 and a future classification unit 20.
  • the learning unit 10 includes a data acquisition unit 101, a data processing unit 102, an expert initialization unit 103, an expert learning unit 104 and an expert storage unit 105.
  • the data acquisition unit 101 acquires data used for learning by the expert learning unit 104 described later.
  • the data acquisition unit 101 receives labeled streaming data as training data.
  • the labeled training data means a combination of data for learning and a label indicating the classification of this training data.
  • the target data is data that belongs to a certain group (hereinafter, also referred to as positive data) or data that does not belong to the group (hereinafter, also referred to as negative data).
  • positive data data that belongs to a certain group
  • negative data data that does not belong to the group
  • a label indicating positive data or a label indicating negative data may be used as a label.
  • x t, i is the D-dimensional feature vector of the i-th sample at time t
  • y t, i which is an element of ⁇ 0, 1 ⁇ , is its class label.
  • N t is the number of training data collected at time t. Note that sequential data can be represented in this format by discretizing at regular intervals and considering the data falling within same interval to have arrived at the same time.
  • D Given the set of training data, D, it is an object to predict binary classifier h t : R D -> ⁇ 0, 1 ⁇ at t (t is an element of ⁇ T+1, T+2, ... ⁇ ), which can precisely classify data at time t.
  • the data acquisition unit 101 may acquire training data from a storage unit (not shown) included in the learning system 100, or may acquire training data from an external storage server or the like (not shown) via a communication network.
  • the data processing unit 102 converts the acquired data into training data.
  • the data processing unit 102 converts the streaming data into feature and label vectors with time annotations. That is, the data processing unit 102 generates the set of training data D described above from the acquired data.
  • the “dynamics” means time-series change of classification criteria of a classifier model.
  • Our model which is used in the exemplary embodiment, is based on Dirichlet Process Mixtures (DPM). It is used to identify the number of clusters/groups from D automatically and assign an expert to each cluster to model the data distribution along with the conditional distribution of class labels given the data.
  • the experts are collectively assigned to a task of classification of the input data.
  • a standard classifier model such as logisitic regression, SVM, etc.
  • a standard time series model such as Vector Autoregressive Model, Gaussian Process, etc.
  • Equation 1 N([mu] ,[Sigma]) is multivariate Gaussian distribution with mean [mu] and covariance matrix [Sigma]
  • the element m k of R D is the mean vector of the cluster k
  • element R k of R DxD is its precision matrix
  • [Phi] k : ⁇ m k
  • R k ⁇ , z t,i is the cluster indicator for the data sample x t,i and 1 is an indicator function.
  • Equation 2 the element g 0 of R D is the mean vector, the element V 0 of R DxD is scale matrix, the element [beta] 0 of R is a scale parameter and the element f 0 of R is the degree of freedom.
  • the probability of label y t, i given feature vector x t, i is modelled by logistic regression as shown in Equations 3 and 4 below.
  • VAR Vector Autoregressive model
  • Equations 5, 6 and 7 the elements A k,1 , A k,2 , ..., A k,m of R (D+1)x(D+1) are the (D+1)x(D+1) matrices for defining dynamics, the element A k,0 of R D+1 is the bias term, [theta] and [theta] 0 are the element of R + .
  • the classifier parameters at time t depends linearly only on past m values of expert’s classifier parameters. This provides this model the ability to have separate dynamics for each expert by keeping the dynamics parameters independent across the experts.
  • a k,m is restricted to be a diagonal matrix.
  • the distribution G 0 is a product of distributions given in equations 2, 5, 6 and 7.
  • the hyperparameters in these distributions have prior probabilities as shown in Equations 10, 11 and 12 below.
  • W, A, Z, [Gamma] and [Phi] are as shown below.
  • D) of parameters W, A, Z, [Phi], [Theta], [Theta] 0 , [Gamma], [pi]’ in a case where labeled learning data D is given is obtained.
  • D) is obtained by using a so - called variational Bayes method of approximately obtaining a posteriori probability.
  • the expert learning unit 104 performs variational inference to find posterior probabilities of hidden variables and parameters in our model.
  • Equation 15 The lower bound L(q) of the log marginal likelihood of the proposed model is expressed as shown in Equation 15.
  • the expert learning unit 104 uses the lower bound for logistic regression proposed in Non Patent Literature 2 to convert it to an exponential family distribution as required in variational inference procedure.
  • Non Patent Literature 2 introduces variable [xi] t,i per feature vector x t,i and changes our lower bound L(q) to L(q, [xi]).
  • the variational posterior q(W, A, Z, [Phi], [Theta], [Theta] 0 , [Gamma], [pi]’) can be factorized using mean field approximation as shown in Equation 16.
  • Fig. 2 is an exemplary explanatory diagram illustrating an example of variational inference performed by the learning unit 10 according to the present exemplary embodiment.
  • the expert learning unit 104 inputs data D and hyper parameters u 0 , v 0 , u, v, a, b, g 0 , [beta] 0 , V 0 and f 0 (step S500).
  • the expert initialization unit 103 initializes each of W, A, Z, [Phi], [Theta], [Theta] 0 , [Gamma], and [pi]’ (step S501).
  • the expert initialization unit 103 may perform expert initialization in an arbitrary manner.
  • the expert initialization unit 103 may initialize the expert using a pre-identified set of parameters (for example: the parameters 0 or 1 to initialize).
  • step S502 to step S514 are repeated until the iterator reaches the maximum (iter ⁇ max_iter). Further, the processes from step S503 to step S512 are repeated by the number of experts. Furthermore, the processes from step S504 to step S507 are repeated for the time 1 to T.
  • step S506 the processes from step S505 to step S506 are repeated by the number of data dimensions. Specifically, in step S506, the expert learning unit 104 updates parameters of W using Equations 45 to 48 shown below.
  • step S507 the expert learning unit 104 updates parameters of [Xi] using Equation 49 shown below.
  • step S509 the expert learning unit 104 updates parameters of A using Equations 41 to 44 shown below. Furthermore, in step S510, the expert learning unit 104 updates parameters of [Gamma] using Equations 39 to 40 shown below.
  • step S511 the expert learning unit 104 updates parameters of [Phi] using Equations 28 to 34 shown below. Furthermore, in step S512, the expert learning unit 104 updates parameters of [Theta] and [Theta] 0 using Equations 35 to 38 shown below.
  • step S513 the expert learning unit 104 updates parameters of Z using Equation 27 shown below. Furthermore, in step S514, the expert learning unit 104 updates parameters of [pi]’ using Equations 25 and 26 shown below.
  • step S515 the expert learning unit 104 outputs optimized q(W), q(A), q(Z), q([Phi]), q([Theta]), q([Theta] 0 ), q ([Gamma]) and q([pi]’).
  • the expert learning unit 104 stores model data for each expert in the expert storage unit 105. That is, the expert storage unit 105 stores model data of each expert.
  • the expert storage unit 105 is realized by, for example, a magnetic disk or the like. Note that, since the model of each expert is learned individually, the expert learning unit 104 may perform normalization processing on all the learned expert models and then store it in the expert storage unit 105.
  • the learning unit 10 (more specifically, the data acquisition unit 101, the data processing unit 102, the expert initialization unit 103 and the expert learning unit 104) is implemented by a CPU of a computer operating according to a program (learning program).
  • the program may be stored in the storage unit (not shown) included in the learning system 100, with the CPU reading the program and, according to the program, operating as the learning unit 10 (more specifically, the data acquisition unit 101, the data processing unit 102, the expert initialization unit 103 and the expert learning unit 104).
  • the functions of the learning system may be provided in the form of SaaS (Software as a Service).
  • the learning unit 10 may each be implemented by dedicated hardware. All or part of the components of each device may be implemented by general-purpose or dedicated circuitry, processors, or combinations thereof. They may be configured with a single chip, or configured with a plurality of chips connected via a bus. All or part of the components of each device may be implemented by a combination of the above-mentioned circuitry or the like and program.
  • each device is implemented by a plurality of information processing devices, circuitry, or the like
  • the plurality of information processing devices, circuitry, or the like may be centralized or distributed.
  • the information processing devices, circuitry, or the like may be implemented in a form in which they are connected via a communication network, such as a client-and-server system or a cloud computing system.
  • the future classification unit 20 receives a new (unlabeled) sample, and predicts its label by combining label predictions from each expert.
  • predictions of each expert on the new sample are combined in a probabilistic fashion. That is, for combining the predictions, first, it is needed to find the weights assigned to each expert for this new sample and further the predictions on new sample of the classifiers of each expert at the time instance of the new sample.
  • the label prediction is performed in two steps. First, the distribution of classifier weights P(w k,T' ) for k which is an element in ⁇ 1, 2, ... ,K; K is total number of experts ⁇ and time T' > T is evaluated. The distribution of classifier weights is calculated with a sampling cum marginalization approach, as shown in Equation 50 below.
  • [tau](a): 1/(1+([pi]*a/8)) 1/2
  • [omega] k,T’,i denotes the probability of choosing k th expert for classification, which is further represented as Equations 53 and 54.
  • the probability of assigning an expert to z T',i can be approximated as where N denotes total number of samples in labeled data set D.
  • the future classification unit 20 includes a data acquisition unit 201, a data processing unit 202, an expert identification unit 203, a classification output unit 204 and a label storage unit 205.
  • the data acquisition unit 201 receives un-labeled streaming data (hereinafter also referred to as a sample). That is, the data acquisition unit 201 receives data to be classified.
  • the data processing unit 202 converts the received streaming data into feature vectors with time annotations.
  • the method of converting streaming data into a time-annotated feature vector is the same as the method performed by the data processing unit 102, but label data is not created.
  • the expert identification unit 203 identifies parameters for each expert for the task of classification of unlabeled data.
  • the expert identification unit 203 includes an expert weighting unit 2031 and a classifier creating unit 2032.
  • the expert weighting unit 2031 calculates the weight for each expert. Specifically, the expert weighting unit 2031 calculates the weight of each expert based on the assignment parameter using Equations 53 and 54 described above.
  • the classifier creating unit 2032 calculates future weights of classifier using the dynamics. Specifically, the classifier creating unit 2032 determines a classifier at a time instance of a new sample using a time series model, by using Expression 50 described above. That is, the classifier creating unit 2032 predicts the classifier parameters for each expert at the time instance of new sample using autoregressive time-series model of classifier parameters.
  • the classification output unit 204 predicts the label for new sample for each expert using the classifier parameters obtained by the classifier creating unit 2032 and combines these label predictions using weights obtained by the expert weighting unit 2031. Specifically, the classification output unit 204 determines the label predictions for all experts and combines them in a probabilistic fashion by using Expressions 51 and 52 described above.
  • the classification output unit 204 stores the determined label in the label storage unit 205. That is, the label storage unit 205 stores a label for the input streaming data.
  • the label storage unit 205 is realized by, for example, a magnetic disk or the like.
  • the future classification unit 20 (more specifically, the data acquisition unit 201, the data processing unit 202, the expert identification unit 203 (more specifically, the expert weighting unit 2031 and the classifier creating unit 2032) and the classification output unit 204) is implemented by a CPU of a computer operating according to a program (learning program, prediction program).
  • the future classification unit 20 (more specifically, the data acquisition unit 201, the data processing unit 202, the expert identification unit 203 (more specifically, the expert weighting unit 2031 and the classifier creating unit 2032) and the classification output unit 204) may each be implemented by dedicated hardware.
  • FIG. 3 depicts a flowchart illustrating an example of learning processing by the learning unit 10.
  • the data acquisition unit 101 receives labeled streaming data as learning data until time T (step S101).
  • the data processing unit 102 converts streaming data into feature and label vectors with time annotations(step S102).
  • the expert initialization unit 103 initializes all the experts with pre-identified parameters(step S103).
  • loop process A (steps S1031 to S1032) is repeated until the termination condition is satisfied.
  • loop process B (steps S1033 to S1034) is repeated at expert level over all the pre-specified number of experts.
  • the expert learning unit 104 learns classifier model for each expert at each time (step S1041).
  • the expert learning unit 104 learns classifier time series model for each expert (step S1042).
  • the expert learning unit 104 learns expert parameters for data model (step S1043).
  • the expert learning unit 104 learns expert assignment parameters for all data points (step S1044).
  • the expert learning unit 104 stores model data in the expert storage unit 105. That is, the expert storage unit 105 stores model data for each expert (step S105).
  • Fig. 4 depicts a flowchart illustrating an example of prediction processing by the future classification unit 20.
  • the data acquisition unit 201 receives un-labeled streaming data (step S201).
  • the data processing unit 202 converts the streaming data into feature vectors with time annotations (step S202).
  • the expert identification unit 203 (specifically, the expert weighting unit 2031) computes weights for each expert(step S2031), and the expert identification unit 203 (specifically, the classifier creating unit 2032) computes future weights of classifier using dynamics (step S2032).
  • the classification output unit 204 combines label predictions from all the experts (step S204).
  • the classification output unit 204 stores predicted labels for the input data in the label storage unit 205. That is, the label storage unit 205 stores predicted labels for the input data (step S205).
  • the expert learning unit 104 learns the classifier model for a mixture of classifiers (experts) at each time and their classifier time series model for each expert. Moreover, the expert learning unit 104 learns, for each expert, (a data model) parameter for a data model and learns the assignment parameter to individual samples in the input data. Therefore, it is possible to learn dynamics of non-linear boundaries used for classification.
  • the expert weighting unit 2031 calculates the weight of each expert based on the assignment parameter, and the classifier creating unit 2032 predicts the classifier weights corresponding to the sample’s time using the classifier time series model. Then, the classification output unit 204 predicts the probability of the label of the sample for each expert, combines the probabilities of the labels of all the experts, and predicts the label of the sample. Therefore, even if the conditional distribution of labels in streaming data changes over time, it is possible to suppress the decrease in the accuracy of the classifier.
  • expert learning unit 104 of the present exemplary embodiment may use Neural network in place of logistic regression for the classifier model or in place of AR process for the time series model.
  • FIG. 5 is an exemplary explanatory diagram illustrating a specific example of the learning process.
  • streaming data from time 1 to T is given as learning data.
  • the example illustrated in Fig. 5 shows that streaming data having X1 and X2 as a feature is labeled with two classes (class 1 and class 2).
  • Fig. 5 shows an example where the decision boundaries also change.
  • Fig. 6 is an exemplary explanatory diagram illustrating a specific example of the predicition process.
  • streaming data from time T + 1 to T + M unlabeled is given as classification target data.
  • the example illustrated in Fig. 6 shows that streaming data having X1 and X2 as a feature is given.
  • the future classification unit 20 refers to the learned expert stored in the expert storage unit 105, and predicts a classifier at each time. As a result, a class where given data is classified is predicted at each time. Similar to the example illustrated in Fig. 5, the conditional distribution of labels changes over time, so that Fig. 6 also shows that the decision boundary changes.
  • Fig. 7 depicts a block diagram illustrating an outline of the learning system according to the present invention.
  • the learning system 80 for example, learning system 100 for learning a model for estimating a label indicating classification of data
  • the learning system according to the present invention includes: a classifier model learning unit 81 (for example, the expert learning unit 104) which learns, using an input data, a classifier model (for example, Q(w k,t )) for a mixture of classifiers referred as experts that are collectively assigned to a task of classification of the input data, at each time instance; a classifier time series model learning unit 82 (for example, the expert learning unit 104) which learns, for each expert, a time series model (for example, Q(A k )) indicating time series change of the classifier model of the expert; a data model parameter learning unit 83 (for example, the expert learning unit 104) which learns a data model parameter (for example, Q([phi] k )
  • the learning system 80 may includes: a weight calculator (for example, the expert weighting unit 2031) which calculates a weight of each expert based on the assignment parameter; a weight predictor (for example, the classifier creating unit 2032) which predicts classifier weights corresponding to a sample’s time using the classifier time series model; and a label predictor (for example, the classification output unit 204) which predicts the probability of the label of the sample for each expert, combines the probabilities of the labels of all the experts, and predicts the label of the sample.
  • a weight calculator for example, the expert weighting unit 2031
  • a weight predictor for example, the classifier creating unit 2032
  • a label predictor for example, the classification output unit 204 which predicts the probability of the label of the sample for each expert, combines the probabilities of the labels of all the experts, and predicts the label of the sample.
  • classifier model learning unit 81 may model the probability of the label given data in the grouped data by logistic regression.
  • the classifier model learning unit 81 may determine the number of clusters from learning data using a model based on a Dirichlet Process mixtures, and assigns the expert to the determined number of clusters respectively.
  • the data model parameter learning unit 83 may learn parameters based on the Normal-Wishart distribution and model data based on multivariate normal distribution.
  • assignment parameter learning unit 84 may model the assignment of cluster based on multinomial or categorical distribution.
  • the classifier model learning unit 81 may learn the classifier model such that a collective decision boundary is an approximation to an underlying non-linear decision boundary at each past time instances.
  • FIG. 8 depicts a schematic block diagram illustrating a configuration of a computer according to at least one of the exemplary embodiments.
  • a computer 1000 includes a CPU 1001, a main storage device 1002, an auxiliary storage device 1003, and an interface 1004.
  • Each of the above-described learning system is mounted on the computer 1000.
  • the operation of the respective processing units described above is stored in the auxiliary storage device 1003 in the form of a program (a learning program).
  • the CPU 1001 reads the program from the auxiliary storage device 1003, deploys the program in the main storage device 1002, and executes the above processing according to the program.
  • the auxiliary storage device 1003 is an exemplary non-transitory physical medium.
  • Other examples of non-transitory physical medium include a magnetic disc, a magneto-optical disk, a CD-ROM, a DVD-ROM, and a semiconductor memory that are connected via the interface 1004.
  • the computer 1000 distributed with the program may deploy the program in the main storage device 1002 to execute the processing described above.
  • the program may implement a part of the functions described above.
  • the program may implement the aforementioned functions in combination with another program stored in the auxiliary storage device 1003 in advance, that is, the program may be a differential file (differential program).
  • a learning system for learning a model for estimating a label indicating classification of data comprising: a classifier model learning unit which learns, using an input data, a classifier model for a mixture of classifiers referred as experts that are collectively assigned to a task of classification of the input data, at each time instance; a classifier time series model learning unit which learns, for each expert, a time series model indicating time series change of the classifier model of the expert; a data model parameter learning unit which learns a data model parameter for a data model indicating the distribution of data features for each expert; and an assignment parameter learning unit which learns an assignment parameter indicating the probability of assigning experts to individual samples in the input data.
  • the learning system according to supplementary note 1, further comprising: a weight calculator which calculates a weight of each expert based on the assignment parameter; a weight predictor which predicts classifier weights corresponding to a sample’s time using the classifier time series model; and a label predictor which predicts the probability of the label of the sample for each expert, combines the probabilities of the labels of all the experts, and predicts the label of the sample.
  • (Supplementary note 4) The learning system according to any one of supplementary notes 1 to 3, wherein, the classifier model learning unit determines the number of clusters from learning data using a model based on a Dirichlet Process mixtures, and assigns the expert to the determined number of clusters respectively.
  • a learning method for learning a model for estimating a label indicating classification of data comprising: learning, using an input data, a classifier model for a mixture of classifiers referred as experts that are collectively assigned to a task of classification of the input data, at each time instance; learning, for each expert, a time series model indicating time series change of the classifier model of the expert; learning a data model parameter for a data model indicating the distribution of data features for each expert; and learning an assignment parameter indicating the probability of assigning experts to individual samples in the input data.
  • the learning method further comprising: calculating a weight of each expert based on the assignment parameter; predicting classifier weights corresponding to a sample’s time using a classifier time series model; predicting the probability of the label of the sample for each expert; combining the probabilities of the labels of all the experts; and predicting the label of the sample.
  • a learning program for learning a model for estimating a label indicating classification of data causes a computer to perform: a classifier model learning process of learning, using an input data, a classifier model for a mixture of classifiers referred as experts that are collectively assigned to a task of classification of the input data, at each time instance; a classifier time series model learning process of learning, for each expert, a time series model indicating time series change of the classifier model of the expert; a data model parameter learning process of learning a data model parameter for a data model indicating the distribution of data features for each expert; and an assignment parameter learning process of learning an assignment parameter indicating the probability of assigning experts to individual samples in the input data.
  • the learning program causes a computer to perform: a weight calculate process of calculating a weight of each expert based on the assignment parameter; a weight predicting process of predicting classifier weights corresponding to a sample’s time using a classifier time series model; and a label predicting process of predicting the probability of the label of the sample for each expert, combining the probabilities of the labels of all the experts, and predicting the label of the sample.
  • learning unit 20 future classification unit 100 learning system 101, 201 data acquisition unit 102, 202 data processing unit 103 expert initialization unit 104 expert learning unit 105 expert storage unit 203 expert identification unit 204 classification output unit 205 label storage unit 2031 expert weighting unit 2032 classifier creating unit

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Software Systems (AREA)
  • General Physics & Mathematics (AREA)
  • Computing Systems (AREA)
  • Artificial Intelligence (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • Mathematical Physics (AREA)
  • General Engineering & Computer Science (AREA)
  • Computational Mathematics (AREA)
  • Pure & Applied Mathematics (AREA)
  • Mathematical Optimization (AREA)
  • Mathematical Analysis (AREA)
  • Algebra (AREA)
  • Probability & Statistics with Applications (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Medical Informatics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

A classifier model learning unit 81 learns, using an input data, a classifier model for a mixture of classifiers referred as experts that are collectively assigned to a task of classification of the input data, at each time instance. A classifier time series model learning unit 82 learns, for each expert, a time series model indicating time series change of the classifier model of the expert. A data model parameter learning unit 83 learns a data model parameter for a data model indicating the distribution of data features for each expert. An assignment parameter learning unit 84 learns an assignment parameter indicating the probability of assigning experts for each expert.

Description

LEARNING SYSTEM, METHOD AND PROGRAM
The present invention relates to a learning system, learning method and learning program that learns a model for classification of input data samples with time (or likewise sequential order) annotations into one of many class labels wherein the distribution of class labels of input data is changing with time.
There is known a classifier that outputs a label representing the attribute of certain data in a case where the data is input in machine learning. It is also known that a classification criterion of the classifier may change over time. In order to prevent temporal deterioration of the classification accuracy of such a classifier, it is necessary to create a classifier whose classification criterion is updated.
Patent Literature 1 discloses a creating apparatus for creating a classifier.  The creating apparatus disclosed in Patent Literatures 1 creates a classifier whose classification accuracy is maintained without frequently collecting labeled training data.
Non Patent Literature 1 discloses a method of performing non-linear classification. In the method disclosed in Non Patent Literatures 1, given a data set of input-response pairs, the Dirichlet Process mixtures of Generalized Linear Models (DP-GLM) produces a global model of the joint distribution through a mixture of local generalized linear models.
Non Patent Literature 2 discloses that an accurate variational transformation can be used to obtain a closed form approximation to the posterior distribution of the parameters thereby yielding an approximate posterior predictive model.
Patent Application Publication No. US 2019/0012566A1
Lauren A. Hannah, et al., "Dirichlet Process Mixtures of Generalized Linear Models", The Journal of Machine Learning Research, Volume 12, pp.1923-1953, 2/1/2011 Tommi S. Jaakkola, Michael I. Jordan, "Bayesian parameter estimation via variational methods", Statistics and Computing,
The creating apparatus disclosed in Patent Literatures 1 learns the classification criterion from input data with temporal attributes and class labels at each past time instances, and learns a time series change model over these classification criterions and using which it performs ahead prediction of classification criterion of a classifier. However, the creating apparatus disclosed in Patent Literatures 1 is limited to a single classifier. The explanation therein is performed using a simple logisitic regression classifier which gives a linear classification criterion. It is asserted that a non-linear classifier such as SVM, boosting, neural networks, etc. can be used in place of logistic regression, when the classification criterion in the input data is non-linear at some or all time instances. However, implementing a non-linear classfication model in their fashion is not obvious. Therefore, there is a problem that the data cannot be classified properly if a classifier model based on logistic regression is used and the boundary for classifying the data is non-linear.
On the other hand, in the method disclosed in Non Patent Literatures 1, classification for non-linearity is considered. However, the method disclosed in Non Patent Literature 1 assumes that it is developed for non-changing distribution of data. However, assuming that the conditional distribution of labels in streaming data changes over time, the method disclosed in Non Patent Literatures 1 has a problem that the accuracy of the classifier is deteriorated over time.
It is an exemplary object of the present invention to provide a learning system, learning method and learning program that can learn dynamics of non-linear boundaries used for classification.
A learning system for learning a model for estimating a label indicating classification of data, the learning system according to the present invention includes a classifier model learning unit which learns, using an input data, a classifier model for a mixture of classifiers referred as experts that are collectively assigned to a task of classification of the input data, at each time instance; a classifier time series model learning unit which learns, for each expert, a time series model indicating time series change of the classifier model of the expert; a data model parameter learning unit which learns a data model parameter for a data model indicating the distribution of data features for each expert; and an assignment parameter learning unit which learns an assignment parameter indicating the probability of assigning experts to individual samples in the input data.
A learning method for learning a model for estimating a label indicating classification of data, the learning method according to the present invention includes: learning, using an input data, a classifier model for a mixture of classifiers referred as experts that are collectively assigned to a task of classification of the input data, at each time instance; learning, for each expert, a time series model indicating time series change of the classifier model of the expert; learning a data model parameter for a data model indicating the distribution of data features for each expert; and learning an assignment parameter indicating the probability of assigning experts to individual samples in the input data.
A learning program for learning a model for estimating a label indicating classification of data, the learning program according to the present invention causes a computer to perform: a classifier model learning process of learning, using an input data, a classifier model for a mixture of classifiers referred as experts that are collectively assigned to a task of classification of the input data, at each time instance; a classifier time series model learning process of learning, for each expert, a time series model indicating time series change of the classifier model of the expert; a data model parameter learning process of learning a data model parameter for a data model indicating the distribution of data features for each expert; and an assignment parameter learning process of learning an assignment parameter indicating the probability of assigning experts to individual samples in the input data.
According to the present invention, it is possible to learn dynamics of non-linear boundaries used for classification.
It depicts an exemplary block diagram illustrating the structure of the exemplary embodiment of a learning system according to the present invention. It depicts an exemplary explanatory diagram illustrating an example of variational inference. It depicts an exemplary explanatory diagram illustrating an example of the learning process by a learning unit. It depicts an exemplary explanatory diagram illustrating an example of the predicition process by a future classification unit. It depicts an exemplary explanatory diagram illustrating a specific example of the learning process. It depicts an exemplary explanatory diagram illustrating a specific example of the predicition process. It depicts a block diagram illustrating an outline of the learning system according to the present invention. It depicts a schematic block diagram illustrating the configuration example of the computer according to the exemplary embodiment of the present invention.
The following describes an exemplary embodiment of the present invention with reference to drawings.
Fig. 1 depicts an exemplary block diagram illustrating the structure of an exemplary embodiment of a learning system according to the present invention. The learning system 100 according to the present exemplary embodiment includes a learning unit 10 and a future classification unit 20.
The learning unit 10 includes a data acquisition unit 101, a data processing unit 102, an expert initialization unit 103, an expert learning unit 104 and an expert storage unit 105.
The data acquisition unit 101 acquires data used for learning by the expert learning unit 104 described later. In the present exemplary embodiment, the data acquisition unit 101 receives labeled streaming data as training data. Here, the labeled training data means a combination of data for learning and a label indicating the classification of this training data.
Hereinafter, in order to simplify the description, the case of performing binary classification will be described. That is, it is determined whether the target data is data that belongs to a certain group (hereinafter, also referred to as positive data) or data that does not belong to the group (hereinafter, also referred to as negative data). For the training data, a label indicating positive data or a label indicating negative data may be used as a label.
Further, a set of training data collected periodically at time t = {1, 2,..., T} is considered as D := {Dt}T t = 1. Here, Dt := {(xt, i, yt, i)} Nt i = 1. xt, i is the D-dimensional feature vector of the i-th sample at time t, and yt, i, which is an element of {0, 1}, is its class label. Nt is the number of training data collected at time t. Note that sequential data can be represented in this format by discretizing at regular intervals and considering the data falling within same interval to have arrived at the same time.
Further, Xt := {xt, i}Nt i = 1 and Yt := {yt, i}Nt i = 1 are denoted. In the exemplary embodiment, Given the set of training data, D, it is an object to predict binary classifier ht : RD -> {0, 1} at t (t is an element of {T+1, T+2, …} ), which can precisely classify data at time t.
The data acquisition unit 101 may acquire training data from a storage unit (not shown) included in the learning system 100, or may acquire training data from an external storage server or the like (not shown) via a communication network.
The data processing unit 102 converts the acquired data into training data. In particular, the data processing unit 102 converts the streaming data into feature and label vectors with time annotations. That is, the data processing unit 102 generates the set of training data D described above from the acquired data.
In the exemplary embodiment, it is assumed that feature vector xt,i (i=1, … , Nt, t=1, …, T) is generated from a finite number of stationary clusters/groups and within each cluster a linear decision boundary (hereafter, it may be simply referred to as a decision boundary) exists, separating the positive labels from negative ones. Further, the decision boundary in each cluster can change with respect to time and the dynamics of decision boundary need not be same amongst the clusters. The “dynamics” means time-series change of classification criteria of a classifier model.
Our model, which is used in the exemplary embodiment, is based on Dirichlet Process Mixtures (DPM). It is used to identify the number of clusters/groups from D automatically and assign an expert to each cluster to model the data distribution along with the conditional distribution of class labels given the data. The experts are collectively assigned to a task of classification of the input data. Moreover, for each expert, the classification criterion at time t = {1, 2, …, T}, is learned using a standard classifier model (such as logisitic regression, SVM, etc.), along with the temporal change of classification criterion using a standard time series model (such as Vector Autoregressive Model, Gaussian Process, etc.). Thus, for each expert, the past classification criterions and a time series model over it can be used to predict their future classification criterions i.e. at time t =T+1, T+2, ….
If a logistic regression as the classifier model for each expert is used, then locally, within a cluster, the relationship between xt,i and yt,i is linear. But, if the mixture contains more than one component, this relationship becomes non-linear globally. Thus, using this model, non-linear decision boundaries at future time instances, can be predicted, provided the classification provided by each expert is combined in an appropriate fashion.
Here, an example of one such realization of the present invention is described. For each expert, logistic regression as the classifier model and a Vector Autoregressive Process as time series model are used. DPM assumes there exists countably infinite clusters/groups within the data, however, it exhibits a clustering property. Thus, in practice, a finite number of experts through our model that may provide accurate approximation to the underlying non-linear decision boundary is inferred.
In the following explanation, when using a Greek letter in the text, an English notation of Greek letter may be enclosed in brackets ([]). In addition, when representing an upper case Greek letter, the beginning of the word in [] is indicated by capital letters, and when representing lower case Greek letters, the beginning of the word in [] is indicated by lower case letters.
Within a cluster/group k, the distribution of x is modeled by the expert as a multivariate normal. Specifically, the distribution of x is represented by the following Equation 1. In Equation 1, N([mu] ,[Sigma]) is multivariate Gaussian distribution with mean [mu] and covariance matrix [Sigma], the element mk of RD is the mean vector of the cluster k, element Rk of RDxD is its precision matrix, [Phi]={[Phi]k}k=1 K s.t. [Phi]k:={mk, Rk}, zt,i is the cluster indicator for the data sample xt,i and 1 is an indicator function.
Figure JPOXMLDOC01-appb-M000001
For each expert k, the prior over [Phi] is given by Normal-Wishart distribution as shown in Equation 2 below.
Figure JPOXMLDOC01-appb-M000002
In Equation 2, the element g0 of RD is the mean vector, the element V0 of RDxD is scale matrix, the element [beta]0 of R is a scale parameter and the element f0 of R is the degree of freedom.
In the exemplary embodiment, within a cluster k, the probability of label yt, i given feature vector xt, i is modelled by logistic regression as shown in Equations 3 and 4 below.
Figure JPOXMLDOC01-appb-M000003
In Equations 3 and 4, wk,t,0 and wk,t,1:D are the bias term and parameter vector for classifier hk,t respectively, wk,t := (wt,0,k, wk,t,1:D) and [sigma](.) is the sigmoid function.
Also, similar to the method described in PTL 1, simple Vector Autoregressive model (VAR) of order M is used to model the dynamics of classifier parameters wk,t for all experts as shown in Equations 5, 6 and 7 below.
Figure JPOXMLDOC01-appb-M000004
In Equations 5, 6 and 7, the elements Ak,1, Ak,2, …, Ak,m of R(D+1)x(D+1) are the (D+1)x(D+1) matrices for defining dynamics, the element Ak,0 of RD+1 is the bias term, [theta] and [theta]0 are the element of R+. As in Equation 6, for each expert, the classifier parameters at time t depends linearly only on past m values of expert’s classifier parameters. This provides this model the ability to have separate dynamics for each expert by keeping the dynamics parameters independent across the experts.
However, for the sake of simplicity, Ak,m is restricted to be a diagonal matrix. This means the i-th component of classifier parameter wk,t,i depends only on its own previous parameter value wk,t-1,i, wk,t-2,i, .... Since the m-th order VAR model cannot be used as in case of t<=m, we assume that wt,k for t<=m is generated from the following distribution for each expert k.
Figure JPOXMLDOC01-appb-M000005
The model of the exemplary embodiment assumes that the dynamics and bias parameter Ak,m (m = 0, ..., M) are generated from a normal distribution as in Equation 7.
Let [Ohm]k := {Wk, Ak, [Phi]k} be the set of parameters corresponding to an expert k in the mixture. Wk = {wk,t}t=1 T, Ak= {Ak,m}m=0 M and [Phi]k = {[mu]k, Rk}. In the model of the exemplary embodiment, a Dirichlet Process is set over [Ohm]k as shown in Equations 8 and 9 below.
Figure JPOXMLDOC01-appb-M000006
The distribution G0 is a product of distributions given in equations 2, 5, 6 and 7. The hyperparameters in these distributions have prior probabilities as shown in Equations 10, 11 and 12 below.
Figure JPOXMLDOC01-appb-M000007
Using DP's stick breaking representation, the proportions of countably infinite clusters within a mixture
Figure JPOXMLDOC01-appb-M000008
from the remaining stick length are determined by Beta distribution as shown in Equation 13.
Figure JPOXMLDOC01-appb-M000009
Correspondingly, the multinomial probabilities of cluster indicator parameters can be written as the following Equation 14.
[Corrected under Rule 26, 24.09.2019]
Figure WO-DOC-MATHS-10
The joint distribution of labeled data D, data model parameters [Phi], classifier parameters W and their precision parameters {[Theta], [Theta]0}, the classifier dynamics parameters A and their precision parameters [Gamma], expert assignment variables (parameters) Z and component probabilities [Pi]' is written as follows.
Figure JPOXMLDOC01-appb-M000011
Hyperparameters [Theta] and [Theta]0 are defined as follows.
Figure JPOXMLDOC01-appb-M000012
In addition, W, A, Z, [Gamma] and [Phi] are as shown below.
[Corrected under Rule 26, 24.09.2019]
Figure WO-DOC-MATHS-13
In the probabilistic model defined by the above formulae, a probability distribution p(W, A, Z, [Phi], [Theta], [Theta]0, [Gamma], [pi]’ | D) of parameters W, A, Z, [Phi], [Theta], [Theta]0, [Gamma], [pi]’ in a case where labeled learning data D is given is obtained.
However, since it is difficult to directly obtain these probability distributions, in the present exemplary embodiment, an approximate distribution q(W, A, Z, [Phi], [Theta], [Theta]0, [Gamma], [pi]’) of the probability distribution p(W, A, Z, [Phi], [Theta], [Theta]0, [Gamma], [pi]’ | D) is obtained by using a so - called variational Bayes method of approximately obtaining a posteriori probability.
The expert learning unit 104 performs variational inference to find posterior probabilities of hidden variables and parameters in our model.
The lower bound L(q) of the log marginal likelihood of the proposed model is expressed as shown in Equation 15.
Figure JPOXMLDOC01-appb-M000014
In the present exemplary embodiment, the expert learning unit 104 uses the lower bound for logistic regression proposed in Non Patent Literature 2 to convert it to an exponential family distribution as required in variational inference procedure. Non Patent Literature 2 introduces variable [xi]t,i per feature vector xt,i and changes our lower bound L(q) to L(q, [xi]). The variational posterior q(W, A, Z, [Phi], [Theta], [Theta]0, [Gamma], [pi]’) can be factorized using mean field approximation as shown in Equation 16.
[Corrected under Rule 26, 24.09.2019]
Figure WO-DOC-MATHS-15
Individual variational distributions are written as shown below in Equations 17-24.
[Corrected under Rule 26, 24.09.2019]
Figure WO-DOC-MATHS-16
The learning process performed by the learning unit 10 using the available data D will be specifically described below. In the present exemplary embodiment, we consider truncated representation of DP’s stick breaking process in the variational inference which limits infinite number of experts to a plurality of experts.
Fig. 2 is an exemplary explanatory diagram illustrating an example of variational inference performed by the learning unit 10 according to the present exemplary embodiment. The expert learning unit 104 inputs data D and hyper parameters u0, v0, u, v, a, b, g0, [beta]0, V0 and f0 (step S500).
The expert initialization unit 103 initializes each of W, A, Z, [Phi], [Theta], [Theta]0, [Gamma], and [pi]’ (step S501). The expert initialization unit 103 may perform expert initialization in an arbitrary manner. The expert initialization unit 103 may initialize the expert using a pre-identified set of parameters (for example: the parameters 0 or 1 to initialize).
Next, in the expert learning unit 104, the processes from step S502 to step S514 are repeated until the iterator reaches the maximum (iter < max_iter). Further, the processes from step S503 to step S512 are repeated by the number of experts. Furthermore, the processes from step S504 to step S507 are repeated for the time 1 to T.
Furthermore, the processes from step S505 to step S506 are repeated by the number of data dimensions. Specifically, in step S506, the expert learning unit 104 updates parameters of W using Equations 45 to 48 shown below.
In step S507, the expert learning unit 104 updates parameters of [Xi] using Equation 49 shown below.
Thereafter, the processes from step S508 to step S510 are repeated by the order M of dynamics. Specifically, in step S509, the expert learning unit 104 updates parameters of A using Equations 41 to 44 shown below. Furthermore, in step S510, the expert learning unit 104 updates parameters of [Gamma] using Equations 39 to 40 shown below.
In step S511, the expert learning unit 104 updates parameters of [Phi] using Equations 28 to 34 shown below. Furthermore, in step S512, the expert learning unit 104 updates parameters of [Theta] and [Theta]0 using Equations 35 to 38 shown below.
In step S513, the expert learning unit 104 updates parameters of Z using Equation 27 shown below. Furthermore, in step S514, the expert learning unit 104 updates parameters of [pi]’ using Equations 25 and 26 shown below.
[Corrected under Rule 26, 24.09.2019]
Figure WO-DOC-MATHS-17
Figure WO-DOC-MATHS-18
Then, in step S515, the expert learning unit 104 outputs optimized q(W), q(A), q(Z), q([Phi]), q([Theta]), q([Theta]0), q ([Gamma]) and q([pi]’).
The expert learning unit 104 stores model data for each expert in the expert storage unit 105. That is, the expert storage unit 105 stores model data of each expert. The expert storage unit 105 is realized by, for example, a magnetic disk or the like. Note that, since the model of each expert is learned individually, the expert learning unit 104 may perform normalization processing on all the learned expert models and then store it in the expert storage unit 105.
The learning unit 10 (more specifically, the data acquisition unit 101, the data processing unit 102, the expert initialization unit 103 and the expert learning unit 104) is implemented by a CPU of a computer operating according to a program (learning program). For example, the program may be stored in the storage unit (not shown) included in the learning system 100, with the CPU reading the program and, according to the program, operating as the learning unit 10 (more specifically, the data acquisition unit 101, the data processing unit 102, the expert initialization unit 103 and the expert learning unit 104). The functions of the learning system may be provided in the form of SaaS (Software as a Service).
The learning unit 10 (more specifically, the data acquisition unit 101, the data processing unit 102, the expert initialization unit 103 and the expert learning unit 104) may each be implemented by dedicated hardware. All or part of the components of each device may be implemented by general-purpose or dedicated circuitry, processors, or combinations thereof. They may be configured with a single chip, or configured with a plurality of chips connected via a bus. All or part of the components of each device may be implemented by a combination of the above-mentioned circuitry or the like and program.
In the case where all or part of the components of each device is implemented by a plurality of information processing devices, circuitry, or the like, the plurality of information processing devices, circuitry, or the like may be centralized or distributed. For example, the information processing devices, circuitry, or the like may be implemented in a form in which they are connected via a communication network, such as a client-and-server system or a cloud computing system.
The future classification unit 20 receives a new (unlabeled) sample, and predicts its label by combining label predictions from each expert. In the present exemplary embodiment, predictions of each expert on the new sample are combined in a probabilistic fashion. That is, for combining the predictions, first, it is needed to find the weights assigned to each expert for this new sample and further the predictions on new sample of the classifiers of each expert at the time instance of the new sample.
For streaming data at time T' > T with DT' := {xT',i} NT' i=1 as unlabeled data set, the label prediction is performed in two steps. First, the distribution of classifier weights P(wk,T') for k which is an element in {1, 2, ... ,K; K is total number of experts} and time T' > T is evaluated. The distribution of classifier weights is calculated with a sampling cum marginalization approach, as shown in Equation 50 below.
Figure JPOXMLDOC01-appb-M000019
The dynamics parameters Ak,m for m = 0, 1, 2,...,M are i.i.d. sample as the following.
Figure JPOXMLDOC01-appb-M000020
[Corrected under Rule 26, 24.09.2019]
Let
Figure WO-DOC-MATHS-21
Then Equations 51 and 52 shown below are obtained.


[Corrected under Rule 26, 24.09.2019]
Figure WO-DOC-MATHS-22
[Corrected under Rule 26, 24.09.2019]
Here, [tau](a):=1/(1+([pi]*a/8))1/2, and [omega]k,T’,i denotes the probability of choosing kth expert for classification, which is further represented as Equations 53 and 54.
Figure WO-DOC-MATHS-23
such that
Figure JPOXMLDOC01-appb-M000024
Additionally, the probability of assigning an expert to zT',i, a cluster indicator variable, can be approximated as
Figure JPOXMLDOC01-appb-M000025
where N denotes total number of samples in labeled data set D.
As illustrated in Fig. 1, the future classification unit 20 includes a data acquisition unit 201, a data processing unit 202, an expert identification unit 203, a classification output unit 204 and a label storage unit 205.
The data acquisition unit 201 receives un-labeled streaming data (hereinafter also referred to as a sample). That is, the data acquisition unit 201 receives data to be classified.
The data processing unit 202 converts the received streaming data into feature vectors with time annotations. The method of converting streaming data into a time-annotated feature vector is the same as the method performed by the data processing unit 102, but label data is not created.
The expert identification unit 203 identifies parameters for each expert for the task of classification of unlabeled data. The expert identification unit 203 includes an expert weighting unit 2031 and a classifier creating unit 2032.
The expert weighting unit 2031 calculates the weight for each expert. Specifically, the expert weighting unit 2031 calculates the weight of each expert based on the assignment parameter using Equations 53 and 54 described above.
The classifier creating unit 2032 calculates future weights of classifier using the dynamics. Specifically, the classifier creating unit 2032 determines a classifier at a time instance of a new sample using a time series model, by using Expression 50 described above. That is, the classifier creating unit 2032 predicts the classifier parameters for each expert at the time instance of new sample using autoregressive time-series model of classifier parameters.
The classification output unit 204 predicts the label for new sample for each expert using the classifier parameters obtained by the classifier creating unit 2032 and combines these label predictions using weights obtained by the expert weighting unit 2031. Specifically, the classification output unit 204 determines the label predictions for all experts and combines them in a probabilistic fashion by using Expressions 51 and 52 described above.
The classification output unit 204 stores the determined label in the label storage unit 205. That is, the label storage unit 205 stores a label for the input streaming data. The label storage unit 205 is realized by, for example, a magnetic disk or the like.
The future classification unit 20 (more specifically, the data acquisition unit 201, the data processing unit 202, the expert identification unit 203 (more specifically, the expert weighting unit 2031 and the classifier creating unit 2032) and the classification output unit 204) is implemented by a CPU of a computer operating according to a program (learning program, prediction program).
The future classification unit 20 (more specifically, the data acquisition unit 201, the data processing unit 202, the expert identification unit 203 (more specifically, the expert weighting unit 2031 and the classifier creating unit 2032) and the classification output unit 204) may each be implemented by dedicated hardware.
Next, operation of the learning system according to the present exemplary embodiment will be described. Fig. 3 depicts a flowchart illustrating an example of learning processing by the learning unit 10.
The data acquisition unit 101 receives labeled streaming data as learning data until time T (step S101). The data processing unit 102 converts streaming data into feature and label vectors with time annotations(step S102). The expert initialization unit 103 initializes all the experts with pre-identified parameters(step S103).
Next, loop process A (steps S1031 to S1032) is repeated until the termination condition is satisfied. Furthermore, loop process B (steps S1033 to S1034) is repeated at expert level over all the pre-specified number of experts.
The expert learning unit 104 learns classifier model for each expert at each time (step S1041). The expert learning unit 104 learns classifier time series model for each expert (step S1042). The expert learning unit 104 learns expert parameters for data model (step S1043). The expert learning unit 104 learns expert assignment parameters for all data points (step S1044).
The expert learning unit 104 stores model data in the expert storage unit 105. That is, the expert storage unit 105 stores model data for each expert (step S105).
Fig. 4 depicts a flowchart illustrating an example of prediction processing by the future classification unit 20.
The data acquisition unit 201 receives un-labeled streaming data (step S201). The data processing unit 202 converts the streaming data into feature vectors with time annotations (step S202). In step S203, the expert identification unit 203 (specifically, the expert weighting unit 2031) computes weights for each expert(step S2031), and the expert identification unit 203 (specifically, the classifier creating unit 2032) computes future weights of classifier using dynamics (step S2032).
The classification output unit 204 combines label predictions from all the experts (step S204). The classification output unit 204 stores predicted labels for the input data in the label storage unit 205. That is, the label storage unit 205 stores predicted labels for the input data (step S205).
As described above, according to the present exemplary embodiment, the expert learning unit 104 learns the classifier model for a mixture of classifiers (experts) at each time and their classifier time series model for each expert. Moreover, the expert learning unit 104 learns, for each expert, (a data model) parameter for a data model and learns the assignment parameter to individual samples in the input data. Therefore, it is possible to learn dynamics of non-linear boundaries used for classification.
Furthermore, according to the present exemplary embodiment, the expert weighting unit 2031 calculates the weight of each expert based on the assignment parameter, and the classifier creating unit 2032 predicts the classifier weights corresponding to the sample’s time using the classifier time series model. Then, the classification output unit 204 predicts the probability of the label of the sample for each expert, combines the probabilities of the labels of all the experts, and predicts the label of the sample. Therefore, even if the conditional distribution of labels in streaming data changes over time, it is possible to suppress the decrease in the accuracy of the classifier.
Note that the expert learning unit 104 of the present exemplary embodiment may use Neural network in place of logistic regression for the classifier model or in place of AR process for the time series model.
Next, a specific example of the learning system of the present explanatory embodiment will be described. Fig. 5 is an exemplary explanatory diagram illustrating a specific example of the learning process. First, streaming data from time 1 to T is given as learning data. The example illustrated in Fig. 5 shows that streaming data having X1 and X2 as a feature is labeled with two classes (class 1 and class 2).
Next, modeling is performed on these data (feature data). For example, as illustrated in Fig. 5, the mean of each expert is calculated over time. Then, unique classification (linear decision boundary) of each expert is learned for each time. Since the conditional distribution of labels changes over time, Fig. 5 shows an example where the decision boundaries also change.
Fig. 6 is an exemplary explanatory diagram illustrating a specific example of the predicition process. First, streaming data from time T + 1 to T + M unlabeled is given as classification target data. The example illustrated in Fig. 6 shows that streaming data having X1 and X2 as a feature is given.
Next, the future classification unit 20 refers to the learned expert stored in the expert storage unit 105, and predicts a classifier at each time. As a result, a class where given data is classified is predicted at each time. Similar to the example illustrated in Fig. 5, the conditional distribution of labels changes over time, so that Fig. 6 also shows that the decision boundary changes.
Next, an outline of the present invention will be described. Fig. 7 depicts a block diagram illustrating an outline of the learning system according to the present invention. The learning system 80 (for example, learning system 100) for learning a model for estimating a label indicating classification of data, the learning system according to the present invention includes: a classifier model learning unit 81 (for example, the expert learning unit 104) which learns, using an input data, a classifier model (for example, Q(wk,t)) for a mixture of classifiers referred as experts that are collectively assigned to a task of classification of the input data, at each time instance; a classifier time series model learning unit 82 (for example, the expert learning unit 104) which learns, for each expert, a time series model (for example, Q(Ak)) indicating time series change of the classifier model of the expert; a data model parameter learning unit 83 (for example, the expert learning unit 104) which learns a data model parameter (for example, Q([phi]k)) for a data model indicating the distribution of data features for each expert; and an assignment parameter learning unit 84 (for example, the expert learning unit 104) which learns an assignment parameter (for example Q(Z)) indicating the probability of assigning experts to individual samples in the input data.
With such a configuration, it is possible to learn dynamics of non-linear boundaries used for classification.
Further, the learning system 80 may includes: a weight calculator (for example, the expert weighting unit 2031) which calculates a weight of each expert based on the assignment parameter; a weight predictor (for example, the classifier creating unit 2032) which predicts classifier weights corresponding to a sample’s time using the classifier time series model; and a label predictor (for example, the classification output unit 204) which predicts the probability of the label of the sample for each expert, combines the probabilities of the labels of all the experts, and predicts the label of the sample.
With such a configuration, even if the conditional distribution of labels in streaming data changes over time, it is possible to suppress the decrease in the accuracy of the classifier.
Further, the classifier model learning unit 81 may model the probability of the label given data in the grouped data by logistic regression.
Further, the classifier model learning unit 81 may determine the number of clusters from learning data using a model based on a Dirichlet Process mixtures, and assigns the expert to the determined number of clusters respectively.
Further, the data model parameter learning unit 83 may learn parameters based on the Normal-Wishart distribution and model data based on multivariate normal distribution.
Further, the assignment parameter learning unit 84 may model the assignment of cluster based on multinomial or categorical distribution.
Further, the classifier model learning unit 81 may learn the classifier model such that a collective decision boundary is an approximation to an underlying non-linear decision boundary at each past time instances.
Fig. 8 depicts a schematic block diagram illustrating a configuration of a computer according to at least one of the exemplary embodiments. A computer 1000 includes a CPU 1001, a main storage device 1002, an auxiliary storage device 1003, and an interface 1004.
Each of the above-described learning system is mounted on the computer 1000. The operation of the respective processing units described above is stored in the auxiliary storage device 1003 in the form of a program (a learning program). The CPU 1001 reads the program from the auxiliary storage device 1003, deploys the program in the main storage device 1002, and executes the above processing according to the program.
Note that at least in one of the exemplary embodiments, the auxiliary storage device 1003 is an exemplary non-transitory physical medium. Other examples of non-transitory physical medium include a magnetic disc, a magneto-optical disk, a CD-ROM, a DVD-ROM, and a semiconductor memory that are connected via the interface 1004. In the case where the program is distributed to the computer 1000 by a communication line, the computer 1000 distributed with the program may deploy the program in the main storage device 1002 to execute the processing described above.
Incidentally, the program may implement a part of the functions described above. The program may implement the aforementioned functions in combination with another program stored in the auxiliary storage device 1003 in advance, that is, the program may be a differential file (differential program).
Note that, a part of or all of the above exemplary embodiments can also be described as following supplementary notes, but is not limited to the following.
(Supplementary note 1)
A learning system for learning a model for estimating a label indicating classification of data, the learning system comprising: a classifier model learning unit which learns, using an input data, a classifier model for a mixture of classifiers referred as experts that are collectively assigned to a task of classification of the input data, at each time instance; a classifier time series model learning unit which learns, for each expert, a time series model indicating time series change of the classifier model of the expert; a data model parameter learning unit which learns a data model parameter for a data model indicating the distribution of data features for each expert; and an assignment parameter learning unit which learns an assignment parameter indicating the probability of assigning experts to individual samples in the input data.
(Supplementary note 2)
The learning system according to supplementary note 1, further comprising: a weight calculator which calculates a weight of each expert based on the assignment parameter; a weight predictor which predicts classifier weights corresponding to a sample’s time using the classifier time series model; and a label predictor which predicts the probability of the label of the sample for each expert, combines the probabilities of the labels of all the experts, and predicts the label of the sample.
(Supplementary note 3)
The learning system according to supplementary note 1 or 2, wherein, the classifier model learning unit models the probability of the label given data in the grouped data by logistic regression.
(Supplementary note 4)
The learning system according to any one of supplementary notes 1 to 3, wherein, the classifier model learning unit determines the number of clusters from learning data using a model based on a Dirichlet Process mixtures, and assigns the expert to the determined number of clusters respectively.
(Supplementary note 5)
The learning system according to any one of supplementary notes 1 to 4, wherein, the data model parameter learning unit learns parameters based on the Normal-Wishart distribution and model data based on multivariate normal distribution.
(Supplementary note 6)
The learning system according to any one of supplementary notes 1 to 5, wherein, the assignment parameter learning unit models the assignment of cluster based on multinomial or categorical distribution.
(Supplementary note 7)
The learning system according to any one of supplementary notes 1 to 6, wherein, the classifier model learning unit learns the classifier model such that a collective decision boundary is an approximation to an underlying non-linear decision boundary at each past time instances.
(Supplementary note 8)
A learning method for learning a model for estimating a label indicating classification of data, the learning method comprising: learning, using an input data, a classifier model for a mixture of classifiers referred as experts that are collectively assigned to a task of classification of the input data, at each time instance; learning, for each expert, a time series model indicating time series change of the classifier model of the expert; learning a data model parameter for a data model indicating the distribution of data features for each expert; and learning an assignment parameter indicating the probability of assigning experts to individual samples in the input data.
(Supplementary note 9)
The learning method according to supplementary note 8, further comprising: calculating a weight of each expert based on the assignment parameter; predicting classifier weights corresponding to a sample’s time using a classifier time series model; predicting the probability of the label of the sample for each expert; combining the probabilities of the labels of all the experts; and predicting the label of the sample.
(Supplementary note 10)
A learning program for learning a model for estimating a label indicating classification of data, the learning program causes a computer to perform: a classifier model learning process of learning, using an input data, a classifier model for a mixture of classifiers referred as experts that are collectively assigned to a task of classification of the input data, at each time instance; a classifier time series model learning process of learning, for each expert, a time series model indicating time series change of the classifier model of the expert; a data model parameter learning process of learning a data model parameter for a data model indicating the distribution of data features for each expert; and an assignment parameter learning process of learning an assignment parameter indicating the probability of assigning experts to individual samples in the input data.
(Supplementary note 11)
The learning program according to supplementary note 10, that causes a computer to perform: a weight calculate process of calculating a weight of each expert based on the assignment parameter; a weight predicting process of predicting classifier weights corresponding to a sample’s time using a classifier time series model; and a label predicting process of predicting the probability of the label of the sample for each expert, combining the probabilities of the labels of all the experts, and predicting the label of the sample.
10 learning unit
20 future classification unit
100 learning system
101, 201 data acquisition unit
102, 202 data processing unit
103 expert initialization unit
104 expert learning unit
105 expert storage unit
203 expert identification unit
204 classification output unit
205 label storage unit
2031 expert weighting unit
2032 classifier creating unit
 

Claims (11)

  1. A learning system for learning a model for estimating a label indicating classification of data, the learning system comprising:
    a classifier model learning unit which learns, using an input data, a classifier model for a mixture of classifiers referred as experts that are collectively assigned to a task of classification of the input data, at each time instance;
    a classifier time series model learning unit which learns, for each expert, a time series model indicating time series change of the classifier model of the expert;
    a data model parameter learning unit which learns a data model parameter for a data model indicating the distribution of data features for each expert; and
    an assignment parameter learning unit which learns an assignment parameter indicating the probability of assigning experts to individual samples in the input data.
  2. The learning system according to claim 1, further comprising:
    a weight calculator which calculates a weight of each expert based on the assignment parameter;
    a weight predictor which predicts classifier weights corresponding to a sample’s time using the classifier time series model; and
    a label predictor which predicts the probability of the label of the sample for each expert, combines the probabilities of the labels of all the experts, and predicts the label of the sample.
  3. The learning system according to claim 1 or 2,
    wherein, the classifier model learning unit models the probability of the label given data in the grouped data by logistic regression.
  4. The learning system according to any one of claims 1 to 3,
    wherein, the classifier model learning unit determines the number of clusters from learning data using a model based on a Dirichlet Process mixtures, and assigns the expert to the determined number of clusters respectively.
  5. The learning system according to any one of claims 1 to 4,
    wherein, the data model parameter learning unit learns parameters based on the Normal-Wishart distribution and model data based on multivariate normal distribution.
  6. The learning system according to any one of claims 1 to 5,
    wherein, the assignment parameter learning unit models the assignment of cluster based on multinomial or categorical distribution.
  7. The learning system according to any one of claims 1 to 6,
    wherein, the classifier model learning unit learns the classifier model such that a collective decision boundary is an approximation to an underlying non-linear decision boundary at each past time instances.
  8. A learning method for learning a model for estimating a label indicating classification of data, the learning method comprising:
    learning, using an input data, a classifier model for a mixture of classifiers referred as experts that are collectively assigned to a task of classification of the input data, at each time instance;
    learning, for each expert, a time series model indicating time series change of the classifier model of the expert;
    learning a data model parameter for a data model indicating the distribution of data features for each expert; and
    learning an assignment parameter indicating the probability of assigning experts to individual samples in the input data.
  9. The learning method according to claim 8, further comprising:
    calculating a weight of each expert based on the assignment parameter;
    predicting classifier weights corresponding to a sample’s time using a classifier time series model;
    predicting the probability of the label of the sample for each expert;
    combining the probabilities of the labels of all the experts; and
    predicting the label of the sample.
  10. A learning program for learning a model for estimating a label indicating classification of data, the learning program causes a computer to perform:
    a classifier model learning process of learning, using an input data, a classifier model for a mixture of classifiers referred as experts that are collectively assigned to a task of classification of the input data, at each time instance;
    a classifier time series model learning process of learning, for each expert, a time series model indicating time series change of the classifier model of the expert;
    a data model parameter learning process of learning a data model parameter for a data model indicating the distribution of data features for each expert; and
    an assignment parameter learning process of learning an assignment parameter indicating the probability of assigning experts to individual samples in the input data.
  11. The learning program according to claim 10, that causes a computer to perform:
    a weight calculate process of calculating a weight of each expert based on the assignment parameter;
    a weight predicting process of predicting classifier weights corresponding to a sample’s time using a classifier time series model; and
    a label predicting process of predicting the probability of the label of the sample for each expert, combining the probabilities of the labels of all the experts, and predicting the label of the sample.
     
PCT/JP2019/029456 2019-04-04 2019-07-26 Learning system, method and program WO2020202594A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
US201962829294P 2019-04-04 2019-04-04
US62/829,294 2019-04-04

Publications (1)

Publication Number Publication Date
WO2020202594A1 true WO2020202594A1 (en) 2020-10-08

Family

ID=72666502

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/JP2019/029456 WO2020202594A1 (en) 2019-04-04 2019-07-26 Learning system, method and program

Country Status (1)

Country Link
WO (1) WO2020202594A1 (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2024061050A1 (en) * 2022-09-19 2024-03-28 北京数慧时空信息技术有限公司 Remote-sensing sample labeling method based on geoscientific information and active learning

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2006127446A (en) * 2004-09-29 2006-05-18 Ricoh Co Ltd Image processing device, image processing method, program, and recording medium
JP2017107386A (en) * 2015-12-09 2017-06-15 日本電信電話株式会社 Instance selection device, classification device, method, and program
US20190041235A1 (en) * 2017-08-04 2019-02-07 Kabushiki Kaisha Toshiba Sensor control support apparatus, sensor control support method and non-transitory computer readable medium

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2006127446A (en) * 2004-09-29 2006-05-18 Ricoh Co Ltd Image processing device, image processing method, program, and recording medium
JP2017107386A (en) * 2015-12-09 2017-06-15 日本電信電話株式会社 Instance selection device, classification device, method, and program
US20190041235A1 (en) * 2017-08-04 2019-02-07 Kabushiki Kaisha Toshiba Sensor control support apparatus, sensor control support method and non-transitory computer readable medium

Non-Patent Citations (4)

* Cited by examiner, † Cited by third party
Title
ATTAMIMI, MUHAMMAD ET AL.: "Autonomous Control of a Service Robot Based on Remote Control Data", PROCEEDINGS OF THE 31ST ANNUAL CONFERENCE OF THE ROBOTICS SOCIETY OF JAPAN [ DVD -ROM), RSJ2013AC2C2-04, 4 September 2013 (2013-09-04) *
HANNAH, LAUREN A ET AL.: "Dirichlet Process Mixtures of Generalized Linear Models", JOURNAL OF MACHINE LEARNING RESEARCH, vol. 12, 2011, pages 1923 - 1953, XP055745980 *
KOIZUMI, YUMA ET AL.: "Intra-note Segmentation for Excitation- Continuous Musical Instruments based on Infinite Mixture Models nesting Hidden Markov Model", PROCEEDINGS OF 2014 SPRING MEETING OF ACOUSTICAL SOCIETY OF JAPAN [ CD-ROM), 3 March 2014 (2014-03-03), pages 985 - 988, ISSN: 1880-7658 *
SASAKI, KENTARO ET AL.: "Time Series Topic Model Considering Dependence to Multiple Topics", IPSJ SIG TECHNICAL REPORTS. MPS-100, vol. 2014, no. 3, 18 September 2014 (2014-09-18), pages 1 - 6, XP055745989, Retrieved from the Internet <URL:https://ipsj.ixsq.nii.ac.jp/ej/?action=repository_uri&item_id=103127&file_id=1&file_no=1> [retrieved on 20190827] *

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2024061050A1 (en) * 2022-09-19 2024-03-28 北京数慧时空信息技术有限公司 Remote-sensing sample labeling method based on geoscientific information and active learning

Similar Documents

Publication Publication Date Title
Bilbao et al. Overfitting problem and the over-training in the era of data: Particularly for Artificial Neural Networks
US10353685B2 (en) Automated model management methods
US11741372B2 (en) Prediction-correction approach to zero shot learning
US11586880B2 (en) System and method for multi-horizon time series forecasting with dynamic temporal context learning
US20200265301A1 (en) Incremental training of machine learning tools
Gama et al. Recurrent concepts in data streams classification
KR20190098107A (en) Neural network training apparatus for deep learning and method thereof
US20230325675A1 (en) Data valuation using reinforcement learning
CN108921342B (en) Logistics customer loss prediction method, medium and system
CN110708285B (en) Flow monitoring method, device, medium and electronic equipment
WO2014176056A2 (en) Data classification
EP4348509A1 (en) Regression and time series forecasting
CN112508177A (en) Network structure searching method and device, electronic equipment and storage medium
US11829442B2 (en) Methods and systems for efficient batch active learning of a deep neural network
US20210319269A1 (en) Apparatus for determining a classifier for identifying objects in an image, an apparatus for identifying objects in an image and corresponding methods
Urgun et al. Composite power system reliability evaluation using importance sampling and convolutional neural networks
WO2020202594A1 (en) Learning system, method and program
Alam Recurrent neural networks in electricity load forecasting
WO2022193171A1 (en) System and method for unsupervised multi-model joint reasoning
CN111382247B (en) Content pushing optimization method, content pushing optimization device and electronic equipment
WO2019221206A1 (en) Creation device, creation method, and program
Hemkiran et al. Design of Automatic Credit Card Approval System Using Machine Learning
Park Application of an Adaptive Incremental Classifier for Streaming Data
Popkov et al. Introduction to the Theory of Randomized Machine Learning
US7720771B1 (en) Method of dividing past computing instances into predictable and unpredictable sets and method of predicting computing value

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 19922436

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 19922436

Country of ref document: EP

Kind code of ref document: A1