WO2020202594A1

WO2020202594A1 - Learning system, method and program

Info

Publication number: WO2020202594A1
Application number: PCT/JP2019/029456
Authority: WO
Inventors: Devendra Dhaka; Kanishka KHANDELWAL; Riki Eto
Original assignee: Nec Corporation
Priority date: 2019-04-04
Filing date: 2019-07-26
Publication date: 2020-10-08

Abstract

A classifier model learning unit 81 learns, using an input data, a classifier model for a mixture of classifiers referred as experts that are collectively assigned to a task of classification of the input data, at each time instance. A classifier time series model learning unit 82 learns, for each expert, a time series model indicating time series change of the classifier model of the expert. A data model parameter learning unit 83 learns a data model parameter for a data model indicating the distribution of data features for each expert. An assignment parameter learning unit 84 learns an assignment parameter indicating the probability of assigning experts for each expert.

Description

LEARNING SYSTEM, METHOD AND PROGRAM

The present invention relates to a learning system, learning method and learning program that learns a model for classification of input data samples with time (or likewise sequential order) annotations into one of many class labels wherein the distribution of class labels of input data is changing with time.

There is known a classifier that outputs a label representing the attribute of certain data in a case where the data is input in machine learning. It is also known that a classification criterion of the classifier may change over time. In order to prevent temporal deterioration of the classification accuracy of such a classifier, it is necessary to create a classifier whose classification criterion is updated.

Patent Literature 1 discloses a creating apparatus for creating a classifier. 　The creating apparatus disclosed in Patent Literatures 1 creates a classifier whose classification accuracy is maintained without frequently collecting labeled training data.

Non Patent Literature 1 discloses a method of performing non-linear classification. In the method disclosed in Non Patent Literatures 1, given a data set of input-response pairs, the Dirichlet Process mixtures of Generalized Linear Models (DP-GLM) produces a global model of the joint distribution through a mixture of local generalized linear models.

Non Patent Literature 2 discloses that an accurate variational transformation can be used to obtain a closed form approximation to the posterior distribution of the parameters thereby yielding an approximate posterior predictive model.

Patent Application Publication No. US 2019/0012566A1

Lauren A. Hannah, et al., "Dirichlet Process Mixtures of Generalized Linear Models", The Journal of Machine Learning Research, Volume 12, pp.1923-1953, 2/1/2011 Tommi S. Jaakkola, Michael I. Jordan, "Bayesian parameter estimation via variational methods", Statistics and Computing,

The creating apparatus disclosed in Patent Literatures 1 learns the classification criterion from input data with temporal attributes and class labels at each past time instances, and learns a time series change model over these classification criterions and using which it performs ahead prediction of classification criterion of a classifier. However, the creating apparatus disclosed in Patent Literatures 1 is limited to a single classifier. The explanation therein is performed using a simple logisitic regression classifier which gives a linear classification criterion. It is asserted that a non-linear classifier such as SVM, boosting, neural networks, etc. can be used in place of logistic regression, when the classification criterion in the input data is non-linear at some or all time instances. However, implementing a non-linear classfication model in their fashion is not obvious. Therefore, there is a problem that the data cannot be classified properly if a classifier model based on logistic regression is used and the boundary for classifying the data is non-linear.

On the other hand, in the method disclosed in Non Patent Literatures 1, classification for non-linearity is considered. However, the method disclosed in Non Patent Literature 1 assumes that it is developed for non-changing distribution of data. However, assuming that the conditional distribution of labels in streaming data changes over time, the method disclosed in Non Patent Literatures 1 has a problem that the accuracy of the classifier is deteriorated over time.

It is an exemplary object of the present invention to provide a learning system, learning method and learning program that can learn dynamics of non-linear boundaries used for classification.

A learning system for learning a model for estimating a label indicating classification of data, the learning system according to the present invention includes a classifier model learning unit which learns, using an input data, a classifier model for a mixture of classifiers referred as experts that are collectively assigned to a task of classification of the input data, at each time instance; a classifier time series model learning unit which learns, for each expert, a time series model indicating time series change of the classifier model of the expert; a data model parameter learning unit which learns a data model parameter for a data model indicating the distribution of data features for each expert; and an assignment parameter learning unit which learns an assignment parameter indicating the probability of assigning experts to individual samples in the input data.

A learning method for learning a model for estimating a label indicating classification of data, the learning method according to the present invention includes: learning, using an input data, a classifier model for a mixture of classifiers referred as experts that are collectively assigned to a task of classification of the input data, at each time instance; learning, for each expert, a time series model indicating time series change of the classifier model of the expert; learning a data model parameter for a data model indicating the distribution of data features for each expert; and learning an assignment parameter indicating the probability of assigning experts to individual samples in the input data.

A learning program for learning a model for estimating a label indicating classification of data, the learning program according to the present invention causes a computer to perform: a classifier model learning process of learning, using an input data, a classifier model for a mixture of classifiers referred as experts that are collectively assigned to a task of classification of the input data, at each time instance; a classifier time series model learning process of learning, for each expert, a time series model indicating time series change of the classifier model of the expert; a data model parameter learning process of learning a data model parameter for a data model indicating the distribution of data features for each expert; and an assignment parameter learning process of learning an assignment parameter indicating the probability of assigning experts to individual samples in the input data.

According to the present invention, it is possible to learn dynamics of non-linear boundaries used for classification.

It depicts an exemplary block diagram illustrating the structure of the exemplary embodiment of a learning system according to the present invention. It depicts an exemplary explanatory diagram illustrating an example of variational inference. It depicts an exemplary explanatory diagram illustrating an example of the learning process by a learning unit. It depicts an exemplary explanatory diagram illustrating an example of the predicition process by a future classification unit. It depicts an exemplary explanatory diagram illustrating a specific example of the learning process. It depicts an exemplary explanatory diagram illustrating a specific example of the predicition process. It depicts a block diagram illustrating an outline of the learning system according to the present invention. It depicts a schematic block diagram illustrating the configuration example of the computer according to the exemplary embodiment of the present invention.

The following describes an exemplary embodiment of the present invention with reference to drawings.

Fig. 1 depicts an exemplary block diagram illustrating the structure of an exemplary embodiment of a learning system according to the present invention. The learning system 100 according to the present exemplary embodiment includes a learning unit 10 and a future classification unit 20.

The learning unit 10 includes a data acquisition unit 101, a data processing unit 102, an expert initialization unit 103, an expert learning unit 104 and an expert storage unit 105.

The data acquisition unit 101 acquires data used for learning by the expert learning unit 104 described later. In the present exemplary embodiment, the data acquisition unit 101 receives labeled streaming data as training data. Here, the labeled training data means a combination of data for learning and a label indicating the classification of this training data.

Hereinafter, in order to simplify the description, the case of performing binary classification will be described. That is, it is determined whether the target data is data that belongs to a certain group (hereinafter, also referred to as positive data) or data that does not belong to the group (hereinafter, also referred to as negative data). For the training data, a label indicating positive data or a label indicating negative data may be used as a label.

Further, a set of training data collected periodically at time t = {1, 2,..., T} is considered as D := {D_t}^T _{t = 1}. Here, D_t := {(x_{t, i}, y_{t, i})} ^Nt _{i = 1}. x_{t, i} is the D-dimensional feature vector of the i-th sample at time t, and y_{t, i}, which is an element of {0, 1}, is its class label. N_t is the number of training data collected at time t. Note that sequential data can be represented in this format by discretizing at regular intervals and considering the data falling within same interval to have arrived at the same time.

Further, X_t := {x_{t, i}}^Nt _{i = 1} and Y_t := {y_{t, i}}^Nt _{i = 1} are denoted. In the exemplary embodiment, Given the set of training data, D, it is an object to predict binary classifier h_t : R^D -> {0, 1} at t (t is an element of {T+1, T+2, …} ), which can precisely classify data at time t.

The data acquisition unit 101 may acquire training data from a storage unit (not shown) included in the learning system 100, or may acquire training data from an external storage server or the like (not shown) via a communication network.

The data processing unit 102 converts the acquired data into training data. In particular, the data processing unit 102 converts the streaming data into feature and label vectors with time annotations. That is, the data processing unit 102 generates the set of training data D described above from the acquired data.

In the exemplary embodiment, it is assumed that feature vector x_t,i (i=1, … , N_t, t=1, …, T) is generated from a finite number of stationary clusters/groups and within each cluster a linear decision boundary (hereafter, it may be simply referred to as a decision boundary) exists, separating the positive labels from negative ones. Further, the decision boundary in each cluster can change with respect to time and the dynamics of decision boundary need not be same amongst the clusters. The “dynamics” means time-series change of classification criteria of a classifier model.

Our model, which is used in the exemplary embodiment, is based on Dirichlet Process Mixtures (DPM). It is used to identify the number of clusters/groups from D automatically and assign an expert to each cluster to model the data distribution along with the conditional distribution of class labels given the data. The experts are collectively assigned to a task of classification of the input data. Moreover, for each expert, the classification criterion at time t = {1, 2, …, T}, is learned using a standard classifier model (such as logisitic regression, SVM, etc.), along with the temporal change of classification criterion using a standard time series model (such as Vector Autoregressive Model, Gaussian Process, etc.). Thus, for each expert, the past classification criterions and a time series model over it can be used to predict their future classification criterions i.e. at time t =T+1, T+2, ….

If a logistic regression as the classifier model for each expert is used, then locally, within a cluster, the relationship between x_t,i and y_t,i is linear. But, if the mixture contains more than one component, this relationship becomes non-linear globally. Thus, using this model, non-linear decision boundaries at future time instances, can be predicted, provided the classification provided by each expert is combined in an appropriate fashion.

Here, an example of one such realization of the present invention is described. For each expert, logistic regression as the classifier model and a Vector Autoregressive Process as time series model are used. DPM assumes there exists countably infinite clusters/groups within the data, however, it exhibits a clustering property. Thus, in practice, a finite number of experts through our model that may provide accurate approximation to the underlying non-linear decision boundary is inferred.

In the following explanation, when using a Greek letter in the text, an English notation of Greek letter may be enclosed in brackets ([]). In addition, when representing an upper case Greek letter, the beginning of the word in [] is indicated by capital letters, and when representing lower case Greek letters, the beginning of the word in [] is indicated by lower case letters.

Within a cluster/group k, the distribution of x is modeled by the expert as a multivariate normal. Specifically, the distribution of x is represented by the following Equation 1. In Equation 1, N([mu] ,[Sigma]) is multivariate Gaussian distribution with mean [mu] and covariance matrix [Sigma], the element m_k of R^D is the mean vector of the cluster k, element R_k of R^DxD is its precision matrix, [Phi]={[Phi]_k}_k=1 ^K s.t. [Phi]_k:={m_k, R_k}, z_t,i is the cluster indicator for the data sample x_t,i and 1 is an indicator function.

For each expert k, the prior over [Phi] is given by Normal-Wishart distribution as shown in Equation 2 below.

In Equation 2, the element g₀ of R^D is the mean vector, the element V₀ of R^DxD is scale matrix, the element [beta]₀ of R is a scale parameter and the element f₀ of R is the degree of freedom.

In the exemplary embodiment, within a cluster k, the probability of label y_{t, i} given feature vector x_{t, i} is modelled by logistic regression as shown in Equations 3 and 4 below.

In Equations 3 and 4, w_k,t,0 and w_k,t,1:D are the bias term and parameter vector for classifier h_k,t respectively, w_k,t := (w_t,0,k, w_k,t,1:D) and [sigma](.) is the sigmoid function.

Also, similar to the method described in PTL 1, simple Vector Autoregressive model (VAR) of order M is used to model the dynamics of classifier parameters w_k,t for all experts as shown in Equations 5, 6 and 7 below.

In Equations 5, 6 and 7, the elements A_k,1, A_k,2, …, A_k,m of R^(D+1)x(D+1) are the (D+1)x(D+1) matrices for defining dynamics, the element A_k,0 of R^D+1 is the bias term, [theta] and [theta]₀ are the element of R⁺. As in Equation 6, for each expert, the classifier parameters at time t depends linearly only on past m values of expert’s classifier parameters. This provides this model the ability to have separate dynamics for each expert by keeping the dynamics parameters independent across the experts.

However, for the sake of simplicity, A_k,m is restricted to be a diagonal matrix. This means the i-th component of classifier parameter w_k,t,i depends only on its own previous parameter value w_k,t-1,i, w_k,t-2,i, .... Since the m-th order VAR model cannot be used as in case of t<=m, we assume that w_t,k for t<=m is generated from the following distribution for each expert k.

The model of the exemplary embodiment assumes that the dynamics and bias parameter A_k,m (m = 0, ..., M) are generated from a normal distribution as in Equation 7.

Let [Ohm]_k:= {W_k, A_k, [Phi]_k} be the set of parameters corresponding to an expert k in the mixture. W_k = {w_k,t}_t=1 ^T, A_k= {A_k,m}_m=0 ^M and [Phi]_k = {[mu]_k, R_k}. In the model of the exemplary embodiment, a Dirichlet Process is set over [Ohm]_k as shown in Equations 8 and 9 below.

The distribution G₀ is a product of distributions given in equations 2, 5, 6 and 7. The hyperparameters in these distributions have prior probabilities as shown in Equations 10, 11 and 12 below.

Using DP's stick breaking representation, the proportions of countably infinite clusters within a mixture

from the remaining stick length are determined by Beta distribution as shown in Equation 13.

Correspondingly, the multinomial probabilities of cluster indicator parameters can be written as the following Equation 14.

[Corrected under Rule 26, 24.09.2019]

The joint distribution of labeled data D, data model parameters [Phi], classifier parameters W and their precision parameters {[Theta], [Theta]₀}, the classifier dynamics parameters A and their precision parameters [Gamma], expert assignment variables (parameters) Z and component probabilities [Pi]' is written as follows.

Hyperparameters [Theta] and [Theta]₀ are defined as follows.

In addition, W, A, Z, [Gamma] and [Phi] are as shown below.

[Corrected under Rule 26, 24.09.2019]

In the probabilistic model defined by the above formulae, a probability distribution p(W, A, Z, [Phi], [Theta], [Theta]₀, [Gamma], [pi]’ | D) of parameters W, A, Z, [Phi], [Theta], [Theta]₀, [Gamma], [pi]’ in a case where labeled learning data D is given is obtained.

However, since it is difficult to directly obtain these probability distributions, in the present exemplary embodiment, an approximate distribution q(W, A, Z, [Phi], [Theta], [Theta]₀, [Gamma], [pi]’) of the probability distribution p(W, A, Z, [Phi], [Theta], [Theta]₀, [Gamma], [pi]’ | D) is obtained by using a so - called variational Bayes method of approximately obtaining a posteriori probability.

The expert learning unit 104 performs variational inference to find posterior probabilities of hidden variables and parameters in our model.

The lower bound L(q) of the log marginal likelihood of the proposed model is expressed as shown in Equation 15.

In the present exemplary embodiment, the expert learning unit 104 uses the lower bound for logistic regression proposed in Non Patent Literature 2 to convert it to an exponential family distribution as required in variational inference procedure. Non Patent Literature 2 introduces variable [xi]_t,i per feature vector x_t,i and changes our lower bound L(q) to L(q, [xi]). The variational posterior q(W, A, Z, [Phi], [Theta], [Theta]₀, [Gamma], [pi]’) can be factorized using mean field approximation as shown in Equation 16.

[Corrected under Rule 26, 24.09.2019]

Individual variational distributions are written as shown below in Equations 17-24.

[Corrected under Rule 26, 24.09.2019]

The learning process performed by the learning unit 10 using the available data D will be specifically described below. In the present exemplary embodiment, we consider truncated representation of DP’s stick breaking process in the variational inference which limits infinite number of experts to a plurality of experts.

Fig. 2 is an exemplary explanatory diagram illustrating an example of variational inference performed by the learning unit 10 according to the present exemplary embodiment. The expert learning unit 104 inputs data D and hyper parameters u₀, v₀, u, v, a, b, g₀, [beta]₀, V₀ and f₀ (step S500).

The expert initialization unit 103 initializes each of W, A, Z, [Phi], [Theta], [Theta]₀, [Gamma], and [pi]’ (step S501). The expert initialization unit 103 may perform expert initialization in an arbitrary manner. The expert initialization unit 103 may initialize the expert using a pre-identified set of parameters (for example: the parameters 0 or 1 to initialize).

Next, in the expert learning unit 104, the processes from step S502 to step S514 are repeated until the iterator reaches the maximum (iter < max_iter). Further, the processes from step S503 to step S512 are repeated by the number of experts. Furthermore, the processes from step S504 to step S507 are repeated for the time 1 to T.

Furthermore, the processes from step S505 to step S506 are repeated by the number of data dimensions. Specifically, in step S506, the expert learning unit 104 updates parameters of W using Equations 45 to 48 shown below.

In step S507, the expert learning unit 104 updates parameters of [Xi] using Equation 49 shown below.

Thereafter, the processes from step S508 to step S510 are repeated by the order M of dynamics. Specifically, in step S509, the expert learning unit 104 updates parameters of A using Equations 41 to 44 shown below. Furthermore, in step S510, the expert learning unit 104 updates parameters of [Gamma] using Equations 39 to 40 shown below.

In step S511, the expert learning unit 104 updates parameters of [Phi] using Equations 28 to 34 shown below. Furthermore, in step S512, the expert learning unit 104 updates parameters of [Theta] and [Theta]₀ using Equations 35 to 38 shown below.

In step S513, the expert learning unit 104 updates parameters of Z using Equation 27 shown below. Furthermore, in step S514, the expert learning unit 104 updates parameters of [pi]’ using Equations 25 and 26 shown below.

[Corrected under Rule 26, 24.09.2019]

Then, in step S515, the expert learning unit 104 outputs optimized q(W), q(A), q(Z), q([Phi]), q([Theta]), q([Theta]₀), q ([Gamma]) and q([pi]’).

The expert learning unit 104 stores model data for each expert in the expert storage unit 105. That is, the expert storage unit 105 stores model data of each expert. The expert storage unit 105 is realized by, for example, a magnetic disk or the like. Note that, since the model of each expert is learned individually, the expert learning unit 104 may perform normalization processing on all the learned expert models and then store it in the expert storage unit 105.

The learning unit 10 (more specifically, the data acquisition unit 101, the data processing unit 102, the expert initialization unit 103 and the expert learning unit 104) is implemented by a CPU of a computer operating according to a program (learning program). For example, the program may be stored in the storage unit (not shown) included in the learning system 100, with the CPU reading the program and, according to the program, operating as the learning unit 10 (more specifically, the data acquisition unit 101, the data processing unit 102, the expert initialization unit 103 and the expert learning unit 104). The functions of the learning system may be provided in the form of SaaS (Software as a Service).

The learning unit 10 (more specifically, the data acquisition unit 101, the data processing unit 102, the expert initialization unit 103 and the expert learning unit 104) may each be implemented by dedicated hardware. All or part of the components of each device may be implemented by general-purpose or dedicated circuitry, processors, or combinations thereof. They may be configured with a single chip, or configured with a plurality of chips connected via a bus. All or part of the components of each device may be implemented by a combination of the above-mentioned circuitry or the like and program.

In the case where all or part of the components of each device is implemented by a plurality of information processing devices, circuitry, or the like, the plurality of information processing devices, circuitry, or the like may be centralized or distributed. For example, the information processing devices, circuitry, or the like may be implemented in a form in which they are connected via a communication network, such as a client-and-server system or a cloud computing system.

The future classification unit 20 receives a new (unlabeled) sample, and predicts its label by combining label predictions from each expert. In the present exemplary embodiment, predictions of each expert on the new sample are combined in a probabilistic fashion. That is, for combining the predictions, first, it is needed to find the weights assigned to each expert for this new sample and further the predictions on new sample of the classifiers of each expert at the time instance of the new sample.

For streaming data at time T' > T with D_T' := {x_T',i} ^NT' _i=1 as unlabeled data set, the label prediction is performed in two steps. First, the distribution of classifier weights P(w_k,T') for k which is an element in {1, 2, ... ,K; K is total number of experts} and time T' > T is evaluated. The distribution of classifier weights is calculated with a sampling cum marginalization approach, as shown in Equation 50 below.

The dynamics parameters A_k,m for m = 0, 1, 2,...,M are i.i.d. sample as the following.

[Corrected under Rule 26, 24.09.2019]
Let

Then Equations 51 and 52 shown below are obtained.

[Corrected under Rule 26, 24.09.2019]

[Corrected under Rule 26, 24.09.2019]
Here, [tau](a):=1/(1+([pi]*a/8))^1/2, and [omega]_k,T’,i denotes the probability of choosing k^th expert for classification, which is further represented as Equations 53 and 54.

such that

Additionally, the probability of assigning an expert to z_T',i, a cluster indicator variable, can be approximated as

where N denotes total number of samples in labeled data set D.

As illustrated in Fig. 1, the future classification unit 20 includes a data acquisition unit 201, a data processing unit 202, an expert identification unit 203, a classification output unit 204 and a label storage unit 205.

The data acquisition unit 201 receives un-labeled streaming data (hereinafter also referred to as a sample). That is, the data acquisition unit 201 receives data to be classified.

The data processing unit 202 converts the received streaming data into feature vectors with time annotations. The method of converting streaming data into a time-annotated feature vector is the same as the method performed by the data processing unit 102, but label data is not created.

The expert identification unit 203 identifies parameters for each expert for the task of classification of unlabeled data. The expert identification unit 203 includes an expert weighting unit 2031 and a classifier creating unit 2032.

The expert weighting unit 2031 calculates the weight for each expert. Specifically, the expert weighting unit 2031 calculates the weight of each expert based on the assignment parameter using Equations 53 and 54 described above.

The classifier creating unit 2032 calculates future weights of classifier using the dynamics. Specifically, the classifier creating unit 2032 determines a classifier at a time instance of a new sample using a time series model, by using Expression 50 described above. That is, the classifier creating unit 2032 predicts the classifier parameters for each expert at the time instance of new sample using autoregressive time-series model of classifier parameters.

The classification output unit 204 predicts the label for new sample for each expert using the classifier parameters obtained by the classifier creating unit 2032 and combines these label predictions using weights obtained by the expert weighting unit 2031. Specifically, the classification output unit 204 determines the label predictions for all experts and combines them in a probabilistic fashion by using Expressions 51 and 52 described above.

The classification output unit 204 stores the determined label in the label storage unit 205. That is, the label storage unit 205 stores a label for the input streaming data. The label storage unit 205 is realized by, for example, a magnetic disk or the like.

The future classification unit 20 (more specifically, the data acquisition unit 201, the data processing unit 202, the expert identification unit 203 (more specifically, the expert weighting unit 2031 and the classifier creating unit 2032) and the classification output unit 204) is implemented by a CPU of a computer operating according to a program (learning program, prediction program).

The future classification unit 20 (more specifically, the data acquisition unit 201, the data processing unit 202, the expert identification unit 203 (more specifically, the expert weighting unit 2031 and the classifier creating unit 2032) and the classification output unit 204) may each be implemented by dedicated hardware.

Next, operation of the learning system according to the present exemplary embodiment will be described. Fig. 3 depicts a flowchart illustrating an example of learning processing by the learning unit 10.

The data acquisition unit 101 receives labeled streaming data as learning data until time T (step S101). The data processing unit 102 converts streaming data into feature and label vectors with time annotations(step S102). The expert initialization unit 103 initializes all the experts with pre-identified parameters(step S103).

Next, loop process A (steps S1031 to S1032) is repeated until the termination condition is satisfied. Furthermore, loop process B (steps S1033 to S1034) is repeated at expert level over all the pre-specified number of experts.

The expert learning unit 104 learns classifier model for each expert at each time (step S1041). The expert learning unit 104 learns classifier time series model for each expert (step S1042). The expert learning unit 104 learns expert parameters for data model (step S1043). The expert learning unit 104 learns expert assignment parameters for all data points (step S1044).

The expert learning unit 104 stores model data in the expert storage unit 105. That is, the expert storage unit 105 stores model data for each expert (step S105).

Fig. 4 depicts a flowchart illustrating an example of prediction processing by the future classification unit 20.

The data acquisition unit 201 receives un-labeled streaming data (step S201). The data processing unit 202 converts the streaming data into feature vectors with time annotations (step S202). In step S203, the expert identification unit 203 (specifically, the expert weighting unit 2031) computes weights for each expert(step S2031), and the expert identification unit 203 (specifically, the classifier creating unit 2032) computes future weights of classifier using dynamics (step S2032).

The classification output unit 204 combines label predictions from all the experts (step S204). The classification output unit 204 stores predicted labels for the input data in the label storage unit 205. That is, the label storage unit 205 stores predicted labels for the input data (step S205).

As described above, according to the present exemplary embodiment, the expert learning unit 104 learns the classifier model for a mixture of classifiers (experts) at each time and their classifier time series model for each expert. Moreover, the expert learning unit 104 learns, for each expert, (a data model) parameter for a data model and learns the assignment parameter to individual samples in the input data. Therefore, it is possible to learn dynamics of non-linear boundaries used for classification.

Furthermore, according to the present exemplary embodiment, the expert weighting unit 2031 calculates the weight of each expert based on the assignment parameter, and the classifier creating unit 2032 predicts the classifier weights corresponding to the sample’s time using the classifier time series model. Then, the classification output unit 204 predicts the probability of the label of the sample for each expert, combines the probabilities of the labels of all the experts, and predicts the label of the sample. Therefore, even if the conditional distribution of labels in streaming data changes over time, it is possible to suppress the decrease in the accuracy of the classifier.

Note that the expert learning unit 104 of the present exemplary embodiment may use Neural network in place of logistic regression for the classifier model or in place of AR process for the time series model.

Next, a specific example of the learning system of the present explanatory embodiment will be described. Fig. 5 is an exemplary explanatory diagram illustrating a specific example of the learning process. First, streaming data from time 1 to T is given as learning data. The example illustrated in Fig. 5 shows that streaming data having X1 and X2 as a feature is labeled with two classes (class 1 and class 2).

Next, modeling is performed on these data (feature data). For example, as illustrated in Fig. 5, the mean of each expert is calculated over time. Then, unique classification (linear decision boundary) of each expert is learned for each time. Since the conditional distribution of labels changes over time, Fig. 5 shows an example where the decision boundaries also change.

Fig. 6 is an exemplary explanatory diagram illustrating a specific example of the predicition process. First, streaming data from time T + 1 to T + M unlabeled is given as classification target data. The example illustrated in Fig. 6 shows that streaming data having X1 and X2 as a feature is given.

Next, the future classification unit 20 refers to the learned expert stored in the expert storage unit 105, and predicts a classifier at each time. As a result, a class where given data is classified is predicted at each time. Similar to the example illustrated in Fig. 5, the conditional distribution of labels changes over time, so that Fig. 6 also shows that the decision boundary changes.

Next, an outline of the present invention will be described. Fig. 7 depicts a block diagram illustrating an outline of the learning system according to the present invention. The learning system 80 (for example, learning system 100) for learning a model for estimating a label indicating classification of data, the learning system according to the present invention includes: a classifier model learning unit 81 (for example, the expert learning unit 104) which learns, using an input data, a classifier model (for example, Q(w_k,t)) for a mixture of classifiers referred as experts that are collectively assigned to a task of classification of the input data, at each time instance; a classifier time series model learning unit 82 (for example, the expert learning unit 104) which learns, for each expert, a time series model (for example, Q(A_k)) indicating time series change of the classifier model of the expert; a data model parameter learning unit 83 (for example, the expert learning unit 104) which learns a data model parameter (for example, Q([phi]_k)) for a data model indicating the distribution of data features for each expert; and an assignment parameter learning unit 84 (for example, the expert learning unit 104) which learns an assignment parameter (for example Q(Z)) indicating the probability of assigning experts to individual samples in the input data.

With such a configuration, it is possible to learn dynamics of non-linear boundaries used for classification.

Further, the learning system 80 may includes: a weight calculator (for example, the expert weighting unit 2031) which calculates a weight of each expert based on the assignment parameter; a weight predictor (for example, the classifier creating unit 2032) which predicts classifier weights corresponding to a sample’s time using the classifier time series model; and a label predictor (for example, the classification output unit 204) which predicts the probability of the label of the sample for each expert, combines the probabilities of the labels of all the experts, and predicts the label of the sample.

With such a configuration, even if the conditional distribution of labels in streaming data changes over time, it is possible to suppress the decrease in the accuracy of the classifier.

Further, the classifier model learning unit 81 may model the probability of the label given data in the grouped data by logistic regression.

Further, the classifier model learning unit 81 may determine the number of clusters from learning data using a model based on a Dirichlet Process mixtures, and assigns the expert to the determined number of clusters respectively.

Further, the data model parameter learning unit 83 may learn parameters based on the Normal-Wishart distribution and model data based on multivariate normal distribution.

Further, the assignment parameter learning unit 84 may model the assignment of cluster based on multinomial or categorical distribution.

Further, the classifier model learning unit 81 may learn the classifier model such that a collective decision boundary is an approximation to an underlying non-linear decision boundary at each past time instances.

Fig. 8 depicts a schematic block diagram illustrating a configuration of a computer according to at least one of the exemplary embodiments. A computer 1000 includes a CPU 1001, a main storage device 1002, an auxiliary storage device 1003, and an interface 1004.

Each of the above-described learning system is mounted on the computer 1000. The operation of the respective processing units described above is stored in the auxiliary storage device 1003 in the form of a program (a learning program). The CPU 1001 reads the program from the auxiliary storage device 1003, deploys the program in the main storage device 1002, and executes the above processing according to the program.

Note that at least in one of the exemplary embodiments, the auxiliary storage device 1003 is an exemplary non-transitory physical medium. Other examples of non-transitory physical medium include a magnetic disc, a magneto-optical disk, a CD-ROM, a DVD-ROM, and a semiconductor memory that are connected via the interface 1004. In the case where the program is distributed to the computer 1000 by a communication line, the computer 1000 distributed with the program may deploy the program in the main storage device 1002 to execute the processing described above.

Incidentally, the program may implement a part of the functions described above. The program may implement the aforementioned functions in combination with another program stored in the auxiliary storage device 1003 in advance, that is, the program may be a differential file (differential program).

Note that, a part of or all of the above exemplary embodiments can also be described as following supplementary notes, but is not limited to the following.

(Supplementary note 1)
A learning system for learning a model for estimating a label indicating classification of data, the learning system comprising: a classifier model learning unit which learns, using an input data, a classifier model for a mixture of classifiers referred as experts that are collectively assigned to a task of classification of the input data, at each time instance; a classifier time series model learning unit which learns, for each expert, a time series model indicating time series change of the classifier model of the expert; a data model parameter learning unit which learns a data model parameter for a data model indicating the distribution of data features for each expert; and an assignment parameter learning unit which learns an assignment parameter indicating the probability of assigning experts to individual samples in the input data.

(Supplementary note 2)
The learning system according to supplementary note 1, further comprising: a weight calculator which calculates a weight of each expert based on the assignment parameter; a weight predictor which predicts classifier weights corresponding to a sample’s time using the classifier time series model; and a label predictor which predicts the probability of the label of the sample for each expert, combines the probabilities of the labels of all the experts, and predicts the label of the sample.

(Supplementary note 3)
The learning system according to

supplementary note

1 or 2, wherein, the classifier model learning unit models the probability of the label given data in the grouped data by logistic regression.

(Supplementary note 4)
The learning system according to any one of supplementary notes 1 to 3, wherein, the classifier model learning unit determines the number of clusters from learning data using a model based on a Dirichlet Process mixtures, and assigns the expert to the determined number of clusters respectively.

(Supplementary note 5)
The learning system according to any one of supplementary notes 1 to 4, wherein, the data model parameter learning unit learns parameters based on the Normal-Wishart distribution and model data based on multivariate normal distribution.

(Supplementary note 6)
The learning system according to any one of supplementary notes 1 to 5, wherein, the assignment parameter learning unit models the assignment of cluster based on multinomial or categorical distribution.

(Supplementary note 7)
The learning system according to any one of supplementary notes 1 to 6, wherein, the classifier model learning unit learns the classifier model such that a collective decision boundary is an approximation to an underlying non-linear decision boundary at each past time instances.

(Supplementary note 8)
A learning method for learning a model for estimating a label indicating classification of data, the learning method comprising: learning, using an input data, a classifier model for a mixture of classifiers referred as experts that are collectively assigned to a task of classification of the input data, at each time instance; learning, for each expert, a time series model indicating time series change of the classifier model of the expert; learning a data model parameter for a data model indicating the distribution of data features for each expert; and learning an assignment parameter indicating the probability of assigning experts to individual samples in the input data.

(Supplementary note 9)
The learning method according to supplementary note 8, further comprising: calculating a weight of each expert based on the assignment parameter; predicting classifier weights corresponding to a sample’s time using a classifier time series model; predicting the probability of the label of the sample for each expert; combining the probabilities of the labels of all the experts; and predicting the label of the sample.

(Supplementary note 10)
A learning program for learning a model for estimating a label indicating classification of data, the learning program causes a computer to perform: a classifier model learning process of learning, using an input data, a classifier model for a mixture of classifiers referred as experts that are collectively assigned to a task of classification of the input data, at each time instance; a classifier time series model learning process of learning, for each expert, a time series model indicating time series change of the classifier model of the expert; a data model parameter learning process of learning a data model parameter for a data model indicating the distribution of data features for each expert; and an assignment parameter learning process of learning an assignment parameter indicating the probability of assigning experts to individual samples in the input data.

(Supplementary note 11)
The learning program according to supplementary note 10, that causes a computer to perform: a weight calculate process of calculating a weight of each expert based on the assignment parameter; a weight predicting process of predicting classifier weights corresponding to a sample’s time using a classifier time series model; and a label predicting process of predicting the probability of the label of the sample for each expert, combining the probabilities of the labels of all the experts, and predicting the label of the sample.

10 learning unit
20 future classification unit
100 learning system
101, 201 data acquisition unit
102, 202 data processing unit
103 expert initialization unit
104 expert learning unit
105 expert storage unit
203 expert identification unit
204 classification output unit
205 label storage unit
2031 expert weighting unit
2032 classifier creating unit
　

Claims

A learning system for learning a model for estimating a label indicating classification of data, the learning system comprising:
a classifier model learning unit which learns, using an input data, a classifier model for a mixture of classifiers referred as experts that are collectively assigned to a task of classification of the input data, at each time instance;
a classifier time series model learning unit which learns, for each expert, a time series model indicating time series change of the classifier model of the expert;
a data model parameter learning unit which learns a data model parameter for a data model indicating the distribution of data features for each expert; and
an assignment parameter learning unit which learns an assignment parameter indicating the probability of assigning experts to individual samples in the input data.
The learning system according to claim 1, further comprising:
a weight calculator which calculates a weight of each expert based on the assignment parameter;
a weight predictor which predicts classifier weights corresponding to a sample’s time using the classifier time series model; and
a label predictor which predicts the probability of the label of the sample for each expert, combines the probabilities of the labels of all the experts, and predicts the label of the sample.
The learning system according to claim 1 or 2,
wherein, the classifier model learning unit models the probability of the label given data in the grouped data by logistic regression.
The learning system according to any one of claims 1 to 3,
wherein, the classifier model learning unit determines the number of clusters from learning data using a model based on a Dirichlet Process mixtures, and assigns the expert to the determined number of clusters respectively.
The learning system according to any one of claims 1 to 4,
wherein, the data model parameter learning unit learns parameters based on the Normal-Wishart distribution and model data based on multivariate normal distribution.
The learning system according to any one of claims 1 to 5,
wherein, the assignment parameter learning unit models the assignment of cluster based on multinomial or categorical distribution.
The learning system according to any one of claims 1 to 6,
wherein, the classifier model learning unit learns the classifier model such that a collective decision boundary is an approximation to an underlying non-linear decision boundary at each past time instances.
A learning method for learning a model for estimating a label indicating classification of data, the learning method comprising:
learning, using an input data, a classifier model for a mixture of classifiers referred as experts that are collectively assigned to a task of classification of the input data, at each time instance;
learning, for each expert, a time series model indicating time series change of the classifier model of the expert;
learning a data model parameter for a data model indicating the distribution of data features for each expert; and
learning an assignment parameter indicating the probability of assigning experts to individual samples in the input data.
The learning method according to claim 8, further comprising:
calculating a weight of each expert based on the assignment parameter;
predicting classifier weights corresponding to a sample’s time using a classifier time series model;
predicting the probability of the label of the sample for each expert;
combining the probabilities of the labels of all the experts; and
predicting the label of the sample.
A learning program for learning a model for estimating a label indicating classification of data, the learning program causes a computer to perform:
a classifier model learning process of learning, using an input data, a classifier model for a mixture of classifiers referred as experts that are collectively assigned to a task of classification of the input data, at each time instance;
a classifier time series model learning process of learning, for each expert, a time series model indicating time series change of the classifier model of the expert;
a data model parameter learning process of learning a data model parameter for a data model indicating the distribution of data features for each expert; and
an assignment parameter learning process of learning an assignment parameter indicating the probability of assigning experts to individual samples in the input data.
The learning program according to claim 10, that causes a computer to perform:
a weight calculate process of calculating a weight of each expert based on the assignment parameter;
a weight predicting process of predicting classifier weights corresponding to a sample’s time using a classifier time series model; and
a label predicting process of predicting the probability of the label of the sample for each expert, combining the probabilities of the labels of all the experts, and predicting the label of the sample.