CA3059932A1

CA3059932A1 - Method and system for individual demand forecasting

Info

Publication number: CA3059932A1
Application number: CA3059932A
Authority: CA
Inventors: Brian Keng; Tianle CHEN
Original assignee: Kinaxis Inc
Current assignee: Kinaxis Inc
Priority date: 2019-10-24
Filing date: 2019-10-24
Publication date: 2021-04-24

Abstract

Provided is a system and method for individual forecasting of a future event for a subject using historical data. The historical data including a plurality of historical events associated with the subject. The method including: receiving the historical data associated with the subject;
determining a random variable representing a remaining time until the future event; predicting a time to the future event using a distribution function that is determined using a recurrent neural network, the distribution function including a learned density with peaks that approximate the times of the historical events in the historical data; determining a log-likelihood function based on a probability that the random variable exceeds an amount of time remaining until a next historical event in the historical data and parameterized by the distribution function; and outputting a forecast of a time to the future event as the log-likelihood function.

Description

2 TECHNICAL FIELD

3 [0001] The following relates generally to data processing, and more specifically, to a method

4 and system for individual demand forecasting.
BACKGROUND
6 [0002] Accurately forecasting individual demand for acquisition of single items is a complex 7 technical problem with many facets. It necessitates predicting not only the next most likely time 8 of acquisition, but also having an accompanying measure of uncertainty is desirable due there 9 likely being inherent randomness of in the acquisition, especially if it is based on an individual's behavior. Such forecasts often also coincide with sparse observations and partial information.
11 Generally, sequential dependence is considered because future demand patterns can be 12 heavily influenced by past behavior. Additionally, there may be strong correlation between 13 demand patterns across substitutable acquisitions, generally requiring that acquisition behavior 14 should be jointly predicted.
SUMMARY
16 [0003] In an aspect, there is provided a method for individual forecasting of a future event for a 17 subject using historical data, the historical data comprising a plurality of historical events 18 associated with the subject, the method executed on at least one processing unit, the method 19 comprising: receiving the historical data associated with the subject;
determining a random variable representing a remaining time until the future event; predicting a time to the future 21 event using a distribution function that is determined using a recurrent neural network, the 22 distribution function comprising a learned density with peaks that approximate the times of the 23 historical events in the historical data; determining a log-likelihood function based on a 24 probability that the random variable exceeds an amount of time remaining until a next historical event in the historical data and parameterized by the distribution function;
and outputting a 26 forecast of a time to the future event as the log-likelihood function.
27 [0004] In a particular case of the method, a loss function for the recurrent neural network 28 comprises a negative of the log-likelihood function.
29 [0005] In another case of the method, the random variable is conditioned based on inter-arrival 1 times of the historical events in the historical data.
2 [0006] In yet another case of the method, the random variable is conditioned based on excess 3 times since arrival of preceding historical events in the historical data.
4 [0007] In yet another case of the method, the log-likelihood function at each time is the log of the probability that the random variable is in the set of time until the next historical event when 6 the next historical event has been observed, and the log of the survival function otherwise.
7 [0008] In yet another case of the method, the distribution function follows a Weibull distribution.
8 [0009] In yet another case of the method, the distribution function is determined as 9 (k//1)((s+0/A)k-1Sw(t), where k is the shape of the Weibull distribution, A is the scale of the Weibull distribution, t is the time-step, and Sw(t) is the survival function.
11 [0010] In yet another case of the method, outputting the forecast of the time to the future event 12 as the log-likelihood function comprises determining a sum of log-likelihoods at each time-step.
13 [0011] In yet another case of the method, the method further comprising transforming the sum 14 of log-likelihoods as a function of recurrent neural network parameters and historical data, and determining a minimizer of an overall observed loss of the recurrent neural network using such 16 function.
17 [0012] In yet another case of the method, the method further comprising outputting derivative 18 values of the log-likelihood function.
19 [0013] In another aspect, there is provided a system for individual forecasting of a future event for a subject using historical data, the historical data comprising a plurality of historical events 21 associated with the subject, the system comprising one or more processors in communication 22 with a data storage, the one or more processors configurable to execute:
a data acquisition 23 module to receive the historical data associated with the subject; a conditional excess module to 24 determine a random variable representing a remaining time until the future event; a machine learning module 120 to predict a time to the future event using a distribution function that is 26 determined using a recurrent neural network, the distribution function comprising a learned 27 density with peaks that approximate the times of the historical events in the historical data; and 28 a forecasting module to determine a log-likelihood function based on a probability that the 29 random variable exceeds an amount of time remaining until a next historical event in the 1 historical data and parameterized by the distribution function, and to output a forecast of a time 2 to the future event as the log-likelihood function.
3 [0014] In a particular case of the system, a loss function for the recurrent neural network 4 comprises a negative of the log-likelihood function.
[0015] In another case of the system, the random variable is conditioned based on inter-arrival 6 times of the historical events in the historical data.
7 [0016] In yet another case of the system, the random variable is conditioned based on excess 8 times since arrival of preceding historical events in the historical data.
9 [0017] In yet another case of the system, the log-likelihood function at each time is the log of the probability that the random variable is in the set of time until the next historical event when 11 the next historical event has been observed, and the log of the survival function otherwise.
12 [0018] In yet another case of the system, the distribution function follows a Weibull distribution.
13 [0019] In yet another case of the system, the distribution function is determined as 14 (kb1)((s+t)/A)k-lSw(t), where k is the shape of the Weibull distribution, A is the scale of the Weibull distribution, t is the time-step, and Sw(t) is the survival function.
16 [0020] In yet another case of the system, outputting the forecast of the time to the future event 17 as the log-likelihood function comprises determining a sum of log-likelihoods at each time-step.
18 [0021] In yet another case of the system, the forecasting module further transforms the sum of 19 log-likelihoods as a function of recurrent neural network parameters and historical data, and determining a minimizer of an overall observed loss of the recurrent neural network using such 21 function.
22 [0022] In yet another case of the system, the forecasting module further outputs derivative 23 values of the log-likelihood function.
24 [0023] These and other embodiments are contemplated and described herein. It will be appreciated that the foregoing summary sets out representative aspects of systems and 26 methods to assist skilled readers in understanding the following detailed description.

1 [0024] The features of the invention will become more apparent in the following detailed 2 description in which reference is made to the appended drawings wherein:
3 [0025] FIG. 1 is a schematic diagram of a system for individual forecasting of a future event for 4 a subject using historical data, in accordance with an embodiment;
[0026] FIG. 2 is a flowchart of a method for individual forecasting of a future event for a subject 6 using historical data, in accordance with an embodiment;
7 [0027] FIG. 3 is a plot of an example of time-since-event (tse(t)) and time-to-event (tte(t)) until 8 an end of a training period, in accordance with the system of FIG. 1;
9 [0028] FIG. 4A is an example of a distributional estimate for an uncensored case with time equals 3, in accordance with the system of FIG. 1;
11 [0029] FIG. 4B is an example of a distributional estimate for a censored case with time equals 12 7, in accordance with the system of FIG. 1;
13 [0030] FIG. 5 is a diagram of an example recurrent neural network (RNN) computational flow, in 14 accordance with the system of FIG. 1;
[0031] FIG. 6 is a diagram of an example Bayesian Network, in accordance with the system of 16 FIG. 1;
17 [0032] FIG. 7 is a chart of a receiver operating characteristic (ROC) curve for example 18 experiments of the system of FIG. 1;
19 [0033] FIG. 8A illustrates a chart of predicted densities for remaining useful life (RUL) for the example experiments of FIG. 7;
21 [0034] FIG. 8B illustrates a chart of predicted modes for RUL for the example experiments of 22 FIG. 7;
23 [0035] FIG. 9A illustrates a histogram of errors for a comparison approach in the example 24 experiments of FIG. 7; and [0036] FIG. 9B illustrates a histogram of errors for the system of FIG. 1 in the example 26 experiments of FIG. 7.

28 [0037] Embodiments will now be described with reference to the figures.
For simplicity and 29 clarity of illustration, where considered appropriate, reference numerals may be repeated =

1 among the Figures to indicate corresponding or analogous elements. In addition, numerous 2 specific details are set forth in order to provide a thorough understanding of the embodiments 3 described herein. However, it will be understood by those of ordinary skill in the art that the 4 embodiments described herein may be practiced without these specific details. In other instances, well-known methods, procedures and components have not been described in detail 6 so as not to obscure the embodiments described herein. Also, the description is not to be 7 considered as limiting the scope of the embodiments described herein.
8 [0038] Various terms used throughout the present description may be read and understood as 9 follows, unless the context indicates otherwise: "or" as used throughout is inclusive, as though written "and/or"; singular articles and pronouns as used throughout include their plural forms, 11 and vice versa; similarly, gendered pronouns include their counterpart pronouns so that 12 pronouns should not be understood as limiting anything described herein to use, 13 implementation, performance, etc. by a single gender; "exemplary" should be understood as 14 "illustrative" or "exemplifying" and not necessarily as "preferred" over other embodiments.
Further definitions for terms may be set out herein; these may apply to prior and subsequent 16 instances of those terms, as will be understood from a reading of the present description.
17 [0039] Any module, unit, component, server, computer, terminal, engine or device exemplified 18 herein that executes instructions may include or otherwise have access to computer readable 19 media such as storage media, computer storage media, or data storage devices (removable and/or non-removable) such as, for example, magnetic disks, optical disks, or tape. Computer 21 storage media may include volatile and non-volatile, removable and non-removable media 22 implemented in any method or technology for storage of information, such as computer 23 readable instructions, data structures, program modules, or other data.
Examples of computer 24 storage media include RAM, ROM, EEPROM, flash memory or other memory technology, CD-ROM, digital versatile disks (DVD) or other optical storage, magnetic cassettes, magnetic tape, 26 magnetic disk storage or other magnetic storage devices, or any other medium which can be 27 used to store the desired information and which can be accessed by an application, module, or 28 both. Any such computer storage media may be part of the device or accessible or connectable 29 thereto. Further, unless the context clearly indicates otherwise, any processor or controller set out herein may be implemented as a singular processor or as a plurality of processors. The 31 plurality of processors may be arrayed or distributed, and any processing function referred to 32 herein may be carried out by one or by a plurality of processors, even though a single processor 33 may be exemplified. Any method, application or module herein described may be implemented

5 1 using computer readable/executable instructions that may be stored or otherwise held by such 2 computer readable media and executed by the one or more processors.
3 [0040] The following relates generally to data processing, and more specifically, to a method 4 and system for individual demand forecasting.
[0041] For the sake of clarity of illustration, the following disclosure generally refers to the

6 implementation of the present embodiments for product demand forecasting;
however, it is

7 appreciated that the embodiments described herein can be used for any suitable application of

8 individual event forecasting. For example, the embodiments described herein could be used to

9 predict the time until a future occurrence of a natural event; such as an earthquake as the subject to be forecasted. In another example, the embodiments described herein could be used 11 to predict the time until a utility spike occurs; such as a spike in electricity consumption or a 12 spike in internet bandwidth as the subjects. In another example, the embodiments described 13 herein could be used to predict component failure times for factory machinery; such as using 14 workload history as input. In another example, the embodiments described herein could be used to predict various other consumer behavior patterns; such as predicting when a client will go on 16 vacation.
17 [0042] In an illustrative example, retailers can have access to massive amounts of consumer 18 behavior data through, for example, customer loyalty programs, purchase histories, and 19 responses to direct marketing campaigns. These data sources can allow retailers to customize their marketing communications at the individual level through personalized content, 21 promotions, and recommendations via channels such as email, mobile and direct mail.
22 Accurately predicting every individual customer behavior for each product is useful in direct 23 marketing efforts which can lead to significant advantages for a retailer driving increased sales, 24 margin, and return on investment. Especially for replenishable products such as regularly consumed food products (e.g. milk) and regularly replenished personal care products (e.g.
26 soap). These products frequently drive store traffic, basket size, and customer loyalty, which are 27 of strategic importance in a highly competitive retail environment.
28 [0043] In the above illustrative example, accurately forecasting individual demand for single 29 products is a challenging and complex technical problem. This problem generally requires predicting not only the next most likely time of purchase of the product but also having an 31 accompanying measure of uncertainty due to the inherent randomness of an individual's 32 purchasing behavior. Additionally, this problem usually has sparse observations (for example, 33 few observations for individual customers) and partial information (for example, purchases of I related products and time since last purchase). Additionally, this problem usually has sequential 2 dependence because future purchase patterns are generally heavily influenced by past 3 behavior. Additionally, this problem usually has strong correlation between purchase patterns 4 across substitutable products that indicates that customer behavior should be jointly predicted.
For example, among purchasers of items in a basket of 12 deli products that were recorded, 6 there were 79,980 unique purchasers. Purchase histories for any single product is generally 7 sparse. For any single product in this basket, the average customer buys only between 0.12 to 8 0.67 items over a 1.5-year period but aggregating over all the products in the basket indicates 9 that these customers purchase on average 3.58 items during the same period. This is not surprising since people tend to prefer variety in their meals even though their choice of whether 11 to purchase a deli product can in some cases be predicted.
12 [0044] Some approaches for accurately forecasting individual demand for single products adapt 13 methods from survival analysis, where a customer is defined to be "alive" if purchases were 14 made. In a scenario with sparse purchase data, this can be useful since non-purchases can reveal information about whether a customer is likely to make a purchase in the future. In 16 addition to modeling whether customers are "alive", the number of purchases a customer makes 17 in a given time period can also be accounted for. Solving this maximum likelihood problem can 18 yield optimal distributional estimates that model such behaviors.
However, these models impose 19 strict assumptions that limit their effectiveness; such as independence and stationarity. In addition, covariates are often modeled in a regression context, further restricting the hypothesis 21 space. While these assumptions may be essential for tractability purposes, in some cases, they 22 can be easily violated when there is a desire to model highly-correlated, high-dimensional and 23 heterogeneous processes. One example of a survival model is a Pareto/NBD
model. In this 24 model, heterogeneity of customer purchase rates and customer dropout times are assumed to follow parametric distributions. While this may be suited for predicting a time-to-event, 26 incorporating covariates in this context generally requires imposing a linearity assumption on 27 one of the model parameters in order to fit a model; which is an unrealistic assumption for 28 models with a large number of data features. While copulas may be used to model customer 29 repurchase patterns, they cannot be sufficiently extended to predict multiple products jointly and under changing environmental conditions.
31 [0045] Some approaches for accurately forecasting individual demand for single products 32 attempt to use machine learning approaches to predict arrival times; for example, Recurrent 33 Neural Networks (RNNs). These approaches leverage the capacity of RNNs to model 1 sequential data with complex temporal correlations as well as non-linear associations. However, 2 these models generally do not deal explicitly with the uncertainty of random arrival times and 3 are not able to properly exploit censored data. Other machine learning approaches have been 4 used; for example, Random Forest models and other ensemble models have been used with binary predictions due to their scalability to wide datasets, ease of training and regularization 6 strategies. However, such tree-based supervised learning models are not well suited to 7 sequentially dependent problems. Recurrent Neural Nets (RNN) are better suited to model data 8 with complex sequential dependencies. For example, using a Long-Short-Term-Memory 9 (LSTM) structure that incorporates gates to recursively update an internal state in order to make sequential path-dependent predictions. In an example LSTM can be trained to make point 11 estimates for time-to-event by minimizing a distance-based metric.
However, unobserved arrival 12 times cannot be explicitly accounted for in these models. Non-arrivals can be important as they 13 can reveal a significant amount of information for the prediction.
14 [0046] Embodiments of the present disclosure advantageously integrate probabilistic approaches of survival analysis with Recurrent Neural Networks to model inter-purchase times 16 for multiple products jointly for each individual customer. In some embodiments, the output of 17 the RNN models the distribution parameters of a "time to next purchase"
random variable 18 instead of a point estimate. In a survival analysis framework, partial information, such as time 19 since the previous arrival, can induce a distribution on the partially observed version of the "time to next purchase" random variable. The structure of such embodiments can impose additional 21 constraints which transform the complex censoring problem into a likelihood-maximization 22 problem. Advantageously, the use of RNNs can remove the need for strict assumptions of past 23 survival analysis models while still having the flexibility to take into account the censored and 24 sequential nature of the problem. In the present disclosure, such Multivariate Arrival Times Recurrent Neural Network models may be referred to as "MAT-RNN".
26 [0047] The present inventors determined the efficacy of the present embodiments in example 27 experiments. The example experiments were performed on data from a large European health 28 and beauty retailer, several benchmark datasets as well as a synthetic dataset. The present 29 embodiments were determined to out-perform other approaches in predicting whether a customer made purchases in the next time period. The results of the example experiments 31 illustrate that the present embodiments perform better than other approaches in 4 out of the 5 32 categories of products considered. Additionally, results on the benchmark and synthetic 1 datasets show comparable performance increases when compared to other survival model 2 techniques and RNNs trained on the usual squared-loss metric.
3 [0048] Referring now to FIG. 1, a system 100 for individual forecasting of a future event for a 4 subject, in accordance with an embodiment, is shown. In this embodiment, the system 100 is run on a server. In further embodiments, the system 100 can be run on any other computing 6 device; for example, a desktop computer, a laptop computer, a smartphone, a tablet computer, 7 a point-of-sale ("PoS") device, a smartwatch, or the like.
8 [0049] In some embodiments, the components of the system 100 are stored by and executed 9 on a single computer system. In other embodiments, the components of the system 100 are distributed among two or more computer systems that may be locally or globally distributed.
11 [0050] FIG. 1 shows various physical and logical components of an embodiment of the system 12 100. As shown, the system 100 has a number of physical and logical components, including a 13 central processing unit ("CPU") 102 (comprising one or more processors), random access 14 memory ("RAM") 104, an input interface 106, an output interface 108, a network interface 110, non-volatile storage 112, and a local bus 114 enabling CPU 102 to communicate with the other 16 components. CPU 102 executes an operating system, and various modules, as described below 17 in greater detail. RAM 104 provides relatively responsive volatile storage to CPU 102. The input 18 interface106 enables an administrator or user to provide input via an input device, for example a 19 keyboard and mouse. The output interface 108 outputs information to output devices, such as a display and/or speakers. The network interface 110 permits communication with other systems, 21 such as other computing devices and servers remotely located from the system 100, such as for 22 a typical cloud-based access model. Non-volatile storage 112 stores the operating system and 23 programs, including computer-executable instructions for implementing the operating system 24 and modules, as well as any data used by these services. Additional stored data, as described below, can be stored in a database 116. During operation of the system 100, the operating 26 system, the modules, and the related data may be retrieved from the non-volatile storage 112 27 and placed in RAM 104 to facilitate execution.
28 [0051] In an embodiment, the system 100 further includes a data acquisition module 117, a 29 conditional excess module 118, a machine learning module 120, and a forecasting module 122.
In some cases, the modules 117, 118, 120, 122 can be executed on the CPU 110.
In further 31 cases, some of the functions of the modules 117, 118, 120, 122 can be executed on a server, 32 on cloud computing resources, or other devices. In some cases, some or all of the functions of 33 any of the modules 117, 118, 120, 122 can be run on other modules.

1 [0052] Forecasting is the process of obtaining a future value for a subject using historical data.
2 Machine learning techniques, as described herein, can use the historical data in order to train 3 their models and thus produce reasonably accurate forecasts when queried.
4 [0053] In some embodiments, the machine learning module 120 uses a Recurrent Neural Net (RNN) to output distributional parameters which represent predictions for the remaining time to 6 arrival. By iterating through time for each customer, the RNN can output sequential distributional 7 estimates for the remaining time until the next purchase arrival, giving an individual demand 8 forecast. Advantageously, the output as a distribution can allow for better decision-making 9 ability because it can allow for the performance of a cost analysis.
[0054] In some cases, each product's inter-purchase time can be assumed to be a realization of 11 a random variable that is distinct for each customer and each product.
In some cases, each 12 product's inter-purchase time can also be dependent on other product purchase times. The 13 conditional excess module 118 can use a conditional excess random variable, which represents 14 a remaining time till next arrival conditioned on observed information to date. This random variable can have a distribution that is induced by an actual inter-purchase time as well as a 16 censoring state.
17 [0055] In some embodiments, the forecasting module 122 can determine a log-likelihood 18 function based on the conditional excess random variable and the outputs of the RNN. In some 19 cases, it is assumed that the approach of these embodiments follows a conditional independence structure where these conditional excess random variables are assumed to be 21 independent given the internal state of the RNN. In such embodiments, the loss function can be 22 defined to be the negative log-likelihood. The optimal RNN parameters in such embodiments 23 can generate distributional parameters that can be advantageously used to model the observed 24 data. Hence, the RNN outputs at the end of training period can be used by the forecasting module 122 as best distributional estimates for a remaining time to next purchase.
26 [0056] In the present disclosure, a random variable representing the remaining time till next 27 arrival conditioned on the current information is denoted as Z. In most cases, this random 28 variable is not the true inter-arrival time, but is instead a version of it, conditioned on observing 29 partial information. Consider an arrival process, where Wõ is the time of the n-th arrival. Let Wo =
0 at the start of a training period. Additionally, let N(t) be the number of arrivals by time t and let 31 Yõ be the inter-arrival time of the n-th arrival, which is the difference between consecutive arrival 32 times.

I N (t) = max {n I Wn t} (1) 2 Yn = Wn Wn-1 3 [0057] At a particular time t, the number of arrivals observed is N(t).
The system 100 predicts 4 the subsequent (i.e. the {N(t) + 1}-th arrival) and its inter-arrival time YN(0+1. Let tse(t) (time-since-event) be the amount of time that has elapsed since the last arrival or start of training 6 period, whichever is smaller. This represents the censoring information that is available to the 7 RNN at each time t. Let tte(t) (time-to-event) be the amount of time remaining until the next 8 arrival or the end of testing period (r), whichever is smaller.
9 tse(t) = t - WN(0 (2) tte(t) = min { Wm0+1 - t, - t}
11 [0058] For the purposes of illustration, consider an example of the above with 3 arrivals; where 12 WI = 16, W2 = 28, and W3 = 32, such that Yi = 16, Y2 = 12, and Y3 = 4.
Also, N(t) is a piecewise 13 constant function which is 0 for t < 16,1 forte [16, 28), 2 for t E [28, 32), and 3 for t 32. FIG. 3 14 illustrates an example plot of tse(t), tte(t) for t until r = 40, which is the end of the training period.
[0059] In some embodiments, the remaining time till next arrival (Zt) can be a conditional 16 random variable that depends only on Ymo+1 , which is the inter-arrival time of the subsequent 17 arrival. The random variable Zt, given the observed information, can thus be defined; which is 18 referred to as a conditional excess random variable. In these embodiments, Zt has a distribution 19 induced by YN(0+1 since tse(t) is fixed.
4 = Ym0+1 tse(t) I YN(t)+1 > tse(t). (3) 21 [0060] For example, consider Z= Y-tIY> t. This random variable Z is conditioned on the fact 22 that Y has been observed to exceed t and the system 100 is interested in the excess value; i.e., 23 Y- t. The distribution of Y induces a distribution on Z.
P (Y > s +
24 P (Z > s) = P (Y - t > slY > t) = (4) P (Y> t) [0061] In an embodiment, there are two cases to define the log-likelihood function. When the 26 next arrival time is observed, the likelihood evaluation is P (Zt E
[tte(t), tte(t) + 1D, since inter-27 arrival times are only discretely observed. However, where the time to next arrival is not 28 observed (i.e. no more subsequent arrivals are observed by end of training), the likelihood 29 evaluation is instead P (Zt > tte(t)), namely the survival function.
Therefore, at each time t, the 1 random variable YN(0+1 which has distribution parametrized by et, induces a distribution on Zr.
2 Thus, the log-likelihood at each time t can be written as follows:
(log P (Zt E [tte(t), tte(t) + 11) if uncensored 3 tt 090 = t log P (Zt > tte(t) ) otherwise (5) 4 [0062] FIGS. 4A and 4B illustrate examples of distributional estimates at two respective times (t=3 for FIG. 4A and t=7 for FIG. 4B) to illustrate the above two cases. FIGS.
4A and 413 6 illustrate log-likelihood visualizations for different censoring statuses. In the uncensored log-7 likelihood computation, f3 is a density function of Z3, which is the predictive distribution for the 8 remaining time till next arrival. Since the next arrival is observed to have occurred at time 6, f3 is 9 evaluated at the value 3, which is the true time to next arrival to compute the log-likelihood. In the censored case, for the predictive distribution at time 7, the next arrival was not observed and 11 hence the right tail of Z7 (i.e.?. 3) was used to compute the log-likelihood.
12 [0063] It can be generally assumed that Yõ follows distributions with differentiable density and 13 survival functions to exploit the back-propagation approach used to fit the RNN. An example is a 14 Weibull distribution parametrized by scale (A) and shape (k), whose survival function is made up of exp() and power() functions.
16 S(y) = P (Y> y) = e (6) 17 [0064] To determine Weibull likelihoods, a random variable Y- Weibull (scale = A, shape = k) 18 can be determined that has simple densities and cumulative distribution functions. Since the 19 survival function (S(x)) has the form:
S(Y) = P (Y >
21 = e-(y/A)k 22 f (y) = (k / A) (y /A)1 23 = (k / A) (y /A) S(Y) 24 [0065] The conditional excess random variable, given that it exceeds s, is W= Y-slY> S.
The definition of conditional probability in terms of some continuous random variable Xi, Xz, for 26 any measurable set Al, Az, given P (X2 E A2) > 0:
P ()CIE Ai, X2 E A2) 27 P (Xi E Ail X2 E A2) = _____ P (X2 E A2) 28 [0066] The conditional excess survival function can thus be derived as:

1 Sw (t) = P (W > t) 2 =P(Y >s+tlY>s) 3 = S (s + t)/S(s) 4 = exp {- ((s + t) / A)k + (s / A)k }
[0067] The conditional excess density function can be determined as:
6 fw (t) = f (s + t) / S(s) 7 = (k / A) ((s+ t) / A)k-1 S(s + 0 / S(s) 8 = (k / A) ((s + sw(t) 9 [0068] In an example, a Long Short Term Memory (LSTM) model can be used by the machine learning module 120 as a type of RNN structure for modeling sequential data.
In further cases, 11 other types of RNN models can be used by the machine learning module 120. At each time (t), 12 the outputs of the LSTM, which is parametrized by e, are passed through an activation function 13 so that they are valid parameters of a distribution function (et). Then, the log-likelihood is 14 determined for each time step (It) by the forecasting module 122, as described herein. FIG. 5 illustrates an example RNN computational flow with outputs (et) generated by the LSTM. Log-16 likelihoods at each time are determined as log of densities parametrized by et, evaluated at zt.
17 Where ht is the internal state of the LSTM and Xt are the covariates at each time t. In this way, 18 the machine learning module 120 can output a single prediction of expected value for each time 19 (t).
[0069] In some embodiments, the machine learning module 120 can determine loss as a 21 negative of the log-likelihood. Optimal parameters for the LSTM (0) can be determined as 22 outputs of a series of distributional estimates e, that best "explain"
the sequence of data 23 observed. In a particular case, the distribution can be a normal distribution. In the event of an 24 uncensored arrival time at time t, the weights et can be determined as those that generate a density that has a peak close to the actual arrival time. In this way, at each time step, a range of 26 values and their relatively likelihood are provided; with the output denoted by et.
27 Advantageously, with an output as a distribution, additional operations can be performed. For 28 example, determining a "best guess" expected value of the distribution.
For example, certain 29 quantities of the distribution can also be optimized; for example, it might be more costly to under-predicted to over-predict, producing a different "best guess." For example, a credibility 1 interval can be used (for example, a 90% credible interval) to determine where an output value 2 is most likely to be; which can allow for better planning and better decision making.
3 [0070] In an embodiment, the machine learning module 120 can assume a Bayesian Network, 4 .. for example similar to a Hidden-Markov model, where random variables at each time t are .. emitted from a hidden state ht. As described, ht represents an internal state of the RNN at each 6 time t and Zt is an observed time series. FIG. 6 illustrates an example of a Bayesian Network 7 where observations are independent conditioned on hidden states.
8 [0071] The forecasting module 122 can factor the joint distribution of {Z}, giving the log-9 .. likelihood for an entire time series as a sum of log-likelihoods at each time; such that the forecasting module 122 obtains a sum described below, for arbitrary events Et.
Since Et is 11 determined by the censoring status, where Et= fitte(t), tte(t)+ 1]) if uncensored and Et = {> tte(0) 12 otherwise, the forecasting module 122 can decompose the overall log-likelihood as a sum:
13 P ([Zr E Etri=lifittil=i) = 11P (Zt E Et I ht t=i 14 /({0t}) = Et/t(0t) (7) [0072] Assuming that the RNN model is parametrized by 0, there exists a function g that 16 recursively maps Xt to (Or, ht) that depends only on e. By substituting ht-1, It(0t) = 1t(gt(0)) where 17 gt depends only on 0, g, {Xj}i,t. Then since the overall log-likelihood is a sum of /Or), it can be 18 written as a function of only the RNN parameters (0) and observed data.
The structure of the 19 RNN and the back-propagation algorithm allows the determination of gradients of any order efficiently and therefore allows for the determination of 6, the minimizer of the overall observed 21 loss.
22 (et, ht) = g(ht-1, Xt I 09) (8) 23 [0073] In some embodiments, the machine learning module 120 can transform the outputs of 24 the RNN such that they are parameters of a distribution. In a particular case, the machine learning module 120 can use a Weibull distribution, which is parametrized by shape and scale 26 parameters, both of which are positive values. In example cases, the RNN
output can be 27 initialized for scale at the maximum-likelihood estimate (MLE) for a scale parameter of a Weibull 28 distribution whose shape parameter is 1; as this was determined by the present inventors to be 29 useful in preventing likelihood-evaluation errors. In example cases, a maximum shape parameter (set at 10) can be used and the RNN output can be passed for shape through a 1 sigmoid function, which is rescaled and shifted such that cr : R (0, 10) and o-*(0) = 1. In some 2 cases, for the scale parameter, an exponential function is used, which is rescaled such that it 3 maps 0 to the average inter-arrival-time.
4 [0074] In some embodiments, the system 100 can model multivariate arrivals by assuming there are p different arrival processes of interest. For the i-th waiting time of interest, is 6 defined to be the time of the n-th arrival of this type and At(t), and Yi,n is likewise defined.
7 Additionally, tse(i, t) and tte(i, t) are defined for the i-th type.
8 Z,,t = YI,N0+1 tse(i, t) I KN,(0+1 > tse(i, t) (9) 9 [0075]
Using the example of the Bayesian Network of FIG. 6, Z = . . , Zpt] and the RNN
output Or = . . . , Opt]. The log-likelihoods for each event type can be determined where /,,t 11 =
log P (Zt = tte(i, t)) or /,,t(9,,t) = log P (Z,t > tte(i, t)), recalling that the former is for the case 12 where the next arrival is observed while the latter is for the case where the no arrivals are 13 observed until the end of training.
14 [0076] Advantageously, the Bayesian Network of FIG. 6 generally requires minimal modifications as it merely requires that the emissions are conditionally independent given ht.
16 The forecasting module 122 can then determine the log-likelihood at each time as a sum, /t(et) =
17 Ei 1i,t (0i,t). Since the LSTM model is still parameterized by e, the remaining operations are the 18 same as described above. In this way, temporal dependence as well as dependence between 19 the p arrival processes can be modeled by the RNN, whose weights e can then be optimized by training data. This allows the forecasting module 122 to also model other outputs by appending 21 [Ki,t, . . . , Kp,t] to Z where KJ is some other variable of interest for process] at time t. In the retail 22 product forecasting example, Kt ,t can be other factors affecting the customer; for example, a 23 promotion. In a factory machinery example, K can be other variables that affect output; for 24 example, ambient temperature [0077] In some cases, for multi-variate purchase arrival times, masking sequences observed 26 before the first arrival of each product can be useful in preventing numerical errors encountered 27 in stochastic gradient descent. In these cases, log-likelihoods determined for time steps before 28 the earliest arrival can be masked. In the case of RNNs, each time step can have a component 29 in the loss function (for each output) and masking can be used to remove those time steps from the loss function so that they are not used in the optimization. This can ensure that RNN
31 parameters are not updated due to losses incurred during these times.

1 [0078] The forecasting module 122 can determine predictions using the fact that at each time t, 2 the estimated parameter et can be used to determine the expectation of any function of zt;
3 assuming that Zt is distributed according to et. Since the system 100 takes into account the next 4 arrival time after the end of training period (time r) , it can compute many different values of interest. As described herein, the values of interest can be derivative values of the output 6 distribution; for example, expected value, median, 90% credible range, some other cost function 7 (using a different under/over weighting of forecasts), and the like.
8 [0079] For example, the forecasting module 122 can determine a predicted probability that the 9 next arrival will occur within y time after end of training, and thus determine P (4 y). The forecasting module 122 can also determine a deferred arrival probability, which is the probability 11 that the next arrival will occur within an interval of between Vi and Vi + y2 time after end of 12 training; given that the forecasting module 122 knows it will not occur within Vi time after the end 13 of training. This can be determined by computing P (Z, E + y2] I 4 > Vi). The quantities of 14 interest may not necessarily be limited to probabilities (for example, mean and quantiles of the predictive distribution) and can be extended to generate other analytics; for example, in the 16 case of predicting product purchases, to aid in revenue analysis or forecasting that depends on 17 the subsequent purchase time.
18 [0080] Turning to FIG. 2, a flowchart for a method 200 for individual forecasting of a future 19 event for a subject, according to an embodiment, is shown. The forecast is based on historical data, for example, as stored in the database 116 or as otherwise received. The historical data 21 comprising a plurality of historical events associated with the subject.
22 [0081] At block 202, the data acquisition module 117 receives the historical data associated 23 with the subject from the input interface 106, the network interface 110, or the non-volatile 24 storage 112. At block 204, the conditional excess module 118 determines a random variable representing a remaining time until the future event. The random variable conditioned based on 26 excess times since arrival of the historical events in the historical data.
27 [0082] At block 206, the machine learning module 120 determines a distribution function that 28 predicts the time of the future event using a recurrent neural network.
The distribution function 29 comprising a learned density with peaks that approximate the times of the historical events in the historical data.
31 [0083] At block 208, the forecasting module 122 determines a log-likelihood function based on 32 a probability that the random variable exceeds an amount of time remaining until a next 1 historical event in the historical data and parameterized by the distribution function. A loss 2 function for the recurrent neural network comprising a negative of the log-likelihood function.
3 [0084] At block 210, the forecasting module 122 forecasts and outputs a time to the future 4 event for a given subject using the log-likelihood function.
[0085] Described below are three sets of example experiments conducted by the present 6 inventors to verify the functionality, efficacy, and advantages of the present embodiments. First, 7 example experiments were conducted to check model assumptions and verify that parameters 8 for Weibull inter-arrival times can be recovered by the present embodiments (using MAT-RNN) 9 on a synthetic dataset. Second, example experiments were performed to compare the performance of MAT-RNN on two open datasets to benchmark models. Third, example 11 experiments were conducted to apply MAT-RNN to predict customer purchases for a large 12 retailer and compare its performance to other approaches in the art.
13 [0086] For the example experiments, a structure used for the RNN had three stacked layers, 14 with two LSTM layers of size Wfollowed by a densely connected layer of size 2p, where p is the number of arrival processes. The densely connected layer transforms the LSTM
outputs to a 16 vector of length 2p. In MAT-RNN, the densely connected layer outputs are passed through an 17 activation layer. For squared-loss RNNs, the activation can be passed through a softplus layer 18 since time to arrivals are non-negative.
19 [0087] In the example experiments, a masking layer was applied prior to the other layers so that the RNN does not train on time indices prior to the initialization of the time series. This structure 21 is the same for other neural network based models used for benchmark comparison. The RNN
22 was trained with a learning rate of 0.001 and trained for 100 steps unless otherwise stated.
23 Gradients were component-wise clipped at 5 to prevent numerical issues.
24 [0088] The example experiments used a generated synthetic dataset, where inter-arrival times followed Weibull distributions. In the synthetic dataset, as shown in TABLE 1, a set of Weibull 26 parameters was generated for each of eight product types, from which inter-arrival times are 27 sampled. The individual product identification is referred to as SKU
(stock keeping unit).

Shape 42.48 32.35 37.68 1.99 26.59 6.91 20.57 8.04 Scale 1.15 1.09 1.06 1.01 0.97 0.88 0.62 0.78 2 [0089] Purchase times were recorded and used to train the MAT-RNN model.
In the example 3 experiments, there were 11,000 subjects. Event arrivals were observed over a period of 156 4 time steps. It was then verified that the trained model (W= 6) recovers these parameters by taking the RNN predictions at the last time step. The results indicated that relative error (i.e. ¨
6 0, where is the estimated parameter and 0 is the true parameter) is low for both scales as well 7 as shapes. TABLE 2 shows errors for estimated parameters for Weibull inter-arrival.

Mean Quantiles (x10-2) Parameter (x10-2) 0 25 50 75 100 Shape +1.02 -1.21 -0.10 +0.58 +1.02 +4.82 Scale +2.32 -3.55 -0.61 +0.61 +5.25 +11.80 [0090] The example experiments show the flexibility of the present embodiments with two open 11 dataset benchmarks. Generally, these two problems are often tackled with different models 12 since the prediction problem is different. The model of the present embodiments can, however, 13 be adapted to solve these problems since they can be modeled by a distributional approach to 14 inter-arrival times.
[0091] The first example dataset is the CDNOW dataset. For this dataset, the example 16 experiments considered a binary classification problem where the system 100 predicts if 17 purchases are made during a testing period. Predictions by the system 100 were determined as 18 the probability that the inter-arrival time occurs before end of testing period. The input data was 19 the transaction history where only purchase records are available without other covariate data.
The example experiments show that present embodiments out-perform other approaches on 21 this dataset, even with no covariates.
22 [0092] The second example dataset is based on the CMAPSS dataset. For this dataset, the 23 system 100 predicted the remaining useful lifetime, or the time to failure. Predictions were 24 determined as the mode, mean, or some other function of the inter-arrival time distribution. The training data was an uncensored time series where sensor readings and operational settings 26 were collected up until the engine fails. A customized loss function was used to evaluate 27 models.

1 [0093] The CDNOW dataset includes purchase transactions, where number of customer 2 purchases are recorded. Transaction dates, purchase counts, and transaction values were 3 available as covariates. The performance of the present embodiments, where W= 1 trained on 4 a weekly level, was compared to another approach, the Pareto/NBD model, which is a classical demand forecasting model using the lifetimes package. The CDNOW dataset is often used as 6 an example where Pareto/NBD type models do well since there's limited covariate data 7 available and there is only a single event type.
8 [0094] In the example experiments, with W= 1, there were 32 trainable parameters in the MAT-9 RNN model of the present embodiments. The training period was set at 1.5 years, from 1997-01-01 to 1998-05-31. Predictions were made for customer purchases within a month of the end 11 of training; i.e., before 1998-06-30. As illustrated in the chart of FIG. 7, the MAT-RNN model of 12 the present embodiments achieved an ROC-AUC (area under the receiver operating 13 characteristic curve) of 0.84 on the CDNOW dataset; which is substantially better when 14 compared to 0.80 that is obtained using the Pareto/NBD estimate for the "alive" probability. It can be seen that the approach of the present embodiments of integrating a survival-based 16 maximum log-likelihood approach with an RNN yielded substantially improved prediction 17 accuracy, even with a small number of weights and on a small dataset.
18 [0095] The CMAPSS dataset is a high dimensional dataset on engine performance with 26 19 sensor measurements and operational settings. In training of the model for the example experiments, the engines were run until failure. In testing of the model, data was recorded until 21 a time prior to failure. The goal was to predict the remaining useful life (RUL) for these engines.
22 A first set of engine simulations in the dataset, which has 100 uncensored time series of 23 engines, were run until failure. The maximum cycles run before failure was found to be 363.
24 Time series for each engine was segmented into sliding windows of window length 78, resulting in 323 windowed time series each of length 78. For the testing dataset, the RNN model was run 26 on a time series 78 cycles before end of observation. A custom loss function was used, where 27 over-estimation was more heavily penalized. The mean custom loss metric (MCL) is defined as 28 follows, where d is the predicted RUL subtracted by the true RUL:
-d/13 1 29 loss (d) = fe d < 0

(10) ed/10 - 1 d > 0 [0096] The performance of the MAT-RNN model of the present embodiments was compared to 31 the SQ-LOSS, which has a softplus activation and is trained on squared loss. Performance was 32 evaluated based on the mean squared loss metric (MSE) as well as the MCL. The RNN models 1 were trained with W= 64. As illustrated in FIGS. 8A and 8B, the performance of MAT-RNN was 2 .. substantial, with modes that correspond roughly to the true RUL. FIG. 8A
illustrates a chart of 3 predicted densities for RUL on C-MAPSS and True RUL. FIG. 8B illustrates a chart of predicted 4 modes for RUL on C-MAPSS and True RUL.
[0097] The example experiments determined that the MAT-RNN model of the present 6 embodiments performed better than SQ-LOSS in the metrics considered, with MAT-RNN having 7 a mean loss of 40.09 compared to SQ-LOSS of 193.36. In the RMSE metric (root-mean-8 squared-error), MAT-RNN had an error of 35.65 compared to SQ-LOSS which as 36.48. It was 9 .. advantageously determined by the present inventors that MAT-RNN is more biased towards under-estimating RUL which makes it perform much better in the custom loss metric. Also, we

11 find that from the histogram of errors illustrated in FIGS. 9A and 9B
that MAT-RNN predictions

12 are unimodal and clustered tightly around its mode. FIG. 9A illustrates a histogram of errors for

13 SQ-LOSS and FIG. 9B illustrates a histogram of errors for MAT-RNN.

14 [0098] In the example experiments, the present inventors determined the predictive performance on a real-life application for predicting purchases of a few baskets of goods sold by 16 a large European retailer. The time resolution of the dataset was on a weekly level. Training 17 data was available over roughly 1.5 years, which gave 78 weeks of training data from 2014-01-18 01 to 2015-06-30. Performance of the MAT-RNN model of the present embodiments was 19 measured on a binary classification problem of predicting whether a customer purchases the product within 4 weeks after the end of training period from 2015-06-30 to 2015-07-31.
21 [0099] The MAT-RNN model of the present embodiments can be used to predict various 22 different quantities of interest; however, for the purposes of the example experiments, 23 comparison was of the predictive performance of the MAT-RNN model to a few benchmark 24 models. Such comparison was with respect to whether an event will arrive within y time after the end of the training period. The benchmark models were a Squared-Loss RNN (SQ-RNN) and a 26 Random Forest Predictor (RNG-F). Models were trained on all customers who bought an item in 27 the basket during the training period and performance was evaluated on this group of customers 28 during the testing period.
29 [0100] RNG-F was trained by splitting the training period into two periods. Covariates at the end of the first period were fed into the model, which was trained to predict whether subjects 31 purchase in the second period, which was also y long. A different RNG-F
model was trained for 32 each product, but was fed covariate datasets for all products.

1 [0101] SQ-LOSS was trained by setting the loss function as the squared difference between the 2 predicted time-to-arrival and the actual time-to-arrival. An activation function of softplus was 3 applied. Predictions of SQ-LOSS were then compared to the testing period length of y. If by the 4 end of the training period, SQ-LOSS predicts the next time- to-arrival as s, then the prediction .metric is y-s. For time periods where no actual time-to-arrival was observed (i.e. no further 6 purchases were observed by end of training), loss was set to 0.
7 [0102] For each customer, at each time period, the Recency, Frequency and Monetary (RFM) 8 metrics were determined, which are commonly used in demand modeling, at 3 different levels;
9 namely for all products, in-basket products and each individual product.
Recency is the time since last purchase, Frequency is the number of repeat purchases and Monetary is the amount 11 of money spent on all purchases to date. Included in the covariates are time-since-event (tse(t)) 12 and indicators for whether a first purchase has occurred (pch(t)). The time-to-event (tse(t)) and 13 the censoring status of the next arrival (unc(t)) were also determined.
14 [0103] On a per-product level, the types of covariates were limited to only RFM metrics (3 covariates) and transformations of purchase history (2 series). RFM metrics on the category and 16 overall purchase history levels were available as well, but these account for an additional 6 17 covariates that were shared across the various purchase arrival processes. The total number of 18 covariates for each product is thus 11, 6 of which are shared with other products.
19 [0104] Five baskets of popular replenishable products were selected for the example experiments. These were selected from products ranked by a score, where Nunique is the number 21 of unique customers and X is the average purchases per customer:
22 score = X* log Nunique (11) 23 [0105] The five selected baskets were bars, deli, floss, pads, soda.
Their data summaries are 24 presented in TABLE 3, where 1./overall is the average in-basket purchase counts, 11 ,per-sku is the mean over the per-product average purchase counts, and n i .s the mean over the per-26 product proportion of buyers who bought another product in-basket. Also note that ptnai is the 27 mean over the per-product proportion of trial customers (i.e. those who have made only a single 28 purchase).

customers basket SKUs (x1000) Poverall Pper-sku Pothers Ptrial bars 6 44 4.78 0.79 0.71 0.43 deli 12 79 3.58 0.29 0.55 0.62 floss 11 200 2.58 0.23 0.40 0.64 pads 7 317 2.26 .032 0.28 0.66 soda 8 341 2.97 0.37 0.45 0.63 2 [0106] As shown, pads had the highest proportion of trial customers along with the smallest 3 proportion of customers who bought another item in the basket. On the other hand, Pper-sku was 4 roughly median in the baskets considered. This is similar for floss as well. For these categories, it would be reasonable to expect product purchases are strongly dependent. A
good joint-6 prediction model should separate trial purchasers from repeat purchasers who decided to stick 7 .. to one product after trying another.
8 .. [0107] Performance was measured based on the ROC-AUC metric where each of the models 9 predicted whether customers who made in-basket purchases would make another in-basket purchase in a 4 week period after the end of a training period of 78 weeks.
The RNN-based 11 models had W= 36 and predict arrival times jointly over different products for each customer.
12 .. The RNG-F model was trained with 100 trees with covariates at week 74 and purchases 13 between week 74 and 78 but predicts purchases for only one product at a time. As such, a 14 separate RNG-F model is trained for each product.
[0108] The example experiments determined how each model does for every product in the 16 basket, and as such, there are multiple ROC-AUC metrics. TABLE 4 shows the results of the 17 example experiments in terms of summary statistics for ROC- AUCs for each item in the basket.
18 The results illustrate that the MAT-RNN model of the present embodiments almost always 19 dominates in the ROC-AUC metric for every category other than bars and deli, which has the smallest number of customers. Even so, MAT-RNN still performs the best in terms of average 21 ROC-AUC among products in each category other than bars.
22 [0109] The number of products for which ROC-AUC has improved over RNG-F
is substantial 23 for the MAT-RNN model. Excluding bars where only 2 out of 6 products saw improved 24 performance, other categories saw ROC-AUC improvements in more than 60%
of the products in-category, with soda and pads showing improvements in all products.
Advantageously, the 26 ability to model sequential data and sequential dependence separates MAT-RNN model from F. Even though RNG-F is trained on the evaluation metric, it was determined that MAT-2 RNN almost always performed better in this binary classification task.
3 [0110] Notably, the performance difference of the MAT-RNN model of the present embodiments 4 over SQ-LOSS and RNG-F is greatest for the pads category. This is likely due to the large amount of missing data since customers are least likely to buy other products.
It was also 6 determined that SQ-LOSS performs poorly compared to MAT-RNN, even though these models 7 have a similar recurrent structure and are fed the same sequential data.
One possible 8 explanation is that the lack of ground truth data has a significant impact on the ability of SQ-to learn. In cases where event arrivals are sparse or where inter-purchase periods are long, the censored nature of the data gives no ground truth to train SQ-LOSS
on. Therefore, 11 even though the recurrent structure makes it possible to model sequential dependence, the 12 structure that the MAT-RNN model imposes on the problem makes it much easier to make 13 predictions with censored observations. Additionally, from the results, it was determined that the 14 MAT-RNN model performs even better for larger customers with larger sample sizes.

E 2 13 0 41 ROC-AUC Quantiles o <0 c0 ia 0- ED, -El; C) (.0 0 0 c -a o cr) 3 0 z 0 Min Q25 Q50 Q75 Max -- CD C
CD g A, 0-RNG-F - 0.7696 0.7986 0.8428 0.8648 0.8710 0.8304 bars 44 6 SQ-LOSS 0 0.6608 0.7165 0.7228 0.7406 0.7550 0.7204 MAT-RNN 2 0.7588 0.7762 0.8174 0.8537 0.8783 0.8167 RNG-F - 0.7452 0.7995 0.8389 0.9004 0.9220 0.8468 deli 79 12 SQ-LOSS 4 0.7763 0.8047 0.8248 0.8458 0.8810 0.8259 MAT-RNN 8 0.8686 0.8823 0.8911 0.9021 0.9131 0.8919 RNG-F - 0.5537 0.6066 0.6199 0.6517 0.7683 0.6408 floss 200 11 SQ-LOSS 10 0.7298 0.7809 0.8089 0.8366 0.8739 0.8055 MAT-RNN 11 0.8680 0.9016 0.9317 0.9421 0.9640 0.9214 RNG-F - 0.5851 0.6148 0.6358 0.6411 0.8234 0.6509 pads 317 7 SQ-LOSS 4 0.5650 0.6149 0.6392 0.6941 0.7154 0.6482 MAT-RNN 7 0.8544 0.9160 0.9459 0.9511 0.9621 0.9281 RNG-F - 0.6959 0.7372 0.7663 0.7903 0.8300 0.7641 soda 341 8 SQ-LOSS 1 0.6844 0.7221 0.7259 0.7320 0.7612 0.7258 MAT-RNN 8 0.8605 0.8669 0.8795 0.8854 0.8909 0.8768 17 [0111] From the example experiments, joint predictions enjoy some advantages over individual predictions as product purchases can be modeled better through joint modeling.
Generally, if 19 network structure is the same, then the amount of time required to train a separate model for 1 each product scales linearly with the number of products. The number of parameters in a 2 collection of individual models is also significantly larger than that of a joint model.
3 [0112] The advantages of training a joint MAT-RNN model over a collection of individual ones 4 can be illustrated by comparing ROC-AUC performance in the soda basket, as shown in TABLE
5. The per-product individual models were given the same covariates but trained only on the 6 purchase arrivals of that particular product. The network structure is the same with W= 36, but 7 the final densely connected layer outputs only a vector of size 2, since distributional parameters 8 for one product is required. However, since the collection of single models have different 9 weights for their RNNs, they have approximately 8 times the number of parameters found in the joint model. As shown in TABLE 5, there is a consistent advantage of a joint model over the 11 individually trained single models, with improvements ranging from 0.0029 to 0.1098. Potential 12 improvements in model performance can be observed by modeling purchase arrivals jointly, 13 even with much fewer number parameters in the joint model.

sku single joint diff 1 = 0.8868 0.8897 +0.0029 2 0.8073 0.8686 +0.0614 3 0.8331 0.8605 +0.0274 4 0.8501 0.8761 +0.0260 5 0.8445 0.8829 +0.0384 6 0.8193 0.8615 +0.0422 7 0.8640 0.8909 +0.0269 8 0.7742 0.8840 +0.1098 16 [0113] Advantageously, the present embodiments can use a survival analysis approach with 17 recurrent neural nets (RNN) to forecast joint arrival times until a next event for each individual 18 over multiple items. The present inventors advantageously recognized the technical advantages 19 of transforming an arrival time problem into a likelihood-maximization problem with loose distributional assumptions regarding inter-arrival times. The example experiments demonstrated 21 that not only can known parameters be recovered during fitting, but also that there are 22 substantial improvements over other approaches.
23 [0114] Although the invention has been described with reference to certain specific 24 embodiments, various modifications thereof will be apparent to those skilled in the art without departing from the spirit and scope of the invention as outlined in the claims appended hereto.

Claims

1. A method for individual forecasting of a future event for a subject using historical data, the historical data comprising a plurality of historical events associated with the subject, the method executed on at least one processing unit, the method comprising:
receiving the historical data associated with the subject;
determining a random variable representing a remaining time until the future event;
predicting a time to the future event using a distribution function that is determined using a recurrent neural network, the distribution function comprising a learned density with peaks that approximate the times of the historical events in the historical data;
determining a log-likelihood function based on a probability that the random variable exceeds an amount of time remaining until a next historical event in the historical data and parameterized by the distribution function; and outputting a forecast of a time to the future event as the log-likelihood function.

2. The method of claim 1, wherein a loss function for the recurrent neural network comprises a negative of the log-likelihood function.

3. The method of claim 1, wherein the random variable is conditioned based on inter-arrival times of the historical events in the historical data.

4. The method of claim 1, wherein the random variable is conditioned based on excess times since arrival of preceding historical events in the historical data.

5. The method of claim 1, wherein the log-likelihood function at each time is the log of the probability that the random variable is in the set of time until the next historical event when the next historical event has been observed, and the log of the survival function otherwise.

6. The method of claim 5, wherein the distribution function follows a Weibull distribution.

7. The method of claim 6, wherein the distribution function is determined as (k/A)((s+t)/A)k-1 Sw(t), where k is the shape of the Weibull distribution, A
is the scale of the Weibull distribution, t is the time-step, and Sw(t) is the survival function.

8. The method of claim 1, wherein outputting the forecast of the time to the future event as the log-likelihood function comprises determining a sum of log-likelihoods at each time-step.

9. The method of claim 8, further comprising transforming the sum of log-likelihoods as a function of recurrent neural network parameters and historical data, and determining a minimizer of an overall observed loss of the recurrent neural network using such function.

10. The method of claim 1, further comprising outputting derivative values of the log-likelihood function.

11. A system for individual forecasting of a future event for a subject using historical data, the historical data comprising a plurality of historical events associated with the subject, the system comprising one or more processors in communication with a data storage, the one or more processors configurable to execute:
a data acquisition module to receive the historical data associated with the subject;
a conditional excess module to determine a random variable representing a remaining time until the future event;
a machine learning module 120 to predict a time to the future event using a distribution function that is determined using a recurrent neural network, the distribution function comprising a learned density with peaks that approximate the times of the historical events in the historical data; and a forecasting module to determine a log-likelihood function based on a probability that the random variable exceeds an amount of time remaining until a next historical event in the historical data and parameterized by the distribution function, and to output a forecast of a time to the future event as the log-likelihood function.

12. The system of claim 11, wherein a loss function for the recurrent neural network comprises a negative of the log-likelihood function.

13. The system of claim 11, wherein the random variable is conditioned based on inter-arrival times of the historical events in the historical data.

14. The system of claim 11, wherein the random variable is conditioned based on excess times since arrival of preceding historical events in the historical data.

15. The system of claim 11, wherein the log-likelihood function at each time is the log of the probability that the random variable is in the set of time until the next historical event when the next historical event has been observed, and the log of the survival function otherwise.

16. The system of claim 15, wherein the distribution function follows a Weibull distribution.

17. The system of claim 16, wherein the distribution function is determined as (WA)((s+t)/A)k-' Sw(t), where k is the shape of the Weibull distribution, A is the scale of the Weibull distribution, t is the time-step, and Sw(t) is the survival function.

18. The system of claim 11, wherein outputting the forecast of the time to the future event as the log-likelihood function comprises determining a sum of log-likelihoods at each time-step.

19. The system of claim 18, wherein the forecasting module further transforms the sum of log-likelihoods as a function of recurrent neural network parameters and historical data, and determining a minimizer of an overall observed loss of the recurrent neural network using such function.

20. The system of claim 11, wherein the forecasting module further outputs derivative values of the log-likelihood function.