WO2020191001A1

WO2020191001A1 - Real-world network link analysis and prediction using extended probailistic maxtrix factorization models with labeled nodes

Info

Publication number: WO2020191001A1
Application number: PCT/US2020/023264
Authority: WO
Inventors: Melissa J.M. TURCOTTE; Francesco Sanna PASSINO; Nicholas Andrew HEARD
Original assignee: Triad National Security, Llc
Priority date: 2019-03-18
Filing date: 2020-03-18
Publication date: 2020-09-24
Also published as: WO2020191001A8

Abstract

A practical adaptation and application of a Poisson matrix factorization (PMF) model for binary matrices to scenarios encountered in real-world computer networks is disclosed. Link prediction techniques may extend the PMF model by incorporating node-specific covariates for two sets of nodes in the PMF framework, modeling sparsity on both latent feature matrices, and/or accounting for seasonal effects to predict links. The standard PMF model may therefore be extended in three directions, which may all be implemented simultaneously to produce more accurate inference and prediction. Furthermore, the model may be extended to properly deal with binary edges, where only the existence of an edge is observed rather than an associated count along the edge.

Description

TITLE

REAL-WORLD NETWORK LINK ANALYSIS AND PREDICTION USING

EXTENDED PROBABILISTIC MATRIX FACTORIZATION MODELS WITH LABELED NODES

CROSS-REFERENCE TO RELATED APPLICATION

[0001] This application claims the benefit of U.S. Provisional Patent Application No. 62/819,912 filed March 18, 2019. The subject matter of this earlier filed application is hereby incorporated by reference in its entirety.

STATEMENT OF FEDERAL RIGHTS

[0002] The United States government has rights in this invention pursuant to Contract No. 89233218CNA000001 between the United States Department of Energy and Triad National Security, LLC for the operation of Los Alamos National Laboratory.

FIELD

[0003] The present invention generally relates to link prediction techniques, and more particularly, to link prediction techniques that extend the Poisson matrix factorization (PMF) model by incorporating node-specific covariates for two sets of nodes in the PMF framework, modeling sparsity on both latent feature matrices, and/or accounting for seasonal effects to determine whether new links are anomalous. BACKGROUND

[0004] Graphs and networks have emerged as popular mathematical structures to represent datasets that are commonly encountered in real world applications, such as computer science, biology, and social sciences. A network may be defined as a graph G = (V, E ), where V is a set of nodes (e.g., users or computing systems) and E : V X V is a set of edges connecting at least some of the nodes. If a node x Î V interacts with a node y Î V, then ( x, y ) Î E. In the case of a computer network, x and y could be, for example, users or computing systems.

[0005] Link prediction is defined as the problem of predicting the presence of an edge between two nodes in the network based on observed edges and attributes of the nodes. A link is an edge, and the determination of whether a new edge is anomalous in some embodiments is based on analysis of node attributes and previously observed edges, which, in the extended PMF model, determine the probability of a link. The link prediction problem has been an active field of research, and is somewhat similar to recommender systems, especially in its static formation.

[0006] In some embodiments, links between entities in computer networks, such as user interactions with computers or system libraries and the corresponding processes that use them, may provide insights into adversary behavior. However, it should be noted that users or computing systems are just one example for certain applications in cybersecurity, and there are many more possible applications without deviating from the scope of the invention. In other words, some embodiments could be used in other application domains in which case nodes would be something else. [0007] Prediction and modeling of links within a computer network has relevant implications in cybersecurity. Relationships between entities within the computer network, such as user interactions with computing systems or system libraries and the corresponding processes that use them, can provide key insights into adversary behavior. Previously unobserved edges may be of particular interest since many attack behaviors, such as lateral movement, phishing, and data retrieval, create new links between such entities. Entities in some embodiments may be users or computing systems, for example. However, some embodiments may be applied to other entities without deviating from the scope of the invention.

[0008] Existing approaches for anomaly detection in cybersecurity research involve building models of normal behavioral patterns and detecting deviations. In this sense, previously observed links can be scored and predicted. However, probabilistically assigning anomaly scores to new links is a serious challenge because no prior observations exist. In other words, because there is no previous link, it is difficult to determine whether a new link is actually anomalous. Accordingly, an approach that represents rich, complex datasets and assigns probabilistic scores for new link formation over time may be beneficial.

[0009] The link prediction problem in computer networks is particularly challenging because the number of nodes involved is potentially very large and the structure of the network is inherently dynamic. For use in practical cybersecurity applications, it is typically necessary to use relatively simple and scalable techniques, given the size and dynamic nature of the networks. [0010] Probabilistic matrix factorization approaches, especially probabilistic matrix factorization techniques (e.g., classical Gaussian matrix factorization) are currently widely used in the tech industry. Poisson matrix factorization (PMF) recently emerged as a suitable model in the link prediction framework due to its flexibility and scalability. In previous work by Turcotte et al. it was shown that the hierarchical PMF model performs well in the context of cybersecurity applications. See Melissa J. M. Turcotte et al. “Poisson Factorization for Peer-Based Anomaly Detection,” 2016 IEEE Conference on Intelligence and Security Informatics (ISI), pages 208-210 (2016).

SUMMARY

[0011] Certain embodiments of the present invention may provide solutions to the problems and needs in the art that have not yet been fully identified, appreciated, or solved by conventional cybersecurity technologies. For example, some embodiments of the present invention pertain to link prediction techniques that extend the PMF model by incorporating node-specific covariates for two sets of nodes in the PMF framework, modeling sparsity on both latent feature matrices, and/or accounting for seasonal effects to improve link prediction.

[0012] In an embodiment, a computer program is embodied on a non-transitory computer-readable medium. The program is configured to cause at least one processor to observe a real-world network over time and construct a matrix for two sets of nodes based on the observation of the real-world network over time. The program is also configured to cause the at least one processor to fit an extended PMF model to the matrix for the two sets of nodes to learn posterior estimates for model parameters for predictive analytical purposes, the extended PMF model incorporating node-specific covariates for the two sets of nodes, modeling sparsity on latent feature matrices for the two sets of nodes, accounting for seasonal effects, or any combination thereof, to predict links. The program is further configured to cause the at least one processor to use the learned posterior estimates for the model parameters to make predictions for future network observations or to determine anomaly scores about links observed after the training period or previously unobserved links. Additionally, the program is configured to cause the at least one processor to output the predictions, the anomaly scores, the model parameters themselves, or any combination thereof.

[0013] In another embodiment, a computer program is embodied on a non-transitory computer-readable medium. The program is configured to cause at least one processor to observe a real-world network over time and construct a matrix for two sets of nodes based on the observation of the real-world network over time. The program is also configured to cause the at least one processor to fit an extended PMF model to the matrix for the two sets of nodes to learn posterior estimates for model parameters for predictive analytical purposes, the extended PMF model incorporating node-specific covariates for the two sets of nodes, modeling sparsity on latent feature matrices for the two sets of nodes, accounting for seasonal effects, or any combination thereof, to predict links. The program is further configured to cause the at least one processor to use the learned posterior estimates for the model parameters to make predictions for future network observations or to determine anomaly scores about links observed after the training period or previously unobserved links. Additionally, the program is configured to cause the at least one processor to output the predictions, the anomaly scores, the model parameters themselves, or any combination thereof. The extended PMF model uses a variational inference procedure for binary matrices. The variational inference procedure provides inference on marginal posterior distributions of parameters for the two sets of nodes, as well as for the covariates since this underpins a predictive distribution on which edges are likely to be observed in the future.

[0014] In yet another embodiment, a computer-implemented method includes fitting, by a computing system, an extended PMF model to a matrix for two sets of nodes based on the observation of the real-world network over time to learn posterior estimates for model parameters for predictive analytical purposes, the extended PMF model incorporating node-specific covariates for the two sets of nodes, modeling sparsity on latent feature matrices for the two sets of nodes, accounting for seasonal effects, or any combination thereof, to predict links. The computer-implemented method also includes using the learned posterior estimates for the model parameters, by the computing system, to make predictions for future network observations or to determine anomaly scores about links observed after the training period or previously unobserved links and outputting the predictions, the anomaly scores, the model parameters themselves, or any combination thereof, by the computing system.

BRIEF DESCRIPTION OF THE DRAWINGS

[0015] In order that the advantages of certain embodiments of the invention will be readily understood, a more particular description of the invention briefly described above will be rendered by reference to specific embodiments that are illustrated in the appended drawings. While it should be understood that these drawings depict only typical embodiments of the invention and are not therefore to be considered to be limiting of its scope, the invention will be described and explained with additional specificity and detail through the use of the accompanying drawings, in which:

[0016] FIG. 1 is a graph illustrating a full extended Poisson matrix factorization (PMF) model, according to an embodiment of the present invention.

[0017] FIG. 2 is a flowchart illustrating a process for new link prediction using an extended PMF model, according to an embodiment of the present invention.

[0018] FIG. 3 is a block diagram illustrating a computing system configured to perform new link prediction using an extended PMF model, according to an embodiment of the present invention.

DETAILED DESCRIPTION OF THE EMBODIMENTS

[0019] Some embodiments of the present invention pertain to an extension to, and a practical application of, a Poisson matrix factorization (PMF) model for binary matrices. PMF is extended in some embodiments to include scenarios that are commonly encountered in real life networks, such as those motivated by applications in cybersecurity. For the purposes of this disclosure, the methodology is described in the context of computer networks. However, it should be noted that the technique of some embodiments may be extended to other practical applications, such as biological or social networks, without deviating from the scope of the invention.

[0020] In particular, the extension in some embodiments explicitly includes known covariates associated with the nodes. By way of nonlimiting example, and per the above, nodes may be users or computing systems, for instance. A doubly sparse PMF with Indian Buffet Process (IBP) priors, which further refines the edge probabilities, may be used. A seasonal version of PMF may be employed to handle dynamic networks. Fast inference schemes using variational inference and Gibbs sampling may be employed. These techniques may be used individually or in combination. Such embodiments have shown improved performance over the standard PMF model and other known link prediction techniques in testing at Los Alamos National Laboratory.

[0021] In order to identify and predict links, it is typically beneficial to develop an understanding of the normal structure of the network graph, and then perform prediction procedures by associating a score with each link. The scores obtained for each edge can be used to flag the connection as anomalous if the event is probabilistically“surprising” according to the model (e.g., if the event is improbable by a predetermined threshold for that event based on past network analysis). In order to compute the scores, it is typically beneficial to use relatively computationally inexpensive and scalable techniques, given the huge amount of data and information available in most networks. PMF is a suitable model in this framework due to its flexibility and scalability. However, known covariate level information about entities is typically discarded or included after the fact.

[0022] Accordingly, some embodiments incorporate this information into the learning process by extending the PMF model to allow for covariates for both sets of nodes (e.g., users and computing systems), showing improved performance in scoring and predicting links. The model of some embodiments may also incorporate seasonality, making the model applicable to a more realistic dynamic computer network setting. In other words, some embodiments provide new link prediction techniques that extend the PMF model by incorporating node-specific covariates for two sets of nodes (e.g., a first set of nodes for users and a second set of nodes for items) items in the PMF framework, modeling sparsity on both latent feature matrices for each set of nodes, and/or accounting for seasonal effects (i.e., seasonality) to predict links. The standard PMF model may therefore be extended in three directions in some embodiments, which can all be implemented simultaneously in certain embodiments to produce more accurate inference and prediction. Furthermore, the model of some embodiments may be extended to properly deal with binary edges, where only the existence of an edge is observed rather than an associated count along the edge, corresponding to the number of observed links between the two nodes during a given time period.

[0023] It is assumed herein that a computer network can be represented by a bipartite dynamic graph G_t = (U, V, E_t) observed at discrete time intervals, where U is the set of users (e.g., user accounts), V i s the set of items (e.g., computing systems - also referred to as“hosts” herein), and the set E_t

U x V is the observed set of edges over a time period (t— 1 , t]. Assuming a given user i and a host j, ( i,j ) Î E_t if i connects to j within the time period (t— 1, t] . From E_t, a rectangular adjacency matrix A_t can readily be obtained, with {A_t}i_j = 1 _Et{(i,j)} , where 1_s(x) denotes the indicator function: given a set S and an atom x, 1_s(x) = 1 if x Î 5, and 0 otherwise. Given a single adjacency matrix A, or a sequence of observed adjacency matrices A₁ ... , A_T, the objective of the link prediction procedure of some embodiments is to reliably predict the structure of the subsequently observed graph A_T+1.

[0024] POISSON MATRIX FACTORIZATION (PMF) AND DEFICIENCIES

FOR NEW LINK PREDICTION [0025] For | U| users and | V| items, let N Î N ^{| U | X | V |} e a matrix containing counts N_ij , representing the number of times the i^th user connected to the j^th item. The hierarchical Poisson factorization model (HPMF) models the counts N_ij using a Poisson link function with a rate given by the inner product between user-specific latent features and item-specific latent features

[0026] Standard PMF has prior distributions on the latent features a and b . However, standard PMF does not have a prior on HPMF adds a prior on the

latent features. This second layer of priors is what makes it hierarchical.

[0027] The specification of the model is completed in a Bayesian framework using Gamma hierarchical priors on the latent parameters:

[0028] A relevant advantage of PMF over competing models is the likelihood that only depends on the number of observed links (i.e., evaluating the likelihood is where nnz(-) is the number of non-zero elements in the matrix— ,

compared to for most statistical models for networks). Networks

observed in real world applications tend to be extremely sparse, with nnz(N) « | U| X | V| . This makes the PMF scalable to graphs of enormous sizes using relatively straightforward algorithms. [0029] The PMF model has been used as building block for multiple extensions. For example, a social Poisson factorization (SPF) has been developed that includes the latent social influence in the recommender system. It has also been proposed to combine PMF with the standard collective matrix factorization model to tackle the problem of cold- starts and jointly model relational matrices. A collaborative topic Poisson factorization (CTPF) has been developed that adds a document topic offset to the standard PMF model to provide content-based recommendations. Note that this allows for item- specific covariate information to be incorporated into the model, but not user-specific covariate information. The approach of some embodiments in this article allows for both user-specific and item-specific covariate information to be included.

[0030] Model selection issues have been tackled previously, where a Bayesian non- parametric model for automatic choice of the number of latent features R is developed based on the Gamma process. The Gamma process construction has also been used to jointly model the adjacency matrix and side information. Content and social trust information have previously been included in the PMF framework. It should be noted that a concept related to PMF is Poisson factor analysis (PFA), which has been further extended to hierarchical (deep) topic models. It should be noted that all of these approaches are constructed for count matrices only. As such, they have not been appropriately adapted to the case of binary matrices, nor do they provide any indication as to why one may do so.

[0031] Sparse latent features are considered in a number of different models. A generic framework for modeling dyadic data, called binary matrix factorization (BMF), has been proposed using a product of sparse matrices and a matrix of weights. The model has been generalized to a Bayesian non-parametric context with Indian buffet process (IBP) priors. It should be noted that“priors” means“prior distributions.” A prior distribution encompasses the prior knowledge about the properties of the parameters of the process. In this context, priors are chosen such that the property of conjugacy holds, which makes inference procedures analytically tractable. A non negative matrix factorization model has also been proposed with Poisson likelihood with sparsity constraints imposed only on one of the two matrices in the decomposition, and structured stochastic mean-field variational inference is used to infer the model parameters. However, in some embodiments, sparsity is imposed on both matrices.

[0032] Dynamic extensions to PMF have also been studied. Kalman filter updates have been used to dynamically correct the rates of the Poisson distribution. A temporal version of PMF has been proposed using the two main tensor factorization algorithms: CP decomposition (also known as the CANDECOMP/PARAFAC, or canonical decomposition parallel factors decomposition) and Tucker decomposition. It has also been suggested to combine the PMF model with the Poisson process to produce dynamic recommendations. In general, despite the extensive treatments of PMF in a dynamic context, seasonality has not been explicitly accounted for. Indeed, it has been overlooked. As such, some embodiments employ a seasonal PMF model.

[0033] INCLUDING COVARIATES IN THE PMF MODEL

[0034] In cybersecurity applications, the counts associated with the links are usually extremely difficult to model due to repeated observations, beaconing behavior, and intrinsic burstiness of the events. As a result, the Poisson model for N_ij is likely not appropriate and arguably, no parametric distribution is able to reliably capture the properties of counts of connections between computing systems. Therefore, some embodiments work directly with the adjacency matrix A, obtained by setting A_ij =

In the standard PMF model, binary adjacency matrices are modeled using

the Poisson link for convenience, despite the difference in the ranges. In some embodiments, the PMF model is modified to treat the counts N_ij as latent variables and to treat A_ij as a censored count. This type of link has been previously used, and is sometimes referred as Bernoulli-Poisson (BerPo) link, where Gibbs sampling is commonly used for inference. However, some embodiments use a variational inference procedure instead. Variational inference schemes have been successfully used for PMF models with a number of different link functions. Variational inference schemes have been successfully used for matrices of counts. For binary matrices, Gibbs sampling or hybrid approaches (e.g., structured stochastic variational inference, which involves Gibbs sampling steps), are used. In some embodiments, a variational inference scheme for binary matrices is used instead.

[0035] Moreover, in many applications, users and items (or any other desired two sets of nodes) usually have associated covariates. Suppose that there are K covariates associated with each user and H covariates for each item. Let the value of the covariate k for the user i be denoted as x_{ik .} Similarly, let the value of the covariate h for the item j be y_jh. In cybersecurity applications, and more generically in network applications, the main interest is typically on categorical covariates, which provide known groupings or clusters of nodes. Covariates group or cluster nodes by dividing a group of nodes into different groups. For instance, the covariate“job title” divides users into different groups according to their jobs, such as managers or scientists. Therefore, for the remainder of this disclosure, the covariates are assumed to be categorical. It is also assumed that the observed values A_ij are obtained from binary truncations of Poisson draws using the following hierarchical model:

[0036] where 1_n is the «-dimensional vector of ones, is the Hadamard element-

wise product, and = {x_ik} and y_j = {y_jh} are the H-dimensional and H-dimensional binary vectors of covariates. In the model of Eq. (2), is a matrix of

interaction terms for each combination of the covariates.

[0037] Assume for a cybersecurity example that a covariate for the employment type “manager” is used for the users and that a covariate for the location“research lab” is used for the hosts. If the user i is a manager and the host j is located in a research laboratory, then F_kh expresses a correction to the rate for a manager connecting

to a research laboratory. The link for the covariates is inspired by the bilinear mixed- effects models for network data. The same priors given in Eq. (1) are used for a_i and b_j and the following prior distribution completes the specification of the model:

[0038] Given an observed matrix A, inference is on the marginal posterior distributions of the parameters a_i and b_j for all the users and items and for the

covariates since this underpins the predictive distribution on which edges are likely to be observed in the future. Inference is straightforward using Gibbs sampling since the prior distributions are chosen to be conjugate to the posteriors, but sampling-based approaches do not scale well with the size of the network. Therefore, inference is usually performed using variational inference, which turns the problem of sampling from the posterior into an optimization task.

[0039] In order to perform inference efficiently, a common latent variable approach is used. Given the unobserved count N_ij , a further set of latent variables Z_ijl, l = 1, ... , R + KH is added. Z_ijl represents the contribution of the component / to the total latent count For

[0040] Otherwise, l refers to a ( k , h ) covariate pair, and

[0041] This construction ensures that N_ij has precisely the Poisson distribution specified in Eq. (2).

[0042] GIBBS SAMPLING

[0043] Since the prior distributions are chosen to be conjugate, the conditionals are all available analytically. First, note that the latent vector

conditional on N_ij, has a Multinomial distribution of

[0044] where is the probability vector proportional to

Then Z_ij and N_ij can be jointly

resampled in a blocked Gibbs sampler step:

[0045] where Pois₊(·) denotes the zero-truncated Poisson distribution The complete conditionals for the user and item latent features are Gamma, where

[0046] and finally,

[0047] Similar arguments give:

[0048] where / is the index corresponding to the covariate pair ( k , h).

[0049] VARIATIONAL INFERENCE

[0050] Variational inference is an optimization-based technique for approximating intractable posterior distributions, such as

[0051] with a proxy distribution

[0052] from a given family and then finding the member q*(.) of the family that minimizes the Kullback-Leibler (KL) divergence to the true posterior. Usually, the KL divergence cannot be explicitly computed. Therefore, an alternative equivalent objective, called evidence lower bound (ELBO), is maximized instead:

[0053] The proxy distribution q(.) is usually chosen to make the above approximation possible, and is in a much simpler form than the posterior distribution. The mean-field variational family is used, where the latent variables in the posterior are considered to be independent and governed by their own distribution, such that:

[0054] where each takes the same distributional form of the complete

conditional for each of the parameters given above in the discussion of Gibbs sampling, taking advantage of the fact that the complete conditionals are in the exponential family. Note that under the approximation in Eq. (7), the ELBO given by Eq. (6) is analytically tractable.

[0055] The variational parameters

[0056] are optimized using coordinate ascent mean-field variational inference (CAVI), where each parameter is optimized while holding the others fixed. Using this algorithm, the optimal form of the variational factors is:

[0057] where V_j is an element of a“partition” of the full set of parameters v, and V-_j denotes the full set v excluding the parameters in the subset V_{j .} Importantly, the expectation is taken with respect to the variational approximation for the parameters v__j , excluding the component V_{j .} Under the mean-field assumption,

[0058] The full variational inference algorithm is detailed in Algorithm 1 below.

(4) repeat

(5) for each entry of A such that A_ij > 0, update the rate of the truncated Poisson distribution for

where is the digamma function;

(6) for each entry A such that A_ij > 0, update the Multinomial parameters:

where / in the lower equation corresponds to a pair ( k , h )

of covariates, and

is the digamma function;

(7) update the user-specific parameters:

(8) update the item-specific parameters:

(9) update the covariate-specific parameters:

(10) until convergence (in ELBO or predictive log-likelihood on a held- out dataset).

[0059] Obtaining the update equations for the variational parameters is straightforward, and similar to the variational inference

algorithm for standard Poisson factorization. Note that these are all parameters for a Gamma distribution governed by a rate and shape, referred to in Algorithm 1 with the superscripts“rte” and“shp,” respectively, where The superscripts

“shp” and“rte” refer to the two parameters that fully characterize a Gamma distribution. The shape parameter shp controls the shape of the distribution and the rate parameter rte controls its variability. Note that all the update equations in Algorithm 1 only depend on the elements of the matrix where A_ij > 0, providing computational efficiency for large sparse matrices.

[0060] Convergence of the CAVI algorithm is determined by monitoring the change in the ELBO. As the ELBO can have many local optima, it can be highly dependent on the initial starting values. Therefore, it is generally advisable to run the algorithm multiple times using different starting points. Also, computing the ELBO on very large matrices is computationally costly. Assessing convergence may be achieved by calculating the average predictive log-likelihood on a small held-out dataset, providing a proxy to calculating the ELBO on the entire dataset.

[0061 ] LINK PREDICTION AND ANOMALY DETECTION

[0062] If a Gibbs sampler is used for inference then, given S samples from the joint posterior, an estimate can be obtained for the posterior predictive distribution of future observations

[0063] Similarly, given the optimized values of the parameters of the variational approximation to the posterior, the estimate can be equivalently obtained using Eq.

(11), where the S samples are drawn from

[0064] Alternatively, a computationally fast way to approximate = l) uses a

function of the Gibbs samples, or of the parameters of the estimated variational distribution:

[0065] where, for example, for the Gibbs sampler, or

(i.e., the mean of the Gamma proxy distribution) for variational inference. It

should be noted that Eq. (12) gives a biased estimate of and by Jensen’s

inequality, in expectation, but has a much lower

computational burden. The approximation in Eq. (12) has been successfully used for link prediction and network anomaly detection purposes in Turcotte et al. (2016). However, if available, it is in general strongly recommended to use Eq. (11), which is a standard Monte Carlo estimate of a probability.

[0066] The problem statement with respect to anomaly detection is to determine whether an observed user-item pair is normal with respect to the model parameters learned over some training period. An anomaly score can be given by the posterior predictive upper tail p-value. As A_ij is a Bernoulli random variable, this is equivalent

[0067] DOUBLY SPARSE PMF WITH IBP PRIORS [0068] One of the main limitations of the PMF model from Eq. (2) is that the coefficients are non-negative, and they all contribute to the

summation in the Poisson rate. Also, the number of latent features R must be specified in advance. The most common criteria for selection of R used in the literature, inspired by the literature on principal component analysis, are based on visual inspection of the scree-plot of singular values and on the position of an elbow in the graph, or alternatively, by maximizing the predictive performance on a held-out data set over various values of the number of latent features R. In this section, the PMF model given in Eq. (2) is generalized by introducing binary coefficients D Î {0,1} used to switch the latent variables a_ir and _jr on or off, allowing for a much sparser representation. It should be noted that binary variables for the covariate coefficients could also be added to simultaneously turn the covariate pairs on and off. The resulting model is an extension to Beta-Process Non-Negative Matrix Factorization (BPNNMF):

e corresponding vectors. The model in Eq. (13) is called“doubly sparse” since it allows for sparsity on a and b simultaneously. It should be noted that in this case, the number of latent features is not restricted to affixed value R , but allowed to be potentially infinite. The binary indicators are variable selection tool that can be used to assess

the impact of each covariate on the link probabilities. A suitable prior on the infinite matrices D and

is the IBP, which is used in the Bayesian non-

parametric literature primarily for latent feature models. The IBP process is the infinite limit of a Beta-Bemoulli process:

[0070] This approximation is particularly convenient for the model given in Eq. (13) for reasons that will be discussed later herein. Similarly, equivalent priors can be placed on the binary variables corresponding to the covariates:

[0071 ] INFERENCE VIA GIBB S SAMPLING

[0072] It should be noted that coordinate ascent mean-field variational inference in this model cannot be trivially applied since in Eq. (16), the expectation infinite, as

Therefore, a Gibbs sampler is used here, where the full conditionals are only a

slight modification of those given in the discussion of Gibbs sampling above. Alternatively, structured stochastic variational inference with Gibbs sampling, which is a hybrid between the two techniques, could be used. The conditional distribution for N_ij and Z_ij follows Eq. (3), except that the rate for the Poisson and probability vectors for the Multinomial will now depend on the binary variables. The conditional distribution of a_ir , conditioned on D^a and D^b, is

[0073] and similarly for _jr . The conditional distributions of are identical

to Eq. (4). [0074] The probabilities can be considered as nuisance parameters and

integrated out, obtaining the following marginal posterior for in the Beta-Bemoulli

approximation:

[0075] and a similar equation can be obtained for In the full IBP setting,

0 in Eq. (14), and new non-zero columns should be resampled. In the linear Gaussian model, an explicit expression has been derived for the marginal likelihood for this type of move. In this model, this step is particularly complicated. If a new non-empty column is proposed in D^a, new columns should also be proposed for D^b , a and b. Hence, the Beta-Bemoulli approximation (or finite-dimensional IBP) is used for simplicity.

[0076] Finally, the marginal posterior distribution for F_kh is a slight modification of Eq. (5):

[0077] and, after integrating out p_c, the marginal posterior of (and similarly for

is:

[0078] DYNAMIC NETWORKS AND SEASONAL PMF [0079] In the previous sections, it has been assumed that a single adjacency matrix A is available. Now consider that a discrete sequence of adjacency matrices A_1, ... , A_T is observed, where it is assumed that the sequence has seasonal dynamics with some known fixed seasonal period P. To include time dependence, a third index t is added to some of the parameters to denote the time at which an adjacency matrix was observed. As before, the counts N_ijt are treated as latent variables, and the sequence of observed adjacency matrices is obtained as A_ijt =

The latent counts are modeled as follows:

[0080] where g: N+ ® {1, ... , P } maps the observation time t to a seasonal segment. For example, with a fixed seasonal period of a week and daily observations, then g(t) = t mod 7 + 1 could correspond to each day of the week. The priors on a_ir and _jr do not change from the previous sections, and represent a baseline level of activity, which is constant over time. On the other hand, represent

corrections to the rates , according to the current seasonal segment. It

should be noted that for some applications, it may not be expected that there is a seasonal adjustment to the rate for the interaction terms of the covariates, in which case the dependency on w_khg(t) could be dropped. For identifiability, it may be necessary to impose constraints on the seasonal adjustments. For example,

The following hierarchical priors are placed on y

some embodiments: [0081] For simplicity, in this section, the binary indicators have been

dropped, but model can be appropriately modified using the same technique presented in Eq. (13). For example, a_i in Eq. (15) could be replaced with

[0082] Inference in the seasonal model follows the same principles used in the previous sections. Gibbs sampling and variational inference procedures can be used, and details and equations are discussed further below.

[0083] VARIATIONAL INFERENCE IN THE PMF MODEL

[0084] All of the factors in the variational approximation given in Eq. (8) are of closed form and take the same distributional form of the complete conditionals for each of the parameters. Let denote the rate å of the

Poisson distribution for N_ij, and let represent the individual

elements in the sum. )

where k is a constant with respect to N_ij and Z_{ij· .}

[0086] Following Eq. (8) and Eq. (16), the optimal variational distribution

with domain of Z_i;- restricted to have

[0087] Evaluating the normalizing constants for the distribution in Eq. (17) gives the optimal variational distribution with the same form as Eq. (18) below, so

that

of the zero truncated Poisson is updated using (see Eq. (9) above for the final expression), and the update

for the vector of probabilities _ij (see Eq. (10) above) is given by an extension of the standard result for variational inference in the PMF model:

[0089] INFERENCE IN THE SEASONAL MODEL

[0090] The Gibbs sampler for the seasonal model in Eq. (15) follows the same guidelines followed above for the non-seasonal models. Given the unobserved count latent variables are added, representing the contribution of the component /

to the total count The full conditional for follows Eq.

(3), except the rate for the Poisson and probability vectors for the Multinomial will now depend on the seasonal parameters Letting q denote a

seasonal segment in (1, ... , P), the full conditionals for the rate parameters are:

[0091] where x_k =

Similar results are available for b_]T

Also:

[0092] and similarly for the conditional distribution is equivalent

to Eq. (4).

[0093] For variational inference, The

mean-field variational family is again used, implying a factorization similar to Eq. (7), so that:

[0094] As in the discussion of variational inference above, each has the same

form of the full conditional distributions for the corresponding parameter or group of parameters. Again, the variational parameters are updated using CAVI and a similar algorithm is obtained to that detailed in Algorithm 1, where steps 7, 8, and 9 are modified to include the time-dependent parameters. It follows that for the user-specific parameters, the update equations take the form:

[0095] and similar results can be obtained for the item-specific parameters and

For the covariates:

[0096] Finally, for the time-dependent hyperpriors:

[0097] and similarly for

[0098] Again, closed form updates are available for Let

[0099] represent the elements in the sum. Then,

the expectation:

can be evaluated in a similar manner to Eq. (16). [0100] Hence, one can derive the update equations for similar to in the

section above:

[0101] PMF EXTENSION

[0102] Some embodiments extend the PMF model by incorporating node-specific covariates for two sets of nodes (e.g., a first set of nodes for users and a second set of nodes for items) in the PMF framework, modeling sparsity on both latent feature matrices for each set of nodes, and/or accounting for seasonal effects (i.e., seasonality) to improve prediction of links. The standard PMF model may therefore be extended in three directions, which can be all implemented simultaneously in some embodiments to produce more accurate inference and prediction. The following equation summarizes the multiple models discussed herein:

[0103] The counts have been considered as censored, and it has been assumed

that only the binary indicator is observed. “Censored” means that it

is assumed that only whether two entities connected is observed, and not the number of connections, as in most PMF applications. In other words, for a given link for a given user and host over a given time period, a link either is or is not observed (i.e., counts either exist they do not). This differs from conventional approaches, which consider observations rather than counts.

[0104] Starting from the hierarchical PMF model discussed above, which only includes the latent features covariates have been included though the matrix

of coefficients contains the interaction terms between the covariates of the users

and hosts, for instance. For a practical example, consider the case where / represents

movie ratings. If it is observed that a user watched a movie, but did not rate the movie, it is unknown whether and how much a user liked the movie. Based on application of the extended PMF of some embodiments, if covariates are known about the users and movies, it can be predicted from the covariate coefficients and the latent features and

whether the user would have liked the movie. The covariates can improve the

predictive performance, especially in the cases where there is little to no known information about what other movies the user has watched. Seasonal adjustments for the coefficients are obtained though the variables Similar to

F, which is a matrix, and which is one of its is a matrix of

seasonal corrections to F, and its elements are denoted

[0105] Finally, sparsity and variable selection issues are tackled using the binary

then the component does not contribute to the probability of a link.

Inference schemes using variational inference and Gibbs sampling are discussed. In particular, a variational inference scheme for the Bemoulli-Poisson link is proposed. The model is summarized graphically in graph 100 of FIG. 1.

[0106] The techniques of some embodiments have been applied to a user authentication graph of the Los Alamos National Laboratory, showing improvements over competing models for link prediction purposes. Including covariates improves area under curve (AUC) scores and average predictive log-likelihoods. Including covariates also enables prediction of cold-starts where new nodes enter the network. Using binary variables for doubly sparse latent features guarantees improvements in the log-likelihood on a held-out dataset. Seasonal corrections enable calculation of time- varying anomaly scores and produces different predictions on different time frames, producing more accurate results.

[0107] The complexity of some real-world networks (e.g., larger computer networks), requires adaptations to the standard PMF model in order to be practical. Often, nodes within a network have associated covariates providing prior knowledge about groupings of nodes, which can be used to improve the predictive power of the model. Also, some networks are intrinsically dynamic with strong seasonal patterns. Therefore, it is important to include the time to observation to have more accurate prediction of observed links in some cases. For example, in the computer network application where the observed network is users authenticating to computers, it may be normal for a user U to connect to computing system X during the week. However, on the weekend, this behavior may be extremely abnormal. Without incorporating seasonal effects, this information would be lost. Also, by using sparse latent feature matrices, each latent feature is associated only with a subset of the nodes. This allows for a more precise assessment of the probability of a link, provides a framework for model selection, and can be used to select automatically the appropriate number of latent features.

[0108] FIG. 2 is a flowchart illustrating a process for link prediction using an extended PMF model, according to an embodiment of the present invention. The process begins with observing a real-world network over time and constructing a matrix for the two sets of nodes (e.g., one for users and another for items) based on the observations at 210 over a training period. An extended PMF model is then fit to the matrix for the two sets of nodes at 220 to learn posterior estimates for the model parameters for predictive analytical purposes. The extended PMF model incorporates node-specific covariates for the two sets of nodes in the graph, models sparsity on latent feature matrices for the two sets of nodes, accounts for seasonal effects, or any combination thereof, to predict the links.

[0109] To actually learn parameters, variational inference or a Gibbs sampler may be used. Gibbs sampling is a common Monte Carlo method for inference, as is variational inference. However, in some embodiments, variational inference and Gibbs samplers have been modified to account for binary edges.

[0110] After these techniques have been applied, the parameters of the model have been learned. This estimate allows for predictions to be made for anomaly detection, recommendations, etc. The learned posterior estimates for the model parameters are thus used to make predictions for future network observations or to determine anomaly scores about links observed after the training period or previously unobserved links at 230. The predictions or anomaly scores (and in some embodiments, the model parameters themselves) are then output at 240 for review. [0111] In some embodiments, the link prediction results may include the top- most likely links for recommendations, the top-Mleast likely links for anomaly detection, etc. For anomaly detection, the results could include the top-M (or top-a%) most anomalous links, which could be further examined by security experts for assessment of the threat to the system. In addition, the parameters of the model could be output for secondary analyses for interpretability of results, such as to identify strongly linked users or items, or understand which covariates are important in making predictions.

[0112] FIG. 3 is a block diagram illustrating a computing system configured to perform new link prediction using an extended PMF model, according to an embodiment of the present invention. Computing system 300 includes a bus 305 or other communication mechanism for communicating information, and processor(s) 310 coupled to bus 305 for processing information. Processor(s) 310 may be any type of general or specific purpose processor, including a central processing unit (CPU), application specific integrated circuit (ASIC), field programmable gate array (FPGA), etc. Processor(s) 310 may also have multiple processing cores, and at least some of the cores may be configured to perform specific functions. Multi-parallel processing may be used in some embodiments. Computing system 300 further includes a memory 315 for storing information and instructions to be executed by processor(s) 310. Memory 315 can be comprised of any combination of random access memory (RAM), read only memory (ROM), flash memory, cache, static storage such as a magnetic or optical disk, or any other types of non-transitory computer-readable media or combinations thereof. Additionally, computing system 300 includes a communication device 320, such as a transceiver and antenna, to wirelessly provide access to a communications network. [0113] Non-transitory computer-readable media may be any available media that can be accessed by processor(s) 310 and may include volatile media, non-volatile media, or both. The media may also be removable, non-removable, or both.

[0114] Processor(s) 310 are further coupled via bus 305 to a display 325, such as a Liquid Crystal Display (LCD), for displaying information to a user. A keyboard 330 and a cursor control device 335, such as a computer mouse, are further coupled to bus 305 to enable a user to interface with computing system. However, in certain embodiments, a physical keyboard and mouse may not be present, and the user may interact with the device solely through display 325 and/or a touchpad (not shown). Any type and combination of input devices may be used as a matter of design choice. In certain embodiments, no physical input device is present. For instance, the user may interact with computing system 300 remotely via another computing system in communication therewith, or computing system 300 may operate autonomously.

[0115] Memory 315 stores software modules that provide functionality when executed by processor(s) 310. The modules include an operating system 340 for computing system 300. The modules further include a module 1145 that is configured to perform new link prediction using an extended PMF model by employing any of the approaches discussed herein or derivatives thereof. Computing system 1100 may include one or more additional functional modules 1150 that include additional functionality.

[0116] One skilled in the art will appreciate that a“system” could be embodied as a server, an embedded computing system, a personal computer, a console, a personal digital assistant (PDA), a cell phone, a tablet computing device, or any other suitable computing device, or combination of devices. Presenting the above-described functions as being performed by a“system” is not intended to limit the scope of the present invention in any way, but is intended to provide one example of many embodiments of the present invention. Indeed, methods, systems and apparatuses disclosed herein may be implemented in localized and distributed forms consistent with computing technology, including cloud computing systems.

[0117] It should be noted that some of the system features described in this specification have been presented as modules, in order to more particularly emphasize their implementation independence. For example, a module may be implemented as a hardware circuit comprising custom very large scale integration (VLSI) circuits or gate arrays, off-the-shelf semiconductors such as logic chips, transistors, or other discrete components. A module may also be implemented in programmable hardware devices such as field programmable gate arrays, programmable array logic, programmable logic devices, graphics processing units, or the like.

[0118] A module may also be at least partially implemented in software for execution by various types of processors. An identified unit of executable code may, for instance, comprise one or more physical or logical blocks of computer instructions that may, for instance, be organized as an object, procedure, or function. Nevertheless, the executables of an identified module need not be physically located together, but may comprise disparate instructions stored in different locations which, when joined logically together, comprise the module and achieve the stated purpose for the module. Further, modules may be stored on a computer-readable medium, which may be, for instance, a hard disk drive, flash device, RAM, tape, or any other such medium used to store data.

[0119] Indeed, a module of executable code could be a single instruction, or many instructions, and may even be distributed over several different code segments, among different programs, and across several memory devices. Similarly, operational data may be identified and illustrated herein within modules, and may be embodied in any suitable form and organized within any suitable type of data structure. The operational data may be collected as a single data set, or may be distributed over different locations including over different storage devices, and may exist, at least partially, merely as electronic signals on a system or network.

[0120] The process steps performed in FIG. 2 may be performed by a computer program, encoding instructions for the processor(s) to perform at least the process described in FIG. 2, in accordance with embodiments of the present invention. The computer program may be embodied on a non-transitory computer-readable medium. The computer-readable medium may be, but is not limited to, a hard disk drive, a flash device, RAM, a tape, or any other such medium used to store data. The computer program may include encoded instructions for controlling the processor(s) to implement the process described in FIG. 2, which may also be stored on the computer-readable medium.

[0121] The computer program can be implemented in hardware, software, or a hybrid implementation. The computer program can be composed of modules that are in operative communication with one another, and which are designed to pass information or instructions to display. The computer program can be configured to operate on a general purpose computer, an ASIC, or any other suitable device.

[0122] It will be readily understood that the components of various embodiments of the present invention, as generally described and illustrated in the figures herein, may be arranged and designed in a wide variety of different configurations. Thus, the detailed description of the embodiments of the present invention, as represented in the attached figures, is not intended to limit the scope of the invention as claimed, but is merely representative of selected embodiments of the invention.

[0123] The features, structures, or characteristics of the invention described throughout this specification may be combined in any suitable manner in one or more embodiments. For example, reference throughout this specification to “certain embodiments,”“some embodiments,” or similar language means that a particular feature, structure, or characteristic described in connection with the embodiment is included in at least one embodiment of the present invention. Thus, appearances of the phrases“in certain embodiments,”“in some embodiment,”“in other embodiments,” or similar language throughout this specification do not necessarily all refer to the same group of embodiments and the described features, structures, or characteristics may be combined in any suitable manner in one or more embodiments.

[0124] It should be noted that reference throughout this specification to features, advantages, or similar language does not imply that all of the features and advantages that may be realized with the present invention should be or are in any single embodiment of the invention. Rather, language referring to the features and advantages is understood to mean that a specific feature, advantage, or characteristic described in connection with an embodiment is included in at least one embodiment of the present invention. Thus, discussion of the features and advantages, and similar language, throughout this specification may, but do not necessarily, refer to the same embodiment.

[0125] Furthermore, the described features, advantages, and characteristics of the invention may be combined in any suitable manner in one or more embodiments. One skilled in the relevant art will recognize that the invention can be practiced without one or more of the specific features or advantages of a particular embodiment. In other instances, additional features and advantages may be recognized in certain embodiments that may not be present in all embodiments of the invention.

[0126] One having ordinary skill in the art will readily understand that the invention as discussed above may be practiced with steps in a different order, and/or with hardware elements in configurations which are different than those which are disclosed. Therefore, although the invention has been described based upon these preferred embodiments, it would be apparent to those of skill in the art that certain modifications, variations, and alternative constructions would be apparent, while remaining within the spirit and scope of the invention. In order to determine the metes and bounds of the invention, therefore, reference should be made to the appended claims.

Claims

1. A computer program embodied on a non-transitory computer-readable medium, the program configured to cause at least one processor to:

observe a real-world network over time and construct a matrix for two sets of nodes based on the observation of the real-world network over time;

fit an extended Poisson matrix factorization (PMF) model to the matrix for the two sets of nodes to learn posterior estimates for model parameters for predictive analytical purposes, the extended PMF model incorporating node-specific covariates for the two sets of nodes, modeling sparsity on latent feature matrices for the two sets of nodes, accounting for seasonal effects, or any combination thereof, to predict links; use the learned posterior estimates for the model parameters to make predictions for future network observations or to determine anomaly scores about links observed after the training period or previously unobserved links; and

output the predictions, the anomaly scores, the model parameters themselves, or any combination thereof.

2. The computer program of claim 1, wherein the extended PMF model is a doubly sparse PMF with Indian Buffet Process (IBP) priors that further refine edge probabilities.

3. The computer program of claim 1, wherein the extended PMF model employs fast inference schemes using variational inference and Gibbs sampling either individually or in combination.

4. The computer program of claim 3, wherein for the Gibbs sampling, a latent vector conditional on counts N_ij has a multinomial distribution such that so that the latent vector and the counts N_ij are jointly resampled in a blocked Gibbs sampler step, and complete conditionals for the latent features for each set of nodes are Gamma.

5. The computer program of claim 1, wherein the extended PMF model is extended to deal with binary edges, where only existence of an edge is observed during a predetermined time period.

6. The computer program of claim 1, the program further configured to cause the at least one processor to:

form at least one rectangular adjacency matrix using an indicator function; and use a structure of the at least one rectangular adjacency matrix to predict a structure of a subsequently observed graph of the computer network, wherein

the adjacency matrix is obtained by setting is an indicator function providing an /V-dimensional vector of ones that indicates whether counts are present, thus treating counts as latent variables and treating as a censored count.

7. The computer program of claim 1, wherein

the extended PMF model uses a variational inference procedure for binary matrices, and

the variational inference procedure provides inference on marginal posterior distributions of parameters for the two sets of nodes, as well as for the covariates since this underpins a predictive distribution on which edges are likely to be observed in the future.

8. The computer program of claim 1, wherein the extended PMF model uses a common latent variable approach to inference.

9. The computer program of claim 1, wherein seasonality is accounted for by considering a discrete sequence of adjacency matrices A₁, ... , A_T representing observation during time periods 1 to T, including time dependence.

10. The computer program of claim 1, wherein the node-specific covariates for both sets of nodes in the PMF model, sparsity on latent feature matrices for the two sets of nodes, and seasonal effects are accounted for simultaneously in the extended PMF model to produce more accurate inference and link prediction.

11. The computer program of claim 1, wherein the output comprises at least one Internet Protocol (IP) address, at least one Media Access Control (MAC) address, or a combination thereof identifying the computing system initiating the link, the computing system receiving the link, or both.

12. The computer program of claim 1, wherein the program is further configured to cause the at least one processor to:

generate a bipartite graph representative of the computer network over a plurality of discrete time intervals, the bipartite graph comprising a set of users in the computer network, a set of hosts in the computer network, and an observed set of links between the user accounts and the hosts over a predetermined time period.

13. The computer program of claim 1, wherein the extended PMF model is implemented as follows:

where counts /V_i;t have been considered as censored, only a binary indicator is observed, and represent latent features, covariates are

included through a matrix of coefficients F that contains interaction terms between the covariates between both sets of nodes, and seasonal adjustments for the coefficients of F are obtained though variables and sparsity and variable selection issues are tackled using binary random vectors D

14. A computer program embodied on a non-transitory computer-readable medium, the program configured to cause at least one processor to: observe a real-world network over time and construct a matrix for two sets of nodes based on the observation of the real-world network over time;

output the predictions, the anomaly scores, the model parameters themselves, or any combination thereof, wherein

15. The computer program of claim 14, wherein

the extended PMF model employs fast inference schemes using variational inference and Gibbs sampling either individually or in combination, and for the Gibbs sampling, a latent vector conditional on counts N_ij has a multinomial distribution such that so that the latent vector and the counts N_ij are jointly resampled in a blocked Gibbs sampler step, and complete conditionals for the latent features for each set of nodes are Gamma.

16. The computer program of claim 14, wherein the node-specific covariates for both sets of nodes in the PMF model, sparsity on latent feature matrices for the two sets of nodes, and seasonal effects are accounted for simultaneously in the extended PMF model to produce more accurate inference and link prediction.

17. The computer program of claim 14, wherein the output comprises at least one Internet Protocol (IP) address, at least one Media Access Control (MAC) address, or a combination thereof identifying the computing system initiating the link, the computing system receiving the link, or both.

18. The computer program of claim 14, wherein the program is further configured to cause the at least one processor to:

19. A computer-implemented method, comprising:

fitting, by a computing system, an extended Poisson matrix factorization (PMF) model to a matrix for two sets of nodes based on the observation of the real-world network over time to learn posterior estimates for model parameters for predictive analytical purposes, the extended PMF model incorporating node-specific covariates for the two sets of nodes, modeling sparsity on latent feature matrices for the two sets of nodes, accounting for seasonal effects, or any combination thereof, to predict links; using the learned posterior estimates for the model parameters, by the computing system, to make predictions for future network observations or to determine anomaly scores about links observed after the training period or previously unobserved links; and

outputting the predictions, the anomaly scores, the model parameters themselves, or any combination thereof, by the computing system.

20. The computer-implemented method of claim 19, wherein the node- specific covariates for both sets of nodes in the PMF model, sparsity on latent feature matrices for the two sets of nodes, and seasonal effects are accounted for simultaneously in the extended PMF model to produce more accurate inference and link prediction.