WO2022106438A1

WO2022106438A1 - Predicting the state of a system using elasticities

Info

Publication number: WO2022106438A1
Application number: PCT/EP2021/081916
Authority: WO
Inventors: Devesh Raj
Original assignee: Unilever Ip Holdings B.V.; Unilever Global Ip Limited; Conopco, Inc., D/B/A Unilever
Priority date: 2020-11-18
Filing date: 2021-11-17
Publication date: 2022-05-27

Abstract

A method of predicting a behavior of a state within a system by building a probabilistic hierarchical model comprising a Bayesian network. The method comprises determining (701) an input specification comprising a plurality of variables and sign indicators associated with a plurality of the variables. The method comprises determining a set of observed values for the plurality of variables. The method comprises determining (702) a set of candidate arcs, each candidate arc indicating a possible link from a first one of the variables to a second one of the variables to be included in a Bayesian network. The method comprises repeatedly selecting (303) a subset of the set of candidate arcs, and calculating (804), for each of the plurality of variables to which the sign indicator is assigned, an elasticity with respect to the target variable according to the fitted Bayesian network.

Description

PREDICTING THE STATE OF A SYSTEM USING ELASTICITIES

Field of the invention

The invention relates to predicting a state within a system.

Background of the invention

In development of food products, many ingredients will have to be considered that each have a different effect on the end product. Moreover, the ingredients may have a complex system of interactions, so that it may be difficult to find a composition that provides the ‘best’ result in terms of, for example, flavor, shelf life, or even overall consumer satisfaction. Such a complex system can be modeled by a number of input variables, or control variables, at least one output variable, representing the behavior and/or output of the system, internal variables representing states inside the system, and external variables representing states of external circumstances, such as temperature, that influence the system but cannot be controlled by a system operator.

For efficient control of the system, it is imperative that the relationship between the control variables and the output variables is predictable. Therefore, the system behavior can be modeled, for example by means of a Bayesian network, to simulate the system behavior and predict the output variable. However, creating an accurate Bayesian network model can be a tedious process.

A Bayesian Belief Network, or Bayesian Network (BN) for short, is a probabilistic hierarchical model that may be primarily used to represent a causal dependence (parent-child) structure among a set of system parameters of a system. A BN may be represented through a set of random variables (forming nodes of the BN) and their conditional dependencies (forming directed edges of the BN) via a directed acyclic graph (DAG). The probability of occurrence of each state of a node is known as its “belief’.

Charniak, E. (1991), “Bayesian networks without tears”, Al magazine, 12(4), 50, discloses algorithms for discrete Bayesian networks.

Cheng, J., & Greiner, R. (2001 , June), “Learning Bayesian belief network classifiers: Algorithms and system”, In Conference of the Canadian Society for Computational Studies of Intelligence (pp. 141- 151), Springer Berlin Heidelberg, discloses learning predictive classifiers based on Bayesian belief networks (BN) using datasets that have few or no continuous features, to avoid information loss in discretization. The paper discloses to discretize continuous features, using a discretization utility.

Bayesian networks provide an established way to represent causal relationships using a structure. However, it can cost a lot of resources to generate a Bayesian network and to store a Bayesian network. Moreover, it may be difficult to build a Bayesian network with sufficient predictive qualities.

Summary of the invention

It is an object of the invention to provide improved control of a system in a simpler way by predicting the behavior of a state of the system. To better address this concern, a first aspect provides a method of predicting a state of a system by building a probabilistic hierarchical model comprising a Bayesian network preferably having continuous variables to predict a state of a system, the method comprising the steps of: determining, by a processor, an input specification comprising a plurality of variables, the plurality of variables including a plurality of controllable variables, a plurality of measured variables, and at least one target variable representing the state to be predicted, the input specification further comprising sign indicators associated with a plurality of the variables, wherein the sign indicator assigned to a variable is positive to indicate that the variable has a positive correlation with the target variable, or the sign indicator is negative to indicate that the variable has a negative correlation with the target variable; determining, by the processor, a set of observed values for the plurality of variables; determining, by the processor, a set of candidate arcs, each candidate arc indicating a possible link from a first one of the variables to a second one of the variables to be included in a Bayesian network; repeatedly selecting, by the processor, a subset of the set of candidate arcs to form the arcs of a Bayesian network, fitting parameters of the Bayesian network with the selected subset of the set of candidate arcs, based on the set of observed values for the plurality of variables, and calculating, for each of the plurality of variables to which the sign indicator is assigned, an elasticity with respect to the target variable according to the fitted Bayesian network; and selecting, by the processor, one of the subsets as a final subset of the set of candidate arcs based on the sign indicators and the elasticity of each of the plurality of variables to which the sign indicator is assigned.

The elasticities and sign indicators provide a highly adequate check that helps to determine which arcs should be included in the network and which not.

Selecting a subset of the candidate arcs may comprise obtaining an intermediate subset of the candidate arcs forming an intermediate Bayesian network, by: iteratively fitting Bayesian networks with different subsets of the candidate arcs to the set of observed values; calculating a fit score of each fitted Bayesian network; comparing each sign indicator to the elasticity of each corresponding variable in each fitted Bayesian network; and selecting the intermediate subset of the candidate arcs, based on the fit score, from among the different subsets of the candidate arcs, wherein the intermediate subset is selected subject to the condition that each sign indicator matches the elasticity of each corresponding variable in the intermediate Bayesian network with the intermediate subset of the candidate arcs. The intermediate subset of the candidate arcs provides at least a promising springboard for further searching for relevant arcs, without introducing counterintuitive connections with elasticities that do not correspond to the sign indicators.

The selected intermediate subset may further satisfy the condition that the elasticity of each variable in the intermediate Bayesian network is greater than a predetermined minimum bound and smaller than a predetermined maximum bound. For example, the condition can be that the elasticity is between -2 and 2. It is noted that elasticity may be a dimensionless or unitless quantity.

The method may further comprise obtaining a plurality of first stage networks with associated fit scores by, for each first arc that is in the set of candidate arcs but not in the intermediate subset of arcs, fitting a first stage network including only the intermediate subset of arcs and the first arc and calculating the fit score associated with the first stage model; wherein the selecting the final subset of the candidate arcs is performed based on the fit scores associated with the first stage networks. This provides checks to see if the model accuracy can be improved by adding one arc to the intermediate subset of arcs, thereby contributing to an overall improvement of the finally determined Bayesian model.

The method may further comprise setting a second subset initially equal to the intermediate subset; and obtaining a plurality of second stage networks with associated fit scores by, iteratively for each second arc that is in the set of candidate arcs but not in the second subset of arcs, fitting a second stage network including only the second subset of arcs and the second arc and calculating the fit score associated with the second stage network, and if the second stage network satisfies a certain criterion, adding the second arc to the second subset of arcs before proceeding with fitting the next second stage network, and if the second stage network does not satisfy the certain criterion, removing the second arc from the set of candidate arcs before proceeding with fitting the next second stage network, wherein the selecting the final subset of the candidate arcs is performed based on the fit scores associated with the second stage models.

The set of candidate arcs may include a plurality of arcs from the controllable variables to the measured variables and a plurality of arcs from the measured variables to the target variable. Further, the set of candidate arcs may satisfy the condition that it does not include any arcs from any variable to any of the controllable variables nor from the target variable to any other variable in the plurality of variables. This way, targeted modeling can be efficiently performed by enforcing the Bayesian network to provide information about the output variable based on the controllable variables.

The plurality of variables may further include at least one external variable, wherein the set of candidate arcs includes at least one arc from the at least one external variable to at least one of the measured variables or the target variable, and wherein the set of candidate arcs does not include any arcs from any variable to an external variable. This is an efficient means to incorporate external circumstances into the model, thereby improving the overall result.

The method may further comprise identifying a target value for the target variable; determining values for the controllable variables based on the Bayesian network with the final subset of the set of candidate arcs and the target value for the target variable; and controlling the controllable variables in the system to set the controllable variables to their determined values. This way, the target variable in the system may be controlled by manipulating the controllable variables. According to another aspect, system is provided for predicting a state by building a probabilistic hierarchical model comprising a Bayesian network, the system comprising: a memory configured to store observed values and a Bayesian network; and a processor configured to perform steps of: determining an input specification comprising a plurality of variables, the plurality of variables including a plurality of controllable variables, a plurality of measured variables, and at least one target variable representing the state to be predicted, the input specification further comprising sign indicators associated with a plurality of the variables, wherein the sign indicator assigned to a variable is positive to indicate that the variable has a positive correlation with the target variable, or the sign indicator is negative to indicate that the variable has a negative correlation with the target variable; determining a set of observed values for the plurality of variables; determining a set of candidate arcs, each candidate arc indicating a possible link from a first one of the variables to a second one of the variables to be included in a Bayesian network; repeatedly selecting a subset of the set of candidate arcs to form the arcs of a Bayesian network, fitting parameters of the Bayesian network with the selected subset of the set of candidate arcs, based on the set of observed values for the plurality of variables, and calculating, for each of the plurality of variables to which the sign indicator is assigned, an elasticity with respect to the target variable according to the fitted Bayesian network; and selecting one of the subsets as a final subset of the set of candidate arcs based on the sign indicators and the elasticity of each of the plurality of variables to which the sign indicator is assigned. According to another aspect, a computer program product, preferably for building a probabilistic hierarchical model comprising a Bayesian network having continuous variables to predict a state of a system, the computer program product is provided, comprising instructions stored on a non-transitory compute readable media, the instructions being configured to cause a computer system to perform the steps of: determining an input specification comprising a plurality of variables, the plurality of variables including a plurality of controllable variables, a plurality of measured variables, and at least one target variable, the input specification further comprising sign indicators associated with a plurality of the variables, wherein the sign indicator assigned to a variable is positive to indicate that the variable has a positive correlation with the target variable, or the sign indicator is negative to indicate that the variable has a negative correlation with the target variable; determining a set of observed values for the plurality of variables; determining a set of candidate arcs, each candidate arc indicating a possible link from a first one of the variables to a second one of the variables to be included in a Bayesian network; repeatedly selecting a subset of the set of candidate arcs to form the arcs of a Bayesian network, fitting parameters of the Bayesian network with the selected subset of the set of candidate arcs, based on the set of observed values for the plurality of variables, and calculating, for each of the plurality of variables to which the sign indicator is assigned, an elasticity with respect to the target variable according to the fitted Bayesian network; and selecting one of the subsets as a final subset of the set of candidate arcs based on the sign indicators and the elasticity of each of the plurality of variables to which the sign indicator is assigned. The person skilled in the art will understand that the features described above may be combined in any way deemed useful. Moreover, modifications and variations described in respect of the method may likewise be applied to the apparatus and to the computer readable media, and vice versa.

Brief description of the drawings

In the following, aspects of the invention will be elucidated by means of examples, with reference to the drawings. The drawings are diagrammatic and may not be drawn to scale. Throughout the drawings, similar items may be marked with the same reference numerals.

Fig. 1 shows an example Bayesian belief network to model crop growth.

Fig. 2 shows a flowchart illustrating aspects of a method of building a Bayesian belief network.

Fig. 3 shows a flowchart illustrating aspects of a method of learning a network structure of a Bayesian belief network.

Fig. 4 shows a flowchart illustrating aspects of a method of learning node parameters of a Bayesian belief network.

Fig. 5 shows a block diagram of an apparatus for building a Bayesian belief network.

Fig. 6 shows a diagram illustrating modifications of the structure of a Bayesian belief network.

Fig. 7 shows a diagram illustrating a method to automate the generation of a Bayesian belief network. Fig. 8 shows a diagram illustrating a method to determine an intermediate subset of arcs.

Fig. 9 shows an example Bayesian belief network to model a cream soup.

Detailed description of the invention

Certain exemplary embodiments will be described in greater detail hereinafter. The matters disclosed in the description, such as detailed construction and elements, are provided to assist in a comprehensive understanding of the exemplary embodiments. Accordingly, it is apparent that the exemplary embodiments can be carried out without those specifically defined matters, also, well- known operations or structures are not described in detail, since they would obscure the description with unnecessary detail.

In many cases, discretization of continuous variables does not provide a satisfactory result when predicting system variables with BNs. Therefore, the present disclosure is provided to help building probabilistic hierarchical models, such as Bayesian networks, that can handle continuous variables directly, without converting continuous system variables into discretized Bayesian network variables. Surprisingly, this makes the Bayesian network more suitable and useful to predict the system’s behavior. For example, elasticities can be calculated, which is not easy to do with discrete Bayesian networks.

The limitations of Bayesian networks to discretely valued nodes may cause exponentially increasing size of probability maps when the number of possible states increases. The inventors have found that for a particular group of applications, in which it is desired to predict the change of a system parameter induced by changing another system parameter, continuous variables are better able to predict this. Unlike the prior art, which has emphasized optimizations for discrete BNs and discretization of any continuous variables, the present inventors have provided a proper solution for BNs with continuous variables. This allows a deeper analysis of the causal relationships between the variables, such as elasticities.

Surprisingly, these elasticities may even be used to improve the structure of the network. For example, by setting a sign indicator to indicate a positive or a negative relationship between a variable and the target variable, the network structure may be improved by adding or deleting arcs until the sign of the elasticity of the variable matches the sign indicator. This way, a high-quality network structure of a Bayesian network may be found more efficiently. It is preferred that in the method of present invention, the sign of the elasticity of each of the plurality of variables of the final subset of the set of candidate arcs to which the sign indicator is assigned matches the respective sign indicator.

In the following, techniques are disclosed relating to methods and systems for building a Bayesian belief network. While the techniques disclosed herein have applicability in any Bayesian belief network, an exemplary Bayesian belief network that models the relationship between genetic and environmental parameters and crop growth is presented to illustrate the embodiments. In biotechnology, a significant amount of time and effort is put into optimizing environmental parameters and genetically determined parameters of plants. However, the relationship between these parameters, profiles, and the final crop growth is not straightforward. However, the BN may help to identify important relationships.

Gaussian BN may be used for modeling systems with drivers that are inherently continuous in nature. In certain embodiments, a continuous BN may be characterized in that each driver (node) follows a Gaussian (Normal) distribution. If this fundamental assumption holds, certain analysis and modeling features may be employed that make use of this assumption.

In certain embodiments, the causal relationship of each node with its parent node(s) is represented through a Gaussian conditional probability distribution.

In certain embodiments, the joint probability distribution of a set of drivers (nodes) may be obtained through the Chain Rule, using Bayes’ Theorem.

In certain embodiments, information regarding any driver (node) may be accessible solely from its Markov Blanket. The Markov Blanket of a driver (node) may be regarded to be the node including its parent nodes and child nodes. In certain embodiments, the modeling of a system by a BN may follow phases of data input and preprocessing, BN creation, and post-BN creation.

In certain embodiments, the data input may be obtained by monitoring system variables and obtaining samples thereof.

In certain embodiments, the data may be pre-processed. For example, the raw data may be processed in an exploratory data analysis (EDA).

One or more of the following steps may be performed to prepare the data before model creation starts. First, missing values may be treated to improve data consistency. For example, a suitable replacement/imputation method may be performed by replacing a missing value with the most frequently appearing value, with an average value, or a value of a moving average in case of timedependent data.

Second, outliers may be treated using any suitable outlier treatment method known in the art per se. For example, outliers may be treated through interquartile range (IQR) method. In this method, values smaller than a value (/?! - 1.5 x IQR may be replaced with the value Q_±. Herein, Q_± denotes the 25^th percentile of the variable within the dataset. Moreover, values that are larger than a value Q₃ + 1.5 x IQR) may be replaced with the value Q₃. Herein, Q₃ denotes the 75^th percentile of the variable within the dataset. In the above equations, the value IQR may be calculated as IQR = Q₃ - Q . In other words, the IQR value denotes the 75^th percentile minus the 25^th percentile of the variable within the dataset.

In order to stabilize the data as well as use the most adequate forms of the drivers, any one or more of a number of data transformations can be applied. For example, one or more of the following data transformations may be applied, depending on the nature of the data. For example, a smoothing operation or other noise reduction formula may be applied. Alternatively, moving average transformation may be applied. Moreover, a Natural Log transformation (or any other log transform) may be applied when this is suitable for the type of variable at hand, in order to stabilize the variance of the data.

Certain embodiments comprise a step of establishing the nodes of the BN. The nodes may alternatively be referred to as variables of the system or variables of the BN. The nodes correspond to the drivers and/or target of the network. These nodes in general correspond to the data points; they may represent states that can be determined by observation. The purpose of the data transformations of the pre-processing step is to bring the data in conformity with the chosen nodes of the BN, so that sufficient data is available for the nodes, and the data of each node has an advantageous distribution, such as a Gaussian (normal) distribution.

Certain embodiments comprise a step before the actual learning of the BN structure. An example of that is a step of whitelisting and/or blacklisting of arcs. In a BN, any two nodes may be connected by a unidirectional arc. Each arc represents a relationship between the nodes, so that the probability distribution of the node to which the arc points is conditional on the value of the node from which the arc points away. When learning the network structure, a goal is to learn which arcs are to be included in the network. When there is no arc or path connecting two nodes, these nodes are essentially considered to be independent of each other. To facilitate the modeling process, arcs can be whitelisted (WL) or blacklisted (BL). In certain embodiments, certain arcs are whitelisted before the modeling procedure starts. In addition, or alternatively, certain arcs may be blacklisted before the modeling procedure starts.

Whitelisted arcs, if specified, will definitely be present in the network, whereas blacklisted arcs, if specified, will definitely be absent from the network. Arcs whitelisted in one direction only (i.e. A -> B is whitelisted but B A is not) may have the respective reverse arc automatically blacklisted. So, if A -> B is whitelisted but B A is not whitelisted, then B A may be automatically blacklisted. Arcs whitelisted in both directions (i.e. both A B and B A are whitelisted) are present in the graph, but their direction is set by the learning algorithm.

In certain embodiments, the BN contains a target node that represents a value that is considered to be the result of the values of the other nodes. Alternatively, the target node represents a value for which a prediction is desired. The whitelisting/blacklisting step may comprise blacklisting all possible arcs pointing away from the target node. This may improve the learning process and lead to better predictions. It may force the network structure to allow to predict the target node based on observations of values for the remaining nodes.

The next step may be learning a network structure. In this step, the arcs of the BN are determined. To that end, a BN network structure learning algorithm may be selected.

For example, a constraint-based algorithm may be used. Such an algorithm learns the network structure by analyzing the probabilistic relations entailed by the Markov property of Bayesian networks with conditional independence tests. Such constraint-based algorithms may be based on the Inductive Causation (IC) algorithm.

Alternatively, a score-based algorithm may be used to learn the BN network structure. Such an algorithm assigns a score to each candidate Bayesian network and tries to maximize it with a heuristic search algorithm, such as Hill-climbing, Tabu search, Simulated annealing, or one of various known genetic algorithms.

Yet alternatively, a hybrid algorithm may be used, in which both constraint-based and score-based learning algorithms are combined to obtain an optimized network structure.

In certain embodiments, Hill-Climbing (HC) methodology may be advantageously used as a network learning algorithm. This may be particularly advantageous in the case where (most of) the nodes represent continuous random variables. It may be even more advantageous in case the continuous random variables have a Gaussian distribution.

An example of the Hill-Climbing method is given below. The method may start with an initial graph structure G. The initial structure may be an empty structure. Alternatively, the initial structure may be the structure of an acyclic graph with randomly selected arcs (satisfying the whitelist and blacklist). Also, a score of the initial graph structure G may be computed. For example, the score is an indication how well the graph structure G can fit the available data. Examples of scoring methods will be disclosed hereinafter.

Next, a number of iterations may be performed. The following steps may be included in each iteration. First, a transformation explained above is performed on a randomly selected arc (adding an arc, deleting an arc, or reversing an arc). In certain embodiments, more than one transformation may be performed. Although the arc and the operation may be selected randomly, only operations that respect the conditions on the graph structure are performed. These conditions may include the condition that the graph remains acyclic, and that any whitelisted arcs are included in the network structure and any blacklisted arcs are excluded from the network structure. The transformation of the first step results in an updated graph structure G*. Second, a score of the updated graph structure G* may be computed. For example, the score is an indication how well the graph structure G* can fit the available data. Examples of scoring methods will be disclosed hereinafter. Third, if the score of the updated graph structure G* is greater than the score of the previously determined greatest score of graph G, then graph G is set to be equal to graph G* and the greatest determined score of graph G is set to be the score of graph G*.

The iteration is terminated when a suitable stopping criterion is satisfied. For example, when there is no possible transformation that would improve the greatest score, the process may stop. Alternatively, when N successive iterations do not improve the greatest score, the process may stop, where N is any positive integer value.

The above process of learning the network structure may be regarded to be an iterative process with each iteration modifying exactly one arc (through: add/delete/reverse) that increases the overall network score. In alternative implementations, more than one arc might by modified in some (or all) iterations.

Certain parameters for and variations of the above self-learned network building method may be employed. For example, a first parameter may specify the maximum number of iterations, the admissible range of the first parameter being [1 , Infinity]-, In case of Infinity, which may be the preferred value, no restriction is put on the number of iterations and the Hill-Climbing algorithm will continue until maximum network score is achieved.

In certain embodiments, the graph structure G may be reset one or more times during the Hill- Climbing. So, the transformation of the first step of some of the iterations may be replaced by a complete reset of the arcs to a new random structure that satisfies the applicable constraints (such as acyclic graph and the blacklist and whitelist). In certain embodiments, a configurable parameter indicates the number of resets that is performed. For example, the number of resets is a non-negative integer, which may preferably be in the range from 0 to 50. A suitable value may be 5. Another configurable parameter may be the number of iterations to insert/remove/reverse an arc after every random reset. This parameter may preferably be, for example, in the range from 0 to 300. A suitable value may be 100. Alternatively, the reset is performed after the score of the graph has not increased for a predetermined number of iterations. That predetermined number of iterations, which may be a positive integer, forms an alternative parameter.

In certain embodiments, another configurable parameter may specify the maximum number of parents for each node. Its admissible range is [1 , (n-1)], the default value being (n-1), where n is the total number of nodes. The parents of a particular node are the nodes from which an arc points to the particular node.

After determining a network structure, the parameters of each node may be determined. The parameters of each node may include the parameters that determine the (conditional) random distribution of each node. For example, the conditional random distribution of a node may have the form:

(Equation 1) wherein

• P_x is the value of a particular node in the network, this particular node being denoted by Node_x,

• N(ji, a²) is the Gaussian normal distribution with mean /z and standard deviation a,

• mean_x is the mean of Node_x (not taking into account the parent nodes),

• stdev_x is the standard deviation of Node_x,

• n is the number of parent nodes of Node_x, the parent nodes being denoted as Node_{i t} for i = 1,2, ...,n.

• P_{i t} for i = 1,2, ...,n, is the value of an immediate parent of Node_x, and

• di, for i = 1,2, ..., n, is the direct effect of P_t on Node_x.

For example, the parameters of a node Node_x may be considered to be the mean mean_x, the number of parent nodes n, the parent nodes Nodei (for all i from 1 to n), the direct effect d_t of each parent node (for all i from 1 to n), and the standard deviation stdev_x.

The number of parent nodes n and the parent nodes Node^ (for all i from 1 to n) themselves may be regarded to define the structure of the network, and they may be determined using the BN network structure learning algorithm, for example the Hill-Climbing algorithm.

The remaining parameters, including the mean mean_x, the direct effect d_L of each parent node (for all i from 1 to n), and the standard deviation stdev_x, may be determined for any given structure of the network. For example, these parameters may be estimated in every iteration after the transformation has been applied during the Hill-Climbing procedure, to assess the score of the network.

These remaining parameters may be fit using, for example, maximum likelihood estimation (MLE) or Bayesian parameter estimation. The inventors found that for continuous nodes, the maximum likelihood estimation method may be advantageously used. For discrete data, Bayesian parameter estimation may be more suitable. The maximum likelihood estimation method is known in the art per se. The skilled person is able to apply the maximum likelihood estimation method in view of the present disclosure.

After building the continuous BN, a number of further steps may be performed, which may include steps of model diagnosis, model outputs computation, and insights generation. In the following, a few exemplary processing tasks are described that can be performed after the BN has been created. It is possible to perform some of these steps during the iterations of the Hill-Climbing method, for example to determine the network score.

The following calculations may be performed to assess the network performance. These

The network score may be determined using a suitable expression. A suitable example of a network score is based on the Bayesian Information Criterion (BIC). However, this is not a limitation.

The network score is a goodness-of-fit statistic that measures how well the network represents the dependence structure of the data. Depending on the implementation, the network score can be both positive or negative, depending on the input data and network structure. For example, while comparing multiple differently structured networks with the same input data, the larger the score, the better is the particular network. In other implementations, it may be the other way round: the smaller the score, the better is the particular network.

In certain embodiments, the network score may be computed through the Bayesian Information criterion (BIC). Network score calculated through the Bayesian Information criterion (BIC) is a goodness-of-fit statistic that measures how well the network represents the dependence structure of the data. It helps to compare multiple networks built on the same input data, and judge which network is better. Before calculating the network score, the parameter estimation may be performed using any suitable method, such as the MLE method explained above.

In general, using the BIC, the network score NetScore_BIC may be determined as follows:

wherein

• x is the collection of data that is available for fitting the network;

• 6 is the network (the collection of arcs and the parameters of the network);

• L(0|%) is the likelihood of the network d, given the collection of data x; • d is the number of arcs in the network (in alternative implementations, d is the total number of parameters of the network); and

• n is the number of observations in the collection of data x.

It is noted that Network Score through BIC may be regarded a penalization-based score that penalizes an increase of the number of parameters in the network.

Other performance estimators may be used to assess the quality of the BN. These performance estimators may be used as an alternative for the BIC in certain embodiments. However, these performance estimators may also be used as a general indication of how much confidence one may have in the model.

For example, the Mean Absolute Percentage Error (MAPE) measures the average percentage error between the actual values of a driver and the obtained fitted values through the network. The less the MAPE, the better is the fit. It may be defined as:

Herein, y_t denotes the actually observed values for a node t, and e_t denotes the error between the predicted value by the network and the actually observed value y_t.

The importance of each arc may be assessed. This information may be used in the iterative process to find the best network structure. Moreover, it may be used to determine which parameters (nodes) have most influence on a target node.

For example, arc strength may be computed through BIC. The arc strength indicates the absolute increase in the overall network score through the removal of this arc; the arc strength can possess either positive or negative values. So, from the definition, it is evident that the smaller the numeric value of the arc strength (considering the magnitude as well as the sign), the more significant the arc is.

Arc Significance is a unique integral number assigned to an arc from the range [1 , e], wherein e is the total number of arcs in the network. The values 1 and e (>1) indicate the most and least significant arcs, respectively, according to the above-mentioned arc strength. Thus, the arc significance numbers the arcs in the network in order of their significance, in decreasing order.

Ml for Arcs: The MI(A, B) for an arc A— >B measures the amount of information obtained about the driver B through knowing the driver A. In a way, it quantifies the redundancy/importance of the relationship between the variables A and B. z-score for Arcs: z-score for an arc A— >B tests the hypothesis that whether the MI(A, B) value is zero or not. A high z-score strengthens the reliability of the significance of arc A— >B.

Pearson’s Correlation for Arcs: For an arc A— >B it computes the amount of linear correlation between driver A and driver B. The key nodes for the target may be identified through an Importance Score of each node with respect to the target. The computation of such an importance score is described below:

Step 1. Identify all paths which originate from the node A and go to the target node X.

Step 2. Take the weighted average of all the arc strengths of the arcs occurring in a path, the weights being the inverse of, or inversely proportional to, the arcs’ Significance score. This weighted average is termed as the path strength.

Step 3. Compute the Importance score of the node A as the simple average of the path strengths of all the paths from A to the target node X.

Step 4. Rank each node by assigning a Significance Score in the range [1 , (n-1)], where n is the total number of nodes in the network. The values 1 and (n-1) indicate the most and least significant node (driver), respectively, n is a positive integer and represents the number of nodes in the network, including the target node.

The direct effect of a node A on a node B may be computed from the node coefficient

, the actual values of A and the fitted values of B through the following formula:

Direct Effect

Wherein: is the coefficient of A in the conditional Gaussian distribution equation of B, cii , fitted(bi) are respectively the actual values of A and fitted values of B, d is the dimension/length of the nodes A and B, which is the number of observations for nodes A and B.

The direct effect quantifies the overall effect of the node A on the node B, as obtained from the network structure and input data. It is expected to be a non-negative quantity. The direct contribution of A on B is the percentage direct effect of A on B versus the summed direct effects of all nodes on B. If any direct effect is found to be negative, the respective contributions are shown as “Not Applicable (NA)”. The presence of negative effects can be overcome through rectifying some of the arcs and/or appropriately preprocessing the input data.

Indirect effects and indirect contributions may also be determined. For example, if a node A has one or more indirect paths to the node B then the indirect effect of A on B may be computed through the following two steps: a) Multiplying the direct effect coefficients of the arcs in a path, and b) Adding the values obtained in the previous step, across all paths.

As before, the indirect contribution of A on B may be regarded to be the percentage indirect effect of A on B. Evidently, if there is no indirect path from A to B, then the indirect effect as well as indirect contribution of A on B are zero. In certain embodiments, the predicted values may be computed by plugging in the new values for the parents of the node in the local probability distribution of the node as obtained from the fit object.

In certain other embodiments, the predicted values are computed by averaging likelihood weighting (Bayes-law) simulations performed using all the available nodes as evidence. The number of random samples which are averaged for each new observation may be controllable. In case the target variable is continuous, the prediction of the target value (target node) may be the expected value of the conditional distribution of the target.

BN may be applied, for example, for finding the dependence structure among a plurality of drivers, performing inference from a built BN structure, or obtaining joint probability of a set of drivers, considered together. Discrete BN, with discrete-valued nodes, is the commonly known structure for inference. It is relatively easy to implement, compared to the continuous counter-part. Discrete BN may be used to generate conditional probability tables (CPT), which may be sufficient for inferential activities. However, discrete BN has the following inherent limitations. For n nodes, the CPT is of size 2ⁿ, which may be unmanageable even for a moderate number of nodes. Many real-world features are continuous - which can’t be handled directly through discrete BN. Apart from inference, discrete BN may not be suitable for other purposes. The continuous BN, using the techniques disclosed herein, may be advantageously used for finding elasticities, performing simulations, performing forecasting, etc..

As mentioned above, it is known to use discrete BN as a technique to perform inferencing and finding joint/marginal probabilities.

In contrast, the potential of continuous BN, as disclosed herein, is to perform a multitude of crucial tasks through it and built an end-to-end solution, entirely driven by the continuous BN framework. This is unique by itself, considering that perhaps no other industry has used continuous BN successfully. For example, continuous BN, when applied using the techniques described herein, may provide a better understanding of causal effects among the nodes. Moreover, the techniques enable computing an elasticity of each node with respect to the target node. This can provide a way to control the target node by manipulating the values of the other nodes. Based on the importance scores, it becomes possible to find the key nodes, that have the greatest influence on the target node.

For example, the coefficient representing the direct effect of a node A on another node B provides direct information about the elasticities, thereby making the tool highly suitable for finding elasticities. For example, the mean of a normal distribution of a node may depend linearly on the value of a parent node. The elasticity may be calculated, for example, for an arc A— >B, as

wherein A is the value of node A, B is the value of node B, which depends on the value of A, and is the direct effect of A on B. Alternatively, the direct effect /? , by itself, may be regarded as a measure of the sensitivity of A with respect to B. Elasticities of nodes that are indirectly connected to the target node may be calculated, for example, by combining the direct effects of the nodes on the path from a node to the target node, for example by multiplication.

An out-of-range simulation may be performed through changing the nodes’ values. Moreover, hybrid modeling becomes possible through considering a plurality of different variables. It becomes possible to provide forecasts for the target node, based on forecasts for the other nodes. Further, it becomes possible to determine optimized values for the nodes, based on certain assumptions on the nodes.

The BN may be built and used to model biological systems, for example as an aid for diagnosis, by setting nodes corresponding to stimuli, symptoms, and/or internal properties of the body. The BN may be built to model components of an apparatus, machine, or factory, ecological systems, meteorological systems, demographic systems, and can be used for purposes of, for example, text mining, feature recognition and extraction.

A method may be provided for predicting a variable in a system comprising a plurality of nodes, each node representing a continuous probability distribution function of a certain property, the method comprising: collecting a set of observed values of certain properties; determining a blacklist of arcs, the arcs in the blacklist identifying pairs of nodes that will not have a connecting arc in the model, or a whitelist of arcs, the arcs in the whitelist identifying pairs of nodes that certainly will have a connecting arc in the model; learning a structure of the network by determining a plurality of directed arcs representing that the probability distribution function of a certain first node is conditional on the property of a certain second node, taking into account the blacklist or the whitelist and the set of observed values; and learning probability distribution function parameters of the nodes of the network, based on the structure of the network and the set of observed values.

An apparatus may be provided for building a Bayesian Belief Network that models a system, wherein the apparatus is configured to: identify a plurality of nodes of a Bayesian belief network, each node representing a random variable having a continuous random distribution; select a target node among the plurality of nodes, the target node representing a state variable to be predicted; blacklist arcs pointing away from the target node to any other node; learn a network structure by identifying arcs between pairs of nodes that explain a system behavior, excluding the blacklisted arcs; learn conditional probability distributions of the continuous random variables of the nodes, wherein the probability distribution of the continuous random variable of a first node is conditional on at least one second node if and only if an arc points to the first node; and predict the value of the random variable of the target node based on a given value of at least one other node of the network.

Some possible aspects and advantages of the techniques disclosed herein may be the following. Constraints on the model structure are imposed using domain knowledge, which forces the network structure to converge to the target variable. This feature may prevent that, if the Bayesian network is allowed to proceed unhindered, it will not converge to the target variable as desired. Further, we add a regression model on top of the hierarchical probabilistic graphical model, which enables to extract the elasticities of the node variables with regard to their effect on the target variable. This amalgamation of the regression model with the Bayesian network provides the possibility of improved control of the target. Finally, the entire process by which this system (Probabilistic Graphical Model and Regression framework) is leveraged to extract the elasticities for the nodes, predicting the target variable, provides an improved information that can be used to control or predict the target variable.

Historically, the modelling process of Bayesian network models was completely manual right from receiving the raw data to getting the final Bayesian model. In a nutshell, it was a time consuming process and each model used to take at least 3 weeks to complete depending on the number of variables.

The following are some of the steps involved in arriving at a final Bayesian model: data interpretation & data quality check of raw data, exploratory data analysis, data treatment & imputation, variable selection, creating train-test datasets, modelling phase, evaluating model diagnostics & ad-hoc requests. The modelling phase itself is a complex and long process involving several steps.

After the Bayesian network is completed, it can be used to better control the system. Specifically, the Bayesian network can be used to influence the target variable by controlling the values of the controllable variables. For example, the method may further comprise identifying a target value for the target variable. Then, the method may comprise determining values for the controllable variables based on the Bayesian network with the final subset of the set of candidate arcs. For example, values of the controllable variables may be chosen at random or using an optimization algorithm and the corresponding value of the target variable may be computed, and the values for the controllable variables resulting in the target variable closest to the target value may be selected. Alternatively, the method may start from initial values for the variables in the system, and adjust the values for the controllable variables using the calculated elasticities to bring the target variable closer to the target value. Next, the method may comprise controlling the controllable variables in the system to set the controllable variables to their determined values. This way, it is likely that the target variable will move towards its target value.

For example, a computer-implemented method of generating a Bayesian network to predict a system’s behavior may comprise the steps of identifying a plurality of variables representing a system, the plurality of variables including a plurality of controllable variables of the system, a plurality of measured variables of the system, and at least one target variable of the system; assigning sign indicators to a plurality of the variables, wherein the sign indicator assigned to a variable is positive if the variable has a positive correlation with the target variable, and the sign indicator is negative if the variable has a negative correlation with the target variable; determining a set of candidate arcs between pairs of the variables, wherein an arc from a first variable to a second variable indicates that a value of the first variable influences a value of the second variable; selecting a subset of the candidate arcs forming a Bayesian network representing the system; calculating a fit score of the Bayesian network representing the system, based on observation data for the plurality of variables; and calculating, for each of the plurality of variables to which the sign indicator is assigned, an elasticity with respect to the target variable according to the Bayesian network, wherein the subset of the candidate arcs is selected based on the sign indicators, the elasticity of each of the plurality of variables to which the sign indicator is assigned, and the fit score.

The methods disclosed herein may be implemented, for example, in software. For example, the methods may be implemented as an R library. Advantages of the methods are a reduction in model execution time. Models can be computed by the computer overnight there by saving runtime during working hours. The result may be a highly accurate model. The model may be even more accurate than a manually fitted model, even though the method needs only a minimum of m manual intervention. In general, only the master input needs to be created, which is a one-time activity.

Although Bayesian modelling is being widely used in the industry, a comprehensive automated packaged framework is not available in literature or in production. The automated framework presented herein can even outperform human modelers.

Some or all aspects of the invention may be suitable for being implemented in form of software, in particular a computer program product. The computer program product may comprise a computer program stored on a non-transitory computer-readable media. Also, the computer program may be represented by a signal, such as an optic signal or an electro-magnetic signal, carried by a transmission medium such as an optic fiber cable or the air. The computer program may partly or entirely have the form of source code, object code, or pseudo code, suitable for being executed by a computer system. For example, the code may be executable by one or more processors.

The examples and embodiments described herein serve to illustrate rather than limit the invention. The person skilled in the art will be able to design alternative embodiments without departing from the spirit and scope of the present disclosure, as defined by the appended claims and their equivalents. Reference signs placed in parentheses in the claims shall not be interpreted to limit the scope of the claims. Items described as separate entities in the claims or the description may be implemented as a single hardware or software item combining the features of the items described.

Detailed description of drawings

Fig. 1 illustrates a simplified Bayesian belief network, provided for the purpose of illustration. The network comprises a node E denoting environmental potential and a node G representing genetic potential. In more complex networks the environmental potential and the genetic potential may be dependent on a great number of further nodes that have not been illustrated in Fig. 1 for ease of explanation. For example, nodes may be included representing specific environmental features that can influence the environmental potential, such as temperature, humidity, and amount of rain. Moreover, nodes may be included representing specific genetic features that can influence genetic potential, such as frequency of occurrence of certain genes in the population of crop. Some of these nodes may be controllable, such as temperature and humidity in an indoor environment. The environmental potential E and genetic potential G may influence the condition of the vegetative organs V, in particular the reproduction organs of the plants. The condition of the vegetative organs V may influence the number of seeds N generated per plant as well as the mean weight W of the seeds generated. The number of seeds N and the seeds mean weight W may determine the crop growth C in terms of total mass of crop. More generally, a number of drivers, such as environmental potential E, genetic potential G, vegetative organs V, number N and mean weight W of seeds, may influence the crop growth C. The drivers may be represented by nodes in a Bayesian belief network. The relationships between these drivers may be found by using the techniques disclosed herein, so that the crop growth C may be predicted using given values for (some of) the drivers. Also, the most important drivers may be identified. For example, some drivers may be controllable to influence a particular target quantity. In certain cases environmental circumstances can be adapted or genetic potential can be changed by genetic treatment or cross-fertilization. The Bayesian belief network can predict the changes in crop growth C caused by such changes.

As shown in the illustrative diagram of Fig. 1 , the Gaussian BN follows a hierarchical regression structure, defined by the nodes and coefficients (direct effects) in the conditional distribution of each node. As depicted above, each node that has one or more parent nodes may have a conditional Gaussian distribution that may be obtained through running local linear regression among the node and its immediate parents, the node being the target of the local linear regression. A possible general structure of the regression equation of a node is as given in Equation 1 :

(Equation 1 reproduced) The definitions of the variables have been provided hereinabove. It is noted that, when doing the regression, the standard deviation stdev_x may be calculated as the standard deviation of residuals, which are the difference between actual and fitted values of P_x.

The values of mean_x and d_t may be determined by performing linear regression as a form of maximum likelihood estimation. For example, the linear regression may be performed for each node separately, starting with a node that only has parent nodes that do not have parent nodes themselves (such as nodes A and B in Fig. 1). Every time, the regression may be performed for a node that only has parent nodes that either do not have parent nodes themselves or for which the regression analysis has already been done. This may be repeated until all the nodes have been processed. Prediction in the BN may be performed using the hierarchical structure in a top-to-bottom manner by predicting the children at each level from its immediate parents and then propagating the predictions downwards to the next level. For example, in the network of Fig. 1 , the prediction of the target node: Crop (C) may be performed as follows:

1 . Start with root nodes: E, G and use them to predict all their children, i.e. V in this case. The root nodes are the nodes that do not have parent nodes.

2. Go to the next level and predict N, l/IZ through their immediate parents, i.e. V

3. Finally predict the target, i.e. C through the predicted values of N, l/IZ

Prediction of a node at each level may be performed through its Gaussian distribution equation, involving immediate parents and direct effects.

During prediction, if the values of all immediate parents of the target are already provided, then the network may directly use those values to predict the target and will, in certain embodiments, ignore other values.

For example - if the values of N and l/IZ are provided together with values of other nodes such as E, G, and V, then the target node C may be predicted using only the given values of N and W, while ignoring the other nodes’ values.

Generally, the Bayesian network contains a model of a system to derive information about that system.

A Bayesian network can provide information about sounds, images, electromagnetic signals, chemical compounds, biological systems, or economic data, for example.

For example, the network score may be calculated as follows for the example crop growth network shown in Fig. 1 :

‘parameters’ denotes the parameters of the BN;

LCparametersIF, G, V, N, W, C) denotes the likelihood that the parameters are correct, given the available data for E, G, V, N, W, C d denotes the number of observations in the dataset; d_node represents the degree of the node (i.e. total number of arcs: incoming + outgoing) and n is the total number of nodes in the network, and

E, G, V, N, W, and C are the nodes of the network, as illustrated in Fig. 1. Fig. 2 shows a flowchart illustrating a method of building a Bayesian belief network. The method starts in step 201 with providing a dataset. For example, the dataset may comprise example states of a system (e.g. a system of interacting food ingredients, a biological system, mechanical system, e.g. components of a car or factory, or another kind of system). In particular, the states may comprise a combination of parameter values representing certain properties. Those parameter values may be continuous in nature. The dataset may contain observed values representing states of the real world system. The observed values may be measured by a detector, such as a sensor. For example, a temperature may be sensed by a thermometer. The data generated by the thermometer may be transmitted to a computer system that stores the observed values in a dataset. The observed values may alternatively be entered into the computer.

The data may be preprocessed in step 202, for example to remove outliers, handle missing values, and the like. This is elaborated elsewhere in this disclosure. In step 203, a blacklist, and optionally a whitelist may be created. The blacklist contains arcs that cannot occur in the Bayesian belief network and that will not be added by the subsequent learning procedure 204. The whitelist contains arcs that are included in the Bayesian belief network, and that will not be removed by the subsequent learning procedure 204. This can help to incorporate a priori knowledge in the network structure. Moreover, the blacklist can contain all the arcs pointing away from the target node. In the example of Fig. 1 , the target node is the crop growth G. The target node is the node for which we would like to make a prediction based on the available data about the values of the other nodes. After determining the blacklist/whitelist in step 203, the network structure is learned in step 204. This may be performed using an iterative process, such as Hill-Climbing, as will be elucidated hereinafter with reference to Fig. 3. After the network structure has been learned, the learning node parameters are set in step 205. This step may involve learning the conditional probability distributions for each node. This may be performed, for example, using a linear regression technique that is disclosed elsewhere in this description.

Fig. 3 illustrates an example implementation of step 204 of learning the network structure. For example, the nodes of the network may be a given, and the connections between the nodes (the arcs) may be determined as part of this procedure.

The process starts in step 301 by determining an initial network structure. For example, the initial network structure is determined randomly, meaning that the absence or presence of a particular node in the network depends on some random value generator. The direction of each arc may also be determined randomly. However, the whitelisted arcs are always included, and the blacklisted arcs are never included. Further, the arcs are chosen such that the resulting network represents an acyclic graph. An acyclic graph may be obtained, for example, by removing arcs from a random network structure until the remaining arcs for an acyclic graph. In step 302, a score of the initial network structure is determined. First the network parameters are estimated using the initial networks structure as a given. Using these network parameters, a network score may be calculated. For example, the network score may be based on the Bayesian Information Criterion (BIC), which is elaborated elsewhere in this document. The network score of the initial network structure is set as the ‘maximum network structure score’, and stored in memory.

In step 303, the network structure is updated. For example, one or more arcs are added and/or one or more arcs are removed. Also, the direction of an arc may be swapped. The arcs to be added or removed may be selected randomly, for example. Alternatively, the arcs to be added or removed may be selected based on an arc strength or an arc significance, possibly in combination with a random variable. When updating the current network, it is ensured that the updated network structure represents an acyclic graph. Moreover, for example, the updating the current network structure comprises adding an arc that is not on the blacklist to the network structure, deleting an arc that is not on the whitelist from the network structure, or reversing the direction of an arc that is not on the whitelist and of which arc the reversed direction is not on the blacklist.

Next, in step 304, a network score of the updated network structure is determined. To that end, first, optimal network parameters are estimated. That is, using the current network structure as a basis, the (conditional) probability distribution for each node is estimated. For example, for each node, a mean and standard deviation of a normal distribution are estimated. Further, the mean of each node may be a linear function of the values of the parent nodes. The coefficients of this linear function may be estimated for each node that has one or more parent nodes.

After that, the ‘quality’ of the resulting network is estimated in form of a network score. This network score may be determined, for example, by the Bayesian Information Criterion (BIC). This criterion is described in greater detail elsewhere in this description.

In step 305, it is checked whether the network score of the updated network structure is larger than the previously stored maximum network structure score.

If the network score of the updated network structure is larger than the previously stored maximum network structure score, in step 306, the network score of the updated network structure is stored as the new maximum network structure score. Also, the current network structure is set equal to the updated network structure (corresponding to the maximum network structure score).

If the network score of the updated network structure is not larger than the previously stored maximum network structure score, in step 307, the current network is kept. In this case, the modifications of the updated network are discarded.

After that, in step 308, it is determined if more iterations should be made, to explore more different network structures. For example, this may be determined based on a stopping criterion, such as a maximum number of iterations or a minimal required improvement of the network score, or a minimum acceptable value of the network score, or a combination of the above. If it is determined that no further iterations are necessary, the learning process may stop in step 309. If it is determined that further iterations are necessary, in step 310 it may be decided whether the network structure should be reset. For example, this may be decided based on a stopping criterion, such as a maximum number of iterations or a minimal required improvement of the network score or a combination thereof.

If it is determined that the reset of the network structure is not necessary, the process may proceed from step 303 by implementing an incremental update to the current network.

If it is determined in step 310 that a reset of the network structure is to be carried out, the process proceeds to step 311. The current network structure and the corresponding maximum network structure score are stored as a candidate network structure. Next, the network structure is reset to a new initial network structure, which may be determined randomly in a similar way as the initial network structure was set in step 301 . Moreover, the initial network structure score may be determined and set as the ‘maximal network structure score’, similar to step 302. Next, the process proceeds from step 303.

When the process ends in step 309, the candidate network structure having the highest network structure score may be finally selected as the finally learned network structure.

Fig. 4 illustrates an example implementation of step 205 of learning the node parameters. This process of learning the node parameters may also be done in steps 302 and 304 as a preliminary step when computing the network structure score. In step 401 , a linear function of the immediate parent nodes is used to define a parameter of a conditional probability density function of a node. For example, in case of a Gaussian probability distribution, the mean may be a linear function of the immediate parent nodes, e.g.

wherein /z is the mean of the Gaussian distribution, mean_x is the mean of the Gaussian distribution without regard of the parent nodes, d_t denotes the influence of the parent node, and P_t is the value of the parent node i, wherein i=1 to n are the parent nodes, i.e. the nodes from which an arc in the Bayesian belief network points to the node for which the coefficient is being set. In step 402, the coefficients of the linear function, in the above case the coefficients mean_x and d₁, d₂, ... , d_n are the coefficients that should be fitted in order to determine the mean /z of the conditional Gaussian distribution of the node. This may be fitted using a maximum likelihood approach based on linear regression, as is disclosed in greater detail elsewhere in this disclosure.

It will be appreciated that other probability distribution functions, such as an exponential distribution, may be used instead of the Gaussian distribution. Moreover, non-linear functions may be used to compute the parameter of the probability density function. For example, instead of a linear function a quadratic function or any polynomial function could be used.

Fig. 5 shows a block diagram illustrating an apparatus 500 for building a Bayesian belief network to predict a variable in a system. The apparatus 500 may be implemented using computer hardware, which may be distributed hardware. The apparatus 500 comprises a processor 501 or a plurality of cooperating processors. The apparatus further comprises a storage 503, for example a computer memory and/or a storage disk. Computer instructions may be stored in the storage 503, in particular in a non-transitory computer-readable media. Moreover, a dataset with observations for the nodes may be stored in the storage 503. The apparatus may comprise an input device 504 for receiving a user input to control the apparatus and a display device 505 to display outputs of the apparatus. The apparatus 500 may further comprise a communication port 502 for connecting to other devices and exchange of data and control signals. For example, the communication port 502 comprises a network interface for wired or wireless communication and/or a universal serial bus (USB). The instructions in the storage 503 may contain modules implementing the steps of one of the methods set forth herein. For example, the instructions may cause the processor 501 to control receiving, via the communication port 502, observations for the nodes of the Bayesian network, and store them as a set of observed values in the storage 503. These observed values may be received, for example, from a measurement device or sensor connected to the apparatus 500 via the communication port 502. In particular, the instructions may cause the processor 501 to determine a blacklist of arcs, wherein the arcs in the blacklist identify arcs not to be included in the Bayesian belief model, the blacklist including all directed arcs from the target node to any other node of the Bayesian belief network, learn a structure of the Bayesian belief network based on a set of observed values for the nodes by determining a plurality of directed arcs of the Bayesian belief network arcing the nodes of the network to form an acyclic graph, the plurality of directed arcs not including arcs corresponding to the arcs in the blacklist, and learn parameters of conditional continuous probability distribution functions of the nodes of the network, based on the structure of the network and the set of observed values. Alternatively, the instructions may cause the processor 501 to perform any of the methods set forth herein.

Fig. 6 illustrates several examples of transformations of the network structure during the learning procedure for learning the network structure. As shown in Fig. 6, starting with a network structure 601 , in which nodes A, B, C, and D are connected by arc in a certain way, an arc may be added 622, reversed 623, or deleted 624. In the illustration, in the network 601 , an arc connects node A to node C, an arc 611 connects node B to node C, and an arc connects node C to node D. As a first example transformation, an arc 612 from node B to node D may be added 622 to the network. As a second example transformation, the direction of an arc 611 may be reversed 623, so that the arc 611 is replaced by an arc 613 that connects node C to node B. As a third example transformation, the arc 611 may be deleted 624. Such transformations can be done with arcs connecting, in principle, any first node to any second node, as long as any whitelisted arcs are not removed and any blacklisted arc are not added. Also, the direction of an arc may not be reversed if the other direction of that arc is blacklisted.

Fig. 7 illustrates a method 700 of generating a Bayesian network. In particular, the method can select the arcs to be included in the Bayesian network. In certain embodiments, the method 700 can be used as the step of learning network structure 204 in Fig.2.

The method starts in step 701 by determining a master input, on which the remainder of the process will be based. The master input may comprise a list of variables, each variable representing a state of the system to be modelled. Moreover, the master input may comprise a dataset with a plurality of observations of the variables. This dataset may be obtained by measuring and storing the states of the variables of the system over time. The states may be measured automatically, using sensors for example. Alternatively, the measured states may be entered into the system. Thus, the dataset may comprise historical data for the variables. Further, the master input may comprise an indication of the type of each variable. Examples of variable types are: controlled variables, which can be controlled by the system operator. For example, a certain voltage can be controlled or the setting of a certain lever. Measured variables are variables that can be measured, meaning that the values can be checked and put in the dataset, but cannot be directly controlled by the operator of the system. The target variable is a (measured) variable that represents the quantity that we want to predict. In principle there can be only one or more than one target variable. External variables are variables that are not influenced by the controlled variables. Possible examples of external variables are, for example, environmental temperature, precipitation.

The variables may be continuous variables, as explained hereinabove. For example, the variables may have a Gaussian normal distribution with a mean that depends on the parent node. Thus, elasticities may be computed in the way set forth hereinabove.

Further, the master input can include a sign indicator for each variable. This sign indicator indicates whether the variable has a positive or a negative correlation with the target variable. For example, the sign indicator may indicate the sign of statistical correlation coefficient between a variable and the target variable. This sign may be set based on domain knowledge, for example, or by means of statistical information. It defines the direction of each variable with respected to the target variable. The master input may be tailored with further specific information for the system to be modelled.

For example, the master input may comprise a list with an identification of each variable and the type of each variable (measure, controlled variable, etc.). Also, for time processes, the master input may contain an indication of whether a variable has a lag.

Also, the master input file may comprise a list of all allowed arcs; that is, for example, in certain cases it is known a prior that only certain variables can influence each other. This knowledge may be converted to the list of allowed arcs. Further, the master input may comprise indication of important arcs. The importance of the arcs may be taken into account when adding arcs to the network, starting with the most important arcs.

Thresholds may be included in the master input, defining hard bounds and/or soft bounds for the quantitative as well as qualitative checks that may be performed to evaluate a Bayesian network.

Further, a set of observed values for the plurality of variables representing the system may be determined. This set of observed values may be obtained, for example, by use of sensors or other detection or measurement techniques that detect certain properties of the system corresponding to states of the system.

Based on the logical relationships between controlled, measured, target, and external variables, a set of candidate arcs is constructed in step 702. Further input that may be taken into account is the optional list of all allowed arcs from the master input. Arcs that are not in the set of candidate arcs can be considered to be on a blacklist of arcs that will not be included in the Bayesian network. Typically, the set of candidate arcs includes the arcs from the controllable variables to the measured variables and the arcs from the measured variables to the target variable, arcs from one measured variable to another measured variable, and arcs from an external variable to a measured variable or the target variable.

However, the set of candidate arcs does not include the blacklisted arcs, such as arcs from any variable to any of the controllable variables, arcs from the target variable to any other variable, and arcs from any variable to an external variable.

In case certain variables have a lag, as defined in the master input, this may be taken into account in this step. For example, if a set of variables is set to belong to lag 1 , and another set of variables is set to belong to lag 2, this means that the set of candidate arcs can contain arcs from the variables of lag 1 to the variables of lag 2. However, the set of candidate arcs does not contain arcs from the variables of lag 2 to the variables of lag 1. Herein, it is assumed that lag 1 variables represent a state of the system at a first time, and lag 2 variables represent the state of the system at a second time, wherein the second time is after the first time.

In the next steps, the structure of the network may be determined in an iterative process in which different structures are tried and evaluated. After every iteration, values obtained can be checked against certain checklists. These checklists may be divided into quantitative and qualitative checks. Quantitative checks can include MAPE (mean absolute percentage error), R squared, importance score, among others. Network score may refer to a score obtained from hill climbing algorithm. Qualitative checks can include checks for the presence of isolated nodes, complete arcs, nodes without parents, nodes without children. Herein, isolated nodes are nodes that have no parent and no child; these nodes may be deleted.

In step 703, an intermediate subset of arcs is determined, forming an intermediate Bayesian network. This may involve an iterative process of testing certain combinations of arcs and their performance in a Bayesian network. The intermediate subset of arcs is selected subject to the condition that no counterintuitive arcs are included. That is, the intermediate Bayesian network does not have any variables with an elasticity that has a sign different from the sign indicator of that variable.

An example implementation of step 703 is illustrated in Fig. 8. The process shown in Fig. 8 is similar to the process of Fig. 3, and the items with the same reference numerals as the ones appearing in Fig. 3 are explained hereinabove in relation to Fig. 3, and their content is not repeated here. However, the process of Fig. 8 is modified so that it checks additionally for counterintuitive arcs. That check may be performed in step 805. Herein, counterintuitive arcs are arcs that cause any variable in the network to have an elasticity with a sign that is different from the sign indicator. Thus, for each node the elasticity of that node with respect to the target node is calculated in step 804 using the equations set forth herein, and compared to the sign indicator of that node. If, in step 805, the network has any node of which the sign of the elasticity differs from the sign indicator, the process proceeds to step 307 to keep the current network (and discard the updated network). If, in step 805, the sign of the elasticity of all of the nodes is the same as the sign indicator of each node, and as an optional condition the score has improved compared to the current network, the modified (updated) network is stored in step 306.

Additionally, in step 805, it may be checked that the total effect of each variable is within prescribed bounds, for example between -2 and +2. If the total effect is not within the prescribed range, the current network is kept in step 307. Otherwise, if the other requirements have been met, the modified network is stored in step 306. The sum of the direct and indirect effect of a variable is the total effect of a variable. It can be considered as the overall elasticity of that particular variable. It has no unit. The exemplary limits of -2 and 2 are bounds that are set based on domain knowledge. Therefore, these limits may be different in different application domains.

The step 703 results in an intermediate Bayesian model with an intermediate subset of arcs. These subset of arcs is taken from the set of candidate arcs defined in step 702. The subset of arcs may be selected using a process based on e.g. Fig. 3, as described above. It will be understood that the intermediate subset of arcs may alternatively be obtained by another selection process, as long as it is ensured that the elasticities of the variables match the sign indicators and optionally is within the prescribed bounds.

In the process shown in Fig. 7, after step 703, more iterations are done in steps 704 and 705, to find a better network structure.

In step 704, more network structures are iteratively created and tested in a first stage of iterations. Each network structure is based on the intermediate subset of arcs and one additional candidate arc. In this step, one arc at a time is picked up from the set of candidate arcs not including the intermediate set of arcs. This selected arc is added to the intermediate set of arcs. Then, the resulting set of arcs is fitted to form the Bayesian model, and the network score is computed. In the next iteration, it starts again from the intermediate set of arcs and just one other arc from the set of candidate arcs. Thus, each iteration one arc is force fitted into the model. This, way, all the arcs are tried one by one. Each iteration the quality checks are done, and if the checks do not meet certain predetermined constraints, the model is rejected. Otherwise, the model and scores are stored.

In step 705, more network structures are iteratively created and tested in a second stage of iterations. In this case, an arc is added in each iteration successively, thus gradually enlarging the number of included arcs. Arcs that fail to satisfy certain criteria are, however, removed from the model. In greater detail, this can be realized as follows.

First, define a second subset to be initially equal to the intermediate subset. This second subset will be adjusted in the subsequent iterations to try out different models. In each iteration, a particular arc is selected that is included in the set of candidate arcs, but is not yet in the second subset of arcs. Then, a Bayesian model is fitted using the second subset of arcs. If the Bayesian model satisfies certain conditions, the model is stored together with its fit score. Moreover, if the Bayesian model satisfies a certain criterion, the selected particular arc is added to the second subset of arcs before proceeding with the next iteration. On the other hand, if the second stage model does not satisfy the certain criterion, the second arc is removed from the set of candidate arcs before proceeding with fitting the next second stage model, so that the same arc is not used again in the subsequent iterations. The first stage 704 and the second stage 705 can be illustrated further using the following example. The specific numbers provided in this example are not limiting, but are provided for purpose of illustration only.

Suppose we have 100 arcs that are possible between the variables provided in the master input. However, suppose 30 arcs of these 100 arcs do not satisfy the conditions (e.g., arcs from a measured variable to a controlled variable), and are blacklisted. The remaining 70 arcs form the set of candidate arcs. Hence, using the 70 arcs in the set of candidate arcs, the iterations of step 703 are preformed to find the intermediate Bayesian model with the intermediate subset of arcs.

Suppose in the intermediate Bayesian model, 50 arcs are used from the initial 70 arcs in the set of candidate arcs. Hence, we have 20 arcs remaining that are in the set of candidate arcs, but not in the intermediate subset of arcs.

Both the first stage and the second stage start from this point.

In the first stage (step 704), one arc from the remaining 20 arcs is taken. It is force fitted(whitelisted) in the post intermediate model. The checklist parameter scores are calculated and recorded in the log. Now the next arc is taken, and force fitted in the post intermediate model and the scores are calculated. This process repeated for each of the 20 arcs but only one arc is fitted in the model in one iteration. An arc forced fitted in the previous iteration is not carried forward to the next iteration.

In the second stage, we start with the intermediate model of arcs and the remaining 20 arcs from the set of candidate arcs. The first arc from this list of 20, is force fitted in to the post intermediate model. The checklist parameter scores are calculated and recorded in the log. If the additional arc is not counter intuitive, the updated model, including the additional arc, is passed on to the next iteration. Now the next arc from the list of 20 arcs is force fitted and the same process is repeated. Herein, counter intuitive means that at least one variable has an elasticity with a sign that differs from the sign indicator. When such a thing happens, the last added arc is counter intuitive and put on the blacklist.

If the fitted Bayesian network with the additional arc does not meet certain constraints, the model it is removed (not recorded in the log) and the model with which the iteration started is passed on to the next iteration.

Hence in the second stage 705, we start with the intermediate model and add one arc at the time, check if it is intuitive, if yes then retain it in the model and then the next arc is added. If not intuitive then the arc is removed, and the next arc is added.

After the first stage and/or second stage have been completed, the log is searched in step 706 for the network with the best score. The network with the best score is the final selected Bayesian model with the optimal set of arcs.

It will be understood that either or both of step 704 and step 705 may be omitted.

Throughout the iterations, the network score or fit score may be computed by taking into consideration qualitative as well as quantitative checks. The final score of a network may be calculated by averaging a plurality of differently calculated scores, for example by taking harmonic mean. The iteration with highest score is chosen as the converged model.

After the final model is determined, optional post-processing may be performed in step 707.

In case the best model obtained from stage 1&2 is sub optimal and has measures without impact, then this stage may be triggered. A created Bayesian network model is said to be optimal if it satisfies all of a predetermined set of conditions, including quantitative conditions and qualitative conditions. On contrary, if a BN doesn’t satisfy few of the checks, it is termed as sub-optimal. Point to be noted here is that it may be acceptable (depending on the criticality) to accept a sub-optimal BN that satisfies all the key checklists (e.g. no measures without impact, no complete arcs etc.), but marginally fails to satisfy the remaining (not so critical) checklists.

A node that is a measure, but doesn’t have any children node (i.e. no impact) in the network is known as a measure without impact. If a measure has at least one child, then it has an impact on the network - this is how we can identify whether a measure has an impact or not.

For example, rectification of measures without impact may be performed.

This stage may be further sub divided into 2 sub-stages. First, the measures without impact may be tried rectifying, for example, using a structure presumed method. A structure presumed model is a BN in which the network structure (i.e. the arcs/connections) is already provided beforehand. So, while building such a BN model, no network structure learning is required, as the structure would be already known.

In case the rectifying of the first sub-stage fails, a second rectifying sub-stage may be triggered which can remove all measures without impact and regenerate the network. This way, the final model won’t have any measure without impact and it would have passed all the checklists.

Fig. 9 shows an example of a Bayesian network 900 for food development. Let us consider the composition of creamy soup powder. It can contain many ingredients such as creamer vegetable granules, binder, NaCI, flavouring ingredients like herbs, spices, yeast extract, monosodium glutamate. Also, for garnishing purposes, large particulate of mushroom, carrot, cauliflower florets, broccoli florets, whole garden peas, sweet corn can be used. Additionally, for flavour and texture, cottage cheese powder, cream powder, and/or butter powder can be added.

The creamer can provide the creamy texture and the flavour of the soup. The herbs and spices can provide various tastes such as sweet, tangy and spicy, provided in different degrees, to enhance the flavour. Also, the shelf life of the ingredients and the resulting soup powder is important to provide the intent. In this respect, a creamy binder selected from non-dairy creamer and dairy-based powder, such as cheese or butter powder, can have a great impact. The vegetable particulate also impacts the shelf life. However, spices, yeast and other ingredients generally have very little or no impact on the shelf life of the mixture.

Hence, creamy soup powder is a complex, well-balanced blend of the various ingredients. At the same time, flavour, aroma, texture, thickness and shelf life can greatly influence customer satisfaction.

Referring to Fig. 9, in an embodiment, consumer preference score is the target node 960. The amount of each ingredient, expressed in weight percentage, for example, are represented by controllable nodes 902-915. Intermediate nodes 951-958, representing properties such as flavour, aroma, texture, thickness, and shelf life, may be influenced by the controllable nodes (the weight percentage of each ingredient), and in turn these intermediate nodes 951-958 influence the consumer preference score, represented by the target node 960.

Experimentally, the available ingredients may be combined using different weight percentages. For each combination of weight percentages of the ingredients, we may assess the intermediate variables (flavour, aroma, texture, thickness, shelf life) and the target variable (consumer preference score) by performing certain measurements and/or by evaluating the product by a group of consumers. This experimental data contains the observed values for the nodes. If we have this experimental data for multiple trials, we can create a model to predict the consumer preference score from the composition of ingredients. This way we can predict the proportion of the ingredients to achieve the desired blend of the soup powder which can be quantified as consumer preference index.

In the network created, we can incorporate rules about known relationships such as a creamy binder, vegetable particulate and dairy powder can have an impact on texture and thickness. For example, if certain ingredients are known to influence a particular node (e.g. the weight percentage of cream powder influences the texture), the arc from the cream powder node to the texture node may be included in the white list.

Similarly, since it is known that certain ingredients have an influence on shelf life, the arcs pointing from the nodes of those ingredients to the node representing the shelf life can be whitelisted. Since it is known that certain other ingredients do not influence the shelf life, the arcs pointing from the nodes of these other ingredients to the node representing the shelf life can be blacklisted.

Also, based on knowledge about food production, certain nodes can be restricted as regards their value. For example, the weight percentage of cream powder can be restricted to a given range, for example the range from 1 .01% to 5.89%. Similarly, rules regarding the weightage restriction of certain the ingredients can be incorporated in the network using the blacklist. The unknown relationships can be found by learning the Bayesian network structure using the method and systems disclosed herein. Elasticities and sign indicators may be determined, for example, between any of the nodes 902-958 and the target node 960. The sign indicator may be set by an expert using domain knowledge.

Examples of variables that can be used, as illustrated in Fig. 9, are: wt% of cottage cheese 902; wt% of cream powder 903; wt% of butter powder 904; wt% of herbs 905; wt% of spices 906; wt% of mushroom flakes 907; wt% of carrot cubes 908; wt% of broccoli florets 909; wt% of whole graden peas 910; wt% of sweet corn 911 ; wt% of yeast extract 912; wt% of binder 913; wt% of NaCI 914; wt% of monosodium glumate 915; wt% of creamer 951 ; Texture (represented by a numeric value) 952; Thickness (represented by a numeric value) 953; Sweet taste (represented by a numeric value) 954; Spicy taste (represented by a numeric value) 955; Sour taste (represented by a numeric value) 956; Flavour (represented by a numeric value) 957 Shelf life (in days) 958; Consumer preference score (target variable) 960

It will be understood that the arcs (arrows) drawn in Fig. 9 are shown merely as examples. The actual arcs of the Bayesian network may be determined by learning the structure of the Bayesian network based on the set of observed values for the nodes.

Claims

1. A method of building a probabilistic hierarchical model comprising a Bayesian network having continuous variables to predict a state of a system, the method comprising the steps of: obtaining (701), by a processor (501), an input specification comprising a plurality of variables, the plurality of variables including a plurality of controllable variables, a plurality of measured variables, and at least one target variable representing the state to be predicted, the input specification further comprising sign indicators associated with a plurality of the variables, wherein the sign indicator assigned to a variable is positive to indicate that the variable has a positive correlation with the target variable, or the sign indicator is negative to indicate that the variable has a negative correlation with the target variable; obtaining, by the processor (501), a set of observed values for the plurality of variables; determining (702), by the processor (501), a set of candidate arcs, each candidate arc indicating a possible link from a first one of the variables to a second one of the variables to be included in a Bayesian network; repeatedly selecting (303), by the processor (501), a subset of the set of candidate arcs to form the arcs of a Bayesian network, fitting parameters of the Bayesian network with the selected subset of the set of candidate arcs, based on the set of observed values for the plurality of variables, and calculating (804), for each of the plurality of variables to which the sign indicator is assigned, an elasticity with regard to its effect on the target variable according to the fitted Bayesian network; and selecting (805, 706), by the processor (501), one of the subsets as a final set of arcs of the Bayesian network based on the sign indicators and the elasticity of each of the plurality of variables to which the sign indicator is assigned.

2. A method according to claim 1 , wherein the sign of the elasticity of each of the plurality of variables of the final subset of the set of candidate arcs to which the sign indicator is assigned matches the respective sign indicator.

3. A method according to claim 1 , wherein selecting a subset of the candidate arcs comprises obtaining (703), by the processor (501), an intermediate subset of the candidate arcs forming an intermediate Bayesian network, by: iteratively fitting (303) Bayesian networks with different subsets of the candidate arcs to the set of observed values; calculating (304) a fit score of each fitted Bayesian network; comparing (804) each sign indicator to the elasticity of each corresponding variable in each fitted Bayesian network; and selecting (805) the intermediate subset of the candidate arcs, based on the fit score, from among the different subsets of the candidate arcs, wherein the intermediate subset is selected satisfying the condition that each sign indicator matches the elasticity of each corresponding variable in the intermediate Bayesian network with the intermediate subset of the candidate arcs.

4. A method according to claim 3, wherein the selected intermediate subset further satisfies the condition that the elasticity of each variable in the intermediate Bayesian network is greater than a predetermined minimum bound and smaller than a predetermined maximum bound.

5. A method according to claim 3 or 4, further comprising obtaining (704), by the processor (501), a plurality of first stage networks with associated fit scores by, for each first arc that is in the set of candidate arcs but not in the intermediate subset of arcs, fitting a first stage network including only the intermediate subset of arcs and the first arc and calculating the fit score associated with the first stage network, wherein the selecting the final subset of the candidate arcs is performed based on the fit scores associated with the first stage networks.

6. A method according to any one of claims 3 to 5, further comprising setting, by the processor (501), a second subset initially equal to the intermediate subset; and obtaining (705), by the processor (501), a plurality of second stage networks with associated fit scores by, iteratively for each second arc that is in the set of candidate arcs but not in the second subset of arcs, fitting a second stage network including only the second subset of arcs and the second arc and calculating the fit score associated with the second stage network, and if the second stage network satisfies a certain criterion, adding the second arc to the second subset of arcs before proceeding with fitting the next second stage network, and if the second stage network does not satisfy the certain criterion, removing the second arc from the set of candidate arcs before proceeding with fitting the next second stage network, wherein the selecting the final subset of the candidate arcs is performed based on the fit scores associated with the second stage networks.

7. A method according to claim 1 , wherein the set of candidate arcs includes a plurality of arcs from the controllable variables to the measured variables and a plurality of arcs from the measured variables to the target variable, and wherein the set of candidate arcs does not include any arcs from any variable to any of the controllable variables nor from the target variable to any other variable in the plurality of variables.

8. A method according to claim 7, wherein the plurality of variables further includes at least one external variable, wherein the set of candidate arcs includes at least one arc from the at least one external variable to at least one of the measured variables or the target variable, and wherein the set of candidate arcs does not include any arcs from any variable to an external variable.

9. A method according to anyone of the preceding claims from 1 to 8, further comprising: identifying a target value for the target variable; determining values for the controllable variables based on the Bayesian network and the target value for the target variable; and controlling the controllable variables to set the controllable variables to their determined values.

10. A system for building a probabilistic hierarchical model comprising a Bayesian network having continuous variables to predict a state of a system, the system comprising: a memory (503) configured to store observed values and a Bayesian network; and a processor (501) configured to perform steps of: obtaining (701) an input specification comprising a plurality of variables, the plurality of variables including a plurality of controllable variables, a plurality of measured variables, and at least one target variable representing the state to be predicted, the input specification further comprising sign indicators associated with a plurality of the variables, wherein the sign indicator assigned to a variable is positive to indicate that the variable has a positive correlation with the target variable, or the sign indicator is negative to indicate that the variable has a negative correlation with the target variable; obtaining a set of observed values for the plurality of variables; determining (702) a set of candidate arcs, each candidate arc indicating a possible link from a first one of the variables to a second one of the variables to be included in a Bayesian network; repeatedly selecting (303) a subset of the set of candidate arcs to form the arcs of a Bayesian network, fitting parameters of the Bayesian network with the selected subset of the set of candidate arcs, based on the set of observed values for the plurality of variables, and calculating (804), for each of the plurality of variables to which the sign indicator is assigned, an elasticity with regard to its effect on the target variable according to the fitted Bayesian network; and selecting (805, 706) one of the subsets as a final set of arcs of the Bayesian network based on the sign indicators and the elasticity of each of the plurality of variables to which the sign indicator is assigned.

11 . A computer program product for building a probabilistic hierarchical model comprising a Bayesian network having continuous variables to predict a state of a system, the computer program product comprising instructions stored on a non-transitory compute readable media, the instructions being configured to cause a computer system to perform the steps of: obtaining (701) an input specification comprising a plurality of variables, the plurality of variables including a plurality of controllable variables, a plurality of measured variables, and at least one target variable, the input specification further comprising sign indicators associated with a plurality of the variables, wherein the sign indicator assigned to a variable is positive to indicate that the variable has a positive correlation with the target variable, or the sign indicator is negative to indicate that the variable has a negative correlation with the target variable; obtaining a set of observed values for the plurality of variables; determining (702) a set of candidate arcs, each candidate arc indicating a possible link from a first one of the variables to a second one of the variables to be included in a Bayesian network; repeatedly selecting (303) a subset of the set of candidate arcs to form the arcs of a Bayesian network, fitting parameters of the Bayesian network with the selected subset of the set of candidate arcs, based on the set of observed values for the plurality of variables, and calculating (804), for each of the plurality of variables to which the sign indicator is assigned, an elasticity with regard to its effect on the target variable according to the fitted Bayesian network; and selecting (805, 706) one of the subsets as a final set of arcs of the Bayesian network based on the sign indicators and the elasticity of each of the plurality of variables to which the sign indicator is assigned.