WO2022106437A1 - Predicting the state of a system with continuous variables - Google Patents

Predicting the state of a system with continuous variables Download PDF

Info

Publication number
WO2022106437A1
WO2022106437A1 PCT/EP2021/081915 EP2021081915W WO2022106437A1 WO 2022106437 A1 WO2022106437 A1 WO 2022106437A1 EP 2021081915 W EP2021081915 W EP 2021081915W WO 2022106437 A1 WO2022106437 A1 WO 2022106437A1
Authority
WO
WIPO (PCT)
Prior art keywords
network
arcs
node
nodes
bayesian
Prior art date
Application number
PCT/EP2021/081915
Other languages
French (fr)
Inventor
Karanjeet Singh BHAGI
Meenakshi BURRA
Aneesh CHAUDHRY
Sachit HANDA
Neelakanta Siva KALYANARAMAN
Gourav Kumar
Robert Cennydd MORGAN
Devesh Raj
Chintan Shah
Uma Parvathy Krishna SHARMA
Rahul Ranjan SRIVASTAVA
Original Assignee
Unilever Ip Holdings B.V.
Unilever Global Ip Limited
Conopco, Inc., D/B/A Unilever
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Unilever Ip Holdings B.V., Unilever Global Ip Limited, Conopco, Inc., D/B/A Unilever filed Critical Unilever Ip Holdings B.V.
Publication of WO2022106437A1 publication Critical patent/WO2022106437A1/en

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N7/00Computing arrangements based on specific mathematical models
    • G06N7/01Probabilistic graphical models, e.g. probabilistic networks

Definitions

  • the invention relates to predicting a state within a system.
  • Such a complex system can be modeled by a number of input variables, or control variables, at least one output variable, representing the behavior and/or output of the system, internal variables representing states inside the system, and external variables representing states of external circumstances, such as temperature, that influence the system but cannot be controlled by a system operator.
  • a Bayesian Belief Network or Bayesian Network (BN) for short, is a probabilistic hierarchical model that may be primarily used to represent a causal dependence (parentchild) structure among a set of system parameters.
  • a BN may be represented through a set of random variables (forming nodes of the BN) and their conditional dependencies (forming directed edges of the BN). The probability of occurrence of each state of a node is known as its “belief’.
  • Bayesian belief network classifiers Algorithms and system
  • Bayesian belief networks BN
  • Bayesian networks provide an established way to represent causal relationships using a structure. However, it can cost a lot of resources to generate a Bayesian network and to store a Bayesian network. Moreover, it may be difficult to build a Bayesian network with sufficient predictive qualities.
  • a method of predicting a state of a system by building a probabilistic hierarchical model comprising a Bayesian network comprising a plurality of nodes, each node representing a variable, a variable corresponding to the state to be predicted being represented by a target node of the Bayesian network, the method comprising determining, by a processor, a set of observed values for the variables represented by the nodes; determining, by the processor, a blacklist of arcs, wherein the arcs in the blacklist identify arcs to be excluded from the Bayesian network, the blacklist including at least all directed arcs from the target node to any other node of the Bayesian network; learning, by the processor, a structure of the Bayesian network based on the set of observed values for the nodes by determining a plurality of directed arcs of the Bayesian network linking the nodes of the network to form an acyclic
  • the step of learning the conditional continuous probability distributions may comprise maximum likelihood estimation of at least one parameter of the conditional continuous probability distributions. This way the conditional probability distributions may be efficiently estimated, even for continuous probability distributions.
  • the learning the parameters may comprise setting at least one parameter of the conditional continuous probability distribution function of a first node as a linear function of at least one immediate parent node of the first node. This may facilitate the modeling step and leads to suitable estimations in many cases.
  • the method may further comprise computing at least one coefficient of the linear function using linear regression based on the set of observed values and the structure of the Bayesian belief network.
  • the linear regression model provides an efficient way to learn the model parameters.
  • conditional continuous probability distributions may be Gaussian distributions. This allows to make further assumptions in the fitting procedure that help to make the fitted model parameters reliable and efficient to compute.
  • the learning the network structure may comprise performing a hill-climbing method comprising finding an optimum score by iteratively modifying the arcs of the network structure.
  • the score of the network structure may be determined after each modification. This may involve fitting the parameters of the conditional probability distributions. This provides an efficient way to learn the network structure.
  • the hill-climbing method may comprise setting a current network structure to an initial network structure; setting a maximum network structure score to a score of the initial network structure; repeating the following steps until a stopping criterion is satisfied: (1) updating the current network structure to obtain an updated network structure, while ensuring that the updated network structure represents an acyclic graph, wherein the updating the current network structure comprises adding an arc that is not on the blacklist to the network structure, deleting an arc that is not on a whitelist from the network structure, or reversing the direction of an arc that is not on the whitelist and of which arc the reversed direction is not on the blacklist; (2) compute a score of the updated network structure; and (3) if the score of the updated network structure is larger than the maximum network structure score, set the current network structure to the updated network structure and set the maximum network structure score to the score of the updated network.
  • the optimum score may be based on a Bayesian information criterion (BIC) or an arc strength. These were found to be good guiding criteria for finding the network structure.
  • the method may further comprise providing a dataset comprising combinations of values for the nodes; and preprocessing the dataset to obtain normalized observations corresponding to the plurality of nodes, wherein the preprocessing comprises at least one of missing values treatment, outliers treatment, and data transformation, wherein the learning the network structure and the learning the conditional probability distributions are performed based on the normalized observations corresponding to the plurality of nodes. These normalized observations may greatly improve the performance of the Bayesian belief network in terms of accuracy and convergence of the learning steps.
  • the method may comprise calculating an elasticity or sensitivity of the first node with respect to a change in at least one immediate (or indirect) parent node based on the at least one coefficient.
  • the method may further comprise determining a whitelist of arcs, wherein the arcs in the whitelist identify arcs that are to be included in the Bayesian belief network.
  • the whitelist can be used to incorporate prior knowledge in the model, for example, to aid the learning process.
  • the method may further comprise identifying a target value for the target node; determining a value for at least one of the nodes based on the Bayesian network and the target value for the target node; and controlling at least one variable in the system based on the determined value for the at least one of the nodes. This is a highly effective manner to indirectly influence one variable of the system by directly controlling another variable.
  • a system for predicting a state by building a probabilistic hierarchical model comprising a Bayesian network comprising a plurality of nodes, each node representing a variable, a variable corresponding to the state to be predicted being represented by a target node of the Bayesian network, the system comprising a memory configured to store a set of observed values for the variables represented by the nodes; and a processor configured to: determine a blacklist of arcs, wherein the arcs in the blacklist identify arcs not to be included in the Bayesian model, the blacklist including all directed arcs from the target node to any other node of the Bayesian network; learn a structure of the Bayesian network based on the set of observed values for the nodes by determining a plurality of directed arcs of the Bayesian network linking the nodes of the network to form an acyclic graph, the plurality of directed arcs not including arcs corresponding to the arcs in the blacklist;
  • computer program product comprising instructions stored on a non-transitory compute readable media which, when the program is executed by a computer system, the instructions being configured to cause a computer system to perform the steps of: determining a set of observed values for a plurality of variables represented by a plurality of nodes of a Bayesian network, a variable corresponding to a state to be predicted being represented by a target node of the Bayesian network, determining a blacklist of arcs, wherein the arcs in the blacklist identify arcs to be excluded from the Bayesian network, the blacklist including at least all directed arcs from the target node to any other node of the Bayesian network; learning a structure of the Bayesian network based on the set of observed values for the nodes by determining a plurality of directed arcs of the Bayesian network linking the nodes of the network to form an acyclic graph, the plurality of directed arcs not including arcs corresponding to the arc
  • Fig. 1 shows an example Bayesian belief network to model crop growth.
  • Fig. 2 shows a flowchart illustrating aspects of a method of building a Bayesian belief network.
  • Fig. 3 shows a flowchart illustrating aspects of a method of learning a network structure of a Bayesian belief network.
  • Fig. 4 shows a flowchart illustrating aspects of a method of learning node parameters of a Bayesian belief network.
  • Fig. 5 shows a block diagram of an apparatus for building a Bayesian belief network.
  • Fig. 6 shows a diagram illustrating modifications of the structure of a Bayesian belief network.
  • Fig. 7 shows an example Bayesian belief network to model a cream soup.
  • Bayesian networks that can handle continuous variables directly, without converting continuous system variables into discretized Bayesian network variables.
  • elasticities can be calculated, which is not easy to do with discrete Bayesian networks.
  • the limitations of Bayesian networks to discretely valued nodes may cause exponentially increasing size of probability maps when the number of possible states increases.
  • Bayesian belief network an exemplary Bayesian belief network that models the relationship between genetic and environmental parameters and crop growth is presented to illustrate the embodiments.
  • biotechnology a significant amount of time and effort is put into optimizing environmental parameters and genetically determined parameters of plants.
  • the relationship between these parameters, profiles, and the final crop growth is not straightforward.
  • the BN may help to identify important relationships.
  • Gaussian BN may be used for modeling systems with drivers that are inherently continuous in nature.
  • a continuous BN may be characterized in that each driver (node) follows a Gaussian (Normal) distribution. If this fundamental assumption holds, certain analysis and modeling features may be employed that make use of this assumption.
  • the causal relationship of each node with its parent node(s) is represented through a Gaussian conditional probability distribution.
  • the joint probability distribution of a set of drivers (nodes) may be obtained through the Chain Rule, using Bayes’ Theorem.
  • information regarding any driver may be accessible solely from its Markov Blanket.
  • the Markov Blanket of a driver may be regarded to be the node including its parent nodes and child nodes.
  • the modeling of a system by a BN may follow phases of data input and pre-processing, BN creation, and post-BN creation.
  • the data input may be obtained by monitoring system variables and obtaining samples thereof.
  • the data may be pre-processed.
  • the raw data may be processed in an exploratory data analysis (EDA).
  • EDA exploratory data analysis
  • One or more of the following steps may be performed to prepare the data before model creation starts.
  • missing values may be treated to improve data consistency.
  • a suitable replacement/imputation method may be performed by replacing a missing value with the most frequently appearing value, with an average value, or a value of a moving average in case of time-dependent data.
  • outliers may be treated using any suitable outlier treatment method known in the art per se.
  • outliers may be treated through interquartile range (IQR) method.
  • IQR interquartile range
  • values smaller than a value (Qi - 1.5 x IQR) may be replaced with the value Q .
  • Q 1 denotes the 25 th percentile of the variable within the dataset.
  • values that are larger than a value (Q 3 + 1.5 x IQR) may be replaced with the value Q 3 .
  • Q 3 denotes the 75 th percentile of the variable within the dataset.
  • the IQR value denotes the 75 th percentile minus the 25 th percentile of the variable within the dataset.
  • any one or more of a number of data transformations can be applied.
  • one or more of the following data transformations may be applied, depending on the nature of the data.
  • a smoothing operation or other noise reduction formula may be applied.
  • moving average transformation may be applied.
  • a Natural Log transformation (or any other log transform) may be applied when this is suitable for the type of variable at hand, in order to stabilize the variance of the data.
  • Certain embodiments comprise a step of establishing the nodes of the BN.
  • the nodes correspond to the drivers and/or target of the network. These nodes in general correspond to the data points; they may represent states that can be determined by observation.
  • the purpose of the data transformations of the pre-processing step is to bring the data in conformity with the chosen nodes of the BN, so that sufficient data is available for the nodes, and the data of each node has an advantageous distribution, such as a Gaussian (normal) distribution.
  • Certain embodiments comprise a step before the actual learning of the BN structure.
  • An example of that is a step of whitelisting and/or blacklisting of arcs.
  • a BN any two nodes may be connected by a unidirectional arc.
  • Each arc represents a relationship between the nodes, so that the probability distribution of the node to which the arc points is conditional on the value of the node from which the arc points away.
  • a goal is to learn which arcs are to be included in the network.
  • these nodes are essentially considered to be independent of each other.
  • arcs can be whitelisted (WL) or blacklisted (BL).
  • certain arcs are whitelisted before the modeling procedure starts.
  • certain arcs may be blacklisted before the modeling procedure starts.
  • Whitelisted arcs, if specified, will definitely be present in the network, whereas blacklisted arcs, if specified, will definitely be absent from the network.
  • Arcs whitelisted in one direction only i.e. A -> B is whitelisted but B A is not
  • B A may have the respective reverse arc automatically blacklisted. So, if A -> B is whitelisted but B A is not whitelisted, then B A may be automatically blacklisted.
  • Arcs whitelisted in both directions i.e. both A -> B and B A are whitelisted) are present in the graph, but their direction is set by the learning algorithm.
  • the BN contains a target node that represents a value that is considered to be the result of the values of the other nodes.
  • the target node represents a value for which a prediction is desired.
  • the whitelisting/blacklisting step may comprise blacklisting all possible arcs pointing away from the target node. This may improve the learning process and lead to better predictions. It may force the network structure to allow to predict the target node based on observations of values for the remaining nodes.
  • the next step may be learning a network structure.
  • the arcs of the BN are determined.
  • a BN network structure learning algorithm may be selected.
  • a constraint-based algorithm may be used. Such an algorithm learns the network structure by analyzing the probabilistic relations entailed by the Markov property of Bayesian networks with conditional independence tests.
  • Such constraint-based algorithms may be based on the Inductive Causation (IC) algorithm.
  • a score-based algorithm may be used to learn the BN network structure.
  • Such an algorithm assigns a score to each candidate Bayesian network and tries to maximize it with a heuristic search algorithm, such as Hill-climbing, Tabu search, Simulated annealing, or one of various known genetic algorithms.
  • a hybrid algorithm may be used, in which both constraint-based and score-based learning algorithms are combined to obtain an optimized network structure.
  • Hill-Climbing (HC) methodology may be advantageously used as a network learning algorithm. This may be particularly advantageous in the case where (most of) the nodes represent continuous random variables. It may be even more advantageous in case the continuous random variables have a Gaussian distribution.
  • An example of the Hill-Climbing method is given below.
  • the method may start with an initial graph structure G.
  • the initial structure may be an empty structure.
  • the initial structure may be the structure of an acyclic graph with randomly selected arcs (satisfying the whitelist and blacklist).
  • a score of the initial graph structure G may be computed.
  • the score is an indication how well the graph structure G can fit the available data. Examples of scoring methods will be disclosed hereinafter.
  • a number of iterations may be performed. The following steps may be included in each iteration. First, a transformation explained above is performed on a randomly selected arc (adding an arc, deleting an arc, or reversing an arc). In certain embodiments, more than one transformation may be performed.
  • the arc and the operation may be selected randomly, only operations that respect the conditions on the graph structure are performed. These conditions may include the condition that the graph remains acyclic, and that any whitelisted arcs are included in the network structure and any blacklisted arcs are excluded from the network structure.
  • the transformation of the first step results in an updated graph structure G*.
  • a score of the updated graph structure G* may be computed. For example, the score is an indication how well the graph structure G* can fit the available data. Examples of scoring methods will be disclosed hereinafter.
  • graph G is set to be equal to graph G* and the greatest determined score of graph G is set to be the score of graph G*.
  • the iteration is terminated when a suitable stopping criterion is satisfied. For example, when there is no possible transformation that would improve the greatest score, the process may stop. Alternatively, when N successive iterations do not improve the greatest score, the process may stop, where N is any positive integer value.
  • the above process of learning the network structure may be regarded to be an iterative process with each iteration modifying exactly one arc (through: add/delete/reverse) that increases the overall network score.
  • more than one arc might by modified in some (or all) iterations.
  • a first parameter may specify the maximum number of iterations, the admissible range of the first parameter being [1 , Infinity]', In case of Infinity, which may be the preferred value, no restriction is put on the number of iterations and the Hill-Climbing algorithm will continue until maximum network score is achieved.
  • the graph structure G may be reset one or more times during the Hill-Climbing. So, the transformation of the first step of some of the iterations may be replaced by a complete reset of the arcs to a new random structure that satisfies the applicable constraints (such as acyclic graph and the blacklist and whitelist).
  • a configurable parameter indicates the number of resets that is performed.
  • the number of resets is a non-negative integer, which may preferably be in the range from 0 to 50.
  • a suitable value may be 5.
  • Another configurable parameter may be the number of iterations to insert/remove/reverse an arc after every random reset. This parameter may preferably be, for example, in the range from 0 to 300.
  • a suitable value may be 100.
  • the reset is performed after the score of the graph has not increased for a predetermined number of iterations. That predetermined number of iterations, which may be a positive integer, forms an alternative parameter.
  • another configurable parameter may specify the maximum number of parents for each node. Its admissible range is [1 , (n-1)], the default value being (n-1), where n is the total number of nodes.
  • the parents of a particular node are the nodes from which an arc points to the particular node.
  • the parameters of each node may be determined.
  • the parameters of each node may include the parameters that determine the (conditional) random distribution of each node.
  • the conditional random distribution of a node may have the form:
  • P x is the value of a particular node in the network, this particular node being denoted by Node x ,
  • N(ji, u 2 ) is the Gaussian normal distribution with mean and standard deviation ⁇ 7
  • the parameters of a node Node x may be considered to be the mean mean x , the number of parent nodes n, the parent nodes Nodei (for all i from 1 to n ). the direct effect d t of each parent node (for all i from 1 to n), and the standard deviation stdev x .
  • the number of parent nodes n and the parent nodes Nodei (for all i from 1 to n) themselves may be regarded to define the structure of the network, and they may be determined using the BN network structure learning algorithm, for example the Hill- Climbing algorithm.
  • the remaining parameters may be determined for any given structure of the network. For example, these parameters may be estimated in every iteration after the transformation has been applied during the Hill-Climbing procedure, to assess the score of the network.
  • MLE maximum likelihood estimation
  • Bayesian parameter estimation For continuous nodes, the maximum likelihood estimation method may be advantageously used. For discrete data, Bayesian parameter estimation may be more suitable.
  • the maximum likelihood estimation method is known in the art per se. The skilled person is able to apply the maximum likelihood estimation method in view of the present disclosure.
  • a number of further steps may be performed, which may include steps of model diagnosis, model outputs computation, and insights generation.
  • steps of model diagnosis, model outputs computation, and insights generation may include steps of model diagnosis, model outputs computation, and insights generation.
  • a few exemplary processing tasks are described that can be performed after the BN has been created. It is possible to perform some of these steps during the iterations of the Hill-Climbing method, for example to determine the network score.
  • the following calculations may be performed to assess the network performance.
  • the network score may be determined using a suitable expression.
  • a suitable example of a network score is based on the Bayesian Information Criterion (BIC). However, this is not a limitation.
  • the network score is a goodness-of-fit statistic that measures how well the network represents the dependence structure of the data.
  • the network score can be positive or negative, depending on the input data and network structure. For example, while comparing multiple differently structured networks with the same input data, the larger the score, the better is the particular network. In other implementations, it may be the other way round: the smaller the score, the better is the particular network.
  • the network score may be computed through the Bayesian Information criterion (BIC).
  • BIC Bayesian Information criterion
  • Network score calculated through the Bayesian Information criterion (BIC) is a goodness-of-fit statistic that measures how well the network represents the dependence structure of the data. It helps to compare multiple networks built on the same input data, and judge which network is better.
  • the parameter estimation may be performed using any suitable method, such as the MLE method explained above.
  • the network score NetScore BIC may be determined as follows: wherein
  • L(6 ⁇ x) is the likelihood of the network 6, given the collection of data x;
  • d is the number of arcs in the network (in alternative implementations, d is the total number of parameters of the network).
  • n is the number of observations in the collection of data x.
  • Network Score through BIC may be regarded a penalization-based score that penalizes an increase of the number of parameters in the network.
  • performance estimators may be used to assess the quality of the BN. These performance estimators may be used as an alternative for the BIC in certain embodiments. However, these performance estimators may also be used as a general indication of how much confidence one may have in the model.
  • the Mean Absolute Percentage Error measures the average percentage error between the actual values of a driver and the obtained fitted values through the network. The less the MAPE, the better is the fit. It may be defined as:
  • y t denotes the actually observed values for a node t
  • e t denotes the error between the predicted value by the network and the actually observed value y t .
  • arc strength may be computed through BIC.
  • the arc strength indicates the absolute increase in the overall network score through the removal of this arc; the arc strength can possess either positive or negative values. So, from the definition, it is evident that the smaller the numeric value of the arc strength (considering the magnitude as well as the sign), the more significant the arc is.
  • Arc Significance is a unique integral number assigned to an arc from the range [1 , e], wherein e is the total number of arcs in the network.
  • the values 1 and e (>1) indicate the most and least significant arcs, respectively, according to the above-mentioned arc strength.
  • the arc significance numbers the arcs in the network in order of their significance, in decreasing order.
  • Ml for Arcs The MI(A, B) for an arc A ⁇ B measures the amount of information obtained about the driver B through knowing the driver A. In a way, it quantifies the redundancy/importance of the relationship between the variables A and B.
  • z-score for Arcs z-score for an arc A ⁇ B tests the hypothesis that whether the Ml (A, B) value is zero or not. A high z-score strengthens the reliability of the significance of arc A— >B.
  • the key nodes for the target may be identified through an Importance Score of each node with respect to the target.
  • the computation of such an importance score is described below:
  • Step 1 Identify all paths which originate from the node A and go to the target node X.
  • Step 2 Take the weighted average of all the arc strengths of the arcs occurring in a path, the weights being the inverse of, or inversely proportional to, the arcs’ Significance score. This weighted average is termed as the path strength.
  • Step 3 Compute the Importance score of the node A as the simple average of the path strengths of all the paths from A to the target node X.
  • Step 4 Rank each node by assigning a Significance Score in the range [1 , (n-1)], where n is the total number of nodes in the network.
  • the values 1 and (n-1) indicate the most and least significant node (driver), respectively, n is a positive integer and represents the number of nodes in the network, including the target node.
  • the direct effect of a node A on a node B may be computed from the node coefficient the actual values of A and the fitted values of B through the following formula:
  • P is the coefficient of A in the conditional Gaussian distribution equation of B
  • a t , fittedfbi) are respectively the actual values of A and fitted values of B
  • d is the dimension/length of the nodes A and 8, which is the number of observations for nodes A and B.
  • the direct effect quantifies the overall effect of the node A on the node 8, as obtained from the network structure and input data. It is expected to be a non-negative quantity.
  • the direct contribution of A on 8 is the percentage direct effect of A on 8 versus the summed direct effects of all nodes on B. If any direct effect is found to be negative, the respective contributions are shown as “Not Applicable (NA)”.
  • NA Not Applicable
  • Indirect effects and indirect contributions may also be determined. For example, if a node A has one or more indirect paths to the node B then the indirect effect of A on 8 may be computed through the following two steps: a) Multiplying the direct effect coefficients of the arcs in a path, and b) Adding the values obtained in the previous step, across all paths.
  • the indirect contribution of A on 8 may be regarded to be the percentage indirect effect of A on B.
  • the indirect effect as well as indirect contribution of A on 8 are zero.
  • the predicted values may be computed by plugging in the new values for the parents of the node in the local probability distribution of the node as obtained from the fit object.
  • the predicted values are computed by averaging likelihood weighting (Bayes-law) simulations performed using all the available nodes as evidence.
  • the number of random samples which are averaged for each new observation may be controllable.
  • the prediction of the target value (target node) may be the expected value of the conditional distribution of the target.
  • BN may be applied, for example, for finding the dependence structure among a plurality of drivers, performing inference from a built BN structure, or obtaining joint probability of a set of drivers, considered together.
  • Discrete BN, with discrete-valued nodes is the commonly known structure for inference. It is relatively easy to implement, compared to the continuous counter-part.
  • Discrete BN may be used to generate conditional probability tables (CPT), which may be sufficient for inferential activities.
  • CPT conditional probability tables
  • discrete BN has the following inherent limitations. For n nodes, the CPT is of size 2 n , which may be unmanageable even for a moderate number of nodes.
  • Many real-world features are continuous - which can’t be handled directly through discrete BN. Apart from inference, discrete BN may not be suitable for other purposes.
  • the continuous BN using the techniques disclosed herein, may be advantageously used for finding elasticities, performing simulations, performing forecasting, etc..
  • continuous BN As disclosed herein, the potential of continuous BN, as disclosed herein, is to perform a multitude of crucial tasks through it and built an end-to-end solution, entirely driven by the continuous BN framework. This is unique by itself, considering that perhaps no other industry has used continuous BN successfully.
  • continuous BN when applied using the techniques described herein, may provide a better understanding of causal effects among the nodes.
  • the techniques enable computing an elasticity of each node with respect to the target node. This can provide a way to control the target node by manipulating the values of the other nodes. Based on the importance scores, it becomes possible to find the key nodes, that have the greatest influence on the target node.
  • the coefficient representing the direct effect of a node A on another node B provides direct information about the elasticities, thereby making the tool highly suitable for finding elasticities.
  • the mean of a normal distribution of a node may depend linearly on the value of a parent node.
  • the elasticity may be calculated, for example, for an arc A ⁇ B, as wherein A is the value of node A, B is the value of node B, which depends on the value of A, and is the direct effect of A on B.
  • the direct effect by itself may be regarded as a measure of the sensitivity of A with respect to B.
  • Elasticities of nodes that are indirectly connected to the target node may be calculated, for example, by combining the direct effects of the nodes on the path from a node to the target node, for example by multiplication.
  • An out-of-range simulation may be performed through changing the nodes’ values.
  • hybrid modeling becomes possible through considering a plurality of different variables. It becomes possible to provide forecasts for the target node, based on forecasts for the other nodes. Further, it becomes possible to determine optimized values for the nodes, based on certain assumptions on the nodes.
  • the BN may be built and used to model biological systems, for example as an aid for diagnosis, by setting nodes corresponding to stimuli, symptoms, and/or internal properties of the body.
  • the BN may be built to model components of an apparatus, machine, or factory, ecological systems, meteorological systems, demographic systems, and can be used for purposes of, for example, text mining, feature recognition and extraction.
  • a method for predicting a variable in a system comprising a plurality of nodes, each node representing a continuous probability distribution function of a certain property, the method comprising: collecting a set of observed values of certain properties; determining a blacklist of arcs, the arcs in the blacklist identifying pairs of nodes that will not have a connecting arc in the model, or a whitelist of arcs, the arcs in the whitelist identifying pairs of nodes that certainly will have a connecting arc in the model; learning a structure of the network by determining a plurality of directed arcs representing that the probability distribution function of a certain first node is conditional on the property of a certain second node, taking into account the blacklist or the whitelist and the set of observed values; and learning probability distribution function parameters of the nodes of the network, based on the structure of the network and the set of observed values.
  • An apparatus may be provided for building a Bayesian Belief Network that models a system, wherein the apparatus is configured to: identify a plurality of nodes of a Bayesian belief network, each node representing a random variable having a continuous random distribution; select a target node among the plurality of nodes, the target node representing a state variable to be predicted; blacklist arcs pointing away from the target node to any other node; learn a network structure by identifying arcs between pairs of nodes that explain a system behavior, excluding the blacklisted arcs; learn conditional probability distributions of the continuous random variables of the nodes, wherein the probability distribution of the continuous random variable of a first node is conditional on at least one second node if and only if an arc points to the first node; and predict the value of the random variable of the target node based on a given value of at least one other node of the network.
  • Constraints on the model structure are imposed using domain knowledge, which forces the network structure to converge to the target variable. This feature may prevent that, if the Bayesian network is allowed to proceed unhindered, it will not converge to the target variable as desired.
  • the Bayesian network can be used to better control the system.
  • the Bayesian network can be used to influence the variable to be predicted or target node by controlling the values of the other variables or nodes.
  • the method may further comprise identifying a target value for the target node.
  • the method may comprise determining values for at least some of the remaining nodes based on the Bayesian network and the target value for the target node. For example, values of the remaining nodes may be chosen at random or using an optimization algorithm and the corresponding value of the target node may be computed, and the values for the remaining nodes resulting in the target node becoming closest to the target value may be selected.
  • the method may start from initial values for the variables in the system, and adjust certain ones of the values using the calculated elasticities to bring the target node closer to the target value.
  • the method may comprise controlling some of the variables in the system to set the variables to their determined values. This way, it is likely that the target variable will move towards its target value.
  • the computer program product may comprise a computer program stored on a non-transitory computer-readable media.
  • the computer program may be represented by a signal, such as an optic signal or an electro-magnetic signal, carried by a transmission medium such as an optic fiber cable or the air.
  • the computer program may partly or entirely have the form of source code, object code, or pseudo code, suitable for being executed by a computer system.
  • the code may be executable by one or more processors.
  • Fig. 1 illustrates a simplified Bayesian belief network, provided for the purpose of illustration.
  • the network comprises a node E denoting environmental potential and a node G representing genetic potential.
  • the environmental potential E and genetic potential G may influence the condition of the vegetative organs V, in particular the reproduction organs of the plants.
  • the condition of the vegetative organs V may influence the number of seeds N generated per plant as well as the mean weight W of the seeds generated.
  • the number of seeds N and the seeds mean weight W may determine the crop growth C in terms of total mass of crop.
  • a number of drivers such as environmental potential E, genetic potential G, vegetative organs V, number N and mean weight W of seeds, may influence the crop growth C.
  • the drivers may be represented by nodes in a Bayesian belief network. The relationships between these drivers may be found by using the techniques disclosed herein, so that the crop growth C may be predicted using given values for (some of) the drivers. Also, the most important drivers may be identified. For example, some drivers may be controllable to influence a particular target quantity. In certain cases environmental circumstances can be adapted or genetic potential can be changed by genetic treatment or cross-fertilization.
  • the Bayesian belief network can predict the changes in crop growth G caused by such changes.
  • the Gaussian BN follows a hierarchical regression structure, defined by the nodes and coefficients (direct effects) in the conditional distribution of each node.
  • each node that has one or more parent nodes may have a conditional Gaussian distribution that may be obtained through running local linear regression among the node and its immediate parents, the node being the target of the local linear regression.
  • Equation 1 A possible general structure of the regression equation of a node is as given in Equation 1 :
  • the standard deviation stdev x may be calculated as the standard deviation of residuals, which are the difference between actual and fitted values of P x .
  • the values of mean x and d t may be determined by performing linear regression as a form of maximum likelihood estimation. For example, the linear regression may be performed for each node separately, starting with a node that only has parent nodes that do not have parent nodes themselves (such as nodes A and B in Fig. 1). Every time, the regression may be performed for a node that only has parent nodes that either do not have parent nodes themselves or for which the regression analysis has already been done. This may be repeated until all the nodes have been processed.
  • Prediction in the BN may be performed using the hierarchical structure in a top-to-bottom manner by predicting the children at each level from its immediate parents and then propagating the predictions downwards to the next level.
  • the prediction of the target node: Crop (C) may be performed as follows:
  • root nodes E, G and use them to predict all their children, i.e. V in this case.
  • the root nodes are the nodes that do not have parent nodes.
  • the network may directly use those values to predict the target and will, in certain embodiments, ignore other values.
  • the target node C may be predicted using only the given values of N and W, while ignoring the other nodes’ values.
  • the Bayesian network contains a model of a system to derive information about that system.
  • a Bayesian network can provide information about sounds, images, electromagnetic signals, chemical compounds, biological systems, or economic data, for example.
  • the network score may be calculated as follows for the example crop growth
  • L (parameters
  • E, G, V, N, W, and C are the nodes of the network, as illustrated in Fig. 1.
  • Fig. 2 shows a flowchart illustrating a method of building a Bayesian belief network.
  • the method starts in step 201 with providing a dataset.
  • the dataset may comprise example states of a system (e.g. a system of interacting food ingredients, a biological system, mechanical system, e.g. components of a car or factory, or another kind of system).
  • the states may comprise a combination of parameter values representing certain properties. Those parameter values may be continuous in nature.
  • the dataset may contain observed values representing states of the real world system.
  • the observed values may be measured by a detector, such as a sensor.
  • a temperature may be sensed by a thermometer.
  • the data generated by the detector may be transmitted to a computer system that stores the observed values in a dataset.
  • the observed values may alternatively be entered into the computer.
  • the data may be preprocessed in step 202, for example to remove outliers, handle missing values, and the like. This is elaborated elsewhere in this disclosure.
  • a blacklist and optionally a whitelist may be created.
  • the blacklist contains arcs that cannot occur in the Bayesian belief network and that will not be added by the subsequent learning procedure 204.
  • the whitelist contains arcs that are included in the Bayesian belief network, and that will not be removed by the subsequent learning procedure 204. This can help to incorporate a priori knowledge in the network structure.
  • the blacklist can contain all the arcs pointing away from the target node. In the example of Fig. 1 , the target node is the crop growth G.
  • the target node is the node for which we would like to make a prediction based on the available data about the values of the other nodes.
  • the network structure is learned in step 204. This may be performed using an iterative process, such as Hill- Climbing, as will be elucidated hereinafter with reference to Fig. 3.
  • the learning node parameters are set in step 205. This step may involve learning the conditional probability distributions for each node. This may be performed, for example, using a linear regression technique that is disclosed elsewhere in this description.
  • Fig. 3 illustrates an example implementation of step 204 of learning the network structure.
  • the nodes of the network may be a given, and the connections between the nodes (the arcs) may be determined as part of this procedure.
  • the process starts in step 301 by determining an initial network structure.
  • the initial network structure is determined randomly, meaning that the absence or presence of a particular node in the network depends on some random value generator.
  • the direction of each arc may also be determined randomly.
  • the whitelisted arcs are always included, and the blacklisted arcs are never included.
  • the arcs are chosen such that the resulting network represents an acyclic graph.
  • An acyclic graph may be obtained, for example, by removing arcs from a random network structure until the remaining arcs for an acyclic graph.
  • a score of the initial network structure is determined.
  • the network parameters are estimated using the initial networks structure as a given.
  • a network score may be calculated.
  • the network score may be based on the Bayesian Information Criterion (BIC), which is elaborated elsewhere in this document.
  • BIC Bayesian Information Criterion
  • the network score of the initial network structure is set as the ‘maximum network structure score’, and stored in memory.
  • the network structure is updated. For example, one or more arcs are added and/or one or more arcs are removed. Also, the direction of an arc may be swapped. The arcs to be added or removed may be selected randomly, for example. Alternatively, the arcs to be added or removed may be selected based on an arc strength or an arc significance, possibly in combination with a random variable. When updating the current network, it is ensured that the updated network structure represents an acyclic graph.
  • the updating the current network structure comprises adding an arc that is not on the blacklist to the network structure, deleting an arc that is not on the whitelist from the network structure, or reversing the direction of an arc that is not on the whitelist and of which arc the reversed direction is not on the blacklist.
  • a network score of the updated network structure is determined.
  • optimal network parameters are estimated. That is, using the current network structure as a basis, the (conditional) probability distribution for each node is estimated. For example, for each node, a mean and standard deviation of a normal distribution are estimated. Further, the mean of each node may be a linear function of the values of the parent nodes. The coefficients of this linear function may be estimated for each node that has one or more parent nodes.
  • the ‘quality’ of the resulting network is estimated in form of a network score.
  • This network score may be determined, for example, by the Bayesian Information Criterion (BIC). This criterion is described in greater detail elsewhere in this description.
  • BIC Bayesian Information Criterion
  • step 306 the network score of the updated network structure is stored as the new maximum network structure score. Also, the current network structure is set equal to the updated network structure (corresponding to the maximum network structure score).
  • step 307 If the network score of the updated network structure is not larger than the previously stored maximum network structure score, in step 307, the current network is kept. In this case, the modifications of the updated network are discarded.
  • step 308 it is determined if more iterations should be made, to explore more different network structures. For example, this may be determined based on a stopping criterion, such as a maximum number of iterations or a minimal required improvement of the network score, or a minimum acceptable value of the network score, or a combination of the above.
  • a stopping criterion such as a maximum number of iterations or a minimal required improvement of the network score, or a minimum acceptable value of the network score, or a combination of the above.
  • step 310 it may be decided whether the network structure should be reset. For example, this may be decided based on a stopping criterion, such as a maximum number of iterations or a minimal required improvement of the network score or a combination thereof.
  • the process may proceed from step 303 by implementing an incremental update to the current network.
  • step 310 If it is determined in step 310 that a reset of the network structure is to be carried out, the process proceeds to step 311.
  • the current network structure and the corresponding maximum network structure score are stored as a candidate network structure.
  • the network structure is reset to a new initial network structure, which may be determined randomly in a similar way as the initial network structure was set in step 301 .
  • the initial network structure score may be determined and set as the ‘maximal network structure score’, similar to step 302.
  • the process proceeds from step 303.
  • Fig. 4 illustrates an example implementation of step 205 of learning the node parameters. This process of learning the node parameters may also be done in steps 302 and 304 as a preliminary step when computing the network structure score.
  • a linear function of the immediate parent nodes is used to define a parameter of a conditional probability density function of a node. For example, in case of a Gaussian probability distribution, the mean may be a linear function of the immediate parent nodes, e.g.
  • n is the mean of the Gaussian distribution
  • mean x is the mean of the Gaussian distribution without regard of the parent nodes
  • d t denotes the influence of the parent node
  • the coefficients of the linear function in the above case the coefficients mean x and d 1 , d 2 , ... , d n are the coefficients that should be fitted in order to determine the mean of the conditional Gaussian distribution of the node. This may be fitted using a maximum likelihood approach based on linear regression, as is disclosed in greater detail elsewhere in this disclosure.
  • probability distribution functions such as an exponential distribution
  • non-linear functions may be used to compute the parameter of the probability density function.
  • a quadratic function or any polynomial function could be used instead of a linear function.
  • Fig. 5 shows a block diagram illustrating an apparatus 500 for building a Bayesian belief network to predict a variable in a system.
  • the apparatus 500 may be implemented using computer hardware, which may be distributed hardware.
  • the apparatus 500 comprises a processor 501 or a plurality of cooperating processors.
  • the apparatus further comprises a storage 503, for example a computer memory and/or a storage disk.
  • Computer instructions may be stored in the storage 503, in particular in a non-transitory computer-readable media.
  • a dataset with observations for the nodes may be stored in the storage 503.
  • the apparatus may comprise an input device 504 for receiving a user input to control the apparatus and a display device 505 to display outputs of the apparatus.
  • the apparatus 500 may further comprise a communication port 502 for connecting to other devices and exchange of data and control signals.
  • the communication port 502 comprises a network interface for wired or wireless communication and/or a universal serial bus (USB).
  • the instructions in the storage 503 may contain modules implementing the steps of one of the methods set forth herein.
  • the instructions may cause the processor 501 to control receiving, via the communication port 502, observations for the nodes of the Bayesian network, and store them as a set of observed values in the storage 503. These observed values may be received, for example, from a measurement device or sensor connected to the apparatus 500 via the communication port 502.
  • the instructions may cause the processor 501 to determine a blacklist of arcs, wherein the arcs in the blacklist identify arcs not to be included in the Bayesian belief model, the blacklist including all directed arcs from the target node to any other node of the Bayesian belief network, learn a structure of the Bayesian belief network based on a set of observed values for the nodes by determining a plurality of directed arcs of the Bayesian belief network linking the nodes of the network to form an acyclic graph, the plurality of directed arcs not including arcs corresponding to the arcs in the blacklist, and learn parameters of conditional continuous probability distribution functions of the nodes of the network, based on the structure of the network and the set of observed values.
  • Fig. 6 illustrates several examples of transformations of the network structure during the learning procedure for learning the network structure.
  • an arc may be added 622, reversed 623, or deleted 624.
  • an arc connects node A to node C
  • an arc 611 connects node B to node C
  • an arc connects node C to node D.
  • an arc 612 from node B to node D may be added 622 to the network.
  • the direction of an arc 611 may be reversed 623, so that the arc 611 is replaced by an arc 613 that connects node C to node B.
  • the arc 611 may be deleted 624.
  • Such transformations can be done with arcs connecting, in principle, any first node to any second node, as long as any whitelisted arcs are not removed and any blacklisted arc are not added.
  • the direction of an arc may not be reversed if the other direction of that arc is blacklisted.
  • Fig. 7 shows an example of a Bayesian network 700 for food development.
  • creamy soup powder can contain many ingredients such as creamer vegetable granules, binder, NaCI, flavouring ingredients like herbs, spices, yeast extract, monosodium glutamate.
  • flavouring ingredients like herbs, spices, yeast extract, monosodium glutamate.
  • large particulate of mushroom, carrot, cauliflower florets, broccoli florets, whole garden peas, sweet corn can be used.
  • cottage cheese powder, cream powder, and/or butter powder can be added.
  • the creamer can provide the creamy texture and the flavour of the soup.
  • the herbs and spices can provide various tastes such as sweet, tangy and spicy, provided in different degrees, to enhance the flavour.
  • the shelf life of the ingredients and the resulting soup powder is important to provide the intent.
  • a creamy binder selected from non-dairy creamer and dairy-based powder, such as cheese or butter powder can have a great impact.
  • the vegetable particulate also impacts the shelf life.
  • spices, yeast and other ingredients generally have very little or no impact on the shelf life of the mixture.
  • creamy soup powder is a complex, well-balanced blend of the various ingredients.
  • flavour, aroma, texture, thickness and shelf life can greatly influence customer satisfaction.
  • consumer preference score is the target node 760.
  • the amount of each ingredient, expressed in weight percentage, for example, are represented by controllable nodes 702-715.
  • Intermediate nodes 751-758 representing properties such as flavour, aroma, texture, thickness, and shelf life, may be influenced by the controllable nodes (the weight percentage of each ingredient), and in turn these intermediate nodes 751-758 influence the consumer preference score, represented by the target node 760.
  • the available ingredients may be combined using different weight percentages.
  • This experimental data contains the observed values for the nodes. If we have this experimental data for multiple trials, we can create a model to predict the consumer preference score from the composition of ingredients. This way we can predict the proportion of the ingredients to achieve the desired blend of the soup powder which can be quantified as consumer preference index.
  • we can incorporate rules about known relationships such as a creamy binder, vegetable particulate and dairy powder can have an impact on texture and thickness. For example, if certain ingredients are known to influence a particular node (e.g. the weight percentage of cream powder influences the texture), the arc from the cream powder node to the texture node may be included in the white list.
  • the arcs pointing from the nodes of those ingredients to the node representing the shelf life can be whitelisted. Since it is known that certain other ingredients do not influence the shelf life, the arcs pointing from the nodes of these other ingredients to the node representing the shelf life can be blacklisted.
  • certain nodes can be restricted as regards their value.
  • the weight percentage of cream powder can be restricted to a given range, for example the range from 1.01% to 5.89%.
  • rules regarding the weightage restriction of certain the ingredients can be incorporated in the network using the blacklist. The unknown relationships can be found by learning the Bayesian network structure using the method and systems disclosed herein.
  • variables that can be used, as illustrated in Fig. 7, are: wt% of cottage cheese 702; wt% of cream powder 703; wt% of butter powder 704; wt% of herbs 705; wt% of spices 706; wt% of mushroom flakes 707; wt% of carrot cubes 708; wt% of broccoli florets 709; wt% of whole graden peas 710; wt% of sweet corn 711 ; wt% of yeast extract 712; wt% of binder 713; wt% of NaCI 714; wt% of monosodium glumate 715; wt% of creamer 751 ; Texture (represented by a numeric value) 752; Thickness (represented by a numeric value) 753; Sweet taste (represented by a numeric value) 754; Spicy taste (represented by a numeric value) 755; Sour taste (represented by a numeric value) 756;
  • arcs (arrows) drawn in Fig. 7 are shown merely as examples.
  • the actual arcs of the Bayesian network may be determined by learning the structure of the Bayesian network based on the set of observed values for the nodes.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Software Systems (AREA)
  • General Physics & Mathematics (AREA)
  • Computing Systems (AREA)
  • Artificial Intelligence (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • Mathematical Physics (AREA)
  • General Engineering & Computer Science (AREA)
  • Computational Mathematics (AREA)
  • Pure & Applied Mathematics (AREA)
  • Mathematical Optimization (AREA)
  • Mathematical Analysis (AREA)
  • Algebra (AREA)
  • Probability & Statistics with Applications (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Medical Informatics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

A method of predicting a state of a system by building a probabilistic hierarchical model is disclosed. The method comprises determining (201) a set of observed values for the variables represented by the nodes. The method comprises determining (203) a blacklist of arcs, wherein the arcs in the blacklist identify arcs to be excluded from the Bayesian network, the blacklist including at least all directed arcs from the target node to any other node of the Bayesian network. The method comprises learning (204) a structure of the Bayesian network based on the set of observed values for the nodes by determining a plurality of directed arcs of the Bayesian network linking the nodes of the network to form an acyclic graph, the plurality of directed arcs not including arcs corresponding to the arcs in the blacklist, and storing the structure of the Bayesian network in a memory (503).

Description

PREDICTING THE STATE OF A SYSTEM WITH CONTINUOUS VARIABLES
Field of the invention
The invention relates to predicting a state within a system.
Background of the invention
In development of food products, many ingredients will have to be considered that each have a different effect on the end product. Moreover, the ingredients may have a complex system of interactions, so that it may be difficult to find a composition that provides the ‘best’ result in terms of, for example, flavor, shelf life, or even overall consumer satisfaction. Such a complex system can be modeled by a number of input variables, or control variables, at least one output variable, representing the behavior and/or output of the system, internal variables representing states inside the system, and external variables representing states of external circumstances, such as temperature, that influence the system but cannot be controlled by a system operator.
For efficient control of the system, it is imperative that the relationship between the control variables and the output variables is predictable. Therefore, the system behavior can be modeled, for example by means of a Bayesian network, to simulate the system behavior and predict the output variable. However, creating an accurate Bayesian network model can be a tedious process.
A Bayesian Belief Network, or Bayesian Network (BN) for short, is a probabilistic hierarchical model that may be primarily used to represent a causal dependence (parentchild) structure among a set of system parameters. A BN may be represented through a set of random variables (forming nodes of the BN) and their conditional dependencies (forming directed edges of the BN). The probability of occurrence of each state of a node is known as its “belief’.
Charniak, E. (1991), “Bayesian networks without tears”, Al magazine, 12(4), 50, discloses algorithms for discrete Bayesian networks.
Cheng, J., & Greiner, R. (2001 , June), “Learning Bayesian belief network classifiers: Algorithms and system”, In Conference of the Canadian Society for Computational Studies of Intelligence (pp. 141-151), Springer Berlin Heidelberg, discloses learning predictive classifiers based on Bayesian belief networks (BN) using datasets that have few or no continuous features, to avoid information loss in discretization. The paper discloses to discretize continuous features, using a discretization utility. Bayesian networks provide an established way to represent causal relationships using a structure. However, it can cost a lot of resources to generate a Bayesian network and to store a Bayesian network. Moreover, it may be difficult to build a Bayesian network with sufficient predictive qualities.
Summary of the invention
It would be advantageous to provide an improved technology for Bayesian belief Networks. In order to address this concern, a method is provided of predicting a state of a system by building a probabilistic hierarchical model comprising a Bayesian network comprising a plurality of nodes, each node representing a variable, a variable corresponding to the state to be predicted being represented by a target node of the Bayesian network, the method comprising determining, by a processor, a set of observed values for the variables represented by the nodes; determining, by the processor, a blacklist of arcs, wherein the arcs in the blacklist identify arcs to be excluded from the Bayesian network, the blacklist including at least all directed arcs from the target node to any other node of the Bayesian network; learning, by the processor, a structure of the Bayesian network based on the set of observed values for the nodes by determining a plurality of directed arcs of the Bayesian network linking the nodes of the network to form an acyclic graph, the plurality of directed arcs not including arcs corresponding to the arcs in the blacklist, and storing the structure of the Bayesian network in a memory; learning, by the processor, parameters of conditional continuous probability distribution functions of the nodes of the network, based on the structure of the network and the set of observed values, and storing the parameters in the memory and preferably wherein the learning parameters comprises setting, by the processor, at least one parameter of the conditional continuous probability distribution function of a first node of the Bayesian network as a linear function of at least one immediate parent node of the first node, and wherein the learning the parameters further comprises computing, by the processor, at least one coefficient of the linear function using linear regression based on the set of observed values and the structure of the Bayesian network.
In many cases, discretization of continuous variables does not provide a satisfactory result when predicting system variables with BNs. Therefore, the present method is provided to help building BNs that can handle continuous variables directly, without discretization step. The limitations of Bayesian networks to discretely valued nodes make it less suitable for a lot of applications. Such limitations may be caused by the exponentially increasing size of probability maps when the number of possible states increases. The inventors have found that for a particular group of applications, in which it is desired to predict the change of the state of a system parameter induced by changing another system parameter, continuous variables are better able to predict this. Unlike the prior art, which has emphasized optimizations for discrete BNs and discretization of any continuous variables, the present inventors have provided a proper solution for BNs with continuous variables. This allows a deeper analysis of the causal relationships between the variables, such as elasticities.
The step of learning the conditional continuous probability distributions may comprise maximum likelihood estimation of at least one parameter of the conditional continuous probability distributions. This way the conditional probability distributions may be efficiently estimated, even for continuous probability distributions.
The learning the parameters may comprise setting at least one parameter of the conditional continuous probability distribution function of a first node as a linear function of at least one immediate parent node of the first node. This may facilitate the modeling step and leads to suitable estimations in many cases.
The method may further comprise computing at least one coefficient of the linear function using linear regression based on the set of observed values and the structure of the Bayesian belief network. The linear regression model provides an efficient way to learn the model parameters.
The conditional continuous probability distributions may be Gaussian distributions. This allows to make further assumptions in the fitting procedure that help to make the fitted model parameters reliable and efficient to compute.
The learning the network structure may comprise performing a hill-climbing method comprising finding an optimum score by iteratively modifying the arcs of the network structure. The score of the network structure may be determined after each modification. This may involve fitting the parameters of the conditional probability distributions. This provides an efficient way to learn the network structure.
The hill-climbing method may comprise setting a current network structure to an initial network structure; setting a maximum network structure score to a score of the initial network structure; repeating the following steps until a stopping criterion is satisfied: (1) updating the current network structure to obtain an updated network structure, while ensuring that the updated network structure represents an acyclic graph, wherein the updating the current network structure comprises adding an arc that is not on the blacklist to the network structure, deleting an arc that is not on a whitelist from the network structure, or reversing the direction of an arc that is not on the whitelist and of which arc the reversed direction is not on the blacklist; (2) compute a score of the updated network structure; and (3) if the score of the updated network structure is larger than the maximum network structure score, set the current network structure to the updated network structure and set the maximum network structure score to the score of the updated network. This way the best network structure may be found relatively efficiently. The optimum score may be based on a Bayesian information criterion (BIC) or an arc strength. These were found to be good guiding criteria for finding the network structure. The method may further comprise providing a dataset comprising combinations of values for the nodes; and preprocessing the dataset to obtain normalized observations corresponding to the plurality of nodes, wherein the preprocessing comprises at least one of missing values treatment, outliers treatment, and data transformation, wherein the learning the network structure and the learning the conditional probability distributions are performed based on the normalized observations corresponding to the plurality of nodes. These normalized observations may greatly improve the performance of the Bayesian belief network in terms of accuracy and convergence of the learning steps. The method may comprise calculating an elasticity or sensitivity of the first node with respect to a change in at least one immediate (or indirect) parent node based on the at least one coefficient.
The method may further comprise determining a whitelist of arcs, wherein the arcs in the whitelist identify arcs that are to be included in the Bayesian belief network. The whitelist can be used to incorporate prior knowledge in the model, for example, to aid the learning process.
The method may further comprise identifying a target value for the target node; determining a value for at least one of the nodes based on the Bayesian network and the target value for the target node; and controlling at least one variable in the system based on the determined value for the at least one of the nodes. This is a highly effective manner to indirectly influence one variable of the system by directly controlling another variable. According to another aspect of the invention, a system is provided for predicting a state by building a probabilistic hierarchical model comprising a Bayesian network comprising a plurality of nodes, each node representing a variable, a variable corresponding to the state to be predicted being represented by a target node of the Bayesian network, the system comprising a memory configured to store a set of observed values for the variables represented by the nodes; and a processor configured to: determine a blacklist of arcs, wherein the arcs in the blacklist identify arcs not to be included in the Bayesian model, the blacklist including all directed arcs from the target node to any other node of the Bayesian network; learn a structure of the Bayesian network based on the set of observed values for the nodes by determining a plurality of directed arcs of the Bayesian network linking the nodes of the network to form an acyclic graph, the plurality of directed arcs not including arcs corresponding to the arcs in the blacklist; and learn parameters of conditional continuous probability distribution functions of the nodes of the network, based on the structure of the network and the set of observed values, preferably set, as part of learning the parameters, at least one parameter of the conditional continuous probability distribution function of a first node of the Bayesian network as a linear function of at least one immediate parent node of the first node, and compute, as part of learning the parameters, at least one coefficient of the linear function using linear regression based on the set of observed values and the structure of the Bayesian network.
According to another aspect of the invention, computer program product is provided comprising instructions stored on a non-transitory compute readable media which, when the program is executed by a computer system, the instructions being configured to cause a computer system to perform the steps of: determining a set of observed values for a plurality of variables represented by a plurality of nodes of a Bayesian network, a variable corresponding to a state to be predicted being represented by a target node of the Bayesian network, determining a blacklist of arcs, wherein the arcs in the blacklist identify arcs to be excluded from the Bayesian network, the blacklist including at least all directed arcs from the target node to any other node of the Bayesian network; learning a structure of the Bayesian network based on the set of observed values for the nodes by determining a plurality of directed arcs of the Bayesian network linking the nodes of the network to form an acyclic graph, the plurality of directed arcs not including arcs corresponding to the arcs in the blacklist; learning parameters of conditional continuous probability distribution functions of the nodes of the network, based on the structure of the network and the set of observed values preferably wherein the learning the parameters comprises setting, by the processor, at least one parameter of the conditional continuous probability distribution function of a first node of the Bayesian network as a linear function of at least one immediate parent node of the first node, and wherein the learning the parameters further comprises computing, by the processor, at least one coefficient of the linear function using linear regression based on the set of observed values and the structure of the Bayesian network.
The person skilled in the art will understand that the features described above may be combined in any way deemed useful. Moreover, modifications and variations described in respect of the method may likewise be applied to the apparatus and to the computer program product.
Brief description of the drawings
In the following, aspects of the invention will be elucidated by means of examples, with reference to the drawings. The drawings are diagrammatic and may not be drawn to scale. Throughout the drawings, similar items may be marked with the same reference numerals.
Fig. 1 shows an example Bayesian belief network to model crop growth.
Fig. 2 shows a flowchart illustrating aspects of a method of building a Bayesian belief network.
Fig. 3 shows a flowchart illustrating aspects of a method of learning a network structure of a Bayesian belief network.
Fig. 4 shows a flowchart illustrating aspects of a method of learning node parameters of a Bayesian belief network.
Fig. 5 shows a block diagram of an apparatus for building a Bayesian belief network.
Fig. 6 shows a diagram illustrating modifications of the structure of a Bayesian belief network. Fig. 7 shows an example Bayesian belief network to model a cream soup.
Detailed description of the invention
Certain exemplary embodiments will be described in greater detail hereinafter. The matters disclosed in the description, such as detailed construction and elements, are provided to assist in a comprehensive understanding of the exemplary embodiments. Accordingly, it is apparent that the exemplary embodiments can be carried out without those specifically defined matters. Also, well-known operations or structures are not described in detail, since they would obscure the description with unnecessary detail.
In many cases, discretization of continuous variables does not provide a satisfactory result when predicting system variables with BNs. Therefore, the present disclosure is provided to help building probabilistic hierarchical models, such as Bayesian networks, that can handle continuous variables directly, without converting continuous system variables into discretized Bayesian network variables. Surprisingly, this makes the Bayesian network more suitable and useful to predict the system’s behavior. For example, elasticities can be calculated, which is not easy to do with discrete Bayesian networks. The limitations of Bayesian networks to discretely valued nodes may cause exponentially increasing size of probability maps when the number of possible states increases. The inventors have found that for a particular group of applications, in which it is desired to predict the change of a system parameter induced by changing another system parameter, continuous variables are better able to predict this. Unlike the prior art, which has emphasized optimizations for discrete BNs and discretization of any continuous variables, the present inventors have provided a proper solution for BNs with continuous variables. This allows a deeper analysis of the causal relationships between the variables, such as elasticities.
In the following, techniques are disclosed relating to methods and systems for building a Bayesian belief network. While the techniques disclosed herein have applicability in any Bayesian belief network, an exemplary Bayesian belief network that models the relationship between genetic and environmental parameters and crop growth is presented to illustrate the embodiments. In biotechnology, a significant amount of time and effort is put into optimizing environmental parameters and genetically determined parameters of plants. However, the relationship between these parameters, profiles, and the final crop growth is not straightforward. However, the BN may help to identify important relationships.
Gaussian BN may be used for modeling systems with drivers that are inherently continuous in nature. In certain embodiments, a continuous BN may be characterized in that each driver (node) follows a Gaussian (Normal) distribution. If this fundamental assumption holds, certain analysis and modeling features may be employed that make use of this assumption.
In certain embodiments, the causal relationship of each node with its parent node(s) is represented through a Gaussian conditional probability distribution.
In certain embodiments, the joint probability distribution of a set of drivers (nodes) may be obtained through the Chain Rule, using Bayes’ Theorem.
In certain embodiments, information regarding any driver (node) may be accessible solely from its Markov Blanket. The Markov Blanket of a driver (node) may be regarded to be the node including its parent nodes and child nodes.
In certain embodiments, the modeling of a system by a BN may follow phases of data input and pre-processing, BN creation, and post-BN creation.
In certain embodiments, the data input may be obtained by monitoring system variables and obtaining samples thereof.
In certain embodiments, the data may be pre-processed. For example, the raw data may be processed in an exploratory data analysis (EDA).
One or more of the following steps may be performed to prepare the data before model creation starts.
First, missing values may be treated to improve data consistency. For example, a suitable replacement/imputation method may be performed by replacing a missing value with the most frequently appearing value, with an average value, or a value of a moving average in case of time-dependent data.
Second, outliers may be treated using any suitable outlier treatment method known in the art per se. For example, outliers may be treated through interquartile range (IQR) method. In this method, values smaller than a value (Qi - 1.5 x IQR) may be replaced with the value Q . Herein, Q1 denotes the 25th percentile of the variable within the dataset. Moreover, values that are larger than a value (Q3 + 1.5 x IQR) may be replaced with the value Q3. Herein, Q3 denotes the 75th percentile of the variable within the dataset. In the above equations, the value IQR may be calculated as IQR = Q3 - Q . In other words, the IQR value denotes the 75th percentile minus the 25th percentile of the variable within the dataset.
In order to stabilize the data as well as use the most adequate forms of the drivers, any one or more of a number of data transformations can be applied. For example, one or more of the following data transformations may be applied, depending on the nature of the data. For example, a smoothing operation or other noise reduction formula may be applied. Alternatively, moving average transformation may be applied. Moreover, a Natural Log transformation (or any other log transform) may be applied when this is suitable for the type of variable at hand, in order to stabilize the variance of the data.
Certain embodiments comprise a step of establishing the nodes of the BN. The nodes correspond to the drivers and/or target of the network. These nodes in general correspond to the data points; they may represent states that can be determined by observation. The purpose of the data transformations of the pre-processing step is to bring the data in conformity with the chosen nodes of the BN, so that sufficient data is available for the nodes, and the data of each node has an advantageous distribution, such as a Gaussian (normal) distribution.
Certain embodiments comprise a step before the actual learning of the BN structure. An example of that is a step of whitelisting and/or blacklisting of arcs. In a BN, any two nodes may be connected by a unidirectional arc. Each arc represents a relationship between the nodes, so that the probability distribution of the node to which the arc points is conditional on the value of the node from which the arc points away. When learning the network structure, a goal is to learn which arcs are to be included in the network. When there is no arc or path connecting two nodes, these nodes are essentially considered to be independent of each other. To facilitate the modeling process, arcs can be whitelisted (WL) or blacklisted (BL). In certain embodiments, certain arcs are whitelisted before the modeling procedure starts. In addition, or alternatively, certain arcs may be blacklisted before the modeling procedure starts.
Whitelisted arcs, if specified, will definitely be present in the network, whereas blacklisted arcs, if specified, will definitely be absent from the network. Arcs whitelisted in one direction only (i.e. A -> B is whitelisted but B A is not) may have the respective reverse arc automatically blacklisted. So, if A -> B is whitelisted but B A is not whitelisted, then B A may be automatically blacklisted. Arcs whitelisted in both directions (i.e. both A -> B and B A are whitelisted) are present in the graph, but their direction is set by the learning algorithm.
In certain embodiments, the BN contains a target node that represents a value that is considered to be the result of the values of the other nodes. Alternatively, the target node represents a value for which a prediction is desired. The whitelisting/blacklisting step may comprise blacklisting all possible arcs pointing away from the target node. This may improve the learning process and lead to better predictions. It may force the network structure to allow to predict the target node based on observations of values for the remaining nodes.
The next step may be learning a network structure. In this step, the arcs of the BN are determined. To that end, a BN network structure learning algorithm may be selected.
For example, a constraint-based algorithm may be used. Such an algorithm learns the network structure by analyzing the probabilistic relations entailed by the Markov property of Bayesian networks with conditional independence tests. Such constraint-based algorithms may be based on the Inductive Causation (IC) algorithm.
Alternatively, a score-based algorithm may be used to learn the BN network structure. Such an algorithm assigns a score to each candidate Bayesian network and tries to maximize it with a heuristic search algorithm, such as Hill-climbing, Tabu search, Simulated annealing, or one of various known genetic algorithms.
Yet alternatively, a hybrid algorithm may be used, in which both constraint-based and score-based learning algorithms are combined to obtain an optimized network structure. In certain embodiments, Hill-Climbing (HC) methodology may be advantageously used as a network learning algorithm. This may be particularly advantageous in the case where (most of) the nodes represent continuous random variables. It may be even more advantageous in case the continuous random variables have a Gaussian distribution. An example of the Hill-Climbing method is given below.
The method may start with an initial graph structure G. The initial structure may be an empty structure. Alternatively, the initial structure may be the structure of an acyclic graph with randomly selected arcs (satisfying the whitelist and blacklist). Also, a score of the initial graph structure G may be computed. For example, the score is an indication how well the graph structure G can fit the available data. Examples of scoring methods will be disclosed hereinafter. Next, a number of iterations may be performed. The following steps may be included in each iteration. First, a transformation explained above is performed on a randomly selected arc (adding an arc, deleting an arc, or reversing an arc). In certain embodiments, more than one transformation may be performed. Although the arc and the operation may be selected randomly, only operations that respect the conditions on the graph structure are performed. These conditions may include the condition that the graph remains acyclic, and that any whitelisted arcs are included in the network structure and any blacklisted arcs are excluded from the network structure. The transformation of the first step results in an updated graph structure G*. Second, a score of the updated graph structure G* may be computed. For example, the score is an indication how well the graph structure G* can fit the available data. Examples of scoring methods will be disclosed hereinafter. Third, if the score of the updated graph structure G* is greater than the score of the previously determined greatest score of graph G, then graph G is set to be equal to graph G* and the greatest determined score of graph G is set to be the score of graph G*.
The iteration is terminated when a suitable stopping criterion is satisfied. For example, when there is no possible transformation that would improve the greatest score, the process may stop. Alternatively, when N successive iterations do not improve the greatest score, the process may stop, where N is any positive integer value.
The above process of learning the network structure may be regarded to be an iterative process with each iteration modifying exactly one arc (through: add/delete/reverse) that increases the overall network score. In alternative implementations, more than one arc might by modified in some (or all) iterations.
Certain parameters for and variations of the above self-learned network building method may be employed. For example, a first parameter may specify the maximum number of iterations, the admissible range of the first parameter being [1 , Infinity]', In case of Infinity, which may be the preferred value, no restriction is put on the number of iterations and the Hill-Climbing algorithm will continue until maximum network score is achieved.
In certain embodiments, the graph structure G may be reset one or more times during the Hill-Climbing. So, the transformation of the first step of some of the iterations may be replaced by a complete reset of the arcs to a new random structure that satisfies the applicable constraints (such as acyclic graph and the blacklist and whitelist). In certain embodiments, a configurable parameter indicates the number of resets that is performed. For example, the number of resets is a non-negative integer, which may preferably be in the range from 0 to 50. A suitable value may be 5. Another configurable parameter may be the number of iterations to insert/remove/reverse an arc after every random reset. This parameter may preferably be, for example, in the range from 0 to 300. A suitable value may be 100. Alternatively, the reset is performed after the score of the graph has not increased for a predetermined number of iterations. That predetermined number of iterations, which may be a positive integer, forms an alternative parameter.
In certain embodiments, another configurable parameter may specify the maximum number of parents for each node. Its admissible range is [1 , (n-1)], the default value being (n-1), where n is the total number of nodes. The parents of a particular node are the nodes from which an arc points to the particular node.
After determining a network structure, the parameters of each node may be determined. The parameters of each node may include the parameters that determine the (conditional) random distribution of each node. For example, the conditional random distribution of a node may have the form:
Figure imgf000014_0001
(Equation 1) wherein
• Px is the value of a particular node in the network, this particular node being denoted by Nodex,
• N(ji, u2) is the Gaussian normal distribution with mean and standard deviation <7,
• meanx is the mean of Nodex (not taking into account the parent nodes),
• stdevx is the standard deviation of Nodex,
• n is the number of parent nodes of Nodex, the parent nodes being denoted as Nodei, for i = 1,2,
Figure imgf000014_0002
• Pt, for i = 1,2, ... , n, is the value of an immediate parent of Nodex, and
• dt, for i = 1,2, ... , n, is the direct effect of Pt on Nodex.
For example, the parameters of a node Nodex may be considered to be the mean meanx, the number of parent nodes n, the parent nodes Nodei (for all i from 1 to n). the direct effect dt of each parent node (for all i from 1 to n), and the standard deviation stdevx. The number of parent nodes n and the parent nodes Nodei (for all i from 1 to n) themselves may be regarded to define the structure of the network, and they may be determined using the BN network structure learning algorithm, for example the Hill- Climbing algorithm.
The remaining parameters, including the mean meanx, the direct effect dt of each parent node (for all i from 1 to n), and the standard deviation stdevx, may be determined for any given structure of the network. For example, these parameters may be estimated in every iteration after the transformation has been applied during the Hill-Climbing procedure, to assess the score of the network.
These remaining parameters may be fit using, for example, maximum likelihood estimation (MLE) or Bayesian parameter estimation. The inventors found that for continuous nodes, the maximum likelihood estimation method may be advantageously used. For discrete data, Bayesian parameter estimation may be more suitable. The maximum likelihood estimation method is known in the art per se. The skilled person is able to apply the maximum likelihood estimation method in view of the present disclosure.
After building the continuous BN, a number of further steps may be performed, which may include steps of model diagnosis, model outputs computation, and insights generation. In the following, a few exemplary processing tasks are described that can be performed after the BN has been created. It is possible to perform some of these steps during the iterations of the Hill-Climbing method, for example to determine the network score.
The following calculations may be performed to assess the network performance.
The network score may be determined using a suitable expression. A suitable example of a network score is based on the Bayesian Information Criterion (BIC). However, this is not a limitation.
The network score is a goodness-of-fit statistic that measures how well the network represents the dependence structure of the data. Depending on the implementation, the network score can be positive or negative, depending on the input data and network structure. For example, while comparing multiple differently structured networks with the same input data, the larger the score, the better is the particular network. In other implementations, it may be the other way round: the smaller the score, the better is the particular network. In certain embodiments, the network score may be computed through the Bayesian Information criterion (BIC). Network score calculated through the Bayesian Information criterion (BIC) is a goodness-of-fit statistic that measures how well the network represents the dependence structure of the data. It helps to compare multiple networks built on the same input data, and judge which network is better. Before calculating the network score, the parameter estimation may be performed using any suitable method, such as the MLE method explained above.
In general, using the BIC, the network score NetScoreBIC may be determined as follows:
Figure imgf000016_0001
wherein
• x is the collection of data that is available for fitting the network;
• 6 is the network (the collection of arcs and the parameters of the network);
• L(6 \x) is the likelihood of the network 6, given the collection of data x;
• d is the number of arcs in the network (in alternative implementations, d is the total number of parameters of the network); and
• n is the number of observations in the collection of data x.
It is noted that Network Score through BIC may be regarded a penalization-based score that penalizes an increase of the number of parameters in the network.
Other performance estimators may be used to assess the quality of the BN. These performance estimators may be used as an alternative for the BIC in certain embodiments. However, these performance estimators may also be used as a general indication of how much confidence one may have in the model.
For example, the Mean Absolute Percentage Error (MAPE) measures the average percentage error between the actual values of a driver and the obtained fitted values through the network. The less the MAPE, the better is the fit. It may be defined as:
Figure imgf000016_0002
Herein, yt denotes the actually observed values for a node t, and et denotes the error between the predicted value by the network and the actually observed value yt.
The importance of each arc may be assessed. This information may be used in the iterative process to find the best network structure. Moreover, it may be used to determine which parameters (nodes) have most influence on a target node. For example, arc strength may be computed through BIC. The arc strength indicates the absolute increase in the overall network score through the removal of this arc; the arc strength can possess either positive or negative values. So, from the definition, it is evident that the smaller the numeric value of the arc strength (considering the magnitude as well as the sign), the more significant the arc is.
Arc Significance is a unique integral number assigned to an arc from the range [1 , e], wherein e is the total number of arcs in the network. The values 1 and e (>1) indicate the most and least significant arcs, respectively, according to the above-mentioned arc strength. Thus, the arc significance numbers the arcs in the network in order of their significance, in decreasing order.
Ml for Arcs: The MI(A, B) for an arc A^B measures the amount of information obtained about the driver B through knowing the driver A. In a way, it quantifies the redundancy/importance of the relationship between the variables A and B. z-score for Arcs: z-score for an arc A^B tests the hypothesis that whether the Ml (A, B) value is zero or not. A high z-score strengthens the reliability of the significance of arc A— >B.
Pearson’s Correlation for Arcs: For an arc A^B it computes the amount of linear correlation between driver A and driver B.
The key nodes for the target may be identified through an Importance Score of each node with respect to the target. The computation of such an importance score is described below:
Step 1. Identify all paths which originate from the node A and go to the target node X.
Step 2. Take the weighted average of all the arc strengths of the arcs occurring in a path, the weights being the inverse of, or inversely proportional to, the arcs’ Significance score. This weighted average is termed as the path strength.
Step 3. Compute the Importance score of the node A as the simple average of the path strengths of all the paths from A to the target node X.
Step 4. Rank each node by assigning a Significance Score in the range [1 , (n-1)], where n is the total number of nodes in the network. The values 1 and (n-1) indicate the most and least significant node (driver), respectively, n is a positive integer and represents the number of nodes in the network, including the target node. The direct effect of a node A on a node B may be computed from the node coefficient
Figure imgf000018_0001
the actual values of A and the fitted values of B through the following formula:
Direct Effect
Figure imgf000018_0002
Wherein:
P is the coefficient of A in the conditional Gaussian distribution equation of B, at , fittedfbi) are respectively the actual values of A and fitted values of B, d is the dimension/length of the nodes A and 8, which is the number of observations for nodes A and B.
The direct effect quantifies the overall effect of the node A on the node 8, as obtained from the network structure and input data. It is expected to be a non-negative quantity. The direct contribution of A on 8 is the percentage direct effect of A on 8 versus the summed direct effects of all nodes on B. If any direct effect is found to be negative, the respective contributions are shown as “Not Applicable (NA)”. The presence of negative effects can be overcome through rectifying some of the arcs and/or appropriately preprocessing the input data.
Indirect effects and indirect contributions may also be determined. For example, if a node A has one or more indirect paths to the node B then the indirect effect of A on 8 may be computed through the following two steps: a) Multiplying the direct effect coefficients of the arcs in a path, and b) Adding the values obtained in the previous step, across all paths.
As before, the indirect contribution of A on 8 may be regarded to be the percentage indirect effect of A on B. Evidently, if there is no indirect path from A to B, then the indirect effect as well as indirect contribution of A on 8 are zero.
Px ~ N(meanx + £”=1 dp^ , stdevx)ln certain embodiments, the predicted values may be computed by plugging in the new values for the parents of the node in the local probability distribution of the node as obtained from the fit object.
In certain other embodiments, the predicted values are computed by averaging likelihood weighting (Bayes-law) simulations performed using all the available nodes as evidence. The number of random samples which are averaged for each new observation may be controllable. In case the target variable is continuous, the prediction of the target value (target node) may be the expected value of the conditional distribution of the target. BN may be applied, for example, for finding the dependence structure among a plurality of drivers, performing inference from a built BN structure, or obtaining joint probability of a set of drivers, considered together. Discrete BN, with discrete-valued nodes, is the commonly known structure for inference. It is relatively easy to implement, compared to the continuous counter-part. Discrete BN may be used to generate conditional probability tables (CPT), which may be sufficient for inferential activities. However, discrete BN has the following inherent limitations. For n nodes, the CPT is of size 2n, which may be unmanageable even for a moderate number of nodes. Many real-world features are continuous - which can’t be handled directly through discrete BN. Apart from inference, discrete BN may not be suitable for other purposes. The continuous BN, using the techniques disclosed herein, may be advantageously used for finding elasticities, performing simulations, performing forecasting, etc..
As mentioned above, it is known to use discrete BN as a technique to perform inferencing and finding joint/marginal probabilities.
In contrast, the potential of continuous BN, as disclosed herein, is to perform a multitude of crucial tasks through it and built an end-to-end solution, entirely driven by the continuous BN framework. This is unique by itself, considering that perhaps no other industry has used continuous BN successfully.
For example, continuous BN, when applied using the techniques described herein, may provide a better understanding of causal effects among the nodes. Moreover, the techniques enable computing an elasticity of each node with respect to the target node. This can provide a way to control the target node by manipulating the values of the other nodes. Based on the importance scores, it becomes possible to find the key nodes, that have the greatest influence on the target node.
For example, the coefficient representing the direct effect of a node A on another node B provides direct information about the elasticities, thereby making the tool highly suitable for finding elasticities. For example, the mean of a normal distribution of a node may depend linearly on the value of a parent node. The elasticity may be calculated, for example, for an arc A^B, as
Figure imgf000019_0001
wherein A is the value of node A, B is the value of node B, which depends on the value of A, and
Figure imgf000019_0002
is the direct effect of A on B. Alternatively, the direct effect by itself, may be regarded as a measure of the sensitivity of A with respect to B. Elasticities of nodes that are indirectly connected to the target node may be calculated, for example, by combining the direct effects of the nodes on the path from a node to the target node, for example by multiplication.
An out-of-range simulation may be performed through changing the nodes’ values. Moreover, hybrid modeling becomes possible through considering a plurality of different variables. It becomes possible to provide forecasts for the target node, based on forecasts for the other nodes. Further, it becomes possible to determine optimized values for the nodes, based on certain assumptions on the nodes.
The BN may be built and used to model biological systems, for example as an aid for diagnosis, by setting nodes corresponding to stimuli, symptoms, and/or internal properties of the body. The BN may be built to model components of an apparatus, machine, or factory, ecological systems, meteorological systems, demographic systems, and can be used for purposes of, for example, text mining, feature recognition and extraction.
A method may be provided for predicting a variable in a system comprising a plurality of nodes, each node representing a continuous probability distribution function of a certain property, the method comprising: collecting a set of observed values of certain properties; determining a blacklist of arcs, the arcs in the blacklist identifying pairs of nodes that will not have a connecting arc in the model, or a whitelist of arcs, the arcs in the whitelist identifying pairs of nodes that certainly will have a connecting arc in the model; learning a structure of the network by determining a plurality of directed arcs representing that the probability distribution function of a certain first node is conditional on the property of a certain second node, taking into account the blacklist or the whitelist and the set of observed values; and learning probability distribution function parameters of the nodes of the network, based on the structure of the network and the set of observed values.
An apparatus may be provided for building a Bayesian Belief Network that models a system, wherein the apparatus is configured to: identify a plurality of nodes of a Bayesian belief network, each node representing a random variable having a continuous random distribution; select a target node among the plurality of nodes, the target node representing a state variable to be predicted; blacklist arcs pointing away from the target node to any other node; learn a network structure by identifying arcs between pairs of nodes that explain a system behavior, excluding the blacklisted arcs; learn conditional probability distributions of the continuous random variables of the nodes, wherein the probability distribution of the continuous random variable of a first node is conditional on at least one second node if and only if an arc points to the first node; and predict the value of the random variable of the target node based on a given value of at least one other node of the network.
Some possible aspects and advantages of the techniques disclosed herein may be the following. Constraints on the model structure are imposed using domain knowledge, which forces the network structure to converge to the target variable. This feature may prevent that, if the Bayesian network is allowed to proceed unhindered, it will not converge to the target variable as desired.
Further, we add a regression model on top of the hierarchical probabilistic graphical model, which enables to extract the elasticities of the node variables with regard to their effect on the target variable. This amalgamation of the regression model with the Bayesian network provides the possibility of improved control of the target. Finally, the entire process by which this system (Probabilistic Graphical Model and Regression framework) is leveraged to extract the elasticities for the nodes, predicting the target variable, provides an improved information that can be used to control or predict the target variable.
After the Bayesian network is completed, it can be used to better control the system. Specifically, the Bayesian network can be used to influence the variable to be predicted or target node by controlling the values of the other variables or nodes. For example, the method may further comprise identifying a target value for the target node. Then, the method may comprise determining values for at least some of the remaining nodes based on the Bayesian network and the target value for the target node. For example, values of the remaining nodes may be chosen at random or using an optimization algorithm and the corresponding value of the target node may be computed, and the values for the remaining nodes resulting in the target node becoming closest to the target value may be selected. Alternatively, the method may start from initial values for the variables in the system, and adjust certain ones of the values using the calculated elasticities to bring the target node closer to the target value. Next, the method may comprise controlling some of the variables in the system to set the variables to their determined values. This way, it is likely that the target variable will move towards its target value. Some or all aspects of the invention may be suitable for being implemented in form of software, in particular a computer program product. The computer program product may comprise a computer program stored on a non-transitory computer-readable media. Also, the computer program may be represented by a signal, such as an optic signal or an electro-magnetic signal, carried by a transmission medium such as an optic fiber cable or the air. The computer program may partly or entirely have the form of source code, object code, or pseudo code, suitable for being executed by a computer system. For example, the code may be executable by one or more processors.
The examples and embodiments described herein serve to illustrate rather than limit the invention. The person skilled in the art will be able to design alternative embodiments without departing from the spirit and scope of the present disclosure, as defined by the appended claims and their equivalents. Reference signs placed in parentheses in the claims shall not be interpreted to limit the scope of the claims. Items described as separate entities in the claims or the description may be implemented as a single hardware or software item combining the features of the items described.
Detailed description of drawings
Fig. 1 illustrates a simplified Bayesian belief network, provided for the purpose of illustration. The network comprises a node E denoting environmental potential and a node G representing genetic potential. In more complex networks the environmental potential and the genetic potential may be dependent on a great number of further nodes that have not been illustrated in Fig. 1 for ease of explanation. The environmental potential E and genetic potential G may influence the condition of the vegetative organs V, in particular the reproduction organs of the plants. The condition of the vegetative organs V may influence the number of seeds N generated per plant as well as the mean weight W of the seeds generated. The number of seeds N and the seeds mean weight W may determine the crop growth C in terms of total mass of crop. More generally, a number of drivers, such as environmental potential E, genetic potential G, vegetative organs V, number N and mean weight W of seeds, may influence the crop growth C. The drivers may be represented by nodes in a Bayesian belief network. The relationships between these drivers may be found by using the techniques disclosed herein, so that the crop growth C may be predicted using given values for (some of) the drivers. Also, the most important drivers may be identified. For example, some drivers may be controllable to influence a particular target quantity. In certain cases environmental circumstances can be adapted or genetic potential can be changed by genetic treatment or cross-fertilization. The Bayesian belief network can predict the changes in crop growth G caused by such changes.
As shown in the illustrative diagram of Fig. 1 , the Gaussian BN follows a hierarchical regression structure, defined by the nodes and coefficients (direct effects) in the conditional distribution of each node. As depicted above, each node that has one or more parent nodes may have a conditional Gaussian distribution that may be obtained through running local linear regression among the node and its immediate parents, the node being the target of the local linear regression. A possible general structure of the regression equation of a node is as given in Equation 1 :
Figure imgf000023_0001
(Equation 1 reproduced)
The definitions of the variables has been provided hereinabove. It is noted that, when doing the regression, the standard deviation stdevx may be calculated as the standard deviation of residuals, which are the difference between actual and fitted values of Px.
The values of meanx and dt may be determined by performing linear regression as a form of maximum likelihood estimation. For example, the linear regression may be performed for each node separately, starting with a node that only has parent nodes that do not have parent nodes themselves (such as nodes A and B in Fig. 1). Every time, the regression may be performed for a node that only has parent nodes that either do not have parent nodes themselves or for which the regression analysis has already been done. This may be repeated until all the nodes have been processed.
Prediction in the BN may be performed using the hierarchical structure in a top-to-bottom manner by predicting the children at each level from its immediate parents and then propagating the predictions downwards to the next level. For example, in the network of Fig. 1 , the prediction of the target node: Crop (C) may be performed as follows:
1. Start with root nodes: E, G and use them to predict all their children, i.e. V in this case. The root nodes are the nodes that do not have parent nodes.
2. Go to the next level and predict N, l/l/ through their immediate parents, i.e. V
3. Finally predict the target, i.e. C through the predicted values of A/, l/l/ Prediction of a node at each level may be performed through its Gaussian distribution equation, involving immediate parents and direct effects.
During prediction, if the values of all immediate parents of the target are already provided, then the network may directly use those values to predict the target and will, in certain embodiments, ignore other values.
For example - if the values of /V and 1/1/ are provided together with values of other nodes such as E, G, and V, then the target node C may be predicted using only the given values of N and W, while ignoring the other nodes’ values.
Generally, the Bayesian network contains a model of a system to derive information about that system. A Bayesian network can provide information about sounds, images, electromagnetic signals, chemical compounds, biological systems, or economic data, for example.
For example, the network score may be calculated as follows for the example crop growth
Figure imgf000024_0001
‘parameters’ denotes the parameters of the BN;
L (parameters | E, G, V,N, W, C) denotes the likelihood that the parameters are correct, given the available data for E, G, V, N, W, C d denotes the number of observations in the dataset; dnode represents the degree of the node (i.e. total number of arcs: incoming + outgoing) and n is the total number of nodes in the network, and
E, G, V, N, W, and C are the nodes of the network, as illustrated in Fig. 1.
Fig. 2 shows a flowchart illustrating a method of building a Bayesian belief network. The method starts in step 201 with providing a dataset. For example, the dataset may comprise example states of a system (e.g. a system of interacting food ingredients, a biological system, mechanical system, e.g. components of a car or factory, or another kind of system). In particular, the states may comprise a combination of parameter values representing certain properties. Those parameter values may be continuous in nature. The dataset may contain observed values representing states of the real world system. The observed values may be measured by a detector, such as a sensor. For example, a temperature may be sensed by a thermometer. The data generated by the detector may be transmitted to a computer system that stores the observed values in a dataset. The observed values may alternatively be entered into the computer.
The data may be preprocessed in step 202, for example to remove outliers, handle missing values, and the like. This is elaborated elsewhere in this disclosure. In step 203, a blacklist, and optionally a whitelist may be created. The blacklist contains arcs that cannot occur in the Bayesian belief network and that will not be added by the subsequent learning procedure 204. The whitelist contains arcs that are included in the Bayesian belief network, and that will not be removed by the subsequent learning procedure 204. This can help to incorporate a priori knowledge in the network structure. Moreover, the blacklist can contain all the arcs pointing away from the target node. In the example of Fig. 1 , the target node is the crop growth G. The target node is the node for which we would like to make a prediction based on the available data about the values of the other nodes. After determining the blacklist/whitelist in step 203, the network structure is learned in step 204. This may be performed using an iterative process, such as Hill- Climbing, as will be elucidated hereinafter with reference to Fig. 3. After the network structure has been learned, the learning node parameters are set in step 205. This step may involve learning the conditional probability distributions for each node. This may be performed, for example, using a linear regression technique that is disclosed elsewhere in this description.
Fig. 3 illustrates an example implementation of step 204 of learning the network structure. For example, the nodes of the network may be a given, and the connections between the nodes (the arcs) may be determined as part of this procedure.
The process starts in step 301 by determining an initial network structure. For example, the initial network structure is determined randomly, meaning that the absence or presence of a particular node in the network depends on some random value generator. The direction of each arc may also be determined randomly. However, the whitelisted arcs are always included, and the blacklisted arcs are never included. Further, the arcs are chosen such that the resulting network represents an acyclic graph. An acyclic graph may be obtained, for example, by removing arcs from a random network structure until the remaining arcs for an acyclic graph.
In step 302, a score of the initial network structure is determined. First the network parameters are estimated using the initial networks structure as a given. Using these network parameters, a network score may be calculated. For example, the network score may be based on the Bayesian Information Criterion (BIC), which is elaborated elsewhere in this document. The network score of the initial network structure is set as the ‘maximum network structure score’, and stored in memory.
In step 303, the network structure is updated. For example, one or more arcs are added and/or one or more arcs are removed. Also, the direction of an arc may be swapped. The arcs to be added or removed may be selected randomly, for example. Alternatively, the arcs to be added or removed may be selected based on an arc strength or an arc significance, possibly in combination with a random variable. When updating the current network, it is ensured that the updated network structure represents an acyclic graph. Moreover, for example, the updating the current network structure comprises adding an arc that is not on the blacklist to the network structure, deleting an arc that is not on the whitelist from the network structure, or reversing the direction of an arc that is not on the whitelist and of which arc the reversed direction is not on the blacklist.
Next, in step 304, a network score of the updated network structure is determined. To that end, first, optimal network parameters are estimated. That is, using the current network structure as a basis, the (conditional) probability distribution for each node is estimated. For example, for each node, a mean and standard deviation of a normal distribution are estimated. Further, the mean of each node may be a linear function of the values of the parent nodes. The coefficients of this linear function may be estimated for each node that has one or more parent nodes.
After that, the ‘quality’ of the resulting network is estimated in form of a network score. This network score may be determined, for example, by the Bayesian Information Criterion (BIC). This criterion is described in greater detail elsewhere in this description. In step 305, it is checked whether the network score of the updated network structure is larger than the previously stored maximum network structure score.
If the network score of the updated network structure is larger than the previously stored maximum network structure score, in step 306, the network score of the updated network structure is stored as the new maximum network structure score. Also, the current network structure is set equal to the updated network structure (corresponding to the maximum network structure score).
If the network score of the updated network structure is not larger than the previously stored maximum network structure score, in step 307, the current network is kept. In this case, the modifications of the updated network are discarded.
After that, in step 308, it is determined if more iterations should be made, to explore more different network structures. For example, this may be determined based on a stopping criterion, such as a maximum number of iterations or a minimal required improvement of the network score, or a minimum acceptable value of the network score, or a combination of the above.
If it is determined that no further iterations are necessary, the learning process may stop in step 309. If it is determined that further iterations are necessary, in step 310 it may be decided whether the network structure should be reset. For example, this may be decided based on a stopping criterion, such as a maximum number of iterations or a minimal required improvement of the network score or a combination thereof.
If it is determined that the reset of the network structure is not necessary, the process may proceed from step 303 by implementing an incremental update to the current network.
If it is determined in step 310 that a reset of the network structure is to be carried out, the process proceeds to step 311. The current network structure and the corresponding maximum network structure score are stored as a candidate network structure. Next, the network structure is reset to a new initial network structure, which may be determined randomly in a similar way as the initial network structure was set in step 301 . Moreover, the initial network structure score may be determined and set as the ‘maximal network structure score’, similar to step 302. Next, the process proceeds from step 303.
When the process ends in step 309, the candidate network structure having the highest network structure score may be finally selected as the finally learned network structure. Fig. 4 illustrates an example implementation of step 205 of learning the node parameters. This process of learning the node parameters may also be done in steps 302 and 304 as a preliminary step when computing the network structure score. In step 401 , a linear function of the immediate parent nodes is used to define a parameter of a conditional probability density function of a node. For example, in case of a Gaussian probability distribution, the mean may be a linear function of the immediate parent nodes, e.g.
Figure imgf000028_0001
wherein n is the mean of the Gaussian distribution, meanx is the mean of the Gaussian distribution without regard of the parent nodes, dt denotes the influence of the parent node, and
Figure imgf000028_0002
is the value of the parent node i, wherein i=1 to n are the parent nodes, i.e. the nodes from which an arc in the Bayesian belief network points to the node for which the coefficient is being set. In step 402, the coefficients of the linear function, in the above case the coefficients meanx and d1, d2, ... , dn are the coefficients that should be fitted in order to determine the mean
Figure imgf000028_0003
of the conditional Gaussian distribution of the node. This may be fitted using a maximum likelihood approach based on linear regression, as is disclosed in greater detail elsewhere in this disclosure.
It will be appreciated that other probability distribution functions, such as an exponential distribution, may be used instead of the Gaussian distribution. Moreover, non-linear functions may be used to compute the parameter of the probability density function. For example, instead of a linear function a quadratic function or any polynomial function could be used.
Fig. 5 shows a block diagram illustrating an apparatus 500 for building a Bayesian belief network to predict a variable in a system. The apparatus 500 may be implemented using computer hardware, which may be distributed hardware. The apparatus 500 comprises a processor 501 or a plurality of cooperating processors. The apparatus further comprises a storage 503, for example a computer memory and/or a storage disk. Computer instructions may be stored in the storage 503, in particular in a non-transitory computer-readable media. Moreover, a dataset with observations for the nodes may be stored in the storage 503. The apparatus may comprise an input device 504 for receiving a user input to control the apparatus and a display device 505 to display outputs of the apparatus. The apparatus 500 may further comprise a communication port 502 for connecting to other devices and exchange of data and control signals. For example, the communication port 502 comprises a network interface for wired or wireless communication and/or a universal serial bus (USB). The instructions in the storage 503 may contain modules implementing the steps of one of the methods set forth herein. For example, the instructions may cause the processor 501 to control receiving, via the communication port 502, observations for the nodes of the Bayesian network, and store them as a set of observed values in the storage 503. These observed values may be received, for example, from a measurement device or sensor connected to the apparatus 500 via the communication port 502. In particular, the instructions may cause the processor 501 to determine a blacklist of arcs, wherein the arcs in the blacklist identify arcs not to be included in the Bayesian belief model, the blacklist including all directed arcs from the target node to any other node of the Bayesian belief network, learn a structure of the Bayesian belief network based on a set of observed values for the nodes by determining a plurality of directed arcs of the Bayesian belief network linking the nodes of the network to form an acyclic graph, the plurality of directed arcs not including arcs corresponding to the arcs in the blacklist, and learn parameters of conditional continuous probability distribution functions of the nodes of the network, based on the structure of the network and the set of observed values.
Fig. 6 illustrates several examples of transformations of the network structure during the learning procedure for learning the network structure. As shown in Fig. 6, starting with a network structure 601 , in which nodes A, B, C, and D are connected by arc in a certain way, an arc may be added 622, reversed 623, or deleted 624. In the illustration, in the network 601 , an arc connects node A to node C, an arc 611 connects node B to node C, and an arc connects node C to node D. As a first example transformation, an arc 612 from node B to node D may be added 622 to the network. As a second example transformation, the direction of an arc 611 may be reversed 623, so that the arc 611 is replaced by an arc 613 that connects node C to node B. As a third example transformation, the arc 611 may be deleted 624. Such transformations can be done with arcs connecting, in principle, any first node to any second node, as long as any whitelisted arcs are not removed and any blacklisted arc are not added. Also, the direction of an arc may not be reversed if the other direction of that arc is blacklisted.
Fig. 7 shows an example of a Bayesian network 700 for food development. Let us consider the composition of creamy soup powder. It can contain many ingredients such as creamer vegetable granules, binder, NaCI, flavouring ingredients like herbs, spices, yeast extract, monosodium glutamate. Also, for garnishing purposes, large particulate of mushroom, carrot, cauliflower florets, broccoli florets, whole garden peas, sweet corn can be used. Additionally, for flavour and texture, cottage cheese powder, cream powder, and/or butter powder can be added.
The creamer can provide the creamy texture and the flavour of the soup. The herbs and spices can provide various tastes such as sweet, tangy and spicy, provided in different degrees, to enhance the flavour. Also, the shelf life of the ingredients and the resulting soup powder is important to provide the intent. In this respect, a creamy binder selected from non-dairy creamer and dairy-based powder, such as cheese or butter powder, can have a great impact. The vegetable particulate also impacts the shelf life. However, spices, yeast and other ingredients generally have very little or no impact on the shelf life of the mixture.
Hence, creamy soup powder is a complex, well-balanced blend of the various ingredients. At the same time, flavour, aroma, texture, thickness and shelf life can greatly influence customer satisfaction.
Referring to Fig. 7, in an embodiment, consumer preference score is the target node 760. The amount of each ingredient, expressed in weight percentage, for example, are represented by controllable nodes 702-715. Intermediate nodes 751-758, representing properties such as flavour, aroma, texture, thickness, and shelf life, may be influenced by the controllable nodes (the weight percentage of each ingredient), and in turn these intermediate nodes 751-758 influence the consumer preference score, represented by the target node 760.
Experimentally, the available ingredients may be combined using different weight percentages. For each combination of weight percentages of the ingredients, we may assess the intermediate variables (flavour, aroma, texture, thickness, shelf life) and the target variable (consumer preference score) by performing certain measurements and/or by evaluating the product by a group of consumers. This experimental data contains the observed values for the nodes. If we have this experimental data for multiple trials, we can create a model to predict the consumer preference score from the composition of ingredients. This way we can predict the proportion of the ingredients to achieve the desired blend of the soup powder which can be quantified as consumer preference index. In the network created, we can incorporate rules about known relationships such as a creamy binder, vegetable particulate and dairy powder can have an impact on texture and thickness. For example, if certain ingredients are known to influence a particular node (e.g. the weight percentage of cream powder influences the texture), the arc from the cream powder node to the texture node may be included in the white list.
Similarly, since it is known that certain ingredients have an influence on shelf life, the arcs pointing from the nodes of those ingredients to the node representing the shelf life can be whitelisted. Since it is known that certain other ingredients do not influence the shelf life, the arcs pointing from the nodes of these other ingredients to the node representing the shelf life can be blacklisted.
Also, based on knowledge about food production, certain nodes can be restricted as regards their value. For example, the weight percentage of cream powder can be restricted to a given range, for example the range from 1.01% to 5.89%. Similarly, rules regarding the weightage restriction of certain the ingredients can be incorporated in the network using the blacklist. The unknown relationships can be found by learning the Bayesian network structure using the method and systems disclosed herein.
Examples of variables that can be used, as illustrated in Fig. 7, are: wt% of cottage cheese 702; wt% of cream powder 703; wt% of butter powder 704; wt% of herbs 705; wt% of spices 706; wt% of mushroom flakes 707; wt% of carrot cubes 708; wt% of broccoli florets 709; wt% of whole graden peas 710; wt% of sweet corn 711 ; wt% of yeast extract 712; wt% of binder 713; wt% of NaCI 714; wt% of monosodium glumate 715; wt% of creamer 751 ; Texture (represented by a numeric value) 752; Thickness (represented by a numeric value) 753; Sweet taste (represented by a numeric value) 754; Spicy taste (represented by a numeric value) 755; Sour taste (represented by a numeric value) 756; Flavour (represented by a numeric value) 757; Shelf life (in days) 758; Consumer preference score (target variable) 760.
It will be understood that the arcs (arrows) drawn in Fig. 7 are shown merely as examples. The actual arcs of the Bayesian network may be determined by learning the structure of the Bayesian network based on the set of observed values for the nodes.

Claims

Claims
1. A method of predicting a state of a system by building a probabilistic hierarchical model comprising a Bayesian network comprising a plurality of nodes, each node representing a variable, a variable corresponding to the state to be predicted being represented by a target node of the Bayesian network, the method comprising determining (201), by a processor (501), a set of observed values for the variables represented by the nodes; determining (203), by the processor (501), a blacklist of arcs, wherein the arcs in the blacklist identify arcs to be excluded from the Bayesian network, the blacklist including at least all directed arcs from the target node to any other node of the Bayesian network; learning (204), by the processor (501), a structure of the Bayesian network based on the set of observed values for the nodes by determining a plurality of directed arcs of the Bayesian network linking the nodes of the network to form an acyclic graph, the plurality of directed arcs not including arcs corresponding to the arcs in the blacklist, and storing the structure of the Bayesian network in a memory (503); learning (205), by the processor (501), parameters of conditional continuous probability distribution functions of the nodes of the network, based on the structure of the network and the set of observed values, and storing the parameters in the memory (503), wherein the learning (205) the parameters comprises setting, by the processor (501), at least one parameter of the conditional continuous probability distribution function of a first node of the Bayesian network as a linear function of at least one immediate parent node of the first node, and wherein the learning (205) the parameters further comprises computing (402), by the processor (501), at least one coefficient of the linear function using linear regression based on the set of observed values and the structure of the Bayesian network.
2. A method according to claim 1 , wherein the learning (205) parameters of the conditional continuous probability distributions comprises estimating, by the processor (501), a maximum likelihood of at least one parameter of the conditional continuous probability distributions.
3. A method according to any of the preceding claims, wherein the conditional continuous probability distributions are Gaussian distributions.
4. A method according any of the preceding claims, wherein the learning (204) the structure of the Bayesian network comprises finding, by the processor (501), an optimum network score by iteratively modifying the arcs of the Bayesian network, wherein the arcs on the blacklist are not included in the Bayesian network, and evaluating a network score after each iteration.
5. A method according to claim 4, wherein the finding the optimum network score comprises: setting (301), by the processor (501), a current network structure to an initial network structure; setting (302), by the processor (501), a maximum network structure score to a score of the initial network structure; repeating the following steps until a stopping criterion is satisfied: updating (303), by the processor (501), the current network structure to obtain an updated network structure, while ensuring that the updated network structure represents an acyclic graph, wherein the updating the current network structure comprises adding an arc that is not on the blacklist to the network structure, deleting an arc that is not on a whitelist from the network structure, or reversing the direction of an arc that is not on the whitelist and of which arc the reversed direction is not on the blacklist; computing (304), by the processor (501), the network score of the updated network structure; and if (305) the score of the updated network structure is larger than the maximum network structure score, setting (306), by the processor (501), the current network structure to the updated network structure and set the maximum network structure score to the score of the updated network.
6. A method according to claim 5, further comprising determining (203), by the processor (501), the whitelist of arcs, wherein the arcs in the whitelist identify arcs that are certainly to be included in the Bayesian network.
7. A method according to claims 5 or 6, wherein the network score is based on a Bayesian information criterion (BIC) or an arc strength.
8. A method according to any of the preceding claims, further comprising: providing (201), by the processor (501), a dataset comprising combinations of values for the nodes; and preprocessing (202), by the processor (501), the dataset to obtain normalized observations corresponding to the plurality of nodes, wherein the preprocessing comprises at least one of missing values treatment, outliers treatment, and data transformation, wherein the learning the network structure (204) and the learning the conditional probability distributions (205) are performed based on the normalized observations corresponding to the plurality of nodes.
9. A method according to claim 8, wherein the normalized observations corresponding to a node have a Gaussian normal distribution.
10. A method according to any one of claims 1 to 9, further comprising calculating (260), by the processor (501), an elasticity of the first node with respect to its at least one immediate parent node based on the at least one coefficient.
11. A method according to any one of claims 1 to 4 and 7 to 10, further comprising determining (203), by the processor (501), a whitelist of arcs, wherein the arcs in the whitelist identify arcs that are certainly to be included in the Bayesian network.
12. A method according to any of the preceding claims, the method further comprising: identifying a target value for the target node; determining a value for at least one of the nodes based on the Bayesian network and the target value for the target node; and controlling at least one variable based on the determined value for the at least one of the nodes.
13. A system for predicting a state by building a probabilistic hierarchical model comprising a Bayesian network comprising a plurality of nodes, each node representing a variable, a variable corresponding to the state to be predicted being represented by a target node of the Bayesian network, the system comprising a memory (503) configured to store a set of observed values for the variables represented by the nodes; and a processor (501) configured to: determine a blacklist of arcs, wherein the arcs in the blacklist identify arcs not to be included in the Bayesian model, the blacklist including all directed arcs from the target node to any other node of the Bayesian network; learn a structure of the Bayesian network based on the set of observed values for the nodes by determining a plurality of directed arcs of the Bayesian network linking the nodes of the network to form an acyclic graph, the plurality of directed arcs not including arcs corresponding to the arcs in the blacklist; learn parameters of conditional continuous probability distribution functions of the nodes of the network, based on the structure of the network and the set of observed values; set, as part of learning the parameters, at least one parameter of the conditional continuous probability distribution function of a first node of the Bayesian network as a linear function of at least one immediate parent node of the first node, and compute, as part of learning the parameters, at least one coefficient of the linear function using linear regression based on the set of observed values and the structure of the Bayesian network.
14. A computer program product comprising instructions stored on a non-transitory computer readable media which, when the program is executed by a computer system, cause the computer system to perform the steps of: determining (201) a set of observed values for a plurality of variables represented by a plurality of nodes of a Bayesian network, a variable corresponding to a state to be predicted being represented by a target node of the Bayesian network, determining (203) a blacklist of arcs, wherein the arcs in the blacklist identify arcs to be excluded from the Bayesian network, the blacklist including at least all directed arcs from the target node to any other node of the Bayesian network; learning (204) a structure of the Bayesian network based on the set of observed values for the nodes by determining a plurality of directed arcs of the Bayesian network linking the nodes of the network to form an acyclic graph, the plurality of directed arcs not including arcs corresponding to the arcs in the blacklist; and learning (205) parameters of conditional continuous probability distribution functions of the nodes of the network, based on the structure of the network and the set of observed values, wherein the learning (205) the parameters comprises setting, by the processor (501), at least one parameter of the conditional continuous probability distribution function of a first node of the Bayesian network as a linear function of at least one immediate parent node of the first node, and wherein the learning (205) the parameters further comprises computing (402), by the processor (501), at least one coefficient of the linear function using linear regression based on the set of observed values and the structure of the Bayesian network.
PCT/EP2021/081915 2020-11-18 2021-11-17 Predicting the state of a system with continuous variables WO2022106437A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
EP20208351.5 2020-11-18
EP20208351 2020-11-18

Publications (1)

Publication Number Publication Date
WO2022106437A1 true WO2022106437A1 (en) 2022-05-27

Family

ID=73475980

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/EP2021/081915 WO2022106437A1 (en) 2020-11-18 2021-11-17 Predicting the state of a system with continuous variables

Country Status (1)

Country Link
WO (1) WO2022106437A1 (en)

Non-Patent Citations (5)

* Cited by examiner, † Cited by third party
Title
ARAGAM BRYON ET AL: "Learning Large-Scale Bayesian Networks with the sparsebn Package", 10 March 2018 (2018-03-10), arXiv.org, pages 1 - 39, XP055822385, Retrieved from the Internet <URL:https://arxiv.org/pdf/1703.04025.pdf> [retrieved on 20210708] *
CHARNIAK, E.: "Bayesian networks without tears", AL MAGAZINE, vol. 12, no. 4, 1991, pages 50, XP009017065
CHENG, J.GREINER, R: "Conference of the Canadian Society for Computational Studies of Intelligence", June 2001, SPRINGER, article "Learning Bayesian belief network classifiers: Algorithms and system", pages: 141 - 151
HAO CUI: "Structure Learning of Bayesian Networks Based on Vertical Segmentation Data", FUZZY SYSTEMS AND KNOWLEDGE DISCOVERY, 2007. FSKD 2007. FOURTH INTERNATIONAL CONFERENCE ON, 1 May 2018 (2018-05-01), Piscataway, NJ, USA, pages 1 - 67, XP055822379, ISBN: 978-0-7695-2874-8, Retrieved from the Internet <URL:https://web.cs.elte.hu/blobs/diplomamunkak/msc_mat/2018/cui_hao.pdf> [retrieved on 20210708] *
JI GUOJUN ET AL: "A Big Data Decision-making Mechanism for Food Supply Chain", vol. 100, 8 March 2017 (2017-03-08), pages 1 - 10, XP055822387, Retrieved from the Internet <URL:https://www.matec-conferences.org/articles/matecconf/pdf/2017/14/matecconf_gcmm2017_02048.pdf> [retrieved on 20210708], DOI: 10.1051/matecconf/201710002048 *

Similar Documents

Publication Publication Date Title
Levine et al. Offline reinforcement learning: Tutorial, review, and perspectives on open problems
Lintusaari et al. Fundamentals and recent developments in approximate Bayesian computation
US7580813B2 (en) Systems and methods for new time series model probabilistic ARMA
US20180285769A1 (en) Artificial immune system for fuzzy cognitive map learning
Raza et al. Cloud sentiment accuracy comparison using RNN, LSTM and GRU
Evans Uncertainty and error
CN110033097B (en) Method and device for determining association relation between user and article based on multiple data fields
Sarkar et al. Nonparametric link prediction in large scale dynamic networks
WO2018083804A1 (en) Analysis program, information processing device, and analysis method
Štula et al. Continuously self-adjusting fuzzy cognitive map with semi-autonomous concepts
Nguyen et al. An alternative approach to avoid overfitting for surrogate models
Liu et al. Efficient preference-based reinforcement learning using learned dynamics models
Bonneau et al. Reinforcement learning-based design of sampling policies under cost constraints in Markov random fields: Application to weed map reconstruction
JP5029090B2 (en) Capability estimation system and method, program, and recording medium
Planas et al. Extrapolation with gaussian random processes and evolutionary programming
Barnett et al. Active reward learning from multiple teachers
CN107644268B (en) Open source software project incubation state prediction method based on multiple features
JP2017227994A (en) Human flow prediction device, parameter estimation device, method and program
JP7212292B2 (en) LEARNING DEVICE, LEARNING METHOD AND LEARNING PROGRAM
WO2022106437A1 (en) Predicting the state of a system with continuous variables
WO2022106438A1 (en) Predicting the state of a system using elasticities
WO2022249224A1 (en) Information processing device, information processing method, and program
Attoh-Okine Pair-copulas in infrastructure multivariate dependence modeling
Li et al. Continuous probabilistic model building genetic network programming using reinforcement learning
JP6977877B2 (en) Causal relationship estimation device, causal relationship estimation method and causal relationship estimation program

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 21830608

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 21830608

Country of ref document: EP

Kind code of ref document: A1