EP1627324A1

EP1627324A1 - Method for determining a probability distribution present in predefined data

Info

Publication number: EP1627324A1
Application number: EP03787314A
Authority: EP
Inventors: Michael Haft; Reimar Hoffmann
Original assignee: Siemens AG
Current assignee: Panoratio Database Images GmbH
Priority date: 2002-07-24
Filing date: 2003-07-23
Publication date: 2006-02-22
Also published as: US20040249488A1; WO2004017224A2; JP2005527923A; DE10233609A1; AU2003260245A1

Description

description

Method for determining a probability distribution in given data

The invention relates to a method for generating a statistical model using a learning method.

The increasing traffic on the Internet enables companies that are represented on the Internet or offer services on the Internet to both exploit an increased circle of customers and to collect customer-specific information. Many of the electronic processes are logged and user data is saved. Many companies now operate a CRM system in which they systematically record information about all customer contacts. Traffic to and access to websites is logged and the processes are logged in a call center. This often results in very large amounts of data containing a wide variety of customer-specific information.

This leads to the disadvantage that valuable information about customers can be accumulated, but due to the often overwhelming amount it can only be edited with great effort.

In principle, statistical methods are used to solve this problem, in particular statistical learning methods, which, for example, have the ability to divide entered variables into classes after a training phase.

The newly created field of data mining or machine learning has made it its goal in particular to further develop such learning methods (such as clustering methods) and to apply them to problems relevant to practice.

Many data mining methods can be specifically targeted at the handling of information from the Internet. With these methods, large amounts of data are converted into valuable information, which generally significantly reduces the amount of data. Many statistical learning methods are also used in such a procedure, for example in order to be able to read statistical dependency structures or recurring patterns from the data.

However, these methods have the disadvantage that they are very complex numerically, although they provide valuable results. The disadvantages are further exacerbated by the fact that missing information, such as the age of a customer or his or her income, complicates the processing of the data and in some cases also makes the information supplied worthless. The statistically optimal handling of such missing information is still very complex.

Another method for the sensible division of information is the generation of a cluster model, e.g. with a Naive Bayesian Network. Bayesian networks are through

Parameterized probability tables. When these tables are optimized, the weakness arises after just a few learning steps that many zero entries are classified in the tables. This creates sparse tables. The fact that the boards change constantly during the learning process, such as. B. during the learning process for statistical cluster models, thin coding of tables is very difficult to use. The repeated occurrence of zero entries in the probability tables leads to an increased and unnecessary calculation and storage effort.

For these reasons, there is a need to design the statistical learning methods mentioned faster and more efficiently. So-called EM (Expectation Maximization) learning methods are of increasing importance. In order to substantiate an EM learning process in the case of a naive Bayesian cluster model, the process steps are generally carried out as follows.

Here X = {X _k , k = 1, ..., K} denotes a set of

- ^ statistical variables (which can e.g. correspond to the fields of a database). The states of the variables are identified with small letters. The variable • can

States x ^L1 'x ⁱ - ² .. ^* . ^" , i.e. X ₁ e {x _{lt i} , i = 1, _... , L. I, '

V is the number of states of variable 1. An entry in a data record (a database) now consists of values for _χ = ( _χ ^π _χ ^π _x ^π ) all variables, where v ι ^> i - 3 ^»• d _en π - called data record. In the ^ th record the variable ^l in the state x1 ^π, the variable X2 i _n the state x ^2π, etc. The panel has M entries, ie * ^X '^ ^{= A>} - ^»M ^j _Zusät2 _ There is a hidden variable or a cluster variable, which is referred to here as Ω; their states are W ^{1 =} * ^■ - ' ^• ' - "} _{So there are} N clusters.

In a statistical clustering model, ^{v J} now describes an a priori distribution; ^{l J} is the a priori weight of the '^" th cluster and ^''' describes the structure of the 'th cluster or the conditional distribution of the observable quantities (contained in the database) Λ ⁼ { ^«. Λ ^Λ I., ^> ^. Λ ⁼ 1, ... a _(ιn d, em / -th C "l, model ~ Di.ea priori distribution and the conditional distributions for each cluster pa- rametrisieren along a common probability model on X or on Ω. X.

A naive Bayesian network assumes that _/ ? (A) can be factored. In general, the parameters of the model, i.e. the a priori distribution p (Ω) and the conditional probability tables, are aimed at to be determined in such a way that the common model reflects the entered data as well as possible. A corresponding EM learning process consists of a series of iteration steps, with an improvement of the model (in the sense of a so-called likelihood) being achieved in each iteration step. In each iteration step, new parameters p ^πeu (...) are estimated based on the current or "old" parameters? "'' (...).

Each EM step begins with the E step, in which "Sufficient Statistics" are determined in the ready-made tables. It starts with probability tables, the entries of which are initialized with zero values. The fields of the tables are E-step is filled with the so-called Sufficient Statistics S (Ω) and S (ÄΩ) by supplementing the missing information (the assignment of each data point to the clusters) with expected values for each data point known from [1].

To expected values for the cluster variable Ω _to calculate the a posteriori distribution ^ '' (w ^*) to be determined. This

Step is also referred to as an "inference step". In the case of a Naive Bayesian Network, the a posteriori distribution for Ω is according to the regulation

for each data point x "from the information entered, where y is a normalization constant

The essence of this calculation consists of the formation of the product over all k = \, ..., K. This product must be in every E-step for all clusters = 1, ..., N and for all data points x ", π = 1, ..., M. The inference step for the assumption of other dependency structures than is similarly complex, often even more complex a naive Bayesian network, and thus includes the essential numerical effort of EM learning.

The entries in the tables S (Ω) and S (X, Ω) change after the formation of the above product for each data point% ^π , π = \, ..., M, since S (ω _t ) by p ^α "( ω, x ") is added up for all i, or forms a sum every p ^αit (ω _l x"). Similarly, S (x, ω,) or S (x _k , ω,) for all variables k im In the case of a Naive Bayesian Network, in each case added by p "" (ω _l x ' ^r ) for all clusters i. This first concludes the E (Expectation) step. Based on this step new parameters p ^new (Ω) and p ^ne "calculated for the statistical model, where represents the structure of the i-th cluster or the conditional distribution of the quantities X contained in the database in this i-th cluster.

In the M (Maximization) step, new parameters p " ^e " (Ω) are calculated using a general log likelihood L =) p (ω _l ) and which is based on the already calculated Sufficient

Statistics based, formed. The M step no longer entails any significant numerical effort. For the general theory of EM learning, see also [5].

It is therefore clear that the essential effort of the algorithm in the inference ^step or on the formation of the product [p ^αh (A \ ω,) and on the accumulation of the sufficient status

tistics is at rest. The formation of numerous zero elements in the probability tables p ° "or _p ^α " (A _i | ώ>,) can, however, be achieved through clever data structures and storage Use intermediate results from one EM step to the next to efficiently calculate the products.

A general and extensive treatment of learning processes by means of Bayesian Networks can be found in [2], in particular the problem of partially missing data is addressed in [3, page 19] and [4]. A disadvantage of these learning methods is that thinly populated panels (panels with many zero entries) are processed and thus a great deal of computing effort is caused, but through which no additional information about the data model to be evaluated is obtained.

The invention is therefore based on the object of specifying a method in which zero entries in probability tables are used in such a way that no further unnecessary numerical or computational effort is caused as a by-product.

The object is achieved by the features of patent claim 1. Preferred developments of the invention result from the subclaims.

The invention essentially consists in that when inferring in a statistical model or in a clustering model, the formation of the result, which is formed from the terms of membership function or conditional probability tables, is carried out as usual, but As soon as the first zero occurs in the associated factors or a weight zero is determined for the cluster after the first steps, the further calculation of the a posteriori weight can be stopped. If, in an iterative learning process (e.g. an EM learning process) a cluster is assigned the weight zero for a certain data point, this cluster will also receive the weight zero in all further steps for this data point, and must therefore also be carried out in all further learning steps are no longer considered. This ensures a sensible elimination of the processing of irrelevant parameters and data. This has the advantage that the learning process can be carried out quickly by considering only the relevant data.

More precisely, the inventive method proceeds as follows: the formation of an overall product in an above inference step, which consists of factors of a posteriori distributions of membership probabilities for all entered data points, is carried out as usual, but as soon as a first predeterminable value, preferably zero or a value close to zero, in which the associated factors occur, the formation of the overall product is terminated. It can also be shown that if, in an EM learning process, a cluster for a certain data point is assigned the weight according to a number of the choice described above, preferably zero, this cluster also has zero weight in all further EM steps for this data point will be assigned. This ensures a sensible elimination of superfluous numerical effort, for example by temporarily storing the corresponding results from one EM step to the next and processing them only for the clusters that are not weighted zero.

The advantages are that the learning process is significantly accelerated overall, not only within one EM step but also for all further steps, especially when the product is formed in the inference step, due to the termination of processing when clusters with zero weights occur.

In the procedure for determining a probability distribution existing in predetermined data, membership probabilities for certain classes are only up to one predeterminable value or a value zero or almost 0 calculated in an iterative process, and the classes with membership probabilities below a selectable value are no longer used in the iterative process.

It is preferred that the predetermined data form clusters.

A suitable iterative method would be the expectation maximization method, in which a product of membership factors is also calculated.

In a further development of the method, a sequence of the factors to be calculated is selected in such a way that the factor that belongs to a rarely occurring state of a variable is processed first. The rarely occurring values can be stored in an ordered list before the formation of the product begins, so that the variables are ordered according to the frequency of occurrence of a zero in the list.

It is also advantageous to use a logarithmic representation of probability tables.

It is also advantageous to use a sparse representation of the probability tables, e.g. in the form of a list that contains only the non-zero elements.

Furthermore, only those clusters that have a non-zero weight are taken into account when calculating sufficient statistics.

The clusters that have been wiped apart from zero can be stored in a list, the data stored in the list being pointers to the corresponding clusters. The method can also be an expectation maximization learning process, in which, in the event that a cluster is given an a posteriori weight of zero for a data point, this cluster receives zero weight in all further steps of the EM method for this data point such that this cluster in no further steps need to be taken into account.

The method can only run over clusters that have a non-zero weight.

The invention is first explained in more detail using exemplary embodiments.

It shows

1 shows a diagram for the implementation of the invention encompassed in claim 1

Fig. 2 is a scheme for reloading Variein depending on

Frequency of their appearance

Fig. 3 The exclusive consideration of clusters that have received a non-ZERO weight

I. First embodiment in an inference step

a). Formation of an overall product with interruption at zero value

FIG. 1 shows a diagram in which the formation of an overall product 3 is carried out for each cluster ω in an inference step. But as soon as the first zero 2b in the associated factors 1, which are read out, for example, from a memory, array or a pointer list can occur, the formation of the total product 3 is terminated (exit). In the case of a zero value, the a posteriori weight belonging to the cluster is then set to zero. Alternatively, you can first check whether at least one of the factors in the product is zero. All multiplications for the formation of the overall product are only carried out if all factors are different from zero.

If, on the other hand, a zero value does not occur with a factor belonging to the overall product, represented by 2a, the formation of product 3 is continued as normal and the next factor 1 is read out of the memory, array or pointer list and used for further development of product 3 condition 2 used.

b). Advantages of interrupting the formation of the overall product when zero values occur

Since the inference step does not necessarily have to be part of an EM learning process, this optimization is also of particular importance in other detection and forecasting processes in which an inference step is required, e.g. when recognizing an optimal offer on the Internet for a customer whose information is available. On this basis, targeted marketing strategies can be generated, whereby the recognition or classification skills lead to automatic reactions that, for example, send information to a customer.

c). Selection of a suitable sequence to speed up data processing

FIG. 2 shows a preferred development of the method according to the invention, in which a clever sequence is selected in such a way that if a factor in the product is zero, represented by 2a, this factor has a high degree of accuracy. likely to appear very soon as one of the first factors in the product. The formation of the total product 3 can thus be terminated very soon. The new sequence la can be determined in accordance with the frequency with which the states of the variables appear in the data. For example, a factor that belongs to a very rarely occurring state of a variable is processed first. The order in which the factors are processed can thus be determined once before the start of the learning process by storing the values of the variables in a correspondingly ordered list la.

d). Logarithmic representation of the plates

In order to limit the computing effort of the above-mentioned method as much as possible, a logarithmic representation of the tables is preferably used, for example to avoid underflow problems. This function can be used to replace zero elements with a positive value, for example. This means that complex processing or separations of values that are almost zero and differ from one another by a very small distance are no longer necessary.

e). Avoiding increased summation when calculating sufficient statistics

In the event that the stochastic variables added to the learning process have a low probability of belonging to a particular cluster, many clusters will have a posteriori weight of zero in the course of the learning process. In order to also accelerate the accumulation of sufficient statistics in the subsequent step, only those clusters are considered in this step that have a weight other than zero. It is advantageous to increase the performance of the inventive learning method in such a way that the non-zero clusters in a list te, an array or a similar data structure can be assigned and saved, which allows only the non-zero elements to be saved.

II. Second embodiment in an EM learning process

a). Disregarding clusters with zero mappings for a data point

In particular, an EM learning process from one step of the learning process to the next step stores which clusters are still allowed due to the occurrence of zeros in the tables and which are no longer allowed. Where, in the first exemplary embodiment, clusters which are given an a posteriori weight of zero by multiplication by zero are excluded from all further calculations in order to thereby save numerical effort, in this embodiment of the invention intermediate results regarding clusters are also taken from one EM step to the next -Association of individual data points (which clusters are already excluded or still permitted) are stored in additionally necessary data structures. This makes sense because it can be shown that a cluster that has received zero weight for a data point in one EM step will also receive zero weight in all subsequent steps.

FIG. 3 specifically shows the case in which in the event that a data point 4 is assigned to a cluster with an almost zero probability 2a, the cluster can in the next step of the learning method 5a + 1, where the probability of this assignment of the data point again is calculated, be immediately reset to zero. Thus, a cluster that has received a weight zero over 2a for a data point 4 in an EM step 5a does not only have to be considered further within the current EM step, 5a, but is also used in all further EM steps 5a + n, where n represents the number of EM steps used (not shown), this cluster over 2a is also no longer used taken into account. The calculation of a data point belonging to a new cluster can then be continued again via 4. An almost non-zero membership of a data point 4 to a cluster leads to a continued calculation via 2b for the next EM step 5a + 1.

b). Save a list of references to relevant ones

cluster

For each data point, a list or a similar data structure can first be saved that contains references to the relevant clusters, which have been given a non-zero weight for this data point. This ensures that in all operations or procedural steps in the formation of the overall product and the accumulation of sufficient statistics, the loops then only run over the still permissible or relevant clusters.

Overall, only the permitted clusters are stored in this exemplary embodiment, but for each data point in a data record.

III. Another embodiment

A combination of the exemplary embodiments already mentioned is used here. A combination of the two exemplary embodiments enables termination at zero weights in the inference step, only the permissible clusters according to the second exemplary embodiment being taken into account in further EM steps.

This creates an overall optimized EM learning process. Since the use of cluster models for recognition and forecasting methods is generally used, optimization according to the inventive type is of particular advantage and value. IV. Arrangement for performing the inventive method

The inventive method according to one or all exemplary embodiments can in principle be carried out with a suitable computer and memory arrangement. The computer memory arrangement should be equipped with a computer program that executes the method steps. The • computer program can also be stored on a data medium such as e.g. be stored on a CD-ROM and thus transferred to other computer systems and executed.

A further development of the computer and memory arrangement mentioned consists in the additional arrangement of an input and output unit. The input units can use sensors, detectors, input keypads or servers to provide information about the status of an observed system, such as the amount of access to a website, in the computer arrangement, for example, to the memory. The output unit would consist of hardware that stores the signals of the results of the processing according to the inventive method or displays them on a screen. An automatic, electronic reaction, for example the sending of a specific email in accordance with the evaluation according to the inventive method, is also conceivable.

V. Application example

The collection of statistics when using a website or the analysis of web traffic is also known today under the keyword web mining. A cluster found through the learning process can, for example, reflect typical behavior of many Internet users. The learning process enables, for example, the recognition that all visitors from a class, or those who have been assigned to the cluster found by the learning process, do not stay in a session for more than one minute and usually only retrieve one page. It is also possible to determine statistical information about the visitors to a website that come to the analyzed website via a free text search engine (freetext search). For example, many of these users request only one document. For example, you could mostly query freeware and hardware documents. The learning process can determine the assignment of visitors coming from a search engine to different clusters. Some clusters are almost completely ruled out, while another cluster can be relatively heavy.

The following publications are cited in this document:

[1] Sufficient, Complete, Ancillary Statistics, available on August 28, 2001 at the Internet address n ri ■: '/ / www .matn. uah. edu / star / ooinr / Doin β. html

[2] B. Thiesson, C. Meek, and D. Heckerman. Accelerating EM for Large Databases. Technical Report MSR-TR-99-31, Microsoft Research, May, 1999 (Revised February, 2001), available on November 14, 2001 at the Internet address: http: // www. research. icrosof. comAheckerman /

[3] D. Heckermann, A Tutorial on Learning With Bayesian Networks, available on March 18, 2002 from the ftp address:

[4] David Maxwell Chickering and David Heckerman, available on March 18, 2002 Internet address: rtt: ^' , ... researcr..ncrcscf. CCRR / scrιρts / ρubs / view. asp?

[5] M.A. Tanner, Tools for Statistical Inference, Springer, New York, 1996

Claims

claims

1.Procedure for determining a probability distribution (1) existing in predefined data, in which membership probabilities (2) for selectable classes are only calculated up to a predeterminable value (A) in an iterative process and the classes with membership probabilities below a selectable value (B ) are no longer used in the iterative process.

2. The method according to claim 1, wherein the predeterminable value

(A) is zero.

3. The method according to any one of claims 1 or 2, wherein the predetermined data form clusters.

4. The method according to claim 1, wherein the iterative method comprises an expectation maximization algorithm.

5. The method according to claim 4, wherein a product (3) is calculated from probability factors.

6. The method according to claim 5, wherein the calculation of the

Product is terminated as soon as a selectable value almost 0 (A) occurs in the factors belonging to the product. -

7. The method according to any one of claims 4 or 5, in which a sequence of the factors to be calculated is selected such that the factor that belongs to a variable that rarely occurs in the data is processed first.

A method according to claim 7, wherein the rarely occurring values prior to the formation of the product are so in an ordered list (la) that the variables are ordered according to the frequency of their appearance in the list.

9. The method according to any one of claims 1 to 8, in which a logarithmic representation of probability tables is used.

10. The method according to any one of claims 1 to 9, in which a thin representation of probability tables below

Use a list that contains only the non-zero items.

11. The method according to any one of claims 1 to 10, in which sufficient statistics are calculated.

12. The method according to claim 11, in which only those clusters which have a non-zero weight are taken into account in the calculation of sufficient statistics.

13. The method according to any one of claims 1 to 8, wherein the clusters that have a non-zero wiping are stored in a list.

14. The method according to any one of claims 1 to 9, which is used in an expectation maximization learning process, in which, in the event that a cluster is given an a posteriori weight for a data point, this cluster in all further steps for this data point receives the weight

Receives zero in such a way that this cluster no longer has to be taken into account in all further EM process steps.

15. The method according to claim 13, in which a list of references to clusters which have a non-zero weight is stored for each data point.

6. The method according to any one of claims 10 or 11, wherein the iterative method only runs over clusters that have a non-zero weight.