WO2006037747A2

WO2006037747A2 - Method for structuring a data stock that is stored on at least one storage medium

Info

Publication number: WO2006037747A2
Application number: PCT/EP2005/054891
Authority: WO
Inventors: Volker Tresp; Kai Yu; Shipeng Yu
Original assignee: Siemens Aktiengesellschaft
Priority date: 2004-10-04
Filing date: 2005-09-28
Publication date: 2006-04-13
Also published as: WO2006037747A3

Abstract

The invention relates to a non-parametric Bayes method for analysing data records, in which elements occur with a specific frequency. The installed model retains the size of earlier extensions, in which latent factors of each data record (e.g. themes of documents) were investigated, whilst at the same time permitting the investigation of the cluster structures of data records, which reflect the statistical dependency of the latent factors. Compared to parametric Bayes modelling, the non-parametric model that is induced by a Dirichlet process (DP) is sufficiently flexible to reveal the data structure. Instead of having to use the Markov chain Monte Carlo (MCMC), which is slow with our specifications, the inventive method introduces an efficient variational inference, which is based on a finite, highly-dimensioned approximation of (DP).

Description

description

Method for structuring a data stored on at least one Speicher¬ medium

The present invention relates to a method and a computer program product for the structuring of a data stock stored on at least one storage medium.

Huge amounts of data are generated today in companies, in research projects, in administrations or on the Internet. Data Mining allows the automatic evaluation of such Datenbe¬ states by means of statistical methods. The goal here is the detection of rules or statistical abnormalities. This refers to the systematic (usually automated or semi-automatic) detection and extraction of unknown information from large amounts of data. For this purpose, the databases are examined for regularities, patterns and structures, deviations and any kind of relationships and mutual influences. The process of pattern recognition and knowledge extraction is also called Knowledge Discovery in Databases (KDD).

We consider the problem of modeling a large corpus of discrete data sets with a high dimension. We assume that a dataset can be modeled by latent factors that represent the occurrence of elements in a dataset. To deepen the discussion, we will identify data sets with documents, latent factors with (latent) topics, and elements with words. PLSI [7] is one of the first attempts of a probabilistic approach to the modeling of text documents as compositions of latent topics. LDA [4] generalizes PLSI to the extent that it considers the topic mix parameters (ie a multinomial over topics) as variables consisting of a Dirichlet distribution will be won. His Bayesian interpretation avoids overfitting, and the model is generalizable to new data (the latter is for PLSI problematic). However, the parametric Dirichlet distribution can be a limitation for applications that have a richer structure. For example, consider Figure 1 (a), which shows the empirical distribution of three topics. We see that the probability that all three topics are in one text (equivalent to the middle of the drawing) is close to zero. In contrast, a Dirichlet distribution over the data (Figure 1 (b)) would predict the highest probability density for this case. This is due to the limited expressivity of a simple Dirichlet distribution.

The present invention is therefore based on the object of specifying a method for identifying topics and / or topic groups in a database stored on at least one storage medium, which is considerably more flexible than the methods hitherto known from the prior art and thus also identification of topic groups or clustering of documents.

According to the invention, this object is achieved by a method and a computer program product having the features specified in claim 1 and claim 13. Advantageous further developments of the present invention are specified in the dependent claims.

According to the present invention, in a method for structuring a data stock stored on at least one storage medium, a Dirichlet distribution with any number of states is used as the statistical model for modeling the data. In an iterative process, variables of the statistical model are adapted to the data stock. The database is structured according to the status of the adapted statistical model.

According to a preferred embodiment of the method according to the invention, the statistical model for modeling tion of the data stock as a Dirichlet process and the iterative process designed as a mean-field algorithm based on a finite approximation. This model for Dirichlet-extended latency semantic analysis advantageously retains the performance of earlier approaches to finding the latent themes, and also introduces additional modeling flexibility to examine document clusters. The inference assumes a variational averaging field approximation based on a finite approximation for DP (Dirichlet process).

According to a further advantageous embodiment of the present invention, the articulated data and / or information about the structured database is stored on a storage medium. Since data about the structure in the storage medium can be found very quickly, this has the advantageous effect of ensuring fast access to the data.

Without limiting the generality of this term, the storage medium comprises volatile, permanent and semi-permanent storage media, wherein the storage can be done for example on electronic, magnetic, optical and magneto-optical media.

In the execution of the computer program product according to the invention, the program sequence control device uses a Dirichlet distribution with a large number of states as the statistical model for modeling the data stock for structuring a data stock stored on at least one storage medium. In an iterative process, variables of the statistical model are adapted to the data content. The database is structured according to the states of the adapted statistical model.

The present invention will be explained in more detail below with reference to an exemplary embodiment with reference to the drawings. It shows FIG. 1 shows a 2-dimensional simplex which represents three topics (the sum of the three probabilities has to be 1): a) the probability distribution of topics in documents, which forms an annular distribution. Dark areas represent a low density. (b) the 3-dimensional Dirichlet distribution, which maximizes the likelihood probabilities of the samples,

Figure 2 (a) is a latency semantic analysis with DP Prior; (b) an equivalent representation, where cd is the indicator variable indicating which cluster d is taken from the infinite set of DP-induced clusters; (c) a latent semantic analysis with a finite alternative of DP (see Section 2),

Figure 3 shows experimental results for a game problem;

(a) Initial random document cluster assignment φd, l; (b) document cluster assignment after an EM step; (c) document cluster assignment after five steps; (d) The original ß; (e) The esteemed ß; (f) The learned

Cluster number in terms of the actual number with mean and error interval,

Figures 4 (a) and (b): perplexity results for Reuters 21578 and 20 newsgroups for DELSA, PLSI, and LDA;

(c): clustering result on the dataset 20 newsgroups.

In this embodiment, a more general non-parametric Bayesian approach is developed that can be used to study not only latent topics and their probabilities, but also complex dependencies, for example, expressed as a complex cluster structure can be. The main innovation is that the parametric a priori Dirichlet distribution in LDA is replaced by a flexible non-parametric distribution G (-), a sample generated from a Dirichlet process (DP) in which the Dirichlet distribution of the LDA becomes the basic distribution. In this model, which is expanded by Dirichlet's process, the a posteriori distribution of the topic mixture for a new document converges to a flexible infinite mixture model in which both the mixture weights and the mixing parameters can be learned from the data. Thus, the a posteriori distribution is able to represent the distribution of topics more truthfully. After the learning procedure has been changed, only a few components with negligible weights typically remain otherwise; the model is thus able to output clusters of documents in a natural way.

Non-parametric Bayesian modeling has gained considerable recognition in the learning community (eg [1, 11, 2, 3, 12]). A potential problem with this class of models is that inferences typically rely on MCMC approximations. However, this can lead to an inadmissible slowdown in dealing with the large collection of documents. In addition to the presentation of a richlet-extended LDA model, a variational averaging field inference is proposed in this exemplary embodiment as a finite approximation for non-parametric Bayesian modeling.

The embodiment is constructed as follows. The first section introduces a Dirichlet-extended latency semantic analysis. In the second section, inference and learning algorithms are presented, which originate from the variational approximation. Thereafter, in the third section, experimental results are presented using a game data set and two document data sets. In the last section, conclusions are drawn from what has been said. 1 Dirichlet-extended latency semantic analysis

We use the notation from [4] and consider a corpus D containing D documents. Each document d is a finite sequence of N _d words, denoted by w _d = {Wd, i, ..., Wd, Nd}, where W _d , _{n is} a variable for the nth word in w _d is the index of the corresponding word in the vocabulary V. Note that the same word can occur multiple times in the sequence W _d .

1.1 The proposed model

We assume that each document is a mixture of latent topics, and that the words in each document are generated by repeated sampling of topics and words, using the following distributions:

w _d , n I Zd, n, β ~ multinomial (w _d , _n \ z _d , n, β) (1)

~ Multinomial (z _d , _n I θ _d ) (2)

w _d , n is generated by its latent theme z _d , _n , which takes values {!, ..., k), β is a kx [V] matrix of the multinomial parameters, Σ _j β _lrD = 1, where the β _z , w (d, n> indicate the probability that the word w _d , _{n is} generated for a given topic z, θ _d denotes the parameters of a multinomial distribution of document d over the topics for W _{d corresponding} to the equation θ _d ,! ≥ 0,

θ _d ,! = 1 suffice.

In the LDA model, θ _d is generated from a £ -dimensional Dirichlet distribution GQ (θ) = Dir (θ | λ), with parameters λ e R ^{k κ x} . In our Dirichlet-extended model, we assume that θ _{d is} generated from a distribution G (θ) that itself is one through a Dirichlet process (DP)

G \ G _Ol a _o ~ DP (G ₀ , α ₀ ) (3) is the random sample generated, in which the non-negative scalar quantity α _{0 is} the precision parameter, and Go (θ) is the base distribution that is identical to the Dirichlet distribution in LDA. It turns out that the distribution G (θ) drawn from a DP

G = Σ ^" i = i U ₁ Oe _* (D (4)

can be written, where Σ ^°° i U ₁ = 1; δe are point center distributions centered in θ, and θ ^* i are countably infinite variables that are independently drawn identically distributed (iid) from G ₀ . The probability weights U ₁ are dependent on α ₀ only via a stick-breaking process, as defined in the next subsection. The generation model summarized in FIG. 2 (a) is conditioned on (kx IV | + k + 1) parameters, ie, β, λ and α ₀ .

1.2 Stick-Breaking and Dirichlet extension

The representation of a sample from the DP prior in Eq. (4) is generated in the stick-breaking process in which an infinite number of pairs (^, 6 ^* ^ are generated.) Θ ^* ₁ is independently drawn from Go, and 1I ₁ is defined as

πi = B ₁ , Ti ₁ = B ₁

, 1-I) (IB ₃ ),

whereby the B _{1 are} independently distributed identically distributed from the beta distribution beta (1, α ₀ ). With a small α ₀ , the first "Sticks" 1I _{1 are} big, and the remaining sticks are left with little. On the other hand, when α _{0 is} large, the first sticks Ti ₁ and all subsequent sticks are small, and the Ti ₁ are more evenly distributed. As a consequence, the base distribution determines the locations of the point masses, and α ₀ determines the distribution of probability weights, resulting in a clustered solution if α _{0 is} chosen to be small. Note that both the places and the weights are not fixed and accept new values whenever a new sample of G is generated. Since E (G) = Go initially, the Prior corresponds to the Prior, which is used in LDA. If there are many documents in the training data set, the locations θ ^* _lr that agree with the data are given high weight. If a small α _{0 is} chosen, the parameters form clusters, whereas at large α ₀ many representative parameters result. The Dirichlet extension thus fulfills two tasks: it increases the flexibility of the representation of the a posteriori distribution of mixed weights and favors a clustered solution that allows an insight into the document corpus.

The DP Prior offers two advantages over the usual methods of clustering documents. First, the number of clusters need not be specified. The resulting final cluster structure is limited by the DP Prior, but also adapted to the empirical observations. Second, the number of clusters is not fixed. Although a _{0 is} a control parameter that can be used to influence the clustering tendency, the DP prior allows new clusters to be created if the current model can not very well explain the pending data, which is especially the case with our prediction can, because the dictionary is fixed, but the documents can grow.

By applying the stick-breaking representation, our model obtains the equivalent representation in Figure 2 (b). An infinite number of θ ^* i is generated from the base distribution, and the new indicator variable Cd indicates which document is assigned to which θ ^* _± . If more than one document is assigned to a θ ^* _± , clusters form, π = {πi,, π "} is a vector of probability weights generated from the stick-breaking process.

Our model is a generalization of LDA. If α ₀ → ∞, the model becomes identical to LDA, since the sample G becomes identical to the finite Dirichlet basis distribution GQ. In the- In the extreme case, documents are pairwise unrelated for a given Go, since the θ _{d are drawn} identically distributed from G ₀ independently. If Go itself is not sufficiently expressive, the model can not capture the dependency between the documents. The DP extension solves this problem in an elegant way. With a moderate α ₀ , the model allows G to drift away from Go, allowing more flexible modeling to explore the richer data structure. The interchangeability does not have to exist throughout the collection, but between groups of documents whose corresponding atomic θ ^* i have been drawn from Go. On the other hand, the increased flexibility does not lead to an overfitting, since inference and learning take place in a Bayesian setting, whereby the average numbers of mixture components and states of the latent variables are determined.

2 inference and learning

Despite these attractive features, an inference with the infinite model relies heavily on MCMC approximations, such as Gibbs samplers, which draw the θ _d directly using a Ploya URL scheme, and the difficulty of random sampling - avoid dimensional G [5]. Another possibility is to make a finite approximation, such as. For example, truncated DP (TDP) [8] or Dirichlet multinomial allo- cation (DMA) [6], followed by a finite version of G can be deduced. Since the sampling with our settings can be very slow, we recommend an efficient variational inference on the basis of the DMA. We are aware of the recently proposed alternative based on TDP. However, this article does not compare the two methods. 2.1 Dirichlet multinomial allocation

First, we approximate the stick-breaking distribution from Eq. (4) by a finite sum. The Dirichlet allocation multinomial OP _N [6] has often been applied in on Bayesian statistics as a finite Approximati¬ for DP (see [6, 8]), the Form G _N = Σ ^W i ₌ i π ₁ .DELTA.E _* ( i>, where π = {πi, ..., π _w } is an N vector of probability weights drawn once from a Dirichlet prior Dir (α _o / N, ..., α _o / N) , and θ ^* i, 1 = 1, ..., N, are independently drawn identically distributed from the basic distribution Go. It has been shown that DP is the limiting case of DP _W [6, 8, 10] and, more importantly, DP _W shows similar stick-breaking, leading to a clustering effect. [6] If N is sufficiently large in relation to sample size D, DP _W provides a good approximation for DP DP _W is illustrated in Figure 2 (c) The likelihood probability of the complete collection D is

where Cd is an indicator variable indicating which unique value θ ^* i assumes document W _d ; z _d , _n is integrated out for reasons of simplification. The inference must compute the associated a posteriori distribution of latent variables p (π, θ ^* i, c, zI D, QΌ, λ, β)), which implies a calculation of Eq. (5), whose integral, however, can not be determined analytically.

2.2 Variation inference and parameter estimation

The idea of a variational averaging field inference is to propose a linked distribution Q (n, θ ^* , c, z) that is conditioned over some free parameters, and then with this Q by minimizing the KL divergence D _KL (Q \ Ip (π, θ ^* , c, zID, O ¹ _O , λ, β) the a-posteriori Appropriating distributions with regard to these free parameters. So we propose a variational distribution Q over latent variables as follows

₍₆₎

where η, γ, φ, φ are variational parameters that are used to tailor the variational proposal to the properties of each latent variable. Specifically, η indicates the N-dimensional Dirichlet distribution for π, γ _λ indicates the k-dimensional Dirichlet distribution for various θ ^* _± , ψa indicates an iV-dimensional multinomial for the indicator C _d for document d, and φd, _n specifies an A: -dimensional multinomial on latent topics for word W _d , _Ω . It turns out that minimizing the KL divergence is equivalent to maximizing a lower bound for In p (D | αO, λ, β), which can be deduced by applying Jensen's inequality [9]. We skip the details of the standard derivation and specify the lower bound directly as

The maximization is performed by setting the partial derivatives to zero with respect to each parameter, resulting in subsequent updates

« ^{Vu c}

ö

^ = Σ ^ ⁺ ^ (11)

where Ψ ( ^# ) is the Digamma function, the first derivative of the log gamma function. We skip the details of the derivation of the above-mentioned updates, in which we repeatedly apply the expected sufficient statistics of the Dirichlet distribution given in [4]. We find that the updates are well interpretable. For example, η in Eq. (11) the exchange between empirical answers in θ ^* i and the prior given by α _0. Finally, due to the coupling of the parameters, the variational inference is performed by Eq. (8) to Eq. (11) iteratively applied until convergence occurs. The convexity of the problem guarantees a global maximum of L.

We follow the Bayesian framework and estimate the hyperparameters α ₀ , λ and β by maximizing the lower limit L with respect first to the variational parameters (as described in Eq. (8) - (11)) and to the Model parameters while the others are fixed. This is often referred to as a national EM [9]. It is easy to derive the update for β:

where δj (wd, n) = 1 for Wd, n = J, and 0 otherwise. For the remaining parameters we first write the parts of L in Eq. (7), which contain α ₀ and λ: £ _{| ani} = Jn r (α.) - N In T (ψ) + {% -!) ^ ₁ [* (,) - * (££, ^ ₎ ]

You can use standard optimization methods, such as: For example, the Newton-Raphson method used in [4] is used to derive the estimates for α ₀ and λ individually by maximizing the objects listed above.

3 Empirical study

3.1 Game data

We first apply the model to a game problem with k = 5 latent themes and a dictionary of 200 words. The probability that words are generated from these topics, ie the parameter β, is shown in FIG. 3 (d). Each colored line corresponds to a topic and assigns to a set of words a probability that is different from zero. For each run, we generate the data with the following steps: (1) a cluster number M is chosen between 5 and 12; (2) M document clusters are produced, each of which is defined by a combination of topics; (3) Each document d, d = 1, ..., 100 is generated by first randomly selecting a cluster and then

40 words according to the respective topic combinations. DP _W is set at N = 100, and we want to examine the performance of detecting the latent topics and the document cluster structure.

In Figures 3 (a) - (c) we illustrate the clustering process for documents about EM iterations with a run containing 6 document clusters. In FIG. 3 (a), we show the initial random assignment φ _dfl of each document d to a cluster 1. After an EM step, the documents begin to converge in a reduced number of clusters (FIG. 3 (b)). ), and after 5 steps converge to exactly 6 clusters (Figure 3 (c)). The learned word distribution of topics β is shown in Figure 3 (e); it is very similar to the actual distribution. By varying M, the actual number of document clusters, we can verify that our model can find the right M. For each value of M in the range 5 to 12, the data is randomized for 20 trials, and we obtain the graph in Figure 3 (f) which shows the average power and the variance. In 37% of the runs, we get perfect results, and in 43% of the runs, the learned deviate from the actual values by only one. However, we also see that the model tends to get slightly less than M clusters for large M's. This may be because just 100 documents are not sufficient to learn a large number of M clusters.

3.2 Document modeling

We compare the proposed model with PLSI and LDA on two sets of textual data. The first is a subset of the data set Reuters-21578, which contains 3000 documents and 20334 words ent. The second is taken from the data set of 20 newsgroups and has 2000 documents with 8014 words. The comparison metric is perplexity, which is usually used in language modeling. For a set of test documents, it is formally defined as: Perplexity (D _te st) = exp (- In p (Dtest) / Σdl

) • We follow the formula in [4] to calculate the perlexity for PLSI. In our algorithm, N stands for the number of training documents. FIGS. 4 (a) and 4 (b) show the results of the comparison with a different number k of latent topics. Our model overcomes PLSI and LDA in all runs. This means that the flexibility introduced by DP expansion does not result in over-fitting and leads to better generalization performance. 3.3 Clustering

In our last experiment we show that our approach is suitable for finding document clusters. We select four categories of cars, motorcycles, baseball and hockey from the dataset 20 newsgroups with 446 documents in each topic. FIG. 4 (c) shows a clustering result in which we set the number of topics to k = 5 and found 6 document chandeliers. In the figure, the documents are indexed according to their actual category labels. So we can clearly see that the result is quite meaningful. Documents from one category show similar accessions to the learned clusters, different categories can be easily distinguished from one another. The first two categories are not clearly separated from each other, since both vehicles have the topic and use many terms together. The other two categories, baseball and hockey, have been found to be ideal.

4 conclusions

In this embodiment, a model for Dirichlet-extended latency semantic analysis is proposed, which retains the capability of previous approaches to finding the latent themes, but also introduces additional modeling flexibility to examine document clusters. The inference is based on a variational mean-field approximation based on a finite approximation for DP (Dirichlet process). The experiments with performance data and two text data sets show that our model can recognize both the latent semantics and a reasonable cluster structure. bibliography

[I] Beal, M.J., Ghahramani, Z. and C.E. Rasmussen: "The Infinite Hidden Markov Model" in Advances in Neural Information Processing Systems (NIPS), 14.2002

[2] Blei, D., Griffiths, TL, Jordan, MI, and JB Tenenbaum: "Hierarchical topic mode and the nested Chinese restaurant process" in Advances in Neural Information Processing Systems 16, MIT Press, 2004 [3] Blei, D. and M. Jordan: "Variational methods for the Dirichlet process", 2004; appears in Proceedings of the 21st International Conference on Machine Learning [4] Blei, D., Ng, A. and M. Jordan: "Latent Dirichlet Allocation" in Journal of Machine Learning Research, 3: 993-1022, 2003

[5] Escobar, M.D., and M. West: "Bayesian density estimation and inference using mixtures" in Journal of the American Statistical Association, 90 (430), 1995

[6] Green, P.J. and S. Richardson: "Modeling heterogeneity with and without the Dirichlet process", 2000, unpublished

[7] Hofmann, T.: "Probabilistic Latent Semantic Indexing" in Proceedings of the 22nd Annual ACM SIGIR Conference, p.50-57, Berkeley, CA, 1999 [8] Ishwaran, H. and M. Zarepour: "Exact and approximate summations for the Dirichlet process "in Can. J. Statist. 30: 269-283, 2002

[9] Jordan, M.I., Ghahramani, Z., Jaakkola, T., and L.K. Saul: "An Introduction to Variational Methods for Graphical Mode" in Machine Learning 37 (2): 183-233, 1999

[10] Neal, R.M .: Markov chain sampling methods for Dirichlet process mixture modeis, in the Journal of Computational and Graphical Statistics, 9: 249-265, 2000

[II] Rasmussen, CE and Z. Ghahramani: "Infinite mixtures on gaussian process experts" in Advances in Neural Information Processing Systems 14, 2002 [12] Yu, K., Tresp, V., and S. Yu: "A Nonparametric Hierarchical Bayesian Framework for Information Filtering" in Pro- grams of the 27th Annual ACM SIGIR Conference, 2004

Claims

claims

1. A method for structuring a data stored on at least one SpeI¬ chermedium database, wherein - is used as a statistical model for modeling the Datenbestan¬ of a Dirichlet distribution with any number of states,

- variables of the statistical model are adapted to the data stock in an iterative process, - the database is structured on the basis of the states of the adapted statistical model.

2. The method of claim 1, wherein based on the states of the adapted statistical model topics and / or topic groups of the data are identifiable.

3. The method according to at least one of claims 1 and 2, wherein the statistical model for modeling the data set is designed as a Dirichlet process.

4. The method of claim 3, wherein the Dirichlet process is approximated by a finite statistical model.

5. The method according to at least one of claims 1 to 4, wherein in the iterative process variables and parameters of the statistical model are adapted to the data stock.

6. The method according to at least one of claims 1 to 5, wherein the iterative process is designed as a mean-field algorithm.

7. The method according to at least one of claims 1 to 6, wo¬ in the iterative process is designed as a mean-field algorithm based on a finite approximation.

8. The method according to at least one of claims 1 to 7, wherein the iterative process as Markov chain Monte Carlo (MCMC) method is configured.

9. The method according to at least one of claims 1 to 8, wherein the articulated data and / or information about the structured data stored on at least one storage medium.

10. The method according to at least one of claims 1 to 9, wherein the structured data and / or information on the structured data is made accessible.

11. The method according to at least one of claims 1 to 10, wherein the database comprises a text document or a set of text documents.

12. The method according to at least one of claims 1 to 11, wherein the database comprises a database and / or the Internet um¬ sums.

13. Computer program product, which is loadable into a main memory of a program sequence control device and at least has a code section, in its execution for structuring a data stored on at least one storage medium - as a statistical model for modeling the data inventory of a Dirichlet distribution is used with any number of states, - variables of the statistical model are adapted to the data stock in an iterative process,

- The database is structured on the basis of the states of the adapted statistic model when the computer program product runs in the program execution control device.