CN104965869A

CN104965869A - Mobile application sorting and clustering method based on heterogeneous information network

Info

Publication number: CN104965869A
Application number: CN201510312733.3A
Authority: CN
Inventors: 吴健; 白双伶; 陈亮; 邓水光; 李莹; 尹建伟; 吴朝晖
Original assignee: Zhejiang University ZJU
Current assignee: Zhejiang University ZJU
Priority date: 2015-06-09
Filing date: 2015-06-09
Publication date: 2015-10-07

Abstract

The present invention discloses a mobile application sorting and clustering method based on a heterogeneous information network. A sorting result mainly reflects the importance degree of an object, and thus, when the sorting result is introduced in the clustering process, a clustering result is more meaningful; moreover, the sorting result and the clustering result are continuously regulated by adopting an iterative method and supplement each other, so that a clustering effect is integrally promoted. Conventionally, only one or two types of information are used commonly in methods capable of being used for clustering of mobile applications; the mobile application sorting and clustering method is based on the heterogeneous information network consisting of four types of information of the applications and uses more information sources, so that clustering accuracy can be essentially promoted.

Description

A kind of sequence of the Mobile solution based on Heterogeneous Information network and clustering method

Technical field

The invention belongs to application recommendation field, particularly relate to a kind of speciality based on Heterogeneous Information network and the clustering method based on sequence, achieve a kind of method of Mobile solution being carried out to effective cluster and sequence.

Background technology

Along with developing rapidly of mobile Internet, emerged the application of substantial amounts in Mobile Market, these Mobile solution miscellaneous are changing the life of people gradually.Each Mobile solution associates each autocorrelative information, and thousands of Mobile solution just defines a huge Heterogeneous Information network, this network packet contains a large amount of valuable information, is therefore had very important significance by tool to the research of Mobile solution information network.On the one hand, the service condition analysing in depth a large amount of Mobile solution can help us to understand the usage behavior of user in detail, thus provides more personalized service for user.Such as personalized application is recommended to be come for targeted customer recommends Mobile solution more accurately by the potential structured relations between digging user or between application, thus promotes the Experience Degree of user.On the other hand, company can also be helped to find more effective advertisement promotion platform to the analysis of Mobile solution data.The method that usual user obtains application from application market is mainly divided into three kinds of approach: a kind of is the search engine using application market, directly search for, the second uses application class label in application market and rank to find the application of needs, and the third obtains application in the list of application of system recommendation.Wherein, application searches mainly adopts keyword match method, the information type used is the title of Mobile solution, and tag along sort is fixing, artificial setting in advance often, along with the growth of number of applications, the unreasonable part of label setting will display gradually, in view of this, adopts a kind of effective information extraction technology to be very necessary to make up these weak points.Cluster is a kind of understanding data, grasp one of important method of effective information, data mixed and disorderly are in a large number attributed to different groups by using clustering method, are conducive to the analysis to data and study, and carrying out cluster analysis to Mobile solution data can as the pre-treatment step before prediction modeling.At present, the most method that can be used in application data cluster analysis is mainly for isomorphism information network, namely based on a certain type information of application, use the information source of single type owing to have ignored other relevant informations, greatly limit the accuracy of cluster.Therefore by extracting the dissimilar information of application to build a Mobile solution heterogeneous network, then based on this network, the method that cluster analysis is carried out in application itself and its relevant information has been become to the active demand of academia and industry member.

Summary of the invention

For above-mentioned technical matters, the present invention proposes the sequence of a kind of Mobile solution based on Heterogeneous Information network and clustering method

In order to solve the problems of the technologies described above, technical scheme of the present invention is as follows:

Based on Mobile solution sequence and the clustering method of Heterogeneous Information network, system comprises data preprocessing module, sequence distribution calculation module and probability generation module, specifically comprises the steps:

11) data preprocessing module obtains Mobile solution information document from Mobile solution market, and carry out pre-service to this Mobile solution information document, described preprocessing process comprises information filtering, word segmentation processing and keyword extraction;

12) a star heterogeneous network be made up of four category informations is built; Carry out stochastic clustering to this star heterogeneous network, star heterogeneous network is divided into multiple sub-network thereupon;

13) distribution calculation module that sorts receives the sequence distribution that sub-network settles accounts attribute node in each sub-network respectively, then exports;

13) probability generation model receives the sequence distribution of attribute node for the posterior probability of computing center's node in each sub-network, the posterior probability of other attribute nodes is calculated afterwards by neighbor relationships, finally check whether cluster result restrains, just repartition sub-network according to new probability distribution be input to sequence distribution calculation module if do not restrained, if convergence just exports as cluster result.

Further, described sequence distribution calculation module sequence flow process specifically comprises the steps:

First the sub-network of a cluster numbers K and K Mobile solution is input as, then the sequence distribution of three generic attribute nodes in each sub-network is calculated respectively, for the object of AUTHOR and CATEGORY Class type, adopt transitivity sort method, the method is the process of an iteration, and end condition is the maximum times that sequence convergence in distribution or iterations are greater than setting; Object for TERM type adopts count sort method to calculate its sequence distribution, and whole sequence distribution calculation process exports the sequence distribution of each attribute type the most at last; The object of described AUTHOR, CATEGORY and TERM type is the keyword of extraction.

Further, first importation comprises cluster numbers K, the sequence distribution of the attribute type of K Mobile solution sub-network and correspondence thereof, EM method will be adopted after setting up probability generation model to obtain optimum parameter value, utilize the posterior probability of sequence distribution generating center type node in each cluster of optimal value of the parameter and the attribute type obtained, then neighbor relationships is utilized to calculate the posterior probability of each attribute type node, finally redistribute each node to different clusters according to probability distribution situation, then export cluster result.

Further, building a star heterogeneous network be made up of four category informations for setting up star network: G=(V, E, W), wherein V={APP, AUTHOR, CATEGORY, TERM}, comprising four category information nodes of application, APP={ap ₁, ap ₂ap _ncentroid set, AUTHOR={au ₁, au ₂au _n, CATEGORY={ca ₁, ca ₂ca _n, TERM={te ₁, te ₂te _mthree generic attribute node set, and E is that the limit connecting Centroid and attribute node is gathered, and W is the weight set on limit, and weights are divided into three kinds, the first, if limit e _ithat connect is APP and { node of AUTHOR, CATEGORY}, so w _ivalue be 1, the second, if limit e _iwhat connect is the node of APP and TERM, so w _ivalue can be any positive integer, the 3rd, if there is no fillet between two nodes, so w _ibe expressed as 0.

Further, star network calculates will obtain the sequence distribution results of attribute type information through sequence distribution, three types information node has the sequence of oneself to distribute, and they will be input in probability generation model as conditional probability, and wherein the sequence of AUTHOR is distributed as R={r (au ₁), r (au ₂) ... r (au _n), wherein r (au _i)>=0, and the sequence distribution of other two attribute type informations also represents in the same fashion, sequence distribution concrete computation process be divided into two parts, first part adopt be transitivity sort method, for AUTHOR, the information of CATEGORY two type, this is the computation process of an iteration:

P(AUTHOR|G)＝(W _AUTHOR,APPσ ^-1 _AUTHOR,APP)(W _APP,CATEGORYσ ^-1 _APP,CATEGORY)P(CATOGORY|G)(1)

P(CATEGORY|G)＝(W _CATEGORY,APPσ ^-1 _CATEGORY,APP)(W _APP,AUTHORσ ^-1 _APP,AUTHOR)P(AUTHOR|G)(2)

Wherein σ ^-1 _{aUTHOR, APP}, σ ^-1 _{aPP, CATEGORY}, σ ^-1 _{cATEGORY, APP}, σ ^-1 _{aPP, AUTHOR}be diagonal matrix, value equals weight matrix W respectively _{aUTHOR, APP}, W _{aPP, CATEGORY}, W _{cATEGORY, APP}, W _{aPP, AUTHOR}the summation of each train value, Part II is count sort method, and for TERM type, concrete computation process is as follows:

p ({te}_{i} | TERM, G) = \frac{Σ_{x &Element; N_{G} ({te}_{i})} w_{{te}_{i}, x}}{Σ_{y &Element; TERM} Σ_{x &Element; N_{G} (y)} w_{y, x}} - - - (3)

Wherein N _g(te _i) represent in G network, te _ineighbor node.

Further, probability generation model will use sequence distribution as one of initial conditions, then uses EM method to comment the Posterior probability distribution of APP node in different cluster, certain sub-network G of definition access _kthe probability of certain attribute node of middle d x is:

p(x|G _k)＝p(X|G _k)×p(x|X,G _k)(4)

Wherein p (X|G _k) represent in network G _kthe probability of middle access type X, p (x|X, G _k) represent in network G _kin, the probability of some nodes in access type X, in order to avoid p (x|X, G _k) there is zero probability phenomenon, add global information, to its smoothing process:

p′(x|X,G _k)＝(1-ε)p(x|X,G _k)+εp(x|X,G) (5)

G in certain sub-network _kaccess a Centroid ap _iprobability decided by its attribute node:

\begin{matrix} p ({ap}_{i} | G_{k}) = \underset{x &Element; N_{G_{k}} ({ap}_{i})}{Π} p {(x | G_{k})}^{w_{{ap}_{i}, x}} \\ = \underset{x &Element; N_{G_{k}} ({ap}_{i})}{Π} p^{'} {(x | X, G_{k})}^{w_{{ap}_{i}, x}} p {(X | G_{k})}^{w_{{ap}_{i}, x}} \end{matrix} - - - (6)

According to Bayes law, obtain Centroid ap _iposterior probability: p (G _k| ap _i) ∝ p (ap _i| G _k) × p (G _k), in order to obtain suitable P (G _k) consider to maximize posterior probability p (G _k| ap _i), then use EM method to obtain best p (G _k), concrete calculation procedure is as follows:

\log L = \underset{{ap}_{i} &Element; APP}{Σ} \log [Σ_{k = 1}^{K + 1} p ({ap}_{i} | G_{k}) \times p (G_{k})] - - - (7)

p^{t} (G_{k} | {ap}_{i}) &Element; p ({ap}_{i} | G_{k}) p^{t} (G_{k}); p^{t + 1} (G_{k}) = \underset{{ap}_{i} &Element; APP}{Σ} \frac{p^{t} (G_{k} | {ap}_{i})}{| APP |} - - - (8)

Wherein, K is the quantity needing cluster that user inputs, and after obtaining all center type probability distribution, for each attribute node calculates its posterior probability in each cluster, concrete formula is as follows:

\begin{matrix} p (G_{k} | x) = \underset{{ap}_{i} &Element; N (x)}{Σ} p (G_{k}, {ap}_{i} | x) \\ = \underset{{ap}_{i} &Element; N (x)}{Σ} \frac{p (G_{k} | {ap}_{i})}{| N (x) |} \end{matrix} - - - (9)

Wherein x is certain attribute node, and N (x) is a Centroid set, is the neighbor node of x, and for certain attribute node, its posterior probability in certain cluster equals the average of the posterior probability of its neighbor node in this cluster.

Beneficial effect of the present invention is: the importance degree of ranking results mainly reflection object, introducing this ranking results in cluster process makes cluster result more meaningful, and adopt the method for iteration that ranking results and cluster result are constantly adjusted, complement each other, improve the effect of cluster on the whole.Traditional, can be used in the method for Mobile solution cluster, usually only use information that is a kind of or two types, the present invention is based on the Heterogeneous Information network be made up of the Four types information applied, the information source used is more, inherently can promote the accuracy of cluster.

Accompanying drawing explanation

Fig. 1 is one-piece construction figure of the present invention;

Fig. 2 is that the present invention sorts distribution calculation module internal process figure;

Fig. 3 is probability generation module internal process figure of the present invention;

Embodiment

Below in conjunction with the drawings and specific embodiments, the present invention is described further.

In traditional clustering method, often have ignored the analysis of Mobile solution data and apply relevant other types data, this limits the accuracy of clustering method to a certain extent.Present invention employs a kind of clustering method based on sequence, first pre-service has been carried out to Mobile solution data, extract the data of Four types, comprise the Apply Names and other three attribute types that are called as center type: application publisher, applicating category and application descriptor, wherein word segmentation processing is carried out to application descriptor, TF-IDF method is adopted to extract key vocabularies, then get up these information consolidations formation star Heterogeneous Information network, class weight matrix is adopted to identify, then the clustering method based on sequence is adopted, the sequence distribution of classification information is calculated by sort method, for reflecting the degree of classification importance.Then on the basis of sequence distribution, probability generation model is set up, the posterior probability be applied in each cluster is obtained with this, after calculating each Posterior probability distribution be applied in each cluster, the probability distribution of other attribute category node in each cluster is obtained by neighbor relationships, calculate sequence distribution and estimate that these two parts of posterior probability are continuously and iteration, iteration will constantly be carried out until result convergence.

Whole Mobile solution sequence and clustering method form primarily of three modules: data preprocessing module, sequence distribution calculation module and probability generation module.

As can be seen from Figure 1, the whole process to Mobile solution sequence and cluster is formed primarily of data preprocessing module, sequence distribution calculation module and probability generation module three part order.First data preprocessing module obtains Mobile solution information document from Mobile solution market, and preprocessing process comprises information filtering, word segmentation processing and keyword extraction, then builds a star heterogeneous network be made up of four category informations; Initialization section carries out stochastic clustering, and star network is divided into multiple sub-network thereupon, and sequence distribution calculation module receives the sequence distribution that sub-network settles accounts attribute node in each sub-network respectively, then exports; The sequence distribution that probability generation model receives attribute node is used for the posterior probability of computing center's node in each sub-network, the posterior probability of other attribute nodes is calculated afterwards by neighbor relationships, finally check whether cluster result restrains, just repartition sub-network according to new probability distribution be input to sequence distribution calculation module if do not restrained, if convergence just exports as cluster result.

Data preprocessing module carries out data extraction, information filtering, word segmentation processing and keyword extraction to the Mobile solution document obtained from Mobile Market, first the data of the Four types corresponding to each application will be extracted, next word segmentation processing to be carried out to the application descriptor extracted, the key utilizing TF-IDF method to extract each application describes vocabulary, finally identify these information with weight matrix, form a Heterogeneous Information network.

Fig. 2 describes the flow process of sequence distribution calculation module.First the sub-network of a cluster numbers K and K Mobile solution is input as, then the sequence distribution of three generic attribute nodes in each sub-network is calculated respectively, for the object of AUTHOR and CATEGORY Class type, adopt transitivity sort method, the method is the process of an iteration, and end condition is the maximum times that sequence convergence in distribution or iterations are greater than setting; Object for TERM type adopts count sort method to calculate its sequence distribution.Whole sequence distribution calculation process exports the sequence distribution of each attribute type the most at last.

Sequence distribution calculation module for obtain can reflection object significance level in different cluster sequence distribution, two parts are subdivided into again for different types of data order module, what one of them part adopted is transitivity sort method, be mainly used for the sequence distribution of computing application publisher and these two attribute types of applicating category, what another part adopted is count sort method, is mainly used for the sequence distribution of computing application key vocabularies.

Fig. 3 describes the internal work flow process of probability generation module.First importation comprises cluster numbers K, the sequence distribution of the attribute type of K Mobile solution sub-network and correspondence thereof, EM method will be adopted after setting up probability generation model to obtain optimum parameter value, utilize the posterior probability of sequence distribution generating center type node in each cluster of optimal value of the parameter and the attribute type obtained, then neighbor relationships is utilized to calculate the posterior probability of each attribute type node, finally redistribute each node to different clusters according to probability distribution situation, then export cluster result.

Probability generation module is used for computing center's type, the i.e. posterior probability of application originally in different cluster, EM method is adopted to estimate the posterior probability of center type, then the probability distribution of other three attribute type information of application is obtained according to neighbor relationships, finally carry out cluster again according to posterior probability, export cluster result.

Needed to be the Heterogeneous Information network be made up of four category informations by the document subject feature vector of Mobile solution before carrying out sequence and cluster:

Star network: G=(V, E, W), wherein V={APP, AUTHOR, CATEGORY, TERM}, comprise four category information nodes of application, APP={ap ₁, ap ₂ap _ncentroid set, AUTHOR={au ₁, au ₂au _n, CATEGORY={ca ₁, ca ₂ca _n, TERM={te ₁, te ₂te _mthree generic attribute node set, and E is that the limit connecting Centroid and attribute node is gathered, and W is the weight set on limit, and weights are divided into three kinds, the first, if limit e _ithat connect is APP and { node of AUTHOR, CATEGORY}, so w _ivalue be 1, the second, if limit e _iwhat connect is the node of APP and TERM, so w _ivalue can be any positive integer, the 3rd, if there is no fillet between two nodes, so w _ibe expressed as 0.

Star network calculates will obtain the sequence distribution results of attribute type information through sequence distribution, three types information node has the sequence of oneself to distribute, they will be input in probability generation model as conditional probability, and wherein the sequence of AUTHOR is distributed as R={r (au ₁), r (au ₂) ... r (au _n), wherein r (au _i)>=0, and the sequence distribution of other two attribute type informations also represents in the same fashion.The concrete computation process of sequence distribution is divided into two parts, and what first part adopted is transitivity sort method, and mainly for AUTHOR, CATEGORY two information of type, this is the computation process of an iteration:

Wherein σ ^-1 _{aUTHOR, APP}, σ ^-1 _{aPP, CATEGORY}, σ ^-1 _{cATEGORY, APP}, σ ^-1 _{aPP, AUTHOR}be diagonal matrix, value equals weight matrix W respectively _{aUTHOR, APP}, W _{aPP, CATEGORY}, W _{cATEGORY, APP}, W _{aPP, AUTHOR}the summation of each train value.Part II is count sort method, and for TERM type, concrete computation process is as follows:

p ({te}_{i} | TERM, G) = \frac{Σ_{x &Element; N_{G} ({te}_{i})} w_{{te}_{i}, x}}{Σ_{y &Element; TERM} Σ_{x &Element; N_{G} (y)} w_{y, x}} - - - (3)

Wherein N _g(te _i) represent in G network, te _ineighbor node.Probability generation model will use sequence distribution to do

For one of initial conditions, then use the Posterior probability distribution of EM method assessment APP node in different cluster.Certain sub-network G of definition access _kthe probability of certain attribute node of middle d x is:

p(x|G _k)＝p(X|G _k)×p(x|X,G _k) (4)

Wherein p (X|G _k) represent in network G _kthe probability of middle access type X, p (x|X, G _k) represent in network G _kin, the probability of some nodes in access type X.In order to avoid p (x|X, G _k) there is zero probability phenomenon, add global information, to its smoothing process:

p′(x|X,G _k)＝(1-ε)p(x|X,G _k)+εp(x|X,G) (5)

\begin{matrix} p ({ap}_{i} | G_{k}) = \underset{x &Element; N_{G_{k}} ({ap}_{i})}{Π} p {(x | G_{k})}^{w_{{ap}_{i}, x}} \\ = \underset{x &Element; N_{G_{k}} ({ap}_{i})}{Π} p^{'} {(x | X, G_{k})}^{w_{{ap}_{i}, x}} p {(X | G_{k})}^{w_{{ap}_{i}, x}} \end{matrix} - - - (6)

According to Bayes law, Centroid ap can be obtained _iposterior probability: p (G _k| ap _i) ∝ p (ap _i| G _k) × p (G _k).In order to obtain suitable P (G _k) consider to maximize posterior probability p (G _k| ap _i), then use EM method to obtain best p (G _k), concrete calculation procedure is as follows:

\log L = \underset{{ap}_{i} &Element; APP}{Σ} \log [Σ_{k = 1}^{K + 1} p ({ap}_{i} | G_{k}) \times p (G_{k})] - - - (7)

p^{t} (G_{k} | {ap}_{i}) &Element; p ({ap}_{i} | G_{k}) p^{t} (G_{k}); p^{t + 1} (G_{k}) = \underset{{ap}_{i} &Element; APP}{Σ} \frac{p^{t} (G_{k} | {ap}_{i})}{| APP |} - - - (8)

Wherein, K is the quantity needing cluster that user inputs, and after obtaining all center type probability distribution, we can calculate its posterior probability in each cluster for each attribute node, and concrete formula is as follows:

\begin{matrix} p (G_{k} | x) = \underset{{ap}_{i} &Element; N (x)}{Σ} p (G_{k}, {ap}_{i} | x) \\ = \underset{{ap}_{i} &Element; N (x)}{Σ} \frac{p (G_{k} | {ap}_{i})}{| N (x) |} \end{matrix} - - - (9)

Wherein x is certain attribute node, and N (x) is a Centroid set, is the neighbor node of x.For certain attribute node, its posterior probability in certain cluster equals the average of the posterior probability of its neighbor node in this cluster.

The above is only the preferred embodiment of the present invention; it should be pointed out that for those skilled in the art, without departing from the inventive concept of the premise; can also make some improvements and modifications, these improvements and modifications also should be considered as in scope.

Claims

1., based on Mobile solution sequence and the clustering method of Heterogeneous Information network, it is characterized in that, system comprises data preprocessing module, sequence distribution calculation module and probability generation module, specifically comprises the steps:

2. a kind of sequence of the Mobile solution based on Heterogeneous Information network according to claim 1 and clustering method, is characterized in that, described sequence distribution calculation module sequence flow process specifically comprises the steps:

3. a kind of sequence of the Mobile solution based on Heterogeneous Information network according to claim 2 and clustering method, it is characterized in that, first importation comprises cluster numbers K, the sequence distribution of the attribute type of K Mobile solution sub-network and correspondence thereof, EM method will be adopted after setting up probability generation model to obtain optimum parameter value, utilize the posterior probability of sequence distribution generating center type node in each cluster of optimal value of the parameter and the attribute type obtained, then neighbor relationships is utilized to calculate the posterior probability of each attribute type node, finally redistribute each node to different clusters according to probability distribution situation, then cluster result is exported.

4. a kind of sequence of the Mobile solution based on Heterogeneous Information network according to claim 1 and clustering method, it is characterized in that, build a star heterogeneous network be made up of four category informations for setting up star network: G=(V, E, W), wherein V={APP, AUTHOR, CATEGORY, TERM}, comprise four category information nodes of application, APP={ap ₁, ap ₂ap _ncentroid set, AUTHOR={au ₁, au ₂au _n, CATEGORY={ca ₁, ca ₂ca _n, TERM={te ₁, te ₂te _mthree generic attribute node set, and E is that the limit connecting Centroid and attribute node is gathered, and W is the weight set on limit, and weights are divided into three kinds, the first, if limit e _ithat connect is APP and { node of AUTHOR, CATEGORY}, so w _ivalue be 1, the second, if limit e _iwhat connect is the node of APP and TERM, so w _ivalue can be any positive integer, the 3rd, if there is no fillet between two nodes, so w _ibe expressed as 0.

5. a kind of sequence of the Mobile solution based on Heterogeneous Information network according to claim 4 and clustering method, it is characterized in that, star network calculates will obtain the sequence distribution results of attribute type information through sequence distribution, three types information node has the sequence of oneself to distribute, they will be input in probability generation model as conditional probability, and wherein the sequence of AUTHOR is distributed as R={r (au ₁), r (au ₂) ... r (au _n), wherein r (au _i)>=0, and the sequence distribution of other two attribute type informations also represents in the same fashion, sequence distribution concrete computation process be divided into two parts, first part adopt be transitivity sort algorithm, for AUTHOR, the information of CATEGORY two type, this is the computation process of an iteration:

P(AUTHOR|G)

＝(W _AUTHOR,APPσ ^-1 _AUTHOR,APP)(W _APP,CATEGORYσ ^-1 _APP,CATEGORY)P(CATOGORY|G)(1)

P(CATEGORY|G)

＝(W _CATEGORY,APPσ ^-1 _CATEGORY,APP)(W _APP,AUTHORσ ^-1 _APP,AUTHOR)P(AUTHOR|G) (2)

Wherein σ ^-1 _{aUTHOR, APP}, σ ^-1 _{aPP, CATEGORY}, σ ^-1 _{cATEGORY, APP}, σ ^-1 _{aPP, AUTHOR}be diagonal matrix, value equals weight matrix W respectively _{aUTHOR, APP}, W _{aPP, CATEGORY}, W _{cATEGORY, APP}, W _{aPP, AUTHOR}the summation of each train value, Part II is count sort algorithm, and for TERM type, concrete computation process is as follows:

p ({te}_{i} | TERM, G) = \frac{Σ_{x &Element; N_{G} ({te}_{i})} w_{{te}_{i}, x}}{Σ_{y &Element; TERM} Σ_{x &Element; N_{G} (y)} w_{y, x}} - - - (3)

Wherein NG (te _i) represent in G network, te _ineighbor node.

6. a kind of sequence of the Mobile solution based on Heterogeneous Information network according to claim 5 and clustering method, it is characterized in that, probability generation model will use sequence distribution as one of initial conditions, then EM method is used to comment the Posterior probability distribution of APP node in different cluster, certain sub-network G of definition access _kthe probability of certain attribute node of middle d x is:

p(x|G _k)＝p(X|G _k)×p(x|X,G _k)(4)

p′(x|X,G _k)＝(1-ε)p(x|X,G _k)+εp(x|X,G) (5)

p ({ap}_{i} | G_{k}) = \underset{x &Element; N_{G_{k}} ({ap}_{i})}{Π} p {(x | G_{k})}^{w_{{ap}_{i}, x}}

= \underset{x &Element; N_{G_{k}} ({ap}_{i})}{Π} p^{'} {(x | X, G_{k})}^{w_{{ap}_{i}, x}} p {(X | G_{k})}^{w_{{ap}_{i}, x}} - - - (6)

\log L = \underset{{ap}_{i} &Element; APP}{Σ} \log [Σ_{k = 1}^{K + 1} p ({ap}_{i} | G_{k}) \times p (G_{k})] - - - (7)

p^{t} (G_{k} | {ap}_{i}) &Element; p ({ap}_{i} | G_{k}) p^{t} (G_{k}); p^{t + 1} (G_{k}) = \underset{{ap}_{i} &Element; APP}{Σ} \frac{p^{t} (G_{k} | {ap}_{i})}{| APP |} - - - (8)

\begin{matrix} p (G_{k} | x) = \underset{{ap}_{i} &Element; N (x)}{Σ} p (G_{k}, {ap}_{i} | x) \\ = \underset{{ap}_{i} &Element; N (x)}{Σ} \frac{p (G_{k} | {ap}_{i})}{| N (x) |} \end{matrix} - - - (9)