CN108197665A

CN108197665A - A kind of algorithm of Bayesian network structure learning based on parallel evolutionary search

Info

Publication number: CN108197665A
Application number: CN201810085728.7A
Authority: CN
Inventors: 林小光; 钟坤华; 孙启龙; 张矩
Original assignee: Chongqing Institute of Green and Intelligent Technology of CAS
Current assignee: Chongqing Institute of Green and Intelligent Technology of CAS
Priority date: 2018-01-29
Filing date: 2018-01-29
Publication date: 2018-06-22

Abstract

The present invention relates to a kind of algorithm of Bayesian network structure learning based on parallel evolutionary search, belong to artificial intelligence field.Thought of this method based on EVOLUTIONARY COMPUTATION, to scoring search process parallelization processing, to realize efficient bayesian network structure learning.Genetic evolution algorithm is combined by the present invention by using Map Reduce technologies with structure searching process, the ability of multiple servers parallel computation to be made full use of to realize high-efficient rapid studying.Traditional genetic algorithm is applied to cloud computing by the present invention, utilize the ability and genetic algorithm concurrency and ability of searching optimum of distributed computing method processing mass data, bayesian network structure learning in mass data is quickly and efficiently carried out, is that current technology is no, there is substantive breakthrough.

Description

A kind of algorithm of Bayesian network structure learning based on parallel evolutionary search

Technical field

The invention belongs to artificial intelligence fields, are related to a kind of bayesian network structure learning side searched for based on parallel evolutionary Method

Background technology

Bayesian network is the mathematical model based on probability inference, is the extension of bayes method, is not know to know at present Know one of expression and the most effective theoretical model in reasoning field.The information representation of Bayesian network is made of two parts, first, adopting Represent the bayesian network structure of conditional independence information with directed acyclic graph, each node in network is represented in special domain One variable, the connection between node represent mutual causality；First, conditional probability distribution function (or conditional probability table).

Bayesian network structure learning is under the premise of a data sample set is given, and finds one and training sample Collection matches best network structure.The purpose of bayesian network structure learning is that the logic obtained in special domain between each variable is closed System can obtain bayesian network structure by the searching algorithm that scores, and structure searching process is a NP hard problem.It is existing There is solution to include K2 learning algorithms, Tan learning algorithms, Bayes's tolerance mechanism, conditional likelihood methods of marking etc..

The basic thought of Algorithm for Bayesian Networks Structure Learning based on scoring search is gone out from a basic network topology Hair, structure is changed using certain searching algorithm, such as edged, subtract while, change while direction, use certain scoring Function pair network structure scores, whether the result of scoring determines the reservation of the network.It is mainly concerned with two problems, one It is the selection of score function；Second is that the selection of searching algorithm.

Under big data environment, data dimension is high, and sample space is big, and Bayesian network is carried out using common learning algorithm Structure learning, process processing is cumbersome, and time-consuming.And unit processing capacity is limited, tends not to rapid within reasonable time obtain To bayesian network structure, next step analysis and decision is significantly limited.

Existing domestic and international mainstream algorithm of Bayesian network structure learning is essentially all using serial process, while Bayes Network structure study is NP hard problems, therefore, the performance of traditional bayesian network structure learning process under big data environment It all can be than relatively low, it is difficult to obtain required bayesian network structure within reasonable time with efficiency.

Invention content

In view of this, the purpose of the present invention is to provide a kind of bayesian network structure learnings based on parallel evolutionary search Method, the thought based on EVOLUTIONARY COMPUTATION, to scoring search process parallelization processing, to realize efficient bayesian network structure It practises.Genetic evolution algorithm is combined by the present invention by using Map Reduce technologies with structure searching process, to make full use of The ability of multiple servers parallel computation realizes high-efficient rapid studying.

In order to achieve the above objectives, the present invention provides following technical solution：

A kind of algorithm of Bayesian network structure learning based on parallel evolutionary search, includes the following steps：

S1：Using Markov chain Monte Carlo (Markov Chain Monte Carlo, MCMC) to original sample Notebook data carries out stochastical sampling；Processing optimizes random sampling procedure using gibbs method, causes in higher-dimension Sampling efficiency higher；Consider the presence of transition probability, data from the sample survey is by several relatively independent data with preliminary conditional probability Subset is formed；

S2：The S1 data subsets generated are subjected to gene code respectively, obtain initial population X_j, i.e., several are without edge graph；

S3：Evaluation function is defined using Bayes's methods of marking, i.e., using the principle G*=arg max of posterior probability maximum P (G | D), wherein, D represents data set, and G represents a network structure on Bayesian network N；For arbitrary node x_i, If his father's set of node is π_i, then the evaluation function of this bayesian network structure be defined as：

I represents node serial number in Bayesian network to be evaluated；J represents node x_iDifferent father node collection serial numbers；q_iFor π_i's Example；N_ijkFor variable X in data set D_iValue take x_ikWhen, father node Π x_iValue is π_ijNumber when a, w_ijRepresent the J,

S4：Using the legal adopted evolutionary process of search by hill climbing；During evolutional operation, from initial population one by one to variable into Row is investigated, and determines the father node of each node, generates the directed edge that child node is directed toward by father node；For X_j, obtained Father node collection is combined into π_j；Definition μ is variable father node number upper limit threshold, if | π_j| ＜ μ, then X_jFather node number be less than The defined upper limit, examination come X_jAnd not in π_jIn variable, therefrom select such X_iSo that new family scoring is maximum；Then Compare V again_newAnd V_oldIf V_new>V_old, then by X_iIt is added to π_jIn；

S5：Define evolution stop condition；If | π_j| >=μ stops evolving.

The beneficial effects of the present invention are：

(1) present invention carries out random sampling using gibbs method, since acceptance probability is 1, markov transition rule It does not need to carry out state and receives judgement, refuse this step without state, so fast convergence rate, random sampling is efficient.

(2) for the present invention using genetic algorithm, the operation in entire evolutionary process has randomness, but is different from complete Random search can effectively utilize historical information follow-on desired value to be made to increase.Mechanism in this way, a generation For genetic evolution, finally converge on a suitable individual.

(3) traditional genetic algorithm is applied in cloud computing by the present invention, and magnanimity number is handled using distributed computing method According to ability and genetic algorithm concurrency and ability of searching optimum, quickly and efficiently carry out bayesian network structure in mass data Study is current technology without having substantive breakthrough.

Description of the drawings

In order to make the purpose of the present invention, technical solution and advantageous effect clearer, the present invention provides drawings described below and carries out Explanation：

Fig. 1 is the technology of the present invention block diagram.

Specific embodiment

Below in conjunction with attached drawing, the preferred embodiment of the present invention is described in detail.

The technology of the present invention block diagram is as shown in Figure 1, process flow is as follows：

S1：Stochastical sampling is carried out to raw sample data using Markov chain Monte Carlo (MCMC).The present invention Processing optimizes random sampling procedure using gibbs method, can cause sampling efficiency higher in higher-dimension.By In the presence of transition probability, therefore data from the sample survey is made of several relatively independent data subsets with preliminary conditional probability.

S2：The data subset that the first step generates is subjected to gene code respectively, obtains initial population, i.e., several are boundless Figure.

S3：Evaluation function is defined using Bayes's methods of marking.That is the principle G*=arg max P (G of posterior probability maximum | D), wherein, D represents data set, and G represents a network structure on Bayesian network N.For arbitrary node x_iIf His father's set of node is π_i, then the evaluation function of this bayesian network structure be defined as：

S4：Using the legal adopted evolutionary process of search by hill climbing.During evolutional operation, from initial population one by one to variable into Row is investigated, and determines the father node of each node, generates the directed edge that child node is directed toward by father node.For X_j, obtained Father node collection is combined into π_j.If | π_j| ＜ μ (definition μ be variable father node number upper limit threshold), then X_jFather node number be less than The defined upper limit, examination come X_jAnd not in π_jIn variable, therefrom select such X_iSo that new family scoring is maximum.Then Compare V again_newAnd V_oldIf V_new>V_old, then by X_iIt is added to π_jIn.

S5：Define evolution stop condition.If | π_j| >=μ stops evolving.

Finally illustrate, preferred embodiment above is merely illustrative of the technical solution of the present invention and unrestricted, although logical It crosses above preferred embodiment the present invention is described in detail, however, those skilled in the art should understand that, can be Various changes are made to it in form and in details, without departing from claims of the present invention limited range.

Claims

1. a kind of algorithm of Bayesian network structure learning based on parallel evolutionary search, it is characterised in that：This method includes following Step：

S1：Using Markov chain Monte Carlo (MarkovChainMonteCarlo, MCMC) to raw sample data into Row stochastical sampling；Processing optimizes random sampling procedure using gibbs method, in higher-dimension so that sampling efficiency Higher；Consider the presence of transition probability, data from the sample survey is made of several relatively independent data subsets with preliminary conditional probability；

S3：Evaluation function is defined using Bayes's methods of marking, i.e., using the principle G of posterior probability maximum^*=argmaxP (G | D), wherein, D represents data set, and G represents a network structure on Bayesian network N；N={ X₁,X₂,…,X_n, wherein X_i Value range beFor arbitrary node x_iIf his father's set of node is π_i, then The evaluation function of this bayesian network structure is defined as：

I represents node serial number in Bayesian network to be evaluated；J represents node x_iDifferent father node collection serial numbers；q_iFor π_iReality Example；N_ijkFor variable X in data set D_iValue take x_ikWhen, father node Π x_iValue is π_ijNumber when a, w_ijRepresent jth It is a,

S4：Using the legal adopted evolutionary process of search by hill climbing；During evolutional operation, variable is examined one by one from initial population It examines, determines the father node of each node, generate the directed edge that child node is directed toward by father node；For X_j, obtained father section Point set is combined into π_j；Definition μ is variable father node number upper limit threshold, if | π_j| ＜ μ, then X_jFather node number less than regulation The upper limit, examination come X_jAnd not in π_jIn variable, therefrom select such X_iSo that new family scoring is maximum；Then compare again Compared with V_newAnd V_oldIf V_new>V_old, then by X_iIt is added to π_jIn；

S5：Define evolution stop condition；If | π_j| >=μ stops evolving.