CN102663108B

CN102663108B - Medicine corporation finding method based on parallelization label propagation algorithm for complex network model

Info

Publication number: CN102663108B
Application number: CN2012101111712A
Authority: CN
Inventors: 王崇骏; 刘正; 杨鸿超; 孙道平; 谢俊元
Original assignee: Nanjing University
Current assignee: Nanjing University
Priority date: 2012-04-16
Filing date: 2012-04-16
Publication date: 2013-11-13
Anticipated expiration: 2032-04-16
Also published as: CN102663108A

Abstract

The invention provides a medicine corporation finding method based on a parallelization label propagation algorithm for a complex network model. The medicine corporation finding method includes a networking stage and a mining stage, the networking stage includes a) preprocessing and generating a traditional Chinese medicine data set, formatting the traditional Chinese medicine dataset into text data; b) deploying the initial text data to a Hadoop platform; c) parallelly establishing a traditional Chinese medicine (TCM for short) network; and d) completing the networking stage, and the mining stage includes a) acquiring a TCM network text file processed and generated from the step c) in the networking stage; b) deploying the TCM network text file to the Hadoop platform; c) implementing the parallelization label propagation algorithm to find out medicine corporations; and d) completing the mining stage. By the aid of the medicine corporation finding method based on the parallelization label propagation algorithm for the complex network model, a TCM network model is built, extensibility and running speed of networking and the label propagation algorithm are increased by the aid of parallelization technology, the medicine corporations with similarity in terms of complex Chinese medicines can be effectively mined, and research on compatibility regulation of traditional Chinese medicine is assisted.

Description

Medicine Combo discovering method based on complex network model parallelization label propagation algorithm

Technical field

The present invention relates to a kind of Chinese medicine complex network modeling method, and adopt parallel label propagation algorithm to excavate the technology of Chinese medicine corporations on this Chinese medicine complex network TCM.

Background technology

Utilize the data mining technology can intellectual analysis Chinese medicine compound prescription data, find potential drug matching rule.It is the clustering algorithm of medicine that one class application is arranged in Chinese medicine data mining commonly used, and it carries out polymerization to find the medicine group of frequent prescription based on transaction item model (compound is regarded as the affairs that are comprised of multi-medicament and is stored in transaction database) to similar medicine.Tradition is difficult to excavate the medicine of indirect prescription compatibility based on the Chinese medicine clustering algorithm of transaction item model, and often ignores the processing to uncommon medicine, is unfavorable for furtheing investigate the Compatibility Law knowledge of every kind of medicine.The present invention attempts with complex network model modeling Chinese medicine network, and application community discovery algorithm excavates the similar medicine group of the property of medicine in the medicine network.

Research to the network community structure in Complex Networks Analysis has had very long history, and it relates to the every field such as computer science, sociology, life science.Analyzing and disclose the community structure in network, is all very important for awareness network structure and phase-split network characteristic.Carry out community discovery very close based on the purpose of the medicine cluster analysis of transaction item model with tradition in the Chinese medicine complex network, be all that the frequent medicine of prescription together is aggregated in same classification, and excavate the similar medicine of the property of medicine so that the Study of Traditional Chinese Medicine Compatibility Law.

Build Chinese medicine complex network this thinking based on complex network model and broken the traditional Chinese medicine data mining all based on the convention of the modeler model of transaction item, and the label propagation algorithm in the employing Complex Networks Analysis can deeply excavate Chinese medicine corporations, find that the property of medicine is similar, the medicine group of the inner relatively frequent prescription of corporations, overcome tradition and can not find indirect compatibility and the defect of ignoring uncommon medicine based on the transaction item clustering algorithm.

In the recent period, along with the surge of Chinese medicine compound prescription data, non-parallel algorithm no longer is applicable to the community discovery of fairly large Chinese medicine data.

Summary of the invention

Technical matters to be solved by this invention is to realize the Chinese medicine complex network modeling, and adopts parallel label propagation algorithm on this model, fast and effeciently to find medicine corporations.

For addressing the above problem, the medicine Combo discovering method based on complex network model parallelization label propagation algorithm of the present invention comprises the steps:

1) the networking stage:

The a pre-service, to generate the Chinese medicine data set, is formatted as initial text data;

B disposes the platform to Hadoop with initial text data;

The Chinese medicine network is set up in the c parallelization, i.e. TCM network, and this network is take medicine as node, with SC _ABNode greater than given threshold value connects limit;

D finishes.

2) excavation phase:

A obtaining step 1)-c processes the Chinese medicine network text file that generates;

B disposes the platform to Hadoop with above-mentioned TCM network text file;

C implements parallelization label propagation algorithm, namely adopts the label propagation algorithm of MapReduce framework parallelization, utilizes node neighbor information iteration to upgrade self label (corporations number namely), to find medicine corporations;

D) finish.

Step 1) in-a, said pre-service is for extracting the ingredients of all compounds in the Chinese medicine compound prescription data.

Step 1) initial text data that said being deployed as step 1 in-b)-a generates is uploaded to the distributed file system (HDFS) of Hadoop platform.

Further, the detailed process of step 1)-c is as follows:

1) be each Chinese medicine compound prescription, namely delegation's text data, set a unique ID

2) set up the inverted index between sign ID from the medicine to the compound;

3) set unique drug identifier id for each medicine, wherein comprise this medicine occurs in compound the frequency;

4) inverted index is reduced, namely again carry out Inversed File Retrieval Algorithm, every row compound reads in certain Map function of this subtask, reduction Chinese medicine compound prescription text data;

5) each Map function reads delegation's text, parses the medicine nodal information;

6) judge that can the compound contained drug in this Map function also set up associating key assignments＜Key, Value in twos 〉, be to carry out 7), otherwise carry out 8);

7) set up associating key assignments＜Key, Value 〉;

8)＜Key, Value〉through shuffle﹠amp; ﹠amp; Sort sends in Reduce, and Reduce receives [Value] array that forms under identical Key, calculates in twos and measures between medicine according to the following formula, will be greater than the medicine of setting threshold to writing in files and be saved in HDFS

{SC}_{AB} = \frac{| F_{A} \cap F_{B} |}{\min {| F_{A} |, | F_{B} |}}

Wherein | F _A∩ F _B| expression medicine A, B be the number of times of prescription together, min{|F _A|, | F _B| the occurrence number of the less medicine of prescription number of times in expression medicine A, B, and SC _ABExpression medicine A, B co-occurrence number of times and the minimum ratio that the medicine number of times occurs;

9) reading 6) the middle medicine that generates is to file, and namely the limit collection of medicine complex network, be formatted as the adjacency list form and preserve the Chinese medicine network topology structure;

10) finish.

Further, utilize affiliated tag update self label (be generally the label of frequency maximum, if maximum frequency label has a plurality of, take certain random selection) of neighbor node in step 2)-c.The total process of parallelization label propagation algorithm is based on iterative, and stopping criterion for iteration is that each node label is basicly stable, does not change such as the node label greater than 90% etc.Provide certain iterative algorithm flow process in iterative step at this, namely wherein parallelization label propagation algorithm iterative process is specific as follows::

1) be the unique initial labels id of each medicine Node configuration;

2) each Map function reads delegation's text from HDFS, deposits in the Value variable;

3) data in parsing Value variable, with interim array Tmp[0] preservation node i d, Tmp[1] preservation adjacency list AdjList and Label;

4) sending node data structure;

5) judge in Label and whether only contain a label, namely iteration first, carry out 6), otherwise carry out 7);

6) make variable V=label 1;

7) make variable V=label 1﹠amp; ﹠amp; Label 2, the wherein label of the label of t-1 iteration of label 1 expression and t-2 iteration of label 2 expressions;

8) make variable i=0;

9) whether judge i less than AdjList.length, be to perform step 10, otherwise perform step 12

10) transmission＜AdjList.get (i), V 〉

11) i, from increasing 1, carries out 8);

12) the Map process finishes, and Hadoop carries out shuffle﹠amp; ﹠amp; Sort;

13) Reduce resolves [Value] array, preserves node structure with data structure AdjLabelPA respectively, interim chained list ls ₁, ls ₂Preserve respectively the l that each passes over ₁, l ₂Value (if two labels are arranged, otherwise ls ₂For sky)

14) find out new node label according to following formula;

C_{x} (t) = f (C_{x_{1}} (t - 1), . . ., C_{x_{k}} (t - 1), w * C_{x_{1}} (t - 2), . . ., w * C_{x_{k}} (t - 2))

15) wherein

Represent iteration x t-1 time _kThe label of node, what the f function returned is that neighbor node passes over the maximum mark of the frequency;

16) the t-1 label and the t label that upgrade in AdjLabel are respectively C _x(t-1) and C _x(t);

17) preserve the result of this iteration to distributed file system HDFS;

18) finish.

Medicine Combo discovering method based on complex network model parallelization label propagation algorithm of the present invention has been set up the Chinese medicine complex network model, utilize Parallelizing Techniques to improve extensibility and the travelling speed of networking and label propagation algorithm, and can effectively excavate the similar medicine corporations of herbal mixture, help the Study of Traditional Chinese Medicine Compatibility Law.

Description of drawings

Fig. 1 is medicine community discovery operational flowchart.

Fig. 2 is the process flow diagram of the medicine Combo discovering method based on complex network model parallelization label propagation algorithm of the present invention.

Fig. 3 is for generating the process flow diagram of Chinese medicine (TCM) network.

Fig. 4 is for utilizing label propagation algorithm (certain iteration) to excavate the process flow diagram of medicine corporations on Chinese medicine (TCM) network.

Embodiment

In order more to understand technology contents of the present invention, especially exemplified by specific embodiment and coordinate appended graphic being described as follows.

As shown in Figure 1, core drug excavates and obtains the Chinese medicine compound prescription data by prescription data base querying, irregular text data extraction etc., generate text data through pre-service such as data normalization, formats, then the Chinese medicine complex network is set up in parallelization on the Hadoop platform, moves finally parallelization label propagation algorithm to excavate medicine corporations on this network.

It is key steps of this invention that the networking of Chinese medicine compound prescription data is excavated medicine corporations with rowization label propagation algorithm, thinking of the present invention is effectively excavated medicine corporations by complex network modeling and parallelization label propagation algorithm exactly, improves simultaneously algorithm extensibility and travelling speed.

The process flow diagram of the medicine Combo discovering method based on complex network model parallelization label propagation algorithm of the present invention as shown in Figure 2.

Step 0 is the initial state of medicine Combo discovering method of the present invention;

At networking stage (step 1-3), step 1 is to obtain initial Chinese medicine compound prescription networking data from database or other irregular text datas, and is formatted as text data in order to be uploaded to the distributed file system (HDFS) of Hadoop platform;

Step 2 is to concentrate parallel Chinese medicine (TCM) network of setting up in primary data, comprises twice inverted index and sets up in twos medicine to the associating key-value pair;

Step 3 is the TCM network that generates to be saved to the HDFS of Hadoop platform.

At excavation phase (step 4-5), step 4, operation parallelization label propagation algorithm in the TCM network that step 3 generates;

Step 5 is that the result of excavating is saved to HDFS;

Step 6 is end step of the medicine Combo discovering method based on complex network model parallelization label propagation algorithm of the present invention.

Fig. 3 is the detailed description to step 2 in Fig. 2.

Step 20 is initial step;

Step 21 is to set a unique ID value for each Chinese medicine compound prescription, from label 1;

Step 22 is to set up the inverted index of medicine to compound ID;

Step 23 is to set id for each medicine, and from label 1%N, wherein N represents this medicine occurs in compound the frequency, i.e. the length of inverted index;

Step 24 pair inverted index reduces, and namely again carries out Inversed File Retrieval Algorithm, and every row compound reads in certain Map function of this subtask;

Can the compound contained drug in this Map function of step 25 judgement also set up the associating key assignments in twos, can namely carry out 26, otherwise carry out 27 (noticing that should be that the Map process of saying this subtask finishes this moment);

Step 26 is for setting up associating key assignments＜Key, Value〉(wherein Key is less than Value);

Step 27 is for utilizing formula 1 to calculate SC in the Reduce function _ABValue

{SC}_{AB} = \frac{| F_{A} \cap F_{B} |}{\min {| F_{A} |, | F_{B} |}} - - - (1)

Step 28 is for to be saved to HDFS with result;

Step 29 is the end of Fig. 3.

The total process of parallelization label propagation algorithm is based on iterative, and stopping criterion for iteration is that each node label is basicly stable, and Fig. 4 is the detailed description to an iteration of label propagation algorithm in step 4 in Fig. 2, and is specific as follows:

Step 40 is initial step;

Step 41 is that each medicine node is set a unique label;

Each Map function of step 42 expression reads delegation's text of HDFS Chinese traditional medicine network file, and deposits in the Value variable;

The Value variable is resolved in step 43 expression, with interim array Tmp[0] preservation node i d, Tmp[1] preservation adjacency list AdjList and Label;

Step 44 expression sending node data structure;

Step 45 judges that whether Label only contains a label (iteration for the first time), is to perform step 46, otherwise performs step 47;

Step 46 makes variable V=label 1;

Step 47 makes variable V=label 1﹠amp; ﹠amp; Label 2, the wherein label of the label of t-1 iteration of label 1 expression and t-2 iteration of label 2 expressions;

Step 48 makes variable i=0;

Whether step 49 judges i less than AdjList.length, is to perform step 50, otherwise performs step 52;

Step 50 expression transmission＜AdjList.get (i), V 〉;

I is from increasing 1 in step 51 expression;

Step 52 is Shuffle and the Sort process of Hadoop platform;

Step 53 receives＜Key for Reduce, [Value] 〉;

Step 54 is resolved [Value] array for Reduce, preserves node structure with data structure AdjLabelPA respectively, interim chained list ls ₁The label of self t-1 iteration of preserving that each neighbor node transmits, ls ₂The label of self t-2 iteration of preserving that each neighbor node transmits;

Step 55 expression is returned to common consideration ls according to the f function ₁, ls ₂The highest label L of the frequency that produces;

The label of t-1, t-2 iteration in AdjLabelPA is upgraded in step 56 expression;

Step 57 is for to be kept at result on HDFS;

Step 58 is the end step of Fig. 4;

Annotate: the label propagation algorithm has repeatedly iteration, and the terminal of iteration is that the tags stabilize of the node more than 90% in network is constant.

In sum, the present invention utilizes Parallelizing Techniques to improve extensibility and the travelling speed of networking and label propagation algorithm, in order to can rapidly and efficiently move the community discovery algorithm under a large amount of compound data, and can effectively excavate the similar medicine corporations of herbal mixture, help the Study of Traditional Chinese Medicine Compatibility Law.

The persond having ordinary knowledge in the technical field of the present invention, without departing from the spirit and scope of the present invention, when being used for a variety of modifications and variations.Therefore, protection scope of the present invention is as the criterion when looking claims person of defining.

Claims

1. the medicine Combo discovering method based on complex network model parallelization label propagation algorithm, is characterized in that, comprises the steps:

1) the networking stage:

B disposes the platform to Hadoop with initial text data;

The Chinese medicine network is set up in the c parallelization, and this network is take medicine as node, with SC _ABNode greater than given threshold value connects limit;

D finishes;

2) excavation phase:

A obtaining step 1) c processes the Chinese medicine network text file that generates;

B disposes the platform to Hadoop with above-mentioned Chinese medicine network text file;

C implements parallelization label propagation algorithm, namely adopts the label propagation algorithm of MapReduce framework parallelization, utilizes node neighbor information iteration to upgrade self label, to find medicine corporations;

D) finish;

Step 1 wherein) detailed process of c is as follows:

1) be each Chinese medicine compound prescription, namely delegation's text data, set a unique ID;

2) set up the inverted index between sign ID from the medicine to the compound;

7) set up associating key assignments＜Key, Value 〉;

8)＜Key, Value〉through shuffle ﹠amp; ﹠amp; Sort sends in Reduce, and Reduce receives [Value] array that forms under identical Key, calculates in twos and measures between medicine according to the following formula, will be greater than the medicine of setting threshold to writing in files and be saved in distributed file system

{SC}_{AB} = \frac{| F_{A} \cap F_{B} |}{\min {| F_{A} |, | F_{B} |}};

10) finish;

Step 2) in c, the total process of parallelization label propagation algorithm is based on iteratively, and stopping criterion for iteration is that each node label is basicly stable, and wherein parallelization label propagation algorithm iterative process is specific as follows:

1) be the unique initial labels id of each medicine Node configuration;

2) each Map function reads delegation's text from distributed file system HDFS, deposits in the Value variable;

4) sending node data structure;

6) make variable V=label 1;

8) make variable i=0;

9) whether judging i less than AdjList.length, is to perform step 10), otherwise perform step 12);

10) transmission＜AdjList.get (i), V 〉

11) i, from increasing 1, carries out 8);

12) the Map process finishes, and Hadoop carries out shuffle ﹠amp; ﹠amp; Sort;

13) Reduce resolves [Value] array, preserves node structure with data structure AdjLabelPA respectively, interim chained list ls ₁, ls ₂Preserve respectively the l that each passes over ₁, l ₂Value, if two labels are arranged, otherwise ls ₂For sky;

14) find out new node label according to following formula;

C_{x} (t) = f ({C_{x}}_{1} (t - 1), . . ., {C_{x}}_{k} (t - 1), w * {C_{x}}_{1} (t - 2), . . ., w * {C_{x}}_{k} (t - 2));

15) wherein

16) the t-1 label and the t label that upgrade in AdjLabelPA are respectively C _x(t-1) and C _x(t);

17) preserve the result of this iteration to distributed file system;

18) finish.

2. the medicine Combo discovering method based on complex network model parallelization label propagation algorithm according to claim 1, is characterized in that, wherein step 1) in a said pre-service for extracting the ingredients of all compounds in the Chinese medicine compound prescription data.

3. the medicine Combo discovering method based on complex network model parallelization label propagation algorithm according to claim 1, it is characterized in that, wherein step 1) said being deployed as step 1 in b) initial text data that generates of a is uploaded to the distributed file system of Hadoop platform.