CN102663108A

CN102663108A - Medicine corporation finding method based on parallelization label propagation algorithm for complex network model

Info

Publication number: CN102663108A
Application number: CN2012101111712A
Authority: CN
Inventors: 王崇骏; 刘正; 杨鸿超; 孙道平; 谢俊元
Original assignee: Nanjing University
Current assignee: Nanjing University
Priority date: 2012-04-16
Filing date: 2012-04-16
Publication date: 2012-09-12
Anticipated expiration: 2032-04-16
Also published as: CN102663108B

Abstract

The invention provides a medicine corporation finding method based on a parallelization label propagation algorithm for a complex network model. The medicine corporation finding method includes a networking stage and a mining stage, the networking stage includes a) preprocessing and generating a traditional Chinese medicine data set, formatting the traditional Chinese medicine dataset into text data; b) deploying the initial text data to a Hadoop platform; c) parallelly establishing a traditional Chinese medicine (TCM for short) network; and d) completing the networking stage, and the mining stage includes a) acquiring a TCM network text file processed and generated from the step c) in the networking stage; b) deploying the TCM network text file to the Hadoop platform; c) implementing the parallelization label propagation algorithm to find out medicine corporations; and d) completing the mining stage. By the aid of the medicine corporation finding method based on the parallelization label propagation algorithm for the complex network model, a TCM network model is built, extensibility and running speed of networking and the label propagation algorithm are increased by the aid of parallelization technology, the medicine corporations with similarity in terms of complex Chinese medicines can be effectively mined, and research on compatibility regulation of traditional Chinese medicine is assisted.

Description

Medicine corporations discover method based on complex network model parallelization label propagation algorithm

Technical field

The present invention relates to a kind of Chinese medicine complex network modeling method, and on this Chinese medicine medicine complex network TCM, adopt parallel label propagation algorithm to excavate the technology of Chinese medicine medicine corporations.

Background technology

Utilize the data mining technology can intellectual analysis Chinese medicine compound prescription data, find potential drug matching rule.In the Chinese medicine data mining commonly used one type of application being arranged is the clustering algorithm of medicine, and it carries out polymerization to find the drug group of frequent prescription based on transaction item model (regard compound the affairs of being made up of multiple medicine as and be stored in the transaction database) to similar medicine.Tradition is difficult to excavate the medicine of indirect prescription compatibility based on the Chinese medicine medicine clustering algorithm of transaction item model, and often ignores the processing to uncommon medicine, is unfavorable for furtheing investigate the compatibility rule knowledge of every kind of medicine.The present invention attempts with complex network model modeling Chinese medicine medicine network, in the medicine network, uses corporations and finds that algorithm excavates the similar drug group of the property of medicine.

Research to network corporations structure in Complex Networks Analysis has had very long history, and it relates to every field such as computer science, sociology, life science.Corporations' structure in analysis and the announcement network all is very important for awareness network structure and phase-split network characteristic.In the Chinese medicine complex network, carrying out corporations finds very close based on the purpose of the medicine cluster analysis of transaction item model with tradition; All be that the frequent medicine of prescription together is aggregated in the same classification, and excavate the similar medicine of the property of medicine so that research drug matching rule.

Make up Chinese medicine medicine complex network this thinking based on complex network model and broken the traditional Chinese medicine data mining all based on the convention of the modeler model of transaction item; And the label propagation algorithm in the employing Complex Networks Analysis can deeply excavate Chinese medicine medicine corporations; Find that the property of medicine is similar, the drug group of the inner frequent relatively prescription of corporations, overcome tradition and can not find indirect compatibility and the defective of ignoring uncommon medicine based on the transaction item clustering algorithm.

In the recent period, along with the surge of Chinese medicine compound prescription data, non-parallel algorithm is applicable to that no longer the corporations of fairly large Chinese medicine data find.

Summary of the invention

Technical matters to be solved by this invention is to realize the Chinese medicine complex network modeling, and on this model, adopts parallel label propagation algorithm, fast and effeciently to find medicine corporations.

For addressing the above problem, the medicine corporations discover method based on complex network model parallelization label propagation algorithm of the present invention comprises the steps:

1) the networking stage:

The a pre-service is formatted as initial text data to generate the Chinese medicine data set;

B disposes the platform to Hadoop with initial text data;

Chinese medicine medicine network is set up in the c parallelization, i.e. TCM network, and this network is a node with the medicine, with SC _ABNode greater than given threshold value connects the limit;

D finishes.

2) excavation phase:

A obtaining step 1)-Chinese medicine medicine network text file that c handle to generate;

B disposes the platform to Hadoop with above-mentioned TCM network text file;

C implements parallelization label propagation algorithm, promptly adopts the label propagation algorithm of MapReduce framework parallelization, utilizes node neighbor information iteration to upgrade self label (promptly corporations number), with discovery medicine corporations;

D) finish.

Said pre-service is formed for the medicine that extracts all compounds in the Chinese medicine compound prescription data among step 1)-a.

The said initial text data that is deployed as step 1)-a generation is uploaded to the distributed file system (HDFS) of Hadoop platform among step 1)-b.

Further, the detailed process of step 1)-c is following:

1) be each Chinese medicine compound prescription, promptly delegation's text data is set a unique ID

2) set up the inverted index between the sign ID from the medicine to the compound;

3) set unique drug identifier id for each medicine, wherein comprise the frequency that this medicine occurs in compound;

4) inverted index is reduced, promptly carry out Inversed File Retrieval Algorithm once more, every capable compound reads in certain Map function of this subtask, reduction Chinese medicine compound prescription text data;

5) each Map function reads delegation's text, parses the medicine nodal information;

6) judging that can the compound contained drug in this Map function also set up associating key assignments < Key, Value>in twos, is then to carry out 7), otherwise carry out 8);

7) set up associating key assignments < Key, Value >;

8) < Key, Value>sends among the Reduce through shuffle&&sort, and Reduce receives [Value] array that identical Key forms down, measures between medicine in twos according to computes, will be greater than the medicine of setting threshold to writing file and being saved among the HDFS

{SC}_{AB} = \frac{| F_{A} \cap F_{B} |}{\min {| F_{A} |, | F_{B} |}}

Wherein | F _A∩ F _B| expression medicine A, B be the number of times of prescription together, min{|F _A|, | F _B| the occurrence number of the less medicine of prescription number of times among the expression medicine A, B, and SC _ABExpression medicine A, B co-occurrence number of times and the minimum ratio that the medicine number of times occurs;

9) reading 6) the middle medicine that generates is to file, and promptly the limit collection of medicine complex network is formatted as the adjacency list form and preserves the Chinese medicine network topology structure;

10) finish.

Further, step 2)-utilize tag update self label (be generally the maximum label of the frequency, take certain selection at random) under the neighbor node among the c if maximum frequency label has a plurality of.The total process of parallelization label propagation algorithm is based on iterative, and stopping criterion for iteration is that each node label is basicly stable, does not for example change etc. greater than 90% node label.Provide certain iterative algorithm flow process in the iterative step at this, promptly wherein parallelization label propagation algorithm iterative process is specific as follows::

1) for each medicine node unique initial labels id is set;

2) each Map function reads delegation's text from HDFS, deposits in the Value variable;

3) data in the parsing Value variable are preserved node i d with interim array Tmp [0], and Tmp [1] preserves adjacency list AdjList and Label;

4) sending node data structure;

5) judge whether only contain a label among the Label, promptly iteration first carries out 6), otherwise carry out 7);

6) make variable V=label 1;

7) make variable V=label 1&& label 2, wherein the label of the label of t-1 iteration of label 1 expression and t-2 iteration of label 2 expressions;

8) make variable i=0;

9) whether judging i less than AdjList.length, is execution in step 10 then, otherwise execution in step 12

10) send < AdjList.get (i), V >

11) i carries out 8 from increasing 1);

12) the Map process finishes, and Hadoop carries out shuffle&&sort;

13) Reduce resolves [Value] array, preserves node structure with data structure AdjLabelPA respectively, interim chained list ls ₁, ls ₂Preserve the l that each passes over respectively ₁, l ₂Value (if two labels are arranged, otherwise ls ₂Be sky)

14) find out new node label according to following formula;

C_{x} (t) = f (C_{x_{1}} (t - 1), . . ., C_{x_{k}} (t - 1), w * C_{x_{1}} (t - 2), . . ., w * C_{x_{k}} (t - 2))

15) wherein Represent iteration x t-1 time _kThe label of node, what the f function returned is that neighbor node passes over the maximum mark of the frequency;

16) the t-1 label and the t label that upgrade among the AdjLabel are respectively C _x(t-1) and C _x(t);

17) result who preserves this iteration is to distributed file system HDFS;

18) finish.

Medicine corporations discover method based on complex network model parallelization label propagation algorithm of the present invention has been set up Chinese medicine medicine complex network model; Utilize the parallelization technology to improve the extensibility and the travelling speed of networking and label propagation algorithm; And can effectively excavate the similar medicine corporations of herbal mixture property, help research drug matching rule.

Description of drawings

Fig. 1 finds operational flowchart for medicine corporations.

Fig. 2 is the process flow diagram of the medicine corporations discover method based on complex network model parallelization label propagation algorithm of the present invention.

Fig. 3 is for generating the process flow diagram of Chinese medicine medicine (TCM) network.

Fig. 4 excavates the process flow diagram of medicine corporations on Chinese medicine medicine (TCM) network, utilizing label propagation algorithm (certain iteration).

Embodiment

In order more to understand technology contents of the present invention, special act specific embodiment also cooperates appended graphic explanation following.

As shown in Figure 1; Core drug excavates and obtains the Chinese medicine compound prescription data through prescription data base querying, irregular text data extraction etc.; Generate text data through pre-service such as data normalization, formats; Chinese medicine medicine complex network is set up in parallelization on the Hadoop platform then, on this network, moves parallelization label propagation algorithm at last to excavate medicine corporations.

It is key steps of this invention that the networking of Chinese medicine compound prescription data is excavated medicine corporations with rowization label propagation algorithm; Thinking of the present invention is effectively excavated medicine corporations through complex network modeling and parallelization label propagation algorithm exactly, improves algorithm extensibility and travelling speed simultaneously.

The process flow diagram of the medicine corporations discover method based on complex network model parallelization label propagation algorithm of the present invention is as shown in Figure 2.

Step 0 is the initial state of medicine of the present invention corporations discover method;

At networking stage (step 1-3), step 1 is from database or other irregular text datas, to obtain initial Chinese medicine compound prescription networking data, and is formatted as text data so that be uploaded to the distributed file system (HDFS) of Hadoop platform;

Step 2 is to concentrate parallel Chinese medicine medicine (TCM) network of setting up in primary data, comprises twice inverted index and sets up medicine in twos to the associating key-value pair;

Step 3 is the HDFS that are saved to the TCM network that generates on the Hadoop platform.

At excavation phase (step 4-5), step 4, operation parallelization label propagation algorithm in the TCM network that step 3 generated;

Step 5 is that the result who excavates is saved to HDFS;

Step 6 is end step of the medicine corporations discover method based on complex network model parallelization label propagation algorithm of the present invention.

Fig. 3 is the detailed description to step 2 among Fig. 2.

Step 20 is an initial step;

Step 21 is to set a unique ID value for each Chinese medicine compound prescription, from label 1 beginning;

Step 22 is to set up the inverted index of medicine to compound ID;

Step 23 is to set id for each medicine, and from label 1%N, wherein N representes the frequency that this medicine occurs in compound, i.e. the length of inverted index;

Step 24 pair inverted index reduces, and promptly carries out Inversed File Retrieval Algorithm once more, and every capable compound reads in certain Map function of this subtask;

Can the compound contained drug in this Map function of step 25 judgement also set up the associating key assignments in twos, can promptly carry out 26, otherwise carry out 27 (noticing that should be that the Map process of saying this subtask finishes this moments);

Step 26 is for setting up associating key assignments < Key, Value>(wherein Key is less than Value);

Step 27 is calculated SC for utilizing formula 1 in the Reduce function _ABValue

{SC}_{AB} = \frac{| F_{A} \cap F_{B} |}{\min {| F_{A} |, | F_{B} |}} - - - (1)

Step 28 is for to be saved to HDFS with the result;

Step 29 is the end of Fig. 3.

The total process of parallelization label propagation algorithm is based on iterative, and stopping criterion for iteration is that each node label is basicly stable, and Fig. 4 is the detailed description to an iteration of label propagation algorithm in the step 4 among Fig. 2, and is specific as follows:

Step 40 is an initial step;

Step 41 is that each medicine node is set a unique label;

Each Map function of step 42 expression reads delegation's text of HDFS Chinese traditional medicine network file, and deposits in the Value variable;

The Value variable is resolved in step 43 expression, preserves node i d with interim array Tmp [0], and Tmp [1] preserves adjacency list AdjList and Label;

Step 44 expression sending node data structure;

Step 45 judges whether Label only contains a label (for the first time iteration), is execution in step 46 then, otherwise execution in step 47;

Step 46 makes variable V=label 1;

Step 47 makes variable V=label 1&& label 2, wherein the label of the label of t-1 iteration of label 1 expression and t-2 iteration of label 2 expressions;

Step 48 makes variable i=0;

Whether step 49 judges i less than AdjList.length, is execution in step 50 then, otherwise execution in step 52;

< AdjList.get (i), V>sent in step 50 expression;

I is from increasing 1 in step 51 expression;

Step 52 is the Shuffle and the Sort process of Hadoop platform;

Step 53 receives < Key, [Value]>for Reduce;

Step 54 is resolved [Value] array for Reduce, preserves node structure with data structure AdjLabelPA respectively, interim chained list ls ₁The label of self t-1 iteration of preserving that each neighbor node transmits, ls ₂The label of self t-2 iteration of preserving that each neighbor node transmits;

Step 55 expression is returned common consideration ls according to the f function ₁, ls ₂The highest label L of the frequency that produces;

The label of t-1, t-2 iteration among the AdjLabelPA is upgraded in step 56 expression;

Step 57 is for to be kept at the result on the HDFS;

Step 58 is the end step of Fig. 4;

Annotate: the label propagation algorithm has repeatedly iteration, and the terminal point of iteration is that the tags stabilize of the node more than 90% in the network is constant.

In sum; The present invention utilizes the parallelization technology to improve the extensibility and the travelling speed of networking and label propagation algorithm; Find algorithm so that can under a large amount of compound data, rapidly and efficiently move corporations; And can effectively excavate the similar medicine corporations of herbal mixture property, help research drug matching rule.

Have common knowledge the knowledgeable in the technical field under the present invention, do not breaking away from the spirit and scope of the present invention, when doing various changes and retouching.Therefore, protection scope of the present invention is as the criterion when looking claims person of defining.

Claims

1. the medicine corporations discover method based on complex network model parallelization label propagation algorithm is characterized in that, comprises the steps:

1) the networking stage:

B disposes the platform to Hadoop with initial text data;

Chinese medicine medicine network is set up in the c parallelization, and this network is a node with the medicine, with SC _ABNode greater than given threshold value connects the limit;

D finishes.

2) excavation phase:

B disposes the platform to Hadoop with above-mentioned Chinese medicine medicine network text file;

C implements parallelization label propagation algorithm, promptly adopts the label propagation algorithm of MapReduce framework parallelization, utilizes node neighbor information iteration to upgrade self label, to find medicine corporations;

D) finish.

2. the medicine corporations discover method based on complex network model parallelization label propagation algorithm according to claim 1 is characterized in that, wherein said pre-service is formed for the medicine that extracts all compounds in the Chinese medicine compound prescription data among step 1)-a.

3. the medicine corporations discover method based on complex network model parallelization label propagation algorithm according to claim 1; It is characterized in that wherein the said initial text data that is deployed as step 1)-a generation is uploaded to the distributed file system of Hadoop platform among step 1)-b.

4. the medicine corporations discover method based on complex network model parallelization label propagation algorithm according to claim 1 is characterized in that wherein the detailed process of step 1)-c is following:

7) set up associating key assignments < Key, Value >;

8) < Key; Value>send among the Reduce through shuffle&&sort; Reduce receives [Value] array that identical Key forms down, measures between medicine in twos according to computes, will be greater than the medicine of setting threshold to writing file and being saved in the distributed file system

10) finish.

5. the medicine corporations discover method based on complex network model parallelization label propagation algorithm according to claim 1; It is characterized in that; Step 2)-c in the total process of parallelization label propagation algorithm be based on iterative; Stopping criterion for iteration is that each node label is basicly stable, and wherein parallelization label propagation algorithm iterative process is specific as follows:

1) for each medicine node unique initial labels id is set;

4) sending node data structure;

6) make variable V=label 1;

8) make variable i=0;

9) whether judging i less than AdjList.1ength, is execution in step 10 then, otherwise execution in step 12

10) send < AdjList.get (i), V >

11) i carries out 8 from increasing 1);

12) the Map process finishes, and Hadoop carries out shuffle&&sort;

14) find out new node label according to following formula;

C_{x} (t) = f (C_{x_{1}} (t - 1), . . ., C_{x_{k}} (t - 1), w * C_{x_{1}} (t - 2), . . ., w * C_{x_{k}} (t - 2))

15) wherein

Represent iteration x t-1 time _kThe label of node, what the f function returned is that neighbor node passes over the maximum mark of the frequency;

17) result who preserves this iteration is to distributed file system;

18) finish.