CN111180045B

CN111180045B - Method for mining relation between drug pairs and efficacy from prescription information

Info

Publication number: CN111180045B
Application number: CN201911165949.6A
Authority: CN
Inventors: 张引; 白宇
Original assignee: Zhejiang University ZJU
Current assignee: Zhejiang University ZJU
Priority date: 2019-11-25
Filing date: 2019-11-25
Publication date: 2023-05-12
Anticipated expiration: 2039-11-25
Also published as: CN111180045A

Abstract

The invention discloses a method for mining the relation between a medicine pair and efficacy from prescription information. The method comprises the following steps: 1) Authoritative prescription information data is collected, wherein the data comprises prescription efficacy main treatment information and traditional Chinese medicine composition information. 2) And the data is cleaned and structured, so that the subsequent model training and information mining are facilitated. 3) And (3) constructing a data mining model, fitting the samples, and learning parameters with strong interpretability. 4) And acquiring the interpretive parameters learned by the model, performing subsequent filtering treatment, removing noise, reserving medicine pair information, and mining the relation between the medicine pair and efficacy. According to the invention, a heuristic strategy is adopted for filtering, the association degree between the efficacy and the drug pair is measured according to the efficacy prediction accuracy, and most of invalid relations can be removed.

Description

Method for mining relation between drug pairs and efficacy from prescription information

Technical Field

The invention relates to the fields of forward network and interpretability theory in a neural network and unsupervised training. In particular to a method for mining drug pairs and efficacy from prescription information.

Background

Prescription refers to the recipe of medicine, the recipe is prepared from ancient times, and the prescription refers to the prescription of prescription for treating diseases. Single drugs have long been used in the ancient times of China to treat diseases. Through long-term medical practice, a plurality of medicines are matched, and decoction is prepared, namely the earliest prescription. The prescription contains the application relation between the traditional Chinese medicine and the efficacy.

The drug pairs are the combination relation of two different drugs, the different drug pairs have different application values, and the relation between the drug pairs and the efficacy is mined from the prescription, so that the traditional Chinese medicine expert can be statistically helped to analyze the functions exerted by the different drug pairs.

Data mining, which is a hotspot problem in artificial intelligence and database field research, refers to a non-trivial process of revealing implicit, previously unknown and potentially valuable information from a vast amount of data in a database. The data mining is a decision support process, and is mainly based on artificial intelligence, machine learning, pattern recognition, statistics, databases, visualization technologies and the like, so that enterprise data is analyzed with high automation, inductive reasoning is made, potential patterns are mined from the data, a decision maker is helped to adjust market strategies, risks are reduced, and correct decisions are made. The knowledge discovery process consists of three phases: (1) preparing data; (2) data mining; (3) results are expressed and interpreted. Data mining may interact with users or knowledge bases.

Neural networks perform very well in supervised task learning, but are somewhat limited in application in the unsupervised field due to their non-interpretability. The invention uses the forward network with simpler structure in the neural network, maintains the network interpretability, and indirectly obtains the information of the task B through training the task A, thereby realizing the unsupervised data mining.

And establishing a corresponding relation between the drug pair and the efficacy by using the authoritative prescription information data. The technical difficulties involved include: 1. lack of annotated supervision data, 2. How to prevent overfitting, 3. How to reject trained mispredictions, 4. How to design a model.

Disclosure of Invention

The invention aims to provide an explanatory neural network by utilizing authoritative prescription information data, providing a strategy for preventing overfitting, providing an invalid relation filtering method of drug pairs and efficacy, and finally obtaining the corresponding relation between the drug pairs and the efficacy.

In order to achieve the above purpose, the present invention adopts the following technical scheme:

a method for mining the relation between drug pairs and efficacy from prescription information comprises the following steps:

1) Collecting authoritative prescription information data, and extracting prescription information through an OCR and rule method;

2) Extracting composition information of text types from the prescription information to generate structural prescription composition data corresponding to each prescription sample;

3) Extracting keywords in the prescription efficacy description text from all prescription information, and establishing a prescription efficacy vocabulary; determining prescription efficacy vocabulary vectors corresponding to each prescription sample according to prescription efficacy description texts and prescription efficacy vocabulary corresponding to each prescription sample, wherein the prescription efficacy vocabulary vectors corresponding to all the prescription samples form a prescription efficacy vocabulary matrix;

4) Pairing all the medicine components in the structural prescription composition data of all the prescription samples in pairs; in order to prevent overfitting, counting the occurrence frequency of each drug pair in all prescription samples, screening out high-frequency drug pairs to form a high-frequency drug pair table, and further reducing the size of an input vocabulary; according to the structural prescription composition data and the high-frequency medicine pair table corresponding to each prescription sample, a high-frequency medicine pair vector corresponding to each prescription sample is obtained, and the high-frequency medicine pair vectors corresponding to all prescription samples form a high-frequency medicine pair matrix;

5) Designing a model structure: to maintain the interpretability of the model, the model selects a single layer forward network, and model parameters will represent the degree of association between each drug pair vocabulary and each efficacy; training a single-layer forward network according to the prescription efficacy vocabulary matrix obtained in the step 3) and the high-frequency medicine pair matrix obtained in the step 4) to obtain a trained network model;

6) Extracting weight parameters from the trained model, and filtering according to a heuristic strategy, wherein the heuristic strategy is as follows: based on all the formulas containing the medicine pair A, calculating the accuracy of the formulas containing the efficacy B, wherein if the accuracy is higher, the probability that the efficacy of the medicine pair A is B is higher, and the specific operation is as follows: inputting the high-frequency medicine pair matrix obtained in the step 4) into a trained network model to obtain a final predicted prescription efficacy vocabulary matrix; and (3) calculating the prediction accuracy of each medicine pair on different efficacy vocabularies and sequencing according to the finally predicted prescription efficacy vocabulary matrix and the actual prescription efficacy vocabulary matrix obtained in the step (3), so as to obtain the relation between the medicine pair and the efficacy.

Preferably, TCM prescription composition component extraction tools are used for carrying out structural extraction on the composition information of the prescription on the prescription information extracted in the step 1).

Preferably, the step 3) specifically includes:

3.1 Extracting keywords in the prescription efficacy description text from prescription information of all prescription samples, wherein the keywords comprise all four-word vocabulary, three-word vocabulary and two-word vocabulary, and splitting the four-word vocabulary into two-word vocabulary; counting the occurrence times of each different vocabulary, and filtering the low-frequency vocabulary to obtain a prescription efficacy vocabulary containing p efficacy vocabularies;

3.2 According to the efficacy of the prescription corresponding to each prescription sampleDescribing a text and a prescription efficacy vocabulary, determining prescription efficacy vocabulary vectors corresponding to each prescription sample, wherein the prescription efficacy vocabulary vectors corresponding to all prescription samples form a prescription efficacy vocabulary matrix Y= { Y _ij I=0, 1, &..n; j=0, 1, p. n represents the number of prescription samples; y is _ij The numerical determination rule of (2) is: if the prescription efficacy description text corresponding to the ith prescription sample contains the jth efficacy vocabulary in the prescription efficacy vocabulary, y _ij =1, otherwise y _ij ＝0。

Preferably, the step 4) specifically includes:

4.1 The pretreatment mode for preventing overfitting can avoid the relation between the drug pairs with lower model fitting frequency and efficacy and avoid generating result errors, and specifically comprises the following steps: pairing all the medicine components in the structural prescription composition data of all the prescription samples in pairs, counting the occurrence frequency of each medicine pair in all the prescription samples, and selecting q medicine pairs with highest frequency to form a high-frequency medicine pair, wherein the low-frequency medicine pair is discarded.

4.2 According to the structural prescription composition data and the high-frequency medicine pair table corresponding to each prescription sample, obtaining a high-frequency medicine pair vector corresponding to each prescription sample, and forming a high-frequency medicine pair matrix X= { X by the high-frequency medicine pair vectors corresponding to all prescription samples _ij I=0, 1, &..n; j=0, 1,. -%, q; n represents the number of prescription samples; x is x _ij The numerical determination rule of (2) is: if the structured prescription composition data of the ith prescription contains the jth drug pair in the high-frequency drug pair table, x _ij =1, otherwise x _ij ＝0。

Preferably, the step 5) specifically comprises the following steps:

establishing a single-layer forward network, wherein the single-layer forward network can maintain the interpretability of the neural network, and training the single-layer forward network according to the prescription efficacy vocabulary matrix obtained in the step 3) and the high-frequency medicine pair matrix obtained in the step 4), wherein a training formula is as follows:

Y＝W·X+b

wherein X represents a high-frequency medicine pair matrix, Y represents a prescription efficacy vocabulary matrix, and the training parameter W representsThe degree of relationship between drug pairs and efficacy, b is the offset; obtaining the parameter W after the first training is finished ^l And b ^l ；

Inputting the high-frequency medicine pair matrix X into a network model after the first training to obtain a predicted prescription efficacy vocabulary matrix Y ^l The training loss function adopted in the training process is as follows:

loss＝-Y*logY ^l

Y ^l ＝W ^l ·X+b ^l

wherein, parameters W 'and b' are obtained after training is finished.

Preferably, the step 6) adopts heuristic strategy to filter out the relationship between the non-drug and the efficacy, specifically:

6.1 Extracting W 'and b' from the trained network model in the step 5), inputting the high-frequency medicine pair matrix X into the trained network model to obtain a final predicted prescription efficacy vocabulary matrix Y '= { Y' _ij I=0, 1, &..n; j=0, 1,..p, the formula is;

Y′＝W′·X+b′

each item in the final predicted prescription efficacy vocabulary matrix Y' is a numerical value between 0 and 1; by setting a threshold value T, the numerical value smaller than T in the final predicted prescription efficacy vocabulary matrix Y 'is recorded as 0, the numerical value larger than T is recorded as 1, and the final predicted prescription efficacy vocabulary matrix Y' is converted into a matrix Y '= { Y', which consists of 0 and 1 _ij I=0, 1, &..n; j=0, 1, p.

6.2 Traversing each drug pair, counting all prescription samples with the drug pair, and predicting the weight of each efficacy, wherein the specific formula is as follows:

score＝correct/(correct+error)

correct indicates the number of prescriptions predicted correctly, and the counting rule is: if y _ij ＝1，y″ _ij If =1, count correct statistics, if yi _j ≠y″ _ij Or yi _j ＝y″ _ij =0, not counted; error indicates the number of prediction errors, and the counting rule is as follows: if y _ij ≠y″ _ij Counting error, otherwise, not counting;

6.3 After score is calculated between each drug pair and each efficacy, all the efficacy of each drug pair is ordered to obtain an ordered list of the most relevant efficacy of the drug pair, and the efficacy ordered at the front is the relevant efficacy of the drug pair to obtain the relation between the drug pair and the efficacy.

Compared with the prior art, the invention has the beneficial effects that:

1) Strategies to prevent overfitting are proposed. The prevention of overfitting is mainly based on controlling the size of the input-output vocabulary, the output vocabulary is a drug-to-vocabulary, and the output vocabulary is an efficacy vocabulary, so that the number of model parameters is reduced. The strategy ensures that parameters with poor adaptability are learned in the unsupervised learning process.

2) An interpretable neural network is designed.

y′＝Wx+b

The dimension of the parameter W represents the pairwise association between efficacy and drug pairs. Meanwhile, the single-layer neural network design can also reduce the overfitting behavior of the model.

3) A method for filtering the ineffective relation between medicine pair and efficacy is provided. After the model is trained, the useful medicine pair efficacy relation is needed to be found out from the currently learned parameters, and the strategy provided by the invention can be used for measuring the association degree between the efficacy and the medicine pair according to the efficacy prediction accuracy, so that most of invalid relations can be removed.

Detailed Description

The present invention will be described in further detail with reference to specific examples.

The invention discloses a method for mining the relation between efficacy and drug pairs from prescription information, which comprises the following steps:

step one, a prescription book mainly takes a prescription dictionary as a main part, wherein the prescription dictionary contains authoritative prescriptions with uniform formats, and the data size is of ten thousand grades, so that the processing requirement is met; ocr processing information such as a prescription dictionary and the like by using a ocr technology, and converting the information into text information; establishing a prescription structured information extraction rule by using a regular expression technology of python, and extracting and storing prescription information, such as a mysql database, wherein a data table field comprises: prescription name, prescription composition, prescription efficacy indications, prescription usage, prescription contraindications, etc.;

step two, carrying out preliminary cleaning on prescription information, wherein some prescription information fields are incomplete, eliminating prescription data without efficacy main treatment information, and obtaining 15241 prescription samples in total; the text-type prescription composition description is converted to a structured composition using a TCM prescription composition extraction tool.

Illustrating:

the composition of the text type formulas is described as: dried rehmannia octastone, four pairs of yam and dogwood, three pairs of alisma, poria cocos and tree peony bark, and one part of cassia twig and aconite root processed respectively

The structural composition is as follows: dried rehmannia root, yam, dogwood, alisma orientale, poria cocos, moutan bark, cassia twig and aconite

Step three, constructing an efficacy vocabulary, and establishing a prescription efficacy vocabulary matrix:

the prescription efficacy text description has certain characteristics, and frequent 4-word phrases can be cut into 2-word phrases, for example: antipruritic, insecticidal, heat-clearing and detoxicating, in order to avoid huge vocabulary leading to too sparse model output, split 4 words short sentence into 2 words: antipruritic, insecticidal, heat clearing, and toxic materials removing effects. Efficacy description vocabulary is mostly 3-word and 2-word vocabulary, for example: regulating spleen and stomach, regulating large intestine, dispelling pathogenic wind, and promoting bone union. The prescription efficacy vocabulary is a key factor of the output of the model, the efficacy vocabulary is as small as possible in order to prevent overfitting, the occurrence frequency of all the vocabularies is counted, 1621 vocabularies with highest frequency are reserved as the prescription efficacy vocabulary, and the characteristics of integrity of vocabulary construction, low noise, labor saving and the like are ensured.

Determining prescription efficacy vocabulary vectors corresponding to each prescription sample according to prescription efficacy description text and prescription efficacy vocabulary corresponding to each prescription sample, wherein the prescription efficacy vocabulary vectors corresponding to all prescription samples form a prescription efficacy vocabulary matrix Y= { Y _ij Dimension 15241 x 1621, where i=0, 1, …,15241; j=0, 1, …,1621; y is _ij The numerical determination rule of (2) is: if the prescription efficacy description text corresponding to the ith prescription sample contains the jth efficacy word in the prescription efficacy vocabularySink, then y _ij =1, otherwise y _ij ＝0。

Step four, constructing a medicine pair vocabulary, and filtering low-frequency medicine pairs according to word frequency statistics to obtain a high-frequency medicine pair table; the following are illustrated:

prescription 1: dried rehmannia root, cassia twig and aconite root

Prescription 2: ramulus Cinnamomi, radix Aconiti lateralis Preparata, and Glycyrrhrizae radix

The appearance drug pair comprises: (dried rehmannia root, cassia twig) (dried rehmannia root, aconite root) (cassia twig, licorice root) (aconite root, licorice root). Wherein (ramulus Cinnamomi, radix Aconiti lateralis) appears in both formulations with frequency of 2.

The 1333 drug pairs with higher frequency are reserved as a final high-frequency drug pair vocabulary, so that the relation between drug pairs with lower model fitting frequency and efficacy can be avoided, and the result error is avoided.

According to the structural prescription composition data and the high-frequency medicine pair table corresponding to each prescription sample, a high-frequency medicine pair vector corresponding to each prescription sample is obtained, and the high-frequency medicine pair vector corresponding to all prescription samples forms a high-frequency medicine pair matrix X= { X _ij Dimension 15241 x 1333, where i=0, 1, 15241; j=0, 1,/1621; x is x _ii The numerical determination rule of (2) is: if the structured prescription composition data of the ith prescription contains the jth drug pair in the high-frequency drug pair table, x _ij =1, otherwise x _ij ＝0。

Step five, in order to maintain the interpretability of the model, a single-layer forward neural network is built by using Tensorflow, and the single-layer forward neural network is trained according to a prescription efficacy vocabulary matrix and a high-frequency medicine pair matrix, wherein a training formula is as follows:

Y＝W·X+b

wherein X represents a high-frequency medicine pair matrix, the dimension is 1333, Y represents a prescription efficacy vocabulary matrix, the dimension is 1621, the training parameter W represents the relation degree between medicine pairs and efficacy, b is an offset, and b has the function of preventing weight offset caused by different frequency of efficacy labels; obtaining the parameter W after the training of the 1 st time is finished ^l And b ^l ；

The high-frequency medicine pair matrix X is input into a network model after the 1 st training,obtaining a predicted prescription efficacy vocabulary matrix Y ^l The prediction formula is as follows:

Y ^l ＝W ^l ·X+b ^l

the neural network training loss function adopts cross entropy loss:

loss＝-Y*logY ^l

wherein, parameters W 'and b' are obtained after training is finished.

And step six, extracting weight parameters from the trained model, and filtering according to a heuristic strategy. The heuristic is: based on all the formulas containing the drug pair A, calculating the accuracy of the formulas containing the efficacy B, and if the accuracy is higher, the probability that the efficacy of the drug pair A is B is larger. Subsequent heuristic strategy processing employs numpy as the data processing tool.

And (3) marking the prescription by using the trained model, extracting W 'and b' from the trained network model, inputting a high-frequency medicine pair matrix X into the trained network model, wherein the dimension is 15241X 1333, wherein 15241 represents the number of prescription samples, and 1333 represents the number of medicine pairs. Obtaining a final predicted prescription efficacy vocabulary matrix Y '= { Y' _ij I=0, 1,. }, 15241; j=0, 1,..1621, where "1621" represents the number of efficacy words.

The calculation formula is as follows;

Y′＝W′·X+b′

examples are as follows:

the drug pair: [ (herba Ephedrae, ramulus Cinnamomi) (herba Ephedrae, radix Aconiti lateralis Preparata) (ramulus Cinnamomi, radix Aconiti lateralis Preparata) ]

Prediction result: (relieving cough, resolving phlegm and clearing heat)

True label: (cough relieving, phlegm resolving, detoxification)

The accuracy of each drug on the predicted individual efficacy was counted.

score＝correct/(correct+error)

correct indicates the number of prescriptions predicted correctly, note that if the efficacy label is 0, the prediction result is 0, correct is not taken into account, and because the efficacy label is 0 in a larger number, calculation offset is caused. error indicates the number of prediction errors, and if the efficacy label is 0, the prediction result is 1 or the efficacy label is 1, the prediction result is 0, and error statistics are counted. Based on the above examples: the correct value for (ephedra, cassia twig) corresponding to cough was 1, since the prediction of "cough" was consistent with the true label. The error value corresponding to detoxification (ephedra, cassia twig) is 1, because the "detoxification" prediction is inaccurate.

All efficacy on each drug pair was ranked, top5 was taken as the final drug pair efficacy.

Claims

1. A method for mining the relation between drug pairs and efficacy from prescription information comprises the following steps:

1) Collecting authoritative prescription information data, and extracting prescription information through an OCR tool;

4) Pairing all the medicine components in the structural prescription composition data of all the prescription samples in pairs, counting the occurrence frequency of each medicine pair in all the prescription samples, and screening out high-frequency medicine pairs to form a high-frequency medicine pair table; according to the structural prescription composition data and the high-frequency medicine pair table corresponding to each prescription sample, a high-frequency medicine pair vector corresponding to each prescription sample is obtained, and the high-frequency medicine pair vectors corresponding to all prescription samples form a high-frequency medicine pair matrix;

5) Establishing a single-layer forward network, and training the single-layer forward network according to the prescription efficacy vocabulary matrix obtained in the step 3) and the high-frequency medicine pair matrix obtained in the step 4) to obtain a trained network model;

6) Inputting the high-frequency medicine pair matrix obtained in the step 4) into a trained network model to obtain a final predicted prescription efficacy vocabulary matrix; and (3) calculating the prediction accuracy of each medicine pair on different efficacy vocabularies and sequencing according to the finally predicted prescription efficacy vocabulary matrix and the actual prescription efficacy vocabulary matrix obtained in the step (3), so as to obtain the relation between the medicine pair and the efficacy.

2. The method for mining drug pair and efficacy relation from prescription information according to claim 1, wherein in the step 2), the TCM prescription composition extraction tool is used to perform structural extraction of composition information of the prescription from the prescription information extracted in the step 1).

3. The method for mining drug pair and efficacy relationship from prescription information according to claim 1, wherein the step 3) specifically comprises:

3.1 Extracting keywords in the prescription efficacy description text from prescription information of all prescription samples, wherein the keywords comprise all four-word vocabulary, three-word vocabulary and two-word vocabulary, and splitting the four-word vocabulary into two-word vocabulary; filtering the low-frequency vocabulary to obtain a prescription efficacy vocabulary containing p efficacy vocabularies;

3.2 Determining prescription efficacy vocabulary vectors corresponding to each prescription sample according to prescription efficacy description text and prescription efficacy vocabulary corresponding to each prescription sample, wherein the prescription efficacy vocabulary vectors corresponding to all prescription samples form a prescription efficacy vocabulary matrix Y= { Y _ij I=0, 1, …, n; j=0, 1, …, p; n represents the number of prescription samples; y is _ij The numerical determination rule of (2) is: if the prescription efficacy description text corresponding to the ith prescription sample contains the jth efficacy vocabulary in the prescription efficacy vocabulary, y _ij =1, otherwise y _ij ＝0。

4. The method for mining drug pair and efficacy relationship from prescription information according to claim 1, wherein the step 4) is specifically:

4.1 Pairing all the medicine components in the structural prescription composition data of all the prescription samples in pairs, counting the occurrence frequency of each medicine pair in all the prescription samples, and screening out q medicine pairs with highest frequency to form a high-frequency medicine pair table;

4.2 According to the structural prescription composition data and the high-frequency medicine pair table corresponding to each prescription sample, obtaining a high-frequency medicine pair vector corresponding to each prescription sample, and forming a high-frequency medicine pair matrix X= { X by the high-frequency medicine pair vectors corresponding to all prescription samples _ij I=0, 1, …, n; j=0, 1, …, q; n represents the number of prescription samples; x is x _ij The numerical determination rule of (2) is: if the structured prescription composition data of the ith prescription contains the jth drug pair in the high-frequency drug pair table, x _ij =1, otherwise x _ij ＝0。

5. The method for mining drug pair and efficacy relationship from prescription information according to claim 1, wherein the step 5) specifically comprises:

establishing a single-layer forward network, and training the single-layer forward network according to the prescription efficacy vocabulary matrix obtained in the step 3) and the high-frequency medicine pair matrix obtained in the step 4), wherein the training formula is as follows:

Y＝W·X+b

wherein X represents a high-frequency drug pair matrix, Y represents a prescription efficacy vocabulary matrix, a training parameter W represents the degree of relationship between drug pairs and efficacy, and b is an offset; obtaining the parameter W after the first training is finished ^l And b ^l ；

loss＝-Y*logY ^l

Y ^l ＝W ^l ·X+b ^l

wherein, the parameter W is obtained after training ^′ And b ^′ 。

6. The method for mining drug pair and efficacy relationship from prescription information according to claim 5, wherein said step 6) specifically comprises:

6.1 Extracting the parameters W obtained after training is finished from the trained network model in the step 5) ^′ And b ^′ Inputting the high-frequency medicine pair matrix X into a trained network model to obtain a final predicted prescription efficacy vocabulary matrix Y ^′ ＝{y _i ^′ _j I=0, 1, …, n; j=0, 1, …, p, n represent the number of prescription samples, i represents the ith prescription sample, p represents the number of efficacy vocabulary words in the prescription efficacy vocabulary, j represents the jth efficacy vocabulary; the calculation formula is as follows;

Y′＝W′·X+b′

each item in the final predicted prescription efficacy vocabulary matrix Y' is a numerical value between 0 and 1; by setting a threshold value T, the numerical value smaller than T in the final predicted prescription efficacy vocabulary matrix Y 'is recorded as 0, the numerical value larger than T is recorded as 1, and the final predicted prescription efficacy vocabulary matrix Y' is converted into a matrix Y '= { Y', which consists of 0 and 1 _ij I=0, 1, …, n; j=0, 1, …, p;

score＝correct/(correct+error)

correct indicates the number of prescriptions predicted correctly, and the counting rule is: if y _ij ＝1，y″ _ij Let y be the sum of the correction times =1 _ij ≠y″ _ij Or y _ij ＝y″ _ij =0, not counted; error indicates the number of prediction errors, and the counting rule is as follows: if y _ij ≠y″ _ij Counting error, otherwise, not counting;