CN113838583B

CN113838583B - Intelligent medicine curative effect evaluation method based on machine learning and application thereof

Info

Publication number: CN113838583B
Application number: CN202111135248.5A
Authority: CN
Inventors: 尚磊; 杨喆; 张玉海; 梁英; 张海悦; 王玥
Original assignee: Air Force Medical University of PLA
Current assignee: Air Force Medical University of PLA
Priority date: 2021-09-27
Filing date: 2021-09-27
Publication date: 2023-10-24
Anticipated expiration: 2041-09-27
Also published as: CN113838583A

Abstract

The invention discloses an intelligent medicine curative effect evaluation method based on machine learning and application thereof, wherein the method comprises the steps of establishing a mapping relation between medicines and corresponding target treatment diseases or symptoms; extracting potential side effects corresponding to medicines, and calculating similarity indexes among medicines; labeling the data on the medicine line to mark whether the medicine is effective or not, structuring the text data of the medicine, and extracting multi-dimensional crowd information and relevant feature vectors of the medicine and treatment; dividing the structured data into a training set and a verification set, establishing an integrated prediction model by utilizing a plurality of algorithms and different characteristic variable selection mechanisms, and selecting a scheme with optimal prediction effect; finally, the effective rates of the similar medicines are ranked according to the medicine similarity indexes, and various functions of evaluating the curative effect of the medicines are realized through the application.

Description

Intelligent medicine curative effect evaluation method based on machine learning and application thereof

Technical Field

The invention relates to the field of biological medicine and artificial intelligence, in particular to an intelligent medicine curative effect evaluation method based on machine learning and application thereof.

Background

In the field of biopharmaceuticals, efficacy (efficacy), efficacy and benefit (efficacy) are three indicators used to evaluate drugs at different times and environments. Efficacy generally refers to the amount of therapeutic effect that a drug can achieve under ideal conditions during the clinical trial phase, and is the maximum desired effect of the drug. The curative effect is the magnitude of the therapeutic effect achieved by the medicine under the actual medical and health conditions, namely the data result obtained in the real world. Benefit refers to whether a drug's value is comparable to the cost paid by an individual or society, which considers not only clinical effectiveness, but also cost benefit to benefit the socioeconomic performance, which is commonly used for health and economics assessment.

When a drug passes the three-phase clinical trial, its efficacy will be examined from the real world after approval for marketing. Under the real condition, factors such as patient groups, medicine doses, use frequency and the like are much more complex than clinical random experiments, so that medicine curative effect evaluation aiming at the real world is more and more emphasized, and mass data mining such as on-line medicine evaluation, case report, medicine use instruction, notice and the like can be realized due to development of big data technology.

Existing studies and methods of drug efficacy from the real world are generally directed to a single data source, such as evaluating drug efficacy through research reports, clinical follow-up or performing a four-phase trial, and the crowd information that they can cover is still affected by factors such as research expenses, research scale, selectivity bias, etc. The invention utilizes text mining technology and integrated machine learning algorithm to integrate the data of different information sources, extracts effective characteristic values, establishes a comprehensive drug efficacy evaluation system and a decision mechanism applied by the system, and realizes multiple functions such as drug recommendation, efficacy and side effect evaluation, similar drug comparison and the like.

The invention can monitor and evaluate the curative effect of the drug on the market for a long time and in a large scale, and can be further used as an important reference index for evaluating the effectiveness of the drug and the benefit of the cost price.

Disclosure of Invention

The invention aims to provide an intelligent medicine evaluation method based on machine learning and application thereof, which combines massive internet data with hospital medical record list, follow-up visit or investigation report data to obtain larger-range real-time feedback information of medicine use conditions, and comprehensively evaluates medicine curative effects by multiple information sources. Adverse factors such as high cost generated by recruiting subjects, artificial inclusion and exclusion criteria and the like in the treatment effect evaluation process after the traditional medicines are marketed are avoided, and the use treatment effect and the side effect generated by the medicines under various conditions are evaluated more comprehensively and efficiently.

The first aspect of the invention provides an intelligent medicine evaluation method based on machine learning, which comprises the following steps:

1) The mapping relation between the medicine and the corresponding therapeutic disease or symptom is extracted through the medicine instruction book and the medicine guide of the medicine administration: assume a pharmaceutical productI=1, a method of treating a subject suffering from a disorder, I, the corresponding target for treating diseases or symptoms is +.>J=1,..j, J targets treat disease or symptom with the corresponding potential side effect +.>K=1,..k is K potential side effects.

Similar drug indices prior to drug were calculated. Specifically, assume a medicineThe corresponding target for treating diseases or symptoms is +.>The method comprises the steps of carrying out a first treatment on the surface of the Medicine->The corresponding target for treating diseases or symptoms is +.>The method comprises the steps of carrying out a first treatment on the surface of the Drug similarity index->。

2) Treating diseases or symptoms by on-line drug comments, hospital medical records and follow-up records according to the drug targets in the step 1)Grouping and sorting each comment and medical recordLabeling, labeled "active" or "inactive", respectively.

Specifically, the labeling mode is automatic labeling, emotion analysis is performed according to semantics, sentences are scored as-1 (negative) to 1 (positive) by VADER (Valence Aware Dictionary and sEntiment Reasoner), and 0 is a neutral opinion. Further, a manual verification may also be performed after automated labeling.

3) Structuring the text data: a) Extracting multidimensional crowd information such as age, gender, race, wedding, region and the like, b) extracting feature vectors: extracting characteristic words or phrases such as inflammation diminishing, fever, headache, cold, cough and the like from on-line medicine comments, medical record lists and follow-up visit records to obtain characteristic vectors;

4) The data set converted from text data to structured data is divided into a training set and a verification set according to a certain proportion.

Specifically, the training set and the validation set may be partitioned in a ratio of 8:2,7:3, or 6:4.

5) Multiple algorithms are selected as classifiers for predicting the classification problem.

Specifically, four classifiers for the two-classification problem may be selected: a) OneVrest SVM, b) Logistic Regression, c) Random Forest, d) Bagging meta-estimator with logistic regressor base.

6) Different characteristic variable selection mechanisms are established, and a scheme with optimal prediction effect of various classifiers under different characteristic variables is selected.

Specifically, the feature variable selection may be obtained by combining a specific word occurrence frequency (Count), a word frequency-inverse document frequency (tf-idf, i.e., tfidf), and a VADER score, for example:

FS-1：CountVectorizer，

FS-2：CountVectorizer +VADERscore，

FS-3: countVectorizer top 10000 feature vector + vaderrscore,

FS-4：TfidfVectorizer,

FS-5：TfidfVectorizer +VADERscore,

FS-6: tfidfVectorizer top 10000 eigenvectors+vaderrscore.

Further, the optimal prediction effect scheme is obtained by evaluating F1-score,

F1 score = 2*(Recall * Precision) / (Recall + Precision)；

where Recall = true positive/(true positive + false negative), precision = true positive/(true positive + false positive).

A second aspect of the present invention provides an application of intelligent drug evaluation based on machine learning, the application comprising a plurality of functions: function 1) evaluating curative effect of a certain medicine, inputting medicine name, obtaining medicine effectiveness score, ranking in the same kind of medicine and ranking side effect; function 2) searching for corresponding medicines aiming at a certain disease or symptom, inputting names of single or multiple diseases or symptoms, and obtaining single or multiple corresponding medicine effectiveness scores, ranks and ranks of side effects of each medicine; function 3) ranking the effectiveness of medicines and similar medicines and side effects of the medicines aiming at multidimensional crowds with different ages, sexes, race, wedding, region and the like.

In the embodiment of the invention, the mapping relation between the medicine and the corresponding target treatment disease or symptom is determined according to approved information such as a medicine use instruction, a medicine guide and the like, and the medicine effectiveness is predicted by using information such as on-line medicine comments, medical records, follow-up records and the like, and the process of establishing a prediction model can be divided into: firstly, carrying out emotion analysis on a sentence through VADER, calibrating whether a medicine aimed by the sentence is effective or not, then dividing a structured data set into a training set and a verification set, and training a plurality of two-classification classifiers (models) by adopting different feature extraction modes to obtain an optimal scheme. In the application level, the prediction result of whether the medicines are effective is applied to the mapping relation which is initially determined, and the effective rate and the side effect of each medicine for treating the diseases or symptoms aiming at each target and the effective rate of single or a plurality of similar medicines for treating the diseases or symptoms aiming at the same target are calculated. Therefore, when a certain medicine is input at the user end, the effective rate and side effect of the corresponding target treatment disease or symptom and the effective rate of similar medicines can appear; if a disease or symptom is input, the effective rate corresponding to single or multiple medicines and the side effects thereof can appear. In addition, the effective rate of the crowd of age, sex, race, wedding, region and the like of the crowd taking the medicine can be known by screening crowd information.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings that are required in the embodiments or the description of the prior art will be briefly described, and the drawings in the following description are only some embodiments of the present invention, and other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

Fig. 1 is a flow chart of an intelligent drug efficacy evaluation method based on machine learning in an embodiment of the invention.

Fig. 2 is a schematic diagram of a software running structure of an intelligent drug efficacy evaluation application based on machine learning in an embodiment of the present invention.

Description of the embodiments

The following description of the embodiments of the present invention will be made with reference to the accompanying drawings, in which it is evident that the embodiments described are only some embodiments of the present invention, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

Reference herein to "an embodiment" means that a particular feature, structure, or characteristic described in connection with the embodiment may be included in at least one embodiment of the invention. The appearances of the phrase in various places in the specification are not necessarily all referring to the same embodiment, nor are separate or alternative embodiments mutually exclusive of other embodiments. Those of skill in the art will explicitly and implicitly appreciate that the embodiments described herein may be combined with other embodiments.

Embodiments of the present invention are described in detail below.

Examples

Referring to fig. 1, fig. 1 is a flow chart of a method for evaluating the efficacy of an intelligent drug based on machine learning according to an embodiment of the present invention, as shown in fig. 1, the method for evaluating the efficacy of an intelligent drug based on machine learning includes:

101. and establishing a mapping relation between the medicine and the corresponding target treatment disease or symptom.

As shown in Table 1, the diseases or symptoms treatable by the drug Metronidazole are sepsis, endocarditis, meningitis, colitis, tetanus, canker sore, etc. Extracting the corresponding target therapeutic diseases or symptoms from the drug instruction or drug guide, and marking the drug as respectivelyThe corresponding target for treating diseases or symptoms is +.>. In this example +.>Metronidazole, =>= { sepsis, endocarditis, meningitis, colitis, tetanus, canker sore. After establishing the mapping relationship between the drug and the corresponding target therapeutic disease or symptom, step 102 is performed.

102. And extracting the corresponding potential side effects of the medicines, and calculating the similarity index between the medicines.

As shown in table 1, potential side effects of metronidazole using the drug were nausea, vomiting, loss of appetite, abdominal cramps, headaches, dizziness, paresthesia, numbness of limbs, and the like. Extracting potential side effects corresponding to the medicine in the medicine use instruction or medicine guide, and marking as. In this example +.>= { nausea, vomiting, loss of appetiteAbdominal cramps, headaches, dizziness, paresthesia, numbness of limbs.

As shown in the table 2 below,ornidazole for the treatment of diseases or conditions with the drug Ornidazole +.>= { sepsis, amoebae disease, meningitis, periodontitis, endometritis, canker sore..the }, potential side effects +.>= { nausea, bad breath, dizziness, drowsiness, rash, cramps, confusion, numbness of limbs.

The similarity index of the two drugs is calculated,

further, a data dictionary for treating diseases or symptoms of medicines is designed, the data dictionary comprises the upper and lower concepts of diseases or symptoms, such as periodontitis and canker sore are oral infections, and if the upper concept of oral infections is used, the similarity index of the two medicines is 3/8=0.375.

Further, a similar word merging data dictionary is designed, and the data dictionary contains words which can be regarded as similar in the same level, such as { numbness of limbs }, { headache }, { dizziness }, and the like.

TABLE 1 Metronidazole drug for treating diseases or symptoms and potential side effects

TABLE 2 therapeutic diseases or symptoms and potential side effects of Ornidazole drug

103. Marking on-line comments, medical records and follow-up records of the medicines, and marking whether the medicines are effective.

Specifically, the statement was scored as a value of-1 (negative) to 1 (positive) with VADER (Valence Aware Dictionary and sEntiment Reasoner), 0 being the neutral opinion. Four scores are given for sentences using the polarity_score method using the VADER module in statistical analysis software python: (a) negative, (b) aggressiveness, (c) neutral score, (d) complex emotion score. The composite score is the sum of the first three scores and is used to measure the positive or negative emotion of a sentence. The application is suitable for emotion analysis of English sentences, so that all data sources for evaluating the curative effect of the medicine are mainly English as much as possible, and if Chinese text is collected, the English can be translated by an automatic translator and manually checked.

104. And structuring the text data, and extracting multidimensional crowd information and relevant feature vectors of medicines and treatments.

Crowd information including, but not limited to, age, gender, race, wedding status, region, etc., such information as on-line drug reviews may be obtained by a computer background database, and hospital medical record management systems may also obtain such information, where the follow-up records should contain such crowd information as much as possible prior to designing a follow-up study.

Words which can embody important text characteristics in online medicine comments, hospital medical records and follow-up records are converted into vector forms in terms of word frequency (CountVec) and word frequency-inverse document frequency (tf-idf).

Further, the feature vector may have the following rules, such as:

FS-1：CountVectorizer，

FS-2：CountVectorizer +VADERscore，

FS-3: countVectorizer top 10000 feature vector + vaderrscore,

FS-4：TfidfVectorizer,

FS-5：TfidfVectorizer +VADERscore,

FS-6: tfidfVectorizer top 10000 eigenvectors+vaderrscore.

105. The structured data is divided into a training set and a validation set.

The data sets in the form of vectors are converted into data sets in the steps of dividing training sets and verification sets according to the proportion of 8:2,7:3 or 6:4.

106. Multiple algorithms are selected as classifiers for predicting the classification problem.

Further, four commonly used algorithms are chosen to train the classifier, such as a) OneVrest SVM, b) Logistic Regression, c) Random Forest, d) Bagging meta-estimator with logistic regressor base.

107. Different characteristic variable selection mechanisms are established, and a scheme with optimal prediction effect of various classifiers under different characteristic variables is selected.

As shown in Table 3, table 3 shows the F1-score of the data training results obtained by combining the four classifiers of step 106 with the six feature variable selection rules of step 104,

F1 score = 2*(Recall * Precision) / (Recall + Precision)；

where Recall = true positive/(true positive + false negative), precision = true positive/(true positive + false positive). Recall is Recall and Precision is Precision.

TABLE 3 prediction effect of multiple classifiers under different feature variables

108. And calculating the effective rate of the medicine aiming at the target treatment disease or symptom by using the optimal scheme obtained through training.

The optimal scheme of the predicted medicine for treating the disease or symptom aiming at the target is random forest (random forest) obtained from the table 3, and the characteristic extraction mode is FS-6: tfidfVectorizer top 10000 eigenvectors+vaderrscore. The corresponding F1-score was 0.760. The scheme is utilized to predict unlabeled data and calculate the effective rate of a certain medicine for different treatment diseases or symptoms respectively.

109. And obtaining the effective rate rank and the potential side effect rank of the similar medicines according to the medicine similarity index.

Drug similarity index.

For a certain drug, such as metronidazole, the similarity index of other drugs and the drug is calculated by using the method of step 102, the first five drugs can be taken, and the effective rate of the drug and the similar drugs can be calculated respectively through a model. Using the potential side effect feature words of the drug extracted in step 102, the ranking of side effects produced by the drug in all data sources is counted, preferably top 10, or top ranking side effects as the case may be.

Examples

Based on the intelligent medicine efficacy evaluation method described in the above embodiment, an intelligent medicine efficacy evaluation application based on machine learning is developed, the application background includes a database for collecting and managing the above different data sources, the middle stage includes an intelligent medicine efficacy evaluation method capable of performing model parameter adjustment and real-time monitoring, and the front stage can realize the following functions:

evaluating the curative effect of a certain medicine, inputting the name of the medicine, and obtaining the medicine effectiveness score, ranking in the medicines of the same type and ranking side effects;

the medicine effectiveness score is the medicine effectiveness rate obtained through the medicine curative effect evaluation model.

Searching for corresponding medicines aiming at a certain disease or symptom, inputting names of single or multiple diseases or symptoms, and obtaining single or multiple corresponding medicine effectiveness scores, ranks and side effect ranks of each medicine;

the disease or symptom is input into the system, the mapping relationship between the drug obtained in step 101 of example 1 and the target therapeutic disease or symptom is used to find the corresponding drug, and the effectiveness score, rank, potential side effect rank of each drug, etc. of the single or multiple corresponding drugs are calculated by the model.

Aiming at multidimensional crowds with different ages, sexes, race, wedding, region and the like, ranking the effectiveness of medicines, similar medicines and side effects thereof;

the crowd information is used as screening conditions for calculating the effectiveness of medicines and searching similar medicines and side effect ranking of the similar medicines.

Claims

1. The intelligent medicine curative effect evaluation method based on machine learning is characterized by comprising the following steps of:

1) The mapping relation between the medicine and the corresponding therapeutic disease or symptom is extracted through the medicine instruction book and the medicine guide of the medicine administration: assume a pharmaceutical productI=1, a method of treating a subject suffering from a disorder, I, its corresponding targetThe treatment of diseases or symptoms is->J=1,..j, J targets treat disease or symptom with the corresponding potential side effect +.>K=1,..k, K is K potential side effects; calculating medicine->A similar drug index;

2) Treating diseases or symptoms by on-line drug comments, hospital medical records and follow-up records according to the drug targets in the step 1)Grouping and labeling each comment and medical record list as valid or invalid;

3) Structuring the text data:

a) Extracting multi-dimensional crowd information such as age, gender, race, wedding and region,

b) Extracting feature vectors: extracting characteristic words or phrases from online drug comments, medical record lists and follow-up records to obtain characteristic vectors;

4) Dividing the data set converted from text data into a structured data set into a training set and a verification set according to a certain proportion;

5) Selecting a plurality of algorithms as classifiers for predicting the classification problems;

6) Different characteristic variable selection mechanisms are established, and a scheme with optimal prediction effect of various classifiers under different characteristic variables is selected;

7) Calculating medicine by using the optimal scheme obtained by training in the step 6)Treatment of diseases or symptoms against the target->The ranking of the medicines in the same kind of medicines is obtained according to the medicine similarity index calculated in the step 1), and the potential side effect of the medicines in the step 1) is +.>Feature words in the dataset are extracted and ranked.

2. The method for evaluating the efficacy of intelligent medicine according to claim 1, wherein the labeling in the step 2) is automatic labeling, emotion analysis is performed according to semantics, the statement is scored as a value from-1 (negative) to 1 (positive) by VADER (Valence Aware Dictionary and sEntiment Reasoner), and 0 is a neutral opinion.

3. The method for evaluating the efficacy of intelligent medicine according to claim 2, wherein the automatic labeling in the step 2) is followed by manual verification.

4. The method for evaluating the efficacy of treatment of intelligent drugs according to claim 1, wherein the training set and the validation set in the step 4) can be classified into 8:2 or 7:3.

5. The method for evaluating the efficacy of treatment of intelligent drugs according to claim 1, wherein the classifiers in the step 5) are four:

a)OneVsRest SVM,

b) Logistic Regression,

c) Random Forest,

d) Bagging meta-estimator with logistic regressor base。

6. the method for evaluating the efficacy of intelligent medicine according to claim 1, wherein the feature variable selection in the step 6) is obtained by combining a specific word occurrence frequency (Count), a word frequency-inverse document frequency (tf-idf, tfidf) and a VADER score, for example:

FS-1：CountVectorizer，

FS-2：CountVectorizer +VADERscore，

FS-3: countVectorizer top 10000 feature vector + vaderrscore,

FS-4：TfidfVectorizer,

FS-5：TfidfVectorizer +VADERscore,

FS-6: tfidfVectorizer top 10000 eigenvectors+vaderrscore.

7. The method for evaluating the efficacy of intelligent medicine according to claim 1, wherein the scheme with the optimal prediction effect in the step 6) is obtained by evaluating F1 score, where f1score=2 (Recall Precision)/(recall+precision); where Recall = true positive/(true positive + false negative), precision = true positive/(true positive + false positive).

8. The method for evaluating the efficacy of a smart drug according to claim 1, wherein the step 1) is characterized by a drug similarity indexThe calculation method of (1) is as follows: let us assume medicine->The corresponding target for treating diseases or symptoms is +.>The method comprises the steps of carrying out a first treatment on the surface of the MedicineThe corresponding target for treating diseases or symptoms is +.>The method comprises the steps of carrying out a first treatment on the surface of the Then->。

9. A machine learning based intelligent drug efficacy evaluation application, characterized in that a plurality of functions can be realized according to the method of any one of claims 1 to 8, comprising:

function 1) evaluating curative effect of a certain medicine, inputting medicine name, obtaining medicine effectiveness score, ranking in the same kind of medicine and ranking side effect;

function 2) searching for corresponding medicines aiming at a certain disease or symptom, inputting names of single or multiple diseases or symptoms, and obtaining single or multiple corresponding medicine effectiveness scores, ranks and ranks of side effects of each medicine;

function 3) ranking the effectiveness of the medicines, the medicines of the same type and side effects thereof aiming at multi-dimensional crowds with different ages, sexes, race, weddings and regions.