CN113838583A

CN113838583A - Intelligent drug efficacy evaluation method based on machine learning and application thereof

Info

Publication number: CN113838583A
Application number: CN202111135248.5A
Authority: CN
Inventors: 尚磊; 杨喆; 张玉海; 梁英; 张海悦; 王玥
Original assignee: Air Force Medical University of PLA
Current assignee: Air Force Medical University of PLA
Priority date: 2021-09-27
Filing date: 2021-09-27
Publication date: 2021-12-24
Anticipated expiration: 2041-09-27
Also published as: CN113838583B

Abstract

The invention discloses an intelligent drug efficacy evaluation method based on machine learning and application thereof, wherein the method comprises the steps of establishing a mapping relation between a drug and a corresponding target treatment disease or symptom; extracting the corresponding potential side effects of the medicines, and calculating similarity indexes among the medicines; labeling the data on the medicine line to mark whether the medicine is effective or not, structuring the text data of the medicine, and extracting multi-dimensional crowd information and medicine and treatment related feature vectors; dividing the structured data into a training set and a verification set, establishing an integrated prediction model and selecting a scheme with optimal prediction effect by utilizing various algorithms and different characteristic variable selection mechanisms; and finally, obtaining the effective rate ranking of the similar medicines according to the medicine similarity index and realizing various functions of medicine curative effect evaluation through the application.

Description

Intelligent drug efficacy evaluation method based on machine learning and application thereof

Technical Field

The invention relates to the fields of biomedicine and artificial intelligence, in particular to an intelligent medicine curative effect evaluation method based on machine learning and application thereof.

Background

In the field of biopharmaceuticals, efficacy (efficacy), therapeutic effect (efffectiveness) and benefit (efficiency) are three indicators used to evaluate drugs at different times and environments. Efficacy generally refers to the magnitude of therapeutic effect that a drug can achieve under ideal conditions during clinical trials, and is the maximum desired effect of the drug. The curative effect is the magnitude of the therapeutic action which can be achieved by the medicine under the actual medical and sanitary conditions, namely the data result obtained in the real world. Benefit refers to whether the value of a drug is comparable to the cost paid by an individual or society, not only considering clinical effectiveness, but also cost benefits, which are generally used for health economics evaluations, to the public.

When a drug passes the third phase clinical trial, approved for marketing, its efficacy will be tested by real world tests. Under the real condition, the factors such as patient groups, drug dosage, use frequency and the like are much more complex compared with clinical random tests, so that the evaluation of the drug curative effect in the real world is more and more looked at, and the information extraction such as on-line drug evaluation, case reports, drug use guidelines, cautionary matters and the like can be realized by mass data mining due to the development of a big data technology.

The existing research and method for the curative effect of the medicine from the real world only aims at a single data source, for example, the curative effect of the medicine is evaluated through investigation reports, clinical follow-up visits or four-stage tests, and the information of the population which can be covered by the research and treatment method is still influenced by factors such as scientific research expenses, research scale, selective deviation and the like. The invention integrates data of different information sources by utilizing a text mining technology and an integrated machine learning algorithm, extracts effective characteristic values, establishes a set of comprehensive drug curative effect evaluation system and a decision mechanism applied by the comprehensive drug curative effect evaluation system, and realizes multiple functions of drug recommendation, curative effect and side effect evaluation, similar drug comparison and the like.

The invention can not only carry out long-term and large-scale monitoring and evaluation on the curative effect of the medicine after the medicine is on the market, but also can be further used as an important reference index for the benefit evaluation of the effectiveness and the cost price of the medicine.

Disclosure of Invention

The invention aims to provide an intelligent drug evaluation method based on machine learning and application thereof, which combines mass internet data with hospital case history list, follow-up visit or investigation report data to obtain larger-range drug use condition real-time feedback information and comprehensively evaluates drug curative effect from multiple information sources. The adverse factors such as high cost and artificial inclusion and exclusion standards caused by recruitment of subjects in the process of evaluating the curative effect of the traditional medicine after the traditional medicine is on the market are avoided, and the using curative effect and the side effect of the medicine under various conditions are evaluated more comprehensively and efficiently.

The invention provides an intelligent medicine evaluation method based on machine learning in a first aspect, which specifically comprises the following steps:

1) extracting the mapping relation between the medicine and the corresponding treatment disease or symptom through the medicine use instruction and the medicine guide of the medical supervision bureau: supposing that the medicine is

I = 1.. I, which corresponds to a target treatment disease or symptom of I

J = 1.. J is J of J target diseases or symptoms, with the corresponding potential side effects of J target diseases or symptoms

K = 1,.. K is K potential side effects.

Calculate the similar drug index before the drug. In particular, suppose a drug product

The corresponding target treatment disease or symptom is

(ii) a Medicine and food additive

The corresponding target treatment disease or symptom is

(ii) a Then medicine similarity index

。

2) Treating diseases or symptoms of the on-line medicine comments, the medical record sheets of the hospitals and the follow-up records according to the medicine targets in the step 1)

Grouping, and labeling each comment and medical record sheet as 'effective' or 'ineffective' respectively.

Specifically, the labeling mode is automatic labeling, emotion analysis is performed according to semantics, a sentence is scored as a value from-1 (negative) to 1 (positive) by using a VADER (value Aware Dictionary and sEntiment reader), and 0 is a neutral opinion. Further, manual checking can be performed after automatic labeling.

3) Structuring the text data: a) extracting multi-dimensional crowd information such as age, gender, race, marriage and childbirth, region and the like, b) extracting feature vectors: extracting characteristic words or phrases such as anti-inflammation, fever, headache, cold, cough and the like from online medicine comments, medical record sheets and follow-up records to obtain characteristic vectors;

4) the text data is converted into a structured data set which is divided into a training set and a verification set according to a certain proportion.

In particular, the training set and validation set may be divided in a ratio of 8:2, 7:3, or 6: 4.

5) Various algorithms are selected as classifiers to predict the binary problem.

Specifically, four classifiers for the two-class problem may be selected: a) OneVsRest SVM, b) Logistic Regression, c) Random Forest, d) Bagging meta-estimator with Logistic Regression base.

6) And establishing different characteristic variable selection mechanisms, and selecting a scheme with optimal prediction effect of various classifiers under different characteristic variables.

Specifically, the feature variable selection may be obtained by permutation and combination of specific word occurrence frequency (Count), word frequency-inverse document frequency (tf-idf, i.e., Tfidf) and VADER score, such as:

FS-1：CountVectorizer，

FS-2：CountVectorizer +VADERscore，

FS-3: countvectorer top 10000 feature vector + VADERscore,

FS-4：TfidfVectorizer,

FS-5：TfidfVectorizer +VADERscore,

FS-6: tfidfvactorizer top 10000 eigenvector + VADERscore.

Further, the optimal prediction scheme is evaluated by F1-score,

F1 score = 2*(Recall * Precision) / (Recall + Precision)；

wherein Recall = true positive/(true positive + false negative), Precision = true positive/(true positive + false positive).

The invention provides an application of intelligent medicine evaluation based on machine learning, which comprises multiple functions: function 1) evaluating the curative effect of a certain medicine, inputting the name of the medicine, and obtaining the effectiveness score of the medicine, the ranking in the same kind of medicine and the ranking of side effects; function 2) searching corresponding medicines for a certain disease or symptom, inputting names of single or multiple diseases or symptoms, and obtaining effectiveness scores, ranking and side effect ranking of each medicine of the single or multiple corresponding medicines; function 3) ranking the effectiveness of the medicine, the similar medicines and the side effects thereof aiming at multi-dimensional people of different ages, sexes, ethnicities, marriage and childbirth, regions and the like.

In the embodiment of the invention, the mapping relation between the medicine and the corresponding target treatment disease or symptom is determined through approved information such as a medicine use instruction book, a medicine guide and the like, the effectiveness of the medicine is predicted by utilizing information such as on-line medicine comments, medical record lists, follow-up records and the like, and the process of establishing a prediction model can be divided into the following steps: firstly, performing emotion analysis on a statement through a VADER (variable amplitude error rate), calibrating whether a medicine aimed at by the statement is effective, then dividing a structured data set into a training set and a verification set, and training a plurality of classifiers (models) of two classes in different feature extraction modes to obtain an optimal scheme. On the application level, the prediction result of whether the medicine is effective is applied to the initially determined mapping relation, and the effective rate of each medicine for treating the disease or symptom and the side effect thereof and the effective rate of a single or a plurality of similar medicines for treating the disease or symptom with the same target are calculated. Therefore, when a certain medicine is input at the user end, the effective rate and the side effect of the medicine corresponding to the target treatment disease or symptom and the effective rates of similar medicines can appear; when a disease or symptom is inputted, its effective rate corresponding to a single or multiple drugs and its respective side effects may appear. In addition, the information of the age, sex, race, marriage and childbearing, region, etc. of the people taking the medicine and the effective rate of the subdivided people can be known by screening the crowd information.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly introduced below, the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art that other drawings can be obtained according to the drawings without creative efforts.

Fig. 1 is a schematic flow chart of an intelligent drug efficacy evaluation method based on machine learning in an embodiment of the present invention.

Fig. 2 is a schematic diagram of a software operation structure of a machine learning-based intelligent drug efficacy evaluation application in an embodiment of the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be described below in a clear and complete manner with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

Reference herein to "an embodiment" means that a particular feature, structure, or characteristic described in connection with the embodiment can be included in at least one embodiment of the invention. The appearances of the phrase in various places in the specification are not necessarily all referring to the same embodiment, nor are separate or alternative embodiments mutually exclusive of other embodiments. It is explicitly and implicitly understood by one skilled in the art that the embodiments described herein can be combined with other embodiments.

The following describes embodiments of the present invention in detail.

Example 1

Referring to fig. 1, fig. 1 is a schematic flowchart of a method for evaluating a therapeutic effect of an intelligent drug based on machine learning according to an embodiment of the present invention, and as shown in fig. 1, the method for evaluating a therapeutic effect of an intelligent drug based on machine learning includes:

101. and establishing a mapping relation between the medicine and the corresponding target treatment disease or symptom.

As shown in Table 1, the disease or symptom treated by metronidazole is septicemia, endocarditis, meningitis, colitis, tetanus, oral ulcer, etc. Extracting the target treatment disease or symptom corresponding to the medicine from the medicine use instruction or medicine guide, and respectively marking the medicine as

Corresponding to the target treatment disease or symptom is

. In this example

= metronidazole (alpha-nitrozole),

= { septicemia, endocarditis, meningitis, colitis, tetanus, canker sore. After the mapping relationship between the drugs and the corresponding target treatment diseases or symptoms is established, step 102 is performed.

102. Extracting the corresponding potential side effects of the medicines, and calculating similarity indexes among the medicines.

As shown in Table 1, potential side effects of metronidazole use are nausea, vomiting, loss of appetite, abdominal cramps, headache, dizziness, paresthesia, numbness of the extremities, etc. Extracting the corresponding potential side effects of the medicine in the medicine use instruction or medicine guide and marking the extracted potential side effects as

. In this example

= { nausea, vomiting, loss of appetite, abdominal cramps, headache, dizziness, paresthesia, numbness of limbs.

As shown in the table 2 below, the following examples,

= Ornidazole, use of drug Ornidazole for treatment of disease or symptoms

= { septicemia, amebiasis, meningitis, periodontitis, endometritis, canker sore. }, potential side effects

= { nausea, oral malodor, dizziness, drowsiness, rash, spasm, confusion, numbness of limbs.

The similarity index of the two drugs is calculated,

further, a dictionary of data for designing a drug to treat a disease or symptom includes upper and lower concepts of a disease or symptom, such as periodontitis and oral ulcer are both oral infections, and if the upper concept of oral infections is used, the similarity index of two drugs is 3/8= 0.375.

Furthermore, a similar word merging data dictionary is designed, which contains words with similar levels that can be considered to be similar, such as { numbness of limbs } and { numbness of limbs }, { headache } and { dizziness }, etc.

103. And marking the on-line comments, the medical record list and the follow-up record of the medicine with labels to mark whether the medicine is effective or not.

Specifically, the sentence was rated as a value of-1 (negative) to 1 (positive) with a value of-0 being a neutral opinion using a VADER (value Aware Dictionary and sEntiment reader). Using the VADER module in the statistical analysis software python, using the polarity _ score method, four scores are given for the sentence: (a) negation, (b) aggressiveness, (c) neutral score, (d) composite sentiment score. The composite score is the sum of the first three scores and is used for measuring positive or negative emotion of the sentence. The application is suitable for emotion analysis of English sentences, so that all data sources for drug efficacy evaluation are mainly English as much as possible, for example, Chinese texts are collected and can be translated into English by an automatic translator and manually checked.

104. And structuring the text data, and extracting multi-dimensional crowd information and medicine and treatment related feature vectors.

The crowd information comprises but is not limited to age, gender, race, marriage and childbirth conditions, regions and the like, such information of the on-line medicine comments can be obtained by a computer background database, the medical record management system of the hospital can also obtain such information, and the follow-up records should contain the crowd information as much as possible before the follow-up survey is designed.

Words which can reflect important characteristics of texts in online medicine comments, hospital medical records and follow-up records are converted into vector forms through word frequency (CountVec) and word frequency-inverse document frequency (tf-idf).

Further, the feature vector may have the following rules, for example:

FS-1：CountVectorizer，

FS-2：CountVectorizer +VADERscore，

FS-3: countvectorer top 10000 feature vector + VADERscore,

FS-4：TfidfVectorizer,

FS-5：TfidfVectorizer +VADERscore,

FS-6: tfidfvactorizer top 10000 eigenvector + VADERscore.

105. The structured data is divided into a training set and a validation set.

And dividing the data set converted into the vector form in the steps into a training set and a verification set according to the ratio of 8:2, 7:3 or 6: 4.

106. Various algorithms are selected as classifiers to predict the binary problem.

Further, four commonly used algorithms are chosen to train the classifier, such as a) OneVsRest SVM, b) Logistic Regression, c) Random Forest, d) Bagging meta-estimator with local classifier base.

107. And establishing different characteristic variable selection mechanisms, and selecting a scheme with optimal prediction effect of various classifiers under different characteristic variables.

As shown in Table 3, Table 3 shows F1-score of the data training result obtained by using the four classifiers in step 106 and the six feature variable selection rule permutation combination in step 104,

F1 score = 2*(Recall * Precision) / (Recall + Precision)；

wherein Recall = true positive/(true positive + false negative), Precision = true positive/(true positive + false positive). Recall is Recall and Precision.

108. And calculating the effective rate of the medicine aiming at the target treatment disease or symptom by using the optimal scheme obtained by training.

The optimal solution for predicting the disease or symptom of the target treatment obtained from table 3 is random forest (RandomForest), and the feature extraction mode is FS-6: tfidfvactorizer top 10000 eigenvector + VADERscore. The corresponding F1-score is 0.760. The scheme is utilized to predict the data which are not labeled, and the effective rates of a certain medicine for treating different diseases or symptoms are respectively calculated.

109. And obtaining the ranking of the effective rate and the ranking of the potential side effects of the similar medicines according to the medicine similarity index.

Drug similarity index.

Aiming at a certain drug, such as metronidazole, the similarity index of other drugs and the drug is calculated by the method in the step 102, the first five drugs can be taken, and the effective rates of the drug and the similar drugs are respectively calculated by a model. And (4) counting the ranking of the side effects generated by the medicine in all the data sources by using the characteristic words of the potential side effects of the medicine extracted in the step 102, wherein the ranking can be 10, or the ranking can be the top side effect according to the situation.

Example 2

Based on the intelligent drug efficacy evaluation method described in the above embodiment, an intelligent drug efficacy evaluation application based on machine learning is developed, the application background includes a database for collecting and managing the above different data sources, the middle station includes an intelligent drug efficacy evaluation method capable of model parameter adjustment and real-time monitoring, and the foreground can implement the following functions:

1) evaluating the curative effect of a certain medicine, inputting the name of the medicine to obtain the effectiveness score of the medicine, ranking in the same medicine and ranking the side effect;

the medicine effectiveness score is the effective rate of the medicine obtained by the medicine curative effect evaluation model.

2) Searching corresponding medicines aiming at a certain disease or symptom, inputting names of single or multiple diseases or symptoms, and obtaining effectiveness scores, ranking and side effect ranking of each medicine of the single or multiple corresponding medicines;

the disease or symptom is input into the system, the mapping relation between the medicine obtained in step 101 of example 1 and the target treatment disease or symptom is used to find the corresponding medicine, and the effectiveness score, the ranking, the potential side effect ranking and the like of each medicine are respectively calculated through the model.

3) Aiming at multi-dimensional people with different ages, sexes, ethnicities, marriage and childbirth, regions and the like, the effectiveness of the medicine, the medicines of the same kind and the ranking of side effects are carried out;

the crowd information is used as a screening condition for calculating the effectiveness of the medicine and searching the similar medicine and the ranking of the side effects thereof.

Claims

1. An intelligent drug efficacy evaluation method based on machine learning is characterized by comprising the following steps:

I = 1.. I, which corresponds to a target treatment disease or symptom of I

K = 1,.. K is K potential side effects; calculating medicine

Similar drug indices between them;

Grouping, labeling each comment and medical record sheet, and respectively marking the comment and the medical record sheet as 'effective' or 'ineffective';

3) structuring the text data:

a) extracting multi-dimensional crowd information such as age, gender, race, marriage and childbirth, region and the like,

b) extracting a feature vector: extracting characteristic words or phrases from online medicine comments, medical history lists and follow-up records to obtain characteristic vectors;

4) dividing a data set converted from text data into a structured data set into a training set and a verification set according to a certain proportion;

5) selecting a plurality of algorithms as classifiers for predicting the two-classification problem;

6) establishing different characteristic variable selection mechanisms, and selecting a scheme with optimal prediction effect of various classifiers under different characteristic variables;

7) calculating the medicine by using the optimal scheme obtained by the training in the step 6)

Treatment of diseases or conditions for a target

The ranking of the effective rate of the medicine in the same class of medicines is obtained according to the medicine similarity index calculated in the step 1), and the potential side effect of the medicine in the step 1) is obtained

And extracting characteristic words in the data set and ranking.

2. The method for evaluating the curative effect of the intelligent drug according to claim 1, wherein the labeling in the step 2) is performed automatically, emotion analysis is performed according to semantics, a sentence is scored as a value from-1 (negative) to 1 (positive) by using a VADER (value Aware Dictionary and sEntiment reader), and 0 is a neutral opinion.

3. The method for evaluating the curative effect of an intelligent drug according to claim 2, wherein the labeling in step 2) is automated and then manually checked.

4. The method for evaluating the curative effect of an intelligent drug according to claim 1, wherein the training set and the validation set in step 4) can be classified into 8:2 or 7: 3.

5. The method for evaluating the curative effect of an intelligent drug according to claim 1, wherein the classifiers in the step 5) are four types:

a)OneVsRest SVM,

b) Logistic Regression,

c) Random Forest,

d) Bagging meta-estimator with logistic regressor base。

6. the method for evaluating the curative effect of an intelligent drug according to claim 1, wherein the characteristic variables in step 6) are selected from a list of specific word occurrence frequencies (Count), word frequency-inverse document frequency (tf-idf, Tfidf) and VADER scores, such as:

FS-1：CountVectorizer，

FS-2：CountVectorizer +VADERscore，

FS-3: countvectorer top 10000 feature vector + VADERscore,

FS-4：TfidfVectorizer,

FS-5：TfidfVectorizer +VADERscore,

FS-6: tfidfvactorizer top 10000 eigenvector + VADERscore.

7. The method for evaluating the therapeutic effect of a smart drug according to claim 1, wherein the optimal prediction in step 6) is evaluated by F1-score, F1 score = 2 (decrease Precision)/(decrease + Precision); wherein Recall = true positive/(true positive + false negative), Precision = true positive/(true positive + false positive).

8. The method for evaluating the efficacy of an intelligent drug according to claim 1, wherein the similarity index of drugs in step 1) is

The calculation method comprises the following steps: supposing that the medicine is

The corresponding target treatment disease or symptom is

(ii) a Medicine and food additive

The corresponding target treatment disease or symptom is

(ii) a Then

。

9. A machine learning based intelligent drug efficacy assessment application that can perform multiple functions according to the method of any of claims 1 to 8, comprising:

function 1) evaluating the curative effect of a certain medicine, inputting the name of the medicine, and obtaining the effectiveness score of the medicine, the ranking in the same kind of medicine and the ranking of side effects;

function 2) searching corresponding medicines for a certain disease or symptom, inputting names of single or multiple diseases or symptoms, and obtaining effectiveness scores, ranking and side effect ranking of each medicine of the single or multiple corresponding medicines;

function 3) ranking the effectiveness of the medicine, the similar medicines and the side effects thereof aiming at multi-dimensional people of different ages, sexes, ethnicities, marriage and childbirth, regions and the like.