CN108376567A

CN108376567A - A kind of clinical medicine based on label propagation algorithm-adverse drug reaction detection method

Info

Publication number: CN108376567A
Application number: CN201810010035.1A
Authority: CN
Inventors: 张强; 魏小鹏; 燕智策; 赵腊生
Original assignee: Dalian University
Current assignee: Dalian University
Priority date: 2018-01-05
Filing date: 2018-01-05
Publication date: 2018-08-07
Anticipated expiration: 2038-01-05
Also published as: CN108376567B

Abstract

The clinical medicine adverse drug reaction detection method based on label propagation algorithm that the present invention relates to a kind of.New similitude based on given drug sample set reconstructs label circulation way with label initialization, and then for the detection of drug adverse drug reaction.First, drug characteristic is filtered using CHI methods, selection includes the larger feature of information content；Secondly, new sample similarity is constructed according to the sample similarity of sample label similitude and Laplace operator adjustment；Then, the information based on known label sample establishes the initialization information of Unknown Label sample；Finally, the drug of detection adverse reaction is propagated by label.The present invention reconstructs drug Similarity measures mode and label circulation way so that the similitude between drug is more accurate, and label circulation way is more smooth, can effectively improve the detection of clinical stage drug adverse drug reaction.

Description

Label propagation algorithm-based clinical drug-drug adverse reaction detection method

Technical Field

The invention relates to the field of medicine safety detection, in particular to a method for detecting adverse reactions of clinical medicines and medicines based on a label propagation algorithm.

Background

In the traditional drug safety detection methods, methods such as a frequency method (a report ratio method (PRR), a report ratio method (ROR) and a comprehensive standard Method (MHRA)) and a bayesian method (a bayesian confidence coefficient propagation neural network (BCPNN) and a multivariate gamma-poisson distribution subtraction Method (MGPS)) are all used for detecting the drugs with adverse reactions in the market. In real life, the medicine before the market is detected, so that some unsafe medicines are prevented from appearing on the market, and other diseases can be caused or the death of the patient can be caused after the unsafe medicines are taken. In recent years, with the heat of big data, the big data method is also used in the medical field to detect new drugs, and the main detection methods are divided into two categories: a similarity-based approach and a classification model-based approach. The similarity-based approach uses the assumption that similar drugs and the same drug have the same effect. The classification model-based method regards the drug research problem as a binary classification problem and utilizes the traditional data mining or machine learning method for detection. Currently, in the field of big data research of drugs, researchers are more used to methods based on similarity assumptions, because the methods based on similarity are more capable of explaining the cause of adverse reactions of drugs, and can also obtain higher detection capability compared with methods based on classification models.

Although researchers have made a great deal of contribution in drug testing using similarity-based methods, there are still a great number of adverse reactions in new drugs appearing on the market. This is because the similarity-based method cannot accurately classify drugs by directly using the similarity between drugs (there is a phenomenon that a large number of classes overlap between them). The Label Propagation Algorithm (LPA) is an adverse drug reaction detection algorithm proposed based on a similarity method. In the label propagation algorithm, researchers directly utilize the similarity between samples according to samples with known labels to iteratively propagate labels until the label information value of the samples is converged, so that adverse reaction information of the detected samples can be obtained. However, this method has the disadvantages of the similarity method, and also has the disadvantages of the sample data characteristic information selection and the unlabeled sample label initialization mode.

Disclosure of Invention

The invention provides a clinical medicine-medicine adverse reaction detection method based on a label propagation algorithm, which is used for correspondingly adjusting the label propagation algorithm from the aspects of given data characteristics, data sample similarity and sample label initialization, so as to improve the defects of the medicine similarity method and the label propagation algorithm.

The technical scheme adopted by the invention for solving the technical problem is to provide a label propagation algorithm-based clinical medicine-adverse medicine reaction detection method, which comprises the following steps:

step 1: filtering the medicine characteristics by adopting a CHI-square (CHI) method, and selecting the characteristics with larger information content;

step 2: constructing new sample similarity according to the sample label similarity and the sample similarity adjusted by the Laplace operator;

and step 3: establishing initialization information of an unknown label sample based on the information of the known label sample;

and 4, step 4: and (3) integrating the step 1, the step 2 and the step 3 to obtain a new label propagation algorithm, and using the algorithm to obtain a detection result of the sample to be identified.

Wherein, the step 1, the step 2 and the step 3 comprise the following specific steps:

(1) the drug data set includes two parts: a drug sample dataset and a drug label dataset. In the sample data set of drugs, each drug is represented by a 1 × N binary vector, and N represents the total number of samples. In the drug label dataset, each drug is represented by a 1 × c vector, c represents both the number of samples of known labels and the number of multiple labels of the samples, and the label dataset of the drug is often represented by Y;

(2) in the training data set of the medicine, the CHI method is used to calculate the sample eigenvalue, and the eigenvalue with a large information content is selected from all the data of the medicine:

wherein,represents a feature t_iIn class c_kThe frequency of occurrence of;representing the degree to which features collectively appear in a certain category; a represents a category c_kIn which the feature t is included_iB represents a non-category c_kIn which the feature t is included_iC represents a category c_kDoes not contain the feature t_iD represents a non-category c_kDoes not contain the feature t_iN ═ a + b + c + d denotes the total number of samples;

(3) solving a sample similarity matrix A after the adjustment of the laplacian of the medicine in the step (2):

s_iand s_jRepresenting the vector formed by the ith sample and the jth drug feature.

(4) Obtaining a label similarity matrix C of the medicine:

representing the weight of the t label; n is a radical of_pRepresenting the total number of samples, N_tRepresenting the number of the t-th label in the sample label; l is a 1 x n vector and,a t-th label representing an i-th sample label vector; representing unknown label samples x_jK- ξ neighbor set of (a) contains a subset of all labeled exemplars;is the average of the previous three cases, indicating the similarity between unlabeled exemplar labels.

(5) A similarity matrix S, S ═ TC.

(6) Reconstructing label initialization information of an unknown label sample by using label information of a known label sample and a similar matrix A:

wherein, P_diffRepresenting the probability of a reaction with a similarity of less than 0.5, P, in a sample of known tags_simIndicating the probability of a reaction with similarity greater than 0.5 in a sample of known tags.

The step 4 comprises the following specific steps:

(1) and (3) carrying out iterative normalization processing on the similarity matrix S by using a Bregmanian-Bi-Stochastication (BBS) algorithm to obtain a normalized convergence matrix W.

(2) And (3) detecting the medicine by using a label propagation algorithm according to the normalized matrix W in the step (1) and the step (6):

wherein u represents that the medicine obtains the label information of u part from other medicines, and the label information of 1-u part of the medicine is reserved; i denotes an N × N identity matrix.

On the basis of a label propagation model, theoretical analysis and practice are carried out by respectively utilizing a CHI feature extraction method, a Laplace operator and label similarity method and a user-defined unknown label sample acquisition initialization method from the aspects of data features, sample similarity and sample label initialization, so that the improved model is more favorable for detecting adverse drug reaction events.

Drawings

Fig. 1 is a flow chart illustrating a label propagation method integrating multiple modes.

Detailed Description

As shown in fig. 1, in order to improve the theoretical label propagation algorithm and effectively detect experimental drugs, a drug data set is obtained, sample features in the drug data set are filtered by using a CHI feature extraction method, and features with large information content are selected from the sample features; secondly, improving a Jacquest correlation coefficient (TC) method by adopting a Laplace algorithm, calculating sample similarity of the medicines, calculating label similarity of the medicines according to a label similarity method, and reconstructing similarity of the medicines according to the sample similarity of the medicines and the label similarity of the medicines; then, carrying out normalization processing on the similarity matrix of the medicine by using a BBS algorithm to obtain a similarity normalization matrix of the medicine; and finally, initializing the label information of the test sample based on the label information of the training medicine, iteratively propagating the label according to the label propagation idea until the label information of the sample is converged, and calculating by using an evaluation method to obtain an evaluation result.

The invention is described in detail below with reference to examples and figures:

the experimental data of the invention are from a FAERS DDI dataset database and a Chemical structure dataset database, 645 medicines and 63473 adverse reaction information which occur between the medicines can be mined from the FAERS DDI dataset database, and are represented by a DDI dataset; chemical structure data for these 645 drugs are available from the Chemical structure dataset database, each represented by a 881-dimensional {0,1} vector. The data used in the experiment are the data of the pretreated medicines in the medicines with the same chemical structure, namely the data used are completely different data. And 5-fold cross validation is carried out on the preprocessed data. The specific process is as follows:

the method comprises the following steps: performing initial preprocessing on the acquired medicine characteristic data, wherein the initial preprocessing comprises deleting medicines with the same characteristics and randomly reserving one of the medicines; and deleting the feature column with only one feature value in the features. Finally 638 medicines are obtained, and 616 characteristics of the 638 medicines are obtained. In the experiment, 638 medicines are randomly divided into 5 equal parts (corresponding feature matrix and label matrix can be obtained) by using a cross validation function, one part of medicines is taken out in each experiment as a test set, and the rest medicines are taken as training sets for validation.

1. And (3) screening all the medicine characteristics in the training medicine data set by using a CHI method:

wherein, χ²(t_i,C_k) Represents class C_kMiddle feature t_iThe amount of information contained;representing features t in all classes_iAverage information amount of (2). If it isSelecting a feature t_i(ii) a Otherwise, delete feature t_i。

2. In the construction of unknown label samples, the initialization information of the labels is as follows:

Step two: solving a sample similarity matrix A, a label similarity matrix C and a formed new similarity matrix S:

S＝TC.*C；

l_iand l_jRepresenting the eigenvectors of the ith sample and the jth sample, and using a k- ξ nearest neighbor method in the label similarity matrix C calculation process, wherein k is 2 and represents k immediately, and ξ is 0.80 and represents threshold nearest neighbor.

Step three: calculating a normalized matrix W by using the similarity matrix S in the step one:

wherein l is a vector with n × 1 dimensional elements all being 1; w⁺Representing the positive part of the matrix W.

Step four: and (3) carrying out label propagation by using a label propagation algorithm:

wherein u represents that the medicine obtains the label information of u part from other medicines, and the label information of 1-u part of the medicine is reserved; i denotes a 638 × 638 identity matrix. In the experiment, the optimal value of u is as follows: u is 0.97 and the identification results are shown in table 1.

TABLE 1 comparison of detection rates of the conventional method and the method of the present invention

Model (model)	AUC	AUPR
			Conventional label propagation algorithm	0.8063+/-0.0050	0.6457+/-0.0154
Proposed label propagation algorithm	0.8119+/-0.0054	0.6522+/-0.0163

According to the steps, the traditional label propagation algorithm in the aspect of medicine detection is compared with the label propagation algorithm integrating various methods, and as can be observed from the table 1, the method provided by the invention is obviously superior to the traditional method.

In conclusion, the LPA method provided by the invention has a good identification effect on given adverse drug reaction data and has strong robustness. The medicine data is filtered from the aspects of characteristics, similarity and initial label values, the traditional similarity calculation mode is improved, and the label initialization mode is adjusted, so that the label is judged more easily, and the detection accuracy is increased.

The above description is only for the best mode of the present invention, but the scope of the present invention is not limited thereto, and any person skilled in the art can make equivalent changes in the technical solution and the inventive concept of the present invention within the technical scope of the present invention.

Claims

1. A clinical medicine-adverse drug reaction detection method based on a label propagation algorithm is characterized by comprising the following steps:

step 1: filtering the medicine characteristics by adopting a chi-square method, and selecting the characteristics with larger information content;

2. The method for detecting adverse drug reactions of clinical drugs based on the label propagation algorithm as claimed in claim 1, wherein the method model for performing the feature filtering in step 1 is as follows:

the method model adopted for constructing the new sample similarity in the step 2 is as follows:

S(i,j)＝TC(i,j)^after.*C(i,j)

wherein,representing the similarity between sample i and sample j; c (i, j) represents the similarity between the sample labels, and the formula is:

representing the weight of the t label; n is a radical of_pRepresenting the total number of samples, N_tRepresenting the number of the t-th label in the sample label; l is a 1 x n vector and,a t-th label representing an i-th sample label vector; representing unknown label samples x_jK- ξ neighbor set of (a) contains a subset of all labeled exemplars;is the average of the previous three cases, indicating the similarity between unlabeled exemplar labels;

the method model adopted for establishing the label information initialization of the unknown label sample in the step 3 is as follows:

3. The method for detecting adverse drug reactions of clinical drugs based on the label propagation algorithm as claimed in claim 2, wherein: step 3 as F ═ (1-u) (I-uW)^-1And Y is propagated to obtain a detection result.