Disclosure of Invention
In view of the above-mentioned drawbacks of the prior art, the present invention provides a screening method for false correlation signals in combined adverse drug reaction report data, comprising the following steps:
s1, data acquisition and preprocessing, namely acquiring ADR data, and carrying out standardized processing on the adverse reaction name and the medicine name in the ADR data;
s2, a data splitting step, namely splitting reports of multiple medicines, multiple adverse reactions, single medicines and multiple adverse reactions in the preprocessed ADR data into one-to-one single corresponding combination;
s3, frequency statistics and analysis, wherein the frequency of occurrence of each combination in the ADR data is counted and correlation analysis is carried out;
s4, establishing a simple fake correlation model;
s5, establishing a pseudo-correlation signal proportional imbalance model;
s6, a signal detection step, wherein the data before and after the optimization by using the simple pseudo-correlation model and the pseudo-correlation signal proportional imbalance model are subjected to signal detection by using an ADR signal detection method;
and S7, effectiveness evaluation, namely, evaluating effectiveness of the false association screening of the combined medicines, and judging the signal detection effects before and after the false association screening by using the adverse reaction warning of the medicines on the medicine specification as a known library.
Preferably, in the data acquiring and preprocessing step of S1, the ADR data is the ADR data of the country center.
Preferably, the data splitting step of S2 specifically includes: firstly, splitting combined medication data and single medication data in ADR data, splitting n adverse reactions caused by single medication into n drug-adverse reaction combinations one by one, and splitting i drugs and j adverse reactions of combined medication into i x j drug-adverse reaction combinations formed in a one-to-one correspondence manner.
Preferably, the frequency statistics and analysis step of S3 specifically includes:
counting the occurrence frequency of each record in the ADR data after splitting, constructing a combined medication data set, wherein the data set comprises four attributes of the name of the drug, the name of the adverse reaction, the occurrence frequency of the single medication and the association degree of the drug and the adverse reaction, calculating the correlation between the drug and the adverse reaction, assuming that the occurrence of each adverse reaction is completely independent when the drug is used alone, the report number of the adverse reactions generated when the drug is used alone can reflect the possibility degree of the adverse reaction caused by the drug, and defining a drug D1Adverse reaction with A1Has a correlation coefficient of
The drug D is then calculated using the total probability formula1Cause adverse reaction A1The correlation when the medicine is taken alone is calculated by the formula
Wherein P (A)
1|D
1) Namely, it isFor using medicine D
1Cause adverse reaction A
1Probability of (A), P (A)
1|D
1)=R
D1(A
1),
Is unused medicine D
1Cause adverse reaction A
1The probability of (a) of (b) being,
preferably, the step of establishing the simple pseudo-correlation model in S4 specifically includes:
suppose a certain drug DkCause adverse reaction AiWhen the number of reports is less than or equal to X, the medicine D is putkAdverse reaction with AiThe combination of (A) is regarded as a pseudo-correlation, and a simple pseudo-correlation model is obtained as
Wherein
For the drug D when administered alone
kAdverse reaction caused by A
iThe number of occurrences of the event.
Preferably, in the step of establishing the simple pseudorelevance model in S4, when the adverse reaction A occurs
1The frequency of occurrence is less than 3, that is, the medicine D can be considered
1Produce adverse reaction A
1Occasionally, without considering the adverse reaction combination, the combination formula
The simple false correlation model can be optimized to
ND1(A1)<3。
Preferably, the step of establishing the pseudo-correlation signal proportional imbalance model in S5 specifically includes:
s51 calculation of medicine D1Adverse reaction with A1Proportional imbalance d between the mapping of (d) and the maximum combination of correlation coefficientsD1(A1) The calculation formula is
Wherein MAX (R)D1(A1),RD2(A1),RD3(A1) Is shown in medicine D1And medicine D2And medicine D3Adverse reaction with A1Taking the maximum value of the correlation coefficients;
s52, the calculation formula in S51 is popularized to general conditions to obtain the medicine K and the adverse reaction A in combined medication1Formula for calculating mapped false correlation signal
Wherein MAX (R)D1(A1),RD2(A1) ...) shows the adverse reaction A of a drug substance taken together with the drug substance1Taking the maximum value of the correlation coefficients of (1), Rk(A1) Is a medicine K and adverse reaction A1The correlation coefficient of (a);
s53, under the condition of combined medication, judging the medicine K and the adverse reaction A1Whether the correlation is a pair of false correlations or not, the judgment formula is
Nk(A1)=0|dk(A1)≥2,
Wherein, | is OR logic operation, N (D) represents adverse reaction A generated when medicine K is singly taken1Number of (d)kA pseudo-correlation signal representing the drug K and the adverse reaction, an adverse reaction A1Can be any adverse reaction caused by combined medication.
Preferably, in the signal detection step of S6, the ratio report ratio algorithm is used to perform signal detection when the combination of the drug and the adverse reaction satisfies PRR ≧ 2, x2When the ratio is more than or equal to 4, the combination of the medicine and the adverse reaction is regarded as one notPositive signal of good reaction.
Preferably, in the effectiveness evaluating step of S7, the evaluation indexes include accuracy, recall ratio, and F index.
Compared with the prior art, the invention has the advantages that:
the relevance model provided by the invention can effectively measure the relevance between the medicines taking the medicines together and the adverse reaction, and meanwhile, the false relevance measurement model provided based on the relevance model can also provide effective guidance for deleting false relevance, improve the quality of data of taking the medicines together, and improve the accuracy and efficiency of detecting the adverse reaction signals, thereby providing reference for detecting the adverse reaction signals of taking the medicines together.
In addition, the invention also provides reference basis for other related problems in the same field, can be expanded and extended on the basis of the reference basis, is applied to the technical scheme of other signal screening methods in the same field, and has very wide application prospect.
The following detailed description of the embodiments of the present invention is provided in connection with the accompanying drawings for the purpose of facilitating understanding and understanding of the technical solutions of the present invention.
Detailed Description
As shown in figures 1-4, the invention discloses a screening method for false correlation signals in combined adverse drug reaction report data, which comprises the following steps:
s1, data acquisition and preprocessing, acquiring ADR data, and carrying out standardized processing on the adverse reaction name and the medicine name in the ADR data.
The ADR data here is the ADR data of the country center.
And S2, a data splitting step, namely splitting reports of multiple medicines, multiple adverse reactions, single medicines and multiple adverse reactions in the preprocessed ADR data into one-to-one single corresponding combination.
The method specifically comprises the steps of splitting combined medication data and single medication data in ADR data, splitting n adverse reactions caused by single medication into n medicine-adverse reaction combinations one by one, and splitting i medicines and j adverse reactions of combined medication into i x j medicine-adverse reaction combinations formed in a one-to-one correspondence mode.
And S3, frequency statistics and analysis, wherein the frequency of occurrence of each combination in the ADR data is counted and correlation analysis is carried out.
Specifically, the method comprises the step of counting the occurrence frequency of each record in the ADR data after splitting, wherein the frequency comprises the frequency of the combination of the single-medication adverse reaction and the frequency of the combination of the combined medication adverse reaction.
The method for the frequency statistics of the single-medication adverse reaction combination comprises the following steps: will be composed of medicine D
1Adverse reactions were scored as D
1→A
1,D
1→A
2The frequency is N
D1(A
1),N
D1(A
2) (ii) a Is not prepared from medicine D
1Adverse reactions caused were scored as
At a frequency of
The combined frequency statistics mode of the combined adverse reactions of the medicines comprises the following steps: will be composed of medicine D1、D2The combined use causes adverse reaction A1、A2According to the statistical frequency of the data after splitting, respectively recording the statistical frequency as the following form D1Cause A1Is ND1 *(A1),D2Cause A1Is ND2 *(A1) And so on.
Constructing a combined medication data set which comprises the names of the medicines, the names of the adverse reactions, the occurrence frequency of the single medication, the medicines and the adverse reactionsCalculating the correlation between the medicine and the adverse reaction by using a data table of four attributes of reaction correlation degree, assuming that each adverse reaction is completely independent when the medicine is singly used, and the report quantity of the adverse reactions generated when the medicine is singly used can reflect the possible degree of the adverse reactions caused by the medicine, and defining a medicine D1Adverse reaction with A1Has a correlation coefficient of
The drug D is then calculated using the total probability formula1Cause adverse reaction A1The correlation when the medicine is taken alone is calculated by the formula
Wherein P (A)
1|D
1) That is to use the medicine D
1Cause adverse reaction A
1Probability of (A), P (A)
1|D
1)=R
D1(A
1),
Is unused medicine D
1Cause adverse reaction A
1The probability of (a) of (b) being,
by the same method, medicine D can be calculated2Adverse reaction with A1The correlation coefficient calculation formula is expanded to the general condition in the adverse reaction database of the single medication, and any one medicine k and the adverse reaction A are calculatediCorrelation coefficient of (d):
where N is the number of all adverse reactions in the ADR data.
S4, establishing a simple pseudo-correlation model.
Specifically, assume a certain drug DkCause adverse reaction AiWhen the number of reports is less than or equal to X, the medicine D is putkAdverse reaction with AiThe combination of (A) is regarded as a pseudo-correlation, and a simple pseudo-correlation model is obtained as
Wherein
For the drug D when administered alone
kAdverse reaction caused by A
iThe number of occurrences of the event.
In general, when adverse reaction A occurs
1The frequency of occurrence is less than 3, that is, the medicine D can be considered
1Produce adverse reaction A
1Occasionally, without considering the adverse reaction combination, the combination formula
The simple false correlation model can be optimized to
ND1(A1)<3。
S5, establishing a pseudo-correlation signal proportional imbalance model.
Specifically, the method comprises the following steps:
s51, in a group of combined medication data, the combination of the drug with the maximum correlation coefficient between the drug and the adverse reaction is supposed to be truly related, and whether the combination is falsely related or not is judged by measuring the difference between the correlation coefficient and the maximum correlation coefficient of other combinations.
Calculating medicine D1Adverse reaction with A1Proportional imbalance d between the mapping of (d) and the maximum combination of correlation coefficientsD1(A1) The calculation formula is
Wherein MAX (R)D1(A1),RD2(A1),RD3(A1) Is shown in medicine D1And medicine D2And medicine D3Adverse reaction with A1The correlation coefficient of (2) is taken as the maximum value.
S52, the calculation formula in S51 is popularized to general conditions to obtain the medicine K and the adverse reaction A in combined medication1Formula for calculating mapped false correlation signal
Wherein MAX (R)D1(A1),RD2(A1) ...) shows the adverse reaction A of a drug substance taken together with the drug substance1Taking the maximum value of the correlation coefficients of (1), Rk(A1) Is a medicine K and adverse reaction A1The correlation coefficient of (2).
S53, suppose dk(A1) When the reaction is more than or equal to 2, the medicine K and the adverse reaction A1The combination of (A) is a pseudo-relation, and the assumption is that the drug does not have adverse reaction when being taken alone1The medicine is considered not to cause adverse reaction A1If the medicine is used in combination with other medicines, adverse reaction A occurs1Then the medicine is considered to have adverse reaction A1Is a one-to-false correlation.
Based on the hypothesis, under the condition of combined medication, the medicine K and the adverse reaction A are judged1Whether the correlation is a pair of false correlations or not, the judgment formula is
Nk(A1)=0|dk(A1)≥2,
Wherein, | is OR logic operation, N (D) represents adverse reaction A generated when medicine K is singly taken1Number of (d)kA pseudo-correlation signal representing the drug K and the adverse reaction, an adverse reaction A1Can be anyThis is intended to combine the adverse reactions caused by medication.
And S6, a signal detection step, namely, carrying out signal detection on the data before and after the optimization by using the simple pseudo-correlation model and the pseudo-correlation signal proportional imbalance model by using an ADR signal detection method. Aiming at original data and optimized data before and after screening, a proportion report ratio algorithm (PRR) commonly used in China is usually used for signal detection, and when the combination of a medicine and an adverse reaction meets the condition that the PRR is more than or equal to 2, x2When the number is more than or equal to 4, the combination of the drug and the adverse reaction is regarded as an adverse reaction positive signal.
And S7, effectiveness evaluation, namely, evaluating effectiveness of the false association screening of the combined medicines, and judging the signal detection effects before and after the false association screening by using the adverse reaction warning of the medicines on the medicine specification as a known library. The evaluation index includes Precision (Precision), Recall (Recall), and F index.
The technical solution of the present invention is further described with reference to a specific embodiment.
First, analysis and selection of ADR data are performed. 1,823,144 ADR report data are obtained from a national drug adverse reaction detection center within 2010.1.1-2011.12.31 years, wherein 608,710 reports record multiple corresponding relations of multiple drugs and multiple adverse reactions, and the data are split into one-to-one single corresponding relations. And after the splitting is finished, abnormal data is removed, and the record of the medicine name or the untoward effect name in the data is deleted. Wherein the data of the western medicines are 1,874,904 records, 16,383 adverse reaction reports of the database are selected for spontaneous reporting of adverse reactions in Jiangsu province, the data of the combined medication is 8, the data of 172 combined medication comprises 3,264 combined medication combinations, the data of the single medication is 8,212 combined medication comprising 547 medicines and 137 adverse reactions, and the combined medication accounts for 49.89 percent of the total database. In this example, western medicine data is selected as the experimental subject.
The attribute fields of the adverse reaction report table in the database are shown in table 1, wherein bgbm is the code of each adverse reaction report, pzmc is the name of the drug causing the adverse reaction, and blfymc is the name of the adverse reaction in the report. Through careful study on the subject, the experiment only needs to report three fields of codes, drug names and adverse reaction names, and the three fields are selected to be recombined into a new original data set:
TABLE 1 adverse reaction report Attribute Table
Subsequently, data splitting is performed. Some drugs and adverse event combinations exist in the data set, and the records are divided into multiple records corresponding to different adverse reactions, drug D1The single use of the traditional Chinese medicine composition produces a plurality of adverse reactions A1And A2The data format in the original data set and the mapping before splitting are shown in the figure. The record is now split into two data records, and the schematic diagram of the split structure is shown in fig. 3.
And recording the data record of the separated medicine corresponding to one adverse reaction as an independent medicine data set. Selecting the drug name and adverse reaction name attributes on the independent medication data set, performing database grouping operation on the two attributes, counting the occurrence frequency of the combination of the drug and the adverse reaction, and finally adding the occurrence frequency of the combination of the drug and the adverse reaction to the independent medication data set. The data table after addition is shown in table 2:
TABLE 2 Combined frequency statistical table for adverse reaction after single medication
Under the implementation of the general ADR signal mining algorithm, assume D1、D2And D3All produce A1Adverse reactions, disregarding D1、D2And D3The interaction between them, and therefore the resolution of this adverse reaction report under this assumption is shown in figure 4.
The assumption is that when the medicine is taken alone, each adverse reaction occurs completely independently, and the medicine is taken aloneThe reported number of the adverse reactions generated during the use can reflect the possible degree of the adverse reactions caused by the medicine, and the full probability formula is used for defining the medicine D1Adverse reaction with A1Correlation when administered alone:
in the data of a group of combined medicines producing a certain adverse reaction, the combination of the medicine with the maximum correlation coefficient between the certain medicine and the adverse reaction is supposed to be truly associated, so that the false association can be judged by measuring the difference between the correlation coefficients of other combinations and the maximum correlation coefficient. If the difference in the correlation coefficient is very large, and reaches a certain threshold, the combination is considered to be a false correlation signal relative to the combination with the largest correlation coefficient, and the mapping of the drug to the adverse reaction is likely to be false correlation.
Calculating the adverse reaction A of the medicine K1Proportional imbalance correlation coefficient d of maximum combination in mappingk(A1) Is calculated by the formula
When the proportional imbalance coefficient is large and a certain threshold value, the medicine and the adverse reaction A are considered1Is a false association. Experiments prove that the effect is most obvious when the threshold value is equal to 2.
Suppose dk(A1) When the reaction is more than or equal to 2, the medicine K and the adverse reaction A1The combination of (c) is a false association. And supposing that the drug has no adverse reaction A when being taken alone1Thus, it is considered that this drug does not cause adverse reaction A1If the medicine is used in combination with other medicines, adverse reaction A occurs1Then the drug is considered to have adverse reaction A1Is a one-to-false correlation. Therefore, a false correlation judgment function is constructed as follows:
NK(D)=0|dk(D)≥2。
in the formula, "|" is OR logic operation NK(D) Shows the adverse reaction A generated when the medicine K is taken alone1Number of (d)kIndicating a false correlation signal of drug K with adverse reactions. Wherein A is1Can be the adverse reaction of any combination of medicines.
Then, false association deletion and data merging are carried out. Firstly, false correlation in the combined medication data set is removed by contrasting a false correlation data set, then three fields of medicines, adverse reactions and occurrence frequency in the combined medication data set and the single medication data set are selected, and the selected records form a new data set. And finally, selecting the drug and adverse reaction fields of the new data set, grouping the two fields, calculating the sum of the occurrence frequency in the group, and replacing the sum with the attribute of the occurrence frequency in the original set to obtain an optimized data set for signal mining. If the false association is not cleared, the three fields combined for single administration are directly selected to form a new data set, and an original data set which is not optimized is obtained.
Subsequently, an ADR signal detection algorithm is applied to the optimized data set for data mining. The ADR signal mining algorithm used in the embodiment is a proportion report ratio algorithm commonly used in China. In this embodiment, adverse reaction signal mining is performed only on signals in which the frequency of occurrence of the combination of the drug and the adverse reaction in the database is not less than 3, and the critical value for determining the adverse reaction positive signal is as follows: PRR is not less than 2 and x2≥4。
Finally, analyzing results, and calculating PRR and x by a formula for each data set based on a four-grid table of an adverse reaction proportion imbalance algorithm2Then selecting a value satisfying PRR ≧ 2, x2And (3) combining the medicines with the adverse reactions of not less than 4 and N of not less than 2, and selecting the combined medicines and adverse reactions fields to form a medicine adverse reaction positive signal set. And finally, obtaining a mining result based on two different data sets.
The result evaluation criteria mainly include accuracy, recall, and F-factor, specifically:
1) rate of accuracy
The accuracy here is for the prediction result of the present invention, which indicates how many of the samples predicted to be positive signals are samples of true positive signals. The calculation formula is as follows:
where TP is the number of positive signals predicted to be positive signals and FP is the number of negative signals predicted to be positive signals.
2) Recall rate
For a positive signal sample in the dataset, it indicates how much of the positive signal in the sample was predicted. The calculation formula is as follows:
where TP is the number of predictions that an original positive signal was a positive signal and FN is the number of predictions that an original positive signal was a negative signal.
3) F index
The F index is the harmonic mean of the accuracy rate and the recall rate, and the calculation formula is as follows:
the calculation results of the accuracy, the recall rate and the F index value on the three data sets of the original data set, the simple optimized data set and the proportional imbalance optimized data set are as follows:
table 3 mining result inspection standard table
From table 3, it can be seen that the accuracy of the positive signals obtained by using the simplified model optimized data set based on the false correlation for signal mining is higher than that of the original unoptimized data set, but the recall rate and the F coefficient are lower than those of the original data set. The main reason for this is that the simple pseudo-correlation model is a combination of drug deletion and adverse reaction occurring in the data set of single drug administration less than 3 times, which reduces the number of optimized data samples and reduces the total number of positive signals found by the final algorithm. The reduction in the number of true positive signals is seen from the reduction in the recall rate, which indicates that the root cause of the improvement in accuracy is the reduction in sample size. From the above analysis, it can be seen that the simple model of the spurious correlation has little influence on eliminating the spurious correlation and improving the data quality.
The accuracy, recall rate and F coefficient of a data set optimized by a proportion imbalance based false correlation model for a positive signal data set obtained by signal mining are all higher than three indexes of an original data set, wherein the F coefficient is higher than the F coefficient in the original data set by 3 percentage points. The ratio imbalance pseudo-correlation model can effectively find out the pseudo-correlation and remove the pseudo-correlation in the drug combination, and can effectively optimize the original data set, so that the original data set is more favorable for detecting the adverse reaction positive signals.
In conclusion, the invention takes Chinese ADR report data as the basis, carries out full research on the screening problem of false association between drugs and adverse reactions in combined medication, provides a simple false association model and a false association signal proportion imbalance model, and detects the screening effectiveness by using a detection index. The final results obtained were: compared with the accuracy, recall rate and F index of the positive signals obtained by using the original data set for signal mining, the data set screened by using the false correlation signal proportion imbalance model are improved by 2-3 percentage points, so that the false correlation can be effectively screened by using the false correlation signal proportion imbalance model, and the accuracy and efficiency of signal detection are improved.
The correlation model can effectively measure the correlation between the medicines taking the medicines together and the adverse reaction, and meanwhile, the pseudo-correlation measurement model provided based on the correlation model can provide effective guidance for deleting the pseudo-correlation, improve the quality of the data of the medicines taking the medicines together, and improve the precision and the efficiency of the detection of the adverse reaction signals, thereby providing reference for the detection of the adverse reaction signals of the medicines taking the medicines together.
In addition, the invention also provides reference basis for other related problems in the same field, can be expanded and extended on the basis of the reference basis, is applied to the technical scheme of other signal screening methods in the same field, and has very wide application prospect.
It will be evident to those skilled in the art that the invention is not limited to the details of the foregoing illustrative embodiments, and that the present invention may be embodied in other specific forms without departing from the spirit or essential attributes thereof. The present embodiments are therefore to be considered in all respects as illustrative and not restrictive, the scope of the invention being indicated by the appended claims rather than by the foregoing description, and all changes which come within the meaning and range of equivalency of the claims are therefore intended to be embraced therein, and any reference signs in the claims are not intended to be construed as limiting the claim concerned.
Furthermore, it should be understood that although the present description refers to embodiments, not every embodiment may contain only a single embodiment, and such description is for clarity only, and those skilled in the art should integrate the description, and the embodiments may be combined as appropriate to form other embodiments understood by those skilled in the art.