CN109448808B

CN109448808B - Abnormal prescription screening method based on multi-view theme modeling technology

Info

Publication number: CN109448808B
Application number: CN201810992868.2A
Authority: CN
Inventors: 赵俊峰; 詹思延; 谢冰; 卓琳; 唐爽; 刘少钦
Original assignee: Peking University
Current assignee: Peking University
Priority date: 2018-08-29
Filing date: 2018-08-29
Publication date: 2022-05-03
Anticipated expiration: 2038-08-29
Also published as: CN109448808A

Abstract

The invention discloses an abnormal prescription screening method based on a multi-view theme modeling technology, which comprises the following steps: 1) collating data from the medical system into prescription data, wherein each prescription data includes diagnostic and medication characteristics; 2) inputting prescription data into an MV-LDA model for training; the MV-LDA model comprises K topics, and each topic comprises a diagnosis characteristic view and a medication characteristic view; the diagnosis feature view in the subject k consists of a diagnosis feature set and a probability value corresponding to each diagnosis feature, and the medication feature view consists of a medication feature set and a probability value corresponding to each medication feature in the set; 3) deducing data of a to-be-identified party by using the trained MV-LDA model to obtain topic distribution based on diagnostic characteristics and topic distribution based on medication characteristics; then, the similarity of the distribution of the two subjects is calculated, and whether the prescription data to be identified is an abnormal prescription or not is judged.

Description

Abnormal prescription screening method based on multi-view theme modeling technology

Technical Field

The invention belongs to the field of medical information processing, and relates to an abnormal prescription screening method based on a multi-view theme modeling technology.

Background

Anomaly detection algorithms in the existing medical field can be divided into supervised and unsupervised categories. Among supervised learning methods, some machine learning methods are commonly used to analyze artificially labeled medical data. Kumar et al, for example, detect recording errors in medical claim Data using SVM supervised learning methods in a dataset labeled with sufficient instances of abnormalities and of good quality (Kumar M, Ghani R, Mei Z S.data Mining to predict and present errors in health instruments processing: ACM SIGKDD International Conference on Knowledge Discovery and Mining, Washington, Dc, Usa, July,2010[ C ]), K.Heller et al (Chandola V, Banerjee A, Kumar V.anommy detection: A survey [ M ] ACM, 2009.). Assuming that all the examples belong to a certain category, the boundaries of the two types of examples are drawn from the data set by using an SVM method, and any example with a wrong edge is regarded as an example with abnormal information. However, since it is very difficult to obtain a high-quality labeled data set required for supervised learning, researchers have also proposed a series of unsupervised anomaly detection methods. Unsupervised methods are typically implemented by finding outliers by abstracting each instance to a point in the high dimensional space, with data points far from other points in the space as outliers. For example, Yamanishi et al use an unsupervised PAD method based on a probabilistic generative model to detect abnormalities in pathological Data (Yamanishi K, Takeuchi J I, Williams G, et al, on-line unsupervised output detection using a fine geometry with a distinguishing learning algorithms [ J ]. Data Mining and Knowledge Discovery, 2004,8(3): 275-300); and the density-based LOF method proposed by M.M.Breunig et al (Breunig M.LOF: identifying sensitivity-based local entities: ACM SIGMOD International Conference on Management of Data, May 16-18,2000, Dallas, Texas, Usa,2000[ C ]). However, in the medical field, such outliers are not necessarily abnormal data, because there are a lot of rare diseases with low incidence rate in the medical field, and actually, except for some common diseases, the incidence rate of most diseases is very low, and the abnormal point detection method cannot deal with such problems. We prefer to detect instances of mismatch between those features over rare data. Context Anomaly Detection (CAD) is an unsupervised method for detecting outliers using the relationship between two classes of features, where CAD classifies features into context features, set as y, and indication features, set as x, and learns a mapping function from x to y, where y is f (x), assuming that most data is normal. For a certain piece of test data, if the two types of characteristics of the test data do not accord with y ═ f (x), the test data is considered to be abnormal data. CAD methods also have application in medicine, for example, the solution of j.hu et al uses a regression model on an indicative property and a set of context features, and then uses test cases of the remaining parts to determine outliers to identify abnormal medication cases in medical records (Hu J, Wang F, Sun J, et al a Healthcare Utilization Analysis Framework for Hot spraying and Contextual Analysis Detection [ J ]. ami a. However, due to the high dimensional sparsity of medical data, the CAD method does not work well in the medical field and can only be used to detect mismatches between two types of features.

Disclosure of Invention

The invention provides an abnormal prescription detection method based on a multi-view theme model (MV-LDA). Since the topic model is based on bag-of-words assumptions, assuming that all words are of the same type, but the diagnosis and medication in the prescription fall into two different types, for this purpose, the invention proposes a multi-view topic model, and in the following, explains the training process of the model, and the inference process of the data (topic model is a type of statistical model for describing the composition of unstructured text, and in the field of machine learning, it is used to mine the potential feature "topic" from a series of texts).

The technical scheme of the invention is as follows:

an abnormal prescription screening method based on a multi-view theme modeling technology comprises the following steps:

1) arranging data from the medical system into normative prescription data, wherein each prescription data comprises diagnosis characteristics and medication characteristics in the prescription;

2) inputting the prescription data into an MV-LDA model, and training the MV-LDA model; the MV-LDA model comprises K topics, and each topic comprises a diagnosis characteristic view and a medication characteristic view; the diagnosis feature view in the subject k consists of a diagnosis feature set and a probability value corresponding to each diagnosis feature, and correspondingly, the medication feature view consists of a medication feature set and a probability value corresponding to each medication feature in the set;

3) for the data of a prescription to be identified, deducing the data of the prescription to be identified by using a trained MV-LDA model to obtain the topic distribution of the data of the prescription to be identified based on the diagnosis characteristics and the topic distribution of the data of the prescription to be identified based on the medication characteristics; and then calculating the similarity of the distribution of the two subjects, and if the similarity is lower than a set threshold value, judging that the data of the party to be identified is an abnormal prescription.

Furthermore, solving of the MV-LDA model is carried out by using Gibbs sampling, and parameters in the MV-LDA model are calculated to obtain the well-trained MV-LDA model.

Further, the method for solving the MV-LDA model by using Gibbs sampling comprises the following steps: for prescription data m, sampling class A features in the prescription data m to obtain features x in the class A features_aThe probability of assigning topic k is:

wherein C represents a matrix, V^AIs the number of class A feature classes, x^AThe number of topics corresponding to the class a features,

representing x in all prescription data of the training dataset_aA count assigned to a topic K, K representing the number of topics, K representing the kth topic of the K topics;

representing all counts, β, of any class A feature assigned to the topic k^AIs Dirichlet prior; z is given a feature x_aTopic of assignment, z_-iThe theme assigned to the remaining features is represented,

indicating that the number of subjects k is assigned to all the features in the prescription data m,

representing the number of all features in the prescription data M, M being the total number of prescription data in the training data set, α being Dirichlet prior; class a characteristics are diagnostic characteristics or drug characteristics; then according to the given x_aThe distributed theme k obtains parameter values in the MV-LDA model.

Further, the subject feature distribution of class A features is

Wherein the content of the first and second substances,

and the theme characteristic distribution of the A-type characteristic is a value under the condition that the theme is k and the characteristic is x.

Further, calculating the similarity by adopting KL divergence, Euclidean distance, cosine similarity, Pearson correlation or a vector point multiplication method.

The invention utilizes MV-LDA to model prescriptions, reduces two types of characteristics of diagnosis and medication from high-dimensional word space to low-dimensional subject space by using abstract characteristics of subjects as an intermediate layer, and the two types of characteristics are related by subjects. The abstract concept of the theme is a group of semantically related words and corresponding probabilities thereof, the central thought of the corpus is described, and the method has good interpretability.

For a prescription data set, the steps for abnormal prescription detection using the present method are as follows:

1) and data preprocessing, namely arranging the data from the medical system into normative prescription data, wherein each prescription data comprises the diagnosis characteristics and the medication characteristics in the prescription.

2) And (3) solving the MV-LDA model, inputting the sorted prescription data into the model, and then carrying out model training according to the model training method provided in the implementation step 2) to obtain the trained MV-LDA model.

3) And (3) deducing the examples by using the data deducing method given in the step 3) and the model obtained in the step 2) and respectively using the diagnosis characteristics and the example characteristics to obtain two example theme distributions, wherein the specific deducing method is shown in the model deducing in the technical scheme.

4) The similarity of the example subject distributions of the two obtained in the step 3) is calculated as the normality degree of the prescription, and the lower the similarity is, the more abnormal the prescription is. Then, a threshold value is set, and if the similarity is lower than the threshold value, the prescription is judged to be an abnormal prescription.

Compared with the prior art, the invention has the following advantages:

based on the MV-LDA model, the invention can detect abnormal prescriptions from a large number of prescriptions. In the experiment, 97% of the 40 prescriptions with higher threshold were abnormal prescriptions after expert review. Compared with other methods, the method has higher detection accuracy and stable detection effect as before when the data is extremely sparse. The method can be used for detecting abnormal prescriptions and matching relation abnormalities among other characteristics, and compared with abnormal detection algorithms in other medical fields, the method is wider in application range and better in expansibility. The MV-LDA can be expanded to any view and obtain the corresponding relation among various views, so that the method can more conveniently detect the abnormal matching among various characteristics.

Drawings

FIG. 1 is a diagram of a MV-LDA probability map model;

FIG. 2 is an example of MV-LDA theme;

FIG. 3 is a flowchart of the steps for anomalous prescription detection.

Detailed Description

In order to make the aforementioned objects, features and advantages of the present invention comprehensible, embodiments accompanied with figures are described in detail below, without limiting the present invention thereto.

The method completes the detection of the abnormal prescription through four steps, namely three steps of data initialization, model training, data inference and abnormal value calculation, and the three steps are described in detail as follows:

1) data initialization:

prescription data is typically stored in the form of structured data, such as in a relational database.

The present invention requires that the data be converted into a format suitable for processing before the MV-LDA can be used to extract the correspondence between diagnosis and medication in the prescription. For diagnostic features, all diagnoses in each record are grouped herein into a diagnostic feature set. The medication information includes the medicine code and the corresponding cost, and the cost is regarded as the dosage of the medicine and is used as the word frequency in the medication characteristic set.

For a certain drug m, let its cost in a certain diagnostic record be c_mNormalized to an integer value n by the present invention_mThe formula is as follows:

the Median (m) function in the above equation represents the Median of the cost of drug m in the data set. Round () is a rounding function, λ represents a multiplier factor, and is manually determined for the number of drugs (n)_m) Is not less than 1. After such transformation, the present invention can obtain an input prescription data set for training the MV-LDA model.

2) Model training:

two types of features x representing A, B are extracted on training data by using a multi-view topic model (MV-LDA)^AAnd x^B(e.g., diagnostic and drug use) are provided. The multi-view theme model (MV-LDA) proposed for the present method will be first introduced here.

MV-LDA is the expansion of an LDA theme model in a characteristic view, the characteristics associated with each characteristic in the LDA theme belong to the same category, the characteristics can be mutually exchanged and can be regarded as only comprising one view; for examples having various features described, the features may be viewed as describing the examples from a different perspective and are associated with the examples described. Taking prescription data as an example, if all the diagnosis and medication information of the prescription are respectively used as the input of the MV-LDA model, an MV-LDA model can be obtained. The model is composed of K (K is a set hyper-parameter) abstract subjects, each subject comprises two types of characteristics of diagnosis and medication and corresponding probability values, and the invention considers that the height of the probability values determines a characteristic to be matched with the latent meanings of the subjects. Assuming that the latent meaning of a theme is 'teeth', the probability of diagnosis and medication related to the teeth belonging to the theme is high, and different from training a plurality of LDA models, the diagnosis and medication characteristics are distributed according to the same example theme, one theme has two views of diagnosis and medication, and the characteristics with higher probability in the two views are both diagnosis or medication matched with the latent meaning of the theme; if the LDA is trained to obtain two models of diagnosis information and medication information, the two types of characteristics are not related.

The present invention models both features a and B, and each instance (i.e., the prescription data) is considered to be described from the view of the class a feature and the view of the class B feature along with two views, hereinafter referred to as class a view and class B view, and the probability map representation of the MV-LDA model is shown in fig. 1.

Like the LDA topic model, α in the figure is a hyper-parameter of topic distribution, β is a hyper-parameter of word distribution under the topic, and θ represents topic distribution of each instance. The difference is that since different kinds of features are considered as delineating instances from different views, each topic is also described by multiple views, the different views have different topic feature distributions Φ_aAnd phi_b. In different views, the topic assignment variable z, the generated feature x, and the hyper-parameter β are all different, and the feature x in different views also generates a corresponding relationship because it is generated by the same example topic distribution θ.

Now, the model has hyper-parameters α, β, the example topic distributions of all examples, and the topic feature distributions under all views are the model parameters to be obtained by the present invention, and θ, φ is in the probability map. The solution of these parameters, i.e. the MV-LDA model, will be described below.

The multi-view topic model uses gibbs sampling for the solution of the model to compute parameters in the model. In the solving process, firstly, randomly distributing a theme to all the characteristics; and then sampling and updating the theme corresponding to each feature of each example according to the current state.

For the MV-LDA model with two features, assuming that class A features are sampled, the feature x in class A features in example m is the case that the state at the previous moment is known_aThe probability of assigning topic k is:

where C denotes a matrix and V is the first factor^AIs the number of class a feature classes,

represents x in all examples_aA count assigned to a topic K, K representing the number of topics, K representing the kth topic of the K topics, and

representing all counts, β, of any class A feature assigned to the topic k^AIs Dirichlet prior, and z in the left equation gives the feature x_aTopic of assignment, z_-iShowing the theme assigned to the remaining features. First factor in right formula

Representing all the features x in class A with the assigned subject k_aIn a ratio of, i.e. to

For the second factor, similarly to the aboveK denotes the number of subjects, M denotes the number of instances,

indicating that the number of subjects k assigned to all the features (both AB and m) in example m,

representing the number of all features in instance m, α is Dirichlet a priori. The right expression represents the proportion of the features of the example m to which the subject k is assigned to the total features, i.e.

After assigning themes to all features, calculating the distribution of theme features under each theme

The needed MV-LDA model can be obtained.

The present invention requires that the data be converted into a format suitable for processing before the MV-LDA can be used to extract the correspondence between diagnosis and medication in the prescription. For diagnostic features, no processing may be done. The medicine information comprises medicine codes and corresponding expenses, and the expenses are regarded as the dosage of the medicines and are used as word frequencies in the characteristic set.

Each piece of data in the processed prescription data set comprises a diagnosis feature set and a medication feature set, and the data set is used as training data to be used as the input of the MV-LDA, so that K subjects with the association relation between diagnosis and medication reserved can be obtained. Each topic contains two topic-feature distributions, corresponding to a plurality of distributions of the topic over medication features and a plurality of distributions over diagnostic features, respectively. The two types of characteristics of diagnosis and medication respectively correspond to the two views of the theme, so that the association relationship is kept.

3) Data inference

Data inference refers to inferring the distribution of topics on the subjects learned by the test data in step 1), where it is necessary to infer the distribution of topics for class a features and the distribution of topics for class B features, respectively.

When model inference is used, each view in the MV-LDA can be regarded as a separate LDA model, and the characteristics can be used independently for inference. For example, when a model containing A, B classes of features has been trained on a data set, the topic-feature distribution φ of A features can be used^AThe method is used for deducing an example only containing the A characteristic and estimating the example-topic distribution of the A characteristic on the model, wherein the deduction formula is as follows:

wherein

The probability value obtained when the topic feature representing the class a feature is distributed when the topic is k and the feature is x. And the factor on the right side is similar to the equation (3-1), and represents the proportion of the feature to the total feature to which the subject k is assigned in the example.

The inference process uses only the topic feature distributions associated with class a features. Inferences can be drawn from multiple views, respectively, resulting in multiple instance-topic distributions. Since these distributions describe the same instance, the distributions should be very close. Specifically, for a prescription which needs to be subjected to anomaly detection, the existing model is used for deducing diagnosis and medication respectively to obtain example theme distribution of the two prescriptions, and then the similarity of the two distributions is compared to judge whether the prescription is normal.

At this step, the present invention will use the diagnostic data and the medication data separately to infer an example-subject distribution of the prescription

And

and based on model assumptions, diagnosis and medication are derived fromThe different views describe the prescription identically, and this correspondence is reflected in the subject, if the prescription is normal, the example-subject distributions derived from the different views should be similar, whereas if the distributions are very different, the prescription is likely to be an anomalous prescription.

4) Outlier calculation

Example topic distribution is inferred from two characteristics of diagnosis and medication

And

then, when the prescription is normal,

and

the values of the components on the respective topics should be relatively similar. The similarity between two vectors can be calculated by adopting various vector similarity measurement methods, namely KL divergence (KL), Euclidean distance (EUC), cosine similarity (COS), Pearson correlation (PS) and vector DOT-product (DOT). The outlier is equal to or the inverse of the similarity of the two vectors, depending on whether the vector similarity measure method is more similar for small values of the two vectors or for large values of the two vectors.

Then, by setting a threshold, abnormal prescriptions above the threshold can be marked for experts to review.

The above embodiments are only for illustrating the technical solutions of the present invention and not for limiting the same, and those skilled in the art can make modifications or equivalent substitutions to the technical solutions of the present invention without departing from the spirit and scope of the present invention, and the scope of the present invention should be determined by the claims.

Claims

1. An abnormal prescription screening method based on a multi-view theme modeling technology comprises the following steps:

1) arranging the medical data into prescription data, wherein each prescription data comprises diagnosis characteristics and medication characteristics in a prescription;

2) inputting the prescription data into an MV-LDA model, and training the MV-LDA model; the MV-LDA model comprises K topics, and each topic comprises a diagnosis characteristic view and a medication characteristic view; the diagnosis feature view in the subject k consists of a diagnosis feature set and a probability value corresponding to each diagnosis feature in the set, and correspondingly, the medication feature view consists of a medication feature set and a probability value corresponding to each medication feature in the set;

2. The method of claim 1, wherein the MV-LDA model is solved using gibbs sampling, and parameters in the MV-LDA model are calculated to obtain a trained MV-LDA model.

3. The method of claim 2, wherein the MV-LDA model solution using gibbs sampling is performed by: for prescription data m, sampling class A features in the prescription data m to obtain features x in the class A features_aThe probability of assigning topic k is:

all prescription data representing training data setIn x_aA count assigned to a topic K, K representing the number of topics, K representing the kth topic of the K topics;

4. The method of claim 3, wherein the subject feature distribution of class A features is

Wherein the content of the first and second substances,

5. The method according to claim 1, wherein the similarity is calculated using KL divergence, euclidean distance, cosine similarity, pearson correlation, or vector point multiplication.