CN108596770B

CN108596770B - Medical insurance fraud detection device and method based on outlier analysis

Info

Publication number: CN108596770B
Application number: CN201711471001.4A
Authority: CN
Inventors: 王新军; 闫中敏; 陈志勇; 姜诚; 于杰
Original assignee: Dareway Software Co ltd
Current assignee: Dareway Software Co ltd
Priority date: 2017-12-29
Filing date: 2017-12-29
Publication date: 2022-04-01
Anticipated expiration: 2037-12-29
Also published as: CN108596770A

Abstract

The invention provides a medical insurance fraud detection device and a detection method based on outlier analysis, wherein the medical insurance fraud detection device based on outlier analysis comprises the following steps: a medical insurance data acquisition module; a medical insurance data preprocessing module; a similarity score calculation module; an outlier detection module; a patient fraud detection module. The present invention improves the existing outlier analysis method through data preprocessing and adjusts the similarity score of each patient to make it suitable for the medical insurance field. Based on the method, the similarity score of the patient is calculated, a mode of combining the similarity and the outlier is adopted, the similarity score is calculated for each patient, then the outlier is analyzed, the distribution of the similarity score is found to be similar to the normal distribution through statistics, and the flexible critical value for detecting the outlier is determined by a mode of combining the threshold and the confidence interval.

Description

Medical insurance fraud detection device and method based on outlier analysis

Technical Field

The invention belongs to the field of medical insurance, and particularly relates to a medical insurance fraud detection method based on outlier analysis.

Background

For medicare fraud, the national health care anti-fraud association (NHCAA) is defined as: "deliberately deceptive or fraudulent presentation by a person or organization to gain an illicit benefit to the person or organization".

Most of the traditional medical insurance fraud detection methods are based on rules. With the development of medical insurance business, a large amount of data accumulated in the medical insurance field, including medical diagnosis information, diagnosis and treatment details, prescription details and digital medical files in a medical insurance settlement system, are accumulated in a large amount, so that medical service big data is formed, and a large amount of medical service knowledge and rules are hidden in the medical service big data. The invention provides a medical insurance fraud behavior detection method based on outlier analysis based on big data.

Disclosure of Invention

The invention provides a medical insurance fraud detection device and method based on outlier analysis in order to improve the fraud detection effect and accuracy.

The invention aims to provide a medical insurance fraud detection device and method based on outlier analysis. The innovation point of the invention is that an evaluation algorithm is designed to realize fraud detection in the field of medical insurance by calculating a similarity score for each patient and utilizing a mode of combining a threshold value and a confidence interval.

In order to achieve the purpose, the invention adopts the following technical scheme:

an apparatus for detecting fraud in medical insurance based on outlier analysis, comprising: the medical insurance data acquisition module 100 can acquire the hospitalization records, the medication records and the treatment records of medical insurance institutions in certain areas, wherein the medical insurance records comprise basic information, medication information, disease information, diagnosis and treatment information and the like of patients; the medical insurance data preprocessing module 200 can preprocess the data of the original data set by using a data cleaning technology and pharmacopoeia; the similarity score calculating module 300 can calculate a similarity score for each patient through a heterogeneous network, and adjust the similarity score in consideration of the number of diseases which can be treated by the medicines of the patient; the outlier detection module 400 can set a flexible critical value by combining a fixed threshold value and a confidence interval to perform outlier iterative search; the patient fraud detection module 500 is capable of determining found outliers as suspected fraudulent patients through outlier analysis.

A medical insurance fraud detection method based on outlier analysis comprises the following steps: step S1, obtaining the actual medical record in the medical insurance in a certain area; step S2, preprocessing the data set by using data cleaning and pharmacopoeia; step S3, extracting information of patients, diseases and medicines to construct a heterogeneous network; step S4, calculating similarity scores of different patients by using the similarity; step S5, performing outlier analysis by combining threshold and confidence intervals to distinguish normal and fraudulent patients, and finally determining patients suspected of being fraudulent.

Preferably, the method for detecting fraud in medical insurance based on outlier analysis, wherein the step of obtaining the actual medical insurance medical record in a certain area comprises the following steps: step S101, acquiring a large amount of medical care related data of a medical insurance institution in a certain area, reserving useful data and removing useless data; step S102, extracting data such as patient basic information record, medication record, treatment record and the like.

Preferably, the method for detecting fraud in medical insurance based on outlier analysis, wherein the pre-processing of the data set using data washing and pharmacopoeia comprises the following steps: step S201, extracting hospitalizing data of a patient to be subjected to fraud detection; step S202, processing sensitive data and data with high data loss rate of patients by using a data cleaning technology, and ensuring that not less than three medical records of each patient are obtained; step S203, the category of the medicine information is processed by inquiring the pharmacopoeia, and a plurality of similar medicines are unified into the same category according to the corresponding relationship between the medicines and the categories.

Preferably, the method for detecting medical insurance fraud based on outlier analysis, wherein the extracting information of patients, diseases and drugs to construct a heterogeneous network comprises the following steps: step S301, analyzing the preprocessed patient data set, mainly analyzing basic information of the patient, diseased information of the patient and medication condition of the patient; step S302, extracting basic information, medicine information and disease information of the patient in the data set, and constructing a heterogeneous information network of the patient.

Preferably, the method for detecting medical insurance fraud based on outlier analysis, wherein the calculating the similarity scores of different patients by using the similarities comprises the following steps: step S401, firstly, the constructed heterogeneous network is analyzed, and the relevance of the patient is reflected by a similarity score calculated for each patient, and in consideration of the problem that the score increases with the increase of the data volume, the visibility factor is expressed in the following form:

in step S402, the score is adjusted in consideration of the number of diseases treated by the drug of the patient, if the patient treats many diseases with the same drug, which is considered abnormal, the score is subtracted from the patient, and the similarity score between the two patients can be expressed by the formula:

preferably, the method for detecting fraud in medical insurance based on outlier analysis is characterized in that the outlier analysis by means of combining the threshold value and the confidence interval comprises the following steps: step S501, after calculating the similarity score of the patient, carrying out statistical analysis on the score to find that the distribution of the similarity score of the patient is close to normal distribution, and carrying out threshold calculation by combining the property of the normal distribution; step S502, determining a flexible critical value for detecting outliers with the aid of a confidence interval of normal distribution, so that the critical value may be automatically adjusted to an appropriate value when other factors change, and the critical value, i.e. the threshold value, may be expressed by a formula:

compared with the prior art, the invention has the beneficial effects that:

(1) the present invention improves the existing outlier analysis method through data preprocessing and adjusts the similarity score of each patient to make it suitable for the medical insurance field. Based on the method, similarity scores of patients are calculated, a similarity score is calculated for each patient by means of similarity and outlier combination, and then outlier analysis is performed.

(2) The invention designs an evaluation algorithm combining a threshold value and a confidence interval for fraud detection. Through statistics, the distribution of the similarity scores can be found to be similar to the normal distribution, and a mode of combining a threshold value and a confidence interval is adopted to determine a flexible critical value for detecting outliers.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this application, illustrate embodiments of the application and, together with the description, serve to explain the application and are not intended to limit the application.

FIG. 1 is a schematic structural diagram of a medical insurance fraud detection apparatus based on outlier analysis according to an exemplary embodiment of the inventive concept;

FIG. 2 is a general flow diagram of a method for detecting medical insurance fraud based on outlier analysis, according to an exemplary embodiment of the inventive concept;

FIG. 3 is a flowchart of medical insurance data acquisition steps of a medical insurance fraud detection method based on outlier analysis, according to an exemplary embodiment of the inventive concept;

FIG. 4 is a flowchart of medical insurance data pre-processing steps of a medical insurance fraud detection method based on outlier analysis, according to an exemplary embodiment of the inventive concept;

FIG. 5 is a flowchart of the information extraction and heterogeneous network construction steps of a method for medical insurance fraud detection based on outlier analysis, according to an exemplary embodiment of the inventive concept;

FIG. 6 is a flowchart of similarity score calculation and adjustment steps of a method for detecting medical insurance fraud based on outlier analysis, according to an exemplary embodiment of the present inventive concept;

fig. 7 is a flowchart of outlier analysis fraud detection steps of an outlier analysis-based medical insurance fraud detection method according to an exemplary embodiment of the inventive concept.

Detailed Description

The invention is further described with reference to the following figures and examples.

It should be noted that the following detailed description is exemplary and is intended to provide further explanation of the disclosure. Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this application belongs.

It is noted that the terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of example embodiments according to the present application. As used herein, the singular forms "a", "an" and "the" are intended to include the plural forms as well, and it should be understood that when the terms "comprises" and/or "comprising" are used in this specification, they specify the presence of stated features, steps, operations, devices, components, and/or combinations thereof, unless the context clearly indicates otherwise.

The invention aims at a phenomenon of actual medical treatment pointed out in a public document, after a patient hospitalization record of medical insurance in a certain area is analyzed, basic information, medicine information, disease information and the like of the patient are extracted, and suspected fraud patient detection is carried out by utilizing similarity score calculation and flexible critical value calculation, and the invention provides a method for fraud detection by outlier analysis, which comprises the following steps:

firstly, two factors of similarity and outlier are considered simultaneously, a similarity score is calculated for each patient by utilizing the similarity, and then the scores of the patients are subjected to outlier analysis and evaluation.

Secondly, a flexible critical value is determined by combining threshold calculation and a confidence interval, and self adjustment can be carried out along with the change of data, so that a fraud patient and a normal patient are distinguished.

Wherein, the following terms related to the invention are:

heterogeneous network: the heterogeneous information network can be represented by a graph G = (V, E), where V represents a vertex and E represents an edge. It can be constructed from many interconnected, large-scale data sets, ranging from social, scientific, engineering, etc. The medical domain can also be modeled as a medical information network whose vertices can include doctors, patients, diseases, treatments, devices, etc., and the vertices can be described as relationships between patient and medication, patient to disease, patient to doctor visits, etc.

Meta-path: the Meta-Path (Meta-Path) is a Path formed by connecting a plurality of vertices, and can systematically reflect the association between different vertices in a heterogeneous information network.

Threshold value: threshold means a limit, so threshold is also called a critical value, and means the lowest value or the highest value that an effect can produce, and is widely used in various scientific fields. For a good threshold setting, more desirable results can be obtained.

Confidence interval: refers to the estimation interval of the overall parameter constructed from the sample statistics. In statistics, the Confidence interval (Confidence interval) of a probability sample is an interval estimate for some overall parameter of this sample. The confidence interval exhibits the extent to which the true value of this parameter has a certain probability of falling around the measurement. The confidence interval indicates the degree of plausibility of the measured value of the measured parameter, i.e. the "one probability" required above.

Fig. 1 is a schematic structural diagram of a medical insurance fraud detection apparatus based on outlier analysis according to an exemplary embodiment of the inventive concept.

As shown in fig. 1, the medical insurance fraud detection apparatus based on outlier analysis according to an exemplary embodiment of the inventive concept includes:

the medical insurance data acquisition module 100 is used for acquiring hospitalization records, medication records and treatment records of medical insurance institutions in certain areas, wherein the medical insurance records comprise basic information, medication information, disease information, diagnosis and treatment information and the like of patients;

the medical insurance data preprocessing module 200 is used for preprocessing the data of the original data set by using a data cleaning technology and pharmacopoeia; wherein,

because the data set has a series of problems of data loss, data inconsistency and the like, such as loss of basic information of patients, dosage of medicines and the like, the method utilizes a data cleaning technology to carry out decryption processing on sensitive data, ensures the integrity and confidentiality of information, and well processes data with high loss rate. Preferably, the pharmacopoeia is the Chinese pharmacopoeia (2015 edition), and the Chinese pharmacopoeia (2015 edition) is used for further extracting and classifying fine-particle-size medicines into coarse-particle-size medicines, so that the problem of medicine information processing is solved;

a similarity score calculating module 300 for calculating a similarity score for each patient through a heterogeneous network; wherein,

in the similarity score calculating module, calculating a similarity score for each patient by analyzing a heterogeneous network, and adjusting the similarity score by considering the number of diseases which can be treated by the medicines of the patient;

the outlier detection module 400 sets a flexible critical value by combining a fixed threshold value and a confidence interval to perform outlier iterative search; wherein,

in the outlier detecting module 400, it is found through statistics that the distribution of the similarity score is similar to the normal distribution, so that the confidence interval of the normal distribution is considered in the threshold calculation. Calculating a flexible critical value, and comparing the flexible critical value with the similarity score to find an outlier in the flexible critical value;

a patient fraud detection module 500 that determines the found outliers as suspected fraudulent patients by outlier analysis; wherein,

in the fraud detection module 500, by outlier analysis, normal patients and fraudulent patients can be distinguished according to the obtained outliers, and the found outliers are determined as suspected fraudulent patients.

Fig. 2 is a general flowchart of a method for detecting medical insurance fraud based on outlier analysis according to an exemplary embodiment of the inventive concept.

As shown in fig. 2, the medical insurance fraud detection method based on outlier analysis according to an exemplary embodiment of the inventive concept includes:

step S1, obtaining the actual medical record in the medical insurance in a certain area;

in a specific implementation, a medical record of actual medical insurance in a certain area is obtained, and a lot of available patient-related information, such as patient basic information, medicine information, disease information and the like, is extracted from the medical record for medical fraud detection.

Step S2, preprocessing the data set by using data cleaning and pharmacopoeia;

most of the data sets cannot be directly used in the medical field due to a series of problems of data missing, data inconsistency and the like of the data sets, such as missing of basic information of patients, dosage of medicines and the like. In the medical field, for example, the specific situation of each medicine is difficult to analyze because the medicine is too many in types and the data volume of each medicine is too small;

preferably, the invention utilizes the Chinese pharmacopoeia (2015 edition) to solve the problem of medicine information processing. By inquiring Chinese pharmacopoeia (2015 edition), medicines with fine particle sizes can be further extracted and classified into medicines with coarse particle sizes according to the recorded medicine types;

the invention utilizes the data cleaning technology to carry out encryption processing on the sensitive data, deletes the data with higher loss rate, ensures the integrity and confidentiality of the information and well processes the data with high loss rate.

Step S3, extracting information of patients, diseases and medicines to construct a heterogeneous network;

extracting basic information, medicine information and disease information of the patient from the acquired medical records, establishing a heterogeneous information network according to mutual relation, enabling the information to appear in the heterogeneous network in a vertex mode, and describing the connection between the information and the heterogeneous information network as information that the patient takes a certain medicine, the medicine can treat a certain disease and the like.

The patient takes a certain medicine each time because of a certain disease, and in the process, the information among the patient, the medicine and the disease is mutually linked to form a huge network which can be regarded as a heterogeneous network formed by taking the three as vertexes.

Step S4, calculating similarity scores of different patients by using the similarity;

calculating a similarity of each patient with other patients through a constructed heterogeneous network, further adjusting the calculated scores for the abnormal condition that the same medicine can treat a plurality of diseases from the disease perspective, calculating the scores of all patients in the candidate group of patients and the reference group, and taking the average value as a final score reflecting the similarity of the candidate group of patients and the normal patients.

Step S5, performing outlier analysis by combining a threshold value and a confidence interval to distinguish normal and fraud patients, and finally determining suspected fraud patients;

after calculating the similarity score of the patient, the invention designs an evaluation algorithm to detect outliers. And combining the threshold value with the confidence interval, and finding that the distribution of the similarity score is similar to the normal distribution through statistics. The invention contemplates taking the confidence interval of a normal distribution, i.e., the mean minus the standard deviation and the mean plus the standard deviation. The threshold is flexibly calculated to determine outliers to find fraudulent patients.

The steps of the medical insurance fraud detection method based on outlier analysis according to an exemplary embodiment of the inventive concept are specifically set forth below, as shown in fig. 3 to 7:

step S101, acquiring a large amount of medical care related data of a medical insurance institution in a certain area, reserving useful data and removing useless data;

step S102, extracting data such as patient basic information record, medication record, treatment record and the like.

Step S201, extracting hospitalizing data of a patient to be subjected to fraud detection;

step S202, processing patient sensitive data and data with high data loss rate by using a data cleaning technology (ensuring that not less than three medical records of each patient are ensured);

step S203, the medicine information is processed by category by inquiring Chinese pharmacopoeia (2015 edition), and a plurality of similar medicines are unified into the same category according to the corresponding relationship between the medicines and the categories.

Step S301, analyzing the preprocessed patient data set, mainly analyzing basic information of the patient, diseased information of the patient and medication condition of the patient;

step S302, extracting basic information, medicine information and disease information of the patient in the data set, and constructing a heterogeneous information network of the patient.

Step S401, firstly, the constructed heterogeneous network is analyzed, the relevance of the patients is reflected by calculating a similarity score for each patient,

this is not reasonable in view of the problem that the score increases as the amount of data increases. In order to solve this problem, the present invention adjusts the visibility factor, which can be expressed in the following form,

step S501, after calculating the similarity score of the patient, performing statistical analysis on the score to find that the distribution of the similarity score of the patient is close to normal distribution, and performing threshold calculation by considering the property of the normal distribution;

step S502, a flexible critical value for detecting outliers is determined by the aid of a confidence interval of normal distribution, when other factors change, the critical value can be automatically adjusted to a proper value, and the critical value, namely the threshold value, can be expressed by a formula,

step S503, iteratively calculating a critical value and comparing the critical value with the similarity score, and considering the score lower than the critical value as an outlier, and screening the outlier in the mode;

through iterative operations, some outliers may be found, but they may be only a fraction of all outliers. They are deleted and the remaining data is used to recalculate the threshold values and delete the new outliers. This process is repeated until no new outliers are found and the last threshold is the final value.

Step S504, the found outliers are determined to be suspected fraudulent patients.

In conclusion, in the fraud detection of the patient, the invention analyzes and preprocesses the historical data record of the patient by using the basic information, the medication information and the disease information of the patient in the medical insurance data, and calculates the similarity score of the patient through the constructed heterogeneous network. And then, outlier analysis is carried out in a mode of combining a threshold value and a confidence interval, so that normal patients and fraudulent patients can be distinguished according to the obtained outliers.

The above description is only a preferred embodiment of the present application and is not intended to limit the present application, and various modifications and changes may be made by those skilled in the art. Any modification, equivalent replacement, improvement and the like made within the spirit and principle of the present application shall be included in the protection scope of the present application.

Although the embodiments of the present invention have been described with reference to the accompanying drawings, it is not intended to limit the scope of the present invention, and it should be understood by those skilled in the art that various modifications and variations can be made without inventive efforts by those skilled in the art based on the technical solution of the present invention.

Claims

1. A medical insurance fraud detection method based on outlier analysis is characterized by comprising the following steps:

step S2, preprocessing the data set by using data cleaning and pharmacopoeia;

the method for performing outlier analysis by combining the threshold value and the confidence interval comprises the following steps:

step S501, after calculating the similarity score of the patient, carrying out statistical analysis on the score to find that the distribution of the similarity score of the patient is close to normal distribution, and carrying out threshold calculation by combining the property of the normal distribution;

step S502, a confidence interval of normal distribution is used for assisting in determining a flexible critical value for detecting an outlier, so that when other factors change, the critical value can be automatically adjusted to a proper value, and the critical value, namely the threshold value, is expressed by a formula:

step S503, iteratively calculating a critical value and comparing the critical value with a similarity score, wherein the score lower than the critical value is regarded as an outlier, and screening the outlier in the way,

through iterative operations, some outliers can be found, but they are only a fraction of all outliers, the outliers are deleted, and the threshold is calculated again using the remaining data, and new outliers are deleted, repeating this process until no new outliers can be found, and the last threshold is taken as the final value;

2. The method of claim 1, wherein the step of obtaining the medical record of the actual medical insurance in the area comprises the steps of:

step S102, extracting basic information record, medication record and treatment record data of the patient.

3. The method of claim 2, wherein the pre-processing the data set using data washing and pharmacopoeia comprises the steps of:

step S202, processing sensitive data and data with high data loss rate of patients by using a data cleaning technology, and ensuring that not less than three medical records of each patient are obtained;

step S203, the category of the medicine information is processed by inquiring the pharmacopoeia, and a plurality of similar medicines are unified into the same category according to the corresponding relationship between the medicines and the categories.

4. The method of claim 1, wherein the step of extracting information of patients, diseases and drugs to construct a heterogeneous network comprises the steps of:

5. The method of claim 4, wherein the calculating the similarity scores of different patients using the similarities comprises the following steps: