CN111612636A - Abnormal medical insurance data detection system and method based on dual clustering algorithm - Google Patents

Abnormal medical insurance data detection system and method based on dual clustering algorithm Download PDF

Info

Publication number
CN111612636A
CN111612636A CN202010368770.7A CN202010368770A CN111612636A CN 111612636 A CN111612636 A CN 111612636A CN 202010368770 A CN202010368770 A CN 202010368770A CN 111612636 A CN111612636 A CN 111612636A
Authority
CN
China
Prior art keywords
medical
suspicious
medical insurance
insurance
patient
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202010368770.7A
Other languages
Chinese (zh)
Inventor
李晖
李瑞璨
崔立真
郭伟
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shandong University
Original Assignee
Shandong University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shandong University filed Critical Shandong University
Priority to CN202010368770.7A priority Critical patent/CN111612636A/en
Publication of CN111612636A publication Critical patent/CN111612636A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q40/00Finance; Insurance; Tax strategies; Processing of corporate or income taxes
    • G06Q40/08Insurance
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/23Clustering techniques
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16HHEALTHCARE INFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR THE HANDLING OR PROCESSING OF MEDICAL OR HEALTHCARE DATA
    • G16H50/00ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics
    • G16H50/70ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics for mining of medical data, e.g. analysing previous cases of other patients

Landscapes

  • Engineering & Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Theoretical Computer Science (AREA)
  • Business, Economics & Management (AREA)
  • Finance (AREA)
  • Accounting & Taxation (AREA)
  • Public Health (AREA)
  • Medical Informatics (AREA)
  • Health & Medical Sciences (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Primary Health Care (AREA)
  • Databases & Information Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Biology (AREA)
  • Pathology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Epidemiology (AREA)
  • General Health & Medical Sciences (AREA)
  • Evolutionary Computation (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Biomedical Technology (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Development Economics (AREA)
  • Economics (AREA)
  • Marketing (AREA)
  • Strategic Management (AREA)
  • Technology Law (AREA)
  • General Business, Economics & Management (AREA)
  • Medical Treatment And Welfare Office Work (AREA)

Abstract

The disclosure discloses an abnormal co-occurrence hospitalizing medical insurance data detection system and method based on a dual clustering algorithm, and hospitalizing information and demographic information are acquired; constructing a P-TL picture according to medical insurance medical record of the medical insurance ginseng and insured people; aiming at the constructed P-TL image, mining suspicious patient groups frequently hospitalized at the same place at the same time and suspicious hospitalization records of the suspicious patient groups through a double clustering algorithm; normal patients were filtered out in the suspect patient population: for each resulting group of suspected fraudulent patients, isolated patients who are not edge-linked to other patients are filtered out, while the remaining groups of patients who are edge-linked to other patients are considered fraudulent if the number of people exceeds a threshold. Normal patients who are misjudged due to long-term regular medical attendance can be filtered, and medical insurance fraud behaviors can be identified more accurately.

Description

Abnormal medical insurance data detection system and method based on dual clustering algorithm
Technical Field
The disclosure belongs to the field of medical insurance computers, and particularly relates to an abnormal co-occurrence hospitalization insurance data detection system and method based on a dual-clustering algorithm.
Background
The statements in this section merely provide background information related to the present disclosure and may not necessarily constitute prior art.
The medical insurance system is a social insurance system established for compensating the economic loss of workers caused by the disease risk.
With the explosive development of the medical insurance industry, few illegal persons begin fraudulent conduct against medical insurance funds for the benefit of interest.
The medical insurance data can be obtained through the medical insurance system, and abnormal data can be obtained through analysis of data such as card swiping or hospitalization reimbursement of medical insurance, for example, when the medical insurance data of a certain actor is consumed in the same place and time for many times, or when the medical insurance data of a certain actor purchases records of the same kind of medicines in the same place and time for many times, the abnormal data can be detected under the general condition of the records of the medical insurance data, and then the abnormal data is further analyzed or corresponding information feedback is executed, or a more strict data supervision scheme is established.
The inventor finds that, in research, the current medical insurance data anomaly detection mainly aims at simple analysis of anomaly data, including acquisition and judgment of corresponding time and place of the data, but does not consider the situation that recorded data are consumed at the same time and the same place for the same times caused by some chronic diseases, so that a certain error exists in the current medical insurance data anomaly detection accuracy, and the main reason that the medical insurance data is inaccurate in detection is caused by inaccurate factors considered for processing and detecting the medical insurance data, so that the main technical problem to be solved by the disclosure is how to carry out the anomaly detection of the medical insurance data under the premise that recorded data of the same kind of medicines are normally purchased at the same time and place for multiple times in the situation of big data.
Disclosure of Invention
In order to overcome the defects of the prior art, the abnormal medical insurance data detection method based on the dual cluster algorithm is provided, the dual cluster algorithm is utilized, and the health medical knowledge base is introduced, so that suspicious patient groups frequently hospitalized at the same time and the same place can be mined, normal patients wrongly judged due to long-term regular hospitalization can be filtered, and the medical insurance fraud behavior can be identified more accurately.
In order to achieve the above object, one or more embodiments of the present disclosure provide the following technical solutions:
the abnormal medical insurance data detection method based on the dual clustering algorithm comprises the following steps:
collecting medical insurance medical record data of medical insurance ginseng insurance people, and constructing a P-TL (graph), wherein the graph comprises two types of nodes, and P represents a set of the medical insurance medical record medical insurance ginseng insurance people; TL represents the collection of the hospitalizing time and hospitalizing place information in the hospitalizing record of the medical insurance;
aiming at the constructed P-TL image, mining suspicious patient groups frequently hospitalized at the same place at the same time and suspicious hospitalization records of the suspicious patient groups through a double clustering algorithm;
filtering out normal medical records from the suspicious medical records: for each group of suspected fraudulent patient populations in the resulting suspicious medical records, isolated patients in which other patients are not linked by edges are filtered out, while patient populations in the remaining suspicious medical records that are linked to each other by edges are considered medical abnormality data if the number of people exceeds a threshold.
On the other hand, the disclosure also discloses abnormal medical insurance data detection equipment based on the dual-clustering algorithm, which comprises a memory, a processor and a computer program which is stored on the memory and can run on the processor, and is characterized in that the processor realizes the steps of the abnormal medical insurance data detection method based on the dual-clustering algorithm when executing the program.
In another aspect, the present disclosure also discloses a computer readable storage medium, on which a computer program is stored, wherein the program is executed by a processor to execute the steps of the abnormal medical insurance data detection method based on the dual cluster algorithm.
On the other hand, the present disclosure further discloses an abnormal co-occurrence hospitalization medical insurance data detection system based on the dual clustering algorithm, which is characterized by comprising:
the hospitalizing record data processing pre-module comprises: collecting medical insurance medical record data of medical insurance ginseng insurance people, and constructing a P-TL (graph), wherein the graph comprises two types of nodes, and P represents a set of the medical insurance medical record medical insurance ginseng insurance people; TL represents the collection of the hospitalizing time and hospitalizing place information in the hospitalizing record of the medical insurance;
the medical abnormal data detection module is used for mining suspicious patient groups and suspicious medical records of the suspicious patient groups who frequently see the medical at the same place at the same time through a double clustering algorithm aiming at the constructed P-TL image;
filtering out normal medical records from the suspicious medical records: for each group of suspected fraudulent patient populations in the resulting suspicious medical records, isolated patients in which other patients are not linked by edges are filtered out, while patient populations in the remaining suspicious medical records that are linked to each other by edges are considered medical abnormality data if the number of people exceeds a threshold.
The above one or more technical solutions have the following beneficial effects:
aiming at the characteristics of accurate detection of the existing medical insurance record data, the technical scheme of the disclosure is to carry out processing schemes such as data cleaning, normalization and data encryption aiming at the acquired medical insurance data, the processed data is complete medical insurance record data which can be subsequently processed, and when abnormal data is detected, suspicious patient groups frequently hospitalized at the same time and place and suspicious hospitalization records of the suspicious patient groups are mined through a double clustering algorithm; filtering out normal medical records from the suspicious medical records: for each group of suspicious fraudulent patient groups in the obtained suspicious medical records, isolated patients which are not linked with other patients through edges are filtered out, and the remaining suspicious medical records are patient groups which are linked with other patients through edges, if the number of people exceeds a threshold value, the suspicious medical records are regarded as abnormal medical data, so that misjudgment on the abnormal medical data can be greatly avoided, and the accuracy of abnormal data detection is improved.
Drawings
The accompanying drawings, which are included to provide a further understanding of the disclosure, illustrate embodiments of the disclosure and together with the description serve to explain the disclosure and are not to limit the disclosure.
FIG. 1 is a flowchart of an abnormal co-occurrence hospitalization medical insurance data abnormal identification method based on a dual clustering algorithm according to an embodiment of the present disclosure;
FIG. 2 is a model diagram of a patient population for detecting frequent simultaneous hospitalizations based on a dual clustering algorithm according to an embodiment of the present disclosure;
FIG. 3 is a model diagram illustrating calculation of prescription similarity between suspicious patients according to an embodiment of the present disclosure.
Detailed Description
It should be noted that the following detailed description is exemplary and is intended to provide further explanation of the disclosure. Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this disclosure belongs.
It is noted that the terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of example embodiments according to the present disclosure. As used herein, the singular forms "a", "an" and "the" are intended to include the plural forms as well, and it should be understood that when the terms "comprises" and/or "comprising" are used in this specification, they specify the presence of stated features, steps, operations, devices, components, and/or combinations thereof, unless the context clearly indicates otherwise.
The embodiments and features of the embodiments in the present disclosure may be combined with each other without conflict.
The traditional medical insurance anti-fraud work mainly depends on formulation of rules, firstly, the medical insurance fraud rules are formulated, the hospitalization behaviors of the insured person are identified based on the rules, and the deceased person and the deceptive behaviors thereof are determined. The method highly depends on the experience of experts, corresponding rules can be formulated generally after the fraudulent conduct happens, and the medical insurance fraudulent conduct cannot be identified quickly and efficiently.
The existing abnormal co-occurrence medical treatment fraud behaviors specifically mean that some cheaters acquire medical insurance cards of multiple ginseng insurance people in a certain mode, and the medical insurance cards are used for purchasing medicines and then selling the medicines in reverse to obtain medical insurance funds in a cheating mode. These fraudsters, in order to reduce the cost of fraud, typically purchase the drug in one fraud using multiple health care cards.
Aiming at the behaviors, the conventional abnormal co-occurrence hospitalizing fraud behavior identification method only considers mining suspicious patient groups which frequently seek medical advice at the same time and place, but does not consider the situation that part of normal patients are misjudged due to long-term regular hospitalizing, so that the detection result is not accurate enough.
The general idea proposed by the present disclosure:
the method is based on a double clustering algorithm, and a suspicious patient group which frequently visits the same place at the same time is mined; meanwhile, a health medical knowledge base is introduced, normal patients which are misjudged due to long-term regular medical attendance are filtered, and fraudulent patients are obtained more accurately.
Example one
The embodiment discloses an abnormal medical insurance data detection method based on a dual cluster algorithm, which comprises the following steps:
step (1): and acquiring the visit information and the demographic information.
Acquiring the visit information of a patient, wherein the visit information mainly comprises: disease data, medication data, diagnosis and treatment data; acquiring demographic information of a patient, wherein the demographic information mainly comprises the age, sex, personnel category, marital, cultural level, occupation, residence and the like of the patient;
the visit information may be obtained from the medical system at the time of acquisition using communication means.
Step (2): and (5) data preprocessing.
The data set of the technical scheme is from a medical insurance information management system and comprises demographic information of patients, such as sex, age and the like, personal numbers, medical treatment numbers, disease names, disease codes, medicine names, medical treatment time, examination items and the like. Due to the fact that medicine codes and disease codes of different hospitals are different, medicine codes and disease codes of different medical insurance institutions are different, and data errors or data loss can also occur due to misoperation of workers sometimes, and therefore the problems of data inconsistency, data loss and data errors can occur in medical insurance data. Meanwhile, because of privacy problems, sensitive information such as personal codes and disease codes needs to be decrypted.
Firstly, data needs to be cleaned, and data with high missing rate and error data are processed; then, standardizing the medicine code and the disease code, and mapping the medicine code and the disease code to an international standard or a national standard to eliminate the problem of data inconsistency; and finally, the sensitive data is decrypted.
Sensitive data such as identity card number information, names, home addresses and other information are subjected to encryption processing by using an MD5 algorithm, namely the sensitive data are processed into meaningless character strings, so that sensitive information is prevented from being leaked when the data are used;
in medical data, since missing data cannot be filled, data having a missing rate higher than a set threshold value is deleted.
According to the international disease classification standard code ICD-10, the disease diagnosis code in the diagnosis information is converted into the corresponding international disease classification standard code ICD-10.
According to the Chinese pharmacopoeia (2015 edition), the medicine codes in the diagnosis information are converted into the corresponding medicine codes in the Chinese pharmacopoeia (2015 edition).
The specific data processing steps include:
1) data cleaning: in medical data, since missing data cannot be filled, data having a missing rate higher than a set threshold value is deleted. Data that is significantly erroneous is also deleted.
2) Data normalization:
a. and mapping the disease codes and the disease names of the original data set to international disease classifications (ICD-10) with the version as ICD-10. The mapping is divided into the following three cases:
if the disease name of the data set can be matched exactly to the disease name in ICD-10, the disease name of the original data set is retained and its disease code is changed to the corresponding disease code in ICD-10.
If the disease names in the data set can not be completely matched with the disease names in the ICD-10, firstly, Word segmentation is carried out on the disease names, different disease names are converted into Word vectors through the Word2Vec technology, meanwhile, the disease names in the ICD-10 are also converted into the Word vectors by adopting the same algorithm, and the similarity of the two Word vectors is calculated. For disease names with similarity exceeding the threshold, the disease code and disease name in the original data set are changed to the disease code and disease name in ICD-10.
And mapping the disease names with the similarity lower than the threshold value in a manual mode.
b. For the drug codes and drug names in the original data set, they were mapped with data from the pharmacopoeia of the people's republic of china (2015 edition). The specific operation process is similar to the disease name processing process, and the description is not repeated.
3) Data encryption: sensitive data such as identity card number information, names, home addresses and other information are subjected to encryption processing by using an MD5 algorithm, namely the sensitive data are processed into meaningless character strings, so that sensitive information leakage during data use is avoided.
And (3): and constructing a P-TL picture according to the medical insurance medical record of the medical insurance ginseng and insured people.
Wherein, the figure has two types of nodes, P represents the collection of the Chinese medical insurance ginseng insurance records for medical insurance; TL represents the collection of the hospitalizing time and hospitalizing place information in the hospitalizing record of the medical insurance, consisting of<Time and place of hospitalization>And (4) showing. There are two types of edges e in the figure, one is the edge connecting the insured ginseng and the insured ginseng, and is represented by e (p)i,pj) Is represented by the formula, wherein pi,pj∈ P, its weight w (P)i,pj) Calculating by the step (5); the other is the edge between the medical insurance ginseng and the medical position at the medical time, which is formed by e (p)i,tlj) Is represented by the formula, wherein pi∈P,tli∈ TL, its weight w (p)i,tlj) Relating to the hospitalizing time and the hospitalizing place of the medical insurance ginseng insurance person. The method comprises the following specific steps:
for edge e (p)i,tlj) Weight w (p) ofi,tlj) The time threshold Φ is calculated, set by the present disclosure to be two days. Wherein tlj=<tj,lj>,tjStands for tljTime of hospitalization of Chinese medicine, /)jStands for tljThe location of hospitalization. Let tiRepresentative of patient piThe time of hospitalization.
When patient piAt and tjWithin a time interval of phi atjThe location takes a hospitalizing action, i.e. | tj-ti|<Φ, then weight w (p)i,tlj) The calculation method is as follows:
Figure BDA0002477475960000071
otherwise, when patient piIs not in conjunction with tjWithin a time interval of phi atjLocation hospitalization, weight w (p)i,tlj) The calculation method is as follows:
w(pi,tlj)=0。
and (4): in the P-TL map constructed in step (3), the suspicious patient population frequently hospitalized at the same place at the same time and their suspicious hospitalization records are mined by a novel double clustering algorithm, as shown in FIG. 2. The method comprises the following specific steps:
(4.1) constructing a matrix M with the size of n × M to represent a P-TL diagram, wherein n is the number of elements contained in the medical insurance participant set P, M is the number of elements contained in the medical time and medical place information set TL, and M is the number of elements contained in the medical time and medical place information set TLi,jEqual to the edge e (P) in the P-TL graphi,tlj) Weight value of w (p)i,tlj)。
(4.2) the double clustering algorithm can cluster the rows and columns of the matrix at the same time, and by this method, suspicious patient groups who frequently visit the same place at the same time and suspicious patients can be minedAnd (6) medical record. Let n dimension vector
Figure BDA0002477475960000081
And m-dimensional vector
Figure BDA0002477475960000082
Respectively representing the left vector and the right vector obtained by matrix decomposition of the matrix M. The outer product of the two vectors is as close as possible to the matrix M, i.e.,
Figure BDA0002477475960000083
the objective function to be solved is:
Figure BDA0002477475960000084
Figure BDA0002477475960000085
wherein,
Figure BDA0002477475960000086
is a vector
Figure BDA0002477475960000087
The number of non-zero entries in (a),
Figure BDA0002477475960000088
is a vector
Figure BDA0002477475960000089
Number of non-zero terms in,/uAnd lvRespectively limit the vector
Figure BDA00024774759600000810
Sum vector
Figure BDA00024774759600000811
The maximum number of non-zero entries in (c). Minimizing the above objective function is mathematically equivalent to minimizing
Figure BDA00024774759600000812
Wherein λuAnd λvCorresponding to the lagrange multiplier at the y-optimum.
In this embodiment, the above objective function needs to be solved by using a PALM algorithm, which is as follows:
(4.2.1) vector
Figure BDA00024774759600000813
Sum vector
Figure BDA00024774759600000814
Is initialized to 1. Order vector
Figure BDA00024774759600000815
Sum vector
Figure BDA00024774759600000816
Representing vectors at the t-th iteration
Figure BDA00024774759600000817
Sum vector
Figure BDA00024774759600000818
(4.2.2) Using vectors
Figure BDA00024774759600000819
Sum vector
Figure BDA00024774759600000820
Computing vectors
Figure BDA00024774759600000821
Order to
Figure BDA00024774759600000822
Represents y at the point
Figure BDA00024774759600000823
The partial derivative is calculated by
Figure BDA00024774759600000824
Order to
Figure BDA00024774759600000825
Represents
Figure BDA00024774759600000826
The Rippschtz modulus is calculated in the following way
Figure BDA00024774759600000827
Order to
Figure BDA00024774759600000828
As an index function, defined as:
when in use
Figure BDA00024774759600000829
When the temperature of the water is higher than the set temperature,
Figure BDA00024774759600000830
when in use
Figure BDA00024774759600000831
It is that,
Figure BDA00024774759600000832
wherein
Figure BDA00024774759600000833
Representative vector
Figure BDA00024774759600000834
The sum of the terms in (1).
Computing
Figure BDA00024774759600000835
The following optimization functions need to be solved:
Figure BDA00024774759600000836
Figure BDA00024774759600000837
η thereinu>1, a constant, is set to 2. The optimization function can then be converted into:
Figure BDA0002477475960000091
this problem is mathematically equivalent to
Figure BDA0002477475960000092
An analytical solution of it is
Figure BDA0002477475960000093
It can be seen that
Figure BDA0002477475960000094
Of the maximum absolute value of luOne element remains to be the optimal solution of the optimization function in (4.2.2). For example, if luIs 5, then will
Figure BDA0002477475960000095
Is arranged in descending absolute value order, the largest 5 items are selected to remain unchanged, and the rest items are set to be 0, the disclosure defines α as
Figure BDA0002477475960000096
Absolute value of element luLarge element value, then
Figure BDA0002477475960000097
The value of (d) is defined as:
when in use
Figure BDA0002477475960000098
When the temperature of the water is higher than the set temperature,
Figure BDA0002477475960000099
when in use
Figure BDA00024774759600000910
When the temperature of the water is higher than the set temperature,
Figure BDA00024774759600000911
(4.2.3) Using vectors
Figure BDA00024774759600000912
Sum vector
Figure BDA00024774759600000913
Computing vectors
Figure BDA00024774759600000914
Order to
Figure BDA00024774759600000915
Represents y at the point
Figure BDA00024774759600000916
The partial derivative is calculated by
Figure BDA00024774759600000917
Order to
Figure BDA00024774759600000918
Represents
Figure BDA00024774759600000919
The Rippschtz modulus is calculated in the following way
Figure BDA00024774759600000920
Order to
Figure BDA00024774759600000921
As an index function, defined as:
when in use
Figure BDA00024774759600000922
When the temperature of the water is higher than the set temperature,
Figure BDA00024774759600000923
when in use
Figure BDA00024774759600000924
It is that,
Figure BDA00024774759600000925
wherein
Figure BDA00024774759600000926
Representative vector
Figure BDA00024774759600000927
The sum of the terms in (1).
Computing
Figure BDA00024774759600000928
The following optimization functions need to be solved:
Figure BDA00024774759600000929
Figure BDA00024774759600000930
η thereinv>1, a constant, is set to 2. The optimization function can then be converted into:
Figure BDA0002477475960000101
this problem is mathematically equivalent to
Figure BDA0002477475960000102
Similarly, an analytical solution thereof is
Figure BDA0002477475960000103
Definition β is
Figure BDA0002477475960000104
Absolute value of element lvLarge element value, then
Figure BDA0002477475960000105
The value of (d) is defined as:
when in use
Figure BDA0002477475960000106
When the temperature of the water is higher than the set temperature,
Figure BDA0002477475960000107
when in use
Figure BDA0002477475960000108
When the temperature of the water is higher than the set temperature,
Figure BDA0002477475960000109
(4.2.4) repeating the steps (4.2.2) and (4.2.3) repeatedly until the result converges. For example, up to
Figure BDA00024774759600001010
And is
Figure BDA00024774759600001011
The calculation is stopped with a setting of 0.01.
For the resulting vector
Figure BDA00024774759600001012
Sum vector
Figure BDA00024774759600001013
And clustering the rows and columns of the matrix M respectively corresponding to the non-zero items to obtain the sub-matrix. The present disclosure sets two thresholds Ψ and Y to limit the minimum of the rows and columns of the submatrix, which are set to 2 and 10, respectively. The row set corresponding to the submatrix is a mined suspicious patient group, the row set comprises elements not less than Ψ, the column set is a medical treatment location information set at medical treatment time, the result corresponds to medical treatment records of the suspicious patient group with fraud suspicion, and the column set comprises elements not less than Y. For example, if Y is set to 1, the suspicious patient groups only have to seek medical treatment at the same place at the same time, and the basis for judging the abnormality is not sufficient.
(4.2.5) in step (4.2.4), only one suspect group of patients was mined. If a new suspect patient population is to be mined again, the elements of the corresponding row in the M matrix corresponding to the mined patient are set to zero. For example, if the patient corresponding to the ith row of the matrix has been mined, then
Figure BDA00024774759600001014
Then the step (4.2.4) is performed on the updated matrix M to mine the new suspicious patient population and their suspicious medical records.
And (5): the similarity of the prescription from patient to patient is calculated as shown in figure 3.
As mentioned in step (3), in the P-TL diagram, the edge e (P)i,pj) Weight w (p) ofi,pj) Representative of patient piTo the patient pjThe similarity of the prescriptions between them. In step (4), a suspect patient population and their suspect medical records are mined. In this step, only calculation of the similarity of the prescription between these suspicious patients is considered, not all patients, and calculation of the similarity of the prescription only considers the suspicious medical records of the patients, not all medical records of the patients. The method comprises the following specific steps:
(5.1) calculating the weight (AW) of the drug in the medical insurance record. Drugs that are of interest to the fraudster should be weighted more heavily, such as drugs with high reimbursement rates, high sales prices, and a wide range of uses. Because the fraudulent group is selling the drug backwards, the present disclosure is only concerned with the drug that can be sold backwards, and not with other kinds of merchandise items, such as surgery, detection reagents, etc. The method comprises the following specific steps:
(5.1.1) if the item of merchandise is a reversible drug,
AW (drug) ratio of drug reimbursement x price of drug x total number of drugs in data set
- (1-drug reimbursement ratio) × drug selling price × total number of drugs in the data set,
(5.1.2) the weights of all drugs are then normalized to be at [0, 1 ]. The method comprises the following specific steps:
Figure BDA0002477475960000111
wherein min is the minimum value of all drug weights, and max is the maximum value of all drug weights.
(5.2) calculating the similarity s between different medical recordsv. Each medical record can be expressed as:
and v is { time, location, diagnose, medicine, dose }, which represents the time, place, disease diagnosis, medicine, and medicine dosage information of the current medical treatment.
For two medical records
vi={timei,locationi,diagnosei,medicinei,dosei},
vj={timej,locationj,diagnosej,medicinej,dosej},
Similarity between them svThe calculation formula is as follows:
Figure BDA0002477475960000121
wherein the (x, y) function is defined as follows:
when x and y are the same, (x, y) ═ 1,
when x and y are different, (x, y) ═ 0.
(5.3) calculating the similarity s between prescriptions of different patientsp. The prescription V for each patient is a collection of medical records V, which can be represented as
V={v1,v2,...,vl},
Where l is the number of medical records V contained in the prescription V.
Prescription V for two patientsiAnd VjTheir similarity spThe calculation formula is as follows:
Figure BDA0002477475960000122
wherein
|total(Vi,Vj)|=|Vi|+|Vj|-|same(Vi,Vj)|。
And | same (V)i,Vj) | is defined as:
Figure BDA0002477475960000123
wherein s isv(Vi,p,Vj,q) Indicating prescription ViThe p-th medical record and prescription VjThe q-th medical record of (1). A is a matrix obtained by solving the following function:
Figure BDA0002477475960000124
Figure BDA0002477475960000125
Figure BDA0002477475960000126
Ap,q≥0,
wherein,frei,pIs a prescription ViFrequency of occurrence, fre, of the p-th medical recordj,qPrescription VjThe frequency of occurrence of the q-th medical record of (1). The solution function method is specifically as follows:
(5.3.1) order matrix
Figure BDA0002477475960000131
In which the number of rows of matrix A is equal to prescription ViIncluding the number of medical records, the number of columns of the matrix A being equal to the prescription VjIncluding the number of medical records. The matrix L has the same number of rows and columns as the matrix a.
(5.3.2)
Figure BDA0002477475960000132
If frei,p≤frej,qThen Ap,q=frei,pAnd for any u, let Lu,p=0;
If frei,p>frej,qThen Ap,q=frej,qAnd for any v, let Lq,v=0;
frei,p=frei,p-Ap,q
frej,q=frej,q-Ap,q
(5.3.3) if any element in L equals 1, then repeating step (3.3.2) until
Figure BDA0002477475960000133
It is stopped.
And (6): normal patients are filtered out in a suspect patient population.
The similarity of the prescription between suspicious patients is obtained through the step (5). In the P-TL diagram, the edge e (P)i,pj) Weight w (p) ofi,pj) The value is according to patient piAnd patient pjThe similarity value of the prescriptions is set. The present disclosure sets a threshold min for prescription similarity between patientswThe threshold is set to 0.35. The method comprises the following specific steps:
when patient piAnd patient pjThe similarity value of the prescriptions is less than the threshold value minwWhen, the edge e (p)i,pj) Weight w (p) ofi,pj) Set to a value of 0, considered as edge e (p)i,pj) Is absent.
When patient piAnd patient pjThe similarity value of the prescriptions is more than or equal to the threshold value minwWhen, the edge e (p)i,pj) Weight w (p) ofi,pj) Setting the value to patient piAnd patient pjThe similarity value of the prescriptions between the two.
For each group of suspected fraudulent patient populations obtained in step (4), passing the edge e (p) with other patientsi,pj) The linked isolated patients are filtered out, while the rest pass the edge e (p) with other patientsi,pj) A group of interlinked patients, which are considered fraudulent if their population exceeds the threshold Ψ. These fraudulent patients may be further analyzed in conjunction with demographic information.
The traditional abnormal co-occurrence hospitalizing fraud behavior identification method only considers the characteristic that a plurality of medical insurance cards are frequently consumed at the same time; the method utilizes a double clustering algorithm and introduces a health medical knowledge base at the same time, not only considers the characteristic that a plurality of medical insurance cards consume frequently and simultaneously, excavates suspicious patient groups which frequently seek medical advice at the same place at the same time, but also can filter normal patients which are misjudged due to long-term regular medical advice, thereby identifying medical insurance fraud behaviors more accurately, and compared with the traditional method, the identification accuracy is 76%, and the identification accuracy is improved to 95%. The medical insurance fund identification method and the medical insurance fund identification device are beneficial to identifying abnormal co-occurrence hospitalizing fraud behaviors and effectively protecting the medical insurance fund.
Example two
The embodiment discloses abnormal medical insurance data detection equipment based on a double clustering algorithm, which comprises a memory, a processor and a computer program which is stored on the memory and can run on the processor, wherein the processor executes the program to realize the steps of the abnormal medical insurance data detection method based on the double clustering algorithm in the first embodiment.
EXAMPLE III
An object of the present embodiment is to provide a computer-readable storage medium.
A computer-readable storage medium, on which a computer program is stored, which, when executed by a processor, performs the steps of implementing the abnormal medical insurance data detection method based on the dual clustering algorithm of example one.
Example four
Referring to fig. 1, the present embodiment discloses an abnormal co-occurrence hospitalization medical insurance data detection system based on a dual clustering algorithm, which includes:
a visit information and demographic information acquisition module: acquiring visit information and demographic information;
acquiring the visit information of a patient, wherein the visit information mainly comprises: disease data, medication data, diagnosis and treatment data; acquiring demographic information of a patient, wherein the demographic information mainly comprises the age, sex, personnel category, marital, cultural level, occupation, residence and the like of the patient;
the visit information preprocessing module: data preprocessing:
sensitive data such as identity card number information, names, home addresses and other information are subjected to encryption processing by using an MD5 algorithm, namely the sensitive data are processed into meaningless character strings, so that sensitive information is prevented from being leaked when the data are used;
in medical data, since missing data cannot be filled, data having a missing rate higher than a set threshold value is deleted.
According to the international disease classification standard code ICD-10, the disease diagnosis code in the diagnosis information is converted into the corresponding international disease classification standard code ICD-10.
According to the Chinese pharmacopoeia (2015 edition), the medicine codes in the diagnosis information are converted into the corresponding medicine codes in the Chinese pharmacopoeia (2015 edition).
The hospitalizing record data processing pre-module comprises: collecting medical insurance medical record data of medical insurance ginseng insurance people, and constructing a P-TL (graph), wherein the graph comprises two types of nodes, and P represents a set of the medical insurance medical record medical insurance ginseng insurance people; TL represents the collection of the hospitalizing time and hospitalizing place information in the hospitalizing record of the medical insurance;
the medical abnormal data detection module is used for mining suspicious patient groups and suspicious medical records of the suspicious patient groups who frequently see the medical at the same place at the same time through a double clustering algorithm aiming at the constructed P-TL image;
filtering out normal medical records from the suspicious medical records: for each group of suspected fraudulent patient populations in the resulting suspicious medical records, isolated patients in which other patients are not linked by edges are filtered out, while patient populations in the remaining suspicious medical records that are linked to each other by edges are considered medical abnormality data if the number of people exceeds a threshold.
The embodiment example of the application utilizes a 'clustering-Sim' model 'double clustering model' to identify fraud.
The double clustering method is used for mining suspicious patient groups which frequently seek medical advice at the same place at the same time and suspicious medical advice records which are simultaneously sought medical advice at the same place. The traditional double clustering method usually needs to manually set the number of clusters, and the quality of the final clustering result cannot be ensured. The double clustering method does not need to set clustering data in advance, and can ensure the quality of the finally mined clustering result by setting constraints on the quality of the final clustering result; the disclosed medical prescription similarity measurement calculation method. Since the prescription for each patient is a complex set of medical records. Different from the conventional common set, when calculating the similarity of the complex set, the occurrence frequency of the internal elements of the complex set and the similarity degree between the internal elements need to be considered. The invention considers the factors and can better calculate the similarity of the medical prescriptions.
The steps involved in the apparatuses of the above second, third and fourth embodiments correspond to the first embodiment of the method, and the detailed description thereof can be found in the relevant description of the first embodiment. The term "computer-readable storage medium" should be taken to include a single medium or multiple media containing one or more sets of instructions; it should also be understood to include any medium that is capable of storing, encoding or carrying a set of instructions for execution by a processor and that cause the processor to perform any of the methods of the present disclosure.
Those skilled in the art will appreciate that the modules or steps of the present disclosure described above can be implemented using general purpose computer means, or alternatively, they can be implemented using program code executable by computing means, whereby the modules or steps may be stored in memory means for execution by the computing means, or separately fabricated into individual integrated circuit modules, or multiple modules or steps thereof may be fabricated into a single integrated circuit module. The present disclosure is not limited to any specific combination of hardware and software.
The above description is only a preferred embodiment of the present disclosure and is not intended to limit the present disclosure, and various modifications and changes may be made to the present disclosure by those skilled in the art. Any modification, equivalent replacement, improvement and the like made within the spirit and principle of the present disclosure should be included in the protection scope of the present disclosure.
Although the present disclosure has been described with reference to specific embodiments, it should be understood that the scope of the present disclosure is not limited thereto, and those skilled in the art will appreciate that various modifications and changes can be made without departing from the spirit and scope of the present disclosure.

Claims (10)

1. The abnormal medical insurance data detection method based on the dual clustering algorithm is characterized by comprising the following steps:
collecting medical insurance medical record data of medical insurance ginseng insurance people, and constructing a P-TL (graph), wherein the graph comprises two types of nodes, and P represents a set of the medical insurance medical record medical insurance ginseng insurance people; TL represents the collection of the hospitalizing time and hospitalizing place information in the hospitalizing record of the medical insurance;
aiming at the constructed P-TL image, mining suspicious patient groups frequently hospitalized at the same place at the same time and suspicious hospitalization records of the suspicious patient groups through a double clustering algorithm;
filtering out normal medical records from the suspicious medical records: for each group of suspected fraudulent patient populations in the resulting suspicious medical records, isolated patients in which other patients are not linked by edges are filtered out, while patient populations in the remaining suspicious medical records that are linked to each other by edges are considered medical abnormality data if the number of people exceeds a threshold.
2. The abnormal medical insurance data detection method based on the dual clustering algorithm as claimed in claim 1, wherein there are two types of edges in the P-TL graph:
one is the edge connecting the medical insurance ginseng and the medical insurance ginseng, and is composed of e (p)i,pj) Is represented by the formula, wherein pi,pj∈P;
The other is the edge between the medical insurance ginseng and the medical position at the medical time, which is formed by e (p)i,tlj) Is represented by the formula, wherein pi∈P,tli∈TL。
3. The abnormal medical insurance data detection method based on the dual clustering algorithm as claimed in claim 2, wherein for the edge e (p)i,tlj) Weight w (p) ofi,tlj) Calculating and setting a time threshold phi, wherein tlj=<tj,lj>,tjStands for tljTime of hospitalization of Chinese medicine, /)jStands for tljThe location of hospitalization in (1), let tiRepresentative of patient piThe time to seek medical attention;
when patient piAt and tjWithin a time interval of phi atjThe location hospitalizing action occurs, then the weight w (p)i,tlj) The calculation method is as follows:
Figure FDA0002477475950000021
otherwise, weight w (p)i,tlj) Is 0.
4. The method as claimed in claim 1, wherein before the step of clustering, a matrix M with a size of n × M is constructed to represent the P-TL diagram, wherein n is the number of elements contained in the medical insurance participant set P, M is the number of elements contained in the medical time and medical place information set TL, and M is the number of elements contained in the medical time and medical place information set TLi,jEqual to the edge e (P) in the P-TL graphi,tlj) The weight value of (2).
5. The abnormal medical insurance data detection method based on the dual clustering algorithm of claim 1, wherein the dual clustering algorithm clusters rows and columns of the matrix at the same time, and mines suspicious patient groups and their suspicious medical records that frequently visit the same place at the same time.
6. The method of claim 5, wherein the PALM algorithm is applied to solve the objective function of the dual-clustering algorithm to obtain a suspicious patient group, if a new suspicious patient group is mined again, the elements of the corresponding row in the M matrix corresponding to the mined patient are set to zero, and then the function solution is performed again on the updated matrix M to obtain a new suspicious patient group and the corresponding suspicious medical records.
7. The abnormal medical insurance data detection method based on the dual clustering algorithm of claim 1, wherein the similarity of the prescription between the patients is calculated, only the similarity of the prescription between suspicious patients is considered to be calculated, and only the suspicious medical record of the patients is considered when the similarity of the prescription is calculated;
in the P-TL diagram, the edge e (P)i,pj) Weight w (p) ofi,pj) The value is according to patient piAnd patient pjThe similarity value of the prescriptions is set.
8. Abnormal medical insurance data detection equipment based on the double clustering algorithm comprises a memory, a processor and a computer program which is stored on the memory and can run on the processor, and is characterized in that the processor executes the program to realize the steps of the abnormal medical insurance data detection method based on the double clustering algorithm according to any one of claims 1 to 7.
9. A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out the steps of the method for abnormal medical insurance data detection based on the double clustering algorithm according to any one of claims 1 to 7.
10. Abnormal co-occurrence medical insurance data detection system based on dual clustering algorithm, characterized by comprising:
the hospitalizing record data processing pre-module comprises: collecting medical insurance medical record data of medical insurance ginseng insurance people, and constructing a P-TL (graph), wherein the graph comprises two types of nodes, and P represents a set of the medical insurance medical record medical insurance ginseng insurance people; TL represents the collection of the hospitalizing time and hospitalizing place information in the hospitalizing record of the medical insurance;
the medical abnormal data detection module is used for mining suspicious patient groups and suspicious medical records of the suspicious patient groups who frequently see the medical at the same place at the same time through a double clustering algorithm aiming at the constructed P-TL image;
filtering out normal medical records from the suspicious medical records: for each group of suspected fraudulent patient populations in the resulting suspicious medical records, isolated patients in which other patients are not linked by edges are filtered out, while patient populations in the remaining suspicious medical records that are linked to each other by edges are considered medical abnormality data if the number of people exceeds a threshold.
CN202010368770.7A 2020-04-29 2020-04-29 Abnormal medical insurance data detection system and method based on dual clustering algorithm Pending CN111612636A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010368770.7A CN111612636A (en) 2020-04-29 2020-04-29 Abnormal medical insurance data detection system and method based on dual clustering algorithm

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010368770.7A CN111612636A (en) 2020-04-29 2020-04-29 Abnormal medical insurance data detection system and method based on dual clustering algorithm

Publications (1)

Publication Number Publication Date
CN111612636A true CN111612636A (en) 2020-09-01

Family

ID=72202000

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010368770.7A Pending CN111612636A (en) 2020-04-29 2020-04-29 Abnormal medical insurance data detection system and method based on dual clustering algorithm

Country Status (1)

Country Link
CN (1) CN111612636A (en)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112241423A (en) * 2020-09-30 2021-01-19 易联众信息技术股份有限公司 Method for mining homogeneous population group based on association rule algorithm
CN112835893A (en) * 2021-01-18 2021-05-25 浙江大学山东工业技术研究院 Method and system for detecting medical insurance fraud behavior based on clustering
CN112884593A (en) * 2021-02-01 2021-06-01 浙江大学山东工业技术研究院 Medical insurance fraud and insurance behavior detection method and early warning device based on graph cluster analysis
CN114418008A (en) * 2022-01-21 2022-04-29 平安国际智慧城市科技股份有限公司 Medical treatment behavior identification method and device, terminal equipment and storage medium

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108596770A (en) * 2017-12-29 2018-09-28 山大地纬软件股份有限公司 Medicare fraud detection device and method based on outlier analysis
CN109636061A (en) * 2018-12-25 2019-04-16 深圳市南山区人民医院 Training method, device, equipment and the storage medium of medical insurance Fraud Prediction network
CN110322356A (en) * 2019-04-22 2019-10-11 山东大学 The medical insurance method for detecting abnormality and system of dynamic multi-mode are excavated based on HIN

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108596770A (en) * 2017-12-29 2018-09-28 山大地纬软件股份有限公司 Medicare fraud detection device and method based on outlier analysis
CN109636061A (en) * 2018-12-25 2019-04-16 深圳市南山区人民医院 Training method, device, equipment and the storage medium of medical insurance Fraud Prediction network
CN110322356A (en) * 2019-04-22 2019-10-11 山东大学 The medical insurance method for detecting abnormality and system of dynamic multi-mode are excavated based on HIN

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
LI RUICAN,ET AL.: "Biclustering-sim: A Novel Method to Identify Abnormal Co-occurrence Medical Visit Behaviors", 《IEEE INTERNATIONAL CONFERENCE ON BIOINFORMATICS AND BIOMEDICINE-BIBM》 *

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112241423A (en) * 2020-09-30 2021-01-19 易联众信息技术股份有限公司 Method for mining homogeneous population group based on association rule algorithm
CN112835893A (en) * 2021-01-18 2021-05-25 浙江大学山东工业技术研究院 Method and system for detecting medical insurance fraud behavior based on clustering
CN112835893B (en) * 2021-01-18 2023-03-21 浙江大学山东工业技术研究院 Method and system for detecting medical insurance fraud behavior based on clustering
CN112884593A (en) * 2021-02-01 2021-06-01 浙江大学山东工业技术研究院 Medical insurance fraud and insurance behavior detection method and early warning device based on graph cluster analysis
CN114418008A (en) * 2022-01-21 2022-04-29 平安国际智慧城市科技股份有限公司 Medical treatment behavior identification method and device, terminal equipment and storage medium

Similar Documents

Publication Publication Date Title
US11669965B2 (en) AI-based label generating system and methods for use therewith
US11462308B2 (en) Triage routing based on inference data from computer vision model
CN111612636A (en) Abnormal medical insurance data detection system and method based on dual clustering algorithm
US20200357117A1 (en) Heat map generating system and methods for use therewith
CN108492196B (en) Wind control method for deducing medical insurance violation behavior through data analysis
US11145396B1 (en) Discovering context-specific complexity and utilization sequences
CN109545317B (en) Method for judging hospitalization behavior based on hospitalization prediction model and related products
JP6410289B2 (en) Pharmaceutical adverse event extraction method and apparatus
US20200373003A1 (en) Automatic medical scan triaging system and methods for use therewith
CN113657548A (en) Medical insurance abnormity detection method and device, computer equipment and storage medium
US20220037019A1 (en) Medical scan artifact detection system and methods for use therewith
Chushig-Muzo et al. Data-driven visual characterization of patient health-status using electronic health records and self-organizing maps
CN111612038A (en) Abnormal user detection method and device, storage medium and electronic equipment
Sideris et al. A flexible data-driven comorbidity feature extraction framework
CN111899114B (en) Doctor-seeking fraud detection method and system based on multi-view double clustering
Luo et al. Design comorbidity portfolios to improve treatment cost prediction of asthma using machine learning
Feldman et al. Supplementing claims data with electronic medical records to improve estimation and classification of rheumatoid arthritis disease activity: a machine learning approach
Kumar et al. Deep learning for healthcare biometrics
US11894117B1 (en) Discovering context-specific complexity and utilization sequences
WO2022036351A1 (en) Automatic medical scan triaging system and methods for use therewith
US20240120037A1 (en) Method and system for hybrid clinical trial design
US20210313067A1 (en) Recommending treatments to mitigate medical conditions and promote survival of living organisms using machine learning models
Dipali et al. Data mining as a tool for detecting adverse effects of drugs
Niu Regression Models for Readmission Prediction Using Electronic Medical Records
Adam Pattern Recognition in the Usage Sequences of Medical Apps

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication
RJ01 Rejection of invention patent application after publication

Application publication date: 20200901