WO2022057057A1 - Method for detecting medicare fraud, and system and storage medium - Google Patents

Method for detecting medicare fraud, and system and storage medium Download PDF

Info

Publication number
WO2022057057A1
WO2022057057A1 PCT/CN2020/127183 CN2020127183W WO2022057057A1 WO 2022057057 A1 WO2022057057 A1 WO 2022057057A1 CN 2020127183 W CN2020127183 W CN 2020127183W WO 2022057057 A1 WO2022057057 A1 WO 2022057057A1
Authority
WO
WIPO (PCT)
Prior art keywords
patient
fraud
medical
doctor
node
Prior art date
Application number
PCT/CN2020/127183
Other languages
French (fr)
Chinese (zh)
Inventor
李坚强
陈杰
胡晓楠
罗若恒
Original Assignee
深圳大学
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 深圳大学 filed Critical 深圳大学
Publication of WO2022057057A1 publication Critical patent/WO2022057057A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q40/00Finance; Insurance; Tax strategies; Processing of corporate or income taxes
    • G06Q40/08Insurance
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16HHEALTHCARE INFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR THE HANDLING OR PROCESSING OF MEDICAL OR HEALTHCARE DATA
    • G16H10/00ICT specially adapted for the handling or processing of patient-related medical or healthcare data
    • G16H10/60ICT specially adapted for the handling or processing of patient-related medical or healthcare data for patient-specific data, e.g. for electronic patient records
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N7/00Computing arrangements based on specific mathematical models
    • G06N7/01Probabilistic graphical models, e.g. probabilistic networks

Definitions

  • the invention relates to the field of medical technology, and in particular, to a method, a system and a storage medium for detecting medical insurance fraud.
  • Medical insurance is a social security project in my country, which is a social security system established to compensate citizens or workers for economic losses caused by disease risks.
  • medical insurance the phenomenon of criminals taking advantage of the opportunity of universal medical insurance to conduct medical insurance fraud has emerged in an endless stream, resulting in an additional increase in national financial medical and health expenditures.
  • unsupervised learning relies on outlier analysis to find potential anomalies in unlabeled data, but the methods used to detect anomalies are not suitable for highly skewed data such as medical insurance fraud data; supervised learning requires a large number of labels Point data, by marking fraudulent and non-fraud examples to achieve prediction, but due to the lack of experts and medical fraud investigation, the actual marking points can be very few, and effective detection cannot be achieved.
  • the technical problem to be solved by the present invention is to provide a method, system and storage medium for detecting medical insurance fraud in view of the above-mentioned defects of the prior art, aiming to solve the problem that the medical insurance fraud detection method in the prior art cannot perform effective detection, Medicare fraud cannot be prevented.
  • a method of detecting health insurance fraud comprising:
  • the pre-labeled fraud samples are input into the established doctor-patient relationship neural network, the fraud prediction model is trained, and the predicted value that each patient node has fraudulent behavior is output from the trained fraud prediction model.
  • the fraud prediction model is trained, and the predicted value that each patient node has fraudulent behavior is output from the trained fraud prediction model.
  • the efficiency of prediction can be improved on the premise of ensuring the accuracy of prediction, and the system can run quickly to predict more patient nodes with fraudulent behavior.
  • the patient node with the predicted value is input into the pre-established dynamic update network, and the invalid patient node is deleted, wherein the basis for judging the invalid patient node is:
  • Defining invalid patient nodes in an effective way can further improve the accuracy of prediction, ensure the validity of the data taken during prediction, and help improve the prediction rate.
  • deletion of invalid patient nodes specifically includes:
  • step of obtaining the pre-marked fraud samples includes:
  • the selected samples to be marked are marked by experts, and the samples with fraudulent behaviors in the samples to be marked are identified to obtain pre-marked fraud samples.
  • selecting part of the medical treatment records from the medical treatment records in a preset manner as the samples to be marked at least includes:
  • the entropy value of each patient is calculated by the maximum entropy selection strategy, and the maximum value of the calculated entropy values is selected as the sample to be marked;
  • the probability value of each patient is calculated by the maximum probability strategy, and the maximum value among the calculated probability values is selected as the sample to be marked.
  • the samples to be marked are selected in a random manner, which maximizes the randomness of selection and helps to improve the accuracy of prediction.
  • the described obtaining of the patient's medical records, the corresponding patient characteristics are extracted according to the obtained medical records, and the doctor-patient relationship neural network is established according to the extracted patient characteristics and the corresponding relationship between the patient and the doctor, including:
  • the patient identity information in the patient medical treatment information is anonymized, and the processed medical treatment information is converted into a medical treatment record of a data structure type.
  • patient privacy can be protected and patient information leakage can be avoided.
  • obtaining the patient's medical treatment record extracting the corresponding patient characteristics according to the obtained medical treatment record, and establishing a doctor-patient relationship neural network according to the extracted patient characteristics and the corresponding relationship between the patient and the doctor, specifically including:
  • the doctor-patient relationship neural network is established.
  • the present invention also discloses a system comprising a memory, and one or more programs, wherein the one or more programs are stored in the memory and configured to be executed by one or more processors to execute the one or more programs
  • the above program contains a method for performing the detection of health insurance fraud as described above.
  • the present invention also discloses a storage medium, wherein the storage medium stores a computer program, and the computer program can be executed to implement the method for detecting medical insurance fraud as described above.
  • a method, system and storage medium for detecting medical insurance fraud provided by the present invention, wherein the method includes: acquiring a patient's medical record, extracting corresponding patient characteristics according to the obtained medical record, and according to the extracted patient characteristics and the corresponding relationship between patients and doctors, establish a doctor-patient relationship neural network; input pre-labeled fraud samples into the established doctor-patient relationship neural network, train a fraud prediction model, and output each fraud prediction model from the trained fraud prediction model.
  • Each patient node has a predicted value of fraudulent behavior; according to the output predicted value, it is determined whether the patient of the corresponding node has fraudulent behavior. Predicting whether patients have fraudulent behaviors through machine learning reduces the difficulty of predicting fraudulent behaviors, and can effectively detect medical insurance fraudulent behaviors, which is conducive to maintaining the health and popularization of the medical insurance system.
  • FIG. 1 is a flowchart of a preferred embodiment of the method for detecting medical insurance fraud in the present invention.
  • FIG. 2 is a flowchart of a specific embodiment of step S100 in the present invention.
  • FIG. 3 is a flow chart of a preferred embodiment of the present invention combined with a dynamic update network.
  • FIG. 4 is a flow chart showing a preferred embodiment of the fraud prediction model in relation to the dynamic update network in the present invention.
  • FIG. 5 is a flowchart of a specific embodiment of step S410 in FIG. 3 in the present invention.
  • FIG. 6 is a flow chart of a preferred embodiment of the execution process of the update algorithm in the present invention.
  • FIG. 7 is a comparison diagram of experimental results using and not using the dynamic update network in the present invention.
  • FIG. 8 is a functional principle block diagram of a preferred embodiment of the system of the present invention.
  • Medical insurance is a social security project in my country, which is a social security system established to compensate citizens or workers for economic losses caused by disease risks. A certain amount of insurance is paid by individuals and employers. When the insured person goes to a doctor and incurs medical expenses, the medical insurance institution will give the patient a certain amount of economic compensation. By the end of 2018, the number of people participating in basic medical insurance in China had reached 1.35 billion, and the participation rate had exceeded 95%. At the same time, the medical insurance fund plays a pivotal role in the life of this article. According to statistics from the Ministry of Human Resources and Social Security, my country's medical expenditures increased from 1.45 trillion in 2008 to 4.10 trillion in 2015, with an average annual growth rate of 16% . However, while the pressure on the medical insurance fund is increasing, criminals take advantage of the opportunity of universal medical insurance to conduct medical insurance fraud.
  • Medical insurance fraud is a fraudulent behavior in the process of medical services for the purpose of seeking benefits.
  • the frauds here mainly include two categories: patients use some means to defraud medical insurance; patients and doctors jointly defraud medical insurance. From 2013 to 2017, the national fiscal medical and health expenditures totaled 5,950.2 billion yuan, with an average annual increase of 11.7%. While the country attaches great importance to medical and health care, the additional expenditures caused by medical insurance fraud are also increasing.
  • unsupervised learning methods relies on outlier analysis to find potential anomalies in unlabeled data, however, outlier detection methods are not suitable for highly skewed data, such as health insurance fraud data.
  • Supervised learning requires a large number of labeled data points, including fraudulent and non-fraud examples, to achieve good predictive performance, however, due to a lack of domain experts and expensive medical fraud investigations, there are very few labeled points; Labels are extremely unbalanced, as non-fraud examples are often not explicitly disclosed in the real world.
  • OCC one-class classification
  • the present invention proposes a method for detecting medical insurance fraud by means of machine learning, thereby solving the problem that fraudulent behaviors cannot be effectively predicted in the prior art.
  • the method described in the present invention is explained in detail below.
  • FIG. 1 is a flowchart of a method for detecting medical insurance fraud in the present invention.
  • a method for detecting medical insurance fraud according to an embodiment of the present invention includes the following steps:
  • the hospital doctor enters the patient's medical information into the medical information system, and then the patient's medical record exists in the medical information system.
  • the medical treatment information includes but is not limited to the patient's identity information, insurance type, purchased drug items, purchase quantity, medical visit date, etc. Since the medical visit information can be increased according to the patient's medical treatment situation, and the insurance types are similar, therefore, this This section does not give a detailed explanation of the medical treatment information. It is understood that this is only used to illustrate part of the content covered by the medical treatment information, and is not used to limit the present invention.
  • the patient's visit information also includes the doctor who received the current medical project. By obtaining the patient's medical record, the patient's detailed medical information and the corresponding doctor for a single visit can be obtained.
  • the invention models the patient characteristics and the relationship between the patient and the doctor extracted from the medical treatment information, and then establishes a neural network of the doctor-patient relationship, which can strengthen the connection between each patient node, and further Facilitates classification of patient nodes.
  • S200 Input the pre-marked fraud samples into the established doctor-patient relationship neural network, train a fraud prediction model, and output a prediction value of fraudulent behavior of each patient node from the trained fraud prediction model.
  • the selected labeled fraud samples are input into the doctor-patient relationship neural network, and then the model is trained.
  • the model is trained.
  • more patient nodes are classified and labeled according to the pre-labeled fraud samples, that is, Calculate the predicted value of fraudulent behavior of all patient nodes, and then analyze the fraudulent behavior of patients through the predicted value, which is conducive to the detection of medical insurance fraud, and then supervises medical insurance fraud, and helps to popularize medical insurance health.
  • the degree of the possibility of fraudulent behavior can be determined by the size of the calculated predicted value.
  • the larger predicted value is regarded as the key analysis object, and the limit of the fraudulent behavior is determined by the size of the predicted value.
  • Domain experts or medical insurance administrators or hospital-defined settings are not described in detail here, but are only used to illustrate that fraud can be determined by predicting values.
  • the step S100 specifically includes:
  • the corresponding system When the patient's visit information is entered in the medical information system, the corresponding system will add medical records. By extracting the characteristics of each patient from all the medical records in the system, and the doctor who is connected with each patient when they visit a doctor, by comparing the two This kind of information is processed and displayed in the form of data, which is convenient for machine learning.
  • the characteristic degree matrix is established for the patient characteristics, and the adjacency matrix is established according to the relationship between the doctor and the patient, and then the algorithm of forming the doctor-patient relationship neural network according to the degree matrix and the adjacency matrix is demonstrated as follows: Specifically, define the patient (P) and the doctor ( D)
  • any patient node is represented as:
  • to represent a convolution kernel function
  • the present invention defines the representation of the convolution operation as follows:
  • In is the identity matrix; is a degree matrix (representing patient characteristic information: such as date of consultation, type of medical insurance, amount, etc.), and is the calculation formula of the degree matrix,
  • A is the adjacency matrix (representing the weight information between the patient and the doctor), is the diagonal matrix composed of the eigenvalues of the Laplace matrix; is the eigenvector of the Laplacian matrix; the filter ⁇ ( ⁇ ) is the diagonal matrix with respect to the Laplacian matrix.
  • GCN Graph Convolution Network
  • H (0) X is the patient node information
  • H (1) represents the output of the first layer of the graph convolutional neural network
  • W (1) is the weight matrix of the first layer of the graph convolutional neural network
  • ⁇ (g) is the sigmoid activation function.
  • the training of the prediction model generally requires labeled data with sufficient data, and the more labeled data, the higher the prediction accuracy. Since in practical applications, the labeling of fraudulent behaviors mainly relies on the investigation by domain experts, the cost of this is undoubtedly huge, and the efficiency of mobilizing experts to investigate is also very low. Moreover, manual labeling in the ever-increasing medical data is obviously Not realistic. Therefore, in the present invention, the most valuable data is selected for labeling by the method of active learning, so as to reduce the manpower and financial resources required for labeling in the existing method.
  • the present invention uses the method of active learning to mark fraud samples, including:
  • S220 Perform expert annotation on the selected samples to be marked, identify samples with fraudulent behaviors among the samples to be marked, and obtain pre-marked fraud samples.
  • the method of selecting samples to be marked in a preset manner in step S210 includes at least one or more of a maximum entropy strategy, a random strategy, and a maximum probability strategy.
  • a maximum entropy strategy Several methods of random sampling are briefly introduced, and the strategies listed here are combined or combined with other means, and then the average point, mean point, and middle point are selected for the overall sample taken. The manners such as these are extensions of the manners listed in this embodiment, and are not explained in detail here, and they are all within the scope of protection involved in this embodiment.
  • Mode 1 Calculate the entropy value of each patient through the maximum entropy selection strategy, and select the maximum value of the calculated entropy values as the sample to be marked.
  • n represents the different values of the random variable x
  • p i represents the probability when x takes the value of i.
  • the randomness of the data can be guaranteed.
  • the confidence value of which category the sample point belongs to is described by the conditional entropy. If the conditional entropy value is larger, it indicates that the classification of a certain sample point is less clear (the classification confidence is smaller); if the conditional entropy value is smaller, it indicates that the The clearer the classification of a sample point (the greater the classification confidence).
  • the conditional entropy value is calculated by the following formula: H(Y
  • Z) H(Z,Y)-H(Z)
  • Method 2 Adopting a random strategy Part of the medical records in the medical records are randomly selected as the samples to be marked. That is, a preset number of medical visit records are randomly selected from all medical visit records as samples to be marked.
  • Method 3 Calculate the probability value of each patient through the maximum probability strategy, and select the maximum value among the calculated probability values as the sample to be marked. Specifically, by calculating the probability value of each patient, the sample to be marked is selected according to the size of the probability value. Since the calculated probability value is in the prior art, no example is given here.
  • the maximum entropy strategy is preferentially selected to select the samples to be marked, which can select the most uncertain samples to be marked with the greatest randomness, thereby increasing the authenticity of the sample selection.
  • step of training the prediction model in step S200 includes:
  • the prediction model can be established.
  • the Bayesian formula It is to obtain the posterior probability of Z by observing the prior probability of patient visit data X. But in fact, there is only data about X, but no distribution function about X, that is, p(X) is unknown, then p(Z
  • Auto-Encoder consists of two parts: decoder and encoder. What can be directly obtained in the present invention is the patient's visit data X, and at the same time X is generated by the hidden variable Z, the generation model from Z ⁇ X is p ⁇ (X
  • the distribution of the patient's visit records can be obtained.
  • the encoder in VAE variantal auto encoding
  • VAE variable auto encoding
  • the expression of the prediction model is obtained by further combining the doctor-patient relationship neural network with the self-segmenting encoder:
  • logp ⁇ (X (i) ) KL(q ⁇ (Z
  • the patient node can be predicted through the fraud prediction model. Specifically, the patient's medical record is used as the input data to establish a doctor-patient relationship neural network, and the output of the doctor-patient relationship neural network is used as the input of the variational auto-decoder, and finally the prediction result is output. All patient nodes are trained in the fraud prediction model. When the preset training times are reached, the prediction and classification of the unknown patient nodes in the patient nodes can be completed, that is, the predicted value of the fraud behavior of the patient nodes is calculated. The predicted value can be divided between 0 and 1, where non-fraud and 1 are fraud.
  • the present invention also proposes an online update strategy, so that the new data is automatically updated every day, and then by adding new nodes and deleting useless old nodes, the number of nodes in the graph is kept at a certain number, thereby The realization system can complete the training in a short time to ensure the good real-time performance of the system.
  • strategy implementation steps for online update in the present invention are as follows:
  • step S200 it further includes:
  • step S410 is performed to input patient nodes with predicted values into the pre-established dynamic update network, and delete invalid patient nodes.
  • the newly added medical treatment records can be added to the fraud prediction model to conduct fraud analysis in a timely manner, and further by adding the original medical treatment records in the fraud prediction model. Invalid data removal can ensure the efficiency of system operation.
  • fraud prediction is performed on all medical treatment records according to the originally set prediction period to ensure the timeliness of the calculated predicted values.
  • S420 Arrange the remaining medical treatment records and the newly added medical treatment records after deleting the invalid node into an updated medical treatment record.
  • steps S100-S430 are performed cyclically by using the updated medical treatment record as the source data for obtaining the medical treatment record in step S100, so as to achieve the effect of continuously updating data and prediction nodes, and ensure the feasibility of the prediction model.
  • the system can be updated regularly, and some nodes with relatively little information in the graph relational network are deleted at each update.
  • the model prediction accuracy can be guaranteed while maintaining real-time. performance and training efficiency.
  • S60 train a fraud prediction model; establish a fraud prediction model according to the marked fraud samples and the doctor-patient relationship neural network;
  • Steps S20-S90 are looped.
  • the step S410 specifically includes:
  • S412. Calculate the priority of each patient node respectively according to the generation date and the predicted value of the patient node with the predicted value.
  • the factors for judging a patient node as an invalid node are determined as follows: the patient's visit time (the earlier the patient is deleted, the priority is to be deleted), and the probability that the patient is predicted to be fraudulent (the smaller the priority is to delete).
  • s is the priority (the smaller the priority)
  • p is the probability that the patient node is predicted to be fraudulent
  • d is the patient's visit date.
  • ⁇ p and ⁇ d are the weights of probability and date, respectively.
  • step S100 before the step S100, it further includes:
  • the patient identity information in the patient medical treatment information is anonymized, and the processed medical treatment information is converted into a medical treatment record of a data structure type.
  • the present invention can effectively predict the distribution of patient nodes, and select the best labeled samples through preset conditions for expert labeling, and then input them into the model for training, and then It reduces the cost of manual labeling and increases the accuracy of fraud prediction.
  • the present invention also provides an online dynamic update network model, which can update the system in real time under the premise that the number of patient nodes to be predicted in the system is fixed, thereby improving the prediction efficiency and implementability of the fraud prediction model; In addition, by deleting invalid nodes, the prediction time can be saved, the occupation of system resources can be avoided, and the prediction accuracy can be improved.
  • the present invention also discloses a system, which, as shown in FIG. 8, includes a memory 20, and one or more programs, wherein the one or more programs are stored in the memory 20 and are configured to be composed of one or more programs Execution of the one or more programs by the processor 10 includes performing the method for detecting medical insurance fraud as described above; specifically as described above.
  • the present invention also discloses a storage medium, wherein the storage medium stores a computer program, and the computer program can be executed to implement the above method for detecting medical insurance fraud; the details are as described above.
  • a method, system and storage medium for detecting medical insurance fraud disclosed by the present invention can obtain medical records, extract patient characteristics and the relationship between patients and doctors according to the medical records, and then use machine learning and deep learning.
  • the method models the extracted features and establishes a fraud prediction model, so as to detect medical insurance fraud with a small amount of manual intervention, ensure the effectiveness of medical insurance fraud, and save a lot of human, material and financial expenditures.
  • the present invention also proposes an online dynamic update strategy, which dynamically updates the patient nodes in the graph neural network, thereby ensuring the real-time and accuracy of the prediction model and the system operation efficiency.

Landscapes

  • Business, Economics & Management (AREA)
  • Engineering & Computer Science (AREA)
  • Accounting & Taxation (AREA)
  • Finance (AREA)
  • General Business, Economics & Management (AREA)
  • General Physics & Mathematics (AREA)
  • Marketing (AREA)
  • Strategic Management (AREA)
  • Technology Law (AREA)
  • Physics & Mathematics (AREA)
  • Development Economics (AREA)
  • Economics (AREA)
  • Theoretical Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Epidemiology (AREA)
  • General Health & Medical Sciences (AREA)
  • Medical Informatics (AREA)
  • Primary Health Care (AREA)
  • Public Health (AREA)
  • Medical Treatment And Welfare Office Work (AREA)

Abstract

The present invention provides a method for detecting medicare fraud, and a system and a storage medium. The method comprises: acquiring a medical record of a patient, extracting a corresponding patient feature according to the acquired medical record, and according to the extracted patient feature and the correlation between the patient and a doctor, establishing a doctor-patient relationship neural network; inputting a pre-marked fraud sample into the established doctor-patient relationship neural network, training a fraud prediction model, and outputting, from the trained fraud prediction model, a predicted value of each patient node having a fraud behavior; and according to the output predicted value, determining whether the patient corresponding to the node has a fraud behavior. By means of a machine learning method, whether a patient has a fraud behavior is predicted, so that the difficulty of predicting a fraud behavior is reduced, a medicare fraud behavior can be effectively detected, and the health popularization of a medicare system is facilitated.

Description

一种检测医保欺诈的方法、系统及存储介质A method, system and storage medium for detecting medical insurance fraud 技术领域technical field
本发明涉及医疗技术领域,尤其涉及的是一种检测医保欺诈的方法、系统及存储介质。The invention relates to the field of medical technology, and in particular, to a method, a system and a storage medium for detecting medical insurance fraud.
背景技术Background technique
医疗保险是我国的一项社会保障项目,是为补偿公民或劳动者因为疾病风险造成的经济损失而建立的一种社会保障制度。但是,随着医疗保险的普及,伴随的是不法分子借助全民医保的契机进行医疗保险欺诈的现象层出不穷,造成全国财政医疗卫生支出额外增高。Medical insurance is a social security project in my country, which is a social security system established to compensate citizens or workers for economic losses caused by disease risks. However, with the popularization of medical insurance, the phenomenon of criminals taking advantage of the opportunity of universal medical insurance to conduct medical insurance fraud has emerged in an endless stream, resulting in an additional increase in national financial medical and health expenditures.
因此,需对医保欺诈活动进行有效的检测,现有的检测方法包括非监督学习和监督学习。其中,非监督学习依赖于异常值分析来发现未标记数据中潜在的异常,但是用于检测异常的方法并不适用于如医疗保险欺诈数据等高度偏斜的数据;监督学习则需要有大量标记点数据,通过标记欺诈和非欺诈示例以实现预测,但由于缺少专家和医疗欺诈调查,实际能够做到的标记点很少,并不能够实现有效的检测。Therefore, effective detection of medical insurance fraud is required, and existing detection methods include unsupervised learning and supervised learning. Among them, unsupervised learning relies on outlier analysis to find potential anomalies in unlabeled data, but the methods used to detect anomalies are not suitable for highly skewed data such as medical insurance fraud data; supervised learning requires a large number of labels Point data, by marking fraudulent and non-fraud examples to achieve prediction, but due to the lack of experts and medical fraud investigation, the actual marking points can be very few, and effective detection cannot be achieved.
可见,目前针对医保欺诈检测的两种方式均不能对真实的医保欺诈进行有效检测,并不利于预防医保欺诈行为的发生。It can be seen that neither of the two current methods of medical insurance fraud detection can effectively detect real medical insurance fraud, which is not conducive to preventing the occurrence of medical insurance fraud.
因此,现有技术存在缺陷,有待改进与发展。Therefore, the existing technology has defects and needs to be improved and developed.
发明内容SUMMARY OF THE INVENTION
本发明要解决的技术问题在于,针对现有技术的上述缺陷,提供一种检测医保欺诈的方法、系统及存储介质,旨在解决现有技术中的医保欺诈检测方法并不能够进行有效检测,不能预防医保欺诈行为的问题。The technical problem to be solved by the present invention is to provide a method, system and storage medium for detecting medical insurance fraud in view of the above-mentioned defects of the prior art, aiming to solve the problem that the medical insurance fraud detection method in the prior art cannot perform effective detection, Medicare fraud cannot be prevented.
本发明解决技术问题所采用的技术方案如下:The technical scheme adopted by the present invention to solve the technical problem is as follows:
一种检测医保欺诈的方法,其中,包括:A method of detecting health insurance fraud, comprising:
获取患者的就诊记录,根据所获取的就诊记录提取对应的患者特征,并根据所提取的患者特征及患者与医生的对应关系,建立医患关系神经网络;Obtain the patient's medical records, extract the corresponding patient characteristics according to the obtained medical records, and establish a doctor-patient relationship neural network according to the extracted patient characteristics and the corresponding relationship between the patient and the doctor;
将预先标记的欺诈样本输入所建立的医患关系神经网络中,训练出欺诈预测模型,并从所训练出的欺诈预测模型中输出每个患者节点具有欺诈行为的预测值;Input the pre-marked fraud samples into the established doctor-patient relationship neural network, train a fraud prediction model, and output the predicted value of fraudulent behavior of each patient node from the trained fraud prediction model;
根据所输出的预测值判定对应节点的患者是否存在欺诈行为。Determine whether the patient of the corresponding node has fraudulent behavior according to the output prediction value.
能够通过机器主动学习并预测出存在欺诈行为的患者节点,方便对医保欺诈行为进行有效的管理,以利于医保体系健康普及。It is possible to actively learn and predict patient nodes with fraudulent behaviors through machines, which facilitates effective management of medical insurance frauds and facilitates the healthy popularization of the medical insurance system.
进一步地,所述将预先标记的欺诈样本输入所建立的医患关系神经网络中,训练出欺诈预测模型,并从所训练出的欺诈预测模型中输出每个患者节点具有欺诈行为的预测值之后包括:Further, the pre-labeled fraud samples are input into the established doctor-patient relationship neural network, the fraud prediction model is trained, and the predicted value that each patient node has fraudulent behavior is output from the trained fraud prediction model. include:
监测是否有新增的就诊记录;Monitor for new medical records;
若有新增的就诊记录,将具有预测值的患者节点输入预先建立的动态更新网络中,删除其中无效的患者节点;If there is a new medical treatment record, input the patient node with the predicted value into the pre-established dynamic update network, and delete the invalid patient node;
将删除无效节点后的其余就诊记录与新增的就诊记录整理成更新后的就诊记录;Organize the remaining medical treatment records and the newly added medical treatment records after deleting the invalid node into the updated medical treatment records;
根据更新后的就诊记录继续判定每个节点对应的患者是否存在欺诈行为。Continue to determine whether the patient corresponding to each node has fraudulent behavior according to the updated medical treatment record.
通过及时对数据进行更新,以删除无效的节点,在保证预测准确度的前提下能够提高预测的效率,保证系统快速运行以预测出更多的具有欺诈行为的患者节点。By updating the data in time to delete invalid nodes, the efficiency of prediction can be improved on the premise of ensuring the accuracy of prediction, and the system can run quickly to predict more patient nodes with fraudulent behavior.
进一步地,所述若有新增的就诊记录,将具有预测值的患者节点输入预先建立的动态更新网络中,删除其中无效的患者节点,其中,判定无效的患者节点的依据为:Further, if there is a newly added medical treatment record, the patient node with the predicted value is input into the pre-established dynamic update network, and the invalid patient node is deleted, wherein the basis for judging the invalid patient node is:
根据具有预测值的患者节点的生成日期及预测值,分别计算每个患者节点的优先级;Calculate the priority of each patient node separately according to the generation date and predicted value of the patient node with the predicted value;
对每个患者节点的优先级进行排序,选取优先级低的作为无效的患者节点。Sort the priority of each patient node, and select the node with low priority as the invalid patient node.
通过有效的方式定义无效的患者节点,能进一步提高预测的准确性,保证进行预测时所采取的数据的有效度,利于提高预测速率。Defining invalid patient nodes in an effective way can further improve the accuracy of prediction, ensure the validity of the data taken during prediction, and help improve the prediction rate.
进一步地,所述删除其中无效的患者节点,具体包括:Further, the deletion of invalid patient nodes specifically includes:
根据新增的就诊数量,以优先级低的患者节点为序删除同等数量的无效的患者节 点。According to the number of new visits, delete the same number of invalid patient nodes in the order of the patient nodes with lower priority.
进一步地,所述将预先标记的欺诈样本输入所建立的医患关系神经网络中,其中,得到预先标记的欺诈样本的步骤包括:Further, inputting the pre-marked fraud samples into the established doctor-patient relationship neural network, wherein the step of obtaining the pre-marked fraud samples includes:
采用预设方式从就诊记录中选取部分就诊记录作为待标记样本;Select part of the medical records from the medical records in a preset way as the samples to be marked;
对所选取的待标记样本进行专家标注,标识待标记样本中具有欺诈行为的样本,得到预先标记的欺诈样本。The selected samples to be marked are marked by experts, and the samples with fraudulent behaviors in the samples to be marked are identified to obtain pre-marked fraud samples.
通过专家进行待标记样本的标注,提高了得到欺诈样本的权威度,使得所预测出的结果真实有效。By labeling the samples to be marked by experts, the authority of obtaining fraud samples is improved, and the predicted results are real and effective.
进一步地,所述采用预设方式从就诊记录中选取部分就诊记录作为待标记样本,其中,采用预设方式选取待标记样本的方式至少包括:Further, selecting part of the medical treatment records from the medical treatment records in a preset manner as the samples to be marked, wherein the method of selecting the samples to be marked in a preset manner at least includes:
通过最大熵选择策略计算出每个患者的熵值,选取所计算熵值中最大值作为待标记样本;The entropy value of each patient is calculated by the maximum entropy selection strategy, and the maximum value of the calculated entropy values is selected as the sample to be marked;
或者,采取随机策略随机采取就诊记录中部分就诊记录作为待标记样本;Or, adopt a random strategy to randomly take part of the medical records in the medical records as the samples to be marked;
或者,通过最大概率策略计算每个患者的概率值,选取所计算概率值中最大值作为待标记样本。Alternatively, the probability value of each patient is calculated by the maximum probability strategy, and the maximum value among the calculated probability values is selected as the sample to be marked.
通过随机的方式选择待标记样本,最大程度地增加了选择的随机性,利于提高预测的准确度。The samples to be marked are selected in a random manner, which maximizes the randomness of selection and helps to improve the accuracy of prediction.
进一步地,所述获取患者的就诊记录,根据所获取的就诊记录提取对应的患者特征,并根据所提取的患者特征及患者与医生的对应关系,建立医患关系神经网络,之前包括:Further, the described obtaining of the patient's medical records, the corresponding patient characteristics are extracted according to the obtained medical records, and the doctor-patient relationship neural network is established according to the extracted patient characteristics and the corresponding relationship between the patient and the doctor, including:
将患者就诊信息中的患者身份信息进行匿名处理,并将处理后的就诊信息转换成数据结构类型的就诊记录。The patient identity information in the patient medical treatment information is anonymized, and the processed medical treatment information is converted into a medical treatment record of a data structure type.
通过对患者身份信息进行匿名处理,能够保障患者隐私,也避免了患者信息泄露。By anonymizing patient identity information, patient privacy can be protected and patient information leakage can be avoided.
进一步地,所述获取患者的就诊记录,根据所获取的就诊记录提取对应的患者特征,并根据所提取的患者特征及患者与医生的对应关系,建立医患关系神经网络,具体包括:Further, obtaining the patient's medical treatment record, extracting the corresponding patient characteristics according to the obtained medical treatment record, and establishing a doctor-patient relationship neural network according to the extracted patient characteristics and the corresponding relationship between the patient and the doctor, specifically including:
获取患者的就诊记录,从就诊记录中提取对应的患者特征,建立患者特征度矩阵;Obtain the patient's medical records, extract the corresponding patient characteristics from the medical records, and establish a patient characteristic degree matrix;
分析就诊记录中医生与患者之间的医患关系,建立对应的医患关系邻接矩阵;Analyze the doctor-patient relationship between doctors and patients in the medical records, and establish the corresponding doctor-patient relationship adjacency matrix;
根据患者特征度矩阵和医患关系邻接矩阵,建立医患关系神经网络。According to the patient characteristic degree matrix and the doctor-patient relationship adjacency matrix, the doctor-patient relationship neural network is established.
本发明还公开一种系统,其中,包括有存储器,以及一个或者一个以上的程序,其中一个或者一个以上程序存储于存储器中,且经配置以由一个或者一个以上处理器执行所述一个或者一个以上程序包含用于执行如上所述的检测医保欺诈的方法。The present invention also discloses a system comprising a memory, and one or more programs, wherein the one or more programs are stored in the memory and configured to be executed by one or more processors to execute the one or more programs The above program contains a method for performing the detection of health insurance fraud as described above.
本发明还公开一种存储介质,其中,所述存储介质存储有计算机程序,所述计算机程序能够被执行以用于实现如上所述的检测医保欺诈的方法。The present invention also discloses a storage medium, wherein the storage medium stores a computer program, and the computer program can be executed to implement the method for detecting medical insurance fraud as described above.
本发明所提供的一种检测医保欺诈的方法、系统及存储介质,其中,所述方法包括:获取患者的就诊记录,根据所获取的就诊记录提取对应的患者特征,并根据所提取的患者特征及患者与医生的对应关系,建立医患关系神经网络;将预先标记的欺诈样本输入所建立的医患关系神经网络中,训练出欺诈预测模型,并从所训练出的欺诈预测模型中输出每个患者节点具有欺诈行为的预测值;根据所输出的预测值判定对应节点的患者是否存在欺诈行为。通过机器学习的方法预测患者是否存在欺诈行为,降低了预测欺诈行为的难度,且能够有效地检测医保欺诈行为,利于维护医保体系健康普及。A method, system and storage medium for detecting medical insurance fraud provided by the present invention, wherein the method includes: acquiring a patient's medical record, extracting corresponding patient characteristics according to the obtained medical record, and according to the extracted patient characteristics and the corresponding relationship between patients and doctors, establish a doctor-patient relationship neural network; input pre-labeled fraud samples into the established doctor-patient relationship neural network, train a fraud prediction model, and output each fraud prediction model from the trained fraud prediction model. Each patient node has a predicted value of fraudulent behavior; according to the output predicted value, it is determined whether the patient of the corresponding node has fraudulent behavior. Predicting whether patients have fraudulent behaviors through machine learning reduces the difficulty of predicting fraudulent behaviors, and can effectively detect medical insurance fraudulent behaviors, which is conducive to maintaining the health and popularization of the medical insurance system.
附图说明Description of drawings
图1是本发明中检测医保欺诈的方法的较佳实施例的流程图。FIG. 1 is a flowchart of a preferred embodiment of the method for detecting medical insurance fraud in the present invention.
图2是本发明中步骤S100的具体实施例的流程图。FIG. 2 is a flowchart of a specific embodiment of step S100 in the present invention.
图3是本发明中结合动态更新网络后的较佳实施例的流程图。FIG. 3 is a flow chart of a preferred embodiment of the present invention combined with a dynamic update network.
图4是本发明中表示欺诈预测模型与动态更新网络联系的较佳实施例的流程图。FIG. 4 is a flow chart showing a preferred embodiment of the fraud prediction model in relation to the dynamic update network in the present invention.
图5是本发明中图3中步骤S410的具体实施例的流程图。FIG. 5 is a flowchart of a specific embodiment of step S410 in FIG. 3 in the present invention.
图6是本发明中更新算法的执行过程的较佳实施例的流程图。FIG. 6 is a flow chart of a preferred embodiment of the execution process of the update algorithm in the present invention.
图7是本发明中使用与不使用动态更新网络的实验结果对比图。FIG. 7 is a comparison diagram of experimental results using and not using the dynamic update network in the present invention.
图8是本发明系统的较佳实施例的功能原理框图。FIG. 8 is a functional principle block diagram of a preferred embodiment of the system of the present invention.
具体实施方式detailed description
为使本发明的目的、技术方案及优点更加清楚、明确,以下参照附图并举实施例对 本发明进一步详细说明。应当理解,此处所描述的具体实施例仅用以解释本发明,并不用于限定本发明。In order to make the objectives, technical solutions and advantages of the present invention clearer and clearer, the present invention will be further described in detail below with reference to the accompanying drawings and examples. It should be understood that the specific embodiments described herein are only used to explain the present invention, but not to limit the present invention.
医疗保险是我国的一项社会保障项目,是为补偿公民或劳动者因为疾病风险造成的经济损失而建立的一种社会保障制度。通过个人和用人单位缴纳一定的保险金额,当参保人患病就诊产生医疗费用后,医疗保险机构给予患者一定的经济补偿。截止2018年底,中国基本医疗保险参与人数已经达到13.5亿人,参保率超过了95%。同时,医疗保险基金在本文生活中扮演者举足轻重的作用,根据人社部数据统计,我国的医疗支出费用从2008年的1.45万亿增长到2015年的4.10万亿,年均增长率达16%。然而就在医保基金压力不断增大的同时,不法分子借助全民医保的契机进行医疗保险欺诈的现象层出不穷。Medical insurance is a social security project in my country, which is a social security system established to compensate citizens or workers for economic losses caused by disease risks. A certain amount of insurance is paid by individuals and employers. When the insured person goes to a doctor and incurs medical expenses, the medical insurance institution will give the patient a certain amount of economic compensation. By the end of 2018, the number of people participating in basic medical insurance in China had reached 1.35 billion, and the participation rate had exceeded 95%. At the same time, the medical insurance fund plays a pivotal role in the life of this article. According to statistics from the Ministry of Human Resources and Social Security, my country's medical expenditures increased from 1.45 trillion in 2008 to 4.10 trillion in 2015, with an average annual growth rate of 16% . However, while the pressure on the medical insurance fund is increasing, criminals take advantage of the opportunity of universal medical insurance to conduct medical insurance fraud.
医保欺诈是医疗服务过程中的以谋取利益为目的的欺诈行为。此处的欺诈行为主要包括两大类:患者利用某种手段骗取医保;患者和医生联合骗取医保。2013年到2017年,全国财政医疗卫生累计支出59502亿元,年均增幅为11.7%,在国家对医疗卫生重视的同时,由于医保欺诈所造成的额外支出也越来越高。Medical insurance fraud is a fraudulent behavior in the process of medical services for the purpose of seeking benefits. The frauds here mainly include two categories: patients use some means to defraud medical insurance; patients and doctors jointly defraud medical insurance. From 2013 to 2017, the national fiscal medical and health expenditures totaled 5,950.2 billion yuan, with an average annual increase of 11.7%. While the country attaches great importance to medical and health care, the additional expenditures caused by medical insurance fraud are also increasing.
现有的对于医保欺诈的检测可以分为两个主要分支:非监督学习方法和监督学习方法。其中,非监督学习依赖于异常值分析来发现未标记数据中潜在的异常,但是,离群值检测方法不适用于高度偏斜的数据,例如医疗保险欺诈数据。监督学习则需要大量标记数据点,包括欺诈和非欺诈示例,以实现良好的预测性能,但是,由于缺少领域专家和昂贵的医疗欺诈调查,可标记的点极少;此外,医疗欺诈数据集的标签极不平衡,因为通常不会在真实明确公开非欺诈示例。为解决此问题,当缺乏非欺诈示例时,一类分类(OCC)算法是用于对医疗欺诈数据进行建模的解决方案,然而,在医疗欺诈数据集中,OCC方法仍然由于训练点数量不足而导致预测性能不佳。故而,以上对医保欺诈活动进行检测的非监督学习和监督学习方法都存在不足,不能够对欺诈行为进行有效的预测。Existing detection of medical insurance fraud can be divided into two main branches: unsupervised learning methods and supervised learning methods. Among them, unsupervised learning relies on outlier analysis to find potential anomalies in unlabeled data, however, outlier detection methods are not suitable for highly skewed data, such as health insurance fraud data. Supervised learning requires a large number of labeled data points, including fraudulent and non-fraud examples, to achieve good predictive performance, however, due to a lack of domain experts and expensive medical fraud investigations, there are very few labeled points; Labels are extremely unbalanced, as non-fraud examples are often not explicitly disclosed in the real world. To solve this problem, one-class classification (OCC) algorithm is a solution for modeling medical fraud data when there is a lack of non-fraud examples, however, in medical fraud datasets, OCC method still suffers from insufficient number of training points. lead to poor prediction performance. Therefore, the above-mentioned unsupervised learning and supervised learning methods for detecting fraudulent activities in medical insurance have shortcomings and cannot effectively predict fraudulent behaviors.
基于此,本发明利用机器学习方式提出了一种检测医保欺诈的方法,进而解决了现有技术中不能有效预测欺诈行为的问题,以下对本发明所述的方法进行详细地解释说 明。Based on this, the present invention proposes a method for detecting medical insurance fraud by means of machine learning, thereby solving the problem that fraudulent behaviors cannot be effectively predicted in the prior art. The method described in the present invention is explained in detail below.
请参见图1,图1是本发明中一种检测医保欺诈的方法的流程图。如图1所示,本发明实施例所述的一种检测医保欺诈的方法包括以下步骤:Please refer to FIG. 1, which is a flowchart of a method for detecting medical insurance fraud in the present invention. As shown in FIG. 1 , a method for detecting medical insurance fraud according to an embodiment of the present invention includes the following steps:
S100、获取患者的就诊记录,根据所获取的就诊记录提取对应的患者特征,并根据所提取的患者特征及患者与医生的对应关系,建立医患关系神经网络。S100. Obtain a patient's medical treatment record, extract corresponding patient characteristics according to the obtained medical treatment record, and establish a doctor-patient relationship neural network according to the extracted patient characteristics and the corresponding relationship between the patient and the doctor.
具体地,当患者在门诊挂号使用医保时,医院医生将患者的就诊信息输入医疗信息系统中,然后医疗信息系统中存在该患者的就诊记录。其中,就诊信息包括但不限于患者身份信息、参保类型、购买药品项目、购买数量、就诊日期等,由于就诊信息可根据患者的就诊情况进行逐样增加,且参保类型大同小异,故而,此处并不对就诊信息做一一详解,可以理解地,此处仅用于举例说明就诊信息所涵盖的部分内容,并不用于限定本发明。Specifically, when a patient registers and uses medical insurance in an outpatient clinic, the hospital doctor enters the patient's medical information into the medical information system, and then the patient's medical record exists in the medical information system. Among them, the medical treatment information includes but is not limited to the patient's identity information, insurance type, purchased drug items, purchase quantity, medical visit date, etc. Since the medical visit information can be increased according to the patient's medical treatment situation, and the insurance types are similar, therefore, this This section does not give a detailed explanation of the medical treatment information. It is understood that this is only used to illustrate part of the content covered by the medical treatment information, and is not used to limit the present invention.
且一般情况下,进行医保欺诈分析时,分析医生与患者间的关系也至关重要。患者就诊信息中也包含当次医疗项目接诊的医生,通过获取患者的就诊记录,能够得到患者详细的就诊信息以及单次就诊时对应接诊的医生。And in general, when conducting health insurance fraud analysis, it is also crucial to analyze the relationship between doctors and patients. The patient's visit information also includes the doctor who received the current medical project. By obtaining the patient's medical record, the patient's detailed medical information and the corresponding doctor for a single visit can be obtained.
本发明通过采用机器学习和深度学习的方法对从就诊信息中所提取出的患者特征和患者与医生关系进行建模,进而建立医患关系神经网络,能够加强各个患者节点之间的联系,进而有利于对患者节点进行分类。By adopting the methods of machine learning and deep learning, the invention models the patient characteristics and the relationship between the patient and the doctor extracted from the medical treatment information, and then establishes a neural network of the doctor-patient relationship, which can strengthen the connection between each patient node, and further Facilitates classification of patient nodes.
S200、将预先标记的欺诈样本输入所建立的医患关系神经网络中,训练出欺诈预测模型,并从所训练出的欺诈预测模型中输出每个患者节点具有欺诈行为的预测值。S200: Input the pre-marked fraud samples into the established doctor-patient relationship neural network, train a fraud prediction model, and output a prediction value of fraudulent behavior of each patient node from the trained fraud prediction model.
具体地,将选择好的具有标记的欺诈样本输入医患关系神经网络中,然后进行模型训练,通过机器主动学习策略,进而根据预先标记的欺诈样本对更多的患者节点进行分类与标记,即计算出所有患者节点具有欺诈行为的预测值,进而通过预测值分析患者的欺诈行为,有利于对医保欺诈行为进行检测,进而监督医保欺诈行为,帮助医保健康普及。Specifically, the selected labeled fraud samples are input into the doctor-patient relationship neural network, and then the model is trained. Through the active learning strategy of the machine, more patient nodes are classified and labeled according to the pre-labeled fraud samples, that is, Calculate the predicted value of fraudulent behavior of all patient nodes, and then analyze the fraudulent behavior of patients through the predicted value, which is conducive to the detection of medical insurance fraud, and then supervises medical insurance fraud, and helps to popularize medical insurance health.
S300、根据所输出的预测值判定对应节点的患者是否存在欺诈行为。S300. Determine whether the patient of the corresponding node has fraudulent behavior according to the output prediction value.
具体地,通过所计算出的预测值的大小即可判定存在欺诈行为可能性的高低,一般 地,将预测值较大的作为重点分析对象,而具体以预测值的大小判定欺诈行为的界限可由领域专家或者医保管理人员或者医院自定义设定,此处并不做详述,仅用于说明通过预测值即可判定欺诈行为。Specifically, the degree of the possibility of fraudulent behavior can be determined by the size of the calculated predicted value. Generally, the larger predicted value is regarded as the key analysis object, and the limit of the fraudulent behavior is determined by the size of the predicted value. Domain experts or medical insurance administrators or hospital-defined settings are not described in detail here, but are only used to illustrate that fraud can be determined by predicting values.
在一实施例中,如图2所示,所述步骤S100具体包括:In one embodiment, as shown in FIG. 2 , the step S100 specifically includes:
S110、获取患者的就诊记录,从就诊记录中提取对应的患者特征,建立患者特征度矩阵。S110. Obtain a patient's medical treatment record, extract corresponding patient characteristics from the medical treatment record, and establish a patient characteristic degree matrix.
S120、分析就诊记录中医生与患者之间的医患关系,建立对应的医患关系邻接矩阵。S120, analyze the doctor-patient relationship between the doctor and the patient in the medical treatment record, and establish a corresponding doctor-patient relationship adjacency matrix.
当医疗信息系统中录入患者的就诊信息之后,对应地系统中就会增加就诊记录,通过从系统的所有就诊记录中提取每个患者的特征,以及每个患者就诊时对接的医生,通过对两种信息进行处理,以数据的形式展示,便于进行机器学习。When the patient's visit information is entered in the medical information system, the corresponding system will add medical records. By extracting the characteristics of each patient from all the medical records in the system, and the doctor who is connected with each patient when they visit a doctor, by comparing the two This kind of information is processed and displayed in the form of data, which is convenient for machine learning.
S130、根据患者特征度矩阵和医患关系邻接矩阵,建立医患关系神经网络。S130. Establish a doctor-patient relationship neural network according to the patient characteristic degree matrix and the doctor-patient relationship adjacency matrix.
其中,对患者特征建立特征度矩阵,以及根据医生与患者间关系建立邻接矩阵,之后根据度矩阵与邻接矩阵形成医患关系神经网络的算法演示如下:具体地,定义患者(P)与医生(D)的关系网(P-D)为一个无向图
Figure PCTCN2020127183-appb-000001
其中,节点数表示为:N=|ν|,ε表示患者与医生之间的连接关系,
Figure PCTCN2020127183-appb-000002
表示患者与医生之间的连接权重。
Among them, the characteristic degree matrix is established for the patient characteristics, and the adjacency matrix is established according to the relationship between the doctor and the patient, and then the algorithm of forming the doctor-patient relationship neural network according to the degree matrix and the adjacency matrix is demonstrated as follows: Specifically, define the patient (P) and the doctor ( D) The network of relationships (PD) is an undirected graph
Figure PCTCN2020127183-appb-000001
Among them, the number of nodes is expressed as: N=|ν|, ε represents the connection between the patient and the doctor,
Figure PCTCN2020127183-appb-000002
Represents the connection weight between patient and doctor.
然后,基于谱领域卷积定义一个图卷积操作
Figure PCTCN2020127183-appb-000003
其中,任何一个患者节点表示为:
Figure PCTCN2020127183-appb-000004
用Θ表示一个卷积核函数,本发明定义卷积操作的表示方式如下:
Then, define a graph convolution operation based on spectral domain convolution
Figure PCTCN2020127183-appb-000003
Among them, any patient node is represented as:
Figure PCTCN2020127183-appb-000004
Using Θ to represent a convolution kernel function, the present invention defines the representation of the convolution operation as follows:
Figure PCTCN2020127183-appb-000005
Figure PCTCN2020127183-appb-000005
通过对患者节点进行卷积操作,可以让相互连接的患者节点之间实现信息互通,使同类型的患者节点分布更加紧密。By performing the convolution operation on the patient nodes, information can be exchanged between the interconnected patient nodes, and the patient nodes of the same type can be distributed more closely.
之后,使用拉普拉斯矩阵对度矩阵和邻接矩阵进行处理,实现特征分解(也即谱分解),定义其表达式如下:After that, use the Laplace matrix to process the degree matrix and the adjacency matrix to realize the eigendecomposition (that is, the spectral decomposition), and define its expression as follows:
Figure PCTCN2020127183-appb-000006
Figure PCTCN2020127183-appb-000006
其中,I n是单位矩阵;
Figure PCTCN2020127183-appb-000007
是度矩阵(表示患者的特征信息:如就诊日期、医保类型、金额等),且
Figure PCTCN2020127183-appb-000008
是度矩阵的计算公式,A是邻接矩阵(表示患者与医生之间的权重信息),
Figure PCTCN2020127183-appb-000009
是拉普拉斯矩阵的特征值组成的对角矩阵;
Figure PCTCN2020127183-appb-000010
是拉普拉斯矩阵的特征向量;过滤器Θ(Λ)是关于拉普拉斯矩阵的对角矩阵。
where In is the identity matrix;
Figure PCTCN2020127183-appb-000007
is a degree matrix (representing patient characteristic information: such as date of consultation, type of medical insurance, amount, etc.), and
Figure PCTCN2020127183-appb-000008
is the calculation formula of the degree matrix, A is the adjacency matrix (representing the weight information between the patient and the doctor),
Figure PCTCN2020127183-appb-000009
is the diagonal matrix composed of the eigenvalues of the Laplace matrix;
Figure PCTCN2020127183-appb-000010
is the eigenvector of the Laplacian matrix; the filter Θ(Λ) is the diagonal matrix with respect to the Laplacian matrix.
由于上述公式所呈现的复杂度为O(n 2),对其用切比雪夫多项式近似和一阶近似的方法改进,改进后的公式如下: Since the complexity of the above formula is O(n 2 ), it is improved by Chebyshev polynomial approximation and first-order approximation. The improved formula is as follows:
Figure PCTCN2020127183-appb-000011
Figure PCTCN2020127183-appb-000011
之后通过GCN提取拓扑图的空间特征,以便于特征提取,所以Graph Convolution Network(GCN)有了如下逐层传播规则:
Figure PCTCN2020127183-appb-000012
After that, the spatial features of the topology map are extracted through GCN to facilitate feature extraction, so Graph Convolution Network (GCN) has the following layer-by-layer propagation rules:
Figure PCTCN2020127183-appb-000012
其中H (0)=X是病人节点信息,H (l)表示所述图卷积神经网络第l层的输出,W (l)为所述图卷积神经网络第l层网络的权重矩阵,σ(g)为sigmoid激活函数。 where H (0) =X is the patient node information, H (1) represents the output of the first layer of the graph convolutional neural network, W (1) is the weight matrix of the first layer of the graph convolutional neural network, σ(g) is the sigmoid activation function.
对于本发明中所建立的监督学习模型,对预测模型进行训练,一般要求有足够数据的已标记数据,且已标记数据越多预测精度越高。由于在实际应用中,欺诈行为的标注主要依赖于领域专家去调查,这样的成本无疑是巨大的,且调动专家调查的效率也很低,况且,日益剧增的医疗数据中进行人工标注,显然并不现实。故而,在本发明中通过主动学习的方法选取最具价值的数据进行标注,以减少现有方式进行标注所要付出的人力和财力。For the supervised learning model established in the present invention, the training of the prediction model generally requires labeled data with sufficient data, and the more labeled data, the higher the prediction accuracy. Since in practical applications, the labeling of fraudulent behaviors mainly relies on the investigation by domain experts, the cost of this is undoubtedly huge, and the efficiency of mobilizing experts to investigate is also very low. Moreover, manual labeling in the ever-increasing medical data is obviously Not realistic. Therefore, in the present invention, the most valuable data is selected for labeling by the method of active learning, so as to reduce the manpower and financial resources required for labeling in the existing method.
在一具体实施例中,本发明通过主动学习的方法进行欺诈样本标注的步骤包括:In a specific embodiment, the present invention uses the method of active learning to mark fraud samples, including:
S210、采用预设方式从就诊记录中选取部分就诊记录作为待标记样本。S210. Select a part of the medical treatment records from the medical treatment records in a preset manner as the samples to be marked.
S220、对所选取的待标记样本进行专家标注,标识待标记样本中具有欺诈行为的样本,得到预先标记的欺诈样本。S220: Perform expert annotation on the selected samples to be marked, identify samples with fraudulent behaviors among the samples to be marked, and obtain pre-marked fraud samples.
在一具体实施例中,所述步骤S210中采用预设方式选取待标记样本的方式至少包括最大熵策略、随机策略、最大概率策略中的一种或几种,可以理解地,此处仅对随机取样的几种方式进行了简单的介绍,而具体地将此处所列举的几种策略进行结合或者将 其与其他的手段进行结合,进而对整体采取的样本选取平均点、均值点、中间点等方式均为本实施例所列举方式的延伸,此处并不做详解,其均为本实施例所涉及的保护范围。In a specific embodiment, the method of selecting samples to be marked in a preset manner in step S210 includes at least one or more of a maximum entropy strategy, a random strategy, and a maximum probability strategy. Several methods of random sampling are briefly introduced, and the strategies listed here are combined or combined with other means, and then the average point, mean point, and middle point are selected for the overall sample taken. The manners such as these are extensions of the manners listed in this embodiment, and are not explained in detail here, and they are all within the scope of protection involved in this embodiment.
方式一、通过最大熵选择策略计算出每个患者的熵值,选取所计算熵值中最大值作为待标记样本。Mode 1: Calculate the entropy value of each patient through the maximum entropy selection strategy, and select the maximum value of the calculated entropy values as the sample to be marked.
其中,熵的计算公式为:
Figure PCTCN2020127183-appb-000013
n代表了随机变量x的不同取值情况,p i表示了x取值为i时的概率。
Among them, the calculation formula of entropy is:
Figure PCTCN2020127183-appb-000013
n represents the different values of the random variable x, and p i represents the probability when x takes the value of i.
我们采用最大熵(最不确定)策略选择标注节点。We adopt the maximum entropy (most uncertain) strategy to select label nodes.
通过采用最大熵选择策略(Maximum Entropy selection:MEs)对就诊记录中最不确定分布的就诊记录进行选取,能够保证数据的随机性。具体地,通过条件熵描述样本点属于哪个类别的自信值,若条件熵值越大,说明对某个样本点的分类越不明确(分类信心越小);若条件熵值越小,说明对某个样本点的分类越明确(分类信心越大)。条件熵值由以下公式计算得到:H(Y|Z)=H(Z,Y)-H(Z)By using the maximum entropy selection strategy (Maximum Entropy selection: MEs) to select the medical records with the most uncertain distribution in the medical records, the randomness of the data can be guaranteed. Specifically, the confidence value of which category the sample point belongs to is described by the conditional entropy. If the conditional entropy value is larger, it indicates that the classification of a certain sample point is less clear (the classification confidence is smaller); if the conditional entropy value is smaller, it indicates that the The clearer the classification of a sample point (the greater the classification confidence). The conditional entropy value is calculated by the following formula: H(Y|Z)=H(Z,Y)-H(Z)
通过对每个患者节点计算熵值,然后对其进行排序,每次都选择熵最大的节点进行欺诈标注。By calculating the entropy value of each patient node, and then sorting them, each time the node with the largest entropy is selected for fraud labeling.
方式二、采取随机策略随机采取就诊记录中部分就诊记录作为待标记样本。即随机从所有就诊记录中选取预设数量的就诊记录作为待标记样本。Method 2: Adopting a random strategy Part of the medical records in the medical records are randomly selected as the samples to be marked. That is, a preset number of medical visit records are randomly selected from all medical visit records as samples to be marked.
方式三、通过最大概率策略计算每个患者的概率值,选取所计算概率值中最大值作为待标记样本。具体地,通过计算每个患者的概率值,通过概率值的大小选择待标记样本,由于计算概率值为现有技术,此处不再进行举例。Method 3: Calculate the probability value of each patient through the maximum probability strategy, and select the maximum value among the calculated probability values as the sample to be marked. Specifically, by calculating the probability value of each patient, the sample to be marked is selected according to the size of the probability value. Since the calculated probability value is in the prior art, no example is given here.
需要说明的是,上述三种方式中优先选择最大熵策略选取待标记样本,能够以最大的随机性选择出最不确定的待标记样本,进而增加了样本选取的真实性。It should be noted that in the above three methods, the maximum entropy strategy is preferentially selected to select the samples to be marked, which can select the most uncertain samples to be marked with the greatest randomness, thereby increasing the authenticity of the sample selection.
在一实施例中,所述步骤S200中训练出预测模型的步骤包括:In one embodiment, the step of training the prediction model in step S200 includes:
通过结合医患关系神经网络与变分自解码器模型,能够建立出预测模型。By combining the doctor-patient relationship neural network and the variational self-decoder model, the prediction model can be established.
一般情况下,贝叶斯公式:
Figure PCTCN2020127183-appb-000014
是通过观察患者就诊数据X的 先验概率求取Z的后验概率。但是实际上,只有关于X的数据,却没有关于X的分布函数,也即p(X)是未知的,则p(Z|X)无法求解。通过变分自解码器(Variational Auto-Encoder)可以解决上述问题。
In general, the Bayesian formula:
Figure PCTCN2020127183-appb-000014
It is to obtain the posterior probability of Z by observing the prior probability of patient visit data X. But in fact, there is only data about X, but no distribution function about X, that is, p(X) is unknown, then p(Z|X) cannot be solved. The above problem can be solved by Variational Auto-Encoder.
自编码器(Auto-Encoder)包含两个部分:解码器和编码器。本发明中能够直接得到的是患者的就诊数据X,同时X又由隐藏变量Z产生,从Z→X的生成模型为p θ(X|Z),称作解码器;而从X→Z的识别模型为q θ(Z|X),称作编码器。假定所有数据是独立同分布的,要让生成模型的效果越好,就需要对生成模型p θ(X|Z)做参数估计,本发明中采用对数最大似然法求对数似然函数的最大值,表达式如下: Auto-Encoder consists of two parts: decoder and encoder. What can be directly obtained in the present invention is the patient's visit data X, and at the same time X is generated by the hidden variable Z, the generation model from Z→X is p θ (X|Z), which is called a decoder; The recognition model is q θ (Z|X), which is called the encoder. Assuming that all data are independent and identically distributed, in order to make the effect of the generative model better, it is necessary to estimate the parameters of the generative model p θ (X|Z). In the present invention, the logarithmic maximum likelihood method is used to obtain the logarithmic likelihood function. The maximum value of , the expression is as follows:
Figure PCTCN2020127183-appb-000015
Figure PCTCN2020127183-appb-000015
通过先获取患者的就诊记录,然后使用编码器q θ(Z|X (i))去逼近真实的后验概率p θ(Z|X (i)),可以获得患者就诊记录的分布。在本发明中采用的是VAE(variational auto encoding)中的编码器,通过从患者节点的分布关系中进行采样,就可以从一个患者节点得到该节点对应的欺诈行为,进而实现通过有限的已标记欺诈样本的输入,之后通过隐形参数的调节,生成所有患者节点的标注,能够更好地针对医患关系进行预测,同时,解决了数据样本不平衡的问题。 By first obtaining the patient's visit records, and then using the encoder q θ (Z|X (i) ) to approximate the true posterior probability p θ (Z|X (i) ), the distribution of the patient's visit records can be obtained. In the present invention, the encoder in VAE (variational auto encoding) is used. By sampling from the distribution relationship of patient nodes, the fraud behavior corresponding to the node can be obtained from a patient node. The input of fraudulent samples, and then through the adjustment of invisible parameters, the labels of all patient nodes are generated, which can better predict the relationship between doctors and patients, and at the same time, solve the problem of imbalanced data samples.
而两个就诊记录之间分布的相似程度用KL散度(Kullback–Leibler divergence)衡量,即得出如下公式:The similarity of the distribution between the two medical records is measured by the KL divergence (Kullback–Leibler divergence), and the following formula is obtained:
Figure PCTCN2020127183-appb-000016
Figure PCTCN2020127183-appb-000016
进一步将医患关系神经网络与自分编码器进行结合得到预测模型的表达式为:The expression of the prediction model is obtained by further combining the doctor-patient relationship neural network with the self-segmenting encoder:
logp θ(X (i))=KL(q φ(Z|X (i))||p θ(Z|X (i)))+L(θ,φ,X (i)) logp θ (X (i) )=KL(q φ (Z|X (i) )||p θ (Z|X (i) ))+L(θ,φ,X (i) )
通过欺诈预测模型可对患者节点进行预测,具体以患者的就诊记录作为输入数据,建立医患关系神经网络,以医患关系神经网络的输出作为变分自解码器的输入,最终输出预测结果。在欺诈预测模型中对所有患者节点进行训练,当达到预设的训练次数之后,能够完成对患者节点中未知患者节点的预测分类,即计算出患者节点具有欺诈行为的预测值,具体地,也可将预测值以0和1进行划分,其中为非欺诈,1为欺诈。The patient node can be predicted through the fraud prediction model. Specifically, the patient's medical record is used as the input data to establish a doctor-patient relationship neural network, and the output of the doctor-patient relationship neural network is used as the input of the variational auto-decoder, and finally the prediction result is output. All patient nodes are trained in the fraud prediction model. When the preset training times are reached, the prediction and classification of the unknown patient nodes in the patient nodes can be completed, that is, the predicted value of the fraud behavior of the patient nodes is calculated. The predicted value can be divided between 0 and 1, where non-fraud and 1 are fraud.
在实际应用场景中,医保欺诈行为方法层出不穷,故而需要对医保欺诈预测模型及时进行更新,但由于图关系网络的复杂性和大量的节点,每次训练更新都需要消耗大量的时间和计算资源,导致在实际应用中具有极大的局限性。In practical application scenarios, medical insurance fraud methods emerge in an endless stream, so it is necessary to update the medical insurance fraud prediction model in time. However, due to the complexity of the graph relationship network and a large number of nodes, each training update needs to consume a lot of time and computing resources. This leads to great limitations in practical applications.
且随着时间的推移,患者就诊记录也会增多,从而导致医-患图关系网络中患者节点越来越多,则对机器计算的条件(硬件、内存、CPU等)要求越高,且随着患者节点的增多,计算量也随之剧增,造成系统预测欺诈值的难度增大,很难应用的实际中。And with the passage of time, the number of patient visits will also increase, resulting in more and more patient nodes in the doctor-patient graph relationship network, and the requirements for machine computing conditions (hardware, memory, CPU, etc.) will be higher. With the increase of patient nodes, the amount of calculation also increases sharply, which makes it more difficult for the system to predict the fraud value, which is difficult to apply in practice.
故而,基于上述原因,本发明还提出了一种在线更新策略,使每天新增数据自动进行更新,然后通过在加入新节点的同时删除无用的旧节点,使图中节点保持在一定数量,从而实现系统可以在短时间内完成训练,保证系统良好的实时性。Therefore, based on the above reasons, the present invention also proposes an online update strategy, so that the new data is automatically updated every day, and then by adding new nodes and deleting useless old nodes, the number of nodes in the graph is kept at a certain number, thereby The realization system can complete the training in a short time to ensure the good real-time performance of the system.
在一实施例中,本发明中进行在线更新的策略实施步骤如下:In one embodiment, the strategy implementation steps for online update in the present invention are as follows:
如图3所示,在所述步骤S200之后还包括:As shown in FIG. 3, after the step S200, it further includes:
S400、监测是否有新增的就诊记录。S400. Monitor whether there is a newly added medical treatment record.
若有,执行步骤S410、将具有预测值的患者节点输入预先建立的动态更新网络中,删除其中无效的患者节点。If so, step S410 is performed to input patient nodes with predicted values into the pre-established dynamic update network, and delete invalid patient nodes.
具体地,通过监测新增的就诊记录,当有新增就诊记录时,通过将新增就诊记录加入欺诈预测模型中能够及时对新增就诊记录进行欺诈分析,且进一步通过将欺诈预测模型中原有的无效数据移除,能够保证系统运行效率。而当未监测到有新增的就诊记录时,则按照原先设定的预测周期对所有就诊记录进行欺诈预测,保证所计算出预测值的时效性。Specifically, by monitoring the newly added medical treatment records, when there are new medical treatment records, the newly added medical treatment records can be added to the fraud prediction model to conduct fraud analysis in a timely manner, and further by adding the original medical treatment records in the fraud prediction model. Invalid data removal can ensure the efficiency of system operation. When no new medical treatment records are detected, fraud prediction is performed on all medical treatment records according to the originally set prediction period to ensure the timeliness of the calculated predicted values.
S420、将删除无效节点后的其余就诊记录与新增的就诊记录整理成更新后的就诊记录。S420: Arrange the remaining medical treatment records and the newly added medical treatment records after deleting the invalid node into an updated medical treatment record.
S430、根据更新后的就诊记录继续判定每个节点对应的患者是否存在欺诈行为。具体地,将更新后的就诊记录作为步骤S100中获取就诊记录的源数据循环执行步骤S100-S430,达到不断更新数据以及预测节点的效果,保证预测模型的可实施性。S430. Continue to determine whether the patient corresponding to each node has fraudulent behavior according to the updated medical treatment record. Specifically, steps S100-S430 are performed cyclically by using the updated medical treatment record as the source data for obtaining the medical treatment record in step S100, so as to achieve the effect of continuously updating data and prediction nodes, and ensure the feasibility of the prediction model.
具体地,通过在线的更新策略,可以定时对系统进行更新,在每次更新时删除图关系网络中一些信息量相对较少的节点,通过不断的迭代,在保证模型预测准确率的同时保持实时性和训练效率。Specifically, through the online update strategy, the system can be updated regularly, and some nodes with relatively little information in the graph relational network are deleted at each update. Through continuous iteration, the model prediction accuracy can be guaranteed while maintaining real-time. performance and training efficiency.
在一流程图中,如图4所示,为更好地表示本发明中建立欺诈预测模型与动态更新网络之间的联系,用以下图示进一步说明:In a flowchart, as shown in Figure 4, in order to better represent the connection between the fraud prediction model established in the present invention and the dynamic update network, the following diagrams are used to further illustrate:
S10、开始;S10, start;
S20、获取就诊记录;从数据中心获取就诊记录;S20. Obtain the medical treatment record; obtain the medical treatment record from the data center;
S30、提取患者特征;从就诊记录中提取患者特征;S30, extracting patient characteristics; extracting patient characteristics from the medical treatment record;
S40、根据患者特征建立医患关系神经网络;根据所提取的患者特征及医生与患者的关系建立医患关系神经网络;S40, establishing a doctor-patient relationship neural network according to patient characteristics; establishing a doctor-patient relationship neural network according to the extracted patient characteristics and the relationship between doctors and patients;
S50、获取预先标记的欺诈样本;从数据中心中选取部分就诊记录进行专家标记,以标注欺诈样本;S50. Obtain pre-marked fraud samples; select some medical treatment records from the data center for expert marking to mark fraud samples;
S60、训练出欺诈预测模型;根据所标记的欺诈样本与医患关系神经网络建立欺诈预测模型;S60, train a fraud prediction model; establish a fraud prediction model according to the marked fraud samples and the doctor-patient relationship neural network;
S70、输出每个患者节点对应的预测值;输出所有患者节点具有欺诈行为的预测值;S70, output the predicted value corresponding to each patient node; output the predicted value that all patient nodes have fraudulent behavior;
S80、是否有新增的就诊记录;将新的就诊记录输入数据中心中;S80. Whether there is a new medical visit record; input the new medical visit record into the data center;
若有,执行S81、将所有患者节点的预测值输入动态更新网络;If so, execute S81, and input the predicted values of all patient nodes into the dynamic update network;
S82、删除无效的患者节点,与新增就诊记录组成更新后的就诊记录;形成新的就诊记录后更新数据中心的数据;S82. Delete the invalid patient node, and form an updated medical visit record with the newly added medical visit record; update the data of the data center after forming the new medical visit record;
若无,执行S90、结束;If not, execute S90, end;
循环步骤S20-S90。Steps S20-S90 are looped.
为便于对所有就诊记录进行描述,在此实施例中引入了数据中心这一概念,用于表述就诊记录的流转过程。In order to facilitate the description of all medical treatment records, the concept of a data center is introduced in this embodiment to describe the circulation process of medical treatment records.
在进一步具体实施例中,如图5所示,所述步骤S410具体包括:In a further specific embodiment, as shown in FIG. 5 , the step S410 specifically includes:
S411、将具有预测值的患者节点输入预先建立的动态更新网络中。S411. Input the patient node with the predicted value into the pre-established dynamic update network.
S412、根据具有预测值的患者节点的生成日期及预测值,分别计算每个患者节点的优先级。S412. Calculate the priority of each patient node respectively according to the generation date and the predicted value of the patient node with the predicted value.
S413、对每个患者节点的优先级进行排序,选取优先级低的作为无效的患者节点。S413. Sort the priority of each patient node, and select the patient node with a low priority as an invalid patient node.
S414、根据新增的就诊数量,以优先级低的患者节点为序删除同等数量的无效的患者节点。S414 , delete an equal number of invalid patient nodes in the order of the patient nodes with lower priority according to the newly added number of visits.
其中,将判定患者节点为无效节点的因素定为:患者的就诊时间(越早越优先删除)、患者被预测为欺诈的概率(越小越优先删除)。Among them, the factors for judging a patient node as an invalid node are determined as follows: the patient's visit time (the earlier the patient is deleted, the priority is to be deleted), and the probability that the patient is predicted to be fraudulent (the smaller the priority is to delete).
具体地,如图6所示,更新算法的执行流程如下:Specifically, as shown in Figure 6, the execution flow of the update algorithm is as follows:
S42、输入就诊记录V和医院新增的数据W;S42, input the medical treatment record V and the newly added data W of the hospital;
S43、根据V生成医患关系神经网络(邻接矩阵A和特征矩阵X);S43, generate a doctor-patient relationship neural network (adjacency matrix A and feature matrix X) according to V;
S44、使用变分自解码关系模型对医患关系神经网络中的所有患者节点进行预测,输出每个患者节点的预测值p;S44, use the variational self-decoding relationship model to predict all patient nodes in the doctor-patient relationship neural network, and output the predicted value p of each patient node;
S45、对每个患者节点的输入日期d进行标准化;S45, standardize the input date d of each patient node;
S46、将每个患者节点的预测值p和输入日期d进行联合;通过s=λ pp+λ dd计算优先级集合s; S46, combine the predicted value p of each patient node with the input date d; calculate the priority set s by s=λ p p+λ d d;
S47、对s进行排序,根据新增数据W的数量,删除s中对应数量的节点;S47, sort s, and delete the corresponding number of nodes in s according to the number of newly added data W;
S48、结合删减后的节点和新增节点对系统迭代更新。S48, iteratively update the system by combining the deleted node and the newly added node.
其中,s为优先级(越小越优先),p为患者节点被预测为欺诈的概率,d为患者的就诊日期。λ p和λ d分别为概率和日期所占的权重。计算出所有患者节点的s,按升序排列,删除前k个节点(k的数目等于新加入节点的数目)。实验表明,详见对比图7,(其中,用圆圈表示不使用在线动态更新策略,棱形表示使用了在线动态更新策略),由图也看出使用动态更新策略,使得欺诈模型的训练速度相对于不使用此策略的方法提升至少40倍,其准确率和精确率也得以保障(使用动态策略可以使模型在6个小时内完成更新,所以具有很好的适用性)。 Among them, s is the priority (the smaller the priority), p is the probability that the patient node is predicted to be fraudulent, and d is the patient's visit date. λp and λd are the weights of probability and date, respectively. Calculate s for all patient nodes, sort them in ascending order, and delete the first k nodes (the number of k is equal to the number of newly added nodes). Experiments show that, see the comparison in Figure 7, (wherein, the circles indicate that the online dynamic update strategy is not used, and the prisms indicate that the online dynamic update strategy is used). It can also be seen from the figure that the use of the dynamic update strategy makes the training speed of the fraud model relatively Since the method without this strategy is improved by at least 40 times, its accuracy and precision are also guaranteed (using the dynamic strategy can make the model update within 6 hours, so it has good applicability).
在一实施例中,在所述步骤S100之前还包括:In an embodiment, before the step S100, it further includes:
将患者就诊信息中的患者身份信息进行匿名处理,并将处理后的就诊信息转换成数据结构类型的就诊记录。The patient identity information in the patient medical treatment information is anonymized, and the processed medical treatment information is converted into a medical treatment record of a data structure type.
通过对患者身份信息进行匿名处理,能够保障患者隐私,也避免了患者信息泄露。本发明通过使用医患关系神经网络和变分自编码器,能够对患者节点的分布进行有效预测,且通过预设条件筛选出最佳的标记样本进行专家标记,然后输入模型中进行训练,进而减少了人工标注成本,也增加了欺诈预测的准确率。进一步地,本发明还提供了在线动态更新网络模型,能够保证系统中需预测的患者节点数量是固定的前提下,对系统进行实时更新,从而提升了欺诈预测模型的预测效率和可实施性;且通过对无效节点进行删除,能够节省预测时长及避免系统资源的占用,也提高了预测的准确率。By anonymizing patient identity information, patient privacy can be protected and patient information leakage can be avoided. By using the doctor-patient relationship neural network and the variational autoencoder, the present invention can effectively predict the distribution of patient nodes, and select the best labeled samples through preset conditions for expert labeling, and then input them into the model for training, and then It reduces the cost of manual labeling and increases the accuracy of fraud prediction. Further, the present invention also provides an online dynamic update network model, which can update the system in real time under the premise that the number of patient nodes to be predicted in the system is fixed, thereby improving the prediction efficiency and implementability of the fraud prediction model; In addition, by deleting invalid nodes, the prediction time can be saved, the occupation of system resources can be avoided, and the prediction accuracy can be improved.
本发明还公开一种系统,其中,如图8所示,包括有存储器20,以及一个或者一个以上的程序,其中一个或者一个以上程序存储于存储器20中,且经配置以由一个或者一个以上处理器10执行所述一个或者一个以上程序包含用于执行如上所述的检测医保欺诈的方法;具体如上所述。The present invention also discloses a system, which, as shown in FIG. 8, includes a memory 20, and one or more programs, wherein the one or more programs are stored in the memory 20 and are configured to be composed of one or more programs Execution of the one or more programs by the processor 10 includes performing the method for detecting medical insurance fraud as described above; specifically as described above.
本发明还公开一种存储介质,其中,所述存储介质存储有计算机程序,所述计算机程序能够被执行以用于实现如上所述的检测医保欺诈的方法;具体如上所述。The present invention also discloses a storage medium, wherein the storage medium stores a computer program, and the computer program can be executed to implement the above method for detecting medical insurance fraud; the details are as described above.
综上所述,本发明公开的一种检测医保欺诈的方法、系统及存储介质,通过获取就诊记录,并根据就诊记录提取患者特征以及患者与医生之间的关系,之后通过机器学习和深度学习的方法对提取的特征进行建模,建立了欺诈预测模型,从而实现在少量的人工干预情况下检测出医保欺诈行为,保证医保欺诈的有效性,进而节省了大量的人力、物力和财力支出。再者,本发明还提出了在线动态更新策略,对图神经网络中的患者节点进行动态更新,从而可以保证预测模型的实时性和准确性,也保证了系统运行效率。To sum up, a method, system and storage medium for detecting medical insurance fraud disclosed by the present invention can obtain medical records, extract patient characteristics and the relationship between patients and doctors according to the medical records, and then use machine learning and deep learning. The method models the extracted features and establishes a fraud prediction model, so as to detect medical insurance fraud with a small amount of manual intervention, ensure the effectiveness of medical insurance fraud, and save a lot of human, material and financial expenditures. Furthermore, the present invention also proposes an online dynamic update strategy, which dynamically updates the patient nodes in the graph neural network, thereby ensuring the real-time and accuracy of the prediction model and the system operation efficiency.
应当理解的是,本发明的应用不限于上述的举例,对本领域普通技术人员来说,可以根据上述说明加以改进或变换,所有这些改进和变换都应属于本发明所附权利要求的保护范围。It should be understood that the application of the present invention is not limited to the above examples. For those of ordinary skill in the art, improvements or transformations can be made according to the above descriptions, and all these improvements and transformations should belong to the protection scope of the appended claims of the present invention.

Claims (10)

  1. 一种检测医保欺诈的方法,其特征在于,包括:A method for detecting medical insurance fraud, comprising:
    获取患者的就诊记录,根据所获取的就诊记录提取对应的患者特征,并根据所提取的患者特征及患者与医生的对应关系,建立医患关系神经网络;Obtain the patient's medical records, extract the corresponding patient characteristics according to the obtained medical records, and establish a doctor-patient relationship neural network according to the extracted patient characteristics and the corresponding relationship between the patient and the doctor;
    将预先标记的欺诈样本输入所建立的医患关系神经网络中,训练出欺诈预测模型,并从所训练出的欺诈预测模型中输出每个患者节点具有欺诈行为的预测值;Input the pre-marked fraud samples into the established doctor-patient relationship neural network, train a fraud prediction model, and output the predicted value of fraudulent behavior of each patient node from the trained fraud prediction model;
    根据所输出的预测值判定对应节点的患者是否存在欺诈行为。Determine whether the patient of the corresponding node has fraudulent behavior according to the output prediction value.
  2. 根据权利要求1所述的检测医保欺诈的方法,其特征在于,所述将预先标记的欺诈样本输入所建立的医患关系神经网络中,训练出欺诈预测模型,并从所训练出的欺诈预测模型中输出每个患者节点具有欺诈行为的预测值之后包括:The method for detecting medical insurance fraud according to claim 1, wherein the pre-marked fraud samples are input into the established doctor-patient relationship neural network, a fraud prediction model is trained, and a fraud prediction model is trained from the trained fraud prediction model. After outputting the predicted value of fraudulent behavior for each patient node in the model, it includes:
    监测是否有新增的就诊记录;Monitor for new medical records;
    若有新增的就诊记录,将具有预测值的患者节点输入预先建立的动态更新网络中,删除其中无效的患者节点;If there is a new medical treatment record, input the patient node with the predicted value into the pre-established dynamic update network, and delete the invalid patient node;
    将删除无效节点后的其余就诊记录与新增的就诊记录整理成更新后的就诊记录;Organize the remaining medical treatment records and the newly added medical treatment records after deleting the invalid node into the updated medical treatment records;
    根据更新后的就诊记录继续判定每个节点对应的患者是否存在欺诈行为。Continue to determine whether the patient corresponding to each node has fraudulent behavior according to the updated medical treatment record.
  3. 根据权利要求2所述的检测医保欺诈的方法,其特征在于,所述若有新增的就诊记录,将具有预测值的患者节点输入预先建立的动态更新网络中,删除其中无效的患者节点,其中,判定无效的患者节点的依据为:The method for detecting medical insurance fraud according to claim 2, wherein if there is a newly added medical treatment record, the patient node with the predicted value is input into the pre-established dynamic update network, and the invalid patient node is deleted, Among them, the basis for determining invalid patient nodes is:
    根据具有预测值的患者节点的生成日期及预测值,分别计算每个患者节点的优先级;Calculate the priority of each patient node separately according to the generation date and predicted value of the patient node with the predicted value;
    对每个患者节点的优先级进行排序,选取优先级低的作为无效的患者节点。Sort the priority of each patient node, and select the node with low priority as the invalid patient node.
  4. 根据权利要求3所述的检测医保欺诈的方法,其特征在于,所述删除其中无效的患者节点,具体包括:The method for detecting medical insurance fraud according to claim 3, wherein the deleting an invalid patient node specifically includes:
    根据新增的就诊数量,以优先级低的患者节点为序删除同等数量的无效的患者节点。According to the number of new visits, delete the same number of invalid patient nodes in the order of the patient nodes with lower priority.
  5. 根据权利要求1所述的检测医保欺诈的方法,其特征在于,所述将预先标记的欺诈样本输入所建立的医患关系神经网络中,其中,得到预先标记的欺诈样本的步骤包括:The method for detecting medical insurance fraud according to claim 1, wherein the pre-marked fraud samples are input into the established doctor-patient relationship neural network, wherein the step of obtaining the pre-marked fraud samples comprises:
    采用预设方式从就诊记录中选取部分就诊记录作为待标记样本;Select part of the medical records from the medical records in a preset way as the samples to be marked;
    对所选取的待标记样本进行专家标注,标识待标记样本中具有欺诈行为的样本,得到预先标记的欺诈样本。The selected samples to be marked are marked by experts, and the samples with fraudulent behaviors in the samples to be marked are identified to obtain pre-marked fraud samples.
  6. 根据权利要求1所述的检测医保欺诈的方法,其特征在于,所述采用预设方式从就诊记录中选取部分就诊记录作为待标记样本,其中,采用预设方式选取待标记样本的方式至少包括:The method for detecting medical insurance fraud according to claim 1, wherein the selecting part of the medical treatment records from the medical treatment records in a preset manner as the samples to be marked, wherein the method of selecting the samples to be marked in the preset manner at least includes: :
    通过最大熵选择策略计算出每个患者的熵值,选取所计算熵值中最大值作为待标记样本;The entropy value of each patient is calculated by the maximum entropy selection strategy, and the maximum value of the calculated entropy values is selected as the sample to be marked;
    或者,采取随机策略随机采取就诊记录中部分就诊记录作为待标记样本;Or, adopt a random strategy to randomly take part of the medical records in the medical records as the samples to be marked;
    或者,通过最大概率策略计算每个患者的概率值,选取所计算概率值中最大值作为待标记样本。Alternatively, the probability value of each patient is calculated by the maximum probability strategy, and the maximum value among the calculated probability values is selected as the sample to be marked.
  7. 根据权利要求1所述的检测医保欺诈的方法,其特征在于,所述获取患者的就诊记录,根据所获取的就诊记录提取对应的患者特征,并根据所提取的患者特征及患者与医生的对应关系,建立医患关系神经网络,之前包括:The method for detecting medical insurance fraud according to claim 1, wherein the obtaining a patient's medical treatment record, extracting corresponding patient characteristics according to the obtained medical treatment record, and according to the extracted patient characteristics and the correspondence between the patient and the doctor relationship, building a neural network of doctor-patient relationship, previously including:
    将患者就诊信息中的患者身份信息进行匿名处理,并将处理后的就诊信息转换成数据结构类型的就诊记录。The patient identity information in the patient medical treatment information is anonymized, and the processed medical treatment information is converted into a medical treatment record of a data structure type.
  8. 根据权利要求1所述的检测医保欺诈的方法,其特征在于,所述获取患者的就诊记录,根据所获取的就诊记录提取对应的患者特征,并根据所提取的患者特征及患者与医生的对应关系,建立医患关系神经网络,具体包括:The method for detecting medical insurance fraud according to claim 1, wherein the obtaining a patient's medical treatment record, extracting corresponding patient characteristics according to the obtained medical treatment record, and according to the extracted patient characteristics and the correspondence between the patient and the doctor relationship, establish a neural network of doctor-patient relationship, including:
    获取患者的就诊记录,从就诊记录中提取对应的患者特征,建立患者特征度矩阵;Obtain the patient's medical records, extract the corresponding patient characteristics from the medical records, and establish a patient characteristic degree matrix;
    分析就诊记录中医生与患者之间的医患关系,建立对应的医患关系邻接矩阵;Analyze the doctor-patient relationship between doctors and patients in the medical records, and establish the corresponding doctor-patient relationship adjacency matrix;
    根据患者特征度矩阵和医患关系邻接矩阵,建立医患关系神经网络。According to the patient characteristic degree matrix and the doctor-patient relationship adjacency matrix, the doctor-patient relationship neural network is established.
  9. 一种系统,其特征在于,包括有存储器,以及一个或者一个以上的程序,其中一个或者一个以上程序存储于存储器中,且经配置以由一个或者一个以上处理器执行所述一个或者一个以上程序包含用于执行如权利要求1-8中任一项所述的检测医保欺诈的方法。A system comprising a memory and one or more programs, wherein the one or more programs are stored in the memory and configured to be executed by one or more processors to execute the one or more programs A method for performing the detection of health insurance fraud as claimed in any one of claims 1-8 is included.
  10. 一种存储介质,其特征在于,所述存储介质存储有计算机程序,所述计算机程序能够被执行以用于实现如权利要求1-8任一项所述的检测医保欺诈的方法。A storage medium, characterized in that the storage medium stores a computer program, and the computer program can be executed to implement the method for detecting medical insurance fraud according to any one of claims 1-8.
PCT/CN2020/127183 2020-09-15 2020-11-06 Method for detecting medicare fraud, and system and storage medium WO2022057057A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202010967115.3 2020-09-15
CN202010967115.3A CN112200684B (en) 2020-09-15 2020-09-15 Method, system and storage medium for detecting medical insurance fraud

Publications (1)

Publication Number Publication Date
WO2022057057A1 true WO2022057057A1 (en) 2022-03-24

Family

ID=74015083

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2020/127183 WO2022057057A1 (en) 2020-09-15 2020-11-06 Method for detecting medicare fraud, and system and storage medium

Country Status (2)

Country Link
CN (1) CN112200684B (en)
WO (1) WO2022057057A1 (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115456805A (en) * 2022-11-14 2022-12-09 华信咨询设计研究院有限公司 Medical insurance anti-fraud method and system based on machine learning
CN118378201A (en) * 2024-06-25 2024-07-23 浙江大学 Medical insurance group abnormal behavior detection method and device

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113538126A (en) * 2021-07-16 2021-10-22 中国工商银行股份有限公司 Fraud risk prediction method and device based on GCN
CN114357008A (en) * 2021-12-16 2022-04-15 上海金仕达卫宁软件科技有限公司 Medical behavior consistency identification model establishing method and risk identification method

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104598552A (en) * 2014-12-31 2015-05-06 大连钜正科技有限公司 Method for learning incremental update-supported big data features
CN106446942A (en) * 2016-09-18 2017-02-22 兰州交通大学 Crop disease identification method based on incremental learning
CN106446552A (en) * 2016-09-28 2017-02-22 湖南老码信息科技有限责任公司 Prediction method and prediction system for sleep disorder based on incremental neural network model
CN108334935A (en) * 2017-12-13 2018-07-27 华南师范大学 Simplify deep learning neural network method, device and the robot system of input
US20180293496A1 (en) * 2017-04-06 2018-10-11 Pixar Denoising monte carlo renderings using progressive neural networks
CN109636061A (en) * 2018-12-25 2019-04-16 深圳市南山区人民医院 Training method, device, equipment and the storage medium of medical insurance Fraud Prediction network

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107657536B (en) * 2017-02-20 2018-07-31 平安科技(深圳)有限公司 The recognition methods of social security fraud and device

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104598552A (en) * 2014-12-31 2015-05-06 大连钜正科技有限公司 Method for learning incremental update-supported big data features
CN106446942A (en) * 2016-09-18 2017-02-22 兰州交通大学 Crop disease identification method based on incremental learning
CN106446552A (en) * 2016-09-28 2017-02-22 湖南老码信息科技有限责任公司 Prediction method and prediction system for sleep disorder based on incremental neural network model
US20180293496A1 (en) * 2017-04-06 2018-10-11 Pixar Denoising monte carlo renderings using progressive neural networks
CN108334935A (en) * 2017-12-13 2018-07-27 华南师范大学 Simplify deep learning neural network method, device and the robot system of input
CN109636061A (en) * 2018-12-25 2019-04-16 深圳市南山区人民医院 Training method, device, equipment and the storage medium of medical insurance Fraud Prediction network

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115456805A (en) * 2022-11-14 2022-12-09 华信咨询设计研究院有限公司 Medical insurance anti-fraud method and system based on machine learning
CN115456805B (en) * 2022-11-14 2023-04-07 华信咨询设计研究院有限公司 Medical insurance anti-fraud method and system based on machine learning
CN118378201A (en) * 2024-06-25 2024-07-23 浙江大学 Medical insurance group abnormal behavior detection method and device

Also Published As

Publication number Publication date
CN112200684B (en) 2024-05-07
CN112200684A (en) 2021-01-08

Similar Documents

Publication Publication Date Title
WO2022057057A1 (en) Method for detecting medicare fraud, and system and storage medium
CN109636061B (en) Training method, device and equipment for medical insurance fraud prediction network and storage medium
Bashir et al. BagMOOV: A novel ensemble for heart disease prediction bootstrap aggregation with multi-objective optimized voting
ȚĂRANU Data mining in healthcare: decision making and precision.
Mall et al. Implementation of machine learning techniques for disease diagnosis
Moreira et al. Evolutionary radial basis function network for gestational diabetes data analytics
CN111784040B (en) Optimization method and device for policy simulation analysis and computer equipment
WO2023024411A1 (en) Association rule assessment method and apparatus based on machine learning
Tam et al. Federated noisy client learning
Anbarasi et al. Fraud detection using outlier predictor in health insurance data
CN114783580A (en) Medical data quality evaluation method and system
Li et al. Study of E-business applications based on big data analysis in modern hospital health management
Prottasha et al. Impact learning: A learning method from feature’s impact and competition
CN113706258B (en) Product recommendation method, device, equipment and storage medium based on combined model
CN114170000A (en) Credit card user risk category identification method, device, computer equipment and medium
Srinivasan et al. Examining Disease Multimorbidity in US Hospital Visits Before and During COVID-19 Pandemic: A Graph Analytics Approach
Patil et al. Tuberculosis Detection Using a Deep Neural Network
Rongqiang et al. Research on promoting the application of disease prediction system based on machine learnin
Chen Identification of the human-oriented factors influencing AERC from the web services
Anuprabha et al. Prediction of Stroke Severity in Brain Through Machine Learning
Obodoekwe et al. A critical analysis of the application of data mining methods to detect healthcare claim fraud in the medical billing process
CN113822490B (en) Asset collection method and device based on artificial intelligence and electronic equipment
Zuluaga et al. Data gap/outlier correction and treatment
Anitha et al. A Review on Disease Prediction Approach using Data Analytics and Machine Learning Algorithms
Jeyabalan et al. Forecasting Covid-19 outbreak using CLR optimized stacked generalization computational models

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 20953914

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

32PN Ep: public notification in the ep bulletin as address of the adressee cannot be established

Free format text: NOTING OF LOSS OF RIGHTS PURSUANT TO RULE 112(1) EPC (EPO FORM 1205A DATED 23-06-2023)

122 Ep: pct application non-entry in european phase

Ref document number: 20953914

Country of ref document: EP

Kind code of ref document: A1