CN109448846B - Analysis method for measuring and calculating rare disease incidence based on medical insurance big data - Google Patents

Analysis method for measuring and calculating rare disease incidence based on medical insurance big data Download PDF

Info

Publication number
CN109448846B
CN109448846B CN201811045882.8A CN201811045882A CN109448846B CN 109448846 B CN109448846 B CN 109448846B CN 201811045882 A CN201811045882 A CN 201811045882A CN 109448846 B CN109448846 B CN 109448846B
Authority
CN
China
Prior art keywords
disease
month
diagnosis
insurance
target
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201811045882.8A
Other languages
Chinese (zh)
Other versions
CN109448846A (en
Inventor
詹思延
王胜锋
冯菁楠
许璐
高培
王金喜
尉晨
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Peking University
Original Assignee
Peking University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Peking University filed Critical Peking University
Priority to CN201811045882.8A priority Critical patent/CN109448846B/en
Publication of CN109448846A publication Critical patent/CN109448846A/en
Application granted granted Critical
Publication of CN109448846B publication Critical patent/CN109448846B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16HHEALTHCARE INFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR THE HANDLING OR PROCESSING OF MEDICAL OR HEALTHCARE DATA
    • G16H50/00ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics
    • G16H50/30ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics for calculating health indices; for individual health risk assessment
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q40/00Finance; Insurance; Tax strategies; Processing of corporate or income taxes
    • G06Q40/08Insurance

Abstract

The invention discloses a method for measuring and calculating the incidence of rare diseases based on medical insurance big data, and relates to medical insurance data processing and analyzing technology. Acquiring numerator and denominator information required by morbidity calculation by summarizing a plurality of key parameters of monthly medical insurance data, and further calculating to acquire the morbidity; the desired molecule is the number of new cases of occurrence of the target disease in a range of populations within a particular time; the denominator is the number of exposed population within a specific time period, i.e. persons likely to develop the target disease, and persons who have already developed disease and are unlikely to become new cases within a specific time period are excluded. By the method, the morbidity data of the rare diseases can be obtained, the development of epidemiological research of the rare diseases is promoted, and data and technical support are provided for reasonably formulating clinical guidelines; further promote the transformation application of medical insurance big data and fill the blank of epidemiological data of rare diseases in China.

Description

Analysis method for measuring and calculating rare disease incidence based on medical insurance big data
Technical Field
The invention relates to a medical insurance data processing and analyzing technology, in particular to an analyzing method for measuring and calculating the rare disease Incidence (Incidence) based on medical insurance big data.
Background
Rare diseases (Rare diseases) refer to a Rare disease with low morbidity and incidence, and at present, China lacks information of basic epidemic characteristics of the disease, including incidence, prevalence and the like. The medical insurance data (Claims data) is data formed by integrating payment information in a medical insurance administrative management system, comprises information such as basic characteristics, diagnosis and treatment of insured people, has huge data volume, good comprehensiveness and timeliness, low cost and high operability, is longitudinal data from the real world, is favorable for developing epidemiological research quickly and efficiently, and particularly can provide a new idea for solving the problem of the lack of epidemiological data of rare diseases in China by utilizing national medical insurance data.
Unlike other epidemiological studies, calculating morbidity requires the definition of the number of new cases in a given time and the number of already ill populations over a given period of time. In the existing foreign analysis method for the incidence of rare diseases, learners such as Raghu G utilize American medical insurance data to calculate the incidence of idiopathic pulmonary fibrosis in the period of 2001 + 2011 and the like, but the calculation of the incidence research is based on individual original data in the medical insurance data, and for mass medical insurance data in China, the period, the format and the quantity of data storage, the span, the loss, the individual guarantee removal and the like of data indexes are different from the foreign medical insurance data, so that the method cannot be directly used for epidemiological research of a medical insurance database in China; however, through the research on the medical insurance data in China, at present, the research is mostly focused on the aspects of discovering fraudulent behaviors, improving the disease treatment effect, modifying and formulating auxiliary policies and the like by mining the medical insurance data, and the research on epidemiology, particularly the disease incidence and the disease prevalence of rare diseases, is rarely carried out by using the medical insurance data, so that the research on the related epidemiological characteristics of the rare diseases is difficult to be carried out by fully using the medical insurance data at present.
Disclosure of Invention
In order to overcome the defects of the prior art, the invention provides a novel method for measuring and calculating the disease incidence based on medical insurance big data, which is based on optimizing a data intermediate storage format and by summarizing a plurality of key parameters of monthly medical insurance data, and comprises the following steps: the total number of individuals participating in the medical insurance per month, the newly increased number of individuals participating in the medical insurance per month, the total number of the recorded diagnosis per month, the total number of the diagnosis loss per month, the number of new cases of the patient in a certain period and the number of the patients suffering from the disease in a certain period (the certain period can be determined according to the research requirement and can be one month, two months, one year and the like), the information of the numerator and the denominator required by the disease incidence calculation is obtained, and then the disease incidence is calculated. The method relates to the statistical operation of numerators and denominators corresponding to the high-efficiency counting morbidity in a summarized data format, is a method for calculating the morbidity of the rare diseases according with the medical insurance data characteristics of China, and can be used for epidemiological analysis of the rare diseases.
The disease that can be measured and calculated by the invention can not be cured completely, once diagnosed, the disease will be suffered for the whole life.
The principle of the invention is as follows: based on the concept of human and month, counting the total number of individuals participating in insurance per month, the number of newly-added individuals participating in insurance per month, the total number of records of visiting a doctor per month and the total number of diagnosis deficiency records per month, defining and extracting target patients by combining target diseases, deducing 'invisible patients' under the condition of random deficiency diagnosis, and deducing and calculating the morbidity by combining a formula. The diseases which can be measured and calculated by the method comprise multiple myeloma, plasma cell leukemia, plasma cell disease, idiopathic pulmonary interstitial fibrosis, POEMS syndrome and other rare diseases which can not be completely cured and are characterized by life-long occurrence once being diagnosed. The method can obtain the incidence data of rare diseases, provide data and technical support for reasonably formulating clinical guidelines and further promote the conversion application of medical insurance big data.
The technical scheme provided by the invention is as follows:
a method for measuring and calculating the incidence of a rare disease based on medical insurance big data has the advantages that the disease measured and calculated is required to be not cured completely; based on a medical insurance database, acquiring numerator and denominator information required by the morbidity calculation by summarizing a plurality of key parameters of monthly medical insurance data (including the total number of individuals participating in insurance per month, the number of newly-added individuals participating in insurance per month, the total number of medical record diagnosis missing, the number of new cases in a certain period and the number of patients already ill in a certain period), and further calculating to obtain the morbidity; the numerator of the morbidity calculation refers to the number of new cases of the target disease in a certain range of population within a specific time, and the denominator is the number of exposed population within the specific time, namely, the people who can have the target disease need to be excluded from the people who have the disease and can not become new cases within the specific time;
the method comprises the following steps:
A1. determining the scope of the medical insurance database (such as time span, regional distribution, outpatient/hospitalization);
A2. basic cleaning of the database and definition of the target disease;
the basic cleaning of the database comprises the following basic steps: (1) checking the integrity and the logicality of variables in the database; (2) code standardization and natural language processing of text contents in a database; (3) the version of International Classification of Diseases (ICD) in the database is determined and unified.
In the invention, the definition of the target disease is based on the name or ICD code of the corresponding disease appearing in the medical insurance database, and particularly, the multiple expression forms of the text and the ICD code need to be fully considered, and a dictionary database containing the target disease diagnosis name expression mode as comprehensive as possible is constructed through a word segmentation technology.
The process of constructing the dictionary database comprises the following steps:
firstly, text information containing target diagnosis disease names (such as multiple myeloma) is extracted from a medical insurance database, and the text information may contain wrong disease name expression modes and other diagnosis names and cannot be directly utilized;
extracting fields related to target diagnosis disease names (multiple myeloma) from the extracted text information by adopting a word segmentation technology;
then, manually judging whether the disease name expressions are correct one by one, and determining the correct text expression as a preliminary dictionary library;
extracting all information containing target diagnosis from the medical insurance database again according to the primary dictionary database, and then recognizing by using a word segmentation technology;
repeating the steps for a plurality of times until the text expression accuracy of the target diagnosis disease name reaches more than 95 percent, and determining the target diagnosis disease name as a final dictionary library. The aim is to obtain as many expressions as possible which contain all the names of the target diagnostic diseases so that subsequent determinations of the patient are not missed.
A3. Summarizing denominator information;
the method is specifically divided into four groups: individuals who participate in the insurance but never reimburse, individuals who participate in the insurance but have reimbursement records but do not have a diagnosis of the target disease, individuals who participate in the insurance but have reimbursement records and have a diagnosis of the target disease, and individuals who have suffered from the target disease within a certain period of time. According to the condition of each individual in each month, the number of the persons involved in the conservation is included, and the number of the persons not involved in the conservation is removed.
Specifically, according to the participation status, if there is a participation record, the current number of people is included, and if there is no participation record, the current number of people is deleted.
First set of denominators: the calculation formula of the individuals participating in the insurance but never reimbursed is as follows according to the monthly sum:
Figure BDA0001793283790000031
wherein t represents the tth month; insurancet,nA state of engagement for the nth group of individuals in the tth month; n represents the human-month sum of the first component denominator.
Second group: individuals participating in insurance and having reimbursement records but no target diagnosis appears, and the method comprises three conditions;
in the first case: the number of people in the month without reimbursement record for the patient is directly included in the denominator, and for each month, the number of people in the month without reimbursement record for the patient1,1
In the second case: the number of people in the month that the patient is diagnosed with the disease and the diagnosis is complete should be included in the denominator calculation, and for each month, the number m of people in the month that the patient is diagnosed with the disease and the diagnosis is complete1,2
In the third case: the number m of people in the month with diagnosis missing due to disease treatment is extracted by considering subsequent filling1,3
Taking each month as an example, the moon and man sum of the second component mother corresponds to the calculation formula as shown in formula 2:
Figure BDA0001793283790000032
wherein t represents the tth month; insurancet,mA state of engagement for the mth group of individuals in the tth month; m represents the human-moon sum of the second component mother.
Third group: participating in individuals who have reimbursement records and have target diagnosis, including three situations;
in the first case: the people who have no reimbursement record for the treatment of the disease are directly brought into the denominator every month, and for every month, the people who have no reimbursement record for the treatment of the diseaseThe number of people in the same month k1,1
In the second case: the number of people in the month with complete diagnosis and treatment for the disease is counted in the denominator, and for each month, the number k of people in the month with complete diagnosis and treatment for the disease1,2
In the third case: the people who have a diagnosis but are diagnosed with a deficiency should consider follow-up filling, and for each month, the number k of people in the month that have a diagnosis but are diagnosed with a deficiency due to a disease1,3
Taking each month as an example, the corresponding calculation formula of the man-month sum of the third component mother is as follows:
Figure BDA0001793283790000041
wherein t represents the tth month; insurancet,kA state of engagement for the kth group of individuals in the tth month; k represents the human-month sum representing the third component mother.
And a fourth group: (ii) an individual who has already suffered from a disease;
the sum of the fourth group corresponds to the calculation formula 4:
Figure BDA0001793283790000042
wherein, t1Indicating a certain period of time; p represents the total number of affected people in the period.
A4. A summary of molecular information, including two groups;
aiming at target diseases, corresponding molecular information extraction is carried out, and the extraction is specifically divided into two groups: new patients and new patients needing to be filled and measured. The new onset patient refers to the number of new cases of the target disease in a certain range of people within a certain period; the latter (new patients to be filled in with the measurement) measurement and calculation has no statistical significance in the association of the diagnosis deficiency based on the information of the visit with whether or not a certain rare disease is suffered.
First group of molecules: new hair patients
All new patients included within a certain period (e.g., monthly or yearly) are scored as new
Figure BDA0001793283790000043
Wherein, t1Indicating a certain period of time; case _ new represents the number of newly diagnosed target diseases within a certain period of time. The judgment method of the new patient comprises the following steps: patients who did not develop the target diagnosis until the particular time the study calculated the incidence, were selected for different elution periods depending on the disease studied. For example, if the incidence of a rare disease is calculated for a year, a new patient is determined if no target diagnosis has occurred in the years before the year in the database.
Second group of molecules: new-onset patient needing to be filled up and measured
A5. Basic characteristics of new patients, such as age, sex, nationality, household registration, etc., are checked and unified
The medical insurance data is divided into 3 forms of 'information table of personnel participating in insurance', 'general clinic (emergency) hospital cost and settlement information table' and 'clinic major illness, clinic overall planning, hospitalization, family hospital bed cost and settlement information table', and the forms need to be checked and unified for age, gender, nationality, household registration and the like through associated variables, so that each associated variable corresponds to a unique identification ID (such as an identity card), and meanwhile, the information corresponding to each unique identification ID, such as age, gender, nationality, household registration and the like, is internally consistent.
A6. And calculating the morbidity by taking the quotient of the summarized molecular information and the denominator information.
Formula for calculating Incidence Incidnce (taking the annual Incidence as an example)
Figure BDA0001793283790000051
Wherein New Case represents the total number of New patients in the observation year, including the sum of New patients observed in the database and New patients calculated in the filling calculation, and is represented by sigma New Case; personYear represents the number of exposed population in the same period, namely the population which is possible to suffer from the disease in the observation area in the observation year, and is represented by sigma Personmonth; sigmatCase is the sum of new patients which need to be filled in each month,
Figure BDA0001793283790000052
t denotes the t-th month, Caseimpute_mRepresents the number of target patients estimated according to the number of people who have a visit but have a diagnosis missing, among individuals who participate in the insurance and have reimbursement records but do not have a target diagnosis every month.
Figure BDA0001793283790000053
Indicates the number of newly diagnosed target diseases in a certain period, t1Indicating the time period.
Denominator sum sigma of each monthtPersonMenth is calculated by the following formula:
tPersonMonth=∑tPersonMonth1+∑tPersonMonth2+∑tPersonMonth3
where t represents the t-th month. SigmatPersonmonth1 corresponds to the moon, Σ, contributed by an individual who participates in a security but never reimbursestPersonmonth2 represents the moon, Σ, contributed by an individual who has been enrolled and reimbursed for reimbursement but who has not yet presented the target diagnosistPersonnnth 3 represents the months contributed by individuals who entered the reimbursement record and presented the target diagnosis,
Figure BDA0001793283790000054
indicates an individual who has suffered from the disease for a certain period of time, t1Indicating the time period.
The invention has the beneficial effects that:
the method for measuring and calculating the incidence of the rare diseases based on the medical insurance big data relates to the statistical operation of numerator and denominator corresponding to the high-efficiency counting incidence under a summarized data format, on one hand, the incidence of the rare diseases and the disease burden data (including the pressure of diseases, disabilities and premature deaths on the whole social economy and health) of China can be obtained, and data and technical support are provided for reasonably formulating clinical guidelines; on the other hand, the invention provides a new method for calculating the disease incidence by using medical insurance data, which can promote the conversion application of medical insurance big data and practically fill the gap of disease incidence data of rare diseases in China. The method is a method for calculating the incidence rate of the rare diseases according with the characteristics of medical insurance data in China, and can be used for epidemiological analysis of the rare diseases.
Drawings
FIG. 1 is a block flow diagram of a method of calculating morbidity provided by the present invention.
Detailed Description
The invention will be further described by way of examples, without in any way limiting the scope of the invention, with reference to the accompanying drawings.
The invention provides a novel method for measuring and calculating disease incidence based on medical insurance big data, which is characterized in that based on an optimized data intermediate storage format, by summarizing a plurality of key parameters of monthly medical insurance data, numerator and denominator information required by the incidence calculation is obtained, and then the incidence is calculated; the numerator of the incidence calculation refers to the number of new cases of the target disease in a certain range of people within a specific time, and the denominator is the number of exposed people within the specific time, namely people who can have the target disease, and people who have the disease and can not become new cases within the specific time need to be excluded.
Fig. 1 shows a flow of a method for calculating morbidity provided by the present invention, and the specific embodiment of the present invention is as follows:
A1. determining database range (e.g., time span, regional distribution, outpatient/hospitalization);
A2. basic cleaning of the database and definition of the target disease;
the basic cleaning of the database comprises the following basic steps: (1) checking the integrity and the logicality of variables in the database; (2) code standardization and natural language processing of text contents in a database; (3) the version of International Classification of Diseases (ICD) in the database is determined and unified.
In the invention, the definition of the target disease is based on the name of the corresponding disease or ICD code in the medical insurance database, and a dictionary database as comprehensive as possible is constructed by fully considering various expression forms of the text and the ICD code.
A3. Summarizing the information of the corresponding denominators of the morbidity;
the denominator of the incidence of disease is specifically divided into four groups
A first group: insured but never reimbursed individuals
The partial patients never see a doctor due to illness, only have reference and protection records and no reimbursement records, and only serve as denominators when the morbidity is calculated. Specifically, it is necessary to count the number of the observation objects in each month in the observation time (1 ═ Shenbao, 0 ═ Shenbao), then remove the non-Shenbao people and the month, and put the total number of the Shenbao people and the month into the denominator. Taking each month as an example, the corresponding calculation formula of the first group of denominators according to the man-month sum is as follows:
Figure BDA0001793283790000061
wherein t represents the tth month; insurancet,nA state of engagement for the nth group of individuals in the tth month; n represents the population-month sum of the first group of denominators.
Second group: individuals participating in insurance and having reimbursement records but no target diagnosis
The partial patients have been treated for the disease, but have not been diagnosed with the target, and have the record of participation insurance and reimbursement, and are only used as denominators in the calculation of the disease incidence. Specifically, in the statistical observation time, the guaranteed state of each observation object in each month (1 ═ guaranteed, 0 ═ guaranteed), and then similarly, the non-guaranteed months are removed, but the guaranteed months cannot be directly put into denominators, but are divided into three cases according to the diagnosis state:
in the first case: the number of people who have not had any reimbursement record for the patient is directly included in the denominator (as shown in figure 1), and for each month, the number m of people in the month that have not had any reimbursement record for the patient1,1
In the second case: the number of people in the month in which the patient is diagnosed and diagnosed completely should be included in the denominator calculation (as shown in fig. 1), and for each month, the number m of people in the month in which the patient is diagnosed and diagnosed completely1,2
In the third case: with a visit but a diagnosis missingThe current month should be considered for subsequent filling (as shown in figure 1), and the current month number m of people who are diagnosed but lack diagnosis due to illness is extracted1,3
Taking each month as an example, the moon and man sum of the second component mother corresponds to the calculation formula as shown in formula 2:
Figure BDA0001793283790000071
wherein t represents the tth month; insurancet,mA state of engagement for the mth group of individuals in the tth month; m represents the human-moon sum of the second component mother.
Third group: individuals participating in reimbursement records and presenting a target diagnosis
The partial patients have diagnosis due to illness, have target diagnosis, have insurance records and reimbursement records, and are used as numerators and denominators in calculating the morbidity and the morbidity. For the denominator, it is specifically required to count the guarantee state (1 ═ guarantee, 0 ═ guarantee) of each observation object in each month within the observation time, and then remove the non-guaranteed months (as shown in fig. 1), but the guaranteed months still cannot be directly put into the denominator, but are divided into three cases according to the diagnosis state:
in the first case: the number of people who have not been treated by disease and recorded without reimbursement is directly included in the denominator (as shown in figure 1), and for each month, the number k of people in the same month who have not been treated by disease and recorded without reimbursement1,1
In the second case: the number of people in the month with complete diagnosis and treatment for the disease should be included in the denominator calculation (as shown in figure 1), and for every month, the number k of people in the month with complete diagnosis and treatment for the disease1,2;;
In the third case: the subsequent filling-up should be considered for the month with diagnosis missing (as shown in figure 1), and for each month, the number k of people in the month with diagnosis missing due to the disease1,3
Taking each month as an example, the corresponding calculation formula of the man-month sum of the third component mother is as follows:
Figure BDA0001793283790000081
wherein t represents the tth month; insurancet,kA state of engagement for the kth group of individuals in the tth month; k represents the human-month sum representing the third component mother.
And a fourth group: an individual who has already suffered from the disease
Since the measurable disease is always suffered from once diagnosed, the denominator is the population likely to suffer from the target disease, i.e. the number of exposed population, so that the individuals suffered from the disease in a certain period need to be removed when calculating the denominator, and the sum of the fourth group corresponds to the calculation formula as 4:
Figure BDA0001793283790000082
wherein, t1Indicating a certain period of time; p represents the total number of affected people in the period.
A4. Summarizing molecular information;
after the definition of the target disease, corresponding molecular information extraction is carried out, and the method is divided into two groups:
first group of molecules: new hair patients
All new patients included in a certain period are recorded as SigmatCase _ new, where t represents within a specific time period; case _ new represents the number of newly diagnosed target diseases within a certain period of time. The judgment method of the new patient comprises the following steps: patients who did not develop the target diagnosis until the particular time the study calculated the incidence, were selected for different elution periods depending on the disease studied. For example, if the incidence of a rare disease is calculated for a year, a new patient is determined if no target diagnosis has occurred in the years before the year in the database.
Second group of molecules: new-onset patient needing to be filled up and measured
Partial visit records have diagnosis deficiency, including non-target disease patients who participate in the reservation and visit but have diagnosis deficiency and target disease patients who participate in the reservation and visit but have diagnosis deficiency, namely m1,3And k1,3. The patient is required to be the one with the calculated morbidityNew issue, i.e. first occurrence of target diagnosis, so for k1,3The patient (2) was diagnosed as missing in the current month, but was not counted as a new patient because it had been diagnosed as the target patient before, and thus was not included in the molecule. Therefore, the portion to be padded for measurement is m1,3The partial records should not be directly removed, as shown in table 1,
table 1 molecular filling in incidence calculation
Figure BDA0001793283790000083
Figure BDA0001793283790000091
The calculation formula of the Incidence rate under the ideal state Incidference is shown as formula 5:
Figure BDA0001793283790000092
wherein, a represents the number of target disease patients which can be captured in the individual who participates in the insurance, has a diagnosis and has a diagnosis record without losing the diagnosis, b represents the number of non-target disease patients which can be captured in the individual who participates in the insurance, has a diagnosis and has a diagnosis record without losing the diagnosis, c represents the number of target disease patients which can be captured theoretically in the individual who participates in the insurance, has a diagnosis and has a diagnosis record without losing the diagnosis, d represents the number of non-target disease patients which can be captured in the individual who participates in the insurance, has a diagnosis and has a diagnosis record with losing the diagnosis, and e represents the number of individual who participates in the insurance but has never been diagnosed.
And the formula for calculating the incidence after direct elimination is shown as formula 6:
Figure BDA0001793283790000093
the incidence of disease in the ideal state and the incidence of disease after direct elimination are both clearly unequal. Therefore, there is a need for diagnosing missing parts based onIt is assumed that appropriate evaluation is performed to obtain the values of c and d. The initial assumption employed in the present invention is that there is no statistical significance in the association of the absence of diagnosis of the information on the visit with whether or not a certain rare disease is present, i.e.
Figure BDA0001793283790000094
If this assumption is satisfied, c is expressed as equation 7:
Figure BDA0001793283790000095
wherein c + d is the total number of records for diagnosing the deletion, and can be directly counted.
The number of people in the molecule to be filled is calculated according to equation 7. Note that the missing part of the new patient diagnosis needed to fill the measurement is: the number m of persons who have a doctor but lack diagnosis in the month in the individuals who participate in the insurance and have reimbursement records but do not have target diagnosis1,3Filling in according to the above assumptions, Caseimpute_mC, the total number of new patients with the target disease after filling is
Figure BDA0001793283790000096
Wherein t represents the t-th month, Caseimpute_mRepresents the new number of the target patients estimated according to the number of persons who have a doctor but lack diagnosis in the current month in the individuals who participate in the insurance and have reimbursement records but do not have target diagnosis.
A5. Basic feature verification and unification for molecular patients
The medical insurance data is divided into 3 forms of 'information table of personnel participating in insurance', 'general clinic (emergency) hospital cost and settlement information table' and 'clinic major illness, clinic overall planning, hospitalization, family hospital bed cost and settlement information table', and the forms need to be checked and unified for age, gender, nationality, household registration and the like through associated variables, so that each associated variable corresponds to a unique identification ID (such as an identity card), and meanwhile, the information corresponding to each unique identification ID, such as age, gender, nationality, household registration and the like, is internally consistent.
A6. Calculation of morbidity
Formula for calculating disease incidence (in human years)
Figure BDA0001793283790000101
Wherein Σ new case represents the total number of new patients within the observation year, including the sum of new patients observed in the database and new patients measured in the filling; personYear indicates the number of exposed population at the same time, namely the population which is possibly suffered from the disease in the observation area in the observation year; sigmatCase is the sum of new patients which need to be filled in each month,
Figure BDA0001793283790000102
Figure BDA0001793283790000103
t denotes the t-th month, Caseimpute_mRepresents the number of target patients estimated according to the number of people who have a visit but have a diagnosis missing, among individuals who participate in the insurance and have reimbursement records but do not have a target diagnosis every month.
Figure BDA0001793283790000104
Indicates the number of newly diagnosed target diseases in a certain period, t1Indicating the time period.
Denominator sum sigma of each monthtPersonMenth is calculated by the following formula:
tPersonMoth=∑tPersonMonth1+∑tPersonMonth2+∑tPersonMonth3
where t represents the t-th month. SigmatPersonmonth1 corresponds to the moon, Σ, contributed by an individual who participates in a security but never reimbursestPersonmonth2 represents the moon, Σ, contributed by an individual who has been enrolled and reimbursed for reimbursement but who has not yet presented the target diagnosistPersonnnth 3 represents the months contributed by individuals who entered the reimbursement record and presented the target diagnosis,
Figure BDA0001793283790000105
indicates an individual who has suffered from the disease for a certain period of time, t1Indicating the time period.
The invention is further illustrated by the following examples.
In this example, a certain province and town employee insurance library and a town resident insurance library in 2012 and 2016 are selected, and four years are selected as the elution period, then the prevalence rate of multiple myeloma is calculated based on the medical insurance database in 2016 of this province, wherein the database comprises 2016 (217,342,112) of town employees and 145,714,765 of town residents.
After completing basic data cleaning (such as reimbursement date, absence of treatment date variables, abnormality and the like), the clinical diagnosis mode of the multiple myeloma is shown in a combined text and ICD code expression as shown in Table 2:
TABLE 2 diagnostic description and ICD coding enumeration of multiple myeloma
Figure BDA0001793283790000106
Figure BDA0001793283790000111
The database contains 6 field names of diagnosis information, which are respectively main diagnosis names, main diagnosis codes, first secondary diagnosis names, first secondary diagnosis codes, second secondary diagnosis names and second secondary diagnosis codes, and the actual definition is defined as follows according to the field structure in the database:
each field must contain a field value (or relationship between fields): myeloma, Carler, bone marrow cancer/bone marrow ca, myelopathy, C90, M9732, 203.0;
the above fields must exclude field values (or relationship between fields): plasma cells, isolated, C90.1, C90.2
In the specific molecular capture, six fields including a main diagnosis, a first diagnosis, a second diagnosis, a main diagnosis code, a first diagnosis code and a second diagnosis code are expanded, and at least one 'must-include field' must be provided in all six fields, but the 'must-exclude field' must not be provided.
A summary of denominator information was then performed, wherein the distribution of monthly insured individual counts, monthly visit record counts and monthly visit record diagnostic loss counts per month is shown in table 3:
TABLE 3 summary of medical insurance data parameters needed for calculation of numerator and denominator
Figure BDA0001793283790000112
Figure BDA0001793283790000121
It should be noted that, the total number of monthly insured individuals is equal to the total number of the individuals who participate in the medical insurance data but never participate in the insurance data + the individuals who participate in the insurance data and have the target diagnosis appearing in the expense record unit + the individuals who participate in the expense record unit and have the target diagnosis appearing in the expense record unit, so that only the total number of the key parameters monthly insured individuals in the medical insurance data is listed here.
Calculation of the first and the denominator
The calculation of denominator is divided into two parts: the number of people participating in the disease was determined in 2016 and the number of individuals who had suffered the disease before 2016.
The average number of ginseng protection people in the first 2016 year is obtained by adopting a method of averaging 12 months, (30152092+28539556+30571984+30615202+28779370+30912654+31344596+28196530+30464440+30734528+ 32040068)
+30705856)/12=30254740
The number of affected individuals in the second 2016 year ago was obtained by the following method: the database of the year 2012-2016 of the province is divided into two parts of the year 2012-2015 and the year 2016, each part ensures that each individual has only one diagnosis record, the two databases are transversely combined by software, the matched part is the number of patients who have the diagnosis records in the year 2016 and have been diagnosed with multiple myeloma before the year 2016, and the number of the individuals who have been diseased before the year 2016 of the province is 3821 people obtained by the method.
Calculation of two, molecule
The calculation of the molecule is divided into two parts: the number of new patients and the patients needing to be filled up and calculated
The number of new patients in the first part is obtained by the same method as the number of patients already suffering from the disease, namely: dividing the database of the year 2012-2016 of the province into two parts of the year 2012-2015 and the year 2016, wherein each part ensures that each individual has only one diagnosis record, transversely combining the two databases by software, and obtaining the part of the matching result which appears in the database of the year 2016, namely the number of new patients of the 2016 multiple myeloma, wherein the number of the new patients of the year 2016 of the province is 417.
The number of patients to be filled in the second part is calculated as follows (in this example, the number of patients to be filled in is calculated by taking the year as a unit), and the average number of treatment records in 2016 years is calculated as follows:
(5266439+4685125+5446965+4767414+5356961+4962207+5627907+4545582+4700331+5365805+5684190+5739000)/12=5178994
the number of diagnosis defects recorded in 2016 year of average visit is:
(331479+314970+333140+315726+300708+271454+274475+282899+273228+284421+277799+280500)/12=295067
then, based on 417 new patients in 2016, the number of patients to be filled is calculated as: (417 × 295067)/(5178994-
The incidence is then calculated as follows:
Figure BDA0001793283790000131
it is noted that the disclosed embodiments are intended to aid in further understanding of the invention, but those skilled in the art will appreciate that: various substitutions and modifications are possible without departing from the spirit and scope of the invention and appended claims. Therefore, the invention should not be limited to the embodiments disclosed, but the scope of the invention is defined by the appended claims.

Claims (6)

1. A method for measuring and calculating disease incidence based on medical insurance big data is characterized in that the measured and calculated disease can not be cured completely, once diagnosed, the disease is suffered for the whole life; based on a medical insurance database, acquiring numerator and denominator information required by morbidity calculation by summarizing a plurality of key parameters of monthly medical insurance data, and further calculating to obtain the morbidity;
the plurality of key parameters includes: the total number of individuals participating in the insurance per month, the newly increased number of individuals participating in the insurance per month, the total number of the recorded diagnosis loss per month, the number of new cases in a certain period and the number of people already suffering from diseases in a certain period;
the molecules required for the morbidity calculation are the number of new cases of occurrence of the target disease in a certain range of population within a specific time;
the denominator required for calculating the morbidity is the number of exposed population within a specific time, namely, the people who can suffer from the target disease, and the people who suffer from the disease and cannot become new cases within the specific time need to be excluded;
the method comprises the following steps:
A1. determining the scope of the medical insurance database, comprising: time span, regional distribution, outpatient/hospitalization;
A2. basic cleaning of a database, defining a target disease and constructing a target disease dictionary library;
the basic cleaning of the database comprises the following steps:
A21) checking the integrity and the logicality of the variables in the database;
A22) code standardization and natural language processing of text contents in a database;
A23) determining and unifying versions of international disease classification ICD codes in a database;
the definition of the target disease is based on the name or ICD code of the corresponding disease appearing in the medical insurance database, and comprises a plurality of forms of text data and ICD codes; constructing a dictionary library containing target disease diagnosis name expression by a word segmentation technology;
A3. collecting denominator information; the method is specifically divided into four groups:
denominator information first group: individuals who participate in the insurance but never reimburse;
second group of denominator information: individuals participating in insurance and having reimbursement records but no target disease diagnosis;
denominator information third group: participating in individuals who have reimbursement records and have target disease diagnosis;
fourth group of denominator information: individuals who have suffered from the disease of interest over a period of time;
according to the insurance status of each individual in each month, if the insurance records exist, the individuals are taken into the insurance number, otherwise, the individuals are taken out for the non-insurance number;
the first group of denominator information is obtained by correspondingly calculating according to the man-month sum by the formula 1:
Figure FDA0002974087490000011
wherein t represents the tth month; insurancet,nA state of engagement for the nth group of individuals in the tth month; n represents the population-month sum of the first group of denominators;
the second group of denominator information includes three cases;
denominator information second group first case: the months without reimbursement records for the treatment of the disease are directly brought into the denominator, and for each month,
that is, the number m of people in the month without reimbursement record for patient1,1
Denominator information second group second case: the month in which the patient is diagnosed and diagnosed completely should be included in the denominator calculation, and for each month,
the number m of people in the month that can see a doctor and have a complete diagnosis1,2
Second set of denominator information third case: the number m of people in the month with diagnosis missing due to disease treatment is extracted by considering subsequent filling1,3
The second group of denominator information is obtained by calculation according to the man-month sum by the formula 2:
Figure FDA0002974087490000021
wherein t represents the tth month; insurancet,mA state of engagement for the mth group of individuals in the tth month; m represents the moon sum of the second component mother;
the third group of denominator information includes three cases;
denominator information third group first case: the months without reimbursement records for the treatment of the disease are directly brought into the denominator, and for each month,
that is, the number k of people in the month without reimbursement record for the patient1,1
Denominator information third group second case: the month in which the patient is diagnosed and diagnosed completely should be included in the denominator calculation, and for each month,
the number k of people in the month who see the doctor immediately and have complete diagnosis1,2
Third group of denominator information third case: the people who have a diagnosis but lack the diagnosis need to consider follow-up filling, and for each month,
the number k of people in the month that the diagnosis is lacked even if the patient is seen1,3
The third group of denominator information is obtained by calculation according to the man-month sum through formula 3:
Figure FDA0002974087490000022
wherein t represents the tth month; insurancet,kA state of engagement for the kth group of individuals in the tth month; k represents the moon sum of the third component mother;
the fourth group of denominator information gives the total sum of individuals already affected by the calculation of equation 4:
Figure FDA0002974087490000023
wherein, t1Indicating a certain period of time; p represents the total number of people with diseases in the period;
A4. summary molecular information, including two groups: new patients and new patients needing to be filled and measured;
the number of new cases of the target disease in a certain range of population within a certain period of time; the calculation method of the new patient is as follows: selecting different elution periods according to different diseases, and selecting a certain period t for patients without target diagnosis before a specific time for calculating the incidence rate1All new onset patients in the recipe
Figure FDA0002974087490000024
The correlation between the diagnosis deficiency based on the information of the doctor and whether a certain rare disease is suffered from does not exist in the calculation of the new patient needing to be filled with the calculation, and is recorded as sigmatCase;
A5. Checking and unifying the basic characteristics of new patients in the molecular information to ensure that the data information from different sources is consistent;
A6. calculating the incidence rate: the summarized numerator information and denominator information are subjected to quotient calculation to obtain the morbidity;
through the steps, the disease incidence is calculated based on the medical insurance big data.
2. The method for estimating the disease Incidence based on the medical insurance big data as set forth in claim 1, wherein the Incidence index of an observation year is calculated by the following formula 8:
Figure FDA0002974087490000031
wherein NewCase represents the total number of new patients in the observation year, including the sum of new patients observed in the database and new patients calculated in the filling calculation, and is represented by sigma NewCase; PersonYear represents the number of exposed population in the same period, namely the population which is possibly suffered from the disease in the observation area in the observation year and is represented by sigma Personmonth; sigmatCase is the sum of new patients which need to be filled in each month,
Figure FDA0002974087490000032
t denotes the t-th month, Caseimpute_mRepresenting the number of target patients estimated according to the number of people who have treatment but have diagnosis deficiency in individuals who participate in insurance and have reimbursement records but do not have target diagnosis every month;
Figure FDA0002974087490000033
indicates a certain period t1The number of newly diagnosed target diseases;
denominator sum sigma of each monthtPersonMenth is calculated by the following formula:
tPersonMonth=∑tPersonMonth1+∑tPersonMonth2+∑tPersonMonth3
wherein t represents the tth month; sigmatPersonmonth1 corresponds to the moon, Σ, contributed by an individual who participates in a security but never reimbursestPersonmonth2 represents the moon, Σ, contributed by an individual who has been enrolled and reimbursed for reimbursement but who has not yet presented the target diagnosistPersonnnth 3 represents the months contributed by the number of individuals participating in the reimbursement record and presenting the target diagnosis;
Figure FDA0002974087490000034
indicates a certain period t1The number of individuals with the disease.
3. The method for measuring and calculating the disease incidence based on the medical insurance big data as claimed in claim 1, wherein the process of constructing the target disease dictionary database in the step A23) specifically comprises the following steps:
A231) firstly, text information containing target diagnosis disease names is extracted from a medical insurance database;
A232) extracting fields related to target diagnosis disease names from the extracted text information by adopting a word segmentation technology;
A233) judging whether the expression of the disease names is correct one by one in a manual mode; forming the correct text expression into a preliminary dictionary library;
A234) and extracting all information containing the target diagnosis disease name from the medical insurance database again according to the primary dictionary database, and repeatedly executing the operations A232) to A233) for identification until the text expression accuracy of the target diagnosis disease name reaches more than 95%, thus constructing and obtaining the target disease dictionary database.
4. The method for estimating disease incidence based on health care big data as claimed in claim 3, wherein the target diagnosis disease name is multiple myeloma.
5. The method for measuring and calculating the disease incidence based on the medical insurance big data as claimed in claim 1, wherein in the summary molecular information, the measurement and calculation of new patients needing to be filled in the measurement and calculation are carried out by the following formula 7 to obtain the number of people needing to be filled in the molecule:
Figure FDA0002974087490000041
wherein a represents the number of target disease patients which can be captured in individuals who participate in insurance, have a diagnosis and have no diagnosis loss; b represents the number of non-target disease patients which can be captured in the individuals participating in the insurance, having a diagnosis and recording the diagnosis without missing the diagnosis; c represents the number of target disease patients which can be captured theoretically in the individuals participating in the insurance, having a medical visit and recording the medical visit and diagnosing the deficiency; d represents the number of non-target disease patients which can be captured in the individuals participating in the insurance, having a medical visit and recording the medical visit and the lack of diagnosis; c + d is the total number of records diagnosed for the absence.
6. The method for assessing disease incidence based on healthcare big data as claimed in claim 1, wherein the assessed disease comprises multiple myeloma, plasma cell leukemia, plasmacytosis, idiopathic pulmonary interstitial fibrosis, POEMS syndrome.
CN201811045882.8A 2018-09-07 2018-09-07 Analysis method for measuring and calculating rare disease incidence based on medical insurance big data Active CN109448846B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201811045882.8A CN109448846B (en) 2018-09-07 2018-09-07 Analysis method for measuring and calculating rare disease incidence based on medical insurance big data

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201811045882.8A CN109448846B (en) 2018-09-07 2018-09-07 Analysis method for measuring and calculating rare disease incidence based on medical insurance big data

Publications (2)

Publication Number Publication Date
CN109448846A CN109448846A (en) 2019-03-08
CN109448846B true CN109448846B (en) 2021-04-30

Family

ID=65530399

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201811045882.8A Active CN109448846B (en) 2018-09-07 2018-09-07 Analysis method for measuring and calculating rare disease incidence based on medical insurance big data

Country Status (1)

Country Link
CN (1) CN109448846B (en)

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112967817B (en) * 2021-02-02 2022-06-10 武汉大学 Epidemiological research population screening method based on medical big data and storage medium
CN116959654A (en) * 2021-04-21 2023-10-27 广州医科大学附属第一医院 Method for establishing diagnosis time-effect-based COVID-19 triage system, system and triage method

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105631235A (en) * 2016-03-10 2016-06-01 深圳市前海安测信息技术有限公司 Medical big data based medical insurance actuarial system and medical big data based medical insurance actuarial method
CN105678104A (en) * 2016-04-06 2016-06-15 电子科技大学成都研究院 Method for analyzing health data of old people on basis of Cox regression model
CN108198595B (en) * 2018-01-18 2022-05-03 北京化工大学 Multi-source heterogeneous unstructured medical record data fusion method

Also Published As

Publication number Publication date
CN109448846A (en) 2019-03-08

Similar Documents

Publication Publication Date Title
US5557514A (en) Method and system for generating statistically-based medical provider utilization profiles
US7774252B2 (en) Method and system for generating statistically-based medical provider utilization profiles
CN110827941A (en) Electronic medical record information correction method and system
CN111145909B (en) Diagnosis and treatment data processing method and device, storage medium and electronic equipment
CN109448846B (en) Analysis method for measuring and calculating rare disease incidence based on medical insurance big data
CN108630320B (en) Method for measuring and calculating disease prevalence rate based on medical insurance big data
Dziadkowiec et al. Improving the quality and design of retrospective clinical outcome studies that utilize electronic health records
March Individual Data Linkage of Survey Data with Claims Data in Germany—An Overview Based on a Cohort Study
Raval et al. Industrial reorganization: Learning about patient substitution patterns from natural experiments
CN114550930A (en) Disease prediction method, device, equipment and storage medium
KR20210126408A (en) Device and method for ai calculating damage using disease big data
US9734474B2 (en) Recreating the state of a clinical system
Lee et al. Does Electronic Health Record Systems Implementation Impact Hospital Efficiency, Profitability, and Quality?
Bissell Laboratory-related measures of patient outcomes: an introduction
TW201503042A (en) Health reporting system and method
Ahmad et al. Feasibility of extracting meaningful patient centered outcomes from the electronic health record following critical illness in the elderly
Simanjuntak et al. Patients Clustering on BPJS Health Insurance Data Using Partition Clustering Algorithm
Aramide et al. Identify the risk to hospital admission in UK—systematic review of literature
BABAJIDE MODEL-BASED SUB-POPULATION ESTIMATES OF MATERNAL MORTALITY RATES AND RATIO FROM SIBLINGS’SURVIVORSHIP HISTORIES IN NIGERIA (2008-2018)
Deng et al. From descriptive to diagnostic analytics for assessing data quality: An application to temporal data elements in electronic health records
Santos et al. Association between hospitalizations for sensitive conditions and quality of primary care
Hasan Examples from administrative claims data and electronic health records
Walley et al. Laboratory Order Errors Before and After Implementation of Electronic Health Record
de Paula et al. New Structured Search Method that Allows Identify and Map Potential Patients with Cystic Fibrosis Not Yet Diagnosed
Leavy et al. Use of Harmonized Depression Outcome Measures for Research

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant