CN109448846A

CN109448846A - A kind of analysis method for calculating rare sick disease incidence based on medical insurance big data

Info

Publication number: CN109448846A
Application number: CN201811045882.8A
Authority: CN
Inventors: 詹思延; 王胜锋; 冯菁楠; 许璐; 高培; 王金喜; 尉晨
Original assignee: Peking University
Current assignee: Peking University
Priority date: 2018-09-07
Filing date: 2018-09-07
Publication date: 2019-03-08
Anticipated expiration: 2038-09-07
Also published as: CN109448846B

Abstract

The invention discloses a kind of methods that rare sick disease incidence is calculated based on medical insurance big data, are related to medical insurance data process&analysis technology, and calculated disease is thoroughly to cure, once diagnosing, suffer from throughout one's life.By summarizing multiple key parameters of monthly medical insurance data, molecule and denominator information needed for disease incidence calculates are obtained, and then disease incidence is calculated；Required molecule is the kainogenesis case load of target disease in a certain range crowd in specific time；Required denominator is the population of exposure number in specific time, it is possible to the people of target disease occur, and need to exclude illness can not become the people of new cases again in specific time.The disease incidence data of rare disease can be obtained by the method for the invention, promoted the development of rare sick epidemiological study, provided data and technical support for rational clinical guidelines；The Transformation Application for further promoting medical insurance big data fills up the blank of the rare sick epidemiologic data in China.

Description

A kind of analysis method for calculating rare sick disease incidence based on medical insurance big data

Technical field

The present invention relates to medical insurance data process&analysis technologies, and in particular to one kind calculates rare disease based on medical insurance big data The analysis method of disease incidence (Incidence).

Background technique

Rare disease (Rare diseases) refers to the rare disease of the very low one kind of illness rate, disease incidence, and China lacks at present The weary information, including disease incidence, illness rate of basic epidemiological features etc. to this kind of disease.Medicare data (Claims It data is usually) in medical insurance chain of command, the data formed by payment information integration, including insured people are substantially special The information such as sign, diagnosis, treatment, data volume is huge, not only comprehensive, timeliness is good, at low cost, and operability is high, even more comes from In the longitudinal data of real world, is conducive to development epidemiological study rapidly and efficiently, is especially available with national medical insurance Data provide new approaches to solve the problems, such as that the rare sick epidemic data in China lacks this.

Unlike other epidemiological studies, calculates disease incidence and need the new cases quantity in clear certain time Disease population quantity is had suffered from in the regular period.In the analysis method for rare sick disease incidence external at present, Raghu G etc. is learned Person utilizes U.S. Medicare medical insurance data, idiopathic pulmonary fibrosis disease incidence etc. during calculating 2001-2011, but these are sent out The calculating of sick rate research be based on the individual initial data in medical insurance data, for the magnanimity medical insurance data in China, no matter data Phase of storage not, format and the scale of construction or the span of data target, missing and individual remove-insurance etc. with external medical insurance data not Together, therefore directly the above method can not be used in the epidemiological study of China's medical insurance database；And it makes a general survey of the country and utilizes medical insurance Data the study found that it is current be mostly focused on by excavate medical insurance data find fraud, improve disease therapeuticing effect with And modification formulation of accompanying policy etc., rare that medical insurance data is utilized to carry out epidemiology, especially rare sick disease incidence is suffered from The research of sick rate, so that being difficult to make full use of medical insurance big data at present, the relevant epidemiologic feature for carrying out rare disease is divided Analysis.

Summary of the invention

In order to overcome the above-mentioned deficiencies of the prior art, the present invention provides a kind of based on medical insurance big data measuring and calculating disease incidence New method, based on optimization data intermediate storage format, by the multiple key parameters for summarizing monthly medical insurance data, comprising: every Month insured individual is total, monthly increase insured number of individuals newly, monthly diagnosis records sum, monthly diagnosis records diagnosis missing is total, Neopathy number of cases in regular period, in the regular period number of patients (regular period according to research it needs to be determined that, can be one Moon, two months, 1 year etc.), the molecule and denominator information needed for disease incidence calculates are obtained, and then disease incidence is calculated.This hair Bright method, which is related to summarizing, efficiently counts the statistical calculation that disease incidence corresponds to molecule, denominator under data format, be that one kind meets China The method of the rare sick disease incidence of the calculating of medical insurance data characteristics, can be used for rare sick epidemiological analysis.

The diseases demands that the present invention can calculate cannot be cured thoroughly, once diagnosis, is suffered from throughout one's life.

The principle of the present invention is: the concept based on the man month, and monthly insured individual is total, monthly increases insured individual newly for counting Number, monthly diagnosis records sum, monthly diagnosis records diagnosis missing sum, combining target disease definition extracting target patient, and " the stealthy patient " under diagnosis missing at random situation is derived, calculates disease incidence in conjunction with the derivation of equation.What the method for the present invention can be calculated Disease includes the institutes such as Huppert's disease, plasma cell leukemia, plasma cell dyscrasias, idiopathic pulmonary interstitial fibrosis, POEMS syndrome Have and meets and cannot thoroughly cure, once diagnosis, suffers from the rare disease of feature throughout one's life.Rare disease can be obtained by the method for the invention Disease incidence data, provide data and technical support for rational clinical guidelines, further promote medical insurance big data conversion Using.

Present invention provide the technical scheme that

A method of rare sick disease incidence being calculated based on medical insurance big data, the diseases demands of measuring and calculating are thoroughly to control More；Based on medical insurance database, multiple key parameters by summarizing monthly medical insurance data (include: monthly insured individual total, every Increase insured number of individuals, monthly diagnosis records sum, monthly diagnosis records diagnosis missing sum, the neopathy in the regular period moon newly Number of cases, number of patients in the regular period), the molecule and denominator information needed for disease incidence calculates are obtained, and then hair is calculated Sick rate；Disease incidence calculate molecule refer in specific time, the kainogenesis case load of target disease, denominator in a certain range of crowd Population of exposure number i.e. in specific time, it is possible to the people of target disease occur, need to exclude illness can not in specific time The people of new cases can be become again；

Include the following steps:

A1. medical insurance database range (such as time span, Regional Distribution, outpatient service/be hospitalized) is determined；

A2. the definition of the basic cleaning and target disease of database；

The basic cleaning of database includes following basic step: (1) integrality and logicality of variable are verified in database； (2) in database content of text code standardized and natural language processing；(3) International Classification of Diseases in database The version of (International Classification of Diseases, ICD) is determining and uniformly.

In the present invention, the definition of target disease, which is subject in medical insurance database, there is corresponding to the title of disease or ICD coding, It specifically needs to fully consider a variety of expression-forms of text and ICD coding, and comprehensively includes as far as possible by participle technique building The dictionary library of target disease diagnosis name expression way.

The process of present invention building dictionary library are as follows:

The text envelope comprising targeted diagnostics disease name (such as Huppert's disease) is extracted first from medical insurance database Breath, this partial text information may include the disease name expression way of mistake, it is also possible to and it include other diagnosis names, it can not Directly utilize；

It needs that in the text information for obtaining extraction, it is (multiple to be related to targeted diagnostics disease name using participle technique Myeloma) field extract；

Then by manually judging whether the expression of these disease names is correct, and correct text representation is determined as just one by one The dictionary library of step；

All information comprising targeted diagnostics are extracted from medical insurance database again according to preliminary dictionary library, recycles and divides Word technology is identified；

Repeatedly repeatedly, until the text representation accuracy rate of targeted diagnostics disease name is determined as most up to 95% or more Whole dictionary library.Purpose is to obtain the expression way comprising all targeted diagnostics disease names as far as possible, so as to subsequent determining trouble It will not be missed when person.

A3. denominator information summarizes；

It is specifically divided into four groups: insured but never submit an expense account individual, insured and thering is reimbursement to record but do not occur target disease to examine Disconnected individual, the insured individual for thering is reimbursement to record and the individual of target disease diagnosis occur, have suffered from target disease in certain time. According to the insured state of every individual in every month, it is included in insured person-time, rejects not insured person-time.

Specifically, according to insured state, person-time at this time is included in if having insured record, if deleted without insured record It removes.

First group of denominator: the insured but individual never submitted an expense account corresponds to calculation formula such as formula 1 by man month summation:

Wherein, t indicates t-th of month；Insurance_t,nFor n-th of group individual t-th of month insured state；N Represent the man month summation of first group of denominator.

Second group: it is insured and have reimbursement record but do not occur the individual of targeted diagnostics, including three kinds of situations；

The first situation: not because the disease medical man month without reimbursement record is directly brought into denominator, for monthly, i.e., not because of disease The medical of that month number m without reimbursement record_1,1；

Second situation: it should be included in denominator because disease is medical and diagnoses the complete man month and calculate, for monthly, i.e., just because of disease It examines and diagnoses complete of that month number m_1,2；

The third situation: have medical but diagnosis missing man month be considered as it is subsequent fill up, extract because disease is medical but diagnosis lacks The of that month number m of mistake_1,3。

By taking every month as an example, the man month summation of second group of denominator corresponds to calculation formula such as formula 2:

Wherein, t indicates t-th of month；Insurance_t,mFor m-th of group individual t-th of month insured state；M Represent the man month summation of second group of denominator.

Third group: insured to there is reimbursement to record and the individual of targeted diagnostics, including three kinds of situations occur；

The first situation: not because the disease medical man month without reimbursement record is directly brought into denominator, for monthly, i.e., not because of disease The medical of that month number k without reimbursement record_1,1；

Second situation: it should be included in denominator because disease is medical and diagnoses the complete man month and calculate, for monthly, i.e., just because of disease It examines and diagnoses complete of that month number k_1,2；

The third situation: have medical but diagnosis missing man month be considered as it is subsequent fill up, for monthly, i.e., because disease is medical But diagnose the of that month number k of missing_1,3。

By taking every month as an example, the man month summation of third component mother corresponds to calculation formula such as formula 3:

Wherein, t indicates t-th of month；Insurance_t,kFor k-th of group individual t-th of month insured state；K Represent the man month summation for showing third component mother.

4th group: the individual of illness；

4th group of summation corresponds to calculation formula such as formula 4:

Wherein, t₁Indicate the regular period；P represents the number of patients summation in the period.

A4. molecular information summarizes, including two groups；

For target disease, corresponding molecular information extracting is carried out, is specifically divided into two groups: is new to send out patient and measuring and calculating be filled up New hair patient.During new hair patient refers to centainly, the kainogenesis case load of target disease in a certain range crowd；The latter (needs Fill up the new hair patient of measuring and calculating) diagnosis missing of the measuring and calculating based on diagnosis information with whether suffer from certain rare sick being associated with and be not present Statistical significance.

First group of molecule: newly sending out patient

The all new hair patients of (for example, can monthly or can be per year) are included in the regular period, are denoted as Wherein, t₁Indicate the regular period；Case_new indicates the number that target disease is newly diagnosed as in the regular period.New hair patient's sentences Disconnected method is: the patient of targeted diagnostics does not occur before the specific time that research calculates disease incidence, according to study of disease Difference selects the different elution phases.Such as calculate the disease incidence in certain rare disease a certain year, then within the scope of database this year it Do not occur targeted diagnostics in the preceding time and is judged as new hair patient.

Second group of molecule: the new hair patient of measuring and calculating need to be filled up

A5. the essential characteristic that molecule newly sends out patient verify with uniformly, such as age, gender, nationality, household register

Medical insurance data are divided into " insurant information table ", " common door (urgency) consultation fee is used and settlement information table " and " outpatient service is big Disease, outpatient service are planned as a whole, are hospitalized, hospital bed set up at a patient's home expense and settlement information table " 3 lists, by associated variable between each list, to year Age, gender, nationality, household register etc. need to carry out multi-section verification and uniformly, and corresponding to unique identity to reach each associated variable knows Other ID (such as identity card), while the information insides one such as each unique identification ID corresponding age, gender, nationality and household register It causes.

The above-mentioned molecular information summarized and denominator information are asked quotient, calculate disease incidence by the A6. calculating of disease incidence.

The calculation formula of disease incidence Incidence (for calculating 1 year disease incidence)

Wherein, New Case indicates the new hair patient populations in observation year, and the new hair including observing in database is suffered from Person and the sum of new hair patient of measuring and calculating is filled up, is indicated with ∑ NewCase；PersonYear indicates population of exposure number of the same period, Refer to the crowd that the disease may occur in the observation year in observation area, is indicated with ∑ Personmonth；∑_tCase is every The sum of the new hair patient of measuring and calculating need to be filled up within a month,T indicates t-th of month, Case_{impute_m}Indicate every month according to insured and there is reimbursement to record but do not occur in the individuals of targeted diagnostics, according to have it is medical but The target patient number that the number of diagnosis missing is estimated.It indicates newly to be diagnosed as target disease in the regular period Number, t₁Indicate the period.

The denominator sum ∑ of every month_tPersonMonth is calculate by the following formula to obtain:

∑_tPersonMonth=∑_tPersonMonth1+∑_tPersonMonth2+∑_tPersonMonth3

Wherein, t indicates t-th of month.∑_tPersonMonth1 corresponds to the people that individual that is insured but never submitting an expense account is contributed Month, ∑_tPersonMonth2 is represented insured and is had reimbursement to record but do not occur the man month that the individual of targeted diagnostics is contributed, ∑_tPersonMonth3 represent it is insured there is reimbursement to record and the man month that the individual of targeted diagnostics is contributed occur,Indicate the individual of illness in the regular period, t₁Indicate the period.

The beneficial effects of the present invention are:

Rare sick disease incidence measuring method provided by through the invention based on medical insurance big data, is related to total amount On the one hand the statistical calculation for corresponding to molecule, denominator according to disease incidence is efficiently counted under format can obtain the morbidity of the rare disease in China Rate and Disease Spectrum data (including disease (disease), disability (disability) and premature death (premature Death) to the pressure of entire society's economy and health), data and technical support are provided for rational clinical guidelines；Another party Face can promote turning for medical insurance big data the present invention provides the new method that disease incidence calculates is solved using medical insurance data Change application, fills up the disease incidence data blank of the rare disease in China conscientiously.The method of the present invention is that one kind meets China's medical insurance data spy The method of the rare sick disease incidence of the calculating of sign, can be used for rare sick epidemiological analysis.

Detailed description of the invention

Fig. 1 is the flow diagram of the method provided by the invention for calculating disease incidence.

Specific embodiment

With reference to the accompanying drawing, the present invention, the model of but do not limit the invention in any way are further described by embodiment It encloses.

The present invention provides a kind of new method based on medical insurance big data measuring and calculating disease incidence, is deposited among data based on optimizing Format is stored up, by summarizing multiple key parameters of monthly medical insurance data, molecule and denominator information needed for obtaining disease incidence calculating, And then disease incidence is calculated；The molecule that disease incidence calculates refers in specific time that target disease is new in a certain range of crowd Case load, the population of exposure number in denominator, that is, specific time, it is possible to the people of target disease occur, illness need to be excluded and existed occurs The people of new cases can not be become in specific time again.

It is the process provided by the invention for calculating disease incidence method shown in Fig. 1, a specific embodiment of the invention is as follows:

A1. database range (such as time span, Regional Distribution, outpatient service/be hospitalized) is determined；

A2. the definition of the basic cleaning and target disease of database；

In the present invention, the definition of target disease, which is subject in medical insurance database, there is corresponding to the title of disease or ICD coding, The a variety of expression-forms for specifically needing to fully consider text and ICD coding, construct dictionary library comprehensive as far as possible.

A3. disease incidence corresponds to summarizing for denominator information；

The denominator of disease incidence is specifically divided into four groups

First group: the insured but individual never submitted an expense account

The some patientss are never medical because of disease, only insured record, no reimbursement record, are used only as in disease incidence calculating point It is female.It specifically needs in the statistical observation time, the insured state (1=is insured, and 0=is not insured) of each observation object in every month, Then the not insured man month is rejected, the insured man month is added up to and is put into denominator.By taking every month as an example, first group of denominator presses people Month summation corresponds to calculation formula such as formula 1:

Wherein, t indicates t-th of month；Insurance_t,nFor n-th of group individual t-th of month insured state；N Represent first group of man month summation of denominator.

Second group: it is insured and have reimbursement record but do not occur the individual of targeted diagnostics

The some patientss were once medical because of disease, but did not occurred targeted diagnostics, while possessing insured record, reimbursement record, equally Denominator is used only as when disease incidence calculates.It specifically needs in the statistical observation time, the insured shape of each observation object in every month State (1=is insured, and 0=is not insured) then equally rejects not insured man month, but the insured man month cannot be directly placed into point Mother, but three kinds of situations are divided into according to diagnostic state:

The first situation: not because the disease medical man month without reimbursement record is directly brought into denominator (such as attached drawing 1), to monthly Speech, i.e., not because of the medical of that month number m without reimbursement record of disease_1,1；

Second situation: should be included in denominator and calculate (such as attached drawing 1) because disease is medical and diagnoses the complete man month, to monthly Speech, i.e., because disease is medical and the complete of that month number m of diagnosis_1,2；

The third situation: have medical but diagnosis missing man month be considered as it is subsequent fill up (such as attached drawing 1), extract because disease is medical But diagnose the of that month number m of missing_1,3。

Third group: insured to there is reimbursement to record and the individual of targeted diagnostics occur

The some patientss have because disease is medical, and targeted diagnostics occur, while possessing insured record, reimbursement record, are suffering from Sick rate, disease incidence are used as molecule and denominator when calculating.It for denominator, specifically needs in the statistical observation time, each observation pair As insured state (1=is insured, and 0=is not insured) in every month, the not insured man month is equally then rejected (such as attached drawing 1), But the insured man month cannot still be directly placed into denominator, but be divided into three kinds of situations according to diagnostic state:

The first situation: not because the disease medical man month without reimbursement record is directly brought into denominator (such as attached drawing 1), to monthly Speech, i.e., not because of the medical of that month number k without reimbursement record of disease_1,1；

Second situation: should be included in denominator and calculate (such as attached drawing 1) because disease is medical and diagnoses the complete man month, to monthly Speech, i.e., because disease is medical and the complete of that month number k of diagnosis_1,2；；

The third situation: have medical but diagnosis missing man month be considered as it is subsequent fill up (such as attached drawing 1), for monthly, I.e. because disease is medical but the of that month number k of diagnosis missing_1,3。

4th group: the individual of illness

It since the disease that can calculate of the present invention is once diagnose, suffers from all the life, therefore denominator should be that target disease may occur Crowd, i.e. population of exposure number, therefore need to remove the individual of illness in the regular period when calculating denominator, the 4th group of summation pair Answer calculation formula such as formula 4:

A4. molecular information summarizes；

After the definition of target disease, corresponding molecular information extracting is carried out, is specifically divided into two groups:

First group of molecule: newly sending out patient

The all new hair patient being included in the regular period, is denoted as ∑_tCase_new, wherein t was indicated in the specific period； Case_new indicates the number that target disease is newly diagnosed as in the regular period.The judgment method of new hair patient is: calculating in research There is not the patient of targeted diagnostics before the specific time of disease incidence, different elutions is selected according to the difference of study of disease Phase.Such as calculate the disease incidence in certain rare disease a certain year, then do not occur in the time within the scope of database before this year Targeted diagnostics are judged as new hair patient.

There is diagnosis missing in part diagnosis records, including insured and medical but diagnosis missing non-targeted Disease and ginseng Protect and go to a doctor but diagnose the target disease patient of missing, i.e. m_1,3And k_1,3.Due to calculate disease incidence require patient be newly send out, namely There are targeted diagnostics for the first time, so for k_1,3Although the of that month diagnosis missing of patient, because being diagnosed as target before Patient is not considered new hair patient, therefore is not included in molecule.Therefore, the part for needing exist for filling up measuring and calculating is m_1,3, part record It should not directly reject, as shown in table 1,

Molecule fills up schematic diagram when 1 disease incidence of table calculates

The calculation formula of disease incidence Incidence ideally is formula 5:

Wherein, a represents the target disease that can be grabbed in individual that is insured and having the diagnosis of medical and diagnosis records not lack Patient's number, b represent the non-targeted disease that can be grabbed in individual that is insured and having the diagnosis of medical and diagnosis records not lack Patient's number, c represent the target disease that can theoretically grab in individual that is insured and having medical and diagnosis records diagnosis missing Patient's number, d represent the non-targeted disease that can be grabbed in individual that is insured and having medical and diagnosis records diagnosis missing Patient numbers, e represent insured but never medical individual number.

And the disease incidence calculation formula after directly rejecting is expressed as formula 6:

Disease incidence ideally is obviously differed with both disease incidence after directly rejecting.Therefore, it is necessary to lack to diagnosis Part is lost, based on certain it is assumed that suitably being estimated, to obtain the numerical value of c and d.The original hypothesis that the present invention uses is medical Information diagnosis missing with whether suffer from certain it is rare disease be associated be not present statistical significance, i.e.,If meeting this It is assumed that then c is expressed as formula 7:

Wherein, c+d is the total number of records of diagnosis missing, can directly be counted to get.

The number for needing to fill up in molecule is calculated according to formula 7.Note that the new hair patient diagnosis for needing to fill up measuring and calculating lacks Lose part are as follows: insured and there is reimbursement to record but do not occur of that month in the individuals of targeted diagnostics having medical but diagnosis missing number m_1,3, filled up according to above-mentioned hypothesis, Case_{impute_m}=c, the target disease after filling up newly send out patient numbers' sum and areWherein, t indicates t-th of month, Case_{impute_m}Indicate every month in, according to it is insured and have reimbursement record But does not occur the of that month target patient for thering is medical but diagnosis missing number to be estimated in the individual of targeted diagnostics and newly send out number.

A5. molecule patient essential characteristic verify with uniformly

A6. the calculating of disease incidence

Disease incidence calculation formula (as unit of man-year)

Wherein, ∑ NewCase indicates the new hair patient populations in observation year, and the new hair including observing in database is suffered from Person and fill up the sum of new hair patient of measuring and calculating；PersonYear indicates population of exposure number of the same period, that is, referring to should in the observation year The crowd of the disease may occur in observation area；∑_tCase is that need to fill up the sum of new hair patient of measuring and calculating every month, T indicates t-th of month, Case_{impute_m}Indicate every month according to it is insured and have reimbursement remember In the individual for recording but not occurring targeted diagnostics, according to the target patient number for thering is medical but diagnosis missing number to be estimated.Indicate the number that target disease is newly diagnosed as in the regular period, t₁Indicate the period.

∑_tPersonMoth=∑_tPersonMonth1+∑_tPersonMonth2+∑_tPersonMonth3

Below by example, the present invention will be further described.

This example chooses the insured library of certain province urban employees and the insured library of town dweller of 2012-2016, is within selection 4 years The elution phase, then the illness rate that Huppert's disease is calculated based on the medical insurance database for saving 2016 is calculated, database includes Urban employees' insurant (217,342,112 people) in 2016, town dweller's insurant (145,714,765 people).

(date, medical date variable missing, exception are such as submitted an expense account) after completing basic data cleansing, Huppert's disease Clinical diagnostic modalities combination text, ICD coding statement is such as table 2:

The diagnosis description of 2 Huppert's disease of table and ICD coding are enumerated

Field name in database comprising diagnostic message shares 6, is Main Diagnosis title, Main Diagnosis volume respectively Code wants diagnosis name for the first time, wants diagnosis coding for the first time, wants diagnosis name for the second time and want diagnosis coding for the second time, then real Border definition is defined as follows according to field structure in database:

Above-mentioned each field must include field value (being "or" relationship between field): myeloma, card are strangled, and bone marrow cancer/ Marrow ca, myelopathy, C90, M9732,203.0；

Above-mentioned each field must exclude field value (being "or" relationship between field): thick liquid cell, isolatism, C90.1, C90.2

When specific molecule crawl, for Main Diagnosis, for the first time to diagnose, to diagnose for second, Main Diagnosis coding, the One secondary diagnosis coding and want diagnosis coding totally six fields expansion for the second time, it is desirable that all in six fields, it is necessary to have to It is one " must include field " few, but " field must be excluded " must not contained.

Then it carries out denominator information to summarize, wherein monthly insured individual sum, monthly diagnosis records sum and monthly It is as shown in table 3 to examine the distribution of record diagnosis missing sum in every month:

3 molecule denominator of table calculates the medical insurance data parameters summary sheet for needing to use

It should be noted that monthly it is insured individual sum=it is insured but never submit an expense account individual+it is insured and have reimbursement record The individual of targeted diagnostics+insured, which occurs, in unit has reimbursement to record and the individual of targeted diagnostics occur, therefore only lists medical insurance here Monthly insured individual is total for key parameter in data.

One, the calculating of denominator

The calculating of denominator is divided into two parts: the number of individuals of average annual insured numbers in 2016 and illness before in 2016.

First part's average annual insured number in 2016 uses the method using 12 monthly average, obtains: (30152092+ 28539556+30571984+30615202+28779370+30912654+31344596+28196530+30464440+ 30734528+32040068

+ 30705856)/12=30254740

The number of individuals of illness before second part 2016 obtains by the following method: by the number of province 2012-2016 It is divided into 2012-2015 and 2016 year two parts according to library, each section all guarantees that each individual only has an idagnostic logout, will Two databases carry out horizontal meaders with software, matched part be had in 2016 diagnosis records and before 2016 by It is diagnosed as patient's number of Huppert's disease, obtaining the number of individuals for saving 2016 Nian Qianyi illness by this method is 3821 people.

Two, the calculating of molecule

The calculating of molecule is divided into two parts: new hair patient's number and the patient that need to fill up measuring and calculating

Patient's number is newly sent out by first part, is obtained using method same as illness number of individuals, it may be assumed that by province 2012- Database in 2016 is divided into 2012-2015 and 2016 year two parts, and each section all guarantees each individual only one Two databases are carried out horizontal meaders with software, only occurred in database in 2016 in matching result by idagnostic logout Part is the new hair patient number in 2016 Huppert's diseases, this obtained by this method save 2016 Nian Xinfa individual patients numbers be 417 people.

Patient's number that second part need to be filled up calculates following (this example is calculated as unit of year need to fill up patient's number), first calculates Average annual diagnosis records number in 2016 out are as follows:

(5266439+4685125+5446965+4767414+5356961+4962207+5627907+4545582+ 4700331+5365805+5684190+5739000)/12=5178994

Average annual diagnosis records diagnose missing number within 2016 are as follows:

(331479+314970+333140+315726+300708+271454+274475+282899+273228+ 284421+277799+280500)/12=295067

Then according to new 417 people of hair patient in 2016, the patient's number that need to be filled up is calculated are as follows: (417*295067)/ (5178994-295067)=25

Then disease incidence calculates as follows:

It should be noted that the purpose for publicizing and implementing example is to help to further understand the present invention, but the skill of this field Art personnel, which are understood that, not to be departed from the present invention and spirit and scope of the appended claims, and various substitutions and modifications are all It is possible.Therefore, the present invention should not be limited to embodiment disclosure of that, and the scope of protection of present invention is with claim Subject to the range that book defines.

Claims

1. a kind of analysis method for calculating rare sick disease incidence based on medical insurance big data, characterized in that calculated disease is cannot Thoroughly cure, once diagnosis, is suffered from throughout one's life；Based on medical insurance database, by summarizing multiple key parameters of monthly medical insurance data, The molecule and denominator information needed for disease incidence calculates are obtained, and then disease incidence is calculated；

Multiple key parameters include: that monthly insured individual is total, monthly increase insured number of individuals newly, monthly diagnosis records sum, every Month diagnosis records diagnosis missing sum, the neopathy number of cases in the regular period, number of patients in the regular period；

Disease incidence calculates the kainogenesis case load that required molecule is target disease in a certain range of crowd in specific time；

Denominator needed for disease incidence calculates is the population of exposure number in specific time, it is possible to the people of target disease occur, and need The people of new cases can not be become again in specific time by excluding illness；

Include the following steps:

A1. medical insurance database range is determined, comprising: time span, Regional Distribution, outpatient service/be hospitalized；

A2. the basic cleaning of database defines target disease and constructs target disease dictionary library；

Database it is basic cleaning include:

A21) in verification database variable integrality and logicality；

A22) in database content of text code standardized and natural language processing；

A23) the version of the International Classification of Diseases ICD coding in determining and unified database；

The definition of target disease, which is subject in medical insurance database, there is corresponding to the title of disease or ICD coding, including text data With the diversified forms of ICD coding；And pass through dictionary library of the participle technique building comprising the expression of target disease diagnosis name；

A3. summarize denominator information；It is specifically divided into four groups:

First group of denominator information: the insured but individual never submitted an expense account；

Second group of denominator information: it is insured and have reimbursement record but do not occur target disease diagnosis individual；

Denominator information third group: the insured individual for thering is reimbursement to record and target disease diagnosis occur；

The 4th group of denominator information: the individual of target disease is had suffered from certain time；

According to the insured state of every individual in every month, it is included in insured person-time if having insured record, is otherwise not insured people It is secondary, it is rejected；

First group of denominator information is calculated especially by formula 1 by man month summation correspondence:

Wherein, t indicates t-th of month；Insurance_t,nFor n-th of group individual t-th of month insured state；N is represented First group of man month summation of denominator；

Second group of denominator information includes three kinds of situations；

Second group of the first situation of denominator information: not because the disease medical man month without reimbursement record is directly brought into denominator, to monthly Speech,

I.e. not because of the medical of that month number m without reimbursement record of disease_1,1；

Second group of second situation of denominator information: should be included in denominator and calculate because disease is medical and diagnoses the complete man month, to monthly Speech,

I.e. because disease is medical and the complete of that month number m of diagnosis_1,2；

Second group of the third situation of denominator information: have medical but diagnosis missing man month be considered as it is subsequent fill up, extract because of disease just Examine but diagnose the of that month number m of missing_1,3。

Second group of denominator information is calculated especially by formula 2 by man month summation:

Wherein, t indicates t-th of month；Insurance_t,mFor m-th of group individual t-th of month insured state；M is represented The man month summation of second group of denominator；

Denominator information third group includes three kinds of situations；

The first situation of denominator information third group: not because the disease medical man month without reimbursement record is directly brought into denominator, to monthly Speech,

I.e. not because of the medical of that month number k without reimbursement record of disease_1,1；

Denominator information third group second situation: should be included in denominator and calculate because disease is medical and diagnoses the complete man month, to monthly Speech,

I.e. because disease is medical and the complete of that month number k of diagnosis_1,2；

The third situation of denominator information third group: have medical but diagnosis missing man month be considered as it is subsequent fill up, for monthly,

I.e. because disease is medical but the of that month number k of diagnosis missing_1,3；

Denominator information third group is calculated especially by formula 3 by man month summation:

Wherein, t indicates t-th of month；Insurance_t,kFor k-th of group individual t-th of month insured state；K is represented Show the man month summation of third component mother；

The 4th group of denominator information passes through the summation that the individual of illness is calculated in formula 4:

Wherein, t₁It indicates in the regular period；P represents the number of patients summation in the period；

A4. summarize molecular information, including two groups: new hair patient and the new hair patient that measuring and calculating need to be filled up；

New hair patient be it is certain during in a certain range crowd target disease kainogenesis case load；The calculating side of new hair patient Method is: selecting the different elution phases according to the difference of disease, target does not occur before the specific time for calculating disease incidence and examine Disconnected patient, by regular period t₁Interior all new hair patient is denoted asThe new hair patient that measuring and calculating need to be filled up surveys Calculate based on diagnosis information diagnosis missing with whether suffer from certain it is rare disease be associated be not present statistical significance, be denoted as ∑_tCase；

A5. to the essential characteristic of the new hair patient in molecular information verify with uniformly so that the data information of separate sources Unanimously；

A6. it calculates disease incidence: the molecular information summarized and denominator information being asked into quotient, disease incidence is calculated.

Through the above steps, it realizes and disease incidence is calculated based on medical insurance big data.

2. the method as described in claim 1 based on medical insurance big data measuring and calculating disease incidence, characterized in that especially by formula 8 1 year disease incidence Incidence is calculated:

Wherein, New Case indicates the new hair patient populations in observation year, including the new hair patient that is observed in database and The sum of the new hair patient for filling up measuring and calculating, is indicated with ∑ NewCase；PersonYear indicates population of exposure number of the same period, that is, refers to The crowd that the disease may occur in the observation year in observation area, is indicated with ∑ Personmonth；∑_tCase is every month The sum of the new hair patient of measuring and calculating, ∑ need to be filled up_tCase=∑_tCaseimpute_{_m}, t t-th of month of expression, Caseimpute_{_m}Table Show every month according to insured and there is reimbursement to record but do not occur in the individuals of targeted diagnostics, according to there is medical but diagnosis missing people Several estimated target patient numbers；Indicate regular period t₁The interior number for being newly diagnosed as target disease；

∑_tPersonMonth=∑_tPersonMonth1+∑_tPersonMonth2+∑_tPersonMonth3

Wherein, t indicates t-th of month；∑_tPersonMonth1 corresponds to the man month that individual that is insured but never submitting an expense account is contributed, ∑_tPersonMonth2 is represented insured and is had reimbursement to record but do not occur the man month that the individual of targeted diagnostics is contributed, ∑_tPersonMonth3 represent it is insured have reimbursement record and there is the man month that the individual number of targeted diagnostics is contributed；Indicate regular period t₁The individual number of inside illness.

3. the method as described in claim 1 based on medical insurance big data measuring and calculating disease incidence, characterized in that step A23) building The process of target disease dictionary library specifically includes:

A231 the text information comprising targeted diagnostics disease name) is extracted first from medical insurance database；

A232 participle technique) is used, from extracting in obtained text information, the field for being involved in targeted diagnostics disease name is mentioned It takes out；

A233) by manual type, judge whether disease name expression is correct one by one；Correct text representation is formed as preliminary Dictionary library；

A234 all letters comprising targeted diagnostics disease name) are extracted from medical insurance database according to preliminary dictionary library again Breath, then operation A232 is executed repeatedly)~A233) identified, until the text representation accuracy rate of targeted diagnostics disease name reaches To 95% or more, i.e. building obtains target disease dictionary library.

4. the method for rare sick disease incidence is calculated based on medical insurance big data as claimed in claim 3, characterized in that targeted diagnostics disease Name of disease is known as Huppert's disease.

5. the method as described in claim 1 based on medical insurance big data measuring and calculating disease incidence, characterized in that summarize molecular information In, the number for needing to fill up in molecule is calculated especially by formula 7 for the measuring and calculating to the new hair patient that need to fill up measuring and calculating:

Wherein, a represents the target disease that can be grabbed in individual that is insured and having the diagnosis of medical and diagnosis records not lack and suffers from Person's number；B represents the non-targeted disease that can be grabbed in individual that is insured and having the diagnosis of medical and diagnosis records not lack and suffers from Person's number；C represents the target disease that can theoretically grab in individual that is insured and having medical and diagnosis records diagnosis missing and suffers from Person's number；D represents the non-targeted Disease that can be grabbed in individual that is insured and having medical and diagnosis records diagnosis missing Number；C+d is the total number of records of diagnosis missing.

6. the method as described in claim 1 based on medical insurance big data measuring and calculating disease incidence, characterized in that calculated disease packet Include Huppert's disease, plasma cell leukemia, plasma cell dyscrasias, idiopathic pulmonary interstitial fibrosis, POEMS syndrome.