CN109147879A

CN109147879A - The method and system of Visual Report Forms based on medical document

Info

Publication number: CN109147879A
Application number: CN201810709344.8A
Authority: CN
Inventors: 孙字弋
Original assignee: Beijing Zhongxin Yi Bao Technology Co Ltd
Current assignee: Beijing Zhongxin Yi Bao Technology Co Ltd
Priority date: 2018-07-02
Filing date: 2018-07-02
Publication date: 2019-01-04
Anticipated expiration: 2038-07-02
Also published as: CN109147879B

Abstract

The present invention relates to the methods of the Visual Report Forms based on medical document.The method of the present invention includes following steps: 1) acquiring the data of medical document；2) data of medical document are divided into disease data and patient data；3) disease category data are analyzed, using clustering algorithm, the result of analysis is then presented with the mode of disease category distribution map；4) data of sick people are analyzed, using crowd's attribute tags algorithm and association rules mining algorithm, the result of analysis is then presented with the method for the cyberrelationship figure of sick people；Wherein, the disease category data analysis uses clustering algorithm；It is to do association rule mining using Apriori algorithm that the data to sick people, which carry out analysis,.The present invention is directed to the specificity of medical big data, proposes the dimension different to these, shows the solution of the analysis convenient for control and prevention of disease in a uniform manner.

Description

The method and system of Visual Report Forms based on medical document

Technical field

The invention belongs to data or technical field of information processing, and in particular to the processing of medical big data is more particularly to The method and system of the Visual Report Forms of medical document.

Background technique

In medical industry, medical data has a specific diagnosis and treatment data of hospital, the generally professional height of this kind of data, and main In each department's storage of hospital so common channel is not easy to obtain.But Medical receipt data (invoice, prescription etc.), due to all wanting It gives patient to hold, so acquisition is easy, for example insurance company's Claims Resolution channel can obtain this kind of data.Therefore, this kind of medical treatment Bills data is being in the growth of geometric progression.Accompanying problem is that: the extreme of medical document big data visualization system It is deficient.

Because when facing mass data, browsing data one by one becomes without in all senses.Need visualization system It generates.And for visualization system, the data and data dimension of different industries can bring final report to present It is as far apart as heaven and earth.

With the rise of present big data concept, all trades and professions start to pay much attention to the acquisition of industry Various types of data and deposit Storage.Known big data analysis has certain application, and such as application No. is 201610497249 patent applications to be related to based on big Data analyze the method for establishing disease cloud atlas, and application No. is 201710150587.8 patent applications to be related to wisdom environmental protection big data Method for visualizing.But medical big data has its specificity, for example includes disease, disease category, patient's has age, gender Equal attributes.How by these different dimensions, the analysis convenient for control and prevention of disease is showed in a uniform manner, is a needs It solves the problems, such as.

Summary of the invention

For the demand, the present invention provides a kind of method of Visual Report Forms based on medical document.

A kind of method of Visual Report Forms based on medical document of the invention mainly includes following processes:

1) data of medical document are acquired

2) data of medical document are divided into disease data and patient data

3) disease category data are analyzed, using clustering algorithm, then with disease category distribution map mode come The result of analysis is presented

4) data of sick people are analyzed, using crowd's attribute tags algorithm and association rules mining algorithm, so The result of analysis is presented with the method for the cyberrelationship figure of sick people afterwards

Wherein, the method for above-mentioned disease category data analysis is as follows:

The source of disease data is obtained according to the disease name in the prescription and diagnosis proof on medical document.

ICD10 medical treatment catalogue is mainly used, as tree catalogue, then by disease specific, is done on this directory tree Clustering algorithm.Detailed process are as follows:

A icd10 catalogue) is sorted out in a manner of relational data, divides DS1, tri- ranks of DS2, DS3

B) the method searched with similarity, while the mode for being subject to error correction navigates to specific disease record DS3

The specific method of lookup is disease on traversal document, calculates the editing distance of it and DS3 grade disease.

Algorithm is as follows:

B1) length of str1 or str2 is 0 length for returning to another character string.If (str1.length==0) return

B2 the matrix d of (n+1) * (m+1)) is initialized, and the value of the first row and column is allowed to increase since 0.Scan two characters It goes here and there (n*m grades), if: str1 [i]==str2 [j] records it with temp, is 0.Otherwise temp is denoted as 1.Then in matrix D [i, j] gives the minimum value of d [i-1, j]+1, d [i, j-1]+1, d [i-1, j-1]+temp three.

B3 after) scanning through, the last one value d [n] [m] for returning to matrix is their distance.

B4) and all DS3 ranks compare distance, and distance is 0 or lower than one threshold value, hit, it is believed that on document Disease be exactly this DS3 disease.

C) to DS3, the number of sufferer is recorded.

D) in DS2 rank, summarize all numbers of DS3 rank；Summarize all data of DS2 in DS1 rank.This Sample, no matter which rank of data can obtain sufferer number.

E) finally, the number of incidences and number of disease out can be summarized by tree.

By the above method, finally presented with the Visual Report Forms based on disease category distribution map.The present invention uses square The mode of shape tree graph shows the morbidity quantity of various diseases, and region area is bigger, and it is more to represent morbidity.Rectangle tree graph is main Purpose seeks to scheme interior very clear whole situation at one, determines diagram size by the size of each element amount, and have Group management function.

Specifically do drawing method are as follows: firstly, calculating the toatl proportion of morbidity, then root according to the morbidity number of third level disease Area of the every kind of disease of the third level on a rectangle is determined according to toatl proportion number.Once the rectangular surfaces of the disease of all third level Product determines, then the area of second level disease and the area of first order disease also determine therewith.

Disease data is divided into three-level according to the catalogue of icd10.First order disease is presented with the region of different colours.Such as Fig. 2 Shown in example diagram.The second level and third level disease are showed all in first order region with the region of subdivision.Click any first order Region can focus on this rank and specially show its information.Such as click after respiratory disease, can present this classification more into One step information.

The method of above-mentioned patients' data analysis is as follows:

Data source includes: the tree (being obtained with above-mentioned disease data analysis method) first is that every disease, Second is that crowd's attribute tags of patient data.

The data source of patients' attribute tags, the age of the patient in medical document (such as case record), gender, doctor Card number is protected, then age-based and gender, forms different groups of users.

Then, association rule mining is done with the data of disease and patient's these two aspects.Specific method is mainly to use Apriori algorithm does association rule mining.

Apriori algorithm is a kind of algorithm of most influential Mining Boolean Association Rules frequent item set.It is based in this way The fact: algorithm uses the priori knowledge of frequent item set property.Apriori is referred to as the alternative manner successively searched for using a kind of, K- item collection is for exploring (k+1)-item collection.Firstly, finding out the set of frequent 1- item collection.The set is denoted as L₁。L₁For looking for frequent 2- The set L of item collection₂, and L₂For looking for L₃, so go down, until frequent k- item collection cannot be found.Look for each L_kNeed a data Library scanning.

All affairs are scanned first, obtain 1- item collection C1, are required elimination to be unsatisfactory for condition item collection according to support, are obtained frequency Numerous 1- item collection.Followed by recursive operation:

Known frequent k- item collection (known to frequent 1- item collection), according to the item in frequent k- item collection, connection obtains all possibility K+1_ item, and carry out beta pruning (if all k subsets of the k+1_ item collection are not all able to satisfy support condition, the k+ 1_ item collection is cut up), obtain C_k+1Then item collection filters off the C_k+1The item that support condition is unsatisfactory in item collection obtains frequent k+1- Item collection.If obtained C_k+1Item collection is sky, then algorithm terminates.

The method of connection are as follows: assuming that L_kAll items in item collection all arrange in that same order, if that L_k [i] and L_kPreceding k-1 in [j] are all identical, and kth item is different, then L_k[i] and L_k[j] is attachable.Such as L₂In { I1, I2 } and { I1, I3 } be exactly it is attachable, connection after obtain { I1, I2, I3 }, still { I1, I2 } and I2, I3 } be it is not attachable, otherwise will lead to and duplicate item in item collection.

It is illustrated again about beta pruning, such as by L₂Generate K₃During, the 3_ item collection enumerated include I1, I2, I3 }, { I1, I3, I5 }, { I2, I3, I4 }, { I2, I3, I5 }, { I2, I4, I5 }, but since { I3, I4 } and { I4, I5 } do not have It occurs in L₂In, so { I2, I3, I4 }, { I2, I3, I5 }, { I2, I4, I5 } are fallen by beta pruning.

By the above method, finally presented with the cyberrelationship figure of sick people.It, can be with wherein by association rule mining Find out the inner link of disease category and Susceptible population's attribute.The specific method is as follows:

Firstly, can calculate the level encoder DS1 of disease category to every an example disease, can also calculate the crowd of patient The group of attribute encodes PG, constructs an one-dimension array and is put into [DS1, PG]；

Then, all disease records are scanned, the input of the one-dimension array of the first step is filled into a new array, is built into One higher-dimension array；

Again, rule digging is associated to higher-dimension array to calculate, will eventually get DS1, the frequency of PG various combination data Spend weighted value FP.Due to being analysis of high frequency relationship, so taking 80 groups of most high frequency as a result, being filled with Gexf formatted data.Gexf It is a kind of special xml language for describing complex network relationship, is usually first to illustrate node (nodes) in gexf, then Resettle the relationship (edges) between node.DS3, PG are inserted as the Node of Gexf, filled out using its corresponding FP value as Edge Enter.

Finally, making the rendering of relational graph of Gexf data.Wherein red is disease category, and dark blue is crowd's attribute.Wherein, Crowd's attribute is grouped according to age bracket and gender.Disease category is classified by the first class catalogue of icd10.Calculate a people Group can show weighted value FP not and after the weight of the relationship of disease category on chain.Weighted value is higher, represents this kind of crowd This susceptible disease.Because FP value is excavated according to the frequency relation of crowd's attribute PG and disease code DS1 as a result, FP Value height represents in data result, and the people group is not and the relationship of disease is high frequency.

Corresponding, the present invention provides a kind of system of the Visual Report Forms of big data analysis based on medical document, mainly Including following modules:

1) data acquisition and categorization module: it is divided into disease for acquiring the data of medical document, and by the data of medical document Sick data and patient data；

2) disease category data analysis module and the data analysis module of sick people data analysis module: are respectively included；

3) analysis visual Reports module: is presented with the cyberrelationship figure that disease category is distributed map and sick people respectively As a result.

The present invention is directed to the specificity (having disease, disease category, the attributes such as patient's has age, gender) of medical big data, The dimension different to these is proposed, shows the solution of the analysis convenient for control and prevention of disease in a uniform manner.Also Solve the problems, such as Medical receipt data in the scarcity of medical document big data visualization system caused by the growth of geometric progression, tool There are preferable application and promotional value.

Detailed description of the invention

The basic flow chart of Fig. 1 the method for the present invention and system.

The intuitive schematic diagram of morbidity quantity (example diagram) of Fig. 2 various diseases

The intuitive schematic diagram of morbidity quantity (example diagram) of Fig. 3 respiratory disease

The inner link cyberrelationship figure (example diagram) of Fig. 4 disease category and Susceptible population's attribute

Specific embodiment

Below by the description of specific embodiment, the present invention is further explained, but is not construed as limiting the invention.

One, method main flow of the invention

1) data of medical document are acquired

2) data of medical document are divided into disease data and patient data

Two, the explanation of analysis method

1, the method for the data analysis of above-mentioned disease category distribution map

ICD10 medical treatment catalogue is mainly used, as tree catalogue, then by disease specific, toward on this directory tree Do clustering algorithm, process are as follows:

Algorithm is as follows:

B3 after) scanning through, the last one value d [n] [m] for returning to matrix is their distance

B4) and all DS3 ranks compare distance, and distance is 0 or lower than one threshold value, hit, it is believed that on document Disease be exactly this DS3 disease

C) to DS3, the number of sufferer is recorded

E) finally, the number of incidences and number of disease out can be summarized by tree

2, the method for the data analysis of the cyberrelationship figure of above-mentioned patients

Data source of both needing to use, first is that the tree of every disease (is distributed with above-mentioned disease category What the data of map were analyzed), second is that crowd's attribute tags of patient data.

Then, association rule mining is done with the data of disease and patient's these two aspects.

Association rule mining is mainly done using Apriori algorithm.

The thinking of algorithm is briefly described below.If being briefly exactly set I is not frequent item set, own Bigger set comprising set I is also impossible to be frequent item set.

Algorithm initial data is as follows:

The basic process of algorithm is as follows:

All affairs are scanned first, obtain 1- item collection C1, are required elimination to be unsatisfactory for condition item collection according to support, are obtained frequency Numerous 1- item collection.

Recursive operation is carried out below:

The method of connection: assuming that L_kAll items in item collection all arrange in that same order, if that L_k[i] And L_kPreceding k-1 in [j] are all identical, and kth item is different, then L_k[i] and L_k[j] is attachable.Such as L₂In { I1, I2 } and { I1, I3 } be exactly it is attachable, connection after obtain { I1, I2, I3 }, still { I1, I2 } and { I2, I3 } is It is not attachable, it otherwise will lead to and duplicate item in item collection.

Three, the pattern and data structure of Visual Report Forms

1, the Visual Report Forms based on disease category distribution map

With the mode of rectangle tree graph, the morbidity quantity of various diseases is showed, region area is bigger, and it is more to represent morbidity. Rectangle tree graph main purpose seeks to scheme interior very clear whole situation at one, determines diagram by the size of each element amount Size, and there is group management function.

Specifically do drawing method are as follows: firstly, calculating the toatl proportion of morbidity, then root according to the morbidity number of third level disease Area of the every kind of disease of the third level on a rectangle is determined according to toatl proportion number.Once the rectangular surfaces of the disease of all third level Product determines, then the area of second level disease and the area of first order disease also determine therewith.Shown in Fig. 2 is a disease class Not Fen Bu map example diagram.

Disease data is divided into three-level according to the catalogue of icd10.First order disease is presented with the region of different colours.Such as Fig. 2 Shown in example diagram.The second level and third level disease are showed all in first order region with the region of subdivision.Click any first order Region can focus on this rank and specially show its information.Such as click after respiratory disease, can present this classification more into One step information, example diagram as shown in Figure 3.

2, the cyberrelationship figure of sick people

Association rule mining is then a critically important project in data mining, and as its name suggests, it is from data behind It was found that association or connection that may be present between things.Such as the thing discovery that customer buys in market by inquiry, 30% Customer can buy bed-linen simultaneously, and buying in the people of sheet has 80% to have purchased pillowcase, just conceals one here Association: sheet-> pillowcase, that is to say, that a big chunk customer can buy bed-linen simultaneously, then for market, Bed-linen can be placed on the same shopping area, just make things convenient for customers do shopping like that.

The inherent connection of disease category and Susceptible population's attribute can be found out by association rule mining specific to the present invention System.The specific method is as follows:

Finally, making the rendering of relational graph of Gexf data.Wherein red is disease category, and dark blue is crowd's attribute.Wherein, Crowd's attribute is grouped according to age bracket and gender.Disease category is classified by the first class catalogue of icd10.Calculate a people Group can show weighted value FP not and after the weight of the relationship of disease category on chain.Weighted value is higher, represents this kind of crowd This susceptible disease.Because FP value is excavated according to the frequency relation of crowd's attribute PG and disease code DS1 as a result, FP Value height represents in data result, and the people group is not and the relationship of disease is high frequency.Shown in Fig. 4 is a disease category With the inner link cyberrelationship example diagram of Susceptible population attribute.

Claims

1. a kind of method of the Visual Report Forms based on medical document, which comprises the steps of:

1) data of medical document are acquired；

2) data of medical document are divided into disease data and patient data；

3) disease category data are analyzed, using clustering algorithm, is then presented with the mode of disease category distribution map The result of analysis；

4) data of sick people are analyzed, using crowd's attribute tags algorithm and association rules mining algorithm, is then used The method of the cyberrelationship figure of sick people is presented the result of analysis；

Wherein, the disease category data analysis uses ICD10 medical treatment catalogue, as tree catalogue, then by specific disease Disease does clustering algorithm on directory tree；

It is that the data of both disease and patient do association rule mining that the data to sick people, which carry out analysis, is Association rule mining is done using Apriori algorithm.

2. the method as described in claim 1, which is characterized in that the method for the disease category data analysis specifically: according to Prescription on medical document obtains the source of disease data with the disease name in diagnosis proof；Using ICD10 medical treatment catalogue, As tree catalogue, clustering algorithm, specific clustering algorithm process then will be done on disease specific directory tree are as follows:

A icd10 catalogue) is sorted out in a manner of relational data, divides DS1, tri- ranks of DS2, DS3；

B) the method searched with similarity, while the mode for being subject to error correction navigates to specific disease record DS3, lookup it is specific Method is disease on traversal document, calculates the editing distance of it and DS3 grade disease；

C) to DS3, the number of sufferer is recorded；

D) in DS2 rank, summarize all numbers of DS3 rank；Summarize all data of DS2 in DS1 rank.In this way, nothing Which rank of data sufferer number can be obtained by；

3. method according to claim 2, which is characterized in that B) in specific algorithm it is as follows:

B1) length of str1 or str2 is 0 length for returning to another character string: if (str1.length==0) return

B2 the matrix d of (n+1) * (m+1)) is initialized, and the value of the first row and column is allowed to increase since 0；Scan two character string (n* M grades), if: str1 [i]==str2 [j] records it with temp, is 0；Otherwise temp is denoted as 1；Then matrix d [i, J] give the minimum value of d [i-1, j]+1, d [i, j-1]+1, d [i-1, j-1]+temp three；

B3 after) scanning through, the last one value d [n] [m] for returning to matrix is their distance；

B4) and all DS3 ranks compare distance, and distance is 0 or lower than one threshold value, hit, it is believed that the disease on document Disease is exactly the disease of this DS3.

4. the method as described in claim 1, which is characterized in that the method that the data of the sick people are analyzed is as follows:

Data source includes: the tree (being obtained with above-mentioned disease data analysis method) first is that every disease, second is that Crowd's attribute tags of patient data, the age of the patient in medical document (such as case record), gender, medical insurance card number, so Age-based and gender afterwards, forms different groups of users；

Correlation rule is done using Apriori algorithm with the data of above-mentioned disease and patient's these two aspects.

5. the method as described in claim 1, which is characterized in that the disease category is distributed the mode of map analysis is presented The result is that showing the morbidity quantity of various diseases with the mode of rectangle tree graph, region area is bigger, and it is more to represent morbidity.

6. method as claimed in claim 5, which is characterized in that the disease category distribution map specifically does drawing method are as follows: Firstly, calculating the toatl proportion of morbidity according to the morbidity number of third level disease, then determine that the third level is every according to toatl proportion number Area of the kind disease on a rectangle；It is highly preferred that disease data is divided into three-level according to the catalogue of icd10, first order disease, It is presented with the region of different colours；The second level and third level disease are showed all in first order region with the region of subdivision；It clicks Any first order region can focus on this rank and specially show its information.

7. the method as described in claim 1, which is characterized in that make the specific method of the cyberrelationship figure of the sick people It is as follows:

Firstly, the level encoder DS1 of disease category can be calculated to every an example disease, crowd's attribute of patient can be also calculated Group encode PG, construct an one-dimension array and be put into [DS1, PG]；

Again, rule digging is associated to higher-dimension array to calculate, will eventually get DS1, the frequency power of PG various combination data Weight values FP；Due to being analysis of high frequency relationship, so taking 80 groups of most high frequency as a result, being filled with Gexf formatted data；By DS3, PG Node as Gexf is inserted, and is inserted using its corresponding FP value as Edge；

Finally, making the rendering of relational graph of Gexf data.Wherein disease category, crowd's attribute are indicated with different colours respectively；Its In, crowd's attribute is grouped according to age bracket and gender；Disease category classifies by the first class catalogue of icd10, calculates one Personal group can show weighted value FP not and after the weight of the relationship of disease category on chain.

8. realizing the visualization report of the big data analysis based on medical document of method as described in any one of claim 1 to 7 The system of table, which is characterized in that mainly include following modules:

1) data acquisition and categorization module: it is divided into disease number for acquiring the data of medical document, and by the data of medical document According to and patient data；

3) knot analyzed Visual Report Forms module: is presented with the cyberrelationship figure that disease category is distributed map and sick people respectively Fruit.