CN107480426B

CN107480426B - Self-iteration medical record file clustering analysis system

Info

Publication number: CN107480426B
Application number: CN201710596235.5A
Authority: CN
Inventors: 童永安; 陈卫单; 陈勇强
Original assignee: Wisefly Technology Co ltd
Current assignee: Wisefly Technology Co ltd
Priority date: 2017-07-20
Filing date: 2017-07-20
Publication date: 2021-01-19
Anticipated expiration: 2037-07-20
Also published as: CN107480426A

Abstract

The invention discloses a self-iterative medical record file cluster analysis system which comprises a medical record import module, a vector processing module and an ISODATA cluster analysis module, wherein the medical record import module is used for extracting variables to be analyzed from medical record files and normalizing the variables; the vector processing module is used for performing type and proportion conversion on different types of variables in the medical record file, and storing the space vector coordinates of each individual in a space vector library after the vector conversion is completed; the ISODATA cluster analysis module is used for calling a space vector to be analyzed from a space vector library in the vector processing module and entering ISODATA cluster analysis; thus, on one hand, the calculation amount is reduced compared with the calculation amount of a large number of hierarchical clustering, and on the other hand, the most reasonable classification result can be obtained. The classification result of large batches of medical record files can be obtained by clustering analysis of electronic medical records with complex and complicated contents, so that the next step of processing or analysis is facilitated.

Description

Self-iteration medical record file clustering analysis system

Technical Field

The invention relates to the technical field of medical treatment, in particular to a self-iterative medical record file cluster analysis system.

Background

Different medical record files have individuality and commonality. In clinical research, it is often necessary to analyze the medical records according to certain characteristics of the medical records so as to classify the medical records for further processing or analysis. However, the objects of the existing cluster analysis are all specific numerical variables, and direct operation is difficult to perform on medical records with various variables and complicated types. In addition, the medical record file clustering analysis system based on hierarchical clustering which is developed before has the problems of large calculated amount and inaccurate classification, and aiming at the problems, the existing algorithm needs to be improved, so that the medical record file clustering analysis system is suitable for the characteristics of large quantity and complex content of medical record files.

Compared with hierarchical clustering, the ISODATA has less calculation amount, can directly obtain a clustering result, and does not need further screening by a user; compared with the K-MEANS clustering algorithm, the ISODATA calculation can adjust the number of the categories to obtain a more reasonable classification result. Therefore, the medical record file clustering analysis algorithm is developed on the basis of ISODATA calculation, and the method is suitable for the characteristics of the medical record files.

Disclosure of Invention

Aiming at the problems in the background art, the invention aims to provide a self-iterative medical record clustering analysis system, which obtains the classification result of a large quantity of medical record files by clustering analysis of electronic medical records with complex and numerous contents, thereby facilitating the next processing or analysis.

The technical scheme of the invention is realized as follows: a self-iterative medical record file cluster analysis system comprises a medical record importing module, a vector processing module and an ISODATA cluster analysis module, wherein the medical record importing module: the system comprises a filter, a parameter analysis module and a parameter analysis module, wherein the filter is used for preliminarily filtering medical record files imported by a user, extracting variables to be analyzed from the medical record files according to an initialized mapping relation, and normalizing each variable in the medical record files for vector abstraction of the next step; the vector processing module: the system is used for carrying out type and proportion conversion on different types of variables in the medical record files, and comprises continuous variable conversion, logic variable conversion and text variable conversion, and after the vector conversion is completed, the space vector coordinates of each individual are stored in a space vector library for the next ISODATA cluster analysis; the ISODATA cluster analysis module: the method is used for calling a space vector to be analyzed from a space vector library in a vector processing module and entering ISODATA clustering analysis.

In the above technical solution, the text type variable conversion is divided into a special conversion and a normal conversion.

In the above technical solution, the cluster analysis module of the ISODATA is divided into seven sub-modules, which are respectively an initialization module, a basic module I, a basic module II, a judgment and iteration module, a splitting module, a merging module and an ending module.

In the above technical solution, the basic module I includes center subset extraction, clustering by a minimum distance method, and cluster screening.

In the above technical solution, the judging and iterating module includes cluster center correction, average distance calculation, and overall average distance calculation for all classes.

The invention discloses a self-iterative medical record file cluster analysis system, which comprises a medical record importing module, a vector processing module and an ISODATA cluster analysis module, wherein the medical record importing module abstracts spatial vectors according to the concrete attributes of medical record files, and then the spatial vectors are applied to ISODATA cluster analysis; and according to the parameter value selected by the user, the ISODATA clustering analysis module carries out multiple iterations on the space vector to finally obtain a classification result. On one hand, the calculation amount is reduced compared with the large calculation amount of hierarchical clustering, and on the other hand, the most reasonable classification result can be obtained. The classification result of large batches of medical record files can be obtained by clustering analysis of electronic medical records with complex and complicated contents, so that the next step of processing or analysis is facilitated.

Drawings

FIG. 1 is a block diagram of a self-iterative medical record cluster analysis system according to the present invention;

FIG. 2 is a diagram illustrating a connection relationship between seven sub-modules in the ISODATA clustering module shown in FIG. 1;

FIG. 3 is a spatial distribution diagram of cluster analysis according to an embodiment of the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

The invention relates to a self-iteration medical record file cluster analysis system which is characterized by space vector conversion and ISODATA cluster analysis. The space vectors are abstracted according to the concrete attributes of each medical record file, and then the space vectors are applied to ISODATA cluster analysis. According to the parameter value selected by the user, the ISODATA cluster analysis module performs multiple iterations on the space vector to finally obtain a classification result, and the classification result comprises a medical record importing module, a vector processing module and an ISODATA cluster analysis module, wherein the connection relation diagram of each module is shown in fig. 1, and the detailed description of each module is described below.

(1) A medical record importing module:

the medical record import module is responsible for carrying out primary processing on medical record files imported by a user. The most critical part of the medical record import module is a filter, and the filter extracts variables needing analysis from the medical record files according to the initialized mapping relation for vector abstraction in the next step. And processing the medical record files through the filter, so that all variables in the medical record files are normalized.

(2) A vector processing module:

the concrete attributes of the medical record files cannot be directly used for cluster analysis, and the cluster analysis can be performed only after vector abstraction. Therefore, the variables need to be subjected to type-to-scale conversion according to certain rules. For different types of variables in medical record files, different conversion methods exist, and the conversion methods are mainly divided into three main categories: continuous type variable conversion, logic type variable conversion, and text type variable conversion. The method comprises the following specific steps:

a. continuous variable conversion: for a continuous variable, the continuous variable is used as one dimension in a space vector, the average value of the continuous variable is selected as a standard value 100 (or manually set as other values as standard values), the variable value of each individual in the sample is divided by the average value and then multiplied by the standard value, and the converted value is used as the corresponding value of the dimension in the space vector.

b. Logic type variable conversion: the logical type variable that is "yes" or "no" is set as a dimension of the space vector, and the corresponding value is 100 (or manually set as a standard value), and the corresponding value is 0, and the logical type variable that is "no" is set as a value corresponding to the dimension.

c. Text type variable conversion: the text type variable conversion method is divided into two modes: a special conversion method and a general conversion method. The common feature of both methods is that some kind of standard is adopted to convert text-type data into numerical data.

The special conversion method comprises the following steps: the special conversion method is characterized in that a conversion standard is preset in the module of the system, and the conversion standard is converted into a specific numerical value. For example, for diagnosis, the diagnosis result is a text-type variable, a four-dimensional disease spectrum is preset in the system, and different diseases have corresponding spatial coordinates in the disease spectrum. The disease spectrum is set according to departments corresponding to different diseases, mutual relation and the severity of the diseases, and a four-dimensional space is developed by adopting a certain standard.

For example, hyperthyroidism, type 1 diabetes mellitus and type 2 diabetes mellitus have certain similarity, and belong to endocrine diseases, while type 1 diabetes mellitus and type 2 diabetes mellitus have higher similarity, so that the coordinates in the disease spectrum are closer. The coordinates of hyperthyroidism are (102,321,210,3), type 1 diabetes (102,321,211,4), type 2 diabetes (102,321,211, 5). The vector conversion module integrates into the space vector according to the coordinates of the diagnostic result in the disease spectrum. Besides the disease spectrum, a surgical operation spectrum, a prescription spectrum and the like belong to special conversion methods.

The common conversion method comprises the following steps: the conventional conversion method requires the user to set a mapping relationship between different texts and numerical values for the text type variable during the introduction of the duration, such as 100, 75, 50, 25 for the difference between good and medium values. And the vector conversion module gives a corresponding numerical value as a dimension in the space vector according to the set value and the mapping relation.

After the vector conversion operation is completed, the space vector coordinates of each individual are stored in a space vector library for the next ISODATA cluster analysis.

(3) ISODATA cluster analysis module:

the ISODATA clustering module is characterized in that the ISODATA algorithm is the core of the ISODATA clustering module. The module calls a space vector to be analyzed from a space vector library in the vector processing module, and the ISODATA cluster analysis is carried out:

the ISODATA cluster analysis module is divided into seven secondary modules, and the connection relationship is shown in FIG. 2:

A. an initialization module:

a. initializing parameters: before the start of the cluster analysis of the ISODATA, various parameters need to be initialized:

parameter name	Means of
		K	Number of target clusters
k	Initial setting of cluster number
		θ_N	The minimum number of vectors in each cluster, if less than this value, is not taken as a single cluster
θ	Maximum standard deviation of allowed distance in each cluster, and splitting if greater than this value
		θ_c	Minimum distance between two cluster centers, merging if less than this value
L	Maximum logarithm of cluster centers that can be merged in one iteration
		I	Number of iterations

B. A basic module I:

b. center subset extraction: randomly extracting k samples from the space vector library as a clustering center subset { z₁，z_，，z₃，…，z_k}。

c. Clustering by a minimum distance method: if it is not

The space vector is assigned to the nearest cluster S_i。

d. And (4) clustering screening: if S is_iIs below a prescribed minimum value theta_NThen the cluster is cancelled, k-1.

C. A basic module II:

e. correcting the clustering center: for the jth dimension value of the ith class, its center value needs to be modified as:

f. calculating the average distance: calculating the average distance from each space vector in the cluster to the cluster center:

g. calculate the ensemble average distance for all classes:

D. a judging and iteration module:

h. the need to make a judgment: 1. if the iteration number reaches I times, the value is set to theta_cGo to block G, 0.

2. If K < ═ K/2, the method goes to a module D to perform splitting treatment;

3. if K > is K/2, turning to a module E for merging;

4. if K/2< K <2K, go to block D if the number of iterations is odd and go to block E if it is even.

E. Splitting the module:

i. for each cluster S_iAnd calculating the standard deviation of each dimension under the cluster according to the following formula:

finding the maximum value sigma of each dimension standard deviation under each cluster_{i max}。

For σ_{i max}If there is σ_{i max}＞θ_sAnd one of the following two conditions is satisfied:

1. the average distance in the class is larger than the average distance in all classes

And the number of the space vectors exceeds theta_NMore than one time N_i＞2θ_N。

2.k<＝K/2。

The cluster is divided into two cluster blocks with two cluster centers

h is any value from 0 to 1, so that the distance from each vector in the original cluster to the center of the new cluster is different.

After the splitting is completed, k is k + 1.

F. A merging module:

j. calculating any two clustering centers C_iAnd C_jThe distance of (c):

D_ij＝dCC_i，C_j)

k. comparison D_ijAnd theta_cSmall and big, smallAt theta_cD of (A)_ijAnd (4) arranging in an ascending order.

From the smallest D_ijAt the beginning, for each D_ijMerge C_iAnd C_jThe clustering center is:

k＝k-1；

from the second smallest D_ijAnd if the two corresponding clustering centers are not merged before, continuing to merge the two corresponding clustering centers. If the total logarithm of the class to be merged reaches L, the merging is stopped.

G. An end module:

m. iteration counter plus one: i ═ i + 1.

And n, if the iteration number reaches the upper limit, ending the iteration. Otherwise, returning to the B module.

The following is a further description of the invention with reference to a specific example:

the existing 10 medical record files need to be subjected to cluster analysis, and various parameters of the existing medical record files are shown in the following table:

the spatial distribution diagram is shown in fig. 3:

the set parameter values are:

parameter name	Parameter value
		K	3
k	2
		θ_N	2
θ	20
		θ_c	20
L	2
		I	20

The two cluster centers originally set are: (0, 20) and (25, 200), but after a number of iterations, the two clusters are split into three clusters, with new cluster centers being (2, 20), (11, 83) and (25, 250), respectively. As can also be seen from the figure, the batch of medical records can be divided into three categories, namely, the lower left corner, the middle and the upper right corner.

In summary, compared with the prior art, the self-iterative medical record file cluster analysis system of the invention has the following beneficial effects:

1. the existing cluster analysis comprises hierarchical clustering, K-MEANS clustering and the like, but the cluster analysis has the problem that the number of categories cannot be automatically adjusted according to specific vector distribution, for example, the number of categories needs to be preset for K-MEANS, and the finally obtained cluster number is equal to the number of categories. The maximum feature of ISODATA calculation is that the clustering number and the clustering center can be adjusted according to the actual situation, so that the clustering result is more in line with the actual distribution situation. In practical application, due to the fact that a researcher knows the reasonable clustering number with deviation, the expected clustering number possibly does not accord with the actual distribution situation, and the clustering number can be adjusted according to the actual situation by using the ISODATA, so that the medical record file classification is more reasonable.

2. If a large number of medical record files are classified manually, particularly when the medical record files are classified one by one according to a plurality of variables, the classification personnel needs to comprehensively analyze the variables and judge the categories of the variables, and the process needs a lot of time and energy and is extremely low in efficiency. And by using the hierarchical clustering system, the correlation coefficient can be operated according to a plurality of quantized variables, and the clustering analysis result can be obtained according to the operation result of the correlation coefficient.

3. The flexibility of the ISODATA clustering system aiming at medical record file classification is embodied in that a user can adjust various parameters of clustering analysis according to actual requirements. Although the number of parameters required to be set by ISODATA is large, the parameters provide a flexible selection range for users, and the accuracy of cluster analysis can be adjusted to make the result of cluster analysis more in line with the actual situation by selecting different parameters such as the iteration upper limit times, the minimum distance between clusters, the maximum standard deviation in clusters, and the like. In addition, the user can reset parameters according to the analysis result after the clustering analysis, so that the clustering analysis is closer to the actual situation.

The above description is only for the purpose of illustrating the preferred embodiments of the present invention and is not to be construed as limiting the invention, and any modifications, equivalents, improvements and the like that fall within the spirit and principle of the present invention are intended to be included therein.

Claims

1. A self-iterative medical record file cluster analysis system is characterized in that: comprises a medical record importing module, a vector processing module and an ISODATA cluster analysis module, wherein,

the medical record importing module: the system comprises a filter, a parameter analysis module and a parameter analysis module, wherein the filter is used for preliminarily filtering a medical record file imported by a user, extracting variables needing analysis from the medical record file according to an initialized mapping relation, and normalizing each variable in the medical record file for vector abstraction of the next step;

the vector processing module: the system is used for carrying out type and proportion conversion on different types of variables in the medical record files, and comprises continuous variable conversion, logic variable conversion and text variable conversion, and after the vector conversion is completed, the space vector coordinates of each individual are stored in a space vector library for the next ISODATA cluster analysis; the continuous variable conversion is to make a continuous variable as a dimension in a space vector, select the average value as a standard value, divide the variable value of each individual in the sample by the average value and multiply the standard value, and the value obtained after conversion is used as the corresponding value of the dimension in the space vector; the logical type variable conversion is to take the logical type variable which is the logical type variable or not as one dimension of the space vector, if the corresponding value is a standard value, and if the corresponding value is a standard value, and the value is set as the value corresponding to the dimension; the text type variable conversion method comprises a special conversion method and a common conversion method; the special conversion method is characterized in that a conversion standard is preset in the module of the system and is converted into a specific numerical value according to the conversion standard; the common conversion method requires a user to set a mapping relation between different texts and numerical values for the text type variables when importing the duration of a disease, a vector conversion module assigns corresponding numerical values as a dimension in a space vector according to the set value and the mapping relation,

the ISODATA cluster analysis module: the space vector library is used for calling a space vector to be analyzed from the space vector library in the vector processing module and entering ISODATA cluster analysis.

2. The self-iterative medical record archive cluster analysis system of claim 1, wherein: the ISODATA cluster analysis module is divided into seven secondary modules which are respectively an initialization module, a basic module I, a basic module II, a judgment and iteration module, a splitting module, a merging module and an ending module.

3. The self-iterative medical record archive cluster analysis system of claim 2, wherein: the basic module I comprises center subset extraction, clustering by a minimum distance method and cluster screening.

4. The self-iterative medical record archive cluster analysis system of claim 2, wherein: the judging and iterating module comprises cluster center correction, average distance calculation and calculation of overall average distance of all classes.