CN107480426A

CN107480426A - From iteration case history archive cluster analysis system

Info

Publication number: CN107480426A
Application number: CN201710596235.5A
Authority: CN
Inventors: 童永安; 陈卫单; 陈勇强
Original assignee: Guangzhou Huiyang Health Science And Technology Co Ltd
Current assignee: Guangzhou Huiyang Health Science And Technology Co Ltd
Priority date: 2017-07-20
Filing date: 2017-07-20
Publication date: 2017-12-15
Anticipated expiration: 2037-07-20
Also published as: CN107480426B

Abstract

The present invention discloses one kind from iteration case history archive cluster analysis system, including case history import modul, Vector Processing module and ISODATA Cluster Analysis modules, and the case history import modul is used for variable and the standardization that the extraction from case history archive needs to analyze；The Vector Processing module is used for the conversion that type and ratio are carried out to different types of variable in case history archive, and after completing vector conversion, each individual space vector coordinate is deposited in space vector storehouse；The ISODATA Cluster Analysis modules are used to transfer space vector to be analyzed from the space vector storehouse in Vector Processing module, into ISODATA cluster analyses；In this way, on the one hand reducing amount of calculation compared with the substantial amounts of amount of calculation of hierarchical clustering, most rational classification results on the other hand can be obtained.The classification results to high-volume case history archive can be obtained, consequently facilitating the processing or analysis of next step by the complicated numerous and complicated electronic health record of content by cluster analysis.

Description

From iteration case history archive cluster analysis system

Technical field

The present invention relates to field of medical technology, particularly relates to a kind of from iteration case history archive cluster analysis system.

Background technology

Existing individual character, there is general character again between different case history archives.When carrying out clinical research, need often according to difference Some features of case history archive are analyzed, so as to classify to it, in order to carry out the processing of next step or analysis.So And the object of existing cluster analysis is all specific numeric type variable, for the case history that variable is various, type is complicated be difficult into The direct computing of row.And there is amount of calculation for the case history archive cluster analysis system based on hierarchical clustering developed before this Greatly, classify the problem of not accurate enough, for these problems, it is necessary to be improved to existing algorithm, so as to adapt to case history archive number Measure the characteristics of huge, content is complicated.

Compared with hierarchical clustering, ISODATA amounts of calculation are less, can directly obtain cluster result, it is not necessary to which user is carried out Further screening；And compared with K-MEANS clustering algorithms, ISODATA, which is calculated, can adjust classification number, obtain relatively reasonable Classification results.Therefore foundational development case history archive cluster algorithm is calculated as with ISODATA, while adapts to case history archive Feature.

The content of the invention

For problem present in background technology, divide it is an object of the invention to provide one kind from iteration case history archive cluster Analysis system, by the complicated numerous and complicated electronic health record of content by cluster analysis, the classification results to high-volume case history archive are obtained, Consequently facilitating the processing or analysis of next step.

The technical proposal of the invention is realized in this way：It is a kind of from iteration case history archive cluster analysis system, including case history Import modul, Vector Processing module, ISODATA Cluster Analysis modules, wherein, the case history import modul：For passing through filtering The case history archive that device imports to user carries out preliminary filtering, and need are extracted from case history archive according to the mapping relations of initialization The variable to be analyzed, and to each variable specifications in case history archive, be abstracted for the vector of next step；The vector Processing module：For carrying out the conversion of type and ratio to different types of variable in case history archive, turn comprising continuous variable Change, logical type variable conversion and text-type variable conversion, complete vector conversion after, by each individual space vector coordinate Deposit in space vector storehouse, for the ISODATA cluster analyses of next step；The ISODATA Cluster Analysis modules：With In transferring space vector to be analyzed from the space vector storehouse in Vector Processing module, into ISODATA cluster analyses.

In the above-mentioned technical solutions, the text-type variable conversion is divided into special conversion and common conversion.

In the above-mentioned technical solutions, the ISODATA Cluster Analysis modules are divided into seven secondary modules, respectively initialize Module, basic module I, basic module II, judgement and iteration module, division module, merging module and terminate module.

In the above-mentioned technical solutions, the basic module I include central subset extract, minimum distance method clustered with And cluster screening.

In the above-mentioned technical solutions, it is described judgement with iteration module include cluster centre correction, average distance calculate and Calculate the population mean distance of all classes.

The present invention from iteration case history archive cluster analysis system, including case history import modul, Vector Processing module, ISODATA Cluster Analysis modules, space vector is abstracted as according to the specific object of each part case history archive first, then by these Space vector is applied among ISODATA cluster analyses；The parameter value selected according to user, ISODATA Cluster Analysis modules pair Space vector passes through successive ignition, finally obtains classification results.On the one hand meter is reduced compared with the substantial amounts of amount of calculation of hierarchical clustering Calculation amount, it on the other hand can obtain most rational classification results.The complicated numerous and complicated electronic health record of content can be passed through cluster point Analysis, obtains the classification results to high-volume case history archive, consequently facilitating the processing or analysis of next step.

Brief description of the drawings

Fig. 1 is to be of the invention from iteration case history archive cluster analysis system module annexation figure；

Fig. 2 is seven secondary module annexation figures in ISODATA Cluster Analysis modules in Fig. 1；

Fig. 3 is instantiation cluster analysis spatial distribution map in the present invention.

Embodiment

Below in conjunction with the accompanying drawing in the embodiment of the present invention, the technical scheme in the embodiment of the present invention is carried out clear, complete Site preparation describes, it is clear that described embodiment is only part of the embodiment of the present invention, rather than whole embodiments.Based on this Embodiment in invention, those of ordinary skill in the art are obtained every other under the premise of creative work is not made Embodiment, belong to the scope of protection of the invention.

It is of the present invention one kind from iteration case history archive cluster analysis system, key point be space vector conversion with ISODATA cluster analyses.Space vector is abstracted as according to the specific object of each part case history archive first, then by these spaces Vector is applied among ISODATA cluster analyses.The parameter value selected according to user, ISODATA Cluster Analysis modules are to space Vector passes through successive ignition, finally obtains classification results, it includes case history import modul, Vector Processing module, ISODATA and gathered Alanysis module, the annexation figure of each module is as shown in figure 1, following is the detailed description to above-mentioned each module.

(1) case history import modul：

Case history import modul is responsible for carrying out preliminary working process to the case history archive that user imports.Case history import modul The most key part is filter, and filter extracts according to the mapping relations of initialization from case history archive needs what is analyzed Variable, the vector for next step are abstracted.By processing of the filter to case history archive, each change in case history archive measures To standardization.

(2) Vector Processing module：

The specific object of case history archive cannot be used directly for cluster analysis, it is necessary to can just enter afterwards by the abstract of vector Row cluster analysis.Therefore, it is necessary to carry out the conversion of type and ratio to variable according to certain rule.For in case history archive Different types of variable, there is different conversion methods, be broadly divided into three major types：Continuous variable conversion, the conversion of logical type variable And text-type variable conversion.It is specific as follows：

A. continuous variable is changed：For some continuous variable, make it as a dimension in space vector, choosing Its fixed average value is as standard value 100 (or being manually set to other values as standard value), each individual variable in sample Value divided by the average value are multiplied by with standard value, respective value of the value obtained after conversion as the dimension in space vector.

B. logical type variable is changed：For the logical type variable of yes/no, make its dimension as space vector, It is that corresponding value is 100 (or being manually set to other values as standard value), no corresponding value is 0, is set as that the dimension is corresponding Value.

C. text-type variable is changed：Text-type variable conversion method is divided into both of which：Special conversion method turns with common Change method.The common feature of two methods is all to take certain standard to turn the data of text-type to be quantified as the number of numeric type According to.

Special conversion method：Special conversion method is preset with transfer standard in the system module, according to the transfer standard Be converted to specific numerical value.Such as diagnosis, diagnostic result is a kind of character type variable, is preset with four in systems The spectrum of disease of dimension, different diseases have corresponding space coordinates in the spectrum of disease.The setting of spectrum of disease is according to various disease institute The order of severity of corresponding section office, mutual contact or even disease, one developed using certain standard are four-dimensional empty Between.

Such as hyperthyroidism, type 1 diabetes, diabetes B have certain similarity, endocrine system disease is belonged to, and wherein 1 Patients with type Ⅰ DM, diabetes B similarity are higher, therefore residing coordinate in spectrum of disease is more nearly.The coordinate of hyperthyroidism is (102,321,210,3), type 1 diabetes (102,321,211,4), diabetes B (102,321,211,5).Therefore vector turns Coordinate of the root tuber according to diagnostic result in spectrum of disease is changed the mold, is integrated among space vector.In addition to spectrum of disease, also have outer Section's operation spectrum and prescription spectrum etc., belong to special conversion method.

Common conversion method：Common conversion method needs user when importing case history, different to text type specification of variables Mapping relations between text and numerical value, such as excellent middle difference Dui Ying 100,75,50,25.Vectorial modular converter is according to setting Definite value and mapping relations, numerical value corresponding to imparting, as a dimension in space vector.

After completing vectorial conversion operation, each individual space vector coordinate is deposited among space vector storehouse, used In the ISODATA cluster analyses of next step.

(3) ISODATA Cluster Analysis modules：

The core of ISODATA Cluster Analysis modules is ISODATA algorithms.The module is from the space in Vector Processing module Space vector to be analyzed is transferred in vectorial storehouse, into ISODATA cluster analyses：

ISODATA Cluster Analysis modules are divided into seven secondary modules, and annexation is as shown in Figure 2：

A. initialization module：

A. initiation parameter：, it is necessary to initialize parameters before ISODATA cluster analyses start：

Parameter name	Implication
		K	Target cluster numbers
k	Initial setting cluster numbers
		θ_N	Minimum vectorial number, is clustered if less than if the value not as single one in each cluster
θ	What is allowed in each cluster then enters line splitting apart from maximum standard deviation, the such as larger than value
		θ_c	The minimum range of two cluster centres, the such as less than value then merge
L	The most logarithms for the cluster centre that can merge in an iteration
		I	Iterations

B. basic module I：

B. central subset extracts：K sample is randomly selected from space vector storehouse, as cluster centre subset { z₁, z_,, z₃..., z_k}。

C. minimum distance method is clustered：IfThen should Space vector assigns the nearest cluster S_i。

D. cluster screening：If S_iIn space vector number be less than defined minimum value θ_N, then the cluster, k=k- are cancelled 1。

C. basic module II：

E. cluster centre corrects：For j-th of dimension values of i-th of classification, its central value needs to be revised as：

F. average distance calculates：Average distance of each space vector to cluster centre in calculating cluster：

G. the population mean distance of all classes is calculated：

D. judgement and iteration module：

H. need to be judged：1. if iterations reaches I times, put θ_c=0, go to module G.

2. if k<=K/2, then module D is gone to, enter line splitting processing；

3. if k>=K/2, then module E is gone to, merge processing；

4. if K/2<k<2K, then module D is gone to when iterations is odd number, module E is gone to when being even number.

E. module is divided：

I. for each cluster S_i, ask the standard deviation of each dimension under the cluster, formula such as following formula：

Find the maximum σ of each dimension standard deviation under each cluster_{i max}。

For σ_{i max}, if σ_{i max}＞ θ_s, and meet one in following two condition：

1. average distance is more than all group average distances in such classAnd the Space like vector number is more than θ_NOne times Above N_iThe θ of ＞ 2_N。

2.k<=K/2.

The cluster is then divided into two cluster blocks, two cluster centres are respectively h For the arbitrary value in 0 to 1 so that the distance of each vector to new cluster centre is different in original cluster.

After completing division, k=k+1.

F. merging module：

J. any two cluster centre C is calculated_iAnd C_jDistance：

D_ij=dCC_i, C_j)

K. D is compared_ijWith θ_cSize, being less than θ_cD_ijAscending order arranges.

From the D of minimum_ijStart, to each D_ijMerge C_iAnd C_j, cluster centre is：

K=k-1；

L. from the second small D_ijIf corresponding two cluster centres are not all merged before this, continue to be closed And.If the total logarithm of classification merged reaches L, stop merging.

G. terminate module：

M. iteration count adds one：I=i+1.

N. if iterations reaches the upper limit, iteration is terminated.Otherwise B modules are returned to.

It is to combine the further explanation that an instantiation is done to the present invention below：

Existing 10 parts of case history archives need to carry out cluster analysis, and its parameters is as shown in the table：

Its spatial distribution map is as shown in Figure 3：

The parameter value set as：

Parameter name	Parameter value
		K	3
k	2
		θ_N	2
θ	20
		θ_c	20
L	2
		I	20

Originally two cluster centres set are：(0,20) and (25,200), but by after successive ignition, two poly- Class has split into three clusters, and new cluster centre is (2,20) respectively, (11,83) and (25,250).Can from figure Find out, this batch of case history can be divided into three classes, and be the lower left corner respectively, the middle and upper right corner.

To sum up, compared with prior art, the present invention has below beneficial to effect from iteration case history archive cluster analysis system Fruit：

1. existing cluster analysis includes hierarchical clustering, K-MEANS is clustered etc., but one existing for these cluster analyses Problem is exactly that can not just need to preset classification according to specific vector distribution adjust automatically classification number, such as K-MEANS Number, the cluster numbers finally drawn are equal therewith.And the maximum feature that ISODATA is calculated is according to actual conditions adjustment to be gathered Class number and cluster centre so that the result of cluster more conforms to actual distribution situation.In actual applications, due to researcher couple Rational cluster numbers cognitive presence deviation, it is expected that cluster numbers and may not meet actual distribution situation, can using ISODATA To be adjusted according to actual conditions to cluster numbers so that case history archive classification is more reasonable.

If 2. using manually classifying to large quantities of case history archives, especially need according to multiple variables carry out by , it is necessary to which sorter carries out comprehensive analysis to variable during one classification, the classification belonging to it is judged, this process needs to spend greatly The time of amount and energy, it is extremely inefficient.And use hierarchical clustering system, it becomes possible to carried out according to multiple variables of quantization related The computing of coefficient, the result of cluster analysis is obtained according to the operation result of coefficient correlation, this process can be located using computer The data of magnanimity are managed, substantially increase operating efficiency.

3. for case history archive classification ISODATA clustering systems flexibility be embodied in user can according to the actual requirements, The parameters of cluster analysis are adjusted.Although the parameter that ISODATA needs are set is more, these parameters are use Family provide flexible range of choice, by select different iteration upper limit numbers, cluster between minimum range and cluster in most The parameters such as big standard deviation, can make certain adjustment so that the result of cluster analysis more conforms to reality to the precision of cluster analysis Border situation.In addition, user can also be according to this analysis result Reparametrization after a cluster analysis so that cluster Analysis is more nearly actual conditions.

The foregoing is merely illustrative of the preferred embodiments of the present invention, is not intended to limit the invention, all essences in the present invention God any modification, equivalent substitution and improvements made etc., should be included in the scope of the protection with principle.

Claims

It is 1. a kind of from iteration case history archive cluster analysis system, it is characterised in that：Including case history import modul, Vector Processing mould Block, ISODATA Cluster Analysis modules, wherein,

The case history import modul：Case history archive for being imported by filter to user carries out preliminary filtering, according to first The mapping relations of beginningization are extracted from case history archive needs the variable analyzed, and to each variable specifications in case history archive, It is abstracted for the vector of next step；

The Vector Processing module：For carrying out the conversion of type and ratio to different types of variable in case history archive, comprising Continuous variable conversion, the conversion of logical type variable and the conversion of text-type variable, after completing vector conversion, by each individual sky Between vectorial coordinate deposit in space vector storehouse, for the ISODATA cluster analyses of next step；

The ISODATA Cluster Analysis modules：For transferring space to be analyzed from the space vector storehouse in Vector Processing module Vector, into ISODATA cluster analyses.
It is 2. according to claim 1 from iteration case history archive cluster analysis system, it is characterised in that：The text-type variable Conversion is divided into special conversion and common conversion.
It is 3. according to claim 1 from iteration case history archive cluster analysis system, it is characterised in that：The ISODATA gathers Alanysis module is divided into seven secondary modules, respectively initialization module, basic module I, basic module II, judgement and iteration mould Block, division module, merging module and terminate module.
It is 4. according to claim 3 from iteration case history archive cluster analysis system, it is characterised in that：The basic module I Extracted comprising central subset, minimum distance method is clustered and clustered screening.
It is 5. according to claim 3 from iteration case history archive cluster analysis system, it is characterised in that：The judgement and iteration Module includes cluster centre correction, average distance calculating and the population mean distance for calculating all classes.