CN109685139A

CN109685139A - Based on the gastroesophageal reflux disease risk factor extracting method precisely clustered and system

Info

Publication number: CN109685139A
Application number: CN201811589375.0A
Authority: CN
Inventors: 刘万里; 徐雷; 黄玉珍; 姚澜; 李荣臻; 夏吉安
Original assignee: 刘万里
Current assignee: Nanjing Hospital of Integrated Traditional and Chinese Medicine
Priority date: 2018-12-25
Filing date: 2018-12-25
Publication date: 2019-04-26

Abstract

The invention discloses a kind of based on the gastroesophageal reflux disease risk factor extracting method precisely clustered and system.Initial patient information collection of the building comprising gastroesophageal reflux disease risk factor first；Secondly data quantization processing is carried out to the factor that initial patient information is concentrated, obtains quantized data matrix；Then clustering processing is carried out to each sample point in quantized data matrix using hierarchical clustering algorithm, obtains hierarchical clustering dendrogram；Furthermore clusters number is determined according to hierarchical clustering dendrogram, and by clusters number in conjunction with K-Means clustering algorithm, the element in quantized data matrix is clustered, obtains multiple class clusters；The index of correlation in each class cluster between each element is finally calculated, and the maximum element of the index of correlation is determined as gastroesophageal reflux disease risk factor.The present invention combines two kinds of clustering methods, efficiently filters out the risk factor for causing gastroesophageal reflux disease, reduces disease incidence.

Description

Based on the gastroesophageal reflux disease risk factor extracting method precisely clustered and system

Technical field

The present invention relates to cluster and medicine technology fields, more particularly to a kind of based on the gastroesophageal reflux disease precisely clustered Sick risk factor extracting method and system.

Background technique

Gastroesophageal reflux disease is showed as disease of digestive system generally existing in a kind of world wide, disease incidence The trend risen year by year.Therefore, the treatment of gastroesophageal reflux disease should cause our enough attention.Due to gastroesophageal reflux The generation of disease and life style, emotional change, eating habit etc. are closely related, and the state of an illness easily changes, therefore by adopting Collection mass data simultaneously analyzes data characteristics to the research disease and prevents to play an important role.

Risk factor is mainly extracted using clustering algorithm in gastroesophageal reflux disease diagnostic techniques at present, but it is poly- The selection of class number and cluster centre is relatively difficult, often because clusters number and cluster centre selection mistake lead to risk factor It is lower to extract accurate rate.

Summary of the invention

The object of the present invention is to provide a kind of based on the gastroesophageal reflux disease risk factor extracting method precisely clustered and System, to solve in the prior art because clusters number and cluster centre selection mistake cause risk factor extraction accurate rate lower Problem.

To achieve the above object, the present invention provides following schemes:

It is a kind of based on the gastroesophageal reflux disease risk factor extracting method precisely clustered, which comprises

Construct initial patient information collection；The initial patient information integrates as the data set of M row N column；The initial patient letter The factor for the i-th row the 1st column that breath is concentrated is patient questionnaire's ID number, and the factor of the 1st column is expressed as different patients in not going together Questionnaire ID number；The problem of factor for the 1st row jth column that the initial patient information is concentrated is questionnaire, and the 1st in different lines Capable factor is expressed as different problems；The factor for the i-th row jth column that the initial patient information is concentrated is the i-th patient questionnaire Answer of the ID number to jth problem；Wherein, 2≤i≤M, 2≤j≤N；

Data quantization processing is carried out to the answer that the initial patient information is concentrated, obtains quantized data matrix；The amount Change the matrix that data matrix is M row N column；The element of the i-th row the 1st column in the quantized data matrix is patient questionnaire's ID number, And the element representation of the 1st column is different patient questionnaire's ID number in not going together；The 1st row jth column in the quantized data matrix Element be questionnaire the problem of, and in different lines the 1st row element representation be different problems；The quantized data matrix In the i-th row jth column element be i-th patient questionnaire's ID number jth problem answers data quantization result fruit；Wherein, 2≤i≤ M, 2≤j≤N；

Clustering processing is carried out to each sample point in the quantized data matrix using hierarchical clustering algorithm, obtains level Dendrogram；Z-th of sample point represents the z row data in the quantized data matrix；The number of the sample point It is identical as the quantized data matrix column number, wherein 2≤z≤M；

Clusters number is determined according to the hierarchical clustering dendrogram；

According to the clusters number and K-Means clustering algorithm, the element in the quantized data matrix is clustered, Obtain multiple class clusters；

The index of correlation in each class cluster between each element is calculated, and the maximum element of the index of correlation is determined as stomach Esophageal reflux disease risk factor；The index of correlation is the average of related coefficient square.

Optionally, described that each sample point in the quantized data matrix is carried out at cluster using hierarchical clustering algorithm Reason, obtains hierarchical clustering dendrogram, specifically includes:

Clustering processing is carried out to each sample point in the quantized data matrix using Agglomerative Hierarchical Clustering algorithm, is obtained Hierarchical clustering dendrogram.

Optionally, described that each sample point in the quantized data matrix is gathered using Agglomerative Hierarchical Clustering algorithm Class processing, obtains hierarchical clustering dendrogram, specifically includes:

Step 1, the distance between sample point two-by-two is calculated；

Step 2, selection synthesizes a class apart from the smallest two sample points；

Step 3, step 1 and step 2 are repeated, until all sample points gather for one kind, obtains hierarchical clustering dendrogram.

Optionally, described to calculate the distance between sample point two-by-two, it specifically includes:

Using average distance algorithm, the distance between sample point two-by-two is calculated.

Optionally, it is executing according to the clusters number and K-Means clustering algorithm, in the quantized data matrix Element is clustered, before obtaining multiple class clusters, the method also includes:

Initial cluster center is determined using K-Means++ algorithm.

Optionally, described according to the clusters number and K-Means clustering algorithm, to the member in the quantized data matrix Element is clustered, and is obtained multiple class clusters, is specifically included:

The clusters number, the initial cluster center and the quantized data matrix are inputted, the K-Means is run The corresponding program of clustering algorithm carries out clustering to the element in the quantized data matrix, obtains multiple class clusters.

It is a kind of based on the gastroesophageal reflux disease risk factor extraction system precisely clustered, comprising:

Initial patient information collection constructs module, for constructing initial patient information collection；The initial patient information integrates as M row The data set of N column；The factor for the i-th row the 1st column that the initial patient information is concentrated is patient questionnaire's ID number, and the in not going together The factor of 1 column is expressed as different patient questionnaire's ID numbers；The factor for the 1st row jth column that the initial patient information is concentrated is to adjust The problem of interrogating volume, and the factor of the 1st row is expressed as different problems in different lines；The i-th of the initial patient information concentration The factor of row jth column is answer of the i-th patient questionnaire's ID number to jth problem；Wherein, 2≤i≤M, 2≤j≤N；

Quantized data matrix obtains module, and the answer for concentrating to the initial patient information carries out at data quantization Reason, obtains quantized data matrix；The quantized data matrix is the matrix of M row N column；The i-th row in the quantized data matrix The element of 1st column is patient questionnaire's ID number, and the element representation of the 1st column is different patient questionnaire's ID number in not going together；It is described The problem of element of the 1st row jth column in quantized data matrix is questionnaire, and the element representation of the 1st row is in different lines Different problems；The element of the i-th row jth column in the quantized data matrix is i-th patient questionnaire's ID number jth problem answers Data quantization result fruit；Wherein, 2≤i≤M, 2≤j≤N；

Hierarchical clustering dendrogram obtains module, for using hierarchical clustering algorithm to each of described quantized data matrix Sample point carries out clustering processing, obtains hierarchical clustering dendrogram；Z-th of sample point represents in the quantized data matrix Z row data；The number of the sample point is identical as the quantized data matrix column number, wherein 2≤z≤M；

Clusters number determining module, for determining clusters number according to the hierarchical clustering dendrogram；

Class cluster obtains module, is used for according to the clusters number and K-Means clustering algorithm, to the quantized data matrix In element clustered, obtain multiple class clusters；

Gastroesophageal reflux disease risk factor determining module, for calculating the correlation in each class cluster between each element Index, and the maximum element of the index of correlation is determined as gastroesophageal reflux disease risk factor；The index of correlation is phase relation Several squares of average.

Optionally, the hierarchical clustering dendrogram obtains module, specifically includes:

Hierarchical clustering dendrogram obtains unit, for using Agglomerative Hierarchical Clustering algorithm in the quantized data matrix Each sample point carries out clustering processing, obtains hierarchical clustering dendrogram.

Optionally, the system also includes:

Initial cluster center determining module, for determining initial cluster center using K-Means++ algorithm.

Optionally, the class cluster obtains module, specifically includes:

Class cluster obtains unit, for inputting the clusters number, the initial cluster center and the quantized data square Battle array, runs the corresponding program of the K-Means clustering algorithm, carries out clustering to the element in the quantized data matrix, Obtain multiple class clusters.

The specific embodiment provided according to the present invention, the invention discloses following technical effects:

A kind of accuracy that the present invention is proposed mainly in combination with two kinds of clustering methods extracts gastroesophageal reflux disease risk factor Method and system.Present invention utilizes the method that hierarchical clustering, K mean cluster combine, be applied to crowd and cluster, respectively from Two dimensions of feature and crowd are clustered, the crowd of the inhomogeneity cluster after finally obtaining precisely cluster, and combine statistics Method analyzes the key factor for determining every a kind of crowd.Therefore, the present invention is in such a way that two kinds of clustering methods combine, Efficiently filter out the risk factor for causing gastroesophageal reflux disease, for medical research in the future and medical diagnosis on disease provide it is scientific according to According to, gastroesophageal reflux disease is instructed, reduce disease incidence.

Detailed description of the invention

It in order to more clearly explain the embodiment of the invention or the technical proposal in the existing technology, below will be to institute in embodiment Attached drawing to be used is needed to be briefly described, it should be apparent that, the accompanying drawings in the following description is only some implementations of the invention Example, for those of ordinary skill in the art, without any creative labor, can also be according to these attached drawings Obtain other attached drawings.

Fig. 1 is that the embodiment of the present invention is shown based on the process of the gastroesophageal reflux disease risk factor extracting method precisely clustered It is intended to；

Fig. 2 is that the embodiment of the present invention is shown based on the structure of the gastroesophageal reflux disease risk factor extraction system precisely clustered It is intended to.

Specific embodiment

Following will be combined with the drawings in the embodiments of the present invention, and technical solution in the embodiment of the present invention carries out clear, complete Site preparation description, it is clear that described embodiments are only a part of the embodiments of the present invention, instead of all the embodiments.It is based on Embodiment in the present invention, it is obtained by those of ordinary skill in the art without making creative efforts every other Embodiment shall fall within the protection scope of the present invention.

The object of the present invention is to provide a kind of based on the gastroesophageal reflux disease risk factor extracting method precisely clustered and System efficiently can accurately filter out the risk factor for causing gastroesophageal reflux disease.

In order to make the foregoing objectives, features and advantages of the present invention clearer and more comprehensible, with reference to the accompanying drawing and specific real Applying mode, the present invention is described in further detail.

Embodiment 1

Fig. 1 is that the embodiment of the present invention is shown based on the process of the gastroesophageal reflux disease risk factor extracting method precisely clustered It is intended to, as shown in Figure 1, provided in an embodiment of the present invention based on the gastroesophageal reflux disease risk factor extraction side precisely clustered Method specifically includes following steps.

Step 101: building initial patient information collection；The initial patient information integrates as the data set of M row N column；It is described first The factor for the i-th row the 1st column that beginning patient information is concentrated is patient questionnaire's ID number, and the factor of the 1st column is expressed as not in not going together Same patient questionnaire's ID number；The problem of factor for the 1st row jth column that the initial patient information is concentrated is questionnaire, and not The factor of the 1st row is expressed as different problems in same column；The factor for the i-th row jth column that the initial patient information is concentrated is i-th Answer of patient questionnaire's ID number to jth problem；Wherein, 2≤i≤M, 2≤j≤N, i, j are positive integer.

Step 102: data quantization processing being carried out to the answer that the initial patient information is concentrated, obtains quantized data square Battle array；The quantized data matrix is the matrix of M row N column；The element of the i-th row the 1st column in the quantized data matrix is patient Questionnaire ID number, and the element representation of the 1st column is different patient questionnaire's ID number in not going together；In the quantized data matrix The problem of element of 1 row jth column is questionnaire, and the element representation of the 1st row is different problems in different lines；The quantization The element of the i-th row jth column in data matrix is the data quantization result fruit of i-th patient questionnaire's ID number jth problem answers；Its In, 2≤i≤M, 2≤j≤N, i.e., the described quantized data matrix are the initial matrix of critical risk factor.

Step 103: clustering processing is carried out to each sample point in the quantized data matrix using hierarchical clustering algorithm, Obtain hierarchical clustering dendrogram；Z-th of sample point represents the z row data in the quantized data matrix；The sample The number of point is identical as the quantized data matrix column number, wherein 2≤z≤M, z are positive integer.

Step 104: clusters number is determined according to the hierarchical clustering dendrogram.

Step 105: according to the clusters number and K-Means clustering algorithm, to the element in the quantized data matrix It is clustered, obtains multiple class clusters；Wherein, the number of class cluster and the identical and different class cluster of the number of cluster represent inhomogeneity Other patient.

Step 106: calculating the index of correlation in each class cluster between each element, and by the maximum element of the index of correlation It is determined as gastroesophageal reflux disease risk factor.

Step 101 specifically includes:

The present embodiment is by putting questionnaire to each human hair with gastroesophageal reflux disease in hospital, and according to recycling More parts of questionnaires back establish initial patient information collection.

The dimension of the initial patient information collection totally 241 in the present embodiment, patient questionnaire's ID number including unique identification And the problem of each questionnaire answer.

It is several comprising general demographic data, life style, eating habit, mental element, sleep factor etc. in questionnaire The answer of subproblem and investigator.The answer type of questionnaire includes three kinds: single choice, True-False, question-and-answer problem.

Step 102 is that the answer to investigator carries out data bulk processing.

Specially using patient questionnaire's ID number as unique identifying number.In questionnaire, using severity level as answer in single choice Topic, if option is often, once in a while, seldom, never, 4,3,2,1 weight can successively to be assigned according to its severity level, according to tool Body answer selects corresponding weight；Using whether type option is answer in True-False, "Yes" is assigned a value of 1, "No" is assigned a value of 0, Corresponding assignment is selected according to specific answer；Option has no the problem of dividing of severity level, such as occupation, because such problem is to knot Fruit is useless, can delete the problem.For question-and-answer problem, directly using the continuous type numerical value of user's input as data, such as user 45 years old, Height 172 is uninfected by HP, often there is sleep disturbance, then the sample data is classified as [age, height, if infection HP, sleep barrier Hinder situation], data value is [45,172,0,4].The purpose of this step is that the answer of all investigators is carried out quantification treatment, is obtained To quantized data matrix R.

Step 103 specifically includes:

Clustering can cluster index, can also cluster to sample, cluster here to index.

The present embodiment further selects the method for Agglomerative Hierarchical Clustering to the quantization using the method for hierarchical clustering Each sample point in data matrix R carries out clustering processing.The basic principle is that first will respectively be considered as one kind by poly- object, at this moment class Then similarity degree with class selects immediate two class to merge into one kind by class statistic, gradually merge until all quilts Until poly- object merging is a kind of.

Wherein, the step of Agglomerative Hierarchical Clustering is as follows:

Using each sample point in quantized data matrix R as an independent class.

According to range formula, the distance between class two-by-two is calculated, is found apart from the smallest two classes c1, c2.

Class c1 and class c2 are merged into a class；

Above step is repeated, until all sample points gather for one kind, and then obtains a hierarchical clustering dendrogram.

Specifically, the present embodiment selects average distance method to calculate the distance between class two-by-two about distance metric.

Average distance method will own by calculating each data point in two classes at a distance from other all data points The mean value of distance is as the distance between class two-by-two.Calculation formula is as follows:

Wherein, dist (x, z) is obtained using Euclidean distance.

Step 104 specifically includes:

In conjunction with professional knowledge, clusters number k is determined according to the tree-shaped map analysis of hierarchical clustering, i.e., finally to be selected it is dangerous because Prime number mesh.

The clusters number k obtained using step 104 carries out K-Means cluster to investigator's variable, i.e., to quantized data square Battle array R carries out K mean cluster.

Before executing step 105, the method also includes: cluster centre is determined using K-Means++ algorithm.

The thought of K-Means++ algorithms selection initial cluster center is: the mutual distance between initial cluster centre is wanted As far as possible.Algorithm steps are as follows:

1, an element is randomly selected in quantized data matrix R at random as first initial cluster center.

2, for each element, the distance D (x) of an initial cluster center nearest with it is calculated, D is successively obtained (1), D (2) ..., D (n), and constitute set D, then by it is all distance summation obtain Sum (D (x)).

3, then a random value is taken, it is taken with the mode of weight and calculates next seed point.The realization of the step is, first The random value Random that can be fallen in Sum (D (x)) is taken, then Sum (D (x)) obtains value r multiplied by random value Random, and With r=r-D (x_g), until its r≤0, g-th of element at this time is exactly next seed point, i.e., next cluster centre.

4,2 and 3 are repeated, is come until L initial cluster center is selected.

Step 105 specifically includes: inputting the quantized data matrix R, clusters number and initial cluster center, runs k- The corresponding program of means algorithm carries out clustering to the element in the quantized data matrix, obtains multiple class clusters.

Step 106 specifically includes: in order to find the risk factor for determining to form such patient groups, needing to calculate every class cluster In the index of correlation (average of related coefficient square) between each element, select the maximum element of the index of correlation to be determined as stomach food Pipe reflux disease risk factor filters out a gastroesophageal reflux disease risk factor in every class cluster.

The quantized data matrix R is grouped to obtain R according to class label₁,R₂,R₃,....R_k, initial danger is set Dangerous sets of factors is empty set, is analyzed the correlation each element in every class cluster, the index of correlation between calculating elements, every Select the maximum element of an index of correlation that risk factor set is added in a kind of cluster.

Wherein related coefficient calculates as follows:

Wherein, Var (X) is the variance of X, and Var (Y) is the variance of Y, and Cov (X, Y) is X, and the covariance between Y, X, Y are Element in every class cluster.

For the sample index of correlation R of a certain feature²Calculation formula it is as follows:

Wherein, X is a certain feature, and i is characterized number, and n is characterized sum.

Embodiment 2

To achieve the above object, the present invention also provides a kind of as shown in Figure 2 based on the gastroesophageal reflux precisely clustered Disease risk factor extraction system.The system includes:

Initial patient information collection constructs module 100, for constructing initial patient information collection；The initial patient information collection is The data set of M row N column；The factor for the i-th row the 1st column that the initial patient information is concentrated is patient questionnaire's ID number, and is not gone together In the 1st column factor be expressed as different patient questionnaire's ID numbers；The factor for the 1st row jth column that the initial patient information is concentrated The problem of for questionnaire, and the factor of the 1st row is expressed as different problems in different lines；What the initial patient information was concentrated The factor of i-th row jth column is answer of the i-th patient questionnaire's ID number to jth problem；Wherein, 2≤i≤M, 2≤j≤N.

Quantized data matrix obtains module 200, and the answer for concentrating to the initial patient information carries out data quantization Processing, obtains quantized data matrix；The quantized data matrix is the matrix of M row N column；I-th in the quantized data matrix The element that row the 1st arranges is patient questionnaire's ID number, and the element representation of the 1st column is different patient questionnaire's ID number in not going together；Institute State in quantized data matrix the 1st row jth column element be questionnaire the problem of, and in different lines the 1st row element representation For different problems；The element of the i-th row jth column in the quantized data matrix is i-th patient questionnaire's ID number jth problem answers Data quantization result fruit；Wherein, 2≤i≤M, 2≤j≤N.

Hierarchical clustering dendrogram obtains module 300, for using hierarchical clustering algorithm in the quantized data matrix Each sample point carries out clustering processing, obtains hierarchical clustering dendrogram；Wherein, z-th of sample point represents the quantization number According to the z row data in matrix；The number of the sample point is identical as the quantized data matrix column number, wherein and 2≤z≤ M。

Clusters number determining module 400, for determining clusters number according to the hierarchical clustering dendrogram.

Class cluster obtains module 500, is used for according to the clusters number and K-Means clustering algorithm, to the quantized data Element in matrix is clustered, and multiple class clusters are obtained.

Gastroesophageal reflux disease risk factor determining module 600, for calculating in each class cluster between each element The index of correlation, and the maximum element of the index of correlation is determined as gastroesophageal reflux disease risk factor.

Preferably, the system further include: initial cluster center determining module, for being determined just using K-Means++ algorithm Beginning cluster centre.

The hierarchical clustering dendrogram obtains module 300, specifically includes:

The class cluster obtains module 500, specifically includes:

Class cluster obtains unit, for inputting the clusters number, the initial cluster center and the quantized data matrix, The corresponding program of the K-Means clustering algorithm is run, clustering is carried out to the element in the quantized data matrix, is obtained Multiple class clusters.

The prior art is compared, advantage of the invention are as follows:

It is actually rare using machine learning method extraction risk factor in gastroesophageal reflux disease diagnostic techniques at present, greatly What majority took the extraction of risk factor in medical domain is statistical method, and statistical method is computationally intensive, simultaneously Accurate rate is lower compared with machine learning.

The present invention in such a way that two kinds of clustering methods combine, for K mean value itself there are the shortcomings that be made that improvement, Further raising has been done to the accuracy of cluster.

The extraction of risk factor of the invention combines crowd's cluster and index screening, analyzed from different crowd it is dangerous because Element is as a result more accurate in conjunction with clustering method and statistical method.

Each embodiment in this specification is described in a progressive manner, the highlights of each of the examples are with other The difference of embodiment, the same or similar parts in each embodiment may refer to each other.For system disclosed in embodiment For, since it is corresponded to the methods disclosed in the examples, so being described relatively simple, related place is said referring to method part It is bright.

Used herein a specific example illustrates the principle and implementation of the invention, and above embodiments are said It is bright to be merely used to help understand method and its core concept of the invention；At the same time, for those skilled in the art, foundation Thought of the invention, there will be changes in the specific implementation manner and application range.In conclusion the content of the present specification is not It is interpreted as limitation of the present invention.

Claims

1. a kind of based on the gastroesophageal reflux disease risk factor extracting method precisely clustered, which is characterized in that the method packet It includes:

Construct initial patient information collection；The initial patient information integrates as the data set of M row N column；The initial patient information collection In the factors of the i-th row the 1st column be patient questionnaire's ID number, and the factor of the 1st column is expressed as different patient questionnaires in not going together ID number；The problem of factor for the 1st row jth column that the initial patient information is concentrated is questionnaire, and the 1st row in different lines Factor is expressed as different problems；The factor for the i-th row jth column that the initial patient information is concentrated is i-th patient questionnaire's ID number Answer to jth problem；Wherein, 2≤i≤M, 2≤j≤N；

Data quantization processing is carried out to the answer that the initial patient information is concentrated, obtains quantized data matrix；The quantization number It is the matrix of M row N column according to matrix；The element of the i-th row the 1st column in the quantized data matrix is patient questionnaire's ID number, and not The element representation of the 1st column is different patient questionnaire's ID number in colleague；The member of the 1st row jth column in the quantized data matrix The problem of element is questionnaire, and the element representation of the 1st row is different problems in different lines；In the quantized data matrix The element of i-th row jth column is the data quantization result fruit of i-th patient questionnaire's ID number jth problem answers；Wherein, 2≤i≤M, 2≤ j≤N；

Clustering processing is carried out to each sample point in the quantized data matrix using hierarchical clustering algorithm, obtains hierarchical clustering Dendrogram；Z-th of sample point represents the z row data in the quantized data matrix；The number of the sample point and institute It is identical to state quantized data matrix column number, wherein 2≤z≤M；

According to the clusters number and K-Means clustering algorithm, the element in the quantized data matrix is clustered, is obtained Multiple class clusters；

The index of correlation in each class cluster between each element is calculated, and the maximum element of the index of correlation is determined as stomach oesophagus Reflux disease risk factor；The index of correlation is the average of related coefficient square.

2. gastroesophageal reflux disease risk factor extracting method according to claim 1, which is characterized in that described to use layer Secondary clustering algorithm carries out clustering processing to each sample point in the quantized data matrix, obtains hierarchical clustering dendrogram, has Body includes:

Clustering processing is carried out to each sample point in the quantized data matrix using Agglomerative Hierarchical Clustering algorithm, obtains level Dendrogram.

3. gastroesophageal reflux disease risk factor extracting method according to claim 2, which is characterized in that described using solidifying Poly- hierarchical clustering algorithm carries out clustering processing to each sample point in the quantized data matrix, and it is tree-shaped to obtain hierarchical clustering Figure, specifically includes:

Step 1, the distance between sample point two-by-two is calculated；

4. gastroesophageal reflux disease risk factor extracting method according to claim 3, which is characterized in that described to calculate two The distance between two sample points, specifically include:

5. gastroesophageal reflux disease risk factor extracting method according to claim 1, which is characterized in that executing basis The clusters number and K-Means clustering algorithm cluster the element in the quantized data matrix, obtain multiple class clusters Before, the method also includes:

Initial cluster center is determined using K-Means++ algorithm.

6. gastroesophageal reflux disease risk factor extracting method according to claim 5, which is characterized in that described according to institute Clusters number and K-Means clustering algorithm are stated, the element in the quantized data matrix is clustered, obtains multiple class clusters, It specifically includes:

The clusters number, the initial cluster center and the quantized data matrix are inputted, the K-Means cluster is run The corresponding program of algorithm carries out clustering to the element in the quantized data matrix, obtains multiple class clusters.

7. a kind of based on the gastroesophageal reflux disease risk factor extraction system precisely clustered, which is characterized in that the system packet It includes:

Initial patient information collection constructs module, for constructing initial patient information collection；The initial patient information integrates as M row N column Data set；The factor for the i-th row the 1st column that the initial patient information is concentrated is patient questionnaire's ID number, and the 1st column in not going together Factor be expressed as different patient questionnaire's ID numbers；The factor for the 1st row jth column that the initial patient information is concentrated is that investigation is asked The problem of volume, and the factor of the 1st row is expressed as different problems in different lines；The i-th row jth that the initial patient information is concentrated The factor of column is answer of the i-th patient questionnaire's ID number to jth problem；Wherein, 2≤i≤M, 2≤j≤N；

Quantized data matrix obtains module, and the answer for concentrating to the initial patient information carries out data quantization processing, obtains To quantized data matrix；The quantized data matrix is the matrix of M row N column；The i-th row the 1st column in the quantized data matrix Element be patient questionnaire's ID number, and the element representation of the 1st column is different patient questionnaire's ID number in not going together；The quantization number The problem of element according to the 1st row jth column in matrix is questionnaire, and the element representation of the 1st row is different in different lines Problem；The element of the i-th row jth column in the quantized data matrix is the data volume of i-th patient questionnaire's ID number jth problem answers Change result fruit；Wherein, 2≤i≤M, 2≤j≤N；

Hierarchical clustering dendrogram obtains module, for using hierarchical clustering algorithm to each sample in the quantized data matrix Point carries out clustering processing, obtains hierarchical clustering dendrogram；Z-th of sample point represents the z in the quantized data matrix Row data；The number of the sample point is identical as the quantized data matrix column number, wherein 2≤z≤M；

Class cluster obtains module, is used for according to the clusters number and K-Means clustering algorithm, in the quantized data matrix Element is clustered, and multiple class clusters are obtained；

Gastroesophageal reflux disease risk factor determining module refers to for calculating the correlation in each class cluster between each element Number, and the maximum element of the index of correlation is determined as gastroesophageal reflux disease risk factor；The index of correlation is related coefficient Square average.

8. gastroesophageal reflux disease risk factor extraction system according to claim 7, which is characterized in that the level is poly- Class dendrogram obtains module, specifically includes:

Hierarchical clustering dendrogram obtains unit, for using Agglomerative Hierarchical Clustering algorithm to each of described quantized data matrix Sample point carries out clustering processing, obtains hierarchical clustering dendrogram.

9. gastroesophageal reflux disease risk factor extraction system according to claim 7, which is characterized in that the system is also Include:

10. gastroesophageal reflux disease risk factor extracting method according to claim 9, which is characterized in that the class cluster Module is obtained, is specifically included:

Class cluster obtains unit, for inputting the clusters number, the initial cluster center and the quantized data matrix, transports The corresponding program of the row K-Means clustering algorithm carries out clustering to the element in the quantized data matrix, obtains more A class cluster.