CN109685139A - Based on the gastroesophageal reflux disease risk factor extracting method precisely clustered and system - Google Patents
Based on the gastroesophageal reflux disease risk factor extracting method precisely clustered and system Download PDFInfo
- Publication number
- CN109685139A CN109685139A CN201811589375.0A CN201811589375A CN109685139A CN 109685139 A CN109685139 A CN 109685139A CN 201811589375 A CN201811589375 A CN 201811589375A CN 109685139 A CN109685139 A CN 109685139A
- Authority
- CN
- China
- Prior art keywords
- quantized data
- data matrix
- row
- column
- obtains
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/21—Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
- G06F18/211—Selection of the most significant subset of features
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/23—Clustering techniques
- G06F18/232—Non-hierarchical techniques
- G06F18/2321—Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions
- G06F18/23213—Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions with fixed number of clusters, e.g. K-means clustering
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16H—HEALTHCARE INFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR THE HANDLING OR PROCESSING OF MEDICAL OR HEALTHCARE DATA
- G16H50/00—ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics
- G16H50/20—ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics for computer-aided diagnosis, e.g. based on medical expert systems
Landscapes
- Engineering & Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Theoretical Computer Science (AREA)
- Health & Medical Sciences (AREA)
- Physics & Mathematics (AREA)
- Artificial Intelligence (AREA)
- Evolutionary Computation (AREA)
- Public Health (AREA)
- Biomedical Technology (AREA)
- General Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- Life Sciences & Earth Sciences (AREA)
- Medical Informatics (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Bioinformatics & Computational Biology (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Evolutionary Biology (AREA)
- Primary Health Care (AREA)
- Databases & Information Systems (AREA)
- Pathology (AREA)
- General Health & Medical Sciences (AREA)
- Epidemiology (AREA)
- Probability & Statistics with Applications (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The invention discloses a kind of based on the gastroesophageal reflux disease risk factor extracting method precisely clustered and system.Initial patient information collection of the building comprising gastroesophageal reflux disease risk factor first;Secondly data quantization processing is carried out to the factor that initial patient information is concentrated, obtains quantized data matrix;Then clustering processing is carried out to each sample point in quantized data matrix using hierarchical clustering algorithm, obtains hierarchical clustering dendrogram;Furthermore clusters number is determined according to hierarchical clustering dendrogram, and by clusters number in conjunction with K-Means clustering algorithm, the element in quantized data matrix is clustered, obtains multiple class clusters;The index of correlation in each class cluster between each element is finally calculated, and the maximum element of the index of correlation is determined as gastroesophageal reflux disease risk factor.The present invention combines two kinds of clustering methods, efficiently filters out the risk factor for causing gastroesophageal reflux disease, reduces disease incidence.
Description
Technical field
The present invention relates to cluster and medicine technology fields, more particularly to a kind of based on the gastroesophageal reflux disease precisely clustered
Sick risk factor extracting method and system.
Background technique
Gastroesophageal reflux disease is showed as disease of digestive system generally existing in a kind of world wide, disease incidence
The trend risen year by year.Therefore, the treatment of gastroesophageal reflux disease should cause our enough attention.Due to gastroesophageal reflux
The generation of disease and life style, emotional change, eating habit etc. are closely related, and the state of an illness easily changes, therefore by adopting
Collection mass data simultaneously analyzes data characteristics to the research disease and prevents to play an important role.
Risk factor is mainly extracted using clustering algorithm in gastroesophageal reflux disease diagnostic techniques at present, but it is poly-
The selection of class number and cluster centre is relatively difficult, often because clusters number and cluster centre selection mistake lead to risk factor
It is lower to extract accurate rate.
Summary of the invention
The object of the present invention is to provide a kind of based on the gastroesophageal reflux disease risk factor extracting method precisely clustered and
System, to solve in the prior art because clusters number and cluster centre selection mistake cause risk factor extraction accurate rate lower
Problem.
To achieve the above object, the present invention provides following schemes:
It is a kind of based on the gastroesophageal reflux disease risk factor extracting method precisely clustered, which comprises
Construct initial patient information collection;The initial patient information integrates as the data set of M row N column;The initial patient letter
The factor for the i-th row the 1st column that breath is concentrated is patient questionnaire's ID number, and the factor of the 1st column is expressed as different patients in not going together
Questionnaire ID number;The problem of factor for the 1st row jth column that the initial patient information is concentrated is questionnaire, and the 1st in different lines
Capable factor is expressed as different problems;The factor for the i-th row jth column that the initial patient information is concentrated is the i-th patient questionnaire
Answer of the ID number to jth problem;Wherein, 2≤i≤M, 2≤j≤N;
Data quantization processing is carried out to the answer that the initial patient information is concentrated, obtains quantized data matrix;The amount
Change the matrix that data matrix is M row N column;The element of the i-th row the 1st column in the quantized data matrix is patient questionnaire's ID number,
And the element representation of the 1st column is different patient questionnaire's ID number in not going together;The 1st row jth column in the quantized data matrix
Element be questionnaire the problem of, and in different lines the 1st row element representation be different problems;The quantized data matrix
In the i-th row jth column element be i-th patient questionnaire's ID number jth problem answers data quantization result fruit;Wherein, 2≤i≤
M, 2≤j≤N;
Clustering processing is carried out to each sample point in the quantized data matrix using hierarchical clustering algorithm, obtains level
Dendrogram;Z-th of sample point represents the z row data in the quantized data matrix;The number of the sample point
It is identical as the quantized data matrix column number, wherein 2≤z≤M;
Clusters number is determined according to the hierarchical clustering dendrogram;
According to the clusters number and K-Means clustering algorithm, the element in the quantized data matrix is clustered,
Obtain multiple class clusters;
The index of correlation in each class cluster between each element is calculated, and the maximum element of the index of correlation is determined as stomach
Esophageal reflux disease risk factor;The index of correlation is the average of related coefficient square.
Optionally, described that each sample point in the quantized data matrix is carried out at cluster using hierarchical clustering algorithm
Reason, obtains hierarchical clustering dendrogram, specifically includes:
Clustering processing is carried out to each sample point in the quantized data matrix using Agglomerative Hierarchical Clustering algorithm, is obtained
Hierarchical clustering dendrogram.
Optionally, described that each sample point in the quantized data matrix is gathered using Agglomerative Hierarchical Clustering algorithm
Class processing, obtains hierarchical clustering dendrogram, specifically includes:
Step 1, the distance between sample point two-by-two is calculated;
Step 2, selection synthesizes a class apart from the smallest two sample points;
Step 3, step 1 and step 2 are repeated, until all sample points gather for one kind, obtains hierarchical clustering dendrogram.
Optionally, described to calculate the distance between sample point two-by-two, it specifically includes:
Using average distance algorithm, the distance between sample point two-by-two is calculated.
Optionally, it is executing according to the clusters number and K-Means clustering algorithm, in the quantized data matrix
Element is clustered, before obtaining multiple class clusters, the method also includes:
Initial cluster center is determined using K-Means++ algorithm.
Optionally, described according to the clusters number and K-Means clustering algorithm, to the member in the quantized data matrix
Element is clustered, and is obtained multiple class clusters, is specifically included:
The clusters number, the initial cluster center and the quantized data matrix are inputted, the K-Means is run
The corresponding program of clustering algorithm carries out clustering to the element in the quantized data matrix, obtains multiple class clusters.
It is a kind of based on the gastroesophageal reflux disease risk factor extraction system precisely clustered, comprising:
Initial patient information collection constructs module, for constructing initial patient information collection;The initial patient information integrates as M row
The data set of N column;The factor for the i-th row the 1st column that the initial patient information is concentrated is patient questionnaire's ID number, and the in not going together
The factor of 1 column is expressed as different patient questionnaire's ID numbers;The factor for the 1st row jth column that the initial patient information is concentrated is to adjust
The problem of interrogating volume, and the factor of the 1st row is expressed as different problems in different lines;The i-th of the initial patient information concentration
The factor of row jth column is answer of the i-th patient questionnaire's ID number to jth problem;Wherein, 2≤i≤M, 2≤j≤N;
Quantized data matrix obtains module, and the answer for concentrating to the initial patient information carries out at data quantization
Reason, obtains quantized data matrix;The quantized data matrix is the matrix of M row N column;The i-th row in the quantized data matrix
The element of 1st column is patient questionnaire's ID number, and the element representation of the 1st column is different patient questionnaire's ID number in not going together;It is described
The problem of element of the 1st row jth column in quantized data matrix is questionnaire, and the element representation of the 1st row is in different lines
Different problems;The element of the i-th row jth column in the quantized data matrix is i-th patient questionnaire's ID number jth problem answers
Data quantization result fruit;Wherein, 2≤i≤M, 2≤j≤N;
Hierarchical clustering dendrogram obtains module, for using hierarchical clustering algorithm to each of described quantized data matrix
Sample point carries out clustering processing, obtains hierarchical clustering dendrogram;Z-th of sample point represents in the quantized data matrix
Z row data;The number of the sample point is identical as the quantized data matrix column number, wherein 2≤z≤M;
Clusters number determining module, for determining clusters number according to the hierarchical clustering dendrogram;
Class cluster obtains module, is used for according to the clusters number and K-Means clustering algorithm, to the quantized data matrix
In element clustered, obtain multiple class clusters;
Gastroesophageal reflux disease risk factor determining module, for calculating the correlation in each class cluster between each element
Index, and the maximum element of the index of correlation is determined as gastroesophageal reflux disease risk factor;The index of correlation is phase relation
Several squares of average.
Optionally, the hierarchical clustering dendrogram obtains module, specifically includes:
Hierarchical clustering dendrogram obtains unit, for using Agglomerative Hierarchical Clustering algorithm in the quantized data matrix
Each sample point carries out clustering processing, obtains hierarchical clustering dendrogram.
Optionally, the system also includes:
Initial cluster center determining module, for determining initial cluster center using K-Means++ algorithm.
Optionally, the class cluster obtains module, specifically includes:
Class cluster obtains unit, for inputting the clusters number, the initial cluster center and the quantized data square
Battle array, runs the corresponding program of the K-Means clustering algorithm, carries out clustering to the element in the quantized data matrix,
Obtain multiple class clusters.
The specific embodiment provided according to the present invention, the invention discloses following technical effects:
A kind of accuracy that the present invention is proposed mainly in combination with two kinds of clustering methods extracts gastroesophageal reflux disease risk factor
Method and system.Present invention utilizes the method that hierarchical clustering, K mean cluster combine, be applied to crowd and cluster, respectively from
Two dimensions of feature and crowd are clustered, the crowd of the inhomogeneity cluster after finally obtaining precisely cluster, and combine statistics
Method analyzes the key factor for determining every a kind of crowd.Therefore, the present invention is in such a way that two kinds of clustering methods combine,
Efficiently filter out the risk factor for causing gastroesophageal reflux disease, for medical research in the future and medical diagnosis on disease provide it is scientific according to
According to, gastroesophageal reflux disease is instructed, reduce disease incidence.
Detailed description of the invention
It in order to more clearly explain the embodiment of the invention or the technical proposal in the existing technology, below will be to institute in embodiment
Attached drawing to be used is needed to be briefly described, it should be apparent that, the accompanying drawings in the following description is only some implementations of the invention
Example, for those of ordinary skill in the art, without any creative labor, can also be according to these attached drawings
Obtain other attached drawings.
Fig. 1 is that the embodiment of the present invention is shown based on the process of the gastroesophageal reflux disease risk factor extracting method precisely clustered
It is intended to;
Fig. 2 is that the embodiment of the present invention is shown based on the structure of the gastroesophageal reflux disease risk factor extraction system precisely clustered
It is intended to.
Specific embodiment
Following will be combined with the drawings in the embodiments of the present invention, and technical solution in the embodiment of the present invention carries out clear, complete
Site preparation description, it is clear that described embodiments are only a part of the embodiments of the present invention, instead of all the embodiments.It is based on
Embodiment in the present invention, it is obtained by those of ordinary skill in the art without making creative efforts every other
Embodiment shall fall within the protection scope of the present invention.
The object of the present invention is to provide a kind of based on the gastroesophageal reflux disease risk factor extracting method precisely clustered and
System efficiently can accurately filter out the risk factor for causing gastroesophageal reflux disease.
In order to make the foregoing objectives, features and advantages of the present invention clearer and more comprehensible, with reference to the accompanying drawing and specific real
Applying mode, the present invention is described in further detail.
Embodiment 1
Fig. 1 is that the embodiment of the present invention is shown based on the process of the gastroesophageal reflux disease risk factor extracting method precisely clustered
It is intended to, as shown in Figure 1, provided in an embodiment of the present invention based on the gastroesophageal reflux disease risk factor extraction side precisely clustered
Method specifically includes following steps.
Step 101: building initial patient information collection;The initial patient information integrates as the data set of M row N column;It is described first
The factor for the i-th row the 1st column that beginning patient information is concentrated is patient questionnaire's ID number, and the factor of the 1st column is expressed as not in not going together
Same patient questionnaire's ID number;The problem of factor for the 1st row jth column that the initial patient information is concentrated is questionnaire, and not
The factor of the 1st row is expressed as different problems in same column;The factor for the i-th row jth column that the initial patient information is concentrated is i-th
Answer of patient questionnaire's ID number to jth problem;Wherein, 2≤i≤M, 2≤j≤N, i, j are positive integer.
Step 102: data quantization processing being carried out to the answer that the initial patient information is concentrated, obtains quantized data square
Battle array;The quantized data matrix is the matrix of M row N column;The element of the i-th row the 1st column in the quantized data matrix is patient
Questionnaire ID number, and the element representation of the 1st column is different patient questionnaire's ID number in not going together;In the quantized data matrix
The problem of element of 1 row jth column is questionnaire, and the element representation of the 1st row is different problems in different lines;The quantization
The element of the i-th row jth column in data matrix is the data quantization result fruit of i-th patient questionnaire's ID number jth problem answers;Its
In, 2≤i≤M, 2≤j≤N, i.e., the described quantized data matrix are the initial matrix of critical risk factor.
Step 103: clustering processing is carried out to each sample point in the quantized data matrix using hierarchical clustering algorithm,
Obtain hierarchical clustering dendrogram;Z-th of sample point represents the z row data in the quantized data matrix;The sample
The number of point is identical as the quantized data matrix column number, wherein 2≤z≤M, z are positive integer.
Step 104: clusters number is determined according to the hierarchical clustering dendrogram.
Step 105: according to the clusters number and K-Means clustering algorithm, to the element in the quantized data matrix
It is clustered, obtains multiple class clusters;Wherein, the number of class cluster and the identical and different class cluster of the number of cluster represent inhomogeneity
Other patient.
Step 106: calculating the index of correlation in each class cluster between each element, and by the maximum element of the index of correlation
It is determined as gastroesophageal reflux disease risk factor.
Step 101 specifically includes:
The present embodiment is by putting questionnaire to each human hair with gastroesophageal reflux disease in hospital, and according to recycling
More parts of questionnaires back establish initial patient information collection.
The dimension of the initial patient information collection totally 241 in the present embodiment, patient questionnaire's ID number including unique identification
And the problem of each questionnaire answer.
It is several comprising general demographic data, life style, eating habit, mental element, sleep factor etc. in questionnaire
The answer of subproblem and investigator.The answer type of questionnaire includes three kinds: single choice, True-False, question-and-answer problem.
Step 102 is that the answer to investigator carries out data bulk processing.
Specially using patient questionnaire's ID number as unique identifying number.In questionnaire, using severity level as answer in single choice
Topic, if option is often, once in a while, seldom, never, 4,3,2,1 weight can successively to be assigned according to its severity level, according to tool
Body answer selects corresponding weight;Using whether type option is answer in True-False, "Yes" is assigned a value of 1, "No" is assigned a value of 0,
Corresponding assignment is selected according to specific answer;Option has no the problem of dividing of severity level, such as occupation, because such problem is to knot
Fruit is useless, can delete the problem.For question-and-answer problem, directly using the continuous type numerical value of user's input as data, such as user 45 years old,
Height 172 is uninfected by HP, often there is sleep disturbance, then the sample data is classified as [age, height, if infection HP, sleep barrier
Hinder situation], data value is [45,172,0,4].The purpose of this step is that the answer of all investigators is carried out quantification treatment, is obtained
To quantized data matrix R.
Step 103 specifically includes:
Clustering can cluster index, can also cluster to sample, cluster here to index.
The present embodiment further selects the method for Agglomerative Hierarchical Clustering to the quantization using the method for hierarchical clustering
Each sample point in data matrix R carries out clustering processing.The basic principle is that first will respectively be considered as one kind by poly- object, at this moment class
Then similarity degree with class selects immediate two class to merge into one kind by class statistic, gradually merge until all quilts
Until poly- object merging is a kind of.
Wherein, the step of Agglomerative Hierarchical Clustering is as follows:
Using each sample point in quantized data matrix R as an independent class.
According to range formula, the distance between class two-by-two is calculated, is found apart from the smallest two classes c1, c2.
Class c1 and class c2 are merged into a class;
Above step is repeated, until all sample points gather for one kind, and then obtains a hierarchical clustering dendrogram.
Specifically, the present embodiment selects average distance method to calculate the distance between class two-by-two about distance metric.
Average distance method will own by calculating each data point in two classes at a distance from other all data points
The mean value of distance is as the distance between class two-by-two.Calculation formula is as follows:
Wherein, dist (x, z) is obtained using Euclidean distance.
Step 104 specifically includes:
In conjunction with professional knowledge, clusters number k is determined according to the tree-shaped map analysis of hierarchical clustering, i.e., finally to be selected it is dangerous because
Prime number mesh.
The clusters number k obtained using step 104 carries out K-Means cluster to investigator's variable, i.e., to quantized data square
Battle array R carries out K mean cluster.
Before executing step 105, the method also includes: cluster centre is determined using K-Means++ algorithm.
The thought of K-Means++ algorithms selection initial cluster center is: the mutual distance between initial cluster centre is wanted
As far as possible.Algorithm steps are as follows:
1, an element is randomly selected in quantized data matrix R at random as first initial cluster center.
2, for each element, the distance D (x) of an initial cluster center nearest with it is calculated, D is successively obtained
(1), D (2) ..., D (n), and constitute set D, then by it is all distance summation obtain Sum (D (x)).
3, then a random value is taken, it is taken with the mode of weight and calculates next seed point.The realization of the step is, first
The random value Random that can be fallen in Sum (D (x)) is taken, then Sum (D (x)) obtains value r multiplied by random value Random, and
With r=r-D (xg), until its r≤0, g-th of element at this time is exactly next seed point, i.e., next cluster centre.
4,2 and 3 are repeated, is come until L initial cluster center is selected.
Step 105 specifically includes: inputting the quantized data matrix R, clusters number and initial cluster center, runs k-
The corresponding program of means algorithm carries out clustering to the element in the quantized data matrix, obtains multiple class clusters.
Step 106 specifically includes: in order to find the risk factor for determining to form such patient groups, needing to calculate every class cluster
In the index of correlation (average of related coefficient square) between each element, select the maximum element of the index of correlation to be determined as stomach food
Pipe reflux disease risk factor filters out a gastroesophageal reflux disease risk factor in every class cluster.
The quantized data matrix R is grouped to obtain R according to class label1,R2,R3,....Rk, initial danger is set
Dangerous sets of factors is empty set, is analyzed the correlation each element in every class cluster, the index of correlation between calculating elements, every
Select the maximum element of an index of correlation that risk factor set is added in a kind of cluster.
Wherein related coefficient calculates as follows:
Wherein, Var (X) is the variance of X, and Var (Y) is the variance of Y, and Cov (X, Y) is X, and the covariance between Y, X, Y are
Element in every class cluster.
For the sample index of correlation R of a certain feature2Calculation formula it is as follows:
Wherein, X is a certain feature, and i is characterized number, and n is characterized sum.
Embodiment 2
To achieve the above object, the present invention also provides a kind of as shown in Figure 2 based on the gastroesophageal reflux precisely clustered
Disease risk factor extraction system.The system includes:
Initial patient information collection constructs module 100, for constructing initial patient information collection;The initial patient information collection is
The data set of M row N column;The factor for the i-th row the 1st column that the initial patient information is concentrated is patient questionnaire's ID number, and is not gone together
In the 1st column factor be expressed as different patient questionnaire's ID numbers;The factor for the 1st row jth column that the initial patient information is concentrated
The problem of for questionnaire, and the factor of the 1st row is expressed as different problems in different lines;What the initial patient information was concentrated
The factor of i-th row jth column is answer of the i-th patient questionnaire's ID number to jth problem;Wherein, 2≤i≤M, 2≤j≤N.
Quantized data matrix obtains module 200, and the answer for concentrating to the initial patient information carries out data quantization
Processing, obtains quantized data matrix;The quantized data matrix is the matrix of M row N column;I-th in the quantized data matrix
The element that row the 1st arranges is patient questionnaire's ID number, and the element representation of the 1st column is different patient questionnaire's ID number in not going together;Institute
State in quantized data matrix the 1st row jth column element be questionnaire the problem of, and in different lines the 1st row element representation
For different problems;The element of the i-th row jth column in the quantized data matrix is i-th patient questionnaire's ID number jth problem answers
Data quantization result fruit;Wherein, 2≤i≤M, 2≤j≤N.
Hierarchical clustering dendrogram obtains module 300, for using hierarchical clustering algorithm in the quantized data matrix
Each sample point carries out clustering processing, obtains hierarchical clustering dendrogram;Wherein, z-th of sample point represents the quantization number
According to the z row data in matrix;The number of the sample point is identical as the quantized data matrix column number, wherein and 2≤z≤
M。
Clusters number determining module 400, for determining clusters number according to the hierarchical clustering dendrogram.
Class cluster obtains module 500, is used for according to the clusters number and K-Means clustering algorithm, to the quantized data
Element in matrix is clustered, and multiple class clusters are obtained.
Gastroesophageal reflux disease risk factor determining module 600, for calculating in each class cluster between each element
The index of correlation, and the maximum element of the index of correlation is determined as gastroesophageal reflux disease risk factor.
Preferably, the system further include: initial cluster center determining module, for being determined just using K-Means++ algorithm
Beginning cluster centre.
The hierarchical clustering dendrogram obtains module 300, specifically includes:
Hierarchical clustering dendrogram obtains unit, for using Agglomerative Hierarchical Clustering algorithm in the quantized data matrix
Each sample point carries out clustering processing, obtains hierarchical clustering dendrogram.
The class cluster obtains module 500, specifically includes:
Class cluster obtains unit, for inputting the clusters number, the initial cluster center and the quantized data matrix,
The corresponding program of the K-Means clustering algorithm is run, clustering is carried out to the element in the quantized data matrix, is obtained
Multiple class clusters.
The prior art is compared, advantage of the invention are as follows:
It is actually rare using machine learning method extraction risk factor in gastroesophageal reflux disease diagnostic techniques at present, greatly
What majority took the extraction of risk factor in medical domain is statistical method, and statistical method is computationally intensive, simultaneously
Accurate rate is lower compared with machine learning.
The present invention in such a way that two kinds of clustering methods combine, for K mean value itself there are the shortcomings that be made that improvement,
Further raising has been done to the accuracy of cluster.
The extraction of risk factor of the invention combines crowd's cluster and index screening, analyzed from different crowd it is dangerous because
Element is as a result more accurate in conjunction with clustering method and statistical method.
Each embodiment in this specification is described in a progressive manner, the highlights of each of the examples are with other
The difference of embodiment, the same or similar parts in each embodiment may refer to each other.For system disclosed in embodiment
For, since it is corresponded to the methods disclosed in the examples, so being described relatively simple, related place is said referring to method part
It is bright.
Used herein a specific example illustrates the principle and implementation of the invention, and above embodiments are said
It is bright to be merely used to help understand method and its core concept of the invention;At the same time, for those skilled in the art, foundation
Thought of the invention, there will be changes in the specific implementation manner and application range.In conclusion the content of the present specification is not
It is interpreted as limitation of the present invention.
Claims (10)
1. a kind of based on the gastroesophageal reflux disease risk factor extracting method precisely clustered, which is characterized in that the method packet
It includes:
Construct initial patient information collection;The initial patient information integrates as the data set of M row N column;The initial patient information collection
In the factors of the i-th row the 1st column be patient questionnaire's ID number, and the factor of the 1st column is expressed as different patient questionnaires in not going together
ID number;The problem of factor for the 1st row jth column that the initial patient information is concentrated is questionnaire, and the 1st row in different lines
Factor is expressed as different problems;The factor for the i-th row jth column that the initial patient information is concentrated is i-th patient questionnaire's ID number
Answer to jth problem;Wherein, 2≤i≤M, 2≤j≤N;
Data quantization processing is carried out to the answer that the initial patient information is concentrated, obtains quantized data matrix;The quantization number
It is the matrix of M row N column according to matrix;The element of the i-th row the 1st column in the quantized data matrix is patient questionnaire's ID number, and not
The element representation of the 1st column is different patient questionnaire's ID number in colleague;The member of the 1st row jth column in the quantized data matrix
The problem of element is questionnaire, and the element representation of the 1st row is different problems in different lines;In the quantized data matrix
The element of i-th row jth column is the data quantization result fruit of i-th patient questionnaire's ID number jth problem answers;Wherein, 2≤i≤M, 2≤
j≤N;
Clustering processing is carried out to each sample point in the quantized data matrix using hierarchical clustering algorithm, obtains hierarchical clustering
Dendrogram;Z-th of sample point represents the z row data in the quantized data matrix;The number of the sample point and institute
It is identical to state quantized data matrix column number, wherein 2≤z≤M;
Clusters number is determined according to the hierarchical clustering dendrogram;
According to the clusters number and K-Means clustering algorithm, the element in the quantized data matrix is clustered, is obtained
Multiple class clusters;
The index of correlation in each class cluster between each element is calculated, and the maximum element of the index of correlation is determined as stomach oesophagus
Reflux disease risk factor;The index of correlation is the average of related coefficient square.
2. gastroesophageal reflux disease risk factor extracting method according to claim 1, which is characterized in that described to use layer
Secondary clustering algorithm carries out clustering processing to each sample point in the quantized data matrix, obtains hierarchical clustering dendrogram, has
Body includes:
Clustering processing is carried out to each sample point in the quantized data matrix using Agglomerative Hierarchical Clustering algorithm, obtains level
Dendrogram.
3. gastroesophageal reflux disease risk factor extracting method according to claim 2, which is characterized in that described using solidifying
Poly- hierarchical clustering algorithm carries out clustering processing to each sample point in the quantized data matrix, and it is tree-shaped to obtain hierarchical clustering
Figure, specifically includes:
Step 1, the distance between sample point two-by-two is calculated;
Step 2, selection synthesizes a class apart from the smallest two sample points;
Step 3, step 1 and step 2 are repeated, until all sample points gather for one kind, obtains hierarchical clustering dendrogram.
4. gastroesophageal reflux disease risk factor extracting method according to claim 3, which is characterized in that described to calculate two
The distance between two sample points, specifically include:
Using average distance algorithm, the distance between sample point two-by-two is calculated.
5. gastroesophageal reflux disease risk factor extracting method according to claim 1, which is characterized in that executing basis
The clusters number and K-Means clustering algorithm cluster the element in the quantized data matrix, obtain multiple class clusters
Before, the method also includes:
Initial cluster center is determined using K-Means++ algorithm.
6. gastroesophageal reflux disease risk factor extracting method according to claim 5, which is characterized in that described according to institute
Clusters number and K-Means clustering algorithm are stated, the element in the quantized data matrix is clustered, obtains multiple class clusters,
It specifically includes:
The clusters number, the initial cluster center and the quantized data matrix are inputted, the K-Means cluster is run
The corresponding program of algorithm carries out clustering to the element in the quantized data matrix, obtains multiple class clusters.
7. a kind of based on the gastroesophageal reflux disease risk factor extraction system precisely clustered, which is characterized in that the system packet
It includes:
Initial patient information collection constructs module, for constructing initial patient information collection;The initial patient information integrates as M row N column
Data set;The factor for the i-th row the 1st column that the initial patient information is concentrated is patient questionnaire's ID number, and the 1st column in not going together
Factor be expressed as different patient questionnaire's ID numbers;The factor for the 1st row jth column that the initial patient information is concentrated is that investigation is asked
The problem of volume, and the factor of the 1st row is expressed as different problems in different lines;The i-th row jth that the initial patient information is concentrated
The factor of column is answer of the i-th patient questionnaire's ID number to jth problem;Wherein, 2≤i≤M, 2≤j≤N;
Quantized data matrix obtains module, and the answer for concentrating to the initial patient information carries out data quantization processing, obtains
To quantized data matrix;The quantized data matrix is the matrix of M row N column;The i-th row the 1st column in the quantized data matrix
Element be patient questionnaire's ID number, and the element representation of the 1st column is different patient questionnaire's ID number in not going together;The quantization number
The problem of element according to the 1st row jth column in matrix is questionnaire, and the element representation of the 1st row is different in different lines
Problem;The element of the i-th row jth column in the quantized data matrix is the data volume of i-th patient questionnaire's ID number jth problem answers
Change result fruit;Wherein, 2≤i≤M, 2≤j≤N;
Hierarchical clustering dendrogram obtains module, for using hierarchical clustering algorithm to each sample in the quantized data matrix
Point carries out clustering processing, obtains hierarchical clustering dendrogram;Z-th of sample point represents the z in the quantized data matrix
Row data;The number of the sample point is identical as the quantized data matrix column number, wherein 2≤z≤M;
Clusters number determining module, for determining clusters number according to the hierarchical clustering dendrogram;
Class cluster obtains module, is used for according to the clusters number and K-Means clustering algorithm, in the quantized data matrix
Element is clustered, and multiple class clusters are obtained;
Gastroesophageal reflux disease risk factor determining module refers to for calculating the correlation in each class cluster between each element
Number, and the maximum element of the index of correlation is determined as gastroesophageal reflux disease risk factor;The index of correlation is related coefficient
Square average.
8. gastroesophageal reflux disease risk factor extraction system according to claim 7, which is characterized in that the level is poly-
Class dendrogram obtains module, specifically includes:
Hierarchical clustering dendrogram obtains unit, for using Agglomerative Hierarchical Clustering algorithm to each of described quantized data matrix
Sample point carries out clustering processing, obtains hierarchical clustering dendrogram.
9. gastroesophageal reflux disease risk factor extraction system according to claim 7, which is characterized in that the system is also
Include:
Initial cluster center determining module, for determining initial cluster center using K-Means++ algorithm.
10. gastroesophageal reflux disease risk factor extracting method according to claim 9, which is characterized in that the class cluster
Module is obtained, is specifically included:
Class cluster obtains unit, for inputting the clusters number, the initial cluster center and the quantized data matrix, transports
The corresponding program of the row K-Means clustering algorithm carries out clustering to the element in the quantized data matrix, obtains more
A class cluster.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201811589375.0A CN109685139A (en) | 2018-12-25 | 2018-12-25 | Based on the gastroesophageal reflux disease risk factor extracting method precisely clustered and system |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201811589375.0A CN109685139A (en) | 2018-12-25 | 2018-12-25 | Based on the gastroesophageal reflux disease risk factor extracting method precisely clustered and system |
Publications (1)
Publication Number | Publication Date |
---|---|
CN109685139A true CN109685139A (en) | 2019-04-26 |
Family
ID=66189310
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201811589375.0A Pending CN109685139A (en) | 2018-12-25 | 2018-12-25 | Based on the gastroesophageal reflux disease risk factor extracting method precisely clustered and system |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN109685139A (en) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN112948640A (en) * | 2021-03-10 | 2021-06-11 | 成都工贸职业技术学院 | Big data clustering method and system based on cloud computing platform |
Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN105956318A (en) * | 2016-05-19 | 2016-09-21 | 上海电机学院 | Improved splitting H-K clustering method-based wind power plant fleet division method |
CN107368856A (en) * | 2017-07-25 | 2017-11-21 | 深信服科技股份有限公司 | Clustering method and device, the computer installation and readable storage medium storing program for executing of Malware |
-
2018
- 2018-12-25 CN CN201811589375.0A patent/CN109685139A/en active Pending
Patent Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN105956318A (en) * | 2016-05-19 | 2016-09-21 | 上海电机学院 | Improved splitting H-K clustering method-based wind power plant fleet division method |
CN107368856A (en) * | 2017-07-25 | 2017-11-21 | 深信服科技股份有限公司 | Clustering method and device, the computer installation and readable storage medium storing program for executing of Malware |
Non-Patent Citations (1)
Title |
---|
段明秀: "层次聚类算法的研究及应用", 《中国优秀硕士学位论文全文数据库》 * |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN112948640A (en) * | 2021-03-10 | 2021-06-11 | 成都工贸职业技术学院 | Big data clustering method and system based on cloud computing platform |
CN112948640B (en) * | 2021-03-10 | 2022-03-15 | 成都工贸职业技术学院 | Big data clustering method and system based on cloud computing platform |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Aada et al. | Predicting diabetes in medical datasets using machine learning techniques | |
US6988056B2 (en) | Signal interpretation engine | |
CN106778042A (en) | Cardio-cerebral vascular disease patient similarity analysis method and system | |
Patil et al. | An association between fingerprint patterns with blood group and lifestyle based diseases: a review | |
James et al. | Repeated split sample validation to assess logistic regression and recursive partitioning: an application to the prediction of cognitive impairment | |
CN109686442A (en) | Method and system are determined based on the gastroesophageal reflux disease risk factor of machine learning | |
CN111785366B (en) | Patient treatment scheme determination method and device and computer equipment | |
Jelinek et al. | Decision trees and multi-level ensemble classifiers for neurological diagnostics | |
Carrillo-Alarcón et al. | A metaheuristic optimization approach for parameter estimation in arrhythmia classification from unbalanced data | |
CN114732424B (en) | Method for extracting complex network attribute of muscle fatigue state based on surface electromyographic signal | |
Rubega et al. | EEG fractal analysis reflects brain impairment after stroke | |
Abdullah et al. | EEG channel selection techniques in motor imagery applications: a review and new perspectives | |
CN102068239A (en) | Method for intelligently acquiring physiological information in body sensor network | |
Chou et al. | Extracting drug utilization knowledge using self-organizing map and rough set theory | |
KR102169637B1 (en) | Method for predicting of mortality risk and device for predicting of mortality risk using the same | |
CN109685139A (en) | Based on the gastroesophageal reflux disease risk factor extracting method precisely clustered and system | |
Arif et al. | An Approach to ECG-based Gender Recognition Using Random Forest Algorithm | |
Sim et al. | Activity recognition using correlated pattern mining for people with dementia | |
CN109509513A (en) | Gastroesophageal reflux disease risk factor extracting method and system based on distributional clustering | |
Liang et al. | A learning model for the automated assessment of hand-drawn images for visuo-spatial neglect rehabilitation | |
KR102261270B1 (en) | Personalized content providing method based on personal multiple feature information and analysis apparatus | |
US11961204B2 (en) | State visualization device, state visualization method, and state visualization program | |
Gomiero et al. | A Short Version of SIS (Support Intensity Scale): The Utility of the Application of Artificial Adaptive Systems. | |
CN109978007A (en) | A kind of disease risk factor extracting method based on attribute weight cluster | |
da Silva Lourenço et al. | Not one size fits all: influence of EEG type when training a deep neural network for interictal epileptiform discharge detection |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
TA01 | Transfer of patent application right | ||
TA01 | Transfer of patent application right |
Effective date of registration: 20190729 Address after: 210000 Xiaolingwei 179, Xuanwu District, Nanjing City, Jiangsu Province Applicant after: Nanjing Hospital of Integrated Traditional and Chinese Medicine Address before: 210000 Xiaolingwei 179, Xuanwu District, Nanjing City, Jiangsu Province Applicant before: Liu Wanli |
|
RJ01 | Rejection of invention patent application after publication | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20190426 |