CN109685139A - Based on the gastroesophageal reflux disease risk factor extracting method precisely clustered and system - Google Patents

Based on the gastroesophageal reflux disease risk factor extracting method precisely clustered and system Download PDF

Info

Publication number
CN109685139A
CN109685139A CN201811589375.0A CN201811589375A CN109685139A CN 109685139 A CN109685139 A CN 109685139A CN 201811589375 A CN201811589375 A CN 201811589375A CN 109685139 A CN109685139 A CN 109685139A
Authority
CN
China
Prior art keywords
quantized data
data matrix
row
column
obtains
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201811589375.0A
Other languages
Chinese (zh)
Inventor
刘万里
徐雷
黄玉珍
姚澜
李荣臻
夏吉安
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Nanjing Hospital of Integrated Traditional and Chinese Medicine
Original Assignee
刘万里
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 刘万里 filed Critical 刘万里
Priority to CN201811589375.0A priority Critical patent/CN109685139A/en
Publication of CN109685139A publication Critical patent/CN109685139A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/211Selection of the most significant subset of features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/23Clustering techniques
    • G06F18/232Non-hierarchical techniques
    • G06F18/2321Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions
    • G06F18/23213Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions with fixed number of clusters, e.g. K-means clustering
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16HHEALTHCARE INFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR THE HANDLING OR PROCESSING OF MEDICAL OR HEALTHCARE DATA
    • G16H50/00ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics
    • G16H50/20ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics for computer-aided diagnosis, e.g. based on medical expert systems

Landscapes

  • Engineering & Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Theoretical Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Physics & Mathematics (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • Public Health (AREA)
  • Biomedical Technology (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Medical Informatics (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Biology (AREA)
  • Primary Health Care (AREA)
  • Databases & Information Systems (AREA)
  • Pathology (AREA)
  • General Health & Medical Sciences (AREA)
  • Epidemiology (AREA)
  • Probability & Statistics with Applications (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a kind of based on the gastroesophageal reflux disease risk factor extracting method precisely clustered and system.Initial patient information collection of the building comprising gastroesophageal reflux disease risk factor first;Secondly data quantization processing is carried out to the factor that initial patient information is concentrated, obtains quantized data matrix;Then clustering processing is carried out to each sample point in quantized data matrix using hierarchical clustering algorithm, obtains hierarchical clustering dendrogram;Furthermore clusters number is determined according to hierarchical clustering dendrogram, and by clusters number in conjunction with K-Means clustering algorithm, the element in quantized data matrix is clustered, obtains multiple class clusters;The index of correlation in each class cluster between each element is finally calculated, and the maximum element of the index of correlation is determined as gastroesophageal reflux disease risk factor.The present invention combines two kinds of clustering methods, efficiently filters out the risk factor for causing gastroesophageal reflux disease, reduces disease incidence.

Description

Based on the gastroesophageal reflux disease risk factor extracting method precisely clustered and system
Technical field
The present invention relates to cluster and medicine technology fields, more particularly to a kind of based on the gastroesophageal reflux disease precisely clustered Sick risk factor extracting method and system.
Background technique
Gastroesophageal reflux disease is showed as disease of digestive system generally existing in a kind of world wide, disease incidence The trend risen year by year.Therefore, the treatment of gastroesophageal reflux disease should cause our enough attention.Due to gastroesophageal reflux The generation of disease and life style, emotional change, eating habit etc. are closely related, and the state of an illness easily changes, therefore by adopting Collection mass data simultaneously analyzes data characteristics to the research disease and prevents to play an important role.
Risk factor is mainly extracted using clustering algorithm in gastroesophageal reflux disease diagnostic techniques at present, but it is poly- The selection of class number and cluster centre is relatively difficult, often because clusters number and cluster centre selection mistake lead to risk factor It is lower to extract accurate rate.
Summary of the invention
The object of the present invention is to provide a kind of based on the gastroesophageal reflux disease risk factor extracting method precisely clustered and System, to solve in the prior art because clusters number and cluster centre selection mistake cause risk factor extraction accurate rate lower Problem.
To achieve the above object, the present invention provides following schemes:
It is a kind of based on the gastroesophageal reflux disease risk factor extracting method precisely clustered, which comprises
Construct initial patient information collection;The initial patient information integrates as the data set of M row N column;The initial patient letter The factor for the i-th row the 1st column that breath is concentrated is patient questionnaire's ID number, and the factor of the 1st column is expressed as different patients in not going together Questionnaire ID number;The problem of factor for the 1st row jth column that the initial patient information is concentrated is questionnaire, and the 1st in different lines Capable factor is expressed as different problems;The factor for the i-th row jth column that the initial patient information is concentrated is the i-th patient questionnaire Answer of the ID number to jth problem;Wherein, 2≤i≤M, 2≤j≤N;
Data quantization processing is carried out to the answer that the initial patient information is concentrated, obtains quantized data matrix;The amount Change the matrix that data matrix is M row N column;The element of the i-th row the 1st column in the quantized data matrix is patient questionnaire's ID number, And the element representation of the 1st column is different patient questionnaire's ID number in not going together;The 1st row jth column in the quantized data matrix Element be questionnaire the problem of, and in different lines the 1st row element representation be different problems;The quantized data matrix In the i-th row jth column element be i-th patient questionnaire's ID number jth problem answers data quantization result fruit;Wherein, 2≤i≤ M, 2≤j≤N;
Clustering processing is carried out to each sample point in the quantized data matrix using hierarchical clustering algorithm, obtains level Dendrogram;Z-th of sample point represents the z row data in the quantized data matrix;The number of the sample point It is identical as the quantized data matrix column number, wherein 2≤z≤M;
Clusters number is determined according to the hierarchical clustering dendrogram;
According to the clusters number and K-Means clustering algorithm, the element in the quantized data matrix is clustered, Obtain multiple class clusters;
The index of correlation in each class cluster between each element is calculated, and the maximum element of the index of correlation is determined as stomach Esophageal reflux disease risk factor;The index of correlation is the average of related coefficient square.
Optionally, described that each sample point in the quantized data matrix is carried out at cluster using hierarchical clustering algorithm Reason, obtains hierarchical clustering dendrogram, specifically includes:
Clustering processing is carried out to each sample point in the quantized data matrix using Agglomerative Hierarchical Clustering algorithm, is obtained Hierarchical clustering dendrogram.
Optionally, described that each sample point in the quantized data matrix is gathered using Agglomerative Hierarchical Clustering algorithm Class processing, obtains hierarchical clustering dendrogram, specifically includes:
Step 1, the distance between sample point two-by-two is calculated;
Step 2, selection synthesizes a class apart from the smallest two sample points;
Step 3, step 1 and step 2 are repeated, until all sample points gather for one kind, obtains hierarchical clustering dendrogram.
Optionally, described to calculate the distance between sample point two-by-two, it specifically includes:
Using average distance algorithm, the distance between sample point two-by-two is calculated.
Optionally, it is executing according to the clusters number and K-Means clustering algorithm, in the quantized data matrix Element is clustered, before obtaining multiple class clusters, the method also includes:
Initial cluster center is determined using K-Means++ algorithm.
Optionally, described according to the clusters number and K-Means clustering algorithm, to the member in the quantized data matrix Element is clustered, and is obtained multiple class clusters, is specifically included:
The clusters number, the initial cluster center and the quantized data matrix are inputted, the K-Means is run The corresponding program of clustering algorithm carries out clustering to the element in the quantized data matrix, obtains multiple class clusters.
It is a kind of based on the gastroesophageal reflux disease risk factor extraction system precisely clustered, comprising:
Initial patient information collection constructs module, for constructing initial patient information collection;The initial patient information integrates as M row The data set of N column;The factor for the i-th row the 1st column that the initial patient information is concentrated is patient questionnaire's ID number, and the in not going together The factor of 1 column is expressed as different patient questionnaire's ID numbers;The factor for the 1st row jth column that the initial patient information is concentrated is to adjust The problem of interrogating volume, and the factor of the 1st row is expressed as different problems in different lines;The i-th of the initial patient information concentration The factor of row jth column is answer of the i-th patient questionnaire's ID number to jth problem;Wherein, 2≤i≤M, 2≤j≤N;
Quantized data matrix obtains module, and the answer for concentrating to the initial patient information carries out at data quantization Reason, obtains quantized data matrix;The quantized data matrix is the matrix of M row N column;The i-th row in the quantized data matrix The element of 1st column is patient questionnaire's ID number, and the element representation of the 1st column is different patient questionnaire's ID number in not going together;It is described The problem of element of the 1st row jth column in quantized data matrix is questionnaire, and the element representation of the 1st row is in different lines Different problems;The element of the i-th row jth column in the quantized data matrix is i-th patient questionnaire's ID number jth problem answers Data quantization result fruit;Wherein, 2≤i≤M, 2≤j≤N;
Hierarchical clustering dendrogram obtains module, for using hierarchical clustering algorithm to each of described quantized data matrix Sample point carries out clustering processing, obtains hierarchical clustering dendrogram;Z-th of sample point represents in the quantized data matrix Z row data;The number of the sample point is identical as the quantized data matrix column number, wherein 2≤z≤M;
Clusters number determining module, for determining clusters number according to the hierarchical clustering dendrogram;
Class cluster obtains module, is used for according to the clusters number and K-Means clustering algorithm, to the quantized data matrix In element clustered, obtain multiple class clusters;
Gastroesophageal reflux disease risk factor determining module, for calculating the correlation in each class cluster between each element Index, and the maximum element of the index of correlation is determined as gastroesophageal reflux disease risk factor;The index of correlation is phase relation Several squares of average.
Optionally, the hierarchical clustering dendrogram obtains module, specifically includes:
Hierarchical clustering dendrogram obtains unit, for using Agglomerative Hierarchical Clustering algorithm in the quantized data matrix Each sample point carries out clustering processing, obtains hierarchical clustering dendrogram.
Optionally, the system also includes:
Initial cluster center determining module, for determining initial cluster center using K-Means++ algorithm.
Optionally, the class cluster obtains module, specifically includes:
Class cluster obtains unit, for inputting the clusters number, the initial cluster center and the quantized data square Battle array, runs the corresponding program of the K-Means clustering algorithm, carries out clustering to the element in the quantized data matrix, Obtain multiple class clusters.
The specific embodiment provided according to the present invention, the invention discloses following technical effects:
A kind of accuracy that the present invention is proposed mainly in combination with two kinds of clustering methods extracts gastroesophageal reflux disease risk factor Method and system.Present invention utilizes the method that hierarchical clustering, K mean cluster combine, be applied to crowd and cluster, respectively from Two dimensions of feature and crowd are clustered, the crowd of the inhomogeneity cluster after finally obtaining precisely cluster, and combine statistics Method analyzes the key factor for determining every a kind of crowd.Therefore, the present invention is in such a way that two kinds of clustering methods combine, Efficiently filter out the risk factor for causing gastroesophageal reflux disease, for medical research in the future and medical diagnosis on disease provide it is scientific according to According to, gastroesophageal reflux disease is instructed, reduce disease incidence.
Detailed description of the invention
It in order to more clearly explain the embodiment of the invention or the technical proposal in the existing technology, below will be to institute in embodiment Attached drawing to be used is needed to be briefly described, it should be apparent that, the accompanying drawings in the following description is only some implementations of the invention Example, for those of ordinary skill in the art, without any creative labor, can also be according to these attached drawings Obtain other attached drawings.
Fig. 1 is that the embodiment of the present invention is shown based on the process of the gastroesophageal reflux disease risk factor extracting method precisely clustered It is intended to;
Fig. 2 is that the embodiment of the present invention is shown based on the structure of the gastroesophageal reflux disease risk factor extraction system precisely clustered It is intended to.
Specific embodiment
Following will be combined with the drawings in the embodiments of the present invention, and technical solution in the embodiment of the present invention carries out clear, complete Site preparation description, it is clear that described embodiments are only a part of the embodiments of the present invention, instead of all the embodiments.It is based on Embodiment in the present invention, it is obtained by those of ordinary skill in the art without making creative efforts every other Embodiment shall fall within the protection scope of the present invention.
The object of the present invention is to provide a kind of based on the gastroesophageal reflux disease risk factor extracting method precisely clustered and System efficiently can accurately filter out the risk factor for causing gastroesophageal reflux disease.
In order to make the foregoing objectives, features and advantages of the present invention clearer and more comprehensible, with reference to the accompanying drawing and specific real Applying mode, the present invention is described in further detail.
Embodiment 1
Fig. 1 is that the embodiment of the present invention is shown based on the process of the gastroesophageal reflux disease risk factor extracting method precisely clustered It is intended to, as shown in Figure 1, provided in an embodiment of the present invention based on the gastroesophageal reflux disease risk factor extraction side precisely clustered Method specifically includes following steps.
Step 101: building initial patient information collection;The initial patient information integrates as the data set of M row N column;It is described first The factor for the i-th row the 1st column that beginning patient information is concentrated is patient questionnaire's ID number, and the factor of the 1st column is expressed as not in not going together Same patient questionnaire's ID number;The problem of factor for the 1st row jth column that the initial patient information is concentrated is questionnaire, and not The factor of the 1st row is expressed as different problems in same column;The factor for the i-th row jth column that the initial patient information is concentrated is i-th Answer of patient questionnaire's ID number to jth problem;Wherein, 2≤i≤M, 2≤j≤N, i, j are positive integer.
Step 102: data quantization processing being carried out to the answer that the initial patient information is concentrated, obtains quantized data square Battle array;The quantized data matrix is the matrix of M row N column;The element of the i-th row the 1st column in the quantized data matrix is patient Questionnaire ID number, and the element representation of the 1st column is different patient questionnaire's ID number in not going together;In the quantized data matrix The problem of element of 1 row jth column is questionnaire, and the element representation of the 1st row is different problems in different lines;The quantization The element of the i-th row jth column in data matrix is the data quantization result fruit of i-th patient questionnaire's ID number jth problem answers;Its In, 2≤i≤M, 2≤j≤N, i.e., the described quantized data matrix are the initial matrix of critical risk factor.
Step 103: clustering processing is carried out to each sample point in the quantized data matrix using hierarchical clustering algorithm, Obtain hierarchical clustering dendrogram;Z-th of sample point represents the z row data in the quantized data matrix;The sample The number of point is identical as the quantized data matrix column number, wherein 2≤z≤M, z are positive integer.
Step 104: clusters number is determined according to the hierarchical clustering dendrogram.
Step 105: according to the clusters number and K-Means clustering algorithm, to the element in the quantized data matrix It is clustered, obtains multiple class clusters;Wherein, the number of class cluster and the identical and different class cluster of the number of cluster represent inhomogeneity Other patient.
Step 106: calculating the index of correlation in each class cluster between each element, and by the maximum element of the index of correlation It is determined as gastroesophageal reflux disease risk factor.
Step 101 specifically includes:
The present embodiment is by putting questionnaire to each human hair with gastroesophageal reflux disease in hospital, and according to recycling More parts of questionnaires back establish initial patient information collection.
The dimension of the initial patient information collection totally 241 in the present embodiment, patient questionnaire's ID number including unique identification And the problem of each questionnaire answer.
It is several comprising general demographic data, life style, eating habit, mental element, sleep factor etc. in questionnaire The answer of subproblem and investigator.The answer type of questionnaire includes three kinds: single choice, True-False, question-and-answer problem.
Step 102 is that the answer to investigator carries out data bulk processing.
Specially using patient questionnaire's ID number as unique identifying number.In questionnaire, using severity level as answer in single choice Topic, if option is often, once in a while, seldom, never, 4,3,2,1 weight can successively to be assigned according to its severity level, according to tool Body answer selects corresponding weight;Using whether type option is answer in True-False, "Yes" is assigned a value of 1, "No" is assigned a value of 0, Corresponding assignment is selected according to specific answer;Option has no the problem of dividing of severity level, such as occupation, because such problem is to knot Fruit is useless, can delete the problem.For question-and-answer problem, directly using the continuous type numerical value of user's input as data, such as user 45 years old, Height 172 is uninfected by HP, often there is sleep disturbance, then the sample data is classified as [age, height, if infection HP, sleep barrier Hinder situation], data value is [45,172,0,4].The purpose of this step is that the answer of all investigators is carried out quantification treatment, is obtained To quantized data matrix R.
Step 103 specifically includes:
Clustering can cluster index, can also cluster to sample, cluster here to index.
The present embodiment further selects the method for Agglomerative Hierarchical Clustering to the quantization using the method for hierarchical clustering Each sample point in data matrix R carries out clustering processing.The basic principle is that first will respectively be considered as one kind by poly- object, at this moment class Then similarity degree with class selects immediate two class to merge into one kind by class statistic, gradually merge until all quilts Until poly- object merging is a kind of.
Wherein, the step of Agglomerative Hierarchical Clustering is as follows:
Using each sample point in quantized data matrix R as an independent class.
According to range formula, the distance between class two-by-two is calculated, is found apart from the smallest two classes c1, c2.
Class c1 and class c2 are merged into a class;
Above step is repeated, until all sample points gather for one kind, and then obtains a hierarchical clustering dendrogram.
Specifically, the present embodiment selects average distance method to calculate the distance between class two-by-two about distance metric.
Average distance method will own by calculating each data point in two classes at a distance from other all data points The mean value of distance is as the distance between class two-by-two.Calculation formula is as follows:
Wherein, dist (x, z) is obtained using Euclidean distance.
Step 104 specifically includes:
In conjunction with professional knowledge, clusters number k is determined according to the tree-shaped map analysis of hierarchical clustering, i.e., finally to be selected it is dangerous because Prime number mesh.
The clusters number k obtained using step 104 carries out K-Means cluster to investigator's variable, i.e., to quantized data square Battle array R carries out K mean cluster.
Before executing step 105, the method also includes: cluster centre is determined using K-Means++ algorithm.
The thought of K-Means++ algorithms selection initial cluster center is: the mutual distance between initial cluster centre is wanted As far as possible.Algorithm steps are as follows:
1, an element is randomly selected in quantized data matrix R at random as first initial cluster center.
2, for each element, the distance D (x) of an initial cluster center nearest with it is calculated, D is successively obtained (1), D (2) ..., D (n), and constitute set D, then by it is all distance summation obtain Sum (D (x)).
3, then a random value is taken, it is taken with the mode of weight and calculates next seed point.The realization of the step is, first The random value Random that can be fallen in Sum (D (x)) is taken, then Sum (D (x)) obtains value r multiplied by random value Random, and With r=r-D (xg), until its r≤0, g-th of element at this time is exactly next seed point, i.e., next cluster centre.
4,2 and 3 are repeated, is come until L initial cluster center is selected.
Step 105 specifically includes: inputting the quantized data matrix R, clusters number and initial cluster center, runs k- The corresponding program of means algorithm carries out clustering to the element in the quantized data matrix, obtains multiple class clusters.
Step 106 specifically includes: in order to find the risk factor for determining to form such patient groups, needing to calculate every class cluster In the index of correlation (average of related coefficient square) between each element, select the maximum element of the index of correlation to be determined as stomach food Pipe reflux disease risk factor filters out a gastroesophageal reflux disease risk factor in every class cluster.
The quantized data matrix R is grouped to obtain R according to class label1,R2,R3,....Rk, initial danger is set Dangerous sets of factors is empty set, is analyzed the correlation each element in every class cluster, the index of correlation between calculating elements, every Select the maximum element of an index of correlation that risk factor set is added in a kind of cluster.
Wherein related coefficient calculates as follows:
Wherein, Var (X) is the variance of X, and Var (Y) is the variance of Y, and Cov (X, Y) is X, and the covariance between Y, X, Y are Element in every class cluster.
For the sample index of correlation R of a certain feature2Calculation formula it is as follows:
Wherein, X is a certain feature, and i is characterized number, and n is characterized sum.
Embodiment 2
To achieve the above object, the present invention also provides a kind of as shown in Figure 2 based on the gastroesophageal reflux precisely clustered Disease risk factor extraction system.The system includes:
Initial patient information collection constructs module 100, for constructing initial patient information collection;The initial patient information collection is The data set of M row N column;The factor for the i-th row the 1st column that the initial patient information is concentrated is patient questionnaire's ID number, and is not gone together In the 1st column factor be expressed as different patient questionnaire's ID numbers;The factor for the 1st row jth column that the initial patient information is concentrated The problem of for questionnaire, and the factor of the 1st row is expressed as different problems in different lines;What the initial patient information was concentrated The factor of i-th row jth column is answer of the i-th patient questionnaire's ID number to jth problem;Wherein, 2≤i≤M, 2≤j≤N.
Quantized data matrix obtains module 200, and the answer for concentrating to the initial patient information carries out data quantization Processing, obtains quantized data matrix;The quantized data matrix is the matrix of M row N column;I-th in the quantized data matrix The element that row the 1st arranges is patient questionnaire's ID number, and the element representation of the 1st column is different patient questionnaire's ID number in not going together;Institute State in quantized data matrix the 1st row jth column element be questionnaire the problem of, and in different lines the 1st row element representation For different problems;The element of the i-th row jth column in the quantized data matrix is i-th patient questionnaire's ID number jth problem answers Data quantization result fruit;Wherein, 2≤i≤M, 2≤j≤N.
Hierarchical clustering dendrogram obtains module 300, for using hierarchical clustering algorithm in the quantized data matrix Each sample point carries out clustering processing, obtains hierarchical clustering dendrogram;Wherein, z-th of sample point represents the quantization number According to the z row data in matrix;The number of the sample point is identical as the quantized data matrix column number, wherein and 2≤z≤ M。
Clusters number determining module 400, for determining clusters number according to the hierarchical clustering dendrogram.
Class cluster obtains module 500, is used for according to the clusters number and K-Means clustering algorithm, to the quantized data Element in matrix is clustered, and multiple class clusters are obtained.
Gastroesophageal reflux disease risk factor determining module 600, for calculating in each class cluster between each element The index of correlation, and the maximum element of the index of correlation is determined as gastroesophageal reflux disease risk factor.
Preferably, the system further include: initial cluster center determining module, for being determined just using K-Means++ algorithm Beginning cluster centre.
The hierarchical clustering dendrogram obtains module 300, specifically includes:
Hierarchical clustering dendrogram obtains unit, for using Agglomerative Hierarchical Clustering algorithm in the quantized data matrix Each sample point carries out clustering processing, obtains hierarchical clustering dendrogram.
The class cluster obtains module 500, specifically includes:
Class cluster obtains unit, for inputting the clusters number, the initial cluster center and the quantized data matrix, The corresponding program of the K-Means clustering algorithm is run, clustering is carried out to the element in the quantized data matrix, is obtained Multiple class clusters.
The prior art is compared, advantage of the invention are as follows:
It is actually rare using machine learning method extraction risk factor in gastroesophageal reflux disease diagnostic techniques at present, greatly What majority took the extraction of risk factor in medical domain is statistical method, and statistical method is computationally intensive, simultaneously Accurate rate is lower compared with machine learning.
The present invention in such a way that two kinds of clustering methods combine, for K mean value itself there are the shortcomings that be made that improvement, Further raising has been done to the accuracy of cluster.
The extraction of risk factor of the invention combines crowd's cluster and index screening, analyzed from different crowd it is dangerous because Element is as a result more accurate in conjunction with clustering method and statistical method.
Each embodiment in this specification is described in a progressive manner, the highlights of each of the examples are with other The difference of embodiment, the same or similar parts in each embodiment may refer to each other.For system disclosed in embodiment For, since it is corresponded to the methods disclosed in the examples, so being described relatively simple, related place is said referring to method part It is bright.
Used herein a specific example illustrates the principle and implementation of the invention, and above embodiments are said It is bright to be merely used to help understand method and its core concept of the invention;At the same time, for those skilled in the art, foundation Thought of the invention, there will be changes in the specific implementation manner and application range.In conclusion the content of the present specification is not It is interpreted as limitation of the present invention.

Claims (10)

1. a kind of based on the gastroesophageal reflux disease risk factor extracting method precisely clustered, which is characterized in that the method packet It includes:
Construct initial patient information collection;The initial patient information integrates as the data set of M row N column;The initial patient information collection In the factors of the i-th row the 1st column be patient questionnaire's ID number, and the factor of the 1st column is expressed as different patient questionnaires in not going together ID number;The problem of factor for the 1st row jth column that the initial patient information is concentrated is questionnaire, and the 1st row in different lines Factor is expressed as different problems;The factor for the i-th row jth column that the initial patient information is concentrated is i-th patient questionnaire's ID number Answer to jth problem;Wherein, 2≤i≤M, 2≤j≤N;
Data quantization processing is carried out to the answer that the initial patient information is concentrated, obtains quantized data matrix;The quantization number It is the matrix of M row N column according to matrix;The element of the i-th row the 1st column in the quantized data matrix is patient questionnaire's ID number, and not The element representation of the 1st column is different patient questionnaire's ID number in colleague;The member of the 1st row jth column in the quantized data matrix The problem of element is questionnaire, and the element representation of the 1st row is different problems in different lines;In the quantized data matrix The element of i-th row jth column is the data quantization result fruit of i-th patient questionnaire's ID number jth problem answers;Wherein, 2≤i≤M, 2≤ j≤N;
Clustering processing is carried out to each sample point in the quantized data matrix using hierarchical clustering algorithm, obtains hierarchical clustering Dendrogram;Z-th of sample point represents the z row data in the quantized data matrix;The number of the sample point and institute It is identical to state quantized data matrix column number, wherein 2≤z≤M;
Clusters number is determined according to the hierarchical clustering dendrogram;
According to the clusters number and K-Means clustering algorithm, the element in the quantized data matrix is clustered, is obtained Multiple class clusters;
The index of correlation in each class cluster between each element is calculated, and the maximum element of the index of correlation is determined as stomach oesophagus Reflux disease risk factor;The index of correlation is the average of related coefficient square.
2. gastroesophageal reflux disease risk factor extracting method according to claim 1, which is characterized in that described to use layer Secondary clustering algorithm carries out clustering processing to each sample point in the quantized data matrix, obtains hierarchical clustering dendrogram, has Body includes:
Clustering processing is carried out to each sample point in the quantized data matrix using Agglomerative Hierarchical Clustering algorithm, obtains level Dendrogram.
3. gastroesophageal reflux disease risk factor extracting method according to claim 2, which is characterized in that described using solidifying Poly- hierarchical clustering algorithm carries out clustering processing to each sample point in the quantized data matrix, and it is tree-shaped to obtain hierarchical clustering Figure, specifically includes:
Step 1, the distance between sample point two-by-two is calculated;
Step 2, selection synthesizes a class apart from the smallest two sample points;
Step 3, step 1 and step 2 are repeated, until all sample points gather for one kind, obtains hierarchical clustering dendrogram.
4. gastroesophageal reflux disease risk factor extracting method according to claim 3, which is characterized in that described to calculate two The distance between two sample points, specifically include:
Using average distance algorithm, the distance between sample point two-by-two is calculated.
5. gastroesophageal reflux disease risk factor extracting method according to claim 1, which is characterized in that executing basis The clusters number and K-Means clustering algorithm cluster the element in the quantized data matrix, obtain multiple class clusters Before, the method also includes:
Initial cluster center is determined using K-Means++ algorithm.
6. gastroesophageal reflux disease risk factor extracting method according to claim 5, which is characterized in that described according to institute Clusters number and K-Means clustering algorithm are stated, the element in the quantized data matrix is clustered, obtains multiple class clusters, It specifically includes:
The clusters number, the initial cluster center and the quantized data matrix are inputted, the K-Means cluster is run The corresponding program of algorithm carries out clustering to the element in the quantized data matrix, obtains multiple class clusters.
7. a kind of based on the gastroesophageal reflux disease risk factor extraction system precisely clustered, which is characterized in that the system packet It includes:
Initial patient information collection constructs module, for constructing initial patient information collection;The initial patient information integrates as M row N column Data set;The factor for the i-th row the 1st column that the initial patient information is concentrated is patient questionnaire's ID number, and the 1st column in not going together Factor be expressed as different patient questionnaire's ID numbers;The factor for the 1st row jth column that the initial patient information is concentrated is that investigation is asked The problem of volume, and the factor of the 1st row is expressed as different problems in different lines;The i-th row jth that the initial patient information is concentrated The factor of column is answer of the i-th patient questionnaire's ID number to jth problem;Wherein, 2≤i≤M, 2≤j≤N;
Quantized data matrix obtains module, and the answer for concentrating to the initial patient information carries out data quantization processing, obtains To quantized data matrix;The quantized data matrix is the matrix of M row N column;The i-th row the 1st column in the quantized data matrix Element be patient questionnaire's ID number, and the element representation of the 1st column is different patient questionnaire's ID number in not going together;The quantization number The problem of element according to the 1st row jth column in matrix is questionnaire, and the element representation of the 1st row is different in different lines Problem;The element of the i-th row jth column in the quantized data matrix is the data volume of i-th patient questionnaire's ID number jth problem answers Change result fruit;Wherein, 2≤i≤M, 2≤j≤N;
Hierarchical clustering dendrogram obtains module, for using hierarchical clustering algorithm to each sample in the quantized data matrix Point carries out clustering processing, obtains hierarchical clustering dendrogram;Z-th of sample point represents the z in the quantized data matrix Row data;The number of the sample point is identical as the quantized data matrix column number, wherein 2≤z≤M;
Clusters number determining module, for determining clusters number according to the hierarchical clustering dendrogram;
Class cluster obtains module, is used for according to the clusters number and K-Means clustering algorithm, in the quantized data matrix Element is clustered, and multiple class clusters are obtained;
Gastroesophageal reflux disease risk factor determining module refers to for calculating the correlation in each class cluster between each element Number, and the maximum element of the index of correlation is determined as gastroesophageal reflux disease risk factor;The index of correlation is related coefficient Square average.
8. gastroesophageal reflux disease risk factor extraction system according to claim 7, which is characterized in that the level is poly- Class dendrogram obtains module, specifically includes:
Hierarchical clustering dendrogram obtains unit, for using Agglomerative Hierarchical Clustering algorithm to each of described quantized data matrix Sample point carries out clustering processing, obtains hierarchical clustering dendrogram.
9. gastroesophageal reflux disease risk factor extraction system according to claim 7, which is characterized in that the system is also Include:
Initial cluster center determining module, for determining initial cluster center using K-Means++ algorithm.
10. gastroesophageal reflux disease risk factor extracting method according to claim 9, which is characterized in that the class cluster Module is obtained, is specifically included:
Class cluster obtains unit, for inputting the clusters number, the initial cluster center and the quantized data matrix, transports The corresponding program of the row K-Means clustering algorithm carries out clustering to the element in the quantized data matrix, obtains more A class cluster.
CN201811589375.0A 2018-12-25 2018-12-25 Based on the gastroesophageal reflux disease risk factor extracting method precisely clustered and system Pending CN109685139A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201811589375.0A CN109685139A (en) 2018-12-25 2018-12-25 Based on the gastroesophageal reflux disease risk factor extracting method precisely clustered and system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201811589375.0A CN109685139A (en) 2018-12-25 2018-12-25 Based on the gastroesophageal reflux disease risk factor extracting method precisely clustered and system

Publications (1)

Publication Number Publication Date
CN109685139A true CN109685139A (en) 2019-04-26

Family

ID=66189310

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201811589375.0A Pending CN109685139A (en) 2018-12-25 2018-12-25 Based on the gastroesophageal reflux disease risk factor extracting method precisely clustered and system

Country Status (1)

Country Link
CN (1) CN109685139A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112948640A (en) * 2021-03-10 2021-06-11 成都工贸职业技术学院 Big data clustering method and system based on cloud computing platform

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105956318A (en) * 2016-05-19 2016-09-21 上海电机学院 Improved splitting H-K clustering method-based wind power plant fleet division method
CN107368856A (en) * 2017-07-25 2017-11-21 深信服科技股份有限公司 Clustering method and device, the computer installation and readable storage medium storing program for executing of Malware

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105956318A (en) * 2016-05-19 2016-09-21 上海电机学院 Improved splitting H-K clustering method-based wind power plant fleet division method
CN107368856A (en) * 2017-07-25 2017-11-21 深信服科技股份有限公司 Clustering method and device, the computer installation and readable storage medium storing program for executing of Malware

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
段明秀: "层次聚类算法的研究及应用", 《中国优秀硕士学位论文全文数据库》 *

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112948640A (en) * 2021-03-10 2021-06-11 成都工贸职业技术学院 Big data clustering method and system based on cloud computing platform
CN112948640B (en) * 2021-03-10 2022-03-15 成都工贸职业技术学院 Big data clustering method and system based on cloud computing platform

Similar Documents

Publication Publication Date Title
Aada et al. Predicting diabetes in medical datasets using machine learning techniques
US6988056B2 (en) Signal interpretation engine
CN106778042A (en) Cardio-cerebral vascular disease patient similarity analysis method and system
Patil et al. An association between fingerprint patterns with blood group and lifestyle based diseases: a review
James et al. Repeated split sample validation to assess logistic regression and recursive partitioning: an application to the prediction of cognitive impairment
CN109686442A (en) Method and system are determined based on the gastroesophageal reflux disease risk factor of machine learning
CN111785366B (en) Patient treatment scheme determination method and device and computer equipment
Jelinek et al. Decision trees and multi-level ensemble classifiers for neurological diagnostics
Carrillo-Alarcón et al. A metaheuristic optimization approach for parameter estimation in arrhythmia classification from unbalanced data
CN114732424B (en) Method for extracting complex network attribute of muscle fatigue state based on surface electromyographic signal
Rubega et al. EEG fractal analysis reflects brain impairment after stroke
Abdullah et al. EEG channel selection techniques in motor imagery applications: a review and new perspectives
CN102068239A (en) Method for intelligently acquiring physiological information in body sensor network
Chou et al. Extracting drug utilization knowledge using self-organizing map and rough set theory
KR102169637B1 (en) Method for predicting of mortality risk and device for predicting of mortality risk using the same
CN109685139A (en) Based on the gastroesophageal reflux disease risk factor extracting method precisely clustered and system
Arif et al. An Approach to ECG-based Gender Recognition Using Random Forest Algorithm
Sim et al. Activity recognition using correlated pattern mining for people with dementia
CN109509513A (en) Gastroesophageal reflux disease risk factor extracting method and system based on distributional clustering
Liang et al. A learning model for the automated assessment of hand-drawn images for visuo-spatial neglect rehabilitation
KR102261270B1 (en) Personalized content providing method based on personal multiple feature information and analysis apparatus
US11961204B2 (en) State visualization device, state visualization method, and state visualization program
Gomiero et al. A Short Version of SIS (Support Intensity Scale): The Utility of the Application of Artificial Adaptive Systems.
CN109978007A (en) A kind of disease risk factor extracting method based on attribute weight cluster
da Silva Lourenço et al. Not one size fits all: influence of EEG type when training a deep neural network for interictal epileptiform discharge detection

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
TA01 Transfer of patent application right
TA01 Transfer of patent application right

Effective date of registration: 20190729

Address after: 210000 Xiaolingwei 179, Xuanwu District, Nanjing City, Jiangsu Province

Applicant after: Nanjing Hospital of Integrated Traditional and Chinese Medicine

Address before: 210000 Xiaolingwei 179, Xuanwu District, Nanjing City, Jiangsu Province

Applicant before: Liu Wanli

RJ01 Rejection of invention patent application after publication
RJ01 Rejection of invention patent application after publication

Application publication date: 20190426