CN109686442B - Machine learning-based gastroesophageal reflux disease risk factor determination method and system - Google Patents

Machine learning-based gastroesophageal reflux disease risk factor determination method and system Download PDF

Info

Publication number
CN109686442B
CN109686442B CN201811589405.8A CN201811589405A CN109686442B CN 109686442 B CN109686442 B CN 109686442B CN 201811589405 A CN201811589405 A CN 201811589405A CN 109686442 B CN109686442 B CN 109686442B
Authority
CN
China
Prior art keywords
data matrix
gastroesophageal reflux
reflux disease
questionnaire
equal
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Expired - Fee Related
Application number
CN201811589405.8A
Other languages
Chinese (zh)
Other versions
CN109686442A (en
Inventor
刘万里
徐雷
黄玉珍
姚澜
李荣臻
夏吉安
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
NANJING INTEGRATED TRADITIONAL CHINESE AND WESTERN MEDICINE HOSPITAL
Original Assignee
NANJING INTEGRATED TRADITIONAL CHINESE AND WESTERN MEDICINE HOSPITAL
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by NANJING INTEGRATED TRADITIONAL CHINESE AND WESTERN MEDICINE HOSPITAL filed Critical NANJING INTEGRATED TRADITIONAL CHINESE AND WESTERN MEDICINE HOSPITAL
Priority to CN201811589405.8A priority Critical patent/CN109686442B/en
Publication of CN109686442A publication Critical patent/CN109686442A/en
Application granted granted Critical
Publication of CN109686442B publication Critical patent/CN109686442B/en
Expired - Fee Related legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16HHEALTHCARE INFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR THE HANDLING OR PROCESSING OF MEDICAL OR HEALTHCARE DATA
    • G16H50/00ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics
    • G16H50/20ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics for computer-aided diagnosis, e.g. based on medical expert systems

Abstract

The invention discloses a method and a system for determining risk factors of gastroesophageal reflux disease based on machine learning, which solve the problem of low accuracy rate when the risk factors of gastroesophageal reflux disease are determined by statistics in the prior art. Firstly, constructing a user information set containing gastroesophageal reflux disease risk factors, and carrying out quantization processing on the factors in the user information set to obtain a quantized data matrix; secondly, standardizing the quantized data matrix, and performing dimension reduction processing on the standardized matrix by adopting a principal component analysis algorithm; then, clustering the data in the processed data set by adopting a hierarchical clustering algorithm to obtain a hierarchical clustering tree diagram; determining the clustering number according to the hierarchical clustering dendrogram, and clustering and dividing the data in the processed data set according to the clustering number to obtain a plurality of clusters; and finally, calculating the correlation indexes among the elements in each cluster, and determining the element with the maximum correlation index as the gastroesophageal reflux disease risk factor.

Description

Machine learning-based gastroesophageal reflux disease risk factor determination method and system
Technical Field
The invention relates to the technical field of machine learning and medicine, in particular to a method and a system for determining risk factors of gastroesophageal reflux disease based on machine learning.
Background
The incidence of gastroesophageal reflux disease, a ubiquitous disease of the digestive system worldwide, shows a tendency to rise year by year. Therefore, the treatment of gastroesophageal reflux disease should draw sufficient attention to us. Because the occurrence of gastroesophageal reflux disease is closely related to life style, mood change, eating habits and the like, the condition of the disease is easy to change, and therefore, the method has important effects on researching the disease and preventing the disease by collecting a large amount of data and analyzing the data characteristics.
At present, the risk factors are not extracted by adopting a machine learning method in the gastroesophageal reflux disease diagnosis technology, most of the extraction of the risk factors in the medical field adopts a statistical method, the calculation amount of the statistical method is large, and the accuracy rate is lower compared with the machine learning.
Disclosure of Invention
The invention aims to provide a method and a system for determining risk factors of gastroesophageal reflux disease based on machine learning, and the method and the system are used for solving the problem of low accuracy rate when the risk factors of gastroesophageal reflux disease are determined by statistics in the prior art.
In order to achieve the purpose, the invention provides the following scheme:
a method for determining risk factors for gastroesophageal reflux disease based on machine learning, comprising:
constructing a user information set; the user information set is a data set with M rows and N columns; the factor of the 1 st column in the ith row in the user information set is a user questionnaire ID number, and the factor of the 1 st column in different rows is represented as different user questionnaire ID numbers; the factor of the 1 st row and the j th column in the user information set is the question of a questionnaire, and the factor of the 1 st row in different columns is expressed as different questions; the factor of the ith row and the jth column in the user information set is the answer of the ith user questionnaire ID number to the jth question; wherein i is more than or equal to 2 and less than or equal to M, and j is more than or equal to 2 and less than or equal to N; m represents the number of all users participating in the questionnaire, N represents the number of all questions in the questionnaire;
carrying out data quantization processing on the answers in the user information set to obtain a quantized data matrix; the quantized data matrix is a matrix with M rows and N columns; the element of the 1 st column in the ith row in the quantized data matrix is a user questionnaire ID number, and the element of the 1 st column in different rows is represented as different user questionnaire ID numbers; the elements of the 1 st row and the j th column in the quantized data matrix are questions of a questionnaire, and the elements of the 1 st row in different columns are represented as different questions; the element of the ith row and the jth column in the quantized data matrix is a data quantization result of the ith user questionnaire ID number and the jth question answer; wherein i is more than or equal to 2 and less than or equal to M, and j is more than or equal to 2 and less than or equal to N;
standardizing the quantized data matrix to obtain a standardized data matrix;
adopting a principal component analysis algorithm to perform dimension reduction processing on the standardized data matrix, and performing reconstruction processing on the dimension-reduced data matrix to obtain a reconstructed data matrix;
processing each sample point in the reconstructed data matrix by adopting a hierarchical clustering algorithm to obtain a hierarchical clustering tree diagram; the z-th sample point represents the z-th row of data in the reconstructed data matrix; wherein z is more than or equal to 2 and less than or equal to M;
determining the clustering number according to the hierarchical clustering dendrogram, and clustering elements in the reconstructed data matrix by adopting a clustering algorithm according to the clustering number to obtain a plurality of clusters;
calculating the correlation index among elements in each cluster, and determining the element with the maximum correlation index as a gastroesophageal reflux disease risk factor; the correlation index is the average of the squares of the correlation coefficients.
Optionally, the normalizing the quantized data matrix to obtain a normalized data matrix specifically includes:
adopting a Z-Score standardization algorithm to standardize the quantitative data matrix; the data for each dimension in the normalized data matrix obeys a normal distribution with a mean of 0 and a variance of 1.
Optionally, the performing, by using a principal component analysis algorithm, a dimension reduction process on the normalized data matrix specifically includes:
calculating a correlation matrix of the normalized data matrix;
calculating an eigenvalue and an eigenvector corresponding to the eigenvalue according to the correlation matrix;
and arranging the characteristic values according to a descending order, and selecting the characteristic vectors corresponding to the first N characteristic values to form a data set after dimensionality reduction.
Optionally, the processing, by using a hierarchical clustering algorithm, each sample point in the reconstructed data matrix to obtain a hierarchical clustering dendrogram specifically includes:
step 1, calculating the distance between every two sample points by adopting an average distance algorithm;
step 2, selecting two sample points with the minimum distance to synthesize a class;
and 3, repeating the step 1 and the step 2 until all the sample points are gathered into one type, and obtaining a hierarchical clustering tree diagram.
Optionally, the calculating a correlation index between elements in each cluster, and determining the element with the largest correlation index as a risk factor for gastroesophageal reflux disease specifically includes:
calculating the correlation index among elements in each class cluster;
and arranging all the related indexes in a descending order, and selecting the element corresponding to the largest related index to determine the element as the risk factor of the gastroesophageal reflux disease.
A machine learning based gastroesophageal reflux disease risk factor determination system comprising:
the user information set construction module is used for constructing a user information set; the user information set is a data set with M rows and N columns; the factor of the 1 st column in the ith row in the user information set is a user questionnaire ID number, and the factor of the 1 st column in different rows is represented as different user questionnaire ID numbers; the factor of the 1 st row and the j th column in the user information set is the question of a questionnaire, and the factor of the 1 st row in different columns is expressed as different questions; the factor of the ith row and the jth column in the user information set is the answer of the ith user questionnaire ID number to the jth question; wherein i is more than or equal to 2 and less than or equal to M, and j is more than or equal to 2 and less than or equal to N; m represents the number of all users participating in the questionnaire, N represents the number of all questions in the questionnaire;
the quantization processing module is used for carrying out data quantization processing on the answers in the user information set to obtain a quantized data matrix; the quantized data matrix is a matrix with M rows and N columns; the element of the 1 st column in the ith row in the quantized data matrix is a user questionnaire ID number, and the element of the 1 st column in different rows is represented as different user questionnaire ID numbers; the elements of the 1 st row and the j th column in the quantized data matrix are questions of a questionnaire, and the elements of the 1 st row in different columns are represented as different questions; the element of the ith row and the jth column in the quantized data matrix is a data quantization result of the ith user questionnaire ID number and the jth question answer; wherein i is more than or equal to 2 and less than or equal to M, and j is more than or equal to 2 and less than or equal to N;
the standardization processing module is used for carrying out standardization processing on the quantized data matrix to obtain a standardized data matrix;
the dimensionality reduction reconstruction module is used for carrying out dimensionality reduction on the standardized data matrix by adopting a principal component analysis algorithm and reconstructing the dimensionality reduced data matrix to obtain a reconstructed data matrix;
a hierarchical clustering dendrogram obtaining module, configured to process each sample point in the reconstructed data matrix by using a hierarchical clustering algorithm to obtain a hierarchical clustering dendrogram; the z-th sample point represents the z-th row of data in the reconstructed data matrix; wherein z is more than or equal to 2 and less than or equal to M;
the cluster dividing module is used for determining the clustering number according to the hierarchical clustering tree diagram and clustering elements in the reconstructed data matrix by adopting a clustering algorithm according to the clustering number to obtain a plurality of clusters;
the gastroesophageal reflux disease risk factor determining module is used for calculating the correlation index among all elements in each cluster, and determining the element with the maximum correlation index as the gastroesophageal reflux disease risk factor; the correlation index is the average of the squares of the correlation coefficients.
Optionally, the normalization processing module specifically includes:
the standardization processing unit is used for adopting a Z-Score standardization algorithm to standardize the quantized data matrix; the data for each dimension in the normalized data matrix obeys a normal distribution with a mean of 0 and a variance of 1.
Optionally, the module for determining risk factors of gastroesophageal reflux disease specifically includes:
a correlation index calculation unit, configured to calculate a correlation index between elements in each of the clusters;
and the gastroesophageal reflux disease risk factor determining unit is used for arranging all the related indexes in a descending order, and selecting the element corresponding to the largest related index to determine the element as the gastroesophageal reflux disease risk factor.
According to the specific embodiment provided by the invention, the invention discloses the following technical effects:
the invention mainly provides a method and a system for determining risk factors of gastroesophageal reflux disease based on machine learning. The method firstly uses a principal component analysis method in characteristic engineering to extract the characteristics, reduces the data dimensionality, then carries out cluster analysis on high-quality data, and selects the most key risk factors in each cluster. The method combines a clustering method and characteristic engineering, screens out risk factors causing the gastroesophageal reflux disease, provides scientific basis for future medical research and disease diagnosis, guides the gastroesophageal reflux disease, and reduces the morbidity.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings needed to be used in the embodiments will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings without inventive exercise.
FIG. 1 is a schematic flow chart of a method for determining risk factors for gastroesophageal reflux disease based on machine learning according to an embodiment of the present invention;
FIG. 2 is a schematic structural diagram of a system for determining risk factors for gastroesophageal reflux disease based on machine learning according to an embodiment of the invention.
Detailed Description
The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
The invention aims to provide a method and a system for determining risk factors of gastroesophageal reflux disease based on machine learning, which can efficiently and accurately determine the risk factors of gastroesophageal reflux disease.
In order to make the aforementioned objects, features and advantages of the present invention comprehensible, embodiments accompanied with figures are described in further detail below.
Example 1
Fig. 1 is a schematic flowchart of a method for determining risk factors for gastroesophageal reflux disease based on machine learning according to an embodiment of the invention, and as shown in fig. 1, the method for determining risk factors for gastroesophageal reflux disease based on machine learning according to an embodiment of the invention specifically includes the following steps.
Step 101: constructing a user information set; the user information set is a data set with M rows and N columns; the factor of the 1 st column in the ith row in the user information set is a user questionnaire ID number, and the factor of the 1 st column in different rows is represented as different user questionnaire ID numbers; the factor of the 1 st row and the j th column in the user information set is the question of a questionnaire, and the factor of the 1 st row in different columns is expressed as different questions; the factor of the ith row and the jth column in the user information set is the answer of the ith user questionnaire ID number to the jth question; wherein i is more than or equal to 2 and less than or equal to M, j is more than or equal to 2 and less than or equal to N, and i and j are positive integers. M represents the number of all users participating in the questionnaire, and N represents the number of all questions in the questionnaire.
Step 102: carrying out data quantization processing on the answers in the user information set to obtain a quantized data matrix; the quantized data matrix is a matrix with M rows and N columns; the element of the 1 st column in the ith row in the quantized data matrix is a user questionnaire ID number, and the element of the 1 st column in different rows is represented as different user questionnaire ID numbers; the elements of the 1 st row and the j th column in the quantized data matrix are questions of a questionnaire, and the elements of the 1 st row in different columns are represented as different questions; the element of the ith row and the jth column in the quantized data matrix is a data quantization result of the ith user questionnaire ID number and the jth question answer; wherein i is more than or equal to 2 and less than or equal to M, and j is more than or equal to 2 and less than or equal to N.
Step 103: and carrying out standardization processing on the quantized data matrix to obtain a standardized data matrix.
Step 104: and performing dimensionality reduction on the standardized data matrix by adopting a principal component analysis algorithm, and performing reconstruction processing on the dimensionality-reduced data matrix to obtain a reconstructed data matrix.
Step 105: processing each sample point in the reconstructed data matrix by adopting a hierarchical clustering algorithm to obtain a hierarchical clustering tree diagram; the z-th sample point represents the z-th row of data in the reconstructed data matrix; wherein z is more than or equal to 2 and less than or equal to M, and z is a positive integer.
Step 106: and determining the clustering number according to the hierarchical clustering tree diagram, and clustering the elements in the reconstructed data matrix by adopting a clustering algorithm according to the clustering number to obtain a plurality of clusters.
Step 107: calculating the correlation index among elements in each cluster, and determining the element with the maximum correlation index as a gastroesophageal reflux disease risk factor; the correlation index is the average of the squares of the correlation coefficients.
Step 101 specifically includes:
this embodiment establishes a user information set by referring to a questionnaire issued by each person consulting gastroesophageal reflux disease in a hospital and based on a plurality of questionnaires retrieved. The user with the user information set may be sick or healthy, and the user needs to give a label after the hospital diagnoses, and whether the user is sick or not is judged through the label. Thus, the user information set is a healthy, diseased mixed data set, and it is known which piece of data is diseased data. The dataset has dimensions 241 in this embodiment, including a unique identifying patient questionnaire ID number and answers to questions for each questionnaire.
The questionnaire includes some questions such as general demographic data, lifestyle, eating habits, mental factors, and sleep factors, and answers of the investigator. The answer types for the questionnaire include three types: single-selecting, judging and asking-answering questions.
Step 102 is to perform data quantization processing on the answer of the examiner.
Specifically, the questionnaire ID number of the patient is used as the unique identification number. In the questionnaire, questions with important levels as answers in the single-choice questions, if the options are frequent, occasional, rare and never, weights of 4, 3, 2 and 1 can be given in sequence according to the important levels, and corresponding weights are selected according to specific answers; judging whether the question uses the type-whether option as an answer, assigning a value of 1 to 'yes' and a value of 0 to 'no', and selecting a corresponding value according to a specific answer; options do not have a question of significant level, such as occupation, which can be eliminated because they are useless for the result. For the question and answer, the continuous numerical value input by the user is directly used as data, if the user is 45 years old, height 172, HP is not infected, and sleep disorder is frequent, the sample data is listed as [ age, height, HP is infected, sleep disorder condition ], and the data value is [45,172,0,4 ]. The purpose of this step is to quantize the answers of all investigators to obtain a quantized data matrix D.
Step 103 is data preprocessing, specifically data normalization.
The data matrix D was scaled to a matrix with an average value of 0 and a standard deviation of 1 using Z-Score normalization. This is because the calculation results using different dimensions and distances may be different due to the existence of the dimensions. Through standardization, each dimension is subjected to normal distribution with the mean value of 0 and the variance of 1, and each dimension is subjected to de-dimensioning when the distance is calculated, so that the influence of selection of different dimensions on distance calculation is avoided.
The Z-Score normalization formula is:
Figure GDA0002222773090000071
the Feature _ value is an original attribute value of a certain Feature of the data, μ is an average value of a certain Feature of the data, s is a standard deviation of the Feature value, and Feature _ value' is a new attribute value of a certain Feature of the data.
Step 104 is mainly to perform dimensionality reduction reconstruction on the data.
The principal component analysis method mainly researches the correlation among columns to obtain the components sorted according to the proportion, and determines the factors playing a greater role in the principal components according to the coefficients of the attributes on the standardized indexes of the principal components. This step is used to initially prune portions of the extraneous factors. And then, the original data space is recovered through inverse operation, so that the purposes of dimension reduction and good data interpretability can be achieved.
The normalized data matrix D' is used as input, and the principal component analysis method comprises the following steps:
calculating a correlation matrix C of the standardized data matrix D';
calculating the eigenvalue of the correlation matrix C and the eigenvector corresponding to the eigenvalue;
and forming a new data set, namely the data set after dimensionality reduction, by using the feature vectors corresponding to the feature values of the N before ranking. Specifically, the eigenvalues are sorted in a descending order, and the eigenvectors corresponding to the eigenvalues of the N before ranking are used as a new basis set. And arranging the characteristic values according to a descending order, wherein the characteristic vectors corresponding to the characteristic values are also in the descending order, namely the information quantity stored by each characteristic vector is reduced from more to less, 90 percent of the information quantity is reserved, namely the characteristics which are less important later, namely the information quantity is less are deleted, and the purpose of reducing the dimension is achieved.
The base vector with the largest information quantity storage capacity is the eigenvector of the largest covariance matrix, and the information quantity stored by the eigenvector is the corresponding eigenvalue of the eigenvector.
The purpose of the selection of N before ranking is to ensure that the main content of the data is not lost, so the proportion of the total variance of the data represented by the new feature is selected to be 90%, that is, 90% of the original information is reserved, and the purpose of reducing dimension is achieved under the condition of not much loss of feature information.
The main component analysis method extracts features from the original data set to form new features, the new features do not correspond to the factors of the original data set, and the interpretability is reduced when dangerous factors are analyzed, so that the data set after the dimensionality reduction needs to be reconstructed and restored, the main components after the dimensionality reduction correspond to the features contained in the original data set, the features with useless factors screened out are obtained, and then a reconstructed data matrix D' is formed.
Step 105 is clustering.
Clustering analysis may cluster the indicators, or may cluster the samples, where the indicators are clustered.
In this embodiment, a hierarchical clustering method is adopted, and further, a method of selecting hierarchical clustering is used to perform clustering processing on each sample point in the quantized data matrix R. The basic principle is that the objects to be detected are respectively regarded as one class, the similarity degree of the class and the class is counted through clustering, then the two classes which are closest to each other are selected and combined into one class, and the combination is carried out step by step until all the objects to be detected are combined into one class.
Wherein, the step of the coacervation hierarchical clustering is as follows:
each sample point in the reconstructed data matrix D "is treated as an independent class.
And calculating the distance between every two classes according to a distance formula, and finding out two classes c1 and c2 with the minimum distance.
Merging class c1 and class c2 into one class;
and repeating the steps until all the sample points are gathered into one type, and further obtaining a hierarchical clustering tree-shaped graph.
Specifically, regarding the distance measure, the present embodiment selects an average distance method to calculate the distance between two classes.
The average distance method calculates the distance between each data point in two classes and all other data points, and takes the average value of all the distances as the distance between every two classes. The calculation formula is as follows:
Figure GDA0002222773090000091
wherein dist (x, z) is derived by using Euclidean distance.
In step 106, the clustering result among the features can be seen according to the clustering tree graph according to the hierarchy. Therefore, the clustering number can be determined through hierarchical clustering dendrogram analysis, and elements in the reconstructed data matrix are clustered by adopting a clustering algorithm according to the clustering number to obtain a plurality of clusters.
Step 107 specifically includes:
in order to find out the risk factors for determining the formation of the patients, the correlation indexes (average of squares of correlation coefficients) among the elements in each cluster are calculated, the element with the largest correlation index is selected to be determined as the risk factor of the gastroesophageal reflux disease, and one risk factor of the gastroesophageal reflux disease is screened from each cluster.
Grouping the quantized data matrix D according to class labels to obtain D1,D2,D3,....RkSetting the initial risk factor set as an empty set, analyzing the correlation among elements in each cluster, calculating the correlation index among the elements, and selecting an element with the maximum correlation index from each cluster to be added into the risk factor set.
Wherein the correlation coefficient is calculated as follows:
Figure GDA0002222773090000092
wherein Var (X) is the variance of X, Var (Y) is the variance of Y, Cov (X, Y) is the covariance between X and Y, and X and Y are the elements in each cluster.
Sample correlation index R for a feature2The calculation formula of (a) is as follows:
Figure GDA0002222773090000101
wherein X is a certain feature, i is a feature number, and n is the total number of features.
Example 2
To achieve the above object, the present invention further provides a machine learning-based gastroesophageal reflux disease risk factor determination system as shown in fig. 2. The system comprises:
a user information set constructing module 100, configured to construct a user information set; the user information set is a data set with M rows and N columns; the factor of the 1 st column in the ith row in the user information set is a user questionnaire ID number, and the factor of the 1 st column in different rows is represented as different user questionnaire ID numbers; the factor of the 1 st row and the j th column in the user information set is the question of a questionnaire, and the factor of the 1 st row in different columns is expressed as different questions; the factor of the ith row and the jth column in the user information set is the answer of the ith user questionnaire ID number to the jth question; wherein i is more than or equal to 2 and less than or equal to M, and j is more than or equal to 2 and less than or equal to N. M represents the number of all users participating in the questionnaire, and N represents the number of all questions in the questionnaire.
A quantization processing module 200, configured to perform data quantization processing on the answers in the user information set to obtain a quantized data matrix; the quantized data matrix is a matrix with M rows and N columns; the element of the 1 st column in the ith row in the quantized data matrix is a user questionnaire ID number, and the element of the 1 st column in different rows is represented as different user questionnaire ID numbers; the elements of the 1 st row and the j th column in the quantized data matrix are questions of a questionnaire, and the elements of the 1 st row in different columns are represented as different questions; the element of the ith row and the jth column in the quantized data matrix is a data quantization result of the ith user questionnaire ID number and the jth question answer; wherein i is more than or equal to 2 and less than or equal to M, and j is more than or equal to 2 and less than or equal to N.
A normalization processing module 300, configured to perform normalization processing on the quantized data matrix to obtain a normalized data matrix.
And a dimension reduction reconstruction module 400, configured to perform dimension reduction processing on the standardized data matrix by using a principal component analysis algorithm, and perform reconstruction processing on the dimension-reduced data matrix to obtain a reconstructed data matrix.
A hierarchical clustering dendrogram obtaining module 500, configured to process each sample point in the reconstructed data matrix by using a hierarchical clustering algorithm to obtain a hierarchical clustering dendrogram; the z-th sample point represents the z-th row of data in the reconstructed data matrix; wherein z is more than or equal to 2 and less than or equal to M.
And the cluster dividing module 600 is configured to determine a cluster number according to the hierarchical clustering tree diagram, and cluster the elements in the reconstructed data matrix by using a clustering algorithm according to the cluster number to obtain a plurality of clusters.
A gastroesophageal reflux disease risk factor determining module 700, configured to calculate a correlation index between each element in each cluster, and determine the element with the largest correlation index as a gastroesophageal reflux disease risk factor; the correlation index is the average of the squares of the correlation coefficients.
The normalization processing module 300 specifically includes:
the standardization processing unit is used for adopting a Z-Score standardization algorithm to standardize the quantized data matrix; the data for each dimension in the normalized data matrix obeys a normal distribution with a mean of 0 and a variance of 1.
The gastroesophageal reflux disease risk factor determining module 700 specifically comprises:
a correlation index calculation unit, configured to calculate a correlation index between elements in each of the clusters;
a risk factor determining unit for determining the risk factors of the gastroesophageal reflux disease by arranging all the related indexes in a descending order and selecting the element corresponding to the largest related index to determine the risk factor of the gastroesophageal reflux disease
The method adopts the combination of feature engineering and clustering to screen out irrelevant features, calculates high-quality data and improves the accuracy. The clustering method can cluster the features with larger similarity into a class, and selects the features which can represent the class most as risk factors in the similar feature clusters, so that the features are representative.
The embodiments in the present description are described in a progressive manner, each embodiment focuses on differences from other embodiments, and the same and similar parts among the embodiments are referred to each other. For the system disclosed by the embodiment, the description is relatively simple because the system corresponds to the method disclosed by the embodiment, and the relevant points can be referred to the method part for description.
The principles and embodiments of the present invention have been described herein using specific examples, which are provided only to help understand the method and the core concept of the present invention; meanwhile, for a person skilled in the art, according to the idea of the present invention, the specific embodiments and the application range may be changed. In view of the above, the present disclosure should not be construed as limiting the invention.

Claims (8)

1. A method for determining risk factors for gastroesophageal reflux disease based on machine learning, the method comprising:
constructing a user information set; the user information set is a data set with M rows and N columns; the factor of the 1 st column in the ith row in the user information set is a user questionnaire ID number, and the factor of the 1 st column in different rows is represented as different user questionnaire ID numbers; the factor of the 1 st row and the j th column in the user information set is the question of a questionnaire, and the factor of the 1 st row in different columns is expressed as different questions; the factor of the ith row and the jth column in the user information set is the answer of the ith user questionnaire ID number to the jth question; wherein i is more than or equal to 2 and less than or equal to M, and j is more than or equal to 2 and less than or equal to N; m represents the number of all users participating in the questionnaire, N represents the number of all questions in the questionnaire;
carrying out data quantization processing on the answers in the user information set to obtain a quantized data matrix; the quantized data matrix is a matrix with M rows and N columns; the element of the 1 st column in the ith row in the quantized data matrix is a user questionnaire ID number, and the element of the 1 st column in different rows is represented as different user questionnaire ID numbers; the elements of the 1 st row and the j th column in the quantized data matrix are questions of a questionnaire, and the elements of the 1 st row in different columns are represented as different questions; the element of the ith row and the jth column in the quantized data matrix is a data quantization result of the ith user questionnaire ID number and the jth question answer; wherein i is more than or equal to 2 and less than or equal to M, and j is more than or equal to 2 and less than or equal to N;
standardizing the quantized data matrix to obtain a standardized data matrix;
adopting a principal component analysis algorithm to perform dimension reduction processing on the standardized data matrix, and performing reconstruction processing on the dimension-reduced data matrix to obtain a reconstructed data matrix;
processing each sample point in the reconstructed data matrix by adopting a hierarchical clustering algorithm to obtain a hierarchical clustering tree diagram; the z-th sample point represents the z-th row of data in the reconstructed data matrix; wherein z is more than or equal to 2 and less than or equal to M;
determining the clustering number according to the hierarchical clustering dendrogram, and clustering elements in the reconstructed data matrix by adopting a clustering algorithm according to the clustering number to obtain a plurality of clusters;
calculating the correlation index among elements in each cluster, and determining the element with the maximum correlation index as a gastroesophageal reflux disease risk factor; the correlation index is the average of the squares of the correlation coefficients.
2. The method for determining risk factors for gastroesophageal reflux disease according to claim 1, wherein the normalizing the quantized data matrix to obtain a normalized data matrix specifically comprises:
adopting a Z-Score standardization algorithm to standardize the quantitative data matrix; the data for each dimension in the normalized data matrix obeys a normal distribution with a mean of 0 and a variance of 1.
3. The method for determining risk factors for gastroesophageal reflux disease according to claim 1, wherein the dimension reduction processing is performed on the standardized data matrix by using a principal component analysis algorithm, and specifically comprises:
calculating a correlation matrix of the normalized data matrix;
calculating an eigenvalue and an eigenvector corresponding to the eigenvalue according to the correlation matrix;
and arranging the characteristic values according to a descending order, and selecting the characteristic vectors corresponding to the first N characteristic values to form a data set after dimensionality reduction.
4. The method for determining risk factors for gastroesophageal reflux disease according to claim 1, wherein the processing of each sample point in the reconstructed data matrix by using a hierarchical clustering algorithm to obtain a hierarchical clustering dendrogram specifically comprises:
step 1, calculating the distance between every two sample points by adopting an average distance algorithm;
step 2, selecting two sample points with the minimum distance to synthesize a class;
and 3, repeating the step 1 and the step 2 until all the sample points are gathered into one type, and obtaining a hierarchical clustering tree diagram.
5. The method for determining risk factors for gastroesophageal reflux disease according to claim 1, wherein the calculating of the correlation index between the elements in each cluster type and the determining of the element with the largest correlation index as the risk factor for gastroesophageal reflux disease specifically comprises:
calculating the correlation index among elements in each class cluster;
and arranging all the related indexes in a descending order, and selecting the element corresponding to the largest related index to determine the element as the risk factor of the gastroesophageal reflux disease.
6. A machine learning based gastroesophageal reflux disease risk factor determination system, the system comprising:
the user information set construction module is used for constructing a user information set; the user information set is a data set with M rows and N columns; the factor of the 1 st column in the ith row in the user information set is a user questionnaire ID number, and the factor of the 1 st column in different rows is represented as different user questionnaire ID numbers; the factor of the 1 st row and the j th column in the user information set is the question of a questionnaire, and the factor of the 1 st row in different columns is expressed as different questions; the factor of the ith row and the jth column in the user information set is the answer of the ith user questionnaire ID number to the jth question; wherein i is more than or equal to 2 and less than or equal to M, and j is more than or equal to 2 and less than or equal to N; m represents the number of all users participating in the questionnaire, N represents the number of all questions in the questionnaire;
the quantization processing module is used for carrying out data quantization processing on the answers in the user information set to obtain a quantized data matrix; the quantized data matrix is a matrix with M rows and N columns; the element of the 1 st column in the ith row in the quantized data matrix is a user questionnaire ID number, and the element of the 1 st column in different rows is represented as different user questionnaire ID numbers; the elements of the 1 st row and the j th column in the quantized data matrix are questions of a questionnaire, and the elements of the 1 st row in different columns are represented as different questions; the element of the ith row and the jth column in the quantized data matrix is a data quantization result of the ith user questionnaire ID number and the jth question answer; wherein i is more than or equal to 2 and less than or equal to M, and j is more than or equal to 2 and less than or equal to N;
the standardization processing module is used for carrying out standardization processing on the quantized data matrix to obtain a standardized data matrix;
the dimensionality reduction reconstruction module is used for carrying out dimensionality reduction on the standardized data matrix by adopting a principal component analysis algorithm and reconstructing the dimensionality reduced data matrix to obtain a reconstructed data matrix;
a hierarchical clustering dendrogram obtaining module, configured to process each sample point in the reconstructed data matrix by using a hierarchical clustering algorithm to obtain a hierarchical clustering dendrogram; the z-th sample point represents the z-th row of data in the reconstructed data matrix; wherein z is more than or equal to 2 and less than or equal to M;
the cluster dividing module is used for determining the clustering number according to the hierarchical clustering tree diagram and clustering elements in the reconstructed data matrix by adopting a clustering algorithm according to the clustering number to obtain a plurality of clusters;
the gastroesophageal reflux disease risk factor determining module is used for calculating the correlation index among all elements in each cluster, and determining the element with the maximum correlation index as the gastroesophageal reflux disease risk factor; the correlation index is the average of the squares of the correlation coefficients.
7. The gastroesophageal reflux disease risk factor determination system as claimed in claim 6, wherein the standardized processing module specifically comprises:
the standardization processing unit is used for adopting a Z-Score standardization algorithm to standardize the quantized data matrix; the data for each dimension in the normalized data matrix obeys a normal distribution with a mean of 0 and a variance of 1.
8. The system for determining risk factors for gastroesophageal reflux disease according to claim 6, wherein the gastroesophageal reflux disease risk factor determination module specifically comprises:
a correlation index calculation unit, configured to calculate a correlation index between elements in each of the clusters;
and the gastroesophageal reflux disease risk factor determining unit is used for arranging all the related indexes in a descending order, and selecting the element corresponding to the largest related index to determine the element as the gastroesophageal reflux disease risk factor.
CN201811589405.8A 2018-12-25 2018-12-25 Machine learning-based gastroesophageal reflux disease risk factor determination method and system Expired - Fee Related CN109686442B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201811589405.8A CN109686442B (en) 2018-12-25 2018-12-25 Machine learning-based gastroesophageal reflux disease risk factor determination method and system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201811589405.8A CN109686442B (en) 2018-12-25 2018-12-25 Machine learning-based gastroesophageal reflux disease risk factor determination method and system

Publications (2)

Publication Number Publication Date
CN109686442A CN109686442A (en) 2019-04-26
CN109686442B true CN109686442B (en) 2020-04-14

Family

ID=66189312

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201811589405.8A Expired - Fee Related CN109686442B (en) 2018-12-25 2018-12-25 Machine learning-based gastroesophageal reflux disease risk factor determination method and system

Country Status (1)

Country Link
CN (1) CN109686442B (en)

Families Citing this family (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110176308A (en) * 2019-05-28 2019-08-27 广东工业大学 Disease and the relevance of vital sign establish device, method, equipment and medium
CN110189803A (en) * 2019-06-05 2019-08-30 南京理工大学 The disease risk factor extracting method combined based on cluster with classification
CN113793667A (en) * 2021-09-16 2021-12-14 平安科技(深圳)有限公司 Disease prediction method and device based on cluster analysis and computer equipment
CN114550121A (en) * 2022-02-28 2022-05-27 重庆长安汽车股份有限公司 Clustering-based automatic driving lane change scene classification method and recognition method

Family Cites Families (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102999686A (en) * 2011-09-19 2013-03-27 上海煜策信息科技有限公司 Health management system and implementation method thereof
CN103198211B (en) * 2013-03-08 2017-02-22 北京理工大学 Quantitative analysis method for influences of attack risk factors of type 2 diabetes on blood sugar
US11176188B2 (en) * 2017-01-11 2021-11-16 Siemens Healthcare Gmbh Visualization framework based on document representation learning
CN107436933A (en) * 2017-07-20 2017-12-05 广州慧扬健康科技有限公司 The hierarchical clustering system arranged for case history archive

Also Published As

Publication number Publication date
CN109686442A (en) 2019-04-26

Similar Documents

Publication Publication Date Title
CN109686442B (en) Machine learning-based gastroesophageal reflux disease risk factor determination method and system
DeSarbo et al. Synthesized clustering: A method for amalgamating alternative clustering bases with differential weighting of variables
Marlin et al. Unsupervised pattern discovery in electronic health care data using probabilistic clustering models
US20140343966A1 (en) Analysis system and health business support method
US20110125683A1 (en) Identification of Co-Regulation Patterns By Unsupervised Cluster Analysis of Gene Expression Data
CN108763590B (en) Data clustering method based on double-variant weighted kernel FCM algorithm
CN109934089A (en) Multistage epileptic EEG Signal automatic identifying method based on supervision gradient lifter
WO2021139116A1 (en) Method, apparatus and device for intelligently grouping similar patients, and storage medium
Costa et al. Classification of breast tissue in mammograms using efficient coding
Corchado et al. Model of experts for decision support in the diagnosis of leukemia patients
CN114530249A (en) Disease risk assessment model construction method based on intestinal microorganisms and application
CN109360658B (en) Disease pattern mining method and device based on word vector model
Pillai et al. Prediction of heart disease using rnn algorithm
Sagayam et al. A cognitive perception on content-based image retrieval using an advanced soft computing paradigm
CN111091907A (en) Health medical knowledge retrieval method and system based on similar case library
CN111797267A (en) Medical image retrieval method and system, electronic device and storage medium
CN114707608A (en) Medical quality control data processing method, apparatus, device, medium, and program product
Pratiwi et al. Personality type assessment system by using enneagram-graphology techniques on digital handwriting
JP6053166B2 (en) Numerical data analysis apparatus and program
CN110633368A (en) Deep learning classification method for early colorectal cancer unstructured data
Si et al. A novel hierarchically-structured factor mixture model for cluster discovery from multi-modality data
Andersson et al. Hierarchical models for epidermal nerve fiber data
CN112233742B (en) Medical record document classification system, equipment and storage medium based on clustering
CN112289444B (en) Method and device for determining potential important information of patient
CN112989971A (en) Electrocardiogram data fusion method and device for different data sources

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
TA01 Transfer of patent application right
TA01 Transfer of patent application right

Effective date of registration: 20190729

Address after: 210000 Xiaolingwei 179, Xuanwu District, Nanjing City, Jiangsu Province

Applicant after: NANJING INTEGRATED TRADITIONAL CHINESE AND WESTERN MEDICINE Hospital

Address before: 210000 Xiaolingwei 179, Xuanwu District, Nanjing City, Jiangsu Province

Applicant before: Liu Wanli

GR01 Patent grant
GR01 Patent grant
CF01 Termination of patent right due to non-payment of annual fee
CF01 Termination of patent right due to non-payment of annual fee

Granted publication date: 20200414