CN109686442B

CN109686442B - Machine learning-based gastroesophageal reflux disease risk factor determination method and system

Info

Publication number: CN109686442B
Application number: CN201811589405.8A
Authority: CN
Inventors: 刘万里; 徐雷; 黄玉珍; 姚澜; 李荣臻; 夏吉安
Original assignee: NANJING INTEGRATED TRADITIONAL CHINESE AND WESTERN MEDICINE HOSPITAL
Current assignee: NANJING INTEGRATED TRADITIONAL CHINESE AND WESTERN MEDICINE HOSPITAL
Priority date: 2018-12-25
Filing date: 2018-12-25
Publication date: 2020-04-14
Anticipated expiration: 2038-12-25
Also published as: CN109686442A

Abstract

The invention discloses a method and a system for determining risk factors of gastroesophageal reflux disease based on machine learning, which solve the problem of low accuracy rate when the risk factors of gastroesophageal reflux disease are determined by statistics in the prior art. Firstly, constructing a user information set containing gastroesophageal reflux disease risk factors, and carrying out quantization processing on the factors in the user information set to obtain a quantized data matrix; secondly, standardizing the quantized data matrix, and performing dimension reduction processing on the standardized matrix by adopting a principal component analysis algorithm; then, clustering the data in the processed data set by adopting a hierarchical clustering algorithm to obtain a hierarchical clustering tree diagram; determining the clustering number according to the hierarchical clustering dendrogram, and clustering and dividing the data in the processed data set according to the clustering number to obtain a plurality of clusters; and finally, calculating the correlation indexes among the elements in each cluster, and determining the element with the maximum correlation index as the gastroesophageal reflux disease risk factor.

Description

Machine learning-based gastroesophageal reflux disease risk factor determination method and system

Technical Field

The invention relates to the technical field of machine learning and medicine, in particular to a method and a system for determining risk factors of gastroesophageal reflux disease based on machine learning.

Background

The incidence of gastroesophageal reflux disease, a ubiquitous disease of the digestive system worldwide, shows a tendency to rise year by year. Therefore, the treatment of gastroesophageal reflux disease should draw sufficient attention to us. Because the occurrence of gastroesophageal reflux disease is closely related to life style, mood change, eating habits and the like, the condition of the disease is easy to change, and therefore, the method has important effects on researching the disease and preventing the disease by collecting a large amount of data and analyzing the data characteristics.

At present, the risk factors are not extracted by adopting a machine learning method in the gastroesophageal reflux disease diagnosis technology, most of the extraction of the risk factors in the medical field adopts a statistical method, the calculation amount of the statistical method is large, and the accuracy rate is lower compared with the machine learning.

Disclosure of Invention

The invention aims to provide a method and a system for determining risk factors of gastroesophageal reflux disease based on machine learning, and the method and the system are used for solving the problem of low accuracy rate when the risk factors of gastroesophageal reflux disease are determined by statistics in the prior art.

In order to achieve the purpose, the invention provides the following scheme:

a method for determining risk factors for gastroesophageal reflux disease based on machine learning, comprising:

constructing a user information set; the user information set is a data set with M rows and N columns; the factor of the 1 st column in the ith row in the user information set is a user questionnaire ID number, and the factor of the 1 st column in different rows is represented as different user questionnaire ID numbers; the factor of the 1 st row and the j th column in the user information set is the question of a questionnaire, and the factor of the 1 st row in different columns is expressed as different questions; the factor of the ith row and the jth column in the user information set is the answer of the ith user questionnaire ID number to the jth question; wherein i is more than or equal to 2 and less than or equal to M, and j is more than or equal to 2 and less than or equal to N; m represents the number of all users participating in the questionnaire, N represents the number of all questions in the questionnaire;

carrying out data quantization processing on the answers in the user information set to obtain a quantized data matrix; the quantized data matrix is a matrix with M rows and N columns; the element of the 1 st column in the ith row in the quantized data matrix is a user questionnaire ID number, and the element of the 1 st column in different rows is represented as different user questionnaire ID numbers; the elements of the 1 st row and the j th column in the quantized data matrix are questions of a questionnaire, and the elements of the 1 st row in different columns are represented as different questions; the element of the ith row and the jth column in the quantized data matrix is a data quantization result of the ith user questionnaire ID number and the jth question answer; wherein i is more than or equal to 2 and less than or equal to M, and j is more than or equal to 2 and less than or equal to N;

standardizing the quantized data matrix to obtain a standardized data matrix;

adopting a principal component analysis algorithm to perform dimension reduction processing on the standardized data matrix, and performing reconstruction processing on the dimension-reduced data matrix to obtain a reconstructed data matrix;

processing each sample point in the reconstructed data matrix by adopting a hierarchical clustering algorithm to obtain a hierarchical clustering tree diagram; the z-th sample point represents the z-th row of data in the reconstructed data matrix; wherein z is more than or equal to 2 and less than or equal to M;

determining the clustering number according to the hierarchical clustering dendrogram, and clustering elements in the reconstructed data matrix by adopting a clustering algorithm according to the clustering number to obtain a plurality of clusters;

calculating the correlation index among elements in each cluster, and determining the element with the maximum correlation index as a gastroesophageal reflux disease risk factor; the correlation index is the average of the squares of the correlation coefficients.

Optionally, the normalizing the quantized data matrix to obtain a normalized data matrix specifically includes:

adopting a Z-Score standardization algorithm to standardize the quantitative data matrix; the data for each dimension in the normalized data matrix obeys a normal distribution with a mean of 0 and a variance of 1.

Optionally, the performing, by using a principal component analysis algorithm, a dimension reduction process on the normalized data matrix specifically includes:

calculating a correlation matrix of the normalized data matrix;

calculating an eigenvalue and an eigenvector corresponding to the eigenvalue according to the correlation matrix;

and arranging the characteristic values according to a descending order, and selecting the characteristic vectors corresponding to the first N characteristic values to form a data set after dimensionality reduction.

Optionally, the processing, by using a hierarchical clustering algorithm, each sample point in the reconstructed data matrix to obtain a hierarchical clustering dendrogram specifically includes:

step 1, calculating the distance between every two sample points by adopting an average distance algorithm;

step 2, selecting two sample points with the minimum distance to synthesize a class;

and 3, repeating the step 1 and the step 2 until all the sample points are gathered into one type, and obtaining a hierarchical clustering tree diagram.

Optionally, the calculating a correlation index between elements in each cluster, and determining the element with the largest correlation index as a risk factor for gastroesophageal reflux disease specifically includes:

calculating the correlation index among elements in each class cluster;

and arranging all the related indexes in a descending order, and selecting the element corresponding to the largest related index to determine the element as the risk factor of the gastroesophageal reflux disease.

A machine learning based gastroesophageal reflux disease risk factor determination system comprising:

the user information set construction module is used for constructing a user information set; the user information set is a data set with M rows and N columns; the factor of the 1 st column in the ith row in the user information set is a user questionnaire ID number, and the factor of the 1 st column in different rows is represented as different user questionnaire ID numbers; the factor of the 1 st row and the j th column in the user information set is the question of a questionnaire, and the factor of the 1 st row in different columns is expressed as different questions; the factor of the ith row and the jth column in the user information set is the answer of the ith user questionnaire ID number to the jth question; wherein i is more than or equal to 2 and less than or equal to M, and j is more than or equal to 2 and less than or equal to N; m represents the number of all users participating in the questionnaire, N represents the number of all questions in the questionnaire;

the quantization processing module is used for carrying out data quantization processing on the answers in the user information set to obtain a quantized data matrix; the quantized data matrix is a matrix with M rows and N columns; the element of the 1 st column in the ith row in the quantized data matrix is a user questionnaire ID number, and the element of the 1 st column in different rows is represented as different user questionnaire ID numbers; the elements of the 1 st row and the j th column in the quantized data matrix are questions of a questionnaire, and the elements of the 1 st row in different columns are represented as different questions; the element of the ith row and the jth column in the quantized data matrix is a data quantization result of the ith user questionnaire ID number and the jth question answer; wherein i is more than or equal to 2 and less than or equal to M, and j is more than or equal to 2 and less than or equal to N;

the standardization processing module is used for carrying out standardization processing on the quantized data matrix to obtain a standardized data matrix;

the dimensionality reduction reconstruction module is used for carrying out dimensionality reduction on the standardized data matrix by adopting a principal component analysis algorithm and reconstructing the dimensionality reduced data matrix to obtain a reconstructed data matrix;

a hierarchical clustering dendrogram obtaining module, configured to process each sample point in the reconstructed data matrix by using a hierarchical clustering algorithm to obtain a hierarchical clustering dendrogram; the z-th sample point represents the z-th row of data in the reconstructed data matrix; wherein z is more than or equal to 2 and less than or equal to M;

the cluster dividing module is used for determining the clustering number according to the hierarchical clustering tree diagram and clustering elements in the reconstructed data matrix by adopting a clustering algorithm according to the clustering number to obtain a plurality of clusters;

the gastroesophageal reflux disease risk factor determining module is used for calculating the correlation index among all elements in each cluster, and determining the element with the maximum correlation index as the gastroesophageal reflux disease risk factor; the correlation index is the average of the squares of the correlation coefficients.

Optionally, the normalization processing module specifically includes:

the standardization processing unit is used for adopting a Z-Score standardization algorithm to standardize the quantized data matrix; the data for each dimension in the normalized data matrix obeys a normal distribution with a mean of 0 and a variance of 1.

Optionally, the module for determining risk factors of gastroesophageal reflux disease specifically includes:

a correlation index calculation unit, configured to calculate a correlation index between elements in each of the clusters;

and the gastroesophageal reflux disease risk factor determining unit is used for arranging all the related indexes in a descending order, and selecting the element corresponding to the largest related index to determine the element as the gastroesophageal reflux disease risk factor.

According to the specific embodiment provided by the invention, the invention discloses the following technical effects:

the invention mainly provides a method and a system for determining risk factors of gastroesophageal reflux disease based on machine learning. The method firstly uses a principal component analysis method in characteristic engineering to extract the characteristics, reduces the data dimensionality, then carries out cluster analysis on high-quality data, and selects the most key risk factors in each cluster. The method combines a clustering method and characteristic engineering, screens out risk factors causing the gastroesophageal reflux disease, provides scientific basis for future medical research and disease diagnosis, guides the gastroesophageal reflux disease, and reduces the morbidity.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings needed to be used in the embodiments will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings without inventive exercise.

FIG. 1 is a schematic flow chart of a method for determining risk factors for gastroesophageal reflux disease based on machine learning according to an embodiment of the present invention;

FIG. 2 is a schematic structural diagram of a system for determining risk factors for gastroesophageal reflux disease based on machine learning according to an embodiment of the invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

The invention aims to provide a method and a system for determining risk factors of gastroesophageal reflux disease based on machine learning, which can efficiently and accurately determine the risk factors of gastroesophageal reflux disease.

In order to make the aforementioned objects, features and advantages of the present invention comprehensible, embodiments accompanied with figures are described in further detail below.

Example 1

Fig. 1 is a schematic flowchart of a method for determining risk factors for gastroesophageal reflux disease based on machine learning according to an embodiment of the invention, and as shown in fig. 1, the method for determining risk factors for gastroesophageal reflux disease based on machine learning according to an embodiment of the invention specifically includes the following steps.

Step 101: constructing a user information set; the user information set is a data set with M rows and N columns; the factor of the 1 st column in the ith row in the user information set is a user questionnaire ID number, and the factor of the 1 st column in different rows is represented as different user questionnaire ID numbers; the factor of the 1 st row and the j th column in the user information set is the question of a questionnaire, and the factor of the 1 st row in different columns is expressed as different questions; the factor of the ith row and the jth column in the user information set is the answer of the ith user questionnaire ID number to the jth question; wherein i is more than or equal to 2 and less than or equal to M, j is more than or equal to 2 and less than or equal to N, and i and j are positive integers. M represents the number of all users participating in the questionnaire, and N represents the number of all questions in the questionnaire.

Step 102: carrying out data quantization processing on the answers in the user information set to obtain a quantized data matrix; the quantized data matrix is a matrix with M rows and N columns; the element of the 1 st column in the ith row in the quantized data matrix is a user questionnaire ID number, and the element of the 1 st column in different rows is represented as different user questionnaire ID numbers; the elements of the 1 st row and the j th column in the quantized data matrix are questions of a questionnaire, and the elements of the 1 st row in different columns are represented as different questions; the element of the ith row and the jth column in the quantized data matrix is a data quantization result of the ith user questionnaire ID number and the jth question answer; wherein i is more than or equal to 2 and less than or equal to M, and j is more than or equal to 2 and less than or equal to N.

Step 103: and carrying out standardization processing on the quantized data matrix to obtain a standardized data matrix.

Step 104: and performing dimensionality reduction on the standardized data matrix by adopting a principal component analysis algorithm, and performing reconstruction processing on the dimensionality-reduced data matrix to obtain a reconstructed data matrix.

Step 105: processing each sample point in the reconstructed data matrix by adopting a hierarchical clustering algorithm to obtain a hierarchical clustering tree diagram; the z-th sample point represents the z-th row of data in the reconstructed data matrix; wherein z is more than or equal to 2 and less than or equal to M, and z is a positive integer.

Step 106: and determining the clustering number according to the hierarchical clustering tree diagram, and clustering the elements in the reconstructed data matrix by adopting a clustering algorithm according to the clustering number to obtain a plurality of clusters.

Step 107: calculating the correlation index among elements in each cluster, and determining the element with the maximum correlation index as a gastroesophageal reflux disease risk factor; the correlation index is the average of the squares of the correlation coefficients.

Step 101 specifically includes:

this embodiment establishes a user information set by referring to a questionnaire issued by each person consulting gastroesophageal reflux disease in a hospital and based on a plurality of questionnaires retrieved. The user with the user information set may be sick or healthy, and the user needs to give a label after the hospital diagnoses, and whether the user is sick or not is judged through the label. Thus, the user information set is a healthy, diseased mixed data set, and it is known which piece of data is diseased data. The dataset has dimensions 241 in this embodiment, including a unique identifying patient questionnaire ID number and answers to questions for each questionnaire.

The questionnaire includes some questions such as general demographic data, lifestyle, eating habits, mental factors, and sleep factors, and answers of the investigator. The answer types for the questionnaire include three types: single-selecting, judging and asking-answering questions.

Step 102 is to perform data quantization processing on the answer of the examiner.

Specifically, the questionnaire ID number of the patient is used as the unique identification number. In the questionnaire, questions with important levels as answers in the single-choice questions, if the options are frequent, occasional, rare and never, weights of 4, 3, 2 and 1 can be given in sequence according to the important levels, and corresponding weights are selected according to specific answers; judging whether the question uses the type-whether option as an answer, assigning a value of 1 to 'yes' and a value of 0 to 'no', and selecting a corresponding value according to a specific answer; options do not have a question of significant level, such as occupation, which can be eliminated because they are useless for the result. For the question and answer, the continuous numerical value input by the user is directly used as data, if the user is 45 years old, height 172, HP is not infected, and sleep disorder is frequent, the sample data is listed as [ age, height, HP is infected, sleep disorder condition ], and the data value is [45,172,0,4 ]. The purpose of this step is to quantize the answers of all investigators to obtain a quantized data matrix D.

Step 103 is data preprocessing, specifically data normalization.

The data matrix D was scaled to a matrix with an average value of 0 and a standard deviation of 1 using Z-Score normalization. This is because the calculation results using different dimensions and distances may be different due to the existence of the dimensions. Through standardization, each dimension is subjected to normal distribution with the mean value of 0 and the variance of 1, and each dimension is subjected to de-dimensioning when the distance is calculated, so that the influence of selection of different dimensions on distance calculation is avoided.

The Z-Score normalization formula is:

the Feature _ value is an original attribute value of a certain Feature of the data, μ is an average value of a certain Feature of the data, s is a standard deviation of the Feature value, and Feature _ value' is a new attribute value of a certain Feature of the data.

Step 104 is mainly to perform dimensionality reduction reconstruction on the data.

The principal component analysis method mainly researches the correlation among columns to obtain the components sorted according to the proportion, and determines the factors playing a greater role in the principal components according to the coefficients of the attributes on the standardized indexes of the principal components. This step is used to initially prune portions of the extraneous factors. And then, the original data space is recovered through inverse operation, so that the purposes of dimension reduction and good data interpretability can be achieved.

The normalized data matrix D' is used as input, and the principal component analysis method comprises the following steps:

calculating a correlation matrix C of the standardized data matrix D';

calculating the eigenvalue of the correlation matrix C and the eigenvector corresponding to the eigenvalue;

and forming a new data set, namely the data set after dimensionality reduction, by using the feature vectors corresponding to the feature values of the N before ranking. Specifically, the eigenvalues are sorted in a descending order, and the eigenvectors corresponding to the eigenvalues of the N before ranking are used as a new basis set. And arranging the characteristic values according to a descending order, wherein the characteristic vectors corresponding to the characteristic values are also in the descending order, namely the information quantity stored by each characteristic vector is reduced from more to less, 90 percent of the information quantity is reserved, namely the characteristics which are less important later, namely the information quantity is less are deleted, and the purpose of reducing the dimension is achieved.

The base vector with the largest information quantity storage capacity is the eigenvector of the largest covariance matrix, and the information quantity stored by the eigenvector is the corresponding eigenvalue of the eigenvector.

The purpose of the selection of N before ranking is to ensure that the main content of the data is not lost, so the proportion of the total variance of the data represented by the new feature is selected to be 90%, that is, 90% of the original information is reserved, and the purpose of reducing dimension is achieved under the condition of not much loss of feature information.

The main component analysis method extracts features from the original data set to form new features, the new features do not correspond to the factors of the original data set, and the interpretability is reduced when dangerous factors are analyzed, so that the data set after the dimensionality reduction needs to be reconstructed and restored, the main components after the dimensionality reduction correspond to the features contained in the original data set, the features with useless factors screened out are obtained, and then a reconstructed data matrix D' is formed.

Step 105 is clustering.

Clustering analysis may cluster the indicators, or may cluster the samples, where the indicators are clustered.

In this embodiment, a hierarchical clustering method is adopted, and further, a method of selecting hierarchical clustering is used to perform clustering processing on each sample point in the quantized data matrix R. The basic principle is that the objects to be detected are respectively regarded as one class, the similarity degree of the class and the class is counted through clustering, then the two classes which are closest to each other are selected and combined into one class, and the combination is carried out step by step until all the objects to be detected are combined into one class.

Wherein, the step of the coacervation hierarchical clustering is as follows:

each sample point in the reconstructed data matrix D "is treated as an independent class.

And calculating the distance between every two classes according to a distance formula, and finding out two classes c1 and c2 with the minimum distance.

Merging class c1 and class c2 into one class;

and repeating the steps until all the sample points are gathered into one type, and further obtaining a hierarchical clustering tree-shaped graph.

Specifically, regarding the distance measure, the present embodiment selects an average distance method to calculate the distance between two classes.

The average distance method calculates the distance between each data point in two classes and all other data points, and takes the average value of all the distances as the distance between every two classes. The calculation formula is as follows:

wherein dist (x, z) is derived by using Euclidean distance.

In step 106, the clustering result among the features can be seen according to the clustering tree graph according to the hierarchy. Therefore, the clustering number can be determined through hierarchical clustering dendrogram analysis, and elements in the reconstructed data matrix are clustered by adopting a clustering algorithm according to the clustering number to obtain a plurality of clusters.

Step 107 specifically includes:

in order to find out the risk factors for determining the formation of the patients, the correlation indexes (average of squares of correlation coefficients) among the elements in each cluster are calculated, the element with the largest correlation index is selected to be determined as the risk factor of the gastroesophageal reflux disease, and one risk factor of the gastroesophageal reflux disease is screened from each cluster.

Grouping the quantized data matrix D according to class labels to obtain D₁,D₂,D₃,....R_kSetting the initial risk factor set as an empty set, analyzing the correlation among elements in each cluster, calculating the correlation index among the elements, and selecting an element with the maximum correlation index from each cluster to be added into the risk factor set.

Wherein the correlation coefficient is calculated as follows:

wherein Var (X) is the variance of X, Var (Y) is the variance of Y, Cov (X, Y) is the covariance between X and Y, and X and Y are the elements in each cluster.

Sample correlation index R for a feature²The calculation formula of (a) is as follows:

wherein X is a certain feature, i is a feature number, and n is the total number of features.

Example 2

To achieve the above object, the present invention further provides a machine learning-based gastroesophageal reflux disease risk factor determination system as shown in fig. 2. The system comprises:

a user information set constructing module 100, configured to construct a user information set; the user information set is a data set with M rows and N columns; the factor of the 1 st column in the ith row in the user information set is a user questionnaire ID number, and the factor of the 1 st column in different rows is represented as different user questionnaire ID numbers; the factor of the 1 st row and the j th column in the user information set is the question of a questionnaire, and the factor of the 1 st row in different columns is expressed as different questions; the factor of the ith row and the jth column in the user information set is the answer of the ith user questionnaire ID number to the jth question; wherein i is more than or equal to 2 and less than or equal to M, and j is more than or equal to 2 and less than or equal to N. M represents the number of all users participating in the questionnaire, and N represents the number of all questions in the questionnaire.

A quantization processing module 200, configured to perform data quantization processing on the answers in the user information set to obtain a quantized data matrix; the quantized data matrix is a matrix with M rows and N columns; the element of the 1 st column in the ith row in the quantized data matrix is a user questionnaire ID number, and the element of the 1 st column in different rows is represented as different user questionnaire ID numbers; the elements of the 1 st row and the j th column in the quantized data matrix are questions of a questionnaire, and the elements of the 1 st row in different columns are represented as different questions; the element of the ith row and the jth column in the quantized data matrix is a data quantization result of the ith user questionnaire ID number and the jth question answer; wherein i is more than or equal to 2 and less than or equal to M, and j is more than or equal to 2 and less than or equal to N.

A normalization processing module 300, configured to perform normalization processing on the quantized data matrix to obtain a normalized data matrix.

And a dimension reduction reconstruction module 400, configured to perform dimension reduction processing on the standardized data matrix by using a principal component analysis algorithm, and perform reconstruction processing on the dimension-reduced data matrix to obtain a reconstructed data matrix.

A hierarchical clustering dendrogram obtaining module 500, configured to process each sample point in the reconstructed data matrix by using a hierarchical clustering algorithm to obtain a hierarchical clustering dendrogram; the z-th sample point represents the z-th row of data in the reconstructed data matrix; wherein z is more than or equal to 2 and less than or equal to M.

And the cluster dividing module 600 is configured to determine a cluster number according to the hierarchical clustering tree diagram, and cluster the elements in the reconstructed data matrix by using a clustering algorithm according to the cluster number to obtain a plurality of clusters.

A gastroesophageal reflux disease risk factor determining module 700, configured to calculate a correlation index between each element in each cluster, and determine the element with the largest correlation index as a gastroesophageal reflux disease risk factor; the correlation index is the average of the squares of the correlation coefficients.

The normalization processing module 300 specifically includes:

The gastroesophageal reflux disease risk factor determining module 700 specifically comprises:

a risk factor determining unit for determining the risk factors of the gastroesophageal reflux disease by arranging all the related indexes in a descending order and selecting the element corresponding to the largest related index to determine the risk factor of the gastroesophageal reflux disease

The method adopts the combination of feature engineering and clustering to screen out irrelevant features, calculates high-quality data and improves the accuracy. The clustering method can cluster the features with larger similarity into a class, and selects the features which can represent the class most as risk factors in the similar feature clusters, so that the features are representative.

The embodiments in the present description are described in a progressive manner, each embodiment focuses on differences from other embodiments, and the same and similar parts among the embodiments are referred to each other. For the system disclosed by the embodiment, the description is relatively simple because the system corresponds to the method disclosed by the embodiment, and the relevant points can be referred to the method part for description.

The principles and embodiments of the present invention have been described herein using specific examples, which are provided only to help understand the method and the core concept of the present invention; meanwhile, for a person skilled in the art, according to the idea of the present invention, the specific embodiments and the application range may be changed. In view of the above, the present disclosure should not be construed as limiting the invention.

Claims

1. A method for determining risk factors for gastroesophageal reflux disease based on machine learning, the method comprising:

standardizing the quantized data matrix to obtain a standardized data matrix;

2. The method for determining risk factors for gastroesophageal reflux disease according to claim 1, wherein the normalizing the quantized data matrix to obtain a normalized data matrix specifically comprises:

3. The method for determining risk factors for gastroesophageal reflux disease according to claim 1, wherein the dimension reduction processing is performed on the standardized data matrix by using a principal component analysis algorithm, and specifically comprises:

calculating a correlation matrix of the normalized data matrix;

4. The method for determining risk factors for gastroesophageal reflux disease according to claim 1, wherein the processing of each sample point in the reconstructed data matrix by using a hierarchical clustering algorithm to obtain a hierarchical clustering dendrogram specifically comprises:

5. The method for determining risk factors for gastroesophageal reflux disease according to claim 1, wherein the calculating of the correlation index between the elements in each cluster type and the determining of the element with the largest correlation index as the risk factor for gastroesophageal reflux disease specifically comprises:

calculating the correlation index among elements in each class cluster;

6. A machine learning based gastroesophageal reflux disease risk factor determination system, the system comprising:

7. The gastroesophageal reflux disease risk factor determination system as claimed in claim 6, wherein the standardized processing module specifically comprises:

8. The system for determining risk factors for gastroesophageal reflux disease according to claim 6, wherein the gastroesophageal reflux disease risk factor determination module specifically comprises: