CN109686442A

CN109686442A - Method and system are determined based on the gastroesophageal reflux disease risk factor of machine learning

Info

Publication number: CN109686442A
Application number: CN201811589405.8A
Authority: CN
Inventors: 刘万里; 徐雷; 黄玉珍; 姚澜; 李荣臻; 夏吉安
Original assignee: Individual
Current assignee: NANJING INTEGRATED TRADITIONAL CHINESE AND WESTERN MEDICINE HOSPITAL
Priority date: 2018-12-25
Filing date: 2018-12-25
Publication date: 2019-04-26
Anticipated expiration: 2038-12-25
Also published as: CN109686442B

Abstract

The invention discloses a kind of gastroesophageal reflux disease risk factors based on machine learning to determine method and system, solves the problems, such as that accurate rate is low when determining gastroesophageal reflux disease risk factor using statistics in the prior art.User information collection of the building comprising gastroesophageal reflux disease risk factor first, and quantification treatment is carried out to the factor that user information is concentrated, obtain quantized data matrix；Secondly quantized data matrix is standardized, dimension-reduction treatment is carried out to the matrix after standardization using Principal Component Analysis Algorithm；Then hierarchical clustering dendrogram is obtained to the data clusters in treated data set using hierarchical clustering algorithm；Furthermore the clusters number determined according to hierarchical clustering dendrogram, and clustering is carried out to the data in treated data set according to clusters number, obtain multiple class clusters；The index of correlation in each class cluster between each element is finally calculated, and the maximum element of the index of correlation is determined as gastroesophageal reflux disease risk factor.

Description

Method and system are determined based on the gastroesophageal reflux disease risk factor of machine learning

Technical field

The present invention relates to machine learning and medicine technology field, anti-more particularly to a kind of stomach oesophagus based on machine learning Stream disease risk factor determines method and system.

Background technique

Gastroesophageal reflux disease is showed as disease of digestive system generally existing in a kind of world wide, disease incidence The trend risen year by year.Therefore, the treatment of gastroesophageal reflux disease should cause our enough attention.Due to gastroesophageal reflux The generation of disease and life style, emotional change, eating habit etc. are closely related, and the state of an illness easily changes, therefore by adopting Collection mass data simultaneously analyzes data characteristics to the research disease and prevents to play an important role.

It is actually rare using machine learning method extraction risk factor in gastroesophageal reflux disease diagnostic techniques at present, greatly What majority took the extraction of risk factor in medical domain is statistical method, and statistical method is computationally intensive, simultaneously Accurate rate is lower compared with machine learning.

Summary of the invention

The object of the present invention is to provide a kind of gastroesophageal reflux disease risk factor based on machine learning determine method and System, to solve the problems, such as that accurate rate is low when determining gastroesophageal reflux disease risk factor using statistics in the prior art.

To achieve the above object, the present invention provides following schemes:

A kind of gastroesophageal reflux disease risk factor based on machine learning determines method, comprising:

Construct user information collection；The user information integrates as the data set of M row N column；The i-th row that the user information is concentrated The factor of 1st column is user's questionnaire ID number, and the factor of the 1st column is expressed as different user's questionnaire ID numbers in not going together；It is described The problem of factor for the 1st row jth column that user information is concentrated is questionnaire, and the factor of the 1st row is expressed as not in different lines Same problem；The factor for the i-th row jth column that the user information is concentrated is answer of the i-th user's questionnaire ID number to jth problem； Wherein, 2≤i≤M, 2≤j≤N；

Data quantization processing is carried out to the answer that the user information is concentrated, obtains quantized data matrix；The quantization number It is the matrix of M row N column according to matrix；The element of the i-th row the 1st column in the quantized data matrix is user's questionnaire ID number, and not The element representation of the 1st column is different user's questionnaire ID number in colleague；The member of the 1st row jth column in the quantized data matrix The problem of element is questionnaire, and the element representation of the 1st row is different problems in different lines；In the quantized data matrix The element of i-th row jth column is the data quantization result of i-th user's questionnaire ID number jth problem answers；Wherein, 2≤i≤M, 2≤j ≤N；

The quantized data matrix is standardized, standardized data matrix is obtained；

Dimension-reduction treatment is carried out to the standardized data matrix using Principal Component Analysis Algorithm, and to the data square after dimensionality reduction Processing is reconstructed in battle array, obtains reconstruct data matrix；

Using hierarchical clustering algorithm, each sample point in the reconstruct data matrix is handled, it is poly- to obtain level Class dendrogram；Z-th of sample point represents the z row data in the reconstruct data matrix；Wherein, 2≤z≤M；

Clusters number is determined according to the hierarchical clustering dendrogram, and according to the clusters number, using clustering algorithm pair Element in the reconstruct data matrix is clustered, and multiple class clusters are obtained；

The index of correlation in each class cluster between each element is calculated, and the maximum element of the index of correlation is determined as stomach Esophageal reflux disease risk factor；The index of correlation is the average of related coefficient square.

Optionally, described that the quantized data matrix is standardized, standardized data matrix is obtained, it is specific to wrap It includes:

Using Z-Score standardized algorithm, the quantized data matrix is standardized；The standardized data The data of each dimension obey the normal distribution that mean value is 0, variance is 1 in matrix.

Optionally, described that dimension-reduction treatment is carried out to the standardized data matrix using Principal Component Analysis Algorithm, it is specific to wrap It includes:

Calculate the correlation matrix of the standardized data matrix；

According to the correlation matrix, characteristic value and the corresponding feature vector of the characteristic value are calculated；

The characteristic value is arranged according to descending order, selects the corresponding feature vector composition drop of characteristic value described in top n Data set after dimension.

Optionally, described to use hierarchical clustering algorithm, each sample point in the reconstruct data matrix is handled, Hierarchical clustering dendrogram is obtained, is specifically included:

Step 1, using average distance algorithm, the distance between sample point two-by-two is calculated；

Step 2, selection synthesizes a class apart from the smallest two sample points；

Step 3, step 1 and step 2 are repeated, until all sample points gather for one kind, obtains hierarchical clustering dendrogram.

Optionally, the index of correlation calculated in each class cluster between each element, and the index of correlation is maximum Element is determined as gastroesophageal reflux disease risk factor, specifically includes:

Calculate the index of correlation in each class cluster between each element；

All index of correlation are arranged according to sequence from big to small, select the corresponding element of the maximum index of correlation It is determined as gastroesophageal reflux disease risk factor.

A kind of gastroesophageal reflux disease risk factor based on machine learning determines system, comprising:

User information collection constructs module, for constructing user information collection；The user information integrates as the data set of M row N column； The factor for the i-th row the 1st column that the user information is concentrated is user's questionnaire ID number, and the factor of the 1st column is expressed as in not going together Different user's questionnaire ID numbers；The problem of factor for the 1st row jth column that the user information is concentrated is questionnaire, and it is different The factor of the 1st row is expressed as different problems in column；The factor for the i-th row jth column that the user information is concentrated is that the i-th user asks Answer of the volume ID number to jth problem；Wherein, 2≤i≤M, 2≤j≤N；

Quantification treatment module, the answer for concentrating to the user information carry out data quantization processing, obtain quantization number According to matrix；The quantized data matrix is the matrix of M row N column；In the quantized data matrix the i-th row the 1st column element be User's questionnaire ID number, and the element representation of the 1st column is different user's questionnaire ID number in not going together；In the quantized data matrix The 1st row jth column element be questionnaire the problem of, and in different lines the 1st row element representation be different problems；It is described The element of the i-th row jth column in quantized data matrix is the data quantization result of i-th user's questionnaire ID number jth problem answers；Its In, 2≤i≤M, 2≤j≤N；

Standardization module obtains standardized data square for being standardized to the quantized data matrix Battle array；

Dimensionality reduction reconstructed module, for carrying out dimension-reduction treatment to the standardized data matrix using Principal Component Analysis Algorithm, And processing is reconstructed to the data matrix after dimensionality reduction, obtain reconstruct data matrix；

Hierarchical clustering dendrogram obtains module, for using hierarchical clustering algorithm, to every in the reconstruct data matrix A sample point is handled, and hierarchical clustering dendrogram is obtained；Z-th of sample point represents the in the reconstruct data matrix Z row data；Wherein, 2≤z≤M；

Class cluster division module, for determining clusters number according to the hierarchical clustering dendrogram, and according to the cluster numbers Mesh clusters the element in the reconstruct data matrix using clustering algorithm, obtains multiple class clusters；

Gastroesophageal reflux disease risk factor determining module, for calculating the correlation in each class cluster between each element Index, and the maximum element of the index of correlation is determined as gastroesophageal reflux disease risk factor；The index of correlation is phase relation Several squares of average.

Optionally, the standardization module, specifically includes:

Standardization unit is standardized the quantized data matrix for using Z-Score standardized algorithm Processing；The data of each dimension obey the normal distribution that mean value is 0, variance is 1 in the standardized data matrix.

Optionally, the gastroesophageal reflux disease risk factor determining module, specifically includes:

Index of correlation computing unit, for calculating the index of correlation in each class cluster between each element；

Gastroesophageal reflux disease risk factor determination unit, for all index of correlation are suitable according to from big to small Sequence arrangement, selects the corresponding element of the maximum index of correlation to be determined as gastroesophageal reflux disease risk factor.

The specific embodiment provided according to the present invention, the invention discloses following technical effects:

A kind of gastroesophageal reflux disease risk factor that the present invention is mainly based upon machine learning proposition determines method and is System.The present invention first carries out feature extraction with the Principal Component Analysis in Feature Engineering, reduces data dimension, then the number to high quality According to clustering is carried out, the risk factor of most critical is selected in every a kind of cluster.Present invention incorporates clustering methods and feature Engineering filters out the risk factor for causing gastroesophageal reflux disease, provides scientific basis for medical research in the future and medical diagnosis on disease, Gastroesophageal reflux disease is instructed, disease incidence is reduced.

Detailed description of the invention

It in order to more clearly explain the embodiment of the invention or the technical proposal in the existing technology, below will be to institute in embodiment Attached drawing to be used is needed to be briefly described, it should be apparent that, the accompanying drawings in the following description is only some implementations of the invention Example, for those of ordinary skill in the art, without any creative labor, can also be according to these attached drawings Obtain other attached drawings.

Fig. 1 determines that the process of method is shown based on the gastroesophageal reflux disease risk factor of machine learning for the embodiment of the present invention It is intended to；

Fig. 2 determines that the structure of system is shown based on the gastroesophageal reflux disease risk factor of machine learning for the embodiment of the present invention It is intended to.

Specific embodiment

Following will be combined with the drawings in the embodiments of the present invention, and technical solution in the embodiment of the present invention carries out clear, complete Site preparation description, it is clear that described embodiments are only a part of the embodiments of the present invention, instead of all the embodiments.It is based on Embodiment in the present invention, it is obtained by those of ordinary skill in the art without making creative efforts every other Embodiment shall fall within the protection scope of the present invention.

The object of the present invention is to provide a kind of gastroesophageal reflux disease risk factor based on machine learning determine method and System, can be efficiently accurate determine gastroesophageal reflux disease risk factor.

In order to make the foregoing objectives, features and advantages of the present invention clearer and more comprehensible, with reference to the accompanying drawing and specific real Applying mode, the present invention is described in further detail.

Embodiment 1

Fig. 1 determines that the process of method is shown based on the gastroesophageal reflux disease risk factor of machine learning for the embodiment of the present invention It is intended to, as shown in Figure 1, the gastroesophageal reflux disease risk factor determination side provided in an embodiment of the present invention based on machine learning Method specifically includes following steps.

Step 101: building user information collection；The user information integrates as the data set of M row N column；The user information collection In the factors of the i-th row the 1st column be user's questionnaire ID number, and the factor of the 1st column is expressed as different user's questionnaires in not going together ID number；The problem of factor for the 1st row jth column that the user information is concentrated is questionnaire, and in different lines the 1st row factor It is expressed as different problems；The factor for the i-th row jth column that the user information is concentrated is i-th user's questionnaire ID number to jth problem Answer；Wherein, 2≤i≤M, 2≤j≤N, i, j are positive integer.

Step 102: data quantization processing being carried out to the answer that the user information is concentrated, obtains quantized data matrix；Institute State the matrix that quantized data matrix is M row N column；The element of the i-th row the 1st column in the quantized data matrix is user's questionnaire ID Number, and the element representation of the 1st column is different user's questionnaire ID number in not going together；The 1st row jth in the quantized data matrix The problem of element of column is questionnaire, and the element representation of the 1st row is different problems in different lines；The quantized data square The element of the i-th row jth column in battle array is the data quantization result of i-th user's questionnaire ID number jth problem answers；Wherein, 2≤i≤ M, 2≤j≤N.

Step 103: the quantized data matrix being standardized, standardized data matrix is obtained.

Step 104: using Principal Component Analysis Algorithm to the standardized data matrix carry out dimension-reduction treatment, and to dimensionality reduction after Data matrix processing is reconstructed, obtain reconstruct data matrix.

Step 105: using hierarchical clustering algorithm, each sample point in the reconstruct data matrix is handled, is obtained To hierarchical clustering dendrogram；Z-th of sample point represents the z row data in the reconstruct data matrix；Wherein, 2≤z≤ M, z are positive integer.

Step 106: clusters number being determined according to the hierarchical clustering dendrogram, and according to the clusters number, using poly- Class algorithm clusters the element in the reconstruct data matrix, obtains multiple class clusters.

Step 107: calculating the index of correlation in each class cluster between each element, and by the maximum element of the index of correlation It is determined as gastroesophageal reflux disease risk factor；The index of correlation is the average of related coefficient square.

Step 101 specifically includes:

The present embodiment by the questionnaire put of human hair in hospital to each consulting gastroesophageal reflux disease, and according to More parts of questionnaires back are recycled to establish user information collection.The possible illness of the user that the user information is concentrated, it is also possible to strong Health, this is needed after equal hospital diagnosis to a label, by the label judge the user whether illness.Therefore, which believes Breath collection is the data set of health, illness mixing, and knows which data is illness data.The dimension of the data set in the present embodiment The answer of the problem of degree totally 241, the questionnaire ID number of the patient including unique identification and each questionnaire.

It is several comprising general demographic data, life style, eating habit, mental element, sleep factor etc. in questionnaire The answer of subproblem and investigator.The answer type of questionnaire includes three kinds: single choice, True-False, question-and-answer problem.

Step 102 is that the answer to investigator carries out data bulk processing.

Specially using the questionnaire ID number of patient as unique identifying number.In questionnaire, using severity level as answer in single choice Topic, if option be often, once in a while, seldom, never, 4,3,2,1 weight can be successively assigned according to its severity level, according to Specific answer selects corresponding weight；Using whether type option is answer in True-False, "Yes" is assigned a value of 1, "No" is assigned a value of 0, corresponding assignment is selected according to specific answer；Option has no the problem of dividing of severity level, such as occupation, because of such problem pair As a result useless, the problem can be deleted.For question-and-answer problem, directly using the continuous type numerical value of user's input as data, such as user 45 Year, height 172 are uninfected by HP, often there is sleep disturbance, then the sample data is classified as [age, height, if infection HP, sleep Obstacle situation], data value is [45,172,0,4].The purpose of this step is that the answer of all investigators is carried out quantification treatment, Obtain quantized data matrix D.

Step 103 is data prediction, specially data normalization.

It is standardized using Z-Score, it is 0 that data matrix D, which is scaled to average value, the matrix that standard deviation is 1.This is because The presence of dimension, it is different using different dimensions, the meeting of the calculated result of distance.By standardization, obey each dimension equal Value for 0, the normal distribution of variance 1, calculate apart from when, each dimension goes dimension, avoids different dimensions Selection adjust the distance calculate generate influence.

Z-Score standardizes formula are as follows:

Wherein, Feature_value is the former attribute value of a certain feature of data, and μ is the average value of a certain characteristic value of data, S is the standard deviation of this feature value, and Feature_value ' is the new attribute value of a certain feature of data.

Step 104 mainly reconstructs Data Dimensionality Reduction.

Principal component analytical method is mainly to study correlativity between each column, obtain after sorting according to accounting at Point, the coefficient according to each attribute in each principal component on each standardized index, determination plays larger factor in principal component. This step is for tentatively deleting part irrelevant factor.Then original data space is restored by inverse operation, be both able to satisfy in this way Dimensionality reduction can reach the target of good data interpretation again.

The step of standardized data matrix D ' conduct input, principal component analytical method, is as follows:

The correlation matrix C of normalized data matrix D '；

Calculate the characteristic value and the corresponding feature vector of characteristic value of correlation matrix C；

The corresponding feature vector of the characteristic value of N before ranking is formed into new data set, i.e. data set after dimensionality reduction.Specially Characteristic value descending is arranged, the corresponding feature vector of the characteristic value of the N using before ranking is as new basis set.According to descending arrayed feature Value, then the corresponding feature vector of characteristic value is also descending, i.e., the information content that each feature vector saves also from more to less, retains 90%, that is, leave out the feature that not too important namely information content is less below, achievees the purpose that dimensionality reduction.

Wherein, the maximum base vector of information content hold capacity is the largest the feature vector of covariance matrix, and this The information content that feature vector saves is exactly its corresponding characteristic value.

It is the need to ensure that the main contents of data will not lose about the purpose selected of N before ranking, therefore selection is new here The ratio for the data population variance that feature can represent is 90%, that is, retains the 90% of raw information, few in characteristic information loss In the case where achieve the purpose that dimensionality reduction.

Data set after the dimensionality reduction that the purpose being reconstructed is, since principal component analytical method is concentrated from initial data Extraction feature constitutes new feature, it is corresponding with the factor of raw data set not on, when analyzing risk factor interpretation drop It is low, it is therefore desirable to reduction is reconstructed to the data set after dimensionality reduction, the principal component after dimensionality reduction is corresponded into raw data set and is wrapped The feature contained is to get to the feature for screening out useless factor, and then composition reconstructs data matrix D ".

Step 105 is cluster.

Clustering can cluster index, can also cluster to sample, cluster here to index.

The present embodiment further selects the method for Agglomerative Hierarchical Clustering to the quantization using the method for hierarchical clustering Each sample point in data matrix R carries out clustering processing.The basic principle is that first will respectively be considered as one kind by poly- object, at this moment class Then similarity degree with class selects immediate two class to merge into one kind by class statistic, gradually merge until all quilts Until poly- object merging is a kind of.

Wherein, the step of Agglomerative Hierarchical Clustering is as follows:

Each sample point in data matrix D " will be reconstructed as an independent class.

According to range formula, the distance between class two-by-two is calculated, is found apart from the smallest two classes c1, c2.

Class c1 and class c2 are merged into a class；

Above step is repeated, until all sample points gather for one kind, and then obtains a hierarchical clustering dendrogram.

Specifically, the present embodiment selects average distance method to calculate the distance between class two-by-two about distance metric.

Average distance method will own by calculating each data point in two classes at a distance from other all data points The mean value of distance is as the distance between class two-by-two.Calculation formula is as follows:

Wherein, dist (x, z) is obtained using Euclidean distance.

In step 106, the present embodiment is according to according to hierarchical clustering dendrogram, it can be seen that the cluster knot between each feature Fruit.Therefore clusters number can be determined by the tree-shaped map analysis of hierarchical clustering, and according to the clusters number, using clustering algorithm Element in the reconstruct data matrix is clustered, multiple class clusters are obtained.

Step 107 specifically includes:

In order to find the risk factor for determining to form such patient groups, need to calculate the phase in every class cluster between each element Close index (average of related coefficient square), select the maximum element of the index of correlation be determined as gastroesophageal reflux disease danger because Element filters out a gastroesophageal reflux disease risk factor in every class cluster.

The quantized data matrix D is grouped to obtain D according to class label₁,D₂,D₃,....R_k, initial danger is set Dangerous sets of factors is empty set, is analyzed the correlation each element in every class cluster, the index of correlation between calculating elements, every Select the maximum element of an index of correlation that risk factor set is added in a kind of cluster.

Wherein related coefficient calculates as follows:

Wherein, Var (X) is the variance of X, and Var (Y) is the variance of Y, and Cov (X, Y) is X, and the covariance between Y, X, Y are Element in every class cluster.

For the sample index of correlation R of a certain feature²Calculation formula it is as follows:

Wherein, X is a certain feature, and i is characterized number, and n is characterized sum.

Embodiment 2

To achieve the above object, the present invention also provides a kind of gastroesophageal refluxs based on machine learning as shown in Figure 2 Disease risk factor determines system.The system includes:

User information collection constructs module 100, for constructing user information collection；The user information integrates as the data of M row N column Collection；The factor for the i-th row the 1st column that the user information is concentrated is user's questionnaire ID number, and the factor of the 1st column indicates in not going together For different user's questionnaire ID numbers；The problem of factor for the 1st row jth column that the user information is concentrated is questionnaire, and not The factor of the 1st row is expressed as different problems in same column；The factor for the i-th row jth column that the user information is concentrated is the i-th user Answer of the questionnaire ID number to jth problem；Wherein, 2≤i≤M, 2≤j≤N.

Quantification treatment module 200, the answer for concentrating to the user information carry out data quantization processing, are quantified Data matrix；The quantized data matrix is the matrix of M row N column；The element of the i-th row the 1st column in the quantized data matrix For user's questionnaire ID number, and the element representation of the 1st column is different user's questionnaire ID number in not going together；The quantized data matrix In the 1st row jth column element be questionnaire the problem of, and in different lines the 1st row element representation be different problems；Institute The element for stating the i-th row jth column in quantized data matrix is the data quantization result of i-th user's questionnaire ID number jth problem answers； Wherein, 2≤i≤M, 2≤j≤N.

Standardization module 300 obtains standardized data for being standardized to the quantized data matrix Matrix.

Dimensionality reduction reconstructed module 400, for being carried out at dimensionality reduction using Principal Component Analysis Algorithm to the standardized data matrix Reason, and processing is reconstructed to the data matrix after dimensionality reduction, obtain reconstruct data matrix.

Hierarchical clustering dendrogram obtains module 500, for using hierarchical clustering algorithm, in the reconstruct data matrix Each sample point is handled, and hierarchical clustering dendrogram is obtained；Z-th of sample point represents in the reconstruct data matrix Z row data；Wherein, 2≤z≤M.

Class cluster division module 600, for determining clusters number according to the hierarchical clustering dendrogram, and according to the cluster Number clusters the element in the reconstruct data matrix using clustering algorithm, obtains multiple class clusters.

Gastroesophageal reflux disease risk factor determining module 700, for calculating in each class cluster between each element The index of correlation, and the maximum element of the index of correlation is determined as gastroesophageal reflux disease risk factor；The index of correlation is phase The average of relationship number square.

The standardization module 300, specifically includes:

The gastroesophageal reflux disease risk factor determining module 700, specifically includes:

Gastroesophageal reflux disease risk factor determination unit, for all index of correlation are suitable according to from big to small Sequence arrangement, selects the corresponding element of the maximum index of correlation to be determined as gastroesophageal reflux disease risk factor

The present invention is combined using Feature Engineering with cluster, is screened out extraneous features, is calculated the data of high quality, Promote accuracy.Clustering method, which can gather the biggish feature of similitude, to be chosen and most can in similar feature cluster for one kind Such feature is represented as risk factor, it is representative.

Each embodiment in this specification is described in a progressive manner, the highlights of each of the examples are with other The difference of embodiment, the same or similar parts in each embodiment may refer to each other.For system disclosed in embodiment For, since it is corresponded to the methods disclosed in the examples, so being described relatively simple, related place is said referring to method part It is bright.

Used herein a specific example illustrates the principle and implementation of the invention, and above embodiments are said It is bright to be merely used to help understand method and its core concept of the invention；At the same time, for those skilled in the art, foundation Thought of the invention, there will be changes in the specific implementation manner and application range.In conclusion the content of the present specification is not It is interpreted as limitation of the present invention.

Claims

1. a kind of gastroesophageal reflux disease risk factor based on machine learning determines method, which is characterized in that the method, packet It includes:

Construct user information collection；The user information integrates as the data set of M row N column；The i-th row the 1st that the user information is concentrated The factor of column is user's questionnaire ID number, and the factor of the 1st column is expressed as different user's questionnaire ID numbers in not going together；The user Information concentrate the 1st row jth column factor be questionnaire the problem of, and in different lines the factor of the 1st row be expressed as it is different Problem；The factor for the i-th row jth column that the user information is concentrated is answer of the i-th user's questionnaire ID number to jth problem；Wherein, 2≤i≤M, 2≤j≤N；

Data quantization processing is carried out to the answer that the user information is concentrated, obtains quantized data matrix；The quantized data square Battle array is the matrix of M row N column；The element of the i-th row the 1st column in the quantized data matrix is user's questionnaire ID number, and is not gone together In the 1st column element representation be different user's questionnaire ID numbers；In the quantized data matrix the 1st row jth column element be The problem of questionnaire, and the element representation of the 1st row is different problems in different lines；I-th in the quantized data matrix The element of row jth column is the data quantization result of i-th user's questionnaire ID number jth problem answers；Wherein, 2≤i≤M, 2≤j≤N；

Using Principal Component Analysis Algorithm to the standardized data matrix carry out dimension-reduction treatment, and to the data matrix after dimensionality reduction into Row reconstruction processing obtains reconstruct data matrix；

Using hierarchical clustering algorithm, each sample point in the reconstruct data matrix is handled, hierarchical clustering tree is obtained Shape figure；Z-th of sample point represents the z row data in the reconstruct data matrix；Wherein, 2≤z≤M；

Clusters number is determined according to the hierarchical clustering dendrogram, and according to the clusters number, using clustering algorithm to described Element in reconstruct data matrix is clustered, and multiple class clusters are obtained；

The index of correlation in each class cluster between each element is calculated, and the maximum element of the index of correlation is determined as stomach oesophagus Reflux disease risk factor；The index of correlation is the average of related coefficient square.

2. gastroesophageal reflux disease risk factor according to claim 1 determines method, which is characterized in that described to described Quantized data matrix is standardized, and is obtained standardized data matrix, is specifically included:

Using Z-Score standardized algorithm, the quantized data matrix is standardized；The standardized data matrix In the data of each dimension obey the normal distribution that mean value is 0, variance is 1.

3. gastroesophageal reflux disease risk factor according to claim 1 determines method, which is characterized in that described using master Constituent analysis algorithm carries out dimension-reduction treatment to the standardized data matrix, specifically includes:

Calculate the correlation matrix of the standardized data matrix；

The characteristic value is arranged according to descending order, after selecting the corresponding feature vector composition dimensionality reduction of characteristic value described in top n Data set.

4. gastroesophageal reflux disease risk factor according to claim 1 determines method, which is characterized in that described to use layer Secondary clustering algorithm handles each sample point in the reconstruct data matrix, obtains hierarchical clustering dendrogram, specific to wrap It includes:

5. gastroesophageal reflux disease risk factor according to claim 1 determines method, which is characterized in that described to calculate often The index of correlation in a class cluster between each element, and the maximum element of the index of correlation is determined as gastroesophageal reflux disease danger Dangerous factor, specifically includes:

All index of correlation are arranged according to sequence from big to small, the corresponding element of the maximum index of correlation is selected to determine For gastroesophageal reflux disease risk factor.

6. a kind of gastroesophageal reflux disease risk factor based on machine learning determines system, which is characterized in that the system, packet It includes:

User information collection constructs module, for constructing user information collection；The user information integrates as the data set of M row N column；It is described The factor for the i-th row the 1st column that user information is concentrated is user's questionnaire ID number, and the factor of the 1st column is expressed as difference in not going together User's questionnaire ID number；The problem of factor for the 1st row jth column that the user information is concentrated is questionnaire, and in different lines The factor of 1st row is expressed as different problems；The factor for the i-th row jth column that the user information is concentrated is i-th user's questionnaire ID Answer number to jth problem；Wherein, 2≤i≤M, 2≤j≤N；

Quantification treatment module, the answer for concentrating to the user information carry out data quantization processing, obtain quantized data square Battle array；The quantized data matrix is the matrix of M row N column；The element of the i-th row the 1st column in the quantized data matrix is user Questionnaire ID number, and the element representation of the 1st column is different user's questionnaire ID number in not going together；In the quantized data matrix The problem of element of 1 row jth column is questionnaire, and the element representation of the 1st row is different problems in different lines；The quantization The element of the i-th row jth column in data matrix is the data quantization result of i-th user's questionnaire ID number jth problem answers；Wherein, 2 ≤ i≤M, 2≤j≤N；

Standardization module obtains standardized data matrix for being standardized to the quantized data matrix；

Dimensionality reduction reconstructed module, for carrying out dimension-reduction treatment to the standardized data matrix using Principal Component Analysis Algorithm, and it is right Processing is reconstructed in data matrix after dimensionality reduction, obtains reconstruct data matrix；

Hierarchical clustering dendrogram obtains module, for using hierarchical clustering algorithm, to each sample in the reconstruct data matrix This point is handled, and hierarchical clustering dendrogram is obtained；Z-th of sample point represents the z row in the reconstruct data matrix Data；Wherein, 2≤z≤M；

Class cluster division module for determining clusters number according to the hierarchical clustering dendrogram, and according to the clusters number, is adopted The element in the reconstruct data matrix is clustered with clustering algorithm, obtains multiple class clusters；

Gastroesophageal reflux disease risk factor determining module refers to for calculating the correlation in each class cluster between each element Number, and the maximum element of the index of correlation is determined as gastroesophageal reflux disease risk factor；The index of correlation is related coefficient Square average.

7. gastroesophageal reflux disease risk factor according to claim 6 determines system, which is characterized in that the standardization Processing module specifically includes:

Standardization unit is standardized place to the quantized data matrix for using Z-Score standardized algorithm Reason；The data of each dimension obey the normal distribution that mean value is 0, variance is 1 in the standardized data matrix.

8. gastroesophageal reflux disease risk factor according to claim 6 determines system, which is characterized in that the stomach oesophagus Reflux disease risk factor determining module, specifically includes:

Gastroesophageal reflux disease risk factor determination unit, for arranging all index of correlation according to sequence from big to small Column, select the corresponding element of the maximum index of correlation to be determined as gastroesophageal reflux disease risk factor.