CN109686442A - Method and system are determined based on the gastroesophageal reflux disease risk factor of machine learning - Google Patents
Method and system are determined based on the gastroesophageal reflux disease risk factor of machine learning Download PDFInfo
- Publication number
- CN109686442A CN109686442A CN201811589405.8A CN201811589405A CN109686442A CN 109686442 A CN109686442 A CN 109686442A CN 201811589405 A CN201811589405 A CN 201811589405A CN 109686442 A CN109686442 A CN 109686442A
- Authority
- CN
- China
- Prior art keywords
- data matrix
- row
- correlation
- reflux disease
- column
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
- 208000021302 gastroesophageal reflux disease Diseases 0.000 title claims abstract description 57
- 238000000034 method Methods 0.000 title claims abstract description 28
- 238000010801 machine learning Methods 0.000 title claims abstract description 19
- 239000011159 matrix material Substances 0.000 claims abstract description 119
- 238000004422 calculation algorithm Methods 0.000 claims abstract description 32
- 238000000513 principal component analysis Methods 0.000 claims abstract description 9
- 238000011002 quantification Methods 0.000 claims abstract description 5
- 230000009467 reduction Effects 0.000 claims description 20
- 238000012545 processing Methods 0.000 claims description 17
- 238000013139 quantization Methods 0.000 claims description 16
- 208000037265 diseases, disorders, signs and symptoms Diseases 0.000 claims description 10
- 238000004458 analytical method Methods 0.000 claims description 5
- 210000002784 stomach Anatomy 0.000 claims description 4
- 239000000203 mixture Substances 0.000 claims description 3
- 239000012141 concentrate Substances 0.000 claims 1
- 239000000470 constituent Substances 0.000 claims 1
- 201000010099 disease Diseases 0.000 description 7
- 238000000605 extraction Methods 0.000 description 3
- 238000004364 calculation method Methods 0.000 description 2
- 238000003745 diagnosis Methods 0.000 description 2
- 235000006694 eating habits Nutrition 0.000 description 2
- 238000005516 engineering process Methods 0.000 description 2
- 230000036541 health Effects 0.000 description 2
- 230000008569 process Effects 0.000 description 2
- 238000011160 research Methods 0.000 description 2
- 238000007619 statistical method Methods 0.000 description 2
- 238000005284 basis set Methods 0.000 description 1
- 230000008859 change Effects 0.000 description 1
- 238000013075 data extraction Methods 0.000 description 1
- 238000012631 diagnostic technique Methods 0.000 description 1
- 208000010643 digestive system disease Diseases 0.000 description 1
- 239000003814 drug Substances 0.000 description 1
- 235000013399 edible fruits Nutrition 0.000 description 1
- 230000000694 effects Effects 0.000 description 1
- 230000002996 emotional effect Effects 0.000 description 1
- 208000015181 infectious disease Diseases 0.000 description 1
- 230000003340 mental effect Effects 0.000 description 1
- 238000002156 mixing Methods 0.000 description 1
- 238000010606 normalization Methods 0.000 description 1
- 238000002360 preparation method Methods 0.000 description 1
- 230000000750 progressive effect Effects 0.000 description 1
- 238000012216 screening Methods 0.000 description 1
- 208000019116 sleep disease Diseases 0.000 description 1
- 208000022925 sleep disturbance Diseases 0.000 description 1
Classifications
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16H—HEALTHCARE INFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR THE HANDLING OR PROCESSING OF MEDICAL OR HEALTHCARE DATA
- G16H50/00—ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics
- G16H50/20—ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics for computer-aided diagnosis, e.g. based on medical expert systems
Landscapes
- Engineering & Computer Science (AREA)
- Health & Medical Sciences (AREA)
- Medical Informatics (AREA)
- Biomedical Technology (AREA)
- Public Health (AREA)
- Pathology (AREA)
- Databases & Information Systems (AREA)
- Data Mining & Analysis (AREA)
- Epidemiology (AREA)
- General Health & Medical Sciences (AREA)
- Primary Health Care (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
- Measuring And Recording Apparatus For Diagnosis (AREA)
- Medical Treatment And Welfare Office Work (AREA)
Abstract
The invention discloses a kind of gastroesophageal reflux disease risk factors based on machine learning to determine method and system, solves the problems, such as that accurate rate is low when determining gastroesophageal reflux disease risk factor using statistics in the prior art.User information collection of the building comprising gastroesophageal reflux disease risk factor first, and quantification treatment is carried out to the factor that user information is concentrated, obtain quantized data matrix;Secondly quantized data matrix is standardized, dimension-reduction treatment is carried out to the matrix after standardization using Principal Component Analysis Algorithm;Then hierarchical clustering dendrogram is obtained to the data clusters in treated data set using hierarchical clustering algorithm;Furthermore the clusters number determined according to hierarchical clustering dendrogram, and clustering is carried out to the data in treated data set according to clusters number, obtain multiple class clusters;The index of correlation in each class cluster between each element is finally calculated, and the maximum element of the index of correlation is determined as gastroesophageal reflux disease risk factor.
Description
Technical field
The present invention relates to machine learning and medicine technology field, anti-more particularly to a kind of stomach oesophagus based on machine learning
Stream disease risk factor determines method and system.
Background technique
Gastroesophageal reflux disease is showed as disease of digestive system generally existing in a kind of world wide, disease incidence
The trend risen year by year.Therefore, the treatment of gastroesophageal reflux disease should cause our enough attention.Due to gastroesophageal reflux
The generation of disease and life style, emotional change, eating habit etc. are closely related, and the state of an illness easily changes, therefore by adopting
Collection mass data simultaneously analyzes data characteristics to the research disease and prevents to play an important role.
It is actually rare using machine learning method extraction risk factor in gastroesophageal reflux disease diagnostic techniques at present, greatly
What majority took the extraction of risk factor in medical domain is statistical method, and statistical method is computationally intensive, simultaneously
Accurate rate is lower compared with machine learning.
Summary of the invention
The object of the present invention is to provide a kind of gastroesophageal reflux disease risk factor based on machine learning determine method and
System, to solve the problems, such as that accurate rate is low when determining gastroesophageal reflux disease risk factor using statistics in the prior art.
To achieve the above object, the present invention provides following schemes:
A kind of gastroesophageal reflux disease risk factor based on machine learning determines method, comprising:
Construct user information collection;The user information integrates as the data set of M row N column;The i-th row that the user information is concentrated
The factor of 1st column is user's questionnaire ID number, and the factor of the 1st column is expressed as different user's questionnaire ID numbers in not going together;It is described
The problem of factor for the 1st row jth column that user information is concentrated is questionnaire, and the factor of the 1st row is expressed as not in different lines
Same problem;The factor for the i-th row jth column that the user information is concentrated is answer of the i-th user's questionnaire ID number to jth problem;
Wherein, 2≤i≤M, 2≤j≤N;
Data quantization processing is carried out to the answer that the user information is concentrated, obtains quantized data matrix;The quantization number
It is the matrix of M row N column according to matrix;The element of the i-th row the 1st column in the quantized data matrix is user's questionnaire ID number, and not
The element representation of the 1st column is different user's questionnaire ID number in colleague;The member of the 1st row jth column in the quantized data matrix
The problem of element is questionnaire, and the element representation of the 1st row is different problems in different lines;In the quantized data matrix
The element of i-th row jth column is the data quantization result of i-th user's questionnaire ID number jth problem answers;Wherein, 2≤i≤M, 2≤j
≤N;
The quantized data matrix is standardized, standardized data matrix is obtained;
Dimension-reduction treatment is carried out to the standardized data matrix using Principal Component Analysis Algorithm, and to the data square after dimensionality reduction
Processing is reconstructed in battle array, obtains reconstruct data matrix;
Using hierarchical clustering algorithm, each sample point in the reconstruct data matrix is handled, it is poly- to obtain level
Class dendrogram;Z-th of sample point represents the z row data in the reconstruct data matrix;Wherein, 2≤z≤M;
Clusters number is determined according to the hierarchical clustering dendrogram, and according to the clusters number, using clustering algorithm pair
Element in the reconstruct data matrix is clustered, and multiple class clusters are obtained;
The index of correlation in each class cluster between each element is calculated, and the maximum element of the index of correlation is determined as stomach
Esophageal reflux disease risk factor;The index of correlation is the average of related coefficient square.
Optionally, described that the quantized data matrix is standardized, standardized data matrix is obtained, it is specific to wrap
It includes:
Using Z-Score standardized algorithm, the quantized data matrix is standardized;The standardized data
The data of each dimension obey the normal distribution that mean value is 0, variance is 1 in matrix.
Optionally, described that dimension-reduction treatment is carried out to the standardized data matrix using Principal Component Analysis Algorithm, it is specific to wrap
It includes:
Calculate the correlation matrix of the standardized data matrix;
According to the correlation matrix, characteristic value and the corresponding feature vector of the characteristic value are calculated;
The characteristic value is arranged according to descending order, selects the corresponding feature vector composition drop of characteristic value described in top n
Data set after dimension.
Optionally, described to use hierarchical clustering algorithm, each sample point in the reconstruct data matrix is handled,
Hierarchical clustering dendrogram is obtained, is specifically included:
Step 1, using average distance algorithm, the distance between sample point two-by-two is calculated;
Step 2, selection synthesizes a class apart from the smallest two sample points;
Step 3, step 1 and step 2 are repeated, until all sample points gather for one kind, obtains hierarchical clustering dendrogram.
Optionally, the index of correlation calculated in each class cluster between each element, and the index of correlation is maximum
Element is determined as gastroesophageal reflux disease risk factor, specifically includes:
Calculate the index of correlation in each class cluster between each element;
All index of correlation are arranged according to sequence from big to small, select the corresponding element of the maximum index of correlation
It is determined as gastroesophageal reflux disease risk factor.
A kind of gastroesophageal reflux disease risk factor based on machine learning determines system, comprising:
User information collection constructs module, for constructing user information collection;The user information integrates as the data set of M row N column;
The factor for the i-th row the 1st column that the user information is concentrated is user's questionnaire ID number, and the factor of the 1st column is expressed as in not going together
Different user's questionnaire ID numbers;The problem of factor for the 1st row jth column that the user information is concentrated is questionnaire, and it is different
The factor of the 1st row is expressed as different problems in column;The factor for the i-th row jth column that the user information is concentrated is that the i-th user asks
Answer of the volume ID number to jth problem;Wherein, 2≤i≤M, 2≤j≤N;
Quantification treatment module, the answer for concentrating to the user information carry out data quantization processing, obtain quantization number
According to matrix;The quantized data matrix is the matrix of M row N column;In the quantized data matrix the i-th row the 1st column element be
User's questionnaire ID number, and the element representation of the 1st column is different user's questionnaire ID number in not going together;In the quantized data matrix
The 1st row jth column element be questionnaire the problem of, and in different lines the 1st row element representation be different problems;It is described
The element of the i-th row jth column in quantized data matrix is the data quantization result of i-th user's questionnaire ID number jth problem answers;Its
In, 2≤i≤M, 2≤j≤N;
Standardization module obtains standardized data square for being standardized to the quantized data matrix
Battle array;
Dimensionality reduction reconstructed module, for carrying out dimension-reduction treatment to the standardized data matrix using Principal Component Analysis Algorithm,
And processing is reconstructed to the data matrix after dimensionality reduction, obtain reconstruct data matrix;
Hierarchical clustering dendrogram obtains module, for using hierarchical clustering algorithm, to every in the reconstruct data matrix
A sample point is handled, and hierarchical clustering dendrogram is obtained;Z-th of sample point represents the in the reconstruct data matrix
Z row data;Wherein, 2≤z≤M;
Class cluster division module, for determining clusters number according to the hierarchical clustering dendrogram, and according to the cluster numbers
Mesh clusters the element in the reconstruct data matrix using clustering algorithm, obtains multiple class clusters;
Gastroesophageal reflux disease risk factor determining module, for calculating the correlation in each class cluster between each element
Index, and the maximum element of the index of correlation is determined as gastroesophageal reflux disease risk factor;The index of correlation is phase relation
Several squares of average.
Optionally, the standardization module, specifically includes:
Standardization unit is standardized the quantized data matrix for using Z-Score standardized algorithm
Processing;The data of each dimension obey the normal distribution that mean value is 0, variance is 1 in the standardized data matrix.
Optionally, the gastroesophageal reflux disease risk factor determining module, specifically includes:
Index of correlation computing unit, for calculating the index of correlation in each class cluster between each element;
Gastroesophageal reflux disease risk factor determination unit, for all index of correlation are suitable according to from big to small
Sequence arrangement, selects the corresponding element of the maximum index of correlation to be determined as gastroesophageal reflux disease risk factor.
The specific embodiment provided according to the present invention, the invention discloses following technical effects:
A kind of gastroesophageal reflux disease risk factor that the present invention is mainly based upon machine learning proposition determines method and is
System.The present invention first carries out feature extraction with the Principal Component Analysis in Feature Engineering, reduces data dimension, then the number to high quality
According to clustering is carried out, the risk factor of most critical is selected in every a kind of cluster.Present invention incorporates clustering methods and feature
Engineering filters out the risk factor for causing gastroesophageal reflux disease, provides scientific basis for medical research in the future and medical diagnosis on disease,
Gastroesophageal reflux disease is instructed, disease incidence is reduced.
Detailed description of the invention
It in order to more clearly explain the embodiment of the invention or the technical proposal in the existing technology, below will be to institute in embodiment
Attached drawing to be used is needed to be briefly described, it should be apparent that, the accompanying drawings in the following description is only some implementations of the invention
Example, for those of ordinary skill in the art, without any creative labor, can also be according to these attached drawings
Obtain other attached drawings.
Fig. 1 determines that the process of method is shown based on the gastroesophageal reflux disease risk factor of machine learning for the embodiment of the present invention
It is intended to;
Fig. 2 determines that the structure of system is shown based on the gastroesophageal reflux disease risk factor of machine learning for the embodiment of the present invention
It is intended to.
Specific embodiment
Following will be combined with the drawings in the embodiments of the present invention, and technical solution in the embodiment of the present invention carries out clear, complete
Site preparation description, it is clear that described embodiments are only a part of the embodiments of the present invention, instead of all the embodiments.It is based on
Embodiment in the present invention, it is obtained by those of ordinary skill in the art without making creative efforts every other
Embodiment shall fall within the protection scope of the present invention.
The object of the present invention is to provide a kind of gastroesophageal reflux disease risk factor based on machine learning determine method and
System, can be efficiently accurate determine gastroesophageal reflux disease risk factor.
In order to make the foregoing objectives, features and advantages of the present invention clearer and more comprehensible, with reference to the accompanying drawing and specific real
Applying mode, the present invention is described in further detail.
Embodiment 1
Fig. 1 determines that the process of method is shown based on the gastroesophageal reflux disease risk factor of machine learning for the embodiment of the present invention
It is intended to, as shown in Figure 1, the gastroesophageal reflux disease risk factor determination side provided in an embodiment of the present invention based on machine learning
Method specifically includes following steps.
Step 101: building user information collection;The user information integrates as the data set of M row N column;The user information collection
In the factors of the i-th row the 1st column be user's questionnaire ID number, and the factor of the 1st column is expressed as different user's questionnaires in not going together
ID number;The problem of factor for the 1st row jth column that the user information is concentrated is questionnaire, and in different lines the 1st row factor
It is expressed as different problems;The factor for the i-th row jth column that the user information is concentrated is i-th user's questionnaire ID number to jth problem
Answer;Wherein, 2≤i≤M, 2≤j≤N, i, j are positive integer.
Step 102: data quantization processing being carried out to the answer that the user information is concentrated, obtains quantized data matrix;Institute
State the matrix that quantized data matrix is M row N column;The element of the i-th row the 1st column in the quantized data matrix is user's questionnaire ID
Number, and the element representation of the 1st column is different user's questionnaire ID number in not going together;The 1st row jth in the quantized data matrix
The problem of element of column is questionnaire, and the element representation of the 1st row is different problems in different lines;The quantized data square
The element of the i-th row jth column in battle array is the data quantization result of i-th user's questionnaire ID number jth problem answers;Wherein, 2≤i≤
M, 2≤j≤N.
Step 103: the quantized data matrix being standardized, standardized data matrix is obtained.
Step 104: using Principal Component Analysis Algorithm to the standardized data matrix carry out dimension-reduction treatment, and to dimensionality reduction after
Data matrix processing is reconstructed, obtain reconstruct data matrix.
Step 105: using hierarchical clustering algorithm, each sample point in the reconstruct data matrix is handled, is obtained
To hierarchical clustering dendrogram;Z-th of sample point represents the z row data in the reconstruct data matrix;Wherein, 2≤z≤
M, z are positive integer.
Step 106: clusters number being determined according to the hierarchical clustering dendrogram, and according to the clusters number, using poly-
Class algorithm clusters the element in the reconstruct data matrix, obtains multiple class clusters.
Step 107: calculating the index of correlation in each class cluster between each element, and by the maximum element of the index of correlation
It is determined as gastroesophageal reflux disease risk factor;The index of correlation is the average of related coefficient square.
Step 101 specifically includes:
The present embodiment by the questionnaire put of human hair in hospital to each consulting gastroesophageal reflux disease, and according to
More parts of questionnaires back are recycled to establish user information collection.The possible illness of the user that the user information is concentrated, it is also possible to strong
Health, this is needed after equal hospital diagnosis to a label, by the label judge the user whether illness.Therefore, which believes
Breath collection is the data set of health, illness mixing, and knows which data is illness data.The dimension of the data set in the present embodiment
The answer of the problem of degree totally 241, the questionnaire ID number of the patient including unique identification and each questionnaire.
It is several comprising general demographic data, life style, eating habit, mental element, sleep factor etc. in questionnaire
The answer of subproblem and investigator.The answer type of questionnaire includes three kinds: single choice, True-False, question-and-answer problem.
Step 102 is that the answer to investigator carries out data bulk processing.
Specially using the questionnaire ID number of patient as unique identifying number.In questionnaire, using severity level as answer in single choice
Topic, if option be often, once in a while, seldom, never, 4,3,2,1 weight can be successively assigned according to its severity level, according to
Specific answer selects corresponding weight;Using whether type option is answer in True-False, "Yes" is assigned a value of 1, "No" is assigned a value of
0, corresponding assignment is selected according to specific answer;Option has no the problem of dividing of severity level, such as occupation, because of such problem pair
As a result useless, the problem can be deleted.For question-and-answer problem, directly using the continuous type numerical value of user's input as data, such as user 45
Year, height 172 are uninfected by HP, often there is sleep disturbance, then the sample data is classified as [age, height, if infection HP, sleep
Obstacle situation], data value is [45,172,0,4].The purpose of this step is that the answer of all investigators is carried out quantification treatment,
Obtain quantized data matrix D.
Step 103 is data prediction, specially data normalization.
It is standardized using Z-Score, it is 0 that data matrix D, which is scaled to average value, the matrix that standard deviation is 1.This is because
The presence of dimension, it is different using different dimensions, the meeting of the calculated result of distance.By standardization, obey each dimension equal
Value for 0, the normal distribution of variance 1, calculate apart from when, each dimension goes dimension, avoids different dimensions
Selection adjust the distance calculate generate influence.
Z-Score standardizes formula are as follows:
Wherein, Feature_value is the former attribute value of a certain feature of data, and μ is the average value of a certain characteristic value of data,
S is the standard deviation of this feature value, and Feature_value ' is the new attribute value of a certain feature of data.
Step 104 mainly reconstructs Data Dimensionality Reduction.
Principal component analytical method is mainly to study correlativity between each column, obtain after sorting according to accounting at
Point, the coefficient according to each attribute in each principal component on each standardized index, determination plays larger factor in principal component.
This step is for tentatively deleting part irrelevant factor.Then original data space is restored by inverse operation, be both able to satisfy in this way
Dimensionality reduction can reach the target of good data interpretation again.
The step of standardized data matrix D ' conduct input, principal component analytical method, is as follows:
The correlation matrix C of normalized data matrix D ';
Calculate the characteristic value and the corresponding feature vector of characteristic value of correlation matrix C;
The corresponding feature vector of the characteristic value of N before ranking is formed into new data set, i.e. data set after dimensionality reduction.Specially
Characteristic value descending is arranged, the corresponding feature vector of the characteristic value of the N using before ranking is as new basis set.According to descending arrayed feature
Value, then the corresponding feature vector of characteristic value is also descending, i.e., the information content that each feature vector saves also from more to less, retains
90%, that is, leave out the feature that not too important namely information content is less below, achievees the purpose that dimensionality reduction.
Wherein, the maximum base vector of information content hold capacity is the largest the feature vector of covariance matrix, and this
The information content that feature vector saves is exactly its corresponding characteristic value.
It is the need to ensure that the main contents of data will not lose about the purpose selected of N before ranking, therefore selection is new here
The ratio for the data population variance that feature can represent is 90%, that is, retains the 90% of raw information, few in characteristic information loss
In the case where achieve the purpose that dimensionality reduction.
Data set after the dimensionality reduction that the purpose being reconstructed is, since principal component analytical method is concentrated from initial data
Extraction feature constitutes new feature, it is corresponding with the factor of raw data set not on, when analyzing risk factor interpretation drop
It is low, it is therefore desirable to reduction is reconstructed to the data set after dimensionality reduction, the principal component after dimensionality reduction is corresponded into raw data set and is wrapped
The feature contained is to get to the feature for screening out useless factor, and then composition reconstructs data matrix D ".
Step 105 is cluster.
Clustering can cluster index, can also cluster to sample, cluster here to index.
The present embodiment further selects the method for Agglomerative Hierarchical Clustering to the quantization using the method for hierarchical clustering
Each sample point in data matrix R carries out clustering processing.The basic principle is that first will respectively be considered as one kind by poly- object, at this moment class
Then similarity degree with class selects immediate two class to merge into one kind by class statistic, gradually merge until all quilts
Until poly- object merging is a kind of.
Wherein, the step of Agglomerative Hierarchical Clustering is as follows:
Each sample point in data matrix D " will be reconstructed as an independent class.
According to range formula, the distance between class two-by-two is calculated, is found apart from the smallest two classes c1, c2.
Class c1 and class c2 are merged into a class;
Above step is repeated, until all sample points gather for one kind, and then obtains a hierarchical clustering dendrogram.
Specifically, the present embodiment selects average distance method to calculate the distance between class two-by-two about distance metric.
Average distance method will own by calculating each data point in two classes at a distance from other all data points
The mean value of distance is as the distance between class two-by-two.Calculation formula is as follows:
Wherein, dist (x, z) is obtained using Euclidean distance.
In step 106, the present embodiment is according to according to hierarchical clustering dendrogram, it can be seen that the cluster knot between each feature
Fruit.Therefore clusters number can be determined by the tree-shaped map analysis of hierarchical clustering, and according to the clusters number, using clustering algorithm
Element in the reconstruct data matrix is clustered, multiple class clusters are obtained.
Step 107 specifically includes:
In order to find the risk factor for determining to form such patient groups, need to calculate the phase in every class cluster between each element
Close index (average of related coefficient square), select the maximum element of the index of correlation be determined as gastroesophageal reflux disease danger because
Element filters out a gastroesophageal reflux disease risk factor in every class cluster.
The quantized data matrix D is grouped to obtain D according to class label1,D2,D3,....Rk, initial danger is set
Dangerous sets of factors is empty set, is analyzed the correlation each element in every class cluster, the index of correlation between calculating elements, every
Select the maximum element of an index of correlation that risk factor set is added in a kind of cluster.
Wherein related coefficient calculates as follows:
Wherein, Var (X) is the variance of X, and Var (Y) is the variance of Y, and Cov (X, Y) is X, and the covariance between Y, X, Y are
Element in every class cluster.
For the sample index of correlation R of a certain feature2Calculation formula it is as follows:
Wherein, X is a certain feature, and i is characterized number, and n is characterized sum.
Embodiment 2
To achieve the above object, the present invention also provides a kind of gastroesophageal refluxs based on machine learning as shown in Figure 2
Disease risk factor determines system.The system includes:
User information collection constructs module 100, for constructing user information collection;The user information integrates as the data of M row N column
Collection;The factor for the i-th row the 1st column that the user information is concentrated is user's questionnaire ID number, and the factor of the 1st column indicates in not going together
For different user's questionnaire ID numbers;The problem of factor for the 1st row jth column that the user information is concentrated is questionnaire, and not
The factor of the 1st row is expressed as different problems in same column;The factor for the i-th row jth column that the user information is concentrated is the i-th user
Answer of the questionnaire ID number to jth problem;Wherein, 2≤i≤M, 2≤j≤N.
Quantification treatment module 200, the answer for concentrating to the user information carry out data quantization processing, are quantified
Data matrix;The quantized data matrix is the matrix of M row N column;The element of the i-th row the 1st column in the quantized data matrix
For user's questionnaire ID number, and the element representation of the 1st column is different user's questionnaire ID number in not going together;The quantized data matrix
In the 1st row jth column element be questionnaire the problem of, and in different lines the 1st row element representation be different problems;Institute
The element for stating the i-th row jth column in quantized data matrix is the data quantization result of i-th user's questionnaire ID number jth problem answers;
Wherein, 2≤i≤M, 2≤j≤N.
Standardization module 300 obtains standardized data for being standardized to the quantized data matrix
Matrix.
Dimensionality reduction reconstructed module 400, for being carried out at dimensionality reduction using Principal Component Analysis Algorithm to the standardized data matrix
Reason, and processing is reconstructed to the data matrix after dimensionality reduction, obtain reconstruct data matrix.
Hierarchical clustering dendrogram obtains module 500, for using hierarchical clustering algorithm, in the reconstruct data matrix
Each sample point is handled, and hierarchical clustering dendrogram is obtained;Z-th of sample point represents in the reconstruct data matrix
Z row data;Wherein, 2≤z≤M.
Class cluster division module 600, for determining clusters number according to the hierarchical clustering dendrogram, and according to the cluster
Number clusters the element in the reconstruct data matrix using clustering algorithm, obtains multiple class clusters.
Gastroesophageal reflux disease risk factor determining module 700, for calculating in each class cluster between each element
The index of correlation, and the maximum element of the index of correlation is determined as gastroesophageal reflux disease risk factor;The index of correlation is phase
The average of relationship number square.
The standardization module 300, specifically includes:
Standardization unit is standardized the quantized data matrix for using Z-Score standardized algorithm
Processing;The data of each dimension obey the normal distribution that mean value is 0, variance is 1 in the standardized data matrix.
The gastroesophageal reflux disease risk factor determining module 700, specifically includes:
Index of correlation computing unit, for calculating the index of correlation in each class cluster between each element;
Gastroesophageal reflux disease risk factor determination unit, for all index of correlation are suitable according to from big to small
Sequence arrangement, selects the corresponding element of the maximum index of correlation to be determined as gastroesophageal reflux disease risk factor
The present invention is combined using Feature Engineering with cluster, is screened out extraneous features, is calculated the data of high quality,
Promote accuracy.Clustering method, which can gather the biggish feature of similitude, to be chosen and most can in similar feature cluster for one kind
Such feature is represented as risk factor, it is representative.
Each embodiment in this specification is described in a progressive manner, the highlights of each of the examples are with other
The difference of embodiment, the same or similar parts in each embodiment may refer to each other.For system disclosed in embodiment
For, since it is corresponded to the methods disclosed in the examples, so being described relatively simple, related place is said referring to method part
It is bright.
Used herein a specific example illustrates the principle and implementation of the invention, and above embodiments are said
It is bright to be merely used to help understand method and its core concept of the invention;At the same time, for those skilled in the art, foundation
Thought of the invention, there will be changes in the specific implementation manner and application range.In conclusion the content of the present specification is not
It is interpreted as limitation of the present invention.
Claims (8)
1. a kind of gastroesophageal reflux disease risk factor based on machine learning determines method, which is characterized in that the method, packet
It includes:
Construct user information collection;The user information integrates as the data set of M row N column;The i-th row the 1st that the user information is concentrated
The factor of column is user's questionnaire ID number, and the factor of the 1st column is expressed as different user's questionnaire ID numbers in not going together;The user
Information concentrate the 1st row jth column factor be questionnaire the problem of, and in different lines the factor of the 1st row be expressed as it is different
Problem;The factor for the i-th row jth column that the user information is concentrated is answer of the i-th user's questionnaire ID number to jth problem;Wherein,
2≤i≤M, 2≤j≤N;
Data quantization processing is carried out to the answer that the user information is concentrated, obtains quantized data matrix;The quantized data square
Battle array is the matrix of M row N column;The element of the i-th row the 1st column in the quantized data matrix is user's questionnaire ID number, and is not gone together
In the 1st column element representation be different user's questionnaire ID numbers;In the quantized data matrix the 1st row jth column element be
The problem of questionnaire, and the element representation of the 1st row is different problems in different lines;I-th in the quantized data matrix
The element of row jth column is the data quantization result of i-th user's questionnaire ID number jth problem answers;Wherein, 2≤i≤M, 2≤j≤N;
The quantized data matrix is standardized, standardized data matrix is obtained;
Using Principal Component Analysis Algorithm to the standardized data matrix carry out dimension-reduction treatment, and to the data matrix after dimensionality reduction into
Row reconstruction processing obtains reconstruct data matrix;
Using hierarchical clustering algorithm, each sample point in the reconstruct data matrix is handled, hierarchical clustering tree is obtained
Shape figure;Z-th of sample point represents the z row data in the reconstruct data matrix;Wherein, 2≤z≤M;
Clusters number is determined according to the hierarchical clustering dendrogram, and according to the clusters number, using clustering algorithm to described
Element in reconstruct data matrix is clustered, and multiple class clusters are obtained;
The index of correlation in each class cluster between each element is calculated, and the maximum element of the index of correlation is determined as stomach oesophagus
Reflux disease risk factor;The index of correlation is the average of related coefficient square.
2. gastroesophageal reflux disease risk factor according to claim 1 determines method, which is characterized in that described to described
Quantized data matrix is standardized, and is obtained standardized data matrix, is specifically included:
Using Z-Score standardized algorithm, the quantized data matrix is standardized;The standardized data matrix
In the data of each dimension obey the normal distribution that mean value is 0, variance is 1.
3. gastroesophageal reflux disease risk factor according to claim 1 determines method, which is characterized in that described using master
Constituent analysis algorithm carries out dimension-reduction treatment to the standardized data matrix, specifically includes:
Calculate the correlation matrix of the standardized data matrix;
According to the correlation matrix, characteristic value and the corresponding feature vector of the characteristic value are calculated;
The characteristic value is arranged according to descending order, after selecting the corresponding feature vector composition dimensionality reduction of characteristic value described in top n
Data set.
4. gastroesophageal reflux disease risk factor according to claim 1 determines method, which is characterized in that described to use layer
Secondary clustering algorithm handles each sample point in the reconstruct data matrix, obtains hierarchical clustering dendrogram, specific to wrap
It includes:
Step 1, using average distance algorithm, the distance between sample point two-by-two is calculated;
Step 2, selection synthesizes a class apart from the smallest two sample points;
Step 3, step 1 and step 2 are repeated, until all sample points gather for one kind, obtains hierarchical clustering dendrogram.
5. gastroesophageal reflux disease risk factor according to claim 1 determines method, which is characterized in that described to calculate often
The index of correlation in a class cluster between each element, and the maximum element of the index of correlation is determined as gastroesophageal reflux disease danger
Dangerous factor, specifically includes:
Calculate the index of correlation in each class cluster between each element;
All index of correlation are arranged according to sequence from big to small, the corresponding element of the maximum index of correlation is selected to determine
For gastroesophageal reflux disease risk factor.
6. a kind of gastroesophageal reflux disease risk factor based on machine learning determines system, which is characterized in that the system, packet
It includes:
User information collection constructs module, for constructing user information collection;The user information integrates as the data set of M row N column;It is described
The factor for the i-th row the 1st column that user information is concentrated is user's questionnaire ID number, and the factor of the 1st column is expressed as difference in not going together
User's questionnaire ID number;The problem of factor for the 1st row jth column that the user information is concentrated is questionnaire, and in different lines
The factor of 1st row is expressed as different problems;The factor for the i-th row jth column that the user information is concentrated is i-th user's questionnaire ID
Answer number to jth problem;Wherein, 2≤i≤M, 2≤j≤N;
Quantification treatment module, the answer for concentrating to the user information carry out data quantization processing, obtain quantized data square
Battle array;The quantized data matrix is the matrix of M row N column;The element of the i-th row the 1st column in the quantized data matrix is user
Questionnaire ID number, and the element representation of the 1st column is different user's questionnaire ID number in not going together;In the quantized data matrix
The problem of element of 1 row jth column is questionnaire, and the element representation of the 1st row is different problems in different lines;The quantization
The element of the i-th row jth column in data matrix is the data quantization result of i-th user's questionnaire ID number jth problem answers;Wherein, 2
≤ i≤M, 2≤j≤N;
Standardization module obtains standardized data matrix for being standardized to the quantized data matrix;
Dimensionality reduction reconstructed module, for carrying out dimension-reduction treatment to the standardized data matrix using Principal Component Analysis Algorithm, and it is right
Processing is reconstructed in data matrix after dimensionality reduction, obtains reconstruct data matrix;
Hierarchical clustering dendrogram obtains module, for using hierarchical clustering algorithm, to each sample in the reconstruct data matrix
This point is handled, and hierarchical clustering dendrogram is obtained;Z-th of sample point represents the z row in the reconstruct data matrix
Data;Wherein, 2≤z≤M;
Class cluster division module for determining clusters number according to the hierarchical clustering dendrogram, and according to the clusters number, is adopted
The element in the reconstruct data matrix is clustered with clustering algorithm, obtains multiple class clusters;
Gastroesophageal reflux disease risk factor determining module refers to for calculating the correlation in each class cluster between each element
Number, and the maximum element of the index of correlation is determined as gastroesophageal reflux disease risk factor;The index of correlation is related coefficient
Square average.
7. gastroesophageal reflux disease risk factor according to claim 6 determines system, which is characterized in that the standardization
Processing module specifically includes:
Standardization unit is standardized place to the quantized data matrix for using Z-Score standardized algorithm
Reason;The data of each dimension obey the normal distribution that mean value is 0, variance is 1 in the standardized data matrix.
8. gastroesophageal reflux disease risk factor according to claim 6 determines system, which is characterized in that the stomach oesophagus
Reflux disease risk factor determining module, specifically includes:
Index of correlation computing unit, for calculating the index of correlation in each class cluster between each element;
Gastroesophageal reflux disease risk factor determination unit, for arranging all index of correlation according to sequence from big to small
Column, select the corresponding element of the maximum index of correlation to be determined as gastroesophageal reflux disease risk factor.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201811589405.8A CN109686442B (en) | 2018-12-25 | 2018-12-25 | Machine learning-based gastroesophageal reflux disease risk factor determination method and system |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201811589405.8A CN109686442B (en) | 2018-12-25 | 2018-12-25 | Machine learning-based gastroesophageal reflux disease risk factor determination method and system |
Publications (2)
Publication Number | Publication Date |
---|---|
CN109686442A true CN109686442A (en) | 2019-04-26 |
CN109686442B CN109686442B (en) | 2020-04-14 |
Family
ID=66189312
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201811589405.8A Expired - Fee Related CN109686442B (en) | 2018-12-25 | 2018-12-25 | Machine learning-based gastroesophageal reflux disease risk factor determination method and system |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN109686442B (en) |
Cited By (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110176308A (en) * | 2019-05-28 | 2019-08-27 | 广东工业大学 | Disease and the relevance of vital sign establish device, method, equipment and medium |
CN110189803A (en) * | 2019-06-05 | 2019-08-30 | 南京理工大学 | The disease risk factor extracting method combined based on cluster with classification |
CN113793667A (en) * | 2021-09-16 | 2021-12-14 | 平安科技(深圳)有限公司 | Disease prediction method and device based on cluster analysis and computer equipment |
CN114550121A (en) * | 2022-02-28 | 2022-05-27 | 重庆长安汽车股份有限公司 | Clustering-based automatic driving lane change scene classification method and recognition method |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102999686A (en) * | 2011-09-19 | 2013-03-27 | 上海煜策信息科技有限公司 | Health management system and implementation method thereof |
CN103198211A (en) * | 2013-03-08 | 2013-07-10 | 北京理工大学 | Quantitative analysis method for influences of attack risk factors of type 2 diabetes on blood sugar |
CN107436933A (en) * | 2017-07-20 | 2017-12-05 | 广州慧扬健康科技有限公司 | The hierarchical clustering system arranged for case history archive |
US20180196873A1 (en) * | 2017-01-11 | 2018-07-12 | Siemens Medical Solutions Usa, Inc. | Visualization framework based on document representation learning |
-
2018
- 2018-12-25 CN CN201811589405.8A patent/CN109686442B/en not_active Expired - Fee Related
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102999686A (en) * | 2011-09-19 | 2013-03-27 | 上海煜策信息科技有限公司 | Health management system and implementation method thereof |
CN103198211A (en) * | 2013-03-08 | 2013-07-10 | 北京理工大学 | Quantitative analysis method for influences of attack risk factors of type 2 diabetes on blood sugar |
US20180196873A1 (en) * | 2017-01-11 | 2018-07-12 | Siemens Medical Solutions Usa, Inc. | Visualization framework based on document representation learning |
CN107436933A (en) * | 2017-07-20 | 2017-12-05 | 广州慧扬健康科技有限公司 | The hierarchical clustering system arranged for case history archive |
Cited By (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110176308A (en) * | 2019-05-28 | 2019-08-27 | 广东工业大学 | Disease and the relevance of vital sign establish device, method, equipment and medium |
CN110189803A (en) * | 2019-06-05 | 2019-08-30 | 南京理工大学 | The disease risk factor extracting method combined based on cluster with classification |
CN113793667A (en) * | 2021-09-16 | 2021-12-14 | 平安科技(深圳)有限公司 | Disease prediction method and device based on cluster analysis and computer equipment |
CN114550121A (en) * | 2022-02-28 | 2022-05-27 | 重庆长安汽车股份有限公司 | Clustering-based automatic driving lane change scene classification method and recognition method |
Also Published As
Publication number | Publication date |
---|---|
CN109686442B (en) | 2020-04-14 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Kumar et al. | Performance analysis of machine learning algorithms on diabetes dataset using big data analytics | |
CN109686442A (en) | Method and system are determined based on the gastroesophageal reflux disease risk factor of machine learning | |
Ahmadlou et al. | Wavelet-synchronization methodology: a new approach for EEG-based diagnosis of ADHD | |
Chae et al. | Data mining approach to policy analysis in a health insurance domain | |
Rasero et al. | Consensus clustering approach to group brain connectivity matrices | |
Songdechakraiwut et al. | Topological learning and its application to multimodal brain network integration | |
Jiang et al. | Sleep stage classification using covariance features of multi-channel physiological signals on Riemannian manifolds | |
Berke Erdaş et al. | CNN-based severity prediction of neurodegenerative diseases using gait data | |
Popkes et al. | Interpretable outcome prediction with sparse Bayesian neural networks in intensive care | |
CN117591953A (en) | Cancer classification method and system based on multiple groups of study data and electronic equipment | |
US11961204B2 (en) | State visualization device, state visualization method, and state visualization program | |
Li et al. | Tensor approximate entropy: An entropy measure for sleep scoring | |
CN107256408B (en) | Method for searching key path of brain function network | |
Smith et al. | An immune network inspired evolutionary algorithm for the diagnosis of Parkinson’s disease | |
Toma et al. | Discovery and integration of univariate patterns from daily individual organ-failure scores for intensive care mortality prediction | |
Everitt et al. | The use of multivariate statistical methods in psychiatry | |
Ono et al. | Introduction to supervised machine learning in clinical epidemiology | |
CN109685139A (en) | Based on the gastroesophageal reflux disease risk factor extracting method precisely clustered and system | |
Andersson et al. | Hierarchical models for epidermal nerve fiber data | |
CN109509513A (en) | Gastroesophageal reflux disease risk factor extracting method and system based on distributional clustering | |
Abenna et al. | Alcohol use disorders automatic detection based BCI systems: a novel EEG classification based on machine learning and optimization algorithms | |
CN113838519A (en) | Gene selection method and system based on adaptive gene interaction regularization elastic network model | |
Chen et al. | Big data approaches to develop a comprehensive and accurate tool aimed at improving autism spectrum disorder diagnosis and subtype stratification | |
Izenman et al. | Recursive partitioning and tree-based methods | |
Matharage et al. | Analysing stillbirth data using dynamic self organizing maps |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
TA01 | Transfer of patent application right | ||
TA01 | Transfer of patent application right |
Effective date of registration: 20190729 Address after: 210000 Xiaolingwei 179, Xuanwu District, Nanjing City, Jiangsu Province Applicant after: NANJING INTEGRATED TRADITIONAL CHINESE AND WESTERN MEDICINE Hospital Address before: 210000 Xiaolingwei 179, Xuanwu District, Nanjing City, Jiangsu Province Applicant before: Liu Wanli |
|
GR01 | Patent grant | ||
GR01 | Patent grant | ||
CF01 | Termination of patent right due to non-payment of annual fee | ||
CF01 | Termination of patent right due to non-payment of annual fee |
Granted publication date: 20200414 |