CN109686442A - Method and system are determined based on the gastroesophageal reflux disease risk factor of machine learning - Google Patents

Method and system are determined based on the gastroesophageal reflux disease risk factor of machine learning Download PDF

Info

Publication number
CN109686442A
CN109686442A CN201811589405.8A CN201811589405A CN109686442A CN 109686442 A CN109686442 A CN 109686442A CN 201811589405 A CN201811589405 A CN 201811589405A CN 109686442 A CN109686442 A CN 109686442A
Authority
CN
China
Prior art keywords
data matrix
row
correlation
reflux disease
column
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201811589405.8A
Other languages
Chinese (zh)
Other versions
CN109686442B (en
Inventor
刘万里
徐雷
黄玉珍
姚澜
李荣臻
夏吉安
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
NANJING INTEGRATED TRADITIONAL CHINESE AND WESTERN MEDICINE HOSPITAL
Original Assignee
Individual
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Individual filed Critical Individual
Priority to CN201811589405.8A priority Critical patent/CN109686442B/en
Publication of CN109686442A publication Critical patent/CN109686442A/en
Application granted granted Critical
Publication of CN109686442B publication Critical patent/CN109686442B/en
Expired - Fee Related legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16HHEALTHCARE INFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR THE HANDLING OR PROCESSING OF MEDICAL OR HEALTHCARE DATA
    • G16H50/00ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics
    • G16H50/20ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics for computer-aided diagnosis, e.g. based on medical expert systems

Landscapes

  • Engineering & Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Medical Informatics (AREA)
  • Biomedical Technology (AREA)
  • Public Health (AREA)
  • Pathology (AREA)
  • Databases & Information Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Epidemiology (AREA)
  • General Health & Medical Sciences (AREA)
  • Primary Health Care (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Measuring And Recording Apparatus For Diagnosis (AREA)
  • Medical Treatment And Welfare Office Work (AREA)

Abstract

The invention discloses a kind of gastroesophageal reflux disease risk factors based on machine learning to determine method and system, solves the problems, such as that accurate rate is low when determining gastroesophageal reflux disease risk factor using statistics in the prior art.User information collection of the building comprising gastroesophageal reflux disease risk factor first, and quantification treatment is carried out to the factor that user information is concentrated, obtain quantized data matrix;Secondly quantized data matrix is standardized, dimension-reduction treatment is carried out to the matrix after standardization using Principal Component Analysis Algorithm;Then hierarchical clustering dendrogram is obtained to the data clusters in treated data set using hierarchical clustering algorithm;Furthermore the clusters number determined according to hierarchical clustering dendrogram, and clustering is carried out to the data in treated data set according to clusters number, obtain multiple class clusters;The index of correlation in each class cluster between each element is finally calculated, and the maximum element of the index of correlation is determined as gastroesophageal reflux disease risk factor.

Description

Method and system are determined based on the gastroesophageal reflux disease risk factor of machine learning
Technical field
The present invention relates to machine learning and medicine technology field, anti-more particularly to a kind of stomach oesophagus based on machine learning Stream disease risk factor determines method and system.
Background technique
Gastroesophageal reflux disease is showed as disease of digestive system generally existing in a kind of world wide, disease incidence The trend risen year by year.Therefore, the treatment of gastroesophageal reflux disease should cause our enough attention.Due to gastroesophageal reflux The generation of disease and life style, emotional change, eating habit etc. are closely related, and the state of an illness easily changes, therefore by adopting Collection mass data simultaneously analyzes data characteristics to the research disease and prevents to play an important role.
It is actually rare using machine learning method extraction risk factor in gastroesophageal reflux disease diagnostic techniques at present, greatly What majority took the extraction of risk factor in medical domain is statistical method, and statistical method is computationally intensive, simultaneously Accurate rate is lower compared with machine learning.
Summary of the invention
The object of the present invention is to provide a kind of gastroesophageal reflux disease risk factor based on machine learning determine method and System, to solve the problems, such as that accurate rate is low when determining gastroesophageal reflux disease risk factor using statistics in the prior art.
To achieve the above object, the present invention provides following schemes:
A kind of gastroesophageal reflux disease risk factor based on machine learning determines method, comprising:
Construct user information collection;The user information integrates as the data set of M row N column;The i-th row that the user information is concentrated The factor of 1st column is user's questionnaire ID number, and the factor of the 1st column is expressed as different user's questionnaire ID numbers in not going together;It is described The problem of factor for the 1st row jth column that user information is concentrated is questionnaire, and the factor of the 1st row is expressed as not in different lines Same problem;The factor for the i-th row jth column that the user information is concentrated is answer of the i-th user's questionnaire ID number to jth problem; Wherein, 2≤i≤M, 2≤j≤N;
Data quantization processing is carried out to the answer that the user information is concentrated, obtains quantized data matrix;The quantization number It is the matrix of M row N column according to matrix;The element of the i-th row the 1st column in the quantized data matrix is user's questionnaire ID number, and not The element representation of the 1st column is different user's questionnaire ID number in colleague;The member of the 1st row jth column in the quantized data matrix The problem of element is questionnaire, and the element representation of the 1st row is different problems in different lines;In the quantized data matrix The element of i-th row jth column is the data quantization result of i-th user's questionnaire ID number jth problem answers;Wherein, 2≤i≤M, 2≤j ≤N;
The quantized data matrix is standardized, standardized data matrix is obtained;
Dimension-reduction treatment is carried out to the standardized data matrix using Principal Component Analysis Algorithm, and to the data square after dimensionality reduction Processing is reconstructed in battle array, obtains reconstruct data matrix;
Using hierarchical clustering algorithm, each sample point in the reconstruct data matrix is handled, it is poly- to obtain level Class dendrogram;Z-th of sample point represents the z row data in the reconstruct data matrix;Wherein, 2≤z≤M;
Clusters number is determined according to the hierarchical clustering dendrogram, and according to the clusters number, using clustering algorithm pair Element in the reconstruct data matrix is clustered, and multiple class clusters are obtained;
The index of correlation in each class cluster between each element is calculated, and the maximum element of the index of correlation is determined as stomach Esophageal reflux disease risk factor;The index of correlation is the average of related coefficient square.
Optionally, described that the quantized data matrix is standardized, standardized data matrix is obtained, it is specific to wrap It includes:
Using Z-Score standardized algorithm, the quantized data matrix is standardized;The standardized data The data of each dimension obey the normal distribution that mean value is 0, variance is 1 in matrix.
Optionally, described that dimension-reduction treatment is carried out to the standardized data matrix using Principal Component Analysis Algorithm, it is specific to wrap It includes:
Calculate the correlation matrix of the standardized data matrix;
According to the correlation matrix, characteristic value and the corresponding feature vector of the characteristic value are calculated;
The characteristic value is arranged according to descending order, selects the corresponding feature vector composition drop of characteristic value described in top n Data set after dimension.
Optionally, described to use hierarchical clustering algorithm, each sample point in the reconstruct data matrix is handled, Hierarchical clustering dendrogram is obtained, is specifically included:
Step 1, using average distance algorithm, the distance between sample point two-by-two is calculated;
Step 2, selection synthesizes a class apart from the smallest two sample points;
Step 3, step 1 and step 2 are repeated, until all sample points gather for one kind, obtains hierarchical clustering dendrogram.
Optionally, the index of correlation calculated in each class cluster between each element, and the index of correlation is maximum Element is determined as gastroesophageal reflux disease risk factor, specifically includes:
Calculate the index of correlation in each class cluster between each element;
All index of correlation are arranged according to sequence from big to small, select the corresponding element of the maximum index of correlation It is determined as gastroesophageal reflux disease risk factor.
A kind of gastroesophageal reflux disease risk factor based on machine learning determines system, comprising:
User information collection constructs module, for constructing user information collection;The user information integrates as the data set of M row N column; The factor for the i-th row the 1st column that the user information is concentrated is user's questionnaire ID number, and the factor of the 1st column is expressed as in not going together Different user's questionnaire ID numbers;The problem of factor for the 1st row jth column that the user information is concentrated is questionnaire, and it is different The factor of the 1st row is expressed as different problems in column;The factor for the i-th row jth column that the user information is concentrated is that the i-th user asks Answer of the volume ID number to jth problem;Wherein, 2≤i≤M, 2≤j≤N;
Quantification treatment module, the answer for concentrating to the user information carry out data quantization processing, obtain quantization number According to matrix;The quantized data matrix is the matrix of M row N column;In the quantized data matrix the i-th row the 1st column element be User's questionnaire ID number, and the element representation of the 1st column is different user's questionnaire ID number in not going together;In the quantized data matrix The 1st row jth column element be questionnaire the problem of, and in different lines the 1st row element representation be different problems;It is described The element of the i-th row jth column in quantized data matrix is the data quantization result of i-th user's questionnaire ID number jth problem answers;Its In, 2≤i≤M, 2≤j≤N;
Standardization module obtains standardized data square for being standardized to the quantized data matrix Battle array;
Dimensionality reduction reconstructed module, for carrying out dimension-reduction treatment to the standardized data matrix using Principal Component Analysis Algorithm, And processing is reconstructed to the data matrix after dimensionality reduction, obtain reconstruct data matrix;
Hierarchical clustering dendrogram obtains module, for using hierarchical clustering algorithm, to every in the reconstruct data matrix A sample point is handled, and hierarchical clustering dendrogram is obtained;Z-th of sample point represents the in the reconstruct data matrix Z row data;Wherein, 2≤z≤M;
Class cluster division module, for determining clusters number according to the hierarchical clustering dendrogram, and according to the cluster numbers Mesh clusters the element in the reconstruct data matrix using clustering algorithm, obtains multiple class clusters;
Gastroesophageal reflux disease risk factor determining module, for calculating the correlation in each class cluster between each element Index, and the maximum element of the index of correlation is determined as gastroesophageal reflux disease risk factor;The index of correlation is phase relation Several squares of average.
Optionally, the standardization module, specifically includes:
Standardization unit is standardized the quantized data matrix for using Z-Score standardized algorithm Processing;The data of each dimension obey the normal distribution that mean value is 0, variance is 1 in the standardized data matrix.
Optionally, the gastroesophageal reflux disease risk factor determining module, specifically includes:
Index of correlation computing unit, for calculating the index of correlation in each class cluster between each element;
Gastroesophageal reflux disease risk factor determination unit, for all index of correlation are suitable according to from big to small Sequence arrangement, selects the corresponding element of the maximum index of correlation to be determined as gastroesophageal reflux disease risk factor.
The specific embodiment provided according to the present invention, the invention discloses following technical effects:
A kind of gastroesophageal reflux disease risk factor that the present invention is mainly based upon machine learning proposition determines method and is System.The present invention first carries out feature extraction with the Principal Component Analysis in Feature Engineering, reduces data dimension, then the number to high quality According to clustering is carried out, the risk factor of most critical is selected in every a kind of cluster.Present invention incorporates clustering methods and feature Engineering filters out the risk factor for causing gastroesophageal reflux disease, provides scientific basis for medical research in the future and medical diagnosis on disease, Gastroesophageal reflux disease is instructed, disease incidence is reduced.
Detailed description of the invention
It in order to more clearly explain the embodiment of the invention or the technical proposal in the existing technology, below will be to institute in embodiment Attached drawing to be used is needed to be briefly described, it should be apparent that, the accompanying drawings in the following description is only some implementations of the invention Example, for those of ordinary skill in the art, without any creative labor, can also be according to these attached drawings Obtain other attached drawings.
Fig. 1 determines that the process of method is shown based on the gastroesophageal reflux disease risk factor of machine learning for the embodiment of the present invention It is intended to;
Fig. 2 determines that the structure of system is shown based on the gastroesophageal reflux disease risk factor of machine learning for the embodiment of the present invention It is intended to.
Specific embodiment
Following will be combined with the drawings in the embodiments of the present invention, and technical solution in the embodiment of the present invention carries out clear, complete Site preparation description, it is clear that described embodiments are only a part of the embodiments of the present invention, instead of all the embodiments.It is based on Embodiment in the present invention, it is obtained by those of ordinary skill in the art without making creative efforts every other Embodiment shall fall within the protection scope of the present invention.
The object of the present invention is to provide a kind of gastroesophageal reflux disease risk factor based on machine learning determine method and System, can be efficiently accurate determine gastroesophageal reflux disease risk factor.
In order to make the foregoing objectives, features and advantages of the present invention clearer and more comprehensible, with reference to the accompanying drawing and specific real Applying mode, the present invention is described in further detail.
Embodiment 1
Fig. 1 determines that the process of method is shown based on the gastroesophageal reflux disease risk factor of machine learning for the embodiment of the present invention It is intended to, as shown in Figure 1, the gastroesophageal reflux disease risk factor determination side provided in an embodiment of the present invention based on machine learning Method specifically includes following steps.
Step 101: building user information collection;The user information integrates as the data set of M row N column;The user information collection In the factors of the i-th row the 1st column be user's questionnaire ID number, and the factor of the 1st column is expressed as different user's questionnaires in not going together ID number;The problem of factor for the 1st row jth column that the user information is concentrated is questionnaire, and in different lines the 1st row factor It is expressed as different problems;The factor for the i-th row jth column that the user information is concentrated is i-th user's questionnaire ID number to jth problem Answer;Wherein, 2≤i≤M, 2≤j≤N, i, j are positive integer.
Step 102: data quantization processing being carried out to the answer that the user information is concentrated, obtains quantized data matrix;Institute State the matrix that quantized data matrix is M row N column;The element of the i-th row the 1st column in the quantized data matrix is user's questionnaire ID Number, and the element representation of the 1st column is different user's questionnaire ID number in not going together;The 1st row jth in the quantized data matrix The problem of element of column is questionnaire, and the element representation of the 1st row is different problems in different lines;The quantized data square The element of the i-th row jth column in battle array is the data quantization result of i-th user's questionnaire ID number jth problem answers;Wherein, 2≤i≤ M, 2≤j≤N.
Step 103: the quantized data matrix being standardized, standardized data matrix is obtained.
Step 104: using Principal Component Analysis Algorithm to the standardized data matrix carry out dimension-reduction treatment, and to dimensionality reduction after Data matrix processing is reconstructed, obtain reconstruct data matrix.
Step 105: using hierarchical clustering algorithm, each sample point in the reconstruct data matrix is handled, is obtained To hierarchical clustering dendrogram;Z-th of sample point represents the z row data in the reconstruct data matrix;Wherein, 2≤z≤ M, z are positive integer.
Step 106: clusters number being determined according to the hierarchical clustering dendrogram, and according to the clusters number, using poly- Class algorithm clusters the element in the reconstruct data matrix, obtains multiple class clusters.
Step 107: calculating the index of correlation in each class cluster between each element, and by the maximum element of the index of correlation It is determined as gastroesophageal reflux disease risk factor;The index of correlation is the average of related coefficient square.
Step 101 specifically includes:
The present embodiment by the questionnaire put of human hair in hospital to each consulting gastroesophageal reflux disease, and according to More parts of questionnaires back are recycled to establish user information collection.The possible illness of the user that the user information is concentrated, it is also possible to strong Health, this is needed after equal hospital diagnosis to a label, by the label judge the user whether illness.Therefore, which believes Breath collection is the data set of health, illness mixing, and knows which data is illness data.The dimension of the data set in the present embodiment The answer of the problem of degree totally 241, the questionnaire ID number of the patient including unique identification and each questionnaire.
It is several comprising general demographic data, life style, eating habit, mental element, sleep factor etc. in questionnaire The answer of subproblem and investigator.The answer type of questionnaire includes three kinds: single choice, True-False, question-and-answer problem.
Step 102 is that the answer to investigator carries out data bulk processing.
Specially using the questionnaire ID number of patient as unique identifying number.In questionnaire, using severity level as answer in single choice Topic, if option be often, once in a while, seldom, never, 4,3,2,1 weight can be successively assigned according to its severity level, according to Specific answer selects corresponding weight;Using whether type option is answer in True-False, "Yes" is assigned a value of 1, "No" is assigned a value of 0, corresponding assignment is selected according to specific answer;Option has no the problem of dividing of severity level, such as occupation, because of such problem pair As a result useless, the problem can be deleted.For question-and-answer problem, directly using the continuous type numerical value of user's input as data, such as user 45 Year, height 172 are uninfected by HP, often there is sleep disturbance, then the sample data is classified as [age, height, if infection HP, sleep Obstacle situation], data value is [45,172,0,4].The purpose of this step is that the answer of all investigators is carried out quantification treatment, Obtain quantized data matrix D.
Step 103 is data prediction, specially data normalization.
It is standardized using Z-Score, it is 0 that data matrix D, which is scaled to average value, the matrix that standard deviation is 1.This is because The presence of dimension, it is different using different dimensions, the meeting of the calculated result of distance.By standardization, obey each dimension equal Value for 0, the normal distribution of variance 1, calculate apart from when, each dimension goes dimension, avoids different dimensions Selection adjust the distance calculate generate influence.
Z-Score standardizes formula are as follows:
Wherein, Feature_value is the former attribute value of a certain feature of data, and μ is the average value of a certain characteristic value of data, S is the standard deviation of this feature value, and Feature_value ' is the new attribute value of a certain feature of data.
Step 104 mainly reconstructs Data Dimensionality Reduction.
Principal component analytical method is mainly to study correlativity between each column, obtain after sorting according to accounting at Point, the coefficient according to each attribute in each principal component on each standardized index, determination plays larger factor in principal component. This step is for tentatively deleting part irrelevant factor.Then original data space is restored by inverse operation, be both able to satisfy in this way Dimensionality reduction can reach the target of good data interpretation again.
The step of standardized data matrix D ' conduct input, principal component analytical method, is as follows:
The correlation matrix C of normalized data matrix D ';
Calculate the characteristic value and the corresponding feature vector of characteristic value of correlation matrix C;
The corresponding feature vector of the characteristic value of N before ranking is formed into new data set, i.e. data set after dimensionality reduction.Specially Characteristic value descending is arranged, the corresponding feature vector of the characteristic value of the N using before ranking is as new basis set.According to descending arrayed feature Value, then the corresponding feature vector of characteristic value is also descending, i.e., the information content that each feature vector saves also from more to less, retains 90%, that is, leave out the feature that not too important namely information content is less below, achievees the purpose that dimensionality reduction.
Wherein, the maximum base vector of information content hold capacity is the largest the feature vector of covariance matrix, and this The information content that feature vector saves is exactly its corresponding characteristic value.
It is the need to ensure that the main contents of data will not lose about the purpose selected of N before ranking, therefore selection is new here The ratio for the data population variance that feature can represent is 90%, that is, retains the 90% of raw information, few in characteristic information loss In the case where achieve the purpose that dimensionality reduction.
Data set after the dimensionality reduction that the purpose being reconstructed is, since principal component analytical method is concentrated from initial data Extraction feature constitutes new feature, it is corresponding with the factor of raw data set not on, when analyzing risk factor interpretation drop It is low, it is therefore desirable to reduction is reconstructed to the data set after dimensionality reduction, the principal component after dimensionality reduction is corresponded into raw data set and is wrapped The feature contained is to get to the feature for screening out useless factor, and then composition reconstructs data matrix D ".
Step 105 is cluster.
Clustering can cluster index, can also cluster to sample, cluster here to index.
The present embodiment further selects the method for Agglomerative Hierarchical Clustering to the quantization using the method for hierarchical clustering Each sample point in data matrix R carries out clustering processing.The basic principle is that first will respectively be considered as one kind by poly- object, at this moment class Then similarity degree with class selects immediate two class to merge into one kind by class statistic, gradually merge until all quilts Until poly- object merging is a kind of.
Wherein, the step of Agglomerative Hierarchical Clustering is as follows:
Each sample point in data matrix D " will be reconstructed as an independent class.
According to range formula, the distance between class two-by-two is calculated, is found apart from the smallest two classes c1, c2.
Class c1 and class c2 are merged into a class;
Above step is repeated, until all sample points gather for one kind, and then obtains a hierarchical clustering dendrogram.
Specifically, the present embodiment selects average distance method to calculate the distance between class two-by-two about distance metric.
Average distance method will own by calculating each data point in two classes at a distance from other all data points The mean value of distance is as the distance between class two-by-two.Calculation formula is as follows:
Wherein, dist (x, z) is obtained using Euclidean distance.
In step 106, the present embodiment is according to according to hierarchical clustering dendrogram, it can be seen that the cluster knot between each feature Fruit.Therefore clusters number can be determined by the tree-shaped map analysis of hierarchical clustering, and according to the clusters number, using clustering algorithm Element in the reconstruct data matrix is clustered, multiple class clusters are obtained.
Step 107 specifically includes:
In order to find the risk factor for determining to form such patient groups, need to calculate the phase in every class cluster between each element Close index (average of related coefficient square), select the maximum element of the index of correlation be determined as gastroesophageal reflux disease danger because Element filters out a gastroesophageal reflux disease risk factor in every class cluster.
The quantized data matrix D is grouped to obtain D according to class label1,D2,D3,....Rk, initial danger is set Dangerous sets of factors is empty set, is analyzed the correlation each element in every class cluster, the index of correlation between calculating elements, every Select the maximum element of an index of correlation that risk factor set is added in a kind of cluster.
Wherein related coefficient calculates as follows:
Wherein, Var (X) is the variance of X, and Var (Y) is the variance of Y, and Cov (X, Y) is X, and the covariance between Y, X, Y are Element in every class cluster.
For the sample index of correlation R of a certain feature2Calculation formula it is as follows:
Wherein, X is a certain feature, and i is characterized number, and n is characterized sum.
Embodiment 2
To achieve the above object, the present invention also provides a kind of gastroesophageal refluxs based on machine learning as shown in Figure 2 Disease risk factor determines system.The system includes:
User information collection constructs module 100, for constructing user information collection;The user information integrates as the data of M row N column Collection;The factor for the i-th row the 1st column that the user information is concentrated is user's questionnaire ID number, and the factor of the 1st column indicates in not going together For different user's questionnaire ID numbers;The problem of factor for the 1st row jth column that the user information is concentrated is questionnaire, and not The factor of the 1st row is expressed as different problems in same column;The factor for the i-th row jth column that the user information is concentrated is the i-th user Answer of the questionnaire ID number to jth problem;Wherein, 2≤i≤M, 2≤j≤N.
Quantification treatment module 200, the answer for concentrating to the user information carry out data quantization processing, are quantified Data matrix;The quantized data matrix is the matrix of M row N column;The element of the i-th row the 1st column in the quantized data matrix For user's questionnaire ID number, and the element representation of the 1st column is different user's questionnaire ID number in not going together;The quantized data matrix In the 1st row jth column element be questionnaire the problem of, and in different lines the 1st row element representation be different problems;Institute The element for stating the i-th row jth column in quantized data matrix is the data quantization result of i-th user's questionnaire ID number jth problem answers; Wherein, 2≤i≤M, 2≤j≤N.
Standardization module 300 obtains standardized data for being standardized to the quantized data matrix Matrix.
Dimensionality reduction reconstructed module 400, for being carried out at dimensionality reduction using Principal Component Analysis Algorithm to the standardized data matrix Reason, and processing is reconstructed to the data matrix after dimensionality reduction, obtain reconstruct data matrix.
Hierarchical clustering dendrogram obtains module 500, for using hierarchical clustering algorithm, in the reconstruct data matrix Each sample point is handled, and hierarchical clustering dendrogram is obtained;Z-th of sample point represents in the reconstruct data matrix Z row data;Wherein, 2≤z≤M.
Class cluster division module 600, for determining clusters number according to the hierarchical clustering dendrogram, and according to the cluster Number clusters the element in the reconstruct data matrix using clustering algorithm, obtains multiple class clusters.
Gastroesophageal reflux disease risk factor determining module 700, for calculating in each class cluster between each element The index of correlation, and the maximum element of the index of correlation is determined as gastroesophageal reflux disease risk factor;The index of correlation is phase The average of relationship number square.
The standardization module 300, specifically includes:
Standardization unit is standardized the quantized data matrix for using Z-Score standardized algorithm Processing;The data of each dimension obey the normal distribution that mean value is 0, variance is 1 in the standardized data matrix.
The gastroesophageal reflux disease risk factor determining module 700, specifically includes:
Index of correlation computing unit, for calculating the index of correlation in each class cluster between each element;
Gastroesophageal reflux disease risk factor determination unit, for all index of correlation are suitable according to from big to small Sequence arrangement, selects the corresponding element of the maximum index of correlation to be determined as gastroesophageal reflux disease risk factor
The present invention is combined using Feature Engineering with cluster, is screened out extraneous features, is calculated the data of high quality, Promote accuracy.Clustering method, which can gather the biggish feature of similitude, to be chosen and most can in similar feature cluster for one kind Such feature is represented as risk factor, it is representative.
Each embodiment in this specification is described in a progressive manner, the highlights of each of the examples are with other The difference of embodiment, the same or similar parts in each embodiment may refer to each other.For system disclosed in embodiment For, since it is corresponded to the methods disclosed in the examples, so being described relatively simple, related place is said referring to method part It is bright.
Used herein a specific example illustrates the principle and implementation of the invention, and above embodiments are said It is bright to be merely used to help understand method and its core concept of the invention;At the same time, for those skilled in the art, foundation Thought of the invention, there will be changes in the specific implementation manner and application range.In conclusion the content of the present specification is not It is interpreted as limitation of the present invention.

Claims (8)

1. a kind of gastroesophageal reflux disease risk factor based on machine learning determines method, which is characterized in that the method, packet It includes:
Construct user information collection;The user information integrates as the data set of M row N column;The i-th row the 1st that the user information is concentrated The factor of column is user's questionnaire ID number, and the factor of the 1st column is expressed as different user's questionnaire ID numbers in not going together;The user Information concentrate the 1st row jth column factor be questionnaire the problem of, and in different lines the factor of the 1st row be expressed as it is different Problem;The factor for the i-th row jth column that the user information is concentrated is answer of the i-th user's questionnaire ID number to jth problem;Wherein, 2≤i≤M, 2≤j≤N;
Data quantization processing is carried out to the answer that the user information is concentrated, obtains quantized data matrix;The quantized data square Battle array is the matrix of M row N column;The element of the i-th row the 1st column in the quantized data matrix is user's questionnaire ID number, and is not gone together In the 1st column element representation be different user's questionnaire ID numbers;In the quantized data matrix the 1st row jth column element be The problem of questionnaire, and the element representation of the 1st row is different problems in different lines;I-th in the quantized data matrix The element of row jth column is the data quantization result of i-th user's questionnaire ID number jth problem answers;Wherein, 2≤i≤M, 2≤j≤N;
The quantized data matrix is standardized, standardized data matrix is obtained;
Using Principal Component Analysis Algorithm to the standardized data matrix carry out dimension-reduction treatment, and to the data matrix after dimensionality reduction into Row reconstruction processing obtains reconstruct data matrix;
Using hierarchical clustering algorithm, each sample point in the reconstruct data matrix is handled, hierarchical clustering tree is obtained Shape figure;Z-th of sample point represents the z row data in the reconstruct data matrix;Wherein, 2≤z≤M;
Clusters number is determined according to the hierarchical clustering dendrogram, and according to the clusters number, using clustering algorithm to described Element in reconstruct data matrix is clustered, and multiple class clusters are obtained;
The index of correlation in each class cluster between each element is calculated, and the maximum element of the index of correlation is determined as stomach oesophagus Reflux disease risk factor;The index of correlation is the average of related coefficient square.
2. gastroesophageal reflux disease risk factor according to claim 1 determines method, which is characterized in that described to described Quantized data matrix is standardized, and is obtained standardized data matrix, is specifically included:
Using Z-Score standardized algorithm, the quantized data matrix is standardized;The standardized data matrix In the data of each dimension obey the normal distribution that mean value is 0, variance is 1.
3. gastroesophageal reflux disease risk factor according to claim 1 determines method, which is characterized in that described using master Constituent analysis algorithm carries out dimension-reduction treatment to the standardized data matrix, specifically includes:
Calculate the correlation matrix of the standardized data matrix;
According to the correlation matrix, characteristic value and the corresponding feature vector of the characteristic value are calculated;
The characteristic value is arranged according to descending order, after selecting the corresponding feature vector composition dimensionality reduction of characteristic value described in top n Data set.
4. gastroesophageal reflux disease risk factor according to claim 1 determines method, which is characterized in that described to use layer Secondary clustering algorithm handles each sample point in the reconstruct data matrix, obtains hierarchical clustering dendrogram, specific to wrap It includes:
Step 1, using average distance algorithm, the distance between sample point two-by-two is calculated;
Step 2, selection synthesizes a class apart from the smallest two sample points;
Step 3, step 1 and step 2 are repeated, until all sample points gather for one kind, obtains hierarchical clustering dendrogram.
5. gastroesophageal reflux disease risk factor according to claim 1 determines method, which is characterized in that described to calculate often The index of correlation in a class cluster between each element, and the maximum element of the index of correlation is determined as gastroesophageal reflux disease danger Dangerous factor, specifically includes:
Calculate the index of correlation in each class cluster between each element;
All index of correlation are arranged according to sequence from big to small, the corresponding element of the maximum index of correlation is selected to determine For gastroesophageal reflux disease risk factor.
6. a kind of gastroesophageal reflux disease risk factor based on machine learning determines system, which is characterized in that the system, packet It includes:
User information collection constructs module, for constructing user information collection;The user information integrates as the data set of M row N column;It is described The factor for the i-th row the 1st column that user information is concentrated is user's questionnaire ID number, and the factor of the 1st column is expressed as difference in not going together User's questionnaire ID number;The problem of factor for the 1st row jth column that the user information is concentrated is questionnaire, and in different lines The factor of 1st row is expressed as different problems;The factor for the i-th row jth column that the user information is concentrated is i-th user's questionnaire ID Answer number to jth problem;Wherein, 2≤i≤M, 2≤j≤N;
Quantification treatment module, the answer for concentrating to the user information carry out data quantization processing, obtain quantized data square Battle array;The quantized data matrix is the matrix of M row N column;The element of the i-th row the 1st column in the quantized data matrix is user Questionnaire ID number, and the element representation of the 1st column is different user's questionnaire ID number in not going together;In the quantized data matrix The problem of element of 1 row jth column is questionnaire, and the element representation of the 1st row is different problems in different lines;The quantization The element of the i-th row jth column in data matrix is the data quantization result of i-th user's questionnaire ID number jth problem answers;Wherein, 2 ≤ i≤M, 2≤j≤N;
Standardization module obtains standardized data matrix for being standardized to the quantized data matrix;
Dimensionality reduction reconstructed module, for carrying out dimension-reduction treatment to the standardized data matrix using Principal Component Analysis Algorithm, and it is right Processing is reconstructed in data matrix after dimensionality reduction, obtains reconstruct data matrix;
Hierarchical clustering dendrogram obtains module, for using hierarchical clustering algorithm, to each sample in the reconstruct data matrix This point is handled, and hierarchical clustering dendrogram is obtained;Z-th of sample point represents the z row in the reconstruct data matrix Data;Wherein, 2≤z≤M;
Class cluster division module for determining clusters number according to the hierarchical clustering dendrogram, and according to the clusters number, is adopted The element in the reconstruct data matrix is clustered with clustering algorithm, obtains multiple class clusters;
Gastroesophageal reflux disease risk factor determining module refers to for calculating the correlation in each class cluster between each element Number, and the maximum element of the index of correlation is determined as gastroesophageal reflux disease risk factor;The index of correlation is related coefficient Square average.
7. gastroesophageal reflux disease risk factor according to claim 6 determines system, which is characterized in that the standardization Processing module specifically includes:
Standardization unit is standardized place to the quantized data matrix for using Z-Score standardized algorithm Reason;The data of each dimension obey the normal distribution that mean value is 0, variance is 1 in the standardized data matrix.
8. gastroesophageal reflux disease risk factor according to claim 6 determines system, which is characterized in that the stomach oesophagus Reflux disease risk factor determining module, specifically includes:
Index of correlation computing unit, for calculating the index of correlation in each class cluster between each element;
Gastroesophageal reflux disease risk factor determination unit, for arranging all index of correlation according to sequence from big to small Column, select the corresponding element of the maximum index of correlation to be determined as gastroesophageal reflux disease risk factor.
CN201811589405.8A 2018-12-25 2018-12-25 Machine learning-based gastroesophageal reflux disease risk factor determination method and system Expired - Fee Related CN109686442B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201811589405.8A CN109686442B (en) 2018-12-25 2018-12-25 Machine learning-based gastroesophageal reflux disease risk factor determination method and system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201811589405.8A CN109686442B (en) 2018-12-25 2018-12-25 Machine learning-based gastroesophageal reflux disease risk factor determination method and system

Publications (2)

Publication Number Publication Date
CN109686442A true CN109686442A (en) 2019-04-26
CN109686442B CN109686442B (en) 2020-04-14

Family

ID=66189312

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201811589405.8A Expired - Fee Related CN109686442B (en) 2018-12-25 2018-12-25 Machine learning-based gastroesophageal reflux disease risk factor determination method and system

Country Status (1)

Country Link
CN (1) CN109686442B (en)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110176308A (en) * 2019-05-28 2019-08-27 广东工业大学 Disease and the relevance of vital sign establish device, method, equipment and medium
CN110189803A (en) * 2019-06-05 2019-08-30 南京理工大学 The disease risk factor extracting method combined based on cluster with classification
CN113793667A (en) * 2021-09-16 2021-12-14 平安科技(深圳)有限公司 Disease prediction method and device based on cluster analysis and computer equipment
CN114550121A (en) * 2022-02-28 2022-05-27 重庆长安汽车股份有限公司 Clustering-based automatic driving lane change scene classification method and recognition method

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102999686A (en) * 2011-09-19 2013-03-27 上海煜策信息科技有限公司 Health management system and implementation method thereof
CN103198211A (en) * 2013-03-08 2013-07-10 北京理工大学 Quantitative analysis method for influences of attack risk factors of type 2 diabetes on blood sugar
CN107436933A (en) * 2017-07-20 2017-12-05 广州慧扬健康科技有限公司 The hierarchical clustering system arranged for case history archive
US20180196873A1 (en) * 2017-01-11 2018-07-12 Siemens Medical Solutions Usa, Inc. Visualization framework based on document representation learning

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102999686A (en) * 2011-09-19 2013-03-27 上海煜策信息科技有限公司 Health management system and implementation method thereof
CN103198211A (en) * 2013-03-08 2013-07-10 北京理工大学 Quantitative analysis method for influences of attack risk factors of type 2 diabetes on blood sugar
US20180196873A1 (en) * 2017-01-11 2018-07-12 Siemens Medical Solutions Usa, Inc. Visualization framework based on document representation learning
CN107436933A (en) * 2017-07-20 2017-12-05 广州慧扬健康科技有限公司 The hierarchical clustering system arranged for case history archive

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110176308A (en) * 2019-05-28 2019-08-27 广东工业大学 Disease and the relevance of vital sign establish device, method, equipment and medium
CN110189803A (en) * 2019-06-05 2019-08-30 南京理工大学 The disease risk factor extracting method combined based on cluster with classification
CN113793667A (en) * 2021-09-16 2021-12-14 平安科技(深圳)有限公司 Disease prediction method and device based on cluster analysis and computer equipment
CN114550121A (en) * 2022-02-28 2022-05-27 重庆长安汽车股份有限公司 Clustering-based automatic driving lane change scene classification method and recognition method

Also Published As

Publication number Publication date
CN109686442B (en) 2020-04-14

Similar Documents

Publication Publication Date Title
Kumar et al. Performance analysis of machine learning algorithms on diabetes dataset using big data analytics
CN109686442A (en) Method and system are determined based on the gastroesophageal reflux disease risk factor of machine learning
Ahmadlou et al. Wavelet-synchronization methodology: a new approach for EEG-based diagnosis of ADHD
Chae et al. Data mining approach to policy analysis in a health insurance domain
Rasero et al. Consensus clustering approach to group brain connectivity matrices
Songdechakraiwut et al. Topological learning and its application to multimodal brain network integration
Jiang et al. Sleep stage classification using covariance features of multi-channel physiological signals on Riemannian manifolds
Berke Erdaş et al. CNN-based severity prediction of neurodegenerative diseases using gait data
Popkes et al. Interpretable outcome prediction with sparse Bayesian neural networks in intensive care
CN117591953A (en) Cancer classification method and system based on multiple groups of study data and electronic equipment
US11961204B2 (en) State visualization device, state visualization method, and state visualization program
Li et al. Tensor approximate entropy: An entropy measure for sleep scoring
CN107256408B (en) Method for searching key path of brain function network
Smith et al. An immune network inspired evolutionary algorithm for the diagnosis of Parkinson’s disease
Toma et al. Discovery and integration of univariate patterns from daily individual organ-failure scores for intensive care mortality prediction
Everitt et al. The use of multivariate statistical methods in psychiatry
Ono et al. Introduction to supervised machine learning in clinical epidemiology
CN109685139A (en) Based on the gastroesophageal reflux disease risk factor extracting method precisely clustered and system
Andersson et al. Hierarchical models for epidermal nerve fiber data
CN109509513A (en) Gastroesophageal reflux disease risk factor extracting method and system based on distributional clustering
Abenna et al. Alcohol use disorders automatic detection based BCI systems: a novel EEG classification based on machine learning and optimization algorithms
CN113838519A (en) Gene selection method and system based on adaptive gene interaction regularization elastic network model
Chen et al. Big data approaches to develop a comprehensive and accurate tool aimed at improving autism spectrum disorder diagnosis and subtype stratification
Izenman et al. Recursive partitioning and tree-based methods
Matharage et al. Analysing stillbirth data using dynamic self organizing maps

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
TA01 Transfer of patent application right
TA01 Transfer of patent application right

Effective date of registration: 20190729

Address after: 210000 Xiaolingwei 179, Xuanwu District, Nanjing City, Jiangsu Province

Applicant after: NANJING INTEGRATED TRADITIONAL CHINESE AND WESTERN MEDICINE Hospital

Address before: 210000 Xiaolingwei 179, Xuanwu District, Nanjing City, Jiangsu Province

Applicant before: Liu Wanli

GR01 Patent grant
GR01 Patent grant
CF01 Termination of patent right due to non-payment of annual fee
CF01 Termination of patent right due to non-payment of annual fee

Granted publication date: 20200414