CN109509513A - Gastroesophageal reflux disease risk factor extracting method and system based on distributional clustering - Google Patents
Gastroesophageal reflux disease risk factor extracting method and system based on distributional clustering Download PDFInfo
- Publication number
- CN109509513A CN109509513A CN201811589386.9A CN201811589386A CN109509513A CN 109509513 A CN109509513 A CN 109509513A CN 201811589386 A CN201811589386 A CN 201811589386A CN 109509513 A CN109509513 A CN 109509513A
- Authority
- CN
- China
- Prior art keywords
- data
- cluster
- questionnaire
- risk factor
- reflux disease
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Classifications
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16H—HEALTHCARE INFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR THE HANDLING OR PROCESSING OF MEDICAL OR HEALTHCARE DATA
- G16H10/00—ICT specially adapted for the handling or processing of patient-related medical or healthcare data
- G16H10/20—ICT specially adapted for the handling or processing of patient-related medical or healthcare data for electronic clinical trials or questionnaires
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16H—HEALTHCARE INFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR THE HANDLING OR PROCESSING OF MEDICAL OR HEALTHCARE DATA
- G16H50/00—ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics
- G16H50/70—ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics for mining of medical data, e.g. analysing previous cases of other patients
Landscapes
- Engineering & Computer Science (AREA)
- Health & Medical Sciences (AREA)
- Medical Informatics (AREA)
- Public Health (AREA)
- Epidemiology (AREA)
- General Health & Medical Sciences (AREA)
- Primary Health Care (AREA)
- Data Mining & Analysis (AREA)
- Biomedical Technology (AREA)
- Databases & Information Systems (AREA)
- Pathology (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The invention discloses a kind of gastroesophageal reflux disease risk factor extracting method and system based on distributional clustering.User information collection of the building comprising gastroesophageal reflux disease risk factor first;Secondly quantized data collection is obtained to user information collection quantification treatment, and the data after quantization is stored in the HDFS distributed file system of Hadoop big data analysis platform, formation sequence file;Followed by the MapReduce Computational frame of Hadoop big data analysis platform, the data of storage are clustered to obtain multiple class clusters using K mean cluster algorithm and Canopy clustering algorithm;The index of correlation in each class cluster between each element is finally calculated, and the maximum element of the index of correlation is determined as gastroesophageal reflux disease risk factor.The present invention efficiently accurately filters out the risk factor for causing gastroesophageal reflux disease using Distributed Architecture and improved K mean cluster algorithm, reduces disease incidence.
Description
Technical field
The present invention relates to cluster and medicine technology fields, more particularly to a kind of gastroesophageal reflux disease based on distributional clustering
Sick risk factor extracting method and system.
Background technique
Gastroesophageal reflux disease is showed as disease of digestive system generally existing in a kind of world wide, disease incidence
The trend risen year by year.Therefore, the treatment of gastroesophageal reflux disease should cause our enough attention.Due to gastroesophageal reflux
The generation of disease and life style, emotional change, eating habit etc. are closely related, and the state of an illness easily changes, therefore by adopting
Collection mass data simultaneously analyzes data characteristics to the research disease and prevents to play an important role.
Risk factor is mainly extracted using clustering algorithm in gastroesophageal reflux disease diagnostic techniques at present, but it is poly-
The selection of class number and cluster centre is relatively difficult, often because clusters number and cluster centre selection mistake lead to risk factor
It is lower to extract accurate rate.
Summary of the invention
The gastroesophageal reflux disease risk factor extracting method that the object of the present invention is to provide a kind of based on distributional clustering and
System, to solve in the prior art because clusters number and cluster centre selection mistake cause risk factor extraction accurate rate lower
Problem.
To achieve the above object, the present invention provides following schemes:
A kind of gastroesophageal reflux disease risk factor extracting method based on distributional clustering, comprising:
Construct user information collection;The user information integrates as the data set of M row N column;The i-th row that the user information is concentrated
The factor of 1st column is user's questionnaire ID number, and the factor of the 1st column is expressed as different user's questionnaire ID numbers in not going together;It is described
The problem of factor for the 1st row jth column that user information is concentrated is questionnaire, and the factor of the 1st row is expressed as not in different lines
Same problem;The factor for the i-th row jth column that the user information is concentrated is answer of the i-th user's questionnaire ID number to jth problem;
Wherein, 2≤i≤M, 2≤j≤N, i, j are positive integer;
Data quantization processing is carried out to the answer that the user information is concentrated, obtains quantized data collection;The quantized data
Integrate as the data set of M row N column;The element for the i-th row the 1st column that the quantized data is concentrated is user's questionnaire ID number, and is not gone together
In the 1st column element representation be different user's questionnaire ID numbers;The element for the 1st row jth column that the quantized data is concentrated is to adjust
The problem of interrogating volume, and the element representation of the 1st row is different problems in different lines;The i-th row jth that the quantized data is concentrated
The element of column is the data quantization result of i-th user's questionnaire ID number jth problem answers;Wherein, 2≤i≤M, 2≤j≤N, i.e. institute
Quantized data is stated to integrate as the initial set of critical risk factor;
All data that the quantized data is concentrated are stored in the distributed text of HDFS of Hadoop big data analysis platform
In part system, formation sequence file;The sequential file includes that a plurality of form is<key, value>data;Described in every < key,
Value > data represent the data line that the quantized data is concentrated;Wherein, key represents user's questionnaire ID number, value generation
Table user answer of all the problems;
Using the MapReduce Computational frame of Hadoop big data analysis platform, using K mean cluster algorithm and Canopy
Clustering algorithm clusters the data in the sequential file, obtains multiple class clusters;
The index of correlation in each class cluster between each element is calculated, and the maximum element of the index of correlation is determined as stomach
Esophageal reflux disease risk factor.
Optionally, the MapReduce Computational frame using Hadoop big data analysis platform, using K mean cluster
Algorithm and Canopy clustering algorithm, cluster the data in the sequential file, obtain multiple class clusters, specifically include:
Using the MapReduce Computational frame of Hadoop big data analysis platform, in conjunction with K mean cluster algorithm and Canopy
Clustering algorithm handles the data in the sequential file, determines clusters number and cluster centre;
According to the clusters number and the cluster centre, the data in the sequential file are clustered, are obtained more
A class cluster.
Optionally, the index of correlation calculated in each class cluster between each element, and the index of correlation is maximum
Element is determined as gastroesophageal reflux disease risk factor, specifically includes:
Calculate the related coefficient in each class cluster between each element;
In conjunction with index of correlation calculation formula and the related coefficient being calculated, each member in each class cluster is calculated
The index of correlation between element;
All index of correlation descendings are arranged, the maximum element of the index of correlation is selected to be determined as gastroesophageal reflux disease
Risk factor.
Optionally, the related coefficient calculated in each class cluster between each element, specifically includes:
It is calculated using the following equation the related coefficient in each class cluster between each element;
The formula is
Wherein, Var (X) is the variance of X, and Var (Y) is the variance of Y, and Cov (X, Y) is X, and the covariance between Y, X, Y are
Element in every class cluster.
Optionally, the index of correlation calculation formula isWherein, R2For correlation
Index, i are characterized number, and n is characterized sum.
A kind of gastroesophageal reflux disease risk factor extraction system based on distributional clustering, comprising:
User information collection constructs module, for constructing user information collection;The user information integrates as the data set of M row N column;
The factor for the i-th row the 1st column that the user information is concentrated is user's questionnaire ID number, and the factor of the 1st column is expressed as in not going together
Different user's questionnaire ID numbers;The problem of factor for the 1st row jth column that the user information is concentrated is questionnaire, and it is different
The factor of the 1st row is expressed as different problems in column;The factor for the i-th row jth column that the user information is concentrated is that the i-th user asks
Answer of the volume ID number to jth problem;Wherein, 2≤i≤M, 2≤j≤N, i, j are positive integer;
Quantized data collection obtains module, and the answer for concentrating to the user information carries out data quantization processing, obtains
Quantized data collection;The quantized data integrates as the data set of M row N column;The element for the i-th row the 1st column that the quantized data is concentrated
For user's questionnaire ID number, and the element representation of the 1st column is different user's questionnaire ID number in not going together;The quantized data is concentrated
The 1st row jth column element be questionnaire the problem of, and in different lines the 1st row element representation be different problems;It is described
The element for the i-th row jth column that quantized data is concentrated is the data quantization result of i-th user's questionnaire ID number jth problem answers;Its
In, 2≤i≤M, 2≤j≤N, i.e., the described quantized data integrate as the initial set of critical risk factor;
Sequential file forms module, and all data for concentrating the quantized data are stored in Hadoop big data point
In the HDFS distributed file system for analysing platform, formation sequence file;The sequential file includes that a plurality of form is < key,
Value > data;Described in every<key, value>data represent the data line that the quantized data is concentrated;Wherein, key is represented
User's questionnaire ID number, value represent user's answer of all the problems;
Class cluster division module, it is equal using K for the MapReduce Computational frame using Hadoop big data analysis platform
It is worth clustering algorithm and Canopy clustering algorithm, the data in the sequential file is clustered, multiple class clusters are obtained;
Gastroesophageal reflux disease risk factor determining module, for calculating the correlation in each class cluster between each element
Index, and the maximum element of the index of correlation is determined as gastroesophageal reflux disease risk factor.
Optionally, the class cluster division module, specifically includes:
Clusters number and cluster centre determination unit, based on the MapReduce using Hadoop big data analysis platform
It calculates frame to handle the data in the sequential file in conjunction with K mean cluster algorithm and Canopy clustering algorithm, determine
Clusters number and cluster centre;
Class cluster division unit is used for according to the clusters number and the cluster centre, to the number in the sequential file
According to being clustered, multiple class clusters are obtained.
Optionally, the gastroesophageal reflux disease risk factor determining module, specifically includes:
Related coefficient computing unit, for calculating the related coefficient in each class cluster between each element;
Index of correlation computing unit, in conjunction with index of correlation calculation formula and the related coefficient being calculated, meter
Calculate the index of correlation in each class cluster between each element;
Gastroesophageal reflux disease risk factor determination unit selects phase for arranging all index of correlation descendings
It closes the maximum element of index and is determined as gastroesophageal reflux disease risk factor.
The specific embodiment provided according to the present invention, the invention discloses following technical effects:
The present invention is mainly based upon a kind of extraction that the MapReduce Computational frame of Hadoop big data analysis platform proposes
The method and system of gastroesophageal reflux disease risk factor.Present invention utilizes distributed computing frameworks to handle medical data,
For K mean cluster algorithm there are the shortcomings that, present invention combination Canopy clustering algorithm improves, and improved K mean value is gathered
Class algorithm is clustered applied to crowd, and crowd is divided into inhomogeneity cluster according to population health sign, finally in conjunction with statistical method point
The risk factor for causing the gastroesophageal reflux disease of every a kind of crowd is precipitated.
Present invention utilizes Distributed Architecture to handle high-dimensional big data, and it is high that improved K mean cluster algorithm is utilized
Effect filters out the risk factor for causing gastroesophageal reflux disease, provides scientific basis for medical research in the future and medical diagnosis on disease,
Gastroesophageal reflux disease is instructed, disease incidence is reduced.
Detailed description of the invention
It in order to more clearly explain the embodiment of the invention or the technical proposal in the existing technology, below will be to institute in embodiment
Attached drawing to be used is needed to be briefly described, it should be apparent that, the accompanying drawings in the following description is only some implementations of the invention
Example, for those of ordinary skill in the art, without any creative labor, can also be according to these attached drawings
Obtain other attached drawings.
Fig. 1 is that the process of gastroesophageal reflux disease risk factor extracting method of the embodiment of the present invention based on distributional clustering is shown
It is intended to;
Fig. 2 is that the structure of gastroesophageal reflux disease risk factor extraction system of the embodiment of the present invention based on distributional clustering is shown
It is intended to.
Specific embodiment
Following will be combined with the drawings in the embodiments of the present invention, and technical solution in the embodiment of the present invention carries out clear, complete
Site preparation description, it is clear that described embodiments are only a part of the embodiments of the present invention, instead of all the embodiments.It is based on
Embodiment in the present invention, it is obtained by those of ordinary skill in the art without making creative efforts every other
Embodiment shall fall within the protection scope of the present invention.
The gastroesophageal reflux disease risk factor extracting method that the object of the present invention is to provide a kind of based on distributional clustering and
System efficiently can accurately filter out the risk factor for causing gastroesophageal reflux disease.
In order to make the foregoing objectives, features and advantages of the present invention clearer and more comprehensible, with reference to the accompanying drawing and specific real
Applying mode, the present invention is described in further detail.
Embodiment 1
Fig. 1 is that the process of gastroesophageal reflux disease risk factor extracting method of the embodiment of the present invention based on distributional clustering is shown
It is intended to, as shown in Figure 1, the gastroesophageal reflux disease risk factor extraction side provided in an embodiment of the present invention based on distributional clustering
Method specifically includes following steps.
Step 101: building user information collection;The user information integrates as the data set of M row N column;The user information collection
In the factors of the i-th row the 1st column be user's questionnaire ID number, and the factor of the 1st column is expressed as different user's questionnaires in not going together
ID number;The problem of factor for the 1st row jth column that the user information is concentrated is questionnaire, and in different lines the 1st row factor
It is expressed as different problems;The factor for the i-th row jth column that the user information is concentrated is i-th user's questionnaire ID number to jth problem
Answer;Wherein, 2≤i≤M, 2≤j≤N, i, j are positive integer.
Step 102: data quantization processing being carried out to the answer that the user information is concentrated, obtains quantized data collection;It is described
Quantized data integrates as the data set of M row N column;The element for the i-th row the 1st column that the quantized data is concentrated is user's questionnaire ID number,
And the element representation of the 1st column is different user's questionnaire ID number in not going together;The 1st row jth column that the quantized data is concentrated
The problem of element is questionnaire, and the element representation of the 1st row is different problems in different lines;What the quantized data was concentrated
The element of i-th row jth column is the data quantization result of i-th user's questionnaire ID number jth problem answers;Wherein, 2≤i≤M, 2≤j
≤ N, i.e., the described quantized data integrate as the initial set of critical risk factor, i.e., the described quantized data integrates as the first of critical risk factor
Initial set.
Step 103: all data that the quantized data is concentrated are stored in the HDFS of Hadoop big data analysis platform
In distributed file system, formation sequence file;The sequential file includes that a plurality of form is<key, value>data;Every
Described<key, value>data represent the data line that the quantized data is concentrated;Wherein, key represents user's questionnaire ID
Number, value represents user's answer of all the problems.
Step 104: using the MapReduce Computational frame of Hadoop big data analysis platform, using K mean cluster algorithm
With Canopy clustering algorithm, the data in the sequential file are clustered, obtain multiple class clusters;Different class clusters represents not
Generic user.
Step 105: calculating the index of correlation in each class cluster between each element, and by the maximum element of the index of correlation
It is determined as gastroesophageal reflux disease risk factor.
Step 101 is that data are acquired and arranged
The present embodiment by the questionnaire put of human hair in hospital to each consulting gastroesophageal reflux disease, and according to
More parts of questionnaires back are recycled to establish user information collection.The possible illness of the user that the user information is concentrated, it is also possible to strong
Health, this is needed after equal hospital diagnosis to a label, by the label judge the user whether illness.Therefore, which believes
Breath collection is the data set of health, illness mixing, and knows which data is illness data.The dimension of the data set in the present embodiment
The answer of the problem of degree totally 241, user's questionnaire ID number including unique identification and each questionnaire.
It is several comprising general demographic data, life style, eating habit, mental element, sleep factor etc. in questionnaire
The answer of subproblem and investigator.The answer type of questionnaire includes three kinds: single choice, True-False, question-and-answer problem.
Step 102 is data bulkization processing
Specially using user's questionnaire ID number as unique identifying number.In questionnaire, using severity level as answer in single choice
Topic, if option is often, once in a while, seldom, never, 4,3,2,1 weight can successively to be assigned according to its severity level, according to tool
Body answer selects corresponding weight;Using whether type option is answer in True-False, "Yes" is assigned a value of 1, "No" is assigned a value of 0,
Corresponding assignment is selected according to specific answer;Option has no the problem of dividing of severity level, such as occupation, because such problem is to knot
Fruit is useless, can delete the problem.For question-and-answer problem, directly using the continuous type numerical value of user's input as data, such as user 45 years old,
Height 172 is uninfected by HP, often there is sleep disturbance, then the sample data is classified as [age, height, if infection HP, sleep barrier
Hinder situation], data value is [45,172,0,4].The purpose of this step is that the answer of all investigators is carried out quantification treatment, is obtained
To quantized data collection R.
Step 103 is data storage
The HDFS that obtained quantized data concentrates all data to be stored in Hadoop big data analysis platform is distributed
In formula file system.Data can be pressed same size cutting by HDFS distributed file system, and be stored in each back end
On.Compared with traditional file system, HDFS distributed file system can reduce reading times, mention specifically for big data
Rise efficiency.
Using quantized data collection as<key, value>sequential file, each<key, value>represent quantized data collection
One data record.Wherein Key is the label of the data record, i.e., user's questionnaire ID, value of every data record are this
The content of data record, i.e. user answer of all the problems.
Step 104 is data clusters
The K mean cluster algorithm optimization towards mass data is described below
Using crowd as clustering variable, obtained by cluster by different class cluster crowds.Since there are two for K mean cluster algorithm
It is a clearly the shortcomings that, i.e. the determination of clusters number and initial cluster center point, thus here use Canopy clustering algorithm
Initial clustering is carried out, to obtain clusters number and initial cluster center point.Utilize Hadoop big data analysis platform
MapReduce Computational frame does clustering processing to the data in sequential file.
Cluster is divided into two stages herein, and the first stage first passes through canopy clustering algorithm and carries out " thick cluster ", first
K mean cluster algorithm is carried out on the basis of stage again, i.e. " thin cluster ", final number of clusters is analyzed by professional knowledge, really
Determine clusters number.
Cluster work is completed on Hadoop big data analysis platform needs 4 job altogether.
Job1 completes the generation at the center canopy, and job2 is centrally generated the k center canopy using job1's, and job3 is to same
Data object in one canopy carries out K mean cluster, generates stable K mean value cluster center, and job4 is complete using the center of job3
At K mean cluster.
(1)job1
Generate the k center canopy.
A) the Map stage
The data record in sequential file is randomly selected when initial, using user's questionnaire ID as first canopy
Canopy ID, the questionnaire information vector of the user is as the center canopy.
Distance threshold t1, t2 are set.Subsequently enter circulation, by unmarked data object remaining in data subset with
All centers canopy are carried out apart from comparison, if more than given parameters t2, then as next center canopy, and to it
It marks.Map function finally generates the local center the canopy point set on each data subset.
B) the Reduce stage
Above step is continued to execute to all part center the canopy point sets adjustment threshold value being collected into, obtains the overall situation
The center canopy point set, form are<each dimension value in the center canopy ID canopy>.
(2)job2
Generate k overlapped canopy.
A) the Map stage
Map function receives data to be clustered, the global center the canopy point set that load job1 is generated when starting, by every number
It is carried out compared with according to object with all centers canopy, if being less than given parameters t1, is divided into corresponding canopy
In region.
B) the Reduce stage
Reduce function is collected arrangement, obtains k overlapped canopy, and form is <user questionnaire ID, institute
Category canopyID list >.
(3)job3
K-means is carried out using the k center canopy generated above and k canopy and clusters iteration, generates k stabilization
K-means cluster center.
A) the Map stage
Map function receives the output of job2, and the overall situation center canopy point set, completion data object to its institute are loaded when starting
The distance at nearest cluster center calculates in the multiple canopy belonged to, output data and affiliated cluster, and form is < cluster ID, data
Value >.Combine function receives the output of Map function, belongs to the merging of the data object of same cluster in locally realization, i.e., to certain
The corresponding dimension of several data objects in cluster carries out summation merging, and the number of statistical data object, and output form is < cluster
ID, data amount check >.
B) the Reduce stage
Reduce function receives the output of Combine function, and statistics belongs to the total of the corresponding dimension of all data objects of certain cluster
With and data object total number, obtain new cluster central value, and calculate new cluster center and fall into which canopy, export
Form be<cluster ID each dimension value in the new cluster center t, affiliated canopy ID list>.Judge whether algorithm restrains simultaneously.
(4)job4
A) the Map stage
The main task in Map stage is that cluster central point distance each is recorded by calculating, and will be belonging to the recording mark
Cluster.Map function calculates its distance for arriving all cluster centers to each record row of input, and according to minimum range by the record
Nearest cluster center is indicated, and makes new category label.Specific step is as follows:
A K data) is randomly selected in data set as initial cluster center.With the five of questionnaire in the present embodiment
A aspect determines clusters number, i.e. K=5, and setting array centers stores the data of cluster centre.
B the distance for) calculating remaining every data its each cluster centre into centers array, be classified as with
It is apart from the class where nearest center.Specifically, the calculating for distance, uses Euclidean distance here, formula is as follows:
Wherein n is data dimension, and every data represents a n-dimensional vector, x1kIndicate the numerical value of the kth dimension of data 1, x2k
Indicate the numerical value of the kth dimension of data 2, then every data can be expressed as a (x11,x12,…x1n), b (x21,x22,…x2n)。
B) the Reduce stage
In the Reduce stage, the data of the same category are formed by a cluster according to the intermediate result that Map function obtains, and count
New cluster centre is calculated, center array is stored in, is used for next round Map.Input data is in the form of (key, value) pair
It shows, key is cluster generic, and value is to record vector in the cluster, and the identical data of all key give a Reduce
Task, the mean value of cumulative calculation key identical data, obtains new cluster centre.Output (key ', value ') it is used as the Map stage
Input, wherein key be cluster classification, value is mean value.
The step of calculating mean value is as follows:
A the identical data of key value) are read as one group, using num as the sum of this group of data.
B it) will be added in this group of data with the numerical value of dimension, divided by num, obtain the mean value of the dimension, it is equal for other
Value carries out same treatment, recycles 241 times.
C 5 groups of all central values of data (C1, C2, C3, C4, C5) deposit center arrays) are calculated, as the map stage
Central value array input.
Data set after output cluster, that is, gather around the data D ' there are five class label.
Step 105 is to determine risk factor
In order to find the risk factor for determining to form such patient groups, need to calculate the phase in every class cluster between each element
Close index (average of related coefficient square), select the maximum element of the index of correlation be determined as gastroesophageal reflux disease danger because
Element filters out a gastroesophageal reflux disease risk factor in every class cluster.
The quantized data collection D is grouped to obtain D according to class label1,D2,D3,....Dk, initial danger is set
Sets of factors is empty set, is analyzed the correlation each element in every class cluster, the index of correlation between calculating elements, each
Select the maximum element of an index of correlation that risk factor set is added in class cluster.
Wherein related coefficient calculates as follows:
Wherein, Var (X) is the variance of X, and Var (Y) is the variance of Y, and Cov (X, Y) is X, and the covariance between Y, X, Y are
Element in every class cluster.
For the sample index of correlation R of a certain feature2Calculation formula it is as follows:
Wherein, X is a certain feature, and i is characterized number, and n is characterized sum.
Embodiment 2
To achieve the above object, the present invention also provides a kind of gastroesophageal refluxs based on distributional clustering as shown in Figure 2
Disease risk factor extraction system.The system includes:
User information collection constructs module 100, for constructing user information collection;The user information integrates as the data of M row N column
Collection;The factor for the i-th row the 1st column that the user information is concentrated is user's questionnaire ID number, and the factor of the 1st column indicates in not going together
For different user's questionnaire ID numbers;The problem of factor for the 1st row jth column that the user information is concentrated is questionnaire, and not
The factor of the 1st row is expressed as different problems in same column;The factor for the i-th row jth column that the user information is concentrated is the i-th user
Answer of the questionnaire ID number to jth problem;Wherein, 2≤i≤M, 2≤j≤N, i, j are positive integer;
Quantized data collection obtains module 200, and the answer for concentrating to the user information carries out data quantization processing, obtains
To quantized data collection;The quantized data integrates as the data set of M row N column;The member for the i-th row the 1st column that the quantized data is concentrated
Element is user's questionnaire ID number, and the element representation of the 1st column is different user's questionnaire ID number in not going together;The quantized data collection
In the 1st row jth column element be questionnaire the problem of, and in different lines the 1st row element representation be different problems;Institute
The element for stating the i-th row jth column of quantized data concentration is the data quantization result of i-th user's questionnaire ID number jth problem answers;Its
In, 2≤i≤M, 2≤j≤N, i.e., the described quantized data integrate as the initial set of critical risk factor;
Sequential file forms module 300, and all data for concentrating the quantized data are stored in the big number of Hadoop
According in the HDFS distributed file system of analysis platform, formation sequence file;The sequential file includes that a plurality of form is < key,
Value > data;Described in every<key, value>data represent the data line that the quantized data is concentrated;Wherein, key is represented
User's questionnaire ID number, value represent user's answer of all the problems;
Class cluster division module 400, for the MapReduce Computational frame using Hadoop big data analysis platform, using K
Means clustering algorithm and Canopy clustering algorithm, cluster the data in the sequential file, obtain multiple class clusters;
Gastroesophageal reflux disease risk factor determining module 500, for calculating in each class cluster between each element
The index of correlation, and the maximum element of the index of correlation is determined as gastroesophageal reflux disease risk factor.
The class cluster division module 400, specifically includes:
Clusters number and cluster centre determination unit, based on the MapReduce using Hadoop big data analysis platform
It calculates frame to handle the data in the sequential file in conjunction with K mean cluster algorithm and Canopy clustering algorithm, determine
Clusters number and cluster centre.
Class cluster division unit is used for according to the clusters number and the cluster centre, to the number in the sequential file
According to being clustered, multiple class clusters are obtained.
The gastroesophageal reflux disease risk factor determining module 500, specifically includes:
Related coefficient computing unit, for calculating the related coefficient in each class cluster between each element;
Index of correlation computing unit, in conjunction with index of correlation calculation formula and the related coefficient being calculated, meter
Calculate the index of correlation in each class cluster between each element.
Gastroesophageal reflux disease risk factor determination unit selects phase for arranging all index of correlation descendings
It closes the maximum element of index and is determined as gastroesophageal reflux disease risk factor.
The prior art is compared, advantage of the invention are as follows:
It is actually rare using machine learning method extraction risk factor in gastroesophageal reflux disease diagnostic techniques at present, greatly
What majority took the extraction of risk factor in medical domain is statistical method, and statistical method is computationally intensive, simultaneously
Accurate rate is lower compared with machine learning.
The present invention in such a way that two kinds of clustering methods combine, for K mean value itself there are the shortcomings that be made that improvement,
Further raising has been done to the accuracy of cluster.
The extraction of risk factor of the invention combines crowd's cluster and index screening, analyzed from different crowd it is dangerous because
Element is as a result more accurate in conjunction with clustering method and statistical method.
Each embodiment in this specification is described in a progressive manner, the highlights of each of the examples are with other
The difference of embodiment, the same or similar parts in each embodiment may refer to each other.For system disclosed in embodiment
For, since it is corresponded to the methods disclosed in the examples, so being described relatively simple, related place is said referring to method part
It is bright.
Used herein a specific example illustrates the principle and implementation of the invention, and above embodiments are said
It is bright to be merely used to help understand method and its core concept of the invention;At the same time, for those skilled in the art, foundation
Thought of the invention, there will be changes in the specific implementation manner and application range.In conclusion the content of the present specification is not
It is interpreted as limitation of the present invention.
Claims (8)
1. a kind of gastroesophageal reflux disease risk factor extracting method based on distributional clustering, which is characterized in that the method packet
It includes:
Construct user information collection;The user information integrates as the data set of M row N column;The i-th row the 1st that the user information is concentrated
The factor of column is user's questionnaire ID number, and the factor of the 1st column is expressed as different user's questionnaire ID numbers in not going together;The user
Information concentrate the 1st row jth column factor be questionnaire the problem of, and in different lines the factor of the 1st row be expressed as it is different
Problem;The factor for the i-th row jth column that the user information is concentrated is answer of the i-th user's questionnaire ID number to jth problem;Wherein,
2≤i≤M, 2≤j≤N, i, j are positive integer;
Data quantization processing is carried out to the answer that the user information is concentrated, obtains quantized data collection;The quantized data integrates as M
The data set of row N column;The element for the i-th row the 1st column that the quantized data is concentrated is user's questionnaire ID number, and the 1st in not going together
The element representation of column is different user's questionnaire ID number;The element for the 1st row jth column that the quantized data is concentrated is questionnaire
The problem of, and the element representation of the 1st row is different problems in different lines;The member for the i-th row jth column that the quantized data is concentrated
Element is the data quantization result of i-th user's questionnaire ID number jth problem answers;Wherein, 2≤i≤M, 2≤j≤N, i.e., the described quantization
Data set is the initial set of critical risk factor;
All data that the quantized data is concentrated are stored in the HDFS distributed field system of Hadoop big data analysis platform
In system, formation sequence file;The sequential file includes that a plurality of form is<key, value>data;Described in every < key,
Value > data represent the data line that the quantized data is concentrated;Wherein, key represents user's questionnaire ID number, value generation
Table user answer of all the problems;
Using the MapReduce Computational frame of Hadoop big data analysis platform, clustered using K mean cluster algorithm and Canopy
Algorithm clusters the data in the sequential file, obtains multiple class clusters;
The index of correlation in each class cluster between each element is calculated, and the maximum element of the index of correlation is determined as stomach oesophagus
Reflux disease risk factor.
2. gastroesophageal reflux disease risk factor extracting method according to claim 1, which is characterized in that the utilization
The MapReduce Computational frame of Hadoop big data analysis platform, it is right using K mean cluster algorithm and Canopy clustering algorithm
Data in the sequential file are clustered, and are obtained multiple class clusters, are specifically included:
Using the MapReduce Computational frame of Hadoop big data analysis platform, clustered in conjunction with K mean cluster algorithm and Canopy
Algorithm handles the data in the sequential file, determines clusters number and cluster centre;
According to the clusters number and the cluster centre, the data in the sequential file are clustered, obtain multiple classes
Cluster.
3. gastroesophageal reflux disease risk factor extracting method according to claim 1, which is characterized in that described to calculate often
The index of correlation in a class cluster between each element, and the maximum element of the index of correlation is determined as gastroesophageal reflux disease danger
Dangerous factor, specifically includes:
Calculate the related coefficient in each class cluster between each element;
In conjunction with index of correlation calculation formula and the related coefficient being calculated, calculate in each class cluster between each element
The index of correlation;
All index of correlation descendings are arranged, the maximum element of the index of correlation is selected to be determined as gastroesophageal reflux disease danger
Factor.
4. gastroesophageal reflux disease risk factor extracting method according to claim 3, which is characterized in that described to calculate often
Related coefficient in a class cluster between each element, specifically includes:
It is calculated using the following equation the related coefficient in each class cluster between each element;
The formula is
Wherein, Var (X) is the variance of X, and Var (Y) is the variance of Y, and Cov (X, Y) is X, and the covariance between Y, X, Y are every class
Element in cluster.
5. gastroesophageal reflux disease risk factor extracting method according to claim 4, which is characterized in that the correlation refers to
Counting calculation formula isWherein, R2For the index of correlation, i is characterized number, and n is characterized
Sum.
6. a kind of gastroesophageal reflux disease risk factor extraction system based on distributional clustering, which is characterized in that the system packet
It includes:
User information collection constructs module, for constructing user information collection;The user information integrates as the data set of M row N column;It is described
The factor for the i-th row the 1st column that user information is concentrated is user's questionnaire ID number, and the factor of the 1st column is expressed as difference in not going together
User's questionnaire ID number;The problem of factor for the 1st row jth column that the user information is concentrated is questionnaire, and in different lines
The factor of 1st row is expressed as different problems;The factor for the i-th row jth column that the user information is concentrated is i-th user's questionnaire ID
Answer number to jth problem;Wherein, 2≤i≤M, 2≤j≤N, i, j are positive integer;
Quantized data collection obtains module, and the answer for concentrating to the user information carries out data quantization processing, is quantified
Data set;The quantized data integrates as the data set of M row N column;The element for the i-th row the 1st column that the quantized data is concentrated is to use
Family questionnaire ID number, and the element representation of the 1st column is different user's questionnaire ID number in not going together;The quantized data concentrate the
The problem of element of 1 row jth column is questionnaire, and the element representation of the 1st row is different problems in different lines;The quantization
The element of the i-th row jth column in data set is the data quantization result of i-th user's questionnaire ID number jth problem answers;Wherein, 2≤
I≤M, 2≤j≤N, i.e., the described quantized data integrate as the initial set of critical risk factor;
Sequential file forms module, and all data for concentrating the quantized data, which are stored in Hadoop big data analysis, puts down
In the HDFS distributed file system of platform, formation sequence file;The sequential file includes that a plurality of form is<key, value>number
According to;Described in every<key, value>data represent the data line that the quantized data is concentrated;Wherein, key represents the user
Questionnaire ID number, value represent user's answer of all the problems;
Class cluster division module, it is poly- using K mean value for the MapReduce Computational frame using Hadoop big data analysis platform
Class algorithm and Canopy clustering algorithm, cluster the data in the sequential file, obtain multiple class clusters;
Gastroesophageal reflux disease risk factor determining module refers to for calculating the correlation in each class cluster between each element
Number, and the maximum element of the index of correlation is determined as gastroesophageal reflux disease risk factor.
7. gastroesophageal reflux disease risk factor extraction system according to claim 6, which is characterized in that the class cluster is drawn
Sub-module specifically includes:
Clusters number and cluster centre determination unit, for the MapReduce calculation block using Hadoop big data analysis platform
Frame is handled the data in the sequential file in conjunction with K mean cluster algorithm and Canopy clustering algorithm, determines cluster
Number and cluster centre;
Class cluster division unit, for according to the clusters number and the cluster centre, to the data in the sequential file into
Row cluster, obtains multiple class clusters.
8. gastroesophageal reflux disease risk factor extraction system according to claim 6, which is characterized in that the stomach oesophagus
Reflux disease risk factor determining module, specifically includes:
Related coefficient computing unit, for calculating the related coefficient in each class cluster between each element;
Index of correlation computing unit, for calculating every in conjunction with index of correlation calculation formula and the related coefficient being calculated
The index of correlation in a class cluster between each element;
Gastroesophageal reflux disease risk factor determination unit selects correlation to refer to for arranging all index of correlation descendings
The maximum element of number is determined as gastroesophageal reflux disease risk factor.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201811589386.9A CN109509513A (en) | 2018-12-25 | 2018-12-25 | Gastroesophageal reflux disease risk factor extracting method and system based on distributional clustering |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201811589386.9A CN109509513A (en) | 2018-12-25 | 2018-12-25 | Gastroesophageal reflux disease risk factor extracting method and system based on distributional clustering |
Publications (1)
Publication Number | Publication Date |
---|---|
CN109509513A true CN109509513A (en) | 2019-03-22 |
Family
ID=65754543
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201811589386.9A Pending CN109509513A (en) | 2018-12-25 | 2018-12-25 | Gastroesophageal reflux disease risk factor extracting method and system based on distributional clustering |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN109509513A (en) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110189803A (en) * | 2019-06-05 | 2019-08-30 | 南京理工大学 | The disease risk factor extracting method combined based on cluster with classification |
Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20100010985A1 (en) * | 2006-07-28 | 2010-01-14 | Andrew Wong | System and method for detecting and analyzing pattern relationships |
CN106156107A (en) * | 2015-04-03 | 2016-11-23 | 刘岩松 | A kind of discovery method of hot news |
CN106530132A (en) * | 2016-11-14 | 2017-03-22 | 国家电网公司 | Power load clustering method and device |
-
2018
- 2018-12-25 CN CN201811589386.9A patent/CN109509513A/en active Pending
Patent Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20100010985A1 (en) * | 2006-07-28 | 2010-01-14 | Andrew Wong | System and method for detecting and analyzing pattern relationships |
CN106156107A (en) * | 2015-04-03 | 2016-11-23 | 刘岩松 | A kind of discovery method of hot news |
CN106530132A (en) * | 2016-11-14 | 2017-03-22 | 国家电网公司 | Power load clustering method and device |
Non-Patent Citations (2)
Title |
---|
VAMEI: "《博客园 vamei》", 10 November 2013, HTTPS://WWW.CNBLOGS.COM/VAMEI/P/3416138.HTML * |
李应安: "基于MapReduce的聚类算法的并行化研究", 《中国优秀硕士学位论文全文数据库 信息科技辑》 * |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110189803A (en) * | 2019-06-05 | 2019-08-30 | 南京理工大学 | The disease risk factor extracting method combined based on cluster with classification |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Xing et al. | Combination data mining methods with new medical data to predicting outcome of coronary heart disease | |
CN105096225B (en) | The analysis system of aided disease diagnosis and treatment, device and method | |
Yarnold et al. | Predicting in‐hospital mortality of patients with AIDS‐related Pneumocystis carinii pneumonia: an example of hierarchically optimal classification tree analysis | |
Rahman et al. | Using and comparing different decision tree classification techniques for mining ICDDR, B Hospital Surveillance data | |
James et al. | Repeated split sample validation to assess logistic regression and recursive partitioning: an application to the prediction of cognitive impairment | |
CN109686442A (en) | Method and system are determined based on the gastroesophageal reflux disease risk factor of machine learning | |
CN111785366B (en) | Patient treatment scheme determination method and device and computer equipment | |
CN106055922A (en) | Hybrid network gene screening method based on gene expression data | |
Nenova et al. | Chronic disease progression prediction: Leveraging case‐based reasoning and big data analytics | |
Jelinek et al. | Decision trees and multi-level ensemble classifiers for neurological diagnostics | |
Hakim et al. | An efficient modified bagging method for early prediction of brain stroke | |
CN113707317B (en) | Disease risk factor importance analysis method based on mixed model | |
Lakshmi et al. | Utilization of data mining techniques for prediction and diagnosis of tuberculosis disease survivability | |
CN109509513A (en) | Gastroesophageal reflux disease risk factor extracting method and system based on distributional clustering | |
Benbelkacem et al. | A data mining-based approach to predict strain situations in hospital emergency department systems | |
Khazaee et al. | Heart arrhythmia detection using support vector machines | |
Dong et al. | A hybrid approach to identifying key factors in environmental health studies | |
CN109685139A (en) | Based on the gastroesophageal reflux disease risk factor extracting method precisely clustered and system | |
Pal | Artificial immune‐based supervised classifier for land‐cover classification | |
Maryoosh et al. | A Review: Data Mining Techniques and Its Applications | |
Adnan et al. | ComboSplit: combining various splitting criteria for building a single decision tree | |
Zhang et al. | Principal trend analysis for time-course data with applications in genomic medicine | |
CN109978007A (en) | A kind of disease risk factor extracting method based on attribute weight cluster | |
JPS5840683A (en) | Analytic method and apparatus for mapping mutual relationship between various elements and structural conception | |
Cabras et al. | Biological Age Imputation by Data Depth: A Proposal and Some Preliminary Results |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
TA01 | Transfer of patent application right | ||
TA01 | Transfer of patent application right |
Effective date of registration: 20190729 Address after: 210000 Xiaolingwei 179, Xuanwu District, Nanjing City, Jiangsu Province Applicant after: Nanjing Hospital of Integrated Traditional and Chinese Medicine Address before: 210000 Xiaolingwei 179, Xuanwu District, Nanjing City, Jiangsu Province Applicant before: Liu Wanli |
|
RJ01 | Rejection of invention patent application after publication | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20190322 |