CN109509513A

CN109509513A - Gastroesophageal reflux disease risk factor extracting method and system based on distributional clustering

Info

Publication number: CN109509513A
Application number: CN201811589386.9A
Authority: CN
Inventors: 刘万里; 徐雷; 黄玉珍; 姚澜; 李荣臻; 夏吉安
Original assignee: 刘万里
Current assignee: Nanjing Hospital of Integrated Traditional and Chinese Medicine
Priority date: 2018-12-25
Filing date: 2018-12-25
Publication date: 2019-03-22

Abstract

The invention discloses a kind of gastroesophageal reflux disease risk factor extracting method and system based on distributional clustering.User information collection of the building comprising gastroesophageal reflux disease risk factor first；Secondly quantized data collection is obtained to user information collection quantification treatment, and the data after quantization is stored in the HDFS distributed file system of Hadoop big data analysis platform, formation sequence file；Followed by the MapReduce Computational frame of Hadoop big data analysis platform, the data of storage are clustered to obtain multiple class clusters using K mean cluster algorithm and Canopy clustering algorithm；The index of correlation in each class cluster between each element is finally calculated, and the maximum element of the index of correlation is determined as gastroesophageal reflux disease risk factor.The present invention efficiently accurately filters out the risk factor for causing gastroesophageal reflux disease using Distributed Architecture and improved K mean cluster algorithm, reduces disease incidence.

Description

Gastroesophageal reflux disease risk factor extracting method and system based on distributional clustering

Technical field

The present invention relates to cluster and medicine technology fields, more particularly to a kind of gastroesophageal reflux disease based on distributional clustering Sick risk factor extracting method and system.

Background technique

Gastroesophageal reflux disease is showed as disease of digestive system generally existing in a kind of world wide, disease incidence The trend risen year by year.Therefore, the treatment of gastroesophageal reflux disease should cause our enough attention.Due to gastroesophageal reflux The generation of disease and life style, emotional change, eating habit etc. are closely related, and the state of an illness easily changes, therefore by adopting Collection mass data simultaneously analyzes data characteristics to the research disease and prevents to play an important role.

Risk factor is mainly extracted using clustering algorithm in gastroesophageal reflux disease diagnostic techniques at present, but it is poly- The selection of class number and cluster centre is relatively difficult, often because clusters number and cluster centre selection mistake lead to risk factor It is lower to extract accurate rate.

Summary of the invention

The gastroesophageal reflux disease risk factor extracting method that the object of the present invention is to provide a kind of based on distributional clustering and System, to solve in the prior art because clusters number and cluster centre selection mistake cause risk factor extraction accurate rate lower Problem.

To achieve the above object, the present invention provides following schemes:

A kind of gastroesophageal reflux disease risk factor extracting method based on distributional clustering, comprising:

Construct user information collection；The user information integrates as the data set of M row N column；The i-th row that the user information is concentrated The factor of 1st column is user's questionnaire ID number, and the factor of the 1st column is expressed as different user's questionnaire ID numbers in not going together；It is described The problem of factor for the 1st row jth column that user information is concentrated is questionnaire, and the factor of the 1st row is expressed as not in different lines Same problem；The factor for the i-th row jth column that the user information is concentrated is answer of the i-th user's questionnaire ID number to jth problem； Wherein, 2≤i≤M, 2≤j≤N, i, j are positive integer；

Data quantization processing is carried out to the answer that the user information is concentrated, obtains quantized data collection；The quantized data Integrate as the data set of M row N column；The element for the i-th row the 1st column that the quantized data is concentrated is user's questionnaire ID number, and is not gone together In the 1st column element representation be different user's questionnaire ID numbers；The element for the 1st row jth column that the quantized data is concentrated is to adjust The problem of interrogating volume, and the element representation of the 1st row is different problems in different lines；The i-th row jth that the quantized data is concentrated The element of column is the data quantization result of i-th user's questionnaire ID number jth problem answers；Wherein, 2≤i≤M, 2≤j≤N, i.e. institute Quantized data is stated to integrate as the initial set of critical risk factor；

All data that the quantized data is concentrated are stored in the distributed text of HDFS of Hadoop big data analysis platform In part system, formation sequence file；The sequential file includes that a plurality of form is<key, value>data；Described in every < key, Value > data represent the data line that the quantized data is concentrated；Wherein, key represents user's questionnaire ID number, value generation Table user answer of all the problems；

Using the MapReduce Computational frame of Hadoop big data analysis platform, using K mean cluster algorithm and Canopy Clustering algorithm clusters the data in the sequential file, obtains multiple class clusters；

The index of correlation in each class cluster between each element is calculated, and the maximum element of the index of correlation is determined as stomach Esophageal reflux disease risk factor.

Optionally, the MapReduce Computational frame using Hadoop big data analysis platform, using K mean cluster Algorithm and Canopy clustering algorithm, cluster the data in the sequential file, obtain multiple class clusters, specifically include:

Using the MapReduce Computational frame of Hadoop big data analysis platform, in conjunction with K mean cluster algorithm and Canopy Clustering algorithm handles the data in the sequential file, determines clusters number and cluster centre；

According to the clusters number and the cluster centre, the data in the sequential file are clustered, are obtained more A class cluster.

Optionally, the index of correlation calculated in each class cluster between each element, and the index of correlation is maximum Element is determined as gastroesophageal reflux disease risk factor, specifically includes:

Calculate the related coefficient in each class cluster between each element；

In conjunction with index of correlation calculation formula and the related coefficient being calculated, each member in each class cluster is calculated The index of correlation between element；

All index of correlation descendings are arranged, the maximum element of the index of correlation is selected to be determined as gastroesophageal reflux disease Risk factor.

Optionally, the related coefficient calculated in each class cluster between each element, specifically includes:

It is calculated using the following equation the related coefficient in each class cluster between each element；

The formula is

Wherein, Var (X) is the variance of X, and Var (Y) is the variance of Y, and Cov (X, Y) is X, and the covariance between Y, X, Y are Element in every class cluster.

Optionally, the index of correlation calculation formula isWherein, R²For correlation Index, i are characterized number, and n is characterized sum.

A kind of gastroesophageal reflux disease risk factor extraction system based on distributional clustering, comprising:

User information collection constructs module, for constructing user information collection；The user information integrates as the data set of M row N column； The factor for the i-th row the 1st column that the user information is concentrated is user's questionnaire ID number, and the factor of the 1st column is expressed as in not going together Different user's questionnaire ID numbers；The problem of factor for the 1st row jth column that the user information is concentrated is questionnaire, and it is different The factor of the 1st row is expressed as different problems in column；The factor for the i-th row jth column that the user information is concentrated is that the i-th user asks Answer of the volume ID number to jth problem；Wherein, 2≤i≤M, 2≤j≤N, i, j are positive integer；

Quantized data collection obtains module, and the answer for concentrating to the user information carries out data quantization processing, obtains Quantized data collection；The quantized data integrates as the data set of M row N column；The element for the i-th row the 1st column that the quantized data is concentrated For user's questionnaire ID number, and the element representation of the 1st column is different user's questionnaire ID number in not going together；The quantized data is concentrated The 1st row jth column element be questionnaire the problem of, and in different lines the 1st row element representation be different problems；It is described The element for the i-th row jth column that quantized data is concentrated is the data quantization result of i-th user's questionnaire ID number jth problem answers；Its In, 2≤i≤M, 2≤j≤N, i.e., the described quantized data integrate as the initial set of critical risk factor；

Sequential file forms module, and all data for concentrating the quantized data are stored in Hadoop big data point In the HDFS distributed file system for analysing platform, formation sequence file；The sequential file includes that a plurality of form is < key, Value > data；Described in every<key, value>data represent the data line that the quantized data is concentrated；Wherein, key is represented User's questionnaire ID number, value represent user's answer of all the problems；

Class cluster division module, it is equal using K for the MapReduce Computational frame using Hadoop big data analysis platform It is worth clustering algorithm and Canopy clustering algorithm, the data in the sequential file is clustered, multiple class clusters are obtained；

Gastroesophageal reflux disease risk factor determining module, for calculating the correlation in each class cluster between each element Index, and the maximum element of the index of correlation is determined as gastroesophageal reflux disease risk factor.

Optionally, the class cluster division module, specifically includes:

Clusters number and cluster centre determination unit, based on the MapReduce using Hadoop big data analysis platform It calculates frame to handle the data in the sequential file in conjunction with K mean cluster algorithm and Canopy clustering algorithm, determine Clusters number and cluster centre；

Class cluster division unit is used for according to the clusters number and the cluster centre, to the number in the sequential file According to being clustered, multiple class clusters are obtained.

Optionally, the gastroesophageal reflux disease risk factor determining module, specifically includes:

Related coefficient computing unit, for calculating the related coefficient in each class cluster between each element；

Index of correlation computing unit, in conjunction with index of correlation calculation formula and the related coefficient being calculated, meter Calculate the index of correlation in each class cluster between each element；

Gastroesophageal reflux disease risk factor determination unit selects phase for arranging all index of correlation descendings It closes the maximum element of index and is determined as gastroesophageal reflux disease risk factor.

The specific embodiment provided according to the present invention, the invention discloses following technical effects:

The present invention is mainly based upon a kind of extraction that the MapReduce Computational frame of Hadoop big data analysis platform proposes The method and system of gastroesophageal reflux disease risk factor.Present invention utilizes distributed computing frameworks to handle medical data, For K mean cluster algorithm there are the shortcomings that, present invention combination Canopy clustering algorithm improves, and improved K mean value is gathered Class algorithm is clustered applied to crowd, and crowd is divided into inhomogeneity cluster according to population health sign, finally in conjunction with statistical method point The risk factor for causing the gastroesophageal reflux disease of every a kind of crowd is precipitated.

Present invention utilizes Distributed Architecture to handle high-dimensional big data, and it is high that improved K mean cluster algorithm is utilized Effect filters out the risk factor for causing gastroesophageal reflux disease, provides scientific basis for medical research in the future and medical diagnosis on disease, Gastroesophageal reflux disease is instructed, disease incidence is reduced.

Detailed description of the invention

It in order to more clearly explain the embodiment of the invention or the technical proposal in the existing technology, below will be to institute in embodiment Attached drawing to be used is needed to be briefly described, it should be apparent that, the accompanying drawings in the following description is only some implementations of the invention Example, for those of ordinary skill in the art, without any creative labor, can also be according to these attached drawings Obtain other attached drawings.

Fig. 1 is that the process of gastroesophageal reflux disease risk factor extracting method of the embodiment of the present invention based on distributional clustering is shown It is intended to；

Fig. 2 is that the structure of gastroesophageal reflux disease risk factor extraction system of the embodiment of the present invention based on distributional clustering is shown It is intended to.

Specific embodiment

Following will be combined with the drawings in the embodiments of the present invention, and technical solution in the embodiment of the present invention carries out clear, complete Site preparation description, it is clear that described embodiments are only a part of the embodiments of the present invention, instead of all the embodiments.It is based on Embodiment in the present invention, it is obtained by those of ordinary skill in the art without making creative efforts every other Embodiment shall fall within the protection scope of the present invention.

The gastroesophageal reflux disease risk factor extracting method that the object of the present invention is to provide a kind of based on distributional clustering and System efficiently can accurately filter out the risk factor for causing gastroesophageal reflux disease.

In order to make the foregoing objectives, features and advantages of the present invention clearer and more comprehensible, with reference to the accompanying drawing and specific real Applying mode, the present invention is described in further detail.

Embodiment 1

Fig. 1 is that the process of gastroesophageal reflux disease risk factor extracting method of the embodiment of the present invention based on distributional clustering is shown It is intended to, as shown in Figure 1, the gastroesophageal reflux disease risk factor extraction side provided in an embodiment of the present invention based on distributional clustering Method specifically includes following steps.

Step 101: building user information collection；The user information integrates as the data set of M row N column；The user information collection In the factors of the i-th row the 1st column be user's questionnaire ID number, and the factor of the 1st column is expressed as different user's questionnaires in not going together ID number；The problem of factor for the 1st row jth column that the user information is concentrated is questionnaire, and in different lines the 1st row factor It is expressed as different problems；The factor for the i-th row jth column that the user information is concentrated is i-th user's questionnaire ID number to jth problem Answer；Wherein, 2≤i≤M, 2≤j≤N, i, j are positive integer.

Step 102: data quantization processing being carried out to the answer that the user information is concentrated, obtains quantized data collection；It is described Quantized data integrates as the data set of M row N column；The element for the i-th row the 1st column that the quantized data is concentrated is user's questionnaire ID number, And the element representation of the 1st column is different user's questionnaire ID number in not going together；The 1st row jth column that the quantized data is concentrated The problem of element is questionnaire, and the element representation of the 1st row is different problems in different lines；What the quantized data was concentrated The element of i-th row jth column is the data quantization result of i-th user's questionnaire ID number jth problem answers；Wherein, 2≤i≤M, 2≤j ≤ N, i.e., the described quantized data integrate as the initial set of critical risk factor, i.e., the described quantized data integrates as the first of critical risk factor Initial set.

Step 103: all data that the quantized data is concentrated are stored in the HDFS of Hadoop big data analysis platform In distributed file system, formation sequence file；The sequential file includes that a plurality of form is<key, value>data；Every Described<key, value>data represent the data line that the quantized data is concentrated；Wherein, key represents user's questionnaire ID Number, value represents user's answer of all the problems.

Step 104: using the MapReduce Computational frame of Hadoop big data analysis platform, using K mean cluster algorithm With Canopy clustering algorithm, the data in the sequential file are clustered, obtain multiple class clusters；Different class clusters represents not Generic user.

Step 105: calculating the index of correlation in each class cluster between each element, and by the maximum element of the index of correlation It is determined as gastroesophageal reflux disease risk factor.

Step 101 is that data are acquired and arranged

The present embodiment by the questionnaire put of human hair in hospital to each consulting gastroesophageal reflux disease, and according to More parts of questionnaires back are recycled to establish user information collection.The possible illness of the user that the user information is concentrated, it is also possible to strong Health, this is needed after equal hospital diagnosis to a label, by the label judge the user whether illness.Therefore, which believes Breath collection is the data set of health, illness mixing, and knows which data is illness data.The dimension of the data set in the present embodiment The answer of the problem of degree totally 241, user's questionnaire ID number including unique identification and each questionnaire.

It is several comprising general demographic data, life style, eating habit, mental element, sleep factor etc. in questionnaire The answer of subproblem and investigator.The answer type of questionnaire includes three kinds: single choice, True-False, question-and-answer problem.

Step 102 is data bulkization processing

Specially using user's questionnaire ID number as unique identifying number.In questionnaire, using severity level as answer in single choice Topic, if option is often, once in a while, seldom, never, 4,3,2,1 weight can successively to be assigned according to its severity level, according to tool Body answer selects corresponding weight；Using whether type option is answer in True-False, "Yes" is assigned a value of 1, "No" is assigned a value of 0, Corresponding assignment is selected according to specific answer；Option has no the problem of dividing of severity level, such as occupation, because such problem is to knot Fruit is useless, can delete the problem.For question-and-answer problem, directly using the continuous type numerical value of user's input as data, such as user 45 years old, Height 172 is uninfected by HP, often there is sleep disturbance, then the sample data is classified as [age, height, if infection HP, sleep barrier Hinder situation], data value is [45,172,0,4].The purpose of this step is that the answer of all investigators is carried out quantification treatment, is obtained To quantized data collection R.

Step 103 is data storage

The HDFS that obtained quantized data concentrates all data to be stored in Hadoop big data analysis platform is distributed In formula file system.Data can be pressed same size cutting by HDFS distributed file system, and be stored in each back end On.Compared with traditional file system, HDFS distributed file system can reduce reading times, mention specifically for big data Rise efficiency.

Using quantized data collection as<key, value>sequential file, each<key, value>represent quantized data collection One data record.Wherein Key is the label of the data record, i.e., user's questionnaire ID, value of every data record are this The content of data record, i.e. user answer of all the problems.

Step 104 is data clusters

The K mean cluster algorithm optimization towards mass data is described below

Using crowd as clustering variable, obtained by cluster by different class cluster crowds.Since there are two for K mean cluster algorithm It is a clearly the shortcomings that, i.e. the determination of clusters number and initial cluster center point, thus here use Canopy clustering algorithm Initial clustering is carried out, to obtain clusters number and initial cluster center point.Utilize Hadoop big data analysis platform MapReduce Computational frame does clustering processing to the data in sequential file.

Cluster is divided into two stages herein, and the first stage first passes through canopy clustering algorithm and carries out " thick cluster ", first K mean cluster algorithm is carried out on the basis of stage again, i.e. " thin cluster ", final number of clusters is analyzed by professional knowledge, really Determine clusters number.

Cluster work is completed on Hadoop big data analysis platform needs 4 job altogether.

Job1 completes the generation at the center canopy, and job2 is centrally generated the k center canopy using job1's, and job3 is to same Data object in one canopy carries out K mean cluster, generates stable K mean value cluster center, and job4 is complete using the center of job3 At K mean cluster.

(1)job1

Generate the k center canopy.

A) the Map stage

The data record in sequential file is randomly selected when initial, using user's questionnaire ID as first canopy Canopy ID, the questionnaire information vector of the user is as the center canopy.

Distance threshold t1, t2 are set.Subsequently enter circulation, by unmarked data object remaining in data subset with All centers canopy are carried out apart from comparison, if more than given parameters t2, then as next center canopy, and to it It marks.Map function finally generates the local center the canopy point set on each data subset.

B) the Reduce stage

Above step is continued to execute to all part center the canopy point sets adjustment threshold value being collected into, obtains the overall situation The center canopy point set, form are<each dimension value in the center canopy ID canopy>.

(2)job2

Generate k overlapped canopy.

A) the Map stage

Map function receives data to be clustered, the global center the canopy point set that load job1 is generated when starting, by every number It is carried out compared with according to object with all centers canopy, if being less than given parameters t1, is divided into corresponding canopy In region.

B) the Reduce stage

Reduce function is collected arrangement, obtains k overlapped canopy, and form is <user questionnaire ID, institute Category canopyID list >.

(3)job3

K-means is carried out using the k center canopy generated above and k canopy and clusters iteration, generates k stabilization K-means cluster center.

A) the Map stage

Map function receives the output of job2, and the overall situation center canopy point set, completion data object to its institute are loaded when starting The distance at nearest cluster center calculates in the multiple canopy belonged to, output data and affiliated cluster, and form is < cluster ID, data Value >.Combine function receives the output of Map function, belongs to the merging of the data object of same cluster in locally realization, i.e., to certain The corresponding dimension of several data objects in cluster carries out summation merging, and the number of statistical data object, and output form is < cluster ID, data amount check >.

B) the Reduce stage

Reduce function receives the output of Combine function, and statistics belongs to the total of the corresponding dimension of all data objects of certain cluster With and data object total number, obtain new cluster central value, and calculate new cluster center and fall into which canopy, export Form be<cluster ID each dimension value in the new cluster center t, affiliated canopy ID list>.Judge whether algorithm restrains simultaneously.

(4)job4

A) the Map stage

The main task in Map stage is that cluster central point distance each is recorded by calculating, and will be belonging to the recording mark Cluster.Map function calculates its distance for arriving all cluster centers to each record row of input, and according to minimum range by the record Nearest cluster center is indicated, and makes new category label.Specific step is as follows:

A K data) is randomly selected in data set as initial cluster center.With the five of questionnaire in the present embodiment A aspect determines clusters number, i.e. K=5, and setting array centers stores the data of cluster centre.

B the distance for) calculating remaining every data its each cluster centre into centers array, be classified as with It is apart from the class where nearest center.Specifically, the calculating for distance, uses Euclidean distance here, formula is as follows:

Wherein n is data dimension, and every data represents a n-dimensional vector, x_1kIndicate the numerical value of the kth dimension of data 1, x_2k Indicate the numerical value of the kth dimension of data 2, then every data can be expressed as a (x₁₁,x₁₂,…x_1n), b (x₂₁,x₂₂,…x_2n)。

B) the Reduce stage

In the Reduce stage, the data of the same category are formed by a cluster according to the intermediate result that Map function obtains, and count New cluster centre is calculated, center array is stored in, is used for next round Map.Input data is in the form of (key, value) pair It shows, key is cluster generic, and value is to record vector in the cluster, and the identical data of all key give a Reduce Task, the mean value of cumulative calculation key identical data, obtains new cluster centre.Output (key ', value ') it is used as the Map stage Input, wherein key be cluster classification, value is mean value.

The step of calculating mean value is as follows:

A the identical data of key value) are read as one group, using num as the sum of this group of data.

B it) will be added in this group of data with the numerical value of dimension, divided by num, obtain the mean value of the dimension, it is equal for other Value carries out same treatment, recycles 241 times.

C 5 groups of all central values of data (C1, C2, C3, C4, C5) deposit center arrays) are calculated, as the map stage Central value array input.

Data set after output cluster, that is, gather around the data D ' there are five class label.

Step 105 is to determine risk factor

In order to find the risk factor for determining to form such patient groups, need to calculate the phase in every class cluster between each element Close index (average of related coefficient square), select the maximum element of the index of correlation be determined as gastroesophageal reflux disease danger because Element filters out a gastroesophageal reflux disease risk factor in every class cluster.

The quantized data collection D is grouped to obtain D according to class label₁,D₂,D₃,....D_k, initial danger is set Sets of factors is empty set, is analyzed the correlation each element in every class cluster, the index of correlation between calculating elements, each Select the maximum element of an index of correlation that risk factor set is added in class cluster.

Wherein related coefficient calculates as follows:

For the sample index of correlation R of a certain feature²Calculation formula it is as follows:

Wherein, X is a certain feature, and i is characterized number, and n is characterized sum.

Embodiment 2

To achieve the above object, the present invention also provides a kind of gastroesophageal refluxs based on distributional clustering as shown in Figure 2 Disease risk factor extraction system.The system includes:

User information collection constructs module 100, for constructing user information collection；The user information integrates as the data of M row N column Collection；The factor for the i-th row the 1st column that the user information is concentrated is user's questionnaire ID number, and the factor of the 1st column indicates in not going together For different user's questionnaire ID numbers；The problem of factor for the 1st row jth column that the user information is concentrated is questionnaire, and not The factor of the 1st row is expressed as different problems in same column；The factor for the i-th row jth column that the user information is concentrated is the i-th user Answer of the questionnaire ID number to jth problem；Wherein, 2≤i≤M, 2≤j≤N, i, j are positive integer；

Quantized data collection obtains module 200, and the answer for concentrating to the user information carries out data quantization processing, obtains To quantized data collection；The quantized data integrates as the data set of M row N column；The member for the i-th row the 1st column that the quantized data is concentrated Element is user's questionnaire ID number, and the element representation of the 1st column is different user's questionnaire ID number in not going together；The quantized data collection In the 1st row jth column element be questionnaire the problem of, and in different lines the 1st row element representation be different problems；Institute The element for stating the i-th row jth column of quantized data concentration is the data quantization result of i-th user's questionnaire ID number jth problem answers；Its In, 2≤i≤M, 2≤j≤N, i.e., the described quantized data integrate as the initial set of critical risk factor；

Sequential file forms module 300, and all data for concentrating the quantized data are stored in the big number of Hadoop According in the HDFS distributed file system of analysis platform, formation sequence file；The sequential file includes that a plurality of form is < key, Value > data；Described in every<key, value>data represent the data line that the quantized data is concentrated；Wherein, key is represented User's questionnaire ID number, value represent user's answer of all the problems；

Class cluster division module 400, for the MapReduce Computational frame using Hadoop big data analysis platform, using K Means clustering algorithm and Canopy clustering algorithm, cluster the data in the sequential file, obtain multiple class clusters；

Gastroesophageal reflux disease risk factor determining module 500, for calculating in each class cluster between each element The index of correlation, and the maximum element of the index of correlation is determined as gastroesophageal reflux disease risk factor.

The class cluster division module 400, specifically includes:

Clusters number and cluster centre determination unit, based on the MapReduce using Hadoop big data analysis platform It calculates frame to handle the data in the sequential file in conjunction with K mean cluster algorithm and Canopy clustering algorithm, determine Clusters number and cluster centre.

The gastroesophageal reflux disease risk factor determining module 500, specifically includes:

Index of correlation computing unit, in conjunction with index of correlation calculation formula and the related coefficient being calculated, meter Calculate the index of correlation in each class cluster between each element.

The prior art is compared, advantage of the invention are as follows:

It is actually rare using machine learning method extraction risk factor in gastroesophageal reflux disease diagnostic techniques at present, greatly What majority took the extraction of risk factor in medical domain is statistical method, and statistical method is computationally intensive, simultaneously Accurate rate is lower compared with machine learning.

The present invention in such a way that two kinds of clustering methods combine, for K mean value itself there are the shortcomings that be made that improvement, Further raising has been done to the accuracy of cluster.

The extraction of risk factor of the invention combines crowd's cluster and index screening, analyzed from different crowd it is dangerous because Element is as a result more accurate in conjunction with clustering method and statistical method.

Each embodiment in this specification is described in a progressive manner, the highlights of each of the examples are with other The difference of embodiment, the same or similar parts in each embodiment may refer to each other.For system disclosed in embodiment For, since it is corresponded to the methods disclosed in the examples, so being described relatively simple, related place is said referring to method part It is bright.

Used herein a specific example illustrates the principle and implementation of the invention, and above embodiments are said It is bright to be merely used to help understand method and its core concept of the invention；At the same time, for those skilled in the art, foundation Thought of the invention, there will be changes in the specific implementation manner and application range.In conclusion the content of the present specification is not It is interpreted as limitation of the present invention.

Claims

1. a kind of gastroesophageal reflux disease risk factor extracting method based on distributional clustering, which is characterized in that the method packet It includes:

Construct user information collection；The user information integrates as the data set of M row N column；The i-th row the 1st that the user information is concentrated The factor of column is user's questionnaire ID number, and the factor of the 1st column is expressed as different user's questionnaire ID numbers in not going together；The user Information concentrate the 1st row jth column factor be questionnaire the problem of, and in different lines the factor of the 1st row be expressed as it is different Problem；The factor for the i-th row jth column that the user information is concentrated is answer of the i-th user's questionnaire ID number to jth problem；Wherein, 2≤i≤M, 2≤j≤N, i, j are positive integer；

Data quantization processing is carried out to the answer that the user information is concentrated, obtains quantized data collection；The quantized data integrates as M The data set of row N column；The element for the i-th row the 1st column that the quantized data is concentrated is user's questionnaire ID number, and the 1st in not going together The element representation of column is different user's questionnaire ID number；The element for the 1st row jth column that the quantized data is concentrated is questionnaire The problem of, and the element representation of the 1st row is different problems in different lines；The member for the i-th row jth column that the quantized data is concentrated Element is the data quantization result of i-th user's questionnaire ID number jth problem answers；Wherein, 2≤i≤M, 2≤j≤N, i.e., the described quantization Data set is the initial set of critical risk factor；

All data that the quantized data is concentrated are stored in the HDFS distributed field system of Hadoop big data analysis platform In system, formation sequence file；The sequential file includes that a plurality of form is<key, value>data；Described in every < key, Value > data represent the data line that the quantized data is concentrated；Wherein, key represents user's questionnaire ID number, value generation Table user answer of all the problems；

Using the MapReduce Computational frame of Hadoop big data analysis platform, clustered using K mean cluster algorithm and Canopy Algorithm clusters the data in the sequential file, obtains multiple class clusters；

The index of correlation in each class cluster between each element is calculated, and the maximum element of the index of correlation is determined as stomach oesophagus Reflux disease risk factor.

2. gastroesophageal reflux disease risk factor extracting method according to claim 1, which is characterized in that the utilization The MapReduce Computational frame of Hadoop big data analysis platform, it is right using K mean cluster algorithm and Canopy clustering algorithm Data in the sequential file are clustered, and are obtained multiple class clusters, are specifically included:

Using the MapReduce Computational frame of Hadoop big data analysis platform, clustered in conjunction with K mean cluster algorithm and Canopy Algorithm handles the data in the sequential file, determines clusters number and cluster centre；

According to the clusters number and the cluster centre, the data in the sequential file are clustered, obtain multiple classes Cluster.

3. gastroesophageal reflux disease risk factor extracting method according to claim 1, which is characterized in that described to calculate often The index of correlation in a class cluster between each element, and the maximum element of the index of correlation is determined as gastroesophageal reflux disease danger Dangerous factor, specifically includes:

Calculate the related coefficient in each class cluster between each element；

In conjunction with index of correlation calculation formula and the related coefficient being calculated, calculate in each class cluster between each element The index of correlation；

All index of correlation descendings are arranged, the maximum element of the index of correlation is selected to be determined as gastroesophageal reflux disease danger Factor.

4. gastroesophageal reflux disease risk factor extracting method according to claim 3, which is characterized in that described to calculate often Related coefficient in a class cluster between each element, specifically includes:

The formula is

Wherein, Var (X) is the variance of X, and Var (Y) is the variance of Y, and Cov (X, Y) is X, and the covariance between Y, X, Y are every class Element in cluster.

5. gastroesophageal reflux disease risk factor extracting method according to claim 4, which is characterized in that the correlation refers to Counting calculation formula isWherein, R²For the index of correlation, i is characterized number, and n is characterized Sum.

6. a kind of gastroesophageal reflux disease risk factor extraction system based on distributional clustering, which is characterized in that the system packet It includes:

User information collection constructs module, for constructing user information collection；The user information integrates as the data set of M row N column；It is described The factor for the i-th row the 1st column that user information is concentrated is user's questionnaire ID number, and the factor of the 1st column is expressed as difference in not going together User's questionnaire ID number；The problem of factor for the 1st row jth column that the user information is concentrated is questionnaire, and in different lines The factor of 1st row is expressed as different problems；The factor for the i-th row jth column that the user information is concentrated is i-th user's questionnaire ID Answer number to jth problem；Wherein, 2≤i≤M, 2≤j≤N, i, j are positive integer；

Quantized data collection obtains module, and the answer for concentrating to the user information carries out data quantization processing, is quantified Data set；The quantized data integrates as the data set of M row N column；The element for the i-th row the 1st column that the quantized data is concentrated is to use Family questionnaire ID number, and the element representation of the 1st column is different user's questionnaire ID number in not going together；The quantized data concentrate the The problem of element of 1 row jth column is questionnaire, and the element representation of the 1st row is different problems in different lines；The quantization The element of the i-th row jth column in data set is the data quantization result of i-th user's questionnaire ID number jth problem answers；Wherein, 2≤ I≤M, 2≤j≤N, i.e., the described quantized data integrate as the initial set of critical risk factor；

Sequential file forms module, and all data for concentrating the quantized data, which are stored in Hadoop big data analysis, puts down In the HDFS distributed file system of platform, formation sequence file；The sequential file includes that a plurality of form is<key, value>number According to；Described in every<key, value>data represent the data line that the quantized data is concentrated；Wherein, key represents the user Questionnaire ID number, value represent user's answer of all the problems；

Class cluster division module, it is poly- using K mean value for the MapReduce Computational frame using Hadoop big data analysis platform Class algorithm and Canopy clustering algorithm, cluster the data in the sequential file, obtain multiple class clusters；

Gastroesophageal reflux disease risk factor determining module refers to for calculating the correlation in each class cluster between each element Number, and the maximum element of the index of correlation is determined as gastroesophageal reflux disease risk factor.

7. gastroesophageal reflux disease risk factor extraction system according to claim 6, which is characterized in that the class cluster is drawn Sub-module specifically includes:

Clusters number and cluster centre determination unit, for the MapReduce calculation block using Hadoop big data analysis platform Frame is handled the data in the sequential file in conjunction with K mean cluster algorithm and Canopy clustering algorithm, determines cluster Number and cluster centre；

Class cluster division unit, for according to the clusters number and the cluster centre, to the data in the sequential file into Row cluster, obtains multiple class clusters.

8. gastroesophageal reflux disease risk factor extraction system according to claim 6, which is characterized in that the stomach oesophagus Reflux disease risk factor determining module, specifically includes:

Index of correlation computing unit, for calculating every in conjunction with index of correlation calculation formula and the related coefficient being calculated The index of correlation in a class cluster between each element；

Gastroesophageal reflux disease risk factor determination unit selects correlation to refer to for arranging all index of correlation descendings The maximum element of number is determined as gastroesophageal reflux disease risk factor.