A kind of large data clustering method based on cloud computing platform and device
Technical field
The present invention relates to Data Mining, be specifically related to a kind of large data clustering method based on cloud computing platform and device.
Background technology
Since half a century, along with computer technology is socially reintegrated life comprehensively, information explosion has run up to one and has started to cause the degree changed.It not only makes the world be flooded with more than ever before information, and its growth rate is also in quickening.The subject of information explosion, as uranology and genetics, has createed " large data " this concept.Nowadays, this concept has almost been applied in the field of all human minds and development.21 century is the epoch that data message develops on a large scale, and mobile interchange, social networks, ecommerce etc. have greatly expanded border and the range of application of internet, and various data just also become large at undergoes rapid expansion.Internet (social activity, search, electric business), mobile Internet (microblogging), Internet of Things (sensor, the wisdom earth), car networking, GPS, medical image, security monitoring, finance (bank, stock market, insurance), telecommunications (call, note) all produce data in madness.So far data volume altogether on the earth, just just strided forward the TB epoch personal user in 2006, the whole world newly creates altogether the data of about 180EB; By 2011, this numeral reached 1.8ZB.(1ZB,=10 hundred million TB).
Large data are a surge (the ERP/CRM data from, progressively expand increase internet data to, then to the sensor relevant information data of Internet of Things) of data volume, are also the liftings of data complexity simultaneously.The large data amount of can be described as runs up to the scale qualitative change to a certain degree formed afterwards.The structured messages such as the data type of large data is rich and varied, the original database data of existing picture, have again the unstructured information such as text, video, and the acquisition and processing rate request of data are also more and more faster.
Large packet contains the implication of " mass data ", has surmounted mass data in terms of content, and in brief, large data are data of " mass data "+complicated type.Large data comprise all data sets of transaction and interaction data collection, and its scale or complexity catch beyond common technology according to rational cost and time limit, manage and process the ability of these data sets.
Large data are converged by three major technique trend and form:
Magnanimity transaction data: in the online trade (OLTP) from ERP application program to data warehouse applications program and analytic system, traditional relation data and destructuring and semi-structured information still increase in continuation.Along with more data and operation flow shift to public and privately owned cloud, this situation becomes more complicated.Inner Transaction Information of managing mainly comprises on-line transaction data and on-line analysis data, be structurized, carried out the static historical data that manages and access by relational database.By these data, we can understand what to there occurs in the past.
Magnanimity interaction data: this new force is made up of the social media data coming from Facebook, Twitter, LinkedIn and other sources.It includes call detail record (CDR), equipment and sensor information, GPS and geo-location mapping (enum) data, the large nuber of images file being transmitted the transmission of (Manage File Transfer) agreement by management document, Web text and clickstream data, scientific information, Email etc.These data can tell we what can occur future.
Mass data processing: utilize multiple Lightweight Database to receive the data from client, and imported to a concentrated large-scale distributed database or distributed storage cluster, then distributed data base is utilized to carry out common inquiry and Classifying Sum etc. to the mass data concentrated stored in the inner, most of common analysis demand is met with this, carry out data mining to based on data query above simultaneously, high level data analysis requirements can be met.Such as, YunTable is the New-generation distributed database developed on the basis of traditional distributed data base and new NoSQL technology.The distributed type assemblies that can build hundred ranks by it manages the mass data of PB rank.
In the face of the surging of large data is attacked, traditional data processing method reply gets up to seem more and more difficult, and we, many times just as in the face of a gold mine, but do not have effective instrument and means, can only hope that " data " heave a sigh.Faced by conventional analytical techniques, the puzzlement of large data mainly contains:
Due to analysis means restriction, all data can not be made full use of;
Be limited to analysis ability and the answer of challenge cannot be obtained;
Have to because the time limit requires adopt a certain simple modeling technique;
Because there is no enough time computing, model accuracy is compromised.
Based on the present situation of data mining cluster research, the existing excavation for large data clusters, mostly the method for employing is to adopt the sampling to data, chooses representative data, realizes the cluster analysis of Points replacing surfaces.When in the face of large data processing, general what adopt is realizes based on the method for sampling probability, but the methods of sampling is not considered between data point or between interval overall relative distance and Data distribution8 uneven, there is the problem that demarcation interval is really up to the mark.Although afterwards, introduce again cluster, fuzzy concept and cloud model etc. and improve interval division problem really up to the mark, achieve good effect, these methods did not all consider the not same-action of large Data Data point to Knowledge Discovery task yet.Therefore, the clustering rule obtained for making excavation is more effective, more fast, from taking into full account that the not same-action of data point is started with, must carry out more deep research to cluster analysis.And cloud computing proposes based on the process between the large Data Data point in reality just, this is excavate more effective clustering rule to provide powerful theoretical foundation.
Summary of the invention
For solving the above-mentioned problems in the prior art, the invention discloses a kind of large data clustering method based on cloud computing platform and device, adopt MapReduce programming model to achieve the effectively process fast of large data in conjunction with clustering algorithm, constantly can excavate valuable information from data.
MapReduce is the programming model being mainly used in extensive (TB level) data documents disposal of Google exploitation.Its main thought forms computing elementary cell by the concept of " Map (mapping) " and " Reduce (abbreviation) ", first by Map program, data are cut into incoherent block, distribute (scheduling) to a large amount of computer disposal, reach the effect of distributed arithmetic, by Reduce program, result is gathered output again, can parallel processing mass data.Its general type is as follows:
Map(k1,v1)-〉list(k2,v2)
Reduce(k2,list(v2))-〉list(v2)
In brief, input data file is divided into M independently data fragmentation (split) by Map-Reduce programming mode; Then distribute to multiple Worker to start M Map function and perform concurrently and output to intermediate file (this locality is write) and result of calculation is exported intermediate result with key/value to form.Intermediate result key/value divides into groups according to key, perform Reduce function, according to the intermediate file positional information obtained from Master, intermediate file place node Reduce order is sent to perform, calculate and export net result, the output of MapReduce is left in R output file, can further reduce and transmit intermediate file to the demand of bandwidth.
MapReduce depends on HDFS and realizes.Calculated data can be divided into a lot of fritter by usual MapReduce, HDFS can copy some parts to guarantee the reliability of system by each piece, data block is placed on different machines in the cluster, so that MapReduce calculates the most easily on data sink master machine according to certain rule by it simultaneously.HDFS is the version of increasing income of Google GFS, the distributed file system of an Error Tolerance, and it can provide the data access of high-throughput, is applicable to the large files (usually more than 64M) storing magnanimity (PB level).
The present invention utilizes Map Reduce programming model to design a kind of clustering ensemble algorithm, and be stored in the distributed file system HDFS of cloud platform by large data stripping and slicing, Hadoop is in charge of stripping and slicing data, and its key value is affiliated data block Di.Computing machine in computing cluster must adopt clustering algorithm to obtain base cluster result to the corresponding stripping and slicing that this locality stores, (key value is machine number to adopt coherence scheme to carry out Reduce process to each cluster result of same machine, value value is cluster result) obtain the final clustering ensemble result of this machine, thus reach the object of the large data of parallel effectively process, the data processing performance that can improve further and efficiency.
In order to achieve the above object, the invention provides following technical scheme:
Based on a large data clustering method for cloud computing platform, comprising:
Step S100, the data from different pieces of information source by filling in missing values, noise data smoothing, identifying that deleting outlier clears up the data of real world, and are carried out standardization processing, are converted into the data of standard format by large data prediction;
Step S200, large data cutting and management: after large data stripping and slicing, obtain the multiple data blocks after cutting, and be stored in the distributed file system HDFS of cloud platform, and Hadoop is in charge of the data block after cutting;
Step S300, sets up the hypergraph model of cluster, specifically comprises:
Set up the hypergraph H=(V of cum rights, E), wherein, V is the set on summit, E is the set on super limit, and each super limit can both connect plural summit, represents the data item for cluster with the summit of hypergraph, the association situation of the data item represented by its summit connected is represented, w (e with super limit
m) be correspond to each the super limit e in E
mweight, e
m∈ E, w (e
m) be used for weighing the degree of correlation between multiple contiguous itemses of being coupled together by super limit;
Super limit e
mweight can determine by following two kinds of methods:
(1) with each super limit e
mthe support of correlation rule as the weight on this super limit;
(2) with each super limit e
mthe mean value of degree of confidence of all necessary association rules as the weight on this super limit; Necessary association rules refers to specific rule, and only there is the set of a data item on the right of its regular expression, and this rule includes super limit e
jassociated all data item.
Step S400, large data-mapping, is specifically mapped to hypergraph H=(V, E) respectively by the data block after cutting, and namely each data block is mapped to a hypergraph;
Step S500, utilizes hypergraph to carry out clustering processing respectively to each data block,
For the class set that hypergraph H=(V, E), C are summit V, c
i∈ C is the subset of V, for any two class c
iand c
j, have c
i∩ c
j≠ φ, for a super limit e
mwith a class c
iif, e
m∩ c
i≠ φ, then e
mand c
ibetween there is relation, this relation is expressed as:
Wherein, | e
m| represent super limit e
mmiddle vertex number, | c
i| representation class c
imiddle vertex number, | e
m∩ c
i| be appear at e simultaneously
mand c
iin vertex number, by class c
iwith class c
jmerge into c
ij, c
ij=c
i∪ c
j, for super limit e
m, e
m∩ c
i≠ φ, if HC is (e
m, c
i) >HC (e
m, c
ij), then super limit e
min have c
jsummit, the change of HC value embodies c
iand c
jbetween relatively super limit e
msimilarity; Definition class c
iquality Q (c
i) be:
I.e. class c
iquality be all super limit e
mhC (the e of the Weight of ∈ E
m, c
i) value and;
Definition merged index f is:
f(c
i,c
j)=Q(c
ij)-[Q(c
i)-Q(c
j)];
The detailed process of clustering processing comprises:
(1) initialization class set C, makes each summit in the corresponding V of each class in C;
(2) traveling through classes all in hypergraph, is each class c
ifind a class c
j, make their merged index maximum, i.e. f (c
i, c
j) value maximum, if f (c
i, c
j) >0, then merge class c
iwith class c
jfor class c
ij;
(3) new hypergraph is built by the class after all merging;
(4) repeated execution of steps (1) ~ (3), until it is merged to no longer include class;
The detailed process of clustering processing can also be comprise:
(1) initialization class set C, makes each summit in the corresponding V of each class in C;
(2) traveling through classes all in hypergraph, is each class c
ifind a class c
j, make their merged index maximum, i.e. f (c
i, c
j) value maximum, if f (c
i, c
j) >0, then merge class c
iwith class c
jfor class c
ij;
(3) new hypergraph is built by the class after all merging;
(4) corresponding k the segmentation { G of described new hypergraph
1, G
2... G
k,
be the weight equal value on all limits in i-th segmentation,
be the weight mean square deviation on all limits in i-th segmentation,
be calculated as follows:
Wherein, i=1,2 ..., k, e represent the super limit in hypergraph, G
irepresent i-th segmentation of hypergraph, w (e) represents the weight that super limit e is corresponding,
represent segmentation G
iin the number of vertices of super limit e;
(5) judge
whether be greater than first threshold, if be greater than first threshold, then the cluster process of repeated execution of steps (1) ~ (4), otherwise, terminate cluster process.
In step S500, utilize hypergraph to carry out clustering processing respectively to each data block, following methods can also be adopted to realize:
(1) roughening treatment, based on hypergraph H=(V, E) construct minimum hypergraph, any one to this minimum hypergraph does is divided, and the division projection quality on initial hypergraph is all better than the division directly done initial hypergraph in same time;
In the alligatoring stage of hypergraph, we need to construct the less hypergraph of a series of continuous print.The object of alligatoring is the minimum hypergraph of structure one, and any one to this hypergraph does is divided, and the division projection quality on initial hypergraph is all better than the division directly done initial hypergraph in same time.In addition, the alligatoring of hypergraph also can reduce the size on super limit.That is, through what alligatoring, large-scale super limit is compressed to the small-sized super limit only connecting several summit.Because refinement heuritic approach is based on Kernighan-Lin algorithm, this algorithm is very effective for small-sized super limit, but for belonging to different demarcation region, the super limit effect comprising a large amount of summit is just very poor.In the alligatoring hypergraph of next stage, one group of summit compression is formed single summit and can select diverse ways.From the angle of node selection, FC (First Choicescheme), GFC (GreedyFirst Choice scheme), HFC (Hybrid First Choice scheme) etc. can be divided into.From the angle that node merges, EDGE (Edge scheme), HEDGE (Hyper-Edgescheme), MHEDGE (Modified Hyper-Edge scheme) etc. can be divided into.
(2) initial division process, carries out two divisions to the hypergraph after roughening treatment in (1);
In the initial division stage, we need to carry out two divisions to alligatoring hypergraph.Because now the number on summit that comprises of hypergraph seldom (being generally less than 100 summits), so much different algorithms can be adopted and can not affect working time and the quality of algorithm too much.Two points of random methods can be adopted repeatedly.We also can adopt the methods such as combined method, spectral method and cellular automation method to carry out dividing.
(3) move optimization process, use the division of minimum hypergraph to obtain one more refinement hypergraph divide;
Migration the optimizing phase, we use the division of minimum hypergraph to obtain one more refinement hypergraph divide.Above process we by next stage more refinement hypergraph projection realize, and utilize divide thinning algorithm reduce divide number of times thus improve divide quality.Because the refinement hypergraph of next stage has higher degree of freedom, thinning algorithm can obtain higher quality.The thought of V-cycle thinning algorithm is the quality utilizing multistage example to improve two points further.V-cycle thinning algorithm comprises two parts, is alligatoring stage and migration optimizing phase respectively.The alligatoring stage remains the input of initial division as algorithm.This is referred to as restricted alligatoring plan by us.In restricted alligatoring in the works, one group of summit carries out merging the summit forming roughening picture, and this group summit can only belong to the part in two divisions.Consequently, two divisions are originally retained and have passed roughening treatment, become us simultaneously and divide in the initialization that the migration optimizing phase will carry out refinement.Duplicate in the migration optimizing phase of the V-cycle refinement then multistage hypergraph division methods above-mentioned of migration optimizing phase.It between the region divided mobile summit to improve the quality of segmentation.It should be noted that the various alligatoring method for expressing of original hypergraph, allow refinement further improve quality thus help it to jump out local minimum.
(4) final division result is clustering processing result.
Step S600, carries out cluster again to the cluster result of each data block that step S500 obtains, obtains final cluster result;
Cluster is again carried out to the cluster result that step S500 obtains, multiple clustering method can be adopted to realize, as k-means clustering method, the clustering method etc. based on hypergraph.
The present invention utilizes cloud platform to carry out excavation clustering processing in conjunction with Hypergraph Theory to large data, achieves the quick, real-time, accurate of large Data Analysis Services.
A kind of large data clusters device based on cloud computing platform that the present invention also proposes, comprising:
Large data prediction device, for by filling in missing values, noise data smoothing, identifying that deleting outlier clears up the data of real world, and carries out standardization processing by the data from different pieces of information source, is converted into the data of standard format;
Large data cutting and management devices, for by after large data stripping and slicing, obtain the multiple data blocks after cutting, and be stored in the distributed file system HDFS of cloud platform, Hadoop is in charge of the data block after cutting;
Set up the hypergraph model device of cluster, specifically for:
Set up the hypergraph H=(V of cum rights, E), wherein, V is the set on summit, E is the set on super limit, and each super limit can both connect plural summit, represents the data item for cluster with the summit of hypergraph, the association situation of the data item represented by its summit connected is represented, w (e with super limit
m) be correspond to each the super limit e in E
mweight, e
m∈ E, w (e
m) be used for weighing the degree of correlation between multiple contiguous itemses of being coupled together by super limit;
Super limit e
mweight can determine by following two kinds of methods:
(1) with each super limit e
mthe support of correlation rule as the weight on this super limit;
(2) with each super limit e
mthe mean value of degree of confidence of all necessary association rules as the weight on this super limit; Necessary association rules refers to specific rule, and only there is the set of a data item on the right of its regular expression, and this rule includes super limit e
jassociated all data item.
Large data mapping unit, for the data block after cutting is mapped to hypergraph H=(V, E) respectively, namely each data block is mapped to a hypergraph;
Clustering processing device, utilizes hypergraph to carry out clustering processing respectively to each data block,
For the class set that hypergraph H=(V, E), C are summit V, c
i∈ C is the subset of V, for any two class c
iand c
j, have c
i∩ c
j≠ φ, for a super limit e
mwith a class c
iif, e
m∩ c
i≠ φ, then e
mand c
ibetween there is relation, this relation is expressed as:
Wherein, | e
m| represent super limit e
mmiddle vertex number, | c
i| representation class c
imiddle vertex number, | e
m∩ c
i| be appear at e simultaneously
mand c
iin vertex number, by class c
iwith class c
jmerge into c
ij, c
ij=c
i∪ c
j, for super limit e
m, e
m∩ c
i≠ φ, if HC is (e
m, c
i) >HC (e
m, c
ij), then super limit e
min have c
jsummit, the change of HC value embodies c
iand c
jbetween relatively super limit e
msimilarity; Definition class c
iquality Q (c
i) be:
I.e. class c
iquality be all super limit e
mhC (the e of the Weight of ∈ E
m, c
i) value and;
Definition merged index f is:
f(c
i,c
j)=Q(c
ij)-[Q(c
i)-Q(c
j)];
The detailed process of clustering processing comprises:
(1) initialization class set C, makes each summit in the corresponding V of each class in C;
(2) traveling through classes all in hypergraph, is each class c
ifind a class c
j, make their merged index maximum, i.e. f (c
i, c
j) value maximum, if f (c
i, c
j) >0, then merge class c
iwith class c
jfor class c
ij;
(3) new hypergraph is built by the class after all merging;
(4) repeated execution of steps (1) ~ (3), until it is merged to no longer include class;
The detailed process of clustering processing can also be comprise:
(1) initialization class set C, makes each summit in the corresponding V of each class in C;
(2) traveling through classes all in hypergraph, is each class c
ifind a class c
j, make their merged index maximum, i.e. f (c
i, c
j) value maximum, if f (c
i, c
j) >0, then merge class c
iwith class c
jfor class c
ij;
(3) new hypergraph is built by the class after all merging;
(4) corresponding k the segmentation { G of described new hypergraph
1, G
2... G
k,
be the weight equal value on all limits in i-th segmentation,
be the weight mean square deviation on all limits in i-th segmentation,
be calculated as follows:
Wherein, i=1,2 ..., k, e represent the super limit in hypergraph, G
irepresent i-th segmentation of hypergraph, w (e) represents the weight that super limit e is corresponding,
represent segmentation G
iin the number of vertices of super limit e;
(5) judge
whether be greater than first threshold, if be greater than first threshold, then the cluster process of repeated execution of steps (1) ~ (4), otherwise, terminate cluster process.
Final clustering apparatus, carries out cluster again to the cluster result of each data block that clustering processing device obtains, obtains final cluster result;
Cluster is again carried out to the cluster result that clustering processing device obtains, multiple clustering method can be adopted to realize, as k-means clustering method, the clustering method etc. based on hypergraph.
Accompanying drawing explanation
Fig. 1 is the process flow diagram of date storage method of the present invention;
Fig. 2 is the structural drawing of data storage device of the present invention.
Embodiment
Below in conjunction with accompanying drawing of the present invention, technical scheme of the present invention is clearly and completely described.Here will be described exemplary embodiment in detail, its sample table shows in the accompanying drawings.When description below relates to accompanying drawing, unless otherwise indicated, the same numbers in different accompanying drawing represents same or analogous key element.Embodiment described in following exemplary embodiment does not represent all embodiments consistent with the present invention.On the contrary, they only with as in appended claims describe in detail, the example of apparatus and method that aspects more of the present invention are consistent.
See Fig. 1, a kind of large data clustering method based on cloud computing platform that the present invention proposes, comprising:
Step S100, the data from different pieces of information source by filling in missing values, noise data smoothing, identifying that deleting outlier clears up the data of real world, and are carried out standardization processing, are converted into the data of standard format by large data prediction;
Data prediction referred to before main process some process that data are carried out.For information process provides clean, accurate, succinct data, improving information processing efficiency and accuracy, is very important link in information processing.The data of real world vary, and in order to realize the unified process of data, first data prediction must be become satisfactory normal data.
Step S200, large data cutting and management: after large data stripping and slicing, obtain the multiple data blocks after cutting, and be stored in the distributed file system HDFS of cloud platform, and Hadoop is in charge of the data block after cutting;
Hadoop to increase income realization as the MapReduce algorithm of Google, and application program can be divided into many very little working cells, each unit can perform or repeat on any clustered node.In addition, Hadoop also provides a distributed file system to be used for storing data on each computing node, and provides the high-throughput to reading and writing data.Many uniprocessor algorithms all again realize on Hadoop, for various algorithm process mass data provides high-availability and scalability.
Step S300, sets up the hypergraph model of cluster, specifically comprises:
Set up the hypergraph H=(V of cum rights, E), wherein, V is the set on summit, E is the set on super limit, and each super limit can both connect plural summit, represents the data item for cluster with the summit of hypergraph, the association situation of the data item represented by its summit connected is represented, w (e with super limit
m) be correspond to each the super limit e in E
mweight, e
m∈ E, w (e
m) be used for weighing the degree of correlation between multiple contiguous itemses of being coupled together by super limit;
Super limit e
mweight can determine by following two kinds of methods:
(1) with each super limit e
mthe support of correlation rule as the weight on this super limit;
(2) with each super limit e
mthe mean value of degree of confidence of all necessary association rules as the weight on this super limit; Necessary association rules refers to specific rule, and only there is the set of a data item on the right of its regular expression, and this rule includes super limit e
jassociated all data item.
For the ease of understanding the present invention, shown below is the concept that some are relevant with hypergraph.
Data item and item set: establish I={i
1, i
2..., i
mm different item destination aggregation (mda), each i
k(k=1,2 ..., m) become data item (Item), the set I of data item is called item set (Item set), and referred to as item collection, its element number is called the length of item set.Length is that the item set of k is called k dimension data item collection, referred to as k-item collection (k-Item set).
Affairs: affairs T (Transaction) is a subset on item set I, namely
each affairs all have a unique indications TID to be connected with it, and the entirety of different affairs constitutes all affairs collection D (i.e. transaction database).
The support of item set: set X as item set, B is the quantity comprising X in database D, and A is the quantity of all affairs comprised in database D, then the support (Support) of item set X is:
the support Support (X) of item collection X describes the importance of item collection X.
Correlation rule: correlation rule can be expressed as: R:X → Y, wherein
and X ∩ Y=φ, if it represents item collection, X occurs in a certain affairs, and item collection Y will inevitably be caused also to occur in same affairs.X is called the condition precedent (preceding paragraph) of rule, and Y is called the result (consequent) of rule.
The support of correlation rule: for correlation rule R:X → Y, wherein
and X ∩ Y=φ.The support of rule R refers in database D the number of deals and the ratio of All Activity number that comprise item collection X and item collection Y simultaneously.
The degree of confidence of correlation rule: for correlation rule R:X → Y, wherein
and X ∩ Y=φ.The degree of confidence (Confidence) of rule R is expressed as:
Namely refer in database D in the transaction occurring item collection X, the probability that item collection Y also occurs simultaneously has much.
The support of correlation rule and degree of confidence are two kinds of measurements of interestingness of rules.Degree of confidence is the tolerance of the accuracy to correlation rule, represents the intensity of rule in other words; Support is the tolerance of the importance to correlation rule, represents the frequency of rule.If do not consider support and the confidence level of correlation rule, there is very many correlation rules so in a database.In fact, people are general only interested in those correlation rules meeting certain support and confidence level.Therefore, in order to find significant correlation rule, need by given two the basic threshold values of user: minimum support and min confidence.
Minimum support and frequent item set: minimum support (Minimum support) represents and finds that correlation rule requires the minimum support threshold that data item must meet to be designated as minsupp, it represents the lowest importance of item set under statistical significance.Only have the item set meeting minimum support just likely to occur in correlation rule, the item set that support is greater than minimum support is called frequent item set or strong point collection (Large itemset); Otherwise, be called nonmatching grids or weakness collection (Small item set).
Min confidence: min confidence (Minimum confidence) represents and is designated as minconf by the Minimum support4 that correlation rule must meet, its represents the least reliability of correlation rule.
Strong association rule: if Support (R) >=minsupp and Confidence (R) >=minconf, then claim correlation rule
R:X → Y is Strong association rule.
Hypergraph H=(V, E), wherein vertex set V={v
1, v
2... v
n, limit collection E={e
1, e
2... e
m, use a
ijrepresent vertex v
iwith v
jdirect limit number, possible value is 0,1,2 ... the nxn matrix claiming it to obtain
a [i, j] ∈ { 0,1,2 ..., be the adjacency matrix of hypergraph.
According to the expansion that the definition of hypergraph adjacency matrix is the definition of simple graph adjacency matrix, in conjunction with the character of the definition of adjacency matrix, the character of hypergraph adjacency matrix can be obtained:
(1) A (H) is for being poised for battle matrix
The sufficient and necessary condition of (2) two figure G and H isomorphism there is permutation matrix P to make
A(H)=P
TA(G)P。
Step S400, large data-mapping, is specifically mapped to hypergraph H=(V, E) respectively by the data block after cutting, and namely each data block is mapped to a hypergraph;
Step S500, utilizes hypergraph to carry out clustering processing respectively to each data block,
For the class set that hypergraph H=(V, E), C are summit V, c
i∈ C is the subset of V, for any two class c
iand c
j, have c
i∩ c
j≠ φ, for a super limit e
mwith a class c
iif, e
m∩ c
i≠ φ, then e
mand c
ibetween there is relation, this relation is expressed as:
Wherein, | e
m| represent super limit e
mmiddle vertex number, | c
i| representation class c
imiddle vertex number, | e
m∩ c
i| be appear at e simultaneously
mand c
iin vertex number, by class c
iwith class c
jmerge into c
ij, c
ij=c
i∪ c
j, for super limit e
m, e
m∩ c
i≠ φ, if HC is (e
m, c
i) >HC (e
m, c
ij), then super limit e
min have c
jsummit, the change of HC value embodies c
iand c
jbetween relatively super limit e
msimilarity; Definition class c
iquality Q (c
i) be:
I.e. class c
iquality be all super limit e
mhC (the e of the Weight of ∈ E
m, c
i) value and;
Definition merged index f is:
f(c
i,c
j)=Q(c
ij)-[Q(c
i)-Q(c
j)];
The detailed process of clustering processing comprises:
(1) initialization class set C, makes each summit in the corresponding V of each class in C;
(2) traveling through classes all in hypergraph, is each class c
ifind a class c
j, make their merged index maximum, i.e. f (c
i, c
j) value maximum, if f (c
i, c
j) >0, then merge class c
iwith class c
jfor class c
ij;
(3) new hypergraph is built by the class after all merging;
(4) repeated execution of steps (1) ~ (3), until it is merged to no longer include class;
The detailed process of clustering processing can also be comprise:
(1) initialization class set C, makes each summit in the corresponding V of each class in C;
(2) traveling through classes all in hypergraph, is each class c
ifind a class c
j, make their merged index maximum, i.e. f (c
i, c
j) value maximum, if f (c
i, c
j) >0, then merge class c
iwith class c
jfor class c
ij;
(3) new hypergraph is built by the class after all merging;
(4) corresponding k the segmentation { G of described new hypergraph
1, G
2... G
k,
be the weight equal value on all limits in i-th segmentation,
be the weight mean square deviation on all limits in i-th segmentation,
be calculated as follows:
Wherein, i=1,2 ..., k, e represent the super limit in hypergraph, G
irepresent i-th segmentation of hypergraph, w (e) represents the weight that super limit e is corresponding,
represent segmentation G
iin the number of vertices of super limit e;
(5) judge
whether be greater than first threshold, if be greater than first threshold, then the cluster process of repeated execution of steps (1) ~ (4), otherwise, terminate cluster process.
In step S500, utilize hypergraph to carry out clustering processing respectively to each data block, following methods can also be adopted to realize:
(1) roughening treatment, based on hypergraph H=(V, E) construct minimum hypergraph, any one to this minimum hypergraph does is divided, and the division projection quality on initial hypergraph is all better than the division directly done initial hypergraph in same time;
In the alligatoring stage of hypergraph, we need to construct the less hypergraph of a series of continuous print.The object of alligatoring is the minimum hypergraph of structure one, and any one to this hypergraph does is divided, and the division projection quality on initial hypergraph is all better than the division directly done initial hypergraph in same time.In addition, the alligatoring of hypergraph also can reduce the size on super limit.That is, through what alligatoring, large-scale super limit is compressed to the small-sized super limit only connecting several summit.Because refinement heuritic approach is based on Kernighan-Lin algorithm, this algorithm is very effective for small-sized super limit, but for belonging to different demarcation region, the super limit effect comprising a large amount of summit is just very poor.In the alligatoring hypergraph of next stage, one group of summit compression is formed single summit and can select diverse ways.From the angle of node selection, FC (First Choicescheme), GFC (GreedyFirst Choice scheme), HFC (Hybrid First Choice scheme) etc. can be divided into.From the angle that node merges, EDGE (Edge scheme), HEDGE (Hyper-Edgescheme), MHEDGE (Modified Hyper-Edge scheme) etc. can be divided into.
(2) initial division process, carries out two divisions to the hypergraph after roughening treatment in (1);
In the initial division stage, we need to carry out two divisions to alligatoring hypergraph.Because now the number on summit that comprises of hypergraph seldom (being generally less than 100 summits), so much different algorithms can be adopted and can not affect working time and the quality of algorithm too much.Two points of random methods can be adopted repeatedly.We also can adopt the methods such as combined method, spectral method and cellular automation method to carry out dividing.
(3) move optimization process, use the division of minimum hypergraph to obtain one more refinement hypergraph divide;
Migration the optimizing phase, we use the division of minimum hypergraph to obtain one more refinement hypergraph divide.Above process we by next stage more refinement hypergraph projection realize, and utilize divide thinning algorithm reduce divide number of times thus improve divide quality.Because the refinement hypergraph of next stage has higher degree of freedom, thinning algorithm can obtain higher quality.The thought of V-cycle thinning algorithm is the quality utilizing multistage example to improve two points further.V-cycle thinning algorithm comprises two parts, is alligatoring stage and migration optimizing phase respectively.The alligatoring stage remains the input of initial division as algorithm.This is referred to as restricted alligatoring plan by us.In restricted alligatoring in the works, one group of summit carries out merging the summit forming roughening picture, and this group summit can only belong to the part in two divisions.Consequently, two divisions are originally retained and have passed roughening treatment, become us simultaneously and divide in the initialization that the migration optimizing phase will carry out refinement.Duplicate in the migration optimizing phase of the V-cycle refinement then multistage hypergraph division methods above-mentioned of migration optimizing phase.It between the region divided mobile summit to improve the quality of segmentation.It should be noted that the various alligatoring method for expressing of original hypergraph, allow refinement further improve quality thus help it to jump out local minimum.
(4) final division result is clustering processing result.
Step S600, carries out cluster again to the cluster result of each data block that step S500 obtains, obtains final cluster result;
Cluster is again carried out to the cluster result that step S500 obtains, multiple clustering method can be adopted to realize, as k-means clustering method, the clustering method etc. based on hypergraph.
The present invention utilizes cloud platform to carry out excavation clustering processing in conjunction with Hypergraph Theory to large data, achieves the quick, real-time, accurate of large Data Analysis Services.
See Fig. 2, a kind of large data clusters device based on cloud computing platform that the present invention also proposes, comprising:
Large data prediction device, for by filling in missing values, noise data smoothing, identifying that deleting outlier clears up the data of real world, and carries out standardization processing by the data from different pieces of information source, is converted into the data of standard format;
Data prediction referred to before main process some process that data are carried out.For information process provides clean, accurate, succinct data, improving information processing efficiency and accuracy, is very important link in information processing.The data of real world vary, and in order to realize the unified process of data, first data prediction must be become satisfactory normal data.
Large data cutting and management devices, for by after large data stripping and slicing, obtain the multiple data blocks after cutting, and be stored in the distributed file system HDFS of cloud platform, Hadoop is in charge of the data block after cutting;
Hadoop to increase income realization as the MapReduce algorithm of Google, and application program can be divided into many very little working cells, each unit can perform or repeat on any clustered node.In addition, Hadoop also provides a distributed file system to be used for storing data on each computing node, and provides the high-throughput to reading and writing data.Many uniprocessor algorithms all again realize on Hadoop, for various algorithm process mass data provides high-availability and scalability.
Set up the hypergraph model device of cluster, specifically for:
Set up the hypergraph H=(V of cum rights, E), wherein, V is the set on summit, E is the set on super limit, and each super limit can both connect plural summit, represents the data item for cluster with the summit of hypergraph, the association situation of the data item represented by its summit connected is represented, w (e with super limit
m) be correspond to each the super limit e in E
mweight, e
m∈ E, w (e
m) be used for weighing the degree of correlation between multiple contiguous itemses of being coupled together by super limit;
Super limit e
mweight can determine by following two kinds of methods:
(1) with each super limit e
mthe support of correlation rule as the weight on this super limit;
(2) with each super limit e
mthe mean value of degree of confidence of all necessary association rules as the weight on this super limit; Necessary association rules refers to specific rule, and only there is the set of a data item on the right of its regular expression, and this rule includes super limit e
jassociated all data item.
Large data mapping unit, for the data block after cutting is mapped to hypergraph H=(V, E) respectively, namely each data block is mapped to a hypergraph;
Clustering processing device, utilizes hypergraph to carry out clustering processing respectively to each data block,
For the class set that hypergraph H=(V, E), C are summit V, c
i∈ C is the subset of V, for any two class c
iand c
j, have c
i∩ c
j≠ φ, for a super limit e
mwith a class c
iif, e
m∩ c
i≠ φ, then e
mand c
ibetween there is relation, this relation is expressed as:
Wherein, | e
m| represent super limit e
mmiddle vertex number, | c
i| representation class c
imiddle vertex number, | e
m∩ c
i| be appear at e simultaneously
mand c
iin vertex number, by class c
iwith class c
jmerge into c
ij, c
ij=c
i∪ c
j, for super limit e
m, e
m∩ c
i≠ φ, if HC is (e
m, c
i) >HC (e
m, c
ij), then super limit e
min have c
jsummit, the change of HC value embodies c
iand c
jbetween relatively super limit e
msimilarity; Definition class c
iquality Q (c
i) be:
I.e. class c
iquality be all super limit e
mhC (the e of the Weight of ∈ E
m, c
i) value and;
Definition merged index f is:
f(c
i,c
j)=Q(c
ij)-[Q(c
i)-Q(c
j)];
The detailed process of clustering processing comprises:
(1) initialization class set C, makes each summit in the corresponding V of each class in C;
(2) traveling through classes all in hypergraph, is each class c
ifind a class c
j, make their merged index maximum, i.e. f (c
i, c
j) value maximum, if f (c
i, c
j) >0, then merge class c
iwith class c
jfor class c
ij;
(3) new hypergraph is built by the class after all merging;
(4) repeated execution of steps (1) ~ (3), until it is merged to no longer include class;
The detailed process of clustering processing can also be comprise:
(1) initialization class set C, makes each summit in the corresponding V of each class in C;
(2) traveling through classes all in hypergraph, is each class c
ifind a class c
j, make their merged index maximum, i.e. f (c
i, c
j) value maximum, if f (c
i, c
j) >0, then merge class c
iwith class c
jfor class c
ij;
(3) new hypergraph is built by the class after all merging;
(4) corresponding k the segmentation { G of described new hypergraph
1, G
2... G
k,
be the weight equal value on all limits in i-th segmentation,
be the weight mean square deviation on all limits in i-th segmentation,
be calculated as follows:
Wherein, i=1,2 ..., k, e represent the super limit in hypergraph, G
irepresent i-th segmentation of hypergraph, w (e) represents the weight that super limit e is corresponding,
represent segmentation G
iin the number of vertices of super limit e;
(5) judge
whether be greater than first threshold, if be greater than first threshold, then the cluster process of repeated execution of steps (1) ~ (4), otherwise, terminate cluster process.
Final clustering apparatus, carries out cluster again to the cluster result of each data block that clustering processing device obtains, obtains final cluster result;
Cluster is again carried out to the cluster result that clustering processing device obtains, multiple clustering method can be adopted to realize, as k-means clustering method, the clustering method etc. based on hypergraph.
Those skilled in the art, at consideration instructions and after putting into practice invention disclosed herein, will easily expect other embodiment of the present invention.The application is intended to contain any modification of the present invention, purposes or adaptations, and these modification, purposes or adaptations are followed general principle of the present invention and comprised the undocumented common practise in the art of the present invention or conventional techniques means.
Should be understood that, the present invention is not limited to precision architecture described above and illustrated in the accompanying drawings, and can carry out various amendment and change not departing from its scope.Scope of the present invention is only limited by appended claim.