CN104820708A - Cloud computing platform based big data clustering method and device - Google Patents

Cloud computing platform based big data clustering method and device Download PDF

Info

Publication number
CN104820708A
CN104820708A CN201510249032.XA CN201510249032A CN104820708A CN 104820708 A CN104820708 A CN 104820708A CN 201510249032 A CN201510249032 A CN 201510249032A CN 104820708 A CN104820708 A CN 104820708A
Authority
CN
China
Prior art keywords
class
data
hypergraph
super limit
summit
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201510249032.XA
Other languages
Chinese (zh)
Other versions
CN104820708B (en
Inventor
马泳宇
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Nanning First Station Network Technology Co., Ltd.
Original Assignee
Chengdu Rui Feng Science And Technology Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Chengdu Rui Feng Science And Technology Ltd filed Critical Chengdu Rui Feng Science And Technology Ltd
Priority to CN201510249032.XA priority Critical patent/CN104820708B/en
Publication of CN104820708A publication Critical patent/CN104820708A/en
Application granted granted Critical
Publication of CN104820708B publication Critical patent/CN104820708B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention provides a cloud computing platform based big data clustering method. The method comprises a step 100 of performing big data preprocessing; a step 200 of performing big data partitioning and management; a step 300 of establishing a hypergraph model for clustering; a step 400 of performing big data mapping, specifically, mapping each partitioned data block to a hypergraph H=(V, E), namely, each data block is mapped to one hypergraph; a step 500 of performing clustering processing on each data block through the hypergraph; and a step 600 of performing clustering on a clustering result of each data block obtained in the step 500 again, so as to obtain the final clustering result. The method performs excavation and clustering on the big data through the cloud platform as well as a hypergraph theory, thereby realizing rapid, real-time and accurate big data analysis and processing.

Description

A kind of large data clustering method based on cloud computing platform and device
Technical field
The present invention relates to Data Mining, be specifically related to a kind of large data clustering method based on cloud computing platform and device.
Background technology
Since half a century, along with computer technology is socially reintegrated life comprehensively, information explosion has run up to one and has started to cause the degree changed.It not only makes the world be flooded with more than ever before information, and its growth rate is also in quickening.The subject of information explosion, as uranology and genetics, has createed " large data " this concept.Nowadays, this concept has almost been applied in the field of all human minds and development.21 century is the epoch that data message develops on a large scale, and mobile interchange, social networks, ecommerce etc. have greatly expanded border and the range of application of internet, and various data just also become large at undergoes rapid expansion.Internet (social activity, search, electric business), mobile Internet (microblogging), Internet of Things (sensor, the wisdom earth), car networking, GPS, medical image, security monitoring, finance (bank, stock market, insurance), telecommunications (call, note) all produce data in madness.So far data volume altogether on the earth, just just strided forward the TB epoch personal user in 2006, the whole world newly creates altogether the data of about 180EB; By 2011, this numeral reached 1.8ZB.(1ZB,=10 hundred million TB).
Large data are a surge (the ERP/CRM data from, progressively expand increase internet data to, then to the sensor relevant information data of Internet of Things) of data volume, are also the liftings of data complexity simultaneously.The large data amount of can be described as runs up to the scale qualitative change to a certain degree formed afterwards.The structured messages such as the data type of large data is rich and varied, the original database data of existing picture, have again the unstructured information such as text, video, and the acquisition and processing rate request of data are also more and more faster.
Large packet contains the implication of " mass data ", has surmounted mass data in terms of content, and in brief, large data are data of " mass data "+complicated type.Large data comprise all data sets of transaction and interaction data collection, and its scale or complexity catch beyond common technology according to rational cost and time limit, manage and process the ability of these data sets.
Large data are converged by three major technique trend and form:
Magnanimity transaction data: in the online trade (OLTP) from ERP application program to data warehouse applications program and analytic system, traditional relation data and destructuring and semi-structured information still increase in continuation.Along with more data and operation flow shift to public and privately owned cloud, this situation becomes more complicated.Inner Transaction Information of managing mainly comprises on-line transaction data and on-line analysis data, be structurized, carried out the static historical data that manages and access by relational database.By these data, we can understand what to there occurs in the past.
Magnanimity interaction data: this new force is made up of the social media data coming from Facebook, Twitter, LinkedIn and other sources.It includes call detail record (CDR), equipment and sensor information, GPS and geo-location mapping (enum) data, the large nuber of images file being transmitted the transmission of (Manage File Transfer) agreement by management document, Web text and clickstream data, scientific information, Email etc.These data can tell we what can occur future.
Mass data processing: utilize multiple Lightweight Database to receive the data from client, and imported to a concentrated large-scale distributed database or distributed storage cluster, then distributed data base is utilized to carry out common inquiry and Classifying Sum etc. to the mass data concentrated stored in the inner, most of common analysis demand is met with this, carry out data mining to based on data query above simultaneously, high level data analysis requirements can be met.Such as, YunTable is the New-generation distributed database developed on the basis of traditional distributed data base and new NoSQL technology.The distributed type assemblies that can build hundred ranks by it manages the mass data of PB rank.
In the face of the surging of large data is attacked, traditional data processing method reply gets up to seem more and more difficult, and we, many times just as in the face of a gold mine, but do not have effective instrument and means, can only hope that " data " heave a sigh.Faced by conventional analytical techniques, the puzzlement of large data mainly contains:
Due to analysis means restriction, all data can not be made full use of;
Be limited to analysis ability and the answer of challenge cannot be obtained;
Have to because the time limit requires adopt a certain simple modeling technique;
Because there is no enough time computing, model accuracy is compromised.
Based on the present situation of data mining cluster research, the existing excavation for large data clusters, mostly the method for employing is to adopt the sampling to data, chooses representative data, realizes the cluster analysis of Points replacing surfaces.When in the face of large data processing, general what adopt is realizes based on the method for sampling probability, but the methods of sampling is not considered between data point or between interval overall relative distance and Data distribution8 uneven, there is the problem that demarcation interval is really up to the mark.Although afterwards, introduce again cluster, fuzzy concept and cloud model etc. and improve interval division problem really up to the mark, achieve good effect, these methods did not all consider the not same-action of large Data Data point to Knowledge Discovery task yet.Therefore, the clustering rule obtained for making excavation is more effective, more fast, from taking into full account that the not same-action of data point is started with, must carry out more deep research to cluster analysis.And cloud computing proposes based on the process between the large Data Data point in reality just, this is excavate more effective clustering rule to provide powerful theoretical foundation.
Summary of the invention
For solving the above-mentioned problems in the prior art, the invention discloses a kind of large data clustering method based on cloud computing platform and device, adopt MapReduce programming model to achieve the effectively process fast of large data in conjunction with clustering algorithm, constantly can excavate valuable information from data.
MapReduce is the programming model being mainly used in extensive (TB level) data documents disposal of Google exploitation.Its main thought forms computing elementary cell by the concept of " Map (mapping) " and " Reduce (abbreviation) ", first by Map program, data are cut into incoherent block, distribute (scheduling) to a large amount of computer disposal, reach the effect of distributed arithmetic, by Reduce program, result is gathered output again, can parallel processing mass data.Its general type is as follows:
Map(k1,v1)-〉list(k2,v2)
Reduce(k2,list(v2))-〉list(v2)
In brief, input data file is divided into M independently data fragmentation (split) by Map-Reduce programming mode; Then distribute to multiple Worker to start M Map function and perform concurrently and output to intermediate file (this locality is write) and result of calculation is exported intermediate result with key/value to form.Intermediate result key/value divides into groups according to key, perform Reduce function, according to the intermediate file positional information obtained from Master, intermediate file place node Reduce order is sent to perform, calculate and export net result, the output of MapReduce is left in R output file, can further reduce and transmit intermediate file to the demand of bandwidth.
MapReduce depends on HDFS and realizes.Calculated data can be divided into a lot of fritter by usual MapReduce, HDFS can copy some parts to guarantee the reliability of system by each piece, data block is placed on different machines in the cluster, so that MapReduce calculates the most easily on data sink master machine according to certain rule by it simultaneously.HDFS is the version of increasing income of Google GFS, the distributed file system of an Error Tolerance, and it can provide the data access of high-throughput, is applicable to the large files (usually more than 64M) storing magnanimity (PB level).
The present invention utilizes Map Reduce programming model to design a kind of clustering ensemble algorithm, and be stored in the distributed file system HDFS of cloud platform by large data stripping and slicing, Hadoop is in charge of stripping and slicing data, and its key value is affiliated data block Di.Computing machine in computing cluster must adopt clustering algorithm to obtain base cluster result to the corresponding stripping and slicing that this locality stores, (key value is machine number to adopt coherence scheme to carry out Reduce process to each cluster result of same machine, value value is cluster result) obtain the final clustering ensemble result of this machine, thus reach the object of the large data of parallel effectively process, the data processing performance that can improve further and efficiency.
In order to achieve the above object, the invention provides following technical scheme:
Based on a large data clustering method for cloud computing platform, comprising:
Step S100, the data from different pieces of information source by filling in missing values, noise data smoothing, identifying that deleting outlier clears up the data of real world, and are carried out standardization processing, are converted into the data of standard format by large data prediction;
Step S200, large data cutting and management: after large data stripping and slicing, obtain the multiple data blocks after cutting, and be stored in the distributed file system HDFS of cloud platform, and Hadoop is in charge of the data block after cutting;
Step S300, sets up the hypergraph model of cluster, specifically comprises:
Set up the hypergraph H=(V of cum rights, E), wherein, V is the set on summit, E is the set on super limit, and each super limit can both connect plural summit, represents the data item for cluster with the summit of hypergraph, the association situation of the data item represented by its summit connected is represented, w (e with super limit m) be correspond to each the super limit e in E mweight, e m∈ E, w (e m) be used for weighing the degree of correlation between multiple contiguous itemses of being coupled together by super limit;
Super limit e mweight can determine by following two kinds of methods:
(1) with each super limit e mthe support of correlation rule as the weight on this super limit;
(2) with each super limit e mthe mean value of degree of confidence of all necessary association rules as the weight on this super limit; Necessary association rules refers to specific rule, and only there is the set of a data item on the right of its regular expression, and this rule includes super limit e jassociated all data item.
Step S400, large data-mapping, is specifically mapped to hypergraph H=(V, E) respectively by the data block after cutting, and namely each data block is mapped to a hypergraph;
Step S500, utilizes hypergraph to carry out clustering processing respectively to each data block,
For the class set that hypergraph H=(V, E), C are summit V, c i∈ C is the subset of V, for any two class c iand c j, have c i∩ c j≠ φ, for a super limit e mwith a class c iif, e m∩ c i≠ φ, then e mand c ibetween there is relation, this relation is expressed as:
HC ( e m , c i ) = | e m ∩ c i | | e m | × | e m ∩ c i | | c i | ,
Wherein, | e m| represent super limit e mmiddle vertex number, | c i| representation class c imiddle vertex number, | e m∩ c i| be appear at e simultaneously mand c iin vertex number, by class c iwith class c jmerge into c ij, c ij=c i∪ c j, for super limit e m, e m∩ c i≠ φ, if HC is (e m, c i) >HC (e m, c ij), then super limit e min have c jsummit, the change of HC value embodies c iand c jbetween relatively super limit e msimilarity; Definition class c iquality Q (c i) be:
Q ( c i ) = Σ e m ∈ E | e m ∩ c i | × w ( e m ) × HC ( | e m , c i | ) ,
I.e. class c iquality be all super limit e mhC (the e of the Weight of ∈ E m, c i) value and;
Definition merged index f is:
f(c i,c j)=Q(c ij)-[Q(c i)-Q(c j)];
The detailed process of clustering processing comprises:
(1) initialization class set C, makes each summit in the corresponding V of each class in C;
(2) traveling through classes all in hypergraph, is each class c ifind a class c j, make their merged index maximum, i.e. f (c i, c j) value maximum, if f (c i, c j) >0, then merge class c iwith class c jfor class c ij;
(3) new hypergraph is built by the class after all merging;
(4) repeated execution of steps (1) ~ (3), until it is merged to no longer include class;
The detailed process of clustering processing can also be comprise:
(1) initialization class set C, makes each summit in the corresponding V of each class in C;
(2) traveling through classes all in hypergraph, is each class c ifind a class c j, make their merged index maximum, i.e. f (c i, c j) value maximum, if f (c i, c j) >0, then merge class c iwith class c jfor class c ij;
(3) new hypergraph is built by the class after all merging;
(4) corresponding k the segmentation { G of described new hypergraph 1, G 2... G k, be the weight equal value on all limits in i-th segmentation, be the weight mean square deviation on all limits in i-th segmentation, be calculated as follows:
D ‾ = Σ e ⋐ G i ( | w ( e ) - w i ‾ | ) 2 | { e | e ⋐ G i } | ,
Wherein, i=1,2 ..., k, e represent the super limit in hypergraph, G irepresent i-th segmentation of hypergraph, w (e) represents the weight that super limit e is corresponding, represent segmentation G iin the number of vertices of super limit e;
(5) judge whether be greater than first threshold, if be greater than first threshold, then the cluster process of repeated execution of steps (1) ~ (4), otherwise, terminate cluster process.
In step S500, utilize hypergraph to carry out clustering processing respectively to each data block, following methods can also be adopted to realize:
(1) roughening treatment, based on hypergraph H=(V, E) construct minimum hypergraph, any one to this minimum hypergraph does is divided, and the division projection quality on initial hypergraph is all better than the division directly done initial hypergraph in same time;
In the alligatoring stage of hypergraph, we need to construct the less hypergraph of a series of continuous print.The object of alligatoring is the minimum hypergraph of structure one, and any one to this hypergraph does is divided, and the division projection quality on initial hypergraph is all better than the division directly done initial hypergraph in same time.In addition, the alligatoring of hypergraph also can reduce the size on super limit.That is, through what alligatoring, large-scale super limit is compressed to the small-sized super limit only connecting several summit.Because refinement heuritic approach is based on Kernighan-Lin algorithm, this algorithm is very effective for small-sized super limit, but for belonging to different demarcation region, the super limit effect comprising a large amount of summit is just very poor.In the alligatoring hypergraph of next stage, one group of summit compression is formed single summit and can select diverse ways.From the angle of node selection, FC (First Choicescheme), GFC (GreedyFirst Choice scheme), HFC (Hybrid First Choice scheme) etc. can be divided into.From the angle that node merges, EDGE (Edge scheme), HEDGE (Hyper-Edgescheme), MHEDGE (Modified Hyper-Edge scheme) etc. can be divided into.
(2) initial division process, carries out two divisions to the hypergraph after roughening treatment in (1);
In the initial division stage, we need to carry out two divisions to alligatoring hypergraph.Because now the number on summit that comprises of hypergraph seldom (being generally less than 100 summits), so much different algorithms can be adopted and can not affect working time and the quality of algorithm too much.Two points of random methods can be adopted repeatedly.We also can adopt the methods such as combined method, spectral method and cellular automation method to carry out dividing.
(3) move optimization process, use the division of minimum hypergraph to obtain one more refinement hypergraph divide;
Migration the optimizing phase, we use the division of minimum hypergraph to obtain one more refinement hypergraph divide.Above process we by next stage more refinement hypergraph projection realize, and utilize divide thinning algorithm reduce divide number of times thus improve divide quality.Because the refinement hypergraph of next stage has higher degree of freedom, thinning algorithm can obtain higher quality.The thought of V-cycle thinning algorithm is the quality utilizing multistage example to improve two points further.V-cycle thinning algorithm comprises two parts, is alligatoring stage and migration optimizing phase respectively.The alligatoring stage remains the input of initial division as algorithm.This is referred to as restricted alligatoring plan by us.In restricted alligatoring in the works, one group of summit carries out merging the summit forming roughening picture, and this group summit can only belong to the part in two divisions.Consequently, two divisions are originally retained and have passed roughening treatment, become us simultaneously and divide in the initialization that the migration optimizing phase will carry out refinement.Duplicate in the migration optimizing phase of the V-cycle refinement then multistage hypergraph division methods above-mentioned of migration optimizing phase.It between the region divided mobile summit to improve the quality of segmentation.It should be noted that the various alligatoring method for expressing of original hypergraph, allow refinement further improve quality thus help it to jump out local minimum.
(4) final division result is clustering processing result.
Step S600, carries out cluster again to the cluster result of each data block that step S500 obtains, obtains final cluster result;
Cluster is again carried out to the cluster result that step S500 obtains, multiple clustering method can be adopted to realize, as k-means clustering method, the clustering method etc. based on hypergraph.
The present invention utilizes cloud platform to carry out excavation clustering processing in conjunction with Hypergraph Theory to large data, achieves the quick, real-time, accurate of large Data Analysis Services.
A kind of large data clusters device based on cloud computing platform that the present invention also proposes, comprising:
Large data prediction device, for by filling in missing values, noise data smoothing, identifying that deleting outlier clears up the data of real world, and carries out standardization processing by the data from different pieces of information source, is converted into the data of standard format;
Large data cutting and management devices, for by after large data stripping and slicing, obtain the multiple data blocks after cutting, and be stored in the distributed file system HDFS of cloud platform, Hadoop is in charge of the data block after cutting;
Set up the hypergraph model device of cluster, specifically for:
Set up the hypergraph H=(V of cum rights, E), wherein, V is the set on summit, E is the set on super limit, and each super limit can both connect plural summit, represents the data item for cluster with the summit of hypergraph, the association situation of the data item represented by its summit connected is represented, w (e with super limit m) be correspond to each the super limit e in E mweight, e m∈ E, w (e m) be used for weighing the degree of correlation between multiple contiguous itemses of being coupled together by super limit;
Super limit e mweight can determine by following two kinds of methods:
(1) with each super limit e mthe support of correlation rule as the weight on this super limit;
(2) with each super limit e mthe mean value of degree of confidence of all necessary association rules as the weight on this super limit; Necessary association rules refers to specific rule, and only there is the set of a data item on the right of its regular expression, and this rule includes super limit e jassociated all data item.
Large data mapping unit, for the data block after cutting is mapped to hypergraph H=(V, E) respectively, namely each data block is mapped to a hypergraph;
Clustering processing device, utilizes hypergraph to carry out clustering processing respectively to each data block,
For the class set that hypergraph H=(V, E), C are summit V, c i∈ C is the subset of V, for any two class c iand c j, have c i∩ c j≠ φ, for a super limit e mwith a class c iif, e m∩ c i≠ φ, then e mand c ibetween there is relation, this relation is expressed as:
HC ( e m , c i ) = | e m ∩ c i | | e m | × | e m ∩ c i | | c i | ,
Wherein, | e m| represent super limit e mmiddle vertex number, | c i| representation class c imiddle vertex number, | e m∩ c i| be appear at e simultaneously mand c iin vertex number, by class c iwith class c jmerge into c ij, c ij=c i∪ c j, for super limit e m, e m∩ c i≠ φ, if HC is (e m, c i) >HC (e m, c ij), then super limit e min have c jsummit, the change of HC value embodies c iand c jbetween relatively super limit e msimilarity; Definition class c iquality Q (c i) be:
Q ( c i ) = Σ e m ∈ E | e m ∩ c i | × w ( e m ) × HC ( | e m , c i | ) ,
I.e. class c iquality be all super limit e mhC (the e of the Weight of ∈ E m, c i) value and;
Definition merged index f is:
f(c i,c j)=Q(c ij)-[Q(c i)-Q(c j)];
The detailed process of clustering processing comprises:
(1) initialization class set C, makes each summit in the corresponding V of each class in C;
(2) traveling through classes all in hypergraph, is each class c ifind a class c j, make their merged index maximum, i.e. f (c i, c j) value maximum, if f (c i, c j) >0, then merge class c iwith class c jfor class c ij;
(3) new hypergraph is built by the class after all merging;
(4) repeated execution of steps (1) ~ (3), until it is merged to no longer include class;
The detailed process of clustering processing can also be comprise:
(1) initialization class set C, makes each summit in the corresponding V of each class in C;
(2) traveling through classes all in hypergraph, is each class c ifind a class c j, make their merged index maximum, i.e. f (c i, c j) value maximum, if f (c i, c j) >0, then merge class c iwith class c jfor class c ij;
(3) new hypergraph is built by the class after all merging;
(4) corresponding k the segmentation { G of described new hypergraph 1, G 2... G k, be the weight equal value on all limits in i-th segmentation, be the weight mean square deviation on all limits in i-th segmentation, be calculated as follows:
D i ‾ = Σ e ⋐ G i ( | w ( e ) - w i ‾ | ) 2 | { e | e ⋐ G i } | ,
Wherein, i=1,2 ..., k, e represent the super limit in hypergraph, G irepresent i-th segmentation of hypergraph, w (e) represents the weight that super limit e is corresponding, represent segmentation G iin the number of vertices of super limit e;
(5) judge whether be greater than first threshold, if be greater than first threshold, then the cluster process of repeated execution of steps (1) ~ (4), otherwise, terminate cluster process.
Final clustering apparatus, carries out cluster again to the cluster result of each data block that clustering processing device obtains, obtains final cluster result;
Cluster is again carried out to the cluster result that clustering processing device obtains, multiple clustering method can be adopted to realize, as k-means clustering method, the clustering method etc. based on hypergraph.
Accompanying drawing explanation
Fig. 1 is the process flow diagram of date storage method of the present invention;
Fig. 2 is the structural drawing of data storage device of the present invention.
Embodiment
Below in conjunction with accompanying drawing of the present invention, technical scheme of the present invention is clearly and completely described.Here will be described exemplary embodiment in detail, its sample table shows in the accompanying drawings.When description below relates to accompanying drawing, unless otherwise indicated, the same numbers in different accompanying drawing represents same or analogous key element.Embodiment described in following exemplary embodiment does not represent all embodiments consistent with the present invention.On the contrary, they only with as in appended claims describe in detail, the example of apparatus and method that aspects more of the present invention are consistent.
See Fig. 1, a kind of large data clustering method based on cloud computing platform that the present invention proposes, comprising:
Step S100, the data from different pieces of information source by filling in missing values, noise data smoothing, identifying that deleting outlier clears up the data of real world, and are carried out standardization processing, are converted into the data of standard format by large data prediction;
Data prediction referred to before main process some process that data are carried out.For information process provides clean, accurate, succinct data, improving information processing efficiency and accuracy, is very important link in information processing.The data of real world vary, and in order to realize the unified process of data, first data prediction must be become satisfactory normal data.
Step S200, large data cutting and management: after large data stripping and slicing, obtain the multiple data blocks after cutting, and be stored in the distributed file system HDFS of cloud platform, and Hadoop is in charge of the data block after cutting;
Hadoop to increase income realization as the MapReduce algorithm of Google, and application program can be divided into many very little working cells, each unit can perform or repeat on any clustered node.In addition, Hadoop also provides a distributed file system to be used for storing data on each computing node, and provides the high-throughput to reading and writing data.Many uniprocessor algorithms all again realize on Hadoop, for various algorithm process mass data provides high-availability and scalability.
Step S300, sets up the hypergraph model of cluster, specifically comprises:
Set up the hypergraph H=(V of cum rights, E), wherein, V is the set on summit, E is the set on super limit, and each super limit can both connect plural summit, represents the data item for cluster with the summit of hypergraph, the association situation of the data item represented by its summit connected is represented, w (e with super limit m) be correspond to each the super limit e in E mweight, e m∈ E, w (e m) be used for weighing the degree of correlation between multiple contiguous itemses of being coupled together by super limit;
Super limit e mweight can determine by following two kinds of methods:
(1) with each super limit e mthe support of correlation rule as the weight on this super limit;
(2) with each super limit e mthe mean value of degree of confidence of all necessary association rules as the weight on this super limit; Necessary association rules refers to specific rule, and only there is the set of a data item on the right of its regular expression, and this rule includes super limit e jassociated all data item.
For the ease of understanding the present invention, shown below is the concept that some are relevant with hypergraph.
Data item and item set: establish I={i 1, i 2..., i mm different item destination aggregation (mda), each i k(k=1,2 ..., m) become data item (Item), the set I of data item is called item set (Item set), and referred to as item collection, its element number is called the length of item set.Length is that the item set of k is called k dimension data item collection, referred to as k-item collection (k-Item set).
Affairs: affairs T (Transaction) is a subset on item set I, namely each affairs all have a unique indications TID to be connected with it, and the entirety of different affairs constitutes all affairs collection D (i.e. transaction database).
The support of item set: set X as item set, B is the quantity comprising X in database D, and A is the quantity of all affairs comprised in database D, then the support (Support) of item set X is: the support Support (X) of item collection X describes the importance of item collection X.
Correlation rule: correlation rule can be expressed as: R:X → Y, wherein and X ∩ Y=φ, if it represents item collection, X occurs in a certain affairs, and item collection Y will inevitably be caused also to occur in same affairs.X is called the condition precedent (preceding paragraph) of rule, and Y is called the result (consequent) of rule.
The support of correlation rule: for correlation rule R:X → Y, wherein and X ∩ Y=φ.The support of rule R refers in database D the number of deals and the ratio of All Activity number that comprise item collection X and item collection Y simultaneously.
The degree of confidence of correlation rule: for correlation rule R:X → Y, wherein and X ∩ Y=φ.The degree of confidence (Confidence) of rule R is expressed as:
Confidence ( R ) = P ( Y ⊆ T | X ⊆ T ) = P ( Y ⊆ T ∩ X ⊆ T ) P ( X ⊆ T ) = Support ( XUY ) Support ( X )
Namely refer in database D in the transaction occurring item collection X, the probability that item collection Y also occurs simultaneously has much.
The support of correlation rule and degree of confidence are two kinds of measurements of interestingness of rules.Degree of confidence is the tolerance of the accuracy to correlation rule, represents the intensity of rule in other words; Support is the tolerance of the importance to correlation rule, represents the frequency of rule.If do not consider support and the confidence level of correlation rule, there is very many correlation rules so in a database.In fact, people are general only interested in those correlation rules meeting certain support and confidence level.Therefore, in order to find significant correlation rule, need by given two the basic threshold values of user: minimum support and min confidence.
Minimum support and frequent item set: minimum support (Minimum support) represents and finds that correlation rule requires the minimum support threshold that data item must meet to be designated as minsupp, it represents the lowest importance of item set under statistical significance.Only have the item set meeting minimum support just likely to occur in correlation rule, the item set that support is greater than minimum support is called frequent item set or strong point collection (Large itemset); Otherwise, be called nonmatching grids or weakness collection (Small item set).
Min confidence: min confidence (Minimum confidence) represents and is designated as minconf by the Minimum support4 that correlation rule must meet, its represents the least reliability of correlation rule.
Strong association rule: if Support (R) >=minsupp and Confidence (R) >=minconf, then claim correlation rule
R:X → Y is Strong association rule.
Hypergraph H=(V, E), wherein vertex set V={v 1, v 2... v n, limit collection E={e 1, e 2... e m, use a ijrepresent vertex v iwith v jdirect limit number, possible value is 0,1,2 ... the nxn matrix claiming it to obtain a [i, j] ∈ { 0,1,2 ..., be the adjacency matrix of hypergraph.
According to the expansion that the definition of hypergraph adjacency matrix is the definition of simple graph adjacency matrix, in conjunction with the character of the definition of adjacency matrix, the character of hypergraph adjacency matrix can be obtained:
(1) A (H) is for being poised for battle matrix
The sufficient and necessary condition of (2) two figure G and H isomorphism there is permutation matrix P to make
A(H)=P TA(G)P。
Step S400, large data-mapping, is specifically mapped to hypergraph H=(V, E) respectively by the data block after cutting, and namely each data block is mapped to a hypergraph;
Step S500, utilizes hypergraph to carry out clustering processing respectively to each data block,
For the class set that hypergraph H=(V, E), C are summit V, c i∈ C is the subset of V, for any two class c iand c j, have c i∩ c j≠ φ, for a super limit e mwith a class c iif, e m∩ c i≠ φ, then e mand c ibetween there is relation, this relation is expressed as:
HC ( e m , c i ) = | e m ∩ c i | | e m | × | e m ∩ c i | | c i | ,
Wherein, | e m| represent super limit e mmiddle vertex number, | c i| representation class c imiddle vertex number, | e m∩ c i| be appear at e simultaneously mand c iin vertex number, by class c iwith class c jmerge into c ij, c ij=c i∪ c j, for super limit e m, e m∩ c i≠ φ, if HC is (e m, c i) >HC (e m, c ij), then super limit e min have c jsummit, the change of HC value embodies c iand c jbetween relatively super limit e msimilarity; Definition class c iquality Q (c i) be:
Q ( c i ) = Σ e m ∈ E | e m ∩ c i | × w ( e m ) × HC ( | e m , c i | ) ,
I.e. class c iquality be all super limit e mhC (the e of the Weight of ∈ E m, c i) value and;
Definition merged index f is:
f(c i,c j)=Q(c ij)-[Q(c i)-Q(c j)];
The detailed process of clustering processing comprises:
(1) initialization class set C, makes each summit in the corresponding V of each class in C;
(2) traveling through classes all in hypergraph, is each class c ifind a class c j, make their merged index maximum, i.e. f (c i, c j) value maximum, if f (c i, c j) >0, then merge class c iwith class c jfor class c ij;
(3) new hypergraph is built by the class after all merging;
(4) repeated execution of steps (1) ~ (3), until it is merged to no longer include class;
The detailed process of clustering processing can also be comprise:
(1) initialization class set C, makes each summit in the corresponding V of each class in C;
(2) traveling through classes all in hypergraph, is each class c ifind a class c j, make their merged index maximum, i.e. f (c i, c j) value maximum, if f (c i, c j) >0, then merge class c iwith class c jfor class c ij;
(3) new hypergraph is built by the class after all merging;
(4) corresponding k the segmentation { G of described new hypergraph 1, G 2... G k, be the weight equal value on all limits in i-th segmentation, be the weight mean square deviation on all limits in i-th segmentation, be calculated as follows:
D i ‾ = Σ e ⋐ G i ( | w ( e ) - w i ‾ | ) 2 | { e | e ⋐ G i } | ,
Wherein, i=1,2 ..., k, e represent the super limit in hypergraph, G irepresent i-th segmentation of hypergraph, w (e) represents the weight that super limit e is corresponding, represent segmentation G iin the number of vertices of super limit e;
(5) judge whether be greater than first threshold, if be greater than first threshold, then the cluster process of repeated execution of steps (1) ~ (4), otherwise, terminate cluster process.
In step S500, utilize hypergraph to carry out clustering processing respectively to each data block, following methods can also be adopted to realize:
(1) roughening treatment, based on hypergraph H=(V, E) construct minimum hypergraph, any one to this minimum hypergraph does is divided, and the division projection quality on initial hypergraph is all better than the division directly done initial hypergraph in same time;
In the alligatoring stage of hypergraph, we need to construct the less hypergraph of a series of continuous print.The object of alligatoring is the minimum hypergraph of structure one, and any one to this hypergraph does is divided, and the division projection quality on initial hypergraph is all better than the division directly done initial hypergraph in same time.In addition, the alligatoring of hypergraph also can reduce the size on super limit.That is, through what alligatoring, large-scale super limit is compressed to the small-sized super limit only connecting several summit.Because refinement heuritic approach is based on Kernighan-Lin algorithm, this algorithm is very effective for small-sized super limit, but for belonging to different demarcation region, the super limit effect comprising a large amount of summit is just very poor.In the alligatoring hypergraph of next stage, one group of summit compression is formed single summit and can select diverse ways.From the angle of node selection, FC (First Choicescheme), GFC (GreedyFirst Choice scheme), HFC (Hybrid First Choice scheme) etc. can be divided into.From the angle that node merges, EDGE (Edge scheme), HEDGE (Hyper-Edgescheme), MHEDGE (Modified Hyper-Edge scheme) etc. can be divided into.
(2) initial division process, carries out two divisions to the hypergraph after roughening treatment in (1);
In the initial division stage, we need to carry out two divisions to alligatoring hypergraph.Because now the number on summit that comprises of hypergraph seldom (being generally less than 100 summits), so much different algorithms can be adopted and can not affect working time and the quality of algorithm too much.Two points of random methods can be adopted repeatedly.We also can adopt the methods such as combined method, spectral method and cellular automation method to carry out dividing.
(3) move optimization process, use the division of minimum hypergraph to obtain one more refinement hypergraph divide;
Migration the optimizing phase, we use the division of minimum hypergraph to obtain one more refinement hypergraph divide.Above process we by next stage more refinement hypergraph projection realize, and utilize divide thinning algorithm reduce divide number of times thus improve divide quality.Because the refinement hypergraph of next stage has higher degree of freedom, thinning algorithm can obtain higher quality.The thought of V-cycle thinning algorithm is the quality utilizing multistage example to improve two points further.V-cycle thinning algorithm comprises two parts, is alligatoring stage and migration optimizing phase respectively.The alligatoring stage remains the input of initial division as algorithm.This is referred to as restricted alligatoring plan by us.In restricted alligatoring in the works, one group of summit carries out merging the summit forming roughening picture, and this group summit can only belong to the part in two divisions.Consequently, two divisions are originally retained and have passed roughening treatment, become us simultaneously and divide in the initialization that the migration optimizing phase will carry out refinement.Duplicate in the migration optimizing phase of the V-cycle refinement then multistage hypergraph division methods above-mentioned of migration optimizing phase.It between the region divided mobile summit to improve the quality of segmentation.It should be noted that the various alligatoring method for expressing of original hypergraph, allow refinement further improve quality thus help it to jump out local minimum.
(4) final division result is clustering processing result.
Step S600, carries out cluster again to the cluster result of each data block that step S500 obtains, obtains final cluster result;
Cluster is again carried out to the cluster result that step S500 obtains, multiple clustering method can be adopted to realize, as k-means clustering method, the clustering method etc. based on hypergraph.
The present invention utilizes cloud platform to carry out excavation clustering processing in conjunction with Hypergraph Theory to large data, achieves the quick, real-time, accurate of large Data Analysis Services.
See Fig. 2, a kind of large data clusters device based on cloud computing platform that the present invention also proposes, comprising:
Large data prediction device, for by filling in missing values, noise data smoothing, identifying that deleting outlier clears up the data of real world, and carries out standardization processing by the data from different pieces of information source, is converted into the data of standard format;
Data prediction referred to before main process some process that data are carried out.For information process provides clean, accurate, succinct data, improving information processing efficiency and accuracy, is very important link in information processing.The data of real world vary, and in order to realize the unified process of data, first data prediction must be become satisfactory normal data.
Large data cutting and management devices, for by after large data stripping and slicing, obtain the multiple data blocks after cutting, and be stored in the distributed file system HDFS of cloud platform, Hadoop is in charge of the data block after cutting;
Hadoop to increase income realization as the MapReduce algorithm of Google, and application program can be divided into many very little working cells, each unit can perform or repeat on any clustered node.In addition, Hadoop also provides a distributed file system to be used for storing data on each computing node, and provides the high-throughput to reading and writing data.Many uniprocessor algorithms all again realize on Hadoop, for various algorithm process mass data provides high-availability and scalability.
Set up the hypergraph model device of cluster, specifically for:
Set up the hypergraph H=(V of cum rights, E), wherein, V is the set on summit, E is the set on super limit, and each super limit can both connect plural summit, represents the data item for cluster with the summit of hypergraph, the association situation of the data item represented by its summit connected is represented, w (e with super limit m) be correspond to each the super limit e in E mweight, e m∈ E, w (e m) be used for weighing the degree of correlation between multiple contiguous itemses of being coupled together by super limit;
Super limit e mweight can determine by following two kinds of methods:
(1) with each super limit e mthe support of correlation rule as the weight on this super limit;
(2) with each super limit e mthe mean value of degree of confidence of all necessary association rules as the weight on this super limit; Necessary association rules refers to specific rule, and only there is the set of a data item on the right of its regular expression, and this rule includes super limit e jassociated all data item.
Large data mapping unit, for the data block after cutting is mapped to hypergraph H=(V, E) respectively, namely each data block is mapped to a hypergraph;
Clustering processing device, utilizes hypergraph to carry out clustering processing respectively to each data block,
For the class set that hypergraph H=(V, E), C are summit V, c i∈ C is the subset of V, for any two class c iand c j, have c i∩ c j≠ φ, for a super limit e mwith a class c iif, e m∩ c i≠ φ, then e mand c ibetween there is relation, this relation is expressed as:
HC ( e m , c i ) = | e m ∩ c i | | e m | × | e m ∩ c i | | c i | ,
Wherein, | e m| represent super limit e mmiddle vertex number, | c i| representation class c imiddle vertex number, | e m∩ c i| be appear at e simultaneously mand c iin vertex number, by class c iwith class c jmerge into c ij, c ij=c i∪ c j, for super limit e m, e m∩ c i≠ φ, if HC is (e m, c i) >HC (e m, c ij), then super limit e min have c jsummit, the change of HC value embodies c iand c jbetween relatively super limit e msimilarity; Definition class c iquality Q (c i) be:
Q ( c i ) = Σ e m ∈ E | e m ∩ c i | × w ( e m ) × HC ( | e m , c i | ) ,
I.e. class c iquality be all super limit e mhC (the e of the Weight of ∈ E m, c i) value and;
Definition merged index f is:
f(c i,c j)=Q(c ij)-[Q(c i)-Q(c j)];
The detailed process of clustering processing comprises:
(1) initialization class set C, makes each summit in the corresponding V of each class in C;
(2) traveling through classes all in hypergraph, is each class c ifind a class c j, make their merged index maximum, i.e. f (c i, c j) value maximum, if f (c i, c j) >0, then merge class c iwith class c jfor class c ij;
(3) new hypergraph is built by the class after all merging;
(4) repeated execution of steps (1) ~ (3), until it is merged to no longer include class;
The detailed process of clustering processing can also be comprise:
(1) initialization class set C, makes each summit in the corresponding V of each class in C;
(2) traveling through classes all in hypergraph, is each class c ifind a class c j, make their merged index maximum, i.e. f (c i, c j) value maximum, if f (c i, c j) >0, then merge class c iwith class c jfor class c ij;
(3) new hypergraph is built by the class after all merging;
(4) corresponding k the segmentation { G of described new hypergraph 1, G 2... G k, be the weight equal value on all limits in i-th segmentation, be the weight mean square deviation on all limits in i-th segmentation, be calculated as follows:
D i ‾ = Σ e ⋐ G i ( | w ( e ) - w i ‾ | ) 2 | { e | e ⋐ G i } | ,
Wherein, i=1,2 ..., k, e represent the super limit in hypergraph, G irepresent i-th segmentation of hypergraph, w (e) represents the weight that super limit e is corresponding, represent segmentation G iin the number of vertices of super limit e;
(5) judge whether be greater than first threshold, if be greater than first threshold, then the cluster process of repeated execution of steps (1) ~ (4), otherwise, terminate cluster process.
Final clustering apparatus, carries out cluster again to the cluster result of each data block that clustering processing device obtains, obtains final cluster result;
Cluster is again carried out to the cluster result that clustering processing device obtains, multiple clustering method can be adopted to realize, as k-means clustering method, the clustering method etc. based on hypergraph.
Those skilled in the art, at consideration instructions and after putting into practice invention disclosed herein, will easily expect other embodiment of the present invention.The application is intended to contain any modification of the present invention, purposes or adaptations, and these modification, purposes or adaptations are followed general principle of the present invention and comprised the undocumented common practise in the art of the present invention or conventional techniques means.
Should be understood that, the present invention is not limited to precision architecture described above and illustrated in the accompanying drawings, and can carry out various amendment and change not departing from its scope.Scope of the present invention is only limited by appended claim.

Claims (8)

1., based on a large data clustering method for cloud computing platform, comprising:
Step S100, the data from different pieces of information source by filling in missing values, noise data smoothing, identifying that deleting outlier clears up the data of real world, and are carried out standardization processing, are converted into the data of standard format by large data prediction;
Step S200, large data cutting and management: after large data stripping and slicing, obtain the multiple data blocks after cutting, and be stored in the distributed file system HDFS of cloud platform, and Hadoop is in charge of the data block after cutting;
Step S300, sets up the hypergraph model of cluster,
Step S400, large data-mapping, is specifically mapped to hypergraph H=(V, E) respectively by the data block after cutting, and namely each data block is mapped to a hypergraph;
Step S500, utilizes hypergraph to carry out clustering processing respectively to each data block, specifically comprises:
For the class set that hypergraph H=(V, E), C are summit V, c i∈ C is the subset of V, for any two class c iand c j, have c i∩ c j≠ φ, for a super limit e mwith a class c iif, e m∩ c i≠ φ, then e mand c ibetween there is relation, this relation is expressed as:
HC ( e m , c i ) = | e m ∩ c i | | e m | × | e m ∩ c i | | c i | ,
Wherein, | e m| represent super limit e mmiddle vertex number, | c i| representation class c imiddle vertex number, | e m∩ c i| be appear at e simultaneously mand c iin vertex number, by class c iwith class c jmerge into c ij, c ij=c i∪ c j, for super limit e m, e m∩ c i≠ φ, if HC is (e m, c i) >HC (e m, c ij), then super limit e min have c jsummit, the change of HC value embodies c iand c jbetween relatively super limit e msimilarity; Definition class c iquality Q (c i) be:
Q ( c i ) = Σ e m ∈ E | e m ∩ c i | × w ( e m ) × HC ( | e m , c i | ) ,
I.e. class c iquality be all super limit e mhC (the e of the Weight of ∈ E m, c i) value and;
Definition merged index f is:
f(c i,c j)=Q(c ij)-[Q(c i)-Q(c j)];
The detailed process of clustering processing comprises:
(1) initialization class set C, makes each summit in the corresponding V of each class in C;
(2) traveling through classes all in hypergraph, is each class c ifind a class c j, make their merged index maximum, i.e. f (c i, c j) value maximum, if f (c i, c j) >0, then merge class c iwith class c jfor class c ij;
(3) new hypergraph is built by the class after all merging;
(4) repeated execution of steps (1) ~ (3), until it is merged to no longer include class;
Step S600, carries out cluster again to the cluster result of each data block that step S500 obtains, obtains final cluster result.
2. as claimed in claim 1 based on the large data clustering method of cloud computing platform, wherein, step S300, sets up the hypergraph model of cluster, specifically comprises:
Set up the hypergraph H=(V of cum rights, E), wherein, V is the set on summit, E is the set on super limit, and each super limit can both connect plural summit, represents the data item for cluster with the summit of hypergraph, the association situation of the data item represented by its summit connected is represented, w (e with super limit m) be correspond to each the super limit e in E mweight, e m∈ E, w (e m) be used for weighing the degree of correlation between multiple contiguous itemses of being coupled together by super limit.
3. as claimed in claim 2 based on the large data clustering method of cloud computing platform, wherein, super limit e mweight be:
With each super limit e mthe support of correlation rule as the weight on this super limit.
4. as claimed in claim 2 based on the large data clustering method of cloud computing platform, wherein, super limit e mweight be:
With each super limit e mthe mean value of degree of confidence of all necessary association rules as the weight on this super limit; Necessary association rules refers to specific rule, and only there is the set of a data item on the right of its regular expression, and this rule includes super limit e jassociated all data item.
5., based on a large data clusters device for cloud computing platform, comprising:
Large data prediction device, for by filling in missing values, noise data smoothing, identifying that deleting outlier clears up the data of real world, and carries out standardization processing by the data from different pieces of information source, is converted into the data of standard format;
Large data cutting and management devices, for by after large data stripping and slicing, obtain the multiple data blocks after cutting, and be stored in the distributed file system HDFS of cloud platform, Hadoop is in charge of the data block after cutting;
Set up the hypergraph model device of cluster, for setting up the hypergraph model of cluster;
Large data mapping unit, for the data block after cutting is mapped to hypergraph H=(V, E) respectively, namely each data block is mapped to a hypergraph;
Clustering processing device, utilizes hypergraph to carry out clustering processing respectively to each data block, specifically comprises:
For the class set that hypergraph H=(V, E), C are summit V, c i∈ C is the subset of V, for any two class c iand c j, have c i∩ c j≠ φ, for a super limit e mwith a class c iif, e m∩ c i≠ φ, then e mand c ibetween there is relation, this relation is expressed as:
HC ( e m , c i ) = | e m ∩ c i | | e m | × | e m ∩ c i | | c i | ,
Wherein, | e m| represent super limit e mmiddle vertex number, | c i| representation class c imiddle vertex number, | e m∩ c i| be appear at e simultaneously mand c iin vertex number, by class c iwith class c jmerge into c ij, c ij=c i∪ c j, for super limit e m, e m∩ c i≠ φ, if HC is (e m, c i) >HC (e m, c ij), then super limit e min have c jsummit, the change of HC value embodies c iand c jbetween relatively super limit e msimilarity; Definition class c iquality Q (c i) be:
Q ( c i ) = Σ e m ∈ E | e m ∩ c i | × w ( e m ) × HC ( | e m , c i | ) ,
I.e. class c iquality be all super limit e mhC (the e of the Weight of ∈ E m, c i) value and;
Definition merged index f is:
f(c i,c j)=Q(c ij)-[Q(c i)-Q(c j)];
The detailed process of clustering processing comprises:
(1) initialization class set C, makes each summit in the corresponding V of each class in C;
(2) traveling through classes all in hypergraph, is each class c ifind a class c j, make their merged index maximum, i.e. f (c i, c j) value maximum, if f (c i, c j) >0, then merge class c iwith class c jfor class c ij;
(3) new hypergraph is built by the class after all merging;
(4) repeated execution of steps (1) ~ (3), until it is merged to no longer include class;
Final clustering apparatus, carries out cluster again to the cluster result of each data block that clustering processing device obtains, obtains final cluster result.
6., as claimed in claim 5 based on the large data clusters device of cloud computing platform, wherein, set up the hypergraph model device of cluster, specifically for:
Set up the hypergraph H=(V of cum rights, E), wherein, V is the set on summit, E is the set on super limit, and each super limit can both connect plural summit, represents the data item for cluster with the summit of hypergraph, the association situation of the data item represented by its summit connected is represented, w (e with super limit m) be correspond to each the super limit e in E mweight, e m∈ E, w (e m) be used for weighing the degree of correlation between multiple contiguous itemses of being coupled together by super limit.
7. as claimed in claim 6 based on the large data clusters device of cloud computing platform, wherein, super limit e mweight be:
With each super limit e mthe support of correlation rule as the weight on this super limit.
8. as claimed in claim 6 based on the large data clusters device of cloud computing platform, wherein, super limit e mweight be:
With each super limit e mthe mean value of degree of confidence of all necessary association rules as the weight on this super limit; Necessary association rules refers to specific rule, and only there is the set of a data item on the right of its regular expression, and this rule includes super limit e jassociated all data item.
CN201510249032.XA 2015-05-15 2015-05-15 A kind of big data clustering method and device based on cloud computing platform Active CN104820708B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201510249032.XA CN104820708B (en) 2015-05-15 2015-05-15 A kind of big data clustering method and device based on cloud computing platform

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201510249032.XA CN104820708B (en) 2015-05-15 2015-05-15 A kind of big data clustering method and device based on cloud computing platform

Publications (2)

Publication Number Publication Date
CN104820708A true CN104820708A (en) 2015-08-05
CN104820708B CN104820708B (en) 2018-02-09

Family

ID=53731003

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201510249032.XA Active CN104820708B (en) 2015-05-15 2015-05-15 A kind of big data clustering method and device based on cloud computing platform

Country Status (1)

Country Link
CN (1) CN104820708B (en)

Cited By (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104809242A (en) * 2015-05-15 2015-07-29 成都睿峰科技有限公司 Distributed-structure-based big data clustering method and device
CN105354298A (en) * 2015-11-01 2016-02-24 长春理工大学 Hadoop based method for analyzing large-scale social network and analysis platform thereof
CN106203516A (en) * 2016-07-13 2016-12-07 中南大学 A kind of subspace clustering visual analysis method based on dimension dependency
CN106503086A (en) * 2016-10-11 2017-03-15 成都云麒麟软件有限公司 The detection method of distributed local outlier
CN106874367A (en) * 2016-12-30 2017-06-20 江苏号百信息服务有限公司 A kind of sampling distribution formula clustering method based on public sentiment platform
CN111125198A (en) * 2019-12-27 2020-05-08 南京航空航天大学 Computer data mining clustering method based on time sequence
CN111507365A (en) * 2019-09-02 2020-08-07 中南大学 Confidence rule automatic generation method based on fuzzy clustering
CN112613562A (en) * 2020-12-24 2021-04-06 山东鑫泰洋智能科技有限公司 Data analysis system and method based on multi-center cloud computing
CN112948640A (en) * 2021-03-10 2021-06-11 成都工贸职业技术学院 Big data clustering method and system based on cloud computing platform
CN113255278A (en) * 2021-05-17 2021-08-13 福州大学 Integrated circuit clustering method based on time sequence driving
CN113988817A (en) * 2021-11-11 2022-01-28 重庆邮电大学 Dirty data cleaning method based on intelligent data platform

Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104809242A (en) * 2015-05-15 2015-07-29 成都睿峰科技有限公司 Distributed-structure-based big data clustering method and device

Patent Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104809242A (en) * 2015-05-15 2015-07-29 成都睿峰科技有限公司 Distributed-structure-based big data clustering method and device

Non-Patent Citations (4)

* Cited by examiner, † Cited by third party
Title
刘丽娜: ""一种基于超图模式的数据聚类方法"", 《石家庄铁道职业技术学院学报》 *
张蓉: ""一种基于超图模式的高维空间数据聚类的方法"", 《计算机工程》 *
沙金等: ""HGHD:一种基于超图的高维空间数据聚类算法"", 《微电子学与计算机》 *
贾俊芳等: ""基于分布式的大数据集聚类分析"", 《计算机工程与应用》 *

Cited By (17)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104809242A (en) * 2015-05-15 2015-07-29 成都睿峰科技有限公司 Distributed-structure-based big data clustering method and device
CN104809242B (en) * 2015-05-15 2018-03-02 成都睿峰科技有限公司 A kind of big data clustering method and device based on distributed frame
CN105354298A (en) * 2015-11-01 2016-02-24 长春理工大学 Hadoop based method for analyzing large-scale social network and analysis platform thereof
CN106203516A (en) * 2016-07-13 2016-12-07 中南大学 A kind of subspace clustering visual analysis method based on dimension dependency
CN106203516B (en) * 2016-07-13 2019-04-09 中南大学 A kind of subspace clustering visual analysis method based on dimension correlation
CN106503086A (en) * 2016-10-11 2017-03-15 成都云麒麟软件有限公司 The detection method of distributed local outlier
CN106874367A (en) * 2016-12-30 2017-06-20 江苏号百信息服务有限公司 A kind of sampling distribution formula clustering method based on public sentiment platform
CN111507365A (en) * 2019-09-02 2020-08-07 中南大学 Confidence rule automatic generation method based on fuzzy clustering
CN111125198A (en) * 2019-12-27 2020-05-08 南京航空航天大学 Computer data mining clustering method based on time sequence
CN112613562A (en) * 2020-12-24 2021-04-06 山东鑫泰洋智能科技有限公司 Data analysis system and method based on multi-center cloud computing
CN112613562B (en) * 2020-12-24 2023-05-12 广州禧闻信息技术有限公司 Data analysis system and method based on multi-center cloud computing
CN112948640A (en) * 2021-03-10 2021-06-11 成都工贸职业技术学院 Big data clustering method and system based on cloud computing platform
CN112948640B (en) * 2021-03-10 2022-03-15 成都工贸职业技术学院 Big data clustering method and system based on cloud computing platform
CN113255278A (en) * 2021-05-17 2021-08-13 福州大学 Integrated circuit clustering method based on time sequence driving
CN113255278B (en) * 2021-05-17 2022-07-15 福州大学 Integrated circuit clustering method based on time sequence driving
CN113988817A (en) * 2021-11-11 2022-01-28 重庆邮电大学 Dirty data cleaning method based on intelligent data platform
CN113988817B (en) * 2021-11-11 2024-04-12 重庆邮电大学 Dirty data cleaning method based on intelligent data platform

Also Published As

Publication number Publication date
CN104820708B (en) 2018-02-09

Similar Documents

Publication Publication Date Title
CN104809242A (en) Distributed-structure-based big data clustering method and device
CN104820708A (en) Cloud computing platform based big data clustering method and device
CN104809244A (en) Data mining method and device in big data environment
Gupta et al. Scalable machine‐learning algorithms for big data analytics: a comprehensive review
Lin Mr-apriori: Association rules algorithm based on mapreduce
Hongchao et al. Distributed data organization and parallel data retrieval methods for huge laser scanner point clouds
Wang et al. Research and implementation on spatial data storage and operation based on Hadoop platform
CN104794151A (en) Spatial knowledge service system building method based on collaborative plotting technology
Gupta et al. Faster as well as early measurements from big data predictive analytics model
Zhang et al. Optimization and improvement of data mining algorithm based on efficient incremental kernel fuzzy clustering for large data
Hashem et al. An Integrative Modeling of BigData Processing.
Venkatesh et al. Challenges and research disputes and tools in big data analytics
Karim et al. Spatiotemporal Aspects of Big Data.
Abdelhafez Big data technologies and analytics: A review of emerging solutions
Dass et al. Amelioration of Big Data analytics by employing Big Data tools and techniques
Tripathi et al. A comparative analysis of conventional hadoop with proposed cloud enabled hadoop framework for spatial big data processing
Khosla et al. Big data technologies
Ma et al. [Retracted] The Construction of Big Data Computational Intelligence System for E‐Government in Cloud Computing Environment and Its Development Impact
Agrawal et al. High performance big data clustering
Hanmanthu et al. Parallel optimal grid-clustering algorithm exploration on mapreduce framework
Prakash et al. Architecture Design for Hadoop No-SQL and Hive
Pratap Analysis of big data technology and its challenges
Tazeen et al. A Survey on Some Big Data Applications Tools and Technologies
Maguerra et al. A survey on solutions for big spatio-temporal data processing and analytics
Davoudian A workload-driven framework for NoSQL data modeling and partitioning

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
EXSB Decision made by sipo to initiate substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
TR01 Transfer of patent right

Effective date of registration: 20190729

Address after: 400,000 Chongqing Jiangbei District Haier Road 319 2-2-1-61 (Two Road Cuntan Bonded Port Area)

Patentee after: Chongqing steady Technology Co., Ltd.

Address before: 610041 East Building, Ladfans Building, 1480 Tianfu Avenue North Section, Chengdu High-tech Zone, Sichuan Province, 10 stories

Patentee before: Chengdu Rui Feng Science and Technology Ltd.

TR01 Transfer of patent right
TR01 Transfer of patent right

Effective date of registration: 20190829

Address after: 530000 Room 5308, 5th floor, Kechuang Building, 25-1 Keyuan Avenue, Nanning City, Guangxi Zhuang Autonomous Region

Patentee after: Nanning Kehang Jinqiao Enterprise Consulting Co., Ltd.

Address before: 400,000 Chongqing Jiangbei District Haier Road 319 2-2-1-61 (Two Road Cuntan Bonded Port Area)

Patentee before: Chongqing steady Technology Co., Ltd.

TR01 Transfer of patent right
TR01 Transfer of patent right

Effective date of registration: 20190926

Address after: 530007 No. 1 Headquarters Road, Nanning City, Guangxi Zhuang Autonomous Region, China-ASEAN Science and Technology Enterprise Incubation Base Phase I C1 Building 607

Patentee after: Nanning First Station Network Technology Co., Ltd.

Address before: 530000 Room 5308, 5th floor, Kechuang Building, 25-1 Keyuan Avenue, Nanning City, Guangxi Zhuang Autonomous Region

Patentee before: Nanning Kehang Jinqiao Enterprise Consulting Co., Ltd.

TR01 Transfer of patent right