CN103838863A - Big-data clustering algorithm based on cloud computing platform - Google Patents

Big-data clustering algorithm based on cloud computing platform Download PDF

Info

Publication number
CN103838863A
CN103838863A CN201410104227.0A CN201410104227A CN103838863A CN 103838863 A CN103838863 A CN 103838863A CN 201410104227 A CN201410104227 A CN 201410104227A CN 103838863 A CN103838863 A CN 103838863A
Authority
CN
China
Prior art keywords
data
clustering
cloud computing
carried out
point
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201410104227.0A
Other languages
Chinese (zh)
Other versions
CN103838863B (en
Inventor
孟海东
任敬佩
宋宇辰
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Inner Mongolia University of Science and Technology
Original Assignee
Inner Mongolia University of Science and Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Inner Mongolia University of Science and Technology filed Critical Inner Mongolia University of Science and Technology
Priority to CN201410104227.0A priority Critical patent/CN103838863B/en
Publication of CN103838863A publication Critical patent/CN103838863A/en
Application granted granted Critical
Publication of CN103838863B publication Critical patent/CN103838863B/en
Expired - Fee Related legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • G06F16/355Class or cluster creation or modification

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a large-data clustering algorithm based on a cloud computing platform. Primitive data is pre-processed; data are divided into M sub-data and distributed to M Map functions and local clustering is carried out on the sub-data; clusters with the same key are combined; if the number R of practical clusters is smaller than the number k of clusters, the number c of representative points and a constriction factor a are regulated and clustering is carried out again until termination conditions are achieved. if a new data set is generated, local clustering is carried out according to judgment conditions that the number K of new data source centers is larger than the obtained number K of clusters before updating or the number of new data source points is larger than the number of data source points before updating. According to the large-data clustering algorithm based on the cloud computing platform, the parallel computing capacity of a high-performance clustering system of cloud computing is used for solving the problem that mass data need to be processed in clustering, and therefore a relation of the data can be rapidly and efficiently dug up.

Description

A kind of large Data Clustering Algorithm based on cloud computing platform
Technical field
The invention belongs to data mining technology field, relate to a kind of large Data Clustering Algorithm based on cloud computing platform.
Background technology
Cluster analysis, as the cross discipline in the fields such as statistics, machine learning and data mining, has attracted numerous researchers to bound oneself to it, and makes it to become a very active research topic of data mining research field.Researchers both domestic and external have proposed a lot of clustering algorithms up to now, and main clustering method can be divided into: based on method, the method based on level, the method based on density, the method based on grid and the method based on model etc. of dividing.
In " the 6th the mobile Internet international symposium " that hold on August 21st, 2012, U.S. Ka Neijimeilong computer machine people specialty doctor Deng Kan represents, find the value in large data, rely on the algorithm of data mining, and will have the algorithm of data mining to add the parallel computation of cloud computing.Distributed cloud storage platform provides more honest cost and high handling property, adds efficient data mining algorithm, becomes the good medicine that solves large data problem.
In Southampton University of Southampton " the mass data Research on Mining under cloud computing ", mention the more and more many medium-sized and small enterprises analysis mass datas of appearing as of cloud computing cheap solution is provided.Introduce SPRINT (the Scalable Parallelizable Induction of Decision of Trees based in cloud computing Hadoop cluster frameworks and data mining technology, a kind of Decision Tree Algorithm with scalability) on the basis of sorting algorithm, describe the execution flow process on MapReduce (data processing model) programming model of SPRINT parallel algorithm in Hadoop (a kind of distributed programmed framework) in detail, and utilize the decision-tree model analyzing to classify to input data.
At present, the data mining work based on cloud computing platform has obtained numerous achievements.Apache Mahout (project of increasing income under Apache SoftWare Foudation) project development goes out the Parallel Algorithms for Data Mining of multiple commercial presence angle; The parallel distributed data Mining Platform (PDMiner, Parallel Distributed Miner) that Inst. of Computing Techn. Academia Sinica releases can be realized the mass data processing of TB rank; The parallel data mining instrument (BC-PDM, Blue Carrier based Parallel Data Mining) of China Mobile has more been to provide the service mode based on Web.These marked achievements, have promoted the development in this field energetically.On the basis of cloud computing programming model MapReduce, existing several data mining algorithm is implemented.The scholars such as CHU in 2007 have proposed the Naive Bayes Classification Algorithm based on MapReduce.This algorithm adopts the thought of distribution process, and the mode of sample being carried out to decentralized statistics and centralized integration by employing is carried out structural classification device, but it can process discrete data, can not provide effective support to continuous data.In addition, in data mining work, the MapReduce of General Clustering Algorithm realizes, and in the scope in road, there is not yet relevant authority's report as far as our knowledge goes.
Current, rest in the optimization of serial method more than both at home and abroad on to the research of clustering method.Serial clustering algorithm has obtained a lot of research and application in statistics and database field, as K-Means (K averaging method) algorithm, towards the comprehensive hierarchical clustering (BIRCH of Large-scale Database System, Balanced Reducing and Clustering Using Hierarchies) algorithm, process statistical information grid (STING, the Statistical Information Grid) algorithm etc. of spatial data.In the face of growing high-volume database and high dimensional data type, in order to obtain better computing power, the clustering algorithm under research parallel model, utilizes the high-speed computational capability of cluster to solve the cluster computing of large data, has very important significance.
Along with internet, real-time stream, the diversified development of connection device, and the promotion of the demand such as search service, community network, Mobile business and open cooperation, cloud computing develops rapidly.Parallel distributed different from the past calculates, and the generation of cloud computing will promote whole the Internet model theory, revolutionary change occurs management mode of enterprise.) etc. major company be the forerunner of cloud computing.
Google is as the user of the maximum cloud computing of number.At present, Google has allowed third party to move large-scale concurrent application by GoogleApp Engine (Google's application searches engine) in the cloud computing of Google.MapReduce is the Distributed Calculation programming framework being proposed at first in 2004 by Google, and it can support the distributed treatment of big data quantity.
Hadoop is increase income a Distributed Calculation Open Source Framework of tissue of Apache, on a lot of large-scale websites, all obtain application, in Hadoop framework, most crucial design is MapReduce and Hadoop distributed file system (HDFS, Hadoop Distributed File System).Amazon uses elasticity to calculate cloud (EC2, Elastic Compute Cloud) and simple storage service (S3, Simple Storage Service) is calculated and stores service for enterprise provides.IBM has released " Lan Yun " computing platform of " change game rule " in November, 2007, buy the i.e. cloud computing platform of use for client brings.Microsoft, immediately following cloud computing paces, releases Windows Azure operating system in October, 2008.Azure (being translated into " blue sky ") is after Windows replaces DOS, the transition of subversiveness again of Microsoft, by make new cloud computing platform in Internet architecture, allows Windows really extend to " blue sky " by PC.
In China, cloud computing development is also very swift and violent.Within 2008, IBM has successively set up two cloud computing centers in Chinese Wuxi and Beijing; A-1. Net has released CloudEx (cloud cable release elastic cloud computing platform) product line, and internet host service, on-line storage virtualization services etc. are provided; Research institute of China Mobile has set up the cloud computing experimental center of 1024 CPU; Polytechnics of PLA has developed cloud storage system MassCloud (magnanimity cloud storage platform), and supports extensive video surveillance applications and the digital earth system based on 3G with it.
Based on the present situation of data mining cluster research, the existing excavation for large data clusters, mostly the method for employing is to adopt the sampling to data, chooses representative data, realizes the cluster analysis of Points replacing surfaces.When in the face of large data processing, what generally adopt is the method realization based on sampling probability, but the methods of sampling is not considered relative distance and the data skewness of the overall situation between data point or between interval, occurs the problem that demarcation interval is really up to the mark.Although afterwards, introduce again cluster, fuzzy concept and cloud model etc. interval division problem really up to the mark is improved, and obtained good effect yet, these methods are not all considered the not same-action of large data data point to Knowledge Discovery task.Therefore, more effective, quicker for making to excavate the clustering rule obtaining, must start with from the not same-action that takes into full account data point, cluster analysis is carried out to more deep research.And cloud computing just the processing between the large data data point based in reality propose, this for excavate more effective clustering rule powerful theoretical foundation is provided.
Summary of the invention
The object of the invention is to overcome the defect that above-mentioned technology exists, a kind of large Data Clustering Algorithm based on cloud computing platform is provided, the method utilizes the computation capability of the High Performance Cluster System of cloud computing to solve the large data processing problem that cluster faces, so that can be quick, effectively excavate the relation of data.Its concrete technical scheme is:
Based on a large Data Clustering Algorithm for cloud computing platform, comprise the following steps:
(1) raw data is carried out to pre-service;
Its basic thought is: first, scan whole data source, check whether there is null value, supplement missing value, choosing according to the mean value of that one dimension at null value place of missing value supplements, secondly, data set is carried out vectorization and cut apart, after cutting apart, data block is distributed on node, each node is distributed to M Map function data block, a threshold value T (distance between points) is set in function, M (allowing minimum number in bunch), choose c distance and carry out cluster at a distance of point farthest as representative point, it is a class that the point that meets T requirement is gathered, be put in one bunch, so circulation is until the point not meeting, then remaining point is divided into a class, form one bunch, and at each bunch with (N (in bunch number a little), SUM (somewhat every dimensional vector sum), SUMSQ (a little in the component quadratic sum of every one dimension)) represent Yi Gecu center, finally, check the number of bunch mid point of final formation, if bunch in number be less than M, all point deletions in this bunch, otherwise form a data acquisition U, obtain a cluster number K.Concrete steps are as follows:
1: scan whole data set and check in each dimension, whether there is null value, supplement missing value;
2: data set is carried out to vectorization;
3: be M subdata by Segmentation of Data Set, be assigned to each child node;
4: by M sub-data allocations, to M Map function, each Map task is processed a data fragmentation;
5: in the Map stage, subdata is carried out to Local Clustering, choose the representative point that c spacing is maximum distance;
6: if spacing is between points less than T, gathering is a class; Otherwise, remaining point is divided into one bunch;
7: calculate in each bunch the number of point, if bunch in the number of point be less than M, so this bunch of deletion;
8: in the Reduce stage, form a new data set U, the number K of compute cluster, and (N, SUM, SUMSQ) expression for the central point of each bunch all bunches.
(2) data set U is divided into M subdata, and distributes M Map function;
(3) in the Map stage, subdata is carried out to Local Clustering, choose the representative point that c spacing is maximum distance;
(4) calculate the central point (N, SUM, SUMSQ) of each bunch;
(5), in the Reduce stage, the class of identical key is merged; Forming Cu center is (N1+N2---+Ni, SUM1+SUM2----+SUMi, SUMSQ1+SUMSQ2----+SUMSQi);
(6) if actual cluster number R is less than cluster number K, adjust representative point number c and contraction factor a, re-start cluster, until reach termination condition.
(7) because large data not only have the feature of higher-dimension and mass data, produce and the fast feature of Data Update but also there are data; Therefore, adopt following methods to solve based on this this algorithm of feature;
Its basic thought is: first, pre-service (the same) is carried out in new data source, obtain number K and all number of data points N of the data set U in new data source and the central point of cluster; Next, do not have cluster numbers K or the counting of new data source of upgrading front acquisition to be greater than counting of the front data source of renewal if new data source Center Number K is greater than, and so, new data source and the data source that there is not renewal re-started to Segmentation of Data Set; Otherwise a central point of the K that the data set not upgrading obtains bunch forms new data set as K point with new data source and cuts apart; Then subset is assigned in each child node, distributes to several Map functions, carry out Local Clustering; If the first situation, K is chosen for [(K so newly+ K old)/2], on the contrary K is chosen for the value that there is no to upgrade front K; Then repeated for 3,4,5,6 stages (pretreatment stage); Concrete steps are as follows:
1: pre-service (the same) is carried out in new data source;
2: vectorization data set;
3: the size of data source points N and central point number K before judging new data source points N and central point number K and there is no renewal;
4: if N newly> N old|| K newly> K old, two data sets re-start and cut apart so, K=[(K newly+ K old)/2]; Otherwise a central point of the K that the data set not upgrading obtains bunch forms new data set as K point with new data source and cuts apart, K=K old;
5: data set U is divided into M subdata, and distributes M Map function;
6: in the Map stage, subdata is carried out to Local Clustering, choose the representative point that c spacing is maximum distance;
7: the central point (N, SUM, SUMSQ) that calculates each bunch;
8: in the Reduce stage, the class of identical key is merged; Forming Cu center is (N1+N2---+Ni, SUM1+SUM2----+SUMi, SUMSQ1+SUMSQ2----+SUMSQi);
9: if actual cluster number R is less than cluster number K, adjust representative point number c and contraction factor a, re-start cluster, until reach termination condition.
Compared with prior art, beneficial effect of the present invention is: the present invention utilizes the computation capability of the High Performance Cluster System of cloud computing to solve the large data processing problem that cluster faces; Take parallel clustering as target, propose new cluster thinking and improved one's methods; The data processing cost of enterprise reduces greatly, also no longer exists with ... high performance machine simultaneously; Large data mining exploitation based on cloud computing is convenient, has shielded bottom.Under parallelization condition, cloud computing can utilize existing equipment to improve processing power and the speed to large-scale data, has both guaranteed fault-tolerance, also increases node; Realize the effect of cloud computing to Cluster Analysis in Data Mining, realize a new abstract model, and by parallelization, fault-tolerant, data distribute, load balancing etc. mixed and disorderly details shield, just can process data fast, thereby excavate the relevance between data, obtain large data for modern life tremendous influence, solve the processing problem of data mining in the face of large data.
Accompanying drawing explanation
Fig. 1 is the pretreatment process figure that the present invention is based on large data in the large Data Clustering Algorithm of cloud computing platform;
Fig. 2 is the present invention's large Data Clustering Algorithm process flow diagram in the large Data Clustering Algorithm of cloud computing platform;
Fig. 3 the present invention is based on clustering algorithm process flow diagram after the large Data Update of cloud computing platform.
Embodiment
Below in conjunction with specific embodiment, technical scheme of the present invention is described in more detail.
With reference to Fig. 1,2,3, in Fig. 1, T: distance between points; M: comprise number a little in bunch; N: the number of point in bunch; SUM: institute a little each dimensional vector and; SUMSQ: institute is each dimension component quadratic sum a little.In Fig. 3, N1: the number of the point in primary data source; N2: the number in new data source; K1: initial clustering number; K2: new pretreated clusters number; Pi: the central point of initial cluster; K=[(K1+K2)/2].
Based on a large Data Clustering Algorithm for cloud computing platform, comprise the following steps:
(1) raw data is carried out to pre-service;
Its basic thought is: first, scan whole data source, check whether there is null value, supplement missing value, choosing according to the mean value of that one dimension at null value place of missing value supplements, secondly, data set is carried out vectorization and cut apart, after cutting apart, data block is distributed on node, each node is distributed to M Map function data block, a threshold value T (distance between points) is set in function, M (allowing minimum number in bunch), choose c distance and carry out cluster at a distance of point farthest as representative point, it is a class that the point that meets T requirement is gathered, be put in one bunch, so circulation is until the point not meeting, then remaining point is divided into a class, form one bunch, and at each bunch with (N (in bunch number a little), SUM (somewhat every dimensional vector sum), SUMSQ (a little in the component quadratic sum of every one dimension)) represent Yi Gecu center, finally, check the number of bunch mid point of final formation, if bunch in number be less than M, all point deletions in this bunch, otherwise form a data acquisition U, obtain a cluster number K.Concrete steps are as follows:
1: scan whole data set and check in each dimension, whether there is null value, supplement missing value;
2: data set is carried out to vectorization;
3: be M subdata by Segmentation of Data Set, be assigned to each child node;
4: by M sub-data allocations, to M Map function, each Map task is processed a data fragmentation;
5: in the Map stage, subdata is carried out to Local Clustering, choose the representative point that c spacing is maximum distance;
6: if spacing is between points less than T, gathering is a class; Otherwise, remaining point is divided into one bunch;
7: calculate in each bunch the number of point, if bunch in the number of point be less than M, so this bunch of deletion;
8: in the Reduce stage, form a new data set U, the number K of compute cluster, and (N, SUM, SUMSQ) expression for the central point of each bunch all bunches.
(2) data set U is divided into M subdata, and distributes M Map function;
(3) in the Map stage, subdata is carried out to Local Clustering, choose the representative point that c spacing is maximum distance;
(4) calculate the central point (N, SUM, SUMSQ) of each bunch;
(5), in the Reduce stage, the class of identical key is merged; Forming Cu center is (N1+N2---+Ni, SUM1+SUM2----+SUMi, SUMSQ1+SUMSQ2----+SUMSQi);
(6) if actual cluster number R is less than cluster number K, adjust representative point number c and contraction factor a, re-start cluster, until reach termination condition.
(7) because large data not only have the feature of higher-dimension and mass data, produce and the fast feature of Data Update but also there are data; Therefore, adopt following methods to solve based on this this algorithm of feature;
Its basic thought is: first, pre-service (the same) is carried out in new data source, obtain number K and all number of data points N of the data set U in new data source and the central point of cluster; Next, do not have cluster numbers K or the counting of new data source of upgrading front acquisition to be greater than counting of the front data source of renewal if new data source Center Number K is greater than, and so, new data source and the data source that there is not renewal re-started to Segmentation of Data Set; Otherwise a central point of the K that the data set not upgrading obtains bunch forms new data set as K point with new data source and cuts apart; Then subset is assigned in each child node, distributes to several Map functions, carry out Local Clustering; If the first situation, K is chosen for [(K so newly+ K old)/2], on the contrary K is chosen for the value that there is no to upgrade front K; Then repeated for 3,4,5,6 stages (pretreatment stage); Concrete steps are as follows:
1: pre-service (the same) is carried out in new data source;
2: vectorization data set;
3: the size of data source points N and central point number K before judging new data source points N and central point number K and there is no renewal;
4: if N newly> N old|| K newly> K old, two data sets re-start and cut apart so, K=[(K newly+ K old)/2]; Otherwise a central point of the K that the data set not upgrading obtains bunch forms new data set as K point with new data source and cuts apart, K=K old;
5: data set U is divided into M subdata, and distributes M Map function;
6: in the Map stage, subdata is carried out to Local Clustering, choose the representative point that c spacing is maximum distance;
7: the central point (N, SUM, SUMSQ) that calculates each bunch;
8: in the Reduce stage, the class of identical key is merged; Forming Cu center is (N1+N2---+Ni, SUM1+SUM2----+SUMi, SUMSQ1+SUMSQ2----+SUMSQi);
9: if actual cluster number R is less than cluster number K, adjust representative point number c and contraction factor a, re-start cluster, until reach termination condition.
Determine the validity of algorithm and ageing
In order to verify validity based on large Data Clustering Algorithm under Hadoop platform and ageing, this algorithm adopts several groups of test data set to verify.Utilize classical UCI data set and Public Data Sets (Amazon provided the development data collection of tens TB since 2008 for developer), the validity of the large data clusters result of test based under cloud computing platform and ageing.
The above; it is only preferably embodiment of the present invention; protection scope of the present invention is not limited to this; any be familiar with those skilled in the art the present invention disclose technical scope in, the simple change of the technical scheme that can obtain apparently or equivalence replace all fall within the scope of protection of the present invention.

Claims (1)

1. the large Data Clustering Algorithm based on cloud computing platform, is characterized in that, comprises the following steps:
1) raw data set is carried out to pre-service;
2) data U is divided into M subdata, and distributes to M Map function;
3), in the Map stage, subdata is carried out to Local Clustering;
4), in the Reduce stage, the class of identical key is merged;
5) if actual cluster number R is less than cluster number k, adjust representative point number c and contraction factor, re-start cluster, until reach termination condition;
6) if N newly> N old|| K newly> K old, two data sets re-start and cut apart so, K=[(K newly+ K old)/2]; Otherwise a central point of the K that the data set not upgrading obtains bunch forms new data set as K point with new data source and cuts apart, K=K old;
7) repeat 3), 4), 5) till straight termination condition of stage.
CN201410104227.0A 2014-03-14 2014-03-14 A kind of big data clustering algorithm based on cloud computing platform Expired - Fee Related CN103838863B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201410104227.0A CN103838863B (en) 2014-03-14 2014-03-14 A kind of big data clustering algorithm based on cloud computing platform

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201410104227.0A CN103838863B (en) 2014-03-14 2014-03-14 A kind of big data clustering algorithm based on cloud computing platform

Publications (2)

Publication Number Publication Date
CN103838863A true CN103838863A (en) 2014-06-04
CN103838863B CN103838863B (en) 2017-07-18

Family

ID=50802359

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201410104227.0A Expired - Fee Related CN103838863B (en) 2014-03-14 2014-03-14 A kind of big data clustering algorithm based on cloud computing platform

Country Status (1)

Country Link
CN (1) CN103838863B (en)

Cited By (18)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104156463A (en) * 2014-08-21 2014-11-19 南京信息工程大学 Big-data clustering ensemble method based on MapReduce
CN104461551A (en) * 2014-12-16 2015-03-25 芜湖乐锐思信息咨询有限公司 Parallel data processing based big data processing system
CN104503820A (en) * 2014-12-10 2015-04-08 华南师范大学 Hadoop optimization method based on asynchronous starting
CN104699772A (en) * 2015-03-05 2015-06-10 孟海东 Big data text classifying method based on cloud computing
CN104933089A (en) * 2015-05-15 2015-09-23 江苏博智软件科技有限公司 Big data set spectrum clustering method based on accelerating iteration
CN105095455A (en) * 2015-07-27 2015-11-25 中国联合网络通信集团有限公司 Data connection optimization method and data operation system
CN105468698A (en) * 2015-11-18 2016-04-06 上海电机学院 Real-time processing method of mass orders
CN106446255A (en) * 2016-10-18 2017-02-22 安徽天达网络科技有限公司 Data processing method based on cloud server
CN106547890A (en) * 2016-11-04 2017-03-29 深圳云天励飞技术有限公司 Quick clustering preprocess method in large nuber of images characteristic vector
CN107291847A (en) * 2017-06-02 2017-10-24 东北大学 A kind of large-scale data Distributed Cluster processing method based on MapReduce
CN109143017A (en) * 2018-07-31 2019-01-04 成都天衡智造科技有限公司 A kind of semicon industry production test data processing method
CN110781815A (en) * 2019-10-25 2020-02-11 四川东方网力科技有限公司 Video data processing method and system
CN111460046A (en) * 2020-03-06 2020-07-28 合肥海策科技信息服务有限公司 Scientific and technological information clustering method based on big data
CN112200206A (en) * 2019-07-08 2021-01-08 浙江宇视科技有限公司 BIRCH algorithm improvement method, device and equipment based on distributed platform
CN112286989A (en) * 2020-10-28 2021-01-29 上海电机学院 Big data clustering mining method and platform
CN116595102A (en) * 2023-07-17 2023-08-15 法诺信息产业有限公司 Big data management method and system for improving clustering algorithm
CN116882850A (en) * 2023-09-08 2023-10-13 山东科技大学 Garden data intelligent management method and system based on big data
CN117194020A (en) * 2023-09-04 2023-12-08 北京宝联之星科技股份有限公司 Cloud computing original big data processing method, system and storage medium

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20110072006A1 (en) * 2009-09-18 2011-03-24 Microsoft Corporation Management of data and computation in data centers
CN103064991A (en) * 2013-02-05 2013-04-24 杭州易和网络有限公司 Mass data clustering method

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20110072006A1 (en) * 2009-09-18 2011-03-24 Microsoft Corporation Management of data and computation in data centers
CN103064991A (en) * 2013-02-05 2013-04-24 杭州易和网络有限公司 Mass data clustering method

Non-Patent Citations (4)

* Cited by examiner, † Cited by third party
Title
JUN ZHAO .ETC: ""Parallelized Incremental Support Vector Machines Based on MapRaduce and Bagging Technique"", 《2012 IEEE INTERNATIONAL CONFERENCE ON INFORMATION SCIENCE AND TECHNOLOGY》 *
KIRAN M .ETC: ""Verification and Validation of Parallel Support Vector Machine Algorithm based on MapReduce Program Model on Hadoop Cluster"", 《ADVANCED COMPUTING AND COMMUNICATION SYSTEM (ICACCS),2013 INTERNATIONAL CONFERENCE ON》 *
MIRKO KAMPF .ETC: ""Hadoop.TS: Large-Scale Time-Series Processing"", 《INTERNATIONAL JOURNAL OF COMPUTER APPLICATIONS》 *
顾瑞春,等: ""一种基于MapReduce的并行聚类模型"", 《计算机与现代化》 *

Cited By (28)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104156463A (en) * 2014-08-21 2014-11-19 南京信息工程大学 Big-data clustering ensemble method based on MapReduce
CN104503820B (en) * 2014-12-10 2018-07-24 华南师范大学 A kind of Hadoop optimization methods based on asynchronous starting
CN104503820A (en) * 2014-12-10 2015-04-08 华南师范大学 Hadoop optimization method based on asynchronous starting
CN104461551A (en) * 2014-12-16 2015-03-25 芜湖乐锐思信息咨询有限公司 Parallel data processing based big data processing system
CN104699772A (en) * 2015-03-05 2015-06-10 孟海东 Big data text classifying method based on cloud computing
CN104933089A (en) * 2015-05-15 2015-09-23 江苏博智软件科技有限公司 Big data set spectrum clustering method based on accelerating iteration
CN105095455A (en) * 2015-07-27 2015-11-25 中国联合网络通信集团有限公司 Data connection optimization method and data operation system
CN105095455B (en) * 2015-07-27 2018-10-19 中国联合网络通信集团有限公司 A kind of data connection optimization method and data arithmetic system
CN105468698A (en) * 2015-11-18 2016-04-06 上海电机学院 Real-time processing method of mass orders
CN106446255A (en) * 2016-10-18 2017-02-22 安徽天达网络科技有限公司 Data processing method based on cloud server
CN106547890B (en) * 2016-11-04 2018-04-03 深圳云天励飞技术有限公司 Quick clustering preprocess method in large nuber of images characteristic vector
CN106547890A (en) * 2016-11-04 2017-03-29 深圳云天励飞技术有限公司 Quick clustering preprocess method in large nuber of images characteristic vector
CN107291847A (en) * 2017-06-02 2017-10-24 东北大学 A kind of large-scale data Distributed Cluster processing method based on MapReduce
WO2018219163A1 (en) * 2017-06-02 2018-12-06 东北大学 Mapreduce-based distributed cluster processing method for large-scale data
CN107291847B (en) * 2017-06-02 2019-06-25 东北大学 A kind of large-scale data Distributed Cluster processing method based on MapReduce
CN109143017A (en) * 2018-07-31 2019-01-04 成都天衡智造科技有限公司 A kind of semicon industry production test data processing method
CN112200206A (en) * 2019-07-08 2021-01-08 浙江宇视科技有限公司 BIRCH algorithm improvement method, device and equipment based on distributed platform
CN112200206B (en) * 2019-07-08 2024-02-27 浙江宇视科技有限公司 BIRCH algorithm improvement method, device and equipment based on distributed platform
CN110781815A (en) * 2019-10-25 2020-02-11 四川东方网力科技有限公司 Video data processing method and system
CN110781815B (en) * 2019-10-25 2022-09-27 四川东方网力科技有限公司 Video data processing method and system
CN111460046A (en) * 2020-03-06 2020-07-28 合肥海策科技信息服务有限公司 Scientific and technological information clustering method based on big data
CN112286989A (en) * 2020-10-28 2021-01-29 上海电机学院 Big data clustering mining method and platform
CN116595102A (en) * 2023-07-17 2023-08-15 法诺信息产业有限公司 Big data management method and system for improving clustering algorithm
CN116595102B (en) * 2023-07-17 2023-10-17 法诺信息产业有限公司 Big data management method and system for improving clustering algorithm
CN117194020A (en) * 2023-09-04 2023-12-08 北京宝联之星科技股份有限公司 Cloud computing original big data processing method, system and storage medium
CN117194020B (en) * 2023-09-04 2024-04-05 北京宝联之星科技股份有限公司 Cloud computing original big data processing method, system and storage medium
CN116882850A (en) * 2023-09-08 2023-10-13 山东科技大学 Garden data intelligent management method and system based on big data
CN116882850B (en) * 2023-09-08 2023-12-12 山东科技大学 Garden data intelligent management method and system based on big data

Also Published As

Publication number Publication date
CN103838863B (en) 2017-07-18

Similar Documents

Publication Publication Date Title
CN103838863A (en) Big-data clustering algorithm based on cloud computing platform
CN102799486B (en) Data sampling and partitioning method for MapReduce system
Luo et al. A parallel dbscan algorithm based on spark
Hao et al. Research of Cloud Computing based on the Hadoop platform
Wei et al. Incremental FP-Growth mining strategy for dynamic threshold value and database based on MapReduce
Zhou et al. Research of the FP-Growth Algorithm Based on Cloud Environments.
Moutafis et al. Efficient processing of all-k-nearest-neighbor queries in the MapReduce programming framework
Shaikh et al. GeoFlink: A distributed and scalable framework for the real-time processing of spatial streams
Gunarathne et al. Portable parallel programming on cloud and hpc: Scientific applications of twister4azure
Ayall et al. Graph computing systems and partitioning techniques: A survey
Chan et al. A distributed stream library for Java 8
Fu et al. Research and application of DBSCAN algorithm based on Hadoop platform
Yu Data processing and development of big data system: a survey
Sharma et al. Parallelization of association rule mining: survey
Wang et al. Spark load balancing strategy optimization based on internet of things
Gao et al. On the power of combiner optimizations in mapreduce over MPI workflows
Luo et al. Implementation of a parallel graph partition algorithm to speed up BSP computing
Song et al. Big data mining method of thermal power based on spark and optimization guidance
Cheng et al. Stream-based particle swarm optimization for data migration decision
Wang et al. Distributed data mining based on semantic web and grid
Sharma et al. Simulation of performance analysis of mongodb, pig, hive storage, map reduce, spark and yarn
Thein et al. Optimization of region distribution using binary partition-based matching algorithm for data distribution management
Lina Application Analysis and Development Strategy of Cloud Computing Technology in Computer Data Processing
Li et al. Skew-aware task scheduling in clouds
Shao et al. Large-scale Graph Analysis: System, Algorithm and Optimization

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
DD01 Delivery of document by public notice

Addressee: Patent director of Inner Mongolia University of science and technology

Document name: Notice of termination of patent

DD01 Delivery of document by public notice
CF01 Termination of patent right due to non-payment of annual fee

Granted publication date: 20170718

Termination date: 20200314

CF01 Termination of patent right due to non-payment of annual fee