CN109657712A - A kind of electric business food and drink data analysing method based on the improved K-Means algorithm of Spark - Google Patents

A kind of electric business food and drink data analysing method based on the improved K-Means algorithm of Spark Download PDF

Info

Publication number
CN109657712A
CN109657712A CN201811507426.0A CN201811507426A CN109657712A CN 109657712 A CN109657712 A CN 109657712A CN 201811507426 A CN201811507426 A CN 201811507426A CN 109657712 A CN109657712 A CN 109657712A
Authority
CN
China
Prior art keywords
cluster
data
spark
point
center
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201811507426.0A
Other languages
Chinese (zh)
Other versions
CN109657712B (en
Inventor
任晨雨
唐月标
黄鹏程
华惊宇
张昱
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Zhejiang University of Technology ZJUT
Original Assignee
Zhejiang University of Technology ZJUT
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Zhejiang University of Technology ZJUT filed Critical Zhejiang University of Technology ZJUT
Priority to CN201811507426.0A priority Critical patent/CN109657712B/en
Publication of CN109657712A publication Critical patent/CN109657712A/en
Application granted granted Critical
Publication of CN109657712B publication Critical patent/CN109657712B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/23Clustering techniques
    • G06F18/232Non-hierarchical techniques
    • G06F18/2321Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions
    • G06F18/23213Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions with fixed number of clusters, e.g. K-means clustering
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/22Matching criteria, e.g. proximity measures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q50/00Systems or methods specially adapted for specific business sectors, e.g. utilities or tourism
    • G06Q50/10Services
    • G06Q50/12Hotels or restaurants

Abstract

A kind of electric business food and drink data analysing method based on the improved K-Means algorithm of Spark, comprising the following steps: step 1, experiment porch and its configuration are built, to realize that Parallel K-Means Clustering Algorithm in Web builds performing environment;Step 2, the acquisition of raw data set;Step 3, raw data set is pre-processed;Step 4, exploitation of the K-Means algorithm in Spark is realized using Scalable language;Step 5, the program editted is compiled execution, is finally completed cluster process.Map the and Combine operator that the present invention utilizes Spark to provide;Using the data structure of RDD;Spark results of intermediate calculations is stored in memory, in conjunction with the clustering algorithm that the initialization cluster centre part of a kind of pair of K-Means algorithm improves, realizes the analysis of electric business food and drink data, processing speed is very fast, and Clustering Effect is preferable.

Description

A kind of electric business food and drink data analysing method based on the improved K-Means algorithm of Spark
Technical field
The invention belongs to big data analysis digging technology fields, especially a kind of to be calculated based on the improved cluster of Spark platform Application of the method in electric business food and drink data analysis field.
Background technique
Since 21 century, with the continuous progress of science and technology, our society is also more and more information-based, and following each row is each The huge data of industry are also to complement each other with information-based industry.The presence of big data, to our life, business, medical treatment, boat It, agriculture, traffic and the development of other field have served very important.Therefore, more profound between mining data Relationship has critically important significance to the anticipation research aspect in each field.But face data, the investment of analytical technology and There are a huge contradiction between acquisition of information, information required for how efficiently quickly extracting and knowledge, removal are not required to The secondary or garbage wanted, to improve data mining in the practicability in each field be a critically important research direction.
In information-based industry high speed development, data are increased with almost exponential other speed, for grinding for data The method of studying carefully is also with diversity, and wherein clustering is one such important analysis method, and analysis and research personnel One of highest method of frequency of use.But traditional data analysis is often limited to data processing platform (DPP) and technology, is unable to satisfy The demand of research and development at this stage.It but is data mining in recent years, with the successive appearance of Hadoop platform and Spark platform Analysis provides good distributed type assemblies frame, targetedly solves the storage and operation difficult point of mass data.Compared to Hadoop platform, Spark platform are transported on the basis of Hadoop MapReduce frame by increasing RDD and the data of itself It calculates intermediate result and is stored in this main two major features of memory, so that Spark platform has higher fault-tolerance and high scalability.
Summary of the invention:
In order to improve the analysis mining working efficiency for facing the huge electric business food and drink data that information-based industry increasingly generates, this Map the and Combine operator that invention is provided using Spark;The data structure of the RDD of inner most general;Spark intermediate computations knot Fruit is stored in the advantages such as memory, the clustering algorithm knot improved with the initialization cluster centre part of a kind of pair of K-Means algorithm It closes, realizes and the data of some electric business catering industries in Beijing are analyzed, the superiority and algorithm for therefrom comparing modified hydrothermal process exist The runing time of Spark platform and single machine platform embodies the superiority of Spark cluster parallel data processing speed.
The technical solution adopted by the present invention to solve the technical problems is:
A kind of electric business food and drink data analysing method based on the improved K-Means algorithm of Spark, the method includes following Step:
Step 1, experiment porch and its configuration are built, to realize that Parallel K-Means Clustering Algorithm in Web builds performing environment, process It is as follows:
The distributed environment of the Spark of 1.1 configuration Master is as follows:
1.1.1 Spark2.1.0 installation kit is downloaded, decompresses and is installed;
1.1.2 modifying associated profile:
Under into/spark2.1.0-bin-hadoop2.7.5/conf catalogue, two files are modified.One of them is to match Spark-env.sh file is set, variable SCALA_HONE, JAVA_HOME, SPARK_MASTER_IP, SPARK_WOKER_ are set The value of MEMORY and HADOOP_CONF_DIR;The other is configuration slave file, is added to this for master and each node In file;
1.2 configuration Scala develop environment:
Because Spark platform is compiled using Scala language, need that Scala is installed;Download to .msi file with Afterwards, it is installed according to step, after being installed, the installation path that global variable SCALA_HOME is Scala is set;
It is finally tested, checks whether Scala installs success, open a new CMD window, input default Scala instruction, if interactive command can be executed with normal circulation, expression is installed successfully;
Step 2, the acquisition of raw data set, process are as follows: experimental data is to choose the information data of food and drink retail shop, data Object include longitude, dimension, city, trade name, address, comprehensive score, comment number, environment scoring, taste scoring, service scoring and Commercial circle data information;
Step 3, raw data set is pre-processed, the data that fill up the vacancy and deletion hash;
Step 4, exploitation of the K-Means algorithm in Spark is realized using Scalable language, process is as follows:
4.1K-Means algorithm is using distance as the similarity measures between data object, to gather to data Class belongs to unsupervised learning, the similitude between data is indicated using Euclidean distance, the calculation formula of Euclidean distance:
Wherein, xi, xjAny two data object in data set is respectively represented, N indicates of the total attribute of each data object Number;
Iteration each time in K-Means cluster process, cluster centre will be calculated and be updated from newly, calculate new cluster Center exactly calculates in this cluster, the mean value of all objects, it is assumed that the cluster centre of k-th cluster is expressed as Centerk, meter The mode for calculating the new cluster centre of this cluster is as follows:
Wherein, CkIt is K class cluster, | Ck| it is the number of data object in K class cluster.Here summation refers to K class cluster Ck Sum of the middle all elements on every Column Properties, so CenterkIt is the vector that a length is D, is expressed as follows:
Centerk=[Centerk1, Centerk2, Centerk3..., CenterkD]
There are two types of the termination conditions of iteration, and one is setting the number of iterations T, and when reaching the T times iteration, program is whole Only iteration, the cluster result obtained at this time are final cluster result;Another kind is that error sum of squares is used to change as program The threshold values that generation terminates, the functional expression are expressed as follows:
Wherein, K indicates the number of class cluster, when the difference of the E of iteration twice is less than certain certain threshold values, i.e. Δ E < δ, program determination iteration;
The algorithm improvement of 4.2 pairs of initialization cluster centre parts, steps are as follows:
4.2.1 a data object is randomly choosed from pretreated data set at random as initial center point Ci, wherein I ∈ (1,2,3 ... K), the data object of data set assume that one shares N number of, data object are referred to as data point, data point is Refer to the vector of the three-dimensional feature composition of data intensive data object;
Ci=rand ([V(1, j), V(2, j), V(3, j))=[V(1, i), V(2, i), V(3, i)]
Wherein, j ∈ (1,2,3 ... N)
4.2.2 each point is calculated in data set to nearest cluster centre point Ci(Ci+1) distance Dj, to all DjIt does Be denoted as Sumi, C is used for the first timei, circulation is internal to use Ci+1
4.2.3 a random point is taken again, then calculates next initial cluster center point, random point using the mode of weight RiValue mode be random value Ri∈ (0, Sumi), R is done to data seti=Ri-DjLoop computation, until Ri< 0, then, Corresponding DjIt is exactly next cluster centre point Ci+1=Dj
4.2.4 4.2.2 and 4.2.3 two above step is repeated, until k-th central point is selected, chooses initial center Point algorithm terminates;
4.3 improved K-Means algorithms realize step in Spark:
4.3.1 parallel processing is carried out to initial data using HDFS, RDD
Numeralization processing first is carried out to the quantitative characteristic inside data set and removes unwanted feature inside data object Dimension, and data source is stored to HDFS by treated;
4.3.2 the cluster centre initialization procedure for executing 4.2 steps, K initial cluster center required for obtaining;
4.3.3 iterative operation is executed, meeting with reference to the number of iterations or obtained new cluster centre point has been more than to be advised Fixed critical value range, then iteration terminates, and otherwise continues to execute iterative operation;
4.3.4 new data variable is obtained, to obtain new cluster centre point;
4.3.5 cluster centre point and the number of iterations are updated;
Step 5, the program editted is compiled execution, is finally completed cluster process.
Further, the method also includes following steps:
Step 6, the data after cluster are visualized.
Further, the method also includes following steps:
Step 7, the execution velocity test under single board computer and cluster is carried out respectively: respectively by program in single board computer and difference Under the cluster of Worker node, the time used in entire data processing is recorded.Technical concept of the invention are as follows: the present invention utilizes Map the and Combine operator that Spark is provided;The data structure of the RDD of inner most general;In Spark results of intermediate calculations is stored in It the advantages such as deposits, reads data set from HDFS first, and RDD required for creating, then execute the cluster centre of this experiment Innovatation Initial method, more high probability are obtained apart from farther away initial cluster center point, and further progress iteration obtains new cluster Center is iterated operation under new cluster centre, after the cluster operation of each section is all completed, converges the poly- of each section Class is as a result, calculating finally obtains cluster result until iteration completion apart from new central point.
Beneficial effects of the present invention are mainly manifested in: on the one hand, on Spark platform, operation function is and the direct phase of RDD Mutual correlation, that is, the data distributed still are clustered to central point inside each RDD, and are executed parallel, are run As a result it does not need repeatedly to return and be calculated from new, then fast more many than being handled on single board computer using cluster processing data;It is another Aspect, in terms of algorithm, with primal algorithm it is direct at random take K initialization cluster centre point compared with, improve in initialization The process of heart point, so that direct random choosing of the distance between random point chosen probability big as far as possible than primal algorithm The probability taken wants high.
Detailed description of the invention
Fig. 1 is to run the realization of K-Means algorithm parallel using Spark to the overall flow figure of the cluster of data set.
Fig. 2 is that the runing time pair in different number of nodes is arranged in K-Means clustering algorithm on single machine and Spark platform Than figure.
Fig. 3 is the cluster result three-dimensional scatterplot schematic diagram to food and drink retail shop, Beijing.
Specific embodiment
The invention will be further described below in conjunction with the accompanying drawings.
Referring to Fig.1~Fig. 3, a kind of electric business food and drink data analysing method based on the improved K-Means algorithm of Spark, packet Include following steps:
Step 1, experiment porch and its configuration are built, to realize that Parallel K-Means Clustering Algorithm in Web builds necessary execution ring Border, process are as follows:
The distributed environment of the Spark of 1.1 configuration Master is as follows:
1.1.1 Spark2.1.0 installation kit is downloaded, decompresses and is installed;
1.1.2 modifying associated profile:
Under into/spark2.1.0-bin-hadoop2.7.5/conf catalogue, two files are modified.One of them is to match Spark-env.sh file is set, variable SCALA_HONE, JAVA_HOME, SPARK_MASTER_IP, SPARK_WOKER_ are set The value of MEMORY and HADOOP_CONF_DIR;The other is configuration slave file, is added to this for master and each node In file;
1.2 configuration Scala develop environment:
Because Spark platform is compiled using Scala language, need that Scala is installed;Download to .msi file with Afterwards, it is installed according to step, after being installed, the installation path that global variable SCALA_HOME is Scala is set;
It is finally tested, checks whether Scala installs success.A new CMD window is opened, input default Scala instruction, if interactive command can be executed with normal circulation, expression is installed successfully.
Step 2, the acquisition of raw data set
2.1 experimental datas are to choose some information datas of the food and drink retail shop of Beijing, and data object includes longitude, dimension 112065 numbers such as degree, city, trade name, address, comprehensive score, comment number, environment scoring, taste scoring, service scoring, commercial circle It is believed that breath.
It step 3, is the overall distribution situation for studying whole background food and drink retail shop and the comprehensive score of retail shop for notebook data With the degree of correlation of comment quantity and environment scoring three, field required for the experiment has longitude, dimension, comprehensive score, point Comment number, environment scoring.Many data are there are vacancy value it can be seen from the initial data of upper figure, and the one of raw data set A little fields are unwanted in test, for example, retail shop address, call the roll etc..Therefore original data set is pre-processed;
Step 4, exploitation of the K-Means algorithm in Spark is realized using Scalable language, process is as follows:
4.1K-Means algorithm is using distance as the similarity measures between data object, to gather to data Class, belongs to unsupervised learning, and this method generallys use Euclidean distance to indicate the similitude between data, the calculating of Euclidean distance Formula:
Wherein, xi, xjAny two data object in data set is respectively represented, N indicates of the total attribute of each data object Number;
Iteration each time in K-Means cluster process, cluster centre will be calculated and be updated from newly, calculate new cluster Center exactly calculates in this cluster, the mean value of all objects, it is assumed that the cluster centre of k-th cluster is expressed as Centerk, meter The mode for calculating the new cluster centre of this cluster is as follows:
Wherein, CkIt is K class cluster, | Ck| it is the number of data object in K class cluster.Here summation refers to K class cluster Ck Sum of the middle all elements on every Column Properties, so CenterkIt is the vector that a length is D, is expressed as follows:
Centerk=[Centerk1, Centerk2, Centerk3..., CenterkD]
There are two types of the termination condition of the algorithm iteration is general, one is setting the number of iterations T, when reaching the T times iteration When, program determination iteration, the cluster result obtained at this time is final cluster result;Another kind is using error sum of squares As the threshold values of program iteration ends, which is expressed as follows:
Wherein, K indicates the number of class cluster.When the difference of the E of iteration twice is less than certain certain threshold values, i.e. Δ E < δ, program determination iteration.
The algorithm improvement of 4.2 pairs of initialization cluster centre parts
The algorithm of original K-Means the step of initializing cluster centre, is then generated at the beginning of K using random operator Beginning cluster centre point.Due to initial point generate randomness, the distance between possible initial clustering point very little, or selection be The noise spot etc. of data set, it is very poor to frequently can lead to cluster result.Therefore, the distance between selection initialization cluster centre point is most It is possible big, following improvement is done to initialization this part of cluster centre, steps are as follows:
4.2.1 a data object is randomly choosed from pretreated data set at random as initial center point Ci.Wherein, I ∈ (1,2,3 ... K), the data object hypothesis one of data set shares N number of.For a better understanding, data object is referred to as Data point, point here really refer to the vector of the three-dimensional feature composition of data intensive data object;;
Ci=rand ([V(1, j), V(2, j), V(3, j))=[V(1, i), V(2, i), V(3, i)]
Wherein, j ∈ (1,2,3 ... N)
4.2.2 each point is calculated in data set to nearest cluster centre point Ci(Ci+1) distance Dj, to all DjIt does Be denoted as Sumi.Algorithm uses C for the first timei, circulation is internal to use Ci+1
4.2.3 a random point is taken again, then calculates next initial cluster center point, random point using the mode of weight RiValue mode be random value Ri∈ (0, Sumi), R is done to data seti=Ri-DjLoop computation, until Ri< 0, then, Corresponding DjIt is exactly next cluster centre point Ci+1=Dj
4.2.4 repeating 4.2.2 and 4.2.3 two above step, until k-th central point is selected, initial center is chosen Point algorithm terminates;
4.3 improved K-Means algorithms realize step in Spark:
4.3.1 parallel processing is carried out to initial data using HDFS, RDD
Numeralization processing first is carried out to the quantitative characteristic inside data set and removes unwanted feature inside data object Dimension, and data source is stored to HDFS by treated;
4.3.2 the cluster centre initialization procedure for executing 4.2 steps, K initial cluster center required for obtaining;
4.3.3 iterative operation is executed inside algorithm, is met super with reference to the number of iterations or obtained new cluster centre point The critical value range of defined is crossed, then iteration terminates, and otherwise continues to execute iterative operation.
4.3.4 new data variable is obtained, to obtain new cluster centre point.
4.3.5 cluster centre point and the number of iterations are updated.
Step 5, the program editted is compiled execution, is finally completed cluster process.
Step 6, the data after cluster are visualized
Step 7, the execution velocity test of program under single board computer and cluster is carried out: respectively in order to preferably prove Spark collection Group parallel data processing superiority, the step for respectively by program under the cluster of single board computer and different Worker nodes, note Record the time used in entire data processing.The final runing time of data is shown in Fig. 2.
For the validity of verification algorithm improvement part, below to the primal algorithm of the library the ML institute band of innovatory algorithm and Spark The effect of cluster is assessed from silhouette coefficient, CH (Calinski-Harabaz) index.
Rationally whether silhouette coefficient be to cluster, effectively to measure.For each sample, the expression formula of silhouette coefficient is as follows:
Wherein, j ∈ (1,2,3 ... N), ajIndicate the average distance with sample in other classifications, bjIt is nearest with its distance The average distance of different classes of middle sample.
It can also be expressed as following formula:
Silhouette coefficient value is [- 1,1], between the bigger explanation of value is similar sample apart from very little, it is different classes of between Sample distance is bigger, so value is the bigger the better.For a sample set, its silhouette coefficient is the silhouette coefficient of all samples Average value.
CH index be the dense degree in same category of indicating and be different classes of between dispersion degree, its mathematical table It is as follows up to formula:
Wherein, m is the sample number of training set;K is classification number, in this experiment, k=3;BkCovariance between classification Matrix;WkFor the covariance matrix inside classification, the mark of tr representing matrix.If the covariance inside classification is smaller, between classification Covariance it is bigger, then fractional value is also bigger, illustrate that Clustering Effect is better.If numerical value is smaller, different clusters are indicated Between boundary it is unobvious, Clustering Effect is also poorer.
The two indices size of ML-K-Means algorithm and innovatory algorithm difference is as shown in table 1:
Silhouette coefficient CH index
ML-K-Means 0.8166 47557.9709
Change-K-Means 0.8596 50846.2341
Table 1
From table 1 it follows that the algorithm that modified hydrothermal process provides on silhouette coefficient and CH index with the library ML of Spark It compares, two coefficients are above primal algorithm, although improving in silhouette coefficient index less obviously, mention in CH index Height is bigger, therefore on the whole, and the Clustering Effect of modified hydrothermal process wants the excellent algorithm carried with the library ML.
Fig. 1 is the overall flow frame diagram of K-Means algorithm data processing on Spark frame, mainly by three parts group At being these three processes of Map, Combine, Reduce respectively.Using the relationship between RDD and Spark and memory, guarantee first Operational efficiency;Then RDD is assigned on each node, each Paralleled executes local cluster operation, each section Local Clustering After, convergence Local Clustering is as a result, whether the distance for calculating new central point restrains, and algorithm terminates if convergence, otherwise Each subregion is from all of above step is newly executed, until the distance convergence of last new central point, algorithm terminate.
Fig. 2 is that the speed of service of the algorithm between nodes different under single board computer and Spark frame is compared.Its In, data set is 93625 datas, and " 0 " indicates the runing time on single board computer;" 1 " indicates there is being a Worker node Cluster runing time;" 2 " are indicated there are two the cluster runing time of Worker node, and so on.As seen from the figure, When cluster Worker node only one when, the runing time of algorithm be above single board computer operation, this is because cluster Calculating needs to initialize inter-related task, there is the newly-built of job, the distribution of resource and scheduling etc., these are all needs time-consumings, but It is that single board computer does not need.But with the increase of number of nodes, the executing tasks parallelly advantage of Spark is just more obvious, it can be seen that When Worker node is equal to 4, the processing speed of Spark is 1.6 times of single board computer or so.
The case where this figure of Fig. 3 is the whole cluster in terms of three attribute angles of data itself.Wherein, symbol " ◆ " indicates Cluster 0, symbol "+" region indicate that cluster 1, symbol " * " indicate cluster 2.As seen from the figure, 0 comprehensive score of cluster and environment score two Dimension is substantially linear, and comprehensive score is higher, and environment scoring is also relatively high, and comprehensive score is lower, environment scoring It is relatively low, but StoreFront general comment quantity it is most of at 2000 hereinafter, differ greatly with other two cluster in general comment quantity, In three-dimensional scatter plot, it is substantially distributed in following;2 overall distribution of cluster be on cluster 0, and comprehensive score and environment scoring Medium higher fraction range is all concentrated on, comments on total quantity substantially in 2000 to 5000 ranges;Cluster 1 is distributed in three-dimensional scatterplot The linear relationship of the top of figure, comprehensive score and environment two dimensions of scoring is less obvious, and composite score is arrived 4.5 substantially 5, environment scoring compare substantially in the score of 8 to 10, two dimensions with other two dimension, fraction range be it is highest, most An apparent index is exactly general comment quantity, substantially between 5000 to 15000.

Claims (3)

1. a kind of electric business food and drink data analysing method based on the improved K-Means algorithm of Spark, which is characterized in that the side Method the following steps are included:
Step 1, experiment porch and its configuration are built, to realize that Parallel K-Means Clustering Algorithm in Web builds performing environment, process is such as Under:
The distributed environment of the Spark of 1.1 configuration Master is as follows:
1.1.1 Spark2.1.0 installation kit is downloaded, decompresses and is installed;
1.1.2 modifying associated profile:
Under into/spark2.1.0-bin-hadoop2.7.5/conf catalogue, two files are modified, one of them is configuration Variable SCALA_HONE, JAVA_HOME, SPARK_MASTER_IP, SPARK_WOKER_ is arranged in spark-env.sh file The value of MEMORY and HADOOP_CONF_DIR;The other is configuration slave file, is added to this for master and each node In file;
1.2 configuration Scala develop environment:
Because Spark platform is compiled using Scala language, need that Scala is installed;It downloads to after .msi file, It is installed according to step, after being installed, the installation path that global variable SCALA_HOME is Scala is set;
It is finally tested, checks whether Scala installs success, open a new CMD window, the Scala of input default refers to It enables, if interactive command can be executed with normal circulation, expression is installed successfully;
Step 2, the acquisition of raw data set, process are as follows: experimental data is to choose the information data of food and drink retail shop, data object Including longitude, dimension, city, trade name, address, comprehensive score, comment number, environment scoring, taste scoring, service scoring and commercial circle Data information;
Step 3, raw data set is pre-processed, the data that fill up the vacancy and deletion hash;
Step 4, exploitation of the K-Means algorithm in Spark is realized using Scalable language, process is as follows:
4.1K-Means algorithm is using distance as the similarity measures between data object, to cluster to data, belongs to In unsupervised learning, the similitude between data is indicated using Euclidean distance, the calculation formula of Euclidean distance:
Wherein, xi, xjAny two data object in data set is respectively represented, N indicates the number of the total attribute of each data object;
Iteration each time in K-Means cluster process, cluster centre will be calculated and be updated from newly, calculate in new cluster The heart exactly calculates in this cluster, the mean value of all objects, it is assumed that the cluster centre of k-th cluster is expressed as Centerk, calculate The mode of the new cluster centre of this cluster is as follows:
Wherein, CkIt is K class cluster, | Ck| it is the number of data object in K class cluster, summation here refers to K class cluster CkMiddle institute There is sum of the element on every Column Properties, so CenterkIt is the vector that a length is D, is expressed as follows:
Centerk=[Centerk1, Centerk2, Centerk3..., CenterkD]
There are two types of the termination conditions of iteration, and one is setting the number of iterations T, and when reaching the T times iteration, program determination changes Generation, the cluster result obtained at this time are final cluster result;Another kind is to use error sum of squares whole as program iteration Threshold values only, the functional expression are expressed as follows:
Wherein, K indicates the number of class cluster, when the difference of the E of iteration twice is less than certain certain threshold values, i.e. Δ E < δ, Program determination iteration;
The algorithm improvement of 4.2 pairs of initialization cluster centre parts, steps are as follows:
4.2.1 a data object is randomly choosed from pretreated data set at random as initial center point Ci, wherein i ∈ (1,2,3 ... K), the data object of data set assume that one shares N number of, data object are referred to as data point, data point refers to The vector of the three-dimensional feature composition of data object in data set;
Ci=rand ([V(1, j), V(2, j), V(3, j))=[V(1, i), V(2, i), V(3, i)]
Wherein, j ∈ (1,2,3 ... N)
4.2.2 each point is calculated in data set to nearest cluster centre point Ci(Ci+1) distance Dj, to all DjIt does and note For Sumi, C is used for the first timei, circulation is internal to use Ci+1
4.2.3 a random point is taken again, then calculates next initial cluster center point, random point R using the mode of weighti's Value mode is random value Ri∈ (O, Sumi), R is done to data seti=Ri-DjLoop computation, until Ri< 0, then, it is corresponding DjIt is exactly next cluster centre point Ci+1=Dj
4.2.4 4.2.2 and 4.2.3 two above step is repeated, until k-th central point is selected, initial center is chosen and counts Method terminates;
4.3 improved K-Means algorithms realize step in Spark:
4.3.1 parallel processing is carried out to initial data using HDFS, RDD
Numeralization processing first is carried out to the quantitative characteristic inside data set and removes unwanted characteristic dimension inside data object, And data source is stored to HDFS by treated;
4.3.2 the cluster centre initialization procedure for executing 4.2 steps, K initial cluster center required for obtaining;
4.3.3 iterative operation is executed, meeting with reference to the number of iterations or obtained new cluster centre point has been more than defined Critical value range, then iteration terminates, and otherwise continues to execute iterative operation;
4.3.4 new data variable is obtained, to obtain new cluster centre point;
4.3.5 cluster centre point and the number of iterations are updated;
Step 5, the program editted is compiled execution, is finally completed cluster process.
2. a kind of electric business food and drink data analysing method based on the improved K-Means algorithm of Spark as described in claim 1, It is characterized in that, the method also includes following steps:
Step 6, the data after cluster are visualized.
3. a kind of electric business food and drink data analysing method based on the improved K-Means algorithm of Spark as claimed in claim 2, It is characterized in that, the method also includes following steps:
Step 7, the execution velocity test under single board computer and cluster is carried out respectively: respectively by program in single board computer and different Worker Under the cluster of node, the time used in entire data processing is recorded.
CN201811507426.0A 2018-12-11 2018-12-11 E-commerce catering data analysis method based on Spark improved K-Means algorithm Active CN109657712B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201811507426.0A CN109657712B (en) 2018-12-11 2018-12-11 E-commerce catering data analysis method based on Spark improved K-Means algorithm

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201811507426.0A CN109657712B (en) 2018-12-11 2018-12-11 E-commerce catering data analysis method based on Spark improved K-Means algorithm

Publications (2)

Publication Number Publication Date
CN109657712A true CN109657712A (en) 2019-04-19
CN109657712B CN109657712B (en) 2021-06-18

Family

ID=66114001

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201811507426.0A Active CN109657712B (en) 2018-12-11 2018-12-11 E-commerce catering data analysis method based on Spark improved K-Means algorithm

Country Status (1)

Country Link
CN (1) CN109657712B (en)

Cited By (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110378550A (en) * 2019-06-03 2019-10-25 东南大学 The processing method of the extensive food data of multi-source based on distributed structure/architecture
CN110705606A (en) * 2019-09-12 2020-01-17 武汉大学 Spatial K-means clustering method based on Spark distributed memory calculation
CN111145042A (en) * 2019-12-31 2020-05-12 国网北京市电力公司 Power distribution network voltage abnormity diagnosis method adopting full-connection neural network
CN111163071A (en) * 2019-12-20 2020-05-15 杭州九略智能科技有限公司 Unknown industrial protocol recognition engine
CN111241812A (en) * 2020-01-09 2020-06-05 内蒙古工业大学 Big data text clustering test method and system based on parallel improved K-means algorithm
CN111858671A (en) * 2020-07-14 2020-10-30 苏州浪潮智能科技有限公司 Method, device, equipment and medium for improving CleaStream algorithm
CN112381559A (en) * 2020-10-14 2021-02-19 浪潮软件股份有限公司 Tobacco retailer segmentation method based on unsupervised machine learning algorithm
CN112905863A (en) * 2021-03-19 2021-06-04 青岛檬豆网络科技有限公司 Automatic customer classification method based on K-Means clustering
CN116595102A (en) * 2023-07-17 2023-08-15 法诺信息产业有限公司 Big data management method and system for improving clustering algorithm

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20150066646A1 (en) * 2013-08-27 2015-03-05 Yahoo! Inc. Spark satellite clusters to hadoop data stores
CN105678398A (en) * 2015-12-24 2016-06-15 国家电网公司 Power load forecasting method based on big data technology, and research and application system based on method
EP3180695A1 (en) * 2014-08-14 2017-06-21 Qubole Inc. Systems and methods for auto-scaling a big data system
CN107886132A (en) * 2017-11-24 2018-04-06 云南大学 A kind of Time Series method and system for solving music volume forecasting

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20150066646A1 (en) * 2013-08-27 2015-03-05 Yahoo! Inc. Spark satellite clusters to hadoop data stores
EP3180695A1 (en) * 2014-08-14 2017-06-21 Qubole Inc. Systems and methods for auto-scaling a big data system
CN105678398A (en) * 2015-12-24 2016-06-15 国家电网公司 Power load forecasting method based on big data technology, and research and application system based on method
CN107886132A (en) * 2017-11-24 2018-04-06 云南大学 A kind of Time Series method and system for solving music volume forecasting

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
TAPAN SHARMA等: "Best of breed solution for clustering of satellite images using bigdata platform spark", 《IEEE》 *
程国建等: "K-means聚类算法在Spark平台上的应用", 《软件导刊》 *

Cited By (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110378550A (en) * 2019-06-03 2019-10-25 东南大学 The processing method of the extensive food data of multi-source based on distributed structure/architecture
CN110705606A (en) * 2019-09-12 2020-01-17 武汉大学 Spatial K-means clustering method based on Spark distributed memory calculation
CN111163071A (en) * 2019-12-20 2020-05-15 杭州九略智能科技有限公司 Unknown industrial protocol recognition engine
CN111145042A (en) * 2019-12-31 2020-05-12 国网北京市电力公司 Power distribution network voltage abnormity diagnosis method adopting full-connection neural network
CN111241812A (en) * 2020-01-09 2020-06-05 内蒙古工业大学 Big data text clustering test method and system based on parallel improved K-means algorithm
CN111858671A (en) * 2020-07-14 2020-10-30 苏州浪潮智能科技有限公司 Method, device, equipment and medium for improving CleaStream algorithm
CN111858671B (en) * 2020-07-14 2022-07-05 苏州浪潮智能科技有限公司 Method, device, equipment and medium for improving CleaStream algorithm
CN112381559A (en) * 2020-10-14 2021-02-19 浪潮软件股份有限公司 Tobacco retailer segmentation method based on unsupervised machine learning algorithm
CN112905863A (en) * 2021-03-19 2021-06-04 青岛檬豆网络科技有限公司 Automatic customer classification method based on K-Means clustering
CN116595102A (en) * 2023-07-17 2023-08-15 法诺信息产业有限公司 Big data management method and system for improving clustering algorithm
CN116595102B (en) * 2023-07-17 2023-10-17 法诺信息产业有限公司 Big data management method and system for improving clustering algorithm

Also Published As

Publication number Publication date
CN109657712B (en) 2021-06-18

Similar Documents

Publication Publication Date Title
CN109657712A (en) A kind of electric business food and drink data analysing method based on the improved K-Means algorithm of Spark
Masoud et al. Dynamic clustering using combinatorial particle swarm optimization
CN107832456B (en) Parallel KNN text classification method based on critical value data division
Kumar et al. Canopy clustering: a review on pre-clustering approach to k-means clustering
Wang et al. A new hybrid feature selection based on multi-filter weights and multi-feature weights
Labroche New incremental fuzzy c medoids clustering algorithms
Yao et al. Denoising protein–protein interaction network via variational graph auto-encoder for protein complex detection
CN106096052A (en) A kind of consumer&#39;s clustering method towards wechat marketing
Zhu et al. Evolving soft subspace clustering
Pandey et al. Stratified linear systematic sampling based clustering approach for detection of financial risk group by mining of big data
Zhang et al. Enabling in-situ data analysis for large protein-folding trajectory datasets
Al-Sabaawi et al. A novel overlapping method to alleviate the cold-start problem in recommendation systems
Özyer et al. Parallel clustering of high dimensional data by integrating multi-objective genetic algorithm with divide and conquer
Liu et al. Identifying protein complexes based on node embeddings obtained from protein-protein interaction networks
Najafi et al. Dependability‐based cluster weighting in clustering ensemble
Shang et al. Co-evolution-based immune clonal algorithm for clustering
Li et al. Unsupervised feature selection and clustering optimization based on improved differential evolution
Kim et al. Topical influence modeling via topic-level interests and interactions on social curation services
Rawal et al. Predicting missing values in a dataset: challenges and approaches
Fillbrunn et al. Diversity-driven widening of hierarchical agglomerative clustering
Zhang et al. Self-Adaptive-Means Based on a Covering Algorithm
Hou A new clustering validity index based on K-means algorithm
Liu et al. A potential-based clustering method with hierarchical optimization
Dudzik et al. Automated optimization of non-linear support vector machines for binary classification
Koohi-Var et al. Scientific workflow clustering based on motif discovery

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant