CN109657712A

CN109657712A - A kind of electric business food and drink data analysing method based on the improved K-Means algorithm of Spark

Info

Publication number: CN109657712A
Application number: CN201811507426.0A
Authority: CN
Inventors: 任晨雨; 唐月标; 黄鹏程; 华惊宇; 张昱
Original assignee: Zhejiang University of Technology ZJUT
Current assignee: Zhejiang University of Technology ZJUT
Priority date: 2018-12-11
Filing date: 2018-12-11
Publication date: 2019-04-19
Anticipated expiration: 2038-12-11
Also published as: CN109657712B

Abstract

A kind of electric business food and drink data analysing method based on the improved K-Means algorithm of Spark, comprising the following steps: step 1, experiment porch and its configuration are built, to realize that Parallel K-Means Clustering Algorithm in Web builds performing environment；Step 2, the acquisition of raw data set；Step 3, raw data set is pre-processed；Step 4, exploitation of the K-Means algorithm in Spark is realized using Scalable language；Step 5, the program editted is compiled execution, is finally completed cluster process.Map the and Combine operator that the present invention utilizes Spark to provide；Using the data structure of RDD；Spark results of intermediate calculations is stored in memory, in conjunction with the clustering algorithm that the initialization cluster centre part of a kind of pair of K-Means algorithm improves, realizes the analysis of electric business food and drink data, processing speed is very fast, and Clustering Effect is preferable.

Description

A kind of electric business food and drink data analysing method based on the improved K-Means algorithm of Spark

Technical field

The invention belongs to big data analysis digging technology fields, especially a kind of to be calculated based on the improved cluster of Spark platform Application of the method in electric business food and drink data analysis field.

Background technique

Since 21 century, with the continuous progress of science and technology, our society is also more and more information-based, and following each row is each The huge data of industry are also to complement each other with information-based industry.The presence of big data, to our life, business, medical treatment, boat It, agriculture, traffic and the development of other field have served very important.Therefore, more profound between mining data Relationship has critically important significance to the anticipation research aspect in each field.But face data, the investment of analytical technology and There are a huge contradiction between acquisition of information, information required for how efficiently quickly extracting and knowledge, removal are not required to The secondary or garbage wanted, to improve data mining in the practicability in each field be a critically important research direction.

In information-based industry high speed development, data are increased with almost exponential other speed, for grinding for data The method of studying carefully is also with diversity, and wherein clustering is one such important analysis method, and analysis and research personnel One of highest method of frequency of use.But traditional data analysis is often limited to data processing platform (DPP) and technology, is unable to satisfy The demand of research and development at this stage.It but is data mining in recent years, with the successive appearance of Hadoop platform and Spark platform Analysis provides good distributed type assemblies frame, targetedly solves the storage and operation difficult point of mass data.Compared to Hadoop platform, Spark platform are transported on the basis of Hadoop MapReduce frame by increasing RDD and the data of itself It calculates intermediate result and is stored in this main two major features of memory, so that Spark platform has higher fault-tolerance and high scalability.

Summary of the invention:

In order to improve the analysis mining working efficiency for facing the huge electric business food and drink data that information-based industry increasingly generates, this Map the and Combine operator that invention is provided using Spark；The data structure of the RDD of inner most general；Spark intermediate computations knot Fruit is stored in the advantages such as memory, the clustering algorithm knot improved with the initialization cluster centre part of a kind of pair of K-Means algorithm It closes, realizes and the data of some electric business catering industries in Beijing are analyzed, the superiority and algorithm for therefrom comparing modified hydrothermal process exist The runing time of Spark platform and single machine platform embodies the superiority of Spark cluster parallel data processing speed.

The technical solution adopted by the present invention to solve the technical problems is:

A kind of electric business food and drink data analysing method based on the improved K-Means algorithm of Spark, the method includes following Step:

Step 1, experiment porch and its configuration are built, to realize that Parallel K-Means Clustering Algorithm in Web builds performing environment, process It is as follows:

The distributed environment of the Spark of 1.1 configuration Master is as follows:

1.1.1 Spark2.1.0 installation kit is downloaded, decompresses and is installed；

1.1.2 modifying associated profile:

Under into/spark2.1.0-bin-hadoop2.7.5/conf catalogue, two files are modified.One of them is to match Spark-env.sh file is set, variable SCALA_HONE, JAVA_HOME, SPARK_MASTER_IP, SPARK_WOKER_ are set The value of MEMORY and HADOOP_CONF_DIR；The other is configuration slave file, is added to this for master and each node In file；

1.2 configuration Scala develop environment:

Because Spark platform is compiled using Scala language, need that Scala is installed；Download to .msi file with Afterwards, it is installed according to step, after being installed, the installation path that global variable SCALA_HOME is Scala is set；

It is finally tested, checks whether Scala installs success, open a new CMD window, input default Scala instruction, if interactive command can be executed with normal circulation, expression is installed successfully；

Step 2, the acquisition of raw data set, process are as follows: experimental data is to choose the information data of food and drink retail shop, data Object include longitude, dimension, city, trade name, address, comprehensive score, comment number, environment scoring, taste scoring, service scoring and Commercial circle data information；

Step 3, raw data set is pre-processed, the data that fill up the vacancy and deletion hash；

Step 4, exploitation of the K-Means algorithm in Spark is realized using Scalable language, process is as follows:

4.1K-Means algorithm is using distance as the similarity measures between data object, to gather to data Class belongs to unsupervised learning, the similitude between data is indicated using Euclidean distance, the calculation formula of Euclidean distance:

Wherein, x_i, x_jAny two data object in data set is respectively represented, N indicates of the total attribute of each data object Number；

Iteration each time in K-Means cluster process, cluster centre will be calculated and be updated from newly, calculate new cluster Center exactly calculates in this cluster, the mean value of all objects, it is assumed that the cluster centre of k-th cluster is expressed as Center_k, meter The mode for calculating the new cluster centre of this cluster is as follows:

Wherein, C_kIt is K class cluster, | C_k| it is the number of data object in K class cluster.Here summation refers to K class cluster C_k Sum of the middle all elements on every Column Properties, so Center_kIt is the vector that a length is D, is expressed as follows:

Center_k=[Center_k1, Center_k2, Center_k3..., Center_kD]

There are two types of the termination conditions of iteration, and one is setting the number of iterations T, and when reaching the T times iteration, program is whole Only iteration, the cluster result obtained at this time are final cluster result；Another kind is that error sum of squares is used to change as program The threshold values that generation terminates, the functional expression are expressed as follows:

Wherein, K indicates the number of class cluster, when the difference of the E of iteration twice is less than certain certain threshold values, i.e. Δ E < δ, program determination iteration；

The algorithm improvement of 4.2 pairs of initialization cluster centre parts, steps are as follows:

4.2.1 a data object is randomly choosed from pretreated data set at random as initial center point C_i, wherein I ∈ (1,2,3 ... K), the data object of data set assume that one shares N number of, data object are referred to as data point, data point is Refer to the vector of the three-dimensional feature composition of data intensive data object；

C_i=rand ([V_{(1, j)}, V_{(2, j)}, V_{(3, j)})=[V_{(1, i)}, V_{(2, i)}, V_{(3, i)}]

Wherein, j ∈ (1,2,3 ... N)

4.2.2 each point is calculated in data set to nearest cluster centre point C_i(C_i+1) distance D_j, to all D_jIt does Be denoted as Sum_i, C is used for the first time_i, circulation is internal to use C_i+1；

4.2.3 a random point is taken again, then calculates next initial cluster center point, random point using the mode of weight R_iValue mode be random value R_i∈ (0, Sum_i), R is done to data set_i=R_i-D_jLoop computation, until R_i< 0, then, Corresponding D_jIt is exactly next cluster centre point C_i+1=D_j；

4.2.4 4.2.2 and 4.2.3 two above step is repeated, until k-th central point is selected, chooses initial center Point algorithm terminates；

4.3 improved K-Means algorithms realize step in Spark:

4.3.1 parallel processing is carried out to initial data using HDFS, RDD

Numeralization processing first is carried out to the quantitative characteristic inside data set and removes unwanted feature inside data object Dimension, and data source is stored to HDFS by treated；

4.3.2 the cluster centre initialization procedure for executing 4.2 steps, K initial cluster center required for obtaining；

4.3.3 iterative operation is executed, meeting with reference to the number of iterations or obtained new cluster centre point has been more than to be advised Fixed critical value range, then iteration terminates, and otherwise continues to execute iterative operation；

4.3.4 new data variable is obtained, to obtain new cluster centre point；

4.3.5 cluster centre point and the number of iterations are updated；

Step 5, the program editted is compiled execution, is finally completed cluster process.

Further, the method also includes following steps:

Step 6, the data after cluster are visualized.

Further, the method also includes following steps:

Step 7, the execution velocity test under single board computer and cluster is carried out respectively: respectively by program in single board computer and difference Under the cluster of Worker node, the time used in entire data processing is recorded.Technical concept of the invention are as follows: the present invention utilizes Map the and Combine operator that Spark is provided；The data structure of the RDD of inner most general；In Spark results of intermediate calculations is stored in It the advantages such as deposits, reads data set from HDFS first, and RDD required for creating, then execute the cluster centre of this experiment Innovatation Initial method, more high probability are obtained apart from farther away initial cluster center point, and further progress iteration obtains new cluster Center is iterated operation under new cluster centre, after the cluster operation of each section is all completed, converges the poly- of each section Class is as a result, calculating finally obtains cluster result until iteration completion apart from new central point.

Beneficial effects of the present invention are mainly manifested in: on the one hand, on Spark platform, operation function is and the direct phase of RDD Mutual correlation, that is, the data distributed still are clustered to central point inside each RDD, and are executed parallel, are run As a result it does not need repeatedly to return and be calculated from new, then fast more many than being handled on single board computer using cluster processing data；It is another Aspect, in terms of algorithm, with primal algorithm it is direct at random take K initialization cluster centre point compared with, improve in initialization The process of heart point, so that direct random choosing of the distance between random point chosen probability big as far as possible than primal algorithm The probability taken wants high.

Detailed description of the invention

Fig. 1 is to run the realization of K-Means algorithm parallel using Spark to the overall flow figure of the cluster of data set.

Fig. 2 is that the runing time pair in different number of nodes is arranged in K-Means clustering algorithm on single machine and Spark platform Than figure.

Fig. 3 is the cluster result three-dimensional scatterplot schematic diagram to food and drink retail shop, Beijing.

Specific embodiment

The invention will be further described below in conjunction with the accompanying drawings.

Referring to Fig.1~Fig. 3, a kind of electric business food and drink data analysing method based on the improved K-Means algorithm of Spark, packet Include following steps:

Step 1, experiment porch and its configuration are built, to realize that Parallel K-Means Clustering Algorithm in Web builds necessary execution ring Border, process are as follows:

1.1.2 modifying associated profile:

1.2 configuration Scala develop environment:

It is finally tested, checks whether Scala installs success.A new CMD window is opened, input default Scala instruction, if interactive command can be executed with normal circulation, expression is installed successfully.

Step 2, the acquisition of raw data set

2.1 experimental datas are to choose some information datas of the food and drink retail shop of Beijing, and data object includes longitude, dimension 112065 numbers such as degree, city, trade name, address, comprehensive score, comment number, environment scoring, taste scoring, service scoring, commercial circle It is believed that breath.

It step 3, is the overall distribution situation for studying whole background food and drink retail shop and the comprehensive score of retail shop for notebook data With the degree of correlation of comment quantity and environment scoring three, field required for the experiment has longitude, dimension, comprehensive score, point Comment number, environment scoring.Many data are there are vacancy value it can be seen from the initial data of upper figure, and the one of raw data set A little fields are unwanted in test, for example, retail shop address, call the roll etc..Therefore original data set is pre-processed；

4.1K-Means algorithm is using distance as the similarity measures between data object, to gather to data Class, belongs to unsupervised learning, and this method generallys use Euclidean distance to indicate the similitude between data, the calculating of Euclidean distance Formula:

Center_k=[Center_k1, Center_k2, Center_k3..., Center_kD]

There are two types of the termination condition of the algorithm iteration is general, one is setting the number of iterations T, when reaching the T times iteration When, program determination iteration, the cluster result obtained at this time is final cluster result；Another kind is using error sum of squares As the threshold values of program iteration ends, which is expressed as follows:

Wherein, K indicates the number of class cluster.When the difference of the E of iteration twice is less than certain certain threshold values, i.e. Δ E < δ, program determination iteration.

The algorithm improvement of 4.2 pairs of initialization cluster centre parts

The algorithm of original K-Means the step of initializing cluster centre, is then generated at the beginning of K using random operator Beginning cluster centre point.Due to initial point generate randomness, the distance between possible initial clustering point very little, or selection be The noise spot etc. of data set, it is very poor to frequently can lead to cluster result.Therefore, the distance between selection initialization cluster centre point is most It is possible big, following improvement is done to initialization this part of cluster centre, steps are as follows:

4.2.1 a data object is randomly choosed from pretreated data set at random as initial center point C_i.Wherein, I ∈ (1,2,3 ... K), the data object hypothesis one of data set shares N number of.For a better understanding, data object is referred to as Data point, point here really refer to the vector of the three-dimensional feature composition of data intensive data object；；

Wherein, j ∈ (1,2,3 ... N)

4.2.2 each point is calculated in data set to nearest cluster centre point C_i(C_i+1) distance D_j, to all D_jIt does Be denoted as Sum_i.Algorithm uses C for the first time_i, circulation is internal to use C_i+1；

4.2.4 repeating 4.2.2 and 4.2.3 two above step, until k-th central point is selected, initial center is chosen Point algorithm terminates；

4.3 improved K-Means algorithms realize step in Spark:

4.3.1 parallel processing is carried out to initial data using HDFS, RDD

4.3.3 iterative operation is executed inside algorithm, is met super with reference to the number of iterations or obtained new cluster centre point The critical value range of defined is crossed, then iteration terminates, and otherwise continues to execute iterative operation.

4.3.4 new data variable is obtained, to obtain new cluster centre point.

4.3.5 cluster centre point and the number of iterations are updated.

Step 6, the data after cluster are visualized

Step 7, the execution velocity test of program under single board computer and cluster is carried out: respectively in order to preferably prove Spark collection Group parallel data processing superiority, the step for respectively by program under the cluster of single board computer and different Worker nodes, note Record the time used in entire data processing.The final runing time of data is shown in Fig. 2.

For the validity of verification algorithm improvement part, below to the primal algorithm of the library the ML institute band of innovatory algorithm and Spark The effect of cluster is assessed from silhouette coefficient, CH (Calinski-Harabaz) index.

Rationally whether silhouette coefficient be to cluster, effectively to measure.For each sample, the expression formula of silhouette coefficient is as follows:

Wherein, j ∈ (1,2,3 ... N), a_jIndicate the average distance with sample in other classifications, b_jIt is nearest with its distance The average distance of different classes of middle sample.

It can also be expressed as following formula:

Silhouette coefficient value is [- 1,1], between the bigger explanation of value is similar sample apart from very little, it is different classes of between Sample distance is bigger, so value is the bigger the better.For a sample set, its silhouette coefficient is the silhouette coefficient of all samples Average value.

CH index be the dense degree in same category of indicating and be different classes of between dispersion degree, its mathematical table It is as follows up to formula:

Wherein, m is the sample number of training set；K is classification number, in this experiment, k=3；B_kCovariance between classification Matrix；W_kFor the covariance matrix inside classification, the mark of tr representing matrix.If the covariance inside classification is smaller, between classification Covariance it is bigger, then fractional value is also bigger, illustrate that Clustering Effect is better.If numerical value is smaller, different clusters are indicated Between boundary it is unobvious, Clustering Effect is also poorer.

The two indices size of ML-K-Means algorithm and innovatory algorithm difference is as shown in table 1:

	Silhouette coefficient	CH index
			ML-K-Means	0.8166	47557.9709
Change-K-Means	0.8596	50846.2341

Table 1

From table 1 it follows that the algorithm that modified hydrothermal process provides on silhouette coefficient and CH index with the library ML of Spark It compares, two coefficients are above primal algorithm, although improving in silhouette coefficient index less obviously, mention in CH index Height is bigger, therefore on the whole, and the Clustering Effect of modified hydrothermal process wants the excellent algorithm carried with the library ML.

Fig. 1 is the overall flow frame diagram of K-Means algorithm data processing on Spark frame, mainly by three parts group At being these three processes of Map, Combine, Reduce respectively.Using the relationship between RDD and Spark and memory, guarantee first Operational efficiency；Then RDD is assigned on each node, each Paralleled executes local cluster operation, each section Local Clustering After, convergence Local Clustering is as a result, whether the distance for calculating new central point restrains, and algorithm terminates if convergence, otherwise Each subregion is from all of above step is newly executed, until the distance convergence of last new central point, algorithm terminate.

Fig. 2 is that the speed of service of the algorithm between nodes different under single board computer and Spark frame is compared.Its In, data set is 93625 datas, and " 0 " indicates the runing time on single board computer；" 1 " indicates there is being a Worker node Cluster runing time；" 2 " are indicated there are two the cluster runing time of Worker node, and so on.As seen from the figure, When cluster Worker node only one when, the runing time of algorithm be above single board computer operation, this is because cluster Calculating needs to initialize inter-related task, there is the newly-built of job, the distribution of resource and scheduling etc., these are all needs time-consumings, but It is that single board computer does not need.But with the increase of number of nodes, the executing tasks parallelly advantage of Spark is just more obvious, it can be seen that When Worker node is equal to 4, the processing speed of Spark is 1.6 times of single board computer or so.

The case where this figure of Fig. 3 is the whole cluster in terms of three attribute angles of data itself.Wherein, symbol " ◆ " indicates Cluster 0, symbol "+" region indicate that cluster 1, symbol " * " indicate cluster 2.As seen from the figure, 0 comprehensive score of cluster and environment score two Dimension is substantially linear, and comprehensive score is higher, and environment scoring is also relatively high, and comprehensive score is lower, environment scoring It is relatively low, but StoreFront general comment quantity it is most of at 2000 hereinafter, differ greatly with other two cluster in general comment quantity, In three-dimensional scatter plot, it is substantially distributed in following；2 overall distribution of cluster be on cluster 0, and comprehensive score and environment scoring Medium higher fraction range is all concentrated on, comments on total quantity substantially in 2000 to 5000 ranges；Cluster 1 is distributed in three-dimensional scatterplot The linear relationship of the top of figure, comprehensive score and environment two dimensions of scoring is less obvious, and composite score is arrived 4.5 substantially 5, environment scoring compare substantially in the score of 8 to 10, two dimensions with other two dimension, fraction range be it is highest, most An apparent index is exactly general comment quantity, substantially between 5000 to 15000.

Claims

1. a kind of electric business food and drink data analysing method based on the improved K-Means algorithm of Spark, which is characterized in that the side Method the following steps are included:

Step 1, experiment porch and its configuration are built, to realize that Parallel K-Means Clustering Algorithm in Web builds performing environment, process is such as Under:

1.1.2 modifying associated profile:

Under into/spark2.1.0-bin-hadoop2.7.5/conf catalogue, two files are modified, one of them is configuration Variable SCALA_HONE, JAVA_HOME, SPARK_MASTER_IP, SPARK_WOKER_ is arranged in spark-env.sh file The value of MEMORY and HADOOP_CONF_DIR；The other is configuration slave file, is added to this for master and each node In file；

1.2 configuration Scala develop environment:

Because Spark platform is compiled using Scala language, need that Scala is installed；It downloads to after .msi file, It is installed according to step, after being installed, the installation path that global variable SCALA_HOME is Scala is set；

It is finally tested, checks whether Scala installs success, open a new CMD window, the Scala of input default refers to It enables, if interactive command can be executed with normal circulation, expression is installed successfully；

Step 2, the acquisition of raw data set, process are as follows: experimental data is to choose the information data of food and drink retail shop, data object Including longitude, dimension, city, trade name, address, comprehensive score, comment number, environment scoring, taste scoring, service scoring and commercial circle Data information；

4.1K-Means algorithm is using distance as the similarity measures between data object, to cluster to data, belongs to In unsupervised learning, the similitude between data is indicated using Euclidean distance, the calculation formula of Euclidean distance:

Wherein, x_i, x_jAny two data object in data set is respectively represented, N indicates the number of the total attribute of each data object；

Iteration each time in K-Means cluster process, cluster centre will be calculated and be updated from newly, calculate in new cluster The heart exactly calculates in this cluster, the mean value of all objects, it is assumed that the cluster centre of k-th cluster is expressed as Center_k, calculate The mode of the new cluster centre of this cluster is as follows:

Wherein, C_kIt is K class cluster, | C_k| it is the number of data object in K class cluster, summation here refers to K class cluster C_kMiddle institute There is sum of the element on every Column Properties, so Center_kIt is the vector that a length is D, is expressed as follows:

Center_k=[Center_k1, Center_k2, Center_k3..., Center_kD]

There are two types of the termination conditions of iteration, and one is setting the number of iterations T, and when reaching the T times iteration, program determination changes Generation, the cluster result obtained at this time are final cluster result；Another kind is to use error sum of squares whole as program iteration Threshold values only, the functional expression are expressed as follows:

4.2.1 a data object is randomly choosed from pretreated data set at random as initial center point C_i, wherein i ∈ (1,2,3 ... K), the data object of data set assume that one shares N number of, data object are referred to as data point, data point refers to The vector of the three-dimensional feature composition of data object in data set；

Wherein, j ∈ (1,2,3 ... N)

4.2.2 each point is calculated in data set to nearest cluster centre point C_i(C_i+1) distance D_j, to all D_jIt does and note For Sum_i, C is used for the first time_i, circulation is internal to use C_i+1；

4.2.3 a random point is taken again, then calculates next initial cluster center point, random point R using the mode of weight_i's Value mode is random value R_i∈ (O, Sum_i), R is done to data set_i=R_i-D_jLoop computation, until R_i< 0, then, it is corresponding D_jIt is exactly next cluster centre point C_i+1=D_j；

4.2.4 4.2.2 and 4.2.3 two above step is repeated, until k-th central point is selected, initial center is chosen and counts Method terminates；

4.3 improved K-Means algorithms realize step in Spark:

4.3.1 parallel processing is carried out to initial data using HDFS, RDD

Numeralization processing first is carried out to the quantitative characteristic inside data set and removes unwanted characteristic dimension inside data object, And data source is stored to HDFS by treated；

4.3.3 iterative operation is executed, meeting with reference to the number of iterations or obtained new cluster centre point has been more than defined Critical value range, then iteration terminates, and otherwise continues to execute iterative operation；

4.3.4 new data variable is obtained, to obtain new cluster centre point；

4.3.5 cluster centre point and the number of iterations are updated；

2. a kind of electric business food and drink data analysing method based on the improved K-Means algorithm of Spark as described in claim 1, It is characterized in that, the method also includes following steps:

Step 6, the data after cluster are visualized.

3. a kind of electric business food and drink data analysing method based on the improved K-Means algorithm of Spark as claimed in claim 2, It is characterized in that, the method also includes following steps:

Step 7, the execution velocity test under single board computer and cluster is carried out respectively: respectively by program in single board computer and different Worker Under the cluster of node, the time used in entire data processing is recorded.