CN104598565B

CN104598565B - A kind of K mean value large-scale data clustering methods based on stochastic gradient descent algorithm

Info

Publication number: CN104598565B
Application number: CN201510011974.4A
Authority: CN
Inventors: 韩海韵; 丁杰; 戴江鹏; 周爱华; 孙玉宝
Original assignee: State Grid Corp of China SGCC; China Electric Power Research Institute Co Ltd CEPRI; Global Energy Interconnection Research Institute
Current assignee: State Grid Corp of China SGCC; China Electric Power Research Institute Co Ltd CEPRI; State Grid Smart Grid Research Institute of SGCC
Priority date: 2015-01-09
Filing date: 2015-01-09
Publication date: 2018-08-14
Anticipated expiration: 2035-01-09
Also published as: CN104598565A

Abstract

The present invention provides a K-means large-scale data clustering method based on a stochastic gradient descent algorithm, comprising the following steps: randomly initializing K cluster centers; sampling data samples, and dividing the data samples into their types; Iteration; repeat steps 1-3 to make the cluster center converge. The K-means large-scale data clustering method based on the stochastic gradient descent algorithm provided by the present invention greatly improves the execution efficiency of the algorithm and achieves a better clustering effect. It can mine data more quickly and effectively, and the proposal of this method provides a possibility for dealing with power big data and other data problems.

Description

A K-means Large-Scale Data Clustering Method Based on Stochastic Gradient Descent Algorithm

技术领域technical field

本发明涉及一种聚类方法，具体涉及一种基于随机梯度下降算法的K均值大规模数据聚类方法。The invention relates to a clustering method, in particular to a K-means large-scale data clustering method based on a stochastic gradient descent algorithm.

背景技术Background technique

近年来随着数据收集手段和能力的提升，个人、特别是企业可以获取的数据量急剧增加。例如，国家电网公司在SG186工程建成之后，八大业务应用平均日增数据记录达5000余万条(144G)；而随着智能电网和SG-ERP的建设，公司的数据增长速度还会再翻几番。超大规模复合型信息存储、备份与容灾都将成为重要的技术领域，数据中心与容灾中心的建设效果将直接影响到企业整体业务的连续性。如何通过强大的算法，充分利用电力生产控制和企业经营中产生的历史数据、实时数据、预测数据以及不同地域空间、层级的数据，更迅速地完成数据的价值“提纯”，是电力大数据亟待解决的难题。In recent years, with the improvement of data collection methods and capabilities, the amount of data that individuals, especially enterprises, can obtain has increased dramatically. For example, after the completion of the SG186 project of the State Grid Corporation of China, the average daily increase of data records for the eight major business applications reached more than 50 million (144G); and with the construction of smart grid and SG-ERP, the company's data growth rate will double again Fan. Ultra-large-scale composite information storage, backup and disaster recovery will all become important technical fields, and the construction effect of the data center and disaster recovery center will directly affect the continuity of the overall business of the enterprise. How to make full use of historical data, real-time data, forecast data, and data of different geographical spaces and levels through powerful algorithms to complete the value "purification" of data more quickly is an urgent need for electric power big data solved puzzles.

企业数据来源广泛，规模日益增长。从某种意义上讲，对公司来说有价值的信息所占的比重正在下降，如何从海量的信息中找到有用的信息正在变得越来越困难。对数据进行有效、充分地整理和分析，减少或压缩无价值的数据，提高有效数据的利用价值，可缩小数据存储规模、降低数据分析占用的计算资源，从而直接引导企业信息资产优化。Enterprise data comes from a wide variety of sources and is growing in scale. In a sense, the proportion of valuable information to companies is decreasing, and how to find useful information from massive amounts of information is becoming more and more difficult. Effectively and fully organize and analyze data, reduce or compress worthless data, increase the utilization value of effective data, reduce the scale of data storage, reduce the computing resources occupied by data analysis, and directly guide the optimization of enterprise information assets.

随着计算机技术和存储设备的快速发展,人们能够轻易地获取数以万计甚至百万计的数据。如何从这些数据中分析出对我们有用的或者感兴趣的信息，成为当前迫切需要解决的问题。传统的K均值聚类算法是数据挖掘领域使用的比较多的方法，首先随机初始化K个聚类中心，然后根据每个样本到聚类中心的距离将所有的样本分成K个不同的类型，最后用每一类中所有样本的平均值来更新聚类中心，整个过程不断迭代，直到收敛。显然，每次迭代时需要计算所有样本到K个聚类中心的距离，当面对大规模数据时，其计算过程需要花费大量的时间，大大降低了算法的执行效率。With the rapid development of computer technology and storage devices, people can easily obtain tens of thousands or even millions of data. How to analyze useful or interesting information from these data has become an urgent problem to be solved. The traditional K-means clustering algorithm is a relatively common method used in the field of data mining. First, K cluster centers are randomly initialized, and then all samples are divided into K different types according to the distance from each sample to the cluster center. Finally, The average value of all samples in each class is used to update the cluster centers, and the whole process is iterated until convergence. Obviously, it is necessary to calculate the distances from all samples to K cluster centers in each iteration. When faced with large-scale data, the calculation process takes a lot of time, which greatly reduces the execution efficiency of the algorithm.

目前，大数据的处理流程一般可以概括为四步：数据采集、导入及预处理、统计与分析、挖掘及决策支持。其中，挖掘与决策支持主要是在现有数据上面进行基于各种算法的计算，从而起到预测和决策支持的效果，以此来实现一些高级别数据分析的需求，比较典型的有用于聚类的K均值聚类算法。然而，传统的数据挖掘技术面临的最大问题就是实时性差，需要花费大量的时间来对数据进行处理。对于实时变化的数据来说，很难及时获取有用的信息，从而影响企业的决策。At present, the processing flow of big data can generally be summarized into four steps: data collection, import and preprocessing, statistics and analysis, mining and decision support. Among them, mining and decision support are mainly based on calculations based on various algorithms on existing data, so as to achieve the effect of prediction and decision support, so as to meet the needs of some high-level data analysis, which is typically used for clustering The K-means clustering algorithm. However, the biggest problem that the traditional data mining technology faces is the poor real-time performance, and it takes a lot of time to process the data. For data that changes in real time, it is difficult to obtain useful information in a timely manner, thereby affecting corporate decision-making.

发明内容Contents of the invention

为了克服上述现有技术的不足，本发明提供一种基于随机梯度下降算法的K均值大规模数据聚类方法，大大提高了算法的执行效率，达到了更好的聚类效果。能够更加快速有效的对数据进行挖掘，该方法的提出为处理电力大数据以及其它数据问题提供了一种可能。In order to overcome the shortcomings of the prior art above, the present invention provides a K-means large-scale data clustering method based on the stochastic gradient descent algorithm, which greatly improves the execution efficiency of the algorithm and achieves better clustering effects. It can mine data more quickly and effectively, and the proposal of this method provides a possibility for dealing with power big data and other data problems.

为了实现上述发明目的，本发明采取如下技术方案：In order to realize the above-mentioned purpose of the invention, the present invention takes the following technical solutions:

本发明提供一种基于随机梯度下降算法的K均值大规模数据聚类方法，所述方法包括以下步骤：The invention provides a K-means large-scale data clustering method based on stochastic gradient descent algorithm, said method comprising the following steps:

步骤1：随机初始化K个聚类中心；Step 1: Randomly initialize K cluster centers;

步骤2：采样数据样本，并将该数据样本划分到所属类型；Step 2: Sampling data samples and classifying the data samples into their types;

步骤3：对目标函数进行迭代；Step 3: Iterate over the objective function;

步骤4：重复步骤1-3，直到聚类中心收敛。Step 4: Repeat steps 1-3 until the cluster centers converge.

所述步骤1中，对于需要处理的K类数据集，随机初始化K个聚类中心w₁,w₂，…，w_k,…,w_K∈R^d，其中，R表示实数，d表示维度，于是R^d表示d维实数，w_k表示第k类数据集对应的聚类中心。In the step 1, for the K-type data sets to be processed, K cluster centers w ₁ , w ₂ , ..., w _k , ..., w _K ∈ R ^d are randomly initialized, where R represents a real number and d represents a dimension , so R ^d represents a d-dimensional real number, and w _k represents the cluster center corresponding to the k-th type of data set.

所述步骤1中，将每个聚类中心中数据样本的个数n₁,n₂,…,n_k,…,n_K∈N初始化为0，其中N表示整数，n_k表示第k类数据集对应的数据样本个数。In the step 1, the number n ₁ , n ₂ ,...,n _k ,...,n _K ∈ N of data samples in each cluster center is initialized to 0, where N represents an integer, and n _k represents the kth class The number of data samples corresponding to the data set.

所述步骤2中，随机采样数据样本z∈R^d，并根据最小距离对应的聚类中心将数据样本z划分到所属类型。In the step 2, the data sample z∈R ^d is randomly sampled, and the data sample z is classified into the type according to the cluster center corresponding to the minimum distance.

最小距离对应的聚类中心中数据集的代号用k^*表示，有：The code of the data set in the cluster center corresponding to the minimum distance is represented by k ^* , which is:

其中，(z-w_k)²表示数据样本z到w_k的距离。Among them, (zw _k ) ² represents the distance from data sample z to w _k .

所述步骤3具体包括以下步骤：Described step 3 specifically comprises the following steps:

步骤3-1：设目标函数为Q_kmeans，有：Step 3-1: Let the objective function be Q _kmeans , there are:

Q_kmeans关于的导数用表示，有：Q _kmeans about The derivative of Indicates that there are:

其中，为第k^*类数据集对应的聚类中心；in, is the cluster center corresponding to the k ^* th class data set;

步骤3-2：设表示第k^*类数据集对应的数据样本个数，采用Q_kmeans和分别更新和 Step 3-2: Set Indicates the number of data samples corresponding to the ^kth class data set, using Q _kmeans and Update separately and

所述步骤4中，重复执行步骤1-3，若满足前后两次迭代的聚类中心距离阈值小于10^-6，则表明聚类中心w₁,w₂，…，w_k,…,w_K收敛。In the step 4, repeat steps 1-3, if the cluster center distance threshold of the two previous iterations is less than 10 ^-6 , it indicates that the cluster centers w ₁ ,w ₂ ,...,w _k ,...,w _K convergence.

与现有技术相比，本发明的有益效果在于：Compared with prior art, the beneficial effect of the present invention is:

本发明提供的基于随机梯度下降算法的K均值大规模数据聚类方法大大降低了算法的计算复杂度，能够更加快速的达到收敛，并且还能够获得更好的聚类效果。由于每次迭代时都是随机的选取样本，而不需要考虑之前样本的情况，因此本质上随机梯度下降算法是一个期望风险最小化的过程。该方法的提出为处理电力大数据以及其它数据问题提供了一种可能。The K-means large-scale data clustering method based on the stochastic gradient descent algorithm provided by the present invention greatly reduces the computational complexity of the algorithm, can achieve convergence more quickly, and can also obtain better clustering effects. Since the samples are randomly selected in each iteration without considering the situation of the previous samples, the stochastic gradient descent algorithm is essentially a process of minimizing the expected risk. The proposed method provides a possibility for dealing with power big data and other data problems.

附图说明Description of drawings

图1是本发明实施例中随机梯度下降算法的原理图；Fig. 1 is the schematic diagram of stochastic gradient descent algorithm in the embodiment of the present invention;

图2是本发明实施例中原始数据的分布图；Fig. 2 is the distribution figure of raw data in the embodiment of the present invention;

图3是现有技术中的K均值聚类方法的聚类结果图；Fig. 3 is the clustering result diagram of the K-means clustering method in the prior art;

图4是本发明实施例中基于随机梯度下降算法的K均值聚类结果图。Fig. 4 is a graph of the K-means clustering results based on the stochastic gradient descent algorithm in the embodiment of the present invention.

具体实施方式Detailed ways

下面结合附图对本发明作进一步详细说明。The present invention will be described in further detail below in conjunction with the accompanying drawings.

实施例Example

首先随机生成两个“月儿”形的样本族，分别用三角形和圆点表示，如图2所示。数据由两个维度的特征组成，每类数据包含200000个样本，总共有400000个数据，属于大数据处理问题，为了显示的方便，选择部分数据进行作图。本实施例所做实验的计算机配置为：64位的操作系统、16GB的内存、英特尔处理器，软件运行环境为MATLAB R2012a版本。具体过程如下：First, two "moon"-shaped sample families are randomly generated, represented by triangles and dots, as shown in Figure 2. The data consists of two-dimensional features. Each type of data contains 200,000 samples, and there are a total of 400,000 data. It belongs to the problem of big data processing. For the convenience of display, some data are selected for graphing. The computer configuration of the experiment done in this embodiment is: 64-bit operating system, 16GB memory, Intel processor, and the software operating environment is MATLAB R2012a version. The specific process is as follows:

a)随机初始化2个聚类中心w₁,w₂∈R²，每类样本的个数n₁,n₂∈N初始化为0；a) Randomly initialize two cluster centers w ₁ , w ₂ ∈ R ² , and the number n ₁ , n ₂ ∈ N of each type of samples is initialized to 0;

b)随机采样一个数据样本z∈R²，根据公式将其划分到相应的类型；b) Randomly sample a data sample z∈R ² , according to the formula Divide it into the corresponding type;

d)更新和： d) update and :

e)步骤b)到d)不断重复，直到聚类中心w₁,w₂收敛。e) Steps b) to d) are repeated until the cluster centers w ₁ and w ₂ converge.

图3是经典的K均值聚类算法在经过3次迭代时得到的结果图，总共耗时32秒，而图4是基于梯度下降算法的K均值聚类算法在耗时17秒时得到的结果，经过了500次迭代，“x”型圆圈表示两个聚类中心。由图可知，两幅图的聚类中心几乎一致。量化的结果中，经典的K均值聚类需要花费32秒，而基于随机梯度下降算法的k均值聚类只需要花费17秒，准确率达到了78.41％，略微高于经典的k均值聚类的78.1％。Figure 3 is the result graph obtained by the classic K-means clustering algorithm after 3 iterations, which takes a total of 32 seconds, and Figure 4 is the result obtained by the K-means clustering algorithm based on the gradient descent algorithm when it takes 17 seconds , after 500 iterations, the "x"-shaped circles represent two cluster centers. It can be seen from the figure that the cluster centers of the two figures are almost the same. Among the quantitative results, the classic K-means clustering takes 32 seconds, while the k-means clustering based on the stochastic gradient descent algorithm only takes 17 seconds, and the accuracy rate reaches 78.41%, which is slightly higher than that of the classic K-means clustering. 78.1%.

最后应当说明的是：以上实施例仅用以说明本发明的技术方案而非对其限制，所属领域的普通技术人员参照上述实施例依然可以对本发明的具体实施方式进行修改或者等同替换，这些未脱离本发明精神和范围的任何修改或者等同替换，均在申请待批的本发明的权利要求保护范围之内。Finally, it should be noted that: the above embodiments are only used to illustrate the technical solutions of the present invention and not to limit them. Those of ordinary skill in the art can still modify or equivalently replace the specific implementation methods of the present invention with reference to the above embodiments. Any modifications or equivalent replacements departing from the spirit and scope of the present invention are within the protection scope of the claims of the pending application of the present invention.

Claims

1. a K mean value large-scale data clustering method based on stochastic gradient descent algorithm, it is characterized in that: described method comprises the following steps:

Step 1: Randomly initialize K cluster centers;

Step 2: Sampling data samples and classifying the data samples into their types;

Step 3: Iterate over the objective function;

Step 4: Repeat steps 1-3 until the cluster centers converge;

In the step 1, for the K-type data sets to be processed, K cluster centers w ₁ , w ₂ , ..., w _k , ..., w _K ∈ R ^d are randomly initialized, where R represents a real number and d represents a dimension , so R ^d represents a d-dimensional real number, and w _k represents the cluster center corresponding to the k-th type of data set;

In the step 1, the number n ₁ , n ₂ ,...,n _k ,...,n _K ∈ N of data samples in each cluster center is initialized to 0, where N represents an integer, and n _k represents the kth class The number of data samples corresponding to the data set;

In the step 2, the data sample z∈R ^d is randomly sampled, and the data sample z is divided into the type according to the cluster center corresponding to the minimum distance;

The code of the data set in the cluster center corresponding to the minimum distance is represented by k ^* , which is:

Among them, (zw _k ) ² represents the distance from data sample z to w _k ;

Described step 3 specifically comprises the following steps:

Step 3-1: Let the objective function be Q _kmeans , there are:

Q _kmeans about The derivative of Indicates that there are:

in, is the cluster center corresponding to the k ^* th class data set;

Step 3-2: Set Indicates the number of data samples corresponding to the ^kth class data set, using and Update separately and

In the step 4, repeat steps 1-3, if the cluster center distance threshold of the two previous iterations is less than 10 ^-6 , it indicates that the cluster centers w ₁ ,w ₂ ,...,w _k ,...,w _K convergence.