CN104598565B - A kind of K mean value large-scale data clustering methods based on stochastic gradient descent algorithm - Google Patents

A kind of K mean value large-scale data clustering methods based on stochastic gradient descent algorithm Download PDF

Info

Publication number
CN104598565B
CN104598565B CN201510011974.4A CN201510011974A CN104598565B CN 104598565 B CN104598565 B CN 104598565B CN 201510011974 A CN201510011974 A CN 201510011974A CN 104598565 B CN104598565 B CN 104598565B
Authority
CN
China
Prior art keywords
data
clustering
steps
data set
gradient descent
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201510011974.4A
Other languages
Chinese (zh)
Other versions
CN104598565A (en
Inventor
韩海韵
丁杰
戴江鹏
周爱华
孙玉宝
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
State Grid Corp of China SGCC
China Electric Power Research Institute Co Ltd CEPRI
Smart Grid Research Institute of SGCC
Original Assignee
State Grid Corp of China SGCC
China Electric Power Research Institute Co Ltd CEPRI
Global Energy Interconnection Research Institute
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by State Grid Corp of China SGCC, China Electric Power Research Institute Co Ltd CEPRI, Global Energy Interconnection Research Institute filed Critical State Grid Corp of China SGCC
Priority to CN201510011974.4A priority Critical patent/CN104598565B/en
Publication of CN104598565A publication Critical patent/CN104598565A/en
Application granted granted Critical
Publication of CN104598565B publication Critical patent/CN104598565B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/23Clustering techniques
    • G06F18/232Non-hierarchical techniques
    • G06F18/2321Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions
    • G06F18/23211Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions with adaptive number of clusters

Landscapes

  • Engineering & Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Artificial Intelligence (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Evolutionary Biology (AREA)
  • Evolutionary Computation (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Probability & Statistics with Applications (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)
  • Complex Calculations (AREA)

Abstract

The present invention provides a kind of K mean value large-scale data clustering methods based on stochastic gradient descent algorithm, includes the following steps:K cluster centre of random initializtion;Sampled data sample, and the data sample is divided into affiliated type;Object function is iterated;Repeat step 13 so that cluster centre is restrained.K mean value large-scale data clustering methods provided by the invention based on stochastic gradient descent algorithm, substantially increase the execution efficiency of algorithm, have reached better Clustering Effect.More rapidly and effectively data can be excavated, the proposition of this method provides a kind of possibility for processing electric power big data and other data problems.

Description

K-means large-scale data clustering method based on random gradient descent algorithm
Technical Field
The invention relates to a clustering method, in particular to a K-means large-scale data clustering method based on a random gradient descent algorithm.
Background
With the recent increase in data collection means and capabilities, the amount of data that can be acquired by individuals, particularly businesses, has increased dramatically. For example, after SG186 engineering is built by national grid companies, the average daily data record of eight business applications reaches 5000 or more ten thousand (144G); with the construction of smart grids and SG-ERP, the data growth rate of companies is doubled. The ultra-large scale compound information storage, backup and disaster recovery become important technical fields, and the construction effect of the data center and the disaster recovery center directly influences the continuity of the whole business of an enterprise. How to fully utilize historical data, real-time data, prediction data and data of different regions, spaces and levels generated in power production control and enterprise operation through a powerful algorithm to more quickly complete the value purification of the data is a difficult problem to be solved urgently for large power data.
Enterprise data is widely available and increasingly large. In a sense, the proportion of information that is valuable to companies is decreasing, and it is becoming more and more difficult to find useful information from a huge amount of information. The data are effectively and fully sorted and analyzed, worthless data are reduced or compressed, the utilization value of effective data is improved, the data storage scale can be reduced, and the computing resources occupied by data analysis are reduced, so that enterprise information asset optimization is directly guided.
With the rapid development of computer technology and storage devices, people can easily acquire tens of thousands or even millions of data. How to analyze useful or interesting information from the data becomes a problem which needs to be solved urgently at present. The traditional K-means clustering algorithm is a method which is used more in the field of data mining, firstly, K clustering centers are initialized randomly, then, all samples are divided into K different types according to the distance from each sample to the clustering centers, finally, the clustering centers are updated by the average values of all samples in each type, and the whole process is iterated continuously until convergence is achieved. Obviously, the distances from all the samples to the K clustering centers need to be calculated in each iteration, and when large-scale data are faced, a large amount of time is needed in the calculation process, so that the execution efficiency of the algorithm is greatly reduced.
At present, the processing flow of big data can be generally summarized into four steps: data acquisition, import and preprocessing, statistics and analysis, mining and decision support. The mining and decision support mainly comprises the steps of carrying out calculation based on various algorithms on the existing data, thereby achieving the effect of prediction and decision support, realizing the requirements of some high-level data analysis and typically adopting a K-means clustering algorithm for clustering. However, the biggest problem faced by the conventional data mining technology is poor real-time performance, and a great deal of time is required for processing data. For data which changes in real time, useful information is difficult to obtain in time, so that the decision of an enterprise is influenced.
Disclosure of Invention
In order to overcome the defects of the prior art, the invention provides the K-means large-scale data clustering method based on the random gradient descent algorithm, so that the execution efficiency of the algorithm is greatly improved, and a better clustering effect is achieved. The data can be mined more quickly and effectively, and the method provides a possibility for processing the large electric power data and other data problems.
In order to achieve the purpose of the invention, the invention adopts the following technical scheme:
the invention provides a K-means large-scale data clustering method based on a random gradient descent algorithm, which comprises the following steps of:
step 1: randomly initializing K clustering centers;
step 2: sampling data samples and dividing the data samples into types;
and step 3: iterating the objective function;
and 4, step 4: and (4) repeating the steps 1-3 until the clustering center is converged.
In the step 1, for the requirementsInitializing K cluster centers w at random for a K-class data set to be processed1,w2,…,wk,…,wK∈RdWherein R represents a real number and d represents a dimension, such that RdRepresenting d-dimensional real numbers, wkRepresenting the cluster center corresponding to the kth class data set.
In the step 1, the number n of data samples in each clustering center is determined1,n2,…,nk,…,nKE N is initialized to 0, where N represents an integer, NkAnd the number of data samples corresponding to the kth type data set is shown.
In the step 2, a random sampling data sample z belongs to RdAnd dividing the data sample z into the types according to the clustering center corresponding to the minimum distance.
The code number of the data set in the cluster center corresponding to the minimum distance is k*It shows, as follows:
wherein (z-w)k)2Representing data samples z to wkThe distance of (c).
The step 3 specifically comprises the following steps:
step 3-1: let the objective function be QkmeansThe method comprises the following steps:
QkmeansaboutFor the derivative ofIt shows, as follows:
wherein,is the k-th*A clustering center corresponding to the class data set;
step 3-2: is provided withDenotes the kth*The number of data samples corresponding to the class data set is adoptedQkmeansAndupdate separatelyAnd
in the step 4, the steps 1 to 3 are repeatedly executed, and if the cluster center distance threshold value of the two iterations is less than 10-6Then indicate the cluster center w1,w2,…,wk,…,wKAnd (6) converging.
Compared with the prior art, the invention has the beneficial effects that:
the K-means large-scale data clustering method based on the stochastic gradient descent algorithm greatly reduces the calculation complexity of the algorithm, can achieve convergence more quickly, and can obtain a better clustering effect. Since each iteration is a random sample, and there is no need to consider the previous samples, the stochastic gradient descent algorithm is essentially a process that is expected to minimize risk. The method provides a possibility for processing the large power data and other data problems.
Drawings
FIG. 1 is a schematic diagram of a random gradient descent algorithm in an embodiment of the present invention;
FIG. 2 is a graph of raw data distribution in an embodiment of the present invention;
FIG. 3 is a graph of the clustering results of the prior art K-means clustering method;
FIG. 4 is a K-means clustering result graph based on a stochastic gradient descent algorithm in the embodiment of the present invention.
Detailed Description
The present invention will be described in further detail with reference to the accompanying drawings.
Examples
First, two "moon" shaped sample families, represented by triangles and dots, respectively, are randomly generated, as shown in FIG. 2. The data is composed of features of two dimensions, each class of data comprises 200000 samples, the total number of the data is 400000, the problem of large data processing is solved, and partial data are selected for drawing for convenience of display. The computer of the experiment performed in this embodiment is configured as follows: 64 bits operating system, 16GB memory, Intel processor, software running environment MATLAB R2012a version. The specific process is as follows:
a) randomly initializing 2 cluster centers w1,w2∈R2Number of samples per class n1,n2Initializing the epsilon N to be 0;
b) randomly sampling a data sample z epsilon R2According to the formulaDividing the data into corresponding types;
d) updatingAnd
e) repeating the steps b) to d) until the clustering center w1,w2And (6) converging.
Fig. 3 is a graph of results obtained by a classical K-means clustering algorithm after 3 iterations, which takes 32 seconds in total, while fig. 4 is a graph of results obtained by a gradient descent algorithm-based K-means clustering algorithm after 17 seconds, which takes 500 iterations, and an "x" shaped circle represents two cluster centers. As can be seen from the graphs, the cluster centers of the two graphs are almost the same. In the quantized result, the classical K-means clustering takes 32 seconds, while the K-means clustering based on the stochastic gradient descent algorithm only takes 17 seconds, and the accuracy rate reaches 78.41%, which is slightly higher than 78.1% of the classical K-means clustering.
Finally, it should be noted that: the above embodiments are only intended to illustrate the technical solution of the present invention and not to limit the same, and a person of ordinary skill in the art can make modifications or equivalents to the specific embodiments of the present invention with reference to the above embodiments, and such modifications or equivalents without departing from the spirit and scope of the present invention are within the scope of the claims of the present invention as set forth in the claims.

Claims (1)

1. A K-means large-scale data clustering method based on a random gradient descent algorithm is characterized by comprising the following steps: the method comprises the following steps:
step 1: randomly initializing K clustering centers;
step 2: sampling data samples and dividing the data samples into types;
and step 3: iterating the objective function;
and 4, step 4: repeating the steps 1-3 until the clustering center is converged;
in the step 1, the required treatment is carried outClass K data set of (1), randomly initializing K clustering centers w1,w2,…,wk,…,wK∈RdWherein R represents a real number and d represents a dimension, such that RdRepresenting d-dimensional real numbers, wkRepresenting the corresponding clustering center of the kth type data set;
in the step 1, the number n of data samples in each clustering center is determined1,n2,…,nk,…,nKE N is initialized to 0, where N represents an integer, NkRepresenting the number of data samples corresponding to the kth type data set;
in the step 2, a random sampling data sample z belongs to RdDividing the data sample z into the types according to the clustering center corresponding to the minimum distance;
the code number of the data set in the cluster center corresponding to the minimum distance is k*It shows, as follows:
wherein (z-w)k)2Representing data samples z to wkThe distance of (d);
the step 3 specifically comprises the following steps:
step 3-1: let the objective function be QkmeansThe method comprises the following steps:
QkmeansaboutFor the derivative ofIt shows, as follows:
wherein,Is the k-th*A clustering center corresponding to the class data set;
step 3-2: is provided withDenotes the kth*The number of data samples corresponding to the class data set is adoptedAndupdate separatelyAnd
in the step 4, the steps 1 to 3 are repeatedly executed, and if the cluster center distance threshold value of the two iterations is less than 10-6Then indicate the cluster center w1,w2,…,wk,…,wKAnd (6) converging.
CN201510011974.4A 2015-01-09 2015-01-09 A kind of K mean value large-scale data clustering methods based on stochastic gradient descent algorithm Active CN104598565B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201510011974.4A CN104598565B (en) 2015-01-09 2015-01-09 A kind of K mean value large-scale data clustering methods based on stochastic gradient descent algorithm

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201510011974.4A CN104598565B (en) 2015-01-09 2015-01-09 A kind of K mean value large-scale data clustering methods based on stochastic gradient descent algorithm

Publications (2)

Publication Number Publication Date
CN104598565A CN104598565A (en) 2015-05-06
CN104598565B true CN104598565B (en) 2018-08-14

Family

ID=53124350

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201510011974.4A Active CN104598565B (en) 2015-01-09 2015-01-09 A kind of K mean value large-scale data clustering methods based on stochastic gradient descent algorithm

Country Status (1)

Country Link
CN (1) CN104598565B (en)

Families Citing this family (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105139277B (en) * 2015-08-18 2018-09-11 国家电网公司 A kind of power matching network information cluster system and method
CN105681089B (en) * 2016-01-26 2019-10-18 上海晶赞科技发展有限公司 Networks congestion control clustering method, device and terminal
US10503580B2 (en) 2017-06-15 2019-12-10 Microsoft Technology Licensing, Llc Determining a likelihood of a resource experiencing a problem based on telemetry data
US11062226B2 (en) 2017-06-15 2021-07-13 Microsoft Technology Licensing, Llc Determining a likelihood of a user interaction with a content element
US10805317B2 (en) 2017-06-15 2020-10-13 Microsoft Technology Licensing, Llc Implementing network security measures in response to a detected cyber attack
US10922627B2 (en) 2017-06-15 2021-02-16 Microsoft Technology Licensing, Llc Determining a course of action based on aggregated data
CN108846532A (en) * 2018-03-21 2018-11-20 宁波工程学院 Business risk appraisal procedure and device applied to logistics supply platform chain
CN108460499B (en) * 2018-04-02 2022-03-08 福州大学 Microblog user influence ranking method integrating user time information
CN111385243A (en) * 2018-12-27 2020-07-07 中国移动通信集团山西有限公司 DDoS detection method, device and equipment
CN110288004B (en) * 2019-05-30 2021-04-20 武汉大学 System fault diagnosis method and device based on log semantic mining

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101488189A (en) * 2009-02-04 2009-07-22 天津大学 Brain-electrical signal processing method based on isolated component automatic clustering process
CN101872343A (en) * 2009-04-24 2010-10-27 罗彤 Semi-supervised mass data hierarchy classification method
CN103810261A (en) * 2014-01-26 2014-05-21 西安理工大学 K-means clustering method based on quotient space theory

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7353472B2 (en) * 2005-08-12 2008-04-01 International Business Machines Corporation System and method for testing pattern sensitive algorithms for semiconductor design
US20070118492A1 (en) * 2005-11-18 2007-05-24 Claus Bahlmann Variational sparse kernel machines

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101488189A (en) * 2009-02-04 2009-07-22 天津大学 Brain-electrical signal processing method based on isolated component automatic clustering process
CN101872343A (en) * 2009-04-24 2010-10-27 罗彤 Semi-supervised mass data hierarchy classification method
CN103810261A (en) * 2014-01-26 2014-05-21 西安理工大学 K-means clustering method based on quotient space theory

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
基于遗传算法和梯度下降法的聚类新算法;吴小涛等;《计算技术与信息发展》;20090430(第4期);第61-62页 *
随机梯度下降法的一些性质;汪宝彬等;《数学杂志》;20111115;第31卷(第6期);第1041-1044页 *

Also Published As

Publication number Publication date
CN104598565A (en) 2015-05-06

Similar Documents

Publication Publication Date Title
CN104598565B (en) A kind of K mean value large-scale data clustering methods based on stochastic gradient descent algorithm
Zeng et al. A GA-based feature selection and parameter optimization for support tucker machine
CN111091247A (en) Power load prediction method and device based on deep neural network model fusion
Yang et al. A scalable data chunk similarity based compression approach for efficient big sensing data processing on cloud
CN104933445A (en) Mass image classification method based on distributed K-means
CN111612319A (en) Load curve depth embedding clustering method based on one-dimensional convolution self-encoder
CN114065850A (en) Spectral clustering method and system based on uniform anchor point and subspace learning
CN113705793A (en) Decision variable determination method and device, electronic equipment and medium
CN117236201B (en) Diffusion and ViT-based downscaling method
CN114913921B (en) Marker gene identification system and method
CN114067915A (en) scRNA-seq data dimension reduction method based on deep antithetical variational self-encoder
CN114358216B (en) Quantum clustering method based on machine learning framework and related device
CN103793438A (en) MapReduce based parallel clustering method
Hu et al. Parallel clustering of big data of spatio-temporal trajectory
Tsai et al. An effective and efficient grid-based data clustering algorithm using intuitive neighbor relationship for data mining
CN106021170A (en) Graph building method employing semi-supervised low-rank representation model
CN111090679B (en) Time sequence data representation learning method based on time sequence influence and graph embedding
CN117060401A (en) New energy power prediction method, device, equipment and computer readable storage medium
CN115115920B (en) Graph data self-supervision training method and device
CN113835964B (en) Cloud data center server energy consumption prediction method based on small sample learning
CN115329082A (en) Log sequence anomaly detection method based on deep hybrid neural network
CN102663141B (en) Multi-channel quantification and hierarchical clustering method based on multi-core parallel computation
CN114219092A (en) Data processing method and system
CN114997214A (en) Fault diagnosis method and device for residual error intensive network
CN113808670A (en) Method for predicting cell differentiation by using single-cell transcriptome data

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C41 Transfer of patent application or patent right or utility model
TA01 Transfer of patent application right

Effective date of registration: 20160425

Address after: 100031 Xicheng District West Chang'an Avenue, No. 86, Beijing

Applicant after: State Grid Corporation of China

Applicant after: China Electric Power Research Institute

Applicant after: State Grid Smart Grid Institute

Address before: 100031 Xicheng District West Chang'an Avenue, No. 86, Beijing

Applicant before: State Grid Corporation of China

Applicant before: China Electric Power Research Institute

CB02 Change of applicant information

Address after: 100031 Xicheng District West Chang'an Avenue, No. 86, Beijing

Applicant after: State Grid Corporation of China

Applicant after: China Electric Power Research Institute

Applicant after: GLOBAL ENERGY INTERCONNECTION RESEARCH INSTITUTE

Address before: 100031 Xicheng District West Chang'an Avenue, No. 86, Beijing

Applicant before: State Grid Corporation of China

Applicant before: China Electric Power Research Institute

Applicant before: State Grid Smart Grid Institute

COR Change of bibliographic data
GR01 Patent grant
GR01 Patent grant