CN106570173B

CN106570173B - Spark-based high-dimensional sparse text data clustering method

Info

Publication number: CN106570173B
Application number: CN201610988558.4A
Authority: CN
Inventors: 王进; 黄超; 莫倩雯; 陈乔松; 邓欣; 欧阳卫华; 胡峰; 李智星; 雷大江
Original assignee: Chongqing University of Post and Telecommunications
Current assignee: Yami Technology Guangzhou Co ltd
Priority date: 2016-11-09
Filing date: 2016-11-09
Publication date: 2020-09-29
Anticipated expiration: 2036-11-09
Also published as: CN106570173A

Abstract

The invention discloses a Spark-based high-dimensional sparse text data clustering method, which comprises the following steps of: reading in a data set by using RDD; designing a distributed sparse vector set by using an RDD interface; and calculating the similarity between the distributed sparse vector set and the complete data set of the node where the distributed sparse vector set is located, and abstracting the similarity into a similarity matrix according to the number. And (4) the stored similar matrix is symmetrical, and the normalized form and the Laplace matrix form are solved. 4. And (5) utilizing the normalized Laplace matrix in the SVD decomposition step 3, and inputting the new matrix constructed in the step (5) and the step (4) as a sample into a K-means model for training. 6. And clustering the test set by using the established model. The invention improves the operational performance of the traditional spectral clustering algorithm under a large data set.

Description

Spark-based high-dimensional sparse text data clustering method

Technical Field

The invention relates to the field of text data clustering, machine learning and distributed computing, in particular to a Spark-based high-dimensional sparse text data clustering method. .

Background

With the advent of the big data age, the internet has accumulated more and more network data. The accumulated data has reached the limit that can be handled by ordinary computers. In order to deal with the increasingly difficult data processing problem, various industries aim at Spark-based distributed processing platforms and parallel sparse data set storage technologies.

Spark is a large data distributed programming framework similar to Hadoop, but there are some differences between the two that make Spark superior in some workloads, in other words Spark enables memory distributed datasets that, in addition to being able to provide interactive queries, can also optimize iterative workloads. The Spark big data platform integrates batch processing, real-time stream processing, interactive query and graph calculation, and resource waste caused by the fact that different clusters need to be deployed in various operation scenes is avoided. The Spark big data platform provides a data structure of a flexible distributed data set (RDD) for programmers. The RDD data structure is a data set distributed across multiple machine clusters and has an efficient fault tolerance mechanism. With the advantage of the flexible distributed data set (RDD) data structure of Spark big data platform, there have been many traditional machine learning algorithms that extend their computational performance and data processing capabilities.

Sparse vector set storage refers to a matrix with most of elements in the matrix being 0, and in fact, large-scale data sets in practical problems are basically sparse, and much sparsity is more than 90% or even 99%. There is therefore a need for efficient sparse data set storage formats and computation methods. And in combination with an elastic distributed data structure (RDD) provided by a Spark big data platform, the sparse data set can be stored in different computing nodes in parallel. The method mainly aims to solve the storage problem and calculation of a large data set, and the problem of overlarge data set is solved by widely adopting a parallel sparse storage method in the aspects of high-dimensional sparse text data set storage and computer vision.

Clustering is a very important method in machine learning and data mining tasks. Spectral clustering is a graph theory-based method that divides a weighted undirected graph into two or more optimal subgraphs. Compared with the traditional method, the spectral clustering method is more efficient in searching the similarity among the data samples. As such, spectral clustering has been widely used in fields such as information retrieval and computer vision. Unfortunately, when the number of data samples (n) increases dramatically, spectral clustering algorithms may face a bottleneck in computational resources. For example, it is possible to have an inoperable problem when calculating the similarity matrix M (n × n) between samples (n is very large) or storing similarity matrices of similar size. Meanwhile, the traditional spectral clustering algorithm also needs a large amount of storage space and operation time when calculating the K eigenvectors of the Laplace matrix. These outstanding problems make spectral clustering increasingly unsatisfactory to the computational requirements that are now being met in the case of a sharp increase in data volume.

Aiming at the practical situation, the invention provides a spectral clustering method based on a Spark big data processing platform.

Disclosure of Invention

The present invention is directed to solving the above problems of the prior art. A method is presented. The technical scheme of the invention is as follows:

a Spark-based high-dimensional sparse text data clustering method comprises the following steps:

step 1: reading in a data set sample to be processed through an elastic distributed data set RDD provided by a Spark big data platform, and designing a distributed sparse vector set for storing a high-dimensional sparse data set by using an RDD interface;

step 2: and calculating a similar matrix M among the data set samples to be processed and storing the similar matrix M in a parallel sparse vector set mode. The similarity is measured by a Euclidean distance mode;

and step 3: the similarity matrix M stored in the parallel sparse vector set mode in the step 2 is symmetrical, and a normalized laplace matrix is solved;

and 4, step 4: decomposing the normalized laplace matrix by using SVD (singular value decomposition) characteristics and solving K nearest neighbor characteristic vectors, and forming the K nearest neighbor characteristic vectors into a nearest neighbor matrix;

and 5: and (4) inputting the constructed nearest neighbor matrix serving as a sample into a K-means model for training to finish clustering.

Further, the step 1 reads in a data set sample to be processed provided by a Spark big data platform into an elastic distributed data set RDD, and loads the data set sample into a distributed sparse vector set;

reading a data source to be processed of a source UCI data platform into an elastic distributed data set RDD, loading the data source into high-dimensional distributed vector set data P, and dividing the data source into a training set A₁And test set A₂。

Further, the data set to be processed is an RCV1 data set, and the form of the data set is { decision tag, condition attribute 1, condition attribute 2, condition attribute 3 … …, condition attribute n }, and the dimension of the data set is >30000 }).

Further, the division into training set A₁And test set A₂By selecting 90% of the samples in the data set as training set A without repeating randomly₁The remaining 10% was set as test set A₂。

Further, the step of loading the high-dimensional distributed vector set data P includes:

a1, reading in a high-dimensional sparse text data set by using a distributed elastic data set (RDD);

a2, adopting sparse storage to record each sample in the data set as A;

a3, randomly sampling the sample of A and dividing the sample into a few sample data blocks B, wherein each data block uses an index mark;

a4, using Spark data platform to provide programming interface MapPartitionWithIndex to distribute data block B with less samples to cluster nodes according to index numbers.

Further, the step 3 specifically includes the following steps:

b1, solving Euclidean distance between the data block B in each calculation node and the sparse vector set P to obtain a distance matrix for representing similarity, and obtaining an upper triangular matrix U by utilizing a parallel upper triangular method;

b2, storing the distance matrix obtained in the step B1 by adopting a point coordinate mode COO, and recording a point set as: CO;

b3, interchanging the row coordinate and the column coordinate of the CO midpoint coordinate, and marking as CO';

b4, combining the point sets of the step B2 and the step B3 together to form a symmetric matrix S;

b5, let element S in symmetric matrix S_ij0, where (i ═ j), the diagonal matrix is noted: d

Calculating a Laplace matrix and normalizing;

L＝D-M； (2)

in the formula (2), L is a Laplace matrix, D is a diagonal matrix, and M is a similar matrix;

normalizing L:

L₁＝D^-1/2L D^-1/2＝I-D^-1/2M D^-1/2； (3)

wherein I is an identity matrix, L₁Is a normalized Laplace matrix.

Further, step 4 comprises: the obtained normalized Laplace matrix L is calculated by adopting a parallel SVD (singular value decomposition) calculation method₁Characteristic decomposition, selecting first K characteristic vectors, theta ═ theta_1i,θ_1i,θ_1i,θ_1i.....θ_Ni) ', i-1, 2, 3, …, K, the first K eigenvectors will form an N x K eigenvector matrix θ, the rows of which represent samples of a cluster class, i.e., the decision labels in the corresponding high-dimensional sparse text dataset.

Further, step 5, inputting the constructed nearest neighbor matrix as a sample into a K-means model for training, and combining with the test set A₂And clustering the test set, and ending.

The invention has the following advantages and beneficial effects:

the embodiment of the invention discloses a Spark-based high-dimensional sparse text data clustering method, wherein a high-dimensional sparse text data set (such as RCV1) comprises the following steps: the dimension is high (n is greater than 10000, n is attribute dimension), the space complexity is high, and the storage and the calculation are not convenient. The invention mainly solves the problem of difficult storage and calculation of the high-dimensional sparse text data set, and has the following specific advantages and beneficial effects: 1. acquiring a high-dimensional sparse text data set (such as RCV1) from a UCI data platform; 2. reading in a high-dimensional sparse text data set by using a distributed elastic data set (RDD), selecting a sparse storage strategy (such as CSR) to store data samples, randomly sampling and dividing the data samples into a few sample data blocks, using an index mark for each data block, and finally distributing the data blocks with few samples to each node of a cluster according to an index number, wherein the method has the advantages of fully utilizing the memory resources of each node in the cluster and storing data with higher dimensionality; 3. splitting the data set to verify the quality of model training; 4. on the basis of a distributed sparse vector set, a set of distributed computing mode (such as parallel triangular matrix taking and parallel symmetric matrix transformation) is designed according to the reality, and aims to solve the problem of computing efficiency caused by high space complexity of a high-dimensional sparse text data set and accelerate the computing efficiency; 5. and solving and normalizing the Laplace matrix, aiming at mapping the data set to a high-dimensional space, conveniently extracting the characteristic vectors of the data set, and selecting the first K characteristic vectors to construct an N x K characteristic matrix. 6. And training the K-means model by using the feature matrix and testing.

Drawings

FIG. 1 is a block diagram of a flow chart of a Spark-based spectral clustering method according to a preferred embodiment of the present invention;

FIG. 2 is a flow chart of a parallel triangle-fetching method;

FIG. 3 is a flow chart of parallel acquisition of symmetric matrices;

FIG. 4 is a flow chart of distributed sparse vector set generation.

Detailed Description

The technical solutions in the embodiments of the present invention will be described in detail and clearly with reference to the accompanying drawings. The described embodiments are only some of the embodiments of the present invention.

The technical scheme of the invention is as follows:

FIG. 1 is a block diagram of the process of the present invention, which includes the following steps:

1. the data loading phase is as shown in figure 2;

in this stage, a data source (source UCI data platform) to be processed needs to be read into an elastic distributed data set (RDD), then loaded into high-dimensional distributed vector set data P, and divided into a training set A₁And test set A₂，

The RCV1 data set is downloaded from a UCI experimental data platform (website: http:// archive. ics. UCI. edu/ml /), the form of the data set is { decision label, condition attribute 1, condition attribute 2, condition attribute 3 … … and condition attribute n }, and the data set has the characteristics of high dimensionality (>30000), sparseness, high time complexity and the like.

2. Loading into a high-dimensional distributed vector set;

step 1, reading a high-dimensional sparse text data set by using a distributed elastic data set (RDD);

step 2, the high-dimensional sparse text data set has sparsity, so that each sample in the data set is recorded as: a. the

And 3, randomly sampling the sample of the A and dividing the sample into a few sample data blocks B, wherein each data block uses an index mark.

And 4, distributing the data block B with few samples to the cluster nodes according to the index numbers by using a programming interface MapPartitionWithIndex provided by a Spark data platform.

3. Splitting the data set;

splitting a data set is splitting the data set into a training set and a test set. The specific way is to select 90% of samples in the data set as a training set A by random non-repetition₁The remaining 10% was set as test set A₂。

4. Calculating an adjacency matrix;

this step requires the calculation of the similarity between the individual samples, measured in euclidean distance. The size of the similarity can be used to represent the correlation between samples, and the similarity between samples is considered as Vertex in Graph and Edge in Graph, so as to obtain the concept of Graph commonly seen by us. The adjacency matrix is calculated according to the concept of graph theory under the condition of known similarity matrix.

5. Taking the upper triangle of the matrix in parallel, as shown in FIG. 3;

step 1, loading a similarity matrix M between samples in a high-dimensional distributed vector set, and assuming that the number of each sample is n (n is less than an attribute dimension d), wherein the number of partitions idn is d/n, and d% n is 0;

step 2, dividing the data of each node data set into idn n-x-n matrixes according to column division;

and 3, finding an n-n matrix at a corresponding position according to the partition number id, taking a triangle, and keeping all columns on the right unchanged.

Step 4, if the last matrix is n x n, directly taking the upper triangle, and ending the operation; otherwise, abandoning;

6. parallel symmetric matrix transformations, as shown in fig. 4;

step 1, solving Euclidean distance between a data block B and a sparse vector set P in each calculation node to represent similarity, and obtaining an upper triangular matrix U by using the parallel upper triangular method;

and 2, storing the distance matrix obtained in the step 1 in a point coordinate mode (COO), wherein a point set is recorded as: CO;

step 3, interchanging the row coordinate and the column coordinate of the CO midpoint coordinate, and recording as CO';

and 4, combining the point sets in the steps b and c to form a symmetric matrix S.

Step 5, making the element S in the symmetric matrix S_ij0, where (i ═ j), the diagonal matrix is noted: d

7. Calculating a Laplace matrix and normalizing;

L＝D-M； (2)

and (2) obtaining a Laplace matrix, wherein D is a diagonal matrix and M is a similar matrix.

Normalizing L:

L₁＝D^-1/2L D^-1/2＝I-D^-1/2M D^-1/2； (3)

wherein I is an identity matrix, L₁Is a normalized Laplace matrix.

8. The normalized Laplace matrix utilizes SVD characteristics;

SVD is a typical characteristic decomposition method, and the normalized Laplace matrix L obtained in the step (6) is calculated by using a parallel SVD calculation method provided by a Spark big data platform₁And (5) decomposing the characteristics. Selecting the first K eigenvectors, theta ═ theta_1i,θ_1i,θ_1i,θ_1i.....θ_Ni) ', i-1, 2, 3, …, K. The first K eigenvectors will form an N x K eigenvector matrix θ. The rows of the matrix represent samples of a cluster class, i.e. corresponding to a high-dimensional sparse text datasetThe decision tag of (1).

9. Training a K-means model;

and (4) generating a N x K feature matrix, wherein each row of the matrix represents a clustering sample and represents a decision label in the high-dimensional sparse text data set. And training a K-means model by using the feature matrix.

10. Testing a K-means model;

in step (1), 10% of the split data set is the test set A₂And (5) testing the model trained in the step (8).

The above examples are to be construed as merely illustrative and not limitative of the remainder of the disclosure. After reading the description of the invention, the skilled person can make various changes or modifications to the invention, and these equivalent changes and modifications also fall into the scope of the invention defined by the claims.

Claims

1. A Spark-based high-dimensional sparse text data clustering method is characterized by comprising the following steps:

the step of designing a distributed sparse vector set suitable for storing a high-dimensional sparse data set by using an RDD interface comprises the following steps:

a2, adopting sparse storage to record each sample in the data set as A;

a3, randomly sampling the sample of A and dividing the sample into a few sample data blocks B, wherein each data block is marked by an index;

a4, distributing the data block B with few samples to the cluster nodes according to the index numbers by using a Spark data platform to provide a programming interface MapPartitionWithIndex;

step 2: calculating a similarity matrix M between the data set samples to be processed and storing the similarity matrix M in a sparse vector set mode, wherein the similarity is measured in an Euclidean distance mode;

and step 3: and (3) symmetry is carried out on the similar matrix M stored in the step (2) in a sparse vector set mode, a normalized laplace matrix is solved, and the parallel symmetric matrix transformation specifically comprises the following steps:

the step 3 comprises the following steps:

b1, solving Euclidean distance between the data block B in each calculation node and the sparse vector set P to obtain a distance matrix for representing similarity, and designing a parallel upper triangular method to obtain an upper triangular matrix U;

the parallel upper triangle taking step of the matrix comprises the following steps:

step 3, finding an n-n matrix at a corresponding position according to the partition number id, taking a triangle, and keeping all columns on the right unchanged;

b2, constructing a symmetric matrix S by the distance matrix obtained in the step B1;

b3, let element S in symmetric matrix S_ij0, where (i ═ j), the diagonal matrix is noted: d, calculating a Laplace matrix and normalizing the Laplace matrix;

2. The Spark-based high-dimensional sparse text data clustering method according to claim 1, wherein a method for calculating the similarity between each sample and constructing a symmetric matrix S is designed and implemented:

d1, solving Euclidean distance between the data block B in each calculation node and a sparse vector set P to represent similarity, and obtaining an upper triangular matrix U by using the parallel upper triangular method;

d2, storing the distance matrix obtained in the step D1 by adopting a point coordinate mode COO, and recording a point set as: CO;

d3, interchanging the row coordinate and the column coordinate of the CO midpoint coordinate, and recording as CO';

d4, combining the point sets of the step D2 and the step D3 together to form a symmetric matrix S.