CN111159406A - Big data text clustering method and system based on parallel improved K-means algorithm - Google Patents

Big data text clustering method and system based on parallel improved K-means algorithm Download PDF

Info

Publication number
CN111159406A
CN111159406A CN201911393493.9A CN201911393493A CN111159406A CN 111159406 A CN111159406 A CN 111159406A CN 201911393493 A CN201911393493 A CN 201911393493A CN 111159406 A CN111159406 A CN 111159406A
Authority
CN
China
Prior art keywords
clustering
text
canopy
algorithm
data
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201911393493.9A
Other languages
Chinese (zh)
Inventor
李雷孝
周成栋
王慧
马志强
王永生
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Inner Mongolia University of Technology
Original Assignee
Inner Mongolia University of Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Inner Mongolia University of Technology filed Critical Inner Mongolia University of Technology
Priority to CN201911393493.9A priority Critical patent/CN111159406A/en
Publication of CN111159406A publication Critical patent/CN111159406A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification

Abstract

The invention belongs to the technical field of text clustering, and particularly relates to a big data text clustering method and system based on a parallel improved K-means algorithm.

Description

Big data text clustering method and system based on parallel improved K-means algorithm
Technical Field
The invention belongs to the technical field of text clustering, and particularly relates to a big data text clustering method and system based on a parallel improved K-means algorithm.
Background
In recent years, with the rapid increase of internet information volume, massive network text data is generated, the text data is unstructured data and has the characteristics of high dimension, large data volume, low value density and the like, how to effectively process and mine the massive network text information becomes one of the research hotspots of the current Chinese information processing, classifying large batches of texts is more an important research field, currently, in the large-scale text information mining process of the internet, clustering can be applied to a plurality of fields of preprocessing stage, text semantic analysis, document similarity analysis, corpus classification analysis, theme analysis and the like, text clustering is realized by dividing texts into meaningful categories, so that the similarity between texts in the same category is higher than that between the texts in different categories, thereby realizing the effective organization and management of text information, effective text clustering can help people to better understand and navigate the retrieval results of information retrieval tools, among the clustering methods, the most widely applied Method is a K-Means Algorithm divided into bases, when the K-Means Algorithm is used for clustering, the number K of clusters needs to be specified and K initial central points are randomly selected, but the number of the clusters cannot be determined in advance and a proper initial clustering center is often selected, so that the K-Means clustering is trapped into local optimization, the clustering result is poor, and aiming at the problem of random selection of the initial clustering center of the K-Means, the idea of introducing density and neighbors in documents (Yang J, Ma Y, Zhang X, et al. an Initialization Method Based on hybrid distance for K-Means Algorithm [ J ]. Neural Computation,2017:1-24.) provides an initial clustering center selection Algorithm, the algorithm improves the clustering quality and stability of the K-means algorithm; an algorithm for initial clustering center selection based on LDA topic probability model is proposed in the literature (Zhang Jinrui, Chaiyume, Red, et al. J. computer engineering and design, 2017,38(1):86-91.), the operation efficiency of the algorithm is greatly improved, the problem that the K-means cannot predetermine the K value is solved, the literature (Limwattape O, Arch-Int S.degree of the implementation parameter for K-means clustering use selection of region clustered on dense DBSCAN (CD-CAN) [ J. Expert Systems,2017(1):220- > 224.) adopts a density-based noisy application space (DBSs), the initial clustering center selection and the initial clustering center selection based on density CAN are automatically found, compared with the algorithm for the initial clustering center selection and the initial clustering center selection based on density CAN, the algorithm has high operation efficiency and can obtain good clustering effect; an improved K-value selection algorithm ET-SSE algorithm is provided in the literature (Wang Jianren, Maxin, Qin, Changlong, improved K-means clustering K-value selection algorithm [ J ]. computer engineering and application, 2019,55(08):33-39.) based on the basic ideas of exponential function property, weight adjustment, bias terms and elbow method, and the effectiveness of the algorithm is verified through experiments, but the currently available clustering algorithms are basically only suitable for processing small-scale data, while under the background of current Internet information explosion, the number of network data documents is exponentially increased, the dimension of a text feature space is also sharply increased, the classification capability of the clustering algorithm is seriously reduced, meanwhile, the running time of the algorithm is greatly prolonged, and obviously, the method is not suitable for practical application, so how to quickly and effectively perform text clustering on large-scale text data, will be a valuable research direction;
parallel computing is an effective solution to the problem of low efficiency of large-scale computing, while there are many traditional parallel computing methods, and some parallel computing methods cannot meet the increasing requirements of internet large-scale data processing, such as: MPI (message passing interface) appearing at the end of the 20 th century, grid computing appearing at the beginning of the 21 st century, and the like have the problems of complex development, poor expansibility and the like, and face challenges, cloud computing arises, MapReduce is one of the most concerned key technologies, Hadoop provides a parallel computing interface taking MapReduce as a core and a distributed file system HDFS (Hadoop distributed file system) and can process data of up to millions of nodes and ZB levels, documents (Zhang Shufen, Dong rock, and Chen academic. HKM algorithm design and research [ J ] application science, 2018, (03): 118): 128.) provide a text clustering algorithm based on Hadoop platform parallel K-means, solve the problems of the traditional algorithm when processing large-scale data, but the MapReduce for parallel processing consumes a large amount of read-write time/read-write time in a network computing and disk O operation, the method is not used for computing tasks, so that the efficiency of a Hadoop architecture is reduced, Spark not only has the advantages of the traditional MapReduce, but also has the advantages of quick processing, complex query, capability of being seamlessly fused with the existing Hadoop and the like, the Spark platform has great improvement compared with the efficiency of the Hadoop platform, and documents (Liupeng, Ten rain, Dingjie, et. large-scale text K-means parallel clustering algorithm [ J ] Chinese information report based on Spark, 2017(04): 150-;
in summary, the prior art has the problem that the clustering accuracy and efficiency of the algorithm are low because the K-means algorithm is not optimized or locally optimized.
Disclosure of Invention
The invention provides a big data text clustering method and a big data text clustering system based on a parallel improved K-means algorithm, which aim to solve the problem that the prior art has low clustering accuracy and efficiency due to the fact that the K-means algorithm is not optimized or locally optimized.
The technical problem solved by the invention is realized by adopting the following technical scheme: the big data text clustering method based on the parallel improved K-means algorithm comprises the following steps:
preprocessing unstructured text data of a big data text in a text storage system;
calculating the weight of the text feature words of the preprocessed big data text by a training word vector method word2Vec feature word weight algorithm;
clustering low-dimensional big data text data through SWCK-means text clustering algorithm processing combining a Canopy center point selection algorithm and a K-means distance-based clustering algorithm.
Further, SWCK-means text clustering algorithm processing combining a Canopy center point selection algorithm and a K-means distance-based clustering algorithm comprises the following steps:
and clustering the big data text data with the text feature word weight in parallel by Canopy to obtain a clustering central point, taking the clustering central point as an initial clustering central point of the K-means clustering, and clustering by using a parallel K-means algorithm.
Further, the big data text clustering method further comprises the following steps:
reading a text data object set in an HDFS distributed file system based on a software framework of Hadoop distributed processing to generate an initial distributed elastic data set (RDD);
preprocessing the RDD data, vectorizing the preprocessed text data, and adding the vectorized text data into the Cache for persistence to form a persistent text vector;
training the word2Vec model by the persistence text vector in parallel;
parallelizing the persistence text vector into a Canopy algorithm, segmenting the RDD, and distributing to each parallel node in the cluster;
executing Map operation at each parallel node in the cluster to calculate the distance between the text data object of each fragment and the Canopy center point so as to determine the local Canopy center point;
merging the local Canopy center points into a global Canopy center point through Reduce operation;
dividing a data object corpus into different Canopy by each parallel node in the cluster through Map operation according to a global Canopy center point, and executing Cache operation to persist the data to form a Canopy persisted text vector;
after removing less Canopy categories from Canopy persistent text vectors, assigning the rest Canopy center point lists to an initial clustering center point list in the K-means;
running K-means local clustering operation at each parallel node in the cluster, wherein the K-means local clustering operation is to perform K-means local clustering by performing Map operation on RDD after passing through the Cache;
and running main control local clustering operation in the main control nodes in the cluster, wherein the main control local clustering operation comprises the steps of merging local clustering results generated by all parallel nodes into a global clustering result through Reduce operation, and updating the central points of all classes.
And judging whether the iteration exit condition is met, if so, outputting a result, and if not, repeatedly executing the K-means local clustering operation and the main control local clustering operation.
Further, the word2Vec feature word weight algorithm includes:
mapping each word into a word vector with a fixed size and 50-200 dimensions, wherein the word vector represents semantic information describing words to a certain extent, and the probability of a word occurring is calculated according to a plurality of words in front of the word or a plurality of continuous words in front of and behind the word.
Further, the word2Vec feature word weight algorithm further includes a Skip-gram feature word weight algorithm, and the Skip-gram feature word weight algorithm includes:
and predicting the context window words of the central words by each central word, respectively calculating the probabilities of the appearance of a plurality of words before and after the context window words, and correcting the word vectors of the central words according to the prediction result.
Further, the Canopy center point selection algorithm includes:
two distance thresholds, a first distance threshold T1 and a second distance threshold T2, of the input data set RDD and Canopy, and the first distance threshold T1> the second distance threshold T2;
taking any data object from the data set RDD, if the Canopy class does not exist currently, taking the data object as the Canopy class, and deleting the data object from the data set RDD;
continuing to take another data object from the data set RDD, calculating the distance from the other data object to all the Canopy already generated, and adding the other data object to the Canopy if the distance to a certain Canopy is less than T1;
determine and delete key Canopy: if the distance from the other data object to the center of all Canopy is greater than T1, then the other data object is taken as a key Canopy;
if the distance from a data object to the center of a Canopy is within T2, adding the current data object to the Canopy and deleting the current data object from the data set RDD;
the critical Canopy determination and deletion operations continue on the data objects in the data set RDD until all data objects are classified into corresponding canlays.
Further, the K-means distance-based clustering algorithm includes:
inputting a data set RDD and K Canopy clustering centers;
determining a key clustering center through the distance of the clustering centers: calculating the distance from all objects of the RDD non-K Canopy center points to each cluster center, and distributing the distance to the cluster where the center closest to the RDD is located;
recalculating the average value of all data objects in each cluster, and taking the average value as a key clustering center;
and repeatedly determining the key clustering center through the distance of the clustering centers until the change of the clustering centers is smaller than a set threshold value to finish iteration or the maximum iteration number is reached to finish iteration.
Further, the text storage system is based on an HDFS distributed file system of a Hadoop distributed processing framework.
A big data text clustering system based on a parallel improved K-means algorithm comprises: a big data text clustering module;
the big data text clustering module applies any one of the big data text clustering methods based on the parallel improved K-means algorithm.
The beneficial technical effects are as follows:
the method comprises the steps of preprocessing a big data text in a text storage system; calculating the weight of the text feature words of the preprocessed big data text by a training word vector method word2Vec feature word weight algorithm; clustering low-dimensional big data text data through an SWCK-means text clustering algorithm process combining a Canopy center point selection algorithm and a K-means distance-based clustering algorithm, performing parallel Canopy clustering on the big data text data with text characteristic word weight to obtain a clustering center point, performing clustering by taking the clustering center point as an initial clustering center point of the K-means clustering and performing the K-means algorithm in parallel, preprocessing the data by taking a Hadoop Distributed File System (HDFS) as a text storage system based on the SWCK-means text clustering algorithm of a Spark platform, and then calculating the article characteristic word weight by using word2 Vec; the text data are clustered, a clustering algorithm combining Canopy and K-means is adopted, a clustering center point is obtained by parallel Canopy clustering, the clustering center point is used as an initial clustering center point of the K-means clustering, then the K-means clustering is carried out in parallel, finally, the main performance indexes such as accuracy, acceleration ratio and expansion ratio are compared through experiments, and an experiment conclusion is obtained.
Drawings
FIG. 1 is a general flowchart of the big data text clustering method based on the parallel improved K-means algorithm of the present invention;
FIG. 2 is a detailed flowchart of the big data text clustering method based on the parallel improved K-means algorithm of the present invention;
FIG. 3 is a model diagram of a Skip-gram feature word weight algorithm of the big data text clustering method based on the parallel improved K-means algorithm;
FIG. 4 is a flow chart of a Canopy center point selection algorithm of the big data text clustering method based on the parallel improved K-means algorithm of the present invention;
FIG. 5 is a flow chart of a K-means distance-based clustering algorithm of the big data text clustering method based on the parallel improved K-means algorithm of the present invention;
Detailed Description
The invention is further described below with reference to the accompanying drawings:
in the figure:
s101, preprocessing a big data text in a text storage system;
s102, calculating text characteristic word weight of the preprocessed big data text through a training word vector method word2Vec characteristic word weight algorithm;
s103, clustering the big data text data with text characteristic word weight through SWCK-means text clustering algorithm processing combining a Canopy center point selection algorithm and a K-means distance-based clustering algorithm;
example (b):
the first embodiment is as follows: as shown in fig. 1, the big data text clustering method based on the parallel improved K-means algorithm includes:
preprocessing unstructured text data of a big data text in a text storage system S101;
calculating text feature word weight of the preprocessed big data text by a training word vector method word2Vec feature word weight algorithm S102;
clustering the low-dimensional big data text data through SWCK-means text clustering algorithm processing combining a Canopy center point selection algorithm and a K-means distance-based clustering algorithm S103.
The SWCK-means text clustering algorithm processing combining the Canopy center point selection algorithm and the K-means distance-based clustering algorithm comprises the following steps:
and clustering the big data text data with the text feature word weight in parallel by Canopy to obtain a clustering central point, taking the clustering central point as an initial clustering central point of the K-means clustering, and clustering by using a parallel K-means algorithm.
The big data text in the text storage system is preprocessed; calculating the weight of the text feature words of the preprocessed big data text by a training word vector method word2Vec feature word weight algorithm; clustering low-dimensional big data text data through an SWCK-means text clustering algorithm process combining a Canopy center point selection algorithm and a K-means distance-based clustering algorithm, performing parallel Canopy clustering on the big data text data with text characteristic word weight to obtain a clustering center point, performing clustering by taking the clustering center point as an initial clustering center point of the K-means clustering and performing the K-means algorithm in parallel, preprocessing the data by taking a Hadoop Distributed File System (HDFS) as a text storage system based on the SWCK-means text clustering algorithm of a Spark platform, and then calculating the article characteristic word weight by using word2 Vec; the text data are clustered, a clustering algorithm combining Canopy and K-means is adopted, a clustering center point is obtained by parallel Canopy clustering, the clustering center point is used as an initial clustering center point of the K-means clustering, then the K-means clustering is carried out in parallel, finally, the main performance indexes such as accuracy, acceleration ratio and expansion ratio are compared through experiments, and an experiment conclusion is obtained.
As shown in fig. 2, the big data text clustering method further includes:
reading a text data object set in an HDFS distributed file system based on a software framework of Hadoop distributed processing to generate an initial distributed elastic data set (RDD);
preprocessing the RDD data, vectorizing the preprocessed text data, and adding the vectorized text data into the Cache for persistence to form a persistent text vector;
training the word2Vec model by the persistence text vector in parallel;
parallelizing the persistence text vector into a Canopy algorithm, segmenting the RDD, and distributing to each parallel node in the cluster;
executing Map operation at each parallel node in the cluster to calculate the distance between the text data object of each fragment and the Canopy center point so as to determine the local Canopy center point;
merging the local Canopy center points into a global Canopy center point through Reduce operation;
dividing a data object corpus into different Canopy by each parallel node in the cluster through Map operation according to a global Canopy center point, and executing Cache operation to persist the data to form a Canopy persisted text vector;
after removing less Canopy categories from Canopy persistent text vectors, assigning the rest Canopy center point lists to an initial clustering center point list in the K-means;
running K-means local clustering operation at each parallel node in the cluster, wherein the K-means local clustering operation is to perform K-means local clustering by performing Map operation on RDD after passing through the Cache;
running main control local clustering operation in main control nodes in a cluster, wherein the main control local clustering operation comprises the steps of merging local clustering results generated by all parallel nodes into a global clustering result through Reduce operation, and updating the central point of each class;
and judging whether the iteration exit condition is met, if so, outputting a result, and if not, repeatedly executing the K-means local clustering operation and the main control local clustering operation.
The method comprises the steps that an initial distributed elastic data set RDD is generated by reading a text data object set in an HDFS distributed file system based on a software framework of Hadoop distributed processing; preprocessing the RDD data, vectorizing the preprocessed text data, and adding the vectorized text data into the Cache for persistence to form a persistent text vector; training the word2Vec model by the persistence text vector in parallel; parallelizing the persistence text vector into a Canopy algorithm, segmenting the RDD, and distributing to each parallel node in the cluster; map operation is executed at each parallel node in the cluster to calculate the distance between each segmented text data object and a Canopy center point so as to determine a local Canopy center point … …, and the parallel design idea of the SWCK-means algorithm is mainly to design a parallelization scheme of two parts of the Canopy + K-means algorithm (word 2Vec algorithm in a Sparkmllib machine learning library is used for calculating text word vector weight, and word2Vec algorithm is already parallelized). The parallel Canopy + K-means algorithm based on Spark is approximately the same as the serial logic of the parallel Canopy + K-means algorithm, Spark can automatically realize the parallelization of the Canopy + K-means algorithm according to the serial logic of the algorithm, and the difference is that the parallel algorithm uses a distributed elastic data set RDD to automatically realize the parallelization of data distribution. The parallelization process of the Canopy + K-means algorithm based on the Spark platform is roughly divided into two parts, one part is Canopy center point selection, the other part is final clustering of the K-means algorithm, the parallelization design of the algorithm is mainly divided into two parts, the first part is parallelized to Canopy algorithm to select the center point, then the final clustering specific process of the parallelization K-means algorithm is shown as follows, Step 1: reading a text data object set from a Hadoop file distribution system (HDFS) to generate an initial RDD; step 2: preprocessing the data, vectorizing the text data, and adding the vectorized text data into the Cache for persistence; step 3: parallelly training a word2Vec model, converting the text into vector representation and persisting the text vector; step 4: executing a parallelization Canopy algorithm, segmenting the data RDD, and distributing to each parallel node in the cluster; step 5: executing Map operation to calculate the distance between the text data object of each fragment and the Canopy center point, thereby obtaining a local Canopy center point; step 6: executing Reduce operation and combining into a whole office Canopy subset center point; step 7: each node executes Map operation according to the global Canopy center point, divides the data object complete set into different canlays and executes Cache operation to make the data persistence; step 8: removing fewer Canopy categories, and then assigning the rest Canopy center point lists to an initial clustering center point list in the K-means; step 9: performing Map operation on the RDD of each node after the Cache to execute K-means local clustering; step 10: and the master control node executes Reduce operation to merge the local clustering results generated by each computing node into a global clustering result, and the central point of each class is updated. Step 11: and judging whether the iteration exit condition is met, outputting a result if the iteration exit condition is met, and repeatedly executing Step9-Step10 if the iteration exit condition is not met.
The word2Vec feature word weight algorithm comprises the following steps:
mapping each word into a word vector with a fixed size and 50-200 dimensions, wherein the word vector represents semantic information describing words to a certain extent, and the probability of a word occurring is calculated according to a plurality of words in front of the word or a plurality of continuous words in front of and behind the word.
The word2Vec feature word weight algorithm comprises the following steps: mapping each word into a word vector with a fixed size and 50-200 dimensions, wherein the word vector represents semantic information describing words to a certain extent, calculating the probability of a word according to a plurality of words in front of the word or a plurality of continuous words in front of and behind the word, and because each word is mapped into a vector with a fixed size, the dimensions of the word vector are generally selected from 50-200 dimensions, and the word vector represents semantic information describing words to a certain extent. The probability of a certain word is calculated according to C words in front of the certain word or C continuous words in front of and behind the certain word, the text is unstructured data, and the text needs to be represented into a form which can be recognized and processed by a computer. The word2Vecmodel is trained by adopting a series of words representing documents, and the word2vec algorithm is in a feature word weight vector form that each document is represented by a vector space model, so that the training speed is high, and the accuracy of semantic similarity is improved. Therefore, word vector weights are calculated based on a Skip-gram model in the word2Vec, next clustering analysis is carried out on text data, and for massive text data, word vector weights are calculated by adopting a word2Vec neural network, so that text dimensionality is reduced.
As shown in fig. 3, the word2Vec feature word weight algorithm further includes a Skip-gram feature word weight algorithm, and the Skip-gram feature word weight algorithm includes:
and predicting the context window words of the central words by each central word, respectively calculating the probabilities of the appearance of a plurality of words before and after the context window words, and correcting the word vectors of the central words according to the prediction result.
As the word2Vec feature word weight algorithm also comprises a Skip-gram feature word weight algorithm, the Skip-gram feature word weight algorithm comprises the following steps: the method predicts the context window words of each central word, respectively calculates the probabilities of a plurality of words appearing before and after the context window words, and corrects the word vectors of the central words according to the prediction results.
As shown in fig. 4, the Canopy center point selection algorithm includes:
two distance thresholds, a first distance threshold T1 and a second distance threshold T2, of the input data set RDD and Canopy, and the first distance threshold T1> the second distance threshold T2;
taking any data object from the data set RDD, if the Canopy class does not exist currently, taking the data object as the Canopy class, and deleting the data object from the data set RDD;
continuing to take another data object from the data set RDD, calculating the distance from the other data object to all the Canopy already generated, and adding the other data object to the Canopy if the distance to a certain Canopy is less than T1;
determine and delete key Canopy: if the distance from the other data object to the center of all Canopy is greater than T1, then the other data object is taken as a key Canopy;
if the distance from a data object to the center of a Canopy is within T2, adding the current data object to the Canopy and deleting the current data object from the data set RDD;
the critical Canopy determination and deletion operations continue on the data objects in the data set RDD until all data objects are classified into corresponding canlays.
The Canopy center point selection algorithm comprises the following steps: two distance thresholds, a first distance threshold T1 and a second distance threshold T2, of the input data set RDD and Canopy, and the first distance threshold T1> the second distance threshold T2; taking any data object from the data set RDD, if the Canopy class does not exist currently, taking the data object as the Canopy class, and deleting the data object from the data set RDD; continuing to take another data object from the data set RDD, calculating the distance from the other data object to all the Canopy already generated, and adding the other data object to the Canopy if the distance to a certain Canopy is less than T1; determining and deleting key data objects: if the distance of another data object to the center of all Canopy is greater than T1, then the other data object is considered a key Canopy … …, since Canopy is a fast-approximation clustering technique. The method has the advantages that the cluster obtaining speed is very high, the result can be obtained only by traversing data once, and therefore, the Canopy algorithm cannot give an accurate cluster result. The Canopy algorithm mainly comprises the following steps: step 1: two distance thresholds, T1 and T2, for input dataset D and Canopy, and T1> T2, Step 2: taking a data object from the data set, if no Canopy class currently exists, treating the data object as a Canopy class and deleting it from the data set, Step 3: continuing to take a point from the data set, calculating the distance from the point to all the Canopy generated, adding the point to a Canopy if the distance to the Canopy is less than T1, and regarding the point as a new Canopy if the distances from P to the centers of all canlays are greater than T1, Step 4: if the data object is within T2 of the center of a Canopy, it is added to the Canopy and removed from the dataset. Because the data object is very close to this Canopy, it can no longer be the center of other canlays, Step 5: and continuously executing Step3 and Step4 operations on the points in the set until all the data objects are divided into corresponding Canopy, finishing the algorithm and finishing the clustering, wherein although the Canopy algorithm is a rough clustering algorithm, the accuracy of the algorithm is low, but the obtained clustering result can be used as the pretreatment of the K-means algorithm, so that the random selection of the initial clustering center of the K-means algorithm is avoided, the iteration times of the K-means algorithm can be effectively reduced, and the clustering effect is improved.
As shown in fig. 5, the K-means distance-based clustering algorithm includes:
inputting a data set RDD and K Canopy clustering centers;
determining a key clustering center through the distance of the clustering centers: calculating the distance from all objects of the RDD non-K Canopy center points to each cluster center, and distributing the distance to the cluster where the center closest to the RDD is located;
recalculating the average value of all data objects in each cluster, and taking the average value as a key clustering center;
and repeatedly determining the key clustering center through the distance of the clustering centers until the change of the clustering centers is smaller than a set threshold value to finish iteration or the maximum iteration number is reached to finish iteration.
The clustering algorithm based on the distance by adopting the K-means comprises the following steps: inputting a data set RDD and the number K of clusters; randomly selecting K points from a data set RDD as an initial clustering center; determining a key clustering center through the distance of the clustering centers: calculating the distance from all objects of the data set RDD non-K points to each cluster center, and distributing the distance to the cluster where the center closest to the data set RDD is located; recalculating the average value of all data objects in each cluster, and taking the average value as a key clustering center; the key clustering centers are determined repeatedly through the distances of the clustering centers until the change of the clustering centers is smaller than a set threshold value or the iteration is finished when the maximum iteration times are reached, and the K-means algorithm is the most widely applied clustering algorithm based on the distances and has the core idea that mutually independent classes are obtained through iteration. I.e. all objects are divided into K different clusters so that objects within a class have a higher similarity and objects between classes have a lower similarity. The main steps of the K-means algorithm are as follows: step 1: the data set D and the number of clusters K are input. Step 2: and randomly selecting K points from the data set D as initial clustering centers. Step 3: and calculating the distances from all the other objects to the centers of the clusters, and allocating the distances to the cluster where the center closest to the objects is located. Step 4: the average of all data objects in each cluster is recalculated and used as a new cluster center. Step 5: repeating Step3 and Step4 until the change of the clustering center is smaller than a set threshold value or the iteration is finished when the maximum iteration times is reached, clustering the text weight data by using a Canopy algorithm to select an initial clustering center of K-means, and then performing final clustering by using the K-means; and finally, carrying out parallelization design on the algorithm.
The text storage system is based on an HDFS distributed file system of a Hadoop distributed processing framework.
The text storage system is based on an HDFS distributed file system of a Hadoop distributed processing framework, and the text clustering algorithm is based on an SWCK-means text clustering algorithm of a Spark platform. And preprocessing the data by adopting a Hadoop Distributed File System (HDFS) as a text storage system.
A big data text clustering system based on a parallel improved K-means algorithm comprises: a big data text clustering module;
the big data text clustering module applies any one big data text clustering method based on the parallel improved K-means algorithm.
Because a big data text clustering module is adopted; the big data text clustering module applies any one of the big data text clustering methods based on the parallel improved K-means algorithm, and the clustering module takes mass internet text information classification as an application background, so that the accuracy and the performance advantages of the K-means algorithm are improved.
The working principle is as follows:
the method comprises the steps of preprocessing a big data text in a text storage system; calculating the weight of the text feature words of the preprocessed big data text by a training word vector method word2Vec feature word weight algorithm; clustering low-dimensional big data text data through an SWCK-means text clustering algorithm process combining a Canopy center point selection algorithm and a K-means distance-based clustering algorithm, performing parallel Canopy clustering on the big data text data with text characteristic word weight to obtain a clustering center point, performing clustering by taking the clustering center point as an initial clustering center point of the K-means clustering and performing the K-means algorithm in parallel, preprocessing the data by taking a Hadoop Distributed File System (HDFS) as a text storage system based on the SWCK-means text clustering algorithm of a Spark platform, and then calculating the article characteristic word weight by using word2 Vec; then clustering is carried out on the text data, a clustering algorithm combining Canopy and K-means is adopted, firstly parallel Canopy clustering is carried out to obtain a clustering central point which is used as an initial clustering central point of K-means clustering, then parallel K-means clustering is carried out, finally, experiments are carried out on main performance indexes such as accuracy, acceleration ratio and expansion ratio to obtain an experimental conclusion, the scheme takes massive Internet text information classification as an application background to carry out experimental analysis on the text algorithm, the experimental result shows that the classification effect of the text algorithm is obviously improved compared with that of the traditional K-means algorithm, and the text algorithm has larger performance advantage when processing massive data, the invention solves the problems that the accuracy and the efficiency of algorithm clustering are low due to the fact that the K-means algorithm is not optimized or locally optimized in the prior art, the method has the beneficial technical effects of improving the accuracy and efficiency of the clustering of the K-means algorithm, reducing the dimensionality of the text, improving the clustering effect and realizing the parallelization design.
The technical solutions of the present invention or similar technical solutions designed by those skilled in the art based on the teachings of the technical solutions of the present invention are all within the scope of the present invention to achieve the above technical effects.

Claims (9)

1. The big data text clustering method based on the parallel improved K-means algorithm is characterized by comprising the following steps of:
preprocessing unstructured text data of a big data text in a text storage system;
calculating the weight of the text feature words of the preprocessed big data text by a training word vector method word2Vec feature word weight algorithm;
clustering low-dimensional big data text data through SWCK-means text clustering algorithm processing combining a Canopy center point selection algorithm and a K-means distance-based clustering algorithm.
2. The big data text clustering method based on the parallel improved K-means algorithm as claimed in claim 1, wherein the SWCK-means text clustering algorithm process combining the Canopy center point selection algorithm and the K-means distance-based clustering algorithm comprises:
and clustering the big data text data with the text feature word weight in parallel by Canopy to obtain a clustering central point, taking the clustering central point as an initial clustering central point of the K-means clustering, and clustering by using a parallel K-means algorithm.
3. The big data text clustering method based on the parallel improved K-means algorithm as claimed in claim 2, wherein the big data text clustering method further comprises:
reading a text data object set in an HDFS distributed file system based on a software framework of Hadoop distributed processing to generate an initial distributed elastic data set (RDD);
preprocessing the RDD data, vectorizing the preprocessed text data, and adding the vectorized text data into the Cache for persistence to form a persistent text vector;
training the word2Vec model by the persistence text vector in parallel;
parallelizing the persistence text vector into a Canopy algorithm, segmenting the RDD, and distributing to each parallel node in the cluster;
executing Map operation at each parallel node in the cluster to calculate the distance between the text data object of each fragment and the Canopy center point so as to determine the local Canopy center point;
merging the local Canopy center points into a global Canopy center point through Reduce operation;
dividing a data object corpus into different Canopy by each parallel node in the cluster through Map operation according to a global Canopy center point, and executing Cache operation to persist the data to form a Canopy persisted text vector;
after removing less Canopy categories from Canopy persistent text vectors, assigning the rest Canopy center point lists to an initial clustering center point list in the K-means;
running K-means local clustering operation at each parallel node in the cluster, wherein the K-means local clustering operation is to perform K-means local clustering by performing Map operation on RDD after passing through the Cache;
running main control local clustering operation in main control nodes in a cluster, wherein the main control local clustering operation comprises the steps of merging local clustering results generated by all parallel nodes into a global clustering result through Reduce operation, and updating the central point of each class;
and judging whether the iteration exit condition is met, if so, outputting a result, and if not, repeatedly executing the K-means local clustering operation and the main control local clustering operation.
4. The big data text clustering method based on the parallel improved K-means algorithm as claimed in claim 3, wherein the word vector model word2Vec feature word weight algorithm comprises:
mapping each word into a word vector with a fixed size and 50-200 dimensions, wherein the word vector represents semantic information describing words to a certain extent, and the probability of a word occurring is calculated according to a plurality of words in front of the word or a plurality of continuous words in front of and behind the word.
5. The big data text clustering method based on the parallel improved K-means algorithm as claimed in claim 4, wherein the word2Vec feature word weight algorithm further comprises a Skip-gram feature word weight algorithm, and the Skip-gram feature word weight algorithm comprises:
and predicting the context window words of the central words by each central word, respectively calculating the probabilities of the appearance of a plurality of words before and after the context window words, and correcting the word vectors of the central words according to the prediction result.
6. The big data text clustering method based on the parallel improved K-means algorithm as claimed in claim 3, wherein the Canopy center point selection algorithm comprises:
two distance thresholds, a first distance threshold T1 and a second distance threshold T2, of the input data set RDD and Canopy, and the first distance threshold T1> the second distance threshold T2;
taking any data object from the data set RDD, if the Canopy class does not exist currently, taking the data object as the Canopy class, and deleting the data object from the data set RDD;
continuing to take another data object from the data set RDD, calculating the distance from the other data object to all the Canopy already generated, and adding the other data object to the Canopy if the distance to a certain Canopy is less than T1;
determine and delete key Canopy: if the distance from the other data object to the center of all Canopy is greater than T1, then the other data object is taken as a key Canopy;
if the distance from a data object to the center of a Canopy is within T2, adding the current data object to the Canopy and deleting the current data object from the data set RDD;
the critical Canopy determination and deletion operations continue on the data objects in the data set RDD until all data objects are classified into corresponding canlays.
7. The big data text clustering method based on the parallel improved K-means algorithm as claimed in claim 3, wherein the K-means distance-based clustering algorithm comprises:
inputting a data set RDD and K Canopy clustering centers;
determining a key clustering center through the distance of the clustering centers: calculating the distance from all objects of the RDD non-K Canopy center points to each cluster center, and distributing the distance to the cluster where the center closest to the RDD is located;
recalculating the average value of all data objects in each cluster, and taking the average value as a key clustering center;
and repeatedly determining the key clustering center through the distance of the clustering centers until the change of the clustering centers is smaller than a set threshold value to finish iteration or the maximum iteration number is reached to finish iteration.
8. The big data text clustering method based on the parallel improved K-means algorithm as claimed in claim 1, wherein the text storage system is based on HDFS distributed file system of Hadoop distributed processing framework.
9. The big data text clustering system based on the parallel improved K-means algorithm is characterized by comprising the following steps: a big data text clustering module;
the big data text clustering module applies the big data text clustering method based on the parallel improved K-means algorithm according to any one of claims 1 to 9.
CN201911393493.9A 2019-12-30 2019-12-30 Big data text clustering method and system based on parallel improved K-means algorithm Pending CN111159406A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201911393493.9A CN111159406A (en) 2019-12-30 2019-12-30 Big data text clustering method and system based on parallel improved K-means algorithm

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201911393493.9A CN111159406A (en) 2019-12-30 2019-12-30 Big data text clustering method and system based on parallel improved K-means algorithm

Publications (1)

Publication Number Publication Date
CN111159406A true CN111159406A (en) 2020-05-15

Family

ID=70559122

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201911393493.9A Pending CN111159406A (en) 2019-12-30 2019-12-30 Big data text clustering method and system based on parallel improved K-means algorithm

Country Status (1)

Country Link
CN (1) CN111159406A (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111444544A (en) * 2020-06-12 2020-07-24 支付宝(杭州)信息技术有限公司 Method and device for clustering private data of multiple parties
CN113139061A (en) * 2021-05-14 2021-07-20 东北大学 Case feature extraction method based on word vector clustering
CN116993059A (en) * 2023-09-26 2023-11-03 南通广袤丰信息技术有限公司 Internet of things intelligent agricultural plant protection system based on big data

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106127244A (en) * 2016-06-22 2016-11-16 Tcl集团股份有限公司 A kind of parallelization K means improved method and system
US20180089303A1 (en) * 2016-09-26 2018-03-29 Splunk Inc. Clustering events based on extraction rules

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106127244A (en) * 2016-06-22 2016-11-16 Tcl集团股份有限公司 A kind of parallelization K means improved method and system
US20180089303A1 (en) * 2016-09-26 2018-03-29 Splunk Inc. Clustering events based on extraction rules

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
张波: "基于Spark的K-means算法的并行化实现与优化", 《中国优秀硕博士学位论文全文数据库(硕士)信息科技辑》 *

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111444544A (en) * 2020-06-12 2020-07-24 支付宝(杭州)信息技术有限公司 Method and device for clustering private data of multiple parties
WO2021249502A1 (en) * 2020-06-12 2021-12-16 支付宝(杭州)信息技术有限公司 Method and apparatus for clustering privacy data of multiple parties
CN113139061A (en) * 2021-05-14 2021-07-20 东北大学 Case feature extraction method based on word vector clustering
CN113139061B (en) * 2021-05-14 2023-07-21 东北大学 Case feature extraction method based on word vector clustering
CN116993059A (en) * 2023-09-26 2023-11-03 南通广袤丰信息技术有限公司 Internet of things intelligent agricultural plant protection system based on big data

Similar Documents

Publication Publication Date Title
US8745055B2 (en) Clustering system and method
CN102737126B (en) Classification rule mining method under cloud computing environment
CN111159406A (en) Big data text clustering method and system based on parallel improved K-means algorithm
JP2022020070A (en) Information processing, information recommendation method and apparatus, electronic device and storage media
Bijari et al. Memory-enriched big bang–big crunch optimization algorithm for data clustering
Zhang et al. An affinity propagation clustering algorithm for mixed numeric and categorical datasets
CN110020435B (en) Method for optimizing text feature selection by adopting parallel binary bat algorithm
Wang et al. Design and Application of a Text Clustering Algorithm Based on Parallelized K-Means Clustering.
Xiong et al. Recursive learning for sparse Markov models
Xu Research and implementation of improved random forest algorithm based on Spark
El Bakry et al. Big data classification using fuzzy K-nearest neighbor
Chen et al. Distributed text feature selection based on bat algorithm optimization
Chu et al. A binary superior tracking artificial bee colony with dynamic Cauchy mutation for feature selection
CN117093885A (en) Federal learning multi-objective optimization method integrating hierarchical clustering and particle swarm
Bae et al. Label propagation-based parallel graph partitioning for large-scale graph data
Chen et al. Community detection based on deepwalk model in large-scale networks
Matharage et al. A scalable and dynamic self-organizing map for clustering large volumes of text data
Wang et al. A Second-Order HMM Trajectory Prediction Method based on the Spark Platform.
Umale et al. Overview of k-means and expectation maximization algorithm for document clustering
Zhang et al. Coarse-grained parallel AP clustering algorithm based on intra-class and inter-class distance
Bagde et al. An analytic survey on mapreduce based k-means and its hybrid clustering algorithms
Kumar et al. Clustering of web usage data using hybrid K-means and PACT Algorithms
Kim et al. Big numeric data classification using grid-based Bayesian inference in the MapReduce framework
Lu et al. Dynamic Partition Forest: An Efficient and Distributed Indexing Scheme for Similarity Search based on Hashing
Feng et al. A genetic k-means clustering algorithm based on the optimized initial centers

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication
RJ01 Rejection of invention patent application after publication

Application publication date: 20200515