WO2017181660A1 - 基于K-Means算法的数据聚类方法和装置 - Google Patents

基于K-Means算法的数据聚类方法和装置 Download PDF

Info

Publication number
WO2017181660A1
WO2017181660A1 PCT/CN2016/105949 CN2016105949W WO2017181660A1 WO 2017181660 A1 WO2017181660 A1 WO 2017181660A1 CN 2016105949 W CN2016105949 W CN 2016105949W WO 2017181660 A1 WO2017181660 A1 WO 2017181660A1
Authority
WO
WIPO (PCT)
Prior art keywords
data
clustering
data set
calculation amount
cluster centers
Prior art date
Application number
PCT/CN2016/105949
Other languages
English (en)
French (fr)
Inventor
胡斐然
王楠楠
曹俊
Original Assignee
华为技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 华为技术有限公司 filed Critical 华为技术有限公司
Publication of WO2017181660A1 publication Critical patent/WO2017181660A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/23Clustering techniques
    • G06F18/232Non-hierarchical techniques
    • G06F18/2321Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions
    • G06F18/23213Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions with fixed number of clusters, e.g. K-means clustering

Definitions

  • the present invention relates to the field of computer technology, and in particular, to a data clustering method and apparatus based on the K-Means algorithm.
  • the K-Means algorithm is the most classical distance-based clustering algorithm. The distance is used as the evaluation index of similarity. The closer the distance between two objects is, the greater the similarity between the two objects.
  • the process of clustering data based on the K-means algorithm may be: grouping the data to be classified into a data set and specifying the number K of categories to be divided, and randomly selecting K data from the data set as the initial cluster center of K categories. For each data except the K initial cluster centers in the data set, calculate the distance between the data and each of the initial cluster centers in the K initial cluster centers, and assign the data to The data is in the category corresponding to the nearest initial cluster center, and then the new cluster centers of the K categories are recalculated according to the data included in the K categories, and then the data in the data set is reclassified into K categories. The distance between adjacent cluster centers of each category is within a preset distance.
  • the present invention provides a data clustering method and apparatus based on the K-Means algorithm.
  • the technical solutions are as follows:
  • a first aspect of the present invention provides a computing device, which is configured to cluster N data included in a data set to be processed into K categories, where N is an integer greater than K, K is a preset number of categories and is an integer greater than or equal to 2, each of the K categories corresponding to an initial cluster center;
  • the computing device includes a communication interface, a processor, and a memory, and the communication interface respectively Establishing a communication connection between the processor and the memory, the processor establishing a communication connection with the memory;
  • the communication interface is configured to receive a clustering request, where the clustering request includes a maximum amount of calculation, the K, and the data set;
  • the memory configured to store the maximum calculation amount, the K, and the data set
  • the memory is further configured to store program instructions
  • the processor is configured to read program instructions in the memory to perform the following steps;
  • the processor is further configured to determine, according to the maximum calculation amount, an adjustment factor corresponding to the maximum calculation amount;
  • the processor is further configured to acquire the data set from the memory, and randomly select one data from the data set;
  • the processor is further configured to select K-1 data from the data set according to the adjustment factor and the randomly selected data, where the randomly selected data and the K-1 data constitute the K initial cluster centers of the data set;
  • the processor is further configured to cluster N data in the data set according to the K initial cluster centers.
  • K-1 data are selected according to the adjustment factor corresponding to the maximum calculation amount and the randomly selected data, and the randomly selected data and K-1 data constitute K initial cluster centers of the data set. Therefore, the present invention can automatically adjust the clustering efficiency according to the maximum calculation amount, thereby improving the clustering efficiency.
  • the clustering request further includes a training number and a data size of the data set
  • the processor is configured to determine, according to the maximum calculation amount, an adjustment factor corresponding to the maximum calculation amount, which may be implemented by the following steps:
  • the processor is further configured to determine, according to the training times, the data size, and the K, a center point initialization calculation amount and an iterative training calculation amount when clustering N data included in the data set;
  • the processor is further configured to determine an adjustment factor corresponding to the maximum calculation amount according to the central point initialization calculation amount, the iterative training calculation amount, and the maximum calculation amount.
  • the adjustment factor corresponding to the maximum calculation amount is determined according to the training times of the data set, the data size, the K, and the maximum calculation amount, the adjustment is more accurate because the data set is more matched, so that the adjustment factor is more accurate.
  • the clustering efficiency can be further improved.
  • the processor is configured to perform, according to the adjustment factor and the randomly selected data, from the Select K-1 data in the data set, which can be achieved by the following steps:
  • the processor is further configured to select M data from the data set according to the adjustment factor, where the M is an integer greater than K;
  • the processor is further configured to select, according to the randomly selected data and the M data, K-1 data that is the farthest distance from the randomly selected data from the M data.
  • K initial cluster centers are selected according to the adjustment factor, so that the distance between the selected K initial cluster centers is far, so that the data in the data set is aggregated according to the K initial cluster centers.
  • the iterative training is less computationally efficient, which improves the clustering efficiency.
  • the processor is configured to perform, according to the K initial cluster centers, the data set N data is clustered, which can be achieved by the following steps:
  • the processor is further configured to determine K final cluster centers according to the K initial cluster centers and N data in the data set;
  • the processor is further configured to separately calculate a distance between the any of the data and each of the final cluster centers of the K final cluster centers;
  • the processor is further configured to select, from the K final cluster centers, a final cluster center having the smallest distance from any of the data, and cluster the any data to the final of the selection.
  • the cluster center corresponds to the category.
  • the processor is configured to separately calculate the any data and the K final cluster centers
  • the distance between each final cluster center in the concrete can be achieved by the following steps:
  • the processor is further configured to acquire each participle included in any of the data;
  • the processor is further configured to separately calculate weighting values of each of the word segments, and calculate respectively according to the weighting value of each of the word segments and each of the final cluster centers in the K final cluster centers. The distance between any of the data and each of the final cluster centers.
  • the data in the data set includes a text type field and a numeric type field
  • the processor is further configured to select a to-be-eliminated category from the K categories,
  • the category to be eliminated includes the number P of data being greater than the preset number;
  • the processor is further configured to: (P-the preset number) data that is the farthest from the final clustering center of the to-be-eliminated category from the data included in the to-be-eliminated category;
  • the processor is further configured to update a final clustering center of the to-be-eliminated category according to data other than the phase-out data in the to-be-eliminated category.
  • a part of the data is eliminated from the category including more data, and the final cluster center after the category is updated is recalculated, so that the updated final cluster center is more accurate, and the existing flow type K is solved.
  • a data clustering method based on a K-Means algorithm is provided, the method being performed by a clustering server for clustering N data included in a data set to be processed into K categories
  • the N is an integer greater than K
  • the K is a preset number of categories and is an integer greater than or equal to 2
  • each of the K categories corresponds to an initial cluster center
  • the method includes:
  • the clustering server receives a clustering request, the clustering request including a maximum amount of calculation, the K, and the data set;
  • the clustering server randomly selects one data from the data set
  • the clustering server selects K-1 data from the data set according to the adjustment factor and the randomly selected data, and the randomly selected data and the K-1 data constitute the data set K initial cluster centers;
  • the clustering server clusters the N data in the data set according to the K initial cluster centers.
  • the clustering request further includes a training number and a data size of the data set
  • the clustering server determines, according to the maximum calculation amount, an adjustment factor corresponding to the maximum calculation amount, including:
  • the clustering server determines, according to the number of trainings, the data size, and the K, a center point initialization calculation amount and an iterative training calculation amount when clustering N data included in the data set;
  • the clustering server determines an adjustment factor corresponding to the maximum calculation amount according to the central point initialization calculation amount, the iterative training calculation amount, and the maximum calculation amount.
  • the clustering server according to the adjustment factor and the randomly selected data, from the data set Select K-1 data, including:
  • the clustering server selects M data from the data set according to the adjustment factor, and the M is an integer greater than K;
  • the clustering server selects K-1 data that is the farthest from the randomly selected data from the M data according to the randomly selected data and the M data.
  • the clustering server according to the K initial cluster centers, the N in the data set Data is clustered, including:
  • the clustering server determines K final cluster centers according to the K initial cluster centers and N data in the data set;
  • the clustering server For any data in the data set, the clustering server respectively calculates a distance between the any of the data and each of the K final cluster centers;
  • the clustering server selects a final cluster center having the smallest distance from any of the K final cluster centers, and classifies any of the data into the selected final cluster center In the corresponding category.
  • the clustering server separately calculates the any of the data and the K final cluster centers The distance between each final cluster center, including:
  • the clustering server acquires each participle included in any of the data;
  • the clustering server respectively calculates a weighting value of each of the word segments, and respectively calculates any one of the weighting values of each of the word segments and each of the K final cluster centers The distance between the data and each of the final cluster centers.
  • the method further includes:
  • the clustering server selects a category to be eliminated from the K categories, and the number of data to be eliminated includes the number P of data greater than a preset number;
  • the clustering server eliminates (P-the preset number) data that is the farthest distance from the final clustering center of the to-be-eliminated category from the data included in the category to be eliminated;
  • the clustering server updates the final clustering center of the to-be-eliminated category according to data other than the phase-out data in the to-be-eliminated category.
  • a data clustering apparatus the apparatus being applied in a clustering server, and the apparatus comprising at least one module for performing the clustering method provided by the second aspect.
  • the adjustment factor corresponding to the maximum calculation amount is determined according to the maximum calculation amount included in the clustering request, and one data is randomly selected from the data set to be clustered, and according to the adjustment factor and the randomly selected data, Select K-1 data from the data set, the randomly selected data and K-1 data constitute K initial cluster centers of the data set, and according to the K initial cluster centers, aggregate N data in the data set Since the K-1 initial cluster centers are selected from the data set according to the maximum calculation amount and the randomly selected data, the present invention can automatically adjust the clustering efficiency according to the maximum calculation amount, thereby improving the clustering efficiency.
  • 1-1 is a schematic structural diagram of a data clustering system based on the K-Means algorithm according to an embodiment of the present invention
  • 1-2 is a schematic structural diagram of a computing device according to an embodiment of the present invention.
  • 2-1 is a flowchart of a data clustering method based on the K-Means algorithm according to an embodiment of the present invention
  • 3-1 is a schematic structural diagram of a device for data clustering based on the K-Means algorithm according to an embodiment of the present invention
  • FIG. 3 is a schematic structural diagram of a device for clustering data based on the K-Means algorithm according to an embodiment of the present invention.
  • K data are randomly selected from the data set as the initial cluster center of K categories, and then each data in the data set is respectively clustered to correspond to the nearest initial cluster center.
  • the category based on the data included in the K categories, recalculate the new cluster centers of the K categories, and then reclassify the data in the data set until the adjacent two of each of the K categories The distance between the cluster centers is within a preset distance. Since the initial cluster centers of the K categories are randomly selected, when K is large and/or the data included in the data set is large, the amount of calculation increases, resulting in low clustering efficiency.
  • the adjustment factor corresponding to the maximum calculation amount is determined according to the maximum calculation amount set at the time of clustering, and one data is randomly selected from the data set to be clustered, and according to the adjustment factor and the randomly selected data, Select K-1 data from the data set, the randomly selected data and the K-1 data constitute K initial cluster centers of the data set, and cluster the data in the data set according to the K initial cluster centers. Since K-1 initial cluster centers are selected from the data set according to the maximum calculation amount and the randomly selected data, the present invention can automatically adjust the clustering efficiency according to the maximum calculation amount, thereby improving the clustering efficiency.
  • the embodiment of the present invention provides a data clustering system based on the K-Means algorithm.
  • the clustering system includes a terminal, a communication network, and a clustering server, and the terminal is configured to use the communication network to the clustering server.
  • Sending a clustering request the clustering request includes a maximum computing amount, a K and a data set, and the data set includes N data to be clustered;
  • the clustering server is configured to receive a clustering request sent by the terminal through the communication network, and the data is received
  • the N data included in the set is clustered into K categories, and the clustering result is fed back to the terminal through the communication network.
  • the embodiment of the invention provides a data clustering method based on the K-Means algorithm, which is
  • the clustering server executes to cluster the N data included in the data set to be processed into K categories, where N is an integer greater than K, and K is a preset number of categories and is an integer greater than or equal to 2, K Each category in each category corresponds to an initial cluster center.
  • the clustering server may be implemented by a computing device.
  • the organization structure of the computing device is as shown in FIG. 1-2.
  • the computing device may include a communication interface 110, a processor 120, and a memory 130.
  • the communication interface 110 and the processor 120 and Memory 130 establishes a communication connection and processor 120 and memory 130 establish a communication connection.
  • the communication interface 110 is configured to receive, by the communication network, a clustering request sent by the terminal, where the clustering request includes a maximum calculation amount, a K, and a data set.
  • the processor 120 can be a central processing unit (English: central processing unit, abbreviated: CPU).
  • the memory 130 is configured to store a maximum calculation amount, K and a data set included in the clustering request; the memory 130 includes a volatile memory, such as a random access memory (English: random-access memory, abbreviation: RAM)
  • the memory may also include non-volatile memory (English: non-volatile memory), such as read-only memory (English: read-only memory, abbreviation: ROM), flash memory (English: flash memory), hard disk (English) : hard disk drive (abbreviation: HDD) or solid state drive (English: solid state drive, abbreviated: SSD); the memory 130 may also include a combination of the above types of memories.
  • program instructions for implementing K-Means algorithm-based data clustering provided in FIG. 1-2 of the present application are stored in a memory 130, and the processor 120 is configured to read the memory.
  • Program instructions in 130 to perform the following steps.
  • the communication interface 110 is configured to receive a clustering request, where the clustering request includes a maximum computing amount, a K, and a data set;
  • the memory 130 is configured to store a maximum calculation amount, a K, and a data set.
  • the processor 120 is configured to determine, according to a maximum calculation amount, an adjustment factor corresponding to the maximum calculation amount;
  • the processor 120 is further configured to: acquire a data set from the memory, and randomly select one data from the data set;
  • the processor 120 is further configured to select K-1 data from the data set according to the adjustment factor and the randomly selected data, and the randomly selected data and the K-1 data form K initial cluster centers of the data set;
  • the processor 120 is further configured to input N data in the data set according to the K initial cluster centers. Row clustering.
  • the clustering request further includes a training number and a data size of the data set.
  • the processor 120 is configured to determine an adjustment factor corresponding to the maximum calculation amount according to the maximum calculation amount, which may be implemented by the following steps:
  • the processor 120 is further configured to determine, according to the training times, the data size, and the K, a center point initialization calculation amount and an iterative training calculation amount when clustering the N data included in the data set;
  • the processor 120 is further configured to determine an adjustment factor corresponding to the maximum calculation amount according to the center point initialization calculation amount, the iterative training calculation amount, and the maximum calculation amount.
  • the processor 120 is configured to select K-1 data from the data set according to the adjustment factor and the randomly selected data, which can be implemented by the following steps:
  • the processor 120 is further configured to select M data from the data set according to the adjustment factor, where M is an integer greater than K;
  • the processor 120 is further configured to select K-1 data that is the farthest distance from the randomly selected data from the M data according to the randomly selected data and the M data.
  • the processing 120 is configured to cluster N data in the data set according to the K initial cluster centers, which may be implemented by the following steps:
  • the processor 120 is further configured to determine K final cluster centers according to the K initial cluster centers and the N data in the data set;
  • the processor 120 is further configured to separately calculate a distance between any of the data and each of the final cluster centers in the K final cluster centers;
  • the processor 120 is further configured to select a final cluster center with the smallest distance from any of the K final cluster centers, and cluster any data into a category corresponding to the selected final cluster center.
  • the processor 120 is configured to separately calculate a distance between the any data and each of the final cluster centers in the K final cluster centers, which may be implemented by the following steps:
  • the processor 120 is further configured to acquire each participle included in any of the data;
  • the processor 120 is further configured to separately calculate weight values of each word segment, and calculate any data and each final gather separately according to the weight value of each word segment and each final cluster center in the K final cluster centers. The distance between the class centers.
  • the processor 120 is further configured to select a category to be eliminated from the K categories, and the to-be-eliminated category includes The number P of data is greater than a preset number;
  • the processor 120 is further configured to: (P-preset number) data that is the farthest distance from the final cluster center of the to-be-eliminated category is eliminated from the data included in the category to be eliminated;
  • the processor 120 is further configured to update the final cluster center of the to-be-eliminated category according to data other than the phase-out data in the to-be-eliminated category.
  • the adjustment factor corresponding to the maximum calculation amount is determined according to the maximum calculation amount included in the clustering request, and one data is randomly selected from the data set to be clustered, and according to the adjustment factor and the randomly selected data, Select K-1 data from the data set, the randomly selected data and K-1 data constitute K initial cluster centers of the data set, and according to the K initial cluster centers, aggregate N data in the data set Since the K-1 initial cluster centers are selected from the data set according to the maximum calculation amount and the randomly selected data, the present invention can automatically adjust the clustering efficiency according to the maximum calculation amount, thereby improving the clustering efficiency.
  • the embodiment of the present invention provides a data clustering method based on the K-Means algorithm, which is executed by a clustering server, and is used for clustering N data included in a data set to be processed into K categories, where N is An integer greater than K, K is a preset number of categories and is an integer greater than or equal to 2, and each of the K categories corresponds to an initial cluster center.
  • the method includes:
  • Step 201 The clustering server receives a clustering request, where the clustering request includes a maximum computing amount, a K, and a data set.
  • step 201 is as follows:
  • the user can compose N data into one data set, and set the maximum time consumption for clustering the data set, that is, the maximum calculation amount, and the preset category.
  • Quantity also known as K.
  • the terminal corresponding to the user sends a clustering request to the clustering server through the communication network, and the clustering request includes at least a maximum computing amount, a K and a data set, and the clustering request may further include a training number and a data size of the data set.
  • Each category corresponds to an initial cluster center, and initial clustering is performed according to the initial cluster center, and then the cluster centers of each category are recalculated until the final cluster center of each category and the adjacent cluster obtained last time.
  • the distance between the centers is within a preset distance; the number of trainings can be the final gather The number of trainings in the center.
  • the data size is the number of data included in the data set, that is, the data size is N.
  • the clustering server receives the clustering request sent by the terminal through the communication interface, and obtains the maximum computing amount, K, and data set from the clustering request. If the clustering request further includes the training times and the data size, the clustering server further The number of trainings and the size of the data can be obtained from the clustering request.
  • the terminal may input the data set included in the clustering request to the clustering server as a whole, and after receiving the data set, the clustering server simultaneously clusters the data in the data set; the terminal may also The data in the data set is input to the clustering server one by one, and the clustering server starts clustering processing every time one data is received.
  • Step 202 The clustering server determines, according to the maximum calculation amount, an adjustment factor corresponding to the maximum calculation amount.
  • step 202 is as follows:
  • This step is implemented by the following steps (1) to (2), including:
  • the clustering server determines a center point initializing calculation amount and an iterative training calculation amount when clustering the N data included in the data set according to the training number, the data size, and the K.
  • the number of trainings is the number of times the cluster center is calculated (B), B is an integer greater than or equal to 2; the data size is the number of data included (N), and N is an integer greater than K.
  • an intermediate variable is also needed, and the intermediate variable has no relationship with the adjustment factor corresponding to the maximum calculation amount, and the intermediate variable may be t dist , and t dist is used to represent that any data is calculated separately from K The time required for the distance between cluster centers.
  • the clustering server determines the center point initialization calculation amount when the N data included in the data set is clustered according to the data size and the number of the cluster centers according to the following formula (1).
  • T init initializes the calculation amount for the center point.
  • t dist is the intermediate variable
  • K is the preset number of categories
  • N is the data size.
  • the iterative training is divided into two parts: calculating the data category and updating the cluster center, and the clustering server determines, according to the training formula, the data size and the number of the cluster centers, according to the following formula (2), the N included in the data set.
  • the iterative training calculation amount when the data is clustered is divided into two parts: calculating the data category and updating the cluster center, and the clustering server determines, according to the training formula, the data size and the number of the cluster centers, according to the following formula (2), the N included in the data set.
  • T itera B ⁇ (T classify ⁇ t dist +T update ⁇ t mean ) ⁇ B ⁇ N ⁇ K ⁇ t dist (2)
  • T itera is the iterative training calculation amount
  • K is the preset number of categories
  • N is the data size
  • t dist is the intermediate variable
  • B is the training number.
  • the clustering server determines the adjustment factor corresponding to the maximum calculation amount according to the central point initialization calculation amount, the iterative training calculation amount, and the maximum calculation amount.
  • the maximum calculation amount T can be expressed by the following formula (3):
  • T tolerance is the maximum amount of calculation
  • t dist is the intermediate variable
  • T tolerance T init +T itera .
  • the adjustment factor corresponding to the maximum calculation amount can be derived, and the adjustment factor is as shown in the following formula (4):
  • the clustering server can calculate the adjustment factor according to the number of trainings, the size of the data, and the number of cluster centers.
  • the maximum calculation amount can be set, and then the clustering speed is automatically adjusted according to the maximum calculation amount, thereby improving the clustering efficiency, avoiding the infinite growth of the calculation time of the single data set, and maintaining the response speed of the overall algorithm.
  • the clustering server is any server having a K-Means algorithm; the terminal can be a smartphone, a tablet, a smart TV, an e-book reader, a multimedia player, a laptop portable computer, a desktop computer, and the like.
  • Step 203 The clustering server randomly selects one data from the data set, and selects K-1 data from the data set according to the adjustment factor and the randomly selected data, and the randomly selected data and the K-1 data constitute a data set. K initial cluster centers.
  • K data are randomly selected from the data set as K initial cluster centers, so that when K is large and/or the data included in the data set is large, the calculation amount is increased, resulting in low clustering efficiency.
  • step 203 is as follows:
  • the clustering server randomly selects one data from the data set, and uses the randomly selected data as an initial clustering center; and then according to the adjusting factor and the randomly selected data, according to the following step (1) to (2) Select K-1 data from the data set and K-1 data as K-1 initial cluster centers.
  • the clustering server selects M data from the data set according to the adjustment factor.
  • the clustering server selects one data from the data set every other adjustment factor according to the adjustment factor to obtain M data, and the distance between two adjacent data in the M data is the adjustment factor. .
  • M data (M is 7) is selected from the data set, respectively: data 1, data 4, data 7, data 10, data 13, data 16 and data 19.
  • the clustering server selects K-1 data that is the farthest from the randomly selected data from the M pieces based on the randomly selected data and the M pieces of data.
  • the clustering server separately calculates a distance between each of the M data and the randomly selected data according to the randomly selected data and the M data; according to each data and the randomly selected data The distance, the K-1 data with the largest distance is selected from the M data.
  • the distance between the K initial cluster centers is relatively long, so that when the data of the data set is clustered according to the K initial cluster centers, the iterative training calculation amount is small, thereby improving the aggregation. Class efficiency.
  • Step 204 The clustering server clusters the N data in the data set according to the K initial cluster centers.
  • step 204 is as follows:
  • This step can be specifically implemented by the following steps (1) to (3), including:
  • the clustering server determines K final cluster centers according to the K initial cluster centers and N data in the data set;
  • This step can be implemented by the following steps (1-1) to (1-3), including:
  • (1-1) For each data except the K initial cluster centers in the data set, the clustering server calculates the distance between the data and each of the initial cluster centers in the K initial cluster centers.
  • the clustering server calculates the distance between the data and each of the initial cluster centers in the K initial cluster centers according to the K-Means algorithm.
  • the clustering server converts the data into multidimensional numbers
  • the value vector calculates the distance between the multi-dimensional numerical vector and each of the initial cluster centers in the K initial cluster centers according to the K-Means algorithm.
  • the clustering server acquires each participle included in the data, calculates the weighting value of each word segment separately, and according to the weighting value of each word segment and the K initial cluster centers. In each of the initial cluster centers, the distance between the data and each initial cluster center is calculated by the K-Means algorithm.
  • the data may be segmented by any existing word segmentation algorithm to obtain each word segment included in the data.
  • the clustering server calculates the TF-IDF value of each participle, and takes the TF-IDF value of each participle as its weighting value. Moreover, since the calculation of the TF-IDF relies on the data included in the data set, and the data included in the data set is established by continuously loading the data, the data set may be changed in real time, and therefore cannot be calculated immediately after the end of the word segmentation. The TF-IDF value of the word segmentation, but needs to be calculated after the data set aggregation is completed.
  • each word segment participates in the cluster calculation as a single dimension.
  • a weight vector w needs to be maintained, and a weight is set for each dimension, and each dimension is The weights can be set and changed as needed.
  • the weight of each dimension is not specifically limited in the embodiment of the present invention.
  • the weight of the set value type field is set to 1
  • the sum of the weights of all the word segments included in the text type field is 1, and the weight of each word segment included in the text type field may be equal or unequal.
  • a log containing three numeric type fields and one text type field the three numeric type fields are a first numeric type field, a second numeric type field, and a third numeric type field, the first numerical type field, and the second
  • the weight type field and the third numeric type field have weights of 1, and the text type field includes three text participles, namely a first text participle, a second text participle and a third text participle, a first text participle and a second text participle.
  • the sum of the weights of the third text segmentation is 1, and the weights of the first text segment, the second text segment, and the third text segment are respectively 1/3, as shown in Figure 2-2.
  • the clustering server calculates the distance between the data and each of the initial cluster centers in the K initial cluster centers by K-Means according to the weighting value of each word segment and each initial cluster center. Can also be:
  • the clustering server passes the weighting value and weight of each word segment and each initial cluster center.
  • the K-Means algorithm calculates the distance between the data and each of the initial cluster centers in the K initial cluster centers.
  • the clustering server clusters the data into a category corresponding to the initial cluster center closest to the data according to the distance between the data and each of the initial cluster centers.
  • the clustering server selects the closest initial cluster center from each initial cluster center according to the distance between the data and each initial cluster center, and clusters the data to the selected initial cluster center. In the category.
  • the clustering server recalculates the new cluster centers of the K categories according to the data included in the K categories, until between the adjacent two cluster centers of each of the K categories The distance gets K final cluster centers within a preset distance.
  • the new cluster center of the category For each of the K categories, calculate the average of the data included in the category as the new cluster center of the category, and calculate the distance between the new cluster center and the initial cluster center, if the distance Within a preset distance (referred to as the first preset distance for ease of differentiation), the new distance center is taken as the final cluster center of the category.
  • One category corresponds to a final cluster center, which is used to cluster the data to be clustered.
  • steps (1-1)-(1-3) are re-executed until the distance between the adjacent two cluster centers of each type is within the first preset distance.
  • the first preset distance may be set and changed according to requirements, and the first preset distance is not specifically limited in the embodiment of the present invention.
  • the clustering server separately calculates the distance between the any data and each of the final cluster centers in the K final cluster centers;
  • the clustering server calculates the distance between the arbitrary data and each of the final cluster centers in the K final cluster centers according to the K-Means algorithm.
  • the clustering server converts any of the data into a multi-dimensional numerical vector, and calculates a multi-dimensional numerical vector and each final cluster in the K final cluster centers according to K-Means. The distance between the centers.
  • the clustering server acquires each participle included in the any data; separately calculates the weighting value of each participle, and according to the weighting value of each participle and each final
  • the cluster center calculates the distance between any of the data and each final cluster center.
  • the clustering server calculates the distance between the any data and the final clustering center by the following formula (5) based on the weighting value of each participle included in any of the data.
  • D(l, c) is the distance between the data and the final cluster center
  • DF is a numeric type field
  • WF is a text type field
  • l is any of the data
  • c is the final cluster center.
  • l(wf) refers to the weighted value of the participle
  • w(fw) refers to the weight
  • c(fw) refers to the value of the final cluster center.
  • the clustering server selects the final cluster center with the smallest distance from any of the K final cluster centers, and classifies any data into the category corresponding to the selected final cluster center. .
  • the clustering server After the clustering is completed, the clustering server generates a clustering result, which includes data included in each category, and sends the clustering result to the terminal through the communication interface.
  • the terminal receives the clustering result sent by the clustering server, and displays the clustering result.
  • Figure 2-3 shows the effect of clustering the data by the clustering method provided by the embodiment of the present invention.
  • the N data included in the data set are clustered into K categories, each of the K categories corresponds to one data subset, and one data subset includes part of the N data, according to each The data subset corresponding to the category includes data that updates the final cluster center for each category.
  • a subset of data can be used as a classifier to cluster the data to be clustered.
  • the clustering server calculates the average value of the data included in each data subset as the clustering center of the classifier; when the terminal needs to cluster a certain data to be clustered, the clustering server sends the terminal through the communication interface.
  • the data to be clustered respectively calculating the distance between the data to be clustered and the cluster center of each classifier, and selecting the cluster center closest to the distance between the data to be clustered from each classifier
  • Corresponding classifiers cluster the data to be clustered into the classifier to be selected.
  • the data subsets of the same category can be merged, and the merged data subsets of the category will be over-expanded, so a elimination mechanism is needed to limit the growth of the data subset, and different subsets of data.
  • the number of data included may vary greatly. When the data subset size is fixed, If the data subset of a certain category includes too many data, the development of data subsets of other categories is restricted. Therefore, the data included in each category can be eliminated by the following steps 205 and 206.
  • Step 205 The clustering server selects a category to be eliminated from the K categories, and the to-be-eliminated category includes data greater than a preset number.
  • step 205 is as follows:
  • the preset number is referred to as a first preset number, and the first preset number may be set and changed as needed.
  • the first preset number is not specifically limited; for example, A preset number can be 100 or the like.
  • the clustering server may also select, from among the K categories, a second preset number of categories including the most data as the category to be eliminated.
  • the second preset number is an integer less than K, and the second preset number can be set and changed as needed.
  • the second preset number is not specifically limited; for example, the second preset number may be It is 2 or 3, etc.
  • Step 206 The clustering server eliminates (P-preset number) data that is the farthest distance from the final clustering center of the category to be eliminated from the data included in the category to be eliminated.
  • step 206 is as follows:
  • the clustering server respectively calculates the distance between each of the P data included in the category to be eliminated and the final cluster center of the category to be eliminated, according to each data and the final cluster center of the category to be eliminated.
  • the distance from the P data is selected from the P-preset number (the first preset number) of the farthest data, and the selected data is eliminated from the to-be-eliminated category, so that the first pre-reserved category is retained. Set the number of data.
  • the clustering server may also eliminate data from the P data included in the category to be eliminated from the final cluster center of the category to be eliminated by more than the second preset distance.
  • the second preset distance may be set and changed as needed.
  • the second preset distance is not specifically limited.
  • the clustering server may further select, from the P data included in the category to be eliminated, a preset number of data having the largest distance from the final cluster center of the category to be eliminated, and select the P data. Data out of data is eliminated.
  • m is an integer greater than 2, and m can be performed as needed It is set and changed.
  • m is not specifically limited; for example, m may be 20 or 50 or the like.
  • the cluster server includes data according to each category. Recalculating the final cluster center after each category update.
  • the clustering server can eliminate the distance between the updated final cluster centers of the categories to be eliminated and the P data included in the category to be eliminated. Far P - preset number of data.
  • Step 207 The clustering server updates the final cluster center of the category to be eliminated according to the data other than the phase-out data in the category to be eliminated.
  • step 207 is as follows:
  • the clustering server calculates the average value of the unremoved data according to the non-eliminated data other than the phase-out data in the category to be eliminated as the final clustering center after the updated category to be eliminated.
  • clustering server When the clustering server receives the data to be clustered, clustering the data to be clustered according to the data to be clustered and the updated final cluster center, thereby improving the clustering accuracy rate.
  • the adjustment factor corresponding to the maximum calculation amount is determined according to the maximum calculation amount included in the clustering request, and one data is randomly selected from the data set to be clustered, and according to the adjustment factor and the randomly selected data, Select K-1 data from the data set, the randomly selected data and K-1 data constitute K initial cluster centers of the data set, and according to the K initial cluster centers, aggregate N data in the data set Since the K-1 initial cluster centers are selected from the data set according to the maximum calculation amount and the randomly selected data, the present invention can automatically adjust the clustering efficiency according to the maximum calculation amount, thereby improving the clustering efficiency.
  • the embodiment of the invention further provides a base data clustering device, which can be implemented by the computing device shown in FIG. 1-2, and can also be implemented by an application-specific integrated circuit (ASIC). , or programmable logic device (English: programmable logic device, abbreviation: PLD) implementation.
  • PLD programmable logic device
  • the above PLD may be a complex programmable logic device (English: complex programmable logic device, abbreviation: CPLD), an FPGA, a general array logic (English: general array logic, abbreviation: GAL) or any combination.
  • the device for dividing the data strip is used to implement the K-Means algorithm based data clustering method shown in FIG. 2-1.
  • the device for data clustering based on the K-Means algorithm may also be a software module.
  • FIG. 3-1 A schematic diagram of the organization result of the data clustering apparatus is shown in FIG. 3-1, and includes a receiving module 301, a determining module 302, a selecting module 303, and a clustering module 304.
  • step 201 in the K-Means algorithm-based data clustering method shown in FIG. 2-1 is executed; when the determining module 302 is in operation, the K-Means based algorithm shown in FIG. 2-1 is executed.
  • step 204 and its alternatives in the K-Means algorithm-based data clustering method shown in FIG. 2-1 are executed.
  • the data clustering apparatus may further include a culling module 305 and a computing module 306.
  • the culling module 305 When the culling module 305 is in operation, the data clustering method based on the K-Means algorithm shown in FIG. 2-1 is executed.
  • Step 206 and its alternatives when the calculation module 306 is in operation, perform step 207 and its alternatives in the K-Means algorithm-based data clustering method shown in FIG. 2-1.
  • the adjustment factor corresponding to the maximum calculation amount is determined according to the maximum calculation amount included in the clustering request, and one data is randomly selected from the data set to be clustered, and according to the adjustment factor and the randomly selected data, Select K-1 data from the data set, the randomly selected data and K-1 data constitute K initial cluster centers of the data set, and according to the K initial cluster centers, aggregate N data in the data set Since the K-1 initial cluster centers are selected from the data set according to the maximum calculation amount and the randomly selected data, the present invention can automatically adjust the clustering efficiency according to the maximum calculation amount, thereby improving the clustering efficiency.
  • the device for data clustering provided by the foregoing embodiment is only illustrated by the division of the above functional modules. In actual applications, the foregoing functions may be required according to requirements. The assignment is done by different functional modules, dividing the internal structure of the device into different functional modules to perform all or part of the functions described above.
  • the device for data clustering provided by the foregoing embodiment is the same as the embodiment of the data clustering method based on the K-Means algorithm, and the specific implementation process is described in detail in the method embodiment, and details are not described herein again.
  • a person skilled in the art may understand that all or part of the steps of implementing the above embodiments may be completed by hardware, or may be instructed by a program to execute related hardware, and the program may be stored in a computer readable storage medium.
  • the storage medium mentioned may be a read only memory, a magnetic disk or an optical disk or the like.

Landscapes

  • Engineering & Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Artificial Intelligence (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Biology (AREA)
  • Evolutionary Computation (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Probability & Statistics with Applications (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

一种基于K-Means算法的数据聚类方法和装置,属于计算机技术领域。方法包括:聚类服务器接收聚类请求,所述聚类请求包括最大计算量、K和数据集(201);聚类服务器根据该最大计算量,确定该最大计算量对应的调整因子(202);聚类服务器从数据集中随机选择一个数据;并根据该调整因子和该随机选择的数据,从数据集中选择K-1个数据,随机选择的数据和该K-1个数据构成数据集的K个初始聚类中心(203);聚类服务器根据该K个初始聚类中心,对该数据集中的N个数据进行聚类(204)。该方法可以根据最大计算量自动调整聚类效率,从而可以提高聚类效率。

Description

基于K-Means算法的数据聚类方法和装置 技术领域
本发明涉及计算机技术领域,特别涉及一种基于K-Means算法的数据聚类方法和装置。
背景技术
K-Means算法是最为经典的基于距离的聚类算法,采用距离作为相似性的评价指标,即认为两个对象的距离越近,这两个对象相似度就越大。
基于K-means算法对数据进行聚类的过程可以为:将待分类的数据组成一个数据集以及指定所要分成的类别数目K,从数据集中随机选择K个数据作为K个类别的初始聚类中心,对于数据集中除K个初始聚类中心之外的每个数据,分别计算该数据与K个初始聚类中心中的每个初始聚类中心之间的距离,并将该数据归到与该数据距离最近的初始聚类中心对应的类别中,然后根据K个类别中包括的数据,重新计算K个类别的新的聚类中心,然后将数据集中的数据重新进行分类,直到K个类别中的每个类别的相邻两次聚类中心之间的距离在预设距离内。
现有技术至少存在如下技术问题:
由于K个类别的初始聚类中心是随机选择的,因此,当K较大和/或数据集中包括的数据较多,计算量会增大,导致聚类效率低。
发明内容
为了解决现有技术的问题,本发明提供了一种基于K-Means算法的数据聚类方法和装置。技术方案如下:
本发明的第一方面,提供了一种计算设备,所述计算设备用于将待处理的数据集包括的N个数据聚类至K个类别中,所述N为大于K的整数,所述K为预设的类别数量且为大于或等于2的整数,K个类别中的每个类别对应一个初始聚类中心;所述计算设备包括通信接口、处理器和存储器,所述通信接口分别与所述处理器和存储器建立通信连接,所述处理器与所述存储器建立通信连接;
所述通信接口,用于接收聚类请求,所述聚类请求包括最大计算量、所述K和所述数据集;
所述存储器,用于存储所述最大计算量、所述K和所述数据集;
所述存储器,还用于存储程序指令;
所述处理器,用于读取所述存储器中的程序指令,以执行以下的步骤;
所述处理器,还用于根据所述最大计算量,确定所述最大计算量对应的调整因子;
所述处理器,还用于从所述存储器中获取所述数据集,并从所述数据集中随机选择一个数据;
所述处理器,还用于根据所述调整因子和所述随机选择的数据,从所述数据集中选择K-1个数据,所述随机选择的数据和所述K-1个数据构成所述数据集的K个初始聚类中心;
所述处理器,还用于根据所述K个初始聚类中心,对所述数据集中的N个数据进行聚类。
在本发明实施例中,由于根据最大计算量对应的调整因子以及随机选择的数据,选择K-1个数据,随机选择的数据和K-1个数据构成数据集的K个初始聚类中心,因此,本发明可以根据最大计算量自动调整聚类效率,从而可以提高聚类效率。
结合第一方面,在第一方面的第一种实现方式中,所述聚类请求中还包括所述数据集的训练次数和数据大小;
所述处理器,用于根据所述最大计算量,确定所述最大计算量对应的调整因子,具体可以通过如下步骤实现:
所述处理器,还用于根据所述训练次数、所述数据大小和所述K,确定对所述数据集包括的N个数据进行聚类时的中心点初始化计算量和迭代训练计算量;
所述处理器,还用于根据所述中心点初始化计算量、所述迭代训练计算量和所述最大计算量,确定所述最大计算量对应的调整因子。
在本发明实施例中,由于根据数据集的训练次数、数据大小、K和最大计算量,确定最大计算量对应的调整因子,该调整因与该数据集更匹配,从而该调整因子更准确,可以进一步提高聚类效率。
结合第一方面或第一方面的第一种实现方式,在第一方面的第二种实现方式中,所述处理器,用于根据所述调整因子和所述随机选择的数据,从所述数据集中选择K-1个数据,具体可以通过如下步骤实现:
所述处理器,还用于根据所述调整因子,从所述数据集中选择M个数据,所述M为大于K的整数;
所述处理器,还用于根据所述随机选择的数据和所述M个数据,从所述M个数据中选择与所述随机选择的数据之间的距离最远的K-1个数据。
在本发明实施例中,根据调整因子选择K个初始聚类中心,从而选择的K个初始聚类中心之间的距离较远,从而根据该K个初始聚类中心对数据集中的数据进行聚类时,迭代训练计算量较少,从而提高了聚类效率。
结合第一方面或第一方面的任一种实现方式,在第一方面的第三种实现方式中,所述处理器,用于根据所述K个初始聚类中心,对所述数据集中的N个数据进行聚类,具体可以通过如下步骤实现:
所述处理器,还用于根据所述K个初始聚类中心和所述数据集中的N个数据,确定K个最终聚类中心;
对于所述数据集中的任一数据,所述处理器,还用于分别计算所述任一数据与所述K个最终聚类中心中的每个最终聚类中心之间的距离;
所述处理器,还用于从所述K个最终聚类中心中选择与所述任一数据之间的距离最小的最终聚类中心,将所述任一数据聚类到所述选择的最终聚类中心对应的类别中。
结合第一方面或第一方面的任一种实现方式,在第一方面的第四种实现方式中,所述处理器,用于分别计算所述任一数据与所述K个最终聚类中心中的每个最终聚类中心之间的距离,具体可以通过如下步骤实现:
当所述任一数据包括文本类型字段和数字类型字段时,所述处理器,还用于获取所述任一数据包括的每个分词;
所述处理器,还用于分别计算所述每个分词的加权值,并根据所述每个分词的加权值和所述K个最终聚类中心中的每个最终聚类中心,分别计算所述任一数据与所述每个最终聚类中心之间的距离。
在本发明实施例中,当所述数据集中的数据包括文本类型字段和数字类型字段时,根据该数据包括的分词的加权值,计算该数据与最终聚类中心之 间的距离,从而本发明可以支持文本-数值混合类型数据的聚类分析。
结合第一方面或第一方面的任一种实现方式,在第一方面的第五种实现方式中,所述处理器,还用于从所述K个类别中,选择待淘汰类别,所述待淘汰类别包括数据的数量P大于预设数目;
所述处理器,还用于从所述待淘汰类别包括的数据中淘汰与所述待淘汰类别的最终聚类中心之间的距离最远的(P-所述预设数目)个数据;
所述处理器,还用于根据所述待淘汰类别中除所述淘汰的数据之外的数据,更新所述待淘汰类别的最终聚类中心。
在本发明实施例中,从包括数据较多的类别中淘汰一部分数据,重新计算该类别更新后的最终聚类中心,从而该更新后的最终聚类中心更准确,解决了现有流式K-Means聚类算法对数据时间的敏感性问题。
本发明的第二方面,提供了一种基于K-Means算法的数据聚类方法,所述方法由聚类服务器执行,用于将待处理的数据集包括的N个数据聚类至K个类别中,所述N为大于K的整数,所述K为预设的类别数量且为大于或等于2的整数,K个类别中的每个类别对应一个初始聚类中心,所述方法包括:
聚类服务器接收聚类请求,所述聚类请求包括最大计算量、所述K和所述数据集;
所述聚类服务器根据所述最大计算量,确定所述最大计算量对应的调整因子;
所述聚类服务器从所述数据集中随机选择一个数据;
所述聚类服务器根据所述调整因子和所述随机选择的数据,从所述数据集中选择K-1个数据,所述随机选择的数据和所述K-1个数据构成所述数据集的K个初始聚类中心;
所述聚类服务器根据所述K个初始聚类中心,对所述数据集中的N个数据进行聚类。
结合第一方面,在第一方面的第一种实现方式中,所述聚类请求中还包括所述数据集的训练次数和数据大小;
所述聚类服务器根据所述最大计算量,确定所述最大计算量对应的调整因子,包括:
所述聚类服务器根据所述训练次数、所述数据大小和所述K,确定对所述数据集包括的N个数据进行聚类时的中心点初始化计算量和迭代训练计算量;
所述聚类服务器根据所述中心点初始化计算量、所述迭代训练计算量和所述最大计算量,确定所述最大计算量对应的调整因子。
结合第一方面或第一方面的第一种实现方式,在第一方面的第二种实现方式中,所述聚类服务器根据所述调整因子和所述随机选择的数据,从所述数据集中选择K-1个数据,包括:
所述聚类服务器根据所述调整因子,从所述数据集中选择M个数据,所述M为大于K的整数;
所述聚类服务器根据所述随机选择的数据和所述M个数据,从所述M个数据中选择与所述随机选择的数据之间的距离最远的K-1个数据。
结合第一方面或第一方面的任一种实现方式,在第一方面的第三种实现方式中,所述聚类服务器根据所述K个初始聚类中心,对所述数据集中的N个数据进行聚类,包括:
所述聚类服务器根据所述K个初始聚类中心和所述数据集中的N个数据,确定K个最终聚类中心;
对于所述数据集中的任一数据,所述聚类服务器分别计算所述任一数据与所述K个最终聚类中心中的每个最终聚类中心之间的距离;
所述聚类服务器从所述K个最终聚类中心中选择与所述任一数据之间的距离最小的最终聚类中心,将所述任一数据归类到所述选择的最终聚类中心对应的类别中。
结合第一方面或第一方面的任一种实现方式,在第一方面的第四种实现方式中,所述聚类服务器分别计算所述任一数据与所述K个最终聚类中心中的每个最终聚类中心之间的距离,包括:
当所述任一数据包括文本类型字段和数字类型字段时,所述聚类服务器获取所述任一数据包括的每个分词;
所述聚类服务器分别计算所述每个分词的加权值,并根据所述每个分词的加权值和所述K个最终聚类中心中的每个最终聚类中心,分别计算所述任一数据与所述每个最终聚类中心之间的距离。
结合第一方面或第一方面的任一种实现方式,在第一方面的第五种实现方式中,所述方法还包括:
所述聚类服务器从所述K个类别中,选择待淘汰类别,所述待淘汰类别包括数据的数量P大于预设数目;
所述聚类服务器从所述待淘汰类别包括的数据中淘汰与所述待淘汰类别的最终聚类中心之间的距离最远的(P-所述预设数目)个数据;
所述聚类服务器根据所述待淘汰类别中除所述淘汰的数据之外的数据,更新所述待淘汰类别的最终聚类中心。
本发明的第三方面,提供了一种数据聚类装置,所述装置应用在聚类服务器中,且所述装置包括了用于执行第二方面提供的聚类方法的至少一个模块。
在本发明实施例中,根据聚类请求包括的最大计算量,确定最大计算量对应的调整因子,从待聚类的数据集中随机选择一个数据,并根据该调整因子和该随机选择的数据,从数据集中选择K-1个数据,该随机选择的数据和K-1个数据构成数据集的K个初始聚类中心,根据该K个初始聚类中心,对数据集中的N个数据进行聚类;由于根据最大计算量和该随机选择的数据,从数据集中选择K-1个初始聚类中心,因此,本发明可以根据最大计算量自动调整聚类效率,从而可以提高聚类效率。
附图说明
图1-1是本发明实施例提供的一种基于K-Means算法的数据聚类系统的结构示意图;
图1-2是本发明实施例提供的一种计算设备的结构示意图;
图2-1是本发明实施例提供的一种基于K-Means算法的数据聚类方法流程图;
图2-2是本发明实施例提供的一种对数据进行预处理的示意图;
图2-3是本发明实施例提供的一种对数据进行聚类的效果图;
图3-1是本发明实施例提供的一种基于K-Means算法的数据聚类方法装置结构示意图;
图3-2是本发明实施例提供的一种基于K-Means算法的数据聚类方法装置结构示意图。
具体实施方式
为使本发明的目的、技术方案和优点更加清楚,下面将结合附图对本发明实施方式作进一步地详细描述。
基于K-Means算法对数据进行聚类时,从数据集中随机选择K个数据作为K个类别的初始聚类中心,然后将数据集中的每个数据分别聚类到与其最近的初始聚类中心对应的类别中,然后根据K个类别中包括的数据,重新计算K个类别的新的聚类中心,再将数据集中的数据重新进行分类,直到K个类别中的每个类别的相邻两次聚类中心之间的距离在预设距离内。由于K个类别的初始聚类中心是随机选择的,因此,当K较大和/或数据集中包括的数据较多时,计算量会增大,导致聚类效率低。
在本发明实施例中,根据聚类时设置的最大计算量,确定最大计算量对应的调整因子,从待聚类的数据集中随机选择一个数据,并根据该调整因子和该随机选择的数据,从数据集中选择K-1个数据,随机选择的数据和该K-1个数据构成数据集的K个初始聚类中心,根据该K个初始聚类中心,对数据集中的数据进行聚类,由于根据最大计算量和该随机选择的数据,从数据集中选择K-1个初始聚类中心,因此,本发明可以根据最大计算量自动调整聚类效率,从而可以提高聚类效率。
本发明实施例提供了一种基于K-Means算法的数据聚类系统,参见图1-1,该聚类系统包括终端,通信网络和聚类服务器,终端,用于通过通信网络向聚类服务器发送聚类请求,该聚类请求包括最大计算量、K和数据集,数据集中包括N个待聚类的数据;聚类服务器,用于通过通信网络接收终端发送的聚类请求,并将数据集包括的N个数据聚类至K个类别中,通过通信网络向终端反馈聚类结果。
本发明实施例提供了一种基于K-Means算法的数据聚类方法,该方法由 聚类服务器执行,用于将待处理的数据集包括的N个数据聚类至K个类别中,N为大于K的整数,K为预设的类别数量且为大于或等于2的整数,K个类别中的每个类别对应一个初始聚类中心。
该聚类服务器可以由计算设备实现,该计算设备的组织结构示意图如图1-2所示,该计算设备可以包括通信接口110、处理器120和存储器130,通信接口110分别与处理器120和存储器130建立通信连接,处理器120和存储器130建立通信连接。
通信接口110用于通过通信网络接收终端发送的聚类请求,该聚类请求包括最大计算量、K和数据集。
处理器120可以为中央处理器(英文:central processing unit,缩写:CPU)。
存储器130用于存储该聚类请求包括的最大计算量、K和数据集;存储器130包括易失性存储器(英文:volatile memory),例如随机存取存储器(英文:random-access memory,缩写:RAM);存储器也可以包括非易失性存储器(英文:non-volatile memory),例如只读存储器(英文:read-only memory,缩写:ROM),快闪存储器(英文:flash memory),硬盘(英文:hard disk drive,缩写:HDD)或固态硬盘(英文:solid state drive,缩写:SSD);存储器130还可以包括上述种类的存储器的组合。在通过软件来实现本申请提供的技术方案时,用于实现本申请图1-2提供的基于K-Means算法的数据聚类的程序指令保存在存储器130中,处理器120用于读取存储器130中的程序指令,以执行以下的步骤。
通信接口110,用于接收聚类请求,该聚类请求包括最大计算量、K和数据集;
存储器130,用于存储最大计算量、K和数据集。
处理器120,用于根据最大计算量,确定最大计算量对应的调整因子;
处理器120,还用于从存储器中获取数据集,并从数据集中随机选择一个数据;
处理器120,还用于根据调整因子和随机选择的数据,从数据集中选择K-1个数据,随机选择的数据和K-1个数据构成数据集的K个初始聚类中心;
处理器120,还用于根据K个初始聚类中心,对数据集中的N个数据进 行聚类。
该聚类请求中还包括数据集的训练次数和数据大小;相应的,处理器120,用于根据最大计算量,确定最大计算量对应的调整因子,具体可以通过如下步骤实现:
处理器120,还用于根据训练次数、数据大小和K,确定对数据集包括的N个数据进行聚类时的中心点初始化计算量和迭代训练计算量;
处理器120,还用于根据中心点初始化计算量、迭代训练计算量和最大计算量,确定最大计算量对应的调整因子。
其中,处理器120,用于根据调整因子和随机选择的数据,从数据集中选择K-1个数据,具体可以通过以下步骤实现:
处理器120,还用于根据调整因子,从数据集中选择M个数据,M为大于K的整数;
处理器120,还用于根据随机选择的数据和M个数据,从M个数据中选择与随机选择的数据之间的距离最远的K-1个数据。
其中,处理120,用于根据K个初始聚类中心,对数据集中的N个数据进行聚类,具体可以通过以下步骤实现:
处理器120,还用于根据K个初始聚类中心和数据集中的N个数据,确定K个最终聚类中心;
对于数据集中的任一数据,处理器120,还用于分别计算任一数据与K个最终聚类中心中的每个最终聚类中心之间的距离;
处理器120,还用于从K个最终聚类中心中选择与任一数据之间的距离最小的最终聚类中心,将任一数据聚类到选择的最终聚类中心对应的类别中。
其中,处理器120,用于分别计算所述任一数据与所述K个最终聚类中心中的每个最终聚类中心之间的距离,具体可以通过以下步骤实现:
当任一数据包括文本类型字段和数字类型字段时,处理器120,还用于获取任一数据包括的每个分词;
处理器120,还用于分别计算每个分词的加权值,并根据每个分词的加权值和K个最终聚类中心中的每个最终聚类中心,分别计算任一数据与每个最终聚类中心之间的距离。
处理器120,还用于从K个类别中,选择待淘汰类别,待淘汰类别包括 数据的数量P大于预设数目;
处理器120,还用于从待淘汰类别包括的数据中淘汰与待淘汰类别的最终聚类中心之间的距离最远的(P-预设数目)个数据;
处理器120,还用于根据待淘汰类别中除淘汰的数据之外的数据,更新待淘汰类别的最终聚类中心。
在本发明实施例中,根据聚类请求包括的最大计算量,确定最大计算量对应的调整因子,从待聚类的数据集中随机选择一个数据,并根据该调整因子和该随机选择的数据,从数据集中选择K-1个数据,该随机选择的数据和K-1个数据构成数据集的K个初始聚类中心,根据该K个初始聚类中心,对数据集中的N个数据进行聚类;由于根据最大计算量和该随机选择的数据,从数据集中选择K-1个初始聚类中心,因此,本发明可以根据最大计算量自动调整聚类效率,从而可以提高聚类效率。
本发明实施例提供了一种基于K-Means算法的数据聚类方法,该方法由聚类服务器执行,用于将待处理的数据集包括的N个数据聚类至K个类别中,N为大于K的整数,K为预设的类别数量且为大于或等于2的整数,K个类别中的每个类别对应一个初始聚类中心。
参见图2-1,该方法包括:
步骤201:聚类服务器接收聚类请求,该聚类请求包括最大计算量、K和数据集。
步骤201的可选方案如下:
当用户想要对N个数据进行聚类时,用户可以将N个数据组成一个数据集,并设置对该数据集进行聚类的最大时间消耗,也即最大计算量,以及,预设的类别数量,也即K。
用户对应的终端通过通信网络向聚类服务器发送聚类请求,该聚类请求至少包括最大计算量、K和数据集,该聚类请求还可以包括数据集的训练次数和数据大小。
每个类别对应一个初始聚类中心,根据初始聚类中心进行初始聚类,然后重新计算每个类别的聚类中心,直到每个类别的最终聚类中心与相邻的上一次得到的聚类中心之间的距离在预设距离内;训练次数可以为得到最终聚 类中心的训练次数。
数据大小为该数据集包括的数据个数,也即数据大小为N。
聚类服务器通过通信接口接收终端发送的聚类请求,并从该聚类请求中获取该最大计算量、K、数据集,如果该聚类请求中还包括训练次数和数据大小,聚类服务器还可以从聚类请求中获取训练次数和数据大小。
需要说明的是,终端可以将该聚类请求中包括的数据集可以作为一个整体输入给聚类服务器,聚类服务器接收到数据集之后,同时对数据集中的数据进行聚类;终端还可以将数据集中的数据一个一个输入给聚类服务器,聚类服务器每接收到一个数据就开始进行聚类处理。
步骤202:聚类服务器根据该最大计算量,确定该最大计算量对应的调整因子。
步骤202的可选方案如下:
本步骤通过以下步骤(1)至(2)实现,包括:
(1):聚类服务器根据该训练次数、该数据大小和该K,确定对该数据集包括的N个数据进行聚类时的中心点初始化计算量和迭代训练计算量。
训练次数为计算聚类中心的次数(B),B为大于或等于2的整数;数据大小为包括的数据个数(N),N为大于K的整数。
进一步地,在本步骤中还需要设置一种中间变量,该中间变量与该最大计算量对应的调整因子没有关系,该中间变量可以为tdist,tdist用于表示计算任一数据分别与K个聚类中心之间的距离所需的时间。
聚类服务器根据该数据大小和该聚类中心数目,按照如下公式(1)确定对该数据集包括的N个数据进行聚类时的中心点初始化计算量。
Figure PCTCN2016105949-appb-000001
其中,Tinit为中心点初始化计算量,
Figure PCTCN2016105949-appb-000002
为调整因子,tdist为中间变量,K为预设的类别数量,N为数据大小。
其中,迭代训练分为计算数据类别和更新聚类中心两部分,则聚类服务器根据该训练次数、该数据大小和该聚类中心数目,按照如下公式(2)确定对该数据集包括的N个数据进行聚类时的迭代训练计算量。
Titera=B×(Tclassify×tdist+Tupdate×tmean)≈B×N×K×tdist    (2)
其中,Titera为迭代训练计算量,K为预设的类别数量,N为数据大小,tdist为中间变量,B为训练次数。
(2):聚类服务器根据中心点初始化计算量、迭代训练计算量和最大计算量,确定最大计算量对应的调整因子。
根据大量实验得出,最大计算量T可以通过以下公式(3)表示:
Ttolerance≈7600000×tdist    (3)
Ttolerance为最大计算量,tdist为中间变量。
其中,中心点初始化计算量和迭代计算量之和为最大计算量,也即Ttolerance=Tinit+Titera。则已知最大计算量的前提下,可以推导出最大计算量对应的调整因子,则调整因子如下公式(4)所示:
Figure PCTCN2016105949-appb-000003
由此可见,聚类服务器根据训练次数、数据大小和聚类中心数目,可以计算出调整因子。
通过试验分析,在k=50,b=5000,n=1时,聚类效率最高;则在k=50,b=5000,n=1时,调整因子为2。
在本发明实施例中,可以设置最大计算量,然后根据最大计算量自动调整聚类速度,从而可以提高聚类效率,避免了单个数据集的计算时间无限增长,维持了整体算法的响应速度。
聚类服务器为任一具有K-Means算法的服务器;终端可以为智能手机、平板电脑、智能电视、电子书阅读器、多媒体播放器、膝上型便携计算机和台式计算机等等。
步骤203:聚类服务器从数据集中随机选择一个数据,并根据该调整因子和该随机选择的数据,从数据集中选择K-1个数据,随机选择的数据和该K-1个数据构成数据集的K个初始聚类中心。
在现有技术中直接从数据集中随机选择K个数据作为K个初始聚类中心,这样当K较大和/或数据集中包括的数据较多时,计算量会增大,导致聚类效率低。
步骤203的可选方案如下:
在本发明实施例中,聚类服务器从数据集中随机选择一个数据,将该随机选择的数据作为一个初始聚类中心;然后根据该调整因子和该随机选择的数据,按照以下步骤(1)至(2),从数据集中选择K-1个数据,将K-1个数据分别作为K-1个初始聚类中心。
(1):聚类服务器根据该调整因子,从数据集中选择M个数据。
聚类服务器根据该调整因子,从数据集中每隔该调整因子选择一个数据,得到M个数据,M个数据中的相邻两个数据之间的距离为该调整因子。。
例如,调整因子为2,该数据集中包括20个数据,分别为数据1-20,则从该数据集中选择M个数据(M为7),分别为:数据1,数据4,数据7,数据10,数据13,数据16和数据19。
(2):聚类服务器根据该随机选择的数据和M个数据,从M个中选择与随机选择的数据之间的距离最远的K-1个数据。
具体地,聚类服务器根据该随机选择的数据和M个数据,分别计算M个数据中的每个数据与该随机选择的数据之间的距离;根据每个数据与该随机选择的数据之间的距离,从M个数据中选择距离最大的K-1个数据。
在本发明实施例中,K个初始聚类中心之间的距离较远,从而根据该K个初始聚类中心对数据集中的数据进行聚类时,迭代训练计算量较少,从而提高了聚类效率。
步骤204:聚类服务器根据该K个初始聚类中心,对该数据集中的N个数据进行聚类。
步骤204的可选方案如下:
本步骤具体可以通过以下步骤(1)至(3)实现,包括:
(1):聚类服务器根据该K个初始聚类中心和该数据集中的N个数据,确定K个最终聚类中心;
本步骤可以通过以下步骤(1-1)至(1-3)实现,包括:
(1-1):对于数据集中除K个初始聚类中心之外的每个数据,聚类服务器分别计算该数据与K个初始聚类中心中的每个初始聚类中心之间的距离。
当该数据仅包括数字类型字段时,聚类服务器根据K-Means算法计算该数据与K个初始聚类中心中的每个初始聚类中心之间的距离。
当该数据仅包括文本类型字段时,聚类服务器将该数据转换为多维的数 值向量,根据K-Means算法计算多维的数值向量与K个初始聚类中心中的每个初始聚类中心之间的距离。
当该数据同时包括文本类型字段和数字类型字段时,聚类服务器获取该数据包括的每个分词,分别计算每个分词的加权值,并根据每个分词的加权值和K个初始聚类中心中的每个初始聚类中心,通过K-Means算法分别计算该数据与每个初始聚类中心之间的距离。
需要说明的是,在本发明实施例中,可以通过现有的任一分词算法对该数据进行分词,得到该数据包括的每个分词。
聚类服务器分别计算每个分词的TF-IDF值,将每个分词的TF-IDF值作为其的加权值。并且,由于TF-IDF的计算依赖于数据集中包括的数据,而该数据集中包括的数据是通过不断地加载数据建立的,因此数据集可能是实时变化的,因此,不能在分词结束之后立即计算分词的TF-IDF值,而是需要在数据集聚合完成之后才可进行计算。
进一步地,每个分词会作为一个单独的维度参与聚类计算,为了保证每个维度对聚类结果的影响大小相同,需要维护一个权重向量w,为每个维度设置一个权重,每个维度的权重可以根据需要进行设置并更改,在本发明实施例中对每个维度的权重不作具体限定。例如,设置数值类型字段的权重设置为1,文本类型字段包括的所有分词的权重之和为1,且文本类型字段包括的每个分词的权重可以相等或者不等。
例如,一条含有3个数值类型字段和一个文本类型字段的日志,3个数值类型字段分别为第一数值类型字段、第二数值类型字段和第三数值类型字段,第一数值类型字段、第二数值类型字段和第三数值类型字段的权重都为1,文本类型字段包括3个文本分词,分别为第一文本分词、第二文本分词和第三文本分词,第一文本分词、第二文本分词和第三文本分词的权重之和为1,且第一文本分词、第二文本分词和第三文本分词的权重分别为1/3,如图2-2所示。
相应的,聚类服务器根据每个分词的加权值和每个初始聚类中心,通过K-Means分别计算该数据与K个初始聚类中心中的每个初始聚类中心之间的距离的步骤还可以为:
聚类服务器根据每个分词的加权值和权重以及每个初始聚类中心,通过 K-Means算法分别计算该数据与K个初始聚类中心中的每个初始聚类中心之间的距离。
(1-2):聚类服务器根据该数据分别与每个初始聚类中心之间的距离,将该数据聚类到与该数据之间的距离最近的初始聚类中心对应的类别中。
聚类服务器根据该数据分别与每个初始聚类中心之间的距离,从每个初始聚类中心中选择距离最近的初始聚类中心,将该数据聚类到该选择的初始聚类中心对应的类别中。
(1-3):聚类服务器根据K个类别中包括的数据,重新计算K个类别的新的聚类中心,直到K个类别中的每个类别的相邻两次聚类中心之间的距离在预设距离内得到K个最终聚类中心。
对于K个类别中的每个类别,分别计算该类别包括的数据的平均值作为该类别的新的聚类中心,计算该新的聚类中心和初始聚类中心之间的距离,如果该距离在预设距离(为了便于区分,将该预设距离称为第一预设距离)内,将该新的距离中心作为该类别的最终聚类中心。一个类别对应一个最终聚类中心,该最终聚类中心用于对待聚类的数据进行聚类。
如果该距离不在第一预设距离内,重新执行步骤(1-1)-(1-3)直到每个类型的相邻两次聚类中心之间的距离在第一预设距离内。第一预设距离可以根据需要进行设置并更改,在本发明实施例中对第一预设距离不作具体限定。
(2):对于数据集中的任一数据,聚类服务器分别计算该任一数据与K个最终聚类中心中的每个最终聚类中心之间的距离;
当该任一数据仅包括数字类型字段时,聚类服务器根据K-Means算法计算该任一数据与K个最终聚类中心中的每个最终聚类中心之间的距离。
当该任一数据仅包括文本类型字段时,聚类服务器将该任一数据转换为多维的数值向量,根据K-Means计算多维的数值向量与K个最终聚类中心中的每个最终聚类中心之间的距离。
当该任一数据包括文本类型字段和数字类型字段时,聚类服务器获取该任一数据包括的每个分词;分别计算每个分词的加权值,并根据每个分词的加权值和每个最终聚类中心,分别计算该任一数据与每个最终聚类中心之间的距离。
对于每个最终聚类中心,聚类服务器根据该任一数据包括的每个分词的加权值,通过以下公式(5)分别计算该任一数据与该最终聚类中心之间的距离。
Figure PCTCN2016105949-appb-000004
其中,D(l,c)为该任一数据与该最终聚类中心之间的距离,DF是指数字类型字段,WF是指文本类型字段。l是该任一数据,c是指该最终聚类中心。l(wf)是指分词的加权值,w(fw)是指权重,c(fw)是指该最终聚类中心的值。
(3):聚类服务器从K个最终聚类中心中选择与该任一数据之间的距离最小的最终聚类中心,将该任一数据归类到选择的最终聚类中心对应的类别中。
聚类完成之后,聚类服务器生成聚类结果,该聚类结果包括每个类别包括的数据,通过通信接口向终端发送聚类结果。
终端接收聚类服务器发送的聚类结果,并显示聚类结果,通过本发明实施例提供的聚类方法对数据进行聚类的效果图如图2-3所示。
聚类完成之后数据集包括的N个数据被聚类至K个类别中,K个类别中的每个类别对应一个数据子集,一个数据子集中包括N个数据中的部分数据,根据每个类别对应的数据子集包括的数据,更新每个类别的最终聚类中心。
聚类完成之后一个数据子集可以作为一个分类器,用于对待聚类的数据进行聚类。
聚类服务器分别计算每个数据子集包括的数据的平均值作为该分类器的聚类中心;当终端需要对某个待聚类的数据进行聚类时,聚类服务器通过通信接口接收终端发送的待聚类的数据,分别计算待聚类的数据与每个分类器的聚类中心之间的距离,从每个分类器中选择与待聚类的数据之间的距离最近的聚类中心对应的分类器,将待聚类的数据聚类待选择的分类器中。
多次聚类之后,同一个类别的数据子集可以进行合并,该类别合并后的数据子集会过度膨胀,所以需要一种淘汰机制来限制数据子集的增长,并且,不同类别的数据子集包括的数据数目可能相差较大,在数据子集大小一定时, 如果某一个类别的数据子集包括的数据数目过大,则会限制其他类别的数据子集的发展,因此,可以通过以下步骤205和206对每个类别包括的数据进行淘汰处理。
步骤205:聚类服务器从K个类别中,选择待淘汰类别,待淘汰类别包括数据大于预设数目。
步骤205的可选方案如下:
为了便于区分,将该预设数目称为第一预设数目,第一预设数目可以根据需要进行设置并更改,在本发明实施例中,对第一预设数目不作具体限定;例如,第一预设数目可以为100等。
在本步骤中,聚类服务器也可以从K个类别中,选择包括数据最多的第二预设数目个类别作为待淘汰类别。
第二预设数目为小于K的整数,且第二预设数目可以根据需要进行设置并更改,在本发明实施例中,对第二预设数目不作具体限定;例如,第二预设数目可以为2或者3等。
步骤206:聚类服务器从待淘汰类别包括的数据中淘汰与待淘汰类别的最终聚类中心之间的距离最远的(P-预设数目)个数据。
步骤206的可选方案如下:
聚类服务器分别计算待淘汰类别中包括的P个数据中的每个数据与该待淘汰类别的最终聚类中心之间的距离,根据每个数据与该待淘汰类别的最终聚类中心之间的距离,从P个数据中选择距离最远的P-预设数目(第一预设数目)个数据,将选择的数据从该待淘汰类别中淘汰,从而使得待淘汰类别中保留第一预设数目个数据。
在本步骤中,聚类服务器还可以从待淘汰类别包括的P个数据中淘汰与待淘汰类别的最终聚类中心之间的距离超过第二预设距离的数据。
第二预设距离可以根据需要进行设置并更改,在本发明实施例中,对第二预设距离不作具体限定。
在本步骤中,聚类服务器还可以从待淘汰类别包括的P个数据中选择离待淘汰类别的最终聚类中心之间的距离最大的预设数目个数据,将P个数据中除选择的数据之外的数据淘汰。
当训练效果足够好时,被分到同一个聚类的数据记录应该是高度相似的, 只需要选取离最终聚类中心最近的前预设数目条这部分少量而有效的数据加入主分类器,即可保留该聚类结果的信息,m为大于2的整数,且m可以根据需要进行设置并更改,在本发明实施例中,对m不作具体限定;例如,m可以为20或者50等。
需要说明的是,在将N个数据聚类至K个类别中之后,K个类别中的每个类别的最终聚类中心可能会发生变化,此时,聚类服务器根据每个类别包括的数据重新计算每个类别更新后的最终聚类中心,在本步骤中,聚类服务器可以从待淘汰类别包括的P个数据中淘汰与待淘汰类别的更新后的最终聚类中心之间的距离最远的P-预设数目个数据。
步骤207:聚类服务器根据待淘汰类别中除淘汰的数据以外的数据,更新待淘汰类别的最终聚类中心。
步骤207的可选方案如下:
聚类服务器根据待淘汰类别中除淘汰的数据以外的未淘汰数据,计算未淘汰数据的平均值作为该待淘汰类别更新后的最终聚类中心。
当聚类服务器接收到待聚类的数据时,根据待聚类的数据和更新后的最终聚类中心对待聚类的数据进行聚类,从而可以提高聚类准确率。
在本发明实施例中,根据聚类请求包括的最大计算量,确定最大计算量对应的调整因子,从待聚类的数据集中随机选择一个数据,并根据该调整因子和该随机选择的数据,从数据集中选择K-1个数据,该随机选择的数据和K-1个数据构成数据集的K个初始聚类中心,根据该K个初始聚类中心,对数据集中的N个数据进行聚类;由于根据最大计算量和该随机选择的数据,从数据集中选择K-1个初始聚类中心,因此,本发明可以根据最大计算量自动调整聚类效率,从而可以提高聚类效率。
本发明实施例还提供了一种基数据聚类装置,该装置可以通过图1-2所示的计算设备实现,还可以通过专用集成电路(英文:application-specific integrated circuit,缩写:ASIC)实现,或可编程逻辑器件(英文:programmable logic device,缩写:PLD)实现。上述PLD可以是复杂可编程逻辑器件(英文:complex programmable logic device,缩写:CPLD),FPGA,通用阵列逻辑(英文:generic array logic,缩写:GAL)或其任意 组合。该划分数据条带的装置用于实现图2-1所示的基于K-Means算法的数据聚类的方法。通过软件实现图2-1所示的基于K-Means算法的数据聚类的方法时,基于K-Means算法的数据聚类的装置也可以为软件模块。
数据聚类装置的组织结果示意图如图3-1所示,包括接收模块301,确定模块302,选择模块303和聚类模块304。
接收模块301工作时,执行图2-1所示的基于K-Means算法的数据聚类方法中的步骤201的部分;确定模块302工作时,执行图2-1所示的基于K-Means算法的数据聚类方法中的步骤202及其可选方案;选择模块303工作时,执行图2-1所示的基于K-Means算法的数据聚类方法中的步骤203、205及其可选方案;聚类模块304工作时,执行图2-1所示的基于K-Means算法的数据聚类方法中的步骤204及其可选方案。
另外,如图3-2所示,数据聚类装置还可以包括淘汰模块305和计算模块306;淘汰模块305工作时,执行图2-1所示的基于K-Means算法的数据聚类方法中的步骤206及其可选方案,计算模块306工作时,执行图2-1所示的基于K-Means算法的数据聚类方法中的步骤207及其可选方案。
在本发明实施例中,根据聚类请求包括的最大计算量,确定最大计算量对应的调整因子,从待聚类的数据集中随机选择一个数据,并根据该调整因子和该随机选择的数据,从数据集中选择K-1个数据,该随机选择的数据和K-1个数据构成数据集的K个初始聚类中心,根据该K个初始聚类中心,对数据集中的N个数据进行聚类;由于根据最大计算量和该随机选择的数据,从数据集中选择K-1个初始聚类中心,因此,本发明可以根据最大计算量自动调整聚类效率,从而可以提高聚类效率。
需要说明的是:上述实施例提供的数据聚类的装置在基于K-Means算法的数据聚类时,仅以上述各功能模块的划分进行举例说明,实际应用中,可以根据需要而将上述功能分配由不同的功能模块完成,即将装置的内部结构划分成不同的功能模块,以完成以上描述的全部或者部分功能。另外,上述实施例提供的数据聚类的装置与基于K-Means算法的数据聚类方法实施例属于同一构思,其具体实现过程详见方法实施例,这里不再赘述。
本领域普通技术人员可以理解实现上述实施例的全部或部分步骤可以通过硬件来完成,也可以通过程序来指令相关的硬件完成,所述的程序可以存储于一种计算机可读存储介质中,上述提到的存储介质可以是只读存储器,磁盘或光盘等。
以上所述仅为本发明的较佳实施例,并不用以限制本发明,凡在本发明的精神和原则之内,所作的任何修改、等同替换、改进等,均应包含在本发明的保护范围之内。

Claims (18)

  1. 一种计算设备,其特征在于,所述计算设备用于将待处理的数据集包括的N个数据聚类至K个类别中,所述N为大于K的整数,所述K为预设的类别数量且为大于或等于2的整数,K个类别中的每个类别对应一个初始聚类中心;所述计算设备包括通信接口、处理器和存储器,所述通信接口分别与所述处理器和存储器建立通信连接,所述处理器与所述存储器建立通信连接;
    所述通信接口,用于接收聚类请求,所述聚类请求包括最大计算量、所述K和所述数据集;
    所述存储器,用于存储所述最大计算量、所述K和所述数据集;
    所述处理器,用于根据所述最大计算量,确定所述最大计算量对应的调整因子;
    所述处理器,还用于从所述存储器中获取所述数据集,并从所述数据集中随机选择一个数据;
    所述处理器,还用于根据所述调整因子和所述随机选择的数据,从所述数据集中选择K-1个数据,所述随机选择的数据和所述K-1个数据构成所述数据集的K个初始聚类中心;
    所述处理器,还用于根据所述K个初始聚类中心,对所述数据集中的N个数据进行聚类。
  2. 根据权利要求1所述的计算设备,其特征在于,所述聚类请求中还包括所述数据集的训练次数和数据大小;
    所述处理器,还用于根据所述训练次数、所述数据大小和所述K,确定对所述数据集包括的N个数据进行聚类时的中心点初始化计算量和迭代训练计算量;
    所述处理器,还用于根据所述中心点初始化计算量、所述迭代训练计算量和所述最大计算量,确定所述最大计算量对应的调整因子。
  3. 根据权利要求1所述的计算设备,其特征在于,
    所述处理器,还用于根据所述调整因子,从所述数据集中选择M个数据,所述M为大于K的整数;
    所述处理器,还用于根据所述随机选择的数据和所述M个数据,从所述M个数据中选择与所述随机选择的数据之间的距离最远的K-1个数据。
  4. 根据权利要求1所述的计算设备,其特征在于,
    所述处理器,还用于根据所述K个初始聚类中心和所述数据集中的N个数据,确定K个最终聚类中心;
    对于所述数据集中的任一数据,所述处理器,还用于分别计算所述任一数据与所述K个最终聚类中心中的每个最终聚类中心之间的距离;
    所述处理器,还用于从所述K个最终聚类中心中选择与所述任一数据之间的距离最小的最终聚类中心,将所述任一数据聚类到所述选择的最终聚类中心对应的类别中。
  5. 根据权利要求4所述的计算设备,其特征在于,
    当所述任一数据包括文本类型字段和数字类型字段时,所述处理器,还用于获取所述任一数据包括的每个分词;
    所述处理器,还用于分别计算所述每个分词的加权值,并根据所述每个分词的加权值和所述K个最终聚类中心中的每个最终聚类中心,分别计算所述任一数据与所述每个最终聚类中心之间的距离。
  6. 根据权利要求1-5任一所述的计算设备,其特征在于,
    所述处理器,还用于从所述K个类别中,选择待淘汰类别,所述待淘汰类别包括数据的数量P大于预设数目;
    所述处理器,还用于从所述待淘汰类别包括的数据中淘汰与所述待淘汰类别的最终聚类中心之间的距离最远的(P-所述预设数目)个数据;
    所述处理器,还用于根据所述待淘汰类别中除所述淘汰的数据之外的数据,更新所述待淘汰类别的最终聚类中心。
  7. 一种基于K-Means算法的数据聚类方法,其特征在于,所述方法由聚 类服务器执行,用于将待处理的数据集包括的N个数据聚类至K个类别中,所述N为大于K的整数,所述K为预设的类别数量且为大于或等于2的整数,K个类别中的每个类别对应一个初始聚类中心,所述方法包括:
    聚类服务器接收聚类请求,所述聚类请求包括最大计算量、所述K和所述数据集;
    所述聚类服务器根据所述最大计算量,确定所述最大计算量对应的调整因子;
    所述聚类服务器从所述数据集中随机选择一个数据;
    所述聚类服务器根据所述调整因子和所述随机选择的数据,从所述数据集中选择K-1个数据,所述随机选择的数据和所述K-1个数据构成所述数据集的K个初始聚类中心;
    所述聚类服务器根据所述K个初始聚类中心,对所述数据集中的N个数据进行聚类。
  8. 根据权利要求7所述的方法,其特征在于,所述聚类请求中还包括所述数据集的训练次数和数据大小;
    所述聚类服务器根据所述最大计算量,确定所述最大计算量对应的调整因子,包括:
    所述聚类服务器根据所述训练次数、所述数据大小和所述K,确定对所述数据集包括的N个数据进行聚类时的中心点初始化计算量和迭代训练计算量;
    所述聚类服务器根据所述中心点初始化计算量、所述迭代训练计算量和所述最大计算量,确定所述最大计算量对应的调整因子。
  9. 根据权利要求7所述的方法,其特征在于,所述聚类服务器根据所述调整因子和所述随机选择的数据,从所述数据集中选择K-1个数据,包括:
    所述聚类服务器根据所述调整因子,从所述数据集中选择M个数据,所述M为大于K的整数;
    所述聚类服务器根据所述随机选择的数据和所述M个数据,从所述M个数据中选择与所述随机选择的数据之间的距离最远的K-1个数据。
  10. 根据权利要求7所述的方法,其特征在于,所述聚类服务器根据所述K个初始聚类中心,对所述数据集中的N个数据进行聚类,包括:
    所述聚类服务器根据所述K个初始聚类中心和所述数据集中的N个数据,确定K个最终聚类中心;
    对于所述数据集中的任一数据,所述聚类服务器分别计算所述任一数据与所述K个最终聚类中心中的每个最终聚类中心之间的距离;
    所述聚类服务器从所述K个最终聚类中心中选择与所述任一数据之间的距离最小的最终聚类中心,将所述任一数据归类到所述选择的最终聚类中心对应的类别中。
  11. 根据权利要求10所述的方法,其特征在于,所述聚类服务器分别计算所述任一数据与所述K个最终聚类中心中的每个最终聚类中心之间的距离,包括:
    当所述任一数据包括文本类型字段和数字类型字段时,所述聚类服务器获取所述任一数据包括的每个分词;
    所述聚类服务器分别计算所述每个分词的加权值,并根据所述每个分词的加权值和所述K个最终聚类中心中的每个最终聚类中心,分别计算所述任一数据与所述每个最终聚类中心之间的距离。
  12. 根据权利要求7-11任一所述的方法,其特征在于,所述方法还包括:
    所述聚类服务器从所述K个类别中,选择待淘汰类别,所述待淘汰类别包括数据的数量P大于预设数目;
    所述聚类服务器从所述待淘汰类别包括的数据中淘汰与所述待淘汰类别的最终聚类中心之间的距离最远的(P-所述预设数目)个数据;
    所述聚类服务器根据所述待淘汰类别中除所述淘汰的数据之外的数据,更新所述待淘汰类别的最终聚类中心。
  13. 一种数据聚类装置,其特征在于,所述装置应用在聚类服务器中,用于将待处理的数据集包括的N个数据聚类至K个类别中,所述N为大于K 的整数,所述K为预设的类别数量且为大于或等于2的整数,K个类别中的每个类别对应一个初始聚类中心,所述装置包括:
    接收模块,用于接收聚类请求,所述聚类请求包括最大计算量、所述K和所述数据集;
    确定模块,用于根据所述最大计算量,确定所述最大计算量对应的调整因子;
    选择模块,用于从所述数据集中随机选择一个数据;
    所述选择模块,还用于根据所述调整因子和所述随机选择的数据,从所述数据集中选择K-1个数据,所述随机选择的数据和所述K-1个数据构成所述数据集的K个初始聚类中心;
    聚类模块,用于根据所述K个初始聚类中心,对所述数据集中的N个数据进行聚类。
  14. 根据权利要求13所述的装置,其特征在于,所述聚类请求中还包括所述数据集的训练次数和数据大小;
    所述确定模块,还用于根据所述训练次数、所述数据大小和所述K,确定对所述数据集包括的N个数据进行聚类时的中心点初始化计算量和迭代训练计算量;根据所述中心点初始化计算量、所述迭代训练计算量和所述最大计算量,确定所述最大计算量对应的调整因子。
  15. 根据权利要求13所述的装置,其特征在于,
    所述选择模块,还用于根据所述调整因子,从所述数据集中选择M个数据,所述M为大于K的整数;根据所述随机选择的数据和所述M个数据,从所述M个数据中选择与所述随机选择的数据之间的距离最远的K-1个数据。
  16. 根据权利要求13所述的装置,其特征在于,
    所述聚类模块,用于根据所述K个初始聚类中心和所述数据集中的N个数据,确定K个最终聚类中心;对于数据集中的任一数据,分别计算所述任一数据与所述K个最终聚类中心中的每个最终聚类中心之间的距离;从所述K个最终聚类中心中选择与所述任一数据之间的距离最小的最终聚类中心, 将所述任一数据归类到所述选择的最终聚类中心对应的类别中。
  17. 根据权利要求16所述的装置,其特征在于,
    所述聚类模块,还用于当所述任一数据包括文本类型字段和数字类型字段时,获取所述任一数据包括的每个分词;分别计算所述每个分词的加权值,并根据所述每个分词的加权值和所述K个最终聚类中心中的每个最终聚类中心,分别计算所述和所述K个最终聚类中心中的每个最终聚类中心数据与所述每个最终聚类中心之间的距离。
  18. 根据权利要求13-17任一所述的装置,其特征在于,
    所述选择模块,还从所述K个类别中,选择待淘汰类别,所述待淘汰类别包括数据的数量P大于预设数目;
    所述装置还包括:
    淘汰模块,用于从所述待淘汰类别包括的数据中淘汰与所述待淘汰类别的最终聚类中心之间的距离最远的(P-所述预设数目)个数据;
    计算模块,还用于根据所述待淘汰类别中除所述淘汰的数据之外的数据,更新所述待淘汰类别的最终聚类中心。
PCT/CN2016/105949 2016-04-21 2016-11-15 基于K-Means算法的数据聚类方法和装置 WO2017181660A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201610255527.8A CN107305637B (zh) 2016-04-21 2016-04-21 基于K-Means算法的数据聚类方法和装置
CN201610255527.8 2016-04-21

Publications (1)

Publication Number Publication Date
WO2017181660A1 true WO2017181660A1 (zh) 2017-10-26

Family

ID=60116583

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2016/105949 WO2017181660A1 (zh) 2016-04-21 2016-11-15 基于K-Means算法的数据聚类方法和装置

Country Status (2)

Country Link
CN (1) CN107305637B (zh)
WO (1) WO2017181660A1 (zh)

Cited By (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109447103A (zh) * 2018-09-07 2019-03-08 平安科技(深圳)有限公司 一种基于硬聚类算法的大数据分类方法、装置及设备
CN109598278A (zh) * 2018-09-20 2019-04-09 阿里巴巴集团控股有限公司 聚类处理方法、装置、电子设备及计算机可读存储介质
CN111476270A (zh) * 2020-03-04 2020-07-31 中国平安人寿保险股份有限公司 基于K-means算法的课程信息确定方法、装置、设备及存储介质
CN111737469A (zh) * 2020-06-23 2020-10-02 中山大学 数据挖掘方法、装置、终端设备和可读存储介质
CN112465626A (zh) * 2020-11-24 2021-03-09 平安科技(深圳)有限公司 基于客户端分类聚合的联合风险评估方法及相关设备
CN112579581A (zh) * 2020-11-30 2021-03-30 贵州力创科技发展有限公司 一种数据分析引擎的数据接入方法及系统
CN113393412A (zh) * 2020-02-27 2021-09-14 中国石油天然气股份有限公司 确定输气管道内腐蚀缺陷的特征值的方法及装置
CN113393412B (zh) * 2020-02-27 2024-05-31 中国石油天然气股份有限公司 确定输气管道内腐蚀缺陷的特征值的方法及装置

Families Citing this family (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108009099B (zh) * 2017-11-30 2021-08-20 北京中科睿芯科技集团有限公司 一种应用于K-Mean聚类算法中的加速方法及其装置
CN110309188A (zh) * 2018-03-08 2019-10-08 优酷网络技术(北京)有限公司 内容聚类方法及装置
CN109615426A (zh) * 2018-12-05 2019-04-12 重庆锐云科技有限公司 一种基于客户聚类的营销方法、系统
CN110912933B (zh) * 2019-12-17 2021-04-02 中国科学院信息工程研究所 一种基于被动测量的设备识别方法
CN112995276B (zh) * 2021-02-01 2023-03-24 中铁第四勘察设计院集团有限公司 一种协作空间通信方法及装置

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2008022341A2 (en) * 2006-08-18 2008-02-21 The Regents Of The University Of California Seeding method for k-means clustering and other clustering algorithms
CN104376057A (zh) * 2014-11-06 2015-02-25 南京邮电大学 一种基于最大最小距离和K-means的自适应聚类方法
CN104537067A (zh) * 2014-12-30 2015-04-22 广东电网有限责任公司信息中心 一种基于k-means聚类的分箱方法
CN105447521A (zh) * 2015-11-25 2016-03-30 大连理工大学 一种K-means聚类的初值选择方法

Family Cites Families (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101893704B (zh) * 2010-07-20 2012-07-25 哈尔滨工业大学 一种基于粗糙集的雷达辐射源信号识别方法
CN104376124A (zh) * 2014-12-09 2015-02-25 西华大学 一种基于扰动吸收原理的聚类算法
CN105469114A (zh) * 2015-11-25 2016-04-06 大连理工大学 一种提高K-means收敛速度的方法
CN105468781A (zh) * 2015-12-21 2016-04-06 小米科技有限责任公司 视频查询方法和装置

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2008022341A2 (en) * 2006-08-18 2008-02-21 The Regents Of The University Of California Seeding method for k-means clustering and other clustering algorithms
CN104376057A (zh) * 2014-11-06 2015-02-25 南京邮电大学 一种基于最大最小距离和K-means的自适应聚类方法
CN104537067A (zh) * 2014-12-30 2015-04-22 广东电网有限责任公司信息中心 一种基于k-means聚类的分箱方法
CN105447521A (zh) * 2015-11-25 2016-03-30 大连理工大学 一种K-means聚类的初值选择方法

Cited By (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109447103A (zh) * 2018-09-07 2019-03-08 平安科技(深圳)有限公司 一种基于硬聚类算法的大数据分类方法、装置及设备
CN109447103B (zh) * 2018-09-07 2023-09-29 平安科技(深圳)有限公司 一种基于硬聚类算法的大数据分类方法、装置及设备
CN109598278A (zh) * 2018-09-20 2019-04-09 阿里巴巴集团控股有限公司 聚类处理方法、装置、电子设备及计算机可读存储介质
CN113393412A (zh) * 2020-02-27 2021-09-14 中国石油天然气股份有限公司 确定输气管道内腐蚀缺陷的特征值的方法及装置
CN113393412B (zh) * 2020-02-27 2024-05-31 中国石油天然气股份有限公司 确定输气管道内腐蚀缺陷的特征值的方法及装置
CN111476270A (zh) * 2020-03-04 2020-07-31 中国平安人寿保险股份有限公司 基于K-means算法的课程信息确定方法、装置、设备及存储介质
CN111476270B (zh) * 2020-03-04 2024-04-30 中国平安人寿保险股份有限公司 基于K-means算法的课程信息确定方法、装置、设备及存储介质
CN111737469A (zh) * 2020-06-23 2020-10-02 中山大学 数据挖掘方法、装置、终端设备和可读存储介质
CN112465626A (zh) * 2020-11-24 2021-03-09 平安科技(深圳)有限公司 基于客户端分类聚合的联合风险评估方法及相关设备
CN112465626B (zh) * 2020-11-24 2023-08-29 平安科技(深圳)有限公司 基于客户端分类聚合的联合风险评估方法及相关设备
CN112579581A (zh) * 2020-11-30 2021-03-30 贵州力创科技发展有限公司 一种数据分析引擎的数据接入方法及系统
CN112579581B (zh) * 2020-11-30 2023-04-14 贵州力创科技发展有限公司 一种数据分析引擎的数据接入方法及系统

Also Published As

Publication number Publication date
CN107305637B (zh) 2020-10-16
CN107305637A (zh) 2017-10-31

Similar Documents

Publication Publication Date Title
WO2017181660A1 (zh) 基于K-Means算法的数据聚类方法和装置
JP7241862B2 (ja) 機械学習モデルを使用した、偏りのあるデータの拒否
US11995702B2 (en) Item recommendations using convolutions on weighted graphs
US11537884B2 (en) Machine learning model training method and device, and expression image classification method and device
WO2022126971A1 (zh) 基于密度的文本聚类方法、装置、设备及存储介质
US10163034B2 (en) Tripoint arbitration for entity classification
WO2021068610A1 (zh) 资源推荐的方法、装置、电子设备及存储介质
JP6144839B2 (ja) 画像を検索するための方法およびシステム
US9454580B2 (en) Recommendation system with metric transformation
WO2020238229A1 (zh) 交易特征生成模型的训练、交易特征的生成方法和装置
US11636486B2 (en) Determining subsets of accounts using a model of transactions
WO2019169704A1 (zh) 一种数据分类方法、装置、设备及计算机可读存储介质
CN110457577B (zh) 数据处理方法、装置、设备和计算机存储介质
WO2015135321A1 (zh) 基于金融数据的社会关系挖掘的方法及装置
WO2020220758A1 (zh) 一种异常交易节点的检测方法及装置
WO2018090545A1 (zh) 融合时间因素的协同过滤方法、装置、服务器和存储介质
US20150039538A1 (en) Method for processing a large-scale data set, and associated apparatus
CN107944485B (zh) 基于聚类群组发现的推荐系统及方法、个性化推荐系统
WO2020114108A1 (zh) 聚类结果的解释方法和装置
CN110569289B (zh) 基于大数据的列数据处理方法、设备及介质
WO2018176913A1 (zh) 搜索方法、装置及非临时性计算机可读存储介质
WO2020007177A1 (zh) 计算机执行的报价方法、报价装置、电子设备及存储介质
US20230153311A1 (en) Anomaly Detection with Local Outlier Factor
US20230106448A1 (en) Diversifying recommendations by improving embedding generation of a graph neural network model
WO2016122575A1 (en) Product, operating system and topic based recommendations

Legal Events

Date Code Title Description
NENP Non-entry into the national phase

Ref country code: DE

121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 16899242

Country of ref document: EP

Kind code of ref document: A1

122 Ep: pct application non-entry in european phase

Ref document number: 16899242

Country of ref document: EP

Kind code of ref document: A1