CN108805174B

CN108805174B - Clustering method and device

Info

Publication number: CN108805174B
Application number: CN201810482717.2A
Authority: CN
Inventors: 姚佳
Original assignee: Guangdong Huihe Technology Development Co ltd
Current assignee: Guangdong Huihe Technology Development Co ltd
Priority date: 2018-05-18
Filing date: 2018-05-18
Publication date: 2022-03-29
Anticipated expiration: 2038-05-18
Also published as: CN108805174A

Abstract

The embodiment of the application provides a clustering method and a device, wherein the method comprises the following steps: reading data to be processed comprising a plurality of samples, and performing RDD (remote data description) on the data to be processed; randomly determining a first preset number of target samples in the plurality of samples, and respectively using the first preset number of samples as the clustering centers of the first preset number of clusters; randomly determining a second preset number of samples among samples other than the target sample among the plurality of samples; calculating the distance between the sample and the clustering center of each cluster aiming at each sample with a second preset number, and dividing the sample into the cluster where the clustering center closest to the sample is located; and calculating the average value of each sample in each cluster as a first average value, and updating the value of the clustering center of the cluster according to the first average value and the current value of the clustering center of the cluster until a preset condition is met. In this way, it can be relatively effectively avoided that the obtained result is only locally optimal.

Description

Clustering method and device

Technical Field

The application relates to the technical field of big data, in particular to a clustering method and a clustering device.

Background

The clustering method has extremely wide application in numerous fields such as machine learning, data mining and the like. In a conventional clustering method, the clustering method can be classified into hierarchical clustering, partition-type clustering, density-based clustering, network-based clustering, kernel clustering, spectral clustering and the like according to certain characteristics. Correspondingly, hierarchical clustering includes brick, rock, Chameleon, etc., partitional clustering includes K-means and their variants, density clustering includes dbscan, OPTICS, etc., and lattice clustering includes sting, clique, etc. Among these clustering methods, the most common ones are too many K-means. Despite the drawbacks of K-means such as a priori assumptions about circular distribution, pre-assigning the number of cluster classes, etc., its relatively low computational complexity makes it a popular algorithm. However, in the case of a huge amount of data to be processed, the existing K-means still has a large computational complexity and is prone to fall into a local optimization problem.

Disclosure of Invention

In view of the above, the present disclosure provides a clustering method and apparatus to improve the above problem.

In order to achieve the above purpose, the embodiment of the present application adopts the following technical solutions:

in a first aspect, an embodiment of the present application provides a clustering method, which is applied to a terminal device based on a Spark frame, where the method includes:

reading data to be processed comprising a plurality of samples through the Spark framework, and subjecting the data to be processed to RDD (remote data description) object;

randomly determining a first preset number of target samples in the plurality of samples, and respectively using the first preset number of samples as the clustering centers of the first preset number of clusters;

randomly determining a second preset number of samples from the samples except the first preset number of target samples, wherein the second preset number is a preset multiple of the first preset number;

for each sample in the second preset number of samples, calculating the distance between the sample and the clustering center of each cluster, and dividing the sample into the cluster where the clustering center closest to the sample is located;

calculating the average value of each sample in each cluster as a first average value, and updating the value of the cluster center of the cluster according to the first average value and the current value of the cluster center of the cluster;

and when the updated value does not meet the preset condition, determining a second preset number of samples in the samples except the first preset number of target samples, and dividing the determined second preset number of samples into corresponding clusters to update the value of the clustering center of each cluster.

Optionally, according to the clustering method provided in the first aspect of the embodiment of the present application, the preset condition includes:

the average value of the included angle between the value before updating and the value after updating of the clustering center of each cluster is smaller than a preset angle;

and updating the value of the clustering center of each cluster according to the determined second preset number of samples for a preset number of times.

Optionally, according to the clustering method provided in the first aspect of the embodiment of the present application, the terminal device presets a cluster center update rate;

updating the value of the cluster center of the cluster according to the first average value and the current value of the cluster center of the cluster, including:

and calculating the weighted average value of the first average value and the current value of the cluster center of the cluster by taking the update rate of the cluster center as the weight of the first average value and taking the difference value between 1 and the update rate of the cluster center as the weight of the current value of the cluster center of the cluster, and updating the value of the cluster center of the cluster into the obtained weighted average value.

Optionally, according to the clustering method provided in the first aspect of the embodiment of the present application, the method further includes:

when any one of the preset conditions is met, updating the clustering centers of the clusters according to the plurality of samples;

and when the preset condition is not met after updating, updating the clustering centers of the clusters again according to the samples.

Optionally, according to the clustering method provided in the first aspect of the embodiment of the present application, updating the clustering center of each cluster according to the plurality of samples includes:

calculating the distance between the sample and the clustering center of each cluster for each sample in the plurality of samples, and dividing the sample into the cluster where the clustering center closest to the sample is located;

and calculating the average value of each sample in each cluster as a second average value, and updating the value of the cluster center of the cluster as the second average value.

In a second aspect, an embodiment of the present application further provides a clustering device, which is applied to a terminal device based on a Spark frame, where the clustering device includes:

the data reading module is used for reading data to be processed comprising a plurality of samples through the Spark frame and reading the data to be processed RDD object;

a center determining module, configured to randomly determine a first preset number of target samples from the multiple samples, and use the first preset number of samples as a clustering center of the first preset number of clusters respectively;

a sample selection module, configured to randomly determine a second preset number of samples from samples, other than the first preset number of target samples, in the plurality of samples, where the second preset number is a preset multiple of the first preset number;

the first dividing module is used for calculating the distance between the sample and the clustering center of each cluster aiming at each sample in the second preset number of samples, and dividing the sample into the cluster where the clustering center closest to the sample is located;

the first updating module is used for calculating the average value of each sample in each cluster to serve as a first average value, updating the value of the clustering center of the cluster according to the first average value and the current value of the clustering center of the cluster, determining a second preset number of samples in the samples except the first preset number of target samples when the updating does not meet a preset condition, and dividing the second preset number of samples into corresponding clusters to update the value of the clustering center of each cluster.

Optionally, according to the clustering apparatus provided in the second aspect of the embodiment of the present application, the preset condition includes:

The clustering device according to claim 7, wherein a cluster center update rate is preset in the terminal device; the first updating module updates the value of the cluster center of the cluster according to the first average value and the current value of the cluster center of the cluster in the following way:

Optionally, according to the clustering device provided by the second aspect of the embodiment of the present application, the clustering device further includes:

and the second updating module is used for updating the clustering center of each cluster according to the plurality of samples when any one of the preset conditions is met, and updating the clustering center of each cluster according to the plurality of samples again when the preset conditions are not met after the updating.

Optionally, according to the clustering device provided in the second aspect of the embodiment of the present application, a manner that the second updating module updates the clustering center of each cluster according to the plurality of samples is as follows:

Compared with the prior art, the embodiment of the application has the following beneficial effects:

the clustering method and device provided by the embodiment of the application are applied to the terminal equipment based on the Spark framework. The terminal device reads data to be processed including a plurality of samples through a Spark frame, and forms an RDD object from the read data. And randomly determining a first preset number of target samples in the plurality of samples to be respectively used as the clustering centers of the first preset number of clusters, randomly determining a second preset number of samples in the samples except the first preset number of target samples in the plurality of samples, and dividing the second preset number of samples into corresponding clusters. And then calculating the average value of each sample in each cluster as a first average value, and updating the value of the clustering center of the cluster according to the first average value and the current value of the clustering center of the cluster until a preset condition is met. Through the process, the updating is not needed to be carried out according to all samples in each updating, the calculation complexity is reduced, and the problem that the local optimization is easy to converge can be avoided to a certain extent.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings that are required to be used in the embodiments will be briefly described below, it should be understood that the following drawings only illustrate some embodiments of the present application and therefore should not be considered as limiting the scope, and for those skilled in the art, other related drawings can be obtained from the drawings without inventive effort.

Fig. 1 is a connection block diagram of a terminal device according to an embodiment of the present disclosure;

fig. 2 is a schematic flow chart of a clustering method according to an embodiment of the present application;

FIG. 3 is a schematic flow chart of a clustering method provided in the embodiment of the present application;

fig. 4 is a functional module block diagram of a clustering device according to an embodiment of the present application.

Icon: 100-a terminal device; 110-a memory; 120-a processor; 130-a communication unit; 200-clustering means; 210-a data reading module; 220-a center determination module; 230-a sample selection module; 240-a partitioning module; 250-a first update module; 260-second update module.

Detailed Description

In order to make the objects, technical solutions and advantages of the embodiments of the present application clearer, the technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are some embodiments of the present application, but not all embodiments. The components of the embodiments of the present application, generally described and illustrated in the figures herein, can be arranged and designed in a wide variety of different configurations.

Thus, the following detailed description of the embodiments of the present application, presented in the accompanying drawings, is not intended to limit the scope of the claimed application, but is merely representative of selected embodiments of the application. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.

It should be noted that: like reference numbers and letters refer to like items in the following figures, and thus, once an item is defined in one figure, it need not be further defined and explained in subsequent figures.

At present, a few clustering methods are suitable for use in a big data environment, but K-means is one of them which is relatively good. Taking a scene based on a Spark framework as an example, the K-means in some embodiments is described as follows:

(1) randomly determining k samples in a given sample, wherein each of the k samples represents a cluster center of one cluster, and k is a pre-specified cluster category number, namely the number of clusters to be divided;

(2) calculating the distance between each sample and the clustering center of each cluster in the rest samples in the given sample, and dividing the sample into clusters represented by the clustering centers closest to the sample;

(3) and calculating the average value of each sample in each cluster, and updating the value of the cluster center of the cluster to the average value.

(4) And (4) continuously repeating the steps (2) to (3) until corresponding conditions are met.

However, the inventor has found that, in the above manner, in order to ensure convergence, all samples are directly used as learning samples of a new cluster center of a cluster at each iteration (updating), which just results in that K-means easily converges to local optimum. Moreover, in the above manner, each iteration of the cluster center directly uses a new value, and completely ignores the value of the cluster center before updating, which results in that the K-menas is easily interfered by some outlier samples. In addition, the time complexity of the K-means of the above method is still high in the case of large data size.

Based on this, the embodiment of the present application provides a clustering method and apparatus, which are applied to a terminal device based on a Spark framework, so as to at least partially improve the above problem. It should be understood that the clustering method and apparatus provided in this embodiment may also be applied to other frameworks such as MapReduce, which is not limited in this embodiment. This will be explained in detail below.

As shown in fig. 1, which is a block schematic diagram of a terminal device 100 provided in this embodiment of the present application, the terminal device 100 may be any electronic device having a data processing function and a communication function, for example, a server, and specifically, when the terminal device 100 is a server, the terminal device may be a single server or a server cluster formed by a plurality of servers, which is not limited in this embodiment.

In this embodiment, the terminal device 100 may be installed with an operating platform corresponding to Spark, where Spark is a fast and general computing engine specially designed for large-scale data processing, and is an open-source general parallel framework similar to Hadoop MapReduce.

The terminal device 100 includes a clustering apparatus 200, a memory 110, a processor 120, and a communication unit 130.

In the present embodiment, the memory 110, the processor 120 and the communication unit 130 are electrically connected to each other directly or indirectly, so as to realize data transmission or interaction. For example, the components may be electrically connected to each other via one or more communication buses or signal lines. The clustering device 200 includes at least one software functional module that can be stored in the memory 110 in the form of software or firmware (firmware) or solidified in an Operating System (OS) of the terminal device 100. The processor 120 is used for executing executable modules stored in the memory 110, such as software functional modules and computer programs included in the clustering device 200.

The Memory 110 may be, but is not limited to, a Random Access Memory (RAM), a Read Only Memory (ROM), a Programmable Read-Only Memory (PROM), an Erasable Read-Only Memory (EPROM), an electrically Erasable Read-Only Memory (EEPROM), and the like.

The processor 120 may be an integrated circuit chip having signal processing capabilities. The Processor 120 may also be a general-purpose Processor, such as a Central Processing Unit (CPU), a Network Processor (NP), a microprocessor, etc.; but may also be a Digital Signal Processor (DSP)), an Application Specific Integrated Circuit (ASIC), a Field Programmable Gate Array (FPGA) or other programmable logic device, discrete gate or transistor logic, discrete hardware components; the processor 120 may also be any conventional processor that may implement or perform the methods, steps, and logic blocks disclosed in embodiments of the present invention.

The communication unit 130 is used for establishing a communication connection between the terminal device 100 and other devices to implement data interaction or communication.

It should be understood that the configuration shown in fig. 1 is merely illustrative, and the terminal device 100 may include more or fewer components than those shown in fig. 1, and may have a completely different configuration than that shown in fig. 1. It should be noted that, the components shown in fig. 1 may be implemented by hardware, software or a combination thereof, and the embodiment is not limited thereto.

Fig. 2 is a schematic flow chart of a clustering method provided in an embodiment of the present application, and the clustering method can be applied to the terminal device 100 shown in fig. 1.

Step S201, reading data to be processed including a plurality of samples through the Spark, and forming an RDD object from the data to be processed.

The RDD is a Spark-specific data model, and the samples are predetermined samples for clustering.

Step S202, randomly determining a first preset number of target samples in the plurality of samples, and respectively using the first preset number of samples as the clustering centers of the first preset number of clusters.

The first preset number is a predetermined required cluster category number, that is, the number of clusters required to be divided.

Step S203, randomly determining a second preset number of samples among the samples except the first preset number of target samples.

And the second preset number is a preset multiple of the first preset number. In this embodiment, the second preset number is a predetermined value and represents how many samples need to be used in each iteration for updating the cluster center of the cluster, and the second preset number may be generally 100-.

For example, if the first predetermined number is k and the predetermined multiple is γ, the second predetermined number is k × γ.

Step S204, for each sample in the second preset number of samples, calculating a distance between the sample and a cluster center of each cluster, and dividing the sample into the cluster where the cluster center closest to the sample is located.

Taking the second preset number k × γ as an example, for each of k × γ samples, calculating the distance from the sample to the current cluster center of each cluster, determining the cluster corresponding to the cluster center with the minimum distance to the sample as a target cluster, and assigning the sample to the target cluster.

Step S205, calculating the average value of each sample in each cluster as a first average value, and updating the value of the cluster center of the cluster according to the first average value and the current value of the cluster center of the cluster.

In this embodiment, for each cluster, after the first average value of the cluster is obtained, the cluster center of the cluster is updated according to the current value of the cluster center of the cluster.

Optionally, in this embodiment, the cluster center update rate may be preset in the terminal device 100. Correspondingly, step S205 may include the steps of:

Assuming that the update rate of the cluster center is α, the current value of the cluster center of the cluster is old _ center, and the calculated first average value is new _ center, the new value center (i.e. the weighted average) of the cluster center of the cluster can be calculated by the following formula:

center＝α*new_center+(1-α)*old_center

the updating rate of the clustering center can be controlled according to the actual situation, so that the influence caused by the poor updating effect of a certain time based on partial samples can be avoided.

Step S206, when the updated value does not satisfy the preset condition, re-determining a second preset number of samples from the samples except the first preset number of target samples, and dividing the re-determined second preset number of samples into corresponding clusters to update the value of the cluster center of each cluster.

In other words, in the present embodiment, the terminal device 100 repeatedly executes steps S203 to S205 until the preset condition is satisfied.

A specific example is given below to explain step S203 to step S206 in detail.

Assume that in this example, steps S203-S206 are repeated 1000 times in total, where the first average value (i.e., the new _ center) obtained by the ith calculation is v_iThe new value of the cluster center (i.e., the center mentioned above) obtained by calculating the weighted average value for the ith time is V_iAssuming that the update rate α of the cluster center is 0.8, V is_iCan be calculated by the following formula:

V_i＝0.8*v_i+0.2*v_i-1

it can be seen that the process of repeatedly executing steps S203-S205 to update the value of the cluster center is actually realized by means of exponential weighted average, in other words, in the clustering method provided in this embodiment, the influence of the previously updated value on the current result is fully considered in each update, rather than looking at or performing each update in isolation, so that the problem of local convergence can be avoided relatively reliably.

Optionally, in this embodiment, the preset condition may include:

The preset angle and the preset times can be flexibly set according to actual conditions, and the embodiment does not limit the preset angle and the preset times.

In addition, the inventors have also found that it may be difficult to ensure complete convergence if the clustering centers of the respective clusters are iteratively updated with only a portion of the samples all the time, so that after the preset condition is reached according to the foregoing manner, the clustering centers of the respective clusters may be further updated with all the samples. Upon performing the foregoing steps, only a few updates (typically less than 10) need to be performed again based on all samples, with negligible time complexity.

Based on this, as shown in fig. 3, the clustering method provided in this embodiment may further include step S301 and step S302.

And S301, updating the clustering center of each cluster according to the plurality of samples when any one of the preset conditions is met.

Wherein, step S301 can be implemented by the following sub-steps:

The second average value is an average value of each sample in each cluster calculated after the cluster center of each cluster is updated according to all samples (each sample in the plurality of samples).

And step S302, when the preset condition is not met after the updating, updating the clustering centers of the clusters again according to the plurality of samples.

In this embodiment, the iterative update based on all samples is stopped until the preset condition is satisfied. Based on the above analysis, typically fewer than 10 updates are performed with negligible time complexity.

As shown in fig. 4, an embodiment of the present application further provides a clustering apparatus 200, which is applied to the terminal device 100 shown in fig. 1. The clustering device 200 includes a data reading module 210, a center determining module 220, a sample selecting module 230, a dividing module 240, and a first updating module 250.

The data reading module 210 is configured to read data to be processed including a plurality of samples through the Spark frame, and form the data to be processed into an RDD object.

In the present embodiment, the description of the data reading module 210 may refer to the detailed description of step S201 shown in fig. 2, that is, step S201 may be performed by the data reading module 210.

The center determining module 220 is configured to randomly determine a first preset number of target samples among the plurality of samples, and use the first preset number of samples as the clustering centers of the first preset number of clusters, respectively.

In the present embodiment, the description about the center determining module 220 may refer to the detailed description of step S202 shown in fig. 2, that is, step S202 may be performed by the center determining module 220.

The sample selecting module 230 is configured to randomly determine a second preset number of samples from the samples except for the first preset number of target samples, where the second preset number is a preset multiple of the first preset number.

In the present embodiment, the description of the sample selecting module 230 may refer to the detailed description of step S203 shown in fig. 2, that is, step S203 may be performed by the sample selecting module 230.

The dividing module 240 is configured to calculate, for each sample in the second preset number of samples, a distance between the sample and a cluster center of each cluster, and divide the sample into the cluster where the cluster center closest to the sample is located.

In the present embodiment, the description about the dividing module 240 may specifically refer to the detailed description of step S204 shown in fig. 2, that is, step S204 may be performed by the dividing module 240.

The first updating module 250 is configured to calculate an average value of each sample in each cluster as a first average value, update a value of a cluster center of the cluster according to the first average value and a current value of the cluster center of the cluster, determine a second preset number of samples again in samples other than the first preset number of target samples in the plurality of samples when a preset condition is not met after the updating, and divide the second preset number of samples determined again into corresponding clusters to update the value of the cluster center of each cluster.

In the present embodiment, the description about the first updating module 250 may specifically refer to the detailed description of step S205 shown in fig. 2, that is, step S205 may be executed by the first updating module 250.

Wherein the preset condition may include:

Optionally, the terminal device 100 may preset a cluster center update rate.

Correspondingly, the way for the first updating module 250 to update the value of the cluster center of the cluster according to the first average value and the current value of the cluster center of the cluster may be:

Optionally, the clustering device 200 may further include a second updating module 250.

The second updating module 260 is configured to update the clustering center of each cluster according to the multiple samples when any one of the preset conditions is met, and update the clustering center of each cluster according to the multiple samples again when the preset condition is not met after the updating.

In the present embodiment, the description of the second updating module 260 can refer to the detailed description of step S301 and step S302 shown in fig. 3, that is, step S301 and step S302 can be executed by the second updating module 260.

Optionally, in this embodiment, the manner in which the second updating module 260 updates the cluster center of each cluster according to the multiple samples may be:

In summary, the clustering method and the clustering device provided by the embodiment of the application are applied to the Spark framework-based terminal device. The terminal device reads data to be processed including a plurality of samples through a Spark frame, and forms an RDD object from the read data. And randomly determining a first preset number of target samples in the plurality of samples to be respectively used as the clustering centers of the first preset number of clusters, randomly determining a second preset number of samples in the samples except the first preset number of target samples in the plurality of samples, and dividing the second preset number of samples into corresponding clusters. And then calculating the average value of each sample in each cluster as a first average value, and updating the value of the clustering center of the cluster according to the first average value and the current value of the clustering center of the cluster until a preset condition is met. Through the process, the updating is not needed to be carried out according to all samples in each updating, the calculation complexity is reduced, and the problem that the local optimization is easy to converge can be avoided to a certain extent.

In the embodiments provided in the present application, it should be understood that the disclosed apparatus and method may be implemented in other ways. The apparatus embodiments described above are merely illustrative, and for example, the flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of apparatus, methods and computer program products according to various embodiments of the present application. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

In addition, functional modules in the embodiments of the present application may be integrated together to form an independent part, or each module may exist separately, or two or more modules may be integrated to form an independent part.

The functions, if implemented in the form of software functional modules and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present application or portions thereof that substantially contribute to the prior art may be embodied in the form of a software product stored in a storage medium and including instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present application. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and other various media capable of storing program codes.

It is noted that, herein, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other identical elements in a process, method, article, or apparatus that comprises the element.

The above description is only for the specific embodiments of the present application, but the scope of the present application is not limited thereto, and any person skilled in the art can easily conceive of the changes or substitutions within the technical scope of the present application, and shall be covered by the scope of the present application. Therefore, the protection scope of the present application shall be subject to the protection scope of the claims.

Claims

1. A clustering method is applied to Spark-based terminal equipment, and the method comprises the following steps:

reading data to be processed comprising a plurality of samples through the Spark, and forming the data to be processed into an RDD object;

randomly determining a first preset number of target samples in the plurality of samples, and respectively using the first preset number of target samples as the clustering centers of a first preset number of clusters;

2. The clustering method according to claim 1, wherein the preset condition comprises:

3. The clustering method according to claim 2, wherein a cluster center update rate is preset in the terminal device;

4. A clustering method according to claim 2 or 3, characterized in that the method further comprises:

5. The clustering method according to claim 4, wherein updating the cluster center of each cluster according to the plurality of samples comprises:

6. A clustering device applied to a Spark framework-based terminal device, the device comprising:

the data reading module is used for reading data to be processed comprising a plurality of samples through the Spark frame and forming the data to be processed into an RDD object;

a center determining module, configured to randomly determine a first preset number of target samples from the multiple samples, and use the first preset number of target samples as a clustering center of a first preset number of clusters respectively;

the dividing module is used for calculating the distance between the sample and the clustering center of each cluster aiming at each sample in the second preset number of samples, and dividing the sample into the cluster where the clustering center closest to the sample is located;

7. The clustering device according to claim 6, wherein the preset condition comprises:

8. The clustering device according to claim 7, wherein a cluster center update rate is preset in the terminal device; the first updating module updates the value of the cluster center of the cluster according to the first average value and the current value of the cluster center of the cluster in the following way:

9. The clustering device according to claim 7 or 8, characterized in that the clustering device further comprises:

10. The clustering device according to claim 9, wherein the second updating module updates the clustering center of each cluster according to the plurality of samples in a manner that: