CN108805174B - Clustering method and device - Google Patents

Clustering method and device Download PDF

Info

Publication number
CN108805174B
CN108805174B CN201810482717.2A CN201810482717A CN108805174B CN 108805174 B CN108805174 B CN 108805174B CN 201810482717 A CN201810482717 A CN 201810482717A CN 108805174 B CN108805174 B CN 108805174B
Authority
CN
China
Prior art keywords
cluster
samples
center
clustering
value
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201810482717.2A
Other languages
Chinese (zh)
Other versions
CN108805174A (en
Inventor
姚佳
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Guangdong Huihe Technology Development Co ltd
Original Assignee
Guangdong Huihe Technology Development Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Guangdong Huihe Technology Development Co ltd filed Critical Guangdong Huihe Technology Development Co ltd
Priority to CN201810482717.2A priority Critical patent/CN108805174B/en
Publication of CN108805174A publication Critical patent/CN108805174A/en
Application granted granted Critical
Publication of CN108805174B publication Critical patent/CN108805174B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/23Clustering techniques

Landscapes

  • Engineering & Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Theoretical Computer Science (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Biology (AREA)
  • Evolutionary Computation (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The embodiment of the application provides a clustering method and a device, wherein the method comprises the following steps: reading data to be processed comprising a plurality of samples, and performing RDD (remote data description) on the data to be processed; randomly determining a first preset number of target samples in the plurality of samples, and respectively using the first preset number of samples as the clustering centers of the first preset number of clusters; randomly determining a second preset number of samples among samples other than the target sample among the plurality of samples; calculating the distance between the sample and the clustering center of each cluster aiming at each sample with a second preset number, and dividing the sample into the cluster where the clustering center closest to the sample is located; and calculating the average value of each sample in each cluster as a first average value, and updating the value of the clustering center of the cluster according to the first average value and the current value of the clustering center of the cluster until a preset condition is met. In this way, it can be relatively effectively avoided that the obtained result is only locally optimal.

Description

Clustering method and device
Technical Field
The application relates to the technical field of big data, in particular to a clustering method and a clustering device.
Background
The clustering method has extremely wide application in numerous fields such as machine learning, data mining and the like. In a conventional clustering method, the clustering method can be classified into hierarchical clustering, partition-type clustering, density-based clustering, network-based clustering, kernel clustering, spectral clustering and the like according to certain characteristics. Correspondingly, hierarchical clustering includes brick, rock, Chameleon, etc., partitional clustering includes K-means and their variants, density clustering includes dbscan, OPTICS, etc., and lattice clustering includes sting, clique, etc. Among these clustering methods, the most common ones are too many K-means. Despite the drawbacks of K-means such as a priori assumptions about circular distribution, pre-assigning the number of cluster classes, etc., its relatively low computational complexity makes it a popular algorithm. However, in the case of a huge amount of data to be processed, the existing K-means still has a large computational complexity and is prone to fall into a local optimization problem.
Disclosure of Invention
In view of the above, the present disclosure provides a clustering method and apparatus to improve the above problem.
In order to achieve the above purpose, the embodiment of the present application adopts the following technical solutions:
in a first aspect, an embodiment of the present application provides a clustering method, which is applied to a terminal device based on a Spark frame, where the method includes:
reading data to be processed comprising a plurality of samples through the Spark framework, and subjecting the data to be processed to RDD (remote data description) object;
randomly determining a first preset number of target samples in the plurality of samples, and respectively using the first preset number of samples as the clustering centers of the first preset number of clusters;
randomly determining a second preset number of samples from the samples except the first preset number of target samples, wherein the second preset number is a preset multiple of the first preset number;
for each sample in the second preset number of samples, calculating the distance between the sample and the clustering center of each cluster, and dividing the sample into the cluster where the clustering center closest to the sample is located;
calculating the average value of each sample in each cluster as a first average value, and updating the value of the cluster center of the cluster according to the first average value and the current value of the cluster center of the cluster;
and when the updated value does not meet the preset condition, determining a second preset number of samples in the samples except the first preset number of target samples, and dividing the determined second preset number of samples into corresponding clusters to update the value of the clustering center of each cluster.
Optionally, according to the clustering method provided in the first aspect of the embodiment of the present application, the preset condition includes:
the average value of the included angle between the value before updating and the value after updating of the clustering center of each cluster is smaller than a preset angle;
and updating the value of the clustering center of each cluster according to the determined second preset number of samples for a preset number of times.
Optionally, according to the clustering method provided in the first aspect of the embodiment of the present application, the terminal device presets a cluster center update rate;
updating the value of the cluster center of the cluster according to the first average value and the current value of the cluster center of the cluster, including:
and calculating the weighted average value of the first average value and the current value of the cluster center of the cluster by taking the update rate of the cluster center as the weight of the first average value and taking the difference value between 1 and the update rate of the cluster center as the weight of the current value of the cluster center of the cluster, and updating the value of the cluster center of the cluster into the obtained weighted average value.
Optionally, according to the clustering method provided in the first aspect of the embodiment of the present application, the method further includes:
when any one of the preset conditions is met, updating the clustering centers of the clusters according to the plurality of samples;
and when the preset condition is not met after updating, updating the clustering centers of the clusters again according to the samples.
Optionally, according to the clustering method provided in the first aspect of the embodiment of the present application, updating the clustering center of each cluster according to the plurality of samples includes:
calculating the distance between the sample and the clustering center of each cluster for each sample in the plurality of samples, and dividing the sample into the cluster where the clustering center closest to the sample is located;
and calculating the average value of each sample in each cluster as a second average value, and updating the value of the cluster center of the cluster as the second average value.
In a second aspect, an embodiment of the present application further provides a clustering device, which is applied to a terminal device based on a Spark frame, where the clustering device includes:
the data reading module is used for reading data to be processed comprising a plurality of samples through the Spark frame and reading the data to be processed RDD object;
a center determining module, configured to randomly determine a first preset number of target samples from the multiple samples, and use the first preset number of samples as a clustering center of the first preset number of clusters respectively;
a sample selection module, configured to randomly determine a second preset number of samples from samples, other than the first preset number of target samples, in the plurality of samples, where the second preset number is a preset multiple of the first preset number;
the first dividing module is used for calculating the distance between the sample and the clustering center of each cluster aiming at each sample in the second preset number of samples, and dividing the sample into the cluster where the clustering center closest to the sample is located;
the first updating module is used for calculating the average value of each sample in each cluster to serve as a first average value, updating the value of the clustering center of the cluster according to the first average value and the current value of the clustering center of the cluster, determining a second preset number of samples in the samples except the first preset number of target samples when the updating does not meet a preset condition, and dividing the second preset number of samples into corresponding clusters to update the value of the clustering center of each cluster.
Optionally, according to the clustering apparatus provided in the second aspect of the embodiment of the present application, the preset condition includes:
the average value of the included angle between the value before updating and the value after updating of the clustering center of each cluster is smaller than a preset angle;
and updating the value of the clustering center of each cluster according to the determined second preset number of samples for a preset number of times.
The clustering device according to claim 7, wherein a cluster center update rate is preset in the terminal device; the first updating module updates the value of the cluster center of the cluster according to the first average value and the current value of the cluster center of the cluster in the following way:
and calculating the weighted average value of the first average value and the current value of the cluster center of the cluster by taking the update rate of the cluster center as the weight of the first average value and taking the difference value between 1 and the update rate of the cluster center as the weight of the current value of the cluster center of the cluster, and updating the value of the cluster center of the cluster into the obtained weighted average value.
Optionally, according to the clustering device provided by the second aspect of the embodiment of the present application, the clustering device further includes:
and the second updating module is used for updating the clustering center of each cluster according to the plurality of samples when any one of the preset conditions is met, and updating the clustering center of each cluster according to the plurality of samples again when the preset conditions are not met after the updating.
Optionally, according to the clustering device provided in the second aspect of the embodiment of the present application, a manner that the second updating module updates the clustering center of each cluster according to the plurality of samples is as follows:
calculating the distance between the sample and the clustering center of each cluster for each sample in the plurality of samples, and dividing the sample into the cluster where the clustering center closest to the sample is located;
and calculating the average value of each sample in each cluster as a second average value, and updating the value of the cluster center of the cluster as the second average value.
Compared with the prior art, the embodiment of the application has the following beneficial effects:
the clustering method and device provided by the embodiment of the application are applied to the terminal equipment based on the Spark framework. The terminal device reads data to be processed including a plurality of samples through a Spark frame, and forms an RDD object from the read data. And randomly determining a first preset number of target samples in the plurality of samples to be respectively used as the clustering centers of the first preset number of clusters, randomly determining a second preset number of samples in the samples except the first preset number of target samples in the plurality of samples, and dividing the second preset number of samples into corresponding clusters. And then calculating the average value of each sample in each cluster as a first average value, and updating the value of the clustering center of the cluster according to the first average value and the current value of the clustering center of the cluster until a preset condition is met. Through the process, the updating is not needed to be carried out according to all samples in each updating, the calculation complexity is reduced, and the problem that the local optimization is easy to converge can be avoided to a certain extent.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings that are required to be used in the embodiments will be briefly described below, it should be understood that the following drawings only illustrate some embodiments of the present application and therefore should not be considered as limiting the scope, and for those skilled in the art, other related drawings can be obtained from the drawings without inventive effort.
Fig. 1 is a connection block diagram of a terminal device according to an embodiment of the present disclosure;
fig. 2 is a schematic flow chart of a clustering method according to an embodiment of the present application;
FIG. 3 is a schematic flow chart of a clustering method provided in the embodiment of the present application;
fig. 4 is a functional module block diagram of a clustering device according to an embodiment of the present application.
Icon: 100-a terminal device; 110-a memory; 120-a processor; 130-a communication unit; 200-clustering means; 210-a data reading module; 220-a center determination module; 230-a sample selection module; 240-a partitioning module; 250-a first update module; 260-second update module.
Detailed Description
In order to make the objects, technical solutions and advantages of the embodiments of the present application clearer, the technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are some embodiments of the present application, but not all embodiments. The components of the embodiments of the present application, generally described and illustrated in the figures herein, can be arranged and designed in a wide variety of different configurations.
Thus, the following detailed description of the embodiments of the present application, presented in the accompanying drawings, is not intended to limit the scope of the claimed application, but is merely representative of selected embodiments of the application. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.
It should be noted that: like reference numbers and letters refer to like items in the following figures, and thus, once an item is defined in one figure, it need not be further defined and explained in subsequent figures.
At present, a few clustering methods are suitable for use in a big data environment, but K-means is one of them which is relatively good. Taking a scene based on a Spark framework as an example, the K-means in some embodiments is described as follows:
(1) randomly determining k samples in a given sample, wherein each of the k samples represents a cluster center of one cluster, and k is a pre-specified cluster category number, namely the number of clusters to be divided;
(2) calculating the distance between each sample and the clustering center of each cluster in the rest samples in the given sample, and dividing the sample into clusters represented by the clustering centers closest to the sample;
(3) and calculating the average value of each sample in each cluster, and updating the value of the cluster center of the cluster to the average value.
(4) And (4) continuously repeating the steps (2) to (3) until corresponding conditions are met.
However, the inventor has found that, in the above manner, in order to ensure convergence, all samples are directly used as learning samples of a new cluster center of a cluster at each iteration (updating), which just results in that K-means easily converges to local optimum. Moreover, in the above manner, each iteration of the cluster center directly uses a new value, and completely ignores the value of the cluster center before updating, which results in that the K-menas is easily interfered by some outlier samples. In addition, the time complexity of the K-means of the above method is still high in the case of large data size.
Based on this, the embodiment of the present application provides a clustering method and apparatus, which are applied to a terminal device based on a Spark framework, so as to at least partially improve the above problem. It should be understood that the clustering method and apparatus provided in this embodiment may also be applied to other frameworks such as MapReduce, which is not limited in this embodiment. This will be explained in detail below.
As shown in fig. 1, which is a block schematic diagram of a terminal device 100 provided in this embodiment of the present application, the terminal device 100 may be any electronic device having a data processing function and a communication function, for example, a server, and specifically, when the terminal device 100 is a server, the terminal device may be a single server or a server cluster formed by a plurality of servers, which is not limited in this embodiment.
In this embodiment, the terminal device 100 may be installed with an operating platform corresponding to Spark, where Spark is a fast and general computing engine specially designed for large-scale data processing, and is an open-source general parallel framework similar to Hadoop MapReduce.
The terminal device 100 includes a clustering apparatus 200, a memory 110, a processor 120, and a communication unit 130.
In the present embodiment, the memory 110, the processor 120 and the communication unit 130 are electrically connected to each other directly or indirectly, so as to realize data transmission or interaction. For example, the components may be electrically connected to each other via one or more communication buses or signal lines. The clustering device 200 includes at least one software functional module that can be stored in the memory 110 in the form of software or firmware (firmware) or solidified in an Operating System (OS) of the terminal device 100. The processor 120 is used for executing executable modules stored in the memory 110, such as software functional modules and computer programs included in the clustering device 200.
The Memory 110 may be, but is not limited to, a Random Access Memory (RAM), a Read Only Memory (ROM), a Programmable Read-Only Memory (PROM), an Erasable Read-Only Memory (EPROM), an electrically Erasable Read-Only Memory (EEPROM), and the like.
The processor 120 may be an integrated circuit chip having signal processing capabilities. The Processor 120 may also be a general-purpose Processor, such as a Central Processing Unit (CPU), a Network Processor (NP), a microprocessor, etc.; but may also be a Digital Signal Processor (DSP)), an Application Specific Integrated Circuit (ASIC), a Field Programmable Gate Array (FPGA) or other programmable logic device, discrete gate or transistor logic, discrete hardware components; the processor 120 may also be any conventional processor that may implement or perform the methods, steps, and logic blocks disclosed in embodiments of the present invention.
The communication unit 130 is used for establishing a communication connection between the terminal device 100 and other devices to implement data interaction or communication.
It should be understood that the configuration shown in fig. 1 is merely illustrative, and the terminal device 100 may include more or fewer components than those shown in fig. 1, and may have a completely different configuration than that shown in fig. 1. It should be noted that, the components shown in fig. 1 may be implemented by hardware, software or a combination thereof, and the embodiment is not limited thereto.
Fig. 2 is a schematic flow chart of a clustering method provided in an embodiment of the present application, and the clustering method can be applied to the terminal device 100 shown in fig. 1.
Step S201, reading data to be processed including a plurality of samples through the Spark, and forming an RDD object from the data to be processed.
The RDD is a Spark-specific data model, and the samples are predetermined samples for clustering.
Step S202, randomly determining a first preset number of target samples in the plurality of samples, and respectively using the first preset number of samples as the clustering centers of the first preset number of clusters.
The first preset number is a predetermined required cluster category number, that is, the number of clusters required to be divided.
Step S203, randomly determining a second preset number of samples among the samples except the first preset number of target samples.
And the second preset number is a preset multiple of the first preset number. In this embodiment, the second preset number is a predetermined value and represents how many samples need to be used in each iteration for updating the cluster center of the cluster, and the second preset number may be generally 100-.
For example, if the first predetermined number is k and the predetermined multiple is γ, the second predetermined number is k × γ.
Step S204, for each sample in the second preset number of samples, calculating a distance between the sample and a cluster center of each cluster, and dividing the sample into the cluster where the cluster center closest to the sample is located.
Taking the second preset number k × γ as an example, for each of k × γ samples, calculating the distance from the sample to the current cluster center of each cluster, determining the cluster corresponding to the cluster center with the minimum distance to the sample as a target cluster, and assigning the sample to the target cluster.
Step S205, calculating the average value of each sample in each cluster as a first average value, and updating the value of the cluster center of the cluster according to the first average value and the current value of the cluster center of the cluster.
In this embodiment, for each cluster, after the first average value of the cluster is obtained, the cluster center of the cluster is updated according to the current value of the cluster center of the cluster.
Optionally, in this embodiment, the cluster center update rate may be preset in the terminal device 100. Correspondingly, step S205 may include the steps of:
and calculating the weighted average value of the first average value and the current value of the cluster center of the cluster by taking the update rate of the cluster center as the weight of the first average value and taking the difference value between 1 and the update rate of the cluster center as the weight of the current value of the cluster center of the cluster, and updating the value of the cluster center of the cluster into the obtained weighted average value.
Assuming that the update rate of the cluster center is α, the current value of the cluster center of the cluster is old _ center, and the calculated first average value is new _ center, the new value center (i.e. the weighted average) of the cluster center of the cluster can be calculated by the following formula:
center=α*new_center+(1-α)*old_center
the updating rate of the clustering center can be controlled according to the actual situation, so that the influence caused by the poor updating effect of a certain time based on partial samples can be avoided.
Step S206, when the updated value does not satisfy the preset condition, re-determining a second preset number of samples from the samples except the first preset number of target samples, and dividing the re-determined second preset number of samples into corresponding clusters to update the value of the cluster center of each cluster.
In other words, in the present embodiment, the terminal device 100 repeatedly executes steps S203 to S205 until the preset condition is satisfied.
A specific example is given below to explain step S203 to step S206 in detail.
Assume that in this example, steps S203-S206 are repeated 1000 times in total, where the first average value (i.e., the new _ center) obtained by the ith calculation is viThe new value of the cluster center (i.e., the center mentioned above) obtained by calculating the weighted average value for the ith time is ViAssuming that the update rate α of the cluster center is 0.8, V isiCan be calculated by the following formula:
Vi=0.8*vi+0.2*vi-1
it can be seen that the process of repeatedly executing steps S203-S205 to update the value of the cluster center is actually realized by means of exponential weighted average, in other words, in the clustering method provided in this embodiment, the influence of the previously updated value on the current result is fully considered in each update, rather than looking at or performing each update in isolation, so that the problem of local convergence can be avoided relatively reliably.
Optionally, in this embodiment, the preset condition may include:
the average value of the included angle between the value before updating and the value after updating of the clustering center of each cluster is smaller than a preset angle;
and updating the value of the clustering center of each cluster according to the determined second preset number of samples for a preset number of times.
The preset angle and the preset times can be flexibly set according to actual conditions, and the embodiment does not limit the preset angle and the preset times.
In addition, the inventors have also found that it may be difficult to ensure complete convergence if the clustering centers of the respective clusters are iteratively updated with only a portion of the samples all the time, so that after the preset condition is reached according to the foregoing manner, the clustering centers of the respective clusters may be further updated with all the samples. Upon performing the foregoing steps, only a few updates (typically less than 10) need to be performed again based on all samples, with negligible time complexity.
Based on this, as shown in fig. 3, the clustering method provided in this embodiment may further include step S301 and step S302.
And S301, updating the clustering center of each cluster according to the plurality of samples when any one of the preset conditions is met.
Wherein, step S301 can be implemented by the following sub-steps:
calculating the distance between the sample and the clustering center of each cluster for each sample in the plurality of samples, and dividing the sample into the cluster where the clustering center closest to the sample is located;
and calculating the average value of each sample in each cluster as a second average value, and updating the value of the cluster center of the cluster as the second average value.
The second average value is an average value of each sample in each cluster calculated after the cluster center of each cluster is updated according to all samples (each sample in the plurality of samples).
And step S302, when the preset condition is not met after the updating, updating the clustering centers of the clusters again according to the plurality of samples.
In this embodiment, the iterative update based on all samples is stopped until the preset condition is satisfied. Based on the above analysis, typically fewer than 10 updates are performed with negligible time complexity.
As shown in fig. 4, an embodiment of the present application further provides a clustering apparatus 200, which is applied to the terminal device 100 shown in fig. 1. The clustering device 200 includes a data reading module 210, a center determining module 220, a sample selecting module 230, a dividing module 240, and a first updating module 250.
The data reading module 210 is configured to read data to be processed including a plurality of samples through the Spark frame, and form the data to be processed into an RDD object.
In the present embodiment, the description of the data reading module 210 may refer to the detailed description of step S201 shown in fig. 2, that is, step S201 may be performed by the data reading module 210.
The center determining module 220 is configured to randomly determine a first preset number of target samples among the plurality of samples, and use the first preset number of samples as the clustering centers of the first preset number of clusters, respectively.
In the present embodiment, the description about the center determining module 220 may refer to the detailed description of step S202 shown in fig. 2, that is, step S202 may be performed by the center determining module 220.
The sample selecting module 230 is configured to randomly determine a second preset number of samples from the samples except for the first preset number of target samples, where the second preset number is a preset multiple of the first preset number.
In the present embodiment, the description of the sample selecting module 230 may refer to the detailed description of step S203 shown in fig. 2, that is, step S203 may be performed by the sample selecting module 230.
The dividing module 240 is configured to calculate, for each sample in the second preset number of samples, a distance between the sample and a cluster center of each cluster, and divide the sample into the cluster where the cluster center closest to the sample is located.
In the present embodiment, the description about the dividing module 240 may specifically refer to the detailed description of step S204 shown in fig. 2, that is, step S204 may be performed by the dividing module 240.
The first updating module 250 is configured to calculate an average value of each sample in each cluster as a first average value, update a value of a cluster center of the cluster according to the first average value and a current value of the cluster center of the cluster, determine a second preset number of samples again in samples other than the first preset number of target samples in the plurality of samples when a preset condition is not met after the updating, and divide the second preset number of samples determined again into corresponding clusters to update the value of the cluster center of each cluster.
In the present embodiment, the description about the first updating module 250 may specifically refer to the detailed description of step S205 shown in fig. 2, that is, step S205 may be executed by the first updating module 250.
Wherein the preset condition may include:
the average value of the included angle between the value before updating and the value after updating of the clustering center of each cluster is smaller than a preset angle;
and updating the value of the clustering center of each cluster according to the determined second preset number of samples for a preset number of times.
Optionally, the terminal device 100 may preset a cluster center update rate.
Correspondingly, the way for the first updating module 250 to update the value of the cluster center of the cluster according to the first average value and the current value of the cluster center of the cluster may be:
and calculating the weighted average value of the first average value and the current value of the cluster center of the cluster by taking the update rate of the cluster center as the weight of the first average value and taking the difference value between 1 and the update rate of the cluster center as the weight of the current value of the cluster center of the cluster, and updating the value of the cluster center of the cluster into the obtained weighted average value.
Optionally, the clustering device 200 may further include a second updating module 250.
The second updating module 260 is configured to update the clustering center of each cluster according to the multiple samples when any one of the preset conditions is met, and update the clustering center of each cluster according to the multiple samples again when the preset condition is not met after the updating.
In the present embodiment, the description of the second updating module 260 can refer to the detailed description of step S301 and step S302 shown in fig. 3, that is, step S301 and step S302 can be executed by the second updating module 260.
Optionally, in this embodiment, the manner in which the second updating module 260 updates the cluster center of each cluster according to the multiple samples may be:
calculating the distance between the sample and the clustering center of each cluster for each sample in the plurality of samples, and dividing the sample into the cluster where the clustering center closest to the sample is located;
and calculating the average value of each sample in each cluster as a second average value, and updating the value of the cluster center of the cluster as the second average value.
In summary, the clustering method and the clustering device provided by the embodiment of the application are applied to the Spark framework-based terminal device. The terminal device reads data to be processed including a plurality of samples through a Spark frame, and forms an RDD object from the read data. And randomly determining a first preset number of target samples in the plurality of samples to be respectively used as the clustering centers of the first preset number of clusters, randomly determining a second preset number of samples in the samples except the first preset number of target samples in the plurality of samples, and dividing the second preset number of samples into corresponding clusters. And then calculating the average value of each sample in each cluster as a first average value, and updating the value of the clustering center of the cluster according to the first average value and the current value of the clustering center of the cluster until a preset condition is met. Through the process, the updating is not needed to be carried out according to all samples in each updating, the calculation complexity is reduced, and the problem that the local optimization is easy to converge can be avoided to a certain extent.
In the embodiments provided in the present application, it should be understood that the disclosed apparatus and method may be implemented in other ways. The apparatus embodiments described above are merely illustrative, and for example, the flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of apparatus, methods and computer program products according to various embodiments of the present application. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.
In addition, functional modules in the embodiments of the present application may be integrated together to form an independent part, or each module may exist separately, or two or more modules may be integrated to form an independent part.
The functions, if implemented in the form of software functional modules and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present application or portions thereof that substantially contribute to the prior art may be embodied in the form of a software product stored in a storage medium and including instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present application. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and other various media capable of storing program codes.
It is noted that, herein, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other identical elements in a process, method, article, or apparatus that comprises the element.
The above description is only for the specific embodiments of the present application, but the scope of the present application is not limited thereto, and any person skilled in the art can easily conceive of the changes or substitutions within the technical scope of the present application, and shall be covered by the scope of the present application. Therefore, the protection scope of the present application shall be subject to the protection scope of the claims.

Claims (10)

1. A clustering method is applied to Spark-based terminal equipment, and the method comprises the following steps:
reading data to be processed comprising a plurality of samples through the Spark, and forming the data to be processed into an RDD object;
randomly determining a first preset number of target samples in the plurality of samples, and respectively using the first preset number of target samples as the clustering centers of a first preset number of clusters;
randomly determining a second preset number of samples from the samples except the first preset number of target samples, wherein the second preset number is a preset multiple of the first preset number;
for each sample in the second preset number of samples, calculating the distance between the sample and the clustering center of each cluster, and dividing the sample into the cluster where the clustering center closest to the sample is located;
calculating the average value of each sample in each cluster as a first average value, and updating the value of the cluster center of the cluster according to the first average value and the current value of the cluster center of the cluster;
and when the updated value does not meet the preset condition, determining a second preset number of samples in the samples except the first preset number of target samples, and dividing the determined second preset number of samples into corresponding clusters to update the value of the clustering center of each cluster.
2. The clustering method according to claim 1, wherein the preset condition comprises:
the average value of the included angle between the value before updating and the value after updating of the clustering center of each cluster is smaller than a preset angle;
and updating the value of the clustering center of each cluster according to the determined second preset number of samples for a preset number of times.
3. The clustering method according to claim 2, wherein a cluster center update rate is preset in the terminal device;
updating the value of the cluster center of the cluster according to the first average value and the current value of the cluster center of the cluster, including:
and calculating the weighted average value of the first average value and the current value of the cluster center of the cluster by taking the update rate of the cluster center as the weight of the first average value and taking the difference value between 1 and the update rate of the cluster center as the weight of the current value of the cluster center of the cluster, and updating the value of the cluster center of the cluster into the obtained weighted average value.
4. A clustering method according to claim 2 or 3, characterized in that the method further comprises:
when any one of the preset conditions is met, updating the clustering centers of the clusters according to the plurality of samples;
and when the preset condition is not met after updating, updating the clustering centers of the clusters again according to the samples.
5. The clustering method according to claim 4, wherein updating the cluster center of each cluster according to the plurality of samples comprises:
calculating the distance between the sample and the clustering center of each cluster for each sample in the plurality of samples, and dividing the sample into the cluster where the clustering center closest to the sample is located;
and calculating the average value of each sample in each cluster as a second average value, and updating the value of the cluster center of the cluster as the second average value.
6. A clustering device applied to a Spark framework-based terminal device, the device comprising:
the data reading module is used for reading data to be processed comprising a plurality of samples through the Spark frame and forming the data to be processed into an RDD object;
a center determining module, configured to randomly determine a first preset number of target samples from the multiple samples, and use the first preset number of target samples as a clustering center of a first preset number of clusters respectively;
a sample selection module, configured to randomly determine a second preset number of samples from samples, other than the first preset number of target samples, in the plurality of samples, where the second preset number is a preset multiple of the first preset number;
the dividing module is used for calculating the distance between the sample and the clustering center of each cluster aiming at each sample in the second preset number of samples, and dividing the sample into the cluster where the clustering center closest to the sample is located;
the first updating module is used for calculating the average value of each sample in each cluster to serve as a first average value, updating the value of the clustering center of the cluster according to the first average value and the current value of the clustering center of the cluster, determining a second preset number of samples in the samples except the first preset number of target samples when the updating does not meet a preset condition, and dividing the second preset number of samples into corresponding clusters to update the value of the clustering center of each cluster.
7. The clustering device according to claim 6, wherein the preset condition comprises:
the average value of the included angle between the value before updating and the value after updating of the clustering center of each cluster is smaller than a preset angle;
and updating the value of the clustering center of each cluster according to the determined second preset number of samples for a preset number of times.
8. The clustering device according to claim 7, wherein a cluster center update rate is preset in the terminal device; the first updating module updates the value of the cluster center of the cluster according to the first average value and the current value of the cluster center of the cluster in the following way:
and calculating the weighted average value of the first average value and the current value of the cluster center of the cluster by taking the update rate of the cluster center as the weight of the first average value and taking the difference value between 1 and the update rate of the cluster center as the weight of the current value of the cluster center of the cluster, and updating the value of the cluster center of the cluster into the obtained weighted average value.
9. The clustering device according to claim 7 or 8, characterized in that the clustering device further comprises:
and the second updating module is used for updating the clustering center of each cluster according to the plurality of samples when any one of the preset conditions is met, and updating the clustering center of each cluster according to the plurality of samples again when the preset conditions are not met after the updating.
10. The clustering device according to claim 9, wherein the second updating module updates the clustering center of each cluster according to the plurality of samples in a manner that:
calculating the distance between the sample and the clustering center of each cluster for each sample in the plurality of samples, and dividing the sample into the cluster where the clustering center closest to the sample is located;
and calculating the average value of each sample in each cluster as a second average value, and updating the value of the cluster center of the cluster as the second average value.
CN201810482717.2A 2018-05-18 2018-05-18 Clustering method and device Active CN108805174B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201810482717.2A CN108805174B (en) 2018-05-18 2018-05-18 Clustering method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201810482717.2A CN108805174B (en) 2018-05-18 2018-05-18 Clustering method and device

Publications (2)

Publication Number Publication Date
CN108805174A CN108805174A (en) 2018-11-13
CN108805174B true CN108805174B (en) 2022-03-29

Family

ID=64091236

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201810482717.2A Active CN108805174B (en) 2018-05-18 2018-05-18 Clustering method and device

Country Status (1)

Country Link
CN (1) CN108805174B (en)

Families Citing this family (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110045371A (en) * 2019-04-28 2019-07-23 软通智慧科技有限公司 Identification method, device, equipment and storage medium
CN112215247A (en) * 2019-07-10 2021-01-12 南京地平线机器人技术有限公司 Method and device for clustering feature vectors and electronic equipment
CN113111893B (en) * 2020-01-09 2022-12-16 中国移动通信集团四川有限公司 Data processing method and system and electronic equipment
CN113393412B (en) * 2020-02-27 2024-05-31 中国石油天然气股份有限公司 Method and device for determining characteristic value of corrosion defect in gas pipeline
CN112560731B (en) * 2020-12-22 2022-07-01 苏州科达科技股份有限公司 Feature clustering method, database updating method, electronic device and storage medium
CN112949697B (en) * 2021-02-07 2023-03-17 广州杰赛科技股份有限公司 Method and device for confirming pipeline abnormity and computer readable storage medium

Family Cites Families (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103593609B (en) * 2012-08-16 2017-04-12 阿里巴巴集团控股有限公司 Trustworthy behavior recognition method and device
CN103049651A (en) * 2012-12-13 2013-04-17 航天科工深圳(集团)有限公司 Method and device used for power load aggregation
CN103699678B (en) * 2013-12-31 2016-09-28 苏州大学 A kind of hierarchy clustering method based on multistage stratified sampling and system
CN105913077A (en) * 2016-04-07 2016-08-31 华北电力大学(保定) Data clustering method based on dimensionality reduction and sampling
CN106570173B (en) * 2016-11-09 2020-09-29 重庆邮电大学 Spark-based high-dimensional sparse text data clustering method
CN106682116B (en) * 2016-12-08 2020-08-04 重庆邮电大学 OPTIC point sorting and clustering method based on Spark memory calculation big data platform
CN107578070A (en) * 2017-09-19 2018-01-12 安徽中科美络信息技术有限公司 K means initial cluster center method for optimizing based on neighborhood information and mean difference degree

Also Published As

Publication number Publication date
CN108805174A (en) 2018-11-13

Similar Documents

Publication Publication Date Title
CN108805174B (en) Clustering method and device
KR102480204B1 (en) Continuous learning for intrusion detection
US11062215B2 (en) Using different data sources for a predictive model
US11157380B2 (en) Device temperature impact management using machine learning techniques
CN108228722B (en) Method for detecting geographic space distribution uniformity of sampling points in crushing area
US20180253284A1 (en) Approximate random number generator by empirical cumulative distribution function
CN111368887B (en) Training method of thunderstorm weather prediction model and thunderstorm weather prediction method
US10785243B1 (en) Identifying evidence of attacks by analyzing log text
US11599568B2 (en) Monitoring an enterprise system utilizing hierarchical clustering of strings in data records
CN112633754A (en) Modeling method and system of data analysis model
KR101850993B1 (en) Method and apparatus for extracting keyword based on cluster
CN112637178B (en) Attack similarity calculation method and device, electronic equipment and readable storage medium
US11625438B2 (en) Monitoring information processing systems utilizing co-clustering of strings in different sets of data records
CN110019845B (en) Community evolution analysis method and device based on knowledge graph
CN115603973A (en) Heterogeneous security monitoring method and system based on government affair information network
US11372904B2 (en) Automatic feature extraction from unstructured log data utilizing term frequency scores
CN113312239A (en) Data detection method, device, electronic equipment and medium
US11212162B2 (en) Bayesian-based event grouping
US11763039B2 (en) Automatically determining storage system data breaches using machine learning techniques
US11494439B2 (en) Digital modeling and prediction for spreading digital data
US11012463B2 (en) Predicting condition of a host for cybersecurity applications
CN112861093B (en) Verification method, device and equipment for access data and storage medium
CN116629459B (en) Method for predicting water area submerged range based on remote sensing and precipitation
CN117473331B (en) Stream data processing method, device, equipment and storage medium
CN109240827B (en) Method and device for determining resource occupation condition of application, storage medium and equipment

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant