CN112766403A

CN112766403A - Incremental clustering method and device based on information gain weight

Info

Publication number: CN112766403A
Application number: CN202110123316.XA
Authority: CN
Inventors: 张子瑛; 杨强; 陈晓科; 范颖; 梁敏玲
Original assignee: Electric Power Research Institute of Guangdong Power Grid Co Ltd
Current assignee: Electric Power Research Institute of Guangdong Power Grid Co Ltd
Priority date: 2020-12-29
Filing date: 2021-01-28
Publication date: 2021-05-07

Abstract

The invention discloses an incremental clustering method and device based on information gain weight, which specifically comprise the following steps: calculating the classification contribution rate of each feature according to the initial data attribute feature information gain weight; respectively calculating the intra-class distances from the initial data to the initial clustering center according to the classification contribution rates, and performing iterative combination on classes of which the intra-class distances are smaller than a distance threshold value to obtain the clustering centers of all classes and the maximum intra-class distance; respectively calculating the distance from the newly added data points to the clustering centers according to the classification contribution rate, and determining the minimum distance and the corresponding clustering centers; and when the minimum distance is less than or equal to the maximum intra-class distance of the corresponding clustering center, merging the newly added data points into the class of the corresponding clustering center, and when the minimum distance is greater than the maximum intra-class distance of the corresponding clustering center, determining the newly added data points as an independent class. The method utilizes the information gain weight to calculate the intra-class distance, and sets the intra-class distance as the incremental data classification threshold, so that the robustness of the incremental clustering method is improved.

Description

Incremental clustering method and device based on information gain weight

Technical Field

The invention relates to the technical field of data processing, in particular to an incremental clustering method and device based on information gain weight.

Background

As a branch of statistics, clustering is a machine learning method that continuously adjusts its own model by observing and learning, and is currently widely used in the fields of network intrusion detection, image recognition, and the like. With the coming of big data era, in order to overcome the limitation of the traditional clustering algorithm in large-scale data calculation, technicians perform incremental reconstruction by using the existing algorithm model, an incremental clustering method is provided: when a batch of clustering results exist, if data is newly added, only the newly added data is clustered, and the existing clustering results are incrementally modified without re-clustering the whole data set after the newly added data.

The existing clustering method comprises the steps of clustering newly added data by using an expansion vectorization method, firstly setting a threshold value, when the minimum distance between the newly added data and the existing central point is smaller than the threshold value, classifying the newly added data into the existing class, otherwise, using the newly added data as an independent class object, and because the method needs to manually appoint the threshold value, the robustness is poor.

Disclosure of Invention

Aiming at the technical problems, the invention provides an incremental clustering method and device based on information gain weight, which eliminate the influence caused by manually setting the threshold value and effectively improve the robustness of the method by setting the maximum radius of each category as an incremental threshold value.

The embodiment of the invention provides an incremental clustering method based on information gain weight, which comprises the following steps:

calculating the classification contribution rate of each characteristic according to the information gain weight of the initial data attribute characteristic;

respectively calculating the intra-class distances from the initial data to the initial clustering centers according to the classification contribution rates, and carrying out iterative combination on classes of which the intra-class distances are smaller than a distance threshold value to obtain clustering results, wherein the clustering results comprise the clustering centers of all classes and the maximum intra-class distance;

respectively calculating the distance from the newly added data points to the clustering centers according to the classification contribution rate, and determining the minimum distance and the corresponding clustering centers;

and when the minimum distance is smaller than or equal to the maximum intra-class distance of the corresponding clustering center, merging the newly added data points into the class of the corresponding clustering center, and when the minimum distance is larger than the maximum intra-class distance of the corresponding clustering center, determining the newly added data points as a single class.

In one embodiment, the information gain weight of the initial data attribute feature is determined according to the information entropy of the initial data.

In one embodiment, the classification contribution rate ∈ of each feature_iDetermined according to the following formula:

where W (T) is the information gain weight of the attribute feature T.

In a certain embodiment, the calculating the intra-class distance from the initial data to the initial clustering center according to the classification contribution ratio includes:

in one embodiment, the continuous values of the initial data attributes are discretized.

In one embodiment, the distance threshold comprises a minimum intra-class distance of intra-class distances of the initial data to the initial cluster center.

The embodiment of the present invention further provides an incremental clustering device based on information gain weight, including:

the first initialization unit is used for calculating the classification contribution rate of each characteristic according to the information gain weight of the initial data attribute characteristic;

the second initialization unit is used for respectively calculating the intra-class distances from the initial data to the initial clustering centers according to the classification contribution rates, and performing iterative combination on the classes with the intra-class distances smaller than a distance threshold value to obtain clustering results, wherein the clustering results comprise the clustering centers of all classes and the maximum intra-class distance;

the data calculation unit is used for respectively calculating the distances from the newly added data points to the clustering centers according to the classification contribution rates, and determining the minimum distance and the corresponding clustering centers;

and the data clustering unit is used for merging the newly added data points into the category of the corresponding clustering center when the minimum distance is less than or equal to the maximum intra-category distance of the corresponding clustering center, and determining the newly added data points as a single category when the minimum distance is greater than the maximum intra-category distance of the corresponding clustering center.

An embodiment of the present invention further provides a computer-readable storage medium, on which a computer program is stored, where the computer program is executed by a processor to implement the method according to any of the above embodiments.

Compared with the prior art, the embodiment of the invention has the beneficial effects that:

the incremental clustering method and device based on the information gain weight fully consider the contributions of different attribute characteristics to clustering, utilize the information gain proportional weight to calculate the distance from data to a clustering center, and eliminate the influence of artificially appointed threshold values by taking the maximum intra-class distance of each class in a clustering result as an incremental data classification threshold value, thereby improving the robustness of the incremental clustering method.

Drawings

In order to more clearly illustrate the technical solution of the present invention, the drawings needed to be used in the embodiments will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art that other drawings can be obtained according to the drawings without creative efforts.

FIG. 1 is a schematic flow chart of an incremental clustering method based on information gain weighting according to an embodiment of the present invention;

fig. 2 is a schematic structural diagram of an incremental clustering device based on information gain weights according to an embodiment of the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

It should be understood that the step numbers used herein are for convenience of description only and are not intended as limitations on the order in which the steps are performed.

It is to be understood that the terminology used in the description of the invention herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the invention. As used in the specification of the present invention and the appended claims, the singular forms "a," "an," and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise.

The terms "comprises" and "comprising" indicate the presence of the described features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof.

The term "and/or" refers to and includes any and all possible combinations of one or more of the associated listed items.

As shown in fig. 1, an embodiment of the present invention provides an incremental clustering method based on information gain weights, which includes the following steps.

S11: and calculating the classification contribution rate of each characteristic according to the information gain weight of the initial data attribute characteristic.

In this embodiment, the continuous values of the initial data attribute are discretized.

In this embodiment, first, clustering parameters are set, including an initial data set X, an attribute set T, and a classification category Y, where T ═ T { (T)₁,T₂,…,T_m}，1≤i≤m。

For the initial data set X, the data set X can be correspondingly divided into sets C according to different values of the classification category Y, where C ═ C₁,C₂,…,C_n}; and simultaneously, the attribute set T can be correspondingly divided into v subsets according to different values of the attribute set T, and then X is equal to { X ═ X₁,X₂,…,X_v}。

In this embodiment, the characteristic T for the initial data_iIts information gain weight W (T)_i) May be determined from the initial data information entropy.

Specifically, the information gain weight W (T)_i)＝IG(T_i)/IS(T_i) Wherein IG (T)_i) Is a characteristic T_iCan be determined by the following formula:

IG(T_i)＝I(C)-I(C|T_i)

in the formula (I), the compound is shown in the specification,

i (C | t) and

then, the information entropy I according to the data is obtained, specifically:

in the formula

Wherein the content of the first and second substances,

for IS (T)_i) Can be represented by formula

Determination, wherein the determination is made according to Bernoulli's law of large numbers

Further, according to the information gain weight W (T)_i) Get the characteristic T_iThe classification contribution rate of-_iComprises the following steps:

s12: and respectively calculating the intra-class distances from the initial data to the initial clustering centers according to the classification contribution rates, and performing iterative combination on classes of which the intra-class distances are smaller than a distance threshold value to obtain clustering results, wherein the clustering results comprise the clustering centers of all classes and the maximum intra-class distance.

In this embodiment, the distance threshold comprises the smallest of the intra-class distances of the initial data to the initial cluster center.

In this embodiment, the cluster category is set as d, and an initial cluster center set center is randomly selected from the initial data set X, and the center is defined as { ct ═ ct_jJ is more than or equal to 1 and less than or equal to d, and meanwhile, a similar cohesion class R is set as { R ═ R { (R) }_jJ is more than or equal to 1 and less than or equal to d }, the iteration frequency is t, the clustering precision is elig, and R is initialized_j＝0。

In the iterative process, the data points X in the initial data set X to the initial cluster centers ct are calculated according to the following formula_jThe distance of (c):

obtaining the minimum Min (dist (x, c) in the distance_j)_j) And merging the data points x into the category according to the category, and simultaneously updating the cluster center of the category and the intra-class distance from each data point to the cluster center.

Wherein the updated cluster center ct_j＝{ct_j，iI is more than or equal to 1 and less than or equal to d, wherein

And when the updated cluster center point is deviated, returning to execute S12 until the updated cluster center point is not deviated any more or the iteration is completed, and recording the maximum intra-class distance of each class.

In the embodiment, the cluster center point of the current iteration is used for subtracting the cluster center point set of the previous iteration, the difference values are compared with the clustering precision elig, if the difference values are less than or equal to elig, the updated cluster center point does not deviate, and the iteration is finished; and if the difference value is larger than elig, shifting the updated cluster center point.

And after the iteration is finished, obtaining a clustering result, wherein the clustering result comprises the clustering center and the maximum intra-class distance of each class.

S13: and respectively calculating the distance from the newly added data point to the clustering center according to the classification contribution rate, and determining the minimum distance and the corresponding clustering center.

In this embodiment, for the newly added data point x_nCalculating the cluster center distance between the point and each category in the cluster result according to the following formula:

and according to new _ Min ═ Min (rect (x)_n,c_j)_j) Obtaining the minimum distance new _ min, wherein j is more than or equal to 1 and less than or equal to d.

S14: and when the minimum distance is smaller than or equal to the maximum intra-class distance of the corresponding clustering center, merging the newly added data points into the class of the corresponding clustering center, and when the minimum distance is larger than the maximum intra-class distance of the corresponding clustering center, determining the newly added data points as a single class.

In this embodiment, the minimum distance new _ min is compared with the maximum intra-class distance of the corresponding class, and when new _ min is less than or equal to the maximum intra-class distance, the newly added data point is classified into the class, otherwise, the newly added data point is set as a separate class.

For different attributes of a data set, the information gain of the attribute is the largest, and the more uncertain information is contained, the more beneficial the clustering is, the embodiment of the invention calculates the distance from the data to the clustering center by using the information gain proportion weight of the attribute characteristics, introduces the attribute information gain into the clustering, and simultaneously avoids the influence of the artificially specified threshold on the clustering effect and effectively improves the robustness of the incremental clustering method by taking the maximum intra-class distance of each class in the clustering result as the classification judgment threshold of the incremental data.

As shown in fig. 2, an embodiment of the present invention further provides an incremental clustering apparatus based on information gain weights, which includes a first initializing unit 101, a second initializing unit 102, a data calculating unit 103, and a data clustering unit 104.

The first initialization unit 101 is configured to calculate a classification contribution rate of each feature according to an information gain weight of the initial data attribute feature.

The second initialization unit 102 is configured to calculate intra-class distances from the initial data to an initial clustering center according to the classification contribution rates, and perform iterative combination on classes with intra-class distances smaller than a distance threshold to obtain a clustering result, where the clustering result includes the clustering centers of each class and a maximum intra-class distance.

The data calculating unit 103 is configured to calculate distances from the newly added data points to the cluster centers according to the classification contribution rates, and determine a minimum distance and a corresponding cluster center.

The data clustering unit 104 is configured to merge the newly added data points into the category of the corresponding clustering center when the minimum distance is less than or equal to the maximum intra-category distance of the corresponding clustering center, and determine the new data points as a single category when the minimum distance is greater than the maximum intra-category distance of the corresponding clustering center.

Because the content of information interaction, execution process, and the like among the units in the device is based on the same concept as the method embodiment of the present invention, specific content can be referred to the description in the method embodiment of the present invention, and is not described herein again.

It will be understood by those skilled in the art that all or part of the processes of the methods of the above embodiments may be implemented by a computer program, which may be stored in a computer-readable storage medium, and may include the processes of the embodiments of the methods when executed. The storage medium may be a magnetic disk, an optical disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), or the like.

While the foregoing is directed to the preferred embodiment of the present invention, it will be understood by those skilled in the art that various changes and modifications may be made without departing from the spirit and scope of the invention.

Claims

1. An incremental clustering method based on information gain weight is characterized by comprising the following steps:

2. The incremental clustering method based on information gain weight according to claim 1, wherein the information gain weight of the attribute feature of the initial data is determined according to the information entropy of the initial data.

3. The incremental clustering method based on information gain weight according to claim 2, wherein the classification contribution rate of each feature ℃_iDetermined according to the following formula:

where W (T) is the information gain weight of the attribute feature T.

4. The incremental clustering method based on information gain weight according to claim 1, wherein the intra-class distances from the initial data to an initial clustering center are respectively calculated according to the classification contribution rates, and specifically:

5. the incremental clustering method based on information gain weight according to claim 1, wherein the continuous values of the initial data attribute are discretized.

6. The incremental clustering method based on information gain weight according to claim 1, wherein the distance threshold comprises a smallest intra-class distance among intra-class distances of the initial data to an initial cluster center.

7. An incremental clustering device based on information gain weight, comprising:

8. The incremental clustering device based on information gain weight of claim 7, wherein the information gain weight of the attribute feature of the initial data is determined according to the information entropy of the initial data.

9. The incremental clustering apparatus based on information gain weight according to claim 7, wherein the distance threshold comprises a smallest intra-class distance among intra-class distances of the initial data to an initial cluster center.

10. A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out the method according to any one of claims 1 to 6.