CN112766403A - Incremental clustering method and device based on information gain weight - Google Patents

Incremental clustering method and device based on information gain weight Download PDF

Info

Publication number
CN112766403A
CN112766403A CN202110123316.XA CN202110123316A CN112766403A CN 112766403 A CN112766403 A CN 112766403A CN 202110123316 A CN202110123316 A CN 202110123316A CN 112766403 A CN112766403 A CN 112766403A
Authority
CN
China
Prior art keywords
clustering
class
distance
intra
information gain
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202110123316.XA
Other languages
Chinese (zh)
Inventor
张子瑛
杨强
陈晓科
范颖
梁敏玲
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Electric Power Research Institute of Guangdong Power Grid Co Ltd
Original Assignee
Electric Power Research Institute of Guangdong Power Grid Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Electric Power Research Institute of Guangdong Power Grid Co Ltd filed Critical Electric Power Research Institute of Guangdong Power Grid Co Ltd
Publication of CN112766403A publication Critical patent/CN112766403A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/23Clustering techniques
    • G06F18/232Non-hierarchical techniques
    • G06F18/2321Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions
    • G06F18/23211Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions with adaptive number of clusters

Landscapes

  • Engineering & Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Artificial Intelligence (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Evolutionary Biology (AREA)
  • Evolutionary Computation (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Probability & Statistics with Applications (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Image Analysis (AREA)

Abstract

The invention discloses an incremental clustering method and device based on information gain weight, which specifically comprise the following steps: calculating the classification contribution rate of each feature according to the initial data attribute feature information gain weight; respectively calculating the intra-class distances from the initial data to the initial clustering center according to the classification contribution rates, and performing iterative combination on classes of which the intra-class distances are smaller than a distance threshold value to obtain the clustering centers of all classes and the maximum intra-class distance; respectively calculating the distance from the newly added data points to the clustering centers according to the classification contribution rate, and determining the minimum distance and the corresponding clustering centers; and when the minimum distance is less than or equal to the maximum intra-class distance of the corresponding clustering center, merging the newly added data points into the class of the corresponding clustering center, and when the minimum distance is greater than the maximum intra-class distance of the corresponding clustering center, determining the newly added data points as an independent class. The method utilizes the information gain weight to calculate the intra-class distance, and sets the intra-class distance as the incremental data classification threshold, so that the robustness of the incremental clustering method is improved.

Description

Incremental clustering method and device based on information gain weight
Technical Field
The invention relates to the technical field of data processing, in particular to an incremental clustering method and device based on information gain weight.
Background
As a branch of statistics, clustering is a machine learning method that continuously adjusts its own model by observing and learning, and is currently widely used in the fields of network intrusion detection, image recognition, and the like. With the coming of big data era, in order to overcome the limitation of the traditional clustering algorithm in large-scale data calculation, technicians perform incremental reconstruction by using the existing algorithm model, an incremental clustering method is provided: when a batch of clustering results exist, if data is newly added, only the newly added data is clustered, and the existing clustering results are incrementally modified without re-clustering the whole data set after the newly added data.
The existing clustering method comprises the steps of clustering newly added data by using an expansion vectorization method, firstly setting a threshold value, when the minimum distance between the newly added data and the existing central point is smaller than the threshold value, classifying the newly added data into the existing class, otherwise, using the newly added data as an independent class object, and because the method needs to manually appoint the threshold value, the robustness is poor.
Disclosure of Invention
Aiming at the technical problems, the invention provides an incremental clustering method and device based on information gain weight, which eliminate the influence caused by manually setting the threshold value and effectively improve the robustness of the method by setting the maximum radius of each category as an incremental threshold value.
The embodiment of the invention provides an incremental clustering method based on information gain weight, which comprises the following steps:
calculating the classification contribution rate of each characteristic according to the information gain weight of the initial data attribute characteristic;
respectively calculating the intra-class distances from the initial data to the initial clustering centers according to the classification contribution rates, and carrying out iterative combination on classes of which the intra-class distances are smaller than a distance threshold value to obtain clustering results, wherein the clustering results comprise the clustering centers of all classes and the maximum intra-class distance;
respectively calculating the distance from the newly added data points to the clustering centers according to the classification contribution rate, and determining the minimum distance and the corresponding clustering centers;
and when the minimum distance is smaller than or equal to the maximum intra-class distance of the corresponding clustering center, merging the newly added data points into the class of the corresponding clustering center, and when the minimum distance is larger than the maximum intra-class distance of the corresponding clustering center, determining the newly added data points as a single class.
In one embodiment, the information gain weight of the initial data attribute feature is determined according to the information entropy of the initial data.
In one embodiment, the classification contribution rate ∈ of each featureiDetermined according to the following formula:
Figure BDA0002922282890000021
where W (T) is the information gain weight of the attribute feature T.
In a certain embodiment, the calculating the intra-class distance from the initial data to the initial clustering center according to the classification contribution ratio includes:
Figure BDA0002922282890000022
in one embodiment, the continuous values of the initial data attributes are discretized.
In one embodiment, the distance threshold comprises a minimum intra-class distance of intra-class distances of the initial data to the initial cluster center.
The embodiment of the present invention further provides an incremental clustering device based on information gain weight, including:
the first initialization unit is used for calculating the classification contribution rate of each characteristic according to the information gain weight of the initial data attribute characteristic;
the second initialization unit is used for respectively calculating the intra-class distances from the initial data to the initial clustering centers according to the classification contribution rates, and performing iterative combination on the classes with the intra-class distances smaller than a distance threshold value to obtain clustering results, wherein the clustering results comprise the clustering centers of all classes and the maximum intra-class distance;
the data calculation unit is used for respectively calculating the distances from the newly added data points to the clustering centers according to the classification contribution rates, and determining the minimum distance and the corresponding clustering centers;
and the data clustering unit is used for merging the newly added data points into the category of the corresponding clustering center when the minimum distance is less than or equal to the maximum intra-category distance of the corresponding clustering center, and determining the newly added data points as a single category when the minimum distance is greater than the maximum intra-category distance of the corresponding clustering center.
In one embodiment, the information gain weight of the initial data attribute feature is determined according to the information entropy of the initial data.
In one embodiment, the distance threshold comprises a minimum intra-class distance of intra-class distances of the initial data to the initial cluster center.
An embodiment of the present invention further provides a computer-readable storage medium, on which a computer program is stored, where the computer program is executed by a processor to implement the method according to any of the above embodiments.
Compared with the prior art, the embodiment of the invention has the beneficial effects that:
the incremental clustering method and device based on the information gain weight fully consider the contributions of different attribute characteristics to clustering, utilize the information gain proportional weight to calculate the distance from data to a clustering center, and eliminate the influence of artificially appointed threshold values by taking the maximum intra-class distance of each class in a clustering result as an incremental data classification threshold value, thereby improving the robustness of the incremental clustering method.
Drawings
In order to more clearly illustrate the technical solution of the present invention, the drawings needed to be used in the embodiments will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art that other drawings can be obtained according to the drawings without creative efforts.
FIG. 1 is a schematic flow chart of an incremental clustering method based on information gain weighting according to an embodiment of the present invention;
fig. 2 is a schematic structural diagram of an incremental clustering device based on information gain weights according to an embodiment of the present invention.
Detailed Description
The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
It should be understood that the step numbers used herein are for convenience of description only and are not intended as limitations on the order in which the steps are performed.
It is to be understood that the terminology used in the description of the invention herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the invention. As used in the specification of the present invention and the appended claims, the singular forms "a," "an," and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise.
The terms "comprises" and "comprising" indicate the presence of the described features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof.
The term "and/or" refers to and includes any and all possible combinations of one or more of the associated listed items.
As shown in fig. 1, an embodiment of the present invention provides an incremental clustering method based on information gain weights, which includes the following steps.
S11: and calculating the classification contribution rate of each characteristic according to the information gain weight of the initial data attribute characteristic.
In this embodiment, the continuous values of the initial data attribute are discretized.
In this embodiment, first, clustering parameters are set, including an initial data set X, an attribute set T, and a classification category Y, where T ═ T { (T)1,T2,…,Tm},1≤i≤m。
For the initial data set X, the data set X can be correspondingly divided into sets C according to different values of the classification category Y, where C ═ C1,C2,…,Cn}; and simultaneously, the attribute set T can be correspondingly divided into v subsets according to different values of the attribute set T, and then X is equal to { X ═ X1,X2,…,Xv}。
In this embodiment, the characteristic T for the initial dataiIts information gain weight W (T)i) May be determined from the initial data information entropy.
Specifically, the information gain weight W (T)i)=IG(Ti)/IS(Ti) Wherein IG (T)i) Is a characteristic TiCan be determined by the following formula:
IG(Ti)=I(C)-I(C|Ti)
in the formula (I), the compound is shown in the specification,
Figure BDA0002922282890000055
i (C | t) and
Figure BDA0002922282890000056
then, the information entropy I according to the data is obtained, specifically:
Figure BDA0002922282890000051
Figure BDA0002922282890000052
in the formula
Figure BDA0002922282890000053
Figure BDA0002922282890000057
Wherein the content of the first and second substances,
Figure BDA0002922282890000054
Figure BDA0002922282890000058
for IS (T)i) Can be represented by formula
Figure BDA0002922282890000061
Determination, wherein the determination is made according to Bernoulli's law of large numbers
Figure BDA0002922282890000062
Further, according to the information gain weight W (T)i) Get the characteristic TiThe classification contribution rate of-iComprises the following steps:
Figure BDA0002922282890000063
s12: and respectively calculating the intra-class distances from the initial data to the initial clustering centers according to the classification contribution rates, and performing iterative combination on classes of which the intra-class distances are smaller than a distance threshold value to obtain clustering results, wherein the clustering results comprise the clustering centers of all classes and the maximum intra-class distance.
In this embodiment, the distance threshold comprises the smallest of the intra-class distances of the initial data to the initial cluster center.
In this embodiment, the cluster category is set as d, and an initial cluster center set center is randomly selected from the initial data set X, and the center is defined as { ct ═ ctjJ is more than or equal to 1 and less than or equal to d, and meanwhile, a similar cohesion class R is set as { R ═ R { (R) }jJ is more than or equal to 1 and less than or equal to d }, the iteration frequency is t, the clustering precision is elig, and R is initializedj=0。
In the iterative process, the data points X in the initial data set X to the initial cluster centers ct are calculated according to the following formulajThe distance of (c):
Figure BDA0002922282890000064
obtaining the minimum Min (dist (x, c) in the distancej)j) And merging the data points x into the category according to the category, and simultaneously updating the cluster center of the category and the intra-class distance from each data point to the cluster center.
Wherein the updated cluster center ctj={ctj,iI is more than or equal to 1 and less than or equal to d, wherein
Figure BDA0002922282890000065
And when the updated cluster center point is deviated, returning to execute S12 until the updated cluster center point is not deviated any more or the iteration is completed, and recording the maximum intra-class distance of each class.
In the embodiment, the cluster center point of the current iteration is used for subtracting the cluster center point set of the previous iteration, the difference values are compared with the clustering precision elig, if the difference values are less than or equal to elig, the updated cluster center point does not deviate, and the iteration is finished; and if the difference value is larger than elig, shifting the updated cluster center point.
And after the iteration is finished, obtaining a clustering result, wherein the clustering result comprises the clustering center and the maximum intra-class distance of each class.
S13: and respectively calculating the distance from the newly added data point to the clustering center according to the classification contribution rate, and determining the minimum distance and the corresponding clustering center.
In this embodiment, for the newly added data point xnCalculating the cluster center distance between the point and each category in the cluster result according to the following formula:
Figure BDA0002922282890000071
and according to new _ Min ═ Min (rect (x)n,cj)j) Obtaining the minimum distance new _ min, wherein j is more than or equal to 1 and less than or equal to d.
S14: and when the minimum distance is smaller than or equal to the maximum intra-class distance of the corresponding clustering center, merging the newly added data points into the class of the corresponding clustering center, and when the minimum distance is larger than the maximum intra-class distance of the corresponding clustering center, determining the newly added data points as a single class.
In this embodiment, the minimum distance new _ min is compared with the maximum intra-class distance of the corresponding class, and when new _ min is less than or equal to the maximum intra-class distance, the newly added data point is classified into the class, otherwise, the newly added data point is set as a separate class.
For different attributes of a data set, the information gain of the attribute is the largest, and the more uncertain information is contained, the more beneficial the clustering is, the embodiment of the invention calculates the distance from the data to the clustering center by using the information gain proportion weight of the attribute characteristics, introduces the attribute information gain into the clustering, and simultaneously avoids the influence of the artificially specified threshold on the clustering effect and effectively improves the robustness of the incremental clustering method by taking the maximum intra-class distance of each class in the clustering result as the classification judgment threshold of the incremental data.
As shown in fig. 2, an embodiment of the present invention further provides an incremental clustering apparatus based on information gain weights, which includes a first initializing unit 101, a second initializing unit 102, a data calculating unit 103, and a data clustering unit 104.
The first initialization unit 101 is configured to calculate a classification contribution rate of each feature according to an information gain weight of the initial data attribute feature.
The second initialization unit 102 is configured to calculate intra-class distances from the initial data to an initial clustering center according to the classification contribution rates, and perform iterative combination on classes with intra-class distances smaller than a distance threshold to obtain a clustering result, where the clustering result includes the clustering centers of each class and a maximum intra-class distance.
The data calculating unit 103 is configured to calculate distances from the newly added data points to the cluster centers according to the classification contribution rates, and determine a minimum distance and a corresponding cluster center.
The data clustering unit 104 is configured to merge the newly added data points into the category of the corresponding clustering center when the minimum distance is less than or equal to the maximum intra-category distance of the corresponding clustering center, and determine the new data points as a single category when the minimum distance is greater than the maximum intra-category distance of the corresponding clustering center.
Because the content of information interaction, execution process, and the like among the units in the device is based on the same concept as the method embodiment of the present invention, specific content can be referred to the description in the method embodiment of the present invention, and is not described herein again.
An embodiment of the present invention further provides a computer-readable storage medium, on which a computer program is stored, where the computer program is executed by a processor to implement the method according to any of the above embodiments.
It will be understood by those skilled in the art that all or part of the processes of the methods of the above embodiments may be implemented by a computer program, which may be stored in a computer-readable storage medium, and may include the processes of the embodiments of the methods when executed. The storage medium may be a magnetic disk, an optical disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), or the like.
While the foregoing is directed to the preferred embodiment of the present invention, it will be understood by those skilled in the art that various changes and modifications may be made without departing from the spirit and scope of the invention.

Claims (10)

1. An incremental clustering method based on information gain weight is characterized by comprising the following steps:
calculating the classification contribution rate of each characteristic according to the information gain weight of the initial data attribute characteristic;
respectively calculating the intra-class distances from the initial data to the initial clustering centers according to the classification contribution rates, and carrying out iterative combination on classes of which the intra-class distances are smaller than a distance threshold value to obtain clustering results, wherein the clustering results comprise the clustering centers of all classes and the maximum intra-class distance;
respectively calculating the distance from the newly added data points to the clustering centers according to the classification contribution rate, and determining the minimum distance and the corresponding clustering centers;
and when the minimum distance is smaller than or equal to the maximum intra-class distance of the corresponding clustering center, merging the newly added data points into the class of the corresponding clustering center, and when the minimum distance is larger than the maximum intra-class distance of the corresponding clustering center, determining the newly added data points as a single class.
2. The incremental clustering method based on information gain weight according to claim 1, wherein the information gain weight of the attribute feature of the initial data is determined according to the information entropy of the initial data.
3. The incremental clustering method based on information gain weight according to claim 2, wherein the classification contribution rate of each feature ℃iDetermined according to the following formula:
Figure FDA0002922282880000011
where W (T) is the information gain weight of the attribute feature T.
4. The incremental clustering method based on information gain weight according to claim 1, wherein the intra-class distances from the initial data to an initial clustering center are respectively calculated according to the classification contribution rates, and specifically:
Figure FDA0002922282880000021
5. the incremental clustering method based on information gain weight according to claim 1, wherein the continuous values of the initial data attribute are discretized.
6. The incremental clustering method based on information gain weight according to claim 1, wherein the distance threshold comprises a smallest intra-class distance among intra-class distances of the initial data to an initial cluster center.
7. An incremental clustering device based on information gain weight, comprising:
the first initialization unit is used for calculating the classification contribution rate of each characteristic according to the information gain weight of the initial data attribute characteristic;
the second initialization unit is used for respectively calculating the intra-class distances from the initial data to the initial clustering centers according to the classification contribution rates, and performing iterative combination on the classes with the intra-class distances smaller than a distance threshold value to obtain clustering results, wherein the clustering results comprise the clustering centers of all classes and the maximum intra-class distance;
the data calculation unit is used for respectively calculating the distances from the newly added data points to the clustering centers according to the classification contribution rates, and determining the minimum distance and the corresponding clustering centers;
and the data clustering unit is used for merging the newly added data points into the category of the corresponding clustering center when the minimum distance is less than or equal to the maximum intra-category distance of the corresponding clustering center, and determining the newly added data points as a single category when the minimum distance is greater than the maximum intra-category distance of the corresponding clustering center.
8. The incremental clustering device based on information gain weight of claim 7, wherein the information gain weight of the attribute feature of the initial data is determined according to the information entropy of the initial data.
9. The incremental clustering apparatus based on information gain weight according to claim 7, wherein the distance threshold comprises a smallest intra-class distance among intra-class distances of the initial data to an initial cluster center.
10. A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out the method according to any one of claims 1 to 6.
CN202110123316.XA 2020-12-29 2021-01-28 Incremental clustering method and device based on information gain weight Pending CN112766403A (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202011599131 2020-12-29
CN2020115991318 2020-12-29

Publications (1)

Publication Number Publication Date
CN112766403A true CN112766403A (en) 2021-05-07

Family

ID=75706592

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110123316.XA Pending CN112766403A (en) 2020-12-29 2021-01-28 Incremental clustering method and device based on information gain weight

Country Status (1)

Country Link
CN (1) CN112766403A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113364751A (en) * 2021-05-26 2021-09-07 北京电子科技职业学院 Network attack prediction method, computer-readable storage medium, and electronic device

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107067045A (en) * 2017-05-31 2017-08-18 北京京东尚科信息技术有限公司 Data clustering method, device, computer-readable medium and electronic equipment
CN108804588A (en) * 2018-05-28 2018-11-13 山西大学 A kind of mixed data flow data label method
CN110110736A (en) * 2018-04-18 2019-08-09 爱动超越人工智能科技(北京)有限责任公司 Increment clustering method and device
US20190285722A1 (en) * 2012-08-03 2019-09-19 Polte Corporation Network architecture and methods for location services
CN110866555A (en) * 2019-11-11 2020-03-06 广州国音智能科技有限公司 Incremental data clustering method, device and equipment and readable storage medium

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20190285722A1 (en) * 2012-08-03 2019-09-19 Polte Corporation Network architecture and methods for location services
CN107067045A (en) * 2017-05-31 2017-08-18 北京京东尚科信息技术有限公司 Data clustering method, device, computer-readable medium and electronic equipment
CN110110736A (en) * 2018-04-18 2019-08-09 爱动超越人工智能科技(北京)有限责任公司 Increment clustering method and device
CN108804588A (en) * 2018-05-28 2018-11-13 山西大学 A kind of mixed data flow data label method
CN110866555A (en) * 2019-11-11 2020-03-06 广州国音智能科技有限公司 Incremental data clustering method, device and equipment and readable storage medium

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
江志良: "基于特征关系的聚类集成研究", 硕士电子期刊(信息科技辑), 15 January 2018 (2018-01-15), pages 1 - 89 *

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113364751A (en) * 2021-05-26 2021-09-07 北京电子科技职业学院 Network attack prediction method, computer-readable storage medium, and electronic device

Similar Documents

Publication Publication Date Title
CN111291678B (en) Face image clustering method and device based on multi-feature fusion
CN109583332B (en) Face recognition method, face recognition system, medium, and electronic device
WO2017157183A1 (en) Automatic multi-threshold characteristic filtering method and apparatus
CN111553399A (en) Feature model training method, device, equipment and storage medium
WO2022042297A1 (en) Text clustering method, apparatus, electronic device, and storage medium
CN111062425B (en) Unbalanced data set processing method based on C-K-SMOTE algorithm
WO2018006631A1 (en) User level automatic segmentation method and system
Kumar et al. A fuzzy clustering technique for enhancing the convergence performance by using improved Fuzzy c-means and Particle Swarm Optimization algorithms
CN110019805A (en) Article Topics Crawling method and apparatus and computer readable storage medium
CN105160598B (en) Power grid service classification method based on improved EM algorithm
CN111564179A (en) Species biology classification method and system based on triple neural network
CN113743474A (en) Digital picture classification method and system based on cooperative semi-supervised convolutional neural network
CN110830291B (en) Node classification method of heterogeneous information network based on meta-path
CN110348516B (en) Data processing method, data processing device, storage medium and electronic equipment
JP2016194914A (en) Method and device for selecting mixture model
CN114417095A (en) Data set partitioning method and device
CN112766403A (en) Incremental clustering method and device based on information gain weight
CN110378389A (en) A kind of Adaboost classifier calculated machine creating device
Lim et al. More powerful selective kernel tests for feature selection
CN110991517A (en) Classification method and system for unbalanced data set in stroke
CN110837853A (en) Rapid classification model construction method
CN114139636B (en) Abnormal operation processing method and device
CN112738724B (en) Method, device, equipment and medium for accurately identifying regional target crowd
CN111428510B (en) Public praise-based P2P platform risk analysis method
CN110263196B (en) Image retrieval method, image retrieval device, electronic equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination