CN112766403A - Incremental clustering method and device based on information gain weight - Google Patents
Incremental clustering method and device based on information gain weight Download PDFInfo
- Publication number
- CN112766403A CN112766403A CN202110123316.XA CN202110123316A CN112766403A CN 112766403 A CN112766403 A CN 112766403A CN 202110123316 A CN202110123316 A CN 202110123316A CN 112766403 A CN112766403 A CN 112766403A
- Authority
- CN
- China
- Prior art keywords
- clustering
- class
- distance
- intra
- information gain
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000000034 method Methods 0.000 title claims abstract description 34
- 238000004590 computer program Methods 0.000 claims description 6
- 238000004364 calculation method Methods 0.000 claims description 3
- 230000009286 beneficial effect Effects 0.000 description 2
- 238000004422 calculation algorithm Methods 0.000 description 2
- 150000001875 compounds Chemical class 0.000 description 1
- 238000001514 detection method Methods 0.000 description 1
- 238000010586 diagram Methods 0.000 description 1
- 230000000694 effects Effects 0.000 description 1
- 230000003993 interaction Effects 0.000 description 1
- 238000012804 iterative process Methods 0.000 description 1
- 238000010801 machine learning Methods 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 230000003287 optical effect Effects 0.000 description 1
- 239000000126 substance Substances 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/23—Clustering techniques
- G06F18/232—Non-hierarchical techniques
- G06F18/2321—Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions
- G06F18/23211—Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions with adaptive number of clusters
Landscapes
- Engineering & Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Physics & Mathematics (AREA)
- Theoretical Computer Science (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Artificial Intelligence (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Bioinformatics & Computational Biology (AREA)
- Life Sciences & Earth Sciences (AREA)
- Evolutionary Biology (AREA)
- Evolutionary Computation (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Probability & Statistics with Applications (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
- Image Analysis (AREA)
Abstract
The invention discloses an incremental clustering method and device based on information gain weight, which specifically comprise the following steps: calculating the classification contribution rate of each feature according to the initial data attribute feature information gain weight; respectively calculating the intra-class distances from the initial data to the initial clustering center according to the classification contribution rates, and performing iterative combination on classes of which the intra-class distances are smaller than a distance threshold value to obtain the clustering centers of all classes and the maximum intra-class distance; respectively calculating the distance from the newly added data points to the clustering centers according to the classification contribution rate, and determining the minimum distance and the corresponding clustering centers; and when the minimum distance is less than or equal to the maximum intra-class distance of the corresponding clustering center, merging the newly added data points into the class of the corresponding clustering center, and when the minimum distance is greater than the maximum intra-class distance of the corresponding clustering center, determining the newly added data points as an independent class. The method utilizes the information gain weight to calculate the intra-class distance, and sets the intra-class distance as the incremental data classification threshold, so that the robustness of the incremental clustering method is improved.
Description
Technical Field
The invention relates to the technical field of data processing, in particular to an incremental clustering method and device based on information gain weight.
Background
As a branch of statistics, clustering is a machine learning method that continuously adjusts its own model by observing and learning, and is currently widely used in the fields of network intrusion detection, image recognition, and the like. With the coming of big data era, in order to overcome the limitation of the traditional clustering algorithm in large-scale data calculation, technicians perform incremental reconstruction by using the existing algorithm model, an incremental clustering method is provided: when a batch of clustering results exist, if data is newly added, only the newly added data is clustered, and the existing clustering results are incrementally modified without re-clustering the whole data set after the newly added data.
The existing clustering method comprises the steps of clustering newly added data by using an expansion vectorization method, firstly setting a threshold value, when the minimum distance between the newly added data and the existing central point is smaller than the threshold value, classifying the newly added data into the existing class, otherwise, using the newly added data as an independent class object, and because the method needs to manually appoint the threshold value, the robustness is poor.
Disclosure of Invention
Aiming at the technical problems, the invention provides an incremental clustering method and device based on information gain weight, which eliminate the influence caused by manually setting the threshold value and effectively improve the robustness of the method by setting the maximum radius of each category as an incremental threshold value.
The embodiment of the invention provides an incremental clustering method based on information gain weight, which comprises the following steps:
calculating the classification contribution rate of each characteristic according to the information gain weight of the initial data attribute characteristic;
respectively calculating the intra-class distances from the initial data to the initial clustering centers according to the classification contribution rates, and carrying out iterative combination on classes of which the intra-class distances are smaller than a distance threshold value to obtain clustering results, wherein the clustering results comprise the clustering centers of all classes and the maximum intra-class distance;
respectively calculating the distance from the newly added data points to the clustering centers according to the classification contribution rate, and determining the minimum distance and the corresponding clustering centers;
and when the minimum distance is smaller than or equal to the maximum intra-class distance of the corresponding clustering center, merging the newly added data points into the class of the corresponding clustering center, and when the minimum distance is larger than the maximum intra-class distance of the corresponding clustering center, determining the newly added data points as a single class.
In one embodiment, the information gain weight of the initial data attribute feature is determined according to the information entropy of the initial data.
In one embodiment, the classification contribution rate ∈ of each featureiDetermined according to the following formula:
where W (T) is the information gain weight of the attribute feature T.
In a certain embodiment, the calculating the intra-class distance from the initial data to the initial clustering center according to the classification contribution ratio includes:
in one embodiment, the continuous values of the initial data attributes are discretized.
In one embodiment, the distance threshold comprises a minimum intra-class distance of intra-class distances of the initial data to the initial cluster center.
The embodiment of the present invention further provides an incremental clustering device based on information gain weight, including:
the first initialization unit is used for calculating the classification contribution rate of each characteristic according to the information gain weight of the initial data attribute characteristic;
the second initialization unit is used for respectively calculating the intra-class distances from the initial data to the initial clustering centers according to the classification contribution rates, and performing iterative combination on the classes with the intra-class distances smaller than a distance threshold value to obtain clustering results, wherein the clustering results comprise the clustering centers of all classes and the maximum intra-class distance;
the data calculation unit is used for respectively calculating the distances from the newly added data points to the clustering centers according to the classification contribution rates, and determining the minimum distance and the corresponding clustering centers;
and the data clustering unit is used for merging the newly added data points into the category of the corresponding clustering center when the minimum distance is less than or equal to the maximum intra-category distance of the corresponding clustering center, and determining the newly added data points as a single category when the minimum distance is greater than the maximum intra-category distance of the corresponding clustering center.
In one embodiment, the information gain weight of the initial data attribute feature is determined according to the information entropy of the initial data.
In one embodiment, the distance threshold comprises a minimum intra-class distance of intra-class distances of the initial data to the initial cluster center.
An embodiment of the present invention further provides a computer-readable storage medium, on which a computer program is stored, where the computer program is executed by a processor to implement the method according to any of the above embodiments.
Compared with the prior art, the embodiment of the invention has the beneficial effects that:
the incremental clustering method and device based on the information gain weight fully consider the contributions of different attribute characteristics to clustering, utilize the information gain proportional weight to calculate the distance from data to a clustering center, and eliminate the influence of artificially appointed threshold values by taking the maximum intra-class distance of each class in a clustering result as an incremental data classification threshold value, thereby improving the robustness of the incremental clustering method.
Drawings
In order to more clearly illustrate the technical solution of the present invention, the drawings needed to be used in the embodiments will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art that other drawings can be obtained according to the drawings without creative efforts.
FIG. 1 is a schematic flow chart of an incremental clustering method based on information gain weighting according to an embodiment of the present invention;
fig. 2 is a schematic structural diagram of an incremental clustering device based on information gain weights according to an embodiment of the present invention.
Detailed Description
The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
It should be understood that the step numbers used herein are for convenience of description only and are not intended as limitations on the order in which the steps are performed.
It is to be understood that the terminology used in the description of the invention herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the invention. As used in the specification of the present invention and the appended claims, the singular forms "a," "an," and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise.
The terms "comprises" and "comprising" indicate the presence of the described features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof.
The term "and/or" refers to and includes any and all possible combinations of one or more of the associated listed items.
As shown in fig. 1, an embodiment of the present invention provides an incremental clustering method based on information gain weights, which includes the following steps.
S11: and calculating the classification contribution rate of each characteristic according to the information gain weight of the initial data attribute characteristic.
In this embodiment, the continuous values of the initial data attribute are discretized.
In this embodiment, first, clustering parameters are set, including an initial data set X, an attribute set T, and a classification category Y, where T ═ T { (T)1,T2,…,Tm},1≤i≤m。
For the initial data set X, the data set X can be correspondingly divided into sets C according to different values of the classification category Y, where C ═ C1,C2,…,Cn}; and simultaneously, the attribute set T can be correspondingly divided into v subsets according to different values of the attribute set T, and then X is equal to { X ═ X1,X2,…,Xv}。
In this embodiment, the characteristic T for the initial dataiIts information gain weight W (T)i) May be determined from the initial data information entropy.
Specifically, the information gain weight W (T)i)=IG(Ti)/IS(Ti) Wherein IG (T)i) Is a characteristic TiCan be determined by the following formula:
IG(Ti)=I(C)-I(C|Ti)
in the formula (I), the compound is shown in the specification,i (C | t) andthen, the information entropy I according to the data is obtained, specifically:
for IS (T)i) Can be represented by formulaDetermination, wherein the determination is made according to Bernoulli's law of large numbers
Further, according to the information gain weight W (T)i) Get the characteristic TiThe classification contribution rate of-iComprises the following steps:
s12: and respectively calculating the intra-class distances from the initial data to the initial clustering centers according to the classification contribution rates, and performing iterative combination on classes of which the intra-class distances are smaller than a distance threshold value to obtain clustering results, wherein the clustering results comprise the clustering centers of all classes and the maximum intra-class distance.
In this embodiment, the distance threshold comprises the smallest of the intra-class distances of the initial data to the initial cluster center.
In this embodiment, the cluster category is set as d, and an initial cluster center set center is randomly selected from the initial data set X, and the center is defined as { ct ═ ctjJ is more than or equal to 1 and less than or equal to d, and meanwhile, a similar cohesion class R is set as { R ═ R { (R) }jJ is more than or equal to 1 and less than or equal to d }, the iteration frequency is t, the clustering precision is elig, and R is initializedj=0。
In the iterative process, the data points X in the initial data set X to the initial cluster centers ct are calculated according to the following formulajThe distance of (c):
obtaining the minimum Min (dist (x, c) in the distancej)j) And merging the data points x into the category according to the category, and simultaneously updating the cluster center of the category and the intra-class distance from each data point to the cluster center.
Wherein the updated cluster center ctj={ctj,iI is more than or equal to 1 and less than or equal to d, wherein
And when the updated cluster center point is deviated, returning to execute S12 until the updated cluster center point is not deviated any more or the iteration is completed, and recording the maximum intra-class distance of each class.
In the embodiment, the cluster center point of the current iteration is used for subtracting the cluster center point set of the previous iteration, the difference values are compared with the clustering precision elig, if the difference values are less than or equal to elig, the updated cluster center point does not deviate, and the iteration is finished; and if the difference value is larger than elig, shifting the updated cluster center point.
And after the iteration is finished, obtaining a clustering result, wherein the clustering result comprises the clustering center and the maximum intra-class distance of each class.
S13: and respectively calculating the distance from the newly added data point to the clustering center according to the classification contribution rate, and determining the minimum distance and the corresponding clustering center.
In this embodiment, for the newly added data point xnCalculating the cluster center distance between the point and each category in the cluster result according to the following formula:
and according to new _ Min ═ Min (rect (x)n,cj)j) Obtaining the minimum distance new _ min, wherein j is more than or equal to 1 and less than or equal to d.
S14: and when the minimum distance is smaller than or equal to the maximum intra-class distance of the corresponding clustering center, merging the newly added data points into the class of the corresponding clustering center, and when the minimum distance is larger than the maximum intra-class distance of the corresponding clustering center, determining the newly added data points as a single class.
In this embodiment, the minimum distance new _ min is compared with the maximum intra-class distance of the corresponding class, and when new _ min is less than or equal to the maximum intra-class distance, the newly added data point is classified into the class, otherwise, the newly added data point is set as a separate class.
For different attributes of a data set, the information gain of the attribute is the largest, and the more uncertain information is contained, the more beneficial the clustering is, the embodiment of the invention calculates the distance from the data to the clustering center by using the information gain proportion weight of the attribute characteristics, introduces the attribute information gain into the clustering, and simultaneously avoids the influence of the artificially specified threshold on the clustering effect and effectively improves the robustness of the incremental clustering method by taking the maximum intra-class distance of each class in the clustering result as the classification judgment threshold of the incremental data.
As shown in fig. 2, an embodiment of the present invention further provides an incremental clustering apparatus based on information gain weights, which includes a first initializing unit 101, a second initializing unit 102, a data calculating unit 103, and a data clustering unit 104.
The first initialization unit 101 is configured to calculate a classification contribution rate of each feature according to an information gain weight of the initial data attribute feature.
The second initialization unit 102 is configured to calculate intra-class distances from the initial data to an initial clustering center according to the classification contribution rates, and perform iterative combination on classes with intra-class distances smaller than a distance threshold to obtain a clustering result, where the clustering result includes the clustering centers of each class and a maximum intra-class distance.
The data calculating unit 103 is configured to calculate distances from the newly added data points to the cluster centers according to the classification contribution rates, and determine a minimum distance and a corresponding cluster center.
The data clustering unit 104 is configured to merge the newly added data points into the category of the corresponding clustering center when the minimum distance is less than or equal to the maximum intra-category distance of the corresponding clustering center, and determine the new data points as a single category when the minimum distance is greater than the maximum intra-category distance of the corresponding clustering center.
Because the content of information interaction, execution process, and the like among the units in the device is based on the same concept as the method embodiment of the present invention, specific content can be referred to the description in the method embodiment of the present invention, and is not described herein again.
An embodiment of the present invention further provides a computer-readable storage medium, on which a computer program is stored, where the computer program is executed by a processor to implement the method according to any of the above embodiments.
It will be understood by those skilled in the art that all or part of the processes of the methods of the above embodiments may be implemented by a computer program, which may be stored in a computer-readable storage medium, and may include the processes of the embodiments of the methods when executed. The storage medium may be a magnetic disk, an optical disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), or the like.
While the foregoing is directed to the preferred embodiment of the present invention, it will be understood by those skilled in the art that various changes and modifications may be made without departing from the spirit and scope of the invention.
Claims (10)
1. An incremental clustering method based on information gain weight is characterized by comprising the following steps:
calculating the classification contribution rate of each characteristic according to the information gain weight of the initial data attribute characteristic;
respectively calculating the intra-class distances from the initial data to the initial clustering centers according to the classification contribution rates, and carrying out iterative combination on classes of which the intra-class distances are smaller than a distance threshold value to obtain clustering results, wherein the clustering results comprise the clustering centers of all classes and the maximum intra-class distance;
respectively calculating the distance from the newly added data points to the clustering centers according to the classification contribution rate, and determining the minimum distance and the corresponding clustering centers;
and when the minimum distance is smaller than or equal to the maximum intra-class distance of the corresponding clustering center, merging the newly added data points into the class of the corresponding clustering center, and when the minimum distance is larger than the maximum intra-class distance of the corresponding clustering center, determining the newly added data points as a single class.
2. The incremental clustering method based on information gain weight according to claim 1, wherein the information gain weight of the attribute feature of the initial data is determined according to the information entropy of the initial data.
5. the incremental clustering method based on information gain weight according to claim 1, wherein the continuous values of the initial data attribute are discretized.
6. The incremental clustering method based on information gain weight according to claim 1, wherein the distance threshold comprises a smallest intra-class distance among intra-class distances of the initial data to an initial cluster center.
7. An incremental clustering device based on information gain weight, comprising:
the first initialization unit is used for calculating the classification contribution rate of each characteristic according to the information gain weight of the initial data attribute characteristic;
the second initialization unit is used for respectively calculating the intra-class distances from the initial data to the initial clustering centers according to the classification contribution rates, and performing iterative combination on the classes with the intra-class distances smaller than a distance threshold value to obtain clustering results, wherein the clustering results comprise the clustering centers of all classes and the maximum intra-class distance;
the data calculation unit is used for respectively calculating the distances from the newly added data points to the clustering centers according to the classification contribution rates, and determining the minimum distance and the corresponding clustering centers;
and the data clustering unit is used for merging the newly added data points into the category of the corresponding clustering center when the minimum distance is less than or equal to the maximum intra-category distance of the corresponding clustering center, and determining the newly added data points as a single category when the minimum distance is greater than the maximum intra-category distance of the corresponding clustering center.
8. The incremental clustering device based on information gain weight of claim 7, wherein the information gain weight of the attribute feature of the initial data is determined according to the information entropy of the initial data.
9. The incremental clustering apparatus based on information gain weight according to claim 7, wherein the distance threshold comprises a smallest intra-class distance among intra-class distances of the initial data to an initial cluster center.
10. A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out the method according to any one of claims 1 to 6.
Applications Claiming Priority (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202011599131 | 2020-12-29 | ||
CN2020115991318 | 2020-12-29 |
Publications (1)
Publication Number | Publication Date |
---|---|
CN112766403A true CN112766403A (en) | 2021-05-07 |
Family
ID=75706592
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202110123316.XA Pending CN112766403A (en) | 2020-12-29 | 2021-01-28 | Incremental clustering method and device based on information gain weight |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN112766403A (en) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113364751A (en) * | 2021-05-26 | 2021-09-07 | 北京电子科技职业学院 | Network attack prediction method, computer-readable storage medium, and electronic device |
Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN107067045A (en) * | 2017-05-31 | 2017-08-18 | 北京京东尚科信息技术有限公司 | Data clustering method, device, computer-readable medium and electronic equipment |
CN108804588A (en) * | 2018-05-28 | 2018-11-13 | 山西大学 | A kind of mixed data flow data label method |
CN110110736A (en) * | 2018-04-18 | 2019-08-09 | 爱动超越人工智能科技(北京)有限责任公司 | Increment clustering method and device |
US20190285722A1 (en) * | 2012-08-03 | 2019-09-19 | Polte Corporation | Network architecture and methods for location services |
CN110866555A (en) * | 2019-11-11 | 2020-03-06 | 广州国音智能科技有限公司 | Incremental data clustering method, device and equipment and readable storage medium |
-
2021
- 2021-01-28 CN CN202110123316.XA patent/CN112766403A/en active Pending
Patent Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20190285722A1 (en) * | 2012-08-03 | 2019-09-19 | Polte Corporation | Network architecture and methods for location services |
CN107067045A (en) * | 2017-05-31 | 2017-08-18 | 北京京东尚科信息技术有限公司 | Data clustering method, device, computer-readable medium and electronic equipment |
CN110110736A (en) * | 2018-04-18 | 2019-08-09 | 爱动超越人工智能科技(北京)有限责任公司 | Increment clustering method and device |
CN108804588A (en) * | 2018-05-28 | 2018-11-13 | 山西大学 | A kind of mixed data flow data label method |
CN110866555A (en) * | 2019-11-11 | 2020-03-06 | 广州国音智能科技有限公司 | Incremental data clustering method, device and equipment and readable storage medium |
Non-Patent Citations (1)
Title |
---|
江志良: "基于特征关系的聚类集成研究", 硕士电子期刊(信息科技辑), 15 January 2018 (2018-01-15), pages 1 - 89 * |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113364751A (en) * | 2021-05-26 | 2021-09-07 | 北京电子科技职业学院 | Network attack prediction method, computer-readable storage medium, and electronic device |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN111291678B (en) | Face image clustering method and device based on multi-feature fusion | |
CN109583332B (en) | Face recognition method, face recognition system, medium, and electronic device | |
WO2017157183A1 (en) | Automatic multi-threshold characteristic filtering method and apparatus | |
CN111553399A (en) | Feature model training method, device, equipment and storage medium | |
WO2022042297A1 (en) | Text clustering method, apparatus, electronic device, and storage medium | |
CN111062425B (en) | Unbalanced data set processing method based on C-K-SMOTE algorithm | |
WO2018006631A1 (en) | User level automatic segmentation method and system | |
Kumar et al. | A fuzzy clustering technique for enhancing the convergence performance by using improved Fuzzy c-means and Particle Swarm Optimization algorithms | |
CN110019805A (en) | Article Topics Crawling method and apparatus and computer readable storage medium | |
CN105160598B (en) | Power grid service classification method based on improved EM algorithm | |
CN111564179A (en) | Species biology classification method and system based on triple neural network | |
CN113743474A (en) | Digital picture classification method and system based on cooperative semi-supervised convolutional neural network | |
CN110830291B (en) | Node classification method of heterogeneous information network based on meta-path | |
CN110348516B (en) | Data processing method, data processing device, storage medium and electronic equipment | |
JP2016194914A (en) | Method and device for selecting mixture model | |
CN114417095A (en) | Data set partitioning method and device | |
CN112766403A (en) | Incremental clustering method and device based on information gain weight | |
CN110378389A (en) | A kind of Adaboost classifier calculated machine creating device | |
Lim et al. | More powerful selective kernel tests for feature selection | |
CN110991517A (en) | Classification method and system for unbalanced data set in stroke | |
CN110837853A (en) | Rapid classification model construction method | |
CN114139636B (en) | Abnormal operation processing method and device | |
CN112738724B (en) | Method, device, equipment and medium for accurately identifying regional target crowd | |
CN111428510B (en) | Public praise-based P2P platform risk analysis method | |
CN110263196B (en) | Image retrieval method, image retrieval device, electronic equipment and storage medium |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |