CN107122793A - A kind of improved global optimization k modes clustering methods - Google Patents
A kind of improved global optimization k modes clustering methods Download PDFInfo
- Publication number
- CN107122793A CN107122793A CN201710177995.2A CN201710177995A CN107122793A CN 107122793 A CN107122793 A CN 107122793A CN 201710177995 A CN201710177995 A CN 201710177995A CN 107122793 A CN107122793 A CN 107122793A
- Authority
- CN
- China
- Prior art keywords
- mrow
- data
- cluster
- value
- modes
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/23—Clustering techniques
- G06F18/232—Non-hierarchical techniques
- G06F18/2321—Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions
- G06F18/23213—Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions with fixed number of clusters, e.g. K-means clustering
Landscapes
- Engineering & Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Physics & Mathematics (AREA)
- Theoretical Computer Science (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Artificial Intelligence (AREA)
- Life Sciences & Earth Sciences (AREA)
- Bioinformatics & Computational Biology (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Evolutionary Biology (AREA)
- Evolutionary Computation (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Probability & Statistics with Applications (AREA)
- Management, Administration, Business Operations System, And Electronic Commerce (AREA)
Abstract
A kind of improved global optimization k modes clustering methods of the present invention, its step is as follows:Step one:Data prediction;Step 2:Set up the k modes mathematical programming models of linearisation;Step 3:Model solution;Step 4:As a result export;Pass through above step, the present invention completes the purpose of k modes algorithms cluster, simultaneously as being linear mathematical modeling, so to the effect of globally optimal solution can be tried to achieve, and without setting initial solution, also avoid for initial solution sensitive issue, solve traditional k modes and calculate two common problems.
Description
First, art:
The present invention provides one kind and can guarantee that the method that k- modes (i.e. k-modes) cluster obtains optimal solution, solves tradition
K-modes clustering algorithms are sensitive to initial solution and are difficult to obtain globally optimal solution.This method belongs to data analysis and excavation applications,
It can make to be engaged in the more optimal Clustering Effect of data processing personnel acquirement.
2nd, background technology:
Be in epoch of data now, daily the data from each side all pour in we computer, network with
And Various types of data storage facilities.And data scale sharply increasing and being widely used to data now, all need our
A kind of powerful and general instrument is wanted, efficient the data of faced explosion type can must be handled.In order to tackle processing
Various types of data problem, then changes into different data useful knowledge, and thus data digging method arises.
Clustering is most basic while being also one of most important technology in data digging method above-mentioned.Cluster
Technology refers to that the set of data object constitutes multiple classes according to the principle for maximizing similitude between similitude in class, minimum class
Process.That is, the data object being ultimately in same cluster has higher similitude, and with the object in other clusters very
It is dissimilar.Now clustering algorithm basic at present can be divided into following several classes:Division methods, hierarchical method, based on density method, base
In grid method.Division methods are a kind of most basic also most practical clustering algorithms, are also the basic link of subsequent analysis.I
Conventional k-modes algorithms be treatment classification type data a kind of classical division methods.But it is a kind of based on non-linear
Integer programming model, it is difficult that the problem of solution, (1) is since it is desired that set initial value, so for initial that this, which just brings two,
The difference for being worth selection influences whether last cluster result so that whole clustering algorithm does not have stability.(2) algorithmic procedure is passed through
Locally optimal solution can be often absorbed in, it is difficult to obtain globally optimal solution.
Based on integer programming (i.e. IP), k-modes clustering problems are changed into linear programming model to the present invention by model, and are saved
The process for going initial solution to set, and because whole process is linear, also can be derived that globally optimal solution.So as to perfect
Solve two problems mentioned before.This method passes through a kind of modeling language for describing and solving large-scale complex mathematical problem
(A Mathematical Programming Language are AMPL), then calls mixed integer programming (i.e. MIP) solver
(such as CPLEX, Lingo) completes medium scale cluster calculation.The present invention is a kind of improved global optimization's clustering method,
This method first proposed a kind of new classifying type data set expression-form, then establish k- using new expression-form
Modes integral linear programming model.
3rd, the content of the invention:
3.1 goal of the invention
It is an object of the invention to solve the shortcoming that traditional k-modes methods are faced always, propose that one kind can be obtained entirely
Office's optimal solution and stable clustering method, more preferably gather to be engaged in the related personnel of data mining and big data analysis and providing to obtain
Class effects scheme.
3.2 technical scheme
To it is faced the problem of it is abstract as follows:Assuming that data set X includes n data object X={ X1,X2,…,Xn}.Each
Data object is by m attribute A1,A2…,AmDescription.So each data object XiX can be expressed asi={ xi1,xi2,…,xim,
Because k-modes algorithms are directed to categorical attribute value, each categorical attribute A herejThere is a codomainpjIt is the number of classification value.It is assumed herein that cluster number k is previously given, k-modes algorithms are exactly
The data object X mentioned is gathered into l (l≤n) individual cluster.K-modes algorithm ideas are exactly first to find out l cluster centre { C1,
C2,…,ClSo that each data object is with its nearest cluster centre apart from dXiCjSum is minimum, and by this element interval
From with referred to as object function, d hereXiCjDissimilar degree measurement between alternatively referred to as two objects.Then according to packet
Situation redefines every group of cluster centre.Afterwards repeat back packet step, finally repeatedly the two steps until receive
Hold back.The target of cluster is to make each group group inner distance sum minimum.
The mathematic sign used in technical scheme is presented below, it is as follows:
A kind of improved global optimization k-modes clustering methods of the present invention, its step is as follows:
Step one:Data prediction
To needing n object for being clustered packet to be numbered, number value is from 1 to n;Use yijRepresent data object i jth
The value of attribute;Here classifying type data use 1,2,3 ..., and numeric type is replaced respectively;Such as attribute Aj, codomain is Vj,
With qjIndividual classifying type data, at this moment with { 1,2,3 ... qjEach categorical attribute value is represented respectively;Record needs the number clustered;
Step 2:Set up the k-modes mathematical programming models of linearisation
Data object X is gathered into according to k-modes algorithm ideas by l (l≤n) individual cluster.Its cluster standard is exactly to find out l
Cluster centre { C1,C2,…,ClSo that each data object is minimum apart from sum with its nearest cluster centre, and by this
Individual group inner distance and referred to as object function.The present invention establishes a linear programming model and completes k-modes calculating process
Process;Wherein described " establishing a linear programming model ", it is as follows that it sets up way:
A) because k-mdoes is make each data object and its nearest cluster centre minimum apart from sum, and by this
Individual group inner distance and referred to as object function.The object function of linear programming model can be set up using this as target, i.e., every
The distance of individual data object i and its cluster center k between each attribute j is summed up.Thus object function can be set up, referring to such as
The object function of lower foundation.
B) for ensure object function set up apart from dikjMeet and require, i.e., data items i and cluster center on attribute Aj
On k distance is calculated, and if only if by each object i when belonging to cluster center k class, there is object i and the cluster center attribute
Value is different, apart from dikj=1.Set up constraints (1).
(1)
C) value of the property value at regulation cluster center can be required:Ji Cu centers each attributes can only take one to uniquely determine
Property value.Set up constraints (2)
(2)
D) rule of Clustering is provided:It and must can only be assigned in a classification, built according to each data object i
Vertical constraints (3);Needed to be divided into prespecified l classes according to cluster result, set up constraints (4)
(3)
(4)
E) because the data of input are the values for the jth attribute that object i is expressed in the form of yij.But such expression
Method is difficult to set up inearized model.So needing to change into a kind of new form, such as decision variable xijtFor a value 0 or 1
Parameter, when data object i jth attribute takes t-th of categorical attribute value (i.e. yij=t), xijt=1.Thus constraint is set up
Condition (6);And because object i jth attribute can only take a value,Thus constraints (5) is set up;
(5)
(6)
Therefore it can summarize and show that linear programming model is as follows:
Object function:
Constraints:
(1)
(2)
(3)
(4)
(5)
(6)
The symbol referred to above arrived, its connotation can all be checked in from symbol table.
Step 3:Model solution
Above-mentioned model is solved, it is considered to carry out model solution using business software, such as Lingo, CPLEX etc. are carried out
Solve;Because this mathematical programming model is linear, possesses optimal solution completely and solve feasibility;
Wherein, described " model ", refers to object function and constraints (1)~(6) institute group set up in step 2
Into linear programming model;
Wherein, described " model solution ", Selection utilization AMPL language call CPLEX solver solving models, it is solved
Specific practice is as follows:
(1) data of cluster required for inputting and cluster basic parameter, set up AMPL data files xxx.dat;
(2) AMPL model file xxx.mod are set up, linear programming model is set up;
(3) AMPL autoexecs xxx.sh is set up;
(4) autoexec xxx.sh is called using AMPL, starts to solve;
Step 4:As a result export
By model solution, optimal target function value can be obtained.It can also obtain corresponding to the target function value simultaneously
Decision variable value wikAnd ukjt;Can be by decision variable u according to definitionkjtAnd wikThe Clustering of each data object is determined
Situation and all kinds of cluster centers;
(1) w is observed for all i ∈ NikIf, wik=1, then it represents that object i is under the jurisdiction of k classes;
(2) if ukjt=1, then the jth attribute at kCu Lei centers take t-th of categorical attribute value;
By above step, the purpose of k-modes algorithms cluster is completed.Simultaneously as being linear mathematical modeling, institute
With to can try to achieve the effect of globally optimal solution, and without setting initial solution, also avoid for initial solution sensitive issue, solution
Traditional k-modes that determined calculates two common problems.
3.3 advantages of the present invention and effect
The present invention summarizes the shortcoming that k-modes exists always, it is proposed that a kind of k-mode in the achievement studied in the past
The linear model of algorithm.This method binary decision variable new by defining is so that former problem is converted to linear integer programming asks
Topic, solves the problem of traditional k-modes algorithms are difficult to obtain globally optimal solution and easily influenceed by initial solution.By to UCI numbers
Solution, which is carried out, according to storehouse example can be seen that model proposed by the present invention is feasible and effective.
4th, illustrate
The implementing procedure figure of Fig. 1 the method for the invention.
5th, embodiment
Originally the one group of mark post issued on UCI data set Lenses is saved to illustrate the specific of the inventive method
Process.The data set has 24 data objects, and each object describes A1, A2, A3, A4 by 4 property values.Each attribute is to divide
Type attribute, so each attributive classification is worth number to be respectively { 3,2,2,2 }.Preset and whole data set is divided into 3
Individual cluster.Specific data value is as shown in the table:
The Lenses tables of data of table 1
A kind of improved global optimization k-modes clustering methods of the present invention, as shown in Figure 1, its specific implementation step is such as
Under:
Step one:Data prediction prepares
Because the data obtained have been the classifying type data with numeral expression, directly it is to each data items numbering then
Data prediction can be completed.Remember that the data set needs to be polymerized to 3 classes.It is shown in Table 2
Table 2
Step 2:Set up the k-modes mathematical programming models of linearisation
Object function:
Constraints:
(1)
(2)
(3)
(4)
(5)
(6)
Step 3:Model solution
According to the data of table 2 and the clusters number of 3 classes of defined, AMPL data files are write, and be named as
lenses.dat。
The mathematical programming model set up according to step 2, writes AMPL model files, and be named as lenses.mod.
Because this mathematical modeling is linear, possesses optimal solution completely and solve feasibility.Set up autoexec
Lenses.sh。
Finally autoexec lenses.sh is called to proceed by solution using AMPL.
Step 4:As a result export
Following result of calculation can be obtained by solving to calculate
Observe wikIf, wik=1, then it represents that object i is under the jurisdiction of k classes.Each data object is marked off by decision variable
Clustering, { 1,5,6,7,8,13,15,21,22 } { 2,4,9,10,12,14,16,18 } 3,11,17,19,20,23,
24};It is 27 to try to achieve optimal target function value, i.e., minimum group inner distance.
Same data set is carried out cluster solution to it in the model for substituting into tradition k-modes methods, can drawn by us
Minimum group inner distance is 32.3, the minimum group inner distance obtained more than this method.Therefore this method, i.e., based on linear integer programming
K-modes clusters optimization method, and more preferably cluster result can be obtained compared with traditional k-modes clustering methods.
Claims (2)
1. a kind of improved global optimization k-modes clustering methods, it is characterised in that:Its step is as follows:
Step one:Data prediction
To needing n object for being clustered packet to be numbered, number value is from 1 to n;Use yijRepresent data object i jth attribute
Value;Here classifying type data use 1,2,3 ..., and numeric type is replaced respectively;Such as attribute Aj, codomain is Vj, have
qjIndividual classifying type data, at this moment with { 1,2,3 ... qjEach categorical attribute value is represented respectively;Record needs the number clustered;
Step 2:Set up the k-modes mathematical programming models of linearisation
Data object X is gathered into according to k-modes algorithm ideas by l cluster, l≤n;Its cluster standard is exactly to find out l cluster
Center { C1,C2,…,ClSo that each data object is minimum apart from sum with its nearest cluster centre, and by this group
Interior distance and referred to as object function;The present invention establishes a linear programming model and completes k-modes calculating process;Wherein
Described " establishing a linear programming model ", it is as follows that it sets up way:
A) because k-mdoes is make each data object and its nearest cluster centre minimum apart from sum, and by this group
Interior distance and referred to as object function;The object function of linear programming model is set up using this as target, i.e., each data pair
As the distance of i and its cluster center k between each attribute j is summed up;Thus object function is set up as follows:
<mrow>
<mi>min</mi>
<mi> </mi>
<mi>F</mi>
<mrow>
<mo>(</mo>
<mi>W</mi>
<mo>,</mo>
<mi>U</mi>
<mo>)</mo>
</mrow>
<mo>=</mo>
<munderover>
<mo>&Sigma;</mo>
<mrow>
<mi>k</mi>
<mo>=</mo>
<mn>1</mn>
</mrow>
<mi>l</mi>
</munderover>
<munderover>
<mo>&Sigma;</mo>
<mrow>
<mi>i</mi>
<mo>=</mo>
<mn>1</mn>
</mrow>
<mi>n</mi>
</munderover>
<munderover>
<mo>&Sigma;</mo>
<mrow>
<mi>j</mi>
<mo>=</mo>
<mn>1</mn>
</mrow>
<mi>m</mi>
</munderover>
<msub>
<mi>d</mi>
<mrow>
<mi>i</mi>
<mi>k</mi>
<mi>j</mi>
</mrow>
</msub>
</mrow>
B) for ensure object function set up apart from dikjMeet and require, i.e., the data items i and cluster center k on attribute Aj
On distance is calculated, and if only if by each object i when belonging to cluster center k class, there is the value of object i Yu cluster the center attribute
Difference, apart from dikj=1;Set up constraints (1);
(1)
C) value of the property value at regulation cluster center can be required:The each attribute in Ji Cu centers can only take an attribute uniquely determined
Value, sets up constraints (2);
(2)
D) rule of Clustering is provided:It and must can only be assigned in a classification, be set up about according to each data object i
Beam condition (3);Needed to be divided into prespecified l classes according to cluster result, set up constraints (4);
(3)
(4)
E) because the data of input are the values for the jth attribute that object i is expressed in the form of yij;But such expression
Inearized model is difficult to set up, so needing to change into a kind of new form, such as decision variable xijtFor the ginseng of a value 0 or 1
Number, when data object i jth attribute takes t-th of categorical attribute i.e. yij=t of value, xijt=1;Thus constraints is set up
(6);And because object i jth attribute can only take a value,Thus constraints (5) is set up;
(5)
(6)
Therefore summarize and show that linear programming model is as follows:
Object function:
<mrow>
<mi>min</mi>
<mi> </mi>
<mi>F</mi>
<mrow>
<mo>(</mo>
<mi>W</mi>
<mo>,</mo>
<mi>U</mi>
<mo>)</mo>
</mrow>
<mo>=</mo>
<munderover>
<mo>&Sigma;</mo>
<mrow>
<mi>k</mi>
<mo>=</mo>
<mn>1</mn>
</mrow>
<mi>l</mi>
</munderover>
<munderover>
<mo>&Sigma;</mo>
<mrow>
<mi>i</mi>
<mo>=</mo>
<mn>1</mn>
</mrow>
<mi>n</mi>
</munderover>
<munderover>
<mo>&Sigma;</mo>
<mrow>
<mi>j</mi>
<mo>=</mo>
<mn>1</mn>
</mrow>
<mi>m</mi>
</munderover>
<msub>
<mi>d</mi>
<mrow>
<mi>i</mi>
<mi>k</mi>
<mi>j</mi>
</mrow>
</msub>
</mrow>
Constraints:
(1)
(2)
(3)
(4)
(5)
(6)
The symbol referred to above arrived, its connotation is all checked in from the symbol table in specification;
Step 3:Model solution
Above-mentioned model is solved, it is considered to carry out model solution using business software, the good CPLEX of such as Lingo are solved;
Because this mathematical programming model is linear, possesses optimal solution completely and solve feasibility;
Step 4:As a result export
By model solution, optimal target function value can be obtained;The decision-making corresponding to the target function value can be also obtained simultaneously
Variate-value wikAnd ukjt;According to definition by decision variable ukjtAnd wikDetermine each data object Clustering situation and
All kinds of cluster centers;
(1) w is observed for all i ∈ NikIf, wik=1, then it represents that object i is under the jurisdiction of k classes;
(2) if ukjt=1, then the jth attribute at kCu Lei centers take t-th of categorical attribute value;
By above step, the purpose of k-modes algorithms cluster is completed, simultaneously as being linear mathematical modeling, so arriving
Can try to achieve the effect of globally optimal solution, and without setting initial solution, also avoid for initial solution sensitive issue, solve
Traditional k-modes calculates two common problems.
2. a kind of improved global optimization k-modes clustering methods according to claim 1, it is characterised in that:
" model " described in step 3, refers to object function and constraints (1)~(6) institute group set up in step 2
Into linear programming model;
" model solution " described in step 3, Selection utilization AMPL language call CPLEX solver solving models, it is solved
Specific practice is as follows:
(1) data of cluster required for inputting and cluster basic parameter, set up AMPL data files xxx.dat;
(2) AMPL model file xxx.mod are set up, linear programming model is set up;
(3) AMPL autoexecs xxx.sh is set up;
(4) autoexec xxx.sh is called using AMPL, starts to solve.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201710177995.2A CN107122793A (en) | 2017-03-23 | 2017-03-23 | A kind of improved global optimization k modes clustering methods |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201710177995.2A CN107122793A (en) | 2017-03-23 | 2017-03-23 | A kind of improved global optimization k modes clustering methods |
Publications (1)
Publication Number | Publication Date |
---|---|
CN107122793A true CN107122793A (en) | 2017-09-01 |
Family
ID=59718021
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201710177995.2A Pending CN107122793A (en) | 2017-03-23 | 2017-03-23 | A kind of improved global optimization k modes clustering methods |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN107122793A (en) |
Cited By (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN108256102A (en) * | 2018-02-01 | 2018-07-06 | 厦门大学嘉庚学院 | A kind of Independent College Studentss based on cluster comment religion data analysing method |
CN111160382A (en) * | 2019-09-29 | 2020-05-15 | 山西大学 | Effective method for processing classified data in real life |
CN112132217A (en) * | 2020-09-23 | 2020-12-25 | 广西大学 | Classification type data clustering method based on intra-cluster dissimilarity |
CN113159392A (en) * | 2021-03-30 | 2021-07-23 | 刘昊戈 | Optimization calculation method for position distribution problem of continuous space |
-
2017
- 2017-03-23 CN CN201710177995.2A patent/CN107122793A/en active Pending
Cited By (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN108256102A (en) * | 2018-02-01 | 2018-07-06 | 厦门大学嘉庚学院 | A kind of Independent College Studentss based on cluster comment religion data analysing method |
CN108256102B (en) * | 2018-02-01 | 2022-02-11 | 厦门大学嘉庚学院 | Independent college student evaluation and education data analysis method based on clustering |
CN111160382A (en) * | 2019-09-29 | 2020-05-15 | 山西大学 | Effective method for processing classified data in real life |
CN112132217A (en) * | 2020-09-23 | 2020-12-25 | 广西大学 | Classification type data clustering method based on intra-cluster dissimilarity |
CN112132217B (en) * | 2020-09-23 | 2023-08-15 | 广西大学 | Classification type data clustering method based on inter-cluster dissimilarity in clusters |
CN113159392A (en) * | 2021-03-30 | 2021-07-23 | 刘昊戈 | Optimization calculation method for position distribution problem of continuous space |
CN113159392B (en) * | 2021-03-30 | 2022-06-24 | 刘昊戈 | Optimization calculation method for position distribution problem of continuous space |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN107122793A (en) | A kind of improved global optimization k modes clustering methods | |
Anselin et al. | Exploratory spatial data analysis linking SpaceStat and ArcView | |
Zhu et al. | Generalized analytic network process | |
Zhao et al. | Development of decision support tool for optimizing urban emergency rescue facility locations to improve humanitarian logistics management | |
Chen et al. | Calibrating a Land Parcel Cellular Automaton (LP-CA) for urban growth simulation based on ensemble learning | |
Baloui Jamkhaneh et al. | Inspection error and its effects on single sampling plans with fuzzy parameters | |
Yu et al. | A regional wind power probabilistic forecast method based on deep quantile regression | |
CN112270406A (en) | Neural information visualization method of brain-like computer operating system | |
WO2021213154A1 (en) | Blockchain data processing method, system, terminal, and computer-readable storage medium | |
Vorobieva et al. | Architecture of digital economy | |
CN109376153A (en) | System and method for writing data into graph database based on NiFi | |
CN108171332A (en) | Product-design knowledge modeling method and system based on SysML | |
Bhargava et al. | Prediction of arthritis using classification and regression tree algorithm | |
CN104598614B (en) | A kind of data multi-scale mode diffusion update method based on geographical semantics | |
CN102508971A (en) | Method for establishing product function model in concept design stage | |
US9886707B1 (en) | System and method for building dynamic hierarchy for products | |
Kao | Network data envelopment analysis with fuzzy data | |
El‐Ghandour et al. | Survey of information technology applications in construction | |
Vasilyev et al. | Development of a decision support system at the stages of pre‐design studies and design of irrigation systems based on IDEFo functional modelling methodology | |
Kaminsky et al. | Evaluating the Effectiveness of Enterprises' Digital Transformation by Fuzzy Logic | |
Tsugawa et al. | Community structure and interaction locality in social networks | |
CN108268478A (en) | A kind of unbalanced dataset feature selection approach and device based on ur-CAIM algorithms | |
WO2023035526A1 (en) | Object sorting method, related device, and medium | |
US20140344235A1 (en) | Determination of data modification | |
García | Plasticity as a link between spatially explicit, distance-independent, and whole-stand forest growth models |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
WD01 | Invention patent application deemed withdrawn after publication | ||
WD01 | Invention patent application deemed withdrawn after publication |
Application publication date: 20170901 |