2nd, background technology:
Be in epoch of data now, daily the data from each side all pour in we computer, network with
And Various types of data storage facilities.And data scale sharply increasing and being widely used to data now, all need our
A kind of powerful and general instrument is wanted, efficient the data of faced explosion type can must be handled.In order to tackle processing
Various types of data problem, then changes into different data useful knowledge, and thus data digging method arises.
Clustering is most basic while being also one of most important technology in data digging method above-mentioned.Cluster
Technology refers to that the set of data object constitutes multiple classes according to the principle for maximizing similitude between similitude in class, minimum class
Process.That is, the data object being ultimately in same cluster has higher similitude, and with the object in other clusters very
It is dissimilar.Now clustering algorithm basic at present can be divided into following several classes:Division methods, hierarchical method, based on density method, base
In grid method.Division methods are a kind of most basic also most practical clustering algorithms, are also the basic link of subsequent analysis.I
Conventional k-modes algorithms be treatment classification type data a kind of classical division methods.But it is a kind of based on non-linear
Integer programming model, it is difficult that the problem of solution, (1) is since it is desired that set initial value, so for initial that this, which just brings two,
The difference for being worth selection influences whether last cluster result so that whole clustering algorithm does not have stability.(2) algorithmic procedure is passed through
Locally optimal solution can be often absorbed in, it is difficult to obtain globally optimal solution.
Based on integer programming (i.e. IP), k-modes clustering problems are changed into linear programming model to the present invention by model, and are saved
The process for going initial solution to set, and because whole process is linear, also can be derived that globally optimal solution.So as to perfect
Solve two problems mentioned before.This method passes through a kind of modeling language for describing and solving large-scale complex mathematical problem
(A Mathematical Programming Language are AMPL), then calls mixed integer programming (i.e. MIP) solver
(such as CPLEX, Lingo) completes medium scale cluster calculation.The present invention is a kind of improved global optimization's clustering method,
This method first proposed a kind of new classifying type data set expression-form, then establish k- using new expression-form
Modes integral linear programming model.
3rd, the content of the invention:
3.1 goal of the invention
It is an object of the invention to solve the shortcoming that traditional k-modes methods are faced always, propose that one kind can be obtained entirely
Office's optimal solution and stable clustering method, more preferably gather to be engaged in the related personnel of data mining and big data analysis and providing to obtain
Class effects scheme.
3.2 technical scheme
To it is faced the problem of it is abstract as follows:Assuming that data set X includes n data object X={ X1,X2,…,Xn}.Each
Data object is by m attribute A1,A2…,AmDescription.So each data object XiX can be expressed asi={ xi1,xi2,…,xim,
Because k-modes algorithms are directed to categorical attribute value, each categorical attribute A herejThere is a codomainpjIt is the number of classification value.It is assumed herein that cluster number k is previously given, k-modes algorithms are exactly
The data object X mentioned is gathered into l (l≤n) individual cluster.K-modes algorithm ideas are exactly first to find out l cluster centre { C1,
C2,…,ClSo that each data object is with its nearest cluster centre apart from dXiCjSum is minimum, and by this element interval
From with referred to as object function, d hereXiCjDissimilar degree measurement between alternatively referred to as two objects.Then according to packet
Situation redefines every group of cluster centre.Afterwards repeat back packet step, finally repeatedly the two steps until receive
Hold back.The target of cluster is to make each group group inner distance sum minimum.
The mathematic sign used in technical scheme is presented below, it is as follows:
A kind of improved global optimization k-modes clustering methods of the present invention, its step is as follows:
Step one:Data prediction
To needing n object for being clustered packet to be numbered, number value is from 1 to n;Use yijRepresent data object i jth
The value of attribute;Here classifying type data use 1,2,3 ..., and numeric type is replaced respectively;Such as attribute Aj, codomain is Vj,
With qjIndividual classifying type data, at this moment with { 1,2,3 ... qjEach categorical attribute value is represented respectively;Record needs the number clustered;
Step 2:Set up the k-modes mathematical programming models of linearisation
Data object X is gathered into according to k-modes algorithm ideas by l (l≤n) individual cluster.Its cluster standard is exactly to find out l
Cluster centre { C1,C2,…,ClSo that each data object is minimum apart from sum with its nearest cluster centre, and by this
Individual group inner distance and referred to as object function.The present invention establishes a linear programming model and completes k-modes calculating process
Process;Wherein described " establishing a linear programming model ", it is as follows that it sets up way:
A) because k-mdoes is make each data object and its nearest cluster centre minimum apart from sum, and by this
Individual group inner distance and referred to as object function.The object function of linear programming model can be set up using this as target, i.e., every
The distance of individual data object i and its cluster center k between each attribute j is summed up.Thus object function can be set up, referring to such as
The object function of lower foundation.
B) for ensure object function set up apart from dikjMeet and require, i.e., data items i and cluster center on attribute Aj
On k distance is calculated, and if only if by each object i when belonging to cluster center k class, there is object i and the cluster center attribute
Value is different, apart from dikj=1.Set up constraints (1).
(1)
C) value of the property value at regulation cluster center can be required:Ji Cu centers each attributes can only take one to uniquely determine
Property value.Set up constraints (2)
(2)
D) rule of Clustering is provided:It and must can only be assigned in a classification, built according to each data object i
Vertical constraints (3);Needed to be divided into prespecified l classes according to cluster result, set up constraints (4)
(3)
(4)
E) because the data of input are the values for the jth attribute that object i is expressed in the form of yij.But such expression
Method is difficult to set up inearized model.So needing to change into a kind of new form, such as decision variable xijtFor a value 0 or 1
Parameter, when data object i jth attribute takes t-th of categorical attribute value (i.e. yij=t), xijt=1.Thus constraint is set up
Condition (6);And because object i jth attribute can only take a value,Thus constraints (5) is set up;
(5)
(6)
Therefore it can summarize and show that linear programming model is as follows:
Object function:
Constraints:
(1)
(2)
(3)
(4)
(5)
(6)
The symbol referred to above arrived, its connotation can all be checked in from symbol table.
Step 3:Model solution
Above-mentioned model is solved, it is considered to carry out model solution using business software, such as Lingo, CPLEX etc. are carried out
Solve;Because this mathematical programming model is linear, possesses optimal solution completely and solve feasibility;
Wherein, described " model ", refers to object function and constraints (1)~(6) institute group set up in step 2
Into linear programming model;
Wherein, described " model solution ", Selection utilization AMPL language call CPLEX solver solving models, it is solved
Specific practice is as follows:
(1) data of cluster required for inputting and cluster basic parameter, set up AMPL data files xxx.dat;
(2) AMPL model file xxx.mod are set up, linear programming model is set up;
(3) AMPL autoexecs xxx.sh is set up;
(4) autoexec xxx.sh is called using AMPL, starts to solve;
Step 4:As a result export
By model solution, optimal target function value can be obtained.It can also obtain corresponding to the target function value simultaneously
Decision variable value wikAnd ukjt;Can be by decision variable u according to definitionkjtAnd wikThe Clustering of each data object is determined
Situation and all kinds of cluster centers;
(1) w is observed for all i ∈ NikIf, wik=1, then it represents that object i is under the jurisdiction of k classes;
(2) if ukjt=1, then the jth attribute at kCu Lei centers take t-th of categorical attribute value;
By above step, the purpose of k-modes algorithms cluster is completed.Simultaneously as being linear mathematical modeling, institute
With to can try to achieve the effect of globally optimal solution, and without setting initial solution, also avoid for initial solution sensitive issue, solution
Traditional k-modes that determined calculates two common problems.
3.3 advantages of the present invention and effect
The present invention summarizes the shortcoming that k-modes exists always, it is proposed that a kind of k-mode in the achievement studied in the past
The linear model of algorithm.This method binary decision variable new by defining is so that former problem is converted to linear integer programming asks
Topic, solves the problem of traditional k-modes algorithms are difficult to obtain globally optimal solution and easily influenceed by initial solution.By to UCI numbers
Solution, which is carried out, according to storehouse example can be seen that model proposed by the present invention is feasible and effective.
5th, embodiment
Originally the one group of mark post issued on UCI data set Lenses is saved to illustrate the specific of the inventive method
Process.The data set has 24 data objects, and each object describes A1, A2, A3, A4 by 4 property values.Each attribute is to divide
Type attribute, so each attributive classification is worth number to be respectively { 3,2,2,2 }.Preset and whole data set is divided into 3
Individual cluster.Specific data value is as shown in the table:
The Lenses tables of data of table 1
A kind of improved global optimization k-modes clustering methods of the present invention, as shown in Figure 1, its specific implementation step is such as
Under:
Step one:Data prediction prepares
Because the data obtained have been the classifying type data with numeral expression, directly it is to each data items numbering then
Data prediction can be completed.Remember that the data set needs to be polymerized to 3 classes.It is shown in Table 2
Table 2
Step 2:Set up the k-modes mathematical programming models of linearisation
Object function:
Constraints:
(1)
(2)
(3)
(4)
(5)
(6)
Step 3:Model solution
According to the data of table 2 and the clusters number of 3 classes of defined, AMPL data files are write, and be named as
lenses.dat。
The mathematical programming model set up according to step 2, writes AMPL model files, and be named as lenses.mod.
Because this mathematical modeling is linear, possesses optimal solution completely and solve feasibility.Set up autoexec
Lenses.sh。
Finally autoexec lenses.sh is called to proceed by solution using AMPL.
Step 4:As a result export
Following result of calculation can be obtained by solving to calculate
Observe wikIf, wik=1, then it represents that object i is under the jurisdiction of k classes.Each data object is marked off by decision variable
Clustering, { 1,5,6,7,8,13,15,21,22 } { 2,4,9,10,12,14,16,18 } 3,11,17,19,20,23,
24};It is 27 to try to achieve optimal target function value, i.e., minimum group inner distance.
Same data set is carried out cluster solution to it in the model for substituting into tradition k-modes methods, can drawn by us
Minimum group inner distance is 32.3, the minimum group inner distance obtained more than this method.Therefore this method, i.e., based on linear integer programming
K-modes clusters optimization method, and more preferably cluster result can be obtained compared with traditional k-modes clustering methods.