CN107122793A

CN107122793A - A kind of improved global optimization k modes clustering methods

Info

Publication number: CN107122793A
Application number: CN201710177995.2A
Authority: CN
Inventors: 黄昌浩; 肖依永
Original assignee: Beihang University
Current assignee: Beihang University
Priority date: 2017-03-23
Filing date: 2017-03-23
Publication date: 2017-09-01

Abstract

An improved global optimization k-modes clustering method of the present invention, its steps are as follows: Step one: data preprocessing; Step two: set up the linearized k-modes mathematical programming model; Step three: model solution; Step four: Result output; through the above steps, the present invention has completed the purpose of k-modes algorithm clustering, and because it is a linear mathematical model, it can obtain the effect of the global optimal solution, and it is not necessary to set the initial solution, and it is also avoided. The problem of initial solution sensitivity solves two common problems in traditional k-modes calculations.

Description

A kind of improved global optimization k-modes clustering methods

First, art：

The present invention provides one kind and can guarantee that the method that k- modes (i.e. k-modes) cluster obtains optimal solution, solves tradition K-modes clustering algorithms are sensitive to initial solution and are difficult to obtain globally optimal solution.This method belongs to data analysis and excavation applications, It can make to be engaged in the more optimal Clustering Effect of data processing personnel acquirement.

2nd, background technology：

Be in epoch of data now, daily the data from each side all pour in we computer, network with And Various types of data storage facilities.And data scale sharply increasing and being widely used to data now, all need our A kind of powerful and general instrument is wanted, efficient the data of faced explosion type can must be handled.In order to tackle processing Various types of data problem, then changes into different data useful knowledge, and thus data digging method arises.

Clustering is most basic while being also one of most important technology in data digging method above-mentioned.Cluster Technology refers to that the set of data object constitutes multiple classes according to the principle for maximizing similitude between similitude in class, minimum class Process.That is, the data object being ultimately in same cluster has higher similitude, and with the object in other clusters very It is dissimilar.Now clustering algorithm basic at present can be divided into following several classes：Division methods, hierarchical method, based on density method, base In grid method.Division methods are a kind of most basic also most practical clustering algorithms, are also the basic link of subsequent analysis.I Conventional k-modes algorithms be treatment classification type data a kind of classical division methods.But it is a kind of based on non-linear Integer programming model, it is difficult that the problem of solution, (1) is since it is desired that set initial value, so for initial that this, which just brings two, The difference for being worth selection influences whether last cluster result so that whole clustering algorithm does not have stability.(2) algorithmic procedure is passed through Locally optimal solution can be often absorbed in, it is difficult to obtain globally optimal solution.

Based on integer programming (i.e. IP), k-modes clustering problems are changed into linear programming model to the present invention by model, and are saved The process for going initial solution to set, and because whole process is linear, also can be derived that globally optimal solution.So as to perfect Solve two problems mentioned before.This method passes through a kind of modeling language for describing and solving large-scale complex mathematical problem (A Mathematical Programming Language are AMPL), then calls mixed integer programming (i.e. MIP) solver (such as CPLEX, Lingo) completes medium scale cluster calculation.The present invention is a kind of improved global optimization's clustering method, This method first proposed a kind of new classifying type data set expression-form, then establish k- using new expression-form Modes integral linear programming model.

3rd, the content of the invention：

3.1 goal of the invention

It is an object of the invention to solve the shortcoming that traditional k-modes methods are faced always, propose that one kind can be obtained entirely Office's optimal solution and stable clustering method, more preferably gather to be engaged in the related personnel of data mining and big data analysis and providing to obtain Class effects scheme.

3.2 technical scheme

To it is faced the problem of it is abstract as follows：Assuming that data set X includes n data object X={ X₁,X₂,…,X_n}.Each Data object is by m attribute A₁,A₂…,A_mDescription.So each data object X_iX can be expressed as_i={ x_i1,x_i2,…,x_im, Because k-modes algorithms are directed to categorical attribute value, each categorical attribute A here_jThere is a codomainp_jIt is the number of classification value.It is assumed herein that cluster number k is previously given, k-modes algorithms are exactly The data object X mentioned is gathered into l (l≤n) individual cluster.K-modes algorithm ideas are exactly first to find out l cluster centre { C₁, C₂,…,C_lSo that each data object is with its nearest cluster centre apart from d_XiCjSum is minimum, and by this element interval From with referred to as object function, d here_XiCjDissimilar degree measurement between alternatively referred to as two objects.Then according to packet Situation redefines every group of cluster centre.Afterwards repeat back packet step, finally repeatedly the two steps until receive Hold back.The target of cluster is to make each group group inner distance sum minimum.

The mathematic sign used in technical scheme is presented below, it is as follows：

A kind of improved global optimization k-modes clustering methods of the present invention, its step is as follows：

Step one：Data prediction

To needing n object for being clustered packet to be numbered, number value is from 1 to n；Use y_ijRepresent data object i jth The value of attribute；Here classifying type data use 1,2,3 ..., and numeric type is replaced respectively；Such as attribute A_j, codomain is V_j, With q_jIndividual classifying type data, at this moment with { 1,2,3 ... q_jEach categorical attribute value is represented respectively；Record needs the number clustered；

Step 2：Set up the k-modes mathematical programming models of linearisation

Data object X is gathered into according to k-modes algorithm ideas by l (l≤n) individual cluster.Its cluster standard is exactly to find out l Cluster centre { C₁,C₂,…,C_lSo that each data object is minimum apart from sum with its nearest cluster centre, and by this Individual group inner distance and referred to as object function.The present invention establishes a linear programming model and completes k-modes calculating process Process；Wherein described " establishing a linear programming model ", it is as follows that it sets up way：

A) because k-mdoes is make each data object and its nearest cluster centre minimum apart from sum, and by this Individual group inner distance and referred to as object function.The object function of linear programming model can be set up using this as target, i.e., every The distance of individual data object i and its cluster center k between each attribute j is summed up.Thus object function can be set up, referring to such as The object function of lower foundation.

B) for ensure object function set up apart from d_ikjMeet and require, i.e., data items i and cluster center on attribute Aj On k distance is calculated, and if only if by each object i when belonging to cluster center k class, there is object i and the cluster center attribute Value is different, apart from dikj=1.Set up constraints (1).

(1)

C) value of the property value at regulation cluster center can be required：Ji Cu centers each attributes can only take one to uniquely determine Property value.Set up constraints (2)

(2)

D) rule of Clustering is provided:It and must can only be assigned in a classification, built according to each data object i Vertical constraints (3)；Needed to be divided into prespecified l classes according to cluster result, set up constraints (4)

(3)

(4)

E) because the data of input are the values for the jth attribute that object i is expressed in the form of yij.But such expression Method is difficult to set up inearized model.So needing to change into a kind of new form, such as decision variable x_ijtFor a value 0 or 1 Parameter, when data object i jth attribute takes t-th of categorical attribute value (i.e. yij=t), x_ijt=1.Thus constraint is set up Condition (6)；And because object i jth attribute can only take a value,Thus constraints (5) is set up；

(5)

(6)

Therefore it can summarize and show that linear programming model is as follows：

Object function：

Constraints:

(1)

(2)

(3)

(4)

(5)

(6)

The symbol referred to above arrived, its connotation can all be checked in from symbol table.

Step 3：Model solution

Above-mentioned model is solved, it is considered to carry out model solution using business software, such as Lingo, CPLEX etc. are carried out Solve；Because this mathematical programming model is linear, possesses optimal solution completely and solve feasibility；

Wherein, described " model ", refers to object function and constraints (1)~(6) institute group set up in step 2 Into linear programming model；

Wherein, described " model solution ", Selection utilization AMPL language call CPLEX solver solving models, it is solved Specific practice is as follows：

(1) data of cluster required for inputting and cluster basic parameter, set up AMPL data files xxx.dat；

(2) AMPL model file xxx.mod are set up, linear programming model is set up；

(3) AMPL autoexecs xxx.sh is set up；

(4) autoexec xxx.sh is called using AMPL, starts to solve；

Step 4：As a result export

By model solution, optimal target function value can be obtained.It can also obtain corresponding to the target function value simultaneously Decision variable value w_ikAnd u_kjt；Can be by decision variable u according to definition_kjtAnd w_ikThe Clustering of each data object is determined Situation and all kinds of cluster centers；

(1) w is observed for all i ∈ N_ikIf, w_ik=1, then it represents that object i is under the jurisdiction of k classes；

(2) if u_kjt=1, then the jth attribute at kCu Lei centers take t-th of categorical attribute value；

By above step, the purpose of k-modes algorithms cluster is completed.Simultaneously as being linear mathematical modeling, institute With to can try to achieve the effect of globally optimal solution, and without setting initial solution, also avoid for initial solution sensitive issue, solution Traditional k-modes that determined calculates two common problems.

3.3 advantages of the present invention and effect

The present invention summarizes the shortcoming that k-modes exists always, it is proposed that a kind of k-mode in the achievement studied in the past The linear model of algorithm.This method binary decision variable new by defining is so that former problem is converted to linear integer programming asks Topic, solves the problem of traditional k-modes algorithms are difficult to obtain globally optimal solution and easily influenceed by initial solution.By to UCI numbers Solution, which is carried out, according to storehouse example can be seen that model proposed by the present invention is feasible and effective.

4th, illustrate

The implementing procedure figure of Fig. 1 the method for the invention.

5th, embodiment

Originally the one group of mark post issued on UCI data set Lenses is saved to illustrate the specific of the inventive method Process.The data set has 24 data objects, and each object describes A1, A2, A3, A4 by 4 property values.Each attribute is to divide Type attribute, so each attributive classification is worth number to be respectively { 3,2,2,2 }.Preset and whole data set is divided into 3 Individual cluster.Specific data value is as shown in the table：

The Lenses tables of data of table 1

A kind of improved global optimization k-modes clustering methods of the present invention, as shown in Figure 1, its specific implementation step is such as Under：

Step one：Data prediction prepares

Because the data obtained have been the classifying type data with numeral expression, directly it is to each data items numbering then Data prediction can be completed.Remember that the data set needs to be polymerized to 3 classes.It is shown in Table 2

Table 2

Step 2：Set up the k-modes mathematical programming models of linearisation

Object function:

Constraints：

(1)

(2)

(3)

(4)

(5)

(6)

Step 3：Model solution

According to the data of table 2 and the clusters number of 3 classes of defined, AMPL data files are write, and be named as lenses.dat。

The mathematical programming model set up according to step 2, writes AMPL model files, and be named as lenses.mod.

Because this mathematical modeling is linear, possesses optimal solution completely and solve feasibility.Set up autoexec Lenses.sh。

Finally autoexec lenses.sh is called to proceed by solution using AMPL.

Step 4：As a result export

Following result of calculation can be obtained by solving to calculate

Observe w_ikIf, w_ik=1, then it represents that object i is under the jurisdiction of k classes.Each data object is marked off by decision variable Clustering, { 1,5,6,7,8,13,15,21,22 } { 2,4,9,10,12,14,16,18 } 3,11,17,19,20,23, 24}；It is 27 to try to achieve optimal target function value, i.e., minimum group inner distance.

Same data set is carried out cluster solution to it in the model for substituting into tradition k-modes methods, can drawn by us Minimum group inner distance is 32.3, the minimum group inner distance obtained more than this method.Therefore this method, i.e., based on linear integer programming K-modes clusters optimization method, and more preferably cluster result can be obtained compared with traditional k-modes clustering methods.

Claims

1. a kind of improved global optimization k-modes clustering methods, it is characterised in that：Its step is as follows：

Step one：Data prediction

To needing n object for being clustered packet to be numbered, number value is from 1 to n；Use y_ijRepresent data object i jth attribute Value；Here classifying type data use 1,2,3 ..., and numeric type is replaced respectively；Such as attribute A_j, codomain is V_j, have q_jIndividual classifying type data, at this moment with { 1,2,3 ... q_jEach categorical attribute value is represented respectively；Record needs the number clustered；

Step 2：Set up the k-modes mathematical programming models of linearisation

Data object X is gathered into according to k-modes algorithm ideas by l cluster, l≤n；Its cluster standard is exactly to find out l cluster Center { C₁,C₂,…,C_lSo that each data object is minimum apart from sum with its nearest cluster centre, and by this group Interior distance and referred to as object function；The present invention establishes a linear programming model and completes k-modes calculating process；Wherein Described " establishing a linear programming model ", it is as follows that it sets up way：

A) because k-mdoes is make each data object and its nearest cluster centre minimum apart from sum, and by this group Interior distance and referred to as object function；The object function of linear programming model is set up using this as target, i.e., each data pair As the distance of i and its cluster center k between each attribute j is summed up；Thus object function is set up as follows：

<mrow> <mi>min</mi> <mi> </mi> <mi>F</mi> <mrow> <mo>(</mo> <mi>W</mi> <mo>,</mo> <mi>U</mi> <mo>)</mo> </mrow> <mo>=</mo> <munderover> <mo>&Sigma;</mo> <mrow> <mi>k</mi> <mo>=</mo> <mn>1</mn> </mrow> <mi>l</mi> </munderover> <munderover> <mo>&Sigma;</mo> <mrow> <mi>i</mi> <mo>=</mo> <mn>1</mn> </mrow> <mi>n</mi> </munderover> <munderover> <mo>&Sigma;</mo> <mrow> <mi>j</mi> <mo>=</mo> <mn>1</mn> </mrow> <mi>m</mi> </munderover> <msub> <mi>d</mi> <mrow> <mi>i</mi> <mi>k</mi> <mi>j</mi> </mrow> </msub> </mrow>

B) for ensure object function set up apart from d_ikjMeet and require, i.e., the data items i and cluster center k on attribute Aj On distance is calculated, and if only if by each object i when belonging to cluster center k class, there is the value of object i Yu cluster the center attribute Difference, apart from dikj=1；Set up constraints (1)；

(1)

C) value of the property value at regulation cluster center can be required：The each attribute in Ji Cu centers can only take an attribute uniquely determined Value, sets up constraints (2)；

(2)

D) rule of Clustering is provided:It and must can only be assigned in a classification, be set up about according to each data object i Beam condition (3)；Needed to be divided into prespecified l classes according to cluster result, set up constraints (4)；

(3)

(4)

E) because the data of input are the values for the jth attribute that object i is expressed in the form of yij；But such expression Inearized model is difficult to set up, so needing to change into a kind of new form, such as decision variable x_ijtFor the ginseng of a value 0 or 1 Number, when data object i jth attribute takes t-th of categorical attribute i.e. yij=t of value, x_ijt=1；Thus constraints is set up (6)；And because object i jth attribute can only take a value,Thus constraints (5) is set up；

(5)

(6)

Therefore summarize and show that linear programming model is as follows：

Object function：

Constraints:

(1)

(2)

(3)

(4)

(5)

(6)

The symbol referred to above arrived, its connotation is all checked in from the symbol table in specification；

Step 3：Model solution

Above-mentioned model is solved, it is considered to carry out model solution using business software, the good CPLEX of such as Lingo are solved； Because this mathematical programming model is linear, possesses optimal solution completely and solve feasibility；

Step 4：As a result export

By model solution, optimal target function value can be obtained；The decision-making corresponding to the target function value can be also obtained simultaneously Variate-value w_ikAnd u_kjt；According to definition by decision variable u_kjtAnd w_ikDetermine each data object Clustering situation and All kinds of cluster centers；

By above step, the purpose of k-modes algorithms cluster is completed, simultaneously as being linear mathematical modeling, so arriving Can try to achieve the effect of globally optimal solution, and without setting initial solution, also avoid for initial solution sensitive issue, solve Traditional k-modes calculates two common problems.

2. a kind of improved global optimization k-modes clustering methods according to claim 1, it is characterised in that：

" model " described in step 3, refers to object function and constraints (1)~(6) institute group set up in step 2 Into linear programming model；

" model solution " described in step 3, Selection utilization AMPL language call CPLEX solver solving models, it is solved Specific practice is as follows：

(2) AMPL model file xxx.mod are set up, linear programming model is set up；

(3) AMPL autoexecs xxx.sh is set up；

(4) autoexec xxx.sh is called using AMPL, starts to solve.