CN107122793A - A kind of improved global optimization k modes clustering methods - Google Patents

A kind of improved global optimization k modes clustering methods Download PDF

Info

Publication number
CN107122793A
CN107122793A CN201710177995.2A CN201710177995A CN107122793A CN 107122793 A CN107122793 A CN 107122793A CN 201710177995 A CN201710177995 A CN 201710177995A CN 107122793 A CN107122793 A CN 107122793A
Authority
CN
China
Prior art keywords
mrow
data
cluster
value
modes
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201710177995.2A
Other languages
Chinese (zh)
Inventor
黄昌浩
肖依永
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beihang University
Original Assignee
Beihang University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beihang University filed Critical Beihang University
Priority to CN201710177995.2A priority Critical patent/CN107122793A/en
Publication of CN107122793A publication Critical patent/CN107122793A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/23Clustering techniques
    • G06F18/232Non-hierarchical techniques
    • G06F18/2321Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions
    • G06F18/23213Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions with fixed number of clusters, e.g. K-means clustering

Landscapes

  • Engineering & Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Artificial Intelligence (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Biology (AREA)
  • Evolutionary Computation (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Probability & Statistics with Applications (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

A kind of improved global optimization k modes clustering methods of the present invention, its step is as follows:Step one:Data prediction;Step 2:Set up the k modes mathematical programming models of linearisation;Step 3:Model solution;Step 4:As a result export;Pass through above step, the present invention completes the purpose of k modes algorithms cluster, simultaneously as being linear mathematical modeling, so to the effect of globally optimal solution can be tried to achieve, and without setting initial solution, also avoid for initial solution sensitive issue, solve traditional k modes and calculate two common problems.

Description

A kind of improved global optimization k-modes clustering methods
First, art:
The present invention provides one kind and can guarantee that the method that k- modes (i.e. k-modes) cluster obtains optimal solution, solves tradition K-modes clustering algorithms are sensitive to initial solution and are difficult to obtain globally optimal solution.This method belongs to data analysis and excavation applications, It can make to be engaged in the more optimal Clustering Effect of data processing personnel acquirement.
2nd, background technology:
Be in epoch of data now, daily the data from each side all pour in we computer, network with And Various types of data storage facilities.And data scale sharply increasing and being widely used to data now, all need our A kind of powerful and general instrument is wanted, efficient the data of faced explosion type can must be handled.In order to tackle processing Various types of data problem, then changes into different data useful knowledge, and thus data digging method arises.
Clustering is most basic while being also one of most important technology in data digging method above-mentioned.Cluster Technology refers to that the set of data object constitutes multiple classes according to the principle for maximizing similitude between similitude in class, minimum class Process.That is, the data object being ultimately in same cluster has higher similitude, and with the object in other clusters very It is dissimilar.Now clustering algorithm basic at present can be divided into following several classes:Division methods, hierarchical method, based on density method, base In grid method.Division methods are a kind of most basic also most practical clustering algorithms, are also the basic link of subsequent analysis.I Conventional k-modes algorithms be treatment classification type data a kind of classical division methods.But it is a kind of based on non-linear Integer programming model, it is difficult that the problem of solution, (1) is since it is desired that set initial value, so for initial that this, which just brings two, The difference for being worth selection influences whether last cluster result so that whole clustering algorithm does not have stability.(2) algorithmic procedure is passed through Locally optimal solution can be often absorbed in, it is difficult to obtain globally optimal solution.
Based on integer programming (i.e. IP), k-modes clustering problems are changed into linear programming model to the present invention by model, and are saved The process for going initial solution to set, and because whole process is linear, also can be derived that globally optimal solution.So as to perfect Solve two problems mentioned before.This method passes through a kind of modeling language for describing and solving large-scale complex mathematical problem (A Mathematical Programming Language are AMPL), then calls mixed integer programming (i.e. MIP) solver (such as CPLEX, Lingo) completes medium scale cluster calculation.The present invention is a kind of improved global optimization's clustering method, This method first proposed a kind of new classifying type data set expression-form, then establish k- using new expression-form Modes integral linear programming model.
3rd, the content of the invention:
3.1 goal of the invention
It is an object of the invention to solve the shortcoming that traditional k-modes methods are faced always, propose that one kind can be obtained entirely Office's optimal solution and stable clustering method, more preferably gather to be engaged in the related personnel of data mining and big data analysis and providing to obtain Class effects scheme.
3.2 technical scheme
To it is faced the problem of it is abstract as follows:Assuming that data set X includes n data object X={ X1,X2,…,Xn}.Each Data object is by m attribute A1,A2…,AmDescription.So each data object XiX can be expressed asi={ xi1,xi2,…,xim, Because k-modes algorithms are directed to categorical attribute value, each categorical attribute A herejThere is a codomainpjIt is the number of classification value.It is assumed herein that cluster number k is previously given, k-modes algorithms are exactly The data object X mentioned is gathered into l (l≤n) individual cluster.K-modes algorithm ideas are exactly first to find out l cluster centre { C1, C2,…,ClSo that each data object is with its nearest cluster centre apart from dXiCjSum is minimum, and by this element interval From with referred to as object function, d hereXiCjDissimilar degree measurement between alternatively referred to as two objects.Then according to packet Situation redefines every group of cluster centre.Afterwards repeat back packet step, finally repeatedly the two steps until receive Hold back.The target of cluster is to make each group group inner distance sum minimum.
The mathematic sign used in technical scheme is presented below, it is as follows:
A kind of improved global optimization k-modes clustering methods of the present invention, its step is as follows:
Step one:Data prediction
To needing n object for being clustered packet to be numbered, number value is from 1 to n;Use yijRepresent data object i jth The value of attribute;Here classifying type data use 1,2,3 ..., and numeric type is replaced respectively;Such as attribute Aj, codomain is Vj, With qjIndividual classifying type data, at this moment with { 1,2,3 ... qjEach categorical attribute value is represented respectively;Record needs the number clustered;
Step 2:Set up the k-modes mathematical programming models of linearisation
Data object X is gathered into according to k-modes algorithm ideas by l (l≤n) individual cluster.Its cluster standard is exactly to find out l Cluster centre { C1,C2,…,ClSo that each data object is minimum apart from sum with its nearest cluster centre, and by this Individual group inner distance and referred to as object function.The present invention establishes a linear programming model and completes k-modes calculating process Process;Wherein described " establishing a linear programming model ", it is as follows that it sets up way:
A) because k-mdoes is make each data object and its nearest cluster centre minimum apart from sum, and by this Individual group inner distance and referred to as object function.The object function of linear programming model can be set up using this as target, i.e., every The distance of individual data object i and its cluster center k between each attribute j is summed up.Thus object function can be set up, referring to such as The object function of lower foundation.
B) for ensure object function set up apart from dikjMeet and require, i.e., data items i and cluster center on attribute Aj On k distance is calculated, and if only if by each object i when belonging to cluster center k class, there is object i and the cluster center attribute Value is different, apart from dikj=1.Set up constraints (1).
(1)
C) value of the property value at regulation cluster center can be required:Ji Cu centers each attributes can only take one to uniquely determine Property value.Set up constraints (2)
(2)
D) rule of Clustering is provided:It and must can only be assigned in a classification, built according to each data object i Vertical constraints (3);Needed to be divided into prespecified l classes according to cluster result, set up constraints (4)
(3)
(4)
E) because the data of input are the values for the jth attribute that object i is expressed in the form of yij.But such expression Method is difficult to set up inearized model.So needing to change into a kind of new form, such as decision variable xijtFor a value 0 or 1 Parameter, when data object i jth attribute takes t-th of categorical attribute value (i.e. yij=t), xijt=1.Thus constraint is set up Condition (6);And because object i jth attribute can only take a value,Thus constraints (5) is set up;
(5)
(6)
Therefore it can summarize and show that linear programming model is as follows:
Object function:
Constraints:
(1)
(2)
(3)
(4)
(5)
(6)
The symbol referred to above arrived, its connotation can all be checked in from symbol table.
Step 3:Model solution
Above-mentioned model is solved, it is considered to carry out model solution using business software, such as Lingo, CPLEX etc. are carried out Solve;Because this mathematical programming model is linear, possesses optimal solution completely and solve feasibility;
Wherein, described " model ", refers to object function and constraints (1)~(6) institute group set up in step 2 Into linear programming model;
Wherein, described " model solution ", Selection utilization AMPL language call CPLEX solver solving models, it is solved Specific practice is as follows:
(1) data of cluster required for inputting and cluster basic parameter, set up AMPL data files xxx.dat;
(2) AMPL model file xxx.mod are set up, linear programming model is set up;
(3) AMPL autoexecs xxx.sh is set up;
(4) autoexec xxx.sh is called using AMPL, starts to solve;
Step 4:As a result export
By model solution, optimal target function value can be obtained.It can also obtain corresponding to the target function value simultaneously Decision variable value wikAnd ukjt;Can be by decision variable u according to definitionkjtAnd wikThe Clustering of each data object is determined Situation and all kinds of cluster centers;
(1) w is observed for all i ∈ NikIf, wik=1, then it represents that object i is under the jurisdiction of k classes;
(2) if ukjt=1, then the jth attribute at kCu Lei centers take t-th of categorical attribute value;
By above step, the purpose of k-modes algorithms cluster is completed.Simultaneously as being linear mathematical modeling, institute With to can try to achieve the effect of globally optimal solution, and without setting initial solution, also avoid for initial solution sensitive issue, solution Traditional k-modes that determined calculates two common problems.
3.3 advantages of the present invention and effect
The present invention summarizes the shortcoming that k-modes exists always, it is proposed that a kind of k-mode in the achievement studied in the past The linear model of algorithm.This method binary decision variable new by defining is so that former problem is converted to linear integer programming asks Topic, solves the problem of traditional k-modes algorithms are difficult to obtain globally optimal solution and easily influenceed by initial solution.By to UCI numbers Solution, which is carried out, according to storehouse example can be seen that model proposed by the present invention is feasible and effective.
4th, illustrate
The implementing procedure figure of Fig. 1 the method for the invention.
5th, embodiment
Originally the one group of mark post issued on UCI data set Lenses is saved to illustrate the specific of the inventive method Process.The data set has 24 data objects, and each object describes A1, A2, A3, A4 by 4 property values.Each attribute is to divide Type attribute, so each attributive classification is worth number to be respectively { 3,2,2,2 }.Preset and whole data set is divided into 3 Individual cluster.Specific data value is as shown in the table:
The Lenses tables of data of table 1
A kind of improved global optimization k-modes clustering methods of the present invention, as shown in Figure 1, its specific implementation step is such as Under:
Step one:Data prediction prepares
Because the data obtained have been the classifying type data with numeral expression, directly it is to each data items numbering then Data prediction can be completed.Remember that the data set needs to be polymerized to 3 classes.It is shown in Table 2
Table 2
Step 2:Set up the k-modes mathematical programming models of linearisation
Object function:
Constraints:
(1)
(2)
(3)
(4)
(5)
(6)
Step 3:Model solution
According to the data of table 2 and the clusters number of 3 classes of defined, AMPL data files are write, and be named as lenses.dat。
The mathematical programming model set up according to step 2, writes AMPL model files, and be named as lenses.mod.
Because this mathematical modeling is linear, possesses optimal solution completely and solve feasibility.Set up autoexec Lenses.sh。
Finally autoexec lenses.sh is called to proceed by solution using AMPL.
Step 4:As a result export
Following result of calculation can be obtained by solving to calculate
Observe wikIf, wik=1, then it represents that object i is under the jurisdiction of k classes.Each data object is marked off by decision variable Clustering, { 1,5,6,7,8,13,15,21,22 } { 2,4,9,10,12,14,16,18 } 3,11,17,19,20,23, 24};It is 27 to try to achieve optimal target function value, i.e., minimum group inner distance.
Same data set is carried out cluster solution to it in the model for substituting into tradition k-modes methods, can drawn by us Minimum group inner distance is 32.3, the minimum group inner distance obtained more than this method.Therefore this method, i.e., based on linear integer programming K-modes clusters optimization method, and more preferably cluster result can be obtained compared with traditional k-modes clustering methods.

Claims (2)

1. a kind of improved global optimization k-modes clustering methods, it is characterised in that:Its step is as follows:
Step one:Data prediction
To needing n object for being clustered packet to be numbered, number value is from 1 to n;Use yijRepresent data object i jth attribute Value;Here classifying type data use 1,2,3 ..., and numeric type is replaced respectively;Such as attribute Aj, codomain is Vj, have qjIndividual classifying type data, at this moment with { 1,2,3 ... qjEach categorical attribute value is represented respectively;Record needs the number clustered;
Step 2:Set up the k-modes mathematical programming models of linearisation
Data object X is gathered into according to k-modes algorithm ideas by l cluster, l≤n;Its cluster standard is exactly to find out l cluster Center { C1,C2,…,ClSo that each data object is minimum apart from sum with its nearest cluster centre, and by this group Interior distance and referred to as object function;The present invention establishes a linear programming model and completes k-modes calculating process;Wherein Described " establishing a linear programming model ", it is as follows that it sets up way:
A) because k-mdoes is make each data object and its nearest cluster centre minimum apart from sum, and by this group Interior distance and referred to as object function;The object function of linear programming model is set up using this as target, i.e., each data pair As the distance of i and its cluster center k between each attribute j is summed up;Thus object function is set up as follows:
<mrow> <mi>min</mi> <mi> </mi> <mi>F</mi> <mrow> <mo>(</mo> <mi>W</mi> <mo>,</mo> <mi>U</mi> <mo>)</mo> </mrow> <mo>=</mo> <munderover> <mo>&amp;Sigma;</mo> <mrow> <mi>k</mi> <mo>=</mo> <mn>1</mn> </mrow> <mi>l</mi> </munderover> <munderover> <mo>&amp;Sigma;</mo> <mrow> <mi>i</mi> <mo>=</mo> <mn>1</mn> </mrow> <mi>n</mi> </munderover> <munderover> <mo>&amp;Sigma;</mo> <mrow> <mi>j</mi> <mo>=</mo> <mn>1</mn> </mrow> <mi>m</mi> </munderover> <msub> <mi>d</mi> <mrow> <mi>i</mi> <mi>k</mi> <mi>j</mi> </mrow> </msub> </mrow>
B) for ensure object function set up apart from dikjMeet and require, i.e., the data items i and cluster center k on attribute Aj On distance is calculated, and if only if by each object i when belonging to cluster center k class, there is the value of object i Yu cluster the center attribute Difference, apart from dikj=1;Set up constraints (1);
(1)
C) value of the property value at regulation cluster center can be required:The each attribute in Ji Cu centers can only take an attribute uniquely determined Value, sets up constraints (2);
(2)
D) rule of Clustering is provided:It and must can only be assigned in a classification, be set up about according to each data object i Beam condition (3);Needed to be divided into prespecified l classes according to cluster result, set up constraints (4);
(3)
(4)
E) because the data of input are the values for the jth attribute that object i is expressed in the form of yij;But such expression Inearized model is difficult to set up, so needing to change into a kind of new form, such as decision variable xijtFor the ginseng of a value 0 or 1 Number, when data object i jth attribute takes t-th of categorical attribute i.e. yij=t of value, xijt=1;Thus constraints is set up (6);And because object i jth attribute can only take a value,Thus constraints (5) is set up;
(5)
(6)
Therefore summarize and show that linear programming model is as follows:
Object function:
<mrow> <mi>min</mi> <mi> </mi> <mi>F</mi> <mrow> <mo>(</mo> <mi>W</mi> <mo>,</mo> <mi>U</mi> <mo>)</mo> </mrow> <mo>=</mo> <munderover> <mo>&amp;Sigma;</mo> <mrow> <mi>k</mi> <mo>=</mo> <mn>1</mn> </mrow> <mi>l</mi> </munderover> <munderover> <mo>&amp;Sigma;</mo> <mrow> <mi>i</mi> <mo>=</mo> <mn>1</mn> </mrow> <mi>n</mi> </munderover> <munderover> <mo>&amp;Sigma;</mo> <mrow> <mi>j</mi> <mo>=</mo> <mn>1</mn> </mrow> <mi>m</mi> </munderover> <msub> <mi>d</mi> <mrow> <mi>i</mi> <mi>k</mi> <mi>j</mi> </mrow> </msub> </mrow>
Constraints:
(1)
(2)
(3)
(4)
(5)
(6)
The symbol referred to above arrived, its connotation is all checked in from the symbol table in specification;
Step 3:Model solution
Above-mentioned model is solved, it is considered to carry out model solution using business software, the good CPLEX of such as Lingo are solved; Because this mathematical programming model is linear, possesses optimal solution completely and solve feasibility;
Step 4:As a result export
By model solution, optimal target function value can be obtained;The decision-making corresponding to the target function value can be also obtained simultaneously Variate-value wikAnd ukjt;According to definition by decision variable ukjtAnd wikDetermine each data object Clustering situation and All kinds of cluster centers;
(1) w is observed for all i ∈ NikIf, wik=1, then it represents that object i is under the jurisdiction of k classes;
(2) if ukjt=1, then the jth attribute at kCu Lei centers take t-th of categorical attribute value;
By above step, the purpose of k-modes algorithms cluster is completed, simultaneously as being linear mathematical modeling, so arriving Can try to achieve the effect of globally optimal solution, and without setting initial solution, also avoid for initial solution sensitive issue, solve Traditional k-modes calculates two common problems.
2. a kind of improved global optimization k-modes clustering methods according to claim 1, it is characterised in that:
" model " described in step 3, refers to object function and constraints (1)~(6) institute group set up in step 2 Into linear programming model;
" model solution " described in step 3, Selection utilization AMPL language call CPLEX solver solving models, it is solved Specific practice is as follows:
(1) data of cluster required for inputting and cluster basic parameter, set up AMPL data files xxx.dat;
(2) AMPL model file xxx.mod are set up, linear programming model is set up;
(3) AMPL autoexecs xxx.sh is set up;
(4) autoexec xxx.sh is called using AMPL, starts to solve.
CN201710177995.2A 2017-03-23 2017-03-23 A kind of improved global optimization k modes clustering methods Pending CN107122793A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201710177995.2A CN107122793A (en) 2017-03-23 2017-03-23 A kind of improved global optimization k modes clustering methods

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201710177995.2A CN107122793A (en) 2017-03-23 2017-03-23 A kind of improved global optimization k modes clustering methods

Publications (1)

Publication Number Publication Date
CN107122793A true CN107122793A (en) 2017-09-01

Family

ID=59718021

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201710177995.2A Pending CN107122793A (en) 2017-03-23 2017-03-23 A kind of improved global optimization k modes clustering methods

Country Status (1)

Country Link
CN (1) CN107122793A (en)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108256102A (en) * 2018-02-01 2018-07-06 厦门大学嘉庚学院 A kind of Independent College Studentss based on cluster comment religion data analysing method
CN111160382A (en) * 2019-09-29 2020-05-15 山西大学 Effective method for processing classified data in real life
CN112132217A (en) * 2020-09-23 2020-12-25 广西大学 Classification type data clustering method based on intra-cluster dissimilarity
CN113159392A (en) * 2021-03-30 2021-07-23 刘昊戈 Optimization calculation method for position distribution problem of continuous space

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108256102A (en) * 2018-02-01 2018-07-06 厦门大学嘉庚学院 A kind of Independent College Studentss based on cluster comment religion data analysing method
CN108256102B (en) * 2018-02-01 2022-02-11 厦门大学嘉庚学院 Independent college student evaluation and education data analysis method based on clustering
CN111160382A (en) * 2019-09-29 2020-05-15 山西大学 Effective method for processing classified data in real life
CN112132217A (en) * 2020-09-23 2020-12-25 广西大学 Classification type data clustering method based on intra-cluster dissimilarity
CN112132217B (en) * 2020-09-23 2023-08-15 广西大学 Classification type data clustering method based on inter-cluster dissimilarity in clusters
CN113159392A (en) * 2021-03-30 2021-07-23 刘昊戈 Optimization calculation method for position distribution problem of continuous space
CN113159392B (en) * 2021-03-30 2022-06-24 刘昊戈 Optimization calculation method for position distribution problem of continuous space

Similar Documents

Publication Publication Date Title
CN107122793A (en) A kind of improved global optimization k modes clustering methods
Anselin et al. Exploratory spatial data analysis linking SpaceStat and ArcView
Zhu et al. Generalized analytic network process
Zhao et al. Development of decision support tool for optimizing urban emergency rescue facility locations to improve humanitarian logistics management
Chen et al. Calibrating a Land Parcel Cellular Automaton (LP-CA) for urban growth simulation based on ensemble learning
Baloui Jamkhaneh et al. Inspection error and its effects on single sampling plans with fuzzy parameters
Yu et al. A regional wind power probabilistic forecast method based on deep quantile regression
CN112270406A (en) Neural information visualization method of brain-like computer operating system
WO2021213154A1 (en) Blockchain data processing method, system, terminal, and computer-readable storage medium
Vorobieva et al. Architecture of digital economy
CN109376153A (en) System and method for writing data into graph database based on NiFi
CN108171332A (en) Product-design knowledge modeling method and system based on SysML
Bhargava et al. Prediction of arthritis using classification and regression tree algorithm
CN104598614B (en) A kind of data multi-scale mode diffusion update method based on geographical semantics
CN102508971A (en) Method for establishing product function model in concept design stage
US9886707B1 (en) System and method for building dynamic hierarchy for products
Kao Network data envelopment analysis with fuzzy data
El‐Ghandour et al. Survey of information technology applications in construction
Vasilyev et al. Development of a decision support system at the stages of pre‐design studies and design of irrigation systems based on IDEFo functional modelling methodology
Kaminsky et al. Evaluating the Effectiveness of Enterprises' Digital Transformation by Fuzzy Logic
Tsugawa et al. Community structure and interaction locality in social networks
CN108268478A (en) A kind of unbalanced dataset feature selection approach and device based on ur-CAIM algorithms
WO2023035526A1 (en) Object sorting method, related device, and medium
US20140344235A1 (en) Determination of data modification
García Plasticity as a link between spatially explicit, distance-independent, and whole-stand forest growth models

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
WD01 Invention patent application deemed withdrawn after publication
WD01 Invention patent application deemed withdrawn after publication

Application publication date: 20170901