TWI474139B

TWI474139B - Data clustering method and computer product thereof

Info

Publication number: TWI474139B
Application number: TW101134505A
Authority: TW
Inventors: Jengming Yih
Original assignee: Min Hwei College Of Health Care Man
Priority date: 2012-09-20
Filing date: 2012-09-20
Publication date: 2015-02-21
Also published as: TW201413405A

Description

Data grouping method and its computer program software

本發明是有關於一種資料分群方法與其電腦程式軟體，特別是有關於一種以模糊分群法(Fuzzy Clustering Method)為基礎之資料分群方法與其電腦程式軟體。The invention relates to a data grouping method and a computer program software thereof, in particular to a data clustering method based on a fuzzy clustering method and a computer program software thereof.

隨著經濟快速的發展以及資訊科技的高速進步，人們已可透過各種管道來輕易獲得大量的資訊或資料。為了處理這些資料，各種資料分群演算法被研發出來，以幫助人們進行資料探勘(data mining)與資料分類。With the rapid development of the economy and the rapid advancement of information technology, people can easily obtain a large amount of information or materials through various channels. To process this data, various data clustering algorithms were developed to help people conduct data mining and data classification.

資料分群(Clustering)是屬於數學統計中的多變量分析(Multivariate Data analysis)。資料分群可將原始資料分類成群，並找出各資料群組的中心點(代表點)，以降低資料分析的複雜度。資料分群演算的運作原理，是根據資料的特徵分布情況(橢圓或圓形)，將具有相似性較高的資料分在一個群組(cluster)中，進而將全部的資料分成複數個資料群組。Clustering is a multivariate data analysis in mathematical statistics. Data grouping can classify the original data into groups and find the center points (representative points) of each data group to reduce the complexity of data analysis. The operation principle of data grouping calculus is based on the characteristic distribution of the data (ellipse or circle), and the data with higher similarity is divided into a cluster, and then the whole data is divided into a plurality of data groups. .

資料分群演算法可分為階層式分群法(Hierarchical clustering)與分割式分群法(Partitioning clustering)兩種。階層式分群法係將資料層層分裂或聚合，再以樹狀結構來表示整個資料。分割式分群法則是先指定群數後，再利用疊代運算來進行資料分群。The data grouping algorithm can be divided into two types: Hierarchical clustering and Partitioning clustering. The hierarchical grouping method splits or aggregates the data layers, and then represents the entire data in a tree structure. The split-type grouping rule is to first specify the number of groups and then use the iterative operation to perform data grouping.

常見的分割式分群法有模糊C平均分群演算法(Fuzzy C-means clustering algorithm)、GK分群演算(Gustafson -Kessel clustering algorithm)、GG分群演算法(Gath and Geva clustering algorithm)等。Common segmentation grouping methods include Fuzzy C-means clustering algorithm and GK group calculus (Gustafson). -Kessel clustering algorithm), GG grouping algorithm (Gath and Geva clustering algorithm), and the like.

模糊平均數分群演算法的距離計算係以歐基里德距離(Euclidean distanc)進行數量的計算，用來辨識資料結構為球形(spherical)的分群。GK分群演算法及GG分群演算法係用來處理非球形結構資料的分類。然而GK分群演算法搭配模糊共變數矩陣，其目標函數受限於經由此模糊共變數矩陣計算而得的馬氏(Mahalanobis)距離；GG分群演算法適用於資料分佈為多變量常態高斯(Gaussian)分佈。The distance calculation of the fuzzy average grouping algorithm is calculated by the number of Euclidean distanc to identify the data structure as a spherical group. The GK grouping algorithm and the GG grouping algorithm are used to classify the non-spherical structure data. However, the GK grouping algorithm is combined with the fuzzy covariate matrix, and its objective function is limited by the Mahalanobis distance calculated by this fuzzy covariate matrix. The GG grouping algorithm is suitable for the data distribution to multivariate normal Gaussian. distributed.

由於目前習知的資料分群演算法大多受限於資料本身的分佈態樣，在資料本身分佈狀態未知的情況下，容易產生分群效果不佳與分群運算時間過長的現象。Since the current data grouping algorithms are mostly limited by the distribution of the data itself, in the case where the distribution state of the data itself is unknown, it is easy to produce a phenomenon in which the grouping effect is poor and the grouping operation time is too long.

因此，需要一種新的資料分群方法來提高資料分群的正確性，並減少分群運算所需的時間。Therefore, a new data grouping method is needed to improve the correctness of data grouping and reduce the time required for grouping operations.

本發明之一方面是在提供於一種資料分群方法與其電腦程式軟體，以提高資料分群的正確性和時間效率。One aspect of the present invention is to provide a data grouping method and a computer program software thereof to improve the correctness and time efficiency of data grouping.

根據本發明之一實施例，在此資料分群方法中，首先提供複數筆資料、預設共變異數閥值以及預設收斂值。接著，計算資料之原始歸屬度矩陣(membership matix)。然後，根據原始歸屬度矩陣來計算資料之原始中心矩陣(center matrix)。接著，根據原始中心矩陣來計算資料之原始共變異數矩陣(covariance matrix)。然後，對原始共變異數矩陣進行共變異數判斷步驟，其中此共變異數判斷步驟包含：判斷原始共變異數矩陣之絕對值是否大於預設共變異數閥值以提供第一判斷結果；以及判斷原始共變異數矩陣之絕對值是否小於預設共變異數閥值之倒數以提供第二判斷結果。當第一判斷結果和第二判斷結果為否時，進行第一資料更新步驟，以更新原始歸屬度矩陣和原始中心矩陣。According to an embodiment of the present invention, in the data grouping method, a plurality of pieces of data, a preset common variability threshold, and a preset convergence value are first provided. Next, the original maturity matrix (membership matix) of the data is calculated. Then, the original center matrix of the data is calculated from the original attribution matrix. Next, the original covariance matrix of the data is calculated from the original central matrix. Then, the original covariation The number matrix performs a common variance determination step, wherein the covariance determination step comprises: determining whether an absolute value of the original covariance matrix is greater than a preset covariance threshold to provide a first determination result; and determining an original covariance matrix Whether the absolute value is less than the reciprocal of the preset common variability threshold provides a second determination result. When the first judgment result and the second judgment result are no, the first data update step is performed to update the original attribution matrix and the original center matrix.

根據本發明之另一實施例，在此資料分群方法中，首先提供複數筆資料、預設共變異數閥值以及預設收斂值。接著，計算資料之原始歸屬度矩陣，其中此原始歸屬度矩陣包含上述之資料。然後，根據原始歸屬度矩陣來計算上述資料之原始中心矩陣。接著，根據原始中心矩陣來計算上述資料之原始共變異數矩陣。然後，進行第一資料更新步驟，以更新原始歸屬度矩陣和原始中心矩陣。在此第一資料更新步驟中，首先進行轉軸(pivoting)運算，以更新原始共變異數矩陣之行列式值，而獲得更新後共變異數矩陣。然後，判斷更新後共變異數矩陣之絕對值是否大於預設共變異數閥值，以提供第一判斷結果。接著，判斷更新後共變異數矩陣之絕對值是否小於預設共變異數閥值之倒數，以提供第二判斷結果。當第一判斷結果和第二判斷結果為否時，利用更新後共變異數矩陣來更新原始歸屬度矩陣和原始中心矩陣。接著，判斷更新後中心矩陣與原始中心矩陣之差值是否小於預設收斂值，以提供第三判斷結果。當第三判斷結果為是時，根據更新後歸屬度矩陣來將上述之資料分群。According to another embodiment of the present invention, in the data grouping method, a plurality of pieces of data, a preset common variability threshold, and a preset convergence value are first provided. Next, the original attribution matrix of the data is calculated, wherein the original attribution matrix contains the above information. Then, the original center matrix of the above data is calculated according to the original attribution matrix. Next, the original covariance matrix of the above data is calculated from the original central matrix. Then, a first data update step is performed to update the original attribution matrix and the original center matrix. In the first data updating step, a pivoting operation is first performed to update the determinant value of the original covariance matrix to obtain an updated covariance matrix. Then, it is determined whether the absolute value of the updated common variance matrix is greater than a preset common variance threshold to provide a first determination result. Next, it is determined whether the absolute value of the updated covariance matrix is less than a reciprocal of the preset covariance threshold to provide a second determination result. When the first judgment result and the second judgment result are no, the original attribution matrix and the original center matrix are updated by using the updated covariance matrix. Next, it is determined whether the difference between the updated central matrix and the original central matrix is less than a preset convergence value to provide a third determination result. When the third determination result is YES, the above-mentioned data is grouped according to the updated attribution degree matrix.

由上述說明可知，本發明實施例之資料分群方法係利用共變異數閥值以及可調整的共變異數矩陣來使得分群的疊代運算收斂，如此即可提高資料分群的正確性和時間效率。It can be seen from the above description that the data grouping method of the embodiment of the present invention is advantageous. The co-variation threshold and the adjustable covariance matrix are used to converge the grouping iterations, thus improving the correctness and time efficiency of data grouping.

假設欲分群的資料集合X={x₁ ,x₂ ,...,x_n }，具有p個特徵屬性，其中x_k 為資料集合X的任何一個樣本，x_k =[x_k1 ,x_k2 ,...,x_kp ] R ^p ，k=1,2,...n。資料集合X可利用矩陣的方式表達如下：若分群的目標是將資料集合X分為c個群組G_i ，i=1,2...c。則每筆資料的歸屬度(membership)u_ik 可以下式來表示：另外u_ik 需要滿足下列三個條件： Suppose that the data set to be grouped X={x ₁ , x ₂ ,..., x _n } has p characteristic attributes, where x _k is any sample of the data set X, x _k =[x _k1 ,x _k2 ,...,x _kp ] R ^p , k = 1, 2, ... n. The data set X can be expressed in the form of a matrix as follows: If the goal of grouping is to divide the data set X into c groups G _i , i=1, 2...c. The attribution information of each of the (membership) u _ik can be expressed by the following formula: In addition, u _ik needs to meet the following three conditions:

式(3)和(4)係代表資料集合X中任何一個資料樣本x_k 只屬於一個群組，而式(5)則代表一個群組至少要包含一個資料樣本，最多包含n-1個資料樣本。Equations (3) and (4) represent that any data sample x _k in the data set X belongs to only one group, and equation (5) represents that a group contains at least one data sample, and at most n-1 data. sample.

對於GK演算法而言，其目標函數係定義如下： For the GK algorithm, its objective function is defined as follows:

其中，X為資料集合，其為n×p的矩陣；U為一個c×n的矩陣，U=[u_ik ]，1≦i≦c，1≦k≦n；V為向量，V=[v₁ ,v₂ ,...,v_c ]，v_i R ^p ，1≦i≦c；v_i 為群組G_i 的中心，i=1,2,...,c；u_ik 為資料樣本x_k 屬於群集G_i 的歸屬度，u_ik [0,1]，i=1,2,...,c，k=1,2,...,n。m為加權指數，m[1,∞]。Where X is a data set, which is a matrix of n×p; U is a matrix of c×n, U=[u _ik ], 1≦i≦c, 1≦k≦n; V is a vector, V=[ v ₁ ,v ₂ ,...,v _c ],v _i R ^p ,1≦i≦c; v _i is the center of the group G _i , i=1,2,...,c;u _ik is the attribution of the data sample x _k belonging to the cluster G _i , u _ik [0, 1], i = 1, 2, ..., c, k = 1, 2, ..., n. m is the weighted index, m [1, ∞].

d ² (x _k ,v _i )係代表資料樣本資x_k 到群集中心v_i 的距離平方，其可以下式來表示： d ² ( x _k , v _i ) represents the square of the distance from the data sample x _k to the cluster center v _i , which can be expressed by:

其中Ai為一矩陣，其可以下式來表示 Where Ai is a matrix, which can be expressed by the following formula

其中ρ _i 為常數，F_i 為第i個群組的共變異數矩陣，其可以下式來表示： Where ρ _i is a constant and F _i is a covariance matrix of the i-th group, which can be expressed by the following formula:

在GK演算法的運算流程中，首先提供模糊度參數m、欲分的群組數c、停止條件ε(即預設收斂值)以及初始設定 U⁽⁰⁾ ，其中U⁽⁰⁾ 係代表第0次疊代的歸屬度矩陣。In the operation flow of the GK algorithm, first, the ambiguity parameter m, the number of groups to be divided c, the stop condition ε (ie, the preset convergence value), and the initial setting U ^{(0) are provided} , where U ⁽⁰⁾ represents the first The degree of attribution matrix of 0 iterations.

接著，計算群組中心矩陣v_i 。群組中心矩陣v_i 之計算係利用下式來完成： Next, the group center matrix v _{i is} calculated. The calculation of the group center matrix v _i is done using the following formula:

其中l 係代表疊代次數。Where l is the number of iterations.

然後，利用上述式(7)來計算距離d ² (x _k ,v _i )。接著，更新歸屬度矩陣U⁽ l ⁾ 為U⁽ l ⁺¹⁾ 。Then, the distance d ² ( x _k , v _i ) is calculated using the above formula (7). Next, the attribution degree matrix U ⁽ l ⁾ is updated to U ⁽ l ⁺¹⁾ .

在此，若，則歸屬度u_ik 可表示如下 Here, if , the attribution degree u _ik can be expressed as follows

若，則u _ik ^(l
) 為0。If , then u _ik ^{( l )} is 0.

接著，判斷|U ^(l
) -U ^(l
+1) |是否小於預設之收斂值ε。若判斷結果為是，則根據歸屬度矩陣來將資料分類；若判斷結果為否，則設定(l )=(l +1)，並回到群組中心之計算步驟再開始進行演算法。Next, it is judged whether | U ^{( l )} - U ^{( l +1)} | is smaller than the preset convergence value ε. If the judgment result is yes, the data is classified according to the attribution degree matrix; if the judgment result is no, then ( l )=( l +1) is set, and the calculation step of the group center is returned to start the algorithm.

GK演算法適合用來偵測橢圓狀分佈的資料，然而對於非測橢圓狀分佈的資料而言，GK演算法無法有效地進行分類。因此，本發明之實施例提供一種資料分群方法來解決這個問題，並提高資料分群的效果。在以下的說明中，將詳細介紹根據本發明實施例之資料分群方法。The GK algorithm is suitable for detecting elliptical distribution data. However, for non-measured elliptical distribution data, the GK algorithm cannot be effectively classified. Therefore, embodiments of the present invention provide a data grouping method to solve this problem and improve the effect of data grouping. In the following description, a data grouping method according to an embodiment of the present invention will be described in detail.

請參照第1圖，其係繪示根據本發明實施例之資料分群方法100的流程示意圖。在資料分群方法100中，首先進行資料提供步驟110，以提供欲分群的資料。如上所述，欲分群的資料可以矩陣X來代表。接著，進行初始設定提供步驟120，以提供預設分群數、初始條件U⁽⁰⁾ 、收斂值等預設值。另外，雖然本實施例之初始條件U⁽⁰⁾ 係由使用者所決定，但在本發明之其他實施例中，初始條件U⁽⁰⁾ 也可根據欲分群的資料X來計算獲得。另外，初始設定提供步驟120亦提供使用者預設之共變異數閥值D。在本實施例中，此共變異數閥值D之值為290，其計算方程式如下：其中， x _j =(x _j
1 ,x _j
2 ,...,x _j
3 )代表第j個樣本向量， a _i =( a _i
1 , a _i
2 ,..., a _ip )代表第i群中心向量。Please refer to FIG. 1 , which is a schematic flowchart diagram of a data grouping method 100 according to an embodiment of the present invention. In the data grouping method 100, a data providing step 110 is first performed to provide data to be grouped. As described above, the data to be grouped can be represented by a matrix X. Next, an initial setting providing step 120 is performed to provide preset values such as a preset number of clusters, an initial condition U ⁽⁰⁾ , a convergence value, and the like. In addition, although the initial condition U ⁽⁰⁾ of the present embodiment is determined by the user, in other embodiments of the present invention, the initial condition U ⁽⁰⁾ may also be calculated based on the data X to be grouped. In addition, the initial setting providing step 120 also provides a user-preset total variability threshold D. In this embodiment, the value of the covariation threshold D is 290, and the equation is calculated as follows: Where x _j =( x _{j 1} , x _{j 2} ,..., x _{j 3} ) represents the jth sample vector, and a _i =( a _{i 1} , a _{i 2} ,..., a _ip ) represents the first i group center vector.

值得注意的是，本發明實施例之共變異數閥值D並不受限於此290。在本發明之其他實施例中，根據所需分群的資料，共變異數閥值D也會有不同。共變異數閥值D係用以提高分群演算法的效率，稍後將會介紹如何應用此共變異數閥值D。It should be noted that the common variability threshold D of the embodiment of the present invention is not limited to 290. In other embodiments of the invention, the covariation threshold D may vary depending on the desired grouping of data. The common variability threshold D is used to improve the efficiency of the clustering algorithm. How to apply this common variability threshold D will be described later.

然後，進行計算步驟130，以計算出歸屬度矩陣、中心矩陣以及共變異數矩陣。在本實施例中，歸屬度矩陣、中心矩陣以的計算可根據前述GK演算法之計算式來計算獲得，但本發明之實施例並不受限於此。計算步驟130對於共變異數矩陣計算方式不同於GK演算法。GK演算法係針對每一個群組計算出一個共變異數矩陣，而本實施例之計算步驟130係針對所有群組來計算出一個共變異數矩陣Σ，此共變異數矩陣的計算方式如下： Then, a calculation step 130 is performed to calculate a attribution matrix, a center matrix, and a covariance matrix. In this embodiment, the calculation of the attribution matrix and the central matrix may be calculated according to the calculation formula of the foregoing GK algorithm, but the embodiment of the present invention is not limited thereto. The calculation step 130 is different from the GK algorithm in calculating the covariance matrix. The GK algorithm calculates a covariance matrix for each group. The calculation step 130 of the embodiment calculates a covariance matrix 针对 for all groups, and the covariance matrix is calculated as follows:

其中，Σ⁽⁰⁾ 係代表疊代次數為0之共變異數矩陣；λ si係代表特徵值(eigen-value)；Γ si係代表特徵向量(eigen-value)。Where Σ ⁽⁰⁾ represents a matrix of common variograms with an iteration number of 0; λ si represents an eigen-value; Γ si represents a eigen-value.

接著，進行共變異數判斷步驟140。共變異數判斷步驟140係判斷共變異數矩陣之絕對值是否大於共變異數閥值D，以及判斷共變異數矩陣之絕對值是否小於1/D，並相應地提供兩個判斷結果。若此兩判斷結果有一者為是，則進行共變異數調整步驟150；若此兩判斷結果皆為否，則進行更新步驟160。在以下的說明中，先就更新步驟160來進行說明。Next, a common variance determination step 140 is performed. The common variance determination step 140 determines whether the absolute value of the common variance matrix is greater than the common variance threshold D, and whether the absolute value of the common variance matrix is less than 1/D, and provides two determination results accordingly. If one of the two determination results is yes, the common variance adjustment step 150 is performed; if both determinations are negative, the update step 160 is performed. In the following description, the update step 160 will be described first.

更新步驟160係利用共變異數矩陣來更新歸屬度矩陣和中心矩陣。本實施例之歸屬度矩陣與中心矩陣之更新計算係根據前述GK演算法之計算式來計算獲得，故更新步驟160的計算方式不在此贅述。然而，值得注意的是，本發明之實施例並不受限於此。當歸屬度矩陣被更新後，中心矩陣也可根據更新後的歸屬度矩陣來進行更新。The updating step 160 uses the covariance matrix to update the attribution matrix and the central matrix. The update calculation of the attribution degree matrix and the central matrix in this embodiment is calculated according to the calculation formula of the GK algorithm described above, and the calculation manner of the update step 160 is not described here. However, it is worth noting that this Embodiments of the invention are not limited thereto. When the attribution matrix is updated, the central matrix can also be updated according to the updated attribution matrix.

在更新步驟160之後，接著進行判斷步驟170，以判斷判斷更新後中心矩陣與原始中心矩陣之差值的絕對值是否小於預設收斂值。若判斷的結果為是，則表示更新後歸屬度矩陣滿足使用者的分類需求，故進行分群步驟190，以根據更新後歸屬度矩陣來將資料分類。若判斷的結果為否，表示更新後歸屬度矩陣並滿足使用者的分類需求，故需進行疊代運算來計算出新的歸屬度矩陣。After the updating step 160, a determining step 170 is next performed to determine whether the absolute value of the difference between the updated central matrix and the original central matrix is less than a preset convergence value. If the result of the determination is yes, it indicates that the updated attribution matrix satisfies the classification requirement of the user, so the grouping step 190 is performed to classify the data according to the updated attribution matrix. If the result of the determination is no, indicating the updated attribution matrix and satisfying the classification requirement of the user, an iterative operation is required to calculate a new attribution matrix.

在本實施例中，當判斷步驟170之判斷結果為否時，進行共變異數矩陣更新步驟180，以更新共變異數矩陣。在本實施例中，共變異數矩陣更新步驟180係對共變異數矩陣進行轉軸(pivoting)運算，如此共變異數矩陣之行列式值會被改變，而可獲得更新後的共變異數矩陣。In the present embodiment, when the determination result of the determination step 170 is NO, the covariance matrix update step 180 is performed to update the covariance matrix. In this embodiment, the covariance matrix update step 180 performs a pivoting operation on the covariance matrix, such that the determinant value of the covariance matrix is changed, and the updated covariance matrix is obtained.

在共變異數矩陣更新步驟180後，接著回到前述之共變異數判斷步驟140，以判斷更新後共變異數矩陣之絕對值是否大於共變異數閥值D，以及判斷更新後共變異數矩陣之絕對值是否小於1/D，並相應地提供兩個判斷結果。若此兩判斷結果有一者為是，則進行共變異數調整步驟150；若此兩判斷結果皆為否，則進行更新步驟160。After the common variance matrix update step 180, the method then returns to the aforementioned common variance determination step 140 to determine whether the absolute value of the updated covariance matrix is greater than the common variance threshold D, and to determine the updated covariance matrix. Whether the absolute value is less than 1/D, and provides two judgment results accordingly. If one of the two determination results is yes, the common variance adjustment step 150 is performed; if both determinations are negative, the update step 160 is performed.

藉由不斷進行疊代運算來重複更新共變異數矩陣，即可獲得多個更新的歸屬度矩陣和中心矩陣。當第n次更新後中心矩陣(v_i ⁽ⁿ⁾ )與第n-1次更新後中心矩陣(v_i ^(n-1) )之絕對差值(|v_i ⁽ⁿ⁾ -v_i ^(n-1) |)小於預設收斂值，則停止疊代運算，並根據第n次更新後歸屬度矩陣來進行資料分類。By continuously updating the covariance matrix by successive iterative operations, multiple updated attribution matrices and central matrices can be obtained. The absolute difference (|v _i ⁽ⁿ⁾ -v _i ⁽ⁿ ) between the central matrix (v _i ⁽ⁿ⁾ ) and the n-1th updated central matrix (v _i ^(n-1) ) after the nth update ^-1) |) is less than the preset convergence value, the iterative operation is stopped, and the data classification is performed according to the nth updated attribution degree matrix.

接著，在以下的敘述中，將針對共變異數調整步驟150來進行說明。Next, in the following description, the covariation adjustment step 150 will be described.

如上所述，當共變異數判斷步驟140之兩判斷結果有一者為是，本實施例之資料分群方法100便會進行共變異數調整步驟150。共變異數調整步驟150係將原始共變異數矩陣(或第n-1次更新後之共變異數矩陣)設為單位矩陣。接著，再進行更新步驟160，以利用單位矩陣來進行原始歸屬度矩陣和原始中心矩陣之更新，而獲得更新後的歸屬度矩陣和中心矩陣(或第n次更新後之歸屬度矩陣和中心矩陣)。計算的方程式係表示如下： As described above, when one of the two determination results of the co-variation number determination step 140 is YES, the data grouping method 100 of the present embodiment performs the co-variation number adjustment step 150. The covariance adjustment step 150 sets the original covariance matrix (or the n-1th updated covariance matrix) as an identity matrix. Then, an updating step 160 is performed to update the original attribution matrix and the original central matrix by using the identity matrix to obtain the updated attribution matrix and the central matrix (or the n-th updated attribution matrix and the central matrix). ). The calculated equations are expressed as follows:

由上述說明可知，本發明實施例之資料分群方法100係利用共變異數閥值來判斷以歐氏距離或是馬氏距離來進行歸屬度矩陣之更新。藉由進行疊代運算來不斷地透過調整值-ln|Σ_i ^-1 |更新原始歸屬度矩陣(或第n-1次更新後之歸屬度矩陣)、原始中心矩陣(或第n-1次更新後之中心矩陣) 以及原始共變異數矩陣(或第n-1次更新後之共變異數矩陣)，即可獲得對應歐氏距離或馬氏距離之更新後歸屬度矩陣(或第n次更新後之歸屬度矩陣)、更新後中心矩陣(或第n次更新後之中心矩陣)以及更新後共變異數矩陣(或第n次更新後之共變異數矩陣)。As can be seen from the above description, the data grouping method 100 according to the embodiment of the present invention uses the common variability threshold to determine whether the attribution matrix is updated by the Euclidean distance or the Mahalanobis distance. The original attribution matrix (or the n-1th updated attribution matrix), the original central matrix (or the n-1th) are continuously updated by the iteration operation by adjusting the value -ln|Σ _i ^-1 | The updated central matrix) and the original covariance matrix (or the n-1th updated covariance matrix) to obtain the updated attribution matrix (or the nth time) corresponding to the Euclidean distance or the Mahalanobis distance The updated attribution matrix), the updated central matrix (or the central matrix after the nth update), and the updated covariance matrix (or the covariance matrix after the nth update).

因此，本發明實施例之資料分群方法100並不受限於欲分群資料的分佈態樣。另外，本發明實施例之資料分群方法100在共變異數矩陣更新步驟180中對共變異數矩陣進行轉軸(pivoting)運算，如此也可提高資料分群方法100的分類效率。Therefore, the data grouping method 100 of the embodiment of the present invention is not limited to the distribution pattern of the data to be grouped. In addition, the data grouping method 100 of the embodiment of the present invention performs a pivoting operation on the covariance matrix in the covariance matrix updating step 180, which can also improve the classification efficiency of the data grouping method 100.

請參照第2圖，其係繪示根據本發明實施例之資料分群方法200的流程示意圖。資料分群方法200係類似於資料分群方法100，但不同之處在於資料分群方法200更包含計次步驟210。計次步驟210係判斷共變異數判斷步驟140所進行的次數是否大於預設之更新次數。若共變異數判斷步驟140所進行的次數大於預設之更新次數，則進行共變異數調整步驟150，以強制將共變異數矩陣設為單位矩陣，再進行更新步驟160，以利用單位矩陣來進行歸屬度矩陣和中心矩陣之更新。其中，預設更新次數可由初始設定提供步驟120來提供，或是由計次步驟210提供。在本實施例中，預設更新次數為100，但本發明之實施例並不受限於此。Please refer to FIG. 2, which is a schematic flowchart diagram of a data grouping method 200 according to an embodiment of the present invention. The data grouping method 200 is similar to the data grouping method 100, but differs in that the data grouping method 200 further includes a step step 210. The counting step 210 determines whether the number of times performed by the co-variation number determining step 140 is greater than a preset number of updates. If the number of times performed by the common variance determination step 140 is greater than the preset number of updates, the common variance adjustment step 150 is performed to force the common variance matrix to be an identity matrix, and then an update step 160 is performed to utilize the identity matrix. Update the attribution matrix and the central matrix. The preset update number may be provided by the initial setting providing step 120 or by the counting step 210. In this embodiment, the preset update number is 100, but the embodiment of the present invention is not limited thereto.

對於資料分群方法100而言，當多次更新後共變異數矩陣一直無法使共變異數判斷步驟140之判斷結果為是，且多次更新後的中心矩陣也無法滿足使用者預設的收斂值時，資料分群方法100可能陷入疊代的無限迴圈中，而無法計算出適當的歸屬度矩陣。因此，本實施例之資料分群方法200利用計次步驟210來強制跳脫出疊代的無限迴圈，以提高資料分群方法200找到合適歸屬度矩陣的機會，進而提高資料分群的效率。For the data grouping method 100, the total mutated matrix has not been able to make the judgment result of the common variability determination step 140 YES, and the central matrix after multiple updates cannot satisfy the user's preset convergence value. At this time, the data grouping method 100 may fall into the infinite loop of the iteration, and the appropriate attribution matrix cannot be calculated. Therefore, the data grouping method 200 of the present embodiment uses the counting step 210 to forcibly jump out of the infinite loop of the iteration, so as to improve the chance that the data grouping method 200 finds the appropriate attribution matrix, thereby improving the efficiency of data grouping.

在以下的說明中，將以一實施例來說明本發明實施例之資料分群方法的功效。In the following description, the effect of the data grouping method of the embodiment of the present invention will be described by way of an embodiment.

請參照第3圖，其係繪示根據本發明實施例之待分群資料的格式。在本實施例中，待分群資料為料之格式為逗號分隔值(Comma Separated Values；CSV)格式，其係將所有數值以逗號(“,”)進行分隔。第3圖中所使用資料為鳶尾花(iris)資料集，其包含四種特徵，即花萼長、花萼寬、花瓣長以及花瓣寬。Please refer to FIG. 3, which illustrates the format of the data to be grouped according to an embodiment of the present invention. In this embodiment, the format of the data to be grouped is a Comma Separated Values (CSV) format, which separates all values by a comma (","). The data used in Figure 3 is the iris data set, which contains four characteristics, namely the length of the flower bud, the width of the flower bud, the length of the petals, and the width of the petals.

在獲得待分群資料後，接著利用資料分群工具程式來進行本發明實施例之資料分群方法。在進行本發明實施例之資料分群方法時，需設定分群個數、模糊數等參數。在本實施例中，分群個數係設定為3，而模糊數係設定為2。After obtaining the data to be grouped, the data grouping tool program is used to perform the data grouping method of the embodiment of the present invention. When performing the data grouping method of the embodiment of the present invention, it is necessary to set parameters such as the number of clusters and the number of fuzzy numbers. In the present embodiment, the number of clusters is set to 3, and the number of fuzzy numbers is set to 2.

設定完成後，接著利用本發明實施例之資料分群方法來將第3圖所示之資料分群，即可得到如第4圖所示之分群結果。由於設定分為三群，所以有三個中心座標，每個中心包含四個維度(變數)。第一群中心為[5.0036,3.4030,1.4850,0.2515]；第二群中心為[6.7751,3.0524,5.6469,2.0536]；第三群中心為[5.8892,2.7612,4.3642,1.3974]。After the setting is completed, the data grouping method of the embodiment of the present invention is used to group the data shown in FIG. 3 to obtain the clustering result as shown in FIG. Since the settings are divided into three groups, there are three center coordinates, each center containing four dimensions (variables). The first group center is [5.0036, 3.430, 1.4850, 0.2515]; the second group center is [6.7751, 3.0524, 5.6469, 2.0536]; the third group center is [5.8892, 2.7612, 4.3642, 1.3974].

經實驗證明，本發明實施例之資料分群方法的正確率高達89.33%。本發明實施例之資料分群方法確實有助於提高資料分群的效率。It has been proved by experiments that the correct rate of the data grouping method of the embodiment of the present invention is as high as 89.33%. The data grouping method of the embodiment of the present invention really helps to mention High efficiency of data grouping.

另外，上述之實施例可利用電腦程式產品來實現，其可包含儲存有多個指令之機器可讀取媒體，這些指令可程式化(programming)電腦來進行上述實施例中的步驟。機器可讀取媒體可為但不限定於軟碟、光碟、唯讀光碟、磁光碟、唯讀記憶體、隨機存取記憶體、可抹除可程式唯讀記憶體(EPROM)、電子可抹除可程式唯讀記憶體(EEPROM)、光卡(optical card)或磁卡、快閃記憶體、或任何適於儲存電子指令的機器可讀取媒體。再者，本發明之實施例也可做為電腦程式產品來下載，其可藉由使用通訊連接(例如網路連線之類的連接)之資料訊號來從遠端電腦轉移至請求電腦。In addition, the above embodiments may be implemented by a computer program product, which may include a machine readable medium storing a plurality of instructions, which can program a computer to perform the steps in the above embodiments. The machine readable medium can be, but is not limited to, a floppy disk, a compact disc, a CD-ROM, a magneto-optical disc, a read-only memory, a random access memory, an erasable programmable read only memory (EPROM), an electronically erasable device. Except for programmable read only memory (EEPROM), optical card or magnetic card, flash memory, or any machine readable medium suitable for storing electronic instructions. Furthermore, embodiments of the present invention can also be downloaded as a computer program product that can be transferred from a remote computer to a requesting computer by using a data signal of a communication connection (such as a connection such as a network connection).

雖然本發明已以數個實施例揭露如上，然其並非用以限定本發明，在本發明所屬技術領域中任何具有通常知識者，在不脫離本發明之精神和範圍內，當可作各種之更動與潤飾，因此本發明之保護範圍當視後附之申請專利範圍所界定者為準。While the invention has been described above in terms of several embodiments, it is not intended to limit the scope of the invention, and the invention may be practiced in various embodiments without departing from the spirit and scope of the invention. The scope of protection of the present invention is defined by the scope of the appended claims.

100‧‧‧資料分群方法100‧‧‧ data grouping method

110‧‧‧資料提供步驟110‧‧‧Information provision steps

120‧‧‧初始設定提供步驟120‧‧‧Initial setting steps

130‧‧‧計算步驟130‧‧‧ Calculation steps

140‧‧‧共變異數判斷步驟140‧‧‧Common variation judgment step

150‧‧‧共變異數調整步驟150‧‧‧Common variation adjustment steps

160‧‧‧更新步驟160‧‧‧Update steps

170‧‧‧判斷步驟170‧‧‧ Judgment steps

180‧‧‧共變異數矩陣更新步驟180‧‧‧Common variation matrix update step

190‧‧‧分群步驟190‧‧‧ grouping steps

200‧‧‧資料分群方法200‧‧‧ data grouping method

210‧‧‧計次步驟210‧‧‧ steps

為讓本發明之上述和其他目的、特徵、和優點能更明顯易懂，上文特舉數個較佳實施例，並配合所附圖式，作詳細說明如下：The above and other objects, features, and advantages of the present invention will become more apparent and understood.

第1圖係繪示根據本發明實施例之資料分群方法的流程示意圖。FIG. 1 is a schematic flow chart showing a data grouping method according to an embodiment of the present invention.

第2圖係繪示根據本發明另一實施例之資料分群方法的流程示意圖。2 is a diagram showing a data grouping method according to another embodiment of the present invention. Schematic diagram of the process.

第3圖係繪示根據本發明實施例之待分群資料的格式。Figure 3 is a diagram showing the format of a data to be grouped according to an embodiment of the present invention.

第4圖係繪示根據本發明實施例之資料分群方法的分群結果。Fig. 4 is a diagram showing the grouping result of the data grouping method according to an embodiment of the present invention.

100‧‧‧資料分群方法100‧‧‧ data grouping method

110‧‧‧資料提供步驟110‧‧‧Information provision steps

120‧‧‧初始設定提供步驟120‧‧‧Initial setting steps

130‧‧‧計算步驟130‧‧‧ Calculation steps

140‧‧‧共變異數判斷步驟140‧‧‧Common variation judgment step

160‧‧‧更新步驟160‧‧‧Update steps

170‧‧‧判斷步驟170‧‧‧ Judgment steps

190‧‧‧分群步驟190‧‧‧ grouping steps

Claims

A data grouping method includes: providing a plurality of data, a preset common variance threshold, and a preset convergence value; calculating a membership matix of the data; calculating according to the original attribution matrix One of the data is a center matrix; the original covariance matrix is calculated according to the original central matrix; and the common covariance matrix is subjected to a common variance determination step, wherein The covariance determination step includes: determining whether an absolute value of the original covariance matrix is greater than the preset covariance threshold to provide a first determination result; and determining whether an absolute value of the original covariance matrix is less than And a reciprocal of the preset common variance threshold to provide a second determination result; when the first determination result and the second determination result are negative, performing a first data update step to update the original attribution matrix And the original central matrix, wherein the first data updating step comprises: updating the original attribution matrix by using the original covariance matrix And obtaining a first updated attribution matrix; and calculating a first updated central matrix according to the first updated attribution matrix; determining whether the difference between the first updated central matrix and the original central matrix is Less than the preset convergence value to provide a third determination result; and when the third determination result is yes, according to the first updated attribution degree The matrix is used to group the data.

For example, the data grouping method described in claim 1 wherein the total variation threshold is 290.

For example, the data grouping method described in claim 1 further includes: when the third determination result is no, performing a pivoting operation to update the determinant value of the original covariance matrix to obtain An updated common variance number matrix; the common variance number determining step is performed on the updated common variance number matrix to determine whether the absolute value of the updated common variance number matrix is greater than the preset common variance number threshold to provide a first Determining a result, and determining whether the absolute value of the updated covariance matrix is less than a reciprocal of the preset covariance threshold to provide a fifth determination result; and when the fourth determination result or the fifth determination result is If yes, a second data update step is performed to update the first updated attribution matrix and the first updated central matrix by using the updated covariance matrix.

The data grouping method of claim 3, further comprising: when the first judgment result or the second judgment result is YES, performing a third data update step to update the original attribution matrix and the An original center matrix, wherein the third data updating step comprises: setting the original covariance matrix to an identity matrix; and updating the original attribution matrix and the original by using the identity matrix Determining a central matrix to obtain a second updated attribution matrix and a second updated central matrix; determining whether a difference between the second updated central matrix and the original central matrix is less than the preset convergence value to provide a a sixth determination result; and when the sixth determination result is YES, grouping the data according to the second updated attribution degree matrix.

The method for grouping data according to item 3 of the patent application scope further includes: providing a preset update number; determining whether the number of times the common variance determination step is performed is greater than the preset update number; and determining the total variation number The third data update step is performed when the number of times the step is performed is greater than the preset number of updates.

A data grouping method includes: providing a plurality of data, a preset common variance threshold, and a preset convergence value; calculating a primary attribution matrix of the data; and calculating the data according to the original attribution matrix An original central matrix; calculating an original covariance matrix of the data according to the original central matrix; performing a first data updating step to update the original attribution matrix and the original central matrix, wherein the first data update The step includes: performing a pivoting operation to update the original covariation Determining the determinant value of the matrix, and obtaining an updated covariance matrix; determining whether the absolute value of the updated covariance matrix is greater than the preset covariance threshold to provide a first determination result; determining the update Whether the absolute value of the post-covariance matrix is less than the reciprocal of the preset co-variation threshold to provide a second determination result; and when the first determination result and the second determination result are negative, using the update a common variance matrix to update the original attribution matrix and the original central matrix; determining whether a difference between the updated central matrix and the original central matrix is less than the preset convergence value to provide a third determination result; When the third determination result is YES, the pieces of data are grouped according to the updated attribution degree matrix.

For example, the data grouping method described in claim 6 wherein the total variation threshold is 290.

The method for grouping data according to claim 6 further includes: when the first judgment result or the second judgment result is YES, performing a second data update step to update the original attribution matrix and the The original center matrix, wherein the second data updating step comprises: setting the original covariance matrix to an identity matrix; and updating the original attribution matrix and the central matrix by using the identity matrix.

For example, the data grouping method described in claim 6 further includes: determining whether an absolute value of the original covariance matrix is greater than the preset covariance threshold to provide a fourth determination result; determining the original total Whether the absolute value of the variance matrix is less than the reciprocal of the preset co-variation threshold to provide a fifth determination result; and when the fourth determination result and the fifth determination result are negative, performing the first data update step.

A computer program software that, after being loaded by a computer, can perform the data grouping method as described in claim 6 of the patent application.