TWI334092B

TWI334092B - Data clustering method

Info

Publication number: TWI334092B
Application number: TW96119098A
Authority: TW
Inventors: Cheng Fa Tsai; chun chang Li
Original assignee: Univ Nat Pingtung Sci & Tech
Priority date: 2007-05-29
Filing date: 2007-05-29
Publication date: 2010-12-01
Also published as: TW200846950A

Description

1334092 九、發明說明：【發明所屬之技術領域】本發明係關於一種資料分群方法，特別是關於藉由 K-means之分群原理，配合一對稱式分群制定規則之驗證步驟及一非對稱式分群制定規則之驗證步驟，以獲得穩定性佳之資料分群結果的資料分群方法。【先前技術】習用資料分群方法主要應用於資料探勘（Data Mining)之技術領域中’用以於龐大資料集中尋找各資料間所隱含之特徵與關係，以建立一套完整資料分析模式，進而可提供決策人員參考。又，習用資料分群方法大致可歸類為切割式（Partitioning )分群、階層式（ffierarchicai 土）分群、密度式（Density-based)分群及網格式（Grid-based) 分群等。其中，切割式分群為較常見之一種分群演算法，其係藉由切割空間的觀點來歸類資料，以便將資料空間分割成數個大小不一的子空間，在同一子空間中的身料點便視為同一群，此類演算法之優點為分群快速，其常見之分群法係為K-means分群方法。請參閱第1圖所示，習用K-means分群方法係藉由 —電腦處理單元擷取一資料庫内之數個資料物件置於—資料空間内，並預先自各資料物件内隨機選擇數個形心= S11，次依據各形心點計异各資料物件彼此間之遠近距^ ，以便進行分群動作S12，完成前述分群動作後，再次吁算新的形心點S13，最後判斷是否達成終止條件S14^該1334092 IX. Description of the invention: [Technical field to which the invention pertains] The present invention relates to a data grouping method, in particular to a verification step by a clustering principle of K-means, a symmetric grouping rule, and an asymmetric grouping Develop a verification step for the rule to obtain a data grouping method for the stability of the data grouping results. [Prior Art] The conventional data grouping method is mainly used in the technical field of data mining to find the characteristics and relationships implied between various data in a large data set to establish a complete data analysis mode. A decision maker's reference is available. Moreover, the conventional data grouping method can be roughly classified into a Partitioning group, a hierarchical (ffierarchicai) group, a density (Density-based group), and a Grid-based group. Among them, the cut group is a more common grouping algorithm, which classifies the data by cutting the space, so as to divide the data space into several subspaces of different sizes, the body points in the same subspace. It is regarded as the same group. The advantage of such an algorithm is that the grouping is fast. The common grouping method is the K-means grouping method. Referring to FIG. 1 , the conventional K-means grouping method is to take a plurality of data objects in a database into a data space by using a computer processing unit, and randomly select several shapes from each data object in advance. Heart = S11, the distance between each data object is calculated according to each centroid point, so that the grouping action S12 is performed, and after the foregoing grouping operation is completed, the new centroid point S13 is again called, and finally the termination condition is determined. S14^The

Ki〇326 07/05/29 — Q — 上划4〇92 終止條件係假設重新計算之新的形二形心點相同時，即終止資料分群作業選擇之數個犯、Sl3及S14等步驟。藉此，即貝覆上述進而作為決策人員之參考依據發掘潛藏且有用之資訊，缺點用K_meanS分群方法仍具有下列 ΐ取數之w，係必須藉由亂數無法率用，使得多次資科分群結果供可驗證該資料分群結果之正確性的步驟，故容易產生穩讀不佳之資料分群結果。又，為Ki〇326 07/05/29 — Q — Up 4〇92 The termination condition is based on the assumption that the recalculated new shape points are the same, that is, the number of criminals selected by the data grouping operation, Sl3 and S14 are terminated. In this way, the above-mentioned information is used as a reference for decision-makers to discover hidden and useful information. The shortcomings of the K_meanS grouping method still have the following number of acquisitions, which must be used by random numbers, making it possible for multiple times. The grouping result is a step for verifying the correctness of the data grouping result, so that it is easy to produce poorly-scoring data grouping results. Again, for

不佳之缺點，國際期刊上=表 .刀。式刀群技術植基於K-means之分群方法，例如·、KGA及GKA等兩種分群方法。惟該KGA及GKA t方法”時間過長，其時間成本亦相對提高，仍不付5經濟效益。基於上述仙，有必要進— 用資料分群方法。民上迷^ 有鑑於此，本發明改良上述習用資料分群方法之缺點’其係依據K-means之分群原理進行分群動作，而付-貝料分群結果後，係更進—步進行—對稱式分非對稱式分群結果驗證步驟，以便輪出二1' 疋性佳之貢料分群結果。【發明内容】 PK10326 07/05/29 本發明之主要目的係提供一種資利用一資料物件分群步分：方法，其係對稱式分群結果驗證步驟，進一步上並藉由- 點之間的距離是否符合對稱式分群:：枓::結果各性佳之資料分群結果，使得本發明° 乂輪出穩定性之功效。 >、有提升負料分群正確本發明之另一目的係提供一種資料、藉由—非對稱式分群結果驗證步法，其係 -貝料分群作業，以便計算各丁數-人K-means 他形心點之間的相對距離，進點與其資料分群結果，使得本發明具有更進f之一確性之功效。升貧料分群正 '八，據本發明之轉分群方法，其係預刀群原理麟料分群結果。次計算該：各:：=間=距離，如符合一對稱式分群制定規則時果驗證+ 分群f果’反之即執行一非對稱式分群結 ▲ 。5轉龍式分群結果驗證步職重覆執行數 queans 分群作業’以獲得數個資料分群結果，碰刀別σ十异各貧料分群結果之各形心點與其他形心點之距 ’以便選取其中距離最小之一資料分群結果，進而獲得穩定性佳之資料分群結果。【實施方式】為讓本發明之上述及其他目的、特徵及優點能更明 ’…頁易1，下文特舉本發明之較佳實施例，並配合所附圖式 ΡΚ10326 07/05/29 ，作詳細說明如下：請參坪第2万 _ 分群方法係耻架構群本發明較佳實施例之資料 1包含-電腦處理:糸統卜該資料分群系統元U係對應連接資料庫12。該電腦處理單數個資料物件12;二· 12 ’該資料庫12内預先儲存庫U内之各次椒以供該電腦處理單元11擷取該資料電腦處理單二二作業。另外’該 :操作及顯示資料分群結果。;由該資料包含-資料物件分群=、- 麵S22、一非對稱式分群結果驗證洗、負料分群結果輸出步驟S24。藉由上述步驟 r :L王主以快速且正辞的完成資料分群作業。 :人月再 > 第2及3圖所示，本發明較佳實施例之資 ^物件分群步驟S2卜其係藉由該使用者介面川建構一貝料空間’並利用該電腦處理單元11擷取該資料庫12内之各資料物件121，且將各資料物件121分配到該資料空間此日守，再依據K-means之分群原理，於各資料物件 121中隨機選擇數個「形心點」，並依據各形心點計算各貧料物件121彼此間之遠近距離以完成分群動作，進而獲知·負料分群結果（前述K-means分群原理為常見之資料为群技術於此不再贅述）。藉此，由於無法判斷已完成分群動作之各資料物件13丨，是否符合穩定性佳之資料分群結果條件，故仍必須更進一步配合進行後續該對稱式分 PK10326 07/05/29 ——8 — ^^4092 群結果驗證步驟S22及非對稱式分群結杲驗證以便獲得穩紐佳之資料分群結果。切蒼照第3圖所示’本發明較佳實施例之 ^果驗，步驟S22，其主要係依據一「對稱式分群^ /」计异該資料分群結果之各形心點彼此間的「距離進而判斷該㈣分群結果是否為穩定性佳之資料分群結。該對稱式分群做規則較佳敎義各形心點數量為、、^ ’ ^個形心點之間距離為「邊長」，當其中至少有⑴ 式^群^」出現對稱時，該資料分群結果即定義為對稱。减接觸為穩定性佳之雜分群結果，以靖分群結果輸出步驟㈣，進而麟—The shortcomings of poor, international journals = table. Knife. Knife group technology is based on K-means grouping methods, such as ·, KGA and GKA. However, the KGA and GKA t methods are too long, and their time costs are relatively high. They still do not pay 5 economic benefits. Based on the above, it is necessary to enter and use the data grouping method. In view of this, the present invention is improved. The shortcomings of the above-mentioned conventional data grouping method are based on the grouping principle of K-means, and after the pay-before grouping results, the system is further advanced--symmetric-type asymmetric grouping result verification step, so that the round [1] The main purpose of the present invention is to provide a method for grouping data by using a data object grouping method: a method for verifying the result of a symmetric grouping result, Further, by means of whether the distance between the points conforms to the symmetric grouping::枓:: The results of the data grouping results of the best results make the invention have the effect of stability. >, the correct negative grouping is correct Another object of the present invention is to provide a data, by means of an asymmetric grouping result verification step method, which is a system-before-batch grouping operation, in order to calculate each number-person K-means his centroid point The relative distance between the entry point and the data grouping result makes the invention have the effect of being more positive. The ascending material group is positive, and the system is based on the pre-segment group principle. The result is as follows: Each::===distance, if it conforms to a symmetric grouping rule, the result is validated + the grouping is fruited, and vice versa, an asymmetric grouping is performed ▲. 5 turn dragon grouping result verification stepping Repeatedly execute the number of Queans grouping operations to obtain the results of several data groupings, and the distance between each centroid point and other centroid points of each of the poorly-distributed group results in order to select one of the smallest distances. In order to obtain the above-mentioned and other objects, features and advantages of the present invention, the preferred embodiments of the present invention will be described hereinafter. The drawing ΡΚ10326 07/05/29 is described in detail as follows: Please refer to the 20,000th _ group method for shame architecture group. The information 1 of the preferred embodiment of the present invention includes - computer processing: the data is divided into The system element U corresponds to the connection database 12. The computer processes a single number of data objects 12; 2· 12 'the database 12 pre-stores each of the peppers in the library U for the computer processing unit 11 to retrieve the data processing computer Single and two operations. In addition, 'the operation and display data grouping result;; the data includes - data object group =, - surface S22, an asymmetric grouping result verification washing, negative material grouping result output step S24. In the above step r: L, the king master completes the data grouping operation with quick and correct remarks: "personal month again", as shown in the second and third figures, the object grouping step S2 of the preferred embodiment of the present invention is The user interface constructs a bedding space 'and uses the computer processing unit 11 to retrieve the data objects 121 in the database 12, and distributes the data objects 121 to the data space, and then according to K- According to the grouping principle of means, a plurality of "centroid points" are randomly selected in each data object 121, and the distance between each of the poor objects 121 is calculated according to each centroid point to complete the grouping operation, thereby obtaining the result of the negative grouping. (The aforementioned K- The means of grouping is a common resource for group technology and will not be described here. Therefore, since it is impossible to judge whether the data objects of the grouping operation have been completed, whether or not the conditions of the data grouping with good stability are met, it is necessary to further cooperate with the subsequent symmetry class PK10326 07/05/29 ——8 — ^ ^4092 The group result verification step S22 and the asymmetric clustering check verification are performed in order to obtain a stable data grouping result. According to the third embodiment of the present invention, the step S22 is mainly based on a "symmetric grouping ^ /" differentiating between the centroid points of the data grouping result. The distance is further determined whether the (four) grouping result is a stable data grouping. The symmetric grouping is better for the rule, and the number of centroid points is "," and the distance between the centroid points is "side length". When at least one of (1) formulas ^" appears symmetric, the data grouping result is defined as symmetry. The reduction of contact is the result of the heterogeneous group with good stability, and the output step (4) of the result of the grouping of Jing Jing, and then Lin -

St果。反之，則直接進行該非對稱_結果 U為該=式則更可進一步判斷各心形點規則 ^」，以便選擇進行「奇數形心則形心點規則」°其中該「奇數形心點規軸上之該二疋’轴通過一特定形心點，其位於該對稱，係C至該對稱軸一側之所有形心點的距離離相等，該對稱轴另一側之所有形心點的距圖所示的「軸式轉做㈣i」。如第4 點=實細例_，係設定一對稱軸x通過— k形心點A至位於該對稱|丨^ / 形心點C的距_等；又轴χ另—側之一St fruit. On the other hand, if the asymmetry _ result U is the value of the formula, the heart-shaped point rule ^" can be further determined, so as to select "odd centroids, the centroid point rule", where the "odd centroid point rule axis" The axis of the second axis passes through a specific centroid point, which is located in the symmetry, and the distances from all centroid points of the system C to the side of the axis of symmetry are equal, and the distances of all the centroid points on the other side of the axis of symmetry The figure shows "Axis to (4) i". For example, point 4 = real example _, set a symmetry axis x through - k centroid point A to the distance _ etc at the symmetry | 丨 ^ / centroid point C; and one of the other sides of the axis

d Α至位於該對稱軸X ΡΚ10326 07/05/29 =側之一形心點D的距離，係與該形心點A至位於該對稱軸X另一側之一形心點E的距離相等，以此類推，進而用以判定該資料分群結果是否符合該「對稱式分群制定規則」。 …該「偶數形心點規則」係為設對稱轴通過二特疋形心點，其位於該對稱軸上之該二形心點及該側之所有形心點彼此間的相對距離，係與該二形心點及該，稱轴另-側之所有形心點彼此間的相對距離相等，即^ 定符合該「對稱式分群制定規則」。如第5圖所示之實施例中，假使係設定一對稱軸γ通過二特定形心點A、F，該形心點A至位於該對稱軸γ —側之—形心點b的距離，係與該形心點A至位於該對稱轴γ另一側之一形心點d is the distance from the centroid point D of the symmetry axis X ΡΚ 10326 07/05/29 = side, which is equal to the distance from the centroid point A to the centroid point E on the other side of the symmetry axis X And so on, and then used to determine whether the data grouping result conforms to the "symmetric grouping rule". ...the "even centroid rule" is to set the symmetry axis through the second characteristic centroid point, the centroid point on the symmetry axis and the relative distance between all the centroid points on the side, The two centroid points and the centroid points on the other side of the axis are equal to each other, that is, the "symmetric grouping rule" is met. In the embodiment shown in FIG. 5, if a symmetry axis γ is set to pass through two specific centroid points A, F, the distance from the centroid point A to the centroid point b on the symmetry axis γ side, And the centroid point A to a centroid point on the other side of the symmetry axis γ

c的距離相等；又該對稱軸γ —側之―形㈣b至另L 形心點D的距離，係與對稱轴γ另一側之一形心點c至 ^-形心點E的距離相等，以此類推，進而用以判定該貧料分群結果是否符合該「對稱式分群制定規則」。再者 ’當該形心點數量大於四時，同樣可適用於前述「奇數形心點規則」。請再參照第3圖所示，本發明較佳實施例之非對稱式分群結果驗證步驟SB，其係再重覆執行數次前述資料物件分群步驟S21 (即K_means資料分群作業），以獲得數個資料分群結果，並分斯算各資料分群結果之各形心 _其他形心點之距離。如第6圖所示之實施例中，其係计异-形心點A至另-形心點B，以及該形心點A至另 PK10326 07/05/29 ~ 10 — 二心點C：的距離；或如第7圖所示之實施例中，其係二二形心點A至另-形心點B、該形心點A至另-形，=，以及該形心點A至另—形心點D的距離。藉此料形"點至其它形心點距離總和最小之一資刀群、、·。果’並直接躺穩定純之資輸—，獲得；定二具有較群結果，’當進行多次分群'；：果'皆為的一致絲前㈣其中所測味實驗。 (_點）'資料集.3 (麵點資料集-2 。其測觀對結果如下表所示· 讀集4 (3_00)The distances of c are equal; the distance from the shape (4) b of the symmetry axis γ to the other L-shaped heart point D is equal to the distance from the centroid point c to the centroid point E on the other side of the symmetry axis γ And so on, and then used to determine whether the poor grouping result meets the "symmetric grouping rule." Furthermore, 'when the number of centroids is greater than four, the same applies to the aforementioned "odd centroid rule". Referring to FIG. 3 again, the asymmetric grouping result verification step SB of the preferred embodiment of the present invention repeats the foregoing data object grouping step S21 (ie, K_means data grouping operation) several times to obtain the number. The data is divided into groups, and the distances of the centroids of each data grouping result are calculated. In the embodiment shown in Fig. 6, the difference is from the centroid point A to the other centroid point B, and the centroid point A to the other PK10326 07/05/29 ~ 10 - the two-point C: Or the embodiment shown in Fig. 7, which is a twenty-two centroid point A to another centroid point B, the centroid point A to another shape, =, and the centroid point A to Another - the distance of the centroid point D. In this way, the material shape " points to the other centroid point distance sum is the smallest one of the clusters, . If you lie directly and stabilize the pure capital, you can get it; the second has a group result, 'when it is divided into multiple groups'; the fruit is the same as the silk front (4). (_point) 'data set.3 (plenographic data set - 2. The results of the survey are shown in the following table.) Episode 4 (3_00)

ΡΚ10326 07/05/29 一 11 一 1334092 。再者，相較於其他如「GKA」及「KGA」等習用資料刀群方法，其雖仍可獲得穩定性佳之資料分群結果，惟其用於執行龐大資料集時’其執行時間成本相較於本發明亦相對較高。整體而言，本發明可於完成該資料物件分群步驟S21後，進一步藉由該對稱式分群結果驗證步驟驗證其資料分群結果是否符合穩定性佳之標準，再行決定 $否輪出該資料分群結果。又即使該資料物件分群步驟所獲得之>料分群結果不符合穩定性佳之標準，亦可另外，過該非對稱式分群結果驗證步驟S23，以進一步產佳之資料分群結果，故本發明資料分群方法確實優於各習用資料分群演算法。雖，本發明已利用上述較佳實施例揭示，然其並非用以限定本發明，任何孰習此枯获 '、w、神和範圍之内，相對明之精本發明所保護之技術範脅，因此本發明之保二=仍屬附之申請專利範圍所界定者為準。 …董乾圍备視後 PK10326 07/05/29 —】2 — 1334092 【圖式簡單說明】 f 1圖：制簡分群方法之步誠程方塊示意圖。第2圖：本發明難實關之㈣料方法的系統架構示意圖。第3圖：本發明較佳實施例之資料分群方法的步驟流程方塊示意圖。第4圖：本發明較佳實施例之資料分群方法之奇數形心點規則的不意圖。第5圖：本發明較佳實施例之資料分群方法之偶數形心點規則的示意圖。第6圖：本發明較佳實施例之資料分群方法之非對稱式分群結果的示意圖（一）。第7圖：本發明較佳實施例之資料分群方法之非對稱式分群結果的示意圖（二）。【主要元件符號說明】 1 資料分群系統 11電腦處理單元 111使用者介面 12資料庫 121資料物件 S11選擇數個形心點 S12進行分群動作 S13計算新的形心點 S14 否達成終止條件 S21資料物件分群步驟 ΡΚ10326 07/05/29 一 13 一 1334092 S22對稱式分群結果驗證步驟 S23非對稱式分群結果驗證步驟 S24資料分群終止步驟 PK10326 07/05/29 —14 —ΡΚ10326 07/05/29 one 11 one 1334092. Furthermore, compared to other conventional data knife group methods such as "GKA" and "KGA", although the data clustering results of good stability are still obtained, the cost of execution time is relatively small when it is used to execute a huge data set. The invention is also relatively high. In general, the present invention can further verify whether the data grouping result meets the stability criterion by using the symmetric grouping result verification step after completing the data object grouping step S21, and then decide whether to turn the data grouping result. . Moreover, even if the material grouping result obtained by the data object grouping step does not meet the stability standard, the asymmetric grouping result verification step S23 may be additionally performed to further produce the best data grouping result, so the data grouping method of the present invention. It is indeed better than the various data clustering algorithms. The present invention has been disclosed by the above-described preferred embodiments, and it is not intended to limit the invention, and any such technical scope is protected by the invention. Therefore, the warranty of the present invention is still subject to the scope defined in the attached patent application. ...Dong Ganwei after the view PK10326 07/05/29 —] 2 — 1334092 [Simple diagram of the diagram] f 1 diagram: the schematic diagram of the Cheng Cheng block diagram of the simplified grouping method. Fig. 2 is a schematic view showing the system architecture of the (four) material method of the present invention. Figure 3 is a block diagram showing the steps of the data grouping method of the preferred embodiment of the present invention. Fig. 4 is a view showing the odd-numbered centroid rule of the data grouping method of the preferred embodiment of the present invention. Figure 5 is a diagram showing the rules of even centroids of the data grouping method of the preferred embodiment of the present invention. Figure 6 is a schematic diagram showing the results of asymmetric grouping of the data grouping method of the preferred embodiment of the present invention (I). Figure 7 is a schematic diagram showing the results of asymmetric grouping of the data grouping method of the preferred embodiment of the present invention (2). [Main component symbol description] 1 Data grouping system 11 Computer processing unit 111 User interface 12 Database 121 Data object S11 Select several centroid points S12 to perform grouping operation S13 Calculate new centroid point S14 No termination condition S21 data object Grouping step ΡΚ10326 07/05/29 one 13 one 1334092 S22 symmetric grouping result verification step S23 asymmetric grouping result verification step S24 data grouping termination step PK10326 07/05/29 —14 —

Claims

1334092 • r :-August 17th, 1999, revised replacement page, patent application scope data grouping method, which includes: capturing a plurality of data objects in a database by a computer processing unit' and allocating each data object - The data space constructed by the user interface; 1 Using the K means group principle to randomly select several centroid points in each data object', and calculating the distance between each data object according to each centroid point, to complete Grouping actions and obtaining a data grouping result; calculating the distance between each centroid point of the data grouping result, conforming to a symmetric grouping rule, the symmetry == system: the number of each X point is κ, and two shapes In the case of a long _, there are rules for the second fruit, and it is necessary to further judge the symmetry grouping of the even number "to select the "odd centroid point rule: 22, point rule" among them; The even-numbered centroid-symmetric axis passes through a specific centroid point, and the 'factor point rule' sets the centroid point to the symmetry axis-side; Jin: the symmetry axis and the centroid time , Equidistant; the "even number of centroids of all the centroids passing through the two specific centroids, which are located on the symmetry set-symmetry axis and the right-hand side of the symmetry axis, the centroid point The relative distance between each other is ~~ 15 ~~ 17 ~ — 17 ~ — August 17th, 1999. Correct replacement page 曰 Correction sub-page j (1) Post (8) points between the asymmetric grouping results for financial check (4) Repeat execution Several times κ_ means data grouping operation to obtain several data grouping results, and calculate the distance between each centroid point and other centroid points of each data grouping result, and then select one of the smallest sums of data to be stable. The data of the best-selling data is rounded out. 2. According to the data grouping method described in item 1 of the patent application scope, wherein the number of centroid points is greater than four, the odd centroid point rule is applied.