TWI407365B - Method for data clustering - Google Patents

Method for data clustering Download PDF

Info

Publication number
TWI407365B
TWI407365B TW98121992A TW98121992A TWI407365B TW I407365 B TWI407365 B TW I407365B TW 98121992 A TW98121992 A TW 98121992A TW 98121992 A TW98121992 A TW 98121992A TW I407365 B TWI407365 B TW I407365B
Authority
TW
Taiwan
Prior art keywords
point
data
array
points
neighboring
Prior art date
Application number
TW98121992A
Other languages
Chinese (zh)
Other versions
TW201101176A (en
Inventor
Cheng Fa Tsai
Shih Yu Huang
Original Assignee
Univ Nat Pingtung Sci & Tech
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Univ Nat Pingtung Sci & Tech filed Critical Univ Nat Pingtung Sci & Tech
Priority to TW98121992A priority Critical patent/TWI407365B/en
Publication of TW201101176A publication Critical patent/TW201101176A/en
Application granted granted Critical
Publication of TWI407365B publication Critical patent/TWI407365B/en

Links

Landscapes

  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

A method for data clustering comprises: Summing up value of each dimension of each data point to obtain several reference numbers; sorting the reference numbers; obtaining a scanning range defined as several times of a expending radius with taking a reference number of a central point as center; calculating the distance between the central point and the data point corresponding to the reference number in the scanning range and regarding data points with distance smaller than the expending radius as neighbors. Consequently, this invention can lower the time cost of searching the neighbors and improve the efficiency of clustering.

Description

資料分群方法Data grouping method

本發明係關於一種資料分群方法,特別是關於一種密度式分群演算法。The present invention relates to a data grouping method, and more particularly to a density grouping algorithm.

習知密度式資料分群方法主要係以資料點密度做為分群之依據。例如在給定的參數R(擴張半徑)及MinPts(最少包含點)下,若某一區域之資料點密度滿足設定之條件,即以該區域進行擴張搜尋,並逐步合併其他同樣密度滿足設定條件之區域,進而得到最終分群結果,代表性演算法包括有DBSCAN或IDBSCAN等。該習知分群方法雖可有效的偵測不規則之圖形及濾除雜訊點,惟其分群所需時間亦相對增加。The conventional density data grouping method mainly uses data point density as the basis for grouping. For example, given a parameter R (expansion radius) and MinPts (minimum inclusion point), if the data point density of a certain area satisfies the set condition, the expansion search is performed in the area, and the other similar density is gradually merged to satisfy the set condition. The region, and finally the final clustering results, representative algorithms include DBSCAN or IDBSCAN. Although the conventional grouping method can effectively detect irregular patterns and filter out noise points, the time required for grouping is relatively increased.

以下針對幾種較具代表性的習用密度式資料分群技術進行說明:The following is a description of several representative custom density data grouping techniques:

1、DBSCAN資料分群方法:步驟一係由一資料集中之數個資料點預先隨機選擇其中一資料點做為初始種子點;步驟二係判斷目前初始種子點的半徑R範圍內是否有超過最少包含點的資料點,若達到門檻值則將目前範圍內的資料點歸類到同一群集內並作為種子點,並從範圍內的其它種子點一一進行擴張;步驟三係持續前述步驟二,直到該資料集中所有資料點都被歸類完畢為止。該習知DBSCAN資料分群方法是以較為合乎邏輯的密度判斷方式來進行分群,故可用以濾除雜訊及適用於不規則圖樣的資料點等;但因為必須對每個資料點進行繁複的密度判斷,故造成分群時間較為冗長。再且,於搜尋鄰近點時,需計算所有資料點與核心點之間的距離,因此將耗費大量時間,造成運算時間成本增加。1. DBSCAN data grouping method: Step 1 selects one of the data points in advance as one of the initial seed points by a plurality of data points in the data set; and the second step determines whether the current initial seed point radius R is within the minimum inclusion. Point data points, if the threshold value is reached, the data points in the current range are classified into the same cluster and used as seed points, and are expanded one by one from other seed points in the range; Step 3 continues the foregoing step two until All data points in the data set are classified. The conventional DBSCAN data grouping method is grouped in a more logical density judgment manner, so it can be used to filter out noise and data points suitable for irregular patterns, etc., but because of the complicated density of each data point. Judging, it is more cumbersome to group. Moreover, when searching for neighboring points, the distance between all the data points and the core points needs to be calculated, so that it takes a lot of time, resulting in an increase in the operation time cost.

2、IDBSCAN資料分群方法:此法係由B.Borah等學者於2004年所提出之密度式資料分群技術,其主要針對前述習知DBSCAN資料分群方法係循序判斷資料點進行擴散而耗時的行為進行改良,而採用經由減少查詢次數而提升分群速度的策略。該習知IDBSCAN資料分群方法係於擴張種子點半徑R之掃描範圍邊界上等距設置8個標記邊界點,該擴張種子點半徑R之掃描範圍內的資料點僅選取最靠近該8個標記邊界點之資料點作為種子點,如此減少種子點之數量,便可減少重複的擴張動作,以克服DBSCAN資料分群方法中種子點數量過多而造成速度緩慢之缺點,惟所能減少的分群時間仍相當有限。2. IDBSCAN data grouping method: This method is a density data grouping technique proposed by B. Borah and other scholars in 2004. It mainly focuses on the above-mentioned conventional DBSCAN data grouping method, which is based on the sequential judgment of data points for time-consuming behavior. Improvements were made to adopt a strategy of increasing the speed of grouping by reducing the number of queries. The conventional IDBSCAN data grouping method is to equidistantly set 8 mark boundary points on the scan range boundary of the extended seed point radius R, and the data points in the scan range of the expanded seed point radius R are only selected closest to the 8 mark boundaries. The data point of the point is used as the seed point. By reducing the number of seed points, the repeated expansion action can be reduced to overcome the shortcomings of the excessive number of seed points in the DBSCAN data grouping method, but the reduced grouping time is still quite limited. .

一般而言,該IDBSCAN資料分群方法雖可將一擴張種子點掃描範圍內之種子點數量減少至不大於8個,然而,由於離該擴張種子點較近的資料點覆蓋面積較大,若將該些資料點納入作為種子點,則會增加搜尋的時間成本。再且,即使該擴張種子點掃描範圍內之種子點數量係不大於8個,然而相鄰之種子點其掃描範圍重疊比例較高,造成其重複擴張之比例亦相當高,進而增加時間成本。基於上述原因,有必要進一步改良上述習知資料分群方法。In general, the IDBSCAN data grouping method can reduce the number of seed points in an expanded seed point scanning range to no more than 8. However, since the data points close to the expanded seed point cover a large area, if The inclusion of these data points as seed points will increase the time cost of the search. Moreover, even if the number of seed points in the scanning range of the expanded seed is not more than 8, the adjacent seed points have a higher overlapping ratio of scanning ranges, and the proportion of repeated expansion is also relatively high, thereby increasing the time cost. For the above reasons, it is necessary to further improve the above-described conventional data grouping method.

本發明目的乃改良上述缺點,以提供一種資料分群方法,其可降低密度式演算法搜尋鄰近點之時間成本。It is an object of the present invention to improve the above disadvantages to provide a data grouping method that reduces the time cost of a density algorithm to search for neighbors.

本發明次一目的係提供一種資料分群方法,係可於搜尋鄰近點時,降低需要進行距離運算之資料點數量。The second object of the present invention is to provide a data grouping method, which can reduce the number of data points that need to perform distance calculation when searching for neighboring points.

為達到前述發明目的,本發明所運用之技術手段及藉由該技術手段所能達到之功效包含有:一種資料分群方法,係包含:一參數設定步驟係設定擴張半徑參數及最小包含點參數;一排序步驟係將一初始陣列中之各個資料點之各維度座標值分別進行加總之總和作為一參考數據,並將該些參考數據依大小排序並儲存於一排序陣列中;一搜尋步驟以其中一資料點作為核心點,於該排序陣列中,以該核心點之參考數據為中心點,以該擴張半徑參數之數倍作為掃描範圍,且倍數係不小於,並計算位於該掃描範圍內之參考數據所對應之資料點與該核心點之間的距離,將距離小於擴張半徑參數R之資料點定義為鄰近點;一判斷步驟係判斷該鄰近點之數量是否大於該最小包含點參數;一分群判斷步驟係判斷該鄰近點是否被分群,若判斷為是,則進行該搜尋步驟,若判斷為否,則將該鄰近點加入一擴張陣列中,並將該鄰近點與該核心點分為同群後,進行一擴張步驟;該擴張步驟係由該擴張陣列中取出鄰近點作為核心點進行該搜尋步驟,直至該擴張陣列中之鄰近點皆完成該搜尋步驟後,進行一終止判斷步驟;及該終止判斷步驟係依一終止條件判斷是否終止。In order to achieve the foregoing object, the technical means and the functions that can be achieved by the technical method include: a data grouping method, comprising: a parameter setting step of setting an expansion radius parameter and a minimum inclusion point parameter; a sorting step is to sum the sum values of the respective dimensions of each data point in an initial array as a reference data, and sort the reference data by size and store in a sorting array; A data point is used as a core point. In the sorting array, the reference data of the core point is taken as a center point, and the number of times of the expansion radius parameter is used as a scanning range, and the multiple is not smaller than And calculating a distance between the data point corresponding to the reference data located in the scan range and the core point, and defining a data point whose distance is smaller than the expansion radius parameter R as a neighboring point; a determining step is determining the number of the neighboring points Whether the parameter is greater than the minimum inclusion point parameter; a group judgment step determines whether the neighbor point is grouped, and if the determination is yes, the search step is performed, and if the determination is no, the neighbor point is added to an expansion array, and After the neighboring point and the core point are divided into the same group, an expansion step is performed; the expanding step is performed by taking the neighboring point as the core point in the expanding array, and the searching step is performed until the neighboring points in the expanding array complete the searching After the step, a termination determining step is performed; and the termination determining step determines whether to terminate according to a termination condition.

為讓本發明之上述及其他目的、特徵及優點能更明顯易懂,下文特舉本發明之較佳實施例,並配合所附圖式,作詳細說明如下:請參照第1及2圖所示,本發明較佳實施例之資料分群方法係藉由一電腦系統連接至少一資料庫作為執行架構,該資料庫中係存有一初始陣列,且該初始陣列中存有數筆資料點11,本發明之資料分群方法係進行一參數設定步驟S1,以於該電腦系統設定一擴張半徑(Eps)參數R及一最少包含點(Minpts)參數。The above and other objects, features and advantages of the present invention will become more <RTIgt; The data grouping method of the preferred embodiment of the present invention is to connect at least one database as an execution architecture by using a computer system, wherein the database has an initial array, and the initial array contains a plurality of data points 11 and The data grouping method of the invention performs a parameter setting step S1 for setting an expansion radius (Eps) parameter R and a minimum inclusion point (Minpts) parameter in the computer system.

請參照第1至3圖所示,本發明較佳實施例之資料分群方法係進行一排序步驟S2,將該初始陣列中之各個資料點11之各維度座標值進行加總之總和作為一參考數據111,並將各個資料點11所分別對應之參考數據111依大小排序並儲存於一排序陣列2中。為方便說明,本實施例以二維平面座標為例進行說明,該數個資料點11皆散佈於一個二維平面空間中,因此,各個資料點11皆對應有一個二維平面座標(a,b),該排序步驟S2係將該二個維度(x維度及y維度)之座標值a及b進行加總之總和a+b作為該參考數據111,使得各個資料點11分別對應有一參考數據111,且該座標值a及b係為不小於0之數值,亦即各個資料點皆位於該二維平面空間之第一象限中。舉例而言,請參照表一所示,其係以資料點A、B、C、D之座標為例,僅需將各資料點之各維度座標值相加後,便可獲得各資料點所分別對應之參考數據。Referring to FIG. 1 to FIG. 3, the data grouping method according to the preferred embodiment of the present invention performs a sorting step S2, and sums the coordinate values of the respective data points of the data points 11 in the initial array as a reference data. 111, and the reference data 111 corresponding to each of the data points 11 are sorted by size and stored in a sorting array 2. For convenience of description, the present embodiment is described by taking a two-dimensional plane coordinate as an example. The plurality of data points 11 are all scattered in a two-dimensional plane space. Therefore, each data point 11 corresponds to a two-dimensional plane coordinate (a, b), the sorting step S2 is to add the total sum a+b of the coordinate values a and b of the two dimensions (x dimension and y dimension) as the reference data 111, so that each data point 11 corresponds to a reference data 111 respectively. And the coordinate values a and b are not less than 0, that is, each data point is located in the first quadrant of the two-dimensional plane space. For example, please refer to Table 1, which takes the coordinates of data points A, B, C, and D as an example. You only need to add the coordinate values of each dimension of each data point to obtain each data point. Corresponding reference data.

表一、資料點A、B、C、D之座標及參考數據 Table 1, coordinates of data points A, B, C, D and reference data

接著,將該些參考數據111依大小進行排序,本實施例係由小到大進行排序,並儲存於一排序陣列2中。以資料點A、B、C、D為例,其於該排序陣列2中之排序依序為資料點B、A、C、D。如此,便可如第3圖所示,將該些資料點11之參考數據111依大小排序儲存於該排序陣列2中。Then, the reference data 111 are sorted by size, and the embodiment is sorted from small to large and stored in a sorting array 2. Taking the data points A, B, C, and D as an example, the sorting in the sorting array 2 is sequentially data points B, A, C, and D. Thus, as shown in FIG. 3, the reference data 111 of the data points 11 are stored in the sorting array 2 in order of size.

為方便後續說明,於此先行介紹資料點之參考數據的最大值與該擴張半徑參數R之關係。請參照第4圖及表二所示,當該擴張半徑參數R之值為1時,以核心點O(0,0)為圓心,以擴張半徑參數R為半徑所畫之圓3上之任一點M(c,d),以及其於X軸上之投影點N(c,0),點O、M、N共同形成一直角三角形,且直角三角形OMN之兩股與∠MON 之關係如表一所示。For the convenience of the following description, the relationship between the maximum value of the reference data of the data point and the expansion radius parameter R is first introduced. Referring to Fig. 4 and Table 2, when the value of the expansion radius parameter R is 1, the core point O(0, 0) is taken as the center, and the radius of the radius parameter R is used as the radius. A point M(c,d), and its projection point N(c,0) on the X-axis, points O, M, and N together form a right-angled triangle, and two of the right-angled triangles OMN and The relationship with ∠ MON is shown in Table 1.

由表二結果可得知,當∠MON 為45。時,具有最大值為,亦即相對於核心點O(0,0),該資料點M於該排序陣列中所對應之參考數據的最大值係為擴張半徑參數R之倍;若核心點之座標為(11,5),參考數據為16,則該資料點M所對應之參考數據的最大值便為It can be seen from the results in Table 2 that when ∠ MON is 45. Time, Has a maximum of , that is, relative to the core point O(0, 0), the maximum value of the reference data corresponding to the data point M in the sorting array is the expansion radius parameter R If the coordinate of the core point is (11, 5) and the reference data is 16, the maximum value of the reference data corresponding to the data point M is .

請參照第1及5至8圖所示,本發明較佳實施例之資料分群方法係進行一搜尋步驟S3,以其中一資料點11作為核心點12,並於該排序陣列2中,以該核心點12所對應之參考數據121為中心點,以該擴張半徑參數R之數倍作為掃描範圍,且該數倍之倍數係不小於,並計算位於該掃描範圍內之參考數據111’所對應之資料點11’與該核心點12之間的距離,將距離小於擴張半徑參數R之資料點定義為鄰近點11”。Referring to Figures 1 and 5 to 8, the data grouping method according to the preferred embodiment of the present invention performs a search step S3, in which a data point 11 is used as the core point 12, and in the sorting array 2, The reference data 121 corresponding to the core point 12 is a center point, and the number of times of the expansion radius parameter R is used as a scanning range, and the multiple of the multiple is not less than And calculating the distance between the data point 11' corresponding to the reference data 111' located in the scanning range and the core point 12, and defining the data point whose distance is smaller than the expansion radius parameter R as the neighboring point 11".

舉例而言,請參照第5及6圖所示,本實施例以二維平面座標為(3,5)之核心點12進行說明,則該核心點12之參考數據121為8。接著,請參照第7及8圖所示,於該排序陣列2中,以該核心點12所對應之參考數據121為中心點,以該擴張半徑參數R之數倍作為掃描範圍。舉例而言,本實施例之擴張半徑參數R選擇為2,該倍數係選擇為,如上所述,若以該核心點12之參考數據121作為中心點,向左右分別取R之掃描範圍,於該掃描範圍內之參考數據111,所對應之資料點11,必將包含所有位於該核心點12之擴張範圍內之鄰近點11”。因此,延續上述例子,該掃描範圍為,如第8圖所示,於該排序陣列2中以該核心點12之參考數據121為中心點,向左右分別取之掃描範圍,則位於該掃描範圍內之參考數據111’所對應之資料點11’,必將包含該圓3內的所有鄰近點11”。其中,該倍數較佳係不小於,因為,該倍數若小於,例如該倍數為1.3,則會使得該掃描範圍無法涵蓋所有鄰近點11”,而可能造成部分之鄰近點11”遺漏。For example, referring to FIGS. 5 and 6, the present embodiment is described by the core point 12 of the two-dimensional plane coordinate (3, 5), and the reference data 121 of the core point 12 is 8. Next, referring to FIGS. 7 and 8, in the sorting array 2, the reference data 121 corresponding to the core point 12 is taken as a center point, and the number of times of the expansion radius parameter R is used as the scanning range. For example, the expansion radius parameter R of the embodiment is selected to be 2, and the multiple is selected as As described above, if the reference point 121 of the core point 12 is taken as the center point, the left and right sides are respectively taken. The scanning range of R, the reference data 111 in the scanning range, and the corresponding data point 11 must contain all the neighboring points 11" located within the expansion range of the core point 12. Therefore, the scanning range is continued by the above example. for As shown in FIG. 8, in the sorting array 2, the reference data 121 of the core point 12 is taken as a center point, and the left and right sides are respectively taken. For the scanning range, the data point 11' corresponding to the reference data 111' located in the scanning range must contain all the neighboring points 11" in the circle 3. The multiplier is preferably not less than Because the multiple is less than For example, if the multiple is 1.3, the scanning range cannot cover all adjacent points 11", and may cause partial neighboring points 11" to be missed.

此外,由於該排序陣列2內相鄰之參考數據所對應之資料點於二維平面中並不一定是相鄰的,舉例而言,請參照第9圖所示,點E(13,3)、點F(15,3)及點G(3,14)之參考數據111分別為16、18及17,於排序陣列2中判斷為相鄰,而於二維平面中,點G(3,14)分別與E(13,3)、點F(15,3)之距離卻相差甚遠。因此,本實施例接著如第10圖所示,分別計算該些資料點11’與該核心點12之距離,本實施例係選擇以歐幾里得公式進行距離計算,若距離小於擴張半徑R之資料點11”,亦即位於該核心點12之擴張範圍(圓3)內之資料點11”,便將該些資料點11”定義為鄰近點11”。如此,僅需透過該距離運算便可將不是鄰近點11”之資料點11’篩除,而獲得該些鄰近點11”。因此,於搜尋鄰近點時,僅需對該掃描範圍內之參考數據111’所對應之資料點11’進行距離運算,而不需如習用密度式資料分群方法必須對所有資料點進行距離運算,因此可大幅降低搜尋鄰近點之時間成本,進而提升分群效率。In addition, since the data points corresponding to the adjacent reference data in the sorting array 2 are not necessarily adjacent in the two-dimensional plane, for example, refer to FIG. 9 and point E (13, 3) The reference data 111 of the point F (15, 3) and the point G (3, 14) are 16, 18, and 17, respectively, which are judged to be adjacent in the sorting array 2, and in the two-dimensional plane, the point G (3, 14) The distance from E (13, 3) and point F (15, 3) is quite different. Therefore, in this embodiment, as shown in FIG. 10, the distances between the data points 11' and the core points 12 are respectively calculated. In this embodiment, the distance calculation is performed by using the Euclidean formula, and if the distance is smaller than the expansion radius R The data point 11", that is, the data point 11" located in the expansion range (circle 3) of the core point 12, defines the data point 11" as the neighboring point 11". Thus, only the data points 11' that are not adjacent points 11" can be screened out by the distance operation to obtain the adjacent points 11". Therefore, when searching for the neighboring point, only the data point 11' corresponding to the reference data 111' in the scanning range needs to be distance-operated, without having to perform distance calculation on all the data points as in the conventional density data grouping method. Therefore, the time cost of searching for neighbors can be greatly reduced, thereby improving the efficiency of grouping.

請參照第1及10圖所示,本發明較佳實施例之資料分群方法接著進行一判斷步驟S4,係判斷該鄰近點11”之數量是否大於該最小包含點參數,若判斷為「是」,則進行一分群判斷步驟S5,若判斷為「否」,則進行該搜尋步驟S3。更詳言之,若該鄰近點11”之數量大於該最小包含點參數,則進行該分群判斷步驟S5;若判斷為「否」,則於該初始陣列取出另一資料點11進行該搜尋步驟S3。Referring to Figures 1 and 10, the data grouping method of the preferred embodiment of the present invention then performs a determining step S4 to determine whether the number of neighboring points 11" is greater than the minimum containing point parameter. If the determination is "Yes" Then, a group judgment step S5 is performed, and if the determination is "NO", the search step S3 is performed. More specifically, if the number of the neighboring points 11" is greater than the minimum inclusion point parameter, the grouping determining step S5 is performed; if the determination is "No", another data point 11 is taken out of the initial array to perform the searching step. S3.

請參照第1及10圖所示,本發明較佳實施例之資料分群方法之分群判斷步驟S5係判斷該鄰近點11”是否被分群,若判斷為「是」,則進行該搜尋步驟;若判斷為「否」,則將該鄰近點11”加入一擴張陣列中,並更新該鄰近點11”的分群編號。更詳言之,若該鄰近點11”已被分群,則重新進行該搜尋步驟S3;若該鄰近點11”尚未被分群,則更新該鄰近點11”的分群編號,亦即將該鄰近點11”與該核心點12分為同一群,再將該鄰近點11”加入該擴張陣列中,並進行一擴張步驟S6。Referring to FIGS. 1 and 10, the grouping determining step S5 of the data grouping method according to the preferred embodiment of the present invention determines whether the neighboring point 11" is grouped. If the determination is YES, the searching step is performed; If the determination is "No", the neighboring point 11" is added to an expanded array, and the clustering number of the neighboring point 11" is updated. More specifically, if the neighboring point 11 ′′ has been grouped, the searching step S3 is performed again; if the neighboring point 11 ′′ has not been grouped, the grouping number of the neighboring point 11 ′′ is updated, that is, the neighboring point 11 is also "The core point 12 is divided into the same group, and then the adjacent point 11" is added to the expanded array, and an expansion step S6 is performed.

請參照第1及10圖所示,本發明較佳實施例之資料分群方法之擴張步驟S6係由該擴張陣列中取出鄰近點11”作為核心點12進行該搜尋步驟S3,直至該擴張陣列中之鄰近點11”皆完成該搜尋步驟S3後,進行一終止判斷步驟S7。更詳言之,分別以該擴張陣列中之各個鄰近點11”作為核心點12進行該搜尋步驟S3,直至該擴張陣列中之鄰近點11”皆完成該搜尋步驟S3後,進行該終止判斷步驟S7。Referring to FIGS. 1 and 10, the expanding step S6 of the data grouping method according to the preferred embodiment of the present invention performs the searching step S3 by taking the neighboring point 11" from the expanded array as the core point 12 until the expanding array. After the neighboring point 11" completes the searching step S3, a termination determining step S7 is performed. More specifically, the searching step S3 is performed by using each adjacent point 11" in the expanded array as the core point 12, until the neighboring point 11" in the expanded array completes the searching step S3, and the termination determining step is performed. S7.

請參照第1及10圖所示,本發明較佳實施例之資料分群方法之終止判斷步驟S7,係判斷達一終止條件後終止。更詳言之,該終止條件係該初始陣列中所有資料點11皆已完成該搜尋步驟S3,若判斷未達該終止條件,亦即尚有資料點11未進行該搜尋步驟S3,則由該初始陣列中取另一資料點11進行該搜尋步驟S3,直至所有資料點11皆進行過該搜尋步驟S3,便終止。至此,便完成本發明之資料分群方法。Referring to FIGS. 1 and 10, the termination determining step S7 of the data grouping method according to the preferred embodiment of the present invention determines that the termination condition is terminated. More specifically, the termination condition is that all the data points 11 in the initial array have completed the searching step S3. If it is determined that the termination condition is not reached, that is, if the data point 11 is not performing the searching step S3, then the The other data point 11 is taken in the initial array to perform the searching step S3 until all the data points 11 have undergone the searching step S3, and then terminate. So far, the data grouping method of the present invention has been completed.

請參照表三所示,為驗證本發明之資料分群方法具有分群效率高之優點,於此針對資料集A至F進行分群,並與習用DBSCAN與IDBSCAN資料分群方法進行比較。其中,資料集A至F皆具有115000筆資料點,且含15000筆之雜訊點,圖形大小為900×650像素。Referring to Table 3, in order to verify that the data grouping method of the present invention has the advantage of high grouping efficiency, the data sets A to F are grouped and compared with the conventional DBSCAN and IDBSCAN data grouping methods. Among them, data sets A to F have 115,000 data points, and contain 15000 noise points, and the graphic size is 900×650 pixels.

由結果可得知,本發明之資料分群方法確實可大幅降低時間成本,並提升分群效率。As can be seen from the results, the data grouping method of the present invention can significantly reduce the time cost and improve the grouping efficiency.

本發明之資料分群方法,係可降低密度式演算法搜尋鄰近點之時間成本,以達到提升資料分群效率之功效。The data grouping method of the invention can reduce the time cost of searching for neighboring points by the density algorithm, so as to improve the efficiency of data grouping.

本發明之資料分群方法,係可於搜尋鄰近點時,降低需要進行距離運算之資料點數量,可大幅降低搜尋鄰近點之時間成本,並減少運算複雜度,此效果於資料點數量越多時更加明顯。The data grouping method of the invention can reduce the number of data points that need to perform distance calculation when searching for neighboring points, can greatly reduce the time cost of searching for neighboring points, and reduce the computational complexity, and the effect is that the number of data points is larger. more obvious.

雖然本發明已利用上述較佳實施例揭示,然其並非用以限定本發明,任何熟習此技藝者在不脫離本發明之精神和範圍之內,相對上述實施例進行各種更動與修改仍屬本發明所保護之技術範疇,因此本發明之保護範圍當視後附之申請專利範圍所界定者為準。While the invention has been described in connection with the preferred embodiments described above, it is not intended to limit the scope of the invention. The technical scope of the invention is protected, and therefore the scope of the invention is defined by the scope of the appended claims.

[本發明][this invention]

11...資料點11. . . Data point

11’...資料點11’. . . Data point

11”...鄰近點11"...adjacent point

111...參考數據111. . . Reference data

111’...參考數據111’. . . Reference data

12...核心點12. . . Core point

121...參考數據121. . . Reference data

2...排序陣列2. . . Sorting array

3...圓3. . . circle

第1圖:本發明之資料分群方法之流程圖。Figure 1: Flow chart of the data grouping method of the present invention.

第2圖:本發明之資料分群方法之資料點分佈示意圖。Fig. 2 is a schematic diagram showing the distribution of data points of the data grouping method of the present invention.

第3圖:本發明之資料分群方法之排序步驟之示意圖。Figure 3 is a schematic illustration of the sequencing steps of the data grouping method of the present invention.

第4圖:本發明之資料分群方法之示意圖。Figure 4: Schematic diagram of the data grouping method of the present invention.

第5圖:本發明之資料分群方法之核心點及資料點分佈示意圖。Figure 5: Schematic diagram of the core points and data points distribution of the data grouping method of the present invention.

第6圖:本發明之資料分群方法之排序陣列之示意圖。Figure 6 is a schematic diagram of a sorting array of the data grouping method of the present invention.

第7圖:本發明之資料分群方法之搜尋步驟於平面空間示意圖。Figure 7: Schematic diagram of the searching step of the data grouping method of the present invention in a plane space.

第8圖:本發明之資料分群方法之搜尋步驟之參考數據示意圖。Fig. 8 is a view showing the reference data of the searching step of the data grouping method of the present invention.

第9圖:本發明之資料分群方法之非相鄰資料點之示意圖。Figure 9 is a schematic diagram of non-adjacent data points of the data grouping method of the present invention.

第10圖:本發明之資料分群方法之進行距離判斷之示意圖。Fig. 10 is a view showing the distance judgment of the data grouping method of the present invention.

Claims (3)

一種資料分群方法,係包含:一參數設定步驟,係設定擴張半徑參數及最小包含點參數;一排序步驟,係將一初始陣列中之各個資料點之各維度座標值分別進行加總之總和作為一參考數據,並將該參考數據依大小排序並儲存於一排序陣列中;一搜尋步驟,以其中一資料點作為核心點,於該排序陣列中,以該核心點之參考數據為中心點,以該擴張半徑參數之數倍作為掃描範圍,且倍數係不小於,並計算位於該掃描範圍內之參考數據所對應之資料點與該核心點之間的距離,將距離小於擴張半徑參數之資料點定義為鄰近點;一判斷步驟,係判斷該鄰近點之數量是否大於該最小包含點參數;一分群判斷步驟,係判斷該鄰近點是否被分群,若判斷為是,則進行該搜尋步驟,若判斷為否,則將該鄰近點加入一擴張陣列中,並將該鄰近點與該核心點分為同群後,進行一擴張步驟;該擴張步驟,係由該擴張陣列中取出鄰近點作為核心點進行該搜尋步驟,直至該擴張陣列中之鄰近點皆完成該搜尋步驟後,進行一終止判斷步驟;及該終止判斷步驟係依一終止條件判斷是否終止。A data grouping method includes: a parameter setting step of setting an expansion radius parameter and a minimum inclusion point parameter; and a sorting step of summing the coordinate values of each dimension of each data point in an initial array as a sum Referring to the data, and sorting the reference data by size and storing in a sorting array; in a searching step, using one of the data points as a core point, in the sorting array, taking the reference data of the core point as a center point, The multiple of the expansion radius parameter is used as the scanning range, and the multiple is not less than And calculating a distance between the data point corresponding to the reference data located in the scanning range and the core point, and defining a data point whose distance is smaller than the expansion radius parameter as a neighboring point; and determining a number of the neighboring points in a determining step Whether it is greater than the minimum inclusion point parameter; a group judgment step determines whether the neighbor point is grouped, and if the determination is yes, the search step is performed, and if the determination is no, the neighbor point is added to an expansion array, and After the neighboring point is divided into the same group as the core point, an expansion step is performed. The expansion step is performed by taking the neighboring point as the core point in the expanded array, and the searching step is performed until the neighboring points in the expanded array are completed. After the searching step, a termination determining step is performed; and the termination determining step determines whether to terminate according to a termination condition. 依申請專利範圍第1項所述之資料分群方法,其中該判斷步驟中,若判斷為「是」,則進行該分群判斷步驟,若判斷為「否」,則進行該搜尋步驟。According to the data grouping method described in the first aspect of the patent application, in the determining step, if the determination is YES, the grouping determining step is performed, and if the determination is "NO", the searching step is performed. 依申請專利範圍第1項所述之資料分群方法,其中該終止判斷步驟中之終止條件係該初始陣列中所有資料點皆已完成該搜尋步驟,若判斷未達該終止條件,則由該初始陣列中取另一資料點進行該搜尋步驟,若達該終止條件,則終止。According to the data grouping method described in claim 1, wherein the termination condition in the termination determining step is that all the data points in the initial array have completed the searching step, and if it is determined that the termination condition is not reached, the initial step is Another data point is taken in the array to perform the search step, and if the termination condition is reached, the termination is terminated.
TW98121992A 2009-06-30 2009-06-30 Method for data clustering TWI407365B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
TW98121992A TWI407365B (en) 2009-06-30 2009-06-30 Method for data clustering

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
TW98121992A TWI407365B (en) 2009-06-30 2009-06-30 Method for data clustering

Publications (2)

Publication Number Publication Date
TW201101176A TW201101176A (en) 2011-01-01
TWI407365B true TWI407365B (en) 2013-09-01

Family

ID=44836924

Family Applications (1)

Application Number Title Priority Date Filing Date
TW98121992A TWI407365B (en) 2009-06-30 2009-06-30 Method for data clustering

Country Status (1)

Country Link
TW (1) TWI407365B (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
TWI453613B (en) * 2011-05-17 2014-09-21 Univ Nat Pingtung Sci & Tech Data clustering method based on grid

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5832182A (en) * 1996-04-24 1998-11-03 Wisconsin Alumni Research Foundation Method and system for data clustering for very large databases
TW200828053A (en) * 2006-12-22 2008-07-01 Univ Nat Pingtung Sci & Tech A method for grid-based data clustering
TW200846950A (en) * 2007-05-29 2008-12-01 Univ Nat Pingtung Sci & Tech Data clustering method

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5832182A (en) * 1996-04-24 1998-11-03 Wisconsin Alumni Research Foundation Method and system for data clustering for very large databases
TW200828053A (en) * 2006-12-22 2008-07-01 Univ Nat Pingtung Sci & Tech A method for grid-based data clustering
TW200846950A (en) * 2007-05-29 2008-12-01 Univ Nat Pingtung Sci & Tech Data clustering method

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
吳宗諭,一種以密度為基礎之改良型快速分群演算法及即時偏光板瑕疵區域檢測之應用,國立台灣科技大學自動化及控制研究所碩士學位論文,網址: http://pc01.lib.ntust.edu.tw/ETD-db/ETD-search-c/view_etd?URN=etd-0727107-114727,2008/07/30 *

Also Published As

Publication number Publication date
TW201101176A (en) 2011-01-01

Similar Documents

Publication Publication Date Title
TWI385544B (en) Density-based data clustering method
CN1873647A (en) CAD method, CAD system and program storage medium storing CAD program thereof
CN105787977A (en) Building vector boundary simplification method
TWI391837B (en) Data clustering method based on density
TWI396106B (en) Grid-based data clustering method
CN105975519A (en) Multi-supporting point index-based outlier detection method and system
TWI460680B (en) Data clustering method based on density
TWI453613B (en) Data clustering method based on grid
TWI407365B (en) Method for data clustering
US8612183B2 (en) Analysis model generation system
CN108509532B (en) Point gathering method and device applied to map
CN104036096B (en) Method for mapping bump features on inclined face to manufacturing feature bodies
CN112734934A (en) STL model 3D printing slicing method based on intersecting edge mapping
TWI431496B (en) Method for grid-based data clustering
JP4440246B2 (en) Spatial index method
CN113763240B (en) Point cloud thumbnail generation method, device, equipment and storage medium
CN113362340A (en) Dynamic space sphere searching point cloud K neighborhood method
TWI402701B (en) Method for density-based data clustering
CN113204607A (en) Vector polygon rasterization method for balancing area, topology and shape features
TWI463339B (en) Method for data clustering
TWI396103B (en) Method for data clustering
CN116070785B (en) Land-air cooperative airspace distribution method based on Andrew algorithm
CN105678753B (en) A kind of method for segmenting objects and device
CN114298881B (en) Vector map watermark processing method and terminal based on gradient lifting decision tree
TWI441085B (en) Data clustering method

Legal Events

Date Code Title Description
MM4A Annulment or lapse of patent due to non-payment of fees