TW201407390A - Data clustering apparatus and method - Google Patents

Data clustering apparatus and method Download PDF

Info

Publication number
TW201407390A
TW201407390A TW101129472A TW101129472A TW201407390A TW 201407390 A TW201407390 A TW 201407390A TW 101129472 A TW101129472 A TW 101129472A TW 101129472 A TW101129472 A TW 101129472A TW 201407390 A TW201407390 A TW 201407390A
Authority
TW
Taiwan
Prior art keywords
data
grouping
clusters
group
average distance
Prior art date
Application number
TW101129472A
Other languages
Chinese (zh)
Other versions
TWI465949B (en
Inventor
Wei-Yao Chuang
Original Assignee
Acer Inc
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Acer Inc filed Critical Acer Inc
Priority to TW101129472A priority Critical patent/TWI465949B/en
Publication of TW201407390A publication Critical patent/TW201407390A/en
Application granted granted Critical
Publication of TWI465949B publication Critical patent/TWI465949B/en

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/23Clustering techniques

Landscapes

  • Engineering & Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Theoretical Computer Science (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Biology (AREA)
  • Evolutionary Computation (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

A data clustering apparatus includes a news database, configured to store data; a calculation module, setting a Global Silhouette Pattern for obtaining a rough clusting number according to the distance relationship between the data; a clustering module, dividing the data into a plurality of clusters by a clustering algorithm according to the rough clusting number and calculating the Intra-Cluster distance of each cluster; and a comparing module, comparing the Intra-Cluster distance with a threshold, wherein when the Intra-Cluster distance is less than the threshold, the cluster corresponding to the Intra-Cluster distance is stored in an event database.

Description

資料分群裝置和方法 Data grouping device and method

本發明主要係關於一種資料分群技術,特別係可利用一文件自動偵測遞迴分群方法(Auto-detect Text Recursively Clusting,ADTR)來進行資料分群之技術。 The invention mainly relates to a data grouping technology, in particular to a technique for data grouping by using an Auto-detect Text Recursively Clusting (ADTR) method.

近年來,由於無線通訊技術的快速發展,因此,各式各樣的可攜式以及手持裝置,例如:行動電話、智慧型手機(smart phone)、個人數位助理(PDA)、平板電腦(Tablet PC)等電子商品不斷的在市場上被推出,且這些電子商品的功能也越來越多元化。此外,由於這些裝置的便利,也使得這些裝置成為人們的生活必需品之一。 In recent years, due to the rapid development of wireless communication technology, a wide variety of portable and handheld devices, such as mobile phones, smart phones, personal digital assistants (PDAs), tablets (Tablet PCs) ) Electronic products are constantly being introduced in the market, and the functions of these electronic products are increasingly diversified. In addition, due to the convenience of these devices, these devices have also become one of the necessities of people's lives.

除了前述支各種無線通訊裝置的硬體外,更有許多可配合前述硬體來執行與應用的軟體與功能不斷地被開發出來,以讓使用者可以更便利、更即時、且更隨時隨地的理財、工作、娛樂或擷取資訊等等。隨著行動網路的普及手持行動裝置的普遍,現代人在移動時,如搭車、捷運上透過行動網路閱讀新聞已成為重要趨勢。現今有為數眾多的新聞來源網站,加上由於目前各家媒體所提供的新聞之簡易資訊聚合(Really Simple Syndication,RSS)各有各的排序方式,十分龐雜。雖然可容易得到為數眾多的新聞事件,卻無法依新聞事件進行追蹤或重要性分類。另外,目前閱讀新聞的應用軟體主要是針對新聞之簡易資訊聚合(RSS)出處,及新聞總綱進行新聞的呈現。造成閱讀者在閱讀經 驗上,不容易找到自己所關心的新聞事件,也不容易找到時下重要的新聞事件。 In addition to the aforementioned hard-wired wireless communication devices, many software and functions that can be implemented and used with the aforementioned hardware are continuously developed to make the user more convenient, more instant, and more convenient to manage money anytime, anywhere. , work, play or grab information, and more. With the popularity of mobile networks, the popularity of handheld mobile devices has become an important trend for modern people to read news through mobile networks, such as by car and MRT. There are a large number of news source websites, and there are various sorting methods for the newsletter of the news provided by various media at present, which are very complicated. Although it is easy to get a large number of news events, it cannot be tracked or classified according to news events. In addition, the current application software for reading news is mainly for the simple news aggregation (RSS) source of news, and the presentation of news by the news general program. Cause the reader to read the scripture In fact, it is not easy to find the news events that you care about, and it is not easy to find important news events.

此外,中文新聞的撰寫是非結構化的格式,因此,在人工智慧自動分類或分群時,很難輕易的將相似的文章判斷為一群;另一方面,分群時往往不同的新聞容易因部份較不具代表性的字詞而被分在同一群,使得若要找出同一新聞事件時,變得較為不容易。此外,資料分群時,群集數的決定往往是十分不容易的,一般透過事先定義或著事先觀察,且不論哪一種方法都需要人工去參與協助。 In addition, the writing of Chinese news is an unstructured format. Therefore, when artificial intelligence is automatically classified or grouped, it is difficult to easily judge similar articles as a group; on the other hand, when different groups are often different, the news is easy to be compared. Non-representative words are grouped together in the same group, making it less difficult to find the same news event. In addition, when data is grouped, the number of clusters is often very difficult to determine. Generally, it is defined by prior definition or prior observation, and any method needs to be manually assisted.

有鑑於上述先前技術之問題,本發明提供了一種資料分群技術,特別係可利用一文件自動偵測遞迴分群方法(Auto-detect Text Recursively Clusting,ADTR)來進行資料分群之技術。 In view of the above prior art problems, the present invention provides a data grouping technique, in particular, a technique for data grouping by using an Auto-detect Text Recursively Clusting (ADTR) method.

根據本發明之一實施例提供了一種資料分群方法,包括以下步驟:由一新聞資料庫取得複數資料;根據上述資料間之一距離關係,建立一整體側影樣式表,以取得一初步分群參考數目;根據上述初步分群參考數目利用一分群演算法將複數資料分為複數群集;計算每一上述群集之一群內平均距離;以及比較上述群內平均距離是否小於一門檻值,其中若上述群內平均距離小於上述門檻值,則將對應上述群內平均距離之上述群集存入一事件資料庫。 According to an embodiment of the present invention, a data grouping method is provided, comprising the steps of: obtaining a plurality of materials from a news database; and establishing an overall silhouette pattern according to a distance relationship between the data to obtain a preliminary group reference number. And using a cluster algorithm to divide the plurality of data into complex clusters according to the preliminary grouping reference number; calculating an average distance within a group of each of the clusters; and comparing whether the average distance within the group is less than a threshold, wherein If the distance is less than the threshold value, the cluster corresponding to the average distance within the group is stored in an event database.

根據本發明之一實施例提供了一種資料分群裝置,包括:一新聞資料庫,用以儲存複數資料;一計算模組,根 據上述資料間之一距離關係,用以建立一整體側影樣式表,再根據上述整體側影樣式表,取得一初步分群參考數目;一分群模組,根據上述初步分群參考數目利用一分群演算法將複數資料分為複數群集,再計算每一上述群集之一群內平均距離,以及一比較模組,用以比較上述群內平均距離是否小於一門檻值,其中若上述群內平均距離小於上述門檻值,則將對應上述群內平均距離之上述群集存入一事件資料庫中。 According to an embodiment of the present invention, a data grouping device includes: a news database for storing complex data; a computing module, a root According to the distance relationship between the above data, an overall silhouette pattern table is established, and a preliminary group reference number is obtained according to the overall silhouette pattern table; a group module is determined by using a group algorithm according to the preliminary grouping reference number. The plurality of data is divided into a plurality of clusters, and then an average distance within a group of each of the clusters is calculated, and a comparison module is configured to compare whether the average distance within the group is less than a threshold value, wherein if the average distance within the group is less than the threshold value , the above cluster corresponding to the average distance within the group is stored in an event database.

第1圖係顯示根據本發明一實施例所述之資料分群裝置100之架構圖。如圖所示,根據本發明一實施例所述之資料分群裝置100,包括,一新聞資料庫110、一預處理模組120、一計算模組130、一分群模組140、一比較模組150、一事件資料庫160。 1 is a block diagram showing a data grouping apparatus 100 according to an embodiment of the present invention. As shown in the figure, a data grouping apparatus 100 according to an embodiment of the present invention includes a news database 110, a preprocessing module 120, a computing module 130, a grouping module 140, and a comparison module. 150. An event database 160.

根據本發明一實施例,新聞資料庫110用以儲存及提供複數資料,且新聞資料庫110所儲存之資料可即時地更新,其中在此所述之資料可包含各類型之新聞事件,像是國際新聞、政治新聞、社會新聞、體育新聞、演藝新聞等,亦可包含各類不同的專題報導或文字資料。 According to an embodiment of the present invention, the news database 110 is configured to store and provide plural data, and the data stored in the news database 110 can be updated in an instant, wherein the data described herein can include various types of news events, such as International news, political news, social news, sports news, performing arts news, etc., can also contain various special reports or written materials.

根據本發明一實施例,預處理模組110,用以將新聞資料庫110所儲存之複數資料預先經過一前處理運算,也就是將複數資料之複數特徵進行一向量化處理,使資料可以轉換成一空間模型,方便之後資料分群之處理,其中在此所述之特徵係指資料中所包含之內容經過斷詞或斷句後 所萃取出來之不同關鍵字,舉例來說,由「全球暖化造成了北極冰山溶化,因而使得海平面上升」這句子,可將「全球暖化」、「北極」、「冰山」、「還平面上升」等關鍵字萃取出來,關鍵字粹取出來後,再將這些關鍵字經過向量化處理,轉換為具有不同加權值之向量點,因此,經由這樣的向量化處理後,就可使得原來的資料可由文字形式轉換成以向量表示之空間模型。 According to an embodiment of the present invention, the pre-processing module 110 is configured to perform a pre-processing operation on the complex data stored in the news database 110, that is, to perform a vectorization process on the complex features of the complex data, so that the data can be converted into one. The spatial model facilitates the processing of data grouping afterwards. The characteristics described herein refer to the content contained in the data after the word breaking or the sentence is broken. The different keywords extracted, for example, the phrase "global warming caused the melting of the Arctic icebergs, thus raising the sea level", the words "global warming", "arctic", "iceberg", "return Keywords such as “flat rise” are extracted, and after the keywords are extracted, these keywords are vectorized and converted into vector points with different weight values. Therefore, after such vectorization processing, the original The data can be converted from text to a spatial model represented by a vector.

根據本發明一實施例,計算模組130用以接收經由預處理模組110前處理過後之資料,並根據資料在空間模型間之距離關係,用以建立一整體側影樣式表(Global Silhouette Pattern),再根據所建立之整體側影樣式表,取得一初步分群參考數目。更明確來說,在此實施例中,計算模組130用以建立一整體側影樣式表取得初步分群參考數目之步驟包括:首先,先以側影公式(如下所示),依據群集中資料間距離之關係計算複數側影係數,其中在此所述之側影係數係一種用以評估分群效度及狀態的指標,其可用以呈現群集狀態的優良程度。接著,針對不同群集數之分群結果,以產生對應一群集數目範圍之不同群集數目所具有之複數整體側影值(Global Silhouette value,GSu),其中上述群集數目範圍係介於2到上述資料之總數之間。最後,計算模組130會根據複數整體側影值,建立整體側影樣式表,用以記錄對應每一群集數目數之整體側影值(GSu),並將對應側影值之最大值之群集數目設定為初步分群參考數目,詳細的計算流程將在底下說明。 According to an embodiment of the invention, the computing module 130 is configured to receive the pre-processed data via the pre-processing module 110, and to establish a global silhouette pattern according to the distance relationship between the spatial models. Then, according to the established overall silhouette style sheet, obtain a preliminary group reference number. More specifically, in this embodiment, the step of the calculation module 130 for establishing a total silhouette pattern table to obtain the preliminary group reference number includes: first, according to the silhouette formula (as shown below), according to the distance between the data in the cluster. The relationship calculates a complex silhouette factor, wherein the silhouette factor described herein is an indicator for assessing the group validity and state that can be used to present a good degree of cluster status. Next, the clustering results for different cluster numbers are used to generate a Global Silhouette value (GS u ) having a number of different clusters corresponding to a number of clusters, wherein the cluster number ranges from 2 to the above data. Between the total. Finally, the calculation module 130 establishes an overall silhouette style sheet according to the complex overall silhouette value, and records the overall silhouette value (GS u ) corresponding to the number of each cluster, and sets the number of clusters corresponding to the maximum value of the silhouette value to The initial number of group references, the detailed calculation process will be explained below.

Silhouette公式:運算某筆i th 資料的Silhouette係數: Silhouette formula: Calculate the Silhouette coefficient of an i th data:

1.計算i th 資料點對同一群集中所有其他資料點的平均距離(a i )。 1. Calculate the average distance (a i ) of the i th data points for all other data points in the same cluster.

2.針對i th 資料點和其他群集,計算此資料對其他每一群集所有資料的平均距離,並取其最小值(b i )。 2. For the i th data points and other clusters, calculate the average distance of this data for all data in each of the other clusters, and take the minimum value (b i ).

3.計算i th 的Silhouette係數(S i ),其公式定義如下: 3. Calculate the Silhouette coefficient (S i ) of i th , the formula is defined as follows:

其中max運算元是用以在ai和bi之中取最大值做為分母的運算且上式遵守-1≦Si≦1。 The max operand is used to take the maximum value between ai and bi as the denominator and the above equation obeys -1≦Si≦1.

為了求得整體側影係數值(GSu),計算模組130得先計算對應每一群集數目中之每一群集的群集側影值(Cluster Silhouette Value),針對對應某一群集數目中的某一群集的群集側影值(S j )計算方式如下: 其中m為存在於單一群集中所包含的資料數。 In order to determine the overall silhouette coefficient value (GS u ), the calculation module 130 first calculates a cluster silhouette value corresponding to each of the clusters for each cluster corresponding to a certain number of clusters. The cluster silhouette value (S j ) is calculated as follows: Where m is the number of data contained in a single cluster.

若以資料分成c群的情況為例,也就是群集數目為c情況下,若要取得整體側影係數值(GSu),則可以透過計算所有群集的群集側影值之平均而取得。整體側影係數值(GSu)定義如下: For example, when the data is divided into groups c, that is, when the number of clusters is c, if the overall silhouette coefficient value (GS u ) is to be obtained, it can be obtained by calculating the average of the cluster silhouette values of all the clusters. The overall silhouette coefficient value (GS u ) is defined as follows:

第2圖係顯示根據本發明一實施例所述之整體側影值和群集數目相對應之示意圖。如第2圖所示,若在新聞資料庫110中有m筆資料,即表示所需計算之群集數目範圍就是由分為2群集到分為m群集,計算模組130就會根據計算群集數目範圍,計算將資料分成2~m群集所分別對應之整體側影值,並將所計算出之整體側影值分別記錄在整體側影樣式表中,若當在分成N群時可得到側影值之最大值,計算模組130就會將N群設為初步分群參考數目。 2 is a schematic diagram showing the overall silhouette value and the number of clusters according to an embodiment of the invention. As shown in FIG. 2, if there is m data in the news database 110, that is, the range of the number of clusters to be calculated is from 2 clusters to divided into m clusters, and the calculation module 130 calculates the number of clusters according to the calculation. Range, calculation divides the data into the overall silhouette values corresponding to the 2~m clusters, and records the calculated overall silhouette values in the overall silhouette pattern table. If the data is divided into N groups, the maximum value of the silhouette value can be obtained. The calculation module 130 sets the N group as the preliminary group reference number.

根據本發明一實施例,分群模組140根據初步分群參考數目,利用一分群演算法將複數資料分為複數群集,再計算每一群集所對應之群內平均距離(Intra-Cluster Distance),其中分群模組140計算每一群集之群內平均距離之步驟包括:首先,先計算向量空間中每一群集中所包括之資料之一中心點;接著再計算每一群集中所包括之資料到中心點之一平均距離,所計算出之不同平均距離即代表每一群集之群內平均距離。在此實施例中,群內平均距離係利用一餘弦距離(Cosine Distance)公式來求得,且群內平均距離可用以評估一群集之內聚力。 According to an embodiment of the present invention, the grouping module 140 divides the complex data into a plurality of clusters according to the preliminary grouping reference number, and then calculates an intra-Cluster Distance corresponding to each cluster, wherein The step of the group module 140 calculating the average distance within the group of each cluster includes: first, calculating a center point of the data included in each cluster in the vector space; and then calculating the data included in each cluster to the center point. An average distance, the calculated average distance is the average distance within each group of each cluster. In this embodiment, the intra-group average distance is determined using a Cosine Distance formula, and the intra-group average distance can be used to assess the cohesion of a cluster.

此外,特別說明的是,在上述實施例所使用之分群演算法為一階層式分群演算法,但在本發明中並不以此演算法為限,對於任何於此領域熟知此技藝之人士,可以在參 閱本說明書後,使用其它適合之分群演算法來取代在說明書所使用之階層式分群演算法,例如:以分割式分群法(partitional clustering)中的K平均(K-means)演算法、K物件(K-medoids)演算法等。 In addition, it is particularly noted that the grouping algorithm used in the above embodiments is a hierarchical grouping algorithm, but is not limited to this algorithm in the present invention, and is anyone skilled in the art, Can be used in After reading this manual, use other suitable grouping algorithms to replace the hierarchical grouping algorithm used in the specification, for example: K-means algorithm in K-partitional clustering, K object (K-medoids) algorithm and so on.

根據本發明一實施例,比較模組150用以比較群集之群內平均距離是否小於一門檻值(threshold),若群內平均距離小於上述門檻值,則將對應群內平均距離小於上述門檻值之群集存入一事件資料庫中160,若群內平均距離未小於門檻值,則執行一遞迴分群之動作,也就是將群內平均距離未小於門檻值之群集所包括之資料重新傳回上述計算模組130,繼續進行計算整體側影樣式表以取得初步分群參考數目,接著再重新進行其它上述資料分群裝置100各模組進行之流程,直到所有資料都儲存到事件資料庫160中,才表示所有的資料都已分群完畢。特別說明的是,關於門檻值之設定,對於任何於此領域熟知此技藝之人士,可在參閱本說明書後,使用適當之值來設為門檻值(例如:0.2~0.3)。根據本發明一實施例,使用者可藉由一顯示單元(圖未顯示)和搜尋單元(圖未顯示),由事件資料庫160取得經由資料分群裝置100分群好之資料結果,並將結果顯示在顯示單元上。 According to an embodiment of the present invention, the comparison module 150 is configured to compare whether the average distance within the cluster is less than a threshold. If the average distance within the group is less than the threshold, the average distance within the group is less than the threshold. The cluster is stored in an event database 160. If the average distance within the group is not less than the threshold, a recursive grouping operation is performed, that is, the data included in the cluster whose average distance within the group is not less than the threshold is retransmitted back. The computing module 130 continues to calculate the overall silhouette pattern table to obtain the preliminary group reference number, and then re-executes the processes performed by the other modules of the data grouping device 100 until all the data is stored in the event database 160. Indicates that all data has been grouped. In particular, regarding the setting of the threshold value, any person skilled in the art can use the appropriate value to set the threshold value (for example, 0.2 to 0.3) after referring to this specification. According to an embodiment of the present invention, the user can obtain the data result grouped by the data grouping device 100 by the event database 160 by using a display unit (not shown) and a search unit (not shown), and display the result. On the display unit.

第3圖係顯示根據本發明一實施例所述之資料分群方法之流程圖300。首先,在步驟S310,由一新聞資料庫取得複數資料;在步驟S320,執行一前處理運算,以將上述資料之複數特徵進行一向量化處理,而使上述資料轉換成一空間模型;在步驟S330,根據上述資料間之一距離關係, 建立一整體側影樣式表,以取得一初步分群參考數目;在步驟S340,根據上述初步分群參考數目利用一分群演算法將複數資料分為複數群集;在步驟S350,取得每一上述群集之一群內平均距離;在步驟S360,比較上述群內平均距離是否小於一門檻值;若上述群內平均距離小於上述門檻值,則進行步驟S370將對應上述群內平均距離之上述群集存入一事件資料庫;若上述群內平均距離未小於上述門檻值,則進行步驟S380,將對應上述群內平均距離之上述群集重新計算上述側影係數,以取得上述初步分群參考數目,也就是再回到步驟S330重新繼續進行資料分群之步驟。此外,特別說明的是,在上述實施例所使用之分群演算法為一階層式分群演算法,但在本發明中並不以此演算法為限,對於任何於此領域熟知此技藝之人士,可以在參閱本說明書後,使用其它適當的分群演算法來取代在說明書所使用之階層式分群演算法,例如:以分割式分群法(partitional clustering)中的K平均(K-means)演算法、K物件(K-medoids)演算法等。 Figure 3 is a flow chart 300 showing a method of data grouping according to an embodiment of the invention. First, in step S310, a plurality of materials are obtained from a news database; in step S320, a pre-processing operation is performed to perform a vectorization process on the complex features of the data to convert the data into a spatial model; in step S330, According to the distance relationship between the above materials, Establishing an overall silhouette pattern table to obtain a preliminary grouping reference number; in step S340, using the grouping algorithm to divide the plurality of data into a plurality of clusters according to the preliminary grouping reference number; and in step S350, obtaining one of each of the clusters If the average distance within the group is less than the threshold value, the step S370 is performed to store the cluster corresponding to the average distance within the group into an event database. If the average distance in the group is not less than the threshold value, proceed to step S380, and recalculate the silhouette coefficient corresponding to the cluster corresponding to the average distance within the group to obtain the preliminary grouping reference number, that is, return to step S330 again. Continue the steps of data grouping. In addition, it is particularly noted that the grouping algorithm used in the above embodiments is a hierarchical grouping algorithm, but is not limited to this algorithm in the present invention, and is anyone skilled in the art, After reading this specification, other suitable clustering algorithms can be used instead of the hierarchical grouping algorithm used in the specification, for example, the K-means algorithm in partition clustering, K object (K-medoids) algorithm and so on.

第4圖係顯示根據本發明一實施例所述之建立整體側影樣式表之流程圖400。首先,在步驟S410,根據資料在空間向量中之距離關係,利用一側影公式計算,以產生對應一數目範圍之不同群集數目之複數整體側影值,其中上述群集數目範圍介於2到上述資料之總數之間;在步驟S420,記錄上述整體側影值於整體側影樣式表中;以及在步驟S430,將對應上述整體側影值之最大值之上述群集數目設定為上述初步分群參考數目。 4 is a flow chart 400 showing the creation of an overall silhouette style sheet in accordance with an embodiment of the present invention. First, in step S410, according to the distance relationship of the data in the space vector, the one-side shadow formula is used to calculate a complex overall silhouette value corresponding to a number of different cluster numbers, wherein the number of clusters ranges from 2 to the above data. Between the total number; in step S420, the overall silhouette value is recorded in the overall silhouette pattern table; and in step S430, the number of clusters corresponding to the maximum value of the overall silhouette value is set as the preliminary group reference number.

第5圖係顯示根據本發明一實施例所述之取得每一群集之對應之群內平均距離之流程圖500。首先,在步驟S510,取得每一上述群集中所包括之上述資料之一中心點;在步驟S520,取得每一上述群集中所包括之上述資料到上述中心點之一平均距離以作為上述群內平均距離。 Figure 5 is a flow chart 500 showing the acquisition of the corresponding intra-group average distance for each cluster, in accordance with an embodiment of the present invention. First, in step S510, a center point of the data included in each of the clusters is obtained; in step S520, an average distance of the data included in each of the clusters to the center point is obtained as the group Average distance.

面對使用者的需求及RSS資訊源的現存問題,為了讓使用者得到更好的閱讀經驗,我們以人工智慧(Artificial intelligence)文字探勘(Text Mining)領域為基礎所提出之資料分群方法,利用文件自動偵測遞回分群技術(ADTR)來改進傳統上分群(Clustering)演算法在群集參數上的自動偵測,可將雜亂的新聞進行分群,以得到不同新聞之簡易資訊聚合(RSS)來源但卻是相似新聞事件的群集,因而達到提昇新聞事件分群準確性之結果,此外本發明所提出之資料分群方法可協助找出新聞中重要的人名及潛在的重要詞庫,隨著新聞情境的不同,也可以適用於不同的情境改變,對詞庫的過適性抗干擾能力有好的表現。此外,與傳統上單一通過分群(Single-pass Clustering)方式相比,單一通過分群方式在於一次處理一篇文章,再去比對目前現存之群集相似度進行分群依據。然而,本發明所提出之資料分群方法,所利用之文件自動偵測遞回分群技術(ADTR)則是一次針對現有的所有資料進行整體側影樣式表的建立並找到初始群集數,再進行群集的遞迴分群演算法。 In the face of the needs of users and the existing problems of RSS information sources, in order to give users a better reading experience, we use the data grouping method based on the field of artificial intelligence text mining (Text Mining). Automatic file detection and retransfer grouping (ADTR) to improve the automatic detection of clustering parameters in traditional clustering algorithms, which can cluster messy news to get RSS feeds of different news. However, it is a cluster of similar news events, thus achieving the result of improving the accuracy of news event clustering. In addition, the data grouping method proposed by the present invention can help identify important names and potentially important lexicons in the news, along with the news situation. Different, it can also be applied to different situational changes, and has a good performance on the over-the-counter anti-interference ability of the thesaurus. In addition, compared with the traditional single-pass clustering method, the single-pass grouping method deals with one article at a time, and then compares the current cluster similarity to the clustering basis. However, the data grouping method proposed by the present invention uses the file automatic detection rewinding grouping technique (ADTR) to establish an overall silhouette style sheet for all existing data and find the initial cluster number, and then perform clustering. Recursive clustering algorithm.

本說明書中所提到的「一實施例」或「實施例」所提到的特定的特徵、結構或性質,可包括在本說明書的至少一實施例中。因此,在不同地方出現的語句「在一個實施 例中」,可能不是都指同一個實施例。另外,此特定的特徵、結構或性質,也可以任何適合的方式與一個或一個以上的實施例結合。再者,必須說明的是,以下所附之例圖僅是為了幫助說明,並未依照實際比例繪示。 The specific features, structures, or properties mentioned in the "invention" or "embodiment" referred to in the specification may be included in at least one embodiment of the present specification. Therefore, the statements that appear in different places "in one implementation In the example, it may not all refer to the same embodiment. In addition, this particular feature, structure, or property may be combined with one or more embodiments in any suitable manner. In addition, it should be noted that the following illustrations are only for the purpose of explanation and are not drawn to the actual scale.

雖然本說明書係使用所揭露之實施例來描述本發明之主題,但所揭露之實施例係用以保護本發明之專利要求範圍,並非用以限定本發明之範圍。因此,本說明書所揭露之實施例,對於任何在本領域熟悉此技藝者,將很快可以理解上述之優點。在閱讀完說明書內容後,任何在本領域熟悉此技藝者,在不脫離本發明之精神和範圍內,可以廣義之方式作適當的更動和替換。 While the present invention has been described with respect to the embodiments of the present invention, the disclosed embodiments are intended to protect the scope of the invention and the scope of the invention. Thus, the embodiments disclosed herein will readily appreciate the advantages described above for anyone skilled in the art. After reading the contents of the specification, any person skilled in the art can make appropriate changes and substitutions in a broad manner without departing from the spirit and scope of the invention.

100‧‧‧資料分群裝置 100‧‧‧ data grouping device

110‧‧‧新聞資料庫 110‧‧‧News database

120‧‧‧預處理模組 120‧‧‧Pre-processing module

130‧‧‧計算模組 130‧‧‧Computation Module

140‧‧‧分群模組 140‧‧‧Group Module

150‧‧‧比較模組 150‧‧‧Comparative Module

160‧‧‧事件資料庫 160‧‧‧ Event Database

300、400、500‧‧‧流程圖 300, 400, 500‧‧‧ flow chart

S310、S320、S330、S340、S350、S360、S370、S380、S410、S420、S430、S510、S520‧‧‧步驟 S310, S320, S330, S340, S350, S360, S370, S380, S410, S420, S430, S510, S520‧‧ steps

第1圖係顯示根據本發明一實施例所述之資料分群裝置100之架構圖。 1 is a block diagram showing a data grouping apparatus 100 according to an embodiment of the present invention.

第2圖係顯示根據本發明一實施例所述之整體側影值和群集數目對應之示意圖。 2 is a schematic diagram showing the correspondence between the overall silhouette value and the number of clusters according to an embodiment of the invention.

第3圖係顯示根據本發明一實施例所述之資料分群方法之流程圖300。 Figure 3 is a flow chart 300 showing a method of data grouping according to an embodiment of the invention.

第4圖係顯示根據本發明一實施例所述之建立整體側影樣式表之流程圖400。 4 is a flow chart 400 showing the creation of an overall silhouette style sheet in accordance with an embodiment of the present invention.

第5圖係顯示根據本發明一實施例所述之計算每一群集之對應之群內平均距離之流程圖500。 Figure 5 is a flow chart 500 showing the calculation of the corresponding intra-group average distance for each cluster, in accordance with an embodiment of the present invention.

100‧‧‧資料分群裝置 100‧‧‧ data grouping device

110‧‧‧新聞資料庫 110‧‧‧News database

120‧‧‧預處理模組 120‧‧‧Pre-processing module

130‧‧‧計算模組 130‧‧‧Computation Module

140‧‧‧分群模組 140‧‧‧Group Module

150‧‧‧比較模組 150‧‧‧Comparative Module

160‧‧‧事件資料庫 160‧‧‧ Event Database

Claims (12)

一種資料分群裝置,包括:一新聞資料庫,用以儲存複數資料;一計算模組,根據上述資料間之一距離關係,用以建立一整體側影樣式表,再根據上述整體側影樣式表,取得一初步分群參考數目;一分群模組,根據上述初步分群參考數目利用一分群演算法將複數資料分為複數群集,再計算每一上述群集之一群內平均距離(Intra-Cluster distance),以及一比較模組,用以比較上述群內平均距離是否小於一門檻值,其中若上述群內平均距離小於上述門檻值,則將對應上述群內平均距離之上述群集存入一事件資料庫中。 A data grouping device includes: a news database for storing a plurality of data; a computing module, based on a distance relationship between the data, for establishing an overall silhouette style sheet, and then obtaining the overall silhouette style sheet according to the overall pattern a preliminary grouping reference number; a grouping module, according to the preliminary grouping reference number, using a grouping algorithm to divide the complex data into a plurality of clusters, and then calculating an intra-Cluster distance of each of the clusters, and a The comparison module is configured to compare whether the average distance in the group is less than a threshold value, wherein if the average distance within the group is less than the threshold value, the cluster corresponding to the average distance within the group is stored in an event database. 如申請專利範圍第1項所述之資料分群裝置,更包括一預處理模組,用以將上述複數資料經過一前處理運算,以將上述資料之複數特徵進行一向量化處理,使上述資料轉換成一空間模型。 The data grouping device of claim 1, further comprising a pre-processing module for performing a pre-processing operation on the plurality of data to perform a vectorization process on the plurality of features of the data to convert the data. Into a space model. 如申請專利範圍第1項所述之資料分群裝置,其中若上述群內平均距離未小於上述門檻值,則將對應上述群內平均距離之上述群集重新傳回上述計算模組,以建立上述整體側影樣式表,而取得上述初步分群參考數目。 The data grouping device of claim 1, wherein if the average distance within the group is not less than the threshold value, the cluster corresponding to the average distance within the group is returned to the computing module to establish the overall The pattern style sheet is obtained, and the number of preliminary grouping references mentioned above is obtained. 如申請專利範圍第1項所述之資料分群裝置,其中上述計算模組建立上述整體側影樣式表之步驟包括:根據上述資料間之上述距離關係,利用一側影公式計算複數側影係數,以產生對應一群集數目範圍之不同群集數目之複數整體側影值(GSu),其中上述群集數目範圍介於 2到上述資料之總數之間;記錄上述側影值至上述整體分群側影樣式表;以及將對應上述整體側影值之最大值之上述群集數目設定為上述初步分群參考數目。 The data grouping device of claim 1, wherein the step of establishing the overall silhouette pattern table by the calculating module comprises: calculating a complex silhouette coefficient by using a one-side shadow formula according to the distance relationship between the data to generate a corresponding a complex overall silhouette value (GS u ) of a number of different clusters in a range of clusters, wherein the number of clusters ranges from 2 to the total number of the above data; recording the above-mentioned silhouette values to the overall grouping pattern style sheet; and The number of clusters of the maximum value of the overall silhouette value is set to the above-mentioned preliminary grouping reference number. 如申請專利範圍第1項所述之資料分群裝置,其中上述分群模組計算每一上述群集之上述群內平均距離之步驟包括:取得每一上述群集中所包括之上述資料之一中心點;以及取得每一上述群集中所包括之上述資料到上述中心點之一平均距離,上述平均距離即上述群內平均距離。 The data grouping device of claim 1, wherein the step of calculating, by the grouping module, the average distance within the group of each of the clusters comprises: obtaining a center point of the data included in each of the clusters; And obtaining an average distance from the above-mentioned center point of the above-mentioned data included in each of the clusters, wherein the average distance is the average distance within the group. 如申請專利範圍第1項所述之資料分群裝置,其中上述分群演算法為一階層式分群演算法。 The data grouping device according to claim 1, wherein the grouping algorithm is a hierarchical grouping algorithm. 一種資料分群方法,包括以下步驟:由一新聞資料庫取得複數資料;根據上述資料間之一距離關係,建立一整體側影樣式表,以取得一初步分群參考數目;根據上述初步分群參考數目利用一分群演算法將複數資料分為複數群集;取得每一上述群集之一群內平均距離;以及比較上述群內平均距離是否小於一門檻值,其中若上述群內平均距離小於上述門檻值,則將對應上述群內平均距離之上述群集存入一事件資料庫。 A data grouping method includes the following steps: obtaining a plurality of materials from a news database; and establishing an overall silhouette style sheet according to a distance relationship between the data to obtain a preliminary grouping reference number; using one according to the preliminary grouping reference number The clustering algorithm divides the plurality of data into a plurality of clusters; obtains an average distance within a group of each of the clusters; and compares whether the average distance within the group is less than a threshold, wherein if the average distance within the group is less than the threshold, the corresponding The above clusters of average distances within the above groups are stored in an event database. 如申請專利範圍第7項所述之資料分群方法,其中在建立上述整體側影樣式表前,更包括,對上述資料,執 行一前處理運算以將上述資料之複數特徵進行一向量化處理,使上述資料轉換成一空間模型。 For example, the data grouping method described in claim 7 of the patent application, wherein before the establishment of the above-mentioned overall silhouette style sheet, A pre-processing operation is performed to perform a vectorization process on the complex features of the above data to convert the data into a spatial model. 如申請專利範圍第7項所述之資料分群方法,其中若上述群內平均距離未小於上述門檻值,則將對應上述群內平均距離之上述群集重新建立上述整體側影樣式表以取得上述初步分群參考數目。 The data grouping method according to claim 7, wherein if the average distance within the group is not less than the threshold value, the cluster corresponding to the average distance within the group is re-established to form the overall silhouette pattern to obtain the preliminary grouping. Reference number. 如申請專利範圍第7項所述之資料分群方法,其中建立上述整體側影樣式表之步驟包括:根據上述資料間之上述距離關係,利用一側影公式計算複數側影係數,以產生對應一群集數目範圍之不同群集數目之複數整體側影值,其中上述群集數目範圍介於2到上述資料之總數之間;記錄上述側影值至上述整體側影樣式表;以及將對應上述整體側影值之最大值之上述群集數目設定為上述初步分群參考數目。 The data grouping method according to Item 7 of the patent application scope, wherein the step of establishing the overall silhouette pattern table comprises: calculating a complex silhouette coefficient by using a one-side shadow formula according to the distance relationship between the data to generate a corresponding number of clusters a complex overall silhouette value of the number of different clusters, wherein the number of clusters ranges from 2 to the total number of the above data; recording the silhouette value to the overall silhouette pattern table; and the cluster corresponding to the maximum value of the overall silhouette value The number is set to the above preliminary group reference number. 如申請專利範圍第7項所述之資料分群方法,其中取得每一上述群集之上述群內平均距離之步驟包括:取得每一上述群集中所包括之上述資料之一中心點;以及取得每一上述群集中所包括之上述資料到上述中心點之一平均距離以作為上述群內平均距離。 The data grouping method of claim 7, wherein the step of obtaining the average distance within the group of each of the clusters comprises: obtaining a center point of the above-mentioned data included in each of the clusters; and obtaining each The average distance of the above-mentioned data included in the cluster to the center point is taken as the average distance within the group. 如申請專利範圍第7項所述之資料分群方法,其中上述分群演算法為一階層式分群演算法。 For example, the data grouping method described in claim 7 is wherein the grouping algorithm is a hierarchical grouping algorithm.
TW101129472A 2012-08-15 2012-08-15 Data clustering apparatus and method TWI465949B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
TW101129472A TWI465949B (en) 2012-08-15 2012-08-15 Data clustering apparatus and method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
TW101129472A TWI465949B (en) 2012-08-15 2012-08-15 Data clustering apparatus and method

Publications (2)

Publication Number Publication Date
TW201407390A true TW201407390A (en) 2014-02-16
TWI465949B TWI465949B (en) 2014-12-21

Family

ID=50550489

Family Applications (1)

Application Number Title Priority Date Filing Date
TW101129472A TWI465949B (en) 2012-08-15 2012-08-15 Data clustering apparatus and method

Country Status (1)

Country Link
TW (1) TWI465949B (en)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103914518A (en) * 2014-03-14 2014-07-09 小米科技有限责任公司 Clustering method and clustering device
CN104268149A (en) * 2014-08-28 2015-01-07 小米科技有限责任公司 Clustering method and clustering device
US10037345B2 (en) 2014-03-14 2018-07-31 Xiaomi Inc. Clustering method and device
CN112954268A (en) * 2019-12-10 2021-06-11 晶睿通讯股份有限公司 Queue analysis method and image monitoring equipment

Cited By (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103914518A (en) * 2014-03-14 2014-07-09 小米科技有限责任公司 Clustering method and clustering device
WO2015135276A1 (en) * 2014-03-14 2015-09-17 小米科技有限责任公司 Clustering method and related device
CN103914518B (en) * 2014-03-14 2017-05-17 小米科技有限责任公司 Clustering method and clustering device
RU2628167C2 (en) * 2014-03-14 2017-08-15 Сяоми Инк. Method and device for clustering
US10037345B2 (en) 2014-03-14 2018-07-31 Xiaomi Inc. Clustering method and device
CN104268149A (en) * 2014-08-28 2015-01-07 小米科技有限责任公司 Clustering method and clustering device
CN112954268A (en) * 2019-12-10 2021-06-11 晶睿通讯股份有限公司 Queue analysis method and image monitoring equipment
CN112954268B (en) * 2019-12-10 2023-07-18 晶睿通讯股份有限公司 Queue analysis method and image monitoring equipment

Also Published As

Publication number Publication date
TWI465949B (en) 2014-12-21

Similar Documents

Publication Publication Date Title
CN108509474B (en) Synonym expansion method and device for search information
US9454580B2 (en) Recommendation system with metric transformation
US9594806B1 (en) Detecting name-triggering queries
WO2018072663A1 (en) Data processing method and device, classifier training method and system, and storage medium
CN111914113B (en) Image retrieval method and related device
Lee Unsupervised and supervised learning to evaluate event relatedness based on content mining from social-media streams
CN108959474B (en) Entity relation extraction method
US9529822B2 (en) Media or content tagging determined by user credibility signals
US10915586B2 (en) Search engine for identifying analogies
WO2017005207A1 (en) Input method, input device, server and input system
CN111814454A (en) Multi-modal network spoofing detection model on social network
CN105183792B (en) Distributed fast text classification method based on locality sensitive hashing
CN111291177A (en) Information processing method and device and computer storage medium
TWI465949B (en) Data clustering apparatus and method
CN115795030A (en) Text classification method and device, computer equipment and storage medium
CN114328798B (en) Processing method, device, equipment, storage medium and program product for searching text
CN106991084B (en) Document evaluation method and device
CN115203379A (en) Retrieval method, retrieval apparatus, computer device, storage medium, and program product
Hulpuș et al. Knowledge graphs meet moral values
CN112883229B (en) Video-text cross-modal retrieval method and device based on multi-feature-map attention network model
CN103150388A (en) Method and device for extracting key words
CN115878761B (en) Event context generation method, device and medium
CN110959157B (en) Accelerating large-scale similarity computation
CN107423294A (en) A kind of community image search method and system
CN115858878A (en) Multi-dimensional matching method, device and equipment for names of layered mechanisms and storage medium