TW201243627A

TW201243627A - Multi-label text categorization based on fuzzy similarity and k nearest neighbors

Info

Publication number: TW201243627A
Application number: TW100113975A
Authority: TW
Inventors: Shie-Jue Lee; Jung-Yi Jiang; Shian-Chi Tsai
Original assignee: Univ Nat Sun Yat Sen
Priority date: 2011-04-22
Filing date: 2011-04-22
Publication date: 2012-11-01
Also published as: TWI452477B

Abstract

The multi-label text categorization of the present invention includes the steps of: (a) calculating the fuzzy similarity of every training document to every class according to the similarity of every feature to every class and the similarity of every feature to every training document; (b) clustering all the training documents according to the fuzzy similarities; (c) calculating the distribution of k nearest neighbors of an un-known document according to the clustered results; and (d) determining the classification of the un-known document according to the distribution of the k nearest neighbors.

Description

201243627 六、發明說明：【發明所屬之技術領域】本發明係關於-種多標籤文件分類方法，特別是一種以模糊相似度與K最近鄰居法為基礎之多標藏文件分類方法。【先前技術】在網路資訊爆炸的時代，搜尋引擎成為人們獲得資訊不可或缺的工具。好的搜尋弓丨擎不僅要找出跟查詢相關的網頁文件，《要盡彳能的提高查詢的效帛，在各種改善方法中’事先對網頁文件内容進行分類是主㈣方法之一。一般網頁資料通常以文件格式儲存，包含其内容、程式區塊、鏈結等訊息。利用文件内容來表示一個網頁，可以透過文件分類方法找出垃圾網頁内容跟—般網頁内容之間的辨別方式。在處理文件資料時通常以向量空間模型（vect〇r space model)來描述一個文件内容。現有的文件分類方法大多以單‘紙^料為§〗1丨練樣本下所開發出來的自動分類方法。然而事實上，每一個文件可以同時屬於多個類別的資料，我們稱之為多標籤（multi-label)文件。在2000年，Schapire與Singer[l]針對多標籤文件自動辨認提出一個延伸自ADABOOST[2]名為BOOSTEXTER的方法’透過調整一組對於訓練樣本與標籤配對資料的權重值，來訓練辨認模組。 1999 年 ’ McCallum[3]，2003 年 Ueda與 Saito[4]分別提出 15435 丨.doc 201243627 基於特徵頻率所架設的模型形成的兩種對於多樟籤文件自動辨認方法。MeCan叫3]利用—個混合機率:與二 3法來學習混合式權重值與單字在每合成分中的201243627 VI. Description of the Invention: [Technical Field] The present invention relates to a multi-label file classification method, and more particularly to a multi-standard file classification method based on the fuzzy similarity and the K nearest neighbor method. [Prior Art] In the era of Internet information explosion, search engines have become an indispensable tool for people to obtain information. A good search engine is not only to find out the web page files related to the query, but also to improve the efficiency of the query as much as possible. In the various improvement methods, it is one of the main (4) methods to classify the content of the web file in advance. General web pages are usually stored in a file format containing information such as content, program blocks, links, and so on. Using file content to represent a web page, you can use the file classification method to find out how to distinguish between the content of the spam page and the content of the page. A file space is usually described in a vector space model (vect〇r space model) when processing a file. Most of the existing document classification methods are based on the automatic classification method developed under the sample of the paper. In fact, each file can belong to multiple categories of data at the same time, which we call a multi-label file. In 2000, Schapire and Singer [l] proposed a method for automatic identification of multi-label files extending from ADABOOST [2] named BOOSTEXTER to train the recognition module by adjusting a set of weight values for training samples and tag pairing data. . In 1999, McCallum [3], in 2003, Ueda and Saito [4] proposed 15435 丨.doc 201243627 two methods for automatic identification of multi-tick documents based on models built by characteristic frequencies. MeCan is called 3] using a mixing probability: and two to learn the mixed weight value and the single word in each composite.

Ueda與Saito[4]使用兩種機率產峰太斗 ^ 座生方式，他們假設在多私織的文件中具有特殊的單字，且這干卞I出現在多標籤文件所包含的所有單標籤類別中。 2004年Ga。刚人提出-個圖優異的方法，利用人併辨認模組的各種參數形成-個連續可微分函數來模擬特定的效能價值，並藉由優化此函數來達成對辨認模組的訓練。期年ComW附人延伸決策樹方法，將盆當作 ADAB〇〇ST.MH[7]中的基礎學習她 .^ ^ 足予^機，來訓練多標籤辨認模組0 鳩年Zhai^Zh°U[8]利用—個新的誤^函數來訓練倒傳遞網路’使其可以用在多標籤自動辨認上。綱5年，—π]與Zh,等人分別延伸最大熵模型，利用增加-個二次限制式以取得類別之間的相關性來處理多標籤資料。 2004年 Godbole與 Sarawagi丨 121m # , g I 2]延伸以支援向量機為基礎的文件自動辨認方法，利用紿、'且口原始文件特徵的異種特徵集δ建立*~~個核心函數央士丨丨丨沾丄數來丨丨練支援向量機多標籤辨認模組。 1隹am]等人透過將多標鐵資料轉換成多類別單標簸貝料集合來處理多標藏文件自動辨認問題。 I5435l.doc 201243627 除了在文件自動辨認上已經有上述學者在進行多標藏資 =自動辨認工作外’在生物資訊與場景自動辨認上，多標籤學習方式也顯示其優良的效果。 —在200丨年，Ciar_King[13]採用C4s決策樹以改變熵的疋義來處理多標籤基因表示問題’其學習得到的決策樹產生的規則可以與已知的生物知識相提並論。 2〇〇2年，ElisseefmWest〇n[14]定義一個特殊的多標籤差數，並提出一個核心實現多標籤自動辨認方法 RANKSVM，且在Yeast基因函式資料集合中測試。 2〇〇7年，BHnker[15]介紹以成對比較式學習方法對於多標籤場景的自動辨認。在2004年，BouteIl[16]等人應用多標籤自動辨認學習技術到場景自動辨認上’他們將多標籤自動辨認問題分割成夕個獨立的二兀自動辨認問題並且提供若干標籤準則來組合由這些二元辨認模組產生的結果。在2007年，qj π]等人研習自動多標籤影像註解問題，他們將輸入樣本轉換到高維度向量㈣人與輸出的相關性編碼，並提出一個最大邊界類型演算法來學習這些經過轉換的資料。Zhang與其共同研究者們在最近這幾年透過將傳統單標籤自動辨認技術修改適用到多標籤自動辨認上做出相當多的貢獻。在2007年[18]提出一個以K最近鄰居法為基礎，以機率十算的方式求得是否屬於某類別的機率以此來判斷物件的類別的MLKNN自動辨認方法。 15435l.doc 201243627 在2009年[20]，分別提出以貝氏辨認模組為基礎加上特徵選取方法的MLNB自動辨認方法；與以RBF類神經網路為基礎，透過歐氏距離對每一個類別進行K-means群聚訓練找出K個群聚中心，建立第一層網路節點，並利用解決最小平方法的方式來求得第一層到第二層的權重值，以形成的類神經網路來進行多標籤自動辨認的MLRBF方法 [19]。然而，上述習知方法必需花費大量之執行分類工作的訓練與測試時間，同時也增加了所需成本。因此，有必要提供一創新且具進步性的以模糊相似度與 K最近鄰居法為基礎之多標籤文件分類方法，以解決上述問題。上述之參考先前技術文獻羅列如下： 1. R. E. Schapire, Y. Singer, "Boostexter: a boosting-based system for text categorization." Machine Learning 39(2/3) pp 135-168, 2000. 2. Y. Freund, R. E. Schapire, " A decision-theoretic generalization of online learning and an application to boosting." Journal of Computer and System Sciences 55(1), pp 119-139, 1997. 3. A. McCallum, "Multi-label text classification with a mixture model trained by EM." In: Working notes of the ΑΑΑΓ99 workshop on text learning, 1999. 4. N. Ueda, K. Saito, " Parametric mixturemodels formulti-label text." Advances in neural information processing systems, vol 15. MIT Press, Cambridge, MA, pp 721-728, 2003. 154351.doc 201243627 5· S. Gao,W. Wu, C-H. Lee, T-S. Chua, "A MFoM learning approach to robust multiclass multi-label text categorization." 21st international conference on machine learning, pp 329-336, 2004. 6. F. D. Comite, R. Gilleron, M. Tommasi, "Learning multi-label alternating decision tree from texts and data." Lecture notes in computer science, vol 2734. Springer, Berlin, pp 35-49, 2003. 7. R. E. Schapire, Y. Singer, "Improved boosting algorithms using confidence-rated predictions." 11th annual conference on computational learning theory, pp 80-91, 1998. 8. M. L. Zhang, Z-H. Zhou, "Multilabel neural networks with applications to functional genomics and text categorization.” IEEE Transactions on Knowledge and Data Engineering 18(10):1338-1351, 2006. 9. N. Ghamrawi, A. McCallum, "Collective multi-label classification." 14th ACM international conference on information and knowledge management, pp 195-200, 2005. 10. S. Zhu, X. Ji, W. Xu, Y. Gong, "Multi-labelled classification using maximum entropy method." 28th annual international ACM SIGIR conference on research and development in information retrieval, pp 274-281,2005. 11. S. Godbole, S. Sarawagi, "Discriminative methods formulti-labeled classification." Lecture notes in artificial intelligence, vol 3056, pp 22-30, 2004. 12. H. Kazawa, T. Izumitani, H. Taira, E. Maeda, "Maximal margin 154351.doc 201243627 labeling for multi-topic text categorization." Advances in neural information processing systems, vol 17. MIT Press, Cambridge, MA pp 649-656, 2005. 13. A. Clare, R. D. King, "Knowledge discovery in multi-label phenotype data." Lecture notes in computer science, vol 2168. Springer, Berlin pp 42-53,2001. 14. A. Elisseeff, J. Weston, "A kernel method for multi-labelled classification." Advances in neural information processing systems, vol 14. MIT Press, Cambridge, MA, pp 681-687, 2002. 15. K. Brinker, E. Htillermeier, "Case-based multilabel ranking." 20th international joint conference on artificial intelligence, pp 702-707, 2007. 16. M. R. Boutell, J. Luo, X. Shen, C. M. Brown, "Learning multi-label scene classification." Pattern Recognition 37(9):1757-1771, 2004. 17. G-J. Qi, X-S. Hua, Y. Rui, J. Tang, T. Mei, H-J Zhang, "Correlative multi-label video annotation.11 15th ACM international conference on multimedia, pp 17-26, 2007. 18. M. L. Zhang and Z. H. Zhou. ML-kNN: A lazy learning approach to multi-label learning. Pattern Recognition, 40(7), pp 2038-2048, 2007. 19. M. L. Zhang. ML-RBF: RBF neural networks for multi-label learning. Neural Processing Letters, 29(2), pp 61-74,2009. 20. J. Μ. P. η. M. L. Zhang and V. Robles. Feature selection for multilabel naive bayes classification. Information Sciences, 179(19), pp 3218-3229, 2009. 154351.doc 201243627 【發明内容】係提供-種以模糊相似度與κ最近鄰居法為基礎二t戴文件分類方法，包括以下步驟：⑷依據每-訓練文件資料之每一特徵對於每一類別的歸屬度及每-特徵對於κ練文件資料的歸屬度，計算每—訓練文件資料對於母-類別的模糊相似度；(b)依據該等模糊相似度對所有訓練文件資料進行分群；⑷依據-未知文件資料及該等模用K最近鄰居法計算該未知文件資料之κ個最近鄰居的類別分佈；及⑷依據㈣最近鄰居的類別分佈判斷該未知文件資料屬於之類別。本發明之方法係透過模糊相似度料算在進行最近K 鄰居法進行分類時優先選擇較相似的分群，利用這些較相似分群中的資料來進行最近㈣居法的與職，如此可以提升分類方法的執行效能。，且’本發明以模糊相似度與U近鄰居法為基礎之多 ::文件刀類方法’可以減少執行分類工作的訓練與測試時間’故可以節省大量的時間成本。【實施方式】在資訊檢索的過程中，必須對資訊本身進行分析，建立索引㈣exing)以幫助檢索之進行。索弓丨主要在表示文件的内容，同時給予索引詞彙一定的權重，以反應該詞棄在文件内各識別的重要性與價值。在目前大部分使用向量空間模型的文件資訊檢索系統中，屬性大多代表某一個詞彙或概〜@屬性值則為該詞彙或概念在文件中的統計資 15435l.doc 10 201243627 訊。既然-篇文件是由許多的辭彙所組成因此可以找出文件中有意義的索引詞彙（即關鍵詞彙）組合成文件向量，而此文件向量即代表在向量空間模型中的一篇文件。在個文件集中，每個索引詞棄即代表空間中的—個維度’而每個維度上的值則代表該文件在這個維度上的重要程度’這個值稱為索引詞㈣「顯著值（Term Significa_s)」或是「權重」’可以由計算文件詞彙統計資料而得到，例如索引S司彙出現的頻率（Term Frequency，TF)。採用向量代表各個文件’不僅可以方便表現出各個文件間的關係，且容易計算彼此間的相似度，意、義相近的文件，所用的辭彙可能有多處相同，若表示成空間中的向量時，這些向量亦較接近。在向量空間模型中，既然是用一組詞彙來代表一篇文件，那麼辭彙的選擇就格外重要。辭彙的權重可用來區別哪些列彙在文件當中較具代表性，可能為文件的關鍵詞彙。一般常用的詞彙權重計算方式有詞彙頻率（Term Frequency)文件頻率（Document Frequency)。詞彙頻率為某一詞彙在一篇文件中出現的次數，其值越高代表該詞彙在文件t越重要；文件頻率為某一詞彙在文件集的所有文件中出現的文件次數，其值越低代表該詞彙越能將某文件與其他文件區別，越具代表性。在文件資料的特徵表示中，由於使用文字出現在文件中的次數來計算特徵值，因此與一般的資料有很大的差別。例如，在資料中某個特徵值為〇與特徵值丨的差異跟2與3的 I54351.doc 201243627 差異雖然都是1，但是很明顯的在一個文件中沒ta現（特徵值〇)與有出現一次（特徵值1)的差異遠大於出現2次與3次的差異。因此’在測量一個文件與某個類別的相似度時，也必須採用特殊的方法，而模糊相似度測量方法在衡量文件資料時具有不錯的效果。圖.4示本發明以模糊相似度與K最近鄰居法為基礎之多標紙文件分類方法之流程圖。參考步驟S11，依據每一訓練文件資料t每一特a對於每一類別&歸屬纟及每一特徵對於每一甽練文件資料的歸屬度，計算每一訓練文件資料對於每一類別的模糊相似度。配口參考圖2及3，在本實施例中，假設訓練文件資料為屯尖”··，<，η為訓練文件資料數，這些訓練文件資料分佈在 Ρ個類別（在圖3中具有cr c3共3個類別）中，每個訓練文件貝料可以屬於一個或多個類別。每個訓練文件資料由m個特徵爿’^，···人來表示，其中每個特徵代表一個文字w。令都’’\)與邮，，c )分別代表特徵（在類別^中的分怖比例，其可表示為：Ueda and Saito [4] use two probabilistic methods, which assume a special word in a multi-private document, and this cognac I appears in all single-label categories contained in a multi-label file. in. 2004 Ga. Gangren proposed an excellent method of using graphs to identify the various parameters of the module to form a continuous differentiable function to simulate a specific performance value, and to optimize the training of the recognition module by optimizing this function. ComW is attached to the decision tree method, and the basin is used as the basis of ADAB〇〇ST.MH[7] to learn her. ^ ^ Foot to machine, to train multi-tag identification module 0 鸠年Zhai^Zh° U[8] uses a new erroneous function to train the reverse transfer network to make it available for multi-tag auto-recognition. In the 5th year, π] and Zh, et al. respectively extended the maximum entropy model, and used the addition of a quadratic restriction to obtain the correlation between categories to process multi-label data. In 2004, Godbole and Sarawagi丨121m # , g I 2] extended the method to support the automatic recognition of files based on vector machine, and used the heterogeneous feature set δ of 绐, 'and original file features to establish *~~ core functions丨丨丄丨丨支援支援支援支援支援支援 support support vector machine multi-tag identification module. 1隹am] et al. deal with the problem of automatic identification of multi-standard documents by converting multi-standard iron data into multi-category single-label 簸料 collection. I5435l.doc 201243627 In addition to the above-mentioned scholars in the automatic identification of documents, in the multi-standard inventory = automatic identification work, in the automatic identification of biological information and scenes, the multi-label learning method also shows its excellent effect. - In 200 years, Ciar_King [13] used the C4s decision tree to change the entropy's ambiguity to deal with multi-tag gene representation problems. The rules derived from the learning decision tree can be compared with known biological knowledge. In 2 2 years, ElisseefmWest〇n [14] defined a special multi-label difference and proposed a core implementation of the multi-label automatic identification method RANKSVM, and tested in the Yeast gene function data set. In 2-7 years, BHnker [15] introduced the automatic identification of multi-label scenes by the pairwise comparative learning method. In 2004, BouteIl [16] and others applied multi-label automatic recognition learning technology to automatic scene recognition. They divided the multi-label automatic identification problem into a separate automatic identification problem and provided several label criteria to combine these. The result of the binary recognition module. In 2007, qj π] et al. studied automatic multi-label image annotation problems. They converted the input samples to high-dimensional vectors (4) human-output correlation codes, and proposed a maximum boundary type algorithm to learn these converted data. . In recent years, Zhang and his co-investigators have made considerable contributions to the adaptation of traditional single-label automatic identification technology to multi-label automatic identification. In 2007 [18], an MLKNN automatic identification method based on the K nearest neighbor method and determining the probability of belonging to a certain category by means of a probability of ten calculations was used to judge the category of the object. 15435l.doc 201243627 In 2009 [20], the MLNB automatic identification method based on the Bayesian recognition module and the feature selection method was proposed respectively; and based on the RBF-like neural network, each category was transmitted through the Euclidean distance. Perform K-means clustering training to find K cluster centers, establish the first layer of network nodes, and use the method of solving the least square method to obtain the weight values of the first layer to the second layer to form the neural network. The network uses the MLRBF method for automatic identification of multiple tags [19]. However, the above conventional methods require a large amount of training and testing time to perform the sorting work, and also increase the required cost. Therefore, it is necessary to provide an innovative and progressive multi-label file classification method based on the fuzzy similarity and K nearest neighbor method to solve the above problems. The above references to prior art documents are listed below: 1. RE Schapire, Y. Singer, "Boostexter: a boosting-based system for text categorization." Machine Learning 39(2/3) pp 135-168, 2000. Y. Freund, RE Schapire, " A decision-theoretic generalization of online learning and an application to boosting." Journal of Computer and System Sciences 55(1), pp 119-139, 1997. 3. A. McCallum, &quot Multi-label text classification with a mixture model trained by EM." In: Working notes of the ΑΑΑΓ99 workshop on text learning, 1999. 4. N. Ueda, K. Saito, " Parametric mixturemodels for multi-label text.&quot Advances in neural information processing systems, vol 15. MIT Press, Cambridge, MA, pp 721-728, 2003. 154351.doc 201243627 5· S. Gao, W. Wu, CH. Lee, TS. Chua, "A MFoM learning approach to robust multiclass multi-label text categorization." 21st international conference on machine learning, pp 329-336, 2004. 6. FD Comite, R. Gilleron, M. Tommasi, "Learning multi-label alternating decision tree from texts and data." Lecture notes in computer science, vol 2734. Springer, Berlin, pp 35-49, 2003. 7. RE Schapire, Y. Singer, "Improved boosting algorithms Using confidence-rated predictions." 11th annual conference on computational learning theory, pp 80-91, 1998. 8. ML Zhang, ZH. Zhou, "Multilabel neural networks with applications to functional genomics and text categorization.” IEEE Transactions on Knowledge and Data Engineering 18(10): 1338-1351, 2006. 9. N. Ghamrawi, A. McCallum, "Collective multi-label classification." 14th ACM international conference on information and knowledge management, pp 195-200, 2005. 10. S. Zhu, X. Ji, W. Xu, Y. Gong, "Multi-labelled classification using maximum entropy method." 28th annual international ACM SIGIR conference on research and development in information retrieval, pp 274- 281,2005. 11. S. Godbole, S. Sarawagi, "Discriminative methods For multi-labeled classification." Lecture notes in artificial intelligence, vol 3056, pp 22-30, 2004. 12. H. Kazawa, T. Izumitani, H. Taira, E. Maeda, "Maximal margin 154351.doc 201243627 labeling For multi-topic text categorization." Advances in neural information processing systems, vol 17. MIT Press, Cambridge, MA pp 649-656, 2005. 13. A. Clare, RD King, "Knowledge discovery in multi-label phenotype Data." Lecture notes in computer science, vol 2168. Springer, Berlin pp 42-53, 2001. 14. A. Elisseeff, J. Weston, "A kernel method for multi-labelled classification." Advances in neural information Processing systems, vol 14. MIT Press, Cambridge, MA, pp 681-687, 2002. 15. K. Brinker, E. Htillermeier, "Case-based multilabel ranking." 20th international joint conference on artificial intelligence, pp 702 -707, 2007. 16. MR Boutell, J. Luo, X. Shen, CM Brown, "Learning multi-label scene classification." Pattern Recognition 37(9 ):1757-1771, 2004. 17. GJ. Qi, XS. Hua, Y. Rui, J. Tang, T. Mei, HJ Zhang, "Correlative multi-label video annotation.11 15th ACM international conference on multimedia, Pp 17-26, 2007. 18. ML Zhang and ZH Zhou. ML-kNN: A lazy learning approach to multi-label learning. Pattern Recognition, 40(7), pp 2038-2048, 2007. 19. ML Zhang. ML -RBF: RBF neural networks for multi-label learning. Neural Processing Letters, 29(2), pp 61-74, 2009. 20. J. Μ. P. η. ML Zhang and V. Robles. Feature selection for multilabel naive Bayes classification. Information Sciences, 179(19), pp 3218-3229, 2009. 154351.doc 201243627 [Summary] A method for classifying two files based on fuzzy similarity and κ nearest neighbor method, including the following Steps: (4) Calculate the fuzzy similarity of each training file data for the parent-category according to the attribution degree of each category and the degree of attribution of each feature to the k-training document data for each feature of each training file; ) based on these fuzzy similarities for all training The file data is grouped; (4) calculating the category distribution of the κ nearest neighbors of the unknown file data according to the unknown file data and the mode, and (4) determining that the unknown file data belongs according to (4) the distribution of the nearest neighbors category. The method of the present invention preferentially selects more similar subgroups when performing the nearest K neighbor method by using the fuzzy similarity calculation, and uses the data in the more similar subgroups to perform the recent (four) residence method, so that the classification method can be improved. Performance. And the present invention is based on the fuzzy similarity and the U-nearest neighbor method: the file cutter method can reduce the training and test time for performing the classification work, so that a large amount of time cost can be saved. [Embodiment] In the process of information retrieval, it is necessary to analyze the information itself and establish an index (4) exing to help the search. The cable is mainly to indicate the content of the document, and at the same time give the index vocabulary a certain weight, in order to reflect the importance and value of the recognition of the word in the document. In most of the current file information retrieval systems using the vector space model, most of the attributes represent a certain vocabulary or an attribute value is the statistical value of the vocabulary or concept in the file 15435l.doc 10 201243627. Since the -document is composed of many vocabularies, it is possible to find a meaningful index vocabulary (ie, keyword pool) in the file into a file vector, and this file vector represents a file in the vector space model. In a file set, each index word represents a dimension in space and the value in each dimension represents the importance of the file in this dimension. This value is called index word (four) "significant value (Term Significa_s) or "weight" can be obtained by calculating the vocabulary statistics of the file, such as the frequency of the index S (Surface Frequency, TF). The use of vectors to represent individual files' not only facilitates the representation of relationships between files, but also makes it easy to calculate similarities between each other. Documents with similar meanings and meanings may have multiple lexicons, if expressed as vectors in space. These vectors are also closer. In the vector space model, since a set of words is used to represent a document, the choice of vocabulary is particularly important. The weight of the vocabulary can be used to distinguish which columns are more representative in the document, and may be the key words of the document. The commonly used vocabulary weight calculation method is the Term Frequency file frequency. The vocabulary frequency is the number of times a vocabulary appears in a document. The higher the value, the more important the vocabulary is in the file t; the file frequency is the number of times a vocabulary appears in all files in the file set, and the lower the value The more the word is represented, the more distinguishing it is from the other files. In the feature representation of the document data, since the feature value is calculated using the number of times the text appears in the file, it is quite different from the general data. For example, in the data, a certain eigenvalue is the difference between 〇 and eigenvalue 丨 and I54351.doc 201243627 of 2 and 3 is although 1, but it is obvious that there is no ta (characteristic value 〇) and The difference in occurrence (feature value 1) is much larger than the difference between 2 and 3 occurrences. Therefore, when measuring the similarity of a file to a certain category, a special method must also be adopted, and the fuzzy similarity measurement method has a good effect in measuring the document. Fig. 4 is a flow chart showing the method for classifying multi-label documents based on fuzzy similarity and K nearest neighbor method in the present invention. Referring to step S11, according to each training file t, each special a is calculated for each category & attribution and the attribution degree of each feature for each training document, and each training document data is calculated for each category of blurring. Similarity. Referring to Figures 2 and 3, in the present embodiment, it is assumed that the training document data is "屯", <, η is the number of training documents, and the training documents are distributed in one category (in Figure 3 Each of the training files can belong to one or more categories. Each training file is represented by m features ^ '^,··· people, each of which represents a text. w. order both ''\) and post, and c) respectively represent the characteristics (the proportion of the terror in the category ^, which can be expressed as:

卿,，C/) η Is8n(wiv)v. ΣQing,, C/) η Is8n(wiv)v. Σ

(’，s)代表所有特徵出現在屬於類別S的訓練文件資料之出見頻率佔此特徵〖'出現頻率總次數的比例；邮㈤代表所有特樹Φ Μ + @ 出現在屬於類別的訓練文件資料數佔此類別 s的總訓練文件資料數之比例。其中少> 的值代表第V個訓練文件資料是否屬於類別 154351.doc 201243627 值為1否則為0 ; sgn(v〇的值在〜>〇時為i， c> ’屬於的話否則為〇。利用*(/,，cy)與必a，c)兩個分佈情形，可以用來衡量一個特徵，廣於類別〜的歸屬度·)，其算式可表示為：繞 c,) = s〜χ ddjt^Cj) maxi£«sm,isvSp dt{tu, cv) max,illfim ,Svip dd {tu, cv) ⑺ 在本實施例中，此歸屬度从，一利用正規化後的吨,Cj) ” W’Cy)相乘求得。而對於一筆訓練文件資料 d =< %%，·"，九 > 而言，此訓練文件資料對於類別。的相似度 •^(^)可表示為：』 m(', s) represents the frequency of occurrence of all the features appearing in the training documents belonging to category S. This feature is the ratio of the total frequency of occurrences of the feature 〖. The postal (five) represents all special trees Φ Μ + @ appears in the training files belonging to the category The ratio of the number of data to the total number of training documents in this category s. The value of less > indicates whether the Vth training file belongs to the category 154351.doc 201243627 The value is 1 otherwise 0; sgn (the value of v〇 is ~, when the weight is ~, c> 'If it belongs, otherwise it is 〇 Using *(/,,cy) and must a,c) two distribution cases, which can be used to measure a feature, which is wider than the degree of attribution of the category ~), and its formula can be expressed as: around c,) = s~ χ ddjt^Cj) maxi£«sm, isvSp dt{tu, cv) max, illfim, Svip dd {tu, cv) (7) In this embodiment, the degree of attribution is from the normalized ton, Cj) W'Cy) is multiplied and found. For a training document d =<%%,·", nine>, the similarity of the training file for the category. ^(^) can be expressed For: 』 m

Sifn(d,c) τ^,〇^μΛίι) Λ ~~ (3) 不〜代表特徵對於訓練文件資料的歸屬度，其可表為： (4) 凡(〇 = —^_ maXlSv£ffl ^ 其中’ ®與㊉分別為模糊運算子，其運算法則定義如下： X® 少 A：Xy，X ㊉少：尤 + 少―(5) 最後’定義訓練文件資料屬於類別c;的模糊相似度仏⑷ 為：Sifn(d,c) τ^,〇^μΛίι) Λ ~~ (3) Not ~ represents the attribution of the characteristics of the training documents, which can be expressed as: (4) Where (〇 = —^_ maXlSv£ffl ^ Where '® and ten are respectively fuzzy operators, and their algorithms are defined as follows: X® Less A: Xy, X Ten: Less + Less - (5) Finally, the definition of training file data belongs to category c; the fuzzy similarity仏(4) is:

Me id) (6) sim{d,c Λ , max]sv£p sim(d,cv) 則4 (c〇為計算訓練文件資料對於類別的模糊相似度之值0 以圖2所示之訓練文件資料為兑，.乂為例’訓練文件資 154351.doc -13· 201243627 料A對於類別c,、的模糊相似度分別為弋d⑹、 ·，訓練文件資料I對於類別c,〜C3的模糊相似度^別為 ⑷…；訓練文件資料九對於類別c^ 的模糊相似度分別為MdK)、(如圖3所示）。參考步驟S 12，依據該等模糊相似度對所有訓練文件資料進行分群。在取得所#訓練文件資料對於各個類別的模糊相似度之後，利用該等模糊相似度對所有訓練文件資料進行分群。此分群的卫作#兩種絲，㈣可以縮減在尋找相近的文件時的搜尋範圍提高搜尋速度，還可以達到事先過瀘、的效果，增加準確度。利用模糊相似度進行分群，每-分群代表對於某-類別具有相當程度相似度的文件資料集合。麥亏步驟 '^ 刀奶竹1彳日似度’利用K最近鄰居法計算該未知文件資料之⑽最近鄰居的類別分佈。當所有訓練文件資料分群完畢之後，透過尋找κ個（例如 2個）最近鄰居’可以得到所有訓練文件資料事前的統計機率，在尋找K最近鄰居肖，由於事先對訓練文件資料進行過分群’ ®此只需要尋找相似度高於—門檻值的分群資料即可，如此可以提向搜尋的效能。參考步驟S14’依據K個最近鄰居的類別分佈判斷該未知文件資料屬於之類別。當一個未知文件資料需要進行自時，找到其Κ個最近㈣’經過統計得到這κ個鄰居中屬於各個類別的個數以，…，少則此未知文件資料是 I54351.doc 14 201243627 否屬於類別~可以下式來判斷： 1，若戶(//广1|£ =〜少/=j〇,若 .那U]，其他 ' (7) 其中，P代表機率 ’、’、〇或1，代表一訓練文件資料屬 -〇1 _ «'丨琛又仟頁科屬於類別Cy的事件；Ε為Κ最近鄰居中屬 /41 州咕T屬於頰別\有E個的事 /it · r\ »jl . ι. 件；R為一隨機函數，隨機輸出〇或丨。由於 P(Hj=h\F.=y, h、、 J ηεΤΓ]~— (8) 其中卩/^^/乂為〜戈卜因此式口彡可以改寫為· ls若摩广，叫"尸1) > M=〇)擎”％為0、〇’ 若/>(//广卿=，=0)>· =:": I聊]，其他 (9) 而户(¥)與擎可以依據該等訓練文件資料中計算得到。上述K最近鄰居法為本技術領域所熟知之技術’在此不再加以敘述。在進行訓練階段中得到所有訓練文件f料事前的統計機率後’ -個多標籤文件的自動辨認模組已經產生。當一未知文件資料需要自動辨認時，首先對該未知文件“擷取特徵並計算模糊相似度（透過式⑴至式⑷），再以與其模糊相似度最接近的分群找到最近K個㈣，透過K個鄰居中的類別刀#，可以讓自動辨認模組給予自動辨認的判斷，得到此未知文件資料所應該屬於的類別。舉例說明’假設有4 共3個類別，依據式（1)至式（6)所計算之相應訓練文件資料di〜d3之模糊相似度如圖4所示。若αΧ疋之相似度門檻值α=〇 . 5，則訓練文件資料d I之模糊相似度〜⑷）、心⑷、M)相對類別〜之y值分別為[1 〇 1；)、 15435 丨.doc 15 201243627 訓練文件資料d2之模糊相似声別~之值分別為[丨又乂⑷）、弋⑷）、心⑷）相對類 w3)、#(以、( 0]、訓練文件資料心之模糊相似度為1，表;:文Γ相對類別―卿"]”值否。如圖資料屬於該相對類別之分群”值為。則 q J、·，〇禾所不，柏斜相資料Qd2,相對_ 1之分群~包含訓練文件七，相對類別，3之分群^分群^包含訓練文件資料d2及假設對於該未知文二:練文件f_。 ⑴至式(6)及該相似声，、包含之特徵而言，依據式料d之模糊相似度 m α所计算之相應該未知文件資佶& r c,()、弋⑷、心⑷，其相對類別C|〜C3之少群:1]由於相對類別h之讀為〇,因此僅需考慮分對訓練文不需考慮分群Ge2。由於本發明之方法事先 dl〜d3進行過分群，因此只需要尋找模糊相似度向於該㈣值《的分群資料即可。如此，彳簡單且快速地判斷出該未知文件資料d應該屬於的類別。、本發月之方法係透過模糊相似度的運算，在進行最近K 法進行刀類時優先選擇較相似的分群利用這些較相似分群中的資料來進行最近K鄰居法的訓練與測試，如此可以提升分類方法的執行效能。並且本發明以模糊相似度與K最近鄰居法為基礎之多標戴文件分類方法’可以減少執行分類X作的訓練與測試時間，故可以節省大量的時間成本。上述實施例僅為說明本發明之原理及其功效，並非限制本發明，因此習於此技術之人士對上述實施例進行修改及 154351.doc 201243627 隻化仍不脫本發明之精神。本發明之權利範圍應如後述之申凊專利範圍所列。【圖式簡單說明】圖1顯示本發明以模糊相似度與κ最近鄰居法為基礎之多標籤文件分類方法之流程圖；圖2顯示本發明一實施例中訓練文件資料與其所含特徵之示意圖；圖3顯示本發明一實施例中相應訓練文件資料之模糊相似度與其所屬類別之示意圖；圖4顯示本發明相應3個訓練文件資料之模糊相似度與其所屬類別之示意圖；圖5顯示本發明相應3個訓練文件資料之模糊相似度經一相似度門檻值處理後與其所屬類別之示意圖；及圖6顯示本發明3個訓練文件資料之分群結果示意圖。【主要元件符號說明】 (無元件符號說明） I54351.doc -17-Me id) (6) sim{d,c Λ , max]sv£p sim(d,cv) then 4 (c〇 is the value of the fuzzy similarity of the training file for the category 0. Training as shown in Figure 2 The document is for redemption. For example, 'training document 154351.doc -13· 201243627 material A for category c,, the fuzzy similarity is 弋d(6), ·, training document data I for category c, ~C3 blur The similarity ^ is (4)...; the fuzzy similarity of the training document 9 for the category c^ is MdK), respectively (as shown in Figure 3). Referring to step S12, all training file data is grouped according to the fuzzy similarities. After obtaining the fuzzy similarity of the training file data for each category, all the training documents are grouped by the fuzzy similarities. This group of Wei Zuo #2 silk, (4) can reduce the search range when searching for similar documents to improve the search speed, and can also achieve the effect of the first, and increase the accuracy. Grouping is performed using fuzzy similarity, and each-branch represents a collection of document data having a considerable degree of similarity for a certain category. The wheat deficit step '^ 刀奶竹1彳日似度' uses the K nearest neighbor method to calculate the category distribution of the nearest neighbor of the unknown document data. After all the training files are grouped, by looking for κ (for example, 2 nearest neighbors), you can get the statistical probability of all training documents beforehand. Looking for K nearest neighbor Xiao, because the training documents have been grouped beforehand' ® This only needs to find the grouping data with the similarity higher than the threshold value, so that it can improve the performance of the search. Referring to step S14', it is judged based on the class distribution of the K nearest neighbors that the unknown file material belongs to the category. When an unknown document data needs to be self-existing, find the nearest one (four) 'The number of the κ neighbors belonging to each category is counted by statistics, ..., and the unknown file data is I54351.doc 14 201243627 No belongs to category ~ can be judged by the following formula: 1, if the household (/ / wide 1 | £ = ~ less / = j 〇, if. that U], the other ' (7) where P stands for probability ', ', 〇 or 1, Representing a training document data belongs to -〇1 _ «'丨琛仟科科 belongs to the category Cy event; Ε is the nearest neighbor of the genus /41 咕T belongs to the cheek \ there are E things /it · r\ »jl . ι. Piece; R is a random function, random output 〇 or 丨. Since P(Hj=h\F.=y, h, , J ηεΤΓ]~—(8) where 卩/^^/乂 is ~ Gob's style can be rewritten as · ls if Mo Guang, called " corpse 1) > M = 〇) 擎"% is 0, 〇' if /> (/ / Guangqing =, = 0) >· =:": I chat], other (9) and household (¥) and engine can be calculated based on the training documents. The K nearest neighbor method is a technology well known in the art. No longer described. In the training phase All training files are expected to have a pre-statistical probability. - An automatic identification module for multi-label files has been generated. When an unknown document needs to be automatically identified, the unknown file is first retrieved and the fuzzy similarity is calculated. Equation (1) to (4)), and then find the nearest K (four) by the group closest to its fuzzy similarity. Through the category knife # in the K neighbors, the automatic identification module can be given the automatic identification judgment, and the unknown file data is obtained. The category that should belong. For example, the hypothesis has 4 total 3 categories, and the fuzzy similarity of the corresponding training file data di~d3 calculated according to the formulas (1) to (6) is as shown in Fig. 4. If the similarity threshold of αΧ疋 is α=〇. 5, the fuzzy similarity of the training document data d I (4), the heart (4), M) relative category y value are [1 〇 1;), 15435 丨.doc 15 201243627 Training document data d2 fuzzy similar sounds ~ values are [丨乂 (4)), 弋 (4)), heart (4)) relative classes w3), # (I, (0), training documents information heart The fuzzy similarity is 1, the table;: the relative category of the document "Qing" is not. If the data belongs to the group of the relative category, the value of the group is. Then q J, ·, 〇禾所不, 柏斜相相Qd2, relative to _ 1 group ~ contains training file seven, relative category, 3 grouping ^ group ^ contains training file data d2 and hypothesis for the unknown text 2: practice file f_. (1) to equation (6) and the similar sound, According to the characteristics of the inclusion, the corresponding unknown file assets & rc, (), 弋 (4), and heart (4) are calculated according to the fuzzy similarity m α of the formula d, and the relative categories C|~C3 are small groups: 1] Since the reading of the relative category h is 〇, it is only necessary to consider the split training text without considering the group Ge2. Since the method of the present invention is performed in advance dl~d3 Excessive grouping, therefore only need to find the fuzzy similarity to the (four) value of the grouping data. So, simply and quickly determine the category of the unknown document data d should belong to. The operation of similarity is preferred when the nearest K method is used to perform the similarity of the clusters. The data in these similar clusters is used to train and test the nearest K neighbor method, which can improve the performance of the classification method. The invention discloses that the multi-standard file classification method based on the fuzzy similarity and the K nearest neighbor method can reduce the training and test time for performing the classification X, so that a large amount of time cost can be saved. The above embodiments are merely illustrative of the principle of the present invention. The present invention is not limited to the present invention, and those skilled in the art will be able to modify the above-described embodiments and 154351.doc 201243627. The scope of the present invention should be as described in the following claims. Listed [Simplified Schematic] Figure 1 shows that the present invention is based on fuzzy similarity and κ nearest neighbor method. FIG. 2 is a schematic diagram showing the training document data and the features included in the embodiment of the present invention; FIG. 3 is a schematic diagram showing the fuzzy similarity of the corresponding training document data and its category in an embodiment of the present invention; 4 is a schematic diagram showing the fuzzy similarity of the corresponding three training documents of the present invention and the category to which it belongs; FIG. 5 is a schematic diagram showing the fuzzy similarity of the corresponding three training documents of the present invention after being processed by a similarity threshold value; And Figure 6 shows a schematic diagram of the grouping results of the three training documents of the present invention. [Explanation of main component symbols] (No component symbol description) I54351.doc -17-

Claims

201243627 VII. Patent application scope: 1. A multi-label file classification method based on fuzzy similarity and κ nearest neighbor method, including the following steps: U) affiliation of each category according to each feature of each training document data Degree and the degree of attribution of each feature to each training document, calculating the fuzzy similarity of each training document data for each category; (b) grouping all training documents according to the fuzzy similarities; (c) Calculating the _ distribution of the nearest neighbors of the unknown & piece of data by using the κ nearest neighbor method according to an unknown file data and the fuzzy similarity; and (4) judging the unknown file data according to the category distribution of the nearest neighbors category. 2. The multi-label document classification method based on the fuzzy similarity and the nearest neighbor method of claim 1 'the towel is in step (a), the training documents are "," and η is the training document data. Number, the files are distributed in P categories, each file belongs to one or more categories, each file is represented by m features W·. people, each of which represents a text w, and /£/(/〆) respectively, J j represents the feature (the proportion of the terror in the category &, which is expressed as: η 〆 / / = ν = 1 where V 1 η雄, C /) Represents all the characteristics appearing in the category ^ ^ I5435l.doc 201243627 The frequency of the training documents appears in the proportion of the total frequency of occurrence of this feature ,, i (, C /) represents all the characteristics of the training documents belonging to the category spoon The number of the total training file data of this category ^; the value of ^ represents whether the Vth training file data belongs to category c, if the value of the Vth training file belongs to category S4, otherwise it is 0; sgn( The value of wiv) is 1 in wiv> ,, otherwise it is 〇. A multi-label file classification method based on fuzzy similarity and κ nearest neighbor method, wherein each feature has a degree of attribution to each category~((, to represent: ...=——„ dd(ti, c 丨) dt(tu,cv) ddiK,cv) where 'max denotes the maximum value. For example, the multi-label file classification method based on the fuzzy similarity and κ nearest neighbor method of claim 3, wherein for the pen training file The data d = <~%, .., '〉, the similarity of the training file data for the category S is expressed as: Sim{d, Cj) ./-1__ m — ---- ten ", ,, +) /s| where 'A(6) represents the feature (for the training document data (1's attribution degree, its table is not to ., 3 and 10 are respectively fuzzy operators, and its algorithm is: x(g) Less = less, less than ten = Λ: +^1α. 5.:: Request item 4 is based on fuzzy similarity and κ nearest neighbor method based on multi-label file cutter method 'each training file for each The fuzzy similarity of the category '(4) is defined as: I5435l.doc 201243627 = ^ sim(d,Cj) maxlsvsp sim{d,cv) 6.=The eye of the item i The paste similarity is based on the K nearest neighbor method, and in step (8), each-group represents a set of files having a degree of similarity for y and categories. A multi-standard file classification method based on fuzzy similarity and κ nearest neighbor method, wherein in step (4), a step of finding a desired cluster data according to a similarity threshold value is further included. 8. The multi-label file classification method based on the fuzzy similarity and the k-nearest neighbor method, as in claim 1, wherein in step (c), the number of each category belonging to each category is based on κ neighbors. 2, _··, "ρ}, according to the value of the value of the unknown file, whether it belongs to the category c, ρ is the number of categories, I if the body (/ /; = 11 five == 0丨 five = A = 〇 , if the household (^ = 0 | £ = ~ Λ〇, ΐ), the other of which '' represents the probability; % is 〇 or 1 ' represents a training file data belongs to the category ~ event; κ nearest neighbor belongs to the category S has Ε R is a random function, randomly outputting 0 or 1. 9. As in claim 8, the multi-label file classification method based on fuzzy similarity and κ nearest neighbor method, where P is expressed as: P{Hj = b) P(E = rij I Hj = b) RE = nj) P(Hj = b\E = n.) where b is 〇 or i. 10. The multi-label file classification method based on fuzzy similarity and κ nearest neighbor method of claim 9, wherein the value of Α is expressed as: 15435 丨.doc 201243627 :0)P(E = rij I HJ = 0) =\)P(E = rij I Hj = 1) 1, if P(/Z is wide 1)P(£=~I = 1) > /ΧΑ = A = 0, if the corpse (//0) ~I = 0) > corpse (//) Ke Wei, 1], other 154351.doc