TW569112B

TW569112B - Selection method of combined classification feature

Info

Publication number: TW569112B
Application number: TW91116798A
Authority: TW
Inventors: Shih-Chun Chou; Wentai Hsieh; I-Heng Meng
Original assignee: Inst Information Industry
Priority date: 2002-07-26
Filing date: 2002-07-26
Publication date: 2004-01-01

Abstract

A selection method of combined classification feature groups single keyword as multiple keywords for representation of certain classification feature, which can be more precise in determining document classification than single keyword.

Description

569112569112

本發明係有關於一種組合的類別特徵選取方法別”於：種可以將關鍵詞組合，以代表某一種：的特徵，提尚關鍵詞代表的準確性。、牛在現今之網際網路中，存在著各式各樣的搜尋的方便’ t常需要把不同類型的文件作分：：類型的文件，#在不同的關鍵字’這- : 搜尋或分類的依據。坑疋我們然而，在現今的技術中，作類別特徵的選取採用統計的方式計算單一個關鍵詞出現的機率或頻 = TFIDF以及Entropy。但是單一個關鍵詞，可能代表二同的類別，無法精確的篩選文件，例如防火牆可能防類別，也可以代表資訊類別，若一篇文件中含有防牌的關鍵詞，並無法由此關鍵詞就可判別此篇文件是 = 防類別或資訊類別。如果將一個組合的類別特徵，則以明確代表某-類別，如防火牆加電腦駭客，可代表資別；如防火牆加火災，可代表消防類別。要目的為提供一種組合的類別 ’將單一的關鍵字組合成多個 ’可以比只計算單一關鍵字出。本發明更可以運用於文件與件中萃取某一類別下的特徵，一類別都會出現這一類關鍵將可應用於後來的概念建構與有鑑於此，本發明之主特徵選取方法。在本發明中關鍵字以代表某一類別特徵現頻率更有準確性與說服力概念建構之間，自動從多文萃取出來的特徵，即代表某字，將這些關鍵字當作特徵 Ontology Base 的建構。The present invention relates to a combination of category feature selection methods different from "a type that can combine keywords to represent a certain type of: features that improve the accuracy of the keyword representation." In today's Internet, There are a variety of conveniences for searching, which often need to divide different types of files :: type of file, #in different keywords' this-: the basis for searching or classification. We are, however, nowadays In the technology, the selection of category features uses statistical methods to calculate the probability or frequency of the occurrence of a single keyword = TFIDF and Entropy. However, a single keyword may represent two identical categories, and it is impossible to accurately filter files. For example, a firewall may The defense category can also represent the information category. If a document contains the keywords of the defense brand, it cannot be judged by this keyword that the document is = defense category or information category. If you combine the characteristics of a category, then To clearly represent a certain category, such as a firewall plus a computer hacker, it can represent a category; if a firewall plus a fire, it can represent a fire category. In order to provide a combined category, 'combining a single keyword into multiple' can be calculated than only a single keyword. The present invention can be more applied to extract features in a category from documents and documents. This category will appear in a category The key will be applicable to the subsequent concept construction and in view of this, the main feature selection method of the present invention. In the present invention, the keywords to represent a certain category of features are more accurate and the persuasive concept construction is automatically followed. The features extracted from multiple texts represent a word, and these keywords are used as the construction of the feature Ontology Base.

0213-8269TW(N);STLC-02-B-9ll5;andyp.ptd 第4頁0213-8269TW (N); STLC-02-B-9ll5; andyp.ptd Page 4

569112 五、發明說明（2) 為了達成本發明之上述目的，太袼日日植山的類別特徵選取…包括下列步驟：=了-種組合 ”同類別文件，然後過濾出語…:盥岡性標注予的詞，捨棄不用，其餘的詞形成關鍵：：合=與只有一個關鍵字建立成一關鍵字集合； :章人所有文件的出現順序建立文件向量，若文“ 内的關鍵字不’若不包含該關鍵字即以〇表*，而；子即旦以1表個數即為關鍵字個數；接著，將關鍵I 里的兀素所有文件向量以建立關鍵字he \ §成向量名稱反轉里的疋素個數即為文件篇數；接關鍵子向鍵字向量，即Μ定一 Ρ卩批枯 19 /慮低出現頻率的關疋門檻值，過濾該關鍵字向量之元素A 1的個數低於該門檻值的關鍵字门里之疋素為向量以形成特徵值集合，即兩兩 =s關鍵字的希μ、富# 和ν且6 Ν - 1關鍵字向量作a ν d Π ίί舁：形成複數N關鍵字向量，其中N大於等:二值二：^向量之元素為1的個數高於該門檻向量ίΓΛ ””Ν_1個關鍵字中，任何Ν-1關鍵字 == 均高於該門榧值；重複上-步驟，直到所有符合條件組合完成。實施例圖顯示本發明之架構。文件11〇係表示同類型的多齡ΐΐ出此同類型文件的關鍵㈣，必須先對這些文丁斷柯與詞性標注；將文件中的句子，例如說可以利用中研院的CKIP (Chlnese k⑽wledge inf569112 V. Description of the invention (2) In order to achieve the above-mentioned purpose of the invention, the selection of the category features of Taiji Rizhishan ... includes the following steps: ==-type combination "documents of the same category, and then filtering out the words ...: labeling The given words are discarded, and the rest of the words form the key :: together = with only one keyword to build a keyword set;: chapter of all documents in the order of appearance of the document vector, if the keywords in the text "'if not If the keyword is included, the table is 0, and the child is the number of the table, which is the number of keywords. Then, the file vector of all elements in key I is used to build the keyword he \ § into the vector name. The number of prime elements in the transfer is the number of documents; then the key sub-directional key vector, that is, M determines a threshold of 19 / low frequency threshold, filters the element A 1 of the key vector The number of elements in the keyword gate whose number is lower than the threshold value is a vector to form a feature value set, that is, pairwise = s keyword's Greek μ, rich # and ν, and 6 Ν-1 keyword vector as a ν d Π ίί 舁: form a complex N key vector, where N is greater than equal: two-valued two: ^ The number of elements in the vector is higher than the threshold vector ΓΓΛ "" N_1 keywords, any N-1 keyword == is higher than the threshold; repeat- Steps until all eligible combinations are completed. Examples The figure shows the architecture of the present invention. File 11 is the key to the generation of the same type of documents. These documents must be marked and part-of-speech first; the sentences in the file can be used, for example, the CKIP (Chlnese k⑽wledge inf

569112 五、發明說明（3) P ocessing )斷同與詞性標注系統將其斷成一個一個的二，而並非所有的詞都可以當成這個類別的特徵，如：装，ί μ形合阔、出現頻率相當高的詞與只有一個字的詞 1 皮^當作類別特徵的機率並不高，因此在停用詞庫 ^ :子此類不用之詞；所有文件在經過斷詞1 2 0處理加用詞過濾140程序，將不用之詞過濾，避免 ‘資ϋ料：二Λ式的負#。上述步驟處理完畢之後，便會鍵字=H Λ放著此類別τ所有文件與其戶斤包含的關 ^接是以「文件代號{關鍵字}」的形式存處理，^彳胃ί Γ a發明之組合的類別特徵選取方法16〇的地而付到組合的類別特徵丨7 〇。接著請參考第2圖’顯示本發明之流立關鍵字集合21 〇，是將之前θ 圖首先建的關鍵字形成一個關鍵字集】合斷，= 隼，出現的關鍵字，如：關鍵合二集4包含所有文件中接著，在建立文件向量2二集即以1表示’ S不包含，則以〇表示，如件包含某關鍵字 \{1，L 0, L…，U，將所有文件都建立J：令件曰· 著，在反轉成關鍵字向量23 〇步驟中；件向：’接名稱反轉所有的文件向量，也就將*關鍵字當成向量現在文件1、3、4時，我們以fu ° ’若有一關鍵字f 1出所有的關鍵字向量都建立起來，這些被^來表#，如此把 ;广夺1關鍵字向量的數量可能非常魔大之字出現的頻率並不高，因此在步驟過以==鍵569112 V. Description of the invention (3) P ocessing) The sameness and part-of-speech tagging system breaks it into two one by one, but not all words can be regarded as features of this category, such as: pretend Words with a relatively high frequency and words with only one word are not likely to be treated as category features. Therefore, the thesaurus is disabled ^: sub-words that are not used in this category; all documents are processed after word breaking 1 2 0 The word filter 140 program will filter the words that are not used to avoid the 'materials: two Λ-style negative #. After the above steps are processed, the key word = H Λ is placed. All the files of this category τ and their households are included in the form of "file code {keyword}". ^ 彳 wei Γ a invention The combined category feature selection method 16 is applied to the combined category feature 丨 7. Next, please refer to FIG. 2 'showing the current keyword set 21 of the present invention, which is to form a keyword set based on the keywords first created in the previous θ graph]] Conjunction, = 隼, the keywords that appear, such as: key combination The second set 4 contains all the files. Then, in the establishment of the file vector 2 the second set is represented by 1 'S does not contain, then it is represented by 0. If the file contains a certain keyword \ {1, L 0, L ..., U, all Files are created J: Let the pieces say · In the step of inverting into a keyword vector 23 〇 Steps: 'Reverse all file vectors by name, so * keywords are treated as vectors now in files 1, 3, At 4 o'clock, we use fu ° 'if there is a keyword f 1 all the keyword vectors are established, these are listed as #, so the number of keyword vectors may be very large. The frequency is not high, so use the == key in the step.

569112 五、發明說明（4) 2插值，用來踢除這種出現頻率相當低的關鍵字，當關鍵字向量中出現1的次數小於此門檻值時，便將其踢除，剩餘的都是1關鍵字向量。接著進入組合階段，兩兩組合所有的1關鍵字向量，成為2關鍵字向量，組合方式為，例如 1關鍵字向量flU，1，U與1關鍵字向量f2{l，1，〇,：[丨組合成^ +丨2關鍵字，此時將兩關鍵字向量丨1及{2作八^^的布林運算，U，0,1，1}AND{1，1，0，1}，結果為{1，〇,〇，u，此 =2關鍵字f1 + f2的向量，在門檻值過濾26〇中，同樣的币選踢除2關鍵字向量中1的次數小於門檻值，得 2關鍵字向量。在步驟檢查是否可以再組合270 +，判斷从人法，，當η關鍵字向量欲組合為n+1關鍵字向量，必: ;二7 ϊ ί向量之元素為1的個數高於該門檻值？ ϊ :，鍵子向篁之η個關鍵字t，任何η關鍵字向-且 f為1的個數均高於該門檻值。透 π ，’即可找出所有長度的特徵組合，斷而的形且集合28 0。仏成組合的特徵第3圖顯示本發明的實施例。569112 V. Description of the invention (4) 2 Interpolation is used to remove such keywords that appear relatively infrequently. When the number of occurrences of 1 in the keyword vector is less than this threshold, they are removed, and the rest are 1 key vector. Then enter the combination phase, combining all 1 keyword vectors in pairs to become 2 keyword vectors. The combination method is, for example, 1 keyword vector flU, 1, U and 1 keyword vector f2 {1,1, 0,: [丨 Combined into ^ + 丨 2 keywords. At this time, two keyword vectors 丨 1 and {2 are used as a Boolean operation of eight ^^, U, 0,1,1} AND {1,1,0,1}, The result is {1, 〇, 〇, u, this = 2 vector of keywords f1 + f2, in the threshold filtering 26, the same coin selection kicks 2 times in the keyword vector 1 less than the threshold, get 2 Key vector. In the step, it is checked whether 270+ can be combined again, and judging from the human law, when the η keyword vector is to be combined into the n + 1 keyword vector, it must:; 2 7 ϊ ί The number of elements of 1 is higher than the threshold value? ϊ :, the number of η keywords t in the bond direction 篁, the number of any η keywords toward-and f 1 is higher than the threshold. Through π, ′, we can find the feature combination of all lengths, broken shape and set 28 0. Features of the combined combination Fig. 3 shows an embodiment of the present invention.

Dl 、D2、D3、D4表示，在此培^加士 Ζ、3、4分別以屬於理財類的文件；經過斷詞後， 3 Μ是字，文件！有股市、投資、經濟、=鍵資、經濟、股市，文件3有經濟、旅遊、有旅遊、投投資’文件4有投資、股市；接下來將關Π、 :個關鍵字集合F⑴:股市，f2:投資，f3:經：鍵子形成金’…旅遊，f6:台北市卜因此所有的旦基Dl, D2, D3, and D4 indicate that here ^ Gas ZZ, 3, and 4 are classified as financial management documents; after the word segmentation, 3M is a word, file! There are stock market, investment, economy, = key capital, economy, stock market, file 3 has economy, tourism, travel, investment, and investment 'file 4 there is investment, stock market; the next will be related to: a set of keywords F⑴: stock market, f2: investment, f3: economy: bonds form gold '... tourism, f6: Taipei city

569112 五、發明說明（5) D1U，1，1，1，0, 0}、D2{1，1，1，〇, 1，〇}、 D3{1，1，1，0, 1，1 }、D4{ 1，1，〇, 〇, 〇, 〇 };接著將所有文件向量反轉成關鍵字向量，因此可得到關鍵字向量， flU，1，1，1}、f2{l，1，1，1}、f3{l，1，1，0}、 f4{l，0, 〇, 〇}、f 5{〇, 1，1，〇}、f6{〇, 〇, 1，0}，這些稱之為1 關鍵字向量。若門檻值設為3時，貝彳f 4、f 5與f 6即被踢除，fl、f2、f3可以組合成2關鍵字向量，將兩兩1關鍵字向量作AND計算，可得到fl + f2{l，l，l，l}、 fl+f3{l，l，l，〇} 、f2+f3{l，l，l，〇}，因為fl+f2 、fl+f3 、 f2 + f3這些2關鍵字向量當中l的數目皆大於門檻值3，所以 fl + f2、fl + f3、f2 + f3皆為2關鍵字，並且可以組成3關鍵字fl + f2 + f3 ;將fl+f2與fl + f3作AND計算，得到 fl + f2+f3{l，1，1，〇}，檢視f i + f2 + f3關鍵字向量中丨的數目大於門檻值3，於是f 1 + f 2 + f 3為有效的3關鍵字；由此可知，若文章中一起出現fl + f2 + f3的關鍵字，也就0fl ^ 市、投資、f3:經濟一起出現在文章中時，判股篇文章屬於理財類的文章。斷此因此’藉由本發明所提出之組合的類別特徵法，可以將單一的關鍵字組合成多個關鍵字以、万別特徵，可以比只計算單一關鍵字出現頻 ^表某—類說服力。只半更有準確性與雖然本發明已以較佳實施例揭露如上，缺、限定本發明，任何熟悉此項技藝者，在不脫S其並非用以神和範圍内，當可做些許更動與潤飾，田比，本發明之精因此本發明之保護569112 V. Description of the invention (5) D1U, 1,1,1,0,0}, D2 {1,1,1, 〇, 1, 〇}, D3 {1,1,1,0,1,1} , D4 {1,1, 〇, 〇, 〇, 〇}; then invert all file vectors into keyword vectors, so you can get keyword vectors, flU, 1,1,1}, f2 {l, 1, 1,1}, f3 {l, 1,1,0}, f4 {l, 0, 〇, 〇}, f5 {〇, 1,1, 〇}, f6 {〇, 〇, 1,0}, These are called 1-key vectors. If the threshold is set to 3, f 4, f 5 and f 6 will be kicked out, fl, f2, and f3 can be combined into a 2-keyword vector, and the 1-keyword vector can be calculated by AND to get fl + f2 {l, l, l, l}, fl + f3 {l, l, l, 〇}, f2 + f3 {l, l, l, 〇}, because fl + f2, fl + f3, f2 + f3 The number of l in these two keyword vectors is greater than the threshold value 3, so fl + f2, fl + f3, f2 + f3 are all 2 keywords, and can be composed of 3 keywords fl + f2 + f3; Fl + f3 is calculated by AND to get fl + f2 + f3 {l, 1,1, 〇}. Check that the number of 丨 in the fi + f2 + f3 keyword vector is greater than the threshold 3, so f 1 + f 2 + f 3 It is a valid 3 keywords; it can be seen that if the keywords fl + f2 + f3 appear together in the article, then 0fl ^ market, investment, f3: when the economy appears in the article together, the stock judgment article belongs to financial management Article. Therefore, by using the combined category feature method proposed by the present invention, a single keyword can be combined into multiple keywords with different characteristics, which can be better than calculating only the frequency of occurrence of a single keyword. . Only half more accurate and although the present invention has been disclosed in the preferred embodiment as above, lacking and limiting the present invention, anyone skilled in this art can make some changes without departing from its scope and scope. With retouching, tinby, the essence of the invention is therefore protected by the invention

〇213-8269TWF(N);STLC-〇2-B-9115;andyp.ptd〇213-8269TWF (N); STLC-〇2-B-9115; andyp.ptd

569112 五、發明說明（6) 範圍當視後附之申請專利範圍所界定者為準。第9頁 0213-8269TWF(N);STLC-02-B-9115;andyp.ptd 569112 圖式簡單說明為使本發明之上述目的、特徵和優點能更明顯易懂，下文特舉實施例，並配合所附圖示，詳細說明如下：第1圖係顯示本發明之架構圖。第2圖係顯示本發明之流程圖。第3圖係顯示本發明之實施例。符號說明 11 0〜同類別文件； 1 2 0〜斷詞； 1 3 0〜停用詞庫； 1 4 0〜停用詞過濾； 1 5 0〜資料庫； 1 6 0〜組合的類別特徵選取方法； 1 7 0〜組合的類別特徵； 210〜建立關鍵字集合； 220〜建立文件向量； 230〜反轉成關鍵字向量； 2 4 0〜過濾； 2 5 0〜組合關鍵字向量； 2 6 0〜門檻值過濾； 270〜檢查是否可再組合； 28 0〜組合的特徵集合；569112 V. Description of Invention (6) The scope shall be determined by the scope of the attached patent application. Page 9 0213-8269TWF (N); STLC-02-B-9115; andyp.ptd 569112 The diagram simply illustrates that in order to make the above-mentioned objects, features and advantages of the present invention more obvious and understandable, the following examples are given, and With the accompanying drawings, the detailed description is as follows: Figure 1 is a diagram showing the architecture of the present invention. Fig. 2 is a flowchart showing the present invention. Fig. 3 shows an embodiment of the present invention. Explanation of symbols 11 0 ~ files of the same category; 1 2 0 ~ word segmentation; 1 3 0 ~ stop word dictionary; 1 4 0 ~ stop word filtering; 1 50 ~ database; 1 6 0 ~ combined category feature selection Method: 1 0 0 ~ combined category features; 210 ~ established keyword set; 220 ~ established file vector; 230 ~ reversed into keyword vector; 2 4 0 ~ filtered; 2 5 0 ~ combined keyword vector; 2 6 0 ~ threshold filtering; 270 ~ check whether recombination is possible; 28 0 ~ combined feature set;

Dl、D2、D3、D4〜文件； F〜關鍵字集合。D1, D2, D3, D4 ~ files; F ~ keyword collection.

0213-8269TWF(N);STLC-02-B-9115;andyp.ptd 第10頁0213-8269TWF (N); STLC-02-B-9115; andyp.ptd Page 10

Claims

/ 、 1Please patent scope 1 · A combined category feature selection method of keyword creation of all files' includes the following steps: • Key J :: Key: Set in the keyword set · Use keywords as vector names 2 sequentially build file vectors; pre-vectors, · invert all file vectors to create keywords and combinations that filter low-frequency occurrences; and key vector to form special sets: and, 2 if the scope of the patent application, item 1, The σ method judges that the category feature selection of the keyword set V4 includes the combination: ::: part-of-speech tag plurals of the same category file. With: .. k 4 verbal auxiliary words, adjectives and only, two and, the remaining words form the key Words such as, ', and words, discarding these plural plural files of the same category and forming a keyword set. The keywords that have appeared are ranked in a certain order. 3 As in the method of item 1 of the scope of the patent application, the method of creating a file vector is selected; the category feature selection table of the combination of 4 is not included. If the keyword is not included The number of keywords is the number of keywords. Show 'Elements of this file vector 4. As the third method of the scope of patent application, in which the category feature selection of the combination of keywords is described as 1, if it does not exist in the document °%, if the keyword exists in the document That is, the number of elements is the document age. 17 is represented by 0, and the key vectors are arranged in order. ’In the β-hai keyword vector, the file is selected with a certain 5. Select the category features as described in item 4 of the scope of patent application

569112 VI. Patent application scope ~ A " method, wherein filtering a low-frequency keyword vector refers to setting a threshold value, and filtering the key vector with a number of elements below 1 is the key to the threshold value Word vectors to form plural key vectors. m 6 · The method for selecting category features of a combination as described in item 5 of the scope of patent application, wherein the combination key vector includes the following steps: 1. A pair of N-1 key vectors are combined to perform the Bollinger operation of A, a N key word vector 'where N is greater than or equal to two, which must be met. The number of 7 key elements set to 1 by the n key number is higher than the threshold value, and the N key is given to N-1 keywords. The elements of any keyword vector are 1 薏 higher than the threshold; cause 岣 to repeat the above steps until all the groups meeting the above conditions form the N key vectors into a feature value set. Fancheng; Ke7. — A readable medium (medium), which is sufficient to perform the following steps: Imaginary knowledge, the keywords of all files are created. The key $ 二二 KEY in the keyword set will be key Word as vector name & 2 to create a file vector word vector; said to invert all file vectors to create a filtering combination of keywords with a low frequency of occurrence: key word vectors to form eigenvalues 8. Such as the 7th fruit of the scope of patent application ^ Establishing a keyword collection system includes the following steps: “Available media, in which word segmentation and part-of-speech tagging are plural and similar a 丨丄 filter out word auxiliary words, describe 1 * files, and Gu Yi and words with only one word.

0213 8269TWF (N); STLC-02-B-9115; andyp.ptd 569112 --------- 6. For patent application, the rest of the words form keywords, and they are listed = these plural documents of the same category have appeared The keywords are arranged in a certain order to form a keyword set. Another sequence is to establish 9 texts :: Special: The readable media described in the seventh item, which contains the key; two Γ table text package, the keyword: a 1 means' if not including words Number. I mean, the number of elements in the file vector is the readable media described in item 9 of the patent application. If the key word exists in the file, i means that : It is expressed in °; the number of elements of the keyword vector is ", it is λ, such as the keyword vector towel, and the files are arranged in a certain order., ..., σ is called readable as described in item 10 of the patent scope. Media, complex, 'filtering low-frequency keyword vectors refers to keywords with a _threshold value, y off = vector elements with a number of 1 below this threshold, k-direction®, forming a complex keyword vector. Key 12 12 · As described in the readable media described in item 11 of the scope of the patent application, in 1, the combination of the keyword vector includes the following steps: The number of element 1 is higher than the threshold value, and among the N key N-1 keywords, any element of the N] keyword vector is higher than the threshold value; repeat the above steps until all the above conditions are met The combination is completed; the N key vectors are formed into a feature value set.

0213-8269TWF (N); STLC-02-B-9115; andyp.ptd page 13