TW569112B - Selection method of combined classification feature - Google Patents

Selection method of combined classification feature Download PDF

Info

Publication number
TW569112B
TW569112B TW91116798A TW91116798A TW569112B TW 569112 B TW569112 B TW 569112B TW 91116798 A TW91116798 A TW 91116798A TW 91116798 A TW91116798 A TW 91116798A TW 569112 B TW569112 B TW 569112B
Authority
TW
Taiwan
Prior art keywords
key
keyword
vector
keywords
vectors
Prior art date
Application number
TW91116798A
Other languages
Chinese (zh)
Inventor
Shih-Chun Chou
Wentai Hsieh
I-Heng Meng
Original Assignee
Inst Information Industry
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Inst Information Industry filed Critical Inst Information Industry
Priority to TW91116798A priority Critical patent/TW569112B/en
Application granted granted Critical
Publication of TW569112B publication Critical patent/TW569112B/en

Links

Abstract

A selection method of combined classification feature groups single keyword as multiple keywords for representation of certain classification feature, which can be more precise in determining document classification than single keyword.

Description

569112569112

本發明係有關於一種組合的類別特徵選取方法 別”於:種可以將關鍵詞組合,以代表某一種 : 的特徵,提尚關鍵詞代表的準確性。 、 牛 在現今之網際網路中,存在著各式各樣的 搜尋的方便’ t常需要把不同類型的文件作分 :: 類型的文件,#在不同的關鍵字’這- : 搜尋或分類的依據。 坑疋我們 然而,在現今的技術中,作類別特徵的選取 採用統計的方式計算單一個關鍵詞出現的機率或頻 = TFIDF以及Entropy。但是單一個關鍵詞,可能代表二 同的類別,無法精確的篩選文件,例如防火牆可能 防類別,也可以代表資訊類別,若一篇文件中含有防牌 的關鍵詞,並無法由此關鍵詞就可判別此篇文件是 = 防類別或資訊類別。如果將一個組合的類別特徵,則以 明確代表某-類別,如防火牆加電腦駭客,可代表資 別;如防火牆加火災,可代表消防類別。 要目的為提供一種組合的類別 ’將單一的關鍵字組合成多個 ’可以比只計算單一關鍵字出 。本發明更可以運用於文件與 件中萃取某一類別下的特徵, 一類別都會出現這一類關鍵 將可應用於後來的概念建構與 有鑑於此,本發明之主 特徵選取方法。在本發明中 關鍵字以代表某一類別特徵 現頻率更有準確性與說服力 概念建構之間,自動從多文 萃取出來的特徵,即代表某 字,將這些關鍵字當作特徵 Ontology Base 的建構。The present invention relates to a combination of category feature selection methods different from "a type that can combine keywords to represent a certain type of: features that improve the accuracy of the keyword representation." In today's Internet, There are a variety of conveniences for searching, which often need to divide different types of files :: type of file, #in different keywords' this-: the basis for searching or classification. We are, however, nowadays In the technology, the selection of category features uses statistical methods to calculate the probability or frequency of the occurrence of a single keyword = TFIDF and Entropy. However, a single keyword may represent two identical categories, and it is impossible to accurately filter files. For example, a firewall may The defense category can also represent the information category. If a document contains the keywords of the defense brand, it cannot be judged by this keyword that the document is = defense category or information category. If you combine the characteristics of a category, then To clearly represent a certain category, such as a firewall plus a computer hacker, it can represent a category; if a firewall plus a fire, it can represent a fire category. In order to provide a combined category, 'combining a single keyword into multiple' can be calculated than only a single keyword. The present invention can be more applied to extract features in a category from documents and documents. This category will appear in a category The key will be applicable to the subsequent concept construction and in view of this, the main feature selection method of the present invention. In the present invention, the keywords to represent a certain category of features are more accurate and the persuasive concept construction is automatically followed. The features extracted from multiple texts represent a word, and these keywords are used as the construction of the feature Ontology Base.

0213-8269TW(N);STLC-02-B-9ll5;andyp.ptd 第4頁0213-8269TW (N); STLC-02-B-9ll5; andyp.ptd Page 4

569112 五、發明說明(2) 為了達成本發明之上述目的,太袼日日植山 的類別特徵選取…包括下列步驟:=了-種組合 ”同類別文件,然後過濾出語…:盥岡性標注 予的詞,捨棄不用,其餘的詞形成關鍵::合=與只有一個 關鍵字建立成一關鍵字集合; :章人所有文件的 出現順序建立文件向量,若文“ 内的關鍵字 不’若不包含該關鍵字即以〇表*,而;子即旦以1表 個數即為關鍵字個數;接著,將關鍵I 里的兀素 所有文件向量以建立關鍵字he \ §成向量名稱反轉 里的疋素個數即為文件篇數;接 關鍵子向 鍵字向量,即Μ定一 Ρ卩批枯 19 /慮低出現頻率的關 疋門檻值,過濾該關鍵字向量之元素A 1的個數低於該門檻值的關鍵字 门里之疋素為 向量以形成特徵值集合,即兩兩 =s關鍵字 的希μ、富# 和ν且6 Ν - 1關鍵字向量作a ν d Π ίί舁:形成複數N關鍵字向量,其中N大於等:二 值二:^向量之元素為1的個數高於該門檻 向量ίΓΛ ””Ν_1個關鍵字中,任何Ν-1關鍵字 == 均高於該門榧值;重複上-步驟, 直到所有符合條件組合完成。 實施例 圖顯示本發明之架構。文件11〇係表示同類型的多 齡ΐΐ出此同類型文件的關鍵㈣,必須先對這些文 丁斷柯與詞性標注;將文件中的句子,例如說可以利 用中研院的CKIP (Chlnese k⑽wledge inf569112 V. Description of the invention (2) In order to achieve the above-mentioned purpose of the invention, the selection of the category features of Taiji Rizhishan ... includes the following steps: ==-type combination "documents of the same category, and then filtering out the words ...: labeling The given words are discarded, and the rest of the words form the key :: together = with only one keyword to build a keyword set;: chapter of all documents in the order of appearance of the document vector, if the keywords in the text "'if not If the keyword is included, the table is 0, and the child is the number of the table, which is the number of keywords. Then, the file vector of all elements in key I is used to build the keyword he \ § into the vector name. The number of prime elements in the transfer is the number of documents; then the key sub-directional key vector, that is, M determines a threshold of 19 / low frequency threshold, filters the element A 1 of the key vector The number of elements in the keyword gate whose number is lower than the threshold value is a vector to form a feature value set, that is, pairwise = s keyword's Greek μ, rich # and ν, and 6 Ν-1 keyword vector as a ν d Π ίί 舁: form a complex N key vector, where N is greater than equal: two-valued two: ^ The number of elements in the vector is higher than the threshold vector ΓΓΛ "" N_1 keywords, any N-1 keyword == is higher than the threshold; repeat- Steps until all eligible combinations are completed. Examples The figure shows the architecture of the present invention. File 11 is the key to the generation of the same type of documents. These documents must be marked and part-of-speech first; the sentences in the file can be used, for example, the CKIP (Chlnese k⑽wledge inf

569112 五、發明說明(3) P ocessing )斷同與詞性標注系統將其斷成一個一個的 二,而並非所有的詞都可以當成這個類別的特徵,如: 装,ί μ形合阔、出現頻率相當高的詞與只有一個字的詞 1 皮^當作類別特徵的機率並不高,因此在停用詞庫 ^ :子此類不用之詞;所有文件在經過斷詞1 2 0處理 加用詞過濾140程序,將不用之詞過濾,避免 ‘資ϋ料:二Λ式的負#。上述步驟處理完畢之後,便會 鍵字=H Λ放著此類別τ所有文件與其戶斤包含的關 ^接是以「文件代號{關鍵字}」的形式存 處理,^彳胃ί Γ a發明之組合的類別特徵選取方法16〇的 地 而付到組合的類別特徵丨7 〇。 接著請參考第2圖’顯示本發明之流 立關鍵字集合21 〇,是將之前θ 圖首先建 的關鍵字形成一個關鍵字集】合斷,= 隼, 出現的關鍵字,如:關鍵合二集4包含所有文件中 接著,在建立文件向量2二集 即以1表示’ S不包含,則以〇表示,如件包含某關鍵字 \{1,L 0, L…,U,將所有文件都建立J:令件曰· 著,在反轉成關鍵字向量23 〇步驟中;件向:’接 名稱反轉所有的文件向量,也就將*關鍵字當成向量 現在文件1、3、4時,我們以fu ° ’若有一關鍵字f 1出 所有的關鍵字向量都建立起來,這些被^來表#,如此把 ;广夺1關鍵字向量的數量可能非常魔大之 字出現的頻率並不高,因此在步驟過以==鍵569112 V. Description of the invention (3) P ocessing) The sameness and part-of-speech tagging system breaks it into two one by one, but not all words can be regarded as features of this category, such as: pretend Words with a relatively high frequency and words with only one word are not likely to be treated as category features. Therefore, the thesaurus is disabled ^: sub-words that are not used in this category; all documents are processed after word breaking 1 2 0 The word filter 140 program will filter the words that are not used to avoid the 'materials: two Λ-style negative #. After the above steps are processed, the key word = H Λ is placed. All the files of this category τ and their households are included in the form of "file code {keyword}". ^ 彳 wei Γ a invention The combined category feature selection method 16 is applied to the combined category feature 丨 7. Next, please refer to FIG. 2 'showing the current keyword set 21 of the present invention, which is to form a keyword set based on the keywords first created in the previous θ graph]] Conjunction, = 隼, the keywords that appear, such as: key combination The second set 4 contains all the files. Then, in the establishment of the file vector 2 the second set is represented by 1 'S does not contain, then it is represented by 0. If the file contains a certain keyword \ {1, L 0, L ..., U, all Files are created J: Let the pieces say · In the step of inverting into a keyword vector 23 〇 Steps: 'Reverse all file vectors by name, so * keywords are treated as vectors now in files 1, 3, At 4 o'clock, we use fu ° 'if there is a keyword f 1 all the keyword vectors are established, these are listed as #, so the number of keyword vectors may be very large. The frequency is not high, so use the == key in the step.

569112 五、發明說明(4) 2插值,用來踢除這種出現頻率相當低的關鍵字,當關鍵 字向量中出現1的次數小於此門檻值時,便將其踢除,剩 餘的都是1關鍵字向量。接著進入組合階段,兩兩組合所 有的1關鍵字向量,成為2關鍵字向量,組合方式為,例如 1關鍵字向量flU,1,U與1關鍵字向量f2{l,1,〇,:[丨組合 成^ +丨2關鍵字,此時將兩關鍵字向量丨1及{2作八^^的布林 運算,U,0,1,1}AND{1,1,0,1},結果為{1,〇,〇,u,此 =2關鍵字f1 + f2的向量,在門檻值過濾26〇中,同樣的 币選踢除2關鍵字向量中1的次數小於門檻值,得 2關鍵字向量。在步驟檢查是否可以再組合270 +,判斷从 人法,,當η關鍵字向量欲組合為n+1關鍵字向量,必: ;二7 ϊ ί向量之元素為1的個數高於該門檻值? ϊ :,鍵子向篁之η個關鍵字t,任何η關鍵字向-且 f為1的個數均高於該門檻值。透 π ,’即可找出所有長度的特徵組合,斷而的形且 集合28 0。 仏成組合的特徵 第3圖顯示本發明的實施例。569112 V. Description of the invention (4) 2 Interpolation is used to remove such keywords that appear relatively infrequently. When the number of occurrences of 1 in the keyword vector is less than this threshold, they are removed, and the rest are 1 key vector. Then enter the combination phase, combining all 1 keyword vectors in pairs to become 2 keyword vectors. The combination method is, for example, 1 keyword vector flU, 1, U and 1 keyword vector f2 {1,1, 0,: [丨 Combined into ^ + 丨 2 keywords. At this time, two keyword vectors 丨 1 and {2 are used as a Boolean operation of eight ^^, U, 0,1,1} AND {1,1,0,1}, The result is {1, 〇, 〇, u, this = 2 vector of keywords f1 + f2, in the threshold filtering 26, the same coin selection kicks 2 times in the keyword vector 1 less than the threshold, get 2 Key vector. In the step, it is checked whether 270+ can be combined again, and judging from the human law, when the η keyword vector is to be combined into the n + 1 keyword vector, it must:; 2 7 ϊ ί The number of elements of 1 is higher than the threshold value? ϊ :, the number of η keywords t in the bond direction 篁, the number of any η keywords toward-and f 1 is higher than the threshold. Through π, ′, we can find the feature combination of all lengths, broken shape and set 28 0. Features of the combined combination Fig. 3 shows an embodiment of the present invention.

Dl 、D2、D3、D4表示,在此培^加士 Ζ、3、4分別以 屬於理財類的文件;經過斷詞後, 3 Μ是 字,文件!有股市、投資、經濟、=鍵 資、經濟、股市,文件3有經濟、旅遊、 有旅遊、投 投資’文件4有投資、股市;接下來將 關Π、 :個關鍵字集合F⑴:股市,f2:投資,f3:經:鍵子形成 金’…旅遊,f6:台北市卜因此所有的 旦基Dl, D2, D3, and D4 indicate that here ^ Gas ZZ, 3, and 4 are classified as financial management documents; after the word segmentation, 3M is a word, file! There are stock market, investment, economy, = key capital, economy, stock market, file 3 has economy, tourism, travel, investment, and investment 'file 4 there is investment, stock market; the next will be related to: a set of keywords F⑴: stock market, f2: investment, f3: economy: bonds form gold '... tourism, f6: Taipei city

569112 五、發明說明(5) D1U,1,1,1,0, 0}、D2{1,1,1,〇, 1,〇}、 D3{1,1,1,0, 1,1 }、D4{ 1,1,〇, 〇, 〇, 〇 };接著將所有文件向 量反轉成關鍵字向量,因此可得到關鍵字向量, flU,1,1,1}、f2{l,1,1,1}、f3{l,1,1,0}、 f4{l,0, 〇, 〇}、f 5{〇, 1,1,〇}、f6{〇, 〇, 1,0},這些稱之為1 關鍵字向量。若門檻值設為3時,貝彳f 4、f 5與f 6即被踢 除,fl、f2、f3可以組合成2關鍵字向量,將兩兩1關鍵字 向量作AND計算,可得到fl + f2{l,l,l,l}、 fl+f3{l,l,l,〇} 、f2+f3{l,l,l,〇},因為fl+f2 、fl+f3 、 f2 + f3這些2關鍵字向量當中l的數目皆大於門檻值3,所以 fl + f2、fl + f3、f2 + f3皆為2關鍵字,並且可以組成3關鍵 字fl + f2 + f3 ;將fl+f2與fl + f3作AND計算,得到 fl + f2+f3{l,1,1,〇},檢視f i + f2 + f3關鍵字向量中丨的數目 大於門檻值3,於是f 1 + f 2 + f 3為有效的3關鍵字;由此可 知,若文章中一起出現fl + f2 + f3的關鍵字,也就0fl ^ 市、投資、f3:經濟一起出現在文章中時,判股 篇文章屬於理財類的文章。 斷此 因此’藉由本發明所提出之組合的類別特徵 法,可以將單一的關鍵字組合成多個關鍵字以、万 別特徵,可以比只計算單一關鍵字出現頻 ^表某—類 說服力。 只半更有準確性與 雖然本發明已以較佳實施例揭露如上,缺 、 限定本發明,任何熟悉此項技藝者,在不脫S其並非用以 神和範圍内,當可做些許更動與潤飾,田比,本發明之精 因此本發明之保護569112 V. Description of the invention (5) D1U, 1,1,1,0,0}, D2 {1,1,1, 〇, 1, 〇}, D3 {1,1,1,0,1,1} , D4 {1,1, 〇, 〇, 〇, 〇}; then invert all file vectors into keyword vectors, so you can get keyword vectors, flU, 1,1,1}, f2 {l, 1, 1,1}, f3 {l, 1,1,0}, f4 {l, 0, 〇, 〇}, f5 {〇, 1,1, 〇}, f6 {〇, 〇, 1,0}, These are called 1-key vectors. If the threshold is set to 3, f 4, f 5 and f 6 will be kicked out, fl, f2, and f3 can be combined into a 2-keyword vector, and the 1-keyword vector can be calculated by AND to get fl + f2 {l, l, l, l}, fl + f3 {l, l, l, 〇}, f2 + f3 {l, l, l, 〇}, because fl + f2, fl + f3, f2 + f3 The number of l in these two keyword vectors is greater than the threshold value 3, so fl + f2, fl + f3, f2 + f3 are all 2 keywords, and can be composed of 3 keywords fl + f2 + f3; Fl + f3 is calculated by AND to get fl + f2 + f3 {l, 1,1, 〇}. Check that the number of 丨 in the fi + f2 + f3 keyword vector is greater than the threshold 3, so f 1 + f 2 + f 3 It is a valid 3 keywords; it can be seen that if the keywords fl + f2 + f3 appear together in the article, then 0fl ^ market, investment, f3: when the economy appears in the article together, the stock judgment article belongs to financial management Article. Therefore, by using the combined category feature method proposed by the present invention, a single keyword can be combined into multiple keywords with different characteristics, which can be better than calculating only the frequency of occurrence of a single keyword. . Only half more accurate and although the present invention has been disclosed in the preferred embodiment as above, lacking and limiting the present invention, anyone skilled in this art can make some changes without departing from its scope and scope. With retouching, tinby, the essence of the invention is therefore protected by the invention

〇213-8269TWF(N);STLC-〇2-B-9115;andyp.ptd〇213-8269TWF (N); STLC-〇2-B-9115; andyp.ptd

569112 五、發明說明(6) 範圍當視後附之申請專利範圍所界定者為準。 第9頁 0213-8269TWF(N);STLC-02-B-9115;andyp.ptd 569112 圖式簡單說明 為使本發明之上述目的、特徵和優點能更明顯易懂, 下文特舉實施例,並配合所附圖示,詳細說明如下: 第1圖係顯示本發明之架構圖。 第2圖係顯示本發明之流程圖。 第3圖係顯示本發明之實施例。 符號說明 11 0〜同類別文件; 1 2 0〜斷詞; 1 3 0〜停用詞庫; 1 4 0〜停用詞過濾; 1 5 0〜資料庫; 1 6 0〜組合的類別特徵選取方法; 1 7 0〜組合的類別特徵; 210〜建立關鍵字集合; 220〜建立文件向量; 230〜反轉成關鍵字向量; 2 4 0〜過濾; 2 5 0〜組合關鍵字向量; 2 6 0〜門檻值過濾; 270〜檢查是否可再組合; 28 0〜組合的特徵集合;569112 V. Description of Invention (6) The scope shall be determined by the scope of the attached patent application. Page 9 0213-8269TWF (N); STLC-02-B-9115; andyp.ptd 569112 The diagram simply illustrates that in order to make the above-mentioned objects, features and advantages of the present invention more obvious and understandable, the following examples are given, and With the accompanying drawings, the detailed description is as follows: Figure 1 is a diagram showing the architecture of the present invention. Fig. 2 is a flowchart showing the present invention. Fig. 3 shows an embodiment of the present invention. Explanation of symbols 11 0 ~ files of the same category; 1 2 0 ~ word segmentation; 1 3 0 ~ stop word dictionary; 1 4 0 ~ stop word filtering; 1 50 ~ database; 1 6 0 ~ combined category feature selection Method: 1 0 0 ~ combined category features; 210 ~ established keyword set; 220 ~ established file vector; 230 ~ reversed into keyword vector; 2 4 0 ~ filtered; 2 5 0 ~ combined keyword vector; 2 6 0 ~ threshold filtering; 270 ~ check whether recombination is possible; 28 0 ~ combined feature set;

Dl、D2、D3、D4〜文件; F〜關鍵字集合。D1, D2, D3, D4 ~ files; F ~ keyword collection.

0213-8269TWF(N);STLC-02-B-9115;andyp.ptd 第10頁0213-8269TWF (N); STLC-02-B-9115; andyp.ptd Page 10

Claims (1)

/、1請專利範圍 1 · 一種組合的類別特徵選 將所有文件的關鍵字建立 法’包括下列步驟: •將該關鍵字集合内的關鍵J ::鍵:集合’· 將關鍵字當成向量名 2順序建立文件向量; 予向量,· 反轉所有文件向量以建立關朝 過濾低出現頻率的關鍵字旦 組合;鍵字向量以形成特=集:及, 2·如申請專利範圍第〗項 、σ 方法斷其中,建立關鍵字集合V4包之括組合的類別特徵選取 :::詞性標注複數同類別文件.以:驟. k 4出語助詞、形容詞與只 ,二及 ,其餘的詞形成關鍵字;、’、 個字的詞,捨棄不 將該等複數同類別文件出#^ 列形成一關鍵字集合。 出現過的關鍵字以一定順序排 3 ·如申請專利範圍第1項所f 方法,其中,建立文件向量所;4之組合的類別特徵選取 表不,若不包含該關鍵字 右文/牛包含該關鍵字即以1 個數即為關鍵字個數。 示’該文件向量的元素 4·如申請專利範圍第3 方法,其中,建立關鍵字旦述之組合的類別特徵選取 以1表示,若不存在於該文° %,若關鍵字存在於該文件即 元素個數即為文件篇齡· 17以〇表示、:該關鍵字向量的 順序排列。 ’ β亥關鍵字向量中,文件以一定 5. 如申請專利範圍第 4項所述之組 合的類別特徵選取/ 、 1Please patent scope 1 · A combined category feature selection method of keyword creation of all files' includes the following steps: • Key J :: Key: Set in the keyword set · Use keywords as vector names 2 sequentially build file vectors; pre-vectors, · invert all file vectors to create keywords and combinations that filter low-frequency occurrences; and key vector to form special sets: and, 2 if the scope of the patent application, item 1, The σ method judges that the category feature selection of the keyword set V4 includes the combination: ::: part-of-speech tag plurals of the same category file. With: .. k 4 verbal auxiliary words, adjectives and only, two and, the remaining words form the key Words such as, ', and words, discarding these plural plural files of the same category and forming a keyword set. The keywords that have appeared are ranked in a certain order. 3 As in the method of item 1 of the scope of the patent application, the method of creating a file vector is selected; the category feature selection table of the combination of 4 is not included. If the keyword is not included The number of keywords is the number of keywords. Show 'Elements of this file vector 4. As the third method of the scope of patent application, in which the category feature selection of the combination of keywords is described as 1, if it does not exist in the document °%, if the keyword exists in the document That is, the number of elements is the document age. 17 is represented by 0, and the key vectors are arranged in order. ’In the β-hai keyword vector, the file is selected with a certain 5. Select the category features as described in item 4 of the scope of patent application 569112 六、申請專利範圍 〜一"' 方法,其中,過濾低出現頻率的關鍵字向量係指設定一門 捏值,過濾該關鍵字向量之元素為1的個數低於該門榧值 的關鍵字向量,形成複數關鍵字向量。 m 6 ·如申請專利範圍第5項所述之組合的類別特徵選取 方法,其中,組合關鍵字向量包括下列步驟: 、 兩兩組合N- 1關鍵字向量作A的布林運算,开彡士 a N關^字向量’其中N大於等於二,必須符合,該n關鍵^數 向置之7G素為1的個數高於該門檻值,且該N關鍵予 N-1個關鍵字中,任何關鍵字向量之元素為1個薏之 高於該門檻值; 致岣 重複上述步驟,直到所有符合上述條件組 將該等N關鍵字向量形成特徵值集合。 凡成; 可 7. —種可讀的媒體(medium ),包含足 以執行下列步驟: 幻知々, 將所有文件的關鍵字建立成 . 將該關鍵字集合内的關鍵字 $二二 關鍵 將關鍵字當成向量名稱& 、2建立文件向量 字向量; 稱反轉所有文件向量以建立 過濾低出現頻率的關鍵字向 組合:鍵字向量以形成特徵值隼人 8. 如申請專利範圍第7項 果^ 建立關鍵字集合係包括下列步驟'之可頃的媒體,其中, 斷詞與詞性標注複數同類a丨丄 過濾出語助詞、形容1 * 文件,以及 谷Η與只有一個字的詞,捨棄不569112 VI. Patent application scope ~ A " method, wherein filtering a low-frequency keyword vector refers to setting a threshold value, and filtering the key vector with a number of elements below 1 is the key to the threshold value Word vectors to form plural key vectors. m 6 · The method for selecting category features of a combination as described in item 5 of the scope of patent application, wherein the combination key vector includes the following steps: 1. A pair of N-1 key vectors are combined to perform the Bollinger operation of A, a N key word vector 'where N is greater than or equal to two, which must be met. The number of 7 key elements set to 1 by the n key number is higher than the threshold value, and the N key is given to N-1 keywords. The elements of any keyword vector are 1 薏 higher than the threshold; cause 岣 to repeat the above steps until all the groups meeting the above conditions form the N key vectors into a feature value set. Fancheng; Ke7. — A readable medium (medium), which is sufficient to perform the following steps: Imaginary knowledge, the keywords of all files are created. The key $ 二 二 KEY in the keyword set will be key Word as vector name & 2 to create a file vector word vector; said to invert all file vectors to create a filtering combination of keywords with a low frequency of occurrence: key word vectors to form eigenvalues 8. Such as the 7th fruit of the scope of patent application ^ Establishing a keyword collection system includes the following steps: “Available media, in which word segmentation and part-of-speech tagging are plural and similar a 丨 丄 filter out word auxiliary words, describe 1 * files, and Gu Yi and words with only one word. 0213 8269TWF(N);STLC-02-B-9115;andyp.ptd 569112 -—---- 六、申請專利範圍 用,其餘的詞形成關鍵字, 列开=該等複數同類別文件出現過的關鍵字以一定順序排 列形成一關鍵字集合。 又項序排 建立9文::ΐ專:範圍第7項所述之可讀的媒體,其中, 含該關鍵;二Γ表文亍件包,該關鍵字:一1表示’若不包 字個數。 表不,该文件向量的兀素個數即為關鍵 诸申請專利範目第9項所述之可讀的媒體,直中’ 鍵字向量’若關鍵字存在於該文件即以i表示若 文:即以°表示;該關鍵字向量的元素個數即 ”、it如λ該關鍵字向量巾,文件以一定順序排列。 、·、σ申叫專利範圍第1 0項所述之可讀的媒體,复 中’過濾低出現頻率的關鍵字向量係指設定_門檻值、 y關=向量之元素為1的個數低於該門檻值的關鍵字k 向®,形成複數關鍵字向量。 鍵予 12 ·如申請專利範圍第11項所述之可讀的媒體,1 中,組合關鍵字向量包括下列步驟: /、 兩兩組合N-1關鍵字向量作AND的布林運算, 向里之兀素為1的個數高於該門檻值,且該N關鍵 N-1個關鍵字中,任何N]關鍵字向量之元素為里 高於該門檻值; 重複上述步驟,直到所有符合上述條件組合完成; 將該等N關鍵字向量形成特徵值集合。0213 8269TWF (N); STLC-02-B-9115; andyp.ptd 569112 --------- 6. For patent application, the rest of the words form keywords, and they are listed = these plural documents of the same category have appeared The keywords are arranged in a certain order to form a keyword set. Another sequence is to establish 9 texts :: Special: The readable media described in the seventh item, which contains the key; two Γ table text package, the keyword: a 1 means' if not including words Number. I mean, the number of elements in the file vector is the readable media described in item 9 of the patent application. If the key word exists in the file, i means that : It is expressed in °; the number of elements of the keyword vector is ", it is λ, such as the keyword vector towel, and the files are arranged in a certain order., ..., σ is called readable as described in item 10 of the patent scope. Media, complex, 'filtering low-frequency keyword vectors refers to keywords with a _threshold value, y off = vector elements with a number of 1 below this threshold, k-direction®, forming a complex keyword vector. Key 12 12 · As described in the readable media described in item 11 of the scope of the patent application, in 1, the combination of the keyword vector includes the following steps: The number of element 1 is higher than the threshold value, and among the N key N-1 keywords, any element of the N] keyword vector is higher than the threshold value; repeat the above steps until all the above conditions are met The combination is completed; the N key vectors are formed into a feature value set. 0213-8269TWF(N);STLC-02-B-9115;andyp.ptd 第13頁0213-8269TWF (N); STLC-02-B-9115; andyp.ptd page 13
TW91116798A 2002-07-26 2002-07-26 Selection method of combined classification feature TW569112B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
TW91116798A TW569112B (en) 2002-07-26 2002-07-26 Selection method of combined classification feature

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
TW91116798A TW569112B (en) 2002-07-26 2002-07-26 Selection method of combined classification feature

Publications (1)

Publication Number Publication Date
TW569112B true TW569112B (en) 2004-01-01

Family

ID=32590423

Family Applications (1)

Application Number Title Priority Date Filing Date
TW91116798A TW569112B (en) 2002-07-26 2002-07-26 Selection method of combined classification feature

Country Status (1)

Country Link
TW (1) TW569112B (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
TWI615725B (en) * 2016-11-30 2018-02-21 優像數位媒體科技股份有限公司 Phrase vector generation device and operation method thereof

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
TWI615725B (en) * 2016-11-30 2018-02-21 優像數位媒體科技股份有限公司 Phrase vector generation device and operation method thereof

Similar Documents

Publication Publication Date Title
Kumar et al. Sentiment analysis of multimodal twitter data
Asghar et al. T‐SAF: Twitter sentiment analysis framework using a hybrid classification scheme
CN109933664B (en) Fine-grained emotion analysis improvement method based on emotion word embedding
WO2019174132A1 (en) Data processing method, server and computer storage medium
Dave et al. Mining the peanut gallery: Opinion extraction and semantic classification of product reviews
WO2016107326A1 (en) Search recommending method and device based on search terms
Thomas et al. Automatic keyword extraction for text summarization in e-newspapers
CN108776901B (en) Advertisement recommendation method and system based on search terms
WO2022116435A1 (en) Title generation method and apparatus, electronic device and storage medium
CN111291177A (en) Information processing method and device and computer storage medium
Tran et al. Spam detection in online classified advertisements
Roy et al. An ensemble approach for aggression identification in English and Hindi text
Patil et al. Web spam detection using SVM classifier
WO2022105497A1 (en) Text screening method and apparatus, device, and storage medium
CN107665442B (en) Method and device for acquiring target user
Malhotra et al. An effective approach for news article summarization
Gupta et al. Text analysis and information retrieval of text data
Pekar et al. Selecting classification features for detection of mass emergency events on social media
TW569112B (en) Selection method of combined classification feature
AleEbrahim et al. Summarising customer online reviews using a new text mining approach
Goodman et al. FastWordBug: A fast method to generate adversarial text against NLP applications
Hosseini et al. Implicit entity linking through ad-hoc retrieval
CN107729509A (en) The chapter similarity decision method represented based on recessive higher-dimension distributed nature
Cao et al. A joint model for chinese microblog sentiment analysis
Khan et al. Semantic-based unsupervised hybrid technique for opinion targets extraction from unstructured reviews

Legal Events

Date Code Title Description
GD4A Issue of patent certificate for granted invention patent
MM4A Annulment or lapse of patent due to non-payment of fees