TW569112B - Selection method of combined classification feature - Google Patents
Selection method of combined classification feature Download PDFInfo
- Publication number
- TW569112B TW569112B TW91116798A TW91116798A TW569112B TW 569112 B TW569112 B TW 569112B TW 91116798 A TW91116798 A TW 91116798A TW 91116798 A TW91116798 A TW 91116798A TW 569112 B TW569112 B TW 569112B
- Authority
- TW
- Taiwan
- Prior art keywords
- key
- keyword
- vector
- keywords
- vectors
- Prior art date
Links
Abstract
Description
569112569112
本發明係有關於一種組合的類別特徵選取方法 別”於:種可以將關鍵詞組合,以代表某一種 : 的特徵,提尚關鍵詞代表的準確性。 、 牛 在現今之網際網路中,存在著各式各樣的 搜尋的方便’ t常需要把不同類型的文件作分 :: 類型的文件,#在不同的關鍵字’這- : 搜尋或分類的依據。 坑疋我們 然而,在現今的技術中,作類別特徵的選取 採用統計的方式計算單一個關鍵詞出現的機率或頻 = TFIDF以及Entropy。但是單一個關鍵詞,可能代表二 同的類別,無法精確的篩選文件,例如防火牆可能 防類別,也可以代表資訊類別,若一篇文件中含有防牌 的關鍵詞,並無法由此關鍵詞就可判別此篇文件是 = 防類別或資訊類別。如果將一個組合的類別特徵,則以 明確代表某-類別,如防火牆加電腦駭客,可代表資 別;如防火牆加火災,可代表消防類別。 要目的為提供一種組合的類別 ’將單一的關鍵字組合成多個 ’可以比只計算單一關鍵字出 。本發明更可以運用於文件與 件中萃取某一類別下的特徵, 一類別都會出現這一類關鍵 將可應用於後來的概念建構與 有鑑於此,本發明之主 特徵選取方法。在本發明中 關鍵字以代表某一類別特徵 現頻率更有準確性與說服力 概念建構之間,自動從多文 萃取出來的特徵,即代表某 字,將這些關鍵字當作特徵 Ontology Base 的建構。The present invention relates to a combination of category feature selection methods different from "a type that can combine keywords to represent a certain type of: features that improve the accuracy of the keyword representation." In today's Internet, There are a variety of conveniences for searching, which often need to divide different types of files :: type of file, #in different keywords' this-: the basis for searching or classification. We are, however, nowadays In the technology, the selection of category features uses statistical methods to calculate the probability or frequency of the occurrence of a single keyword = TFIDF and Entropy. However, a single keyword may represent two identical categories, and it is impossible to accurately filter files. For example, a firewall may The defense category can also represent the information category. If a document contains the keywords of the defense brand, it cannot be judged by this keyword that the document is = defense category or information category. If you combine the characteristics of a category, then To clearly represent a certain category, such as a firewall plus a computer hacker, it can represent a category; if a firewall plus a fire, it can represent a fire category. In order to provide a combined category, 'combining a single keyword into multiple' can be calculated than only a single keyword. The present invention can be more applied to extract features in a category from documents and documents. This category will appear in a category The key will be applicable to the subsequent concept construction and in view of this, the main feature selection method of the present invention. In the present invention, the keywords to represent a certain category of features are more accurate and the persuasive concept construction is automatically followed. The features extracted from multiple texts represent a word, and these keywords are used as the construction of the feature Ontology Base.
0213-8269TW(N);STLC-02-B-9ll5;andyp.ptd 第4頁0213-8269TW (N); STLC-02-B-9ll5; andyp.ptd Page 4
569112 五、發明說明(2) 為了達成本發明之上述目的,太袼日日植山 的類別特徵選取…包括下列步驟:=了-種組合 ”同類別文件,然後過濾出語…:盥岡性標注 予的詞,捨棄不用,其餘的詞形成關鍵::合=與只有一個 關鍵字建立成一關鍵字集合; :章人所有文件的 出現順序建立文件向量,若文“ 内的關鍵字 不’若不包含該關鍵字即以〇表*,而;子即旦以1表 個數即為關鍵字個數;接著,將關鍵I 里的兀素 所有文件向量以建立關鍵字he \ §成向量名稱反轉 里的疋素個數即為文件篇數;接 關鍵子向 鍵字向量,即Μ定一 Ρ卩批枯 19 /慮低出現頻率的關 疋門檻值,過濾該關鍵字向量之元素A 1的個數低於該門檻值的關鍵字 门里之疋素為 向量以形成特徵值集合,即兩兩 =s關鍵字 的希μ、富# 和ν且6 Ν - 1關鍵字向量作a ν d Π ίί舁:形成複數N關鍵字向量,其中N大於等:二 值二:^向量之元素為1的個數高於該門檻 向量ίΓΛ ””Ν_1個關鍵字中,任何Ν-1關鍵字 == 均高於該門榧值;重複上-步驟, 直到所有符合條件組合完成。 實施例 圖顯示本發明之架構。文件11〇係表示同類型的多 齡ΐΐ出此同類型文件的關鍵㈣,必須先對這些文 丁斷柯與詞性標注;將文件中的句子,例如說可以利 用中研院的CKIP (Chlnese k⑽wledge inf569112 V. Description of the invention (2) In order to achieve the above-mentioned purpose of the invention, the selection of the category features of Taiji Rizhishan ... includes the following steps: ==-type combination "documents of the same category, and then filtering out the words ...: labeling The given words are discarded, and the rest of the words form the key :: together = with only one keyword to build a keyword set;: chapter of all documents in the order of appearance of the document vector, if the keywords in the text "'if not If the keyword is included, the table is 0, and the child is the number of the table, which is the number of keywords. Then, the file vector of all elements in key I is used to build the keyword he \ § into the vector name. The number of prime elements in the transfer is the number of documents; then the key sub-directional key vector, that is, M determines a threshold of 19 / low frequency threshold, filters the element A 1 of the key vector The number of elements in the keyword gate whose number is lower than the threshold value is a vector to form a feature value set, that is, pairwise = s keyword's Greek μ, rich # and ν, and 6 Ν-1 keyword vector as a ν d Π ίί 舁: form a complex N key vector, where N is greater than equal: two-valued two: ^ The number of elements in the vector is higher than the threshold vector ΓΓΛ "" N_1 keywords, any N-1 keyword == is higher than the threshold; repeat- Steps until all eligible combinations are completed. Examples The figure shows the architecture of the present invention. File 11 is the key to the generation of the same type of documents. These documents must be marked and part-of-speech first; the sentences in the file can be used, for example, the CKIP (Chlnese k⑽wledge inf
569112 五、發明說明(3) P ocessing )斷同與詞性標注系統將其斷成一個一個的 二,而並非所有的詞都可以當成這個類別的特徵,如: 装,ί μ形合阔、出現頻率相當高的詞與只有一個字的詞 1 皮^當作類別特徵的機率並不高,因此在停用詞庫 ^ :子此類不用之詞;所有文件在經過斷詞1 2 0處理 加用詞過濾140程序,將不用之詞過濾,避免 ‘資ϋ料:二Λ式的負#。上述步驟處理完畢之後,便會 鍵字=H Λ放著此類別τ所有文件與其戶斤包含的關 ^接是以「文件代號{關鍵字}」的形式存 處理,^彳胃ί Γ a發明之組合的類別特徵選取方法16〇的 地 而付到組合的類別特徵丨7 〇。 接著請參考第2圖’顯示本發明之流 立關鍵字集合21 〇,是將之前θ 圖首先建 的關鍵字形成一個關鍵字集】合斷,= 隼, 出現的關鍵字,如:關鍵合二集4包含所有文件中 接著,在建立文件向量2二集 即以1表示’ S不包含,則以〇表示,如件包含某關鍵字 \{1,L 0, L…,U,將所有文件都建立J:令件曰· 著,在反轉成關鍵字向量23 〇步驟中;件向:’接 名稱反轉所有的文件向量,也就將*關鍵字當成向量 現在文件1、3、4時,我們以fu ° ’若有一關鍵字f 1出 所有的關鍵字向量都建立起來,這些被^來表#,如此把 ;广夺1關鍵字向量的數量可能非常魔大之 字出現的頻率並不高,因此在步驟過以==鍵569112 V. Description of the invention (3) P ocessing) The sameness and part-of-speech tagging system breaks it into two one by one, but not all words can be regarded as features of this category, such as: pretend Words with a relatively high frequency and words with only one word are not likely to be treated as category features. Therefore, the thesaurus is disabled ^: sub-words that are not used in this category; all documents are processed after word breaking 1 2 0 The word filter 140 program will filter the words that are not used to avoid the 'materials: two Λ-style negative #. After the above steps are processed, the key word = H Λ is placed. All the files of this category τ and their households are included in the form of "file code {keyword}". ^ 彳 wei Γ a invention The combined category feature selection method 16 is applied to the combined category feature 丨 7. Next, please refer to FIG. 2 'showing the current keyword set 21 of the present invention, which is to form a keyword set based on the keywords first created in the previous θ graph]] Conjunction, = 隼, the keywords that appear, such as: key combination The second set 4 contains all the files. Then, in the establishment of the file vector 2 the second set is represented by 1 'S does not contain, then it is represented by 0. If the file contains a certain keyword \ {1, L 0, L ..., U, all Files are created J: Let the pieces say · In the step of inverting into a keyword vector 23 〇 Steps: 'Reverse all file vectors by name, so * keywords are treated as vectors now in files 1, 3, At 4 o'clock, we use fu ° 'if there is a keyword f 1 all the keyword vectors are established, these are listed as #, so the number of keyword vectors may be very large. The frequency is not high, so use the == key in the step.
569112 五、發明說明(4) 2插值,用來踢除這種出現頻率相當低的關鍵字,當關鍵 字向量中出現1的次數小於此門檻值時,便將其踢除,剩 餘的都是1關鍵字向量。接著進入組合階段,兩兩組合所 有的1關鍵字向量,成為2關鍵字向量,組合方式為,例如 1關鍵字向量flU,1,U與1關鍵字向量f2{l,1,〇,:[丨組合 成^ +丨2關鍵字,此時將兩關鍵字向量丨1及{2作八^^的布林 運算,U,0,1,1}AND{1,1,0,1},結果為{1,〇,〇,u,此 =2關鍵字f1 + f2的向量,在門檻值過濾26〇中,同樣的 币選踢除2關鍵字向量中1的次數小於門檻值,得 2關鍵字向量。在步驟檢查是否可以再組合270 +,判斷从 人法,,當η關鍵字向量欲組合為n+1關鍵字向量,必: ;二7 ϊ ί向量之元素為1的個數高於該門檻值? ϊ :,鍵子向篁之η個關鍵字t,任何η關鍵字向-且 f為1的個數均高於該門檻值。透 π ,’即可找出所有長度的特徵組合,斷而的形且 集合28 0。 仏成組合的特徵 第3圖顯示本發明的實施例。569112 V. Description of the invention (4) 2 Interpolation is used to remove such keywords that appear relatively infrequently. When the number of occurrences of 1 in the keyword vector is less than this threshold, they are removed, and the rest are 1 key vector. Then enter the combination phase, combining all 1 keyword vectors in pairs to become 2 keyword vectors. The combination method is, for example, 1 keyword vector flU, 1, U and 1 keyword vector f2 {1,1, 0,: [丨 Combined into ^ + 丨 2 keywords. At this time, two keyword vectors 丨 1 and {2 are used as a Boolean operation of eight ^^, U, 0,1,1} AND {1,1,0,1}, The result is {1, 〇, 〇, u, this = 2 vector of keywords f1 + f2, in the threshold filtering 26, the same coin selection kicks 2 times in the keyword vector 1 less than the threshold, get 2 Key vector. In the step, it is checked whether 270+ can be combined again, and judging from the human law, when the η keyword vector is to be combined into the n + 1 keyword vector, it must:; 2 7 ϊ ί The number of elements of 1 is higher than the threshold value? ϊ :, the number of η keywords t in the bond direction 篁, the number of any η keywords toward-and f 1 is higher than the threshold. Through π, ′, we can find the feature combination of all lengths, broken shape and set 28 0. Features of the combined combination Fig. 3 shows an embodiment of the present invention.
Dl 、D2、D3、D4表示,在此培^加士 Ζ、3、4分別以 屬於理財類的文件;經過斷詞後, 3 Μ是 字,文件!有股市、投資、經濟、=鍵 資、經濟、股市,文件3有經濟、旅遊、 有旅遊、投 投資’文件4有投資、股市;接下來將 關Π、 :個關鍵字集合F⑴:股市,f2:投資,f3:經:鍵子形成 金’…旅遊,f6:台北市卜因此所有的 旦基Dl, D2, D3, and D4 indicate that here ^ Gas ZZ, 3, and 4 are classified as financial management documents; after the word segmentation, 3M is a word, file! There are stock market, investment, economy, = key capital, economy, stock market, file 3 has economy, tourism, travel, investment, and investment 'file 4 there is investment, stock market; the next will be related to: a set of keywords F⑴: stock market, f2: investment, f3: economy: bonds form gold '... tourism, f6: Taipei city
569112 五、發明說明(5) D1U,1,1,1,0, 0}、D2{1,1,1,〇, 1,〇}、 D3{1,1,1,0, 1,1 }、D4{ 1,1,〇, 〇, 〇, 〇 };接著將所有文件向 量反轉成關鍵字向量,因此可得到關鍵字向量, flU,1,1,1}、f2{l,1,1,1}、f3{l,1,1,0}、 f4{l,0, 〇, 〇}、f 5{〇, 1,1,〇}、f6{〇, 〇, 1,0},這些稱之為1 關鍵字向量。若門檻值設為3時,貝彳f 4、f 5與f 6即被踢 除,fl、f2、f3可以組合成2關鍵字向量,將兩兩1關鍵字 向量作AND計算,可得到fl + f2{l,l,l,l}、 fl+f3{l,l,l,〇} 、f2+f3{l,l,l,〇},因為fl+f2 、fl+f3 、 f2 + f3這些2關鍵字向量當中l的數目皆大於門檻值3,所以 fl + f2、fl + f3、f2 + f3皆為2關鍵字,並且可以組成3關鍵 字fl + f2 + f3 ;將fl+f2與fl + f3作AND計算,得到 fl + f2+f3{l,1,1,〇},檢視f i + f2 + f3關鍵字向量中丨的數目 大於門檻值3,於是f 1 + f 2 + f 3為有效的3關鍵字;由此可 知,若文章中一起出現fl + f2 + f3的關鍵字,也就0fl ^ 市、投資、f3:經濟一起出現在文章中時,判股 篇文章屬於理財類的文章。 斷此 因此’藉由本發明所提出之組合的類別特徵 法,可以將單一的關鍵字組合成多個關鍵字以、万 別特徵,可以比只計算單一關鍵字出現頻 ^表某—類 說服力。 只半更有準確性與 雖然本發明已以較佳實施例揭露如上,缺 、 限定本發明,任何熟悉此項技藝者,在不脫S其並非用以 神和範圍内,當可做些許更動與潤飾,田比,本發明之精 因此本發明之保護569112 V. Description of the invention (5) D1U, 1,1,1,0,0}, D2 {1,1,1, 〇, 1, 〇}, D3 {1,1,1,0,1,1} , D4 {1,1, 〇, 〇, 〇, 〇}; then invert all file vectors into keyword vectors, so you can get keyword vectors, flU, 1,1,1}, f2 {l, 1, 1,1}, f3 {l, 1,1,0}, f4 {l, 0, 〇, 〇}, f5 {〇, 1,1, 〇}, f6 {〇, 〇, 1,0}, These are called 1-key vectors. If the threshold is set to 3, f 4, f 5 and f 6 will be kicked out, fl, f2, and f3 can be combined into a 2-keyword vector, and the 1-keyword vector can be calculated by AND to get fl + f2 {l, l, l, l}, fl + f3 {l, l, l, 〇}, f2 + f3 {l, l, l, 〇}, because fl + f2, fl + f3, f2 + f3 The number of l in these two keyword vectors is greater than the threshold value 3, so fl + f2, fl + f3, f2 + f3 are all 2 keywords, and can be composed of 3 keywords fl + f2 + f3; Fl + f3 is calculated by AND to get fl + f2 + f3 {l, 1,1, 〇}. Check that the number of 丨 in the fi + f2 + f3 keyword vector is greater than the threshold 3, so f 1 + f 2 + f 3 It is a valid 3 keywords; it can be seen that if the keywords fl + f2 + f3 appear together in the article, then 0fl ^ market, investment, f3: when the economy appears in the article together, the stock judgment article belongs to financial management Article. Therefore, by using the combined category feature method proposed by the present invention, a single keyword can be combined into multiple keywords with different characteristics, which can be better than calculating only the frequency of occurrence of a single keyword. . Only half more accurate and although the present invention has been disclosed in the preferred embodiment as above, lacking and limiting the present invention, anyone skilled in this art can make some changes without departing from its scope and scope. With retouching, tinby, the essence of the invention is therefore protected by the invention
〇213-8269TWF(N);STLC-〇2-B-9115;andyp.ptd〇213-8269TWF (N); STLC-〇2-B-9115; andyp.ptd
569112 五、發明說明(6) 範圍當視後附之申請專利範圍所界定者為準。 第9頁 0213-8269TWF(N);STLC-02-B-9115;andyp.ptd 569112 圖式簡單說明 為使本發明之上述目的、特徵和優點能更明顯易懂, 下文特舉實施例,並配合所附圖示,詳細說明如下: 第1圖係顯示本發明之架構圖。 第2圖係顯示本發明之流程圖。 第3圖係顯示本發明之實施例。 符號說明 11 0〜同類別文件; 1 2 0〜斷詞; 1 3 0〜停用詞庫; 1 4 0〜停用詞過濾; 1 5 0〜資料庫; 1 6 0〜組合的類別特徵選取方法; 1 7 0〜組合的類別特徵; 210〜建立關鍵字集合; 220〜建立文件向量; 230〜反轉成關鍵字向量; 2 4 0〜過濾; 2 5 0〜組合關鍵字向量; 2 6 0〜門檻值過濾; 270〜檢查是否可再組合; 28 0〜組合的特徵集合;569112 V. Description of Invention (6) The scope shall be determined by the scope of the attached patent application. Page 9 0213-8269TWF (N); STLC-02-B-9115; andyp.ptd 569112 The diagram simply illustrates that in order to make the above-mentioned objects, features and advantages of the present invention more obvious and understandable, the following examples are given, and With the accompanying drawings, the detailed description is as follows: Figure 1 is a diagram showing the architecture of the present invention. Fig. 2 is a flowchart showing the present invention. Fig. 3 shows an embodiment of the present invention. Explanation of symbols 11 0 ~ files of the same category; 1 2 0 ~ word segmentation; 1 3 0 ~ stop word dictionary; 1 4 0 ~ stop word filtering; 1 50 ~ database; 1 6 0 ~ combined category feature selection Method: 1 0 0 ~ combined category features; 210 ~ established keyword set; 220 ~ established file vector; 230 ~ reversed into keyword vector; 2 4 0 ~ filtered; 2 5 0 ~ combined keyword vector; 2 6 0 ~ threshold filtering; 270 ~ check whether recombination is possible; 28 0 ~ combined feature set;
Dl、D2、D3、D4〜文件; F〜關鍵字集合。D1, D2, D3, D4 ~ files; F ~ keyword collection.
0213-8269TWF(N);STLC-02-B-9115;andyp.ptd 第10頁0213-8269TWF (N); STLC-02-B-9115; andyp.ptd Page 10
Claims (1)
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
TW91116798A TW569112B (en) | 2002-07-26 | 2002-07-26 | Selection method of combined classification feature |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
TW91116798A TW569112B (en) | 2002-07-26 | 2002-07-26 | Selection method of combined classification feature |
Publications (1)
Publication Number | Publication Date |
---|---|
TW569112B true TW569112B (en) | 2004-01-01 |
Family
ID=32590423
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
TW91116798A TW569112B (en) | 2002-07-26 | 2002-07-26 | Selection method of combined classification feature |
Country Status (1)
Country | Link |
---|---|
TW (1) | TW569112B (en) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
TWI615725B (en) * | 2016-11-30 | 2018-02-21 | 優像數位媒體科技股份有限公司 | Phrase vector generation device and operation method thereof |
-
2002
- 2002-07-26 TW TW91116798A patent/TW569112B/en not_active IP Right Cessation
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
TWI615725B (en) * | 2016-11-30 | 2018-02-21 | 優像數位媒體科技股份有限公司 | Phrase vector generation device and operation method thereof |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Kumar et al. | Sentiment analysis of multimodal twitter data | |
Asghar et al. | T‐SAF: Twitter sentiment analysis framework using a hybrid classification scheme | |
CN109933664B (en) | Fine-grained emotion analysis improvement method based on emotion word embedding | |
WO2019174132A1 (en) | Data processing method, server and computer storage medium | |
Dave et al. | Mining the peanut gallery: Opinion extraction and semantic classification of product reviews | |
WO2016107326A1 (en) | Search recommending method and device based on search terms | |
Thomas et al. | Automatic keyword extraction for text summarization in e-newspapers | |
CN108776901B (en) | Advertisement recommendation method and system based on search terms | |
WO2022116435A1 (en) | Title generation method and apparatus, electronic device and storage medium | |
CN111291177A (en) | Information processing method and device and computer storage medium | |
Tran et al. | Spam detection in online classified advertisements | |
Roy et al. | An ensemble approach for aggression identification in English and Hindi text | |
Patil et al. | Web spam detection using SVM classifier | |
WO2022105497A1 (en) | Text screening method and apparatus, device, and storage medium | |
CN107665442B (en) | Method and device for acquiring target user | |
Malhotra et al. | An effective approach for news article summarization | |
Gupta et al. | Text analysis and information retrieval of text data | |
Pekar et al. | Selecting classification features for detection of mass emergency events on social media | |
TW569112B (en) | Selection method of combined classification feature | |
AleEbrahim et al. | Summarising customer online reviews using a new text mining approach | |
Goodman et al. | FastWordBug: A fast method to generate adversarial text against NLP applications | |
Hosseini et al. | Implicit entity linking through ad-hoc retrieval | |
CN107729509A (en) | The chapter similarity decision method represented based on recessive higher-dimension distributed nature | |
Cao et al. | A joint model for chinese microblog sentiment analysis | |
Khan et al. | Semantic-based unsupervised hybrid technique for opinion targets extraction from unstructured reviews |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
GD4A | Issue of patent certificate for granted invention patent | ||
MM4A | Annulment or lapse of patent due to non-payment of fees |