TW201126354A - Method for multi-layer classifier - Google Patents
Method for multi-layer classifier Download PDFInfo
- Publication number
- TW201126354A TW201126354A TW99101931A TW99101931A TW201126354A TW 201126354 A TW201126354 A TW 201126354A TW 99101931 A TW99101931 A TW 99101931A TW 99101931 A TW99101931 A TW 99101931A TW 201126354 A TW201126354 A TW 201126354A
- Authority
- TW
- Taiwan
- Prior art keywords
- category
- model
- attributes
- layer
- undetermined
- Prior art date
Links
Abstract
Description
201126354 六、發明說明: 【發明所屬之技術領域】 本發明係關於一種多層次分類方法,尤指一種適用於 建立一種多層判別分析模型,以及決定屬性選擇和切點之 分類方法。 【先前技術】 分類方法的用途非常廣泛,舉例來說,在金融業上, 銀行在審核信用卡用戶時,能辨別此位申請人是否會容易 變成呆帳;在醫藥學理上,能判斷細胞組織為正常或異常; 在行銷的研究上,能判斷此種行銷方法能否吸引顧客購買 商品。因此,在數據資料探勘的領域裡,佔有及重要之部 分即是在探討分類方法。 分類方法是一 徑▲ f式子習(superv丨sed丨eaming)的方法, 所謂的監督式學習方法是在知道目標輸出值的情形下來進 ^tf (unsupen.sed leading) ^ ^ ^ ^ # (principal ^ ^ ^ ^ 監督式學習的方法。在分類 員方法裡,一般需要選擇適當的 屬性(attribute)來建立分類模型 . a » ^ ^ θ 例如,用身咼和體重來判斷 $疋女生’身高及體重即稱為屬性。建立分 類模型時也往往會先把資 錢建立刀 (―ng Samples),另—群為/兩群’-群為訓練樣本 樣本則是用來驗證此分:::二:類模型 201126354 曰月”以兩種現有的分類方法較為常見,分別為在多 變量統計分析中常見的費雪線性判別分析咖丨“201126354 VI. Description of the Invention: [Technical Field] The present invention relates to a multi-level classification method, and more particularly to a classification method suitable for establishing a multi-layer discriminant analysis model and determining attribute selection and tangent points. [Prior Art] The classification method is very versatile. For example, in the financial industry, when a bank audits a credit card user, the bank can discern whether the applicant will easily become a bad debt; in medical science, the cell organization can be judged as Normal or abnormal; In marketing research, it can be judged whether such marketing methods can attract customers to purchase goods. Therefore, in the field of data mining, the part that is important and important is to explore the classification method. The classification method is a method of superv丨sed丨eaming. The so-called supervised learning method is to enter the target output value and enter ^tf (unsupen.sed leading) ^ ^ ^ ^ # ( Principal ^ ^ ^ ^ Supervised learning method. In the classifier method, it is generally necessary to select the appropriate attribute to establish the classification model. a » ^ ^ θ For example, use body and weight to judge $疋 girl's height And the weight is called the attribute. When the classification model is established, it is often used to build the knife (―ng Samples), and the other group is the two groups. The group is the training sample sample to verify this score::: Two: Class model 201126354 "Yueyue" is more common with two existing classification methods, respectively, which is the common Fisher's linear discriminant analysis in the multivariate statistical analysis.
Analysis,FLD),以及分類與迴歸樹⑹咖㈣ regression trees, CART) 〇 Μ ^^ f # Λ ^ , 分類方法中,尤其在屬性的選擇上,部分屬性只能判別特 定類別而影響其分㈣用之準確性;且在以往分類模型的 建立上’會因其屬性的選擇不同’或未對所欲分類之判別 分析模型進行效能的評估,進而影響分類的準確性。 因此’目則巫需-種新的多層次分類方法以解決 問題。 【發明内容】 本發明之主要目的係在提供一種多層次分類方法,藉 由多層判別分析模型,在每—層會尋找―或兩個切點來對 一或兩個類別做出分類,且每一層可以同時使用多個屬 性’並透過費雪判別分析來找到這些屬性的最佳線性組合。 本發明提供一種多層次分類方法,係於一電腦可紀錄 媒體中用以分類多個影像樣本,此電腦可紀錄媒體包括有 一處理器、-輸入裝置、及-儲存裝置’此方法至少包括: (a) 接收複數個原始樣本; (b) 提供複數個屬性,並以一多變量參數對此些原始樣 本由此些屬性進行顯著性評估計算; (c) 選擇至少一切點並建立一判別分析模型,其係將步 驟(b)中評估後具有顯著性者其中之一,藉提供一變數同質 201126354 ^析參數筛選出此至少―切點,將此些屬性評估後具有顯 者性者t所包含之此複數個原始樣本分群為至少一類別以 建立此判別分析模型,纟中此至少一類別係包括有第一類 別(NodeA)、第三類別_eB) '及未決定之第三類別_〜); ⑷進行-評估模型效能之步驟,其係將此判別分析模 型中加人此些心進行㈣性評估;#中,#加入此些屬 性=有增進此判別分析模型之顯著性時,便進人此判別分 析模型之下—層’再以此變數同質分析參數篩選出至少一 切點’將此判別分析模型中加人此些屬性評估後具有顯著 吐者中所包含之此複數個原始樣本繼續分群為第一類別 (NodeA)、第二類別(N,、及未決定之第三類攀如”以 八⑷加A卜止條件,此停止條件係以選擇此變數同質 刀析參數,右不拒絕虛無假設,(判別分析模型即停止往 下-層分群;或在此評估模型效能之步驟中加入此些料 租I歸刀析法進仃顯著性評估’當加入此些屬性後無法 乂升此判別分析模型之顯著性時,若拒絕虛無假設,此詞 別分析模型即停止往下一層分群。 本發明亦提供—種用以分類多個影像樣本之電腦可知 ·、,體其係以建立一多層次分類方法對此些影像樣 行分類。 根據本發明多層次分類方法,其中,在加入此停止僻 此㈣分_型之最後_層分類層中此未決定々 第二類別中所包含之樣本數為零,換句話說,根據為 6 201126354 發明多層次分類方法最終結果必須將複數個原始樣本皆分 類為第一類別(N〇deA)及/或第二類別(NodeB)中。 根據本發明多層次分類方法’其中,多變量參數之選 擇沒有限制,較佳為Wilk’s lambda或Gini index ;此些屬性之選 擇沒有限制,較佳為至少一選自由ringPDVlmax、 VeinCentralVImin、VeinTDCentralVImax、TDVImax、Cl.、RMV、CI2、 MCI3、及MI2所組成之群組。此外,變數同質分析參數之選 擇沒有限制,較佳為 Gini index、Mahalanobis distance、或 Youden’s Index ° 另一方面,對於顯著性評估計算係以一 F統計量算出的 p值(p-value) ’以此p值表示此些屬性在此類別間平均的差異 顯者性,或以一衡ΐ不純度(丨mpurity)之準則判斷; 其中,此F統計量為 ρ Λ » 此不純度(impurity)為Analysis, FLD), and classification and regression tree (6) coffee (4) regression trees, CART) 〇Μ ^^ f # Λ ^ , in the classification method, especially in the selection of attributes, some attributes can only identify specific categories and affect their points (4) The accuracy of the classification; and in the establishment of the previous classification model 'will be different due to the choice of its attributes' or the performance of the discriminant analysis model of the desired classification, and thus affect the accuracy of the classification. Therefore, the goal is to create a new multi-level classification method to solve the problem. SUMMARY OF THE INVENTION The main object of the present invention is to provide a multi-level classification method, which uses a multi-layer discriminant analysis model to find one or two tangent points in each layer to classify one or two categories, and each layer You can use multiple attributes at the same time and find the best linear combination of these attributes through Fisher Discriminant Analysis. The present invention provides a multi-level classification method for classifying a plurality of image samples in a computer recordable medium. The computer recordable medium includes a processor, an input device, and a storage device. The method includes at least: a) receiving a plurality of original samples; (b) providing a plurality of attributes, and performing a significant evaluation of the attributes of the original samples with a multivariate parameter; (c) selecting at least all points and establishing a discriminant analysis model , which is one of the significant ones after the evaluation in step (b), by providing a variable homogenous 201126354 ^ analysis parameter to filter out at least the "cut point", and the attributes are evaluated after the explicit person t is included The plurality of original samples are grouped into at least one category to establish the discriminant analysis model, wherein the at least one category includes a first category (NodeA), a third category _eB) 'and an undetermined third category _~) (4) Carrying out the steps of evaluating the performance of the model, which is to add the heart to the discriminant analysis model to perform (four) sexual evaluation; #中,#Add these attributes = improve the discriminant analysis model When you are sexual, you enter the discriminant analysis model - layer 'and then filter at least all the points with the homogeneity analysis parameters'. Adding these attributes to the discriminant analysis model is evaluated by the significant spit. The plurality of original samples continue to be grouped into a first category (NodeA), a second category (N, and an undetermined third category) with an eight (4) plus A condition, and the stop condition is to select the variable homogenous. Knife the parameters, the right does not reject the null hypothesis, (the discriminant analysis model stops the down-layer grouping; or in the step of evaluating the model performance, adding these rents to the tooling method) When these attributes cannot be promoted to the significance of the discriminant analysis model, if the null hypothesis is rejected, the word analysis model stops the next layer grouping. The present invention also provides a computer for classifying multiple image samples. The system classifies the image sample lines by establishing a multi-level classification method. According to the multi-level classification method of the present invention, in the last _ layer classification layer of the _ type Decides that the number of samples included in the second category is zero. In other words, according to the final result of the multi-level classification method of 6 201126354 invention, a plurality of original samples must be classified into the first category (N〇deA) and/or In the second category (NodeB). According to the multi-level classification method of the present invention, the selection of the multi-variable parameters is not limited, and is preferably Wilk's lambda or Gini index; the selection of such attributes is not limited, and preferably at least one selected from the group consisting of ringPDVlmax , VeinCentralVImin, VeinTDCentralVImax, TDVImax, Cl., RMV, CI2, MCI3, and MI2. In addition, there is no limit to the choice of variable homogeneity analysis parameters, preferably Gini index, Mahalanobis distance, or Youden's Index ° Aspect, for the significance evaluation calculation, the p-value calculated by a F statistic 'this p-value indicates the average difference of the attributes among the categories, or the balance is not pure (丨mpurity) criterion judgment; wherein, the F statistic is ρ Λ » This impureness is
Impurity = + NiU x Gini(txl) +Ν,χ Gini{tR) (Nl+Nm+Nr) ; 其中n為樣本空間(sample size),p為屬性的數目,Λ 則為 Wilk’s lambda ; 其中’’ nl為第一類別的樣本空間,Nm為第三類別的樣 本空間’ NR為第二類別的樣本空間,tL為第一類別的Gini 值,tM為第三類別的Gini值,tR為第二類別的Gini值。 201126354 根據本發明多層次分類方法’其中,此評估模型效能 之步驟可包括下列四種方法: 在與步驟(C)所建立之此判別分析模型同層中加入此 些屬性’以增加此判別分析模型之原同層中的區別能力; 在此第三類別(NodeN)上加入此些屬性並新增一層以建 立一模型’此模型亦以此變數同質分析參數筛選出至少一 切點’將剩餘未決定之此複數個原始樣本繼續分群為第一 類別(NodeA)、第二類別(N〇deB)、及未決定之第三類別…“以); 將第一類別(NodeA)設定為未決定之類別,並將第一類 別(NodeA)加上未決定之第三類別(N〇deN)而形成的組合中加 入此些屬性並新增一層以建立一模型,此模型亦以此變數 同質分析參數篩選出至少一切點,將剩餘未決定之此複數 個原始樣本繼續分群為第一類別Wodej、第二類別Mod#)、 及未決定之第三類別(Nod%); 或將第二類別(N〇deB)設定為未決定之類別,並將第二 類別(NodeB)加上未決定之第三類別心^^而形成的組合中 加入此些屬性並新增一層以建立一模型,此模型亦以此變 數同質刀析參數篩選出至少一切點’將剩餘未決定之此複 數個原始樣本繼續分群為第一類別(NodeA)、第二類別 (NodeB)、及未決定之第三類別。 由上可知,本發明係在提供一種新的判別分析模型結 構及方法其類似於樹狀分類結構,都是由上往下一層 一層將資料分割。而與樹狀結構不同的是,此判別分析模 型每層會將-些資料針對_或二個類別做出分類,並將 8 201126354 未決定之資料留至下一層,此外’每一層可選擇些屬性並 利用費雪判別分析做線性組合。 換吕之,本發明係以上述之方法以建立一種新的多層 判別分析模型,其一層可能只能區別_個類別(N〇dej NodeB)或是兩個類別皆可以區別(他如八及N〇deB),並將尚未決 定類別之樣本留至下__層做判別。而此判別模型分 析並包括··在判別模型分析每層發展有效變數之選擇和尋 找切點的方法與準則,並以評估模型效能之步驟在加入新 屬性時會考慮整體效能來決定要如何建構模型,並建立停 止條件以避免過度配適的問題。 因此,根據本發明亦提供一種屬性選擇和切點決定方 法,並考慮了判別分析模型在加人新屬性時會考慮整體模 型的效能,以決定判別分析模型應如何建立及其停止條 件’故而大幅提高分類之準破性。 【實施方式】 、圖2係顯示一電腦可紀錄媒體之架構的示意圖,其可用 以執行本發明多層判別分析模型之多層次分類方法。 如圖2所示,電腦可紀錄媒體1包含顯示裝置13、處理 益12、記憶體u、輸入裝置14、及儲存裝置叫。盆中, 可用以輸入影像、文字、指令等資料至電腦可 二存裝置15係例如為硬碟、光碟機或藉由網際 ••同路連接之㈣資料庫,用以儲存系統程式、應用 使用者資料等’記憶體η係用以暫存資料或執行之程:式, 201126354 處理益12用以運算及處理資料等,顯示裝置13則用以顯示 輸出之資料。 如圖2所示之電腦可紀錄媒體一般係於系統程式 (Operating System)下執行各種應用程式,例如文書處理程式、 繪圖程式、科學運算程式、瀏覽程式、電子郵件程式等。 在本實施例中,儲存裝置14係儲存有使電腦可紀錄媒體執 行夕層次分類方法的程式。當欲使電腦可紀錄媒體執行 此分類方法時,對應之程式便被載入記憶體丨丨,以配合處 理器12執行此方法。最後,再將分類結果之相關資料顯示 於顯示裝置13或藉由網際網路儲存於一遠端資料庫中。 藉由本發明之方法,其流程示意可如圖u所示,根據 其所建立之多層判別分析模型架構係如圖1 b,其與分類樹 相似的為都是由上而下不斷的分割資料。然而,跟分類樹 不同的疋,本案之多層次分類方法會針對每一層都會對部 分或全部之複數個原始樣本做出判別,而這些已判別出來 的類別(NodeA)或NodeB)就不會進入下一層的模型,只留下在 此層做出判別為尚未決定類別(N〇deN)到下一層中加入新的 屬性來對它作出判別’而每一層可以只針對判斷—個類別 或是兩個類別’若是只判斷一個類別則只需要找—個切點 來把樣本㈣成兩部分…部分為能在此層分類出來的, 另一部分為未決定必須留到下一層,但若是要判斷兩個類 別則需尋找兩個切點,把資料切割成三部分,一部分為第 一類別(NodeA),一部分為第二類別(N〇deB),剩丁的那一部分 則是未決定之第三類別(Νο&ν)。而每次要加入一個新的屬性 201126354 時’會考慮整體模型的效能來決 的屬性讓原本那—層的判斷 ^有的層内結合新 來對那此a i \站山+ 或疋要加一個新的屬性 爪對那些尚未分類出來的樣本 士 订77犬員。不斷的在此模型 中加入新的屬性直到此模型達到停止條件。 以下,將詳述本發明之多層j 又增-人分類方法及所建立多 層判別分析模型架構。 首先’接受複數個原始樣本,針對此些複數個原始樣 本’必須先由複數個屬性中選擇—屬性,並以一多變量參 數對此些原始樣本由此些屬性進行顯著性評估計算。此顯 著性之評估賴提供-變數同質分析參㈣選出至少一切 點’將此些屬性評估後具有顯著性者,再以此切點來決定 模型裡的樣本是此分到哪個類別(N〇deA、N〇deB、或να)或 是要留到下-層’較佳之選擇為評估後最具顯著性者。由 此可知,選擇屬性及決定此些切點係非常重要。之後,需 對前述建立之判別分析模型進行評估模型效能之步驟亦 即藉由在模型中多加進些個屬性後進行比較兩種模型,其 包括可以一種為在原有的判別分析模型裡加一個屬性並使 用費雪線性判別分析(Fisher linear discriminant Analysis,以下簡 稱FLD)結合’另一種為新增加一層模型。 ί屬性及多變量參數選擇1 在屬性的選擇上’可使用ringPDVImax、VeinCentra丨VImin、Impurity = + NiU x Gini(txl) +Ν,χ Gini{tR) (Nl+Nm+Nr) ; where n is the sample size, p is the number of attributes, and Λ is Wilk's lambda; where '' Nl is the sample space of the first category, Nm is the sample space of the third category 'NR is the sample space of the second category, tL is the Gini value of the first category, tM is the Gini value of the third category, and tR is the second category Gini value. 201126354 According to the multi-level classification method of the present invention, the step of evaluating the performance of the model may include the following four methods: adding these attributes in the same layer as the discriminant analysis model established in step (C) to increase the discriminant analysis The ability to distinguish the original layer in the same layer; add these attributes to the third category (NodeN) and add a layer to create a model. This model also filters out at least everything from the homogeneity analysis parameters. The plurality of original samples determined to continue to be grouped into a first category (NodeA), a second category (N〇deB), and an undetermined third category... "to"; the first category (NodeA) is set to be undecided Category, and add the first attribute (NodeA) plus the undetermined third category (N〇deN) to add a layer to create a model. This model also uses this variable to analyze parameters. Filter out at least all points, and continue to group the remaining undetermined plural original samples into the first category Wodej, the second category Mod#), and the undetermined third category (Nod%); or the second category (N〇deB) is set to an undetermined category, and the combination of the second category (NodeB) plus the undetermined third category heart ^^ is added to the attributes and a layer is added to create a model. The model also screens out at least all points with this variable homogenous knife analysis parameter' to continue to group the remaining undetermined plural original samples into the first category (NodeA), the second category (NodeB), and the undetermined third category. As can be seen from the above, the present invention provides a new discriminant analysis model structure and method which is similar to a tree-like classification structure, which divides data from top to bottom. Unlike the tree structure, this discriminant analysis Each layer of the model will classify some data for _ or two categories, and leave 8 201126354 undetermined data to the next layer. In addition, 'each layer can select some attributes and use Fisher's discriminant analysis to make a linear combination. Lu Zhi, the present invention uses the above method to establish a new multi-layer discriminant analysis model, one layer may only distinguish _ categories (N〇dej NodeB) or both categories can be distinguished (he is like eight and N deB), and leave the sample of the undetermined category to the next __ layer for discriminating. This discriminant model analyzes and includes the method and criteria for selecting the effective variable of each layer in the discriminant model analysis and finding the tangent point, and evaluating The step of model performance considers the overall performance to determine how to construct the model when adding new attributes, and establishes a stop condition to avoid over-adaptation. Therefore, according to the present invention, a method of attribute selection and tangent determination is also provided, and The discriminant analysis model considers the performance of the overall model when adding new attributes to determine how the discriminant analysis model should be established and its stopping conditions. Therefore, the quasi-breaking of the classification is greatly improved. [Embodiment] Figure 2 shows a computer. A schematic diagram of the architecture of the recording medium, which can be used to perform a multi-level classification method of the multi-layer discriminant analysis model of the present invention. As shown in Fig. 2, the computer recordable medium 1 includes a display device 13, a processor 12, a memory u, an input device 14, and a storage device. In the basin, you can use the input image, text, instructions and other information to the computer. The device can be stored in a system such as a hard disk, a CD player or a network connected by the Internet. The data η is used for temporary storage of data or execution: 201126354 Processing benefit 12 is used to calculate and process data, etc. Display device 13 is used to display the output data. The computer recordable media shown in FIG. 2 generally executes various applications such as a word processing program, a drawing program, a scientific computing program, a browsing program, an email program, etc. under the operating system (Operating System). In the present embodiment, the storage device 14 stores a program for causing the computer recordable medium to perform the hierarchical classification method. When the computer recordable medium is to perform this sorting method, the corresponding program is loaded into the memory port to cooperate with the processor 12 to execute the method. Finally, the related data of the classification result is displayed on the display device 13 or stored in a remote database through the Internet. By means of the method of the present invention, the flow diagram can be as shown in Fig. u. According to the multi-layer discriminant analysis model architecture established by the invention, as shown in Fig. 1b, the similarity to the classification tree is that the data is divided from top to bottom. However, unlike the classification tree, the multi-level classification method of this case will judge some or all of the original samples for each layer, and these identified categories (NodeA) or NodeB will not enter. The model of the next layer, leaving only the level determined in this layer as the undetermined category (N〇deN) to add a new attribute to the next layer to discriminate it's and each layer can only be judged - one category or two If you only want to judge a category, you only need to find a point to make the sample (four) into two parts... the part can be classified at this level, and the other part is undecided and must be left to the next level, but if you want to judge two The category needs to find two pointcuts, and cut the data into three parts, one is the first category (NodeA), the other is the second category (N〇deB), and the remaining part is the undetermined third category (Νο& ;ν). And each time I want to add a new attribute 201126354, I will consider the performance of the overall model. Let the original layer--the judgment of the layer be combined with the new layer to add a new one to the ai\zhanshan+ or 疋The new attribute claws set 77 dog handlers for those who have not yet been classified. New properties are continually added to this model until the model reaches the stop condition. Hereinafter, the multi-layer j-addition-person classification method of the present invention and the established multi-layer discriminant analysis model architecture will be described in detail. First, accepting a plurality of original samples, for which the plurality of original samples must be selected by a plurality of attributes, and a multivariate parameter is used to perform a significant evaluation of the original samples. The evaluation of this significance depends on the provision-variable homogeneity analysis. (4) Select at least all points. If these attributes are evaluated, they are significant, and then use this point to determine which category the sample in the model belongs to (N〇deA, N〇deB, or να) or to stay in the lower-layer's preferred choice is the most significant after evaluation. From this, it is important to select attributes and determine these cut points. After that, the step of evaluating the model performance of the discriminant analysis model established above is to compare the two models by adding more attributes to the model, which may include adding an attribute to the original discriminant analysis model. And using Fisher linear discriminant analysis (FLD) combined with 'the other is a new layer of model. ί attribute and multivariate parameter selection 1 In the selection of attributes, you can use ringPDVImax, VeinCentra丨VImin,
VeinTDCentralVImax、TDVImax、Cl、RMV、CI2 ' MCI3、及 MI2 ; 而多變量參數之選擇有兩種準則可以使用。一種是常見於 多變量統計方法上檢定類別間的平均是否有差異的wilk’ s 201126354 lambda,另一種則是在分類樹上評估不純度(impurity)的Gini index °VeinTDCentralVImax, TDVImax, Cl, RMV, CI2 'MCI3, and MI2; and there are two criteria for selecting multivariable parameters. One is Wilk’s 201126354 lambda, which is common in multivariate statistical methods to determine whether the average between categories is different, and the other is to evaluate the purity of Gini index ° on the classification tree.
Wilk's lambda 假設有g個類別,p個屬性,且 ' 〜心= 1"丨:H0 is not true 其中’ H〇係虛無假設(null hypothesis) ’ H,係對立假設(alternative hypothesis),則為層級(dass)K的平均值。Wilk's lambda assumes that there are g categories, p attributes, and '~heart = 1"丨:H0 is not true where 'H〇 hypothesis' H, the alternative hypothesis, is the level (dass) The average value of K.
Wilk's lambda : Λ= |W1 _ 1 |B+W| |I+W'B| =n 1 + A; (式1) 其中,w為組内變異矩陣 B為組間變異矩陣 I 係單位矩陣(identity matrix) 冰為W」b的特徵值 在%為真下,Λ經過某些轉換後會服從F分配(式2) test statisticWilk's lambda : Λ= |W1 _ 1 |B+W| |I+W'B| =n 1 + A; (1) where w is the intra-group variation matrix B is the inter-group variation matrix I-unit matrix ( Identity matrix) The eigenvalue of ice for W"b is true, and after some conversions, it will obey the F allocation (formula 2) test statistic
F 少=ΛΛ', 爪,=p(g-O, .,=LP2(g~O2-4 \P2+(g-l)2-5 (式2) 統計量F可簡化成 =s[n-(p-g+ 2)/2] 201126354F Less = ΛΛ', Claw, =p(gO, ., =LP2(g~O2-4 \P2+(gl)2-5 (Formula 2) The statistic F can be simplified to =s[n-(p-g+ 2)/2] 201126354
/T=^!(hA, F p ')汁、/T=^!(hA, F p ') juice,
Wilk’ s lambda也可轉換成卡方分配 Bartlett's χ2 statistic test statistic - -[(« -1) - (P + g) / 2] In Λ - z;^ (式 3 ) 當類別少的時候,F統計量會比卡方統計量好。由於多 層判別分析較佳為針對2個類別分析,所以我們選用F統計 量。 本發明可以比較每個屬性用前述F統計量算出的p值 (p-value),p值越小代表這個屬性在類別間平均的差異越顯 著,比較每個屬性的p值即可選出一個最顯著的屬性。若是 要同一層中選進新的屬性,則比較新屬性跟原有的屬性組 合後得到的p值,選出跟原有屬性組合?值最小的變數即可。Wilk's lambda can also be converted to a chi-square allocation Bartlett's χ2 statistic test statistic - -[(« -1) - (P + g) / 2] In Λ - z;^ (Formula 3) When there are few categories, F The statistic will be better than the chi-square statistic. Since multi-layer discriminant analysis is better for two categories, we use F statistic. The present invention can compare the p-values calculated by using the aforementioned F statistic for each attribute. The smaller the p-value is, the more significant the difference between the averages of the attributes is. The p-value of each attribute is selected to be the most Significant attributes. If you want to select a new attribute in the same layer, compare the p value obtained by combining the new attribute with the original attribute, and select the combination with the original attribute? The smallest value can be.
Gini Index 由於每次在進行分割時要搜尋一個較佳或最佳的屬性 及切點,所以要有一個分割的準則來評估此屬性與切點的 效月b,其中較常見的準則為Gini jncjex。Gini index是一種在衡 量不純度(impurity)的準則,所以Gini丨〇(1狀越小越好。每個屬 性配上一個對應的切點就能得到其Gini index ,所以每個屬性 可以搜尋一個最佳的對應切點。在進行變數選擇時,只要 比較每個屬性搭配上其對應的最佳切點後的Gini index即可 選出在此分割最佳的屬性及切點。 201126354 (式4) 假設現在有g個類別,Gini Index之定義為: >*j 不純度(impurity)即為: ^Gini(tL) + ^-Gini{tR) 其中, 為類別/在結點t所占的比例 P{j\t) 為類別j在結點t所占的比例 〜為左邊結點的樣本數 〜為右邊結點的樣本數 N=〜'為所有樣本總合 在此處’本發明之多層判別分析模型跟分類樹不同的 地方在於’分類樹在每一個結點是做二元分割,但在本發 明之多層判別分析模型中’每一層都必須把資料分割成二 結點,所以不純度(impurity)的計算要改為The Gini Index has to search for a better or best attribute and cut point each time it is split, so there is a split criterion to evaluate this attribute and the cut point b. The more common criterion is Gini jncjex. Gini index is a measure of impurity, so Gini丨〇 (the smaller the shape, the better. Each attribute is matched with a corresponding tangent to get its Gini index, so each attribute can search for one of the most Good corresponding cut-point. When making variable selection, just compare the Gini index of each attribute with its corresponding optimal cut point to select the best attribute and cut point here. 201126354 (Equation 4) Suppose now there is g For each category, Gini Index is defined as: >*j Impurity is: ^Gini(tL) + ^-Gini{tR) where is the category/proportion of node tP{j\ t) is the proportion of the category j at the node t~ the number of samples of the left node~ the number of samples of the right node N=~' is the sum of all the samples here. 'The multi-layer discriminant analysis model of the present invention The difference between the classification trees is that the 'classification tree is a binary segmentation at each node, but in the multi-layer discriminant analysis model of the present invention, 'each layer must divide the data into two nodes, so the impureness is The calculation should be changed to
Impurity = NJNs,父Gini[tiU) + N0Giniq (Nl+Nm +Nr) (式5) 本發明可以比較每個屬性搭配其最佳的—組切點之後 得到的不純度,選出不純度最小的一個屬性。 若是要在同一層加入新的屬性,可以利用新的屬性跟 原有的屬性透過FLD組合後得到的區別分數來計曾不纯 度’找出跟原有的屬性組合之後不純度最低的屬性即可。 14 201126354 [切點選擇] 切點的選擇有三種方法,分別為Gini index、馬氏距離 (Mahalanobis distance)、或 Y〇uden,s Index。Impurity = NJNs, parent Gini[tiU) + N0Giniq (Nl+Nm +Nr) (Equation 5) The present invention can compare the inefficiency obtained after each attribute with its optimal set-cut point, and select an attribute with the least purity. . If you want to add a new attribute on the same layer, you can use the difference between the new attribute and the original attribute through the FLD combination to calculate the inferiority of 'the lowest purity attribute after combining with the original attribute. . 14 201126354 [Cut Point Selection] There are three methods for selecting the tangent points, namely Gini index, Mahalanobis distance, or Y〇uden, s Index.
Gini IndexGini Index
在使用Gini index選擇屬性時,每個屬性都要選擇—組切 點來搭配才能得到其不純度,所以需要一個方法選擇—組 取好的切點來搭配,以得到最低的不純度,若是在分類樹 裡,只需尋找一個切點,所以在分類樹裡尋找切點的方法 為把所有可能的切點都試過一次,找出一個不純度最低的 切點。然而,在本發明多層判別分析模型中,例如需要得 到兩個切點來把資料分成三群,在此情形時假設會存在有 #個樣本,尋找一個切點只需要試驗W種可能,若要找兩個 切點則會有種可能,在樣本數很大時,料以試 過所有可能的切點來找兩個切點會非常慢,所以本發明為 解決則述問題,並發展一個快速搜尋出兩個切點的方法。 首先,像一般的分類樹一樣,先搜尋所有可能找一個 斤有資料刀成兩群後不純度最低的切點,C。,然後利用C σ 、把=貝料切成N〇de,和N〇de;f。在Node,裡,再搜尋一個可以 把N〇dei分成兩群後不純度最低的切點,c:同樣的,在NodeWhen using Gini index to select attributes, each attribute must be selected - group cut points to match to get its impureness, so you need a method selection - the group takes good cut points to match, to get the lowest impurity, if it is in the classification tree In that, just look for a tangent point, so the way to find the tangent point in the classification tree is to try all possible cut points and find a cut point with the lowest purity. However, in the multi-layer discriminant analysis model of the present invention, for example, it is necessary to obtain two tangent points to divide the data into three groups. In this case, it is assumed that there are # samples, and it is only necessary to test W types to find a tangent point. There is a possibility that there will be a cut point. When the number of samples is large, it is very slow to find all the possible tangent points to find two tangent points. Therefore, the present invention solves the problem and develops a fast search for two tangent points. Methods. First of all, like the general classification tree, first search for all the cut points that may find a kilogram with the data knife into two groups with the lowest purity, C. Then, using C σ , cut the = material into N〇de, and N〇de; f. In Node, search for another cut point that can divide N〇dei into two groups with the lowest purity, c: the same, in Node
中也搜尋—彻-r t K 可以把⑽屿分成兩群後不純度最低的切點, ’如圖3所示。 201126354 如此來可得到二個候選切點C(i,c,,與c,,用這三個 候選切點可以組成(C。’ (d2)與(Cq,C2)三種切點組 合,比較這三種切點組合把資料切成三群後的不純度,選 出一個最佳的切點組合即可,較佳為把同質性高的樣本放 在左右兩惻。因此,在搜尋^時會設下限制,用q切出來的 兩群資料裡,比較遠離C。的那群資料不純度要比另一群資 料高。基於同前理由,在搜尋6時也要設下一樣的限制。 也就疋《兒,G/m(/A/)<G/mU , 。若以此搜尋演 算法,只需搜尋大約2N次來尋找那三個候選切點,再比較 三種組合即可。 馬氏距離(Mahalanobis distance) 根據本發明之另一個切點選擇的方法為馬氏距離,其 與歐氏距離(Euclidean distance)的差別在於馬氏距離考慮的不 只是類別間中心點差異,還會考慮各個類別的散佈情形, 舉例來說,若有一個樣本距離A類別跟B類別的中心都一樣 达’若A類別的變異數比較大’散佈情形的比較分散,b類 別的變異數比較小’散佈情形很集中,那此樣本離A類別的 馬氏距離就會比離B類別的馬氏距離來的小,故因此而認為 其比較屬於A類別。 以下將詳細介紹利用馬氏距離應用於分類上之方法, 首先’假設現在有2個類別,則可以算出距離a類別的馬氏 距離為··仏⑴=yJ(X -» 5,而距離B類別的馬氏距離 為.dh(x、=扣-…)rsk'xD 其中 M/Gp/k,.·./、)為 A類別的 16 201126354 平均數為A類別的共變異數矩陣(c〇variance爪献), A=(/WW、)為B類別的平均數,類別的共變異數 車田0七)< 則屬於A類別,而則屬於b類 別。 仁在本發明之多層判別分析模型令,本發明將複數個 樣本刀成A,B兩類別(N〇deA' NodeB)與未決定(N〇deN)這三群, 故將原本q(J〇<Aj(Y),屬於A類別的樣本數挑出來,然後利 用這些原本已經判斷為A類別的樣本算一個新的 f^A\ λέβ\ L ’ A, ’接著把這些已判斷為Α類別的樣本用新的平均數和 變異數再計算一次馬氏距離: ,DHi(x)=^-Mm)rs-](x_Mfji) 若仏U)<〜⑺,則屬於A類別;而Α|(χ)>ζ^(χ),.則屬於 未決定。 同樣的’把原本&〇〇>&00 ’屬於B類別的樣本數挑出 來,然後利用這些原本已經判斷為8類別的樣本算一個新的 仏2 ’ ’ ‘S«2 ’然後把這些已判斷為B類別的樣本用新 的平均數和變異數再計算一次馬氏距離, D,2M = ^MA2yS;A'2(x~P/l2) , Dh2{x) = yli^2ys;\(x-Me2) 若%00>仏0〇 ’則屬於B類別,而Z)42(x)<仏w,則屬 於未決定。 另須注意的是’在本發明多層判別分析模型中若使用 馬氏距離找切點時,主要係為了把資料用分成比較屬於A 頌別的和比較屬於B類別的,再利用這兩個資料子集合來求 201126354 出我們要的切點,但當此兩筆資料子集合為如圖4a之情形 時,由於其結點之選擇會影響非分類的準確度,為了要改 善此情形中樣本數差距極大造成馬氏距離切點的不可靠, 可進一步利用Gini index來修正馬氏距離,如圖4b所示。首 先,先用Gini index找一個切點,然後把資料用此切點分成兩 邊,比較各個類別在這兩邊所占的比例,若是A類別在左邊 所占的比例大於右邊所占的比例,則把右邊資料的A類別移 除,反之則把左邊資料的A類別移除。同樣的,B類別也比 較其在左右兩邊所占的比例,移除掉比例比較小那邊的B 類別。然後用剩餘的A類別、B類別來重新計算平均及變異 數,即可得到一個經過Gini index修正後的馬氏距離。The middle search also - the -r t K can divide the (10) island into two groups and the cut point with the lowest purity, as shown in Figure 3. 201126354 So we can get two candidate cut points C(i,c,, and c, which can be composed of these three candidate cut points (C.' (d2) and (Cq, C2) three kinds of tangent points combination, compare these three kinds of tangent point combinations After cutting the data into three groups, the purity is selected, and an optimal combination of tangent points is selected. It is preferable to place the samples with high homogeneity on the left and right sides. Therefore, when searching for ^, the limit is set, and the limit is set by q. In the two groups of data that came out, the data that is farther away from C is less pure than the other group. For the same reason, the same restriction should be set when searching for 6. That is, "G, m" (/A/) <G/mU, . If you use this search algorithm, you only need to search about 2N times to find those three candidate cut points, and then compare the three combinations. Mahalanobis distance According to the present invention Another method of tangent point selection is the Mahalanobis distance, which differs from the Euclidean distance in that the Mahalanobis distance considers not only the difference in the center point between the categories, but also the distribution of each category. For example, If there is a sample distance A and B The other centers are the same. 'If the A category has a large number of variations, the distribution is more scattered, and the b category has a smaller variation. The distribution is very concentrated. The sample is farther away from the A category. The Markov distance of the category is small, so it is considered that the comparison belongs to the category A. The method of applying the Mahalanobis distance to the classification is described in detail below. First, it is assumed that there are two categories, and the distance a category can be calculated. The Mahalanobis distance is ··仏(1)=yJ(X -» 5, and the Mahalanobis distance of the distance B category is .dh(x,= buckle-...)rsk'xD where M/Gp/k,.·./, ) is the A category of 16 201126354 The average is the A-category covariance matrix (c〇variance), A=(/WW,) is the average of the B categories, the total number of variants of the category is 0 (7)) < Then belongs to category A, and belongs to category b. In the multi-layer discriminant analysis model of the present invention, the present invention divides a plurality of samples into A, B two categories (N〇deA' NodeB) and undecided (N〇deN) These three groups, so the original q (J〇<Aj(Y), the number of samples belonging to the A category, and then use these originals already The sample broken into the A category is calculated as a new f^A\ λέβ\ L ' A, ' and then the samples that have been judged as the Α category are recalculated with the new mean and the number of variances. The DH (x) )=^-Mm)rs-](x_Mfji) If )U)<~(7), it belongs to category A; and Α|(χ)>ζ^(χ),. is undecided. The same 'take the original &〇〇>& 00 'the number of samples belonging to the B category, and then use these samples that have been judged as 8 categories to calculate a new 仏 2 ' ' 'S«2 ' and then These samples, which have been judged to be in the B category, are recalculated with a new average and the number of mutations, D, 2M = ^MA2yS; A'2 (x~P/l2), Dh2{x) = yli^2ys; \(x-Me2) If %00>仏0〇' belongs to category B, and Z)42(x)<仏w, it is undecided. It should also be noted that when the Mahalanobis distance finding point is used in the multi-layer discriminant analysis model of the present invention, the main purpose is to divide the data into A categories and compare them to the B category, and then use the two data pieces. The collection is to find the pointcut we want 201126354, but when the two sub-collections of data are as shown in Figure 4a, because the choice of its nodes will affect the accuracy of non-classification, in order to improve the situation, the sample size is greatly different. It is unreliable to cause the Markov distance tangent point, and the Gini index can be further used to correct the Mahalanobis distance, as shown in Fig. 4b. First, use the Gini index to find a tangent point, then use the tangent point to divide the data into two sides, and compare the proportion of each category on the two sides. If the proportion of the A category on the left side is larger than the ratio on the right side, then the right side is used. The A category is removed, otherwise the A category of the left material is removed. Similarly, the B category also compares the proportion of the left and right sides, and removes the B category with a smaller proportion. Then use the remaining A categories, B categories to recalculate the average and the variance, and get a Markov distance corrected by Gini index.
Youden’s Index 首先,定義 Younde’s Index = specificity+sensitivity-1,其中 specificity為在所提供之複數個原始樣本中所有A類別樣本裡 判斷正確的比例,sensitivity為在所提供之複數個原始樣本中 所有B類別樣本裡判斷正確的比例,故Y〇uden’s index越高越 好。 搜尋切點的方法跟使用Gini index相似,先搜尋所有可能 找一個把所有資料分成兩群後Younde’ s Index最高的切點, C。’然後利用C。可以把資料切成Node,和Node«。在Node,裡, 再搜尋一個可以把Node,分成兩群後Younde’ s Index最高的切 點’ C,。同樣的’在Node,,中也搜尋一個可以把Node„分成兩 群後Younde’ s Index最高的切點,c,。如此一來,我們可以得 201126354 到三個候選切點C(>,c,,與c,用這三個候選切點可以組成 (c〇,ς) ’(c,,q)與(c。,q)三種切點組合,比較這三種切 點組合把資料切成三群後的Youden’ s Index,選出一個最佳的 切點組合即可。 分成二群時,由於有未決定的部分,故Specificity和 sensitivity的計算要做更改。Youden's Index First, define the Younde's Index = specificity+sensitivity-1, where specificity is the correct proportion in all A-category samples in the provided original samples, and the sensitivity is all B-categories in the plurality of original samples provided. The correct ratio is judged in the sample, so the higher the Y〇uden's index, the better. The method of searching for the pointcut is similar to using the Gini index. First search for all the possible cut points that divide the highest score of the Younde's Index, C. Then use C. You can cut the data into Node, and Node«. In Node, search for a tangent point 'C' that can divide Node into two groups and the highest Younde’s Index. The same 'in Node,' also search for a cut point that can divide the Node into two groups and the highest point of the Younde's Index, c. So, we can get 201126354 to three candidate cut points C (>, c, , and c, with these three candidate cut points can be composed of (c〇, ς) '(c,, q) and (c., q) three kinds of tangent points combination, compare these three kinds of tangent point combination to cut the data into three groups of Youden ' s Index, select an optimal combination of tangent points. When split into two groups, the calculation of the specificity and sensitivity should be changed because there are undetermined parts.
Specificity=(A類別判對樣本數+0.5*未判別屬A類別樣 本數)ZA類別總樣本數;以及Specificity=(A category judges the number of samples +0.5* is not discriminated as the number of A categories) the total number of samples in the ZA category;
Sensitivity=(B類別判對樣本數+ 〇·5*未判別屬b類別樣 本數)/B類別總樣本數; 其後,再選取這三組切點中Y0Uden,sindex最高的一組切 點即可。 【評估模型效能】 在多層判別分析模型中’每次要加一個屬性進模型裡 時’可以下列四種不同方案進行評估之步驟。 首先,如圖5所不,假設已有一層由不構成的模型並 利用X,把樣本分成三群,分別為A類別,B類別,以及未決 定的樣本,分別以NodeA,N〇deB,NodeN來表示。 方案1 : 在原有的那一層新加入屬性<,跟A利用FLD組合,以 增加原有那一層的區別能力。 201126354 在Node、.上加入屬性1建一個模型,利用此模型來區別 在原有的層裡區別不出來的樣本。 方案3 : 把NodeA跟NodeN的樣本合併,以ν〇<ν表示此時原有 的那層A構成的模型只拿來區分出B類別,在N〇deAN上加入 屬性A建一個模型,利用此模型來區別在原有的層裡區別 不出來的樣本。 方案4 : 把Ν〇<^β跟N〇deN的樣本合併,以N〇d^表示,此時原有的 那層A構成的模型只拿來區分出A類別,在N〇deBN上加入屬 性七建-個模型’利用此模型來區別在原有的層裡區別不 出來的樣本。 ί停止條件| 在本發明多層判別分析模型的停止條件上可分為兩 種,-為決定是否要把未決定的樣本繼續往下分割,另一 為决定要不要在已存在的層裡加入新屬性。 在決定是否要繼續把未決定的樣本繼續往下分割之判 別,可利用在屬性選擇時提到的職,s !細恤,若不拒絕虛 無假設’代表在剩餘的樣本裡,找不到能把類別間顯著區 为開來的屬性,所以就停止繼續往下分割。 如前所述,另一停止條件為決定是否在原有的層裡加 入新屬性,由於模型原本已存在—些顯著的綠,若要在 加人新屬性時’此時需考量的不是加人新層性後整體模型 20 201126354 夠不夠顯著’而是考量新加入的屬性額外解釋了多少變 異。在此,可以參考迴歸分析法中之順向選擇法(forward selection)使用的partial F-test,其做法為檢定新加入一個屬性的 模型跟原始模型有沒有顯著差異。若拒絕了虛無假設,表 示加入新屬性的模型無顯著改善,不將此屬性加入模型。 其檢定模型如(式6) ί^〇 ' y = β〇+ β\^\+ (full model) = β〇+ β'χ\ (reduce model) f ^ ^.SR^Xi>X2)~ SSRjX^) ^ SSE{Xl,ΧΊ) SSR{X]) χΊ) SSE(X”X,) H 办1 dfn~dfF dfF 6) 其中, 也為flill model的自由度; 為reduce model的自由度; <厶〇,/5 i,A為變數的參數; 人SSR為-人方的解釋平均(eXpiaine(j sum SqUare) ·,以及 SSE為-人方的剩餘總和(e residuai sum SqUare)。 而在判別分析的順向選擇法,其模型如(式7) 若拒絕虛無假設,則表示模型不需加入此新屬性。 (full model) (reduce model) (式7) //〇 . d = ύ)χΧλ + 〇)2Χί Η{ . y = 〇)]Χ] w2) 若加入之新屬性夠顯著,還要用評估模型效能的方法 比較加入前和加入後整體模型的效能。反之,若加入新的 201126354 屬性後無法提升整體模型的效能的話,就停止加入新屬 性。需注意的是,本發明之多層次分類方法及所建立之多 層判別分析模型架構,在模型的最後一層要強迫對所有資 料進行分類’不能再留下未決定的樣本。 根據前述參數及設定條件,根據本發明之方法建立多 層判別分析模型詳細流程圖係如圖6所示。 夕 首先,在接受複數個原始樣本後(圖未示),利用wuk,s lambda或是Ginnndex選進一個最顯著的屬性,然後檢定這個 屬性是否有顯著區別出各個類別的能力。若是拒絕了虛無 假設,則代表此屬性具有解釋能力。再利用如前述之馬氏 距離或Gini index來找出此屬性最好的一組切點,把資料分成 第一類別(A類別,NodeA),第二類別(B類別,N〇deB)跟未決 定之第三類別(N〇deN)三群資料,然後就可以根據這三群資料 來s平估這個模型的效能。 接著,在選進第二個屬性時,要考慮把第二個屬性加 在哪個地方,在前述所提到四個方案,分別為:(方案说 原有的那層,找一個跟原有變數組合後最好的屬性與切 點’(方案2)用原本未決定那群樣本找—個最適合的的屬性 與切點]方案3)把A類別當成未決定,用A類別加上未決定Sensitivity=(B category judges the number of samples + 〇·5* unrecognized b category sample number) / B total sample number; Then, select the three sets of cut points in the Y0Uden, the highest set of cut points. [Evaluation of Model Effectiveness] In the multi-layer discriminant analysis model, the step of evaluation can be performed in the following four different schemes when adding an attribute into the model each time. First, as shown in Fig. 5, it is assumed that there is a layer of unconformed model and X is used to divide the sample into three groups, namely A category, B category, and undetermined samples, respectively, NodeA, N〇deB, NodeN. To represent. Option 1: Add the attribute < in the original layer, and use FLD combination with A to increase the difference of the original layer. 201126354 Add attribute 1 to Node, . to build a model, and use this model to distinguish samples that are not distinguishable in the original layer. Scheme 3: Combine the NodeA and NodeN samples, and use ν〇<ν to indicate that the model of the original layer A is only used to distinguish the B category, and add the attribute A to the N〇deAN to build a model. This model is used to distinguish between samples that are not distinguishable in the original layer. Scheme 4: Combine the samples of Ν〇<^β with N〇deN, denoted by N〇d^. At this time, the model of the original layer A is only used to distinguish the category A, and is added to N〇deBN. The attribute seven builds a model 'use this model to distinguish the samples that are not distinguishable in the original layer. ί stop condition | In the stop condition of the multi-layer discriminant analysis model of the present invention, it can be divided into two types, namely, to decide whether to continue to divide the undetermined sample, and to decide whether to add a new layer to the existing layer. Attributes. In deciding whether to continue to divide the undecided sample into the next division, you can use the position mentioned in the attribute selection, s !, if you do not reject the null hypothesis, the representative will not be able to find it in the remaining samples. Make a significant area between categories as an open attribute, so stop continuing to split down. As mentioned above, another stop condition is to decide whether to add new attributes to the original layer. Since the model already exists - some significant green, if you want to add new attributes, it is not a new addition. After the stratification of the overall model 20 201126354 is not significant enough 'it is to consider how much variation is additionally explained by the newly added attributes. Here, reference can be made to the partial F-test used in the forward selection method in the regression analysis, which is to verify that there is no significant difference between the model in which a new attribute is added and the original model. If the null hypothesis is rejected, there is no significant improvement in the model for adding new attributes, and this attribute is not added to the model. The verification model is (Equation 6) ί^〇' y = β〇+ β\^\+ (full model) = β〇+ β'χ\ (reduce model) f ^ ^.SR^Xi>X2)~ SSRjX ^) ^ SSE{Xl,ΧΊ) SSR{X]) χΊ) SSE(X"X,) H do 1 dfn~dfF dfF 6) where, also the degree of freedom of the flill model; the degree of freedom of the reduce model; ;厶〇,/5 i,A is a parameter of the variable; the human SSR is the human-interpreted average (eXpiaine(j sum SqUare) ·, and the SSE is the human remaining sum (e residuai sum SqUare). The forward selection method of discriminant analysis, whose model is (Equation 7), if the null hypothesis is rejected, indicates that the model does not need to be added to this new attribute. (full model) (reduce model) (Equation 7) //〇.d = ύ) χΧλ + 〇)2Χί Η{ . y = 〇)]Χ] w2) If the new attribute added is significant enough, compare the performance of the model before and after the addition with the evaluation model performance. Conversely, if you add a new one If the performance of the overall model cannot be improved after the 201126354 attribute, the new attribute is stopped. It should be noted that the multi-level classification method of the present invention and the established multi-layer discriminant analysis model architecture are in the model. The last layer is forced to classify all the data. 'The undetermined samples can no longer be left. According to the above parameters and setting conditions, the detailed flow chart of the multi-layer discriminant analysis model based on the method of the present invention is shown in Fig. 6. First, in the After accepting a plurality of original samples (not shown), use wuk, s lambda or Ginnndex to select a most significant attribute, and then check whether the attribute has a significant difference in the ability of each category. If the null hypothesis is rejected, it represents This attribute has the ability to interpret. Reuse the Markov distance or Gini index as described above to find the best set of cut points for this attribute, and divide the data into the first category (A category, NodeA), the second category (B category, N 〇deB) and the undetermined third category (N〇deN) three groups of data, and then you can estimate the performance of the model based on the three groups of data. Next, when selecting the second attribute, consider Where is the second attribute added? The four options mentioned above are: (The plan says the original layer, find the best attribute after combining with the original variable. The cut point '(Scheme 2) uses the original undetermined group of samples to find the most suitable attribute and the cut point] Scheme 3) The A category is considered undecided, and the A category plus undecided
的樣本來尋找一個最適合的屬性與切點;以及(方案4)把B 類別當成未決$’用_别加上未決定的樣本來尋找一個最 適合的屬性與切點。 “每個方案選進眉性之後都要用懸,⑶地祕定其顯 者性’認為不夠顯著的方案就捨棄此屬性,然後用前述提 22 201126354 ⑽估模型效料步驟來評估每個㈣㈣模型的效能。 若方案丨的效能最好,則把新的屬性加在原有的層裡。若方 案2的效能最好,則利用上一層剩下的未決定樣本建立新的 -層模型。若為方案3或方案4最好,則是在上一層把八類別 «類別當成未決定,並用所有剩下的未決定樣本建立新的 層模且上層的模型轉換成只切一個切點來判斷a 類別或B類別,不在同一層判斷兩個類別。 若目前的模型已經有η層,要再加入一個新屬性時,把 新屬性加在原有的層裡此方案會有η種情形’再加上方案 2、3、4,共要考慮η+3種情形。若此η+3種情形新增變數都 不顯著,模型就停止。若有通過的方案,則選出效能最佳 的方案,k查此方案多選進一個屬性後的整體模型校能有 沒有改善。若無.改善,則停止加入新變數,若有改善則 繼續加新的屬性到模型裡,一直不斷加入變數直到模型效 能不再改善為止。 綜合上述’本發明針對多層判別分析的模型提供了一 個有系統的變數選擇方法,可以用wiUc’ s |ambda轉換成F分 配後的p值或Gini index來選擇變數《而在切點的決定上,也 提供了馬氏距離' Gini index等方法。用Gini index決定切點時, 由於必須哥找至少一切點,若是搜尋所有可能的切點組合 會非常耗時,故本發明亦提供了較快速搜尋到所欲的切點 之方法。而用馬氏距離決定此至少一切點時,由於我們會 先用馬氏距離把所有樣本分成偏向A類別及偏向b類別 的’再用此兩群樣本來找兩個馬氏距離切點,但由於先用 201126354 馬氏距離把資料分成兩群,這兩群内類別間的樣本數差距 通常很大,而此類別間的樣本數差距會造成馬氏距離尋找 切點的不可靠,故本發明並提供了使用Giniindex修正馬氏距 尚隹來解決此問題。在每次新加入屬性到模型裡時,不僅只 考慮一層的效能’而是在考慮整體模型的效能後,才決定 要把新的屬性加入哪裡。而在模型的停止條件上,也提供 了使用如Wilk’ s丨ambda來防止模型的過度配適,故而大幅提 高分類之準確性。 [實施例η 在本實施例中提供了 一筆樣本數為1 00,2個類別,5 個屬性(Χι,Χ2,…,X5)的資料,其中每個屬性皆服從Ν(〇,ι), 其類別散佈圖如圖7b所示’而其預設的模型如圖7a所示》 其中’第一層將用乂,來解釋’其無法分類的部分再留到下 一層給X2去解釋》 經由多層判別分析得到的結果如圖7c所示,由於多層 判別分析模型有用Gini index及馬氏距離兩種尋找切點的方 法’故在多層判別分析的結果呈現上會把此兩種方法都放 上。至於經由CART得到的結果,則如圖7d所示。可以比較 使用Gini index找切點的多層判別分析跟CART,兩者尋找切 點的準則一樣。 在多層判別分析中,第一層用X,分出了類別〇和類別 1 ’類別0包含了 24個類別〇和0個類別1,類別1包含了 3個類 別0和35個類別1。然而,如圖7d所示,而在CART裡則是在 第一層用X,分出類別1,包含了 3個類別0和35個類別1,第 24 201126354 一層再用 X,分 1 x 、 - 六具別1 ’包含了 24個類別0和0個類別1,所 的蚌ft合^所刀類出的結果都一樣。但是,多層判別分析 At曰在層裡把此屬性判別兩個類別(類別0及類別1) ,^力都用上,但在CAR1^£只能在一層裡先判別一個類 別在下I再使用同一個屬+生判別另一個類別。 八 套的,、°果呈現在表1中,從表1中可看出,多層判別 分析使用Gini index所得到的結果跟Cart一樣好。 多層FLD切 點:Gini index 多層FLD切 點:MD CART FLD 準確率 0.89 0.85 0.89 0.83 表1The sample is used to find the most suitable attribute and tangent point; and (Scheme 4) treat the B category as pending $' with _ plus undecided samples to find the most suitable attribute and cut point. “Each plan should use suspend after selecting the eyebrow, (3) secretly determine its explicitness. If you think that it is not significant enough, discard this attribute, and then use the above-mentioned method to estimate the model effect to evaluate each (4) (4). The effectiveness of the model. If the performance of the scheme is the best, add the new attribute to the original layer. If the performance of the scheme 2 is the best, use the remaining undetermined samples of the previous layer to create a new layer model. For scenario 3 or scenario 4, it is best to judge the a category by considering the eight categories «category as undecided and creating a new layer model with all remaining undetermined samples and converting the upper model into only one tangent point. Or B category, not in the same layer to judge two categories. If the current model already has n layers, to add a new attribute, add the new attribute to the original layer, this program will have η cases 'plus the program 2, 3, 4, a total of η + 3 cases should be considered. If the new variables in the η + 3 cases are not significant, the model will stop. If there is a plan to pass, select the best performance plan, k check this The scheme is selected after the selection of an attribute. If there is no improvement, the model will stop adding new variables. If there is improvement, continue to add new attributes to the model, and continue to add variables until the model performance is no longer improved. The discriminant analysis model provides a systematic variable selection method. You can use wiUc's |ambda to convert to the F-valued p-value or Gini index to select the variable. "In the decision of the tangent point, the Markov distance is also provided." Gini index and other methods. When using Gini index to determine the tangent point, since it is necessary to find at least all the points, if searching for all possible tangent point combinations will be very time consuming, the present invention also provides a method for searching for the desired cut point more quickly. When using Markov distance to determine this at least all points, we will first use the Mahalanobis distance to divide all samples into A-category and biased b-types, and then use these two groups of samples to find two Markov distance tangent points, but because of Using 201126354 Mahalanobis distance to divide the data into two groups, the sample size gap between the two groups is usually very large, and the sample size gap between the categories will be It is unreliable to find the tangent point in the Markov distance. Therefore, the present invention provides a solution to correct this problem by using Giniindex to correct the Markov distance. Whenever a new attribute is added to the model, not only the performance of one layer is considered, but After considering the performance of the overall model, it is decided to add new attributes. In the stop condition of the model, it also provides the use of such as Wilk's 丨ambda to prevent over-adaptation of the model, thus greatly improving the accuracy of the classification. [Embodiment η In the present embodiment, a data of a sample number of 100, 2 categories, and 5 attributes (Χι, Χ2, ..., X5) is provided, each of which is subject to Ν(〇, ι). The category scatter diagram is shown in Figure 7b' and its default model is shown in Figure 7a. "The first layer will use 乂 to explain 'the part that cannot be classified and then leave it to the next layer to explain X2" The results obtained by multi-layer discriminant analysis are shown in Fig. 7c. Since the multi-layer discriminant analysis model uses both the Gini index and the Mahalanobis distance to find the tangent point, the results of the multi-layer discriminant analysis will be put on both methods.As for the results obtained via CART, it is as shown in Figure 7d. You can compare the multi-level discriminant analysis using Cini index to find the tangent point and CART. The two criteria for finding the tangent point are the same. In the multi-layer discriminant analysis, the first layer uses X to classify categories and categories. 1 'Category 0 contains 24 categories and 0 categories 1. Category 1 contains 3 categories 0 and 35 categories 1. However, as shown in Figure 7d, in CART, X is used in the first layer, category 1 is included, and 3 categories 0 and 35 categories 1 are included. On the 24th 201126354 layer, X is used again, and 1 x is used. - Six different 1 'contains 24 categories 0 and 0 categories 1. The results of the 蚌 ft and knives are the same. However, the multi-level discriminant analysis At曰 discriminates this attribute into two categories (category 0 and category 1) in the layer, and the force is used, but in CAR1^£, only one category can be discriminated in the first layer. One genus + student discriminates another category. The results of the eight sets are shown in Table 1. As can be seen from Table 1, the results of the multi-level discriminant analysis using Gini index are as good as those of Cart. Multi-layer FLD cut point: Gini index Multi-layer FLD cut point: MD CART FLD Accuracy rate 0.89 0.85 0.89 0.83 Table 1
[實施例2J 在本實施例中提供了 一筆樣本數為2〇〇, 2個類別,1〇 個屬性(X,,X2,…,χ,〇)的資料,其中每個屬性皆服從 N(〇,l)。預設的模型如圖心所示。其中,第一層將選進\丨2 组合成一個FLD模型,那些第一層無法做出分類的,則留 到第二層中由乂3,乂4組合的FLD模型來解釋。 纪由夕層判別分析得到的結果如圖8b所示,cart得到 的結果則如圖8 c所示。 根據本實施例方法的結果呈現在表2中,此多層判別分析不 管是用Gini index或馬氏距離來找切點得到的準確率,都比 CART和 FLD好。 25 201126354 ^--- 準確率 夕層FLD切 點:Gini index ------ 多層FLD 切點:MD CART FLD 0.9 0.885 0.83 0.88 表2 [實施例31 在本貫施例中提供了 一筆樣本數為1000,2個類別,5 個屬···,χ5)的資料’其中每個屬性皆服從N(0,1), 八類另丨放佈圖如圖9b所示,預設的模型如圖9a所示。第一 層將用X,來解釋’且Χ|只有分類出類別〇的能力,其餘無法 分類的部分再留到下一層給&去解釋。 、'呈由夕層判別分析得到的結果如圖9c所示,CART得到 的...。果則如圖9d所#。由於預設的模型可以視為單變量的 樹狀、,構,故在此案例多層判別分析使用㈤index當切點準 則的結果會跟CART得到的結果一樣。 根據本實施例方法的結果呈現在表3中,多層判別分析使用 Gini index所得到的結果跟cart—樣好。 多層FLD切 點:Gini index 多層FLD 切點:MD CART FLD 準確率 0.84 0.835 一 — 0.84 1 0.785 表3[Embodiment 2J In the present embodiment, a data of 2 〇〇, 2 categories, 1 属性 attributes (X, X2, ..., χ, 〇) is provided, wherein each attribute is subject to N ( Hey, l). The preset model is shown in the figure. Among them, the first layer will be selected into the \丨2 group to form an FLD model, and those that cannot be classified in the first layer will be explained in the second layer by the FLD model of the combination of 乂3 and 乂4. The results obtained by discriminant analysis by Fig. 8b are shown in Fig. 8b, and the results obtained by cart are shown in Fig. 8c. The results of the method according to the present embodiment are presented in Table 2. This multi-level discriminant analysis is better than CART and FLD, regardless of whether the Gini index or the Mahalanobis distance is used to find the cut point. 25 201126354 ^--- Accuracy rate FLD cut point: Gini index ------ Multi-layer FLD Cut point: MD CART FLD 0.9 0.885 0.83 0.88 Table 2 [Example 31 A sample number is provided in this example 1000, 2 categories, 5 genus···, χ5) The data 'each of them is subject to N(0,1), and the eight types of other layouts are shown in Figure 9b. The default model is shown in Figure Shown in 9a. The first layer will use X to explain 'and Χ| only the ability to classify the category ,, and the remaining unclassifiable parts will be left to the next layer to explain & The result obtained by the discriminant analysis by the eve layer is shown in Fig. 9c, which is obtained by CART. If it is #, as shown in Figure 9d. Since the preset model can be regarded as a univariate tree, structure, in this case, the multi-level discriminant analysis uses (5) index as the result of the tangent point will be the same as the result obtained by CART. The results of the method according to the present embodiment are presented in Table 3. The results obtained by the multi-level discriminant analysis using the Gini index are good as those of the cart. Multi-layer FLD cut point: Gini index Multi-layer FLD Cut point: MD CART FLD Accuracy 0.84 0.835 I — 0.84 1 0.785 Table 3
I實施例4J 在本實施例中提供了一 個屬性(x,,x2, ...,x5)的資料 筆樣本數為丨〇〇〇,2個類別,5 ,其中每個屬性皆服從N(0,1)。 26 201126354 預設的模型如圖丨0a所示。第-層將用X,來解釋 分類出類別0的能力’其餘無法分 且X,只有 x2和χ3去解釋。 h再留到下一層給 ’ CART得 經由多層判別分析得到的結果如圖丨0 b所示 到的結果則如圖10c所示。 根據本實施例方法的結果呈現在表4中 Gini index所得到的結果最好。 多層判別分析使用 乡層FLD切 點:Gini index 多層FLD 切點:MD ~----- CART FLD 準確率 0.S65 0.795 0.85 0.79I Embodiment 4J In the present embodiment, the number of data pen samples of one attribute (x, x2, ..., x5) is 丨〇〇〇, two categories, 5, each of which is subject to N ( 0,1). 26 201126354 The preset model is shown in Figure 0a. The first layer will use X to explain the ability to classify category 0. The rest cannot be divided and X, only x2 and χ3 are explained. h then left to the next layer for 'CART. The results obtained through multi-layer discriminant analysis are shown in Fig. 0b. The results are shown in Fig. 10c. The results of the method according to the present embodiment are shown in Table 4. The results obtained by the Gini index are the best. Multi-layer discriminant analysis uses the FLD cut point of the township layer: Gini index Multi-layer FLD Cut point: MD ~----- CART FLD Accuracy rate 0.S65 0.795 0.85 0.79
[實施例5] 在本實施例中提供了透過超音波掃描來得到一些腫瘤 影像的量化的屬性,再透過這些屬性來建構一個判別模 型,其中腫瘤影像樣本有160個’有108個以類別〇代表,52 個以類別1代表。 首先提供CI、έΐ、MI、HI、ringPDVImax這5個屬性做分析, 若直接使用費雪判別分析合併這5個屬性,得到的準確率為 0.793,使用多層判別分析的結果準確率則為0.8。此外,多 層判別分析只會使用其中四個變數,如圖11 a所示,且得到 的準確率比傳統的費雪判別分析高。 除上述5個屬性之外,根據本實施例可再加入其他屬性一起 分析。多層判別分析使用Gini index決定切點得到的結果如圖 27 201126354 1 1 b所示,準减率為0.906。多層判別分析使用Youden’s index 決定切點得到的結果則如圖1 1 c所示,準確率為0.801 2。 CART所得到的結果如圖lid所示,準確率為0.868。FLD使 用了 ringPDVImax、VeinCentralVImin、VeinTDCentralVImax、 TDVImax、Cl、RMV、CI2、MCI3、MI2這 9個屬性,得到的準 確率為0.843。如表5所示,多層判別分析得到的準確率最好。 多層FLD切 點:Gin i index 多層FLD切 點:Youden’s index CART FLD 準確率 0.906 0.801 0.868 0.838 表5 再者,本發明上述執行步驟,可以電腦語言寫成以便 執行,而此寫成之軟體程式可以儲存於任何微處理單元可 以辨識、解讀之紀錄媒體,或包含有此紀錄媒體之物品及 裝置。其不限為任何形式,此物品可為硬碟、軟碟、光碟、 ZIP、MO、1C晶片、隨機存取記憶體(RAM),或任何熟悉此 項技藝者所可使用之包含有此紀錄媒體之物品。由於本發 明之多層次分類方法已揭露完整如前,任何熟悉電腦語言 者閱讀本發明說明書即知如何撰寫軟體程式,故有關軟體 程式細節部分不在此贅述。 上述實施例僅係為了方便說明而舉例而已,本發明所 主張之權利範圍自應以申請專利範圍所述為準,而非僅限 於上述實施例。 28 201126354 【圖式簡單說明】 圖1a係本發明多❹丨別分析流程圖。 架構 圖1b係根據本發明之方法所建立之多層判別分析模型 示意圖。 、 圖2係顯不-電腦可紀錄媒體之架構的示意圖。 圖3係本發明一較佳實施例之搜尋⑶口丨切點示意圖。 圖4a-4b係本發明—較佳實施例之使用丨以以修正馬氏距 離示意圖。 圖5係本發明一比較模型之四種方式示意圖。 圖6係本發明多層判別分析模型詳細流程圖。 圖7a-7d係本發明實施例1示意圖。 圖8a-8c係本發明實施例2示意圖。 圖9a-9d係本發明實施例3示意圖。 圖l〇a-l〇c係本發明實施例4示意圖。 圖11 a-11 d係本發明實施例5示意圖。 【主要元件符號說明】 電腦可紀錄媒體1 記憶體11 處理器12 顯示裝置13 輸入裝置14 儲存裝置15 29[Embodiment 5] In the present embodiment, the quantized attributes of some tumor images are obtained by ultrasonic scanning, and then a discriminant model is constructed through these attributes, wherein there are 160 'with 108 categories of tumor image samples〇 Representatives, 52 are represented by category 1. Firstly, the five attributes of CI, έΐ, MI, HI, and ringPDVImax are provided for analysis. If the five attributes are directly combined using Fisher's discriminant analysis, the accuracy rate is 0.793, and the accuracy of using multi-level discriminant analysis is 0.8. In addition, multi-layer discriminant analysis uses only four of these variables, as shown in Figure 11a, and the accuracy is higher than the traditional Fisher discriminant analysis. In addition to the above five attributes, other attributes can be added together for analysis according to this embodiment. The results obtained by multi-level discriminant analysis using Gini index to determine the tangent point are shown in Fig. 27 201126354 1 1 b, and the quasi-reduction rate is 0.906. Multilayer discriminant analysis using Youden's index to determine the tangent results is shown in Figure 1 c, with an accuracy of 0.801 2 . The results obtained by CART are shown in lid, and the accuracy rate is 0.868. FLD uses the nine attributes of ringPDVImax, VeinCentralVImin, VeinTDCentralVImax, TDVImax, Cl, RMV, CI2, MCI3, and MI2, and the accuracy is 0.843. As shown in Table 5, the multi-layer discriminant analysis yields the best accuracy. Multi-layer FLD cut point: Gin i index Multi-layer FLD cut point: Youden's index CART FLD Accuracy rate 0.906 0.801 0.868 0.838 Table 5 Furthermore, the above-mentioned execution steps of the present invention can be written in a computer language for execution, and the software program written in this can be stored in any micro The recording medium that the processing unit can recognize and interpret, or the items and devices that contain the recording medium. It is not limited to any form, and the article can be a hard disk, a floppy disk, a compact disc, a ZIP, an MO, a 1C chip, a random access memory (RAM), or any other person familiar with the art. Media items. Since the multi-level classification method of the present invention has been disclosed as before, anyone who is familiar with the computer language knows how to write a software program after reading the present specification, and thus the details of the software program are not described here. The above-described embodiments are merely examples for the convenience of the description, and the scope of the claims is intended to be limited by the scope of the claims. 28 201126354 [Simplified description of the drawings] Fig. 1a is a flow chart of multi-discrimination analysis of the present invention. Architecture Figure 1b is a schematic diagram of a multi-layer discriminant analysis model established in accordance with the method of the present invention. Figure 2 shows a schematic diagram of the architecture of a computer recordable medium. 3 is a schematic diagram of a search (3) port cut point according to a preferred embodiment of the present invention. Figures 4a-4b are schematic views of the preferred embodiment of the invention used to modify the Mahalanobis distance. Figure 5 is a schematic illustration of four modes of a comparative model of the present invention. Figure 6 is a detailed flow chart of the multi-layer discriminant analysis model of the present invention. 7a-7d are schematic views of Embodiment 1 of the present invention. 8a-8c are schematic views of Embodiment 2 of the present invention. 9a-9d are schematic views of Embodiment 3 of the present invention. 1a-l〇c are schematic views of Embodiment 4 of the present invention. Figure 11 - 11 d is a schematic view of Embodiment 5 of the present invention. [Description of main component symbols] Computer recordable media 1 Memory 11 Processor 12 Display device 13 Input device 14 Storage device 15 29
Claims (1)
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
TW099101931A TWI521361B (en) | 2010-01-25 | 2010-01-25 | Method for multi-layer classifier |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
TW099101931A TWI521361B (en) | 2010-01-25 | 2010-01-25 | Method for multi-layer classifier |
Publications (2)
Publication Number | Publication Date |
---|---|
TW201126354A true TW201126354A (en) | 2011-08-01 |
TWI521361B TWI521361B (en) | 2016-02-11 |
Family
ID=45024492
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
TW099101931A TWI521361B (en) | 2010-01-25 | 2010-01-25 | Method for multi-layer classifier |
Country Status (1)
Country | Link |
---|---|
TW (1) | TWI521361B (en) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
TWI564740B (en) * | 2015-08-24 | 2017-01-01 | 國立成功大學 | Mutually-exclusive and collectively-exhaustive (mece) feature selection method and computer program product |
-
2010
- 2010-01-25 TW TW099101931A patent/TWI521361B/en active
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
TWI564740B (en) * | 2015-08-24 | 2017-01-01 | 國立成功大學 | Mutually-exclusive and collectively-exhaustive (mece) feature selection method and computer program product |
Also Published As
Publication number | Publication date |
---|---|
TWI521361B (en) | 2016-02-11 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Santhanam et al. | Application of K-means and genetic algorithms for dimension reduction by integrating SVM for diabetes diagnosis | |
Yu et al. | An automatic method to determine the number of clusters using decision-theoretic rough set | |
US7801836B2 (en) | Automated predictive data mining model selection using a genetic algorithm | |
Zhu et al. | Balancing accuracy, complexity and interpretability in consumer credit decision making: A C-TOPSIS classification approach | |
Sun et al. | An adaptive density peaks clustering method with Fisher linear discriminant | |
Alsheref et al. | Automated prediction of employee attrition using ensemble model based on machine learning algorithms | |
Nathiya et al. | An analytical study on behavior of clusters using k means, em and k* means algorithm | |
US20060047616A1 (en) | System and method for biological data analysis using a bayesian network combined with a support vector machine | |
Hou et al. | A new density kernel in density peak based clustering | |
Hsiao et al. | Integrating MTS with bagging strategy for class imbalance problems | |
AlKubaisi et al. | Multivariate discriminant analysis managing staff appraisal case study | |
CN106126973B (en) | Gene correlation method based on R-SVM and TPR rules | |
Yotsawat et al. | Improved credit scoring model using XGBoost with Bayesian hyper-parameter optimization | |
Oreški et al. | Cost-sensitive learning from imbalanced datasets for retail credit risk assessment | |
TW201126354A (en) | Method for multi-layer classifier | |
Li et al. | Personal credit default discrimination model based on super learner ensemble | |
AU2012255722A1 (en) | Computer-implemented method and system for detecting interacting DNA loci | |
Cai et al. | Fuzzy criteria in multi-objective feature selection for unsupervised learning | |
Caplescu et al. | Will they repay their debt? Identification of borrowers likely to be charged off | |
Amaratunga et al. | Ensemble classifiers | |
Mukhopadhyay et al. | Unsupervised cancer classification through SVM-boosted multiobjective fuzzy clustering with majority voting ensemble | |
AlSaif | Large scale data mining for banking credit risk prediction | |
Huang et al. | A Study of Genetic Neural Network as Classifiers and its Application in Breast Cancer Diagnosis. | |
Kutnjak et al. | Applying the decision tree method in identifying key indicators of the Digital Economy and Society Index (DESI) | |
Mondal et al. | Simultaneous clustering and gene ranking: A multiobjective genetic approach |