TW201126354A - Method for multi-layer classifier - Google Patents

Method for multi-layer classifier Download PDF

Info

Publication number
TW201126354A
TW201126354A TW99101931A TW99101931A TW201126354A TW 201126354 A TW201126354 A TW 201126354A TW 99101931 A TW99101931 A TW 99101931A TW 99101931 A TW99101931 A TW 99101931A TW 201126354 A TW201126354 A TW 201126354A
Authority
TW
Taiwan
Prior art keywords
category
model
attributes
layer
undetermined
Prior art date
Application number
TW99101931A
Other languages
Chinese (zh)
Other versions
TWI521361B (en
Inventor
King-Jen Chang
Wen-Hwa Chen
Argon Chen
Chiung-Nein Chen
Ming-Chih Ho
Hao-Chih Tai
Ming-Hsun Wu
Hsin-Jung Wu
Original Assignee
Amcad Biomed Corp
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Amcad Biomed Corp filed Critical Amcad Biomed Corp
Priority to TW099101931A priority Critical patent/TWI521361B/en
Publication of TW201126354A publication Critical patent/TW201126354A/en
Application granted granted Critical
Publication of TWI521361B publication Critical patent/TWI521361B/en

Links

Abstract

The present invention relates to a method for multi-layer classifier applying on a computer readable medium for classifying multiple image samples. The method at least comprising the following steps: (a) receiving a plurality of samples; (b) providing a plurality of attributes, and evaluating a significance of the attributes by a selection criterion; (c) selecting at least one cut-point to establish a discriminant analysis model; (d) proceeding a step of evaluating a performance of the discriminant analysis model by adding the attributes to the discriminant analysis model; and (e) providing a stop criterion. The present invention also provides a computer readable medium for classifying multiple image samples by using the method for multi-layer classifier.

Description

201126354 六、發明說明: 【發明所屬之技術領域】 本發明係關於一種多層次分類方法,尤指一種適用於 建立一種多層判別分析模型,以及決定屬性選擇和切點之 分類方法。 【先前技術】 分類方法的用途非常廣泛,舉例來說,在金融業上, 銀行在審核信用卡用戶時,能辨別此位申請人是否會容易 變成呆帳;在醫藥學理上,能判斷細胞組織為正常或異常; 在行銷的研究上,能判斷此種行銷方法能否吸引顧客購買 商品。因此,在數據資料探勘的領域裡,佔有及重要之部 分即是在探討分類方法。 分類方法是一 徑▲ f式子習(superv丨sed丨eaming)的方法, 所謂的監督式學習方法是在知道目標輸出值的情形下來進 ^tf (unsupen.sed leading) ^ ^ ^ ^ # (principal ^ ^ ^ ^ 監督式學習的方法。在分類 員方法裡,一般需要選擇適當的 屬性(attribute)來建立分類模型 . a » ^ ^ θ 例如,用身咼和體重來判斷 $疋女生’身高及體重即稱為屬性。建立分 類模型時也往往會先把資 錢建立刀 (―ng Samples),另—群為/兩群’-群為訓練樣本 樣本則是用來驗證此分:::二:類模型 201126354 曰月”以兩種現有的分類方法較為常見,分別為在多 變量統計分析中常見的費雪線性判別分析咖丨“201126354 VI. Description of the Invention: [Technical Field] The present invention relates to a multi-level classification method, and more particularly to a classification method suitable for establishing a multi-layer discriminant analysis model and determining attribute selection and tangent points. [Prior Art] The classification method is very versatile. For example, in the financial industry, when a bank audits a credit card user, the bank can discern whether the applicant will easily become a bad debt; in medical science, the cell organization can be judged as Normal or abnormal; In marketing research, it can be judged whether such marketing methods can attract customers to purchase goods. Therefore, in the field of data mining, the part that is important and important is to explore the classification method. The classification method is a method of superv丨sed丨eaming. The so-called supervised learning method is to enter the target output value and enter ^tf (unsupen.sed leading) ^ ^ ^ ^ # ( Principal ^ ^ ^ ^ Supervised learning method. In the classifier method, it is generally necessary to select the appropriate attribute to establish the classification model. a » ^ ^ θ For example, use body and weight to judge $疋 girl's height And the weight is called the attribute. When the classification model is established, it is often used to build the knife (―ng Samples), and the other group is the two groups. The group is the training sample sample to verify this score::: Two: Class model 201126354 "Yueyue" is more common with two existing classification methods, respectively, which is the common Fisher's linear discriminant analysis in the multivariate statistical analysis.

Analysis,FLD),以及分類與迴歸樹⑹咖㈣ regression trees, CART) 〇 Μ ^^ f # Λ ^ , 分類方法中,尤其在屬性的選擇上,部分屬性只能判別特 定類別而影響其分㈣用之準確性;且在以往分類模型的 建立上’會因其屬性的選擇不同’或未對所欲分類之判別 分析模型進行效能的評估,進而影響分類的準確性。 因此’目則巫需-種新的多層次分類方法以解決 問題。 【發明内容】 本發明之主要目的係在提供一種多層次分類方法,藉 由多層判別分析模型,在每—層會尋找―或兩個切點來對 一或兩個類別做出分類,且每一層可以同時使用多個屬 性’並透過費雪判別分析來找到這些屬性的最佳線性組合。 本發明提供一種多層次分類方法,係於一電腦可紀錄 媒體中用以分類多個影像樣本,此電腦可紀錄媒體包括有 一處理器、-輸入裝置、及-儲存裝置’此方法至少包括: (a) 接收複數個原始樣本; (b) 提供複數個屬性,並以一多變量參數對此些原始樣 本由此些屬性進行顯著性評估計算; (c) 選擇至少一切點並建立一判別分析模型,其係將步 驟(b)中評估後具有顯著性者其中之一,藉提供一變數同質 201126354 ^析參數筛選出此至少―切點,將此些屬性評估後具有顯 者性者t所包含之此複數個原始樣本分群為至少一類別以 建立此判別分析模型,纟中此至少一類別係包括有第一類 別(NodeA)、第三類別_eB) '及未決定之第三類別_〜); ⑷進行-評估模型效能之步驟,其係將此判別分析模 型中加人此些心進行㈣性評估;#中,#加入此些屬 性=有增進此判別分析模型之顯著性時,便進人此判別分 析模型之下—層’再以此變數同質分析參數篩選出至少一 切點’將此判別分析模型中加人此些屬性評估後具有顯著 吐者中所包含之此複數個原始樣本繼續分群為第一類別 (NodeA)、第二類別(N,、及未決定之第三類攀如”以 八⑷加A卜止條件,此停止條件係以選擇此變數同質 刀析參數,右不拒絕虛無假設,(判別分析模型即停止往 下-層分群;或在此評估模型效能之步驟中加入此些料 租I歸刀析法進仃顯著性評估’當加入此些屬性後無法 乂升此判別分析模型之顯著性時,若拒絕虛無假設,此詞 別分析模型即停止往下一層分群。 本發明亦提供—種用以分類多個影像樣本之電腦可知 ·、,體其係以建立一多層次分類方法對此些影像樣 行分類。 根據本發明多層次分類方法,其中,在加入此停止僻 此㈣分_型之最後_層分類層中此未決定々 第二類別中所包含之樣本數為零,換句話說,根據為 6 201126354 發明多層次分類方法最終結果必須將複數個原始樣本皆分 類為第一類別(N〇deA)及/或第二類別(NodeB)中。 根據本發明多層次分類方法’其中,多變量參數之選 擇沒有限制,較佳為Wilk’s lambda或Gini index ;此些屬性之選 擇沒有限制,較佳為至少一選自由ringPDVlmax、 VeinCentralVImin、VeinTDCentralVImax、TDVImax、Cl.、RMV、CI2、 MCI3、及MI2所組成之群組。此外,變數同質分析參數之選 擇沒有限制,較佳為 Gini index、Mahalanobis distance、或 Youden’s Index ° 另一方面,對於顯著性評估計算係以一 F統計量算出的 p值(p-value) ’以此p值表示此些屬性在此類別間平均的差異 顯者性,或以一衡ΐ不純度(丨mpurity)之準則判斷; 其中,此F統計量為 ρ Λ » 此不純度(impurity)為Analysis, FLD), and classification and regression tree (6) coffee (4) regression trees, CART) 〇Μ ^^ f # Λ ^ , in the classification method, especially in the selection of attributes, some attributes can only identify specific categories and affect their points (4) The accuracy of the classification; and in the establishment of the previous classification model 'will be different due to the choice of its attributes' or the performance of the discriminant analysis model of the desired classification, and thus affect the accuracy of the classification. Therefore, the goal is to create a new multi-level classification method to solve the problem. SUMMARY OF THE INVENTION The main object of the present invention is to provide a multi-level classification method, which uses a multi-layer discriminant analysis model to find one or two tangent points in each layer to classify one or two categories, and each layer You can use multiple attributes at the same time and find the best linear combination of these attributes through Fisher Discriminant Analysis. The present invention provides a multi-level classification method for classifying a plurality of image samples in a computer recordable medium. The computer recordable medium includes a processor, an input device, and a storage device. The method includes at least: a) receiving a plurality of original samples; (b) providing a plurality of attributes, and performing a significant evaluation of the attributes of the original samples with a multivariate parameter; (c) selecting at least all points and establishing a discriminant analysis model , which is one of the significant ones after the evaluation in step (b), by providing a variable homogenous 201126354 ^ analysis parameter to filter out at least the "cut point", and the attributes are evaluated after the explicit person t is included The plurality of original samples are grouped into at least one category to establish the discriminant analysis model, wherein the at least one category includes a first category (NodeA), a third category _eB) 'and an undetermined third category _~) (4) Carrying out the steps of evaluating the performance of the model, which is to add the heart to the discriminant analysis model to perform (four) sexual evaluation; #中,#Add these attributes = improve the discriminant analysis model When you are sexual, you enter the discriminant analysis model - layer 'and then filter at least all the points with the homogeneity analysis parameters'. Adding these attributes to the discriminant analysis model is evaluated by the significant spit. The plurality of original samples continue to be grouped into a first category (NodeA), a second category (N, and an undetermined third category) with an eight (4) plus A condition, and the stop condition is to select the variable homogenous. Knife the parameters, the right does not reject the null hypothesis, (the discriminant analysis model stops the down-layer grouping; or in the step of evaluating the model performance, adding these rents to the tooling method) When these attributes cannot be promoted to the significance of the discriminant analysis model, if the null hypothesis is rejected, the word analysis model stops the next layer grouping. The present invention also provides a computer for classifying multiple image samples. The system classifies the image sample lines by establishing a multi-level classification method. According to the multi-level classification method of the present invention, in the last _ layer classification layer of the _ type Decides that the number of samples included in the second category is zero. In other words, according to the final result of the multi-level classification method of 6 201126354 invention, a plurality of original samples must be classified into the first category (N〇deA) and/or In the second category (NodeB). According to the multi-level classification method of the present invention, the selection of the multi-variable parameters is not limited, and is preferably Wilk's lambda or Gini index; the selection of such attributes is not limited, and preferably at least one selected from the group consisting of ringPDVlmax , VeinCentralVImin, VeinTDCentralVImax, TDVImax, Cl., RMV, CI2, MCI3, and MI2. In addition, there is no limit to the choice of variable homogeneity analysis parameters, preferably Gini index, Mahalanobis distance, or Youden's Index ° Aspect, for the significance evaluation calculation, the p-value calculated by a F statistic 'this p-value indicates the average difference of the attributes among the categories, or the balance is not pure (丨mpurity) criterion judgment; wherein, the F statistic is ρ Λ » This impureness is

Impurity = + NiU x Gini(txl) +Ν,χ Gini{tR) (Nl+Nm+Nr) ; 其中n為樣本空間(sample size),p為屬性的數目,Λ 則為 Wilk’s lambda ; 其中’’ nl為第一類別的樣本空間,Nm為第三類別的樣 本空間’ NR為第二類別的樣本空間,tL為第一類別的Gini 值,tM為第三類別的Gini值,tR為第二類別的Gini值。 201126354 根據本發明多層次分類方法’其中,此評估模型效能 之步驟可包括下列四種方法: 在與步驟(C)所建立之此判別分析模型同層中加入此 些屬性’以增加此判別分析模型之原同層中的區別能力; 在此第三類別(NodeN)上加入此些屬性並新增一層以建 立一模型’此模型亦以此變數同質分析參數筛選出至少一 切點’將剩餘未決定之此複數個原始樣本繼續分群為第一 類別(NodeA)、第二類別(N〇deB)、及未決定之第三類別…“以); 將第一類別(NodeA)設定為未決定之類別,並將第一類 別(NodeA)加上未決定之第三類別(N〇deN)而形成的組合中加 入此些屬性並新增一層以建立一模型,此模型亦以此變數 同質分析參數篩選出至少一切點,將剩餘未決定之此複數 個原始樣本繼續分群為第一類別Wodej、第二類別Mod#)、 及未決定之第三類別(Nod%); 或將第二類別(N〇deB)設定為未決定之類別,並將第二 類別(NodeB)加上未決定之第三類別心^^而形成的組合中 加入此些屬性並新增一層以建立一模型,此模型亦以此變 數同質刀析參數篩選出至少一切點’將剩餘未決定之此複 數個原始樣本繼續分群為第一類別(NodeA)、第二類別 (NodeB)、及未決定之第三類別。 由上可知,本發明係在提供一種新的判別分析模型結 構及方法其類似於樹狀分類結構,都是由上往下一層 一層將資料分割。而與樹狀結構不同的是,此判別分析模 型每層會將-些資料針對_或二個類別做出分類,並將 8 201126354 未決定之資料留至下一層,此外’每一層可選擇些屬性並 利用費雪判別分析做線性組合。 換吕之,本發明係以上述之方法以建立一種新的多層 判別分析模型,其一層可能只能區別_個類別(N〇dej NodeB)或是兩個類別皆可以區別(他如八及N〇deB),並將尚未決 定類別之樣本留至下__層做判別。而此判別模型分 析並包括··在判別模型分析每層發展有效變數之選擇和尋 找切點的方法與準則,並以評估模型效能之步驟在加入新 屬性時會考慮整體效能來決定要如何建構模型,並建立停 止條件以避免過度配適的問題。 因此,根據本發明亦提供一種屬性選擇和切點決定方 法,並考慮了判別分析模型在加人新屬性時會考慮整體模 型的效能,以決定判別分析模型應如何建立及其停止條 件’故而大幅提高分類之準破性。 【實施方式】 、圖2係顯示一電腦可紀錄媒體之架構的示意圖,其可用 以執行本發明多層判別分析模型之多層次分類方法。 如圖2所示,電腦可紀錄媒體1包含顯示裝置13、處理 益12、記憶體u、輸入裝置14、及儲存裝置叫。盆中, 可用以輸入影像、文字、指令等資料至電腦可 二存裝置15係例如為硬碟、光碟機或藉由網際 ••同路連接之㈣資料庫,用以儲存系統程式、應用 使用者資料等’記憶體η係用以暫存資料或執行之程:式, 201126354 處理益12用以運算及處理資料等,顯示裝置13則用以顯示 輸出之資料。 如圖2所示之電腦可紀錄媒體一般係於系統程式 (Operating System)下執行各種應用程式,例如文書處理程式、 繪圖程式、科學運算程式、瀏覽程式、電子郵件程式等。 在本實施例中,儲存裝置14係儲存有使電腦可紀錄媒體執 行夕層次分類方法的程式。當欲使電腦可紀錄媒體執行 此分類方法時,對應之程式便被載入記憶體丨丨,以配合處 理器12執行此方法。最後,再將分類結果之相關資料顯示 於顯示裝置13或藉由網際網路儲存於一遠端資料庫中。 藉由本發明之方法,其流程示意可如圖u所示,根據 其所建立之多層判別分析模型架構係如圖1 b,其與分類樹 相似的為都是由上而下不斷的分割資料。然而,跟分類樹 不同的疋,本案之多層次分類方法會針對每一層都會對部 分或全部之複數個原始樣本做出判別,而這些已判別出來 的類別(NodeA)或NodeB)就不會進入下一層的模型,只留下在 此層做出判別為尚未決定類別(N〇deN)到下一層中加入新的 屬性來對它作出判別’而每一層可以只針對判斷—個類別 或是兩個類別’若是只判斷一個類別則只需要找—個切點 來把樣本㈣成兩部分…部分為能在此層分類出來的, 另一部分為未決定必須留到下一層,但若是要判斷兩個類 別則需尋找兩個切點,把資料切割成三部分,一部分為第 一類別(NodeA),一部分為第二類別(N〇deB),剩丁的那一部分 則是未決定之第三類別(Νο&ν)。而每次要加入一個新的屬性 201126354 時’會考慮整體模型的效能來決 的屬性讓原本那—層的判斷 ^有的層内結合新 來對那此a i \站山+ 或疋要加一個新的屬性 爪對那些尚未分類出來的樣本 士 订77犬員。不斷的在此模型 中加入新的屬性直到此模型達到停止條件。 以下,將詳述本發明之多層j 又增-人分類方法及所建立多 層判別分析模型架構。 首先’接受複數個原始樣本,針對此些複數個原始樣 本’必須先由複數個屬性中選擇—屬性,並以一多變量參 數對此些原始樣本由此些屬性進行顯著性評估計算。此顯 著性之評估賴提供-變數同質分析參㈣選出至少一切 點’將此些屬性評估後具有顯著性者,再以此切點來決定 模型裡的樣本是此分到哪個類別(N〇deA、N〇deB、或να)或 是要留到下-層’較佳之選擇為評估後最具顯著性者。由 此可知,選擇屬性及決定此些切點係非常重要。之後,需 對前述建立之判別分析模型進行評估模型效能之步驟亦 即藉由在模型中多加進些個屬性後進行比較兩種模型,其 包括可以一種為在原有的判別分析模型裡加一個屬性並使 用費雪線性判別分析(Fisher linear discriminant Analysis,以下簡 稱FLD)結合’另一種為新增加一層模型。 ί屬性及多變量參數選擇1 在屬性的選擇上’可使用ringPDVImax、VeinCentra丨VImin、Impurity = + NiU x Gini(txl) +Ν,χ Gini{tR) (Nl+Nm+Nr) ; where n is the sample size, p is the number of attributes, and Λ is Wilk's lambda; where '' Nl is the sample space of the first category, Nm is the sample space of the third category 'NR is the sample space of the second category, tL is the Gini value of the first category, tM is the Gini value of the third category, and tR is the second category Gini value. 201126354 According to the multi-level classification method of the present invention, the step of evaluating the performance of the model may include the following four methods: adding these attributes in the same layer as the discriminant analysis model established in step (C) to increase the discriminant analysis The ability to distinguish the original layer in the same layer; add these attributes to the third category (NodeN) and add a layer to create a model. This model also filters out at least everything from the homogeneity analysis parameters. The plurality of original samples determined to continue to be grouped into a first category (NodeA), a second category (N〇deB), and an undetermined third category... "to"; the first category (NodeA) is set to be undecided Category, and add the first attribute (NodeA) plus the undetermined third category (N〇deN) to add a layer to create a model. This model also uses this variable to analyze parameters. Filter out at least all points, and continue to group the remaining undetermined plural original samples into the first category Wodej, the second category Mod#), and the undetermined third category (Nod%); or the second category (N〇deB) is set to an undetermined category, and the combination of the second category (NodeB) plus the undetermined third category heart ^^ is added to the attributes and a layer is added to create a model. The model also screens out at least all points with this variable homogenous knife analysis parameter' to continue to group the remaining undetermined plural original samples into the first category (NodeA), the second category (NodeB), and the undetermined third category. As can be seen from the above, the present invention provides a new discriminant analysis model structure and method which is similar to a tree-like classification structure, which divides data from top to bottom. Unlike the tree structure, this discriminant analysis Each layer of the model will classify some data for _ or two categories, and leave 8 201126354 undetermined data to the next layer. In addition, 'each layer can select some attributes and use Fisher's discriminant analysis to make a linear combination. Lu Zhi, the present invention uses the above method to establish a new multi-layer discriminant analysis model, one layer may only distinguish _ categories (N〇dej NodeB) or both categories can be distinguished (he is like eight and N deB), and leave the sample of the undetermined category to the next __ layer for discriminating. This discriminant model analyzes and includes the method and criteria for selecting the effective variable of each layer in the discriminant model analysis and finding the tangent point, and evaluating The step of model performance considers the overall performance to determine how to construct the model when adding new attributes, and establishes a stop condition to avoid over-adaptation. Therefore, according to the present invention, a method of attribute selection and tangent determination is also provided, and The discriminant analysis model considers the performance of the overall model when adding new attributes to determine how the discriminant analysis model should be established and its stopping conditions. Therefore, the quasi-breaking of the classification is greatly improved. [Embodiment] Figure 2 shows a computer. A schematic diagram of the architecture of the recording medium, which can be used to perform a multi-level classification method of the multi-layer discriminant analysis model of the present invention. As shown in Fig. 2, the computer recordable medium 1 includes a display device 13, a processor 12, a memory u, an input device 14, and a storage device. In the basin, you can use the input image, text, instructions and other information to the computer. The device can be stored in a system such as a hard disk, a CD player or a network connected by the Internet. The data η is used for temporary storage of data or execution: 201126354 Processing benefit 12 is used to calculate and process data, etc. Display device 13 is used to display the output data. The computer recordable media shown in FIG. 2 generally executes various applications such as a word processing program, a drawing program, a scientific computing program, a browsing program, an email program, etc. under the operating system (Operating System). In the present embodiment, the storage device 14 stores a program for causing the computer recordable medium to perform the hierarchical classification method. When the computer recordable medium is to perform this sorting method, the corresponding program is loaded into the memory port to cooperate with the processor 12 to execute the method. Finally, the related data of the classification result is displayed on the display device 13 or stored in a remote database through the Internet. By means of the method of the present invention, the flow diagram can be as shown in Fig. u. According to the multi-layer discriminant analysis model architecture established by the invention, as shown in Fig. 1b, the similarity to the classification tree is that the data is divided from top to bottom. However, unlike the classification tree, the multi-level classification method of this case will judge some or all of the original samples for each layer, and these identified categories (NodeA) or NodeB will not enter. The model of the next layer, leaving only the level determined in this layer as the undetermined category (N〇deN) to add a new attribute to the next layer to discriminate it's and each layer can only be judged - one category or two If you only want to judge a category, you only need to find a point to make the sample (four) into two parts... the part can be classified at this level, and the other part is undecided and must be left to the next level, but if you want to judge two The category needs to find two pointcuts, and cut the data into three parts, one is the first category (NodeA), the other is the second category (N〇deB), and the remaining part is the undetermined third category (Νο&amp ;ν). And each time I want to add a new attribute 201126354, I will consider the performance of the overall model. Let the original layer--the judgment of the layer be combined with the new layer to add a new one to the ai\zhanshan+ or 疋The new attribute claws set 77 dog handlers for those who have not yet been classified. New properties are continually added to this model until the model reaches the stop condition. Hereinafter, the multi-layer j-addition-person classification method of the present invention and the established multi-layer discriminant analysis model architecture will be described in detail. First, accepting a plurality of original samples, for which the plurality of original samples must be selected by a plurality of attributes, and a multivariate parameter is used to perform a significant evaluation of the original samples. The evaluation of this significance depends on the provision-variable homogeneity analysis. (4) Select at least all points. If these attributes are evaluated, they are significant, and then use this point to determine which category the sample in the model belongs to (N〇deA, N〇deB, or να) or to stay in the lower-layer's preferred choice is the most significant after evaluation. From this, it is important to select attributes and determine these cut points. After that, the step of evaluating the model performance of the discriminant analysis model established above is to compare the two models by adding more attributes to the model, which may include adding an attribute to the original discriminant analysis model. And using Fisher linear discriminant analysis (FLD) combined with 'the other is a new layer of model. ί attribute and multivariate parameter selection 1 In the selection of attributes, you can use ringPDVImax, VeinCentra丨VImin,

VeinTDCentralVImax、TDVImax、Cl、RMV、CI2 ' MCI3、及 MI2 ; 而多變量參數之選擇有兩種準則可以使用。一種是常見於 多變量統計方法上檢定類別間的平均是否有差異的wilk’ s 201126354 lambda,另一種則是在分類樹上評估不純度(impurity)的Gini index °VeinTDCentralVImax, TDVImax, Cl, RMV, CI2 'MCI3, and MI2; and there are two criteria for selecting multivariable parameters. One is Wilk’s 201126354 lambda, which is common in multivariate statistical methods to determine whether the average between categories is different, and the other is to evaluate the purity of Gini index ° on the classification tree.

Wilk's lambda 假設有g個類別,p個屬性,且 ' 〜心= 1"丨:H0 is not true 其中’ H〇係虛無假設(null hypothesis) ’ H,係對立假設(alternative hypothesis),則為層級(dass)K的平均值。Wilk's lambda assumes that there are g categories, p attributes, and '~heart = 1"丨:H0 is not true where 'H〇 hypothesis' H, the alternative hypothesis, is the level (dass) The average value of K.

Wilk's lambda : Λ= |W1 _ 1 |B+W| |I+W'B| =n 1 + A; (式1) 其中,w為組内變異矩陣 B為組間變異矩陣 I 係單位矩陣(identity matrix) 冰為W」b的特徵值 在%為真下,Λ經過某些轉換後會服從F分配(式2) test statisticWilk's lambda : Λ= |W1 _ 1 |B+W| |I+W'B| =n 1 + A; (1) where w is the intra-group variation matrix B is the inter-group variation matrix I-unit matrix ( Identity matrix) The eigenvalue of ice for W"b is true, and after some conversions, it will obey the F allocation (formula 2) test statistic

F 少=ΛΛ', 爪,=p(g-O, .,=LP2(g~O2-4 \P2+(g-l)2-5 (式2) 統計量F可簡化成 =s[n-(p-g+ 2)/2] 201126354F Less = ΛΛ', Claw, =p(gO, ., =LP2(g~O2-4 \P2+(gl)2-5 (Formula 2) The statistic F can be simplified to =s[n-(p-g+ 2)/2] 201126354

/T=^!(hA, F p ')汁、/T=^!(hA, F p ') juice,

Wilk’ s lambda也可轉換成卡方分配 Bartlett's χ2 statistic test statistic - -[(« -1) - (P + g) / 2] In Λ - z;^ (式 3 ) 當類別少的時候,F統計量會比卡方統計量好。由於多 層判別分析較佳為針對2個類別分析,所以我們選用F統計 量。 本發明可以比較每個屬性用前述F統計量算出的p值 (p-value),p值越小代表這個屬性在類別間平均的差異越顯 著,比較每個屬性的p值即可選出一個最顯著的屬性。若是 要同一層中選進新的屬性,則比較新屬性跟原有的屬性組 合後得到的p值,選出跟原有屬性組合?值最小的變數即可。Wilk's lambda can also be converted to a chi-square allocation Bartlett's χ2 statistic test statistic - -[(« -1) - (P + g) / 2] In Λ - z;^ (Formula 3) When there are few categories, F The statistic will be better than the chi-square statistic. Since multi-layer discriminant analysis is better for two categories, we use F statistic. The present invention can compare the p-values calculated by using the aforementioned F statistic for each attribute. The smaller the p-value is, the more significant the difference between the averages of the attributes is. The p-value of each attribute is selected to be the most Significant attributes. If you want to select a new attribute in the same layer, compare the p value obtained by combining the new attribute with the original attribute, and select the combination with the original attribute? The smallest value can be.

Gini Index 由於每次在進行分割時要搜尋一個較佳或最佳的屬性 及切點,所以要有一個分割的準則來評估此屬性與切點的 效月b,其中較常見的準則為Gini jncjex。Gini index是一種在衡 量不純度(impurity)的準則,所以Gini丨〇(1狀越小越好。每個屬 性配上一個對應的切點就能得到其Gini index ,所以每個屬性 可以搜尋一個最佳的對應切點。在進行變數選擇時,只要 比較每個屬性搭配上其對應的最佳切點後的Gini index即可 選出在此分割最佳的屬性及切點。 201126354 (式4) 假設現在有g個類別,Gini Index之定義為: >*j 不純度(impurity)即為: ^Gini(tL) + ^-Gini{tR) 其中, 為類別/在結點t所占的比例 P{j\t) 為類別j在結點t所占的比例 〜為左邊結點的樣本數 〜為右邊結點的樣本數 N=〜'為所有樣本總合 在此處’本發明之多層判別分析模型跟分類樹不同的 地方在於’分類樹在每一個結點是做二元分割,但在本發 明之多層判別分析模型中’每一層都必須把資料分割成二 結點,所以不純度(impurity)的計算要改為The Gini Index has to search for a better or best attribute and cut point each time it is split, so there is a split criterion to evaluate this attribute and the cut point b. The more common criterion is Gini jncjex. Gini index is a measure of impurity, so Gini丨〇 (the smaller the shape, the better. Each attribute is matched with a corresponding tangent to get its Gini index, so each attribute can search for one of the most Good corresponding cut-point. When making variable selection, just compare the Gini index of each attribute with its corresponding optimal cut point to select the best attribute and cut point here. 201126354 (Equation 4) Suppose now there is g For each category, Gini Index is defined as: >*j Impurity is: ^Gini(tL) + ^-Gini{tR) where is the category/proportion of node tP{j\ t) is the proportion of the category j at the node t~ the number of samples of the left node~ the number of samples of the right node N=~' is the sum of all the samples here. 'The multi-layer discriminant analysis model of the present invention The difference between the classification trees is that the 'classification tree is a binary segmentation at each node, but in the multi-layer discriminant analysis model of the present invention, 'each layer must divide the data into two nodes, so the impureness is The calculation should be changed to

Impurity = NJNs,父Gini[tiU) + N0Giniq (Nl+Nm +Nr) (式5) 本發明可以比較每個屬性搭配其最佳的—組切點之後 得到的不純度,選出不純度最小的一個屬性。 若是要在同一層加入新的屬性,可以利用新的屬性跟 原有的屬性透過FLD組合後得到的區別分數來計曾不纯 度’找出跟原有的屬性組合之後不純度最低的屬性即可。 14 201126354 [切點選擇] 切點的選擇有三種方法,分別為Gini index、馬氏距離 (Mahalanobis distance)、或 Y〇uden,s Index。Impurity = NJNs, parent Gini[tiU) + N0Giniq (Nl+Nm +Nr) (Equation 5) The present invention can compare the inefficiency obtained after each attribute with its optimal set-cut point, and select an attribute with the least purity. . If you want to add a new attribute on the same layer, you can use the difference between the new attribute and the original attribute through the FLD combination to calculate the inferiority of 'the lowest purity attribute after combining with the original attribute. . 14 201126354 [Cut Point Selection] There are three methods for selecting the tangent points, namely Gini index, Mahalanobis distance, or Y〇uden, s Index.

Gini IndexGini Index

在使用Gini index選擇屬性時,每個屬性都要選擇—組切 點來搭配才能得到其不純度,所以需要一個方法選擇—組 取好的切點來搭配,以得到最低的不純度,若是在分類樹 裡,只需尋找一個切點,所以在分類樹裡尋找切點的方法 為把所有可能的切點都試過一次,找出一個不純度最低的 切點。然而,在本發明多層判別分析模型中,例如需要得 到兩個切點來把資料分成三群,在此情形時假設會存在有 #個樣本,尋找一個切點只需要試驗W種可能,若要找兩個 切點則會有種可能,在樣本數很大時,料以試 過所有可能的切點來找兩個切點會非常慢,所以本發明為 解決則述問題,並發展一個快速搜尋出兩個切點的方法。 首先,像一般的分類樹一樣,先搜尋所有可能找一個 斤有資料刀成兩群後不純度最低的切點,C。,然後利用C σ 、把=貝料切成N〇de,和N〇de;f。在Node,裡,再搜尋一個可以 把N〇dei分成兩群後不純度最低的切點,c:同樣的,在NodeWhen using Gini index to select attributes, each attribute must be selected - group cut points to match to get its impureness, so you need a method selection - the group takes good cut points to match, to get the lowest impurity, if it is in the classification tree In that, just look for a tangent point, so the way to find the tangent point in the classification tree is to try all possible cut points and find a cut point with the lowest purity. However, in the multi-layer discriminant analysis model of the present invention, for example, it is necessary to obtain two tangent points to divide the data into three groups. In this case, it is assumed that there are # samples, and it is only necessary to test W types to find a tangent point. There is a possibility that there will be a cut point. When the number of samples is large, it is very slow to find all the possible tangent points to find two tangent points. Therefore, the present invention solves the problem and develops a fast search for two tangent points. Methods. First of all, like the general classification tree, first search for all the cut points that may find a kilogram with the data knife into two groups with the lowest purity, C. Then, using C σ , cut the = material into N〇de, and N〇de; f. In Node, search for another cut point that can divide N〇dei into two groups with the lowest purity, c: the same, in Node

中也搜尋—彻-r t K 可以把⑽屿分成兩群後不純度最低的切點, ’如圖3所示。 201126354 如此來可得到二個候選切點C(i,c,,與c,,用這三個 候選切點可以組成(C。’ (d2)與(Cq,C2)三種切點組 合,比較這三種切點組合把資料切成三群後的不純度,選 出一個最佳的切點組合即可,較佳為把同質性高的樣本放 在左右兩惻。因此,在搜尋^時會設下限制,用q切出來的 兩群資料裡,比較遠離C。的那群資料不純度要比另一群資 料高。基於同前理由,在搜尋6時也要設下一樣的限制。 也就疋《兒,G/m(/A/)<G/mU , 。若以此搜尋演 算法,只需搜尋大約2N次來尋找那三個候選切點,再比較 三種組合即可。 馬氏距離(Mahalanobis distance) 根據本發明之另一個切點選擇的方法為馬氏距離,其 與歐氏距離(Euclidean distance)的差別在於馬氏距離考慮的不 只是類別間中心點差異,還會考慮各個類別的散佈情形, 舉例來說,若有一個樣本距離A類別跟B類別的中心都一樣 达’若A類別的變異數比較大’散佈情形的比較分散,b類 別的變異數比較小’散佈情形很集中,那此樣本離A類別的 馬氏距離就會比離B類別的馬氏距離來的小,故因此而認為 其比較屬於A類別。 以下將詳細介紹利用馬氏距離應用於分類上之方法, 首先’假設現在有2個類別,則可以算出距離a類別的馬氏 距離為··仏⑴=yJ(X -» 5,而距離B類別的馬氏距離 為.dh(x、=扣-…)rsk'xD 其中 M/Gp/k,.·./、)為 A類別的 16 201126354 平均數為A類別的共變異數矩陣(c〇variance爪献), A=(/WW、)為B類別的平均數,類別的共變異數 車田0七)< 則屬於A類別,而則屬於b類 別。 仁在本發明之多層判別分析模型令,本發明將複數個 樣本刀成A,B兩類別(N〇deA' NodeB)與未決定(N〇deN)這三群, 故將原本q(J〇<Aj(Y),屬於A類別的樣本數挑出來,然後利 用這些原本已經判斷為A類別的樣本算一個新的 f^A\ λέβ\ L ’ A, ’接著把這些已判斷為Α類別的樣本用新的平均數和 變異數再計算一次馬氏距離: ,DHi(x)=^-Mm)rs-](x_Mfji) 若仏U)<〜⑺,則屬於A類別;而Α|(χ)>ζ^(χ),.則屬於 未決定。 同樣的’把原本&〇〇>&00 ’屬於B類別的樣本數挑出 來,然後利用這些原本已經判斷為8類別的樣本算一個新的 仏2 ’ ’ ‘S«2 ’然後把這些已判斷為B類別的樣本用新 的平均數和變異數再計算一次馬氏距離, D,2M = ^MA2yS;A'2(x~P/l2) , Dh2{x) = yli^2ys;\(x-Me2) 若%00>仏0〇 ’則屬於B類別,而Z)42(x)<仏w,則屬 於未決定。 另須注意的是’在本發明多層判別分析模型中若使用 馬氏距離找切點時,主要係為了把資料用分成比較屬於A 頌別的和比較屬於B類別的,再利用這兩個資料子集合來求 201126354 出我們要的切點,但當此兩筆資料子集合為如圖4a之情形 時,由於其結點之選擇會影響非分類的準確度,為了要改 善此情形中樣本數差距極大造成馬氏距離切點的不可靠, 可進一步利用Gini index來修正馬氏距離,如圖4b所示。首 先,先用Gini index找一個切點,然後把資料用此切點分成兩 邊,比較各個類別在這兩邊所占的比例,若是A類別在左邊 所占的比例大於右邊所占的比例,則把右邊資料的A類別移 除,反之則把左邊資料的A類別移除。同樣的,B類別也比 較其在左右兩邊所占的比例,移除掉比例比較小那邊的B 類別。然後用剩餘的A類別、B類別來重新計算平均及變異 數,即可得到一個經過Gini index修正後的馬氏距離。The middle search also - the -r t K can divide the (10) island into two groups and the cut point with the lowest purity, as shown in Figure 3. 201126354 So we can get two candidate cut points C(i,c,, and c, which can be composed of these three candidate cut points (C.' (d2) and (Cq, C2) three kinds of tangent points combination, compare these three kinds of tangent point combinations After cutting the data into three groups, the purity is selected, and an optimal combination of tangent points is selected. It is preferable to place the samples with high homogeneity on the left and right sides. Therefore, when searching for ^, the limit is set, and the limit is set by q. In the two groups of data that came out, the data that is farther away from C is less pure than the other group. For the same reason, the same restriction should be set when searching for 6. That is, "G, m" (/A/) <G/mU, . If you use this search algorithm, you only need to search about 2N times to find those three candidate cut points, and then compare the three combinations. Mahalanobis distance According to the present invention Another method of tangent point selection is the Mahalanobis distance, which differs from the Euclidean distance in that the Mahalanobis distance considers not only the difference in the center point between the categories, but also the distribution of each category. For example, If there is a sample distance A and B The other centers are the same. 'If the A category has a large number of variations, the distribution is more scattered, and the b category has a smaller variation. The distribution is very concentrated. The sample is farther away from the A category. The Markov distance of the category is small, so it is considered that the comparison belongs to the category A. The method of applying the Mahalanobis distance to the classification is described in detail below. First, it is assumed that there are two categories, and the distance a category can be calculated. The Mahalanobis distance is ··仏(1)=yJ(X -» 5, and the Mahalanobis distance of the distance B category is .dh(x,= buckle-...)rsk'xD where M/Gp/k,.·./, ) is the A category of 16 201126354 The average is the A-category covariance matrix (c〇variance), A=(/WW,) is the average of the B categories, the total number of variants of the category is 0 (7)) < Then belongs to category A, and belongs to category b. In the multi-layer discriminant analysis model of the present invention, the present invention divides a plurality of samples into A, B two categories (N〇deA' NodeB) and undecided (N〇deN) These three groups, so the original q (J〇<Aj(Y), the number of samples belonging to the A category, and then use these originals already The sample broken into the A category is calculated as a new f^A\ λέβ\ L ' A, ' and then the samples that have been judged as the Α category are recalculated with the new mean and the number of variances. The DH (x) )=^-Mm)rs-](x_Mfji) If )U)<~(7), it belongs to category A; and Α|(χ)>ζ^(χ),. is undecided. The same 'take the original &〇〇>& 00 'the number of samples belonging to the B category, and then use these samples that have been judged as 8 categories to calculate a new 仏 2 ' ' 'S«2 ' and then These samples, which have been judged to be in the B category, are recalculated with a new average and the number of mutations, D, 2M = ^MA2yS; A'2 (x~P/l2), Dh2{x) = yli^2ys; \(x-Me2) If %00>仏0〇' belongs to category B, and Z)42(x)<仏w, it is undecided. It should also be noted that when the Mahalanobis distance finding point is used in the multi-layer discriminant analysis model of the present invention, the main purpose is to divide the data into A categories and compare them to the B category, and then use the two data pieces. The collection is to find the pointcut we want 201126354, but when the two sub-collections of data are as shown in Figure 4a, because the choice of its nodes will affect the accuracy of non-classification, in order to improve the situation, the sample size is greatly different. It is unreliable to cause the Markov distance tangent point, and the Gini index can be further used to correct the Mahalanobis distance, as shown in Fig. 4b. First, use the Gini index to find a tangent point, then use the tangent point to divide the data into two sides, and compare the proportion of each category on the two sides. If the proportion of the A category on the left side is larger than the ratio on the right side, then the right side is used. The A category is removed, otherwise the A category of the left material is removed. Similarly, the B category also compares the proportion of the left and right sides, and removes the B category with a smaller proportion. Then use the remaining A categories, B categories to recalculate the average and the variance, and get a Markov distance corrected by Gini index.

Youden’s Index 首先,定義 Younde’s Index = specificity+sensitivity-1,其中 specificity為在所提供之複數個原始樣本中所有A類別樣本裡 判斷正確的比例,sensitivity為在所提供之複數個原始樣本中 所有B類別樣本裡判斷正確的比例,故Y〇uden’s index越高越 好。 搜尋切點的方法跟使用Gini index相似,先搜尋所有可能 找一個把所有資料分成兩群後Younde’ s Index最高的切點, C。’然後利用C。可以把資料切成Node,和Node«。在Node,裡, 再搜尋一個可以把Node,分成兩群後Younde’ s Index最高的切 點’ C,。同樣的’在Node,,中也搜尋一個可以把Node„分成兩 群後Younde’ s Index最高的切點,c,。如此一來,我們可以得 201126354 到三個候選切點C(>,c,,與c,用這三個候選切點可以組成 (c〇,ς) ’(c,,q)與(c。,q)三種切點組合,比較這三種切 點組合把資料切成三群後的Youden’ s Index,選出一個最佳的 切點組合即可。 分成二群時,由於有未決定的部分,故Specificity和 sensitivity的計算要做更改。Youden's Index First, define the Younde's Index = specificity+sensitivity-1, where specificity is the correct proportion in all A-category samples in the provided original samples, and the sensitivity is all B-categories in the plurality of original samples provided. The correct ratio is judged in the sample, so the higher the Y〇uden's index, the better. The method of searching for the pointcut is similar to using the Gini index. First search for all the possible cut points that divide the highest score of the Younde's Index, C. Then use C. You can cut the data into Node, and Node«. In Node, search for a tangent point 'C' that can divide Node into two groups and the highest Younde’s Index. The same 'in Node,' also search for a cut point that can divide the Node into two groups and the highest point of the Younde's Index, c. So, we can get 201126354 to three candidate cut points C (>, c, , and c, with these three candidate cut points can be composed of (c〇, ς) '(c,, q) and (c., q) three kinds of tangent points combination, compare these three kinds of tangent point combination to cut the data into three groups of Youden ' s Index, select an optimal combination of tangent points. When split into two groups, the calculation of the specificity and sensitivity should be changed because there are undetermined parts.

Specificity=(A類別判對樣本數+0.5*未判別屬A類別樣 本數)ZA類別總樣本數;以及Specificity=(A category judges the number of samples +0.5* is not discriminated as the number of A categories) the total number of samples in the ZA category;

Sensitivity=(B類別判對樣本數+ 〇·5*未判別屬b類別樣 本數)/B類別總樣本數; 其後,再選取這三組切點中Y0Uden,sindex最高的一組切 點即可。 【評估模型效能】 在多層判別分析模型中’每次要加一個屬性進模型裡 時’可以下列四種不同方案進行評估之步驟。 首先,如圖5所不,假設已有一層由不構成的模型並 利用X,把樣本分成三群,分別為A類別,B類別,以及未決 定的樣本,分別以NodeA,N〇deB,NodeN來表示。 方案1 : 在原有的那一層新加入屬性<,跟A利用FLD組合,以 增加原有那一層的區別能力。 201126354 在Node、.上加入屬性1建一個模型,利用此模型來區別 在原有的層裡區別不出來的樣本。 方案3 : 把NodeA跟NodeN的樣本合併,以ν〇<ν表示此時原有 的那層A構成的模型只拿來區分出B類別,在N〇deAN上加入 屬性A建一個模型,利用此模型來區別在原有的層裡區別 不出來的樣本。 方案4 : 把Ν〇<^β跟N〇deN的樣本合併,以N〇d^表示,此時原有的 那層A構成的模型只拿來區分出A類別,在N〇deBN上加入屬 性七建-個模型’利用此模型來區別在原有的層裡區別不 出來的樣本。 ί停止條件| 在本發明多層判別分析模型的停止條件上可分為兩 種,-為決定是否要把未決定的樣本繼續往下分割,另一 為决定要不要在已存在的層裡加入新屬性。 在決定是否要繼續把未決定的樣本繼續往下分割之判 別,可利用在屬性選擇時提到的職,s !細恤,若不拒絕虛 無假設’代表在剩餘的樣本裡,找不到能把類別間顯著區 为開來的屬性,所以就停止繼續往下分割。 如前所述,另一停止條件為決定是否在原有的層裡加 入新屬性,由於模型原本已存在—些顯著的綠,若要在 加人新屬性時’此時需考量的不是加人新層性後整體模型 20 201126354 夠不夠顯著’而是考量新加入的屬性額外解釋了多少變 異。在此,可以參考迴歸分析法中之順向選擇法(forward selection)使用的partial F-test,其做法為檢定新加入一個屬性的 模型跟原始模型有沒有顯著差異。若拒絕了虛無假設,表 示加入新屬性的模型無顯著改善,不將此屬性加入模型。 其檢定模型如(式6) ί^〇 ' y = β〇+ β\^\+ (full model) = β〇+ β'χ\ (reduce model) f ^ ^.SR^Xi>X2)~ SSRjX^) ^ SSE{Xl,ΧΊ) SSR{X]) χΊ) SSE(X”X,) H 办1 dfn~dfF dfF 6) 其中, 也為flill model的自由度; 為reduce model的自由度; <厶〇,/5 i,A為變數的參數; 人SSR為-人方的解釋平均(eXpiaine(j sum SqUare) ·,以及 SSE為-人方的剩餘總和(e residuai sum SqUare)。 而在判別分析的順向選擇法,其模型如(式7) 若拒絕虛無假設,則表示模型不需加入此新屬性。 (full model) (reduce model) (式7) //〇 . d = ύ)χΧλ + 〇)2Χί Η{ . y = 〇)]Χ] w2) 若加入之新屬性夠顯著,還要用評估模型效能的方法 比較加入前和加入後整體模型的效能。反之,若加入新的 201126354 屬性後無法提升整體模型的效能的話,就停止加入新屬 性。需注意的是,本發明之多層次分類方法及所建立之多 層判別分析模型架構,在模型的最後一層要強迫對所有資 料進行分類’不能再留下未決定的樣本。 根據前述參數及設定條件,根據本發明之方法建立多 層判別分析模型詳細流程圖係如圖6所示。 夕 首先,在接受複數個原始樣本後(圖未示),利用wuk,s lambda或是Ginnndex選進一個最顯著的屬性,然後檢定這個 屬性是否有顯著區別出各個類別的能力。若是拒絕了虛無 假設,則代表此屬性具有解釋能力。再利用如前述之馬氏 距離或Gini index來找出此屬性最好的一組切點,把資料分成 第一類別(A類別,NodeA),第二類別(B類別,N〇deB)跟未決 定之第三類別(N〇deN)三群資料,然後就可以根據這三群資料 來s平估這個模型的效能。 接著,在選進第二個屬性時,要考慮把第二個屬性加 在哪個地方,在前述所提到四個方案,分別為:(方案说 原有的那層,找一個跟原有變數組合後最好的屬性與切 點’(方案2)用原本未決定那群樣本找—個最適合的的屬性 與切點]方案3)把A類別當成未決定,用A類別加上未決定Sensitivity=(B category judges the number of samples + 〇·5* unrecognized b category sample number) / B total sample number; Then, select the three sets of cut points in the Y0Uden, the highest set of cut points. [Evaluation of Model Effectiveness] In the multi-layer discriminant analysis model, the step of evaluation can be performed in the following four different schemes when adding an attribute into the model each time. First, as shown in Fig. 5, it is assumed that there is a layer of unconformed model and X is used to divide the sample into three groups, namely A category, B category, and undetermined samples, respectively, NodeA, N〇deB, NodeN. To represent. Option 1: Add the attribute < in the original layer, and use FLD combination with A to increase the difference of the original layer. 201126354 Add attribute 1 to Node, . to build a model, and use this model to distinguish samples that are not distinguishable in the original layer. Scheme 3: Combine the NodeA and NodeN samples, and use ν〇<ν to indicate that the model of the original layer A is only used to distinguish the B category, and add the attribute A to the N〇deAN to build a model. This model is used to distinguish between samples that are not distinguishable in the original layer. Scheme 4: Combine the samples of Ν〇<^β with N〇deN, denoted by N〇d^. At this time, the model of the original layer A is only used to distinguish the category A, and is added to N〇deBN. The attribute seven builds a model 'use this model to distinguish the samples that are not distinguishable in the original layer. ί stop condition | In the stop condition of the multi-layer discriminant analysis model of the present invention, it can be divided into two types, namely, to decide whether to continue to divide the undetermined sample, and to decide whether to add a new layer to the existing layer. Attributes. In deciding whether to continue to divide the undecided sample into the next division, you can use the position mentioned in the attribute selection, s !, if you do not reject the null hypothesis, the representative will not be able to find it in the remaining samples. Make a significant area between categories as an open attribute, so stop continuing to split down. As mentioned above, another stop condition is to decide whether to add new attributes to the original layer. Since the model already exists - some significant green, if you want to add new attributes, it is not a new addition. After the stratification of the overall model 20 201126354 is not significant enough 'it is to consider how much variation is additionally explained by the newly added attributes. Here, reference can be made to the partial F-test used in the forward selection method in the regression analysis, which is to verify that there is no significant difference between the model in which a new attribute is added and the original model. If the null hypothesis is rejected, there is no significant improvement in the model for adding new attributes, and this attribute is not added to the model. The verification model is (Equation 6) ί^〇' y = β〇+ β\^\+ (full model) = β〇+ β'χ\ (reduce model) f ^ ^.SR^Xi>X2)~ SSRjX ^) ^ SSE{Xl,ΧΊ) SSR{X]) χΊ) SSE(X"X,) H do 1 dfn~dfF dfF 6) where, also the degree of freedom of the flill model; the degree of freedom of the reduce model; ;厶〇,/5 i,A is a parameter of the variable; the human SSR is the human-interpreted average (eXpiaine(j sum SqUare) ·, and the SSE is the human remaining sum (e residuai sum SqUare). The forward selection method of discriminant analysis, whose model is (Equation 7), if the null hypothesis is rejected, indicates that the model does not need to be added to this new attribute. (full model) (reduce model) (Equation 7) //〇.d = ύ) χΧλ + 〇)2Χί Η{ . y = 〇)]Χ] w2) If the new attribute added is significant enough, compare the performance of the model before and after the addition with the evaluation model performance. Conversely, if you add a new one If the performance of the overall model cannot be improved after the 201126354 attribute, the new attribute is stopped. It should be noted that the multi-level classification method of the present invention and the established multi-layer discriminant analysis model architecture are in the model. The last layer is forced to classify all the data. 'The undetermined samples can no longer be left. According to the above parameters and setting conditions, the detailed flow chart of the multi-layer discriminant analysis model based on the method of the present invention is shown in Fig. 6. First, in the After accepting a plurality of original samples (not shown), use wuk, s lambda or Ginnndex to select a most significant attribute, and then check whether the attribute has a significant difference in the ability of each category. If the null hypothesis is rejected, it represents This attribute has the ability to interpret. Reuse the Markov distance or Gini index as described above to find the best set of cut points for this attribute, and divide the data into the first category (A category, NodeA), the second category (B category, N 〇deB) and the undetermined third category (N〇deN) three groups of data, and then you can estimate the performance of the model based on the three groups of data. Next, when selecting the second attribute, consider Where is the second attribute added? The four options mentioned above are: (The plan says the original layer, find the best attribute after combining with the original variable. The cut point '(Scheme 2) uses the original undetermined group of samples to find the most suitable attribute and the cut point] Scheme 3) The A category is considered undecided, and the A category plus undecided

的樣本來尋找一個最適合的屬性與切點;以及(方案4)把B 類別當成未決$’用_别加上未決定的樣本來尋找一個最 適合的屬性與切點。 “每個方案選進眉性之後都要用懸,⑶地祕定其顯 者性’認為不夠顯著的方案就捨棄此屬性,然後用前述提 22 201126354 ⑽估模型效料步驟來評估每個㈣㈣模型的效能。 若方案丨的效能最好,則把新的屬性加在原有的層裡。若方 案2的效能最好,則利用上一層剩下的未決定樣本建立新的 -層模型。若為方案3或方案4最好,則是在上一層把八類別 «類別當成未決定,並用所有剩下的未決定樣本建立新的 層模且上層的模型轉換成只切一個切點來判斷a 類別或B類別,不在同一層判斷兩個類別。 若目前的模型已經有η層,要再加入一個新屬性時,把 新屬性加在原有的層裡此方案會有η種情形’再加上方案 2、3、4,共要考慮η+3種情形。若此η+3種情形新增變數都 不顯著,模型就停止。若有通過的方案,則選出效能最佳 的方案,k查此方案多選進一個屬性後的整體模型校能有 沒有改善。若無.改善,則停止加入新變數,若有改善則 繼續加新的屬性到模型裡,一直不斷加入變數直到模型效 能不再改善為止。 綜合上述’本發明針對多層判別分析的模型提供了一 個有系統的變數選擇方法,可以用wiUc’ s |ambda轉換成F分 配後的p值或Gini index來選擇變數《而在切點的決定上,也 提供了馬氏距離' Gini index等方法。用Gini index決定切點時, 由於必須哥找至少一切點,若是搜尋所有可能的切點組合 會非常耗時,故本發明亦提供了較快速搜尋到所欲的切點 之方法。而用馬氏距離決定此至少一切點時,由於我們會 先用馬氏距離把所有樣本分成偏向A類別及偏向b類別 的’再用此兩群樣本來找兩個馬氏距離切點,但由於先用 201126354 馬氏距離把資料分成兩群,這兩群内類別間的樣本數差距 通常很大,而此類別間的樣本數差距會造成馬氏距離尋找 切點的不可靠,故本發明並提供了使用Giniindex修正馬氏距 尚隹來解決此問題。在每次新加入屬性到模型裡時,不僅只 考慮一層的效能’而是在考慮整體模型的效能後,才決定 要把新的屬性加入哪裡。而在模型的停止條件上,也提供 了使用如Wilk’ s丨ambda來防止模型的過度配適,故而大幅提 高分類之準確性。 [實施例η 在本實施例中提供了 一筆樣本數為1 00,2個類別,5 個屬性(Χι,Χ2,…,X5)的資料,其中每個屬性皆服從Ν(〇,ι), 其類別散佈圖如圖7b所示’而其預設的模型如圖7a所示》 其中’第一層將用乂,來解釋’其無法分類的部分再留到下 一層給X2去解釋》 經由多層判別分析得到的結果如圖7c所示,由於多層 判別分析模型有用Gini index及馬氏距離兩種尋找切點的方 法’故在多層判別分析的結果呈現上會把此兩種方法都放 上。至於經由CART得到的結果,則如圖7d所示。可以比較 使用Gini index找切點的多層判別分析跟CART,兩者尋找切 點的準則一樣。 在多層判別分析中,第一層用X,分出了類別〇和類別 1 ’類別0包含了 24個類別〇和0個類別1,類別1包含了 3個類 別0和35個類別1。然而,如圖7d所示,而在CART裡則是在 第一層用X,分出類別1,包含了 3個類別0和35個類別1,第 24 201126354 一層再用 X,分 1 x 、 - 六具別1 ’包含了 24個類別0和0個類別1,所 的蚌ft合^所刀類出的結果都一樣。但是,多層判別分析 At曰在層裡把此屬性判別兩個類別(類別0及類別1) ,^力都用上,但在CAR1^£只能在一層裡先判別一個類 別在下I再使用同一個屬+生判別另一個類別。 八 套的,、°果呈現在表1中,從表1中可看出,多層判別 分析使用Gini index所得到的結果跟Cart一樣好。 多層FLD切 點:Gini index 多層FLD切 點:MD CART FLD 準確率 0.89 0.85 0.89 0.83 表1The sample is used to find the most suitable attribute and tangent point; and (Scheme 4) treat the B category as pending $' with _ plus undecided samples to find the most suitable attribute and cut point. “Each plan should use suspend after selecting the eyebrow, (3) secretly determine its explicitness. If you think that it is not significant enough, discard this attribute, and then use the above-mentioned method to estimate the model effect to evaluate each (4) (4). The effectiveness of the model. If the performance of the scheme is the best, add the new attribute to the original layer. If the performance of the scheme 2 is the best, use the remaining undetermined samples of the previous layer to create a new layer model. For scenario 3 or scenario 4, it is best to judge the a category by considering the eight categories «category as undecided and creating a new layer model with all remaining undetermined samples and converting the upper model into only one tangent point. Or B category, not in the same layer to judge two categories. If the current model already has n layers, to add a new attribute, add the new attribute to the original layer, this program will have η cases 'plus the program 2, 3, 4, a total of η + 3 cases should be considered. If the new variables in the η + 3 cases are not significant, the model will stop. If there is a plan to pass, select the best performance plan, k check this The scheme is selected after the selection of an attribute. If there is no improvement, the model will stop adding new variables. If there is improvement, continue to add new attributes to the model, and continue to add variables until the model performance is no longer improved. The discriminant analysis model provides a systematic variable selection method. You can use wiUc's |ambda to convert to the F-valued p-value or Gini index to select the variable. "In the decision of the tangent point, the Markov distance is also provided." Gini index and other methods. When using Gini index to determine the tangent point, since it is necessary to find at least all the points, if searching for all possible tangent point combinations will be very time consuming, the present invention also provides a method for searching for the desired cut point more quickly. When using Markov distance to determine this at least all points, we will first use the Mahalanobis distance to divide all samples into A-category and biased b-types, and then use these two groups of samples to find two Markov distance tangent points, but because of Using 201126354 Mahalanobis distance to divide the data into two groups, the sample size gap between the two groups is usually very large, and the sample size gap between the categories will be It is unreliable to find the tangent point in the Markov distance. Therefore, the present invention provides a solution to correct this problem by using Giniindex to correct the Markov distance. Whenever a new attribute is added to the model, not only the performance of one layer is considered, but After considering the performance of the overall model, it is decided to add new attributes. In the stop condition of the model, it also provides the use of such as Wilk's 丨ambda to prevent over-adaptation of the model, thus greatly improving the accuracy of the classification. [Embodiment η In the present embodiment, a data of a sample number of 100, 2 categories, and 5 attributes (Χι, Χ2, ..., X5) is provided, each of which is subject to Ν(〇, ι). The category scatter diagram is shown in Figure 7b' and its default model is shown in Figure 7a. "The first layer will use 乂 to explain 'the part that cannot be classified and then leave it to the next layer to explain X2" The results obtained by multi-layer discriminant analysis are shown in Fig. 7c. Since the multi-layer discriminant analysis model uses both the Gini index and the Mahalanobis distance to find the tangent point, the results of the multi-layer discriminant analysis will be put on both methods.As for the results obtained via CART, it is as shown in Figure 7d. You can compare the multi-level discriminant analysis using Cini index to find the tangent point and CART. The two criteria for finding the tangent point are the same. In the multi-layer discriminant analysis, the first layer uses X to classify categories and categories. 1 'Category 0 contains 24 categories and 0 categories 1. Category 1 contains 3 categories 0 and 35 categories 1. However, as shown in Figure 7d, in CART, X is used in the first layer, category 1 is included, and 3 categories 0 and 35 categories 1 are included. On the 24th 201126354 layer, X is used again, and 1 x is used. - Six different 1 'contains 24 categories 0 and 0 categories 1. The results of the 蚌 ft and knives are the same. However, the multi-level discriminant analysis At曰 discriminates this attribute into two categories (category 0 and category 1) in the layer, and the force is used, but in CAR1^£, only one category can be discriminated in the first layer. One genus + student discriminates another category. The results of the eight sets are shown in Table 1. As can be seen from Table 1, the results of the multi-level discriminant analysis using Gini index are as good as those of Cart. Multi-layer FLD cut point: Gini index Multi-layer FLD cut point: MD CART FLD Accuracy rate 0.89 0.85 0.89 0.83 Table 1

[實施例2J 在本實施例中提供了 一筆樣本數為2〇〇, 2個類別,1〇 個屬性(X,,X2,…,χ,〇)的資料,其中每個屬性皆服從 N(〇,l)。預設的模型如圖心所示。其中,第一層將選進\丨2 组合成一個FLD模型,那些第一層無法做出分類的,則留 到第二層中由乂3,乂4組合的FLD模型來解釋。 纪由夕層判別分析得到的結果如圖8b所示,cart得到 的結果則如圖8 c所示。 根據本實施例方法的結果呈現在表2中,此多層判別分析不 管是用Gini index或馬氏距離來找切點得到的準確率,都比 CART和 FLD好。 25 201126354 ^--- 準確率 夕層FLD切 點:Gini index ------ 多層FLD 切點:MD CART FLD 0.9 0.885 0.83 0.88 表2 [實施例31 在本貫施例中提供了 一筆樣本數為1000,2個類別,5 個屬···,χ5)的資料’其中每個屬性皆服從N(0,1), 八類另丨放佈圖如圖9b所示,預設的模型如圖9a所示。第一 層將用X,來解釋’且Χ|只有分類出類別〇的能力,其餘無法 分類的部分再留到下一層給&去解釋。 、'呈由夕層判別分析得到的結果如圖9c所示,CART得到 的...。果則如圖9d所#。由於預設的模型可以視為單變量的 樹狀、,構,故在此案例多層判別分析使用㈤index當切點準 則的結果會跟CART得到的結果一樣。 根據本實施例方法的結果呈現在表3中,多層判別分析使用 Gini index所得到的結果跟cart—樣好。 多層FLD切 點:Gini index 多層FLD 切點:MD CART FLD 準確率 0.84 0.835 一 — 0.84 1 0.785 表3[Embodiment 2J In the present embodiment, a data of 2 〇〇, 2 categories, 1 属性 attributes (X, X2, ..., χ, 〇) is provided, wherein each attribute is subject to N ( Hey, l). The preset model is shown in the figure. Among them, the first layer will be selected into the \丨2 group to form an FLD model, and those that cannot be classified in the first layer will be explained in the second layer by the FLD model of the combination of 乂3 and 乂4. The results obtained by discriminant analysis by Fig. 8b are shown in Fig. 8b, and the results obtained by cart are shown in Fig. 8c. The results of the method according to the present embodiment are presented in Table 2. This multi-level discriminant analysis is better than CART and FLD, regardless of whether the Gini index or the Mahalanobis distance is used to find the cut point. 25 201126354 ^--- Accuracy rate FLD cut point: Gini index ------ Multi-layer FLD Cut point: MD CART FLD 0.9 0.885 0.83 0.88 Table 2 [Example 31 A sample number is provided in this example 1000, 2 categories, 5 genus···, χ5) The data 'each of them is subject to N(0,1), and the eight types of other layouts are shown in Figure 9b. The default model is shown in Figure Shown in 9a. The first layer will use X to explain 'and Χ| only the ability to classify the category ,, and the remaining unclassifiable parts will be left to the next layer to explain & The result obtained by the discriminant analysis by the eve layer is shown in Fig. 9c, which is obtained by CART. If it is #, as shown in Figure 9d. Since the preset model can be regarded as a univariate tree, structure, in this case, the multi-level discriminant analysis uses (5) index as the result of the tangent point will be the same as the result obtained by CART. The results of the method according to the present embodiment are presented in Table 3. The results obtained by the multi-level discriminant analysis using the Gini index are good as those of the cart. Multi-layer FLD cut point: Gini index Multi-layer FLD Cut point: MD CART FLD Accuracy 0.84 0.835 I — 0.84 1 0.785 Table 3

I實施例4J 在本實施例中提供了一 個屬性(x,,x2, ...,x5)的資料 筆樣本數為丨〇〇〇,2個類別,5 ,其中每個屬性皆服從N(0,1)。 26 201126354 預設的模型如圖丨0a所示。第-層將用X,來解釋 分類出類別0的能力’其餘無法分 且X,只有 x2和χ3去解釋。 h再留到下一層給 ’ CART得 經由多層判別分析得到的結果如圖丨0 b所示 到的結果則如圖10c所示。 根據本實施例方法的結果呈現在表4中 Gini index所得到的結果最好。 多層判別分析使用 乡層FLD切 點:Gini index 多層FLD 切點:MD ~----- CART FLD 準確率 0.S65 0.795 0.85 0.79I Embodiment 4J In the present embodiment, the number of data pen samples of one attribute (x, x2, ..., x5) is 丨〇〇〇, two categories, 5, each of which is subject to N ( 0,1). 26 201126354 The preset model is shown in Figure 0a. The first layer will use X to explain the ability to classify category 0. The rest cannot be divided and X, only x2 and χ3 are explained. h then left to the next layer for 'CART. The results obtained through multi-layer discriminant analysis are shown in Fig. 0b. The results are shown in Fig. 10c. The results of the method according to the present embodiment are shown in Table 4. The results obtained by the Gini index are the best. Multi-layer discriminant analysis uses the FLD cut point of the township layer: Gini index Multi-layer FLD Cut point: MD ~----- CART FLD Accuracy rate 0.S65 0.795 0.85 0.79

[實施例5] 在本實施例中提供了透過超音波掃描來得到一些腫瘤 影像的量化的屬性,再透過這些屬性來建構一個判別模 型,其中腫瘤影像樣本有160個’有108個以類別〇代表,52 個以類別1代表。 首先提供CI、έΐ、MI、HI、ringPDVImax這5個屬性做分析, 若直接使用費雪判別分析合併這5個屬性,得到的準確率為 0.793,使用多層判別分析的結果準確率則為0.8。此外,多 層判別分析只會使用其中四個變數,如圖11 a所示,且得到 的準確率比傳統的費雪判別分析高。 除上述5個屬性之外,根據本實施例可再加入其他屬性一起 分析。多層判別分析使用Gini index決定切點得到的結果如圖 27 201126354 1 1 b所示,準减率為0.906。多層判別分析使用Youden’s index 決定切點得到的結果則如圖1 1 c所示,準確率為0.801 2。 CART所得到的結果如圖lid所示,準確率為0.868。FLD使 用了 ringPDVImax、VeinCentralVImin、VeinTDCentralVImax、 TDVImax、Cl、RMV、CI2、MCI3、MI2這 9個屬性,得到的準 確率為0.843。如表5所示,多層判別分析得到的準確率最好。 多層FLD切 點:Gin i index 多層FLD切 點:Youden’s index CART FLD 準確率 0.906 0.801 0.868 0.838 表5 再者,本發明上述執行步驟,可以電腦語言寫成以便 執行,而此寫成之軟體程式可以儲存於任何微處理單元可 以辨識、解讀之紀錄媒體,或包含有此紀錄媒體之物品及 裝置。其不限為任何形式,此物品可為硬碟、軟碟、光碟、 ZIP、MO、1C晶片、隨機存取記憶體(RAM),或任何熟悉此 項技藝者所可使用之包含有此紀錄媒體之物品。由於本發 明之多層次分類方法已揭露完整如前,任何熟悉電腦語言 者閱讀本發明說明書即知如何撰寫軟體程式,故有關軟體 程式細節部分不在此贅述。 上述實施例僅係為了方便說明而舉例而已,本發明所 主張之權利範圍自應以申請專利範圍所述為準,而非僅限 於上述實施例。 28 201126354 【圖式簡單說明】 圖1a係本發明多❹丨別分析流程圖。 架構 圖1b係根據本發明之方法所建立之多層判別分析模型 示意圖。 、 圖2係顯不-電腦可紀錄媒體之架構的示意圖。 圖3係本發明一較佳實施例之搜尋⑶口丨切點示意圖。 圖4a-4b係本發明—較佳實施例之使用丨以以修正馬氏距 離示意圖。 圖5係本發明一比較模型之四種方式示意圖。 圖6係本發明多層判別分析模型詳細流程圖。 圖7a-7d係本發明實施例1示意圖。 圖8a-8c係本發明實施例2示意圖。 圖9a-9d係本發明實施例3示意圖。 圖l〇a-l〇c係本發明實施例4示意圖。 圖11 a-11 d係本發明實施例5示意圖。 【主要元件符號說明】 電腦可紀錄媒體1 記憶體11 處理器12 顯示裝置13 輸入裝置14 儲存裝置15 29[Embodiment 5] In the present embodiment, the quantized attributes of some tumor images are obtained by ultrasonic scanning, and then a discriminant model is constructed through these attributes, wherein there are 160 'with 108 categories of tumor image samples〇 Representatives, 52 are represented by category 1. Firstly, the five attributes of CI, έΐ, MI, HI, and ringPDVImax are provided for analysis. If the five attributes are directly combined using Fisher's discriminant analysis, the accuracy rate is 0.793, and the accuracy of using multi-level discriminant analysis is 0.8. In addition, multi-layer discriminant analysis uses only four of these variables, as shown in Figure 11a, and the accuracy is higher than the traditional Fisher discriminant analysis. In addition to the above five attributes, other attributes can be added together for analysis according to this embodiment. The results obtained by multi-level discriminant analysis using Gini index to determine the tangent point are shown in Fig. 27 201126354 1 1 b, and the quasi-reduction rate is 0.906. Multilayer discriminant analysis using Youden's index to determine the tangent results is shown in Figure 1 c, with an accuracy of 0.801 2 . The results obtained by CART are shown in lid, and the accuracy rate is 0.868. FLD uses the nine attributes of ringPDVImax, VeinCentralVImin, VeinTDCentralVImax, TDVImax, Cl, RMV, CI2, MCI3, and MI2, and the accuracy is 0.843. As shown in Table 5, the multi-layer discriminant analysis yields the best accuracy. Multi-layer FLD cut point: Gin i index Multi-layer FLD cut point: Youden's index CART FLD Accuracy rate 0.906 0.801 0.868 0.838 Table 5 Furthermore, the above-mentioned execution steps of the present invention can be written in a computer language for execution, and the software program written in this can be stored in any micro The recording medium that the processing unit can recognize and interpret, or the items and devices that contain the recording medium. It is not limited to any form, and the article can be a hard disk, a floppy disk, a compact disc, a ZIP, an MO, a 1C chip, a random access memory (RAM), or any other person familiar with the art. Media items. Since the multi-level classification method of the present invention has been disclosed as before, anyone who is familiar with the computer language knows how to write a software program after reading the present specification, and thus the details of the software program are not described here. The above-described embodiments are merely examples for the convenience of the description, and the scope of the claims is intended to be limited by the scope of the claims. 28 201126354 [Simplified description of the drawings] Fig. 1a is a flow chart of multi-discrimination analysis of the present invention. Architecture Figure 1b is a schematic diagram of a multi-layer discriminant analysis model established in accordance with the method of the present invention. Figure 2 shows a schematic diagram of the architecture of a computer recordable medium. 3 is a schematic diagram of a search (3) port cut point according to a preferred embodiment of the present invention. Figures 4a-4b are schematic views of the preferred embodiment of the invention used to modify the Mahalanobis distance. Figure 5 is a schematic illustration of four modes of a comparative model of the present invention. Figure 6 is a detailed flow chart of the multi-layer discriminant analysis model of the present invention. 7a-7d are schematic views of Embodiment 1 of the present invention. 8a-8c are schematic views of Embodiment 2 of the present invention. 9a-9d are schematic views of Embodiment 3 of the present invention. 1a-l〇c are schematic views of Embodiment 4 of the present invention. Figure 11 - 11 d is a schematic view of Embodiment 5 of the present invention. [Description of main component symbols] Computer recordable media 1 Memory 11 Processor 12 Display device 13 Input device 14 Storage device 15 29

Claims (1)

201126354 七、申請專利範圍: 1.種多層次分類方法,係於一電腦可紀錄媒體中用 二刀類多個影像樣本,該電腦可紀錄媒體包括有一處理 裔、一輸入裝置、及-儲存裝置,該方法至少包括下列步 驟: (a)接收複數個原始樣本: =提供複數個屬性,並以—多變量參數對該些原始樣 本由該些屬性進行顯著性評估計算; 牛驟擇至少—切點並建立—判射析模型,其係將該 ==後具有顯著性者其中之一,藉提供-變數同 該至少—切點,將該些屬性評估後具有 •‘ 巾所包含之該複數個原純本分群為至少一類別 判別分析模型,其中該至少-類別係包括有第-二二?广類別、及未決定之第三類別⑼。如); 细中°f㈣型效能之步驟’其係將該判別分析模 ==些屬性進行顯著性評估;其中,當加入該至少 別分析=ΓΓΓ模型之顯著性時,便進入該判 少點 I,再以該變數同質分析參數篩選出至 顯著中=別分析模型中加入該些屬性評估後具有 別1: 該複數個原始樣本繼續分群為第—類 以及類別(N〇deB)、及未決定之第三類別; 分析ί)Γ入一停止條件’該停止條件係以選擇該變數同質 …右不拒絕虛無假設’該判別分析模型即停止往 30 201126354 或在該評估模型效能之步驟中加入該些屬性201126354 VII. Scope of application for patents: 1. A multi-level classification method for using two types of multiple image samples in a computer recordable medium. The computer recordable medium includes a processor, an input device, and a storage device. The method comprises at least the following steps: (a) receiving a plurality of original samples: = providing a plurality of attributes, and performing a significant evaluation of the original samples by the multi-variable parameters; the bolus is at least - the tangent point And establishing a sentence-analysis model, which is one of the significant ones after the ==, by providing a-variable with the at least-cut point, and evaluating the attributes to have the plural originals included in the towel The purely sub-group is at least one category discriminant analysis model, wherein the at least-category includes the second-two-two? Wide category, and undetermined third category (9). For example, the step of the efficiency of the °f (four) type of performance is to make a significant evaluation of the discriminant analysis model == some attributes; wherein, when the significance of the at least analysis=ΓΓΓ model is added, the judgment point is entered. I, then use the variable homogeneity analysis parameters to filter out to the significant = in the analysis model, add the attributes to evaluate the difference: 1: The plurality of original samples continue to be grouped into the first category and the category (N〇deB), and The third category of decision; analysis Γ) breaks into a stop condition 'The stop condition is to select the variable homogeneity... the right does not reject the null hypothesis'. The discriminant analysis model stops at 30 201126354 or joins in the step of evaluating the performance of the model. These attributes 下一層分群;或在該 以一迴歸分析法進行 別分析模型即停止往下一層分群。 2.如申印專利範圍第1項所述之多層次分類方法其 中在加入該停止條件時’該判別分析模型之最後一層分 類層中。玄未决疋之第三類別(NodeN)中所包含之樣本數為 3. 如申請專利範圍第1項所述之多層次分類方法其 中。亥夕變量參數係為Wilk,s lambda或Gini index。 4. 如申請專利範圍第1項所述之多層次分類方法,該 顯著性評估計算係以一 F統計量算出的?值,以該?值表示該 些屬性在該類別間平均的差異顯著性;或以一衡量不純度 (impurity)之準則判斷; 其中,該F統計量為The next layer is grouped; or if the model is analyzed by a regression analysis, the next layer is stopped. 2. The multi-level classification method according to item 1 of the scope of the patent application, wherein the stop condition is added to the last layer classification layer of the discriminant analysis model. The number of samples included in the third category (NodeN) of the unresolved 为 is 3. The multi-level classification method described in item 1 of the patent application scope. The Haixi variable parameter is Wilk, s lambda or Gini index. 4. If the multi-level classification method described in item 1 of the patent application is applied, is the calculation of the significance evaluation calculated by a F statistic? Value to that? Values indicate the significance of the differences between the categories of the attributes; or judged by a measure of impureness; where the F statistic is 該不純度(impurity)為 Impurity = x+^A/ ^Gini{tM) + NRxGini{tR) (Nl + Nm+Nr) ; 其中’ n為樣本空間(sample size),p為屬性的數目,Λ 則為 Wilk’s lambda ; 201126354 其中,nl為第一類別的樣本空間,為第三類別的樣 本工間’ NR為第二類別的樣本空間,t[為第一類別的 值,tM為第三類別的Gin•丨值,tR為第二類別的丨值。 5·如申請專利範圍第1項所述之多層次分類方法,其 中,4些屬性係至少一選自由ringPDVImax、VeinCentralvlm丨η、 VeniTDCentralVImax、TDVI_、ci、RMV ' CI2、MCI3、及 ΜΙ2 所組成之群組。 6. 如申請專利範圍第1項所述之多層次分類方法其 中’ 4變數同質分析參數係為Gini index、Mahalan〇bis distance、 或 Youden’s Index c 7. 如申請專利範圍第丨項所述之多層次分類方法,其 中,該評估模型效能之步驟包括:在與步驟(c)所建立之該 判別为析模型同層中加入該些屬性,以增加該判別分析模 型之原同層中的區別能力。 8·如申請專利範圍第1項所述之多層次分類方法,其 中,該評估模型效能之步驟包括:在該第三類別Wodh)上加 入該些屬性並新增一層以建立一模型,該模型亦以該變數 同質分析參數篩選出至少一切點,將剩餘未決定之該複數 個原始樣本繼續分群為第一類別⑼以以)、第二類別…以卽)、 及未決定之第三類別(M〇deN)。 9.如申請專利範圍第丨項所述之多層次分類方法,其 中,5玄評估模型效能之步驟包括:將第一類別設定為 未泱定之類別,並將第一類別^^如“加上未決定之第三類別 (NodeN)而形成的組合中加入該至少一屬性並新增一層以建 32 201126354 立一換型’該模型亦以該變數同質分析參數篩選出至少— 切點’將剩餘未決定之該複數個原始樣本繼續分群為第— 類別(NodeA)、第一類別仍〇cjeij)'及未決定之第三類別⑺以〜)。 1 0 ·如申凊專利範圍第1項所述之多層次分類方法其 中,該評估模型效能之步驟包括:將第二類別Modh)設定為 未決定之類別,並將第二類別Mod#)加上未決定之第三類別 (NodeN)而形成的組合中加入該些屬性並新增一層以建立— 模型,該模型亦以該變數同質分析參數篩選出至少一切 點,將刺餘未決定之該複數個原始樣本繼續分群為第一類 別(NodeA)、第二類別⑽如)、及未決定之第三類別(则〜)。 j]·如申請專利範圍第丨項所述之多層次分類方法其 中’邊迴歸分析法係包括—順向選擇法使用之两㈣她 、丨2·種用以分類多個影像樣本之電腦可紀錄媒體,其 多層次分類方法對該些影像樣本進行分類,該 :己錄媒體包括有一處理器、一輸入裝置、及一儲存 裝置’❹層次分類方法至少包括下列步驟: (a)接收複數個原始樣本; 本由Lb 性,独-多變量參㈣該些原始樣 二屬I·生進行顯者性評估計算; (c)選擇至少一切點並 步驟⑻中評估後具有顯著性者1 = 4 ^ ’其係將該 質分析參數“出^ 错提供—變數同 顯著性者中所^ —切點’將該些屬性評估後具有 者中所包含之該複數個原始樣本分群為 201126354 以建立該判別分析模型,其中該至少—類別係包括有第一 類別(NodeA)、第二類別(NodeB)'及未決定之第三類別(ν〇&ν); (d) 進行-評估模型效能之步驟,其係將該判別分析: 型中加入該些屬性進行顯著性評估;其中,當加入該些屬 性後有增進該判別分析模型之顯著性時,便進入該判別分 析模型之下一層,再以該變數同質分析參數筛選出至少1 切點,將該判別分析模型中加入該些屬性評估後具有顯著 性者中所包含之該複數個原始樣本繼續分群為第一類別 (N〇deA)、第二類別(NodeB)、及未決定之第三類別以 及 (e) 加入一停止條件,該停止條件係以選擇該變數同質 分析參數,若不拒絕虛無假設’該判別分析模型即停止往 下-層分群;或在該評估模型效能之步驟中加入該些屬性 以-迴歸分析法進行顯著性評估,當加入該些屬性後無法 提升該判別分析模型之顯著性時,若拒絕虛無假設,該判 別分析模型即停止往下一層分群。 13.如申请專利範圍第丨2項所述之電腦可紀錄媒體,其 中,在加入該停止條件時,該判別分析模型之最後一層分 類層中,該未決定之第三類別陶^)中所包含之樣本數 零。 叼 1 4.如申凊專利範圍苐】2項所述之電腦可紀錄媒體,其 中°玄夕變ϊ參數係為Wi丨k,s lambda或Gini index。 1 5.如申凊專利範圍第丨2項所述之電腦可紀錄媒體,該 顯著性評估計算係以—F統計量算出的p值以該p值表示該 34 201126354 些屬性在該類別間平均的差異顯著性;或以一衡量不純度 (impurity)之準則判斷; 其中,該F統計量為 n-p-\ P 該不純度(impurity)為 Impurity = P_l 乂Gini、t匕、·\·NM Gini、t+ NR xGini、tR) - (Nl+Nm+Nr) . 其中’ n為樣本空間(sainple size),p為屬性的數目,Λ 則為 Wilk’s lambda ; 其中,NL為第一類別的樣本空間,為第三類別的樣 本空間,nr為第二類別的樣本空間,tL為第一類別的Gini 值’ tM為第三類別的Gini值,tR為第二類別的Gin丨值。 16.如申請專利範圍第12項所述之電腦可紀錄媒體,其 中’該些屬性係至少一選自由ringPDVImax、VeinCemraMmin、 VeinTDCentralVImax、TDVImax、Cl、RMV、CI2、MCI3、及 MI2 所組成之群組》 17. 如申請專利範圍第12項所述之電腦可紀錄媒體其 中。亥.史數同質分析參數係為Gini index、Mahalanobis distance、 或 Youden s Index。 18. 如申請專利範圍第12項所述之電腦可紀錄媒體,其 中’该評估模型效能之步驟包括:在與步驟(c)所建立之該 判別分析模型同層中加入該些屬性,以增加該判別分析模 型之原同層中的區別能力 201126354 19.如申請專利範圍第12項所述之電腦可紀錄媒體,其 中,該評估模型效能之步驟包括:在該第三類別⑽㈣上加 入該些屬性並新增一層以建立一模型,該模型亦以該變數 同質分析參數篩選出至少一切點,將剩餘未決定之該複數 個原始樣本繼續分群為第—_⑽〜)、第二類卵㈣)、 及未決定之第三類別(N〇deN)。 20. 如申請專利範圍第丨2項所述之電腦可紀錄媒體,其 中,該評估模型效能之步驟包括:將第一類別⑽〜)設定為 未決定之類別,並將第一類別(Nodh)加上未決定之第三類別 (Nodew)而形成的組合中加入該些屬性並新增一層以建立一 模型,該模型亦以該變數同質分析參數篩選出至少一切 點,將剩餘未決定之該複數個原始樣本繼續分群為第一類 別(N〇deA)、第二類別~如)、及未決定之第三類別⑽如)。 21. 如申請專利範圍第12項所述之電腦可紀錄媒體,其 中’該評估模型效能之步驟包括:將第二類別⑽如)設定為 未決定之類別,並將第二類別(N〇deB)加上未決定之第三類別 (N〇deN)而形成的組合中加入該些屬性並新增一層以建立一 模型,該模型亦以該變數同質分析參數筛選出至少一切 點,將剩餘未決定之該複數個原始樣本繼續分群為第一類 別陶eA)、第二類別_eB)、及未決定之第三類別㈣知)。、 22. 如申請專利範圍第12項所述之電腦可紀錄媒體,其 中’該迴歸分析法係包括一順向選擇法使用之_ialFtest。 36The impureness is Impurity = x+^A/^Gini{tM) + NRxGini{tR) (Nl + Nm+Nr); where 'n is the sample size, p is the number of attributes, Λ For Wilk's lambda; 201126354 where nl is the sample space of the first category, the sample space of the third category is 'NR is the sample space of the second category, t[ is the value of the first category, and tM is the Gin of the third category • Depreciation, tR is the depreciation of the second category. 5. The multi-level classification method according to claim 1, wherein at least one of the four attributes is selected from the group consisting of ringPDVImax, VeinCentralvlm丨η, VeniTDCentralVImax, TDVI_, ci, RMV 'CI2, MCI3, and ΜΙ2. Group. 6. The multi-level classification method described in item 1 of the patent application, wherein the '4 variable homogeneity analysis parameter is Gini index, Mahalan〇bis distance, or Youden's Index c 7. As described in the scope of the patent application The hierarchical classification method, wherein the step of evaluating the performance of the model comprises: adding the attributes in the same layer as the analysis model established in step (c) to increase the distinguishing ability in the original layer of the discriminant analysis model . 8. The multi-level classification method according to claim 1, wherein the step of evaluating the performance of the model comprises: adding the attributes to the third category Wodh) and adding a layer to establish a model, the model At least all the points are also selected by the variable homogeneity analysis parameter, and the remaining undetermined plurality of original samples are further grouped into a first category (9) to be, a second category ... to 卽), and an undetermined third category ( M〇deN). 9. The multi-level classification method as described in the scope of the patent application scope, wherein the step of evaluating the performance of the model includes: setting the first category to an undetermined category, and adding the first category ^^ Adding at least one attribute to the combination formed by the undetermined third category (NodeN) and adding a layer to build 32 201126354 to change the type of 'the model also uses the variable homogeneity analysis parameters to filter out at least - the cut point 'will remain The plurality of original samples determined to continue to be grouped into the first category (NodeA), the first category is still cjeij)' and the undetermined third category (7) is ~). 1 0 · as claimed in claim 1 The multi-level classification method, wherein the step of evaluating the performance of the model comprises: setting the second category Modh) to an undetermined category, and adding the second category Mod#) to the undetermined third category (NodeN). Adding these attributes to the combination and adding a layer to establish a model, the model also filters out at least all points with the variable homogeneity analysis parameters, and continues to group the plurality of original samples that are not determined by the stabbing into the first category (NodeA). , the second category (10) as in), and the undetermined third category (then ~). j] · The multi-level classification method as described in the scope of the patent application, wherein the 'edge regression analysis method includes - forward selection method The use of two (four) she, 丨2, a computer recordable medium for classifying a plurality of image samples, the multi-level classification method for classifying the image samples, the recorded medium includes a processor, an input device, And a storage device's hierarchical classification method includes at least the following steps: (a) receiving a plurality of original samples; the Lb-sex, single-multivariate parameters (4) the original samples of the two genera I·sheng for explicit evaluation calculation; (c) select at least all points and have significant significance after evaluation in step (8). 1 = 4 ^ 'The system analyzes the parameters of the qualitative analysis by “producing the error—the variable is the same as the significant one. The plurality of original sample groups included in the evaluation are 201126354 to establish the discriminant analysis model, wherein the at least-category includes a first category (NodeA), a second category (NodeB), and an undetermined number The three categories (ν〇 &ν); (d) the step of performing - evaluating the performance of the model, which is to add the attributes to the discriminant analysis: a significant evaluation; wherein, when the attributes are added, the When discriminating the significance of the analysis model, it enters the lower layer of the discriminant analysis model, and then selects at least 1 tangent point by the homogeneity analysis parameter of the variable, and adds the attribute to the discriminant analysis model to be included in the evaluation. The plurality of original samples continue to be grouped into a first category (N〇deA), a second category (NodeB), and an undetermined third category, and (e) a stop condition is added, the stop condition is selected to select the variable homogeneity Analyze the parameters, if the null hypothesis is not rejected, the discriminant analysis model stops the down-layer grouping; or add the attributes in the step of evaluating the performance of the model to perform the significance evaluation by the regression analysis method, after adding the attributes When the significance of the discriminant analysis model cannot be improved, if the null hypothesis is rejected, the discriminant analysis model stops the next layer grouping. 13. The computer recordable medium according to claim 2, wherein, in the last layer classification layer of the discriminant analysis model, the undetermined third category Tao in the classification condition The number of samples included is zero.叼 1 4. For example, the computer-recordable media mentioned in the 2nd paragraph of the application scope is the Wi丨k, s lambda or Gini index. 1 5. The computer-recordable medium as described in item 2 of the patent application scope, the significance evaluation calculation is that the p-value calculated by the -F statistic is represented by the p-value. 34 201126354 Some attributes are averaged between the categories The difference is significant; or judged by a measure of impureness; where the F statistic is np-\ P, the impureness is Impurity = P_l 乂 Gini, t匕, ····NM Gini , t+ NR xGini, tR) - (Nl+Nm+Nr) . where 'n is the sample space (sainple size), p is the number of attributes, Λ is Wilk's lambda; where NL is the sample space of the first category, For the sample space of the third category, nr is the sample space of the second category, tL is the Gini value of the first category 'tM is the Gini value of the third category, and tR is the Gin丨 value of the second category. 16. The computer recordable medium of claim 12, wherein the at least one attribute is selected from the group consisting of ringPDVImax, VeinCemraMmin, VeinTDCentralVImax, TDVImax, Cl, RMV, CI2, MCI3, and MI2. 17. The computer recordable media mentioned in claim 12 of the patent application. The homology analysis parameter of the Hi. history is Gini index, Mahalanobis distance, or Youden s Index. 18. The computer recordable medium of claim 12, wherein the step of evaluating the performance of the model comprises: adding the attributes in the same layer as the discriminant analysis model established in step (c) to increase The discriminating ability of the discriminant analysis model in the original layer 201126354. The computer recordable medium according to claim 12, wherein the step of evaluating the performance of the model comprises: adding the third category (10) (four) Attributes and a new layer to create a model, the model also filters out at least all points with the variable homogeneity analysis parameters, and continues to group the remaining undetermined plural original samples into the first -_(10)~) and the second type of eggs (four) And the third category that has not been decided (N〇deN). 20. The computer recordable medium of claim 2, wherein the step of evaluating the performance of the model comprises: setting the first category (10)~) to an undetermined category, and the first category (Nodh) Adding these attributes to the combination formed by the undetermined third category (Nodew) and adding a layer to create a model, the model also filters out at least all points with the variable homogeneity analysis parameters, and the remaining undetermined A plurality of original samples continue to be grouped into a first category (N〇deA), a second category ~ as), and an undetermined third category (10), such as). 21. The computer recordable medium of claim 12, wherein the step of evaluating the performance of the model comprises: setting the second category (10) as an undetermined category, and the second category (N〇deB) Adding these attributes to the combination formed by the undetermined third category (N〇deN) and adding a layer to create a model, the model also filters out at least all points with the variable homogeneity analysis parameters, leaving the remaining The plurality of original samples determined to continue to be grouped into the first category (eA), the second category (eB), and the undetermined third category (four). 22. The computer recordable medium as described in claim 12, wherein the regression analysis method includes a _ialFtest used by the forward selection method. 36
TW099101931A 2010-01-25 2010-01-25 Method for multi-layer classifier TWI521361B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
TW099101931A TWI521361B (en) 2010-01-25 2010-01-25 Method for multi-layer classifier

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
TW099101931A TWI521361B (en) 2010-01-25 2010-01-25 Method for multi-layer classifier

Publications (2)

Publication Number Publication Date
TW201126354A true TW201126354A (en) 2011-08-01
TWI521361B TWI521361B (en) 2016-02-11

Family

ID=45024492

Family Applications (1)

Application Number Title Priority Date Filing Date
TW099101931A TWI521361B (en) 2010-01-25 2010-01-25 Method for multi-layer classifier

Country Status (1)

Country Link
TW (1) TWI521361B (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
TWI564740B (en) * 2015-08-24 2017-01-01 國立成功大學 Mutually-exclusive and collectively-exhaustive (mece) feature selection method and computer program product

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
TWI564740B (en) * 2015-08-24 2017-01-01 國立成功大學 Mutually-exclusive and collectively-exhaustive (mece) feature selection method and computer program product

Also Published As

Publication number Publication date
TWI521361B (en) 2016-02-11

Similar Documents

Publication Publication Date Title
Santhanam et al. Application of K-means and genetic algorithms for dimension reduction by integrating SVM for diabetes diagnosis
Yu et al. An automatic method to determine the number of clusters using decision-theoretic rough set
US7801836B2 (en) Automated predictive data mining model selection using a genetic algorithm
Zhu et al. Balancing accuracy, complexity and interpretability in consumer credit decision making: A C-TOPSIS classification approach
Sun et al. An adaptive density peaks clustering method with Fisher linear discriminant
Alsheref et al. Automated prediction of employee attrition using ensemble model based on machine learning algorithms
Nathiya et al. An analytical study on behavior of clusters using k means, em and k* means algorithm
US20060047616A1 (en) System and method for biological data analysis using a bayesian network combined with a support vector machine
Hou et al. A new density kernel in density peak based clustering
Hsiao et al. Integrating MTS with bagging strategy for class imbalance problems
AlKubaisi et al. Multivariate discriminant analysis managing staff appraisal case study
CN106126973B (en) Gene correlation method based on R-SVM and TPR rules
Yotsawat et al. Improved credit scoring model using XGBoost with Bayesian hyper-parameter optimization
Oreški et al. Cost-sensitive learning from imbalanced datasets for retail credit risk assessment
TW201126354A (en) Method for multi-layer classifier
Li et al. Personal credit default discrimination model based on super learner ensemble
AU2012255722A1 (en) Computer-implemented method and system for detecting interacting DNA loci
Cai et al. Fuzzy criteria in multi-objective feature selection for unsupervised learning
Caplescu et al. Will they repay their debt? Identification of borrowers likely to be charged off
Amaratunga et al. Ensemble classifiers
Mukhopadhyay et al. Unsupervised cancer classification through SVM-boosted multiobjective fuzzy clustering with majority voting ensemble
AlSaif Large scale data mining for banking credit risk prediction
Huang et al. A Study of Genetic Neural Network as Classifiers and its Application in Breast Cancer Diagnosis.
Kutnjak et al. Applying the decision tree method in identifying key indicators of the Digital Economy and Society Index (DESI)
Mondal et al. Simultaneous clustering and gene ranking: A multiobjective genetic approach