TWI521361B - Method for multi-layer classifier - Google Patents

Method for multi-layer classifier Download PDF

Info

Publication number
TWI521361B
TWI521361B TW099101931A TW99101931A TWI521361B TW I521361 B TWI521361 B TW I521361B TW 099101931 A TW099101931 A TW 099101931A TW 99101931 A TW99101931 A TW 99101931A TW I521361 B TWI521361 B TW I521361B
Authority
TW
Taiwan
Prior art keywords
category
node
model
attributes
undetermined
Prior art date
Application number
TW099101931A
Other languages
Chinese (zh)
Other versions
TW201126354A (en
Inventor
張金堅
陳文華
陳正剛
陳炯年
何明志
戴浩志
吳明勳
巫信融
Original Assignee
安克生醫股份有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 安克生醫股份有限公司 filed Critical 安克生醫股份有限公司
Priority to TW099101931A priority Critical patent/TWI521361B/en
Publication of TW201126354A publication Critical patent/TW201126354A/en
Application granted granted Critical
Publication of TWI521361B publication Critical patent/TWI521361B/en

Links

Description

多層次分類方法Multi-level classification method

本發明係關於一種多層次分類方法,尤指一種適用於建立一種多層判別分析模型,以及決定屬性選擇和切點之分類方法。The present invention relates to a multi-level classification method, and more particularly to a classification method suitable for establishing a multi-layer discriminant analysis model and determining attribute selection and tangent points.

分類方法的用途非常廣泛,舉例來說,在金融業上,銀行在審核信用卡用戶時,能辨別此位申請人是否會容易變成呆帳;在醫藥學理上,能判斷細胞組織為正常或異常;在行銷的研究上,能判斷此種行銷方法能否吸引顧客購買商品。因此,在數據資料探勘的領域裡,佔有及重要之部分即是在探討分類方法。The classification method is very versatile. For example, in the financial industry, when the bank audits credit card users, it can discern whether the applicant will easily become a bad debt; in medical science, it can judge whether the cell organization is normal or abnormal; In marketing research, it can be judged whether such marketing methods can attract customers to purchase goods. Therefore, in the field of data mining, the possession and important part is to explore the classification method.

分類方法是一種監督式學習(supervised learning)的方法,所謂的監督式學習方法是在知道目標輸出值的情形下來進行資料探勘,反之則稱為非監督式學習(unsupervised learning),如主成分分析(Principal component analysis)即為一種非監督式學習的方法。在分類方法裡,一般需要選擇適當的屬性(attribute)來建立分類模型,例如,用身高和體重來判斷這個人是男生或是女生,身高及體重即稱為屬性。建立分類模型時也往往會先把資料分成兩群,一群為訓練樣本(training samples),另一群為獨立測試樣本(independent test samples),並使用訓練樣本來建立一個分類模型,獨立測試樣本則是用來驗證此分類模型是否穩健。The classification method is a supervised learning method. The so-called supervised learning method performs data exploration in the case of knowing the target output value, and vice versa is called unsupervised learning, such as principal component analysis. (Principal component analysis) is a method of unsupervised learning. In the classification method, it is generally necessary to select an appropriate attribute to establish a classification model. For example, height and weight are used to judge whether the person is a boy or a girl, and height and weight are called attributes. When creating a classification model, the data is often divided into two groups. One group is training samples, the other group is independent test samples, and the training samples are used to establish a classification model. The independent test samples are Used to verify that this classification model is robust.

目前,以兩種現有的分類方法較為常見,分別為在多變量統計分析中常見的費雪線性判別分析(Fisher linear discriminant Analysis,FLD),以及分類與迴歸樹(Classification and regression trees,CART)。然而,本案發明人發現,在基於前述分類方法中,尤其在屬性的選擇上,部分屬性只能判別特定類別而影響其分類應用之準確性;且在以往分類模型的建立上,會因其屬性的選擇不同,或未對所欲分類之判別分析模型進行效能的評估,進而影響分類的準確性。At present, two existing classification methods are common, namely Fisher linear discriminant analysis (FLD), and Classification and regression trees (CART), which are common in multivariate statistical analysis. However, the inventor of the present invention found that, based on the foregoing classification method, especially in the selection of attributes, some attributes can only discriminate specific categories and affect the accuracy of their classification application; and in the establishment of the previous classification model, due to its attributes The choices are different, or the discriminant analysis model of the desired classification is not evaluated for effectiveness, which in turn affects the accuracy of the classification.

因此,目前亟需一種新的多層次分類方法以解決上述問題。Therefore, there is a need for a new multi-level classification method to solve the above problems.

本發明之主要目的係在提供一種多層次分類方法,藉由多層判別分析模型,在每一層會尋找一或兩個切點來對一或兩個類別做出分類,且每一層可以同時使用多個屬性,並透過費雪判別分析來找到這些屬性的最佳線性組合。The main object of the present invention is to provide a multi-level classification method. By using a multi-layer discriminant analysis model, one or two tangent points are searched for each layer to classify one or two categories, and each layer can use multiple layers at the same time. Attributes, and through Fisher Discriminant Analysis to find the best linear combination of these attributes.

本發明提供一種多層次分類方法,係於一電腦可紀錄媒體中用以分類多個影像樣本,此電腦可紀錄媒體包括有一處理器、一輸入裝置、及一儲存裝置,此方法至少包括:The present invention provides a multi-level classification method for classifying a plurality of image samples in a computer recordable medium. The computer recordable medium includes a processor, an input device, and a storage device. The method includes at least:

(a)接收複數個原始樣本;(a) receiving a plurality of original samples;

(b)提供複數個屬性,並以一多變量參數對此些原始樣本由此些屬性進行顯著性評估計算;(b) providing a plurality of attributes and performing a significant evaluation of the attributes of the original samples with a multivariate parameter;

(c)選擇至少一切點並建立一判別分析模型,其係將步驟(b)中評估後具有顯著性者其中之一,藉提供一變數同質分析參數篩選出此至少一切點,將此些屬性評估後具有顯著性者中所包含之此複數個原始樣本分群為至少一類別以建立此判別分析模型,其中此至少一類別係包括有第一類別(NodeA)、第二類別(NodeB)、及未決定之第三類別(NodeN);(c) selecting at least all points and establishing a discriminant analysis model, which is one of the significant ones after the evaluation in step (b), and filtering out at least all of the points by providing a variable homogeneity analysis parameter. The plurality of original sample groups included in the saliency after the evaluation are at least one category to establish the discriminant analysis model, wherein the at least one category includes the first category (Node A ) and the second category (Node B ) And the third category that has not been decided (Node N );

(d)進行一評估模型效能之步驟,其係將此判別分析模型中加入此些屬性進行顯著性評估;其中,當加入此些屬性後有增進此判別分析模型之顯著性時,便進入此判別分析模型之下一層,再以此變數同質分析參數篩選出至少一切點,將此判別分析模型中加入此些屬性評估後具有顯著性者中所包含之此複數個原始樣本繼續分群為第一類別(NodeA)、第二類別(NodeB)、及未決定之第三類別(NodeN);以及(d) performing a step of evaluating the performance of the model by adding such attributes to the discriminant analysis model for significant evaluation; wherein, when the attributes are added to enhance the significance of the discriminant analysis model, Discriminate the layer below the analysis model, and then filter out at least all the points by using the variable homogeneity analysis parameters. The plurality of original samples included in the discriminant analysis model added to the attributes are considered to be the first group. Category (Node A ), second category (Node B ), and undetermined third category (Node N );

(e)加入一停止條件,此停止條件係以選擇此變數同質分析參數,若不拒絕虛無假設,此判別分析模型即停止往下一層分群;或在此評估模型效能之步驟中加入此些屬性以一迴歸分析法進行顯著性評估,當加入此些屬性後無法提升此判別分析模型之顯著性時,若拒絕虛無假設,此判別分析模型即停止往下一層分群。(e) adding a stop condition, which is to select the variable homogeneity analysis parameter. If the null hypothesis is not rejected, the discriminant analysis model stops the next layer grouping; or add these attributes in the step of evaluating the model performance. The saliency evaluation is performed by a regression analysis method. When the saliency of the discriminant analysis model cannot be improved after adding these attributes, if the null hypothesis is rejected, the discriminant analysis model stops the next layer grouping.

本發明亦提供一種用以分類多個影像樣本之電腦可紀錄媒體,其係以建立一多層次分類方法對此些影像樣本進行分類。The invention also provides a computer recordable medium for classifying a plurality of image samples, which is to classify the image samples by establishing a multi-level classification method.

根據本發明多層次分類方法,其中,在加入此停止條件時,此判別分析模型之最後一層分類層中,此未決定之第三類別(NodeN)中所包含之樣本數為零,換句話說,根據本發明多層次分類方法最終結果必須將複數個原始樣本皆分類為第一類別(NodeA)及/或第二類別(NodeB)中。According to the multi-level classification method of the present invention, when the stop condition is added, in the last layer classification layer of the discriminant analysis model, the number of samples included in the undetermined third category (Node N ) is zero, in other words In other words, the final result of the multi-level classification method according to the present invention must classify a plurality of original samples into a first category (Node A ) and/or a second category (Node B ).

根據本發明多層次分類方法,其中,多變量參數之選擇沒有限制,較佳為Wilk’s lambda或Gini index;此些屬性之選擇沒有限制,較佳為至少一選自由ringPDVImax、VeinCentralVImin、VeinTDCentralVImax、TDVImax、CI、RMV、CI2、MCI3、及MI2所組成之群組。此外,變數同質分析參數之選擇沒有限制,較佳為Gini index、Mahalanobis distance、或Youden’s Index。According to the multi-level classification method of the present invention, the selection of the multi-variable parameters is not limited, and is preferably Wilk's lambda or Gini index; the selection of such attributes is not limited, and preferably at least one selected from the group consisting of ringPDVImax, VeinCentralVImin, VeinTDCentralVImax, TDVImax, A group consisting of CI, RMV, CI2, MCI3, and MI2. Further, there is no limitation on the selection of the variable homogeneity analysis parameters, and it is preferably Gini index, Mahalanobis distance, or Youden's Index.

另一方面,對於顯著性評估計算係以一F統計量算出的p值(p-value),以此p值表示此些屬性在此類別間平均的差異顯著性;或以一衡量不純度(impurity)之準則判斷;其中,此F統計量為On the other hand, for the significance evaluation, the p-value calculated by a F statistic is used, and the p value is used to indicate the average difference of the attributes among the categories; or one is used to measure the impurity ( The criterion of impurity; wherein the F statistic is

此不純度(impurity)為This impureness is

其中,n為樣本空間(sample size),p為屬性的數目,Λ則為Wilk’s lambda;其中,NL為第一類別的樣本空間,NM為第三類別的樣本空間,NR為第二類別的樣本空間,tL為第一類別的Gini值,tM為第三類別的Gini值,tR為第二類別的Gini值。Where n is the sample size, p is the number of attributes, and Λ is Wilk's lambda; where N L is the sample space of the first category, N M is the sample space of the third category, and N R is the second The sample space of the category, t L is the Gini value of the first category, t M is the Gini value of the third category, and t R is the Gini value of the second category.

根據本發明多層次分類方法,其中,此評估模型效能之步驟可包括下列四種方法:在與步驟(c)所建立之此判別分析模型同層中加入此些屬性,以增加此判別分析模型之原同層中的區別能力;在此第三類別(NodeN)上加入此些屬性並新增一層以建立一模型,此模型亦以此變數同質分析參數篩選出至少一切點,將剩餘未決定之此複數個原始樣本繼續分群為第一類別(NodeA)、第二類別(NodeB)、及未決定之第三類別(NodeN);將第一類別(NodeA)設定為未決定之類別,並將第一類別(NodeA)加上未決定之第三類別(NodeN)而形成的組合中加入此些屬性並新增一層以建立一模型,此模型亦以此變數同質分析參數篩選出至少一切點,將剩餘未決定之此複數個原始樣本繼續分群為第一類別(NodeA)、第二類別(NodeB)、及未決定之第三類別(NodeN);或將第二類別(NodeB)設定為未決定之類別,並將第二類別(NodeB)加上未決定之第三類別(NodeN)而形成的組合中加入此些屬性並新增一層以建立一模型,此模型亦以此變數同質分析參數篩選出至少一切點,將剩餘未決定之此複數個原始樣本繼續分群為第一類別(NodeA)、第二類別(NodeB)、及未決定之第三類別(NodeN)。According to the multi-level classification method of the present invention, the step of evaluating the performance of the model may include the following four methods: adding such attributes to the same layer of the discriminant analysis model established in the step (c) to increase the discriminant analysis model The ability to distinguish in the same layer; add these attributes to the third category (Node N ) and add a layer to create a model. This model also filters out at least everything from the homogeneity analysis parameters. The plurality of original samples determined to continue to be grouped into a first category (Node A ), a second category (Node B ), and an undetermined third category (Node N ); the first category (Node A ) is set to be undecided a category, and add a layer to the combination formed by the first category (Node A ) plus the undetermined third category (Node N ) to create a model. This model also uses this variable homogeneity analysis. The parameter filters out at least all points, and continues to group the remaining undetermined plurality of original samples into a first category (Node A ), a second category (Node B ), and an undetermined third category (Node N ); or the second category (Node B) is not set Prescribed category and a second category (Node B) plus a third category of undetermined composition (Node N) formed by adding a layer of such properties, and to create a new model, which is also variable in this analysis homogenous The parameter filters out at least all points, and continues to group the remaining undetermined plural original samples into a first category (Node A ), a second category (Node B ), and an undetermined third category (Node N ).

由上可知,本發明係在提供一種新的判別分析模型結構及其方法,其類似於樹狀分類結構,都是由上往下一層一層將資料分割。而與樹狀結構不同的是,此判別分析模型每一層會將一些資料針對一或二個類別做出分類,並將未決定之資料留至下一層,此外,每一層可選擇些屬性並利用費雪判別分析做線性組合。As can be seen from the above, the present invention provides a new discriminant analysis model structure and method thereof, which is similar to a tree-like classification structure in that data is segmented from top to bottom. Different from the tree structure, each layer of the discriminant analysis model will classify some data for one or two categories, and leave the undetermined data to the next layer. In addition, each layer can select some attributes and utilize Fisher's discriminant analysis makes a linear combination.

換言之,本發明係以上述之方法以建立一種新的多層判別分析模型,其每一層可能只能區別一個類別(NodeA或NodeB)或是兩個類別皆可以區別(NodeA及NodeB),並將尚未決定類別(NodeN)之樣本留至下一層做判別。而此判別模型分析並包括:在判別模型分析每層發展有效變數之選擇和尋找切點的方法與準則,並以評估模型效能之步驟在加入新屬性時會考慮整體效能來決定要如何建構模型,並建立停止條件以避免過度配適的問題。In other words, the present invention uses the above method to establish a new multi-layer discriminant analysis model, each layer of which may only distinguish one category (Node A or Node B ) or both categories (Node A and Node B ). And leave the sample of the undetermined category (Node N ) to the next layer for discrimination. The discriminant model analysis includes: the method of selecting the effective variable of each layer in the discriminant model and the method and criterion for finding the tangent point, and the step of evaluating the performance of the model will consider the overall performance to determine how to construct the model when adding the new attribute. And establish a stop condition to avoid excessive fit.

因此,根據本發明亦提供一種屬性選擇和切點決定方法,並考慮了判別分析模型在加入新屬性時會考慮整體模型的效能,以決定判別分析模型應如何建立及其停止條件,故而大幅提高分類之準確性。Therefore, according to the present invention, an attribute selection and a cut point determination method are also provided, and the discriminant analysis model considers the performance of the overall model when adding a new attribute, so as to determine how the discriminant analysis model should be established and its stopping condition, thereby greatly improving the classification. Accuracy.

圖2係顯示一電腦可紀錄媒體之架構的示意圖,其可用以執行本發明多層判別分析模型之多層次分類方法。2 is a schematic diagram showing the architecture of a computer recordable medium, which can be used to perform a multi-level classification method of the multi-layer discriminant analysis model of the present invention.

如圖2所示,電腦可紀錄媒體1包含顯示裝置13、處理器12、記憶體11、輸入裝置14、及儲存裝置15等。其中,輸入裝置14可用以輸入影像、文字、指令等資料至電腦可紀錄媒體,儲存裝置15係例如為硬碟、光碟機或藉由網際網路連接之遠端資料庫,用以儲存系統程式、應用程式及使用者資料等,記憶體11係用以暫存資料或執行之程式,處理器12用以運算及處理資料等,顯示裝置13則用以顯示輸出之資料。As shown in FIG. 2, the computer recordable medium 1 includes a display device 13, a processor 12, a memory 11, an input device 14, a storage device 15, and the like. The input device 14 can be used to input image, text, instructions and the like to the computer recordable medium. The storage device 15 is, for example, a hard disk, a CD player or a remote database connected through the Internet for storing the system program. The application 11 and the user data are used to temporarily store data or execute programs, the processor 12 is used to calculate and process data, and the display device 13 is used to display the output data.

如圖2所示之電腦可紀錄媒體一般係於系統程式(Operating System)下執行各種應用程式,例如文書處理程式、繪圖程式、科學運算程式、瀏覽程式、電子郵件程式等。在本實施例中,儲存裝置14係儲存有使電腦可紀錄媒體執行一多層次分類方法的程式。當欲使電腦可紀錄媒體執行此分類方法時,對應之程式便被載入記憶體11,以配合處理器12執行此方法。最後,再將分類結果之相關資料顯示於顯示裝置13或藉由網際網路儲存於一遠端資料庫中。The computer recordable media shown in FIG. 2 generally executes various applications, such as a word processing program, a drawing program, a scientific computing program, a browsing program, an email program, etc., under the operating system (Operating System). In the present embodiment, the storage device 14 stores a program for causing the computer recordable medium to perform a multi-level classification method. When the computer recordable medium is to perform this sorting method, the corresponding program is loaded into the memory 11 to cooperate with the processor 12 to execute the method. Finally, the related data of the classification result is displayed on the display device 13 or stored in a remote database through the Internet.

藉由本發明之方法,其流程示意可如圖1a所示,根據其所建立之多層判別分析模型架構係如圖1b,其與分類樹相似的為都是由上而下不斷的分割資料。然而,跟分類樹不同的是,本案之多層次分類方法會針對每一層都會對部分或全部之複數個原始樣本做出判別,而這些已判別出來的類別(NodeA)或NodeB)就不會進入下一層的模型,只留下在此層做出判別為尚未決定類別(NodeN)到下一層中加入新的屬性來對它作出判別,而每一層可以只針對判斷一個類別或是兩個類別,若是只判斷一個類別則只需要找一個切點來把樣本切割成兩部分,一部分為能在此層分類出來的,另一部分為未決定必須留到下一層,但若是要判斷兩個類別則需尋找兩個切點,把資料切割成三部分,一部分為第一類別(NodeA),一部分為第二類別(NodeB),剩下的那一部分則是未決定之第三類別(NodeN)。而每次要加入一個新的屬性時,會考慮整體模型的效能來決定要在原有的層內結合新的屬性讓原本那一層的判斷力更佳或是要加一個新的屬性來對那些尚未分類出來的樣本進行分類。不斷的在此模型中加入新的屬性直到此模型達到停止條件。By means of the method of the present invention, the flow diagram can be as shown in FIG. 1a, and the multi-layer discriminant analysis model architecture established according to the present invention is as shown in FIG. 1b, which is similar to the classification tree in that the data is divided from top to bottom. However, unlike the classification tree, the multi-level classification method of this case will judge some or all of the original samples for each layer, and these identified categories (Node A ) or Node B ) will not. Will enter the next layer of the model, leaving only the level determined in this layer as the undetermined category (Node N ) to add a new attribute to the next layer to discriminate it, and each layer can only be judged for one category or two For each category, you only need to find a cut point to cut the sample into two parts, one part can be classified at this level, and the other part is undecided and must be left to the next level, but if you want to judge two categories Then you need to find two tangent points, cut the data into three parts, one part is the first category (Node A ), one part is the second category (Node B ), and the remaining part is the undetermined third category (Node N) ). And each time you want to add a new attribute, you will consider the performance of the overall model to decide to combine the new attributes in the original layer to make the original layer better judgment or to add a new attribute to those that have not yet The classified samples are classified. New properties are continually added to this model until the model reaches the stop condition.

以下,將詳述本發明之多層次分類方法及所建立之多層判別分析模型架構。Hereinafter, the multi-level classification method of the present invention and the established multi-layer discriminant analysis model architecture will be described in detail.

首先,接受複數個原始樣本,針對此些複數個原始樣本,必須先由複數個屬性中選擇一屬性,並以一多變量參數對此些原始樣本由此些屬性進行顯著性評估計算。此顯著性之評估係藉提供一變數同質分析參數篩選出至少一切點,將此些屬性評估後具有顯著性者,再以此切點來決定模型裡的樣本是此分到哪個類別(NodeA、NodeB、或NodeN)或是要留到下一層,較佳之選擇為評估後最具顯著性者。由此可知,選擇屬性及決定此些切點係非常重要。之後,需對前述建立之判別分析模型進行評估模型效能之步驟,亦即藉由在模型中多加進些個屬性後進行比較兩種模型,其包括可以一種為在原有的判別分析模型裡加一個屬性並使用費雪線性判別分析(Fisher linear discriminant Analysis,以下簡稱FLD)結合,另一種為新增加一層模型。First, a plurality of original samples are accepted. For the plurality of original samples, an attribute must be selected from a plurality of attributes, and a multivariate parameter is used to perform a significant evaluation on the original samples. The evaluation of this significance is to provide at least one point by providing a variable homogeneity analysis parameter, and the attributes are evaluated to be significant, and then the point is used to determine which category the sample in the model belongs to (Node A , Node B or Node N ) or to stay on the next layer, the better choice is the most significant after evaluation. From this we can see that it is very important to select attributes and determine these cut points. After that, the step of evaluating the performance of the model is performed on the discriminant analysis model established above, that is, by adding more attributes to the model and comparing the two models, which may include adding one to the original discriminant analysis model. The attributes are combined using Fisher linear discriminant analysis (FLD), and the other is a new layer of model.

[屬性及多變量參數選擇][Attribute and multivariate parameter selection]

在屬性的選擇上,可使用ringPDVImax、VeinCentralVImin、VeinTDCentralVImax、TDVImax、CI、RMV、CI2、MCI3、及MI2;而多變量參數之選擇有兩種準則可以使用。一種是常見於多變量統計方法上檢定類別間的平均是否有差異的Wilk’s lambda,另一種則是在分類樹上評估不純度(impurity)的Gini index。In the selection of attributes, ringPDVImax, VeinCentralVImin, VeinTDCentralVImax, TDVImax, CI, RMV, CI2, MCI3, and MI2 can be used; and there are two criteria for selecting multivariable parameters. One is Wilk’s lambda, which is common in multivariate statistical methods to determine whether there is a difference in the average between categories, and the other is to evaluate the purity of the Gini index on the classification tree.

Wilk’s lambdaWilk’s lambda

假設有g個類別,p個屬性,且x k ~N p k ,Σ),k=1,2,...,g Suppose there are g categories, p attributes, and x k ~ N p k , Σ), k =1, 2,..., g

其中,H0係虛無假設(null hypothesis),H1係對立假設(alternative hypothesis),μk則為層級(class)K的平均值。Among them, H 0 is a null hypothesis, H 1 is an alternative hypothesis, and μ k is the average of the class K.

其中,W為組內變異矩陣Where W is the intra-group variation matrix

B為組間變異矩陣B is the variation matrix between groups

I係單位矩陣(identity matrix)I system unit matrix (identity matrix)

λ i 為W-1B的特徵值λ i is the eigenvalue of W -1 B

H 0為真下,Λ經過某些轉換後會服從F分配(式2)When H 0 is true, Λ will obey F allocation after some conversion (Equation 2)

時,s=1,m 1=p,m 2=n-p-1,統計量F可簡化成when When s =1, m 1 = p , m 2 = n - p -1, the statistic F can be simplified

Wilk’s lambda也可轉換成卡方分配Wilk’s lambda can also be converted to chi-squared

當類別少的時候,F統計量會比卡方統計量好。由於多層判別分析較佳為針對2個類別分析,所以我們選用F統計量。When there are few categories, the F statistic will be better than the chi-square statistic. Since the multi-layer discriminant analysis is preferably for two categories, we use the F statistic.

本發明可以比較每個屬性用前述F統計量算出的p值(p-value),p值越小代表這個屬性在類別間平均的差異越顯著,比較每個屬性的p值即可選出一個最顯著的屬性。若是要同一層中選進新的屬性,則比較新屬性跟原有的屬性組合後得到的p值,選出跟原有屬性組合p值最小的變數即可。The present invention can compare the p-values calculated by using the aforementioned F statistic for each attribute. The smaller the p-value is, the more significant the difference between the averages of the attributes is. The p-value of each attribute is selected to be the most Significant attributes. If you want to select a new attribute in the same layer, compare the p value obtained by combining the new attribute with the original attribute, and select the variable with the smallest p value combined with the original attribute.

Gini IndexGini Index

由於每次在進行分割時要搜尋一個較佳或最佳的屬性及切點,所以要有一個分割的準則來評估此屬性與切點的效能,其中較常見的準則為Gini index。Gini index是一種在衡量不純度(impurity)的準則,所以Gini index越小越好。每個屬性配上一個對應的切點就能得到其Gini index,所以每個屬性可以搜尋一個最佳的對應切點。在進行變數選擇時,只要比較每個屬性搭配上其對應的最佳切點後的Gini index即可選出在此分割最佳的屬性及切點。Since each time a segment is searched for a better or better attribute and a cut point, there is a segmentation criterion to evaluate the performance of this attribute and the cut point. The more common criterion is the Gini index. The Gini index is a measure of impurity, so the smaller the Gini index, the better. Each attribute is assigned a corresponding tangent to get its Gini index, so each attribute can search for an optimal corresponding point. When making a variable selection, you can select the best attribute and cut point in this segment by comparing the Gini index of each attribute with its corresponding optimal cut point.

假設現在有g個類別,Gini Index之定義為:Assuming there are now g categories, the Gini Index is defined as:

不純度(impurity)即為:Impurity is:

其中,P(i|t)為類別i在結點t所占的比例Where P ( i | t ) is the proportion of category i at node t

P(j|t)為類別j在結點t所占的比例 P ( j | t ) is the proportion of category j at node t

n L 為左邊結點的樣本數 n L is the number of samples of the left node

n R 為右邊結點的樣本數 n R is the number of samples on the right node

N=n R +n L 為所有樣本總合N= n R + n L is the sum of all samples

在此處,本發明之多層判別分析模型跟分類樹不同的地方在於,分類樹在每一個結點是做二元分割,但在本發明之多層判別分析模型中,每一層都必須把資料分割成二結點,所以不純度(impurity)的計算要改為Here, the multi-layer discriminant analysis model of the present invention differs from the classification tree in that the classification tree is binary-divided at each node, but in the multi-layer discriminant analysis model of the present invention, each layer must divide the data. Into two nodes, so the calculation of impureness should be changed to

本發明可以比較每個屬性搭配其最佳的一組切點之後得到的不純度,選出不純度最小的一個屬性。The present invention can compare the purity obtained after each attribute is matched with its optimal set of cut points, and select an attribute with the least purity.

若是要在同一層加入新的屬性,可以利用新的屬性跟原有的屬性透過FLD組合後得到的區別分數來計算不純度,找出跟原有的屬性組合之後不純度最低的屬性即可。If you want to add new attributes on the same layer, you can use the new attribute and the difference score obtained by the original attribute through FLD to calculate the impurity, and find the attribute with the lowest purity after combining with the original attribute.

[切點選擇][cut point selection]

切點的選擇有三種方法,分別為Gini index、馬氏距離(Mahalanobis distance)、或Youden’s Index。There are three ways to select the tangent points, namely Gini index, Mahalanobis distance, or Youden’s Index.

Gini IndexGini Index

在使用Gini index選擇屬性時,每個屬性都要選擇一組切點來搭配才能得到其不純度,所以需要一個方法選擇一組最好的切點來搭配,以得到最低的不純度,若是在分類樹裡,只需尋找一個切點,所以在分類樹裡尋找切點的方法為把所有可能的切點都試過一次,找出一個不純度最低的切點。然而,在本發明多層判別分析模型中,例如需要得到兩個切點來把資料分成三群,在此情形時假設會存在有N個樣本,尋找一個切點只需要試驗N種可能,若要找兩個切點則會有N(N-1)/2種可能,在樣本數很大時,若要以試過所有可能的切點來找兩個切點會非常慢,所以本發明為解決前述問題,並發展一個快速搜尋出兩個切點的方法。When using Gini index to select attributes, each attribute must select a set of tangent points to match to obtain its impureness, so a method is needed to select the best set of tangent points to match to get the lowest impurity, if it is in the classification tree. In that, just look for a tangent point, so the way to find the tangent point in the classification tree is to try all possible cut points and find a cut point with the lowest purity. However, in the multi-layer discriminant analysis model of the present invention, for example, it is necessary to obtain two tangent points to divide the data into three groups. In this case, it is assumed that there are N samples, and it is only necessary to test N kinds of possibilities to find a tangent point. There are N ( N -1)/2 possibilities for the tangent point. When the number of samples is large, it is very slow to find all the possible tangent points to find two tangent points. Therefore, the present invention solves the aforementioned problems and Develop a quick way to find two cut points.

首先,像一般的分類樹一樣,先搜尋所有可能找一個把所有資料分成兩群後不純度最低的切點,C 0,然後利用C 0可以把資料切成Node L 和Node R 。在Node L 裡,再搜尋一個可以把Node L 分成兩群後不純度最低的切點,C 1。同樣的,在Node R 中也搜尋一個可以把Node R 分成兩群後不純度最低的切點,C 2,如圖3所示。First, like a general classification tree, first search for all the cut points that have the lowest purity after dividing all the data into two groups, C 0 , and then use C 0 to cut the data into Node L and Node R. In Node L , search for a point that can be used to divide Node L into two groups with the lowest purity, C 1 . Similarly, in Node R , we also search for a tangent point that can minimize the purity of Node R into two groups, C 2 , as shown in Figure 3.

如此一來可得到三個候選切點C 0C 1,與C 2,用這三個候選切點可以組成(C 0C 1),(C 1C 2)與(C 0C 2)三種切點組合,比較這三種切點組合把資料切成三群後的不純度,選出一個最佳的切點組合即可,較佳為把同質性高的樣本放在左右兩側。因此,在搜尋C 1時會設下限制,用C 1切出來的兩群資料裡,比較遠離C 0的那群資料不純度要比另一群資料高。基於同前理由,在搜尋C 2時也要設下一樣的限制。也就是說,Gini(t LL )<Gini(t LR ),Gini(t RR )<Gini(t RL )。若以此搜尋演算法,只需搜尋大約2N次來尋找那三個候選切點,再比較三種組合即可。In this way, three candidate cut points C 0 , C 1 , and C 2 can be obtained, and the three candidate cut points can be used to form ( C 0 , C 1 ), ( C 1 , C 2 ) and ( C 0 , C 2 ). Three combinations of tangent points, compare the three types of tangent points to cut the data into three groups of impurities, select an optimal combination of tangent points, preferably to place samples of high homogeneity on the left and right sides. Therefore, when searching for C 1 , the limit is set. In the two groups of data cut out by C 1 , the data of the group farther away from C 0 is higher than the other group. For the same reason, the same restrictions should be set when searching for C 2 . That is, Gini ( t LL )< Gini ( t LR ), Gini ( t RR )< Gini ( t RL ). If you use this search algorithm, you only need to search about 2N times to find those three candidate cut points, and then compare the three combinations.

馬氏距離(Mahalanobis distance)Mahalanobis distance

根據本發明之另一個切點選擇的方法為馬氏距離,其與歐氏距離(Euclidean distance)的差別在於馬氏距離考慮的不只是類別間中心點差異,還會考慮各個類別的散佈情形,舉例來說,若有一個樣本距離A類別跟B類別的中心都一樣遠,若A類別的變異數比較大,散佈情形的比較分散,B類別的變異數比較小,散佈情形很集中,那此樣本離A類別的馬氏距離就會比離B類別的馬氏距離來的小,故因此而認為其比較屬於A類別。Another method of choice according to the present invention is the Mahalanobis distance, which differs from the Euclidean distance in that the Mahalanobis distance considers not only the difference in the center point between the categories, but also the dispersion of each category. For example, if there is a sample distance A category is the same as the center of the B category, if the A category has a large number of variances, the distribution is more scattered, the B category has a smaller number of variances, and the distribution is very concentrated. The Mahalanobis distance from the A category is smaller than the Mahalanobis distance from the B category, so it is considered that the comparison belongs to the A category.

以下將詳細介紹利用馬氏距離應用於分類上之方法,首先,假設現在有2個類別,則可以算出距離A類別的馬氏距離為:,而距離B類別的馬氏距離為:,其中μ A =(μ A 1 A 2,…μ Ap )為A類別的平均數,S A 為A類別的共變異數矩陣(covariance matrix),μ B =(μ B 1 B 2,…μ Bp )為B類別的平均數,S B 為B類別的共變異數矩陣;當D A (x)<D B (x)則屬於A類別,而D A (x)>D B (x)則屬於B類別。The method of applying the Mahalanobis distance to the classification will be described in detail below. First, if there are two categories, the Mahalanobis distance of the distance A category can be calculated as: , and the Mahalanobis distance from the B category is: , where μ A = (μ A 1 , μ A 2 , ... μ Ap ) is the average of the class A, S A is the covariance matrix of the class A, μ B = (μ B 1 , μ B 2 ,...μ Bp ) is the average of the B categories, S B is the covariance matrix of the B categories; when D A ( x )< D B ( x ) belongs to the A category, and D A ( x )> D B ( x ) belongs to category B.

但在本發明之多層判別分析模型中,本發明將複數個樣本分成A,B兩類別(NodeA、NodeB)與未決定(NodeN)這三群,故將原本D A (x)<D B (x),屬於A類別的樣本數挑出來,然後利用這些原本已經判斷為A類別的樣本算一個新的μ A 1,μ B 1S A 1S B 1,接著把這些已判斷為A類別的樣本用新的平均數和變異數再計算一次馬氏距離:However, in the multi-layer discriminant analysis model of the present invention, the present invention divides a plurality of samples into three groups of A, B (Node A , Node B ) and undetermined (Node N ), so that the original D A ( x ) < D B ( x ), the number of samples belonging to the A category is selected, and then a new μ A 1 , μ B 1 , S A 1 , S B 1 is calculated using these samples which have been judged as the A category, and then these have been A sample determined to be in the A category is recalculated with a new average and the number of variances:

D A 1(x)<D B 1(x),則屬於A類別;而D A 1(x)>D B 1(x),則屬於未決定。If D A 1 ( x )< D B 1 ( x ), it belongs to category A; and D A 1 ( x )> D B 1 ( x ), it is undecided.

同樣的,把原本D A (x)>D B (x),屬於B類別的樣本數挑出來,然後利用這些原本已經判斷為B類別的樣本算一個新的μ A 2,μ B 2S A 2S B 2,然後把這些已判斷為B類別的樣本用新的平均數和變異數再計算一次馬氏距離,Similarly, the original D A ( x )> D B ( x ), the number of samples belonging to the B category is selected, and then a sample of the original B class is used to calculate a new μ A 2 , μ B 2 , S A 2 , S B 2 , and then calculate the Mahalanobis distance by using the new average and the variance for the samples that have been judged as the B category.

D A 2(x)>D B 2(x),則屬於B類別,而D A 2(x)<D B 2(x),則屬於未決定。If D A 2 ( x )> D B 2 ( x ), it belongs to the B category, and D A 2 ( x )< D B 2 ( x ) belongs to the undetermined.

另須注意的是,在本發明多層判別分析模型中若使用馬氏距離找切點時,主要係為了把資料用分成比較屬於A類別的和比較屬於B類別的,再利用這兩個資料子集合來求出我們要的切點,但當此兩筆資料子集合為如圖4a之情形時,由於其結點之選擇會影響非分類的準確度,為了要改善此情形中樣本數差距極大造成馬氏距離切點的不可靠,可進一步利用Gini index來修正馬氏距離,如圖4b所示。首先,先用Gini index找一個切點,然後把資料用此切點分成兩邊,比較各個類別在這兩邊所占的比例,若是A類別在左邊所占的比例大於右邊所占的比例,則把右邊資料的A類別移除,反之則把左邊資料的A類別移除。同樣的,B類別也比較其在左右兩邊所占的比例,移除掉比例比較小那邊的B類別。然後用剩餘的A類別、B類別來重新計算平均及變異數,即可得到一個經過Gini index修正後的馬氏距離。It should be noted that, when the Mahalanobis distance finding point is used in the multi-layer discriminant analysis model of the present invention, the two data sub-collections are mainly used to divide the data into the categories belonging to the category A and the categories belonging to the category B. To find the cut point we want, but when the two sub-collections of data are as shown in Figure 4a, because the choice of its nodes will affect the accuracy of non-classification, in order to improve the situation, the difference in the number of samples is extremely large. The distance from the tangent point is unreliable, and the Gini index can be further used to correct the Mahalanobis distance, as shown in Figure 4b. First, use the Gini index to find a tangent point, then use the tangent point to divide the data into two sides, and compare the proportion of each category on the two sides. If the proportion of the A category on the left side is larger than the ratio on the right side, then the right side is used. The A category is removed, otherwise the A category of the left material is removed. Similarly, the B category also compares its proportion on the left and right sides, and removes the B category with a smaller proportion. Then use the remaining A categories, B categories to recalculate the average and the variance, and get a Markov distance corrected by Gini index.

Youden’s IndexYouden’s Index

首先,定義Younde’s Index=specificity+sensitivity-1,其中specificity為在所提供之複數個原始樣本中所有A類別樣本裡判斷正確的比例,sensitivity為在所提供之複數個原始樣本中所有B類別樣本裡判斷正確的比例,故Youden’s index越高越好。First, define Youdee's Index=specificity+sensitivity-1, where specificity is the correct proportion of all A-category samples in the provided plurality of original samples, and the sensitivity is in all B-category samples in the plurality of original samples provided. Determine the correct ratio, so the higher the Youden's index, the better.

搜尋切點的方法跟使用Gini index相似,先搜尋所有可能找一個把所有資料分成兩群後Younde’s Index最高的切點,C 0,然後利用C 0可以把資料切成Node L 和Node R 。在Node L 裡,再搜尋一個可以把Node L 分成兩群後Younde’s Index最高的切點,C 1。同樣的,在Node R 中也搜尋一個可以把Node R 分成兩群後Younde’s Index最高的切點,C 2。如此一來,我們可以得到三個候選切點C 0C 1,與C 2,用這三個候選切點可以組成(C 0C 1),(C 1C 2)與(C 0C 2)三種切點組合,比較這三種切點組合把資料切成三群後的Youden’s Index,選出一個最佳的切點組合即可。The method of searching for the pointcut is similar to using the Gini index. First search for all the possible cut points that divide the all the data into two groups and the highest cut point of the Younde's Index, C 0 , and then use C 0 to cut the data into Node L and Node R. In Node L , search for a tangent point, C 1 , that can divide the Node L into two groups and the highest Younde's Index. Similarly, in Node R , you also search for a tangent point, C 2 , that can divide the Node R into two groups and the highest Younde's Index. In this way, we can get three candidate cut points C 0 , C 1 , and C 2 , which can be composed of ( C 0 , C 1 ), ( C 1 , C 2 ) and ( C 0 , C 2 ) Three kinds of tangent point combinations, compare these three types of tangent points to cut the data into three groups of Youden's Index, and choose an optimal cut point combination.

分成三群時,由於有未決定的部分,故specificity和sensitivity的計算要做更改。When divided into three groups, the calculation of specificity and sensitivity is to be changed because there are undetermined parts.

Specificity=(A類別判對樣本數+0.5*未判別屬A類別樣本數)/A類別總樣本數;以及Specificity=(A category judges the number of samples +0.5* is not discriminated as the number of samples in the A category)/A total number of samples in the A category;

Sensitivity=(B類別判對樣本數+0.5*未判別屬B類別樣本數)/B類別總樣本數;Sensitivity=(the number of samples in the B category is +0.5* the number of samples in the B category is not discriminated)/the total number of samples in the B category;

其後,再選取這三組切點中Youden’s index最高的一組切點即可。Then, select the set of cut points with the highest Youden’s index among the three sets of cut points.

[評估模型效能][Evaluation Model Effectiveness]

在多層判別分析模型中,每次要加一個屬性進模型裡時,可以下列四種不同方案進行評估之步驟。In the multi-layer discriminant analysis model, each time an attribute is added to the model, the evaluation steps can be performed in the following four different schemes.

首先,如圖5所示,假設已有一層由X 1構成的模型,並利用X 1把樣本分成三群,分別為A類別,B類別,以及未決定的樣本,分別以NodeA,NodeB,NodeN來表示。First, as shown in FIG. 5, it is assumed by the model layer have been formed of X 1, X 1 and using the put sample into three groups, namely, Class A, B categories, and undecided samples, respectively, Node A, Node B , Node N to indicate.

方案1:plan 1:

在原有的那一層新加入屬性X i ,跟X 1利用FLD組合,以增加原有那一層的區別能力。In the original layer, the new attribute X i is added, and the X 1 is combined with the FLD to increase the difference of the original layer.

方案2:Scenario 2:

在NodeN上加入屬性X 1 建一個模型,利用此模型來區別在原有的層裡區別不出來的樣本。Add a property X 1 to the Node N to build a model, and use this model to distinguish the samples that are not distinguishable in the original layer.

方案3:Option 3:

把NodeA跟NodeN的樣本合併,以NodeAN表示,此時原有的那層X 1構成的模型只拿來區分出B類別,在NodeAN上加入屬性X k 建一個模型,利用此模型來區別在原有的層裡區別不出來的樣本。The Node A and Node N samples are merged and represented by Node AN . At this time, the original X 1 model is only used to distinguish the B category, and the attribute X k is added to the Node AN to build a model. To distinguish between samples that are not distinguishable in the original layer.

方案4:Option 4:

把NodeB跟NodeN的樣本合併,以NodeBN表示,此時原有的那層X 1構成的模型只拿來區分出A類別,在NodeBN上加入屬性X p 建一個模型,利用此模型來區別在原有的層裡區別不出來的樣本。The Node B and Node N samples are merged and represented by Node BN . At this time, the original model of X 1 is only used to distinguish the A category, and the attribute X p is added to the Node BN to build a model. To distinguish between samples that are not distinguishable in the original layer.

[停止條件][stop condition]

在本發明多層判別分析模型的停止條件上可分為兩種,一為決定是否要把未決定的樣本繼續往下分割,另一為決定要不要在已存在的層裡加入新屬性。In the stopping condition of the multi-layer discriminant analysis model of the present invention, it can be divided into two types, one is to decide whether to continue to divide the undetermined sample, and the other is to decide whether to add a new attribute to the existing layer.

在決定是否要繼續把未決定的樣本繼續往下分割之判別,可利用在屬性選擇時提到的Wilk’s lambda,若不拒絕虛無假設,代表在剩餘的樣本裡,找不到能把類別間顯著區分開來的屬性,所以就停止繼續往下分割。In deciding whether to continue to divide the undecided sample into the next division, you can use the Wilk's lambda mentioned in the attribute selection. If you do not reject the null hypothesis, it means that in the remaining samples, you can not find the significant difference between the categories. Separate the attributes, so stop continuing to split down.

如前所述,另一停止條件為決定是否在原有的層裡加入新屬性,由於模型原本已存在一些顯著的屬性,若要在加入新屬性時,此時需考量的不是加入新屬性後整體模型夠不夠顯著,而是考量新加入的屬性額外解釋了多少變異。在此,可以參考迴歸分析法中之順向選擇法(forward selection)使用的partial F-test,其做法為檢定新加入一個屬性的模型跟原始模型有沒有顯著差異。若拒絕了虛無假設,表示加入新屬性的模型無顯著改善,不將此屬性加入模型。其檢定模型如(式6)As mentioned above, another stop condition is to decide whether to add a new attribute to the original layer. Since the model already has some significant attributes, if you want to add a new attribute, the whole thing to consider is not to add the new attribute. The model is not significant enough, but rather considers how much variation is additionally explained by the newly added attributes. Here, reference can be made to the partial F-test used in the forward selection method in the regression analysis method, which is to verify that the model newly added to an attribute has no significant difference from the original model. If the null hypothesis is rejected, the model that added the new attribute has no significant improvement, and this attribute is not added to the model. Its verification model is (Equation 6)

其中,df F 為full model的自由度;df R  為reduce model的自由度;df R β012為變數的參數;df R SSR為次方的解釋平均(explained sum of square);以及df R SSE為次方的剩餘總和(e residual sum of square)。Where df F is the degree of freedom of the full model; df R is the degree of freedom of the reduce model; df R β 0 , β 1 , β 2 are the parameters of the variable; df R SSR is the expansive sum of square And df R SSE is the e residual sum of square.

而在判別分析的順向選擇法,其模型如(式7)In the forward selection method of discriminant analysis, the model is (Equation 7)

若拒絕虛無假設,則表示模型不需加入此新屬性。If you reject the null hypothesis, it means that the model does not need to join this new attribute.

若加入之新屬性夠顯著,還要用評估模型效能的方法比較加入前和加入後整體模型的效能。反之,若加入新的屬性後無法提升整體模型的效能的話,就停止加入新屬性。需注意的是,本發明之多層次分類方法及所建立之多層判別分析模型架構,在模型的最後一層要強迫對所有資料進行分類,不能再留下未決定的樣本。If the new attribute added is significant enough, the performance of the model before and after the addition is compared using the method of evaluating the performance of the model. Conversely, if you can't improve the performance of the overall model after adding new attributes, stop adding new attributes. It should be noted that the multi-level classification method of the present invention and the established multi-layer discriminant analysis model architecture are forced to classify all the data in the last layer of the model, and no undetermined samples can be left.

根據前述參數及設定條件,根據本發明之方法建立多層判別分析模型詳細流程圖係如圖6所示。According to the foregoing parameters and setting conditions, a detailed flow chart for establishing a multi-layer discriminant analysis model according to the method of the present invention is shown in FIG. 6.

首先,在接受複數個原始樣本後(圖未示),利用Wilk’s lambda或是Gini index選進一個最顯著的屬性,然後檢定這個屬性是否有顯著區別出各個類別的能力。若是拒絕了虛無假設,則代表此屬性具有解釋能力。再利用如前述之馬氏距離或Gini index來找出此屬性最好的一組切點,把資料分成第一類別(A類別,NodeA),第二類別(B類別,NodeB)跟未決定之第三類別(NodeN)三群資料,然後就可以根據這三群資料來評估這個模型的效能。First, after accepting a plurality of original samples (not shown), Wilk's lambda or Gini index is used to select a most significant attribute, and then it is checked whether the attribute has a significant difference in the ability of each category. If the null hypothesis is rejected, it means that this attribute has explanatory power. Reuse the Markov distance or Gini index as described above to find the best set of tangent points for this attribute, and divide the data into the first category (A category, Node A ), and the second category (B category, Node B ) and not determined. The third category (Node N ) is a group of data, and then the performance of the model can be evaluated based on the three groups of data.

接著,在選進第二個屬性時,要考慮把第二個屬性加在哪個地方,在前述所提到四個方案,分別為:(方案1)在原有的那層,找一個跟原有變數組合後最好的屬性與切點;(方案2)用原本未決定那群樣本找一個最適合的的屬性與切點;(方案3)把A類別當成未決定,用A類別加上未決定的樣本來尋找一個最適合的屬性與切點;以及(方案4)把B類別當成未決定,用B類別加上未決定的樣本來尋找一個最適合的屬性與切點。Then, when selecting the second attribute, consider where to add the second attribute. In the four scenarios mentioned above, respectively: (Scenario 1) In the original layer, find one with the original The best attributes and tangent points after the combination of variables; (Scheme 2) Find the most suitable attribute and cut point with the original undetermined sample; (Scheme 3) Treat the A category as undecided, and add the undetermined A category. The sample is to find the most suitable attribute and tangent point; and (Scheme 4) to consider the B category as undecided, and use the B category plus undetermined samples to find the most suitable attribute and cut point.

每個方案選進屬性之後都要用Wilk’a lambda檢定其顯著性,認為不夠顯著的方案就捨棄此屬性,然後用前述提到評估模型效能等步驟來評估每個方案整體模型的效能。若方案1的效能最好,則把新的屬性加在原有的層裡。若方案2的效能最好,則利用上一層剩下的未決定樣本建立新的一層模型。若為方案3或方案4最好,則是在上一層把A類別或B類別當成未決定,並用所有剩下的未決定樣本建立新的一層模型,且上一層的模型轉換成只切一個切點來判斷A類別或B類別,不在同一層判斷兩個類別。After each option is selected, Wilk’a lambda is used to verify its significance. If the scheme is not significant enough, the attribute is discarded, and then the performance of each model is evaluated by the steps mentioned above. If the performance of the scheme 1 is the best, the new attribute is added to the original layer. If the performance of scenario 2 is the best, a new layer of model is created using the remaining undetermined samples from the previous layer. If the best for scenario 3 or scenario 4 is to consider the class A or class B as undecided in the previous layer, and create a new layer model with all the remaining undetermined samples, and the model of the previous layer is converted into only one point. To judge the A category or the B category, the two categories are not judged at the same level.

若目前的模型已經有n層,要再加入一個新屬性時,把新屬性加在原有的層裡此方案會有n種情形,再加上方案2、3、4,共要考慮n+3種情形。若此n+3種情形新增變數都不顯著,模型就停止。若有通過的方案,則選出效能最佳的方案,檢查此方案多選進一個屬性後的整體模型校能有沒有改善。若無改善,則停止加入新變數,若有改善,則繼續加新的屬性到模型裡,一直不斷加入變數直到模型效能不再改善為止。If the current model already has n layers, when adding a new attribute, adding the new attribute to the original layer will have n cases, plus schemes 2, 3, and 4, a total of n+3 will be considered. Kind of situation. If the new variables in this n+3 case are not significant, the model will stop. If there is a plan to pass, select the best-performing solution and check whether the overall model of the plan has been improved by selecting one attribute. If there is no improvement, stop adding new variables. If there is improvement, continue to add new attributes to the model, and continue to add variables until the model performance is no longer improved.

綜合上述,本發明針對多層判別分析的模型提供了一個有系統的變數選擇方法,可以用Wilk’s lambda轉換成F分配後的p值或Gini index來選擇變數。而在切點的決定上,也提供了馬氏距離、Gini index等方法。用Gini index決定切點時,由於必須尋找至少一切點,若是搜尋所有可能的切點組合會非常耗時,故本發明亦提供了較快速搜尋到所欲的切點之方法。而用馬氏距離決定此至少一切點時,由於我們會先用馬氏距離把所有樣本分成偏向A類別及偏向B類別的,再用此兩群樣本來找兩個馬氏距離切點,但由於先用馬氏距離把資料分成兩群,這兩群內類別間的樣本數差距通常很大,而此類別間的樣本數差距會造成馬氏距離尋找切點的不可靠,故本發明並提供了使用Gini index修正馬氏距離來解決此問題。在每次新加入屬性到模型裡時,不僅只考慮一層的效能,而是在考慮整體模型的效能後,才決定要把新的屬性加入哪裡。而在模型的停止條件上,也提供了使用如Wilk’s lambda來防止模型的過度配適,故而大幅提高分類之準確性。In summary, the present invention provides a systematic variable selection method for a multi-layer discriminant analysis model, which can be selected by converting Wilk's lambda into F-assigned p-value or Gini index. In the decision of the cut point, methods such as Mahalanobis distance and Gini index are also provided. When determining the tangent point with the Gini index, it is very time consuming to search for all possible tangent point combinations since it is necessary to find at least all points. Therefore, the present invention also provides a method for quickly searching for the desired tangent point. When using Markov distance to determine this at least all points, we will first use the Mahalanobis distance to divide all samples into A-class and B-classes, and then use the two sets of samples to find two Markov distance cut points, but because First, the data is divided into two groups by Mahalanobis. The difference in the number of samples between the two groups is usually very large, and the difference in the number of samples between the categories will cause the Mahalanobis distance to find the tangent point unreliable, so the present invention provides Use the Gini index to correct the Mahalanobis distance to solve this problem. Each time a new attribute is added to the model, not only the performance of one layer is considered, but the performance of the overall model is considered before deciding where to add the new attribute. In the stop condition of the model, the use of Wilk’s lambda is also provided to prevent over-adaptation of the model, thus greatly improving the accuracy of the classification.

[實施例1][Example 1]

在本實施例中提供了一筆樣本數為100,2個類別,5個屬性(X1,X2,…,X5)的資料,其中每個屬性皆服從N(0,1),其類別散佈圖如圖7b所示,而其預設的模型如圖7a所示。其中,第一層將用X1來解釋,其無法分類的部分再留到下一層給X2去解釋。In this embodiment, a data of 100, 2 categories, 5 attributes (X 1 , X 2 , ..., X 5 ) is provided, wherein each attribute obeys N(0, 1), its category The scatter plot is shown in Figure 7b, and its default model is shown in Figure 7a. Among them, the first layer will be explained by X 1 , and the unclassified part will be left to the next layer to explain to X 2 .

經由多層判別分析得到的結果如圖7c所示,由於多層判別分析模型有用Gini index及馬氏距離兩種尋找切點的方法,故在多層判別分析的結果呈現上會把此兩種方法都放上。至於經由CART得到的結果,則如圖7d所示。可以比較使用Gini index找切點的多層判別分析跟CART,兩者尋找切點的準則一樣。The results obtained by the multi-layer discriminant analysis are shown in Fig. 7c. Since the multi-layer discriminant analysis model uses the Gini index and the Mahalanobis distance to find the tangent point, the two methods are put on the results of the multi-layer discriminant analysis. . As for the results obtained via CART, it is as shown in Figure 7d. Multi-level discriminant analysis using Cini index to find points can be compared with CART, which is the same as the criteria for finding tangent points.

在多層判別分析中,第一層用X1分出了類別0和類別1,類別0包含了24個類別0和0個類別1,類別1包含了3個類別0和35個類別1。然而,如圖7d所示,而在CART裡則是在第一層用X1分出類別1,包含了3個類別0和35個類別1,第二層再用X2分出類別1,包含了24個類別0和0個類別1,所以兩種方法所分類出的結果都一樣。但是,多層判別分析的結構會在一層裡把此屬性判別兩個類別(類別0及類別1)的能力都用上,但在CART裡只能在一層裡先判別一個類別,在下一層再使用同一個屬性判別另一個類別。In the multilayer discriminant analysis, the separation of the first layer category 1 and category 0, Type 0 0 contains 24 categories with Category 1 and 0 X 1, Class 1 contains three categories category 0 and 35 1. However, as shown in Figure 7d, in CART, category 1 is divided by X 1 in the first layer, including 3 categories 0 and 35 categories 1 , and the second layer is assigned category 1 by X 2 . Contains 24 categories 0 and 0 categories 1, so the results are the same for both methods. However, the structure of the multi-level discriminant analysis will use the ability to distinguish between two categories (category 0 and category 1) in one layer, but in CART, only one category can be identified in one layer, and the same layer can be used in the next layer. One attribute identifies another category.

各個方法的結果呈現在表1中,從表1中可看出,多層判別分析使用Gini index所得到的結果跟CART一樣好。The results of the various methods are presented in Table 1. As can be seen from Table 1, the results obtained using the Gini index for multi-layer discriminant analysis are as good as CART.

[實施例2][Embodiment 2]

在本實施例中提供了一筆樣本數為200,2個類別,10個屬性(X1,X2,…,X10)的資料,其中每個屬性皆服從N(0,1)。預設的模型如圖8a所示。其中,第一層將選進X1,X2組合成一個FLD模型,那些第一層無法做出分類的,則留到第二層中由X3,X4組合的FLD模型來解釋。In the present embodiment, a data of 200, 2 categories, 10 attributes (X 1 , X 2 , ..., X 10 ) is provided, wherein each attribute obeys N(0, 1). The default model is shown in Figure 8a. Among them, the first layer will be selected into X 1 and X 2 will be combined into one FLD model. Those those whose first layer cannot be classified will be left to the second layer to be explained by the FLD model of X 3 and X 4 combination.

經由多層判別分析得到的結果如圖8b所示,CART得到的結果則如圖8c所示。The results obtained by multi-layer discriminant analysis are shown in Figure 8b, and the results obtained by CART are shown in Figure 8c.

根據本實施例方法的結果呈現在表2中,此多層判別分析不管是用Gini index或馬氏距離來找切點得到的準確率,都比CART和FLD好。The results of the method according to the present embodiment are presented in Table 2. This multi-level discriminant analysis is better than CART and FLD, regardless of whether the Gini index or the Mahalanobis distance is used to find the cut point.

[實施例3][Example 3]

在本實施例中提供了一筆樣本數為1000,2個類別,5個屬性(X1,X2,…,X5)的資料,其中每個屬性皆服從N(0,1),其類別散佈圖如圖9b所示,預設的模型如圖9a所示。第一層將用X1來解釋,且X1只有分類出類別0的能力,其餘無法分類的部分再留到下一層給X2去解釋。In the present embodiment, a data of 1000, 2 categories, 5 attributes (X 1 , X 2 , ..., X 5 ) is provided, wherein each attribute obeys N(0, 1), its category The scatter plot is shown in Figure 9b, and the default model is shown in Figure 9a. The first layer will be explained by X 1 , and X 1 has only the ability to classify category 0, and the remaining unclassifiable parts are left to the next layer for X 2 to explain.

經由多層判別分析得到的結果如圖9c所示,CART得到的結果則如圖9d所示。由於預設的模型可以視為單變量的樹狀結構,故在此案例多層判別分析使用Gini index當切點準則的結果會跟CART得到的結果一樣。The results obtained by multi-layer discriminant analysis are shown in Figure 9c, and the results obtained by CART are shown in Figure 9d. Since the preset model can be regarded as a univariate tree structure, in this case, the multi-level discriminant analysis uses the Gini index when the result of the tangent point criterion is the same as that obtained by CART.

根據本實施例方法的結果呈現在表3中,多層判別分析使用Gini index所得到的結果跟CART一樣好。The results of the method according to the present embodiment are presented in Table 3. The results obtained by the multi-level discriminant analysis using the Gini index are as good as the CART.

[實施例4][Example 4]

在本實施例中提供了一筆樣本數為1000,2個類別,5個屬性(X1,X2,…,X5)的資料,其中每個屬性皆服從N(0,1)。預設的模型如圖10a所示。第一層將用X1來解釋,且X1只有分類出類別0的能力,其餘無法分類的部分再留到下一層給X2和X3去解釋。In the present embodiment, a data of 1000, 2 categories, 5 attributes (X 1 , X 2 , ..., X 5 ) is provided, wherein each attribute obeys N(0, 1). The default model is shown in Figure 10a. The first layer will be explained by X 1 , and X 1 has only the ability to classify category 0, and the remaining unclassifiable parts are left to the next layer to explain X 2 and X 3 .

經由多層判別分析得到的結果如圖10b所示,CART得到的結果則如圖10c所示。The results obtained by the multi-layer discriminant analysis are shown in Fig. 10b, and the results obtained by CART are shown in Fig. 10c.

根據本實施例方法的結果呈現在表4中,多層判別分析使用Gini index所得到的結果最好。The results of the method according to the present embodiment are presented in Table 4, and the results obtained by the multi-dimensional discriminant analysis using the Gini index are the best.

[實施例5][Example 5]

在本實施例中提供了透過超音波掃描來得到一些腫瘤影像的量化的屬性,再透過這些屬性來建構一個判別模型,其中腫瘤影像樣本有160個,有108個以類別0代表,52個以類別1代表。In this embodiment, the quantitative properties of some tumor images are obtained by ultrasonic scanning, and then a discriminant model is constructed through these attributes, wherein there are 160 tumor image samples, 108 are represented by category 0, and 52 are Category 1 represents.

首先提供CI、EI、MI、HI、ringPDVImax這5個屬性做分析,若直接使用費雪判別分析合併這5個屬性,得到的準確率為0.793,使用多層判別分析的結果準確率則為0.8。此外,多層判別分析只會使用其中四個變數,如圖11a所示,且得到的準確率比傳統的費雪判別分析高。Firstly, the five attributes of CI, EI, MI, HI, and ringPDVImax are provided for analysis. If the five attributes are directly combined using Fisher's discriminant analysis, the accuracy rate is 0.793, and the accuracy of using multi-level discriminant analysis is 0.8. In addition, multi-level discriminant analysis uses only four of these variables, as shown in Figure 11a, and the accuracy obtained is higher than the traditional Fisher discriminant analysis.

除上述5個屬性之外,根據本實施例可再加入其他屬性一起分析。多層判別分析使用Gini index決定切點得到的結果如圖11b所示,準確率為0.906。多層判別分析使用Youden’s index法定切點得到的結果則如圖11c所示,準確率為0.8012。CART所得到的結果如圖11d所示,準確率為0.868。FLD使用了ringPDVImax、VeinCentralVImin、VeinTDCentralVImax、TDVImax、CI、RMV、CI2、MCI3、MI2這9個屬性,得到的準確率為0.843。如表5所示,多層判別分析得到的準確率最好。In addition to the above five attributes, other attributes can be added together for analysis according to this embodiment. Multilayer discriminant analysis using Gini index to determine the tangent results is shown in Figure 11b, with an accuracy of 0.906. Multi-layer discriminant analysis using the Youden's index legal cut-off results is shown in Figure 11c, with an accuracy of 0.8012. The results obtained by CART are shown in Figure 11d with an accuracy of 0.868. FLD uses the nine attributes of ringPDVImax, VeinCentralVImin, VeinTDCentralVImax, TDVImax, CI, RMV, CI2, MCI3, and MI2, and the accuracy is 0.843. As shown in Table 5, the multi-layer discriminant analysis yields the best accuracy.

再者,本發明上述執行步驟,可以電腦語言寫成以便執行,而此寫成之軟體程式可以儲存於任何微處理單元可以辨識、解讀之紀錄媒體,或包含有此紀錄媒體之物品及裝置。其不限為任何形式,此物品可為硬碟、軟碟、光碟、ZIP、MO、IC晶片、隨機存取記憶體(RAM),或任何熟悉此項技藝者所可使用之包含有此紀錄媒體之物品。由於本發明之多層次分類方法已揭露完整如前,任何熟悉電腦語言者閱讀本發明說明書即知如何撰寫軟體程式,故有關軟體程式細節部分不在此贅述。Furthermore, the above-mentioned execution steps of the present invention can be written in a computer language for execution, and the written software program can be stored in any recording medium that can be recognized and interpreted by the micro processing unit, or an article and device containing the recording medium. It is not limited to any form, and the article can be a hard disk, a floppy disk, a compact disc, a ZIP, an MO, an IC chip, a random access memory (RAM), or any other person familiar with the art. Media items. Since the multi-level classification method of the present invention has been disclosed as before, anyone who is familiar with the computer language knows how to write a software program after reading the specification of the present invention, so the details of the software program are not described here.

上述實施例僅係為了方便說明而舉例而已,本發明所主張之權利範圍自應以申請專利範圍所述為準,而非僅限於上述實施例。The above-mentioned embodiments are merely examples for convenience of description, and the scope of the claims is intended to be limited to the above embodiments.

1‧‧‧電腦可紀錄媒體 1‧‧‧Computer recordable media

11‧‧‧記憶體 11‧‧‧ memory

12‧‧‧處理器 12‧‧‧ Processor

13‧‧‧顯示裝置 13‧‧‧Display device

14‧‧‧輸入裝置 14‧‧‧ Input device

15‧‧‧儲存裝置 15‧‧‧Storage device

圖1a係本發明多層判別分析流程圖。 Figure 1a is a flow chart of the multilayer discriminant analysis of the present invention.

圖1b係根據本發明之方法所建立之多層判別分析模型架構示意圖。 Figure 1b is a schematic diagram of a multi-layer discriminant analysis model architecture established in accordance with the method of the present invention.

圖2係顯示一電腦可紀錄媒體之架構的示意圖。 2 is a schematic diagram showing the architecture of a computer recordable medium.

圖3係本發明一較佳實施例之搜尋Gini index切點示意圖。 FIG. 3 is a schematic diagram of a search Gini index cut point according to a preferred embodiment of the present invention.

圖4a-4b係本發明一較佳實施例之使用Gini index修正馬氏距離示意圖。 4a-4b are schematic diagrams of correcting Mahalanobis distance using Gini index in accordance with a preferred embodiment of the present invention.

圖5係本發明一比較模型之四種方式示意圖。 Figure 5 is a schematic illustration of four modes of a comparative model of the present invention.

圖6係本發明多層判別分析模型詳細流程圖。 Figure 6 is a detailed flow chart of the multi-layer discriminant analysis model of the present invention.

圖7a-7d係本發明實施例1示意圖。 7a-7d are schematic views of Embodiment 1 of the present invention.

圖8a-8c係本發明實施例2示意圖。 8a-8c are schematic views of Embodiment 2 of the present invention.

圖9a-9d係本發明實施例3示意圖。 9a-9d are schematic views of Embodiment 3 of the present invention.

圖10a-10c係本發明實施例4示意圖。 10a-10c are schematic views of Embodiment 4 of the present invention.

圖11a-11d係本發明實施例5示意圖。 11a-11d are schematic views of Embodiment 5 of the present invention.

(該圖為一流程圖故無元件代表符號)(The figure is a flow chart, so there is no component symbol)

Claims (22)

一種多層次分類方法,係於一電腦可紀錄媒體中用以分類多個影像樣本,該電腦可紀錄媒體包括有一處理器、一輸入裝置、及一儲存裝置,該方法至少包括下列步驟:(a)接收複數個原始樣本;(b)提供複數個屬性,並以一多變量參數對該些原始樣本由該些屬性進行顯著性評估計算;(c)選擇至少一切點並建立一判別分析模型,其係將該步驟(b)中評估後具有顯著性者其中之一,藉提供一變數同質分析參數篩選出該至少一切點,將該些屬性評估後具有顯著性者中所包含之該複數個原始樣本在這一層中分群為至少一類別以建立該判別分析模型,其中該至少一類別係包括有第一類別(NodeA)、第二類別(NodeB)、及未決定之第三類別(NodeN);(d)進行一評估模型效能之步驟,其係將該判別分析模型中加入該些屬性進行顯著性評估;其中,當加入該至少一屬性後有增進該判別分析模型之顯著性時,未決定之第三類別(NodeN)的該複數個原始樣本便進入該判別分析模型之下一層,再以該變數同質分析參數篩選出至少一切點,將該判別分析模型中加入該些屬性評估後具有顯著性者中所包含之該複數個原始樣本接著分群為第一類別(NodeA)、第二類別(NodeB)、及未決定之第三類別(NodeN);以及 (e)加入一停止條件,該停止條件係以選擇該變數同質分析參數,若不拒絕虛無假設,該判別分析模型即停止往下一層分群;或在該評估模型效能之步驟中加入該些屬性以一迴歸分析法進行顯著性評估,當加入該些屬性後無法提升該判別分析模型之顯著性時,若拒絕虛無假設,該判別分析模型即停止往下一層分群。 A multi-level classification method for classifying a plurality of image samples in a computer recordable medium, the computer recordable medium comprising a processor, an input device, and a storage device, the method comprising at least the following steps: (a Receiving a plurality of original samples; (b) providing a plurality of attributes, and performing a significant evaluation of the original samples from the attributes by a multivariate parameter; (c) selecting at least all points and establishing a discriminant analysis model, It is one of the significant ones after the evaluation in the step (b), and the at least one point is selected by providing a variable homogeneity analysis parameter, and the plurality of points included in the attribute are evaluated after the attribute is evaluated. The original samples are grouped into at least one category in this layer to establish the discriminant analysis model, wherein the at least one category includes a first category (Node A ), a second category (Node B ), and an undetermined third category ( Node N); (d) performing a step of evaluating the effectiveness of the model, which is added to the system in the plurality of discriminant analysis model assessment significant attributes; wherein, when there is added to the at least one property enhancing When significant discriminant analysis model, the third category of undecided (Node N) of the plurality of original sample will enter the discriminant analysis model under layer, then the homogeneous analysis parameter variable at least all selected points, the determination The plurality of original samples included in the analytic model after adding the attributes to the analysis model are then grouped into a first category (Node A ), a second category (Node B ), and an undetermined third category (Node). And (e) adding a stop condition for selecting the variable homogeneity analysis parameter, and if the null hypothesis is not rejected, the discriminant analysis model stops the next layer grouping; or in the step of evaluating the performance of the model Adding these attributes to the saliency analysis by a regression analysis method, when the attributes are added, the saliency of the discriminant analysis model cannot be improved. If the null hypothesis is rejected, the discriminant analysis model stops the next layer grouping. 如申請專利範圍第1項所述之多層次分類方法,其中,在加入該停止條件時,該判別分析模型之最後一層分類層中,該未決定之第三類別(NodeN)中所包含之樣本數為零。 The multi-level classification method according to claim 1, wherein when the stop condition is added, in the last layer classification layer of the discriminant analysis model, the undetermined third category (Node N ) is included The number of samples is zero. 如申請專利範圍第1項所述之多層次分類方法,其中,該多變量參數係為Wilk’s lambda或Gini index。 The multi-level classification method according to claim 1, wherein the multivariate parameter is Wilk’s lambda or Gini index. 如申請專利範圍第1項所述之多層次分類方法,該顯著性評估計算係以一F統計量算出的p值,以該p值表示該些屬性在該類別間平均的差異顯著性;或以一衡量不純度(impurity)之準則判斷;其中,該F統計量為 該不純度(impurity)為 其中,n為樣本空間(sample size),p為屬性的數目,Λ則為Wilk’s lambda; 其中,NL為第一類別的樣本空間,NM為第三類別的樣本空間,NR為第二類別的樣本空間,tL為第一類別的Gini值,tM為第三類別的Gini值,tR為第二類別的Gini值。 The multi-level classification method as described in claim 1, wherein the significance evaluation calculation is a p-value calculated by a F-statistic, and the p-value indicates an average significance of the differences between the categories; or Judging by a measure of impureness; wherein the F statistic is The impureness is Where n is the sample size, p is the number of attributes, and Λ is Wilk's lambda; where N L is the sample space of the first category, N M is the sample space of the third category, and N R is the second The sample space of the category, t L is the Gini value of the first category, t M is the Gini value of the third category, and t R is the Gini value of the second category. 如申請專利範圍第1項所述之多層次分類方法,其中,該些屬性係至少一選自由ringPDVImax、VeinCentralVImin、VeinTDCentralVImax、TDVImax、CI、RMV、CI2、MCI3、及MI2所組成之群組。 The multi-level classification method according to claim 1, wherein the attributes are at least one selected from the group consisting of ringPDVImax, VeinCentralVImin, VeinTDCentralVImax, TDVImax, CI, RMV, CI2, MCI3, and MI2. 如申請專利範圍第1項所述之多層次分類方法,其中,該變數同質分析參數係為Gini index、Mahalanobis distance、或Youden’s Index。 The multi-level classification method according to claim 1, wherein the variable homogeneity analysis parameter is Gini index, Mahalanobis distance, or Youden’s Index. 如申請專利範圍第1項所述之多層次分類方法,其中,該評估模型效能之步驟包括:在與步驟(c)所建立之該判別分析模型同層中加入該些屬性,以增加該判別分析模型之原同層中的區別能力。 The multi-level classification method according to claim 1, wherein the step of evaluating the performance of the model comprises: adding the attributes to the same layer of the discriminant analysis model established in step (c) to increase the discrimination Analyze the difference in the original layer of the model. 如申請專利範圍第1項所述之多層次分類方法,其中,該評估模型效能之步驟包括:在該第三類別(NodeN)上加入該些屬性並新增一層以建立一模型,該模型亦以該變數同質分析參數篩選出至少一切點,將剩餘未決定之該複數個原始樣本繼續分群為第一類別(NodeA)、第二類別(NodeB)、及未決定之第三類別(NodeN)。 The multi-level classification method according to claim 1, wherein the step of evaluating the performance of the model comprises: adding the attributes to the third category (Node N ) and adding a layer to establish a model, the model At least all the points are also filtered by the variable homogeneity analysis parameter, and the remaining undetermined plurality of original samples are further grouped into a first category (Node A ), a second category (Node B ), and an undetermined third category ( Node N ). 如申請專利範圍第1項所述之多層次分類方法,其中,該評估模型效能之步驟包括:將第一類別(NodeA)設定為未決定之類別,並將第一類別(NodeA)加上未決定之第三類別(NodeN)而形成的組合中加入該至少一屬性並新增一層以建 立一模型,該模型亦以該變數同質分析參數篩選出至少一切點,將剩餘未決定之該複數個原始樣本繼續分群為第一類別(NodeA)、第二類別(NodeB)、及未決定之第三類別(NodeN)。 The multi-level classification method as described in claim 1, wherein the step of evaluating the performance of the model comprises: setting the first category (Node A ) to an undetermined category, and adding the first category (Node A ) Adding the at least one attribute to the combination formed by the undetermined third category (Node N ) and adding a layer to establish a model, the model also filters at least all points with the variable homogeneity analysis parameter, and the remaining undetermined The plurality of original samples continue to be grouped into a first category (Node A ), a second category (Node B ), and an undetermined third category (Node N ). 如申請專利範圍第1項所述之多層次分類方法,其中,該評估模型效能之步驟包括:將第二類別(NodeB)設定為未決定之類別,並將第二類別(NodeB)加上未決定之第三類別(NodeN)而形成的組合中加入該些屬性並新增一層以建立一模型,該模型亦以該變數同質分析參數篩選出至少一切點,將剩餘未決定之該複數個原始樣本繼續分群為第一類別(NodeA)、第二類別(NodeB)、及未決定之第三類別(NodeN)。 The multi-level classification method described in claim 1, wherein the step of evaluating the performance of the model comprises: setting the second category (Node B ) to an undetermined category, and adding the second category (Node B ) Adding these attributes to the combination formed by the undetermined third category (Node N ) and adding a layer to establish a model, the model also filters out at least all points with the variable homogeneity analysis parameters, and the remaining undetermined The plurality of original samples continue to be grouped into a first category (Node A ), a second category (Node B ), and an undetermined third category (Node N ). 如申請專利範圍第1項所述之多層次分類方法,其中,該迴歸分析法係包括一順向選擇法使用之partial F-test。 The multi-level classification method described in claim 1, wherein the regression analysis method comprises a partial F-test used by the forward selection method. 一種用以分類多個影像樣本之非暫態電腦可紀錄媒體,其係以建立一多層次分類方法對該些影像樣本進行分類,該多層次分類方法至少包括下列步驟:(a)接收複數個原始樣本;(b)提供複數個屬性,並以一多變量參數對該些原始樣本由該些屬性進行顯著性評估計算;(c)選擇至少一切點並建立一判別分析模型,其係將該步驟(b)中評估後具有顯著性者其中之一,藉提供一變數同質分析參數篩選出該至少一切點,將該些屬性評估後具有顯著性者中所包含之該複數個原始樣本在這一層中分群為至少一類別以建立該判別分析模型,其中該至少一類別係 包括有第一類別(NodeA)、第二類別(NodeB)、及未決定之第三類別(NodeN);(d)進行一評估模型效能之步驟,其係將該判別分析模型中加入該些屬性進行顯著性評估;其中,當加入該些屬性後有增進該判別分析模型之顯著性時,未決定之第三類別(NodeN)的該複數個原始樣本便進入該判別分析模型之下一層,再以該變數同質分析參數篩選出至少一切點,將該判別分析模型中加入該些屬性評估後具有顯著性者中所包含之該複數個原始樣本接著分群為第一類別(NodeA)、第二類別(NodeB)、及未決定之第三類別(NodeN);以及(e)加入一停止條件,該停止條件係以選擇該變數同質分析參數,若不拒絕虛無假設,該判別分析模型即停止往下一層分群;或在該評估模型效能之步驟中加入該些屬性以一迴歸分析法進行顯著性評估,當加入該些屬性後無法提升該判別分析模型之顯著性時,若拒絕虛無假設,該判別分析模型即停止往下一層分群。 A non-transitory computer recordable medium for classifying a plurality of image samples, wherein the image samples are classified by establishing a multi-level classification method, the multi-level classification method comprising at least the following steps: (a) receiving a plurality of image samples (b) providing a plurality of attributes, and performing a significant evaluation of the original samples from the attributes by a multivariate parameter; (c) selecting at least all points and establishing a discriminant analysis model, One of the significant ones after the evaluation in step (b) is to screen out the at least all points by providing a variable homogeneity analysis parameter, and the plurality of original samples included in the attribute are evaluated after the attribute is evaluated. Grouping into at least one category in a layer to establish the discriminant analysis model, wherein the at least one category comprises a first category (Node A ), a second category (Node B ), and an undetermined third category (Node N ); (d) performing a step of evaluating the performance of the model by adding the attributes to the discriminant analysis model for significant evaluation; wherein, when the attributes are added, the discriminant analysis model is enhanced Sexual, a third category of undecided (Node N) of the plurality of original sample will enter the discriminant analysis model under layer, then the homogeneous analysis parameter variable at least all selected points, the discriminant analysis is added to the model The plurality of original samples included in the saliency after the attribute evaluation are then grouped into a first category (Node A ), a second category (Node B ), and an undetermined third category (Node N ); and e) adding a stop condition, which is to select the variable homogeneity analysis parameter, if the null hypothesis is not rejected, the discriminant analysis model stops the next layer grouping; or add the attributes in the step of evaluating the model performance A regression analysis performs a significant evaluation. When the saliency of the discriminant analysis model cannot be improved after adding these attributes, if the null hypothesis is rejected, the discriminant analysis model stops the next layer grouping. 如申請專利範圍第12項所述之電腦可紀錄媒體,其中,在加入該停止條件時,該判別分析模型之最後一層分類層中,該未決定之第三類別(NodeN)中所包含之樣本數為零。 The computer recordable medium according to claim 12, wherein, when the stop condition is added, in the last layer classification layer of the discriminant analysis model, the undetermined third category (Node N ) is included The number of samples is zero. 如申請專利範圍第12項所述之電腦可紀錄媒體,其中,該多變量參數係為Wilk’s lambda或Gini index。 The computer recordable medium of claim 12, wherein the multivariate parameter is Wilk’s lambda or Gini index. 如申請專利範圍第12項所述之電腦可紀錄媒體,該顯著性評估計算係以一F統計量算出的p值,以該p值表示該 些屬性在該類別間平均的差異顯著性;或以一衡量不純度(impurity)之準則判斷;其中,該F統計量為 該不純度(impurity)為 其中,n為樣本空間(sample size),p為屬性的數目,Λ則為Wilk’s lambda;其中,NL為第一類別的樣本空間,NM為第三類別的樣本空間,NR為第二類別的樣本空間,tL為第一類別的Gini值,tM為第三類別的Gini值,tR為第二類別的Gini值。 The computer-recordable medium according to claim 12, wherein the significance evaluation calculation is a p-value calculated by a F statistic, and the p-value indicates an average difference of the attributes among the categories; or Judging by a measure of impureness; wherein the F statistic is The impureness is Where n is the sample size, p is the number of attributes, and Λ is Wilk's lambda; where N L is the sample space of the first category, N M is the sample space of the third category, and N R is the second The sample space of the category, t L is the Gini value of the first category, t M is the Gini value of the third category, and t R is the Gini value of the second category. 如申請專利範圍第12項所述之電腦可紀錄媒體,其中,該些屬性係至少一選自由ringPDVImax、VeinCentralVImin、VeinTDCentralVImax、TDVImax、CI、RMV、CI2、MCI3、及MI2所組成之群組。 The computer recordable medium of claim 12, wherein the attributes are at least one selected from the group consisting of ringPDVImax, VeinCentralVImin, VeinTDCentralVImax, TDVImax, CI, RMV, CI2, MCI3, and MI2. 如申請專利範圍第12項所述之電腦可紀錄媒體,其中,該變數同質分析參數係為Gini index、Mahalanobis distance、或Youden’s Index。 The computer recordable medium according to claim 12, wherein the variable homogeneity analysis parameter is Gini index, Mahalanobis distance, or Youden’s Index. 如申請專利範圍第12項所述之電腦可紀錄媒體,其中,該評估模型效能之步驟包括:在與步驟(c)所建立之該判別分析模型同層中加入該些屬性,以增加該判別分析模型之原同層中的區別能力。 The computer recordable medium according to claim 12, wherein the step of evaluating the performance of the model comprises: adding the attributes to the same layer of the discriminant analysis model established in step (c) to increase the discrimination. Analyze the difference in the original layer of the model. 如申請專利範圍第12項所述之電腦可紀錄媒體,其中,該評估模型效能之步驟包括:在該第三類別(NodeN)上加入該些屬性並新增一層以建立一模型,該模型亦以該變數同質分析參數篩選出至少一切點,將剩餘未決定之該複數個原始樣本繼續分群為第一類別(NodeA)、第二類別(NodeB)、及未決定之第三類別(NodeN)。 The computer recordable medium according to claim 12, wherein the step of evaluating the performance of the model comprises: adding the attributes to the third category (Node N ) and adding a layer to establish a model, the model At least all the points are also filtered by the variable homogeneity analysis parameter, and the remaining undetermined plurality of original samples are further grouped into a first category (Node A ), a second category (Node B ), and an undetermined third category ( Node N ). 如申請專利範圍第12項所述之電腦可紀錄媒體,其中,該評估模型效能之步驟包括:將第一類別(NodeA)設定為未決定之類別,並將第一類別(NodeA)加上未決定之第三類別(NodeN)而形成的組合中加入該些屬性並新增一層以建立一模型,該模型亦以該變數同質分析參數篩選出至少一切點,將剩餘未決定之該複數個原始樣本繼續分群為第一類別(NodeA)、第二類別(NodeB)、及未決定之第三類別(NodeN)。 The computer recordable medium according to claim 12, wherein the step of evaluating the performance of the model comprises: setting the first category (Node A ) to an undetermined category, and adding the first category (Node A ) Adding these attributes to the combination formed by the undetermined third category (Node N ) and adding a layer to establish a model, the model also filters out at least all points with the variable homogeneity analysis parameters, and the remaining undetermined The plurality of original samples continue to be grouped into a first category (Node A ), a second category (Node B ), and an undetermined third category (Node N ). 如申請專利範圍第12項所述之電腦可紀錄媒體,其中,該評估模型效能之步驟包括:將第二類別(NodeB)設定為未決定之類別,並將第二類別(NodeB)加上未決定之第三類別(NodeN)而形成的組合中加入該些屬性並新增一層以建立一模型,該模型亦以該變數同質分析參數篩選出至少一切點,將剩餘未決定之該複數個原始樣本繼續分群為第一類別(NodeA)、第二類別(NodeB)、及未決定之第三類別(NodeN)。 The computer recordable medium according to claim 12, wherein the step of evaluating the performance of the model comprises: setting the second category (Node B ) to an undetermined category, and adding the second category (Node B ) Adding these attributes to the combination formed by the undetermined third category (Node N ) and adding a layer to establish a model, the model also filters out at least all points with the variable homogeneity analysis parameters, and the remaining undetermined The plurality of original samples continue to be grouped into a first category (Node A ), a second category (Node B ), and an undetermined third category (Node N ). 如申請專利範圍第12項所述之電腦可紀錄媒體,其中,該迴歸分析法係包括一順向選擇法使用之partial F-test。The computer recordable medium according to claim 12, wherein the regression analysis method comprises a partial F-test used by the forward selection method.
TW099101931A 2010-01-25 2010-01-25 Method for multi-layer classifier TWI521361B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
TW099101931A TWI521361B (en) 2010-01-25 2010-01-25 Method for multi-layer classifier

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
TW099101931A TWI521361B (en) 2010-01-25 2010-01-25 Method for multi-layer classifier

Publications (2)

Publication Number Publication Date
TW201126354A TW201126354A (en) 2011-08-01
TWI521361B true TWI521361B (en) 2016-02-11

Family

ID=45024492

Family Applications (1)

Application Number Title Priority Date Filing Date
TW099101931A TWI521361B (en) 2010-01-25 2010-01-25 Method for multi-layer classifier

Country Status (1)

Country Link
TW (1) TWI521361B (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
TWI564740B (en) * 2015-08-24 2017-01-01 國立成功大學 Mutually-exclusive and collectively-exhaustive (mece) feature selection method and computer program product

Also Published As

Publication number Publication date
TW201126354A (en) 2011-08-01

Similar Documents

Publication Publication Date Title
Aïvodji et al. Fairwashing: the risk of rationalization
CN111695626A (en) High-dimensional unbalanced data classification method based on mixed sampling and feature selection
CN101853389A (en) Detection device and method for multi-class targets
WO2023279696A1 (en) Service risk customer group identification method, apparatus and device, and storage medium
JP4893624B2 (en) Data clustering apparatus, clustering method, and clustering program
SG192536A1 (en) Method for multi-layer classifier
Masood et al. Clustering techniques in bioinformatics
CN111338950A (en) Software defect feature selection method based on spectral clustering
Zhang et al. A multi-label learning based kernel automatic recommendation method for support vector machine
Hsiao et al. Integrating MTS with bagging strategy for class imbalance problems
Bruzzese et al. DESPOTA: DEndrogram slicing through a pemutation test approach
Sewwandi et al. Automated granule discovery in continuous data for feature selection
Yotsawat et al. Improved credit scoring model using XGBoost with Bayesian hyper-parameter optimization
Bulysheva et al. Segmentation modeling algorithm: a novel algorithm in data mining
TWI521361B (en) Method for multi-layer classifier
Sevilla-Villanueva et al. Using CVI for understanding class topology in unsupervised scenarios
KR102158049B1 (en) Data clustering apparatus and method based on range query using cf tree
CN117035983A (en) Method and device for determining credit risk level, storage medium and electronic equipment
Cateni et al. Improving the stability of sequential forward variables selection
CN115345248A (en) Deep learning-oriented data depolarization method and device
Zhang et al. A hierarchical feature selection model using clustering and recursive elimination methods
WO2021260945A1 (en) Training data generation program, device, and method
Caplescu et al. Will they repay their debt? Identification of borrowers likely to be charged off
Thangavelu et al. Feature selection in cancer genetics using hybrid soft computing
Dong et al. Overview of contrast data mining as a field and preview of an upcoming book