TWI564740B - Mutually-exclusive and collectively-exhaustive (mece) feature selection method and computer program product - Google Patents

Mutually-exclusive and collectively-exhaustive (mece) feature selection method and computer program product Download PDF

Info

Publication number
TWI564740B
TWI564740B TW104127537A TW104127537A TWI564740B TW I564740 B TWI564740 B TW I564740B TW 104127537 A TW104127537 A TW 104127537A TW 104127537 A TW104127537 A TW 104127537A TW I564740 B TWI564740 B TW I564740B
Authority
TW
Taiwan
Prior art keywords
variables
variable
grouping
indicator
samples
Prior art date
Application number
TW104127537A
Other languages
Chinese (zh)
Other versions
TW201709090A (en
Inventor
李家岩
陳柏勳
Original Assignee
國立成功大學
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 國立成功大學 filed Critical 國立成功大學
Priority to TW104127537A priority Critical patent/TWI564740B/en
Application granted granted Critical
Publication of TWI564740B publication Critical patent/TWI564740B/en
Publication of TW201709090A publication Critical patent/TW201709090A/en

Links

Description

互斥與完備集合的特徵選擇方法、電腦程式產品 Mutual exclusion and complete set feature selection method, computer program product

本發明是有關於一種特徵選擇方法,且特別是有關於一種互斥(mutually-exclusive)與完備集合(collectively-exhaustive)的特徵選擇方法與電腦程式產品。 The present invention relates to a feature selection method, and in particular to a mutually-exclusive and collectively-exhaustive feature selection method and computer program product.

在機器學習與資料探勘的領域中,常常會處理到大筆的樣本,而且每筆樣本可能會具有非常多的變數。然而,太多的變數可能會導致維度的詛咒(curse of dimensionality),使得有用的樣本變得很稀疏。為了解決此問題,一般可透過特徵選擇來減少變數的維度或是數目。 例如,主成分分析(Principal components analysis,PCA)便是一個經常用來降低維度的工具。一種好的特徵選擇方法應該要能夠選擇出越少的變數越好,同時所選擇的變數要能夠代表原始的樣本完整資訊,這樣一來當使用所選擇的變數來執行機器學習或是資料探勘的演算法時,才能夠減少計算 量但又達到不錯的準確度。為此,本領域具有通常知識者仍在不斷研究是否有更好的特徵選擇方法。 In the field of machine learning and data exploration, large samples are often processed, and each sample may have many variables. However, too many variables can lead to the curse of dimensionality, making useful samples sparse. In order to solve this problem, the feature selection can generally be used to reduce the dimension or number of variables. For example, Principal Components Analysis (PCA) is a tool often used to reduce dimensions. A good feature selection method should be able to select fewer variables as possible, and the selected variables should be able to represent the original sample complete information, so that when using the selected variables to perform machine learning or data exploration, When the algorithm is implemented, the calculation can be reduced. Volume but achieved good accuracy. To this end, those with ordinary knowledge in the field are still constantly researching whether there is a better method of feature selection.

本發明提出一種互斥與完備集合的特徵選擇方法,其考慮了獨立性、重要性與完整性。 The invention proposes a feature selection method of mutual exclusion and complete set, which considers independence, importance and integrity.

本發明的實施例提出一種特徵選擇方法,用於一電子裝置。此特徵選擇方法包括:取得多個樣本,其中每一個樣本具有多個變數;進行獨立性變數產生步驟,以根據分群數目將變數分為多個變數群組,並從每一個變數群組中選擇至少一個變數來產生多個獨立性變數;從獨立性變數中選出多個重要變數;根據重要變數計算樣本的第一分群指標;重複執行第一程序一預設次數。第一程序包括:對每一個樣本產生一啞變數;以及將重要變數與啞變數結合,以計算樣本的第二分群指標。上述的特徵選擇方法還包括判斷第一分群指標與第二分群指標是否通過完整性測試;以及當第一分群指標與第二分群指標通過完整性測試時,輸出重要變數。 Embodiments of the present invention provide a feature selection method for an electronic device. The feature selection method includes: obtaining a plurality of samples, wherein each sample has a plurality of variables; performing an independent variable generation step to divide the variable into a plurality of variable groups according to the number of clusters, and selecting from each variable group At least one variable is used to generate a plurality of independent variables; a plurality of important variables are selected from the independence variables; the first grouping index of the sample is calculated according to the important variables; and the first program is repeatedly executed for a preset number of times. The first procedure includes: generating a dummy variable for each sample; and combining the important variable with the dummy variable to calculate a second grouping indicator of the sample. The feature selection method further includes determining whether the first grouping indicator and the second grouping indicator pass the integrity test; and outputting the important variable when the first grouping indicator and the second grouping indicator pass the integrity test.

在一些實施例中,上述的獨立性變數產生步驟包括:對樣本執行華德法(Ward’s method),以決定分群數目;根據分群數目對變數執行K均值(K-means)分群演算法,以將變數分為變數群組;對於每一個變數群組中的每一個變數均計算第三分群指標;以及根據第三分群指標,從每一個變數群組中選出一個變數。 In some embodiments, the independence variable generation step includes: performing a Ward's method on the sample to determine the number of clusters; performing a K-means clustering algorithm on the variables according to the number of clusters to The variable is divided into a variable group; a third grouping indicator is calculated for each variable in each variable group; and a variable is selected from each variable group according to the third grouping indicator.

在一些實施例中,上述從獨立性變數中選出重要變數的步驟包括:對獨立性變數執行步進選擇(stepwise selection)演算法以選出重要變數。 In some embodiments, the step of selecting an important variable from the independence variable comprises performing a stepwise selection algorithm on the independence variable to select an important variable.

在一些實施例中,上述從獨立性變數中選出重要變數的步驟包括:對獨立性變數執行最佳子集合(best subset)演算法以選出重要變數。 In some embodiments, the step of selecting an important variable from the independence variable comprises performing an best subset algorithm on the independence variable to select an important variable.

在一些實施例中,上述產生啞變數的步驟包括:對每一個樣本以均勻分佈(uniform distribution)隨機地產生啞變數。 In some embodiments, the step of generating a dummy variable comprises randomly generating a dummy variable for each sample in a uniform distribution.

在一些實施例中,上述判斷第一分群指標與第二分群指標是否通過完整性測試的步驟包括:計算第二分群指標的平均值;執行統計假設檢定,其中虛無假設(null hypothesis)被設定為第一分群指標相同於平均值;以及若虛無假設被接受,則判斷第一分群指標與第二分群指標通過完整性測試。 In some embodiments, the step of determining whether the first grouping indicator and the second grouping indicator pass the integrity test comprises: calculating an average of the second grouping indicator; performing a statistical hypothesis verification, wherein the null hypothesis is The first grouping indicator is set to be the same as the average value; and if the null hypothesis is accepted, the first grouping indicator and the second grouping indicator are judged to pass the integrity test.

在一些實施例中,上述的特徵選擇方法更包括:若第一分群指標與第二分群指標沒有通過完整性測試,判斷重要變數的數目是否小於變數的數目;以及若重要變數的數目小於變數的數目,則增加分群數目,並根據增加後的分群數目重新進行獨立性變數產生步驟。 In some embodiments, the feature selection method further includes: if the first grouping indicator and the second grouping indicator fail the integrity test, determining whether the number of important variables is less than the number of variables; and if the number of important variables is less than the variable The number of clusters is increased by the number of clusters, and the independent variable generation step is re-executed according to the number of clusters after the increase.

本發明的實施例亦提出一種電腦程式產品,當電腦載入此電腦程式產品並執行後,可完成上述之特徵選擇方法。 The embodiment of the present invention also provides a computer program product, which can complete the above feature selection method after the computer loads the computer program product and executes it.

為讓本發明的上述特徵和優點能更明顯易懂, 下文特舉實施例,並配合所附圖式作詳細說明如下。 To make the above features and advantages of the present invention more apparent, The following specific embodiments are described in detail below with reference to the accompanying drawings.

100‧‧‧電子裝置 100‧‧‧Electronic devices

102‧‧‧樣本 102‧‧‧ sample

104‧‧‧變數 104‧‧‧variables

110‧‧‧處理器 110‧‧‧ processor

120‧‧‧記憶體 120‧‧‧ memory

210、220、230‧‧‧空間 210, 220, 230‧‧‧ space

S301~S309、S401~S411‧‧‧步驟 S301~S309, S401~S411‧‧‧ steps

[圖1]是根據一實施例繪示電子裝置的方塊圖。 FIG. 1 is a block diagram showing an electronic device according to an embodiment.

[圖2]是根據一實施例描述互斥與完備集合原則的示意圖。 FIG. 2 is a schematic diagram illustrating the principle of mutual exclusion and complete set according to an embodiment.

[圖3]是根據一實施例繪示特徵選擇方法的流程圖。 FIG. 3 is a flow chart showing a feature selection method according to an embodiment.

[圖4A]與[圖4B]是根據一實施例繪示特徵選擇方法的流程圖。 4A and 4B are flowcharts illustrating a feature selection method according to an embodiment.

關於本文中所使用之『第一』、『第二』、...等,並非特別指次序或順位的意思,其僅為了區別以相同技術用語描述的元件或操作。 The terms "first", "second", "etc." used in this document are not intended to mean the order or the order, and are merely to distinguish between elements or operations described in the same technical terms.

請參照圖1,圖1是根據一實施例繪示電子裝置的方塊圖。電子裝置100包括處理器110與記憶體120。記憶體120儲存了電腦程式產品,當處理器110執行此電腦程式產品時會執行一個特徵選擇方法。在一些實施例中,電子裝置100可以被實作為任意形式的電腦,例如為個人電腦、伺服器或是工業電腦等,本發明並不在此限。 Please refer to FIG. 1. FIG. 1 is a block diagram of an electronic device according to an embodiment. The electronic device 100 includes a processor 110 and a memory 120. The memory 120 stores a computer program product, and when the processor 110 executes the computer program product, a feature selection method is executed. In some embodiments, the electronic device 100 can be implemented as a computer of any form, such as a personal computer, a server, or an industrial computer. The present invention is not limited thereto.

電子裝置100會取得多個樣本102,其中每個樣本102都具有多個變數,也就是說一個樣本可以表示為多個變數的向量,而電子裝置100執行完特徵選擇方法以後會輸 出所選擇的變數104。在一些實施例中,這些樣本102是關於半導體製程,例如是從半導體生產線上的感測器取得,每一個樣本可代表一個產品,而每一個變數可以是溫度、厚度、或材料濃度等。或者,這些樣本102也可以是關於生物資訊、金屬冶煉、賣場的銷售資訊、或數位影像等等。本發明並不限制這些樣本與變數的內容、個數、資料型態等。 The electronic device 100 obtains a plurality of samples 102, wherein each sample 102 has a plurality of variables, that is, one sample can be represented as a vector of a plurality of variables, and the electronic device 100 loses after performing the feature selection method. The selected variable 104 is selected. In some embodiments, the samples 102 are for a semiconductor process, such as from a sensor on a semiconductor production line, each sample may represent a product, and each variable may be temperature, thickness, or material concentration, and the like. Alternatively, the samples 102 may also be information about biological information, metal smelting, sales of the store, digital images, and the like. The invention does not limit the content, number, data type, etc. of these samples and variables.

請參照圖2,圖2是根據一實施例描述互斥與完備集合原則的示意圖。在此假設有5個變數,在圖2中標記為區域A~E,若這些變數彼此之間為互斥,則如空間210所示,區域A~E彼此之間不會重疊,但可以看出區域A~E之間會有一些空隙。若這些變數是完備集合,則如空間220所示,區域A~E會填滿整個空間220,但彼此之間會有重疊。若這些變數是互斥與完備集合,則會如空間230所示,區域A~E彼此不重疊,但又填滿整個空間。在以下的實施例中,便是要選出如空間230的變數,同時互斥且完備集合。 Please refer to FIG. 2. FIG. 2 is a schematic diagram illustrating the principle of mutual exclusion and complete set according to an embodiment. It is assumed here that there are five variables, which are labeled as areas A to E in FIG. 2. If these variables are mutually exclusive, as shown by space 210, areas A~E do not overlap each other, but can be seen. There will be some gaps between the areas A~E. If these variables are complete sets, as shown by space 220, areas A~E will fill the entire space 220, but will overlap each other. If these variables are mutually exclusive and complete sets, as shown by space 230, areas A~E do not overlap each other, but fill the entire space. In the following embodiments, variables such as space 230 are selected, while being mutually exclusive and complete sets.

請參照圖3,圖3是根據一實施例繪示特徵選擇方法的流程圖。在步驟S301中,先做資料的預處理,例如可先識別出不完整、不正確、不準確、或是不相關的變數,並且嘗試替換、修改、轉換、或是移除這些變數。在此實施例中,步驟S301包括以下子步驟。第一,先做資料的收集,可透過感測器或是量測儀器來收集代表真實世界物理狀況的資料,並且轉換這些資料為特定的格式。第二,做資料的清理,可辨識出錯誤、雜訊、遺失的數值、離群值、或者是不一致的數值。資料的清理可以增加資料的品質,以遺失的 數值為例,可計算其他樣本的最大值、最小值、平均值、內插值、機率分佈等來估測出遺失的數值。第三,做資料的整合,可將不同的資料來源整合來產生有意義且有價值的資料。第四,資料的轉換,可將資料轉換為二進位、十進位、八進位、區間標度(interval scale)、比例的尺度等。然而,這些子步驟僅是範例,在其他實施例中也可以進行去雜訊、刪除極值等處理。或者,在一些實施例中,也可以將取得的變數輸入至S形函數(sigmoid function)或其他函數以改變數值的分佈。本發明並不限制資料預處理的具體內容。 Please refer to FIG. 3. FIG. 3 is a flowchart illustrating a feature selection method according to an embodiment. In step S301, the preprocessing of the data is performed first, for example, incomplete, incorrect, inaccurate, or irrelevant variables may be identified first, and attempts to replace, modify, convert, or remove the variables are attempted. In this embodiment, step S301 includes the following sub-steps. First, the first collection of data can be used to collect data representing the real world physical conditions through sensors or measuring instruments, and convert the data into a specific format. Second, the data can be cleaned up to identify errors, noise, missing values, outliers, or inconsistent values. Data cleansing can increase the quality of the data to the loss For example, the maximum value, minimum value, average value, interpolation value, probability distribution, etc. of other samples can be calculated to estimate the missing value. Third, the integration of data can integrate different sources of information to produce meaningful and valuable information. Fourth, the conversion of data can be converted into binary, decimal, octal, interval scale, scale of scale, and so on. However, these sub-steps are merely examples, and in other embodiments, processing such as denoising, deleting extreme values, and the like may be performed. Alternatively, in some embodiments, the obtained variables may also be input to a sigmoid function or other function to change the distribution of the values. The invention does not limit the specific content of data preprocessing.

在步驟S302中,進行獨立性的考量以取得彼此獨立(或相關性低)的多個獨立性變數。在一些實施例中,可先對所有的變數分群以產生多個變數群組,然後再從這些變數群組中挑選出獨立性變數(此步驟也被稱為獨立性變數產生步驟)。舉例來說,可對所有的樣本執行華德法(Ward’s method)以決定一個分群數目。華德法是先把每一個的樣本當作一個群組,然後每一次都根據誤差平方和(sum of square error,SSE)來合併兩個群組,直到所有的群組都被合併為同一個群組。令x ij 表示為第j個群組中的第i個樣本,表示在第j個群組中所有樣本的變數的平均值,表示所有樣本的變數的平均值,則誤差平方和可以表示為以下方程式(1),另外所有誤差總和(total sum of square,SST)可表示為以下方程式(2),R平方值則可以表示為以下方程式(3)。 In step S302, independence considerations are made to obtain a plurality of independent variables that are independent of each other (or have low correlation). In some embodiments, all of the variables may be grouped first to generate a plurality of sets of variables, and then independent variables are selected from the group of variables (this step is also referred to as an independent variable generation step). For example, the Ward's method can be performed on all samples to determine a number of clusters. The Waldorf method first treats each sample as a group, and then merges the two groups each time according to the sum of square error (SSE) until all the groups are merged into the same group. Group. Let x ij be the ith sample in the jth group, Means the average of the variables of all samples in the jth group, The average of the variables representing all samples, the sum of squared errors can be expressed as the following equation (1), and the total sum of square (SST) can be expressed as the following equation (2), and the R-squared value can be expressed as The following equation (3).

R 2=(SST-SSE)/SST…(3) R 2 =( SST - SSE )/ SST ...(3)

因此,當進行華德法時,在合併群組以後SSE會越來越大,我們可訂出一個臨界值,當SSE大於此臨界值時便停止合併群組以取得群組數目。在取得群組數目以後,接下來可根據此分群數目對所有的變數執行K均值(K-means)分群演算法,以將這些變數分為多個變數群組。然而,上述的華德法與K均值演算法僅是範例,任何習知的分群演算法都可以用來產生變數群組。例如,上述的華德法也可以置換為最近法(single-linkage)、最遠法(complete-linkage)、平均法(average-linkage)、中心法(centroid-linkage)等階層式分群方法來決定分群數目。或者,也可以使用模糊C均值(fuzzy C-means)演算法、高斯混和(Mixture of Gaussians)模型、階層式分群(Hierarchical clustering)等來做分群,所使用的分群演算法可以是教導式(supervised)或是非教導式(unsupervised)的分群演算法,本發明並不在此限。 Therefore, when the Waldorf method is performed, the SSE will become larger and larger after the group is merged. We can set a threshold. When the SSE is greater than the threshold, the merge group is stopped to obtain the number of groups. After the number of groups is obtained, a K-means grouping algorithm can be performed on all the variables according to the number of clusters to divide the variables into a plurality of variable groups. However, the Waldorf method and the K-means algorithm described above are merely examples, and any conventional grouping algorithm can be used to generate a variable group. For example, the Huade method described above can also be replaced by a hierarchical grouping method such as single-linkage, complete-linkage, average-linkage, and centroid-linkage. The number of clusters. Alternatively, a fuzzy C-means algorithm, a Mixture of Gaussians model, a Hierarchical clustering, or the like may be used for grouping, and the grouping algorithm used may be supervised. Or an unsupervised clustering algorithm, the invention is not limited thereto.

值得注意的是,上述的步驟是對變數分群,而不是對樣本分群。舉例來說,可從所有的樣本中取得某一變數來產生一向量以代表該變數,而變數之間的距離則可計算向量之間的歐式距離(Euclidean distance),但本發明並不限制如何定義變數之間的距離。在取得多個變數群組以後,對於變數群組中每一個變數都會計算對應的一個分群指標(亦稱第三分群指標),並且根據此分群指標從每個變數群組 中挑選至少一個變數作為獨立性變數。在此實施例中,是從每個變數群組中挑選有最佳分群指標的變數以產生獨立性變數,但在其他實施例中也可以在每個變數群組中挑選兩個或三個以上的變數,本發明並不在此限。此外,上述的分群指標可以是R平方值、F值、或是其他合適的指標,本發明並不在此限。 It is worth noting that the above steps are to group the variables rather than group the samples. For example, a certain variable can be taken from all samples to generate a vector to represent the variable, and the distance between the variables can calculate the Euclidean distance between the vectors, but the invention does not limit how Define the distance between the variables. After obtaining a plurality of variable groups, a corresponding grouping indicator (also referred to as a third grouping indicator) is calculated for each variable in the variable group, and each group is determined according to the grouping indicator. Select at least one variable as an independent variable. In this embodiment, the variable having the best clustering index is selected from each variable group to generate an independent variable, but in other embodiments, two or more may be selected in each variable group. The variables are not limited by the present invention. In addition, the above-mentioned grouping index may be an R-squared value, an F-value, or other suitable index, and the present invention is not limited thereto.

在步驟S303中,進行重要性的考量,從上述的獨立性變數中選出多個重要變數。在一些實施例中,步驟S303是執行步進選擇(stepwise selection)演算法或是最佳子集合(best subset)演算法以選出重要變數。大致上來說,步進選擇演算法可分為向前選取(forward selection)以及往後刪除(backward elimination)。在向前選取時,在未被選擇的獨立性變數中,若有變數的F數值大於某一臨界值,則會挑選有最大F數值的變數。在往後刪除時,在已挑選的變數中若有變數的F數值小於某一臨界值則會刪除有最小F數值的變數。向前選取和往後刪除可能會被執行好幾次,最後挑選出的變數便是重要變數,然而本領域具有通常知識者當可理解步進選擇演算法,在此並不詳細贅述。 In step S303, consideration of importance is performed, and a plurality of important variables are selected from the above-described independence variables. In some embodiments, step S303 is to perform a stepwise selection algorithm or a best subset algorithm to select important variables. In general, the step selection algorithm can be divided into forward selection and backward elimination. In the forward selection, in the unselected independence variable, if the F value of the variable is greater than a certain threshold, the variable having the largest F value is selected. When deleting later, if the F value of the variable in the selected variable is less than a certain threshold, the variable with the smallest F value will be deleted. The forward selection and the subsequent deletion may be performed several times, and the last selected variables are important variables. However, those skilled in the art can understand the step selection algorithm and will not be described in detail herein.

另外,最佳子集合演算法是一種由使用者任意選定出的變數組合產生出最適配(best-fitting)的迴歸模型,一般挑選原則是希望以最少的變數個數滿足資料集統計特性。最佳子集合演算法在選取變數上,會使用到的分群指標包含R平方值、調整後R平方值(Adjust-R-Square)、均方差(Mean-Square-Error)以及Mallow's Cp。當使用者 挑選的變數組合的變數個數相同時,可選用R平方值;若變數個數不相同時,可選用調整後R平方值及Mallow's Cp作為相互驗證的統計工具。然而,本領域具有通常知識者也應可理解最佳子集合演算法,在此不再贅述。 In addition, the best sub-set algorithm is a combination of variables arbitrarily selected by the user to produce a best-fitting regression model. The general selection principle is to satisfy the statistical characteristics of the data set with the fewest number of variables. The best sub-set algorithm uses the clustering indicator to include the R-squared value, the adjusted R-square, the mean square error (Mean-Square-Error), and Mallow's Cp. When the user When the number of variables of the selected variable combination is the same, the R square value may be used; if the number of variables is different, the adjusted R square value and Mallow's Cp may be selected as the mutual verification statistical tools. However, those skilled in the art should also understand the best sub-set algorithm, and will not be described here.

在步驟S303中是要挑選出重要變數,但除了上述的步進選擇演算法以及最佳子集合演算法,在其他實施例中也可以使用其他的方法。舉例來說,對於步驟S302取得的獨立性變數可以先做主成分分析,而有最大特徵值(eigenvalue)的成分會具有多個係數(亦代表權重),其是分別對應至所有的獨立性變數,可將有最大係數的獨立性變數挑選出來做為重要變數。或者,可根據獨立性變數來執行適應性增強(AbaBoost),在適應性增強中每次挑選的變數都會具有一個權重,而我們可以根據這些權重挑選出重要變數(例如挑選權重較大的重要變數)。或者,也可以根據這些獨立性變數做線性迴歸,在做完線性迴歸以後每個變數也都會有各自的權重,我們也可根據這些權重挑選出重要變數。步驟S303的精神在於從彼此獨立(或相關性低)的獨立性變數中挑選出重要變數,但本領域具有通常知識者當可理解在現有的技術中,有許多手段都可以達到這個目的,本發明並不限制步驟S303的具體作法。 In step S303, important variables are to be selected, but in addition to the step selection algorithm and the optimal sub-set algorithm described above, other methods may be used in other embodiments. For example, the independence variable obtained in step S302 may be subjected to principal component analysis first, and the component having the largest eigenvalue (eigenvalue) may have multiple coefficients (also representing weights), which respectively correspond to all independent variables. Independent variables with the largest coefficient can be selected as important variables. Alternatively, adaptive enhancement (AbaBoost) can be performed according to the independence variable, in which each selected variable will have a weight, and we can select important variables based on these weights (for example, select important variables with larger weights) ). Alternatively, linear regression can be performed based on these independent variables. After the linear regression is completed, each variable will also have its own weight. We can also select important variables based on these weights. The spirit of step S303 consists in selecting important variables from independent variables (or low correlation), but those skilled in the art can understand that in the prior art, there are many means to achieve this goal. The invention does not limit the specific practice of step S303.

接下來在步驟S304中,進行完整性的考量,在此實施例中是同時考慮了內部的完整性與外部的完整性。從步驟S303取得的重要變數是所有變數的子集合,內部的完整性是要測試這些重要變數是否有足夠的資訊來做決定 (decision-making),而外部的完整性是要測試這些重要變數對於所有的樣本來說是否有足夠的資訊。 Next, in step S304, integrity considerations are made, in this embodiment both internal integrity and external integrity are considered. The important variable obtained from step S303 is a subset of all variables, and the internal integrity is to test whether these important variables have enough information to make a decision. (decision-making), and the external integrity is to test whether these important variables have enough information for all samples.

具體來說,可先根據這些重要變數計算一分群指標(亦稱第一分群指標),例如為R平方值,這裡表示為r s 2。也就是說,可根據這些重要變數重新將樣本分群,然後根據分群的結果計算出R平方值r s 2。接下來,會重複執行第一程序一預設次數,此第一程序包括對每一個樣本產生一個啞變數(dummy variable)。在此實施例中,上述的啞變數是根據均勻分佈(uniform distribution)隨機地產生。然而,在其他實施例中,也可以根據其他的機率分佈來產生啞變數;或者,也可以將若干個重要變數乘上一權重後再相加,相加後的數值再加上一隨機的數值以產生啞變數,本發明並不在此限。 Specifically, a clustering indicator (also referred to as a first grouping indicator) may be calculated based on these important variables, for example, an R-squared value, here denoted as r s 2 . That is to say, the samples can be regrouped according to these important variables, and then the R-squared value r s 2 is calculated from the results of the clustering. Next, the first program is repeatedly executed a preset number of times, and the first program includes generating a dummy variable for each sample. In this embodiment, the aforementioned dummy variables are randomly generated according to a uniform distribution. However, in other embodiments, the dummy variable may also be generated according to other probability distributions; or, several important variables may be multiplied by a weight and then added, and the added value plus a random value In order to generate a dummy variable, the present invention is not limited thereto.

上述的第一程序還包括將重要變數與啞變數結合,根據結合後的變數重新對樣本分群以計算樣本的分群指標(亦稱第二分群指標,其種類必須要和上述的第一分群指標相同,例如為R平方值)。上述的預設次數例如為(但不限於)100次,因此總共會產生100個R平方值。在執行第一程序100次以後,會計算第一分群指標與這100個第二分群指標是否通過一個完整性測試。在此實施例中,是先計算這些第二分群指標的平均值,表示為r n 2,然後執行一個統計假設檢定(hypothesis test),將統計假設檢定的虛無假設(null hypothesis)設定為第一分群指標r s 2相同於該平均值r n 2,可表示為以下方程式(4);而對立假設(alternative hypothesis)可以設定為第一分群指標r s 2大於(或不相同於)平均值r n 2,可表示為以下方程式(5)。 The first program further includes combining the important variable with the dummy variable, and regrouping the sample according to the combined variable to calculate the grouping index of the sample (also referred to as the second grouping indicator, the type of which must be the first grouping indicator mentioned above) The same, for example, the R square value). The preset number of times described above is, for example, but not limited to, 100 times, so a total of 100 R-squared values are generated. After performing the first program 100 times, it is calculated whether the first grouping indicator and the 100 second grouping indicators pass an integrity test. In this embodiment, the average of these second clustering indicators is calculated first, denoted as r n 2 , and then a hypothesis test is performed, and the null hypothesis of the statistical hypothesis test is set as the first The one-group index r s 2 is the same as the average value r n 2 and can be expressed as the following equation (4); and the alternative hypothesis can be set as the first group index r s 2 is greater than (or not equal to) the average value r n 2 can be expressed as the following equation (5).

H 0r n 2=r s 2…(4) H 0 : r n 2 = r s 2 (4)

H 1r n 2>r s 2…(5) H 1 : r n 2 > r s 2 ... (5)

在此,上述的統計假設檢定為Z檢定。然而,本領域具有通常知識者當可理解Z檢定或其他檢定的內容,在此便不再贅述。若上述的虛無假設被接受(或不被拒絕),則表示現有的重要變數已經是完整的,也可以說不論增加多少次啞變數,R平方值都不會顯著地改變,可判斷通過了完整性測試。相反地,若虛無假設被拒絕(rejected),則表示所增加的啞變數有可能會提升重要變數的完整性,因此可判斷並沒有通過完整性測試。 Here, the above statistical hypothesis is determined as a Z test. However, those having ordinary knowledge in the art can understand the content of the Z test or other verification, and will not be described here. If the above null hypothesis is accepted (or not rejected), it means that the existing important variables are already complete, and it can be said that no matter how many dummy variables are added, the R-squared value will not change significantly, and the completeness can be judged to be passed. Sex test. Conversely, if the null hypothesis is rejected, it means that the added dummy variable may improve the integrity of the important variable, so it can be judged that the integrity test has not been passed.

在此實施例中是使用統計假設檢定來判斷第一分群指標與多個第二分群指標是否通過完整性測試,然而在其他實施例中,也可以判斷第一分群指標與平均值r n 2之間的誤差是否小於一個臨界值,若是則判斷通過完整性測試。或者,上述的平均值可以替換為中位數。或者,可以計算第一分群指標與這些第二分群指標的均方差,若均方差大於某一臨界值則表示沒有通過完整性測試。此外,在此實施例中上述的第一分群指標與第二分群指標為R平方值,但在其他實施例中也可以用其他適合的分群指標,例如F數值。或者,第一分群指標與第二分群指標也可以為一個預測模型的正確(錯誤)率。舉例來說,若上述的樣本是半導體產品,可根據現有的重要變數來執行一個機器學習算法來取得一個模 型以判斷半導體產品是否有缺陷,並把預測的正確率當作是分群指標,而在每次加入啞變數以後都會重新訓練模型。也就是說,若上述的啞變數無法增加預測的正確率,則可以判斷通過完整性測試。然而,本領域具有通常知識者當可將依照上述的教示加以潤飾或修改,本發明並不限制如何判斷第一分群指標與第二分群指標是否通過完整性測試。 In this embodiment, the statistical hypothesis check is used to determine whether the first cluster indicator and the plurality of second cluster indicators pass the integrity test. However, in other embodiments, the first cluster indicator and the average value r n 2 may also be determined. Whether the error is less than a critical value, and if so, judges the integrity test. Alternatively, the above average value can be replaced with the median. Alternatively, the mean squared difference between the first grouping indicator and the second grouping indicator may be calculated. If the mean squared difference is greater than a certain threshold value, it means that the integrity test is not passed. In addition, in the embodiment, the first grouping index and the second grouping index are R-squared values, but in other embodiments, other suitable grouping indicators, such as F-numbers, may also be used. Alternatively, the first clustering indicator and the second clustering indicator may also be the correct (error) rate of a prediction model. For example, if the above sample is a semiconductor product, a machine learning algorithm can be executed according to existing important variables to obtain a model to determine whether the semiconductor product is defective, and the prediction accuracy rate is regarded as a clustering index, and The model is retrained each time a dummy variable is added. That is to say, if the above dummy variable cannot increase the accuracy of the prediction, the pass integrity test can be judged. However, those skilled in the art can retouch or modify the teachings in accordance with the above teachings. The present invention does not limit how to determine whether the first grouping index and the second grouping indicator pass the integrity test.

若在步驟S304中判斷通過完整性測試,在步驟S305中,輸出所選擇的重要變數。若沒有通過完整性測試,則進行步驟S306,判斷已選擇的重要變數個數是否小於所有的變數個數。若步驟S306的結果為是,則進行步驟S307,增加分群數目,接下來回到步驟S302,根據增加後的分群數目重新執行獨立性變數產生步驟。也就是說,第二次執行步驟S302時不會根據華德法決定分群變數,而是會根據增加後的分群數目執行K均值演算法。如此一來,產生的變數群組會增加,而挑選出的獨立性變數也會增加。另一方面,若步驟S306的結果為否,則進行步驟S308,結束特徵選擇。在一些實施例中還可進行步驟S309,尋找未觀察到的變數,例如增加半導體產線上的感測器,藉此取得更多的變數。 If it is judged in step S304 that the integrity test is passed, in step S305, the selected important variable is output. If the integrity test is not passed, step S306 is performed to determine whether the number of selected important variables is less than the number of all variables. If the result of step S306 is YES, step S307 is performed to increase the number of clusters, and then returning to step S302, the independence variable generating step is re-executed based on the increased number of clusters. That is to say, when the second step S302 is performed, the clustering variable is not determined according to the Waldorf method, but the K-means algorithm is executed according to the increased number of clusters. As a result, the resulting variable group will increase, and the selected independence variables will increase. On the other hand, if the result of the step S306 is NO, the process proceeds to a step S308 to end the feature selection. In some embodiments, step S309 may also be performed to find unobserved variables, such as increasing the sensor on the semiconductor line, thereby taking more variables.

本發明也提出一種電腦程式產品,當電腦載入此電腦程式產品並執行後,可完成上述的特徵選擇方法。然而,本發明並不限制此電腦程式產品是用何種程式語言來實作。 The invention also proposes a computer program product, which can complete the above feature selection method after the computer loads the computer program product and executes it. However, the present invention does not limit the programming language in which the computer program product is implemented.

請參照圖4A與圖4B,圖4A與圖4B是根據一實 施例繪示特徵選擇方法的流程圖。在步驟S401中,取得多個樣本。在步驟S402中,根據分群數目將變數分為多個變數群組,並從每一個變數群組中選擇至少一個變數來產生多個獨立性變數。在步驟S403中,從獨立性變數中選出多個重要變數。在步驟S404中,根據重要變數計算樣本的第一分群指標。在步驟S405中,對每一個樣本產生一個啞變數。在步驟S406中,將重要變數與啞變數結合,以計算樣本的第二分群指標,其中步驟S405與步驟S406合稱為第一程序。在步驟S407中,判斷是否達到預設次數,若否則回到步驟S405,若是則進行步驟S408。在步驟S408中,判斷第一分群指標與第二分群指標是否通過完整性測試,若是則在步驟S409中輸出重要變數,若否則進行步驟S410。在步驟S410中,判斷重要變數的個數是否少於所有變數的個數,若否則結束流程,若是則進行步驟S412,增加分群數目並回到步驟S402。然而,圖4A與圖4B中各步驟已詳細說明如上,在此便不再贅述。值得注意的是,圖4A與圖4B中各步驟可以實作為多個程式碼或是電路,本發明並不在此限。此外,圖4A與圖4B的方法可以搭配以上實施例使用,也可以單獨使用。換言之,圖4A與圖4B的各步驟之間也可以加入其他的步驟。 Please refer to FIG. 4A and FIG. 4B, FIG. 4A and FIG. 4B are based on a real The embodiment illustrates a flow chart of the feature selection method. In step S401, a plurality of samples are acquired. In step S402, the variable is divided into a plurality of variable groups according to the number of clusters, and at least one variable is selected from each of the variable groups to generate a plurality of independent variables. In step S403, a plurality of important variables are selected from the independence variables. In step S404, the first grouping indicator of the sample is calculated based on the important variable. In step S405, a dummy variable is generated for each sample. In step S406, the important variable is combined with the dummy variable to calculate a second grouping index of the sample, wherein step S405 and step S406 are collectively referred to as a first program. In step S407, it is determined whether the preset number of times has been reached. If not, the process returns to step S405, and if yes, the process proceeds to step S408. In step S408, it is determined whether the first grouping indicator and the second grouping indicator pass the integrity test, and if so, the important variable is output in step S409, otherwise step S410 is performed. In step S410, it is determined whether the number of important variables is less than the number of all variables. If the flow is otherwise ended, if yes, step S412 is performed to increase the number of clusters and return to step S402. However, the steps in FIGS. 4A and 4B have been described in detail above, and will not be described again herein. It should be noted that the steps in FIG. 4A and FIG. 4B can be implemented as a plurality of codes or circuits, and the present invention is not limited thereto. In addition, the methods of FIGS. 4A and 4B can be used in conjunction with the above embodiments, or can be used alone. In other words, other steps can be added between the steps of Figures 4A and 4B.

雖然本發明已以實施例揭露如上,然其並非用以限定本發明,任何所屬技術領域中具有通常知識者,在不脫離本發明的精神和範圍內,當可作些許的更動與潤飾,故本發明的保護範圍當視後附的申請專利範圍所界定者為準。 Although the present invention has been disclosed in the above embodiments, it is not intended to limit the present invention, and any one of ordinary skill in the art can make some changes and refinements without departing from the spirit and scope of the present invention. The scope of the invention is defined by the scope of the appended claims.

S301~S309‧‧‧步驟 S301~S309‧‧‧Steps

Claims (8)

一種互斥與完備集合的特徵選擇方法,用於一電子裝置,該特徵選擇方法包括:取得多個樣本,其中每一該些樣本具有多個變數;進行一獨立性變數產生步驟,從每一該些樣本中取得該些變數的其中之一來產生一向量以代表對應的該變數,以根據一分群數目將該些變數分為多個變數群組,並從每一該些變數群組中選擇該些變數的其中至少一者來產生多個獨立性變數;從該些獨立性變數中選出多個重要變數;根據該些重要變數計算該些樣本的一第一分群指標;重複執行一第一程序一預設次數,其中該第一程序包括:對每一該些樣本產生一啞變數;以及將該些重要變數與該啞變數結合,以計算該些樣本的一第二分群指標;判斷該第一分群指標與該些第二分群指標是否通過一完整性測試;以及當該第一分群指標與該些第二分群指標通過該完整性測試時,輸出該些重要變數。 A mutually exclusive and complete set of feature selection methods for an electronic device, the feature selection method comprising: obtaining a plurality of samples, wherein each of the samples has a plurality of variables; performing an independent variable generation step, from each Obtaining one of the variables in the samples to generate a vector to represent the corresponding variable, to divide the variables into a plurality of variable groups according to a group number, and from each of the variable groups Selecting at least one of the variables to generate a plurality of independent variables; selecting a plurality of important variables from the independent variables; calculating a first grouping index of the samples according to the important variables; performing the first a first predetermined number of times, wherein the first program comprises: generating a dummy variable for each of the samples; and combining the important variables with the dummy variable to calculate a second grouping index of the samples; Determining whether the first grouping indicator and the second grouping indicator pass an integrity test; and when the first grouping indicator and the second grouping indicator pass the integrity test, The out some important variables. 如申請專利範圍第1項所述之特徵選擇方法,其中該獨立性變數產生步驟包括:對該些樣本執行一華德法(Ward’s method),以決定 該分群數目;根據該分群數目對該些變數執行一K均值(K-means)分群演算法,以將該些變數分為該些變數群組;對於每一該些變數群組中的每一該些變數均計算一第三分群指標;以及根據該些第三分群指標,從每一該些變數群組中選出該些變數的其中之一者。 The feature selection method of claim 1, wherein the independence variable generation step comprises: performing a Ward's method on the samples to determine a number of clusters; performing a K-means grouping algorithm on the variables according to the number of clusters to divide the variables into the group of variables; for each of the variable groups Each of the variables calculates a third grouping indicator; and based on the third grouping indicators, one of the variables is selected from each of the group of variables. 如申請專利範圍第1項所述之特徵選擇方法,其中所述從該些獨立性變數中選出該些重要變數的步驟包括:對該些獨立性變數執行一步進選擇(stepwise selection)演算法以選出該些重要變數。 The feature selection method according to claim 1, wherein the step of selecting the important variables from the independence variables comprises: performing a stepwise selection algorithm on the independence variables to Select these important variables. 如申請專利範圍第1項所述之特徵選擇方法,其中所述從該些獨立性變數中選出該些重要變數的步驟包括:對該些獨立性變數執行一最佳子集合(best subset)演算法以選出該些重要變數。 The feature selection method according to claim 1, wherein the step of selecting the important variables from the independence variables comprises: performing a best subset calculus on the independent variables. The law selects these important variables. 如申請專利範圍第1項所述之特徵選擇方法,其中所述對每一該些樣本產生該啞變數的步驟包括:對每一該些樣本以一均勻分佈(uniform distribution)隨機地產生該啞變數。 The feature selection method of claim 1, wherein the step of generating the dummy variable for each of the samples comprises: randomly generating the dummy for each of the samples in a uniform distribution variable. 如申請專利範圍第1項所述之特徵選擇方法,其中所述判斷該第一分群指標與該些第二分群指標是否通過該完整性測試的步驟包括:計算該些第二分群指標的一平均值;執行一統計假設檢定,其中在該統計假設檢定中,一虛無假設(null hypothesis)被設定為該第一分群指標相同於該平均值;以及若該虛無假設被接受,則判斷該第一分群指標與該些第二分群指標通過該完整性測試。 The method for selecting a feature according to claim 1, wherein the step of determining whether the first grouping indicator and the second grouping indicator pass the integrity test comprises: calculating the second grouping indicator An average; performing a statistical hypothesis test, wherein in the statistical hypothesis assay, a null hypothesis is set such that the first clustering indicator is the same as the average; and if the null hypothesis is accepted, determining The first cluster indicator and the second cluster indicator pass the integrity test. 如申請專利範圍第1項所述之特徵選擇方法,更包括:若該第一分群指標與該些第二分群指標沒有通過該完整性測試,判斷該些重要變數的數目是否小於該些變數的數目;以及若該些重要變數的數目小於該些變數的數目,則增加該分群數目,並根據增加後的該分群數目重新進行該獨立性變數產生步驟。 The method for selecting a feature as described in claim 1 further includes: if the first grouping indicator and the second grouping indicator fail to pass the integrity test, determining whether the number of the important variables is less than the variables And if the number of the important variables is less than the number of the variables, the number of the clusters is increased, and the independence variable generating step is re-executed according to the increased number of the clusters. 一種電腦程式產品,當電腦載入此電腦程式產品並執行後,可完成如請求項1至7任一項所述之特徵選擇方法。 A computer program product, when the computer is loaded into the computer program product and executed, the feature selection method according to any one of claims 1 to 7 can be completed.
TW104127537A 2015-08-24 2015-08-24 Mutually-exclusive and collectively-exhaustive (mece) feature selection method and computer program product TWI564740B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
TW104127537A TWI564740B (en) 2015-08-24 2015-08-24 Mutually-exclusive and collectively-exhaustive (mece) feature selection method and computer program product

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
TW104127537A TWI564740B (en) 2015-08-24 2015-08-24 Mutually-exclusive and collectively-exhaustive (mece) feature selection method and computer program product

Publications (2)

Publication Number Publication Date
TWI564740B true TWI564740B (en) 2017-01-01
TW201709090A TW201709090A (en) 2017-03-01

Family

ID=58407816

Family Applications (1)

Application Number Title Priority Date Filing Date
TW104127537A TWI564740B (en) 2015-08-24 2015-08-24 Mutually-exclusive and collectively-exhaustive (mece) feature selection method and computer program product

Country Status (1)

Country Link
TW (1) TWI564740B (en)

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6529895B2 (en) * 1999-04-23 2003-03-04 Microsoft Corporation Determining a distribution of a numeric variable
US6977679B2 (en) * 2001-04-03 2005-12-20 Hewlett-Packard Development Company, L.P. Camera meta-data for content categorization
TWI294971B (en) * 2006-03-21 2008-03-21 Method of system model dimension identification and important variables selection
TW201126354A (en) * 2010-01-25 2011-08-01 Amcad Biomed Corp Method for multi-layer classifier

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6529895B2 (en) * 1999-04-23 2003-03-04 Microsoft Corporation Determining a distribution of a numeric variable
US6977679B2 (en) * 2001-04-03 2005-12-20 Hewlett-Packard Development Company, L.P. Camera meta-data for content categorization
TWI294971B (en) * 2006-03-21 2008-03-21 Method of system model dimension identification and important variables selection
TW201126354A (en) * 2010-01-25 2011-08-01 Amcad Biomed Corp Method for multi-layer classifier

Also Published As

Publication number Publication date
TW201709090A (en) 2017-03-01

Similar Documents

Publication Publication Date Title
JP6969637B2 (en) Causality analysis methods and electronic devices
JP4627674B2 (en) Data processing method and program
US20190098034A1 (en) Anomaly detection method and recording medium
Liquet et al. Bayesian variable selection regression of multivariate responses for group data
Tikkanen et al. Yield optimization using advanced statistical correlation methods
CN112036426A (en) Method and system for unsupervised anomaly detection and accountability using majority voting of high dimensional sensor data
CN111985825A (en) Crystal face quality evaluation method for roller mill orientation instrument
CN111931983A (en) Precipitation prediction method and system
CN107016416B (en) Data classification prediction method based on neighborhood rough set and PCA fusion
JP2005092466A (en) Diagnostic process supporting method and its program
Todorov Robust selection of variables in linear discriminant analysis
US20210248293A1 (en) Optimization device and optimization method
Wang et al. A multivariate sign chart for monitoring dependence among mixed-type data
US20200279148A1 (en) Material structure analysis method and material structure analyzer
TWI564740B (en) Mutually-exclusive and collectively-exhaustive (mece) feature selection method and computer program product
WO2022215559A1 (en) Hybrid model creation method, hybrid model creation device, and program
Chen et al. Semiparametric regression control charts
Susetyoko et al. Characteristics of Accuracy Function on Multiclass Classification Based on Best, Average, and Worst (BAW) Subset of Random Forest Model
Acha Chigozie et al. Towards Efficiency in the Residual and Parametric Bootstrap Techniques
CN117609737B (en) Method, system, equipment and medium for predicting health state of inertial navigation system
CN113487223B (en) Risk assessment method and system based on information fusion
CN113191134B (en) Document quality verification method, device, equipment and medium based on attention mechanism
CN117194963B (en) Industrial FDC quality root cause analysis method, device and storage medium
WO2023085195A1 (en) Model generation device, model generation method, and data estimation device
JP2018151913A (en) Information processing system, information processing method, and program

Legal Events

Date Code Title Description
MM4A Annulment or lapse of patent due to non-payment of fees