TWI474210B

TWI474210B - A method of applying a genetic algorithm to automatically group and filter the independent variables to synchronize the regression model parameters

Info

Publication number: TWI474210B
Application number: TW101144015A
Authority: TW
Original assignee: Nat Taichung University Science & Technology
Priority date: 2012-11-23
Filing date: 2012-11-23
Publication date: 2015-02-21
Also published as: TW201421275A

Description

A method for applying genetic algorithm to automatically group and filter independent variables to synchronize regression parameter estimation

本發明係有關於一種應用基因演算法來自動分群並篩選自變數以同步進行迴歸模型參數校估之方法。The present invention relates to a method for applying genetic algorithms to automatically group and filter independent variables to simultaneously perform regression model parameter estimation.

迴歸分析是社會科學領域非常重要的工具，在許多情境下常需要針對收集的案例進行分群來建立因果模式，不過目前常見的軟體(如SPSS,Statistica等)並未提供同時進行分群與迴歸模型參數估計的功能。因此吾人往往必須先利用某些分群模式(如群集分析)先對案例進行分群，再逐群建立迴歸模型，並校估其參數。由於分群與迴歸模型之參數校估並非同時進行，因此所得的模型並無法保證是最適解。此外，在嘗試建立各分群之迴歸模型時，往往需耗費大量時間來剔除不適用的案例、選取適合的自變數、以及判斷模型參數的合理性。Regression analysis is a very important tool in the social sciences. In many situations, it is often necessary to group the collected cases to establish a causal pattern. However, the current common software (such as SPSS, Statistica, etc.) does not provide simultaneous grouping and regression model parameters. Estimated function. Therefore, we often have to use some clustering models (such as cluster analysis) to first group cases, then build regression models one by one, and evaluate their parameters. Since the parameter estimation of the clustering and regression models is not simultaneous, the resulting model is not guaranteed to be the most suitable solution. In addition, when trying to establish regression models for each group, it often takes a lot of time to eliminate the inapplicable cases, select suitable independent variables, and judge the rationality of the model parameters.

目前在分群的技術方面除了多變量分析領域的群集分析外，在專家系統領域方面，案例式推論(Case-Based Reasoning,CBR)技術也可以用來進行案例篩選與分群。如Kuncheva與Jain(1999)提出同步最佳化案例與特徵選擇方法，Ahn、Kim與Han(2007)應用CBR技術於顧客分群的問題。不過並未見有將CBR應用於自動分群並同步建立迴歸模型且校估其參數的文獻，而此即為本發明欲解決之課題。In addition to cluster analysis in the field of multivariate analysis, in the field of expert systems, case-based reasoning (CBR) technology can also be used for case screening and grouping. For example, Kuncheva and Jain (1999) proposed synchronous optimization cases and feature selection methods. Ahn, Kim and Han (2007) applied CBR technology to customer grouping problems. However, there is no document that applies CBR to automatic clustering and synchronously establishes a regression model and evaluates its parameters, which is the subject of the present invention.

本發明乃是結合基因演算法、CBR技術與迴歸模型來進行分群並同步建立各分群之迴歸模型且校估出其參數。分群群數與因變數(可以設定複數個因變數)是事先設定，但各群體之自變數則事先不知，藉此來找出更準確的分群準則，以找出適用於各分群的關鍵自變數。The invention combines gene algorithm, CBR technology and regression model to perform grouping and synchronously establish regression models of each group and evaluate the parameters thereof. The number of clusters and the dependent variable (multiple dependent variables can be set) are set in advance, but the independent variables of each group are not known in advance, so as to find a more accurate grouping criterion to find the key independent variables applicable to each group. .

基此，本發明之主要目的在於提供一種應用基因演算法來自動分群並篩選自變數以同步進行迴歸模型參數校估之方法，其可找出更準確的分群準則，以找出適用於各分群的關鍵自變數。Accordingly, the main object of the present invention is to provide a method for automatically grouping and screening self-variables to synchronously perform regression model parameter estimation using a genetic algorithm, which can find more accurate grouping criteria to find out that it is applicable to each group. The key to the self-variable.

為了達成前述目的，依據本發明所提供之一種應用基因演算法來自動分群並篩選自變數以同步進行迴歸模型參數校估之方法，包含有下列步驟：A)定義染色體結構及編碼：定義基因演算法中的染色體結構及編碼，各該染色體包含了案例、案例所歸屬的群體、自變數以及自變數所歸屬的群體；其中，案例或自變數所歸屬的群體係分別以一預定編碼格式表示，此外，一該案例所歸屬的群體係為複數群體中的一個或不屬於任何群體(即所歸屬的群體為零個)，該一該自變數係歸屬零個以上的群體；B)產生初始族群：以隨機的方法產生預定數量的第0代染色體，並定義為一上一代族群；C)進行基因操作來產生新的子代：在基因操作時係對該上一代族群的染色體進行選擇、交配及突變，進而產生下一代的染色體，進而定義為一下一代族群； D)對步驟C)中所產生的該下一代族群的染色體進行解碼：在解碼後，即可得知各該案例被分在哪個群體，而各個群體使用了哪些自變數，並將這些解碼後的案例資料依其所歸屬的群體而分為複數個資料集；E)進行迴歸模型參數校估並計算適應值(Fitness Value)：將各該資料集導入迴歸模型中進行參數校估，並據以計算出適應值；以及F)是否滿足一停止條件？若是，則擷取具有最佳的適應值之染色體及其所對應的族群，將分群結果儲存並結束；若否，則將步驟C)中所產生的該下一代族群視為上一代族群，再回到步驟C)。In order to achieve the foregoing object, a method for automatically grouping and screening independent variables to synchronize regression parameter parameter estimation according to the present invention includes the following steps: A) defining chromosome structure and coding: defining gene calculus The chromosome structure and coding in the method, each chromosome contains the case, the group to which the case belongs, the independent variable, and the group to which the independent variable belongs; wherein the group system to which the case or the independent variable belongs is represented by a predetermined coding format, In addition, the group system to which the case belongs is one of the plural groups or does not belong to any group (ie, the group to which it belongs is zero), the one of the self-variables belongs to more than zero groups; B) the initial group is generated : generating a predetermined number of chromosomes 0 in a random manner and defining them as a previous generation group; C) performing genetic manipulation to generate new progeny: selecting and mating the chromosomes of the previous generation group during gene manipulation And mutations, which in turn produce the next generation of chromosomes, which in turn are defined as a next generation group; D) decoding the chromosomes of the next generation group generated in step C): after decoding, it can be known which group the case is divided into, and which self-variables are used by each group, and these are decoded. The case data is divided into a plurality of data sets according to the group to which they belong; E) the regression model parameters are estimated and the fitness value is calculated (Fitness Value): each data set is introduced into the regression model for parameter estimation, and according to To calculate the fitness value; and F) whether a stop condition is met? If so, the chromosome with the best fitness value and its corresponding ethnic group are extracted, and the clustering result is stored and ended; if not, the next generation group generated in step C) is regarded as the previous generation group, and then Go back to step C).

較佳地，在步驟A)中，該案例所歸屬的群體係以整數編碼來表示其所歸屬之群體代碼；該自變數所歸屬的群體係以二進位編碼的各個位數來對應該自變數所歸屬的各該群體。Preferably, in step A), the group system to which the case belongs is represented by an integer code to the group code to which it belongs; the group system to which the independent variable belongs is corresponding to the number of bits of the binary code. Each of the groups to which they belong.

較佳地，在步驟C)之前還更包含有一步驟C0)：在對上一代族群的染色體進行選擇、交配及突變前，先評估該上一代族群中每一個染色體的適應值；若有染色體是非可行解(Infeasible Solution)，則重新產生染色體，或修改該染色體使其為可行解。Preferably, before step C), there is further included a step C0): before selecting, mating and mutating the chromosomes of the previous generation group, first evaluating the fitness value of each chromosome in the previous generation; if the chromosome is non- Infeasible Solution, regenerate the chromosome, or modify the chromosome to make it a feasible solution.

較佳地，在步驟C)中產生下一代族群稱為演化一代；於步驟F)中，該停止條件係指演化的代數達到使用者所設定的代數。Preferably, generating the next generation group in step C) is called evolution generation; in step F), the stop condition means that the algebra of evolution reaches the algebra set by the user.

較佳地，在步驟E)中，該適應值之計算，乃是依據各分群之迴歸模型的調整後判定係數來計算，並有整體模式與加權平均模式兩種計算方式可以選擇。Preferably, in step E), the calculation of the fitness value is calculated according to the adjusted determination coefficient of the regression model of each group, and has an overall mode. Two calculation methods can be selected with the weighted average mode.

較佳地，在步驟E)中，如果任一分群的迴歸模型，其自變數係數符號與預先輸入的判斷準則(有正號、負號、與不確定三種)不同，則重設此模式之調整後判定係數為0。Preferably, in step E), if the regression model of any group is different from the pre-entered criterion (there are positive, negative, and uncertain), the adjustment of the mode is reset. The post-judgment coefficient is zero.

為了詳細說明本發明之構造及特點所在，茲舉以下之較佳實施例並配合圖式說明如後，其中：In order to explain the structure and features of the present invention in detail, the following preferred embodiments,

如第一圖所示，本發明一較佳實施例所提供之一種應用基因演算法來自動分群並篩選自變數以同步進行迴歸模型參數校估之方法，主要包含有下列步驟：As shown in the first figure, a method for applying the genetic algorithm to automatically group and filter the self-variables to synchronously perform regression model parameter estimation is provided in a preferred embodiment of the present invention, which mainly includes the following steps:

A)定義染色體結構及編碼：定義基因演算法中的染色體結構及編碼，各該染色體包含了案例(或亦可稱為樣本)、案例所歸屬的群體、自變數以及自變數所歸屬的群體；其中，案例係以對應於廠商為例，而自變數則以對應於績效指標為例，案例或自變數所歸屬的群體係分別以一預定編碼格式表示，此外，一該案例所歸屬的群體係為複數群體中的一個或不歸屬於任何群體(即所歸屬的群體為零個)，該一該自變數係歸屬零個以上的群體。於本實施例中，各該案例所歸屬的群體係以整數編碼來表示其所歸屬的群體代碼；一該自變數所歸屬的群體係以二進位編碼的各個位數來對應該自變數所歸屬的各該群體。A) Define chromosome structure and coding: define the chromosome structure and coding in the gene algorithm, each of which contains a case (or may also be called a sample), a group to which the case belongs, an independent variable, and a group to which the independent variable belongs; The case is based on the case of the manufacturer, and the self-variable is taken as an example. The group system to which the case or the independent variable belongs is represented by a predetermined coding format. In addition, the group system to which the case belongs For one of the plural groups or not belonging to any group (ie, the group to which it belongs is zero), the one of the independent variables belongs to more than zero groups. In this embodiment, the group system to which each case belongs is represented by an integer code to the group code to which it belongs; a group system to which the self-variable belongs belongs to each bit of the binary code to correspond to the attribute to which the variable belongs. Each of this group.

B)產生初始族群：以隨機的方法產生預定數量的第0代染色體，並定義為一上一代族群。B) Generation of an initial population: A predetermined number of Generation 0 chromosomes are generated in a random manner and defined as a previous generation population.

舉例而言，假設有10個案例(I)，5個自變數(F)，分為2個群體，將染色體設為5條，則可以表1表示。For example, suppose there are 10 cases (I), 5 independent variables (F), divided into 2 groups, and the chromosomes are set to 5, which can be expressed in Table 1.

上述表1中，分群的總數係為2，而案例選擇的編碼值即為0、1或2，分別表示該案例被分類為不選取、第一群或第二群。In Table 1 above, the total number of clusters is 2, and the code value of the case selection is 0, 1, or 2, respectively indicating that the case is classified as not selected, the first group or the second group.

以染色體5為例，其案例歸屬的群體代碼可為0~2，而I₂ 的基因值為2，即表示案例2被分類到群體2。由上述表1可知染色體1的案例1被分類到群體1，案例2被分類到群體2，案例3則為不選取(即不分類到任一群體)。Taking chromosome 5 as an example, the group code to which the case belongs may be 0 to 2, and the gene value of I ₂ is 2, which means that case 2 is classified into group 2. From Table 1 above, it can be seen that Case 1 of chromosome 1 is classified into Group 1, Case 2 is classified into Group 2, and Case 3 is not selected (ie, not classified into any group).

再說明自變數(F)的選取狀態，在上述表1中，在分群的總數為2的狀況下，自變數選擇的編碼值係設為0~3，共4種。由於十進位數值在電腦中實際是以二進位的方式存在，因此利用二進位的編碼方式可以使用電腦記憶體內部的1 bit(即一個位數)來表示其自變數是否被某群體所選取，以及是哪幾個群體選取該自變數。其中在二進位編碼中的1表示該群體被選取，而0表示該群體未被選取。自變數的選取狀態以下述表2為例。The state in which the argument (F) is selected will be described. In the above Table 1, in the case where the total number of clusters is 2, the code value selected from the variable is set to 0 to 3, and there are four types. Since the decimal value actually exists in the binary mode in the computer, the binary encoding method can use the 1 bit (ie, one digit) inside the computer memory to indicate whether the self-variable is selected by a certain group. And which groups choose the self-variable. Binary code A 1 in the group indicates that the group is selected, and 0 indicates that the group is not selected. The selection state of the self-variable is exemplified by Table 2 below.

參閱表1配合表2可知，染色體5中的自變數F₃ 的值是2，由表2得知整數2的二進位值是10，則表示該自變數F₃ 被群體2所選取，群體1則不採用該自變數F₃ 。再看自變數F₁ ，其值為0，代表未被選取。F₂ 的值為1即表示被群體1所選取。F₄ 的值為3即表示被群體1及群體2所選取。Referring to Table 1 and Table 2, the value of the self-variable F ₃ in chromosome 5 is 2, and it is found from Table 2 that the binary value of the integer 2 is 10, indicating that the self-variable F ₃ is selected by the group 2, the group 1 Then the self-variant F ₃ is not used. Look at the argument F ₁ , which has a value of 0, indicating that it is not selected. A value of 1 for F ₂ means that it is selected by group 1. A value of 3 for F ₄ indicates that it is selected by group 1 and group 2.

由此可見，藉由上述表1即可得知哪些案例被分在第幾群，以及各群體各自使用了哪些自變數。It can be seen from the above Table 1 that which cases are classified into the first group and which independent variables are used by each group.

C0)先評估該上一代族群中的每一個染色體的適應值。若有染色體是非可行解，則重新產生染色體，或修改該染色體使其為可行解。於本實施例中係以重新產生染色體為例。(此步驟於第一圖中未示)C0) The fitness value of each chromosome in the previous generation group is evaluated first. If a chromosome is a non-feasible solution, the chromosome is regenerated or modified to make it a feasible solution. In this embodiment, a regenerated chromosome is taken as an example. (This step is not shown in the first figure)

C)進行基因操作來產生新的子代：在基因操作時係對該上一代族群的染色體進行選擇、交配及突變，進而產生下一代的染色體，進而定義為一下一代族群。此步驟中產生下一代族群的動作係稱為演化一代。C) Gene manipulation to generate new progeny: During gene manipulation, the chromosomes of the previous generation are selected, mated, and mutated to produce the next generation of chromosomes, which are defined as a next generation group. This step produces The movement of the next generation of ethnic groups is called the evolutionary generation.

D)對步驟C)中所產生的該下一代族群的染色體進行解碼：在解碼後，即可得知各該案例被分在哪個群體，而各個群體使用了哪些自變數，並將這些解碼後的資料依其所歸屬的群體而分為複數個資料集。D) decoding the chromosomes of the next generation group generated in step C): after decoding, it can be known which group the case is divided into, and which self-variables are used by each group, and these are decoded. The data is divided into a plurality of data sets according to the group to which they belong.

E)進行迴歸模型參數校估並計算適應值(Fitness Value)：將各該資料集導入迴歸模型中進行參數校估，並據以計算出適應值。於本實施例中，該適應值之計算，乃是依據各分群之迴歸模型的調整後判定係數來計算，並有整體模式與加權平均模式兩種計算方式可以選擇。整體模式之適應值計算方式係適用在每個案例均等價(即重要性相同)的情況，加權平均模式適應值的計算方式則適用於希望各分群所建立的因果模式均具有不錯的解釋能力，不至於有太偏向某一分群的情況。E) Perform regression model parameter estimation and calculate fitness value (Fitness Value): Import each data set into the regression model for parameter estimation, and calculate the fitness value accordingly. In this embodiment, the calculation of the fitness value is calculated according to the adjusted determination coefficient of the regression model of each group, and there are two calculation modes of the overall mode and the weighted average mode. The calculation method of the fitness value of the overall model is applicable to the case where each case is equivalent (that is, the importance is the same). The calculation method of the weighted average mode fitness value is suitable for the causal mode that each group is expected to have a good explanatory ability. There is no case of being too biased towards a certain group.

如果任一分群的迴歸模型，其自變數係數符號與預先輸入的判斷準則(有正號、負號、與不確定三種)不同，則重設此模式之調整後判定係數為0。If the regression model of any group is different from the pre-entered judgment criterion (there are positive, negative, and uncertain), the adjustment coefficient of this mode is reset to 0.

迴歸模型的調整後判定係數(adjusted R² )以下述式1表示。The adjusted coefficient (adjusted R ² ) of the regression model is expressed by the following formula 1.

計算適應值的方式，茲參閱表3說明如下： The way to calculate the fitness value is as follows:

由上述表3可知，表1中的染色體5的各個案例以及自變數的分群的狀態係整理於表3。As can be seen from the above Table 3, the respective cases of the chromosome 5 in Table 1 and the state of the cluster of the self-variables are summarized in Table 3.

第一群(具有I₁ 、I₅ 、I₆ 、I₈ 、I₁₀ 五個案例，以F₂ 、F₄ 、F₅ 為自變數)與第二群(具有I₂ 、I₄ 、I₇ 、I₉ 四個案例，以F₃ 、F₄ 為自變數)分別進行迴歸分析的結果假設如下述表4所示。The first group (with five cases of I ₁ , I ₅ , I ₆ , I ₈ , I ₁₀ , with F ₂ , F ₄ , F ₅ as independent variables) and the second group (with I ₂ , I ₄ , I ₇ The results of the regression analysis performed in four cases of I ₉ and F ₃ and F ₄ as independent variables are shown in Table 4 below.

整體模式適應值的計算係將所有分群彙整看成一個整體，然後使用上述的式1來加以計算，其計算式如下： The calculation of the overall mode fitness value is to treat all the clusters as a whole, and then use the above formula 1 to calculate, the calculation formula is as follows:

以上述表4為例，整體模式適應值為： Taking Table 4 above as an example, the overall mode adaptation value is:

加權平均模式適應值的計算則是以各分群的案例數為權重，對各分群的Adj.-R² 進行加權平均來作為適應值。The weighted average mode adaptation value is calculated by weighting the number of cases of each group, and weighting the Adj.-R ² of each group as the fitness value.

以上述表4為例，加權平均模式適應值為： Taking Table 4 above as an example, the weighted average mode adaptation value is:

F)是否滿足一停止條件？若是，則擷取具有最佳的適應值之染色體及其所對應的族群，將分群結果儲存並結束；若否，則將步驟C)中所產生的該下一代族群視為上一代族群，再回到步驟C)。於本實施例中，該停止條件係指演化的代數達到使用者所設定的代數。F) Does it meet a stop condition? If so, the chromosome with the best fitness value and its corresponding ethnic group are extracted, and the clustering result is stored and ended; if not, the next generation group generated in step C) is regarded as the previous generation group, and then Go back to step C). In the present embodiment, the stopping condition means that the algebra of the evolution reaches the algebra set by the user.

藉由上述步驟可知，在滿足停止條件後，即會擷取出最佳的適應值及其所對應的族群，並且結束演化。如此一來，可藉此找到歷代中具有最佳適應值的某條染色體，再由該染色體解碼後之案例以及自變數來找到最準確的分群準則。在本實施例中，由於案例係以對應於廠商為例，而自變數則以對應於績效指標為例，因此可以藉由本發明之技術來找出適用於各分群廠商的關鍵自變數(即績效指標)。可以類推得知的是，當使用本發明之技術來應用於其他課題時，亦可以藉由本發明之分群及自變數萃取的方法來同步求得最佳之各分群因果模型參數，並節省大量的人力操作與試誤時間。According to the above steps, after the stop condition is satisfied, the optimal fitness value and its corresponding ethnic group are extracted, and the evolution is ended. In this way, a chromosome with the best fitness value in the past generation can be found, and the most accurate clustering criterion can be found from the case of the chromosome decoding and the self-variable. In this embodiment, since the case is corresponding to the manufacturer, and the self-variable is taken as an example corresponding to the performance indicator, the key independent variable (ie, performance) applicable to each cluster manufacturer can be found by the technology of the present invention. index). It can be inferred that when using the technique of the present invention to apply to other problems, the method of grouping and self-variable extraction of the present invention can also be used to simultaneously obtain the optimal parameters of each group of causal models, and save a large amount of Manpower operation and trial and error time.

由此可見，本發明結合了基因演算法、CBR技術與迴歸模型來進行分群並同步建立各分群迴歸模型。藉由事先設定分群群數，但各群體之自變數則先不預設立場其歸屬於哪個群體，藉此來找出更準確的分群準則，進而找出適用於各分群的關鍵自變數。It can be seen that the present invention combines a gene algorithm, a CBR technique and a regression model to perform grouping and simultaneously establish each group regression model. By setting the number of clusters in advance, but the self-variables of each group are not pre-determined In which group, to find more accurate grouping criteria, and then to find the key independent variables applicable to each group.

第一圖係本發明一較佳實施例之流程圖。The first figure is a flow chart of a preferred embodiment of the present invention.

Claims

A method for applying genetic algorithm to automatically group and filter independent variables to synchronize regression parameter parameter estimation includes the following steps: A) defining chromosome structure and coding: defining chromosome structure and coding in gene algorithm, each of the chromosomes Contains the case, the group to which the case belongs, the independent variable, and the group to which the independent variable belongs; wherein the group system to which the case or the independent variable belongs is represented by a predetermined coding format, and in addition, the group system to which the case belongs is One of the plural groups is not attributable to any group, one of which belongs to more than zero groups; B) the initial group is generated: a predetermined number of chromosomes 0 are generated in a random manner and defined as a previous generation group C) performing genetic manipulation to generate new progeny: in the process of gene manipulation, the chromosomes of the previous generation are selected, mated and mutated, thereby generating the next generation chromosome, which is defined as a next generation group; D) The chromosome of the next generation group generated in step C) is decoded: after decoding, it can be known that each case is divided Which group, and which self-variables are used by each group, and divide the decoded case data into a plurality of data sets according to the group to which they belong; E) perform regression model parameter estimation and calculate fitness value (Fitness Value) : Introduce each data set into the regression model for parameter estimation, and calculate the fitness value accordingly; and F) Does it satisfy a stop condition? If yes, the chromosome with the best fitness value and its corresponding ethnic group are retrieved, and the clustering result is stored and ended; if not, the next generation group generated in step C) is regarded as the previous one. Generation group, go back to step C).

The method for automatically grouping and screening the self-variables to synchronously perform regression model parameter estimation according to the applied gene algorithm described in claim 1 of the patent application scope, wherein: in step A), the group system to which the case belongs is encoded by an integer The group code to which it belongs is represented; the group system to which the independent variable belongs is corresponding to each group to which the argument belongs to by the number of bits encoded by the binary.

The method for automatically grouping and filtering the self-variables to synchronously perform regression model parameter estimation according to the applied gene algorithm described in claim 1 of the patent application scope, wherein: before step C), there is further included a step C0): on the pair Before selecting, mating, and mutating a chromosome of a generation, first evaluate the fitness of each chromosome in the previous generation; if the chromosome is an Infeasible Solution, regenerate the chromosome, or modify the chromosome to make it feasible. solution.

The method for automatically grouping and screening the self-variables to synchronously perform regression model parameter estimation according to the applied gene algorithm described in claim 1 of the patent application scope, wherein: generating a next-generation group in step C) is called an evolution generation; In F), the stopping condition means that the algebra of the evolution reaches the algebra set by the user.

The method for automatically grouping and screening the self-variables to synchronously perform regression model parameter estimation according to the applied gene algorithm described in claim 1 of the patent scope, wherein: in step E), the calculation of the fitness value is based on each The adjusted coefficient of the regression model of the group is calculated, and there are two calculation modes: the overall mode and the weighted average mode.

According to the applied gene algorithm described in claim 1 of the patent application, the method of automatically grouping and screening the self-variables to synchronously perform regression model parameter estimation, wherein: in step E), if any of the grouped regression models, The variable coefficient symbol is different from the pre-entered judgment criterion (there are positive, negative, and uncertain), and the adjustment coefficient of this mode is reset to 0.