TWI721331B

TWI721331B - Classification device and classification method

Info

Publication number: TWI721331B
Application number: TW107139402A
Authority: TW
Inventors: 賴瓊惠; 黃裕峰; 鄭力嘉
Original assignee: 中華電信股份有限公司
Priority date: 2018-11-06
Filing date: 2018-11-06
Publication date: 2021-03-11
Also published as: TW202018527A

Abstract

A classification method adapted to classifying a plurality of data pieces is provided, wherein each of the plurality of data pieces is associated with a first feature set. The classification method includes: Calculating correlations of a plurality of first features in the first feature set, and filtering the first feature set according to the correlations and weights of the plurality of first features so as to generate a second feature set. Grouping the plurality of data pieces according to the second feature set to generate a grouping result. Generating a first decision tree set according to the second feature set and the grouping result. Filtering the second feature set according to error rates of a plurality of first decision tree in the first decision tree set to generate a third feature set. Generating a classification result according to the grouping result and the third feature set.

Description

Classification device and classification method

本發明是有關於一種分類裝置以及分類方法。The invention relates to a classification device and a classification method.

目前，大數據的應用逐漸普及。許多行業開始根據大數據的分析結果來進行決策。一般來說，人們會將大數據作為製作數學模型的輸入參數，而後再根據製作完成的模型來進行分類及評估。然而，大數據龐大、複雜且瑣碎。若僅是單純地輸入巨量的資料來建立模型，往往會造成垃圾進，垃圾出（garbage in, garbage out，GIGO）的現象，從而使得製作出的模型失真。At present, the application of big data is gradually popularized. Many industries have begun to make decisions based on the analysis results of big data. Generally speaking, people will use big data as input parameters for making mathematical models, and then classify and evaluate them based on the completed models. However, big data is huge, complex, and trivial. If only a large amount of data is simply input to build a model, it will often cause garbage in, garbage out (GIGO) phenomenon, which will make the produced model distorted.

因此，如何從巨量的資料中找出特別重要或有意義的特徵，是本領域人員致力達到的目標之一。Therefore, how to find particularly important or meaningful features from a huge amount of data is one of the goals that people in the field strive to achieve.

本發明提供一種分類裝置以及分類方法。The invention provides a classification device and a classification method.

本發明的一種分類裝置，適於分類多筆資料，其中多筆資料中的每一者關聯於第一特徵集合，分類裝置包括儲存媒體以及處理器。儲存媒體儲存多個模組。處理器耦接儲存媒體，且存取並執行多個模組。多個模組包括特徵篩選模組、分群模組、隨機森林選擇模組以及分類模組。特徵篩選模組分別計算第一特徵集合中的多個第一特徵的相關性，並且根據多個第一特徵的相關性和權重對第一特徵集合進行篩選，藉以產生第二特徵集合。分群模組根據第二特徵集合對多筆資料進行分群，藉以產生分群結果。隨機森林選擇模組根據第二特徵集合和分群結果產生第一決策樹集合，並且根據第一決策樹集合中的多個第一決策樹的錯誤率對第二特徵集合進行篩選，藉以產生第三特徵集合。分類模組根據分群結果和第三特徵集合產生分類結果。A sorting device of the present invention is suitable for sorting multiple pieces of data, wherein each of the multiple pieces of data is associated with a first feature set, and the sorting device includes a storage medium and a processor. The storage medium stores multiple modules. The processor is coupled to the storage medium, and accesses and executes a plurality of modules. The multiple modules include a feature screening module, a clustering module, a random forest selection module, and a classification module. The feature screening module respectively calculates the correlations of the multiple first features in the first feature set, and screens the first feature set according to the correlations and weights of the multiple first features, so as to generate the second feature set. The grouping module groups the multiple pieces of data according to the second feature set, so as to generate a grouping result. The random forest selection module generates the first decision tree set according to the second feature set and the clustering result, and screens the second feature set according to the error rate of the multiple first decision trees in the first decision tree set, thereby generating the third feature set. Feature collection. The classification module generates a classification result according to the grouping result and the third feature set.

本發明的一種分類方法，適於分類多筆資料，其中多筆資料中的每一者關聯於第一特徵集合。分類方法包括：分別計算第一特徵集合中的多個第一特徵的相關性，並且根據多個第一特徵的相關性和權重對第一特徵集合進行篩選，藉以產生第二特徵集合。根據第二特徵集合對多筆資料進行分群，藉以產生分群結果。根據第二特徵集合和分群結果產生第一決策樹集合。根據第一決策樹集合中的多個第一決策樹的錯誤率對第二特徵集合進行篩選，藉以產生第三特徵集合。根據分群結果和第三特徵集合產生分類結果。The classification method of the present invention is suitable for classifying a plurality of data items, wherein each of the plurality of data items is related to the first feature set. The classification method includes: respectively calculating the correlations of multiple first features in the first feature set, and screening the first feature sets according to the correlations and weights of the multiple first features, so as to generate the second feature set. According to the second feature set, multiple pieces of data are grouped into groups to generate grouping results. A first set of decision trees is generated according to the second feature set and the clustering result. The second feature set is screened according to the error rate of the multiple first decision trees in the first decision tree set, so as to generate the third feature set. The classification result is generated according to the grouping result and the third feature set.

基於上述，本發明能有效地對大量資料的進行縮減，更能精確地取得重要資料，進而能依據重要資料找出有意義的規則。此外，本發明能在眾多建立規則的演算法中，找出對目前的資料分類有一定的可信度之決策樹，且決策樹不會有太複雜而導致過度學習的情況。Based on the above, the present invention can effectively reduce a large amount of data, can obtain important data more accurately, and can find meaningful rules based on the important data. In addition, the present invention can find a decision tree that has a certain degree of credibility for the current data classification among many algorithms for establishing rules, and the decision tree will not be too complicated to cause over-learning.

為讓本發明的上述特徵和優點能更明顯易懂，下文特舉實施例，並配合所附圖式作詳細說明如下。In order to make the above-mentioned features and advantages of the present invention more comprehensible, the following specific embodiments are described in detail in conjunction with the accompanying drawings.

圖1是根據本發明的實施例繪示分類裝置10的示意圖。分類裝置10可包括處理器100和儲存媒體300。FIG. 1 is a schematic diagram illustrating a classification device 10 according to an embodiment of the present invention. The classification device 10 may include a processor 100 and a storage medium 300.

處理器100耦接儲存媒體300，且存取並執行儲存媒體300儲存的多個模組。處理器100可例如是中央處理單元（central processing unit，CPU），或是其他可程式化之一般用途或特殊用途的微處理器（microprocessor）、數位信號處理器（digital signal processor，DSP）、可程式化控制器、特殊應用積體電路（application specific integrated circuit，ASIC）或其他類似元件或上述元件的組合，本發明不限於此。The processor 100 is coupled to the storage medium 300, and accesses and executes a plurality of modules stored in the storage medium 300. The processor 100 may be, for example, a central processing unit (CPU), or other programmable general-purpose or special-purpose microprocessors, digital signal processors (DSP), and Programmable controller, application specific integrated circuit (application specific integrated circuit, ASIC) or other similar components or a combination of the above components, the present invention is not limited to this.

儲存媒體300可儲存多個模組，該些模組可包括特徵篩選模組310、分群模組330、隨機森林選擇模組350以及分類模組370。關於該些模組的功能，將於後續說明之。儲存媒體300可例如是任何型態的固定式或可移動式的隨機存取記憶體（random access memory，RAM）、唯讀記憶體（read-only memory，ROM）、快閃記憶體（flash memory）、硬碟（hard disk drive，HDD）、固態硬碟（solid state drive，SSD）或類似元件或上述元件的組合，本發明不限於此。The storage medium 300 may store a plurality of modules, and the modules may include a feature screening module 310, a grouping module 330, a random forest selection module 350, and a classification module 370. The functions of these modules will be explained later. The storage medium 300 can be, for example, any type of fixed or removable random access memory (RAM), read-only memory (ROM), flash memory (flash memory). ), hard disk drive (HDD), solid state drive (solid state drive, SSD) or similar components or a combination of the above components, the present invention is not limited thereto.

圖2是根據本發明的實施例繪示分類方法20的流程圖，其中分類方法20可由分類裝置10實施。分類裝置10和分類方法20適於分類多筆資料，其中所述多筆資料中的每一者關聯於由多個特徵組成的第一特徵集合。以表1的資料為例，表1的資料代表總共23,837筆的客群資料，且每筆資料具有227個欄位，其中每個欄位代表一種特徵。換言之，表1的資料共有227個特徵。亦即，表1之資料中的每一者關聯於由227個特徵組成的第一特徵集合（以下將稱第一特徵集合中的特徵為「第一特徵」）。表 1

FIG. 2 is a flowchart illustrating a classification method 20 according to an embodiment of the present invention, wherein the classification method 20 can be implemented by the classification device 10. The classification device 10 and the classification method 20 are adapted to classify multiple pieces of data, wherein each of the multiple pieces of data is associated with a first feature set composed of multiple features. Take the data in Table 1 as an example. The data in Table 1 represents a total of 23,837 customer group data, and each data has 227 fields, and each field represents a feature. In other words, the data in Table 1 has a total of 227 features. That is, each of the data in Table 1 is associated with a first feature set consisting of 227 features (hereinafter, the features in the first feature set will be referred to as "first features"). Table 1

在步驟S210，特徵篩選模組310可分別計算第一特徵集合中的多個第一特徵的相關性，並且根據所述多個第一特徵的所述相關性和權重對所述第一特徵集合進行篩選，藉以產生第二特徵集合，其中相關性可例如是以相關係數的形式表示。In step S210, the feature screening module 310 may respectively calculate the correlations of the multiple first features in the first feature set, and calculate the correlation of the first feature set according to the correlations and weights of the multiple first features. The screening is performed to generate a second feature set, where the correlation can be expressed in the form of correlation coefficients, for example.

圖3是根據本發明的實施例繪示分類方法20的步驟S210的詳細流程圖。具體來說，在步驟S211，特徵篩選模組310可計算每個第一特徵彼此之間的相關係數，並取得每個第一特徵的權重。若第一特徵集合包括m個第一特徵，則特徵篩選模組310可計算m個第一特徵之間的相關係數（例如：第一特徵i與第一特徵j的相關係數為r_i,j ），並扣除掉重複的相關係數以及自相關係數，從而得到由

個相關係數所組成的相關係數矩陣，如表2所示。另一方面，可以根據每一個第一特徵的重要程度來設定該第一特徵的權重。例如，若對表1資料的使用者來說，性別是較重要的特徵，且市話發話比率是較不重要的特徵，則可將性別的權重設定為較高，且將市話發話比率的權重設定為較低，如表3所示。表3揭示了本實施例所使用的針對表1之各個第一特徵的權重表，其中該些權重可由使用者依實際需求調整之，本發明不限於此。表 2

表 3

FIG. 3 is a detailed flowchart illustrating step S210 of the classification method 20 according to an embodiment of the present invention. Specifically, in step S211, the feature screening module 310 may calculate the correlation coefficient between each first feature and obtain the weight of each first feature. If the first feature set includes m first features, the feature screening module 310 can calculate the correlation coefficient between the m first features (for example, the correlation coefficient between the first feature i and the first feature j is r _i,j ), and deduct the repeated correlation coefficients and autocorrelation coefficients, so as to get

The correlation coefficient matrix composed of three correlation coefficients is shown in Table 2. On the other hand, the weight of each first feature can be set according to the importance of the first feature. For example, if gender is a more important feature for users of the data in Table 1, and the rate of local calls is a less important feature, then the weight of gender can be set to be higher, and the rate of local calls The weight is set to be low, as shown in Table 3. Table 3 discloses the weight table for each first feature of Table 1 used in this embodiment, wherein the weights can be adjusted by the user according to actual needs, and the present invention is not limited to this. Table 2

Table 3

在取得第一特徵集合中各個第一特徵對應的相關性和權重後，在步驟S212，特徵篩選模組310可以特徵集合資訊的資料格式儲存各個第一特徵的相關資訊。特徵集合資訊的資料格式如下列公式（1）所示。

…公式（1）其中z表示總共z個第一特徵、y表示第一特徵的索引、x_y 表示第y個第一特徵的特徵集合資訊、r_y _,(y+1) 表示第y個第一特徵和第（y+1）個第一特徵的相關係數（同理，r_y,z 表示第y個第一特徵和第z個第一特徵的相關係數）、w_y 表示第y個第一特徵的權重、d_y 表示第y個第一特徵的是否要刪除的註記。在本實施例中，表1中的部分特徵之特徵集合資訊可整理為如表4所示。表 4

After obtaining the correlation and weight corresponding to each first feature in the first feature set, in step S212, the feature screening module 310 can store the related information of each first feature in the data format of the feature set information. The data format of the feature set information is shown in the following formula (1).

... Equation (1) where z represents the total of first feature z, y denotes the index of the first feature, x _y indicates the y th characteristic feature of the first set of _{_{information, r y, (y + 1}} ) denotes the y-th The correlation coefficient between a feature and the (y+1)th first feature (similarly, r _y,z represents the correlation coefficient between the yth first feature and the zth first feature), w _y represents the yth first feature The weight of a feature, d _y, indicates whether to delete the note of the y-th first feature. In this embodiment, the feature set information of some features in Table 1 can be sorted as shown in Table 4. Table 4

回到圖3，在步驟S213，特徵篩選模組310可根據特徵集合資訊來對各個第一特徵進行篩選，篩選的方式可例如是基於表5的編碼來實施。主要的原理是利用相關係數和權重作為篩選依據。相關係數可看出某一個特徵與其他特徵的相關程度，舉例來說，假設第一特徵包括手機綁約期限以及手機綁約到期月數，兩者具有高度的正相關性，但是因為使用者較重視手機綁約到期月數，故最終僅會留下的特徵為手機綁約到期月數，手機綁約期限的特徵之特徵集合資訊將會被標上刪除的註記。具體來說，特徵篩選模組310會先比對每兩個第一特徵的相關係數，從而由多個第一特徵中挑選出相關性大於閾值的多個特徵，其中所述多個特徵可被視為是高度正相關的一組特徵集合。接著，特徵篩選模組310可從所述多個特徵中挑選出具有最大權重的關鍵特徵，並刪除所述多個特徵中除了所述關鍵特徵以外的其他特徵（例如：在其他特徵的特徵集合資訊標上刪除的註記）。重覆執行上述的步驟直到所有的特徵都被比較過，最後所得到的一或多個關鍵特徵即可組成第二特徵集合。在上述步驟中被標上刪除註記的特徵，可被視作為較不重要的特徵。因此，步驟S210可篩選掉大量不重要之特徵，有效地將龐大的資料縮減為真正重要的資料。舉例來說，在本實施例中，第二特徵集合（即：第一特徵集合經步驟S210篩選後產生的集合）可由表6表示的30個特徵組成。表 5

表 6

Returning to FIG. 3, in step S213, the feature screening module 310 can screen each first feature according to the feature set information, and the screening method can be implemented based on the coding in Table 5, for example. The main principle is to use correlation coefficients and weights as the basis for selection. The correlation coefficient can show the degree of correlation between a certain feature and other features. For example, suppose that the first feature includes the duration of the mobile phone contract and the number of months when the mobile phone contract expires. The two have a high degree of positive correlation, but because the user More attention is paid to the number of months that the mobile phone contract expires, so the only feature left in the end is the number of months that the mobile phone contract expires. The feature collection information of the characteristics of the mobile phone contract period will be marked with a deleted note. Specifically, the feature screening module 310 first compares the correlation coefficients of every two first features, and then selects multiple features with a correlation greater than a threshold from the multiple first features, wherein the multiple features can be selected It is regarded as a set of features that are highly positively correlated. Then, the feature screening module 310 can select the key feature with the largest weight from the plurality of features, and delete the features other than the key feature in the plurality of features (for example, in the feature set of other features). Information marked with deleted notes). Repeat the above steps until all the features have been compared, and the finally obtained one or more key features can form the second feature set. Features marked for deletion in the above steps can be regarded as less important features. Therefore, step S210 can filter out a large number of unimportant features, effectively reducing the huge amount of data to truly important data. For example, in this embodiment, the second feature set (ie, the set generated after the first feature set is screened in step S210) may be composed of 30 features shown in Table 6. Table 5

Table 6

在步驟S220，分群模組330可根據第二特徵集合對多筆資料進行分群，藉以產生分群結果。具體來說，分群模組330可基於（但不限於）K平均演算法（K-means）以根據所述第二特徵集合對所述多筆資料進行分群，藉以產生所述分群結果。舉例來說，分群模組330可基於如表6的第二特徵集合，根據K平均演算法對表1的資料進行分群，分群結果可如表7所示。表 7

In step S220, the grouping module 330 can group a plurality of pieces of data according to the second feature set to generate a grouping result. Specifically, the grouping module 330 may be based on (but not limited to) a K-means algorithm to group the plurality of data according to the second feature set, so as to generate the grouping result. For example, the grouping module 330 may group the data in Table 1 according to the K-means algorithm based on the second feature set as shown in Table 6, and the grouping result may be as shown in Table 7. Table 7

在產生分群結果後，在步驟S230，隨機森林選擇模組350可根據第二特徵集合和分群結果產生第一決策樹集合。具體來說，隨機森林選擇模組350可基於隨機森林演算法（Random Forests）以根據第二特徵集合和分群結果產生具有多個第一決策樹的所述第一決策樹集合。在本實施例中，隨機森林選擇模組350根據第二特徵集合和分群結果產生了30個決策樹模型，且每個決策樹模型的錯誤率可如表8所示。表 8

After the grouping result is generated, in step S230, the random forest selection module 350 may generate a first set of decision trees according to the second feature set and the grouping result. Specifically, the random forest selection module 350 may be based on a random forest algorithm (Random Forests) to generate the first decision tree set having a plurality of first decision trees according to the second feature set and the grouping result. In this embodiment, the random forest selection module 350 generates 30 decision tree models according to the second feature set and the clustering result, and the error rate of each decision tree model can be as shown in Table 8. Table 8

接著，在步驟S240，隨機森林選擇模組350可根據第一決策樹集合中的多個第一決策樹的錯誤率對第二特徵集合進行篩選，藉以產生第三特徵集合。Next, in step S240, the random forest selection module 350 can filter the second feature set according to the error rates of the first decision trees in the first decision tree set, so as to generate the third feature set.

圖4是根據本發明的實施例繪示分類方法20的步驟S240的詳細流程圖。具體來說，在步驟S241，隨機森林選擇模組350可從多個第一決策樹中挑選出具有最低錯誤率的決策樹。在本實施例中，隨機森林選擇模組350可根據表8的資訊挑選出具有最低錯誤率的決策樹「第12個決策樹」。接著，在步驟S242，隨機森林選擇模組350可使用對應於具有所述最低錯誤率的決策樹的特徵組成第三特徵集合。在本實施例中，隨機森林選擇模組350可使用對應於第12個決策樹的特徵組成第三特徵集合。第12個決策樹的特徵（即：第三特徵集合中的特徵），可如表9所示。表 9

FIG. 4 is a detailed flowchart illustrating step S240 of the classification method 20 according to an embodiment of the present invention. Specifically, in step S241, the random forest selection module 350 can select the decision tree with the lowest error rate from the plurality of first decision trees. In this embodiment, the random forest selection module 350 can select the "12th decision tree" with the lowest error rate based on the information in Table 8. Next, in step S242, the random forest selection module 350 may use the features corresponding to the decision tree with the lowest error rate to form a third feature set. In this embodiment, the random forest selection module 350 may use the features corresponding to the twelfth decision tree to form the third feature set. The features of the twelfth decision tree (ie: features in the third feature set) can be shown in Table 9. Table 9

回到圖2，在步驟S250，分類模組370可根據分群結果和第三特徵集合產生分類結果。Returning to FIG. 2, in step S250, the classification module 370 may generate a classification result according to the grouping result and the third feature set.

圖5是根據本發明的實施例繪示分類方法的步驟S250的詳細流程圖。在步驟S251，分類模組370可根據分群結果和第三特徵集合產生第二決策樹集合。在本實施例中，分類模組370可將表7的分群結果以及表9的第三特徵集合分別帶入一或多種決策樹演算法，從而產生包括了一或多個第二決策樹的第二決策樹集合。舉例來說，分類模組370可將表7的分群結果以及表9的第三特徵集合分別帶入（但不限於）CART分類決策樹演算法、C5.0分類決策樹演算法以及CHAID分類決策樹演算法，從而得到一CART分類決策樹、一C5.0分類決策樹以及一CHAID分類決策樹，且該些分類決策樹的正確率如表10所示。表 10

FIG. 5 is a detailed flowchart of step S250 of the classification method according to an embodiment of the present invention. In step S251, the classification module 370 may generate a second set of decision trees according to the grouping result and the third feature set. In this embodiment, the classification module 370 can bring the grouping results of Table 7 and the third feature set of Table 9 into one or more decision tree algorithms, respectively, to generate a first decision tree that includes one or more second decision trees. 2. Set of decision trees. For example, the classification module 370 can bring the grouping results of Table 7 and the third feature set of Table 9 into (but not limited to) the CART classification decision tree algorithm, the C5.0 classification decision tree algorithm, and the CHAID classification decision, respectively. The tree algorithm is used to obtain a CART classification decision tree, a C5.0 classification decision tree, and a CHAID classification decision tree, and the correct rates of the classification decision trees are shown in Table 10. Table 10

在步驟S252，分類模組370可計算第二決策樹集合中的第二決策樹的評分，並且根據評分挑選出最終決策樹。在本實施例中，第二決策樹的評分可由可信度、複雜度以及正確率等三個要素決定。In step S252, the classification module 370 may calculate the score of the second decision tree in the second set of decision trees, and select the final decision tree based on the score. In this embodiment, the score of the second decision tree can be determined by three factors: credibility, complexity, and accuracy.

針對可信度的計算，首先，分類模組370可根據分群結果計算多筆資料在各個分群所對應的第一比率。舉例來說，假設表7的分群結果代表所述多筆資料（即：表1的23,837筆資料）共被分為5群，則分類模組370可計算所述多筆資料分別對應於第1、2、3、4及5群的第一比率，如表11所示。表 11

其中g_j 代表所述多筆資料在第j群所佔的第一比率。舉例來說，根據表11可知，表1的23,837筆資料中，被分為第1群的資料共佔了總筆資料的0.253。For the calculation of credibility, first, the classification module 370 can calculate the first ratio of the multiple data in each group according to the grouping result. For example, assuming that the grouping results in Table 7 represent that the multiple pieces of data (ie: 23,837 pieces of data in Table 1) are divided into 5 groups, the classification module 370 can calculate that the multiple pieces of data respectively correspond to the first group. The first ratios of, 2, 3, 4 and 5 groups are shown in Table 11. Table 11

Where g _j represents the first ratio of the multiple data in the jth group. For example, according to Table 11, among the 23,837 data in Table 1, the data classified into the first group accounted for 0.253 of the total data.

接著，分類模組370可根據分群結果計算多筆資料在第二決策樹的各個節點所對應的第二比率。具體來說，分類模組370可在各個第二決策樹產生後，直接讀取各個節點的資訊，以取得各分群在節點上所佔的第二比率。以CART分類決策樹為例，假設在步驟S251產生的CART分類決策樹共有16個節點。在本實施例中，CART分類決策樹的第1個節點可包括如表12顯示的資訊。表 12

Then, the classification module 370 can calculate the second ratio corresponding to each node of the second decision tree of the multiple pieces of data according to the grouping result. Specifically, the classification module 370 can directly read the information of each node after each second decision tree is generated, so as to obtain the second ratio of each cluster on the node. Taking the CART classification decision tree as an example, it is assumed that the CART classification decision tree generated in step S251 has a total of 16 nodes. In this embodiment, the first node of the CART classification decision tree may include the information shown in Table 12. Table 12

在取得第一比率和第二比率後，分類模組370可基於公式（2）以根據第一比率和第二比率計算第二決策樹的可信度。

…公式（2）其中j為分群結果的索引、m為分群結果的總分群數量、p代表第二決策樹的第p個節點、k代表第二決策樹的節點的總數量、i代表第二決策樹的索引。在本實施例中，分群共有5群，故m = 5。第1個第二決策樹的節點共有16個，故k = 16，其中，第1個第二決策樹（i = 1）為CART分類決策樹、第2個第二決策樹（i = 2）為C5.0分類決策樹並且第3個第二決策樹（i = 3）為CHAID分類決策樹。After obtaining the first ratio and the second ratio, the classification module 370 can calculate the credibility of the second decision tree based on the formula (2).

…Formula (2) where j is the index of the clustering result, m is the total number of clusters of the clustering result, p represents the p-th node of the second decision tree, k represents the total number of nodes of the second decision tree, and i represents the second The index of the decision tree. In this embodiment, there are 5 groups in total, so m=5. There are 16 nodes in the first second decision tree, so k = 16, where the first second decision tree (i = 1) is the CART classification decision tree, and the second second decision tree (i = 2) It is the C5.0 classification decision tree and the third second decision tree (i = 3) is the CHAID classification decision tree.

舉例來說，分類模組370可基於公式（2）、表11以及表12的資訊計算出為CART分類決策樹的節點1對應的可信度為

。以此類推，分類模組370可分別計算出CART分類決策樹、C5.0分類決策樹以及CHAID分類決策樹的節點對應的可信度，分別如表13、14及15所示。表 13 CART 分類決策樹

表 14 C5.0 分類決策樹

表 15 CHAID 分類決策樹

For example, the classification module 370 can calculate the credibility corresponding to node 1 of the CART classification decision tree based on formula (2), Table 11, and Table 12 as

. By analogy, the classification module 370 can respectively calculate the credibility of the nodes of the CART classification decision tree, the C5.0 classification decision tree, and the CHAID classification decision tree, as shown in Tables 13, 14 and 15, respectively. Table 13 CART classification decision tree

Table 14 C5.0 classification decision tree

Table 15 CHAID classification decision tree

再以公式（2）計算完各個第二決策樹的各個節點對應的可信度後，可以公式（3）對第i個第二決策樹的節點之可信度進行正規化，藉以得到對應於第i個第二決策樹的可信度。

…公式（3）其中j為分群結果的索引、m為分群結果的總分群數量、p代表第二決策樹的第p個節點、k代表第二決策樹的節點的總數量、i代表第二決策樹的索引、n為第二決策樹的總數。在本實施例中，共計有3個第二決策樹，故n = 3。以表13、14和15為例，對應表13的CART分類決策樹的可信度計算列表的合計值3.44661719，正規化後為0.263655231。對應表14的C5.0分類決策樹的可信度計算列表的合計值13.07244，正規化後為1。對應表15的CHAID分類決策樹的可信度計算列表的合計值8.1228693，正規化後為0.621373587。After calculating the credibility of each node of each second decision tree with formula (2), the credibility of the node of the i-th second decision tree can be normalized by formula (3) to obtain the corresponding The credibility of the i-th second decision tree.

…Formula (3) where j is the index of the clustering result, m is the total number of clusters of the clustering result, p represents the p-th node of the second decision tree, k represents the total number of nodes of the second decision tree, and i represents the second The index of the decision tree, n is the total number of the second decision tree. In this embodiment, there are a total of 3 second decision trees, so n=3. Taking Tables 13, 14 and 15 as examples, the total value of the credibility calculation list of the CART classification decision tree corresponding to Table 13 is 3.44661719, which is 0.263655231 after normalization. The total value of the credibility calculation list of the C5.0 classification decision tree corresponding to Table 14 is 13.07244, which is 1 after normalization. The total value of the trustworthiness calculation list of the CHAID classification decision tree corresponding to Table 15 is 8.1228693, which is 0.621373587 after normalization.

針對複雜度，分類模組370可在產生第二決策樹的過程中取得第i個第二決策樹的複雜度s_i ，再對此複雜度s_i 以公式（4）進行正規化。

…公式（4）其中i為第二決策樹的索引、n為第二決策樹的總數。在本實施例中，共計有3個第二決策樹，故n = 3。以表13、14和15為例，對應表13的CART分類決策樹的總節點數為16，則正規化後的複雜度為0.219178082。對應表14的C5.0分類決策樹的總節點數為73，則正規化後的複雜度為1。對應表15的CHAID分類決策樹的總節點數為35，則正規化後的複雜度為0.479452055。For complexity, classification module 370 may acquire the i-th second decision tree complexity s _i in generating a second decision tree, then this complexity s _i In equation (4) for normalization.

…Formula (4) where i is the index of the second decision tree and n is the total number of the second decision tree. In this embodiment, there are a total of 3 second decision trees, so n=3. Taking Tables 13, 14 and 15 as examples, the total number of nodes in the CART classification decision tree corresponding to Table 13 is 16, and the normalized complexity is 0.219178082. Corresponding to the C5.0 classification decision tree of Table 14, the total number of nodes is 73, and the normalized complexity is 1. The total number of nodes in the CHAID classification decision tree corresponding to Table 15 is 35, and the normalized complexity is 0.479452055.

針對正確率，分類模組370可在產生第二決策樹的過程中取得第i個第二決策樹的正確率c_i 。以表10為例，對應表10的CART分類決策樹的正確率為0.9871，C5.0分類決策樹的正確率為0.9933，CHAID分類決策樹的正確率為0.9787。 _{For the correctness rate, the classification module 370 can obtain the correctness rate c i of} the i-th second decision tree in the process of generating the second decision tree. Taking Table 10 as an example, the correct rate of the CART classification decision tree corresponding to Table 10 is 0.9871, the correct rate of the C5.0 classification decision tree is 0.9933, and the correct rate of the CHAID classification decision tree is 0.9787.

在取得各個第二決策樹的可信度、複雜度以及正確率後，分類模組370可根據第二決策樹的可信度、複雜度以及正確率計算第二決策樹的評分。分類模組370可根據如公式（5）的評分公式計算出各個第二決策樹的評分。

…公式（5）其中j為分群結果的索引、m為分群結果的總分群數量、p代表第二決策樹的第p個節點、k代表第二決策樹的節點的總數量、i代表第二決策樹的索引。計算好的各個第二決策樹的評分，如表16所示。表 16

After obtaining the credibility, complexity, and accuracy of each second decision tree, the classification module 370 can calculate the score of the second decision tree according to the credibility, complexity, and accuracy of the second decision tree. The classification module 370 can calculate the score of each second decision tree according to the scoring formula such as formula (5).

…Formula (5) where j is the index of the clustering result, m is the total number of clusters of the clustering result, p represents the p-th node of the second decision tree, k represents the total number of nodes of the second decision tree, and i represents the second The index of the decision tree. The calculated scores of each second decision tree are shown in Table 16. Table 16

再計算完各個第二決策樹的評分後，分類模組370可根據評分挑選出最終決策樹。例如，分類模組370可從表16中挑選出評分分數最高的CHAID決策樹作為最終決策樹。After calculating the score of each second decision tree, the classification module 370 can select the final decision tree based on the score. For example, the classification module 370 can select the CHAID decision tree with the highest score from the table 16 as the final decision tree.

在決定了最終決策樹後，在步驟S253，分類模組370可根據最終決策樹對多筆資料進行分類，藉以產生分類結果。After the final decision tree is determined, in step S253, the classification module 370 can classify multiple pieces of data according to the final decision tree to generate classification results.

特點及功效Features and effects

本發明之重要特徵選擇方法係採用權重設定法及相關係數檢定法，可保留事先設定的重要特徵參數，同時也能使用相關係數檢定法計算特徵之間的相關係數值，提供大量特徵參數的有效縮減，以使能在大量資料中依據重要特徵參數找出更有意義的分類規則。The important feature selection method of the present invention adopts the weight setting method and the correlation coefficient verification method, which can retain the important feature parameters set in advance, and can also use the correlation coefficient verification method to calculate the correlation coefficient value between features, providing a large number of feature parameters. Reduce to enable more meaningful classification rules to be found in a large amount of data based on important characteristic parameters.

本發明運用隨機森林處理大量輸入特徵參數及評估特徵重要性的優點，找出建立模型錯誤率最低的方法，來挑出最重要的特徵。The invention uses the advantages of random forest to process a large number of input feature parameters and evaluate the importance of features, find the method with the lowest error rate of establishing the model, and pick out the most important features.

本發明採用可信度、複雜度及正確率評分方法，能找出樹的建立不會太複雜而導致過度學習情況，而有一定的可信度並且正確率在一定水準以上的分類決策樹，得到最好的客群資料分類方法。The present invention adopts credibility, complexity and correct rate scoring methods, and can find out the classification decision tree that has certain credibility and accuracy rate above a certain level, and the establishment of the tree will not be too complicated to cause over-learning. Get the best customer group data classification method.

本發明並不侷限於客群資料，可應用在不同種類的資料，能找出資料中隱藏的規則，進而挖掘出大數據資料中的重要規則。The present invention is not limited to customer group data, and can be applied to different types of data, can find hidden rules in the data, and then dig out important rules in big data data.

綜上所述，本發明可透過各個特徵的相關性以及權重來篩選出較為關鍵的特徵，還可以利用分群演算法以及決策樹演算法來對該些較為關鍵的特徵作進一步地篩選，從而找出對所建立的模型來說最為重要的特徵。本發明還可以針對多種不同的決策樹演算法進行綜合性的評分，並根據評分來挑選出用來對資料進行分類的最終決策樹。基此，本發明能有效地對大量資料的進行縮減，更能精確地取得重要資料，進而能依據重要資料找出有意義的規則。此外，本發明能在眾多建立規則的演算法中，找出對目前的資料分類有一定的可信度之決策樹，且決策樹不會有太複雜而導致過度學習的情況。In summary, the present invention can filter out more critical features through the correlation and weight of each feature, and can also use grouping algorithm and decision tree algorithm to further filter these more critical features, so as to find Identify the most important features for the established model. The present invention can also perform comprehensive scoring for a variety of different decision tree algorithms, and select the final decision tree used to classify the data according to the score. Based on this, the present invention can effectively reduce a large amount of data, can obtain important data more accurately, and can find meaningful rules based on the important data. In addition, the present invention can find a decision tree that has a certain degree of credibility for the current data classification among many algorithms for establishing rules, and the decision tree will not be too complicated to cause over-learning.

雖然本發明已以實施例揭露如上，然其並非用以限定本發明，任何所屬技術領域中具有通常知識者，在不脫離本發明的精神和範圍內，當可作些許的更動與潤飾，故本發明的保護範圍當視後附的申請專利範圍所界定者為準。Although the present invention has been disclosed in the above embodiments, it is not intended to limit the present invention. Anyone with ordinary knowledge in the relevant technical field can make some changes and modifications without departing from the spirit and scope of the present invention. The scope of protection of the present invention shall be determined by the scope of the attached patent application.

10:分類裝置100:處理器20:分類方法300:儲存媒體310:特徵篩選模組330:分群模組350:隨機森林選擇模組370:分類模組S210、S211、S212、S213、S220、S230、S240、S241、S242、S250、S251、S252、S253:步驟10: Classification device 100: Processor 20: Classification method 300: Storage medium 310: Feature screening module 330: Grouping module 350: Random forest selection module 370: Classification module S210, S211, S212, S213, S220, S230 , S240, S241, S242, S250, S251, S252, S253: steps

圖1是根據本發明的實施例繪示分類裝置的示意圖。圖2是根據本發明的實施例繪示分類方法的流程圖。圖3是根據本發明的實施例繪示分類方法的步驟S210的詳細流程圖。圖4是根據本發明的實施例繪示分類方法的步驟S240的詳細流程圖。圖5是根據本發明的實施例繪示分類方法的步驟S250的詳細流程圖。Fig. 1 is a schematic diagram illustrating a classification device according to an embodiment of the present invention. Fig. 2 is a flowchart illustrating a classification method according to an embodiment of the present invention. FIG. 3 is a detailed flowchart of step S210 of the classification method according to an embodiment of the present invention. FIG. 4 is a detailed flowchart of step S240 of the classification method according to an embodiment of the present invention. FIG. 5 is a detailed flowchart of step S250 of the classification method according to an embodiment of the present invention.

20:分類方法 20: Classification method

S210、S220、S230、S240、S250:步驟 S210, S220, S230, S240, S250: steps

Claims

A classification device is suitable for classifying multiple pieces of data, wherein each of the multiple pieces of data is associated with a first feature set. The classification device includes: a storage medium storing a plurality of modules; and a processor, coupled The storage medium accesses and executes the plurality of modules, the plurality of modules include: a feature screening module, which respectively calculates the correlation of the plurality of first features in the first feature set, and The first feature set is screened according to the correlation and weight of the multiple first features, so as to generate a second feature set; the grouping module performs processing on the multiple pieces of data according to the second feature set Grouping to generate a grouping result; the random forest selection module generates a first set of decision trees according to the second feature set and the grouping result, and according to the results of multiple first decision trees in the first set of decision trees The error rate screens the second feature set to generate a third feature set; and a classification module generates a classification result based on the clustering result and the third feature set, including the classification result based on the clustering result and the third feature set. The third feature set generates a second decision tree set and calculates the score of the second decision tree in the second decision tree set, and selects the final decision tree according to the score; wherein the calculation of the second decision tree set The step of selecting the final decision tree according to the score of the second decision tree in, and further includes: Calculate the first ratio corresponding to the multiple pieces of data in each group according to the grouping result; calculate the second ratio corresponding to the multiple pieces of data at each node of the second decision tree according to the grouping result; according to The first ratio and the second ratio are used to calculate the credibility of the second decision tree; and the final decision tree is selected based on the score.

The classification device according to the first item of the scope of patent application, wherein the step of screening the first feature set according to the correlation and weight of the plurality of first features to generate a second feature set includes: Select multiple features with a correlation greater than a threshold from the multiple first features; and select the key feature with the largest weight from the multiple features, and delete the multiple features except for the key feature Other features.

The classification device according to the first item of the scope of patent application, wherein the step of grouping the multiple data according to the second feature set to generate a grouping result includes: based on the K-means algorithm to according to the second feature set. The feature set groups the multiple pieces of data to generate the grouping result.

The classification device according to the first item of the scope of patent application, wherein the step of generating a first decision tree set according to the second feature set and the clustering result includes: a random forest algorithm based on the second feature set and Grouping As a result, the first decision tree set having a plurality of first decision trees is generated.

The classification device according to the first item of the scope of patent application, wherein the second feature set is screened according to the error rate of the multiple first decision trees in the first decision tree set, so as to generate the third feature set The steps include: selecting the decision tree with the lowest error rate from the plurality of first decision trees; and using the features corresponding to the decision tree with the lowest error rate to form the third feature set.

The classification device according to item 1 of the scope of patent application, wherein the step of generating a classification result according to the grouping result and the third feature set further includes: classifying the plurality of data according to the final decision tree, In order to generate the classification result.

A classification method suitable for classifying multiple pieces of data, wherein each of the multiple pieces of data is associated with a first feature set, and the classification method includes: calculating the first feature separately by a feature screening module of a computer device The relevance of the multiple first features in the set, and the first feature set is screened according to the relevance and weight of the multiple first features, so as to generate a second feature set; by the computer device The grouping module of the computer device groups the plurality of data according to the second feature set to generate a grouping result; the random forest selection module of the computer device generates a first grouping result according to the second feature set and the grouping result A set of decision trees; The machine forest selection module screens the second feature set according to the error rate of the first decision trees in the first decision tree set to generate a third feature set; and the computer device The classification module generates a classification result according to the grouping result and the third feature set, including generating a second decision tree set according to the grouping result and the third feature set and calculating the second decision tree set The score of the second decision tree of the second decision tree, and the final decision tree is selected according to the score; wherein the calculation of the score of the second decision tree in the second decision tree set, and the selection of the final decision tree according to the score The step further includes: calculating the first ratio of the plurality of data corresponding to each grouping according to the grouping result; calculating the first ratio of the plurality of data corresponding to each node of the second decision tree according to the grouping result Two ratios; calculating the score of the second decision tree according to the credibility, complexity, and correctness of the second decision tree; and selecting the final decision tree according to the score.

The classification method according to item 7 of the scope of patent application, wherein the step of screening the first feature set according to the correlation and weight of the plurality of first features to generate a second feature set includes: Select multiple features with a correlation greater than a threshold value from the multiple first features; and select the key feature with the largest weight from the multiple features, and delete Features other than the key feature among the plurality of features.

The classification method according to item 7 of the scope of patent application, wherein the step of grouping the plurality of data according to the second feature set to generate a grouping result includes: a K-means algorithm based on the second feature set to generate a grouping result. The feature set groups the multiple pieces of data to generate the grouping result.

The classification method according to item 7 of the scope of patent application, wherein the step of generating a first decision tree set according to the second feature set and the clustering result includes: based on a random forest algorithm to according to the second feature set and The grouping result generates the first decision tree set having a plurality of first decision trees.

The classification method according to item 7 of the scope of patent application, wherein the second feature set is screened according to the error rate of the multiple first decision trees in the first decision tree set, so as to generate the third feature set The steps include: selecting the decision tree with the lowest error rate from the plurality of first decision trees; and using the features corresponding to the decision tree with the lowest error rate to form the third feature set.

The classification method according to item 7 of the scope of patent application, wherein the step of generating a classification result according to the grouping result and the third feature set further includes: classifying the plurality of data according to the final decision tree, In order to generate the classification result.