TW202018527A

TW202018527A - Classification device and classification method

Info

Publication number: TW202018527A
Application number: TW107139402A
Authority: TW
Inventors: 賴瓊惠; 黃裕峰; 鄭力嘉
Original assignee: 中華電信股份有限公司
Priority date: 2018-11-06
Filing date: 2018-11-06
Publication date: 2020-05-16
Also published as: TWI721331B

Abstract

A classification method adapted to classifying a plurality of data pieces is provided, wherein each of the plurality of data pieces is associated with a first feature set. The classification method includes: Calculating correlations of a plurality of first features in the first feature set, and filtering the first feature set according to the correlations and weights of the plurality of first features so as to generate a second feature set. Grouping the plurality of data pieces according to the second feature set to generate a grouping result. Generating a first decision tree set according to the second feature set and the grouping result. Filtering the second feature set according to error rates of a plurality of first decision tree in the first decision tree set to generate a third feature set. Generating a classification result according to the grouping result and the third feature set.

Description

Classification device and classification method

本發明是有關於一種分類裝置以及分類方法。The invention relates to a classification device and a classification method.

目前，大數據的應用逐漸普及。許多行業開始根據大數據的分析結果來進行決策。一般來說，人們會將大數據作為製作數學模型的輸入參數，而後再根據製作完成的模型來進行分類及評估。然而，大數據龐大、複雜且瑣碎。若僅是單純地輸入巨量的資料來建立模型，往往會造成垃圾進，垃圾出（garbage in, garbage out，GIGO）的現象，從而使得製作出的模型失真。At present, the application of big data is gradually popularized. Many industries began to make decisions based on the analysis results of big data. Generally speaking, people will use big data as input parameters for making mathematical models, and then classify and evaluate them based on the completed models. However, big data is huge, complex and trivial. If you simply input a huge amount of data to build a model, it will often cause garbage in, garbage out (GIGO), which will distort the model you make.

因此，如何從巨量的資料中找出特別重要或有意義的特徵，是本領域人員致力達到的目標之一。Therefore, how to find particularly important or meaningful features from a huge amount of data is one of the goals that those skilled in the art strive to achieve.

本發明提供一種分類裝置以及分類方法。The invention provides a classification device and a classification method.

本發明的一種分類裝置，適於分類多筆資料，其中多筆資料中的每一者關聯於第一特徵集合，分類裝置包括儲存媒體以及處理器。儲存媒體儲存多個模組。處理器耦接儲存媒體，且存取並執行多個模組。多個模組包括特徵篩選模組、分群模組、隨機森林選擇模組以及分類模組。特徵篩選模組分別計算第一特徵集合中的多個第一特徵的相關性，並且根據多個第一特徵的相關性和權重對第一特徵集合進行篩選，藉以產生第二特徵集合。分群模組根據第二特徵集合對多筆資料進行分群，藉以產生分群結果。隨機森林選擇模組根據第二特徵集合和分群結果產生第一決策樹集合，並且根據第一決策樹集合中的多個第一決策樹的錯誤率對第二特徵集合進行篩選，藉以產生第三特徵集合。分類模組根據分群結果和第三特徵集合產生分類結果。A classification device of the present invention is suitable for classifying multiple pieces of data, wherein each of the multiple pieces of data is associated with a first feature set. The classification device includes a storage medium and a processor. The storage medium stores multiple modules. The processor is coupled to the storage medium, and accesses and executes multiple modules. Multiple modules include feature selection module, grouping module, random forest selection module and classification module. The feature selection module calculates the correlations of multiple first features in the first feature set, and filters the first feature set according to the correlations and weights of the multiple first features to generate a second feature set. The grouping module groups multiple pieces of data according to the second feature set to generate a grouping result. The random forest selection module generates a first decision tree set according to the second feature set and the grouping result, and screens the second feature set according to the error rates of the plurality of first decision trees in the first decision tree set, thereby generating the third Feature collection. The classification module generates a classification result according to the grouping result and the third feature set.

本發明的一種分類方法，適於分類多筆資料，其中多筆資料中的每一者關聯於第一特徵集合。分類方法包括：分別計算第一特徵集合中的多個第一特徵的相關性，並且根據多個第一特徵的相關性和權重對第一特徵集合進行篩選，藉以產生第二特徵集合。根據第二特徵集合對多筆資料進行分群，藉以產生分群結果。根據第二特徵集合和分群結果產生第一決策樹集合。根據第一決策樹集合中的多個第一決策樹的錯誤率對第二特徵集合進行篩選，藉以產生第三特徵集合。根據分群結果和第三特徵集合產生分類結果。A classification method of the present invention is suitable for classifying multiple pieces of data, where each of the multiple pieces of data is associated with a first feature set. The classification method includes: calculating the correlations of a plurality of first features in the first feature set, and filtering the first feature set according to the correlations and weights of the plurality of first features to generate a second feature set. Group multiple pieces of data according to the second feature set to generate a grouping result. The first decision tree set is generated according to the second feature set and the grouping result. The second feature set is filtered according to the error rates of the multiple first decision trees in the first decision tree set, thereby generating a third feature set. The classification result is generated according to the grouping result and the third feature set.

基於上述，本發明能有效地對大量資料的進行縮減，更能精確地取得重要資料，進而能依據重要資料找出有意義的規則。此外，本發明能在眾多建立規則的演算法中，找出對目前的資料分類有一定的可信度之決策樹，且決策樹不會有太複雜而導致過度學習的情況。Based on the above, the present invention can effectively reduce a large amount of data, more accurately obtain important data, and then find meaningful rules based on the important data. In addition, the present invention can find a decision tree with a certain degree of credibility to the current data classification among many algorithm for establishing rules, and the decision tree will not be too complicated and cause excessive learning.

為讓本發明的上述特徵和優點能更明顯易懂，下文特舉實施例，並配合所附圖式作詳細說明如下。In order to make the above-mentioned features and advantages of the present invention more obvious and understandable, the embodiments are specifically described below in conjunction with the accompanying drawings for detailed description as follows.

圖1是根據本發明的實施例繪示分類裝置10的示意圖。分類裝置10可包括處理器100和儲存媒體300。FIG. 1 is a schematic diagram illustrating a classification device 10 according to an embodiment of the present invention. The classification device 10 may include a processor 100 and a storage medium 300.

處理器100耦接儲存媒體300，且存取並執行儲存媒體300儲存的多個模組。處理器100可例如是中央處理單元（central processing unit，CPU），或是其他可程式化之一般用途或特殊用途的微處理器（microprocessor）、數位信號處理器（digital signal processor，DSP）、可程式化控制器、特殊應用積體電路（application specific integrated circuit，ASIC）或其他類似元件或上述元件的組合，本發明不限於此。The processor 100 is coupled to the storage medium 300, and accesses and executes multiple modules stored in the storage medium 300. The processor 100 may be, for example, a central processing unit (CPU), or other programmable general-purpose or special-purpose microprocessor (microprocessor), digital signal processor (DSP), or A programmable controller, an application specific integrated circuit (ASIC) or other similar components or a combination of the above components, the invention is not limited thereto.

儲存媒體300可儲存多個模組，該些模組可包括特徵篩選模組310、分群模組330、隨機森林選擇模組350以及分類模組370。關於該些模組的功能，將於後續說明之。儲存媒體300可例如是任何型態的固定式或可移動式的隨機存取記憶體（random access memory，RAM）、唯讀記憶體（read-only memory，ROM）、快閃記憶體（flash memory）、硬碟（hard disk drive，HDD）、固態硬碟（solid state drive，SSD）或類似元件或上述元件的組合，本發明不限於此。The storage medium 300 may store multiple modules, and these modules may include a feature selection module 310, a grouping module 330, a random forest selection module 350, and a classification module 370. The functions of these modules will be explained later. The storage medium 300 may be, for example, any type of fixed or removable random access memory (RAM), read-only memory (ROM), flash memory (flash memory) ), hard disk drive (HDD), solid state drive (SSD) or similar components or a combination of the above components, the invention is not limited thereto.

圖2是根據本發明的實施例繪示分類方法20的流程圖，其中分類方法20可由分類裝置10實施。分類裝置10和分類方法20適於分類多筆資料，其中所述多筆資料中的每一者關聯於由多個特徵組成的第一特徵集合。以表1的資料為例，表1的資料代表總共23,837筆的客群資料，且每筆資料具有227個欄位，其中每個欄位代表一種特徵。換言之，表1的資料共有227個特徵。亦即，表1之資料中的每一者關聯於由227個特徵組成的第一特徵集合（以下將稱第一特徵集合中的特徵為「第一特徵」）。表 1

2 is a flowchart illustrating a classification method 20 according to an embodiment of the present invention, where the classification method 20 may be implemented by the classification device 10. The classification device 10 and the classification method 20 are adapted to classify multiple pieces of data, wherein each of the multiple pieces of data is associated with a first feature set consisting of multiple features. Taking the data in Table 1 as an example, the data in Table 1 represents a total of 23,837 customer groups, and each data has 227 fields, where each field represents a feature. In other words, the data in Table 1 has a total of 227 features. That is, each of the data in Table 1 is associated with a first feature set composed of 227 features (hereinafter, the features in the first feature set will be referred to as "first features"). Table 1

在步驟S210，特徵篩選模組310可分別計算第一特徵集合中的多個第一特徵的相關性，並且根據所述多個第一特徵的所述相關性和權重對所述第一特徵集合進行篩選，藉以產生第二特徵集合，其中相關性可例如是以相關係數的形式表示。In step S210, the feature selection module 310 may separately calculate the correlation of a plurality of first features in the first feature set, and compare the first feature set according to the correlation and weight of the plurality of first features Screening is performed to generate a second feature set, wherein the correlation may be expressed in the form of a correlation coefficient, for example.

圖3是根據本發明的實施例繪示分類方法20的步驟S210的詳細流程圖。具體來說，在步驟S211，特徵篩選模組310可計算每個第一特徵彼此之間的相關係數，並取得每個第一特徵的權重。若第一特徵集合包括m個第一特徵，則特徵篩選模組310可計算m個第一特徵之間的相關係數（例如：第一特徵i與第一特徵j的相關係數為r_i,j ），並扣除掉重複的相關係數以及自相關係數，從而得到由

個相關係數所組成的相關係數矩陣，如表2所示。另一方面，可以根據每一個第一特徵的重要程度來設定該第一特徵的權重。例如，若對表1資料的使用者來說，性別是較重要的特徵，且市話發話比率是較不重要的特徵，則可將性別的權重設定為較高，且將市話發話比率的權重設定為較低，如表3所示。表3揭示了本實施例所使用的針對表1之各個第一特徵的權重表，其中該些權重可由使用者依實際需求調整之，本發明不限於此。表 2

表 3

FIG. 3 is a detailed flowchart illustrating step S210 of the classification method 20 according to an embodiment of the present invention. Specifically, in step S211, the feature selection module 310 may calculate the correlation coefficient between each first feature and obtain the weight of each first feature. If the first feature set includes m first features, the feature screening module 310 may calculate the correlation coefficient between the m first features (for example, the correlation coefficient between the first feature i and the first feature j is r _i,j ), and deduct the repeated correlation coefficient and autocorrelation coefficient, so as to obtain

The correlation coefficient matrix composed of four correlation coefficients is shown in Table 2. On the other hand, the weight of each first feature can be set according to the importance of each first feature. For example, if the user of the data in Table 1 has sex as a more important feature and the local call rate is a less important feature, the weight of gender can be set to a higher value and the local call rate can be set The weight is set lower, as shown in Table 3. Table 3 reveals the weight table for each first feature of Table 1 used in this embodiment, wherein the weights can be adjusted by the user according to actual needs, and the invention is not limited thereto. Table 2

Table 3

在取得第一特徵集合中各個第一特徵對應的相關性和權重後，在步驟S212，特徵篩選模組310可以特徵集合資訊的資料格式儲存各個第一特徵的相關資訊。特徵集合資訊的資料格式如下列公式（1）所示。

…公式（1）其中z表示總共z個第一特徵、y表示第一特徵的索引、x_y 表示第y個第一特徵的特徵集合資訊、r_y _,(y+1) 表示第y個第一特徵和第（y+1）個第一特徵的相關係數（同理，r_y,z 表示第y個第一特徵和第z個第一特徵的相關係數）、w_y 表示第y個第一特徵的權重、d_y 表示第y個第一特徵的是否要刪除的註記。在本實施例中，表1中的部分特徵之特徵集合資訊可整理為如表4所示。表 4

After obtaining the correlation and weight corresponding to each first feature in the first feature set, in step S212, the feature filtering module 310 can store the relevant information of each first feature in the data format of the feature set information. The data format of feature set information is shown in the following formula (1).

…Formula (1) where z represents a total of z first features, y represents the index of the first feature, x _y represents the feature set information of the y-th first feature, r _y _{, (y+1)} represents the y-th Correlation coefficients of a feature and the (y+1)th first feature (in the same way, r _y,z represents the correlation coefficients of the yth first feature and the zth first feature), w _y represents the yth The weight of a feature, d _y indicates whether the y-th first feature is to be deleted. In this embodiment, the feature set information of some features in Table 1 can be organized as shown in Table 4. Table 4

回到圖3，在步驟S213，特徵篩選模組310可根據特徵集合資訊來對各個第一特徵進行篩選，篩選的方式可例如是基於表5的編碼來實施。主要的原理是利用相關係數和權重作為篩選依據。相關係數可看出某一個特徵與其他特徵的相關程度，舉例來說，假設第一特徵包括手機綁約期限以及手機綁約到期月數，兩者具有高度的正相關性，但是因為使用者較重視手機綁約到期月數，故最終僅會留下的特徵為手機綁約到期月數，手機綁約期限的特徵之特徵集合資訊將會被標上刪除的註記。具體來說，特徵篩選模組310會先比對每兩個第一特徵的相關係數，從而由多個第一特徵中挑選出相關性大於閾值的多個特徵，其中所述多個特徵可被視為是高度正相關的一組特徵集合。接著，特徵篩選模組310可從所述多個特徵中挑選出具有最大權重的關鍵特徵，並刪除所述多個特徵中除了所述關鍵特徵以外的其他特徵（例如：在其他特徵的特徵集合資訊標上刪除的註記）。重覆執行上述的步驟直到所有的特徵都被比較過，最後所得到的一或多個關鍵特徵即可組成第二特徵集合。在上述步驟中被標上刪除註記的特徵，可被視作為較不重要的特徵。因此，步驟S210可篩選掉大量不重要之特徵，有效地將龐大的資料縮減為真正重要的資料。舉例來說，在本實施例中，第二特徵集合（即：第一特徵集合經步驟S210篩選後產生的集合）可由表6表示的30個特徵組成。表 5

表 6

Returning to FIG. 3, in step S213, the feature filtering module 310 may filter each first feature according to the feature set information, and the filtering method may be implemented based on the encoding of Table 5, for example. The main principle is to use correlation coefficients and weights as screening basis. The correlation coefficient can see the degree of correlation between a certain feature and other features. For example, suppose the first feature includes the contract period of the mobile phone and the number of months that the contract of the phone contract expires. The two have a high positive correlation, but because the user Pay more attention to the number of months of the contract expiration of the mobile phone, so the only feature that will be left in the end is the number of months of expiration of the contract of the mobile phone. Specifically, the feature screening module 310 first compares the correlation coefficients of every two first features, so as to select a plurality of features with a correlation greater than a threshold from the plurality of first features, wherein the plurality of features can be It is regarded as a set of features that are highly positively correlated. Next, the feature filtering module 310 may select the key feature with the largest weight from the multiple features, and delete the other features except the key feature among the multiple features (for example: feature sets in other features Information marked with deleted notes). Repeat the above steps until all the features are compared, and the one or more key features finally obtained can form a second feature set. Features marked with deletion in the above steps can be regarded as less important features. Therefore, in step S210, a large number of unimportant features can be screened out, effectively reducing huge data to truly important data. For example, in this embodiment, the second feature set (that is, the set generated after the first feature set is filtered by step S210) may be composed of 30 features represented in Table 6. Table 5

Table 6

在步驟S220，分群模組330可根據第二特徵集合對多筆資料進行分群，藉以產生分群結果。具體來說，分群模組330可基於（但不限於）K平均演算法（K-means）以根據所述第二特徵集合對所述多筆資料進行分群，藉以產生所述分群結果。舉例來說，分群模組330可基於如表6的第二特徵集合，根據K平均演算法對表1的資料進行分群，分群結果可如表7所示。表 7

In step S220, the grouping module 330 may group multiple pieces of data according to the second feature set to generate a grouping result. Specifically, the grouping module 330 may be based on (but not limited to) a K-means algorithm to group the plurality of data according to the second feature set, thereby generating the grouping result. For example, the grouping module 330 may group the data in Table 1 based on the second feature set as shown in Table 6 according to the K-average algorithm, and the grouping result may be as shown in Table 7. Table 7

在產生分群結果後，在步驟S230，隨機森林選擇模組350可根據第二特徵集合和分群結果產生第一決策樹集合。具體來說，隨機森林選擇模組350可基於隨機森林演算法（Random Forests）以根據第二特徵集合和分群結果產生具有多個第一決策樹的所述第一決策樹集合。在本實施例中，隨機森林選擇模組350根據第二特徵集合和分群結果產生了30個決策樹模型，且每個決策樹模型的錯誤率可如表8所示。表 8

After generating the grouping result, in step S230, the random forest selection module 350 may generate a first decision tree set according to the second feature set and the grouping result. Specifically, the random forest selection module 350 may be based on a random forest algorithm (Random Forests) to generate the first decision tree set having multiple first decision trees according to the second feature set and the grouping result. In this embodiment, the random forest selection module 350 generates 30 decision tree models according to the second feature set and the grouping result, and the error rate of each decision tree model may be as shown in Table 8. Table 8

接著，在步驟S240，隨機森林選擇模組350可根據第一決策樹集合中的多個第一決策樹的錯誤率對第二特徵集合進行篩選，藉以產生第三特徵集合。Next, in step S240, the random forest selection module 350 may filter the second feature set according to the error rates of the plurality of first decision trees in the first decision tree set, thereby generating a third feature set.

圖4是根據本發明的實施例繪示分類方法20的步驟S240的詳細流程圖。具體來說，在步驟S241，隨機森林選擇模組350可從多個第一決策樹中挑選出具有最低錯誤率的決策樹。在本實施例中，隨機森林選擇模組350可根據表8的資訊挑選出具有最低錯誤率的決策樹「第12個決策樹」。接著，在步驟S242，隨機森林選擇模組350可使用對應於具有所述最低錯誤率的決策樹的特徵組成第三特徵集合。在本實施例中，隨機森林選擇模組350可使用對應於第12個決策樹的特徵組成第三特徵集合。第12個決策樹的特徵（即：第三特徵集合中的特徵），可如表9所示。表 9

4 is a detailed flowchart illustrating step S240 of the classification method 20 according to an embodiment of the present invention. Specifically, in step S241, the random forest selection module 350 may select the decision tree with the lowest error rate from the plurality of first decision trees. In this embodiment, the random forest selection module 350 can select the decision tree "the 12th decision tree" with the lowest error rate according to the information in Table 8. Next, in step S242, the random forest selection module 350 may use the features corresponding to the decision tree with the lowest error rate to compose the third feature set. In this embodiment, the random forest selection module 350 may use the features corresponding to the 12th decision tree to form a third feature set. The characteristics of the 12th decision tree (that is, the characteristics in the third feature set) can be shown in Table 9. Table 9

回到圖2，在步驟S250，分類模組370可根據分群結果和第三特徵集合產生分類結果。Returning to FIG. 2, in step S250, the classification module 370 may generate a classification result according to the grouping result and the third feature set.

圖5是根據本發明的實施例繪示分類方法的步驟S250的詳細流程圖。在步驟S251，分類模組370可根據分群結果和第三特徵集合產生第二決策樹集合。在本實施例中，分類模組370可將表7的分群結果以及表9的第三特徵集合分別帶入一或多種決策樹演算法，從而產生包括了一或多個第二決策樹的第二決策樹集合。舉例來說，分類模組370可將表7的分群結果以及表9的第三特徵集合分別帶入（但不限於）CART分類決策樹演算法、C5.0分類決策樹演算法以及CHAID分類決策樹演算法，從而得到一CART分類決策樹、一C5.0分類決策樹以及一CHAID分類決策樹，且該些分類決策樹的正確率如表10所示。表 10

FIG. 5 is a detailed flowchart illustrating step S250 of the classification method according to an embodiment of the present invention. In step S251, the classification module 370 may generate a second decision tree set according to the grouping result and the third feature set. In this embodiment, the classification module 370 can bring the grouping results of Table 7 and the third feature set of Table 9 into one or more decision tree algorithms, respectively, thereby generating a first including one or more second decision trees. Two sets of decision trees. For example, the classification module 370 can bring the grouping results of Table 7 and the third feature set of Table 9 into (but not limited to) the CART classification decision tree algorithm, C5.0 classification decision tree algorithm, and CHAID classification decision respectively A tree algorithm to obtain a CART classification decision tree, a C5.0 classification decision tree, and a CHAID classification decision tree, and the correct rates of these classification decision trees are shown in Table 10. Table 10

在步驟S252，分類模組370可計算第二決策樹集合中的第二決策樹的評分，並且根據評分挑選出最終決策樹。在本實施例中，第二決策樹的評分可由可信度、複雜度以及正確率等三個要素決定。In step S252, the classification module 370 may calculate the score of the second decision tree in the second set of decision trees, and select the final decision tree according to the score. In this embodiment, the score of the second decision tree can be determined by three factors such as credibility, complexity, and accuracy.

針對可信度的計算，首先，分類模組370可根據分群結果計算多筆資料在各個分群所對應的第一比率。舉例來說，假設表7的分群結果代表所述多筆資料（即：表1的23,837筆資料）共被分為5群，則分類模組370可計算所述多筆資料分別對應於第1、2、3、4及5群的第一比率，如表11所示。表 11

其中g_j 代表所述多筆資料在第j群所佔的第一比率。舉例來說，根據表11可知，表1的23,837筆資料中，被分為第1群的資料共佔了總筆資料的0.253。For the calculation of credibility, first, the classification module 370 can calculate the first ratio corresponding to each piece of data in each sub-group according to the sub-group result. For example, assuming that the grouping result in Table 7 represents that the multiple pieces of data (that is, 23,837 pieces of data in Table 1) are divided into five groups, the classification module 370 may calculate that the multiple pieces of data correspond to the first The first ratios of, 2, 3, 4 and 5 groups are shown in Table 11. Table 11

Where g _j represents the first ratio of the multiple data in the j group. For example, according to Table 11, among the 23,837 records in Table 1, the data classified into the first group accounted for 0.253 of the total records.

接著，分類模組370可根據分群結果計算多筆資料在第二決策樹的各個節點所對應的第二比率。具體來說，分類模組370可在各個第二決策樹產生後，直接讀取各個節點的資訊，以取得各分群在節點上所佔的第二比率。以CART分類決策樹為例，假設在步驟S251產生的CART分類決策樹共有16個節點。在本實施例中，CART分類決策樹的第1個節點可包括如表12顯示的資訊。表 12

Then, the classification module 370 can calculate the second ratio corresponding to each piece of data in each node of the second decision tree according to the grouping result. Specifically, after each second decision tree is generated, the classification module 370 can directly read the information of each node to obtain the second ratio of each sub-group on the node. Taking the CART classification decision tree as an example, assume that the CART classification decision tree generated in step S251 has a total of 16 nodes. In this embodiment, the first node of the CART classification decision tree may include the information shown in Table 12. Table 12

在取得第一比率和第二比率後，分類模組370可基於公式（2）以根據第一比率和第二比率計算第二決策樹的可信度。

…公式（2）其中j為分群結果的索引、m為分群結果的總分群數量、p代表第二決策樹的第p個節點、k代表第二決策樹的節點的總數量、i代表第二決策樹的索引。在本實施例中，分群共有5群，故m = 5。第1個第二決策樹的節點共有16個，故k = 16，其中，第1個第二決策樹（i = 1）為CART分類決策樹、第2個第二決策樹（i = 2）為C5.0分類決策樹並且第3個第二決策樹（i = 3）為CHAID分類決策樹。After obtaining the first ratio and the second ratio, the classification module 370 may calculate the credibility of the second decision tree based on the formula (2) based on the first ratio and the second ratio.

…Formula (2) where j is the index of the clustering result, m is the total number of clustering results, p is the p-th node of the second decision tree, k is the total number of nodes in the second decision tree, and i is the second The index of the decision tree. In this embodiment, there are 5 groups in total, so m=5. There are 16 nodes in the first second decision tree, so k = 16, where the first second decision tree (i = 1) is the CART classification decision tree and the second second decision tree (i = 2) C5.0 classification decision tree and the third second decision tree (i = 3) is CHAID classification decision tree.

舉例來說，分類模組370可基於公式（2）、表11以及表12的資訊計算出為CART分類決策樹的節點1對應的可信度為

。以此類推，分類模組370可分別計算出CART分類決策樹、C5.0分類決策樹以及CHAID分類決策樹的節點對應的可信度，分別如表13、14及15所示。表 13 CART 分類決策樹

表 14 C5.0 分類決策樹

表 15 CHAID 分類決策樹

For example, the classification module 370 can calculate the credibility corresponding to node 1 of the CART classification decision tree based on the information in formula (2), Table 11 and Table 12

. By analogy, the classification module 370 can calculate the credibility corresponding to the nodes of the CART classification decision tree, C5.0 classification decision tree, and CHAID classification decision tree, as shown in Tables 13, 14, and 15, respectively. Table 13 CART classification decision tree

Table 14 C5.0 classification decision tree

Table 15 CHAID classification decision tree

再以公式（2）計算完各個第二決策樹的各個節點對應的可信度後，可以公式（3）對第i個第二決策樹的節點之可信度進行正規化，藉以得到對應於第i個第二決策樹的可信度。

…公式（3）其中j為分群結果的索引、m為分群結果的總分群數量、p代表第二決策樹的第p個節點、k代表第二決策樹的節點的總數量、i代表第二決策樹的索引、n為第二決策樹的總數。在本實施例中，共計有3個第二決策樹，故n = 3。以表13、14和15為例，對應表13的CART分類決策樹的可信度計算列表的合計值3.44661719，正規化後為0.263655231。對應表14的C5.0分類決策樹的可信度計算列表的合計值13.07244，正規化後為1。對應表15的CHAID分類決策樹的可信度計算列表的合計值8.1228693，正規化後為0.621373587。After calculating the credibility corresponding to each node of each second decision tree by formula (2), the credibility of the node of the i-th second decision tree can be normalized by formula (3) to obtain the corresponding The credibility of the i-th second decision tree.

…Formula (3) where j is the index of the clustering result, m is the total number of clustering results, p is the p-th node of the second decision tree, k is the total number of nodes in the second decision tree, and i is the second The index of the decision tree, n is the total number of the second decision tree. In this embodiment, there are a total of three second decision trees, so n = 3. Taking Tables 13, 14, and 15 as examples, the total value of the credibility calculation list of the CART classification decision tree corresponding to Table 13 is 3.34661719, and after normalization is 0.263655231. The total value of the credibility calculation list of the C5.0 classification decision tree corresponding to Table 14 is 13.07244, which is 1 after normalization. The total value of the credibility calculation list of the CHAID classification decision tree corresponding to Table 15 is 8.1228693, which is 0.621373587 after normalization.

針對複雜度，分類模組370可在產生第二決策樹的過程中取得第i個第二決策樹的複雜度s_i ，再對此複雜度s_i 以公式（4）進行正規化。

…公式（4）其中i為第二決策樹的索引、n為第二決策樹的總數。在本實施例中，共計有3個第二決策樹，故n = 3。以表13、14和15為例，對應表13的CART分類決策樹的總節點數為16，則正規化後的複雜度為0.219178082。對應表14的C5.0分類決策樹的總節點數為73，則正規化後的複雜度為1。對應表15的CHAID分類決策樹的總節點數為35，則正規化後的複雜度為0.479452055。For complexity, classification module 370 may acquire the i-th second decision tree complexity s _i in generating a second decision tree, then this complexity s _i In equation (4) for normalization.

…Formula (4) where i is the index of the second decision tree and n is the total number of the second decision tree. In this embodiment, there are a total of three second decision trees, so n = 3. Taking Tables 13, 14, and 15 as examples, the total number of nodes in the CART classification decision tree corresponding to Table 13 is 16, and the complexity after normalization is 0.219178082. The total number of nodes in the C5.0 classification decision tree corresponding to Table 14 is 73, and the complexity after normalization is 1. Corresponding to the total number of nodes in the CHAID classification decision tree of Table 15 is 35, the complexity after normalization is 0.479452055.

針對正確率，分類模組370可在產生第二決策樹的過程中取得第i個第二決策樹的正確率c_i 。以表10為例，對應表10的CART分類決策樹的正確率為0.9871，C5.0分類決策樹的正確率為0.9933，CHAID分類決策樹的正確率為0.9787。For the correct rate, the classification module 370 can obtain the correct rate c _{i of the ith} second decision tree in the process of generating the second decision tree. Taking Table 10 as an example, the correct rate of the CART classification decision tree corresponding to Table 10 is 0.9871, the correct rate of the C5.0 classification decision tree is 0.9933, and the correct rate of the CHAID classification decision tree is 0.9787.

在取得各個第二決策樹的可信度、複雜度以及正確率後，分類模組370可根據第二決策樹的可信度、複雜度以及正確率計算第二決策樹的評分。分類模組370可根據如公式（5）的評分公式計算出各個第二決策樹的評分。

…公式（5）其中j為分群結果的索引、m為分群結果的總分群數量、p代表第二決策樹的第p個節點、k代表第二決策樹的節點的總數量、i代表第二決策樹的索引。計算好的各個第二決策樹的評分，如表16所示。表 16

After obtaining the credibility, complexity, and accuracy of each second decision tree, the classification module 370 can calculate the score of the second decision tree according to the credibility, complexity, and accuracy of the second decision tree. The classification module 370 can calculate the score of each second decision tree according to the scoring formula of formula (5).

…Formula (5) where j is the index of the clustering result, m is the total number of clustering results, p is the p-th node of the second decision tree, k is the total number of nodes in the second decision tree, and i is the second The index of the decision tree. The calculated scores of each second decision tree are shown in Table 16. Table 16

再計算完各個第二決策樹的評分後，分類模組370可根據評分挑選出最終決策樹。例如，分類模組370可從表16中挑選出評分分數最高的CHAID決策樹作為最終決策樹。After calculating the score of each second decision tree, the classification module 370 can select the final decision tree according to the score. For example, the classification module 370 may select the CHAID decision tree with the highest score from Table 16 as the final decision tree.

在決定了最終決策樹後，在步驟S253，分類模組370可根據最終決策樹對多筆資料進行分類，藉以產生分類結果。After the final decision tree is determined, in step S253, the classification module 370 may classify multiple pieces of data according to the final decision tree to generate a classification result.

特點及功效Features and effects

本發明之重要特徵選擇方法係採用權重設定法及相關係數檢定法，可保留事先設定的重要特徵參數，同時也能使用相關係數檢定法計算特徵之間的相關係數值，提供大量特徵參數的有效縮減，以使能在大量資料中依據重要特徵參數找出更有意義的分類規則。The important feature selection method of the present invention adopts the weight setting method and the correlation coefficient verification method, which can retain the important feature parameters set in advance, and can also use the correlation coefficient verification method to calculate the correlation coefficient value between features, providing a large number of feature parameters effectively Reduction to enable more meaningful classification rules based on important feature parameters in a large amount of data.

本發明運用隨機森林處理大量輸入特徵參數及評估特徵重要性的優點，找出建立模型錯誤率最低的方法，來挑出最重要的特徵。The present invention uses the advantages of random forest to process a large number of input feature parameters and evaluate the importance of features, and finds the method with the lowest error rate for model establishment to pick out the most important features.

本發明採用可信度、複雜度及正確率評分方法，能找出樹的建立不會太複雜而導致過度學習情況，而有一定的可信度並且正確率在一定水準以上的分類決策樹，得到最好的客群資料分類方法。The invention adopts the scoring methods of credibility, complexity and correct rate, and can find out that the establishment of the tree will not be too complicated and lead to over-learning, but it has a certain degree of credibility and the correct decision rate is above a certain level. Get the best customer data classification method.

本發明並不侷限於客群資料，可應用在不同種類的資料，能找出資料中隱藏的規則，進而挖掘出大數據資料中的重要規則。The present invention is not limited to customer group data, but can be applied to different types of data to find out hidden rules in the data, and then dig out important rules in big data data.

綜上所述，本發明可透過各個特徵的相關性以及權重來篩選出較為關鍵的特徵，還可以利用分群演算法以及決策樹演算法來對該些較為關鍵的特徵作進一步地篩選，從而找出對所建立的模型來說最為重要的特徵。本發明還可以針對多種不同的決策樹演算法進行綜合性的評分，並根據評分來挑選出用來對資料進行分類的最終決策樹。基此，本發明能有效地對大量資料的進行縮減，更能精確地取得重要資料，進而能依據重要資料找出有意義的規則。此外，本發明能在眾多建立規則的演算法中，找出對目前的資料分類有一定的可信度之決策樹，且決策樹不會有太複雜而導致過度學習的情況。In summary, the present invention can screen out the more critical features through the correlation and weight of each feature, and can also use the grouping algorithm and decision tree algorithm to further filter these more critical features to find Out the most important features for the established model. The invention can also comprehensively score a variety of different decision tree algorithms, and select the final decision tree used to classify the data according to the score. Based on this, the present invention can effectively reduce a large amount of data, more accurately obtain important data, and then find meaningful rules based on the important data. In addition, the present invention can find a decision tree with a certain degree of credibility to the current data classification among many algorithm for establishing rules, and the decision tree will not be too complicated and cause excessive learning.

雖然本發明已以實施例揭露如上，然其並非用以限定本發明，任何所屬技術領域中具有通常知識者，在不脫離本發明的精神和範圍內，當可作些許的更動與潤飾，故本發明的保護範圍當視後附的申請專利範圍所界定者為準。Although the present invention has been disclosed as above with examples, it is not intended to limit the present invention. Any person with ordinary knowledge in the technical field can make some changes and modifications without departing from the spirit and scope of the present invention. The scope of protection of the present invention shall be subject to the scope defined in the appended patent application.

10:分類裝置100:處理器20:分類方法300:儲存媒體310:特徵篩選模組330:分群模組350:隨機森林選擇模組370:分類模組S210、S211、S212、S213、S220、S230、S240、S241、S242、S250、S251、S252、S253:步驟10: Classification device 100: Processor 20: Classification method 300: Storage media 310: Feature screening module 330: Grouping module 350: Random forest selection module 370: Classification module S210, S211, S212, S213, S220, S230 , S240, S241, S242, S250, S251, S252, S253: steps

圖1是根據本發明的實施例繪示分類裝置的示意圖。圖2是根據本發明的實施例繪示分類方法的流程圖。圖3是根據本發明的實施例繪示分類方法的步驟S210的詳細流程圖。圖4是根據本發明的實施例繪示分類方法的步驟S240的詳細流程圖。圖5是根據本發明的實施例繪示分類方法的步驟S250的詳細流程圖。FIG. 1 is a schematic diagram illustrating a classification device according to an embodiment of the present invention. 2 is a flowchart illustrating a classification method according to an embodiment of the invention. FIG. 3 is a detailed flowchart illustrating step S210 of the classification method according to an embodiment of the present invention. FIG. 4 is a detailed flowchart illustrating step S240 of the classification method according to an embodiment of the present invention. FIG. 5 is a detailed flowchart illustrating step S250 of the classification method according to an embodiment of the present invention.

20:分類方法 20: Classification method

S210、S220、S230、S240、S250:步驟 S210, S220, S230, S240, S250: steps

Claims

A classification device is suitable for classifying multiple pieces of data, wherein each of the multiple pieces of data is associated with a first feature set. The classification device includes: a storage medium storing a plurality of modules; and a processor, coupled The storage medium, and accessing and executing the plurality of modules, the plurality of modules including: a feature filtering module, respectively calculating the correlation of a plurality of first features in the first feature set, and Filtering the first feature set according to the relevance and weight of the plurality of first features, thereby generating a second feature set; a grouping module, performing the plurality of data according to the second feature set Grouping to generate a grouping result; a random forest selection module that generates a first decision tree set based on the second feature set and the grouping result, and based on multiple first decision trees in the first decision tree set The error rate filters the second feature set to generate a third feature set; and a classification module generates a classification result based on the grouping result and the third feature set.

The classification device according to item 1 of the patent application scope, wherein the step of filtering the first feature set based on the correlation and weight of the plurality of first features to generate a second feature set includes: Selecting multiple features with a correlation greater than a threshold from the multiple first features; and selecting key features with the largest weight from the multiple features and deleting the multiple features except the key feature Other characteristics.

The classification device according to item 1 of the patent application scope, wherein the step of grouping the plurality of data according to the second feature set to generate a grouping result includes: based on the K-average algorithm based on the second The feature set groups the multiple pieces of data to generate the grouping result.

The classification device according to item 1 of the patent application scope, wherein the step of generating a first decision tree set based on the second feature set and the clustering result includes: based on a random forest algorithm to determine the second feature set and The grouping result produces the first set of decision trees with multiple first decision trees.

The classification device according to item 1 of the patent application scope, wherein the second feature set is filtered according to the error rates of the plurality of first decision trees in the first decision tree set, thereby generating a third feature set The steps include: selecting the decision tree with the lowest error rate from the plurality of first decision trees; and using the features corresponding to the decision tree with the lowest error rate to compose the third feature set.

The classification device according to item 1 of the patent application scope, wherein the step of generating a classification result based on the clustering result and the third feature set includes: generating a second decision based on the clustering result and the third feature set Tree set; calculate the score of the second decision tree in the second decision tree set, and select the final decision tree according to the score; and classify the multiple data according to the final decision tree to generate Describe the classification results.

The classification device according to item 6 of the patent application scope, wherein the step of calculating the score of the second decision tree in the second decision tree set and selecting the final decision tree according to the score includes: according to the grouping The result calculates the first ratio corresponding to each piece of data in each group; calculates the second ratio corresponding to each piece of data in each node of the second decision tree according to the grouping result; according to the first Calculating the credibility of the second decision tree by the ratio and the second ratio; calculating the score of the second decision tree according to the credibility, complexity, and accuracy of the second decision tree; and according to The score selects the final decision tree.

A classification method is suitable for classifying multiple pieces of data, wherein each of the multiple pieces of data is associated with a first feature set. The classification method includes: separately calculating a plurality of first features in the first feature set The correlation of the multiple first features, and the first feature set is filtered according to the correlation and weight of the plurality of first features to generate a second feature set; the plurality of data is based on the second feature set Performing grouping to generate a grouping result; generating a first decision tree set according to the second feature set and the grouping result; according to the error rates of multiple first decision trees in the first decision tree set The second feature set is screened to generate a third feature set; and a classification result is generated according to the grouping result and the third feature set.

The classification method according to item 8 of the patent application scope, wherein the step of generating a second feature set by filtering the first feature set according to the relevance and weight of the plurality of first features includes: Selecting multiple features with a correlation greater than a threshold from the multiple first features; and selecting key features with the largest weight from the multiple features and deleting the multiple features except the key feature Other characteristics.

The classification method as described in item 8 of the patent application scope, wherein the step of grouping the plurality of data according to the second feature set to generate a grouping result includes: based on the K-average algorithm based on the second The feature set groups the multiple pieces of data to generate the grouping result.

The classification method according to item 8 of the patent application scope, wherein the step of generating a first decision tree set based on the second feature set and the clustering result includes: based on a random forest algorithm to determine the second feature set and The grouping result produces the first set of decision trees with multiple first decision trees.

The classification method according to item 8 of the patent application scope, wherein the second feature set is filtered according to the error rates of the plurality of first decision trees in the first decision tree set, thereby generating the third feature set The steps include: selecting the decision tree with the lowest error rate from the plurality of first decision trees; and using the features corresponding to the decision tree with the lowest error rate to compose the third feature set.

The classification method according to item 8 of the patent application scope, wherein the step of generating a classification result based on the grouping result and the third feature set includes: generating a second decision based on the grouping result and the third feature set Tree set; calculate the score of the second decision tree in the second decision tree set, and select the final decision tree according to the score; and classify the multiple data according to the final decision tree to generate Describe the classification results.

The classification method according to item 13 of the patent application scope, wherein the step of calculating the score of the second decision tree in the second set of decision trees and selecting the final decision tree according to the score includes: according to the grouping The result calculates the first ratio corresponding to each piece of data in each group; calculates the second ratio corresponding to each piece of data in each node of the second decision tree according to the grouping result; according to the first Calculating the credibility of the second decision tree by the ratio and the second ratio; calculating the score of the second decision tree according to the credibility, complexity, and accuracy of the second decision tree; and according to The score selects the final decision tree.