KR100727555B1

KR100727555B1 - Creating method for decision tree using time-weighted entropy and recording medium thereof

Info

Publication number: KR100727555B1
Application number: KR1020060024327A
Authority: KR
Inventors: 이지형; 동립권
Original assignee: 성균관대학교산학협력단
Priority date: 2005-12-05
Filing date: 2006-03-16
Publication date: 2007-06-14
Also published as: KR20070058936A

Abstract

시간을 고려하여 최근 데이터들의 경향을 더 잘 표현할 수 있는 시간 가중치 엔트로피를 이용한 결정 트리 생성 방법 및 이를 기록한 기록매체에 관한 것으로, 컴퓨터가 외부에서 입력된 다수의 사례 데이터의 패턴을 인식하고 새롭게 분류된 데이터를 창출하기 위한 데이터 마이닝의 분류 기법 중 결정 트리 생성 방법에 있어서, (a) 시간 가중치 엔트로피를 이용하여 다수의 사례 데이터가 갖는 속성들 중에서 현재 레벨의 노드가 될 속성을 선택하기 위한 테스트를 하는 단계, (b) 상기 테스트에서 얻은 정보 획득량이 가장 큰 속성을 노드로 선택하는 단계, (c) 현재 레벨의 모든 노드가 선택될 때까지 상기 단계 (a) 내지 상기 단계 (b)를 반복하는 단계, (d) 현재 레벨의 모든 노드가 선택되면 한 단계 하위 레벨에 대해 상기 단계 (a) 내지 상기 단계 (c)를 반복하는 단계로 이루어진다.The present invention relates to a method for generating a decision tree using time weighted entropy that can better represent the trend of recent data in consideration of time, and a recording medium recording the same. A decision tree generation method among data mining classification techniques for generating data, comprising: (a) testing to select an attribute to be a node of a current level among attributes of a plurality of case data using time weight entropy; (B) selecting an attribute having the largest amount of information obtained in the test as a node, (c) repeating steps (a) to (b) until all nodes of the current level are selected; (d) If all nodes of the current level are selected, repeat steps (a) through (c) for one level lower level. A step of.

상기와 같은 시간 가중치 엔트로피를 이용한 결정 트리 생성 방법 및 이를 기록한 기록매체를 이용하는 것에 의해, 최근 데이터에 더 중점을 두고 있기 때문에 전체 데이터를 고려해도 최근 데이터의 영향을 볼 수 있다. By using the decision tree generation method using the time-weighted entropy as described above and the recording medium recording the same, the influence of the recent data can be seen even if the entire data is taken into consideration because the recent data is more focused.

데이터마이닝, 결정 트리, ID3, 엔트로피 Data Mining, Decision Trees, ID3, Entropy

Description

CREATING METHOD FOR DECISION TREE USING TIME-WEIGHTED ENTROPY AND RECORDING MEDIUM THEREOF}

도 1은 종래의 결정 트리에 따른 DM발송 수집 데이터 테이블과 추출된 규칙을 나타내는 도면,1 is a diagram illustrating a DM shipment collection data table and extracted rules according to a conventional decision tree;

도 2는 각 사례들에 대한 가중치를 나타내는 도면,2 is a diagram showing weights for respective cases;

도 3은 본 발명에 따른 시간 가중치 엔트로피를 이용한 결정 트리 생성 방법을 설명하는 흐름도,3 is a flowchart illustrating a method for generating a decision tree using time weighted entropy according to the present invention;

도 4는 종래의 ID3 알고리즘과 본 발명에 따른 결정 트리 생성 방법의 비교 실험을 위한 데이터 테이블을 도시한 도면,4 is a diagram showing a data table for a comparative experiment between a conventional ID3 algorithm and a decision tree generation method according to the present invention;

도 5는 도 4에 도시된 테이블에 따라 생성된 결정 트리의 예를 나타내는 도면.FIG. 5 shows an example of a decision tree generated according to the table shown in FIG. 4; FIG.

본 발명은 데이터 마이닝(Data Mining)의 분류 기법 중 결정 트리에 관한 것으로, 특히 시간을 고려하여 최근 데이터들의 경향을 더 잘 표현할 수 있는 시간 가중치 엔트로피를 이용한 결정 트리 생성 방법 및 이를 기록한 기록매체에 관한 것이다.The present invention relates to a decision tree among classification techniques of data mining, and more particularly, to a method for generating a decision tree using time weighted entropy that can better express trends of recent data in consideration of time, and a recording medium recording the same. will be.

유비쿼터스(Ubiquitous) 환경에서는 사용자 프로파일 정보에 관련된 데이터들을 데이터베이스(Database)에 저장하고, 사용자에게 다양한 서비스를 제공하기 위해 저장된 사용자 프로파일 정보 데이터(User Profile Information Data)들을 학습시켜 사용자의 경향을 분석해야 한다. 그런데 실제 유비쿼터스 환경에서 사용자의 경향은 시간에 따라 바뀌는 경우가 많다. 예를 들어, A 사용자가 3개월 전에 A1 상황에서 A10 일을 했었는데, 최근에 같은 A1 상황에서는 A10 일을 하지 않거나 B1 상황에서 A10 일을 하는 경우가 있다. 이와 같이, 사용자 프로파일 데이터들을 분석할 때는 과거의 데이터보다는 최근의 데이터에 더 많이 비중을 두어야 한다. In the ubiquitous environment, data related to user profile information should be stored in a database, and user profile information data stored in order to provide various services to the user should be analyzed to analyze user trends. . However, in a real ubiquitous environment, the user's tendency often changes over time. For example, user A has done A10 work in A1 situation three months ago, but recently does A10 work in A1 situation or A10 work in B1 situation. As such, when analyzing user profile data, more emphasis should be placed on recent data than on past data.

일반적으로, 데이터 마이닝(Data Mining)이란 수많은 데이터 가운데 유용하게 활용될 수 있는 숨겨진 지식을 효과적으로 찾아내는 지식 탐사의 한 연구 분야이다. 즉, 데이터 웨어하우스(Data Warehouse)에 숨겨져 있는 데이터 간의 유형과 관계를 탐색하고, 이를 분석하는 분야이다. 데이터 마이닝의 기법으로는 한 사건이 다른 사건과 상호 관련을 갖는 연합(Association), 한 사건이 뒤의 다른 사건을 유발시키는 연속(Sequence), 패턴(Pattern)을 인식하고 새롭게 분류된 데이터를 창출하는 분류(Classification), 이전에는 알려지지 않았던 사실 집단을 발견하고 가시화하는 집단화(Clustering), 데이터 내에서 단순히 발견되는 패턴을 통해 미래에 관한 예측을 하는 예측(Prediction) 등이 있다.In general, data mining is a field of knowledge exploration that effectively finds hidden knowledge that can be useful among a large number of data. In other words, it explores and analyzes the types and relationships between the data hidden in the Data Warehouse. Data mining techniques can be used to create associations in which one event correlates with another, to recognize sequences, patterns that cause one event later, and to generate newly classified data. Classification, clustering to discover and visualize previously unknown fact groups, and predictions to predict the future through simple patterns found in the data.

데이터 마이닝에서 주로 사용하고 있는 분류 방법에는 베이지안 정리 (Bayesian) 분류, K Nearest Neighbors, 유전자 알고리즘(Genetic Algorithm), 신경망(Neural Network), 규칙 기반(Rule-Based) 알고리즘, 결정 트리(Decision Tree) 등이 있다. 그 중에서 간단한 트리 형식으로 결과가 구성되는 결정 트리는 사람들이 쉽게 이해하고 설명할 수 있기 때문에 데이터 마이닝 작업에 많이 쓰이고 있다. 또한, 결정 트리는 예측과 분류를 위해 가장 많이 사용되는 방법이다. 이러한 예측과 분류에 있어서 어떤 경우에는 '얼마나 잘 분류하거나 예측하는가'만이 문제시되기도 한다. 즉, DM(Direct Mail) 발송회사는 'DM 수신집단 모델(Model)이 어떻게 구성되었는지'보다는 '자신의 메일(Mail)에 얼마나 대답을 잘 해줄 수 있는 집단을 분류할 수 있는지'에 관심을 갖는다. 그러나 또 다른 경우에는 '왜 이러한 결정을 하게 되었는가'를 설명하는 것을 중요시하기도 한다. 이 경우에 결정 트리가 유용하게 사용되는데, 예를 들어 카드신청자의 카드 발급을 거절하는 경우 거절 결과를 설명할 수 있는 결정 트리가 사용된다.The classification methods commonly used in data mining include Bayesian Theorem, K Nearest Neighbors, Genetic Algorithm, Neural Network, Rule-Based Algorithm, Decision Tree, etc. There is this. Among them, decision trees, whose results are organized in a simple tree format, are widely used for data mining because they can be easily understood and explained by people. In addition, decision trees are the most used method for prediction and classification. In these predictions and classifications, in some cases, only 'how well do we classify or predict?' In other words, a direct mailer is more interested in how well a group can respond to its own mail than how the DM model is organized. . In other cases, however, it may be important to explain 'why did we make this decision?' In this case, the decision tree is usefully used. For example, when the card applicant refuses to issue a card, a decision tree is used to explain the result of the rejection.

결정 트리는 데이터를 구성하는 속성(Attribute)과 클래스(Class)와의 관계를 규명하기 위해 데이터 집합(Data Set)을 부분 집합(Subset)으로 분할하고, 이 분할된 집합의 특성을 규명하는데 사용되는 방법이다. 다시 말해서, 과거에 수집된 데이터의 레코드(Record)들을 분석하고, 이 클래스들 사이에 존재하는 패턴을 속성의 조합으로 나타내는 트리 형태의 분류모형이다. 그리고 이렇게 만들어진 분류모형은 새로운 레코드를 분류하고 해당 클래스의 속성을 예측하는데 사용된다. 이러한 결정 트리는 스무고개 게임과 유사하다. 첫 번째 질문이 다음에 해야 할 질문을 결정하게 되며, 질문이 잘 선택되면 들어온 레코드들을 짧은 시간 내에 잘 분류할 수 있다. 결정 트리는 순환적 분할(Recursive Partitioning) 방식을 이용하여 트리를 구축하는 방법으로, 트리의 가장 상단에 위치하는 루트 노드(Root Node), 속성의 분리기준을 포함하는 내부 노드(Internal Nodes), 노드와 노드를 이어주는 가지(Link), 그리고 최종 분류를 의미하는 리프 노드(Leaf Nodes)로 구성된다. The decision tree is a method used to divide the data set into subsets and to identify the characteristics of the partitioned set in order to identify the relationship between the attributes constituting the data and the class. . In other words, it is a tree-type taxonomy that analyzes records of data collected in the past, and represents a pattern existing among these classes as a combination of attributes. The classification model thus created is used to classify new records and to predict properties of the class. This decision tree is similar to a twenty-four game. The first question decides what to do next, and if the question is well selected, the incoming records can be sorted well in a short time. The decision tree is a method of constructing a tree by using recursive partitioning. The root node located at the top of the tree, internal nodes including attribute separation criteria, nodes and It consists of links that connect nodes, and leaf nodes, which means final classification.

결정 트리의 구조를 형성하는 형태 중 하나로 이진 트리(Binary Tree) 구조를 들 수 있다. 이 구조는 각각의 노드가 두 개의 자식 노드를 만들어 예-아니오-질문(Yes-No-Query)에 답함으로써 리프 노드까지 진행해가는 방법이다. 단순한 이진 트리 모양만 있는 것이 아니라 혼합된 형태의 모형도 있다. 구축된 모형은 테스트(Test) 자료를 적용시켜 예측률을 살펴봄으로써 그 모형의 효과를 측정한다. 이때 여러 경로 중 더욱 효과적인 경로가 있을 수 있으며, 이런 경우 비효과적인 가지를 없애는 가지치기(Pruning) 방법을 적용해야 한다. 결정 트리는 자료를 가장 잘 분리할 수 있는 분리 변수로 분리의 기준을 찾는 것부터 시작한다. 그리고 나서 다음 노드에서 분리의 기준을 다시 찾아 더 이상 잘 분리할 수 없을 때까지 트리를 형성한다.One of the forms forming the structure of the crystal tree may be a binary tree structure. This structure is a way for each node to go to the leaf node by creating two child nodes and answering a Yes-No-Query. Not just binary tree shapes, but mixed models. The constructed model measures the effectiveness of the model by applying test data and looking at the prediction rate. At this time, there may be a more effective path among several paths, and in this case, a pruning method that removes ineffective branches should be applied. Decision trees begin by finding the criteria for separation as the separation variables that best separate the data. Then, at the next node, the criteria for separation are found again and the tree is formed until it can no longer be separated.

결정 트리 생성의 예로 DM발송을 도 1에 따라 살펴보자.As an example of the decision tree generation, the DM transmission will be described with reference to FIG. 1.

도 1은 종래의 결정 트리에 따른 DM발송 수집 데이터 테이블과 추출된 규칙을 나타내는 도면으로서, 도 1a는 수집된 DM발송 데이터의 한 예이고, 도 1b는 생성된 분류규칙을 트리 형식으로 나타낸다. FIG. 1 is a diagram illustrating a DM shipment collection data table and an extracted rule according to a conventional decision tree. FIG. 1A is an example of collected DM transmission data, and FIG. 1B illustrates a generated classification rule in a tree format.

도 1a에서 도시한 바와 같이, DM발송에서는 긍정적인 반응을 보일만한 고객 들을 예측하는 것을 목적으로 한다. 우선, 통신업체의 질문에 대한 고객의 반응이 '예'로 응답한 고객들과 '아니오'로 응답한 고객들을 특성에 따라 분류한다. 데이터는 '직업', '성별', '나이'라는 3가지의 속성과 특성에 해당하는 '응답'으로 구성되고, 전체 고객의 수는 10개이며, 이 중 3명의 고객이 '아니오', 7명이 '예'라고 응답했다. 결정 트리는 각 속성들이 고객들을 분류하는데 영향을 미치는 정도를 측정한 후, 그 중에서 가장 영향력이 있는 속성을 선정하여 루트 노드로 지정한다. 이 예에서는 '직업'이라는 속성이 루트 노드로 지정되었으며, 고객들은 해당 속성의 값에 따라 '회사원', '자영업', '무직'이라는 3개의 가지로 분리된다. 여기에서 고객의 직업이 '자영업'인 경우 '성별'과 '나이'에 상관없이 '예'라고 응답한다는 첫 번째 규칙을 발견할 수 있다. 반면 전체 10명의 고객들 중에서 직업이 '회사원'인 가지에 속한 고객의 수는 5명이며, 이 가운데 2명이 '아니오', 3명이 '예'라고 응답했다. 결정 트리는 이 5명의 고객을 계속 분류하기 위해 트리를 확장시키며, 이때 가장 큰 영향력을 미치는 속성이 '나이'라는 것을 찾아내어 내부 노드로 지정한다. 분리의 기준이 되는 값은 35세로 정한다. 특히 나이가 35세 이하인 2명 고객 모두는 '아니오'라고 응답했는데, 이를 근거로 직업이 '회사원'이고 나이가 '35세' 이하인 고객은 '아니오'라고 응답한다는 두 번째 규칙이 도출된다. 이와 같은 방법으로 트리를 확장한 결과 총 4가지의 분류규칙이 만들어졌다. 도 1b에서 도시한 바와 같이, 이제 회사원이고, 나이가 33세인 남성고객에게 DM을 발송하면 어떻게 반응할 것인가를 판단해 보면, 위의 규칙에 따라 '아니오'라고 응답하라 것이라고 예측할 수 있다. 따라서 이러한 고객은 DM발송 대상자 명단에서 제외하는 것이 비용 과 노력 절감 차원에서 유리할 것이다.As shown in Figure 1a, the DM shipment aims to predict the customers that will show a positive response. First, the customer's response to the carrier's question is classified as "Yes" and those who answered "No" according to their characteristics. The data consists of 'responses' corresponding to three attributes and characteristics: 'occupation', 'gender' and 'age', with a total of 10 customers, of which 3 are 'no', 7 People replied yes. The decision tree measures the degree to which each attribute affects the classification of customers, and then selects the most influential attribute among them and assigns it as the root node. In this example, the attribute 'Employment' is designated as the root node, and customers are divided into three branches, 'Office worker', 'Self-employment' and 'Unemployment', depending on the value of the attribute. Here you can find the first rule that says "Yes" regardless of "gender" and "age" if the customer's job is "self-employed." On the other hand, among the total 10 customers, the number of customers belonging to the branch whose occupation is 'office worker' is 5, of which 2 said no and 3 said yes. The decision tree expands the tree to continue to classify these five customers, finding the most influential attribute 'age' and designating it as an internal node. The standard for separation is set at 35 years. In particular, two customers who are under the age of 35 responded with a no answer. Based on this, a second rule is derived that the job is an 'employee' and a customer who is under the age of '35' replies with no. As a result of expanding the tree in this way, a total of four classification rules were created. As shown in FIG. 1B, judging how the DM responds to a 33-year-old male customer who is now an office worker, it can be predicted that the answer is no according to the above rules. Therefore, it would be advantageous to exclude these customers from the list of direct mailers in terms of cost and effort savings.

이러한 결정 트리 알고리즘에는 ID3(Interactive Dichotomizer 3), ID5R, C4.5, CART(Classification And Regression Trees), SPRINT(Scalable PaRallelizable INduction of decision Trees), CHAID(Chi-squared Automatic Interaction Detection) 등이 있고, 그 중에서 대표적인 것이 1986년에 Quinlan이 제안한 ID3 알고리즘이다.These decision tree algorithms include ID3 (Interactive Dichotomizer 3), ID5R, C4.5, Classification and Regression Trees (CART), Scalable PaRallelizable INduction of decision Trees (SPRINT), and Chi-squared Automatic Interaction Detection (CHAID). The most representative of these is the ID3 algorithm proposed by Quinlan in 1986.

ID3 알고리즘은 결정 트리에서 분할 기준이 되는 최적 검사 속성을 선택하기 위해 속성 선택 척도로 정보 획득량(Information Gain)을 사용한다. 이 척도는 정보 이론의 엔트로피(Entropy) 개념을 사용하는데, 엔트로피 값은 작은 값을 취한다. 따라서 검사 속성 중에서 가장 작은 엔트로피 값을 갖는 속성이 정보 획득량이 가장 많으므로 이 속성을 선택하게 된다. 패턴인식과 분류에 대한 ID3 접근은 비수치 속성이나 변수값(Nonnumeric Attributes or Feature Values)을 가지는 패턴들을 분류하기 위한 효율적인 식별 트리(Discrimination Tree) 생성 과정이다. 식별 트리는 규칙들을 모아놓은 형태로 표현될 수 있기 때문에, 기계학습이나 규칙획득을 위한 귀납추론으로 생각되기도 한다. ID3는 어떤 조건에서는 매우 효율적일 수 있지만, 효용성 범위를 넘어서서 사용되어서는 안 된다. ID3에서는 패턴의 수가 많고, 각 패턴들이 긴 길이의 비수치 속성값으로 구성되어 있을 경우에 유효하게 사용될 수 있다. The ID3 algorithm uses information gain as an attribute selection measure to select an optimal inspection attribute that is a partitioning criterion in the decision tree. This measure uses the concept of entropy in the theory of information, where entropy values are small. Therefore, the attribute having the smallest entropy value among the inspection attributes has the largest amount of information acquisition, so the attribute is selected. The ID3 approach to pattern recognition and classification is an efficient discrimination tree generation process for classifying patterns with nonnumeric attributes or feature values. Because an identity tree can be expressed as a collection of rules, it can be thought of as inductive reasoning for machine learning or rule acquisition. ID3 can be very efficient under some conditions, but should not be used beyond its usefulness. ID3 can be effectively used when the number of patterns is large and each pattern is composed of a long non-numeric attribute value.

이와 같이 데이터 마이닝을 위한 최적의 의사 결정 트리(Decision Tree)를 선택하는 기술의 일례가 대한민국 특허 등록공보 10-0497211호(2005.06.15 등록; 데이터 마이닝을 위한 최적의 의사 결정 나무 선택 장치 및 그 방법)에 개시되어 있다.As an example of a technique for selecting an optimal decision tree for data mining, Korean Patent Registration No. 10-0497211 (registered on June 15, 2005; an apparatus and method for selecting an optimal decision tree for data mining) ) Is disclosed.

상기 공보에 개시된 기술은 입력된 다차원 자료 및 내포 의사 결정 나무(트리)들을 이용하여 주어진 나무에 대한 다차원 자료 각각의 확률을 계산하는 확률 계산 수단, 상기 나무에 대한 사전 확률을 계산하는 사전 확률 계산 수단, 상기 나무에 대한 다차원 자료 각각의 확률 및 주어진 나무에 대한 사전 확률을 이용하여 베이지안 정리(Bayesian Theorem)에 따라 사후 확률을 계산하는 사후 확률 계산 수단, 상기 각각의 나무에 대한 사후 확률이 가장 큰 의사 결정 나무를 선택하여 단일화된 최적 의사 결정 나무를 구하는 의사 결정 나무 선택 수단으로 구성되고, 사후 확률을 이용하여 TIC(Tree Information Criteria)라는 새로운 양을 정의한 후, TIC 값에 근간을 둔 최적의 의사 결정 나무를 선택하여 의사 결정 나무를 한 번만 구축하게끔 하여 계산 속도를 비약적으로 향상시키는 데이터 마이닝을 위한 최적의 의사 결정 나무 선택 장치 및 그 방법에 대해 개시되어 있다.The technique disclosed in the above publication is a probability calculation means for calculating the probability of each of the multi-dimensional data for a given tree using the input multidimensional data and nested decision trees (trees), and a prior probability calculation means for calculating a prior probability for the tree. A post probability calculation means for calculating a post probability according to Bayesian Theorem using the probability of each of the multi-dimensional data of the tree and the prior probability of a given tree; It consists of a decision tree selection means that selects a decision tree to obtain a unified optimal decision tree, and uses posterior probabilities to define a new quantity called Tree Information Criteria (TIC), and then makes an optimal decision based on TIC values. Select trees to build decision trees only once, dramatically speeding up computation An apparatus and method for selecting optimal decision trees for improving data mining are disclosed.

또, 데이터마이닝에서 통계적인 분류법으로 사용되는 의사 결정 트리 기술의 일례가 대한민국 특허 등록공보 10-0498651호(2005.06.22 등록; 데이터마이닝의 분류 의사 결정 나무에서 분산이 작은, 즉 순수한 관심 노드 분류를 통한 자료의 통계적 분류 방법)에 개시되어 있다.In addition, an example of a decision tree technique used as a statistical classification in data mining is disclosed in Korean Patent Registration Publication No. 10-0498651 (registered on June 22, 2005; classification of decision making tree in data mining, that is, a pure classification of nodes of interest). Method of statistical classification of data.

상기 공보에 개시된 기술은 데이터마이닝의 한 기법인 의사 결정 나무(트리) 중 분류 의사 결정 나무를 형성함에 있어서, min(PL 0PL 1, PR 0PR 1) 또는 min(PL 1, PL 0, PR 1, PR 0) 또는 max(PL 1, PL 0, PR 1, PR 0)에 기초한 분류방법(Splitting Method)으로, 독립 변수(x)와 그의 분계점(Threshold)을 분류 기준으로 선정하고, 선정한 분류 기준에 따라 자식 노드를 분류하고, 모든 자식노드에 대해 분류 과정의 종료 여부를 판단하여 분류 과정이 종료하지 않은 것으로 판단되는 자식노드에 대하여는 분류 과정을 반복하는 과정으로 이루어지고, 반응 변수가 범주형 변수(Categorical Variable)로써 각 관찰치를 의사 결정 나무에 의하여 계급을 예측하여 불균형적인 나무 구조를 가지지만 이것 때문에 오히려 더 설명력이 있고, 원하는 부분 집합을 찾기엔 더 빠르고 간결하여 효과적인 데이터마이닝의 분류 의사 결정 나무에서 분산이 작은, 즉 순수한 관심 노드 분류를 통한 자료의 통계적 분류 방법에 대해 개시되어 있다.The technique disclosed in the above publication is based on min (PL 0PL 1, PR 0PR 1) or min (PL 1, PL 0, PR 1, in forming a classification decision tree among decision trees (trees) which is a technique of data mining. As a splitting method based on PR 0) or max (PL 1, PL 0, PR 1, PR 0), an independent variable (x) and its threshold are selected as classification criteria, and the selected classification criteria As a result, the child nodes are classified, the child node is judged whether the classification process is terminated for all the child nodes, and the child node is determined to have not been terminated. As a categorical variable, each observation has an unbalanced tree structure by predicting the class by the decision tree, but because of this it is more explanatory and faster and more concise to find the desired subset. Disclosure of Terminology's Classification Decision trees have been described for statistical classification of data through small variance, that is, pure node of interest classification.

그러나 유비쿼터스와 같은 환경에서 시간에 따라 경향이 바뀌는 데이터들을 분석할 때, 기존의 ID3 알고리즘은 시간요소를 적절히 표현할 수 없다는 문제가 있었다. However, when analyzing data that tends to change over time in an ubiquitous environment, there is a problem that the existing ID3 algorithm cannot properly represent time elements.

또한, 기존의 ID3 알고리즘에 있어서는 시간에 따라 변화하는 데이터를 처리할 때 최근 데이터의 경향을 알기 힘들다는 문제가 있었다. 즉, 기존의 알고리즘은 최근 데이터를 찾아내는데 있어서 어느 데이터가 최근의 것인지 알기가 쉽지 않다. In addition, the conventional ID3 algorithm has a problem that it is difficult to know the trend of the latest data when processing data that changes over time. In other words, the existing algorithm is difficult to find out which data is the latest in finding the latest data.

본 발명의 목적은 상술한 바와 같은 문제점을 해결하기 위해 이루어진 것으로서, 모든 데이터를 이용해도 최근의 경향을 쉽게 알아낼 수 있는 시간 가중치 엔트로피를 이용한 결정 트리 생성 방법 및 이를 기록한 기록매체를 제공하는 것이 다.An object of the present invention is to solve the problems described above, and to provide a method for generating a decision tree using time-weighted entropy, which can easily find out the latest trends even with all the data, and a recording medium recording the same.

본 발명의 다른 목적은 유비쿼터스 환경에서 사용자의 선호도 데이터를 분석하여 사용자의 행동을 예측하거나 성향을 인지할 수 있는 시간 가중치 엔트로피를 이용한 결정 트리 생성 방법 및 이를 기록한 기록매체를 제공하는 것이다.Another object of the present invention is to provide a decision tree generation method using time-weighted entropy capable of predicting user behavior or recognizing propensity by analyzing user preference data in a ubiquitous environment, and a recording medium recording the same.

상기 목적을 달성하기 위해 본 발명에 따른 시간 가중치 엔트로피를 이용한 결정 트리 생성 방법은 컴퓨터가 외부에서 입력된 다수의 사례 데이터의 패턴을 인식하고 새롭게 분류된 데이터를 창출하기 위한 데이터 마이닝의 분류 기법 중 결정 트리 생성 방법에 있어서, (a) 시간 가중치 엔트로피를 이용하여 다수의 사례 데이터가 갖는 속성들 중에서 현재 레벨의 노드가 될 속성을 선택하기 위한 테스트를 하는 단계, (b) 상기 테스트에서 얻은 정보 획득량이 가장 큰 속성을 노드로 선택하는 단계, (c) 현재 레벨의 모든 노드가 선택될 때까지 상기 단계 (a) 내지 상기 단계 (b)를 반복하는 단계, (d) 현재 레벨의 모든 노드가 선택되면 한 단계 하위 레벨에 대해 상기 단계 (a) 내지 상기 단계 (c)를 반복하는 단계를 포함하고, 상기 단계 (a) 내지 상기 단계 (b)에 의해 생성되는 결정 트리는 모든 하위사례 데이터 집단이 하나의 클래스로 통일되고 시스템 엔트로피가 0이 될 때까지 생성되며, 상기 시간 가중치 엔트로피는 최근의 경향일수록 큰 가중치를 갖는 것을 특징으로 한다.In order to achieve the above object, a decision tree generation method using temporal weight entropy according to the present invention is performed by a computer to recognize a pattern of a plurality of case data input from the outside and to determine among classification techniques of data mining for generating newly classified data. A method of generating a tree, the method comprising: (a) performing a test for selecting an attribute to be a node of a current level among attributes of a plurality of case data using time weight entropy, (b) obtaining information obtained in the test Selecting the largest attribute as a node, (c) repeating steps (a) to (b) until all nodes of the current level are selected, and (d) if all nodes of the current level are selected Repeating steps (a) to (c) for one step lower level, wherein steps (a) to (b) The tree is generated by determining all lower case data group is unified into one class is generated until the system entropy is zero, characterized by having the time-weighted entropy is greater the more recent trends in weight.

또, 본 발명에 따른 시간 가중치 엔트로피를 이용한 결정 트리 생성 방법에 있어서, 상기 시간 가중치 엔트로피는 식In addition, in the method for generating a decision tree using time weighted entropy according to the present invention, the time weighted entropy is

의 실행에 의해 연산되며, 상기 S는 사례 데이터들의 집합, 상기 T_c는 사례 데이터들이 속하는 클래스의 총 개수, 상기 i는 사례 데이터의 순서를 나타내는 임의의 변수, 상기 W(i)는 i번째 사례 데이터의 최근 경향을 표시하는 가중치, 상기 c는 순서를 나타내는 임의의 변수인 것을 특징으로 하는 시간 가중치 엔트로피를 이용한 결정 트리 생성 방법.

Computed by execution of S, wherein S is a set of case data, T _c is the total number of classes to which the case data belongs, i is any variable representing the order of the case data, and W (i) is the i th case A weight indicating the latest trend of the data, wherein c is an arbitrary variable indicating an order, wherein the decision tree generation method using temporal weight entropy.

또, 본 발명에 따른 시간 가중치 엔트로피를 이용한 결정 트리 생성 방법에 있어서, 상기 W(i)는 식In the method for generating a decision tree using time weighted entropy according to the present invention, W (i) is

의 실행에 의해 연산되며, 상기 n은 사례 데이터들의 개수, 상기 i는 사례 데이터의 순서를 나타내는 임의의 변수, 상기 θ는 상기 n을 X축으로 하고 사례 데이터의 신뢰도를 Y축으로 하는 직선비례그래프에서 각 사례 데이터의 가중치를 나타내는 각도이고, 상기

의 범위는 -2/n보다 크거나 같고 2/n보다 작거나 같은 것을 특징으로 한다.

Where n is the number of case data, i is any variable representing the order of case data, and θ is a linear proportional graph with n as the X axis and reliability of the case data as the Y axis. Is an angle representing the weight of each case data,

The range of is greater than or equal to -2 / n and less than or equal to 2 / n.

또, 본 발명에 따른 시간 가중치 엔트로피를 이용한 결정 트리 생성 방법에 있어서, 트리의 생성 중 특정 노드의 엔트로피가 사전에 정한 기준값보다 작을 경우, 그 노드에 속하는 사례 데이터가 가장 많이 속하는 클래스로 대치하는 것을 특징으로 한다.In addition, in the decision tree generation method using time weight entropy according to the present invention, if the entropy of a specific node is smaller than a predetermined reference value during generation of the tree, the case data belonging to the node is replaced with the class to which the most data belongs. It features.

또, 상기 목적을 달성하기 위해 본 발명에 따른 컴퓨터로 읽을 수 있는 기록 매체는 컴퓨터가 외부에서 입력된 다수의 사례 데이터의 패턴을 인식하고 새롭게 분류된 데이터를 창출하기 위한 데이터 마이닝의 분류 기법 중 결정 트리 생성 방법을 컴퓨터로 읽을 수 있는 기록매체에 있어서, (a) 시간 가중치 엔트로피를 이용하여 다수의 사례 데이터가 갖는 속성들 중에서 현재 레벨의 노드가 될 속성을 선택하기 위한 테스트를 하는 단계, (b) 상기 테스트에서 얻은 정보 획득량이 가장 큰 속성을 노드로 선택하는 단계, (c) 현재 레벨의 모든 노드가 선택될 때까지 상기 단계 (a) 내지 상기 단계 (b)를 반복하는 단계, (d) 현재 레벨의 모든 노드가 선택되면 한 단계 하위 레벨에 대해 상기 단계 (a) 내지 상기 단계 (c)를 반복하는 단계를 실행시키기 위한 프로그램을 기록하고, 상기 단계 (a) 내지 상기 단계 (b)에 의해 생성되는 결정 트리는 모든 하위사례 데이터 집단이 하나의 클래스로 통일되고 시스템 엔트로피가 0이 될 때까지 생성되며, 상기 시간 가중치 엔트로피는 최근의 경향일수록 큰 가중치를 갖는 것을 특징으로 한다.In addition, in order to achieve the above object, the computer-readable recording medium according to the present invention is determined by a classification method of data mining for the computer to recognize a pattern of a plurality of case data input from the outside and to create newly classified data. A computer-readable recording medium of a tree generation method, comprising: (a) testing to select an attribute to be a node of a current level among attributes of a plurality of case data using time weight entropy, (b (C) repeating the steps (a) to (b) until all nodes of the current level are selected; A program for executing the steps of repeating steps (a) to (c) for one level lower level if all nodes of the current level are selected The decision tree generated by steps (a) through (b) is generated until all subcase data groups are unified into one class and the system entropy is zero, and the time weight entropy is The tendency of to have a greater weight.

이하, 본 발명이 속하는 기술 분야에서 통상의 지식을 가진 자가 본 발명을 용이하게 실시할 수 있을 정도로 상세히 설명하기 위하여, 본 발명의 가장 바람직한 실시 예를 첨부된 도면을 참조로 하여 상세히 설명하기로 한다. 또한, 본 발명의 설명에 있어서는 동일 부분은 동일 부호를 붙이고, 그 반복 설명은 생략한다.Hereinafter, the most preferred embodiments of the present invention will be described in detail with reference to the accompanying drawings so that those skilled in the art can easily practice the present invention. . In addition, in description of this invention, the same code | symbol is attached | subjected to the same part and the repeated description is abbreviate | omitted.

우선, ID3 알고리즘에서 정의하는 엔트로피 공식을 소개한다. 엔트로피는 다음과 같이 정의한다.First, we introduce the entropy formula defined in the ID3 algorithm. Entropy is defined as

[수학식 1]에서 S는 사례들의 집합이고, T_c는 사례들이 속하는 클래스의 총 개수이며, P_c는 S중에서 클래스 c에 속하는 사례들의 비율이다. 즉, S에 n개의 사례가 있고 이중 n_c개가 있다면 P_c=n_c/n가 된다. 그리고 0log₂0=0으로 정의한다. In Equation 1, S is a set of cases, T _c is the total number of classes belonging to the case, P _c is the ratio of the cases belonging to class c among S. That is, if there are n cases in S and n _{c of} them, then P _c = n _c / n. And 0log ₂ 0 = 0.

S 중에서 c에 속하는 것이 하나이면 P_c=1/n이고 두 개이면 2/n이 되므로, S의 각 사례가 S의 엔트로피에 미치는 영향은 1/n에 비례한다고 할 수 있다. 즉 i번째 사례의 최근 경향을 표시하는 가중치를 W(i)로 표시하고, 모든 i에 대하여 W(i)=1이라고 한다면, P_c는 [수학식 2]와 같이 다시 적을 수 있다. [수학식 2]에서 c(i)는 i번째 사례가 속하는 클래스이다.If one of S belongs to c, P _c = 1 / n, and if two belong to 2 / n, the effect of each instance of S on the entropy of S is proportional to 1 / n. That is, if the weight indicating the latest trend of the i-th case is expressed as W (i) and W (i) = 1 for all i, P _c may be rewritten as in Equation 2 below. In Equation 2, c (i) is the class to which the i th instance belongs.

도 2는 각 사례들에 대한 가중치를 나타내는 도면이다.2 is a diagram illustrating weights for respective cases.

도 2a는 ID3 알고리즘에서의 각 사례별 가중치로 하나의 사각형 면적으로 표현되어 있다. 예를 들어, 도 2a에서 빗살무늬로 채워진 부분은 첫 번째 사례의 가 중치를 나타낸다. 도 2a에서 모든 데이터의 가중치 값은 모두 1이다. 여기서 n값을 시간이라고 생각한다면, n이 크면 클수록 데이터가 더 최근의 것임을 의미한다. 그러므로 n에 따라서 대응하는 면적도 커져야 하는데, ID3 알고리즘은 모든 데이터의 가중치 값인 면적이 모두 같기 때문에 시간에 대한 영향을 표현할 수 없다. Figure 2a is represented by one square area as the weight for each case in the ID3 algorithm. For example, in FIG. 2A the portion filled with comb marks represents the weight of the first instance. In FIG. 2A, the weight values of all data are all 1. If n is considered to be time, the larger n means that the data is more recent. Therefore, the corresponding area must also be large according to n. The ID3 algorithm cannot express the effect on time because the area, which is a weight value of all data, is the same.

이러한 문제를 일반적으로 정의하면, 최근 데이터의 영향이 더 커야 한다고 가정할 경우 n에 따라서 가중치 값도 커지고, 예전 데이터의 영향이 더 커야 한다고 가정할 경우 n에 따라서 가중치 값이 작아져야 한다. 이 일반화된 정의를 가지고 가중치 값을 결정할 수 있는 도면은 도 2b와 같다. In general definition of this problem, assuming that the influence of the recent data should be greater, the weighting value should increase according to n, and assuming that the influence of the old data should be greater, the weighting value should decrease according to n. A diagram in which the weight value can be determined using this generalized definition is shown in FIG. 2B.

도 2b는 각 사례들에 대한 가중치를 나타내는 도면이다. 도 2b에서 Y축은 각 사례의 신뢰도로, 신뢰도가 높은 사례는 큰 가중치 값을 갖는다. 도 2b에 도시된 OABC 면적과 도 2a에 도시된 n개의 정사각형 면적의 합은 같다. 즉, 도 2b에서 n개 사례가 갖는 가중치의 합과 일반적으로 사용되는 엔트로피에서 n개의 사례가 갖는 가중치의 합이 같도록 정의하였다. 새로운 가중치인 직선AB는

로 결정된다. 2B is a diagram illustrating weights for respective cases. In FIG. 2B, the Y-axis is the reliability of each case, and the case with high reliability has a large weight value. The sum of the OABC areas shown in FIG. 2B and the n square areas shown in FIG. 2A is equal. That is, in FIG. 2B, the sum of the weights of the n cases and the weight of the n cases in the commonly used entropy are defined to be the same. The new weight straight line AB

Is determined.

[수학식 3]에서는 i번째 사례의 가중치 값으로 사용하였다. i번째 데이터의 가중치 W(i)를 구하면 이래와 같다. In Equation 3, a weight value of the i th case was used. The weight W (i) of the i-th data is obtained as follows.

[수학식 3]에 나타난 θ는 각 데이터에게 가중치 값을 주기 위해 정의하는 각도로, 도 2b에서 도시하는 바와 같이 n을 X축으로 하고 사례 데이터의 신뢰도를 Y축으로 하는 직선비례그래프에서 각 사례 데이터의 가중치를 나타낸다.

범위는

=2/n일 때 n번째 사례의 가중치 값은 가장 크며 최근 데이터 n의 영향이 최대가 되고,

=0일 때 가중치 값이 1이 되면서 ID3 알고리즘이 되고,

=-2/n일 때 n번째 사례의 가중치 값은 가장 작으며 최근 데이터 n의 영향이 최소가 된다. Θ shown in [Equation 3] is an angle defined to give each data a weighted value. As shown in FIG. 2B, each case is a linear proportional graph in which n is the X axis and the reliability of the case data is the Y axis. Indicates the weight of the data.

The range is

When = 2 / n, the weight of the nth case is the largest and the influence of the latest data n is the greatest,

When = 0, the weight is 1 and becomes the ID3 algorithm.

When = -2 / n, the weight of the nth case is the smallest and the influence of the latest data n is minimal.

이러한 새로운 가중치를 이용한 엔트로피를 산출하는 공식은 다음과 같다.The formula for calculating entropy using this new weight is as follows.

[수학식 4]의 엔트로피는 시간 가중치 엔트로피(Time-weighted Entropy)라고 정의한다. 본 발명에서는 ID3 알고리즘과 시간 가중치 엔트로피를 사용하여 결정 트리를 구성한다.Entropy in Equation 4 is defined as time-weighted entropy. In the present invention, the decision tree is constructed using the ID3 algorithm and time weight entropy.

다음에 본 발명의 실시예에 따른 시간 가중치 엔트로피를 이용한 결정 트리 생성 방법에 대해 도 3에 따라 설명한다.Next, a decision tree generation method using time weighted entropy according to an embodiment of the present invention will be described with reference to FIG. 3.

도 3은 본 발명에 따른 시간 가중치 엔트로피를 이용한 결정 트리 생성 방법을 설명하는 흐름도이다.3 is a flowchart illustrating a decision tree generation method using time weight entropy according to the present invention.

결정 트리는 루트 노드부터 시작하여 하단으로 한 레벨씩 감소하면서 생성한다. 데이터 마이닝의 분류 기법은 통상적으로 컴퓨터를 이용하여 외부에서 입력된 다수의 사례 데이터들을 분류하므로 본 발명의 실시예에서도 컴퓨터를 이용한다고 가정하여 설명한다. 우선, 시간 가중치 엔트로피를 이용하여 다수의 사례 데이터가 갖는 속성들 중에서 현재 레벨의 노드가 될 속성을 선택하기 위한 테스트를 한다. 즉, 노드를 선택하기 위한 테스트를 시작한다(ST 2010). 시간 가중치 엔트로피를 이용하여 엔트로피의 초기값을 연산한다(ST 2020). 다음으로, 여러 속성들 중에서 하나의 속성을 정한 후, 선택할 노드로 가정한다(ST 2030). 다음으로, 시간 가중치 엔트로피를 이용하여 상기 가정하여 선택된 노드가 갖는 각각의 가지의 엔트로피를 연산한다(ST 2031). 연산된 각 가지의 엔트로피를 이용하여 시스템 엔트로피를 연산한다(ST 2032). 이제, 엔트로피의 초기값에서 시스템 엔트로피를 차감한 값인 정보 획득량을 연산한다(ST 2033). 이때, 시스템 엔트로피가 적을수록 정보 획득량은 많아지게 된다. 모든 속성에 대해서 단계 ST 2030 내지 단계 ST 2033을 반복한다(ST 2034). 노드로 가정될 수 있는 모든 속성에 대한 테스트가 완료되면, 각 속성의 정보 획득량을 비교하여 정보 획득량이 가장 큰 속성을 노드로 선택한다(ST 2035). 현재는 첫 번째 레벨이므로 단계 ST 2035에서 선택된 노드가 루트 노드가 된다.Decision trees are created starting at the root node and decremented one level down. Since the classification technique of data mining generally classifies a plurality of case data input from the outside using a computer, it will be described assuming that a computer is used even in an embodiment of the present invention. First, a test is performed to select an attribute to be a node of the current level among attributes of a plurality of case data using time weight entropy. That is, a test for selecting a node is started (ST 2010). An initial value of entropy is calculated using the time weight entropy (ST 2020). Next, after deciding one attribute among several attributes, it is assumed as a node to select (ST 2030). Next, the entropy of each branch of the assumed node selected by using the time weight entropy is calculated (ST 2031). The system entropy is calculated using the calculated entropy of each branch (ST 2032). The information acquisition amount, which is the value obtained by subtracting the system entropy from the initial value of entropy, is calculated (ST 2033). In this case, the smaller the system entropy, the greater the amount of information acquisition. Steps ST 2030 to ST 2033 are repeated for all the attributes (ST 2034). When the tests for all attributes that may be assumed to be nodes are completed, the information acquisition amount of each attribute is compared to select an attribute having the largest information acquisition amount as a node (ST 2035). Currently, since it is the first level, the node selected in step ST 2035 becomes the root node.

루트 노드가 선택되면 레벨을 한 단계 감소하여 레벨-1의 노드들을 선택해야 한다(ST 2060). 한 단계 감소된 레벨의 각 노드들을 선택하는 과정은 루트 노드를 선택한 단계 ST 2010 내지 단계 ST 2035와 동일하므로 구체적 설시는 생략한다. 현 재 레벨의 모든 노드에 대한 선택이 완료되면(ST 2040) 다시 레벨을 한 단계 감소하여 레벨-2의 노드들을 선택해야 한다. 이 과정은 트리 생성이 완료될 때까지 반복된다(ST 2050). 즉, 모든 하위집단이 하나의 클래스로 통일되고, 시스템 엔트로피가 0이 될 때까지 반복된다.If the root node is selected, the level must be decreased by one level to select nodes of level-1 (ST 2060). The process of selecting each node of the reduced level by one step is the same as that of the steps ST 2010 to ST 2035 in which the root node is selected, so a detailed description thereof is omitted. When the selection of all nodes of the current level is completed (ST 2040), the level must be reduced by one step again to select the nodes of level-2. This process is repeated until tree generation is completed (ST 2050). That is, all subgroups are unified into one class and repeated until the system entropy becomes zero.

그리고 트리의 생성 중 특정 노드의 엔트로피가 사전에 정한 기준값보다 작을 경우, 그 노드에 속하는 데이터가 가장 많이 속하는 클래스로 대치한다.If the entropy of a specific node is smaller than a predetermined reference value during tree generation, the data belonging to the node is replaced with the class that contains the most data.

본 발명은 종래의 ID3 알고리즘보다 최근 데이터들의 경향을 잘 표현한다는 것을 도 4 내지 도 5에 따라 실험을 통해 설명한다.The present invention demonstrates through the experiment according to Figs. 4 to 5 that the trend of recent data better than the conventional ID3 algorithm.

도 4는 종래의 ID3 알고리즘과 본 발명에 따른 결정 트리 생성 방법의 비교 실험을 위한 데이터 테이블을 도시한 도면이고, 도 5는 도 4에 도시된 테이블에 따라 생성된 결정 트리의 예를 나타내는 도면이다.4 is a diagram showing a data table for a comparative experiment between a conventional ID3 algorithm and a decision tree generation method according to the present invention, and FIG. 5 is a diagram showing an example of a decision tree generated according to the table shown in FIG. .

도 4에 도시된 테니스 데이터 테이블은 'outlook', 'temperature', 'humidity', 'windy'에 따라 사용자가 테니스를 했는지 안 했는지를 알기 위한 자료이다. 이러한 사용자 관련 데이터는 시간에 따라 변화하기 때문에 기존의 데이터와는 다른 경향을 갖는 새로운 데이터를 추가하여 실험하였다. 도 4의 데이터 1~14는 기존의 데이터이며, 데이터 15~21은 새로 추가된 데이터이다. 그리고 사용자가 테니스를 치는 경향의 변화를 표현하기 위해서 데이터 15~21은 기존과는 다른 경향을 갖도록 추가하였다. 즉, 데이터 1~14는 도 5a와 같은 경향을 보이지만 데이터 8~21은 도 5b와 같이 되도록 하기 위해 데이터 15~21을 겹치도록 하였다. 이것은 사용자의 경향이 서서히 바뀔 것으로 예상되기 때문이다.The tennis data table shown in FIG. 4 is data for knowing whether or not the user played tennis according to 'outlook', 'temperature', 'humidity', and 'windy'. Since these user-related data change over time, we experimented with adding new data that has a different trend from the existing data. Data 1 to 14 of FIG. 4 are existing data, and data 15 to 21 are newly added data. And in order to express the change in the tendency of the user to play tennis, data 15-21 were added to have a different tendency than before. That is, the data 1 to 14 show the same trend as that of FIG. 5A, but the data 8 to 21 are overlapped with the data 15 to 21 in order to be as shown in FIG. 5B. This is because the user's tendency is expected to change slowly.

이렇게 변화하는 사용자 경향을 적절히 학습하기 위해서는 이미 다른 경향을 보이는 과거 데이터보다는 현재 데이터에 관심을 가져야 한다. 그러나 어려운 점은 과연 어느 데이터부터가 새로운 경향을 보이는지 어떻게 알아낼 수 있는가이다. 예를 들어, 데이터 8~21이 새로운 경향을 보인다고 생각하여 여기에 ID3 알고리즘을 적용하면 바뀐 경향을 잘 반영하는 결정 트리를 얼을 수 있다. 그러나 모든 데이터가 동일하게 중요하다는 판단에 데이터 1~21까지의 데이터에 ID3 알고리즘을 적용하여 결정 트리를 구하면 도 5d와 같이 복잡한 트리를 얻게 된다. 이렇게 복잡한 트리를 얻게 된 것은 이미 지나간 과거의 데이터와 새로운 경향의 데이터가 서로 섞여 있기 때문이다. 또한, 최신 데이터가 새로운 경향을 반영한다고 하여 지나치게 최신 데이터만을 고려해도 적절한 결절 트리를 얼을 수 없게 된다. 예를 들어, 데이터 15~21에 대한 결정 트리를 구하면 도 5c와 같이 지나치게 간단한 것은 얻게 된다. 이와 같이 변화하는 데이터에서 경향을 반영하는 결정 트리를 생성하기는 쉽지 않다. In order to properly learn these changing user trends, you should be interested in the current data rather than the historical data that already shows other trends. The hard part, however, is how to figure out which data shows the new trend. For example, suppose that data 8-21 show new trends, and if you apply the ID3 algorithm to it, you can freeze the decision tree that reflects the changed trends. However, if all the data are equally important and the decision tree is obtained by applying the ID3 algorithm to the data from data 1 to 21, a complex tree is obtained as shown in FIG. 5D. You get this complex tree because you have a mix of old and new trends. In addition, since the latest data reflects a new trend, it is impossible to freeze an appropriate nodule tree even if only excessive data is considered. For example, when a decision tree for data 15 to 21 is obtained, an overly simple one is obtained as shown in FIG. 5C. It is not easy to create a decision tree that reflects trends in this changing data.

이를 위해서 본 발명에서는 시간 가중치 엔트로피와 ID3 알고리즘을 이용하여 결정 트리를 생성하는 방법을 제안한다. 즉, ID3 알고리즘으로 결정 트리를 생성하는 과정에서 기존의 엔트로피 대신 새로 제안하는 엔트로피를 사용한다. 그리고 트리 생성 중 특정 노드의 엔트로피가 사전에 정한 기존 값보다 작으면 그 노드에 속하는 데이터가 가장 많이 속하는 클래스로 대치하는 방법을 적용하였다. 이러한 방법을 도 4에 도시된 데이터들에 적용해보면 도 5e와 같이 데이터 1~21까지의 데이터에 데이터 시간 가중치 엔트로피 알고리즘을 적용한 결정 트리가 나온다. 최근 데이터에 중점 둔 결과 루트 노드로 'windy'가 선택되었다. 그리고 'windy = weak'인 경우를 살펴보면, 과거의 두 경우인 데이터 1과 8에만 'no'이고 최신 데이터를 포함한 나머지는 'yes'인 것을 알 수 있다. 이 경우의 엔트로피를 구하면 과거 데이터의 가중치가 낮은 관계로 '0.095'가 나온다. 이는 미리 정한 기준 값 '0.1'에 미달하므로 가장 많이 나타나는 클래스인 'yes'로 치환하였다.To this end, the present invention proposes a method for generating a decision tree using time weight entropy and ID3 algorithm. In other words, the new entropy is used instead of the existing entropy in the process of generating the decision tree using the ID3 algorithm. And if the entropy of a specific node is smaller than the existing value during the tree creation, the method that replaces the class with the most data belongs to the node. When the method is applied to the data shown in FIG. 4, as shown in FIG. 5E, a decision tree applying a data time weight entropy algorithm to data 1 to 21 is shown. As a result of recent data focus, 'windy' has been selected as the root node. If we look at the case of 'windy = weak', we can see that the past two cases, data 1 and 8, are 'no' and the rest including the latest data is 'yes'. In this case, entropy results in '0.095' because of the low weight of historical data. Since this is less than the predetermined standard value '0.1', it is replaced with 'yes' which is the most frequently appeared class.

본 실험을 통해 알 수 있듯이, 본 발명에서 제안한 시간 가중치 엔트로피 공식을 사용하여 구성된 트리 구조가 종래의 ID3 알고리즘으로 구성된 트리보다 더 간단하고 유리하다. 그 이유는 ID3 알고리즘은 최근 경향을 찾기 어려운 것과 달리 본 발명에서 제안한 구조는 최근 데이터에 더 중점을 두고 있기 때문에 전체 데이터를 고려해도 최근 데이터의 영향을 볼 수 있다. 실험 결과를 비교해 보면, 기존의 알고리즘보다는 시간 가중치를 이용한 본 발명이 더 간단하고 효과적인 결정 트리를 생성하였으며, 테스트 데이터에 대해서 3~6% 높은 예측률을 보였다.As can be seen from this experiment, the tree structure constructed using the time weighted entropy formula proposed in the present invention is simpler and more advantageous than the tree composed of the conventional ID3 algorithm. The reason is that, unlike the ID3 algorithm, it is difficult to find a recent trend. Therefore, the structure proposed by the present invention focuses on the recent data. Comparing the experimental results, the present invention using the time weight than the conventional algorithm generated a simpler and more effective decision tree, and showed a 3-6% higher prediction rate for the test data.

본 발명에 따른 시간 가중치 엔트로피를 이용한 결정 트리 생성 방법 및 이를 기록한 기록매체는 유비쿼터스와 같은 환경에서 사용자의 선호도 데이터를 가지고 사용자의 행동을 추론하고 학습을 시켜서 사용자의 행동을 예측도 할 수 있다. The decision tree generation method using the time-weighted entropy and the recording medium recording the same may predict the user's behavior by inferring and learning the user's behavior with the user's preference data in an ubiquitous environment.

이상, 본 발명자에 의해서 이루어진 발명은 상기 실시 예에 따라 구체적으로 설명하였지만, 본 발명은 상기 실시 예에 한정되는 것은 아니고, 그 요지를 이탈하 지 않는 범위에서 여러 가지로 변경 가능한 것은 물론이다.As mentioned above, although the invention made by this inventor was demonstrated concretely according to the said Example, this invention is not limited to the said Example, Of course, it can be variously changed in the range which does not deviate from the summary.

상술한 바와 같이, 본 발명에 따른 시간 가중치 엔트로피를 이용한 결정 트리 생성 방법 및 이를 기록한 기록매체에 의하면, 최근 데이터에 더 중점을 두고 있기 때문에 전체 데이터를 고려해도 최근 데이터의 영향을 볼 수 있다는 효과가 얻어진다.As described above, according to the method for generating a decision tree using time weighted entropy and a recording medium recording the same, according to the present invention, the effect of recent data can be seen even if all data are considered, since the data is more focused on recent data. Obtained.

또, 본 발명에 따른 시간 가중치 엔트로피를 이용한 결정 트리 생성 방법 및 이를 기록한 기록매체에 의하면, 기존의 ID3 알고리즘으로 구성된 트리보다 더 간단하고 효과적인 결정 트리를 생성할 수 있다는 효과도 얻어진다.In addition, according to the method for generating a decision tree using time-weighted entropy and a recording medium recording the same, an effect of generating a decision tree simpler and more effective than a tree composed of the existing ID3 algorithm is obtained.

또, 본 발명에 따른 시간 가중치 엔트로피를 이용한 결정 트리 생성 방법 및 이를 기록한 기록매체에 의하면, 유비쿼터스 환경에서 사용자의 선호도 데이터를 분석하여 사용자의 행동을 예측하거나 성향을 인지할 수 있다는 효과도 얻어진다. In addition, according to the method for generating a decision tree using time-weighted entropy and a recording medium recording the same, an effect of predicting a user's behavior or recognizing a user's behavior can be obtained by analyzing user's preference data in a ubiquitous environment.

Claims

In the decision tree generation method of the classification method of data mining for the computer to recognize the pattern of a plurality of case data input from the outside and to create the newly classified data,

(a) calculating an initial value of entropy using time weighted entropy,

(b) setting one attribute among attributes of the plurality of case data and assuming the node to be selected;

(c) calculating system entropy for the hypothesized node using time weighted entropy,

(d) calculating an information acquisition amount using the initial value of the entropy and the system entropy,

(e) selecting, as nodes of the current level, an attribute having the largest amount of each information acquisition calculated among the attributes;

(f) repeating steps (a) to (e) until all nodes of the current level are selected,

(g) repeating steps (a) to (f) for one level lower level if all nodes of the current level are selected,

The decision tree generated by steps (a) through (e) is generated until all subcase data populations are unified into one class and system entropy becomes zero,

The temporal weight entropy has a greater weight as the latest trend has a greater weight.

The method of claim 1,

The time weight entropy is

Is computed by the execution of

S is a set of case data, T _c is the total number of classes to which the case data belongs, i is any variable representing the order of the case data, and W (i) represents the latest trend of the i th case data. The weight, wherein c is an arbitrary variable representing the order, decision tree generation method using a time weight entropy.

The method of claim 2,

Where W (i) is

Is computed by the execution of

N is the number of case data, i is an arbitrary variable representing the order of the case data, and θ is the weight of each case data in a linear proportional graph where n is the X axis and reliability of the case data is the Y axis. Is an angle to represent,

remind

The range of is greater than or equal to -2 / n and less than or equal to 2 / n decision tree generation method using time-weighted entropy.

The method of claim 1,

If the entropy of a particular node is smaller than a predetermined reference value during tree generation, the decision tree generation method using time-weighted entropy is replaced with the class to which the case data belonging to the node belongs most.

In a recording medium that can read a decision tree generation method of the data mining classification techniques for recognizing a pattern of a plurality of case data input from the outside and to create a newly classified data,

(a) calculating an initial value of entropy using time weight entropy,

(g) if all nodes of the current level are selected, record a program for executing the steps of repeating steps (a) to (f) for one level lower level,

The time-weighted entropy has a greater weight as the recent trend has increased.