KR20040028081A

KR20040028081A - Method of Data Sorting for efficient fitness function evaluation in genetic Algorithm

Info

Publication number: KR20040028081A
Application number: KR1020020059174A
Authority: KR
Inventors: 구세본
Original assignee: 주식회사 케이티
Priority date: 2002-09-28
Filing date: 2002-09-28
Publication date: 2004-04-03

Abstract

PURPOSE: A data classifying method for an efficient fitness function evaluation on a genetic algorithm is provided to largely shorten an operation time of entire algorithm by generating a table through a method to arrange the individual data having the same attribute to each attribute, and performing the fitness function evaluation as referring to the table without reading entire data again. CONSTITUTION: An individual group according to an attribute value is generated by receiving the data(301). Repeating a previous step until the last input data, the individual group having the same attribute to all attribute values is generated(303). Receiving a classification condition according to the attribute value, a result using the generated individual group is output(307).

Description

Method of Data Sorting for efficient fitness function evaluation in genetic Algorithm

본 발명은 유전자 알고리즘에 있어서 효율적인 적응도 함수 계산을 위한 데이터 분류 방법 및 상기 방법을 실현시키기 위한 프로그램을 기록한 컴퓨터로 읽을 수 있는 기록매체에 관한 것이다.The present invention relates to a data classification method for efficiently calculating the adaptive function in a genetic algorithm, and a computer-readable recording medium having recorded thereon a program for realizing the method.

본 발명은 NP-complete(Nondeterministic Polynomial time complete) 문제의 해결에 자주 사용되는 유전자 알고리즘(genetic algorithm)의 적응도 함수 계산(fitness function evaluation)을 수행하는데 있어서 데이터의 크기가 커질수록 많은 계산성능을 요구하는 문제를 빠르고 효율적으로 해결할 수 있는 방안을 제안한 것이다. 이를 위해서 각 훈련 데이터(training data)의 표현을 기존의 방법처럼 개별 데이터(instance)가 갖춘 특성(attribute)의 나열로 나타내는 것이 아니라 각 특성마다 동일한 특성을 갖추고 있는 개별 데이터를 나열하는 방법을 이용하여 표(table)를 생성하고, 후에 적합도 함수의 계산시에 전체 데이터를 다시 읽어 들일 필요 없이 생성된 표를 참조하여 적합도 함수의 계산을 수행함으로써 전체 알고리즘의 수행 시간을 획기적으로 감소시킬 수 있다. 본 발명은 인공지능(artificial intelligence), 기계학습(machine learning), 데이타마이닝(datamining), 정보 검색(information retrieval) 등의 기술 분야와 관계가 있다.The present invention requires more computational performance as the data size increases in performing fitness function evaluation of the genetic algorithm, which is frequently used to solve the NP-complete problem. It is proposed a way to solve the problem quickly and efficiently. To do this, the representation of each training data is not represented as a list of attributes of individual data as in the conventional method. It is possible to drastically reduce the execution time of the entire algorithm by generating a table and performing the calculation of the fitness function with reference to the generated table without having to read the entire data again in the calculation of the fitness function later. The present invention relates to the technical fields of artificial intelligence, machine learning, data mining, information retrieval, and the like.

본 발명에 사용되는 용어를 정의하면 다음과 같다.The terms used in the present invention are defined as follows.

유전자 알고리즘이란 자연의 적자생존 법칙을 문제 풀이에 응용하여, 기존의 일반적인 알고리즘으로는 해결이 불가능한 NP-complete 문제에서 정확한 해(solution)에 가까운 최적의 해를 효율적으로 구하기 위해, 즉 최적화 문제(optimization problem)에, 주로 사용되는 알고리즘이다. 이 알고리즘은 또한 데이터 마이닝(datamining)의 한 분야인 분류(classification)와 클러스터링(clustering)에서도 자주 이용되고 있다.Genetic Algorithm is the application of the law of survival of the fittest of nature to solve the problem, so as to efficiently find the optimal solution close to the exact solution from the NP-complete problem that cannot be solved by the conventional general algorithm, that is, the optimization problem problem), a commonly used algorithm. This algorithm is also frequently used in classification and clustering, one area of data mining.

그리고, 개체(individual)란 유전자 알고리즘에서 가능한 하나의 해를 나타내는 용어이다. 일반적으로 개체는 비트의 열(bit string)로 컴퓨터의 기억장치 속에서 표현되는데, 이와 같은 비트들의 차이에 따라 각각의 개체는 주어진 문제에 대한 서로 다른 해결 방법을 표현한다.And individual is a term representing one possible solution in a genetic algorithm. In general, an object is represented in a computer's memory as a string of bits. Each of these objects represents a different solution to a given problem.

그리고, 적응도(fitness)란 유전자 알고리즘에서 각각의 개체가 표현하는 해결 방법이 얼마만큼 원하는 해와 가까운지를 나타내는 수치이다. 혹은 개체들로 표현되는 다양한 해결 방법 중에서 어떤 개체가 더 효율적으로 문제를 해결할 수 있는 지를 나타내는 수치라고도 볼 수 있다. 이와 같은 각 개체의 적응도를 계산하는 작업을 적합도 함수 계산(fitness function evaluation)이라고 하며, 전체 유전자 알고리즘의 수행 시간 중에서 이러한 적합도 함수 계산의 수행 시간이 큰 부분을 차지한다. 유전자 알고리즘에서 각각의 개체는 각자의 적응도를 갖고 있다.Fitness is a number that indicates how close the solution represented by each individual in the genetic algorithm is to the desired solution. It is also a number that indicates which of the various solutions represented by the objects can solve the problem more efficiently. The task of calculating the fitness of each individual is called fitness function evaluation, and the execution time of the fitness function calculation is a large part of the execution time of the entire genetic algorithm. In genetic algorithms, each individual has its own fitness.

그리고, 집단(population)이란 유전자 알고리즘에서 개체의 집합을 나타낸다. 유전자 알고리즘은 현재 집단에서 적응도가 높은 개체들을 골라 유전 연산자(genetic operator)를 적용시켜 새로운 개체들로 이루어진 새로운 집단을 만들어내는 과정을 반복한다. 이와 같은 반복을 통해서 높은 적응도를 갖춘 개체를 만들어내고 이런 개체가 바로 주어진 문제의 해가 된다.Population refers to the collection of individuals in a genetic algorithm. Genetic algorithms select highly adaptive individuals from the current population and apply genetic operators to create a new population of new individuals. This repetition creates a highly adaptive entity, which is the solution to a given problem.

그리고, 유전 연산자란 유전자 알고리즘에서 개체에 적용시켜 원래의 개체의 특성을 일부 이어받으면서 동시에 새로운 특성을 나타내는 개체를 만들어내는데 사용되는 함수이다. 크게 교차 연산자(crossover operator)와 돌연변이 연산자(mutation operator)가 주로 사용된다.Genetic operators are functions that can be applied to individuals in genetic algorithms to create an object representing a new characteristic while at the same time inheriting some of the characteristics of the original individual. In general, the crossover operator and the mutation operator are mainly used.

일반적으로 유전자 알고리즘의 동작을 살펴보면, 찾아야 할 변수에 대한 각각의 변수를 하나의 염색체로 간주하여, 탐색해야 될 위치와 자세는 모두 N개의 염색체로 구성된 한 개의 유전자로 칭한다. 그리고, 초기 캘리브레이션 단계에서 생성된 유전자 집단, 즉 인구(Population)을 이용하여 인구를 구성하는 유전자끼리 사전에 주어진 랜덤 변수 비율로 서로 교배(crossover)를 시키고, 사전에 지정한 랜덤 변수로 유전자내의 염색체에 대하여 돌연변이(mutation)를 시킨다. 그리고, 오차(Derr)를 이용하여 각 유전자에 대한 성능 평가를 실시하여 좋은 성능을 가지는 유전자는 선택하고 그렇지 않은 유전자에 대해서는 도태(selection)를 시키며 새로운 유전자를 생성하여 인구에 추가를 한다.In general, when looking at the operation of a genetic algorithm, each variable for the variable to be found is considered as one chromosome, and the position and posture to be searched are referred to as a single gene consisting of N chromosomes. Then, gene populations generated in the initial calibration stage, that is, populations (Population), are used to crossover each other at a predetermined random variable ratio between genes constituting the population, and to pre-specify random variables to the chromosomes in the gene. Mutation is performed. Then, the performance of each gene is evaluated using the error (Derr) to select a gene having a good performance, and to select a gene that does not have a good one, and to generate a new gene and add it to the population.

그리고, 다시 교배와 돌연변이 과정을 반복하여 수행하게 된다. 이러한 과정을 계속 수행하면 인구내의 대부분 유전자가 특징점간의 오차의 합(Derr)이 가장 적게 되는 값을 가지도록 수렴하게 된다.Then, the hybridization and mutation process are performed again. By continuing this process, most genes in the population converge to have the smallest Derr between the feature points.

이와 같이, 일반적으로 종래의 유전자 알고리즘은 지금까지 주로 최적화 문제에 적용되었지만, 최근에는 분류 문제에서도 사용되고 있다. 특별히 분류 문제에 사용되는 유전자 알고리즘을 분류 시스템(classifier system)이라고 칭하기도 한다. 이 같은 분류 문제는 주로 기계 학습(machine learning) 분야에서 연구가 이루어지고 있다.As described above, conventional genetic algorithms have been mainly applied to optimization problems until now, but recently, they have also been used in classification problems. Genetic algorithms, especially used for classification problems, are sometimes called classifier systems. This classification problem is mainly studied in the field of machine learning.

분류 문제란 범주(class)와 각 범주에 속하는 예제들의 특성(attribute)을 나타낸 훈련 데이터(training data)를 이용하여 범주와 특성들의 관계를 나타내는 모델(model)을 만들고, 후에 범주가 알려지지 않고 특성만을 알 수 있는 새로운 데이터가 입력되었을 때 이미 만들어진 모델을 이용하여 새로운 데이터의 범주를 예측하는 문제이다.A classification problem is a model that represents the relationship between a category and characteristics using training data that shows the class and the attributes of the examples in each category. It is a problem of predicting a new category of data by using a model that is already created when new data is known.

이와 같은 분류 문제에 유전자 알고리즘을 적용할 경우, 각 개체는 분류 문제가 찾아내야 하는 모델이 되고, 각 개체의 적응도는 개체가 표현하는 모델을 훈련데이터에 적용했을 때 실제 범주를 얼마나 정확하게 예측하는지의 정도가 된다.When genetic algorithms are applied to such classification problems, each individual becomes a model that the classification problem must find, and the adaptability of each individual determines how accurately the actual category is predicted when the model represented by the individual is applied to the training data. It becomes about.

이와 같은 방식으로 적응도를 구하기 위해서는, 즉 적응도 함수 계산을 하기 위해서는, 훈련데이터의 특성을 모델에 적용해서 모델이 예측하는 범주와 실제 범주를 비교하는 작업을 모든 훈련데이터에 대해서 실시해야 한다. 전체적으로 예측 범주와 실제 범주가 일치하는 경우가 많을수록 좋은 모델이라고 할 수 있고 분류 문제의 해에 가깝다고 볼 수 있다. 이와 같은 개체와 훈련 데이터의 관계가 도 1 에 표현되어 있다. 왼쪽에는 유전자 알고리즘을 통해서 생성된 개체들이 상자 모양으로 표현되어 있고, 오른쪽에는 훈련데이터들이 나타나있다. 하나하나의 개체는 한 범주에 속한 데이터들의 공통 특성이라고 생각되는 특징을 기술한 모델이다.In order to calculate the fitness in this way, that is, to calculate the fitness function, the training data should be applied to the model to compare the predicted and actual categories of the model to all the training data. Overall, the more the predicted and actual categories match, the better the model and the closer the solution to the classification problem. The relationship between such individuals and training data is shown in FIG. 1. On the left, the objects created by genetic algorithms are represented by boxes. On the right, the training data are shown. Each entity is a model that describes the characteristics that are considered to be common characteristics of the data in a category.

도 1 은 본 발명이 적용되는 유전자 알고리즘에서 개체와 훈련 데이터들과의 관계를 나타낸 예시도이다.1 is an exemplary view showing a relationship between an individual and training data in a genetic algorithm to which the present invention is applied.

도 1에서 첫 번째 개체는 특성 1의 값이 1이고 특성 3의 값이 0인 훈련 데이터는 범주 2에 속할 것이라는 예측을 하고 있다. 훈련데이터의 경우 가장 왼쪽은 각 훈련데이터를 구분해주는 고유 번호가 있고, 중간에는 특성을 기술하고 있다. 이때 적용되는 문제에 따라서 많은 변형이 존재하지만 예제의 경우 설명의 단순성을 위해 특성은 세 가지고 있고 각 특성값은 0-4까지의 값을 갖는다고 가정하자. 가장 오른쪽에는 현재 훈련데이터의 범주를 나타내고 있다. 첫 번째 개체가 기술하고 있는 특징과 일치하는 훈련데이터는 2,4,5,6으로 모두 4개가 있다. 이중 5를 제외한 나머지 3개의 훈련데이터의 범주가 실제로 2이므로 첫 번째 개체가 기술하고 있는 특성은 훈련데이터에 대해서 75%의 정확도를 가지고 있다고 말할 수 있다. 이 경우 첫 번째 개체의 적응도는 0.75가 된다.In FIG. 1, the first entity predicts that training data with a value of feature 1 and a value of feature 3 equal to 0 belongs to category 2. In the case of training data, the leftmost part has a unique number that distinguishes each training data, and the middle describes the characteristic. There are many variations depending on the problem applied, but for the sake of simplicity in the example, assume that you have three properties and each property has a value between 0 and 4. The rightmost figure shows the current category of training data. There are four training data that match the characteristics described by the first individual, 2, 4, 5 and 6. Since the other three categories of training data except for 5 are actually 2, it can be said that the characteristic described by the first individual has 75% accuracy with respect to the training data. In this case, the adaptability of the first individual is 0.75.

지금까지, 분류 문제와 같이 훈련 데이터를 이용해서 적응도를 구하는 경우에 적응도 함수 계산에는 크게 두 가지 방법이 사용되었다. 즉 개체를 기준으로 훈련 데이터를 이용하는 방법이 하나이고, 훈련 데이터를 기준으로 개체를 이용하는 것이 다른 하나이다.Up to now, two methods have been used to calculate the fitness function in the case of obtaining fitness using training data such as classification problem. In other words, one method of using the training data based on the subject, and the other is using the subject based on the training data.

첫 번째 방법은 위의 예제에서 든 것처럼 각각의 개체에 대해서 전체 훈련데이터를 살펴보고 적응도를 구하는 방법이다. 주로 일반적인 유전자 알고리즘에서 이와 같은 방법이 많이 사용된다.The first method is to look at the total training data for each individual and find the adaptability as shown in the example above. This method is commonly used in general genetic algorithms.

두 번째 방법은 각각의 훈련데이터에 대해 일치하는 조건문을 가진 개체를 찾아내어 개체의 적응도를 변경하는 방법인데, 주로 위에서 이야기한 분류 시스템에서 이와 같은 방법을 사용한다.The second method is to change the adaptability of an individual by finding an entity with a matching conditional statement for each training data. This method is mainly used in the classification system described above.

이러한 두 방법 모두 훈련 데이터의 크기가 커질수록 적응도 함수 계산에 걸리는 시간이 크게 증가하는 문제가 있다. 이를 해결하기 위해서 훈련 데이터 중에 일부만을 적응도 함수 계산에 사용하는 샘플링(sampling) 방법이 이용되는데, 이러한 방법 역시 계산된 적응도가 사용된 훈련 데이터의 편차에 따라 부정확해진다는 문제점이 있다.In both of these methods, as the size of the training data increases, the time taken to calculate the adaptability function increases significantly. In order to solve this problem, a sampling method using only a part of the training data for calculating the fitness function is used, and this method also has a problem in that the calculated fitness becomes inaccurate according to the deviation of the used training data.

본 발명은, 상기와 같은 문제점을 해결하기 위하여 제안된 것으로, 유전자 알고리즘에 있어서 효율적인 적응도 함수 계산을 위한 데이터 분류 방법 및 상기 방법을 실현시키기 위한 프로그램을 기록한 컴퓨터로 읽을 수 있는 기록매체를 제공하는데 그 목적이 있다.The present invention has been proposed to solve the above problems, and provides a data classification method for efficient adaptive function calculation in a genetic algorithm and a computer-readable recording medium recording a program for realizing the method. The purpose is.

즉, 적응도 함수의 계산을 위해서 훈련 데이터를 사용하는 유전자 알고리즘의 경우 데이터의 크기가 커질수록 적응도 함수의 계산 시간이 급격하게 늘어나고, 요구되는 기억장치 용량 또한 늘어남에 따라 전체적인 알고리즘에 성능을 저하시키는 문제점이 발생하기 때문에, 이를 효율적으로 처리할 수 있는 방안을 제공하고자 한다.In other words, in the case of genetic algorithms using training data to calculate the adaptability function, as the size of the data increases, the computation time of the adaptability function increases rapidly, and the required storage capacity also increases, which degrades the overall algorithm. Since there is a problem to make it, to provide a way to handle this efficiently.

특히, 본 발명은 유전자 알고리즘의 적응도 함수 계산을 수행하는데 있어서 데이터의 크기가 커질수록 많은 계산성능을 요구함에 따라 이를 빠르고 효율적으로 처리할 수 있는 방안을 제공하고자 한다.In particular, the present invention is to provide a method that can be processed quickly and efficiently as the size of the data in the calculation of the adaptability function of the genetic algorithm requires more computational performance.

도 1 은 본 발명이 적용되는 유전자 알고리즘에서 개체와 훈련 데이터들과의 관계를 나타낸 예시도.1 is an exemplary view showing a relationship between an individual and training data in a genetic algorithm to which the present invention is applied.

도 2 는 본 발명에 따른 유전자 알고리즘에 있어서 효율적인 적응도 함수 계산을 위한 데이터 분류 방법에 의한 역테이블의 일실시예 구조도.2 is a structural diagram of an inverse table according to a data classification method for calculating an adaptive fitness function in a genetic algorithm according to the present invention.

도 3 은 본 발명에 따른 유전자 알고리즘에 있어서 효율적인 적응도 함수 계산을 위한 데이터 분류 방법에 관한 일실시예 동작 흐름도.3 is a flowchart illustrating an embodiment of a data classification method for efficiently calculating an adaptive function in a genetic algorithm according to the present invention.

상기의 목적을 달성하기 위한 본 발명은, 유전자 알고리즘에서 효율적인 적응도 함수 계산을 위한 데이터 분류 방법에 있어서, 데이터를 입력받아, 특성값에 따른 개체군을 생성하는 제 1 단계; 상기 1 단계의 과정을 마지막 입력 데이터까지 반복하여, 모든 특성값마다 동일 특성을 갖추고 있는 개체군을 생성하는 제 2 단계; 및 특성값에 따른 분류 조건을 입력받아, 상기 제 2 단계에서 생성된 개체군을이용한 결과를 출력하는 제 3 단계를 포함한다.According to an aspect of the present invention, there is provided a data classification method for efficiently calculating an adaptive function in a genetic algorithm, the method comprising: a first step of receiving data and generating a population according to a characteristic value; A second step of generating the population having the same characteristics for every characteristic value by repeating the process of step 1 to the last input data; And a third step of receiving a classification condition according to the characteristic value and outputting a result of using the population generated in the second step.

또한, 본 발명은, 유전자 알고리즘이 적용되는 프로세서를 구비한 분류 시스템에, 데이터를 입력받아, 특성값에 따른 개체군을 생성하는 제 1 기능; 상기 1 단계의 과정을 마지막 입력 데이터까지 반복하여, 모든 특성값마다 동일 특성을 갖추고 있는 개체군을 생성하는 제 2 기능; 및 특성값에 따른 분류 조건을 입력받아, 상기 제 2 기능에 의해서 생성된 개체군을 이용한 결과를 출력하는 제 3 기능을 실현시키기 위한 프로그램을 기록한 컴퓨터로 읽을 수 있는 기록매체를 제공한다.The present invention also provides a classification system having a processor to which a genetic algorithm is applied, the first function of receiving data and generating a population according to a characteristic value; A second function of repeating the process of step 1 to the last input data to generate a population having the same characteristics for every characteristic value; And a computer-readable recording medium having recorded thereon a program for realizing a third function of receiving a classification condition according to the characteristic value and outputting a result of using the population generated by the second function.

본 발명은 대용량의 데이터를 이용하는 유전자 알고리즘의 수행 성능을 증가시키기 때문에 다양한 분야에 그 응용이 가능하다. 예를 들어 대량의 데이터 사이의 유사점을 찾아내서 이용하는 데이터 마이닝의 분류(classification)문제에 유전자 알고리즘을 사용하는 경우 같은 분류자(classifier)를 찾아내기 위한 작업에서 필요한 메모리와 계산 시간을 획기적으로 절감시킬 수 있는데, 이와 같은 기술은 수신자가 정확하게 지정되지 않은 대량의 이메일을 적합한 수신자에게 전달해주는 이메일 콜센터 솔루션, 대량의 스팸메일과 유용한 메일을 구분해서 스팸메일의 전송을 막는 스팸메일 필터링(filtering) 시스템 등을 구축하는데 핵심적인 기술로 사용될 수 있다. 그 외에도 유전자 알고리즘을 사용하는 모든 분야에서 알고리즘의 수행 성능을 증가시키는데 본 발명이 이용될 수 있기 때문에 그 응용분야는 무척이나 넓다.The present invention is applicable to various fields because it increases the performance of the genetic algorithm using a large amount of data. For example, using genetic algorithms for classifying data mining to find similarities between large amounts of data can dramatically reduce the memory and computation time required to find the same classifier. These technologies include email call center solutions that deliver large numbers of emails whose recipients are not precisely addressed to the right recipients, and spam filtering systems that distinguish between large amounts of spam and useful mail to prevent the transmission of spam. It can be used as a core technology in building In addition, the field of application is very wide because the present invention can be used to increase the performance of algorithms in all fields using genetic algorithms.

이하, 첨부된 도면을 참조하여 본 발명에 따른 바람직한 일실시예를 상세히 설명한다.Hereinafter, exemplary embodiments of the present invention will be described in detail with reference to the accompanying drawings.

도 1 은 본 발명이 적용되는 개체와 훈련데이터들과의 관계를 나타낸 그림이다. 이미 위에서 설명되었듯이 개체와 훈련데이터를 나타내는 상자모양 안에 칸들은 각각 개체와 훈련데이터의 특정한 속성을 나타낸다. 설명의 단순화를 위해서 도 1 에서는 간단한 상황을 예로 들었다.1 is a diagram showing the relationship between the object and the training data to which the present invention is applied. As already explained above, the boxes within the box representing the individual and training data each represent specific attributes of the individual and training data. For simplicity, the simple situation is illustrated in FIG. 1.

도 2 는 본 발명에 따른 유전자 알고리즘에 있어서 효율적인 적응도 함수 계산을 위한 데이터 분류 방법에 의한 역테이블의 일실시예 구조도이다.2 is a structural diagram of an inverse table according to a data classification method for calculating an adaptive function in the genetic algorithm according to the present invention.

즉, 도 2는 개체와 훈련데이터가 도 1과 같다고 가정했을 때 생성되는 테이블을 나타낸 그림이다. 도시된 바와 같이, 테이블의 왼쪽 항에는 속성의 번호와 그 번호에 해당하는 가능한 속성값들이 나타나 있다. 그 오른쪽 항에는 표의 속성 번호에 해당하는 속성 값을 갖고 있는 훈련데이터들의 일련번호를 나타내고 있다.That is, FIG. 2 is a diagram illustrating a table generated when the individual and the training data are assumed to be the same as in FIG. 1. As shown, the left column of the table shows the number of the attribute and the possible attribute values corresponding to that number. The right side shows the serial number of the training data with the attribute value corresponding to the attribute number in the table.

유전자 알고리즘에서 각 개체의 적응도 함수를 계산하는 일은 전체 알고리즘 수행시간 중 상당히 큰 부분을 차지한다. 때문에 적응도 함수를 계산하는 방법은 전체 알고리즘의 성능과 직결된다. 도 1의 첫 번째 개체의 적응도 함수 계산의 경우를 예로 들어 설명하면, 첫 번째 개체가 나타내고 있는 모델은 '특성1의 값이 1이고 특성3의 값이 0인 데이터는 범주가 2에 속한다' 라는 모델이다. 이와 같은 모델은 실제 상황에서 맞을 수도 있고 틀릴 수도 있다. 때문에 유전자 알고리즘에서는 더 나은 모델이 무엇인지를 결정하기 위해 훈련데이터에 모델이 나타내고 있는 규칙을 적용해보는 것이다. 규칙을 적용해 보기 위해서는 훈련데이터 중에서 특성1의 값이 1이고 특성3의 값이 0인 데이터를 찾아내야 하고, 그와 같은 특징을 갖고 있는 훈련데이터 중에서 얼마나 많은 훈련데이터가 실제로 범주 2인지를 확인해야한다. 이를 위해서는 지금까지 전체 훈련데이터를 모두 읽어 하나하나의 조건을 검색하는 방식으로 적응도 함수를 계산했다. 지나치게 데이터가 많아 이와 같은 방법의 적용이 어려울 경우 무작위 혹은 미리 정해진 규칙에 따라 전체 훈련데이터 중에서 일부분만을 골라 적응도 함수 계산에 이용했다.In the genetic algorithm, calculating the adaptability function of each individual is a big part of the overall algorithm execution time. Therefore, the method of calculating the adaptability function is directly related to the performance of the overall algorithm. For example, in the case of calculating the fitness function of the first entity in FIG. 1, the model represented by the first entity is 'the data of the characteristic 1 having the value 1 and the characteristic 3 having the value 0 belong to the category 2.' It is a model. Such a model may or may not be right in the real world. In genetic algorithms, we apply the rules that the model represents to training data to determine what is the better model. In order to apply the rule, we need to find out the data with the value of characteristic 1 and the value of characteristic 3 as 0 in the training data, and identify how many training data are actually category 2 among the training data with such characteristics. Should be. To this end, the fitness function has been calculated by reading all the training data and searching each condition. If it is difficult to apply this method because there is too much data, only a part of the training data was selected according to random or predetermined rules and used to calculate the adaptability function.

그러나, 본 발명에서 제안하고 있는 데이터 분류 방법을 이용할 경우 이와 같은 계산이 훨씬 빠르고 정확하게 이루어질 수 있다. 도 1에서와 같이, 첫 번째 개체의 조건 부분을 살펴보면 다른 특성은 필요 없고 특성 1과 특성 3의 값이 각각 1과 0인 경우에만 문제가 된다는 것을 알 수 있다. 따라서, 본 발명을 적용할 경우, 도 2에 도시된 테이블을 참조하면 그와 같은 특징을 갖고 있는 훈련 데이터는 일련번호가 2, 4, 5, 6 인 훈련데이터들이라는 것을 알 수 있다. 이와 같은 방법을 이용하면 전체 훈련데이터를 검색할 필요 없이 2, 4, 5, 6 단 네 개의 훈련데이터를 검색하면 첫 번째 개체의 적응도 값을 구할 수 있게 되는 것이다.However, when using the data classification method proposed in the present invention, such calculation can be made much faster and more accurately. As shown in FIG. 1, when looking at the condition part of the first object, it can be seen that other characteristics are not necessary and only a problem when the values of the characteristics 1 and 3 are 1 and 0, respectively. Therefore, when applying the present invention, referring to the table shown in Figure 2 it can be seen that the training data having such characteristics are training data having serial numbers 2, 4, 5, 6. Using this method, it is possible to find the fitness value of the first individual by searching only four, two, four, five or six training data without having to search the entire training data.

이상과 같은 예에서는 사용되는 훈련데이터의 양이 많지 않았기 때문에 본 발명을 적용한 경우와 그렇지 않은 경우 성능의 차이가 크지 않을지 모르지만 훈련데이터의 양이 많아짐에 따라서 본 발명을 적용할 경우 알고리즘이 훨씬 빠르게 동작하리라는 것을 알 수 있다.In the above example, since the amount of training data used was not large, the difference between the performance of the present invention and that of the present invention may not be great. However, the algorithm operates much faster when the present invention is applied as the amount of training data increases. I can see.

도 3 은 본 발명에 따른 유전자 알고리즘에 있어서 효율적인 적응도 함수 계산을 위한 데이터 분류 방법에 관한 일실시예 동작 흐름도이다.3 is a flowchart illustrating an embodiment of a data classification method for efficiently calculating an adaptive function in a genetic algorithm according to the present invention.

도 1에서 제시된 개체 및 훈련 데이터 간의 관계를 도 2의 역테이블로 구성하여 유전자 알고리즘에 있어서 효율적인 적응도 함수 계산을 위한 데이터 분류를수행할 수 있도록 한다.The relationship between the individual and the training data shown in FIG. 1 is configured by the inverse table of FIG. 2 so that data classification for efficient adaptive function calculation in the genetic algorithm can be performed.

우선, 개체 및 특성값을 포함하는 데이터를 입력받는다(301).First, data including an object and a characteristic value is input (301).

그리고, 입력된 데이터에 대해 각각의 특성값을 개별적으로 분류하여 입력된 데이터가 포함하고 있는 특성값에 해당 개체값을 추가하여 특성값에 따른 개체군을 형성한다(302).Then, each characteristic value is individually classified with respect to the input data, and the corresponding individual value is added to the characteristic value included in the input data to form a population according to the characteristic value (302).

그리고, 입력되어야 할 데이터의 마지막까지 입력이 되었는지를 확인하여(303), 입력이 되지 않았으면, 개체 및 특성값을 포함하는 새로운 데이터를 입력받고(304), 입력된 새로운 데이터에 대해 각각의 특성값을 개별적으로 분류하여 입력된 데이터가 포함하고 있는 특성값에 해당 개체값을 추가하여 특성값에 따른 개체군을 형성한다(305).Then, it is checked whether input is made to the end of the data to be input (303). If it is not input, new data including an object and a characteristic value is input (304), and each characteristic is input to the input new data. Values are individually classified to add the corresponding individual values to the characteristic values included in the input data, thereby forming a population according to the characteristic values (305).

그리고, 입력되어야 할 데이터의 마지막까지 입력이 되었는지를 확인하여(303), 모든 데이터가 입력이 되었으면, 특성값에 따른 분류 조건을 입력받아(306) 입력된 특성값에 해당하는 개체군의 교집합을 통한 분류 결과를 출력한다(307).In addition, it is checked whether input is made to the end of data to be input (303), and when all data are input, input classification conditions according to characteristic values (306) are obtained through the intersection of the populations corresponding to the input characteristic values. The classification result is output (307).

상술한 바와 같은 본 발명의 방법은 프로그램으로 구현되어 컴퓨터로 읽을 수 있는 형태로 기록매체(씨디롬, 램, 플로피 디스크, 하드 디스크, 광자기 디스크 등)에 저장될 수 있다.As described above, the method of the present invention may be implemented as a program and stored in a recording medium (CD-ROM, RAM, floppy disk, hard disk, magneto-optical disk, etc.) in a computer-readable form.

이상에서 설명한 본 발명은, 본 발명이 속하는 기술분야에서 통상의 지식을 가진 자에 있어 본 발명의 기술적 사상을 벗어나지 않는 범위내에서 여러 가지 치환, 변형 및 변경이 가능하므로 전술한 실시예 및 첨부된 도면에 의해 한정되는 것이 아니다.The present invention described above is capable of various substitutions, modifications, and changes without departing from the spirit of the present invention for those skilled in the art to which the present invention pertains, and the above-described embodiments and accompanying It is not limited by the drawings.

상기와 같은 본 발명은, 각 훈련 데이터(training data)의 표현을 개별 데이터(instance)가 갖춘 특성(attribute)의 나열로 나타내는 것이 아니라 각 특성마다 동일한 특성을 갖추고 있는 개별 데이터를 나열하는 방법을 이용하여 표(table)를 생성하고, 후에 적응도 함수의 계산(fitness function evaluation)시에 전체 데이터를 다시 읽어 들일 필요 없이 생성된 표를 참조하여 적응도 함수의 계산을 수행함으로써 전체 알고리즘의 수행 시간을 획기적으로 감소시킬 수 있는 효과가 있다.As described above, the present invention utilizes a method of arranging individual data having the same characteristics for each characteristic rather than representing the expression of each training data as an arrangement of the attributes with the individual data. To generate a table, and then perform the calculation of the fitness function by referring to the generated table without having to read the entire data again during the fitness function evaluation. There is an effect that can be significantly reduced.

Claims

In the data classification method for efficient adaptive function calculation in genetic algorithm,

A first step of receiving data and generating a population according to a characteristic value;

A second step of generating the population having the same characteristics for every characteristic value by repeating the process of step 1 to the last input data; And

A third step of receiving a classification condition according to the characteristic value and outputting a result using the population generated in the second step;

Data classification method for the calculation of the efficient fitness function in the genetic algorithm including.

The method of claim 1,

The third step is

Efficient adaptability in the genetic algorithm, characterized in that for the predetermined number of characteristic values corresponding to the input classification conditions, the classification result is output through the intersection between the populations generated in the second step corresponding to each characteristic value Data classification method for calculating a function.

In a classification system having a processor to which a genetic algorithm is applied,

A first function of receiving data and generating a population according to a characteristic value;

A second function of repeating the process of step 1 to the last input data to generate a population having the same characteristics for every characteristic value; And

A third function of receiving a classification condition according to a characteristic value and outputting a result using the population generated by the second function

A computer-readable recording medium having recorded thereon a program for realizing this.