KR100791037B1

KR100791037B1 - Grid-based data classification device and method thereof

Info

Publication number: KR100791037B1
Application number: KR1020070038262A
Authority: KR
Inventors: 이종식; 조규철
Original assignee: 인하대학교 산학협력단
Priority date: 2007-04-19
Filing date: 2007-04-19
Publication date: 2008-01-03

Abstract

A grid-based mass data dividing device and a method thereof are provided to reduce a data dividing time by generating/training clusters to divide mass data, and removing the clusters not representing the data or having lower weight from the trained clusters, and increase stability by allocating grid resources to a data dividing process dynamically. A data divider(10) includes a resource managing module(11) dividing the available grid resources, a threshold adjusting module(13) adjusting threshold depending on a result of the divided data while dividing the data after initializing the threshold, and a result testing module(15) transferring the result by checking the result. A data intermediary(30) includes a plurality of resource intermediating modules(31) transferring the grid resources and the threshold, and merging/transferring the divided data to the result testing module. A plurality of data dividers(50) train the clusters by receiving the grid resource/threshold and dividing the mass data to the clusters, and removes and returns the clusters having an abnormal value to the data intermediary.

Description

Grid-based data classification device and method

도 1은 본 발명에 따른 그리드 기반 대용량 데이터 분류 장치를 개략적으로 도시한 블록도.1 is a block diagram schematically showing a grid-based mass data classification apparatus according to the present invention.

도 2는 본 발명에 따른 그리드 기반 대용량 데이터 분류 방법을 개략적으로 도시한 흐름도.2 is a flow chart schematically showing a grid-based mass data classification method according to the present invention.

도 3은 본 발명에 따른 그리드 기반 대용량 데이터 분류 방법 중 데이터 분류 방법을 상세하게 도시한 흐름도.3 is a flowchart illustrating a data classification method in detail among the grid-based large data classification method according to the present invention;

도 4는 도 3의 데이터 분류 방법 중 데이터 훈련 방법을 상세하게 도시한 흐름도.4 is a flowchart illustrating a data training method in detail among the data classification method of FIG. 3.

도 5는 도 3의 데이터 분류 방법 중 이상치 클러스터 제거 방법을 상세하게 도시한 흐름도.FIG. 5 is a flowchart illustrating an outlier cluster removing method in detail of the data classification method of FIG. 3; FIG.

도 6는 도 3의 데이터 분류 방법 중 클러스터를 그룹화한 인식 방법을 개략적으로 도시한 개념도.FIG. 6 is a conceptual diagram schematically illustrating a recognition method in which clusters are grouped among the data classification methods of FIG. 3; FIG.

<도면의 주요 부분에 대한 도면 부호의 간단한 설명> <Brief description of reference numerals for the main parts of the drawings>

1: 그리드 기반 대용량 데이터 분류 장치1: grid-based large data classification device

10: 데이터 분류 관리부 11: 자원 관리 모듈10: data classification management unit 11: resource management module

13: 분류 임계값 조정 모듈 15: 결과 테스트 모듈13: Classification threshold adjustment module 15: Result test module

30: 데이터 중개부 31: 자원 중개 모듈30: Data mediation 31: Resource mediation module

31a: 정보 전달기 31b: 결과 통합기31a: Information Communicator 31b: Results Integrator

50: 데이터 분류부 51: 훈련 데이터 관리 모듈50: data classification unit 51: training data management module

53: 클러스터 관리 모듈 55: 데이터 훈련 모듈53: cluster management module 55: data training module

57: 클러스터 제거 모듈57: cluster removal module

본 발명은 그리드 기반 대용량 데이터 분류 장치 및 방법에 관한 것으로, 더욱 상세하게는 대용량 데이터가 분류되도록 클러스터를 생성시켜 훈련하고, 생성된 클러스터 중 데이터를 대표할 수 없거나 해당 클러스터에 소속된 데이터가 일정 수 이하인 비중이 작은 이상치 클러스터를 제거함으로써, 이에 따른 클러스터의 감소는 데이터 분류의 처리 시간을 단축시키는 효과가 있으며, 분류가 완료된 결과를 기반으로 최적의 임계값을 유도하도록 조정함으로써, 최적의 인식률에 근접할 수 있으며, 이에 따라 계산 시간을 감소시켜 작업 처리 시간을 단축시킬 수 있고, 동적으로 그리드 자원에 데이터 분류 작업을 할당함으로써, 대용량의 데이터 운용 및 처리가 가능해짐과 동시에 중앙 서버의 병목 현상을 방지하여 안정성을 증가시킬 수 있는 그리드 기반 대용량 데이터 분류 장치 및 방법에 관한 것이다.The present invention relates to a grid-based mass data classification apparatus and method, and more particularly, to create and train a cluster to classify a large amount of data, and can not represent the data among the generated clusters or a certain number of data belonging to the cluster By eliminating outlier clusters with less specific gravity, the reduction of clusters has the effect of reducing the processing time of data classification, and adjusting to derive the optimal threshold value based on the result of classification, thereby approaching the optimum recognition rate. This reduces the computation time and shortens the processing time, and dynamically allocates data classification tasks to grid resources, enabling large data operations and processing while preventing bottlenecks in the central server. To increase stability Half of large amounts of data classification to an apparatus and method.

일반적으로, 그리드 컴퓨팅(Grid Computing)은 네트워크로 다수의 컴퓨터를 연결하여 데이터 처리 능력을 극대화시키는 방법으로써, 지리적으로 분산된 컴퓨터, 저장 장치 등의 자원을 네트워크로 연결하여 상호 공유하고, 이를 이용할 수 있도록 이루어진다.In general, grid computing is a method of maximizing data processing capability by connecting a plurality of computers through a network, and sharing resources using geographically dispersed computers and storage devices through a network and sharing them. It is done so.

그리고, 상기 그리드 컴퓨팅은 일종의 가상 컴퓨터로써, 일정 시간 동안 유휴 상태의 컴퓨터 자원을 검색 및 파악하여 각각의 컴퓨터가 보유한 자원을 공유하기 때문에, 데이터를 처리함에 있어 작업 속도를 증가시킬 수 있다.In addition, the grid computing is a kind of virtual computer, which searches and grasps idle computer resources for a predetermined time and shares the resources held by each computer, thereby increasing the work speed in processing data.

그러나, 문자열 및 실수로 구성된 대용량 데이터를 분류함에 있어서, 대표 패턴에 대한 분류가 이루어짐으로써, 대표 패턴이 한정적 및 제한적이므로 이에 대한 유동적인 그룹화가 실행되지 못했고, 이에 따라 분류의 정확성이 감소되었으며, 대용량의 데이터에 대한 최적의 인식률을 산출하기 위한 계산 시간이 증가하였고, 그리드 자원에 대한 분류가 유연하게 이루어지지 않아 중앙 서버의 병목 현상 등 안정성이 감소함과 동시에, 자원의 활용도 및 응답 시간이 증가하는 등의 문제점이 있었다.However, in classifying a large amount of data consisting of character strings and real numbers, the classification of the representative pattern is performed, and thus, the representative pattern is limited and limited, so that no fluid grouping can be performed. Accordingly, the accuracy of the classification is reduced. The computation time for calculating the optimal recognition rate for the data is increased, and the classification of grid resources is not flexible, which reduces stability such as bottlenecks in the central server and increases resource utilization and response time. There was a problem.

본 발명은 상기한 문제점을 해결하기 위하여 안출한 것으로, 대용량 데이터가 분류되도록 클러스터를 생성시켜 훈련하고, 생성된 클러스터 중 데이터를 대표할 수 없거나 해당 클러스터에 소속된 데이터가 일정 수 이하인 비중이 작은 이상치 클러스터를 제거함으로써, 이에 따른 클러스터의 감소는 데이터 분류의 처리 시간을 단축시키는 효과가 있으며, 분류가 완료된 결과를 기반으로 최적의 임계값을 유도하도록 조정함으로써, 최적의 인식률에 근접할 수 있으며, 이에 따라 계산 시간을 감소시켜 작업 처리 시간을 단축시킬 수 있고, 동적으로 그리드 자원에 데이터 분류 작업을 할당함으로써, 대용량의 데이터 운용 및 처리가 가능해짐과 동시에 중앙 서버의 병목 현상을 방지하여 안정성을 증가시킬 수 있는 그리드 기반 대용량 데이터 분류 장치 및 방법을 제공하는 것을 목적으로 한다.The present invention has been made in order to solve the above problems, and by generating and training a cluster to classify a large amount of data, an outlier having a small proportion of the generated cluster can not represent the data or the data belonging to the cluster is less than a certain number By eliminating the cluster, the reduction of the cluster accordingly has the effect of reducing the processing time of data classification, and by adjusting to derive the optimal threshold value based on the result of classification, the optimal recognition rate can be approached. As a result, the processing time can be reduced by reducing the calculation time, and by dynamically assigning data classification tasks to grid resources, large data operation and processing can be performed and stability can be increased by preventing bottlenecks of the central server. Grid-based large data classification device To provide a method for the purpose.

상기한 바와 같은 목적을 달성하기 위하여 본 발명은 대용량의 데이터 분류에 사용될 이용가능한 그리드 자원을 분류하는 자원 관리 모듈과, 상기 데이터 분류를 위한 임계값을 초기화하여 데이터를 분류하되, 분류가 완료된 데이터의 결과에 따라 임계값을 조정하는 임계값 조정 모듈과, 상기 임계값 조정 모듈로 분류가 완료된 데이터의 결과를 검사하여 결과를 전달하는 결과 테스트 모듈을 포함하는 데이터 분류 관리부; 상기 그리드 자원 및 임계값을 전달하고, 분류가 완료된 데이터를 병합하여 상기 결과 테스트 모듈로 전송하는 다수의 자원 중개 모듈을 포함하는 데이터 중개부; 및 상기 그리드 자원 및 임계값을 수신하여 대용량 데이터를 클 러스터로 분류하여 훈련시키되, 이상치 클러스터는 제거하여 상기 데이터 중개부로 전송하는 다수의 데이터 분류부; 를 포함한다.In order to achieve the above object, the present invention provides a resource management module for classifying available grid resources to be used for mass data classification, and classifying data by initializing a threshold for the data classification. A data classification management unit including a threshold adjustment module for adjusting a threshold value according to a result, and a result test module for inspecting a result of data that has been classified into the threshold adjustment module and delivering a result; A data mediator including a plurality of resource mediation modules for transmitting the grid resources and thresholds, and merging the classified data and transmitting the merged data to the result test module; And a plurality of data classifiers which receive the grid resource and the threshold value and classify and train a large amount of data into clusters, but remove outlier clusters and transmit them to the data broker. It includes.

그리고, 상기 자원 관리 모듈은 이용 가능한 그리드 자원을 수집하여 저장 공간 및 작업 처리량에 따라 분류하는 것을 특징으로 한다.The resource management module collects available grid resources and classifies them according to storage space and throughput.

또한, 상기 이상치 클러스터는 생성된 클러스터 중에서 데이터를 대표할 수 없거나 또는 포함된 데이터의 비중이 크지 않은 클러스터인 것을 특징으로 한다.In addition, the outlier cluster may be a cluster in which data cannot be represented among the generated clusters or the weight of the included data is not large.

여기서, 상기 자원 중개 모듈은 상기 그리드 자원 및 임계값을 데이터 분류부에 전달하는 정보 전달기; 및 다수개의 데이터 분류부로부터 출력된 데이터 훈련, 분류 결과 및 클러스터 정보를 병합하여 상기 결과 테스트 모듈로 전송하는 결과 통합기; 를 포함하는 것을 특징으로 한다.Here, the resource intermediary module includes an information transmitter for transmitting the grid resource and the threshold to a data classification unit; And a result integrator for merging data training, classification results and cluster information output from a plurality of data classification units and transmitting the merged data to the result test module. Characterized in that it comprises a.

더불어, 상기 데이터 분류부는 상기 데이터를 입력 및 관리하는 훈련 데이터 관리 모듈; 상기 그리드 자원을 사용하고, 임계값을 기반으로 데이터가 분류되도록 클러스터를 생성시켜 훈련하는 데이터 훈련 모듈; 상기 클러스터를 형성하는 대표 데이터를 관리 및 갱신하는 클러스터 관리 모듈; 및 생성된 클러스터 중 이상치 클러스터를 제거하는 클러스터 제거 모듈; 을 포함하는 것을 특징으로 한다.In addition, the data classification unit training data management module for inputting and managing the data; A data training module that uses the grid resources and generates and trains a cluster to classify data based on a threshold value; A cluster management module for managing and updating representative data forming the cluster; And a cluster removal module for removing outlier clusters from the generated clusters. Characterized in that it comprises a.

그리고, 상기 클러스터 제거 모듈은 생성된 전체 클러스터를 검사하되, 각 클러스터에 소속된 데이터의 수가 일정 기준 이하 또는 해당 클러스터가 데이터를 대표할 수 없는 이상치 클러스터를 제거하는 것을 특징으로 한다.The cluster removal module inspects the generated clusters, and removes outlier clusters whose number of data belonging to each cluster is less than or equal to a predetermined criterion or the cluster cannot represent data.

한편, 대용량 데이터 분류를 위해 수집된 그리드 자원을 저장 공간 및 작업 처리량에 따라 분류하고, 임계값을 초기화하는 단계; 상기 분류된 그리드 자원으로 동적으로 작업을 할당받아 임계값 기반으로 데이터가 분류되도록 클러스터를 생성하여 훈련하고, 전체 클러스터를 검사하고, 이상치 클러스터를 제거하여 전송하는 단계; 상기 데이터 훈련, 분류 및 클러스터 정보를 병합한 결과를 검사하되, 최적의 임계값이 유도되도록 임계값을 조정하는 단계; 를 포함한다.On the other hand, classifying the collected grid resources according to the storage space and work throughput for mass data classification, and initialize the threshold value; Generating and training a cluster so that data is classified based on a threshold by dynamically assigning a task to the classified grid resources, inspecting the entire cluster, and removing and transmitting outlier clusters; Examining a result of merging the data training, classification, and cluster information, and adjusting a threshold value to derive an optimal threshold value; It includes.

여기서, 상기 이상치 클러스터는 해당 클러스터가 데이터를 대표할 수 없거나 또는 생성된 전체 클러스터 중에서 각 클러스터에 소속된 데이터 수가 일정 기준 이하인 비중이 크지 않은 클러스터인 것을 특징으로 한다.Here, the outlier cluster may be a cluster in which the corresponding cluster cannot represent data, or a cluster in which the number of data belonging to each cluster is less than a predetermined standard among all generated clusters.

또한, 상기 데이터 분류 방법은 대용량 데이터를 입력 및 관리하는 단계; 동적으로 할당된 그리드 자원으로 전송된 임계값을 기반으로 데이터가 분류되도록 클러스터를 생성하여 훈련하는 단계; 클러스터를 생성하는 대표 데이터를 관리 및 갱신하는 단계; 생성된 전체 클러스터를 검사하여 이상치 클러스터를 제거하는 단계; 를 포함하여 이루어지는 것을 특징으로 한다.The data classification method may further include inputting and managing a large amount of data; Creating and training a cluster to classify data based on thresholds transmitted to dynamically allocated grid resources; Managing and updating representative data for creating a cluster; Examining the generated entire cluster and removing outlier clusters; Characterized in that comprises a.

그리고, 데이터 훈련 방법은 대용량 데이터로부터 임계값을 기반으로 클러스터를 생성하는 단계; 생성된 클러스터의 패턴을 임계값과 비교하여 입력된 데이터의 패턴과 매칭시키는 단계; 생성된 클러스터의 패턴이 입력된 데이터의 패턴에 포함되면, 상기 데이터를 상기 생성된 클러스터에 소속시키는 단계; 생성된 클러스터의 패턴이 입력된 데이터의 패턴에 포함되지 않으면, 새로운 패턴에 따른 클러스터를 생성하는 단계; 클러스터를 생성시킨 데이터를 대표 데이터로 지정하여 갱신하는 단계; 를 포함하여 이루어지는 것을 특징으로 한다.The data training method may further include generating a cluster based on a threshold value from a large amount of data; Comparing the pattern of the generated cluster with a threshold and matching the pattern of the input data; If the pattern of the generated cluster is included in the pattern of input data, belonging to the generated cluster; If the pattern of the generated cluster is not included in the pattern of the input data, generating a cluster according to the new pattern; Designating and updating the data which created the cluster as the representative data; Characterized in that comprises a.

또한, 이상치 클러스터 제거 방법은 생성된 전체 클러스터를 검사하되, 각 클러스터에 소속된 데이터의 수가 일정 기준 이하 또는 해당 클러스터가 데이터를 대표할 수 없는 이상치 클러스터를 제거하는 단계; 를 포함하여 이루어지는 것을 특징으로 한다.In addition, the method for removing outlier clusters may include inspecting the generated clusters, and removing outlier clusters whose number of data belonging to each cluster is less than or equal to a predetermined criterion or the cluster cannot represent data; Characterized in that comprises a.

이하, 본 발명에 따른 실시예를 첨부된 예시도면을 참고로 하여 상세하게 설명한다.Hereinafter, exemplary embodiments of the present invention will be described in detail with reference to the accompanying drawings.

도 1은 본 발명에 따른 그리드 기반 대용량 데이터 분류 장치를 개략적으로 도시한 블록도이다. 도면에서 도시하고 있는 바와 같이, 본 발명에 의한 그리드 기반 대용량 데이터 분류 장치(1)는 데이터 분류 관리부(10)와 데이터 중개부(30)와 다수개의 데이터 분류부(50)를 포함하여 이루어진다.1 is a block diagram schematically showing a grid-based mass data classification apparatus according to the present invention. As shown in the figure, the grid-based mass data classification apparatus 1 according to the present invention includes a data classification management unit 10, a data broker 30, and a plurality of data classification units 50.

여기서, 상기 데이터 분류 관리부(10)는 대용량의 데이터를 분류하기 위하여, 보유한 그리드 자원을 저장 공간 및 작업 처리량에 따라 필터링하고, 대용량의 데이터 분류 작업을 위한 임계값을 초기화하여 상기 그리드 자원 및 초기화된 임계값을 상기 전달하며, 대용량의 데이터 분류 작업이 완료된 데이터를 테스트하여 이에 따라 임계값을 조정하고, 조정된 임계값으로 최적의 인식률에 도달할 때까지 상기 임계값을 조정 및 갱신한다.In this case, the data classification management unit 10 filters the retained grid resources according to the storage space and the throughput to classify a large amount of data, and initializes the threshold values for the large data classification operation to initialize the grid resources and the initialized values. The threshold is communicated and the data that has been mass-categorized is tested to adjust the threshold accordingly, and adjust and update the threshold until the optimal recognition rate is reached with the adjusted threshold.

이를 위하여, 상기 데이터 분류 관리부(10)는 자원 관리 모듈(11)과 임계값 조정 모듈(13)과 결과 테스트 모듈(15)로 이루어지는데, 상기 자원 관리 모듈(11)은 대용량의 데이터를 분류하기 위하여, 데이터를 저장할 수 있는 저장 공간과 상 기 데이터를 처리할 수 있는 작업 처리 능력에 따라, 그리드 컴퓨팅 기반 그리드 자원들을 분류하는데, 기 정해진 기준 저장 공간 및 기준 작업 처리 능력 미만인 그리드 자원들은 필터링으로 제외되도록 여과시키고, 이에 따른 수식은 하기와 같다.To this end, the data classification management unit 10 is composed of a resource management module 11, the threshold adjustment module 13 and the result test module 15, the resource management module 11 to classify a large amount of data For this purpose, the grid computing-based grid resources are classified according to the storage space for storing data and the job processing ability to process the data, except for the grid resources below the predetermined reference storage space and the standard job processing capacity by filtering. Filtered to make the following formula.

LOC(X[j + 1]) = LOC(X[j]) + cLOC (X [j + 1]) = LOC (X [j]) + c

LOC(X[j]) = L_O + cjLOC (X [j]) = L _O + cj

여기서, LOC 는 그리드 자원에 대하여 노드에 따른 순서를 나타내며, c 는 그리드 자원의 데이터 저장 능력인 저장 공간 및 작업 처리 능력을 나타내고, j 는 그리드 자원의 노드 위치를 나타내며, L_O은노드 위치의 기본값을 나타낸다.Here, LOC represents the order according to the node for the grid resource, c represents the storage space and job processing capacity that is the data storage capacity of the grid resource, j represents the node location of the grid resource, L _O is Indicates the default value of node location.

다시 말하면, 상기 수학식 1에 따라 선별된 그리드 자원에 대하여 사용가능한 그리드 자원을 정렬한 후, 순차적인 방법에 의하여 작업 처리 능력이 우수한 그리드 자원이 동적으로 할당된다.In other words, after aligning the available grid resources with respect to the grid resources selected according to Equation 1, grid resources having excellent job processing capability are dynamically allocated by a sequential method.

그리고, 상기 임계값 조정 모듈(13)은 기준 저장 공간 및 기준 작업 처리량 이상인 그리드 자원에 할당된 대용량 데이터의 분류가 완료되면, 이를 테스트하여 테스트 결과를 기반으로 임계값을 재조정하여 변경할 수 있으며, 변경된 임계값은 상기 데이터 중개부(30)로 전달되어 대용량의 데이터를 분류하는데 이용된다.When the classification of the large amount of data allocated to the grid resource that is greater than or equal to the reference storage space and the reference throughput is completed, the threshold adjustment module 13 may test and change the threshold value based on the test result to change the changed threshold value. The threshold value is transmitted to the data intermediary 30 and used to classify a large amount of data.

또한, 상기 결과 테스트 모듈(15)은 기준 저장 공간 및 기준 작업 처리량 이상인 그리드 자원에 할당된 대용량 데이터의 분류가 완료되면, 상기 다수개의 데이터 분류부(50)의 분류 결과는 상기 데이터 중개부(30)에서 각 결과가 병합되는데, 상기 병합된 결과를 이용하여 데이터 분류 작업이 완료된 데이터에 대하여 분류 정확도를 테스트하기 위해 구비된다.In addition, when the result test module 15 classifies a large amount of data allocated to a grid resource having a reference storage space and a reference work throughput or more, the classification result of the plurality of data classifiers 50 is the data relay unit 30. Each result is merged, and is provided to test the classification accuracy on the data for which data classification is completed using the merged result.

상기 데이터 중개부(30)는 상기 데이터 분류 관리부(10)로부터 전달받은 그리드 자원 및 임계값 정보를 상기 다수개의 데이터 분류부(50)로 전송하고, 상기 그리드 자원 및 임계값 정보를 이용하여 데이터 분류가 완료된 각각의 결과를 병합하여 상기 데이터 분류 관리부(10)로 전달한다.The data relay unit 30 transmits the grid resource and threshold information received from the data classification management unit 10 to the plurality of data classification units 50, and classifies data using the grid resource and threshold information. Merge each completed result and deliver it to the data classification manager 10.

이를 위하여, 상기 데이터 중개부(30)는 다수개의 자원 중개 모듈(31)로 이루어지는데, 상기 자원 중개 모듈(31)은 정보 전달기(31a)와 결과 통합기(31b)를 포함하여 이루어진다.To this end, the data intermediary 30 is composed of a plurality of resource intermediary module 31, the resource intermediary module 31 comprises an information transmitter 31a and the result integrator 31b.

여기서, 상기 다수개의 자원 중개 모듈(31)은 상기 데이터 분류 관리부(10)로부터 전달받은 그리드 자원 정보 및 임계값 정보를 데이터 분류부(50)로 전송하고, 상기 다수개의 데이터 분류부(50)의 대용량 데이터 분류 결과를 데이터 분류부(50)의 상위 노드인 데이터 중개부(30)의 자원 중개 모듈(31)에서 병합하며, 이를 데이터 중개부(30)의 상위 노드인 데이터 분류 관리부(10)로 전달한다.Here, the plurality of resource mediation module 31 transmits the grid resource information and the threshold information received from the data classification management unit 10 to the data classification unit 50, and the plurality of data classification unit 50 The result of mass data classification is merged in the resource mediation module 31 of the data mediator 30, which is the upper node of the data classifier 50, and the data classification management unit 10 is the upper node of the data mediator 30. To pass.

그리고, 상기 정보 전달기(31a)는 데이터 분류 관리부(10)로부터 전달되는 그리드 자원 정보 및 임계값 정보를 데이터 분류부(50)로 전송하기 위하여 구비되 며, 다수개의 자원 중개 모듈(31) 내에 각각 구비되어 하위 노드인 데이터 분류부(50)로 전달되는 정보를 전송한다.In addition, the information transmitter 31a is provided to transmit grid resource information and threshold information transmitted from the data classification manager 10 to the data classifier 50, and are provided in the plurality of resource intermediary modules 31. Each of them is provided to transmit the information delivered to the data classification unit 50, which is a lower node.

또한, 상기 결과 통합기(31b)는 하위 노드인 데이터 분류부(50)에서 분류가 완료된 대용량의 데이터에 대한 결과를 상위 노드인 데이터 중개부(30)의 다수개의 자원 중개 모듈(31) 내 결과 통합기(31b)를 통하여 병합하여 상기 데이터 분류 관리부(10)로 전달하는 기능을 가지며 구비된다.In addition, the result integrator 31b is a result of a plurality of resource intermediary module 31 of the data relay unit 30 of the upper node for the result of the large-capacity data that has been classified in the data classification unit 50 that is a lower node. It is provided with a function of merging through the integrator 31b and delivering it to the data classification management unit 10.

여기서, 상기 상위 노드 및 하위 노드에서 상위 및 하위는 데이터 분류 관리부(10), 데이터 중개부(30), 데이터 분류부(50) 간 비교 시의 상대적인 개념을 나타낸다.Here, the upper and lower in the upper node and the lower node represents a relative concept in the comparison between the data classification management unit 10, the data intermediary 30, the data classification unit 50.

상기 다수개의 데이터 분류부(50)는 상기 데이터 분류 관리부(10)에서 출력되는 그리드 자원 정보 및 임계값 정보를 데이터 중개부(30) 내 다수개의 자원 중개 모듈(31)의 결과 통합기(31b)를 통하여 전달받고, 대용량의 데이터를 동적으로 할당된 그리드 자원으로 배분하여 훈련 및 분류하며, 상기 데이터가 입력되어 분류되도록 클러스터를 형성하는데, 기 생성된 클러스터와 패턴이 유사한 데이터가 입력되면 해당 클러스터에 상기 입력된 데이터를 소속시키고, 기 생성된 클러스터와 패턴이 다른 새로운 패턴의 데이터가 입력되면, 새로운 패턴의 클러스터를 생성하며, 상기 데이터를 대표 데이터로 지정하여 갱신시키고, 모든 데이터가 분류 완료되면, 생성된 전체의 클러스터를 검사하되, 각 클러스터에 소속된 데이터의 수가 일정 기준 이하의 비중이 작거나 또는 해당 클러스터가 데이터를 대표할 수 없는 이상치 클러스터를 제거한다.The plurality of data classifiers 50 may output grid resource information and threshold information output from the data classification manager 10 as a result integrator 31b of the plurality of resource mediation modules 31 in the data broker 30. It receives and transmits a large amount of data to dynamically allocated grid resources to train and classify, and forms a cluster so that the data is input and sorted. When the inputted data belongs to a new pattern with a different pattern from the previously generated cluster, a new patterned cluster is created, the data is designated as the representative data and updated, and when all data are classified, Examine the entire cluster, but the number of data belonging to each cluster is less than a certain standard. Or remove outlier clusters that the cluster cannot represent data for.

이를 위하여, 상기 다수개의 데이터 분류부(50) 각각에 훈련 데이터 관리 모듈(51)과 클러스터 관리 모듈(53)과 데이터 훈련 모듈(55)과 클러스터 제거 모듈(57)을 포함한다.To this end, each of the plurality of data classifiers 50 includes a training data management module 51, a cluster management module 53, a data training module 55, and a cluster removal module 57.

여기서, 상기 훈련 데이터 관리 모듈(51)은 대용량의 데이터를 입력받아 관리한다.Here, the training data management module 51 receives and manages a large amount of data.

그리고, 상기 데이터 훈련 모듈(55)은 상기 훈련 데이터 관리 모듈(51)에서 관리중인 데이터에 대하여, 입력과 동시에 클러스터를 생성시키는데, 다음 입력되는 데이터에 대하여 데이터 중개부(30)에서 전송된 임계값을 기반으로 기 생성된 클러스터와 패턴이 유사한지의 여부를 확인하고, 패턴이 유사한 데이터의 경우에는 기 생성된 클러스터에 상기 데이터를 소속시키되, 기 생성된 클러스터와 패턴이 유사하지 않은 새로운 패턴의 데이터일 경우에는 새로운 클러스터를 생성시켜 데이터가 분류되도록 훈련을 완료한다.The data training module 55 generates a cluster at the same time as the input of the data managed by the training data management module 51, and the threshold value transmitted from the data intermediary 30 for the next input data. Check whether the pattern is similar to the pre-generated cluster, and if the pattern is similar, the data belongs to the pre-generated cluster, but the data of the new pattern is not similar to the pre-generated cluster. In this case, a new cluster is created to complete the training to classify the data.

또한, 상기 클러스터 관리 모듈(53)은 상기 데이터 훈련 모듈(55)에서 생성된 새로운 클러스터를 형성하게 한 새로운 패턴의 데이터를 대표 데이터로 지정하여 관리하고, 계속해서 다음 데이터가 입력되어 상기 데이터 훈련 모듈(55)에서 분류되도록 클러스터를 형성할 시, 새로운 패턴의 클러스터를 형성하는 데이터를 대표 데이터로 지정 및 갱신하도록 구비된다.In addition, the cluster management module 53 designates and manages the data of the new pattern that causes the formation of a new cluster generated in the data training module 55 as the representative data, and subsequently, the next data is input to the data training module. When forming clusters to be classified at 55, data for forming a cluster of new patterns is designated and updated as representative data.

여기서, 상기 클러스터 제거 모듈(57)은 상기 데이터 훈련 모듈(55)에서 생성된 전체 클러스터를 검사하되, 각 클러스터에 소속된 데이터를 카운트하여 각 클 러스터에 소속된 데이터의 수가 일정 기준 이하인 비중이 작은 클러스터 또는 해당 클러스터가 데이터를 대표할 수 없는 클러스터를 이상치 클러스터로 정의하여, 상기 이상치 클러스터일 경우에는 이상치 클러스터를 제거하고, 다음 분류시에 입력된 데이터가 기 생성된 클러스터와 패턴을 비교할 시, 검색될 클러스터의 수를 감소시켜 검색 시간을 감소시켜, 처리 속도를 높일 수 있다.Here, the cluster removal module 57 inspects the entire clusters generated by the data training module 55, but counts the data belonging to each cluster so that the number of data belonging to each cluster is less than a certain standard. When a cluster or a cluster that the cluster cannot represent data is defined as an outlier cluster, the outlier cluster is removed, and when the data entered at the next classification is compared with the previously generated cluster and pattern, the search is performed. The processing time can be increased by reducing the search time by reducing the number of clusters to be formed.

마지막으로, 상기 다수개의 데이터 분류부(50)로부터 산출된 결과는 상위 노드인 데이터 중개부(30)에서 병합하여 데이터 분류 관리부(10)의 결과 테스트 모듈(15)로 전달되며, 상기 결과 테스트 모듈(15)의 분석 및 테스트 결과에 따라 임계값 조정 모듈(13)의 임계값이 조정되고, 상기에 서술된 바와 같이 조정된 임계값으로 인식률이 최적화되도록 임계값을 조정한다.Finally, the results calculated from the plurality of data classifiers 50 are merged in the data broker 30, which is a higher node, and transferred to the result test module 15 of the data classification manager 10. The threshold of the threshold adjustment module 13 is adjusted in accordance with the analysis and test results of (15), and the threshold is adjusted so that the recognition rate is optimized with the adjusted threshold as described above.

도 2는 본 발명에 따른 그리드 기반 대용량 데이터 분류 방법을 개략적으로 도시한 흐름도이다. 도면에서 도시하고 있는 바와 같이, 본 발명에 의한 그리드 기반 대용량 데이터 분류 방법은 그리드 컴퓨팅을 기반으로 데이터를 분류함에 있어서, 대용량의 데이터 분류를 위하여 자원 관리 모듈에서 이용가능한 그리드 자원을 수집하고, 각기 다른 저장 공간과 작업 처리 능력을 가진 그리드 자원을 저장 공간 및 작업 처리 능력에 따라 순차적으로 나열시키며, 기준 저장 공간 및 기준 작업 처리 능력 미만인 그리드 자원은 제거되도록 분류하면서 시작된다(S10).2 is a flowchart schematically illustrating a grid-based mass data classification method according to the present invention. As shown in the drawings, in the grid-based mass data classification method according to the present invention, in classifying data based on grid computing, grid resources available in the resource management module are collected for different data classification, and different The grid resources having the storage space and the job processing capability are sequentially arranged according to the storage space and the job processing capability, and the grid resources having less than the reference storage space and the standard job processing capability are started while being classified to be removed (S10).

그리고, 대용량의 데이터 분류 작업을 위한 임계값을 임계값 조정 모듈에서 초기화시키고(S20), 이에 따른 임계값과 그리드 자원을 데이터 중개부 내 다수개 존재하는 자원 중개 모듈의 정보 전달기를 통하여 다수개의 데이터 분류부로 각각 전송한다(S30).In addition, a threshold value for a large-scale data classification operation is initialized in a threshold adjustment module (S20), and a plurality of data through information transmitters of a plurality of resource mediation modules existing in the data relay unit according to the threshold values and grid resources. Each transmission to the classification unit (S30).

또한, 데이터 분류부로 할당된 그리드 자원 및 임계값으로 대용량 데이터가 분류되도록 클러스터를 생성하여 훈련시키고, 생성된 전체 클러스터에 대하여 이상치 클러스터를 제거하여 이를 상기 데이터 분류부의 상위 노드인 데이터 중개부 내 결과 통합기로 전송하는데, 상기 결과 통합기는 다수개의 데이터 분류부에서 전달받은 결과 정보를 병합한다(S40).In addition, a cluster is generated and trained to classify a large amount of data by a grid resource and a threshold value allocated to the data classification unit, and an outlier cluster is removed for all generated clusters, and the result is integrated in the data intermediate unit, which is a higher node of the data classification unit. The result integrator merges the result information received from the plurality of data classification units (S40).

여기서, 다수의 데이터 분류부에서 전송된 데이터의 분류, 훈련 결과와 클러스터 정보를 결과 통합기에서 병합하여 데이터 분류 관리부 내 결과 테스트 모듈로 전송하며(S50), 데이터의 분류, 훈련 결과와 클러스터 정보를 수신한 결과 테스트 모듈은 분석 및 테스트한다(S60).Here, the classification, training results and cluster information of the data transmitted from the plurality of data classification units are merged in the result integrator and transmitted to the result test module in the data classification management unit (S50). The received test module is analyzed and tested (S60).

여기서, 최적의 인식률을 가지도록 임계값이 유도되었는지의 여부를 확인하여(S70), 최적의 임계값이 유도될 때까지 테스트 결과를 기반으로 임계값 조정 모듈에서 분류를 위한 임계값을 재조정하여 갱신하여(S80) 계속적으로 상기 단계(S10 - S60)를 반복구동하고, 최적의 임계값이 유도되면 본 발명에 따른 그리드 기반 대용량 데이터 분류 방법을 종료시킨다.Here, it is checked whether a threshold value is derived to have an optimal recognition rate (S70), and the threshold value adjustment module updates the threshold value for classification based on the test result until the optimal threshold value is derived. By repeating (S80) and continuously driving the steps (S10-S60), if the optimal threshold value is derived, the grid-based mass data classification method according to the present invention is terminated.

도 3은 본 발명에 따른 그리드 기반 대용량 데이터 분류 방법 중 데이터 분류 방법을 상세하게 도시한 흐름도이다. 도면에서 도시하고 있는 바와 같이, 본 발명에 의한 그리드 기반 대용량 데이터 분류 방법 중 데이터 분류 방법은 도 2의 단 계(S50)를 상세히 설명하는데, 데이터 분류부로 동적으로 할당된 그리드 자원에 대용량 데이터를 배분하고, 훈련 데이터 관리 모듈로 대용량 데이터를 입력 및 관리한다(S50-1).3 is a flowchart illustrating a data classification method in detail among the grid-based mass data classification method according to the present invention. As shown in the figure, the data classification method of the grid-based large-capacity data classification method according to the present invention will be described in detail step S50 of FIG. 2, in which large-capacity data is distributed to grid resources dynamically allocated to the data classification unit. Then, the large data is input and managed by the training data management module (S50-1).

그리고, 상기 훈련 데이터 관리 모듈에서 관리중인 대용량 데이터 분류를 위하여 할당된 그리드 자원에게 상기 데이터 분류부로 전송된 임계값을 전달하고, 이를 기반으로 대용량 데이터가 분류되도록 클러스터를 형성시켜 훈련한다(S50-2).In addition, a threshold value transmitted to the data classifying unit is transmitted to a grid resource allocated for classifying a large amount of data managed by the training data management module, and a cluster is formed to train a large amount of data based on the training (S50-2). ).

또한, 상기 과정에서 데이터 분류를 진행하며 생성된 클러스터를 관리 및 갱신하는데, 기 생성된 클러스터와 유사하지 않은 패턴의 데이터는 새로운 패턴의 클러스터를 생성하게 되는데, 이 과정에서 유사하지 않아 새로운 패턴의 클러스터를 형성하는 데이터를 대표 데이터로 지정하여 이를 관리 및 갱신한다(S50-3)In addition, in the above process, the data is classified and managed and updated, and the cluster of patterns that are not similar to the previously generated clusters generates a cluster of new patterns. In this process, the clusters of the new patterns are not similar. Designate and manage the data to form the representative data to update it (S50-3)

더불어, 데이터 분류가 완료된 후, 생성된 전체 클러스터를 검사하되, 각 클러스터에 소속된 데이터를 카운트하여 각 클러스터에 소속된 데이터의 수가 일정 기준 이하인 비중이 작은 클러스터 또는 해당 클러스터가 데이터를 대표할 수 없는 클러스터를 이상치 클러스터로 정의하고, 상기 이상치 클러스터가 발견될 경우에는 이를 제거하여 다음 분류시에 입력된 데이터가 기 생성된 클러스터와 패턴을 비교할 시, 검색될 클러스터의 수를 감소시켜 검색 시간을 줄일 수 있도록 실행된다(S50-4).In addition, after the data classification is completed, the entire clusters generated are examined, but the data belonging to each cluster is counted so that the cluster having a small proportion or less than a certain number of data belonging to each cluster cannot represent the data. The cluster is defined as an outlier cluster, and when the outlier cluster is found, it is eliminated to reduce the search time by reducing the number of clusters to be searched when comparing the pattern with the data generated in the next classification. Is executed (S50-4).

도 4는 도 3의 데이터 분류 방법 중 데이터 훈련 방법을 상세하게 도시한 흐름도이다. 도면에서 도시하고 있는 바와 같이, 데이터 중개부로부터 전송된 임계값 을 기반으로 대용량 데이터를 할당된 그리드 자원으로 입력하여 클러스터를 생성한다(S50-2a).4 is a flowchart illustrating a data training method in detail among the data classification method of FIG. 3. As shown in the figure, a cluster is generated by inputting a large amount of data as an allocated grid resource based on the threshold value transmitted from the data broker (S50-2a).

그리고, 다음에 입력된 데이터를 기 생성된 클러스터와 패턴을 비교하는데, 상기 임계값을 기반으로 매칭시켜(S50-2b), 기 생성된 클러스터의 패턴에 입력된 데이터의 패턴이 포함될 경우(S50-2c), 상기 기 생성된 클러스터에 입력된 데이터를 소속시킨다(S50-2d).Then, the next inputted data is compared with the pre-generated clusters and the pattern is matched based on the threshold value (S50-2b), and the pattern of the input data is included in the pre-generated cluster pattern (S50-). 2c), the inputted data belongs to the pre-generated cluster (S50-2d).

한편, 상기 단계(S50-2c)에서 기 생성된 클러스터의 패턴에 입력된 데이터의 패턴이 포함되지 않을 경우, 새로운 패턴에 따른 클러스터를 생성하여 이를 갱신한다(S50-2e).On the other hand, if the pattern of the input data is not included in the previously generated cluster pattern in step S50-2c, a cluster according to the new pattern is generated and updated (S50-2e).

더불어, 상기 입력된 데이터가 모두 기 생성된 클러스터에 포함되거나 또는 새로운 클러스터를 생성하여 훈련이 완료되었는지의 여부를 확인하고(S50-2f), 모든 데이터에 대한 훈련이 완료되지 않은 경우에는 데이터를 기 생성된 클러스터에 임계값을 기준으로 비교하여 매칭하는 단계(S50-2b)로 이동하며, 모든 데이터에 대한 훈련이 완료되었을 경우에는 본 발명에 따른 데이터 훈련 방법을 종료시킨다.In addition, all of the input data is included in the previously created cluster or a new cluster is created to check whether the training is completed (S50-2f). If the training for all the data is not completed, the data is written. In step S50-2b, a comparison is made based on a threshold and matching with the generated cluster, and when training for all data is completed, the data training method according to the present invention is terminated.

도 5는 도 3의 데이터 분류 방법 중 이상치 클러스터 제거 방법을 상세하게 도시한 흐름도이다. 도면에서 도시하고 있는 바와 같이, 훈련 데이터 모듈에서 생성이 완료된 전체 클러스터를 검사하면서 시작된다(S50-4a).5 is a flowchart illustrating an outlier cluster removing method in detail in the data classification method of FIG. 3. As shown in the figure, the training data module starts by examining the entire cluster that has been created (S50-4a).

여기서, 전체 클러스터를 검사함에 있어서, 완료된 클러스터에 대하여 각 클러스터에 소속된 데이터의 수를 카운트하고(S50-4b), 상기 카운트된 데이터의 수가 일정 기준 이하인지 또는 해당 클러스터가 데이터를 대표할 수 있는지의 여부를 확인한다(S50-4c).Here, in checking the entire cluster, the number of data belonging to each cluster is counted for the completed cluster (S50-4b), and whether the number of the counted data is equal to or less than a predetermined criterion or the cluster can represent the data. Check whether or not (S50-4c).

그리고, 해당 클러스터에 소속된 데이터의 수가 일정 기준을 초과한 즉, 비중이 작지 않은 클러스터로 분류되고, 해당 클러스터가 데이터를 대표하여 클러스터의 속성을 유지할 때 검사를 완료하는데, 각각의 클러스터를 검사하면서 전체 클러스터에 대한 검사가 완료되었는지의 여부를 확인하고(S50-4d), 전체 클러스터에 대한 검사가 완료되었으면 종료시키는데, 아직 이상치 클러스터의 여부를 검사받지 못한 클러스터가 있을 경우에는 생성된 클러스터를 검사하는 단계(S50-4a)로 이동한다.Then, the number of data belonging to the cluster exceeds a certain criterion, that is, classified as a cluster having a small weight, and the inspection is completed when the cluster maintains the cluster properties on behalf of the data. Check whether the check of the entire cluster has been completed (S50-4d), and if the check of the entire cluster is completed, terminate. If there is a cluster that has not yet been checked for the outlier cluster, the created cluster is checked. Go to step S50-4a.

또한, 해당 클러스터에 소속된 데이터의 수가 일정 기준을 이하, 즉 비중이 작은 클러스터로 분류되거나 또는 해당 클러스터가 데이터를 대표하지 못하여 클러스터의 속성을 유지하지 못하는 클러스터는 이상치 클러스터로 정의하여 이를 삭제하는데(S50-4e), 이는 다음 분류 시 기 생성된 클러스터와 입력되는 데이터의 패턴을 비교할 시, 이상치 클러스터의 제거로 인한 클러스터의 수 감소로 검색 및 비교의 시간이 줄어들고, 이에 따라 처리 속도가 증가한다.In addition, if the number of data belonging to the cluster is less than a certain criterion, that is, classified as a cluster having a small weight or the cluster that does not represent the data because the cluster does not represent the data is defined as an outlier cluster and deleted it ( S50-4e), when comparing the pattern of the input data with the cluster generated during the next classification, the time for searching and comparing is reduced due to the decrease in the number of clusters due to the removal of outlier clusters, thereby increasing the processing speed.

여기서, 전체 클러스터에 대한 검사가 완료되었는지의 여부를 묻고(S50-4d), 완료되었으면 종료시키는데, 아직 이상치 클러스터의 여부를 검사받지 못한 클러스터가 있을 경우에는 생성된 클러스터를 검사하는 단계(S50-4a)로 이동한다.Here, it is asked whether or not the check for the entire cluster has been completed (S50-4d), and if it is completed, if there is a cluster that has not yet been checked for the outlier cluster, the step of checking the generated cluster (S50-4a) Go to).

도 6는 도 3의 데이터 분류 방법 중 클러스터를 그룹화한 인식 방법을 개략 적으로 도시한 개념도이다. 도면에서 도시하고 있는 바와 같이, 본 발명에 의한 클러스터를 그룹화한 인식 방법은 분류 작업이 진행되면, 임계값에 따라 클러스터의 수가 정해진다.FIG. 6 is a conceptual diagram schematically illustrating a recognition method in which clusters are grouped among the data classification methods of FIG. 3. As shown in the figure, in the recognition method in which clusters according to the present invention are grouped, the number of clusters is determined according to a threshold value when a classification operation is performed.

그리고, 생성된 클러스터를 그룹핑에 의한 방법으로 분류 작업을 수행하는데, 상기 생성된 클러스터를 A, B 두 분류 그룹으로 그룹화하여, 상기 그룹 내의 클러스터 유닛들이 정상적인 분류 위치에 있지 않은 Case1 과 Case2와 같은 경우에는 잘못 분류된 경우로써, 일부 유닛이 다른 클러스터에 위치된다.In addition, the generated cluster is classified by a grouping method, and the generated cluster is grouped into two classification groups, A and B, such that Case 1 and Case 2 where cluster units in the group are not in the normal classification position. In case of misclassification, some units are located in different clusters.

여기서, 소수 유닛이 소속된 클러스터인 Group A Outlier, Group B Outlier는 클러스터의 자격이 없으므로 제거한다.Here, Group A Outlier and Group B Outlier, which are the clusters to which the minority unit belongs, are removed because they are not eligible for the cluster.

또한, 대용량 데이터를 분류함에 있어서, 상기 대용량 데이터는 클러스터를 통하여 분류된 상태이므로, 생성된 클러스터에 대하여 그룹을 지정하는 방법에 따라서 다양한 분류 결과를 얻을 수 있다.In addition, in classifying a large amount of data, since the large amount of data is classified through a cluster, various classification results may be obtained according to a method of designating a group for the generated cluster.

이때, 또 다른 분류를 위하여, 그 분류에 성격에 맞는 훈련과 식별에 대한 과정을 재시작하는 것이 아니라, 단지 그룹화를 다르게 하면 원하는 결과값을 산출할 수 있다.At this time, for another classification, rather than restarting the process for training and identification that is appropriate for the classification, different groupings may yield desired results.

이상에서는 본 발명의 바람직한 실시예를 예시적으로 설명하였으나, 본 발명의 범위는 이같은 특정 실시예에만 한정되지 않으며 해당 분야에서 통상의 지식을 가진자라면 본 발명의 특허 청구 범위내에 기재된 범주 내에서 적절하게 변경이 가능 할 것이다. Although the preferred embodiments of the present invention have been described above by way of example, the scope of the present invention is not limited to such specific embodiments, and those skilled in the art are appropriate within the scope described in the claims of the present invention. It will be possible to change.

이상에서 설명한 바와 같이 상기와 같은 구성을 갖는 본 발명은 대용량 데이터가 분류되도록 클러스터를 생성시켜 훈련하고, 생성된 클러스터 중 데이터를 대표할 수 없거나 해당 클러스터에 소속된 데이터가 일정 수 이하인 비중이 작은 이상치 클러스터를 제거함으로써, 이에 따른 클러스터의 감소는 데이터 분류의 처리 시간을 단축시키는 효과가 있으며, 분류가 완료된 결과를 기반으로 최적의 임계값을 유도하도록 조정함으로써, 최적의 인식률에 근접할 수 있으며, 이에 따라 계산 시간을 감소시켜 작업 처리 시간을 단축시킬 수 있고, 동적으로 그리드 자원에 데이터 분류 작업을 할당함으로써, 대용량의 데이터 운용 및 처리가 가능해짐과 동시에 중앙 서버의 병목 현상을 방지하여 안정성을 증가시킬 수 있는등의 효과를 거둘 수 있다.As described above, the present invention having the above configuration generates and trains a cluster so that a large amount of data is classified, and an outlier having a small proportion of the generated cluster that cannot represent the data or the data belonging to the cluster is below a certain number. By eliminating the cluster, the reduction of the cluster accordingly has the effect of reducing the processing time of data classification, and by adjusting to derive the optimal threshold value based on the result of classification, the optimal recognition rate can be approached. As a result, the processing time can be reduced by reducing the calculation time, and by dynamically assigning data classification tasks to grid resources, large data operation and processing can be performed and stability can be increased by preventing bottlenecks of the central server. It can be effective.

Claims

A resource management module for classifying available grid resources to be used for classifying a large amount of data; and a threshold adjustment module for classifying data by initializing a threshold for classifying the data, and adjusting the threshold according to the result of the classified data. And a result classification module including a result test module for inspecting a result of the data whose classification is completed by the threshold adjustment module and delivering the result.

A data mediator including a plurality of resource mediation modules for transmitting the grid resources and thresholds, and merging the classified data and transmitting the merged data to the result test module; And

A plurality of data classifiers which receive the grid resource and the threshold value and classify and train a large amount of data into clusters, but remove the outlier clusters and transmit them to the data mediator;

Grid-based mass data classification device comprising a.

The method of claim 1,

The resource management module collects available grid resources and classifies them according to storage space and throughput.

The method of claim 1,

The outlier cluster is a cluster-based mass data classification apparatus, characterized in that the cluster can not represent the data from the generated cluster or the weight of the included data is not large.

The method of claim 1,

The resource mediation module

An information transmitter for transmitting the grid resource and the threshold value to a data classification unit; And

A result integrator for merging data training, classification results and cluster information output from a plurality of data classification units and transmitting the merged data to the result test module;

Grid-based mass data classification device comprising a.

The method of claim 1,

The data classification unit

A training data management module for inputting and managing the data;

A data training module that uses the grid resources and generates and trains a cluster to classify data based on a threshold value;

A cluster management module for managing and updating representative data forming the cluster; And

A cluster removal module for removing outlier clusters from the generated clusters;

Grid-based mass data classification device comprising a.

The method of claim 5,

The cluster elimination module inspects the generated whole clusters, and removes outlier clusters whose number of data belonging to each cluster is less than or equal to a predetermined criterion or the cluster cannot represent data.

Classifying grid resources collected for mass data classification according to storage space and throughput and initializing a threshold value;

Generating and training a cluster so that data is classified based on a threshold by dynamically assigning a task to the classified grid resources, inspecting the entire cluster, and removing and transmitting outlier clusters;

Examining a result of merging the data training, classification, and cluster information, and adjusting a threshold value to derive an optimal threshold value;

Grid-based large data classification method comprising a.

The method of claim 7, wherein

The outlier cluster is a cluster-based large-scale data classification method, characterized in that the cluster is not representative of the data or the cluster is not a large proportion of the number of data belonging to each cluster of the generated total cluster or less.

The method of claim 7, wherein

The data classification method is

Inputting and managing large amounts of data;

Creating and training a cluster to classify data based on thresholds transmitted to dynamically allocated grid resources;

Managing and updating representative data for creating a cluster;

Examining the generated entire cluster and removing outlier clusters;

Grid-based mass data classification method characterized in that it comprises a.

The method of claim 7, wherein

Data training method

Creating a cluster based on a threshold from the large amount of data;

Comparing the pattern of the generated cluster with a threshold and matching the pattern of the input data;

If the pattern of the generated cluster is included in the pattern of input data, belonging to the generated cluster;

If the pattern of the generated cluster is not included in the pattern of the input data, generating a cluster according to the new pattern;

Designating and updating the data which created the cluster as the representative data;

The method of claim 7, wherein

How to remove outlier clusters

Checking the generated whole clusters, and removing outlier clusters whose number of data belonging to each cluster is less than or equal to a predetermined criterion or which the cluster cannot represent data;