KR20230099450A

KR20230099450A - Algorithm for Distributed Parallel Processing of Multi-grid DC distribution Big Data

Info

Publication number: KR20230099450A
Application number: KR1020210188802A
Authority: KR
Inventors: 박정희
Original assignee: 주식회사 엘시스
Priority date: 2021-12-27
Filing date: 2021-12-27
Publication date: 2023-07-04

Abstract

다회로 직류 배전망 빅데이터 분산 병렬 처리 알고리즘(Algorithm for Distributed Parallel Processing of Multi-grid DC distribution Big Data)을 이용한 빅데이터 분석 및 처리 장치는, 모집단(population) 데이터를 분할한 복수 개의 입력 부분모집단(subpopulation)을 각각 할당받고, 할당 된 상기 입력 부분모집단에 대해 미리 결정된 문제에 대한 적합도(fitness) 값에 기반한 분산 병렬 처리 알고리즘 연산을 수행함으로써 후손(offspring) 데이터를 포함하는 출력 부분모집단을 얻으며, 상기 출력 부분모집단을 상기 입력 부분모집단으로 이용하여 분산 병렬 처리 알고리즘 연산을 복수 회 반복수행하도록 구성된 복수 개의 맵(map) 모듈; 상기 분산 병렬 처리 알고리즘 연산의 반복 수행이 종료된 후 상기 복수 개의 맵 모듈 각각으로부터 상기 출력 부분모집단 을 수신하고, 수신된 상기 출력 부분모집단으부터 적합도 값에 기초하여 출력 데이터를 생성하도록 구성된 리듀 스(reduce) 모듈; 및 상기 출력 데이터를 출력하도록 구성된출력 모듈을 포함할 수 있다.A big data analysis and processing apparatus using the Algorithm for Distributed Parallel Processing of Multi-grid DC distribution Big Data, which divides the population data into a plurality of input subpopulations ( subpopulation) is assigned, respectively, and an output subpopulation including offspring data is obtained by performing a distributed parallel processing algorithm operation based on a fitness value for a predetermined problem for the assigned input subpopulation, a plurality of map modules configured to repeatedly perform a distributed parallel processing algorithm operation a plurality of times using an output subpopulation as the input subpopulation; Reduce configured to receive the output subpopulation from each of the plurality of map modules after the repeated execution of the distributed parallel processing algorithm operation is finished, and to generate output data based on a fitness value from the received output subpopulation ( reduce) module; and an output module configured to output the output data.

Description

Algorithm for Distributed Parallel Processing of Multi-grid DC distribution Big Data}

실시예들은 다회로 직류 배전망 빅데이터 분산 병렬 처리를 위한 알고리즘(Algorithm for Distributed Parallel Processing of Multi-grid DC distribution Big Data)을 이용해 데이터 분석 및 처리 장치와 데이터 분석 및 처리방법에 대한 것으로, 보다 구체적으로는 맵과 리듀스 과정의 개선에 의해 데이터 입출력을 감소시킴으로써 처리 및 분석 성능이 향상된 맵 리듀스 분산 병렬 처리 알고리즘 기반 데이터 분석 및 처리 장치와 방법에 대한 것이다Embodiments relate to a data analysis and processing device and a data analysis and processing method using an algorithm for distributed parallel processing of multi-grid DC distribution big data (Algorithm for Distributed Parallel Processing of Multi-grid DC distribution Big Data). It is about a data analysis and processing device and method based on a map-reduce distributed parallel processing algorithm with improved processing and analysis performance by reducing data input and output by improving the map and reduce process.

빅데이터는 일반적으로 사용되는 소프트웨어 도구로는 데이터의 처리, 수집, 저장, 탐색, 분석을 할 수 없는 큰 규모의 데이터를 말한다. 빅데이터의 분석은 공학, 과학뿐만 아니라 기업, 정부 등 데이터를 수집하고 관리 하는 모든 분야에서 관심사가 되고 있다. 빅데이터에서 주요 이슈는 데이터 규모, 다양성, 및 처리 속도 등에 효과적으로 대처하는 것이다. 하둡(Hadoop)은 빅데이터를 저장, 관리, 처리, 분석하기 위한 도구로 잘 알려진 오픈 플랫폼이다. 빅데이터에서는 처리 속도가 아주 중요한 요소인데, 하둡의 맵 리듀스(MapReduce) 기법은 빅데이터에 대한 처리, 검색, 분석을 효율적으로 하기 위한 병렬 분산 처리 프로그래밍으로서 다양한 형태로 활용 되고 있다.Big data refers to large-scale data that cannot be processed, collected, stored, explored, or analyzed using commonly used software tools. Big data analysis is becoming a concern not only in engineering and science, but also in all fields that collect and manage data, such as business and government. The main issue in big data is to effectively cope with data scale, variety, and processing speed. Hadoop is a well-known open platform for storing, managing, processing, and analyzing big data. Processing speed is a very important factor in big data, and Hadoop's MapReduce technique is used in various forms as parallel distributed processing programming to efficiently process, search, and analyze big data.

한편, 분산 병렬 처리 알고리즘(Distributed Parallel Processing Algorithm)은 특정한 수학적 및/또는 논리적 문제를 해결하기 위하여 잠재적인 해들을 대상으로 선택, 교차 및 변이로 이루어진 병렬 연산자(operator)를 반복적으로 수행함으로써 최적의 해를 얻는 알고리즘이다. 분산 병렬 처리 알고리즘은 자연계에서의 진화 과정의 기반인 적자생존 원리에 기초하여 최적의 해를 찾아가는 최적화 기법이다. 분산 병렬 처리 알고리즘에 의하면, 단일점이 아닌 여러 점들의 집단에서 동시에 탐색을 하기 때문에 지역해(local optimal)가 아닌 전역해(global optimal)를 찾을 수 있다. 이러한 장점 때문에 병렬 알고리즘은 퍼지(fuzzy) 시스템, 신경망 등과 결합하여 다양한 분야에 적용되고있다.On the other hand, Distributed Parallel Processing Algorithm is an optimal solution by repeatedly performing a parallel operator consisting of selection, intersection, and mutation targeting potential solutions to solve a specific mathematical and/or logical problem. is an algorithm to obtain Distributed parallel processing algorithm is an optimization technique that finds an optimal solution based on the principle of survival of the fittest, which is the basis of the evolutionary process in the natural world. According to the distributed parallel processing algorithm, it is possible to find a global optimal solution rather than a local optimal solution because it simultaneously searches in a group of multiple points rather than a single point. Because of these advantages, parallel algorithms are applied to various fields in combination with fuzzy systems and neural networks.

본 발명의 일 측면에 따르면, 맵 리듀스(MapReduce)를 이용한 종래의 분산 병렬 처리 알고리즘(Distributed Parallel Processing Algorithm)의 성능을 개선하고, 다회로 직류 배전망 빅데이터를 대상으로 대규모 분산 처리(Distributed Processing)가 가능한 새로운 분산 병렬 처리 알고리즘(Distributed Parallel Processing Algorithm)을 이용한 빅데이터 분석 및 처리 장치와 방법을 제공할 수 있다.According to one aspect of the present invention, the performance of a conventional Distributed Parallel Processing Algorithm using MapReduce is improved, and large-scale distributed processing is performed for multi-circuit DC distribution network big data. ) can provide a big data analysis and processing device and method using a new Distributed Parallel Processing Algorithm.

일 실시예에 따른 분산 병렬 처리 알고리즘(Distributed Parallel Processing Algorithm)을 이용한 다회로 직류 배전망 빅데이터 분석 및 처리 장치는, 모집단(population) 데이터를 분할한 복수 개의 입력 부분모집단(subpopulation) 을 각각 할당받고, 할당된 상기 입력 부분모집단에 대해 미리 결정된 문제에 대한 적합도(fitness) 값에 기반한 분산 병렬 처리 알고리즘 연산을 수행함으로써 후손(offspring) 데이터를 포함하는 출력 부분모집단을 얻으며, 상기 출력 부분모집단을 상기 입력 부분모집단으로 이용하여 분산 병렬 처리 알고리즘 연산을 복수 회 반복 수행하도록 구성된 복수 개의 맵(map) 모듈; 상기 분산 병렬 처리 알고리즘 연산의 반복 수행이 종료된 후 상기 복수 개의 맵 모듈 각각으 로부터 상기 출력 부분모집단을 수신하고, 수신된 상기 출력 부분모집단으부터 적합도 값에 기초하여 출력 데이 터를 생성하도록 구성된 리듀스(reduce) 모듈; 및 상기 출력 데이터를 출력하도록 구성된 출력 모듈을 포함할 수 있다.An apparatus for analyzing and processing multi-circuit DC distribution network big data using a Distributed Parallel Processing Algorithm according to an embodiment is assigned a plurality of input subpopulations obtained by dividing population data, respectively , Obtain an output subpopulation including offspring data by performing a distributed parallel processing algorithm operation based on a fitness value for a predetermined problem on the assigned input subpopulation, and converting the output subpopulation to the input subpopulation. a plurality of map modules configured to repeatedly perform a distributed parallel processing algorithm operation a plurality of times using a subpopulation; Receive the output subpopulation from each of the plurality of map modules after the repeated execution of the distributed parallel processing algorithm operation is finished, and generate output data based on a fitness value from the received output subpopulation. reduce module; and an output module configured to output the output data.

일 실시예에 따른 MRPGA를 이용한 데이터 분석 및 처리 방법은, 모집단 데이터를 복수 개의 입력 부분모집단으 로 분할하는 단계; 상기 복수 개의 입력 부분모집단을 복수 개의 맵 모듈에 각각 할당하는 단계; 상기 복수 개 의 맵 모듈 각각에서, 상기 입력 부분모집단에 대해 미리 결정된 문제에 대한 적합도 값에 기반한 분산 병렬 처리 알고리즘 연산을 수행함으로써 후손 데이터를 포함하는 출력 부분모집단을 얻는 단계; 상기 복수 개의 맵 모듈 각각 으로부터 얻어진 상기 출력 부분모집단을 리듀스 모듈에서 수신하고, 상기 출력 부분모집단으로부터 상기 적합 도 값에 기초하여 출력 데이터를 결정하는 단계; 및 상기 출력 데이터를 출력하는 단계를 포함하되, 상기 출력 부분모집단을 얻는 단계는, 상기 각 맵 모듈에서 얻어진 출력 부분모집단을 상기 입력 부분모집단으로 이용하여 복수 회 반복 수행될 수 있다.A data analysis and processing method using MRPGA according to an embodiment includes dividing population data into a plurality of input subpopulations; assigning the plurality of input subpopulations to a plurality of map modules, respectively; obtaining an output subpopulation including descendant data by performing, in each of the plurality of map modules, a distributed parallel processing algorithm operation based on a fitness value for a predetermined problem with respect to the input subpopulation; receiving the output subpopulation obtained from each of the plurality of map modules in a reduce module, and determining output data from the output subpopulation based on the fitness value; and outputting the output data, wherein the obtaining of the output subpopulation may be repeatedly performed a plurality of times by using the output subpopulation obtained from each map module as the input subpopulation.

일 실시예에 따른 컴퓨터로 판독 가능한 저장 매체에는, 컴퓨터에 의하여 실행됨으로써 컴퓨터에 의하여 상기 MRPGA를 이용한 데이터 분석 및 처리방법을 수행하기 위한 명령이 저장될 수 있다.A computer-readable storage medium according to an embodiment may store instructions for executing the data analysis and processing method using the MRPGA by the computer by being executed by the computer.

본 발명의 일 측면에 따른 데이터 분석 및 처리 장치와 데이터 분석 및 처리 방법은, 맵 리듀스(MapReduce)에 기반한 분산 병렬 처리 알고리즘(Distributed Parallel Processing Algorithm)의 맵과 리듀스 과정을 개선함으로써, 종래의 단순 분산 병렬알고리즘(Distributed Parallel Algorithm)에 비해 수렴 속 도가 빠르며 효율적으로 병렬 처리를 수행할 수 있는 이점이 있다. 또한 상기 데이터 분석 및 처리 장치와 데이 터 분석 및 처리 방법은, 부분세대(sub-generation)의 반복 횟수에 따라 우수한 최적해를 빠르게 찾을 수 있으 며, 빅데이터(bigdata)의 처리에 있어 우수한 성능을 갖는다.A data analysis and processing apparatus and a data analysis and processing method according to an aspect of the present invention improve the map and reduce process of a Distributed Parallel Processing Algorithm based on MapReduce, Compared to the simple Distributed Parallel Algorithm, it has the advantage of fast convergence and efficient parallel processing. In addition, the data analysis and processing device and data analysis and processing method can quickly find an excellent optimal solution according to the number of repetitions of sub-generations, and have excellent performance in processing big data. .

도 1은 하둡(Hadoop) 분산 파일 시스템의 작동 흐름을 나타내는 개략도이다.
도 2는 맵 리듀스(MapReduce) 기법에 의한 데이터 처리를 나타내는 개략도이다.
도 3은 종래의 맵 리듀스(MapReduce)를 이용한 단순 알고리즘(Simple Algorithm; SG A)의 데이터 흐름을 나타내는 개략도이다.
도 4는 일 실시예에 따른 분산 병렬 처리 알고리즘(Distributed Parallel Processing Algorithm; MRPGA)을 이용한 데이터 분석 및 처리 장치의 개략적인 블록도이다.1 is a schematic diagram showing an operational flow of a Hadoop distributed file system.
2 is a schematic diagram illustrating data processing by a MapReduce technique.
3 is a schematic diagram showing a data flow of a simple algorithm (SG A) using a conventional MapReduce.
4 is a schematic block diagram of a data analysis and processing apparatus using a Distributed Parallel Processing Algorithm (MRPGA) according to an embodiment.

본 발명의 실시예들에 따른 데이터 분석 및 처리 장치와 데이터 분석 및 처리 방법은, 하둡(Hadoop) 환경의 맵 리듀스(MapReduce) 기법을 새로운 방식으로 분산 병렬 처리 알고리즘(Parallel Algorith; PGA)에 적용한 맵 리듀스 분산 병렬 처리 알고리즘(Distributed Parallel Processing Algorith; MRPGA)을 이용하여 데이터의 분석 및 처리를 수행하도록 구성된다.A data analysis and processing apparatus and a data analysis and processing method according to embodiments of the present invention apply the MapReduce technique of the Hadoop environment to a distributed parallel algorithm (PGA) in a new way. It is configured to perform analysis and processing of data using a MapReduce Distributed Parallel Processing Algorithm (MRPGA).

하둡은 맵 리듀스 기반 오픈 소스(open source) 소프트웨어 미들웨어로서, 야후(Yahoo), 페이스북(Facebook), 아마존(Amazon), 아이비엠(IBM), 넥스알(NexR) 등 많은 기업들에서 클라우드 컴퓨팅 플랫폼으로 활용되고 있으 며, 크게 분산 컴퓨팅 프레임워크(framework)를 지원하기 위한 분산 파일 시스템인 HDFS(Hadoop Distributed File System)와, 분산 프로그래밍 모델인 맵 리듀스로 이루어진다.Hadoop is an open source software middleware based on MapReduce, a cloud computing platform used by many companies such as Yahoo, Facebook, Amazon, IBM, and NexR. It consists of HDFS (Hadoop Distributed File System), a distributed file system to support a distributed computing framework, and MapReduce, a distributed programming model.

하둡 환경에서 맵 리듀스의 응용은 클라이언트가 수행하는 작업단위인 잡(job)으로 구성된다. 잡은 입력 데이터, 맵 리듀스 프로그램과 설정 정보로 구성된다. 또한, 잡은 맵 태스크(task)와 리듀스 태스트로 나누어 실행된다. 잡 실행 과정의 제어를 위해, 하나의 잡 트래커(job tracker)와 다수의 태스크 트래커(task tracke r)가 사용될 수 있다. 잡 트래커는 태스크 트래커들이 수행할 태스크를 스케줄링(scheduling)함으로써 시스템 전체에서 모든 잡이 수행되도록 조정한다. 태스크 트래커는 태스크를 수행하고 각 잡의 전체 결과를 잡트래커에 보낸다. 이때 태스크가 실패하면, 잡 트래커는 그것을 다른 태스크 트래커에 다시 스케줄링한다.The application of MapReduce in the Hadoop environment consists of jobs, which are units of work performed by clients. A job consists of input data, a MapReduce program, and configuration information. Jobs are also divided into map tasks and reduce tasks for execution. For control of a job execution process, one job tracker and a plurality of task trackers may be used. The job tracker coordinates the execution of all jobs throughout the system by scheduling the tasks to be performed by the task trackers. The task tracker performs the task and sends the overall results of each job to the job tracker. If the task fails at this time, the job tracker reschedules it on another task tracker.

도 1은 본 발명의 실시예들에 사용되는 하둡 분산 파일 시스템(HDFS)의 작동 흐름을 나타내는 개략도이다.1 is a schematic diagram showing the operational flow of a Hadoop Distributed File System (HDFS) used in embodiments of the present invention.

도 1을 참조하면, HDFS에서 네임노드(namenode)는 파일을 구성하는 데이터 블록의 메타데이터(metadata)를 유지 하며 블록 매핑(block mapping)을 통해 클라이언트에 블록의 위치정보를 제공한다. 예컨대, 메타데이터는 데이 터노드(datanode)에 저장된 파일의 저장 위치(예컨대, home/foo/data)나 복제본 개수 등 파일 정보를 포함할 수 있다. 클라이언트는 네임노드에 메타데이터 정보를 저장하거나, 특정 데이터 블록을 네임노드에 요청할 수 있다. 이후 클라이언트는 파일에 대한 실제 연산을 수행하기 위해 데이터노드(datanode)의 랙(rack)에 포함된 특정 데이터노트의 블록에 접근하여 데이터 블록의 읽기, 쓰기, 복제 등을 처리한다.Referring to FIG. 1, in HDFS, a namenode maintains metadata of data blocks constituting a file and provides block location information to a client through block mapping. For example, the metadata may include file information such as a storage location (eg, home/foo/data) of a file stored in a datanode or the number of copies. A client can store metadata information in the NameNode or request a specific data block from the NameNode. Afterwards, the client accesses the block of a specific data note included in the rack of the datanode to perform the actual operation on the file, and handles reading, writing, copying, etc. of the data block.

하둡 클러스터(cluster)의 구성은 네임노드와 잡 트래커가 통합된 마스터 노드와, 태스크 트래커와 데이터노드 를 포함하는 슬레이브 노드로 나눠진다. 하둡에서 마스터 노드와 슬레이브 노드 사이의 제어신호에는 RPC(Remote Procedure Call) 프로토콜이 사용되며, 마스터와 클라이언트 사이의 통신 역시 RPC가 사용된다. 그 리고 데이터노드와 클라이언트는 TCP 소켓을 통해 데이터를 전달한다. 마지막으로 맵 리듀스 과정에서 태스크 트래커는 맵 태스크의 결과를 리듀스 태스크로 전달하게 되는데, 하둡은 이들 간의 통신을 위해 HTTP를 사용한다.The composition of Hadoop cluster is divided into a master node that integrates namenode and job tracker, and a slave node that includes task tracker and datanode. In Hadoop, RPC (Remote Procedure Call) protocol is used for control signals between master and slave nodes, and RPC is also used for communication between master and client. And datanode and client transfer data through TCP socket. Finally, in the MapReduce process, the task tracker delivers the result of the map task to the reduce task, and Hadoop uses HTTP for communication between them.

HDFS와 같은 분산 파일 시스템은, 클라이언트 측에서 서버에 저장된 데이터에 접근하여 마치 자신에게 저장되어 있는 데이터인 것처럼 처리할 수 있는 클라이언트/서버 기반의 파일 시스템이다. 분산 파일 시스템은 기존의 분산 파일시스템과 다르게 거대하고, 고성능이어야 한다. 이를 통해 지속적인 데이터 증가에 효율적으로 대비해 야 하고, 빈번하게 발생되는 고장에 대해서도 대처가 가능한 매커니즘이 필요하다.A distributed file system such as HDFS is a client/server based file system in which a client side can access data stored on a server and process it as if it were stored on its own. Distributed file systems, unlike existing distributed file systems, must be huge and perform well. Through this, it is necessary to efficiently prepare for continuous data increase, and a mechanism that can cope with frequent failures is needed.

실시예들에 따른 MRPGA에 사용되는 맵 리듀스 기법은, HDFS와 같이 광범위하게 구성된 클러스터 환경에서 대용량 데이터를 분산 처리하기 위한 기술이다. 기존의 MPI(Message Passing Interface)와 같은 병렬 처리 모델은 고성능의 컴퓨팅을 요구하는 분야에 적합한 반면 대규모의 데이터 처리에 적용할 경우에는 문제가 많았다. 맵 리듀스는 데이터양에 따른 확장성, 노드간 데이터 이동시 네트워크 트래픽 최소화 등의 요구사항을 고려하여 만들어졌으며, 현재 대규모 데이터 처리분야에서 널리 활용되고 있다. 실시예들은 맵 리듀스를 이용하여 광범위한 분산 처리를 수행함으로써, 빅 데이터(big data) 환경, 예컨대, 대규모 데이터 처리에 적합한 데이터 웨어하우 스(data warehouse), 데이터 마이닝(data mining), 정보 검색 같은 분야에 유용하게 활용될 수 있다.A map reduce technique used in MRPGA according to embodiments is a technique for distributing and processing large amounts of data in a widely configured cluster environment such as HDFS. Parallel processing models such as the existing MPI (Message Passing Interface) are suitable for fields requiring high-performance computing, but have many problems when applied to large-scale data processing. MapReduce was created in consideration of requirements such as scalability according to the amount of data and minimization of network traffic when moving data between nodes, and is currently widely used in the field of large-scale data processing. Embodiments perform a wide range of distributed processing using MapReduce, so that a big data environment, such as a data warehouse suitable for large-scale data processing, data mining, and information search, can be used. It can be usefully used in the field.

도 2는 맵 리듀스 기법에 의한 데이터 처리를 나타내는 개략도이다.2 is a schematic diagram showing data processing by the map reduce technique.

도 2를 참조하면, 네임노트의 입력 데이터는 맵 함수를 통해 복수개의 분할 데이터로 분할되고, 복수 개의 데 이터노드에 분산될 수 있다. 예를 들어, 각 데이터노트는 각 개별 컴퓨터의 로컬 저장소일 수 있다. 각 컴퓨터 에 분산된 중간 데이터는 정렬된 후, 리듀스 함수를 통해 수집되어 출력 데이터에 출력될 수 있다. 맵 및 리듀 스 함수에서는 해시(hash) 자료 구조와 유사한 키(key)와 값(value)의 쌍인 (키, 값)의 형식으로 데이터를 처리 및 저장한다. 그에 따라 데이터의 분산 및 수집이 용이하며, 동일한 키를 이용하여 데이터를 그룹으로 정렬할 수 있다.Referring to FIG. 2, the input data of NameNote can be divided into a plurality of divided data through a map function and distributed to a plurality of data nodes. For example, each DataNote could be local storage on each individual computer. After sorting the intermediate data distributed to each computer, it can be collected through a reduce function and displayed as output data. Map and reduce functions process and store data in the form of (key, value), which is a pair of key and value similar to a hash data structure. Accordingly, it is easy to distribute and collect data, and data can be sorted into groups using the same key.

도 3은 종래의 맵 리듀스를 이용한 단순 알고리즘(Simple Algorithm; SGA)의 데이터 흐름을 나타내는 개략도로서, 선택-재결합(selectorecombinative)알고리즘에 기반한다.3 is a schematic diagram showing the data flow of a simple algorithm (SGA) using a conventional map reducer, which is based on a selectorecombinative algorithm.

종래의 맵 리듀스를 이용한 SGA에서는 맵과 리듀스에 맞게 변형된 알고리즘의 초기 모델을 제시하였다. 도 3을 참조하면, 먼저 입력 데이터가 다수의 맵 함수로 분할되어 할당되고, 각 맵 함수에서는 특정 문제에 대 한 데이터들의 적합도(fitness) 값을 산출함으로써 최적 적합도를 가진 데이터를 (키, 값)의 형식으로 출력한다. 예를 들어, 맵 함수에서 출력되는 키는 개별 데이터 집합의 키이고, 값은 해당 개별 데이터 집합의 각 데이터에 대한 적합도 값일 수 있다. 다음으로, 각 맵 합수로부터 얻어진 키-값 쌍을 병렬 연산자를 이용 한 선택과 교차에 의해 섞음으로서 후손(offspring) 데이터를 생성한다. 리듀스 함수는 후손 데이터의 키값에 따라 동일한 키를 갖는 데이터끼리 취합함으로써 출력 데이터를 생성한다.In the conventional SGA using MapReduce, an initial model of an algorithm modified to fit Map and Reduce was presented. Referring to FIG. 3, first, the input data is divided into a plurality of map functions and allocated, and each map function calculates the fitness value of the data for a specific problem to obtain data with the best fit (key, value). output in the format of For example, a key output from the map function may be a key of an individual data set, and a value may be a fitness value for each data of the individual data set. Next, offspring data is created by mixing the key-value pairs obtained from each map function by selection and intersection using a parallel operator. The reduce function creates output data by aggregating data with the same key according to the key value of descendant data.

이와 같이 한번의 맵 리듀스 과정을 거치면 한 세대가 끝내고, 리듀스 결과인 출력 데이터는 다음 세대의 입력 데이터가 된다. 정해진 세대수(즉, 반복 횟수)만큼 맵 리듀스 과정을 반복하게 되는데, 세대가 바뀔 때마다 맵 과 리듀스 함수에 의하여 HDFS 기반 데이터베이스(DB)에 대한 데이터 입출력이 발생된다.In this way, once the MapReduce process is completed, one generation is completed, and the output data, which is the result of the reduction, becomes the input data of the next generation. The map-reduce process is repeated for a set number of generations (i.e., the number of repetitions). Every time the generation changes, data input/output to the HDFS-based database (DB) is generated by the map and reduce functions.

이상에서 설명한 종래의 맵 리듀스를 이용한 SGA는, 맵 함수에서는 데이터를 입력받아 적합도 값을 계산하고 가장 우수한 전역해의 저장만을 수행하며, 리듀스 함수를 이용하여 병렬 알고리즘 연산을 분산 처리하는 구조를 사용하고 있다. 그러나 이에 따르면 각 세대마다 데이터 입출력이 발생되는 단점이 있다. 즉, 맵 함수에서 HDFS 기반 데이터베이스에서 모집단(population) 데이터를 입력받아 처리한 후 리듀스 함수로 전달하면, 리듀스 함수에서도 작업을 수행하여 얻어진 데이터(여기서, 데이터란 모집단을 이루는 개별 값들을 지칭한다)를 다시 HDFS 기반 DB에 기록하게 된다. 다음 세대로 넘어가면, 다시 맵 함수에서는 HDFS 기반 DB에 저장된 데이터 모집 단을 입력받아서 처리하게 된다.The SGA using the conventional map reduce described above has a structure in which the map function receives data, calculates the fitness value, stores only the best global solution, and distributes parallel algorithm operation using the reduce function. are using However, according to this method, there is a disadvantage in that data input/output occurs for each generation. That is, if the map function receives population data from the HDFS-based database, processes it, and passes it to the reduce function, the data obtained by performing work in the reduce function (here, data refers to individual values constituting the population) ) is recorded in the HDFS-based DB again. Moving on to the next generation, the map function receives and processes the data group stored in the HDFS-based DB again.

본 발명의 일 측면에서는, 이 점을 개선하기 위해 맵 리듀스를 이용한 새로운 PGA 기법인 MRPGA기법을 제안한다. 도 4는 일 실시예에 따른 MRPGA를 이용한 데이터 분석 및 처리 장치의 개략적인 블록도이다.In one aspect of the present invention, an MRPGA technique, a new PGA technique using map reduce, is proposed to improve this point. 4 is a schematic block diagram of a data analysis and processing device using an MRPGA according to an embodiment.

도 4를 참조하면, 본 실시예에 따른 데이터 분석 및 처리 장치는 HDFS 기반의 데이터베이스(DB)(10), 복수 개의 맵 모듈(21, 22, 23, ..., 2n), 리듀스 모듈(30) 및 출력 모듈(40)을 포함할 수 있다. 본 명세서에서 "데이터베이스", "부(unit)", "모듈(module)", "장치" 또는 "시스템" 등의 용어는 하드웨어 및 해당 하드웨어에 의해 구 동되는 소프트웨어의 조합을 지칭하는 것으로 의도된다. 예를 들어, 하드웨어는 CPU 또는 다른 프로세서 (processor)를 포함하는 데이터 처리 기기일 수 있다. 또한, 하드웨어에 의해 구동되는 소프트웨어는 실행중인 프로세스, 객체(object), 실행파일(executable), 실행 스레드(thread of execution), 프로그램(program) 등을 지칭할 수 있다.Referring to FIG. 4, the data analysis and processing apparatus according to the present embodiment includes an HDFS-based database (DB) 10, a plurality of map modules 21, 22, 23, ..., 2n, a reduce module ( 30) and an output module 40. In this specification, terms such as "database", "unit", "module", "device" or "system" are intended to refer to a combination of hardware and software driven by the hardware. . For example, the hardware may be a data processing device including a CPU or other processor. Also, software driven by hardware may refer to a running process, an object, an executable file, a thread of execution, a program, and the like.

도 4의 블록도에서 DB(10), 맵 모듈(21, 22, 23, ..., 2n), 리듀스 모듈(30) 및 출력 모듈(40) 각각은 서로 구 분되는 별개의 사각형으로 도시되었으나, 이는 단지 일 실시예에 따른 데이터 분석 및 처리 장치에서 수행되는 동작에 따라 장치를 기능적으로 구분한 것이다. 즉, DB(10), 맵 모듈(21, 22, 23, ..., 2n), 리듀스 모듈(30) 및 출력 모듈(40) 중 일부 또는 전부가 단일 장치로 집적화되어 구현될 수도 있으며, 또는 DB(10), 맵 모듈(21, 22, 23, , 2n), 리듀스 모듈(30) 및 출력 모듈(40) 각각이 물리적으로 구별되는 개별 장치로 구현되어 네트워크를 통한 통신에 의하여 동작하도록 구현될 수도 있다.In the block diagram of FIG. 4, each of the DB 10, the map modules 21, 22, 23, ..., 2n, the reduce module 30, and the output module 40 is shown as a separate rectangle separated from each other. However, this is merely functionally dividing the device according to the operation performed in the data analysis and processing device according to one embodiment. That is, some or all of the DB 10, the map module 21, 22, 23, ..., 2n, the reduce module 30, and the output module 40 may be integrated and implemented as a single device, Alternatively, the DB 10, the map modules 21, 22, 23, 2n, the reduce module 30, and the output module 40 are implemented as physically distinct individual devices and operated through communication through a network. may be implemented.

본 실시예에서 사용된 MRPGA는, 도 3을 참조하여 전술한 종래의 SGA방법에서 사용한 선택-재결합 병렬 알고 리즘을 응용 및 개선하여 선택과 교차를 연계한 로컬 검색(local search) 형태의 맵 구조를 제안하고, 우수한 개별 데이터를 리듀스 모듈(30)에 전달하면서 맵 리듀스 기법을 이용한 변이를 적용하였다. 본 실시예에 따른 MRPGA의 구체적인 과정을 분설하면 다음과 같다.The MRPGA used in this embodiment proposes a map structure in the form of a local search linking selection and intersection by applying and improving the selection-recombination parallel algorithm used in the conventional SGA method described above with reference to FIG. 3 And while passing the excellent individual data to the reduce module 30, the mutation using the map reduce technique was applied. The detailed process of MRPGA according to this embodiment is as follows.

먼저, 초기화 과정으로서 분석 대상 데이터를 주어진 문제에 대한 해로 구성함으로써 모집단 데이터를 생성하고, 모집단에 포함된 개별 해들에 대하여 주어진 문제에 대한 적합도 값을 산출함으로써 입력 데이터를 구성하여 HDFS 기반 DB(10)에 저장할 수 있다. 여기서 해란, 주어진 수학적 및/또는 논리적 문제에 대한 답이 될 수 있는 데이터 또는 데이터의 세트를 지칭하는 것으로서, 실시예들에 따른 데이터 분석에 관련된 문제 및 해의 구체적인 내용이나 형식은 어느 하나로 한정될 수 없으며 사안에 따라 상이할 수 있다.First, as an initialization process, population data is generated by configuring the data to be analyzed as a solution to a given problem, and input data is configured by calculating the fitness value for the given problem for individual solutions included in the population to form an HDFS-based DB (10 ) can be stored. Here, the solution refers to data or a set of data that can be an answer to a given mathematical and/or logical problem, and the specific content or form of the problem and solution related to data analysis according to the embodiments may be limited to any one. No, and may vary from case to case.

다음으로, 모집단을 복수 개의 부분모집단으로 분할하여 입력 부분모집단으로서 복수 개의 맵 모듈(21, 22, 23, ...., 2n)에 각각 할당할 수 있다. 즉, 모집단의 전체 크기는 맵 모듈(21, 22, 23, , 2n)의 개수에 각 부분모 집단의 크기를 곱한 값과 같다. 각각의 맵 모듈(21, 22, 23, , 2n)은 할당된 입력 부분모집단에서 각각의 해 를 키(key)로 하고, 각각의 해에 대한 적합도 값을 값(value)으로 하여 입력 부분모집단에 대하여 병렬 알고리즘 연산을 수행할 수 있다.Next, the population may be divided into a plurality of subpopulations and assigned to a plurality of map modules 21, 22, 23, ...., 2n as input subpopulations, respectively. That is, the total size of the population is equal to the value obtained by multiplying the size of each subpopulation by the number of map modules 21, 22, 23, 2n. Each map module (21, 22, 23, , 2n) sets each solution as a key in the assigned input subpopulation and uses the fitness value for each solution as a value to input subpopulations. Parallel algorithmic operations can be performed on

본 명세서에 분산 병렬 처리 알고리즘 연산이란, 적자생존 원리에 기초하여 각각의 해들 사이의 선택, 교차 및 변이를 수행하면서 주어진 문제에 대한 최적의 적합도 값을 갖는 해를 찾기 위한 연산 과정을 지칭한다. 병렬 알고리 즘에서 선택이란 주어진 문제에 대한 적합도 값이 상대적으로 높은 해를 선택하는 것을 의미하며, 교차란 개별 해에 포함된 데이터의 일부를 해들 사이에서 상호 치환하는 것을 의미한다. 선택 및 교차를 포함한 분산 병렬 처리 알고리즘 연산의 구체적인 방법에 대해서는 본 발명의 기술분야에서 잘 알려져 있으므로, 자세한 설명을 생략한다. 예를 들면, 일 실시예에서 선택 방법으로는 토너먼트 선택법(tournament selection)을 사용하고, 교차 방법으로 는 균등 교차(uniform crossover)를 사용할 수 있으나, 이에 한정되는 것은 아니다.In this specification, distributed parallel processing algorithm calculation refers to a calculation process for finding a solution having an optimal fitness value for a given problem while performing selection, intersection, and mutation between each solution based on the survival of the fittest principle. In a parallel algorithm, selection means to select a solution with a relatively high fitness value for a given problem, and crossover means to mutually replace part of the data included in individual solutions between solutions. Since a specific method of calculating a distributed parallel processing algorithm including selection and intersection is well known in the art, a detailed description thereof will be omitted. For example, in one embodiment, a tournament selection method may be used as a selection method and uniform crossover may be used as a crossover method, but is not limited thereto.

각각의 맵 모듈(21, 22, 23, , 2n)에서는, 할당된 부분모집단의 모든 해 중에서 주어진 문제에 대한 적합도 값이 상대적으로 높은 해를 선택하고, 선택된 해들에 대하여 교차를 수행할 수 있다. 이러한 선택 및 교차 과정 을 거쳐후손 데이터가 생성되며, 후손 데이터가 부분모집단에서 기존의 데이터를 대치할 수 있다. 다음으로, 맵 모듈(21, 22, 23, , 2n)은 선택 및 치환이 이루어진 후손 데이터의 적합도 값만을 다시 산출할 수 있다.In each of the map modules 21, 22, 23, 2n, a solution having a relatively high fitness value for a given problem is selected from among all solutions of the assigned subpopulation, and crossover may be performed on the selected solutions. . Through this selection and intersection process, descendant data is generated, and descendant data can replace existing data in a subpopulation. Next, the map modules 21, 22, 23, 2n may recalculate only the fitness values of descendant data after selection and replacement.

실시예들에서 맵 모듈(21, 22, 23, , 2n)은, 입력 부분모집단에 대하여 적합도 값이 상대적으로 높은 해를 선택하여 교차를 수행하고 다시 적합도 값을 산출하는 이상의 과정을, 미리 결정된 부분세대(sub-generation)의 수 만큼반복 수행할 수 있다. 즉, 맵 모듈(21, 22, 23, , 2n)에 의해 선택 및 교차를 수행한 결과 후손 데 이터를 포함하여 생성된 출력 부분모집단이, 다시 맵 모듈(21, 22, 23, , 2n)에 대한 입력 부분모집단이 됨으로써 하나의 부분세대에 대한 연산이 이루어진다. 이때, 출력 부분모집단은 해당 출력 부분모집단이 생성된 맵 모듈(21, 22, 23, , 2n)에 대하여 다시 입력 부분모집단으로 입력될 수도 있으며, 또는 해당 출력 부분모 집단이 생성된 맵 모듈(21, 22, 23, , 2n)과 상이한 다른 맵 모듈(21, 22, 23, , 2n)에 입력 부분모집단 으로 입력될 수도 있다.In the embodiments, the map modules 21, 22, 23, 2n perform the above process of selecting a solution having a relatively high fitness value for the input subpopulation, performing intersection, and calculating the fitness value again, It can be repeated as many times as the number of sub-generations. That is, as a result of selection and crossing by the map module 21, 22, 23, 2n, the output subpopulation generated including descendant data is returned to the map module 21, 22, 23, 2n. By becoming an input subpopulation for a subpopulation, an operation for one subgeneration is performed. At this time, the output subpopulation may be input again as an input subpopulation for the map module (21, 22, 23, 2n) from which the output subpopulation was generated, or the map module (21) from which the corresponding output subpopulation was generated. , 22, 23, , 2n) and other map modules (21, 22, 23, 2n) may be input as input subpopulations.

본 발명의 발명자들은, 실시예들에 따른 MRPGA 방법을 평가하기 위한 실험을 수행하였다. 실험에 사용된 하둡 버전은 120 버전이며, 하드웨어의 운영체제는 우분투(Ubuntu) 1204 버전이 사용되었다. 하드웨어의 CPU로는 인텔(Intel) 코어(TM) i5 650(320GHz)이 사용되었으며, 램(RAM)은 3GB가 탑재되었다.The inventors of the present invention conducted experiments to evaluate the MRPGA method according to the embodiments. The Hadoop version used in the experiment was version 120, and the operating system of the hardware was Ubuntu version 1204. Intel Core (TM) i5 650 (320 GHz) was used as the hardware CPU, and 3 GB of RAM was installed.

실험에 사용된 하둡 모 드는 단일 컴퓨팅 장치에서 가상의 분산환경을 구성할 수 있는 가상 분산 모드(pseudo-distribution mode)를 활 용하여 병렬처리를 수행하였다.The Hadoop mode used in the experiment performed parallel processing by utilizing a pseudo-distribution mode that can configure a virtual distributed environment on a single computing device.

분산 병렬 처리 알고리즘을 이용하여 최적해를 찾기위한 문제는, 널리 사용되고 있는 간단한 최적화 문제인 원맥스 문제(OneMax Problem)를 사용하였다. 이 문제는 문제는

일 때 적합 도 값이 최대인 값을 찾는 문제로, 적합도 값 F(x)는 다음 수학식 1과 같이 산출된다.For the problem to find the optimal solution using the distributed parallel processing algorithm, the OneMax Problem, which is a simple optimization problem that is widely used, was used. this problem is the problem

As a problem of finding the value with the maximum fitness value when , the fitness value F(x) is calculated as in Equation 1 below.

이상에서 살펴본 본 발명의 실시예들에 따른 MRPGA를 이용한 데이터분석 및 처리 장치와 데이터 분석 및 처리 방법은, 하둡 환경으로부터 분산환경을 구축하여 맵 리듀스 기법을 통해 PGA를 구현하는 방법으로서, 종래의 방 법의 단점들을 개선한 새로운 맵 리듀스 기반 PGA 방법이다. 실시예들에 따른 MRPGA를 이용한 데이터 분석 및 처리 장치와 방법에 의하면, HDFS DB에 대한 데이터의 입출력 횟수를 줄여 PGA의 속도를 개선할 수 있으며, 단 일 리듀스 과정을 통해 탐색 영역의 통합과 분산 맵으로 데이터의 다양성을 유지할 수 있다. 또한 부분모집단 별로 분산 병렬 처리 알고리즘 연산을 반복 수행함으로써 로컬 검색에 의해 PGA의 성능을 개선하였고, 맵 리듀스의 특성 을 이용해 효율적인 변이를 구현하였다.The data analysis and processing apparatus and data analysis and processing method using MRPGA according to the embodiments of the present invention described above are a method of implementing a PGA through a map reduce technique by constructing a distributed environment from a Hadoop environment. It is a new map-reduce-based PGA method that improves the disadvantages of the method. According to the data analysis and processing apparatus and method using the MRPGA according to the embodiments, the speed of the PGA can be improved by reducing the number of data inputs and outputs to the HDFS DB, and the integration and distribution of search areas can be achieved through a single reduce process. You can maintain the diversity of your data with maps. In addition, the performance of the PGA was improved by local search by repeatedly performing the distributed parallel processing algorithm operation for each subpopulation, and efficient variation was implemented using the characteristics of MapReduce.

이상에서 설명한 실시예들에 따른 데이터 분석 및 처리 방법은 적어도 부분적으로 컴퓨터 프로그램으로 구현되 고 컴퓨터로 읽을 수 있는 기록매체에 기록될 수 있다. 실시예들에 따른 데이터 분석 및 처리 방법을 구현하기 위한 프로그램이 기록되고 컴퓨터가 읽을 수 있는 기록매체는 컴퓨터에 의하여 읽혀질 수 있는 데이터가 저장되 는 모든 종류의 기록 장치를 포함한다. 컴퓨터가 읽을 수 있는 기록매체의 예로는 ROM, RAM, CD-ROM, 자기 테이 프, 플로피디스크, 광 데이터 저장장치 등이 있으며, 또한 캐리어 웨이브(carrier wave)(예를 들어, 인터넷을 통한 전송)의 형태로 구현되는 것도 포함한다. 또한 컴퓨터가 읽을 수 있는 기록매체는 네트워크로 연결된 컴퓨 터 시스템에 분산되어, 분산 방식으로 컴퓨터가 읽을 수 있는 코드가 저장되고 실행될 수도 있다.Data analysis and processing methods according to the embodiments described above may be at least partially implemented as a computer program and recorded on a computer-readable recording medium. The computer-readable recording medium on which the program for implementing the data analysis and processing method according to the embodiments is recorded includes all kinds of recording devices that store data readable by a computer. Examples of computer-readable recording media include ROM, RAM, CD-ROM, magnetic tape, floppy disk, optical data storage device, etc., and carrier wave (for example, transmission through the Internet). It also includes what is implemented in the form of. In addition, computer-readable recording media may be distributed to computer systems connected through a network, and computer-readable codes may be stored and executed in a distributed manner.

이상에서 살펴본 본 발명은 도면에 도시된 실시예들을 참고로 하여 설명하였으나 이는 예시적인 것에 불과하며 당해 분야에서 통상의 지식을 가진 자라면 이로부터 다양한 변형 및 실시예의 변형이 가능하다는 점을 이해할 것이다.The present invention reviewed above has been described with reference to the embodiments shown in the drawings, but this is only exemplary, and those skilled in the art will understand that various modifications and variations of the embodiments are possible therefrom.

그러나, 이와 같은 변형은 본 발명의 기술적 보호범위 내에 있다고 보아야 한다.However, such modifications should be considered within the technical protection scope of the present invention.

따라서, 본 발명의 진 정한 기술적 보호범위는 첨부된 특허청구범위의 기술적 사상에 의해서 정해져야 할 것이다.Therefore, the true technical protection scope of the present invention should be determined by the technical spirit of the appended claims.

10 : HDFS DB
21 : 맵 모듈 1
22 : 맵 모듈 2
23 : 맵 모듈 3
2n : 맵 모듈 2n
30 : 리듀스 모듈
40 : 출력 모듈10: HDFS DB
21: map module 1
22: map module 2
23: map module 3
2n: map module 2n
30: reduce module
40: output module

Claims

Each of a plurality of input subpopulations obtained by dividing the population data is allocated, and an output subpopulation including descendant data is obtained by performing a distributed parallel processing algorithm operation based on a fitness value for a predetermined problem for the allocated input subpopulations. a plurality of map modules configured to repeatedly perform a distributed parallel processing algorithm operation a plurality of times by using the output subpopulation as the input subpopulation;
Reduce configured to receive the output subpopulation from each of the plurality of map modules after the repeated execution of the distributed parallel processing algorithm operation is finished, and to generate output data based on a fitness value from the received output subpopulation module; and an output module configured to output the output data.

The method of claim 1 , wherein the output module is configured to output output data generated by repeating a distributed parallel processing algorithm operation a predetermined number of times by inputting the output data as the population data to the plurality of map modules. A data analysis and processing device to be used.

2. The data analysis and processing apparatus according to claim 1, further comprising a database based on a Hadoop distributed file system configured to store the population data and the output data.

partitioning the population data into a plurality of input subpopulations;
assigning the plurality of input subpopulations to a plurality of map modules, respectively;
obtaining an output subpopulation including descendant data by performing, in each of the plurality of map modules, a distributed parallel processing algorithm operation based on a fitness value for a predetermined problem with respect to the input subpopulation;
receiving the output subpopulation obtained from each of the plurality of map modules in a reduce module, and determining output data from the output subpopulation based on the fitness value; and outputting the output data, wherein the step of obtaining the output subpopulation is repeatedly performed a plurality of times using the output subpopulation obtained from each map module as the input subpopulation Data analysis, characterized in that and processing methods.

The method of claim 4, wherein the dividing of the population data into a plurality of input subpopulations comprises data based on a Hadoop distributed file system.

6. The method of claim 5, further comprising recording the output data as the population data in the database, wherein the outputting the output data comprises performing a distributed parallel processing algorithm operation by using the output data as the population data. A data analysis and processing method comprising the step of outputting output data generated by repeating a predetermined number of times.

performing by a computing device, wherein the computing device divides the population data into a plurality of input subpopulations;
assigning each of the plurality of input subpopulations to a plurality of map modules of the computing device;
obtaining an output subpopulation including descendant data by performing, in each of the plurality of map modules, a distributed parallel processing algorithm operation based on a fitness value for a predetermined problem with respect to the input subpopulation;
receiving the output subpopulation obtained from each of the plurality of map modules in a reduce module of the computing device, and determining output data from the output subpopulation based on the fitness value; and outputting the output data, wherein the obtaining of the output subpopulation comprises a data analysis and processing method repeatedly performed a plurality of times using the output subpopulation obtained from each map module as the input subpopulation. A computer-readable recording medium, characterized in that stored instructions for performing.