KR20130080721A

KR20130080721A - Host node and memory management method for cluster system based on parallel computing framework

Info

Publication number: KR20130080721A
Application number: KR1020120001689A
Authority: KR
Inventors: 이재진; 김정원
Original assignee: 서울대학교산학협력단
Priority date: 2012-01-05
Filing date: 2012-01-05
Publication date: 2013-07-15
Also published as: KR101332839B1

Abstract

PURPOSE: Host node and memory management method are provided to reduce a bug incidence rate of an application by executing an application of a parallel computing programming model between a host node and a calculation node in cluster environment. CONSTITUTION: A host thread (221) executes a host program for a parallel computing framework. The host thread inserts a command related to a kernel program for the parallel computing framework to a command queue. A command scheduler (222) schedules the command queue and selects a command to be executed. The command scheduler generates a request message requesting execution of a command. A memory management unit (280) judges whether having a buffer objet using a command selected by a first calculation module or not. The memory management unit copies a buffer object of a second calculation module different from the first calculation module into the first calculation module. [Reference numerals] (220) Host node; (241a) Calculation module #1; (241b) Calculation module #2; (241c) Calculation module #3; (241d) Calculation module #4; (AA,240) Calculation node; (BB,CC,DD,EE) Calculation module

Description

Host node and memory management method for cluster system based on parallel computing framework}

병렬 컴퓨팅 프레임워크(Parallel Computing Framework) 기술과 관련된다.Related to the Parallel Computing Framework technology.

병렬 컴퓨팅이란 동시에 다수의 계산을 하는 연산의 한 방법을 말한다. 병렬 컴퓨팅은 크고 복잡한 문제를 작게 나누어 동시에 병렬적으로 해결하는 데에 주로 사용된다. 병렬 컴퓨팅에는 여러 방법과 종류가 존재한다. 그 예로, 비트 수준, 명령어 수준, 데이터, 작업 병렬 처리 방식 등이 있다. 병렬 컴퓨팅은 오래전부터 주로 고성능 연산에 이용되어 왔으며, 프로세서 주파수의 물리적인 한계에 다가가면서 문제 의식이 높아진 이후에 더욱 주목받게 되었다. 최근 컴퓨터 이용에서 발열과 전력 소모에 대한 관심이 높아지는 것과 더불어 멀티 코어 프로세서를 핵심으로 컴퓨터 구조에서 강력한 패러다임으로 주목받게 되었다.Parallel computing is a method of operation that performs many computations at the same time. Parallel computing is often used to solve large and complex problems in parallel and solve them in parallel. There are many ways and kinds of parallel computing. Examples include bit level, instruction level, data, and task parallelism. Parallel computing has long been used primarily for high-performance computing, and has come to the fore since the problem consciousness has increased as the physical limits of processor frequencies have been approached. In recent years, with the growing interest in heat generation and power consumption in computer use, it has attracted attention as a powerful paradigm in computer architecture with a multi-core processor as the core.

병렬 컴퓨팅을 위한 대표적인 프레임워크로는 OpenCL이 있다. OpenCL은 여러 CPU, GPU 및 기타 프로세서 등으로 이루어진 다중 플랫폼에서 구동해야 하는 데이터 및 태스크 병렬성을 갖는 프로그램을 작성할 수 있게 해 주는 프레임워크이다. OpenCL은 C99에 기반한 새로운 언어를 포함하고 있다. 이 언어를 가지고 혼성 플랫폼을 정의하고 제어할 수 있는 커널, API 등을 작성할 수 있다. OpenCL은 임무 기반 병렬 또는 데이터 기반 병렬 컴퓨팅을 제공한다.The representative framework for parallel computing is OpenCL. OpenCL is a framework that allows you to write programs with data and task parallelism that must run on multiple platforms with multiple CPUs, GPUs, and other processors. OpenCL includes a new language based on C99. With this language you can write kernels, APIs, and so on that define and control hybrid platforms. OpenCL provides mission-based parallel or data-based parallel computing.

통상적으로 OpenCL은 단일 노드 시스템을 위한 것이다. 그러나 최근에는 다수의 노드를 갖는 클러스터 시스템이 주목을 받고 있는데, OpenCL은 이러한 클러스터 환경에서 동작하지 않는 문제가 있다. 즉 클러스터 환경에서 OpenCL을 위한 어플리케이션을 작성하려면, 사용자가 직접 네트워크를 위한 메시지 프로그래밍 라이브러리를 사용해야만 하는 불편함이 있다. 이것은 어플리케이션 작성을 어렵게 만든다. 나아가 이와 같이 사용자가 직접 코드를 추가한 OpenCL 어플리케이션은 정작 단일 노드 시스템에는 동작하지 아니할 수도 있다.Typically OpenCL is for single node systems. Recently, however, a cluster system having a large number of nodes has attracted attention, and OpenCL does not work in such a cluster environment. In other words, in order to write an application for OpenCL in a cluster environment, it is inconvenient for a user to use a message programming library for a network. This makes writing an application difficult. Furthermore, OpenCL applications with user-added code may not work on a single node system.

병렬 컴퓨팅 프레임워크가 효과적으로 동작할 수 있는 클러스터 시스템의 호스트 노드 및 이러한 시스템을 위한 메모리 관리 방법이 제시된다.A host node of a cluster system in which the parallel computing framework can operate effectively and a memory management method for such a system are presented.

본 발명의 일 양상에 따른 호스트 노드는, 병렬 컴퓨팅 프레임워크를 위한 호스트 프로그램을 실행하고, 병렬 컴퓨팅 프레임워크를 위한 커널 프로그램과 관련된 커맨드를 커맨드 큐에 삽입하는 호스트 스레드, 커맨드 큐를 스케줄링하여 실행될 커맨드를 선택하고, 선택된 커맨드의 실행을 요청하는 요청메시지를 생성하고, 커맨드를 실행할 계산 모듈이 포함된 계산 노드로 생성된 요청메시지를 네트워크를 통해 전송하는 커맨드 스케줄러, 및 선택된 커맨드를 실행할 계산 모듈인 제 1 계산 모듈이 선택된 커맨드가 사용하는 버퍼 객체를 가지고 있는지 여부를 판단하고, 그 판단 결과에 따라, 제 1 계산 모듈과 다른 계산 모듈인 제 2 계산 모듈의 버퍼 객체를 제 1 계산 모듈로 복사하는 메모리 관리부를 포함할 수 있다.A host node according to an aspect of the present invention is a host thread for executing a host program for a parallel computing framework, and inserting a command related to a kernel program for the parallel computing framework into a command queue, a command to be executed by scheduling a command queue. A command scheduler for generating a request message requesting execution of the selected command, transmitting a generated request message over the network to a calculation node including a calculation module to execute the command, and a calculation module for executing the selected command. 1. A memory for determining whether a calculation module has a buffer object used by a selected command, and according to the determination result, a memory for copying a buffer object of a second calculation module, which is a different calculation module from the first calculation module, to the first calculation module. It may include a management unit.

또한 본 발명의 일 양상에 따른 메모리 관리 방법은, 병렬 컴퓨팅 프레임워크를 위한 호스트 프로그램을 실행하고, 병렬 컴퓨팅 프레임워크를 위한 커널 프로그램과 관련된 커맨드를 커맨드 큐에 삽입하는 단계, 커맨드 큐를 스케줄링하여 실행될 커맨드를 선택하는 단계, 선택된 커맨드를 실행할 계산 모듈인 제 1 계산 모듈이 선택된 커맨드가 사용하는 버퍼 객체를 가지고 있는지 여부를 판단하고, 판단 결과에 따라, 제 1 계산 모듈과 다른 계산 모듈인 제 2 계산 모듈의 버퍼 객체를 제 1 계산 모듈로 복사하는 단계, 및 선택된 커맨드의 실행을 요청하는 요청메시지를 생성하고, 생성된 요청메시지를 네트워크를 통해 제 1 계산 모듈이 포함된 계산 노드로 전송하는 단계를 포함할 수 있다.In addition, the memory management method according to an aspect of the present invention, executing a host program for the parallel computing framework, inserting a command related to the kernel program for the parallel computing framework to the command queue, scheduling the command queue to be executed Selecting a command, determining whether the first calculation module, which is the calculation module to execute the selected command, has a buffer object used by the selected command, and according to the determination result, a second calculation that is a calculation module different from the first calculation module; Copying the buffer object of the module to the first calculation module, generating a request message requesting execution of the selected command, and transmitting the generated request message to the calculation node including the first calculation module via a network. It may include.

클러스터 환경의 호스트 노드와 계산 노드가 메시지를 주고 받으면서 병렬 컴퓨팅 프로그래밍 모델의 어플리케이션을 실행하기 때문에, 사용자가 직접 어플리케이션에 MPI(message passing interface)와 같은 코드를 추가해야만 하는 불편함이 해소된다. 따라서 어플리케이션의 버그 발생률을 줄이고, 프로그래머의 생산성과 병렬 컴퓨팅 프레임워크의 이식성을 높일수가 있다.Since the host node and the compute node in the cluster environment send and receive messages to execute an application of the parallel computing programming model, the inconvenience of having to add code such as a message passing interface (MPI) directly to the application is eliminated. This reduces application bugs and increases programmer productivity and portability of parallel computing frameworks.

또한 병렬 컴퓨팅 프레임워크 기반의 클러스터 환경에서, 계산 모듈 리스트를 이용하여 커맨드를 실행할 계산 모듈이 그 커맨드가 사용하는 버퍼 객체를 보유하고 있는지 미리 판단한 후 그에 따라 다른 계산 모듈의 버퍼 객체가 해당 계산 모듈로 복사되기 때문에, 효율적인 메모리 관리를 보장할 수가 있다.In addition, in a cluster environment based on a parallel computing framework, a calculation module list is used to determine in advance whether a calculation module that executes a command has a buffer object used by the command, and then buffer objects of another calculation module are transferred to the calculation module. Since it is copied, efficient memory management can be guaranteed.

도 1은 본 발명의 일 실시예에 따른 클러스터 시스템의 구성을 도시한다.
도 2는 본 발명의 일 실시예에 따른 클러스터 시스템의 호스트 노드 및 계산 노드의 구성을 도시한다.
도 3은 본 발명의 일 실시예에 따른 클러스터 시스템의 메모리 관리 방법을 도시한다.1 illustrates a configuration of a cluster system according to an embodiment of the present invention.
2 illustrates a configuration of a host node and a compute node of a cluster system according to an embodiment of the present invention.
3 illustrates a memory management method of a cluster system according to an embodiment of the present invention.

이하, 첨부된 도면을 참조하여 본 발명의 실시를 위한 구체적인 예를 상세히 설명한다. Hereinafter, specific examples for carrying out the present invention will be described in detail with reference to the accompanying drawings.

도 1은 본 발명의 일 실시예에 따른 클러스터 시스템의 구성을 도시한다. 1 illustrates a configuration of a cluster system according to an embodiment of the present invention.

도 1을 참조하면, 본 실시예에 따른 클러스터 시스템(cluster system)(100)은 호스트 노드(101) 및 다수의 계산 노드(102a, 102b, 102n)를 포함한다. 호스트 노드(101)와 각각의 계산 노드(102a, 102b, 102n)는 네트워크(103)를 통해 연결된다. 네트워크(103)는 유선 또는 무선 네트워크가 될 수 있다. 호스트 노드(101)와 계산 노드(102a, 102b, 102n)는 네트워크 메시지를 통해 정보를 주고 받는 것이 가능하다. 네트워크 메시지는 LAN, IPv4, IPv6 패킷과 같은 형태를 가질 수 있다. Referring to FIG. 1, the cluster system 100 according to the present embodiment includes a host node 101 and a plurality of computing nodes 102a, 102b, 102n. The host node 101 and each computing node 102a, 102b, 102n are connected via a network 103. Network 103 may be a wired or wireless network. The host node 101 and the computing nodes 102a, 102b, and 102n can exchange information through network messages. The network message may have a form such as a LAN, IPv4, or IPv6 packet.

클러스터 시스템(100)은 병렬 컴퓨팅 프레임워크를 토대로 형성된다. 다시 말해, 본 실시예에 따른 병렬 컴퓨팅 프레임워크는 클러스터 시스템(100)에서 동작하는 것이 가능하다. 클러스터 시스템(100)의 병렬 컴퓨팅 프레임워크로는 OpenCL, OpenMP, 및 CUDA 등이 사용될 수 있다.Cluster system 100 is formed based on a parallel computing framework. In other words, the parallel computing framework according to the present embodiment may operate in the cluster system 100. As the parallel computing framework of the cluster system 100, OpenCL, OpenMP, CUDA, and the like may be used.

클러스터 시스템(100)의 병렬 컴퓨팅 프레임워크를 위한 어플리케이션은 호스트 프로그램과 커널 프로그램으로 구성될 수 있다. 호스트 프로그램은 커널 프로그램이 실행되도록 커널 프로그램을 관리하는 부분이고, 커널 프로그램은 데이터 연산이 처리되는 부분이 될 수 있다. 일 양상에 따라, 호스트 프로그램은 호스트 노드(101)에서 실행되고, 커널 프로그램은 계산 노드(102a, 102b, 102n)에서 실행될 수 있다. An application for the parallel computing framework of the cluster system 100 may be composed of a host program and a kernel program. The host program is a part that manages the kernel program so that the kernel program is executed, and the kernel program may be a part where data operations are processed. According to one aspect, the host program may be executed at the host node 101 and the kernel program may be executed at the compute nodes 102a, 102b, 102n.

호스트 노드(101)는 호스트 프로그램을 실행한다. 호스트 노드(101)는 병렬 컴퓨팅 프로그래밍 툴을 통해 작성된 어플리케이션의 호스트 프로그램을 실행하여 커널 프로그램의 실행을 관리할 수 있다. 예를 들어, 호스트 노드(101)는 커널 프로그램과 관련된 커맨드를 이용하여 커맨드 실행을 요청하는 요청메시지를 생성하고, 생성된 요청메시지를 네트워크(103)를 통해 계산 노드(102a, 102b, 102n)로 전송하는 것이 가능하다. The host node 101 executes a host program. The host node 101 may manage the execution of a kernel program by executing a host program of an application created through a parallel computing programming tool. For example, the host node 101 generates a request message for requesting command execution using a command related to a kernel program, and generates the generated request message to the computing nodes 102a, 102b, and 102n via the network 103. It is possible to transmit.

계산 노드(102a, 102b, 102n)는 커널 프로그램을 실행한다. 호스트 노드(101)로부터 커맨드 실행을 요청하는 요청메시지를 수신한 계산 노드(예컨대, 102a)는 요청메시지에 따라 커맨드를 실행하고, 커맨드 실행의 완료를 알리는 완료메시지를 생성하고, 생성된 완료메시지를 네트워크(103)를 통해 호스트 노드(101)로 전송하는 것이 가능하다.Computing nodes 102a, 102b, 102n execute kernel programs. Computing node (eg, 102a) that receives a request message for requesting command execution from host node 101 executes a command according to the request message, generates a completion message indicating completion of command execution, and generates the generated completion message. It is possible to transmit to the host node 101 via the network 103.

도 2는 본 발명의 일 실시예에 따른 클러스터 시스템의 호스트 노드 및 계산 노드의 구성을 도시한다. 2 illustrates a configuration of a host node and a compute node of a cluster system according to an embodiment of the present invention.

도 2를 참조하면, 호스트 노드(220)는 호스트 스레드(221), 커맨드 스케줄러(222), 커맨드-큐(223a~223f), 이슈-큐(224), 및 완료-큐(225)를 포함할 수 있다. Referring to FIG. 2, the host node 220 may include a host thread 221, a command scheduler 222, command-queues 223a-223f, an issue-queue 224, and a completion-queue 225. Can be.

호스트 스레드(221)는 호스트 프로그램을 실행한다. The host thread 221 executes a host program.

호스트 스레드(221)는 커맨드-큐(223a~223f)에 커맨드를 삽입한다. 각각의 커맨드-큐(223a~223f)는 계산 노드(240, 240b)의 각 계산 모듈(241a, 241b, 241c, 241d)에 대응될 수 있다. 예컨대, 제 1 커맨드-큐 그룹(226) 내의 제 1 커맨드-큐(223a)는 제 1 계산 노드(240) 내의 계산 모듈 #1(241a)에 매핑될 수 있다. 커맨드는 계산 노드(240)에서 실행될 커널 프로그램과 관련된 각종 연산이 될 수 있다. 다시 말해, 호스트 스레드(221)는 커널 프로그램과 관련된 커맨드를 실행할 계산 노드(240) 및 계산 모듈(241a)를 결정하고, 결정된 계산 노드(240) 및 계산 모듈(241a)에 대응되는 커맨드-큐(223a)에 커맨드를 넣는 것이 가능하다. The host thread 221 inserts a command into the command queues 223a to 223f. Each command-queue 223a-223f may correspond to each calculation module 241a, 241b, 241c, 241d of the calculation nodes 240, 240b. For example, the first command-queue 223a in the first command-queue group 226 may be mapped to the calculation module # 1 241a in the first compute node 240. The command may be various operations related to the kernel program to be executed in the calculation node 240. In other words, the host thread 221 determines the calculation node 240 and the calculation module 241a to execute a command related to the kernel program, and the command-queue corresponding to the determined calculation node 240 and the calculation module 241a is determined. It is possible to put a command in 223a).

커맨드 스케줄러(222)는 커맨드-큐(223a~223f)를 스케줄링한다. 스케줄 정책으로는 라운드-로빈 방식이 사용될 수 있으나, 응용 목적에 따라 다양한 스케줄 정책이 적용될 수 있음은 물론이다. 예컨대, 커맨드 스케줄러(222)는 정해진 스케줄 정책에 따라 실행될 커맨드를 선택하고, 선택된 커맨드를 커맨드-큐(223a~223f)에서 꺼낼 수 있다. Command scheduler 222 schedules command-queues 223a through 223f. The round-robin method may be used as the schedule policy, but various schedule policies may be applied depending on the application purpose. For example, the command scheduler 222 may select a command to be executed according to a predetermined schedule policy, and retrieve the selected command from the command queues 223a to 223f.

커맨드 스케줄러(222)는 선택된 커맨드를 이용하여 요청메시지를 생성한다. 요청메시지란 네트워크를 통해 전송되는 패킷 형태의 메시지로서 선택된 커맨드의 실행을 요청하는 메시지가 될 수 있다. 예컨대, 커맨드 스케줄러(222)는 선택된 커맨드의 커맨드-큐(223a)에 매핑된 계산 노드(240) 및 계산 모듈(241a)을 주소 필드 및 커맨드의 명세(description) 필드를 갖는 패킷 데이터를 생성할 수 있다. The command scheduler 222 generates a request message using the selected command. The request message is a packet type message transmitted through a network, and may be a message for requesting execution of a selected command. For example, the command scheduler 222 may generate packet data having an address field and a description field of the command from the calculation node 240 and the calculation module 241a mapped to the command-queue 223a of the selected command. have.

커맨드 스케줄러(222)는 생성된 요청메시지를 네트워크(260)를 통해 계산 노드(240)로 전송한다. The command scheduler 222 sends the generated request message to the computing node 240 via the network 260.

메모리 관리부(280)는 커맨드 스케줄러(222)에 의해 선택된 커맨드를 실행할 계산 모듈(이하, '실행 계산 모듈'이라 함)(예컨대, 241a)이 선택된 커맨드가 사용하는 버퍼 객체(이하, '필요 버퍼 객체'라 함)를 가지고 있는지 여부를 판단한다. The memory manager 280 may use a buffer object used by a command selected by a calculation module (hereinafter, referred to as an execution calculation module) (for example, 241a) to execute a command selected by the command scheduler 222 (hereinafter referred to as a 'necessary buffer object'). Determine whether or not it has

버퍼 객체란 병렬 컴퓨팅 프레임워크를 위한 어플리케이션이 사용하는 추상화된 메모리 영역으로 정의될 수 있다. 이 버퍼 객체는 특정한 메모리에 종속되지 않고 여러 메모리에 매핑될 수 있다. 따라서 여러 계산 모듈(241a, 241b, 241c, 241d)이 하나의 버퍼 객체를 공유하는 것이 가능하다. 필요 버퍼 객체란 실행 계산 모듈이 커맨드를 실행하기 위해 필요한 버퍼 객체가 될 수 있다. A buffer object may be defined as an abstracted memory area used by an application for a parallel computing framework. This buffer object can be mapped to multiple memories without being dependent on a particular memory. Therefore, it is possible for several calculation modules 241a, 241b, 241c, and 241d to share one buffer object. A required buffer object may be a buffer object required for an execution calculation module to execute a command.

메모리 관리부(280)는 소정의 계산 모듈 리스트를 이용하여 실행 계산 모듈(241a)의 필요 버퍼 객체 보유 여부를 판단할 수 있다. 계산 모듈 리스트는 각각의 버퍼 객체 별로 그 버퍼 객체에 관한 최신 데이터를 보유하고 있는 계산 모듈이 리스트 형태로 구성된 것을 말한다. 메모리 관리부(280)는 필요 버퍼 객체의 계산 모듈 리스트에 실행 계산 모듈(241a)이 존재하는지 여부에 따라, 실행 계산 모듈이 필요 버퍼 객체를 보유하고 있는지 여부를 판단할 수 있다. The memory manager 280 may determine whether the execution calculation module 241a holds the required buffer object by using a predetermined calculation module list. The calculation module list is a list of calculation modules configured to hold the latest data about the buffer object for each buffer object. The memory manager 280 may determine whether the execution calculation module holds the required buffer object according to whether the execution calculation module 241a exists in the calculation module list of the required buffer object.

판단 결과, 실행 계산 모듈(241a)이 필요 버퍼 객체를 보유하고 있지 아니한 경우, 메모리 관리부(280)는 필요 버퍼 객체를 보유하고 있는 다른 계산 모듈(예컨대, 241b)을 검색하고, 검색된 계산 모듈(241b)로부터 필요 버퍼 객체를 복사하여 이를 실행 계산 모듈(241a)의 메모리에 기록할 수 있다. 필요 버퍼 객체를 보유하고 있는 다른 계산 모듈(241b)은 필요 버퍼 객체의 계산 모듈 리스트에 존재하는 계산 모듈 중 어느 하나가 될 수 있다. As a result of determination, when the execution calculation module 241a does not hold the required buffer object, the memory manager 280 searches for another calculation module (eg, 241b) that holds the required buffer object, and retrieves the found calculation module 241b. The necessary buffer object can be copied from the memory module and written to the memory of the execution calculation module 241a. The other calculation module 241b holding the required buffer object may be any one of the calculation modules present in the calculation module list of the required buffer object.

일 양상에 따라, 계산 장치 리스트에 복수개의 계산 모듈이 존재한다면, 다음과 같은 우선순위로 데이터를 복사할 계산 모듈을 선택할 수 있다. According to one aspect, if there are a plurality of calculation modules in the calculation device list, it is possible to select a calculation module to copy the data in the following priority order.

1. 같은 노드의 CPU1. CPU on the same node

2. 같은 노드의 GPU2. GPU on same node

3. 다른 노드의 CPU3. CPU on another node

4. 다른 노드의 GPU4. GPUs on other nodes

이는 네트워크상에서의 데이터 이동은 노드간 메인 메모리간의 이동만이 가능하기 때문이다. 따라서 한 노드의 GPU에서 다른 노드의 GPU로 메모리를 이동하려고 한다면, 이동하고자 하는 GPU 메모리에서 노드의 메인 메모리로 데이터를 복사한 후, 네트워크를 통해 다른 노드의 메인 메모리로 데이터를 복사한다. 그 후, 데이터를 받은 노드의 메인 메모리로부터 GPU 메모리로 복사해야 한다. 이러한 추가적인 복사 비용을 줄이기 위해 CPU 계산 장치를 우선하는 것이 바람직하다. This is because data movement on the network can only be performed between nodes and main memory. Therefore, if you want to move memory from one node's GPU to another node's GPU, copy the data from the GPU memory to move to the node's main memory and then copy the data to the other node's main memory over the network. After that, the data must be copied from the node's main memory to the GPU memory. It is desirable to prioritize the CPU computing device to reduce this additional copying cost.

또한 메모리 관리부(280)는 필요 버퍼 객체가 실행 계산 모듈로 복사된 경우, 실행 계산 모듈을 계산 모듈 리스트에 추가할 수 있다. 추가적 양상에 따라, 메모리 관리부(280)는 이후 버퍼 객체가 수정되면, 그 버퍼 객체의 계산 모듈 리스트를 클리어(clear)한 후에, 그 버퍼 객체를 수정한 계산 모듈(예컨대, 241a)을 계산 모듈 리스트에 추가할 수도 있다. 메모리 관리부(280)는 커널 소스(예컨대, OpenCL 커널 소스)를 분석해서 어떤 커맨드(또는 어떤 커널)가 어떤 버퍼 객체를 수정하는지 알아낼 수 있다.In addition, the memory manager 280 may add the execution calculation module to the calculation module list when the required buffer object is copied to the execution calculation module. According to a further aspect, when the buffer object is modified, the memory manager 280 then clears the calculation module list of the buffer object, and then calculates the calculation module (eg, 241a) that modified the buffer object. You can also add The memory manager 280 may analyze the kernel source (eg, the OpenCL kernel source) to find out which command (or which kernel) modifies which buffer object.

도 2에서, 스케줄러(222)에 의해 선택된 커맨드가 제 1 커맨드 큐(223a)에 있는 커맨드이고, 이 커맨드를 실행할 계산 모듈은 계산 모듈 #1(241a)이라고 가정하자. 또한 선택된 커맨드가 사용하는 버퍼 객체 'A'에 관한 계산 모듈 리스트에 계산 모듈 #2(241b) 및 계산 모듈 #3(241c)이 존재한다고 가정하자. 메모리 관리부(280)는 버퍼 객체 A에 관한 계산 모듈 리스트에 계산 모듈 #1(241a)이 있는지 여부를 판단한다. 만약 버퍼 객체 A에 관한 계산 모듈 리스트에 계산 모듈 #1(241a)이 있다면, 커맨드 스케줄러(222)는 계산 모듈 #1(241a)로 요청메시지를 전송한다. 만약 버퍼 객체 A에 관한 계산 모듈 리스트에 계산 모듈 #1(241a)이 없다면, 메모리 관리부(280)는 그 계산 모듈 리스트에 있는 다른 계산 모듈 #2(241b) 또는 계산 모듈 #3(241c)으로부터 해당 버퍼 객체를 계산 모듈 #1(241a)에 복사한다. 이후 커맨드 스케줄러(222)는 계산 모듈 #1(241a)로 요청메시지를 전송한다. In FIG. 2, assume that the command selected by the scheduler 222 is a command in the first command queue 223a, and the calculation module to execute this command is calculation module # 1 241a. Also assume that calculation module # 2 241b and calculation module # 3 241c exist in the calculation module list for the buffer object 'A' used by the selected command. The memory manager 280 determines whether the calculation module # 1 241a is in the calculation module list related to the buffer object A. FIG. If the calculation module # 1 241a is in the calculation module list for the buffer object A, the command scheduler 222 sends a request message to the calculation module # 1 241a. If there is no calculation module # 1 241a in the calculation module list for the buffer object A, the memory management unit 280 corresponds to the calculation module # 2 (241b) or calculation module # 3 (241c) in the calculation module list. Copy the buffer object to compute module # 1 241a. The command scheduler 222 then sends a request message to the calculation module # 1 241a.

일 양상에 따라, 커맨드 스케줄러(222)는 실행 계산 모듈에 필요 버퍼 객체가 존재하지 않는 경우, 메모리 관리부(280)에 의해 필요 버퍼 객체가 실행 계산 모듈로 복사된 이후에 요청메시지를 생성할 수 있다. According to an aspect, the command scheduler 222 may generate a request message after the required buffer object is copied to the execution calculation module by the memory manager 280 when the required buffer object does not exist in the execution calculation module. .

또 다른 양상에 따라, 커맨드 스케줄러(222)는 전송된 요청메시지에 대응되는 커맨드를 이슈-큐(224)에 저장하고, 계산 노드(240)의 응답을 기다릴 수 있다. 만약 계산 노드(240)가 커맨드 실행의 완료를 응답하면, 커맨드 스케줄러(222)는 이슈-큐(224)에 저장되어 있던 커맨드를 빼서 완료-큐(225)로 옮길 수가 있다.According to another aspect, the command scheduler 222 may store a command corresponding to the transmitted request message in the issue-queue 224 and wait for a response of the calculation node 240. If the compute node 240 responds to completion of the command execution, the command scheduler 222 may remove the command stored in the issue-queue 224 and move it to the completion-queue 225.

또한, 도 2에서 계산 노드(240)는 다수의 계산 모듈(241a, 241b, 241c, 241d), 커맨드 핸들러(242), 다수의 준비-큐(243a, 243b, 243c, 243d), 및 다수의 모듈 스레드(244a, 244b, 244c, 244d)를 포함할 수 있다. In addition, in FIG. 2, the calculation node 240 includes a number of calculation modules 241a, 241b, 241c, 241d, a command handler 242, a number of ready-queues 243a, 243b, 243c, 243d, and a number of modules. Threads 244a, 244b, 244c, 244d.

각각의 계산 모듈(241a, 241b, 241c, 241d)은 CPU, GPU 등과 같이 데이터를 연산 및 처리하는 디바이스가 될 수 있다. Each calculation module 241a, 241b, 241c, and 241d may be a device for computing and processing data, such as a CPU and a GPU.

커맨드 핸들러(242)는 네트워크(260)를 통해 호스트 노드(220)로부터 요청메시지를 수신한다. The command handler 242 receives a request message from the host node 220 via the network 260.

커맨드 핸들러(242)는 수신된 요청메시지에 따라 커맨드 객체를 생성하고, 생성된 커맨드 객체를 준비-큐(243a, 243b, 243c, 243d)에 넣는다. 준비-큐(243a, 243b, 243c, 243d)는 각 계산 모듈(241a, 241b, 241c, 241d)에 대응된다. 따라서 요청메시지에 따라 제 1 계산 모듈(241a)에서 커맨드가 실행되는 것으로 결정되었다면, 커맨드 객체는 제 1 계산 모듈(241a)의 준비-큐(243a)로 삽입될 수 있다. The command handler 242 generates a command object according to the received request message, and puts the generated command object into the preparation queues 243a, 243b, 243c, and 243d. The preparation-cues 243a, 243b, 243c, and 243d correspond to the respective calculation modules 241a, 241b, 241c, and 241d. Therefore, if it is determined that the command is executed in the first calculation module 241a according to the request message, the command object may be inserted into the preparation-queue 243a of the first calculation module 241a.

모듈 스레드(244a, 244b, 244c, 244d)는 준비-큐(243a, 243b, 243c, 243d)에 삽입된 커맨드 객체와 계산 모듈(241a, 241b, 241c, 241d)을 이용하여 커맨드를 실행한다. 예컨대, 위 예에서, 제 1 계산 모듈(241a)의 모듈 스레드(244a)가 준비-큐(243a)에서 커맨드 객체를 빼서 이를 바탕으로 제 1 계산 모듈(241a)에서 커맨드가 처리되도록 하는 것이 가능하다. The module threads 244a, 244b, 244c and 244d execute commands using the command objects inserted into the preparation-queues 243a, 243b, 243c and 243d and the calculation modules 241a, 241b, 241c and 241d. For example, in the above example, it is possible for the module thread 244a of the first calculation module 241a to withdraw the command object from the ready-queue 243a so that the command is processed in the first calculation module 241a based on it. .

일 양상에 따라, 모듈 스레드(244a, 244b, 244c, 244d)는 커맨드 실행이 완료되면, 완료된 커맨드를 완료-큐(245)에 삽입할 수 있다.According to one aspect, the module threads 244a, 244b, 244c, 244d may insert the completed command into the completion-queue 245 when the command execution is complete.

추가적 양상에 따라, 커맨드 핸들러(242)는 완료-큐(245)에 있는 커맨드를 이용하여 커맨드 실행의 완료를 알리는 완료메시지를 생성하고, 생성된 완료메시지를 네트워크(260)를 통해 호스트 노드(220)로 전송하는 것이 가능하다. According to a further aspect, the command handler 242 uses a command in the completion-queue 245 to generate a completion message indicating the completion of the command execution, and the generated completion message via the network 260 to the host node 220. It is possible to send

도 2에서 도시된 각각의 구성은 전기 회로 및/또는 하드웨어의 형태로 구현되거나, 또는 소정의 프로세서(processor)에서 실행되는 어플리케이션 프로그램(application program) 등의 형태로 구현될 수 있다. 또한 각 구성의 구분은 단지 그 기능에 따른 논리적인 구분의 일례이다. 따라서 도 2에서 도시된 것과 다른 기준에 따라서 그 기능들이 구분될 수도 있다. 다시 말해, 둘 이상의 기능 유닛이 하나의 기능 유닛으로 통합되거나, 어느 하나의 기능 유닛에서 수행되는 기능의 일부가 하나 또는 그 이상의 다른 기능 유닛에서 수행될 수도 있다.Each configuration illustrated in FIG. 2 may be implemented in the form of electrical circuits and / or hardware, or may be implemented in the form of an application program executed in a predetermined processor. In addition, the division of each structure is only an example of the logical division according to the function. Therefore, the functions may be distinguished according to different criteria from those shown in FIG. 2. In other words, two or more functional units may be integrated into one functional unit, or some of the functions performed in one functional unit may be performed in one or more other functional units.

도 3은 본 발명의 일 실시예에 따른 병렬 컴퓨팅 프레임워크 기반 클러스터 시스템의 메모리 관리 방법을 도시한다. 이것은 도 2의 호스트 노드에 의해 수행될 수 있다. 3 illustrates a memory management method of a parallel computing framework based cluster system according to an embodiment of the present invention. This can be done by the host node of FIG.

도 2 및 도 3을 참조하면, 본 실시예에 따른 메모리 관리 방법에 따라, 먼저 병렬 컴퓨팅 프레임워크를 위한 호스트 프로그램을 실행하고 병렬 컴퓨팅 프레임워크를 위한 커널 프로그램과 관련된 커맨드가 커맨드 큐에 삽입된다(301). 예컨대, 호스트 스레드(221)가 커맨드를 커맨드 큐(223a)에 넣는 것이 가능하다. 2 and 3, according to the memory management method according to the present embodiment, a host program for the parallel computing framework is first executed, and commands related to the kernel program for the parallel computing framework are inserted into the command queue ( 301). For example, it is possible for the host thread 221 to put a command in the command queue 223a.

그리고 커맨드 큐의 스케줄링에 따라 실행될 커맨드가 선택된다(302). 예컨대, 커맨드 스케줄러(222)가 소정의 스케줄 정책에 따라 커맨드 큐(223a)에 있는 커맨드를 선택하는 것이 가능하다. The command to be executed is selected according to the scheduling of the command queue (302). For example, it is possible for the command scheduler 222 to select a command in the command queue 223a in accordance with a predetermined schedule policy.

그리고 선택된 커맨드를 실행할 계산 모듈이 해당 커맨드가 사용하는 버퍼 객체를 가지고 있느지 여부가 판단된다(303). 예컨대, 메모리 관리부(280)가 버퍼 객체별 계산 모듈 리스트를 이용하여 해당 버퍼 객체의 계산 모듈 리스트에 해당 계산 모듈이 존재하는지 여부를 판단할 수 있다. 만약 선택된 커맨드를 실행할 계산 모듈(예컨대, 241a)이 해당 버퍼 객체를 보유하고 있지 아니한 경우, 계산 모듈 리스트에 존재하는 다른 계산 모듈(예컨대, 241b)의 버퍼 객체를, 선택된 커맨드를 실행할 계산 모듈(241a)의 메모리로 복사하고, 그 계산 모듈 리스트에 선택된 커맨드를 실행할 계산 모듈(241a)을 추가한다(304). In operation 303, it is determined whether the calculation module to execute the selected command has a buffer object used by the command. For example, the memory manager 280 may determine whether the calculation module exists in the calculation module list of the buffer object by using the calculation module list for each buffer object. If the calculation module (eg, 241a) that executes the selected command does not have a corresponding buffer object, the buffer object of another calculation module (eg, 241b) present in the calculation module list is used. Is added to the memory, and the calculation module 241a is added to the calculation module list to execute the selected command (304).

그리고 선택된 커맨드의 실행을 요청하는 요청메시지가 해당 계산 노드로 전송된다(305). 예컨대, 커맨드 스케줄러(222)가 선택된 커맨드의 실행을 요청하는 요청메시지를 생성하고, 생성된 요청메시지를 네트워크를 통해 전송하는 것이 가능하다. A request message requesting execution of the selected command is transmitted to the corresponding calculation node (305). For example, the command scheduler 222 may generate a request message for requesting execution of the selected command and transmit the generated request message through the network.

이상에서 살펴본 것과 같이, 개시된 실시예들에 의하면, 병렬 컴퓨팅 프레임워크 기반의 클러스터 환경에서, 계산 모듈 리스트를 이용하여 커맨드를 실행할 계산 모듈이 그 커맨드가 사용하는 버퍼 객체를 보유하고 있는지 미리 판단한 후 그에 따라 다른 계산 모듈의 버퍼 객체가 해당 계산 모듈로 복사되기 때문에, 효율적인 메모리 관리를 보장할 수가 있다. As described above, according to the disclosed embodiments, in a cluster environment based on a parallel computing framework, a calculation module that executes a command is determined in advance using a calculation module list to determine whether the calculation module holds a buffer object used by the command. As a result, the buffer objects of other calculation modules are copied to the corresponding calculation module, thereby ensuring efficient memory management.

한편, 본 발명의 실시 예들은 컴퓨터로 읽을 수 있는 기록 매체에 컴퓨터가 읽을 수 있는 코드로 구현하는 것이 가능하다. 컴퓨터가 읽을 수 있는 기록 매체는 컴퓨터 시스템에 의하여 읽혀질 수 있는 데이터가 저장되는 모든 종류의 기록 장치를 포함한다.Meanwhile, the embodiments of the present invention can be embodied as computer readable codes on a computer readable recording medium. A computer-readable recording medium includes all kinds of recording apparatuses in which data that can be read by a computer system is stored.

컴퓨터가 읽을 수 있는 기록 매체의 예로는 ROM, RAM, CD-ROM, 자기 테이프, 플로피디스크, 광 데이터 저장장치 등이 있으며, 또한 캐리어 웨이브(예를 들어 인터넷을 통한 전송)의 형태로 구현하는 것을 포함한다. 또한, 컴퓨터가 읽을 수 있는 기록 매체는 네트워크로 연결된 컴퓨터 시스템에 분산되어, 분산 방식으로 컴퓨터가 읽을 수 있는 코드가 저장되고 실행될 수 있다. 그리고 본 발명을 구현하기 위한 기능적인(functional) 프로그램, 코드 및 코드 세그먼트들은 본 발명이 속하는 기술 분야의 프로그래머들에 의하여 용이하게 추론될 수 있다.Examples of the computer-readable recording medium include a ROM, a RAM, a CD-ROM, a magnetic tape, a floppy disk, an optical data storage device and the like, and also a carrier wave (for example, transmission via the Internet) . In addition, the computer-readable recording medium may be distributed over network-connected computer systems so that computer readable codes can be stored and executed in a distributed manner. In addition, functional programs, codes, and code segments for implementing the present invention can be easily deduced by programmers skilled in the art to which the present invention belongs.

나아가 전술한 실시 예들은 본 발명을 예시적으로 설명하기 위한 것으로 본 발명의 권리범위가 특정 실시 예에 한정되지 아니할 것이다.Further, the embodiments described above are intended to illustrate the present invention, and the scope of the present invention is not limited to the specific embodiments.

220: 호스트 노드
221: 호스트 스레드
222: 커맨드 스케줄러
240: 계산 노드
241: 계산 모듈
242: 커맨드 핸들러
244: 모듈 스레드
260: 네트워크
280: 메모리 관리부220: host node
221: host thread
222: command scheduler
240: compute node
241: calculation module
242: command handler
244: module thread
260: network
280: memory management unit

Claims

A host thread that executes a host program for the parallel computing framework and inserts commands related to the kernel program for the parallel computing framework into a command queue;
The command scheduler selects a command to be executed by scheduling the command queue, generates a request message requesting execution of the selected command, and transmits the generated request message via a network to a calculation node including a calculation module to execute the command. ; And
It is determined whether the first calculation module which is the calculation module to execute the selected command has a buffer object used by the selected command, and according to the determination result, the second calculation module which is a calculation module different from the first calculation module. A memory manager to copy a buffer object to the first calculation module; Host node of the parallel computing framework-based cluster system comprising a.

The apparatus of claim 1, wherein the memory management unit
A host node of the parallel computing framework-based cluster system, which generates and manages a calculation module list composed of calculation modules holding the latest data on the buffer objects for each buffer object.

The memory manager of claim 2, wherein the memory manager
And determining whether the first calculation module has the buffer object by using the calculation module list.

The memory manager of claim 3, wherein the memory manager
If the first calculation module is not in the calculation module list,
And copy the buffer object of the second computation module to the first computation module and add the first computation module to the computation module list.

The memory manager of claim 2, wherein the memory manager
When the buffer object is modified, the host of the parallel computing framework-based cluster system, after clearing the calculation module list of the buffer object, adds the calculation module that modified the buffer object to the calculation module list. Node.

The method of claim 1, wherein the command scheduler is
If the first calculation module does not hold the buffer object,
And host the request message after the buffer object of the second computing module is copied to the first computing module.

The method of claim 1, wherein the buffer object
Host node of the parallel computing framework based cluster system, characterized in that it is defined as an abstracted memory area used by the application for the parallel computing framework.

Executing a host program for the parallel computing framework and inserting a command associated with a kernel program for the parallel computing framework into a command queue;
Scheduling the command queue to select a command to be executed;
It is determined whether a first calculation module which is a calculation module to execute the selected command has a buffer object used by the selected command, and according to the determination result, a buffer of a second calculation module which is a calculation module different from the first calculation module. Copying an object to the first calculation module; And
Generating a request message requesting execution of the selected command and transmitting the generated request message to a calculation node including the first calculation module through a network; Memory management method of a parallel computing framework based cluster system comprising a.