KR20120083000A

KR20120083000A - Method for dynamically assigned of parallel control module

Info

Publication number: KR20120083000A
Application number: KR1020110004395A
Authority: KR
Inventors: 김동순; 이승열; 이상설; 안재훈
Original assignee: 전자부품연구원
Priority date: 2011-01-17
Filing date: 2011-01-17
Publication date: 2012-07-25
Also published as: KR101177059B1

Abstract

PURPOSE: Method for dynamically allocating a parallel control module is provided to dynamically control the number of fixed parallel control modules according to an input thread, thereby improving a performing speed and convenience. CONSTITUTION: A CPU core analyzes the number of threads which data processing is requested to. The number of parallel control modules for processing data is decided by using the number of the threads, the number of minimum threads able to be bound at once, and the number of maximum threads able to be bound into a block(S102). The CUP core stores information about the parallel control module and the data to be processed by the parallel control module, and requests the data processing to the parallel control module(S108). The number of minimum threads able to be bound at once is 32 ea. The number of maximum threads able to be bound into a block is 512 ea.

Description

How to dynamically assign a parallel control module {Method for dynamically assigned of parallel control module}

본 발명은 CPU 코어와 복수개의 병렬 제어 모듈로 구성된 쓰레드 처리 시스템에 관한 것으로, 더욱 상세하게는 입력된 쓰레드에 따라 복수 개의 병렬 제어 모듈을 동적으로 사용하는 방안에 관한 것이다.The present invention relates to a thread processing system composed of a CPU core and a plurality of parallel control modules, and more particularly, to a method of dynamically using a plurality of parallel control modules according to an input thread.

현재 높은 클록 주파수 프로세서와 관련하여 늘어난 하드웨어 복잡성을 인식하게 되었다. 또한 클록 주파수를 무한정 증가시킬 수 없으며 다른 방법이 필요하다. 이에 따라 프로세서의 효율성 개선은 물론 쓰레드 단위의 병렬실행(TLP: Thread Level Parallelism)을 통해 전체적인 성능을 향상키시기 방법으로 멀티프로세서와 멀티쓰레딩 기술이 출현하게 되었다.The increased hardware complexity associated with high clock frequency processors is now recognized. In addition, the clock frequency cannot be increased indefinitely, and other methods are required. As a result, multiprocessor and multithreading technologies have emerged as a way of improving the overall efficiency through thread level parallelism (TLP) as well as improving processor efficiency.

최근의 이러한 멀티프로세싱으로의 전환은 데스크탑에서 임베디드 설계 분야로까지 인기가 확산되면서 다수의 소프트웨어에 대한 발상의 전환이 요구되고 있다. 여러 해 동안 임베디드 설계자들은 자신들의 설계에 여러 개의 프로세서를 내장함으로써 한정된 전력으로 더 나은 계산 능력을 구현할 수 있었다. The recent shift to multiprocessing has spread from desktop to embedded design, requiring a shift in thinking about many software. For many years, embedded designers have been able to embed more processors into their designs to achieve better computational power with limited power.

멀티프로세서와 멀티쓰레딩은 모두 프로세서 성능을 전체적으로 향상시키며, 이로 인해 병행 소프트웨어 쓰레드를 이용하여 어플리케이션의 처리시간을 단축한다. 그러나 이 두 기술은 이러한 목표를 달성하기 위해서 하드웨어적으로 서로 다른 접근방법을 취하고 있으며 특정 소프트웨어 코드의 예에 대해 각기 다른 성취 수준을 나타낸다.Both multiprocessor and multithreading improve processor performance overall, which reduces the processing time of the application by using parallel software threads. However, these two technologies take different approaches in hardware to achieve these goals and represent different levels of achievement for specific software code examples.

멀티쓰레딩의 기본 개념은 높은 주파수의 프로세서가 저속 메모리와 결합되었을 때, 일반적으로 액세스 지연으로 인한 유니 프로세서 설계에서 발생하는 비효율적인 사이클을 이용해서 전체적인 프로세서 성능을 높이려 하는 것이다.The basic concept of multithreading is to increase overall processor performance by taking advantage of the inefficient cycles that typically occur in uniprocessor designs due to access delays when high frequency processors are combined with slow memory.

쓰레드를 여유 사이클에 맞춤으로써 코어의 효율이 향상된다. 기본적으로 멀티쓰레딩은 최소의 프로세서 로직만을 중복시켜서 추가적인 하드웨어 쓰레드를 지원하는 유니 프로세서로 분류된다.By fitting threads into spare cycles, core efficiency is improved. By default, multithreading is classified as a uniprocessor that supports additional hardware threads by overlapping only minimal processor logic.

일반적으로 이것은 프로그래머의 레지스터 집합이며, CPU의 수퍼바이저 상태로 충분하므로 오늘날의 OS가 이 하드웨어 쓰레드를 가상 프로세서로 인식할 수 있다. 이러한 나머지 프로세서 로직을 공유하는 것은 소프트웨어의 복잡성을 증가시키는 중요한 문제다. Typically this is a set of registers from the programmer, and the CPU's supervisor state is sufficient so that today's operating systems can see this hardware thread as a virtual processor. Sharing this remaining processor logic is an important issue that increases the complexity of the software.

일반적으로 CPU는 순서대로 실행되는 일련의 명령어를 처리하도록 설계되었다. 멀티쓰레딩(Multithreading) 기능과 다중 코어의 출현으로 CPU에서도 상당한 수준의 병렬 처리가 가능하나, CPU와 GPU 간의 성능 차이가 발생한다. In general, a CPU is designed to process a sequence of instructions that execute in sequence. The advent of multithreading capabilities and multiple cores allows a significant degree of parallelism in the CPU, but there is a performance difference between the CPU and the GPU.

GPU는 동시에 여러 개의 작업을 처리할 수 있도록 설계된 병렬 프로세서로 각각의 GPU 코어는 CPU 코어보다 더 단순하고 덜 강력하지만 GPU에는 코어가 수 십 수 백 개가 들어있다. The GPU is a parallel processor designed to handle multiple tasks at the same time. Each GPU core is simpler and less powerful than the CPU core, but the GPU contains dozens of hundred cores.

동시에 수 백 개의 작업을 처리할 수 있는 능력으로 인해 GPU는 병렬 처리의 이점을 확보하고 있다. 따라서 기존 병렬 소프트웨어 구조 생성 방법으로 병렬 처리 프로그램을 수행하려면, 매우 시간이 많이 소요되고 어렵다. 기존의 방법과 다른 병렬 소프트웨어 구조 생성 및 제어 방법이 필요하다.The ability to handle hundreds of tasks at the same time gives GPUs the advantage of parallel processing. Therefore, it is very time-consuming and difficult to execute the parallel processing program by the existing parallel software structure generation method. There is a need for a method for creating and controlling parallel software architectures that is different from the existing ones.

즉, 종래 쓰레드는 고정된 개수의 병렬 제어 모듈을 이용하여 처리하였다. 즉, 입력된 쓰레드의 분석 없이 설정된 개수의 병렬 제어 모듈을 이용하여 쓰레드를 처리하였다. 따라서 GPU 코어로 입력되는 쓰레드는 고정된 병렬 제어 모듈에 의해 처리됨으로써 경우에 따라서는 많은 처리 시간이 소요되는 효율적으로 처리되지 않는다는 단점이 있다.That is, the conventional thread was processed using a fixed number of parallel control modules. In other words, the threads were processed using the set number of parallel control modules without analyzing the input threads. Therefore, a thread that is input to the GPU core is processed by a fixed parallel control module, and thus may not be efficiently processed, which requires a lot of processing time in some cases.

본 발명이 해결하려는 과제는 GPU 코어를 구성하고 있는 병렬 제어 모듈에서 입력되는 쓰레드를 분석하여 효율적으로 처리하는 방안을 제안함에 있다.An object of the present invention is to propose a method for efficiently processing a thread input from a parallel control module constituting a GPU core.

본 발명이 해결하려는 다른 과제는 기존 고정된 병렬 제어 모듈의 개수를 입력되는 쓰레드에 따라 동적으로 사용하는 방안을 제안함에 있다.Another object of the present invention is to propose a method of dynamically using the number of existing fixed parallel control modules according to input threads.

본 발명이 해결하려는 또 다른 과제는 병렬 제어 모듈을 효율적으로 사용함으로써 편의성 및 수행 속도를 향상시키는 방안을 제안함에 있다.Another problem to be solved by the present invention is to propose a method for improving convenience and execution speed by using a parallel control module efficiently.

이를 위해 본 발명의 CPU 코어와 상기 CPU 코어로부터 데이터 처리를 요청받는 복수 개의 병렬 제어 모듈을 포함하는 GPU 코어로 구성된 데이터 처리 시스템에서, 데이터 처리를 수행할 병렬 제어 모듈의 개수를 결정하는 방법에 있어서, 상기 CPU 코어에서 데이터 처리가 요청되는 쓰레드의 개수를 분석하는 단계; 상기 CPU 코어에서 분석한 상기 쓰레드의 개수와 한 번에 묶을 수 있는 최소 쓰레드의 개수(Warp size)와 하나의 블록으로 묶을 수 있는 최대 쓰레드의 개수를 이용하여 데이터를 처리할 병렬 제어 모듈의 개수를 결정하는 단계; 및 상기 CUP 코어에서 병렬 제어 모듈과 각 병렬 제어 모듈로 처리를 요청할 데이터에 대한 정보를 저장한 후, 각 병렬 제어 모듈로 데이터 처리를 요청하는 단계를 포함한다.To this end, in a data processing system including a CPU core of the present invention and a GPU core including a plurality of parallel control modules that receive data processing requests from the CPU cores, the method of determining the number of parallel control modules to perform data processing, Analyzing the number of threads for which data processing is requested in the CPU core; The number of parallel control modules to process data is determined using the number of threads analyzed by the CPU core, the minimum number of threads that can be bundled at a time, and the maximum number of threads that can be bundled into one block. Determining; And storing information about data to be processed by the parallel control module and each parallel control module in the CUP core, and then requesting data processing by each parallel control module.

이를 위해 본 발명의 CPU 코어와 상기 CPU 코어로부터 데이터 처리를 요청받는 복수 개의 병렬 제어 모듈을 포함하는 GPU 코어로 구성된 데이터 처리 시스템에서, 데이터 처리를 수행할 병렬 제어 모듈의 개수를 결정하는 방법에 있어서, 상기 CPU 코어에서 데이터 처리가 요청되는 쓰레드의 개수를 분석하는 단계; 상기 CPU 코어에서 분석한 상기 쓰레드의 개수와 한 번에 묶을 수 있는 최소 쓰레드의 개수(Warp size)와 하나의 블록으로 묶을 수 있는 최대 쓰레드의 개수와 각 병렬 제어 모듈의 데이터 처리와 관련된 우선순위를 이용하여 데이터를 처리할 병렬 제어 모듈의 개수를 결정하는 단계; 상기 CUP 코어에서 병렬 제어 모듈과 각 병렬 제어 모듈로 처리를 요청할 데이터에 대한 정보를 저장한 후, 각 병렬 제어 모듈로 데이터 처리를 요청하는 단계; 및 상기 CPU 코어에서 상기 각 병렬 제어 모듈에서 처리한 데이터를 취합하는 단계를 포함한다.To this end, in a data processing system including a CPU core of the present invention and a GPU core including a plurality of parallel control modules that receive data processing requests from the CPU cores, the method of determining the number of parallel control modules to perform data processing, Analyzing the number of threads for which data processing is requested in the CPU core; The number of threads analyzed by the CPU core, the minimum number of threads that can be bundled at a time (Warp size), the maximum number of threads that can be bundled into one block, and the priority related to data processing of each parallel control module are determined. Determining the number of parallel control modules to process data using; Storing information about data to be processed by a parallel control module and each parallel control module in the CUP core, and then requesting data processing by each parallel control module; And collecting data processed by each parallel control module in the CPU core.

본 발명에 따른 병렬 제어 모듈을 동적으로 할당하는 방법은 GPU 코어를 구성하고 있는 병렬 제어 모듈에서 입력되는 쓰레드를 분석하여 효율적으로 처리한다. 즉 기존 고정된 병렬 제어 모듈의 개수를 입력되는 쓰레드에 따라 동적으로 사용함으로써 편의성 및 수행 속도를 향상시킬수 있는 장점이 있다.The method for dynamically allocating the parallel control module according to the present invention efficiently analyzes the threads input from the parallel control module constituting the GPU core. In other words, the number of fixed parallel control modules can be dynamically used according to the input thread, thereby improving convenience and execution speed.

도 1은 본 발명의 일실시 예에 따른 CPU와 GPU에서 수행되는 동작을 도시한 흐름도이며,
도 2는 본 발명의 일실시 예에 따른 쓰레드 처리 시스템을 도시하고 있으며,
도 3은 본 발명의 일실시 예에 따른 로그 저장소에 저장되어 정보의 일예를 도시하고 있다.1 is a flowchart illustrating an operation performed in a CPU and a GPU according to an embodiment of the present invention.
2 illustrates a thread processing system according to an embodiment of the present invention.
3 illustrates an example of information stored in a log store according to an embodiment of the present invention.

전술한, 그리고 추가적인 본 발명의 양상들은 첨부된 도면을 참조하여 설명되는 바람직한 실시 예들을 통하여 더욱 명백해질 것이다. 이하에서는 본 발명의 이러한 실시 예를 통해 당업자가 용이하게 이해하고 재현할 수 있도록 상세히 설명하기로 한다.The foregoing and further aspects of the present invention will become more apparent through the preferred embodiments described with reference to the accompanying drawings. Hereinafter will be described in detail to enable those skilled in the art to easily understand and reproduce through this embodiment of the present invention.

하나의 프로그램을 프로세스라고 볼 때, 쓰레드는 하나의 프로그램 내에서의 실행 단위라고 할 수 있다. 자바에서는 각 작업을 쓰레드라고 표현하며, 이러한 쓰레드를 복수 개 둘 수 있도록 함으로써 멀티태스킹을 가능하게 한다. 자바에서는 멀티태스킹을 여러 개의 쓰레드를 동시에 수행하는 멀티Tm레딩을 이용하여 해결하고 있다. When a program is a process, a thread is a unit of execution within a program. In Java, each task is referred to as a thread, and multitasking is possible by having multiple such threads. In Java, multitasking is solved using multi-Tm threading, which executes multiple threads simultaneously.

자바 가상머신은 하나의 애플리케이션이 동시에 수행되는 여러 개의 쓰레드를 가질 수 있도록 한다. 물론 일의 우선순위가 있듯이 모든 쓰레드는 우선순위를 갖는다.The Java virtual machine allows an application to have multiple threads running simultaneously. Of course, as with work priority, all threads have priority.

기존에 병렬 제어 구조 자동 요청 시 정해진 병렬 모듈 개수를 사용함으로써 발생했던 단점을 극복하기 위해 본 발명은 환경에 따라 병렬 모듈의 개수를 동적으로 사용하는 방안을 제안한다. The present invention proposes a method of dynamically using the number of parallel modules according to the environment in order to overcome the disadvantages caused by using a fixed number of parallel modules when automatically requesting a parallel control structure.

도 1은 본 발명의 일실시 예에 따른 CPU와 GPU에서 수행되는 동작을 도시한 흐름도이다. 이하 도 1을 이용하여 본 발명의 일실시 예에 따른 CPU와 GPU에서 수행되는 동작에 대해 상세하게 알아보기로 한다.1 is a flowchart illustrating operations performed by a CPU and a GPU, according to an exemplary embodiment. Hereinafter, an operation performed in a CPU and a GPU according to an embodiment of the present invention will be described in detail with reference to FIG. 1.

S100단계는 CPU의 멀티쓰레딩 병렬 구조를 병렬 GPU 제어 구조로 변환하기 위해 쓰레드를 분석한다. 즉, S100단계는 처리하고자 하는 쓰레드를 분석한다.Step S100 analyzes the threads to convert the multithreaded parallel structure of the CPU into a parallel GPU control structure. That is, step S100 analyzes the thread to be processed.

S102단계는 CPU 코어 개수와 GPU 코어 개수에 따라 동적으로 사용할 병렬 제어 모듈의 개수를 결정한다. 부가하여 병렬 제어 모듈의 개수는 각 병렬 제어 모듈을 구성하고 있는 512개의 블록(block)과 32개의 랩(warp)을 고려하여 결정한다. 즉, 한 번에 묶을 수 있는 최소 쓰레드의 개수(Warp size) 는 32개이며, 하나의 블록으로 묶을 수 있는 최대 쓰레드의 개수는 512개이다. Step S102 determines the number of parallel control modules to be used dynamically according to the number of CPU cores and the number of GPU cores. In addition, the number of parallel control modules is determined in consideration of 512 blocks and 32 warps constituting each parallel control module. In other words, the minimum number of threads that can be bound at one time is 32, and the maximum number of threads that can be bound in one block is 512.

S104단계는 동적으로 사용할 병렬 제어 모듈의 개수가 결정되며, 각 병렬 제어 모듈은 로그 저장소에 각 모듈의 식별자가 저장된다.In step S104, the number of parallel control modules to be used dynamically is determined, and each parallel control module stores an identifier of each module in a log storage.

S106단계는 각 병렬 제어 모듈은 처리에 필요한 데이터 프로세싱이 정의되며, 생성된다.In step S106, data processing required for processing each parallel control module is defined and generated.

S108단계는 생성된 병렬 제어 모듈은 GPU 코어에 각각 할당되어 요청받아 할당된 자원을 처리한다. 자원 처리 순서는 로그 저장소를 통해 병렬 제어 모듈의 식별자 및 각 병렬 제어 모듈이 처리해야 할 처리 파트가 저장되어 있다. 따라서 각 병렬 제어 모듈은 로그 저장소에 저장되어 있는 각 처리 파트에 대한 정보를 이용하여 해당 부분을 처리한다.In step S108, the generated parallel control module is assigned to each of the GPU cores to process the allocated resources. The resource processing sequence stores the identifier of the parallel control module and the processing parts to be processed by each parallel control module through the log storage. Therefore, each parallel control module processes the corresponding part by using the information about each processing part stored in the log storage.

S110단계는 각 병렬 제어 모듈에서 처리된 데이터는 로그 저장소를 기반으로 머지(merging) 과정을 통해 취합된다. 취합 단계에서 최대 타임아웃(Maximum Timeout) 시간을 확인하는 동안 지연된 GPU 코어 작업이 있을 경우 해당 병렬 제어 모듈 식별자 확인을 거쳐 작업의 우선순위를 높임으로써 최대 타임아웃을 초과 하지 않도록 처리 한다.In step S110, the data processed by each parallel control module is collected through a merging process based on the log storage. If there is a delayed GPU core task while checking the maximum timeout time during the gathering stage, check the corresponding parallel control module identifier to increase the priority of the task so as not to exceed the maximum timeout.

이와 같이 본 발명은 기존 고정된 병렬 제어 모듈의 개수를 다양한 인자를 이용하여 병렬 제어 모듈의 개수를 동적으로 사용하는 방안을 제안한다.As described above, the present invention proposes a method of dynamically using the number of parallel control modules using various factors as the number of existing fixed parallel control modules.

도 2는 본 발명의 일실시 예에 따른 쓰레드 처리 시스템을 도시하고 있다. 이하 도 2를 이용하여 본 발명의 일실시 예에 따른 쓰레드 처리 시스템에 대해 알아보기로 한다.2 illustrates a thread processing system according to an embodiment of the present invention. Hereinafter, a thread processing system according to an embodiment of the present invention will be described with reference to FIG. 2.

도 2에 의하면 쓰레드 처리 시스템은 CPU 코어(200), GPU 코어(210)를 구성하고 있는 복수개의 병렬 제어 모듈(220)을 포함한다. 물론 상술한 구성 이외에 다른 구성이 쓰레드 처리 시스템에 포함될 수 있음은 자명하다.According to FIG. 2, the thread processing system includes a CPU core 200 and a plurality of parallel control modules 220 constituting the GPU core 210. Obviously, other configurations may be included in the thread processing system in addition to the above-described configuration.

CPU 코어(200)는 입력된 쓰레드를 분석한다. 즉, CPU 코어(200)는 처리해야 하는 쓰레드의 개수를 분석한다. CPU 코어(200)는 입력된 쓰레드를 효율적으로 처리하기 위해 동적으로 사용해야 하는 병렬 제어 모듈의 개수를 결정한다. CPU 코어(200)는 입력된 쓰레드의 개수에 따라 처리해야 하는 병렬 제어 모듈의 개수를 결정한다. CPU 코어(200)는 쓰레드를 처리하기 위해 결정한 병렬 제어 모듈의 개수와 각 병렬 제어 모듈에서 처리해야 하는 부분을 로그 저장소에 저장한다. CPU 코어(200)는 각 병렬 제어 모듈로 처리해야 하는 쓰레드에 관한 정보를 제공한다.The CPU core 200 analyzes the input thread. That is, the CPU core 200 analyzes the number of threads to be processed. The CPU core 200 determines the number of parallel control modules that must be used dynamically to efficiently process the input threads. The CPU core 200 determines the number of parallel control modules to be processed according to the number of input threads. The CPU core 200 stores in the log storage the number of parallel control modules determined to process threads and the parts to be processed in each parallel control module. The CPU core 200 provides information about threads that need to be processed by each parallel control module.

CPU 코어(200)는 각 병렬 제어 모듈에서 처리한 데이터를 로그 저장소에 저장되어 있는 정보를 기반으로 취합 과정을 수행한다. 취합 단계에서 최대 타임아웃(Maximum Timeout) 시간을 확인하는 동안 지연된 GPU 코어 작업이 있을 경우 해당 병렬 제어 모듈 식별자 확인을 거쳐 작업의 우선순위를 높임으로써 최대 타임아웃을 초과 하지 않도록 처리 한다.The CPU core 200 collects data processed by each parallel control module based on information stored in a log storage. If there is a delayed GPU core task while checking the maximum timeout time during the gathering stage, check the corresponding parallel control module identifier to increase the priority of the task so as not to exceed the maximum timeout.

즉, 복수 개의 병렬 제어 모듈 중 특정 병렬 제어 모듈에서 처리 시간이 소용되는 경우 CPU 코어(200)는 처리된 데이터를 취합하는 과정을 수행할 없게 된다. 따라서 CPU 코어(200)는 특정 병렬 제어 모듈에서 데이터 처리 속도가 지연되는 경우 해당 병렬 제어 모듈의 데이터 처리 작업에 대한 우선순위를 높임으로서 신속하게 데이터를 처리할 수 있도록 한다.That is, when processing time is used in a specific parallel control module among the plurality of parallel control modules, the CPU core 200 may not perform a process of collecting processed data. Therefore, when the data processing speed is delayed in a specific parallel control module, the CPU core 200 increases the priority of the data processing task of the corresponding parallel control module so that the data can be processed quickly.

부가하여 CPU 코어(200)는 특정 병렬 제어 모듈에서 데이터 처리 시간이 지연되는 경우, 현재 데이터 처리 작업을 수행하지 않는 병렬 제어 모듈에서 해당 데이터의 일부 또는 전부를 처리하도록 요청할 수 있다. 이 경우 CPU 코어는 이에 대한 정보를 로드 저장소에 새로 저장한다. 이와 같이 함으로써 CUP 코어는 입력된 쓰레드를 최대한 신속하게 처리할 수 있도록 한다. In addition, when a data processing time is delayed in a specific parallel control module, the CPU core 200 may request to process some or all of the corresponding data in a parallel control module that does not perform a current data processing task. In this case, the CPU core stores new information about it in the load store. This allows the CUP core to process the threads as quickly as possible.

이와 별도로 CPU 코어는 각 병렬 제어 모듈의 성능을 측정하기 위한 성능 측정부를 구비할 수 있다.In addition, the CPU core may include a performance measurement unit for measuring the performance of each parallel control module.

성능 측정부는 각 쓰레드의 로드를 측정(또는 예측)한다. 또한, 성능 측정부는 측정된 쓰레드별 로드 정보를 쓰레드 로그 저장소에 업데이트한다. 성능 측정부는 각 병렬 제어 모듈간의 로드가 불균형한지 여부를 결정하고, 코어 간 로드가 불균형하다고 결정되면 로드 밸런싱 요청을 전송한다. The performance measurer measures (or predicts) the load on each thread. In addition, the performance measurement unit updates the thread-specific load information in the thread log storage. The performance measurement unit determines whether the load between each parallel control module is unbalanced, and if it is determined that the load between cores is unbalanced, it transmits a load balancing request.

이를 위해 성능 측정부는 통상의 여러 가지 방법을 이용하여 로드 측정 또는 예측을 할 수 있다. 예를 들어, 각 코어별로 로드의 불균형을 나타내는 값이 설정된 임계값을 넘게 되면, 로드가 불균형한 것으로 결정할 수 있다.To this end, the performance measurement unit may perform load measurement or prediction using various conventional methods. For example, when a value representing a load imbalance for each core exceeds a set threshold, it may be determined that the load is unbalanced.

병렬 제어 모듈(220)은 CPU 코어로부터 요청받은 데이터 처리 작업을 수행한다. 병렬 제어 모듈(220)은 CPU 코어로부터 요청받은 데이터의 처리 작업이 완료되면 이에 대한 정보를 CPU 코어로 제공한다. 이와 같이 함으로써 CPU 코어는 복수개의 병렬 제어 모듈(220) 중 데이터 처리 작업이 완료된 병렬 제어 모듈(220)에 대한 정보를 획득할 수 있게 된다.The parallel control module 220 performs a data processing task requested from the CPU core. The parallel control module 220 provides information about the CPU core when the processing of the data requested from the CPU core is completed. In this manner, the CPU core may acquire information about the parallel control module 220 in which the data processing task is completed among the plurality of parallel control modules 220.

부가하여 병렬 제어 모듈(220)은 현재 요청받은 데이터 처리 작업에 과부하가 걸린 경우에는 이에 대한 정보를 CPU 코어로 제공할 수 있다. 이와 같이 함으로써 CPU 코어는 신속하게 해당 데이터 처리 작업을 다른 병렬 제어 모듈로 할당하여 요청할 수 있다.In addition, the parallel control module 220 may provide information about the CPU core when the currently requested data processing task is overloaded. In this way, the CPU core can quickly allocate and request the data processing task to another parallel control module.

도 3은 본 발명의 일실시 예에 따른 로그 저장소에 저장되어 정보의 일예를 도시하고 있다. 이하 도 3을 이용하여 로그 저장소에 저장되는 정보의 일예에 대해 상세하게 알아보기로 한다.3 illustrates an example of information stored in a log store according to an embodiment of the present invention. Hereinafter, an example of information stored in the log storage will be described in detail with reference to FIG. 3.

도 3에 의하면, 쓰레드1은 병렬 제어 모듈1에서 처리하며, 쓰레드2와 쓰레드3은 병렬 제어 모듈2에서 처리된다. 또한 쓰레드n은 병렬 제어 모듈n에서 처리됨을 알 수 있다. 이와 같이 각 병렬 제어 모듈에서 처리한 쓰레드는 이후 CPU 코어에서 취합 과정을 거치게 된다.According to FIG. 3, thread 1 is processed in parallel control module 1, and thread 2 and thread 3 are processed in parallel control module 2. It can also be seen that thread n is processed in parallel control module n. In this way, the threads processed by each parallel control module are later collected by the CPU core.

상술한 바와 같이 CPU 코어는 각 쓰레드별로 처리하는 병렬 제어 모듈이 변경된 경우에는 이에 대한 정보를 업데이트함은 당연하다. 도 3은 각 병렬 제어 모듈에서 적어도 하나의 쓰레드를 처리하는 것으로 도시되어 있으나, 이에 한정되는 것은 아니다. 즉, 복수개의 병렬 제어 모듈에서 하나의 쓰레드를 처리할 수 있다.As described above, it is natural that the CPU core updates information about the parallel control module processed by each thread. 3 is illustrated as processing at least one thread in each parallel control module, but is not limited thereto. That is, one thread can be processed by multiple parallel control modules.

또한 도 3에 도시되어 있는 바와 같이 각 병렬 제어 모듈별로 할당받은 데이터 처리 작업을 완료하였는지 여부를 로그 저장소에 저장할 수 있다. 도 3에 의하면 병렬 제어 모듈1과 병렬 제어 모듈n은 할당받은 데이터 처리 작업을 완료하였으나, 병렬 제어 모듈2는 할당받은 데이터 처리 작업을 완료하지 않았음을 알 수 있다.In addition, as shown in FIG. 3, whether the data processing task allocated to each parallel control module is completed may be stored in the log storage. Referring to FIG. 3, the parallel control module 1 and the parallel control module n have completed the assigned data processing task, but the parallel control module 2 has not completed the assigned data processing task.

본 발명은 도면에 도시된 일실시 예를 참고로 설명되었으나, 이는 예시적인 것에 불과하며, 본 기술 분야의 통상의 지식을 가진 자라면 이로부터 다양한 변형 및 균등한 타 실시예가 가능하다는 점을 이해할 것이다. While the present invention has been particularly shown and described with reference to exemplary embodiments thereof, it will be understood by those of ordinary skill in the art that various changes in form and details may be made therein without departing from the scope of the present invention .

200: CPU 코어 210: GPU 코어
220: 병렬 제어 모듈 200: CPU core 210: GPU core
220: parallel control module

Claims

In a data processing system consisting of a CPU core and a GPU core comprising a plurality of parallel control modules for requesting data processing from the CPU core, a method for dynamically allocating the number of parallel control modules to perform data processing,
Analyzing the number of threads for which data processing is requested in the CPU core;
The number of parallel control modules to process data is determined using the number of threads analyzed by the CPU core, the minimum number of threads that can be bundled at a time, and the maximum number of threads that can be bundled into one block. Determining; And
And storing information on the data to be processed by the parallel control module and each parallel control module in the CUP core, and then requesting data processing by each parallel control module. Way.

The method of claim 1, wherein the minimum number of threads that can be bundled at one time is 32, and the maximum number of threads that can be bundled with one block is 512.

3. The method of claim 2, wherein the CPU core stores information about data to be processed by the parallel control module and each parallel control module in a log storage.

The method of claim 3, wherein the log storage,
And storing information about the identifier of the parallel control module and information about data processed by each parallel control module.

In a data processing system consisting of a CPU core and a GPU core comprising a plurality of parallel control modules for requesting data processing from the CPU core, a method for dynamically allocating the number of parallel control modules to perform data processing,
Analyzing the number of threads for which data processing is requested in the CPU core;
The number of threads analyzed by the CPU core, the minimum number of threads that can be bundled at a time (Warp size), the maximum number of threads that can be bundled into one block, and the priority related to data processing of each parallel control module are determined. Determining the number of parallel control modules to process data using;
Storing information about data to be processed by a parallel control module and each parallel control module in the CUP core, and then requesting data processing by each parallel control module; And
And collecting the data processed by each parallel control module in the CPU core.

The method of claim 5, wherein the CPU core,
A method of dynamically allocating parallel control modules, characterized in that the priority of data processing of a parallel control module for processing the requested data processing is delayed than a predetermined time is set.

7. The method of claim 6, wherein the minimum number of threads that can be bundled at one time is 32, and the maximum number of threads that can be bundled with one block is 512.