KR101658792B1

KR101658792B1 - Computing system and method

Info

Publication number: KR101658792B1
Application number: KR1020100000812A
Authority: KR
Inventors: 이원종; 정석윤
Original assignee: 삼성전자주식회사
Priority date: 2010-01-06
Filing date: 2010-01-06
Publication date: 2016-09-26
Also published as: KR20110080540A

Abstract

멀티 코어(multi-core) 또는 메니 코어(many-core) 환경의 컴퓨팅 시스템이 제공된다. 컴퓨팅 시스템 내에서 작업들 간의 로드 밸런싱(load balancing)을 위해, 코어 및 H/W(Hardware)의 수, 버스 구조, 메모리 크기 등 다양한 변수를 고려하여 복수 개의 작업이 코어들에 분배된다.A computing system in a multi-core or many-core environment is provided. For load balancing among tasks within a computing system, a plurality of tasks are distributed to cores in consideration of various variables such as the number of cores and hardware (H / W), bus structure, and memory size.

Description

[0001] COMPUTING SYSTEM AND METHOD [0002]

멀티 코어(multi-core) 또는 메니 코어(many-core) 프로세서 환경의 컴퓨팅 시스템에 있어서, 코어의 수에 따라 확장적(scalable)으로 성능을 증가시키기 위해, 유휴(idle)한 코어가 존재하지 않도록 부하를 균등하게 할당하는 부하 균형(load balancing) 방법에 연관된다.In a computing system in a multi-core or many-core processor environment, in order to increase performance scalably according to the number of cores, there is no need for an idle core to exist To a load balancing method that evenly allocates the load.

스트림 프로세서(stream processor)는 오디오, 비디오, 그래픽 등 멀티미디어 데이터 처리를 위해 활용도가 높아지고 있다. 최근에는 저전력과 고성능을 위해, 작업(task)의 병렬 처리가 가능한 멀티 코어(multi-core) 또는 메니 코어(many-core) 환경의 스트림 프로세서에 대한 연구와 활용이 증가하고 있다.Stream processors are being used for processing multimedia data such as audio, video, and graphics. Recently, for low power and high performance, research and use of stream processors in multi-core or many-core environments capable of parallel processing of tasks are increasing.

이러한 멀티 코어 또는 메니 코어 프로세서에 있어서, 코어의 수에 따른 확장적 성능 증가를 위해서는 유휴한 코어가 존재하지 않도록 부하 균형을 이루는 방법이 중요하다. 코어의 수가 증가하더라도 부하 균형이 효율적으로 이루어지지 않으면, 전체 시스템 성능의 개선은 현저하지 않기 때문이다.In order to increase the performance of the multi-core or the manic core processor according to the number of cores, it is important to balance the load so that idle cores do not exist. If the number of cores increases, load balancing can not be done efficiently, and overall system performance improvement is not significant.

특히, 멀티미디어 데이터 처리의 경우, 시간에 따라 생성되어 처리되는 데이터의 종류와 양이 가변적이므로 이러한 경우 적응적으로(adaptively) 대처할 수 있는 시스템이 요구된다.In particular, in the case of multimedia data processing, since the type and amount of data generated and processed according to time are variable, a system that can adaptively cope with such a situation is required.

이러한 부하 균형을 위해, 종래에는, 주기적으로 코어의 유휴(idle) 여부를 모니터링 하여 유휴한 코어에 대해 작업(task)을 할당하는 방법이 사용되었다. 그런데 이러한 방법은 코어의 수가 비교적 작을 때, 이를테면 2개 내지 8개의 코어를 갖는 프로세서에 대해서는 효과적이나, 그 이상 많은 수의 코어에는 효과적이지 않다.For such load balancing, conventionally, a method of allocating tasks to idle cores by monitoring whether the cores are idle periodically has been used. However, this method is effective when the number of cores is relatively small, such as for a processor having two to eight cores, but is not effective for a large number of cores.

예를 들어, 코어의 수가 16개, 32개, 64개, 128개 등의 경우, 각 코어의 상태를 모니터링하고, 작업간에 동기화를 하는 등, 프로세서 부하 균형을 위한 제어가 많은 오버헤드를 발생시키기 때문이다.For example, if the number of cores is 16, 32, 64, 128, etc., control for processor load balancing, such as monitoring the status of each core and synchronizing between tasks, Because.

본 발명의 일측에 따르면, 처리되어야 할 복수 개의 작업(Task) 각각의 래이턴시(latency)를 계산하는 태스크 프로파일러(Task profiler), 상기 복수 개의 작업 간의 종속성 분석을 수행하여 TDG(Task Dependency Graph)를 추출하는 태스크 어널라이저(Task Analyzer), 복수 개의 코어 및 적어도 하나의 하드웨어 정보를 이용하여 상기 복수 개의 작업에 대해 소프트웨어 파이프라이닝을 수행하는 스케줄러(scheduler)를 포함하는, 컴퓨팅 시스템이 제공된다.According to an aspect of the present invention, there is provided a computer program product including a task profiler for calculating a latency of each of a plurality of tasks to be processed, a dependency analyzing unit for analyzing dependencies between the plurality of tasks, A scheduler for performing software pipelining for the plurality of tasks using a plurality of cores and at least one piece of hardware information.

이 경우, 상기 처리되어야 할 복수 개의 작업은 복수 회의 이터래이션(Iteration)을 갖는다.In this case, the plurality of jobs to be processed have a plurality of iterations.

여기서, 상기 스케줄러는, 상기 TDG를 참고하여, 상기 복수 회의 이터래이션들 중 한 이터래이션과 다음 이터래이션 사이의 처리 시작 간격의 최소 사이클인 mII(minimum Initiation Interval)을 계산하고, 상기 mII를 이용하여 각 이터래이션이 중첩될 수 있는 태스크 매핑의 최적 해(optimal solution)을 찾고, 상기 최적 해를 이용하여 상기 복수 개의 작업을 상기 복수 개의 코어 및 적어도 하나의 하드웨어와 매핑시킬 수 있다.Here, the scheduler calculates a minimum Initiation Interval (mII), which is a minimum cycle of a processing start interval between one of the plurality of interactions, with reference to the TDG, and the mII Can be used to find an optimal solution of task mapping where each event can be overlapped and map the plurality of tasks to the plurality of cores and at least one hardware using the optimal solution.

한편, 상기 스케줄러는, 상기 mII를 이용하여 상기 최적 해를 찾지 못하는 경우, 상기 최적 해를 찾을 때까지 상기 mII에 한 사이클씩 더해 나가며, 각 사이클을 이용하여 상기 최적 해를 찾고, 상기 최적 해를 이용하여 상기 복수 개의 작업을 상기 복수 개의 코어 및 적어도 하나의 하드웨어와 매핑시킬 수 있다.If the optimal solution is not found using the mII, the scheduler adds one cycle to the mII until it finds the optimal solution, finds the optimal solution using each cycle, May be used to map the plurality of jobs to the plurality of cores and at least one hardware.

본 발명의 일 실시예에 따르면, 상기 스케줄러는, 상기 복수 개의 작업을 복수 개의 코어 및 적어도 하나의 하드웨어들에 매핑시키며 선험적 탐색 방법(Heuristic Search)으로 상기 최적해를 찾는다.According to an embodiment of the present invention, the scheduler maps the plurality of jobs to a plurality of cores and at least one hardware, and searches for the optimal solution by a heuristic search method.

그리고, 상기 최적 해는, 상기 컴퓨팅 시스템의 처리 IPC(Instruction Per Cycle)을 최대화 하는 태스크의 매핑 조합일 수 있다.The optimal solution may be a mapping combination of tasks maximizing the processing instruction cycle (IPC) of the computing system.

한편, 상기 복수 개의 코어 및 적어도 하나의 하드웨어 정보는, 상기 복수 개의 코어 및 적어도 하나의 하드웨어의 수, 상기 복수 개의 코어 및 적어도 하나의 하드웨어의 내부 구조, 상기 복수 개의 코어 및 적어도 하나의 하드웨어의 토폴로지, 상기 복수 개의 코어 및 적어도 하나의 하드웨어의 내부 버퍼 크기 중 적어도 하나를 포함할 수 있다.Meanwhile, the plurality of cores and the at least one hardware information may include at least one of a number of the plurality of cores and at least one hardware, an inner structure of the plurality of cores and at least one hardware, a topology of the plurality of cores and at least one hardware , An internal buffer size of the plurality of cores and at least one hardware.

본 발명의 다른 일측에 따르면, 처리되어야 할 복수 개의 작업(Task) 각각의 래이턴시(latency)를 계산하는 단계, 상기 복수 개의 작업 간의 종속성 분석을 수행하여 TDG(Task Dependency Graph)를 추출하는 단계, 및 복수 개의 코어 및 적어도 하나의 하드웨어 정보를 이용하여 상기 복수 개의 작업에 대해 소프트웨어 파이프라이닝을 수행하는 단계를 포함하는, 컴퓨팅 방법이 제공된다.According to another aspect of the present invention, there is provided a method for processing a task, the method comprising: calculating latency of each of a plurality of tasks to be processed; extracting a task dependency graph (TDG) by performing dependency analysis among the plurality of tasks; And performing software pipelining for the plurality of jobs using a plurality of cores and at least one piece of hardware information.

메니 코어 프로세서 환경에서 코어들 간의 부하 균형이 최대화 되므로, 유휴 코어의 수가 최소화 되고, 따라서 전체 데이터 처리 속도가 향상된다.In a manic core processor environment, the load balancing between the cores is maximized, so the number of idle cores is minimized and thus the overall data processing speed is improved.

도 1은 본 발명의 일 실시예에 따른 컴퓨팅 시스템을 도시한다.
도 2는 본 발명의 일 실시예에 따른 컴퓨팅 시스템에서, 처리부 내의 멀티 코어 및 하드웨어의 예시적 구조를 도시한다.
도 3은 본 발명의 일 실시예에 따른 컴퓨팅 시스템에 입력되는 어플리케이션 소스의 예시적인 수도 코드(Pseudo code)를 도시한다.
도 4는 본 발명의 일 실시예에 따라 도 3의 어플리케이션 소스를 이용하여 TDG(Task Dependency Graph)를 생성한 결과를 도시한다.
도 5는 본 발명의 일 실시예에 따라 도 4의 TDG를 이용하여 태스크 이터래이션에 대한 소프트웨어 파이프라이닝을 실행한 결과를 도시한다.
도 6은 본 발명의 일 실시예에 따른 컴퓨팅 방법을 도시하는 흐름도이다.
도 7은 본 발명의 일 실시예에 따른 컴퓨팅 방법에서 소프트웨어 파이프라이닝 단계를 상술한 흐름도이다.Figure 1 illustrates a computing system in accordance with an embodiment of the present invention.
2 illustrates an exemplary structure of multi-core and hardware in a processing unit in a computing system according to an embodiment of the present invention.
FIG. 3 illustrates exemplary pseudo code of an application source input to a computing system according to an embodiment of the present invention.
FIG. 4 shows a result of generating a task dependency graph (TDG) using the application source of FIG. 3 according to an embodiment of the present invention.
Figure 5 illustrates the result of performing software pipelining for task activity using the TDG of Figure 4 in accordance with one embodiment of the present invention.
6 is a flow chart illustrating a computing method in accordance with an embodiment of the present invention.
Figure 7 is a flow chart detailing software pipelining steps in a computing method according to an embodiment of the present invention.

이하에서, 본 발명의 일부 실시예를, 첨부된 도면을 참조하여 상세하게 설명한다. 그러나, 본 발명이 실시예들에 의해 제한되거나 한정되는 것은 아니다. 각 도면에 제시된 동일한 참조 부호는 동일한 부재를 나타낸다.Hereinafter, some embodiments of the present invention will be described in detail with reference to the accompanying drawings. However, the present invention is not limited to or limited by the embodiments. Like reference symbols in the drawings denote like elements.

도 1은 본 발명의 일 실시예에 따른 컴퓨팅 시스템(100)을 도시한다.Figure 1 illustrates a computing system 100 in accordance with one embodiment of the present invention.

컴퓨팅 시스템(100)은 멀티 코어(multi-core) 또는 메니 코어(many-core) (이하에서는 통칭하여 멀티 코어라고 함) 환경의 처리부(140)를 포함한다. 처리부(140) 내에는 병렬적으로 작업(task)을 처리할 수 있는 다수의 코어(core)를 포함하며, 또한 적어도 하나 이상의 전용 하드웨어(dedicated Hardware or fixed H/W)를 포함할 수 있다.The computing system 100 includes a processing unit 140 in a multi-core or many-core (collectively, multicore) environment. The processor 140 may include a plurality of cores capable of processing tasks in parallel and may also include at least one dedicated hardware or a fixed hardware.

멀티 코어 환경의 처리부(Processor)에서는, 자원(resource)의 활용도를 최대화 하기 위해서는 작업의 적절한 스케줄링을 통한 병렬 처리의 최대화가 요구된다.In a processor of a multicore environment, in order to maximize utilization of resources, it is required to maximize parallel processing through appropriate scheduling of tasks.

기존의 소프트웨어 파이프라이닝(software pipelining)은, 단일 코어 내에 포함된 다수 개의 연산 유닛(Arithmetic-Logic Unit, ALU), 또는 기타 펑션 유닛(Function Unit, FU)들 간에 병렬 처리를 최대화 하기 위해 명령(Instruction) 이터래이션을 적절히 매핑하여 병렬 처리를 꾀하는 방법이다.Conventional software pipelining can be used to optimize the parallelism between instructions (Arithmetic-Logic Units, ALUs) or other Function Units (FUs) ) This is a method for parallel processing by appropriately mapping emitters.

본 발명의 일 실시예에 따르면, 이러한 소프트웨어 파이프라이닝 기법이 멀티 코어 환경에서 다수 개의 코어들 간에 작업(Task)을 병렬 처리하는 데에 활용된다.According to an embodiment of the present invention, this software pipelining technique is utilized to parallelize tasks among a plurality of cores in a multicore environment.

어플리케이션 소스 코드가 입력되면, 태스크 프로파일러(110)는 소스 코드를 분석하여 처리되어야 할 작업들의 종류에 따른 처리 소요 시간, 즉 레이턴시(latency)를 계산한다.When the application source code is input, the task profiler 110 analyzes the source code and calculates the processing time, i.e., latency, according to the types of tasks to be processed.

본 실시예에서 분석되는 어플리케이션 소스 코드 내에는 작업 이터래이션(Task Iteration)이 포함된다. 즉, 루프(loop) 구문 또는 포(for) 구문을 통해 복수 개의 작업들을 포함하는 단위 코드가 반복 처리되어야 한다.The application source code analyzed in the present embodiment includes a task interruption. That is, a unit code including a plurality of tasks must be repeated through a loop or for statement.

태스크 프로파일러(110)에 의해 작업 레이턴시가 분석되면, 태스크 어널라이저(120)는 이터래이션을 갖는 단위 코드 내의 작업들 사이의 종속성(dependency)을 고려하여 작업 종속 그래프(Task Dependency Graph, TDG)를 작성한다. 이러한 TDG에 의하면, 어떤 작업이 처리된 후에 어떤 작업이 처리될 수 있는 지의 종속성이 파악될 수 있다. 이러한 과정을 위해서 소스 코드가 어셈블리어로 변환되어 분석될 수도 있다.When the task latency is analyzed by the task profiler 110, the task analyzer 120 calculates a Task Dependency Graph (TDG), taking into account the dependency between tasks in the unit code having the mutation, . According to this TDG, a dependency can be grasped on which task can be processed after a task is processed. For this process, the source code can be converted into assembly language and analyzed.

그러면, 스케줄러(130)는 태스크 이터래이션을 고려하여, 한 이터래이션과 다음 이터래이션이 중첩적으로 수행될 수 있기 위한 최소 시작 간격(minimum Initiation Interval, mII)을 계산하고, mII에서의 태스크 매핑의 최적 해(optimal solution of Task Mapping)을 찾는다.Then, the scheduler 130 calculates a minimum initialization interval (mII) for allowing a one-time transition and a next one to be performed in an overlapping manner, taking into account task variation, Find the optimal solution of Task Mapping.

이러한 태스크 매핑의 최적 해는, 한 사이클 당 처리되는 태스크인 TPC(Task per Cycle)를 최대화 할 수 있는 매핑 조합이다. 그리고, 이러한 최적 해의 탐색은 선험적인 탐색(Heuristic Search)에 의해 수행될 수 있다.The optimal solution to this task mapping is a mapping combination that can maximize the task per cycle (TPC), which is a task to be processed per cycle. And, the search for such an optimal solution can be performed by a heuristic search.

mII에서 최적 해가 찾아지지 않는 경우, 스케줄러(130)는 시작 간격(Initiation Interval, II)를 한 사이클 늘린 다음 다시 최적 해를 찾는다. 이러한 방법으로 최적 해를 찾을 때까지 II를 한 사이클씩 늘려가며 최적 해를 찾는다.If no optimal solution is found in mII, the scheduler 130 increases the initiation interval (II) by one cycle and then finds the optimal solution again. In this way, II is incremented by one cycle until the optimal solution is found, and the optimal solution is found.

그렇게 해서 최적 해를 갖는 II가 찾아지면, 스케줄러(130)는 최적 해를 이용하여 작업들을 멀티 코어 및 전용 하드웨어들에 매핑 시키고, 작업 스케줄링을 완성한다.When an II with the optimal solution is found, the scheduler 130 maps the tasks to the multicore and dedicated hardware using the optimal solution, and completes the task scheduling.

그러면 처리부(140)는 스케줄러(130)가 스케줄링 한 바에 따라 태스크 이터래이션을 처리함으로써 어플리케이션을 처리한다.Then, the processing unit 140 processes the task by processing the task according to the scheduler 130 scheduling.

태스크 프로파일러(110), 태스크 어널라이저(120) 및 스케줄러(130)의 동작이 도 2 이하를 참조하여 보다 상세히 후술된다.The operation of the task profiler 110, the task analyzer 120 and the scheduler 130 will be described in more detail below with reference to FIG.

도 2는 본 발명의 일 실시예에 따른 컴퓨팅 시스템에서, 처리부(140) 내의 멀티 코어 및 하드웨어의 예시적 구조를 도시한다.2 illustrates an exemplary structure of multicore and hardware in processing unit 140 in a computing system according to an embodiment of the present invention.

본 실시예에서 처리부(140)는 네 개의 코어(210, 220, 230 및 240), ROP(Raster Operation)을 위한 네 개의 전용 하드웨어(251, 252, 253 및 254), 배치 매니지먼트를 위한 전용 하드웨어 BMU(Batch Management Unit)(260) 및 프래그먼트 재너레이션 하드웨어 FG(Fragment Generation)(270)을 포함한다.In this embodiment, the processing unit 140 includes four cores 210, 220, 230 and 240, four dedicated hardware 251, 252, 253 and 254 for ROP (ROP), a dedicated hardware BMU (Batch Management Unit) 260 and a fragment generation hardware FG (Fragment Generation)

코어들(210, 220, 230 및 240)과 일부 하드웨어들(253, 254, 260 및 270)은 내부에 버퍼(buffer)를 갖고, 다른 일부 하드웨어들(251, 252)는 버퍼를 갖지 않을 수 있다.The cores 210, 220, 230 and 240 and some of the hardware 253, 254, 260 and 270 may have a buffer therein and some of the hardware 251 and 252 may not have a buffer .

이러한 하드웨어 구조의 디스크립터(descriptor)는, 도 1의 스케줄러(130)가 태스크 매핑을 수행할 때 이용된다.The descriptor of this hardware structure is used when the scheduler 130 of FIG. 1 performs task mapping.

도 3은 본 발명의 일 실시예에 따른 컴퓨팅 시스템에 입력되는 어플리케이션 소스 내에서, 복수 회의 이터래이션 동안 처리되어야 할 단위 코드의 예시적인 수도 코드(Pseudo code)(300)를 도시한다.FIG. 3 illustrates an exemplary pseudo code 300 of unit codes to be processed during a plurality of occurrences within an application source input to a computing system according to an embodiment of the present invention.

본 실시예에서, 수도 코드(300)는 100 회 이터래이션으로 처리되기 위한 것이다. 수도 코드(300)는 BMU, VS(Vertex Shader), FG, PS(Pixel Shader), ROP 등의 작업들을 포함한다.In this embodiment, the numerical code 300 is intended to be treated as a 100-time variation. The capital code 300 includes operations such as BMU, VS (Vertex Shader), FG, PS (Pixel Shader), and ROP.

수도 코드(300)는 태스크 프로파일러(110)에 의해 분석되는데, 본 실시예에서, BMU의 레이턴시는 1, VS의 레이턴시는 2, PS의 레이턴시는 3, FG의 레이턴시는 1, 그리고 ROP의 레이턴시는 1으로 가정한다.The numeric code 300 is analyzed by the task profiler 110. In this embodiment, the latency of the BMU is 1, the latency of VS is 2, the latency of PS is 3, the latency of FG is 1, Is assumed to be 1.

이러한 레이턴시들이 태스크 프로파일러(110)에 의해 분석된다.These latencies are analyzed by the task profiler 110.

그리고, 도 2에서 도시된 바와 같이, BMU, FG 및 ROP는 전용 하드웨어에 의해 처리되며, 나머지 VS, PS는 코어(210, 220, 230 또는 240)에 의해 처리될 수 있다.2, the BMU, FG and ROP are processed by dedicated hardware, and the remaining VS, PS can be processed by the core 210, 220, 230 or 240.

한편, 작업 종속성에 있어서, 수도 코드(300) 내의 BMU가 두 번 수행되어야 VS가 처리될 수 있다. 그리고, FG는 두 번의 VS가 처리되어야 처리될 수 있다.On the other hand, in the job dependency, the VS can be processed only if the BMU in the capitalization code 300 is executed twice. And FG can only be processed if two VSs are processed.

또한 PS는 두 번 처리되는데, 각각 FG 한 번 처리 후에 PS 한 번이 처리될 수 있는 관계에 있다. ROP 또한 두 번 처리되는데, PS 한 번 처리 후에 ROP 한 번 처리가 가능하다.Also, the PS is processed twice, one after the FG process, and one time after the PS process. The ROP is also processed twice, once the PS is processed, and once for the ROP.

도 4는 본 발명의 일 실시예에 따라 도 3의 어플리케이션 소스를 이용하여 TDG(Task Dependency Graph)(400)를 생성한 결과를 도시한다.FIG. 4 illustrates a result of generating a Task Dependency Graph (TDG) 400 using the application source of FIG. 3 according to an embodiment of the present invention.

상술한 바와 같이 작업들은 종속성을 가지며, 이러한 종속성은 그래프(Graph) 구조에 의해 표현될 수 있다.As described above, tasks have dependencies, and these dependencies can be represented by a graph structure.

태스크 어널라이저(120)는 상술한 바와 같이, BMU 처리 이후에 VS 두 개가 병렬 처리될 수 있고, 두 VS가 모두 처리되어야 FG가 처리될 수 있으며, FG 처리 후에 PS 두 개가 병렬 처리될 수 있고, 각 PS 처리 이후에는 ROP 두 개가 병렬로 처리될 수 있음을 TDG(400)으로 표현하여 스케줄러(130)에 제공한다.As described above, the task analyzer 120 can process the two VSs in parallel after the BMU process, the FG can be processed only when both of the VSs are processed, the two PSs can be processed in parallel after the FG process, After each PS process, the TDG 400 expresses that two ROPs can be processed in parallel and provides them to the scheduler 130.

도 5는 본 발명의 일 실시예에 따라 도 4의 TDG를 이용하여 태스크 이터래이션에 대한 소프트웨어 파이프라이닝을 실행한 결과를 도시한다.Figure 5 illustrates the result of performing software pipelining for task activity using the TDG of Figure 4 in accordance with one embodiment of the present invention.

스케줄러(130)는 태스크 프로파일러(110)가 제공한 각 작업들의 레이턴시, 태스크 어널라이저(120)가 제공한 TDG 및 도 2에서 상술한 코어들과 하드웨어들의 구성 정보 Descriptor를 이용하여, 작업을 각 코어들과 하드웨어들에 매핑 시킨다.The scheduler 130 uses the latency of each task provided by the task profiler 110, the TDG provided by the task analyzer 120, and the configuration information Descriptor of the cores and hardware described in FIG. 2, Maps to cores and hardware.

이 과정에서 스케줄러는, 상기한 소프트웨어 파이프라인 기법을 이용한다.In this process, the scheduler uses the software pipeline technique described above.

먼저, 스케줄러(130)는 한 이터래이션과 다음 이터래이션이 중첩적으로 처리될 수 있기 위한 최소 시작 간격 mII(minimum Initiation Interval)를 찾는다.First, the scheduler 130 finds a minimum start interval mll (minimum Initiation Interval) for one event and the next event to be processed in an overlapping manner.

이러한 mII는 하드웨어의 구조, 작업의 종류 등을 참조하여 결정될 수 있으며, 소프트웨어 파이프라인 기법에 대한 통상의 지식 수준에서 이해될 수 있다.Such mII can be determined by referring to the structure of the hardware, the kind of work, and the like, and can be understood at a normal knowledge level of the software pipeline technique.

이렇게 mII가 찾아지면, 스케줄러(130)는 mII에서 작업들을 각 코어들 또는 전용 하드웨어들에 매핑하는 조합들 중, 최적 해를 찾을 수 있는지 판단한다.When mII is found, the scheduler 130 determines whether the mII can find an optimal solution among the combinations that map the tasks to each of the cores or dedicated hardware.

여기서, 전용 하드웨어가 있는 BMU, FG, ROP는 각각 해당 전용 하드웨어에 매핑되며, 나머지 작업들인 VS와 PS는 네 개의 코어들에 적절히 매핑된다.Here, BMU, FG, and ROP with dedicated hardware are mapped to corresponding dedicated hardware, respectively, and the remaining operations, VS and PS, are appropriately mapped to the four cores.

이 mII를 이용하여 최적 해를 찾을 수 없다면, II(Initiation Interval)를 1 증가시키고 다시 찾게 되며, 이러한 과정을 통해 최적 해가 찾아지면, 그 때의 II를 최종 II 로 결정한다.If the optimal solution can not be found using this mII, then the II (Initiation Interval) is incremented by 1, and it is searched again. If the optimal solution is found through this process, II is determined as the final II.

본 실시예의 태스크 매핑 최적 해에서는 최종 II(510)가 3이다.In the task mapping optimal solution of this embodiment, the final II 510 is 3.

그리고, 첫 번째 이터래이션에서의 BMU1 두 개가 전용 하드웨어에 배치되고, VS1 두 개는 각각 코어 3과 코어 4에 매핑되었다. 그러면, VS 처리 후에 FG1이 전용 하드웨어에서 처리되고, PS1 두 개는 코어 1과 코어 2에서 병렬 처리되도록 매핑되었다.Then, two BMU1s in the first transition are placed on dedicated hardware, and two VS1s are mapped to core 3 and core 4, respectively. Then, after VS processing, FG1 is processed in dedicated hardware, and two PS1s are mapped to be processed in cores 1 and 2 in parallel.

그리고, 그 결과를 이용하여, ROP1 두 개가 병렬 처리를 위해 ROP1과 ROP2에 나누어 매핑되었다.Then, using the result, two ROP1s are mapped to ROP1 and ROP2 for parallel processing.

한편, 도시된 바와 같이, VS1이 코어 4에서 처리되는 동안, 두 번째 이터래이션의 BMU2가 전용 하드웨어에서 처리되도록 매핑되었다.On the other hand, as shown, while VS1 is being processed in core 4, BMU2 of the second transition is mapped to be processed in dedicated hardware.

그리고, FG1이 전용 하드웨어에서 처리되는 동안, 두 번째 이터래이션의 VS2가 코어 4에서 처리되도록 매핑되었다.And while FG1 is being processed on dedicated hardware, VS2 of the second transition is mapped to be processed on Core 4.

이러한 식으로, 유휴(Idle) 코어 또는 유휴 하드웨어를 최소화시키고 이터래이션을 중첩시키는 소프트웨어 파이프라이닝이 수행된다.In this way, software pipelining is performed that minimizes idle cores or idle hardware and overlaps emulation.

스테디 스테이트(Steady state) 구간(520)을 관찰하면, 서로 다른 이터래이션에 속하는 작업들이 포함되었지만, 구간(520) 내에 BMU 두 개, VS 두 개, FG 두 개, PS 두 개, ROP 두 개가 포함되었으므로, 하나의 이터래이션을 구성하는 작업 10 개가 모두 포함된다.Observing the steady state section 520 includes tasks belonging to different mutations, but in section 520 there are two BMUs, two VSs, two FGs, two PSs, two ROPs It includes all 10 tasks that make up a transaction.

따라서, 원래 10 사이클이 소요되던 하나의 이터래이션이, 스테디 스테이트 구간들의 각각에서는, 3 사이클밖에 소요되지 않는다.Thus, one transaction that originally took 10 cycles would take only 3 cycles in each of the steady state intervals.

따라서, 사이클 당 작업 처리율인, TPC(Task Per Cycle)이, 이러한 소프트웨어 파이프라이닝이 없는 경우(without software pipelining)의 1(= 10Tasks/10Cycles)에서 3.3(=10Tasks/3Cycles)로 개선되었다.Thus, Task Per Cycle (TPC), which is the throughput per cycle, has been improved from 1 (= 10 Tasks / 10 Cycles) to 3.3 (= 10 Tasks / 3 Cycles) without such software pipelining.

따라서, 병렬 처리가 3.3 배 개선되었다고 볼 수 있다. 물론, 이러한 스테디 스테이트에 들어가기 전의 도입부와, 이터래이션이 종료되는 마지막 부분에서는 완전한 중첩이 되지 않지만, 이는 이터래이션 수가 커짐에 따라 무시할 수 있는 수준에 불과하다.Therefore, parallel processing is improved 3.3 times. Of course, there is no complete overlap at the beginning before entering the steady state and at the end after the transition, but this is only negligible as the number of emissions increases.

도 6은 본 발명의 일 실시예에 따른 컴퓨팅 방법을 도시하는 흐름도이다.6 is a flow chart illustrating a computing method in accordance with an embodiment of the present invention.

단계(S610)에서 어플리케이션의 소스 코드가 입력된다. 상기한 바와 같이, 어플리케이션 소스 코드 내에는 복수 회의 이터래이션을 갖는 작업들의 단위가 포함된다.In step S610, the source code of the application is input. As described above, the application source code includes a unit of work having a plurality of mutations.

본 발명의 일 실시예에 따름 컴퓨팅 방법에 의하면, 이러한 이터래이션 처리에서, 주어진 멀티 코어 및 다수의 하드웨어들의 병렬 처리 효율이 최대화 된다.According to the computing method according to an embodiment of the present invention, in such a transaction processing, the parallel processing efficiency of a given multicore and a plurality of hardware is maximized.

그러면, 단계(S620)에서, 태스크 프로파일러(110)가 한 이터래이션에 포함되는 복수 개의 작업들 각각의 처리 레이턴시를 구한다. 이러한 과정은 도 1 및 도 3을 참조하여 상술한 바와 같다.Then, in step S620, the task profiler 110 obtains the processing latency of each of a plurality of jobs included in one transaction. This process is as described above with reference to Figs.

그리고, 단계(S630)에서 태스크 어널라이저(120)가 한 이터래이션에 포함되는 복수 개의 작업들 각각의 종속성을 분석하여 TDG를 생성한다. TDG의 생성 과정 및 그 예시적 결과는 도 4를 참조하여 상술한 바와 같다.In step S630, the task analyzer 120 analyzes the dependency of each of a plurality of tasks included in one of the transactions to generate a TDG. The generation process of TDG and its exemplary results are as described above with reference to FIG.

그러면, 단계(S640)에서, 스케줄러(130)는 이러한 작업 레이턴시, TDG, 및 멀티 코어들과 하드웨어들의 Descriptor를 참조하여, 복수 회의 이터래이션에 대해 소프트웨어 파이프라이닝을 수행한다.Then, in step S640, the scheduler 130 performs software pipelining for a plurality of occurrences, referring to the operation latency, the TDG, and the descriptors of the multicores and hardware.

소프트웨어 파이프라이닝 과정은 도 7을 참조하여 보다 상세히 후술한다.The software pipelining process will be described in more detail below with reference to FIG.

이렇게, 소프트웨어 파이프라이닝이 수행되면, 전체 작업 이터래이션이 스케줄링되고, 처리부(140)는 스케줄에 따라 작업을 처리한다.Thus, when software pipelining is performed, the entire worker movement is scheduled, and the processing unit 140 processes the work according to the schedule.

도 7은 본 발명의 일 실시예에 따른 컴퓨팅 방법에서 소프트웨어 파이프라이닝 단계를 상술한 흐름도이다.Figure 7 is a flow chart detailing software pipelining steps in a computing method according to an embodiment of the present invention.

단계(S710)에서는 태스크 이터래이션의 mII가 계산된다. 이러한 mII는 하드웨어 구조, 작업들의 레이턴시, 데이터 로딩/스토어 사이클 등을 고려하여, 스케줄러가 결정한다. mII 결정은 도 5를 참조하여 상술한 바와 같다.In step S710, the mII of the task variation is calculated. The mII is determined by the scheduler in consideration of hardware structure, latency of operations, data loading / store cycle, and the like. The mII determination is as described above with reference to Fig.

그리고, 스케줄러는 단계(S720)에서 mII를 이용하여 태스크 매핑의 최적 해를 탐색한다. 이러한 탐색 과정은 상술한 바와 같이 선험적 탐색(Heuristic Search)일 수 있다.Then, in step S720, the scheduler searches for an optimal solution of the task mapping using mII. This search process may be a heuristic search as described above.

한편, 본 발명의 일 실시예에 따르면, 상기 최적 해는 태스크 매핑에 따라 사이클 당 작업 처리율인 TPC를 최대화시킬 수 있는 태스크 매핑 조합으로 정의될 수 있다.Meanwhile, according to an embodiment of the present invention, the optimal solution may be defined as a task mapping combination capable of maximizing the TPC, which is the task throughput per cycle, according to the task mapping.

그러면 단계(S730)에서 최적 해가 도출되었는지 판단되며, 최적해가 도출되었다면, 단계(S750)에서 스케줄러는 최적 해를 이용하여 태스크 매핑을 확정하고, 전체 작업 이터래이션을 스케줄링 한다.In step S730, it is determined whether the optimal solution is derived. If the optimal solution is derived, the scheduler determines the task mapping using the optimal solution in step S750 and schedules the entire work shift.

그러나, 단계(S730)의 판단 결과, 최적해가 도출되지 않았다고 판단되면, 단계(S740)에서 II를 1 증가시켜서 다시 최적 해를 탐색하는 과정이 반복된다.However, if it is determined in step S730 that the optimal solution has not been derived, the process of searching for the optimal solution is repeated by increasing II by 1 in step S740.

이러한 과정을 통해 도 5의 실시예와 같은 소프트웨어 파이프라이닝 과정이 수행되며, 멀티 코어 환경에서, 각 코어들 및/또는 전용 하드웨어들의 유휴 시간(Idle time)을 최소화하고, 작업 처리의 병렬성을 높일 수 있다.Through this process, a software pipelining process similar to the embodiment of FIG. 5 is performed. In the multicore environment, the idle time of each of the cores and / or the dedicated hardware can be minimized, have.

이상에서 설명한 본 발명의 실시예들은, 종래의 방법에 따라 유휴 코어를 모니터링하다가 유휴 코어가 발견되면 작업을 할당하는 방법에 비해, 체계적이고 효율적이다. 또한, 유휴 코어의 모니터링이나 작업 할당을 위한 제어가 복잡하지 않으므로, 코어의 수가 많을수록 그 활용도가 크다.The embodiments of the present invention described above are systematic and efficient as compared with the method of allocating jobs when an idle core is detected while monitoring idle cores according to a conventional method. Also, the control for idle core monitoring and task allocation is not complicated.

본 발명의 일 실시예에 따른 방법은 다양한 컴퓨터 수단을 통하여 수행될 수 있는 프로그램 명령 형태로 구현되어 컴퓨터 판독 가능 매체에 기록될 수 있다. 상기 컴퓨터 판독 가능 매체는 프로그램 명령, 데이터 파일, 데이터 구조 등을 단독으로 또는 조합하여 포함할 수 있다. 상기 매체에 기록되는 프로그램 명령은 본 발명을 위하여 특별히 설계되고 구성된 것들이거나 컴퓨터 소프트웨어 당업자에게 공지되어 사용 가능한 것일 수도 있다. 컴퓨터 판독 가능 기록 매체의 예에는 하드 디스크, 플로피 디스크 및 자기 테이프와 같은 자기 매체(magnetic media), CD-ROM, DVD와 같은 광기록 매체(optical media), 플롭티컬 디스크(floptical disk)와 같은 자기-광 매체(magneto-optical media), 및 롬(ROM), 램(RAM), 플래시 메모리 등과 같은 프로그램 명령을 저장하고 수행하도록 특별히 구성된 하드웨어 장치가 포함된다. 프로그램 명령의 예에는 컴파일러에 의해 만들어지는 것과 같은 기계어 코드뿐만 아니라 인터프리터 등을 사용해서 컴퓨터에 의해서 실행될 수 있는 고급 언어 코드를 포함한다. 상기된 하드웨어 장치는 본 발명의 동작을 수행하기 위해 하나 이상의 소프트웨어 모듈로서 작동하도록 구성될 수 있으며, 그 역도 마찬가지이다.The method according to an embodiment of the present invention can be implemented in the form of a program command which can be executed through various computer means and recorded in a computer-readable medium. The computer-readable medium may include program instructions, data files, data structures, and the like, alone or in combination. The program instructions recorded on the medium may be those specially designed and constructed for the present invention or may be available to those skilled in the art of computer software. Examples of computer-readable media include magnetic media such as hard disks, floppy disks and magnetic tape; optical media such as CD-ROMs and DVDs; magnetic media such as floppy disks; Magneto-optical media, and hardware devices specifically configured to store and execute program instructions such as ROM, RAM, flash memory, and the like. Examples of program instructions include machine language code such as those produced by a compiler, as well as high-level language code that can be executed by a computer using an interpreter or the like. The hardware devices described above may be configured to operate as one or more software modules to perform the operations of the present invention, and vice versa.

이상과 같이 본 발명은 비록 한정된 실시예와 도면에 의해 설명되었으나, 본 발명은 상기의 실시예에 한정되는 것은 아니며, 본 발명이 속하는 분야에서 통상의 지식을 가진 자라면 이러한 기재로부터 다양한 수정 및 변형이 가능하다.While the invention has been shown and described with reference to certain preferred embodiments thereof, it will be understood by those of ordinary skill in the art that various changes in form and details may be made therein without departing from the spirit and scope of the invention as defined by the appended claims. This is possible.

그러므로, 본 발명의 범위는 설명된 실시예에 국한되어 정해져서는 아니 되며, 후술하는 특허청구범위뿐 아니라 이 특허청구범위와 균등한 것들에 의해 정해져야 한다.Therefore, the scope of the present invention should not be limited to the described embodiments, but should be determined by the equivalents of the claims, as well as the claims.

100: 컴퓨팅 시스템
110: 태스크 프로파일러
120: 태스크 어널라이저
130: 태스크 스케줄러
140: 처리부100: Computing System
110: Task Profiler
120: Task Analyzer
130: Task Scheduler
140:

Claims

A task profiler for calculating a latency of each of a plurality of tasks to be processed;
A task analyzer for analyzing dependency between the plurality of tasks and extracting a task dependency graph (TDG); And
A scheduler that performs software pipelining for the plurality of jobs using a plurality of cores and at least one piece of hardware information,
Lt; / RTI >
Wherein the plurality of jobs to be processed have a plurality of iterations,
The scheduler comprising:
Calculating a minimum initiation interval (mII), which is a minimum cycle of a processing start interval between one of the plurality of interactions, with reference to the TDG,
And finds an optimal solution of the task mapping in which each of the mutations can be overlapped using the mII.

delete

The method according to claim 1,
The scheduler comprising:
And using the optimization solution to map the plurality of tasks to the plurality of cores and at least one hardware.

The method of claim 3,
The scheduler comprising:
If the optimal solution is not found using the mII, one cycle is added to the mII until the optimal solution is found, and the optimum solution is found using each cycle,
And using the optimization solution to map the plurality of tasks to the plurality of cores and at least one hardware.

The method of claim 3,
The scheduler comprising:
And maps the plurality of jobs to a plurality of cores and at least one hardware and finds the optimal solution by a heuristic search.

The method of claim 3,
Wherein the optimization solution is a mapping combination of tasks that maximizes a processing task cycle (TPC) of the computing system.

The method according to claim 1,
The plurality of cores and at least one hardware information,
A plurality of cores and at least one hardware, an inner structure of the plurality of cores and at least one hardware, a topology of the plurality of cores and at least one hardware, an inner buffer size of the plurality of cores and at least one hardware , &Lt; / RTI >

Calculating a latency of each of a plurality of Tasks to be processed;
Extracting a task dependency graph (TDG) by performing dependency analysis between the plurality of tasks; And
Performing software pipelining for the plurality of jobs using a plurality of cores and at least one piece of hardware information
Lt; / RTI >
Wherein the plurality of jobs to be processed have a plurality of iterations,
Calculating a minimum Initiation Interval (mII), which is a minimum cycle of a processing start interval between one of the plurality of interactions, with reference to the TDG; And
A step of finding an optimal solution of the task mapping in which each of the mutations can be overlapped using the mII
Lt; / RTI >

delete

9. The method of claim 8,
Wherein performing the software pipelining comprises:
Mapping the plurality of jobs to the plurality of cores and at least one piece of hardware using the optimal solution
Lt; / RTI >

11. The method of claim 10,
Wherein performing the software pipelining comprises:
If the optimal solution is not found using the mII, one cycle is added to the mII until the optimum solution is found, and the optimal solution is found using each cycle
Further comprising the steps of:

11. The method of claim 10,
Wherein performing the software pipelining comprises:
And mapping the plurality of jobs to a plurality of cores and at least one hardware and finding the optimal solution by a heuristic search.

The method of claim 10, wherein
Wherein the optimization solution is a mapping combination of tasks maximizing a processing TPC (task per cycle) of a computing system on which the computing method is executed.

9. The method of claim 8,
The plurality of cores and at least one hardware information,
A plurality of cores and at least one hardware, an inner structure of the plurality of cores and at least one hardware, a topology of the plurality of cores and at least one hardware, an inner buffer size of the plurality of cores and at least one hardware Gt; a < / RTI >

A computer-readable recording medium storing a program for carrying out the method according to any one of claims 8 to 14.