KR101109009B1

KR101109009B1 - A method for parallelizing irregular reduction on explicitly managed memory hierarchy

Info

Publication number: KR101109009B1
Application number: KR1020090110616A
Authority: KR
Inventors: 한환수; 김성건; 최광무
Original assignee: 선문대학교 산학협력단; 한국과학기술원
Priority date: 2009-11-17
Filing date: 2009-11-17
Publication date: 2012-01-31
Also published as: KR20110054102A

Abstract

본 발명은 비정규 리덕션의 병렬화 방법에 관한 것으로서, 명시적 관리 메모리 구조의 온 칩 메모리 크기에 맞추어 데이터를 일정한 크기를 갖는 블록으로 분할하는 (a) 단계; 반복문의 반복주기들을 각각에서 접근하는 데이터 블록이 같은 것 끼리 모아 분산 실행의 단위 작업들로 생성하는 (b) 단계; 및 생성된 단위 작업들을 이벤트 기반으로 복수개의 장치들에 할당하는 (c) 단계; 를 포함한다.The present invention relates to a method for parallelizing non-normal reduction, comprising the steps of: (a) dividing data into blocks having a predetermined size in accordance with an on-chip memory size of an explicit management memory structure; (B) generating the repetition cycles of the iterations from each of the same data blocks approaching each other as unit operations of distributed execution; (C) allocating the generated unit tasks to the plurality of devices on an event basis; It includes.

상기와 같은 본 발명에 따르면, 명시적 관계 메모리 구조를 가진 병렬 처리 에 의해 데이터 블록이 동일한 것들을 선별하여 분산 실행의 단위 작업들을 복수개의 장치에 할당함으로써, 비정규 리덕션의 계산에 소요되는 시간을 최소화함과 아울러 빠른 연산결과를 도출하는 효과가 있다.According to the present invention as described above, by the parallel processing having an explicit relational memory structure, the same data block is selected and the unit operations of distributed execution are allocated to a plurality of devices, thereby minimizing the time required for calculating the irregular reduction. In addition, it has the effect of deriving fast computational results.

비정규, 리덕션, 병렬처리, IRREGULAR, REDUCTION Irregular, reduction, parallelism, IRREGULAR, REDUCTION

Description

A METHOD FOR PARALLELIZING IRREGULAR REDUCTION ON EXPLICITLY MANAGED MEMORY HIERARCHY}

본 발명은 비정규 리덕션의 병렬화 방법에 관한 것으로서, 더욱 상세하게는 명시적 관리 메모리 구조를 가진 병렬 처리 장치에서 비정규 리덕션을 분산 실행하기 위하여, 계산 작업을 분할하여 각 처리 장치에 할당하는 기술에 관한 것이다.The present invention relates to a method for parallelizing non-regular reduction, and more particularly, to a technique of dividing a calculation task and allocating to each processing unit in order to distribute and execute non-normal reduction in a parallel processing unit having an explicit management memory structure. .

최근 프로세서와 메인 메모리의 동작 속도 격차로 인해 데이터를 읽고 쓸 때의 성능 저하 문제가 대두됨에 따라, 기존의 캐시 구조 대신 명시적 관리 메모리 구조가 제안되었다. 명시적 관리 메모리 구조의 목적은 메모리 계층 구조를 프로그래머에게 노출시킴으로써, 프로그래머가 각 응용 프로그램의 특성에 맞는 데이터 접근 방법을 사용하여 코드를 작성함으로써 프로그램의 실행 성능을 높일 수 있게 하려는 것이다. 기존의 캐시 기반 환경에서는 데이터 접근 과정의 대부분을 프로세서가 자동으로 처리했기 때문에 프로그래머가 응용 프로그램의 특성에 맞는 최적화를 수행하는 데 한계가 있었으나, 명시적 관리 메모리 구조를 채용한 환경에서는 메모리 계층 간 데이터 이동을 프로그래머가 직접 관리하기 때문에, 자주 사용되는 데이터를 온 칩 메모리에 고정시켜 놓거나 앞으로 사용할 데이터를 미리 온 칩 메 모리에 가져오는 등의 다양한 최적화를 수행하기가 용이하다.Recently, due to the performance speed difference between the processor and the main memory, a problem of performance degradation in reading and writing data has been introduced. Therefore, an explicit managed memory structure has been proposed instead of the existing cache structure. The purpose of explicit managed memory structures is to expose memory hierarchies to programmers so that programmers can improve the execution performance of their programs by writing code using data access methods tailored to the characteristics of each application. In the existing cache-based environment, most of the data access process was automatically handled by the processor, which limited the programmer's ability to optimize for the application's characteristics. Because the programmer manages the movement directly, it is easy to perform various optimizations, such as keeping frequently used data in on-chip memory or importing data for future use in on-chip memory.

비정규 리덕션은 분자 동역학, 유체 역학 등, 제반 과학 분야의 문제를 컴퓨터를 이용해 풀이할 때 자주 사용되는 것으로, 일반적으로 도 1과 같은 형태로 표현된다. 도 1의 반복문은 매 반복주기(iteration)마다 계산 목적에 따른 특정 값

를 계산하고 그 값을 교환 법칙이 성립하는 연산자인 리덕션 연산자 '

'를 이용하여 배열

의 원소들에 누적하는 것을 특징으로 한다. 이 때 배열

를 접근하는 인덱스

,

등은 정규화된 수식으로 나타나지 않는 임의의 함수로서, 다른 배열에 의해 간접지정(indirection)되는 형태일 수 있다.Non-normal reduction is often used to solve problems in various scientific fields, such as molecular dynamics and fluid dynamics, using a computer, and is generally expressed in the form of FIG. 1. The loop of FIG. 1 has a specific value according to the purpose of calculation at every iteration.

And the value of the reduction operator '

Array using '

It accumulates in the elements of. Then array

Index to access

,

And the like are arbitrary functions that do not appear as normalized expressions, and may be indirectly designated by other arrays.

일반적으로 비정규 리덕션을 통해 수행하는 계산은 많은 시간을 소요하기 때문에, 고성능의 병렬 처리 장치를 이용하여 비정규 리덕션을 분산 실행함으로써 수행 시간을 단축하는 기술이 개발되어 왔다. 예를 들면, 임계 영역(critical section)을 이용해 각 반복주기를 병렬 수행하는 방법, 혹은 각 처리 장치 별로 부분 결과를 독립적으로 동시에 계산한 후 그 부분 결과를 최종 결과로 합치는 방법 등이 제안되었다.In general, calculations performed through non-normal reduction take a lot of time. Therefore, a technique for reducing execution time by distributing non-normal reduction using a high performance parallel processing unit has been developed. For example, a method of performing each repetition period in parallel using a critical section, or a method of simultaneously calculating partial results for each processing device and then combining the partial results into a final result has been proposed.

그러나 비정규 리덕션의 분산 실행과 관련한 기존의 방법들은 데이터를 읽고 쓰는 과정에서 발생하는 메모리 계층 구조 간의 데이터 이동을 소프트웨어적으로 지원하지 않고 있으므로 명시적 관리 메모리 구조를 가진 병렬 처리 장치에서는 사용할 수 없다. 최근 출시되고 있는 고성능 병렬 처리 장치 중에는 명시적 관리 메모리 구조를 채용한 것들이 많이 있는데, 이러한 장치를 이용하여 비정규 리덕션을 포함하는 과학 계산 프로그램을 효과적으로 수행하기 위해서는 메모리 구조의 특성을 고려한 새로운 분산 실행 방법이 필요하다.However, the existing methods related to distributed execution of non-regular reduction do not support data movement between memory hierarchies that occur during reading and writing of data, so they cannot be used in parallel processing units with explicit managed memory structures. There are many high performance parallel processing units on the market that employ explicit managed memory structures. In order to effectively execute scientific computation programs including irregular reduction using these devices, a new distributed execution method that takes into account the characteristics of the memory structure is introduced. need.

본 발명의 목적은, 기존 기술들에서 고려하고 있지 않은 다음의 기술들, 즉 명시적 관리 메모리 구조의 특성을 고려하여 계산을 분할하는 방법, 분할된 각 계산 작업을 각 처리 장치에 할당하는 방법, 및 각 계산 작업에 필요한 데이터를 메인 메모리에서 읽고 쓰는 비용을 최소화하는 방법을 제시함으로써, 명시적 관리 메모리 구조를 가진 병렬 처리 장치에서 비정규 리덕션을 분산 실행할 수 있도록 함에 그 목적이 있다.An object of the present invention is to provide a method for partitioning a calculation in consideration of characteristics of an explicit managed memory structure, a method for allocating each partitioned calculation task to each processing device, which is not considered in existing technologies, And by providing a method for minimizing the cost of reading and writing data in the main memory for each computational task, the purpose is to enable distributed execution of denormalization in a parallel processing unit having an explicit managed memory structure.

이러한 기술적 과제를 달성하기 위한 본 발명의 비정규 리덕션의 병렬화 방법은, 명시적 관리 메모리 구조의 온 칩 메모리 크기에 맞추어 데이터를 일정한 크기를 갖는 블록으로 분할하는 (a) 단계; 반복문의 반복주기들을 각각에서 접근하는 데이터 블록이 같은 것 끼리 모아 분산 실행의 단위 작업들로 생성하는 (b) 단계; 및 생성된 단위 작업들을 이벤트 기반으로 복수개의 장치들에 할당하는 (c) 단계; 를 포함한다.In accordance with an aspect of the present invention, there is provided a method of parallelizing non-normal reduction, comprising: dividing data into blocks having a predetermined size in accordance with an on-chip memory size of an explicit management memory structure; (B) generating the repetition cycles of the iterations from each of the same data blocks approaching each other as unit operations of distributed execution; (C) allocating the generated unit tasks to the plurality of devices on an event basis; It includes.

또한, (a) 단계는, 온 칩 메모리 중 데이터 할당에 사용할 수 있는 영역의 크기

와, 반복주기공간의 차원 수

, 및 계산에 사용되는 배열

의 원소의 크기인

를 조합하는 하기의 [수학식 1]을 통해 상기 데이터 블록의 크기

를 연산하는 (a-1) 단계; 를 포함한다.In addition, in step (a), the size of the area available for data allocation in the on-chip memory

And the number of dimensions of the repeating cycle space.

, And arrays used for calculation

Is the size of the element of

The size of the data block through the following [Equation 1] to combine

Calculating (a-1); It includes.

[수학식 1][Equation 1]

또한, (b) 단계는, 블록의 크기를 버킷의 크기로 하는 버킷 정렬(bucket sorting) 수행을 통해 분산 실행의 단위 작업을 생성하는 (b-1) 단계; 를 포함한다.In addition, step (b) comprises the steps of: (b-1) generating a unit operation of distributed execution by performing bucket sorting with the size of the block as the size of the bucket; It includes.

또한, (c) 단계는, 스케줄러 스레드가 작업 스레드에게 작업의 시작을 알리는 매개(live-in)변수를 작업 스레드로 전달하는 (c-1) 단계; 작업 스레드가 스케줄러 스레드에게 작업의 할당을 요청하는 (c-2) 단계; 스케줄러 스레드가 작업 스레드로 할당할 작업들 중에 데이터 독립적인 작업이 있는지 여부를 판단하는 (c-3) 단계; (c-3) 단계의 판단결과, 작업 스레드로 할당할 작업들 중에 데이터 독립적인 작업이 있는 경우, 스케줄러 스레드가 작업 스레드로 현재 처리되고 있는 작업에 데이터 독립적인 작업을 할당하는 (c-4) 단계; 및 작업 스레드가 할당받은 작업을 수행하고, 할당받은 작업 완료시 결과(live-out)변수를 상기 스케줄러 스레드로 전달하는 (c-5) 단계; 를 포함한다.In addition, step (c) includes: (c-1) the scheduler thread passing a live-in parameter to the work thread informing the work thread of the start of the work; (C-2) the task thread requesting the scheduler thread to allocate a task; (C-3) determining whether a scheduler thread has a data independent task among tasks to be assigned to the task thread; (c-4) the scheduler thread allocates the data independent task to the task currently being processed as the task thread when there is a data independent task among the tasks to be assigned to the task thread as a result of the step (c-3). step; (C-5) performing a task assigned to a work thread and transferring a result variable (live-out) to the scheduler thread upon completion of the assigned task; It includes.

그리고, (c-6) 상기 (c-3) 단계의 판단결과, 작업 스레드로 할당할 작업들 중에 데이터 독립적인 작업이 없는 경우, 현재 처리되고 있는 작업이 완료될 때 까지 소정시간 대기한 후, (c-3) 단계로 절차를 이행하는 단계; 를 포함한다.(C-6) If there is no data independent task among the tasks to be allocated to the work thread as a result of the determination of step (c-3), after waiting for a predetermined time until the currently processed task is completed, implementing the procedure to step (c-3); It includes.

상기와 같은 본 발명에 따르면, 명시적 관계 메모리 구조를 가진 병렬 처리 에 의해 데이터 블록이 동일한 것들을 선별하여 분산 실행의 단위 작업들을 복수개의 장치에 할당함으로써, 비정류 리덕션의 계산에 소요되는 시간을 최소화함과 아울러 빠른 연산결과를 도출하는 효과가 있다.According to the present invention as described above, by minimizing the same block of data by assigning the unit operations of distributed execution to a plurality of devices by the parallel processing having an explicit relation memory structure, the time required to calculate the non-rectification reduction is minimized In addition, it has the effect of deriving fast computational results.

본 발명의 구체적인 특징 및 이점들은 첨부도면에 의거한 다음의 상세한 설명으로 더욱 명백해질 것이다. 이에 앞서, 본 명세서 및 청구범위에 사용된 용어나 단어는 발명자가 그 자신의 발명을 가장 최선의 방법으로 설명하기 위해 용어의 개념을 적절하게 정의할 수 있다는 원칙에 입각하여 본 발명의 기술적 사상에 부합하는 의미와 개념으로 해석되어야 할 것이다. 또한, 본 발명에 관련된 공지 기능 및 그 구성에 대한 구체적인 설명이 본 발명의 요지를 불필요하게 흐릴 수 있다고 판단되는 경우에는, 그 구체적인 설명을 생략하였음에 유의해야 할 것이다.Specific features and advantages of the present invention will become more apparent from the following detailed description based on the accompanying drawings. Prior to this, terms and words used in the present specification and claims are to be interpreted in accordance with the technical idea of the present invention based on the principle that the inventor can properly define the concept of the term in order to explain his invention in the best way. It should be interpreted in terms of meaning and concept. It is to be noted that the detailed description of known functions and constructions related to the present invention is omitted when it is determined that the gist of the present invention may be unnecessarily blurred.

도 2 는 본 발명에 따른 비정규 리덕션의 병렬화 방법의 분자 동역학 계산 알고리즘에 따른 비정규 리덕션의 예시도이다.2 is an exemplary diagram of non-normal reduction according to a molecular dynamics calculation algorithm of the method of parallelizing non-normal reduction according to the present invention.

도 2 에 예시된 반복문('for 이하')은, 각 반복주기 마다 하나의 분자쌍에 대하여 상호 작용하는 힘을 계산하고 있는데, 예시된 반복문을 설명하면 아래와 같다.The loop ('for') illustrated in FIG. 2 calculates an interaction force with respect to one molecular pair at each repetition period.

먼저, 상호 작용하는 분자쌍의 목록인 배열 interactions에서 각 분자의 인덱스 p1, p2를 가져온다.First, we obtain the indices p1 and p2 of each molecule from array interactions, which are lists of interacting molecular pairs.

이어서, 각 분자의 위치정보를 배열 positions을 통해 추출한다.Then, the position information of each molecule is extracted through the array positions.

그리고, 상호 작용하는 힘 d_force를 계산하여 그 값을 각 분자가 받는 힘을 나타내는 배열인 force에 누적한다.The interaction force d_force is calculated and its value is accumulated in force, an array representing the force each molecule receives.

도 2 의 예시에서 비정규 리덕션은 배열 force를 대상으로 이루어지고 있는데, 이하에서는 배열 force를 '리덕션 배열', '리덕션 배열'과 같은 인덱스로 접근되는 positions를 '보조 배열', 그리고 '리덕션 배열'의 인덱스를 결정하는 interactions를 '인덱스 배열'로 각각 상정하여 설명한다.In the example of FIG. 2, the non-normal reduction is made of an array force. Hereinafter, the positions that are accessed by the indexes such as the 'reduction array', 'reduction array' and the 'secondary array' and 'reduction array' The interactions that determine the index are described as 'index arrays'.

도 2 의 반복문에서, 각 반복주기에서 접근하는 '리덕션 배열'과 '보조 배열'의 주소는 '인덱스 배열'에 의해 간접적으로 결정되기 때문에 코드를 작성하는 시점에서는 그 값을 예측할 수 없다. 따라서, 명시적 관리 메모리 구조를 대상으로 코드를 작성할 때, 매 반복주기 마다 데이터 이동을 위한 코드를 생성해야하는 비효율이 발생한다.In the loop of FIG. 2, since the addresses of the reduction array and the secondary array accessed in each iteration period are determined indirectly by the index array, the value cannot be predicted at the time of writing the code. Therefore, when writing code for an explicit managed memory structure, there is an inefficiency in generating code for data movement every repetition cycle.

한편, 도 3 은 도 2 의 '리덕션 배열'과 '보조 배열'을 blocksize로 나누었을 경우, 각 반복 주기가 p1, p2 값에 따라 어느 영역에 속하는가를 나타낸 예시도이다.On the other hand, FIG. 3 is an exemplary diagram showing which region each repetition period belongs to according to p1 and p2 values when 'reduction arrangement' and 'secondary arrangement' of FIG. 2 are divided into blocksize.

예를 들어 blocksize가 '100'이라고 할 때, 어떤 반복주기

는 R(1, 2)에 속하며, 리덕션 배열과 보조 배열의 두 번째와 세 번째 블록을 접근한다. 이와 유사하게 p1의 값이 '150', p2의 값이 '220'인 반복주기

가 있다고 가정하면, 이 역시 R(1, 2)에 속하는 것을 알 수 있는데, 이 때

와

는 같은 데이터 블록을 접근하므로 한 번에 처리하면 추가적인 데이터 이동 코드를 실행할 필요가 없다.For example, if blocksize is '100', some iteration period

Belongs to R (1, 2) and accesses the second and third blocks of the reduction and auxiliary arrays. Similarly, the repetition period with p1 value '150' and p2 value '220'

If we assume that we also belong to R (1, 2),

Wow

Accesses the same block of data, so if you process it all at once, you don't need to run any additional data movement code.

전술한바와 같이, 같은 데이터 블록 집합을 사용하는 반복주기들을 모아 한꺼번에 처리하려고 할 때는 계산에 필요한 데이터 블록 집합이 모두 온 칩 메모리에 상주할 수 있어야 한다.As described above, when a plurality of repetition cycles using the same data block set are to be processed at once, all the data block sets required for the calculation must reside in the on-chip memory.

도 3의 예시에서, R(a, b)을 계산하기 위해 필요한 데이터 블록의 최대 수는, '리덕션 배열(force)'과 '보조 배열(positions)'의 블록 각각에 대하여 반복주기공간의 차원 수와 같은 두 개씩임을 알 수 있다.In the example of FIG. 3, the maximum number of data blocks needed to calculate R (a, b) is the number of dimensions of the repeating periodic space for each block of 'force' and 'positions'. It can be seen that each of the two.

도 4 에 도시된 바와 같이, 본 발명에 따른 비정규 리덕션의 병렬화 방법은, 먼저, 온 칩 메모리의 크기에 맞추어 데이터를 일정 크기의 블록으로 분할한다(S10).As shown in FIG. 4, in the method of parallelizing non-normal reduction according to the present invention, first, data is divided into blocks having a predetermined size in accordance with the size of an on-chip memory (S10).

이어서, 반복문의 반복주기들을 각각에서 접근하는 데이터 블록이 동일한 것들 끼리 분류하여 분산 실행의 단위 작업들로 생성한다(S20).Subsequently, the data blocks that access the repetition cycles of the repetition statements are classified into the same ones to generate unit operations of distributed execution (S20).

그리고, 생성된 단위 작업들을 각각의 이벤트와 부합하는 처리장치로 할당한다(S30).In operation S30, the generated unit tasks are allocated to a processing device corresponding to each event.

여기서, 상기 (a) 단계에서 데이터를 일정 크기의 블록으로 나눌 때는, [수학식 1]에 의해 블록의 크기를 정한다.Here, when dividing the data into blocks having a predetermined size in step (a), the block size is determined by Equation 1.

[수학식 1][Equation 1]

이때,

는 블록의 크기,

는 온 칩 메모리 중 데이터 할당에 사용할 수 있는 영역의 크기,

은 도 3과 같이 나타나는 반복주기공간(iteration space)의 차원 수,

는 배열

의 원소의 크기이다.At this time,

Is the size of the block,

Is the size of the on-chip memory available for data allocation,

Is the number of dimensions of the iteration space shown in Figure 3,

Is an array

Is the size of the element.

또한, 상기 (b) 단계에서는 도 5 의 예시와 같이, 각기 접근하는 데이터 블록이 같은 반복주기들을 취합하여 분산 수행의 단위 작업으로 생성한다.In addition, in the step (b), as shown in FIG. 5, each approaching data block collects the same repetition periods and generates them as unit operations of distributed execution.

도 5 의 각 사각형은 반복주기

에서 접근하는 배열의 인덱스인 (p1, p2)를 나타내는데, 도 5 의 상단과 같은 순서로 배치되어 있던 반복주기들을 각기 접근하는 블록에 따라 재배치하면, 도 5 의 하단과 같은 결과를 얻는다.Each rectangle in Fig. 5 is a repetition period

(P1, p2), which are the indexes of the arrays accessed in, are rearranged in the same order as in the upper part of FIG.

즉, 각 반복주기를 재배치하는 (b) 단계는, 상기 (a) 단계에서 결정된 블록의 크기를 버킷의 크기로 하는 버킷 정렬(bucket sorting)을 통해 수행될 수 있다(S21).That is, step (b) of rearranging each repetition period may be performed through bucket sorting in which the size of the block determined in step (a) is the size of the bucket (S21).

그리고, 상기 (c) 단계에서는 도 6 및 도 7 에 나타난 이벤트 전달 방식에 따라 상기 (b) 단계에서 결정된 단위 작업들을 복수개의 장치에 할당하여 병렬 수행토록 한다.In the step (c), the unit tasks determined in the step (b) are allocated to a plurality of devices in parallel according to the event delivery method shown in FIGS. 6 and 7.

단위 작업을 임의의 처리 장치에 새롭게 할당할 경우, 병렬 수행 시에 경쟁 조건(race condition)이 발생하지 않게 하기 위해서, 현재 처리되고 있는 작업들에 대하여 데이터 독립적인 작업만을 할당해야 한다.When a unit task is newly allocated to a processing unit, only data-independent tasks should be allocated to tasks that are currently being processed in order to prevent a race condition from occurring in parallel execution.

여기서, 작업이 데이터 독립적이라는 것은 각 작업이 같은 데이터 블록을 사 용하지 않는 경우를 말하는데, 예를 들어 도 3 에서 R(0, 1)과 R(2,3)은 서로 다른 블록들만을 사용하므로 데이터 독립적이고, R(1, 2)와 R(1, 3)은 서로 동일한 블록을 사용하므로 데이터 의존적이라 할 수 있다.Herein, a task is data independent when each task does not use the same data block. For example, in FIG. 3, R (0, 1) and R (2,3) use only different blocks. It is data independent, and R (1, 2) and R (1, 3) are data dependent because they use the same block.

도 6 및 도 7 에 도시된 바와 같이 상기 (c) 단계를 구체적으로 살피면, 스케줄러 스레드가 작업 스레드에게 작업의 시작을 알리는 매개(live-in)변수를 전달한다(S31).6 and 7, the step (c) is specifically examined, and the scheduler thread delivers a live-in parameter for notifying the start of the work to the work thread (S31).

이어서, 작업 스레드가 스케줄러 스레드에게 작업의 할당을 요청한다(S32).Subsequently, the work thread requests the scheduler thread to allocate the work (S32).

뒤이어, 스케줄러 스레드가 작업 스레드로 할당할 작업들 중에 데이터 독립적인 작업이 있는지 여부를 판단한다(S33).Subsequently, it is determined whether a scheduler thread has a data independent task among tasks to be allocated to the task thread (S33).

제S33 단계의 판단결과, 작업 스레드로 할당할 작업들 중에 데이터 독립적인 작업이 있는 경우, 스케줄러 스레드가 작업 스레드로 현재 처리되고 있는 작업에 데이터 독립적인 작업을 할당한다(S34).As a result of the determination in step S33, when there is a data independent task among the tasks to be allocated to the work thread, the scheduler thread allocates the data independent task to the task currently being processed as the work thread (S34).

이어서, 작업 스레드가 할당받은 작업을 수행하고, 할당받은 작업 완료시 결과(live-out)변수를 전달한다(S35).Subsequently, the task thread performs the assigned task and delivers a result (live-out) variable upon completion of the assigned task (S35).

반면에, 제S33 단계의 판단결과, 작업 스레드로 할당할 작업들 중에 데이터 독립적인 작업이 없는 경우, 현재 처리되고 있는 작업이 완료될 때 까지 소정시간 대기후, 상기 제S33단계로 절차를 이행한다(S36).On the other hand, if there is no data independent task among the tasks to be allocated to the work thread as a result of the determination in step S33, after a predetermined time waits until the current processing task is completed, the procedure is performed in step S33. (S36).

이상으로 본 발명의 기술적 사상을 예시하기 위한 바람직한 실시예와 관련하여 설명하고 도시하였지만, 본 발명은 이와 같이 도시되고 설명된 그대로의 구성 및 작용에만 국한되는 것이 아니며, 기술적 사상의 범주를 일탈함이 없이 본 발명 에 대해 다수의 변경 및 수정이 가능함을 당업자들은 잘 이해할 수 있을 것이다. 따라서 그러한 모든 적절한 변경 및 수정과 균등 물들도 본 발명의 범위에 속하는 것으로 간주되어야 할 것이다. As described above and described with reference to a preferred embodiment for illustrating the technical idea of the present invention, the present invention is not limited to the configuration and operation as shown and described as described above, it is a deviation from the scope of the technical idea It will be understood by those skilled in the art that many modifications and variations can be made to the invention without departing from the scope of the invention. And all such modifications and changes as fall within the scope of the present invention are therefore to be regarded as being within the scope of the present invention.

도 1 은 일반적으로 비정규 리덕션 구현을 위한 코드를 나타낸 예시도.1 is an illustrative diagram generally illustrating code for implementing a non-normal reduction.

도 2 는 본 발명에 따른 비정규 리덕션의 병렬화 방법의 분자 동역학 계산 알고리즘에 따른 비정규 리덕션의 예시도.Figure 2 is an illustration of nonnormal reduction according to the molecular dynamics calculation algorithm of the method of parallelization of nonnormal normalization according to the present invention.

도 3 은 도 2 의 리덕션 배열과 보조 배열을 blocksize로 나누었을 때, 각 반복주기가 p1, p2 값에 따라 어느 영역에 속하는가를 나타낸 예시도.3 is a diagram illustrating which region each repetition period belongs to according to p1 and p2 values when the reduction array and the auxiliary array of FIG. 2 are divided into blocksize.

도 4 는 본 발명에 따른 비정규 리덕션의 병렬화 방법을 나타낸 순서도.4 is a flow chart illustrating a parallelization method of non-normal reduction according to the present invention.

도 5 는 본 발명에 따른 비정규 리덕션의 병렬화 방법의 (b) 단계에 대한 설명을 부연하는 예시도.FIG. 5 is an exemplary view detailing step (b) of the method for parallelizing non-normal reduction according to the present invention. FIG.

도 6 은 본 발명에 따른 비정규 리덕션의 병렬화 방법의 (c) 단계를 나타낸 순서도.Figure 6 is a flow chart showing step (c) of the method of parallelization of irregular reduction in accordance with the present invention.

도 7 은 본 발명에 따른 비정규 리덕션의 병렬화 방법의 (c) 단계에 대한 설명을 부연하는 예시도.FIG. 7 is an exemplary view illustrating a description of step (c) of the method for parallelizing non-normal reduction according to the present invention. FIG.

Claims

In the parallelization method of irregular reduction,

(a) dividing the data into blocks having a predetermined size in accordance with the on-chip memory size of the explicit management memory structure;

(b) generating the repetition cycles of the iterations from each of the same data blocks approaching each other as unit operations of distributed execution; And

(c) allocating the generated unit tasks to a plurality of devices on an event basis; , &Lt; / RTI &

Step (b) includes: generating a unit job of distributed execution by performing bucket sorting, wherein the size of the block is the size of a bucket; Parallelization method of irregular reduction comprising a.

The method of claim 1,

In step (a),

(a-1) Size of the area available for data allocation in the on-chip memory

And the number of dimensions of the repeating cycle space.

, And arrays used for calculation

Is the size of the element of

The size of the data block through the following [Equation 1] to combine

Calculating a; Parallelization method of irregular reduction comprising a.

[Equation 1]

delete

The method of claim 1,

In step (c),

(c-1) the scheduler thread passing a live-in parameter to the work thread that informs the work thread of the start of the work;

(c-2) the task thread requesting the scheduler thread to allocate a task;

(c-3) determining whether the scheduler thread has a data independent task among tasks to be allocated to the task thread;

(c-4) As a result of the determination in the step (c-3), if there is a data independent task among the tasks to be allocated to the task thread, the scheduler thread is data independent of the task currently being processed by the task thread. Assigning a task; And

(c-5) performing a task assigned by the work thread and transferring a result variable (live-out) to the scheduler thread upon completion of the assigned task; Parallelization method of irregular reduction comprising a.

The method of claim 4, wherein

(c-6) As a result of the determination in the step (c-3), if there is no data independent task among the tasks to be allocated to the task thread, the processor waits for a predetermined time until the task currently being processed is completed. implementing the procedure to step (c-3); Parallelization method of irregular reduction comprising a.