KR102382336B1

KR102382336B1 - Method for computing tridiagonal matrix

Info

Publication number: KR102382336B1
Application number: KR1020190113981A
Authority: KR
Inventors: 최정일; 김기하; 강지훈
Original assignee: 연세대학교 산학협력단; 한국과학기술정보연구원
Priority date: 2019-09-17
Filing date: 2019-09-17
Publication date: 2022-04-04
Also published as: KR20210032670A

Abstract

본 발명은 멀티 코어 분산 메모리 시스템을 위한 3중 대각행렬식 연산 방법으로서, 다수의 코어 각각이 연속으로 인가되는 다수의 3중 대각행렬식 각각에서 코어의 개수에 대응하는 개수의 행단위로 분할된 분할 행렬을 획득하고, 기지정된 방식에 따라 수정하여 수정 행렬식을 획득하고, 수정 행렬식의 제1 행 및 마지막 행을 추출하여 축소 수정 행렬식을 획득하는 단계, 다수의 3중 대각행렬식 각각에서 획득된 축소 수정 행렬식을 다수의 코어의 기지정된 순서에 따라 다수의 코어 중 대응하는 코어로 전송하는 단계, 다수의 코어 각각이 전송된 축소 수정 행렬식을 결합하여 축소 대각행렬식을 획득 및 저장하는 단계, 다수의 코어 각각이 병렬로 획득된 축소 수정 행렬식의 해를 연산하는 단계 및 축소 대각행렬식의 해를 이용하여 3중 대각행렬식의 나머지 해를 연산하는 단계를 포함하여, 멀티 코어 분산 메모리 시스템의 연산 효율성을 극대화 할 수 있다.The present invention is a triple diagonal matrix calculation method for a multi-core distributed memory system, and a partition matrix divided into a number of rows corresponding to the number of cores in each of a plurality of triple diagonal matrix formulas to which each of a plurality of cores is sequentially applied. obtaining, modifying according to a predetermined method to obtain a correction determinant, extracting the first row and the last row of the correction determinant to obtain a reduced correction determinant; Transmitting to a corresponding one of the plurality of cores according to a predetermined order of the plurality of cores, combining the reduced correction determinant transmitted by each of the plurality of cores to obtain and storing the reduced diagonal matrix, each of the plurality of cores in parallel It is possible to maximize the computational efficiency of the multi-core distributed memory system, including calculating the solution of the reduced modified determinant obtained by .

Description

Triple diagonal matrix calculation method {METHOD FOR COMPUTING TRIDIAGONAL MATRIX}

본 발명은 3중 대각행렬식을 연산하는 연산 방법에 관한 것으로, 멀티 코어 분산 메모리 시스템에서 하나 또는 다수의 3중 대각행렬식을 효율적으로 병렬 연산할 수 있도록 하는 연산 방법에 관한 것이다.The present invention relates to an operation method for calculating a triple diagonal matrix expression, and to an operation method enabling efficient parallel operation of one or more triple diagonal matrix expressions in a multi-core distributed memory system.

3중 대각행렬식은 선형 연립 방정식의 하나로 행렬의 형태가 대각행렬의 주대각선을 포함해 대각성분이 3중 구조인 경우로써, 유체역학, 열전달, 양자역학, 전자기학 등에서 수치해석으로 특정 문제에 대한 해를 구할 때 빈번하게 나타나는 형태이다.The triple diagonal matrix equation is one of the linear simultaneous equations. The matrix form is a case in which the diagonal components including the main diagonal of the diagonal matrix have a triple structure. In fluid mechanics, heat transfer, quantum mechanics, electromagnetics, etc., a solution to a specific problem can be solved by numerical analysis. This is a form that appears frequently when searching.

3중 대각행렬식의 해를 구하는 알고리즘 중 널리 사용되는 토마스 알고리즘(Thomas algorithm)은 가우시안 소거법의 특수한 형태로써, 순차 연산 처리 방식으로 단순히 연산 처리 관점에서는 가장 효율적인 방법이다. 그러나 토마스 알고리즘은 그 계산과정이 순차적으로 진행되기 때문에 병렬화가 불가능하다. 즉 멀티 코어 분산 메모리 시스템과 같이 고성능의 연산 시스템에 적용 시에 병렬 연산을 제공할 수 없어 효율성이 크게 낮아진다. The Thomas algorithm, which is widely used among the algorithms for finding the solution of a triple diagonal matrix equation, is a special form of Gaussian elimination. However, the Thomas algorithm cannot be parallelized because the calculation process proceeds sequentially. In other words, when applied to a high-performance computing system such as a multi-core distributed memory system, parallel operation cannot be provided, so the efficiency is greatly reduced.

PCR 알고리즘(parallel cyclic reduction algorithm)은 3중 대각행렬식에 대한 병렬 연산 처리가 가능하도록 고안된 방법이다. 이 방법은 재귀적 알고리즘으로써 방정식 3개씩 한 묵음으로 미지수를 소거해 병렬적으로 처리가 가능 하다는 장점이 있지만 토마스 알고리즘보다 기본적 효율이 좋지 않다. 이러한 효율성의 차이로 인한 문제는 해결해야하는 3중 대각행렬식의 크기가 크고, 수가 많을 수록 증가된다. 또한 PCR 알고리즘은 분산 메모리 시스템에 적용하기 적합하지 않다.The PCR algorithm (parallel cyclic reduction algorithm) is a method devised to enable parallel arithmetic processing for a triple diagonal matrix expression. As a recursive algorithm, this method has the advantage that it can be processed in parallel by canceling the unknowns by silence of three equations one by one, but the basic efficiency is not good compared to the Thomas algorithm. The problem caused by this difference in efficiency increases as the size of the triple diagonal matrix to be solved is large and the number increases. Also, the PCR algorithm is not suitable for application to distributed memory systems.

한편 3중 대각행렬식을 분산메모리 시스템에서 병렬적으로 계산하는 알고리즘도 고안된 바가 있다(Mattor et al 1995). 이 방법은 먼저 각 계산 노드에서 정리된 원소 값을 모아 작은 크기의 하위 3중 대각행렬식을 만들어 그 해를 구한다. 그 다음 하위 3중 대각행렬식의 해를 이용해 각 계산 노드에서 병렬적으로 원래 3중 대각행렬식의 해를 구한다. 하지만 하나의 3중 대각행렬식을 계산할 때 모든 계산 노드에서 하위 3중 대각행렬식의 해를 구하는 과정이 불필요하게 중복되어 여전히 효율성이 낮다는 한계가 있다.On the other hand, an algorithm for calculating a triple diagonal matrix in parallel in a distributed memory system has also been devised (Mattor et al 1995). In this method, the solution is obtained by first collecting the element values arranged at each computation node to create a small-sized lower triple diagonal matrix. Then, using the solution of the lower triple diagonal matrix, the original triple diagonal matrix is solved in parallel at each computation node. However, when calculating a single triple diagonal matrix, the process of finding the solution of the lower triple diagonal matrix in all computation nodes is unnecessarily duplicated, so there is a limitation that the efficiency is still low.

미국 공개 특허 2019/0153824(2019.05.23 공개)US Patent Publication 2019/0153824 (published on May 23, 2019)

본 발명의 목적은 멀티 코어 분산 메모리 시스템에서 효율적으로 3중 대각행렬식을 해결할 수 있는 3중 대각행렬식 연산 방법을 제공하는데 있다.It is an object of the present invention to provide a triple diagonal matrix calculation method capable of efficiently solving a triple diagonal matrix expression in a multi-core distributed memory system.

상기 목적을 달성하기 위한 본 발명의 일 실시예에 따른 3중 대각행렬식 연산 방법은 멀티 코어 분산 메모리 시스템을 위한 3중 대각행렬식 연산 방법에 있어서, 다수의 코어 각각이 연속으로 인가되는 다수의 3중 대각행렬식 각각에서 코어의 개수에 대응하는 개수의 행단위로 분할된 분할 행렬을 획득하고, 기지정된 방식에 따라 수정하여 수정 행렬식을 획득하고, 상기 수정 행렬식의 제1 행 및 마지막 행을 추출하여 축소 수정 행렬식을 획득하는 단계; 상기 다수의 3중 대각행렬식 각각에서 획득된 축소 수정 행렬식을 상기 다수의 코어의 기지정된 순서에 따라 다수의 코어 중 대응하는 코어로 전송하는 단계; 다수의 코어 각각이 전송된 상기 축소 수정 행렬식을 결합하여 축소 대각행렬식을 획득 및 저장하는 단계; 다수의 코어 각각이 병렬로 획득된 축소 수정 행렬식의 해를 연산하는 단계; 및 축소 대각행렬식의 해를 이용하여 3중 대각행렬식의 나머지 해를 연산하는 단계를 포함한다.A triple diagonal matrix operation method according to an embodiment of the present invention for achieving the above object is a triple diagonal matrix operation method for a multi-core distributed memory system. In each of the diagonal matrix equations, a partition matrix divided into a number of rows corresponding to the number of cores is obtained, modified according to a predetermined method to obtain a correction determinant, and the first row and the last row of the correction determinant are extracted and reduced and corrected obtaining a determinant; transmitting the reduced correction determinant obtained from each of the plurality of triple diagonal matrix expressions to a corresponding one of the plurality of cores according to a predetermined order of the plurality of cores; obtaining and storing the reduced diagonal matrix by combining the reduced correction determinants transmitted by each of a plurality of cores; calculating, by each of a plurality of cores, a solution of the reduced correction determinant obtained in parallel; and calculating the remaining solution of the triple diagonal matrix expression using the solution of the reduced diagonal matrix expression.

상기 코어로 전송하는 단계는 상기 다수의 3중 대각 행렬식이 인가된 순서에 따라 다수의 3중 대각 행렬식에서 획득된 상기 축소 수정 행렬식을 상기 다수의 코어에 기지정된 순서로 전송할 수 있다.In the transmitting to the core, the reduced correction determinant obtained from a plurality of triple diagonal determinants may be transmitted to the plurality of cores in a predetermined order according to the order in which the plurality of triple diagonal determinants are applied.

상기 축소 수정 행렬식의 해를 연산하는 단계는 저장된 축소 수정 행렬식의 개수가 코어 개수를 초과하면, 코어 개수에 대응하는 개수의 축소 수정 행렬식의 해를 병렬로 연산하고, 나머지 축소 수정 행렬식의 해를 이후 코어 개수 단위로 병렬로 연산할 수 있다.In the step of calculating the solution of the reduced correction determinant, if the number of stored reduced correction determinants exceeds the number of cores, the solution of the reduced correction determinant of the number corresponding to the number of cores is calculated in parallel, and the solution of the remaining reduced correction determinant is then It can be operated in parallel in units of the number of cores.

상기 수정 행렬식을 획득하는 단계는 상기 분할 행렬을 수정 토마스 알고리즘에 따라 수정하여 수정 행렬식을 획득하고, 상기 축소 대각행렬식의 해를 연산하는 단계는 토마스 알고리즘에 따라 연산을 수행할 수 있다.The obtaining of the modified determinant may include obtaining a modified determinant by modifying the partitioning matrix according to the modified Thomas algorithm, and the calculating of the solution of the reduced diagonal matrix may be performed according to the Thomas algorithm.

상기 나머지 해를 연산하는 단계는 연산된 축소 대각행렬식의 해를 다수의 코어로 분산 전송하는 단계; 및 상기 축소된 대각행렬식의 해와 대응하는 상기 수정 행렬식을 토마스 알고리즘의 업데이트 알고리즘에 대입하여 연산하는 단계를 포함할 수 있다.Calculating the remaining solution may include: distributedly transmitting the calculated reduced diagonal matrix solution to a plurality of cores; and substituting the correction determinant corresponding to the solution of the reduced diagonal matrix expression into an update algorithm of Thomas's algorithm and calculating.

상기 분산 전송하는 단계는 상기 축소된 대각행렬식의 해를 대응하는 축소된 수정 행렬식을 전송한 코어로 전송할 수 있다.The distributed transmission may transmit the solution of the reduced diagonal matrix to the core that has transmitted the reduced correction matrix corresponding to the solution.

따라서, 본 발명의 실시예에 따른 3중 대각행렬식 연산 방법은 멀티 코어3중 대각행렬식을 해결함에 있어 병렬 확장성을 향상시켜 멀티 코어 분산 메모리 시스템에 최적화된 연산 성능을 제공할 수 있으며, 코어간 통신량과 유휴 시간을 최소화하여 부하를 저감할 수 있으며, 중복 연산을 방지하여 부하 균등성을 향상시킬 수 있어 연산 효율성을 극대화할 수 있다.Therefore, the triple diagonal matrix operation method according to the embodiment of the present invention can provide optimized operation performance for a multi-core distributed memory system by improving parallel scalability in solving a multi-core triple diagonal matrix expression, and between cores It is possible to reduce the load by minimizing the communication amount and idle time, and it is possible to improve the load uniformity by preventing redundant calculations, thereby maximizing the operation efficiency.

도 1은 본 발명의 일 실시예에 따른 3중 대각행렬식 연산 방법을 나타낸다.
도 2는 3중 대각행렬식의 일예를 나타낸다.
도 3은 도 1의 3중 대각행렬식 단계에서 수정된 3중 대각행렬식의 일예를 나타낸다.
도 4는 도 1의 수정 대각행렬식 축소 단계에서 축소된 대각행렬식의 일예를 나타낸다.
도 5는 축소된 대각행렬식을 분산 연산하는 예를 나타낸다.
도 6은 도 1의 3중 대각행렬식 연산 방법의 전체 연산 과정을 시각적으로 나타낸다.
도 7은 다수의 3중 대각행렬식 연산에서 코어 사이에 전송되는 데이터를 시각적으로 나타낸 도면이다.
도 8은 본 발명의 일 실시예에 따른 3중 대각행렬식 연산 방법의 성능을 비교 시뮬레이션한 결과를 나타낸다.1 shows a triple diagonal matrix calculation method according to an embodiment of the present invention.
2 shows an example of a triple diagonal matrix formula.
FIG. 3 shows an example of the triple diagonal matrix modified in the step of the triple diagonal matrix of FIG. 1 .
4 shows an example of a diagonal matrix reduced in the step of reducing the modified diagonal matrix of FIG. 1 .
5 shows an example of distributed calculation of the reduced diagonal matrix expression.
FIG. 6 visually shows the entire operation process of the triple diagonal matrix operation method of FIG. 1 .
7 is a diagram visually illustrating data transmitted between cores in a plurality of triple diagonal matrix operations.
8 shows the results of comparison simulation of the performance of the triple diagonal matrix calculation method according to an embodiment of the present invention.

본 발명과 본 발명의 동작상의 이점 및 본 발명의 실시에 의하여 달성되는 목적을 충분히 이해하기 위해서는 본 발명의 바람직한 실시예를 예시하는 첨부 도면 및 첨부 도면에 기재된 내용을 참조하여야만 한다. In order to fully understand the present invention, the operational advantages of the present invention, and the objects achieved by the practice of the present invention, reference should be made to the accompanying drawings illustrating preferred embodiments of the present invention and the contents described in the accompanying drawings.

이하, 첨부한 도면을 참조하여 본 발명의 바람직한 실시예를 설명함으로써, 본 발명을 상세히 설명한다. 그러나, 본 발명은 여러 가지 상이한 형태로 구현될 수 있으며, 설명하는 실시예에 한정되는 것이 아니다. 그리고, 본 발명을 명확하게 설명하기 위하여 설명과 관계없는 부분은 생략되며, 도면의 동일한 참조부호는 동일한 부재임을 나타낸다. Hereinafter, the present invention will be described in detail by describing preferred embodiments of the present invention with reference to the accompanying drawings. However, the present invention may be embodied in various different forms, and is not limited to the described embodiments. In addition, in order to clearly explain the present invention, parts irrelevant to the description are omitted, and the same reference numerals in the drawings indicate the same members.

명세서 전체에서, 어떤 부분이 어떤 구성요소를 "포함"한다고 할 때, 이는 특별히 반대되는 기재가 없는 한 다른 구성요소를 제외하는 것이 아니라, 다른 구성요소를 더 포함할 수 있는 것을 의미한다. 또한, 명세서에 기재된 "...부", "...기", "모듈", "블록" 등의 용어는 적어도 하나의 기능이나 동작을 처리하는 단위를 의미하며, 이는 하드웨어나 소프트웨어 또는 하드웨어 및 소프트웨어의 결합으로 구현될 수 있다. Throughout the specification, when a part "includes" a certain component, it does not exclude other components, unless otherwise stated, meaning that other components may be further included. In addition, terms such as "...unit", "...group", "module", and "block" described in the specification mean a unit that processes at least one function or operation, which is hardware, software, or hardware. and a combination of software.

도 1은 본 발명의 일 실시예에 따른 3중 대각행렬식 연산 방법을 나타내고, 도 2는 3중 대각행렬식의 일예를 나타내며, 도 3은 도 1의 3중 대각행렬식 단계에서 수정된 3중 대각행렬식의 일예를 나타낸다. 그리고 도 4는 도 1의 수정 대각행렬식 축소 단계에서 축소된 대각행렬식의 일예를 나타내고, 도 5는 축소된 대각행렬식을 분산 연산하는 예를 나타낸다.1 shows a triple diagonal matrix calculation method according to an embodiment of the present invention, FIG. 2 shows an example of a triple diagonal matrix formula, and FIG. 3 is a triple diagonal matrix formula modified in the triple diagonal matrix formula step of FIG. shows an example of 4 shows an example of a diagonal matrix reduced in the step of reducing the modified diagonal matrix of FIG. 1, and FIG. 5 shows an example of distributing the reduced diagonal matrix expression.

도 1 내지 도 5를 참조하면, 본 실시예에 따른 3중 대각행렬식 연산 방법은 우선 연산이 수행되어야 하는 다수의 3중 대각행렬식을 획득한다(S11).1 to 5 , the triple diagonal matrix calculation method according to the present embodiment first obtains a plurality of triple diagonal matrix formulas to be calculated ( S11 ).

여기서 3중 대각행렬식은 Ax = d의 형태로 표현되는 행렬식으로, A가 N × N 크기의 3중 대각행렬이고, x와 d는 각각 길이 N을 갖는 열 벡터이다. 이러한 3중 대각행렬식은 행렬 원소 a_i, b_i, c_i (여기서 i = 1, …, N)로 구성된 3중 대각행렬(A)과 우변의 d_i를 원소로 갖는 열 벡터(d)에 대해 x_i를 원소로 갖는 미지의 열 벡터(x)를 구하는 선형 연립 방정식이다.Here, the triple diagonal matrix is a determinant expressed in the form of Ax = d, where A is a triple diagonal matrix of size N × N, and x and d are column vectors each having a length of N. This triple diagonal matrix equation is a triple diagonal matrix (A) composed of matrix elements a _i , b _i , c _i (here i = 1, …, N) and a column vector (d) having d _i on the right side as an element. It is a linear system of equations to find the unknown column vector (x) with x _i as an element.

즉 3중 대각행렬식은 수학식 1로 표현될 수 있다.That is, the triple diagonal matrix can be expressed by Equation (1).

여기서 a₁ 과 c_N 의 값은 0이다.Here, the values of a ₁ and c _N are 0.

다수의 3중 대각행렬식이 획득되면, 연산을 수행하는 다수의 코어의 개수에 대응하여 획득된 3중 대각행렬식을 도 2에 도시된 바와 같이 행단위로 균등하게 분할한다(S12).When a plurality of triple diagonal matrix expressions are obtained, the triple diagonal matrix expressions obtained corresponding to the number of a plurality of cores performing an operation are equally divided into rows as shown in FIG. 2 (S12).

본 실시예에 따른 3중 대각행렬식 연산 방법은 멀티 코어 분산 메모리 시스템에 적용되어 수행되는 것을 전제로 한다. 멀티 코어 분산 메모리 시스템은 각각의 코어가 개별적으로 연산을 수행할 수 있으므로, 3중 대각행렬식 연산을 병렬화하면 연산 효율성을 크게 높일 수 있다. 다만 분산 메모리 시스템에서 각각의 코어는 메모리를 공유하지 않으므로, 각각의 코어에서 연산된 결과는 코어간 통신을 통해 상호 전달되어야 하며, 이 과정에서 대량의 통신 트래픽을 유발하게 된다. 뿐만 아니라 다수의 3중 대각행렬식 연산을 연속하여 수행하는 경우, 다수의 코어에 분산되어 연산된 결과를 특정 코어가 인가받아 최종 연산을 수행하며, 이 과정에서 나머지 코어들이 유휴 상태로 유지됨으로써 효율성을 극대화할 수 없다는 한계가 있다. 이에 본 실시예에서는 3중 대각행렬식 연산을 병렬화하고 시분할 다중화 기법으로 병렬 연산이 수행되도록 함으로써, 연산 효율성이 극대화되도록 한다.It is premised that the triple diagonal matrix calculation method according to the present embodiment is applied and performed in a multi-core distributed memory system. In a multi-core distributed memory system, since each core can individually perform an operation, parallelizing the triple diagonal matrix operation can greatly increase the computational efficiency. However, in a distributed memory system, since each core does not share a memory, the result calculated by each core must be transmitted through inter-core communication, which causes a large amount of communication traffic in the process. In addition, when a plurality of triple diagonal matrix operations are continuously performed, a specific core receives the result calculated by being distributed to a plurality of cores and performs the final operation, and in this process, the remaining cores are kept idle to improve efficiency There is a limit that cannot be maximized. Accordingly, in the present embodiment, the operation efficiency is maximized by parallelizing the triple diagonal matrix operation and performing the parallel operation using the time division multiplexing technique.

이를 위해, 본 실시예에서는 다수의 코어에서 병렬적으로 동시에 연산을 수행할 수 있도록 3중 대각행렬식을 코어의 개수에 대응하여 분할한다.To this end, in the present embodiment, the triple diagonal matrix is divided according to the number of cores so that a plurality of cores can perform operations in parallel and at the same time.

연산 코어의 개수가 P개인 경우, N × N 크기의 3중 대각행렬(A)을 행단위로 N/P = m 개씩 나누어 분할하여 P개의 코어로 제공할 수 있다. 일예로 도 2에 도시된 바와 같이, 3중 대각행렬이 12 × 12 크기(N = 12)의 3중 대각행렬이고, 연산 코어의 개수가 3개인 경우, 3중 대각행렬을 4행씩 균등하게 분할할 수 있다. 도 2의 행렬에서 여백은 0값을 갖는 원소이다. 3중 대각행렬을 m행씩 균등 분할되면, 이에 대응하여 열 또한 m열씩 균등 분할된다. 즉 m × m 크기의 P × P개의 분할 행렬로 분할된다.When the number of computational cores is P, a triple diagonal matrix A having a size of N × N may be divided and divided by N/P = m in row units to provide P cores. As an example, as shown in FIG. 2 , when the triple diagonal matrix is a triple diagonal matrix with a size of 12 × 12 (N = 12), and the number of operation cores is 3, the triple diagonal matrix is divided equally by 4 rows can do. In the matrix of FIG. 2, a blank is an element having a value of 0. If the triple diagonal matrix is equally divided by m rows, the corresponding columns are equally divided by m columns. That is, it is divided into P × P partitioning matrices of size m × m.

그리고 분할된 행렬식 각각은 수학식 2와 같이 표현될 수 있다.And each of the divided determinants can be expressed as in Equation (2).

여기서 p는 분할 행렬의 인덱스로서 0 ≤ p ≤ P-1 이고, x₀ ^p 와 x_m+1 ^p 는 각각 x_m ^p-1 와 x₁ ^p+1 에 대응한다.Here, p is an index of the partition matrix, and 0 ≤ p ≤ P-1, and x ₀ ^p and x _m+1 ^p correspond to x _m ^p-1 and x ₁ ^p+1 , respectively.

행렬식이 분할되면, 분할 행렬식을 인가받은 각 코어는 분할된 행렬식에 대해 표 1의 수정 토마스 알고리즘을 적용하여, 분할된 행렬식을 수정한다(S13).When the determinant is divided, each core to which the division determinant is applied applies the modified Thomas algorithm of Table 1 to the divided determinant to correct the divided determinant (S13).

수정 토마스 알고리즘은 3중 대각행렬식을 계산하는 기법으로 알려진 알고리즘으로 수정 토마스 알고리즘에 따라 3중 대각행렬을 변환하면, 도 3에 도시된 바와 같이, 3중 대각행렬의 대각선 원소가 모두 1로 변환된다. 또한 각행의 첫번째 원소, 즉 분할된 행렬식의 첫번째 원소가 0이 아닌 값으로 변환된다.The modified Thomas algorithm is an algorithm known as a technique for calculating a triple diagonal matrix. When a triple diagonal matrix is transformed according to the modified Thomas algorithm, as shown in FIG. 3, all diagonal elements of the triple diagonal matrix are converted to 1. . Also, the first element of each row, that is, the first element of the partitioned determinant is converted to a non-zero value.

이에 각 코어는 수정된 행렬식에서 제1 행 및 제m 행에 대한 방정식을 수학식 3과 같이 변환할 수 있다.Accordingly, each core may transform the equations for the first row and the mth row in the modified determinant as shown in Equation (3).

그리고 제2 행 내지 제 m-1 행에 대한 방정식은 수학식 4와 같이 변환할 수 있다.In addition, equations for the second to m-1 th rows can be converted as in Equation (4).

수학식 3 및 4에서

는 수정 토마스 알고리즘으로 수정된 원소 계수를 나타낸다.In

Equations

3 and 4

denotes the element coefficients corrected by the modified Thomas algorithm.

한편, x₀ ^p 및 x_m+1 ^p 를 x_m ^p-1 및 x₁ ^p+1 로 대체하면, 수학식 3은 수학식 5로 표현된다.Meanwhile, if x ₀ ^p and x _m+1 ^p are replaced with x _m ^p-1 and x ₁ ^p+1 , Equation 3 is expressed as Equation 5.

그리고 각각의 코어는 대응하는 분할된 행렬식이 수정되면, 수정된 행렬식에서 제1 행 및 제m 행만을 추출하여 수정 행렬식을 축소한다(S14).Then, when the corresponding divided determinant is corrected, each core extracts only the first row and the m-th row from the modified determinant to reduce the correction determinant (S14).

도 3 및 도 4에서는 3개의 코어가 각각 분할된 행렬식을 수정 및 축소하여 획득하는 것을 시각적으로 표시하기 위해, 서로 다른 코어에서 연산되는 원소에 대해 서로 다른 색상으로 표시하였다.3 and 4, in order to visually indicate that the three cores are obtained by modifying and reducing the partitioned determinant, respectively, elements calculated in different cores are displayed in different colors.

수정 행렬식이 축소되면, 획득된 모든 3중 대각행렬식을 수정 및 축소하였는지 판별한다(S15). 만일 획득된 3중 대각행렬식 중 수정 및 축소되지 않은 3중 대각행렬식이 존재하면 다음 연산되어야 하는 다음 3중 대각행렬을 분할하고 수정 및 축소하여 축소된 수정 행렬을 획득한다. 여기서 축소된 수정 행렬식 각각은 각 코어에 대응하는 메모리에 임시 저장될 수 있다.When the correction determinant is reduced, it is determined whether all the obtained triple diagonal matrix expressions have been corrected and reduced (S15). If there is a non-modified or non-reduced triple diagonal matrix among the obtained triple diagonal matrix expressions, the next triple diagonal matrix to be calculated is divided, modified, and reduced to obtain a reduced correction matrix. Here, each of the reduced correction determinants may be temporarily stored in a memory corresponding to each core.

그러나 모든 3중 대각행렬식이 수정 및 축소된 것으로 판별되면, P개의 코어 각각은 저장된 다수의 축소된 수정 행렬식을 서로 다른 코어로 전송한다(S16). 여기서 P개의 코어 각각은 축소된 수정 행렬이 획득된 순서에 기반하여 동일한 시간 구간에 획득된 축소된 수정 행렬을 기지정된 순서에 따라 하나의 코어로 전송하고, 다음 시간 구간에 획득된 축소된 수정 행렬식을 다음 지정된 코어로 전송한다.However, if it is determined that all triple diagonal determinants have been corrected and reduced, each of the P cores transmits the stored plurality of reduced correction determinants to different cores (S16). Here, each of the P cores transmits the reduced correction matrix obtained in the same time period to one core in a predetermined order based on the order in which the reduced correction matrix is obtained, and the reduced correction matrix obtained in the next time period to the next designated core.

여기서 축소 대각행렬식을 획득하는 코어는 다른 코어들로부터 2개의 행을 갖는 축소된 수정 행렬식을 인가받으므로, 각각의 코어로부터 m개의 행을 모두 인가받는 경우에 비해, 통신량이 크게 줄어들게 된다. 즉 코어간 통신 효율성을 크게 높일 수 있다.Here, since the core that obtains the reduced diagonal matrix receives the reduced correction determinant having two rows from other cores, the communication amount is greatly reduced compared to the case where all m rows are applied from each core. That is, the communication efficiency between cores can be greatly improved.

그리고 다른 코어들로부터 축소된 수정 행렬식을 인가받은 코어는 축소된 수정 행렬식을 결합하여 도 4와 같이 축소된 대각행렬식을 획득하여 저장한다(S17).And the core that has received the reduced correction determinant from other cores combines the reduced correction determinant to obtain and store the reduced diagonal matrix as shown in FIG. 4 (S17).

만일 획득된 3중 대각행렬식의 개수가 코어의 개수보다 많으면, 기지정된 순서에 따라 반복적으로 축소된 수정 행렬식을 인가받아 결합하여 축소된 대각행렬식을 획득하고 저장할 수 있다.If the number of obtained triple diagonal determinants is greater than the number of cores, it is possible to obtain and store reduced diagonal determinants by repeatedly receiving and combining the reduced correction determinants in a predetermined order.

획득된 모든 3중 대각행렬식에 대한 축소된 대각행렬식이 획득되면, 다수의 코어 각각이 다수의 축소 대각행렬식에 대해 병렬로 동시에 연산을 수행한다(S18).When the reduced diagonal matrix expressions for all the obtained triple diagonal matrix expressions are obtained, each of a plurality of cores simultaneously performs an operation on the plurality of reduced diagonal matrix expressions in parallel (S18).

도 4에 도시된 바와 같이 축소된 대각행렬식 또한 3중 대각행렬식의 형태로 획득되며, P개의 코어 각각은 표 2 및 표 3의 토마스 알고리즘을 이용하여 축소된 대각행렬식의 해를 연산한다.As shown in FIG. 4 , the reduced diagonal matrix is also obtained in the form of a triple diagonal matrix, and each of the P cores calculates a solution of the reduced diagonal matrix by using the Thomas algorithm of Tables 2 and 3.

토마스 알고리즘에 따라 축소된 대각행렬식의 해가 연산되면, 모든 축소 대각행렬식에 대해 연산이 수행되었는지 판별한다(S19). 상기한 바와 같이 획득된 3중 대각행렬식의 개수가 코어의 개수보다 많은 경우, 다수의 코어는 저장된 모든 축소 대각행렬식을 동시에 연산할 수 없다. 즉 한번에 병렬로 연산을 수행할 수 있는 축소 대각행렬식의 개수는 코어의 개수로 한정된다. 이에 모든 축소 대각행렬식에 대해 연산이 수행되었는지 판별하고 연산되지 않은 축소 대각행렬식이 존재하면, 다시 다수의 코어 각각이 동시에 병렬로 서로 다른 축소 대각행렬식에 대한 연산을 수행한다(S18).When the solution of the reduced diagonal matrix expression is calculated according to the Thomas algorithm, it is determined whether the operation has been performed on all the reduced diagonal matrix expressions (S19). As described above, when the number of obtained triple diagonal matrix equations is greater than the number of cores, the plurality of cores cannot simultaneously calculate all the stored reduced diagonal matrix expressions. That is, the number of reduced diagonal matrix expressions that can perform operations in parallel at once is limited to the number of cores. Accordingly, it is determined whether the operation has been performed on all the reduced diagonal matrix expressions, and if there is a reduced diagonal matrix expression that has not been calculated, each of the plurality of cores again simultaneously performs an operation on different reduced diagonal matrix expressions in parallel (S18).

그러나 모든 축소 대각행렬식에 대한 연산을 수행된 것으로 판별되면, 축소된 대각행렬식의 해를 다시 P개로 분할하여 P개의 코어로 분산 배포한다(S20). 토마스 알고리즘에 따라 해지는 해는 m개의 행을 갖도록 분할된 행렬식 각각에서 제1 행 및 제m 행에 대한 해로서, 3중 대각행렬식에 대한 완전한 해를 구하기 위해서는 제2 행 내지 제m -1 행에 대한 해가 추가로 계산되어야 한다.However, if it is determined that the operation on all the reduced diagonal matrix expressions has been performed, the solution of the reduced diagonal matrix formula is again divided into P pieces and distributed to P cores (S20). The solution to be solved according to the Thomas algorithm is a solution for the first row and the mth row in each of the determinants divided to have m rows. The solution must be calculated additionally.

이에 제2 행 내지 제m -1 행에 대한 연산 또한 병렬로 수행되도록 축소된 대각행렬식의 해를 P개로 분할하여 P개의 코어로 분산 배포한다. 이때, 분산 배포되는 축소된 대각행렬식의 해는 대응하는 축소된 수정 행렬식이 전송된 코어로 전달될 수 있다.Accordingly, the solution of the reduced diagonal matrix formula is divided into P pieces and distributed among P cores so that the operations on the second row to the m -1 th row are also performed in parallel. In this case, the solution of the distributedly distributed reduced diagonal determinant may be transmitted to the core to which the corresponding reduced correction determinant is transmitted.

그리고 P개의 코어 각각은 이전 계산한 수정된 행렬식에 인가된 P개로 분할된 축소된 대각행렬식의 해를 대입하여 표 3으로 나타나는 병렬로 업데이트 알고리즘에 따라 반복 연산을 수행하여 3중 대각행렬식의 제2 행 내지 제m -1 행에 대한 해를 획득한다(S21).And each of the P cores substitutes the solution of the reduced diagonal matrix divided into P applied to the previously calculated modified determinant, and performs iterative operations according to the update algorithm in parallel as shown in Table 3 to perform the second of the triple diagonal matrix expression. A solution for the row to the m -1 th row is obtained (S21).

도 2에서는 12 × 12 크기의 3중 대각행렬에 대한 행렬식을 3개의 코어가 병렬로 연산을 수행하는 것으로 가정하였으므로, 3개의 코어 각각은 3개의 분할된 행렬식의 제2 행 및 제3 행에 대한 해를 도 5에서와 같이 획득할 수 있다.In FIG. 2, since it is assumed that three cores perform an operation in parallel for the determinant of a triple diagonal matrix having a size of 12 × 12, each of the three cores corresponds to the second row and the third row of the three divided determinants. A solution can be obtained as in FIG. 5 .

그리고 획득된 모든 축소된 대각행렬식에 대한 업데이트 연산이 수행되었는지 판별한다(S22). 만일 모든 축소된 대각행렬식에 대한 업데이트 연산이 수행된 것으로 판별하면, 3중 대각행렬식 연산을 종료한다. 그러나 업데이트 연산이 수행되지 않은 축소된 대각행렬식이 존재하면 다시 P개의 코어는 병렬로 업데이트 연산을 수행한다(S21).And it is determined whether an update operation has been performed on all the obtained reduced diagonal matrix expressions (S22). If it is determined that the update operation for all reduced diagonal matrix expressions has been performed, the triple diagonal matrix expression operation is terminated. However, if the reduced diagonal matrix expression on which the update operation is not performed exists, the P cores again perform the update operation in parallel (S21).

수치해석과 같이 3중 대각행렬식을 해석해야 하는 분야에서는 3중 대각행렬식 하나만 나타나는 경우는 매우 드물며, 대부분 대량의 3중 대각행렬식을 연산해야 하는 경우가 빈번하게 발생한다. 즉 연속하여 다수의 3중 대각행렬식을 연산해야 하는 경우가 빈번하게 발생한다. 이에, 만일 축소된 대각행렬식을 획득한 코어는 곧바로 축소된 대각행렬식에 대한 해를 연산하게 되면, 나머지 코어는 축소된 대각행렬식을 획득한 코어가 해를 연산하는 동안 유휴 상태에 놓이게 된다. 즉 연산 효율성을 크게 떨어뜨리는 결과를 초래한다.In the field where a triple diagonal matrix expression needs to be analyzed, such as numerical analysis, it is very rare that only one triple diagonal matrix expression appears, and in most cases, a large number of triple diagonal matrix expressions need to be calculated frequently. That is, it is frequently necessary to continuously calculate a plurality of triple diagonal matrix expressions. Accordingly, if the core that has obtained the reduced diagonal matrix directly calculates the solution for the reduced diagonal matrix, the remaining cores are placed in an idle state while the core that has obtained the reduced diagonal matrix calculates the solution. That is, it results in a significant decrease in computational efficiency.

이에 본 실시예에서는 다수의 3중 대각행렬식 각각에 대한 축소된 수정 행렬식을 병렬로 획득하고, 획득된 다수의 3중 대각행렬식에 대한 축소된 수정 행렬식을 일괄로 전송하도록 함으로써, 코어간 통신 시간을 줄일 수 있다. 뿐만 아니라, 다수의 코어 각각이 전송된 축소된 수정 행렬식을 결합한 축소 대각 행렬식을 병렬로 연산하고, 연산 결과를 다시 다수의 코어에 분산 배포하여 업데이트 연산을 수행하도록 함으로써 다수의 코어의 유휴 시간을 최소화할 수 있다.Accordingly, in this embodiment, by obtaining reduced correction determinants for each of a plurality of triple diagonal matrix expressions in parallel and transmitting the reduced correction determinants for a plurality of obtained triple diagonal matrix expressions in a batch, communication time between cores is reduced can be reduced In addition, the idle time of multiple cores is minimized by calculating the reduced diagonal determinant combined with the reduced correction determinant transmitted by each of the multiple cores in parallel, and distributing the calculation result back to the multiple cores to perform the update operation. can do.

결과적으로 코어간 통신량과 코어의 유휴 시간을 최소화하여 도 2에 도시된 3중 대각행렬식 전체에 대한 해를 계산할 수 있다.As a result, the solution for the entire triple diagonal matrix shown in FIG. 2 can be calculated by minimizing the amount of inter-core communication and the idle time of the core.

도 6은 도 1의 3중 대각행렬식 연산 방법의 전체 연산 과정을 시각적으로 나타내고, 도 7은 다수의 3중 대각행렬식 연산에서 코어 사이에 전송되는 데이터를 시각적으로 나타낸 도면이다.6 is a diagram visually illustrating the entire operation process of the triple diagonal matrix operation method of FIG. 1 , and FIG. 7 is a diagram visually illustrating data transmitted between cores in a plurality of triple diagonal matrix operations.

도 1 내지 도 5를 참조하여 도 6의 전체 연산 과정을 다시 살펴보면, 3중 대각행렬식이 획득되면 다수개의 코어 각각이 3중 대각행렬식에서 분할된 행렬식을 인가받고, 인가된 분할 행렬식을 수정 토마스 알고리즘에 따라 수정하여 수정 행렬식을 획득하고, 수정 행렬식에서 제1 행 및 마지막 행을 추출하여 축소된 수정 행렬식을 획득한다. 그리고 모든 3중 대각행렬식에 대한 축소된 수정 행렬식을 획득되면, 다수의 축소된 수정 행렬식을 기지정된 순서로 서로 다른 코어로 전달한다.Referring back to the entire operation process of FIG. 6 with reference to FIGS. 1 to 5, when a triple diagonal matrix is obtained, each of a plurality of cores receives a determinant divided in the triple diagonal matrix, and the applied partitioning determinant is modified Thomas's algorithm A correction determinant is obtained by modifying according to , and a reduced correction determinant is obtained by extracting the first row and the last row from the correction determinant. And when the reduced correction determinants for all triple diagonal determinants are obtained, a plurality of reduced correction determinants are transferred to different cores in a predetermined order.

축소된 수정 행렬식을 인가받은 다수의 코어 각각은 축소된 수정 행렬식을 결합하여 축소된 대각행렬식을 획득하여 저장하고, 각각의 코어가 토마스 알고리즘을 이용하여 축소된 대각행렬식의 해를 병렬로 연산한다. 즉 분할된 행렬식 각각의 제1 행 및 마지막 행의 해를 연산한다.Each of the plurality of cores to which the reduced correction determinant is authorized combines the reduced correction determinant to obtain and store the reduced diagonal matrix, and each core calculates the solution of the reduced diagonal matrix in parallel using the Thomas algorithm. That is, the solutions of the first row and the last row of each of the divided determinants are calculated.

저장된 모든 축소된 대각행렬식의 해가 획득되면, 연산 결과를 다시 코어의 개수에 대응하여 분할하여 다수의 코어로 분산 배포한다. 이때, 다수의 코어는 일예로 MPI_Alltoall 방식으로 통신을 수행할 수 있다.When the solutions of all the stored reduced diagonal matrix equations are obtained, the calculation result is again divided according to the number of cores and distributed among a plurality of cores. In this case, the plurality of cores may perform communication in the MPI_Alltoall method, for example.

분산 배포된 축소된 대각행렬식의 분할 해는 다수의 코어 각각에서 이전 계산된 수정된 행렬식과 함께 업데이트 알고리즘에 적용되어, 분할된 행렬식의 제1 행 및 마지막 행을 제외한 나머지 행에 대한 해를 연산하여 3중 대각행렬식의 전체 해를 획득한다.The distributed solution of the reduced diagonal determinant is applied to the update algorithm together with the previously calculated modified determinant in each of the multiple cores, and the solution for the remaining rows except for the first and last rows of the divided determinant is calculated. Obtain the full solution of the triple diagonal matrix.

다수의 코어는 획득된 모든 3중 대각행렬식에 대한 해가 획득될 때까지 병렬로 반복적으로 업데이트 알고리즘을 수행한다.A plurality of cores iteratively performs the update algorithm in parallel until solutions for all obtained triple diagonal matrix expressions are obtained.

도 8은 본 발명의 일 실시예에 따른 3중 대각행렬식 연산 방법의 성능을 비교 시뮬레이션한 결과를 나타낸다.8 shows the results of comparison simulation of the performance of the triple diagonal matrix calculation method according to an embodiment of the present invention.

도 8에서 A와 B는 기존의 3중 대각행렬식 연산 방법으로 A는 3중 대각행렬식 전체를 다수의 코어로 전송하여 연산하는 기법을 나타내고, B는 단일 코어의 시분할 다중화 방식으로 연산하는 기법을 나타낸다. 그리고 C는 본 실시예에 따른 3중 대각행렬식 연산 방법인 PaScal TDMA(Parallel and Scalable Library for TDMA) 기법을 나타낸다. 도 8은 코어당 격자 크기를 512²으로 고정하고, 코어를 8개에서 4096개까지 증가시키며 다수의 3중 대각행렬식을 연산하는 경우, 각 코어별 데이터 통신 시간을 시뮬레이션한 결과이다.In FIG. 8, A and B are a conventional triple diagonal matrix operation method, A indicates a method of transferring the entire triple diagonal matrix to a plurality of cores, and B indicates a method of calculating a single core time division multiplexing method. . And C denotes a PaScal Parallel and Scalable Library for TDMA (TDMA) technique, which is a triple diagonal matrix calculation method according to the present embodiment. 8 is a simulation result of data communication time for each core when the grid size per core is fixed at 512 ² , the number of cores is increased from 8 to 4096, and multiple triple diagonal matrices are calculated.

도 8에 도시된 바와 같이, 본 실시예에 따른 3중 대각행렬식 연산 방법은 기존에 비해 코어간 통신 시간이 크게 저감되었을 뿐만 아니라, 코어의 개수가 증가될수록 기존에 비해 코어간 통신 시간이 더 크게 저감되었음을 확인할 수 있다.As shown in FIG. 8 , in the triple diagonal matrix calculation method according to the present embodiment, the communication time between cores is greatly reduced compared to the prior art, and as the number of cores increases, the communication time between cores becomes larger than in the prior art. It can be seen that the reduction

본 발명에 따른 방법은 컴퓨터에서 실행시키기 위한 매체에 저장된 컴퓨터 프로그램으로 구현될 수 있다. 여기서 컴퓨터 판독가능 매체는 컴퓨터에 의해 액세스 될 수 있는 임의의 가용 매체일 수 있고, 또한 컴퓨터 저장 매체를 모두 포함할 수 있다. 컴퓨터 저장 매체는 컴퓨터 판독가능 명령어, 데이터 구조, 프로그램 모듈 또는 기타 데이터와 같은 정보의 저장을 위한 임의의 방법 또는 기술로 구현된 휘발성 및 비휘발성, 분리형 및 비분리형 매체를 모두 포함하며, ROM(판독 전용 메모리), RAM(랜덤 액세스 메모리), CD(컴팩트 디스크)-ROM, DVD(디지털 비디오 디스크)-ROM, 자기 테이프, 플로피 디스크, 광데이터 저장장치 등을 포함할 수 있다.The method according to the present invention may be implemented as a computer program stored in a medium for execution by a computer. Here, the computer-readable medium may be any available medium that can be accessed by a computer, and may include all computer storage media. Computer storage media includes both volatile and nonvolatile, removable and non-removable media implemented in any method or technology for storage of information such as computer readable instructions, data structures, program modules or other data, and read dedicated memory), RAM (Random Access Memory), CD (Compact Disk)-ROM, DVD (Digital Video Disk)-ROM, magnetic tape, floppy disk, optical data storage, and the like.

본 발명은 도면에 도시된 실시예를 참고로 설명되었으나 이는 예시적인 것에 불과하며, 본 기술 분야의 통상의 지식을 가진 자라면 이로부터 다양한 변형 및 균등한 타 실시예가 가능하다는 점을 이해할 것이다.Although the present invention has been described with reference to the embodiment shown in the drawings, which is only exemplary, those skilled in the art will understand that various modifications and equivalent other embodiments are possible therefrom.

따라서, 본 발명의 진정한 기술적 보호 범위는 첨부된 청구범위의 기술적 사상에 의해 정해져야 할 것이다.Accordingly, the true technical protection scope of the present invention should be defined by the technical spirit of the appended claims.

Claims

In a triple diagonal matrix calculation method for a multi-core distributed memory system,
In each of a plurality of triple diagonal matrix formulas to which each of a plurality of cores are sequentially applied, a partition matrix divided into a number of rows corresponding to the number of cores is obtained, and a correction matrix is obtained by modifying it according to a predetermined method, and the modification extracting only the first row and the last row of the determinant to obtain a reduced correction determinant;
transmitting the reduced correction determinant obtained from each of the plurality of triple diagonal matrix expressions to a corresponding one of the plurality of cores according to a predetermined order of the plurality of cores;
obtaining and storing a reduced diagonal matrix by combining the reduced correction determinants transmitted by each of a plurality of cores;
when the reduced diagonal matrix expressions for all triple diagonal matrix expressions are obtained, calculating a solution for each of the plurality of reduced diagonal matrix expressions in which each of the plurality of cores are combined simultaneously in parallel; and
Comprising the step of calculating the remaining solutions of the triple diagonal matrix expression using the solutions of the plurality of reduced diagonal matrix expressions,
The step of calculating the remaining solution is
transmitting the solution of the reduced diagonal matrix to a core that has transmitted the corresponding reduced correction determinant; and
Comprising the step of calculating by substituting the solution of the reduced diagonal matrix expression into the correction determinant except for the first row and the last row,
The step of calculating the solution of the reduced correction determinant is
When the number of stored reduction correction determinants exceeds the number of cores, the solution of the reduced correction determinant of the number corresponding to the number of cores is computed in parallel, and the solutions of the remaining reduced correction determinants are computed in parallel in units of the number of cores thereafter. Determinant arithmetic method.

The method of claim 1, wherein the transmitting to the core comprises:
A triple diagonal matrix calculation method for transmitting the reduced correction determinant obtained from a plurality of triple diagonal determinants to the plurality of cores in a predetermined order according to an order in which the plurality of triple diagonal determinants are applied.

delete

The method of claim 1, wherein obtaining the correction determinant comprises:
Correcting the partition matrix according to the modified Thomas algorithm to obtain a modified determinant,
The step of calculating the solution of the reduced diagonal matrix expression is
A triple diagonal matrix operation method that performs operations according to Thomas's algorithm.

delete