KR20210032670A

KR20210032670A - Method for computing tridiagonal matrix

Info

Publication number: KR20210032670A
Application number: KR1020190113981A
Authority: KR
Inventors: 최정일; 김기하; 강지훈
Original assignee: 연세대학교 산학협력단; 한국과학기술정보연구원
Priority date: 2019-09-17
Filing date: 2019-09-17
Publication date: 2021-03-25
Also published as: KR102382336B1

Abstract

The present invention relates to a tridiagonal matrix determinant computing method for a multi-core distributed memory system. The tridiagonal matrix determinant computing method comprises the steps of: obtaining a partitioning matrix divided into row units of the number corresponding to the number of cores in each of a plurality of tridiagonal matrix determinants in which a plurality of cores are sequentially and individually applied, correcting the partitioning matrix according to a predetermined manner to obtain a correction matrix determinant, and extract a first row and a last row of the correction matrix determinant to obtain a reduced correction matrix determinant; transmitting the reduced correction matrix determinant obtained from each of the plurality of tridiagonal matrix determinants to a corresponding one of the plurality of cores according to a predetermined order of the plurality of cores; combining, by each of the plurality of cores, the transmitted reduced correction matrix determinant to obtain and store the reduced diagonal matrix determinant; calculating, by each of the plurality of cores, a solution of the reduced correction matrix determinant obtained in parallel; and calculating the rest of the solution of a tridiagonal matrix determinant using the solution of the reduced diagonal matrix determinant. Therefore, computational efficiency of the multi-core distributed memory system can be maximized.

Description

Triple diagonal matrix calculation method {METHOD FOR COMPUTING TRIDIAGONAL MATRIX}

본 발명은 3중 대각행렬식을 연산하는 연산 방법에 관한 것으로, 멀티 코어 분산 메모리 시스템에서 하나 또는 다수의 3중 대각행렬식을 효율적으로 병렬 연산할 수 있도록 하는 연산 방법에 관한 것이다.The present invention relates to a method of calculating a triple diagonal matrix expression, and to a calculation method that enables efficient parallel calculation of one or a plurality of triple diagonal matrix expressions in a multi-core distributed memory system.

3중 대각행렬식은 선형 연립 방정식의 하나로 행렬의 형태가 대각행렬의 주대각선을 포함해 대각성분이 3중 구조인 경우로써, 유체역학, 열전달, 양자역학, 전자기학 등에서 수치해석으로 특정 문제에 대한 해를 구할 때 빈번하게 나타나는 형태이다.The triple diagonal matrix equation is one of the linear system of equations, and the form of the matrix is a triple structure, including the main diagonal of the diagonal matrix.The solution to a specific problem is solved by numerical analysis in fluid mechanics, heat transfer, quantum mechanics, and electromagnetics. It is a form that appears frequently when seeking.

3중 대각행렬식의 해를 구하는 알고리즘 중 널리 사용되는 토마스 알고리즘(Thomas algorithm)은 가우시안 소거법의 특수한 형태로써, 순차 연산 처리 방식으로 단순히 연산 처리 관점에서는 가장 효율적인 방법이다. 그러나 토마스 알고리즘은 그 계산과정이 순차적으로 진행되기 때문에 병렬화가 불가능하다. 즉 멀티 코어 분산 메모리 시스템과 같이 고성능의 연산 시스템에 적용 시에 병렬 연산을 제공할 수 없어 효율성이 크게 낮아진다. Among the algorithms for solving the triple diagonal matrix equation, the widely used Thomas algorithm is a special form of the Gaussian elimination method. It is a sequential operation processing method and is simply the most efficient method from the viewpoint of operation processing. However, the Thomas algorithm cannot be parallelized because the calculation process proceeds sequentially. In other words, when applied to a high-performance computing system such as a multi-core distributed memory system, it is not possible to provide parallel computation, which greatly reduces the efficiency.

PCR 알고리즘(parallel cyclic reduction algorithm)은 3중 대각행렬식에 대한 병렬 연산 처리가 가능하도록 고안된 방법이다. 이 방법은 재귀적 알고리즘으로써 방정식 3개씩 한 묵음으로 미지수를 소거해 병렬적으로 처리가 가능 하다는 장점이 있지만 토마스 알고리즘보다 기본적 효율이 좋지 않다. 이러한 효율성의 차이로 인한 문제는 해결해야하는 3중 대각행렬식의 크기가 크고, 수가 많을 수록 증가된다. 또한 PCR 알고리즘은 분산 메모리 시스템에 적용하기 적합하지 않다.The PCR algorithm (parallel cyclic reduction algorithm) is a method designed to enable parallel operation processing for triple diagonal matrix expressions. As a recursive algorithm, this method has the advantage of being able to process in parallel by eliminating unknowns with one silence by three equations, but its basic efficiency is not as good as that of Thomas's algorithm. The problem caused by this difference in efficiency increases as the size of the triple diagonal matrix to be solved increases and the number increases. Also, the PCR algorithm is not suitable for application to a distributed memory system.

한편 3중 대각행렬식을 분산메모리 시스템에서 병렬적으로 계산하는 알고리즘도 고안된 바가 있다(Mattor et al 1995). 이 방법은 먼저 각 계산 노드에서 정리된 원소 값을 모아 작은 크기의 하위 3중 대각행렬식을 만들어 그 해를 구한다. 그 다음 하위 3중 대각행렬식의 해를 이용해 각 계산 노드에서 병렬적으로 원래 3중 대각행렬식의 해를 구한다. 하지만 하나의 3중 대각행렬식을 계산할 때 모든 계산 노드에서 하위 3중 대각행렬식의 해를 구하는 과정이 불필요하게 중복되어 여전히 효율성이 낮다는 한계가 있다.On the other hand, an algorithm for calculating a triple diagonal matrix equation in parallel in a distributed memory system has also been devised (Mattor et al 1995). This method first collects the values of the elements organized in each computational node and creates a small-sized lower triple diagonal matrix equation to find the solution. Then, the solution of the original triple diagonal matrix equation is obtained in parallel at each computing node using the solution of the lower triple diagonal matrix equation. However, when calculating one triple diagonal matrix equation, there is a limitation that efficiency is still low because the process of obtaining the solution of the lower triple diagonal matrix equation is unnecessarily redundant at all computation nodes.

미국 공개 특허 2019/0153824(2019.05.23 공개)US Published Patent 2019/0153824 (published on May 23, 2019)

본 발명의 목적은 멀티 코어 분산 메모리 시스템에서 효율적으로 3중 대각행렬식을 해결할 수 있는 3중 대각행렬식 연산 방법을 제공하는데 있다.An object of the present invention is to provide a triple diagonal matrix calculation method capable of efficiently solving a triple diagonal matrix equation in a multi-core distributed memory system.

상기 목적을 달성하기 위한 본 발명의 일 실시예에 따른 3중 대각행렬식 연산 방법은 멀티 코어 분산 메모리 시스템을 위한 3중 대각행렬식 연산 방법에 있어서, 다수의 코어 각각이 연속으로 인가되는 다수의 3중 대각행렬식 각각에서 코어의 개수에 대응하는 개수의 행단위로 분할된 분할 행렬을 획득하고, 기지정된 방식에 따라 수정하여 수정 행렬식을 획득하고, 상기 수정 행렬식의 제1 행 및 마지막 행을 추출하여 축소 수정 행렬식을 획득하는 단계; 상기 다수의 3중 대각행렬식 각각에서 획득된 축소 수정 행렬식을 상기 다수의 코어의 기지정된 순서에 따라 다수의 코어 중 대응하는 코어로 전송하는 단계; 다수의 코어 각각이 전송된 상기 축소 수정 행렬식을 결합하여 축소 대각행렬식을 획득 및 저장하는 단계; 다수의 코어 각각이 병렬로 획득된 축소 수정 행렬식의 해를 연산하는 단계; 및 축소 대각행렬식의 해를 이용하여 3중 대각행렬식의 나머지 해를 연산하는 단계를 포함한다.A triple diagonal matrix calculation method according to an embodiment of the present invention for achieving the above object is a triple diagonal matrix calculation method for a multi-core distributed memory system, wherein each of a plurality of cores is sequentially applied. In each of the diagonal matrix equations, a division matrix divided into a number of rows corresponding to the number of cores is obtained, modified according to a known method to obtain a modified determinant, and reduced correction by extracting the first row and the last row of the modified determinant Obtaining a determinant; Transmitting the reduced correction determinant obtained from each of the plurality of triple diagonal matrix equations to a corresponding core among a plurality of cores according to a predetermined order of the plurality of cores; Obtaining and storing a reduced diagonal matrix expression by combining the reduced correction determinants transmitted by each of a plurality of cores; Calculating a solution of the reduced correction determinant obtained by each of the plurality of cores in parallel; And calculating the remaining solution of the triple diagonal matrix equation by using the solution of the reduced diagonal matrix equation.

상기 코어로 전송하는 단계는 상기 다수의 3중 대각 행렬식이 인가된 순서에 따라 다수의 3중 대각 행렬식에서 획득된 상기 축소 수정 행렬식을 상기 다수의 코어에 기지정된 순서로 전송할 수 있다.In the transmitting of the core, the reduced correction determinants obtained from the plurality of triple diagonal determinants may be transmitted to the plurality of cores in a predetermined order according to the order in which the plurality of triple diagonal determinants are applied.

상기 축소 수정 행렬식의 해를 연산하는 단계는 저장된 축소 수정 행렬식의 개수가 코어 개수를 초과하면, 코어 개수에 대응하는 개수의 축소 수정 행렬식의 해를 병렬로 연산하고, 나머지 축소 수정 행렬식의 해를 이후 코어 개수 단위로 병렬로 연산할 수 있다.In the step of calculating the solution of the reduced correction determinant, if the number of stored reduced correction determinants exceeds the number of cores, a solution of the reduced correction determinant corresponding to the number of cores is calculated in parallel, and the solution of the remaining reduced correction determinant is then performed. It can be operated in parallel by the number of cores.

상기 수정 행렬식을 획득하는 단계는 상기 분할 행렬을 수정 토마스 알고리즘에 따라 수정하여 수정 행렬식을 획득하고, 상기 축소 대각행렬식의 해를 연산하는 단계는 토마스 알고리즘에 따라 연산을 수행할 수 있다.In the obtaining of the modified determinant, the partitioning matrix is modified according to a modified Thomas algorithm to obtain a modified determinant, and the step of calculating a solution of the reduced diagonal matrix may be performed according to the Thomas algorithm.

상기 나머지 해를 연산하는 단계는 연산된 축소 대각행렬식의 해를 다수의 코어로 분산 전송하는 단계; 및 상기 축소된 대각행렬식의 해와 대응하는 상기 수정 행렬식을 토마스 알고리즘의 업데이트 알고리즘에 대입하여 연산하는 단계를 포함할 수 있다.The calculating of the remaining solution may include distributing and transmitting the calculated solution of the reduced diagonal matrix equation to a plurality of cores; And calculating the modified determinant corresponding to the solution of the reduced diagonal matrix equation by substituting it into an update algorithm of the Thomas algorithm.

상기 분산 전송하는 단계는 상기 축소된 대각행렬식의 해를 대응하는 축소된 수정 행렬식을 전송한 코어로 전송할 수 있다.In the distributed transmission, the solution of the reduced diagonal matrix may be transmitted to a core that transmitted a corresponding reduced correction matrix.

따라서, 본 발명의 실시예에 따른 3중 대각행렬식 연산 방법은 멀티 코어3중 대각행렬식을 해결함에 있어 병렬 확장성을 향상시켜 멀티 코어 분산 메모리 시스템에 최적화된 연산 성능을 제공할 수 있으며, 코어간 통신량과 유휴 시간을 최소화하여 부하를 저감할 수 있으며, 중복 연산을 방지하여 부하 균등성을 향상시킬 수 있어 연산 효율성을 극대화할 수 있다.Accordingly, the method of calculating a triple diagonal matrix according to an embodiment of the present invention can improve parallel scalability in solving a multi-core triple diagonal matrix, thereby providing optimized computing performance for a multi-core distributed memory system. The load can be reduced by minimizing the amount of communication and idle time, and the load evenness can be improved by preventing redundant calculations, thus maximizing computational efficiency.

도 1은 본 발명의 일 실시예에 따른 3중 대각행렬식 연산 방법을 나타낸다.
도 2는 3중 대각행렬식의 일예를 나타낸다.
도 3은 도 1의 3중 대각행렬식 단계에서 수정된 3중 대각행렬식의 일예를 나타낸다.
도 4는 도 1의 수정 대각행렬식 축소 단계에서 축소된 대각행렬식의 일예를 나타낸다.
도 5는 축소된 대각행렬식을 분산 연산하는 예를 나타낸다.
도 6은 도 1의 3중 대각행렬식 연산 방법의 전체 연산 과정을 시각적으로 나타낸다.
도 7은 다수의 3중 대각행렬식 연산에서 코어 사이에 전송되는 데이터를 시각적으로 나타낸 도면이다.
도 8은 본 발명의 일 실시예에 따른 3중 대각행렬식 연산 방법의 성능을 비교 시뮬레이션한 결과를 나타낸다.1 shows a method of calculating a triple diagonal matrix according to an embodiment of the present invention.
2 shows an example of a triple diagonal matrix equation.
3 shows an example of a triple diagonal matrix equation modified in the step of the triple diagonal matrix equation of FIG. 1.
FIG. 4 shows an example of a reduced diagonal matrix equation in the step of reducing the modified diagonal matrix equation of FIG. 1.
5 shows an example of variance calculation of a reduced diagonal matrix equation.
6 is a visual representation of the entire operation process of the method of calculating the triple diagonal matrix of FIG. 1.
7 is a diagram visually showing data transmitted between cores in a number of triple diagonal matrix calculations.
8 shows a result of comparison and simulation of the performance of a triple diagonal matrix calculation method according to an embodiment of the present invention.

본 발명과 본 발명의 동작상의 이점 및 본 발명의 실시에 의하여 달성되는 목적을 충분히 이해하기 위해서는 본 발명의 바람직한 실시예를 예시하는 첨부 도면 및 첨부 도면에 기재된 내용을 참조하여야만 한다. In order to fully understand the present invention, operational advantages of the present invention, and objects achieved by the implementation of the present invention, reference should be made to the accompanying drawings illustrating preferred embodiments of the present invention and the contents described in the accompanying drawings.

이하, 첨부한 도면을 참조하여 본 발명의 바람직한 실시예를 설명함으로써, 본 발명을 상세히 설명한다. 그러나, 본 발명은 여러 가지 상이한 형태로 구현될 수 있으며, 설명하는 실시예에 한정되는 것이 아니다. 그리고, 본 발명을 명확하게 설명하기 위하여 설명과 관계없는 부분은 생략되며, 도면의 동일한 참조부호는 동일한 부재임을 나타낸다. Hereinafter, the present invention will be described in detail by describing a preferred embodiment of the present invention with reference to the accompanying drawings. However, the present invention may be implemented in various different forms, and is not limited to the described embodiments. In addition, in order to clearly describe the present invention, parts irrelevant to the description are omitted, and the same reference numerals in the drawings indicate the same members.

명세서 전체에서, 어떤 부분이 어떤 구성요소를 "포함"한다고 할 때, 이는 특별히 반대되는 기재가 없는 한 다른 구성요소를 제외하는 것이 아니라, 다른 구성요소를 더 포함할 수 있는 것을 의미한다. 또한, 명세서에 기재된 "...부", "...기", "모듈", "블록" 등의 용어는 적어도 하나의 기능이나 동작을 처리하는 단위를 의미하며, 이는 하드웨어나 소프트웨어 또는 하드웨어 및 소프트웨어의 결합으로 구현될 수 있다. Throughout the specification, when a certain part "includes" a certain component, it means that other components may be further included, rather than excluding other components unless specifically stated to the contrary. In addition, terms such as "... unit", "... group", "module", and "block" described in the specification mean a unit that processes at least one function or operation, which is hardware, software, or hardware. And software.

도 1은 본 발명의 일 실시예에 따른 3중 대각행렬식 연산 방법을 나타내고, 도 2는 3중 대각행렬식의 일예를 나타내며, 도 3은 도 1의 3중 대각행렬식 단계에서 수정된 3중 대각행렬식의 일예를 나타낸다. 그리고 도 4는 도 1의 수정 대각행렬식 축소 단계에서 축소된 대각행렬식의 일예를 나타내고, 도 5는 축소된 대각행렬식을 분산 연산하는 예를 나타낸다.FIG. 1 shows a method of calculating a triple diagonal matrix according to an embodiment of the present invention, and FIG. 2 shows an example of a triple diagonal matrix equation, and FIG. 3 is a triple diagonal matrix equation modified in the step of the triple diagonal matrix equation of FIG. Shows an example of. In addition, FIG. 4 shows an example of a reduced diagonal matrix expression in the modified diagonal matrix expression reduction step of FIG. 1, and FIG. 5 shows an example of variance calculation of the reduced diagonal matrix expression.

도 1 내지 도 5를 참조하면, 본 실시예에 따른 3중 대각행렬식 연산 방법은 우선 연산이 수행되어야 하는 다수의 3중 대각행렬식을 획득한다(S11).Referring to FIGS. 1 to 5, in the method of calculating a triple diagonal matrix according to the present embodiment, first, a plurality of triple diagonal matrix expressions to be performed are obtained (S11).

여기서 3중 대각행렬식은 Ax = d의 형태로 표현되는 행렬식으로, A가 N × N 크기의 3중 대각행렬이고, x와 d는 각각 길이 N을 갖는 열 벡터이다. 이러한 3중 대각행렬식은 행렬 원소 a_i, b_i, c_i (여기서 i = 1, …, N)로 구성된 3중 대각행렬(A)과 우변의 d_i를 원소로 갖는 열 벡터(d)에 대해 x_i를 원소로 갖는 미지의 열 벡터(x)를 구하는 선형 연립 방정식이다.Here, the triple diagonal matrix is a determinant expressed in the form of Ax = d, where A is a triple diagonal matrix of size N × N, and x and d are column vectors each having a length N. This triple-diagonal matrix equation is based on a triple-diagonal matrix (A) consisting _{of matrix elements a i} , b _i , c _i (where i = 1, …, N) and a column vector (d) having _{d i on the right side as elements.} It is a linear system of equations that finds an unknown column vector (x) with _{x i as an element.}

즉 3중 대각행렬식은 수학식 1로 표현될 수 있다.That is, the triple diagonal matrix equation can be expressed by Equation 1.

여기서 a₁ 과 c_N 의 값은 0이다.Here, the values of _{a 1} and c _{N are 0.}

다수의 3중 대각행렬식이 획득되면, 연산을 수행하는 다수의 코어의 개수에 대응하여 획득된 3중 대각행렬식을 도 2에 도시된 바와 같이 행단위로 균등하게 분할한다(S12).When a number of triple diagonal matrix equations are obtained, the obtained triple diagonal matrix equation is equally divided into rows as shown in FIG. 2 in response to the number of cores performing the operation (S12).

본 실시예에 따른 3중 대각행렬식 연산 방법은 멀티 코어 분산 메모리 시스템에 적용되어 수행되는 것을 전제로 한다. 멀티 코어 분산 메모리 시스템은 각각의 코어가 개별적으로 연산을 수행할 수 있으므로, 3중 대각행렬식 연산을 병렬화하면 연산 효율성을 크게 높일 수 있다. 다만 분산 메모리 시스템에서 각각의 코어는 메모리를 공유하지 않으므로, 각각의 코어에서 연산된 결과는 코어간 통신을 통해 상호 전달되어야 하며, 이 과정에서 대량의 통신 트래픽을 유발하게 된다. 뿐만 아니라 다수의 3중 대각행렬식 연산을 연속하여 수행하는 경우, 다수의 코어에 분산되어 연산된 결과를 특정 코어가 인가받아 최종 연산을 수행하며, 이 과정에서 나머지 코어들이 유휴 상태로 유지됨으로써 효율성을 극대화할 수 없다는 한계가 있다. 이에 본 실시예에서는 3중 대각행렬식 연산을 병렬화하고 시분할 다중화 기법으로 병렬 연산이 수행되도록 함으로써, 연산 효율성이 극대화되도록 한다.It is assumed that the method of calculating a triple diagonal matrix according to the present embodiment is applied to and performed in a multi-core distributed memory system. In a multi-core distributed memory system, since each core can perform an operation individually, the operation efficiency can be greatly improved by parallelizing the triple diagonal matrix operation. However, in a distributed memory system, since each core does not share a memory, the result calculated by each core must be transmitted to each other through communication between the cores, and in this process, a large amount of communication traffic is induced. In addition, when a number of triple diagonal matrix calculations are continuously performed, a specific core receives the result of the calculation distributed over a number of cores to perform the final calculation, and in this process, the remaining cores are kept in an idle state, thereby improving efficiency. There is a limit that it cannot be maximized. Accordingly, in the present embodiment, the triple diagonal matrix operation is parallelized and the parallel operation is performed using a time division multiplexing technique, thereby maximizing the operation efficiency.

이를 위해, 본 실시예에서는 다수의 코어에서 병렬적으로 동시에 연산을 수행할 수 있도록 3중 대각행렬식을 코어의 개수에 대응하여 분할한다.To this end, in the present embodiment, a triple diagonal matrix expression is divided according to the number of cores so that a plurality of cores can simultaneously perform operations in parallel.

연산 코어의 개수가 P개인 경우, N × N 크기의 3중 대각행렬(A)을 행단위로 N/P = m 개씩 나누어 분할하여 P개의 코어로 제공할 수 있다. 일예로 도 2에 도시된 바와 같이, 3중 대각행렬이 12 × 12 크기(N = 12)의 3중 대각행렬이고, 연산 코어의 개수가 3개인 경우, 3중 대각행렬을 4행씩 균등하게 분할할 수 있다. 도 2의 행렬에서 여백은 0값을 갖는 원소이다. 3중 대각행렬을 m행씩 균등 분할되면, 이에 대응하여 열 또한 m열씩 균등 분할된다. 즉 m × m 크기의 P × P개의 분할 행렬로 분할된다.When the number of computational cores is P, a triple diagonal matrix (A) of size N × N can be divided into rows by N/P = m and divided into P cores. As an example, as shown in FIG. 2, when the triple diagonal matrix is a triple diagonal matrix of size 12 × 12 (N = 12) and the number of calculation cores is 3, the triple diagonal matrix is equally divided by 4 rows. can do. In the matrix of FIG. 2, a margin is an element having a value of 0. If the triple diagonal matrix is equally divided by m rows, the columns are divided equally by m columns in response to this. That is, it is divided into P × P partition matrices of size m × m.

그리고 분할된 행렬식 각각은 수학식 2와 같이 표현될 수 있다.And each of the divided determinants can be expressed as in Equation 2.

여기서 p는 분할 행렬의 인덱스로서 0 ≤ p ≤ P-1 이고, x₀ ^p 와 x_m+1 ^p 는 각각 x_m ^p-1 와 x₁ ^p+1 에 대응한다.Here, p is the index of the partition matrix, where 0 ≤ p ≤ P-1, and x ₀ ^p and x _m+1 ^p correspond to x _m ^p-1 and x ₁ ^p+1 , respectively.

행렬식이 분할되면, 분할 행렬식을 인가받은 각 코어는 분할된 행렬식에 대해 표 1의 수정 토마스 알고리즘을 적용하여, 분할된 행렬식을 수정한다(S13).When the determinant is partitioned, each core to which the partitioning determinant is applied applies the modified Thomas algorithm of Table 1 to the partitioned determinant to correct the partitioned determinant (S13).

수정 토마스 알고리즘은 3중 대각행렬식을 계산하는 기법으로 알려진 알고리즘으로 수정 토마스 알고리즘에 따라 3중 대각행렬을 변환하면, 도 3에 도시된 바와 같이, 3중 대각행렬의 대각선 원소가 모두 1로 변환된다. 또한 각행의 첫번째 원소, 즉 분할된 행렬식의 첫번째 원소가 0이 아닌 값으로 변환된다.The modified Thomas algorithm is an algorithm known as a technique for calculating a triple diagonal matrix. When the triple diagonal matrix is converted according to the modified Thomas algorithm, as shown in FIG. 3, all diagonal elements of the triple diagonal matrix are converted to 1. . Also, the first element of each row, that is, the first element of the partitioned determinant is converted to a non-zero value.

이에 각 코어는 수정된 행렬식에서 제1 행 및 제m 행에 대한 방정식을 수학식 3과 같이 변환할 수 있다.Accordingly, each core may transform the equations for the first row and the mth row in the modified determinant as shown in Equation 3.

그리고 제2 행 내지 제 m-1 행에 대한 방정식은 수학식 4와 같이 변환할 수 있다.In addition, the equations for the second to m-1th rows can be converted as shown in Equation 4.

수학식 3 및 4에서

는 수정 토마스 알고리즘으로 수정된 원소 계수를 나타낸다.In

Equations

3 and 4

Denotes the element coefficients modified by the modified Thomas algorithm.

한편, x₀ ^p 및 x_m+1 ^p 를 x_m ^p-1 및 x₁ ^p+1 로 대체하면, 수학식 3은 수학식 5로 표현된다.On the other hand, if x ₀ ^p and x _m+1 ^p are replaced by x _m ^p-1 and x ₁ ^p+1 , Equation 3 is expressed by Equation 5.

그리고 각각의 코어는 대응하는 분할된 행렬식이 수정되면, 수정된 행렬식에서 제1 행 및 제m 행만을 추출하여 수정 행렬식을 축소한다(S14).In addition, when the corresponding partitioned determinant is modified, each core extracts only the first row and the m-th row from the modified determinant to reduce the modified determinant (S14).

도 3 및 도 4에서는 3개의 코어가 각각 분할된 행렬식을 수정 및 축소하여 획득하는 것을 시각적으로 표시하기 위해, 서로 다른 코어에서 연산되는 원소에 대해 서로 다른 색상으로 표시하였다.In FIGS. 3 and 4, elements calculated in different cores are displayed in different colors in order to visually display that the determinants obtained by modifying and reducing the determinants in which each of the three cores is divided are visually displayed.

수정 행렬식이 축소되면, 획득된 모든 3중 대각행렬식을 수정 및 축소하였는지 판별한다(S15). 만일 획득된 3중 대각행렬식 중 수정 및 축소되지 않은 3중 대각행렬식이 존재하면 다음 연산되어야 하는 다음 3중 대각행렬을 분할하고 수정 및 축소하여 축소된 수정 행렬을 획득한다. 여기서 축소된 수정 행렬식 각각은 각 코어에 대응하는 메모리에 임시 저장될 수 있다.When the correction determinant is reduced, it is determined whether all the obtained triple diagonal matrix expressions have been corrected or reduced (S15). If there is a triple diagonal matrix expression that has not been modified or reduced among the obtained triple diagonal matrix equations, the next triple diagonal matrix to be calculated is divided and modified and reduced to obtain a reduced modified matrix. Here, each of the reduced correction determinants may be temporarily stored in a memory corresponding to each core.

그러나 모든 3중 대각행렬식이 수정 및 축소된 것으로 판별되면, P개의 코어 각각은 저장된 다수의 축소된 수정 행렬식을 서로 다른 코어로 전송한다(S16). 여기서 P개의 코어 각각은 축소된 수정 행렬이 획득된 순서에 기반하여 동일한 시간 구간에 획득된 축소된 수정 행렬을 기지정된 순서에 따라 하나의 코어로 전송하고, 다음 시간 구간에 획득된 축소된 수정 행렬식을 다음 지정된 코어로 전송한다.However, when it is determined that all the triple diagonal matrix equations are modified and reduced, each of the P cores transmits a plurality of stored reduced modified matrix equations to different cores (S16). Here, each of the P cores transmits the reduced correction matrix obtained in the same time interval to one core in a predetermined order based on the order in which the reduced correction matrix was obtained, and the reduced correction matrix obtained in the next time interval. To the next designated core.

여기서 축소 대각행렬식을 획득하는 코어는 다른 코어들로부터 2개의 행을 갖는 축소된 수정 행렬식을 인가받으므로, 각각의 코어로부터 m개의 행을 모두 인가받는 경우에 비해, 통신량이 크게 줄어들게 된다. 즉 코어간 통신 효율성을 크게 높일 수 있다.Here, since the core obtaining the reduced diagonal matrix is applied with the reduced modified matrix having two rows from other cores, the amount of communication is significantly reduced compared to the case where all m rows are applied from each core. That is, the communication efficiency between cores can be greatly improved.

그리고 다른 코어들로부터 축소된 수정 행렬식을 인가받은 코어는 축소된 수정 행렬식을 결합하여 도 4와 같이 축소된 대각행렬식을 획득하여 저장한다(S17).In addition, the core, to which the reduced correction determinant is applied from other cores, combines the reduced correction determinant to obtain and store the reduced diagonal matrix expression as shown in FIG. 4 (S17).

만일 획득된 3중 대각행렬식의 개수가 코어의 개수보다 많으면, 기지정된 순서에 따라 반복적으로 축소된 수정 행렬식을 인가받아 결합하여 축소된 대각행렬식을 획득하고 저장할 수 있다.If the number of obtained triple diagonal matrix equations is greater than the number of cores, the reduced modified matrix equation may be repeatedly applied in a predetermined order and combined to obtain and store the reduced diagonal matrix equation.

획득된 모든 3중 대각행렬식에 대한 축소된 대각행렬식이 획득되면, 다수의 코어 각각이 다수의 축소 대각행렬식에 대해 병렬로 동시에 연산을 수행한다(S18).When the reduced diagonal matrix equations for all the obtained triple diagonal matrix equations are obtained, each of the plurality of cores simultaneously performs calculations on the multiple reduced diagonal matrix equations in parallel (S18).

도 4에 도시된 바와 같이 축소된 대각행렬식 또한 3중 대각행렬식의 형태로 획득되며, P개의 코어 각각은 표 2 및 표 3의 토마스 알고리즘을 이용하여 축소된 대각행렬식의 해를 연산한다.As shown in FIG. 4, the reduced diagonal matrix equation is also obtained in the form of a triple diagonal matrix equation, and each of the P cores calculates a solution of the reduced diagonal matrix equation using the Thomas algorithm of Tables 2 and 3.

토마스 알고리즘에 따라 축소된 대각행렬식의 해가 연산되면, 모든 축소 대각행렬식에 대해 연산이 수행되었는지 판별한다(S19). 상기한 바와 같이 획득된 3중 대각행렬식의 개수가 코어의 개수보다 많은 경우, 다수의 코어는 저장된 모든 축소 대각행렬식을 동시에 연산할 수 없다. 즉 한번에 병렬로 연산을 수행할 수 있는 축소 대각행렬식의 개수는 코어의 개수로 한정된다. 이에 모든 축소 대각행렬식에 대해 연산이 수행되었는지 판별하고 연산되지 않은 축소 대각행렬식이 존재하면, 다시 다수의 코어 각각이 동시에 병렬로 서로 다른 축소 대각행렬식에 대한 연산을 수행한다(S18).When the solution of the reduced diagonal matrix expression is calculated according to the Thomas algorithm, it is determined whether the operation has been performed for all the reduced diagonal matrix expressions (S19). When the number of three-fold diagonal matrix equations obtained as described above is greater than the number of cores, a plurality of cores cannot simultaneously compute all the stored reduced diagonal matrix equations. That is, the number of reduced diagonal matrix expressions that can be operated in parallel at one time is limited by the number of cores. Accordingly, it is determined whether an operation has been performed for all the reduced diagonal matrix expressions, and if there is an uncalculated reduced diagonal matrix expression, each of the plurality of cores simultaneously performs calculations on different reduced diagonal matrix expressions in parallel (S18).

그러나 모든 축소 대각행렬식에 대한 연산을 수행된 것으로 판별되면, 축소된 대각행렬식의 해를 다시 P개로 분할하여 P개의 코어로 분산 배포한다(S20). 토마스 알고리즘에 따라 해지는 해는 m개의 행을 갖도록 분할된 행렬식 각각에서 제1 행 및 제m 행에 대한 해로서, 3중 대각행렬식에 대한 완전한 해를 구하기 위해서는 제2 행 내지 제m -1 행에 대한 해가 추가로 계산되어야 한다.However, if it is determined that the calculations for all the reduced diagonal matrix equations have been performed, the solution of the reduced diagonal matrix equation is divided into P pieces and distributed to P cores (S20). The solution according to the Thomas algorithm is a solution for the first row and the mth row in each of the determinants divided to have m rows, and in order to obtain a complete solution to the triple diagonal matrix equation, the second to m -1th rows are The solution for this has to be calculated additionally.

이에 제2 행 내지 제m -1 행에 대한 연산 또한 병렬로 수행되도록 축소된 대각행렬식의 해를 P개로 분할하여 P개의 코어로 분산 배포한다. 이때, 분산 배포되는 축소된 대각행렬식의 해는 대응하는 축소된 수정 행렬식이 전송된 코어로 전달될 수 있다.Accordingly, the solution of the reduced diagonal matrix equation is divided into P and distributed to P cores so that operations on the second to m -1 rows are also performed in parallel. At this time, the solution of the distributedly distributed reduced diagonal matrix equation may be transmitted to the core to which the corresponding reduced modified matrix equation was transmitted.

그리고 P개의 코어 각각은 이전 계산한 수정된 행렬식에 인가된 P개로 분할된 축소된 대각행렬식의 해를 대입하여 표 3으로 나타나는 병렬로 업데이트 알고리즘에 따라 반복 연산을 수행하여 3중 대각행렬식의 제2 행 내지 제m -1 행에 대한 해를 획득한다(S21).In addition, each of the P cores performs the iterative operation according to the update algorithm in parallel as shown in Table 3 by substituting the solution of the reduced diagonal matrix equation divided into P applied to the modified determinant calculated previously, and performing the second of the triple diagonal matrix equation. A solution to the row to the m -1th row is obtained (S21).

도 2에서는 12 × 12 크기의 3중 대각행렬에 대한 행렬식을 3개의 코어가 병렬로 연산을 수행하는 것으로 가정하였으므로, 3개의 코어 각각은 3개의 분할된 행렬식의 제2 행 및 제3 행에 대한 해를 도 5에서와 같이 획득할 수 있다.In FIG. 2, it is assumed that the determinant for a triple diagonal matrix of size 12 × 12 is assumed to be performed by three cores in parallel, so that each of the three cores corresponds to the second row and the third row of the three partitioned determinants. The solution can be obtained as in FIG. 5.

그리고 획득된 모든 축소된 대각행렬식에 대한 업데이트 연산이 수행되었는지 판별한다(S22). 만일 모든 축소된 대각행렬식에 대한 업데이트 연산이 수행된 것으로 판별하면, 3중 대각행렬식 연산을 종료한다. 그러나 업데이트 연산이 수행되지 않은 축소된 대각행렬식이 존재하면 다시 P개의 코어는 병렬로 업데이트 연산을 수행한다(S21).Then, it is determined whether an update operation for all the obtained reduced diagonal matrix expressions has been performed (S22). If it is determined that the update operation for all the reduced diagonal matrix expressions has been performed, the triple diagonal matrix expression operation is terminated. However, if there is a reduced diagonal matrix expression in which the update operation has not been performed, the P cores again perform the update operation in parallel (S21).

수치해석과 같이 3중 대각행렬식을 해석해야 하는 분야에서는 3중 대각행렬식 하나만 나타나는 경우는 매우 드물며, 대부분 대량의 3중 대각행렬식을 연산해야 하는 경우가 빈번하게 발생한다. 즉 연속하여 다수의 3중 대각행렬식을 연산해야 하는 경우가 빈번하게 발생한다. 이에, 만일 축소된 대각행렬식을 획득한 코어는 곧바로 축소된 대각행렬식에 대한 해를 연산하게 되면, 나머지 코어는 축소된 대각행렬식을 획득한 코어가 해를 연산하는 동안 유휴 상태에 놓이게 된다. 즉 연산 효율성을 크게 떨어뜨리는 결과를 초래한다.In the field where a triple diagonal matrix expression needs to be analyzed, such as numerical analysis, it is very rare that only one triple diagonal matrix expression appears, and in most cases, a large number of triple diagonal matrix expressions need to be calculated. That is, it is frequently necessary to calculate a number of triple diagonal matrix expressions in succession. Accordingly, if the core that has acquired the reduced diagonal matrix equation immediately calculates a solution to the reduced diagonal matrix equation, the remaining cores are idle while the core that has acquired the reduced diagonal matrix equation calculates the solution. In other words, it results in a significant decrease in computational efficiency.

이에 본 실시예에서는 다수의 3중 대각행렬식 각각에 대한 축소된 수정 행렬식을 병렬로 획득하고, 획득된 다수의 3중 대각행렬식에 대한 축소된 수정 행렬식을 일괄로 전송하도록 함으로써, 코어간 통신 시간을 줄일 수 있다. 뿐만 아니라, 다수의 코어 각각이 전송된 축소된 수정 행렬식을 결합한 축소 대각 행렬식을 병렬로 연산하고, 연산 결과를 다시 다수의 코어에 분산 배포하여 업데이트 연산을 수행하도록 함으로써 다수의 코어의 유휴 시간을 최소화할 수 있다.Accordingly, in this embodiment, a reduced correction determinant for each of a plurality of triple diagonal matrix equations is acquired in parallel, and the reduced correction determinants for a plurality of obtained triple diagonal matrix equations are collectively transmitted, thereby reducing the communication time between cores. Can be reduced. In addition, it minimizes the idle time of multiple cores by calculating a reduced diagonal determinant that combines the reduced correction determinant transmitted by each of the multiple cores in parallel and distributing the calculation result back to multiple cores to perform the update operation. can do.

결과적으로 코어간 통신량과 코어의 유휴 시간을 최소화하여 도 2에 도시된 3중 대각행렬식 전체에 대한 해를 계산할 수 있다.As a result, it is possible to calculate a solution for the entire triple diagonal matrix equation shown in FIG. 2 by minimizing the amount of communication between the cores and the idle time of the cores.

도 6은 도 1의 3중 대각행렬식 연산 방법의 전체 연산 과정을 시각적으로 나타내고, 도 7은 다수의 3중 대각행렬식 연산에서 코어 사이에 전송되는 데이터를 시각적으로 나타낸 도면이다.6 is a visual representation of the entire operation process of the triple diagonal matrix calculation method of FIG. 1, and FIG. 7 is a diagram visually showing data transmitted between cores in a plurality of triple diagonal matrix calculations.

도 1 내지 도 5를 참조하여 도 6의 전체 연산 과정을 다시 살펴보면, 3중 대각행렬식이 획득되면 다수개의 코어 각각이 3중 대각행렬식에서 분할된 행렬식을 인가받고, 인가된 분할 행렬식을 수정 토마스 알고리즘에 따라 수정하여 수정 행렬식을 획득하고, 수정 행렬식에서 제1 행 및 마지막 행을 추출하여 축소된 수정 행렬식을 획득한다. 그리고 모든 3중 대각행렬식에 대한 축소된 수정 행렬식을 획득되면, 다수의 축소된 수정 행렬식을 기지정된 순서로 서로 다른 코어로 전달한다.Looking at the entire operation process of FIG. 6 again with reference to FIGS. 1 to 5, when a triple diagonal matrix is obtained, each of a plurality of cores receives a determinant divided from the triple diagonal matrix, and the applied partitioning determinant is modified by Thomas Algorithm. According to the correction, the correction determinant is obtained, and the first row and the last row are extracted from the correction determinant to obtain a reduced correction determinant. And when the reduced correction determinants for all triple diagonal matrix equations are obtained, a number of reduced correction determinants are transferred to different cores in a predetermined order.

축소된 수정 행렬식을 인가받은 다수의 코어 각각은 축소된 수정 행렬식을 결합하여 축소된 대각행렬식을 획득하여 저장하고, 각각의 코어가 토마스 알고리즘을 이용하여 축소된 대각행렬식의 해를 병렬로 연산한다. 즉 분할된 행렬식 각각의 제1 행 및 마지막 행의 해를 연산한다.Each of the plurality of cores to which the reduced correction determinant is applied obtains and stores the reduced diagonal matrix expression by combining the reduced correction determinant, and each core calculates the solution of the reduced diagonal matrix expression in parallel using the Thomas algorithm. That is, the solutions of the first row and the last row of each of the divided determinants are calculated.

저장된 모든 축소된 대각행렬식의 해가 획득되면, 연산 결과를 다시 코어의 개수에 대응하여 분할하여 다수의 코어로 분산 배포한다. 이때, 다수의 코어는 일예로 MPI_Alltoall 방식으로 통신을 수행할 수 있다.When the solutions of all the stored reduced diagonal matrix equations are obtained, the calculation result is divided again corresponding to the number of cores and distributed to a plurality of cores. In this case, a plurality of cores may perform communication in an MPI_Alltoall method, for example.

분산 배포된 축소된 대각행렬식의 분할 해는 다수의 코어 각각에서 이전 계산된 수정된 행렬식과 함께 업데이트 알고리즘에 적용되어, 분할된 행렬식의 제1 행 및 마지막 행을 제외한 나머지 행에 대한 해를 연산하여 3중 대각행렬식의 전체 해를 획득한다.The partitioning solution of the distributed and reduced diagonal matrix is applied to the update algorithm along with the modified determinant previously calculated in each of the plurality of cores, and the solution for the remaining rows excluding the first and last rows of the partitioned determinant is calculated. Obtain the full solution of the triple diagonal matrix equation.

다수의 코어는 획득된 모든 3중 대각행렬식에 대한 해가 획득될 때까지 병렬로 반복적으로 업데이트 알고리즘을 수행한다.The multiple cores repeatedly perform the update algorithm in parallel until solutions for all the obtained triple diagonal matrix equations are obtained.

도 8은 본 발명의 일 실시예에 따른 3중 대각행렬식 연산 방법의 성능을 비교 시뮬레이션한 결과를 나타낸다.8 shows a result of comparison and simulation of the performance of a triple diagonal matrix calculation method according to an embodiment of the present invention.

도 8에서 A와 B는 기존의 3중 대각행렬식 연산 방법으로 A는 3중 대각행렬식 전체를 다수의 코어로 전송하여 연산하는 기법을 나타내고, B는 단일 코어의 시분할 다중화 방식으로 연산하는 기법을 나타낸다. 그리고 C는 본 실시예에 따른 3중 대각행렬식 연산 방법인 PaScal TDMA(Parallel and Scalable Library for TDMA) 기법을 나타낸다. 도 8은 코어당 격자 크기를 512²으로 고정하고, 코어를 8개에서 4096개까지 증가시키며 다수의 3중 대각행렬식을 연산하는 경우, 각 코어별 데이터 통신 시간을 시뮬레이션한 결과이다.In FIG. 8, A and B represent a conventional triple diagonal matrix calculation method, where A represents a technique for calculating by transmitting the entire triple diagonal matrix to multiple cores, and B represents a technique for computing with a single core time division multiplexing method. . In addition, C denotes a PaScal Parallel and Scalable Library for TDMA (TDMA) technique, which is a triple diagonal matrix calculation method according to the present embodiment. FIG. 8 is ^{a result of simulating data communication time for each core when the grid size per core is fixed at 512 2} , the number of cores is increased from 8 to 4096, and a number of triple diagonal matrix equations are calculated.

도 8에 도시된 바와 같이, 본 실시예에 따른 3중 대각행렬식 연산 방법은 기존에 비해 코어간 통신 시간이 크게 저감되었을 뿐만 아니라, 코어의 개수가 증가될수록 기존에 비해 코어간 통신 시간이 더 크게 저감되었음을 확인할 수 있다.As shown in FIG. 8, in the method of calculating a triple diagonal matrix according to the present embodiment, the communication time between cores is significantly reduced compared to the previous one, and as the number of cores increases, the communication time between cores is larger than the conventional one. It can be seen that it has been reduced.

본 발명에 따른 방법은 컴퓨터에서 실행시키기 위한 매체에 저장된 컴퓨터 프로그램으로 구현될 수 있다. 여기서 컴퓨터 판독가능 매체는 컴퓨터에 의해 액세스 될 수 있는 임의의 가용 매체일 수 있고, 또한 컴퓨터 저장 매체를 모두 포함할 수 있다. 컴퓨터 저장 매체는 컴퓨터 판독가능 명령어, 데이터 구조, 프로그램 모듈 또는 기타 데이터와 같은 정보의 저장을 위한 임의의 방법 또는 기술로 구현된 휘발성 및 비휘발성, 분리형 및 비분리형 매체를 모두 포함하며, ROM(판독 전용 메모리), RAM(랜덤 액세스 메모리), CD(컴팩트 디스크)-ROM, DVD(디지털 비디오 디스크)-ROM, 자기 테이프, 플로피 디스크, 광데이터 저장장치 등을 포함할 수 있다.The method according to the present invention can be implemented as a computer program stored in a medium for execution on a computer. Here, the computer-readable medium may be any available medium that can be accessed by a computer, and may also include all computer storage media. Computer storage media includes both volatile and nonvolatile, removable and non-removable media implemented in any method or technology for storage of information such as computer readable instructions, data structures, program modules or other data, and ROM (Read Dedicated memory), RAM (random access memory), CD (compact disk)-ROM, DVD (digital video disk)-ROM, magnetic tape, floppy disk, optical data storage device, and the like.

본 발명은 도면에 도시된 실시예를 참고로 설명되었으나 이는 예시적인 것에 불과하며, 본 기술 분야의 통상의 지식을 가진 자라면 이로부터 다양한 변형 및 균등한 타 실시예가 가능하다는 점을 이해할 것이다.The present invention has been described with reference to the embodiments shown in the drawings, but these are merely exemplary, and those of ordinary skill in the art will understand that various modifications and equivalent other embodiments are possible therefrom.

따라서, 본 발명의 진정한 기술적 보호 범위는 첨부된 청구범위의 기술적 사상에 의해 정해져야 할 것이다.Therefore, the true technical protection scope of the present invention should be determined by the technical spirit of the appended claims.

Claims

In the triple diagonal matrix calculation method for a multi-core distributed memory system,
In each of a plurality of triple diagonal matrix equations in which each of a plurality of cores is continuously applied, a partition matrix divided into a number of rows corresponding to the number of cores is obtained, and modified according to a known method to obtain a modified determinant, and the modification Extracting the first row and the last row of the determinant to obtain a reduced correction determinant;
Transmitting the reduced correction determinant obtained from each of the plurality of triple diagonal matrix equations to a corresponding core among a plurality of cores according to a predetermined order of the plurality of cores;
Obtaining and storing a reduced diagonal matrix expression by combining the reduced correction determinants transmitted by each of a plurality of cores;
Calculating a solution of the reduced correction determinant obtained by each of the plurality of cores in parallel; And
A triple diagonal matrix calculation method comprising the step of calculating a residual solution of the triple diagonal matrix equation using the solution of the reduced diagonal matrix equation.

The method of claim 1, wherein transmitting to the core
A method of calculating a triple diagonal matrix for transmitting the reduced correction determinants obtained from a plurality of triple diagonal determinants in a predetermined order to the plurality of cores according to the order in which the plurality of triple diagonal determinants are applied.

The method of claim 1, wherein calculating the solution of the reduced correction determinant comprises:
If the number of stored reduced correction determinants exceeds the number of cores, the solution of the number of reduced correction determinants corresponding to the number of cores is calculated in parallel, and the solution of the remaining reduced correction determinants is then calculated in parallel in units of the number of cores. Determinant calculation method.

The method of claim 1, wherein obtaining the correction determinant comprises:
The partitioning matrix is modified according to the modified Thomas algorithm to obtain a modified determinant,
The step of calculating the solution of the reduced diagonal matrix equation
A triple diagonal matrix calculation method that performs calculations according to the Thomas algorithm.

The method of claim 1, wherein calculating the remaining solution comprises:
Distributedly transmitting the calculated solution of the reduced diagonal matrix equation to a plurality of cores; And
And calculating the modified determinant corresponding to the solution of the reduced diagonal matrix equation by substituting it into an update algorithm of the Thomas algorithm.

The method of claim 5, wherein the distributed transmission comprises:
A triple diagonal matrix calculation method for transmitting the solution of the reduced diagonal matrix equation to a core transmitting the corresponding reduced modified matrix equation.