KR102327234B1

KR102327234B1 - Memory data transform method and computer for matrix multiplication

Info

Publication number: KR102327234B1
Application number: KR1020190122023A
Authority: KR
Inventors: 정성우; 콩 뚜안 두; 최정환
Original assignee: 고려대학교 산학협력단
Priority date: 2019-10-02
Filing date: 2019-10-02
Publication date: 2021-11-15
Also published as: KR20210039600A

Abstract

행렬 연산시 메모리 데이터 변환 방법 및 컴퓨터가 개시된다. 행렬 연산시 메모리 데이터 변환 방법은, (a) 제1 행렬 데이터를 스토리지(storage)로부터 패치하여 캐시 메모리에 저장하는 단계; (b) 제2 행렬 데이터를 스토리지로부터 패치하는 단계; 및 (c) 상기 패치된 제2 행렬 데이터의 열(column)과 행(row)을 변환하여 스크래치패드 메모리에 저장하는 단계를 포함한다.A method and computer for converting memory data during matrix operation are disclosed. A method of converting memory data during matrix operation includes the steps of: (a) fetching first matrix data from storage and storing it in a cache memory; (b) fetching the second matrix data from storage; and (c) converting columns and rows of the fetched second matrix data and storing them in a scratchpad memory.

Description

Memory data transform method and computer for matrix multiplication

본 발명은 행렬 연산시 메모리 데이터 변환 방법 및 컴퓨터에 관한 것이다.The present invention relates to a method and a computer for converting memory data during matrix operation.

GPU의 행렬 곱셈의 경우 L1D (L1 Data Cache) 및 SPM(Scratchpad Memory)과 같은 온칩 메모리는 두 개의 입력 행렬 (첫 번째 행렬의 행 데이터와 두 번째 행렬의 열 데이터)의 데이터를 제공하나, 온칩 메모리 아키텍처는 한 번에 하나의 행 액세스만 허용하기 때문에, 두 번째 행렬 열 데이터는 다중 액세스로 읽혀 행렬 곱셈 속도가 느려지는 한계점이 있다. For matrix multiplication on GPUs, on-chip memories such as L1 Data Cache (L1D) and Scratchpad Memory (SPM) provide data from two input matrices (row data in the first matrix and column data in the second matrix), but on-chip memory Because the architecture only allows access to one row at a time, the second matrix column data is read as multiple accesses, which slows down matrix multiplication.

(01) 대한민국등록특허공보 제10-0909510(2009.07.20.)(01) Republic of Korea Patent Publication No. 10-0909510 (2009.07.20.)

본 발명은 행렬 연산시, 연산 속도를 높일 수 있는 메모리 데이터 변환 방법 및 컴퓨터를 제공하기 위한 것이다.An object of the present invention is to provide a method and a computer for converting memory data that can increase the operation speed during matrix operation.

또한, 본 발명은 행렬 연산시, 곱해지는 행렬 데이터에 대해 하나의 액세스를 통해 판독하도록 할 수 있어 연산 성능을 향상시킬 수 있는 행렬 연산시 메모리 데이터 변환 방법 및 컴퓨터를 제공하기 위한 것이다. Another object of the present invention is to provide a method and a computer for converting memory data during matrix operation, which can improve arithmetic performance by allowing the multiplied matrix data to be read through one access during matrix operation.

본 발명의 일 측면에 따르면, 행렬 연산시 메모리 데이터 변환 방법이 제공된다. According to one aspect of the present invention, there is provided a method of converting memory data in a matrix operation.

본 발명의 일 실시예에 따르면, (a) 제1 행렬 데이터를 스토리지(storage)로부터 패치하여 캐시 메모리에 저장하는 단계; (b) 제2 행렬 데이터를 스토리지로부터 패치하는 단계; 및 (c) 상기 패치된 제2 행렬 데이터의 열(column)을 하나의 물리적 행(row)에 위치되도록 변환하여 스크래치패드 메모리에 저장하는 단계를 포함하는 행렬 연산시 메모리 데이터 변환 방법이 제공된다. According to an embodiment of the present invention, the method comprising: (a) fetching first matrix data from storage and storing it in a cache memory; (b) fetching the second matrix data from storage; and (c) converting a column of the fetched second matrix data to be positioned in one physical row and storing the converted column in a scratchpad memory.

상기 (c) 단계는, 상기 패치된 제2 행렬 데이터의 논리적인 열(column)에 위치한 데이터를 하나의 물리적 행(row)에 위치되도록 상기 스크래치패드 메모리에 저장할 수 있다. In the step (c), data located in a logical column of the fetched second matrix data may be stored in the scratchpad memory to be located in one physical row.

상기 제2 행렬 데이터는 곱해지는 행렬 데이터이다. The second matrix data is matrix data to be multiplied.

본 발명의 다른 측면에 따르면, 행렬 연산 성능을 향상시킬 수 있는 컴퓨터가 제공된다. According to another aspect of the present invention, a computer capable of improving matrix operation performance is provided.

본 발명의 일 실시예에 따르면, 스토리지; 캐시 메모리; 스크래치패치 메모리; 및 행렬 연산시, 제1 행렬 데이터를 상기 스토리지로부터 패치하여 상기 캐시 메모리에 저장하며, 제2 행렬 데이터를 상기 스토리지로부터 패치하여 상기 스크래치패드 메모리에 저장시, 하나의 물리적 행에 위치하도록 상기 제2 행렬 데이터를 저장하는 프로세서를 포함하는 컴퓨터가 제공될 수 있다. According to one embodiment of the present invention, storage; cache memory; scratchpatch memory; and during matrix operation, fetching first matrix data from the storage and storing it in the cache memory, and fetching the second matrix data from the storage and storing the second matrix data in the scratchpad memory, so that the second matrix data is located in one physical row. A computer may be provided that includes a processor for storing matrix data.

상기 프로세서는, 상기 패치된 제2 행렬 데이터의 열(column)과 행(row)을 변환하여 상기 스크래치패드 메모리에 저장할 수 있다. The processor may convert columns and rows of the fetched second matrix data and store them in the scratchpad memory.

본 발명의 일 실시예에 따른 행렬 연산시 메모리 데이터 변환 방법 및 컴퓨터를 제공함으로써, 행렬 연산시 곱해지는 행렬 데이터를 하나의 워드 라인으로 판독 할 수 있는 이점이 있다.By providing a method and a computer for converting memory data during matrix operation according to an embodiment of the present invention, there is an advantage in that matrix data multiplied during matrix operation can be read as one word line.

이를 통해, 본 발명은 행렬 연산시 연산 속도를 향상시킬 수 있는 이점이 있다. Through this, the present invention has the advantage of improving the operation speed during matrix operation.

도 1은 본 발명의 일 실시예에 따른 행렬 연산시 메모리 주소 변환 방법을 나타낸 순서도.
도 2는 본 발명의 일 실시예에 따른 컴퓨터의 내부 구성을 개략적으로 도시한 블록도.
도 3은 본 발명의 일 실시예에 따른 행렬 곱셈 연산 성능 개선을 도시한 그래프.
도 4는 종래 대비 본 발명의 일 실시예에 따른 행렬 크기에 따른 스크래치패드 메모리 및 전체 시스템의 에너지 소비 결과를 나타낸 그래프.1 is a flowchart illustrating a method of converting a memory address during matrix operation according to an embodiment of the present invention;
2 is a block diagram schematically illustrating an internal configuration of a computer according to an embodiment of the present invention;
3 is a graph illustrating an improvement in matrix multiplication operation performance according to an embodiment of the present invention.
4 is a graph showing the energy consumption results of the scratchpad memory and the entire system according to the matrix size according to an embodiment of the present invention compared to the prior art.

본 명세서에서 사용되는 단수의 표현은 문맥상 명백하게 다르게 뜻하지 않는 한, 복수의 표현을 포함한다. 본 명세서에서, "구성된다" 또는 "포함한다" 등의 용어는 명세서상에 기재된 여러 구성 요소들, 또는 여러 단계들을 반드시 모두 포함하는 것으로 해석되지 않아야 하며, 그 중 일부 구성 요소들 또는 일부 단계들은 포함되지 않을 수도 있고, 또는 추가적인 구성 요소 또는 단계들을 더 포함할 수 있는 것으로 해석되어야 한다. 또한, 명세서에 기재된 "...부", "모듈" 등의 용어는 적어도 하나의 기능이나 동작을 처리하는 단위를 의미하며, 이는 하드웨어 또는 소프트웨어로 구현되거나 하드웨어와 소프트웨어의 결합으로 구현될 수 있다.As used herein, the singular expression includes the plural expression unless the context clearly dictates otherwise. In this specification, terms such as “consisting of” or “comprising” should not be construed as necessarily including all of the various components or various steps described in the specification, some of which components or some steps are It should be construed that it may not include, or may further include additional components or steps. In addition, terms such as "...unit" and "module" described in the specification mean a unit that processes at least one function or operation, which may be implemented as hardware or software, or a combination of hardware and software. .

이하, 첨부된 도면들을 참조하여 본 발명의 실시예를 상세히 설명한다. Hereinafter, embodiments of the present invention will be described in detail with reference to the accompanying drawings.

도 1은 본 발명의 일 실시예에 따른 행렬 연산시 메모리 주소 변환 방법을 나타낸 순서도이다. 이하에서는 행렬 곱셈 연산을 수행하는 방법에 대해 설명하기로 한다. 1 is a flowchart illustrating a method of converting a memory address during matrix operation according to an embodiment of the present invention. Hereinafter, a method of performing a matrix multiplication operation will be described.

단계 110에서 컴퓨터(100)는 제1 행렬 데이터를 스토리지(storage)로부터 패치한다. In step 110, the computer 100 fetches the first matrix data from storage.

단계 115에서 컴퓨터(100)는 패치된 제1 행렬 데이터를 캐시 메모리에 저장한다. 여기서, 캐시 메모리는 L1 데이터 캐시일 수 있다. In step 115, the computer 100 stores the fetched first matrix data in the cache memory. Here, the cache memory may be an L1 data cache.

제1 행렬 데이터를 스토리지로부터 패치하여 L1 데이터 캐시에 저장하는 방법은 종래와 동일할 수 있다. A method of fetching the first matrix data from the storage and storing it in the L1 data cache may be the same as in the related art.

단계 120에서 컴퓨터(100)는 제2 행렬 데이터를 스토리지로부터 패치한다. In step 120, the computer 100 fetches the second matrix data from the storage.

단계 125에서 컴퓨터(100)는 패치된 제2 행렬 데이터의 열(column)을 행(row)로 변환하여 스크래치패드 메모리에 저장한다. In step 125, the computer 100 converts a column of the fetched second matrix data into a row and stores it in the scratchpad memory.

즉, 본 발명의 일 실시예에 따르면, 컴퓨터(100)는 패치된 제2 행렬 데이터의 논리적 열(column)을 하나의 물리적 행(row)에 위치되도록 스크래치패드 메모리에 저장할 수 있다. That is, according to an embodiment of the present invention, the computer 100 may store a logical column of the fetched second matrix data in a scratchpad memory to be located in one physical row.

이를 통해, 본 발명의 일 실시예에 따르면, 행렬 연산시 연산 성능을 향상시킬 수 있다. 즉, 일반적으로 곱해지는 제2 행렬 데이터가 종래와 같이 저장되는 경우 제2 행렬 데이터의 열(column) 데이터는 다중 액세스를 통해 판독되게 된다. 이는 물리적 메모리의 특성상 열(column) 데이터를 판독하기 위해서는 물리적 메모리의 행(row)를 모두 오픈(open)해야만 하는 특성에 기인한다. Through this, according to an embodiment of the present invention, it is possible to improve arithmetic performance during matrix operation. That is, when the second matrix data that is generally multiplied is stored as in the prior art, column data of the second matrix data is read through multiple accesses. This is due to the characteristic of having to open all rows of the physical memory in order to read column data due to the characteristics of the physical memory.

따라서, 본 발명의 일 실시예에서는 곱해지는 제2 행렬 데이터의 열 데이터에 대한 다중 액세스를 최소화하기 위해 패치된 제2 행렬 데이터의 논리적 열 데이터를 물리적 메모리의 하나의 행(row)에 위치하도록 저장함으로써 한번의 액세스로 제2 행렬 데이터의 논리적 열 데이터에 대한 판독이 가능하도록 할 수 있다. Accordingly, in one embodiment of the present invention, logical column data of the fetched second matrix data is stored in one row of the physical memory in order to minimize multiple accesses to column data of the multiplied second matrix data. By doing so, it is possible to read the logical column data of the second matrix data with one access.

이로 인해, 본 발명의 일 실시예에서는 행렬 연산시, 곱해지는 행렬 데이터의 열(column)을 행(row)로 변환하여 GPU의 스크래치패드 메모리에 저장함으로써 연산 속도를 높일 수 있는 이점이 있다. For this reason, in one embodiment of the present invention, there is an advantage in that the operation speed can be increased by converting a column of matrix data to be multiplied into a row and storing it in the scratchpad memory of the GPU during matrix operation.

도 2는 본 발명의 일 실시예에 따른 컴퓨터의 내부 구성을 개략적으로 도시한 블록도이다. 2 is a block diagram schematically illustrating an internal configuration of a computer according to an embodiment of the present invention.

도 2를 참조하면, 본 발명의 일 실시예에 따른 컴퓨터(100)는 스토리지(210), 캐시 메모리(220), 스크래치패드 메모리(230) 및 프로세서(240)를 포함한다. Referring to FIG. 2 , the computer 100 according to an embodiment of the present invention includes a storage 210 , a cache memory 220 , a scratchpad memory 230 , and a processor 240 .

스토리지(210)는 제1 행렬 데이터 및 제2 행렬 데이터를 저장한다. 물론, 스토리지(210)에는 제1 행렬 데이터 및 제2 행렬 데이터 이외에도 다양한 데이터들이 저장됨은 당연하다. The storage 210 stores the first matrix data and the second matrix data. Of course, it is natural that various data are stored in the storage 210 in addition to the first matrix data and the second matrix data.

캐시 메모리(220)는 L1(level-1) 캐시일 수 있다. L1 캐시는 당업자에게는 자명한 사항이므로 이에 대한 상세한 설명은 생략하기로 한다. The cache memory 220 may be an L1 (level-1) cache. Since the L1 cache is obvious to those skilled in the art, a detailed description thereof will be omitted.

스크래치패드 메모리(230)는 주기억장치와는 별도로 실행중의 명령에 필요한 오퍼랜드, 중간데이터의 유지, 인터럽트 처리 등에 사용하는 고속처리용 메모리이다. The scratchpad memory 230 is a high-speed processing memory used separately from the main memory device for operands necessary for an instruction being executed, intermediate data retention, interrupt processing, and the like.

프로세서(240)는 본 발명의 일 실시예에 따른 컴퓨터(100)의 내부 구성 요소들(예를 들어, 스토리지(210), 캐시 메모리(220), 스크래치패드 메모리(230) 등)을 제어한다.The processor 240 controls internal components (eg, the storage 210 , the cache memory 220 , the scratchpad memory 230 , etc.) of the computer 100 according to an embodiment of the present invention.

또한, 프로세서(240)는 행렬 연산을 위해, 스토리지(210)로부터 제1 행렬 데이터 및 제2 행렬 데이터를 패치(fetch)한다. 이어, 프로세서(240)는 제1 행렬 데이터는 L1 캐시에 저장할 수 있다. 또한, 프로세서(240)는 패치된 제2 행렬 데이터의 논리적인 열(column)을 하나의 물리적 행에 위치되도록 스크래치패드 메모리(230)에 저장할 수 있다. 즉, 프로세서(240)는 패치된 제2 행렬 데이터의 열(column)을 행(row)로 변환하여 스크래치패드 메모리(230)에 저장할 수 있다.In addition, the processor 240 fetches the first matrix data and the second matrix data from the storage 210 for the matrix operation. Subsequently, the processor 240 may store the first matrix data in the L1 cache. Also, the processor 240 may store a logical column of the fetched second matrix data in the scratchpad memory 230 to be located in one physical row. That is, the processor 240 may convert a column of the fetched second matrix data into a row and store it in the scratchpad memory 230 .

이로 인해, 프로세서(240)는 행렬 연산시, 스크래치패드 메모리(230)에 한번만 액세스함으로써 제2 행렬 데이터를 판독할 수 있어, 연산 성능을 향상시킬 수 있는 이점이 있다. For this reason, the processor 240 can read the second matrix data by accessing the scratchpad memory 230 only once during the matrix operation, so there is an advantage in that the operation performance can be improved.

도 3은 본 발명의 일 실시예에 따른 행렬 곱셈 연산 성능 개선을 도시한 그래프이다. 본 발명의 일 실시예에 따르면, 8 × 8, 16 × 16, 24 × 24 및 32 × 32의 행렬 크기에 대해 종래 대비 행렬 곱셈 성능이 7.6%, 17.8%, 31.6% 및 34.9% 향상된 것을 알 수 있다. 이는 주로 행렬 재적에 대해 제2 행렬의 열 단위 데이터를 판독하는데 소요되는 시간이 짧아진 것에 기인한다. 즉, 제2 행렬의 열 단위 데이터를 스크래치패드 메모리의 열에 정렬하는 대신, 하나의 물리적 행에 위치되도록 저장함으로써 한번의 액세스로 행렬 연산이 가능하도록 할 수 있다. 따라서, 행렬의 크기가 커질 경우 종래의 경우 열에 순차적으로 액세스하는데 많은 시간이 소요되기 때문에 본 발명의 일 실시예에 따른 메모리 데이터 변환 방법이 많은 이점이 있는 것을 알 수 있다. 3 is a graph illustrating an improvement in matrix multiplication operation performance according to an embodiment of the present invention. According to an embodiment of the present invention, it can be seen that matrix multiplication performance is improved by 7.6%, 17.8%, 31.6%, and 34.9% compared to the prior art for matrix sizes of 8 × 8, 16 × 16, 24 × 24 and 32 × 32 have. This is mainly due to the shortened time required for reading the column-wise data of the second matrix with respect to the matrix volume. That is, instead of arranging the column-wise data of the second matrix in a column of the scratchpad memory, it is stored so that it is located in one physical row, so that a matrix operation can be performed with one access. Accordingly, it can be seen that the memory data conversion method according to an embodiment of the present invention has many advantages because it takes a lot of time to sequentially access columns in the conventional case when the size of the matrix increases.

도 4는 종래 대비 본 발명의 일 실시예에 따른 행렬 크기에 따른 스크래치패드 메모리 및 전체 시스템의 에너지 소비 결과를 나타낸 그래프이다. 4 is a graph showing the energy consumption results of the scratchpad memory and the entire system according to the matrix size according to an embodiment of the present invention compared to the prior art.

본 발명의 일 실시예에 따른 방법이 스크래치패드 메모리 및 전체 시스템의 에너지 소비를 각각 92.7 ~ 93.9% 및 6.8 ~ 19.4% 절약하는 것을 알 수 있다. 이는 스크래치패드 메모리에서 단 한번의 워드 라인만을 제2 행렬의 열 단위 요소에 대해 판독하기 때문이다. It can be seen that the method according to an embodiment of the present invention saves energy consumption of the scratchpad memory and the entire system by 92.7 to 93.9% and 6.8 to 19.4%, respectively. This is because only one word line is read for the column-wise element of the second matrix in the scratchpad memory.

본 발명의 실시 예에 따른 장치 및 방법은 다양한 컴퓨터 수단을 통하여 수행될 수 있는 프로그램 명령 형태로 구현되어 컴퓨터 판독 가능 매체에 기록될 수 있다. 컴퓨터 판독 가능 매체는 프로그램 명령, 데이터 파일, 데이터 구조 등을 단독으로 또는 조합하여 포함할 수 있다. 컴퓨터 판독 가능 매체에 기록되는 프로그램 명령은 본 발명을 위하여 특별히 설계되고 구성된 것들이거나 컴퓨터 소프트웨어 분야 통상의 기술자에게 공지되어 사용 가능한 것일 수도 있다. 컴퓨터 판독 가능 기록 매체의 예에는 하드 디스크, 플로피 디스크 및 자기 테이프와 같은 자기매체(magnetic media), CD-ROM, DVD와 같은 광기록 매체(optical media), 플롭티컬 디스크(floptical disk)와 같은 자기-광 매체(magneto-optical media) 및 롬(ROM), 램(RAM), 플래시 메모리 등과 같은 프로그램 명령을 저장하고 수행하도록 특별히 구성된 하드웨어 장치가 포함된다. 프로그램 명령의 예에는 컴파일러에 의해 만들어지는 것과 같은 기계어 코드뿐만 아니라 인터프리터 등을 사용해서 컴퓨터에 의해서 실행될 수 있는 고급 언어 코드를 포함한다.The apparatus and method according to an embodiment of the present invention may be implemented in the form of program instructions that can be executed through various computer means and recorded in a computer-readable medium. The computer-readable medium may include program instructions, data files, data structures, and the like, alone or in combination. The program instructions recorded on the computer readable medium may be specially designed and configured for the present invention, or may be known and available to those skilled in the computer software field. Examples of the computer-readable recording medium include magnetic media such as hard disks, floppy disks and magnetic tapes, optical media such as CD-ROMs and DVDs, and magnetic media such as floppy disks. - Includes magneto-optical media and hardware devices specially configured to store and execute program instructions, such as ROM, RAM, flash memory, and the like. Examples of program instructions include not only machine language codes such as those generated by a compiler, but also high-level language codes that can be executed by a computer using an interpreter or the like.

상술한 하드웨어 장치는 본 발명의 동작을 수행하기 위해 하나 이상의 소프트웨어 모듈로서 작동하도록 구성될 수 있으며, 그 역도 마찬가지이다.The hardware devices described above may be configured to operate as one or more software modules to perform the operations of the present invention, and vice versa.

이제까지 본 발명에 대하여 그 실시 예들을 중심으로 살펴보았다. 본 발명이 속하는 기술 분야에서 통상의 지식을 가진 자는 본 발명이 본 발명의 본질적인 특성에서 벗어나지 않는 범위에서 변형된 형태로 구현될 수 있음을 이해할 수 있을 것이다. 그러므로 개시된 실시 예들은 한정적인 관점이 아니라 설명적인 관점에서 고려되어야 한다. 본 발명의 범위는 전술한 설명이 아니라 특허청구범위에 나타나 있으며, 그와 동등한 범위 내에 있는 모든 차이점은 본 발명에 포함된 것으로 해석되어야 할 것이다.So far, the present invention has been looked at focusing on the embodiments thereof. Those of ordinary skill in the art to which the present invention pertains will understand that the present invention can be implemented in a modified form without departing from the essential characteristics of the present invention. Therefore, the disclosed embodiments should be considered in an illustrative rather than a restrictive sense. The scope of the present invention is indicated in the claims rather than the foregoing description, and all differences within the scope equivalent thereto should be construed as being included in the present invention.

100: 컴퓨터
110: 스토리지
120: 캐시 메모리
130: 스크래치패드 메모리100: computer
110: storage
120: cache memory
130: scratchpad memory

Claims

(a) fetching the first matrix data from storage and storing it in a cache memory;
(b) fetching the second matrix data from storage; and
(c) converting columns and rows of the fetched second matrix data and storing them in a scratchpad memory,
The step (c) is,
Converting data located in a logical column of the fetched second matrix data to be located in one physical row and storing it in the scratchpad memory,
The method of converting memory data during matrix operation, characterized in that the second matrix data is matrix data to be multiplied.

delete

storage;
cache memory;
scratchpad memory; and
During matrix operation, the first matrix data is fetched from the storage and stored in the cache memory, and when the second matrix data is fetched from the storage and stored in the scratchpad memory, the second matrix is located in one physical row. a processor for storing data;
The processor is
Converting data located in a logical column of the fetched second matrix data to be located in one physical row and storing it in the scratchpad memory,
wherein the second matrix data is matrix data to be multiplied.

delete