KR20170012019A

KR20170012019A - Method for optimizing parallel matrix multiplication in a system supporting multiple CPU and multiple GPU

Info

Publication number: KR20170012019A
Application number: KR1020160077978A
Authority: KR
Inventors: 한흥우; 이준원; 김석현
Original assignee: 삼성전자주식회사
Priority date: 2015-07-24
Filing date: 2016-06-22
Publication date: 2017-02-02
Also published as: KR102505279B1

Abstract

A multi-process computing method and a computing system are provided. The computing method for a multi-process in the computing system estimates a block size to be distributed in a plurality of graphic processing units (GPU) connected to a central processing unit (CPU), and performs matrix product calculation having the block size estimated in a memory of each of the plurality of GPUs. The estimated block size is a value estimated through machine learning based on the number of the plurality of GPUs connected to the CPU, and a size of a matrix obtained through the matrix product calculation of the plurality of GPUs.

Description

[0001] The present invention relates to a computing method in a computing environment supporting a plurality of CPUs and a plurality of GPUs,

본 개시는 컴퓨터 처리 기술에 관한 것으로, 더욱 구체적으로, 복수의 CPU(Central Processing Unit) 및 복수의 GPU(Graphics Processing Unit)로 구성된 컴퓨팅 환경에서 최적 수치연산(parallel matrix multiplication)을 제공하는 방법 및 시스템에 관한 것이다. The present disclosure relates to computer processing techniques, and more particularly, to a method and system for providing parallel matrix multiplication in a computing environment comprising a plurality of CPUs (Central Processing Unit) and a plurality of GPUs (Graphics Processing Unit) .

빅데이터와 같은 대용량 데이터를 처리하는 수요가 증가함에 따라, 멀티 CPU 및 멀티 GPU를 이용하는 컴퓨팅 환경에 대한 요구가 증가되고 있다. 따라서, 멀티 CPU 및 멀티 GPU로 구성된 시스템에서 수치 연산을 고속화하는 연산 방법에 대한 필요성이 대두된다. GPU는 수많은 코어를 병렬로 사용하여 대용량 데이터에 대해 동시에 연산을 수행함으로써 계산 속도를 크게 높여준다.As the demand for processing large amounts of data such as Big Data increases, there is a growing demand for computing environments using multiple CPUs and multiple GPUs. Therefore, there is a need for a computation method for speeding up numerical computation in a system composed of a multi-CPU and a multi-GPU. The GPU uses a large number of cores in parallel to simultaneously perform computations on large amounts of data, greatly speeding up computations.

종래에는, 복수의 GPU 및 복수의 CPU를 포함하는 컴퓨팅 환경에서 복수의 GPU 사이에 데이터를 교환할 때, 각 GPU에서 연산이 수행되어 생성된 연산데이터는 복수의 CPU를 연결하는 QPI(QuickPath Interconnect) 채널 등과 같은 복수의 CPU들을 상호 연결하는 통신 패스를 지나가야 한다. Conventionally, when data is exchanged between a plurality of GPUs in a computing environment including a plurality of GPUs and a plurality of CPUs, operation data generated by operations performed in each GPU includes a QuickPath Interconnect (QPI) Channels, and the like.

또한, 대용량 데이터를 처리할 때, 대규모 행렬 곱(large scale matrix multiplication)이 이용된다. 종래에는, 멀티 프로세서(multi processes)가 구현된 시스템에서 행렬의 블록 분리(block decomposition)에 대한 기술이 다양하게 제시되고 있다. 반면, 멀티 GPU를 이용하는 시스템에서는 행렬의 최적의 블록 분리에 대한 기술 개발이 초기 단계이다. 따라서, 멀티 GPU를 이용한 시스템에서 행렬 연산 속도를 향상시킬 수 있는 최적의 블록 사이즈를 찾는 방법이 필요하다.In addition, when processing large amounts of data, a large scale matrix multiplication is used. Conventionally, various techniques for block decomposition of a matrix have been proposed in a system in which multi processes are implemented. On the other hand, in the system using multi-GPU, development of technology for optimal block separation of matrices is an early stage. Therefore, there is a need for a method of finding an optimum block size that can improve the matrix operation speed in a system using a multi-GPU.

본 개시의 목적은, 상술한 필요성에 의해 안출된 것으로, 기계학습을 통해 최적의 블록 사이즈를 추정하여 CPU에 연결된 복수의 GPU에서 병렬 연산을 고속화하는 방법 및 시스템을 제공하는 데 있다. SUMMARY OF THE INVENTION The object of the present disclosure is to provide a method and system for estimating an optimal block size through machine learning to speed up parallel operations in a plurality of GPUs connected to a CPU.

상술한 목적을 달성하기 위해, 본 개시의 일 실시 예에 따른, 멀티 프로세스(multi-process)를 위한 컴퓨팅 시스템은, 제1 커넥터(connector)를 통해 제1 CPU(Core Processing Unit) 및 제1 메모리에 연결된 복수의 제1 GPU(Graphical Processing Unit, GPU), 제2 상술한 목적을 달성하기 위해, 본 개시의 일 실시 예에 따른, 멀티 프로세스(multi-process)를 위한 컴퓨팅 시스템은, CPU(Central Processing Unit), 커넥터(connector)를 통해 상기 CPU에 연결된 복수의 GPU(Graphics Processing Unit)및상기 복수의 GPU 각각에 포함된 메모리를 포함하고, 상기 복수의 GPU 각각은 각GPU의 상기 메모리에서 지정된 블록 사이즈를 가지는 행렬 곱 연산(matrix multiplication)을 수행하며, 상기 지정된 블록 사이즈는, 상기 CPU에 연결된 복수의 GPU의 개수 및 상기 복수의 GPU의 상기 행렬 곱 연산을 통해 획득된 정방 행렬의 사이즈를 바탕으로 기계학습을 이용하여 추정된 값이다.In order to achieve the above object, a computing system for a multi-process according to an embodiment of the present disclosure includes a first CPU (Core Processing Unit) and a first memory A computing system for a multi-process according to an embodiment of the present disclosure includes a CPU (Central Processing Unit), a central processing unit (CPU) A GPU (Graphics Processing Unit) connected to the CPU through a connector, and a memory included in each of the plurality of GPUs, wherein each of the plurality of GPUs is connected to a designated block Wherein the designated block size is determined based on a number of GPUs connected to the CPU and a size of a square matrix obtained through the matrix multiplication of the plurality of GPUs The value is estimated using a machine learning based.

또한, 상술한 목적을 달성하기 위해, 본 개시의 일 실시 예에 따른, 멀티 프로세스를 위한 컴퓨팅 방법은, 커넥터를 통해 CPU에 연결된 복수의 GPU에 분배(distribute)될 블록 사이즈를 추정하는 단계 및 상기 복수의 각 GPU의 메모리에서 상기 추정된 블록 사이즈를 가지는 행렬 곱 연산을 수행하는 단계를 포함하고, 상기 추정된 블록 사이즈는, 상기 CPU에 연결된 복수의 GPU의 개수 및 상기 복수의 GPU의 상기 행렬 곱 연산을 통해 획득된 정방 행렬의 사이즈를 바탕으로 기계학습을 이용하여 추정된 값이다.In order to achieve the above object, a computing method for a multiprocess according to an embodiment of the present disclosure includes estimating a block size to be distributed to a plurality of GPUs connected to a CPU through a connector, And performing a matrix multiplication operation with the estimated block size in a memory of each of a plurality of GPUs, wherein the estimated block size is determined by the number of GPUs connected to the CPU and the matrix product of the plurality of GPUs Is a value estimated using the machine learning based on the size of the square matrix obtained through the operation.

본 개시의 실시 예들에 따른 데이터 처리 방법은, CPU에 연결된 복수의 GPU를 포함하는 컴퓨팅 시스템에서 복수의 GPU를 통해 고속의 병렬 연산을 수행하여 고성능 컴퓨팅(High Performance Computing, HPC)을 제공할 수 있다.The data processing method according to embodiments of the present disclosure can provide high performance computing (HPC) by performing high-speed parallel computation through a plurality of GPUs in a computing system including a plurality of GPUs connected to a CPU .

도 1은, 본 개시의 일 실시 예에 따른, 컴퓨팅 시스템의 간략한 블록도,
도 2는, 본 개시의 일 실시 예에 따른, 컴퓨팅 시스템의 상세한 블록도,
도 3은, 본 개시의 일 도 1은, 본 개시의 일 실시 예에 따른, 컴퓨팅 시스템의 간략한 블록도,
도 2는, 본 개시의 일 실시 예에 따른, 컴퓨팅 시스템의 상세한 블록도,
도 3은, 본 개시의 일 실시 예에 따른, 복수의 GPU에서 연산 방법을 도시한 순서도,
도 4는, 본 개시의 실시 예들에 따른, 복수의 GPU에서 연산 방법을 도시한 도면,
도 5는, 본 개시의 일 실시 예에 따른, 기계학습을 통한 최적의 블록 사이즈를 추정하는 방법을 도시한 순서도,
도 6은, 본 개시의 일 실시 예에 따른, 블록 사이즈에 따른 실행 시간을 도시한 그래프, 그리고
도 7은, 본 개시의 일 실시 예에 따른, 최적의 블록 사이즈를 추정한 결과를 도시한 표이다.1 is a simplified block diagram of a computing system, in accordance with one embodiment of the present disclosure;
2 is a detailed block diagram of a computing system, in accordance with an embodiment of the present disclosure;
Figure 3 is a simplified block diagram of a computing system, in accordance with an embodiment of the present disclosure;
2 is a detailed block diagram of a computing system, in accordance with an embodiment of the present disclosure;
3 is a flow diagram illustrating a method of operation in a plurality of GPUs, in accordance with one embodiment of the present disclosure;
Figure 4 illustrates a method of operation in a plurality of GPUs, in accordance with embodiments of the present disclosure;
5 is a flow chart illustrating a method for estimating an optimal block size through machine learning, in accordance with one embodiment of the present disclosure;
Figure 6 is a graph showing execution time according to block size, in accordance with one embodiment of the present disclosure; and
7 is a table showing results of estimating the optimal block size according to an embodiment of the present disclosure.

본 명세서에서 사용되는 용어에 대해 간략히 설명하고, 본 개시에 대해 구체적으로 설명하기로 한다. BRIEF DESCRIPTION OF THE DRAWINGS The terminology used herein will be briefly described, and the present disclosure will be described in detail.

본 개시에서 사용되는 용어는 본 개시에서의 기능을 고려하면서 가능한 현재 널리 사용되는 일반적인 용어들을 선택하였으나, 이는 당 분야에 종사하는 기술자의 의도 또는 판례, 새로운 기술의 출현 등에 따라 달라질 수 있다.Although the terms used in this disclosure have taken into account the functions in this disclosure and have made possible general terms that are currently widely used, they may vary depending on the intent or circumstance of the person skilled in the art, the emergence of new technologies and the like.

　또한, 특정한 경우는 출원인이 임의로 선정한 용어도 있으며, 이 경우 해당되는 발명의 설명 부분에서 상세히 그 의미를 기재할 것이다. 따라서 본 개시에서 사용되는 용어는 단순한 용어의 명칭이 아닌, 그 용어가 가지는 의미와 본 개시의 전반에 걸친 내용을 토대로 정의되어야 한다. Also, in certain cases, there may be a term selected arbitrarily by the applicant, in which case the meaning thereof will be described in detail in the description of the corresponding invention. Accordingly, the terms used in this disclosure should be defined based on the meaning of the term rather than on the name of the term, and throughout the present disclosure.

본 개시의 실시 예들은 다양한 변환을 가할 수 있고 여러 가지 실시 예를 가질 수 있는바, 특정 실시 예들을 도면에 예시하고 상세한 설명에 상세하게 설명하고자 한다. 그러나 이는 특정한 실시 형태에 대해 범위를 한정하려는 것이 아니며, 개시된 사상 및 기술 범위에 포함되는 모든 변환, 균등물 내지 대체물을 포함하는 것으로 이해되어야 한다. 실시 예들을 설명함에 있어서 관련된 공지 기술에 대한 구체적인 설명이 요지를 흐릴 수 있다고 판단되는 경우 그 상세한 설명을 생략한다.The embodiments of the present disclosure are capable of various transformations and may have various embodiments, and specific embodiments are illustrated in the drawings and described in detail in the detailed description. It is to be understood, however, that it is not intended to limit the scope of the specific embodiments but includes all transformations, equivalents, and alternatives falling within the spirit and scope of the disclosure disclosed. DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENTS In the following description of the embodiments of the present invention,

제1, 제2 등의 용어는 다양한 구성요소들을 설명하는데 사용될 수 있지만, 구성요소들은 용어들에 의해 한정되어서는 안 된다. 용어들은 하나의 구성요소를 다른 구성요소로부터 구별하는 목적으로만 사용된다.The terms first, second, etc. may be used to describe various elements, but the elements should not be limited by terms. Terms are used only for the purpose of distinguishing one component from another.

단수의 표현은 문맥상 명백하게 다르게 뜻하지 않는 한, 복수의 표현을 포함한다. 본 출원에서, "포함하다" 또는 "구성되다" 등의 용어는 명세서상에 기재된 특징, 숫자, 단계, 동작, 구성요소, 부품 또는 이들을 조합한 것이 존재함을 지정하려는 것이지, 하나 또는 그 이상의 다른 특징들이나 숫자, 단계, 동작, 구성요소, 부품 또는 이들을 조합한 것들의 존재 또는 부가 가능성을 미리 배제하지 않는 것으로 이해되어야 한다.The singular expressions include plural expressions unless the context clearly dictates otherwise. In this application, the terms "comprise", "comprising" and the like are used to specify that there is a stated feature, number, step, operation, element, component, or combination thereof, But do not preclude the presence or addition of features, numbers, steps, operations, components, parts, or combinations thereof.

본 개시에서 "모듈" 혹은 "부"는 적어도 하나의 기능이나 동작을 수행하며, 하드웨어 또는 소프트웨어로 구현되거나 하드웨어와 소프트웨어의 결합으로 구현될 수 있다. 또한, 복수의 "모듈" 혹은 복수의 "부"는 특정한 하드웨어로 구현될 필요가 있는 "모듈" 혹은 "부"를 제외하고는 적어도 하나의 모듈로 일체화되어 적어도 하나의 프로세서(미도시)로 구현될 수 있다.In this disclosure, "module" or "module " performs at least one function or operation, and may be implemented in hardware or software or a combination of hardware and software. Also, a plurality of " modules "or a plurality of" parts "may be implemented as at least one processor (not shown) integrated into at least one module, except" module " .

아래에서는 첨부한 도면을 참고하여 본 개시의 실시 예에 대하여 본 개시가 속하는 기술 분야에서 통상의 지식을 가진 자가 용이하게 실시할 수 있도록 상세히 설명한다. 그러나 본 개시는 여러 가지 상이한 형태로 구현될 수 있으며 여기에서 설명하는 실시 예에 한정되지 않는다. 그리고 도면에서 본 개시를 명확하게 설명하기 위해서 설명과 관계없는 부분은 생략하였으며, 명세서 전체를 통하여 유사한 부분에 대해서는 유사한 도면 부호를 붙였다.DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENTS Reference will now be made in detail to embodiments of the present disclosure, examples of which are illustrated in the accompanying drawings, wherein like reference numerals refer to the like elements throughout. However, the present disclosure may be embodied in many different forms and is not limited to the embodiments described herein. In order that the present disclosure may be more fully understood, the same reference numbers are used throughout the specification to refer to the same or like parts.

도 1은, 본 개시의 일 실시 예에 따른, 컴퓨팅 시스템의 간략한 블록도이다. 도 1을 참조하면, 컴퓨팅 시스템(10)은 복수의 CPU(Central Processing Unit)(110, 210), 각 CPU(110, 210) 및 각 메모리(130, 230)를 복수의 GPU(Graphic Processing Unit)(140, 150, 240, 250)와 연결하는 복수의 커넥터(120, 220)을 포함할 수 있다.1 is a simplified block diagram of a computing system, in accordance with one embodiment of the present disclosure; 1, the computing system 10 includes a plurality of CPUs (Central Processing Units) 110 and 210, a plurality of CPUs 110 and 210, and a plurality of memories 130 and 230, And a plurality of connectors 120, 220 that connect to the connectors 140, 150, 240, 250.

시스템(10)은, 예를 들어, 복수의 CPU와 복수의 GPU를 포함하는 슈퍼 컴퓨터일 수 있다. 또한, 시스템(10)은 하나의 CPU와 복수의 GPU를 포함하는 컴퓨팅 시스템일 수 있다. 시스템(10)은 컴퓨터 등과 같은 하나의 전자 장치에 구현될 수도 있다.The system 10 may be, for example, a supercomputer comprising a plurality of CPUs and a plurality of GPUs. In addition, the system 10 may be a computing system that includes one CPU and a plurality of GPUs. The system 10 may be embodied in one electronic device, such as a computer or the like.

각 CPU(110, 210)는 시스템 메모리를 포함하는 칩셋일 수 있다. 각 GPU(140, 150, 240, 250)는 비디오 게임 등과 같은 그래픽 데이터의 처리를 위한 특수 목적 프로세서일 수 있다. 예를 들어, GPU는 NVIDA^TM GPU일 수 있으나, 이에 한정되지 않는다.Each of the CPUs 110 and 210 may be a chipset including a system memory. Each GPU 140, 150, 240, 250 may be a special purpose processor for processing graphics data such as video games and the like. For example, the GPU may be, but is not limited to, an NVIDA ^TM GPU.

커넥터(120, 220)는 복수의 네트워크 인터페이스인 버스(bus)를 통해 커넥터(120, 220)에 연결된 복수의 GPU(140, 150, 240, 250)들이 각 커넥터(120, 220)에 연결된 메모리(130, 230) 및 CPU(110, 210)와 통신할 수 있도록 한다.The connectors 120 and 220 are connected to a plurality of GPUs 140, 150, 240 and 250 connected to the connectors 120 and 220 through buses which are a plurality of network interfaces, 130, and 230 and the CPUs 110 and 210, respectively.

예를 들어, 커넥터(120, 220)는 브리지형(bridged) 호스트 인터페이스인 PCI 익스프레스(Peripheral component interconnect express, PCIe^TM) 시스템에서 루트 콤플렉스(root complex)일 수 있다. 본 개시에서는 설명의 편의를 위하여 PCIe^TM시스템으로 구현된 컴퓨팅 시스템에 대해 설명하나, 이에 한정되지 않는다. 본 개시에서 사용되는 “루트 콤플렉스(root complex)”라는 용어는 “커넥터”를 의미할 수 있다.May be, for example, connectors 120, 220 are bridge-type (bridged) host interface, a PCI Express (Peripheral component interconnect express, PCIe ^TM) Root Complex (root complex) from the system. For the convenience of description, a computing system implemented with a PCIe ^TM system will be described in the present disclosure, but the present invention is not limited thereto. The term " root complex " as used in this disclosure may mean " connector. &Quot;

본 개시의 일 실시 예에 따라, 복수의 제1 GPU(Graphics Processing Unit, GPU)인 GPU1(140) 및 GPU2(150)는 제1 커넥터(connector)(120)를 통해 제1 CPU(Central Processing Unit)(110) 및 제1 메모리(130)에 연결될 수 있다. 복수의 제2 GPU인 GPU3(240) 및 GPU(250)은 제2 커넥터(connector)(220)을 통해 제2 CPU(210) 및 제2 메모리(230)에 연결될 수 있다. 또한, 제1 CPU(110) 및 제2 CPU(210)는 각CPU(110, 210)를 상호 연결하는 통신 패스를 통해 통신할 수 있다. According to one embodiment of the present disclosure, a plurality of first GPU (Graphics Processing Unit) GPU1 140 and GPU2 150 are connected via a first connector 120 to a first CPU ) 110 and the first memory 130. [ The plurality of second GPUs GPU3 240 and GPU 250 may be connected to the second CPU 210 and the second memory 230 via a second connector 220. [ Also, the first CPU 110 and the second CPU 210 can communicate through communication paths connecting the CPUs 110 and 210, respectively.

본 개시의 일 실시 예에 따라, 복수의 제1 GPU 각각(140, 150)이 서로의 메모리에 저장된 어드레스 정보를 바탕으로 제1 연산을 수행하여 제1 연산 데이터를 생성할 수 있다. 또한, 복수의 제2 GPU 각각(240, 250)이 서로의 메모리에 저장된 어드레스 정보를 바탕으로 제2 연산을 수행하여 제2 연산 데이터를 생성할 수 있다. 그리고, 상기 제1 연산 데이터 및 상기 제2 연산 데이터는 통신 패스를 통해 교환할 수 있다.According to one embodiment of the present disclosure, each of a plurality of first GPUs 140, 150 may generate a first computational data by performing a first computation based on address information stored in memory of each other. In addition, each of the second GPUs 240 and 250 can generate the second calculation data by performing the second calculation based on the address information stored in the memory of the second GPU 240 and the second GPU 240. The first calculation data and the second calculation data can be exchanged through a communication path.

예를 들어, 제1 커넥터(120)에 공통으로 연결된 복수의 제1 GPU인 각 GPU(140, 150)는 각각 연산을 수행하고 제1 연산 데이터를 생성하고, 제2 커넥터(220)에 공통으로 연결된 복수의 제2 GPU인 GPU(240, 250)는 각각 연산을 수행하여 제2 연산 데이터를 생성할 수 있다. For example, each of the GPUs 140 and 150, which is a plurality of first GPUs commonly connected to the first connector 120, performs an operation and generates first calculation data, The GPUs 240 and 250, which are a plurality of connected second GPUs, may each perform an operation to generate second operation data.

또 다른 예를 들어, 제1 커넥터(120)에 연결된 GPU1(140) 및 GPU2(150)은 도 2에서 도시된 하나의 스위치를 공유할 수 있다. 또한, GPU1(140) 및 GPU2(150)은 각각 상이한 스위치에 연결되어 하나의 제1 커넥터(120)에 연결될 수도 있다.As another example, the GPU1 140 and the GPU2 150 connected to the first connector 120 may share one switch shown in Fig. In addition, the GPU1 140 and the GPU2 150 may be connected to different switches and connected to one first connector 120, respectively.

GPU1(140) 및 GPU2(150)이 상이한 스위치에 연결되어 제1 커넥터(120)에 연결된 경우, GPU1(140) 및 GPU2(150)은 NVIDIA 레인^TM(NVIDIA Lane^TM)등과 같은 버스를 통해 서로 연결될 수도 있다. 이때, GPU1(140)과 GPU2(150)은 제1 커넥터(120)를 통하지 않고 연결된 버스를 통해 직접 서로의 어드레스 정보를 요청하여 연산을 수행할 수 있다. If GPU1 (140) and GPU2 (150) is connected to a different switch connected to the first connector (120), GPU1 (140) and GPU2 (150) are connected to each other via a bus, such as the NVIDIA Lane ^TM (NVIDIA Lane ^TM) It is possible. At this time, the GPU1 140 and the GPU2 150 can directly perform mutual address information request through the connected bus without passing through the first connector 120, thereby performing calculation.

제2 커넥터(220)에 연결된 GPU3(240) 및 GPU4(250)의 동작 방법은 상술한 제1 커넥터(120)에 연결된 GPU1(140) 및 GPU2(150)의 동작 방법과 대응되므로 상세한 설명은 생략한다.The operation method of the GPU3 240 and the GPU4 250 connected to the second connector 220 corresponds to the operation method of the GPU1 140 and the GPU2 150 connected to the first connector 120, do.

도1에서는 각 커넥터(120, 220)에 각각 두 개의 GPU(140, 150, 240, 250)가 연결된 것을 도시하였으나, 이는 일 실시 예일 뿐, 각 커넥터(120, 220)는 2개 이상의 GPU와 연결될 수 있다. 또한, 컴퓨팅 시스템(10)에서 복수의 CPU(110, 210)는 각각 하나 이상의 커넥터(120, 220)를 포함할 수도 있으나, 컴퓨팅 시스템(10)은 복수의 CPU(110, 210) 중 커넥터를 포함하지 않는 CPU를 포함할 수도 있다.Although two GPUs 140, 150, 240, and 250 are shown connected to the connectors 120 and 220 in FIG. 1, the connectors 120 and 220 may be connected to two or more GPUs . Also, in the computing system 10, the plurality of CPUs 110 and 210 may each include one or more connectors 120 and 220, but the computing system 10 includes connectors of the plurality of CPUs 110 and 210 It may include a CPU that does not.

복수의 제1 GPU(140, 150) 및 복수의 제2 GPU(240, 250)는 각각 메모리를 포함할 수 있다. 또한, 각 GPU에서 수행되는 연산은, 각 GPU들의 메모리에 저장된 어드레스에 기초하여 다른 GPU에 액세스되도록 하는 수행되는 병렬 행렬 곱(parallel matrix multiplication) 연산일 수 있다. GPU의 병렬 행렬 곱 연산은 각 GPU의 메모리에서 수행될 수 있다. The plurality of first GPUs 140 and 150 and the plurality of second GPUs 240 and 250 may each include a memory. In addition, the operations performed on each GPU may be a parallel matrix multiplication operation performed to cause the other GPUs to be accessed based on the addresses stored in the memory of each GPU. The parallel matrix multiplication of the GPU can be performed in the memory of each GPU.

본 개시의 일 실시 예에서, 각 GPU(140, 150, 240, 250) 중 적어도 하나의 GPU인 GPU2(150)의 타겟 어드레스가 GPU2(150)와 다른 커넥터(220)에 연결된 GPU3(240)의 메모리에 있는 경우, GPU2(150)와 GPU2(150)와 다른 커넥터(220)에 연결된 GPU3(240)의 사이의 연산 데이터 교환은 CPU1(110) 및 CPU2(210)를 상호 연결하는 통신 패스를 통해 수행될 수 있다.In one embodiment of the present disclosure, the target address of GPU2 150, which is at least one GPU of each GPU 140, 150, 240, 250, is coupled to GPU2 150, The exchange of computational data between the GPU2 150 and the GPU2 150 and the GPU3 240 connected to the other connector 220 is performed via a communication path interconnecting the CPU1 110 and the CPU2 210 .

예를 들어, 통신 패스는 하이퍼트랜스포트(Hyper Transport, HT) 또는 QPI(QuickPath Interconnect) 등일 수 있으나, 이에 한정되지 않는다.For example, the communication path may be, but is not limited to, Hyper Transport (HT) or QuickPath Interconnect (QPI).

또한, 복수의 제1 GPU(140, 150) 및 복수의 제2 GPU(240, 250)은 각 커넥터(120, 220)에 스위치를 통해 연결될 수 있다. 제1 커넥터(120) 및 제2 커넥터(220)에 연결된 스위치는 복수 개일 수 있다. 커넥터(120, 220)와 복수의 GPU(140, 150, 240, 250)를 연결하는 스위치에 대해서는 도 2에서 설명하기로 한다.The plurality of first GPUs 140 and 150 and the plurality of second GPUs 240 and 250 may be connected to the respective connectors 120 and 220 through a switch. A plurality of switches connected to the first connector 120 and the second connector 220 may be provided. The switches connecting the connectors 120 and 220 and the plurality of GPUs 140, 150, 240, and 250 will be described with reference to FIG.

각 CPU(110, 120)에 연결된 복수의 GPU(140, 150, 240, 250) 중 적어도 하나는 마스터이고, 나머지 GPU는 슬레이브일 수 있다. 예를 들어, 제1 CPU(110)에 네 개의 GPU가 연결되어 있을 경우, 네 개 중 하나의 GPU는 마스터이고, 나머지 세 개는 슬레이브일 수 있다. 마스터 GPU는 각 GPU의 입력 행렬의 크기, CPU에 연결된 GPU의 개수, 및 복수의 GPU에서 행렬 곱 연산을 위한 블록 사이즈를 마스터 GPU의 메모리 또는 CPU의 호스트 메모리에 데이터 베이스로 구축할 수 있다.At least one of the plurality of GPUs 140, 150, 240, and 250 connected to the CPUs 110 and 120 may be a master, and the remaining GPUs may be slaves. For example, when four GPUs are connected to the first CPU 110, one of the four GPUs may be a master and the remaining three may be slaves. The master GPU can construct a database in the memory of the master GPU or in the host memory of the CPU, the size of the input matrix of each GPU, the number of GPUs connected to the CPU, and the block size for matrix multiplication in a plurality of GPUs.

　　　　본 개시의 일 실시 예에 따라, 마스터 GPU는, 데이터 베이스에 저장된 데이터를 바탕으로 기계학습을 이용하여 복수의 GPU가 최단 행렬 곱 연산을 수행할 수 있는 최적의 블록 사이즈를 추정할 수 있다. 최적의 블록 사이즈를 추정하는 방법은 도 5 내지 도 7에서 상술한다.According to one embodiment of the present disclosure, a master GPU can estimate an optimal block size by which a plurality of GPUs can perform a shortest matrix multiplication operation using machine learning based on data stored in a database. The method of estimating the optimal block size will be described in detail in FIGS.

도 2는, 본 개시의 일 실시 예에 따른, 컴퓨팅 시스템의 상세한 블록도이다. 도 2를 참조하면, 컴퓨팅 시스템(10)은 제1 CPU(110) 및 제2 CPU(210)등의 복수의 CPU를 포함할 수 있다. 각 CPU(110, 210)는 PCIe^TM 버스를 통해 각각 제1 루트 콤플렉스(120) 및 제2 루트 콤플렉스(220)에 연결될 수 있다. 제1 루트 콤플렉스(120) 및 제2 루트 콤플렉스(220)는 복수의 PCIe^TM장치 다운스트림 포트(PCIe^TMdevice downstream port)(270)을 포함하며 각 메모리(130, 230)에 접근할 수 있다. 2 is a detailed block diagram of a computing system, in accordance with one embodiment of the present disclosure; Referring to FIG. 2, the computing system 10 may include a plurality of CPUs, such as a first CPU 110 and a second CPU 210. Each of the CPUs 110 and 210 may be connected to the first route complex 120 and the second route complex 220 via the PCIe ^TM bus, respectively. The first contains the root complex 120 and the second root complex 220 includes a plurality of PCIe ^TM device downstream ports (PCIe ^TM device downstream port) (270) and can be accessed in each memory (130 and 230).

각 루트 콤플렉스(120, 220)은 PCIe^TM버스를 통해 복수의 GPU와 연결된 PCIe^TM스위치(160, 260)와 통신할 수 있다. 루트 콤플렉스(120, 220)의 PCIe^TM장치 다운스트림 포트(270) 중 하나는 PCIe^TM스위치(160, 260)의 PCIe^TM장치업스트림 포트(PCIe^TMdevice upstream port) (280)와 PCIe^TM버스를 통해 통신할 수 있다.Each root complex 120, 220 can communicate with a plurality of GPU-connected PCIe ^TM switches 160, 260 via a PCIe ^TM bus. Root complex through the PCIe ^TM device downstream ports 270, one of which is PCIe and PCIe ^TM device upstream port (PCIe ^TM device upstream port) (280) of the PCIe ^TM switch (160, 260) ^TM bus (120, 220) Communication can be performed.

PCIe^TM스위치(106, 260)는 복수의 PCIe^TM장치 다운스트림 포트(270)를 포함하고, PCIe^TM장치 다운스트림 포트(270)는 하나의 GPU(140, 150)의 PCIe^TM장치 업스트림 포트(280)와 PCIe^TM버스를 통해 통신할 수 있다. 각 GPU(140, 150, 240, 250)은 각각 GPU메모리(140-1, 150-1, 240-1, 250-1)을 포함할 수 있다. 각 GPU(140, 150, 240, 250)은 PCIe^TM버스를 통해 각 GPU의 메모리(140-1, 150-1, 240-1, 250-1)에 액세스할 수 있다. 각 GPU의 메모리(140-1, 150-1, 240-1, 250-1)은 많은 수의(numerous) 병렬 행렬 곱 연산을 수행할 수 있다.The PCIe ^TM switch 106 and 260 includes a plurality of PCIe ^TM device downstream ports 270 and the PCIe ^TM device downstream port 270 includes a PCIe ^TM device upstream port 280 of one GPU 140 and 150 ) And the PCIe ^TM bus. Each of the GPUs 140, 150, 240, and 250 may include GPU memories 140-1, 150-1, 240-1, and 250-1. Each GPU 140, 150, 240, 250 can access the memory 140-1, 150-1, 240-1, 250-1 of each GPU via a PCIe ^TM bus. The memory 140-1, 150-1, 240-1, and 250-1 of each GPU can perform a large number of parallel matrix multiplication operations.

예를 들어, 컴퓨팅 시스템(10)을 구성하는 복수의 CPU(110, 210)는 QPI(QuickPath Interconnect)를 통해 상호 연결될 수 있다. For example, the plurality of CPUs 110 and 210 constituting the computing system 10 may be interconnected through a QuickPath Interconnect (QPI).

그러나, 도 2에 도시된 컴퓨팅 시스템(10)의 구성은, 본 개시를 설명하기 위한 일 실시 예일 뿐, 이에 한정되지 않으며, 다양하게 구현될 수 있다.However, the configuration of the computing system 10 shown in FIG. 2 is only one embodiment for explaining the present disclosure, but is not limited thereto, and can be variously implemented.

도 3은, 본 개시의 일 실시 예에 따른, 복수의 GPU에서 연산방법을 도시한 순서도이다.Figure 3 is a flow diagram illustrating a method of operation in a plurality of GPUs, in accordance with one embodiment of the present disclosure;

S310 단계에서, 컴퓨팅 시스템(10)을 구성하는 각 루트 콤플렉스에 연결된 복수의 GPU 각각에서 병렬 행렬 곱 연산을 수행한다. 병렬 행렬 곱 연산은, 각 GPU들의 메모리에 저장된 어드레스에 기초하여 다른 GPU의 메모리에 액세스되며, GPU의 각 메모리에서 수행될 수 있다.In step S310, a parallel matrix multiplication operation is performed on each of a plurality of GPUs connected to each of the root complexes constituting the computing system 10. [ The parallel matrix multiplication operation is accessed in the memory of another GPU based on the address stored in the memory of each GPU, and can be performed in each memory of the GPU.

예를 들어, GPU에서 수행되는 병렬 행렬 곱 연산은, C=AxB의 정방행렬일 수 있다. 이때, A는 제1 GPU의 입력행렬(NxN), B는 제2 GPU의 입력 행렬(NxN), C는 제1 GPU 및 제2 GPU의 병렬행렬곱연산에 따른 정방행렬(NxN)일 수 있다. 구체적으로, C(i)는 NxN 정방 행렬의 행 벡터(row vector)로 다음의 함수일 수 있다.For example, a parallel matrix multiplication operation performed on the GPU may be a square matrix of C = AxB. In this case, A may be an input matrix NxN of the first GPU, B may be an input matrix NxN of the second GPU, and C may be a square matrix NxN according to a parallel matrix multiplication operation of the first GPU and the second GPU . Specifically, C (i) may be a row vector of an NxN square matrix and the following function.

여기서, |B|는 S_i에 속하는(belong to) 행렬의 행에서 블록(blocks), S_i는 i번째의 GPU, A(i)는 |B|/S, S는 CPU에 포함된 GPU의 개수, B(i, j)는 |B|/S, |B|/S는 B(i)의 서브 블록일 수 있다.Here, | B | is is (belong to) a block in a row of the matrix (blocks), S _i is the i-th GPU, A (i) belonging to the S _i | of the GPU included in / S, S is a CPU | B The number, B (i, j), may be | B | / S and | B | / S may be sub-blocks of B (i).

종래에는 복수의 CPU를 포함하고, 각 CPU가 복수의 GPU를 포함하는 컴퓨팅 시스템(10)에서 각 GPU가 병렬 행렬 곱 연산을 수행할 때, 연산 시 모든 프로세스마다 QPI 등과 같은 통신 패스를 통해 CPU에서 다른 CPU로 B(j)를 교환해야 한다. 따라서, 연산의 속도가 저하되는 문제가 있다.Conventionally, when a GPU includes a plurality of CPUs and each CPU performs a parallel matrix multiplication operation in a computing system 10 including a plurality of GPUs, a CPU You must exchange B (j) with another CPU. Therefore, there is a problem that the speed of the calculation is lowered.

본 개시의 일 실시 예에 따라, 복수의 GPU의 행렬 곱 연산은 각 GPU의 메모리에서 수행될 수 있다. 또한, 복수의 GPU의 행렬 곱 연산은 공통의 루트 콤플렉스에 연결된 GPU들 사이에서 우선 수행된다. 따라서, 본 개시의 실시 예에 따른 GPU 연산 방법은 종래처럼 GPU에서 연산을 수행할 때마다 QPI 등과 같은 통신 패스를 통과하지 않아도 되므로 연산 속도를 향상시킬 수 있다.According to one embodiment of the present disclosure, a matrix multiplication operation of a plurality of GPUs may be performed in the memory of each GPU. In addition, matrix multiplication operations of a plurality of GPUs are performed first among GPUs connected to a common root complex. Therefore, the GPU operation method according to the present embodiment of the present invention can improve the operation speed because it does not have to pass the communication path such as QPI every time when the operation is performed in the GPU as in the prior art.

S330 단계에서, 컴퓨팅 시스템(10)을 구성하는 복수의 GPU 중에서 서로 다른 루트 콤플렉스에 연결된 복수의 GPU 간의 연산 데이터 교환이 수행된다. GPU에서 병렬 행렬 곱 연산은 각 GPU의 메모리에 저장된 어드레스에 기초하여 다른 GPU에 액세스 될 수 있다. 따라서, 공통의 루트 콤플렉스에 연결된 복수의 GPU 중 적어도 하나의 GPU의 타겟 어드레스가 다른 루트 콤플렉스에 연결된 복수의 GPU 중 하나일 수 있다. In step S330, operation data exchange between a plurality of GPUs connected to different root complexes among a plurality of GPUs constituting the computing system 10 is performed. A parallel matrix multiplication operation in the GPU can be accessed to another GPU based on the address stored in the memory of each GPU. Thus, the target address of at least one GPU of a plurality of GPUs coupled to a common root complex may be one of a plurality of GPUs coupled to other root complexes.

예를 들어, 복수의 루트 콤플렉스가 공통의 CPU에 연결된 경우, 공통의 루트 콤플렉스에 연결된 복수의 GPU 중 적어도 하나의 GPU의 타겟 어드레스가 공통의 CPU에 연결된 다른 루트 콤플렉스에 연결된 복수의 GPU 중 하나일 수 있다. 이때, 각 GPU는 QPI 등의 통신패스를 통과하지 않고, 공통의 CPU 내에서 연산 결과 생성된 연산 데이터 교환을 수행할 수 있다. 따라서, 종래보다 연산 속도가 향상될 수 있다.For example, when a plurality of root complexes are connected to a common CPU, the target address of at least one of the plurality of GPUs connected to the common root complex is one of a plurality of GPUs connected to another root complex connected to a common CPU . At this time, each GPU does not pass a communication path such as QPI, and can exchange computation data generated as a result of computation in a common CPU. Therefore, the operation speed can be improved as compared with the conventional art.

S350 단계에서, 컴퓨팅 시스템(10)을 구성하는 복수의 CPU간 GPU 연산 데이터 교환이 수행된다. S330 단계에서 상술한 바와 같이, 공통의 루트 콤플렉스에 연결된 복수의 GPU 중 적어도 하나의 GPU의 타겟 어드레스가 다른 루트 콤플렉스에 연결된 복수의 GPU 중 하나일 수 있다. In step S350, GPU operation data exchange between a plurality of CPUs constituting the computing system 10 is performed. As described above in step S330, the target address of at least one of the plurality of GPUs coupled to the common root complex may be one of a plurality of GPUs connected to the other root complexes.

예를 들어, 제1 CPU의 제1 루트 콤플렉스에 연결된 제1 GPU의 타겟 어드레스가 제2 CPU의 제2 루트 콤플렉스에 연결된 제2 GPU의 메모리에 있을 수 있다. 이때, 제1 GPU는 제1 CPU의 복수의 제1 GPU에서 획득한 제1 연산 데이터와 제2 CPU의 복수의 제2 GPU에서 획득한 제2 연산 데이터 교환을 QPI 등의 통신패스를 통해 수행할 수 있다.For example, the target address of the first GPU coupled to the first root complex of the first CPU may be in the memory of the second GPU coupled to the second root complex of the second CPU. At this time, the first GPU performs the first calculation data acquired from the plurality of first GPUs of the first CPU and the second calculation data acquired from the plurality of second GPUs of the second CPU through the communication path such as QPI .

도 4는, 본 개시의 실시 예들에 따른, 복수의 GPU에서 연산 방법을 도시한 도면이다.4 is a diagram illustrating a method of operation in a plurality of GPUs, in accordance with embodiments of the present disclosure.

도 4를 참조하면, 얇은 선(thin line)(410)은 공통의 루트 콤플렉스 및 공통의 스위치를 통해 연결된 복수의 GPU(GPU1와GPU2, GPU3와 GPU4, GPU5와 GPU6, GPU7와 GPU8)에서의 연산을 도시한다. 복수의 GPU(GPU1와GPU2, GPU3와 GPU4, GPU5와 GPU6, GPU7와 GPU8)들은 각 GPU의 메모리에 저장된 어드레스 정보를 바탕으로 병렬 행렬 곱 연산을 수행하여 각각 연산 데이터를 생성할 수 있다. 각 GPU는 생성된 연산 데이터를 서로 교환할 수 있다.Referring to Figure 4, a thin line 410 is computed on a plurality of GPUs (GPU1 and GPU2, GPU3 and GPU4, GPU5 and GPU6, GPU7 and GPU8) connected via a common root complex and a common switch Lt; / RTI > A plurality of GPUs (GPU1 and GPU2, GPU3 and GPU4, GPU5 and GPU6, GPU7 and GPU8) can perform parallel matrix multiplication based on address information stored in the memory of each GPU to generate operation data. Each GPU can exchange generated computation data.

예를 들어, GPU1 및 GPU2는 PCIe^TM버스 등과 같은 버스를 통해 연결될 수 있으며, GPU1 및 GPU2는 직접 각각의 메모리에 접근하여 연산을 수행할 수 있다. GPU3과 GPU4, GPU5 과 GPU6, GPU7과 GPU8 각각은 GPU1과 GPU2의 연산 방법과 동일한 방법으로 연산을 수행할 수 있다.For example, GPU1 and GPU2 may be connected via a bus such as a PCIe ^TM bus, and GPU1 and GPU2 may directly access each memory to perform operations. GPU3 and GPU4, GPU5 and GPU6, and GPU7 and GPU8, respectively, can perform operations in the same manner as the calculation methods of GPU1 and GPU2.

점선(dot line)(420)은 상이한 스위치를 통해 공통의 루트 콤플렉스에 연결된 복수의 GPU(GPU1과 GPU4, GPU2와 GPU3, GPU5와 GPU8, GPU6과 GPU7)에서의 데이터 교환을 도시한다.The dot line 420 illustrates the exchange of data in a plurality of GPUs (GPU1 and GPU4, GPU2 and GPU3, GPU5 and GPU8, GPU6 and GPU7) connected to a common root complex via different switches.

예를 들어, GPU1은 제1 스위치에 PCIe^TM버스를 통해 제1 루트 콤플렉스 지나 GPU4와 데이터 교환을 수행할 수 있다. GPU2와 GPU3, GPU5와 GPU8, GPU6과 GPU7 각각은 GPU1과 GPU3의 데이터 교환 방법과 같이 데이터를 교환할 수 있다.For example, the GPU 1 may exchange data with the first route complex or the GPU 4 via the PCIe ^TM bus to the first switch. GPU2 and GPU3, GPU5 and GPU8, GPU6 and GPU7, respectively, can exchange data in the same way as data exchange between GPU1 and GPU3.

굵은 선(thick line(430)은 CPU 및 루트 콤플렉스를 공유하지 않은 복수의 GPU 사이의 데이터 교환을 도시한다.The thick line 430 shows the exchange of data between a CPU and a plurality of GPUs that do not share a root complex.

예를 들어, 컴퓨팅 시스템(10)의 각 장치(CPU, 루트 콤플렉스, 스위치, 메모리, GPUs)들을 연결하는 PCIe^TM버스를 통해 제1 CPU와 제2 CPU를 상호 연결하는 QPI 등의 통신패스를 이용하여 GPU8과 데이터 교환을 수행할 수 있다. GPU4는 GPU5와 QPI 통신 패스를 이용하여 데이터 교환을 수행할 수 있다. 이때, GPU1와 GPU8, GPU4와 GPU5는 서로 데이터를 교환할 때, 각 GPU가 PCIe^TM버스를 통해 연결된 스위치, 루트 콤플렉스, CPU를 통과할 수 있다.A communication path such as a QPI that interconnects the first CPU and the second CPU via a PCIe ^TM bus connecting each device (CPU, root complex, switch, memory, GPUs) of the computing system 10 So that data exchange with the GPU 8 can be performed. GPU4 can exchange data using GPU5 and QPI communication path. When GPU1 and GPU8, GPU4, and GPU5 exchange data with each other, each GPU can pass through a switch, a root complex, and a CPU connected through a PCIe ^TM bus.

각 루트 콤플렉스는 각 GPU로부터 각 루트 콤플렉스에 연결된 메모리에 접근 요청을 수신할 수 있다. 루트 콤플렉스는 CPU를 대신하여(on behalf of CPU) 복수의 GPU 사이의 데이터 교환 결과를 수신하고 루트 콤플렉스에 연결된(포함된) 메모리에 결과를 저장할 수 있다. 루트 콤플렉스는 복수의 GPU의 데이터 교환 결과를 CPU에 전송할 수 있고, CPU는 수신한 데이터를 CPU에 포함된 호스트 메모리에 저장할 수 있다.Each root complex can receive an access request from each GPU to memory associated with each root complex. The root complex can receive the results of data exchange between multiple GPUs on behalf of the CPU and store the results in memory (included) connected to the root complex. The root complex can transfer the data exchange result of a plurality of GPUs to the CPU, and the CPU can store the received data in the host memory included in the CPU.

즉, 종래에는 컴퓨팅 시스템(10)에 포함된 GPU의 개수(도 4의 실시 예: 8)만큼 QPI를 이용해서 GPU 데이터를 교환할 수 있었으나, 본 개시의 실시 예에 따른 방법은 컴퓨팅 시스템(10)에 포함된 CPU의 개수(도 4의 실시 예: 2)만큼 QPI를 이용할 수 있다. 따라서, 상술한 본 개시의 실시 예들에 따른, 복수의 GPU 간의 데이터 교환 방법은 종래의 방법보다 QPI 이용 횟수를 대폭 감소하여 그래픽 처리의 연산 속도를 향상시킬 수 있다. That is, conventionally, the GPU data could be exchanged using the QPI by the number of GPUs included in the computing system 10 (the embodiment of FIG. 4: 8), but the method according to the embodiment of the present disclosure is not applicable to the computing system 10 QPI can be used by the number of CPUs (the embodiment of FIG. Accordingly, the data exchange method between a plurality of GPUs according to the embodiments of the present disclosure described above can significantly improve the computation speed of graphics processing by greatly reducing the number of QPI use times compared to the conventional method.

도 5은, 본 개시의 일 실시 예에 따른, 기계학습을 통한 최적의 블록 사이즈를 추정하는 방법을 도시한 순서도이다.5 is a flow chart illustrating a method for estimating optimal block size through machine learning, in accordance with one embodiment of the present disclosure.

예를 들어, 복수의 GPU는 NVDIA CUDA^TM 등과 같은 프로그래밍 모델을 이용하여 병렬행렬곱연산(parallel matrix multiplication)을 수행할 수 있다. 그러나, 이는 본 개시를 설명하기 위한 일 실시 예일 뿐 이에 한정되지 않는다.For example, a plurality of GPUs may perform parallel matrix multiplication using a programming model such as NVDIA CUDA ^TM . However, this is only one embodiment for explaining the present disclosure, but is not limited thereto.

S510 단계에서, 컴퓨팅 시스템(10)은 하나의 CPU에 연결된 복수의 GPU의 병렬 행렬 곱 연산(parallel matrix multiplication)에서 최적의 블록 사이즈를 추정하기 위한 데이터 베이스를 구축할 수 있다. In operation S510, the computing system 10 may construct a database for estimating an optimal block size in a parallel matrix multiplication of a plurality of GPUs connected to one CPU.

컴퓨팅 시스템(10)은 복수의 GPU의 행렬곱연산에 의한 정방행렬의 입력 행렬을 복수의 블록으로 분리(split)할 수 있다. 컴퓨팅 시스템(10)은 분리된 복수의 블록을 마스터 GPU 및 슬레이브 GPU에 각각 분배(distribute)할 수 있다. 따라서, 입력 정방행렬로부터 각 GPU에 분배될 최적의 블록 사이즈를 추출한 경우, 복수의 GPU의 연산 속도는 최적화될 수 있다.The computing system 10 may split an input matrix of a square matrix by matrix multiplication of a plurality of GPUs into a plurality of blocks. The computing system 10 may distribute a plurality of separate blocks to a master GPU and a slave GPU, respectively. Therefore, when the optimal block size to be distributed to each GPU is extracted from the input square matrix, the operation speed of a plurality of GPUs can be optimized.

컴퓨팅 시스템(10)은 하나의 CPU에 포함된 복수의 GPU에서 하나를 마스터, 나머지 GPU들을 슬레이브로 판단할 수 있다. 컴퓨팅 시스템(10)은 하나의 CPU에 포함된 GPU의 개수(n), 각 GPU의 입력 행렬 크기(s), 정방 행렬의 블록 사이즈(b)를 데이터베이스로 저장할 수 있다. 데이터베이스는 마스터 GPU의 메모리일 수도 있고, 컴퓨팅 시스템(10)의 빅데이터 저장소일 수도 있다.S520 단계에서, 컴퓨팅 시스템(10)은 S510 단계에서 구축된 데이터베이스를 바탕으로 딥러닝(Deep Learning) 등과 같은 기계학습(Machine Learning)을 이용하여 최적의 블록 사이즈를 추정할 수 있다.블록 행렬 분해(block matrix decomposition)에서 블록 사이즈는, 복수의 GPU를 통해 병렬 행렬 곱 연산을 할 때, GPU의 실행시간(run time)에 중요한 역할을 한다. 따라서, 최적의 블록 사이즈를 추정하여 GPU를 통한 데이터 처리 시간을 단축시킬 수 있다.The computing system 10 can determine one of the plurality of GPUs included in one CPU as a master and the remaining GPUs as a slave. The computing system 10 may store the number (n) of GPUs included in one CPU, the input matrix size (s) of each GPU, and the block size (b) of the square matrix as a database. The database may be the memory of the master GPU or it may be the big data store of the computing system 10. In step S520, the computing system 10 may perform the steps of, for example, Deep Learning In block matrix decomposition, the block size is calculated by multiplying the execution time of the GPU by the parallel matrix multiplication using multiple GPUs. run time. Therefore, it is possible to estimate the optimum block size and shorten the data processing time through the GPU.

도 6은, 본 개시의 일 실시 예에 따른, 블록 사이즈에 따른 실행 시간을 도시한 그래프이다.6 is a graph showing execution time according to a block size, according to an embodiment of the present disclosure;

도 3에서 설명한 바와 같이, 예를 들어, 두 개의 GPU에서 행렬 곱 연산에 따른 정방행렬 C=AxB 일 수 있다. 이때, A는 제1 GPU의 NxN 행렬, B는 제2 GPU의 NxN 행렬, 그리고 C는 A 행렬과 B 행렬의 병렬 곱 행렬인 NxN의 정방행렬일 수 있다. As described in FIG. 3, for example, it may be a square matrix C = AxB according to the matrix multiplication operation in two GPUs. In this case, A may be an NxN matrix of the first GPU, B may be an NxN matrix of the second GPU, and C may be a square matrix of NxN, which is a parallel multiplication matrix of an A matrix and a B matrix.

도 6은, 10,000 x 10,000(AxB) 정방 행렬의 랜덤 행렬 곱을 4개의 GPU를 이용하여 블록 사이즈에 따른 행렬 곱 연산 속도를 도시한 그래프이다. FIG. 6 is a graph showing a matrix multiplication rate according to a block size using a random matrix multiplication of 10,000 x 10,000 (AxB) square matrix using four GPUs.

도 6을 참조하면, 블록의 개수가 적은 경우(# of Block), 정방 행렬의 블록 사이즈가 상대적으로 크게 분해(decomposition)된 것을 의미하며, 한 블록 안에 많은 데이터가 할당된 것을 의미할 수 있다. 이때, 하나의 블록에서 처리되는 복수의 GPU 내 연산 데이터 량이 증가하므로, 입력 행렬 연산 처리 시간은 감소할 수 있다. 그러나, 블록의 개수가 너무 작을 경우, 정방 행렬의 블록 사이즈가 너무 크게 분해(decomposition)된 것이므로, GPU 내 연산 데이터 량이 너무 증가되어 입력 행렬 연산 처리 시간은 증가할 수 있다. Referring to FIG. 6, when the number of blocks is small (# of Block), it means that the block size of the square matrix is relatively largely decomposed, meaning that a lot of data is allocated in one block. At this time, since the amount of operation data in a plurality of GPUs processed in one block increases, the input matrix operation processing time can be reduced. However, when the number of blocks is too small, the block size of the square matrix is decomposed too much, so that the amount of operation data in the GPU is excessively increased, and the input matrix operation processing time can be increased.

또한, 블록의 개수(# of Block)가 많은 경우, 정방 행렬의 블록 사이즈가 상대적으로 작게 분해(decomposition)된 것을 의미하며, 복수의 GPU 사이에 블록 데이터 교환 횟수가 증가하는 것을 의미할 수 있다. 따라서, 복수의 GPU 사이의 데이터 교환 횟수가 증가함에 따라, 복수의 GPU에서 입력 행렬 곱 연산 처리 시간은 증가할 수 있다. Also, when the number of blocks (# of Block) is large, it means that the block size of the square matrix is relatively small, meaning that the number of block data exchanges increases between a plurality of GPUs. Therefore, as the number of data exchanges between a plurality of GPUs increases, the input matrix multiplication processing time in a plurality of GPUs may increase.

예를 들어, 도 6에서는, 정방행렬의 블록의 개수가 약 50개로 분해(decomposition)된 경우, 블록의 개수가 최적일 수 있다. 이때, 데이터는 복수의 GPU에 골고루 분포된 것을 의미할 수 있다. 따라서, 입력 행렬 연산의 속도는 최적의 값을 가질 수 있다. For example, in FIG. 6, if the number of blocks of the square matrix is decomposed into about 50, the number of blocks may be optimal. At this time, the data may be distributed evenly among a plurality of GPUs. Thus, the rate of input matrix computation can have an optimal value.

본 개시의 일 실시 예에 따라, 도 3에서 설명한 복수의 GPU를 이용한 병렬 행렬 곱 연산에서 정방 행렬 C= AxB에서, 정방 행렬 C를 복수의 블록으로 분해(decomposition)할 때, 병렬 행렬 곱 연산(parallel matrix multiplication)을 위한 최적의 블록 사이즈(b*)를 추정하기 위한 함수는 b*= g(n,s)를 이용할 수 있다. 이때, n은 정방 행렬의 크기이고, s는 하나의 CPU에서 연산을 수행하는 GPU 개수를 의미할 수 있다. 함수 g는 딥러닝과 같은 기계학습 및 회귀(regression) 분석을 이용하여 최적의 블록 사이즈(b*)를 추론하는 함수일 수 있다. According to one embodiment of the present disclosure, when a square matrix C is decomposed into a plurality of blocks in a square matrix C = AxB in a parallel matrix multiplication operation using a plurality of GPUs described in FIG. 3, a parallel matrix multiplication operation a function for estimating an optimal block size (b *) for a parallel matrix multiplication may use b * = g (n, s). Where n is the size of the square matrix and s is the number of GPUs that perform operations on one CPU. The function g may be a function of inferring the optimal block size (b *) using machine learning and regression analysis such as deep running.

회귀 분석(regression analysis)은 관찰된 연속형 변수들에 대해 두 변수 사이의 모형을 구한 뒤 적합도를 측정해 내는 분석 방법이다. 일 예로, 가우시안 프로세스 회귀를 이용할 수도 있다. 또한, 인공신경망(artificial neural network)을 이용한 기계학습을 통한 데이터 분석은 통계학적으로 높은 정확도를 얻을 수 있다. 본 개시에서는 가우시안 회귀 분석을 적용하여 최적의 블록 사이즈를 추정하였으나 이는 일 실시 예일 뿐 이에 한정되지 않는다.Regression analysis is an analytical method that measures fitness between two variables for observed continuous variables. As an example, a Gaussian process regression may be used. Data analysis through machine learning using an artificial neural network can also provide statistically high accuracy. In this disclosure, Gaussian regression analysis was applied to estimate the optimal block size, but this is not an exhaustive list.

예를 들어, 딥러닝과 같은 기계학습은, 빅데이터와 같은 데이터베이스로부터 숨겨진 키 값(key value)를 추론하여 추출할 수 있다. 즉, 딥 신경망(Deep Neural Networks, DNN)에서는 입력 값을 계산하여(computation) 트레이닝을 반복하고, 트레이닝 결과 획득된 키 값을 추론할 수 있다.For example, machine learning such as deep learning can extract hidden key values from a database, such as Big Data. In Deep Neural Networks (DNN), input values can be computed, training is repeated, and key values obtained from training can be deduced.

본 개시의 일 실시 예에 따라, 딥신경망(DNN)에 입력 값은 정방행렬 C=AxB에서 정방행렬의 크기(n), CPU에 연결된 GPU의 개수(s), 블록사이즈(b)일 수 있다. 이때, 최적의 블록사이즈(b*)를 추출하기 위한 테스팅 함수는 b*=arg_bmin f(s,n,b)일 수 있다. 여기서, f(n,s,b)는 병렬행렬곱함수 f의 실행시간일 수 있다. 시스템(10)은 DNN에 입력된 값들(n,s,b)을 바탕으로 딥러닝 기계학습 방법 및 회귀분석 등을 이용하여 함수 b*= g(n,s)를 추출할 수 있다. 따라서, 시스템(10)은 DNN에서 숨겨진 노드의 개수인 h 차원 벡터를 유추할 수 있다. According to one embodiment of the present disclosure, the input value to the deep neural network (DNN) may be the size (n) of the square matrix in the square matrix C = AxB, the number of GPUs connected to the CPU (s) . At this time, the testing function for extracting the optimal block size (b *) may be b * = arg _b min f (s, n, b). Here, f (n, s, b) may be the execution time of the parallel matrix multiplication function f . The system 10 can extract the function b * = g (n, s) using the deep learning machine learning method and the regression analysis based on the values (n, s, b) input to the DNN. Thus, the system 10 can infer the h-dimensional vector, which is the number of hidden nodes in the DNN.

도 7은, 본 개시의 일 실시 예에 따른, 최적의 블록 사이즈를 추정한 결과를 도시한 표이다.7 is a table showing results of estimating the optimal block size according to an embodiment of the present disclosure.

본 개시의 일 실시 예에 따라, 함수 f(n, s, b)를 로깅(logging)하여 데이터 베이스를 생성할 수 있다. 상술한 바와 같이, n은 GPU의 입력 행렬 크기, s는 CPU에 포함된 GPU의 개수, b는 행렬곱연산에서 블록 사이즈일 수 있다. 예를 들어, n은 4,000(4k), 8,000(8k), 16,000(16k) 크기의 입력 행렬일 수 있다. s는 2, 3, 4 등의 GPU 개수일 수 있다. b는 1k, 2k, 4k, 8k, 및 16k 등의 범위일 수 있다. 예를 들어, 하나의 CPU는 2개의 GPU를 포함하고, 2개의 GPU를 통해 입력되는 정방행렬의 크기는 16,000(16k)이고, 입력 정방 행렬은 4,000(4k) x 4,000(4k)의 블록으로 분해(decomposition)되어 각 GPU에 분배될 수 있다. According to one embodiment of the present disclosure, a database can be created by logging the function f (n, s, b). As described above, n may be the size of the input matrix of the GPU, s may be the number of GPUs included in the CPU, and b may be the block size in matrix multiplication. For example, n may be an input matrix of size 4,000 (4k), 8,000 (8k), 16,000 (16k). s can be the number of 2, 3, 4, etc. GPUs. b may be in the range of 1k, 2k, 4k, 8k, and 16k. For example, one CPU includes two GPUs, a square matrix input through two GPUs is 16,000 (16k), and an input square matrix is divided into blocks of 4,000 (4k) x 4,000 (4k) decomposed and distributed to each GPU.

도 7의 표는 행렬곱연산을 10 회 반복 수행하여 얻은 함수 f(n, s, b)의 파라미터들에 대한 평균 및 표준 편차를 나타낸다. 표의 각 항목의 결과 값은 특정 s, n, b 값에서 5회 이상 수행된 함수 f(n, s, b)의 평균과 표준 편차(괄호 안의 숫자는 표준 편차이고, 괄호 밖의 숫자는 평균을 의미)이다. 표에서 동그라미로 표시된 숫자는 최적의 블록 사이즈(b*)를 얻은 n, s, b 에서의 평균 값을 의미한다. 표에서 얻은 데이터를 바탕으로 컴퓨팅 시스템(10)은 메타데이터를 생성하고, 메타 데이터 값들에 가우시안 처리 회귀(Gaussian Process Regression)를 적용할 수 있다.The table of Fig. 7 shows the mean and standard deviation of the parameters of the function f (n, s, b) obtained by repeating matrix multiplication 10 times. The results of each item in the table are the mean and standard deviation of the function f (n, s, b) performed more than 5 times at a specific s, n, b value (the numbers in parentheses are the standard deviations, )to be. The numbers in circles in the table indicate the average values at n, s, and b obtained from the optimal block size (b *). Based on the data obtained from the table, the computing system 10 may generate metadata and apply a Gaussian process regression to the metadata values.

따라서, 상술한 방법을 이용하여, CPU에 포함된 복수의 GPU에서 병렬 행렬 곱 연산을 수행하는 경우, 복수의 GPU에 블록 분해(block decomposition)를 바탕으로 병렬행렬을 복수의 블록으로 분리(partition)하고, 최적의 블록 사이즈를 추정하여, 복수의 GPU에 최적의 블록 사이즈를 가지는 행렬을 분배할 수 있다. 이때, 최적의 블록 사이즈를 추정하기 위해 신경망 기계 학습 및 회귀분석(예: 가우시안 프로세스 회귀)을 이용할 수 있다.Accordingly, when a parallel matrix multiplication operation is performed in a plurality of GPUs included in a CPU, a parallel matrix is divided into a plurality of blocks based on block decomposition in a plurality of GPUs, , And a matrix having an optimal block size for a plurality of GPUs can be distributed by estimating an optimal block size. At this time, neural network machine learning and regression analysis (e.g., Gaussian process regression) can be used to estimate the optimal block size.

상술한 본 개시의 일 실시 예에 따라, 기계학습을 이용하여 복수의 GPU에 분배될 최적의 블록 사이즈를 추정할 수 있다. 따라서, CPU에 연결된 복수의 GPU를 통해 고속의 병렬 연산을 수행하여 고성능 컴퓨팅(High Performance Computing, HPC)을 제공할 수 있다.According to one embodiment of the present disclosure described above, machine learning can be used to estimate an optimal block size to be distributed to a plurality of GPUs. Accordingly, high-speed parallel computing is performed through a plurality of GPUs connected to the CPU, thereby providing High Performance Computing (HPC).

한편, 상술한 방법은 컴퓨터에서 실행될 수 있는 프로그램으로 작성 가능하고, 컴퓨터로 읽을 수 있는 기록매체를 이용하여 상기 프로그램을 동작시키는 범용 디지털 컴퓨터에서 구현될 수 있다. 또한, 상술한 방법에서 사용된 데이터의 구조는 컴퓨터로 읽을 수 있는 기록매체에 여러 수단을 통하여 기록될 수 있다. 상기 컴퓨터로 읽을 수 있는 기록매체는 마그네틱 저장매체(예를 들면, 롬, 플로피 디스크, 하드 디스크 등), 광학적 판독 매체(예를 들면, 시디롬, 디브이디 등)와 같은 저장매체를 포함한다.Meanwhile, the above-described method can be implemented in a general-purpose digital computer that can be created as a program that can be executed by a computer and operates the program using a computer-readable recording medium. In addition, the structure of the data used in the above-described method can be recorded on a computer-readable recording medium through various means. The computer-readable recording medium includes a storage medium such as a magnetic storage medium (e.g., ROM, floppy disk, hard disk, etc.), optical reading medium (e.g., CD ROM,

본 실시 예와 관련된 기술 분야에서 통상의 지식을 가진 자는 상기된 기재의 본질적인 특성에서 벗어나지 않는 범위에서 변형된 형태로 구현될 수 있음을 이해할 수 있을 것이다. 그러므로 개시 방법들은 한정적인 관점이 아니라 설명적인 관점에서 고려되어야 한다. 본 발명의 범위는 전술한 설명이 아니라 특허청구범위에 나타나 있으며, 그와 동등한 범위 내에 있는 모든 차이점은 본 발명에 포함된 것으로 해석되어야 할 것이다.It will be understood by those skilled in the art that various changes in form and details may be made therein without departing from the spirit and scope of the invention as defined by the appended claims. Therefore, disclosure methods should be considered from an illustrative point of view, not from a restrictive point of view. The scope of the present invention is defined by the appended claims rather than by the foregoing description, and all differences within the scope of equivalents thereof should be construed as being included in the present invention.

10: 컴퓨팅 시스템
110: 제1 CPU, 210: 제2 CPU
120: 제1 커넥터(루트 콤플렉스) 220: 제2 커넥터(루트 콤플렉스)
130: 제1 메모리 23: 제2 메모리
140: 제1 GPU, 150: 제2 GPU, 240: 제3 GPU, 250: 제4 GPU10: Computing System
110: first CPU, 210: second CPU
120: first connector (root complex) 220: second connector (root complex)
130: first memory 23: second memory
140: first GPU, 150: second GPU, 240: third GPU, 250: fourth GPU

Claims

In a computing system for multi-process,
A CPU (Central Processing Unit);
A plurality of GPU (Graphics Processing Unit) connected to the CPU, and
And a memory included in each of the plurality of GPUs,
Each of the plurality of GPUs performing a matrix multiplication with a designated block size in the memory of each GPU,
Wherein the designated block size is a value estimated using machine learning based on a number of GPUs connected to the CPU and a size of a square matrix obtained through the matrix multiplication of the plurality of GPUs.

The method according to claim 1,
Wherein one of the plurality of GPUs is a master and the remaining GPUs are slaves, and the master GPU stores the input matrix size of each of the plurality of GPUs, the number of the plurality of GPUs, and the block size in a memory of the master GPU as a database Building a computing system.

3. The method of claim 2,
The master GPU includes:
And divides the square matrix into the estimated block size to distribute to the master GPU and the slave GPU.

The method according to claim 1,
The testing function for extracting the optimum block size (b *) using the machine learning is b * = arg _b min f (s, n, b)
Where b * is the optimal block size, f (n, s, b) is the execution time of the parallel matrix multiplication function f , s is the number of GPUs connected to the CPU, n is the size of the input square matrix, Sized computing system.

5. The method of claim 4,
Wherein the function for estimating the block size for performing the shortest matrix through the machine learning comprises:
b * = g (n, s)
Where g is a function that deduces an optimal block size (b *) using machine learning and regression analysis.

The method according to claim 1,
And a network interface connecting each of the plurality of GPUs,
The matrix multiplication operation may include:
Wherein each of the plurality of GPUs is accessed through a memory of a different GPU through the network interface based on an address stored in a memory of each of the plurality of GPUs.

A computing method for a multiprocess,
Estimating a block size to be distributed to a plurality of GPUs connected to the CPU; And
Performing a matrix multiplication operation with the estimated block size in a memory of each of the plurality of GPUs,
Wherein the estimated block size is a value estimated using machine learning based on a number of GPUs connected to the CPU and a size of a square matrix obtained through the matrix multiplication of the plurality of GPUs.

8. The method of claim 7,
One of the plurality of GPUs is a master and the remaining GPUs are slaves,
Wherein the estimating step comprises:
Constructing a database of the input matrix size of each of the plurality of GPUs, the number of the plurality of GPUs, and the block size in a memory of the master GPU.

9. The method of claim 8,
Wherein the master GPU further comprises splitting the square matrix into the estimated block size and distributing the square matrix to the master GPU and the slave GPU.

8. The method of claim 7,
The testing function for extracting the optimum block size (b *) using the machine learning is b * = arg _b min f (s, n, b)
Where b * is the optimal block size, f (n, s, b) is the execution time of the parallel matrix multiplication function f , s is the number of GPUs connected to the CPU, n is the size of the input square matrix, Size computing method.

11. The method of claim 10,
Wherein the function for estimating the block size for performing the shortest matrix through the machine learning comprises:
b * = g (n, s)
Where g is a function of inferring an optimal block size (b *) using machine learning and regression analysis.

8. The method of claim 7,
And a network interface connecting each of the plurality of GPUs,
The matrix multiplication operation may include:
Wherein each of the plurality of GPUs is accessed and stored in a memory of a different GPU through the network interface based on an address stored in a memory of each of the plurality of GPUs.