KR102505279B1

KR102505279B1 - Method for optimizing parallel matrix multiplication in a system supporting multiple CPU and multiple GPU

Info

Publication number: KR102505279B1
Application number: KR1020160077978A
Authority: KR
Inventors: 한흥우; 이준원; 김석현
Original assignee: 삼성전자주식회사
Priority date: 2015-07-24
Filing date: 2016-06-22
Publication date: 2023-03-02
Also published as: KR20170012019A

Abstract

멀티 프로세스(process) 컴퓨팅 방법 및 컴퓨팅 시스템을 제공한다. 본 개시의 컴퓨팅 시스템에서의 멀티 프로세스를 위한 컴퓨팅 방법은, CPU에 연결된 복수의 GPU에 분배(distribute)될 블록 사이즈를 추정하고, 복수의 각 GPU의 메모리에서 추정된 블록 사이즈를 가지는 행렬곱연산을 수행한다. 추정된 블록 사이즈는, CPU에 연결된 복수의 GPU의 개수 및 복수의 GPU의 행렬 곱 연산을 통해 획득된 정방 행렬의 사이즈를 바탕으로 기계학습을 통해 추정된 값이다.A multi-process computing method and computing system are provided. A computing method for multi-processes in a computing system of the present disclosure estimates a block size to be distributed to a plurality of GPUs connected to a CPU, and performs a matrix multiplication operation having the estimated block size in the memory of each of the plurality of GPUs. carry out The estimated block size is a value estimated through machine learning based on the number of GPUs connected to the CPU and the size of a square matrix obtained through a matrix multiplication operation of the plurality of GPUs.

Description

Operation method in a computing environment supporting multiple CPUs and multiple GPUs {Method for optimizing parallel matrix multiplication in a system supporting multiple CPU and multiple GPU}

본 개시는 컴퓨터 처리 기술에 관한 것으로, 더욱 구체적으로, 복수의 CPU(Central Processing Unit) 및 복수의 GPU(Graphics Processing Unit)로 구성된 컴퓨팅 환경에서 최적 수치연산(parallel matrix multiplication)을 제공하는 방법 및 시스템에 관한 것이다. The present disclosure relates to computer processing technology, and more particularly, to a method and system for providing optimal numerical calculation (parallel matrix multiplication) in a computing environment composed of a plurality of CPUs (Central Processing Units) and a plurality of GPUs (Graphics Processing Units). It is about.

빅데이터와 같은 대용량 데이터를 처리하는 수요가 증가함에 따라, 멀티 CPU 및 멀티 GPU를 이용하는 컴퓨팅 환경에 대한 요구가 증가되고 있다. 따라서, 멀티 CPU 및 멀티 GPU로 구성된 시스템에서 수치 연산을 고속화하는 연산 방법에 대한 필요성이 대두된다. GPU는 수많은 코어를 병렬로 사용하여 대용량 데이터에 대해 동시에 연산을 수행함으로써 계산 속도를 크게 높여준다.As the demand for processing large amounts of data such as big data increases, the demand for a computing environment using multiple CPUs and multiple GPUs is increasing. Therefore, there is a need for a calculation method that speeds up numerical calculation in a system composed of multiple CPUs and multiple GPUs. GPUs use many cores in parallel to perform calculations on large amounts of data simultaneously, greatly speeding up calculations.

종래에는, 복수의 GPU 및 복수의 CPU를 포함하는 컴퓨팅 환경에서 복수의 GPU 사이에 데이터를 교환할 때, 각 GPU에서 연산이 수행되어 생성된 연산데이터는 복수의 CPU를 연결하는 QPI(QuickPath Interconnect) 채널 등과 같은 복수의 CPU들을 상호 연결하는 통신 패스를 지나가야 한다. Conventionally, when data is exchanged between a plurality of GPUs in a computing environment including a plurality of GPUs and a plurality of CPUs, operation data generated by performing an operation on each GPU is a QPI (QuickPath Interconnect) connecting the plurality of CPUs. It must pass through a communication path interconnecting a plurality of CPUs, such as a channel.

또한, 대용량 데이터를 처리할 때, 대규모 행렬 곱(large scale matrix multiplication)이 이용된다. 종래에는, 멀티 프로세서(multi processes)가 구현된 시스템에서 행렬의 블록 분리(block decomposition)에 대한 기술이 다양하게 제시되고 있다. 반면, 멀티 GPU를 이용하는 시스템에서는 행렬의 최적의 블록 분리에 대한 기술 개발이 초기 단계이다. 따라서, 멀티 GPU를 이용한 시스템에서 행렬 연산 속도를 향상시킬 수 있는 최적의 블록 사이즈를 찾는 방법이 필요하다.Also, when processing large amounts of data, large scale matrix multiplication is used. Conventionally, various techniques for block decomposition of a matrix have been proposed in a system in which a multi-processor is implemented. On the other hand, in systems using multi-GPUs, technology development for optimal block separation of matrices is in an early stage. Therefore, a method for finding an optimal block size capable of improving matrix operation speed in a multi-GPU system is required.

본 개시의 목적은, 상술한 필요성에 의해 안출된 것으로, 기계학습을 통해 최적의 블록 사이즈를 추정하여 CPU에 연결된 복수의 GPU에서 병렬 연산을 고속화하는 방법 및 시스템을 제공하는 데 있다. An object of the present disclosure is to provide a method and system for speeding up parallel operation in a plurality of GPUs connected to a CPU by estimating an optimal block size through machine learning.

상술한 목적을 달성하기 위해, 본 개시의 일 실시 예에 따른, 멀티 프로세스(multi-process)를 위한 컴퓨팅 시스템은, 제1 커넥터(connector)를 통해 제1 CPU(Core Processing Unit) 및 제1 메모리에 연결된 복수의 제1 GPU(Graphical Processing Unit, GPU), 제2 상술한 목적을 달성하기 위해, 본 개시의 일 실시 예에 따른, 멀티 프로세스(multi-process)를 위한 컴퓨팅 시스템은, CPU(Central Processing Unit), 커넥터(connector)를 통해 상기 CPU에 연결된 복수의 GPU(Graphics Processing Unit)및상기 복수의 GPU 각각에 포함된 메모리를 포함하고, 상기 복수의 GPU 각각은 각GPU의 상기 메모리에서 지정된 블록 사이즈를 가지는 행렬 곱 연산(matrix multiplication)을 수행하며, 상기 지정된 블록 사이즈는, 상기 CPU에 연결된 복수의 GPU의 개수 및 상기 복수의 GPU의 상기 행렬 곱 연산을 통해 획득된 정방 행렬의 사이즈를 바탕으로 기계학습을 이용하여 추정된 값이다.In order to achieve the above object, according to an embodiment of the present disclosure, a multi-process computing system includes a first CPU (Core Processing Unit) and a first memory through a first connector. A plurality of first GPUs (Graphical Processing Units, GPUs) connected to the second, in order to achieve the above-described object, according to an embodiment of the present disclosure, a multi-process computing system, CPU (Central Processing Unit), a plurality of GPUs (Graphics Processing Units) connected to the CPU through a connector, and a memory included in each of the plurality of GPUs, wherein each of the plurality of GPUs includes a block designated in the memory of each GPU. A matrix multiplication operation having a size is performed, and the designated block size is based on the number of a plurality of GPUs connected to the CPU and the size of a square matrix obtained through the matrix multiplication operation of the plurality of GPUs It is an estimated value using machine learning.

또한, 상술한 목적을 달성하기 위해, 본 개시의 일 실시 예에 따른, 멀티 프로세스를 위한 컴퓨팅 방법은, 커넥터를 통해 CPU에 연결된 복수의 GPU에 분배(distribute)될 블록 사이즈를 추정하는 단계 및 상기 복수의 각 GPU의 메모리에서 상기 추정된 블록 사이즈를 가지는 행렬 곱 연산을 수행하는 단계를 포함하고, 상기 추정된 블록 사이즈는, 상기 CPU에 연결된 복수의 GPU의 개수 및 상기 복수의 GPU의 상기 행렬 곱 연산을 통해 획득된 정방 행렬의 사이즈를 바탕으로 기계학습을 이용하여 추정된 값이다.In addition, in order to achieve the above object, a computing method for multi-processes according to an embodiment of the present disclosure includes the steps of estimating a block size to be distributed to a plurality of GPUs connected to a CPU through a connector and the and performing a matrix multiplication operation having the estimated block size in a memory of each of a plurality of GPUs, wherein the estimated block size is determined by the number of GPUs connected to the CPU and the matrix multiplication of the plurality of GPUs. It is an estimated value using machine learning based on the size of the square matrix obtained through the operation.

본 개시의 실시 예들에 따른 데이터 처리 방법은, CPU에 연결된 복수의 GPU를 포함하는 컴퓨팅 시스템에서 복수의 GPU를 통해 고속의 병렬 연산을 수행하여 고성능 컴퓨팅(High Performance Computing, HPC)을 제공할 수 있다.Data processing methods according to embodiments of the present disclosure may provide high performance computing (HPC) by performing high-speed parallel calculations through a plurality of GPUs in a computing system including a plurality of GPUs connected to a CPU. .

도 1은, 본 개시의 일 실시 예에 따른, 컴퓨팅 시스템의 간략한 블록도,
도 2는, 본 개시의 일 실시 예에 따른, 컴퓨팅 시스템의 상세한 블록도,
도 3은, 본 개시의 일 도 1은, 본 개시의 일 실시 예에 따른, 컴퓨팅 시스템의 간략한 블록도,
도 2는, 본 개시의 일 실시 예에 따른, 컴퓨팅 시스템의 상세한 블록도,
도 3은, 본 개시의 일 실시 예에 따른, 복수의 GPU에서 연산 방법을 도시한 순서도,
도 4는, 본 개시의 실시 예들에 따른, 복수의 GPU에서 연산 방법을 도시한 도면,
도 5는, 본 개시의 일 실시 예에 따른, 기계학습을 통한 최적의 블록 사이즈를 추정하는 방법을 도시한 순서도,
도 6은, 본 개시의 일 실시 예에 따른, 블록 사이즈에 따른 실행 시간을 도시한 그래프, 그리고
도 7은, 본 개시의 일 실시 예에 따른, 최적의 블록 사이즈를 추정한 결과를 도시한 표이다.1 is a simplified block diagram of a computing system, according to one embodiment of the present disclosure;
2 is a detailed block diagram of a computing system, according to an embodiment of the present disclosure;
3 is a simplified block diagram of a computing system according to an embodiment of the present disclosure;
2 is a detailed block diagram of a computing system, according to an embodiment of the present disclosure;
3 is a flowchart illustrating an operation method in a plurality of GPUs according to an embodiment of the present disclosure;
4 is a diagram illustrating an operation method in a plurality of GPUs according to embodiments of the present disclosure;
5 is a flowchart illustrating a method of estimating an optimal block size through machine learning according to an embodiment of the present disclosure;
6 is a graph showing execution time according to block size, according to an embodiment of the present disclosure, and
7 is a table showing results of estimating an optimal block size according to an embodiment of the present disclosure.

본 명세서에서 사용되는 용어에 대해 간략히 설명하고, 본 개시에 대해 구체적으로 설명하기로 한다. Terms used in this specification will be briefly described, and the present disclosure will be described in detail.

본 개시에서 사용되는 용어는 본 개시에서의 기능을 고려하면서 가능한 현재 널리 사용되는 일반적인 용어들을 선택하였으나, 이는 당 분야에 종사하는 기술자의 의도 또는 판례, 새로운 기술의 출현 등에 따라 달라질 수 있다.The terms used in the present disclosure have been selected from general terms that are currently widely used as much as possible while considering the functions in the present disclosure, but they may vary according to the intention or precedent of a person skilled in the art, the emergence of new technologies, and the like.

　또한, 특정한 경우는 출원인이 임의로 선정한 용어도 있으며, 이 경우 해당되는 발명의 설명 부분에서 상세히 그 의미를 기재할 것이다. 따라서 본 개시에서 사용되는 용어는 단순한 용어의 명칭이 아닌, 그 용어가 가지는 의미와 본 개시의 전반에 걸친 내용을 토대로 정의되어야 한다. In addition, in a specific case, there is also a term arbitrarily selected by the applicant, and in this case, the meaning will be described in detail in the description of the invention. Therefore, terms used in the present disclosure should be defined based on the meaning of the term and the general content of the present disclosure, not simply the name of the term.

본 개시의 실시 예들은 다양한 변환을 가할 수 있고 여러 가지 실시 예를 가질 수 있는바, 특정 실시 예들을 도면에 예시하고 상세한 설명에 상세하게 설명하고자 한다. 그러나 이는 특정한 실시 형태에 대해 범위를 한정하려는 것이 아니며, 개시된 사상 및 기술 범위에 포함되는 모든 변환, 균등물 내지 대체물을 포함하는 것으로 이해되어야 한다. 실시 예들을 설명함에 있어서 관련된 공지 기술에 대한 구체적인 설명이 요지를 흐릴 수 있다고 판단되는 경우 그 상세한 설명을 생략한다.Embodiments of the present disclosure may apply various transformations and may have various embodiments, and specific embodiments are illustrated in the drawings and described in detail in the detailed description. However, this is not intended to limit the scope to specific embodiments, and should be understood to include all transformations, equivalents, and substitutes included in the spirit and scope of technology disclosed. In describing the embodiments, if it is determined that a detailed description of a related known technology may obscure the subject matter, the detailed description will be omitted.

제1, 제2 등의 용어는 다양한 구성요소들을 설명하는데 사용될 수 있지만, 구성요소들은 용어들에 의해 한정되어서는 안 된다. 용어들은 하나의 구성요소를 다른 구성요소로부터 구별하는 목적으로만 사용된다.Terms such as first and second may be used to describe various components, but the components should not be limited by the terms. Terms are only used to distinguish one component from another.

단수의 표현은 문맥상 명백하게 다르게 뜻하지 않는 한, 복수의 표현을 포함한다. 본 출원에서, "포함하다" 또는 "구성되다" 등의 용어는 명세서상에 기재된 특징, 숫자, 단계, 동작, 구성요소, 부품 또는 이들을 조합한 것이 존재함을 지정하려는 것이지, 하나 또는 그 이상의 다른 특징들이나 숫자, 단계, 동작, 구성요소, 부품 또는 이들을 조합한 것들의 존재 또는 부가 가능성을 미리 배제하지 않는 것으로 이해되어야 한다.Singular expressions include plural expressions unless the context clearly dictates otherwise. In this application, the terms "comprise" or "consist of" are intended to designate that there is a feature, number, step, operation, component, part, or combination thereof described in the specification, but one or more other It should be understood that the presence or addition of features, numbers, steps, operations, components, parts, or combinations thereof is not precluded.

본 개시에서 "모듈" 혹은 "부"는 적어도 하나의 기능이나 동작을 수행하며, 하드웨어 또는 소프트웨어로 구현되거나 하드웨어와 소프트웨어의 결합으로 구현될 수 있다. 또한, 복수의 "모듈" 혹은 복수의 "부"는 특정한 하드웨어로 구현될 필요가 있는 "모듈" 혹은 "부"를 제외하고는 적어도 하나의 모듈로 일체화되어 적어도 하나의 프로세서(미도시)로 구현될 수 있다.In the present disclosure, a “module” or “unit” performs at least one function or operation, and may be implemented in hardware or software or a combination of hardware and software. In addition, a plurality of "modules" or a plurality of "units" are integrated into at least one module and implemented by at least one processor (not shown), except for "modules" or "units" that need to be implemented with specific hardware. It can be.

아래에서는 첨부한 도면을 참고하여 본 개시의 실시 예에 대하여 본 개시가 속하는 기술 분야에서 통상의 지식을 가진 자가 용이하게 실시할 수 있도록 상세히 설명한다. 그러나 본 개시는 여러 가지 상이한 형태로 구현될 수 있으며 여기에서 설명하는 실시 예에 한정되지 않는다. 그리고 도면에서 본 개시를 명확하게 설명하기 위해서 설명과 관계없는 부분은 생략하였으며, 명세서 전체를 통하여 유사한 부분에 대해서는 유사한 도면 부호를 붙였다.Hereinafter, with reference to the accompanying drawings, embodiments of the present disclosure will be described in detail so that those skilled in the art can easily carry out the present disclosure. However, the present disclosure may be implemented in many different forms and is not limited to the embodiments described herein. And in order to clearly describe the present disclosure in the drawings, parts irrelevant to the description are omitted, and similar reference numerals are attached to similar parts throughout the specification.

도 1은, 본 개시의 일 실시 예에 따른, 컴퓨팅 시스템의 간략한 블록도이다. 도 1을 참조하면, 컴퓨팅 시스템(10)은 복수의 CPU(Central Processing Unit)(110, 210), 각 CPU(110, 210) 및 각 메모리(130, 230)를 복수의 GPU(Graphic Processing Unit)(140, 150, 240, 250)와 연결하는 복수의 커넥터(120, 220)을 포함할 수 있다.1 is a simplified block diagram of a computing system, according to one embodiment of the present disclosure. Referring to FIG. 1 , a computing system 10 includes a plurality of CPUs (Central Processing Units) 110 and 210 , each CPU 110 and 210 and each memory 130 and 230 as a plurality of GPUs (Graphic Processing Units). A plurality of connectors 120 and 220 connected to (140, 150, 240, and 250) may be included.

시스템(10)은, 예를 들어, 복수의 CPU와 복수의 GPU를 포함하는 슈퍼 컴퓨터일 수 있다. 또한, 시스템(10)은 하나의 CPU와 복수의 GPU를 포함하는 컴퓨팅 시스템일 수 있다. 시스템(10)은 컴퓨터 등과 같은 하나의 전자 장치에 구현될 수도 있다.System 10 may be, for example, a supercomputer comprising multiple CPUs and multiple GPUs. Additionally, system 10 may be a computing system including one CPU and multiple GPUs. System 10 may be implemented in a single electronic device such as a computer.

각 CPU(110, 210)는 시스템 메모리를 포함하는 칩셋일 수 있다. 각 GPU(140, 150, 240, 250)는 비디오 게임 등과 같은 그래픽 데이터의 처리를 위한 특수 목적 프로세서일 수 있다. 예를 들어, GPU는 NVIDA^TM GPU일 수 있으나, 이에 한정되지 않는다.Each of the CPUs 110 and 210 may be a chipset including a system memory. Each of the GPUs 140, 150, 240, and 250 may be a special purpose processor for processing graphic data such as video games. For example, the GPU may be an NVIDA ^™ GPU, but is not limited thereto.

커넥터(120, 220)는 복수의 네트워크 인터페이스인 버스(bus)를 통해 커넥터(120, 220)에 연결된 복수의 GPU(140, 150, 240, 250)들이 각 커넥터(120, 220)에 연결된 메모리(130, 230) 및 CPU(110, 210)와 통신할 수 있도록 한다.The connectors 120 and 220 include a plurality of GPUs 140, 150, 240, and 250 connected to the connectors 120 and 220 through a bus, which is a plurality of network interfaces, and memory connected to the respective connectors 120 and 220 ( 130, 230) and CPU (110, 210) to be able to communicate.

예를 들어, 커넥터(120, 220)는 브리지형(bridged) 호스트 인터페이스인 PCI 익스프레스(Peripheral component interconnect express, PCIe^TM) 시스템에서 루트 콤플렉스(root complex)일 수 있다. 본 개시에서는 설명의 편의를 위하여 PCIe^TM시스템으로 구현된 컴퓨팅 시스템에 대해 설명하나, 이에 한정되지 않는다. 본 개시에서 사용되는 “루트 콤플렉스(root complex)”라는 용어는 “커넥터”를 의미할 수 있다.For example, the connectors 120 and 220 may be a root complex in a peripheral component interconnect express (PCIe ^TM ) system that is a bridged host interface. In the present disclosure, for convenience of explanation, a computing system implemented as a PCIe ^TM system is described, but is not limited thereto. The term “root complex” used in this disclosure may mean “connector”.

본 개시의 일 실시 예에 따라, 복수의 제1 GPU(Graphics Processing Unit, GPU)인 GPU1(140) 및 GPU2(150)는 제1 커넥터(connector)(120)를 통해 제1 CPU(Central Processing Unit)(110) 및 제1 메모리(130)에 연결될 수 있다. 복수의 제2 GPU인 GPU3(240) 및 GPU(250)은 제2 커넥터(connector)(220)을 통해 제2 CPU(210) 및 제2 메모리(230)에 연결될 수 있다. 또한, 제1 CPU(110) 및 제2 CPU(210)는 각CPU(110, 210)를 상호 연결하는 통신 패스를 통해 통신할 수 있다. According to an embodiment of the present disclosure, GPU1 140 and GPU2 150, which are a plurality of first graphics processing units (GPUs), connect a first central processing unit (CPU) through a first connector 120. ) 110 and the first memory 130. The plurality of second GPUs, GPU3 240 and GPU 250 , may be connected to the second CPU 210 and the second memory 230 through a second connector 220 . In addition, the first CPU 110 and the second CPU 210 may communicate through a communication path interconnecting the respective CPUs 110 and 210 .

본 개시의 일 실시 예에 따라, 복수의 제1 GPU 각각(140, 150)이 서로의 메모리에 저장된 어드레스 정보를 바탕으로 제1 연산을 수행하여 제1 연산 데이터를 생성할 수 있다. 또한, 복수의 제2 GPU 각각(240, 250)이 서로의 메모리에 저장된 어드레스 정보를 바탕으로 제2 연산을 수행하여 제2 연산 데이터를 생성할 수 있다. 그리고, 상기 제1 연산 데이터 및 상기 제2 연산 데이터는 통신 패스를 통해 교환할 수 있다.According to an embodiment of the present disclosure, each of the plurality of first GPUs 140 and 150 may generate first calculation data by performing a first calculation based on address information stored in each other's memories. Also, each of the plurality of second GPUs 240 and 250 may perform a second operation based on address information stored in each other's memory to generate second operation data. And, the first operation data and the second operation data can be exchanged through a communication path.

예를 들어, 제1 커넥터(120)에 공통으로 연결된 복수의 제1 GPU인 각 GPU(140, 150)는 각각 연산을 수행하고 제1 연산 데이터를 생성하고, 제2 커넥터(220)에 공통으로 연결된 복수의 제2 GPU인 GPU(240, 250)는 각각 연산을 수행하여 제2 연산 데이터를 생성할 수 있다. For example, each of the plurality of first GPUs 140 and 150, which are a plurality of first GPUs commonly connected to the first connector 120, performs calculations and generates first calculation data, and is commonly connected to the second connector 220. The GPUs 240 and 250, which are a plurality of connected second GPUs, may each perform an operation to generate second operation data.

또 다른 예를 들어, 제1 커넥터(120)에 연결된 GPU1(140) 및 GPU2(150)은 도 2에서 도시된 하나의 스위치를 공유할 수 있다. 또한, GPU1(140) 및 GPU2(150)은 각각 상이한 스위치에 연결되어 하나의 제1 커넥터(120)에 연결될 수도 있다.For another example, GPU1 140 and GPU2 150 connected to the first connector 120 may share one switch shown in FIG. 2 . Also, the GPU1 140 and the GPU2 150 may be connected to different switches and connected to one first connector 120 .

GPU1(140) 및 GPU2(150)이 상이한 스위치에 연결되어 제1 커넥터(120)에 연결된 경우, GPU1(140) 및 GPU2(150)은 NVIDIA 레인^TM(NVIDIA Lane^TM)등과 같은 버스를 통해 서로 연결될 수도 있다. 이때, GPU1(140)과 GPU2(150)은 제1 커넥터(120)를 통하지 않고 연결된 버스를 통해 직접 서로의 어드레스 정보를 요청하여 연산을 수행할 수 있다. When GPU1 140 and GPU2 150 are connected to different switches and connected to the first connector 120, GPU1 140 and GPU2 150 are connected to each other through a bus such as NVIDIA Lane ^TM (NVIDIA Lane ^TM ) may be In this case, the GPU1 140 and the GPU2 150 may perform operations by directly requesting address information from each other through a connected bus without going through the first connector 120 .

제2 커넥터(220)에 연결된 GPU3(240) 및 GPU4(250)의 동작 방법은 상술한 제1 커넥터(120)에 연결된 GPU1(140) 및 GPU2(150)의 동작 방법과 대응되므로 상세한 설명은 생략한다.Since the operating methods of the GPU3 240 and the GPU4 250 connected to the second connector 220 correspond to the operating methods of the GPU1 140 and the GPU2 150 connected to the first connector 120 described above, detailed descriptions are omitted. do.

도1에서는 각 커넥터(120, 220)에 각각 두 개의 GPU(140, 150, 240, 250)가 연결된 것을 도시하였으나, 이는 일 실시 예일 뿐, 각 커넥터(120, 220)는 2개 이상의 GPU와 연결될 수 있다. 또한, 컴퓨팅 시스템(10)에서 복수의 CPU(110, 210)는 각각 하나 이상의 커넥터(120, 220)를 포함할 수도 있으나, 컴퓨팅 시스템(10)은 복수의 CPU(110, 210) 중 커넥터를 포함하지 않는 CPU를 포함할 수도 있다.Although FIG. 1 shows that two GPUs 140, 150, 240, and 250 are connected to each connector 120 and 220, this is only an example, and each connector 120 and 220 may be connected to two or more GPUs. can In addition, although each of the plurality of CPUs 110 and 210 in the computing system 10 may include one or more connectors 120 and 220, the computing system 10 includes a connector among the plurality of CPUs 110 and 210. It may also include CPUs that do not.

복수의 제1 GPU(140, 150) 및 복수의 제2 GPU(240, 250)는 각각 메모리를 포함할 수 있다. 또한, 각 GPU에서 수행되는 연산은, 각 GPU들의 메모리에 저장된 어드레스에 기초하여 다른 GPU에 액세스되도록 하는 수행되는 병렬 행렬 곱(parallel matrix multiplication) 연산일 수 있다. GPU의 병렬 행렬 곱 연산은 각 GPU의 메모리에서 수행될 수 있다. Each of the plurality of first GPUs 140 and 150 and the plurality of second GPUs 240 and 250 may include a memory. Also, the operation performed in each GPU may be a parallel matrix multiplication operation performed to access another GPU based on an address stored in a memory of each GPU. The GPU's parallel matrix multiplication operation can be performed in the memory of each GPU.

본 개시의 일 실시 예에서, 각 GPU(140, 150, 240, 250) 중 적어도 하나의 GPU인 GPU2(150)의 타겟 어드레스가 GPU2(150)와 다른 커넥터(220)에 연결된 GPU3(240)의 메모리에 있는 경우, GPU2(150)와 GPU2(150)와 다른 커넥터(220)에 연결된 GPU3(240)의 사이의 연산 데이터 교환은 CPU1(110) 및 CPU2(210)를 상호 연결하는 통신 패스를 통해 수행될 수 있다.In one embodiment of the present disclosure, the target address of GPU2 150, which is at least one GPU among GPUs 140, 150, 240, and 250, is of GPU3 240 connected to a connector 220 different from that of GPU2 150. In the case of memory, operation data exchange between GPU2 (150) and GPU3 (240) connected to GPU2 (150) and another connector (220) is via a communication path interconnecting CPU1 (110) and CPU2 (210). can be performed

예를 들어, 통신 패스는 하이퍼트랜스포트(Hyper Transport, HT) 또는 QPI(QuickPath Interconnect) 등일 수 있으나, 이에 한정되지 않는다.For example, the communication path may be Hyper Transport (HT) or QPI (QuickPath Interconnect), but is not limited thereto.

또한, 복수의 제1 GPU(140, 150) 및 복수의 제2 GPU(240, 250)은 각 커넥터(120, 220)에 스위치를 통해 연결될 수 있다. 제1 커넥터(120) 및 제2 커넥터(220)에 연결된 스위치는 복수 개일 수 있다. 커넥터(120, 220)와 복수의 GPU(140, 150, 240, 250)를 연결하는 스위치에 대해서는 도 2에서 설명하기로 한다.Also, the plurality of first GPUs 140 and 150 and the plurality of second GPUs 240 and 250 may be connected to the respective connectors 120 and 220 through switches. The number of switches connected to the first connector 120 and the second connector 220 may be plural. A switch connecting the connectors 120 and 220 and the plurality of GPUs 140, 150, 240 and 250 will be described with reference to FIG. 2 .

각 CPU(110, 120)에 연결된 복수의 GPU(140, 150, 240, 250) 중 적어도 하나는 마스터이고, 나머지 GPU는 슬레이브일 수 있다. 예를 들어, 제1 CPU(110)에 네 개의 GPU가 연결되어 있을 경우, 네 개 중 하나의 GPU는 마스터이고, 나머지 세 개는 슬레이브일 수 있다. 마스터 GPU는 각 GPU의 입력 행렬의 크기, CPU에 연결된 GPU의 개수, 및 복수의 GPU에서 행렬 곱 연산을 위한 블록 사이즈를 마스터 GPU의 메모리 또는 CPU의 호스트 메모리에 데이터 베이스로 구축할 수 있다.At least one of the plurality of GPUs 140 , 150 , 240 , and 250 connected to each of the CPUs 110 and 120 may be a master, and the remaining GPUs may be slaves. For example, when four GPUs are connected to the first CPU 110, one of the four GPUs may be a master and the other three may be slaves. The master GPU may build a database of the size of the input matrix of each GPU, the number of GPUs connected to the CPU, and the block size for matrix multiplication operation in the plurality of GPUs in the memory of the master GPU or the host memory of the CPU.

　　　　본 개시의 일 실시 예에 따라, 마스터 GPU는, 데이터 베이스에 저장된 데이터를 바탕으로 기계학습을 이용하여 복수의 GPU가 최단 행렬 곱 연산을 수행할 수 있는 최적의 블록 사이즈를 추정할 수 있다. 최적의 블록 사이즈를 추정하는 방법은 도 5 내지 도 7에서 상술한다.According to an embodiment of the present disclosure, the master GPU may estimate an optimal block size for performing a shortest matrix multiplication operation by a plurality of GPUs using machine learning based on data stored in a database. A method of estimating the optimal block size will be described in detail with reference to FIGS. 5 to 7 .

도 2는, 본 개시의 일 실시 예에 따른, 컴퓨팅 시스템의 상세한 블록도이다. 도 2를 참조하면, 컴퓨팅 시스템(10)은 제1 CPU(110) 및 제2 CPU(210)등의 복수의 CPU를 포함할 수 있다. 각 CPU(110, 210)는 PCIe^TM 버스를 통해 각각 제1 루트 콤플렉스(120) 및 제2 루트 콤플렉스(220)에 연결될 수 있다. 제1 루트 콤플렉스(120) 및 제2 루트 콤플렉스(220)는 복수의 PCIe^TM장치 다운스트림 포트(PCIe^TMdevice downstream port)(270)을 포함하며 각 메모리(130, 230)에 접근할 수 있다. 2 is a detailed block diagram of a computing system, according to one embodiment of the present disclosure. Referring to FIG. 2 , the computing system 10 may include a plurality of CPUs such as a first CPU 110 and a second CPU 210 . Each of the CPUs 110 and 210 may be connected to the first root complex 120 and the second root complex 220 through a PCIe ^TM bus, respectively. The first root complex 120 and the second root complex 220 include ^a plurality of PCIe ^TM device downstream ports 270 and can access the respective memories 130 and 230 .

각 루트 콤플렉스(120, 220)은 PCIe^TM버스를 통해 복수의 GPU와 연결된 PCIe^TM스위치(160, 260)와 통신할 수 있다. 루트 콤플렉스(120, 220)의 PCIe^TM장치 다운스트림 포트(270) 중 하나는 PCIe^TM스위치(160, 260)의 PCIe^TM장치업스트림 포트(PCIe^TMdevice upstream port) (280)와 PCIe^TM버스를 통해 통신할 수 있다.Each of the root complexes 120 and 220 may communicate with the PCIe ^TM switches 160 and 260 connected to the plurality of GPUs through the PCIe ^TM bus. One of the PCIe ^TM device downstream ports 270 of the root complex 120, 220 communicates with the PCIe ^TM device upstream port 280 of the PCIe ^TM switches 160, 260 via the PCIe ^TM ^bus . can communicate

PCIe^TM스위치(106, 260)는 복수의 PCIe^TM장치 다운스트림 포트(270)를 포함하고, PCIe^TM장치 다운스트림 포트(270)는 하나의 GPU(140, 150)의 PCIe^TM장치 업스트림 포트(280)와 PCIe^TM버스를 통해 통신할 수 있다. 각 GPU(140, 150, 240, 250)은 각각 GPU메모리(140-1, 150-1, 240-1, 250-1)을 포함할 수 있다. 각 GPU(140, 150, 240, 250)은 PCIe^TM버스를 통해 각 GPU의 메모리(140-1, 150-1, 240-1, 250-1)에 액세스할 수 있다. 각 GPU의 메모리(140-1, 150-1, 240-1, 250-1)은 많은 수의(numerous) 병렬 행렬 곱 연산을 수행할 수 있다.The PCIe ^TM switches 106 and 260 include a plurality of PCIe ^TM device downstream ports 270, and the PCIe ^TM device downstream port 270 is connected to a PCIe ^TM device upstream port 280 of one GPU 140 and 150. ) and the PCIe ^TM bus. Each of the GPUs 140, 150, 240, and 250 may include GPU memories 140-1, 150-1, 240-1, and 250-1, respectively. Each GPU (140, 150, 240, 250) can access the memory (140-1, 150-1, 240-1, 250-1) of each GPU through the PCIe ^TM bus. The memories 140-1, 150-1, 240-1, and 250-1 of each GPU may perform a number of parallel matrix multiplication operations.

예를 들어, 컴퓨팅 시스템(10)을 구성하는 복수의 CPU(110, 210)는 QPI(QuickPath Interconnect)를 통해 상호 연결될 수 있다. For example, the plurality of CPUs 110 and 210 constituting the computing system 10 may be interconnected through QuickPath Interconnect (QPI).

그러나, 도 2에 도시된 컴퓨팅 시스템(10)의 구성은, 본 개시를 설명하기 위한 일 실시 예일 뿐, 이에 한정되지 않으며, 다양하게 구현될 수 있다.However, the configuration of the computing system 10 shown in FIG. 2 is only one embodiment for explaining the present disclosure, is not limited thereto, and may be implemented in various ways.

도 3은, 본 개시의 일 실시 예에 따른, 복수의 GPU에서 연산방법을 도시한 순서도이다.3 is a flowchart illustrating an operation method in a plurality of GPUs according to an embodiment of the present disclosure.

S310 단계에서, 컴퓨팅 시스템(10)을 구성하는 각 루트 콤플렉스에 연결된 복수의 GPU 각각에서 병렬 행렬 곱 연산을 수행한다. 병렬 행렬 곱 연산은, 각 GPU들의 메모리에 저장된 어드레스에 기초하여 다른 GPU의 메모리에 액세스되며, GPU의 각 메모리에서 수행될 수 있다.In step S310, a parallel matrix multiplication operation is performed on each of a plurality of GPUs connected to each root complex constituting the computing system 10. The parallel matrix multiplication operation is accessed to the memory of another GPU based on the address stored in the memory of each GPU, and may be performed in each memory of the GPU.

예를 들어, GPU에서 수행되는 병렬 행렬 곱 연산은, C=AxB의 정방행렬일 수 있다. 이때, A는 제1 GPU의 입력행렬(NxN), B는 제2 GPU의 입력 행렬(NxN), C는 제1 GPU 및 제2 GPU의 병렬행렬곱연산에 따른 정방행렬(NxN)일 수 있다. 구체적으로, C(i)는 NxN 정방 행렬의 행 벡터(row vector)로 다음의 함수일 수 있다.For example, a parallel matrix multiplication operation performed on a GPU may be a square matrix of C=AxB. In this case, A may be an input matrix (NxN) of the first GPU, B may be an input matrix (NxN) of the second GPU, and C may be a square matrix (NxN) according to the parallel matrix multiplication operation of the first GPU and the second GPU. . Specifically, C(i) is a row vector of an NxN square matrix and may be the following function.

여기서, |B|는 S_i에 속하는(belong to) 행렬의 행에서 블록(blocks), S_i는 i번째의 GPU, A(i)는 |B|/S, S는 CPU에 포함된 GPU의 개수, B(i, j)는 |B|/S, |B|/S는 B(i)의 서브 블록일 수 있다.Here, |B| is a block in the row of the matrix belonging to S _i , S _i is the ith GPU, A(i) is |B|/S, and S is the number of GPUs included in the CPU. The number, B(i, j), may be |B|/S, and |B|/S may be a subblock of B(i).

종래에는 복수의 CPU를 포함하고, 각 CPU가 복수의 GPU를 포함하는 컴퓨팅 시스템(10)에서 각 GPU가 병렬 행렬 곱 연산을 수행할 때, 연산 시 모든 프로세스마다 QPI 등과 같은 통신 패스를 통해 CPU에서 다른 CPU로 B(j)를 교환해야 한다. 따라서, 연산의 속도가 저하되는 문제가 있다.Conventionally, when each GPU performs a parallel matrix multiplication operation in the computing system 10 including a plurality of CPUs, each CPU including a plurality of GPUs, the CPU through a communication path such as QPI for every process during the operation. You have to swap B(j) for another CPU. Therefore, there is a problem that the speed of calculation is lowered.

본 개시의 일 실시 예에 따라, 복수의 GPU의 행렬 곱 연산은 각 GPU의 메모리에서 수행될 수 있다. 또한, 복수의 GPU의 행렬 곱 연산은 공통의 루트 콤플렉스에 연결된 GPU들 사이에서 우선 수행된다. 따라서, 본 개시의 실시 예에 따른 GPU 연산 방법은 종래처럼 GPU에서 연산을 수행할 때마다 QPI 등과 같은 통신 패스를 통과하지 않아도 되므로 연산 속도를 향상시킬 수 있다.According to an embodiment of the present disclosure, a matrix multiplication operation of a plurality of GPUs may be performed in a memory of each GPU. In addition, a matrix multiplication operation of a plurality of GPUs is first performed among GPUs connected to a common root complex. Therefore, the GPU calculation method according to an embodiment of the present disclosure can improve the calculation speed because it does not have to pass through a communication path such as QPI every time the GPU performs calculation as in the prior art.

S330 단계에서, 컴퓨팅 시스템(10)을 구성하는 복수의 GPU 중에서 서로 다른 루트 콤플렉스에 연결된 복수의 GPU 간의 연산 데이터 교환이 수행된다. GPU에서 병렬 행렬 곱 연산은 각 GPU의 메모리에 저장된 어드레스에 기초하여 다른 GPU에 액세스 될 수 있다. 따라서, 공통의 루트 콤플렉스에 연결된 복수의 GPU 중 적어도 하나의 GPU의 타겟 어드레스가 다른 루트 콤플렉스에 연결된 복수의 GPU 중 하나일 수 있다. In step S330 , operation data is exchanged between a plurality of GPUs connected to different root complexes among a plurality of GPUs constituting the computing system 10 . Parallel matrix multiplication operations on GPUs can be accessed on other GPUs based on addresses stored in the memory of each GPU. Accordingly, the target address of at least one GPU among a plurality of GPUs connected to a common root complex may be one of a plurality of GPUs connected to another root complex.

예를 들어, 복수의 루트 콤플렉스가 공통의 CPU에 연결된 경우, 공통의 루트 콤플렉스에 연결된 복수의 GPU 중 적어도 하나의 GPU의 타겟 어드레스가 공통의 CPU에 연결된 다른 루트 콤플렉스에 연결된 복수의 GPU 중 하나일 수 있다. 이때, 각 GPU는 QPI 등의 통신패스를 통과하지 않고, 공통의 CPU 내에서 연산 결과 생성된 연산 데이터 교환을 수행할 수 있다. 따라서, 종래보다 연산 속도가 향상될 수 있다.For example, when a plurality of root complexes are connected to a common CPU, the target address of at least one of the plurality of GPUs connected to the common root complex is one of the plurality of GPUs connected to another root complex connected to the common CPU. can At this time, each GPU can perform calculation data exchange generated as a result of calculation within a common CPU without passing through a communication path such as QPI. Therefore, calculation speed can be improved compared to the prior art.

S350 단계에서, 컴퓨팅 시스템(10)을 구성하는 복수의 CPU간 GPU 연산 데이터 교환이 수행된다. S330 단계에서 상술한 바와 같이, 공통의 루트 콤플렉스에 연결된 복수의 GPU 중 적어도 하나의 GPU의 타겟 어드레스가 다른 루트 콤플렉스에 연결된 복수의 GPU 중 하나일 수 있다. In step S350, GPU operation data is exchanged between a plurality of CPUs constituting the computing system 10. As described above in step S330, the target address of at least one GPU among a plurality of GPUs connected to a common root complex may be one of a plurality of GPUs connected to another root complex.

예를 들어, 제1 CPU의 제1 루트 콤플렉스에 연결된 제1 GPU의 타겟 어드레스가 제2 CPU의 제2 루트 콤플렉스에 연결된 제2 GPU의 메모리에 있을 수 있다. 이때, 제1 GPU는 제1 CPU의 복수의 제1 GPU에서 획득한 제1 연산 데이터와 제2 CPU의 복수의 제2 GPU에서 획득한 제2 연산 데이터 교환을 QPI 등의 통신패스를 통해 수행할 수 있다.For example, a target address of a first GPU connected to a first root complex of a first CPU may be in a memory of a second GPU connected to a second root complex of a second CPU. At this time, the first GPU may exchange first operation data obtained from a plurality of first GPUs of the first CPU and second operation data obtained from a plurality of second GPUs of the second CPU through a communication path such as QPI. can

도 4는, 본 개시의 실시 예들에 따른, 복수의 GPU에서 연산 방법을 도시한 도면이다.4 is a diagram illustrating an operation method in a plurality of GPUs according to embodiments of the present disclosure.

도 4를 참조하면, 얇은 선(thin line)(410)은 공통의 루트 콤플렉스 및 공통의 스위치를 통해 연결된 복수의 GPU(GPU1와GPU2, GPU3와 GPU4, GPU5와 GPU6, GPU7와 GPU8)에서의 연산을 도시한다. 복수의 GPU(GPU1와GPU2, GPU3와 GPU4, GPU5와 GPU6, GPU7와 GPU8)들은 각 GPU의 메모리에 저장된 어드레스 정보를 바탕으로 병렬 행렬 곱 연산을 수행하여 각각 연산 데이터를 생성할 수 있다. 각 GPU는 생성된 연산 데이터를 서로 교환할 수 있다.Referring to Figure 4, a thin line (thin line) 410 is a plurality of GPUs (GPU1 and GPU2, GPU3 and GPU4, GPU5 and GPU6, GPU7 and GPU8) connected through a common root complex and a common switch. shows A plurality of GPUs (GPU1 and GPU2, GPU3 and GPU4, GPU5 and GPU6, GPU7 and GPU8) may generate operation data by performing a parallel matrix multiplication operation based on address information stored in the memory of each GPU. Each GPU can exchange generated operation data with each other.

예를 들어, GPU1 및 GPU2는 PCIe^TM버스 등과 같은 버스를 통해 연결될 수 있으며, GPU1 및 GPU2는 직접 각각의 메모리에 접근하여 연산을 수행할 수 있다. GPU3과 GPU4, GPU5 과 GPU6, GPU7과 GPU8 각각은 GPU1과 GPU2의 연산 방법과 동일한 방법으로 연산을 수행할 수 있다.For example, GPU1 and GPU2 may be connected through a bus such as a PCIe ^TM bus, and GPU1 and GPU2 may perform operations by directly accessing respective memories. Each of GPU3 and GPU4, GPU5 and GPU6, and GPU7 and GPU8 can perform calculations in the same way as GPU1 and GPU2.

점선(dot line)(420)은 상이한 스위치를 통해 공통의 루트 콤플렉스에 연결된 복수의 GPU(GPU1과 GPU4, GPU2와 GPU3, GPU5와 GPU8, GPU6과 GPU7)에서의 데이터 교환을 도시한다.Dot line 420 shows data exchange on multiple GPUs (GPU1 and GPU4, GPU2 and GPU3, GPU5 and GPU8, GPU6 and GPU7) connected to a common root complex through different switches.

예를 들어, GPU1은 제1 스위치에 PCIe^TM버스를 통해 제1 루트 콤플렉스 지나 GPU4와 데이터 교환을 수행할 수 있다. GPU2와 GPU3, GPU5와 GPU8, GPU6과 GPU7 각각은 GPU1과 GPU3의 데이터 교환 방법과 같이 데이터를 교환할 수 있다.For example, GPU1 may exchange data with GPU4 via a first root complex through a PCIe ^TM bus to a first switch. Each of GPU2 and GPU3, GPU5 and GPU8, and GPU6 and GPU7 can exchange data in the same way as GPU1 and GPU3.

굵은 선(thick line(430)은 CPU 및 루트 콤플렉스를 공유하지 않은 복수의 GPU 사이의 데이터 교환을 도시한다.A thick line 430 shows data exchange between a CPU and multiple GPUs that do not share a root complex.

예를 들어, 컴퓨팅 시스템(10)의 각 장치(CPU, 루트 콤플렉스, 스위치, 메모리, GPUs)들을 연결하는 PCIe^TM버스를 통해 제1 CPU와 제2 CPU를 상호 연결하는 QPI 등의 통신패스를 이용하여 GPU8과 데이터 교환을 수행할 수 있다. GPU4는 GPU5와 QPI 통신 패스를 이용하여 데이터 교환을 수행할 수 있다. 이때, GPU1와 GPU8, GPU4와 GPU5는 서로 데이터를 교환할 때, 각 GPU가 PCIe^TM버스를 통해 연결된 스위치, 루트 콤플렉스, CPU를 통과할 수 있다.For example, using a communication path such as a QPI interconnecting a first CPU and a second CPU through a PCIe ^TM bus connecting each device (CPU, root complex, switch, memory, GPUs) of the computing system 10 to perform data exchange with GPU8. GPU4 can exchange data with GPU5 using the QPI communication path. At this time, when GPU1 and GPU8, GPU4 and GPU5 exchange data with each other, each GPU can pass through a switch, root complex, and CPU connected through a PCIe ^TM bus.

각 루트 콤플렉스는 각 GPU로부터 각 루트 콤플렉스에 연결된 메모리에 접근 요청을 수신할 수 있다. 루트 콤플렉스는 CPU를 대신하여(on behalf of CPU) 복수의 GPU 사이의 데이터 교환 결과를 수신하고 루트 콤플렉스에 연결된(포함된) 메모리에 결과를 저장할 수 있다. 루트 콤플렉스는 복수의 GPU의 데이터 교환 결과를 CPU에 전송할 수 있고, CPU는 수신한 데이터를 CPU에 포함된 호스트 메모리에 저장할 수 있다.Each root complex may receive an access request from each GPU to a memory connected to each root complex. The root complex can receive the results of data exchange between multiple GPUs on behalf of the CPU and store the results in memory attached to (contained in) the root complex. The root complex may transmit data exchange results of the plurality of GPUs to the CPU, and the CPU may store the received data in a host memory included in the CPU.

즉, 종래에는 컴퓨팅 시스템(10)에 포함된 GPU의 개수(도 4의 실시 예: 8)만큼 QPI를 이용해서 GPU 데이터를 교환할 수 있었으나, 본 개시의 실시 예에 따른 방법은 컴퓨팅 시스템(10)에 포함된 CPU의 개수(도 4의 실시 예: 2)만큼 QPI를 이용할 수 있다. 따라서, 상술한 본 개시의 실시 예들에 따른, 복수의 GPU 간의 데이터 교환 방법은 종래의 방법보다 QPI 이용 횟수를 대폭 감소하여 그래픽 처리의 연산 속도를 향상시킬 수 있다. That is, in the prior art, GPU data could be exchanged using QPI as much as the number of GPUs included in the computing system 10 (the embodiment of FIG. 4: 8), but the method according to the embodiment of the present disclosure QPI can be used as many as the number of CPUs included in ) (the embodiment of FIG. 4: 2). Therefore, the method for exchanging data between a plurality of GPUs according to the above-described embodiments of the present disclosure can significantly reduce the number of times of using QPI compared to the conventional method, thereby improving the calculation speed of graphic processing.

도 5은, 본 개시의 일 실시 예에 따른, 기계학습을 통한 최적의 블록 사이즈를 추정하는 방법을 도시한 순서도이다.5 is a flowchart illustrating a method of estimating an optimal block size through machine learning, according to an embodiment of the present disclosure.

예를 들어, 복수의 GPU는 NVDIA CUDA^TM 등과 같은 프로그래밍 모델을 이용하여 병렬행렬곱연산(parallel matrix multiplication)을 수행할 수 있다. 그러나, 이는 본 개시를 설명하기 위한 일 실시 예일 뿐 이에 한정되지 않는다.For example, a plurality of GPUs may perform parallel matrix multiplication using a programming model such as NVIDIA CUDA ^TM . However, this is only one embodiment for explaining the present disclosure and is not limited thereto.

S510 단계에서, 컴퓨팅 시스템(10)은 하나의 CPU에 연결된 복수의 GPU의 병렬 행렬 곱 연산(parallel matrix multiplication)에서 최적의 블록 사이즈를 추정하기 위한 데이터 베이스를 구축할 수 있다. In step S510, the computing system 10 may build a database for estimating an optimal block size in parallel matrix multiplication of a plurality of GPUs connected to one CPU.

컴퓨팅 시스템(10)은 복수의 GPU의 행렬곱연산에 의한 정방행렬의 입력 행렬을 복수의 블록으로 분리(split)할 수 있다. 컴퓨팅 시스템(10)은 분리된 복수의 블록을 마스터 GPU 및 슬레이브 GPU에 각각 분배(distribute)할 수 있다. 따라서, 입력 정방행렬로부터 각 GPU에 분배될 최적의 블록 사이즈를 추출한 경우, 복수의 GPU의 연산 속도는 최적화될 수 있다.The computing system 10 may split an input matrix of a square matrix by a matrix multiplication operation of a plurality of GPUs into a plurality of blocks. The computing system 10 may distribute a plurality of separated blocks to the master GPU and the slave GPU, respectively. Therefore, when the optimal block size to be distributed to each GPU is extracted from the input square matrix, the operation speed of the plurality of GPUs can be optimized.

컴퓨팅 시스템(10)은 하나의 CPU에 포함된 복수의 GPU에서 하나를 마스터, 나머지 GPU들을 슬레이브로 판단할 수 있다. 컴퓨팅 시스템(10)은 하나의 CPU에 포함된 GPU의 개수(n), 각 GPU의 입력 행렬 크기(s), 정방 행렬의 블록 사이즈(b)를 데이터베이스로 저장할 수 있다. 데이터베이스는 마스터 GPU의 메모리일 수도 있고, 컴퓨팅 시스템(10)의 빅데이터 저장소일 수도 있다.S520 단계에서, 컴퓨팅 시스템(10)은 S510 단계에서 구축된 데이터베이스를 바탕으로 딥러닝(Deep Learning) 등과 같은 기계학습(Machine Learning)을 이용하여 최적의 블록 사이즈를 추정할 수 있다.블록 행렬 분해(block matrix decomposition)에서 블록 사이즈는, 복수의 GPU를 통해 병렬 행렬 곱 연산을 할 때, GPU의 실행시간(run time)에 중요한 역할을 한다. 따라서, 최적의 블록 사이즈를 추정하여 GPU를 통한 데이터 처리 시간을 단축시킬 수 있다.The computing system 10 may determine one of the plurality of GPUs included in one CPU as a master and the remaining GPUs as slaves. The computing system 10 may store the number (n) of GPUs included in one CPU, the size (s) of an input matrix of each GPU, and the block size (b) of a square matrix in a database. The database may be the memory of the master GPU or the big data storage of the computing system 10. In step S520, the computing system 10 performs deep learning based on the database built in step S510. An optimal block size can be estimated using machine learning. In block matrix decomposition, the block size determines the execution time of the GPU when performing a parallel matrix multiplication operation through multiple GPUs ( play an important role in run time. Accordingly, it is possible to reduce the data processing time through the GPU by estimating the optimal block size.

도 6은, 본 개시의 일 실시 예에 따른, 블록 사이즈에 따른 실행 시간을 도시한 그래프이다.6 is a graph illustrating execution time according to block size, according to an embodiment of the present disclosure.

도 3에서 설명한 바와 같이, 예를 들어, 두 개의 GPU에서 행렬 곱 연산에 따른 정방행렬 C=AxB 일 수 있다. 이때, A는 제1 GPU의 NxN 행렬, B는 제2 GPU의 NxN 행렬, 그리고 C는 A 행렬과 B 행렬의 병렬 곱 행렬인 NxN의 정방행렬일 수 있다. As described in FIG. 3, for example, it may be a square matrix C=AxB according to matrix multiplication operation in two GPUs. In this case, A may be an NxN matrix of the first GPU, B may be an NxN matrix of the second GPU, and C may be an NxN square matrix that is a parallel multiplication matrix of the A matrix and the B matrix.

도 6은, 10,000 x 10,000(AxB) 정방 행렬의 랜덤 행렬 곱을 4개의 GPU를 이용하여 블록 사이즈에 따른 행렬 곱 연산 속도를 도시한 그래프이다. 6 is a graph showing a matrix multiplication operation speed according to a block size using four GPUs for random matrix multiplication of a 10,000 x 10,000 (AxB) square matrix.

도 6을 참조하면, 블록의 개수가 적은 경우(# of Block), 정방 행렬의 블록 사이즈가 상대적으로 크게 분해(decomposition)된 것을 의미하며, 한 블록 안에 많은 데이터가 할당된 것을 의미할 수 있다. 이때, 하나의 블록에서 처리되는 복수의 GPU 내 연산 데이터 량이 증가하므로, 입력 행렬 연산 처리 시간은 감소할 수 있다. 그러나, 블록의 개수가 너무 작을 경우, 정방 행렬의 블록 사이즈가 너무 크게 분해(decomposition)된 것이므로, GPU 내 연산 데이터 량이 너무 증가되어 입력 행렬 연산 처리 시간은 증가할 수 있다. Referring to FIG. 6, when the number of blocks is small (# of Block), it means that the block size of the square matrix is decomposed relatively large, and it can mean that a lot of data is allocated in one block. In this case, since the amount of calculation data in the plurality of GPUs processed in one block increases, the input matrix calculation processing time may decrease. However, if the number of blocks is too small, since the block size of the square matrix is decomposed too large, the amount of calculation data in the GPU increases too much, and thus the input matrix calculation processing time may increase.

또한, 블록의 개수(# of Block)가 많은 경우, 정방 행렬의 블록 사이즈가 상대적으로 작게 분해(decomposition)된 것을 의미하며, 복수의 GPU 사이에 블록 데이터 교환 횟수가 증가하는 것을 의미할 수 있다. 따라서, 복수의 GPU 사이의 데이터 교환 횟수가 증가함에 따라, 복수의 GPU에서 입력 행렬 곱 연산 처리 시간은 증가할 수 있다. In addition, when the number of blocks (# of Block) is large, it means that the block size of the square matrix is decomposed relatively small, and it may mean that the number of exchanges of block data between a plurality of GPUs increases. Accordingly, as the number of data exchanges between the plurality of GPUs increases, the processing time of the input matrix multiplication operation in the plurality of GPUs may increase.

예를 들어, 도 6에서는, 정방행렬의 블록의 개수가 약 50개로 분해(decomposition)된 경우, 블록의 개수가 최적일 수 있다. 이때, 데이터는 복수의 GPU에 골고루 분포된 것을 의미할 수 있다. 따라서, 입력 행렬 연산의 속도는 최적의 값을 가질 수 있다. For example, in FIG. 6, when the number of blocks of a square matrix is decomposed into about 50, the number of blocks may be optimal. In this case, the data may mean evenly distributed over a plurality of GPUs. Therefore, the speed of input matrix operation may have an optimal value.

본 개시의 일 실시 예에 따라, 도 3에서 설명한 복수의 GPU를 이용한 병렬 행렬 곱 연산에서 정방 행렬 C= AxB에서, 정방 행렬 C를 복수의 블록으로 분해(decomposition)할 때, 병렬 행렬 곱 연산(parallel matrix multiplication)을 위한 최적의 블록 사이즈(b*)를 추정하기 위한 함수는 b*= g(n,s)를 이용할 수 있다. 이때, n은 정방 행렬의 크기이고, s는 하나의 CPU에서 연산을 수행하는 GPU 개수를 의미할 수 있다. 함수 g는 딥러닝과 같은 기계학습 및 회귀(regression) 분석을 이용하여 최적의 블록 사이즈(b*)를 추론하는 함수일 수 있다. According to an embodiment of the present disclosure, in the parallel matrix multiplication operation using a plurality of GPUs described in FIG. 3, when the square matrix C is decomposed into a plurality of blocks in the square matrix C = AxB, the parallel matrix multiplication operation ( A function for estimating an optimal block size (b*) for parallel matrix multiplication) may use b*=g(n,s). In this case, n is the size of the square matrix, and s may mean the number of GPUs performing operations on one CPU. The function g may be a function that infers an optimal block size (b*) using machine learning such as deep learning and regression analysis.

회귀 분석(regression analysis)은 관찰된 연속형 변수들에 대해 두 변수 사이의 모형을 구한 뒤 적합도를 측정해 내는 분석 방법이다. 일 예로, 가우시안 프로세스 회귀를 이용할 수도 있다. 또한, 인공신경망(artificial neural network)을 이용한 기계학습을 통한 데이터 분석은 통계학적으로 높은 정확도를 얻을 수 있다. 본 개시에서는 가우시안 회귀 분석을 적용하여 최적의 블록 사이즈를 추정하였으나 이는 일 실시 예일 뿐 이에 한정되지 않는다.Regression analysis is an analysis method that measures the degree of fit after obtaining a model between two variables for observed continuous variables. For example, Gaussian process regression may be used. In addition, data analysis through machine learning using an artificial neural network can obtain statistically high accuracy. In the present disclosure, the optimal block size was estimated by applying Gaussian regression analysis, but this is only an example and is not limited thereto.

예를 들어, 딥러닝과 같은 기계학습은, 빅데이터와 같은 데이터베이스로부터 숨겨진 키 값(key value)를 추론하여 추출할 수 있다. 즉, 딥 신경망(Deep Neural Networks, DNN)에서는 입력 값을 계산하여(computation) 트레이닝을 반복하고, 트레이닝 결과 획득된 키 값을 추론할 수 있다.For example, machine learning such as deep learning may infer and extract a hidden key value from a database such as big data. That is, deep neural networks (DNNs) may calculate input values, repeat training, and infer key values obtained as a result of training.

본 개시의 일 실시 예에 따라, 딥신경망(DNN)에 입력 값은 정방행렬 C=AxB에서 정방행렬의 크기(n), CPU에 연결된 GPU의 개수(s), 블록사이즈(b)일 수 있다. 이때, 최적의 블록사이즈(b*)를 추출하기 위한 테스팅 함수는 b*=arg_bmin f(s,n,b)일 수 있다. 여기서, f(n,s,b)는 병렬행렬곱함수 f의 실행시간일 수 있다. 시스템(10)은 DNN에 입력된 값들(n,s,b)을 바탕으로 딥러닝 기계학습 방법 및 회귀분석 등을 이용하여 함수 b*= g(n,s)를 추출할 수 있다. 따라서, 시스템(10)은 DNN에서 숨겨진 노드의 개수인 h 차원 벡터를 유추할 수 있다. According to an embodiment of the present disclosure, input values to the deep neural network (DNN) may be the size (n) of a square matrix in a square matrix C = AxB, the number (s) of GPUs connected to the CPU, and the block size (b). . In this case, the testing function for extracting the optimal block size (b*) may be b*=arg _b min f (s,n,b). Here, f (n, s, b) may be the execution time of the parallel matrix multiplication function f . The system 10 may extract the function b*=g(n,s) based on the values (n,s,b) input to the DNN by using a deep learning machine learning method and regression analysis. Thus, the system 10 can infer the h-dimensional vector, which is the number of hidden nodes in the DNN.

도 7은, 본 개시의 일 실시 예에 따른, 최적의 블록 사이즈를 추정한 결과를 도시한 표이다.7 is a table showing results of estimating an optimal block size according to an embodiment of the present disclosure.

본 개시의 일 실시 예에 따라, 함수 f(n, s, b)를 로깅(logging)하여 데이터 베이스를 생성할 수 있다. 상술한 바와 같이, n은 GPU의 입력 행렬 크기, s는 CPU에 포함된 GPU의 개수, b는 행렬곱연산에서 블록 사이즈일 수 있다. 예를 들어, n은 4,000(4k), 8,000(8k), 16,000(16k) 크기의 입력 행렬일 수 있다. s는 2, 3, 4 등의 GPU 개수일 수 있다. b는 1k, 2k, 4k, 8k, 및 16k 등의 범위일 수 있다. 예를 들어, 하나의 CPU는 2개의 GPU를 포함하고, 2개의 GPU를 통해 입력되는 정방행렬의 크기는 16,000(16k)이고, 입력 정방 행렬은 4,000(4k) x 4,000(4k)의 블록으로 분해(decomposition)되어 각 GPU에 분배될 수 있다. According to an embodiment of the present disclosure, a database may be created by logging the function f (n, s, b). As described above, n may be the size of the input matrix of the GPU, s may be the number of GPUs included in the CPU, and b may be the block size in the matrix multiplication operation. For example, n may be an input matrix having a size of 4,000 (4k), 8,000 (8k), or 16,000 (16k). s may be the number of GPUs, such as 2, 3, or 4. b may range from 1k, 2k, 4k, 8k, and 16k. For example, one CPU includes two GPUs, the size of a square matrix input through the two GPUs is 16,000 (16k), and the input square matrix is decomposed into blocks of 4,000 (4k) x 4,000 (4k). (decomposition) and distributed to each GPU.

도 7의 표는 행렬곱연산을 10 회 반복 수행하여 얻은 함수 f(n, s, b)의 파라미터들에 대한 평균 및 표준 편차를 나타낸다. 표의 각 항목의 결과 값은 특정 s, n, b 값에서 5회 이상 수행된 함수 f(n, s, b)의 평균과 표준 편차(괄호 안의 숫자는 표준 편차이고, 괄호 밖의 숫자는 평균을 의미)이다. 표에서 동그라미로 표시된 숫자는 최적의 블록 사이즈(b*)를 얻은 n, s, b 에서의 평균 값을 의미한다. 표에서 얻은 데이터를 바탕으로 컴퓨팅 시스템(10)은 메타데이터를 생성하고, 메타 데이터 값들에 가우시안 처리 회귀(Gaussian Process Regression)를 적용할 수 있다.The table of FIG. 7 shows the average and standard deviation of the parameters of the function f (n, s, b) obtained by repeating the matrix multiplication operation 10 times. The result value of each item in the table is the average and standard deviation of the function f (n, s, b) performed more than 5 times at specific s, n, and b values (numbers in parentheses are standard deviations, numbers outside parentheses are averages) )am. Numbers circled in the table mean the average values at n, s, and b where the optimal block size (b*) was obtained. Based on the data obtained from the table, the computing system 10 may generate metadata and apply Gaussian Process Regression to the metadata values.

따라서, 상술한 방법을 이용하여, CPU에 포함된 복수의 GPU에서 병렬 행렬 곱 연산을 수행하는 경우, 복수의 GPU에 블록 분해(block decomposition)를 바탕으로 병렬행렬을 복수의 블록으로 분리(partition)하고, 최적의 블록 사이즈를 추정하여, 복수의 GPU에 최적의 블록 사이즈를 가지는 행렬을 분배할 수 있다. 이때, 최적의 블록 사이즈를 추정하기 위해 신경망 기계 학습 및 회귀분석(예: 가우시안 프로세스 회귀)을 이용할 수 있다.Therefore, when a parallel matrix multiplication operation is performed on a plurality of GPUs included in the CPU using the above method, the parallel matrix is partitioned into a plurality of blocks based on block decomposition on the plurality of GPUs Then, by estimating an optimal block size, a matrix having an optimal block size may be distributed to a plurality of GPUs. In this case, neural network machine learning and regression analysis (eg, Gaussian process regression) may be used to estimate the optimal block size.

상술한 본 개시의 일 실시 예에 따라, 기계학습을 이용하여 복수의 GPU에 분배될 최적의 블록 사이즈를 추정할 수 있다. 따라서, CPU에 연결된 복수의 GPU를 통해 고속의 병렬 연산을 수행하여 고성능 컴퓨팅(High Performance Computing, HPC)을 제공할 수 있다.According to one embodiment of the present disclosure described above, an optimal block size to be distributed to a plurality of GPUs may be estimated using machine learning. Accordingly, high-performance computing (HPC) may be provided by performing high-speed parallel calculations through a plurality of GPUs connected to the CPU.

한편, 상술한 방법은 컴퓨터에서 실행될 수 있는 프로그램으로 작성 가능하고, 컴퓨터로 읽을 수 있는 기록매체를 이용하여 상기 프로그램을 동작시키는 범용 디지털 컴퓨터에서 구현될 수 있다. 또한, 상술한 방법에서 사용된 데이터의 구조는 컴퓨터로 읽을 수 있는 기록매체에 여러 수단을 통하여 기록될 수 있다. 상기 컴퓨터로 읽을 수 있는 기록매체는 마그네틱 저장매체(예를 들면, 롬, 플로피 디스크, 하드 디스크 등), 광학적 판독 매체(예를 들면, 시디롬, 디브이디 등)와 같은 저장매체를 포함한다.On the other hand, the above-described method can be written as a program that can be executed on a computer, and can be implemented in a general-purpose digital computer that operates the program using a computer-readable recording medium. In addition, the structure of data used in the above-described method can be recorded on a computer-readable recording medium through various means. The computer-readable recording medium includes storage media such as magnetic storage media (eg, ROM, floppy disk, hard disk, etc.) and optical reading media (eg, CD-ROM, DVD, etc.).

본 실시 예와 관련된 기술 분야에서 통상의 지식을 가진 자는 상기된 기재의 본질적인 특성에서 벗어나지 않는 범위에서 변형된 형태로 구현될 수 있음을 이해할 수 있을 것이다. 그러므로 개시 방법들은 한정적인 관점이 아니라 설명적인 관점에서 고려되어야 한다. 본 발명의 범위는 전술한 설명이 아니라 특허청구범위에 나타나 있으며, 그와 동등한 범위 내에 있는 모든 차이점은 본 발명에 포함된 것으로 해석되어야 할 것이다.Those skilled in the art related to this embodiment will be able to understand that it can be implemented in a modified form within a range that does not deviate from the essential characteristics of the above description. Therefore, the disclosed methods are to be considered in an illustrative rather than a limiting sense. The scope of the present invention is shown in the claims rather than the foregoing description, and all differences within the equivalent scope will be construed as being included in the present invention.

10: 컴퓨팅 시스템
110: 제1 CPU, 210: 제2 CPU
120: 제1 커넥터(루트 콤플렉스) 220: 제2 커넥터(루트 콤플렉스)
130: 제1 메모리 23: 제2 메모리
140: 제1 GPU, 150: 제2 GPU, 240: 제3 GPU, 250: 제4 GPU10: computing system
110: first CPU, 210: second CPU
120: first connector (root complex) 220: second connector (root complex)
130: first memory 23: second memory
140: first GPU, 150: second GPU, 240: third GPU, 250: fourth GPU

Claims

In a computing system for multi-process,
CPU (Central Processing Unit);
A plurality of GPUs (Graphics Processing Units) connected to the CPU; and
A memory included in each of the plurality of GPUs; includes,
Each of the plurality of GPUs performs a matrix multiplication operation having a specified block size in the memory of each GPU,
The designated block size is a value estimated using machine learning based on the number of a plurality of GPUs connected to the CPU and the size of a square matrix obtained through the matrix multiplication operation of the plurality of GPUs Computing system.

According to claim 1,
One of the plurality of GPUs is a master and the other GPUs are slaves, and the master GPU stores the input matrix size of each of the plurality of GPUs, the number of the plurality of GPUs, and the block size as a database in the memory of the master GPU. Computing system to build.

According to claim 2,
The master GPU,
A computing system for dividing the square matrix into the estimated block size and distributing it to the master GPU and the slave GPU.

According to claim 1,
The testing function for extracting the optimal block size (b*) using the machine learning is b*=arg _b min f (s,n,b),
Here, b* is the optimal block size, f (n,s,b) is the execution time of the parallel matrix multiplication function f , s is the number of GPUs connected to the CPU, n is the size of the input square matrix, and b is the block Sized computing system.

According to claim 4,
The function for estimating the block size for performing the shortest matrix multiplication operation through the machine learning,
b*= g(n,s), and
Here, g is a computing system that is a function that infers an optimal block size (b*) using machine learning and regression analysis.

According to claim 1,
Further comprising a network interface connecting each of the plurality of GPUs,
The matrix multiplication operation,
Based on the address stored in the memory of each of the plurality of GPUs, each of the plurality of GPUs is performed by accessing the memory of different GPUs through the network interface.

In the computing method for multi-process,
Estimating a block size to be distributed to a plurality of GPUs connected to the CPU; and
Performing a matrix multiplication operation having the estimated block size in the memory of each of the plurality of GPUs; Including,
The estimated block size is a value estimated using machine learning based on the number of a plurality of GPUs connected to the CPU and the size of a square matrix obtained through the matrix multiplication operation of the plurality of GPUs Computing method.

According to claim 7,
One of the plurality of GPUs is a master and the other GPUs are slaves,
The estimating step is
Computing method further comprising: constructing a database of the input matrix size of each of the plurality of GPUs, the number of the plurality of GPUs, and the block size in a memory of the master GPU.

According to claim 8,
The master GPU splitting the square matrix into the estimated block size and distributing it to the master GPU and the slave GPU; Computing method further comprising.

According to claim 7,
The testing function for extracting the optimal block size (b*) using the machine learning is b*=arg _b min f (s,n,b),
Here, b* is the optimal block size, f (n,s,b) is the execution time of the parallel matrix multiplication function f , s is the number of GPUs connected to the CPU, n is the size of the input square matrix, and b is the block A computing method that is size.

According to claim 10,
The function for estimating the block size for performing the shortest matrix multiplication operation through the machine learning,
b*= g(n,s), and
Here, g is a computing method that is a function of inferring an optimal block size (b*) using machine learning and regression analysis.

According to claim 7,
The matrix multiplication operation,
Based on the address stored in the memory of each of the plurality of GPUs, each of the plurality of GPUs is performed by accessing the memory of each other GPU through a network interface.