KR102300118B1

KR102300118B1 - Job placement method for gpu application based on machine learning and device for method

Info

Publication number: KR102300118B1
Application number: KR1020190178561A
Authority: KR
Inventors: 김윤희; 오지선
Original assignee: 숙명여자대학교산학협력단
Priority date: 2019-12-30
Filing date: 2019-12-30
Publication date: 2021-09-07
Also published as: KR20210085490A

Abstract

본 발명은 실행이 요구되는 응용에 관한 정보가 수신되면, 수신된 응용에 관한 정보에 기초하여 응용의 속성과 시스템의 상태를 확인하는 단계, 응용의 속성 중에서 학습 모델의 입력 데이터를 선택하고, 학습 모델에 입력하는 단계, 학습 모델을 적용하여 응용의 작업 순서와 응용과 다른 응용들의 공동 배치 구조를 결정하는 단계 및 결정된 작업 순서 및 공동 배치 구조에 따라 GPU에서 응용을 실행하는 단계를 포함하는 것을 특징으로 하는 GPU 응용을 위한 기계 학습 기반 작업 배치 방법을 제공한다.In the present invention, when information about an application that requires execution is received, the step of checking the properties of the application and the state of the system based on the information about the received application, selecting the input data of the learning model from among the properties of the application, and learning Inputting to the model, applying the learning model to determine the task sequence of the application and the co-location structure of the application and other applications, and executing the application on the GPU according to the determined task order and co-location structure It provides a machine learning-based job placement method for GPU applications.

Description

JOB PLACEMENT METHOD FOR GPU APPLICATION BASED ON MACHINE LEARNING AND DEVICE FOR METHOD

본 발명은 GPU 응용을 위한 기계 학습 기반 작업 배치 방법을 제공한다.The present invention provides a machine learning-based task placement method for GPU applications.

GPU(Graphic Processing Unit) 가상화 시스템은 다수의 프로세싱 코어로 병렬 연산을 수행함으로써 GPGPU(General-Purpose GPU) 기능으로 사용될 수 있으며, 데이터 센터 또는 클라우드 업체 등으로부터 사용자에게 제공되는 시스템이다.A GPU (Graphic Processing Unit) virtualization system can be used as a General-Purpose GPU (GPGPU) function by performing parallel operations with multiple processing cores, and is a system provided to users by a data center or cloud company.

GPU 가상화 시스템에서 서비스 제공자는 GPU를 요청하는 사용자에게 전용 엑세스 권한을 제공하는 방식으로 GPU를 제공하는데, 이때 자원 사용의 효율성을 증가시키기 위해서, 여러 응용(application) 프로그램이 GPU 자원을 공유하여 공동으로 실행될 수 있다. 그러나 여러 응용 프로그램의 커널을 하나의 프로세스로 실행하는 경우, 실행되는 응용의 자원 사용 특성이 다양하고, 여러 응용을 공동으로 수행함으로써 전체 시스템의 성능 저하가 발생하는 등의 문제가 발생할 수 있어, 이에 관한 연구가 진행 중에 있다.In the GPU virtualization system, the service provider provides the GPU in a way that provides dedicated access rights to the user who requests the GPU. can be executed However, when the kernel of several applications is executed as a single process, the resource usage characteristics of the executed applications are diverse, and problems such as degradation of the overall system performance may occur due to the joint execution of several applications. Research on this is in progress.

본 발명은 GPU 응용을 위한 기계 학습 기반 작업 배치 방법을 제공함으로써 GPU가 다수의 응용을 공동으로 실행할 때의 전체 시스템의 성능 및 자원 사용률을 향상시키고자 한다.The present invention aims to improve the performance and resource utilization of the entire system when the GPU executes multiple applications jointly by providing a machine learning-based task placement method for GPU applications.

일 실시예에 따르면, 학습 모델에 입력되는 응용 속성은 GPU 응용 사전 정보, GPU 응용 프로파일링 정보 및 클러스터 환경 정보 중 적어도 하나를 포함하며, GPU 응용 프로파일링 정보는 응용의 GPU 활용도, GPU 메모리 사용량, GPU 코어 사용량, PCIe 처리량 및 수행 시간 중 적어도 하나를 포함할 수 있다.According to an embodiment, the application properties input to the learning model include at least one of GPU application dictionary information, GPU application profiling information, and cluster environment information, and the GPU application profiling information includes application's GPU utilization, GPU memory usage, It may include at least one of GPU core usage, PCIe throughput, and execution time.

일 실시예에 따르면, GPU 응용 프로파일링 정보에 포함된 GPU 활용도, GPU 메모리 사용량, GPU 코어 사용량, PCIe 처리량 중 적어도 하나는 일정 시간 동안 주기적으로 GPU를 모니터링하여 수집되는 정보일 수 있다.According to an embodiment, at least one of GPU utilization, GPU memory usage, GPU core usage, and PCIe throughput included in the GPU application profiling information may be information collected by periodically monitoring the GPU for a predetermined time.

일 실시예에 따르면, 클러스터 환경 정보는 노드 이름, GPU 카드, GPU 구조, GPU 메모리, GPU 코어 수 및 PCIe 대역폭 중 적어도 하나를 포함할 수 있다.According to an embodiment, the cluster environment information may include at least one of a node name, a GPU card, a GPU structure, a GPU memory, the number of GPU cores, and a PCIe bandwidth.

일 실시예에 따르면, 학습 모델은 현재 상태를 관찰하고 정책에 따라 행동을 결정하는데, 결정은 행동을 수행할 때 예상되는 할인 누적 보상이 최대화되도록 결정되는 것을 특징으로 하고, 학습 모델의 현재 상태는 학습 모델에 입력되는 응용 속성에 기초하여 결정되고, 행동은 공동 배치 구조의 작업 슬롯에 관한 것일 수 있다.According to an embodiment, the learning model observes a current state and determines an action according to a policy, wherein the decision is determined such that an expected discount cumulative reward when performing the action is maximized, wherein the current state of the learning model is It is determined based on the application properties input to the learning model, and the action may be related to the work slot of the co-located structure.

일 실시예에 따르면, 행동의 수는 공동 배치 구조의 작업 슬롯의 수와 동일한 것일 수 있다.According to an embodiment, the number of actions may be equal to the number of work slots of the co-located structure.

일 실시예에 따르면, 학습 모델은 할인 누적 보상 함수의 분표의 편차를 줄이는 경사 하강법을 사용하는 것일 수 있다.According to an embodiment, the learning model may use gradient descent to reduce the deviation of the distribution of the discount cumulative reward function.

일 실시예에 따르면, 경사 하강법은 예상되는 할인 누적 보상에서 정책에 따라 획득된 경험적 할인 누적 보상을 차감한 후, 경사 하강 값을 예측하는 것일 수 있다.According to an embodiment, the gradient descent method may be a method of predicting a gradient descent value after subtracting an accumulated empirical discount accumulated reward obtained according to a policy from an expected accumulated discount reward.

일 실시예에 따르면, 공동 배치 구조를 결정하는 단계는 사용 가능한 작업 슬롯의 수를 초과하는 작업 슬롯이 필요한 경우, 초과되는 작업 슬롯을 백 로그(back log)에 추가하는 단계 및 사용 가능한 작업 슬롯의 수보다 적은 수의 작업 슬롯이 요구되어 빈 작업 슬롯이 존재할 때, 백 로그에 추가된 작업으로 빈 작업 슬롯을 채우는 단계를 포함할 수 있다.According to one embodiment, the determining of the co-location structure includes the steps of adding the excess work slots to a back log when a work slot exceeding the number of available work slots is required, and the number of available work slots. The method may include filling the empty job slots with jobs added to the backlog when there are empty job slots because a smaller number of job slots is required.

일 실시예에 따르면, 학습 모델은 DQN 학습 모델일 수 있다.According to an embodiment, the learning model may be a DQN learning model.

일 실시예에 따르면, GPU에서 응용을 실행한 후, 응용의 실행 결과를 모니터링하는 단계 및 모니터링 결과에 따라 학습 모델의 입력을 수정하는 단계를 더 포함할 수 있다.According to an embodiment, after executing the application on the GPU, the method may further include monitoring an execution result of the application and modifying an input of the learning model according to the monitoring result.

본 발명은 컴퓨터 프로그램이 프로세서에 의해 실행될 때, 상술된 방법이 수행되는 컴퓨터 프로그램을 저장한 컴퓨터-판독가능 저장 매체를 제공한다.The present invention provides a computer-readable storage medium storing a computer program in which the above-described method is performed when the computer program is executed by a processor.

본 발명에서 개시하고 있는 일 실시예에 따르면, 응용 프로파일링 데이터에 기초하여 GPU의 공유 자원에 작업을 배치하기 때문에, 응용의 유형에 따른 변함없이, GPU 전체 시스템의 자원 사용률을 향상시킬 수 있다.According to an embodiment disclosed in the present invention, since tasks are allocated to the shared resources of the GPU based on the application profiling data, the resource utilization rate of the entire GPU system can be improved without changing according to the type of application.

뿐만 아니라, 본 발명에서 개시하고 있는 일 실시예에 따르면, 각 응용간의 간섭을 줄일 수 있으므로, 작업 속도의 저하를 감소시킬 수 있다.In addition, according to an embodiment disclosed in the present invention, it is possible to reduce the interference between each application, it is possible to reduce the decrease in the work speed.

도 1은 응용 공동 실행시 성능과 공유 자원과의 관계를 설명하기 위한 도면이다.
도 2는 본 발명의 일 실시예에 따른 GPU 응용을 위한 기계 학습 기반 작업 배치 방법에 대한 흐름도이다.
도 3은 본 발명의 일 실시예에 따른 학습 모델의 구조를 설명하기 위한 도면이다.
도 4는 본 발명의 다른 일 실시예에 따른 학습 모델에 관한 알고리즘을 설명하기 위한 도면이다.
도 5는 종래 기술에 따라 응용 공동 실행시 시간에 따른 자원 할당 결과를 설명하기 위한 도면이다.
도 6은 본 발명의 일 실시예에 따라 응용 공동 실행시 시간에 따른 자원 할당 결과를 설명하기 위한 도면이다.
도 7은 본 발명의 일 실시예에 따른 작업 배치 서비스 모델의 구조를 설명하기 위한 도면이다.
도 8 내지 도 10은 본 발명의 일 실시예에 따른 GPU 응용을 위한 기계 학습 기반 작업 배치 방법의 작업 부하별 실험 결과를 설명하기 위한 도면이다.
도 11는 본 발명의 일 실시예에 따른 GPU 응용을 위한 기계 학습 기반 작업 배치 방법에 대한 작업 속도 저하 분석 실험 결과를 설명하기 위한 도면이다.
도 12는 본 발명의 일 실시예에 따른 GPU 응용을 위한 기계 학습 기반 작업 배치 방법에 대한 트레이닝 오버헤드 비교 실험 결과를 설명하기 위한 도면이다.
도 13은 본 발명의 일 실시예에 따른 GPU 응용을 위한 기계 학습 기반 작업 배치 장치와 종래 장치의 가격 비교를 설명하기 위한 도면이다.1 is a diagram for explaining the relationship between performance and shared resources during joint execution of applications.
2 is a flowchart of a machine learning-based task arrangement method for GPU application according to an embodiment of the present invention.
3 is a diagram for explaining the structure of a learning model according to an embodiment of the present invention.
4 is a diagram for explaining an algorithm related to a learning model according to another embodiment of the present invention.
5 is a diagram for explaining a result of resource allocation according to time when an application is jointly executed according to the prior art.
6 is a diagram for explaining a resource allocation result according to time when applications are jointly executed according to an embodiment of the present invention.
7 is a diagram for explaining the structure of a work arrangement service model according to an embodiment of the present invention.
8 to 10 are diagrams for explaining experimental results for each workload of a machine learning-based job arrangement method for GPU application according to an embodiment of the present invention.
11 is a view for explaining a result of an experiment for analysis of job speed degradation for a machine learning-based job arrangement method for GPU application according to an embodiment of the present invention.
12 is a diagram for explaining a training overhead comparison experiment result for a machine learning-based task arrangement method for GPU application according to an embodiment of the present invention.
13 is a diagram for explaining a price comparison between a machine learning-based job arrangement device for GPU application and a conventional device according to an embodiment of the present invention.

본 발명은 다양한 변경을 가할 수 있고 여러 가지 실시예를 가질 수 있는 바, 특정 실시예들을 도면에 예시하고 상세한 설명에 상세하게 설명하고자 한다. 그러나, 이는 본 발명을 특정한 실시 형태에 대해 한정하려는 것이 아니며, 본 발명의 사상 및 기술 범위에 포함되는 모든 변경, 균등물 내지 대체물을 포함하는 것으로 이해되어야 한다. 각 도면을 설명하면서 유사한 참조부호를 유사한 구성요소에 대해 사용하였다.Since the present invention can have various changes and can have various embodiments, specific embodiments are illustrated in the drawings and described in detail in the detailed description. However, this is not intended to limit the present invention to specific embodiments, and it should be understood to include all modifications, equivalents and substitutes included in the spirit and scope of the present invention. In describing each figure, like reference numerals have been used for like elements.

제1, 제2, A, B 등의 용어는 다양한 구성요소들을 설명하는데 사용될 수 있지만, 상기 구성요소들은 상기 용어들에 의해 한정되어서는 안 된다. 상기 용어들은 하나의 구성요소를 다른 구성요소로부터 구별하는 목적으로만 사용된다. 예를 들어, 본 발명의 권리 범위를 벗어나지 않으면서 제1 구성요소는 제2 구성요소로 명명될 수 있고, 유사하게 제2 구성요소도 제1 구성요소로 명명될 수 있다. 및/또는 이라는 용어는 복수의 관련된 기재된 항목들의 조합 또는 복수의 관련된 기재된 항목들 중의 어느 항목을 포함한다.Terms such as first, second, A, and B may be used to describe various elements, but the elements should not be limited by the terms. The above terms are used only for the purpose of distinguishing one component from another. For example, without departing from the scope of the present invention, a first component may be referred to as a second component, and similarly, a second component may also be referred to as a first component. and/or includes a combination of a plurality of related listed items or any of a plurality of related listed items.

어떤 구성요소가 다른 구성요소에 "연결되어" 있다거나 "접속되어" 있다고 언급된 때에는, 그 다른 구성요소에 직접적으로 연결되어 있거나 또는 접속되어 있을 수도 있지만, 중간에 다른 구성요소가 존재할 수도 있다고 이해되어야 할 것이다. 반면에, 어떤 구성요소가 다른 구성요소에 "직접 연결되어" 있다거나 "직접 접속되어" 있다고 언급된 때에는, 중간에 다른 구성요소가 존재하지 않는 것으로 이해되어야 할 것이다.When a component is referred to as being “connected” or “connected” to another component, it is understood that the other component may be directly connected or connected to the other component, but other components may exist in between. it should be On the other hand, when it is said that a certain element is "directly connected" or "directly connected" to another element, it should be understood that no other element is present in the middle.

본 출원에서 사용한 용어는 단지 특정한 실시예를 설명하기 위해 사용된 것으로, 본 발명을 한정하려는 의도가 아니다. 단수의 표현은 문맥상 명백하게 다르게 뜻하지 않는 한, 복수의 표현을 포함한다. 본 출원에서, "포함하다" 또는 "가지다" 등의 용어는 명세서상에 기재된 특징, 숫자, 단계, 동작, 구성요소, 부품 또는 이들을 조합한 것이 존재함을 지정하려는 것이지, 하나 또는 그 이상의 다른 특징들이나 숫자, 단계, 동작, 구성요소, 부품 또는 이들을 조합한 것들의 존재 또는 부가 가능성을 미리 배제하지 않는 것으로 이해되어야 한다.The terms used in the present application are only used to describe specific embodiments, and are not intended to limit the present invention. The singular expression includes the plural expression unless the context clearly dictates otherwise. In the present application, terms such as “comprise” or “have” are intended to designate that a feature, number, step, operation, component, part, or combination thereof described in the specification exists, but one or more other features It should be understood that this does not preclude the existence or addition of numbers, steps, operations, components, parts, or combinations thereof.

다르게 정의되지 않는 한, 기술적이거나 과학적인 용어를 포함해서 여기서 사용되는 모든 용어들은 본 발명이 속하는 기술 분야에서 통상의 지식을 가진 자에 의해 일반적으로 이해되는 것과 동일한 의미를 가지고 있다. 일반적으로 사용되는 사전에 정의되어 있는 것과 같은 용어들은 관련 기술의 문맥 상 가지는 의미와 일치하는 의미를 가지는 것으로 해석되어야 하며, 본 출원에서 명백하게 정의하지 않는 한, 이상적이거나 과도하게 형식적인 의미로 해석되지 않는다. 이하, 첨부된 도면을 참조하여 본 발명의 바람직한 실시예를 상세하게 설명한다.Unless defined otherwise, all terms used herein, including technical and scientific terms, have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs. Terms such as those defined in commonly used dictionaries should be interpreted as having a meaning consistent with the meaning in the context of the related art, and should not be interpreted in an ideal or excessively formal meaning unless explicitly defined in the present application. does not Hereinafter, preferred embodiments of the present invention will be described in detail with reference to the accompanying drawings.

도 1은 응용 공동 실행시 성능과 공유 자원과의 관계를 설명하기 위한 도면이다.1 is a diagram for explaining the relationship between performance and shared resources during joint execution of applications.

종래의 기계 학습 기반 MPS는 대상 자원을 GPU로 하는 경우, 주로 대상 응용이 기계 학습 응용에 한정되어 있어, HPC(High Performance Computing) 응용과 같이 커널의 수와 길이가 다양한 경우에는 적용하기 어려운 문제가 있었다. 또한, 응용의 세부 자원을 고려하지 못하므로, 공유되는 자원들 간의 간섭이 발생하여 전체 시스템의 성능이 저하되는 문제가 있었다.In the conventional machine learning-based MPS, when the target resource is the GPU, the target application is mainly limited to the machine learning application. there was. In addition, since detailed resources of the application are not taken into account, interference between shared resources occurs, thereby deteriorating the performance of the entire system.

도 1은 5가지 응용들(Lammps, Gromacs, CNN, vgg16, googlenet) 중 두가지 응용이 GPU 자원을 공유할 때, GPU, GPU 메모리, SM(코어), 입출력 대역폭(I/O bandwidth)에 대한 부하 정도를 설명하기 위한 도면이다. 도 1의 5가지 응용 중 Lammps, Gromacs은 HPC 응용이고, CNN, vgg16, googlenet은 기계 학습 응용이다. 1 shows when two of the five applications (Lammps, Gromacs, CNN, vgg16, googlenet) share GPU resources, GPU, GPU memory, SM (core), load on input/output bandwidth (I/O bandwidth) It is a drawing for explaining the degree. Among the five applications in FIG. 1, Lammps and Gromacs are HPC applications, and CNN, vgg16, and googlenet are machine learning applications.

구체적으로 도 1의 (a)는 각각의 두 응용이 GPU 자원을 공유하는 경우, 응용 수행 시간의 정도를 나타내는 도면이며, 도 1의 (b)는 GPU 메모리의 부하 정도를 나타내는 도면이고, 도 1의 (c)는 GPU 코어의 부하 정도를 나타내는 도면이며, 도 1의 (d)는 입출력 대역폭의 부하 정도를 나타내는 도면이다. 도 1의 (a) 내지 (d)의 각 결과는 색이 짙을수록 각 항목의 부하 정도가 심한 것을 나타낸다.Specifically, FIG. 1(a) is a diagram showing the degree of application execution time when two applications share GPU resources, FIG. 1(b) is a diagram showing the load degree of GPU memory, FIG. 1 (c) is a diagram showing the degree of load on the GPU core, and (d) of FIG. 1 is a diagram showing the degree of load on the input/output bandwidth. Each result of (a) to (d) of FIG. 1 indicates that the darker the color, the greater the load degree of each item.

도 1의 (a) 내지 (d)를 참고하면, CNN 응용과 Gromacs 응용이 GPU 자원을 공유하는 경우, 수행 시간 및 GPU 코어의 부하 정도는 3이지만, GPU 메모리의 부하 정도는 1로 여유가 있고, 입출력 대역폭의 부하 정도는 5로 대단히 심한 것을 알 수 있다. 또한, googlenet 응용과 vgg16 응용이 GPU 자원을 공유하는 경우, 수행 시간은 4 정도가 소요되지만, GPU 메모리 부하 정도는 3, GPU 코어의 부하 정도는 2이고, 입출력 대역폭의 부하 정도는 1인 것을 확인할 수 있다.1 (a) to (d), when the CNN application and the Gromacs application share GPU resources, the execution time and the load level of the GPU core are 3, but the load level of the GPU memory is 1, and , it can be seen that the load degree of the input/output bandwidth is very severe at 5. Also, if the googlenet application and the vgg16 application share GPU resources, it takes about 4 execution time, but the GPU memory load is 3, the GPU core load is 2, and the I/O bandwidth load is 1. can

도 1의 결과에 기초하면, GPU 메모리의 부하가 커지면 수행 시간이 길어지는 것을 알 수 있지만, 다른 요소들의 경우, 각 요소의 부하가 증가한다고 해서 수행 시간에 영향이 있다고 보기는 어렵다. 이는 공유되는 각 응용간의 특성에 따라 생길 수 있는 간섭이 다르기 때문이며, 이러한 간섭으로 인해 시스템의 성능을 예측하기 어려운 점이 있다.Based on the results of FIG. 1 , it can be seen that the execution time becomes longer when the load of the GPU memory increases, but in the case of other factors, it is difficult to see that the increase in the load of each element affects the execution time. This is because interference that may occur is different depending on the characteristics of each shared application, and it is difficult to predict system performance due to such interference.

도 2는 본 발명의 일 실시예에 따른 GPU 응용을 위한 기계 학습 기반 작업 배치 방법에 대한 흐름도이다.2 is a flowchart of a machine learning-based task arrangement method for GPU application according to an embodiment of the present invention.

단계 210에서, 실행이 요구되는 응용에 관한 정보가 수신되면 수신된 응용에 관한 정보에 기초하여 응용의 속성과 시스템의 상태 확인할 수 있다. In step 210, when information on an application to be executed is received, the properties of the application and the state of the system may be checked based on the received information on the application.

한편 본 발명의 일 실시예에 따른 GPU 응용을 위한 기계 학습 기반 작업 배치 방법에서 학습 모델에 입력되는 데이터는 표 1의 응용 속성 중 적어도 하나를 포함할 수 있다. 다시말해 학습 모델에 입력되는 응용 속성은 GPU 응용 사전 정보, GPU 응용 프로파일링 정보 및 클러스터 환경 정보 중 적어도 하나를 포함할 수 있다.Meanwhile, in the machine learning-based task arrangement method for GPU application according to an embodiment of the present invention, data input to the learning model may include at least one of the application properties of Table 1. In other words, the application properties input to the learning model may include at least one of GPU application dictionary information, GPU application profiling information, and cluster environment information.

범주category 응용 속성application properties GPU 응용 사전 정보GPU Application Preliminary Information 응용 이름application name 응용 타입Application type 입력 파일 크기input file size 입력 변수input variable GPU 응용 프로파일링 정보About GPU Application Profiling GPU 활용도GPU Utilization GPU 메모리 사용량GPU memory usage PCIe 처리량PCIe throughput 수행시간execution time 클러스터 환경 정보Cluster Environment Information 노드 이름node name GPU 카드GPU card GPU 구조GPU architecture GPU 메모리GPU memory GPU 코어 수number of GPU cores PCIe 대역폭PCIe bandwidth

이 중 GPU 응용 사전 정보는 응용 이름, 응용 타입, 입력 파일 크기 및 입력 변수 중 적어도 하나를 포함할 수 있고, GPU 응용 프로파일링 정보는 응용의 GPU 활용도, GPU 메모리 사용량, GPU 코어 사용량, PCIe 처리량 및 수행 시간 중 적어도 하나를 포함할 수 있다. 또한, 클러스터 환경 정보는 노드 이름, GPU 카드, GPU 구조, GPU 메모리, GPU 코어 수 및 PCIe 대역폭 중 적어도 하나를 포함할 수 있다. 한편, 표 1은 학습 모델 사용을 위해 수집되는 응용의 자원 사용 이력의 항목의 예시이며, 이외 다른 정보들이 학습 모델에 입력될 수 있음은 당해 기술분야의 통상의 기술자에게 자명하다. Among them, the GPU application dictionary information may include at least one of an application name, an application type, an input file size, and an input variable, and the GPU application profiling information includes an application's GPU utilization, GPU memory usage, GPU core usage, PCIe throughput, and It may include at least one of the execution time. In addition, the cluster environment information may include at least one of a node name, a GPU card, a GPU structure, a GPU memory, the number of GPU cores, and a PCIe bandwidth. On the other hand, Table 1 is an example of the items of the resource use history of the application collected for the use of the learning model, it is apparent to those skilled in the art that other information can be input to the learning model.

본 발명의 일 실시예에 따른 GPU 응용을 위한 기계 학습 기반 작업 배치 방법에서, 본 응용 속성은 사용자가 응용을 수행하기 위해 시스템에 제출하는 정보일 수 있고, 응용이 실행되면서 수집되는 정보일 수 있으나, 본 응용 속성이 획득되는 방법은 이에 제한되지 않는다. 또한, 표 1의 GPU 응용 프로파일링 정보에 포함된 각 항목에 대한 사용량은 일정한 시간 동안 주기적으로 모니터링하여 수집될 수 있다. 예를 들어, GPU 응용 프로파일링 정보에 포함된 GPU 활용도, GPU 메모리 사용량, GPU 코어 사용량, PCIe 처리량 중 적어도 하나는 일정 시간 동안 주기적으로 GPU를 모니터링하여 수집되는 정보일 수 있다.In the machine learning-based job placement method for GPU application according to an embodiment of the present invention, the application attribute may be information submitted by a user to the system to perform the application, or may be information collected while the application is executed. , the method by which this application attribute is obtained is not limited thereto. In addition, the usage of each item included in the GPU application profiling information of Table 1 may be periodically monitored and collected for a certain period of time. For example, at least one of GPU utilization, GPU memory usage, GPU core usage, and PCIe throughput included in the GPU application profiling information may be information collected by periodically monitoring the GPU for a predetermined time.

단계 220에서, 응용의 속성 중에서 학습 모델의 입력 데이터를 선택하고, 학습 모델에 입력할 수 있다. 여기서 학습 모델은 DQN 학습 모델일 수 있다.In step 220, input data of the learning model may be selected from the properties of the application and input to the learning model. Here, the learning model may be a DQN learning model.

단계 230에서, 학습 모델을 적용하여 응용의 작업 순서와 공동 배치되는 작업 결정할 수 있다.In step 230, the learning model may be applied to determine which tasks are collocated with the task sequence of the application.

여기서 학습 모델은 현재 상태를 관찰하고 정책에 따라 행동을 결정하는데, 결정은 행동을 수행할 때 예상되는 할인 누적 보상이 최대화되도록 결정되는 것을 특징으로 하고, 학습 모델의 현재 상태는 학습 모델에 입력되는 응용 속성에 기초하여 결정되고, 행동은 공동 배치 구조의 작업 슬롯에 관한 것일 수 있다. 또한, 행동의 수는 공동 배치 구조의 작업 슬롯의 수와 동일한 것일 수 있다.Here, the learning model observes the current state and determines the action according to the policy, wherein the decision is determined such that the expected discount cumulative reward when performing the action is maximized, and the current state of the learning model is input to the learning model It is determined based on the application attribute, and the action may be related to the work slot of the co-located structure. Also, the number of actions may be equal to the number of work slots of the co-located structure.

또한, 학습 모델은 할인 누적 보상 함수의 분표의 편차를 줄이는 경사 하강법을 사용하는 것일 수 있다. 이때 경사 하강법은 예상되는 할인 누적 보상에서 정책에 따라 획득된 경험적 할인 누적 보상을 차감한 후, 경사 하강 값을 예측하는 것일 수 있다.Also, the learning model may use gradient descent to reduce the deviation of the distribution of the discount cumulative reward function. In this case, the gradient descent method may be to predict the gradient descent value after subtracting the accumulated empirical discount accumulated rewards obtained according to the policy from the expected discount accumulated rewards.

단계 240에서, 결정된 작업 순서에 따라 GPU에 공동 배치하여 실행할 수 있다.In step 240, according to the determined task order, it may be co-located and executed on the GPU.

한편, 단계 240은 사용 가능한 작업 슬롯의 수를 초과하는 작업 슬롯이 필요한 경우, 초과되는 작업 슬롯을 백 로그(back log)에 추가하는 단계 및 사용 가능한 작업 슬롯의 수보다 적은 수의 작업 슬롯이 요구되어 빈 작업 슬롯이 존재할 때, 백 로그에 추가된 작업으로 빈 작업 슬롯을 채우는 단계를 포함할 수 있다.On the other hand, in step 240, if a work slot exceeding the number of available work slots is required, adding the excess work slots to a back log and a number of work slots smaller than the number of available work slots are required and filling the empty job slots with jobs added to the backlog when there is an empty job slot.

또한, 본 발명의 일 실시예에 따른 방법은 GPU에서 응용을 실행한 후, 응용의 실행 결과를 모니터링하는 단계 및 모니터링 결과에 따라 학습 모델의 입력을 수정하는 단계를 더 포함할 수 있다.In addition, the method according to an embodiment of the present invention may further include, after executing the application on the GPU, monitoring the execution result of the application and modifying the input of the learning model according to the monitoring result.

따라서 본 발명의 일 실시예에 따른 GPU 응용을 위한 기계 학습 기반 작업 배치 방법은 응용의 프로파일링 데이터에 기초하여 GPU 자원을 공유하는 응용들간의 간섭을 예측하고 이에 따라 작업 배치를 수행하기 때문에, 시스템의 전체 성능을 향상시키는 효과가 있다.Therefore, the machine learning-based task placement method for GPU applications according to an embodiment of the present invention predicts the interference between applications sharing GPU resources based on the profiling data of the application and performs task placement accordingly, so the system has the effect of improving the overall performance of

이하에서는 본 발명의 일 실시예에 따른 방법이 DQN 학습 모델을 사용하는 것을 전제로 설명하기로 한다.Hereinafter, the method according to an embodiment of the present invention will be described on the assumption that the DQN learning model is used.

도 3은 본 발명의 일 실시예에 따른 학습 모델의 구조를 설명하기 위한 도면이다.3 is a diagram for explaining the structure of a learning model according to an embodiment of the present invention.

도 3은 에이전트(agent)가 환경(Environment)과 상호 작용하며 학습과 보상(reward), 관찰(observe)을 반복하는 전반적인 DQN 구조를 나타낸 도면이다. 도 3을 참고하면, 시간 t에서 에이전트는 현재 상태(s _t )를 관찰하고 특정 행동(action) a를 수행하도록 요청받는다. 특정 행동의 수행 후 상태는 다음 상태(s _t+1 )로 전환되고 에이전트는 보상(r _t)을 받는다. 이 때 상태 전환 및 보상 내용은 환경 상태와 에이전트가 수행한 행동(action)에만 영향을 받기 때문에 마르코프(Markov) 속성을 갖는다. 에이전트는 학습 과정을 반복하며 예상 누적 할인 보상(Cumulative Discounted Reward)을 최대화 하는 것을 목표로 한다.3 is a diagram showing the overall DQN structure in which an agent interacts with the environment and repeats learning, reward, and observation. Referring to FIG. 3 , at time t, the agent observes the current state ( s _t ) and is requested to perform a specific action a. After performing a specific action, the state transitions to the next state ( s _t+1 ) and the agent receives a reward ( r _t ). At this time, state transition and reward contents have Markov properties because they are affected only by the environment state and the action performed by the agent. The agent iterates through the learning process and aims to maximize the expected Cumulative Discounted Reward.

여기서 행동(action)은 에이전트의 정책(policy)을 기반으로 선택되는데 여기서 정책(policy)은 수학식 1과 같이 표현될 수 있다.Here, the action is selected based on the agent's policy, where the policy may be expressed as in Equation 1.

[수학식 1][Equation 1]

여기서 π(s,a)는 상태 s에서 행동(action) a가 실행될 확률로 나타낸다. 본 발명의 일 실시예에 따른 방법은 수 많은 {state, action} 값을 테이블로 저장하는 대신, 함수 근사기인 DNN(Deep Neural Network)를 사용하여 정책 변수(policy parameter)인 θ를 계산하여 구한다. 따라서 정책(policy)은

로 나타낼 수 있다. 즉, DQN 학습 모델의 최종 목표는 정책(policy)

에 따라 상태 s에서 행동 a를 수행할 때 예상되는 할인 누적 보상을 최대화하는 것이며, 이때 예상되는 할인 누적 보상은

로 표현할 수 있다.Here, π(s,a) represents the probability that action a will be executed in state s. In the method according to an embodiment of the present invention, instead of storing a large number of {state, action} values in a table, a policy parameter θ is calculated and obtained using a deep neural network (DNN), which is a function approximator. Therefore, the policy is

can be expressed as In other words, the final goal of the DQN learning model is policy

is to maximize the expected discounted cumulative reward when performing action a in state s according to

can be expressed as

한편 본 발명의 일 실시예에 따른 방법은 공동 작업 배치를 위해 DQN 학습 모델의 입력으로 상술된 표 1에서 정의한 응용의 속성을 사용할 수 있다. 이때 학습 모델 입력 상태 s=(j,R)는 작업 벡터 j와 클러스터 자원 vector R를 포함한다. 클러스터의 자원 환경 이력 항목을 통해 클러스터 자원은 아래의 수학식 2와 같이 나타낼 수 있으며, R의 각 원소는 각 GPU 공유 자원의 사용 가능한 자원량을 의미한다. Meanwhile, the method according to an embodiment of the present invention may use the properties of the application defined in Table 1 above as an input of the DQN learning model for collaboration arrangement. At this time, the learning model input state s=(j,R) includes a working vector j and a cluster resource vector R. Through the resource environment history item of the cluster, the cluster resource can be expressed as in Equation 2 below, and each element of R means the amount of available resources of each GPU shared resource.

[수학식 2][Equation 2]

한편 작업 j는 각 작업 당 일정한 타임 슬롯 (T)마다 수집된 자원의 프로파일링 이력 정보를 나타낸 것으로, 수학식 3과 같이 표현할 수 있다.On the other hand, task j represents profiling history information of resources collected for each predetermined time slot (T) for each task, and can be expressed as in Equation 3.

[수학식 3] [Equation 3]

따라서, j _i는 해당 응용(APP)의 타임 슬롯 (T_j)마다 변화하는 자원 사용량의 정보를 포함할 수 있고, j_i에 포함되는

은 수학식 4와 같이 정의할 수 있다.Accordingly, j _i may include information on resource usage that changes for each _{time slot (T j} ) of the corresponding application (APP), and is included in _{j i}

can be defined as in Equation (4).

[수학식 4] [Equation 4]

본 발명의 일 실시예에 따른 GPU 응용을 위한 기계 학습 기반 작업 배치 방법은 작업의 정보와 GPU 클러스터 환경의 값을 DNN의 입력 값으로 사용할 수 있으며, 변화하는 자원 사용량을 입력하기 위해서 입력 상태 공간은 수정될 수 있다.The machine learning-based task arrangement method for GPU application according to an embodiment of the present invention may use task information and the value of the GPU cluster environment as input values of the DNN, and in order to input changing resource usage, the input state space is can be modified.

도 4는 본 발명의 다른 일 실시예에 따른 학습 모델에 관한 알고리즘을 설명하기 위한 도면이다. 4 is a diagram for explaining an algorithm related to a learning model according to another embodiment of the present invention.

도 4의 제1 부분(410)은 상태가 현재 상태가 s _i 일 때, 행동 a _i 를 선택하고, 이에 따라 에이전트가 받을 보상 _i 과 정책 q _i 을 결정하는 부분을 의미한다. 그리고 제2 부분(420)은 할인 누적 보상 함수 Q값의 분포 편차를 줄이기 위해 정책 변수 θ을 수정하는 부분이다. The first part 410 of FIG. 4 means a part in which, when the current state is s _i , an action a _i is selected, and a reward _i and a policy q _i to be received by the agent are determined accordingly. And the second part 420 is a part for modifying the policy variable θ in order to reduce the distribution deviation of the discount cumulative compensation function Q value.

제1 부분(410)을 참고하면 현재 상태(s _t )를 관찰하고 특정 행동(action) a를 수행하도록 요청받는다. 특정 행동의 수행 후 상태는 다음 상태(s _t+1 )로 전환되고 에이전트는 보상(r _t)을 받는다. 또한, 제1 부분(410)의 제1 영역(415)을 참고하면, 본 발명의 일 실시예에 따른 방법은 할인 누적 보상 함수 Q값을 최대화하도록 상태 s에서 행동 a를 수행된다.Referring to the first part 410 , it is requested to observe the current state s _t and perform a specific action a. After performing a specific action, the state transitions to the next state ( s _t+1 ) and the agent receives a reward ( r _t ). Also, referring to the first region 415 of the first portion 410 , the method according to an embodiment of the present invention performs action a in state s to maximize the discount cumulative reward function Q value.

본 발명의 일 실시예에 따른 방법은 정책(policy)

에 따라 상태 s에서 행동 a를 수행할 때 예상되는 할인 누적 보상

을 최대화한다. 참고로 강화형 기계 학습은 Q값의 분포의 편차를 줄이기 위해 경사 하강법을 사용하는데, 종래의 경사 하강법의 경우, 정책에 따라 얻은 경험적 할인 누적 보상만을 사용한다.A method according to an embodiment of the present invention includes a policy

The expected discount cumulative reward when performing action a in state s according to

to maximize For reference, the reinforcement-type machine learning uses gradient descent to reduce the variance in the distribution of Q values. In the case of the conventional gradient descent method, only the empirical discount cumulative reward obtained according to the policy is used.

그러나 본 발명의 일 실시예에 따른 방법은 제2 부분(420)의 제2 영역(425)에 도시된 바와 같이, 종래 기술 대비 더 낮은 분포를 가지는 정책 변수(policy parameter)를 위해 예측(prediction) 할인 누적 보상

에서 정책에 따라 얻은 경험적 할인 누적 보상

를 차감하여 경사 하강 값을 예측한다. However, in the method according to an embodiment of the present invention, as shown in the second region 425 of the second part 420 , a prediction is performed for a policy parameter having a lower distribution than in the prior art. Discount Accumulated Reward

Accumulated rewards from empirical discounts earned in accordance with the policy

The gradient descent value is predicted by subtracting

도 5는 종래 기술에 따라 응용 공동 실행시 시간에 따른 자원 할당 결과를 설명하기 위한 도면이다.5 is a diagram for explaining a result of resource allocation according to time when an application is jointly executed according to the prior art.

도 5는 DeepRM 학습 모델을 이용하여 GPU 공유 자원 배치를 수행한 결과를 나타낸 것으로, DeepRM 학습 모델에서 고려되는 자원은 CPU, 메모리임을 알 수 있다. 한편, DeepRM 학습 모델의 경우, 가상 작업을 생성하기 때문에, 작업 배치가 실제 응용을 대상으로 수행되지 않는 특성이 있다. 또한, 상술한 바와 같이 CPU, 메모리를 대상으로 수행 시간 동안 일정한 자원 사용량만을 고려하는 것이 특징이다. Figure 5 shows the result of performing the GPU shared resource arrangement using the DeepRM learning model, it can be seen that the resources considered in the DeepRM learning model are CPU and memory. On the other hand, in the case of the DeepRM learning model, since a virtual task is created, task placement is not performed for real applications. In addition, as described above, it is characterized in that only a certain amount of resource usage is considered during the execution time for the CPU and memory.

도 6은 본 발명의 일 실시예에 따라 응용 공동 실행시 시간에 따른 자원 할당 결과를 설명하기 위한 도면이다.6 is a diagram for explaining a resource allocation result according to time when applications are jointly executed according to an embodiment of the present invention.

도 6은 자원 4개(GPU, GPU 메모리, GPU 코어 및 PCIe)에 대한 클러스터 이미지와 3개의 작업 이미지(job slot 1 내지 job slot 3)로 구성된 도면이다. 그리고, 도 6의 하단의 행렬 식은 시간 슬롯 T 마다 각 작업에 의해 사용되는 자원의 사용량을 표시한 것으로, {T, (R1, R2, R3, R4)} 와 같이 표현할 수 있다. 예를 들어, 도 6의 보라색 작업(job slot 1)은 5개의 시간 슬롯에 대하여 {0, (1, 1, 0, 1)}, {1, (2, 3, 2, 1)}, {2, (3, 0, 2, 1)}, {3, (0, 0, 3, 1)}, {4, (0, 0, 0, 1)}의 자원 요구 사항을 나타낸다고 볼 수 있다. 또한 파란색 작업(job slot 2)은 5개의 시간 슬롯에 대하여 {0, (2, 2, 1, 3)}, {1, (1, 1, 1, 2)}, {2, (1, 1, 0, 2)}, {3, (1, 2, 0, 0)}, {4, (0, 0, 0, 0)}의 자원 요구 사항을 나타낸다고 볼 수 있다. 마지막으로 초록색 작업(job slot 3)은 5개의 시간 슬롯에 대하여 {0, (1, 1, 2, 0)}, {1, (1, 0, 1, 0)}, {2, (0, 1, 1, 0)}, {3, (3, 0, 0, 2)}, {4, (3, 4, 0, 2)}의 자원 요구 사항을 나타낸다고 볼 수 있다.6 is a diagram composed of a cluster image for four resources (GPU, GPU memory, GPU core, and PCIe) and three job images (job slot 1 to job slot 3). And, the determinant at the bottom of FIG. 6 indicates the amount of resource used by each task for each time slot T, and can be expressed as {T, (R1, R2, R3, R4)}. For example, the purple job (job slot 1) of FIG. 6 is {0, (1, 1, 0, 1)}, {1, (2, 3, 2, 1)}, { 2, (3, 0, 2, 1)}, {3, (0, 0, 3, 1)}, {4, (0, 0, 0, 1)}. Also, the blue job (job slot 2) is {0, (2, 2, 1, 3)}, {1, (1, 1, 1, 2)}, {2, (1, 1) for 5 time slots. , 0, 2)}, {3, (1, 2, 0, 0)}, {4, (0, 0, 0, 0)}. Finally, the green job (job slot 3) is {0, (1, 1, 2, 0)}, {1, (1, 0, 1, 0)}, {2, (0, 1, 1, 0)}, {3, (3, 0, 0, 2)}, {4, (3, 4, 0, 2)}.

한편, 신경망의 입력은 고정된 상태로 표현되는 것이 바람직하기 때문에 작업 슬롯의 수 M 만큼을 입력으로 사용할 수 있다. 학습을 통한 결과는 다음 행동 a를 결정하는데 사용되며, 행동 공간(action space)은 a={0,1,...,M}로 표현된다. a=1인 경우는 첫 번째 작업 슬롯에 할당된 작업을 실행하라는 것을 의미하며, a=0은 행동 보류를 의미하며, 현재 타임 슬롯 T에서 클러스터 이미지에 작업 이미지가 들어갈 공간이 없다는 것을 나타낸다. 따라서 타임 슬롯 T에 따라 진행되면서 적합한 작업 a가 결정 및 실행되고, 이에 따라 자원 클러스터 이미지가 갱신된다. On the other hand, since the input of the neural network is preferably expressed in a fixed state, as many as M of the number of work slots can be used as the input. The result of learning is used to determine the next action a, and the action space is expressed as a={0,1,...,M}. When a=1, it means to execute the task assigned to the first task slot, a=0 means to hold the action, and indicates that there is no space for the task image in the current time slot T in the cluster image. Accordingly, an appropriate task a is determined and executed while proceeding according to the time slot T, and the resource cluster image is updated accordingly.

만약 M개를 초과하는 작업 슬롯이 필요한 경우, 상태 공간(state space)의 백 로그(back log) 구성 요소에 추가될 수 있고, 빈 작업 슬롯이 존재할 때, 백 로그에 추가되어 있던 작업으로 빈 작업 슬롯을 채운다. 그러나 위의 방법은 결과 a에 따라 하나의 작업만 실행 가능하기 때문에 작업을 공동으로 배치하여 실행할 수 없다.If more than M job slots are required, they can be added to the back log component of the state space, and when empty job slots exist, the jobs that were added to the backlog are empty jobs. fill the slot. However, the above method cannot co-locate and execute tasks because only one task can be executed according to result a.

한편 본 발명의 일 실시예에 따른 방법은 한 자원의 동시에 사용 가능한 작업 슬롯의 수를 고려하여 자원이 배치될 수 있다. 예를 들어, 한 자원의 사용이 총 4개의 작업 슬롯이 필요한 경우, 본 발명은 한 자원이 한 타임 슬롯 당 4개의 자원을 초과하지 않는 범위 내에서 최대한 많은 작업 슬롯에 자원을 할당할 수 있다. 따라서 도 6의 첫번째 타임 슬롯(T=0)일때 각 작업 슬롯에 할당된 자원의 수는 {4, 4, 3, 4}로 표현될 수 있으며, 이는 첫번째 타임 슬롯에서는 GPU 코어는 3개의 자원이 할당되고, GPU, GPU 메모리 및 PCIe는 4개의 자원이 할당되었다는 것을 의미한다.Meanwhile, in the method according to an embodiment of the present invention, resources may be allocated in consideration of the number of simultaneously usable work slots of one resource. For example, when the use of one resource requires a total of 4 work slots, the present invention can allocate resources to as many work slots as possible within a range where one resource does not exceed 4 resources per one time slot. Therefore, in the first time slot (T=0) of FIG. 6, the number of resources allocated to each work slot can be expressed as {4, 4, 3, 4}, which means that in the first time slot, the GPU core has three resources. is allocated, GPU, GPU memory and PCIe means that 4 resources are allocated.

한편 도 5와 비교하여 도 6을 참고하면, 본 발명의 일 실시예에 따른 방법은 공유 자원 할당시에 고려하는 변수가 더 많기 때문에 도 5의 종래의 DeepRM에 비해 성능의 저하를 줄이면서 자원 공유가 가능하다는 장점이 있다. 뿐만 아니라, 종래의 DeepRM에 따르면 가상 작업을 수행하고, 단일 작업 배치만이 가능하나, 본 발명의 일 실시예에 따르면 실제 작업을 수행할 수 있으며, 공동 배치도 수행할 수 있다는 장점이 있다. 즉, 본 발명의 일 실시예에 따르면, 공동 배치를 통해 수행시간을 단축할 수 있을 뿐만 아니라 자원 활용도를 증가시킬 수 있다.Meanwhile, referring to FIG. 6 compared to FIG. 5, the method according to an embodiment of the present invention shares resources while reducing performance degradation compared to the conventional DeepRM of FIG. 5 because there are more variables to consider when allocating shared resources. The advantage is that it is possible to In addition, according to the conventional DeepRM, a virtual task is performed and only a single task arrangement is possible, but according to an embodiment of the present invention, an actual task can be performed and a joint arrangement can be performed. That is, according to an embodiment of the present invention, it is possible to reduce the execution time and increase resource utilization through co-location.

도 7은 본 발명의 일 실시예에 따른 작업 배치 서비스 모델의 구조를 설명하기 위한 도면이다.7 is a diagram for explaining the structure of a work arrangement service model according to an embodiment of the present invention.

도 7의 (a)를 참고하면 본 발명의 일 실시예에 따른 방법의 infrastructure 계층은 GPU 컴퓨팅 컨테이너(computing container)를 대상으로 하였으며, 오케스트레이션 플랫폼(orchestration platform)은 kubernetes를 사용하는 것을 전제로 하였다. 제안하는 서비스 모델은 도 2 내지 도 6에 기초하여 설명한 본 발명의 일 실시예에 따른 방법을 반영한 서비스 모델을 의미한다. 작업 배치 서비스는 앞서 간섭을 유발하는 원인인 각 응용의 특성과 시스템 환경에 따라 선택된 행동(action) 값 즉, 공동 배치될 작업들을 실제 배치하여 실행하고 시스템과 작업의 상태를 모니터링할 수 있다. 이에 따라 작업의 실패, 종료 등을 파악하고 다음 작업들을 자원에 배치하는 서비스를 제공할 수 있다.Referring to (a) of Figure 7, the infrastructure layer of the method according to an embodiment of the present invention is a target GPU computing container (computing container), the orchestration platform (orchestration platform) is the premise that using kubernetes. The proposed service model refers to a service model reflecting the method according to an embodiment of the present invention described based on FIGS. 2 to 6 . The task arrangement service can actually arrange and execute the action values selected according to the characteristics of each application and the system environment, that is, the tasks to be co-located, and monitor the status of the system and tasks. Accordingly, it is possible to provide a service that detects failure, termination, etc. of a task and arranges the following tasks to a resource.

도 7의 (b)은 강화형 기계 학습을 기반으로 사용자가 제출한 응용의 프로파일링 데이터와 클러스터 자원의 상태를 반영하면서 공동 실행할 작업을 배치하는 전반적인 흐름을 나타내는 알고리즘이다. 7B is an algorithm showing the overall flow of arranging tasks to be jointly executed while reflecting the profiling data of the application submitted by the user and the status of cluster resources based on reinforcement machine learning.

도 7의 (b)을 참고하면 먼저, 알고리즘은 사용자가 실행하고자 하는 응용을 제출하고, 제출한 응용의 프로파일링 데이터(profiling data)와 시스템의 상태 정보(cluster status)를 확인한다. 이후, 학습 모델의 입력으로 사용할 입력 데이터를 결정한다. 이때, 응용의 프로파일링 데이터 및 시스템의 상태 정보로서 수학식 4의 j_i와 수학식 2의 R이 입력될 수 있다.Referring to (b) of FIG. 7 , first, the algorithm submits an application to be executed by a user, and checks profiling data of the submitted application and cluster status of the system. Thereafter, input data to be used as an input of the learning model is determined. _{In this case, j i} of Equation 4 and R of Equation 2 may be input as profiling data of the application and state information of the system.

이 후 강화형 기계 학습을 적용하여 전체 작업 부하 큐(queue)의 작업의 순서와 그 순서에 공동 배치될 작업들을 결정한다. 간섭을 일으키는 작업들의 순서에 따라 작업들을 GPU에 공동 배치하고 실행시킨 후, 작업의 상태를 모니터링한다. 상술된 GPU 공동 배치 및 실행과 모니터링 동작은 전체 작업을 배치할 때까지 반복하여 수행된다After that, reinforcement machine learning is applied to determine the order of tasks in the entire workload queue and the tasks to be co-located in that order. After co-locating and executing the tasks on the GPU according to the order of the tasks causing the interference, the state of the tasks is monitored. The GPU co-location and execution and monitoring operations described above are repeatedly performed until the entire job is deployed.

도 8 내지 도 10은 본 발명의 일 실시예에 따른 GPU 응용을 위한 기계 학습 기반 작업 배치 방법의 작업 부하별 실험 결과를 설명하기 위한 도면이다.8 to 10 are diagrams for explaining experimental results for each workload of a machine learning-based job arrangement method for GPU application according to an embodiment of the present invention.

본 발명의 일 실시예에 따른 방법과 종래 기술의 성능 비교를 위해 아래 표 2의 NGC에서 제공하는 HPC 응용 4개, Tensorflow 벤치마크 ML(machine learning, 기계 학습) 응용 7개 사용한 총 5가지 워크로드(GPU heavy, GPU light, GPU memory heavy, GPU memory light, random)에 대하여 성능 분석을 수행하였다.A total of 5 workloads using 4 HPC applications and 7 Tensorflow benchmark ML (machine learning) applications provided by NGC in Table 2 below for performance comparison between the method according to an embodiment of the present invention and the prior art Performance analysis was performed for (GPU heavy, GPU light, GPU memory heavy, GPU memory light, random).

HPCHPC MLML LammpsLammps googlenetgooglenet Resnet50Resnet50 GromacsGromacs alexnetalexnet Resnet101Resnet101 QmcpackQmcpack Vgg16Vgg16 Resnet152Resnet152 HoomdHoomd Vgg11Vgg11

또한, 본 발명의 일 실시예에 따른 방법과 비교가 되는 종래 기술로는 표 3의 총 4가지 작업 배치 방법이 활용되었다.In addition, as a prior art that is compared with the method according to an embodiment of the present invention, a total of four work arrangement methods in Table 3 were utilized.

종래 작업 배치 방법Conventional work placement method SJFSJF 단일배치single batch 작업의 수행 시간이 증가하는 순서로 작업을 배치Place tasks in increasing order of execution time TetrisTetris 다중배치multiple batches 응용의 작업 사용량과 자원 가용성에 따라 packing 방식으로 배치Arranged in a packing method according to the application's work usage and resource availability DeepRM MaxDeepRM Max 작업의 자원 사용량 최대 값을 입력으로 하여 강화 학습을 적용하여 배치Placement by applying reinforcement learning with the maximum value of resource usage of the task as input DeepRM MeanDeepRM Mean 작업의 자원 사용량 평균 값을 입력으로 하여 강화 학습을 적용하여 배치Placement by applying reinforcement learning with the average value of resource usage of the task as input

실험을 위하여 활용된 GPU 컨테이너 클러스터는 컨테이너 오케스트레이션 플랫폼인 Kubernetes 사용하여 노드 클러스터이고, 다중 작업 배치를 위해 kubernetes 기본 장치 플러그인인 k8s를 수정하였다. 또한 작업들은 NVIDIA docker 컨테이너를 사용하여 실행하였으며, 실험에 사용된 컴퓨팅 노드의 사양은 아래 표 4와 같다.The GPU container cluster utilized for the experiment is a node cluster using Kubernetes, a container orchestration platform, and k8s, a kubernetes default device plugin, was modified for multi-task deployment. In addition, the tasks were executed using the NVIDIA docker container, and the specifications of the computing node used in the experiment are shown in Table 4 below.

노드 세부사항Node Details CPU (마스터)CPU (Master) GPU (노드)GPU (node) 아키텍처architecture Intel® Core^TM i7-5820KIntel® Core ^™ i7-5820K Nvidia GeForce Titan Xp D5xNvidia GeForce Titan Xp D5x 코어 클락core clock 3.30GHz3.30GHz 1.58GHz1.58GHz 코어 수number of cores 66 38403840 메모리 크기memory size 32GB32 GB 12GB12 GB Threading APIThreading API -- Nvidia CUDA 10.0Nvidia CUDA 10.0 PCIe 대역폭PCIe bandwidth -- 32GB/s32 GB/s 운영체제operating system Ubuntu 16.04.6 LTSUbuntu 16.04.6 LTS Ubuntu 16.04.6 LTSUbuntu 16.04.6 LTS

도 8의 경우 GPU 활용도 평균 40%를 기준으로 분류된 워크로드인 GPU heavy(a)와 GPU light(b)에 대한 종래 기술과 본 발명의 일 실시예에 따른 방법의 수행 시간(execution time), GPU 활용도 및 GPU 메모리 활용도를 나타낸 도면이다. In the case of Figure 8, the prior art for GPU heavy (a) and GPU light (b), which are workloads classified based on average GPU utilization 40%, and the method according to an embodiment of the present invention (execution time), It is a diagram showing GPU utilization and GPU memory utilization.

도 8의 (a)의 GPU heavy 워크로드는 평균 40% 이상의 GPU 활용도를 보이는 vgg11, vgg16, resnet50, resnet101, resnet152, qmcpack, hoomd, lammps 응용으로 구성되었다. SJF 방식으로 배치한 경우, 4807초 가장 짧은 수행 시간을 갖지만, GPU 활용도 평균은 70.50%, GPU 메모리 평균에서는 36.44%로 가장 낮다. 다중 작업이 실행되는 기법 중 본 발명의 일 실시예에 따른 방법(Dqn GPU)이 4891초로 가장 높은 성능을 보이며, GPU 활용도 82.09%, 59.13%로 높은 자원 활용을 보인다. DeepRM Mean의 경우에는 수행 시간이 5530초로 가장 길며 이는 GPU 메모리를 최대로 쓸 때 메모리 문제(OOM: Out Of Memory)가 발생하여 실패한 응용을 다시 실행하였기 때문이다.The GPU heavy workload of Figure 8 (a) was composed of vgg11, vgg16, resnet50, resnet101, resnet152, qmcpack, hoomd, and lammps applications showing an average GPU utilization of 40% or more. When deployed in the SJF method, it has the shortest execution time of 4807 seconds, but the average GPU utilization is 70.50% and the average GPU memory is 36.44%, which is the lowest. Among the methods for executing multiple tasks, the method (Dqn GPU) according to an embodiment of the present invention shows the highest performance at 4891 seconds, and shows high resource utilization with GPU utilization of 82.09% and 59.13%. In the case of DeepRM Mean, the execution time is the longest at 5530 seconds, because when the GPU memory is used to the maximum, a memory problem (OOM: Out Of Memory) occurred and the failed application was re-executed.

도 8의 (b)의 GPU light 워크로드는 평균 40% 미만의 GPU 활용도를 보이는 alexnet, gromacs 응용으로 구성되었다. GPU 활용도가 낮은 응용들을 실행한 경우, SJF 방식의 배치가 2295초로 가장 낮은 성능과 자원 활용도를 가진다. 이는 SJF 방식으로 응용이 단일 배치 되어 공동 수행이 가능한 다른 기법에 비해 수행 시간이 늘어난 것으로 보인다. 응용 gromacs의 경우 alexnet에 비해 수행시간이 길지만 GPU 메모리를 적게 사용하는데, 이는 한번 배치 시 최대 3개의 작업이 수행 가능하기 때문이다. 본 발명의 일 실시예에 따른 방법(Dqn GPU)이 1214초, GPU 활용도 89.31%, GPU 메모리 활용도 88.42%로 성능이 가장 좋으며, 강화형 기계 학습을 통한 배치로 GPU 메모리를 최대 11755MB까지 사용할 수 있다.The GPU light workload of FIG. 8 (b) was composed of alexnet and gromacs applications showing an average GPU utilization of less than 40%. When applications with low GPU utilization are executed, the SJF method has the lowest performance and resource utilization at 2295 seconds. This seems to have increased the execution time compared to other techniques that can be jointly executed by single-deploying applications in the SJF method. In the case of application gromacs, the execution time is longer than that of alexnet, but it uses less GPU memory, because up to three tasks can be performed at one time. The method (Dqn GPU) according to an embodiment of the present invention has the best performance at 1214 seconds, GPU utilization 89.31%, and GPU memory utilization 88.42% .

도 9의 경우 GPU 메모리 활용도 평균 50%를 기준으로 분류된 워크로드인 GPU memory heavy(a)와 GPU memory light(b)에 대한 종래 기술과 본 발명의 일 실시예에 따른 방법의 수행 시간, GPU 활용도 및 GPU 메모리 활용도를 나타낸 도면이다. In the case of Figure 9, GPU memory heavy (a) and GPU memory light (b), which are workloads classified based on an average of 50% of GPU memory utilization, the prior art and the execution time of the method according to an embodiment of the present invention, GPU It is a diagram showing utilization and GPU memory utilization.

도 9의 (a)의 GPU memory heavy 워크로드는 평균 50% 이상의 GPU 메모리 활용도를 보이는 vgg11, vgg16, resnet50, resnet101, resnet152 응용으로 구성되었다. SJF의 경우 수행 시간이 가장 짧으며, 그 다음은 본 발명의 일 실시예에 따른 방법(DqnGPU)이다. 본 발명의 일 실시예에 따른 방법의 수행 시간이 늘어난 이유는 실행한 응용들이 GPU 활용도도 높아 공동 배치되었을 경우 자원 경쟁을 유발시켰을 것으로 보인다. 그러나 다른 공동 수행 배치 기법과 비교하였을 때 1214초로 가장 짧은 수행 시간과 89.31%, 88.42%로 가장 높은 자원 활용도를 가지는 것을 알 수 있다.The GPU memory heavy workload of Fig. 9 (a) was composed of vgg11, vgg16, resnet50, resnet101, resnet152 applications showing an average of 50% or more GPU memory utilization. In the case of SJF, the execution time is the shortest, followed by the method (DqnGPU) according to an embodiment of the present invention. The reason that the execution time of the method according to an embodiment of the present invention is increased is likely to have caused resource competition when the executed applications have high GPU utilization and are co-located. However, it can be seen that the shortest execution time of 1214 seconds and the highest resource utilization of 89.31% and 88.42% when compared with other joint execution arrangement techniques.

도 9의 (b)의 GPU memory light 워크로드는 평균 50% 미만의 GPU memory 활용도를 보이는 alexnet, googlenet, qmcpack, hoomd, gromacs 응용으로 구성되었다. 본 발명의 일 실시예에 따른 방법(DqnGPU)이 수행 시간 측면에서 가장 우수한 것을 알 수 있다. 3928초로 가장 짧은 수행 시간을 가지며 이 때 학습 결과를 통해 GPU 메모리 활용도는 낮으나 수행시간이 긴 gromacs, qmcpack 응용들이 공동 배치되어 수행되었기 때문이다.The GPU memory light workload of FIG. 9 (b) was composed of alexnet, googlenet, qmcpack, hoomd, and gromacs applications showing an average GPU memory utilization of less than 50%. It can be seen that the method (DqnGPU) according to an embodiment of the present invention is the most excellent in terms of execution time. It has the shortest execution time at 3928 seconds, and at this time, based on the learning results, it is because gromacs and qmcpack applications with low GPU memory utilization but long execution time were co-located and performed.

도 10의 경우 임의로 구성된 워크로드에 대한 종래 기술과 본 발명의 일 실시예에 따른 방법의 수행 시간, GPU 활용도 및 GPU 메모리 활용도를 나타낸 도면이다. In the case of FIG. 10, it is a view showing the execution time, GPU utilization, and GPU memory utilization of a method according to an embodiment of the present invention and the prior art for an arbitrarily configured workload.

도 10의 Random 워크로드는 실험에 사용한 모든 응용들로 구성하였다. 본 발명의 일 실시예에 따른 방법(DqnGPU)으로 공동 배치를 수행하였을 경우 작업 수행 시간은 2481초, GPU 활용도 62.16%, GPU 메모리 활용도 20.45%의 성능을 보인다. 수행 시간 측면에서 다른 공동 실행 배치 기법보다 좋은 성능을 보인다. 그러나 GPU 활용도는 Tetris, DeepRM Max 기법보다 다른 워크로드 실험과 비교 하였을 때 낮으며, SJF와 비교하였을 때 GPU 메모리 활용도도 비슷한 것을 볼 수 있다. 도 10을 참고하면, 본 발명의 일 실시예에 따른 방법의 경우 여러 응용이 존재할 때 다양한 자원 사용 특성이 존재하기 때문에 작업 배치 수행 시 자원 경쟁의 영향이 다양하게 반영될 수 있음을 알 수 있다.The random workload of FIG. 10 was composed of all applications used in the experiment. When co-location is performed by the method (DqnGPU) according to an embodiment of the present invention, the performance of the task execution time is 2481 seconds, the GPU utilization is 62.16%, and the GPU memory utilization is 20.45%. In terms of execution time, it shows better performance than other co-execution batch techniques. However, GPU utilization is lower than that of Tetris and DeepRM Max techniques when compared to other workload experiments, and GPU memory utilization is similar when compared to SJF. Referring to FIG. 10 , it can be seen that in the case of the method according to an embodiment of the present invention, since various resource use characteristics exist when several applications exist, the influence of resource competition can be reflected in various ways when performing task assignment.

도 11은 본 발명의 일 실시예에 따른 GPU 응용을 위한 기계 학습 기반 작업 배치 방법에 대한 작업 속도 저하 분석 실험 결과를 설명하기 위한 도면이다.11 is a diagram for explaining a result of an experiment for analysis of job speed degradation for a machine learning-based job arrangement method for GPU application according to an embodiment of the present invention.

작업 속도 저하 분석을 위해 도 10의 random 워크로드에 포함되었던 공동 수행 배치 기법 4개에 대한 총 30개 작업의 속도 저하를 분석하였다. 단일 실행한 응용의 수행 시간을 기준으로 속도 저하 없음(오차 2%), 속도 저하 10% 미만, 10% 이상 20% 미만, 20% 이상으로 나누어 평가하였다.For the analysis of work speed degradation, the speed degradation of a total of 30 jobs for 4 jointly performed batch techniques included in the random workload of FIG. 10 was analyzed. Based on the execution time of a single executed application, the evaluation was divided into no speed degradation (error 2%), speed degradation less than 10%, 10% or more and less than 20%, and 20% or more.

도 11 및 표 5를 참고하면, random 워크로드 실행 시 4개 기법에 대한 각 작업의 속도 저하를 보여준다. 본 발명의 일 실시예에 따른 방법은 총 20개의 작업이 속도 저하가 발생하지 않았다. 수행 시간 별 변하는 자원의 사용량에 따라 학습하기 때문에 실제 자원을 초과하지 않는 범위 내에서 배치하여 자원 경쟁으로 발생하는 속도 저하가 작은 것으로 보인다. 그러나 발생하는 속도 저하는 공동 수행 시 고려하지 않은 자원 외에도 실행하는 응용의 특성, 다양한 조건이나 환경에서 영향을 받을 수 있음을 의미한다. Tetris, DeepRM Max 기법의 경우 14개의 작업에서 속도 저하가 발생하지 않았다. DeepRM Mean의 경우 속도 저하 10%이상에서 총 12개로 가장 많이 해당된다. DeepRM Mean의 경우 자원 평균 값에 따라 배치하기 때문에 공동 실행 시 발생하는 OOM 문제로 작업 속도 저하가 많이 발생하였다.Referring to FIG. 11 and Table 5, it shows the slowdown of each operation for the four techniques when running a random workload. In the method according to an embodiment of the present invention, there was no slowdown in a total of 20 tasks. Because it learns according to the usage of resources that change according to execution time, it seems that the decrease in speed caused by resource competition is small by placing it within a range that does not exceed the actual resources. However, the slowdown that occurs means that, in addition to resources not taken into account when performing jointly, the characteristics of the running application, various conditions or environments may be affected. In the case of Tetris and DeepRM Max techniques, there was no slowdown in 14 tasks. In the case of DeepRM Mean, a total of 12 is the most common at 10% or more of a decrease in speed. In the case of DeepRM Mean, because it is arranged according to the average resource value, the OOM problem that occurs during joint execution caused a lot of slowdown in operation.

저하 없음no degradation 10% 미만의 저하Less than 10% degradation 10% 이상의 저하10% or more degradation 20% 이상의 저하20% or more degradation TetrisTetris 1414 1010 44 22 DqnGPUDqnGPU 2020 55 33 22 DeepRM MaxDeepRM Max 1414 66 77 33 DeepRM MeanDeepRM Mean 1313 55 88 44

도 12는 본 발명의 일 실시예에 따른 GPU 응용을 위한 기계 학습 기반 작업 배치 방법에 대한 트레이닝 오버헤드 비교 실험 결과를 설명하기 위한 도면이다.12 is a diagram for explaining a training overhead comparison experiment result for a machine learning-based task arrangement method for GPU application according to an embodiment of the present invention.

도 12는 random 워크로드를 사용하여 전체 수행 시간에 대비 트레이닝 시간의 비율으로 정의된 오버헤드를 비교한 결과를 나타낸 도면이다. 전체 수행 시간은 수행 시간이 가장 짧은 SJF를 기준으로 정규화하여 비교하였다. 전체 응용의 자원 사용 이력은 존재한다고 가정할 때, 처음 작업 배치를 위한 트레이닝 시간을 포함한 전체 수행 시간을 비교한다. 본 발명의 일 실시예에 따른 방법(DqnGPU), DeepRM Max, DeepRM Mean 기법의 경우 강화형 기계 학습을 통해 나온 결과에 따라 배치하기 때문에 트레이닝 시간이 포함된다.12 is a diagram illustrating a result of comparing overhead defined as a ratio of training time to total execution time using a random workload. The total execution time was compared by normalizing it based on the SJF with the shortest execution time. Assuming that the resource usage history of the entire application exists, the total execution time including the training time for the initial task assignment is compared. In the case of the method (DqnGPU), DeepRM Max, and DeepRM Mean according to an embodiment of the present invention, the training time is included because the arrangement is based on the results obtained through reinforcement machine learning.

도 12를 참고하면, 각 배치 기법의 트레이닝 오버헤드의 결과를 보여준다. 수행 시간 정규화 값은 Tetris, 본 발명의 일 실시예에 따른 방법(DqnGPU), DeepRM Max, DeepRM Mean 각각 약 1.294, 1.081, 1.303, 1.346이다. 본 발명의 일 실시예에 따른 방법은 트레이닝 시간이 123초로 약 0.054이며 트레이닝 시간을 뺀 수행 시간은 1.027이다. SJF 와 수행 시간의 결과를 비교하였을 때 성능 차이가 작은 것을 확인할 수 있다. 이는 작업 셋의 처음 트레이닝 시간을 제외하고 그 후 다음 작업을 배치할 경우에는 성능 차이가 거의 없다는 것을 의미한다. 자원 활용도도 높아질 뿐만 아니라 작업 수행하면서 온라인으로 트레이닝을 계속 진행한다면 정확도도 높아져 다음 작업 셋 수행 성능이 좋아질 수 있음을 의미한다. DeepRM Max의 트레이닝 시간은 0.048이며 Tetris와 DeepRM Max의 수행 시간만 비교 했을 시에 1.294, 1.254로 DeepRM Max의 경우가 더 우수하다. 따라서 처음 트레이닝 오버헤드를 제외하고 강화형 기계 학습을 적용한 작업 배치 기법이 의미가 있다는 것을 알 수 있다.Referring to FIG. 12 , the results of training overhead of each batch technique are shown. The execution time normalization values are about 1.294, 1.081, 1.303, and 1.346 for Tetris, a method according to an embodiment of the present invention (DqnGPU), DeepRM Max, and DeepRM Mean, respectively. In the method according to an embodiment of the present invention, the training time is 123 seconds, which is about 0.054, and the execution time minus the training time is 1.027. When comparing the results of SJF and execution time, it can be seen that the performance difference is small. This means that there is almost no difference in performance when placing the next task after excluding the first training time of the task set. Not only does the resource utilization increase, but if you continue to train online while performing the task, the accuracy will increase, which means that the performance of the next task set can be improved. The training time of DeepRM Max is 0.048, and when only the execution time of Tetris and DeepRM Max is compared, it is 1.294 and 1.254, which is better in the case of DeepRM Max. Therefore, it can be seen that the task placement technique applying reinforcement machine learning is meaningful except for the initial training overhead.

도 13은 본 발명의 일 실시예에 따른 GPU 응용을 위한 기계 학습 기반 작업 배치 장치와 종래 장치의 가격 비교를 설명하기 위한 도면이다.13 is a view for explaining a price comparison between a machine learning-based job arrangement device for GPU application and a conventional device according to an embodiment of the present invention.

도 13은 클라우드 업체에서 제공하는 GPU instance를 기준으로 SJF와 본 발명의 일 실시예에 따른 장치의 가격을 비교한 것이다.13 is a comparison of the price of the SJF and the device according to an embodiment of the present invention based on the GPU instance provided by the cloud company.

표 6은 Amazon EC2에서 제공하는 GPU instatnce 가격이다. G3 instatnce는 NVIDIA Tesla M60, G4 instatnce는 NVIDIA T4 GPU 카드를 제공한다. 만약 GPU 메모리 자원 측면에서 g4dn.4xlarge 는 g3.xlarge 2개와 같다고 가정하면, g4dn.4xlarge를 공유하였을 때 0.602$까지 절감할 수 있다(이하 cost 1이라 한다). 또한 만약 vCPU, memory 자원 측면에서 g4dn.4xlarge가 g4dn.xlarge 4개와 같다고 가정하면, g4dn.4xlarge를 공유하여 사용하였을 때 0.301$까지 가격 절감이 가능하다(이하 cost 2이라 한다).Table 6 shows the GPU instance prices provided by Amazon EC2. G3 instatnce offers NVIDIA Tesla M60, G4 instatnce offers NVIDIA T4 GPU cards. If it is assumed that g4dn.4xlarge is equal to two g3.xlarge in terms of GPU memory resources, up to $0.602 can be saved when g4dn.4xlarge is shared (hereafter referred to as cost 1). Also, assuming that g4dn.4xlarge is equal to 4 g4dn.xlarge in terms of vCPU and memory resources, it is possible to reduce the price by up to $0.301 when using g4dn.4xlarge (hereafter referred to as cost 2).

GPUGPU vCPUvCPUs 메모리(GB)Memory (GB) GPU 메모리(GB)GPU memory (GB) GPU 코어GPU Core 비용(USD)Cost (USD) G3.xlargeG3.xlarge 1One 44 30.530.5 88 20482048 0.750.75 g4dn.xlargeg4dn.xlarge 1One 44 1616 1616 25602560 0.5260.526 g4dn.4xlargeg4dn.4xlarge 1One 1616 6464 1616 25602560 1.2041.204

또한 도 13의 그래프는 앞의 random 워크로드를 실행했을 때를 기준으로 위에서 가정한 비용에 따른 비교 결과를 나타낸 것이다. cost 1의 경우, 본 발명의 일 실시예에 따라 공유 배치하는 instance(DqnGPU)를 사용하였을 경우, 단독 instance를 사용했을 때 보다 약 29.86% 정도 비용 절감이 가능하다. cost 2의 경우 단독 instance를 사용했을 때보다본 발명의 일 실시예에 따라 공유 배치하는 instance(DqnGPU)를 사용하였을 때 약 49.98% 정도 비용이 절감되는 것을 알 수 있다.In addition, the graph of FIG. 13 shows a comparison result according to the cost assumed above based on the execution of the previous random workload. In the case of cost 1, when a shared instance (DqnGPU) is used according to an embodiment of the present invention, it is possible to reduce costs by about 29.86% compared to when a single instance is used. In the case of cost 2, it can be seen that the cost is reduced by about 49.98% when an instance (DqnGPU) that is shared according to an embodiment of the present invention is used rather than when a single instance is used.

이상의 설명은 본 발명의 기술 사상을 예시적으로 설명한 것에 불과한 것으로, 본 발명이 속하는 기술 분야에서 통상의 지식을 가진 사람이라면 본 발명의 본질적인 특성에서 벗어나지 않는 범위에서 다양한 수정 및 변형이 가능할 것이다. 따라서, 본 발명에 실행된 실시예들은 본 발명의 기술 사상을 한정하기 위한 것이 아니라 설명하기 위한 것이고, 이러한 실시예에 의하여 본 발명의 기술 사상의 범위가 한정되는 것은 아니다. 본 발명의 보호 범위는 아래의 청구범위에 의하여 해석되어야 하며, 그와 동등한 범위 내에 있는 모든 기술 사상은 본 발명의 권리범위에 포함되는 것으로 해석되어야 할 것이다.The above description is merely illustrative of the technical idea of the present invention, and a person of ordinary skill in the art to which the present invention pertains may make various modifications and variations without departing from the essential characteristics of the present invention. Accordingly, the embodiments implemented in the present invention are not intended to limit the technical spirit of the present invention, but to explain, and the scope of the technical spirit of the present invention is not limited by these embodiments. The protection scope of the present invention should be construed by the following claims, and all technical ideas within the scope equivalent thereto should be construed as being included in the scope of the present invention.

Claims

In a machine learning-based task placement method for GPU applications performed by a machine learning-based task placement device for GPU applications,
when information on an application for which execution is requested is received, checking properties of the application and a state of the system based on the received information on the application;
selecting input data of a learning model from among the properties of the application and inputting it into the learning model;
determining a work order of the application and a co-location structure of the application and other applications by applying the learning model; and
executing the application on the GPU according to the determined task sequence and co-location structure;
wherein the learning model observes a current state and determines an action according to a policy, wherein the determination is determined such that an expected discounted cumulative reward when performing the action is maximized,
A machine learning-based job placement method for GPU applications.

According to claim 1,
The application properties input to the learning model include at least one of GPU application dictionary information, GPU application profiling information, and cluster environment information, and the GPU application profiling information includes GPU utilization, GPU memory usage, and GPU core usage of the application. , PCIe throughput and execution time.

3. The method of claim 2,
At least one of the GPU utilization, the GPU memory usage, the GPU core usage, and the PCIe throughput included in the GPU application profiling information is information collected by periodically monitoring the GPU for a certain period of time, GPU application A machine learning-based task placement method for

3. The method of claim 2,
The cluster environment information is
A machine learning-based job placement method for a GPU application, comprising at least one of a node name, GPU card, GPU structure, GPU memory, number of GPU cores and PCIe bandwidth.

According to claim 1,
The current state of the learning model is determined based on the application property input to the learning model, and the action relates to a task slot of the co-placement structure.

6. The method of claim 5,
The machine learning-based task placement method for GPU application, characterized in that the number of actions is equal to the number of task slots of the co-location structure.

6. The method of claim 5,
The learning model is a machine learning-based task arrangement method for GPU applications, characterized in that it uses gradient descent to reduce the deviation of the distribution of the discount cumulative reward function.

8. The method of claim 7,
The gradient descent method is
A machine learning-based task arrangement method for GPU application, characterized in that the gradient descent value is predicted after subtracting the empirical discount accumulation reward obtained according to the policy from the expected discount accumulation reward.

In a machine learning-based task placement method for GPU applications performed by a machine learning-based task placement device for GPU applications,
when information on an application for which execution is requested is received, checking properties of the application and a state of the system based on the received information on the application;
selecting input data of a learning model from among the properties of the application and inputting it into the learning model;
determining a work order of the application and a co-location structure of the application and other applications by applying the learning model; and
executing the application on the GPU according to the determined task sequence and co-location structure;
Determining the co-location structure comprises:
If you need a job slot that exceeds the number of available job slots,
adding excess work slots to a back log; and
filling the empty job slots with jobs added to the backlog when there are empty job slots because a number of job slots smaller than the number of available job slots is requested;
A machine learning-based job placement method for GPU applications, comprising:

According to claim 1,
The learning model is a machine learning-based task placement method for GPU applications, characterized in that the DQN learning model.

According to claim 1,
after executing the application on the GPU, monitoring the execution result of the application; and
modifying the input of the learning model according to the monitoring result;
Machine learning-based job placement method for GPU applications, characterized in that it further comprises.

A computer-readable storage medium storing a computer program in which the method according to any one of claims 1 to 11 is performed when the computer program is executed by a processor.