KR102662496B1

KR102662496B1 - Batch scheduling method for generating multiple deep learning model based inference responses using multi-gpu

Info

Publication number: KR102662496B1
Application number: KR1020230104373A
Authority: KR
Inventors: 조창희; 고형석; 이홍재
Original assignee: (주)유알피
Priority date: 2023-08-09
Filing date: 2023-08-09
Publication date: 2024-05-03

Abstract

본 발명은 멀티 GPU를 사용하여 다수의 딥러닝 모델 기반 추론 응답을 생성할 때 효율적으로 GPU리소스를 사용하기 위한 배치 스케줄러를 사용 방법에 관한 것으로, 딥러닝 모델 기반 추론 응답 생성을 위한 배치 스케줄링 장치에서, 외부 서버 및 사용자 단말 중 적어도 하나 이상으로부터 딥러닝 모델의 추론 응답 생성 요청을 수신하는 요청수신단계; 상기 추론 응답 생성 요청을 분류하여 추론요청 대기열에 등록하는 요청분류단계; 상기 추론요청 대기열을 주기적으로 확인하여 추론 응답 생성 요청에 대한 처리 순서 및 GPU를 결정하는 스케줄링단계; 및 선택된 GPU에 추론 응답 생성에 필요한 딥러닝 모델을 로딩하고 배치를 실행하는 배치실행단계;를 포함한다.The present invention relates to a method of using a batch scheduler to efficiently use GPU resources when generating a plurality of deep learning model-based inference responses using multi-GPU, and is provided in a batch scheduling device for generating deep learning model-based inference responses. , a request receiving step of receiving a request for generating an inference response of a deep learning model from at least one of an external server and a user terminal; A request classification step of classifying the inference response generation request and registering it in an inference request queue; A scheduling step of periodically checking the inference request queue to determine a processing order and GPU for inference response generation requests; and a batch execution step of loading the deep learning model required for generating the inference response on the selected GPU and executing the batch.

Description

Batch scheduling method for generating inference responses based on multiple deep learning models using multi-GPU {BATCH SCHEDULING METHOD FOR GENERATING MULTIPLE DEEP LEARNING MODEL BASED INFERENCE RESPONSES USING MULTI-GPU}

본 발명은 멀티GPU를 사용하여 다수의 딥러닝 모델 기반 추론 응답을 생성할 때 효율적으로 GPU 리소스를 사용하기 위한 배치 스케줄러를 사용하는 방법에 관한 것으로, 딥러닝 추론 응답 생성 요청을 모델 별 요청 대기열(Queue)에 등록하고, 주기적으로 요청 대기열(Queue)을 확인하여 추론 응답 생성 요청을 처리할 GPU 선정 및 추론 처리 순서를 결정하고, 딥러닝 모델 로딩 및 추론 응답을 배치로 처리하는 멀티 GPU를 사용한 다수의 딥러닝 모델 기반 추론 응답 생성을 위한 배치 스케줄링 방법이다.The present invention relates to a method of using a batch scheduler to efficiently use GPU resources when generating inference responses based on multiple deep learning models using multi-GPU. The request for generating deep learning inference responses is placed in a request queue for each model ( Queue), periodically check the request queue, select the GPU to process the inference response generation request, determine the inference processing order, and use multiple GPUs to process deep learning model loading and inference responses in batches. This is a batch scheduling method for generating inference responses based on deep learning models.

최근 들어 대용량 데이터를 탐색하고 처리하는 딥러닝 기술의 발달이 가속화되면서 딥러닝 모델이 복잡도가 증가하고, 모델의 정확도에 대한 요구가 높아지면서 딥러닝 모델의 크기가 기하급수적으로 증가하고 있다.Recently, as the development of deep learning technology to explore and process large amounts of data has accelerated, the complexity of deep learning models has increased, and the size of deep learning models has increased exponentially as the demand for model accuracy increases.

하드웨어 성능 및 인프라가 빠른 속도로 개선되어 발전되고 있으나, 대용량 AI 학습 및 추론 시에 하드웨어의 한계점이 대두되면서, AI 성능 개선을 위해 하드웨어와 관련된 한계 상황을 극복하기 위한 다양한 기술들이 요구되고 있다.Hardware performance and infrastructure are improving and developing at a rapid pace, but as hardware limitations emerge during large-scale AI learning and inference, various technologies are required to overcome hardware-related limitations to improve AI performance.

이와 관련하여, 대용량 딥러닝 모델의 학습 및 추론 시 발생할 수 있는 GPU 메모리의 제한을 해결하고, 특히 실시간 응용에서 추론 응답 속도, 즉 지연시간(latency)을 줄이기 위해 멀티 GPU 사용을 통한 병렬 처리가 사용되고 있으나, 멀티 GPU 시스템에서 모델 추론을 병렬로 처리하기 위해서는 효과적인 GPU 할당 기술 및 추론 처리에 대한 스케줄링이 필요하며, 특히 다수의 딥러닝 모델 기반의 추론 응답이 생성되어야 하는 경우 중요도, 대기 시간, 모델 사이즈, 가용 리소스 상태 등 다양한 요소를 고려하여 추론의 지연시간을 최소화해야 하는 어려움이 있다.In this regard, parallel processing through the use of multi-GPU is used to solve limitations in GPU memory that may occur when training and inferring large-capacity deep learning models, and to reduce inference response speed, or latency, especially in real-time applications. However, in order to process model inference in parallel in a multi-GPU system, effective GPU allocation technology and scheduling for inference processing are required, especially when inference responses based on multiple deep learning models must be generated, such as importance, waiting time, and model size. , there is a difficulty in minimizing the delay time of inference by considering various factors such as the status of available resources.

본 발명은 상기 문제점을 해결하기 위해 딥러닝 추론 수행 시 제한된 GPU 리소스를 효과적으로 사용할 수 있도록 추론 모델의 종류, 중요도, 모델 사이즈, 우선 순위 등 하드웨어 및 서비스 수행 환경을 고려하여 딥러닝 모델 기반 추론 응답 생성 요청을 스케줄링하여 배치로 수행하는 멀티 GPU를 사용한 다수의 딥러닝 모델 기반 추론 응답 생성을 위한 배치 스케줄링 방법을 제공하는데 그 목적이 있다.In order to solve the above problems, the present invention generates a deep learning model-based inference response by considering the hardware and service performance environment, such as the type, importance, model size, and priority of the inference model, so that limited GPU resources can be effectively used when performing deep learning inference. The purpose is to provide a batch scheduling method for generating inference responses based on multiple deep learning models using multi-GPU that schedules requests and performs them in batches.

본 발명의 일 실시예에 따른 멀티 GPU를 사용한 다수의 딥러닝 모델 기반 추론 응답 생성을 위한 배치 스케줄링 방법은, 딥러닝 모델 기반 추론 응답 생성을 위한 배치 스케줄링 장치에서, 외부 서버 및 사용자 단말 중 적어도 하나 이상으로부터 딥러닝 모델의 추론 응답 생성 요청을 수신하는 요청수신단계; 상기 추론 응답 생성 요청을 분류하여 추론요청 대기열에 등록하는 요청분류단계; 상기 추론요청 대기열을 주기적으로 확인하여 추론 응답 생성 요청에 대한 처리 순서 및 GPU를 결정하는 스케줄링단계; 및 선택된 GPU에 추론 응답 생성에 필요한 딥러닝 모델을 로딩하고 배치를 실행하는 배치실행단계;를 포함할 수 있다.The batch scheduling method for generating inference responses based on a plurality of deep learning models using multi-GPU according to an embodiment of the present invention includes at least one of an external server and a user terminal in a batch scheduling device for generating inference responses based on a deep learning model. A request receiving step of receiving a request for generating an inference response of a deep learning model from the above; A request classification step of classifying the inference response generation request and registering it in an inference request queue; A scheduling step of periodically checking the inference request queue to determine a processing order and GPU for inference response generation requests; and a batch execution step of loading a deep learning model required for generating an inference response on the selected GPU and executing the batch.

또한, 상기 요청수신단계는 적어도 하나 이상의 상이한 유형의 딥러닝 모델에 대한 추론 응답 생성 요청을 수신하는 것을 특징으로 한다.Additionally, the request receiving step is characterized by receiving a request for generating an inference response for at least one different type of deep learning model.

또한, 상기 요청분류단계는 추론 응답 생성 요청을 확인하여 추론을 수행할 딥러닝 모델을 결정하고, 상기 딥러닝 모델의 추론 응답 생성 요청을 처리하는 추론요청 대기열에 등록하는 것을 특징으로 한다.In addition, the request classification step is characterized by checking the inference response generation request, determining a deep learning model to perform inference, and registering the inference request queue for processing the inference response generation request of the deep learning model.

또한, 상기 추론요청 대기열은 추론을 수행하는 딥러닝 모델의 수와 동일하게 생성되어, 상기 딥러닝 모델과 서로 매핑되어 관리되고, 매핑 된 딥러닝 모델의 추론 응답 생성에 필요한 GPU 메모리 정보를 포함하는 것을 특징으로 한다.In addition, the inference request queue is created equal to the number of deep learning models performing inference, is managed by mapping to each other with the deep learning model, and includes GPU memory information necessary for generating inference responses of the mapped deep learning model. It is characterized by

또한, 상기 스케줄링단계는 적어도 하나 이상의 상기 추론요청 대기열을 주기적으로 확인하여 각각의 추론요청 대기열에 등록된 추론 응답 생성 요청 처리에 필요한 GPU 메모리 최대값을 산정하고, 할당 가능한 GPU를 확인하는 단계; 추론 응답 생성 요청의 처리 순서를 결정하는 단계; 및 상기 GPU 메모리 최대값에 따라 추론 응답 생성 요청을 처리할 GPU를 선정하는 단계;를 포함한다.In addition, the scheduling step includes periodically checking at least one of the inference request queues, calculating the maximum GPU memory required to process the inference response generation request registered in each inference request queue, and checking available GPUs; determining a processing order of an inference response generation request; and selecting a GPU to process the inference response generation request according to the GPU memory maximum value.

또한, 상기 스케줄링단계는 추론 응답 생성 요청 처리에 필요한 GPU 메모리 크기, 요청 시간, 모델별 우선 순위 및 모델별 최대 대기 시간 중 적어도 하나 이상에 따라 처리 순서를 결정하는 것을 특징으로 한다.In addition, the scheduling step is characterized by determining the processing order according to at least one of the GPU memory size, request time, priority for each model, and maximum waiting time for each model required to process the inference response generation request.

또한, 상기 배치실행단계는 추론 응답 생성을 위해 선택된 GPU에 실행할 딥러닝 모델과 다른 딥러닝 모델이 로딩되어 있거나 로딩된 딥러닝 모델이 없는 경우 추론 응답 생성에 필요한 딥러닝 모델을 새롭게 로딩하여 응답생성배치를 실행하는 것을 특징으로 한다.In addition, in the batch execution step, if a deep learning model different from the deep learning model to be executed is loaded on the GPU selected for generating the inference response, or if there is no loaded deep learning model, the deep learning model required for generating the inference response is newly loaded to generate a response. Characterized by executing a batch.

또한, 상기 응답생성배치는 하나의 딥러닝 모델에 대해 복수개의 추론 응답 생성 요청에 대한 추론을 수행하는 것을 특징으로 한다.In addition, the response generation arrangement is characterized by performing inference on a plurality of inference response generation requests for one deep learning model.

배치 실행을 통해 멀티 GPU 를 사용한 병렬 처리를 가능하게 하여 추론 지연 속도를 줄일 수 있다.Batch execution enables parallel processing using multiple GPUs to reduce inference latency.

또한, 배치 스케줄러를 사용하여 다수 딥러닝 모델을 제한된 GPU 리소스에 효율적으로 할당할 수 있다.Additionally, a batch scheduler can be used to efficiently allocate multiple deep learning models to limited GPU resources.

또한, 서비스 중요도 및 최대 대기 가능 시간을 반영하여 동일한 응답 생성 대상이 있을 경우 우선 순위 결정이 가능하다.In addition, priority can be determined if there are identical response generation targets by reflecting service importance and maximum waiting time.

도 1은 본 발명의 일 실시예에 따른 멀티 GPU를 사용한 다수의 딥러닝 모델 기반 추론 응답 생성을 위한 배치 스케줄링 장치의 전체 관계도이다.
도 2는 본 발명의 일 실시예에 따른 멀티 GPU를 사용한 다수의 딥러닝 모델 기반 추론 응답 생성을 위한 배치 스케줄링 장치의 기능 및 동작 구조를 나타낸 도면이다.
도 3은 본 발명의 일 실시예에 따른 멀티 GPU를 사용한 다수의 딥러닝 모델 기반 추론 응답 생성을 위한 배치 스케줄링 장치의 하드웨어 구조를 나타낸 도면이다.
도 4는 본 발명의 일 실시예에 따른 멀티 GPU를 사용한 다수의 딥러닝 모델 기반 추론 응답 생성을 위한 배치 스케줄링 방법의 순서도이다.
도 5는 본 발명의 일 실시예에 따른 멀티 GPU를 사용한 다수의 딥러닝 모델 기반 추론 응답 생성을 위한 배치 스케줄링 방법에서 추론 처리 순서 및 GPU 선정(S430)에 대한 상세 순서도이다.Figure 1 is an overall relationship diagram of a batch scheduling device for generating inference responses based on multiple deep learning models using multi-GPU according to an embodiment of the present invention.
Figure 2 is a diagram showing the function and operational structure of a batch scheduling device for generating inference responses based on multiple deep learning models using multi-GPU according to an embodiment of the present invention.
Figure 3 is a diagram showing the hardware structure of a batch scheduling device for generating inference responses based on multiple deep learning models using multi-GPU according to an embodiment of the present invention.
Figure 4 is a flowchart of a batch scheduling method for generating inference responses based on multiple deep learning models using multi-GPU according to an embodiment of the present invention.
Figure 5 is a detailed flowchart of the inference processing order and GPU selection (S430) in the batch scheduling method for generating inference responses based on multiple deep learning models using multi-GPU according to an embodiment of the present invention.

이하에서는 도면을 참조하여 본 발명의 구체적인 실시예를 상세하게 설명한다. 다만, 본 발명의 사상은 제시되는 실시예에 제한되지 아니하고, 본 발명의 사상을 이해하는 당업자는 동일한 사상의 범위 내에서 다른 구성요소를 추가, 변경, 삭제 등을 통하여, 퇴보적인 다른 발명이나 본 발명 사상의 범위 내에 포함되는 다른 실시예를 용이하게 제안할 수 있을 것이나, 이 또한 본원 발명 사상 범위 내에 포함된다고 할 것이다.Hereinafter, specific embodiments of the present invention will be described in detail with reference to the drawings. However, the spirit of the present invention is not limited to the presented embodiments, and a person skilled in the art who understands the spirit of the present invention may add, change, or delete other components within the scope of the same spirit, thereby creating other degenerative inventions or the present invention. Other embodiments that are included within the scope of the invention can be easily proposed, but this will also be said to be included within the scope of the invention of the present application.

그리고, 후술되는 용어들은 본 발명에서의 기능을 고려하여 설정된 용어들로써 이는 발명자의 의도 또는 관례에 따라 달라질 수 있으므로 그 정의는 본 명세서 전반에 걸친 내용을 토대로 내려져야 할 것이고, 본 명세서에서 본 발명에 관련된 공지의 구성 또는 기능에 대한 구체적인 설명이 본 발명의 요지를 흐릴 수 있다고 판단되는 경우에 이에 관한 자세한 설명은 생략하기로 한다.In addition, the terms described below are terms set in consideration of the function in the present invention, and may vary depending on the inventor's intention or custom, so the definition should be made based on the content throughout the present specification, and in this specification related to the present invention. In cases where it is determined that detailed descriptions of well-known configurations or functions may obscure the gist of the present invention, detailed descriptions thereof will be omitted.

이하, 도면을 참조로 하여 본 발명에 따른 멀티 GPU를 사용한 다수의 딥러닝 모델 기반 추론 응답 생성을 위한 배치 스케줄링 장치를 설명한다.Hereinafter, a batch scheduling device for generating inference responses based on multiple deep learning models using multi-GPU according to the present invention will be described with reference to the drawings.

도 1은 본 발명의 일 실시예에 따른 멀티 GPU를 사용한 다수의 딥러닝 모델 기반 추론 응답 생성을 위한 배치 스케줄링 장치(100, 이하 딥러닝 모델 기반 추론 응답 생성을 위한 배치 스케줄링 장치라 함)의 전체 관계도이다.1 is an overall diagram of a batch scheduling device 100 (hereinafter referred to as a batch scheduling device for generating inference responses based on deep learning models) for generating inference responses based on multiple deep learning models using multi-GPU according to an embodiment of the present invention. It is also a relationship.

도 1을 참조하면, 딥러닝 모델 기반 추론 응답 생성을 위한 배치 스케줄링 장치(100)는 적어도 하나 이상의 사용자 단말기(200) 및 적어도 하나 이상의 어플리케이션 서버(300)와 네트워크로 연결되어 서로 통신할 수 있다.Referring to FIG. 1, a batch scheduling device 100 for generating inference responses based on a deep learning model is connected to at least one user terminal 200 and at least one application server 300 over a network and can communicate with each other.

본 발명에서 언급하는 네트워크라 함은 유선 공중망, 무선 이동 통신망, 또는 휴대 인터넷 등과 통합된 코어 망일 수도 있고, TCP/IP 프로토콜 및 그 상위 계층에 존재하는 여러 서비스, 즉 HTTP(Hyper Text Transfer Protocol), HTTPS(Hyper Text Transfer Protocol Secure), Telnet, FTP(File Transfer Protocol) 등을 제공하는 전 세계적인 개방형 컴퓨터 네트워크 구조를 의미할 수 있으며, 이러한 예에 한정하지 않고 다양한 형태로 데이터를 송수신할 수 있는 데이터 통신망을 포괄적으로 의미하는 것이다.The network referred to in the present invention may be a core network integrated with a wired public network, wireless mobile communication network, or mobile Internet, etc., and may include the TCP/IP protocol and various services existing in its upper layer, such as HTTP (Hyper Text Transfer Protocol), It may refer to a global open computer network structure that provides HTTPS (Hyper Text Transfer Protocol Secure), Telnet, FTP (File Transfer Protocol), etc., but is not limited to these examples and is a data communication network that can transmit and receive data in various forms. It means comprehensively.

본 발명의 딥러닝 모델 기반 추론 응답 생성을 위한 배치 스케줄링 장치(100)는 딥러닝 추론 응답 생성 요청을 모델 별 요청 대기열(Queue)에 등록하고, 주기적으로 요청 대기열(Queue)을 확인하여 응답 생성에 필요한 GPU 메모리, 모델별 우선 순위(서비스 중요도) 및 최대 대기 가능 시간을 반영하여 우선 순위 결정하고 배치를 실행하여 응답을 생성한다.The batch scheduling device 100 for generating a deep learning model-based inference response of the present invention registers a request for generating a deep learning inference response in a request queue for each model, and periodically checks the request queue to generate a response. A response is generated by determining the priority by reflecting the required GPU memory, priority (service importance) for each model, and maximum waiting time and executing the batch.

이를 위해, 사용자 단말기(200) 외부 서버(300) 및 중 적어도 하나 이상으로부터 딥러닝 모델의 추론 응답 생성 요청을 수신하고, 상기 추론 응답 생성 요청을 분류하여 모델 별 요청 대기열에 등록하며, 주기적으로 상기 모델 별 요청 대기열 을 확인하여 추론 응답 생성 요청을 처리할 GPU 선정하고, 추론 응답 생성 요청에 대한 우선 순위를 스케줄링하여 딥러닝 모델 로딩하고 추론 응답을 생성한다.To this end, the user terminal 200 receives a request for generating an inference response of a deep learning model from at least one of the external server 300, classifies the request for generating an inference response, and registers it in a request queue for each model, and periodically Check the request queue for each model to select the GPU to process the inference response generation request, schedule the priority for the inference response generation request, load the deep learning model, and generate the inference response.

본 발명에서 사용자 단말기(200) 또는 외부 서버(300)는 딥러닝 추론 모델을 위한 배치 스케줄링 장치(100)에서 제공하는 사용자 인터페이스 또는 연동 인터페이스를 통해 적어도 하나 이상의 딥러닝 모델 기반 추론을 요청할 수 있다.In the present invention, the user terminal 200 or the external server 300 may request inference based on at least one deep learning model through a user interface or linkage interface provided by the batch scheduling device 100 for a deep learning inference model.

외부 서버(300)는 인공지능 기반 서비스를 제공하는 다양한 어플리케이션 서버, 인공지능 기술을 사용하여 데이터를 가공 및 분석하는 빅데이터 서버 등 일 수 있다.The external server 300 may be a variety of application servers that provide artificial intelligence-based services, big data servers that process and analyze data using artificial intelligence technology, etc.

도 2는 본 발명의 일 실시예에 따른 멀티 GPU를 사용한 다수의 딥러닝 모델 기반 추론 응답 생성을 위한 배치 스케줄링 장치(100)의 기능 및 동작 구조를 나타낸 도면이다.FIG. 2 is a diagram showing the function and operational structure of the batch scheduling device 100 for generating inference responses based on multiple deep learning models using multi-GPU according to an embodiment of the present invention.

도 2를 참조하면, 딥러닝 모델 기반 추론 응답 생성을 위한 배치 스케줄링 장치(100)는 요청 수신부(110), 요청 분류부(120), 스케줄링부(130) 및 응답 전송부(140)를 구비할 수 있다.Referring to FIG. 2, the batch scheduling device 100 for generating inference responses based on a deep learning model may include a request reception unit 110, a request classification unit 120, a scheduling unit 130, and a response transmission unit 140. You can.

요청 수신부(110)는 외부 서버 및 사용자 단말 중 적어도 하나 이상으로부터 딥러닝 모델의 추론 응답 생성 요청을 수신하는 수신한다.The request receiving unit 110 receives a request for generating an inference response of a deep learning model from at least one of an external server and a user terminal.

딥러닝 모델 기반 추론 응답 생성을 위한 배치 스케줄링 장치(100)는 적어도 하나 이상의 딥러닝 모델을 관리할 수 있고, 상기 딥러닝 모델은 학습이 완료되어 인공지능 서비스를 제공하는 추론을 위한 모델이다.The batch scheduling device 100 for generating deep learning model-based inference responses can manage at least one deep learning model, and the deep learning model is an inference model that has completed learning and provides an artificial intelligence service.

또한, 상기 모델은 서로 다른 추론 응답을 생성하는 모델이다.Additionally, the above model is a model that generates different inference responses.

따라서, 요청 수신부(110)는 적어도 하나 이상의 상이한 유형의 딥러닝 모델에 대한 추론 응답 생성 요청을 수신한다.Accordingly, the request receiver 110 receives a request for generating an inference response for at least one different type of deep learning model.

일례로, 문서 가공, 분석, 검색 서비스를 제공하는 시스템의 경우 요약문 생성, 문서 분류, 키워드 생성 등의 딥러닝 모델 기반 추론 요청 일 수 있다.For example, in the case of a system that provides document processing, analysis, and search services, it may be a deep learning model-based inference request such as summary generation, document classification, and keyword generation.

본 발명에서는 딥러닝 추론 모델에 대한 실시예를 설명하고 있으나, 이에 한정하지 않고, 추론 모델 전체를 의미할 수 있다.In the present invention, an embodiment of a deep learning inference model is described, but it is not limited to this and may refer to the entire inference model.

요청 수신부(110)는 사용자 단말기(200) 또는 외부 서버(300)로부터 요청을 수신하기 위한 인터페이스를 제공할 수 있다.The request receiving unit 110 may provide an interface for receiving a request from the user terminal 200 or the external server 300.

일례로, 요청을 수신하는 인터페이스는 HTTP, REST, SOAP 프로토콜 기반 일 수 있다.For example, the interface that receives the request may be based on HTTP, REST, or SOAP protocols.

요청 분류부(120)는 요청 수신부(110)에서 수신한 추론 응답 생성 요청을 분류하여 추론요청 대기열에 등록한다.The request classification unit 120 classifies the inference response generation request received from the request reception unit 110 and registers it in the inference request queue.

여기서, 추론요청 대기열은 입력한 순서대로 데이터가 출력되는 FIFO(First In First Out)구조의 큐(Queue) 형태의 저장 공간일 수 있다.Here, the inference request queue may be a storage space in the form of a queue with a FIFO (First In First Out) structure where data is output in the order in which it is input.

또한, 추론요청 대기열은 모델의 개수만큼 생성되고, 각각의 추론요청 대기열은 하나의 추론 모델에 대한 요청만을 관리할 수 있다.Additionally, inference request queues are created as many as the number of models, and each inference request queue can only manage requests for one inference model.

이를 위해, 각각의 추론요청 대기열은 처리할 모델을 식별할 수 있는 모델 식별자 및 상기 모델을 통해 추론 응답 생성에 필요한 GPU 메모리 정보가 설정될 수 있다.To this end, each inference request queue can be set with a model identifier that identifies the model to be processed and GPU memory information necessary for generating an inference response through the model.

요청 분류부(120)는 추론 응답 생성 요청의 종류를 확인하여 추론 모델을 확인하고 해당 추론요청 대기열에 추론 응답 생성 요청을 추가한다.The request classification unit 120 checks the type of the inference response generation request, confirms the inference model, and adds the inference response generation request to the corresponding inference request queue.

스케줄링부(130)는 각각의 상기 추론요청 대기열을 확인하여 추론 응답 생성 요청을 처리할 GPU 선정 및 추론 처리 순서를 정하여 딥러닝 모델 로딩 및 추론 응답을 생성한다.The scheduling unit 130 checks each of the inference request queues, selects a GPU to process the inference response generation request, determines the inference processing order, loads the deep learning model, and generates an inference response.

스케줄링부(130)는 배치스케줄러(131) 응답생성배치(132) 및 리소스 산정부(133)를 포함한다.The scheduling unit 130 includes a batch scheduler 131, a response generation batch 132, and a resource calculation unit 133.

배치스케줄러(131)는 적어도 하나 이상의 추론요청 대기열로부터 추론 응답 생성 요청을 주기적으로 확인하여 추론 응답 생성 요청 처리에 필요한 GPU 메모리 크기 및 요청 순서에 따라 스케줄링하여 적어도 하나 이상의 응답생성배치(132)를 실행한다.The batch scheduler 131 periodically checks inference response generation requests from at least one inference request queue, schedules them according to the GPU memory size and request order required to process the inference response generation requests, and executes at least one response generation batch 132. do.

여기서, 응답생성배치(132)는 주어진 작업을 병렬로 처리하는 복수의 프로세스 또는 스레드(thread) 일 수 있다.Here, the response generation batch 132 may be a plurality of processes or threads that process a given task in parallel.

또한, 서로 다른 추론요청 대기열에서 요청 시간이 동일한 추론 응답 생성 요청이 존재할 경우, 딥러닝 모델별 우선순위 및 최대 대기 시간에 따라 스케줄링하여 응답생성배치(132)를 실행한다.In addition, if there are inference response generation requests with the same request time in different inference request queues, the response generation batch 132 is executed by scheduling according to the priority and maximum waiting time for each deep learning model.

딥러닝 모델별 우선순위는 운영자에 의해 사전에 설정될 수 있다.Priorities for each deep learning model can be set in advance by the operator.

최대 대기 시간은 각 딥러닝 모델별 평균 추론 시간과 배치로 수행할 응답 생성 요청 개수를 통해 산정할 수 있다.The maximum waiting time can be calculated through the average inference time for each deep learning model and the number of response generation requests to be performed in batches.

배치스케줄러(131)는 설정된 수집 주기에 따라 각각의 추론요청 대기열을 확인하여 요청된 순서대로 추론 응답 생성 요청을 수집하여 리소스 산정부(133)를 통해 요청 처리에 필요한 GPU 메모리 최대값을 산정하고 사용 가능한 GPU를 결정한다.The batch scheduler 131 checks each inference request queue according to the set collection cycle, collects inference response generation requests in the requested order, and calculates and uses the maximum GPU memory required for request processing through the resource estimator 133. Determine which GPU is available.

리소스 산정부(133)는 하나의 추론 응답 생성에 필요한 GPU 메모리 및 배치로 처리해야 할 추론 응답 생성 요청 개수를 기초로 필요한 GPU 메모리 최대값을 산정하고 사용 가능한 GPU 정보를 전달한다.The resource estimator 133 calculates the maximum required GPU memory based on the GPU memory required to generate one inference response and the number of inference response generation requests to be processed in batches and delivers available GPU information.

딥러닝 모델 기반 추론 응답 생성을 위한 배치 스케줄링 장치(100)는 적어도 하나 이상의 GPU를 포함할 수 있고, 상기 GPU는 서로 다른 메모리 사이즈를 가질 수 있다.The batch scheduling apparatus 100 for generating deep learning model-based inference responses may include at least one GPU, and the GPUs may have different memory sizes.

따라서, 리소스 산정부(133)를 추론 응답 생성을 수행할 딥러닝 모델의 GPU 메모리 * 배치로 처리해야 할 추론 응답 생성 요청 개수를 계산하여 응답 생성에 필요한 GPU 메모리 최대값을 산정하고, 가용 메모리가 상기 GPU 메모리 최대값 보다 큰 GPU를 확인하여 사용 가능한 GPU 정보를 전달한다.Therefore, the resource estimator 133 calculates the maximum GPU memory required for response generation by calculating the number of inference response generation requests to be processed by batch * GPU memory of the deep learning model to perform inference response generation, and the available memory is GPUs larger than the maximum GPU memory value are identified and available GPU information is delivered.

배치스케줄러(131)는 추론 응답 생성을 위해 선정된 GPU에 해당 딥러닝 모델을 로딩하고 응답생성배치(132)를 실행한다.The batch scheduler 131 loads the deep learning model into the GPU selected to generate the inference response and executes the response generation batch 132.

또한, 응답생성배치(132) 실행 시 응답을 처리할 딥러닝 모델 및 추론할 입력 데이터를 전달할 수 있다.Additionally, when executing the response generation batch 132, a deep learning model to process the response and input data to be inferred can be transmitted.

이때, 배치스케줄러(131)는 추론 응답 생성을 위해 선정된 GPU상에 실행할 딥러닝 모델과 다른 딥러닝 모델이 로딩되어 있거나 로딩된 딥러닝 모델이 없는 경우 추론 응답 생성에 필요한 딥러닝 모델을 새롭게 로딩하여 배치를 실행한다.At this time, the batch scheduler 131 loads a new deep learning model required for generating the inference response if a deep learning model different from the deep learning model to be executed on the GPU selected for generating the inference response is loaded or if there is no loaded deep learning model. and run the batch.

응답생성배치(132)는 하나의 딥러닝 모델에 대해 복수개의 추론 응답 생성 요청에 대한 추론을 수행한다.The response generation batch 132 performs inference on a plurality of inference response generation requests for one deep learning model.

이때, 응답생성배치(132)는 추론할 입력 데이터에 대한 전처리 및 가공을 수행한 후 딥러닝 모델에 적용할 수 있다.At this time, the response generation batch 132 can be applied to a deep learning model after pre-processing and processing the input data to be inferred.

응답 전송부(140)는 응답생성배치(132)를 통해 생성된 추론 요청에 대한 응답을 요청한 사용자 단말기(200) 또는 외부 서버(300)에게 전송한다.The response transmission unit 140 transmits a response to the inference request generated through the response generation batch 132 to the requesting user terminal 200 or the external server 300.

생성된 응답을 전달할 대상은 추론요청 대기열에 등록 시 요청 정보에 함께 포함되어 전달될 수 있고, 응답 전송부(140)는 해당 정보를 확인하여 응답을 요청 대상에게 전달할 수 있다.The target to which the generated response is to be delivered may be included and delivered in the request information when registering in the inference request queue, and the response transmission unit 140 may confirm the information and deliver the response to the request target.

또한, 생성된 추론 응답 또는 응답 생성에 대한 결과는 요청한 사용자 단말기(200) 또는 외부 서버(300)에게 응답 메시지, callback, 이벤트 등의 다양한 형식으로 전달될 수 있다.Additionally, the generated inference response or the result of response generation may be delivered to the requesting user terminal 200 or the external server 300 in various formats, such as a response message, callback, or event.

도 3은 본 발명의 일 실시예에 따른 멀티 GPU를 사용한 다수의 딥러닝 모델 기반 추론 응답 생성을 위한 배치 스케줄링 장치(100)의 하드웨어 구조를 나타낸 도면이다.FIG. 3 is a diagram showing the hardware structure of a batch scheduling device 100 for generating inference responses based on multiple deep learning models using multi-GPU according to an embodiment of the present invention.

도 3을 참조하면, 딥러닝 모델 기반 추론 응답 생성을 위한 배치 스케줄링 장치(100)의 하드웨어 구조는, 중앙처리장치(1000), 메모리(2000), 사용자 인터페이스(3000), 데이터베이스 인터페이스(4000), 네트워크 인터페이스(5000), 웹서버(6000) 등을 포함하여 구성된다.Referring to FIG. 3, the hardware structure of the batch scheduling device 100 for generating deep learning model-based inference responses includes a central processing unit 1000, a memory 2000, a user interface 3000, a database interface 4000, It is composed of a network interface (5000), a web server (6000), etc.

사용자 인터페이스(3000)는 그래픽 사용자 인터페이스(GUI, graphical user interface)를 사용함으로써, 사용자에게 입력과 출력 인터페이스를 제공한다.The user interface 3000 provides an input and output interface to the user by using a graphical user interface (GUI).

데이터베이스 인터페이스(4000)는 데이터베이스와 하드웨어 구조 사이의 인터페이스를 제공한다.The database interface 4000 provides an interface between the database and the hardware structure.

네트워크 인터페이스(5000)는 사용자가 보유한 장치 간의 네트워크 연결을 제공한다.The network interface 5000 provides network connections between devices owned by users.

웹 서버(6000)는 사용자가 네트워크를 통해 하드웨어 구조로 액세스하기 위한 수단을 제공한다. 대부분의 사용자들은 원격에서 웹 서버로 접속하여 딥러닝 모델 기반 추론 응답 생성을 위한 배치 스케줄링 장치(100)을 사용할 수 있다.The web server 6000 provides a means for users to access the hardware structure through a network. Most users can access the web server remotely and use the batch scheduling device 100 to generate deep learning model-based inference responses.

상술한 구성 또는 방법의 각 단계는, 컴퓨터 판독 가능한 기록 매체 상의 컴퓨터 판독 가능 코드로 구현되거나 전송 매체를 통해 전송될 수 있다. 컴퓨터 판독 가능한 기록 매체는, 컴퓨터 시스템에 의해 읽혀질 수 있는 데이터를 저장할 수 있는 데이터 저장 디바이스이다.Each step of the above-described configuration or method may be implemented as computer-readable code on a computer-readable recording medium or transmitted through a transmission medium. A computer-readable recording medium is a data storage device capable of storing data that can be read by a computer system.

그래픽처리장치(7000)는 그래픽 연산을 빠르게 처리하여 결과값을 출력하는 연산 장치로 인공지능의 대규모 연산 처리 작업을 병렬로 처리한다.The graphics processing unit 7000 is an operation device that quickly processes graphics operations and outputs results, and processes large-scale artificial intelligence operations in parallel.

컴퓨터 판독 가능한 기록 매체의 예로는 데이터베이스, ROM, RAM, CD-ROM, DVD, 자기 테이프, 플로피 디스크 및 광학 데이터 저장 디바이스가 있으나 이에 한정되는 것은 아니다. 전송 매체는 인터넷 또는 다양한 유형의 통신 채널을 통해 전송되는 반송파를 포함할 수 있다. 또한 컴퓨터 판독 가능한 기록 매체는, 컴퓨터 판독 가능 코드가 분산 방식으로 저장되고, 실행되도록 네트워크 결합 컴퓨터 시스템을 통해 분배될 수 있다.Examples of computer-readable recording media include, but are not limited to, databases, ROM, RAM, CD-ROM, DVD, magnetic tape, floppy disk, and optical data storage devices. Transmission media may include carrier waves transmitted over the Internet or various types of communication channels. The computer-readable recording medium may also be distributed through a network-coupled computer system such that the computer-readable code is stored and executed in a distributed manner.

또한 본 발명에 적용된 적어도 하나 이상의 구성요소는, 각각의 기능을 수행하는 중앙처리장치(CPU), 마이크로프로세서 등과 같은 프로세서 및 그래픽처리장치(GPU)를 포함하거나 이에 의해 구현될 수 있으며, 상기 구성요소 중 둘 이상은 하나의 단일 구성요소로 결합되어 결합된 둘 이상의 구성요소에 대한 모든 동작 또는 기능을 수행할 수 있다. 또한 본 발명에 적용된 적어도 하나 이상의 구성요소의 일부는, 이들 구성요소 중 다른 구성요소에 의해 수행될 수 있다. 또한 상기 구성요소들 간의 통신은 버스(미도시)를 통해 수행될 수 있다.In addition, at least one component applied to the present invention may include or be implemented by a processor such as a central processing unit (CPU), a microprocessor, etc., and a graphics processing unit (GPU) that perform their respective functions, and the components Two or more of them can be combined into one single component to perform all operations or functions of the two or more components combined. Additionally, part of at least one or more components applied to the present invention may be performed by other components among these components. Additionally, communication between the components may be performed through a bus (not shown).

도 4는 본 발명의 일 실시예에 따른 멀티 GPU를 사용한 다수의 딥러닝 모델 기반 추론 응답 생성을 위한 배치 스케줄링 방법의 순서도이고, 도 5는 본 발명의 일 실시예에 따른 멀티 GPU를 사용한 다수의 딥러닝 모델 기반 추론 응답 생성을 위한 배치 스케줄링 방법에서 추론 처리 순서 및 GPU 선정(S430)에 대한 상세 순서도이다.Figure 4 is a flowchart of a batch scheduling method for generating inference responses based on multiple deep learning models using multi-GPU according to an embodiment of the present invention, and Figure 5 is a flowchart of a batch scheduling method using multi-GPU according to an embodiment of the present invention. This is a detailed flowchart of the inference processing order and GPU selection (S430) in the batch scheduling method for generating inference responses based on deep learning models.

이하, 도 4 내지 도 5를 참조하여 멀티 GPU를 사용한 다수의 딥러닝 모델 기반 추론 응답 생성을 위한 배치 스케줄링 방법을 설명한다.Hereinafter, a batch scheduling method for generating inference responses based on multiple deep learning models using multi-GPU will be described with reference to FIGS. 4 and 5.

먼저, 딥러닝 모델 기반 추론 응답 생성을 위한 배치 스케줄링 장치(100)는 사용자 단말기(200) 및 외부 서버(300) 중 적어도 하나 이상으로부터 딥러닝 추론 요청을 수신하는 요청수신단계(S410)를 수행한다.First, the batch scheduling device 100 for generating a deep learning model-based inference response performs a request reception step (S410) of receiving a deep learning inference request from at least one of the user terminal 200 and the external server 300. .

이때, 적어도 하나 이상의 상이한 유형의 딥러닝 모델에 대한 추론 응답 생성 요청을 수신할 수 있다.At this time, a request for generating an inference response for at least one different type of deep learning model may be received.

일례로, 딥러닝 모델 기반 추론 응답 생성을 위한 배치 스케줄링 장치(100)에서 생성 요약 모델, 문서 분류 모델, 키워드 생성 모델에 대한 추론 서비스를 제공하는 경우, 해당 모델에 대한 추론 요청을 요청 수신부(110)를 통해 수신할 수 있다.For example, when the batch scheduling device 100 for generating a deep learning model-based inference response provides an inference service for a generated summary model, a document classification model, and a keyword generation model, an inference request for the corresponding model is requested from the receiving unit 110. ) can be received through.

이후, 수신된 추론 응답 생성 요청을 확인하여 요청 종류를 판단하고, 분류하여 추론요청 대기열에 등록하는 요청분류단계(S420)를 수행한다.Afterwards, a request classification step (S420) is performed in which the received inference response generation request is checked, the request type is determined, classified, and registered in the inference request queue.

수신된 요청은 요청 분류부(120)를 통해 복수개의 추론요청 대기열로 분류되는데, 이때, 추론요청 대기열(Queue)은 추론 요청의 종류, 즉 딥러닝 모델의 수와 동일하게 생성되어, 추론을 수행하는 상기 딥러닝 모델과 서로 매핑 되어 관리된다.The received request is classified into a plurality of inference request queues through the request classification unit 120. At this time, the inference request queue is created equal to the type of inference request, that is, the number of deep learning models, and performs inference. It is managed by mapping with the deep learning model.

또한, 각 추론요청 대기열(Queue)은 매핑 된 딥러닝 모델의 추론 응답 생성에 필요한 GPU 메모리 정보를 설정 정보로 포함할 수 있다.Additionally, each inference request queue may include GPU memory information necessary for generating inference responses of the mapped deep learning model as setting information.

각 모델의 GPU 메모리 정보는 스케줄링단계(S430)에서 GPU를 선정할 때 사용된다.GPU memory information for each model is used when selecting a GPU in the scheduling step (S430).

일례로, 생성 요약 모델, 문서 분류 모델, 키워드 생성 모델에 대한 추론 서비스를 제공하는 경우, 생성요약요청Queue, 문서분류요청Queue, 키워드생성요청 Queue에 각 추론 요청이 등록된다.For example, when providing inference services for a creation summary model, document classification model, and keyword generation model, each inference request is registered in the creation summary request queue, document classification request queue, and keyword creation request queue.

추론 응답 생성 요청에는 내부적으로 요청에 대한 종류, 즉 추론 모델을 지정하는 추론 모델 타입을 포함할 수 있고, 상기 추론 모델 타입을 확인하여 해당 모델의 Queue에 분류하여 등록할 수 있다.The inference response creation request may internally include an inference model type that specifies the type of request, that is, an inference model, and the inference model type can be checked and classified and registered in the queue of the corresponding model.

다음으로, 딥러닝 모델 기반 추론 응답 생성을 위한 배치 스케줄링 장치(100)는 배치스케줄러(131)를 통해 각 추론요청 대기열(Queue)을 주기적으로 확인하여 추론 응답 생성 요청 처리에 필요한 GPU 메모리 크기, 요청 시간, 모델별 우선 순위 및 모델별 최대 대기 시간 중 적어도 하나 이상에 따라 추론 응답 생성 요청에 대한 처리 순서 및 GPU를 결정하는 스케줄링단계(S430)를 수행한다.Next, the batch scheduling device 100 for generating deep learning model-based inference responses periodically checks each inference request queue through the batch scheduler 131 to determine the GPU memory size and request size required to process the inference response generation request. A scheduling step (S430) is performed to determine the processing order and GPU for the inference response generation request according to at least one of time, priority for each model, and maximum waiting time for each model.

배치스케줄러(131)는 적어도 하나 이상의 상기 추론요청 대기열을 주기적으로 확인하여 각각의 추론요청 대기열에 등록된 추론 응답 생성 요청 처리에 필요한 GPU 메모리 최대값을 산정한다. (S431)The batch scheduler 131 periodically checks at least one of the inference request queues and calculates the maximum GPU memory required to process the inference response generation request registered in each inference request queue. (S431)

일례로, 요약생성요청Queue, 분서분류요청Queue, 키워드생성요청Queue에 등록된 요청이 존재하는 경우, 각 Queue에 등록된 요청 시간을 확인하여 요청된 순서대로 추론 응답 생성 요청을 수집하여 아래 식으로 해당 요청 처리에 필요한 GPU 메모리 최대값을 산정한다.For example, if there are requests registered in the Summary Creation Request Queue, Document Classification Request Queue, and Keyword Creation Request Queue, check the request times registered in each Queue and collect inference response creation requests in the order requested, using the formula below: Calculate the maximum GPU memory required to process the request.

요청 처리에 필요한 GPU 메모리 최대값 = 응답 생성 요청 건수 * 해당 모델을 통해 응답 생성에 필요한 GPU 메모리Maximum GPU memory required to process requests = Number of response generation requests * GPU memory required to generate responses through the model

또한, 산정한 요청 처리에 필요한 GPU 메모리 최대값을 기초로 할당 가능한 GPU가 있는지를 확인한다.Also, check whether there is a GPU that can be allocated based on the calculated maximum GPU memory required to process the request.

이후, 추론 응답 생성 요청 처리에 필요한 GPU 메모리 크기, 요청 시간, 모델별 우선 순위 및 모델별 최대 대기 시간 중 적어도 하나 이상에 따라 추론 응답 생성 요청의 처리 순서를 결정한다. (S432)Thereafter, the processing order of the inference response generation request is determined according to at least one of the GPU memory size, request time, priority for each model, and maximum waiting time for each model required to process the inference response generation request. (S432)

이때, 요청 시간이 동일한 추론 응답 생성 요청이 존재할 경우, 딥러닝 모델별 우선순위 및 최대 대기 시간에 따라 추론 응답 생성 요청의 처리 순서를 선정할 수 있다.At this time, if there are inference response generation requests with the same request time, the processing order of the inference response generation requests can be selected according to the priority and maximum waiting time for each deep learning model.

딥러닝 모델별 우선순위는 운영자에 의해 사전에 설정될 수 있고, 최대 대기 시간은 각 딥러닝 모델별 평균 추론 시간과 배치로 수행할 응답 생성 요청 개수를 통해 산정할 수 있다.The priority for each deep learning model can be set in advance by the operator, and the maximum waiting time can be calculated through the average inference time for each deep learning model and the number of response generation requests to be performed in batches.

추론 응답 생성 요청의 처리 순서가 결정되면, 앞서 산정된 GPU 메모리 최대값에 따라 추론 응답 생성 요청을 처리할 GPU를 결정한다. (S433)Once the processing order of the inference response generation request is determined, the GPU that will process the inference response generation request is determined according to the previously calculated maximum GPU memory value. (S433)

딥러닝 모델 기반 추론 응답 생성을 위한 배치 스케줄링 장치(100)는 복수개의 GPU를 포함할 수 있고, 각 GPU를 사용하여 추론 응답 생성을 병렬로 처리할 수 있다.The batch scheduling device 100 for generating inference responses based on a deep learning model may include a plurality of GPUs, and can process the generation of inference responses in parallel using each GPU.

따라서, 각 GPU의 가용 메모리를 확인하여 추론 응답 생성을 처리할 GPU를 선정한다.Therefore, the available memory of each GPU is checked to select the GPU that will process the inference response generation.

S430단계 수행 후, 선택된 GPU에 추론 응답 생성에 필요한 딥러닝 모델을 로딩하고 배치를 실행하는 배치실행단계(S440)를 수행한다.After performing step S430, a batch execution step (S440) is performed to load the deep learning model required for generating inference responses on the selected GPU and execute the batch.

이때, 추론 응답 생성을 위해 선택된 GPU상에 실행할 딥러닝 모델이 이미 로딩되어 있는 경우에는 응답생성배치(132)를 실행하여 추론 응답 생성을 처리하고, 실행할 딥러닝 모델과 다른 딥러닝 모델이 로딩되어 있거나 로딩된 딥러닝 모델이 없는 경우 추론 응답 생성에 필요한 딥러닝 모델을 새롭게 로딩하여 응답생성배치(132)를 실행한다.At this time, if the deep learning model to be executed is already loaded on the GPU selected to generate the inference response, the response generation batch 132 is executed to process the generation of the inference response, and a deep learning model different from the deep learning model to be executed is loaded. If there is no deep learning model or loaded, the deep learning model required for generating the inference response is newly loaded and the response generation batch 132 is executed.

응답생성배치(132)는 복수개가 실행되어 병렬적으로 추론 응답 생성을 수행할 수 있다.A plurality of response generation batches 132 can be executed to generate inferred responses in parallel.

또한, 응답생성배치(132)는 하나의 모델에 대한 복수개의 추론 요청에 대한 응답을 생성할 수 있다. Additionally, the response generation batch 132 can generate responses to a plurality of inference requests for one model.

생성된 추론 응답 또는 응답 생성에 대한 결과는 요청한 사용자 단말기(200) 또는 외부 서버(300)에게 응답 메시지, callback, 이벤트 등의 다양한 형식으로 전달될 수 있다.The generated inference response or the result of response generation may be delivered to the requesting user terminal 200 or the external server 300 in various formats such as a response message, callback, or event.

100: 딥러닝 모델 기반 추론 응답 생성을 위한 배치 스케줄링 장치
110: 요청 수신부
120: 요청 분류부
130: 스케줄링부
131: 배치스케줄러 132: 응답생성배치
133: 리소스 산정부
140: 응답 전송부
200: 사용자 단말기
300: 외부 서버100: Batch scheduling device for generating deep learning model-based inference responses
110: Request receiver
120: Request classification unit
130: Scheduling department
131: Batch scheduler 132: Response generation batch
133: Resource calculation unit
140: Response transmission unit
200: user terminal
300: external server

Claims

In a batch scheduling device for generating an inference response based on a deep learning model, a request receiving step of receiving a request for generating an inference response of a deep learning model from at least one of an external server and a user terminal;
A request classification step of classifying the inference response generation request and registering it in an inference request queue;
A scheduling step of periodically checking the inference request queue to determine a processing order and GPU for inference response generation requests; and
Includes a batch execution step of loading the deep learning model required for generating inference responses on the selected GPU and executing the batch,
The request classification step is,
Checking the inference response generation request, determining a deep learning model to perform inference, and registering in an inference request queue that processes the inference response generation request of the deep learning model,
The inference request queue is,
The number of deep learning models that perform different inferences is generated and mapped 1:1 with the deep learning model to manage requests for generating inference responses for one inference model, and the inference response of the mapped deep learning model. Characterized by setting the GPU memory information necessary for generation,
The scheduling step is,
Periodically checking the plurality of inference request queues to calculate the maximum value of GPU memory required to process the plurality of inference response generation requests registered in each inference request queue, and confirming allocable GPUs;
determining a processing order of an inference response generation request; and
Comprising: selecting a GPU to process an inference response generation request according to the GPU memory maximum value,
The step of determining the processing order of the inference response generation request is,
The processing order is determined based on the GPU memory size, request time, priority for each model, and maximum waiting time for each model required to process the inference response generation request.
The batch execution step is,
If a deep learning model different from the deep learning model to be executed is loaded on the GPU selected to generate the inference response, or if there is no deep learning model loaded, the deep learning model required to generate the inference response is newly loaded and the response generation batch is executed. do,
The response generation arrangement is,
Characterized by performing inference on a plurality of inference response generation requests for one deep learning model,
Batch scheduling method for generating inference responses based on deep learning models.

According to paragraph 1,
The request reception step is,
Characterized in receiving a request to generate an inference response for at least one different type of deep learning model,
Batch scheduling method for generating inference responses based on deep learning models.

delete