KR20210065817A

KR20210065817A - Apparatus for Layer switching of Deep Learning Private Cloud Service

Info

Publication number: KR20210065817A
Application number: KR1020200037145A
Authority: KR
Inventors: 허대영; 최규연
Original assignee: 주식회사 가토랩
Priority date: 2019-11-27
Filing date: 2020-03-26
Publication date: 2021-06-04
Also published as: KR102341376B1; KR20210065818A; KR102341377B1

Abstract

Embodiments of the present invention disclose a layer switching device of a deep learning private cloud service, which includes a task scheduler which performs communication with a deep learning integrated server, finds out an idle device group to combine a computing node agent which converts various types of physical device layers into abstraction layers and the operating environment set by the user with a device group registered in the deep learning integration server, and performs a task of assigning the idle device group to the computing node agent according to a user's request. Therefore, it is possible to configure an independent development and execution environment without being disturbed from other users by allocating limited resources mutually and exclusively.

Description

Apparatus for Layer switching of Deep Learning Private Cloud Service

본 발명은 프라이빗 클라우드 서비스에 관한 것으로, 특히 딥러닝 프라이빗 클라우드 서비스의 계층 전환 장치에 관한 것이다.The present invention relates to a private cloud service, and more particularly, to an apparatus for layer switching of a deep learning private cloud service.

이 부분에 기술된 내용은 단순히 본 실시예에 대한 배경 정보를 제공할 뿐 종래기술을 구성하는 것은 아니다.The content described in this section merely provides background information for the present embodiment and does not constitute the prior art.

프라이빗 클라우드 서비스는 기업이 직접 클라우드 환경을 구축하고, 이를 기업 내부에서 활용하거나 또는 계열사에 공개하는 것을 의미한다. 외부 클라우드 사업자의 서비스를 이용하지 않고 서비스를 위한 인프라를 해당 조직의 방화벽 뒤에서 내부적으로 운영하여 컴퓨팅 리소스를 보다 효율적으로 관리할 수 있다. 하지만 프라이빗 클라우드 서비스는 최고 수준의 보안이 적용되는 만큼 비용이 많이 드는 문제가 있다.Private cloud service means that a company directly builds a cloud environment, uses it inside the company, or discloses it to affiliates. Computing resources can be managed more efficiently by operating the infrastructure for the service internally behind the organization's firewall without using the services of an external cloud provider. However, there is a problem that private cloud services are expensive as the highest level of security is applied.

프라이빗 클라우드 서비스는 외부에 의존하지 않기 때문에 구성이 자유롭고 자사의 비즈니스에 최적화된 서비스를 구현하기에 용이하며, 상대적으로 보안성이 높다. 하지만, 초기 투자 비용이 많이 들며, 일정 수준 이상의 규모를 확보하지 않으면 오히려 효용성이 떨어지는 문제가 있다.Since the private cloud service does not depend on the outside, it is easy to configure, and it is easy to implement a service optimized for your own business, and it has relatively high security. However, there is a problem in that the initial investment cost is high, and if the scale is not secured above a certain level, the utility is rather reduced.

본 발명의 실시예들은 개인 또는 단체 등에서 보유하거나 접근할 수 있는 GPGPU 시스템을 공동으로 효과적으로 활용할 수 있는 딥러닝 기반의 프라이빗 클라우드 GPGPU 및 CPU와 같은 제한된 자원을 분배하여 다른 사용자로부터 방해 받지 않고 독립적인 개발 및 실행 환경을 구성하여 컨테이너 기술 또는 클라우드 기술을 통해 제공하는데 발명의 주된 목적이 있다.Embodiments of the present invention are independent development without interference from other users by distributing limited resources such as deep learning-based private cloud GPGPU and CPU that can effectively utilize the GPGPU system that can be owned or accessed by individuals or groups, etc. And the main object of the invention is to configure the execution environment and provide it through container technology or cloud technology.

본 발명의 명시되지 않은 또 다른 목적들은 하기의 상세한 설명 및 그 효과로부터 용이하게 추론할 수 있는 범위 내에서 추가적으로 고려될 수 있다.Other objects not specified in the present invention may be additionally considered within the scope that can be easily inferred from the following detailed description and effects thereof.

본 실시예의 일 측면에 의하면, 본 발명은 딥러닝 통합 서버와 통신을 수행하고, 다양한 형태의 물리 장치 계층을 추상화 계층으로 전환하는 계산 노드 에이전트를 포함하는 프라이빗 클라우드 서비스의 계층 전환 장치를 제안한다.According to one aspect of the present embodiment, the present invention proposes a layer switching device for a private cloud service that communicates with a deep learning integration server and includes a compute node agent that converts various types of physical device layers into abstraction layers.

본 발명의 또 다른 실시예에 따르면, 본 발명은 사용자가 설정하는 운영 환경을 딥러닝 통합 서버에 등록된 장치 그룹과 결합하도록 유휴 장치 그룹을 찾고, 상기 유휴 장치 그룹을 사용자의 요청에 따라 계산 노드 에이전트로 배정하는 일을 수행하는 작업 스케줄러를 포함하는 프라이빗 클라우드 서비스의 계층 전환 장치를 제안한다.According to another embodiment of the present invention, the present invention finds an idle device group to combine the operating environment set by the user with the device group registered in the deep learning integration server, and sets the idle device group to a computation node according to the user's request. We propose a layer switching device for a private cloud service including a task scheduler that performs tasks assigned to an agent.

이상에서 설명한 바와 같이 본 발명의 실시예들에 의하면, 본 발명은 제한된 자원을 상호 배타적으로 분배하여 다른 사용자로부터 방해 받지 않고 독립적인 개발 및 실행 환경을 구성할 수 있는 효과가 있다.As described above, according to the embodiments of the present invention, the present invention has an effect that an independent development and execution environment can be configured without interference from other users by distributing limited resources to each other.

본 발명의 실시예들에 의하면, 본 발명은 AI 분야의 프로그래밍을 위한 환경으로 프로그래밍 노트북 인터페이스를 기본적으로 제공하며, 일정 규모 이상의 개발 프로젝트를 위해서 일반적인 개발 환경이 필요한 사용자를 위해서는 Microsoft 회사가 공개 소스로 공개한 개발 환경 IDE등을 온라인으로 사용할 수 있는 환경을 제공하는 효과가 있다.According to the embodiments of the present invention, the present invention basically provides a programming laptop interface as an environment for programming in the field of AI, and for users who need a general development environment for a development project of a certain size or more, the Microsoft company is an open source It has the effect of providing an environment where the public development environment IDE can be used online.

여기에서 명시적으로 언급되지 않은 효과라 하더라도, 본 발명의 기술적 특징에 의해 기대되는 이하의 명세서에서 기재된 효과 및 그 잠정적인 효과는 본 발명의 명세서에 기재된 것과 같이 취급된다.Even if it is an effect not explicitly mentioned herein, the effects described in the following specification expected by the technical features of the present invention and their potential effects are treated as if they were described in the specification of the present invention.

도 1은 본 발명의 일 실시예에 따른 딥러닝 프라이빗 클라우드 서비스의 계층 전환 장치에 의한 계층 전환 방법을 나타내는 흐름도이다.
도 2는 본 발명의 일 실시예에 따른 딥러닝 프라이빗 클라우드 서비스의 추상화 계층을 나타내는 도면이다.
도 3은 본 발명의 일 실시예에 따른 딥러닝 프라이빗 클라우드 서비스의 계산 노드 에이전트의 처리를 나타내는 흐름도이다.
도 4는 본 발명의 일 실시예에 따른 딥러닝 프라이빗 클라우드 서비스의 계산 노드 에이전트의 새로운 장치 그룹 등록을 나타내는 도면이다.
도 5는 본 발명의 일 실시예에 따른 딥러닝 프라이빗 클라우드 서비스의 계산 노드 에이전트에 따른 운영 컨테이너 동기화를 나타내는 도면이다.
도 6은 본 발명의 일 실시예에 따른 딥러닝 프라이빗 클라우드 서비스의 계산 노드 에이전트에 따른 에이전트 상태 제어를 나타내는 도면이다.
도 7은 본 발명의 일 실시예에 따른 딥러닝 프라이빗 클라우드 서비스의 계산 노드 에이전트에서 운영 컨테이너 제어 절차를 나타내는 흐름도이다.
도 8은 본 발명의 일 실시예에 따른 딥러닝 프라이빗 클라우드 서비스의 딥러닝 작업 스케줄러의 동작을 나타내는 도면이다.
도 9는 본 발명의 일 실시예에 따른 딥러닝 프라이빗 클라우드 서비스의 딥러닝 작업 스케줄러의 처리를 나타내는 흐름도이다.
도 10은 본 발명의 일 실시예에 따른 딥러닝 프라이빗 클라우드 서비스의 웹 서비스 구조를 나타내는 도면이다.
도 11은 본 발명의 일 실시예에 따른 딥러닝 프라이빗 클라우드 서비스의 개념 구조를 나타내는 도면이다.
도 12는 실시예들에서 사용되기에 적합한 컴퓨팅 기기를 포함하는 컴퓨팅 환경을 예시하여 설명하기 위한 블록도이다.1 is a flowchart illustrating a layer switching method by a layer switching device of a deep learning private cloud service according to an embodiment of the present invention.
2 is a diagram illustrating an abstraction layer of a deep learning private cloud service according to an embodiment of the present invention.
3 is a flowchart illustrating the processing of a computation node agent of a deep learning private cloud service according to an embodiment of the present invention.
4 is a diagram illustrating a new device group registration of a compute node agent of a deep learning private cloud service according to an embodiment of the present invention.
5 is a diagram illustrating synchronization of an operating container according to a compute node agent of a deep learning private cloud service according to an embodiment of the present invention.
6 is a diagram illustrating agent state control according to a compute node agent of a deep learning private cloud service according to an embodiment of the present invention.
7 is a flowchart illustrating an operation container control procedure in a compute node agent of a deep learning private cloud service according to an embodiment of the present invention.
8 is a diagram illustrating an operation of a deep learning task scheduler of a deep learning private cloud service according to an embodiment of the present invention.
9 is a flowchart illustrating processing of a deep learning task scheduler of a deep learning private cloud service according to an embodiment of the present invention.
10 is a diagram illustrating a web service structure of a deep learning private cloud service according to an embodiment of the present invention.
11 is a diagram illustrating a conceptual structure of a deep learning private cloud service according to an embodiment of the present invention.
12 is a block diagram illustrating and describing a computing environment including a computing device suitable for use in embodiments.

이하, 첨부된 도면을 참조하여 본 발명의 실시 예를 상세히 설명한다. 본 발명의 이점 및 특징, 그리고 그것들을 달성하는 방법은 첨부되는 도면과 함께 상세하게 후술되어 있는 실시 예들을 참조하면 명확해질 것이다. 그러나 본 발명은 이하에서 게시되는 실시 예들에 한정되는 것이 아니라 서로 다른 다양한 형태로 구현될 수 있으며, 단지 본 실시 예들은 본 발명의 게시가 완전하도록 하고, 본 발명이 속하는 기술분야에서 통상의 지식을 가진 자에게 발명의 범주를 완전하게 알려주기 위해 제공되는 것이며, 본 발명은 청구항의 범주에 의해 정의될 뿐이다. 명세서 전체에 걸쳐 동일 참조 부호는 동일 구성 요소를 지칭한다.Hereinafter, embodiments of the present invention will be described in detail with reference to the accompanying drawings. Advantages and features of the present invention, and a method for achieving them will become apparent with reference to the embodiments described below in detail in conjunction with the accompanying drawings. However, the present invention is not limited to the embodiments published below, but may be implemented in various different forms, and only these embodiments allow the publication of the present invention to be complete, and common knowledge in the technical field to which the present invention pertains. It is provided to fully inform the possessor of the scope of the invention, and the present invention is only defined by the scope of the claims. Like reference numerals refer to like elements throughout.

다른 정의가 없다면, 본 명세서에서 사용되는 모든 용어(기술 및 과학적 용어를 포함)는 본 발명이 속하는 기술분야에서 통상의 지식을 가진 자에게 공통적으로 이해될 수 있는 의미로 사용될 수 있을 것이다. 또 일반적으로 사용되는 사전에 정의되어 있는 용어들은 명백하게 특별히 정의되어 있지 않는 한 이상적으로 또는 과도하게 해석되지 않는다.Unless otherwise defined, all terms (including technical and scientific terms) used herein may be used in the meaning commonly understood by those of ordinary skill in the art to which the present invention belongs. In addition, terms defined in a commonly used dictionary are not to be interpreted ideally or excessively unless clearly defined in particular.

본 출원에서 사용한 용어는 단지 특정한 실시 예를 설명하기 위해 사용된 것으로, 본 발명을 한정하려는 의도가 아니다, 단수의 표현은 문맥상 명백하게 다르게 뜻하지 않는 한, 복수의 표현을 포함한다. 본 출원에서, "포함하다" 또는 "가지다" 등의 용어는 명세서상에 기재된 특징, 숫자, 단계, 동작, 구성요소, 부품 또는 이들을 조합한 것이 존재함을 지정하려는 것이지, 하나 또는 그 이상의 다른 특징들이나 숫자, 단계, 동작, 구성요소, 부품 또는 이들을 조합한 것들의 존재 또는 부가 가능성을 미리 배제하지 않는 것으로 이해되어야 한다.The terms used in this application are only used to describe specific embodiments, and are not intended to limit the present invention. The singular expression includes the plural expression unless the context clearly dictates otherwise. In the present application, terms such as “comprise” or “have” are intended to designate that a feature, number, step, operation, component, part, or combination thereof described in the specification exists, but one or more other features It should be understood that this does not preclude the existence or addition of numbers, steps, operations, components, parts, or combinations thereof.

제2, 제1 등과 같이 서수를 포함하는 용어는 다양한 구성요소들을 설명하는데 사용될 수 있지만, 상기 구성요소들은 상기 용어들에 의해 한정되지는 않는다. 상기 용어들은 하나의 구성요소를 다른 구성요소로부터 구별하는 목적으로만 사용된다. 예를 들어, 본 발명의 권리 범위를 벗어나지 않으면서 제2 구성요소는 제1 구성요소로 명명될 수 있고, 유사하게 제1 구성요소도 제2 구성요소로 명명될 수 있다. 및/또는 이라는 용어는 복수의 관련된 기재된 항목들의 조합 또는 복수의 관련된 기재된 항목들 중의 어느 항목을 포함한다.Terms including an ordinal number such as second, first, etc. may be used to describe various elements, but the elements are not limited by the terms. The above terms are used only for the purpose of distinguishing one component from another. For example, without departing from the scope of the present invention, the second component may be referred to as the first component, and similarly, the first component may also be referred to as the second component. and/or includes a combination of a plurality of related listed items or any of a plurality of related listed items.

도 1은 본 발명의 일 실시예에 따른 딥러닝 프라이빗 클라우드 서비스의 계층 전환 장치에 의한 계층 전환 방법을 나타내는 흐름도이다.1 is a flowchart illustrating a layer switching method by a layer switching device of a deep learning private cloud service according to an embodiment of the present invention.

딥러닝 프라이빗 클라우드 서비스는 개인/단체 등에서 보유하거나 접근할 수 있는 GPGPU 시스템을 효과적으로 공동으로 활용할 수 있다. 이 서비스는 GPGPU를 사용하길 원하는 데이터 과학자에서부터 고도로 숙련된 IT 개발자에게 GPGPU 및 중앙 처리 장치(CPU)와 같이 제한된 자원을 상호 배타적으로 분배하여, 다른 사용자로부터 방해 받지 않고 독립적인 개발 및 실행 환경을 언제든지 구성할 수 있는 기술을 컨테이너 기술과 클라우드 기술을 통해 제공한다. 여기서, 데이터 과학자는 데이터 과학과 관련된 분야를 전공하고 데이터 분석과 관련된 업무에 종사하는 사람을 의미한다.The deep learning private cloud service can effectively jointly utilize the GPGPU system that can be owned or accessed by individuals/groups. This service mutually-exclusively distributes limited resources such as GPGPU and central processing unit (CPU) to data scientists who want to use GPGPU, to highly skilled IT developers, from data scientists who want to use GPGPU, allowing independent development and execution environments at any time without interference from other users. Configurable technology is provided through container technology and cloud technology. Here, a data scientist refers to a person who majors in a field related to data science and is engaged in data analysis-related work.

GPGPU(General-Purpose computing on Graphics Processing Units)는 GPU 상의 범용 계산을 나타내며, 일반적으로 컴퓨터 그래픽스를 위한 계산만 맡았던 그래픽 처리 장치(GPU)를 중앙 처리 장치(CPU)가 맡았던 응용 프로그램들의 계산에 사용하는 기술이다.GPGPU (General-Purpose Computing on Graphics Processing Units) refers to general-purpose computation on the GPU, where the graphics processing unit (GPU), which was normally only responsible for computation for computer graphics, is used for calculation of applications that the central processing unit (CPU) was responsible for. it's technology

딥러닝 프라이빗 클라우드 서비스(10)는 최근에 화제인 AI 분야의 프로그래밍을 위한 환경으로 프로그래밍 노트북 인터페이스를 기본적으로 제공하며, 일정 규모 이상의 개발 프로젝트를 위해서 일반적인 개발 환경이 필요한 사용자를 위해서는 Microsoft 회사가 공개 소스로 공개한 개발 환경 IDE등을 온라인으로 사용할 수 있는 환경을 제공한다.The deep learning private cloud service (10) provides a programming laptop interface as an environment for programming in the AI field, which has recently been a hot topic, and for users who need a general development environment for development projects of a certain size or more, a Microsoft company is an open source It provides an environment where you can use the development environment IDE, etc., published as .

딥러닝 프라이빗 클라우드 서비스의 계층 전환 장치(30)는 계산 노드 에이전트(120) 및 작업 스케줄러(110)를 포함한다.The layer switching device 30 of the deep learning private cloud service includes a compute node agent 120 and a task scheduler 110 .

도 1을 참조하여 딥러닝 프라이빗 클라우드 서비스의 계층 전환 장치(30)에 의한 계층 전환 방법을 설명한다.A layer switching method by the layer switching device 30 of the deep learning private cloud service will be described with reference to FIG. 1 .

프라이빗 클라우드 서비스의 계층 전환 방법은 작업 스케줄러에 의해 사용자가 설정하는 운영 환경을 딥러닝 통합 서버에 등록된 장치 그룹과 결합하도록 유휴 장치 그룹을 찾고, 유휴 장치 그룹을 사용자의 요청에 따라 계산 노드 에이전트로 배정하는 일을 수행하는 단계(S110) 및 딥러닝 통합 서버와 통신하고, 계산 노드 에이전트에 의해 다양한 형태의 물리 장치 계층을 추상화 계층으로 전환하는 단계(S120)를 포함한다.The layer switching method of the private cloud service is to find an idle device group to combine the operating environment set by the user by the task scheduler with the device group registered in the deep learning integration server, and convert the idle device group into a compute node agent according to the user's request. It includes the step of performing the assignment (S110) and the step of communicating with the deep learning integration server, and converting the physical device layer of various types into the abstraction layer by the computation node agent (S120).

도 1에서는 각각의 과정을 순차적으로 실행하는 것으로 개재하고 있으나 이는 예시적으로 설명한 것에 불과하고, 이 분야의 기술자라면 본 발명의 실시예의 본질적인 특성에서 벗어나지 않는 범위에서 도 1에 기재된 순서를 변경하여 실행하거나 또는 하나 이상의 과정을 병렬적으로 실행하거나 다른 과정을 추가하는 것으로 다양하게 수정 및 변형하여 적용 가능할 것이다.Although it is interposed as sequentially executing each process in FIG. 1, this is only illustratively described, and those skilled in the art change the order described in FIG. 1 within the range that does not deviate from the essential characteristics of the embodiment of the present invention. Alternatively, various modifications and variations may be applied by executing one or more processes in parallel or adding other processes.

프라이빗 클라우드 서비스의 계층 전환 방법은 프라이빗 클라우드 서비스의 계층 전환 장치(30)에 의해 수행된다. 이하에서는 프라이빗 클라우드 서비스의 계층 전환 장치(30)가 수행하는 동작에 관하여 설명한다.The layer switching method of the private cloud service is performed by the layer switching device 30 of the private cloud service. Hereinafter, an operation performed by the layer switching device 30 of the private cloud service will be described.

프라이빗 클라우드 서비스의 계층 전환 장치(30)는 계산 노드 에이전트(110) 및 작업 스케줄러(120)에 의해 계층 전환을 수행할 수 있다.The layer switching device 30 of the private cloud service may perform layer switching by the compute node agent 110 and the task scheduler 120 .

계산 노드 에이전트(110)는 딥러닝 통합 서버(300)와 통신하고, 다양한 형태의 물리 장치 계층(100)을 추상화 계층(200)으로 전환할 수 있다.The computation node agent 110 may communicate with the deep learning integration server 300 and convert the physical device layer 100 of various forms into the abstraction layer 200 .

작업 스케줄러(120)는 사용자가 설정하는 운영 환경을 딥러닝 통합 서버(300)에 등록된 장치 그룹과 결합하도록 유휴 장치 그룹을 찾고, 유휴 장치 그룹을 사용자의 요청에 따라 계산 노드 에이전트(110)로 배정하는 일을 수행할 수 있다.The task scheduler 120 finds an idle device group to combine the operating environment set by the user with the device group registered in the deep learning integration server 300, and converts the idle device group to the computation node agent 110 according to the user's request. assignment can be performed.

물리 장치 계층(100)은 다수의 서버들을 네트워크를 통해서 사용할 수 있다.The physical device layer 100 may use a plurality of servers through a network.

추상화 계층(200)은 물리 장치 계층을 단순화하여 복수의 장치 그룹(210), 복수의 저장 볼륨(230) 및 복수의 운영 컨테이너(240)로 구분된 개념 장치를 형성하고, 단순화된 개념 장치를 사용하여 사용자에게 서비스를 제공하는 기반을 형성할 수 있다.The abstraction layer 200 simplifies the physical device layer to form a conceptual device divided into a plurality of device groups 210 , a plurality of storage volumes 230 , and a plurality of operation containers 240 , and use the simplified conceptual device This can form a basis for providing services to users.

복수의 장치 그룹(210) 각각은 연산 작업을 수행하는 복수의 연산 코어(220)를 포함할 수 있다.Each of the plurality of device groups 210 may include a plurality of computational cores 220 that perform computational tasks.

복수의 장치 그룹(210)은 복수의 연산 코어(220) 각각을 분리하여 운영 컨테이너(240)와 결합하거나 일부 연산 코어(220)들을 조합하여 운영 컨테이너(240)와 결합할 수 있다.The plurality of device groups 210 may be combined with the operation container 240 by separating each of the plurality of operation cores 220 , or may be combined with the operation container 240 by combining some operation cores 220 .

연산 코어(220)는 제조사 정보, 코어 성능 표기 또는 코어가 제공하는 연산 기술 정보를 적어도 하나 포함하는 범용 연산 코어 및 제조사 정보, 복수의 서버의 모델 정보, 코어의 성능 표기 정보 또는 연산 기술 정보를 적어도 하나 포함하는 딥러닝 연산 코어를 포함할 수 있으며, 반드시 이에 한정되는 것은 아니다.The computation core 220 includes at least one general-purpose computation core and manufacturer information including at least one of manufacturer information, core performance notation or computation technology information provided by the core, model information of a plurality of servers, core performance labeling information, or computation technology information. It may include a deep learning computation core including one, but is not necessarily limited thereto.

저장 볼륨(230)은 사용자가 사용하거나 생성하는 데이터를 보관하고, 운영 컨테이너(240)와 적어도 하나 결합하여 공유할 수 있다.The storage volume 230 may store data used or generated by a user, and may be shared by combining at least one operation container 240 with the operation container 240 .

운영 컨테이너(240)는 사용자가 사용할 수 있는 운영 시스템 및 시스템 소프트웨어와 응용 소프트웨어가 미리 설정된 사용자 환경을 제공할 수 있다.The operation container 240 may provide an operating system that a user can use, and a user environment in which system software and application software are preset.

운영 컨테이너(240)는 적어도 하나의 연산 코어(220) 및 적어도 하나의 저장 볼륨(230)과 결합하고, 이미지 상태 및 활성 상태로 구분되어 사용자에 의해 이미지 상태의 운영 컨테이너를 선택하여 활성화할 수 있다.The operation container 240 is combined with at least one computation core 220 and at least one storage volume 230 , and is divided into an image state and an active state, and the user selects and activates an operation container in the image state. .

운영 컨테이너(240)는 사용자가 운영 컨테이너(240)를 비활성화하는 경우 비활성화된 운영 컨테이너는 제거되고, 사용자가 운영 컨테이너(240)를 재활성화하는 경우 이전에 사용했던 결합을 재조합하여 새로운 운영 컨테이너가 생성될 수 있다.In the operational container 240 , when the user deactivates the operational container 240 , the deactivated operational container is removed, and when the user reactivates the operational container 240 , a new operational container is created by recombining the previously used combination. can be

계산 노드 에이전트(110)는 물리 장치 계층(100)에 포함된 CPU 및 GPGPU 정보를 취합하여 각각 연산 코어(220) 정보로 전환한 후 딥러닝 통합 서버(300)에 하나의 장치 그룹(210)으로 등록할 수 있다. 구체적으로, 계산 노드 에이전트(110)는 장치 그룹(210) 등록 시 딥러닝 통합 서버(300)의 주소와 등록에 필요한 보안 토큰을 입력하고, 보안 토큰에 따라서 장치 그룹(210)에 접근할 수 있는 사용자가 제한될 수 있다.Calculation node agent 110 collects CPU and GPGPU information included in the physical device layer 100 and converts it into computation core 220 information, respectively, and then as one device group 210 in the deep learning integration server 300 . can register. Specifically, the computation node agent 110 inputs the address of the deep learning integration server 300 and the security token required for registration when registering the device group 210, and can access the device group 210 according to the security token. Users may be restricted.

계산 노드 에이전트(110)는 주기적으로 물리 장치 계층에 포함된 CPU 및 GPGPU 정보를 조회하여 연산 코어의 정보를 갱신하고, 갱신된 연산 코어의 정보를 딥러닝 통합 서버에 보고할 수 있다. 구체적으로, 계산 노드 에이전트(110)는 새로운 CPU 또는 GPGPU가 추가된 경우 추가된 CPU 또는 GPGPU를 탐지하여 관리하는 장치 그룹(210)에 추가하고, 기존의 CPU 또는 GPGPU가 제거된 경우 제거된 CPU 또는 GPGPU를 탐지하여 관리하는 장치 그룹(210)에 삭제하여 장치 그룹(210)의 정보를 갱신할 수 있다.The computation node agent 110 may periodically update the information of the computational core by inquiring the CPU and GPGPU information included in the physical device layer, and report the updated information of the computational core to the deep learning integration server. Specifically, the compute node agent 110 detects the added CPU or GPGPU when a new CPU or GPGPU is added and adds it to the managed device group 210, and when the existing CPU or GPGPU is removed, the removed CPU or The information of the device group 210 may be updated by detecting and deleting the GPGPU in the device group 210 that is managed.

계산 노드 에이전트(110)는 추상화 계층(200)의 운영 컨테이너(240)를 활성화할 때, 운영 컨테이너(240)의 저장소로부터 최신의 운영 컨테이너를 탐색하여 운영 컨테이너(240)를 동기화하고, 기존의 운영 컨테이너와 최신의 운영 컨테이너가 같은 경우 기존의 운영 컨테이너를 재사용하고, 기존의 운영 컨테이너와 최신의 운영 컨테이너가 다른 경우 기존의 운영 컨테이너를 그대로 두고, 최신 운영 컨테이너를 형성할 수 있다. 구체적으로, 계산 노드 에이전트(110)는 장치 그룹(210)과 결합한 기존의 활성 운영 컨테이너(112)가 모두 비활성화되었을 때 장치 그룹(210)과 결합한 기존의 활성 운영 컨테이너(112)를 물리 장치 계층(100)에서 제거할 수 있다.When the compute node agent 110 activates the operational container 240 of the abstraction layer 200, it searches for the latest operational container from the storage of the operational container 240 to synchronize the operational container 240, and If the container and the newest operational container are the same, the existing operational container may be reused, and if the existing operational container and the newer operational container are different, the existing operational container may be left as it is, and a newer operational container may be formed. Specifically, the compute node agent 110 converts the existing active operational container 112 associated with the device group 210 to the physical device layer ( ) when all the existing active operational containers 112 associated with the device group 210 are deactivated. 100) can be removed.

계산 노드 에이전트(110)는 딥러닝 통합 서버로부터 사용자 환경 정보의 요청을 전달받으며, 사용자 환경 정보의 요청에 의해 요청 정보를 수신하고, 요청 정보를 분석하여 추상화 계층(200)의 운영 컨테이너(240), 저장 볼륨(230) 및 장치 그룹(210)을 결합하여 활성화하여 활성 운영 컨테이너(112)를 생성할 수 있다. 구체적으로, 계산 노드 에이전트(110)는 활성화 중에 발생하는 모든 정보를 딥러닝 통합 서버에 전달하며, 문제가 발생할 경우 문제가 발생한 활성 운영 컨테이너(112)를 제거하고, 사용자 요청 작업에 유형에 따라서 운영 컨테이너의 활성 서비스를 초기화할 수 있다.The computation node agent 110 receives a request for user environment information from the deep learning integration server, receives the request information by the request for user environment information, and analyzes the request information to operate the container 240 of the abstraction layer 200 , the storage volume 230 and the device group 210 can be combined and activated to create an active operational container 112 . Specifically, the compute node agent 110 delivers all information that occurs during activation to the deep learning integration server, and when a problem occurs, removes the active operation container 112 where the problem occurs, and operates according to the type of user requested operation You can initialize the container's active service.

예를 들어, 계산 노드 에이전트(110)는 상호작용 작업의 경우, 운영 컨테이너(240)에 개인 노트북 인터페이스를 제공하는 서비스와 온라인 통합 개발 환경 인터페이스를 제공하는 서비스를 활성화할 수 있다. 계산 노드 에이전트(110)는 상호작용 작업 및 수업이나 팀 단위로 사용하려는 요청이 함께 있는 경우, 개인 노트북 인터페이스를 제공하는 서비스 대신, 노트북 허브 인터페이스를 제공하는 서비스를 편성 및 재구성하여 서비스를 하고, 온라인 통합 개발 환경 인터페이스는 비활성화한 상태로 시작할 수 있다. 계산 노드 에이전트(110)는 배치 작업으로 구성되는 경우, 딥러닝 통합 서버(300)와 직접 통신하는 배치 작업의 동작 상태를 점검하고 사용자가 접근하여 상태를 확인하는 시큐어 터미널 서비스를 제공할 수 있다.For example, in the case of an interactive task, the compute node agent 110 may activate a service providing a personal notebook interface to the operation container 240 and a service providing an online integrated development environment interface. The compute node agent 110 organizes and reconfigures a service that provides a notebook hub interface instead of a service that provides a personal notebook interface when there is a request for interactive work and class or team use together, and provides the service, online The integrated development environment interface can be started in a disabled state. When the computation node agent 110 is configured as a batch job, it can provide a secure terminal service that checks the operation status of the batch job that communicates directly with the deep learning integration server 300 and that the user accesses and checks the status.

계산 노드 에이전트(110)는 장치 그룹(210)의 연산 코어(220)와 결합한 활성 운영 컨테이너(112)를 추적하고, 활성 운영 컨테이너(112)가 사용하는 장치 그룹의 부하를 측정하고 상태 변화를 탐지하여 딥러닝 통합 서버(300)에 활성 운영 컨테이너(112)에 대한 정보를 갱신할 수 있다.The compute node agent 110 tracks the active operational container 112 associated with the computational core 220 of the device group 210, measures the load on the device group used by the active operational container 112, and detects a state change. to update the information on the active operation container 112 in the deep learning integration server 300.

작업 스케줄러(120)는 사용자가 설정하는 운영 환경을 딥러닝 통합 서버(300)에 등록된 장치 그룹과 결합하도록 유휴 장치 그룹을 찾고, 유휴 장치 그룹을 사용자의 요청에 배정하는 일을 수행할 수 있다.The task scheduler 120 finds an idle device group to combine the operating environment set by the user with the device group registered in the deep learning integration server 300, and assigns the idle device group to the user's request. .

사용자의 요청은 연산 코어(220)를 배정한 후, 사용자의 조작에 의해 업무를 진행하는 상호 작용 작업 또는 사용자의 조작이 사전에 정의되어있거나, 자동화된 처리 프로세서를 가지고 있는 배치 작업을 포함할 수 있다.The user's request may include an interactive task in which the operation is performed by the user's manipulation after allocating the computational core 220, or a batch task in which the user's manipulation is predefined or has an automated processing processor. .

작업 스케줄러(120)는 스케줄링 대상이 되는 사용자를 선택하고, 선택된 사용자의 우선 순위에 따라 가장 높은 우선 순위를 갖는 사용자의 우선 순위를 갖는 작업을 선택하며, 선택된 자원을 순서대로 읽어 유휴 자원 및 자원 그룹을 비교하여 작업에 자원을 할당하여 자원을 선택하며, 선택된 작업에 계산 노드 에이전트(110)를 할당할 수 있다.The task scheduler 120 selects a user to be scheduled, selects a task with the priority of the user having the highest priority according to the priority of the selected user, and reads the selected resources in order to find idle resources and resource groups can be compared to select a resource by allocating a resource to the task, and allocating the compute node agent 110 to the selected task.

작업 스케줄러(120)는 자원을 배정 받기를 대기하고 있는 사용자들을 선택하고, 선택된 사용자 중에서 독점적으로 자원을 사용하고 있는 사용자 이력을 기반으로 우선 순위를 변경하고, 사용자의 이력 점수가 동일한 경우 먼저 자원 배정을 요청한 사용자가 높은 우선 순위를 가질 수 있다.The task scheduler 120 selects users waiting to be allocated resources, changes the priority based on user histories using resources exclusively among the selected users, and allocates resources first when the user history scores are the same. A user who requested a request may have a higher priority.

작업 스케줄러(120)는 사용자 이력 기반의 점수를 위해 티켓과 티켓 부채를 적용할 수 있다.The task scheduler 120 may apply tickets and ticket debt for scoring based on user history.

티켓은 일정한 값으로 초기화되며, 자원 사용 기준에 따라 사용할 때 마다 감소할 수 있다.Tickets are initialized with a constant value, and may decrease each time they are used according to resource usage criteria.

티켓 부채는 티켓을 모두 소진한 상태에서 자원을 계속해서 사용하는 경우, 자원 사용 기준에 도달 할 때 마다 증가할 수 있다.Ticket debt can increase each time the resource usage threshold is reached if resources are continuously used after all tickets are exhausted.

사용자의 우선 순위는 티켓을 가진 사용자가 높은 우선 순위를 갖고, 티켓이 없고 티켓 부채가 있는 경우, 티켓 부채의 수준의 경고가 기 설정된 경고 미만일 경우 차순위를 갖고, 티켓이 없고 티켓 부채의 수준의 경고가 기 설정된 경고 이상일 경우 최하위 순위를 가지며, 자원이 부족할 필요한 경우 해당 사용자의 자원은 강제로 해제될 수 있다.The priority of the user is that the user with the ticket has a higher priority, has no ticket and has ticket debt, has the next highest priority when the warning of the level of ticket debt is less than the preset warning, and has no ticket and has the warning of the level of ticket debt If is greater than a preset warning, it has the lowest priority, and when resources are insufficient, the user's resources can be forcibly released.

프라이빗 클라우드 서비스의 계층 전환 장치(30)는 딥러닝 통합 서버(300)에 통신하여 작업 스케줄러(120)에 의해 배정된 계산 노드 에이전트(110)를 통해 다양한 형태의 물리 장치 계층(100)을 추상화 계층(200)으로 전환할 수 있다.The layer switching device 30 of the private cloud service communicates with the deep learning integration server 300 to abstract the physical device layer 100 of various types through the compute node agent 110 assigned by the task scheduler 120 . (200) can be converted.

딥러닝 프라이빗 클라우드 서비스를 구성하는 계층에 대해서는 도 2를 참조하여 자세히 설명한다.Layers constituting the deep learning private cloud service will be described in detail with reference to FIG. 2 .

도 2는 본 발명의 일 실시예에 따른 딥러닝 프라이빗 클라우드 서비스의 추상화 계층을 나타내는 도면이다.2 is a diagram illustrating an abstraction layer of a deep learning private cloud service according to an embodiment of the present invention.

딥러닝 프라이빗 클라우드 서비스(10)는 도 11과 같은 구조를 가지며, 크게 물리 장치 계층(100)과 물리 장치 계층(100)을 개념화하는 추상화 계층(200), 이들 추상 계층 개념을 이용하여 다양한 물리 장치를 통합 및 조합하여 사용자가 사용할 수 있도록 여러 가지 서비스를 제공하는 서비스 계층으로 구분한다. 다음은 각 계층에 대한 개념에 관해서 설명한다.The deep learning private cloud service 10 has a structure as shown in FIG. 11 , the abstraction layer 200 conceptualizing the physical device layer 100 and the physical device layer 100 , and various physical devices using these abstraction layer concepts It is divided into a service layer that provides various services for users to use by integrating and combining them. The following describes the concept of each layer.

물리 장치 계층(100)은 물리 장치 처리부이며, 추상화 계층(200)은 추상화 처리부이고, 서비스 계층은 서비스 처리부이다. 여기서, 서비스 계층은 딥러닝 통합 서버(300)로 구현될 수 있으며, 반드시 이에 한정되는 것은 아니다.The physical device layer 100 is a physical device processing unit, the abstraction layer 200 is an abstraction processing unit, and the service layer is a service processing unit. Here, the service layer may be implemented as the deep learning integrated server 300, but is not necessarily limited thereto.

물리 장치 계층(100)은 연산을 위한 서버, 데이터 저장을 위한 스토리지 서버, 다양한 컨테이너 구성 정보를 가지고 있는 스토리지 서버, 사용자가 웹 인터페이스로 접근할 수 있도록 하는 웹 서버 등으로 형성될 수 있다.The physical device layer 100 may be formed of a server for calculation, a storage server for data storage, a storage server having various container configuration information, a web server allowing a user to access it through a web interface, and the like.

구체적으로, 물리 장치 계층(100)은 실질적인 물리 장치와 이 장치들을 네트워크를 통해서 사용할 수 있도록 하는 운영 시스템 및 소프트웨어를 말한다. 따라서, 물리 장치 계층(100)은 실제로 독립적으로 운영 시스템 및 소프트웨어를 가진다면 컴퓨터 가상화 기술로 생성한 가상 기계도 포함될 수 있다. 즉, 물리 장치 계층(100)은 가상화 기술 및 다양한 플랫폼 서비스를 제공하고 있는 퍼블릭 클라우드의 시스템도 포함될 수 있다.Specifically, the physical device layer 100 refers to an actual physical device and an operating system and software that allow these devices to be used through a network. Accordingly, the physical device layer 100 may include a virtual machine created by computer virtualization technology as long as it has an independent operating system and software. That is, the physical device layer 100 may include a public cloud system that provides virtualization technology and various platform services.

추상화 계층(200)은 매우 다양한 형태의 물리 장치 계층(100)을 단순화하고, 단순화된 개념 장치를 사용하여, 다양한 조합이나 통합을 통해 사용자에게 서비스를 제공할 수 있는 기반을 제공한다. 추상화 계층(200)에서는 물리 장치 계층(100)을 장치 그룹(210)과 연산 코어(220), 저장 볼륨(230) 및 운영 컨테이너(240)로 구분한다.The abstraction layer 200 simplifies the physical device layer 100 of various types and provides a basis for providing a service to a user through various combinations or integration using the simplified conceptual device. The abstraction layer 200 divides the physical device layer 100 into a device group 210 , a computation core 220 , a storage volume 230 , and an operation container 240 .

추상화 계층(200)에 대한 설명은 도 2를 참조하여 자세히 설명한다.The description of the abstraction layer 200 will be described in detail with reference to FIG. 2 .

장치 그룹(210)은 연산 코어(220)들의 집합이다. 장치 그룹(210)에 포함된 연산 코어(220)들은 각각을 분리하여 독립적으로 선택하여 운영 컨테이너(240)와 결합하거나, 일부 연산 코어(220)들을 조합하여 하나의 운영 컨테이너(240)와 결합할 수 있다. 즉, 서로 다른 장치 그룹(210)에 속한 연산 코어(220)들은 하나의 운영 컨테이너(240)에 포함될 수 없다. 추상화 방법에 따라서, 같은 장치 그룹(210)에 포함된 연산 코어들의 조합 방법이 제한될 수 있다. 예를 들어, 최대 4개의 범용 코어 혹은 최대 4개의 딥러닝 코어같이 한 번에 조합할 수 있는 수가 제한될 수 있다.Device group 210 is a set of computational cores 220 . Computational cores 220 included in the device group 210 may be separated and independently selected and combined with the operation container 240 , or some operation cores 220 may be combined to combine with one operation container 240 . can That is, the computation cores 220 belonging to different device groups 210 may not be included in one operation container 240 . Depending on the abstraction method, a method of combining computation cores included in the same device group 210 may be limited. For example, the number that can be combined at one time may be limited, for example, up to 4 general-purpose cores or up to 4 deep learning cores.

연산 코어(220)는 다른 사용자의 환경으로부터 격리할 수 있는 상태의 장치를 의미한다. 연산 코어(220)는 다양한 기술을 통해서 운영 컨테이너(240)와 결합할 수 있고, 운영 컨테이너(240)에서 동작할 사용자의 모든 소프트웨어는 결합한 연산 코어(220)만을 사용하고, 포함되지 않은 다른 연산 코어(220)에 영향을 주지 않아야만 한다.The computational core 220 refers to a device in a state that can be isolated from the environment of other users. The computational core 220 may be combined with the operational container 240 through various technologies, and all software of the user to operate in the operational container 240 uses only the combined computational core 220, and other computational cores that are not included (220) should not be affected.

딥러닝 프라이빗 클라우드 서비스(10)에서는 연산 코어(220)를 범용 연산 코어와 딥러닝 연산 코어로 구분할 수 있으며, 반드시 이에 한정되는 것은 아니다.In the deep learning private cloud service 10 , the computation core 220 may be divided into a general-purpose computation core and a deep learning computation core, but is not limited thereto.

범용 연산 코어는 일반적인 CPU 칩세트의 코어 단위를 포함한다. 범용 연산 코어는 물리적인 제조사 정보, 코어 성능 표기 및 코어가 제공하는 연산 기술(명령어셋 표기 등) 정보를 포함할 수 있다.A general-purpose computational core includes a core unit of a general CPU chipset. The general-purpose computation core may include physical manufacturer information, core performance labeling, and computation technology (instruction asset labeling, etc.) information provided by the core.

딥러닝 연산 코어에는 GPGPU 장치들을 포함한다. 딥러닝 연산 코어도 범용 연산 코어와 마찬가지로 물리적인 제조사 정보 및 물리 장치 모델 정보, 코어 성능 표기, 연산 기술 정보를 포함할 수 있다.The deep learning computation core includes GPGPU units. The deep learning computational core may include physical manufacturer information, physical device model information, core performance notation, and computational technology information like general-purpose computational cores.

저장 볼륨(230)은 사용자가 사용하는 소프트웨어가 사용하거나 생성하는 데이터를 보관할 수 있는 장치를 의미한다. 저장 볼륨(230)은 사용자가 사용할 운영 컨테이너(240)에 하나 이상이 결합될 수 있다. 저장 볼륨(230)은 장치 그룹(210)에 포함되지 않는다. 하나의 저장 볼륨(230)은 지정된 사용자와 지정된 운영 컨테이너(240)와 결합한다. 새로운 운영 컨테이너(240)를 생성할 때, 새로운 저장 볼륨(230)이 할당된다. The storage volume 230 refers to a device capable of storing data used or generated by software used by a user. One or more storage volumes 230 may be coupled to operation containers 240 for use by a user. The storage volume 230 is not included in the device group 210 . One storage volume 230 is combined with a designated user and a designated operation container 240 . When creating a new operational container 240 , a new storage volume 230 is allocated.

저장 볼륨(230)은 운영 컨테이너(240)들 사이에서 공유할 수 있으며, 만약 사용자가 기존의 저장 볼륨(230)을 참조하고자 할 경우, 저장 볼륨(230)의 공유를 통해서 사용자 접근을 허용할 수 있다.The storage volume 230 can be shared between the operation containers 240 , and if a user wants to refer to the existing storage volume 230 , user access can be allowed through the sharing of the storage volume 230 . have.

운영 컨테이너(240)는 사용자가 사용할 수 있는 운영 시스템 및 시스템 소프트웨어와 응용 소프트웨어가 미리 설정된 사용자 환경을 말한다. 운영 컨테이너(240)는 이미지 상태와 활성 상태로 구분되며, 사용자는 이미지 상태의 운영 컨테이너(240)를 선택하여 활성화하여 사용할 수 있다.The operation container 240 refers to a user environment in which an operating system and system software and application software that can be used by a user are set in advance. The operational container 240 is divided into an image state and an active state, and the user can select and activate the operational container 240 in the image state to use.

도 2를 참조하면, 운영 컨테이너(240)를 활성화하기 위해서는 앞서 설명한 연산 코어(220)와 저장 볼륨(230)이 결합하여야 한다. 활성 운영 컨테이너(112)는 한 개 이상의 연산 코어(220)와 한 개 이상의 저장 볼륨(230)과 결합한 상태이며, 사용자가 운영 컨테이너(240)를 비활성화할 경우, 해당 운영 컨테이너(240)는 완전히 제거된다. 사용자가 운영 컨테이너(240)를 재활성화하고자 하는 경우, 이전에 사용했던 결합을 재현하여 새로운 운영 컨테이너(240)를 제공할 수 있다.Referring to FIG. 2 , in order to activate the operation container 240 , the above-described operation core 220 and the storage volume 230 must be combined. The active operational container 112 is combined with one or more computational cores 220 and one or more storage volumes 230 , and when a user deactivates the operational container 240 , the operational container 240 is completely removed. do. When a user wants to reactivate the operational container 240 , a new operational container 240 may be provided by reproducing the previously used combination.

계산 노드 에이전트는 도 3 내지 도 7을 참조하여 자세히 설명한다.The compute node agent will be described in detail with reference to FIGS. 3 to 7 .

도 3은 본 발명의 일 실시예에 따른 딥러닝 프라이빗 클라우드 서비스의 계산 노드 에이전트의 처리를 나타내는 흐름도이다.3 is a flowchart illustrating the processing of a computation node agent of a deep learning private cloud service according to an embodiment of the present invention.

계산 노드 에이전트(110)는 물리 장치 계층(100)에 있는 여러 가지 장치들을 추상화 계층(200)으로 전환하기 위한 서비스이다. 계산 노드 에이전트(110)는 딥러닝 프라이빗 클라우드(10)의 통합 서버에 주기적으로 통신하여, 다음과 같은 업무를 처리한다.The compute node agent 110 is a service for converting various devices in the physical device layer 100 to the abstraction layer 200 . The compute node agent 110 periodically communicates with the integrated server of the deep learning private cloud 10 to process the following tasks.

계산 노드 에이전트가 시작되면(S310), 장치 그룹 정보를 갱신하고(S320), 운영 컨테이너를 관리하고(S330), 통신을 수행할 수 있다(S340).When the compute node agent starts (S310), device group information is updated (S320), the operation container is managed (S330), and communication can be performed (S340).

장치 그룹 정보를 갱신하기 전에 새로운 장치 그룹을 등록할 수 있으며, 이는 도 4를 참조하여 자세히 설명한다.A new device group may be registered before updating device group information, which will be described in detail with reference to FIG. 4 .

도 4는 본 발명의 일 실시예에 따른 딥러닝 프라이빗 클라우드 서비스의 계산 노드 에이전트의 새로운 장치 그룹 등록을 나타내는 도면이다.4 is a diagram illustrating a new device group registration of a compute node agent of a deep learning private cloud service according to an embodiment of the present invention.

계산 노드 에이전트(110)는 하나의 기계(서버)에 포함된 CPU 및 GPGPU 정보를 취합하여, 각각 연산 코어 정보로 전환한 후, 딥러닝 프라이빗 클라우드 통합 서버(300)에 하나의 장치 그룹으로 등록한다.The computation node agent 110 collects CPU and GPGPU information included in one machine (server), converts each into computational core information, and then registers it as one device group in the deep learning private cloud integration server 300 .

계산 노드 에이전트(110)는 등록 시 등록하고자 하는 딥러닝 프라이빗 클라우드 통합 서버(300)의 주소와 등록에 필요한 보안 토큰을 입력하여야 한다. 사용한 보안 토큰에 따라서, 이 장치 그룹에 접근할 수 있는 사용자가 제한될 수 있다.The computation node agent 110 must input the address of the deep learning private cloud integration server 300 to be registered at the time of registration and a security token required for registration. Depending on the security token used, the users who can access this device group may be restricted.

장치 그룹 정보를 갱신(S320)은 CPU 및 GPGPU 장치 정보를 조회하며(S322), N초가 되면 정보 조회를 중지한다(S324). 여기서, N초는 사용자가 설정한 시간일 수 있으며, CPU 및 GPGPU 장치 정보를 조회(S322)와 N초 중지(S324)를 반복하여 수행할 수 있다.In updating device group information (S320), CPU and GPGPU device information is inquired (S322), and when N seconds are reached, the information inquiry is stopped (S324). Here, N seconds may be a time set by the user, and may repeatedly perform inquiry (S322) and N second stop (S324) for CPU and GPGPU device information.

장치 그룹 정보를 갱신(S320)은 CPU 및 GPGPU 장치 정보를 조회(S322)가 끝나면 통신(S340)을 위해 CPU 및 GPGPU 장치 정보를 전달할 수 있다.Updating device group information (S320) may transmit CPU and GPGPU device information for communication (S340) when the inquiry (S322) for CPU and GPGPU device information is finished.

계산 노드 에이전트(110)는 시스템 조회를 통해 주기적으로 물리 장치에 포함된 CPU 및 GPGPU 정보를 다시 조회하여 기존의 연산 코어 정보를 갱신한다. 갱신된 정보는 딥러닝 프라이빗 클라우드 통합 서버(300)에 보고하여, 항상 최신 상태가 유지될 수 있도록 한다. The computation node agent 110 updates the existing computational core information by periodically re-inquiring CPU and GPGPU information included in the physical device through the system inquiry. The updated information is reported to the deep learning private cloud integration server 300 so that it can always be kept up to date.

만약, 장치 관리자에 의해서 새로운 CPU나 GPGPU가 추가된 경우, 이를 탐지하여, 관리하는 장치 그룹 정보에 추가한다. 반대로 기존의 CPU나 GPGPU가 제거된 경우도 추가와 같은 절차를 통해 장치 그룹의 정보가 갱신된다.If a new CPU or GPGPU is added by the device manager, it is detected and added to the managed device group information. Conversely, even when the existing CPU or GPGPU is removed, device group information is updated through the same procedure as addition.

운영 컨테이너를 관리(S330)는 컨테이너를 제어하고(S332), 컨테이너 정보를 동기화할 수 있다(S334). 컨테이너 제어(S332) 및 컨테이너 정보 동기화(S334)를 반복적으로 수행하여 통신(S340)을 위해 정보를 전달할 수 있다.Managing the operation container (S330) may control the container (S332) and synchronize container information (S334). By repeatedly performing container control ( S332 ) and container information synchronization ( S334 ), information may be transmitted for communication ( S340 ).

통신(S340)은 CPU 및 GPGPU 장치 정보, 컨테이너 제어 및 컨테이너 정보 동기화 정보를 딥러닝 통합 서버(S350)에 등록할 수 있다.Communication (S340) may register CPU and GPGPU device information, container control and container information synchronization information to the deep learning integration server (S350).

운영 컨테이너 동기화(S334)는 도 5를 참조하여 자세히 설명한다. 도 5는 본 발명의 일 실시예에 따른 딥러닝 프라이빗 클라우드 서비스의 계산 노드 에이전트에 따른 운영 컨테이너 동기화를 나타내는 도면이다.Operation container synchronization ( S334 ) will be described in detail with reference to FIG. 5 . 5 is a diagram illustrating synchronization of an operation container according to a compute node agent of a deep learning private cloud service according to an embodiment of the present invention.

계산 노드 에이전트(110)는 운영 컨테이너(240)를 활성화할 때, 도 5와 같이 항상 비활성 운영 컨테이너(114)에 형성된 운영 컨테이너 저장소로부터 최신의 운영 컨테이너를 탐색한다. 만약 이전에 사용했던 운영 컨테이너와 최신 운영 컨테이너가 같은 경우, 운영 컨테이너 저장소에서 받지 않고, 기존의 운영 컨테이너를 재사용한다. 기존의 운영 컨테이너와 최신 운영 컨테이너가 다른 경우에는 기존의 운영 컨테이너를 그대로 두고, 새로운 운영 컨테이너를 받는다. 이때, 두 개 이상의 서로 다른 버전의 같은 운영 컨테이너가 존재할 수 있다. 두 개 이상의 서로 다른 버전의 같은 운영 컨테이너는 제1 운영 컨테이너(242) 및 제2 운영 컨테이너(244)이다.When the compute node agent 110 activates the operational container 240 , it always searches for the latest operational container from the operational container storage formed in the inactive operational container 114 as shown in FIG. 5 . If the previously used operational container and the newer operational container are the same, the existing operational container is reused without receiving it from the operational container storage. If the existing operational container and the new operational container are different, the existing operational container is left as it is and a new operational container is received. In this case, two or more different versions of the same operating container may exist. Two or more different versions of the same operational container are a first operational container 242 and a second operational container 244 .

계산 노드 에이전트(110)는 장치 그룹(210)과 결합한 기존의 활성 운영 컨테이너(112)가 모두 비활성화되었을 때 해당 운영 컨테이너를 물리 장치에서 제거한다.The compute node agent 110 removes the corresponding operational container from the physical device when all of the existing active operational containers 112 associated with the device group 210 are deactivated.

컨테이너 제어(S332)는 도 6을 참조하여 자세히 설명한다. 도 6은 본 발명의 일 실시예에 따른 딥러닝 프라이빗 클라우드 서비스의 계산 노드 에이전트에 따른 에이전트 상태 제어를 나타내는 도면이다.The container control ( S332 ) will be described in detail with reference to FIG. 6 . 6 is a diagram illustrating agent state control according to a compute node agent of a deep learning private cloud service according to an embodiment of the present invention.

계산 노드 에이전트(110)는 도 6과 같이 딥러닝 프라이빗 클라우드 통합 스케줄러로부터 사용자 환경 정보(310)의 구성 요청을 전달받으며, 새로운 사용자 환경 정보(310)의 요청이 발생한 경우, 해당 사용자 환경 정보(310)의 요청 정보를 수신하고, 사용자 환경 정보(310)의 요청 정보를 분석하여 운영 컨테이너(240)와 저장 볼륨(230) 및 관리하는 장치 그룹(210)의 연산 코어(220)를 신속하게 결합하여 활성화한다. 활성화 중에 발생하는 모든 정보를 딥러닝 프라이빗 클라우드 통합 서버(300)에 전달하며, 만약 문제가 발생할 경우, 문제가 발생한 활성 운영 컨테이너(112)를 제거한다.The computation node agent 110 receives a configuration request for user environment information 310 from the deep learning private cloud integration scheduler as shown in FIG. 6 , and when a request for new user environment information 310 occurs, the corresponding user environment information 310 ), and by analyzing the request information of the user environment information 310 to quickly combine the operation container 240 and the storage volume 230 and the operation core 220 of the managed device group 210, Activate it. All information generated during activation is transmitted to the deep learning private cloud integration server 300 , and if a problem occurs, the active operation container 112 in which the problem occurs is removed.

사용자가 활성 운영 컨테이너(112)를 비활성화하는 경우, 딥러닝 프라이빗 클라우드 통합 서버(300)는 해당 활성 운영 컨테이너(112)를 관리하는 계산 노드 에이전트(110)에 사건을 배당하고 계산 노드 에이전트(110)는 해당 활성 운영 컨테이너(112)를 제거한다. When the user deactivates the active operational container 112, the deep learning private cloud integration server 300 allocates an event to the compute node agent 110 that manages the active operational container 112, and the compute node agent 110 removes the corresponding active operating container 112 .

컨테이너 제어(S332)의 절차는 도 7을 참조하여 자세히 설명한다. 도 7은 본 발명의 일 실시예에 따른 딥러닝 프라이빗 클라우드 서비스의 계산 노드 에이전트에서 운영 컨테이너 제어 절차를 나타내는 흐름도이다.The procedure of the container control ( S332 ) will be described in detail with reference to FIG. 7 . 7 is a flowchart illustrating an operation container control procedure in a compute node agent of a deep learning private cloud service according to an embodiment of the present invention.

계산 노드 에이전트(110)는 사용자 요청 작업에 유형에 따라서 운영 컨테이너(240)의 활성 서비스를 다르게 초기화한다. 상호작용 작업의 경우, 운영 컨테이너(240)에 개인 노트북 인터페이스를 제공하는 서비스와 온라인 통합 개발 환경 인터페이스를 제공하는 서비스를 활성화한다. 만약 상호작용 작업이면서, 수업이나 팀 단위로 사용하려는 요청이 함께 있는 경우에는 개인 노트북 인터페이스를 제공하는 서비스 대신, 노트북 허브 인터페이스를 제공하는 서비스를 편성 및 재구성하여 서비스를 하고, 온라인 통합 개발 환경 인터페이스는 비활성화한 상태로 시작한다. 배치 작업으로 구성해야 하는 경우에는 딥러닝 통합 서버(300)와 직접 통신할 수 있는 시큐어 터미널 서비스만 제공한다. 이 터미널 서비스는 배치 작업의 동작 상태를 점검하는 등의 목적으로, 사용자가 접근하여 상태를 확인할 수 있게 할 때 사용된다.The compute node agent 110 initializes the active service of the operational container 240 differently according to the type of user requested task. In the case of interactive work, a service providing a personal notebook interface to the operation container 240 and a service providing an online integrated development environment interface are activated. If it is an interactive work and there is a request to use it in a class or a team unit, instead of a service that provides a personal laptop interface, a service that provides a laptop hub interface is organized and reconfigured, and the online integrated development environment interface is provided. Start disabled. If it is to be configured as a batch job, only the secure terminal service that can communicate directly with the deep learning integration server 300 is provided. This terminal service is used to allow users to access and check the status of a batch job, for example, to check the operation status.

컨테이너 제어(S332)의 절차는 컨테이너 제어 단계(S710)를 통해 활성화(S720)되거나 활성화를 실패(S730)할 수 있다. 여기서, 활성화 실패(S730)는 사용자 요청 작업에 유형에 따라서 운영 컨테이너의 활성 서비스를 다르게 초기화할 수 있다.The container control (S332) procedure may be activated (S720) or the activation may fail (S730) through the container control step (S710). Here, in the activation failure ( S730 ), the active service of the operation container may be differently initialized according to the type of the user requested task.

작업 처리 실패(S730)는 사용자 볼륨 확인 단계(S750), 다중 사용자용 확인 단계(S752), 노트북 프로그래밍 요구사항 확인 단계(S754), 개인 개발 환경 요구사항 확인 단계(S756) 또는 원격 접속 요구사항 확인 단계(S758)를 거쳐 상술한 과정에서 확인이 불가능 할 경우 작업 처리가 실패할 수 있다.Job processing failure (S730) is the user volume check step (S750), multi-user check step (S752), laptop programming requirements check step (S754), personal development environment requirements check step (S756) or remote access requirements check If confirmation is not possible in the above-described process through step S758, the job processing may fail.

계산 노드 에이전트에서 운영 컨테이너 제어 절차는 사용자 볼륨 확인 단계(S750), 다중 사용자용 확인 단계(S752), 노트북 프로그래밍 요구사항 확인 단계(S754), 개인 개발 환경 요구사항 확인 단계(S756) 또는 원격 접속 요구사항 확인 단계(S758)를 거치며, 컨테이너 정보를 동기화 한다(S740).In the compute node agent, the operational container control procedure is a user volume verification step (S750), a multi-user verification step (S752), a laptop programming requirement verification step (S754), a personal development environment requirement verification step (S756) or a remote access request A confirmation step (S758) is performed, and container information is synchronized (S740).

사용자 볼륨 확인 단계(S750)는 컨테이너에 볼륨을 연결하여(S760), 운영 컨테이너 정보를 동기화하며(S770), 다중 사용자용 확인 단계(S752)는 Jupyter Hub 패키지 설치 및 활성화(S762)하고, 운영 컨테이너 정보를 동기화하며(S770), 노트북 프로그래밍 요구사항 확인 단계(S754)는 Jupyter Notebook 서비스를 활성화하여(S764), 운영 컨테이너 정보를 동기화하며(S770), 개인 개발 환경 요구사항 확인 단계(S756)는 추가 패키지 설치 및 서비스를 활성화하여(S766), 운영 컨테이너 정보를 동기화하며(S770), 원격 접속 요구사항 확인 단계(S758)는 SSH를 설정하여(S768), 운영 컨테이너 정보를 동기화한다(S770).The user volume verification step (S750) connects the volume to the container (S760), synchronizes the production container information (S770), the multi-user verification step (S752) installs and activates the Jupyter Hub package (S762), and the production container Synchronize information (S770), check notebook programming requirements step (S754) activate Jupyter Notebook service (S764), synchronize operational container information (S770), and check personal development environment requirements step (S756) add By activating the package installation and service (S766), the operation container information is synchronized (S770), and in the remote access requirement confirmation step (S758), SSH is set (S768), and the operation container information is synchronized (S770).

컨테이너 제어(S332)의 절차는 사용자 볼륨 확인 단계(S750), 다중 사용자용 확인 단계(S752), 노트북 프로그래밍 요구사항 확인 단계(S754), 개인 개발 환경 요구사항 확인 단계(S756), 원격 접속 요구사항 확인 단계(S758)에서의 운영 컨테이너 정보 동기화(S740) 및 컨테이너에 볼륨을 연결(S760), Jupyter Hub 패키지 설치 및 활성화(S762), Jupyter Notebook 서비스를 활성화하여(S764), 추가 패키지 설치 및 서비스를 활성화(S766) 및 SSH를 설정(S768)하여 운영 컨테이너 정보 동기화(770)가 끝나면 사용자 요청 작업에 유형에 따라서 운영 컨테이너의 활성 서비스가 다르게 초기될 수 있다.The procedure of container control (S332) includes user volume verification step (S750), multi-user verification step (S752), laptop programming requirements verification step (S754), personal development environment requirements verification step (S756), remote access requirements Synchronize operational container information in the verification step (S758) (S740) and attach volumes to containers (S760), install and activate Jupyter Hub packages (S762), activate Jupyter Notebook services (S764), install additional packages and services After activation (S766) and SSH setting (S768) to complete operational container information synchronization (770), the active service of the operational container may be initialized differently depending on the type of user requested operation.

계산 노드 에이전트(110)는 관리하는 장치 그룹(210)의 연산 코어(220)와 결합한 활성 운영 컨테이너(112)를 추적한다. 활성 운영 컨테이너(112)가 사용하는 장치 그룹의 부하를 측정하고, 활성 운영 컨테이너(112) 상태 변화를 탐지하여, 딥러닝 프라이빗 클라우드 통합 서버(300)에 활성 운영 컨테이너(112)에 대한 정보를 주기적 및 지속해서 갱신한다. 예를 들어, 활성 운영 컨테이너(112)가 외부 요인에 의해 중단되는 경우, 딥러닝 프라이빗 클라우드 통합 서버(300)에, 문제가 발생을 보고하여 사용자와 관리자가 해당 상황에 대해 대응할 수 있게 한다.The compute node agent 110 tracks the active operational containers 112 associated with the compute core 220 of the device group 210 it manages. By measuring the load of the device group used by the active operational container 112 and detecting a change in the state of the active operational container 112 , information on the active operational container 112 is periodically transmitted to the deep learning private cloud integration server 300 . and continuously updated. For example, when the active operation container 112 is interrupted by an external factor, the problem is reported to the deep learning private cloud integration server 300 so that users and administrators can respond to the situation.

작업 스케줄러는 도 8 및 도 9을 참조하여 자세히 설명한다.The task scheduler will be described in detail with reference to FIGS. 8 and 9 .

도 8은 본 발명의 일 실시예에 따른 딥러닝 프라이빗 클라우드 서비스의 딥러닝 작업 스케줄러의 동작을 나타내는 도면이다.8 is a diagram illustrating an operation of a deep learning task scheduler of a deep learning private cloud service according to an embodiment of the present invention.

딥러닝 작업 스케줄러(120)는 사용자가 필요로 하는 운영 환경을 딥러닝 프라이빗 클라우드 통합 서버(300)에 등록된 장치 그룹과 결합할 수 있도록 유휴 장치 그룹을 찾고, 해당 장치 그룹을 사용자 요청(410)에 배정하는 일을 수행한다. The deep learning task scheduler 120 finds an idle device group so that the operating environment required by the user can be combined with the device group registered in the deep learning private cloud integration server 300, and the device group is requested by the user (410) carry out assignments to

도8을 참조하면, 딥러닝 작업 스케줄러(120)는 사용자의 권한에 따라 사용할 수 있는 장치 그룹(210)의 범위를 선정한다. 선정된 그룹 범위에서 요청(410)한 연산 코어(220)의 종류와 개수를 유휴 상태로 가지고 있는 장치 그룹(210)을 임의로 선발하고, 해당 장치 그룹(210)에 포함된 연산 코어(220)들을 요청에 배정한다. 요청 처리 배정은 사용자 권한 조회(420) 및 장치 상태 조회(430)를 이용하여 요청 처리 배정할 수 있다.Referring to FIG. 8 , the deep learning task scheduler 120 selects the range of the device group 210 that can be used according to the user's authority. A device group 210 having the type and number of arithmetic cores 220 requested 410 in the selected group range in an idle state is arbitrarily selected, and the arithmetic cores 220 included in the device group 210 are selected. assigned to the request. The request processing assignment may be performed using the user permission inquiry 420 and the device status inquiry 430 .

요청에 배정된 자원은 해당 연산 코어(220)를 관리하는 딥러닝 에이전트(110)에 의해 운영 컨테이너(240)가 전개될 것이다. 만약 사용자가 요청한 연산 코어(220)의 요구사항을 만족하는 장치 그룹(210)이 존재하지 않는 경우, 요청 방법에 따라 처리 방식이 달라진다. 요청 방법은 상호작용 작업과 배치 처리 작업으로 구분한다. 이하에서는 상호작용 작업과 배치 처리 작업에 대해 설명한다. 여기서, 딥러닝 에이전트는 계산 노드 에이전트이다. The resource allocated to the request will be deployed by the operation container 240 by the deep learning agent 110 that manages the corresponding computational core 220 . If the device group 210 that satisfies the requirement of the computation core 220 requested by the user does not exist, the processing method is changed according to the request method. The request method is divided into an interactive operation and a batch processing operation. Hereinafter, an interaction operation and a batch processing operation will be described. Here, the deep learning agent is a compute node agent.

상호작용 작업은 연산 코어(220)를 배정한 후, 소프트웨어에 의해서 자동으로 처리되는 요소가 거의 없고, 사용자가 조작하여 어떤 업무를 진행시켜야 하는 작업을 의미한다. 대부분의 데이터 조작 행위나, 데이터 훈련 등의 단계, 사용자의 프로그램 개발 단계는 상호작용 작업으로 표현할 수 있다. The interactive task refers to a task in which there are few elements automatically processed by the software after allocating the computational core 220 , and the user has to manipulate the operation to proceed with a certain task. Most of the data manipulation actions, data training steps, and user program development steps can be expressed as interactive work.

상호작용 작업을 위해서 딥러닝 작업 스케줄러(120)는 자원을 최대한 빠르게 배정할 수 있도록 노력한다. 만약 배정할 수 없는 상황이 발생하는 경우, 짧은 시간 동안 다른 유휴 자원이 발생할 때까지 대기하였다가, 여전히 배정할 수 없을 때, 사용자가 현재 배정이 어려운 사정을 알 수 있도록 딥러닝 프라이빗 클라우드 통합 서버(300)에게 알릴 수 있다. 딥러닝 에이전트(110)는 상호작용 작업을 위한 운영 컨테이너를 전개할 때, 데이터 조작 등에서 많이 사용하는 프로그래밍 노트북 인터페이스 환경과 온라인에서 사용할 수 있는 프로그램 개발 환경을 제공할 수 있도록 노력하여 전개한다. 만약 전개하고 있는 운영 컨테이너에서 노트북 인터페이스 환경이나 프로그램 개발 환경을 제공할 수 없는 경우, 작업 처리 실패로 간주하고, 딥러닝 프라이빗 클라우드 통합 서버(300)에 작업을 수행할 수 없음을 알려야 한다. 그리고 딥러닝 에이전트(110)는 전개하던 운영 컨테이너를 비활성화 상태로 만든다.For the interactive task, the deep learning task scheduler 120 strives to allocate resources as quickly as possible. If an assignment is not possible, it waits for other idle resources for a short period of time, and when it still cannot be assigned, a deep learning private cloud integration server ( 300) can be reported. When the deep learning agent 110 deploys an operation container for interactive work, it is deployed by making efforts to provide a programming laptop interface environment that is often used in data manipulation and a program development environment that can be used online. If the operating container being deployed cannot provide a laptop interface environment or a program development environment, it is regarded as a job processing failure, and the deep learning private cloud integration server 300 must be informed that the job cannot be performed. And the deep learning agent 110 makes the deployed operation container inactive.

배치 작업은 상호작용 작업과 달리 사용자의 조작이 사전에 정의되어있거나, 필요치 않은 자동화된 처리 프로세스를 가지고 있는 작업을 의미한다. 일반적으로 데이터 학습을 통해 미리 만들어진 어떤 처리자가 있고, 처리자를 통해서 다른 입력 데이터를 처리하여 그 결과를 얻고자 할 때 사용한다. A batch job, unlike an interactive job, means a job that has an automated processing process that does not require a user's operation in advance or is pre-defined. Generally, there is a pre-made processor through data learning, and it is used when processing other input data through the processor to obtain the result.

딥러닝 작업 스케줄러(120)는 배치 작업을 배정할 수 없을 때, 자원 배정 실패라고 판단하기까지 상호작용과 비교하여 상당히 긴 시간 동안 대기한다. 딥러닝 에이전트(110)는 배치 작업을 처리할 때, 운영 컨테이너를 생성하며, 별도의 사용자 환경을 전개하지 않고, 원격에서 사용자 프로세스가 정상적으로 처리되고 있는 확인할 수 있는 정보를 수집할 수 있는 환경만을 제공하려고 노력한다. 정보 수집 행위를 정상적으로 할 수 없을 때에 배치 작업은 실패로 처리하고, 운영 컨테이너(240)를 비활성화 상태로 만든다.When the deep learning task scheduler 120 cannot allocate the batch task, it waits for a considerably longer time compared to the interaction until it is determined that the resource allocation has failed. When the deep learning agent 110 processes a batch job, it creates an operational container, does not deploy a separate user environment, and provides only an environment that can remotely collect information that can confirm that the user process is being processed normally. try to do When the information collection action cannot be normally performed, the batch operation is treated as a failure, and the operation container 240 is put into an inactive state.

도 9는 본 발명의 일 실시예에 따른 딥러닝 프라이빗 클라우드 서비스의 딥러닝 작업 스케줄러의 처리를 나타내는 흐름도이다.9 is a flowchart illustrating processing of a deep learning task scheduler of a deep learning private cloud service according to an embodiment of the present invention.

딥러닝 작업 스케줄러(120)는 상술한 바와 같이 작업의 형태에 따라, 작업 실패로 판단하는 기준이 달라진다. 뿐만 아니라, 연산 코어(220)의 수가 사용자의 요청에 비해 부족할 경우, 특정 사용자에게 쏠림 현상을 방지할 수 있도록 할 수 있다. 이 기능을 사용하는 경우, 딥러닝 작업 스케줄러(120)는 사용자의 과거 사용 이력 패턴, 최근에 사용한 시간, 최근 사용하지 않은 누적 시간, 최근 사용한 연산 코어의 수, 요청한 후 대기한 시간 등을 파악하여, 사용자 간의 우선순위를 조정한다.As described above, the deep learning job scheduler 120 has different criteria for determining job failure according to the type of job. In addition, when the number of computation cores 220 is insufficient compared to the user's request, it is possible to prevent a phenomenon of concentration on a specific user. When this function is used, the deep learning task scheduler 120 identifies the user's past usage history pattern, the recently used time, the recently unused cumulative time, the number of recently used computational cores, the waiting time after request, etc. , adjust priorities among users.

딥러닝 작업 스케줄러(120)는 기본적으로 일찍 요청된 작업부터 처리하려고 하지만, 사용자 간 우선순위를 통해 작업의 순서를 재조정할 수 있는 기능을 갖는다.The deep learning task scheduler 120 basically tries to process the task requested early, but has a function to readjust the order of tasks through priorities between users.

도 9를 참조하면, 작업 스케줄러(120)가 시작되면, 요청 작업 대기열(S910), 자원 풀(S920) 및 사용자 풀(S930)을 불러올 수 있다. 여기서, 사용자 풀은 사용자 사용 이력을 포함할 수 있으며, 반드시 이에 한정되는 것은 아니다.Referring to FIG. 9 , when the job scheduler 120 is started, the request job queue S910 , the resource pool S920 , and the user pool S930 may be called. Here, the user pool may include a user usage history, but is not limited thereto.

요청 작업 대기열(S910)은 작업 배치 요청(S940)에 의해 불러올 수 있으며, 사용자 후보 선택 단계(S912) 및 대상 태스크 선택 단계(S914)에서 이용할 수 있다.The requested work queue ( S910 ) may be called by a work arrangement request ( S940 ), and may be used in the user candidate selection step ( S912 ) and the target task selection step ( S914 ).

자원 풀(S920)은 자원 선택 단계(S950)에서 이용할 수 있으며, 사용자 풀(S930)은 사용자 후보 선택 단계(S912)에서 이용할 수 있다.The resource pool S920 may be used in the resource selection step S950 , and the user pool S930 may be used in the user candidate selection step S912 .

도 9는 딥러닝 작업 스케줄러의 구성요소와 정보 전달 구조를 표현한 것이다. 딥러닝 작업 스케줄러(120)는 사용자 후보 선택(S912)과 대상 작업(태스크)을 선택(S914), 계산 노드의 자원을 선택(S950)하는 3개의 과정으로 구성될 수 있다.9 shows the components and information delivery structure of the deep learning task scheduler. The deep learning task scheduler 120 may consist of three processes of selecting a user candidate (S912), selecting a target job (task) (S914), and selecting a resource of a computation node (S950).

자원 선택 과정(S950)에서 자원이 부족한 경우에 대기열 N분 대기 단계(S952)를 거쳐 요청 작업 대기열 단계(S910)를 다시 수행하며, 자원 선택 과정(S950)이 완료되면 에이전트 배정을 통해 계산 노드 에이전트 단계(S960)를 수행할 수 있다.If the resource is insufficient in the resource selection process (S950), the request work queue step (S910) is performed again through the queue N-minute waiting step (S952), and when the resource selection process (S950) is completed, the agent is calculated through the agent assignment Step S960 may be performed.

사용자 후보 선택 단계(S912)는 스케줄링 대상이 되는 사용자를 선택하는 과정으로, 단계 S930을 통해 자원을 배정 받기를 대기하고 있는 사용자들을 선택한다. 선택된 사용자 중에서 단계 S920에서 최근에 독점적으로 자원을 사용하고 있는 등의 사용자 이력에 기반한 점수 요인을 통해, 우선순위를 변경한다. 만약 사용자의 사용자 이력 점수가 동일한 경우, 먼저 자원 배정을 요청한 사용자가 높은 우선순위를 갖는다.The user candidate selection step S912 is a process of selecting users to be scheduled, and users waiting to be allocated resources are selected through step S930. The priority is changed through a score factor based on a user history such as recently exclusively using a resource among the selected users in step S920. If the user history scores of the users are the same, the user who requests resource allocation first has a higher priority.

딥러닝 작업 스케줄러(120)는 사용자 이력 기반 점수를 위해 티켓과 티켓 부채를 적용하고 있다. 티켓은 일정한 값으로 초기화된다. 티켓은 자원 사용 기준(예를 들어, GPGPU 1개를 10분간 사용)에 맞게 사용할 때 마다 감소한다. 티켓 부채는 티켓을 모두 소진하여 0인 상태에서 자원을 계속해서 사용하는 경우, 자원 사용 기준에 도달 할 때 마다 증가한다. 사용자 우선 순위는 티켓을 가진 사용자가 높은 우선 순위를 갖는다. 이때, 티켓의 규모의 비교는 하지 않고, 티켓의 유무로만 판단한다. 티켓이 없고 티켓 부채가 있는 경우, 티켓 부채의 수준의 경고 미만일 경우 차순위를 갖는다. 마지막으로 티켓이 없고 티켓 부채의 수준의 경고를 넘어선 경우 최하위 순위를 가지며, 시스템에서 자원이 부족할 필요한 경우, 해당 사용자의 자원은 강제로 해제될 수 있다.The deep learning task scheduler 120 applies tickets and ticket debt for user history-based scoring. The ticket is initialized with a constant value. Tickets are decremented each time they are used to meet the resource usage criteria (eg, 1 GPGPU used for 10 minutes). Ticket debt increases each time the resource usage threshold is reached, if resources are continuously used in a state of 0 by exhausting all tickets. The user priority is that a user with a ticket has a higher priority. At this time, the size of the ticket is not compared, and only the presence or absence of the ticket is determined. If there is no ticket and there is ticket debt, it takes second place if the level of ticket debt is below the warning level. Finally, if there is no ticket and the level of warning of ticket debt is exceeded, it has the lowest rank, and if the system runs out of resources, if necessary, the user's resources can be forcibly released.

대상 태스크(작업) 선택 단계(S914)는 선정된 사용자 우선 순위에 따라, 가장 높은 우선 순위를 갖는 사용자의 가장 최우선 태스크를 선택한다. 만약 해당 사용자의 태스크(작업)와 비슷한 시기(예를 들면 1분 이내)에 제출된 태스크(작업)가 존재하는 경우, 동시에 선택한다.In the target task (job) selection step S914, the highest priority task of the user having the highest priority is selected according to the selected user priority. If there is a task (job) submitted at a similar time (eg, within 1 minute) to the user's task (job), select them at the same time.

본 발명의 일 실시예에 따르면, 태스크(작업)는 운영체계가 제어하는 프로그램의 기본 단위를 의미한다.According to an embodiment of the present invention, a task (job) refers to a basic unit of a program controlled by an operating system.

자원 선택 단계(S950)는 선택된 태스크를 제출 순으로 읽어, 자원 풀에 있는 유휴 자원 및 자원 그룹을 비교하여 태스크에 자원을 할당한다. 만약 이 단계에서 자원이 부족한 경우, 해당 태스크는 일정 시간 동안 스케줄링(예를 들면 5분)을 하지 않고 대기 상태에 둔다. 대기된 태스크는 다른 태스크가 종료되어 자원이 반환되는 경우, 반환된 자원이 태스크의 요구사항이 일치하는지 검사하며, 일치할 경우 즉시 대기 해제 된다. 대기 해제된 태스크는 다시 전체 스케줄링 절차를 거친다. 성공적으로 자원 선택이 완료된 경우, 해당 태스크에 계산 노드 에이전트를 할당한다.The resource selection step S950 reads the selected tasks in the order of submission, compares idle resources and resource groups in the resource pool, and allocates resources to the tasks. If the resource is insufficient in this step, the task is put in a waiting state without scheduling (eg, 5 minutes) for a certain period of time. The queued task checks whether the returned resource matches the task's requirements when other tasks are terminated and resources are returned. If they match, the waiting task is immediately released. A task that has been released from waiting goes through the entire scheduling procedure again. When resource selection is successfully completed, a compute node agent is assigned to the task.

본 발명의 일 실시에에 따르면, 자원은 컴퓨터로 실행되는 작업이나 태스크가 필요로 하는 컴퓨터 시스템, 운영 체제의 기구나 기능을 의미한다. 예를 들어 자원은 주기억 장치, 입출력 장치, 중앙 연산 처리 장치, 타이머, 데이터 세트, 제어 프로그램, 처리 프로그램 등을 의미하며 반드시 이에 한정되는 것은 아니다.According to an embodiment of the present invention, a resource refers to a computer system or a mechanism or function of an operating system required for a job or task executed by a computer. For example, the resource means a main memory device, an input/output device, a central processing unit, a timer, a data set, a control program, a processing program, and the like, but is not necessarily limited thereto.

도 10은 본 발명의 일 실시예에 따른 딥러닝 프라이빗 클라우드 서비스의 웹 서비스 구조를 나타내는 도면이다.10 is a diagram illustrating a web service structure of a deep learning private cloud service according to an embodiment of the present invention.

딥러닝 통합 서버(300)는 다양한 데이터베이스 솔루션을 사용하여 서비스 제공한다. 딥러닝 통합 서버(300)는 (i) 시스템에 요청, 정보 조작 등을 위한 REST API 서비스, (ii) 신속한 상태 변경 통지 및 정보 교환을 위한 웹 소켓 서비스, (iii) 딥러닝 통합 서버의 성능의 측정 정보를 기록할 수 있는 성능 측정 기능, (iv) 시스템에서 발생할 수 있는 여러 가지 예외를 기록할 수 있는 사건 보고 기능, (v) 사용자의 요청에 즉각적으로 대응할 필요가 없거나, 시간이 지연될 수 있는 처리를 수행하는 백그라운드 처리 서비스, (vi) 운영 데이터 기록을 위한 데이터베이스 연동, (vii) 자주 접근하는 데이터나 세션을 저장하기 위한 캐쉬 기능을 갖는다.The deep learning integration server 300 provides services using various database solutions. The deep learning integration server 300 provides (i) a REST API service for requesting to the system, information manipulation, etc., (ii) a web socket service for rapid state change notification and information exchange, and (iii) performance of the deep learning integration server. Performance measurement function to record measurement information; (iv) event reporting function to log various exceptions that may occur in the system; (v) no need to respond immediately to user requests or delays It has a background processing service that performs existing processing, (vi) database linkage for recording operational data, and (vii) a cache function to store frequently accessed data or sessions.

REST API 서비스는 웹 브라우저 혹은 에이전트와 같은 클라이언트에서 서버에 정보 조회/생성/변경 요청을 할 수 있는 기능을 제공한다. 클라이언트에서 HTTP(S) 프로토콜을 다룰 수 있다면, 서버가 제공하는 API를 사용할 수 있다.REST API service provides a function to request information inquiry/creation/change from a client such as a web browser or agent to the server. If the client can handle the HTTP(S) protocol, you can use the API provided by the server.

웹 소켓 서비스는 HTTP 프로토콜의 서브프로토콜인 웹 소켓 프로토콜은 HTTP 프로토콜과 달리 연결상태를 유지하며, 쌍방의 데이터를 주고 받을 수 있다. 따라서, 웹 소켓 서비스는 한쪽의 상태 변화나 알림 등의 목적에 활용이 가능하다. 딥러닝 통합 서버(300)는 사용자와 딥러닝 통합 서버 간에 웹 소켓 연결을 별도로 유지하며, 사용자 요청에 대한 상태 변화가 발생하는 경우, 사용자가 즉시 알 수 있도록 한다. 또한, 딥러닝 통합 서버(300)는 딥러닝 에이전트(110)와도 웹 소켓 연결을 유지하여 딥러닝 에이전트(110)에 작업이 배정되거나, 딥러닝 에이전트(110)의 운영 컨테이너(240)의 상태 변경에 대한 정보를 즉각적으로 반영할 수 있도록 하고 있다.Unlike the HTTP protocol, the websocket protocol, which is a subprotocol of the HTTP protocol, maintains a connection state and can send and receive data from both sides. Therefore, the websocket service can be used for the purpose of changing the state of one side or notifying. The deep learning integration server 300 separately maintains a web socket connection between the user and the deep learning integration server, and when a state change for a user request occurs, the user can immediately know. In addition, the deep learning integration server 300 maintains a web socket connection with the deep learning agent 110 to assign a task to the deep learning agent 110, or change the state of the operation container 240 of the deep learning agent 110 information can be reflected immediately.

도 10을 참조하면, 딥러닝 통합 서버(300)는 REST API와 웹 소켓을 이용하여 다른 컴포넌트와 통신한다. REST API는 어떤 상태를 조작하거나, 새로운 정보를 생성하고자 하는 등의 행위를 할 수 있도록 HTTP(S) 프로토콜을 기반으로 제공한다. 웹 소켓은 딥러닝 통합 서버(300)에서 발생하는 정보의 변경을 즉시 전파하기 위해서 사용할 수 있다.Referring to FIG. 10 , the deep learning integration server 300 communicates with other components using a REST API and a web socket. The REST API is provided based on the HTTP(S) protocol so that you can manipulate a certain state or create new information. The web socket can be used to immediately propagate information changes that occur in the deep learning integration server 300 .

예를 들어, 사용자가 작업 요청을 하는 경우, 사용자가 보고 있는 인터페이스에 내부적으로 REST API를 사용하여, 딥러닝 통합 서버(300)에 이 요청을 전달한다. 딥러닝 통합 서버(300)는 딥러닝 작업 스케줄러(120)를 호출하여 작업 배정을 의뢰하고, 작업 배정이 성공한 경우, 사용자와 배정된 장치 그룹을 가진 딥러닝 에이전트(110)에 웹 소켓을 사용하여 통지한다. 작업 배정이 실패한 경우는 사용자에게만 웹 소켓을 사용하여 통지한다. 만약 사용자에게 통지를 실패한 경우, 사용자는 새로 고침을 통해 REST API를 사용하여 호출하여 최신 정보를 얻을 수 있다.For example, when a user makes a work request, the REST API is used internally for the interface the user is viewing, and the request is delivered to the deep learning integration server 300 . The deep learning integration server 300 calls the deep learning task scheduler 120 to request task assignment, and when the task assignment is successful, using a web socket to the deep learning agent 110 with the user and the assigned device group notify When task assignment fails, only the user is notified using web sockets. If the notification to the user fails, the user can get the latest information by calling it using REST API through refresh.

새로운 작업을 통지 받은 딥러닝 에이전트(110)는 운영 컨테이너를 전개하며, 이에 대한 정보를 웹 소켓과 REST API 모두 활용하여 통합 서버에 전파한다. 딥러닝 통합 서버(300)는 전파된 사항에 따라, 중요 사항은 데이터베이스에 기록하고, 필요한 경우 사용자에게 변경된 상태 정보를 가공하여 웹 소켓을 사용하여 전파한다.The deep learning agent 110 notified of the new task deploys an operation container and propagates the information to the integration server using both websockets and REST APIs. The deep learning integration server 300 records important matters in the database according to the propagation, and if necessary, processes the changed state information to the user and propagates it using a web socket.

성능 측정 기능에서 딥러닝 프라이빗 클라우드의 통합 서버(300)는 서버의 상태를 알 수 있도록, 사용자의 요청 빈도, 데이터베이스 접근 빈도, 접근 시간 등의 측정 정보를 관리하며, Prometheus, StatsD(302)와 같은 성능 측정 정보를 기록하는 데이터베이스와 연동할 수 있도록 하고 있다. 성능 측정 정보는 인증 시도 횟수, 인증성공 횟수, 인증 실패 횟수, 인증에 성공한 사용자수, HTTP 응답 상태 빈도, HTTP 요청 빈도, HTTP 응답에 걸린 시간, 세션 정보 접근 횟수, 캐쉬 적중 빈도, 캐쉬 실패 빈도, 데이터베이스 접근 빈도, 데이터베이스 접근 시간 등을 외부에서 측정할 수 있도록 제공한다.In the performance measurement function, the integrated server 300 of the deep learning private cloud manages measurement information such as the user's request frequency, database access frequency, and access time so that the server status can be known, such as Prometheus and StatsD (302) It makes it possible to link with a database that records performance measurement information. The performance measurement information includes the number of authentication attempts, authentication success, authentication failure, successful authentication, HTTP response status frequency, HTTP request frequency, HTTP response time, session information access, cache hit frequency, cache failure frequency, It provides external measurement of database access frequency and database access time.

사건 보고 기능은 딥러닝 프라이빗 클라우드 서비스(10)의 각 요소 소프트웨어서 발생하는 예기치 못한 오류를 기록하고, 발생한 오류를 해결할 수 있도록 하는 기능이다. 이를 위해서 Sentry(302)라는 사건 기록 소프트웨어를 사용하며, 이 서비스를 통해서 에러가 발생할 경우, 운영진 등의 관계자에게 사건 발생 즉시 상황을 전파하고, 별도로 기록된 사건의 내용을 분석하여, 문제를 해결할 수 있도록 한다.The event reporting function is a function to record unexpected errors that occur in each component software of the deep learning private cloud service 10 and to solve the errors. For this purpose, the event recording software called Sentry (302) is used, and if an error occurs through this service, the situation is immediately communicated to the relevant personnel such as the management team, and the problem can be solved by analyzing the contents of the separately recorded event. let it be

백그라운드 처리 서비스는 딥러닝 프라이빗 클라우드 서비스(10)에서 지연된 사용자 요청 처리 및 주기적인 서비스 루틴을 동작시키기 위해서 사용한다. 백그라운드 처리 서비스는 파이썬의 Celery를 이용하여 실현한다. 딥러닝 작업 스케줄러 서비스 루틴도 백그라운드 처리 서비스를 통해서 동작한다.The background processing service is used to process delayed user requests and operate periodic service routines in the deep learning private cloud service 10 . The background processing service is realized using Python's Celery. The deep learning task scheduler service routine also operates through the background processing service.

백그라운드 처리 서비스는 도 12와 같이 시스템 장애에 대비하여 이중화를 할 수 있도록 고려하여 설계하였다. 백그라운드 처리 요청을 전달하는 매개 소프트웨어인 RabbitMQ 혹은 Redis(304)는 각각 시스템 장애 극복을 위한 클러스터 구성을 할 수 있다. 또한, 전달 받아 처리하는 딥러닝 지연 처리 루틴이나 스케줄 기능 또한 각각 독립적으로 서로 다른 장치에서 운영할 수 있으며, 이를 통해 특정 시스템의 장애로 인해 서비스가 중단되는 상태를 막을 수 있고, 시스템의 부하가 집중되는 경우 이를 위한 시스템을 증설하여 부하를 분산할 수도 있다.The background processing service was designed in consideration of redundancy in preparation for system failure as shown in FIG. 12 . RabbitMQ or Redis 304, which is an intermediary software that delivers background processing requests, can configure a cluster for overcoming system failure, respectively. In addition, the deep learning delay processing routine or schedule function that receives and processes can also be operated independently on different devices, thereby preventing service interruption due to a specific system failure and concentrating the system load. If necessary, the load can be distributed by expanding a system for this purpose.

본 발명의 일 실시예에 따르면, 딥러닝 지연 처리 루틴 및 딥러닝 작업 스케줄(309)은 부하 분산을 위한 수평 확장 기능에 사용될 수 있으며, 반드시 이에 한정되는 것은 아니다.According to an embodiment of the present invention, the deep learning delay processing routine and the deep learning task schedule 309 may be used for a horizontal scaling function for load balancing, but is not necessarily limited thereto.

본 발명의 일 실시예에 따르면, RabbitMQ는 오픈 소스 메시지 브로커 소프트웨어(메시지 지향 미들웨어)로서, 메시지를 생산하는 생산자(Producer)가 메시지를 큐에 저장해 두면, 메시지를 수신하는 소비자(Consumer)가 메시지를 가져와 처리하는 Publish/Subscribe 방식의 메시지 전달 브로커이다. Redis는 디스크에 상주하는 인메모리 데이터베이스로, Remote Dictionary Server의 약자로서, "키-값" 구조의 비정형 데이터를 저장하고 관리하기 위한 오픈 소스 기반의 비관계형 데이터베이스 관리 시스템(DBMS)이다.According to an embodiment of the present invention, RabbitMQ is an open source message broker software (message-oriented middleware). When a message producer stores a message in a queue, a consumer who receives the message sends the message. It is a publish/subscribe method message delivery broker that imports and processes. Redis is an in-memory database residing on a disk, an abbreviation of Remote Dictionary Server. It is an open source non-relational database management system (DBMS) for storing and managing unstructured data in a "key-value" structure.

딥러닝 프라이빗 클라우드 통합 서버는 정보 기록 및 조회를 위해서 관계형 데이터베이스를 사용한다. 관계형 데이터베이스는 PostgreSQL과 MySQL(306)를 사용할 수 있다. 관계형 데이터베이스는 공개소스로 사용할 수 있고, 매우 안정적으로 많이 사용하고 있는 PostgreSQL과 오라클에서 인수한 공개 소프트웨어인 MySQL, MySQL의 개발자가 MySQL의 기능을 개선한 MariaDB 최신버전에 대하여 테스트를 완료하였다. 데이터베이스에는 사용자의 접근 기록, 사용자가 연산코어를 사용한 기록 등 운영에 필요한 정보가 보관된다.The deep learning private cloud integration server uses a relational database for information recording and retrieval. The relational database can use PostgreSQL and MySQL (306). The relational database can be used as an open source, and PostgreSQL, which is very stable and widely used, MySQL, the open software acquired by Oracle, and the latest version of MariaDB, which the developer of MySQL has improved the functions of MySQL, have been tested. The database stores information necessary for operation, such as user access records and user use of computational cores.

본 발명의 일 실시예에 따르면, PostgreSQL은 확장 가능성 및 표준 준수를 강조하는 객체-관계형 데이터베이스 관리 시스템(ORDBMS)의 하나이다. MySQL은 오라클 사가 관리 및 배포하고 있는 오픈소스 관계형 데이터베이스 관리 시스템이다.According to one embodiment of the present invention, PostgreSQL is one of object-relational database management systems (ORDBMS) that emphasizes extensibility and standards compliance. MySQL is an open source relational database management system maintained and distributed by Oracle Corporation.

딥러닝 프라이빗 클라우드 통합 서버는 사용자 요청의 쏠림으로 인한 부하를 분산하기 위해서, 사용자 접속 정보를 기록하는 세션을 별도의 캐쉬 서버에 저장할 수 있도록 하고 있다. 캐쉬 서버로는 Memcached와 Redis를 지원한다The deep learning private cloud integrated server allows the session to record user access information to be stored in a separate cache server in order to distribute the load due to the concentration of user requests. Memcached and Redis are supported as cache servers.

본 발명의 일 실시예에 따르면, Memcached는 범용 분산 캐시 시스템이다. 외부 데이터 소스(예를 들어, 데이터베이스나 API)의 읽기 횟수를 줄이기 위해 데이터와 객체들을 RAM에 캐시 처리함으로써 동적 데이터베이스 드리븐 웹사이트의 속도를 높이기 위해 사용될 수 있다.According to one embodiment of the present invention, Memcached is a general-purpose distributed cache system. It can be used to speed up dynamic database driven websites by caching data and objects in RAM to reduce the number of reads from external data sources (eg databases or APIs).

사용자는 시스템 장애가 발생하더라도, 다른 딥러닝 통합 서버의 인스턴스에 접속할 수 있으며, 기존의 서버에서 사용하던 캐쉬 기록이나 세션이 유지되어, 시스템 장애를 인지하지 못하고, 계속해서 서비스를 사용할 수 있다.Even if a system failure occurs, the user can connect to the instance of another deep learning integrated server, and the cache record or session used in the existing server is maintained, so the user can continue to use the service without recognizing the system failure.

웹 서버는 사용자 인터페이스를 담고 있는 자바스크립트 프로그램 코드와 웹 문서와 이미지 등을 서비스하고, REST API 서비스 혹은 웹 소켓 서비스로 들어오는 요청에 대한 부하 분산 기능을 일부 수행한다. 웹 서버는 NGinx와 Apache HTTP 서버를 지원한다.The web server serves the JavaScript program code containing the user interface, web documents, and images, and partially performs a load balancing function for requests coming to the REST API service or web socket service. The web server supports NGinx and Apache HTTP server.

본 발명의 일 실시예에 따르면, NGinx는 웹 서버 소프트웨어로, 가벼움과 높은 성능을 목표로 하며, 웹 서버, 리버스 프록시 및 메일 프록시 기능을 가질 수 있다. Apache HTTP 서버는 아파치 소프트웨어 재단에서 관리하는 HTTP 웹 서버이다.According to an embodiment of the present invention, NGinx is web server software, which aims at lightness and high performance, and may have web server, reverse proxy and mail proxy functions. Apache HTTP Server is an HTTP web server maintained by the Apache Software Foundation.

데이터 스토리지 서버는 사용자의 활성 운영 컨테이너에 결합할 수 있는 사용자 볼륨을 제공하는 서버를 말한다. 이 서버는 기본적으로 NFS 서비스가 가능한 모든 제품이 적용 가능하며, 필요한 경우 IP-SAN과 같은 네트워크 기반의 스토리지 기술이 적용된 제품을 사용할 수 있다.A data storage server is a server that provides user volumes that can be bound to the user's active production containers. Basically, all products capable of NFS service can be applied to this server, and if necessary, products with network-based storage technology such as IP-SAN can be used.

도 11은 본 발명의 일 실시예에 따른 딥러닝 프라이빗 클라우드 서비스의 개념 구조를 나타내는 도면이다.11 is a diagram illustrating a conceptual structure of a deep learning private cloud service according to an embodiment of the present invention.

도 11은 본 발명의 일 실시예에 따른 사용자에게 제공하는 딥러닝 프라이빗 클라우드 서비스를 예시한 예시도이다.11 is an exemplary diagram illustrating a deep learning private cloud service provided to a user according to an embodiment of the present invention.

딥러닝 프라이빗 클라우드 서비스는 사용자 웹 통합 인터페이스 및 관리자 웹 통합 인터페이스를 구성할 수 있으며, 사용자 웹 통합 인터페이스는 사용자가 단말기와 상호 작용하는 시스템이고, 관리자 웹 통합 인터페이스는 관리자가 단말기와 상호 작용하는 시스템일 수 있다. 여기서, 단말기는 컴퓨터, 휴대폰 등과 같은 통신망으로 연결되어 데이터를 입력하거나 처리 결과를 출력하는 장치일 수 있다.The deep learning private cloud service can configure a user web integrated interface and an administrator web integrated interface, the user web integrated interface is a system in which the user interacts with the terminal, and the administrator web integrated interface is a system in which the administrator interacts with the terminal. can Here, the terminal may be a device connected to a communication network such as a computer or a mobile phone to input data or output a processing result.

본 발명의 일 실시예에 따르면, 사용자 웹 통합 인터페이스는 사용자가 Git 기반 소스 형상 관리 서비스, 딥러닝 작업 스케줄러, 소스와 Learning DB 통합 툴, 딥러닝 수행 서비스, 실행 서비스(보고서 및 로그) 및 Learning DB 관리 서비스를 수행할 수 있다. 본 발명의 일 실시예에 따르면, 관리자 웹 통합 인터페이스는 관리자가 장치 상태/성능 모니터링, 스케줄러 튜너, 학생 지원 서비스, 연산 장치 모니터 및 Tensorflow 모니터를 수행할 수 있다. 사용자 웹 통합 인터페이스 및 관리자 웹 통합 인터페이스가 수행하는 수행 동작은 상술한 바에 한정하지 않는다. 여기서, 연산 장치는 장치 그룹으로 구현될 수 있으며, 반드시 이에 한정되는 것은 아니다.According to an embodiment of the present invention, the user web integration interface provides a user with a Git-based source configuration management service, a deep learning task scheduler, a source and learning DB integration tool, a deep learning execution service, an execution service (reports and logs) and a learning DB management services can be performed. According to an embodiment of the present invention, the administrator web integrated interface allows the administrator to perform device status/performance monitoring, scheduler tuner, student support service, computing device monitor, and Tensorflow monitor. The operations performed by the user web integrated interface and the administrator web integrated interface are not limited to the above description. Here, the computing device may be implemented as a device group, but is not limited thereto.

도 11에서, 물리 장치 레이어는 물리 장치 계층(100)을 의미하며, 추상화 레이어는 추상화 계층(200)을 의미할 수 있다.In FIG. 11 , the physical device layer may mean the physical device layer 100 , and the abstraction layer may mean the abstraction layer 200 .

도 12는 실시예들에서 사용되기에 적합한 컴퓨팅 기기를 포함하는 컴퓨팅 환경을 예시하여 설명하기 위한 블록도이다.12 is a block diagram illustrating and describing a computing environment including a computing device suitable for use in embodiments.

도 12는 예시적인 실시예들에서 사용되기에 적합한 컴퓨팅 기기를 포함하는 컴퓨팅 환경을 예시하여 설명하기 위한 블록도이다. 도시된 실시예에서, 각 컴포넌트들은 이하에 기술된 것 이외에 상이한 기능 및 능력을 가질 수 있고, 이하에 기술되지 것 이외에도 추가적인 컴포넌트를 포함할 수 있다.12 is a block diagram illustrating and describing a computing environment including a computing device suitable for use in example embodiments. In the illustrated embodiment, each component may have different functions and capabilities other than those described below, and may include additional components other than those described below.

도시된 컴퓨팅 환경은 선형성을 갖는 프라이빗 클라우드 서비스의 계층 전환 장치(30)를 포함한다. 일 실시예에서, 선형성을 갖는 프라이빗 클라우드 서비스의 계층 전환 장치(30)는 타 단말기와 신호를 송수신하는 모든 형태의 컴퓨팅 기기일 수 있다. The illustrated computing environment includes a layer switching device 30 of a private cloud service with linearity. In an embodiment, the layer switching device 30 of the private cloud service having linearity may be any type of computing device that transmits/receives signals to and from other terminals.

선형성을 갖는 프라이빗 클라우드 서비스의 계층 전환 장치(30)는 적어도 하나의 프로세서(1610), 컴퓨터 판독 가능한 저장매체(1620) 및 통신 버스(1660)를 포함한다. 프로세서(1610)는 선형성을 갖는 프라이빗 클라우드 서비스의 계층 전환 장치(30)로 하여금 앞서 언급된 예시적인 실시예에 따라 동작하도록 할 수 있다. 예컨대, 프로세서(1610)는 컴퓨터 판독 가능한 저장 매체(1620)에 저장된 하나 이상의 프로그램들을 실행할 수 있다. 상기 하나 이상의 프로그램들은 하나 이상의 컴퓨터 실행 가능 명령어를 포함할 수 있으며, 상기 컴퓨터 실행 가능 명령어는 프로세서(1610)에 의해 실행되는 경우 선형성을 갖는 프라이빗 클라우드 서비스의 계층 전환 장치(30)로 하여금 예시적인 실시예에 따른 동작들을 수행하도록 구성될 수 있다.The layer switching device 30 of the private cloud service having linearity includes at least one processor 1610 , a computer-readable storage medium 1620 , and a communication bus 1660 . The processor 1610 may cause the layer switching device 30 of the private cloud service having linearity to operate according to the above-mentioned exemplary embodiment. For example, the processor 1610 may execute one or more programs stored in the computer-readable storage medium 1620 . The one or more programs may include one or more computer-executable instructions, which, when executed by the processor 1610 , cause the apparatus 30 for layer switching of the private cloud service having linearity to be an exemplary implementation. may be configured to perform operations according to an example.

컴퓨터 판독 가능한 저장 매체(1620)는 컴퓨터 실행 가능 명령어 내지 프로그램 코드, 프로그램 데이터 및/또는 다른 적합한 형태의 정보를 저장하도록 구성된다. 컴퓨터 판독 가능한 저장 매체(1620)에 저장된 프로그램(1630)은 프로세서(1610)에 의해 실행 가능한 명령어의 집합을 포함한다. 일 실시예에서, 컴퓨터 판독한 가능 저장 매체(1620)는 메모리(랜덤 액세스 메모리와 같은 휘발성 메모리, 비휘발성 메모리, 또는 이들의 적절한 조합), 하나 이상의 자기 디스크 저장 기기들, 광학 디스크 저장 기기들, 플래시 메모리 기기들, 그 밖에 선형성을 갖는 프라이빗 클라우드 서비스의 계층 전환 장치(30)에 의해 액세스되고 원하는 정보를 저장할 수 있는 다른 형태의 저장 매체, 또는 이들의 적합한 조합일 수 있다.Computer-readable storage medium 1620 is configured to store computer-executable instructions or program code, program data, and/or other suitable form of information. The program 1630 stored in the computer-readable storage medium 1620 includes a set of instructions executable by the processor 1610 . In one embodiment, computer-readable storage medium 1620 includes memory (volatile memory, such as random access memory, non-volatile memory, or a suitable combination thereof), one or more magnetic disk storage devices, optical disk storage devices, It may be flash memory devices, other types of storage media that can be accessed by the layer switching device 30 of the private cloud service having linearity and store desired information, or a suitable combination thereof.

통신 버스(1660)는 프로세서(1610), 컴퓨터 판독 가능한 저장 매체(1620)를 포함하여 선형성을 갖는 프라이빗 클라우드 서비스의 계층 전환 장치(30)의 다른 다양한 컴포넌트들을 상호 연결한다.The communication bus 1660 interconnects various other components of the layer switching device 30 of the private cloud service having linearity, including the processor 1610 and the computer-readable storage medium 1620 .

선형성을 갖는 프라이빗 클라우드 서비스의 계층 전환 장치(30)는 또한 하나 이상의 입출력 장치(미도시)를 위한 인터페이스를 제공하는 하나 이상의 입출력 인터페이스(1640) 및 하나 이상의 통신 인터페이스(1650)를 포함할 수 있다. 입출력 인터페이스(1640) 및 통신 인터페이스(1650)는 통신 버스(1660)에 연결된다. 입출력 장치(미도시)는 입출력 인터페이스(1040)를 통해 선형성을 갖는 프라이빗 클라우드 서비스의 계층 전환 장치(30)의 다른 컴포넌트들에 연결될 수 있다. 예시적인 입출력 장치는 포인팅 장치(마우스 또는 트랙패드 등), 키보드, 터치 입력 장치(터치패드 또는 터치스크린 등), 음성 또는 소리 입력 장치, 다양한 종류의 센서 장치 및/또는 촬영 장치와 같은 입력 장치, 및/또는 디스플레이 장치, 프린터, 스피커 및/또는 네트워크 카드와 같은 출력 장치를 포함할 수 있다. 예시적인 입출력 장치(미도시)는 선형성을 갖는 프라이빗 클라우드 서비스의 계층 전환 장치(30)를 구성하는 일 컴포넌트로서 선형성을 갖는 프라이빗 클라우드 서비스의 계층 전환 장치(30)의 내부에 포함될 수도 있고, 선형성을 갖는 프라이빗 클라우드 서비스의 계층 전환 장치(30)와는 구별되는 별개의 장치로 컴퓨팅 기기와 연결될 수도 있다.The layer switching device 30 of the private cloud service having linearity may also include one or more input/output interfaces 1640 and one or more communication interfaces 1650 that provide interfaces for one or more input/output devices (not shown). The input/output interface 1640 and the communication interface 1650 are coupled to the communication bus 1660 . The input/output device (not shown) may be connected to other components of the layer switching device 30 of the private cloud service having linearity through the input/output interface 1040 . Exemplary input/output devices include input devices such as pointing devices (such as a mouse or trackpad), keyboards, touch input devices (such as touchpads or touchscreens), voice or sound input devices, various types of sensor devices and/or imaging devices; and/or output devices such as display devices, printers, speakers and/or network cards. An exemplary input/output device (not shown) is a component constituting the layer switching device 30 of the private cloud service having linearity, and may be included in the layer switching device 30 of the private cloud service having linearity. It may be connected to a computing device as a separate device distinct from the layer switching device 30 of the private cloud service.

본 실시예들에 따른 동작은 다양한 컴퓨터 수단을 통하여 수행될 수 있는 프로그램 명령 형태로 구현되어 컴퓨터 판독 가능한 매체에 기록될 수 있다. 컴퓨터 판독 가능한 매체는 실행을 위해 프로세서에 명령어를 제공하는 데 참여한 임의의 매체를 나타낸다. 컴퓨터 판독 가능한 매체는 프로그램 명령, 데이터 파일, 데이터 구조 또는 이들의 조합을 포함할 수 있다. 예를 들면, 자기 매체, 광기록 매체, 메모리 등이 있을 수 있다. 컴퓨터 프로그램은 네트워크로 연결된 컴퓨터 시스템 상에 분산되어 분산 방식으로 컴퓨터가 읽을 수 있는 코드가 저장되고 실행될 수도 있다. 본 실시예를 구현하기 위한 기능적인(Functional) 프로그램, 코드, 및 코드 세그먼트들은 본 실시예가 속하는 기술분야의 프로그래머들에 의해 용이하게 추론될 수 있을 것이다.The operations according to the present embodiments may be implemented in the form of program instructions that can be performed through various computer means and recorded in a computer-readable medium. Computer-readable media refers to any medium that participates in providing instructions to a processor for execution. Computer-readable media may include program instructions, data files, data structures, or a combination thereof. For example, there may be a magnetic medium, an optical recording medium, a memory, and the like. A computer program may be distributed over a networked computer system so that computer readable code is stored and executed in a distributed manner. Functional programs, codes, and code segments for implementing the present embodiment may be easily inferred by programmers in the technical field to which the present embodiment pertains.

이상의 설명은 본 발명의 기술 사상을 예시적으로 설명한 것에 불과한 것으로서, 본 발명이 속하는 기술 분야에서 통상의 지식을 가진 자라면 본 발명의 본질적인 특성에서 벗어나지 않는 범위 내에서 다양한 수정, 변경 및 치환이 가능할 것이다. 따라서, 본 발명에 개시된 실시 예 및 첨부된 도면들은 본 발명의 기술 사상을 한정하기 위한 것이 아니라 설명하기 위한 것이고, 이러한 실시 예 및 첨부된 도면에 의하여 본 발명의 기술 사상의 범위가 한정되는 것은 아니다. 본 발명의 보호 범위는 아래의 청구 범위에 의하여 해석되어야 하며, 그와 동등한 범위 내에 있는 모든 기술 사상은 본 발명의 권리 범위에 포함되는 것으로 해석되어야 할 것이다.The above description is merely illustrative of the technical idea of the present invention, and those of ordinary skill in the art to which the present invention pertains may make various modifications, changes, and substitutions within the scope without departing from the essential characteristics of the present invention. will be. Accordingly, the embodiments disclosed in the present invention and the accompanying drawings are intended to explain, not to limit the technical spirit of the present invention, and the scope of the technical spirit of the present invention is not limited by these embodiments and the accompanying drawings. . The protection scope of the present invention should be construed by the following claims, and all technical ideas within the scope equivalent thereto should be construed as being included in the scope of the present invention.

10: 딥러닝 프라이빗 클라우드
100: 물리 장치 계층
200: 추상화 계층10: Deep Learning Private Cloud
100: physical device layer
200: abstraction layer

Claims

In the layer switching device of the private cloud service,
A layer switching device of a private cloud service that communicates with the deep learning integration server and includes a compute node agent that converts various types of physical device layers into abstraction layers.

According to claim 1,
The physical device layer uses a plurality of servers through a network,
The abstraction layer simplifies the physical device layer to form a conceptual device divided into a plurality of device groups, a plurality of storage volumes, and a plurality of operation containers, and is a basis for providing services to users using the simplified conceptual device. formed,
Each of the plurality of device groups includes a plurality of computational cores for performing computational tasks,
The storage volume stores data used or generated by the user, and shares it in combination with at least one operation container,
The operating container provides a user environment in which an operating system and system software and application software that can be used by the user are set in advance.

3. The method of claim 2,
The plurality of device groups,
Separating each of the plurality of computational cores and combining them with the operational container or combining some computational cores with the operational container,
The computational core is
A general-purpose computational core including at least one of manufacturer information, core performance indication, or computational technology information provided by the core; and
A layer switching device for a private cloud service including a deep learning computation core including at least one of manufacturer information, model information of the plurality of servers, performance representation information of a core, or computation technology information.

3. The method of claim 2,
The operating container is
Combined with at least one of the computational core and at least one of the storage volume, divided into an image state and an active state, select and activate an operation container of the image state by the user
In the operation container, when the user deactivates the operation container, the deactivated operation container is removed, and when the user reactivates the operation container, a previously used combination is recombined to create a new operation container. Layer switching device for private cloud services.

According to claim 1,
The compute node agent comprises:
After collecting CPU and GPGPU information included in the physical device layer and converting each into computational core information, it is registered as one device group in the deep learning integrated server,
When registering the device group, input the address of the deep learning integration server and a security token required for registration, and according to the security token, users who can access the device group are restricted. .

6. The method of claim 5,
The compute node agent comprises:
Periodically inquire the CPU and GPGPU information included in the physical device layer to update the information of the computational core, and report the updated information of the computational core to the deep learning integration server,
When a new CPU or GPGPU is added, the added CPU or GPGPU is detected and added to the managed device group. When the existing CPU or GPGPU is removed, the removed CPU or GPGPU is detected and deleted from the managed device group. The layer switching device of the private cloud service, characterized in that the information of the device group is updated.

7. The method of claim 6,
The compute node agent comprises:
When activating the operational container of the abstraction layer, searching for the latest operational container from the storage of the operational container and synchronizing the operational container;
When the existing operational container and the newest operational container are the same, the existing operational container is reused, and when the existing operational container and the latest operational container are different, the existing operational container is left as it is, and the newer operational container is formed and,
The layer switching device of the private cloud service, characterized in that when all the existing active operation containers associated with the device group are deactivated, the existing active operation containers associated with the device group are removed from the physical device layer.

8. The method of claim 7,
The compute node agent comprises:
Receives a request for user environment information from the deep learning integration server, receives request information by the request for user environment information, analyzes the request information, and combines the operation container, storage volume and device group of the abstraction layer Activate to create an active production container,
All information generated during the activation is transmitted to the deep learning integration server, and when a problem occurs, the active operation container in which the problem occurs is removed, and the active service of the operation container is initialized according to the type of user requested operation, characterized in that A tier switching device for private cloud services.

9. The method of claim 8,
The compute node agent comprises:
for interactive tasks, enable a service that provides a personal laptop interface to the production container and a service that provides an online integrated development environment interface;
When there is a request for interactive work and class or team use, instead of a service that provides a personal laptop interface, a service that provides a laptop hub interface is organized and reconfigured to provide the service, and the online integrated development environment interface is disabled start with the state,
When it is composed of a batch job, it checks the operation status of the batch job that communicates directly with the deep learning integration server, and provides a secure terminal service for the user to access and check the status.

10. The method of claim 9,
The compute node agent comprises:
Tracks the computing core of the device group and the combined active operation container, measures the load of the device group used by the active operation container, detects a state change, and updates the information on the active operation container in the deep learning integration server Layer switching device of the private cloud service, characterized in that.

According to claim 1,
Further comprising a task scheduler to find an idle device group to combine the operating environment set by the user with the device group registered in the deep learning integration server, and to assign the idle device group to the user's request,
The task scheduler selects a user to be scheduled, selects a task having the priority of the user having the highest value among the priorities of the selected user, and reads the selected resources in order to find idle resources and resources selecting a resource by allocating a resource to a task by comparing groups, allocating the compute node agent to the selected task,
The private cloud service layer switching device communicates with the deep learning integration server and converts the physical device layer of various types into the abstraction layer through the compute node agent assigned by the task scheduler. A device for switching layers of services.

In the layer switching device of the private cloud service,
A task scheduler that finds an idle device group to combine the operating environment set by the user with the device group registered in the deep learning integration server, and assigns the idle device group as a compute node agent according to the user's request. A tier switching device for private cloud services.

13. The method of claim 12,
The user's request is
After allocating a computational core, an interactive task or a batch task in which the user's operation is defined in advance or has an automated processing processor to perform a task by the user's operation,
The task scheduler is
The task scheduler selects a user to be scheduled, selects a task having the priority of the user having the highest value among the priorities of the selected user, and reads the selected resources in order to find idle resources and resources A layer switching device for a private cloud service that compares groups, allocates resources to tasks, selects a resource, and allocates the compute node agent to the selected task.

14. The method of claim 13,
The task scheduler is
A user who selects users waiting to be allocated a resource, changes the priority based on a user history who exclusively uses a resource among the selected users, and requests resource allocation first if the history scores of the users are the same Layer switching device of the private cloud service, characterized in that it has priority.

14. The method of claim 13,
The task scheduler is
Apply tickets and ticket debt for scoring based on user history;
The ticket is initialized with a constant value and decreases each time it is used according to the resource usage criteria,
The ticket debt increases each time the resource usage criterion is reached when the resource is continuously used while all the tickets are exhausted,
The priority of the user is that the user with the ticket has priority, if there is no ticket and there is ticket debt, the level of warning of the ticket debt is less than a preset warning, the next priority is, if there is no ticket and the ticket When the level of debt warning is higher than the set warning, it has the lowest rank, and when resources are insufficient, the user's resources are forcibly released.