KR102405933B1

KR102405933B1 - Platform system and method for managing execution of machine learning based on learning

Info

Publication number: KR102405933B1
Application number: KR1020200033179A
Authority: KR
Inventors: 홍지만; 김영관; 박기철; 이주석
Original assignee: 숭실대학교 산학협력단
Priority date: 2020-02-27
Filing date: 2020-03-18
Publication date: 2022-06-08
Also published as: KR20210109407A

Abstract

본 발명은 머신 러닝 태스크에 대한 로그 정보를 분석하여 태스크에 소요되는 컴퓨팅 자원을 예측하고, 그 예측 결과를 기반으로 최적의 워커 노드를 선택하여 머신 러닝을 수행하는 학습에 의한 머신 러닝 실행 관리 플랫폼 시스템 및 방법에 관한 것이다.
나아가 본 발명은 네트워크 오류를 비롯하여 플랫폼 외부에서의 장애 요인에 따라 플랫폼 구조나 기능에 이상이 발생하는 경우 이를 감지하여 머신 러닝 수행 노드의 변경을 가능하게 하는 학습에 의한 머신 러닝 실행 관리 플랫폼 시스템 및 방법에 관한 것이다.The present invention is a machine learning execution management platform system by learning that analyzes log information for a machine learning task to predict the computing resources required for the task, and selects an optimal worker node based on the prediction result to perform machine learning and methods.
Furthermore, the present invention provides a machine learning execution management platform system and method by learning that detects when an abnormality occurs in the platform structure or function according to a failure factor outside the platform, including network errors, and enables the change of the machine learning execution node. is about

Description

SYSTEM AND METHOD FOR MANAGING EXECUTION OF MACHINE LEARNING BASED ON LEARNING

본 발명은 머신 러닝 태스크에 대한 로그 정보를 분석하여 태스크에 소요되는 컴퓨팅 자원을 예측하고, 그 예측 결과를 기반으로 최적의 워커 노드를 선택하여 머신 러닝을 수행하는 학습에 의한 머신 러닝 실행 관리 플랫폼 시스템 및 방법에 관한 것이다.The present invention is a machine learning execution management platform system by learning that analyzes log information for a machine learning task to predict the computing resources required for the task, and selects an optimal worker node based on the prediction result to perform machine learning and methods.

나아가 본 발명은 네트워크 오류를 비롯하여 플랫폼 외부에서의 장애 요인에 따라 플랫폼 구조나 기능에 이상이 발생하는 경우 이를 감지하여 머신 러닝 수행 노드의 변경을 가능하게 하는 학습에 의한 머신 러닝 실행 관리 플랫폼 시스템 및 방법에 관한 것이다.Furthermore, the present invention provides a machine learning execution management platform system and method by learning that detects when an abnormality occurs in the platform structure or function according to a failure factor outside the platform, including network errors, and enables the change of the machine learning execution node. is about

머신 러닝(machine learning)은 패턴 인식과 컴퓨터 학습 이론의 연구로부터 진화한 인공지능(AI)의 한 분야로, 경험적 데이터를 기반으로 학습 및 예측을 수행하고 스스로의 성능을 향상시키는 시스템과 그 알고리즘을 구축한다.Machine learning is a field of artificial intelligence (AI) that has evolved from the study of pattern recognition and computer learning theory. build

이러한 머신 러닝을 위한 통합 원격 실행 플랫폼은 그 상위 레이어부터 마스터 노드, 세션 노드 및 워커 노드로 구성된다. 또한 별도의 외부 네트워크 저장소를 가지며, 사용자는 외부 프로그램을 통해 플랫폼에 접근한다.The integrated remote execution platform for machine learning consists of a master node, a session node, and a worker node from the upper layer. It also has a separate external network storage, and users access the platform through an external program.

또한, 플랫폼에 접근한 사용자가 머신 러닝을 수행할 대상인 태스크(task)를 등록하면, 그 등록된 태스크는 먼저 머신 러닝을 수행할 프레임워크에 맞게 세션 노드(session node)에 할당된다.In addition, when a user accessing the platform registers a task that is a target to perform machine learning, the registered task is first allocated to a session node in accordance with a framework to perform machine learning.

세션 노드는 해당 태스크의 수행이 가능하면서도 연산처리 능력이 가장 좋은 워커 노드에 태스크를 할당하고, 사용자가 워커 노드에서 수행할 머신 러닝 관련 명령어를 내리면 사용자의 명령에 따라 머신 러닝을 수행한다.A session node assigns a task to a worker node that can perform the task and has the best computational processing power, and when a user issues a machine learning-related command to be performed in the worker node, machine learning is performed according to the user's command.

그러나, 종래의 머신 러닝을 위한 통합 원격 실행 플랫폼은 세션 노드에서 워커 노드 중 어느 하나에 머신 러닝 태스크를 할당할 때 태스크와는 무관하게 오직 워커 노드의 성능만을 고려하였다.However, in the conventional integrated remote execution platform for machine learning, when assigning a machine learning task to any one of the worker nodes in the session node, only the performance of the worker node is considered regardless of the task.

따라서, 워커 노드의 성능이 머신 러닝 태스크가 요구하는 하드웨어 성능보다 낮은 경우 머신 러닝 태스크 수행 속도가 현저히 저하되며, 심한 경우에는 워커 노드가 셧다운 되어 더 이상 사용할 수 없는 문제가 있다.Therefore, when the performance of the worker node is lower than the hardware performance required by the machine learning task, the machine learning task execution speed is significantly reduced, and in severe cases, the worker node is shut down and can no longer be used.

나아가, 네트워크 등 플랫폼 외부적 장애 요인에 대해 대응할 수 없어 플랫폼 구조에 문제가 발생할 수 있다. 예컨대 작업 수행 전이나 수행 중인 워커 노드에서 발생한 네트워크 장애 등에 대응할 수 없다는 문제가 있다.Furthermore, problems may occur in the platform structure because it cannot respond to factors external to the platform such as networks. For example, there is a problem in that it cannot respond to network failures that occur before or during operation of a worker node.

미국등록특허 US 16/248,560US registered patent US 16/248,560 대한민국 공개특허 제10-2012-0001688호Republic of Korea Patent Publication No. 10-2012-0001688

본 발명은 전술한 바와 같은 문제점을 해결하기 위한 것으로, 머신 러닝 태스크에 대한 로그 정보를 분석하여 태스크에 소요되는 컴퓨팅 자원을 예측하고, 그 예측 결과를 기반으로 최적의 워커 노드를 선택하는 학습에 의한 머신 러닝 실행 관리 플랫폼 시스템 및 방법을 제공하고자 한다.The present invention is to solve the problems described above, by analyzing log information for a machine learning task, predicting the computing resources required for the task, and selecting an optimal worker node based on the prediction result. An object of the present invention is to provide a machine learning execution management platform system and method.

나아가 본 발명은 네트워크 오류를 비롯하여 플랫폼 외부에서의 장애 요인 발생에 따라 플랫폼 구조나 기능에 이상이 발생하는 경우 이를 감지하여 머신 러닝 수행 노드의 변경을 가능하게 하는 학습에 의한 머신 러닝 실행 관리 플랫폼 시스템 및 방법을 제공하고자 한다.Furthermore, the present invention provides a machine learning execution management platform system by learning that detects when an abnormality occurs in the platform structure or function due to the occurrence of failure factors outside the platform, including network errors, and enables the change of the machine learning execution node; and We want to provide a way.

이를 위해 본 발명에 따른 학습에 의한 머신 러닝 실행 관리 플랫폼 시스템은 머신 러닝(machine learning)이 수행되는 태스크(task)를 입력받는 외부 입력 모듈과; 상기 입력된 태스크(task)에 대해 머신 러닝을 수행하되, 수행 중이거나 수행을 마친 적어도 하나 이상의 태스크에 대한 정보를 제공하는 워커 노드와; 상기 워커 노드에서 제공된 태스크 정보를 분석하고, 태스크별로 상기 머신 러닝에 소요되는 컴퓨팅 자원을 산출하는 세션 노드; 및 상기 외부 입력 모듈을 통해 새로 입력된 태스크에 대한 머신 러닝 수행 명령을 전달하는 마스터 노드;를 포함하되, 상기 세션 노드는 미리 산출된 태스크의 컴퓨팅 자원을 기반으로 새로 입력된 태스크의 컴퓨팅 자원을 예측하고, 예측 결과에 따라 상기 새로 입력된 태스크가 할당되는 워커 노드를 선택하는 것을 특징으로 한다.To this end, the machine learning execution management platform system by learning according to the present invention includes an external input module for receiving a task in which machine learning is performed; a worker node that performs machine learning on the input task and provides information on at least one or more tasks being performed or completed; a session node that analyzes task information provided by the worker node and calculates computing resources required for the machine learning for each task; and a master node that transmits a machine learning execution command for a newly input task through the external input module, wherein the session node predicts the computing resource of the newly input task based on the computing resource of the task calculated in advance. and selecting a worker node to which the newly input task is assigned according to the prediction result.

이때, 상기 워커 노드는 태스크의 머신 러닝에 소요되는 컴퓨팅 자원 정보를 로그(Log) 형식으로 제공하는 것이 바람직하다.In this case, it is preferable that the worker node provides computing resource information required for machine learning of the task in a log format.

또한, 상기 워커 노드는 CPU 사용량, 메모리 사용량, 디스크 사용량, GPU 플롭(Flops), 머신 러닝 태스크의 총 에폭(epoch) 및 머신 러닝 총 수행 시간을 상기 컴퓨팅 자원 정보로 제공하는 것이 바람직하다.In addition, it is preferable that the worker node provides CPU usage, memory usage, disk usage, GPU flops, total epochs of machine learning tasks, and machine learning total execution time as the computing resource information.

또한, 상기 세션 노드는 상기 워커 노드로부터 제공된 컴퓨팅 자원을 분석하여 머신 러닝 태스크의 에폭(epoch)당 컴퓨팅 비용 및 수행 시간을 산출하여 상기 새로 입력된 태스크에서 소요되는 컴퓨팅 자원을 예측하는 것이 바람직하다.In addition, it is preferable that the session node predicts the computing resources required for the newly input task by calculating the computing cost and execution time per epoch of the machine learning task by analyzing the computing resource provided from the worker node.

또한, 상기 세션 노드는 각 에폭당 상기 CPU 사용량, 메모리 사용량, 디스크 사용량 및 GPU 플롭을 참조하되, 진행된 모든 에폭에서의 CPU 사용량, 메모리 사용량, 디스크 사용량 각각에 대한 평균값으로 상기 컴퓨팅 비용을 산출하여, GPU 플롭에서의 에폭당 CPU 사용량, 메모리 사용량 및 디스크 사용량의 평균값으로 상기 새로 입력된 태스크를 예측하는 것이 바람직하다.In addition, the session node refers to the CPU usage, memory usage, disk usage, and GPU flop per each epoch, but calculates the computing cost as an average value for each of the CPU usage, memory usage, and disk usage in all epochs that have been performed, It is preferable to predict the newly input task as an average value of CPU usage, memory usage, and disk usage per epoch in the GPU flop.

또한, 상기 세션 노드는 상기 머신 러닝 태스크의 총 에폭, 머신 러닝 총 수행 시간 및 GPU 플롭을 참조하되, 상기 머신 러닝 태스크의 총 에폭 및 머신 러닝 총 수행 시간으로 에폭당 수행 시간을 산출하여, GPU 플롭에서의 에폭당 수행 시간으로 상기 새로 입력된 태스크를 예측하는 것이 바람직하다.In addition, the session node refers to the total epochs of the machine learning task, the total machine learning execution time, and the GPU flop, and calculates the execution time per epoch with the total epochs of the machine learning task and the total machine learning execution time. It is preferable to predict the newly input task by the execution time per epoch in .

또한, 상기 세션 노드는 다수개이며 각각의 세션 노드는 동일하거나 동일한 그룹으로 분류되는 프레임워크(frame work)로 구성된 워커 노드를 관리하는 것이 바람직하다.In addition, the number of session nodes is plural, and each session node preferably manages a worker node composed of the same or a framework classified into the same group.

또한, 상기 세션 노드는 상기 새로 입력된 태스크에 대해 머신 러닝을 수행하는데 소요되는 컴퓨팅 자원을 추가하여, 상기 예측에 이용되는 태스크의 컴퓨팅 자원에 대한 기록을 갱신하는 것이 바람직하다.In addition, it is preferable that the session node updates the record of the computing resources of the task used for the prediction by adding the computing resources required to perform machine learning on the newly input task.

또한, 상기 마스터 노드 및 세션 노드 중 어느 하나 이상은 상기 워커 노드에 대해 네트워크 핑 테스트(ping test)를 수행하여, 태스크의 할당이 가능한 노드를 추출하는 것이 바람직하다.In addition, it is preferable that at least one of the master node and the session node extracts a node capable of assigning a task by performing a network ping test on the worker node.

또한, 상기 머신 러닝이 수행된 태스크에 대한 정보가 예측 신뢰도를 획득하는 수준으로 축적되지 않은 경우, 이미 머신 러닝이 수행된 태스크에 대한 정보에 선형 보간법을 적용하는 초기 구동 모듈을 더 포함하되, 상기 초기 구동 모듈은 상기 선형 보간법으로 예상한 태스크에 대한 정보로 상기 새로 입력된 태스크의 컴퓨팅 자원을 예측하기 위한 태스크 정보를 생성하는 것이 바람직하다.In addition, when the information on the task on which the machine learning has been performed is not accumulated to a level at which prediction reliability is obtained, the method further comprises an initial driving module for applying a linear interpolation method to the information on the task on which the machine learning has been performed. Preferably, the initial driving module generates task information for predicting the computing resource of the newly input task as information on the task expected by the linear interpolation method.

한편, 본 발명에 따른 학습에 의한 머신 러닝 실행 관리 방법은 워커 노드에서 태스크에 대해 머신 러닝을 수행하되, 수행 중이거나 수행을 마친 적어도 하나 이상의 태스크에 대한 정보를 제공하는 정보 제공 단계와; 세션 노드에서 상기 워커 노드로부터 제공된 태스크 정보를 분석하고, 태스크별로 상기 머신 러닝에 소요되는 컴퓨팅 자원을 산출하는 자원 산출 단계와; 외부 입력 모듈을 통해 머신 러닝이 수행되는 새로운 태스크를 입력받아 등록하는 태스크 등록 단계와; 상기 세션 노드에서 미리 산출된 태스크의 컴퓨팅 자원을 기반으로 상기 새로 입력된 태스크의 컴퓨팅 자원을 예측하고, 예측 결과에 따라 상기 새로 입력된 태스크가 할당되는 워커 노드를 선택하는 노드 선택 단계와; 마스터 노드에서 상기 외부 입력 모듈을 통해 새로 입력된 태스크에 대한 머신 러닝 수행 명령을 전달하는 실행 명령 단계; 및 상기 워커 노드에서 상기 머신 러닝 수행 명령을 입력받아 상기 새로 입력된 태스크에 대한 머신 러닝을 수행하는 머신 러닝 단계;를 포함하는 것을 특징으로 한다.On the other hand, the method for managing machine learning execution by learning according to the present invention comprises: an information providing step of performing machine learning on a task in a worker node, and providing information on at least one or more tasks that are being performed or have been performed; a resource calculation step of analyzing task information provided from the worker node in a session node and calculating computing resources required for the machine learning for each task; a task registration step of receiving and registering a new task on which machine learning is performed through an external input module; a node selection step of predicting the computing resource of the newly input task based on the computing resource of the task calculated in advance in the session node, and selecting a worker node to which the newly input task is assigned according to the prediction result; an execution command step of transmitting a machine learning execution command for a newly input task through the external input module in the master node; and a machine learning step of receiving the machine learning execution command from the worker node and performing machine learning on the newly input task.

이때, 상기 마스터 노드 및 세션 노드 중 어느 하나 이상에서, 상기 워커 노드에 대해 네트워크 핑 테스트(ping test)를 수행하여, 상기 태스크의 할당이 가능한 노드를 분석하는 상태 검색 단계를 더 포함하는 것이 바람직하다.In this case, it is preferable that at least one of the master node and the session node further include a state search step of performing a network ping test on the worker node to analyze a node to which the task can be assigned. .

또한, 상기 머신 러닝이 수행된 태스크에 대한 정보가 예측 신뢰도를 획득하는 수준으로 축적되지 않은 경우, 초기 구동 모듈에 의해 이미 머신 러닝이 수행된 태스크에 대한 정보에 선형 보간법을 적용하는 데이터 보간 단계를 더 포함하되, 상기 데이터 보간 단계에서는 상기 선형 보간법으로 예상한 태스크에 대한 정보로 상기 새로 입력된 태스크의 컴퓨팅 자원을 예측하기 위한 태스크 정보를 생성하는 것이 바람직하다.In addition, when the information on the task on which the machine learning has been performed is not accumulated to a level at which prediction reliability is obtained, a data interpolation step of applying a linear interpolation method to the information on the task on which the machine learning has already been performed by the initial driving module is performed. Further comprising, in the data interpolation step, it is preferable to generate task information for predicting the computing resource of the newly input task with information about the task predicted by the linear interpolation method.

이상과 같은 본 발명은 머신 러닝 태스크에 대한 로그 정보를 분석하여 태스크에 소요되는 컴퓨팅 자원을 예측하고, 그 예측 결과를 기반으로 최적의 워커 노드를 선택한다. 따라서, 최적 상태로 태스크를 처리할 수 있는 워커 노드에서 태스크를 할당받아 머신 러닝을 수행하게 한다.As described above, the present invention analyzes log information for a machine learning task, predicts computing resources required for the task, and selects an optimal worker node based on the prediction result. Therefore, machine learning is performed by being assigned a task from a worker node that can process the task in an optimal state.

나아가 본 발명은 핑 테스트를 통해 플랫폼 외부에서의 장애 요인에 의해 플랫폼 구조나 기능에 이상이 발생하는 경우 이를 감지한다. 따라서, 이상 발생시 이를 감지하여 머신 러닝 수행 노드의 변경을 가능하게 하므로 지속적이면서 효율적인 머신 러닝의 수행을 가능하게 한다.Furthermore, the present invention detects when an abnormality occurs in the platform structure or function due to an obstacle outside the platform through a ping test. Accordingly, it is possible to change the machine learning performing node by detecting an abnormality when it occurs, thereby enabling continuous and efficient machine learning.

도 1은 본 발명에 따른 학습에 의한 머신 러닝 실행 관리 플랫폼 시스템의 전체 구성도이다.
도 2는 본 발명에 따른 학습에 의한 머신 러닝 실행 관리 플랫폼 시스템의 연결 상태도이다.
도 3은 본 발명에 따른 학습에 의한 머신 러닝 실행 관리 방법을 나타낸 흐름도이다.
도 4는 본 발명에 따른 학습에 의한 머신 러닝 실행 관리 방법을 나타낸 구체적인 실시예이다.1 is an overall configuration diagram of a machine learning execution management platform system by learning according to the present invention.
2 is a connection state diagram of a machine learning execution management platform system by learning according to the present invention.
3 is a flowchart illustrating a method for managing machine learning execution by learning according to the present invention.
4 is a specific embodiment showing a machine learning execution management method by learning according to the present invention.

이하, 첨부된 도면을 참조하여 본 발명의 바람직한 실시예에 따른 학습에 의한 머신 러닝 실행 관리 플랫폼 시스템 및 방법에 대해 상세히 설명한다.Hereinafter, a machine learning execution management platform system and method by learning according to a preferred embodiment of the present invention will be described in detail with reference to the accompanying drawings.

도 1과 같이, 본 발명에 따른 학습에 의한 머신 러닝 실행 관리 플랫폼 시스템은 외부 입력 모듈(10), 워커 노드(20), 세션 노드(30) 및 마스터 노드(40)를 포함한다. 바람직한 다른 실시예로 초기 구동 모듈을 더 포함한다.1 , the machine learning execution management platform system by learning according to the present invention includes an external input module 10 , a worker node 20 , a session node 30 and a master node 40 . In another preferred embodiment, it further includes an initial driving module.

이러한 구성에 의하면 외부 입력 모듈(10)을 통해 입력된 태스크(task)를 마스터 노드(40)에서 등록하고, 세션 노드(30)는 등록된 태스크를 워커 노드(20)에 할당하여 워커 노드(20)에서 태스크에 대한 머신 러닝이 이루어지게 한다.According to this configuration, the master node 40 registers a task input through the external input module 10 , and the session node 30 allocates the registered task to the worker node 20 to the worker node 20 . ) to enable machine learning for the task.

특히, 본 발명은 머신 러닝 태스크에 대한 로그 형태의 정보를 분석하여 태스크에 소요되는 컴퓨팅 자원을 예측하고, 머신 러닝이 수행될 태스크에 대한 예측 결과를 기반으로 최적의 워커 노드(20)를 선택한다. In particular, the present invention predicts the computing resources required for the task by analyzing log-type information about the machine learning task, and selects the optimal worker node 20 based on the prediction result for the task to be machine learning performed. .

또한 네트워크 오류를 비롯하여 플랫폼 외부에서의 장애 요인 발생에 따라 플랫폼 구조나 기능에 이상이 발생하는 경우 이를 감지하여 머신 러닝 수행 노드인 워커 노드의 변경을 가능하게 한다.In addition, when an abnormality occurs in the structure or function of the platform due to the occurrence of failure factors outside the platform, including network errors, it detects this and enables the change of the worker node, the machine learning performing node.

본 발명의 특징적인 구성들에 대한 구체적인 설명에 앞서 머신 러닝 태스크에 대한 통합 원격 수행 관리를 위한 구성들에 대해 먼저 살펴보면, 본 발명은 각 노드들 이외에 별도의 외부 네트워크 저장소(DB-O)를 가진다. Prior to a detailed description of the characteristic configurations of the present invention, first looking at the configurations for integrated remote performance management for machine learning tasks, the present invention has a separate external network storage (DB-O) in addition to each node .

워커 노드(20), 세션 노드(30) 및 마스터 노드(40)를 포함하는 각 노드들은 머신 러닝을 위한 프로세스의 처리가 가능한 컴퓨팅 모듈의 일종으로, 이들 노드는 외부 네트워크 저장소(DB-O)에 접속하여 데이터 처리를 수행할 수 있다.Each node including the worker node 20, the session node 30, and the master node 40 is a kind of computing module capable of processing a process for machine learning, and these nodes are stored in an external network storage (DB-O). You can connect and process data.

외부 네트워크 저장소(DB-O)는 머신 러닝의 데이터셋(dataset)을 제공하는 것으로 머신 러닝 수행 환경을 셋팅한 워커 노드(20)는 상기 외부 네트워크 저장소(DB-O)에서 데이터셋을 다운로드하여 머신 러닝을 수행한다.The external network storage (DB-O) provides a dataset of machine learning, and the worker node 20, which sets the machine learning execution environment, downloads the dataset from the external network storage (DB-O) and run a run

사용자는 외부 입력 모듈(10)을 통해 본 발명의 플랫폼에 접근한다. 외부 입력 모듈(10)은 일 예로 API(Application Programming Interface) 등과 같은 외부 프로그램으로 구현되어 있어서 손쉽게 플랫폼에 접근한다.The user accesses the platform of the present invention through the external input module 10 . The external input module 10 is implemented as an external program such as an API (Application Programming Interface), for example, and thus easily accesses the platform.

또한 사용자는 외부 입력 모듈(10)을 이용하여 머신 러닝을 수행할 대상인 태스크를 입력한다. 입력된 태스크는 머신 러닝을 수행할 프레임워크에 맞게 세션 노드(30)에 할당된다. 세션 노드(30)는 해당 태스크를 워커 노드(20)에 할당한다. In addition, the user inputs a task, which is a target to perform machine learning, by using the external input module 10 . The input task is assigned to the session node 30 according to the framework to perform machine learning. The session node 30 assigns the corresponding task to the worker node 20 .

이때, 사용자가 외부 입력 모듈(10)을 통해 워커 노드(20)에서 수행할 머신 러닝 관련 명령어를 입력하면 해당 명령어가 워커 노드(20)에 전달되어 실행된다. 대표적으로 '러닝 머신 실행 명령'을 내리면 머신 러닝이 시작된다.At this time, when the user inputs a machine learning related command to be performed in the worker node 20 through the external input module 10 , the corresponding command is delivered to the worker node 20 and executed. Typically, when you issue a 'running machine command', machine learning starts.

마스터 노드(40)는 플랫폼 내 워커 노드(20)들의 자원 모니터링, 태스크의 머신 러닝 진행 상황 확인, 태스크를 플랫폼에 업로드, 외부 입력 모듈(10)로부터 입력된 사용자의 명령의 전달 및 태스크와 데이터셋을 관리하는 기능을 가진다.The master node 40 monitors the resource of the worker nodes 20 in the platform, checks the machine learning progress of the task, uploads the task to the platform, transfers the user's command input from the external input module 10, and the task and dataset has the ability to manage

세션 노드(30)는 동일하거나 동일한 그룹으로 분류되는 프레임워크로 구성된 워커 노드(20)들을 관리한다. 각 세션 노드(30)는 워커 노드(20)들의 자원을 모니터링하고 마스터 노드(40)로부터 받은 명령어를 워커 노드(20)에 전달한다.The session node 30 manages the worker nodes 20 composed of the same or a framework classified into the same group. Each session node 30 monitors the resources of the worker nodes 20 and transmits a command received from the master node 40 to the worker node 20 .

워커 노드(20)는 세션 노드(30)로부터 전달받은 태스크에 해당하는 머신 러닝을 수행한다. 또한 주기적으로 당해 워커 노드(20)의 자원상황을 세션 노드(30)에 보고한다.The worker node 20 performs machine learning corresponding to the task received from the session node 30 . In addition, it periodically reports the resource status of the worker node 20 to the session node 30 .

따라서, 워커 노드(20)는 '태스크 실행기'를 통해 머신 러닝 수행환경을 세팅하고 머신 러닝 데이터셋을 외부 네트워크 저장소(DB-O)로부터 다운로드 받으며, 머신 러닝 수행 명령이 전달되면 준비된 수행환경에서 머신 러닝을 수행한다. Therefore, the worker node 20 sets the machine learning execution environment through the 'task executor', downloads the machine learning dataset from the external network storage (DB-O), and when the machine learning execution command is delivered, the machine in the prepared execution environment run a run

또한 자원관리 모듈을 통해 워커 노드(20)의 자원상황을 세션 노드(30)에 보고한다. 자원상황은 실시간 혹은 주기적으로 세션 노드(30)에 보고됨에 따라 플랫폼에 할당된 태스크를 수행하는 워커 노드(20)의 선정에 활용된다.Also, the resource status of the worker node 20 is reported to the session node 30 through the resource management module. As the resource status is reported to the session node 30 in real time or periodically, it is utilized to select the worker node 20 that performs the task assigned to the platform.

한편, 위에서 설명한 바와 같이 본 발명은 머신 러닝 태스크에 대한 로그 정보를 분석하여 태스크에 소요되는 컴퓨팅 자원을 예측하고, 머신 러닝이 수행될 태스크에 대한 예측 결과를 기반으로 워커 노드(20)를 선택한다.On the other hand, as described above, the present invention analyzes log information for a machine learning task to predict the computing resources required for the task, and selects the worker node 20 based on the prediction result for the task on which machine learning is to be performed. .

이를 위해, 도 2와 같이 워커 노드(20)는 외부 입력 모듈(10)을 통해 입력된 태스크에 대해 머신 러닝을 수행하고, 수행 중이거나 수행을 마친 태스크에 대한 정보를 세션 노드(30)에 제공한다.To this end, as shown in FIG. 2 , the worker node 20 performs machine learning on the task input through the external input module 10 , and provides information on the task being performed or performed to the session node 30 . do.

태크스에 대한 정보는 워커 노드(20)에서 태스크의 머신 러닝에 소요되는 컴퓨팅 자원 정보를 의미하는 것으로 로그(Log) 형식으로 제공된다. 또한 태스크 정보는 적어도 하나 이상이다. 바람직하게는 신뢰성 높은 데이터 산출을 위해 설정된 횟수 이상 축적된 머신 러닝 태스크에 대한 데이터가 이용되게 한다.The information on the task means computing resource information required for machine learning of the task in the worker node 20 and is provided in a log format. In addition, the task information is at least one or more. Preferably, the data for the machine learning task accumulated more than a set number of times is used for reliable data calculation.

실시예로써 워커 노드(20)는 CPU 사용량, 메모리 사용량, 디스크 사용량, GPU 플롭(Graphic Process Unit Flops), 머신 러닝 태스크의 총 에폭(epoch) 및 머신 러닝 총 수행 시간을 컴퓨팅 자원 정보로 제공한다.As an embodiment, the worker node 20 provides CPU usage, memory usage, disk usage, GPU flops, total epochs of machine learning tasks, and machine learning total execution time as computing resource information.

후술하는 바와 같이 컴퓨팅 자원에 '컴퓨팅 비용'과 '수행 시간'을 포함하는 경우, CPU 사용량, 메모리 사용량 및 디스크 사용량은 컴퓨팅 비용 산출에 이용된다. 총 에폭과 총 수행 시간은 컴퓨터 연산처리 단위인 에폭당 수행 시간의 산출에 이용된다.As will be described later, when 'computing cost' and 'execution time' are included in computing resources, CPU usage, memory usage, and disk usage are used to calculate computing cost. The total epoch and the total execution time are used to calculate the execution time per epoch, which is a unit of computational processing.

세션 노드(30)는 상술한 바와 같이 워커 노드(20)에서 제공된 머신 러닝 태스크 정보(즉, 로그 형식 정보)를 분석하고, 태스크별로 머신 러닝에 소요되는 컴퓨팅 자원을 산출한다.As described above, the session node 30 analyzes machine learning task information (ie, log format information) provided from the worker node 20 and calculates computing resources required for machine learning for each task.

특히, 세션 노드(30)는 미리 산출된 태스크의 컴퓨팅 자원을 기반으로 새로 입력된 태스크의 컴퓨팅 자원을 예측하고, 예측 결과에 따라 새로 입력된 태스크가 할당되는 워커 노드(20)를 선택한다.In particular, the session node 30 predicts the computing resource of the newly input task based on the computing resource of the previously calculated task, and selects the worker node 20 to which the newly input task is assigned according to the prediction result.

즉, 세션 노드(30)는 컴퓨팅 자원에 대한 정보를 미리 산출하고, 이를 근거로 후속의 새로운 머신 러닝을 수행할 태스크에 사용되는 컴퓨팅 자원을 예측한다. 이를 기반으로 워커 노드(20)를 결정한다.That is, the session node 30 calculates information about the computing resource in advance, and predicts the computing resource to be used for a task to be performed subsequent new machine learning based on this. Based on this, the worker node 20 is determined.

이에, 세션 노드(30)는 워커 노드(20)로부터 제공된 컴퓨팅 자원을 분석하여 머신 러닝 태스크의 에폭(epoch)당 컴퓨팅 비용 및 수행 시간을 산출하고, 새로 입력된 태스크에서 소요되는 컴퓨팅 자원을 예측한다. 물론 경과될 것으로 예측되는 수행 시간을 이용하여 예상 종료 시간 역시 예측할 수 있다.Accordingly, the session node 30 analyzes the computing resources provided from the worker node 20 to calculate the computing cost and execution time per epoch of the machine learning task, and predicts the computing resources required for the newly input task. . Of course, the expected end time can also be predicted using the expected execution time.

이러한 세션 노드(30)는 '컴퓨팅 비용'의 산출을 위해 각 에폭당 CPU 사용량, 메모리 사용량, 디스크 사용량 및 GPU 플롭을 참조한다. 이를 통해 진행된 모든 에폭에서의 CPU 사용량, 메모리 사용량, 디스크 사용량 각각에 대한 평균값을 계산한다. 이를 통해 컴퓨팅 비용을 결정한다.The session node 30 refers to CPU usage, memory usage, disk usage, and GPU flop per each epoch to calculate 'computing cost'. Through this, the average value of each of CPU usage, memory usage, and disk usage in all epochs performed is calculated. This determines the computing cost.

따라서, 워커 노드(20)로부터 제공된 GPU 플롭에서의 에폭당 CPU 사용량, 메모리 사용량 및 디스크 사용량의 평균값으로 새로 입력된 태스크를 예측한다. 즉, 해당 GPU 플롭마다 그에 대응하는 태스크를 예측한다.Accordingly, the newly input task is predicted as the average value of the CPU usage per epoch, the memory usage, and the disk usage in the GPU flop provided from the worker node 20 . That is, the corresponding task is predicted for each GPU flop.

이를 바탕으로 세션 노드(30)에서 새로 입력된 태스크를 워커 노드(20)에 할당할 때 그 할당 대상이 되는 워커 노드(20)의 플롭(Flops)을 참조하여 새로 입력된 태스크의 머신 러닝 수행에 필요한 컴퓨팅 정보를 예측할 수 있게 된다.Based on this, when assigning a newly input task from the session node 30 to the worker node 20, referring to the flops of the worker node 20 that is the assignment target, it is necessary to perform machine learning of the newly input task. It becomes possible to predict the necessary computing information.

따라서, 예측한 컴퓨팅 비용이 워커 노드(20)의 하드웨어 성능보다 높게 요구된다면 해당 워커 노드(20)에 할당하지 않고 새로운 머신 러닝 태스크의 수행이 가능한 다른 워커 노드(20)를 검색하여 태스크를 할당하게 된다.Therefore, if the predicted computing cost is required to be higher than the hardware performance of the worker node 20, the task is assigned by searching for another worker node 20 capable of performing a new machine learning task without allocating it to the worker node 20. do.

또한, 세션 노드(30)는 '수행 시간'의 산출을 위해 머신 러닝 태스크의 총 에폭, 머신 러닝 총 수행 시간 및 GPU 플롭을 참조한다. 이를 통해 머신 러닝 태스크의 총 에폭 및 머신 러닝 총 수행 시간으로 에폭당 수행 시간을 산출한다. 에폭당 수행 시간은 총 수행 시간을 총 에폭의 수로 나누어 산출된다.In addition, the session node 30 refers to the total epoch of the machine learning task, the total machine learning execution time, and the GPU flop to calculate the 'execution time'. Through this, the execution time per epoch is calculated as the total epoch of the machine learning task and the total execution time of the machine learning. The execution time per epoch is calculated by dividing the total execution time by the total number of epochs.

따라서, 워커 노드(20)로부터 제공된 GPU 플롭에서의 에폭당 수행 시간으로 새로 입력된 태스크를 예측한다. 즉, 해당 GPU 플롭마다 그에 대응하는 수행 시간을 산출한다. 다만, 수행 시간은 일정 시점부터 종료 시간까지의 시간이므로 해당 수행 시간으로부터 종료 시간을 산출할 수도 있다.Accordingly, the newly input task is predicted by the execution time per epoch in the GPU flop provided from the worker node 20 . That is, the corresponding execution time is calculated for each GPU flop. However, since the execution time is a time from a certain point to an end time, the end time may be calculated from the corresponding execution time.

이와 같이 플롭(Flops)을 기준으로 해당 플롭일 때의 에폭 당 수행 시간을 계산하며, 차후 새로운 머신 러닝 태스크를 할당시 새로운 머신 러닝 태스크의 총 에폭과 할당 대상이 되는 워커 노드(20)의 플롭을 참고하여 머신 러닝 태스크의 종료 시간을 예측한다.In this way, the execution time per epoch is calculated based on the flop based on the flop, and when a new machine learning task is assigned later, the total epoch of the new machine learning task and the flop of the worker node 20 to be assigned are calculated. As a reference, predict the end time of the machine learning task.

다만, 상술한 세션 노드(30)는 다수개이며 각각의 세션 노드(30)는 동일하거나 동일한 그룹으로 분류되는 프레임워크(frame work)로 구성된 워커 노드(20)를 관리하는 것이 바람직하다. However, the above-described session nodes 30 are plural, and each session node 30 preferably manages the worker node 20 composed of the same or a framework classified into the same group.

세션 노드(30)는 동일 혹은 동일 그룹의 프레임워크로 구성된 워커 노드(20)들을 관리함으로써, 입력된 머신 러닝 태스크에 따라 세션 노드(30)를 할당하고, 그에 의해 일괄 관리되는 프레임워크 내에서 워커 노드(20)를 선택한다.The session node 30 allocates the session node 30 according to the input machine learning task by managing the worker nodes 20 composed of the framework of the same or the same group, and thereby the workers within the framework collectively managed. Select node 20 .

또한, 세션 노드(30)는 새로 입력된 태스크에 대해 머신 러닝을 수행하는데 소요되는 컴퓨팅 자원을 추가하여, 예측에 이용되는 태스크의 컴퓨팅 자원에 대한 기록을 갱신하는 것이 바람직하다.In addition, it is preferable that the session node 30 updates the record of the computing resources of the task used for prediction by adding the computing resources required to perform machine learning on the newly input task.

이와 같이 예측을 통해 머신 러닝 태스크를 워커 노드(20)에 할당하고, 그 할당된 워커 노드(20)에서 실제 머신 러닝이 이루어진 태스크의 정보를 추가하여 다음의 예측에 반영하는 과정을 반복, 연속함으로써 점차 신뢰성을 높이게 된다.By repeating and continuing the process of allocating a machine learning task to the worker node 20 through prediction in this way, adding information on the task on which machine learning has been performed in the assigned worker node 20 and reflecting it in the next prediction in this way Gradually increase the reliability.

마스터 노드(40)는 외부 입력 모듈(10)을 통해 새로 입력된 태스크에 대한 머신 러닝 수행 명령을 전달한다. 마스터 노드(40)로부터 전달된 명령은 일 예로 세션 노드(30)를 통해 워커 노드(20)로 전달된다.The master node 40 transmits a machine learning execution command for a newly input task through the external input module 10 . The command transmitted from the master node 40 is transmitted to the worker node 20 through the session node 30 as an example.

또한, 위에서 설명한 바와 같이 마스터 노드(40)는 외부 입력 모듈(10)을 통해 입력된 태스크를 프레임워크의 구성에 따라 세션 노드(30)에 할당함으로써, 세션 노드(30)에서 다시 워커 노드(20)를 할당하게 한다.In addition, as described above, the master node 40 assigns the task input through the external input module 10 to the session node 30 according to the configuration of the framework, thereby returning from the session node 30 to the worker node 20 ) to be assigned.

초기 구동 모듈은 머신 러닝이 수행된 태스크에 대한 정보가 예측 신뢰도를 획득하는 수준으로 축적되지 않은 경우 예상되는 데이터를 추가하여 데이터 표본을 확장한다. 이러한 초기 구동 모듈을 실시예로써 마스터 노드(40)에 일체로 구현될 수 있다.The initial driving module expands the data sample by adding expected data when information about the task on which machine learning has been performed is not accumulated to a level at which prediction reliability is obtained. Such an initial driving module may be integrally implemented in the master node 40 as an embodiment.

플랫폼 구동 초기나 자주 사용되지 않는 GPU 플롭의 경우에는 머신 러닝 태스크 정보가 충분히 축적되지 않아서 평균값 등을 이용하여 새로 입력된 태스크에 대한 예측을 진행하기 어렵거나 신뢰성이 낮다.In the early stage of platform operation or in the case of infrequently used GPU flops, machine learning task information is not sufficiently accumulated, so it is difficult or unreliable to predict a newly input task using the average value.

이에, 초기 구동 모듈은 이미 머신 러닝이 수행된 소수의 태스크 정보에 선형 보간법을 적용하고, 선형 보간법으로 예상한 태스크에 대한 정보로 새로 입력된 태스크의 컴퓨팅 자원을 예측하기 위한 태스크 정보를 생성한다.Accordingly, the initial driving module applies the linear interpolation method to a small number of task information on which machine learning has already been performed, and generates task information for predicting the computing resource of the newly input task as information about the task expected by the linear interpolation method.

한편, 또 다른 실시예로써 본 발명은 네트워크 오류를 비롯하여 플랫폼 외부에서의 장애 요인 발생에 따라 플랫폼 구조나 기능에 이상이 발생하는 경우 이를 감지하여 머신 러닝 수행 노드의 변경을 가능하게 한다.On the other hand, as another embodiment, the present invention makes it possible to change the machine learning performing node by detecting an abnormality in the platform structure or function due to the occurrence of a failure factor outside the platform including a network error.

이를 위해 마스터 노드(40) 및 세션 노드(30) 중 어느 하나 이상은 워커 노드(20)에 대해 네트워크 핑 테스트(ping test)를 수행하여, 태스크의 할당이 가능한 노드를 추출한다. 또한, 마스터 노드(40)에서 핑 테스트를 수행하는 경우에는 세션 노드(30)를 감시할 수도 있다.To this end, at least one of the master node 40 and the session node 30 performs a network ping test on the worker node 20 to extract a node capable of assigning a task. In addition, when the master node 40 performs the ping test, the session node 30 may be monitored.

핑 테스트는 테스트 신호의 전송과 그에 따른 응답 여부를 확인하는 네트워크 상의 신호 전송 기술로, 여기서는 노드 간에 신호를 정상적으로 주고받는지 확인하여 접속 속도나 끊김 현상 등을 확인하도록 ping 명령어를 이용한다.The ping test is a signal transmission technology on the network that checks the transmission of a test signal and whether there is a response.

마스터 노드(40)와 세션 노드(30)는 머신 러닝 수행을 위한 일종의 네트워크 상의 관리 노드(management node)에 해당하므로, 이들 중 어느 노드에서도 핑 테스트를 수행할 수 있다. Since the master node 40 and the session node 30 correspond to a kind of management node on a network for performing machine learning, a ping test may be performed on any of these nodes.

본 발명은 워커 노드(20) 측에서도 핑 테스트를 수행하여 관리 노드측에 머신 러닝의 진행이 준비됨을 알릴 수 있다. 하지만 태스크를 할당하는 마스터 노드(40)나 세션 노드(30)에서 핑 테스트를 수행하는 것이 관리 측면에서 유리한 점이 있다.In the present invention, the worker node 20 may also perform a ping test to notify the management node that machine learning is ready to proceed. However, it is advantageous in terms of management to perform a ping test in the master node 40 or the session node 30 that allocates the task.

따라서, 마스터 노드(40)나 세션 노드(30)(이하, '마스터 노드'라 함)에서 핑 테스트를 사용함으로써, 각 노드에서 여러 가지 이유로 문제가 발생하여 플랫폼에서 역할을 하지 못하는 경우를 판단하고 이에 즉각적으로 대응하게 된다.Therefore, by using the ping test in the master node 40 or the session node 30 (hereinafter referred to as the 'master node'), it is determined when a problem occurs in each node for various reasons and cannot play a role in the platform. You will respond immediately to this.

예컨대, 플랫폼 내부 동작 중 과부하로 인한 노드 셧다운, 하드웨어적 요인으로 인한 셧다운, 네트워크 오류로 인한 인식 불가 등의 이유로 해당 노드가 더 이상 작업을 진행하지 못하는 상황을 인지하기 위해 네트워크 핑을 사용할 수 있다. For example, network ping can be used to recognize a situation in which a node can no longer perform work due to a node shutdown due to an overload during operation of the platform, a shutdown due to a hardware factor, or unrecognizable due to a network error.

마스터 노드(40)는 플랫폼 내의 각 노드에 네트워크 핑을 전달하고 수신받은 핑으로 노드의 활성 여부를 판단한다. 만약 핑 송수신 중 특정 노드가 일정 시간 이상 반응하지 않으면 해당 노드는 장애가 발생하여 더 이상 노드의 역할을 하지 못하는 것으로 판단한다.The master node 40 transmits a network ping to each node in the platform and determines whether or not the node is active based on the received ping. If a specific node does not respond for more than a certain period of time during ping transmission/reception, it is determined that the node has a failure and no longer functions as a node.

특정 노드에서 장애가 발생하면 마스터 노드(40)는 사용자에게 장애 발생 여부를 전달하고 해당 노드의 응답을 기다린다. 또한 장애 발생 후 일정 시간 내에 노드가 재연결되어 응답이 이루어지면 해당 노드의 작업을 재개한다.When a failure occurs in a specific node, the master node 40 informs the user whether a failure has occurred and waits for a response from the corresponding node. Also, if the node is reconnected and a response is made within a certain period of time after the failure, the operation of the node is resumed.

반면, 일정 시간 내에 노드가 재연결 되지 않으면 해당 노드는 더 이상 플랫폼 내에서 노드의 역할을 하지 못하는 것으로 판단한다. 이러한 경우 플랫폼 DB(DB-I)에 저장된 머신 러닝 태스크의 작업 정보를 바탕으로 머신 러닝 태스크를 다른 워커 노드(20)에 할당한다.On the other hand, if a node is not reconnected within a certain period of time, it is determined that the node can no longer function as a node in the platform. In this case, the machine learning task is assigned to the other worker nodes 20 based on the job information of the machine learning task stored in the platform DB (DB-I).

세션 노드(30)의 경우 새로운 세션 노드(30)를 생성하고 플랫폼 DB(DB-I)에 저장된 기존 세션 노드(30)의 정보를 바탕으로 새로운 세션 노드(30)의 정보를 갱신한다. 이후 기존 세션 노드(30)에서 관리하던 워커 노드(20)들을 새로운 세션 노드(30)에 할당하여 관리한다. In the case of the session node 30, a new session node 30 is created and information of the new session node 30 is updated based on the information of the existing session node 30 stored in the platform DB (DB-I). Thereafter, the worker nodes 20 managed by the existing session node 30 are allocated to the new session node 30 and managed.

해당 작업 이후 장애가 발생한 노드가 재연결되면 워커 노드(20)의 경우에는 진행 중이던 작업을 초기화하고 세션 노드(30)에 재할당한다. 세션 노드(30)의 경우에는 정보를 초기화하고 새로운 세션 노드(30)가 생성되어 할당이 필요할 때까지 대기한다.After the corresponding task, when the failed node is reconnected, in the case of the worker node 20 , the ongoing task is initialized and reallocated to the session node 30 . In the case of the session node 30, information is initialized and a new session node 30 is created and waits until allocation is required.

이하, 첨부된 도면을 참조하여 본 발명에 따른 학습에 의한 머신 러닝 실행 관리 방법에 대해 설명한다. Hereinafter, a machine learning execution management method by learning according to the present invention will be described with reference to the accompanying drawings.

다만, 본 발명에 따른 학습에 의한 머신 러닝 실행 관리 방법은 바람직한 실시예로써 위에서 설명한 플랫폼 시스템상에서 구현된다. 따라서, 이하에서는 가급적 중복적인 설명은 생략한다.However, the machine learning execution management method by learning according to the present invention is implemented on the platform system described above as a preferred embodiment. Accordingly, redundant descriptions are omitted below as much as possible.

도 3 및 도 4와 같이, 본 발명에 따른 학습에 의한 머신 러닝 실행 관리 방법은 정보 제공 단계(S10), 자원 산출 단계(S20), 태스크 등록 단계(S30), 노드 선택 단계(S40), 실행 명령 단계(S50) 및 머신 러닝 단계(S60)를 포함한다.3 and 4, the machine learning execution management method by learning according to the present invention is information providing step (S10), resource calculation step (S20), task registration step (S30), node selection step (S40), execution It includes an instruction step (S50) and a machine learning step (S60).

나아가 바람직한 다른 실시예로써 일정 기간마다 주기적으로 핑 테스트를 수행하는 상태 검색 단계 및 플랫폼 초기 구동시 등에 부족한 데이터에 대해 선형 보간법을 적용하는 데이터 보간 단계(S1)를 더 포함한다.Furthermore, as another preferred embodiment, the method further includes a data interpolation step (S1) of applying a linear interpolation method to insufficient data, such as a state search step of periodically performing a ping test at regular intervals, and a platform initial operation.

여기서, 상기한 정보 제공 단계(S10)에서는 워커 노드(20)에서 태스크에 대해 머신 러닝을 수행하되, 수행 중이거나 수행을 마친 적어도 하나 이상의 태스크에 대한 정보를 제공한다.Here, in the information providing step (S10), machine learning is performed on the task in the worker node 20, but information on at least one task being performed or completed is provided.

워커 노드(20)에서 제공되는 머신 러닝 태스크 정보는 워커 노드(20)에서 태스크의 머신 러닝에 소요되는 컴퓨팅 자원 정보를 의미하는 것으로 그 상위 레이어인 세션 노드(30)에 로그(Log) 형식으로 제공한다. Machine learning task information provided by the worker node 20 means computing resource information required for machine learning of the task in the worker node 20 and provided to the session node 30, which is an upper layer, in the form of a log. do.

실시예로, 워커 노드(20)는 CPU 사용량, 메모리 사용량, 디스크 사용량, GPU 플롭(Graphic Process Unit Flops), 머신 러닝 태스크의 총 에폭(epoch) 및 머신 러닝 총 수행 시간을 상기 컴퓨팅 자원 정보로 제공한다.In an embodiment, the worker node 20 provides CPU usage, memory usage, disk usage, GPU flops, total epochs of machine learning tasks, and total machine learning execution time as the computing resource information. do.

따라서, 새로 입력된 태스크의 예측을 위해 필요한 컴퓨팅 자원을 예측시 CPU 사용량, 메모리 사용량 및 디스크 사용량은 '컴퓨팅 비용' 산출에 이용된다. 총 에폭과 총 수행 시간은 에폭당 '수행 시간'의 산출에 이용된다.Accordingly, when estimating computing resources required for prediction of a newly input task, CPU usage, memory usage, and disk usage are used to calculate 'computing cost'. The total epoch and total execution time are used to calculate the 'run time' per epoch.

다음, 자원 산출 단계(S20)에서는 세션 노드(30)에서 위와 같이 워커 노드(20)로부터 제공된 태스크 정보를 분석하고, 태스크별로 머신 러닝에 소요되는 컴퓨팅 자원을 산출한다.Next, in the resource calculation step ( S20 ), the session node 30 analyzes the task information provided from the worker node 20 as above, and calculates the computing resources required for machine learning for each task.

즉, 세션 노드(30)는 컴퓨팅 자원에 대한 정보를 미리 산출하고, 이를 근거로 후속의 새로운 머신 러닝을 수행할 태스크에 사용될 것으로 예측되는 컴퓨팅 자원을 예측한다. 이를 기반으로 후속 단계에서 워커 노드(20)를 결정할 수 있다.That is, the session node 30 calculates information about the computing resource in advance, and predicts the computing resource that is expected to be used for a task to be performed subsequent new machine learning based on this information. Based on this, the worker node 20 may be determined in a subsequent step.

구체적으로 세션 노드(30)는 워커 노드(20)로부터 제공된 컴퓨팅 자원을 분석하여 머신 러닝 태스크의 에폭당 컴퓨팅 비용 및 수행 시간을 산출하여 새로 입력된 태스크에서 소요되는 컴퓨팅 자원을 예측한다.Specifically, the session node 30 analyzes the computing resources provided from the worker node 20 to calculate the computing cost and execution time per epoch of the machine learning task to predict the computing resources required for the newly input task.

다음, 태스크 등록 단계(S30)에서는 외부 입력 모듈(10)을 통해 머신 러닝이 수행되는 새로운 태스크를 입력받아 등록한다. 예컨대, 사용자가 명령한 머신 러닝 태스크가 새로 입력되며, 입력된 태스크는 마스터 노드(40)에 의해 플랫폼 DB(DB-I)에 등록된다.Next, in the task registration step ( S30 ), a new task on which machine learning is performed is received and registered through the external input module 10 . For example, a machine learning task commanded by a user is newly input, and the input task is registered in the platform DB (DB-I) by the master node 40 .

마스터 노드(40)에 의해 등록된 머신 러닝 태스크는 프레임워크 구조에 따라 적합한 것으로 판단되는 세션 노드(30)에 할당된다. 세션 노드(30)에서는 아래와 같이 컴퓨팅 자원을 예측하여 머신 러닝을 수행할 워커 노드(20)를 결정한다.The machine learning task registered by the master node 40 is assigned to the session node 30 determined to be suitable according to the framework structure. The session node 30 determines the worker node 20 to perform machine learning by predicting computing resources as follows.

다음, 노드 선택 단계(S40)에서는 세션 노드(30)에서 미리 산출된 태스크의 컴퓨팅 자원을 기반으로 새로 입력된 태스크의 컴퓨팅 자원을 예측하고, 예측 결과에 따라 새로 입력된 태스크가 할당되는 워커 노드(20)를 선택한다.Next, in the node selection step S40, the computing resource of the newly input task is predicted based on the computing resource of the task calculated in advance in the session node 30, and the worker node to which the newly input task is assigned according to the prediction result ( 20) is selected.

따라서, 머신 러닝을 위한 통합 원격 실행 플랫폼에서 워커 노드(20)에 머신 러닝 태스크를 할당할 때 워커 노드(20) 자체의 성능뿐만 아니라 머신 러닝 태스크에 필요한 컴퓨팅 자원까지 고려하여 최적의 노드를 선택하게 된다.Therefore, when allocating a machine learning task to the worker node 20 in the integrated remote execution platform for machine learning, the optimal node is selected by considering not only the performance of the worker node 20 itself but also the computing resources required for the machine learning task. do.

구체적으로, 노드 선택 단계(S40)에서 세션 노드(30)는 컴퓨팅 비용의 산출을 위해 각 에폭당 CPU 사용량, 메모리 사용량, 디스크 사용량 및 GPU 플롭을 참조한다. 이를 통해 진행된 모든 에폭에서의 CPU 사용량, 메모리 사용량, 디스크 사용량 각각에 대한 평균값을 계산한다. 이를 통해 컴퓨팅 비용이 결정된다.Specifically, in the node selection step ( S40 ), the session node 30 refers to CPU usage, memory usage, disk usage, and GPU flop per each epoch to calculate the computing cost. Through this, the average value of each of CPU usage, memory usage, and disk usage in all epochs performed is calculated. This determines the computing cost.

또한, 세션 노드(30)는 수행 시간의 산출을 위해 머신 러닝 태스크의 총 에폭, 머신 러닝 총 수행 시간 및 GPU 플롭을 참조한다. 머신 러닝 태스크의 총 에폭 및 머신 러닝 총 수행 시간으로 에폭당 수행 시간을 산출하는데, 에폭당 수행 시간은 총 수행 시간을 총 에폭의 수로 나누어 산출될 수 있다.In addition, the session node 30 refers to the total epoch of the machine learning task, the total machine learning execution time, and the GPU flop to calculate the execution time. The execution time per epoch is calculated as the total epochs of the machine learning task and the total machine learning execution time. The execution time per epoch may be calculated by dividing the total execution time by the total number of epochs.

따라서, 새로 입력되어 머신 러닝을 위한 워커 노드(20)의 할당이 필요한 태스크에 대해, 머신 러닝 수행시 필요한 컴퓨팅 자원을 예측하고, 그 예측된 컴퓨팅 자원을 만족하는 워커 노드(20)에 태스크를 할당하게 된다.Accordingly, for a task that is newly input and requires allocation of the worker node 20 for machine learning, a computing resource necessary for performing machine learning is predicted, and the task is assigned to the worker node 20 that satisfies the predicted computing resource. will do

또한, 세션 노드(30)는 새로 입력된 태스크에 대해 머신 러닝을 수행하는데 소요된 컴퓨팅 자원을 추가하여, 차후의 예측에 이용되는 컴퓨팅 자원 기록을 갱신함으로써 예측 결과를 반영하는 과정을 반복, 연속하여 점차 신뢰성을 높일 수 있게 된다.In addition, the session node 30 repeats and continuously repeats the process of reflecting the prediction result by adding the computing resources required to perform machine learning on the newly input task and updating the computing resource record used for subsequent prediction. It is possible to gradually increase the reliability.

다음, 실행 명령 단계(S50)에서는 마스터 노드(40)에서 외부 입력 모듈(10)을 통해 새로 입력된 태스크에 대한 머신 러닝 수행 명령을 전달한다. 수행 명령은 일 예로 마스터 노드(40)에서 세션 노드(30)로, 상기 세션 노드(30)에서 워커 노드(20)로 전달된다.Next, in the execution command step S50 , the master node 40 transmits a machine learning execution command for a newly input task through the external input module 10 . The execution command is transmitted, for example, from the master node 40 to the session node 30 and from the session node 30 to the worker node 20 .

이와 같이 사용자로부터 머신 러닝 실행 명령이 전달되면 머신 러닝 단계(S60)에서는 워커 노드(20)에서 머신 러닝 수행 명령을 입력받아 새로 입력된 태스크에 대한 머신 러닝을 수행한다.As such, when the machine learning execution command is transmitted from the user, in the machine learning step ( S60 ), the worker node 20 receives the machine learning execution command and performs machine learning on the newly input task.

머신 러닝의 수행을 위해 워커 노드(20)는 일 예로 '태스크 실행기'를 통해 머신 러닝 수행환경을 세팅하고 머신 러닝 데이터셋을 외부 네트워크 저장소(DB-O)로부터 다운로드 받아 머신 러닝을 수행한다.To perform machine learning, the worker node 20 sets a machine learning execution environment through, for example, a 'task executor', downloads a machine learning dataset from an external network storage (DB-O), and performs machine learning.

한편, 상태 검색 단계(미도시)에서는 마스터 노드(40) 및 세션 노드(30) 중 어느 하나 이상에서, 워커 노드(20)와 같은 다른 노드에 대해 네트워크 핑 테스트(ping test)를 수행하여, 태스크의 할당이 가능한 노드를 분석한다.On the other hand, in the state search step (not shown), in any one or more of the master node 40 and the session node 30, a network ping test is performed on other nodes such as the worker node 20, and the task Analyzes the nodes that can be assigned.

이와 같이 핑 테스트를 진행하는 상태 검색 단계는 플랫폼이 구동되기 이전은 물론 플랫폼이 구동되는 중에도 실행 가능하다. As such, the state search step of performing the ping test can be executed before the platform is driven as well as while the platform is running.

따라서, 정보 제공 단계(S10), 자원 산출 단계(S20), 태스크 등록 단계(S30), 노드 선택 단계(S40), 실행 명령 단계(S50) 및 머신 러닝 단계(S60) 어디에서도 가능하며 각 단계들의 진행 전에도 가능하다.Accordingly, the information provision step (S10), the resource calculation step (S20), the task registration step (S30), the node selection step (S40), the execution command step (S50), and the machine learning step (S60) are possible anywhere, and each step is It can be done before proceeding.

다음, 데이터 보간 단계(S1)에서는 머신 러닝이 수행된 태스크에 대한 정보가 예측 신뢰도를 획득하는 수준으로 축적되었는지 판단(S1a)하고, 판단결과 충분히 축적되지 않은 것으로 판단된 경우 상술한 초기 구동 모듈에서 머신 러닝이 수행된 태스크에 대한 정보에 선형 보간법을 적용한다.Next, in the data interpolation step (S1), it is determined (S1a) whether information about the task on which the machine learning is performed has accumulated to a level at which prediction reliability is obtained, and when it is determined that the information is not sufficiently accumulated as a result of the determination, in the above-described initial driving module Linear interpolation is applied to information about the task performed by machine learning.

플랫폼 구동 초기나 자주 사용되지 않는 GPU 플롭의 경우에는 머신 러닝 태스크에 대한 정보가 충분히 축적되지 않아서 평균값 등을 이용하여 새로 입력된 태스크에 대한 예측이 무의미하거나 산출결과의 신뢰성이 낮다.In the early stage of platform operation or in the case of infrequently used GPU flops, information on machine learning tasks is not sufficiently accumulated, so prediction of newly input tasks using average values is meaningless or the reliability of calculation results is low.

따라서, 초기 구동 모듈은 이미 머신 러닝이 수행된 태스크에 대한 정보에 선형 보간법을 적용하고, 선형 보간법으로 예상한 태스크에 대한 정보로 새로 입력된 태스크의 컴퓨팅 자원을 예측하기 위한 태스크 정보를 생성한다.Accordingly, the initial driving module applies the linear interpolation method to the information on the task on which machine learning has already been performed, and generates task information for predicting the computing resource of the newly input task as information about the task expected by the linear interpolation method.

이와 같이 본 발명은 기존 머신 러닝을 위한 통합 원격 학습 실행 관리 플랫폼 구조에서 플랫폼의 성능 향상을 위해 워커 노드(20)가 이전에 실행한 기계학습 태스크의 정보를 로그 형태로 저장한다.As described above, the present invention stores the information of the machine learning task previously executed by the worker node 20 in a log form to improve the performance of the platform in the existing integrated remote learning execution management platform structure for machine learning.

또한, 이를 통해 다음에 수행될 기계학습 태스크의 정보를 예측하여 보다 효율적인 기계학습 태스크의 분배를 수행한다. 또한 네트워크 상황을 주기적으로 확인하여 플랫폼의 장애 상황을 유연하게 대응할 수 있게 한다.In addition, through this, the machine learning task is more efficiently distributed by predicting the information of the machine learning task to be performed next. In addition, by periodically checking the network status, it is possible to flexibly respond to the fault situation of the platform.

이상, 본 발명의 특정 실시예에 대하여 상술하였다. 그러나, 본 발명의 사상 및 범위는 이러한 특정 실시예에 한정되는 것이 아니라, 본 발명의 요지를 변경하지 않는 범위 내에서 다양하게 수정 및 변형 가능하다는 것을 본 발명이 속하는 기술 분야에서 통상의 지식을 가진 자라면 이해할 것이다.In the above, specific embodiments of the present invention have been described above. However, the spirit and scope of the present invention is not limited to these specific embodiments, and various modifications and variations can be made within the scope that does not change the gist of the present invention. You will understand when you grow up.

따라서, 이상에서 기술한 실시예들은 본 발명이 속하는 기술 분야에서 통상의 지식을 가진 자에게 발명의 범주를 완전하게 알려주기 위해 제공되는 것이므로, 모든 면에서 예시적인 것이며 한정적이 아닌 것으로 이해해야만 하며, 본 발명은 청구항의 범주에 의해 정의될 뿐이다.Therefore, since the embodiments described above are provided to fully inform those of ordinary skill in the art to which the present invention belongs the scope of the invention, it should be understood that they are exemplary in all respects and not limiting, The invention is only defined by the scope of the claims.

10: 외부 입력 모듈
20: 워커 노드
30: 세션 노드
40: 마스터 노드
DB-I: 플랫폼 DB
DB-O: 외부 네트워크 저장소10: external input module
20: worker node
30: Session node
40: master node
DB-I: Platform DB
DB-O: External network storage

Claims

an external input module 10 for receiving a task on which machine learning is performed;
a worker node 20 that performs machine learning on the input task and provides computing resource information on at least one or more tasks being performed or completed;
a session node 30 that analyzes computing resource information on the task provided by the worker node 20 and calculates the computing resource required for the machine learning for each task; and
A master node 40 that transmits a machine learning execution command for a newly input task through the external input module 10; includes;
The session node 30 predicts the computing resource of the newly input task based on the computing resource of the pre-calculated task, and selects the worker node 20 to which the newly input task is assigned according to the prediction result,
The worker node 20,
Machine learning execution management platform by learning, characterized in that it provides CPU usage, memory usage, disk usage, GPU flops, total epochs of machine learning tasks, and machine learning total execution time as the computing resource information system.

According to claim 1,
The worker node 20,
A machine learning execution management platform system by learning, characterized in that it provides computing resource information required for machine learning of a task in a log format.

delete

According to claim 1,
The session node 30,
In learning, characterized in that by analyzing the computing resource information provided from the worker node 20 to calculate the computing cost and execution time per epoch of the machine learning task, the computing resource required for the newly input task is predicted. Machine Learning Execution Management Platform System by.

5. The method of claim 4,
The session node 30,
See above for each epoch CPU usage, memory usage, disk usage and GPU flop,
By calculating the computing cost as an average value for each of CPU usage, memory usage, and disk usage in all epochs that have been performed,
Machine learning execution management platform system by learning, characterized in that predicting the newly input task as an average value of CPU usage, memory usage, and disk usage per epoch in the GPU flop.

5. The method of claim 4,
The session node 30,
With reference to the total epoch of the machine learning task, the total machine learning execution time and the GPU flop,
By calculating the execution time per epoch with the total epochs of the machine learning task and the total machine learning execution time,
Machine learning execution management platform system by learning, characterized in that predicting the newly input task by the execution time per epoch in the GPU flop.

According to claim 1,
The session node 30 is a plurality,
Machine learning execution management platform system by learning, characterized in that each session node (30) manages the worker node (20) composed of a framework (framework) classified into the same or the same group.

According to claim 1,
The session node 30,
Machine learning execution management platform system by learning, characterized in that by adding the computing resources required to perform machine learning for the newly input task, and updating the record of the computing resources of the task used for the prediction.

According to claim 1,
Any one or more of the master node 40 and the session node 30,
A machine learning execution management platform system by learning, characterized in that by performing a network ping test on the worker node (20) to extract a node capable of task assignment.

According to claim 1,
When the information on the task on which the machine learning has been performed is not accumulated to a level at which prediction reliability is obtained,
Further comprising an initial driving module that applies a linear interpolation method to information about a task on which machine learning has already been performed,
The initial driving module is
Machine learning execution management platform system by learning, characterized in that for generating task information for predicting the computing resource of the newly input task with information about the task predicted by the linear interpolation method.

An information providing step (S10) of performing machine learning on a task in the worker node 20, and providing computing resource information on at least one task being performed or completed;
a resource calculation step (S20) of analyzing computing resource information for the task provided from the worker node 20 in the session node 30 and calculating the computing resources required for the machine learning for each task;
a task registration step (S30) of receiving and registering a new task on which machine learning is performed through the external input module (10);
Predicting the computing resource of the newly input task based on the computing resource of the task calculated in advance in the session node 30,
a node selection step (S40) of selecting a worker node 20 to which the newly input task is assigned according to a prediction result;
an execution command step (S50) of transmitting a machine learning execution command for a newly input task from the master node 40 through the external input module 10; and
A machine learning step (S60) of receiving the machine learning execution command from the worker node 20 and performing machine learning on the newly input task (S60);
The worker node 20,
Machine learning execution management method by learning, characterized in that CPU usage, memory usage, disk usage, GPU flops, total epochs of machine learning tasks, and machine learning total execution time are provided as the computing resource information .

12. The method of claim 11,
In any one or more of the master node 40 and the session node 30,
The machine learning execution management method by learning, characterized in that it further comprises a state search step of performing a network ping test on the worker node (20) to analyze the node to which the task can be assigned.

12. The method of claim 11,
When the information on the task on which the machine learning has been performed is not accumulated to a level at which prediction reliability is obtained,
Further comprising a data interpolation step (S1) of applying a linear interpolation method to the information on the task on which machine learning has already been performed by the initial driving module,
In the data interpolation step (S1),
Machine learning execution management method by learning, characterized in that for generating task information for predicting the computing resource of the newly input task with information on the task predicted by the linear interpolation method.