KR102266324B1

KR102266324B1 - Worker node management method of managing execution platform and platform system for the same

Info

Publication number: KR102266324B1
Application number: KR1020200034648A
Authority: KR
Inventors: 홍지만; 김영관; 박기철; 이주석
Original assignee: 숭실대학교산학협력단
Priority date: 2020-02-28
Filing date: 2020-03-20
Publication date: 2021-06-17

Abstract

The present invention relates to a method for managing worker nodes in a machine learning execution management platform and a platform system therefor, which improve reconfiguration performance of the worker nodes by increasing operation speed for addition or deletion of the worker nodes, and can flexibly respond to situations occurring within the platform. The method includes the steps of: registering the node to be deleted to a deletion list among the worker nodes recorded in the node list in a session node; searching for the highest priority worker node to perform machine learning in the session node which has been requested to assign a task from a master node; comparing the priority ID indicating the highest priority worker node in the session node and the deletion registration ID registered in the deletion list; deleting the highest priority worker node from the node list in the session node, when the priority ID and deletion registration ID are the same; allocating tasks so that machine learning is performed in the worker node corresponding to the priority ID in the session node, when the priority ID and deletion registration ID are different; and performing machine learning on the worker node to which the task is assigned according to a machine learning execution command.

Description

WORKER NODE MANAGEMENT METHOD OF MANAGING EXECUTION PLATFORM AND PLATFORM SYSTEM FOR THE SAME}

본 발명은 워커 노드의 추가나 삭제를 위한 연산 속도를 높임에 따라 워커 노드의 재구성 성능을 향상시키고 플랫폼 내에서 발생된 상황에 유연하게 대응할 수 있는 머신 러닝 실행 관리 플랫폼에서의 워커 노드 관리 방법 및 그를 위한 플랫폼 시스템에 관한 것이다.The present invention provides a method for managing worker nodes in a machine learning execution management platform that can improve the reconfiguration performance of worker nodes by increasing the operation speed for addition or deletion of worker nodes and flexibly respond to situations occurring within the platform, and a method for managing the same It is about the platform system.

머신 러닝(machine learning)은 패턴 인식과 컴퓨터 학습 이론의 연구로부터 진화한 인공지능(AI)의 한 분야로, 경험적 데이터를 기반으로 학습 및 예측을 수행하고 스스로의 성능을 향상시키는 시스템과 그 알고리즘을 구축한다.Machine learning is a field of artificial intelligence (AI) that has evolved from the study of pattern recognition and computer learning theory. It is a system and algorithm that performs learning and prediction based on empirical data and improves its own performance. build

이러한 머신 러닝을 위한 통합 원격 실행 플랫폼은 그 상위 레이어부터 마스터 노드, 세션 노드 및 워커 노드로 구성된다. 또한 별도의 외부 네트워크 저장소를 가지며, 사용자는 외부 프로그램을 통해 플랫폼에 접근한다.The integrated remote execution platform for machine learning consists of a master node, a session node, and a worker node from the upper layer. It also has a separate external network storage, and users access the platform through an external program.

따라서, 플랫폼에 접근한 사용자가 머신 러닝을 수행할 대상인 태스크(task)를 등록하면, 그 등록된 태스크는 먼저 머신 러닝을 수행할 프레임워크에 맞게 세션 노드(session node)에 할당된다.Therefore, when a user accessing the platform registers a task that is a target to perform machine learning, the registered task is first allocated to a session node in accordance with a framework to perform machine learning.

세션 노드는 해당 태스크의 수행이 가능하면서도 연산처리 능력이 가장 좋은 워커 노드에 태스크를 할당하고, 사용자가 워커 노드에서 수행할 머신 러닝 관련 명령어를 내리면 그 사용자의 명령에 따라 머신 러닝을 수행한다.A session node assigns a task to a worker node that can perform the task and has the best computational processing power, and when a user issues a machine learning-related command to be performed in the worker node, machine learning is performed according to the user's command.

그러나, 종래의 머신 러닝을 위한 통합 원격 실행 플랫폼은 상기한 워커 노드를 오직 CPU/GPU 연산처리능력인 플롭(Flops) 기준으로만 정렬하여 할당하기 때문에 재구성이 필요한 경우의 처리가 어렵다는 문제가 있다.However, the conventional integrated remote execution platform for machine learning has a problem in that it is difficult to process when reconfiguration is necessary because the worker nodes are arranged and allocated only based on flops, which are CPU/GPU computational processing power.

즉, 워커 노드의 추가 삽입 및 임의 제거나 장애에 따른 삭제 등 워커 노드 리스트의 재구성이 필요한 상황이 고려되지 않아서, 플랫폼의 확장 및 축소 등 워커 노드의 수가 변경되는 경우의 처리가 어렵다는 문제가 있다.That is, the situation requiring reorganization of the worker node list, such as additional insertion, arbitrary removal, or deletion of worker nodes, is not considered, so there is a problem in that it is difficult to handle the case where the number of worker nodes is changed, such as expansion or reduction of the platform.

미국등록특허 US 16/248,560US registered patent US 16/248,560 대한민국 공개특허 제10-2012-0001688호Republic of Korea Patent Publication No. 10-2012-0001688

본 발명은 전술한 바와 같은 문제점을 해결하기 위한 것으로, 워커 노드의 추가나 삭제를 위한 연산 속도를 높임에 따라 워커 노드의 재구성 성능을 향상시키고 플랫폼 내에서 발생된 상황에 유연하게 대응할 수 있는 머신 러닝 실행 관리 플랫폼에서의 워커 노드 관리 방법 및 그를 위한 플랫폼 시스템을 제공하고자 한다.The present invention is to solve the problems described above, and by increasing the operation speed for adding or deleting worker nodes, it improves the reconfiguration performance of the worker nodes and executes machine learning that can flexibly respond to situations occurring within the platform An object of the present invention is to provide a method for managing worker nodes in a management platform and a platform system for the same.

이를 위해, 본 발명에 따른 머신 러닝 실행 관리 플랫폼에서의 워커 노드 관리 방법은 세션 노드에서 노드 리스트에 기록된 워커 노드들 중 삭제 대상이 되는 노드를 삭제 리스트에 등록시키는 삭제 노드 등록 단계와; 마스터 노드로부터 태스크의 할당을 요청받은 세션 노드에서 머신 러닝을 수행할 최선순위 워커 노드를 검색하는 워커 노드 검색 단계와; 상기 세션 노드에서 상기 최선순위 워커 노드를 나타내는 우선 순위 ID와 상기 삭제 리스트에 등록된 삭제 등록 ID를 비교하는 노드 유효성 판단 단계와; 상기 우선 순위 ID와 삭제 등록 ID가 동일한 경우, 상기 세션 노드에서 상기 최선순위 워커 노드를 상기 노드 리스트로부터 삭제하는 노드 리스트 정리 단계와; 상기 우선 순위 ID와 삭제 등록 ID가 다른 경우, 상기 세션 노드에서 상기 우선 순위 ID에 해당하는 워커 노드에서 머신 러닝이 이루어지도록 태스크를 할당하는 태스크 할당 단계; 및 머신 러닝 수행 명령에 따라 상기 태스크가 할당된 워커 노드에서 머신 러닝을 수행하는 머신 러닝 단계;를 포함하는 것을 특징으로 한다.To this end, a method for managing a worker node in a machine learning execution management platform according to the present invention includes: a deletion node registration step of registering a node to be deleted from among the worker nodes recorded in the node list in the session node to the deletion list; a worker node search step of searching for a worker node with the highest priority to perform machine learning in the session node requested for task assignment from the master node; a node validity determination step of comparing a priority ID indicating the highest priority worker node with a deletion registration ID registered in the deletion list in the session node; a node list cleaning step of deleting the highest priority worker node from the node list in the session node when the priority ID and the deletion registration ID are the same; a task assignment step of allocating a task so that machine learning is performed in a worker node corresponding to the priority ID in the session node when the priority ID and the deletion registration ID are different; and a machine learning step of performing machine learning in a worker node to which the task is assigned according to a machine learning execution command.

이때, 상기 세션 노드에서 상기 워커 노드들의 성능을 분석하는 워커 노드 분석 단계; 및 상기 분석된 워커 노드의 성능 순서에 따라 힙 자료 구조(heap data structure)로 이루어진 상기 노드 리스트를 생성하는 노드 리스트 생성 단계;를 더 포함하는 것이 바람직하다.At this time, a worker node analysis step of analyzing the performance of the worker nodes in the session node; and a node list generating step of generating the node list including a heap data structure according to the performance order of the analyzed worker nodes.

또한, 상기 세션 노드는 상기 워커 노드의 성능을 CPU와 GPU의 연산처리능력인 플롭으로 분류하되, 기가플롭(GFlops)을 단위로 하여 상기 노드 리스트를 정렬하는 것이 바람직하다.In addition, the session node classifies the performance of the worker node into flops, which are computational processing capabilities of the CPU and GPU, and it is preferable to sort the node list by using gigaflops (GFlops) as a unit.

또한, 상기 세션 노드에서 상기 삭제 등록 ID가 기록되는 삭제 리스트 테이블을 생성하는 삭제 테이블 생성 단계; 및 상기 마스터 노드에서 상기 삭제 리스트 테이블에 등록되는 삭제 대상 워커 노드를 검출하는 제외 노드 분석 단계;를 더 포함하는 것이 바람직하다.In addition, a deletion table generation step of generating a deletion list table in which the deletion registration ID is recorded in the session node; and an exclusion node analysis step of detecting a deletion target worker node registered in the deletion list table in the master node.

또한, 상기 삭제 리스트 테이블은 해시(hash) 함수에 의해 생성된 키 값에 따라 매핑이 이루어지는 해시맵 구조의 테이블인 것이 바람직하다.In addition, it is preferable that the delete list table is a table of a hash map structure in which mapping is performed according to a key value generated by a hash function.

또한, 상기 마스터 노드로부터 머신 러닝이 수행되는 태스크(task)를 워커 노드에 할당하도록 요청받는 노드 할당 요청 단계를 더 포함하되, 상기 마스터 노드는 상기 머신 러닝이 수행되는 프레임워크(frame work)로 구성된 워커 노드를 관리하는 세션 노드에 상기 태스크의 할당을 요청하는 것이 바람직하다.In addition, the method further comprises a node assignment request step of receiving a request from the master node to allocate a task on which machine learning is performed to a worker node, wherein the master node is composed of a framework in which the machine learning is performed. It is preferable to request assignment of the task to the session node that manages the worker node.

또한, 상기 삭제 노드 등록 단계에서 상기 마스터 노드는 임의 제외되거나 장애가 발생한 워커 노드를 추출하여 상기 세션 노드에 삭제를 요청하고, 상기 세션 노드는 상기 삭제 등록 ID를 상기 해시맵 구조의 삭제 리스트에 등록시키는 것이 바람직하다.In addition, in the deletion node registration step, the master node extracts a randomly excluded or faulty worker node and requests deletion from the session node, and the session node registers the deletion registration ID in the deletion list of the hash map structure. it is preferable

또한, 상기 워커 노드 검색 단계에서 상기 세션 노드는 상기 힙 자료 구조로 이루어진 노드 리스트 중 최상위의 루트 노드(root node)를 상기 최선순위 워커 노드로 검색하는 것이 바람직하다.In addition, in the worker node search step, it is preferable that the session node searches for a root node of the highest priority among the node list including the heap data structure as the highest priority worker node.

또한, 상기 노드 유효성 판단 단계에서는 상기 세션 노드에서 상기 루트 노드에 해당되는 워커 노드의 우선 순위 ID와 상기 삭제 리스트에 등록된 삭제 등록 ID를 비교하는 것이 바람직하다.Preferably, in the node validity determination step, the priority ID of the worker node corresponding to the root node in the session node is compared with the deletion registration ID registered in the deletion list.

또한, 상기 노드 리스트 정리 단계는 상기 우선 순위 ID와 삭제 등록 ID가 동일한 경우, 상기 노드 리스트에 대해 상기 루트 노드를 삭제하는 pop 연산을 진행 후 상기 노드 리스트를 힙 자료 구조 방식으로 재구성하는 것이 바람직하다.In addition, in the node list cleaning step, when the priority ID and the deletion registration ID are the same, it is preferable to perform a pop operation to delete the root node on the node list and then reconstruct the node list in a heap data structure method. .

또한, 상기 pop 연산에서 삭제된 루트 노드의 삭제 등록 ID를 상기 삭제 리스트에서 제외시키는 삭제 리스트 업데이트 단계를 더 포함하는 것이 바람직하다.Preferably, the method further includes a deletion list update step of excluding the deletion registration ID of the root node deleted in the pop operation from the deletion list.

또한, 상기 태스크 할당 단계에서 상기 세션 노드는 상기 루트 노드에 해당하는 워커 노드에 상기 태스크를 할당하는 것이 바람직하다.Preferably, in the task assignment step, the session node assigns the task to a worker node corresponding to the root node.

또한, 상기 세션 노드는 상기 루트 노드에 해당하는 워커 노드에 태스크를 할당하며, 상기 루트 노드를 삭제하는 pop 연산을 진행 후 상기 노드 리스트를 힙 자료 구조 방식으로 재구성하는 것이 바람직하다.Preferably, the session node allocates a task to a worker node corresponding to the root node, and after performing a pop operation to delete the root node, the node list is reconstructed in a heap data structure method.

한편, 본 발명에 따른 워커 노드 관리 방법이 수행되는 머신 러닝 실행 관리 플랫폼은 머신 러닝이 수행되는 태스크를 입력받는 외부 입력 모듈과; 상기 입력된 태스크에 대해 머신 러닝을 수행하며, 다수개가 노드 리스트에 등록되어 있는 워커 노드와; 상기 노드 리스트에 기록된 상기 워커 노드들 중 삭제 대상이 되는 노드를 삭제 리스트에 등록시키는 세션 노드; 및 사용자로부터 태스크의 할당을 요청받아 상기 머신 러닝을 수행할 최선순위 워커 노드를 검색하도록 상기 세션 노드에 명령을 내리는 마스터 노드;를 포함하되, 상기 세션 노드는 상기 최선순위 워커 노드를 지정하는 우선 순위 ID와 상기 삭제 리스트에 등록된 삭제 등록 ID를 비교하고, 상기 우선 순위 ID와 삭제 등록 ID가 동일한 경우 상기 세션 노드는 상기 노드 리스트에서 상기 최선순위 워커 노드의 ID를 삭제하고, 상기 우선 순위 ID와 삭제 등록 ID가 다른 경우 상기 세션 노드는 상기 우선 순위 ID에 해당하는 워커 노드에서 머신 러닝이 이루어지도록 상기 태스크를 할당하는 것이 바람직하다.On the other hand, the machine learning execution management platform on which the worker node management method according to the present invention is performed comprises: an external input module for receiving a task on which machine learning is performed; a worker node that performs machine learning on the input task and is registered in a plurality of node lists; a session node for registering a node to be deleted among the worker nodes recorded in the node list in the deletion list; and a master node that receives a request for task assignment from a user and instructs the session node to search for the highest priority worker node to perform the machine learning. The ID and the deletion registration ID registered in the deletion list are compared, and when the priority ID and the deletion registration ID are the same, the session node deletes the ID of the highest priority worker node from the node list, and the priority ID and When the deletion registration ID is different, it is preferable that the session node allocates the task so that machine learning is performed in the worker node corresponding to the priority ID.

이상과 같은 본 발명은 워커 노드를 기록한 노드 리스트 및 워커 노드 중 제외가 필요한 노드를 기록한 삭제 리스트를 통해 기존에 구현된 머신 러닝 플랫폼의 내부 속도를 향상시킨다. The present invention as described above improves the internal speed of the previously implemented machine learning platform through the node list recording the worker nodes and the deletion list recording the nodes that need to be excluded from among the worker nodes.

또한, 워커 노드 리스트를 힙 자료 구조로 구현하여 상기 노드 리스트의 재구성 및 삽입, 선택 속도를 향상시키고 힙 자료 구조에서 속도가 느린 검색 및 삭제 연산을 위해 삭제 리스트를 통해 느린 속도를 보완한다.In addition, the worker node list is implemented as a heap data structure to improve the reorganization, insertion, and selection speed of the node list, and to compensate for the slow speed through the delete list for search and delete operations that are slow in the heap data structure.

도 1은 본 발명에 따른 머신 러닝 실행 관리 플랫폼 시스템의 전체 구성도이다.
도 2는 본 발명에 따른 노드 리스트와 삭제 리스트의 구성도이다.
도 3은 본 발명에 따른 머신 러닝 실행 관리 플랫폼에서의 워커 노드 관리 방법을 나타낸 흐름도이다.
도 4는 본 발명에 따른 삭제 노드 등록 단계를 나타낸 상세 흐름도이다.1 is an overall configuration diagram of a machine learning execution management platform system according to the present invention.
2 is a block diagram of a node list and a delete list according to the present invention.
3 is a flowchart illustrating a method for managing worker nodes in a machine learning execution management platform according to the present invention.
4 is a detailed flowchart illustrating a deletion node registration step according to the present invention.

이하, 첨부된 도면을 참조하여 본 발명의 바람직한 실시예에 따른 머신 러닝 실행 관리 플랫폼에서의 워커 노드 관리 방법 및 그를 위한 플랫폼 시스템에 대해 상세히 설명한다.Hereinafter, a method for managing a worker node in a machine learning execution management platform according to a preferred embodiment of the present invention and a platform system therefor will be described in detail with reference to the accompanying drawings.

도 1과 같이, 본 발명에 따른 머신 러닝 실행 관리 플랫폼 시스템은 외부 입력 모듈(10), 워커 노드(20), 세션 노드(30) 및 마스터 노드(40)를 포함한다. 바람직한 실시예로 외부 네트워크 저장소(DB-O) 및 내부 플랫폼 관리 DB(DB-I)를 더 포함한다.1 , the machine learning execution management platform system according to the present invention includes an external input module 10 , a worker node 20 , a session node 30 , and a master node 40 . In a preferred embodiment, it further includes an external network storage (DB-O) and an internal platform management DB (DB-I).

이러한 구성에 의하면 사용자로부터 외부 입력 모듈(10)을 통해 입력된 태스크(task)를 마스터 노드(40)에서 등록하고, 세션 노드(30)는 등록된 태스크를 워커 노드(20)에 할당하여 워커 노드(20)에서 태스크에 대한 머신 러닝이 이루어진다.According to this configuration, the master node 40 registers a task input from the user through the external input module 10 , and the session node 30 allocates the registered task to the worker node 20 to the worker node. In (20), machine learning is performed on the task.

특히, 도 2와 같이 본 발명은 워커 노드(20)를 기록한 노드 리스트(31) 및 워커 노드(20) 중 제외가 필요한 노드를 기록한 삭제 리스트(32)를 통해 기존에 구현된 머신 러닝 플랫폼의 내부 속도를 향상시킨다. In particular, as shown in FIG. 2 , the present invention is a machine learning platform previously implemented through the node list 31 recording the worker node 20 and the deletion list 32 recording the nodes that need to be excluded among the worker nodes 20 . improve speed.

즉, 노드 리스트(31)에 기록된 워커 노드(20)들을 직접 검색하는 대신 삭제 리스트(32)에 미리 등록된 워커 노드(20)들의 ID를 참조하여, 제외 대상이 되는 적어도 하나 이상의 워커 노드(20)를 빠르게 확인하도록 한다.That is, instead of directly searching for the worker nodes 20 recorded in the node list 31, at least one worker node to be excluded by referring to IDs of the worker nodes 20 registered in advance in the deletion list 32 ( 20) should be checked quickly.

또한, 워커 노드(20)들의 리스트를 힙 자료 구조(heap data structure)로 구현하여 노드 리스트(31)의 재구성 및 삽입, 선택 속도를 향상시키고 힙 자료 구조에서 속도가 느린 검색 및 삭제 연산은 상기한 삭제 리스트(32)를 구현하여 속도를 보완한다.In addition, the list of worker nodes 20 is implemented as a heap data structure to improve the reorganization, insertion, and selection speed of the node list 31, and search and deletion operations, which are slow in the heap data structure, are described above. Implement a delete list 32 to compensate for speed.

따라서, 본 발명은 워커 노드(20)의 추가나 삭제 등을 위한 연산 처리 속도를 높임에 따라 워커 노드(20)의 재구성 성능을 향상시키고 플랫폼 내에서 발생된 상황에 유연하게 대응할 수 있도록 한다.Accordingly, the present invention improves the reconfiguration performance of the worker node 20 by increasing the operation processing speed for addition or deletion of the worker node 20, and enables a flexible response to situations occurring in the platform.

이러한 본 발명의 특징적인 기능에 대한 구체적인 설명에 앞서, 본 발명에 적용 가능한 통합 원격 수행 관리 기술에 대해 먼저 살펴보면 각 노드(20, 30, 40)들은 계층 구조로 구현되어 명령을 전달하고 자원을 할당한다.Prior to a detailed description of the characteristic functions of the present invention, first looking at the integrated remote performance management technology applicable to the present invention, each node 20, 30, 40 is implemented in a hierarchical structure to transmit commands and allocate resources. do.

워커 노드(20), 세션 노드(30) 및 마스터 노드(40)를 포함하는 각 노드들은 머신 러닝(machine learning)을 위한 프로세스의 처리가 가능한 컴퓨팅 모듈의 일종으로, 이들 노드는 외부 네트워크 저장소(DB-O)에 접속하여 데이터 처리를 수행할 수 있다.Each node including the worker node 20, the session node 30, and the master node 40 is a kind of computing module capable of processing a process for machine learning, and these nodes are external network storage (DB). -O) to perform data processing.

또한, 워커 노드(20), 세션 노드(30) 및 마스터 노드(40)는 하나의 컴퓨팅 모듈에서 일체로 구현되거나 일부 자원은 외부의 제3모듈에서 구현될 수 있으며, 각 노드의 기능은 서로 다른 노드에서 함께 구현될 수 있다.In addition, the worker node 20, the session node 30, and the master node 40 may be integrally implemented in one computing module, or some resources may be implemented in an external third module, and the functions of each node are different from each other. It can be implemented together in Node.

외부 네트워크 저장소(DB-O)는 머신 러닝의 데이터셋(dataset)을 제공하는 것으로 머신 러닝 수행 환경을 셋팅한 워커 노드(20)는 외부 네트워크 저장소(DB-O)에서 데이터셋을 다운로드하여 머신 러닝을 수행한다.The external network storage (DB-O) provides a dataset of machine learning, and the worker node 20, which sets the machine learning execution environment, downloads the dataset from the external network storage (DB-O) to learn machine learning. carry out

사용자는 외부 입력 모듈(10)을 통해 본 발명의 플랫폼에 접근한다. 외부 입력 모듈(10)은 일 예로 API(Application Programming Interface) 등과 같은 외부 프로그램으로 구현되어 있어서 손쉽게 플랫폼에 접근한다.A user accesses the platform of the present invention through the external input module 10 . The external input module 10 is implemented as an external program such as an API (Application Programming Interface), for example, and thus easily accesses the platform.

또한 사용자는 외부 입력 모듈(10)을 이용하여 머신 러닝을 수행할 대상인 태스크(task)를 입력한다. 입력된 태스크는 머신 러닝을 수행할 프레임워크에 맞게 세션 노드(30)에 할당된다. 세션 노드(30)는 해당 태스크를 워커 노드(20)에 할당한다. In addition, the user inputs a task, which is a target to perform machine learning, by using the external input module 10 . The input task is assigned to the session node 30 according to the framework to perform machine learning. The session node 30 assigns the corresponding task to the worker node 20 .

이때, 사용자가 외부 입력 모듈(10)을 통해 워커 노드(20)에서 수행할 머신 러닝 관련 명령어를 입력하면 해당 명령어가 워커 노드(20)에 전달되어 실행된다. 대표적으로 '러닝 머신 실행 명령'을 내리면 머신 러닝이 시작된다.At this time, when the user inputs a machine learning related command to be performed in the worker node 20 through the external input module 10 , the corresponding command is delivered to the worker node 20 and executed. For example, when you issue a 'running machine command', machine learning starts.

마스터 노드(40)는 플랫폼 내 워커 노드(20)들의 자원 모니터링, 태스크의 머신 러닝 진행 상황 확인, 태스크를 플랫폼에 업로드, 외부 입력 모듈(10)로부터 입력된 사용자의 명령의 전달 및 태스크와 데이터셋을 관리하는 기능을 가진다.The master node 40 monitors the resource of the worker nodes 20 in the platform, checks the machine learning progress of the task, uploads the task to the platform, transfers the user's command input from the external input module 10, and the task and dataset has the ability to manage

세션 노드(30)는 동일하거나 동일한 그룹으로 분류되는 프레임워크로 구성된 워커 노드(20)들을 관리한다. 각 세션 노드(30)는 워커 노드(20)들의 자원을 모니터링하고 마스터 노드(40)로부터 받은 명령어를 워커 노드(20)에 전달한다.The session node 30 manages the worker nodes 20 composed of frameworks classified into the same or the same group. Each session node 30 monitors the resources of the worker nodes 20 and transmits a command received from the master node 40 to the worker node 20 .

워커 노드(20)는 세션 노드(30)로부터 전달받은 태스크에 해당하는 머신 러닝을 수행한다. 또한 주기적으로 당해 워커 노드(20)의 자원상황을 세션 노드(30)에 보고한다.The worker node 20 performs machine learning corresponding to the task received from the session node 30 . Also, it periodically reports the resource status of the worker node 20 to the session node 30 .

따라서, 워커 노드(20)는 '태스크 실행기'를 통해 머신 러닝 수행환경을 세팅하고 머신 러닝 데이터셋을 외부 네트워크 저장소(DB-O)로부터 다운로드 받으며, 머신 러닝 수행 명령이 전달되면 준비된 수행환경에서 머신 러닝을 수행한다. Therefore, the worker node 20 sets the machine learning execution environment through the 'task executor', downloads the machine learning dataset from the external network storage (DB-O), and when the machine learning execution command is delivered, the machine in the prepared execution environment run a run

또한 '자원관리 모듈'을 통해 워커 노드(20)의 자원상황을 세션 노드(30)에 보고한다. 자원상황은 실시간 혹은 주기적으로 세션 노드(30)에 보고됨에 따라 플랫폼에 할당된 태스크를 수행하는 워커 노드(20)의 선정에 활용된다.Also, the resource status of the worker node 20 is reported to the session node 30 through the 'resource management module'. As the resource status is reported to the session node 30 in real time or periodically, it is used to select the worker node 20 that performs the task assigned to the platform.

한편, 위에서 설명한 바와 같이 본 발명은 워커 노드(20)의 추가나 삭제 등을 위한 연산 처리 속도를 높임에 따라 워커 노드(20)의 재구성 성능을 향상시키고 플랫폼 내에서 발생된 상황에 유연하게 대응할 수 있게 한다.On the other hand, as described above, the present invention improves the reconfiguration performance of the worker node 20 by increasing the operation processing speed for addition or deletion of the worker node 20 and flexibly responds to situations occurring within the platform. do.

도 2에서 살펴본 바와 같이 본 발명은 워커 노드(20)를 기록한 노드 리스트(31) 및 워커 노드(20)들 중에서 제외가 필요한 노드를 기록한 삭제 리스트(32)를 통해 기존에 구현된 머신 러닝 플랫폼의 내부 속도를 향상시킨다.As shown in FIG. 2 , the present invention is a machine learning platform previously implemented through a node list 31 recording a worker node 20 and a deletion list 32 recording a node that needs to be excluded from among the worker nodes 20 . Increases internal speed.

또한, 워커 노드(20)들의 리스트를 힙 자료 구조로 구현하여 노드 리스트(31)의 재구성 및 삽입, 선택 속도를 향상시키고 힙 자료 구조에서 속도가 느린 검색 및 삭제 연산 속도의 보완을 위해 삭제 리스트(32)를 구현한다.In addition, the list of worker nodes 20 is implemented as a heap data structure to improve the reconstruction, insertion, and selection speed of the node list 31, and to compensate for the slow search and delete operation speed in the heap data structure, the delete list ( 32) is implemented.

따라서 노드 리스트(31)와 삭제 리스트(32)는 본 발명에서 특별히 구축하거나 새로이 추가한 정보 저장부나 자료 구조로써 의미를 갖는다.Therefore, the node list 31 and the deletion list 32 have meanings as information storage units or data structures specially constructed or newly added in the present invention.

이때, 상기 노드 리스트(31)는 세션 노드(30)에서 워커 노드(20)를 관리하기 위한 자료 구조이다. 노드 리스트(31)에는 머신 러닝 태스크가 할당되는 워커 노드(20)들이 기록된다. 후술하는 바와 같이 노드 리스트(31)는 실시예로써 힙 자료 구조(heap data structure)를 사용한다.In this case, the node list 31 is a data structure for managing the worker node 20 in the session node 30 . In the node list 31 , worker nodes 20 to which machine learning tasks are assigned are recorded. As will be described later, the node list 31 uses a heap data structure as an embodiment.

또한, 노드 리스트(31)의 정렬은 CPU/GPU의 연산처리능력인 플롭을 기준으로 하고, 단위는 기가플롭(GFlops)을 이용하며, CPU 각 코어의 플롭들과 GPU의 플롭을 합하여 CPU와 GPU 두 하드웨어의 모두의 연산처리능력을 고려한다.In addition, the sorting of the node list 31 is based on the flop, which is the computational processing power of the CPU/GPU, and the unit uses giga-flops (GFlops), and the flops of each CPU core and the flops of the GPU are added to the CPU and GPU. Consider the computational processing power of both hardware.

플롭(Flops: floating operations per second)은 컴퓨터의 연산처리속도 단위를 나타내는 것으로, 기가플롭스인 GFLOPs는 컴퓨터의 1초당 부동 소수점 연산의 실행 횟수를 10억(=10⁹) 단위로 표현한 것이다.Flop: This indicates the arithmetic processing speed of the unit (Flops floating operations per second) is a computer, gigaflops GFLOPs of the image is the execution frequency of the first floating-point operations per second of the computer units billion (= 10 ^9).

이러한 방식으로 사용자로부터 입력된 태스크를 연산처리능력이 가장 좋은 워커 노드(20)부터 차례로 할당하고, 최종적으로 태스크가 할당된 워커 노드에서 입력된 태스크에 대한 머신 러닝을 수행하게 된다.In this way, the task input from the user is sequentially assigned from the worker node 20 having the best computational processing power, and machine learning is performed on the task input from the worker node to which the task is finally assigned.

삭제 리스트(32)는 노드 리스트(31)에서의 삭제 성능 향상을 위해 본 발명에서 제안하는 리스트(list)로, 노드 리스트(31)에 힙 자료 구조를 적용할 경우 노드의 검색 및 삭제 연산 속도가 매우 떨어지는 단점을 보완하기 위한 것이다.The deletion list 32 is a list proposed by the present invention to improve deletion performance in the node list 31. When a heap data structure is applied to the node list 31, the node search and deletion operation speed is reduced. This is to make up for the very poor shortcomings.

삭제 연산의 성능을 보완하기 위해 세션 노드(30)에서 워커 노드(20)의 삭제를 요청시 노드 리스트(31)에서 바로 삭제하는 대신 삭제 대상 워커 노드(20)의 ID(이하, '삭제 등록 ID')를 삭제 리스트(32)에 저장하고 이를 이용한다.In order to supplement the performance of the delete operation, when the session node 30 requests the deletion of the worker node 20, the ID of the worker node 20 to be deleted (hereinafter, 'deletion registration ID') instead of directly deleted from the node list 31 ') is stored in the deletion list 32 and used.

삭제 리스트(32)는 일 예로 해시맵 구조를 사용하여 워커 노드(20)의 ID 삽입 및 탐색, 삭제 연산을 빠른 속도로 수행 할 수 있게 한다. 이후 새로운 머신 러닝 태스크의 할당 요청이 세션 노드(30)로 전달되면 워커 노드(20)의 선택 및 할당을 진행한다.The deletion list 32 enables, for example, to perform ID insertion, search, and deletion operations of the worker node 20 at high speed using a hash map structure. Thereafter, when a request for assignment of a new machine learning task is transmitted to the session node 30 , selection and assignment of the worker node 20 is performed.

워커 노드의 선택 및 할당은 머신 러닝을 수행 가능한 가용 자원 중에서 선택되며, 노드 리스트(31) 중 연산처리능력에 따른 최선순위의 워커 노드를 선택한다. 최선순위는 실시예로써 힙 자료 구조에서의 최상단 노드인 루트 노드(root node)이다.Selection and allocation of worker nodes is selected from available resources capable of performing machine learning, and the highest-priority worker node according to computational processing power is selected from the node list 31 . The highest priority is, by way of example, a root node, which is the highest node in the heap data structure.

또한 루트 노드를 참조시 그 워커 노드(20)의 ID가 삭제 리스트(32)에 등록되어 있는 워커 노드(20)라면 머신 러닝 태스크를 할당하지 않고 노드 리스트(31)에서 pop 연산을 진행하여 해당 워커 노드(20)를 삭제한다.In addition, when referring to the root node, if the ID of the worker node 20 is a worker node 20 registered in the deletion list 32, a pop operation is performed on the node list 31 without assigning a machine learning task to the corresponding worker. The node 20 is deleted.

이러한 과정은 머신 러닝을 적절하게 수행할 수 있는 워커 노드(20)가 루트 노드가 됨에 따라 머신 러닝 태스크가 할당되는 때까지 혹은 노드 리스트(31)의 모든 원소가 없어질 때까지 진행된다.This process proceeds until a machine learning task is assigned as the worker node 20 capable of properly performing machine learning becomes a root node, or until all elements of the node list 31 are exhausted.

이를 위해, 도 3과 같이 본 발명에 따른 머신 러닝 실행 관리 플랫폼에서의 워커 노드 관리 방법은 삭제 노드 등록 단계(S10), 워커 노드 검색 단계(S20) 및 노드 유효성 판단 단계(S30)를 포함한다.To this end, as shown in FIG. 3 , the worker node management method in the machine learning execution management platform according to the present invention includes a deletion node registration step (S10), a worker node search step (S20), and a node validity determination step (S30).

또한, 본 발명은 노드 리스트 정리 단계(S40), 태스크 할당 단계(S50) 및 머신 러닝 단계(S60)를 포함하며, 바람직한 실시예로써 노드 할당 요청 단계(S20a) 및 삭제 리스트 업데이트 단계(S41)를 더 포함한다.In addition, the present invention includes a node list cleaning step (S40), a task assignment step (S50) and a machine learning step (S60), and as a preferred embodiment, the node allocation request step (S20a) and the deletion list update step (S41) include more

이러한 기술적 구성들로 이루어진 본 발명은 노드 리스트(31)의 구조를 개선하고 아울러 삭제 리스트(32)를 도입함으로써 기존의 머신 러닝 실행 관리 플랫폼과 비교하여 워커 노드(20)를 더 효율적으로 관리한다.The present invention with these technical configurations improves the structure of the node list 31 and also introduces the deletion list 32 to more efficiently manage the worker node 20 compared to the existing machine learning execution management platform.

노드 리스트(31)의 경우 기존 리스트 형태의 자료구조에서 힙 자료 구조로 변경하여 정렬 및 탐색속도를 빠르게 한다. 나아가 힙 자료 구조의 검색 및 삭제에 대한 단점을 보완하기 위해 삭제 리스트(32)를 도입한다.In the case of the node list 31, the speed of sorting and searching is increased by changing from the existing list-type data structure to a heap data structure. Furthermore, a delete list 32 is introduced to compensate for the shortcomings of searching and deleting the heap data structure.

즉, 세션 노드(30)에서 노드 리스트(31)에 기록된 워커 노드(20)들의 머신 러닝 가용 여부를 실시간으로 검색 및 삭제하는 대신, 머신 러닝 태스크 할당시 삭제 리스트(32)에 질의를 함으로써 할당하고자 하는 워커 노드(20)의 가용 여부를 즉시 확인하고 최선의 워커 노드(30)를 선택할 수 있도록 한다.That is, instead of searching and deleting the machine learning availability of the worker nodes 20 recorded in the node list 31 in the session node 30 in real time, when assigning the machine learning task, the deletion list 32 is allocated by querying. The availability of the desired worker node 20 is immediately checked and the best worker node 30 can be selected.

또한, 삭제 요청시 노드 리스트(31)에서 즉시 삭제하지 않고 삭제 대상 워커 노드(20)의 ID를 삭제 리스트(32)에 저장한다. 이후 새로운 태스크 입력시 머신 러닝 태스크를 할당할 워커 노드(20)가 삭제 리스트(32)에 포함되어 있는 워커 노드(20)라면 태스크를 할당하지 않고 이를 삭제하는 방법을 이용한다.In addition, when a deletion request is made, the ID of the worker node 20 to be deleted is stored in the deletion list 32 instead of being deleted immediately from the node list 31 . Thereafter, when a new task is input, if the worker node 20 to which the machine learning task is to be assigned is the worker node 20 included in the deletion list 32, a method of deleting the task without assigning it is used.

좀더 구체적으로, 상기 삭제 노드 등록 단계(S10)에서는 세션 노드(30)에서 노드 리스트(31)에 기록된 워커 노드(20)들 중 삭제 대상이 되는 노드를 삭제 리스트(32)에 등록시킨다. 삭제 리스트(32)에 등록되는 워커 노드(20)는 임의 제거나 노드 장애 등에 따라 머신 러닝을 수행할 수 없는 노드이다.More specifically, in the deletion node registration step S10 , a node to be deleted among the worker nodes 20 recorded in the node list 31 in the session node 30 is registered in the deletion list 32 . The worker node 20 registered in the deletion list 32 is a node that cannot perform machine learning due to random removal or node failure.

이를 위해 마스터 노드(40)는 임의 제외되거나 장애가 발생한 워커 노드(20)를 추출하고 세션 노드(30)에 삭제를 요청한다. 바람직한 실시예로써 마스터 노드(40)로부터 삭제 요청을 받은 세션 노드(30)는 후술하는 바와 같이 삭제 등록 ID를 해시맵(hash map) 구조의 삭제 리스트(32)에 등록시킨다.To this end, the master node 40 extracts the randomly excluded or faulty worker node 20 and requests the session node 30 to delete it. As a preferred embodiment, the session node 30 receiving the deletion request from the master node 40 registers the deletion registration ID in the deletion list 32 of a hash map structure as described below.

도 2와 같이, 삭제 노드 등록 단계(S10)는 노드 리스트(31)의 워커 노드(20)를 참조하고, 그 중 제외 대상 워커 노드(20)를 삭제 리스트(32)를 기록하게 된다. 따라서, 이에 앞서 노드 리스트(31)와 삭제 리스트(32)를 생성할 필요가 있다.As shown in FIG. 2 , in the deletion node registration step S10 , the worker node 20 of the node list 31 is referred to, and the exclusion target worker node 20 is recorded in the deletion list 32 . Therefore, it is necessary to generate the node list 31 and the deletion list 32 prior to this.

노드 리스트(31)의 생성은 워커 노드 분석 단계(S1) 및 노드 리스트 생성 단계(S2)를 포함하는데, 그 중 워커 노드 분석 단계(S1)에서는 세션 노드(30)에서 워커 노드(20)들의 성능을 분석하여 리스트 내의 목록 순서를 정한다.The generation of the node list 31 includes a worker node analysis step S1 and a node list generation step S2, among which, in the worker node analysis step S1, the performance of the worker nodes 20 in the session node 30 Analyze the list order in the list.

이를 위해 세션 노드(30)에서 워커 노드(20)들을 감시하거나 워커 노드(20)에서 자신이 제공 가능한 컴퓨팅 자원을 실시간 혹은 주기적으로 세션 노드(30)로 전송할 수 있다. 워커 노드(20)의 성능은 바람직하게는 CPU와 GPU의 연산처리능력으로 분류될 수 있다.To this end, the session node 30 may monitor the worker nodes 20 or transmit computing resources that the worker node 20 can provide to the session node 30 in real time or periodically. The performance of the worker node 20 may be preferably classified into the computational processing power of the CPU and the GPU.

노드 리스트 생성 단계(S2)에서는 워커 노드(20)의 성능 순서에 따라 힙 자료 구조로 이루어진 노드 리스트(31)를 생성한다. 바람직하게 노드 리스트(31)는 연산처리능력인 플롭으로 분류하되, 분류시 기가플롭(GFlops)을 단위로 하여 정렬된다.In the node list generation step S2 , the node list 31 composed of a heap data structure is generated according to the performance order of the worker nodes 20 . Preferably, the node list 31 is classified into flops, which are computational processing capabilities, and is sorted using giga-flops (GFlops) as a unit during classification.

힙 자료 구조에서의 힙(heap)은 최대값 및 최소값을 찾아내는 연산을 빠르게 하기 위해 고안된 완전 이진 트리(complete binary tree)의 자료구조(tree-based structure)로서 힙 속성을 만족한다.The heap in the heap data structure satisfies the heap property as a data structure of a complete binary tree designed to speed up the operation to find the maximum and minimum values.

힙 속성은 일 예로 A가 B의 부모 노드이면, A의 키 값과 B의 키 값 사이에는 대소관계가 성립한다는 것으로, 부모 노드의 키 값이 자식 노드의 키값보다 항상 큰 힙을 '최대 힙'이라하고 그 반대는 '최소 힙'이라고 한다.The heap property is that, for example, if A is a parent node of B, a case-to-order relationship is established between the key value of A and the key value of B. A heap in which the key value of the parent node is always larger than the key value of the child node is referred to as the 'maximum heap'. and the opposite is called 'minimal heap'.

이때, 본 발명은 워커 노드(20)들의 연산처리능력이 좋은 순서로 스택을 결정하므로, 연산처리능력이 좋은 워커 노드(20)수록 그 키 값이 큰 최대 힙을 따르게 되므로, 루트 노드는 최선순위 노드가 된다.At this time, since the present invention determines the stack in the order in which the computational processing power of the worker nodes 20 is good, the worker node 20 with the good computational processing capability follows the maximum heap with a larger key value, so the root node is the highest priority. become a node.

또한, 도 2에서 삭제 리스트(32)는 삭제 테이블 생성 단계(S3) 및 제외 노드 분석 단계(S4)를 거쳐 생성된다. 이때, 삭제 테이블 생성 단계(S3)에서는 세션 노드(30)에서 삭제 등록 ID가 기록되는 삭제 리스트 테이블을 생성한다.Also, in FIG. 2 , the deletion list 32 is generated through the deletion table creation step S3 and the exclusion node analysis step S4 . In this case, in the deletion table creation step ( S3 ), the session node 30 creates a deletion list table in which the deletion registration ID is recorded.

삭제 리스트 테이블은 해시(hash) 함수에 의해 생성된 키 값에 따라 매핑이 이루어지는 해시맵 구조가 바람직하다. 해시맵은 키를 값에 매핑할 수 있는 구조인 연관 배열 추가에 사용되는 자료 구조로, 해시 함수를 사용하여 색인(index)을 버킷(bucket)이나 슬롯(slot) 등의 배열로 계산한다.The delete list table preferably has a hash map structure in which mapping is performed according to a key value generated by a hash function. A hash map is a data structure used to add an associative array, a structure that can map a key to a value, and calculates an index into an array such as a bucket or slot using a hash function.

제외 노드 분석 단계(S4)에서는 마스터 노드(40)에서 각각의 워커 노드(20)를 감시하여 삭제 리스트 테이블에 등록될 삭제 대상 워커 노드(20)를 검출한다. 본 발명에서 삭제는 제외를 포함하는 개념으로 머신 러닝에서 배제됨을 의미한다.In the exclusion node analysis step S4, the master node 40 monitors each worker node 20 to detect the deletion target worker node 20 to be registered in the deletion list table. In the present invention, deletion is a concept including exclusion and means excluded from machine learning.

마스터 노드(40)에 의해 검출된 삭제 대상 워커 노드(20)는 상술한 삭제 노드 등록 단계(S10)에서 세션 노드(30)에 통지되고, 세션 노드(30)는 워커 노드(20)들 중 삭제 대상이 되는 노드를 삭제 리스트(32)에 등록시킨다.The deletion target worker node 20 detected by the master node 40 is notified to the session node 30 in the above-described deletion node registration step S10 , and the session node 30 is deleted from among the worker nodes 20 . A target node is registered in the deletion list 32 .

세션 노드(30)는 삭제 대상 워커 노드(20)의 정보를 받아 해당 ID를 삭제 리스트(32)에 삽입할 때, 해시 맵 구조를 사용하여 중복된 워커 노드(20) 삭제 요청을 무시할 수 있으며, 여러 개의 워커 노드(20) 삭제 요청을 삭제 리스트(32)에 포함할 수 있다.When the session node 30 receives the information of the worker node 20 to be deleted and inserts the ID into the deletion list 32, it can ignore the duplicate worker node 20 deletion request using a hash map structure, A plurality of worker nodes 20 deletion requests may be included in the deletion list 32 .

한편, 위와 같이 워커 노드(20)의 노드 리스트(31)와 그들 중 삭제 대상이 되는 워커 노드(20)를 삭제 리스트(32)에 등록하여 삭제 변경 사항을 기록한 이후에는 새로 입력된 머신 러닝 태스크를 워커 노드(20)에 할당하게 된다.On the other hand, as above, after registering the node list 31 of the worker node 20 and the worker node 20 to be deleted among them in the deletion list 32 and recording the deletion changes, the newly input machine learning task It is assigned to the worker node (20).

이에, 노드 할당 요청 단계(S20a)에서 세션 노드(30)는 마스터 노드(40)로부터 머신 러닝이 수행되는 태스크를 워커 노드에 할당하도록 요청받는다. 마스터 노드(40)는 머신 러닝이 수행되는 프레임워크(frame work)로 구성된 워커 노드(20)를 관리하는 세션 노드(30)에 태스크의 할당을 요청한다.Accordingly, in the node assignment request step ( S20a ), the session node 30 is requested from the master node 40 to allocate a task on which machine learning is performed to the worker node. The master node 40 requests assignment of a task to the session node 30 that manages the worker node 20 configured as a framework in which machine learning is performed.

그 후, 워커 노드 검색 단계(S20)에서는 태스크의 할당을 요청받은 세션 노드(30)에서 머신 러닝을 수행할 최선순위 워커 노드(20)를 검색한다. 여기서 검색은 삭제 리스트(32)와 무관하게 연산처리능력이 가장 뛰어난 워커 노드(20)를 찾아 그 순서를 결정하는 과정이다.Thereafter, in the worker node search step ( S20 ), the highest priority worker node 20 to perform machine learning is searched for in the session node 30 , which has been requested to be assigned a task. Here, the search is a process of finding and determining the order of the worker nodes 20 having the best computational processing power regardless of the deletion list 32 .

상술한 바와 같이 노드 리스트(31)가 힙 자료 구조로 구축되어 있는 경우, 세션 노드(30)는 힙 자료 구조로 이루어진 노드 리스트(31) 중 최상위의 루트 노드(root node)를 검색한다. 루트 노드는 최선순위 워커 노드(20)에 해당한다.As described above, when the node list 31 is constructed in the heap data structure, the session node 30 searches for the highest root node among the node list 31 formed in the heap data structure. The root node corresponds to the highest priority worker node 20 .

힙 속성에서 부모 노드의 키 값이 자식 노드의 키 값보다 항상 힙 속성을 갖는 최대 힙의 경우에는 CPU/GPU 연산처리능력의 순서로 노드의 스택이 결정되므로, 루트 노드는 최선순위 워커 노드(20)를 지정하게 되므로 워커 노드(20) 할당시 이를 우선 검색하게 된다.In the case of the maximum heap, where the key value of the parent node always has the heap property than the key value of the child node in the heap property, the stack of the nodes is determined in the order of CPU/GPU processing power, so the root node is the highest-priority worker node (20 ) is specified, so when the worker node 20 is allocated, it is first searched for.

다음, 노드 유효성 판단 단계(S30)는 최선순위로 검색된 워커 노드(20)가 삭제 리스트(32)에 등록된 것인지 판단하는 단계로 최선순위 워커 노드(20)를 지정하는 우선 순위 ID와 삭제 리스트(32)에 등록된 삭제 등록 ID를 비교한다.Next, the node validity determination step (S30) is a step of determining whether the worker node 20 retrieved with the highest priority is registered in the deletion list 32. A priority ID and a deletion list for designating the highest priority worker node 20 ( 32) and compare the registered deletion registration ID.

이를 위해 세션 노드(30)는 새로운 머신 러닝 태스크의 입력시 검색을 통해 최선순위 워커 노드(20)를 추출하고, 추출된 최선순위 워커 노드(20)가 삭제 리스트(32)에 등록된 노드들과 일치하는지 검색한다.To this end, the session node 30 extracts the highest priority worker node 20 through a search when a new machine learning task is input, and the extracted highest priority worker node 20 is combined with the nodes registered in the deletion list 32 . Search for a match.

워커 노드(20)들이 기록된 노드 리스트(31)가 힙 자료 구조로 이루어진 경우 세션 노드(30)는 힙 자료 구조 중 루트 노드에 해당되는 워커 노드(20)의 우선 순위 ID와 삭제 리스트(32)에 등록된 삭제 등록 ID를 비교한다.When the node list 31 in which the worker nodes 20 are recorded is formed of a heap data structure, the session node 30 includes the priority ID and the deletion list 32 of the worker node 20 corresponding to the root node among the heap data structures. Compare the deletion registration ID registered in .

다음, 노드 리스트 정리 단계(S40)에서는 검색된 최선순위(루트 노드)에 해당하는 워커 노드(20)가 삭제 리스트(32)에 등록된 것이라면 그 최선순위 워커 노드(20)를 노드 리스트(31)에서 제거한다.Next, in the node list cleaning step (S40), if the worker node 20 corresponding to the searched highest priority (root node) is registered in the deletion list 32, the highest priority worker node 20 is removed from the node list 31 Remove.

우선 순위 ID와 삭제 등록 ID가 동일하면 태스크 할당을 위해 검색된 워커 노드(20)는 머신 러닝의 수행에 부적합 혹은 불가능한 경우이므로, 세션 노드(30)는 검색된 최선순위 워커 노드(20)를 노드 리스트(31)에서 삭제하는 것이다.If the priority ID and the deletion registration ID are the same, the worker node 20 searched for task assignment is unsuitable or impossible to perform machine learning, so the session node 30 returns the searched highest priority worker node 20 to the node list ( 31) will be deleted.

이때, 세션 노드(30)는 pop 연산을 진행함으로써 힙 자료 구조의 노드 리스트(31)에서 루트 노드를 삭제한다. 또한, pop 연산에 따라 기존의 루트 노드가 삭제된 이후에는 다시 노드 리스트(31)를 힙 자료 구조 방식으로 재구성한다.At this time, the session node 30 deletes the root node from the node list 31 of the heap data structure by performing a pop operation. In addition, after the existing root node is deleted according to the pop operation, the node list 31 is reconstructed in the heap data structure method.

pop 연산을 한다는 것은 현재 스택에 가장 위에 있는 데이터를 꺼내서 삭제를 하는 것이므로, pop 연산을 통해 현재의 루트 노드를 삭제한 이후에는 그 아래에 있는 데이터들로 노드 리스트(31)를 재구성하게 된다.Since the pop operation removes the data at the top of the current stack and deletes it, the node list 31 is reconstructed with the data below it after the current root node is deleted through the pop operation.

노드 리스트(31)를 재구성한 후에는 삭제 리스트(32)도 재구성할 수 있다. 삭제 리스트(32)의 재구성은 pop 연산에서 삭제된 루트 노드의 삭제 등록 ID를 삭제 리스트(32)에서 제외시키는 삭제 리스트 업데이트 단계(S41)로 구현된다.After reconstructing the node list 31 , the deletion list 32 may also be reconstructed. The reconfiguration of the deletion list 32 is implemented by the deletion list update step S41 of excluding the deletion registration ID of the root node deleted in the pop operation from the deletion list 32 .

다음, 태스크 할당 단계(S50)에서는 검색된 최선순위 워커 노드(20)가 삭제 리스트(32)에 등록된 것이 아닌 것으로 판단된 경우, 그 검색된 최선순위 워커 노드(20)에 새로 입력된 머신 러닝 태스크를 할당한다.Next, in the task assignment step ( S50 ), if it is determined that the retrieved highest priority worker node 20 is not registered in the deletion list 32 , the machine learning task newly input to the retrieved highest priority worker node 20 is added allocate

구체적으로, 우선 순위 ID와 삭제 등록 ID가 다른 경우, 세션 노드(30)는 우선 순위 ID에 해당하는 워커 노드(20)에서 머신 러닝이 이루어지도록 해당 워커 노드(20)에 태스크를 할당한다.Specifically, when the priority ID and the deletion registration ID are different, the session node 30 assigns a task to the corresponding worker node 20 so that machine learning is performed in the worker node 20 corresponding to the priority ID.

최선순위 워커 노드(20)가 노드 리스트(31)의 루트 노드인 경우에는 세션 노드(30)에서 루트 노드에 해당하는 워커 노드(20)에 태스크를 할당한다. 즉, 삭제 리스트(32)에 등록되지 않은 워커 노드(20)들 중 최선의 워커 노드(20)에 태스크를 할당한다.When the highest priority worker node 20 is the root node of the node list 31 , a task is assigned to the worker node 20 corresponding to the root node in the session node 30 . That is, the task is assigned to the best worker node 20 among the worker nodes 20 not registered in the deletion list 32 .

태스크를 할당한 이후에는 노드 리스트(31) 역시 업데이트 하는 것이 바람직하다. 이는 현재 머신 러닝에 사용되는 워커 노드(20)를 제외하여 다음번 워커 노드(20)가 이용되도록 하기 위함이다.After allocating the task, it is preferable to update the node list 31 as well. This is to allow the next worker node 20 to be used by excluding the worker node 20 currently used for machine learning.

이에, 세션 노드(30)는 최선선위의 루트 노드에 해당하는 워커 노드(20)에 태스크를 할당하고, 상기 할당된 루트 노드를 삭제하는 pop 연산을 진행(S51)한다. 또한 pop 연산을 진행(S51) 후에는 노드 리스트(31)를 힙 자료 구조 방식으로 재구성한다.Accordingly, the session node 30 allocates a task to the worker node 20 corresponding to the root node of the highest priority, and performs a pop operation for deleting the allocated root node (S51). Also, after the pop operation is performed (S51), the node list 31 is reconstructed in a heap data structure method.

그 후 머신 러닝 단계(S60)에서는 머신 러닝 수행 명령에 따라 태스크가 할당된 워커 노드(20)에서 머신 러닝이 수행되게 한다. 머신 러닝 수행 명령은 마스터 노드(40)에서 외부 입력 모듈(10)을 통해 입력된 사용자의 명령을 입력받은 후 전달한다.Thereafter, in the machine learning step (S60), machine learning is performed in the worker node 20 to which the task is assigned according to the machine learning execution command. The machine learning execution command is transmitted after receiving the user's command input through the external input module 10 from the master node 40 .

이때, 워커 노드(20)는 '태스크 실행기'를 통해 머신 러닝 수행환경을 세팅하고 머신 러닝 데이터셋을 외부 네트워크 저장소(DB-O)로부터 다운로드 받아 머신 러닝 수행 명령이 전달되면 준비된 수행환경에서 머신 러닝을 수행한다. At this time, the worker node 20 sets the machine learning execution environment through the 'task executor', downloads the machine learning dataset from the external network storage (DB-O), and when the machine learning execution command is delivered, machine learning in the prepared execution environment carry out

또한 자원관리 모듈을 통해 워커 노드(20)의 자원상황을 세션 노드(30)에 보고한다. 자원상황은 실시간 혹은 주기적으로 세션 노드(30)에 보고됨에 따라 후속으로 플랫폼에 할당된 태스크를 수행하기 위한 워커 노드(20)의 선정에 활용된다.In addition, the resource status of the worker node 20 is reported to the session node 30 through the resource management module. As the resource status is reported to the session node 30 in real time or periodically, it is subsequently used to select the worker node 20 to perform the task assigned to the platform.

이하, 첨부된 도면을 참조하여 본 발명에 따른 워커 노드 관리 방법이 수행되는 머신 러닝 실행 관리 플랫폼 시스템에 대해 설명한다.Hereinafter, a machine learning execution management platform system in which the method for managing worker nodes according to the present invention is performed will be described with reference to the accompanying drawings.

다만, 본 발명에 따른 머신 러닝 실행 관리 플랫폼 시스템은 위에서 설명한 워커 노드 관리 방법에 적용되는 것이다. 따라서, 이하에서는 가급적 중복적인 설명은 생략한다.However, the machine learning execution management platform system according to the present invention is applied to the worker node management method described above. Therefore, redundant descriptions will be omitted below as much as possible.

위에서 도 1 및 도 2를 참조하여 살펴본 바와 같이, 본 발명에 따른 머신 러닝 실행 관리 플랫폼 시스템은 외부 입력 모듈(10), 워커 노드(20), 세션 노드(30) 및 마스터 노드(40)를 포함한다. 바람직하게 외부 네트워크 저장소(DB-O) 및 내부 플랫폼 관리 DB(DB-I)를 더 포함한다.1 and 2 above, the machine learning execution management platform system according to the present invention includes an external input module 10 , a worker node 20 , a session node 30 and a master node 40 . do. Preferably, it further includes an external network storage (DB-O) and an internal platform management DB (DB-I).

이러한 본 발명에 따른 머신 러닝 실행 관리 플랫폼 시스템은 실시예로써 머신 러닝 서버나 PC에 구축될 수 있다. 또한, DBMS(DataBase Management System) 포함한 데이터베이스 서버로 구축될 수 있으며, 컴퓨팅 자원이나 데이터베이스 중 일부는 외부의 제3 서버와 연동하여 구축될 수 있다.The machine learning execution management platform system according to the present invention may be built in a machine learning server or PC as an embodiment. In addition, it may be built as a database server including a DBMS (DataBase Management System), and some of the computing resources or databases may be built in conjunction with an external third server.

이때, 외부 입력 모듈(10)은 머신 러닝이 수행되는 태스크를 입력받는 것으로, 사용자는 외부 입력 모듈(10)을 통해 본 발명의 플랫폼에 접근한다. 외부 입력 모듈(10)은 일 예로 API(Application Programming Interface) 등과 같은 외부 프로그램으로 구현되어 있어서 사용자는 손쉽게 플랫폼에 접근한다.In this case, the external input module 10 receives a task on which machine learning is performed, and the user accesses the platform of the present invention through the external input module 10 . The external input module 10 is implemented as an external program, such as an API (Application Programming Interface), for example, so that the user easily accesses the platform.

워커 노드(20), 세션 노드(30) 및 마스터 노드(40)를 포함하여 이루어진 각 노드(20, 30, 40)들은 계층 구조로 구현되어 서로 간에 명령을 전달하고 자원을 할당한다.Each of the nodes 20 , 30 , 40 including the worker node 20 , the session node 30 , and the master node 40 is implemented in a hierarchical structure to transmit commands to each other and allocate resources.

이들 워커 노드(20), 세션 노드(30) 및 마스터 노드(40)는 머신 러닝을 위한 프로세스의 처리가 가능한 컴퓨팅 모듈의 일종으로, 외부 네트워크 저장소(DB-O)에 접속하여 다운로드된 데이터 처리를 수행할 수 있다.These worker nodes 20, session nodes 30, and master nodes 40 are a kind of computing module capable of processing processes for machine learning, and access the external network storage (DB-O) to process downloaded data. can be done

외부 네트워크 저장소(DB-O)는 머신 러닝의 데이터셋(dataset)을 제공하는 것으로, 머신 러닝 수행 환경을 셋팅한 워커 노드(20)는 외부 네트워크 저장소(DB-O)에서 데이터셋을 다운로드하여 머신 러닝을 수행한다.The external network storage (DB-O) provides a dataset of machine learning, and the worker node 20, which has set the machine learning execution environment, downloads the dataset from the external network storage (DB-O) and run a run

구체적으로 워커 노드(20)는 사용자로부터 입력된 태스크에 대해 머신 러닝을 수행하는 것으로, 다수의 워커 노드(20)들 각각을 지정하는 ID는 노드 리스트(31)에 기록된다. 따라서, 노드 리스트(31)가 구축된 세션 노드(30)에서 워커 노드(20)를 관리한다.Specifically, the worker node 20 performs machine learning on a task input from the user, and IDs designating each of the plurality of worker nodes 20 are recorded in the node list 31 . Accordingly, the worker node 20 is managed by the session node 30 in which the node list 31 is built.

노드 리스트(31)는 힙 자료 구조로 이루어짐에 따라 워커 노드(20)들 중 현재 연산처리능력이 가장 뛰어난 최선순위 워커 노드(20)는 루트 노드가 되며, 루트 노드에 해당하는 우선 순위 ID는 삭제 리스트(32)에 등록된 것과 비교된다. 이러한 비교를 통해 태스크의 할당 여부가 결정된다.As the node list 31 has a heap data structure, the highest priority worker node 20 with the highest current computational processing capacity among the worker nodes 20 becomes the root node, and the priority ID corresponding to the root node is deleted. It is compared to that registered in the list 32 . Through this comparison, task assignment is determined.

세션 노드(30)는 동일하거나 동일한 그룹으로 분류되는 프레임워크로 구성된 워커 노드(20)들을 관리한다. 또한 세션 노드(30)는 워커 노드(20)들의 자원을 모니터링하고 마스터 노드(40)로부터 받은 명령어를 워커 노드(20)에 전달한다.The session node 30 manages the worker nodes 20 composed of frameworks classified into the same or the same group. In addition, the session node 30 monitors the resources of the worker nodes 20 and transmits the command received from the master node 40 to the worker node 20 .

특히, 세션 노드(30)는 노드 리스트(31)에 기록된 워커 노드(20)들 중 삭제 대상이 되는 노드를 삭제 리스트(32)에 등록한다. 또한, 최선순위(루트 노드) 워커 노드(20)가 삭제 리스트(32)에 등록되어 있는지 판단하는 과정을 거쳐 머신 러닝 태스크가 할당될 워커 노드(20)를 결정한다.In particular, the session node 30 registers a node to be deleted among the worker nodes 20 recorded in the node list 31 in the deletion list 32 . In addition, the worker node 20 to which the machine learning task is assigned is determined through a process of determining whether the highest priority (root node) worker node 20 is registered in the deletion list 32 .

마스터 노드(40)는 사용자로부터 태스크의 할당을 요청받아 세션 노드(30)에 머신 러닝을 수행할 최선순위 워커 노드(20)를 검색하도록 명령을 전달하고, 워커 노드(20)들 중 제외될 노드(머신 러닝의 수행에서 제외)를 검출하여 세션 노드(30)에 알린다.The master node 40 receives a request for task assignment from a user and transmits a command to the session node 30 to search for the highest priority worker node 20 to perform machine learning, and a node to be excluded from among the worker nodes 20 . (except in the execution of machine learning) is detected and notified to the session node 30 .

위와 같은 기술적 구성에서 세션 노드(30)는 최선순위 워커 노드(20)를 지정하는 우선 순위 ID와 삭제 리스트(32)에 등록된 삭제 등록 ID를 비교하고, 비교 결과 동일하면 해당 워커 노드(20)를 제외한다. 반면 비교 결과가 다르면 머신 러닝 태스크를 할당한다.In the above technical configuration, the session node 30 compares the priority ID designating the highest priority worker node 20 with the deletion registration ID registered in the deletion list 32, and if the comparison result is the same, the corresponding worker node 20 exclude On the other hand, if the comparison results are different, a machine learning task is assigned.

우선 순위 ID와 삭제 등록 ID가 동일한 경우는 태스크 할당을 위해 검색된 워커 노드(20)가 임의 제외된 것이거나 노드 장애가 발생한 것이므로, 세션 노드(30)는 노드 리스트(31)에서 최선순위 워커 노드(20)의 ID를 삭제함으로써 해당 워커 노드(20)에 머신 러닝 태스크가 할당되지 않게 한다.When the priority ID and the deletion registration ID are the same, the worker node 20 searched for task assignment is randomly excluded or a node failure has occurred, so the session node 30 is the highest priority worker node 20 in the node list 31 ) so that the machine learning task is not assigned to the worker node 20 by deleting the ID.

이때, 노드 리스트(31)가 힙 자료 구조인 경우 최선순위 워커 노드(20)는 루트 노드에서 지정하는 워커 노드(20)이며, pop 연산을 통해 삭제 리스트(32)에서 루트 노드를 삭제한 이후에는 나머지 노드들만으로 노드 리스트(31)를 힙 자료 구조로 재구성한다.At this time, when the node list 31 is a heap data structure, the highest priority worker node 20 is the worker node 20 designated by the root node, and after deleting the root node from the deletion list 32 through the pop operation, The node list 31 is reconstructed into a heap data structure using only the remaining nodes.

반면, 상기 우선 순위 ID와 삭제 등록 ID가 다른 경우, 세션 노드(30)는 우선 순위 ID에 해당하는 워커 노드(20)에서 머신 러닝이 이루어지도록 태스크를 할당하며, 태스크가 할당된 워커 노드(20)는 머신 러닝 수행 명령에 따라 머신 러닝이 수행된다.On the other hand, when the priority ID and the deletion registration ID are different, the session node 30 allocates a task so that machine learning is performed in the worker node 20 corresponding to the priority ID, and the task is assigned to the worker node 20 ), machine learning is performed according to the machine learning execution instruction.

이상, 본 발명의 특정 실시예에 대하여 상술하였다. 그러나, 본 발명의 사상 및 범위는 이러한 특정 실시예에 한정되는 것이 아니라, 본 발명의 요지를 변경하지 않는 범위 내에서 다양하게 수정 및 변형 가능하다는 것을 본 발명이 속하는 기술 분야에서 통상의 지식을 가진 자라면 이해할 것이다.In the above, specific embodiments of the present invention have been described above. However, the spirit and scope of the present invention is not limited to these specific embodiments, but various modifications and variations are possible within the scope that does not change the gist of the present invention. You will understand when you grow up.

따라서, 이상에서 기술한 실시예들은 본 발명이 속하는 기술 분야에서 통상의 지식을 가진 자에게 발명의 범주를 완전하게 알려주기 위해 제공되는 것이므로, 모든 면에서 예시적인 것이며 한정적이 아닌 것으로 이해해야만 하며, 본 발명은 청구항의 범주에 의해 정의될 뿐이다.Therefore, since the embodiments described above are provided to fully inform those of ordinary skill in the art to which the present invention pertains the scope of the invention, it should be understood that they are exemplary in all respects and not limiting, The invention is only defined by the scope of the claims.

10: 외부 입력 모듈
20: 워커 노드
30: 세션 노드
31: 노드 리스트
32: 삭제 리스트
40: 마스터 노드
DB-I: 플랫폼 DB
DB-O: 외부 네트워크 저장소10: external input module
20: worker node
30: Session node
31: node list
32: delete list
40: master node
DB-I: Platform DB
DB-O: External network storage

Claims

A method for managing worker nodes in a machine learning execution management platform including a worker node (20), a session node (30) and a master node (40), the method comprising:
a deletion node registration step (S10) of registering a node to be deleted among the worker nodes 20 recorded in the node list 31 in the session node 30 in the deletion list 32;
a worker node search step (S20) of searching for the highest priority worker node (20) to perform machine learning in the session node (30) that has been requested to allocate a task from the master node (40);
a node validity determination step (S30) of comparing the priority ID indicating the highest priority worker node (20) with the deletion registration ID registered in the deletion list (32) in the session node (30);
a node list cleaning step (S40) of deleting the highest priority worker node (20) from the node list (31) in the session node (30) when the priority ID and the deletion registration ID are the same;
task assignment step (S50) of allocating a task so that machine learning is performed in the worker node 20 corresponding to the priority ID in the session node 30 when the priority ID and the deletion registration ID are different; and
A worker node management method in a machine learning execution management platform comprising a; machine learning step (S60) of performing machine learning in the worker node 20 to which the task is assigned according to a machine learning execution command.

According to claim 1,
a worker node analysis step (S1) of analyzing the performance of the worker nodes 20 in the session node 30; and
Machine learning further comprising a node list generation step (S2) of generating the node list 31 composed of a heap data structure according to the performance order of the analyzed worker node 20; How to manage worker nodes in an execution management platform.

3. The method of claim 2,
The session node 30 classifies the performance of the worker node 20 into flops, which are the computational processing capabilities of the CPU and GPU, and sorts the node list 31 by using gigaflops (GFlops) as a unit. How to manage worker nodes in a machine learning execution management platform.

According to claim 1,
a deletion table creation step (S3) of generating a deletion list table in which the deletion registration ID is recorded in the session node (30); and
Worker node management in the machine learning execution management platform, characterized in that it further comprises; an exclusion node analysis step (S4) of detecting the deletion target worker node 20 registered in the deletion list table in the master node 40 Way.

5. The method of claim 4,
The delete list table is a worker node management method in a machine learning execution management platform, characterized in that it is a table of a hash map structure in which mapping is made according to a key value generated by a hash function.

According to claim 1,
Further comprising a node assignment request step (S20a) receiving a request from the master node 40 to assign a task (task) on which machine learning is performed to a worker node,
Machine learning execution management, characterized in that the master node 40 requests the assignment of the task to the session node 30 that manages the worker node 20 configured as a framework in which the machine learning is performed. How to manage worker nodes on the platform.

6. The method of claim 5,
In the deletion node registration step (S10),
The master node 40 extracts the worker node 20 that is randomly excluded or has a failure and requests the session node 30 to delete it,
The method of managing a worker node in a machine learning execution management platform, characterized in that the session node (30) registers the deletion registration ID in the deletion list (32) of the hashmap structure.

3. The method of claim 2,
In the worker node search step (S20),
In the machine learning execution management platform, characterized in that the session node 30 searches for the highest root node among the node list 31 consisting of the heap data structure as the highest priority worker node 20 . How to manage worker nodes.

9. The method of claim 8,
In the node validity determination step (S30),
A worker in a machine learning execution management platform, characterized in that the session node (30) compares the priority ID of the worker node (20) corresponding to the root node and the deletion registration ID registered in the deletion list (32) How to manage nodes.

9. The method of claim 8,
The node list arrangement step (S40) is,
When the priority ID and the deletion registration ID are the same, a pop operation for deleting the root node is performed on the node list 31, and then the node list 31 is reconstructed in a heap data structure method. How to manage worker nodes in a learning execution management platform.

11. The method of claim 10,
The method for managing a worker node in a machine learning execution management platform, characterized in that it further comprises a deletion list update step (S41) of excluding the deletion registration ID of the root node deleted in the pop operation from the deletion list (32).

9. The method of claim 8,
In the task assignment step (S50),
The session node (30) is a worker node management method in a machine learning execution management platform, characterized in that allocating the task to the worker node (20) corresponding to the root node.

13. The method of claim 12,
The session node 30 assigns a task to the worker node 20 corresponding to the root node,
A worker node management method in a machine learning execution management platform, characterized in that after performing a pop operation to delete the root node, the node list (31) is reconstructed in a heap data structure method.

In the machine learning execution management platform system in which the worker node management method as in any one of claims 1 to 13 is performed,
an external input module 10 for receiving a task on which machine learning is performed;
a worker node 20 that performs machine learning on the input task and is registered in a plurality of node lists 31;
a session node (30) for registering a node to be deleted among the worker nodes (20) recorded in the node list (31) in the deletion list (32); and
The master node 40 receives a request for assignment of a task from a user and instructs the session node 30 to search for the highest priority worker node 20 to perform the machine learning;
The session node 30 compares the priority ID designating the highest priority worker node 20 with the deletion registration ID registered in the deletion list 32,
When the priority ID and the deletion registration ID are the same, the session node 30 deletes the ID of the highest priority worker node 20 from the node list 31,
When the priority ID and the deletion registration ID are different, the session node 30 allocates the task so that machine learning is performed in the worker node 20 corresponding to the priority ID. platform system.