KR20050066133A

KR20050066133A - Processes monitoring system for cluster and method thereof

Info

Publication number: KR20050066133A
Application number: KR1020030097367A
Authority: KR
Inventors: 김상완
Original assignee: 한국과학기술정보연구원
Priority date: 2003-12-26
Filing date: 2003-12-26
Publication date: 2005-06-30

Abstract

본 발명은 클러스터 환경에서 각 계산 노드에서 실행중인 프로세스 정보를 모니터링할 수 있는 시스템 및 방법에 관한 것으로, 클러스터의 규모가 커짐에 따라 매 순간 모니터링되는 정보가 증가할 경우, 모니터링 시스템에 계층적인 구조를 갖게 함으로써 대규모의 클러스터에도 적용이 가능한 시스템 및 방법을 제공한다.The present invention relates to a system and a method for monitoring process information running on each compute node in a cluster environment. When the information to be monitored every minute increases as the size of the cluster increases, a hierarchical structure is provided to the monitoring system. This provides a system and method that can be applied to large clusters.

본 발명의 모니터링 시스템 및 방법을 적용함으로써 클러스터의 각 노드에서 실행중인 프로세스의 상세한 프로세스 정보를 모니터링할 수 있으며, 모니터링 정보를 데이터베이스에 저장함으로써, 시스템에 직접 로그인 할 필요 없이 웹 인터페이스등을 이용하여 모니터링이 가능하고, 사용자 및 CPU점유율 등 다양한 기준에 따라 사용자가 원하는 프로세스 정보를 선택적으로 모니터링할 수 있다.By applying the monitoring system and method of the present invention, detailed process information of a process running in each node of the cluster can be monitored, and by storing the monitoring information in a database, monitoring using a web interface or the like without having to log in to the system directly. It is possible to selectively monitor the process information desired by the user according to various criteria such as user and CPU occupancy.

Description

Process Monitoring System for Cluster and Method

본 발명은 클러스터 환경에서 각 계산 노드에서 실행중인 프로세스 정보를 모니터링할 수 있는 시스템 및 방법에 관한 것이다.The present invention relates to a system and method capable of monitoring process information running on each compute node in a clustered environment.

클러스터는 개인용 컴퓨터 등의 독립적인 컴퓨터 시스템을 네트웍 장비로 연결하여 특수한 목적으로 이용하기 위해 사용되는 시스템으로, 고성능 컴퓨팅(High Performance Computing) 및 고가용성 컴퓨팅(High Availability Computing)을 제공하기 위한 솔루션으로 오늘날 널리 이용되고 있다. A cluster is a system used to connect an independent computer system such as a personal computer to a network device and use it for a special purpose. It is a solution for providing high performance computing and high availability computing. It is widely used.

또한, 업그레이드나 확장이 필요할 경우 시스템 전체를 바꾸지 않고, 필요한 일부 노드나 네트웍만 교체하거나 추가할 수가 있어서 유연한 확장성을 제공한다는 장점이 있다. 이러한 장점 때문에 복잡한 과학계산 문제를 실행할 필요가 있는 대학 실험실이나 기업체, 연구소에서 자신들이 필요한 계산 요구량에 맞게 클러스터를 구축하여 사용하고 있는 사례가 늘어나고 있다.In addition, if an upgrade or expansion is required, only a few nodes or networks can be replaced or added without changing the entire system, thereby providing flexible scalability. These advantages are increasing the number of cases where university labs, businesses, and laboratories that need to execute complex scientific computational problems are building clusters to meet their computational needs.

클러스터는 하드웨어와 운영체제가 분리되어 있는 독립적인 시스템들을 모아 놓은 것이므로, 이것이 하나의 통일된 컴퓨팅자원으로 보여지도록 하는 단일 시스템 이미지(single system image) 기술이 중요하다. 단일 시스템 이미지는 사용자로 하여금 여러 대의 컴퓨터의 집합을 마치 하나의 커다란 컴퓨터로 느껴지도록 만들어 준다.Because clusters are a collection of independent systems with separate hardware and operating systems, a single system image technology is important so that this is seen as a unified computing resource. A single system image makes a user feel like a large computer with a collection of multiple computers.

단일 시스템 이미지는 사용자 정책, 파일 시스템, CPU 및 메모리 자원, 시스템 모니터링 등의 다양한 요소에서 생각해 볼 수 있는데, 본 발명에서는 시스템 프로세스를 모니터링하는 부분에 있어서 단일 시스템 이미지를 제공한다고 할 수 있다. 프로세스(process)란 유닉스(UNIX)와 같은 운영체제에서 실행되고 있는 프로그램을 말하며 시스템관리자나 사용자는 시스템 모니터링에서 현재 실행되고 있는 프로세스에 대한 정보를 얻는 것은 매우 중요한 의미를 갖는다.The single system image can be considered in various factors such as user policy, file system, CPU and memory resources, system monitoring, etc. In the present invention, it can be said that the single system image is provided for monitoring system processes. A process is a program running on an operating system such as UNIX. It is very important for a system administrator or a user to obtain information about a process currently running in system monitoring.

클러스터를 모니터링하는 목적은 우선 시스템 관리자의 입장에서, 클러스터를 구성하는 구성 노드중 일부에 이상이 있음을 감시하거나 특정 노드에 부하가 집중되는 현상을 파악하기 위해서이다. 또한, 전반적인 시스템의 사용율을 시간에 따라 지속적으로 관찰함으로써 시스템의 사용 패턴을 예측하는데 이용될 수도 있다. The purpose of monitoring the cluster is to monitor the abnormality of some of the constituent nodes constituting the cluster or to identify the load concentration on a specific node from the system administrator's point of view. It can also be used to predict the usage pattern of a system by continuously monitoring the overall utilization of the system over time.

한편, 사용자의 입장에서 클러스터 모니터링이 필요한 이유는, 현재 시스템의 전체적인 사용률을 파악함으로써, 수행하려고 하는 계산 작업이 실행될 수 있는지 여부를 파악하는 것과, 어떤 노드에 작업이 집중되어있는지 알아내어, 부하가 적은 노드에서 작업을 수행하기 위해서이다. 그리고, 현재 실행하고 있는 작업의 상태를 파악함으로써, 처음에 의도했던 대로 작업이 실행되고 있는지를 파악하기 위한 목적이 있다. Cluster monitoring, on the other hand, requires users to understand the overall utilization of the current system to determine whether the computations they want to perform can be performed, and to find out which nodes are focused on which nodes are under heavy load. To do work on fewer nodes. Then, by grasping the status of the job currently being executed, the purpose is to determine whether the job is being executed as originally intended.

클러스터에서 수행중인 프로세스를 모니터링하는 것은 사용자가 자신의 작업의 진행상황을 파악하는데 가장 일반적인 방법이라고 할 수 있다. 프로세스가 사용하고 있는 시스템 자원량(CPU점유율, 메모리 사용량등)을 모니터링함으로써 사용자는 자신의 작업의 진행 상황을 파악할 수 있게 된다.Monitoring the processes running in a cluster is the most common way for users to track their progress. By monitoring the amount of system resources (CPU occupancy, memory usage, etc.) that processes are using, you can keep track of the progress of your work.

기존의 클러스터 모니터링 시스템으로 버클리 대학에서 개발된 간글리아(Ganglia) 는 시스템의 전반적인 사용율을 모니터링 하기에는 우수하지만, 구성 노드에서 실행되고 있는 각 프로세스를 하나씩 모니터링 하지 못한다는 단점이 있다. 그러나, RRD (Round Robin Database)를 이용하여 시간에 따른 모니터링 정보를 지속적으로 데이터베이스에 쌓아 둠으로써 시간에 따른 시스템의 변화를 쉽게 파악이 가능하다는 장점이 있다. Developed by Berkeley University as an existing cluster monitoring system, Ganglia is good at monitoring the overall utilization of the system, but has the disadvantage of not monitoring each process running on the configuration node one by one. However, there is an advantage that it is easy to grasp the change of the system over time by continuously accumulating the monitoring information over time by using the Round Robin Database (RRD).

클러스터에서 프로세스를 모니터링 할 수 있는 툴(tool)로는 타일랜드의 카셋사르트 대학에서 만든 SCMS 패키지가 있는데, 이것은 모니터링뿐만 아니라 클러스터 관리를 위한 다른 기능들도 함께 포함하고 있다. 그러나, 프로세스 모니터링에 있어서는 단순히 유닉스의 ps 명령을 실행한 결과를 보여줌으로써 사용자에게 친숙하지 않으며 필요한 프로세스만 선택하여 보여주는 필터기능이 없어 사용하기 불편하다는 단점이 있다.A tool for monitoring processes in a cluster is an SCMS package from Cassart University, Thailand, which includes not only monitoring but also other functions for cluster management. However, in the process monitoring, the result of simply executing the ps command of Unix is not familiar to the user, and it is inconvenient to use because there is no filter function to select only necessary processes.

본 발명에서는 클러스터의 각 구성 노드에서 실행중인 프로세스들의 정보를 모니터링하기 위한 시스템 및 방법을 제공한다. The present invention provides a system and method for monitoring information of processes running on each component node of a cluster.

또한, 클러스터의 규모가 커짐에 따라 매 순간 모니터링되는 정보가 증가할 경우, 모니터링 시스템에 계층적인 구조를 갖게 함으로써 대규모의 클러스터에도 적용이 가능한 시스템 및 방법을 제공한다.In addition, when the information to be monitored every moment as the size of the cluster increases, it provides a system and method that can be applied to a large cluster by having a hierarchical structure in the monitoring system.

도 1은 본 발명에서 발명된 클러스터를 위한 프로세스 모니터링 시스템 구성을 보여준다. 본 발명의 모니터링 시스템은 모니터링 데몬(monitoring daemon), 모니터링 정보수집자(collector), 데이터베이스(database) 및 인터페이스를 포함한다. 그리고, 상기 정보수집자, 데이터베이스 및 인터페이스는 모니터링 노드를 구성한다.1 shows a process monitoring system configuration for a cluster invented in the present invention. The monitoring system of the present invention includes a monitoring daemon, a monitoring information collector, a database, and an interface. The information collector, database, and interface constitute a monitoring node.

우선, 모니터링 데몬(monitoring daemon)은 모니터링 하고자 하는 모든 타겟 노드(target node)에서 실행되며, 각 노드에서 실행중인 프로세스의 정보를 수집하여 네트웍으로 전송하는 기능을 담당한다. 리눅스 시스템에서 프로세스 모니터링은 /proc 파일 시스템을 이용하면 되는데, 순간적인 CPU 점유율과 같은 정보는 시간차를 두고 모니터링을 지속적으로 하여야 얻을 수 있기 때문에, 한번 모니터링한 정보를 계속 유지하고 있어 이를 연속적으로 업데이트 해 나가게 된다.First, the monitoring daemon runs on all target nodes to monitor and collects information about processes running on each node and sends them to the network. In Linux system, process monitoring can be done by using / proc file system. Since information such as instantaneous CPU occupancy can be obtained by monitoring continuously with time difference, it keeps the information once monitored and update it continuously. Will go out.

다음으로, 정보수집자(collector)는 클러스터 내에서 모니터링을 담당하는 노드(모니터링 노드)에서 실행되며, 모니터링 데몬으로부터 정보를 수집하고, 수집된 정보는 데이터베이스(database)에 저장된다. 정보수집자는 주기적으로 혹은 필요한 때에 모니터링 데몬으로부터 모니터링 정보를 수집하여 데이터베이스를 업데이트한다.Next, the information collector (collector) is executed in the monitoring node (monitoring node) in the cluster, collects information from the monitoring daemon, and the collected information is stored in a database. The information collector updates the database by collecting monitoring information from the monitoring daemon periodically or as needed.

상기 데이터베이스는 클러스터에서 수집된 프로세스 정보를 저장하기 위한 장소이다. 모니터링한 정보를 RDBMS(Relational Database Management System)을 이용하여 저장해 둠으로써, 빠르고 자유로운 검색을 가능하게 한다. 또한, 과거의 모니터링 정보를 데이터베이스에 보관해 둠으로써 시간에 따라 변화된 상황을 사용자에게 제공해 줄 수 있다. The database is a place for storing process information collected in the cluster. The monitored information is stored using RDBMS (Relational Database Management System), enabling fast and free retrieval. In addition, by keeping the past monitoring information in the database, it can provide the user with the situation changed over time.

인터페이스는 데이터베이스에 수집된 모니터링 정보를 사용자가 편리하게 이용할 수 있도록 하는 다양한 인터페이스를 제공한다. 웹 인터페이스는 웹브라우져를 통하여 프로세스 정보를 제공하며, 시스템에 로그인 할 필요가 없이 시스템에 대한 접근 권한이 없이도 쉽게 모니터링이 가능한 반면, 콘솔 인터페이스는 콘솔에서 간단한 명령어를 이용하여 모니터링 정보를 조회할 수 있는 기능을 사용자에게 제공한다.The interface provides various interfaces for the user to conveniently use the monitoring information collected in the database. The web interface provides process information through a web browser and can be easily monitored without access to the system without having to log in to the system, while the console interface allows you to view monitoring information using simple commands in the console. Provide the functionality to the user.

다음으로, 본 발명의 클러스터를 위한 프로세스 모니터링 시스템을 이용하여 시스템 프로세스를 모니터링하는 방법을 설명한다.Next, a method of monitoring system processes using the process monitoring system for the cluster of the present invention will be described.

우선, 모니터링 정보의 수집단계에서, 모니터링 정보는 모니터링 데몬에 의해 최초로 수집되어 정보수집자에게 전달된다. 모니터링 정보를 수집하는 시간은 정보수집자에 의해서 결정되며, 정보수집자의 요청에 의해서 모니터링 데몬이 응답하여 정보를 수집하여 넘겨주게 된다. First, in the collecting step of monitoring information, the monitoring information is first collected by the monitoring daemon and delivered to the information collector. The time to collect the monitoring information is determined by the information collector, and the monitoring daemon responds to the information collector's request by collecting the information.

모니터링 정보를 정보수집자에게 전달하는 방법으로는 XML형태의 텍스트 기반 메시지 전달과 바이너리 형식의 데이터를 전달하는 방법이 있다. Methods of delivering monitoring information to information collectors include text-based message delivery in XML format and data in binary format.

XML형태의 텍스트 기반 메시지를 이용하는 방법은 수집된 모니터링 정보를 XML 문서로 만들어 정보수집자에게 보내는 방법으로, XML의 확장성에 의해 모니터링 항목의 추가 등이 자유롭게 쉽게 이루어 질 수 있으며, 모니터링 데몬과 정보수집자의 구현 방식에 제한이 없다는 장점이 있는 반면에 XML문서를 생성하고 파싱해야 하는 부담이 따르게 되고, 텍스트 기반의 메시지를 전송하므로 전송 데이터의 양이 많아지며, 따라서 속도가 느려진다는 단점이 있다. The method of using text-based message in the form of XML is to collect collected monitoring information as an XML document and send it to the information collector. By extension of XML, monitoring items can be added easily and easily. While there is an advantage that there is no restriction on the implementation method of the ruler, the burden of generating and parsing an XML document is entailed, and the amount of data to be transmitted is increased because the text-based message is transmitted, and thus, the speed is slow.

반면, 바이너리 형식의 데이터를 전달하는 방법은 모니터링 데몬에서 수집한 데이터를 언어에 의존적인 구조를 그대로 정보수집자로 전송하는 방법으로, 파싱이 필요 없으므로 빠르고, 주고 받는 데이터의 양이 적다는 장점이 있는 반면에 모니터링 항목을 추가할 때 모든 모니터링 데몬과 정보수집자를 동시에 수정해야 된다는 단점이 존재하게 된다. 또한, 구현 방법에 있어서도 모니터링 데몬과 정보수집자를 같은 프로그램 언어를 이용하여 작성해야 된다는 제약이 따르게 된다.On the other hand, the binary data transfer method is to transfer the data collected by the monitoring daemon to the information collector as it is language-dependent. As a result, parsing is not necessary. On the other hand, when adding monitoring items, there is a disadvantage that all monitoring daemons and information collectors must be modified at the same time. In addition, the implementation method is subject to the constraint that the monitoring daemon and information collector must be written using the same programming language.

본 발명에서는 XML 형식의 텍스트 메시지를 이용하여 프로세스 정보를 수집하며, 도2는 수집된 프로세스 정보의 예를 보여주고 있다. In the present invention, process information is collected using a text message in XML format, and FIG. 2 shows an example of collected process information.

다음으로, 모니터링 정보의 요청 및 응답단계에서는 모니터링 데몬에서 얻어진 프로세스 정보가 정보수집자로 전달되는데, 이때 모든 프로세스에 대한 정보를 전달할 수도 있지만, 정보수집자가 원하는 프로세스만 골라서 전달할 수도 있다. 이를 위하여, 상태정보를 알고자 하는 프로세스의 PID(process ID)목록을 모니터링 데몬에 전달해주게 된다. 즉, 모니터링 데몬과 정보수집자 사이는 요청(request)과 응답(response)의 프로토콜 구조로 이루어져 있다. 이렇게 요청 메시지에 수집하고자 하는 모니터링 정보의 제약사항을 제시해줌으로써, 모니터링 속도를 높이고, 꼭 필요한 데이터만을 정보수집자로 전송함으로써 데이터 전송량을 줄일 수 있다.Next, in the request and response step of the monitoring information, the process information obtained from the monitoring daemon is delivered to the information collector. In this case, information about all processes may be delivered, but only the processes desired by the information collector may be delivered. For this purpose, the PID (process ID) list of the process for which status information is to be delivered to the monitoring daemon. In other words, between the monitoring daemon and the information collector consists of the protocol structure of request and response. By presenting the constraints of the monitoring information to be collected in the request message, it is possible to speed up the monitoring and reduce the data transmission volume by sending only necessary data to the information collector.

모니터링 주기를 변화시키는 방법으로는 프로세스의 시작 시간에 의한 방법, 프로세스의 자원 사용량의 변화에 의한 방법, 사용자가 직접 지정하는 방법이 있을 수 있다. As a method of changing the monitoring cycle, there may be a method by a start time of a process, a method by a change in resource usage of a process, or a method directly specified by a user.

프로세스의 시작 시간에 의한 방법은 프로세스의 총 실행시간이 길어질수록 모니터링 주기를 길게 하는 방법이다. 시스템 프로세스는 시스템 부팅 과정에서 시작되어 계속 실행이 되는데, 이런 프로세스들은 자주 모니터링할 필요가 없기 때문에 이 방법을 이용하면 전체적인 모니터링 정보를 줄일 수 있다. The method by the start time of the process is to increase the monitoring period as the total execution time of the process becomes longer. System processes are started during the system boot process and continue to run. These processes do not need to be monitored often, so this approach can reduce overall monitoring information.

프로세스 자원 사용량의 변화에 의한 방법은 CPU나 메모리와 같은 자원의 사용량을 모니터링하여 일정 범위이상 자원 사용량에 변화가 있을 때에만 변화된 정보를 전송하는 방법이다. The method by changing the process resource usage is to monitor the usage of resources such as CPU or memory and transmit the changed information only when there is a change in resource usage over a certain range.

한편, 시스템 데몬 프로세스와 같이 항상 실행되고 있어서 시간 변화에 따라 크게 변하지 않는 프로세스의 경우, 모니터링 주기를 길게 함으로써 모니터링 정보의 양을 줄일 수 있다. On the other hand, in the case of a process that is always running, such as a system daemon process, and does not change significantly with time, the amount of monitoring information can be reduced by lengthening the monitoring period.

본 발명은 다른 실시예로, 확장된 클러스터를 모니터링하는 방법으로서, 여러 개의 정보수집자를 두어 각각의 정보수집자가 모니터하는 노드의 수를 제한함으로써 성능을 떨어뜨리지 않게 한다. 이것은 클러스터의 규모가 커질 경우, 정보수집자에게 전송되는 정보가 증가하게 되어 전체적인 모니터링 시스템의 성능이 떨어질 우려가 있으므로, 이때는 도3에서 보는 바와 같이 모니터링 데몬이 할당된 타겟노드들만 모니터하고, 모니터링된 정보는 각 정보수집자에게 전송하므로써, 각 모니터링 데몬에 할당된 타겟노드들의 정보가 모니터링 노드에 따로 저장되도록 하는 것이다. 여기서 상기 모니터링 노드는 정보수집자와 데이터베이스로 구성된 것으로, 각 모니터링 데몬에 대응되는 수만큼 존재하여 각 모니터링 데몬에 할당된 타겟노드들의 정보를 저장하고 있다.In another embodiment, the present invention provides a method for monitoring an extended cluster, in which a plurality of information collectors are provided so as not to reduce performance by limiting the number of nodes monitored by each information collector. This is because when the size of the cluster grows, the information transmitted to the information collector increases, which may reduce the performance of the overall monitoring system. In this case, only the target nodes to which the monitoring daemon is allocated are monitored, as shown in FIG. The information is transmitted to each information collector so that the information of the target nodes assigned to each monitoring daemon is stored separately in the monitoring node. Here, the monitoring node is composed of an information collector and a database, and exists in a number corresponding to each monitoring daemon and stores information of target nodes allocated to each monitoring daemon.

이때, 모니터링 인터페이스 노드는 각 모니터링 노드에 분산ㆍ저장되어 있는 모니터링 정보를 일관된 방법으로 사용자가 접근할 수 있도록 하는 인터페이스를 제공한다. 상기 모니터링 인터페이스 노드에서는 사용자의 검색조건을 입력받아 분산된 모니터링 노드에 SQL 질의를 하여 그 결과를 사용자에게 제공하는 역할을 한다. At this time, the monitoring interface node provides an interface for the user to access the monitoring information distributed and stored in each monitoring node in a consistent manner. The monitoring interface node receives a user's search condition and performs an SQL query on a distributed monitoring node to provide a result to the user.

본 발명의 클러스터를 위한 프로세스 모니터링 시스템은 클러스터의 규모에 관계없이 일관된 방법으로 클러스터의 모든 노드에서 실행되고 있는 프로세스에 대한 정보를 쉽게 모니터링할 수 있다. The process monitoring system for the cluster of the present invention can easily monitor information on processes running on all nodes of the cluster in a consistent manner regardless of the size of the cluster.

특히, 사용자별, 자원 사용율별 등 다양한 검색 조건으로 프로세스정보를 검색할 수 있으며, 데이터베이스를 이용하므로 시간의 변화에 따른 프로세스 상태 변화까지 검색이 가능하다. In particular, process information can be searched by various search conditions, such as by user and resource usage rate. Since the database is used, it is possible to search process status changes according to time changes.

또한, 모니터링을 위해 시스템이 부담해야 되는 부하를 줄임으로써 빠르고 정확한 정보를 얻을 수 있다.In addition, fast and accurate information can be obtained by reducing the load on the system for monitoring.

도1은 본 발명의 프로세스 모니터링 시스템 구조도이다.1 is a structural diagram of a process monitoring system of the present invention.

도2는 수집된 프로세스 정보이다.2 is the collected process information.

도3은 확장된 클러스터를 모니터링하는 시스템 구조도이다.3 is a system architecture diagram for monitoring an extended cluster.

Claims

A monitoring daemon for monitoring the process;

An information collector for collecting information from the monitoring daemon;

A database for storing information collected from the information collector;

Process monitoring system comprising an interface for providing process information.

The process monitoring system according to claim 1, wherein a protocol structure of a request and a response is provided between the monitoring daemon and the information collector.

The process monitoring system of claim 1, wherein the information collector and the database form a monitoring node, and a request and response protocol structure is provided between the interface and the plurality of monitoring nodes.

The information collected from the target node is first monitored by the monitoring daemon and is transmitted to the information collector. The monitoring daemon suggests the limitations of the monitoring information to be collected in the request message, so that only the process desired by the information collector is provided. Process monitoring method comprising the step of requesting and responding to the monitoring information sent to the collector.

5. The monitoring daemon of claim 4, wherein in the monitoring information collection step, the monitoring daemon monitors and transmits only information of the assigned target nodes to each information collector. In the request and response step of the monitoring information, the monitoring interface node monitors the monitoring daemon. And querying a monitoring node configured as the information collector.