KR20220020553A

KR20220020553A - Method and device for monitoring application performance in multi-cloud environment

Info

Publication number: KR20220020553A
Application number: KR1020200101006A
Authority: KR
Inventors: 유명식; 당꽝녓밍; 당꽝?퓜?
Original assignee: 숭실대학교산학협력단
Priority date: 2020-08-12
Filing date: 2020-08-12
Publication date: 2022-02-21
Also published as: KR102372958B1; KR102372958B9

Abstract

The present invention provides a method and an apparatus for monitoring application performance in a multi-cloud environment. The present invention provides a method for monitoring application performance in a multi-cloud environment, comprising the steps of: collecting log information of a cloud storage; allowing an abnormal resource detection unit to detect whether an abnormal resource exists by comparing a metric value calculated on the basis of a model trained in advance through machine learning with a metric value corresponding to the log information; allowing an alarm manager service to determine whether to generate an alarm according to the comparison result of the metric values; and allowing an adaptation manager service to perform a scaling operation according to an alarm provided from the alarm manager service or an abnormal operation predicted by the model. The present invention can predict whether an abnormal resource exists through a model constructed through machine learning.

Description

Method and device for monitoring application performance in multi-cloud environment

본 발명은 멀티 클라우드 환경에서 애플리케이션 성능 모니터링 방법 및 장치에 관한 것이다. The present invention relates to a method and apparatus for monitoring application performance in a multi-cloud environment.

ELK는 모든 유형의 구조화 및 비정형 데이터에서 실시간으로 데이터를 검색하고 분석하는데 사용된다. ELK is used to search and analyze data in real time from all types of structured and unstructured data.

ELK는 Elasticsearch, Logstash 그리고 Kibana 이 세 가지로 구성되며, Elasticsearch는 분산형 RESTful 검색 및 분석 엔진이고, Logstash는 오픈소스 서버측 데이터 처리 파이프라인으로, 다양한 소스에서 동시에 데이터를 수집하고 변환하여 자주 사용하는 Stash 보관소로 전송한다. ELK consists of three parts: Elasticsearch, Logstash, and Kibana. Elasticsearch is a distributed RESTful search and analytics engine, and Logstash is an open source server-side data processing pipeline that collects and transforms data from various sources simultaneously and uses it frequently. Transfer to Stash Archive.

Kibana는 데이터를 시각적으로 탐색하고 실시간으로 분석할 수 있도록 한다. Kibana allows you to visually explore data and analyze it in real time.

이러한 ELK에 파일 비트가 추가된 것을 ELK Stack이라 한다. The file bit added to this ELK is called ELK Stack.

여기서 파일 비트는 서버에 에이전트로 설치되어 다양한 유형의 데이터를 Elasticsearch 또는 Logstash에 전송한다. Here, the file bit is installed as an agent on the server to send various types of data to Elasticsearch or Logstash.

이와 같은 멀티 클라우드 환경에서 애플리케이션의 성능을 실시간으로 모니터링하는 것이 요구된다. In such a multi-cloud environment, real-time monitoring of application performance is required.

기존 AWS EC2(Amazon Elastic Compute Cloud) 시스템에서는 CPU 사용률과 같은 미리 정의된 목표 메트릭 값 "인프라 레벨" 또는 애플리케이션 처리량 같은 "애플리케이션 레벨"을 자동 스케일링 작업(단일차원, 인프라 수준 또는 애플리케이션 수준)을 고려하도록 설정해야 하며, TTS는 지정된 메트릭 값을 대상 메트릭 값에 가깝게 유지하기 위해 인스턴스를 추가하거나 제거한다.Existing AWS EC2 (Amazon Elastic Compute Cloud) systems allow predefined target metric values such as CPU utilization “infrastructure level” or “application level” such as application throughput to be automatically taken into account for scaling operations (single-dimensional, infrastructure level or application level). must be set, and TTS adds or removes instances to keep the specified metric value close to the target metric value.

그러나, 애플리케이션 성능 모니터링에 대한 기존 방식은 장애가 발생하는 것을 미리 예측하지 못하는 문제점이 있다. However, the existing method for application performance monitoring has a problem in that it cannot predict the occurrence of a failure in advance.

미국등록특허 10,649,756US Patent 10,649,756

상기한 종래기술의 문제점을 해결하기 위해, 본 발명은 문제에 미리 대응할 수 있는 멀티 클라우드 환경에서 애플리케이션 성능 모니터링 방법 및 장치를 제안하고자 한다. In order to solve the problems of the prior art, the present invention is to propose a method and apparatus for monitoring application performance in a multi-cloud environment capable of responding to problems in advance.

상기한 바와 같은 목적을 달성하기 위하여, 본 발명의 일 실시예에 따르면, 멀티 클라우드 환경에서 애플리케이션 성능을 모니터링하는 방법으로서, 클라우드 스토리지의 로그 정보를 수집하는 단계; 이상 리소스 감지부가 기계학습을 통해 미리 학습된 모델을 기반으로 계산된 메트릭 값과 상기 로그 정보에 상응하는 메트릭 값을 비교하여 이상 리소스 여부를 감지하는 단계; 알람 매니저 서비스가 상기 메트릭 값의 비교 결과에 따라 알람 발생 여부를 결정하는 단계; 및 어댑션 매니저 서비스가 알람 매니저 서비스로부터 제공된 알람 또는 상기 모델에 의해 예측되는 이상 동작에 따라 스케일링 동작을 수행하는 단계를 포함하는 애플리케이션 성능 모니터링 방법이 제공된다. In order to achieve the above object, according to an embodiment of the present invention, there is provided a method for monitoring application performance in a multi-cloud environment, the method comprising: collecting log information of cloud storage; detecting, by an abnormal resource detection unit, a metric value calculated based on a model learned in advance through machine learning with a metric value corresponding to the log information to detect whether there is an abnormal resource; determining, by the alarm manager service, whether to generate an alarm according to a result of the comparison of the metric values; and performing, by the adaptation manager service, a scaling operation according to an alarm provided from the alarm manager service or an abnormal operation predicted by the model.

상기 로그 정보는 로그 시각, 로그 종류, 클라우드 스토리지 식별 정보, 작동 요청 타입, 상기 클라우드 스토리지로 요청이 전송된 시각, 상기 클라우드 스토리지로부터 응답을 수신한 시각, 요청과 관련된 파일의 크기, 상기 클라우드 스토리지의 통신 프로토콜, 에러 코드 또는 에러 메시지 중 적어도 하나를 포함할 수 있다.The log information includes log time, log type, cloud storage identification information, operation request type, time when a request is sent to the cloud storage, time when a response is received from the cloud storage, the size of a file related to the request, and the size of the cloud storage. It may include at least one of a communication protocol, an error code, or an error message.

상기 모델은 Robust Random Cut Forest를 기반으로 이전에 수집한 계절성의 히스토리 데이터 및 사용자에 의해 설정된 CPU 사용률 임계치를 입력으로 하여 미리 학습될 수 있다.The model can be pre-trained by inputting historical data of seasonality collected previously based on Robust Random Cut Forest and a CPU utilization threshold set by a user as inputs.

상기 스케일링 동작은 프로비저닝을 위한 리소스를 계산하는 동작을 포함할 수 있다. The scaling operation may include calculating a resource for provisioning.

본 발명의 다른 측면에 따르면, 멀티 클라우드 환경에서 애플리케이션 성능을 모니터링하는 장치로서, 프로세서; 및 상기 프로세서에 연결되는 메모리를 포함하되, 상기 메모리는, 미리 학습된 모델을 기반으로 계산된 메트릭 값과 상기 로그 정보에 상응하는 메트릭 값을 비교하고, 상기 비교에 따라 이상 리소스 여부를 판단하도록, 상기 프로세서에 의해 실행되는 프로그램 명령어들을 포함하되, 상기 메트릭 값의 비교 결과에 따라 알람 매니저 서비스가 알람 발생 여부를 결정하고, 어댑션 매니저 서비스가 알람 매니저 서비스로부터 제공된 알람 또는 상기 모델에 의해 예측되는 이상 동작에 따라 스케일링 동작을 수행하는, 애플리케이션 성능 모니터링 장치가 제공된다. According to another aspect of the present invention, there is provided an apparatus for monitoring application performance in a multi-cloud environment, comprising: a processor; and a memory connected to the processor, wherein the memory compares a metric value calculated based on a pre-trained model with a metric value corresponding to the log information, and determines whether an abnormal resource is present according to the comparison, including program instructions executed by the processor, wherein the alarm manager service determines whether an alarm is generated according to a result of the comparison of the metric values, and the adaptation manager service is an alarm provided from the alarm manager service or an abnormality predicted by the model An application performance monitoring apparatus is provided that performs a scaling operation according to an operation.

본 발명의 또 다른 측면에 따르면, 멀티 클라우드 환경에서 애플리케이션 성능을 모니터링하는 방법으로서, 미리 학습된 모델을 기반으로 계산된 메트릭 값과 상기 로그 정보에 상응하는 메트릭 값을 비교하는 단계; 및 상기 비교에 따라 이상 리소스 여부를 판단하는 단계를 포함하되, 상기 메트릭 값의 비교 결과에 따라 알람 매니저 서비스가 알람 발생 여부를 결정하고, 어댑션 매니저 서비스가 알람 매니저 서비스로부터 제공된 알람 또는 상기 모델에 의해 예측되는 이상 동작에 따라 스케일링 동작을 수행하는, 애플리케이션 성능 모니터링 방법이 제공된다. According to another aspect of the present invention, there is provided a method for monitoring application performance in a multi-cloud environment, the method comprising: comparing a metric value calculated based on a pre-trained model with a metric value corresponding to the log information; and determining whether there is an abnormal resource according to the comparison, wherein the alarm manager service determines whether an alarm occurs according to the result of the comparison of the metric values, and the adaptation manager service responds to the alarm provided from the alarm manager service or the model. An application performance monitoring method for performing a scaling operation according to an abnormal operation predicted by

본 발명의 또 다른 측면에 따르면, 상기한 방법을 수행하는 컴퓨터 판독 가능한 프로그램이 제공된다. According to another aspect of the present invention, there is provided a computer readable program for performing the above method.

본 발명에 따르면, 기계학습을 통해 구축된 모델을 통해 이상 리소스 여부를 미리 예측할 수 있는 장점이 있다. According to the present invention, there is an advantage of being able to predict in advance whether there is an abnormal resource through a model built through machine learning.

도 1은 본 발명의 일 실시예에 따른 애플리케이션 성능 모니터링 시스템의 구성을 도시한 도면이다.
도 2는 본 실시예에 따른 이상 감지 과정에 대한 흐름도이다.
도 3은 본 발명의 일 실시예에 따른 이상 리소스 감지부의 구성을 도시한 도면이다.1 is a diagram illustrating the configuration of an application performance monitoring system according to an embodiment of the present invention.
2 is a flowchart of an abnormality detection process according to the present embodiment.
3 is a diagram illustrating the configuration of an abnormal resource detection unit according to an embodiment of the present invention.

본 발명은 다양한 변경을 가할 수 있고 여러 가지 실시예를 가질 수 있는 바, 특정 실시예들을 도면에 예시하고 상세하게 설명하고자 한다.Since the present invention can have various changes and can have various embodiments, specific embodiments are illustrated in the drawings and described in detail.

그러나, 이는 본 발명을 특정한 실시 형태에 대해 한정하려는 것이 아니며, 본 발명의 사상 및 기술 범위에 포함되는 모든 변경, 균등물 내지 대체물을 포함하는 것으로 이해되어야 한다. However, this is not intended to limit the present invention to specific embodiments, and it should be understood to include all modifications, equivalents and substitutes included in the spirit and scope of the present invention.

도 1은 본 발명의 일 실시예에 따른 애플리케이션 성능 모니터링 시스템의 구성을 도시한 도면이다. 1 is a diagram showing the configuration of an application performance monitoring system according to an embodiment of the present invention.

도 1을 참조하면, 본 실시예에 따른 시스템은, 클라우드 스토리지(100), 파일 비트(Filebeat, 102), Logstash(104), Elasticsearch(106), Kibana(108), 이상 리소스 감지부(Abnormaly Resource Detection, 110), 알람 매니저 서비스(Alert Manager Service, 112) 및 어댑션 매니저 서비스(Adaption Manager Service, 114)를 포함할 수 있다. Referring to Figure 1, the system according to the present embodiment, cloud storage 100, file beat (Filebeat, 102), Logstash (104), Elasticsearch (106), Kibana (108), abnormal resource detection unit (Abnormaly Resource) Detection 110), an Alarm Manager Service 112, and an Adaption Manager Service 114 may be included.

Logstash(104)는 파일 비트를 이용하여 클라우드 스토리지(100)의 로그 정보를 수집한다. Logstash 104 collects log information of the cloud storage 100 using the file bit.

여기서, 로그 정보는 로그 시각, 로그 종류, 클라우드 스토리지 식별 정보, 작동 요청 타입, 상기 클라우드 스토리지로 요청이 전송된 시각, 상기 클라우드 스토리지로부터 응답을 수신한 시각, 요청과 관련된 파일의 크기, 상기 클라우드 스토리지의 통신 프로토콜, 에러 코드 또는 에러 메시지 중 적어도 하나를 포함할 수 있다. Here, the log information includes log time, log type, cloud storage identification information, operation request type, time when a request is transmitted to the cloud storage, time when a response is received from the cloud storage, size of a file related to the request, and the cloud storage may include at least one of a communication protocol, an error code, or an error message.

Logstash(104)는 여러 소스에서 동시에 데이터를 수집하는 서버 측 데이터 처리 파이프라인이다. Logstash 104 is a server-side data processing pipeline that collects data from multiple sources simultaneously.

Elasticsearch(106)는 Restful API를 통해 액세스할 수 있는 NoSQL 데이터베이스가 지원하는 분산 검색 및 분석 엔진이다.Elasticsearch (106) is a distributed search and analytics engine supported by a NoSQL database accessible through the Restful API.

Elasticsearch(106)는 Apache Lucene 검색 엔진과 쿼리 구문을 사용하며, 본 실시예에서 Elasticsearch(106)는 자동 스케일링 알고리즘과 관련된 다양한 검색 및 분석 요건을 충족하기 위해 사용된다. Elasticsearch 106 uses Apache Lucene search engine and query syntax, and in this embodiment, Elasticsearch 106 is used to meet various search and analysis requirements related to automatic scaling algorithms.

Kibana(108)는 ELK Stack에서 데이터를 시각화하는데 사용되는 소프트웨어이며, 본 실시예에서는 자동 스케일링 알고리즘의 관리 대시보드로 사용된다. Kibana 108 is software used to visualize data in the ELK Stack, and is used as a management dashboard of an automatic scaling algorithm in this embodiment.

이상 리소스 감지부(110)는 기계학습을 통해 미리 학습된 모델을 기반으로 계산된 메트릭 값과 상기한 로그 정보에 상응하는 메트릭 값을 비교하여 이상 리소스 여부를 감지한다. The abnormal resource detection unit 110 detects whether an abnormal resource exists by comparing a metric value calculated based on a model previously learned through machine learning with a metric value corresponding to the log information.

여기서, 이상 여부 감지를 위한 모델은, Robust Random Cut Forest 기반으로 구축될 수 있고, 이전에 수집한 계절성의 히스토리 데이터 및 사용자에 의해 설정된 CPU 사용률 임계치를 입력으로 하여 미리 학습할 수 있다. Here, the model for abnormality detection can be built based on robust random cut forest, and can be trained in advance by inputting previously collected seasonality historical data and a CPU utilization threshold set by the user as inputs.

알람 매니저 서비스(112)는 이상 리소스 감지부(108)의 메트릭 값의 비교 결과에 따라 알람 발생 여부를 결정하고, 트리거 알람을 어댑션 매니저 서비스(114)로 전송한다. The alarm manager service 112 determines whether an alarm is generated according to the result of comparing the metric values of the abnormal resource detection unit 108 , and transmits a trigger alarm to the adaptation manager service 114 .

어댑션 매니저 서비스(114)는 알람 매니저 서비스(112)로부터 제공된 알람 또는 이상 리소스 감지부(108)의 모델에 의해 예측되는 이상 동작에 따라 스케일링 동작을 수행한다. The adaptation manager service 114 performs a scaling operation according to an alarm provided from the alarm manager service 112 or an abnormal operation predicted by the model of the abnormal resource detection unit 108 .

여기서, 스케일링 동작은 프로비저닝을 위한 리소스를 계산하는 동작일 수 있다. Here, the scaling operation may be an operation of calculating a resource for provisioning.

본 실시예에 따르면, 이상 리소스 행동(CPU의 리소스 급증과 같은) 감지하고, 예정된 행동(scheduled action)를 준비하기 위해 가까운 미래에 시스템에 들어오는 이상 행동을 예측하기 위해 이력 데이터(history data) 학습을 위해 비감독(unsupervised) 런타임 이상 감지를 사용한다. According to this embodiment, history data learning is performed to detect abnormal resource behavior (such as a resource surge of CPU) and predict abnormal behavior entering the system in the near future to prepare for a scheduled action. For this purpose, unsupervised runtime anomaly detection is used.

도 2는 본 실시예에 따른 이상 감지 과정에 대한 흐름도이다. 2 is a flowchart of an abnormality detection process according to the present embodiment.

도 2를 참조하면 먼저 시스템은 클라우드 스토리지로부터 로그 정보를 수집한다(단계 200).Referring to FIG. 2 , first, the system collects log information from cloud storage (step 200).

본 실시예에 따르면, Robust Random Cut Forest를 기반으로 이전에 수집한 계절성의 히스토리 데이터 및 사용자에 의해 설정된 CPU 사용률 임계치를 입력으로 하여 이상 리소스 감지를 위한 모델을 학습한다(단계 202).According to this embodiment, based on the robust random cut forest, historical data of seasonality collected previously and the CPU utilization threshold set by the user are input to learn a model for abnormal resource detection (step 202).

시스템은 실시간으로 수집된 로그 정보에 상응하는 메트릭 값과, 단계 202에서 학습된 모델에서 예측된 값을 비교하고(단계 204), 이상이 발생하는 경우 알람을 트리거한다(단계 206).The system compares the metric value corresponding to the log information collected in real time with the value predicted by the model trained in step 202 (step 204), and triggers an alarm when an abnormality occurs (step 206).

단계 206에서 생성된 알람에 따라 스케일링 동작을 수행한다(단계 208).A scaling operation is performed according to the alarm generated in step 206 (step 208).

본 실시예에 따르면, 스케일링 동작은 단계 206에서 생성된 알람뿐만 아니라, 본 실시예에 따른 모델에서 예측된 이상 동작에 의해서도 수행될 수 있다. According to the present embodiment, the scaling operation may be performed not only by the alarm generated in step 206 but also by the abnormal operation predicted by the model according to the present embodiment.

여기서, 스케일링 동작은 프로비저닝을 위한 리소스를 계산하는 동작을 포함한다. Here, the scaling operation includes an operation of calculating a resource for provisioning.

도 3은 본 발명의 일 실시예에 따른 이상 리소스 감지부의 구성을 도시한 도면이다. 3 is a diagram illustrating the configuration of an abnormal resource detection unit according to an embodiment of the present invention.

도 3을 참조하면, 본 실시예에 따른 이상 리소스 감지부는 프로세서(300) 및 메모리(302)를 포함할 수 있다. Referring to FIG. 3 , the abnormal resource detection unit according to the present embodiment may include a processor 300 and a memory 302 .

프로세서(300)는 컴퓨터 프로그램을 실행할 수 있는 CPU(central processing unit)나 그밖에 가상 머신 등을 포함할 수 있다. The processor 300 may include a central processing unit (CPU) capable of executing a computer program or other virtual machines.

메모리(302)는 고정식 하드 드라이브나 착탈식 저장 장치와 같은 불휘발성 저장 장치를 포함할 수 있다. 착탈식 저장 장치는 컴팩트 플래시 유닛, USB 메모리 스틱 등을 포함할 수 있다. 메모리(202)는 각종 랜덤 액세스 메모리와 같은 휘발성 메모리도 포함할 수 있다.Memory 302 may include a non-volatile storage device such as a fixed hard drive or a removable storage device. The removable storage device may include a compact flash unit, a USB memory stick, and the like. Memory 202 may also include volatile memory, such as various random access memories.

이와 같은 메모리(302)에는 프로세서(300)에 의해 실행 가능한 프로그램 명령어들이 저장된다. The memory 302 stores program instructions executable by the processor 300 .

본 발명의 일 실시예에 따른 프로그램 명령어들은, 미리 학습된 모델을 기반으로 계산된 메트릭 값과 클라우드 스토리지로부터 수집된 로그 정보에 상응하는 메트릭 값을 비교하고, 상기 비교에 따라 이상 리소스 여부를 판단한다. Program commands according to an embodiment of the present invention compare a metric value calculated based on a pre-trained model with a metric value corresponding to log information collected from cloud storage, and determine whether an abnormal resource is present according to the comparison .

상기한 메트릭 값의 비교 결과에 따라 알람 매니저 서비스가 알람 발생 여부를 결정하고, 어댑션 매니저 서비스가 알람 매니저 서비스로부터 제공된 알람 또는 상기 모델에 의해 예측되는 이상 동작에 따라 스케일링 동작을 수행한다. The alarm manager service determines whether or not an alarm is generated according to the comparison result of the metric values, and the adaptation manager service performs a scaling operation according to an alarm provided from the alarm manager service or an abnormal operation predicted by the model.

상기한 본 발명의 실시예는 예시의 목적을 위해 개시된 것이고, 본 발명에 대한 통상의 지식을 가지는 당업자라면 본 발명의 사상과 범위 안에서 다양한 수정, 변경, 부가가 가능할 것이며, 이러한 수정, 변경 및 부가는 하기의 특허청구범위에 속하는 것으로 보아야 할 것이다.The above-described embodiments of the present invention have been disclosed for purposes of illustration, and various modifications, changes, and additions will be possible within the spirit and scope of the present invention by those skilled in the art having ordinary knowledge of the present invention, and such modifications, changes and additions should be considered as belonging to the following claims.

Claims

A method for monitoring application performance in a multi-cloud environment, comprising:
collecting log information of cloud storage;
detecting, by an abnormal resource detection unit, a metric value calculated based on a model learned in advance through machine learning with a metric value corresponding to the log information to detect whether there is an abnormal resource;
determining, by the alarm manager service, whether to generate an alarm according to a result of the comparison of the metric values; and
An application performance monitoring method comprising the step of performing, by an adaptation manager service, a scaling operation according to an alarm provided from the alarm manager service or an abnormal operation predicted by the model.

The method of claim 1,
The log information includes log time, log type, cloud storage identification information, operation request type, time when a request is sent to the cloud storage, time when a response is received from the cloud storage, the size of a file related to the request, and the size of the cloud storage. A method for monitoring application performance comprising at least one of a communication protocol, an error code, or an error message.

According to claim 1,
The model is an application performance monitoring method that is pre-trained by inputting historical data of seasonality collected previously based on Robust Random Cut Forest and a CPU utilization threshold set by the user as inputs.

According to claim 1,
The scaling operation includes calculating a resource for provisioning.

A device for monitoring application performance in a multi-cloud environment, comprising:
processor; and
a memory coupled to the processor;
The memory is
comparing the metric value calculated based on the pre-trained model with the metric value corresponding to the log information,
to determine whether there is an abnormal resource according to the comparison,
including program instructions executed by the processor;
Application performance monitoring, wherein the alarm manager service determines whether an alarm is generated according to the result of the comparison of the metric values, and the adaptation manager service performs a scaling operation according to an alarm provided from the alarm manager service or an abnormal operation predicted by the model Device.

6. The method of claim 5,
The log information includes log time, log type, cloud storage identification information, operation request type, time when a request is sent to the cloud storage, time when a response is received from the cloud storage, the size of a file related to the request, and the size of the cloud storage. An application performance monitoring device comprising at least one of a communication protocol, an error code, or an error message.

6. The method of claim 5,
The model is an application performance monitoring device that is pre-trained by inputting historical data of seasonality collected previously based on Robust Random Cut Forest and a CPU utilization threshold set by the user as inputs.

A method for monitoring application performance in a multi-cloud environment, comprising:
comparing a metric value calculated based on a pre-trained model with a metric value corresponding to the log information; and
Comprising the step of determining whether an abnormal resource according to the comparison,
Application performance monitoring, wherein the alarm manager service determines whether an alarm is generated according to the result of the comparison of the metric values, and the adaptation manager service performs a scaling operation according to an alarm provided from the alarm manager service or an abnormal operation predicted by the model method.

A computer readable program for performing the method according to claim 8 .