KR102367409B1

KR102367409B1 - Method, server and computer program for predicting it service failure using pre-learned failure prediction model

Info

Publication number: KR102367409B1
Application number: KR1020210148448A
Authority: KR
Inventors: 명재석
Original assignee: 주식회사 데이탄소프트
Priority date: 2021-11-02
Filing date: 2021-11-02
Publication date: 2022-02-24

Abstract

Provided are a method, server, and computer program for predicting an IT service failure using a pre-learned failure prediction model. According to various embodiments of the present invention, the method for predicting the IT service failure using the pre-learned failure prediction model performed by a computing device comprises: a step of colleting index information on an IT service; a step of extracting a load value for the IT service by analyzing the collected index information through the pre-learned failure prediction model; and a step of determining a failure possibility for the IT service by using the extracted load value.

Description

Failure prediction method of IT service using pre-learned failure prediction model, server and computer program

본 발명의 다양한 실시예는 기 학습된 장애 예측 모델을 이용한 IT 서비스의 장애 예측 방법, 서버 및 컴퓨터프로그램에 관한 것이다. Various embodiments of the present invention relate to a failure prediction method of an IT service using a previously learned failure prediction model, a server, and a computer program.

IT 서비스는 인프라 시스템, 애플리케이션(소스, DB, 인터페이스) 등 다양한 많은 장비들과 기술들을 통합하여 구축 및 운영되기 때문에, IT 서비스에 대한 장애 발생시 전문가를 통해 처리한다 하더라도, 원인 파악과 장애 해결에 많은 시간과 노력이 소요된다는 문제가 있다.Since IT services are built and operated by integrating a variety of equipment and technologies such as infrastructure systems and applications (source, DB, interfaces), even if an IT service failure occurs through an expert, it is difficult to identify the cause and solve the failure. The problem is that it takes time and effort.

특히, 애플리케이션에 대한 변경 정보 등이 따로 관리되지 않거나, 관리되더라도 개별적이고 산발적인 관리에 그치고 있어, 장애 발생 시에 정보로 제공되지 못하고 있는 점이, 장애 원인 파악을 더욱 어렵게 만들고 있다.In particular, the fact that application change information is not managed separately or, even though it is managed, is managed only individually and sporadic, and thus is not provided as information when a failure occurs, making it more difficult to identify the cause of the failure.

이러한 문제를 극복하고 IT 서비스의 안정적인 운영을 위해 모니터링 시스템을 도입함으로써, 실시간으로 IT 서비스의 성능을 모니터링하고, 성능 데이터의 추이를 분석하여 미래 시점의 장애를 예측하고 대비하였으나, 운영 인력의 모니터링 업무 가중으로 인해, 실시간 알림 이상의 효과는 얻기 어렵다는 문제가 있다.By overcoming these problems and introducing a monitoring system for the stable operation of the IT service, the performance of the IT service was monitored in real time and the performance data trend was analyzed to predict and prepare for future failures, but the monitoring task of the operating personnel Due to the weighting, there is a problem in that it is difficult to obtain an effect beyond real-time notification.

또한, 모니터링 업무에 대한 로드를 줄이기 위해, 인공지능을 통해 운영 인력을 보조하고 장애 예측과 대응을 자동화하는 경우, 성능 데이터 특성이 IT 서비스별로 다르기 때문에 모든 서비스의 성능 데이터를 수집하여 일반적인 패턴을 학습하거나, 서비스 별로 전문가가 배치되어 수동으로 모델을 학습시켜야 한다는 문제가 있다.In addition, in order to reduce the load on the monitoring task, when the operation personnel is supported through artificial intelligence and the failure prediction and response are automated, the performance data characteristics are different for each IT service, so the performance data of all services is collected to learn general patterns. Alternatively, there is a problem in that an expert is placed for each service and the model must be manually trained.

한국공개특허 제10-2017-0071825호(2017.06.26)Korean Patent Publication No. 10-2017-0071825 (2017.06.26)

본 발명이 해결하고자 하는 과제는 상술된 문제점을 해소하기 위한 목적으로, IT 서비스에 대한 각종 지표를 포함하는 지표 정보를 이용하여 학습된 2개의 장애 예측 모델을 이용하여 IT 서비스에 대한 장애 가능성을 판단함으로써, 보다 정확한 장애 예측이 가능한 기 학습된 장애 예측 모델을 이용한 IT 서비스의 장애 예측 방법, 서버 및 컴퓨터프로그램을 제공하는 것이다.The problem to be solved by the present invention is for the purpose of solving the above-mentioned problems, determining the possibility of failure of an IT service using two failure prediction models learned using index information including various indicators of the IT service By doing so, it is to provide an IT service failure prediction method, server and computer program using a pre-learned failure prediction model capable of more accurate failure prediction.

본 발명이 해결하고자 하는 다른 과제는 애플리케이션 수정 및 배포, 애플리케이션 제공 시스템의 인프라 변경과 같은 외부 요인이나 장애 가능성 판단 결과와 실제 장애 발생 여부의 비교 결과와 같은 내부 요인에 기초하여 장애 예측 모델의 재학습 여부를 결정하고, 이에 따라 자동적으로 장애 예측 모델을 재학습시킴으로써, 전문가의 개입없이 자동적으로 장애 예측 모델을 재학습 및 개선시킬 수 있는 기 학습된 장애 예측 모델을 이용한 IT 서비스의 장애 예측 방법, 서버 및 컴퓨터프로그램을 제공하는 것이다.Another problem to be solved by the present invention is the re-learning of the failure prediction model based on external factors such as application modification and distribution, and infrastructure change of the application providing system, or internal factors such as the comparison result of the possibility of failure and the actual failure occurrence. Failure prediction method of IT service using a pre-learned failure prediction model that can automatically re-learn and improve the failure prediction model without expert intervention, server and to provide a computer program.

본 발명이 해결하고자 하는 과제들은 이상에서 언급된 과제로 제한되지 않으며, 언급되지 않은 또 다른 과제들은 아래의 기재로부터 통상의 기술자에게 명확하게 이해될 수 있을 것이다.The problems to be solved by the present invention are not limited to the problems mentioned above, and other problems not mentioned will be clearly understood by those skilled in the art from the following description.

상술한 과제를 해결하기 위한 본 발명의 일 실시예에 따른 기 학습된 장애 예측 모델을 이용한 IT 서비스의 장애 예측 방법은, 컴퓨팅 장치에 의해 수행되는 방법에 있어서, IT 서비스에 대한 지표 정보를 수집하는 단계, 기 학습된 장애 예측 모델을 통해 상기 수집된 지표 정보를 분석하여 상기 IT 서비스에 대한 부하 값을 추출하는 단계 및 상기 추출된 부하 값을 이용하여 상기 IT 서비스에 대한 장애 가능성을 판단하는 단계를 포함할 수 있다.In the method performed by a computing device, the method for predicting failure of an IT service using a pre-learned failure prediction model according to an embodiment of the present invention for solving the above-described problem, the method includes collecting indicator information for the IT service step, extracting a load value for the IT service by analyzing the collected indicator information through a pre-learned failure prediction model, and determining the possibility of failure for the IT service using the extracted load value may include

다양한 실시예에서, 상기 부하 값을 추출하는 단계는, 상기 수집된 지표 정보로부터 상기 IT 서비스의 임계부하 값을 추출하는 단계 및 상기 수집된 지표 정보로부터 상기 IT 서비스의 미래부하 값을 추출하는 단계를 포함하며, 상기 장애 가능성을 판단하는 단계는, 상기 추출된 미래부하 값이 상기 추출된 임계부하 값을 초과하는지에 따라 상기 IT 서비스에 대한 장애 가능성을 판단하는 단계를 포함할 수 있다.In various embodiments, the step of extracting the load value includes extracting a threshold load value of the IT service from the collected indicator information and extracting a future load value of the IT service from the collected indicator information The determining of the failure possibility may include determining the failure probability of the IT service according to whether the extracted future load value exceeds the extracted threshold load value.

다양한 실시예에서, 상기 임계부하 값을 추출하는 단계는, 상기 IT 서비스의 동시 사용자 수와 상기 IT 서비스의 응답 시간 간의 관계를 정의하는 지수함수 형태의 임계부하 예측 함수를 포함하는 임계부하 예측 모델을 생성하는 단계, 소정 기간 동안의 상기 IT 서비스의 지표 정보를 상기 생성된 임계부하 예측 모델의 입력 값으로 하여 추출된 결과 값에 기초하여, 상기 생성된 임계부하 예측 모델의 성능과 기 설정된 목표 성능 간의 차이를 산출하는 단계, 상기 산출된 차이가 기 설정된 기준 값 이하인 경우 상기 생성된 임계부하 예측 모델을 확정하고, 상기 산출된 차이가 상기 기 설정된 기준 값을 초과하는 경우 상기 생성된 임계부하 예측 모델에 포함된 임계부하 예측 함수를 보정하는 단계 및 상기 확정된 임계부하 예측 모델을 이용하여 상기 수집된 지표 정보로부터 상기 임계부하 값을 추출하는 단계를 포함할 수 있다.In various embodiments, the step of extracting the threshold load value comprises a critical load prediction model including a critical load prediction function in the form of an exponential function defining a relationship between the number of simultaneous users of the IT service and the response time of the IT service. generating, based on a result value extracted by using the index information of the IT service for a predetermined period as an input value of the generated critical load prediction model, between the performance of the generated critical load prediction model and a preset target performance calculating the difference, when the calculated difference is less than or equal to a preset reference value, determining the generated critical load prediction model, and when the calculated difference exceeds the preset reference value, in the generated critical load prediction model It may include correcting the included critical load prediction function and extracting the critical load value from the collected indicator information using the determined critical load prediction model.

다양한 실시예에서, 상기 생성된 임계부하 예측 모델의 성능과 기 설정된 목표 성능 간의 차이를 산출하는 단계는, 하기의 수학식 1을 이용하여 상기 결과 값을 추출하는 단계를 포함하며, 상기 임계부하 예측 함수를 보정하는 단계는, 상기 산출된 차이가 상기 기 설정된 기준 값을 초과하는 경우, 하기의 수학식 2를 이용하여 상기 산출된 차이가 최소값을 가지도록 상기 임계부하 예측 함수의 파라미터를 결정하고, 상기 결정된 파라미터를 이용하여 상기 임계부하 예측 함수를 보정하는 단계를 포함할 수 있다.In various embodiments, calculating the difference between the performance of the generated critical load prediction model and the preset target performance includes extracting the result value using Equation 1 below, and predicting the critical load In the step of correcting the function, when the calculated difference exceeds the preset reference value, the parameter of the threshold load prediction function is determined so that the calculated difference has a minimum value using Equation 2 below, The method may include correcting the critical load prediction function using the determined parameter.

<수학식 1><Equation 1>

여기서, 상기

는 상기 임계부하 예측 함수, 상기

,

및

는 상기 임계부하 예측 함수의 파라미터 초기값 및 상기

는 상기 임계부하 예측 함수의 입력 값으로, 상기 IT 서비스의 동시 사용자 수를 의미할 수 있다(단, 상기 수학식 1은 상기

가 상기

초과일 때이며, 상기

가 0 이하이면 상기

는 0이고, 상기

가 0 초과 상기

이하이면 상기

는 상기

임).Here, the

is the critical load prediction function, the

,

and

is the initial parameter value of the threshold load prediction function and the

is the input value of the threshold load prediction function, and may mean the number of simultaneous users of the IT service (provided that Equation 1 is the

is reminded

When it exceeds,

If is less than or equal to 0, the

is 0, and

above 0

If less than

is said

Lim).

<수학식 2><Equation 2>

여기서, 상기

는 상기 산출된 차이가 최소값을 가지도록 하는 상기 임계부하 예측 함수의 파라미터 값, 상기

는 상기 임계부하 예측 함수의 결과 값으로, 상기 IT 서비스의 동시 사용자 수가 n명일 때의 예상 응답 시간 및 상기

는 상기 기 설정된 목표 성능을 의미할 수 있다.Here, the

is a parameter value of the threshold load prediction function such that the calculated difference has a minimum value, the

is the result value of the threshold load prediction function, the expected response time when the number of simultaneous users of the IT service is n, and the

may mean the preset target performance.

다양한 실시예에서, 상기 미래부하 값을 추출하는 단계는, 소정 기간 동안의 상기 IT 서비스의 지표 정보를 복수의 구간으로 분할하는 단계, 상기 복수의 구간 각각에 대한 특성 벡터를 산출하고, 상기 산출된 특성 벡터를 이용하여 소정 기간 동안의 상기 IT 서비스의 지표 정보에 대한 특성 함수를 산출하는 단계, 분위 회귀(Quantile regression)를 기반으로, 상기 산출된 특성 함수를 학습시킴으로써 미래부하 예측 모델을 생성하는 단계 및 상기 생성된 미래부하 예측 모델을 이용하여 상기 수집된 지표 정보로부터 상기 미래부하 값을 추출하는 단계를 포함할 수 있다.In various embodiments, the extracting of the future load value includes dividing the index information of the IT service for a predetermined period into a plurality of sections, calculating a characteristic vector for each of the plurality of sections, and the calculated Calculating a feature function for the indicator information of the IT service for a predetermined period using a feature vector, generating a future load prediction model by learning the calculated feature function based on quantile regression and extracting the future load value from the collected indicator information using the generated future load prediction model.

다양한 실시예에서, 상기 특성 함수를 산출하는 단계는, 상기 복수의 구간 각각에 대하여 산출된 복수의 특성 벡터 각각에 대하여, 시간 특성, 상기 IT 서비스를 이용하는 사용자들의 이용형태 특성 및 상기 추출된 미래부하 값과의 상관관계 특성 중 적어도 하나의 특성을 고려하여 가중치를 부여하고, 상기 가중치가 부여된 복수의 특성 벡터를 이용하여 상기 특성 함수를 산출하는 단계를 포함할 수 있다.In various embodiments, the calculating of the characteristic function includes, for each of the plurality of characteristic vectors calculated for each of the plurality of sections, a time characteristic, a usage pattern characteristic of users using the IT service, and the extracted future load. The method may include assigning a weight in consideration of at least one characteristic among correlation characteristics with a value, and calculating the characteristic function by using a plurality of characteristic vectors to which the weight has been assigned.

다양한 실시예에서, 상기 장애 가능성을 판단하는 단계는, 상기 추출된 부하 값에 기초하여 상기 IT 서비스에 대한 장애가 발생될 것으로 예측된 제1 시점에서 상기 IT 서비스에 대한 장애가 발생되지 않은 경우, 상기 제1 시점에서의 상기 IT 서비스의 지표 정보를 학습 데이터로 하여 상기 기 학습된 장애 예측 모델을 재학습시키는 단계 및 상기 추출된 부하 값에 기초하여 상기 IT 서비스에 대한 장애가 발생되지 않을 것으로 예측되었으나 제2 시점에서 상기 IT 서비스에 대한 장애가 발생된 경우, 상기 제2 시점에서의 상기 IT 서비스의 지표 정보를 학습 데이터로 하여 상기 기 학습된 장애 예측 모델을 재학습시키는 단계를 포함할 수 있다.In various embodiments, the determining of the possibility of failure may include, if the failure to the IT service does not occur at a first time point when the failure to the IT service is predicted to occur based on the extracted load value, the second The step of re-learning the previously learned failure prediction model using the index information of the IT service as learning data at one point in time, and the second time it was predicted that failure to the IT service would not occur based on the extracted load value. The method may include re-learning the previously-learned failure prediction model using the index information of the IT service at the second time point as learning data when a failure of the IT service occurs at a time point.

다양한 실시예에서, 기 제공된 제1 IT 서비스가 업데이트되거나 또는 상기 제1 IT 서비스를 제공하는 시스템의 인프라가 변경됨에 따라 제2 IT 서비스가 제공되는 경우, 상기 기 학습된 장애 예측 모델의 학습 데이터로서 기 사용된 상기 제1 IT 서비스의 지표 정보에서 상기 제1 IT 서비스의 성능에 관한 정보를 제외시키는 단계, 상기 제2 IT 서비스에 대한 가상 사용자 요청을 생성하는 단계, 상기 생성된 가상 사용자 요청에 기초하여 상기 제2 IT 서비스의 성능에 관한 정보를 수집하는 단계 및 상기 제1 IT 서비스의 성능에 관한 정보가 제외된 상기 제1 IT 서비스의 지표 정보와 상기 수집된 제2 IT 서비스의 성능에 관한 정보를 학습 데이터로 하여 상기 기 학습된 장애 예측 모델을 재학습시키는 단계를 더 포함할 수 있다.In various embodiments, when a previously provided first IT service is updated or a second IT service is provided as the infrastructure of a system providing the first IT service is changed, as learning data of the previously learned failure prediction model Excluding information on the performance of the first IT service from the previously used indicator information of the first IT service, generating a virtual user request for the second IT service, based on the generated virtual user request to collect information on the performance of the second IT service and the information on the performance of the collected second IT service and the index information of the first IT service from which the information on the performance of the first IT service is excluded It may further include the step of re-learning the pre-learned disability prediction model as the learning data.

상술한 과제를 해결하기 위한 본 발명의 다른 실시예에 따른 기 학습된 장애 예측 모델을 이용한 IT 서비스의 장애 예측 서버는, 프로세서, 네트워크 인터페이스, 메모리 및 상기 메모리에 로드(load)되고, 상기 프로세서에 의해 실행되는 컴퓨터 프로그램을 포함하되, 상기 컴퓨터 프로그램은, IT 서비스에 대한 지표 정보를 수집하는 인스트럭션(instruction), 기 학습된 장애 예측 모델을 통해 상기 수집된 지표 정보를 분석하여 상기 IT 서비스에 대한 부하 값을 추출하는 인스트럭션 및 상기 추출된 부하 값을 이용하여 상기 IT 서비스에 대한 장애 가능성을 판단하는 인스트럭션을 포함할 수 있다.A failure prediction server of an IT service using a pre-learned failure prediction model according to another embodiment of the present invention for solving the above-described problems is loaded into a processor, a network interface, a memory, and the memory, and to the processor Including a computer program executed by a computer program, wherein the computer program analyzes the collected indicator information through an instruction for collecting indicator information for the IT service, a pre-learned failure prediction model, and a load on the IT service It may include an instruction for extracting a value and an instruction for determining the possibility of failure of the IT service by using the extracted load value.

상술한 과제를 해결하기 위한 본 발명의 또 다른 실시예에 따른 컴퓨터로 읽을 수 있는 기록매체에 기록된 컴퓨터프로그램은, 컴퓨팅 장치와 결합되어, IT 서비스에 대한 지표 정보를 수집하는 단계, 기 학습된 장애 예측 모델을 통해 상기 수집된 지표 정보를 분석하여 상기 IT 서비스에 대한 부하 값을 추출하는 단계 및 상기 추출된 부하 값을 이용하여 상기 IT 서비스에 대한 장애 가능성을 판단하는 단계를 포함하는 기 학습된 장애 예측 모델을 이용한 IT 서비스의 장애 예측 방법을 실행시키기 위하여 컴퓨터로 판독가능한 기록매체에 저장될 수 있다.A computer program recorded on a computer-readable recording medium according to another embodiment of the present invention for solving the above-described problems is combined with a computing device, the step of collecting indicator information for the IT service, the previously learned Pre-learned comprising the steps of extracting a load value for the IT service by analyzing the collected indicator information through a failure prediction model, and determining the possibility of a failure for the IT service using the extracted load value It may be stored in a computer-readable recording medium to execute the failure prediction method of the IT service using the failure prediction model.

본 발명의 기타 구체적인 사항들은 상세한 설명 및 도면들에 포함되어 있다.Other specific details of the invention are included in the detailed description and drawings.

본 발명의 다양한 실시예에 따르면, IT 서비스에 대한 각종 지표를 포함하는 지표 정보를 이용하여 학습된 2개의 장애 예측 모델을 이용하여 IT 서비스에 대한 장애 가능성을 판단함으로써, 보다 정확한 장애 예측이 가능하다는 이점이 있다.According to various embodiments of the present invention, more accurate failure prediction is possible by determining the possibility of failure for an IT service using two failure prediction models learned using index information including various indexes for the IT service. There is an advantage.

또한, 애플리케이션 수정 및 배포, 애플리케이션 제공 시스템의 인프라 변경과 같은 외부 요인이나 장애 가능성 판단 결과와 실제 장애 발생 여부의 비교 결과와 같은 내부 요인에 기초하여 장애 예측 모델의 재학습 여부를 결정하고, 이에 따라 자동적으로 장애 예측 모델을 재학습시킴으로써, 전문가의 개입없이 자동적으로 장애 예측 모델을 재학습 및 개선시킬 수 있다는 이점이 있다.In addition, based on external factors such as application modification and distribution, infrastructure change of the application providing system, or internal factors such as the comparison result of failure possibility and actual failure occurrence, it is determined whether to retrain the failure prediction model, and accordingly By automatically retraining the disability prediction model, there is an advantage in that the disability prediction model can be automatically retrained and improved without the intervention of an expert.

본 발명의 효과들은 이상에서 언급된 효과로 제한되지 않으며, 언급되지 않은 또 다른 효과들은 아래의 기재로부터 통상의 기술자에게 명확하게 이해될 수 있을 것이다.Effects of the present invention are not limited to the effects mentioned above, and other effects not mentioned will be clearly understood by those skilled in the art from the following description.

도 1은 본 발명의 일 실시예에 따른 기 학습된 장애 예측 모델을 이용한 IT 서비스의 장애 예측 시스템을 도시한 도면이다.
도 2는 본 발명의 다른 실시예에 따른 기 학습된 장애 예측 모델을 이용한 IT 서비스의 장애 예측 서버의 하드웨어 구성도이다.
도 3은 다양한 실시예에 적용 가능한 기 학습된 장애 예측 모델을 이용한 IT 서비스의 장애 예측 시스템의 구조를 도시한 도면이다.
도 4는 본 발명의 또 다른 실시예에 따른 기 학습된 장애 예측 모델을 이용한 IT 서비스의 장애 예측 방법의 순서도이다.
도 5는 다양한 실시예에서, 임계부하 값을 추출하는 방법을 설명하기 위한 순서도이다.
도 6 내지 8은 다양한 실시예에서, 임계부하 예측 모델의 학습 과정을 설명하기 위한 임계부하 예측 함수의 그래프이다.
도 9는 다양한 실시예에서, 임계부하 예측 모델의 결과 값을 이용하여 응답지연 예측점을 제공하는 구성을 설명하기 위한 그래프이다.
도 10은 다양한 실시예에서, 미래부하 값을 추출하는 방법을 설명하기 위한 순서도이다.
도 11 및 12는 다양한 실시예에서, 미래부하 예측 모델의 학습 과정을 설명하기 위한 특성 방정식의 그래프이다.
도 13은 다양한 실시예에서, 임계부하 값과 미래부하 값을 이용하여 장애 가능성을 판단하는 구성을 설명하기 위한 그래프이다.
도 14는 다양한 실시예에서, 외부 요인에 따라 장애 예측 모델을 재학습시키는 과정을 설명하기 위한 순서도이다.
도 15는 다양한 실시예에서, 내부 요인에 따라 장애 예측 모델을 재학습시키는 과정을 설명하기 위한 순서도이다.
도 16은 다양한 실시예에서, 이상치 발생 구간을 검출하는 구성을 설명하기 위한 그래프이다.
도 17 내지 19는 다양한 실시예에서, 기 학습된 장애 예측 모델을 이용한 IT 서비스의 장애 예측 서버가 제공하는 대시보드를 예시적으로 도시한 도면이다.1 is a diagram illustrating a failure prediction system of an IT service using a pre-learned failure prediction model according to an embodiment of the present invention.
2 is a hardware configuration diagram of an IT service failure prediction server using a pre-learned failure prediction model according to another embodiment of the present invention.
3 is a diagram illustrating a structure of an IT service failure prediction system using a pre-learned failure prediction model applicable to various embodiments.
4 is a flowchart of a failure prediction method of an IT service using a pre-learned failure prediction model according to another embodiment of the present invention.
5 is a flowchart illustrating a method of extracting a threshold load value according to various embodiments.
6 to 8 are graphs of a critical load prediction function for explaining a learning process of a critical load prediction model, according to various embodiments.
9 is a graph for explaining a configuration for providing a response delay prediction point using a result value of a critical load prediction model in various embodiments.
10 is a flowchart illustrating a method of extracting a future load value according to various embodiments.
11 and 12 are graphs of characteristic equations for explaining a learning process of a future load prediction model, according to various embodiments.
13 is a graph for explaining a configuration for determining the possibility of failure using a threshold load value and a future load value in various embodiments.
14 is a flowchart illustrating a process of re-learning a disability prediction model according to an external factor, in various embodiments.
15 is a flowchart for explaining a process of re-learning a disability prediction model according to an internal factor, in various embodiments.
16 is a graph for explaining a configuration for detecting an outlier occurrence section, according to various embodiments.
17 to 19 are diagrams exemplarily illustrating a dashboard provided by a failure prediction server of an IT service using a pre-learned failure prediction model in various embodiments.

본 발명의 이점 및 특징, 그리고 그것들을 달성하는 방법은 첨부되는 도면과 함께 상세하게 후술되어 있는 실시예들을 참조하면 명확해질 것이다. 그러나, 본 발명은 이하에서 개시되는 실시예들에 제한되는 것이 아니라 서로 다른 다양한 형태로 구현될 수 있으며, 단지 본 실시예들은 본 발명의 개시가 완전하도록 하고, 본 발명이 속하는 기술 분야의 통상의 기술자에게 본 발명의 범주를 완전하게 알려주기 위해 제공되는 것이며, 본 발명은 청구항의 범주에 의해 정의될 뿐이다. Advantages and features of the present invention and methods of achieving them will become apparent with reference to the embodiments described below in detail in conjunction with the accompanying drawings. However, the present invention is not limited to the embodiments disclosed below, but may be implemented in various different forms, and only the present embodiments allow the disclosure of the present invention to be complete, and those of ordinary skill in the art to which the present invention pertains. It is provided to fully understand the scope of the present invention to those skilled in the art, and the present invention is only defined by the scope of the claims.

본 명세서에서 사용된 용어는 실시예들을 설명하기 위한 것이며 본 발명을 제한하고자 하는 것은 아니다. 본 명세서에서, 단수형은 문구에서 특별히 언급하지 않는 한 복수형도 포함한다. 명세서에서 사용되는 "포함한다(comprises)" 및/또는 "포함하는(comprising)"은 언급된 구성요소 외에 하나 이상의 다른 구성요소의 존재 또는 추가를 배제하지 않는다. 명세서 전체에 걸쳐 동일한 도면 부호는 동일한 구성 요소를 지칭하며, "및/또는"은 언급된 구성요소들의 각각 및 하나 이상의 모든 조합을 포함한다. 비록 "제1", "제2" 등이 다양한 구성요소들을 서술하기 위해서 사용되나, 이들 구성요소들은 이들 용어에 의해 제한되지 않음은 물론이다. 이들 용어들은 단지 하나의 구성요소를 다른 구성요소와 구별하기 위하여 사용하는 것이다. 따라서, 이하에서 언급되는 제1 구성요소는 본 발명의 기술적 사상 내에서 제2 구성요소일 수도 있음은 물론이다.The terminology used herein is for the purpose of describing the embodiments and is not intended to limit the present invention. In this specification, the singular also includes the plural unless specifically stated otherwise in the phrase. As used herein, “comprises” and/or “comprising” does not exclude the presence or addition of one or more other components in addition to the stated components. Like reference numerals refer to like elements throughout, and "and/or" includes each and every combination of one or more of the recited elements. Although "first", "second", etc. are used to describe various elements, these elements are not limited by these terms, of course. These terms are only used to distinguish one component from another. Accordingly, it goes without saying that the first component mentioned below may be the second component within the spirit of the present invention.

다른 정의가 없다면, 본 명세서에서 사용되는 모든 용어(기술 및 과학적 용어를 포함)는 본 발명이 속하는 기술분야의 통상의 기술자에게 공통적으로 이해될 수 있는 의미로 사용될 수 있을 것이다. 또한, 일반적으로 사용되는 사전에 정의되어 있는 용어들은 명백하게 특별히 정의되어 있지 않는 한 이상적으로 또는 과도하게 해석되지 않는다.Unless otherwise defined, all terms (including technical and scientific terms) used herein will have the meaning commonly understood by those of ordinary skill in the art to which this invention belongs. In addition, terms defined in a commonly used dictionary are not to be interpreted ideally or excessively unless specifically defined explicitly.

명세서에서 사용되는 "부" 또는 “모듈”이라는 용어는 소프트웨어, FPGA 또는 ASIC과 같은 하드웨어 구성요소를 의미하며, "부" 또는 “모듈”은 어떤 역할들을 수행한다. 그렇지만 "부" 또는 “모듈”은 소프트웨어 또는 하드웨어에 한정되는 의미는 아니다. "부" 또는 “모듈”은 어드레싱할 수 있는 저장 매체에 있도록 구성될 수도 있고 하나 또는 그 이상의 프로세서들을 재생시키도록 구성될 수도 있다. 따라서, 일 예로서 "부" 또는 “모듈”은 소프트웨어 구성요소들, 객체지향 소프트웨어 구성요소들, 클래스 구성요소들 및 태스크 구성요소들과 같은 구성요소들과, 프로세스들, 함수들, 속성들, 프로시저들, 서브루틴들, 프로그램 코드의 세그먼트들, 드라이버들, 펌웨어, 마이크로 코드, 회로, 데이터, 데이터베이스, 데이터 구조들, 테이블들, 어레이들 및 변수들을 포함한다. 구성요소들과 "부" 또는 “모듈”들 안에서 제공되는 기능은 더 작은 수의 구성요소들 및 "부" 또는 “모듈”들로 결합되거나 추가적인 구성요소들과 "부" 또는 “모듈”들로 더 분리될 수 있다.As used herein, the term “unit” or “module” refers to a hardware component such as software, FPGA, or ASIC, and “unit” or “module” performs certain roles. However, “part” or “module” is not meant to be limited to software or hardware. A “unit” or “module” may be configured to reside on an addressable storage medium or to reproduce one or more processors. Thus, as an example, “part” or “module” refers to components such as software components, object-oriented software components, class components and task components, processes, functions, properties, Includes procedures, subroutines, segments of program code, drivers, firmware, microcode, circuitry, data, databases, data structures, tables, arrays, and variables. Components and functionality provided within “parts” or “modules” may be combined into a smaller number of components and “parts” or “modules” or as additional components and “parts” or “modules”. can be further separated.

공간적으로 상대적인 용어인 "아래(below)", "아래(beneath)", "하부(lower)", "위(above)", "상부(upper)" 등은 도면에 도시되어 있는 바와 같이 하나의 구성요소와 다른 구성요소들과의 상관관계를 용이하게 기술하기 위해 사용될 수 있다. 공간적으로 상대적인 용어는 도면에 도시되어 있는 방향에 더하여 사용시 또는 동작시 구성요소들의 서로 다른 방향을 포함하는 용어로 이해되어야 한다. 예를 들어, 도면에 도시되어 있는 구성요소를 뒤집을 경우, 다른 구성요소의 "아래(below)"또는 "아래(beneath)"로 기술된 구성요소는 다른 구성요소의 "위(above)"에 놓일 수 있다. 따라서, 예시적인 용어인 "아래"는 아래와 위의 방향을 모두 포함할 수 있다. 구성요소는 다른 방향으로도 배향될 수 있으며, 이에 따라 공간적으로 상대적인 용어들은 배향에 따라 해석될 수 있다.Spatially relative terms "below", "beneath", "lower", "above", "upper", etc. It can be used to easily describe the correlation between a component and other components. Spatially relative terms should be understood as terms including different directions of components during use or operation in addition to the directions shown in the drawings. For example, when a component shown in a drawing is turned over, a component described as "beneath" or "beneath" of another component may be placed "above" of the other component. can Accordingly, the exemplary term “below” may include both directions below and above. Components may also be oriented in other orientations, and thus spatially relative terms may be interpreted according to orientation.

본 명세서에서, 컴퓨터는 적어도 하나의 프로세서를 포함하는 모든 종류의 하드웨어 장치를 의미하는 것이고, 실시 예에 따라 해당 하드웨어 장치에서 동작하는 소프트웨어적 구성도 포괄하는 의미로서 이해될 수 있다. 예를 들어, 컴퓨터는 스마트폰, 태블릿 PC, 데스크톱, 노트북 및 각 장치에서 구동되는 사용자 클라이언트 및 애플리케이션을 모두 포함하는 의미로서 이해될 수 있으며, 또한 이에 제한되는 것은 아니다.In this specification, a computer refers to all types of hardware devices including at least one processor, and may be understood as encompassing software configurations operating in the corresponding hardware device according to embodiments. For example, a computer may be understood to include, but is not limited to, smart phones, tablet PCs, desktops, notebooks, and user clients and applications running on each device.

이하, 첨부된 도면을 참조하여 본 발명의 실시예를 상세하게 설명한다. Hereinafter, embodiments of the present invention will be described in detail with reference to the accompanying drawings.

본 명세서에서 설명되는 각 단계들은 컴퓨터에 의하여 수행되는 것으로 설명되나, 각 단계의 주체는 이에 제한되는 것은 아니며, 실시 예에 따라 각 단계들의 적어도 일부가 서로 다른 장치에서 수행될 수도 있다.Each step described in this specification is described as being performed by a computer, but the subject of each step is not limited thereto, and at least a portion of each step may be performed in different devices according to embodiments.

도 1은 본 발명의 일 실시예에 따른 기 학습된 장애 예측 모델을 이용한 IT 서비스의 장애 예측 시스템을 도시한 도면이다.1 is a diagram illustrating a failure prediction system of an IT service using a pre-learned failure prediction model according to an embodiment of the present invention.

도 1을 참조하면, 본 발명의 일 실시예에 따른 기 학습된 장애 예측 모델을 이용한 IT 서비스의 장애 예측 시스템은 IT 서비스의 장애 예측 서버(100), 사용자 단말(200), 외부 서버(300) 및 네트워크(400)를 포함할 수 있다.Referring to FIG. 1 , a failure prediction system of an IT service using a pre-learned failure prediction model according to an embodiment of the present invention is an IT service failure prediction server 100 , a user terminal 200 , and an external server 300 . and a network 400 .

여기서, 도 1에 도시된 기 학습된 장애 예측 모델을 이용한 IT 서비스의 장애 예측 시스템은 일 실시예에 따른 것이고, 그 구성 요소가 도 1에 도시된 실시예에 한정되는 것은 아니며, 필요에 따라 부가, 변경 또는 삭제될 수 있다.Here, the failure prediction system of the IT service using the pre-learned failure prediction model shown in FIG. 1 is according to an embodiment, and its components are not limited to the embodiment shown in FIG. , may be changed or deleted.

일 실시예에서, IT 서비스의 장애 예측 서버(100)(이하, "서버(100)")는 IT 서비스에 장애 가능성을 판단할 수 있다. 예를 들어, 서버(100)는 IT 서비스를 제공함에 따라 수집되는 지표 정보를 실시간으로 수집할 수 있고, 기 학습된 장애 예측 모델을 이용하여 IT 서비스에 대한 지표 정보를 분석함으로써, IT 서비스에 대한 부하 값을 추출할 수 있으며, 추출된 부하 값을 이용하여 IT 서비스에 대한 장애 가능성을 판단할 수 있다.In one embodiment, the failure prediction server 100 (hereinafter, “server 100”) of the IT service may determine the possibility of failure in the IT service. For example, the server 100 may collect indicator information collected as IT services are provided in real time, and by analyzing indicator information on IT services using a pre-learned failure prediction model, The load value can be extracted, and the possibility of failure of the IT service can be determined using the extracted load value.

여기서, 기 학습된 장애 예측 모델은 복수의 IT 서비스에 대한 지표 정보를 포함하는 학습 데이터를 이용하여 기 학습된 인공지능 모델일 수 있다.Here, the pre-learned failure prediction model may be a pre-learned artificial intelligence model using learning data including indicator information for a plurality of IT services.

인공지능 모델(또는 연산 모델, 신경망, 네트워크 함수, 뉴럴 네트워크(neural network))은 하나 이상의 네트워크 함수로 구성되며, 하나 이상의 네트워크 함수는 일반적으로 ‘노드’라 지칭될 수 있는 상호 연결된 계산 단위들의 집합으로 구성될 수 있다. 이러한 ‘노드’들은 ‘뉴런(neuron)’들로 지칭될 수도 있다. 하나 이상의 네트워크 함수는 적어도 하나 이상의 노드들을 포함하여 구성된다. 하나 이상의 네트워크 함수를 구성하는 노드(또는 뉴런)들은 하나 이상의 ‘링크’에 의해 상호 연결될 수 있다.An artificial intelligence model (or computational model, neural network, network function, neural network) consists of one or more network functions, one or more network functions being a set of interconnected computational units that can generally be referred to as 'nodes'. can be composed of These ‘nodes’ may also be referred to as ‘neurons’. The one or more network functions are configured by including at least one or more nodes. Nodes (or neurons) constituting one or more network functions may be interconnected by one or more ‘links’.

인공지능 모델 내에서, 링크를 통해 연결된 하나 이상의 노드들은 상대적으로 입력 노드 및 출력 노드의 관계를 형성할 수 있다. 입력 노드 및 출력 노드의 개념은 상대적인 것으로서, 하나의 노드에 대하여 출력 노드 관계에 있는 임의의 노드는 다른 노드와의 관계에서 입력 노드 관계에 있을 수 있으며, 그 역도 성립할 수 있다. 전술한 바와 같이, 입력 노드 대 출력 노드 관계는 링크를 중심으로 생성될 수 있다. 하나의 입력 노드에 하나 이상의 출력 노드가 링크를 통해 연결될 수 있으며, 그 역도 성립할 수 있다.In the AI model, one or more nodes connected through a link may relatively form a relationship between an input node and an output node. The concepts of an input node and an output node are relative, and any node in an output node relationship with respect to one node may be in an input node relationship in a relationship with another node, and vice versa. As described above, an input node to output node relationship may be created around a link. One or more output nodes may be connected to one input node through a link, and vice versa.

하나의 링크를 통해 연결된 입력 노드 및 출력 노드 관계에서, 출력 노드는 입력 노드에 입력된 데이터에 기초하여 그 값이 결정될 수 있다. 여기서 입력 노드와 출력 노드를 상호 연결하는 노드는 가중치(weight)를 가질 수 있다. 가중치는 가변적일 수 있으며, 인공지능 모델이 원하는 기능을 수행하기 위해, 사용자 또는 알고리즘에 의해 가변될 수 있다. 예를 들어, 하나의 출력 노드에 하나 이상의 입력 노드가 각각의 링크에 의해 상호 연결된 경우, 출력 노드는 상기 출력 노드와 연결된 입력 노드들에 입력된 값들 및 각각의 입력 노드들에 대응하는 링크에 설정된 가중치에 기초하여 출력 노드 값을 결정할 수 있다.In the relationship between the input node and the output node connected through one link, the value of the output node may be determined based on data input to the input node. Here, a node interconnecting the input node and the output node may have a weight. The weight may be variable, and may be changed by a user or an algorithm in order for the AI model to perform a desired function. For example, when one or more input nodes are interconnected to one output node by respective links, the output node sets values input to input nodes connected to the output node and links corresponding to the respective input nodes. An output node value may be determined based on the weight.

전술한 바와 같이, 인공지능 모델은 하나 이상의 노드들이 하나 이상의 링크를 통해 상호연결 되어 인공지능 모델 내에서 입력 노드 및 출력 노드 관계를 형성한다. 인공지능 모델 내에서 노드들과 링크들의 개수 및 노드들과 링크들 사이의 연관관계, 링크들 각각에 부여된 가중치의 값에 따라, 인공지능 모델의 특성이 결정될 수 있다. 예를 들어, 동일한 개수의 노드 및 링크들이 존재하고, 링크들 사이의 가중치 값이 상이한 두 인공지능 모델이 존재하는 경우, 두 개의 인공지능 모델들은 서로 상이한 것으로 인식될 수 있다.As described above, in the AI model, one or more nodes are interconnected through one or more links to form an input node and an output node relationship within the AI model. Characteristics of the AI model may be determined according to the number of nodes and links in the AI model, the correlation between nodes and links, and the value of a weight assigned to each of the links. For example, when the same number of nodes and links exist and there are two AI models having different weight values between the links, the two AI models may be recognized as different from each other.

인공지능 모델을 구성하는 노드들 중 일부는, 최초 입력 노드로부터의 거리들에 기초하여, 하나의 레이어(layer)를 구성할 수 있다. 예를 들어, 최초 입력 노드로부터 거리가 n인 노드들의 집합은, n 레이어를 구성할 수 있다. 최초 입력 노드로부터 거리는, 최초 입력 노드로부터 해당 노드까지 도달하기 위해 거쳐야 하는 링크들의 최소 개수에 의해 정의될 수 있다. 그러나, 이러한 레이어의 정의는 설명을 위한 임의적인 것으로서, 인공지능 모델 내에서 레이어의 차수는 전술한 것과 상이한 방법으로 정의될 수 있다. 예를 들어, 노드들의 레이어는 최종 출력 노드로부터 거리에 의해 정의될 수도 있다.Some of the nodes constituting the artificial intelligence model may configure one layer based on distances from the initial input node. For example, a set of nodes having a distance of n from the initial input node may constitute n layers. The distance from the initial input node may be defined by the minimum number of links that must be passed to reach the corresponding node from the initial input node. However, the definition of such a layer is arbitrary for the purpose of explanation, and the order of the layer in the AI model may be defined in a different way from that described above. For example, a layer of nodes may be defined by a distance from the final output node.

최초 입력 노드는 인공지능 모델 내의 노드들 중 다른 노드들과의 관계에서 링크를 거치지 않고 데이터가 직접 입력되는 하나 이상의 노드들을 의미할 수 있다. 또는, 인공지능 모델 네트워크 내에서, 링크를 기준으로 한 노드 간의 관계에 있어서, 링크로 연결된 다른 입력 노드들 가지지 않는 노드들을 의미할 수 있다. 이와 유사하게, 최종 출력 노드는 인공지능 모델 내의 노드들 중 다른 노드들과의 관계에서, 출력 노드를 가지지 않는 하나 이상의 노드들을 의미할 수 있다. 또한, 히든 노드는 최초 입력 노드 및 최후 출력 노드가 아닌 인공지능 모델을 구성하는 노드들을 의미할 수 있다. 본 개시의 일 실시예에 따른 인공지능 모델은 입력 레이어의 노드가 출력 레이어에 가까운 히든 레이어의 노드보다 많을 수 있으며, 입력 레이어에서 히든 레이어로 진행됨에 따라 노드의 수가 감소하는 형태의 인공지능 모델일 수 있다.The initial input node may mean one or more nodes to which data is directly input without going through a link in a relationship with other nodes among nodes in the AI model. Alternatively, in the relationship between nodes based on the link in the artificial intelligence model network, it may mean nodes that do not have other input nodes connected by a link. Similarly, the final output node may refer to one or more nodes that do not have an output node in relation to other nodes among nodes in the artificial intelligence model. In addition, the hidden node may refer to nodes constituting an artificial intelligence model other than the first input node and the last output node. The artificial intelligence model according to an embodiment of the present disclosure may have more nodes in the input layer than nodes in the hidden layer close to the output layer, and is an artificial intelligence model in which the number of nodes decreases as the input layer progresses to the hidden layer. can

인공지능 모델은 하나 이상의 히든 레이어를 포함할 수 있다. 히든 레이어의 히든 노드는 이전의 레이어의 출력과 주변 히든 노드의 출력을 입력으로 할 수 있다. 각 히든 레이어 별 히든 노드의 수는 동일할 수도 있고 상이할 수도 있다. 입력 레이어의 노드의 수는 입력 데이터의 데이터 필드의 수에 기초하여 결정될 수 있으며 히든 노드의 수와 동일할 수도 있고 상이할 수도 있다. 입력 레이어에 입력된 입력 데이터는 히든 레이어의 히든 노드에 의하여 연산될 수 있고 출력 레이어인 완전 연결 레이어(FCL: fully connected layer)에 의해 출력될 수 있다.An AI model may include one or more hidden layers. The hidden node of the hidden layer may have the output of the previous layer and the output of the neighboring hidden nodes as inputs. The number of hidden nodes for each hidden layer may be the same or different. The number of nodes of the input layer may be determined based on the number of data fields of the input data and may be the same as or different from the number of hidden nodes. Input data input to the input layer may be calculated by a hidden node of the hidden layer and may be output by a fully connected layer (FCL) that is an output layer.

다양한 실시예에서, 서버(100)는 인공지능 모델을 학습시키기 위한 학습 데이터를 구축할 수 있고, 구축된 학습 데이터를 이용하여 교사 학습(supervised learning), 비교사 학습(unsupervised learning), 및 반교사학습(semi supervised learning) 중 적어도 하나의 방식으로 인공지능 모델을 학습시킬 수 있다.In various embodiments, the server 100 may build learning data for learning the artificial intelligence model, and using the constructed learning data, teacher learning (supervised learning), non-supervised learning (unsupervised learning), and anti-teacher The artificial intelligence model may be trained by at least one method of semi-supervised learning.

인공지능 모델의 학습은 출력의 오류를 최소화하기 위한 것이다. 인공지능 모델의 학습에서 반복적으로 학습 데이터를 인공지능 모델에 입력시키고 학습 데이터에 대한 인공지능 모델의 출력과 타겟의 에러를 계산하고, 에러를 줄이기 위한 방향으로 인공지능 모델의 에러를 인공지능 모델의 출력 레이어에서부터 입력 레이어 방향으로 역전파(backpropagation)하여 인공지능 모델의 각 노드의 가중치를 업데이트 하는 과정이다. The training of the AI model is to minimize the error in the output. In the learning of the AI model, iteratively input the training data into the AI model, calculate the output of the AI model for the training data and the error of the target, and reduce the error of the AI model in the direction of reducing the error. This is the process of updating the weight of each node of the AI model by backpropagating from the output layer to the input layer.

교사 학습의 경우 각각의 학습 데이터에 정답이 레이블링 되어있는 학습 데이터를 사용하며(즉, 레이블링된 학습 데이터), 비교사 학습의 경우는 각각의 학습 데이터에 정답이 레이블링 되어 있지 않을 수 있다. 즉, 예를 들어 데이터 분류에 관한 교사 학습의 경우의 학습 데이터는 학습 데이터 각각에 카테고리가 레이블링 된 데이터 일 수 있다. 레이블링된 학습 데이터가 인공지능 모델에 입력되고, 인공지능 모델의 출력(카테고리)과 학습 데이터의 레이블을 비교함으로써 오류(error)가 계산될 수 있다.In the case of teacher learning, learning data in which correct answers are labeled in each learning data is used (ie, labeled learning data), and in the case of comparative learning, the correct answers may not be labeled in each learning data. That is, for example, learning data in the case of teacher learning related to data classification may be data in which categories are labeled in each of the learning data. The labeled training data is input to the AI model, and an error can be calculated by comparing the output (category) of the AI model with the label of the training data.

다른 예로, 데이터 분류에 관한 비교사 학습의 경우 입력인 학습 데이터가 인공지능 모델 출력과 비교됨으로써 오류가 계산될 수 있다. 계산된 오류는 인공지능 모델에서 역방향(즉, 출력 레이어에서 입력 레이어 방향)으로 역전파 되며, 역전파에 따라 인공지능 모델의 각 레이어의 각 노드들의 연결 가중치가 업데이트 될 수 있다. 업데이트 되는 각 노드의 연결 가중치는 학습률(learning rate)에 따라 변화량이 결정될 수 있다.As another example, in the case of comparative learning about data classification, an error may be calculated by comparing the input, the training data, with the output of the artificial intelligence model. The calculated error is back propagated in the reverse direction (ie, from the output layer to the input layer) in the AI model, and the connection weight of each node of each layer of the AI model may be updated according to the back propagation. The change amount of the connection weight of each node to be updated may be determined according to a learning rate.

입력 데이터에 대한 인공지능 모델의 계산과 에러의 역전파는 학습 사이클(epoch)을 구성할 수 있다. 학습률은 인공지능 모델의 학습 사이클의 반복 횟수에 따라 상이하게 적용될 수 있다. 예를 들어, 인공지능 모델의 학습 초기에는 높은 학습률을 사용하여 인공지능 모델이 빠르게 일정 수준의 성능을 확보하도록 하여 효율성을 높이고, 학습 후기에는 낮은 학습률을 사용하여 정확도를 높일 수 있다.The computation of the AI model on the input data and the backpropagation of errors can constitute a learning cycle (epoch). The learning rate may be applied differently depending on the number of repetitions of the learning cycle of the AI model. For example, in the early stages of learning an AI model, a high learning rate can be used to allow the AI model to quickly achieve a certain level of performance, thereby increasing efficiency, and using a low learning rate at the end of learning can increase accuracy.

인공지능 모델의 학습에서 일반적으로 학습 데이터는 실제 데이터(즉, 학습된 인공지능 모델을 이용하여 처리하고자 하는 데이터)의 부분집합일 수 있으며, 따라서, 학습 데이터에 대한 오류는 감소하나 실제 데이터에 대해서는 오류가 증가하는 학습 사이클이 존재할 수 있다. 과적합(overfitting)은 이와 같이 학습 데이터에 과하게 학습하여 실제 데이터에 대한 오류가 증가하는 현상이다. 예를 들어, 노란색 고양이를 보여 고양이를 학습한 인공지능 모델이 노란색 이외의 고양이를 보고는 고양이임을 인식하지 못하는 현상이 과적합의 일종일 수 있다.In the training of artificial intelligence models, in general, the training data may be a subset of the real data (that is, the data to be processed using the trained artificial intelligence model), and thus the error for the training data is reduced, but for the real data There may be learning cycles in which errors increase. Overfitting is a phenomenon in which errors on actual data increase by over-learning on training data as described above. For example, a phenomenon in which an AI model that has learned a cat by showing a yellow cat does not recognize that it is a cat when it sees a cat other than yellow may be a type of overfitting.

과적합은 머신러닝 알고리즘의 오류를 증가시키는 원인으로 작용할 수 있다. 이러한 과적합을 막기 위하여 다양한 최적화 방법이 사용될 수 있다. 과적합을 막기 위해서는 학습 데이터를 증가시키거나, 레귤라이제이션(regularization), 학습의 과정에서 네트워크의 노드 일부를 생략하는 드롭아웃(dropout) 등의 방법이 적용될 수 있다.Overfitting can act as a cause of increasing errors in machine learning algorithms. In order to prevent such overfitting, various optimization methods can be used. In order to prevent overfitting, methods such as increasing training data, regularization, or dropout in which a part of nodes in the network are omitted in the process of learning, may be applied.

다양한 실시예에서, 서버(100)는 IT 서비스에 대한 지표 정보를 이용하여 IT 서비스에 대한 장애 가능성을 판단할 수 있고, 장애가 발생할 것으로 판단되는 경우, 사용자(예: IT 서비스 관리자)에게 알림을 제공할 수 있다.In various embodiments, the server 100 may determine the possibility of a failure of the IT service by using the indicator information for the IT service, and when it is determined that a failure will occur, a notification is provided to a user (eg, an IT service manager) can do.

일 실시예에서, 사용자 단말(200)(예: IT 서비스 관리자의 단말)은 네트워크(400)를 통해 서버(100)와 연결될 수 있으며, IT 서비스에 대한 장애 가능성을 판단하기 위한 각종 정보 및 데이터(예: IT 서비스를 이용함에 따라 생성되는 로그 데이터)를 서버(100)로 제공할 수 있고, 이에 대한 응답으로 서버(100)로부터 IT 서비스에 대한 장애 가능성 판단 결과를 제공받을 수 있다.In an embodiment, the user terminal 200 (eg, the terminal of the IT service manager) may be connected to the server 100 through the network 400, and various information and data ( For example: log data generated by using the IT service) may be provided to the server 100 , and in response to this, a failure possibility determination result for the IT service may be provided from the server 100 .

다양한 실시예에서, 사용자 단말(200)은 웹 또는 애플리케이션 형태로 구현된 기 학습된 장애 예측 모델을 이용한 IT 서비스의 장애 예측 프로세스를 실행하기 위한 운영 체제를 포함하며, 기 학습된 장애 예측 모델을 이용한 IT 서비스의 장애 예측 프로세스를 실행함에 따라 제공되는 대시보드(예: 장애 가능성 판단 결과를 실시간으로 모니터링하기 위한 대시보드, 도 17 내지 19)를 출력하기 위해 소정의 영역에 디스플레이가 구비되는 스마트폰(Smart-phone)일 수 있으나, 이에 한정되지 않고, 사용자 단말(200)은 휴대성과 이동성이 보장되는 무선 통신 장치로서, 네비게이션, PCS(Personal Communication System), GSM(Global System for Mobile communications), PDC(Personal Digital Cellular), PHS(Personal Handyphone System), PDA(Personal Digital Assistant), IMT(International Mobile Telecommunication)-2000, CDMA(Code Division Multiple Access)-2000, W-CDMA(W-Code Division Multiple Access), Wibro(Wireless Broadband Internet) 단말, 스마트 패드(Smartpad), 타블렛PC(Tablet PC) 등과 같은 모든 종류의 핸드헬드(Handheld) 기반의 무선 통신 장치를 포함할 수 있다.In various embodiments, the user terminal 200 includes an operating system for executing a failure prediction process of an IT service using a previously learned failure prediction model implemented in the form of a web or an application, and using the previously learned failure prediction model A smartphone having a display in a predetermined area to output a dashboard (eg, a dashboard for monitoring the failure possibility determination result in real time, FIGS. 17 to 19 ) provided as the failure prediction process of the IT service is executed ( Smart-phone), but is not limited thereto, and the user terminal 200 is a wireless communication device that guarantees portability and mobility, and includes navigation, Personal Communication System (PCS), Global System for Mobile communications (GSM), and PDC (Personal Communication System). Personal Digital Cellular), PHS (Personal Handyphone System), PDA (Personal Digital Assistant), IMT (International Mobile Telecommunication)-2000, CDMA (Code Division Multiple Access)-2000, W-CDMA (W-Code Division Multiple Access), It may include all types of handheld-based wireless communication devices such as a Wireless Broadband Internet (Wibro) terminal, a smart pad, and a tablet PC.

여기서, 네트워크(400)는 복수의 단말 및 서버들과 같은 각각의 노드 상호 간에 정보 교환이 가능한 연결 구조를 의미하는 것으로, 이러한 네트워크의 일 예에는 근거리 통신망(LAN: Local Area Network), 광역 통신망(WAN: Wide Area Network), 인터넷(WWW: World Wide Web), 유무선 데이터 통신망, 전화망, 유무선 텔레비전 통신망 등이 포함될 수 있다.Here, the network 400 refers to a connection structure capable of exchanging information between each node, such as a plurality of terminals and servers, and an example of such a network includes a local area network (LAN), a wide area network ( WAN: Wide Area Network), Internet (WWW: World Wide Web), wired/wireless data communication network, telephone network, wired/wireless television communication network, etc. may be included.

또한, 무선 데이터 통신망은 3G, 4G, 5G, 3GPP(3rd Generation Partnership Project), 5GPP(5th Generation Partnership Project), LTE(Long Term Evolution), WIMAX(World Interoperability for Microwave Access), 와이파이(Wi-Fi), 인터넷(Internet), LAN(Local Area Network), Wireless LAN(Wireless Local Area Network), WAN(Wide Area Network), PAN(Personal Area Network), RF(Radio Frequency), 블루투스(Bluetooth) 네트워크, NFC(Near-Field Communication) 네트워크, 위성 방송 네트워크, 아날로그 방송 네트워크, DMB(Digital Multimedia Broadcasting) 네트워크 등이 포함될 수 있으나, 이에 한정되지는 않는다.In addition, wireless data communication networks include 3G, 4G, 5G, 3rd Generation Partnership Project (3GPP), 5th Generation Partnership Project (5GPP), Long Term Evolution (LTE), World Interoperability for Microwave Access (WIMAX), and Wi-Fi. , Internet, LAN (Local Area Network), Wireless LAN (Wireless Local Area Network), WAN (Wide Area Network), PAN (Personal Area Network), RF (Radio Frequency), Bluetooth (Bluetooth) network, NFC ( A Near-Field Communication) network, a satellite broadcasting network, an analog broadcasting network, a Digital Multimedia Broadcasting (DMB) network, etc. may be included, but are not limited thereto.

일 실시예에서, 외부 서버(300)는 네트워크(400)를 통해 서버(100)와 연결될 수 있으며, IT 서비스에 대한 장애 가능성을 판단하기 위해 필요한 각종 정보 및 데이터를 제공할 수 있고, 이에 대한 응답으로 서버(100)로부터 IT 서비스에 대한 장애 가능성 판단 결과를 제공받을 수 있다. 예를 들어, 외부 서버(300)는 IT 서비스를 배포 및 운영하는 IT 서비스 운용 서버(고객사 서버)일 수 있으나, 이에 한정되지 않는다. 이하, 도 2를 참조하여, 기 학습된 장애 예측 모델을 이용한 IT 서비스의 장애 예측 프로세스를 수행하는 서버(100)의 하드웨어 구성에 대해 설명하도록 한다.In one embodiment, the external server 300 may be connected to the server 100 through the network 400, and may provide various information and data necessary to determine the possibility of failure for the IT service, and respond thereto. As a result, it is possible to receive the failure possibility determination result for the IT service from the server 100 . For example, the external server 300 may be an IT service operation server (customer company server) that distributes and operates IT services, but is not limited thereto. Hereinafter, with reference to FIG. 2 , a hardware configuration of the server 100 performing a failure prediction process of an IT service using a pre-learned failure prediction model will be described.

도 2는 본 발명의 다른 실시예에 따른 기 학습된 장애 예측 모델을 이용한 IT 서비스의 장애 예측 서버의 하드웨어 구성도이다.2 is a hardware configuration diagram of an IT service failure prediction server using a pre-learned failure prediction model according to another embodiment of the present invention.

도 2를 참조하면, 다양한 실시예에서, 서버(100)는 하나 이상의 프로세서(110), 프로세서(110)에 의하여 수행되는 컴퓨터 프로그램(151)을 로드(Load)하는 메모리(120), 버스(130), 통신 인터페이스(140) 및 컴퓨터 프로그램(151)을 저장하는 스토리지(150)를 포함할 수 있다. 여기서, 도 2에는 본 발명의 실시예와 관련 있는 구성요소들만 도시되어 있다. 따라서, 본 발명이 속한 기술분야의 통상의 기술자라면 도 2에 도시된 구성요소들 외에 다른 범용적인 구성 요소들이 더 포함될 수 있음을 알 수 있다.Referring to FIG. 2 , in various embodiments, the server 100 includes one or more processors 110 , a memory 120 for loading a computer program 151 executed by the processor 110 , and a bus 130 . ), a communication interface 140 and a storage 150 for storing the computer program 151 may be included. Here, only the components related to the embodiment of the present invention are shown in FIG. 2 . Accordingly, one of ordinary skill in the art to which the present invention pertains can see that other general-purpose components other than those shown in FIG. 2 may be further included.

프로세서(110)는 서버(100)의 각 구성의 전반적인 동작을 제어한다. 프로세서(110)는 CPU(Central Processing Unit), MPU(Micro Processor Unit), MCU(Micro Controller Unit), GPU(Graphic Processing Unit) 또는 본 발명의 기술 분야에 잘 알려진 임의의 형태의 프로세서를 포함하여 구성될 수 있다.The processor 110 controls the overall operation of each component of the server 100 . The processor 110 includes a central processing unit (CPU), a micro processor unit (MPU), a micro controller unit (MCU), a graphic processing unit (GPU), or any type of processor well known in the art. can be

또한, 프로세서(110)는 본 발명의 실시예들에 따른 방법을 실행하기 위한 적어도 하나의 애플리케이션 또는 프로그램에 대한 연산을 수행할 수 있으며, 서버(100)는 하나 이상의 프로세서를 구비할 수 있다.In addition, the processor 110 may perform an operation for at least one application or program for executing the method according to the embodiments of the present invention, and the server 100 may include one or more processors.

다양한 실시예에서, 프로세서(110)는 프로세서(110) 내부에서 처리되는 신호(또는, 데이터)를 일시적 및/또는 영구적으로 저장하는 램(RAM: Random Access Memory, 미도시) 및 롬(ROM: Read-Only Memory, 미도시)을 더 포함할 수 있다. 또한, 프로세서(110)는 그래픽 처리부, 램 및 롬 중 적어도 하나를 포함하는 시스템온칩(SoC: system on chip) 형태로 구현될 수 있다.In various embodiments, the processor 110 temporarily and/or permanently stores a signal (or data) processed inside the processor 110 , a random access memory (RAM) and a read access memory (ROM). -Only Memory, not shown) may be further included. In addition, the processor 110 may be implemented in the form of a system on chip (SoC) including at least one of a graphic processing unit, a RAM, and a ROM.

메모리(120)는 각종 데이터, 명령 및/또는 정보를 저장한다. 메모리(120)는 본 발명의 다양한 실시예에 따른 방법/동작을 실행하기 위하여 스토리지(150)로부터 컴퓨터 프로그램(151)을 로드할 수 있다. 메모리(120)에 컴퓨터 프로그램(151)이 로드되면, 프로세서(110)는 컴퓨터 프로그램(151)을 구성하는 하나 이상의 인스트럭션들을 실행함으로써 상기 방법/동작을 수행할 수 있다. 메모리(120)는 RAM과 같은 휘발성 메모리로 구현될 수 있을 것이나, 본 개시의 기술적 범위가 이에 한정되는 것은 아니다.The memory 120 stores various data, commands and/or information. The memory 120 may load the computer program 151 from the storage 150 to execute methods/operations according to various embodiments of the present disclosure. When the computer program 151 is loaded into the memory 120 , the processor 110 may perform the method/operation by executing one or more instructions constituting the computer program 151 . The memory 120 may be implemented as a volatile memory such as RAM, but the technical scope of the present disclosure is not limited thereto.

버스(130)는 서버(100)의 구성 요소 간 통신 기능을 제공한다. 버스(130)는 주소 버스(address Bus), 데이터 버스(Data Bus) 및 제어 버스(Control Bus) 등 다양한 형태의 버스로 구현될 수 있다.The bus 130 provides a communication function between the components of the server 100 . The bus 130 may be implemented as various types of buses, such as an address bus, a data bus, and a control bus.

통신 인터페이스(140)는 서버(100)의 유무선 인터넷 통신을 지원한다. 또한, 통신 인터페이스(140)는 인터넷 통신 외의 다양한 통신 방식을 지원할 수도 있다. 이를 위해, 통신 인터페이스(140)는 본 발명의 기술 분야에 잘 알려진 통신 모듈을 포함하여 구성될 수 있다. 몇몇 실시예에서, 통신 인터페이스(140)는 생략될 수도 있다.The communication interface 140 supports wired/wireless Internet communication of the server 100 . In addition, the communication interface 140 may support various communication methods other than Internet communication. To this end, the communication interface 140 may be configured to include a communication module well known in the art. In some embodiments, the communication interface 140 may be omitted.

스토리지(150)는 컴퓨터 프로그램(151)을 비 임시적으로 저장할 수 있다. 서버(100)를 통해 기 학습된 장애 예측 모델을 이용한 IT 서비스의 장애 예측 프로세스를 수행하는 경우, 스토리지(150)는 기 학습된 장애 예측 모델을 이용한 IT 서비스의 장애 예측 프로세스를 제공하기 위하여 필요한 각종 정보를 저장할 수 있다.The storage 150 may non-temporarily store the computer program 151 . When performing the failure prediction process of the IT service using the pre-learned failure prediction model through the server 100, the storage 150 may provide various types of failure prediction process for the IT service using the pre-learned failure prediction model. information can be stored.

스토리지(150)는 ROM(Read Only Memory), EPROM(Erasable Programmable ROM), EEPROM(Electrically Erasable Programmable ROM), 플래시 메모리 등과 같은 비휘발성 메모리, 하드 디스크, 착탈형 디스크, 또는 본 발명이 속하는 기술 분야에서 잘 알려진 임의의 형태의 컴퓨터로 읽을 수 있는 기록 매체를 포함하여 구성될 수 있다.The storage 150 is a non-volatile memory such as a read only memory (ROM), an erasable programmable ROM (EPROM), an electrically erasable programmable ROM (EEPROM), a flash memory, a hard disk, a removable disk, or well in the art to which the present invention pertains. It may be configured to include any known computer-readable recording medium.

컴퓨터 프로그램(151)은 메모리(120)에 로드될 때 프로세서(110)로 하여금 본 발명의 다양한 실시예에 따른 방법/동작을 수행하도록 하는 하나 이상의 인스트럭션들을 포함할 수 있다. 즉, 프로세서(110)는 상기 하나 이상의 인스트럭션들을 실행함으로써, 본 발명의 다양한 실시예에 따른 상기 방법/동작을 수행할 수 있다.The computer program 151 may include one or more instructions that, when loaded into the memory 120 , cause the processor 110 to perform methods/operations according to various embodiments of the present invention. That is, the processor 110 may perform the method/operation according to various embodiments of the present disclosure by executing the one or more instructions.

일 실시예에서, 컴퓨터 프로그램(151)은 IT 서비스에 대한 지표 정보를 수집하는 단계, 기 학습된 장애 예측 모델을 통해 수집된 지표 정보를 분석하여 IT 서비스에 대한 부하 값을 추출하는 단계 및 추출된 부하 값을 이용하여 IT 서비스에 대한 장애 가능성을 판단하는 단계를 포함하는 기 학습된 장애 예측 모델을 이용한 IT 서비스의 장애 예측 방법을 수행하도록 하는 하나 이상의 인스트럭션을 포함할 수 있다.In one embodiment, the computer program 151 collects indicator information for the IT service, analyzes the indicator information collected through a pre-learned failure prediction model to extract a load value for the IT service, and the extracted It may include one or more instructions for performing a failure prediction method of an IT service using a pre-learned failure prediction model including the step of determining a failure probability for the IT service using the load value.

본 발명의 실시예와 관련하여 설명된 방법 또는 알고리즘의 단계들은 하드웨어로 직접 구현되거나, 하드웨어에 의해 실행되는 소프트웨어 모듈로 구현되거나, 또는 이들의 결합에 의해 구현될 수 있다. 소프트웨어 모듈은 RAM(Random Access Memory), ROM(Read Only Memory), EPROM(Erasable Programmable ROM), EEPROM(Electrically Erasable Programmable ROM), 플래시 메모리(Flash Memory), 하드 디스크, 착탈형 디스크, CD-ROM, 또는 본 발명이 속하는 기술 분야에서 잘 알려진 임의의 형태의 컴퓨터 판독가능 기록매체에 상주할 수도 있다.The steps of a method or algorithm described in relation to an embodiment of the present invention may be implemented directly in hardware, as a software module executed by hardware, or by a combination thereof. A software module may contain random access memory (RAM), read only memory (ROM), erasable programmable ROM (EPROM), electrically erasable programmable ROM (EEPROM), flash memory, hard disk, removable disk, CD-ROM, or It may reside in any type of computer-readable recording medium well known in the art to which the present invention pertains.

본 발명의 구성 요소들은 하드웨어인 컴퓨터와 결합되어 실행되기 위해 프로그램(또는 애플리케이션)으로 구현되어 매체에 저장될 수 있다. 본 발명의 구성 요소들은 소프트웨어 프로그래밍 또는 소프트웨어 요소들로 실행될 수 있으며, 이와 유사하게, 실시 예는 데이터 구조, 프로세스들, 루틴들 또는 다른 프로그래밍 구성들의 조합으로 구현되는 다양한 알고리즘을 포함하여, C, C++, 자바(Java), 어셈블러(assembler) 등과 같은 프로그래밍 또는 스크립팅 언어로 구현될 수 있다. 기능적인 측면들은 하나 이상의 프로세서들에서 실행되는 알고리즘으로 구현될 수 있다. 이하, 도 3 내지 18을 참조하여, 서버(100)에 의해 수행되는 기 학습된 장애 예측 모델을 이용한 IT 서비스의 장애 예측 방법에 대해 설명하도록 한다.The components of the present invention may be implemented as a program (or application) to be executed in combination with a computer, which is hardware, and stored in a medium. Components of the present invention may be implemented as software programming or software components, and similarly, embodiments may include various algorithms implemented as data structures, processes, routines, or combinations of other programming constructs, including C, C++ , may be implemented in a programming or scripting language such as Java, assembler, or the like. Functional aspects may be implemented in an algorithm running on one or more processors. Hereinafter, with reference to FIGS. 3 to 18 , a method for predicting failure of an IT service using a pre-learned failure prediction model performed by the server 100 will be described.

도 3은 다양한 실시예에 적용 가능한 기 학습된 장애 예측 모델을 이용한 IT 서비스의 장애 예측 시스템의 구조를 도시한 도면이며, 도 4는 본 발명의 또 다른 실시예에 따른 기 학습된 장애 예측 모델을 이용한 IT 서비스의 장애 예측 방법의 순서도이다.3 is a diagram illustrating the structure of a failure prediction system of an IT service using a pre-learned failure prediction model applicable to various embodiments, and FIG. 4 is a previously learned failure prediction model according to another embodiment of the present invention. It is a flowchart of the method for predicting the failure of the used IT service.

도 3 및 4를 참조하면, S110 단계에서, 서버(100)는 IT 서비스에 대한 지표 정보를 수집할 수 있다. 예를 들어, 서버(100)는 IT 서비스를 배포 및 운용하는 서버(또는 시스템)로부터 IT 서비스에 대한 지표 정보를 실시간으로 획득할 수 있다.3 and 4 , in step S110 , the server 100 may collect indicator information for an IT service. For example, the server 100 may acquire index information on the IT service in real time from a server (or system) that distributes and operates the IT service.

여기서, 지표 정보는 IT 서비스에 대한 부하와 성능에 관한 정보를 포함할 수 있다. 예를 들어, 서버(100)는 부하 정보로서 IT 서비스의 동시 사용자에 대한 정보 및 초당 트랜잭션의 개수(Transaction Per Second, TPS)에 대한 정보를 포함하고, 성능 정보로서 IT 서비스의 응답시간에 대한 정보를 포함하는 지표 정보를 수집할 수 있다. 그러나, 이에 한정되지 않는다.Here, the indicator information may include information about the load and performance of the IT service. For example, the server 100 includes information on simultaneous users of the IT service and information on the number of transactions per second (Transaction Per Second, TPS) as load information, and information on response time of the IT service as performance information. It is possible to collect indicator information including However, the present invention is not limited thereto.

S120 단계에서, 서버(100)는 기 학습된 장애 예측 모델을 통해 S110 단계를 거쳐 수집된 지표 정보를 분석하여 IT 서비스에 대한 임계부하 값을 추출할 수 있다. 또한, S130 단계에서, 서버(100)는 기 학습된 장애 예측 모델을 통해 S110 단계를 거쳐 수집된 지표 정보를 분석하여 IT 서비스에 대한 미래부하 값을 추출할 수 있다.In step S120, the server 100 may extract the threshold load value for the IT service by analyzing the index information collected through the step S110 through the pre-learned failure prediction model. In addition, in step S130, the server 100 may extract a future load value for the IT service by analyzing the index information collected through the step S110 through the pre-learned failure prediction model.

다양한 실시예에서, 기 학습된 장애 예측 모델은 임계부하 예측 모델 및 미래부하 예측 모델을 포함할 수 있으며, 서버(100)는 임계부하 예측 모델을 이용하여 IT 서비스의 임계부하 값을 추출하고, 미래부하 예측 모델을 이용하여 IT 서비스의 미래부하 값을 추출할 수 있다.In various embodiments, the pre-learned failure prediction model may include a critical load prediction model and a future load prediction model, and the server 100 extracts the critical load value of the IT service using the critical load prediction model, The future load value of IT services can be extracted using the load prediction model.

여기서, 임계부하 값은 IT 서비스에서 허용 가능한 부하의 임계 값(Threshold)을 의미할 수 있고, 미래부하 값은 미래의 소정 시점에서의 IT 서비스 부하 예측 값을 의미할 수 있다. Here, the threshold load value may mean a threshold value of an allowable load in the IT service, and the future load value may mean an IT service load prediction value at a predetermined point in the future.

또한, 여기서, 도 4에 도시된 기 학습된 장애 예측 모델을 이용한 IT 서비스의 장애 예측 방법은 임계부하 값을 추출하는 동작을 수행한 후 미래부하 값을 추출하는 동작을 수행하는 것으로 도시되어 있으나, 이는 임계부하 값을 추출하는 동작과 미래부하 값을 추출하는 동작을 구분하기 위함이며, 임계부하 값과 미래부하 값을 추출하는 순서는 도 1에 도시된 실시예에 한정되지 않고, 미래부하 값을 추출하는 동작을 먼저 수행하거나, 임계부하 값을 추출하는 동작과 미래부하 값을 추출하는 동작을 동시에 병렬적으로 수행할 수 있다. 이하, 도 5 내지 12를 참조하여 IT 서비스의 지표 정보로부터 임계부하 값 및 미래부하 값을 추출하는 과정에 대해 설명하도록 한다.In addition, here, the failure prediction method of the IT service using the pre-learned failure prediction model shown in FIG. 4 is shown as performing the operation of extracting the future load value after performing the operation of extracting the critical load value, This is to distinguish the operation of extracting the critical load value from the operation of extracting the future load value, and the order of extracting the critical load value and the future load value is not limited to the embodiment shown in FIG. The extraction operation may be performed first, or the operation of extracting the critical load value and the operation of extracting the future load value may be simultaneously performed in parallel. Hereinafter, a process of extracting a threshold load value and a future load value from indicator information of an IT service will be described with reference to FIGS. 5 to 12 .

도 5는 다양한 실시예에서, 임계부하 값을 추출하는 방법을 설명하기 위한 순서도이다.5 is a flowchart illustrating a method of extracting a threshold load value according to various embodiments.

도 5를 참조하면, S210 단계에서, 서버(100)는 임계부하 예측을 위한 임계부하 예측 함수를 포함하는 임계부하 예측 모델을 생성할 수 있다. Referring to FIG. 5 , in step S210 , the server 100 may generate a critical load prediction model including a critical load prediction function for critical load prediction.

다양한 실시예에서, 서버(100)는 IT 서비스의 동시 사용자 수와 IT 서비스의 응답 시간 간의 관계를 정의하는 임계부하 예측 함수를 포함하는 임계부하 예측 모델을 생성할 수 있다. 이때, 일반적으로 부하에 따른 성능은 컴퓨팅 자원의 경쟁으로 인해 도 6에 도시된 (a) 그래프와 같이 지수형 증가 추세를 보이는 바, 서버(100)는 하기의 수학식 1과 같이, 지수 함수 형태의 임계부하 예측 함수를 생성할 수 있다.In various embodiments, the server 100 may generate a critical load prediction model including a critical load prediction function that defines a relationship between the number of concurrent users of the IT service and the response time of the IT service. At this time, in general, the performance according to the load shows an exponential increase trend as in the graph (a) shown in FIG. 6 due to competition for computing resources, and the server 100 has an exponential function form as shown in Equation 1 below. It is possible to create a critical load prediction function of

또한, 실제로는 동시 사용자 수(

)가 0에 가깝더라도, 0이 아닌 이상 아주 작은 부하라도

만큼의 자원을 소모하고,

=

수준까지는 자원 소모에 큰 차이가 없다. 이러한 점을 고려하여, 서버(100)는 먼저, 도 7에 도시된 바와 같이 비선형적 임계부하 예측 함수를 가정할 수 있고, 후술되는 S220 단계 내지 S260 단계를 거쳐 비선형적 임계부하 예측 함수를 학습시킴으로써, 비선형적 임계부하 예측 함수를 선형적 임계부하 예측 함수로 가공할 수 있다.Also, the number of concurrent users (

) is close to 0, even if it is a very small load unless it is 0

consume enough resources,

=

There is not much difference in resource consumption up to the level. In consideration of this point, the server 100 may first assume a non-linear critical load prediction function as shown in FIG. 7 , and learn the non-linear critical load prediction function through steps S220 to S260 to be described later. , the non-linear critical load prediction function can be processed into a linear critical load prediction function.

<수학식 1><Equation 1>

여기서,

는 임계부하 예측 함수,

,

및

는 임계부하 예측 함수의 파라미터 초기값 및

는 임계부하 예측 함수의 입력 값으로, IT 서비스의 동시 사용자 수(단, 상기 수학식 1은 상기

가 상기

초과일 때이며, 상기

가 0 이하이면 상기

는 0이고, 상기

가 0 초과 상기

이하이면 상기

는 상기

임)일 수 있다.here,

is the critical load prediction function,

,

and

is the initial parameter value of the critical load prediction function and

is the input value of the threshold load prediction function, the number of simultaneous users of the IT service (provided that Equation 1 is the

is reminded

When it exceeds,

If is less than or equal to 0, the

is 0, and

above 0

If less than

is said

) can be.

S220 단계에서, 서버(100)는 S110 단계를 거쳐 생성한 임계부하 예측 함수의 파라미터를 설정할 수 있다.In step S220 , the server 100 may set parameters of the critical load prediction function generated through step S110 .

다양한 실시예에서, 서버(100)는 임계부하 예측 함수의 파라미터

,

및

의 초기값

,

및

각각을 0 이상의 임의의 값으로 설정할 수 있다.In various embodiments, the server 100 determines the parameters of the threshold load prediction function.

,

and

initial value of

,

and

Each can be set to any value greater than or equal to 0.

다양한 실시예에서, 서버(100)는 통계적 특성에 따라 임계부하 예측 함수의 파라미터 초기값

,

및

각각을 설정할 수 있다. 예를 들어, 서버(100)는 도 8에 도시된 바와 같이, 통계적 특성에 따라 임계부하 예측 함수의 파라미터인

의 초기값

를 1로 설정할 수 있고, 임계부하 예측 함수의 파라미터인

의 초기값

를 임계부하 예측 함수의 최소값(min(x))으로 설정할 수 있으며, 임계부하 예측 함수의 파라미터인

의 초기값

를 임계부하 예측 함수의 평균값(avg(x))으로 설정할 수 있다. 그러나, 이에 한정되지 않는다.In various embodiments, the server 100 determines the initial parameter values of the threshold load prediction function according to statistical characteristics.

,

and

Each can be set. For example, as shown in FIG. 8 , the server 100 is a parameter of the critical load prediction function according to statistical characteristics.

initial value of

can be set to 1, and the parameter of the critical load prediction function is

initial value of

can be set as the minimum value (min(x)) of the critical load prediction function, and

initial value of

can be set as the average value (avg(x)) of the critical load prediction function. However, the present invention is not limited thereto.

S230 단계에서, 서버(100)는 S210 단계 및 S220 단계를 거쳐 생성된 임계부하 예측 함수에 IT 서비스의 지표 정보를 입력함으로써 결과 값을 추출할 수 있고, 추출된 결과 값에 기초하여 임계부하 예측 모델의 성능을 산출할 수 있으며, 산출된 임계부하 예측 모델의 성능과 기 설정된 목표 성능 간의 차이를 산출할 수 있다.In step S230, the server 100 may extract the result value by inputting the index information of the IT service into the critical load prediction function generated through the steps S210 and S220, and based on the extracted result value, the critical load prediction model can calculate the performance of , and the difference between the calculated performance of the critical load prediction model and the preset target performance can be calculated.

다양한 실시예에서, 서버(100)는 상기의 수학식 1을 이용하여 임계부하 예측 모델의 결과값을 산출할 수 있고, 하기의 수학식 3을 이용하여 임계부하 예측 모델의 성능과 기 설정된 목표 성능 간의 차이를 산출할 수 있다.In various embodiments, the server 100 may calculate the result value of the critical load prediction model using Equation 1 above, and the performance of the critical load prediction model and preset target performance using Equation 3 below difference between them can be calculated.

<수학식 3><Equation 3>

여기서,

는 임계부하 예측 함수의 결과 값으로, IT 서비스의 동시 사용자 수가 n명일 때의 예상 응답 시간 및

는 상기 기 설정된 목표 성능을 의미할 수 있다.here,

is the result value of the threshold load prediction function, the expected response time when the number of simultaneous users of the IT service is n, and

may mean the preset target performance.

S240 단계에서, 서버(100)는 S230 단계를 거쳐 산출된 차이가 종료 조건을 만족하는지 여부를 판단할 수 있다.In step S240 , the server 100 may determine whether the difference calculated through step S230 satisfies the termination condition.

여기서, 종료 조건은 상기의 과정을 거쳐 생성된 임계부하 예측 모델의 성능이 기 설정된 목표 성능에 도달했는지 여부일 수 있다. 예를 들어, 종료 조건은 임계부하 예측 모델의 성능과 기 설정된 목표 성능 간의 차이가 기 설정된 값 이하인지 여부일 수 있으나, 이에 한정되지 않는다. 즉, 서버(100)는 S230 단계를 거쳐 산출된 임계부하 예측 모델의 성능과 기 설정된 목표 성능 간의 차이가 기 설정된 값 이하인지를 판단함으로써, 종료 조건을 만족했는지를 판단할 수 있다.Here, the termination condition may be whether the performance of the critical load prediction model generated through the above process reaches a preset target performance. For example, the termination condition may be whether a difference between the performance of the critical load prediction model and a preset target performance is less than or equal to a preset value, but is not limited thereto. That is, the server 100 may determine whether the termination condition is satisfied by determining whether the difference between the performance of the threshold load prediction model calculated through step S230 and the preset target performance is less than or equal to a preset value.

S250 단계에서, 서버(100)는 S240 단계를 거쳐 산출된 차이가 기 설정된 기준 값을 초과하는 경우, 하기의 수학식 2를 이용하여 임계부하 예측 모델의 성능과 기 설정된 목표 성능 간의 차이가 최소값을 가지도록 임계부하 예측 함수의 파라미터를 결정할 수 있다.In step S250, when the difference calculated through step S240 exceeds the preset reference value, the difference between the performance of the critical load prediction model and the preset target performance is the minimum value using Equation 2 below. It is possible to determine the parameters of the critical load prediction function to have

<수학식 2><Equation 2>

여기서,

는 산출된 차이가 최소값을 가지도록 하는 임계부하 예측 함수의 파라미터 값일 수 있다.here,

may be a parameter value of the threshold load prediction function that allows the calculated difference to have a minimum value.

다양한 실시예에서, 서버(100)는 확률적 경사하강법(Stochastic Gradient Descent, SGD) 기반으로 학습된 모델을 이용하여, 임계부하 예측 모델의 성능과 기 설정된 목표 성능 간의 차이가 최소값을 가지도록 하는 임계부하 예측 함수의 파라미터를 결정할 수 있다.In various embodiments, the server 100 uses a model learned based on stochastic gradient descent (SGD) to ensure that the difference between the performance of the critical load prediction model and the preset target performance has a minimum value. A parameter of the critical load prediction function may be determined.

이후, 서버(100)는 상기의 방법에 따라 결정된 파라미터를 이용하여 기 생성된 임계부하 예측 함수의 파라미터를 보정할 수 있고, 이후, 종료 조건을 만족할 때까지 S220 단계 내지 S250 단계를 반복적으로 수행할 수 있다.Thereafter, the server 100 may correct the parameters of the pre-generated critical load prediction function using the parameters determined according to the above method, and then repeatedly perform steps S220 to S250 until the termination condition is satisfied. can

S260 단계에서, 서버(100)는 S240 단계를 거쳐 임계부하 예측 모델의 성능과 기 설정된 목표 성능 간의 차이가 기 설정된 값 이하인 것으로 판단되는 경우, 최종적으로 결정된 파라미터가 적용된 임계부하 예측 함수를 포함하는 임계부하 예측 모델을 최적 모델로서 저장할 수 있다.In step S260, when it is determined that the difference between the performance of the critical load prediction model and the preset target performance is less than or equal to a preset value through step S240, the server 100 includes a critical load prediction function to which the finally determined parameter is applied. The load prediction model may be stored as an optimal model.

S270 단계에서, 서버(100)는 상기의 과정을 거쳐 저장된 최적 모델에 IT 서비스의 지표 정보를 입력함으로써, 결과 값으로 IT 서비스의 임계부하 값을 추출할 수 있다.In step S270 , the server 100 may extract the critical load value of the IT service as a result value by inputting the index information of the IT service into the optimal model stored through the above process.

이때, 서버(100)는 상기의 과정을 거쳐 특정 임계부하 예측 모델이 최적 모델로서 확정 및 저장된 경우, 임계부하 예측 모델의 재학습에 대한 트리거가 감지되기 전까지 저장된 최적 모델을 사용하여 임계부하 값을 추출할 수 있다.At this time, when the specific critical load prediction model is confirmed and stored as the optimal model through the above process, the server 100 uses the stored optimal model until a trigger for re-learning of the critical load prediction model is detected. can be extracted.

다양한 실시예에서, 서버(100)는 도 9에 도시된 바와 같이, 상기의 과정을 거쳐 저장된 최적 모델을 이용하여 IT 서비스의 임계부하 값을 추출할 뿐만 아니라, 추출된 임계부하 값을 이용하여 응답지연 예측점을 산출하고, 산출된 응답지연 예측점을 제공할 수 있다.In various embodiments, as shown in FIG. 9 , the server 100 not only extracts the threshold load value of the IT service using the optimal model stored through the above process, but also responds using the extracted threshold load value A delay prediction point may be calculated, and the calculated response delay prediction point may be provided.

도 10은 다양한 실시예에서, 미래부하 값을 추출하는 방법을 설명하기 위한 순서도이다.10 is a flowchart illustrating a method of extracting a future load value according to various embodiments.

도 10을 참조하면, S310 단계에서, 서버(100)는 소정 기간 동안의 IT 서비스의 지표 정보에 대한 특성 함수를 산출할 수 있다.Referring to FIG. 10 , in step S310 , the server 100 may calculate a characteristic function for index information of an IT service for a predetermined period.

먼저, 도 10을 참조하면, 서버(100)는 소정의 기간 동안의 IT 서비스의 지표 정보를 시계열에 따라 복수의 구간(예: t~t1, t₁~t₂, t₂~t₃?? t_k-1~t_k)으로 분할할 수 있다. 이때, 복구의 구간 각각의 시간 길이는 동일할 수 있으나, 이에 한정되지 않는다.First, referring to FIG. 10 , the server 100 displays the index information of the IT service for a predetermined period in a plurality of sections (eg, _t _~ t1, t1 _~ t2 , t2 _~ t3?? t _k-1 ~t _k ). In this case, the time length of each recovery section may be the same, but is not limited thereto.

이후, 서버(100)는 복수의 구간 각각에 대한 특성 벡터를 산출할 수 있다. 여기서, 특성 벡터(또는 고유 벡터(eigenvector))를 산출하는 방법은 다양한 기술들이 공지되어 있는 바, 구체적으로 기술적 특징에 대해서는 한정하지 않는다.Thereafter, the server 100 may calculate a feature vector for each of the plurality of sections. Here, a method for calculating a feature vector (or an eigenvector) is not limited to a specific technical feature, as various techniques are known.

이후, 서버(100)는 복수의 구간 각각에 대한 복수의 특성 벡터를 이용하여 소정 기간 동안의 IT 서비스의 지표 정보에 대한 특성 함수를 산출할 수 있다. 예를 들어, 서버(100)는 복수의 특성 벡터 각각을 결합함으로써, φ=x₁+x₂+x₃ ?? +x_k 형태의 특성 함수를 산출할 수 있으나, 이에 한정되지 않는다.Thereafter, the server 100 may calculate a characteristic function for index information of an IT service for a predetermined period by using a plurality of characteristic vectors for each of a plurality of sections. For example, the server 100 combines each of a plurality of feature vectors, so that φ=x ₁ +x ₂ +x ₃ ?? A characteristic function of the form +x _k may be calculated, but the present invention is not limited thereto.

다양한 실시예에서, 서버(100)는 복수의 구간 각각에 대하여 산출된 복수의 특성 벡터 각각에 대하여, 시간 특성(예: 연, 월, 일, 시, 분, 요일, 주말여부, 휴일여부 등), IT 서비스를 이용하는 사용자들의 이용형태 특성(예: 추세 특정으로, n분전, n일전, n주전, n분의 이동평균 등) 및 미래부하 값과의 상관관계 특성(예: 미래부하와의 상관관계가 있는 지표의 n분전 값 등) 중 적어도 하나의 특성을 고려하여 가중치를 부여하고, 가중치가 부여된 복수의 특성 벡터를 이용하여 특성 함수를 산출함으로써, , φ=w-₁x₁+ w-₂x₂+ w-₃x₃ ?? + w-_kx_k 형태의 특성 함수 즉, 시간, 추세, 상관관계에 대한 특성이 반영된 특성 함수를 산출할 수 있다.In various embodiments, the server 100, for each of the plurality of feature vectors calculated for each of the plurality of sections, time properties (eg, year, month, day, hour, minute, day of the week, whether it is a weekend, whether a holiday, etc.) , characteristics of users who use IT services (eg, trend specific, n minutes ago, n days ago, n weeks ago, n minutes moving average, etc.) and correlation characteristics with future load values (eg, correlation with future loads) By assigning a weight in consideration of at least one characteristic among the n-divisional value of a related index, etc.), and calculating a characteristic function using a plurality of weighted characteristic vectors, , φ=w- ₁ x ₁ + w - ₂ x ₂ + w- ₃ x ₃ ?? A characteristic function in the form of + w- _k x _k , that is, a characteristic function reflecting characteristics of time, trend, and correlation can be calculated.

S320 단계에서, 서버(100)는 도 12에 도시된 바와 같이, 분위 회귀(Quantile regression)를 기반으로 S310 단계를 거쳐 산출된 특성 함수를 학습시킴으로써, 미래부하 예측 모델을 생성할 수 있다. 이때, 서버(100)는 특성 함수에 대하여 상한(Upper Bound)을 95%, 하한(Lower Bound)을 5%로 설정하고, 그 중간값(median)을 분위 학습시킴으로써, 예측 구간의 신뢰도를 90% 수준으로 형성할 수 있다. 여기서, 분위 회귀 기반으로 특성 함수를 학습시키는 방법은 다양한 기술들이 공지되어 있는 바, 구체적으로 기술적 특징에 대해서는 한정하지 않는다.In step S320 , the server 100 may generate a future load prediction model by learning the characteristic function calculated through step S310 based on quantile regression, as shown in FIG. 12 . At this time, the server 100 sets the upper bound to 95% and the lower bound to 5% for the characteristic function, and by learning the quantile of the median, the reliability of the prediction section is 90% level can be formed. Here, a method for learning a characteristic function based on quantile regression is not limited to a specific technical characteristic, as various techniques are known.

S330 단계에서, 서버(100)는 S310 단계 내지 S330 단계를 거쳐 학습된 미래부하 예측 모델을 최적 모델로서 저장할 수 있다.In step S330, the server 100 may store the future load prediction model learned through steps S310 to S330 as an optimal model.

S340 단계에서, 서버(100)는 최적 모델로서 저장된 미래부하 예측 모델을 이용하여 IT 서비스의 지표 정보를 분석함으로써, IT 서비스의 미래부하 값을 추출할 수 있다.In step S340 , the server 100 may extract a future load value of the IT service by analyzing the index information of the IT service using the future load prediction model stored as the optimal model.

이때, 서버(100)는 상기의 과정을 거쳐 특정 미래부하 예측 모델이 최적 모델로서 확정 및 저장된 경우, 미래부하 예측 모델의 재학습에 대한 트리거가 감지되기 전까지 저장된 최적 모델을 사용하여 미래부하 값을 추출할 수 있다.At this time, when a specific future load prediction model is confirmed and stored as an optimal model through the above process, the server 100 uses the stored optimal model until a trigger for re-learning of the future load prediction model is detected. can be extracted.

다시, 도 3 및 4를 참조하면, S140 단계에서, 서버(100)는 S120 단계를 거쳐 추출된 임계부하 값과 S130 단계를 거쳐 추출된 미래부하 값을 이용하여 IT 서비스에 대한 장애 가능성을 판단할 수 있다. Again, referring to FIGS. 3 and 4 , in step S140 , the server 100 uses the threshold load value extracted through step S120 and the future load value extracted through step S130 to determine the possibility of failure for the IT service. can

다양한 실시예에서, 서버(100)는 도 13에 도시된 바와 같이 미래부하 값이 임계부하 값을 초과하는지에 따라 IT 서비스에 대한 장애 가능성을 판단할 수 있다.In various embodiments, the server 100 may determine the possibility of failure of the IT service according to whether the future load value exceeds the threshold load value as shown in FIG. 13 .

다양한 실시예에서, 서버(100)는 임계부하 값과 미래부하 값 각각을 정규화할 수 있고, 정규화된 임계부하 값과 정규화된 미래부하 값의 차이가 0이하인지 여부에 따라 IT 서비스에 대한 장애 가능성을 판단할 수 있다.In various embodiments, the server 100 may normalize each of the threshold load value and the future load value, and the probability of failure to the IT service depending on whether the difference between the normalized threshold load value and the normalized future load value is 0 or less can be judged

여기서, IT 서비스에 대한 장애 가능성을 판단하는 동작은 "장애가 발생될 것으로 예측된다" 또는 "장애가 발생되지 않을 것으로 예측된다"와 같이 장애 발생 유무를 이분법적으로 판단하는 형태로 구현될 수 있다. 예를 들어, 서버(100)는 미래부하 값이 임계부하 값을 초과하지 않는 경우 IT 서비스에 대한 장애가 발생되지 않을 것으로 판단할 수 있고, 미래부하 값이 임계부하 값을 초과하지 않는 경우 IT 서비스에 대한 장애가 발생될 것으로 판단할 수 있다. 그러나, 이에 한정되지 않고, 경우에 따라 미래부하 값이 임계부하 값을 초과하는 정도에 따라 장애가 발생할 확률 값을 산출하는 형태로 구현될 수 있다.Here, the operation of determining the possibility of failure for the IT service may be implemented in a form of dichotomy of whether or not a failure occurs, such as "failure is predicted to occur" or "failure is not expected to occur". For example, the server 100 may determine that a failure to the IT service will not occur if the future load value does not exceed the threshold load value, and if the future load value does not exceed the threshold load value, the IT service It can be judged that there is an obstacle to the However, the present invention is not limited thereto, and in some cases, it may be implemented in the form of calculating a probability value of occurrence of a failure according to the degree to which the future load value exceeds the critical load value.

다양한 실시예에서, 서버(100)는 장애 가능성 판단의 신뢰도를 향상시키기 위하여, 장애 가능성을 판단하는 장애 예측 모델의 재학습을 위한 트리거(예: IT 서비스에 관한 내부 요인 및 외부 요인)가 발생할 경우, 장애 예측 모델을 재학습 시킬 수 있다. 이하, 도 14 및 15를 참조하여 설명하도록 한다.In various embodiments, when a trigger (eg, internal factors and external factors related to IT services) occurs, the server 100 is a trigger for re-learning of a failure prediction model for determining the possibility of failure in order to improve the reliability of determining the possibility of failure , it is possible to retrain the disability prediction model. Hereinafter, it will be described with reference to FIGS. 14 and 15 .

도 14는 다양한 실시예에서, 외부 요인에 따라 장애 예측 모델을 재학습시키는 과정을 설명하기 위한 순서도이다.14 is a flowchart illustrating a process of re-learning a disability prediction model according to an external factor, in various embodiments.

도 14를 참조하면, S410 단계에서, 서버(100)는 외부 요인에 따른 모델 재학습 트리거(Trigger)를 감지할 수 있다. Referring to FIG. 14 , in step S410 , the server 100 may detect a model retraining trigger according to an external factor.

여기서, 외부 요인에 따른 모델 재학습 트리거는, 기 제공된 제1 IT 서비스가 업데이트됨으로써 새로운 IT 서비스가 배포되거나 제1 IT 서비스를 제공하는 시스템의 인프라가 변경되는 것을 의미할 수 있으나, 이에 한정되지 않는다.Here, the model re-learning trigger according to an external factor may mean that a new IT service is distributed or the infrastructure of a system that provides the first IT service is changed by updating the first IT service provided, but is not limited thereto .

S420 단계에서, 서버(100)는 S410 단계에서, 외부 요인에 따른 모델 재학습 트리거가 감지되는 경우 즉, 기 제공된 제1 IT 서비스가 업데이트되거나 또는 제1 IT 서비스를 제공하는 시스템의 인프라가 변경됨에 따라 제1 IT 서비스가 아닌 제2 IT 서비스가 제공되는 것을 감지되는 경우, 장애 예측 모델의 재학습을 위하여 장애 예측 모델의 학습 데이터를 가공할 수 있다. 예를 들어, 제1 서비스의 지표 정보는 제1 IT 서비스의 부하에 관한 정보와 제1 IT 서비스의 성능에 관한 정보를 포함할 수 있으며, 서버(100)는 제1 IT 서비스의 지표 정보에서 제1 IT 서비스의 성능에 관한 정보를 제외시킬 수 있다.In step S420, when the server 100 detects a model re-learning trigger according to an external factor in step S410, that is, the first IT service provided is updated or the infrastructure of the system providing the first IT service is changed. Accordingly, when it is sensed that the second IT service other than the first IT service is provided, the learning data of the failure prediction model may be processed for re-learning of the failure prediction model. For example, the indicator information of the first service may include information about the load of the first IT service and information about the performance of the first IT service, and the server 100 may 1 Information about the performance of IT services may be excluded.

상기와 같은 외부 요인에 의해 제1 IT 서비스가 제2 IT 서비스로 변경될 경우, 제1 IT 서비스에 관한 성능 정보(예: 응답 시간 등)는 무의미하나, 제1 IT 서비스를 사용하는 사용자들 동작, 행동에 관한 정보 즉, 제1 IT 서비스에 관한 부하 정보(예: 동시 사용자 수 등)는 유의미한 바, 제1 IT 서비스의 지표 정보 중 무의미한 정보인 성능 정보를 무효화시키고, 유의미한 정보인 제1 IT 서비스의 부하 정보만을 취할 수 있다.When the first IT service is changed to the second IT service due to the above external factors, performance information (eg, response time, etc.) about the first IT service is meaningless, but the behavior of users using the first IT service , behavior information, that is, load information on the first IT service (eg, the number of concurrent users, etc.) is meaningful, invalidates performance information, which is meaningless information, among the index information of the first IT service, and Only load information of the service can be taken.

S430 단계에서, 서버(100)는 S110 단계를 거쳐 학습 데이터를 가공 이후, 제1 IT 서비스에 대한 호출 통계를 수집할 수 있고, 이후 새롭게 제공된 제2 IT 서비스에 대한 온-디맨드(On-Demand) 기반의 성능 테스트를 수행할 수 있다.In step S430 , the server 100 may collect call statistics for the first IT service after processing the learning data through step S110 , and then on-demand for the newly provided second IT service based performance tests can be performed.

먼저, 서버(100)는 제2 IT 서비스에 대한 가상 사용자 요청을 생성할 수 있다. 여기서, 가상 사용자 요청은 제2 IT 서비스에 대한 성능을 평가하기 위해 제2 IT 서비스의 각종 기능 및 동작을 제어하는 제어명령으로, 실제 사용자로부터 입력된 사용자 요청이 아닌 가상으로 생성된 사용자 요청일 수 있다. 그러나, 이에 한정되지 않는다.First, the server 100 may generate a virtual user request for the second IT service. Here, the virtual user request is a control command for controlling various functions and operations of the second IT service in order to evaluate the performance of the second IT service. there is. However, the present invention is not limited thereto.

이후, 서버(100)는 가상 사용자 요청을 제2 IT 서비스를 제공하는 서버로 제공할 수 있고, 해당 서버로부터 가상 사용자 요청에 따라 제2 IT 서비스를 제어함에 따라 생성되는 정보(예: 제2 IT 서비스의 성능에 관한 정보)를 수집할 수 있다. 그러나, 이에 한정되지 않는다Thereafter, the server 100 may provide the virtual user request to the server providing the second IT service, and information generated by controlling the second IT service according to the virtual user request from the server (eg, the second IT service) information about the performance of the service) may be collected. However, it is not limited thereto.

S440 단계에서, 서버(100)는 S420 거쳐 가공된 학습 데이터(예: 제1 IT 서비스의 지표 정보 중 부하에 관한 정보만을 포함하는 학습 데이터)와 S430 단계를 거쳐 수집된 제2 IT 서비스의 성능에 관한 정보를 결합하여 학습 데이터를 생성할 수 있고, 이를 이용하여 장애 예측 모델을 학습시킴으로써, 제2 IT 서비스에 적용 가능한 장애 예측 모델을 생성할 수 있다.In step S440, the server 100 determines the performance of the learning data processed through S420 (eg, learning data including only information about the load among the index information of the first IT service) and the performance of the second IT service collected through step S430. It is possible to generate learning data by combining the related information, and by using this to learn the failure prediction model, it is possible to generate a failure prediction model applicable to the second IT service.

즉, 본원발명은 외부 요인에 의해 제공되는 IT 서비스가 변경될 경우에도 전문가의 직접적인 개입없이 자동적으로 맞춤형 장애 예측 모델을 구축할 수 있다.That is, according to the present invention, even when the IT service provided by an external factor is changed, a customized failure prediction model can be automatically built without the direct intervention of an expert.

도 15는 다양한 실시예에서, 내부 요인에 따라 장애 예측 모델을 재학습시키는 과정을 설명하기 위한 순서도이다.15 is a flowchart for explaining a process of re-learning a disability prediction model according to an internal factor, in various embodiments.

도 15를 참조하면, 서버(100)는 내부 요인에 따른 모델 재학습 트리거를 감지할 수 있다.Referring to FIG. 15 , the server 100 may detect a model re-learning trigger according to an internal factor.

여기서, 내부 요인에 따른 모델 재학습 트리거는 장애 예측 모델이 정확한 결과 값과 실제 값이 상이한 것을 의미할 수 있다. 예를 들어, 서버(100)는 도 16에 도시된 바와 같이 IT 서비스에 대한 장애가 발생될 것으로 예측된 제1 시점에서 IT 서비스에 대한 장애가 발생하지 않거나 또는 IT 서비스에 대한 장애가 발생되지 않을 것으로 예측되었으나 제2 시점에서 IT 서비스에 대한 장애가 발생되는 것을 모델 재학습 트리거로서 감지할 수 있다.Here, the model re-learning trigger according to an internal factor may mean that an accurate result value of the failure prediction model is different from an actual value. For example, as shown in FIG. 16 , in the server 100 , it is predicted that the failure of the IT service does not occur or that the failure of the IT service does not occur at the first time point when the failure of the IT service is predicted to occur. At the second time point, it may be detected that a failure to the IT service occurs as a model re-learning trigger.

S520 단계에서, 서버(100)는 S510 단계를 거쳐 모델 재학습 트리거가 감지되는 경우, 학습 데이터를 생성하여 장애 예측 모델을 재학습시킬 수 있다.In step S520 , the server 100 may retrain the failure prediction model by generating training data when a model retraining trigger is detected through step S510 .

다양한 실시예에서, 서버(100)는 IT 서비스의 지표 정보로부터 추출된 부하 값에 기초하여 IT 서비스에 대한 장애가 발생될 것으로 예측된 제1 시점에서 IT 서비스에 대한 장애가 발생되지 않은 경우, 제1 시점에서의 IT 서비스의 지표 정보를 학습 데이터로 하여 기 학습된 장애 예측 모델을 재학습시킬 수 있다.In various embodiments, the server 100 determines that when the failure of the IT service does not occur at the first time point at which the failure to the IT service is predicted based on the load value extracted from the indicator information of the IT service, the first time point It is possible to re-learn the previously-learned failure prediction model using the index information of the IT service as learning data.

또한, 서버(100)는 IT 서비스의 지표 정보로부터 추출된 부하 값에 기초하여 IT 서비스에 대한 장애가 발생되지 않을 것으로 예측되었으나 제2 시점에서 IT 서비스에 대한 장애가 발생된 경우, 제2 시점에서의 IT 서비스의 지표 정보를 학습 데이터로 하여 기 학습된 장애 예측 모델을 재학습시킬 수 있다.In addition, in the server 100, when it is predicted that failure to the IT service will not occur based on the load value extracted from the index information of the IT service, but a failure to the IT service occurs at the second time point, the Using the service index information as learning data, it is possible to retrain the pre-learned disability prediction model.

다양한 실시예에서, 서버(100)는 IT 서비스에 대한 장애가 발생될 것으로 예측된 제1 시점에서 IT 서비스에 대한 장애가 발생하지 않거나 또는 IT 서비스에 대한 장애가 발생되지 않을 것으로 예측되었으나 제2 시점에서 IT 서비스에 대한 장애가 발생되는 것과 같은 이상치가 발생되는 경우, 이상치가 발생된 구간에 대한 지표 정보를 레이블링하여 단기적으로 저장하고, 추후 단기적으로 저장된 지표 정보들이 일정 수준을 초과하는 경우, 해당 지표 정보들 각각을 학습 데이터로 하여 장애 예측 모델을 재학습시킬 수 있다.In various embodiments, the server 100 predicts that the failure of the IT service does not occur or that the failure of the IT service does not occur at the first point in time when the failure of the IT service is predicted to occur, but the IT service at the second time point When an outlier occurs such as a failure for It is possible to retrain the disability prediction model using the training data.

즉, 본원발명은 장애 예측 모델이 정확한 결과 값을 추출하지 못하는 것으로 판단되는 경우, 장애 예측 모델이 정확한 결과 값을 추출하도록 학습 데이터를 이용하여 지속적으로 재학습 시킴으로써, 장애 예측 모델의 성능을 지속적으로 향상시킬 수 있다.That is, in the present invention, when it is determined that the disability prediction model cannot extract an accurate result value, the performance of the disability prediction model is continuously improved by continuously re-learning using the learning data so that the disability prediction model extracts an accurate result value. can be improved

전술한 기 학습된 장애 예측 모델을 이용한 IT 서비스의 장애 예측 방법은 도면에 도시된 순서도를 참조하여 설명하였다. 간단한 설명을 위해 기 학습된 장애 예측 모델을 이용한 IT 서비스의 장애 예측 방법은 일련의 블록들로 도시하여 설명하였으나, 본 발명은 상기 블록들의 순서에 한정되지 않고, 몇몇 블록들은 본 명세서에 도시되고 시술된 것과 상이한 순서로 수행되거나 또는 동시에 수행될 수 있다. 또한, 본 명세서 및 도면에 기재되지 않은 새로운 블록이 추가되거나, 일부 블록이 삭제 또는 변경된 상태로 수행될 수 있다.The failure prediction method of the IT service using the previously-learned failure prediction model has been described with reference to the flowchart shown in the drawings. For a simple explanation, the failure prediction method of the IT service using the pre-learned failure prediction model has been described with a series of blocks, but the present invention is not limited to the order of the blocks, and some blocks are shown in this specification and It may be performed in a different order or may be performed concurrently. In addition, new blocks not described in the present specification and drawings may be added, or some blocks may be deleted or changed.

이상, 첨부된 도면을 참조로 하여 본 발명의 실시예를 설명하였지만, 본 발명이 속하는 기술분야의 통상의 기술자는 본 발명이 그 기술적 사상이나 필수적인 특징을 변경하지 않고서 다른 구체적인 형태로 실시될 수 있다는 것을 이해할 수 있을 것이다. 그러므로, 이상에서 기술한 실시예들은 모든 면에서 예시적인 것이며, 제한적이 아닌 것으로 이해해야만 한다.In the above, embodiments of the present invention have been described with reference to the accompanying drawings, but those of ordinary skill in the art to which the present invention pertains can realize that the present invention can be embodied in other specific forms without changing the technical spirit or essential features thereof. you will be able to understand Therefore, it should be understood that the embodiments described above are illustrative in all respects and not restrictive.

100 : IT 서비스의 장애 예측 서버
200 : 사용자 단말
300 : 외부 서버
400 : 네트워크100: failure prediction server of IT service
200: user terminal
300 : external server
400: network

Claims

A method performed by a computing device, comprising:
collecting indicator information for IT services;
extracting a load value for the IT service by analyzing the collected indicator information through a pre-learned failure prediction model; and
Comprising the step of determining the possibility of failure of the IT service by using the extracted load value,
The step of extracting the load value,
extracting a threshold load value of the IT service from the collected indicator information; and
Including the step of extracting the value of the future load of the IT service from the collected indicator information,
The step of determining the possibility of the failure,
Comprising the step of determining the possibility of failure of the IT service according to whether the extracted future load value exceeds the extracted threshold load value,
The step of extracting the threshold load value,
generating a critical load prediction model including a critical load prediction function in the form of an exponential function defining a relationship between the number of simultaneous users of the IT service and a response time of the IT service;
Calculating the difference between the performance of the generated critical load prediction model and a preset target performance based on the extracted result value using the index information of the IT service for a predetermined period as an input value of the generated critical load prediction model step;
When the calculated difference is less than or equal to a preset reference value, the generated critical load prediction model is determined, and when the calculated difference exceeds the preset reference value, a critical load prediction function included in the generated critical load prediction model correcting; and
Comprising the step of extracting the threshold load value from the collected index information using the determined critical load prediction model,
Failure prediction method of IT service using pre-learned failure prediction model.

delete

According to claim 1,
Calculating the difference between the performance of the generated critical load prediction model and the preset target performance includes:
It comprises the step of extracting the result value using Equation 1 below,
The step of correcting the critical load prediction function,
When the calculated difference exceeds the preset reference value, the parameter of the threshold load prediction function is determined so that the calculated difference has a minimum value using Equation 2 below, and the determined parameter is used to determine the Comprising the step of calibrating the critical load prediction function,
Failure prediction method of IT service using pre-learned failure prediction model.
<Equation 1>

Here, the

is the critical load prediction function, the

,

and

is the input value of the threshold load prediction function, and is the number of simultaneous users of the IT service (provided that Equation 1 is the

is reminded

When it exceeds,

If is less than or equal to 0, the

is 0, and

above 0

If less than

is said

Lim)
<Equation 2>

Here, the

is the preset target performance

A method performed by a computing device, comprising:
collecting indicator information for IT services;
extracting a load value for the IT service by analyzing the collected indicator information through a pre-learned failure prediction model; and
Comprising the step of determining the possibility of failure of the IT service by using the extracted load value,
The step of extracting the load value,
extracting a threshold load value of the IT service from the collected indicator information; and
Including the step of extracting the value of the future load of the IT service from the collected indicator information,
The step of determining the possibility of the failure,
Comprising the step of determining the possibility of failure of the IT service according to whether the extracted future load value exceeds the extracted threshold load value,
The step of extracting the future load value,
dividing the index information of the IT service for a predetermined period into a plurality of sections;
calculating a characteristic vector for each of the plurality of sections, and calculating a characteristic function for index information of the IT service for a predetermined period using the calculated characteristic vector;
generating a future load prediction model by learning the calculated characteristic function based on quantile regression; and
Using the generated future load prediction model comprising the step of extracting the value of the future load from the collected indicator information,
Failure prediction method of IT service using pre-learned failure prediction model.

6. The method of claim 5,
Calculating the characteristic function comprises:
For each of the plurality of characteristic vectors calculated for each of the plurality of sections, considering at least one of a time characteristic, a usage type characteristic of users who use the IT service, and a correlation characteristic with the extracted future load value Comprising weighting and calculating the feature function using a plurality of weighted feature vectors,
Failure prediction method of IT service using pre-learned failure prediction model.

A method performed by a computing device, comprising:
collecting indicator information for IT services;
extracting a load value for the IT service by analyzing the collected indicator information through a pre-learned failure prediction model; and
Comprising the step of determining the possibility of failure of the IT service by using the extracted load value,
The step of determining the possibility of the failure,
When the failure to the IT service does not occur at the first time point when the failure to the IT service is predicted to occur based on the extracted load value, index information of the IT service at the first time point is used as learning data. re-learning the previously learned disability prediction model; and
Based on the extracted load value, when it is predicted that the failure of the IT service will not occur, but the failure of the IT service occurs at the second time point, the index information of the IT service at the second time point is used as learning data. To include the step of re-learning the previously learned disability prediction model,
Failure prediction method of IT service using pre-learned failure prediction model.

A method performed by a computing device, comprising:
collecting indicator information for IT services;
extracting a load value for the IT service by analyzing the collected indicator information through a pre-learned failure prediction model; and
Comprising the step of determining the possibility of failure of the IT service by using the extracted load value,
When the second IT service is provided as the previously provided first IT service is updated or the infrastructure of the system providing the first IT service is changed, the previously used first IT service as learning data of the pre-learned failure prediction model 1 excluding information about the performance of the first IT service from indicator information of the IT service;
generating a virtual user request for the second IT service;
collecting information on performance of the second IT service based on the generated virtual user request; and
Re-learning the previously-learned failure prediction model using the information on the performance of the collected second IT service and the index information of the first IT service from which the information on the performance of the first IT service is excluded as learning data further comprising steps,
Failure prediction method of IT service using pre-learned failure prediction model.

processor;
network interface;
Memory; and
A computer program loaded into the memory and executed by the processor,
The processor is
performing the method of claim 1, 5, 7 or 8 by executing one or more instructions included in the computer program,
Failure prediction server of IT service using pre-learned failure prediction model.

combined with a computing device,
A computer program stored in a computer-readable recording medium for executing the method of claim 1, 5, 7 or 8.