KR102307632B1

KR102307632B1 - Unusual Insider Behavior Detection Framework on Enterprise Resource Planning Systems using Adversarial Recurrent Auto-encoder

Info

Publication number: KR102307632B1
Application number: KR1020210070225A
Authority: KR
Inventors: 오현택; 유종민; 김민경; 김옥수; 정세훈
Original assignee: 주식회사 아미크; 한국과학기술원
Priority date: 2021-05-31
Filing date: 2021-05-31
Publication date: 2021-10-05

Abstract

Disclosed are a system and method for detecting an unusual behavior of a user of an enterprise information system based on an adversarial recurrent auto-encoder. In accordance with an embodiment of the present invention, the method for detecting the unusual behavior of the user of the enterprise information system based on the adversarial recurrent auto-encoder conducted by a computer system comprises the following steps of: drawing a determination model on a normal behavior of the user in the enterprise information system; and conducting the unusual insider behavior detection (UIBD) by calculating an error through the determination model. The step of conducting the UIBD is able to identify an unusual behavior when there is a greater error of the unusual behavior than an error of the preset normal behavior as the determination model is modeled by using only the normal behavior.

Description

{Unusual Insider Behavior Detection Framework on Enterprise Resource Planning Systems using Adversarial Recurrent Auto-encoder}

아래의 실시예들은 기업정보시스템에 대한 사용자 이상행위 탐지 시스템 및 방법에 관한 것으로, 더욱 상세하게는 적대적 재귀 오토인코더 기반 기업정보시스템에 대한 사용자 이상행위 탐지 시스템 및 방법에 관한 것이다. The following embodiments relate to a system and method for detecting an abnormal user behavior for a corporate information system, and more particularly, to a system and method for detecting an abnormal user behavior for a hostile recursive autoencoder-based corporate information system.

기업의 현대화와 세계화는 기업의 구성을 과거에 비해 훨씬 더 복잡한 것으로 변화시켰다. 이러한 변화 과정 동안, 많은 기업들은 운영 효율성을 개선하기 위해 비즈니스 운영의 많은 부분을 전산화했다. 오늘날 재무, 인적 자원, 제조 및 공급망과 같은 다양한 기업 자원은 기업정보시스템(Enterprise Resource Planning System, ERP 시스템)이라고 불리는 전산 시스템에 의해 관리되고 있다. 이러한 변화의 와중에 기업 자원을 침해하거나 파괴하려는 기업 위협은 점점 더 정교해지고 은밀해지고 있다. 특히, 기업 내부자를 통한 기업 자원의 위협은 반드시 해결해야 할 중요한 사안이다.The modernization and globalization of enterprises has changed the composition of enterprises into something much more complex than in the past. During this transformation process, many companies have computerized many parts of their business operations to improve operational efficiencies. Today, various corporate resources such as finance, human resources, manufacturing and supply chain are managed by a computerized system called Enterprise Resource Planning System (ERP system). In the midst of these changes, corporate threats to compromise or destroy corporate resources are becoming increasingly sophisticated and covert. In particular, the threat of corporate resources through corporate insiders is an important issue that must be addressed.

내부자는 조직의 네트워크, 시스템 또는 데이터에 대한 접근 권한을 가지고 있거나 가지고 있는 현재 또는 이전 직원, 계약자 또는 기타 비즈니스 파트너로 정의된다. 내부자가 기업 자원을 위협하는 것은 외부인이 접근할 수 없는 시스템의 더 깊은 부분에 접근할 수 있기 때문에 외부인의 위협에 비해 더 위험할 수 있다. Gurucul의 2020년 내부자 위협 보고서에 따르면 설문 조사 결과, 보안 실무자의 82% 이상이 조직의 내부자 위협 효과가 '일부 효과적', '매우 효과적' 또는 '굉장히 효과적'이라고 응답했다. 보안 전문가, 정부 기관 및 기업 조직에 의해 악의적인 내부자와 부주의한 내부자 모두의 공격을 방지하거나 완화해야 할 필요성이 제기되었다.An insider is defined as a current or former employee, contractor or other business partner who has or has access to an organization's networks, systems, or data. Threats from insiders to corporate resources can be more risky than threats from outsiders because they can gain access to deeper parts of the system that outsiders cannot. According to Gurucul's 2020 Insider Threat Report, a survey found that over 82% of security practitioners said their organization's insider threat effectiveness was 'somewhat effective', 'very effective' or 'very effective'. The need to prevent or mitigate attacks by both malicious and careless insiders has been raised by security professionals, government agencies, and corporate organizations.

초기에 내부자의 범죄 행위나 위협을 식별하기 위해 역할 기반 또는 시나리오 기반 접근법이 제안되었다. 그러한 접근 방식은 의심스러운 행위나 위협 활동의 패턴이나 조건을 미리 정의했다. 내부자의 활동이 조건을 만족하거나 행위 패턴과 일치할 경우 의심스러운 행위로 간주될 수 있다. 이러한 접근 방식은 사전에 정의된 이상행위나 위협을 탐지하는 데 탁월한 성능을 보였지만 사전에 정의되지 않은 다른 행위를 식별할 수는 없었다. 또한, 새로운 위협적이거나 이상행위를 탐지하기 위한 추가 조건이나 기준을 만들기 위해서는 ERP 시스템 흐름의 모든 세부 사항을 이해해야 한다.Initially, role-based or scenario-based approaches were proposed to identify insider criminal activity or threats. Such an approach pre-defined patterns or conditions for suspicious behavior or threatening activity. An insider's activity may be considered suspicious if it satisfies a condition or matches a pattern of behavior. This approach excelled at detecting predefined anomalies or threats, but could not identify other undefined behaviors. In addition, every detail of the ERP system flow must be understood in order to create additional conditions or criteria for detecting new threats or anomalies.

이러한 한계를 해결하기 위해, 데이터 중심 머신 러닝 과제로서 UIBD 문제를 해결하기 위한 연구가 있었다. 단일 확률적 모델 또는 정상행위 샘플에 의해서만 도출된 클래스 분류기에 기초하여, 그들은 모델을 사용하여 가능성 또는 오류를 계산하여 주어진 샘플의 이상을 추정한다. 이들은 모델이 정상 샘플로만 구성되기 때문에 비정상 샘플의 가능성 또는 오차가 정상 샘플보다 클 것이라고 가정한다. 은닉 마르코프 모델, 주성분 분석 및 isolation forest 알고리즘을 사용하여 정상행위 모델을 도출한다. 그러나 이러한 기계 학습 모델은 주어진 데이터 분포가 선형이라고 가정한다. 불행히도 이러한 가정은 실무적으로 보장될 수 없다. 이러한 한계를 극복하기 위해 최근에는 비선형적이고 복잡한 데이터 분포를 모델링할 수 있는 딥러닝을 사용하는 다양한 방법이 제안되고 있다(비특허문헌 1). 그러나 이러한 머신 러닝 기반 또는 딥러닝 기반에는 원시 시스템 로그에서 유용한 기능을 마이닝하는 '변수 가공(feature engineering)' 작업이 필요하다. 이 작업은 일부 무의미한 정보나 노이즈를 제거함으로써 모델의 구별 능력을 향상시킬 수 있지만, 시간이 많이 걸리고 어렵다.To solve this limitation, there have been studies to solve the UIBD problem as a data-driven machine learning task. Based on a single probabilistic model or a class classifier derived only from a normative sample, they use the model to calculate the likelihood or error to estimate anomalies in a given sample. They assume that the probability or error of an anomalous sample will be greater than that of a normal sample because the model consists only of normal samples. We derive the normal behavior model using the hidden Markov model, principal component analysis, and isolation forest algorithm. However, these machine learning models assume that the given data distribution is linear. Unfortunately, this assumption cannot be guaranteed in practice. In order to overcome this limitation, various methods using deep learning that can model nonlinear and complex data distribution have been recently proposed (Non-Patent Document 1). However, such machine learning-based or deep-learning-based 'feature engineering' work is required to mine useful features from raw system logs. This task can improve the model's discriminating ability by removing some meaningless information or noise, but it is time consuming and difficult.

B. Sharma, P. Pokharel, and B. Joshi, "User behavior analytics for anomaly detection using lstm autoencoder-insider threat detection," in Proceedings of the 11th International Conference on Advances in Information Technology, pp. 1-9, 2020. B. Sharma, P. Pokharel, and B. Joshi, "User behavior analytics for anomaly detection using lstm autoencoder-insider threat detection," in Proceedings of the 11th International Conference on Advances in Information Technology, pp. 1-9, 2020. P. Wang, B. Xu, J. Xu, G. Tian, C.-L. Liu, and H. Hao, "Semantic expansion using word embedding clustering and convolutional neural network for improving short text classification," Neurocomputing, vol. 174, pp. 806-814, 2016. P. Wang, B. Xu, J. Xu, G. Tian, C.-L. Liu, and H. Hao, "Semantic expansion using word embedding clustering and convolutional neural network for improving short text classification," Neurocomputing, vol. 174, pp. 806-814, 2016. D. P. Kingma and M. Welling, "Auto-encoding variational bayes," arXiv preprint arXiv:1312.6114, 2013. D. P. Kingma and M. Welling, “Auto-encoding variational bayes,” arXiv preprint arXiv:1312.6114, 2013.

실시예들은 적대적 재귀 오토인코더 기반 기업정보시스템 사용자 이상행위 탐지 시스템 및 방법에 관하여 기술하며, 보다 구체적으로 적대적 반복 자동 인코더(ARAE)를 기반으로 도출된 정상행위에 대한 판별 모델을 이용하여 사용자 이상행위 탐지(UIBD)를 수행함으로써 사용자의 개입 없이 사용자 이상행위를 인식할 수 있는, 적대적 재귀 오토인코더 기반 기업정보시스템 사용자 이상행위 탐지 시스템 및 방법을 제공하는데 있다. The embodiments describe a hostile recursive autoencoder-based enterprise information system user anomalous behavior detection system and method, and more specifically, a user anomalous behavior using a discrimination model for normal behavior derived based on a hostile iterative automatic encoder (ARAE). An object of the present invention is to provide a hostile recursive autoencoder-based enterprise information system user anomaly detection system and method, which can recognize user abnormal behavior without user intervention by performing detection (UIBD).

또한, 실시예들은 시스템 관리자가 탐지 결과 중 심각한 경우를 발견하는 경우 중요한 위협 사전을 구성하고 심각한 사례에 대한 잠재 특징을 사전에 저장함으로써, 특정 위협 사례를 빠르게 식별할 수 있는, 적대적 재귀 오토인코더 기반 기업정보시스템 사용자 이상행위 탐지 시스템 및 방법을 제공하는데 있다. In addition, the embodiments provide a hostile recursive autoencoder-based system that can quickly identify specific threat cases by configuring an important threat dictionary and storing potential characteristics for severe cases in advance when a system administrator finds a serious case among the detection results. An object of the present invention is to provide a system and method for detecting an abnormal behavior of a corporate information system user.

일 실시예에 따른 컴퓨터 시스템에 의해 수행되는 적대적 재귀 오토인코더 기반 기업정보시스템 사용자 이상행위 탐지 방법은, 기업정보시스템에서 사용자의 정상행위에 대한 판별 모델을 도출하는 단계; 및 상기 판별 모델을 통해 오차를 계산하여 사용자 이상행위 탐지(Unusual Insider Behavior Detection, UIBD)를 수행하는 단계를 포함하고, 상기 사용자 이상행위 탐지를 수행하는 단계는, 상기 판별 모델이 상기 정상행위만 사용하여 모델링됨에 따라 기설정된 정상행위의 오차보다 이상행위의 오차가 큰 경우 이상행위를 식별할 수 있다. According to an embodiment, a method for detecting a hostile recursive autoencoder-based user abnormal behavior performed by a computer system includes: deriving a discrimination model for a normal behavior of a user in a corporate information system; and performing a user abnormal behavior detection (Unusual Insider Behavior Detection, UIBD) by calculating an error through the discrimination model, wherein the performing of the user abnormal behavior detection includes: the discrimination model uses only the normal behavior Thus, when the error of the abnormal behavior is larger than the error of the preset normal behavior as it is modeled, the abnormal behavior can be identified.

상기 사용자의 정상행위에 대한 판별 모델을 도출하는 단계는, 적대적 반복 자동 인코더(Adversarial Recurrent Auto-encoder, ARAE)를 이용하여 상기 판별 모델을 도출할 수 있다. In the step of deriving the discrimination model for the normal behavior of the user, the discrimination model may be derived using an adversarial recurrent auto-encoder (ARAE).

상기 적대적 반복 자동 인코더(ARAE)는, 인코딩된 입력이 주어지면 상기 입력을 잠재 특징으로 인코딩하고 재구성된 결과를 생성함에 따라 상기 오차를 계산 가능하게 하며, 상기 잠재 특징은 적대적 손실을 계산하기 위해 적용되고, 상기 재구성된 결과는 재구성 손실을 계산하는 데 사용되며, 상기 적대적 반복 자동 인코더(ARAE)는 상기 적대적 손실 및 상기 재구성 손실을 최적화할 수 있다. The adversarial iterative automatic encoder (ARAE), given an encoded input, encodes the input into a latent feature and makes it possible to calculate the error as it produces a reconstructed result, the latent feature being applied to calculate the adversarial loss. and the reconstructed result is used to calculate a reconstruction loss, and the adversarial iterative automatic encoder (ARAE) may optimize the adversarial loss and the reconstruction loss.

상기 사용자의 정상행위에 대한 판별 모델을 도출하는 단계는, 정상행위에 대한 원시 시스템 로그를 dense 임베딩 벡터(Dense Embedding Vector, DEV) 또는 원핫 인코딩(one-hot encoding)으로 인코딩하는 단계; 및 인코딩된 상기 dense 임베딩 벡터 또는 원핫 인코딩(one-hot encoding)의 결과를 적대적 반복 자동 인코더(ARAE)에 적용하여 정상행위에 대한 판별 모델을 도출하는 단계를 포함하여 이루어질 수 있다. The step of deriving the discrimination model for the normal behavior of the user may include: encoding the raw system log for the normal behavior using a dense embedding vector (DEV) or one-hot encoding; and applying the encoded dense embedding vector or the result of one-hot encoding to an adversarial iterative automatic encoder (ARAE) to derive a discriminative model for a normal behavior.

미리 저장된 중요한 위협 사전(significant threatening dictionary)을 이용하여 패스트 트랙(fast track)을 통해 중요한 위협 탐지를 수행하는 단계를 더 포함할 수 있다. The method may further include performing important threat detection through a fast track using a pre-stored significant threatening dictionary.

상기 중요한 위협 사전은, 상기 사용자 이상행위 탐지 중 시스템 관리자가 심각한 위협 행위로 분류한 행위의 잠재 특징으로 구성될 수 있다. The important threat dictionary may include potential features of actions classified as serious threat actions by the system administrator during the detection of the user anomaly.

상기 패스트 트랙을 통해 중요한 위협 탐지를 수행하는 단계는, 상기 사용자 이상행위 탐지 중 도출되는 잠재 특징을 상기 중요한 위협 사전에 저장된 특징과 비교하여 유사한 경우, 상기 저장된 특징에 해당하는 행위가 발생한다고 간주할 수 있다. In the step of performing important threat detection through the fast track, a latent characteristic derived during the detection of the user anomaly is compared with the characteristic stored in the important threat dictionary and, if similar, it is considered that the action corresponding to the stored characteristic occurs. can

상기 판별 모델을 통해 오차를 계산하여 상기 사용자 이상행위 탐지의 수행 및 상기 미리 저장된 중요한 위협 사전을 이용하여 패스트 트랙을 통해 중요한 위협 탐지의 수행은 상기 사용자 이상행위를 식별하기 위해 상호 보완적으로 작용할 수 있다. Performing the user anomaly detection by calculating an error through the discrimination model and performing important threat detection through the fast track using the pre-stored important threat dictionary can work complementary to identify the user anomaly. have.

다른 실시예에 따른 적대적 재귀 오토인코더 기반 기업정보시스템 사용자 이상행위 탐지 시스템은, 기업정보시스템에서 사용자의 정상행위에 대한 판별 모델을 도출하는 판별 모델 모델링부; 및 상기 판별 모델을 통해 오차를 계산하여 사용자 이상행위 탐지(Unusual Insider Behavior Detection, UIBD)를 수행하는 사용자 이상행위 탐지부를 포함하고, 상기 사용자 이상행위 탐지부는, 상기 판별 모델이 상기 정상행위만 사용하여 모델링됨에 따라 기설정된 정상행위의 오차보다 이상행위의 오차가 큰 경우 이상행위를 식별할 수 있다. A hostile recursive autoencoder-based corporate information system user abnormal behavior detection system according to another embodiment includes: a discrimination model modeling unit for deriving a judgment model for a normal behavior of a user in the corporate information system; and a user abnormal behavior detection unit that performs user abnormal behavior detection (Unusual Insider Behavior Detection, UIBD) by calculating an error through the discrimination model, wherein the user abnormal behavior detection unit includes the identification model using only the normal behavior As it is modeled, when the error of the abnormal behavior is greater than the preset error of the normal behavior, the abnormal behavior can be identified.

상기 판별 모델 모델링부는, 적대적 반복 자동 인코더(Adversarial Recurrent Auto-encoder, ARAE)를 이용하여 상기 판별 모델을 도출할 수 있다. The discriminant model modeling unit may derive the discriminant model using an adversarial recurrent auto-encoder (ARAE).

상기 판별 모델 모델링부는, 정상행위에 대한 원시 시스템 로그를 dense 임베딩 벡터(Dense Embedding Vector, DEV) 또는 원핫 인코딩(one-hot encoding)으로 인코딩하는 시스템 로그 임베딩부; 및 인코딩된 상기 dense 임베딩 벡터 또는 원핫 인코딩(one-hot encoding)의 결과를 적용하여 정상행위에 대한 판별 모델을 도출하는 적대적 반복 자동 인코더(ARAE)를 포함하여 이루어질 수 있다. The discriminant model modeling unit may include: a system log embedding unit that encodes a raw system log for a normal behavior using a dense embedding vector (DEV) or one-hot encoding; and an adversarial iterative automatic encoder (ARAE) that derives a discriminative model for normal behavior by applying the encoded dense embedding vector or a result of one-hot encoding.

미리 저장된 중요한 위협 사전(significant threatening dictionary)을 이용하여 패스트 트랙(fast track)을 통해 중요한 위협 탐지를 수행하는 중요한 위협 탐지부를 더 포함할 수 있다. It may further include an important threat detection unit that performs important threat detection through a fast track using a pre-stored significant threatening dictionary.

상기 중요한 위협 탐지부는, 상기 사용자 이상행위 탐지 중 시스템 관리자가 심각한 위협 행위로 분류한 행위의 잠재 특징으로 구성될 수 있다. The important threat detection unit may be configured with latent characteristics of an action classified as a serious threat by a system administrator during detection of the user's abnormal behavior.

상기 중요한 위협 탐지부는, 상기 사용자 이상행위 탐지 중 도출되는 잠재 특징을 상기 중요한 위협 사전에 저장된 특징과 비교하여 유사한 경우, 상기 저장된 특징에 해당하는 행위가 발생한다고 간주할 수 있다. The important threat detection unit may compare a potential characteristic derived during the detection of the user abnormal behavior with a characteristic stored in the important threat dictionary and, when similar, considers that an action corresponding to the stored characteristic occurs.

실시예들에 따르면 적대적 반복 자동 인코더(ARAE)를 기반으로 도출된 정상행위에 대한 판별 모델을 이용하여 사용자 이상행위 탐지(UIBD)를 수행함으로써 사용자의 개입 없이 사용자 이상행위를 인식할 수 있는, 적대적 재귀 오토인코더 기반 기업정보시스템 사용자 이상행위 탐지 시스템 및 방법을 제공할 수 있다. According to embodiments, by performing user anomaly detection (UIBD) using a discrimination model for normal behavior derived based on hostile iterative automatic encoder (ARAE), hostile behavior that can recognize abnormal behavior without user intervention It is possible to provide a recursive autoencoder-based corporate information system user anomaly detection system and method.

또한, 실시예들에 따르면 시스템 관리자가 탐지 결과 중 심각한 경우를 발견하는 경우 중요한 위협 사전을 구성하고 심각한 사례에 대한 잠재 특징을 사전에 저장함으로써, 특정 위협 사례를 빠르게 식별할 수 있는, 적대적 재귀 오토인코더 기반 기업정보시스템 사용자 이상행위 탐지 시스템 및 방법을 제공할 수 있다. In addition, according to embodiments, when the system administrator finds a serious case among the detection results, by configuring an important threat dictionary and storing potential characteristics of the serious case in advance, the hostile recursive auto It is possible to provide an encoder-based enterprise information system user anomaly detection system and method.

도 1은 일 실시예에 따른 컴퓨터 시스템의 내부 구성의 일례를 설명하기 위한 블록도이다.
도 2는 일 실시예에 따른 적대적 재귀 오토인코더 기반 기업정보시스템 사용자 이상행위 탐지 시스템을 나타내는 블록도이다.
도 3은 일 실시예에 따른 적대적 재귀 오토인코더 기반 기업정보시스템 사용자 이상행위 탐지 방법을 나타내는 흐름도이다.
도 4는 일 실시예에 따른 ARAE를 사용하는 UIBD 프로세스를 나타내는 도면이다.
도 5는 일 실시예에 따른 시스템 로그 임베딩의 개략도를 나타낸다.
도 6은 일 실시예에 따른 목적 함수 계산을 위한 프로세스에 대한 ARAE의 구조적 세부사항을 나타내는 도면이다. 1 is a block diagram illustrating an example of an internal configuration of a computer system according to an embodiment.
2 is a block diagram illustrating a hostile recursive autoencoder-based enterprise information system user abnormal behavior detection system according to an embodiment.
3 is a flowchart illustrating a method for detecting abnormal behavior of a hostile recursive autoencoder-based enterprise information system user according to an embodiment.
4 is a diagram illustrating a UIBD process using ARAE according to an embodiment.
5 shows a schematic diagram of a system log embedding according to an embodiment.
6 is a diagram illustrating structural details of ARAE for a process for calculating an objective function according to an embodiment.

이하, 첨부된 도면을 참조하여 실시예들을 설명한다. 그러나, 기술되는 실시예들은 여러 가지 다른 형태로 변형될 수 있으며, 본 발명의 범위가 이하 설명되는 실시예들에 의하여 한정되는 것은 아니다. 또한, 여러 실시예들은 당해 기술분야에서 평균적인 지식을 가진 자에게 본 발명을 더욱 완전하게 설명하기 위해서 제공되는 것이다. 도면에서 요소들의 형상 및 크기 등은 보다 명확한 설명을 위해 과장될 수 있다.Hereinafter, embodiments will be described with reference to the accompanying drawings. However, the described embodiments may be modified in various other forms, and the scope of the present invention is not limited by the embodiments described below. In addition, various embodiments are provided in order to more completely explain the present invention to those of ordinary skill in the art. The shapes and sizes of elements in the drawings may be exaggerated for clearer description.

기업정보시스템(ERP 시스템)에서 내부자의 이상행위를 탐지하는 것은 내부자가 기업 자원을 위협하고 남용하는 위험을 줄이기 위한 필수적인 부분 중 하나이다. 규칙 기반 시스템과 확률적 프로세스에 기초한 행위를 탐지하는 많은 접근 방식은 현재 수동으로 설정된 알고리즘 또는 확률적 경계를 사용하는 경험적 모니터링으로 제한된다. 이러한 접근 방식에는 사용자 허가 지침 및 프로세스 데이터 특성과 같은 사전 지식이 필요하다. 그러나 사전 지식을 얻는 것은 현실적으로 어렵고, 이것들은 경험적 규칙을 사용하여 명확하게 정의될 수 없는 전형적으로 이상행위를 탐지하는 데 적절하지 않다. Detecting an insider's anomaly in an enterprise information system (ERP system) is one of the essential parts to reduce the risk that insiders threaten and abuse corporate resources. Many approaches to behavior detection based on rule-based systems and probabilistic processes are currently limited to empirical monitoring using manually set algorithms or probabilistic boundaries. This approach requires prior knowledge such as user permission guidelines and process data characteristics. However, obtaining prior knowledge is difficult in practice, and these are typically not suitable for detecting anomalies that cannot be clearly defined using empirical rules.

아래의 본 발명의 실시예들은 ERP 시스템을 위한 사용자 이상행위 탐지(Unusual Insider Behavior Detection, UIBD)를 위한 새로운 프레임워크를 제안한다. 제안된 프레임워크는 처음에는 정상행위 샘플에 대한 차별적 모델을 도출하고, UIBD는 모델을 사용하여 오류를 계산하여 수행된다. 모델은 정상 샘플만 사용하여 컴파일되므로 비정상 샘플의 오차는 정규 샘플보다 크다. 여기서, 강력한 정상행위 모델을 도출하기 위해 적대적 반복 자동 인코더(Adversarial Recurrent Auto-encoder, ARAE)를 제시한다. ARAE를 기반으로 제안된 프레임워크의 효율성을 입증하기 위해, 실제 기업에서 운영되는 ERP 시스템의 보안 감사 로그 시퀀스에 의해 정의된 내부자 행위로 구성된 데이터셋을 사용하여 실험을 수행할 수 있다. 실험 결과는 ARAE와 함께 제안된 프레임워크가 사용자 이상행위를 성공적으로 탐지하고 사용자 이상행위 또는 위협을 탐지하는 다른 방법을 능가할 수 있음을 보여준다.The following embodiments of the present invention propose a new framework for user abnormal behavior detection (Unusual Insider Behavior Detection, UIBD) for an ERP system. The proposed framework initially derives a differential model for the normal behavior sample, and UIBD is performed by calculating the error using the model. Since the model is compiled using only normal samples, the error of non-stationary samples is greater than that of normal samples. Here, we present an Adversarial Recurrent Auto-encoder (ARAE) to derive a robust normal behavior model. To prove the effectiveness of the proposed framework based on ARAE, an experiment can be performed using a dataset consisting of insider behavior defined by the security audit log sequence of an ERP system operating in a real enterprise. Experimental results show that the proposed framework together with ARAE can successfully detect user anomaly and outperform other methods of detecting user anomaly or threat.

도 1은 일 실시예에 따른 컴퓨터 시스템의 내부 구성의 일례를 설명하기 위한 블록도이다. 예를 들어, 본 발명의 실시예들에 따른 적대적 재귀 오토인코더 기반 기업정보시스템 사용자 이상행위 탐지 시스템이 도 1의 컴퓨터 시스템(장치)(100)을 통해 구현될 수 있다. 도 1에 도시한 바와 같이, 컴퓨터 시스템(100)은 적대적 재귀 오토인코더 기반 기업정보시스템 사용자 이상행위 탐지 방법을 실행하기 위한 구성요소로서 프로세서(110), 메모리(120), 영구 저장 장치(130), 버스(140), 입출력 인터페이스(150) 및 네트워크 인터페이스(160)를 포함할 수 있다.1 is a block diagram illustrating an example of an internal configuration of a computer system according to an embodiment. For example, the hostile recursive autoencoder-based enterprise information system user abnormal behavior detection system according to embodiments of the present invention may be implemented through the computer system (device) 100 of FIG. 1 . As shown in FIG. 1 , the computer system 100 is a component for executing the hostile recursive autoencoder-based corporate information system user anomaly detection method, and includes a processor 110 , a memory 120 , and a permanent storage device 130 . , a bus 140 , an input/output interface 150 , and a network interface 160 .

프로세서(110)는 명령어들의 임의의 시퀀스를 처리할 수 있는 임의의 장치를 포함하거나 그의 일부일 수 있다. 프로세서(110)는 예를 들어 컴퓨터 프로세서, 이동 장치 또는 다른 전자 장치 내의 프로세서 및/또는 디지털 프로세서를 포함할 수 있다. 프로세서(110)는 예를 들어, 서버 컴퓨팅 디바이스, 서버 컴퓨터, 일련의 서버 컴퓨터들, 서버 팜, 클라우드 컴퓨터, 컨텐츠 플랫폼, 이동 컴퓨팅 장치, 스마트폰, 태블릿, 셋톱 박스, 미디어 플레이어 등에 포함될 수 있다. 프로세서(110)는 버스(140)를 통해 메모리(120)에 접속될 수 있다.Processor 110 may include or be part of any apparatus capable of processing any sequence of instructions. Processor 110 may include, for example, a computer processor, a processor in a mobile device, or other electronic device and/or a digital processor. The processor 110 may be included in, for example, a server computing device, a server computer, a set of server computers, a server farm, a cloud computer, a content platform, a mobile computing device, a smartphone, a tablet, a set-top box, a media player, and the like. The processor 110 may be connected to the memory 120 through the bus 140 .

메모리(120)는 컴퓨터 시스템(100)에 의해 사용되거나 그에 의해 출력되는 정보를 저장하기 위한 휘발성 메모리, 영구, 가상 또는 기타 메모리를 포함할 수 있다. 메모리(120)는 예를 들어 랜덤 액세스 메모리(RAM: random access memory) 및/또는 동적 RAM(DRAM: dynamic RAM)을 포함할 수 있다. 메모리(120)는 컴퓨터 시스템(100)의 상태 정보와 같은 임의의 정보를 저장하는 데 사용될 수 있다. 메모리(120)는 예를 들어 적대적 재귀 오토인코더 기반 기업정보시스템 사용자 이상행위 탐지를 위한 명령어들을 포함하는 컴퓨터 시스템(100)의 명령어들을 저장하는 데에도 사용될 수 있다. 컴퓨터 시스템(100)은 필요에 따라 또는 적절한 경우에 하나 이상의 프로세서(110)를 포함할 수 있다.Memory 120 may include volatile memory, permanent, virtual, or other memory for storing information used by or output by computer system 100 . The memory 120 may include, for example, random access memory (RAM) and/or dynamic RAM (DRAM). Memory 120 may be used to store any information, such as state information of computer system 100 . The memory 120 may also be used to store instructions of the computer system 100 including, for example, instructions for detecting a hostile recursive autoencoder-based corporate information system user anomaly. Computer system 100 may include one or more processors 110 as needed or appropriate.

버스(140)는 컴퓨터 시스템(100)의 다양한 컴포넌트들 사이의 상호작용을 가능하게 하는 통신 기반 구조를 포함할 수 있다. 버스(140)는 예를 들어 컴퓨터 시스템(100)의 컴포넌트들 사이에, 예를 들어 프로세서(110)와 메모리(120) 사이에 데이터를 운반할 수 있다. 버스(140)는 컴퓨터 시스템(100)의 컴포넌트들 간의 무선 및/또는 유선 통신 매체를 포함할 수 있으며, 병렬, 직렬 또는 다른 토폴로지 배열들을 포함할 수 있다.Bus 140 may include a communications infrastructure that enables interaction between various components of computer system 100 . Bus 140 may carry data between, for example, components of computer system 100 , such as between processor 110 and memory 120 . Bus 140 may include wireless and/or wired communication media between components of computer system 100 and may include parallel, serial, or other topological arrangements.

영구 저장 장치(130)는 (예를 들어, 메모리(120)에 비해) 소정의 연장된 기간 동안 데이터를 저장하기 위해 컴퓨터 시스템(100)에 의해 사용되는 바와 같은 메모리 또는 다른 영구 저장 장치와 같은 컴포넌트들을 포함할 수 있다. 영구 저장 장치(130)는 컴퓨터 시스템(100) 내의 프로세서(110)에 의해 사용되는 바와 같은 비휘발성 메인 메모리를 포함할 수 있다. 영구 저장 장치(130)는 예를 들어 플래시 메모리, 하드 디스크, 광 디스크 또는 다른 컴퓨터 판독 가능 매체를 포함할 수 있다.Persistent storage 130 is a component, such as memory or other persistent storage, as used by computer system 100 to store data for an extended period of time (eg, compared to memory 120 ). may include Persistent storage 130 may include non-volatile main memory as used by processor 110 in computer system 100 . Persistent storage 130 may include, for example, flash memory, a hard disk, an optical disk, or other computer readable medium.

입출력 인터페이스(150)는 키보드, 마우스, 음성 명령 입력, 디스플레이 또는 다른 입력 또는 출력 장치에 대한 인터페이스들을 포함할 수 있다. 구성 명령들 및/또는 적대적 재귀 오토인코더 기반 기업정보시스템 사용자 이상행위 탐지를 위한 정보가 입출력 인터페이스(150)를 통해 수신될 수 있다.The input/output interface 150 may include interfaces to a keyboard, mouse, voice command input, display, or other input or output device. Configuration commands and/or information for detecting abnormal behavior of a hostile recursive autoencoder-based enterprise information system user may be received through the input/output interface 150 .

네트워크 인터페이스(160)는 근거리 네트워크 또는 인터넷과 같은 네트워크들에 대한 하나 이상의 인터페이스를 포함할 수 있다. 네트워크 인터페이스(160)는 유선 또는 무선 접속들에 대한 인터페이스들을 포함할 수 있다. 구성 명령들 및/또는 적대적 재귀 오토인코더 기반 기업정보시스템 사용자 이상행위 탐지를 위한 정보는 네트워크 인터페이스(160)를 통해 수신될 수 있다.Network interface 160 may include one or more interfaces to networks such as a local area network or the Internet. Network interface 160 may include interfaces for wired or wireless connections. Information for configuration commands and/or hostile recursive autoencoder-based enterprise information system user anomaly detection may be received through the network interface 160 .

또한, 다른 실시예들에서 컴퓨터 시스템(100)은 도 1의 구성요소들보다 더 많은 구성요소들을 포함할 수도 있다. 그러나, 대부분의 종래기술적 구성요소들을 명확하게 도시할 필요성은 없다. Also, in other embodiments, computer system 100 may include more components than those of FIG. 1 . However, there is no need to clearly show most of the prior art components.

도 2는 일 실시예에 따른 적대적 재귀 오토인코더 기반 기업정보시스템 사용자 이상행위 탐지 시스템을 나타내는 블록도이다.2 is a block diagram illustrating a hostile recursive autoencoder-based enterprise information system user abnormal behavior detection system according to an embodiment.

도 2는 도 1의 일 실시예에 따른 컴퓨터 시스템(100)의 프로세서(110)가 포함할 수 있는 구성요소의 예를 도시한 도면이다. 여기서, 컴퓨터 시스템(100)의 프로세서(110)는 일 실시예에 따른 적대적 재귀 오토인코더 기반 기업정보시스템 사용자 이상행위 탐지 시스템(200)을 포함할 수 있다. 일 실시예에 따른 적대적 재귀 오토인코더 기반 기업정보시스템 사용자 이상행위 탐지 시스템(200)은 판별 모델 모델링부(210), 사용자 이상행위 탐지부(220) 및 중요한 위협 탐지부(230)를 포함하여 이루어질 수 있다. 또한, 실시예에 따라 판별 모델 모델링부(210)는 시스템 로그 임베딩부(211) 및 적대적 반복 자동 인코더(ARAE, 212)를 포함할 수 있다.FIG. 2 is a diagram illustrating an example of components that the processor 110 of the computer system 100 according to the embodiment of FIG. 1 may include. Here, the processor 110 of the computer system 100 may include the hostile recursive autoencoder-based enterprise information system user abnormal behavior detection system 200 according to an embodiment. The hostile recursive autoencoder-based enterprise information system user abnormal behavior detection system 200 according to an embodiment includes a discriminative model modeling unit 210 , a user abnormal behavior detection unit 220 and an important threat detection unit 230 . can Also, according to an embodiment, the discriminant model modeling unit 210 may include a system log embedding unit 211 and an adversarial iterative automatic encoder (ARAE) 212 .

프로세서(110) 및 프로세서(110)의 구성요소들은 도 3의 적대적 재귀 오토인코더 기반 기업정보시스템 사용자 이상행위 탐지 방법이 포함하는 단계들(S310 내지 S330)을 수행할 수 있다. 예를 들어, 프로세서(110) 및 프로세서(110)의 구성요소들은 메모리(120)가 포함하는 운영체제의 코드와 상술한 적어도 하나의 프로그램 코드에 따른 명령(instruction)을 실행하도록 구현될 수 있다. 여기서, 적어도 하나의 프로그램 코드는 적대적 재귀 오토인코더 기반 기업정보시스템 사용자 이상행위 탐지를 처리하기 위해 구현된 프로그램의 코드에 대응될 수 있다.The processor 110 and the components of the processor 110 may perform the steps S310 to S330 included in the hostile recursive autoencoder-based enterprise information system user anomaly detection method of FIG. 3 . For example, the processor 110 and components of the processor 110 may be implemented to execute an operating system code included in the memory 120 and an instruction according to at least one program code described above. Here, the at least one program code may correspond to a code of a program implemented to process the hostile recursive autoencoder-based enterprise information system user abnormal behavior detection.

적대적 재귀 오토인코더 기반 기업정보시스템 사용자 이상행위 탐지 방법은 도시된 순서대로 발생하지 않을 수 있으며, 단계들 중 일부가 생략되거나 추가의 과정이 더 포함될 수 있다.The hostile recursive autoencoder-based corporate information system user anomaly detection method may not occur in the order shown, and some of the steps may be omitted or additional processes may be further included.

도 3은 일 실시예에 따른 적대적 재귀 오토인코더 기반 기업정보시스템 사용자 이상행위 탐지 방법을 나타내는 흐름도이다.3 is a flowchart illustrating a method for detecting abnormal behavior of a hostile recursive autoencoder-based enterprise information system user according to an embodiment.

도 3을 참조하면, 일 실시예에 따른 컴퓨터 시스템에 의해 수행되는 적대적 재귀 오토인코더 기반 기업정보시스템 사용자 이상행위 탐지 방법은, 기업정보시스템에서 사용자의 정상행위에 대한 판별 모델을 도출하는 단계(S310), 및 판별 모델을 통해 오차를 계산하여 사용자 이상행위 탐지(Unusual Insider Behavior Detection, UIBD)를 수행하는 단계(S320)를 포함하여 이루어질 수 있다. 사용자 이상행위 탐지를 수행하는 단계(S320)는, 판별 모델이 정상행위만 사용하여 모델링됨에 따라 기설정된 정상행위의 오차보다 이상행위의 오차가 큰 경우 이상행위를 식별할 수 있다. Referring to Figure 3, the hostile recursive autoencoder-based corporate information system user abnormal behavior detection method performed by the computer system according to an embodiment, the step of deriving a judgment model for the user's normal behavior in the corporate information system (S310) ), and calculating an error through the discrimination model to perform Unusual Insider Behavior Detection (UIBD) (S320). In the step S320 of detecting the user's abnormal behavior, the abnormal behavior may be identified when the abnormal behavior error is greater than the preset normal behavior error as the determination model is modeled using only the normal behavior.

또한, 미리 저장된 중요한 위협 사전(significant threatening dictionary)을 이용하여 패스트 트랙(fast track)을 통해 중요한 위협 탐지를 수행하는 단계(S330)를 더 포함할 수 있다. In addition, the method may further include performing important threat detection through a fast track using a pre-stored significant threatening dictionary ( S330 ).

실시예들에 따르면 적대적 반복 자동 인코더(ARAE)를 기반으로 도출된 정상행위에 대한 판별 모델을 이용하여 사용자 이상행위 탐지(UIBD)를 수행함으로써 사용자의 개입 없이 사용자 이상행위를 인식할 수 있다. According to embodiments, by performing user anomaly detection (UIBD) using a normal behavior discrimination model derived based on an adversarial repeating automatic encoder (ARAE), a user abnormal behavior can be recognized without user intervention.

또한, 실시예들에 따르면 시스템 관리자가 탐지 결과 중 심각한 경우를 발견하는 경우 중요한 위협 사전을 구성하고 심각한 사례에 대한 잠재 특징을 사전에 저장함으로써, 특정 위협 사례를 빠르게 식별할 수 있다. In addition, according to embodiments, when a system administrator finds a serious case among detection results, a specific threat case can be quickly identified by configuring an important threat dictionary and storing potential characteristics of the serious case in advance.

일 실시예에 따른 적대적 재귀 오토인코더 기반 기업정보시스템 사용자 이상행위 탐지 방법은 도 2에서 설명한 일 실시예에 따른 적대적 재귀 오토인코더 기반 기업정보시스템 사용자 이상행위 탐지 시스템을 예를 들어 보다 상세히 설명할 수 있다. 일 실시예에 따른 적대적 재귀 오토인코더 기반 기업정보시스템 사용자 이상행위 탐지 시스템(200)은 판별 모델 모델링부(210), 사용자 이상행위 탐지부(220) 및 중요한 위협 탐지부(230)를 포함하여 이루어질 수 있다. 또한, 실시예에 따라 판별 모델 모델링부(210)는 시스템 로그 임베딩부(211) 및 적대적 반복 자동 인코더(ARAE, 212)를 포함할 수 있다.The hostile recursive autoencoder-based corporate information system user anomalous behavior detection method according to an embodiment may be described in more detail with an example of the hostile recursive autoencoder-based corporate information system user abnormal behavior detection system according to the embodiment described in FIG. 2 . have. The hostile recursive autoencoder-based enterprise information system user abnormal behavior detection system 200 according to an embodiment includes a discriminative model modeling unit 210 , a user abnormal behavior detection unit 220 and an important threat detection unit 230 . can Also, according to an embodiment, the discriminant model modeling unit 210 may include a system log embedding unit 211 and an adversarial iterative automatic encoder (ARAE) 212 .

단계(S310)에서, 판별 모델 모델링부(210)는 기업정보시스템에서 사용자의 정상행위에 대한 판별 모델을 도출할 수 있다. 판별 모델 모델링부(210)는 적대적 반복 자동 인코더(ARAE)를 이용하여 판별 모델을 도출할 수 있다. 여기서, 적대적 반복 자동 인코더(ARAE)는 인코딩된 입력이 주어지면 입력을 잠재 특징으로 인코딩하고 재구성된 결과를 생성함에 따라 오차를 계산 가능하게 한다. 잠재 특징은 적대적 손실을 계산하기 위해 적용되고, 재구성된 결과는 재구성 손실을 계산하는 데 사용되며, 적대적 반복 자동 인코더(ARAE)는 적대적 손실 및 재구성 손실을 최적화할 수 있다. In step S310, the discrimination model modeling unit 210 may derive a discrimination model for the user's normal behavior in the corporate information system. The discriminant model modeling unit 210 may derive a discriminant model using an adversarial iterative automatic encoder (ARAE). Here, an adversarial iterative autoencoder (ARAE), given an encoded input, encodes the input into latent features and makes it possible to compute the error as it produces a reconstructed result. The latent features are applied to calculate the adversarial loss, the reconstructed result is used to calculate the reconstruction loss, and the adversarial iterative autoencoder (ARAE) can optimize the adversarial loss and the reconstruction loss.

판별 모델 모델링부(210)는 시스템 로그 임베딩부(211) 및 적대적 반복 자동 인코더(ARAE, 212)를 더 포함할 수 있다. 시스템 로그 임베딩부(211)는 정상행위에 대한 원시 시스템 로그를 dense 임베딩 벡터(Dense Embedding Vector, DEV)로 인코딩할 수 있다. 그리고 적대적 반복 자동 인코더(ARAE, 212)는 인코딩된 dense 임베딩 벡터를 적용하여 정상행위에 대한 판별 모델을 도출할 수 있다. 한편, 여기에서는 원시 시스템 로그를 dense 벡터 임베딩을 이용하여 인코딩하는 것으로 설명하고 있으나, 이에 제한되지 않으며, 시계열 데이터를 딥러닝에 적용할 수 있도록 인코딩하는 다른 방법(예컨대, 원핫 인코딩(one-hot encoding))을 사용하는 것이 가능하다. 예컨대, 시스템 로그 임베딩부(211)는 정상행위에 대한 원시 시스템 로그를 또는 원핫 인코딩(one-hot encoding)으로 인코딩할 수 있다. 그리고 적대적 반복 자동 인코더(ARAE, 212)는 인코딩된 원핫 인코딩(one-hot encoding) 결과를 적용하여 정상행위에 대한 판별 모델을 도출할 수 있다. The discriminant model modeling unit 210 may further include a system log embedding unit 211 and an adversarial iterative automatic encoder (ARAE) 212 . The system log embedding unit 211 may encode a raw system log for a normal behavior into a dense embedding vector (DEV). And the adversarial iterative automatic encoder (ARAE, 212) can derive a discriminant model for the normal behavior by applying the encoded dense embedding vector. On the other hand, although it is described here that the raw system log is encoded using dense vector embedding, it is not limited thereto, and there is another method of encoding time series data to be applicable to deep learning (eg, one-hot encoding). )) can be used. For example, the system log embedding unit 211 may encode the original system log for normal behavior or one-hot encoding. In addition, the adversarial iterative automatic encoder (ARAE) 212 may derive a discriminant model for normal behavior by applying the encoded one-hot encoding result.

단계(S320)에서, 사용자 이상행위 탐지부(220)는 판별 모델을 통해 오차를 계산하여 사용자 이상행위 탐지(UIBD)를 수행할 수 있다. 사용자 이상행위 탐지부(220)는 판별 모델이 정상행위만 사용하여 모델링됨에 따라 기설정된 정상행위의 오차보다 이상행위의 오차가 큰 경우 이상행위를 식별할 수 있다. In step S320 , the user abnormal behavior detection unit 220 may calculate an error through the discrimination model to perform user abnormal behavior detection (UIBD). The user abnormal behavior detection unit 220 may identify the abnormal behavior when the abnormal behavior error is greater than the preset normal behavior error as the determination model is modeled using only the normal behavior.

단계(S330)에서, 중요한 위협 탐지부(230)는 미리 저장된 중요한 위협 사전을 이용하여 패스트 트랙(fast track)을 통해 중요한 위협 탐지를 수행할 수 있다. 여기서, 중요한 위협 사전은 사용자 이상행위 탐지 중 시스템 관리자가 심각한 위협 행위로 분류한 행위의 잠재 특징으로 구성될 수 있다. 중요한 위협 탐지부(230)는 사용자 이상행위 탐지 중 도출되는 잠재 특징을 중요한 위협 사전에 저장된 특징과 비교하여 유사한 경우, 저장된 특징에 해당하는 행위가 발생한다고 간주할 수 있다. In step S330 , the important threat detection unit 230 may perform important threat detection through a fast track using a pre-stored important threat dictionary. Here, the important threat dictionary may consist of latent features of actions classified as serious threat actions by the system administrator during detection of user anomalies. The important threat detection unit 230 compares a potential characteristic derived during detection of an abnormal user behavior with a characteristic stored in an important threat dictionary and, if similar, may consider that an action corresponding to the stored characteristic occurs.

판별 모델을 통해 오차를 계산하여 사용자 이상행위 탐지의 수행 및 미리 저장된 중요한 위협 사전을 이용하여 패스트 트랙을 통해 중요한 위협 탐지의 수행은 사용자 이상행위를 식별하기 위해 상호 보완적으로 작용할 수 있다. Performing user anomaly detection by calculating an error through the discrimination model and performing important threat detection through fast track using a pre-stored important threat dictionary can complement each other to identify user anomaly.

본 발명의 실시예들은 ERP 시스템에서 사용자 이상행위 탐지(UIBD)를 위한 프레임워크를 제공할 수 있다. 이 프레임워크는 이상행위 사례의 사전 정의나 '변수 가공' 작업 없이 강력한 내부자 행위 탐지 성능을 제공할 수 있다. 이를 달성하기 위해 제안된 프레임워크는 정상행위에 대한 원시 시스템 로그를 dense 임베딩 벡터(DEV)로 인코딩하고 이를 적대적 반복 자동 인코더(ARAE)에 적용하여 정상행위에 대한 판별 모델을 도출한다. 제안된 프레임워크는 포괄적인 이상행위를 탐지하기 위한 비지도 프로세스와 시스템 관리자가 사후 정의한 특정 내부자 행위에 대한 지도 탐지 프로세스를 모두 제공한다. 제안된 프레임워크의 효율성을 입증하기 위해, 실시예들은 실제의 다양한 기업에서 기록된 ERP 시스템 로그로 구성된 데이터셋을 사용하여 제안된 프레임워크를 테스트할 수 있다. 실험 결과는 사용자 이상행위를 탐지하는 데 제안된 프레임워크의 효율성을 보여준다.Embodiments of the present invention may provide a framework for user anomaly detection (UIBD) in the ERP system. This framework can provide powerful insider behavior detection without pre-defining anomalies or 'manipulating variables'. To achieve this, the proposed framework encodes the raw system log of the normal behavior into a dense embedding vector (DEV) and applies it to the adversarial iterative automatic encoder (ARAE) to derive a discriminant model for the normal behavior. The proposed framework provides both an unsupervised process for detecting comprehensive anomalies and a supervised detection process for specific insider behavior post-defined by the system administrator. In order to prove the effectiveness of the proposed framework, embodiments may test the proposed framework using a dataset consisting of ERP system logs recorded by various companies in practice. Experimental results show the effectiveness of the proposed framework in detecting user anomalies.

본 발명의 주요 기여사항은 다음과 같이 요약될 수 있다. The main contribution of the present invention can be summarized as follows.

제안된 프레임워크는 UIBD를 위한 새로운 프레임워크이며, 준비된 비정상적이거나 위협적인 행위 사례와 시스템 전문가의 변수 가공 없이 ERP 시스템에서 다양한 비정상적이거나 위협적인 행위(이상행위)를 식별할 수 있는 강력한 탐지 성능을 제공한다.The proposed framework is a new framework for UIBD, and provides strong detection performance to identify various abnormal or threatening behaviors (abnormal behaviors) in the ERP system without prepared cases of abnormal or threatening behaviors and variable processing by system experts. do.

또한, 주어진 정상행위에 대한 차별적 재구성 모델을 도출하는 새로운 방법, 즉 적대적 반복 자동 인코더(ARAE)를 이용한다.In addition, we use a novel method to derive a differential reconstruction model for a given normal behavior, namely, adversarial iterative automatic encoder (ARAE).

또한, UIBD를 위한 대규모 공개 데이터셋으로, 실제 산업 현장에서 ERP 시스템에 의해 포착된 다양한 내부자 행위를 포함한다.It is also a large public dataset for UIBD, which includes various insider behavior captured by ERP systems in real industrial settings.

아래에서는 UIBD에 대해 설명하고 제안된 프레임워크와 학습 절차를 상세히 설명한다. Below, UIBD is described and the proposed framework and learning procedure are described in detail.

도 4는 일 실시예에 따른 ARAE를 사용하는 UIBD 프로세스를 나타내는 도면이다.4 is a diagram illustrating a UIBD process using ARAE according to an embodiment.

도 4를 참조하면, 제안된 프레임워크의 주요 구성요소는 기업정보시스템(401)에 대한 사용자 이상행위 탐지를 목표로 하는 사용자 이상행위의 목적과 유형에 따라 두 개의 핵심 파이프라인으로 구분할 수 있다. 첫째, ARAE(405) 기반 UIBD는 사용자 이상행위를 위한 포괄적이고 자동화된 식별 프로세스를 제공한다. 이는, 도 3에서 설명한 단계(S310) 및 단계(S320)에 포함될 수 있다. 원시 시스템 로그의 시퀀스(402)가 주어지면 프레임워크는 시퀀스를 DEV(404)에 매핑(403)하고 DEV(404)를 ARAE(405)에 적용하여 재구성한다. 그리고 입력 벡터와 재구성된 결과(406)를 비교(407, 408)하여 이상행위를 식별(409)한다. 이것은 사용자의 개입 없이 사용자 이상행위를 인식할 수 있는 주요 파이프라인이다.Referring to FIG. 4 , the main components of the proposed framework can be divided into two core pipelines according to the purpose and type of user anomalous behavior aimed at detecting user anomalous behavior for the enterprise information system 401 . First, ARAE 405-based UIBD provides a comprehensive and automated identification process for user anomalies. This may be included in steps S310 and S320 described in FIG. 3 . Given a sequence 402 of the raw system log, the framework maps 403 the sequence to a DEV 404 and applies the DEV 404 to the ARAE 405 to reconstruct it. Then, an abnormal behavior is identified (409) by comparing (407, 408) the input vector and the reconstructed result (406). This is the main pipeline that can recognize user anomalies without user intervention.

두 번째 부분은 중요한 위협 사전(411)을 사용하여 특정 위협을 탐지하는 패스트 트랙(412)이다. 이는, 도 3에서 설명한 단계(S330)에 포함될 수 있다. 중요한 위협 사전(411)은 모든 탐지 결과 중 시스템 관리자가 심각한 위협 행위로 분류한 행위의 잠재 특징으로 구성된다. 현실에서는 관리자가 UIBD 결과를 분석하는 것이 불가피하다. 관리자가 첫 번째 파이프라인의 탐지 결과 중 심각한 경우를 발견(410)하면, 관리자는 중요한 위협 사전(411)을 구성하고 심각한 사례에 대한 잠재 특징을 중요한 위협 사전(411)에 저장(413, 414, 415, 416)할 수 있다.The second part is the fast track 412, which uses the critical threat dictionary 411 to detect specific threats. This may be included in step S330 described with reference to FIG. 3 . The important threat dictionary 411 is composed of potential features of actions classified as serious threat actions by the system administrator among all detection results. In reality, it is inevitable for managers to analyze UIBD results. When the administrator finds a serious case among the detection results of the first pipeline (410), the administrator configures the critical threat dictionary 411 and stores potential characteristics of the severe case in the critical threat dictionary 411 (413, 414, 415, 416) can be done.

이 두 개의 파이프라인은 사용자 이상행위를 식별하기 위해 상호 보완적으로 작용한다. ARAE(405)를 사용하는 첫 번째 UIBD 프로세스는 포괄적인 UIBD 결과를 자동으로 제공할 수 있지만, 계산 집약적이기 때문에 빠르게 식별해야 하는 일부 심각한 행위를 분류하는 것은 적절하지 않을 수 있다. 응답 시간을 줄이고 특정 위협 행위를 식별하기 위해 사전을 사용하는 두 번째 UIBD 프로세스는 관리자가 정의한 특정 행위를 식별하는 바로가기를 제공할 수 있다.These two pipelines work complementary to identify user anomalies. The first UIBD process using ARAE 405 can automatically provide comprehensive UIBD results, but since it is computationally intensive it may not be appropriate to classify some serious behaviors that need to be identified quickly. A second UIBD process, which uses a dictionary to reduce response times and identify specific threatening behaviors, can provide shortcuts to identify specific behaviors defined by the administrator.

다시 말하면, 실시예들은 첫 번째 파이프라인을 통한 완전 자동화된 이상행위 검출 및 분석이 가능하다. 이는 모델의 재구성 오차를 기반으로 이상 검출 결과를 제공할 수 있으며, 시스템 사용자가 원하는 이상행위 점수만 세팅하면 그에 맞춰 결과를 출력할 수 있다. 그러나 포괄적인 분석 결과를 제공하기 때문에 세부적인 결과 분석은 시스템의 사용자가 직접 수행해야 하고, 모델의 전체 계산 루틴을 모두 사용하기 때문에 계산이 집약적이다. In other words, the embodiments are capable of fully automated anomaly detection and analysis through the first pipeline. This can provide an anomaly detection result based on the reconstruction error of the model, and if the system user sets the desired anomaly score, the result can be output accordingly. However, since it provides comprehensive analysis results, detailed analysis of the results must be performed by the user of the system, and it is computationally intensive because it uses all of the model's entire calculation routine.

이에 따라 실시예들은 두 번째 파이프라인을 이용하여 사용자 등록 정보를 기반으로 이상행위를 검출할 수 있다. 첫 번째 파이프라인의 검출 결과를 분석하던 시스템의 사용자의 요청에 따라 선택된 행위의 잠재 특징을 저장하고 해당 특징에 대한 검출만 할 수 있다. 또한, 이 경우 검출이 아닌 분류의 형태로 출력되어 특정 행위를 인식할 수 있다. 그러나 시스템 사용자의 사전 등록이 존재해야 사용 가능하다. Accordingly, the embodiments may detect an abnormal behavior based on user registration information using the second pipeline. At the request of the user of the system that was analyzing the detection result of the first pipeline, the latent feature of the selected behavior is stored, and only the feature can be detected. Also, in this case, a specific action can be recognized by being output in the form of classification rather than detection. However, it can be used only if there is a system user's pre-registration.

따라서 실시예들은 두 개의 파이프라인을 이용한 이상행위 검출을 통해 비지도학습 기반 검출 및 지도학습 기반 검출을 모두 지원할 수 있다. 특히, 실시예들은 적대적 학습 방식의 도입을 통해 모델의 구분 성능을 향상시킬 수 있다.Accordingly, embodiments may support both unsupervised learning-based detection and supervised learning-based detection through abnormal behavior detection using two pipelines. In particular, embodiments may improve the classification performance of a model through the introduction of an adversarial learning method.

딥러닝 모델을 훈련하기 위해 시스템 로그를 처리하려면 원시 로그를 벡터 공간에 인코딩하는 것이 필수적이다. 접근 방식 중 하나로, 원핫 인코딩은 분류된 신호를 벡터화하는 방법을 생각할 수 있다. 새 범주에 대한 새 벡터를 추가하면 되기 때문에 인코딩 결과를 만들고 업데이트하는 것이 간단하고 빠르다. Du는 시스템 로그를 원핫 벡터로 표현하여 모델에 로그를 적용한다. 그러나 방법이 단순하기 때문에 각 범주에 대해 새로운 차원을 작성할 때 차원의 저주(curse of dimensionality)로 이어질 가능성이 있다.To process the system log to train a deep learning model, it is essential to encode the raw log into a vector space. As one of the approaches, one-hot encoding can be considered as a method of vectorizing a classified signal. Creating and updating the encoding result is simple and fast because we only need to add a new vector for the new category. Du applies the logarithm to the model by expressing the system log as a one-hot vector. However, because of the simplicity of the method, creating a new dimension for each category has the potential to lead to a curse of dimensionality.

따라서, 실시예들은 원시 시스템 로그를 벡터 공간으로 인코딩하기 위해, 도 4에서 도시된 바와 같이, ARAE 전면의 신경망에 기반한 dense 벡터 임베딩(비특허문헌 2)을 사용할 수 있다. Accordingly, embodiments may use dense vector embedding (Non-Patent Document 2) based on a neural network in front of ARAE, as shown in FIG. 4 , to encode the raw system log into a vector space.

도 5는 일 실시예에 따른 시스템 로그 임베딩의 개략도를 나타낸다.5 shows a schematic diagram of a system log embedding according to an embodiment.

도 5에 도시된 바와 같이, 원시 ERP 시스템 로그(복잡한 코드로 표시)(510)는 정수 s = {s_t}t=1:n(520)으로 대체되며, 여기서 s_t와 n은 각각 시간 t에서 단일 로그와 행위의 길이를 나타내며, 신경망을 기반으로 하는 임베딩 네트워크(530)를 사용하여 dense 벡터 x = {x_t}t=1:n(540)으로 다시 변환된다.As shown in Figure 5, the raw ERP system log (represented by complex code) 510 _{is replaced by the integer s = {s t} }t=1:n(520), where s _t and n are time t, respectively. represents a single logarithm and the length of the action, and _{is converted back into a dense vector x = {x t} } t = 1: n (540) using an embedding network 530 based on a neural network.

원핫 인코딩 대신 임베딩 계층을 사용하면 몇 가지 이점이 있다. 첫째, 임의의 입력 신호 사이의 의미적 관계를 나타낼 수 있는 훈련 가능한 임베딩 계층이 활용되며, 입력 신호(비특허문헌 2)를 직교 벡터 공간으로 매핑한다(원핫 인코딩으로는 불가능하다). 둘째, dense 임베딩은 원시 신호를 특정 벡터 공간에 임베딩하는 차원 유연성을 갖는다. N-분류 신호가 주어질 때, 원핫 벡터 공간의 차원은 적어도 N보다 크다. 반면에, dense 임베딩은 매핑된 벡터들 사이에 직교성을 강요하지 않기 때문에 N보다 더 작은 차원을 가진 벡터 공간에 신호를 매핑할 수 있다. There are several advantages to using an embedding layer instead of one-hot encoding. First, a trainable embedding layer capable of representing a semantic relationship between arbitrary input signals is utilized, and the input signal (Non-Patent Document 2) is mapped to an orthogonal vector space (not possible with one-hot encoding). Second, dense embeddings have dimensional flexibility to embed the raw signal into a specific vector space. Given an N-classification signal, the dimension of the one-hot vector space is at least greater than N. On the other hand, since dense embeddings do not enforce orthogonality between mapped vectors, signals can be mapped into a vector space with dimensions smaller than N.

제안된 프레임워크의 핵심 가설은 재구성 모델이 정규 행위만 사용하여 컴파일될 때 이상행위의 재구성 오차가 정상행위보다 클 것이라는 것이다. 가설로서, 주어진 정상행위 샘플의 모든 세부 사항을 설명할 수 있고 훈련 단계에서 보이지 않는 정상 샘플을 다룰 수 있는 강력한 모델을 도출하는 것이 필수적이다. 특히, 판별 모델을 도출하기 위한 입력으로서, 행위는 일련의 시스템 로그에 의해 정의된다. 따라서 시퀀스에 포함된 전체 시스템 로그에 대한 전역 표현을 학습해야 한다. 다양한 길이와 패턴으로 행위를 포괄할 수 있는 정보 특징을 배우는 것은 간단한 과제가 아닐 수 있다.The core hypothesis of the proposed framework is that when the reconstruction model is compiled using only regular behavior, the reconstruction error of anomalies will be greater than that of normal behavior. As a hypothesis, it is essential to derive a robust model that can explain all the details of a given normal behavior sample and can handle normal samples not seen in the training phase. In particular, as an input for deriving a discriminant model, behavior is defined by a set of system logs. Therefore, it is necessary to learn a global representation of the entire system log included in the sequence. Learning information features that can encompass behaviors of varying lengths and patterns may not be a straightforward task.

이 문제를 해결하기 위해 적대적 반복 자동 인코더(ARAE)라는 자동 인코더 기반 생성 모델을 제안한다. To solve this problem, we propose an autoencoder-based generative model called adversarial iterative autoencoder (ARAE).

도 6은 일 실시예에 따른 목적 함수 계산을 위한 프로세스에 대한 ARAE의 구조적 세부사항을 나타내는 도면이다. 6 is a diagram illustrating structural details of ARAE for a process for calculating an objective function according to an embodiment.

도 6을 참조하면, ARAE의 주요 구성 요소는 인코더 f, 디코더 g 및 판별기 D를 포함하여 이루어질 수 있다. 판별기 D는 적대적 학습을 활용하기 위한 훈련 단계에만 적용된다. RNN 기반 자동 인코더는 일반적으로 시계열 데이터에서 특징을 추출하는 데 사용된다(비특허문헌 1). 여기서 다양한 길이의 행위에서 잠재 특징을 추출하기 위해 이 아키텍처를 사용할 수 있다.Referring to FIG. 6 , the main components of ARAE may include an encoder f, a decoder g, and a discriminator D. Discriminator D applies only to the training phase to utilize adversarial learning. RNN-based automatic encoders are generally used to extract features from time series data (Non-Patent Document 1). Here we can use this architecture to extract latent features from behaviors of varying lengths.

인코더는 입력으로부터 잠재 특징(벡터)를 추출하고, 잠재 특징을 디코더에 입력함으로써 오토 인코더를 통해 재구성된 결과를 획득할 수 있다. 이 때, 원본 상태의 입력과 재구성된 결과 사이의 오차인 재구성 손실(620)을 구할 수 있다. 또한, 잠재 특징으로부터 잠재적인 손실(610)을 구함으로써 차원을 줄여 일반화된 모델을 도출할 수 있다. 이 때, 일례로써 가우시안 분포를 사용하여 잠재 특징 벡터의 범위를 제한할 수 있다. The encoder extracts the latent feature (vector) from the input, and by inputting the latent feature to the decoder, the reconstructed result can be obtained through the auto-encoder. In this case, the reconstruction loss 620 that is an error between the input of the original state and the reconstructed result may be obtained. In addition, a generalized model can be derived by reducing the dimension by obtaining the potential loss 610 from the latent feature. In this case, as an example, the range of the latent feature vector may be limited by using a Gaussian distribution.

처음에 시스템 로그 임베딩에 의해 인코딩된 DEV x를 사용하여 인코더 f는 DEV x를 잠재 특징 z: f(x) = z로 매핑하며, 여기서 z = {z_t}t=1:n 및 z_t는 x_t에서 추출한 잠재 특징이다. z를 디코더 g에 적용하여 재구성 결과인 g(z) =

를 생성한다. 순차 데이터를 처리하고 다양한 길이의 행위에서 전역 정보를 추출하기 위해, f와 g는 RNN 구조를 기반으로 컴파일되며, RNN 구조는 LSTM, GRU 또는 Naive RNN와 같은 다양한 유형의 메모리 셀에 의해 구축될 수 있다. 훈련 단계에서 x와

는 재구성 손실 L_Re(620)를 계산하는 데 사용되며, z는 학습의 적대적 손실 L_Adv(610)에 적용된다. UIBD를 위한 ARAE 학습에 대한 자세한 설명은 아래에 설명되어 있다. 제안된 RNN 기반 구조를 기반으로, ARAE는 일련의 로그에 걸쳐 유용한 기능을 모델링할 수 있으며, ERP 시스템 로그의 시퀀스에 대한 다양한 패턴을 자동으로 학습하여 이상행위를 식별할 수 있다.Using DEV x initially encoded by syslog embedding, encoder f maps DEV x to a latent feature z: f(x) = z, where z = {z _t }t=1:n and z _t is It is a latent feature extracted from _{x t .} g(z) = the result of reconstruction by applying z to the decoder g

create To process sequential data and extract global information from behaviors of varying lengths, f and g are compiled based on RNN structures, which can be built by different types of memory cells such as LSTMs, GRUs or naive RNNs. have. In the training phase, x and

is used to compute the reconstruction loss L _Re (620), and z is _{applied to the adversarial loss L Adv} (610) of learning. A detailed description of ARAE learning for UIBD is described below. Based on the proposed RNN-based structure, ARAE can model useful functions across a set of logs, and automatically learn various patterns for the sequence of ERP system logs to identify anomalies.

ARAE는 재구성 손실과 적대적 손실로 구성된 목적 함수 측면에서 최적화된다. 제안된 프레임워크를 구현하면서 배치 훈련 계획을 적용하기 위해 정규화된 행위 길이여야 하므로 주어진 훈련 샘플 중 최대 길이 nmax로 모든 행위 길이를 정규화한다. 길이 정규화에 의해 생성된 더미 데이터는 0으로 채워진다. 그러나 더미 데이터의 효과를 제거하기 위해 데이터가 실제인지 더미인지를 구별하는 마스크도 생성한다. 재구성 손실 항은 유클리드 거리를 기반으로 공식화된다. 마스킹된 재구성 손실은 다음 식과 같이 계산된다.ARAE is optimized in terms of an objective function consisting of reconstruction loss and adversarial loss. In order to apply the batch training plan while implementing the proposed framework, we normalize all the action lengths to the maximum length nmax among the given training samples, since they must be normalized action lengths. Dummy data generated by length normalization is filled with zeros. However, to eliminate the effect of dummy data, it also creates a mask that distinguishes whether the data is real or dummy. The reconstruction loss term is formulated based on the Euclidean distance. The masked reconstruction loss is calculated as follows.

[수학식 1][Equation 1]

여기서, m은 손실 계산에 요소가 유효한지 여부를 나타내기 위해 0과 1로 구성된 시간적 길이 정규화를 위한 마스크이다. |m|²는 m의 스케일을 나타내며, m에서 0이 아닌 원소의 수로 해석될 수 있다. 재구성 손실 항은 x에 비해 더 전역적인 추상화된 특징 z를 추출하는 역할을 한다.Here, m is a mask for temporal length normalization composed of 0 and 1 to indicate whether an element is valid for loss calculation. |m| ² represents the scale of m, and can be interpreted as the number of non-zero elements in m. The reconstruction loss term serves to extract a more global abstracted feature z compared to x.

위의 재구성 손실은 일반적으로 사용되는 생성 모델 중 하나인 자동 인코더(비특허문헌 3)에서 영감을 얻었다. 자동 인코더는 고차원 원시 데이터를 저차원 잠재 특징 공간에 임의로 인코딩하고 이를 고차원 결과로 디코딩한다. 이 인코드-디코드 프로세스(즉, 재구성 프로세스)에 따라 입력 샘플과 비교하여 더 전역화된 정보인 추상화된 특징이 잠재 공간에 매핑된다. 그러나 자동 인코더의 잠재 특징 공간은 무작위로 구성되며 재구성 프로세스의 불확실성을 증가시킬 수 있다. 이전 연구(비특허문헌 1)에 따르면, 재구성 과정의 불확실성은 이상 샘플에 대한 더 작은 재구성 오류를 취함으로써 판별력을 저하시킬 수 있다. 따라서 UIBD 성능을 저하시킬 수 있다.The above reconstruction loss was inspired by an automatic encoder (Non-Patent Document 3), which is one of the generally used generative models. The autoencoder randomly encodes high-dimensional raw data into a low-dimensional latent feature space and decodes it into a high-dimensional result. According to this encode-decode process (i.e., reconstruction process), abstract features, which are more globalized information compared to the input sample, are mapped into the latent space. However, the latent feature space of the autoencoder is randomly constructed and can increase the uncertainty of the reconstruction process. According to a previous study (Non-Patent Document 1), uncertainty in the reconstruction process may reduce discriminant power by taking smaller reconstruction errors for abnormal samples. Therefore, UIBD performance may be degraded.

잠재 특징 공간의 불확실성을 줄이기 위해, z로 대표되는 잠재 특징의 분포를 적대적 학습을 사용하여 알려진 분포에 가깝게 하도록 강요할 수 있다. 인코더 f는 적대적 학습 프로세스에서 생성기의 역할로 작용한다. 적대적 손실 항은 다음 식과 같이 정의될 수 있다.To reduce the uncertainty of the latent feature space, we can force the distribution of the latent features represented by z to be close to the known distribution using adversarial learning. The encoder f acts as a generator in the adversarial learning process. The hostile loss term can be defined as the following equation.

[수학식 2][Equation 2]

여기서, D는 ARAE 훈련에 적대적 학습을 적용하는 판별기이다. p_x는 훈련 행위 샘플 x의 분포를 나타내며, 실제로 훈련 데이터셋으로 정의된다.

는 정규 분포 N(0, 1)에서 생성된 랜덤 변수를 나타낸다. D는 f에서 생성된 인코딩과 이전 정규 분포를 구별하는 것을 목표로 한다. 따라서, 최대화를 시도하는 상대방 D에 대해 L_Adv를 최소화하려고 노력한다. 적대적 손실을 사용하면 잠재 특징 분포의 확률적 분포 형태에 대한 지침을 제공하는 것으로 생각할 수 있으며 잠재 공간에 대한 불확실성을 줄일 수 있다.Here, D is a discriminator that applies adversarial learning to ARAE training. p _x represents the distribution of the training behavior sample x, which is actually defined as the training dataset.

denotes a random variable generated from a normal distribution N(0, 1). D aims to distinguish the encoding generated from f and the previous normal distribution. Therefore, we try to minimize _{L Adv} for the counterpart D that is trying to maximize. The use of adversarial loss can be thought of as providing guidance on the shape of the stochastic distribution of the latent feature distribution, and can reduce the uncertainty about the latent space.

변동 추론(비특허문헌 3)은 ARAE가 자동 인코더를 기반으로 하기 때문에 ARAE에서 더 친숙할 수 있지만, 기대와 변화를 도출하기 위해 구조적 재구성이 필요하다. 적대적 방식은 일종의 독립적인 함수인 판별기를 추가하여 쉽게 적용할 수 있다. 이것은 실제로 큰 장점이 될 것이다.Variation inference (Non-Patent Document 3) may be more familiar to ARAE as ARAE is based on autoencoders, but requires structural reorganization to derive expectations and changes. The adversarial method can be easily applied by adding a discriminator, which is a kind of independent function. This would actually be a huge advantage.

전체 목표는 위의 모든 손실의 조합에 의해 다음 식과 같이 정의될 수 있다.The overall goal can be defined as the following equation by the combination of all losses above.

[수학식 3][Equation 3]

여기서, λ는 재구성 및 적대적 손실 사이의 균형을 맞추는 파라미터이다. ARAE의 파라미터는 주어지는 다음 식에 의해 최적화될 수 있다.Here, λ is a parameter balancing reconstruction and adversarial loss. The parameters of ARAE can be optimized by the following equation given.

[수학식 4][Equation 4]

ARAE는 확률적 경사 하강을 사용하여 훈련되며, 판별기는 훈련 단계에서만 사용된다.ARAE is trained using stochastic gradient descent, and the discriminator is used only in the training phase.

도 4에 도시된 바와 같이, 사용자 이상행위를 식별하기 위한 두 가지 부분이 있다. 첫째, 제안된 프레임워크는 입력 x와 ARAE에서 생성된 재구성 결과 g · f(x) =

를 비교하여 이상행위를 식별할 수 있다. 이 때, 비교는 다음 식에 의해 정의된 오류를 계산하여 수행될 수 있다.As shown in FIG. 4 , there are two parts for identifying user abnormal behavior. First, the proposed framework assumes that the input x and the reconstruction result generated from ARAE g f(x) =

can be compared to identify abnormal behavior. In this case, the comparison can be performed by calculating the error defined by the following equation.

[수학식 5][Equation 5]

여기서, n_x는 x의 시간 길이를 나타낸다.Here, n _x represents the length of time x.

위의 오류를 사용한 UIBD 프로세스는 다음 식과 같이 표현될 수 있다.The UIBD process using the above error can be expressed as the following equation.

[수학식 6][Equation 6]

여기서, τ는 수동으로 결정된 임계값이며, E는 x와

사이의 오차를 계산하는 함수를 나타낸다. 따라서 오차가 임계값 τ보다 크면, 주어진 입력은 이례적인 것으로 간주된다.where τ is a manually determined threshold, E is x and

Represents a function that calculates the error between Therefore, if the error is greater than the threshold τ, the given input is considered anomalous.

둘째, 중요한 위협 사전과 함께 패스트 트랙(fast track)을 통한 중요한 위협 탐지를 할 수 있다. 중요한 위협 탐지는 잠재 특징을 사전에 저장된 특징과 비교하여 수행된다. 주어진 잠재 특징이 사전의 특정 특징과 유사할 경우 저장된 특징에 해당하는 행위가 발생한다고 간주할 수 있다. 탐지는 잠재 특징 z와 사전에 저장된 특징

사이의 주어진 계산 오류 식인 [수학식 5]에 의해 수행될 수 있다.Second, it is possible to detect important threats through a fast track together with an important threat dictionary. Critical threat detection is performed by comparing latent features with pre-stored features. When a given latent feature is similar to a specific feature in the dictionary, it can be considered that an action corresponding to the stored feature occurs. Detection is a latent feature z and a pre-stored feature

It can be performed by [Equation 5], which is a given calculation error expression between

[수학식 7][Equation 7]

여기서,

는 사전에 저장된 잠재적인 특징을 나타낸다. N_DICT는 사전에 등록된 특징의 수이다. 잠재된 특징은 원시 로그나 dense 벡터보다 더 일반화된 정보를 제공하여 패스트 트랙을 사용하는 UIBD에 대한 일반화 성능을 향상시킬 수 있다.here,

represents a potential feature stored in advance. N _DICT is the number of pre-registered features. The latent features can provide more generalized information than raw logarithms or dense vectors, which can improve generalization performance for UIBDs using fast track.

적대적 학습은 훈련 단계에서만 적용되기 때문에, 모델 복잡도는 훈련 및 테스트 단계에서 변할 수 있다. 훈련 단계에서 인코딩된 입력이 주어지면, ARAE는 입력을 잠재 특징으로 인코딩하고 재구성된 결과를 생성한다. 잠재 특징은 적대적 손실 L_Adv[수학식 2]를 계산하기 위해 적용되며, 재구성된 결과는 재구성 손실 L_Re[수학식 1]를 계산하는 데 사용된다. 위의 프로세스에서 판별기는 적대적 학습에 사용된다. 따라서, 훈련 단계에서 제안된 프레임워크의 모델 복잡도는 ARAE와 판별기의 모델 복잡도의 합에 의해 계산된다. Since adversarial learning is applied only in the training phase, the model complexity can change in the training and testing phases. Given an encoded input in the training phase, ARAE encodes the input into latent features and produces a reconstructed result. The latent feature is _{applied to calculate the adversarial loss L Adv} [Equation 2], and the reconstructed result is _{used to calculate the reconstruction loss L Re} [Equation 1]. In the above process, the discriminator is used for adversarial learning. Therefore, the model complexity of the proposed framework in the training phase is calculated by the sum of the ARAE and the model complexity of the discriminator.

반면에, 테스트 단계에서 입력 데이터가 주어지면, ARAE는 재구성된 결과를 생성하고 오류를 계산하여 주어진 입력의 이상을 측정한다. 적대적 학습에 대한 판별기는 테스트 단계에서 적용되지 않으므로, 테스트 단계에서 제안된 프레임워크의 모델 복잡도는 ARAE의 모델 복잡도로만 정의된다.On the other hand, given the input data in the test phase, ARAE measures the anomaly of the given input by generating a reconstructed result and calculating the error. Since the discriminator for adversarial learning is not applied in the testing phase, the model complexity of the framework proposed in the testing phase is defined only by the model complexity of ARAE.

ARAE와 판별기의 모델 복잡도는 다음과 같이 계산된다. 첫째, GRU를 사용하는 ARAE의 모델 복잡도는

인 O(2W)이다. 여기서, n_c는 GRU 셀 수, n_i는 입력 단위 수, n_o는 출력 단위 수이다. 둘째, 완전히 연결된 네트워크로 구성된 판별기의 모델 복잡도는 각각 N = n_i ^fc Х n_o ^fc인 O(N)이다. 여기서, n_i ^fc와 n_o ^fc는 입력 및 출력 단위의 수를 나타낸다. 판별기는 훈련 단계에서만 사용된다.The model complexity of ARAE and discriminator is calculated as follows. First, the model complexity of ARAE using GRU is

is O(2W). Here, n _c is the number of GRU cells, n _i is the number of input units, and n _o is the number of output units. Second, the model complexity of the discriminator composed of fully connected networks is O(N) with _{N = n i} ^fc Х n _o ^{fc , respectively.} Here, n _i ^fc and n _o ^fc represent the number of input and output units. The discriminator is only used in the training phase.

따라서 각 훈련 단계와 시험 단계에 대해 제안된 프레임워크의 모델 복잡도는 각각 O(2W + N)와 O(2W)이다. 또한, 패스트 트랙을 사용하는 UIBD의 모델 복잡도는 O(W)로, 이는 ARAE를 사용하는 주 파이프라인의 절반이다. 참고로 패스트 트랙을 통한 UIBD는 ARAE의 인코더 부분 f만 사용했기 때문에 주 파이프라인보다 계산 집약도가 낮다고 생각할 수 있다. 그러나 실제로, 계산 복잡도는 훈련 데이터셋의 양 또는 각 입력 데이터의 시간 길이에 따라 달라질 수 있다. 따라서 테스트 단계에서 실제 실행 속도를 모니터링하는 것이 중요하다. Therefore, the model complexity of the proposed framework for each training step and testing step is O(2W + N) and O(2W), respectively. Also, the model complexity of UIBD using fast track is O(W), which is half of the main pipeline using ARAE. For reference, UIBD via fast track uses only the encoder part f of ARAE, so you can think of it as less computationally intensive than the main pipeline. However, in practice, the computational complexity may vary depending on the amount of training dataset or the length of time of each input data. Therefore, it is important to monitor the actual execution speed during the test phase.

이상에서 설명된 장치는 하드웨어 구성요소, 소프트웨어 구성요소, 및/또는 하드웨어 구성요소 및 소프트웨어 구성요소의 조합으로 구현될 수 있다. 예를 들어, 실시예들에서 설명된 장치 및 구성요소는, 예를 들어, 프로세서, 컨트롤러, ALU(arithmetic logic unit), 디지털 신호 프로세서(digital signal processor), 마이크로컴퓨터, FPA(field programmable array), PLU(programmable logic unit), 마이크로프로세서, 또는 명령(instruction)을 실행하고 응답할 수 있는 다른 어떠한 장치와 같이, 하나 이상의 범용 컴퓨터 또는 특수 목적 컴퓨터를 이용하여 구현될 수 있다. 처리 장치는 운영 체제(OS) 및 상기 운영 체제 상에서 수행되는 하나 이상의 소프트웨어 애플리케이션을 수행할 수 있다. 또한, 처리 장치는 소프트웨어의 실행에 응답하여, 데이터를 접근, 저장, 조작, 처리 및 생성할 수도 있다. 이해의 편의를 위하여, 처리 장치는 하나가 사용되는 것으로 설명된 경우도 있지만, 해당 기술분야에서 통상의 지식을 가진 자는, 처리 장치가 복수 개의 처리 요소(processing element) 및/또는 복수 유형의 처리 요소를 포함할 수 있음을 알 수 있다. 예를 들어, 처리 장치는 복수 개의 프로세서 또는 하나의 프로세서 및 하나의 컨트롤러를 포함할 수 있다. 또한, 병렬 프로세서(parallel processor)와 같은, 다른 처리 구성(processing configuration)도 가능하다.The device described above may be implemented as a hardware component, a software component, and/or a combination of the hardware component and the software component. For example, devices and components described in the embodiments may include, for example, a processor, a controller, an arithmetic logic unit (ALU), a digital signal processor, a microcomputer, a field programmable array (FPA), It may be implemented using one or more general purpose or special purpose computers, such as a programmable logic unit (PLU), microprocessor, or any other device capable of executing and responding to instructions. The processing device may execute an operating system (OS) and one or more software applications running on the operating system. The processing device may also access, store, manipulate, process, and generate data in response to execution of the software. For convenience of understanding, although one processing device is sometimes described as being used, one of ordinary skill in the art will recognize that the processing device includes a plurality of processing elements and/or a plurality of types of processing elements. It can be seen that can include For example, the processing device may include a plurality of processors or one processor and one controller. Other processing configurations are also possible, such as parallel processors.

소프트웨어는 컴퓨터 프로그램(computer program), 코드(code), 명령(instruction), 또는 이들 중 하나 이상의 조합을 포함할 수 있으며, 원하는 대로 동작하도록 처리 장치를 구성하거나 독립적으로 또는 결합적으로(collectively) 처리 장치를 명령할 수 있다. 소프트웨어 및/또는 데이터는, 처리 장치에 의하여 해석되거나 처리 장치에 명령 또는 데이터를 제공하기 위하여, 어떤 유형의 기계, 구성요소(component), 물리적 장치, 가상 장치(virtual equipment), 컴퓨터 저장 매체 또는 장치에 구체화(embody)될 수 있다. 소프트웨어는 네트워크로 연결된 컴퓨터 시스템 상에 분산되어서, 분산된 방법으로 저장되거나 실행될 수도 있다. 소프트웨어 및 데이터는 하나 이상의 컴퓨터 판독 가능 기록 매체에 저장될 수 있다.The software may comprise a computer program, code, instructions, or a combination of one or more thereof, which configures a processing device to operate as desired or is independently or collectively processed You can command the device. The software and/or data may be any kind of machine, component, physical device, virtual equipment, computer storage medium or device, to be interpreted by or to provide instructions or data to the processing device. may be embodied in The software may be distributed over networked computer systems, and stored or executed in a distributed manner. Software and data may be stored in one or more computer-readable recording media.

실시예에 따른 방법은 다양한 컴퓨터 수단을 통하여 수행될 수 있는 프로그램 명령 형태로 구현되어 컴퓨터 판독 가능 매체에 기록될 수 있다. 상기 컴퓨터 판독 가능 매체는 프로그램 명령, 데이터 파일, 데이터 구조 등을 단독으로 또는 조합하여 포함할 수 있다. 상기 매체에 기록되는 프로그램 명령은 실시예를 위하여 특별히 설계되고 구성된 것들이거나 컴퓨터 소프트웨어 당업자에게 공지되어 사용 가능한 것일 수도 있다. 컴퓨터 판독 가능 기록 매체의 예에는 하드 디스크, 플로피 디스크 및 자기 테이프와 같은 자기 매체(magnetic media), CD-ROM, DVD와 같은 광기록 매체(optical media), 플롭티컬 디스크(floptical disk)와 같은 자기-광 매체(magneto-optical media), 및 롬(ROM), 램(RAM), 플래시 메모리 등과 같은 프로그램 명령을 저장하고 수행하도록 특별히 구성된 하드웨어 장치가 포함된다. 프로그램 명령의 예에는 컴파일러에 의해 만들어지는 것과 같은 기계어 코드뿐만 아니라 인터프리터 등을 사용해서 컴퓨터에 의해서 실행될 수 있는 고급 언어 코드를 포함한다. The method according to the embodiment may be implemented in the form of program instructions that can be executed through various computer means and recorded in a computer-readable medium. The computer-readable medium may include program instructions, data files, data structures, etc. alone or in combination. The program instructions recorded on the medium may be specially designed and configured for the embodiment, or may be known and available to those skilled in the art of computer software. Examples of the computer-readable recording medium include magnetic media such as hard disks, floppy disks and magnetic tapes, optical media such as CD-ROMs and DVDs, and magnetic such as floppy disks. - includes magneto-optical media, and hardware devices specially configured to store and execute program instructions, such as ROM, RAM, flash memory, and the like. Examples of program instructions include not only machine language codes such as those generated by a compiler, but also high-level language codes that can be executed by a computer using an interpreter or the like.

이상과 같이 실시예들이 비록 한정된 실시예와 도면에 의해 설명되었으나, 해당 기술분야에서 통상의 지식을 가진 자라면 상기의 기재로부터 다양한 수정 및 변형이 가능하다. 예를 들어, 설명된 기술들이 설명된 방법과 다른 순서로 수행되거나, 및/또는 설명된 시스템, 구조, 장치, 회로 등의 구성요소들이 설명된 방법과 다른 형태로 결합 또는 조합되거나, 다른 구성요소 또는 균등물에 의하여 대치되거나 치환되더라도 적절한 결과가 달성될 수 있다.As described above, although the embodiments have been described with reference to the limited embodiments and drawings, various modifications and variations are possible from the above description by those skilled in the art. For example, the described techniques are performed in a different order than the described method, and/or the described components of the system, structure, apparatus, circuit, etc. are combined or combined in a different form than the described method, or other components Or substituted or substituted by equivalents may achieve an appropriate result.

그러므로, 다른 구현들, 다른 실시예들 및 특허청구범위와 균등한 것들도 후술하는 특허청구범위의 범위에 속한다.Therefore, other implementations, other embodiments, and equivalents to the claims are also within the scope of the following claims.

Claims

A method for detecting abnormal behavior of a hostile recursive autoencoder-based corporate information system user performed by a computer system, the method comprising:
deriving a model for determining the user's normal behavior in the corporate information system; and
Performing Unusual Insider Behavior Detection (UIBD) by calculating an error through the discrimination model
including,
The step of detecting the user anomaly includes:
Identifying an abnormal behavior when the error of the abnormal behavior is greater than the preset error of the normal behavior as the discrimination model is modeled using only the normal behavior
Characterized in, the user abnormal behavior detection method.

According to claim 1,
The step of deriving a discrimination model for the user's normal behavior comprises:
Deriving the discriminant model using an Adversarial Recurrent Auto-encoder (ARAE)
Characterized in, the user abnormal behavior detection method.

3. The method of claim 2,
The adversarial iterative automatic encoder (ARAE) comprises:
Given an encoded input, encode the input into a latent feature and make it possible to calculate the error as it produces a reconstructed result, the latent feature is applied to calculate an adversarial loss, and the reconstructed result is the reconstruction loss. wherein the adversarial iterative automatic encoder (ARAE) optimizes the adversarial loss and the reconstruction loss.
Characterized in, the user abnormal behavior detection method.

According to claim 1,
The step of deriving a discrimination model for the user's normal behavior comprises:
encoding the raw system log for the normal behavior with a dense embedding vector (DEV) or one-hot encoding; and
applying the encoded dense embedding vector or the result of one-hot encoding to an adversarial iterative automatic encoder (ARAE) to derive a discriminant model for normal behavior
Including, a user anomaly detection method.

According to claim 1,
Performing important threat detection through a fast track using a pre-stored significant threatening dictionary
Further comprising, a user anomaly detection method.

6. The method of claim 5,
The important threat dictionary is:
Consisting of potential characteristics of actions classified as serious threat actions by the system administrator among the detection of user anomalies
Characterized in, the user abnormal behavior detection method.

7. The method of claim 6,
The step of performing important threat detection through the fast track comprises:
Comparing the latent feature derived during the detection of the user anomaly with the feature stored in the important threat dictionary and, if similar, it is considered that the behavior corresponding to the stored feature occurs
Characterized in, the user abnormal behavior detection method.

6. The method of claim 5,
The execution of the user anomaly detection by calculating an error through the discrimination model and the execution of the important threat detection through the fast track using the pre-stored important threat dictionary are complementary to identify the user anomaly. thing
Characterized in, the user abnormal behavior detection method.

In the hostile recursive autoencoder-based corporate information system user abnormal behavior detection system,
a discrimination model modeling unit for deriving a discrimination model for a user's normal behavior in the corporate information system; and
A user abnormal behavior detection unit that performs user abnormal behavior detection (UIBD) by calculating an error through the discrimination model
including,
The user abnormal behavior detection unit,
Identifying an abnormal behavior when the error of the abnormal behavior is greater than the preset error of the normal behavior as the discrimination model is modeled using only the normal behavior
Characterized in, the user anomaly detection system.

10. The method of claim 9,
The discriminant model modeling unit,
Deriving the discriminant model using an Adversarial Recurrent Auto-encoder (ARAE)
Characterized in, the user anomaly detection system.

11. The method of claim 10,
The adversarial iterative automatic encoder (ARAE) comprises:
Given an encoded input, encode the input into a latent feature and make it possible to calculate the error as it produces a reconstructed result, the latent feature is applied to calculate an adversarial loss, and the reconstructed result is the reconstruction loss. wherein the adversarial iterative automatic encoder (ARAE) optimizes the adversarial loss and the reconstruction loss.
Characterized in, the user anomaly detection system.

10. The method of claim 9,
The discriminant model modeling unit,
a system log embedding unit that encodes a raw system log for a normal behavior by a dense embedding vector (DEV) or one-hot encoding; and
An adversarial iterative automatic encoder (ARAE) that derives a discriminative model for normal behavior by applying the encoded dense embedding vector or the result of one-hot encoding
Including, user anomaly detection system.

10. The method of claim 9,
An important threat detection unit that performs important threat detection through a fast track using a pre-stored significant threatening dictionary
Further comprising, a user anomaly detection system.

14. The method of claim 13,
The important threat detection unit,
Consisting of potential characteristics of actions classified as serious threat actions by the system administrator among the detection of user anomalies
Characterized in, the user anomaly detection system.

15. The method of claim 14,
The important threat detection unit,
Comparing the latent feature derived during detection of the user anomaly with the feature stored in the important threat dictionary and, if similar, considers that the action corresponding to the stored feature occurs
Characterized in, the user anomaly detection system.

14. The method of claim 13,
The execution of the user anomaly detection by calculating an error through the discrimination model and the execution of the important threat detection through the fast track using the pre-stored important threat dictionary are complementary to identify the user anomaly. thing
Characterized in, the user anomaly detection system.