KR102534396B1

KR102534396B1 - Method of operating artificial intelligence algorithms, apparatus for operating artificial intelligence algorithms and storage medium for storing a software operating artificial intelligence algorithms

Info

Publication number: KR102534396B1
Application number: KR1020200165605A
Authority: KR
Inventors: 송중석; 최윤수; 김규일; 이준; 권태웅; 이윤수; 최상수; 이혁로; 박진형
Original assignee: 한국과학기술정보연구원
Priority date: 2020-12-01
Filing date: 2020-12-01
Publication date: 2023-05-22
Also published as: KR20220076780A

Abstract

개시한 실시예들은 인공지능(Artificial Intelligence, AI) 알고리즘 수행 방법, 인공지능 알고리즘 수행 장치 및 인공지능 알고리즘 수행하는 소프트웨어를 저장하는 저장매체에 관한 것이다.
일 실시예에 따르면 이진 레이블(label) 데이터의 이진 레이블 분포를 검출하는 클래스(class) 분포 검출부; 상기 검출한 이진 레이블 분포에 따라 상기 데이터를 분류하고 상기 분류된 데이터를 상기 이진 레이블 분포에 따라 처리하는 레이블 분포 처리부; 및 상기 레이블 분포 처리부가 처리한 데이터의 상기 레이블 분포에 따라 선택한 인공지능 알고리즘들을 이용해 상기 레이블 분포 처리부가 처리한 데이터를 각각 학습하는 학습 처리부;를 포함하는 인공지능 알고리즘 수행 장치를 제공한다. The disclosed embodiments relate to an artificial intelligence (AI) algorithm execution method, an artificial intelligence algorithm execution device, and a storage medium storing software for performing an artificial intelligence algorithm.
According to an embodiment, a class distribution detector detecting a binary label distribution of binary label data; a label distribution processing unit that classifies the data according to the detected binary label distribution and processes the classified data according to the binary label distribution; and a learning processing unit learning the data processed by the label distribution processing unit using artificial intelligence algorithms selected according to the label distribution of the data processed by the label distribution processing unit.

Description

Artificial intelligence (AI) algorithm execution method, artificial intelligence algorithm execution device, and storage medium for storing artificial intelligence algorithm execution software OPERATING ARTIFICIAL INTELLIGENCE ALGORITHMS}

이하의 개시는 인공지능(Artificial Intelligence, AI) 알고리즘 수행 방법, 인공지능 알고리즘 수행 장치 및 인공지능 알고리즘 수행하는 소프트웨어를 저장하는 저장매체에 관한 것이다. The following disclosure relates to an artificial intelligence (AI) algorithm execution method, an artificial intelligence algorithm execution device, and a storage medium for storing software for performing an artificial intelligence algorithm.

인공지능 모델 생성과정에서는 학습을 위한 양질의 데이터 세트(dataset) 확보가 매우 중요하다 그러나 실제 데이터는 데이터의 분포와 성격이 편중되는 경우가 대부분이다. 이러한 환경 하에서 인공지능 모델은 데이터의 수정, 선별 및 가공 등의 과정이 선결되어야 하므로 많은 시간적, 물적 비용이 소모되었다. In the process of creating an artificial intelligence model, it is very important to secure a high-quality data set for learning. However, in most cases, the distribution and characteristics of actual data are biased. Under this environment, artificial intelligence models require a lot of time and material costs because the process of data modification, selection, and processing must be pre-determined.

예를 들면 인공지능 모델 활용의 관점에서 동일한 데이터 및 학습 알고리즘을 사용하더라도 사용자 목적에 따라 그 적용 방법이 크게 달라질 수 있다. For example, even if the same data and learning algorithm are used from the perspective of using an artificial intelligence model, the application method can vary greatly depending on the user's purpose.

데이터 불균형 처리를 포함한 데이터 가공, 사용 목적에 적합한 학습 알고리즘 선택할 경우 사람이 개입하여 이와 같은 데이터를 처리하였는데 이런 과정은 인공지능 모델 활용과 선택에 큰 제한이 되었다. Data processing, including data imbalance processing, and when selecting a learning algorithm suitable for the purpose of use, a person intervened to process such data, which greatly limited the use and selection of artificial intelligence models.

또한 다양한 분야에서 인공지능 기술의 활용 시도가 이루어지고 있지만, 일반적으로 각 분야에 특화된 데이터를 활용하여 인공지능 모델을 개발함으로써 인공지능 모델의 활용성이 극히 제한되는 문제점도 있었다. In addition, although attempts are being made to utilize artificial intelligence technology in various fields, there is also a problem in that the usability of artificial intelligence models is extremely limited as artificial intelligence models are generally developed using data specialized for each field.

예를 들면 컴퓨터 보안 분야의 침입탐지 시스템 등 네트워크 보안관제 분야는 입력 데이터의 클래스 분포가 매우 유동적이며 데이터의 성격에 따라 인공지능 모델 결과가 바이어스(bias) 되기 쉬운 문제점도 있었다.. For example, in the field of network security control, such as an intrusion detection system in the field of computer security, the class distribution of input data is very flexible, and there is also a problem that AI model results are easily biased depending on the nature of the data.

보안 이벤트의 탐지 결과인 정탐 또는 오탐과 같이 실제 환경에서 수집된 보안데이터는 이진 클래스(분류 규칙에 따라 집합의 요소를 두 그룹으로 분류하는 작업, binary class)의 불균형 데이터를 포함한다. Security data collected in the real environment, such as true positives or false positives, which are detection results of security events, includes imbalanced data of binary classes (classification of elements of a set into two groups according to classification rules, binary class).

보안 이벤트의 경우 정탐의 비율과 오탐의 비율이 크게 차이가 나기 때문에 이러한 불균형한 데이터에 대해 인공지능 알고리즘을 수행할 경우 최적의 알고리즘을 적용하는 방안이 필요하다. In the case of security events, since there is a large difference between the true positive rate and the false positive rate, it is necessary to apply the optimal algorithm when performing an artificial intelligence algorithm on such imbalanced data.

그런데 사용자의 임의적인 데이터 전 처리는 실제 환경에서 수집된 데이터와 분명한 차이가 있기 때문에 모델 성능에 영향을 줄 수도 있다. However, the user's arbitrary data pre-processing may affect the model performance because there is a clear difference from the data collected in the real environment.

데이터의 특성을 모르는 사용자가 인공지능 알고리즘을 적용하거나 또는 인공지능 알고리즘의 이해가 부족한 사용자가 알고리즘이나 데이터 처리 선택을 잘못하는 경우 좋지 못한 결과를 도출할 수 있다. If a user who does not know the characteristics of data applies an artificial intelligence algorithm, or if a user who lacks understanding of an artificial intelligence algorithm makes a wrong choice of algorithm or data processing, bad results can be obtained.

이하의 개시는 이러한 문제점을 해결하기 위한 것으로서, 입력 데이터의 불균형에 따라 인공지능 알고리즘을 최적으로 적용할 수 있는, 인공지능 알고리즘 수행 방법, 인공지능 알고리즘 수행 장치 및 인공지능 알고리즘 수행하는 소프트웨어를 저장하는 저장매체를 제공하는 것이다. The following disclosure is to solve this problem, and to store an artificial intelligence algorithm performing method, an artificial intelligence algorithm performing device, and an artificial intelligence algorithm performing software that can optimally apply an artificial intelligence algorithm according to an imbalance of input data. to provide a storage medium.

이하의 개시의 목적은 사용자의 임의적인 데이터 전처리 없이 입력 데이터의 불균형에 따라 자동적으로 인공지능 알고리즘을 적용할 수 있는 인공지능 알고리즘 수행 방법, 인공지능 알고리즘 수행 장치 및 인공지능 알고리즘 수행하는 소프트웨어를 저장하는 저장매체를 제공하는 것이다.The purpose of the disclosure below is to store an artificial intelligence algorithm performing method, an artificial intelligence algorithm performing device, and an artificial intelligence algorithm performing software that can automatically apply an artificial intelligence algorithm according to the imbalance of input data without user's arbitrary data preprocessing. to provide a storage medium.

이하의 개시의 또 다른 목적은 데이터의 특성을 모르는 사용자나 인공지능 알고리즘의 이해가 부족한 사용자라고 하더라도 최적의 인공지능 알고리즘을 수행할 수 있는, 인공지능 알고리즘 수행 방법, 인공지능 알고리즘 수행 장치 및 인공지능 알고리즘 수행하는 소프트웨어를 저장하는 저장매체를 제공하는 것이다.Another object of the disclosure below is to provide a method for performing an artificial intelligence algorithm, a device for performing an artificial intelligence algorithm, and an artificial intelligence algorithm capable of performing an optimal artificial intelligence algorithm even for a user who does not know the characteristics of data or a user who lacks understanding of an artificial intelligence algorithm. It is to provide a storage medium for storing software for performing an algorithm.

개시하는 일 실시예는 이진 레이블(label) 데이터의 이진 레이블 분포를 검출하는 클래스(class) 분포 검출부; 상기 검출한 이진 레이블 분포에 따라 상기 데이터를 분류하고 상기 분류된 데이터를 상기 이진 레이블 분포에 따라 처리하는 레이블 분포 처리부; 및 상기 레이블 분포 처리부가 처리한 데이터의 상기 레이블 분포에 따라 선택한 인공지능 알고리즘들을 이용해 상기 레이블 분포 처리부가 처리한 데이터를 각각 학습하는 학습 처리부;를 포함하는 인공지능 알고리즘 수행 장치를 제공한다. One disclosed embodiment includes a class distribution detection unit for detecting a binary label distribution of binary label data; a label distribution processing unit that classifies the data according to the detected binary label distribution and processes the classified data according to the binary label distribution; and a learning processing unit learning the data processed by the label distribution processing unit using artificial intelligence algorithms selected according to the label distribution of the data processed by the label distribution processing unit.

상기 레이블 분포 처리부는, 상기 분류된 데이터의 레이블 분포가, 단일 레이블 분포, 불균형 레이블 분포 또는 균형 레이블 분포 중 어느 하나의 레이블 분포인지에 따라 상기 분류된 데이터를 처리할 수 있다. The label distribution processor may process the classified data according to whether a label distribution of the classified data is one of a single label distribution, an imbalanced label distribution, and a balanced label distribution.

상기 레이블 분포 처리부는, 상기 분류된 데이터의 레이블 분포가, 불균형 레이블 분포인 경우 상기 분류된 데이터를 오버 피팅(over fitting) 또는 언더 피팅(under fitting)하여 상기 분류된 데이터의 레이블 분포를 단일 레이블 분포 또는 균형 레이블 분포의 데이터 세트로 변경할 수 있다. The label distribution processing unit converts the label distribution of the classified data into a single label distribution by over-fitting or under-fitting the classified data when the label distribution of the classified data is an imbalanced label distribution. Alternatively, it can be changed to a data set with a balanced label distribution.

상기 학습 처리부는, 상기 분류된 데이터의 레이블 분포가, 단일 레이블 분포인 경우 상기 분류된 데이터에 비지도학습(Unsupervised Learning) 인공지능 알고리즘을 적용할 수 있다. The learning processing unit may apply an unsupervised learning artificial intelligence algorithm to the classified data when the label distribution of the classified data is a single label distribution.

상기 학습 처리부는, 상기 분류된 데이터의 레이블 분포가, 균형 레이블 분포인 경우 DNN (Deep Neural Network), CNN (Convolution Neural Network), RNN (Recurrent Neural Network), 또는 BSVM(Binarized Support Vector Machine) 중 어느 하나의 인공지능 알고리즘을 적용할 수 있다. The learning processing unit, when the label distribution of the classified data is a balanced label distribution, selects any one of a Deep Neural Network (DNN), a Convolution Neural Network (CNN), a Recurrent Neural Network (RNN), and a Binarized Support Vector Machine (BSVM) One artificial intelligence algorithm can be applied.

개시하는 다른 일 실시예는 입력된 데이터로부터 상기 데이터의 이진 레이블 분포를 검출하는 단계; 상기 검출한 이진 레이블 분포에 따라 상기 데이터를 분류하고 상기 분류된 데이터를 상기 이진 레이블 분포에 따라 처리하는 단계; 및 상기 레이블 분포에 따라 선택한 인공지능 알고리즘들을 적용해 상기 처리한 데이터를 각각 학습하는 단계;를 포함하는 인공지능 알고리즘 수행 방법을 제공한다.Another disclosed embodiment includes detecting a binary label distribution of input data from the input data; classifying the data according to the detected binary label distribution and processing the classified data according to the binary label distribution; and learning each of the processed data by applying AI algorithms selected according to the label distribution.

개시하는 다른 일 실시예는 입력된 데이터로부터 상기 데이터의 이진 레이블 분포를 검출하고; 상기 검출한 이진 레이블 분포에 따라 상기 데이터를 분류하고 상기 분류된 데이터를 상기 이진 레이블 분포에 따라 처리하고; 및 상기 분류된 데이터의 상기 레이블 분포에 따라 선택한 인공지능 알고리즘들을 적용해 상기 처리한 데이터를 각각 학습하는 인공지능 알고리즘 수행하는 소프트웨어를 저장하는 저장매체를 제공한다.Another disclosed embodiment detects a binary label distribution of input data from the data; classifying the data according to the detected binary label distribution and processing the classified data according to the binary label distribution; and a storage medium storing software for performing an artificial intelligence algorithm for learning each of the processed data by applying artificial intelligence algorithms selected according to the label distribution of the classified data.

이하의 개시한 예에 따르면 입력 데이터가 불균형한 분포를 가지더라도 이를 고려하여 인공지능 알고리즘을 최적으로 수행할 수 있다. According to the example disclosed below, even if the input data has an imbalanced distribution, the artificial intelligence algorithm can be optimally performed in consideration of this.

이하의 개시한 예에 따르면 사용자의 임의적인 데이터 전처리 없이 입력 데이터의 불균형에 따라 자동적으로 인공지능 알고리즘을 적용할 수 있다. According to the example disclosed below, an artificial intelligence algorithm may be automatically applied according to an imbalance of input data without arbitrary data preprocessing by a user.

이하의 개시한 예에 따르면 데이터의 특성을 모르는 사용자나 인공지능 알고리즘의 이해가 부족한 사용자라고 하더라도 최적의 인공지능 알고리즘을 수행할 수 있다. According to the example disclosed below, even a user who does not know the characteristics of data or a user who lacks understanding of an artificial intelligence algorithm can perform an optimal artificial intelligence algorithm.

도 1은 본 발명에서 제안하는 인공지능 모델 플랫폼의 일 실시 예를 개념적으로 나타낸 개념도
도 2는 본 발명의 일 실시 예에 따른 인공지능 모델 플랫폼의 구성도
도 3은 본 발명의 일 실시 예에 따른 특징정보 추천 장치의 구성도
도 4는 본 발명의 일 실시 예에 따른 정규화 방식 추천 장치의 구성도
도 5는 불균형 데이터의 레이블 분포에 따라 인공지능 알고리즘을 수행하는 일 실시 예를 개시한 도면
도 6은 실시 예에 따른 레이블(클래스) 분포 처리부의 일 예를 개시한 도면
도 7은 입력 데이터의 분포에 따라 처리한 데이터를 이용해 인공지능 알고리즘을 선택적으로 적용하는 일 예를 개시한 도면
도 8은, 데이터의 분포에 따라 처리한 데이터를 이용해 인공지능 알고리즘을 선택적으로 적용하는 인공지능 알고리즘 수행 방법의 일 예를 개시한 도면1 is a conceptual diagram conceptually showing an embodiment of an artificial intelligence model platform proposed in the present invention.
2 is a block diagram of an artificial intelligence model platform according to an embodiment of the present invention
3 is a block diagram of a device for recommending feature information according to an embodiment of the present invention.
4 is a block diagram of a normalization method recommendation apparatus according to an embodiment of the present invention.
5 is a diagram illustrating an embodiment of performing an artificial intelligence algorithm according to a label distribution of imbalanced data;
6 is a diagram illustrating an example of a label (class) distribution processing unit according to an embodiment.
7 is a diagram illustrating an example of selectively applying an artificial intelligence algorithm using data processed according to the distribution of input data;
8 is a diagram illustrating an example of a method of performing an artificial intelligence algorithm that selectively applies an artificial intelligence algorithm using data processed according to the distribution of data.

이하 실시 예를 용이하게 설명하기 위해 첨부한 도면을 참조하여 실시 예를 설명한다.In order to easily describe the following embodiments, embodiments will be described with reference to the accompanying drawings.

현재, 과학기술사이버안전센터에서 제공하고 있는 실시간 보안관제 서비스는, 침해위협관리시스템(TMS)에서 탐지 및 수집하는 보안이벤트를 기반으로, 보안관제 요원에 의한 룰(Rule) 기반 분석 및 대응 지원이 이루어지는 서비스 구조를 갖는다.Currently, the real-time security control service provided by the Science and Technology Cyber Safety Center is based on security events detected and collected by the threat management system (TMS), and rule-based analysis and response support by security control personnel. has a service structure.

헌데, TMS에 의해 탐지되는 보안이벤트 수가 폭발적으로 증가하고 있으며, 이와 같은 대용량의 전체 보안이벤트를 보안관제 요원이 분석하기는 현실적으로 어려운 한계 상황에 도달하고 있다.However, the number of security events detected by TMS is increasing exponentially, and it is reaching a realistic limit for security control personnel to analyze such large-capacity entire security events.

또한, 기존의 보안관제 서비스는, 보안관제 요원의 전문 지식 및 경험에 의존하기 때문에, 특정 보안이벤트에 대한 분석이 집중되는 업무편중 현상 발생하거나 분석 결과의 편차가 발생하는 등 분석 평준화가 실현되지 못하는 상황도 발생하고 있다.In addition, since existing security control services depend on the expertise and experience of security control personnel, analysis leveling is not realized, such as work concentration in which analysis of specific security events is concentrated or deviations in analysis results occur. Situation is also happening.

결국, TMS에 의해 탐지되는 보안이벤트 수가 폭발적으로 증가하고 있는 현 상황에서는, 보안관제 요원의 분석에 의존하는 기존 보안관제 서비스의 서비스 구조 자체를 혁신할 필요가 있다.After all, in the current situation where the number of security events detected by TMS is explosively increasing, it is necessary to innovate the service structure of the existing security control service that depends on the analysis of security control personnel.

이에, 보안관제 요원의 분석을 대체할 수 있는 인공지능 모델을 활용하는 보안관제 서비스 구조를 생각해 볼 수 있다.Accordingly, it is possible to consider a security control service structure that utilizes an artificial intelligence model that can replace the security control agent's analysis.

본 발명에서는, 보안관제를 위한 인공지능 모델을 생성할 수 있도록 하는 인공지능 모델 플랫폼을 제공하고자 한다.In the present invention, it is intended to provide an artificial intelligence model platform capable of generating an artificial intelligence model for security control.

특히, 본 발명에서는, 보안관제 기술에 익숙하지 않은 일반 사용자도 보안관제를 위한 최적의 인공지능 모델을 생성할 수 있도록 하는 인공지능 모델 플랫폼을 제공하고자 한다.In particular, in the present invention, it is intended to provide an artificial intelligence model platform that allows even general users who are not familiar with security control technology to create an optimal artificial intelligence model for security control.

도 1은 본 발명에서 제안하는 인공지능 모델 플랫폼의 일 실시 예를 개념적으로 보여주고 있다.1 conceptually shows an embodiment of the artificial intelligence model platform proposed in the present invention.

도 1에 도시된 바와 같이, 본 발명의 인공지능 모델 플랫폼은, 보안관제를 위한 인공지능 모델 생성에 필요한 각종 데이터를 수집 및 가공하는 수집 기능, 수집 기능에서 수집 및 가공된 각종 데이터를 기반으로 인공지능 모델을 생성하고 이와 관련된 성능 및 이력을 관리하는 인공지능 기능, 그리고 시스템 관리자 및 일반 사용자에게 제공하는 UI(User Interface)를 기반으로 수집/인공지능 기능과 관련된 각종 설정 및 사용자 관리를 담당하는 관리 기능으로 구분할 수 있다.As shown in FIG. 1, the artificial intelligence model platform of the present invention collects and processes various data necessary for generating an artificial intelligence model for security control and artificial intelligence based on various data collected and processed in the collection function. Management responsible for various settings and user management related to collection/AI functions based on AI functions that create intelligence models and manage performance and history related to them, and UI (User Interface) provided to system administrators and general users function can be distinguished.

그리고, 본 발명의 인공지능 모델 플랫폼은, 빅데이터 통합저장 스토리지로부터 신규 생성된 원천 보안데이터를 주기적으로 수집하는 검색엔진을 포함하고, 수집 기능에서의 각종 데이터를 검색엔진에 탑재하여 검색엔진을 데이터저장소로서 활용할 수 있다.In addition, the artificial intelligence model platform of the present invention includes a search engine that periodically collects source security data newly generated from big data integrated storage, and loads various data from the collection function into the search engine to turn the search engine into data. Can be used as storage.

이렇게 되면, 수집 기능에 속하는 각종 모듈(예: 수집/특징추출/정규화/출력)은 검색엔진(데이터저장소)를 기반으로 동작할 수 있다.In this case, various modules (eg, collection/feature extraction/normalization/output) belonging to the collection function may operate based on the search engine (data storage).

이하에서는, 도 2를 참조하여 본 발명의 실시 예에 인공지능 모델 플랫폼의 구성 및 각 구성의 역할을 구체적으로 설명하겠다.Hereinafter, the configuration of the artificial intelligence model platform and the role of each configuration will be described in detail in an embodiment of the present invention with reference to FIG. 2 .

본 발명의 인공지능 모델 플랫폼(100)은, 데이터수집모듈(110), 특징추출모듈(120), 정규화모듈(130), 데이터출력모듈(140), 모델생성모듈(150)을 포함한다.The artificial intelligence model platform 100 of the present invention includes a data collection module 110, a feature extraction module 120, a normalization module 130, a data output module 140, and a model generation module 150.

더 나아가, 본 발명의 인공지능 모델 플랫폼(100)은, 성능관리모듈(160) 및 UI모듈(170)을 더 포함할 수 있다.Furthermore, the artificial intelligence model platform 100 of the present invention may further include a performance management module 160 and a UI module 170.

이러한 인공지능 모델 플랫폼(100)의 구성 전체 내지는 적어도 일부는 하드웨어 모듈 형태 또는 소프트웨어 모듈 형태로 구현되거나, 하드웨어 모듈과 소프트웨어 모듈이 조합된 형태로도 구현될 수 있다.All or at least part of the configuration of the artificial intelligence model platform 100 may be implemented in the form of hardware modules or software modules, or may be implemented in the form of a combination of hardware modules and software modules.

여기서, 소프트웨어 모듈이란, 예컨대, 인공지능 모델 플랫폼(100) 내에서 연산을 제어하는 프로세서에 의해 실행되는 명령어로 이해될 수 있으며, 이러한 명령어는 인공지능 모델 플랫폼(100) 내 메모리에 탑재된 형태를 가질 수 있을 것이다.Here, the software module may be understood as, for example, a command executed by a processor that controls operation within the artificial intelligence model platform 100, and such a command may have a form loaded in a memory within the artificial intelligence model platform 100. will be able to have

결국, 본 발명의 일 실시 예에 따른 인공지능 모델 플랫폼(100)은 전술한 구성을 통해, 본 발명에서 제안하는 기술 즉 보안관제를 위한 최적의 인공지능 모델을 생성할 수 있도록 하는 기술을 실현하며, 이하에서는 이를 실현하기 위한 인공지능 모델 플랫폼(100) 내 각 구성에 대해 보다 구체적으로 설명하기로 한다.As a result, the artificial intelligence model platform 100 according to an embodiment of the present invention realizes the technology proposed in the present invention, that is, a technology that can generate an optimal artificial intelligence model for security control, through the above-described configuration, , Hereinafter, each component in the artificial intelligence model platform 100 for realizing this will be described in more detail.

먼저, UI모듈(170)은, 데이터수집모듈(110)의 특정 검색 조건, 특징추출모듈(120)의 특징정보, 정규화모듈(130)의 정규화 방식, 데이터출력모듈(140)의 조건 중 적어도 하나를 설정하기 위한 UI(User Interface)를 제공한다.First, the UI module 170 performs at least one of a specific search condition of the data collection module 110, feature information of the feature extraction module 120, a normalization method of the normalization module 130, and a condition of the data output module 140. Provides a UI (User Interface) for setting.

예컨대, UI모듈(170)은, 본 발명의 인공지능 모델 플랫폼(100)에서 보안관제를 위한 인공지능 모델을 생성하고자 하는 시스템 관리자 또는 일반 사용자(이하, 사용자로 통칭함)의 조작에 따라, 데이터수집모듈(110)의 특정 검색 조건, 특징추출모듈(120)의 특징정보, 정규화모듈(130)의 정규화 방식, 데이터출력모듈(140)의 조건 중 적어도 하나를 설정하기 위한 UI를 제공한다.For example, the UI module 170, according to the operation of a system manager or general user (hereinafter collectively referred to as a user) who wants to create an artificial intelligence model for security control in the artificial intelligence model platform 100 of the present invention, data, A UI for setting at least one of a specific search condition of the collection module 110, feature information of the feature extraction module 120, a normalization method of the normalization module 130, and a condition of the data output module 140 is provided.

이에, UI모듈(170)은, 제공한 UI를 기반으로 수집/인공지능 기능과 관련된 각종 설정, 구체적으로 후술의 생성할 인공지능 모델을 위한 데이터수집모듈(110)의 특정 검색 조건, 특징추출모듈(120)의 특징정보, 정규화모듈(130)의 정규화 방식, 데이터출력모듈(140)의 조건 등을 사용자정보/설정정보 저장소에 저장/관리하게 된다.Accordingly, the UI module 170, based on the provided UI, various settings related to the collection/artificial intelligence function, specific search conditions of the data collection module 110 for the artificial intelligence model to be created later, and feature extraction module. Characteristic information of 120, normalization method of normalization module 130, condition of data output module 140, etc. are stored/managed in user information/setting information storage.

데이터수집모듈(110)은, 원천 보안데이터로부터 특정 검색 조건 즉 앞서 사용자에 의해 기 설정된 특정 검색 조건에 의해 학습/테스트 데이터로 사용하고자 하는 보안이벤트를 수집한다.The data collection module 110 collects security events to be used as learning/test data from source security data according to a specific search condition previously set by a user.

예를 들어, 데이터수집모듈(110) 특정 검색 조건으로서, 학습/테스트 데이터로 사용하고자 하는 일자(또는 기간), 건수, IP가 설정될 수 있다. For example, as specific search conditions for the data collection module 110, a date (or period), number of cases, and IP to be used as learning/test data may be set.

이에, 데이터수집모듈(110)은, 특정 검색 조건이 일자(또는 기간)인 경우, 원천 보안데이터로부터 설정된 일자(또는 기간)에 속하는 보안이벤트를 수집할 수 있다.Accordingly, when a specific search condition is a date (or period), the data collection module 110 may collect security events belonging to a set date (or period) from source security data.

또는, 데이터수집모듈(110)은, 특정 검색 조건이 건수인 경우, 원천 보안데이터로부터 지정된 시점에서 설정된 건수(예: 500,000건)의 보안이벤트를 수집할 수 있다.Alternatively, when a specific search condition is the number of cases, the data collection module 110 may collect a set number of security events (eg, 500,000 cases) from source security data at a designated time point.

또는, 데이터수집모듈(110)은, 특정 검색 조건이 IP인 경우, 원천 보안데이터로부터 설정된 IP가 Source IP 또는 Destination IP와 일치하는 보안이벤트를 수집할 수 있다.Alternatively, when the specific search condition is IP, the data collection module 110 may collect security events whose IP set from source security data matches the source IP or destination IP.

물론, 특정 검색 조건으로서, 일자(또는 기간), 건수, IP 등의 조합이 설정될 수도 있다.Of course, as a specific search condition, a combination of date (or period), number of cases, IP, etc. may be set.

이 경우 역시, 데이터수집모듈(110)은, 원천 보안데이터로부터 설정된 일자(또는 기간), 건수, IP 등의 조합에 따른 보안이벤트를 수집할 수 있다.In this case as well, the data collection module 110 may collect security events according to a combination of a set date (or period), case number, IP, etc. from source security data.

더 구체적으로, 데이터수집모듈(110)은, 전술과 같이 원천 보안데이터로부터 보안이벤트를 수집하는데 있어서, 시스템의 부하를 줄이기 위하여 동시 수행 가능한 최대 수집 건수가 한정될 수 있다.More specifically, in collecting security events from source security data as described above, the data collection module 110 may limit the maximum number of concurrently performed collections in order to reduce system load.

예를 들면, 원천 보안데이터로부터 설정된 일자(또는 기간)에 속하는 보안이벤트를 수집하는 경우, 설정된 일자(또는 기간)에 속하는 보안이벤트 수집 건의 총 수가 1000,000건이고, 동시 수행 가능한 최대 수집 건수가 500,000건이라고 가정할 수 있다. For example, when collecting security events belonging to a set date (or period) from source security data, the total number of security event collections belonging to the set date (or period) is 1000,000, and the maximum number of concurrent collections is We can assume 500,000 cases.

이 경우, 데이터수집모듈(110)은, 금번 수집 건의 총 수가 동시 수행 가능한 최대 수집 건수를 초과하는 것으로 판단, 금번 수집 건의 총 개수 중 최대 수집 건수를 초과하는 수집 건을 큐(queue)에 저장한 후 순차적으로 진행할 수 있다.In this case, the data collection module 110 determines that the total number of collections this time exceeds the maximum number of simultaneous collections, and stores the collections exceeding the maximum number among the total number of collections this time in a queue. After that, you can proceed sequentially.

즉, 데이터수집모듈(110)은, 금번 수집 건의 총 개수 1000,000건 중 시간순서에 따라 최대 수집 건수 500,000건을 수집/진행하되, 최대 수집 건수 500,000건을 초과하는 수집 건 500,000건에 대해서는 큐(queue)에 저장한 후 순차적으로 수집/진행할 수 있다.That is, the data collection module 110 collects/processes 500,000 of the total number of collections in chronological order among the total number of collections of 1000,000 this time, but queues 500,000 collections exceeding the maximum of 500,000. After saving in the queue, it can be collected/processed sequentially.

이 경우, 데이터수집모듈(110)은, 큐에 저장한 후 진행하는 수집 건 500,000건의 경우, 원천 보안데이터에서 수집 건의 발생시점 이전 데이터에 대해서만 보안이벤트를 수집한다.In this case, in the case of 500,000 collected cases that are stored in the queue and then proceeded, the data collection module 110 collects security events only for data prior to the point of occurrence of the collected cases in the source security data.

즉, 금번 수집 건의 총 개수 1000,000건 중 큐에 저장한 후 진행하는 수집 건 500,000건의 경우는, 수집 건의 발생시점과 실제 수집/진행된 시점 간의 차이가 발생하므로, 이로 인한 보안이벤트 수집 오류를 방지하기 위해 원천 보안데이터에서 수집 건의 발생시점 이전 데이터에서만 보안이벤트를 수집하는 것이다.In other words, in the case of 500,000 collection cases that are stored in the queue and proceeding after being stored in the queue, out of the total number of 1000,000 cases collected this time, there is a difference between the time when the collection event occurred and the time when the actual collection/procedure occurred, preventing errors in security event collection caused by this. In order to do this, security events are collected only from the original security data prior to the point of occurrence of the collection case.

한편, 앞서 본 발명의 인공지능 모델 플랫폼(100)은, 빅데이터 통합저장 스토리지로부터 신규 생성된 원천 보안데이터를 주기적으로 수집하는 검색엔진을 포함한다고 언급한 바 있다.On the other hand, it has been mentioned that the artificial intelligence model platform 100 of the present invention includes a search engine that periodically collects newly generated original security data from big data integrated storage.

이 경우 데이터수집모듈(110)는, 검색엔진(데이터 저장소) 내 원천 보안데이터에서 보안데이터를 수집할 수 있다.In this case, the data collection module 110 may collect security data from source security data in a search engine (data store).

빅데이터 통합저장 스토리지는 본 발명의 인공지능 모델 플랫폼(100) 뿐만 아니라 다른 시스템에서도 활용하는 저장소이기 때문에, 빅데이터 통합저장 스토리지로부터 대량의 데이터(보안이벤트)를 수집할 경우 빅데이터 통합저장 스토리지에 부하가 생겨 다른 시스템에도 영향을 미칠 수 있다.Since the big data integrated storage storage is a storage used not only by the artificial intelligence model platform 100 of the present invention but also by other systems, when a large amount of data (security event) is collected from the big data integrated storage storage, the big data integrated storage storage It can create a load and affect other systems.

하지만, 본 발명(데이터수집모듈(110))은, 데이터수집모듈(110)가 빅데이터 통합저장 스토리지로부터 직접 보안이벤트를 수집하지 않고, 빅데이터 통합저장 스토리지로부터 신규 생성된 원천 보안데이터 만을 주기적으로 수집하는 검색엔진을 기반으로 보안이벤트를 수집하기 때문에, 전술의 빅데이터 통합저장 스토리지 부하 문제를 회피할 수 있다.However, in the present invention (data collection module 110), the data collection module 110 does not collect security events directly from the big data integrated storage, but only periodically generates original security data from the big data integrated storage. Since security events are collected based on the collected search engine, the aforementioned big data integrated storage storage load problem can be avoided.

특징추출모듈(120)은, 데이터수집모듈(110)에서 수집된 보안이벤트에 대하여 기 설정된 특징정보 즉 앞서 사용자에 의해 기 설정된 특징정보(Feature)를 추출한다.The feature extraction module 120 extracts feature information previously set for the security event collected by the data collection module 110, that is, feature information previously set by the user.

인공지능 모델 생성 시, 인공지능 알고리즘으로 데이터(보안이벤트)를 분류하기 위해서는 데이터(보안이벤트)가 어떤 특징으로 가지고 있는지 찾고 이를 벡터로 만들어야 하는데, 이러한 과정을 특징정보 추출 과정이라 한다.When creating an artificial intelligence model, in order to classify data (security events) with an artificial intelligence algorithm, it is necessary to find out what characteristics the data (security events) have and make them into vectors. This process is called feature information extraction process.

특징추출모듈(120)은, 데이터수집모듈(110)에서 수집된 보안이벤트에 대하여 특징정보 추출 과정을 수행하는 역할을 담당하는 것이다.The feature extraction module 120 is responsible for performing a feature information extraction process with respect to security events collected by the data collection module 110 .

그리고, 특징추출모듈(120)에 의해 추출된 각 보안이벤트의 특징정보는, 후술의 인공지능 모델 생성 시 기계학습(예: Deep Learning)에 사용될 것이다.In addition, the feature information of each security event extracted by the feature extraction module 120 will be used for machine learning (eg, deep learning) when generating an artificial intelligence model described later.

특히, 본 발명에서는, 사용자가 특징정보로서, 단일 특징을 설정할 수 있고 복합 특징을 설정할 수 있도록 한다.In particular, in the present invention, the user can set a single feature as feature information and set a composite feature.

여기서, 단일 특징이란, 하나의 보안이벤트에서 추출할 수 있는 특징들을 의미한다.Here, a single feature means features that can be extracted from one security event.

예를 들면, 탐지시간, Source IP, Source port, Destination IP, Destination port, 프로토콜, 보안이벤트명, 보안이벤트 타입, 공격횟수, 공격방향, 패킷사이즈, 자동분석 결과, 동적분석 결과, 기관번호, 점보페이로드 여부, 페이로드, word2vec 변환 방식을 적용한 페이로드 등이, 단일 특징에 속할 수 있다.For example, detection time, source IP, source port, destination IP, destination port, protocol, security event name, security event type, number of attacks, attack direction, packet size, automatic analysis result, dynamic analysis result, agency number, jumbo Whether or not the payload is present, the payload, and the payload to which the word2vec conversion method is applied may belong to a single feature.

참고로, Word2Vec을 통한 페이로드 변환 방식은, 단어를 벡터로 변환하는 방식으로서, 주변 단어들 간의 관계를 통해 해당 단어의 벡터를 결정하는 방식이다. 일반적인 문장은 띄어쓰기 기준으로 단어를 구별할 수 있지만, 페이로드는 의미 단위로 구분하기가 매우 어려우며 다량의 특수문자들이 포함되어 있기 때문에 word2vec을 적용하기 위해서는 사전 처리가 필요하다. For reference, the payload conversion method through Word2Vec is a method of converting a word into a vector, and is a method of determining a vector of a corresponding word through a relationship between neighboring words. General sentences can distinguish words based on spacing, but payload is very difficult to classify into semantic units and contains a large amount of special characters, so pre-processing is required to apply word2vec.

본 발명에서는, word2vec을 적용하기 위한 사전 처리로서, 다음의 4단계를 수행할 수 있다.In the present invention, as a pre-processing for applying word2vec, the following 4 steps can be performed.

1) 16진수로 인코딩된 문자열을 아스키 문자열로 변환(아스키 코드값 (32~127) 이외에는 공백으로 변환)1) Converts hexadecimal-encoded strings to ASCII strings (except for ASCII code values (32 to 127), converts to blanks)

2) url encoding된 부분 처리(%25 -> ‘%’, %26 -> ‘&’, %2A -> ‘*’ ...) 2) Processing the url-encoded part (%25 -> ‘%’, %26 -> ‘&’, %2A -> ‘*’ ...)

3) ‘@’, ‘\’, ‘-’, ‘:’, ‘%’, ‘_’, ‘.’, ‘!’, ‘/’, ‘`’를 제외한 특수기호들을 공백으로 치환하고 모든 대문자를 소문자로 치환3) Replace special symbols except '@', '\', '-', ':', '%', '_', '.', '!', '/', '`' with spaces Replace all uppercase letters with lowercase letters

4) 한 글자로 구성된 단어를 제외하고 word2vec알고리즘 적용4) Excluding words composed of one letter, word2vec algorithm is applied

한편, 복합 특징이란, 여러 보안이벤트 간의 집계, 통계적 기법들을 활용하여 추출할 수 있는 하나의 특징을 의미한다.On the other hand, a complex feature means a single feature that can be extracted by using aggregation between several security events and statistical techniques.

예를 들면, 기간 또는 건수 등의 기준으로 보안이벤트 그룹을 형성하고, 그룹 내 연산(예: 집계, 통계적 기법 등)을 통해 추출할 수 있는 하나의 특징(예: 연산 결과값)이, 복합 특징에 속할 수 있다.For example, security event groups are formed based on period or number of cases, and one feature (eg, result value of operation) that can be extracted through calculation within the group (eg, aggregation, statistical technique, etc.) is a composite feature. can belong to

예를 들어, 기간(8.22~9.3)을 기준을 다음의 표 1과 같은 보안이벤트 그룹을 형성한다고 가정한다.For example, it is assumed that a security event group as shown in Table 1 is formed based on the period (8.22 to 9.3).

보안이벤트 그룹 내 연산(예: Source IP, Destination IP, 보안이벤트 명이 100.100.100.100/111.111.111.11/AAA인 보안이벤트의 개수)을 통해 추출할 수 있는 하나의 특징(예: 4개)이, 복합 특징에 속할 수 있다.One feature (e.g. 4) that can be extracted through calculation within the security event group (e.g. source IP, destination IP, number of security events with security event names 100.100.100.100/111.111.111.11/AAA) may belong to the feature.

이에, 특징추출모듈(120)은, 데이터수집모듈(110)에서 수집된 보안이벤트에 대하여, 기 설정된 특징정보(단일 특징 및/또는 복합 특징)를 추출할 수 있다.Accordingly, the feature extraction module 120 may extract preset feature information (single feature and/or composite feature) from the security event collected by the data collection module 110 .

정규화모듈(130)은, 보안이벤트의 추출된 특징정보에 대하여 기 설정된 정규화를 수행한다.The normalization module 130 performs preset normalization on extracted feature information of security events.

정규화는 추출된 특징들의 값의 범위를 일정하게 맞춰주는 과정을 말한다. 필드(field) A가 50~100, 필드 B가 0~100의 범위를 가진다면 똑같은 50이라도 서로 다른 척도에 의해서 측정된 값이기 때문에 그 의미는 상이하다. 따라서, 서로 다른 필드의 값들을 공통 척도로 조정하여 일정한 의미를 갖도록 하는 과정이 필요하고 이를 정규화라 한다.Normalization refers to a process of adjusting the range of values of extracted features to be constant. If field A has a range of 50 to 100 and field B has a range of 0 to 100, the meaning is different because the same 50 is a value measured by different scales. Therefore, a process of adjusting the values of different fields to a common scale to have a certain meaning is required, and this is called normalization.

정규화모듈(130)은, 보안이벤트의 추출된 특징정보에 대하여, 기 설정된 정규화 방식에 따라서 서로 다른 필드의 값들을 공통 척도로 조정하여 일정한 의미를 갖도록 하는 정규화를 수행하게 된다.The normalization module 130 performs normalization of the extracted feature information of the security event to have a certain meaning by adjusting values of different fields to a common scale according to a preset normalization method.

이때, 기 설정된 정규화 방식은, 앞서 사용자에 의해 기 설정된 정규화 방식을 의미한다.In this case, the preset normalization method means a normalization method previously set by the user.

본 발명의 인공지능 모델 플랫폼(100)에서는, 다음의 3가지 정규화 방식을 제공하여 사용자로 하여금 기 설정할 수 있도록 한다.In the artificial intelligence model platform 100 of the present invention, the following three normalization methods are provided so that the user can set them in advance.

수학식 1은 Feature scaling [a,b] 정규화 방식을 의미하며, 수학식 2는 Mean normalization [-1,1] 정규화 방식, 수학식 3은 Standard score 정규화 방식을 의미한다.Equation 1 means the feature scaling [a,b] normalization method, Equation 2 means the Mean normalization [-1,1] normalization method, and Equation 3 means the standard score normalization method.

정규화모듈(130)은, 보안이벤트의 추출된 특징정보에 대하여, 전술의 3가지 정규화 방식 중 사용자에 의해 기 설정된 정규화 방식에 따라 정규화를 수행하게 된다.The normalization module 130 normalizes the extracted feature information of the security event according to a normalization method preset by a user among the three normalization methods described above.

데이터출력모듈(140)은, 특정정보 정규화가 완료된 보안이벤트에서 학습 데이터 또는 테스트 데이터를 주어진 조건 즉 앞서 사용자에 의해 기 설정된(주어진) 조건에 의해 추출한다.The data output module 140 extracts learning data or test data from the security event for which specific information normalization has been completed under a given condition, that is, a condition previously set (given) by a user.

구체적으로, 데이터출력모듈(140)은, 특정정보 정규화가 완료된 보안이벤트를, 사용자가 원하는 값, 순서, 포맷, 학습/테스트 데이터 비율, 파일분할방식 등에 따라 화면 또는 파일로 출력하게 된다. Specifically, the data output module 140 outputs the security event for which specific information has been normalized to a screen or a file according to a user's desired value, sequence, format, training/test data ratio, file division method, and the like.

이처럼 출력된 학습 데이터 또는 테스트 데이터는, 인공지능 모델 생성 시 즉시 활용할 수 있도록 날짜, 사용자 별로 Database 또는 파일 저장소를 통해 관리한다.The output learning data or test data is managed through a database or file storage by date and user so that it can be used immediately when creating an artificial intelligence model.

모델생성모듈(150)은, 데이터출력모듈(140)에서 출력/파일 저장소에 관리되는 학습 데이터에 인공지능 알고리즘을 적용하여, 보안관제를 위한 인공지능 모델을 생성한다.The model generation module 150 applies an artificial intelligence algorithm to the learning data managed in the output/file storage in the data output module 140 to generate an artificial intelligence model for security control.

즉, 모델생성모듈(150)은, 학습 데이터에 인공지능 알고리즘을 적용하여, 보안관제를 위한 인공지능 모델, 예컨대 사용자에 의해 요구되는 기능의 인공지능 모델을 생성할 수 있다.That is, the model generation module 150 may generate an artificial intelligence model for security control, for example, an artificial intelligence model of a function required by a user, by applying an artificial intelligence algorithm to learning data.

예를 들면, 모델생성모듈(150)은, 사용자 요구에 따라, 보안이벤트의 악성 여부를 탐지하기 위한 인공지능 탐지모델을 생성할 수 있고, 보안이벤트의 정탐/오탐을 분류하기 위한 인공지능 분류모델을 생성할 수도 있다.For example, the model generation module 150 may generate an artificial intelligence detection model for detecting whether a security event is malicious according to a user request, and an artificial intelligence classification model for classifying true positives/false positives of a security event. can also create

구체적으로, 모델생성모듈(150)은, 데이터출력모듈(140)에서 출력/파일 저장소에 관리되는 학습 데이터를 기반으로, 인공지능 알고리즘 예컨대 사용자에 의해 기 선택된 기계학습(예: Deep Learning) 알고리즘에 따라, 보안관제를 위한 인공지능 모델을 생성할 수 있다.Specifically, the model generation module 150, based on the learning data managed in the output / file storage in the data output module 140, an artificial intelligence algorithm, for example, a machine learning (eg, Deep Learning) algorithm previously selected by the user. Accordingly, an artificial intelligence model for security control can be created.

예를 들면, 모델생성모듈(150)은, Backward Propagation(오차역전파법) 계산 기반의 기계학습 기술에서 모델을 통해 예측되는 결과값과 실제 결과값 간의 편차를 나타내는 학습손실함수(Loss function)을 이용하여, 학습 데이터를 기반으로 학습손실함수(Loss function)의 편차가 0이 되는 인공지능 모델을 생성할 수 있다.For example, the model generation module 150 uses a learning loss function (Loss function) representing the deviation between the result value predicted through the model and the actual result value in the backward propagation (error backpropagation) calculation-based machine learning technology. Thus, it is possible to generate an artificial intelligence model in which the deviation of the learning loss function becomes 0 based on the learning data.

이상에서 설명한 바와 같이, 본 발명의 인공지능 모델 플랫폼(100)에 따르면, 별도의 프로그래밍 없이 UI를 기반으로 보안관제를 위한 인공지능 모델을 생성할 수 있도록 하는 플랫폼 환경을 제공함으로써, 보안관제 기술에 익숙하지 않은 일반 사용자도 보안관제를 위한 자신의 목적 및 요구 사항에 맞는 인공지능 모델을 생성할 수 있도록 한다.As described above, according to the artificial intelligence model platform 100 of the present invention, by providing a platform environment that enables the creation of an artificial intelligence model for security control based on UI without separate programming, security control technology Even ordinary users who are not familiar with it can create artificial intelligence models suitable for their own purposes and requirements for security control.

더 나아가, 본 발명의 인공지능 모델 플랫폼(100)에서 성능관리모듈(160)은, 데이터출력모듈(140)에서 출력/파일 저장소에 관리되는 테스트 데이터를 활용하여, 전술의 생성한 인공지능 모델의 정확도를 테스트한다.Furthermore, in the artificial intelligence model platform 100 of the present invention, the performance management module 160 utilizes the test data managed in the output/file storage in the data output module 140 to generate the artificial intelligence model generated above. Test accuracy.

성능관리모듈(160)은, 모델생성모듈(150)에 의해 생성된 인공지능 모델을 관리하기 위한 것으로서, ‘누가’ ‘언제’ ‘어떤 데이터’ ‘어떤 필드’ ‘어떤 샘플링 방식’ ‘어떤 정규화 방식’ ‘어떤 모델’을 이용하여 인공지능 모델을 만든 것인지, 또한 생성된 인공지능 모델이 어느 정도의 성능(정답률)을 갖는지 등의 성능 정보를 시스템(파일저장소)에 기록 및 관리한다. The performance management module 160 is for managing the artificial intelligence model generated by the model generation module 150, and includes 'who', 'when', 'what data', 'what field', 'what sampling method', 'what normalization method' Record and manage performance information such as 'what model' was used to create the AI model and how much performance (correct answer rate) the generated AI model has in the system (file storage).

그리고, 성능관리모듈(160)은, 이러한 성능 정보 관리를 기반으로, 모델 생성을 위한 조건들과 성능을 한눈에 비교할 수 있어 조건들과 성능의 상관 관계를 쉽게 파악할 수 있도록 한다.Also, the performance management module 160 compares conditions for model generation and performance at a glance based on the performance information management, so that a correlation between the conditions and performance can be easily grasped.

본 발명에서는, 보안관제 기술에 익숙하지 않은 일반 사용자도 인공지능 모델을 생성할 수 있도록 하는 플랫폼 환경을 제공하고 있다는 점에서, 본 발명의 플랫폼 환경에서 생성된 인공지능 모델의 정확도(성능) 테스트는 필수적일 수도 있다. In the present invention, in that a platform environment is provided so that even general users who are not familiar with security control technology can create an artificial intelligence model, the accuracy (performance) test of the artificial intelligence model created in the platform environment of the present invention may be essential.

구체적으로, 성능관리모듈(160)은, 데이터출력모듈(140)에서 출력/파일 저장소에 관리되는 테스트 데이터(정탐/오탐 분류 및 악성 여부 탐지의 실제 결과값을 알고 있는 보안이벤트)를 활용하여, 전술의 생성한 인공지능 모델의 정확도를 테스트한다.Specifically, the performance management module 160 utilizes the test data managed in the output/file storage in the data output module 140 (security events that know the actual result values of true positive/false positive classification and malicious detection), Test the accuracy of the artificial intelligence model created above.

예를 들어, 성능관리모듈(160)은, 테스트 데이터를 활용하여 전술의 생성한 인공지능 모델을 테스트하여, 모델을 통해 예측되는 결과값과 알고 있는 실제 결과값의 일치 비율을 모델의 정확도(성능) 즉 테스트 결과로서 출력할 수 있다. For example, the performance management module 160 tests the artificial intelligence model generated above using the test data, and calculates the accuracy of the model (performance ), that is, it can be output as a test result.

인공지능 모델을 생성하기 위해서는, 어떠한 특징(Feature)들을 사용하는지 그리고 어떤 정규화 방식을 적용하는지가 모델 성능(정확도)에 큰 영향을 미친다.To create an artificial intelligence model, which features are used and which regularization method is applied have a great influence on model performance (accuracy).

헌데, 사람 특히 보안관제 기술에 익숙하지 않은 일반 사용자가 자신이 원하는 인공지능 모델을 생성하는데 최적 성능을 낼 수 있는 특징정보(Feature)를 조합/설정하는 것은 어려울 것이다.However, it will be difficult for general users, especially those who are not familiar with security control technology, to combine/set feature information that can produce optimal performance in creating the artificial intelligence model they want.

이에, 본 발명에서 특징추출모듈(120)은, 성능관리모듈(160)의 정확도 테스트 결과를 근거로, 전술의 생성한 인공지능 모델의 정확도를 높이도록 특징정보(Feature)에 대한 변경을 추천할 수 있다.Therefore, in the present invention, the feature extraction module 120, based on the accuracy test result of the performance management module 160, recommends changes to feature information to increase the accuracy of the artificial intelligence model generated above. can

사람 특히 보안관제 기술에 익숙하지 않은 일반 사용자가 자신이 원하는 인공지능 모델을 생성하는데 최적 성능을 낼 수 있는 정규화 방식을 알고 설정하는 것 역시 어려울 것이다.It will also be difficult for ordinary users, especially those who are not familiar with security control technology, to know and set a normalization method that can produce optimal performance in creating the artificial intelligence model they want.

또한, 본 발명에서 정규화모듈(130)은, 인공지능 모델의 정확도를 높이도록 정규화에 대한 정규화 방식 변경을 추천할 수 있다.In addition, in the present invention, the normalization module 130 may recommend a normalization method change for normalization to increase the accuracy of the artificial intelligence model.

이하에서는, 도 3을 참조하여, 인공지능 모델의 정확도를 높이도록 특징정보(Feature) 변경을 추천하는 기술, 구체적으로 그 기술을 실현하는 특징정보 추천 장치에 대하여 설명하겠다.Hereinafter, with reference to FIG. 3, a technique for recommending a feature change to increase the accuracy of an artificial intelligence model, and a feature information recommending device that realizes the technique, will be described in detail.

도 3은, 본 발명의 일 실시 예에 따른 특징정보 추천 장치의 구성을 도시하고 있다.3 illustrates the configuration of an apparatus for recommending feature information according to an embodiment of the present invention.

도 3에 도시된 바와 같이, 본 발명의 특징정보 추천 장치(200)는, 모델성능확인부(210), 조합성능확인부(220), 추천부(230)를 포함한다.As shown in FIG. 3 , the apparatus 200 for recommending feature information according to the present invention includes a model performance checking unit 210 , a combination performance checking unit 220 , and a recommendation unit 230 .

이러한 특징정보 추천 장치(200)의 구성 전체 내지는 적어도 일부는 하드웨어 모듈 형태 또는 소프트웨어 모듈 형태로 구현되거나, 하드웨어 모듈과 소프트웨어 모듈이 조합된 형태로도 구현될 수 있다.All or at least part of the configuration of the feature information recommendation device 200 may be implemented in the form of a hardware module or a software module, or may be implemented in a combination of a hardware module and a software module.

여기서, 소프트웨어 모듈이란, 예컨대, 특징정보 추천 장치(200) 내에서 연산을 제어하는 프로세서에 의해 실행되는 명령어로 이해될 수 있으며, 이러한 명령어는 특징정보 추천 장치(200) 내 메모리에 탑재된 형태를 가질 수 있을 것이다.Here, a software module may be understood as, for example, a command executed by a processor that controls an operation within the feature information recommendation device 200, and such a command may have a form loaded in a memory within the feature information recommendation device 200. will be able to have

결국, 본 발명의 일 실시 예에 따른 특징정보 추천 장치(200)는 전술한 구성을 통해, 본 발명에서 제안하는 기술 즉 인공지능 모델의 정확도를 높이도록 특징정보(Feature) 변경을 추천하는 기술을 실현하며, 이하에서는 이를 실현하기 위한 특징정보 추천 장치(200) 내 각 구성에 대해 보다 구체적으로 설명하기로 한다.As a result, the feature information recommendation device 200 according to an embodiment of the present invention, through the above configuration, the technology proposed in the present invention, that is, the technology of recommending a change in feature information to increase the accuracy of the artificial intelligence model. Hereinafter, each component in the feature information recommendation device 200 for realizing this will be described in more detail.

모델성능확인부(210)는, 인공지능 모델 생성 시 설정 가능한 전체 특징정보 중 기 설정된 특징정보 학습을 기반으로 생성된 인공지능 모델에 대하여, 모델 성능을 확인한다.The model performance checking unit 210 checks the model performance of the artificial intelligence model generated based on learning of preset feature information among all feature information that can be set when the artificial intelligence model is created.

즉, 모델성능확인부(210)는, 사용자에 의해 설정된 특징정보 학습을 기반으로 생성된 인공지능 모델의 성능(정확도)를 확인하는 것이다.That is, the model performance checking unit 210 checks the performance (accuracy) of the artificial intelligence model generated based on learning feature information set by the user.

구체적인 설명을 위해, 이하에서는, 본 발명의 인공지능 모델 플랫폼(100)에서 사용자에 의해 설정된 특징정보(이하, 사용자 설정 특징정보)를 학습/생성된 인공지능 모델을 가정하여 설명하겠다.For detailed description, hereinafter, it will be described assuming an artificial intelligence model that has learned/generated feature information (hereinafter, user-set feature information) set by a user in the artificial intelligence model platform 100 of the present invention.

모델성능확인부(210)는, 전술과 같이 인공지능 모델 플랫폼(100)에서 사용자 설정 특징정보를 학습하여 생성된 인공지능 모델에 대하여, 모델 성능을 확인한다.As described above, the model performance check unit 210 checks the model performance of the artificial intelligence model generated by learning the user-set feature information in the artificial intelligence model platform 100.

예를 들면, 모델성능확인부(210)는, 인공지능 모델에 대하여, 본 발명의 인공지능 모델 플랫폼(100, 특히 데이터출력모듈(140))에서 출력되는 테스트 데이터(정탐/오탐 분류 및 악성 여부 탐지의 실제 결과값을 알고 있는 보안이벤트)를 활용하여, 모델 성능(정확도)을 테스트/확인할 수 있다.For example, the model performance confirmation unit 210, for the artificial intelligence model, the test data output from the artificial intelligence model platform 100 (particularly, the data output module 140) of the present invention (true positive/false positive classification and malicious status) Using the security event for which the actual result of the detection is known), the model performance (accuracy) can be tested/confirmed.

이에 모델성능확인부(210)는, 본 발명의 인공지능 모델 플랫폼(100, 특히 데이터출력모듈(140))에서 생성되는 인공지능 모델을 대상으로, 테스트 데이터를 활용하여 인공지능 모델을 테스트함으로써, 모델을 통해 예측되는 결과값과 알고 있는 실제 결과값의 일치 비율을 모델의 정확도(성능) 즉 테스트 결과로서 출력할 수 있다. Accordingly, the model performance confirmation unit 210 tests the artificial intelligence model using the test data for the artificial intelligence model generated by the artificial intelligence model platform 100 (particularly, the data output module 140) of the present invention, The matching ratio between the result value predicted by the model and the actual result value known can be output as the accuracy (performance) of the model, that is, the test result.

조합성능확인부(220)는, 전체 특징정보에서 다수의 특징정보 조합을 설정하여, 다수의 특징정보 조합 별로 학습을 기반으로 생성된 인공지능 모델의 성능을 확인한다.The combination performance checking unit 220 sets a plurality of feature information combinations from all feature information and checks the performance of the artificial intelligence model generated based on learning for each feature information combination.

구체적으로, 조합성능확인부(220)는, 인공지능 모델 생성 시 설정 가능한 전체 특징정보에서, 금번 인공지능 모델 생성 시 학습된 사용자 설정 특징정보 외 다양한 특징정보 조합을 설정하여 다수의 특징정보 조합 별로 학습을 기반으로 생성된 인공지능 모델의 성능을 확인할 수 있다.Specifically, the combination performance confirmation unit 220 sets various feature information combinations in addition to the user-set feature information learned when the artificial intelligence model is created from the entire feature information that can be set when the artificial intelligence model is created, and sets each of a plurality of feature information combinations. You can check the performance of the artificial intelligence model created based on learning.

추천부(230)는, 조합성능확인부(220)에서 확인한 다수의 특징정보 조합 별 성능 중에서, 모델성능확인부(210)에서 확인한 모델 성능 즉 금번 사용자 설정을 기반으로 생성된 인공지능 모델의 성능 보다 높은 성능의 특정 특징정보 조합을 추천할 수 있다.The recommendation unit 230, among the performance of a plurality of feature information combinations confirmed by the combination performance confirmation unit 220, the model performance confirmed by the model performance confirmation unit 210, that is, the performance of the artificial intelligence model generated based on the current user setting A specific feature information combination with higher performance may be recommended.

이하에서는, 특정 특징정보 조합을 추천하는 구체적인 실시예들을 설명하겠다.Hereinafter, specific embodiments for recommending a specific feature information combination will be described.

일 실시 예에 따르면, 조합성능확인부(220)에 의해 설정되는 다수의 특징정보 조합은, 금번 인공지능 모델 생성 시 학습된 사용자 설정 특징정보에, 전체 특징정보에서 사용자 설정 특징정보를 제외한 나머지 특정정보 중 적어도 하나씩 순차적으로 추가한 조합일 수 있다.According to an embodiment, the combination of a plurality of feature information set by the combination performance check unit 220 includes the user-set feature information learned at the time of generating the artificial intelligence model this time, and the remaining specific feature information other than the user-set feature information from all feature information. It may be a combination of sequentially adding at least one of the pieces of information.

이하에서는, 전체 특징정보(예: a,b,c...,z(n=26)) 중 금번 인공지능 모델 생성 시 학습된 사용자 설정 특징정보(예: a,b,c,d,e,f(k=6))를 가정하여 설명하겠다. 그리고 이 경우, 모델성능확인부(210)에서 확인한 인공지능 모델 성능(mk)이 85%라고 가정한다.In the following, among all feature information (eg a,b,c...,z (n=26)), the learned user-set feature information (eg a,b,c,d,e , f (k = 6)) will be explained. And in this case, it is assumed that the artificial intelligence model performance (mk) checked by the model performance confirmation unit 210 is 85%.

이에, 조합성능확인부(220)는, 사용자 설정 특징정보(a,b,c,d,e,f)에 전체 특징정보(n) 중 사용자 설정 특징정보(a,b,c,d,e,f)를 제외한 나머지 특정정보 중 적어도 하나씩 순차적으로 추가하여, 다수의 특징정보 조합을 설정할 수 있다. Accordingly, the combination performance check unit 220 assigns the user-set feature information (a, b, c, d, e, f) to the user-set feature information (a, b, c, d, e) among the entire feature information (n). , It is possible to set a plurality of characteristic information combinations by sequentially adding at least one of the remaining specific information except for f).

예를 들면, 조합성능확인부(220)는, 사용자가 설정한 사용자 설정 특징정보(a,b,c,d,e,f)에, 전체 특징정보(n) 중 사용자 설정 특징정보(a,b,c,d,e,f)를 제외한 나머지 특정정보 중 1~(n-k)개의 특징정보를 순차적으로 추가하여, 다음과 같은 다수의 특징정보 조합을 설정할 수 있다.For example, the combination performance check unit 220, in the user-set feature information (a, b, c, d, e, f) set by the user, among the entire feature information (n), the user-set feature information (a, b, c, d, e, f), 1 to (n-k) feature information among the remaining specific information can be sequentially added to set a plurality of feature information combinations as follows.

a,b,c,d,e,f,g -> m(k+1)1 -> 82%a,b,c,d,e,f,g -> m(k+1)1 -> 82%

a,b,c,d,e,f,h -> m(k+1)2 -> 80%a,b,c,d,e,f,h -> m(k+1)2 -> 80%

......

a,b,c,d,e,f,g,h,i -> m(k+3)1 -> 88%a,b,c,d,e,f,g,h,i -> m(k+3)1 -> 88%

......

a,b,c,d,e,f,...,z -> m(n) -> 85%a,b,c,d,e,f,...,z -> m(n) -> 85%

그리고, 조합성능확인부(220)는, 전술과 같이 다수의 특징정보 조합 별로 학습을 기반으로 생성된 인공지능 모델의 성능, 82%, 80%, ... 88%,...85%을 확인할 수 있다.And, the combination performance check unit 220, as described above, performs 82%, 80%, ... 88%, ... 85% of the performance of the artificial intelligence model generated based on learning for each combination of feature information. You can check.

이 경우, 추천부(230)는, 다수의 특징정보 조합 별 성능 중에서, 금번 사용자 설정을 기반으로 생성된 인공지능 모델의 성능(mk=85%) 보다 높은 성능을 갖는 상위 N개(예: 4개)를 특정 특징정보 조합으로서 선택/추천할 수 있다.In this case, the recommendation unit 230 selects the top N (e.g., 4 ) can be selected/recommended as a specific feature information combination.

물론, 상위 N개는 시스템관리자 또는 사용자에 의해 지정/변경될 수 있는 개수이다.Of course, the top N is a number that can be designated/changed by the system administrator or user.

다른 예를 들면, 조합성능확인부(220)는, 사용자가 설정한 사용자 설정 특징정보(a,b,c,d,e,f)에, 전체 특징정보(n) 중 사용자 설정 특징정보(a,b,c,d,e,f)를 제외한 나머지 특정정보를 1개씩 순차적으로 추가하여, 다음과 같은 다수의 특징정보 조합을 설정할 수 있다.For another example, the combination performance check unit 220 may include the user-set feature information (a, b, c, d, e, f) set by the user among the entire feature information (n), the user-set feature information (a). , b, c, d, e, f), the remaining specific information can be sequentially added one by one to set a plurality of characteristic information combinations as follows.

a,b,c,d,e,f,g -> m(k+1)1 -> 82%a,b,c,d,e,f,g -> m(k+1)1 -> 82%

a,b,c,d,e,f,h -> m(k+1)2 -> 80%a,b,c,d,e,f,h -> m(k+1)2 -> 80%

......

a,b,c,d,e,f,z -> m(k+1)ζ+1 -> 90%a,b,c,d,e,f,z -> m(k+1)ζ+1 -> 90%

그리고, 조합성능확인부(220)는, 전술과 같이 다수의 특징정보 조합 별로 학습을 기반으로 생성된 인공지능 모델의 성능, 82%, 80%, ...90%을 확인할 수 있다.In addition, the combination performance check unit 220 can check the performance of the artificial intelligence model generated based on learning for each feature information combination, 82%, 80%, ... 90%, as described above.

이 경우, 추천부(230)는, 다수의 특징정보 조합 별 성능 중에서, 금번 사용자 설정을 기반으로 생성된 인공지능 모델의 성능(mk=85%) 보다 높은 성능을 갖는 상위 N개(예: 3개)를 특정 특징정보 조합으로서 선택/추천할 수 있다.In this case, the recommendation unit 230 selects the top N (e.g., 3 ) can be selected/recommended as a specific feature information combination.

한편, 다른 실시 예에 따르면, 조합성능확인부(220)는, 기 설정된 특징정보 즉 사용자 설정 특징정보(a,b,c,d,e,f)에 전체 특징정보(n)에서 사용자 설정 특징정보(a,b,c,d,e,f)를 제외한 나머지 특정정보 중 하나씩 순차적으로 추가하여, 다수의 특징정보 조합을 설정하는 조합설정 과정을 수행할 수 있다.On the other hand, according to another embodiment, the combination performance check unit 220 includes the user-set feature from the entire feature information (n) to preset feature information, that is, user-set feature information (a, b, c, d, e, f). A combination setting process of setting a plurality of characteristic information combinations may be performed by sequentially adding one of the remaining specific information excluding information (a, b, c, d, e, and f).

이렇게 되면, 조합성능확인부(220)는, 전술과 마찬가지로 다음과 같은 다수의 특징정보 조합 별 성능, 82%, 80%, ...90%을 확인할 수 있다.In this case, the combination performance checking unit 220 can check the following performance for each feature information combination, 82%, 80%, ... 90%, as in the above.

a,b,c,d,e,f,g -> m(k+1)1 -> 82%a,b,c,d,e,f,g -> m(k+1)1 -> 82%

a,b,c,d,e,f,h -> m(k+1)2 -> 80%a,b,c,d,e,f,h -> m(k+1)2 -> 80%

......

a,b,c,d,e,f,z -> m(k+1)ζ+1 -> 90%a,b,c,d,e,f,z -> m(k+1)ζ+1 -> 90%

조합성능확인부(220)는, 다수의 특징정보 조합 중 금번 사용자 설정을 기반으로 생성된 인공지능 모델의 성능(mk=85%) 보다 높은 성능을 갖는 특징정보 조합 각각을 특징정보로 재 설정하여, 재 설정한 각 특징정보에 대하여 조합설정 과정이 반복 수행되도록 하는 재설정 과정을 수행할 수 있다. The combination performance confirmation unit 220 resets each of the feature information combinations having a higher performance than the performance (mk = 85%) of the artificial intelligence model generated based on the current user setting among a plurality of feature information combinations as feature information In this case, it is possible to perform a reset process in which the combination setting process is repeatedly performed for each reset feature information.

즉, 조합성능확인부(220)는, 다수의 특징정보 조합 중 인공지능 모델의 성능(mk=85%) 보다 낮거나 같은 성능을 갖는 특징정보 조합을 삭제하고 인공지능 모델의 성능(mk=85%) 보다 높은 성능을 갖는 특징정보 조합 만을 다음과 같이 남기고, 이들 각각을 특징정보로 재 설정하여 다음의 표 2와 같이 재 설정한 각 특징정보에 대하여 조합설정 과정이 반복 수행되도록 하는 재설정 과정을 수행할 수 있다.That is, the combination performance check unit 220 deletes a feature information combination having a performance lower than or equal to the performance of the artificial intelligence model (mk = 85%) among a plurality of feature information combinations, and the performance of the artificial intelligence model (mk = 85%) %) A reset process in which only feature information combinations with higher performance are left as follows, and each of them is reset as feature information so that the combination setting process is repeatedly performed for each reset feature information as shown in Table 2 below. can be done

a,b,c,d,e,f,l -> m(k+1)1 -> 87%a,b,c,d,e,f,l -> m(k+1)1 -> 87%

a,b,c,d,e,f,m -> m(k+1)2 -> 86%a,b,c,d,e,f,m -> m(k+1)2 -> 86%

a,b,c,d,e,f,n -> m(k+1)3 -> 86%a,b,c,d,e,f,n -> m(k+1)3 -> 86%

조합성능확인부(220)는, 전술의 조합설정 과정 및 재설정 과정을 반복하면서, 다수의 특징정보 조합 중 금번 사용자 설정을 기반으로 생성된 인공지능 모델의 성능(mk=85%) 보다 높은 성능을 갖는 특징정보 조합이 존재하지 않는 경우, 직전의 특징정보를 특정 특징정보 조합으로서 선택하고 추천부(230)로 전달하는 과정을 수행한다.The combination performance confirmation unit 220 repeats the above-described combination setting process and resetting process, and performs higher than the performance (mk = 85%) of the artificial intelligence model generated based on the user setting this time among a plurality of feature information combinations. If the feature information combination does not exist, a process of selecting the previous feature information as a specific feature information combination and transmitting it to the recommendation unit 230 is performed.

이 경우, 추천부(230)는, 다수의 특징정보 조합 별 성능 중에서, 조합성능확인부(220)로부터 전달되는 특징정보를, 금번 사용자 설정을 기반으로 생성된 인공지능 모델의 성능(mk=85%) 보다 높은 성능을 갖는 특정 특징정보 조합으로서 추천할 수 있다.In this case, the recommendation unit 230 selects the feature information transmitted from the combination performance checker 220 among the performance of a plurality of feature information combinations, and the performance of the artificial intelligence model generated based on the current user setting (mk = 85 %) can be recommended as a specific feature information combination with higher performance.

이상, 본 발명에 따르면, 인공지능 모델 플랫폼(100)에서 제공하는 환경에서 UI를 기반으로 보안관제를 위한 인공지능 모델을 생성하는 사용자에게 최적의 성능(정확도)를 갖는 최적 특징(feature)를 추천/적용할 수 있도록 함으로써, 보안관제 기술에 익숙하지 않은 일반 사용자도 보안관제를 위한 최적의 인공지능 모델을 생성할 수 있도록 한다. As described above, according to the present invention, an optimal feature having optimal performance (accuracy) is recommended to a user who creates an artificial intelligence model for security control based on a UI in an environment provided by the artificial intelligence model platform 100. / By enabling application, even general users who are not familiar with security control technology can create an optimal artificial intelligence model for security control.

이하에서는, 도 4를 참조하여, 인공지능 모델의 정확도를 높이도록 정규화 방식 변경을 추천하는 기술, 구체적으로 그 기술을 실현하는 정규화 방식 추천 장치에 대하여 설명하겠다.Hereinafter, with reference to FIG. 4, a technique for recommending a normalization method change to increase the accuracy of an artificial intelligence model, and a normalization method recommendation device for realizing the technology will be described in detail.

도 4는, 본 발명의 일 실시 예에 따른 정규화 방식 추천 장치의 구성을 도시하고 있다.4 illustrates the configuration of a normalization method recommendation apparatus according to an embodiment of the present invention.

도 4에 도시된 바와 같이, 본 발명의 정규화 방식 추천 장치(300)는, 속성확인부(310), 결정부(320), 추천부(330)를 포함한다.As shown in FIG. 4 , the normalization method recommendation apparatus 300 of the present invention includes an attribute confirmation unit 310 , a determination unit 320 , and a recommendation unit 330 .

이러한 정규화 방식 추천 장치(300)의 구성 전체 내지는 적어도 일부는 하드웨어 모듈 형태 또는 소프트웨어 모듈 형태로 구현되거나, 하드웨어 모듈과 소프트웨어 모듈이 조합된 형태로도 구현될 수 있다.All or at least part of the configuration of the normalization method recommendation device 300 may be implemented in the form of a hardware module, a software module, or a combination of a hardware module and a software module.

여기서, 소프트웨어 모듈이란, 예컨대, 정규화 방식 추천 장치(300) 내에서 연산을 제어하는 프로세서에 의해 실행되는 명령어로 이해될 수 있으며, 이러한 명령어는 정규화 방식 추천 장치(300) 내 메모리에 탑재된 형태를 가질 수 있을 것이다.Here, the software module may be understood as, for example, a command executed by a processor that controls an operation within the normalization method recommendation device 300, and such a command may have a form loaded in a memory within the normalization method recommendation device 300. will be able to have

결국, 본 발명의 일 실시 예에 따른 정규화 방식 추천 장치(300)는 전술한 구성을 통해, 본 발명에서 제안하는 기술 즉 인공지능 모델의 정확도를 높이도록 정규화 방식 변경을 추천하는 기술을 실현하며, 이하에서는 이를 실현하기 위한 정규화 방식 추천 장치(300) 내 각 구성에 대해 보다 구체적으로 설명하기로 한다.As a result, the normalization method recommendation apparatus 300 according to an embodiment of the present invention realizes the technique proposed in the present invention, that is, a technique of recommending a change in normalization method to increase the accuracy of an artificial intelligence model, through the above configuration, Hereinafter, each component in the normalization method recommendation apparatus 300 for realizing this will be described in more detail.

속성확인부(310)는, 인공지능 모델 생성 시 학습에 이용되는 특징정보의 속성을 확인한다.The property confirmation unit 310 checks the properties of the feature information used for learning when the artificial intelligence model is created.

여기서, 인공지능 모델 생성 시 학습에 이용되는 특징정보는, 인공지능 모델 생성 시 설정 가능한 전체 특징정보 중 UI를 기반으로 사용자에 의해 직접 설정되는 특징정보일 수 있고, 또는 전체 특징정보 중 추천되는 특정 특징정보 조합이 적용/설정되는 특징정보일 수도 있다.Here, the feature information used for learning when the artificial intelligence model is created may be feature information directly set by the user based on the UI among all feature information that can be set when the artificial intelligence model is created, or recommended specific information among all feature information. It may be feature information to which a feature information combination is applied/set.

그리고, 특징정보의 속성은, 크게 숫자 속성과 카테고리 속성으로 구분될 수 있다.In addition, properties of feature information can be largely divided into numerical properties and category properties.

즉, 속성확인부(310)는, 인공지능 모델 생성 시 학습에 이용되는 특징정보(직접 설정 또는 추천 적용)의 속성이, 숫자 속성인지 또는 카테고리 속성인지 또는 숫자 및 카테고리 조합 속성인지를 확인할 수 있다.That is, the attribute check unit 310 can check whether the attribute of the feature information (directly set or applied as a recommendation) used for learning when creating the artificial intelligence model is a number attribute, a category attribute, or a number and category combination attribute. .

결정부(320)는, 설정 가능한 전체 정규화 방식 중, 속성확인부(310)에서 확인한 특징정보의 속성에 따른 정규화 방식을 결정한다.The determination unit 320 determines a normalization method according to the property of the characteristic information confirmed by the property checking unit 310 among all normalization methods that can be set.

구체적으로 설명하면, 결정부(320)는, 특징정보의 속성에 따른 정규화 방식을 결정하기에 앞서, 금번 특징정보 전체 필드에 동일한 정규화 방식이 적용되는지 또는 금번 특징정보 전체 필드에서 필드 별로 정규화 방식이 적용되는지를 먼저 구분할 수 있다.Specifically, before determining the normalization method according to the attribute of the feature information, the determination unit 320 determines whether the same normalization method is applied to all fields of the current feature information or whether the normalization method for each field is applied in all fields of the current feature information. You can first distinguish whether it applies.

결정부(320)는, 금번 특징정보 전체 필드에 숫자 및/또는 카테고리 데이터만 존재하는 경우(단일 특징 case 포함), 금번 특징정보 전체 필드에 동일한 정규화 방식이 적용되는 것으로 구분할 수 있다. The determination unit 320 may determine that the same normalization method is applied to all fields of the current feature information when only numeric and/or category data exists in all fields of the current feature information (including a single feature case).

이 경우, 결정부(320)는, 특징정보의 속성이 숫자 속성인 경우, 특징정보의 전체 숫자패턴에 따른 제1 정규화 방식을 결정하고, 특징정보의 속성이 카테고리 속성인 경우, 특징정보의 전체 카테고리 개수로 정의되는 벡터(Vector) 내 특징정보의 카테고리 별로 지정된 위치에만 0이 아닌 특성값으로 표현하는 제2 정규화 방식을 결정하고, 특징정보의 속성이 숫자 및 카테고리 조합 속성인 경우, 상기 제2 정규화 방식 및 제1 정규화 방식을 결정할 수 있다. In this case, the determining unit 320 determines the first normalization method according to the entire number pattern of the feature information when the attribute of the feature information is a numeric attribute, and if the attribute of the feature information is a category attribute, the entirety of the feature information A second normalization method is determined in which a non-zero characteristic value is expressed only in a position designated for each category of feature information in a vector defined by the number of categories, and when the attribute of feature information is a number and category combination attribute, the second normalization method is determined. A normalization method and a first normalization method may be determined.

구체적으로, 제1 정규화 방식은, 기 정의된 우선순위에 따라 Standard score 정규화 방식, Mean normalization 정규화 방식, Feature scaling 정규화 방식을 포함한다(수학식 1,2,3 참조).Specifically, the first normalization method includes a standard score normalization method, a mean normalization normalization method, and a feature scaling normalization method according to predefined priorities (see Equations 1, 2, and 3).

결정부(320)는, 특징정보 전체 필드에 숫자 데이터만 존재하는 경우 특징정보의 속성이 숫자 속성인 것으로 구분하고, 이 경우 특징정보의 전체 숫자패턴에 따른 제1 정규화 방식을 결정한다.The determination unit 320 determines that the attribute of the feature information is a numeric attribute when only numeric data exists in the entire field of feature information, and in this case, determines the first normalization method according to the entire number pattern of feature information.

이때, 결정부(320)는, 제1 정규화 방식 중 우선순위에 따라 Standard score 정규화 방식, Mean normalization 정규화 방식, Feature scaling 정규화 방식의 순서로 결정하되, 특징정보의 전체 숫자패턴에 대한 표준편차 및 정규화 스케일링 범위 상/하한 존재 여부를 근거로, 제1 정규화 방식 중 적용 가능한 가장 우선순위가 높은 정규화 방식을 결정할 수 있다.At this time, the determination unit 320 determines the standard score normalization method, the mean normalization normalization method, and the feature scaling normalization method in order according to the priority among the first normalization methods, and standard deviation and normalization for all numerical patterns of feature information. Based on whether upper/lower limits of the scaling range exist, a normalization method having the highest applicable priority among the first normalization methods may be determined.

또한, 결정부(320)는, 특징정보 전체 필드에 카테고리 데이터만 존재하는 경우 특징정보의 속성이 카테고리 속성인 것으로 구분하고, 이 경우 특징정보의 전체 카테고리 개수로 정의되는 벡터(Vector) 내 특징정보의 카테고리 별로 지정된 위치에만 0이 아닌 특성값으로 표현하는 제2 정규화 방식을 결정할 수 있다.In addition, the determination unit 320 classifies the property of the characteristic information as a category property when only category data exists in all fields of the characteristic information, and in this case, the characteristic information in a vector defined by the total number of categories of the characteristic information. It is possible to determine a second normalization scheme that expresses a non-zero characteristic value only at positions designated for each category of .

학습 데이터에 인공지능 알고리즘(예: 기계 학습)을 적용하여 인공지능 모델을 생성하기 위해서는, 데이터를 기계가 이해할 수 있는 수치 형태의 데이터로 변환해 주어야 하는데, 본 발명에서는 이러한 변환 방식(제2 정규화 방식)으로 One Hot Encoding을 채택할 수 있다.In order to generate an artificial intelligence model by applying an artificial intelligence algorithm (e.g., machine learning) to learning data, it is necessary to convert data into numerical data that machines can understand. In the present invention, this conversion method (second normalization) method) can adopt One Hot Encoding.

이에, 결정부(320)는, 특징정보의 속성이 카테고리 속성인 경우, 특징정보의 전체 카테고리 개수로 정의되는 벡터(Vector) 내 특징정보의 카테고리 별로 지정된 위치에만 0이 아닌 특성값(예: 1)으로 표현하는 제2 정규화 방식_One Hot Encoding을 결정할 수 있다.Accordingly, when the attribute of the feature information is a category attribute, the determiner 320 sets a non-zero feature value (e.g., 1 ), it is possible to determine the second normalization method_One Hot Encoding.

제2 정규화 방식_One Hot Encoding을 간단히 설명하면, 특징정보가 과일이라는 카테고리 속성을 가지며 사과, 배, 감(과일의 종류가 3개이므로 3차원 벡터로 표현)이 전체 카테고리 개수라고 가정한다. To briefly explain the second normalization method_One Hot Encoding, it is assumed that feature information has a category attribute of fruit, and apples, pears, and persimmons (expressed as a 3D vector since there are three types of fruit) are the total number of categories.

이때 사과, 배, 감 각각을 데이터로 가지는 각 특징정보는 제2 정규화 방식_One Hot Encoding에 따라 다음과 같이 표현될 수 있다.At this time, each characteristic information having apples, pears, and persimmons as data can be expressed as follows according to the second normalization method_One Hot Encoding.

사과 = {1, 0, 0}apple = {1, 0, 0}

배 = {0, 1, 0}times = {0, 1, 0}

감 = {0, 0, 1}Persimmon = {0, 0, 1}

또한, 결정부(320)는, 특징정보 전체 필드에 숫자 및 카테고리 데이터가 존재하는 경우 특징정보의 속성이 숫자 및 카테고리 조합 속성인 것으로 구분하고, 이 경우 전술의 제2 정규화 방식 및 제1 정규화 방식을 결정할 수 있다. In addition, the determination unit 320, when number and category data exist in all fields of the feature information, classifies the attribute of the feature information as a number and category combination attribute, and in this case, the second normalization method and the first normalization method described above. can determine

즉, 결정부(320)는, 특징정보의 속성이 숫자 및 카테고리 조합 속성인 경우, 특징정보 내 카테고리 속성의 데이터에 대해서 먼저 전술의 제2 정규화 방식_One Hot Encoding이 적용된 후, 특징정보의 전체 숫자패턴에 대한 표준편차 및 정규화 스케일링 범위 상/하한 존재 여부를 근거로 제1 정규화 방식 중 적용 가능한 가장 우선순위가 높은 정규화 방식을 결정하기 위해서, 제2 정규화 방식 및 제1 정규화 방식을 결정할 수 있다.That is, when the property of the feature information is a number and category combination attribute, the determination unit 320 first applies the second normalization method_One Hot Encoding to the data of the category attribute in the feature information, and then applies the entirety of the feature information. A second normalization method and a first normalization method may be determined in order to determine a normalization method having the highest priority among the first normalization methods based on the standard deviation of the number pattern and whether there is an upper/lower limit of the normalization scaling range. .

한편, 결정부(320)는, 특징정보가 복합 특징(여러 보안이벤트 간의 집계, 통계적 기법들을 활용하여 추출할 수 있는 하나의 특징)인 경우, 금번 특징정보 전체 필드에서 필드 별로 정규화 방식 적용되는 것으로 구분할 수 있다. On the other hand, the determination unit 320 determines that, when the feature information is a composite feature (a feature that can be extracted using an aggregation of several security events and statistical techniques), the normalization method is applied for each field in the entire feature information field this time. can be distinguished.

이 경우, 결정부(320)는, 특징정보에서 속성이 종류 속성의 필드에 대해서는 Mean normalization 정규화 방식, Feature scaling 정규화 방식 중 적용 가능한 가장 우선순위가 높은 정규화 방식을 결정할 수 있다.In this case, the determination unit 320 may determine a normalization method having the highest priority among the mean normalization normalization method and the feature scaling normalization method for a field of attribute type in feature information.

또한, 결정부(320)는, 특징정보에서 속성이 개수 속성의 필드에 대해서는 Mean normalization 정규화 방식, Feature scaling 정규화 방식 중 적용 가능한 가장 우선순위가 높은 정규화 방식을 결정할 수 있다.In addition, the determiner 320 may determine a normalization method having the highest priority among mean normalization normalization methods and feature scaling normalization methods for a field of the number of attributes in feature information.

또한, 결정부(320)는, 특징정보에서 속성이 비율 속성의 필드에 대해서는 정규화 방식을 미 결정하고 정규화 대상에서 제외시키도록 결정하거나 또는 Standard score 정규화 방식을 결정할 수 있다.In addition, the determination unit 320 may decide to exclude a normalization method for a field of a ratio attribute in feature information without determining a normalization method, or may determine a standard score normalization method.

또한, 결정부(320)는, 특징정보에서 속성이 존재 여부(예: 연산 결과값의 유/무)속성의 필드에 대해서는 정규화 방식을 미 결정하고 정규화 대상에서 제외시키도록 결정할 수 있다.In addition, the determination unit 320 may determine whether a normalization method is not determined for a field of whether an attribute exists (eg, presence/absence of an operation result value) in feature information and excludes it from the normalization target.

추천부(330)는, 결정부(320)에서 결정한 정규화 방식을 추천한다.The recommendation unit 330 recommends the normalization method determined by the determination unit 320 .

이상, 본 발명에 따르면, 인공지능 모델 플랫폼(100)에서 제공하는 환경에서 UI를 기반으로 보안관제를 위한 인공지능 모델을 생성하는 사용자에게 최적의 성능(정확도)를 갖는 최적 정규화 방식을 추천/적용할 수 있도록 함으로써, 보안관제 기술에 익숙하지 않은 일반 사용자도 보안관제를 위한 최적의 인공지능 모델을 생성할 수 있도록 한다. As described above, according to the present invention, an optimal normalization method having optimal performance (accuracy) is recommended/applied to a user who creates an artificial intelligence model for security control based on a UI in an environment provided by the artificial intelligence model platform 100. By enabling this, even general users who are not familiar with security control technology can create an optimal artificial intelligence model for security control.

이상에서 설명한 바와 같이, 본 발명에 의하면, 보안관제를 위한 인공지능 모델을 생성할 수 있도록 하는 인공지능 모델 플랫폼을 구현하되, 특히 인공지능 모델 성능에 직결되는 특징정보 및 정규화 방식을 최적으로 추천/적용할 수 있도록 함으로써, 보안관제 기술에 익숙하지 않은 일반 사용자도 보안관제를 위한 최적의 인공지능 모델을 생성할 수 있도록 하는 인공지능 모델 플랫폼을 구현할 수 있다. As described above, according to the present invention, an artificial intelligence model platform capable of generating an artificial intelligence model for security control is implemented, but in particular, feature information directly related to artificial intelligence model performance and normalization method are optimally recommended/recommended. By making it applicable, it is possible to implement an artificial intelligence model platform that allows even general users who are not familiar with security control technology to create an optimal artificial intelligence model for security control.

이로 인해, 본 발명에 따르면, 보안관제를 위한 목적 및 요구 사항에 적합한 최적의 인공지능 모델을 유연하고 다양하게 생성 및 적용할 수 있기 때문에, 보안관제 서비스의 품질 향상을 극대화시킬 수 있고, 아울러 대규모 사이버공격 및 이상행위 발생 징후를 효율적으로 분석하기 위한 인공지능 기반의 침해대응 체계 구축을 지원할 수 있는 효과까지 기대할 수 있다.For this reason, according to the present invention, since the optimal artificial intelligence model suitable for the purpose and requirements for security control can be created and applied flexibly and variously, it is possible to maximize the quality improvement of the security control service, and at the same time, large-scale It can even be expected to support the establishment of an artificial intelligence-based breach response system to efficiently analyze signs of cyberattacks and abnormal behavior.

이하에서는 데이터의 특성을 고려하여 인공지능 알고리즘의 수행 성능을 더욱 높일 수 있는 실시예들을 개시한다. Hereinafter, embodiments that can further enhance the performance of an artificial intelligence algorithm by considering the characteristics of data will be disclosed.

보안 데이터와 같이 실제 환경에서 수집된 불균형 데이터를 별도의 가공 및 분류 절차 없이 기계 학습 및 인공지능 기법에 활용하기 위한 실시 예를 개시한다. Disclosed is an embodiment for utilizing imbalanced data collected in real environments, such as security data, in machine learning and artificial intelligence techniques without separate processing and classification procedures.

실시 예에 따르면 보유한 데이터의 레이블(label) 분포에 따라 최적의 방법으로 학습 가능한 인공지능 알고리즘을 자동으로 선택하거나 사용자의 목적을 고려하여 최적 알고리즘 적용할 수 있다. According to the embodiment, an artificial intelligence algorithm capable of learning in an optimal way may be automatically selected according to the label distribution of retained data, or an optimal algorithm may be applied in consideration of a user's purpose.

도 5는 불균형 데이터의 레이블 분포에 따라 인공지능 알고리즘을 수행하는 일 실시 예를 개시한다. 5 discloses an embodiment of performing an artificial intelligence algorithm according to a label distribution of imbalanced data.

이 실시 예는 데이터베이스(510), 레이블(클래스) 분포 처리부(520) 및 학습 처리부(530)을 포함한다. This embodiment includes a database 510, a label (class) distribution processor 520, and a learning processor 530.

데이터베이스(510)는 사용자의 데이터를 저장하는데, 예를 들면 보안 이벤트 또는 보안 로그 데이터 등 보안 관련 데이터를 저장할 수 있다. The database 510 stores user data, and may store security-related data such as security events or security log data.

레이블(클래스) 분포 처리부(520)는 데이터베이스(510)에 저장된 데이터를 레이블 분포에 따라 분류하거나 분류된 데이터를 검출하고, 분류 또는 검출한 레이블 분포에 따라 데이터를 처리한다. The label (class) distribution processing unit 520 classifies data stored in the database 510 according to the label distribution or detects the classified data, and processes the data according to the classification or detected label distribution.

레이블(클래스) 분포 처리부(520)는 레이블 분포에 따라 다수의 레이블 처리 분포 처리부들을 포함할 수 있다.The label (class) distribution processing unit 520 may include a plurality of label processing distribution processing units according to the label distribution.

실시 예에서 레이블(클래스) 분포 처리부(520)는 레이블(클래스) 분포 검출부(521), 제 1 레이블 분포 처리부(523), 제 2 레이블 분포 처리부(525) 및 제 3 레이블 분포 처리부(527)을 포함하는 예를 개시한다. In an embodiment, the label (class) distribution processing unit 520 includes a label (class) distribution detection unit 521, a first label distribution processing unit 523, a second label distribution processing unit 525, and a third label distribution processing unit 527. Including examples are disclosed.

레이블(클래스) 분포 검출부(521)는 데이터베이스(510)에 저장된 데이터를 그 데이터의 레이블 분포에 따라 분류하거나 분류된 데이터를 검출할 수 있다. The label (class) distribution detection unit 521 may classify data stored in the database 510 according to the label distribution of the data or detect the classified data.

설명의 편의상 이하의 실시 예는 레이블(클래스) 분포 비율에 따라 단일 레이블 분포이거나, 불균형 레이블 분포이거나, 균형 레이블 분포인 경우를 예시한다. For convenience of description, the following example illustrates a single label distribution, an unbalanced label distribution, or a balanced label distribution according to the label (class) distribution ratio.

이 경우 제 1 레이블 분포 처리부(523)는 레이블(클래스) 분포 검출부(521)가 검출한 단일 레이블 분포 데이터를 처리한다. 단일 레이블 분포 데이터를 처리하는 예는 이하에서 상술한다.In this case, the first label distribution processing unit 523 processes single label distribution data detected by the label (class) distribution detection unit 521 . An example of processing single label distribution data is detailed below.

제 2 레이블 분포 처리부(525)는 레이블(클래스) 분포 검출부(521)가 검출한 불균형 레이블 분포 데이터를 처리한다. 불균형 레이블 분포 데이터를 처리하는 예는 이하에서 상술한다.The second label distribution processing unit 525 processes the imbalanced label distribution data detected by the label (class) distribution detection unit 521 . An example of processing unbalanced label distribution data is detailed below.

제 3 레이블 분포 처리부(527)는 레이블(클래스) 분포 검출부(521)가 검출한 균형 레이블 분포 데이터를 처리한다. 균형 레이블 분포 데이터를 처리하는 예는 이하에서 상술한다.The third label distribution processing unit 527 processes balanced label distribution data detected by the label (class) distribution detection unit 521 . An example of processing balanced label distribution data is detailed below.

여기서는 레이블 분포를 3가지(단일, 불균형, 균형) 경우로 나눈 예를 개시하였으나, 사용자의 목적에 따라 레이블(클래스) 분포를 더 상세하게 구분할 수도 있는데 그런 경우에는 레이블(클래스) 분포 처리부(520)는 예시한 것 이외의 다수의 레이블 처리 분포 처리부들을 포함할 수 있다.Here, an example of dividing the label distribution into three cases (single, unbalanced, and balanced) has been disclosed, but the label (class) distribution can be classified in more detail according to the user's purpose. In that case, the label (class) distribution processing unit 520 may include a plurality of label processing distribution processing units other than those illustrated.

예를 들어 사용자가 레이블 분포 비율을 더 상세하게 구분한다면, 단일 레이블, 불균형 레이블, 균형 레이블에 포함되는 데이터 레이블의 비율은 달라질 수 있다. 그 경우 예를 들면 불균형 레이블도 제 1 불균형 레이블, 제 2 불균형 레이블, 제 3 불균형 레이블 등등 데이터 레이블의 비율에 따라 복수의 불균형 레이블을 포함할 수도 있다. For example, if the user classifies the label distribution ratio in more detail, the ratio of data labels included in single label, unbalanced label, and balanced label may be different. In that case, for example, an imbalanced label may also include a plurality of imbalanced labels according to a ratio of data labels, such as a first imbalanced label, a second imbalanced label, a third imbalanced label, and the like.

여기서 단일 레이블이란 입력 데이터 세트에 포함된 데이터의 이진 레이블인 제 1 클래스와 제 2 클래스의 분포의 비가 어느 한쪽이 일방적으로 커서 거의 단일한 클래스로 이루어졌다고 판단할 수 있는 경우를 의미한다. 제 1 클래스와 제 2 클래스의 분포의 비가 극대 : 극소이거나 반대로 극소 : 극대인 경우를 의미하는데 두 클래스의 비율은 사용자의 정의에 따를 수 있다.Here, the single label means a case in which it can be determined that the input data set consists of almost a single class because the distribution ratio of the first class and the second class, which are binary labels of data included in the input data set, is one-sidedly large. The distribution ratio between the first class and the second class means a case where the distribution ratio is maximum:minimum or conversely minimum:maximum. The ratio of the two classes can be defined by the user.

또한 여기서 균형 레이블이란 입력 데이터 세트에 포함된 데이터의 이진 레이블인 제 1 클래스와 제 2 클래스의 분포의 비가 비슷한 경우로서, 두 클래스의 비율은 사용자의 정의에 따를 수 있다.Also, here, the balanced label refers to a case where the distribution ratios of the first class and the second class, which are binary labels of data included in the input data set, are similar, and the ratio of the two classes may be defined by a user.

그리고 불균형 레이블이란 단일 레이블 비율도 아니고 균형 레이블의 비도 아닌 경우를 의미하는 것으로서 마찬가지로 입력 데이터 세트에 포함된 데이터의 두 클래스의 비율은 사용자의 정의에 따를 수 있다.In addition, an imbalanced label means a case that is neither a single label ratio nor a balanced label ratio, and similarly, the ratio of two classes of data included in the input data set may follow the user's definition.

다수의 레이블 분포의 분류 및 분류 비율에 대해서는 이하의 실시 예에서 상술한다.Classification and classification ratios of a plurality of label distributions will be described in detail in the following embodiments.

학습 처리부(530)는 레이블(클래스) 분포 처리부(520)가 레이블 분포에 따라 처리한 데이터 분류들에 따라 각각의 인공지능 학습 알고리즘을 선택하고, 선택한 알고리즘으로 해당 분류의 데이터를 학습한다.The learning processing unit 530 selects each artificial intelligence learning algorithm according to the data classifications processed by the label (class) distribution processing unit 520 according to the label distribution, and learns data of the corresponding classification with the selected algorithm.

이 도면의 실시 예 중 데이터베이스(510)는 도 2의 데이터수집모듈(110)에 대응될 수 있다. 그리고 레이블 분포 처리부(520)는 도 2의 특징추출모듈(120)에 대응될 수 있으며, 학습 처리부(530)는 도 2의 모델생성모듈(150)에 대응될 수 있다. Among the embodiments of this drawing, the database 510 may correspond to the data collection module 110 of FIG. 2 . Also, the label distribution processing unit 520 may correspond to the feature extraction module 120 of FIG. 2 , and the learning processing unit 530 may correspond to the model generation module 150 of FIG. 2 .

도 6은 실시 예에 따른 레이블(클래스) 분포 처리부의 일 예를 개시한다. 6 discloses an example of a label (class) distribution processing unit according to an embodiment.

이 예에서는 데이터의 레이블(label) 또는 클래스(class) 분포는 이진 클래스(binary class) 인 경우를 가정한다. In this example, it is assumed that the label or class distribution of data is a binary class.

그리고 이 예에서 데이터의 이진 레이블 분포가 9:1 이상의 경우를 단일 레이블 분포, 이진 레이블 분포가 6:4 이상이고 9:1 이하의 경우를 불균형 레이블 분포, 그리고, 이진 레이블 분포가 5:5 이상이고 6:4 이하인 경우를 균형 레이블 분포라고 가정한다. And in this example, if the binary label distribution of the data is greater than or equal to 9:1, it is a single label distribution, if the binary label distribution is greater than or equal to 6:4 and less than or equal to 9:1, it is an unbalanced label distribution, and if the binary label distribution is greater than or equal to 5:5 and 6:4 or less is assumed to be a balanced label distribution.

이 예에서 각각의 레이블의 분포에 따라 제 1 레이블 분포 처리부(523)는 단일 레이블 분포 데이터를 처리하는 단일 레이블 분포부로 호칭하고, 제 2 레이블 분포 처리부(525)는 불균형 레이블 분포 데이터를 처리하는 불균형 레이블 분포부로 호칭한다. 그리고, 레이블의 분포에 따라 제 3 레이블 분포 처리부(527)는 균형 레이블 분포 데이터를 처리하는 균형 레이블 분포부로 호칭한다. In this example, according to the distribution of each label, the first label distribution processing unit 523 is referred to as a single label distribution unit processing single label distribution data, and the second label distribution processing unit 525 is called an unbalanced label distribution processing unit processing unbalanced label distribution data. It is called the label distribution unit. Also, according to the label distribution, the third label distribution processing unit 527 is called a balanced label distribution unit that processes balanced label distribution data.

레이블(클래스) 분포 검출부(521)는 저장되거나 입력된 데이터의 레이블 분포를 레이블 분포 기준에 따라 검출할 수 있다. 각각 검출된 데이터들은 그 레이블 분포에 따라 제 1 레이블 분포 처리부(523), 제 2 레이블 분포 처리부(525) 및 제 3 레이블 분포 처리부(527)로 출력한다. ‘The label (class) distribution detector 521 may detect a label distribution of stored or input data according to a label distribution criterion. Each detected data is output to the first label distribution processing unit 523, the second label distribution processing unit 525, and the third label distribution processing unit 527 according to the label distribution. '

레이블(클래스) 분포 검출부(521)가 입력 데이터 세트가 하나의 레이블 분포(이진 클래스인 데이터 분포의 비율이 9:1 이상인 경우를 단일 레이블로 예시)를 가지고 있는 경우, 이 레이블 분포를 검출하고 검출된 레이블 분포에 따라 입력 데이터를 제 1 레이블 분포 처리부(523)로 출력한다. If the label (class) distribution detection unit 521 has one label distribution in the input data set (the case where the ratio of the binary class data distribution is 9:1 or more is exemplified as a single label), this label distribution is detected and detected. The input data is output to the first label distribution processing unit 523 according to the label distribution.

예를 들어 이진 클래스 중 클래스 1의 데이터의 량이 극히 많고 클래스 2의 데이터의 량이 극히 적거나, 반대로 클래스 1의 데이터의 량이 극히 적고 클래스 2의 데이터의 량이 극히 많은 경우가 이 경우에 해당할 수 있다. For example, in the binary class, the amount of data of class 1 is extremely large and the amount of data of class 2 is extremely small, or conversely, the amount of data of class 1 is extremely small and the amount of data of class 2 is extremely large. .

레이블(클래스) 분포 검출부(521)가 입력 데이터 세트가 불균형한 레이블 분포(이진 클래스인 데이터 분포의 비율이 6:4이상이고 9:1 이하인 불균형 레이블)를 가지고 있는 경우, 이 레이블 분포를 검출하고 검출된 레이블 분포에 따라 입력 데이터를 제 2 레이블 분포 처리부(525)로 출력한다. If the label (class) distribution detection unit 521 has an unbalanced label distribution (binary class data distribution ratio of 6:4 or more and 9:1 or less), the label (class) distribution detection unit 521 detects this label distribution, Input data is output to the second label distribution processing unit 525 according to the detected label distribution.

예를 들어 이진 클래스 중 클래스 1의 데이터의 량이 다수이거나 클래스 2의 데이터의 량이 소수이거나, 반대로 클래스 1의 데이터의 량이 소수이고 클래스 2의 데이터의 량이 다수인 경우가 이 경우에 해당할 수 있다. For example, this case may correspond to a binary class in which the amount of data of class 1 is large, the amount of data of class 2 is small, or, conversely, the amount of data of class 1 is small and the amount of data of class 2 is large.

레이블(클래스) 분포 검출부(521)가 입력 데이터 세트가 균형적인 레이블 분포(이진 클래스인 데이터 분포의 비율이 6:4이하이고 5:5이상인 경우를 균형 레이블로 예시)를 가지고 있는 경우, 이 레이블 분포를 검출하고 검출된 레이블 분포에 따라 입력 데이터를 제 3 레이블 분포 처리부(527)로 출력한다. If the label (class) distribution detection unit 521 has a balanced label distribution for the input data set (the case where the ratio of the binary class data distribution is 6:4 or less and 5:5 or more is exemplified as a balanced label), this label The distribution is detected and the input data is output to the third label distribution processor 527 according to the detected label distribution.

예를 들어 보안 데이터를 탐지하는 경우 정탐(True Positive)과 오탐(False Positive)의 이진 클래스를 분포를 가질 수 있다. 이러한 보안 데이터 탐지의 경우 정탐과 오탐의 비율이 매우 큰 차이가 발생하기 때문에 데이터의 클래스는 단일 레이블 분포를 가진다고 판단할 수 있다. For example, when detecting security data, a binary class distribution of true positives and false positives can be obtained. In the case of such security data detection, since the ratio of true positives and false positives has a very large difference, it can be determined that the data class has a single label distribution.

따라서, 이 경우 레이블(클래스) 분포 검출부(521)는 해당 데이터 분포에 따라 입력 데이터 세트를 제 1 레이블 분포 처리부(523)로 출력할 수 있다. 그리고 제 1 레이블 분포 처리부(523)에서 처리한 데이터를 데이터의 분포를 고려하여 인공지능 알고리즘을 선택하여 적용할 수 있다. Accordingly, in this case, the label (class) distribution detection unit 521 may output the input data set to the first label distribution processing unit 523 according to the corresponding data distribution. In addition, an artificial intelligence algorithm may be selected and applied to the data processed by the first label distribution processor 523 in consideration of data distribution.

예를 들어 제 1 레이블 분포 처리부(523)는 수신된 데이터 세트의 레이블 분포를 검토하고 처리하는 과정을 수행할 수 있다. 예를 들어 제 1 레이블 분포 처리부(523)는 단일 레이블 분포의 데이터 세트에 대해 데이터의 유사도를 이용한 탐지 또는 데이터 간의 이상 탐지 등을 수행할 수 있다. For example, the first label distribution processor 523 may review and process the label distribution of the received data set. For example, the first label distribution processing unit 523 may perform detection using data similarity or anomaly detection between data for a single label distribution data set.

도 7은 입력 데이터의 분포에 따라 처리한 데이터를 이용해 인공지능 알고리즘을 선택적으로 적용하는 일 예를 개시한다.7 discloses an example of selectively applying an artificial intelligence algorithm using data processed according to the distribution of input data.

여기의 예에서도 위와 같이 각각의 레이블의 분포에 따라 제 1 레이블 분포 처리부(523)는 단일 레이블 분포 데이터를 처리하는 단일 레이블 분포부로 호칭하고, 제 2 레이블 분포 처리부(525)는 불균형 레이블 분포 데이터를 처리하는 불균형 레이블 분포부로 호칭하며, 그리고 제 3 레이블 분포 처리부(527)는 균형 레이블 분포 데이터를 처리하는 균형 레이블 분포부로 호칭한다. In this example as well, according to the distribution of each label, the first label distribution processing unit 523 is referred to as a single label distribution unit processing single label distribution data, and the second label distribution processing unit 525 processes unbalanced label distribution data. The third label distribution processing unit 527 is called a balanced label distribution unit that processes balanced label distribution data.

실시 예에 따라 각 레이블 분포 그룹별 인공지능 알고리즘은 다르게 선택하고, 레이블 분포 별로 선택된 인공지능 알고리즘을 이용하여 분포 그룹에 속한 데이터를 학습하도록 할 수 있다.Depending on the embodiment, an artificial intelligence algorithm for each label distribution group may be selected differently, and data belonging to the distribution group may be learned using the artificial intelligence algorithm selected for each label distribution.

이 예에서 제 1 레이블 분포 처리부(523)는 단일 레이블 분포의 데이터에 대해 데이터 분포를 검증하고 처리하고, 그 처리된 데이터에 대해 학습 처리부(530)가 이에 맞는 인공지능 알고리즘을 적용한다. In this example, the first label distribution processing unit 523 verifies and processes the data distribution for data of a single label distribution, and the learning processing unit 530 applies an artificial intelligence algorithm suitable for the processed data.

이 경우 학습 처리부(530)가 적용하는 인공지능 알고리즘은 지도학습(Supervised Learning)보다는 비지도학습(Unsupervised Learning) 인공지능 알고리즘을 적용한다. 예를 들어 학습 처리부(530)는 레이블 분포가 단일 레이블 분포인 경우 클러스터링(Clustering) 알고리즘과 같이 유사성에 기초하여 데이터를 그룹들로 분류하고 분류된 클러스터의 특성을 고려하여 데이터마이닝(data mining)을 수행한다. 학습 처리부(530)는 분류한 데이터의 클러스터에 따라 클러스터의 대표값을 이용하여 전체 데이터의 특성을 고려하여 기계 학습을 수행한다.In this case, the artificial intelligence algorithm applied by the learning processing unit 530 applies an unsupervised learning artificial intelligence algorithm rather than supervised learning. For example, when the label distribution is a single label distribution, the learning processing unit 530 classifies data into groups based on similarity, such as a clustering algorithm, and performs data mining in consideration of characteristics of the classified clusters. carry out The learning processing unit 530 performs machine learning in consideration of the characteristics of the entire data using the representative value of the cluster according to the cluster of the classified data.

학습 처리부(530)는 비지도학습 GAN(Generative adversarial network)와 같은 방식의 알고리즘을 적용할 수 있다. 이 경우 학습 처리부(530)는 위의 도 2나 도 4의 정규화 모듈에서 예시한 방식으로 데이터가 가지고 있는 확률분포를 추정하도록 하여 인공신경망이 그 분포를 만들어 내도록 한다.The learning processing unit 530 may apply an algorithm of the same method as an unsupervised learning generative adversarial network (GAN). In this case, the learning processing unit 530 makes the artificial neural network create the distribution by estimating the probability distribution of the data in the manner exemplified in the normalization module of FIG. 2 or FIG. 4 above.

이 경우 레이블(클래스) 분포 검출부(520)는 신규 데이터가 유입되는 경우 기 생성한 확률 분포에 따라 신규 데이터와 기 처리한 데이터의 유사도를 비교하고 학습하여 생성한 인공지능 모델에 대입을 통해 신규 데이터의 레이블을 결정한다.In this case, when new data is introduced, the label (class) distribution detection unit 520 compares the similarity between the new data and the previously processed data according to the generated probability distribution and substitutes the new data into the artificial intelligence model created by learning. determine the label of

제 3 레이블 분포 처리부(527)는 균형 레이블 분포의 데이터에 대해 데이터 분포를 검증하여 처리하고, 그 처리된 데이터에 대해 학습 처리부(530)는 균형 레이블 분포에 적합한 인공지능 알고리즘을 적용한다. 이 경우 학습 처리부(530)는 지도학습 계열의 인공지능 알고리즘(DNN (Deep Neural Network), CNN (Convolution Neural Network), RNN (Recurrent Neural Network), BSVM(Binarized Support Vector Machine) 등을 적용할 수 있다. The third label distribution processing unit 527 verifies and processes the data distribution for the balanced label distribution data, and the learning processing unit 530 applies an artificial intelligence algorithm suitable for the balanced label distribution to the processed data. In this case, the learning processing unit 530 may apply supervised learning-based artificial intelligence algorithms (Deep Neural Network (DNN), Convolution Neural Network (CNN), Recurrent Neural Network (RNN), Binarized Support Vector Machine (BSVM), etc.) .

제 2 레이블 분포 처리부(525)는 불균형 레이블 분포의 데이터를 처리한다. 학습 처리부(530)는 제 2 레이블 분포 처리부(525)가 처리한 불균형 레이블 분포의 데이터에 대해 인공지능 알고리즘을 수행할 수 있다. The second label distribution processor 525 processes data of an imbalanced label distribution. The learning processing unit 530 may perform an artificial intelligence algorithm on the imbalanced label distribution data processed by the second label distribution processing unit 525 .

이 경우 다른 예로서 불균형 레이블 분포의 데이터 세트에 대해 제 2 레이블 분포 처리부(525)는 해당 데이터 세트의 오버(over) 또는 언더(under) 샘플링을 수행한다. 그 결과 제 2 레이블 분포 처리부(526)을 통해 오버(over) 또는 언더(under) 샘플링 된 데이터 세트는 단일 레이블 분포가 되거나 균형 레이블 분포가 될 수 있다. In this case, as another example, the second label distribution processor 525 performs over- or under-sampling of the data set having an imbalanced label distribution. As a result, the over- or under-sampled data set through the second label distribution processor 526 may become a single label distribution or a balanced label distribution.

제 2 레이블 분포 처리부(525)는 불균형 레이블 분포의 데이터 세트를 처리하여 단일 레이블 분포로 변경하고 변경한 데이터를 제 1 레이블 분포 처리부(523)로 전송하거나, 균형 레이블 분포로 변경하고 변경한 데이터를 제 3 레이블 분포 처리부(527)로 전송할 수 있다. The second label distribution processor 525 processes the unbalanced label distribution data set and changes it to a single label distribution and transmits the changed data to the first label distribution processor 523, or changes it to a balanced label distribution and converts the changed data It can be transmitted to the third label distribution processor 527.

학습 처리부(530)는 제 1, 제 2, 제 3 레이블 분포 처리부(523, 525, 527)가 처리한 데이터에 대해 각각 적합한 인공지능 알고리즘을 적용할 수 있다. The learning processing unit 530 may apply suitable artificial intelligence algorithms to the data processed by the first, second, and third label distribution processing units 523, 525, and 527, respectively.

도 8은 데이터의 분포에 따라 처리한 데이터를 이용해 인공지능 알고리즘을 선택적으로 적용하는 인공지능 알고리즘 수행 방법의 일 예를 개시한다. 8 discloses an example of a method of performing an artificial intelligence algorithm that selectively applies an artificial intelligence algorithm using data processed according to data distribution.

입력된 데이터로부터 이진 클래스 데이터의 분포를 검출한다(S610). 데이터의 이진 클래스를 검출하는 예는 도 5 및 도 6에서 상술하였다. 예를 들어 이진 클래스의 데이터 분포는 단일 레이블 분포, 불균형 레이블 분포, 균형 레이블 분포에 따라 나눌 수 있다. 이진 클래스를 검출하는 예는 데이터 레이블의 분포의 비에 따라 달라질 수 있으므로 분포의 비가 달라지는 경우 예시한 레이블 분포 외에 다른 레이블 분포들이 있을 수 있다. 또한 데이터 분포를 단일 레이블 분포, 불균형 레이블 분포, 균형 레이블 분포에 따라 나눈 후 각각의 단일 레이블 분포, 불균형 레이블 분포, 균형 레이블 분포를 다시 레이블 분포의 비에 따라 세분하여 각각 나눌 수도 있다. The distribution of binary class data is detected from the input data (S610). An example of detecting the binary class of data is described above in FIGS. 5 and 6 . For example, the data distribution of the binary class can be divided according to single label distribution, unbalanced label distribution, and balanced label distribution. Since an example of detecting a binary class may vary according to a distribution ratio of data labels, there may be other label distributions other than the exemplified label distribution when the distribution ratio is changed. In addition, after dividing the data distribution into single label distribution, unbalanced label distribution, and balanced label distribution, each of the single label distribution, unbalanced label distribution, and balanced label distribution can be further subdivided according to the ratio of the label distributions to be divided.

검출한 레이블 또는 클래스에 포함되는 데이터를 검출한 레이블 또는 클래스의 분포에 따라 처리한다(S620). 클래스에 포함된 데이터를 처리하는 방식은 도 6 및 도 7에 상술하였다.Data included in the detected label or class is processed according to the distribution of the detected label or class (S620). A method of processing data included in a class has been described in detail with reference to FIGS. 6 and 7 .

예를 들면 데이터 레이블 분포의 비가 불균형 레이블 분포에 해당하는 경우, 데이터를 사용자 목적을 고려하여 오버 피팅(over fitting) 또는 언더 피팅(under fitting) 등을 통해 데이터를 단일 레이블 또는 균형 레이블로 변경할 수 있다. For example, if the ratio of the data label distribution corresponds to an unbalanced label distribution, the data can be changed to a single label or a balanced label through overfitting or underfitting in consideration of the user's purpose. .

레이블 분포에 따라 처리한 데이터에 대해 각각 인공지능 알고리즘을 선택하고 선택한 인공지능 알고리즘을 이용해 처리한 데이터를 학습하도록 한다(S630). An artificial intelligence algorithm is selected for each processed data according to the label distribution, and the processed data is learned using the selected artificial intelligence algorithm (S630).

레이블 분포에 따라 처리한 데이터에 대해 인공지능 알고리즘을 선택하는 예는 도 6 및 도 7에 상술하였다.Examples of selecting an artificial intelligence algorithm for data processed according to label distribution are described in FIGS. 6 and 7 .

레이블 분포에 따른 사용자에 의해 세분화될 수도 있다. 실시 예는 단일 레이블, 불균형 레이블, 균형 레이블로 각각 분류한 예를 개시하였다. 레이블의 분포가 달라지면 선택하는 인공지능 알고리즘도 레이블 분포에 따라 다르게 선택할 수도 있다. It can also be subdivided by users according to label distribution. The embodiment discloses an example in which each label is classified into a single label, an imbalanced label, and a balanced label. If the label distribution is different, the AI algorithm selected may also be selected differently according to the label distribution.

데이터의 클래스 분포가 단일 레이블이라고 하더라도 단일 레이블로 분류된 데이터 세트의 레이블 분포 비율들을 내부적으로 세분화하여 비지도학습 인공지능 알고리즘들을 다르게 적용할 수도 있다. Even if the class distribution of data is a single label, unsupervised learning artificial intelligence algorithms can be applied differently by internally subdividing the label distribution ratios of a data set classified as a single label.

또한 데이터의 클래스 분포가 균형 레이블이라고 하더라도 균형 레이블로 분류된 데이터 세트의 레이블 분포 비율들을 세분화여 DNN, CNN, RNN, BSVM 등의 인공지능 알고리즘을 세분화된 레이블 분포 비율에 따라 선택하여 적용할 수도 있다. In addition, even if the class distribution of the data is a balanced label, the label distribution ratios of the data set classified as balanced labels can be subdivided and artificial intelligence algorithms such as DNN, CNN, RNN, and BSVM can be selected and applied according to the subdivided label distribution ratio. .

본 실시 예에 따르면 데이터에 인공지능 및 기계 학습 방법을 적용할 경우 클래스의 데이터 분포가 결과를 크게 미치거나 인공지능 모델의 학습 또는 성능 하락을 유발할 경우라도 데이터 분포에 따라 다른 인공지능 알고리즘을 적용함으로써 고성능의 인공지능 모델 생성 가능하도록 한다. According to this embodiment, when applying artificial intelligence and machine learning methods to data, even if the data distribution of the class greatly affects the results or causes the learning or performance degradation of the artificial intelligence model, by applying different artificial intelligence algorithms according to the data distribution It enables the creation of high-performance artificial intelligence models.

그리고 인공지능 알고리즘의 활용 목적이나 클래스의 데이터 분포에 따라 알고리즘 자동 추천을 통해 인공지능 활용의 진입장벽을 낮추고 생산성 극대화할 수 있다. In addition, it is possible to lower entry barriers to the use of artificial intelligence and maximize productivity through automatic algorithm recommendation according to the purpose of using artificial intelligence algorithms or data distribution of classes.

100: 인공지능 모델 플랫폼
110: 데이터수집모듈
120 : 특징추출모듈
130: 정규화모듈
140: 데이터출력모듈
150: 모델생성모듈
160: 성능관리모듈
170: UI모듈
210: 모델성능확인부
220: 조합성능확인부
230: 추천부
310: 속성확인부
320: 결정부
330: 추천부
510: 데이터베이스
520: 레이블 분포 처리부
521: 레이블 분포 검출부
523, 525, 527: 제 1, 제 2, 제 3 레이블 분포 처리부
530: 학습 처리부100: AI model platform
110: data collection module
120: feature extraction module
130: normalization module
140: data output module
150: model generation module
160: performance management module
170: UI module
210: model performance confirmation unit
220: combination performance confirmation unit
230: recommendation unit
310: property confirmation unit
320: decision unit
330: recommendation unit
510: database
520: label distribution processing unit
521: label distribution detection unit
523, 525, 527: first, second, and third label distribution processors
530: learning processing unit

Claims

a class distribution detector detecting a binary label distribution of binary label data;
a label distribution processing unit that classifies the data according to the detected binary label distribution and processes the classified data according to the binary label distribution; and
A learning processing unit learning each of the data processed by the label distribution processing unit using artificial intelligence algorithms selected according to the label distribution of the data processed by the label distribution processing unit;
The label distribution processing unit,
Check the performance of artificial intelligence models generated for each combination of feature information for the binary label data;
Recommending any one of the feature information combinations based on the performance of the artificial intelligence models,
An artificial intelligence algorithm execution device.

According to claim 1,
The label distribution processing unit,
An artificial intelligence algorithm execution device for processing the classified data according to whether the label distribution of the classified data is any one of a single label distribution, an unbalanced label distribution, and a balanced label distribution.

According to claim 1,
The label distribution processing unit,
If the label distribution of the classified data is an unbalanced label distribution, the label distribution of the classified data is obtained by over-fitting or under-fitting the classified data to data of a single label distribution or balanced label distribution. A device that performs an artificial intelligence algorithm that changes to a set.

According to claim 1,
The learning processing unit,
An artificial intelligence algorithm performing device for applying an unsupervised learning artificial intelligence algorithm to the classified data when the label distribution of the classified data is a single label distribution.

According to claim 1,
The learning processing unit,
If the label distribution of the classified data is a balanced label distribution, any artificial intelligence algorithm of Deep Neural Network (DNN), Convolution Neural Network (CNN), Recurrent Neural Network (RNN), or Binarized Support Vector Machine (BSVM) An artificial intelligence algorithm implementation device that applies.

In the method of performing an artificial intelligence algorithm in which each step is performed by a computing device,
detecting a binary label distribution of the data from the input data;
classifying the data according to the detected binary label distribution and processing the classified data according to the binary label distribution; and
Learning each of the processed data by applying artificial intelligence algorithms selected according to the label distribution;
The processing step is
Check the performance of artificial intelligence models generated for each combination of feature information for binary label data,
Recommending any one of the feature information combinations based on the performance of the artificial intelligence models,
How to perform artificial intelligence algorithms.

According to claim 6,
The step of processing according to the binary label distribution,
The method of performing the artificial intelligence algorithm for processing the classified data according to whether the label distribution of the classified data is any one of a single label distribution, an imbalanced label distribution, and a balanced label distribution.

According to claim 6,
The step of processing according to the binary label distribution,
When the label distribution of the classified data is an unbalanced label distribution, the label distribution of the classified data is obtained by overfitting or underfitting the classified data to a single label distribution or a balanced label distribution. How to perform an artificial intelligence algorithm that transforms into a data set.

According to claim 6,
The learning step is
An artificial intelligence algorithm performing method of applying an unsupervised learning artificial intelligence algorithm to the classified data when the label distribution of the classified data is a single label distribution.

According to claim 6,
The learning step is
If the label distribution of the classified data is a balanced label distribution, any artificial intelligence algorithm of Deep Neural Network (DNN), Convolution Neural Network (CNN), Recurrent Neural Network (RNN), or Binarized Support Vector Machine (BSVM) How to perform an artificial intelligence algorithm that applies .

detecting a binary label distribution of the data from the input data; classifying the data according to the detected binary label distribution and processing the classified data according to the binary label distribution; and applying artificial intelligence algorithms selected according to the label distribution of the classified data to learn each of the processed data;
The processing process is
Check the performance of artificial intelligence models generated for each combination of feature information for binary label data,
Recommending any one of the feature information combinations based on the performance of the artificial intelligence models,
A computer-readable storage medium that stores software that performs artificial intelligence algorithms.

According to claim 11,
The process of learning each of the processed data,
A computer-readable storage medium storing software for performing an artificial intelligence algorithm for applying an unsupervised learning artificial intelligence algorithm to the classified data when the label distribution of the classified data is a single label distribution.

According to claim 11,
The process of learning each of the processed data,
If the label distribution of the classified data is a balanced label distribution, any artificial intelligence algorithm of Deep Neural Network (DNN), Convolution Neural Network (CNN), Recurrent Neural Network (RNN), or Binarized Support Vector Machine (BSVM) A computer-readable storage medium that stores software that performs artificial intelligence algorithms that apply