KR20220074638A

KR20220074638A - A method and apparatus for determining sampling point and sampling rate for multiple traffic analyzers using reinforcement learning on software-defined networks

Info

Publication number: KR20220074638A
Application number: KR1020200163365A
Authority: KR
Inventors: 임혁; 김성환; 윤승현
Original assignee: 광주과학기술원
Priority date: 2020-11-27
Filing date: 2020-11-27
Publication date: 2022-06-03

Abstract

본 발명은 SDN 지원 네트워크에서 강화학습을 이용하여 다중 트래픽 분석기에 대한 샘플링 포인트 및 샘플링 레이트를 결정하기 위한 방법 및 장치에 관한 것이다. 본 발명의 일 실시예에 따른 강화학습을 이용하는 SDN의 동작 방법은, (a) 트래픽 분석기로부터 각 샘플링 포인트에 대한 샘플링된 트래픽의 검사 결과를 수신하는 단계; 및 (b) 상기 샘플링된 트래픽의 검사 결과에 기반하여, 적어도 하나의 트래픽 분석기, 적어도 하나의 샘플링 포인트 및 샘플링 레이트 관련 정보를 결정하는 단계;를 포함할 수 있다. The present invention relates to a method and apparatus for determining sampling points and sampling rates for multiple traffic analyzers using reinforcement learning in an SDN supported network. An operating method of SDN using reinforcement learning according to an embodiment of the present invention comprises the steps of: (a) receiving a test result of sampled traffic for each sampling point from a traffic analyzer; and (b) determining at least one traffic analyzer, at least one sampling point, and sampling rate-related information based on the inspection result of the sampled traffic.

Description

A method and apparatus for determining sampling point and sampling rate for multiple traffic analyzers using reinforcement learning on software-defined networks}

본 발명은 SDN(software-defined networking) 지원 네트워크에 관한 것으로, 더욱 상세하게는 SDN 지원 네트워크에서 강화학습을 이용하여 다중 트래픽 분석기에 대한 샘플링 포인트 및 샘플링 레이트를 결정하기 위한 방법 및 장치에 관한 것이다.The present invention relates to a software-defined networking (SDN) supported network, and more particularly, to a method and apparatus for determining a sampling point and a sampling rate for a multi-traffic analyzer using reinforcement learning in an SDN supported network.

지난 수십 년 동안 인터넷은 많은 사람들의 삶에 없어서는 안 될 부분이 되었으며 이로 인해 네트워크 트래픽이 엄청나게 증가했다.Over the past few decades, the Internet has become an integral part of many people's lives, resulting in a massive increase in network traffic.

인터넷 사용자 및 네트워크 트래픽의 증가로 인해 서비스 거부 공격, 중간자 공격 및 맬웨어와 같은 여러 네트워크 공격 및 이상 현상에 대한 방어를 제공하는 사이버 보안의 중요성이 꾸준히 증가했다.With the growth of Internet users and network traffic, the importance of cybersecurity in providing defense against multiple network attacks and anomalies such as denial of service attacks, man-in-the-middle attacks, and malware has steadily increased.

인터넷에 대한 심각한 위협 중 하나는 지능형 지속 위협(advanced persistent threat, APT)이다. APT는 다양한 공격 기술을 사용하고 조용히 작동하므로 공격자가 탐지되지 않고 장기간 대상 시스템에 대한 제어를 유지할 수 있다.One of the serious threats to the Internet is advanced persistent threat (APT). APTs use a variety of attack techniques and operate silently, allowing attackers to remain undetected and control over a target system for extended periods of time.

즉, 이러한 공격은 정부 기관, 기업 또는 군대와 같은 특정 대상 시설에서 은밀하고 지속적인 공격 프로세스를 통해 중요한 정보를 훔치기 때문에 탐지하기 어렵다.In other words, these attacks are difficult to detect because they steal sensitive information from specific targeted facilities such as government agencies, corporations or the military through a covert and continuous attack process.

APT 공격의 목적은 표적의 네트워크를 마비시키는 것이 아니라 표적의 데이터를 훔쳐서 압수하고 압수된 데이터의 공개를 위한 금전적 요구를 만드는 것이다.The purpose of an APT attack is not to paralyze the target's network, but to steal and seize the target's data and create a monetary demand for the release of the confiscated data.

따라서, 공격자는 먼저 대상 네트워크에 침입하여 APT 악성 코드를 피해자의 컴퓨터에 설치하고 컴퓨터를 이상 상태로 변환한다. 이 악성 코드는 공격자가 손상된 시스템을 원격으로 제어하고 장기간에 걸쳐 기밀 정보와 같은 민감한 데이터를 대상에서 훔치는데 사용된다.Therefore, the attacker first breaks into the target network, installs the APT malware on the victim's computer, and transforms the computer into an abnormal state. This malware is used by attackers to remotely take control of compromised systems and steal sensitive data, such as confidential information, from the target over an extended period of time.

이에, APT 공격과 같은 네트워크 이상을 탐지하기 위해 다양한 연구들이 진행되고 있으나 개선은 미흡한 실정이다. Accordingly, various studies are being conducted to detect network anomalies such as APT attacks, but improvement is insufficient.

[특허문헌 1] 한국등록특허 제10-2033169호[Patent Document 1] Korean Patent No. 10-2033169

본 발명은 전술한 문제점을 해결하기 위하여 창출된 것으로, SDN 지원 네트워크에서 강화학습을 이용하여 다중 트래픽 분석기에 대한 샘플링 포인트 및 샘플링 레이트를 결정하기 위한 방법 및 장치를 제공하는 것을 그 목적으로 한다.The present invention was created to solve the above problems, and an object of the present invention is to provide a method and apparatus for determining a sampling point and a sampling rate for a multi-traffic analyzer using reinforcement learning in an SDN supported network.

또한, 본 발명은 DDPG 모델을 이용하여 트래픽 분석기, 샘플링 포인트 및 샘플링 레이트를 결정하기 위한 방법 및 장치를 제공하는 것을 그 목적으로 한다. Another object of the present invention is to provide a method and apparatus for determining a traffic analyzer, a sampling point and a sampling rate using a DDPG model.

본 발명의 목적들은 이상에서 언급한 목적들로 제한되지 않으며, 언급되지 않은 또 다른 목적들은 아래의 기재로부터 명확하게 이해될 수 있을 것이다.Objects of the present invention are not limited to the objects mentioned above, and other objects not mentioned will be clearly understood from the description below.

상기한 목적들을 달성하기 위하여, 본 발명의 일 실시예에 따른 강화학습을 이용하는 SDN의 동작 방법은, (a) 트래픽 분석기로부터 각 샘플링 포인트에 대한 샘플링된 트래픽의 검사 결과를 수신하는 단계; 및 (b) 상기 샘플링된 트래픽의 검사 결과에 기반하여, 적어도 하나의 트래픽 분석기, 적어도 하나의 샘플링 포인트 및 샘플링 레이트 관련 정보를 결정하는 단계;를 포함할 수 있다. In order to achieve the above objects, an SDN operating method using reinforcement learning according to an embodiment of the present invention includes the steps of: (a) receiving a test result of sampled traffic for each sampling point from a traffic analyzer; and (b) determining at least one traffic analyzer, at least one sampling point, and sampling rate-related information based on the inspection result of the sampled traffic.

실시예에서, 상기 (b) 단계는, 상기 샘플링된 트래픽의 검사 결과에 대한 리워드(reward)를 계산하는 단계; 및 상기 리워드를 강화학습 모델에 적용하여 적어도 하나의 트래픽 분석기, 적어도 하나의 샘플링 포인트 및 샘플링 레이트 관련 정보를 결정하는 단계;를 포함할 수 있다. In an embodiment, the step (b) comprises: calculating a reward for the inspection result of the sampled traffic; and determining at least one traffic analyzer, at least one sampling point, and sampling rate related information by applying the reward to the reinforcement learning model.

실시예에서, 상기 (b) 단계는, 상기 리워드를 상기 강화학습 모델에 적용하여 샘플링 정책을 결정하는 단계; 및 상기 샘플링 정책에 따라, 상기 리워드를 최대화하기 위한 상기 적어도 하나의 트래픽 분석기, 적어도 하나의 샘플링 포인트 및 샘플링 레이트 관련 정보를 결정하는 단계;를 포함할 수 있다. In an embodiment, the step (b) may include determining a sampling policy by applying the reward to the reinforcement learning model; and determining, according to the sampling policy, the at least one traffic analyzer, at least one sampling point, and sampling rate related information for maximizing the reward.

실시예에서, 상기 샘플링 포인트에 의해 샘플링된 트래픽은 상기 트래픽 분석기로 전달될 수 있다. In an embodiment, the traffic sampled by the sampling point may be forwarded to the traffic analyzer.

실시예에서, 상기 강화학습 모델은, DDPG(Deep Deterministic Policy Gradient) 모델을 포함할 수 있다. In an embodiment, the reinforcement learning model may include a Deep Deterministic Policy Gradient (DDPG) model.

실시예에서, 상기 (b) 단계는, 상기 샘플링된 트래픽의 검사 결과에 기반하여, 상기 적어도 하나의 트래픽 분석기 및 적어도 하나의 샘플링 포인트에 대한 상태 공간(state space) 정보를 결정하는 단계; 및 상기 상태 공간 정보에 기반하여 상기 적어도 하나의 트래픽 분석기 및 적어도 하나의 샘플링 포인트 및 샘플링 레이트 관련 정보에 대한 행동 공간(action space) 정보를 결정하는 단계;를 포함할 수 있다. In an embodiment, the step (b) may include: determining state space information about the at least one traffic analyzer and the at least one sampling point based on the inspection result of the sampled traffic; and determining action space information for the at least one traffic analyzer and the at least one sampling point and sampling rate related information based on the state space information.

실시예에서, 상기 (b) 단계는, 제1 시간 스텝(time step)에서 상기 행동 공간 정보가 실행되는 경우, 상기 제1 시간 스텝에 대한 상태 공간 정보로부터 제2 시간 스텝에 대한 상대 공간 정보로의 변경에 대한 전환 확률(transition probability) 정보를 결정하는 단계; 상기 전환 확률 정보에 기반하여 강화학습 모델에 따른 리워드를 결정하는 단계; 및 상기 리워드를 강화학습 모델에 적용하여 적어도 하나의 트래픽 분석기, 적어도 하나의 샘플링 포인트 및 샘플링 레이트 관련 정보를 결정하는 단계;를 포함할 수 있다. In an embodiment, the step (b) includes, when the action space information is executed at a first time step, from the state space information for the first time step to the relative spatial information for the second time step. determining transition probability information for a change of ; determining a reward according to a reinforcement learning model based on the conversion probability information; and determining at least one traffic analyzer, at least one sampling point, and sampling rate related information by applying the reward to the reinforcement learning model.

실시예에서, SDN 장치는, 트래픽 분석기로부터 각 샘플링 포인트에 대한 샘플링된 트래픽의 검사 결과를 수신하는 통신부; 및 상기 샘플링된 트래픽의 검사 결과에 기반하여, 적어도 하나의 트래픽 분석기, 적어도 하나의 샘플링 포인트 및 샘플링 레이트 관련 정보를 결정하는 제어부;를 포함할 수 있다. In an embodiment, the SDN device may include: a communication unit configured to receive an inspection result of sampled traffic for each sampling point from a traffic analyzer; and a control unit configured to determine at least one traffic analyzer, at least one sampling point, and sampling rate related information, based on a test result of the sampled traffic.

실시예에서, 상기 제어부는, 상기 샘플링된 트래픽의 검사 결과에 대한 리워드(reward)를 계산하고, 상기 리워드를 강화학습 모델에 적용하여 적어도 하나의 트래픽 분석기, 적어도 하나의 샘플링 포인트 및 샘플링 레이트 관련 정보를 결정할 수 있다. In an embodiment, the control unit calculates a reward for the inspection result of the sampled traffic, and applies the reward to a reinforcement learning model to at least one traffic analyzer, at least one sampling point, and sampling rate related information can be decided

실시예에서, 상기 제어부는, 상기 리워드를 상기 강화학습 모델에 적용하여 샘플링 정책을 결정하고, 상기 샘플링 정책에 따라, 상기 리워드를 최대화하기 위한 상기 적어도 하나의 트래픽 분석기, 적어도 하나의 샘플링 포인트 및 샘플링 레이트 관련 정보를 결정할 수 있다. In an embodiment, the control unit determines a sampling policy by applying the reward to the reinforcement learning model, and according to the sampling policy, the at least one traffic analyzer for maximizing the reward, at least one sampling point, and sampling Rate-related information may be determined.

실시예에서, 상기 제어부는, 상기 샘플링된 트래픽의 검사 결과에 기반하여, 상기 적어도 하나의 트래픽 분석기 및 적어도 하나의 샘플링 포인트에 대한 상태 공간(state space) 정보를 결정하고, 상기 상태 공간 정보에 기반하여 상기 적어도 하나의 트래픽 분석기 및 적어도 하나의 샘플링 포인트 및 샘플링 레이트 관련 정보에 대한 행동 공간(action space) 정보를 결정할 수 있다. In an embodiment, the control unit determines state space information for the at least one traffic analyzer and the at least one sampling point based on a test result of the sampled traffic, and based on the state space information Thus, action space information for the at least one traffic analyzer and the at least one sampling point and sampling rate related information may be determined.

실시예에서, 상기 제어부는, 제1 시간 스텝(time step)에서 상기 행동 공간 정보가 실행되는 경우, 상기 제1 시간 스텝에 대한 상태 공간 정보로부터 제2 시간 스텝에 대한 상대 공간 정보로의 변경에 대한 전환 확률(transition probability) 정보를 결정하고, 상기 전환 확률 정보에 기반하여 강화학습 모델에 따른 리워드를 결정하며, 상기 리워드를 강화학습 모델에 적용하여 적어도 하나의 트래픽 분석기, 적어도 하나의 샘플링 포인트 및 샘플링 레이트 관련 정보를 결정할 수 있다. In an embodiment, when the action space information is executed at a first time step, the controller is configured to change from state space information for the first time step to relative spatial information for a second time step. to determine transition probability information for the service, determine a reward according to the reinforcement learning model based on the transition probability information, and apply the reward to the reinforcement learning model to include at least one traffic analyzer, at least one sampling point, and Sampling rate related information may be determined.

상기한 목적들을 달성하기 위한 구체적인 사항들은 첨부된 도면과 함께 상세하게 후술될 실시예들을 참조하면 명확해질 것이다.Specific details for achieving the above objects will become clear with reference to the embodiments to be described in detail below in conjunction with the accompanying drawings.

그러나, 본 발명은 이하에서 개시되는 실시예들에 한정되는 것이 아니라, 서로 다른 다양한 형태로 구성될 수 있으며, 본 발명의 개시가 완전하도록 하고 본 발명이 속하는 기술 분야에서 통상의 지식을 가진 자(이하, "통상의 기술자")에게 발명의 범주를 완전하게 알려주기 위해서 제공되는 것이다.However, the present invention is not limited to the embodiments disclosed below, but may be configured in various different forms, and those of ordinary skill in the art to which the present invention pertains ( Hereinafter, "a person skilled in the art") is provided to fully inform the scope of the invention.

본 발명의 일 실시예에 의하면, DDPG 모델을 이용하여 트래픽 분석기, 샘플링 포인트 및 샘플링 레이트를 결정함으로써, 트래픽 샘플링 시 악의적인 플로우를 샘플링할 확률을 증가시키고, SDN의 처리 오버 헤드를 감소시킬 수 있다. According to an embodiment of the present invention, by determining a traffic analyzer, a sampling point, and a sampling rate using a DDPG model, the probability of sampling a malicious flow during traffic sampling can be increased, and processing overhead of SDN can be reduced. .

본 발명의 효과들은 상술된 효과들로 제한되지 않으며, 본 발명의 기술적 특징들에 의하여 기대되는 잠정적인 효과들은 아래의 기재로부터 명확하게 이해될 수 있을 것이다.Effects of the present invention are not limited to the above-described effects, and potential effects expected by the technical features of the present invention will be clearly understood from the following description.

도 1은 본 발명의 일 실시예에 따른 SDN 지원 네트워크를 도시한 도면이다.
도 2는 본 발명의 일 실시예에 따른 DDPG 모델을 도시한 도면이다.
도 3a는 본 발명의 일 실시예에 따른 리워드 성능 그래프를 도시한 도면이다.
도 3b는 본 발명의 일 실시예에 따른 악의적인 플로우를 샘플링할 확률 그래프를 도시한 도면이다.
도 3c는 본 발명의 일 실시예에 따른 다중 트래픽 분석기들의 부하 분산 성능 그래프를 도시한 도면이다.
도 3d는 본 발명의 일 실시예에 따른 SDN의 처리 오버헤드 성능 그래프를 도시한 도면이다.
도 4는 본 발명의 일 실시예에 따른 SDN 컨트롤러의 동작 방법을 도시한 도면이다.
도 5는 본 발명의 일 실시예에 따른 SDN 컨트롤러의 기능적 구성을 도시한 도면이다.1 is a diagram illustrating an SDN support network according to an embodiment of the present invention.
2 is a diagram illustrating a DDPG model according to an embodiment of the present invention.
3A is a diagram illustrating a reward performance graph according to an embodiment of the present invention.
3B is a diagram illustrating a probability graph of sampling a malicious flow according to an embodiment of the present invention.
3C is a diagram illustrating a load balancing performance graph of multiple traffic analyzers according to an embodiment of the present invention.
3D is a diagram illustrating a processing overhead performance graph of SDN according to an embodiment of the present invention.
4 is a diagram illustrating a method of operating an SDN controller according to an embodiment of the present invention.
5 is a diagram illustrating a functional configuration of an SDN controller according to an embodiment of the present invention.

본 발명은 다양한 변경을 가할 수 있고, 여러 가지 실시예들을 가질 수 있는 바, 특정 실시예들을 도면에 예시하고 이를 상세히 설명하고자 한다. Since the present invention can have various changes and can have various embodiments, specific embodiments are illustrated in the drawings and described in detail.

청구범위에 개시된 발명의 다양한 특징들은 도면 및 상세한 설명을 고려하여 더 잘 이해될 수 있을 것이다. 명세서에 개시된 장치, 방법, 제법 및 다양한 실시예들은 예시를 위해서 제공되는 것이다. 개시된 구조 및 기능상의 특징들은 통상의 기술자로 하여금 다양한 실시예들을 구체적으로 실시할 수 있도록 하기 위한 것이고, 발명의 범위를 제한하기 위한 것이 아니다. 개시된 용어 및 문장들은 개시된 발명의 다양한 특징들을 이해하기 쉽게 설명하기 위한 것이고, 발명의 범위를 제한하기 위한 것이 아니다.Various features of the invention disclosed in the claims may be better understood upon consideration of the drawings and detailed description. The apparatus, methods, preparations, and various embodiments disclosed herein are provided for purposes of illustration. The disclosed structural and functional features are intended to enable those skilled in the art to specifically practice the various embodiments, and are not intended to limit the scope of the invention. The disclosed terms and sentences are for the purpose of easy-to-understand descriptions of various features of the disclosed invention, and are not intended to limit the scope of the invention.

본 발명을 설명함에 있어서, 관련된 공지기술에 대한 구체적인 설명이 본 발명의 요지를 불필요하게 흐릴 수 있다고 판단되는 경우, 그 상세한 설명을 생략한다.In describing the present invention, if it is determined that a detailed description of a related known technology may unnecessarily obscure the gist of the present invention, the detailed description thereof will be omitted.

이하, 본 발명의 일 실시예에 따른 SDN 지원 네트워크에서 강화학습을 이용하여 다중 트래픽 분석기에 대한 샘플링 포인트 및 샘플링 레이트를 결정하기 위한 방법 및 장치를 설명한다.Hereinafter, a method and apparatus for determining a sampling point and a sampling rate for a multi-traffic analyzer using reinforcement learning in an SDN supported network according to an embodiment of the present invention will be described.

도 1은 본 발명의 일 실시예에 따른 SDN 지원 네트워크(100)을 도시한 도면이다.1 is a diagram illustrating an SDN support network 100 according to an embodiment of the present invention.

도 1을 참고하면, SDN 지원 네트워크(100)는 SDN 컨트롤러(110), 적어도 하나의 트래픽 분석기(120), 적어도 하나의 스위치(130)를 포함할 수 있다. Referring to FIG. 1 , the SDN support network 100 may include an SDN controller 110 , at least one traffic analyzer 120 , and at least one switch 130 .

IDS(Intrusion Detection System)와 같은 트래픽 분석기(130)는 고정된 위치와 막대한 양의 네트워크 트래픽에 대해 DPI(deep packet inspection)를 수행할 수 있는 제한된 용량을 가지고 있기 때문에, 네트워크 트래픽을 샘플링하고 트래픽 분석기(130)로 샘플링된 트래픽을 조정하는 것은 사이버 보안을 위해 중요하게 처리될 수 있다. Since the traffic analyzer 130 such as an Intrusion Detection System (IDS) has a limited capacity to perform deep packet inspection (DPI) on a fixed location and a huge amount of network traffic, it samples network traffic and analyzes the traffic analyzer. Adjusting the sampled traffic to 130 can be important for cyber security.

네트워크에서 컨트롤 플레인(control plane)과 데이터 플레인(data plane)을 분리하여 유연성, 탄력성 및 프로그래밍 가능성을 제공하는 SDN 지원 네트워크 (100)에서는( SDN 컨트롤러(110)가 스위치(130)에서 데이터 패킷을 확률적으로 캡처하고 샘플링된 트래픽을 트래픽 분석기(120)와 같은 모든 목적지(destination)로 조정할 수 있다. In an SDN-enabled network 100 that provides flexibility, resiliency and programmability by separating the control plane and data plane from the network ( SDN controller 110 The captured and sampled traffic can be directed to any destination, such as the traffic analyzer 120 .

예를 들어, 스위치(130)는 OF-활성화 스위치(OF-enabled switch)로 구현될 수 있다. For example, the switch 130 may be implemented as an OF-enabled switch (OF-enabled switch).

그러나 폐기된 트래픽에서 잠재적으로 유용한 정보가 손실될 가능성이 있기 때문에, 트래픽 모니터링을 위한 샘플링 포인트와 샘플링 레이트를 결정하는 것이 필요하다. 여기서, 샘플링 포인트는 다수의 스위치(130) 중 트래픽의 샘플링을 수행하기 위한 적어도 하나의 스위치를 의미할 수 있다. However, because of the potential loss of useful information in discarded traffic, it is necessary to determine sampling points and sampling rates for traffic monitoring. Here, the sampling point may mean at least one switch for performing sampling of traffic among the plurality of switches 130 .

따라서, 본 발명에 따르면, DRL(deep reinforcement learning) 알고리즘을 기반으로 SDN 지원 네트워크(100)에서 다중 트래픽 분석기(120)에 대한 샘플링 포인트와 샘플링 레이트를 결정할 수 있다. Therefore, according to the present invention, it is possible to determine the sampling point and sampling rate for the multi-traffic analyzer 120 in the SDN support network 100 based on a DRL (deep reinforcement learning) algorithm.

본 발명에 따른 DRL 알고리즘은 다중 트래픽 분석기(120)에서 얻은 샘플링된 트래픽의 검사 결과에 따라 플로우 분포의 불확실성 하에서 샘플링 자원 할당 정책(sampling resource allocation policy)을 학습할 수 있다. The DRL algorithm according to the present invention may learn a sampling resource allocation policy under uncertainty of flow distribution according to the inspection result of the sampled traffic obtained by the multi-traffic analyzer 120 .

SDN 컨트롤러(110)는 컨트롤 플레인과 데이터 플레인을 분리하여 유연성, 탄력성 및 프로그래밍 기능을 제공할 수 있다. The SDN controller 110 may provide flexibility, flexibility and programmability by separating the control plane and data plane.

SDN 지원 네트워크(100)에서 다중 트래픽 분석기(120)에 대한 샘플링 포인트 및 샘플링 레이트의 결정 문제를 MDP(Markov decision process)로 공식화할 수 있다. MDP는 네트워크 플로우의 모니터링 성능을 향상하기 위해 장기 목적 함수(long-term objective function)를 최대화하여 네트워크 플로우 불확실성 발생을 고려할 수 있다. 이 경우, 현재의 결정은 미래의 리워드(reward)를 고려하여 산출될 수 있다. The problem of determining the sampling point and sampling rate for the multiple traffic analyzer 120 in the SDN support network 100 may be formulated as a Markov decision process (MDP). MDP may consider the occurrence of network flow uncertainty by maximizing a long-term objective function in order to improve the monitoring performance of the network flow. In this case, the current decision may be calculated in consideration of future rewards.

그런 다음, 다중 트래픽 분석기(120)에서 얻은 검사 결과에 따라 샘플링 레이트를 자동으로 결정하는 DDPG 기반 DRL 알고리즘이 사용될 수 있다. Then, a DDPG-based DRL algorithm that automatically determines a sampling rate according to the inspection result obtained from the multi-traffic analyzer 120 may be used.

일 실시예에서, 각 플로우가 아니라 각 샘플링 포인트에 대한 샘플링 레이트를 구성하는 샘플링 포인트 별 샘플링 접근 방식이 사용될 수 있다. 이 때, 샘플링 포인트 별 샘플링 방식을 선택하는 이유는 확장성이 더 높기 때문일 수 있다. In an embodiment, a sampling approach for each sampling point constituting a sampling rate for each sampling point rather than for each flow may be used. In this case, the reason for selecting the sampling method for each sampling point may be that scalability is higher.

네트워크의 규모가 커짐에 따라 일반적으로 플로우 수의 규모가 증가하는 레이트는 샘플링 포인트의 수의 레이트보다 훨씬 빠를 수 있다. As the size of the network increases, in general, the rate at which the size of the number of flows increases can be much faster than the rate of the number of sampling points.

또한, 모든 스위치(130)에서 샘플링하는 대신 특정 스위치(130), 즉, 특정 샘플링 포인트를 선택하여 샘플링하면 모든 스위치(130)의 플로우 테이블을 업데이트하는 SDN 컨트롤러(110)의 처리 오버 헤드가 감소할 수 있다. In addition, when sampling by selecting a specific switch 130, that is, a specific sampling point, instead of sampling from all switches 130, the processing overhead of the SDN controller 110 for updating the flow table of all switches 130 can be reduced. can

실제 네트워크 환경에서 어떤 악성 플로우가 발생할지 알 수 없기 때문에, 가능한 한 많은 플로우를 샘플링하기 위해 샘플링할 위치를 결정하는 것이 필요하다.Since it is impossible to know which malicious flows will occur in a real network environment, it is necessary to determine where to sample in order to sample as many flows as possible.

일 실시예에서, 샘플링 프로세스를 통해, 각 샘플링 포인트에서 트래픽을 샘플링하고 샘플링된 트래픽을 다중 트래픽 분석기(120)로 전달할 수 있다. In one embodiment, the sampling process may sample the traffic at each sampling point and pass the sampled traffic to the multi-traffic analyzer 120 .

일 실시예에서, 검사 프로세스를 통해, 트래픽 분석기(130)는 전달된 샘플링된 트래픽에 대한 검사를 수행하고, 중복되지 않은 샘플링된 플로우(non-redundant sampled flows)의 수에 대한 정보가 포함된 검사 결과를 SDN 컨트롤러(110)에 송신할 수 있다. In one embodiment, through the inspection process, the traffic analyzer 130 performs an inspection on the forwarded sampled traffic, and the inspection includes information about the number of non-redundant sampled flows. The result may be transmitted to the SDN controller 110 .

일 실시예에서, 재구성 프로세스를 통해, SDN 컨트롤러(110)는 SDN 컨트롤러(110)의 DRL 모델을 이용하여 결정한 샘플링 정책에 따라 샘플링 포인트의 샘플링 레이트를 조정할 수 있다. In an embodiment, through the reconfiguration process, the SDN controller 110 may adjust the sampling rate of the sampling point according to the sampling policy determined using the DRL model of the SDN controller 110 .

전체 프로세스는 시간이 지남에 따라 주기적으로 수행될 수 있다. The whole process can be performed periodically over time.

시스템 모델 측면에서, SDN 지원 네트워크(100)에 n개의 스위치(130)와 l개의 트래픽 분석기(120)가 존재하는 경우, 스위치(130) 및 트래픽 분석기(120)의 집합은 O=

및 D=

로 표시될 수 있다. In terms of the system model, when there are n switches 130 and l traffic analyzers 120 in the SDN supported network 100, the set of switches 130 and traffic analyzers 120 is 0=

and D=

can be displayed as

SDN 컨트롤러(110)는 n개의 스위치(130) 중 m개의 샘플링 포인트를 주기적으로 선택하여 샘플링 레이트를 할당할 수 있다. 이 경우, T는 샘플링 자원 할당 기간을 나타낼 수 있다. The SDN controller 110 may allocate sampling rates by periodically selecting m sampling points among the n switches 130 . In this case, T may represent a sampling resource allocation period.

일 실시예에서, 트래픽 분석기(d_k)(130)의 처리 용량은 c_k(패킷/T)로 표시되며, l 트래픽 분석기(130)의 총 용량은 하기 <수학식 1>과 같이 계산할 수 있습니다.In one embodiment, the processing capacity of the traffic analyzer (d _k ) 130 is expressed as c _k (packets/T), l The total capacity of the traffic analyzer 130 can be calculated as in Equation 1 below .

일 실시예에서,

동안 네트워크의 총 플로우 집합이고,

는 T 동안 네트워크의 총 비 중복 샘플링 플로우 집합을 나타낸다. 이 경우,

는 네트워크 플로우의 불확실성 특성 때문에 동적으로 변경될 수 있다. 또한,

는 샘플링 자원 할당 정책에 의해 주기적으로 결정될 수 있다. In one embodiment,

is the total set of flows in the network during

denotes the total set of non-redundant sampling flows of the network during T. in this case,

can be dynamically changed due to the uncertainty nature of the network flow. In addition,

may be periodically determined by the sampling resource allocation policy.

일 실시예에서, 샘플링 포인트 및 샘플링 레이트를 결정하기 위해, T 동안 샘플링 포인트의 정렬된 m-튜플(tuple)을 P=(p₁, p₂,..., p_m) 및

로 나타낼 수 있다. In one embodiment, to determine the sampling point and sampling rate, an ordered m-tuple of sampling points during T is taken as P=(p ₁ , p ₂ ,..., p _m ) and

can be expressed as

샘플링 포인트가 결정되면 m-튜플의 순서에 따라 각 샘플링 포인트에서 샘플링된 트래픽의 볼륨이 결정될 수 있다. When the sampling point is determined, the volume of traffic sampled at each sampling point may be determined according to the order of the m-tuple.

각 샘플링 포인트의 샘플링 레이트 차이는 r만큼 감소할 수 있다. 여기서, r은 0과 1 사이의 양(positive)의 작은 숫자입니다. 즉,

.The sampling rate difference of each sampling point may be reduced by r. where r is a positive small number between 0 and 1. in other words,

.

샘플링 포인트의 수 m과 샘플링 포인트의 샘플링 레이트 감소 비율 r은 네트워크 상태에 따라 달라질 수 있다.The number of sampling points m and the sampling rate reduction ratio r of the sampling points may vary depending on network conditions.

그런 다음 샘플링 레이트 벡터를 x=

로 나타낼 수 있다. 여기서, x_j는 샘플링 포인트 k에서의 샘플링 레이트를 나타내고 x_j의 단위는 T 당 패킷일 수 있다.Then the sampling rate vector is x=

can be expressed as Here, x _j represents a sampling rate at the sampling point k, and the unit of x _j may be a packet per T.

샘플링된 트래픽의 전체 양이 트래픽 분석기(130)의 총 처리 용량을 초과해서는 안되므로, C와 x 간의 관계에 대한 제약은 하기 <수학식 2>와 같이 나타낼 수 있다. Since the total amount of sampled traffic must not exceed the total processing capacity of the traffic analyzer 130, the constraint on the relationship between C and x can be expressed as Equation 2 below.

또한, 일 실시예에서, 각 샘플링 레이트 x_j의 양은 하기 <수학식 3>과 같이 나타낼 수 있다. Also, in an embodiment, the amount of each sampling rate x _j may be expressed as in Equation 3 below.

일 실시예에서, 트래픽 분석기 선택 G에 따라 하기 <수학식 4>와 같이 T 동안 트래픽 분석기 c_k의 리소스 사용률 u_k을 얻을 수 있습니다. 여기서, G=

는 m개의 샘플링 포인트가 l개의 그룹으로 클러스터링되는 트래픽 분석기(120)를 선택하기 위한 그룹 세트를 나타낼 수 있다. In one embodiment, according to the traffic analyzer selection G, it is possible to obtain the resource utilization rate _uk of the traffic analyzer _ck during T as shown in Equation 4 below. where G=

may represent a group set for selecting the traffic analyzer 120 in which m sampling points are clustered into l groups.

여기서, w_j는 T 동안 p_j의 샘플링 비용이고, 이 경우, w_j는 x_j가 T 동안 p_j의 데이터 레이트와 같거나 작으면 x_j와 같을 수 있다. 그렇지 않으면 w_j는 T 동안 p_j의 데이터 레이트와 같을 수 있다.where w _j is the sampling cost of p _j during T, in this case, w _j may be equal to x _{j if x j} _is less than or equal to the data rate of p _j during T. Otherwise w _j may be equal to the data rate of p _j during T.

샘플링 포인트와 트래픽 분석기 선택을 결정한 후, SDN 지원 네트워크(100)의 오버헤드인 선택된 트래픽 분석기(130)로 샘플링된 트래픽을 조정(steer)할 수 있다.After determining the sampling point and traffic analyzer selection, it is possible to steer the sampled traffic to the selected traffic analyzer 130 , which is an overhead of the SDN support network 100 .

V=

는 트래픽 분석기의 스티어링 오버헤드(steering overhead)의 집합이며, v_k는 하기 <수학식 5>와 같이 나타낼 수 있다.V=

is a set of steering overhead of the traffic analyzer, and v _k may be expressed as in Equation 5 below.

여기서,

는 j번째 샘플링 포인트에서 트래픽 분석기 d_k까지의 홉의 수를 나타낼 수 있다. here,

may represent the number of hops from the j-th sampling point to the traffic analyzer d _k .

SDN 지원 네트워크(100)가 시간 스텝(time step)을 기반으로 이산 시간 기반(discrete-time basis)으로 작동하고, 미래 정보에 불확실성이 있는 지속적인 시간 스텝을 통해 강화 학습을 사용하여, 더 나은 샘플링 정책이 점진적으로 학습될 수 있다. SDN-enabled network 100 operates on a discrete-time basis based on time steps, and uses reinforcement learning over continuous time steps with uncertainty in future information, resulting in a better sampling policy This can be learned gradually.

시간 스텝의 길이는 단일 상태가 변경되는데 걸리는 시간인 1 단위입니다. 본 발명에 따르면, 시간 스텝을 샘플링 기간 P로 설정하고 미래 정보에서 네트워크 플로우의 발생과 플로우의 데이터 레이트라는 두 가지 불확실성을 고려할 수 있다. 초기 상태에서 현재 상태(t)까지 이 두 불확실성에 대한 정보가 있다. The length of a time step is in units of 1, the time it takes for a single state to change. According to the present invention, by setting the time step to the sampling period P, two uncertainties can be taken into account: the occurrence of a network flow in future information and the data rate of the flow. There is information about these two uncertainties from the initial state to the present state (t).

따라서, MDP(Markov Decision Process)를 사용하여 최적의 정책을 찾기 위해 시스템을 모델링할 수 있다. 이 경우, 최적화 방법으로 DDPG 모델이 사용될 수 있다. Therefore, the system can be modeled to find the optimal policy using the Markov Decision Process (MDP). In this case, the DDPG model may be used as an optimization method.

일 실시예에서, 다중 트래픽 분석기(120)와 SDN 컨트롤러(110)가 있는 SDN 지원 네트워크(100)에 DDPG 모델을 적용하기 위해 불확실성을 가정하여 시스템의 상태 공간(state space), 행동 공간(action space), 전환 확률(transition probabilities) 및 리워드 함수(reward functions)를 MDP로 공식화할 수 있다. In one embodiment, in order to apply the DDPG model to the SDN supported network 100 having the multiple traffic analyzer 120 and the SDN controller 110 , a state space, an action space of the system by assuming uncertainty ), transition probabilities and reward functions can be formulated as MDP.

일 실시예에서, 상태 공간 및 행동 공간 측면에서, SDN 지원 네트워크(100)의 상태 공간(state space)은 현재 할당된 샘플링 리소스 상태와 선택된 트래픽 분석기 상태로 구성될 수 있다. In one embodiment, in terms of state space and action space, the state space of the SDN support network 100 may consist of a currently allocated sampling resource state and a selected traffic analyzer state.

상태 s_t는 시간 스텝 t의 상태를 나타내며, 하기 <수학식 6>과 같이 m개의 샘플링 포인트에 대해 정렬된 쌍 세트와 p_j에 대해 선택된 트래픽 분석기(120)로 나타낼 수 있다.The state s _t represents the state of the time step t, and may be represented by the traffic analyzer 120 selected for the pair set and p _j arranged for m sampling points as shown in Equation 6 below.

여기서, 상태 s_t는 각 샘플링 포인트 p_j에 대한 트래픽 분석기 d_k가 있는 m-순서 선택된 샘플링 포인트를 나타낼 수 있다. Here, state s _t may represent an m-order selected sampling point with a traffic analyzer d _k for each sampling point p _j .

또한, 일 실시예에서, SDN 지원 네트워크(100)의 행동 공간(action space)은 세 가지 행동(action)으로 구성될 수 있다. Further, in one embodiment, the action space of the SDN support network 100 may be composed of three actions (action).

첫 번째는 샘플링 포인트

를 결정하는 행동이고, 두 번째는 각 샘플링 포인트를 할당할 트래픽 분석기를 결정하는 행동이고, 세 번째는 샘플링 레이트가 얼마나 감소하는지(r)를 결정하는 행동입니다. 시간 스텝 t에서의 일련의 행동 a_t는 하기 <수학식 7>과 같이 나타낼 수 있다.The first is the sampling point.

The second action determines the traffic analyzer to assign each sampling point to, and the third action determines how much the sampling rate decreases (r). A series of actions a _t at time step t can be expressed as in Equation 7 below.

여기서, 행동 a_t는 선택한 스위치(130)에 대한 트래픽 분석기(120) 중 하나를 사용하여, n개의 스위치(130) 중 하나를 선택하고, 시간 스텝 t에서 샘플링 레이트 감소 비율 r을 선택함을 나타낼 수 있다.Here, the action a _t represents selecting one of the n switches 130 using one of the traffic analyzers 120 for the selected switch 130 and selecting the sampling rate reduction ratio r at time step t. can

상태 s_t에서 행동을 수행하면, 행동 a_t에 의해 선택된 스위치(130)가 다음 상태 s_t+1의 첫 번째 요소로 선택될 수 있다. If the action is performed in the state s _t , the switch 130 selected by the action a _t may be selected as the first element of the next state s _t+1 .

샘플링 레이트 벡터 x는 샘플링 포인트의 순서에 따라 결정되기 때문에, 행동 a_t에서 선택한 스위치(130)는 샘플링 포인트의 가장 큰 샘플링 레이트를 가질 수 있다. 또한 선택한 샘플링 감소 비율에 따라 샘플링 포인트의 샘플링 비율이 변경될 수 있다. Since the sampling rate vector x is determined according to the order of the sampling points, the switch 130 selected in the action a _t may have the largest sampling rate of the sampling points. Also, the sampling rate of the sampling point may be changed according to the selected sampling reduction rate.

일 실시예에서, 전환 확률 및 리워드 함수 측면에서, 행동 a_t가 수행될 때, 상태 s_t에서 상태 s_t+1로 SDN 지원 네트워크(100)의 전환 확률은

로 나타낼 수 있다.In one embodiment, in terms of a transition probability and a reward function, when the action a _t is performed, the transition probability of the SDN support network 100 from the state s _t to the state s _t+1 is

can be expressed as

SDN 컨트롤러(110)를 사용하여 각 스위치(130)의 샘플링 레이트를 동적으로 변경할 수 있기 때문에, 전환 확률

는 정책

에 의해 정의된 상태가 s_t일 때 행동을 선택할 확률에 의해 결정될 수 있다.Since the sampling rate of each switch 130 can be dynamically changed using the SDN controller 110, the switching probability

is the policy

It can be determined by the probability of choosing an action when the state defined by s _t .

SDN 지원 네트워크(100)의 포괄적인 리워드는 (1)공정 공유 플로우 모니터링(fair-share flow monitoring) (2)다중 트래픽 분석기(120)에 대한 부하 분산(load balancing) 및 (3)샘플링한 트래픽을 다중 트래픽 분석기로 조향하는 오버헤드 (sampled traffic steering overhead)를 줄이는 것이라는 세 가지 주요 목표를 달성하도록 설계될 수 있다. Comprehensive rewards of SDN-enabled network 100 include (1) fair-share flow monitoring, (2) load balancing to multiple traffic analyzers 120, and (3) sampled traffic. It can be designed to achieve three main goals: reducing sampled traffic steering overhead with multiple traffic analyzers.

일 실시예에서, 공정 공유 플로우 모니터링을 달성하기 위해 플로우 공정 공유 리워드 r^f는 하기 <수학식 8>과 같이 나타낼 수 있다. In one embodiment, in order to achieve process sharing flow monitoring, the flow process sharing reward r ^f may be expressed as in Equation 8 below.

여기서,

동안 네트워크의 총 플로우 집합이고,

는 샘플링 자원 할당 정책에 의해 주기적으로 결정될 수 있다. here,

is the total set of flows in the network during

may be periodically determined by the sampling resource allocation policy.

일 실시예에서, 트래픽 분석기(120)의 균형 잡힌 활용을 위해 트래픽 분석기(120)의 로드 밸런싱 리워드 r^u를 정의하고, 공정성 인덱스 r^u는 하기 <수학식 9>와 같이 나타낼 수 있다. In an embodiment, the load balancing reward r ^u of the traffic analyzer 120 is defined for balanced utilization of the traffic analyzer 120 , and the fairness index r ^u may be expressed as in Equation 9 below.

일 실시예에서, 샘플링된 트래픽 스티어링 오버 헤드를 고려하기 위해 정규화된 평균 스티어링 오버헤드 패널티 r^v를 하기 <수학식 10>과 같이 정의합니다.In an embodiment, a normalized average steering overhead penalty r ^v is defined as in Equation 10 below to consider the sampled traffic steering overhead.

일 실시예에서,

은 다중 트래픽 분석기(120)를 동등하게 활용하여 샘플링된 트래픽 스티어링 오버헤드를 고려하면서 네트워크 플로우를 공정하게 샘플링하는지 여부를 나타내는 비용 값을 반환하는 리워드 함수를 나타낼 수 있다. 상태 s_t에서 행동 a_t이 수행되면 리워드 함수는 하기 <수학식 11>과 같이 나타낼 수 있다. In one embodiment,

may represent a reward function that returns a cost value indicating whether to fairly sample network flows while equally utilizing multiple traffic analyzers 120 to account for sampled traffic steering overhead. When the action a _t is performed in the state s _t , the reward function can be expressed as in Equation 11 below.

여기서, R(s_t, a_t)의 범위는 [-1, 1]이고, SDN 지원 네트워크(100)의 샘플링 성능을 나타낸다. 현재 행동이 향후 리워드에 미치는 영향을 고려하기 위해 주어진 정책

에 따라 예상되는 총 할인 리워드(total expected discounted reward)을 하기 <수학식 12>와 같이 나타낼 수 있다. Here, the range of R(s _t , a _t ) is [-1, 1], indicating the sampling performance of the SDN support network 100 . A given policy to take into account the impact of current actions on future rewards

The total expected discounted reward can be expressed as in Equation 12 below.

여기서, 0 ≤

≤ 1은 다음 시간 스텝 t+1에서 무한 시간 스텝까지 미래 리워드의 중요성을 결정하는 할인 계수를 나타낸다.where 0 ≤

≤ 1 represents the discount coefficient that determines the importance of the future reward from the next time step t+1 to an infinite time step.

즉,

는 시간 스텝 t에서의 순간 리워드와 다음 시간 스텝 t+1에서 할인된 리워드의 합으로 정의될 수 있다. 예를 들어,

=0은 SDN 지원 네트워크(100)가 현재 리워드만 고려함을 의미하고,

=1은 시스템이 현재 리워드와 동일한 가중치로 무한 시간 스텝에서 장기 리워드를 고려함을 의미할 수 있다.in other words,

can be defined as the sum of the instantaneous reward at time step t and the discounted reward at the next time step t+1. for example,

=0 means that the SDN support network 100 only considers the current reward,

=1 may mean that the system considers the long-term reward in an infinite time step with the same weight as the current reward.

즉, SDN 지원 네트워크(100)의 샘플링 성능을 높이기 위해 총 예상 할인 리워드(total expected discounted reward)

를 증가시키는 행동 선택 정책(action-selection policy)을 결정할 수 있다. That is, to increase the sampling performance of the SDN support network 100, the total expected discounted reward (total expected discounted reward)

It is possible to determine an action-selection policy that increases .

도 2는 본 발명의 일 실시예에 따른 DDPG 모델을 도시한 도면이다.2 is a diagram illustrating a DDPG model according to an embodiment of the present invention.

도 2를 참고하면, 강화 학습(reinforcement learning, RL)알고리즘은 향후 누적할 리워드를 극대화하는 일련의 행동으로 정의된 정책을 결정할 수 있다. Referring to FIG. 2 , a reinforcement learning (RL) algorithm may determine a policy defined as a series of actions that maximize rewards to be accumulated in the future.

일 실시예에서, MDP의 최적 가치 함수와 최적 정책은 Bellman의 기대 방정식과 Bellman의 최적 방정식을 통해 결정할 수 있다. Bellman 방정식은 기본적으로 SARSA 및 Q-learning으로 발전한 동적 프로그래밍을 통해 풀 수 있다.In one embodiment, the optimal value function and optimal policy of MDP may be determined through Bellman's expectation equation and Bellman's optimal equation. Bellman equations can be basically solved through dynamic programming advanced with SARSA and Q-learning.

대표적인 RL 알고리즘인 Q-learning은 시스템의 향후 리워드나 전환 가능성에 대한 사전 지식없이 작동하는 모델 자유 특성과 행동 정책을 따르면서 최적의 정책에 대해 학습하는 정책을 벗어난 방법 때문에 사용될 수 있다. Q-learning, which is a representative RL algorithm, can be used because of the model-free property that operates without prior knowledge of the system's future reward or conversion potential, and the out-of-policy method of learning about the optimal policy while following the behavioral policy.

따라서 Q-learning은 미래 정보의 불확실성을 고려하여 실시간 시스템 운영에 사용될 수 있다. 실제 애플리케이션, 즉 네트워크 환경 관리 및 운영에서 Q-learning과 같은 RL 알고리즘은 MDP의 엄청난 공간과 행동으로 인해 복잡성 문제에 직면할 수 있다. Therefore, Q-learning can be used for real-time system operation considering the uncertainty of future information. In real applications, namely, network environment management and operation, RL algorithms such as Q-learning may face complexity problems due to the enormous space and behavior of MDPs.

Q-learning 기반 테이블 값 업데이트로 인한 차원 문제의 저주를 해결하기 위해 최근에는 DRL(Deep Reinforcement Learning)으로 알려진 RL과 딥 러닝의 조합이 사용될 수 있다. A combination of RL and deep learning, recently known as Deep Reinforcement Learning (DRL), can be used to solve the curse of dimensionality problems caused by Q-learning-based table value updates.

DRL의 목적은 예상되는 장기 리워드를 극대화할 수 있는 DRL 에이전트 행동 전략의 확률 분포를 나타내는 다층 퍼셉트론(MLP) 기반 비선형 함수 근사기를 활용하여 최적의 정책을 학습하는 것일 수 있다.The purpose of DRL may be to learn the optimal policy by utilizing a multilayer perceptron (MLP)-based nonlinear function approximator that represents the probability distribution of the DRL agent action strategy that can maximize the expected long-term reward.

일반적인 DRL 알고리즘인 Deep Q-네트워크(DQN)는 Q-러닝의 복잡한 상태 공간 문제를 해결하는데 널리 사용될 수 있다. Deep Q-network (DQN), a common DRL algorithm, can be widely used to solve complex state space problems in Q-learning.

그러나, DQN은 이산적인 행동 공간을 가지고 있기 때문에 연속적인 행동 공간에 대한 제어가 필요한 환경에서는 적용하기 어려울 수 있다. However, since DQN has a discrete action space, it may be difficult to apply in an environment that requires control over a continuous action space.

이 문제를 해결하기 위해, 도 2에 도시된 것과 같이, DPG(deterministic policy gradient) 알고리즘에 기반한 액터-크리틱(actor-critic) 기법을 이용한 DDPG(Deep deterministic policy gradient) 모델이 사용될 수 있다. To solve this problem, as shown in FIG. 2 , a deep deterministic policy gradient (DDPG) model using an actor-critic technique based on a deterministic policy gradient (DPG) algorithm may be used.

장기 리워드(long-term reward)를 극대화하기 위해 행동-가치 함수(action-value function)를 업데이트하는 크리틱 네트워크(critic network)(220)와 크리틱 모델에 따라 정책을 결정하는 액터 네트워크(actor network)(210)로 구성된 액터-크리틱 방법이 사용될 수 있다. A critical network 220 that updates an action-value function to maximize long-term reward and an actor network that determines a policy according to the crit model ( 210) may be used.

DPG는 기울기 상승 알고리즘을 통해 MLP와 같은 비선형 함수 근사에 대한 매개 변수화된 정책을 최적화하는 정책 기울기 방법을 사용하여 최적의 결정론적 정책을 결정할 수 있다. DPG can determine the optimal deterministic policy using a policy gradient method that optimizes a parameterized policy for nonlinear function approximations such as MLP through a gradient escalation algorithm.

DDPG 모델(200)은 MDP를 해결하기 위한 모델 프리(model-free), 오프 정책(off-policy), 액터 크리틱 알고리즘(actor-critic algorithm)으로 엄청난 상태 공간 및 행동 공간이 존재할 수 있다.The DDPG model 200 is a model-free, off-policy, actor-critic algorithm for solving MDP, and a huge state space and action space may exist.

일 실시예에서, DDPG 모델(200)을 사용하는 샘플링 정책 업데이트 알고리즘이 사용될 수 있다. In one embodiment, a sampling policy update algorithm using the DDPG model 200 may be used.

상기 <수학식 (12)>에서 정의한 총 예상 할인 리워드를 증가시키는 행동 선택 정책을 찾기 위해 모델 프리, 오프 정책 및 액터-크리틱 특성을 가진 DDPG 모델(200)이 사용될 수 있다. In order to find an action selection policy that increases the total expected discount reward defined in Equation (12) above, the DDPG model 200 having model-free, off-policy, and actor-critic characteristics may be used.

일 실시예에서, 다중 트래픽 분석기(120)의 균형 잡힌 리소스 활용으로 샘플링 정확도를 높이고, 샘플링된 트래픽 스티어링 오버헤드를 줄이는 최적의 행동 선택 정책을 결정할 수 있다. In an embodiment, it is possible to determine an optimal behavior selection policy to increase sampling accuracy and reduce sampled traffic steering overhead by balanced resource utilization of the multi-traffic analyzer 120 .

목표를 달성하기 위해 Q(st, at)로 표시된 주어진 정책

에 따라 상태 s_t에서 행동 a_t를 취할 때, 총 예상 할인 리워드를 반환하는 행동-가치 함수(action-value function) Q(s_t, a_t)를 하기 <수학식 13>과 같이 나타낼 수 있다. A given policy, denoted by Q(st, at), to achieve the goal

When an action a _t is taken in the state _s _t _according to .

다음과 같이 정의된 최적의 행동-가치 함수 Q^*(s, a)를 근사화를 위해 하기 <수학식 14>와 같이 나타낼 수 있다. The optimal behavior-value function Q ^* (s, a) defined as follows can be expressed as in Equation 14 below for approximation.

일반적으로 사용되는 정책을 벗어난 알고리즘인 Q-러닝은 행동 및 타겟 정책을 모두 사용하여 근사화할 수 있다. 행동 정책을 위해 엡실론 탐욕 방법(epsilon greedy method), 즉,

이 사용될 수 있다. Q-learning, an algorithm outside the commonly used policy, can be approximated using both behavioral and target policies. For the behavior policy, the epsilon greedy method, i.e.,

this can be used

그러나, 연속 행동 공간(continuous action space)에서는 모든 Q 값과 상태 및 행동 공간에 따라 기하급수적으로 증가하는 복잡성을 탐색해야 하므로 행동-가치 함수에서 최적의 행동을 찾기가 어려울 수 있다. However, in the continuous action space, it may be difficult to find the optimal action in the action-value function because it is necessary to explore the exponentially increasing complexity according to all Q values and states and action spaces.

일 실시예에서, 샘플링 자원 할당을 위한 DDPG 모델 측면에서, DRL의 연속 행동 공간을 처리하기 위해, DDPG 모델(200)은 도 2에 도시된 바와 같이 두 개의 신경망을 사용하여 최적의 행동-가치 함수를 근사화할 수 있다. In one embodiment, in terms of the DDPG model for sampling resource allocation, in order to process the continuous action space of the DRL, the DDPG model 200 uses two neural networks as shown in FIG. 2 to obtain an optimal action-value function. can be approximated.

하나는 행동-가치 함수를 근사화하는 크리틱 네트워크(220)일 수 있다. 입력은 행동과 관찰(action and observation)이고, 출력은 상태-행동 쌍(state-action pair)의 값, 즉, Q(s_t, a_t)일 수 있다. One may be a crit network 220 that approximates a behavior-value function. The input may be action and observation, and the output may be the value of a state-action pair, that is, Q(s _t , a _t ).

다른 신경망은 액터 네트워크(210)라고 하는 정책 기능(policy function)을 근사하는데 사용되며, 입력은 관찰 값(observation value)이고 출력은 행동 값(action value)일 수 있다. Another neural network is used to approximate a policy function called the actor network 210 , where the input may be an observation value and the output may be an action value.

DDPG 모델(200)은 비선형 함수 근사치를 매개 변수화하기 위해 액터 네트워크(210)를 위한

및 크리틱 네트워크(220)를 위한

를 사용할 수 있다. The DDPG model 200 is a method for the actor network 210 to parameterize the non-linear function approximation.

and for the crit network 220 .

can be used

및

는 정책 기울기(policy gradient) 및 손실 함수(loss function)에 따라 반복적으로 업데이트될 수 있다.

and

may be iteratively updated according to a policy gradient and a loss function.

크리틱 네트워크

는 손실을 최소화하기 위해

방향으로 업데이트하여 하기 <수학식 15>와 같이 최적화될 수 있다.crit network

to minimize the loss

direction and can be optimized as in Equation 15 below.

여기서,

이다. here,

to be.

일 실시예에서,

방향으로 정책 기울기 방법의 목적 함수인 업데이트 J(

)를 통해 액터 네트워크

는 하기 <수학식 16>과 같이 최적화될 수 있다.In one embodiment,

Update J(

) via Actor Network

can be optimized as in Equation 16 below.

대상 네트워크를 부드럽게 업데이트하기 위해

및

는 각각 하기 <수학식 17> 및 <수학식 18>과 같이 업데이트될 수 있다. To smoothly update the target network

and

may be updated as in <Equation 17> and <Equation 18>, respectively.

여기서,

는 업데이트된 크리틱 네트워크를 나타낸다. here,

represents the updated critique network.

여기서,

는 업데이트된 액터 네트워크를 나타낸다. 또한,

및

는 1보다 작은 양(positive)의 작은 숫자를 나타낸다. here,

represents the updated actor network. In addition,

and

represents a positive small number less than 1.

일 실시예에서, 하기 <표 1>과 같이 DDPG 모델 기반 샘플링 정책 업데이트 알고리즘을 나타낼 수 있다.In an embodiment, a DDPG model-based sampling policy update algorithm may be represented as shown in Table 1 below.

1: // Initialization
2: Set critic

and actor network

with weight

and

3: Set target network

and

with weights

and

4: Initialize experience replay buffer B
5: Set initial state s₀ according to the initial sampling policy
6: for each time step do
7: Select action

following the parameter noise for exploration
8: Take action at and observe R(s_t, a_t), s_t+1
9: Store transition {s_t, a_t, R(s_t, a_t), s_t+1} in B
10: Randomly Sample a batch of N transitions{s_i, a_i, R(s_i, a_i), s_i+1} from B
11: Set

12: Update critic

by minimizing loss in (15)
13: Update actor

using the sampled policy gradient in (16)
14: Update the targets softly in (17) and (18)
15: end for1: // Initialization
2: Set critic

and actor network

with weight

and

3: Set target network

and

with weights

and

4: Initialize experience replay buffer B
5: Set initial state s ₀ according to the initial sampling policy
6: for each time step do
7: Select action

following the parameter noise for exploration
8: Take action at and observe R(s _t , a _t ), s _t+1
9: Store transition {s _t , a _t , R(s _t , a _t ), s _t+1 } in B
10: Randomly Sample a batch of N transitions{s _i , a _i , R(s _i , a _i ), s _i+1 } from B
11: Set

12: Update critic

by minimizing loss in (15)
13: Update actor

using the sampled policy gradient in (16)
14: Update the targets softly in (17) and (18)
15: end for

도 3a는 본 발명의 일 실시예에 따른 리워드 성능 그래프를 도시한 도면이다.3A is a diagram illustrating a reward performance graph according to an embodiment of the present invention.

도 3a를 참고하면, 10개의 스위치(130)와 플로우 수를 500으로 설정한 팻 트리 토폴로지(fat-tree topology)를 고려할 수 있으며, 플로우의 데이터 레이트는 1 ~ 20Mbps로 설정될 수 있다. 네트워크 플로우의 불확실성을 고려하기 위해 플로우의 라우팅 및 데이터 레이트는 무작위로 변경될 수 있고, 전체 플로우의 2%는 악의적인 플로우이다.Referring to FIG. 3A , a fat-tree topology in which 10 switches 130 and the number of flows are set to 500 may be considered, and the data rate of the flows may be set to 1 to 20 Mbps. In order to take into account the uncertainty of the network flow, the routing and data rate of the flow can be changed randomly, and 2% of the total flows are malicious flows.

트래픽 분석기(120)의 수는 동일한 처리 용량 1Gbps로 3개로 설정될 수 있다. 시뮬레이션에서는 본 발명에 따른 샘플링 알고리즘이 시간 스텝이 1초로 설정될 때마다 이산 시간 방식으로 작동할 수 있다. The number of traffic analyzers 120 may be set to three with the same processing capacity of 1 Gbps. In the simulation, the sampling algorithm according to the present invention can operate in a discrete time manner whenever the time step is set to 1 second.

초기 상태 s0는 상태 공간 S에서 임의로 선택될 수 있다. 본 발명에 따른 샘플링 알고리즘의 성능과 종래의 랜덤 샘플링 및 FBC(flow betweenness centrality) 샘플링 방법의 성능을 비교할 수 있다.The initial state s0 can be arbitrarily chosen in the state space S. The performance of the sampling algorithm according to the present invention can be compared with that of the conventional random sampling and flow betweenness centrality (FBC) sampling methods.

종래의 랜덤 샘플링은 샘플링 포인트, 샘플링 레이트 및 트래픽 분석기를 각 시간 스텝에서 랜덤으로 선택한다. Conventional random sampling randomly selects a sampling point, a sampling rate, and a traffic analyzer at each time step.

종래의 FBC 샘플링은 그래프 이론의 중심성 측정을 기반으로 스위치 중 샘플링 포인트를 선택하고 각 플로우의 데이터 레이트에 비례하는 트래픽을 샘플링하는 레이트 비례 샘플링에 따라 샘플링 스위치의 샘플링 레이트를 결정한다. The conventional FBC sampling determines the sampling rate of the sampling switch according to rate proportional sampling, which selects a sampling point among switches based on the centrality measurement of graph theory and samples traffic proportional to the data rate of each flow.

종래의 FBC 샘플링은 트래픽 분석기 선택을 위해 선택한 샘플링 포인트의 샘플링 레이트에 따라 그리디 선택을 수행한다. 즉, 샘플링 레이트가 높은 샘플링 스위치는 사용 가능한 리소스가 많은 트래픽 분석기를 선택한다.The conventional FBC sampling performs greedy selection according to a sampling rate of a sampling point selected for traffic analyzer selection. That is, a sampling switch with a high sampling rate selects a traffic analyzer with many available resources.

본 발명에 따르면, 샘플링 포인트 수는 5개로 설정되고, 모든 샘플링 방법은 샘플링된 트래픽의 집계된 양이 IDS 용량과 동일한 조건에서 각 반복에서 샘플링을 수행할 수 있다.According to the present invention, the number of sampling points is set to 5, and all sampling methods can perform sampling at each iteration under the condition that the aggregated amount of sampled traffic is equal to the IDS capacity.

따라서 샘플링된 데이터 트래픽의 총 레이트는 IDS 용량 미만으로 유지될 수 있다. Thus, the total rate of sampled data traffic can be kept below the IDS capacity.

도 3a를 참고하면, 시간 스텝 수에 따른 리워드를 확인할 수 있다. DDPG 모델(200)을 사용하여 본 발명에 따른 샘플링 알고리즘이 리워드의 관점에서 성능을 향상시키는 방법을 평가할 수 있다. Referring to FIG. 3A , rewards according to the number of time steps can be checked. The DDPG model 200 can be used to evaluate how the sampling algorithm according to the present invention improves performance in terms of rewards.

도 3b는 본 발명의 일 실시예에 따른 악의적인 플로우를 샘플링할 확률 그래프를 도시한 도면이다. 도 3c는 본 발명의 일 실시예에 따른 다중 트래픽 분석기들의 부하 분산 성능 그래프를 도시한 도면이다. 도 3d는 본 발명의 일 실시예에 따른 SDN의 처리 오버헤드 성능 그래프를 도시한 도면이다. 3B is a diagram illustrating a probability graph of sampling a malicious flow according to an embodiment of the present invention. 3C is a diagram illustrating a load balancing performance graph of multiple traffic analyzers according to an embodiment of the present invention. 3D is a diagram illustrating a processing overhead performance graph of SDN according to an embodiment of the present invention.

도 3b를 참고하면, 악의적인 플로우들을 라우팅할 때 제외되는 스위치의 비율

가 증가함에 따라 제안된 방법이 다른 방법보다 악의적인 플로우를 샘플링하지 못할 확률이 낮음을 보여준다.Referring to FIG. 3B , the ratio of switches excluded when routing malicious flows

It shows that the proposed method has a lower probability of failing to sample a malicious flow than other methods as .

예를 들어,

가 0.1인 경우 악의적인 플로우들을 중심성 측정 값 상위 10%의 스위치를 제외한 나머지 스위치들을 지나도록 라우팅한다. 즉 제안된 방법은 높은 중심성 측정 값을 갖지 않는 스위치들을 지나는 악의적인 플로우들에 대한 모니터링 성능이 다른 방법보다 높음을 보여준다.for example,

If is 0.1, malicious flows are routed through the switches except for the switches with the top 10% of the centrality measurement value. That is, the proposed method shows that the monitoring performance for malicious flows passing through switches that do not have a high centrality measurement value is higher than that of other methods.

도 3c와 도 3d는 다중 트래픽 분석기에 대해 균형 잡힌 부하를 유지하면서, 트래픽 분석기 수가 2 개 이상인 경우 제안된 방법이 다른 방법보다 샘플링한 트래픽을 다중 트래픽 분석기들로 조향하는 오버 헤드를 줄일 수 있음을 보여준다.3c and 3d show that the proposed method can reduce the overhead of steering the sampled traffic to multiple traffic analyzers compared to other methods when the number of traffic analyzers is two or more, while maintaining a balanced load for multiple traffic analyzers. show

본 발명에 따른 DDPG 모델(200)을 사용하는 알고리즘은 시간 스텝이 증가할수록 종래의 방법보다 리워드를 더 많이 증가시키는 것을 확인할 수 있다. 이는, 본 발명에 따른 알고리즘이 반복을 경험할 때 더 나은 행동을 선택할 수 있기 때문일 수 있다. It can be seen that the algorithm using the DDPG model 200 according to the present invention increases the reward more than the conventional method as the time step increases. This may be because the algorithm according to the present invention can select a better behavior when it experiences repetition.

본 발명에 따르면, 네트워크 트래픽 모니터링은 네트워크 이상을 탐지하기 위한 사이버 보안에서 중요한 역할을 할 수 있다. 방대한 양의 전체 네트워크 트래픽을 모니터링하는 것은 제한된 트래픽 분석기 용량으로 인해 불가능하기 때문에 트래픽을 선별적으로 모니터링할 수 있도록 샘플링하는 것 중요하다. SDN 컨트롤러(110)를 사용하면 DPI를 수행하는 IDS와 같은 다중 트래픽 분석기(120)에 대해 네트워크 트래픽 샘플링 및 조정이 가능할 수 있다. According to the present invention, network traffic monitoring can play an important role in cyber security to detect network anomalies. It is important to sample the traffic so that it can be selectively monitored, as monitoring massive amounts of total network traffic is not possible due to the limited capacity of the traffic analyzer. Using the SDN controller 110 may enable network traffic sampling and adjustment for multiple traffic analyzers 120 such as IDS performing DPI.

본 발명에 따르면, 네트워크 플로우 불확실성 하에서 다중 트래픽 분석기(120)의 샘플링 포인트 및 샘플링 레이트를 결정할 수 있다. 다중 트래픽 분석기(120)에 대한 샘플링 포인트 및 레이트 결정은 불확실성이 네트워크 플로우의 동적 특성인 불연속 시간 MDP로 공식화될 수 있다.According to the present invention, it is possible to determine the sampling point and the sampling rate of the multi-traffic analyzer 120 under the network flow uncertainty. The sampling point and rate determination for the multi-traffic analyzer 120 may be formulated as a discrete time MDP where uncertainty is a dynamic characteristic of network flows.

사용 가능한 실시간 트래픽 분석기 모니터링 결과를 사용하여 심층 강화 학습 접근 방식이 사용될 수 있다. Using the available real-time traffic analyzer monitoring results, a deep reinforcement learning approach can be used.

본 발명에 따른 알고리즘은 샘플링 레이트 결정을 내리기 위해 네트워크 플로우 불확실성을 수정할 수 있다. Algorithms according to the present invention may correct for network flow uncertainty in order to make sampling rate decisions.

도 4는 본 발명의 일 실시예에 따른 SDN 컨트롤러(110)의 동작 방법을 도시한 도면이다.4 is a diagram illustrating an operation method of the SDN controller 110 according to an embodiment of the present invention.

도 4를 참고하면, S401 단계는, 트래픽 분석기(120)로부터 각 샘플링 포인트(130)에 대한 샘플링된 트래픽의 검사 결과를 수신하는 단계이다. Referring to FIG. 4 , step S401 is a step of receiving a test result of sampled traffic for each sampling point 130 from the traffic analyzer 120 .

일 실시예에서, 각 샘플링 포인트(130)에 의해 샘플링된 트래픽은 트래픽 분석기(120)로 전달될 수 있다. In one embodiment, traffic sampled by each sampling point 130 may be passed to a traffic analyzer 120 .

S403 단계는, 샘플링된 트래픽의 검사 결과에 기반하여, 적어도 하나의 트래픽 분석기(120), 적어도 하나의 샘플링 포인트(130) 및 샘플링 레이트 관련 정보를 결정하는 단계이다. Step S403 is a step of determining at least one traffic analyzer 120 , at least one sampling point 130 , and sampling rate related information, based on the inspection result of the sampled traffic.

일 실시예에서, 샘플링된 트래픽의 검사 결과에 대한 리워드(reward)를 계산하고, 리워드를 강화학습 모델에 적용하여 적어도 하나의 트래픽 분석기(120), 적어도 하나의 샘플링 포인트(130) 및 샘플링 레이트 관련 정보를 결정할 수 있다. In one embodiment, at least one traffic analyzer 120, at least one sampling point 130, and a sampling rate are calculated by calculating a reward for the inspection result of the sampled traffic, and applying the reward to the reinforcement learning model. information can be determined.

구체적으로, 일 실시예에서, 리워드를 상기 강화학습 모델에 적용하여 샘플링 정책을 결정하고, 샘플링 정책에 따라, 리워드를 최대화하기 위한 적어도 하나의 트래픽 분석기(120), 적어도 하나의 샘플링 포인트(130) 및 샘플링 레이트 관련 정보를 결정할 수 있다. Specifically, in one embodiment, at least one traffic analyzer 120, at least one sampling point 130 for determining a sampling policy by applying a reward to the reinforcement learning model, and maximizing a reward according to the sampling policy. and sampling rate related information.

일 실시예에서, 강화학습 모델은, DDPG(Deep Deterministic Policy Gradient) 모델을 포함할 수 있다. In an embodiment, the reinforcement learning model may include a Deep Deterministic Policy Gradient (DDPG) model.

이 경우, 일 실시예에서, 샘플링된 트래픽의 검사 결과에 기반하여, 적어도 하나의 트래픽 분석기(120) 및 적어도 하나의 샘플링 포인트(130)에 대한 상태 공간(state space) 정보를 결정할 수 있다. In this case, according to an embodiment, state space information about the at least one traffic analyzer 120 and the at least one sampling point 130 may be determined based on the inspection result of the sampled traffic.

이후, 상태 공간 정보에 기반하여 적어도 하나의 트래픽 분석기(120) 및 적어도 하나의 샘플링 포인트(130) 및 샘플링 레이트 관련 정보에 대한 행동 공간(action space) 정보를 결정할 수 있다. Thereafter, action space information for at least one traffic analyzer 120 and at least one sampling point 130 and sampling rate related information may be determined based on the state space information.

또한, 일 실시예에서, 제1 시간 스텝(time step)에서 행동 공간 정보가 실행되는 경우, 제1 시간 스텝에 대한 상태 공간 정보로부터 제2 시간 스텝에 대한 상대 공간 정보로의 변경에 대한 전환 확률(transition probability) 정보를 결정할 수 있다.Further, in one embodiment, when the action space information is executed at a first time step, the transition probability for a change from the state space information for the first time step to the relative spatial information for the second time step (transition probability) information can be determined.

또한, 전환 확률 정보에 기반하여 강화학습 모델에 따른 리워드를 결정하며, 리워드를 강화학습 모델에 적용하여 적어도 하나의 트래픽 분석기(120), 적어도 하나의 샘플링 포인트(130) 및 샘플링 레이트 관련 정보를 결정할 수 있다. In addition, a reward according to the reinforcement learning model is determined based on the conversion probability information, and the reward is applied to the reinforcement learning model to determine at least one traffic analyzer 120 , at least one sampling point 130 , and sampling rate related information can

도 5는 본 발명의 일 실시예에 따른 SDN 컨트롤러(110)의 기능적 구성을 도시한 도면이다. 일 실시예에서, SDN 컨트롤러(110)는 SDN, SDN 장치 또는 이와 동등한 기술적 의미를 갖는 용어로 지칭될 수 있다. 5 is a diagram illustrating a functional configuration of the SDN controller 110 according to an embodiment of the present invention. In an embodiment, the SDN controller 110 may be referred to as an SDN, an SDN device, or a term having an equivalent technical meaning.

도 5를 참고하면, SDN 컨트롤러(110)는 통신부(510), 제어부(520) 및 저장부(530)를 포함할 수 있다.Referring to FIG. 5 , the SDN controller 110 may include a communication unit 510 , a control unit 520 , and a storage unit 530 .

통신부(510)는 트래픽 분석기(120)로부터 각 샘플링 포인트(130)에 대한 샘플링된 트래픽의 검사 결과를 수신할 수 있다. The communication unit 510 may receive a test result of sampled traffic for each sampling point 130 from the traffic analyzer 120 .

일 실시예에서, 통신부(510)는 유선 통신 모듈 및 무선 통신 모듈 중 적어도 하나를 포함할 수 있다. 통신부(510)의 전부 또는 일부는 '송신부', '수신부' 또는 '송수신부(transceiver)'로 지칭될 수 있다.In an embodiment, the communication unit 510 may include at least one of a wired communication module and a wireless communication module. All or part of the communication unit 510 may be referred to as a 'transmitter', 'receiver', or 'transceiver'.

제어부(520)는 샘플링된 트래픽의 검사 결과에 기반하여, 적어도 하나의 트래픽 분석기(120), 적어도 하나의 샘플링 포인트(130) 및 샘플링 레이트 관련 정보를 결정할 수 있다. The control unit 520 may determine at least one traffic analyzer 120 , at least one sampling point 130 , and sampling rate related information based on a test result of the sampled traffic.

일 실시예에서, 제어부(520)는 제어 플레인을 통해 OpenFlow (OF) 프로토콜을 사용하여 샘플링 포인트(120)(예: 스위치 또는 라우터)를 원격으로 제어하는 중앙 집중식 SDN 컨트롤러로 구현될 수 있다. In one embodiment, the control unit 520 may be implemented as a centralized SDN controller that remotely controls the sampling point 120 (eg, a switch or router) using the OpenFlow (OF) protocol through the control plane.

일 실시예에서, 제어부(520)는 적어도 하나의 프로세서 또는 마이크로(micro) 프로세서를 포함하거나, 또는, 프로세서의 일부일 수 있다. 또한, 제어부(520)는 CP(communication processor)라 지칭될 수 있다. 제어부(520)는 본 발명의 다양한 실시예에 따른 SDN 컨트롤러(110)의 동작을 제어할 수 있다. In an embodiment, the controller 520 may include at least one processor or microprocessor, or may be a part of the processor. Also, the controller 520 may be referred to as a communication processor (CP). The controller 520 may control the operation of the SDN controller 110 according to various embodiments of the present invention.

저장부(530)는 샘플링된 트래픽의 검사 결과를 저장할 수 있다. 일 실시예에서, 저장부(530)는 적어도 하나의 트래픽 분석기(120), 적어도 하나의 샘플링 포인트(130) 및 샘플링 레이트 관련 정보에 대한 상태 공간 정보, 행동 공간 정보, 전환 확률 정보, 리워드를 저장할 수 있다. The storage 530 may store the inspection result of the sampled traffic. In one embodiment, the storage unit 530 stores the at least one traffic analyzer 120 , the at least one sampling point 130 , and state space information, action space information, conversion probability information, and rewards for the sampling rate related information. can

일 실시예에서, 저장부(530)는 휘발성 메모리, 비휘발성 메모리 또는 휘발성 메모리와 비휘발성 메모리의 조합으로 구성될 수 있다. 그리고, 저장부(530)는 제어부(520)의 요청에 따라 저장된 데이터를 제공할 수 있다.In an embodiment, the storage unit 530 may be configured as a volatile memory, a non-volatile memory, or a combination of a volatile memory and a non-volatile memory. In addition, the storage unit 530 may provide stored data according to the request of the control unit 520 .

도 5를 참고하면, SDN 컨트롤러(110)는 통신부(510), 제어부(520) 및 저장부(530)를 포함할 수 있다. 본 발명의 다양한 실시 예들에서 SDN 컨트롤러(110)는 도 5에 설명된 구성들이 필수적인 것은 아니어서, 도 5에 설명된 구성들보다 많은 구성들을 가지거나, 또는 그보다 적은 구성들을 가지는 것으로 구현될 수 있다.Referring to FIG. 5 , the SDN controller 110 may include a communication unit 510 , a control unit 520 , and a storage unit 530 . In various embodiments of the present invention, the SDN controller 110 is not essential to the configurations described in FIG. 5, so it may be implemented as having more configurations than the configurations described in FIG. 5, or having fewer configurations. .

이상의 설명은 본 발명의 기술적 사상을 예시적으로 설명한 것에 불과한 것으로, 통상의 기술자라면 본 발명의 본질적인 특성이 벗어나지 않는 범위에서 다양한 변경 및 수정이 가능할 것이다.The above description is merely illustrative of the technical spirit of the present invention, and various changes and modifications may be made by those skilled in the art without departing from the essential characteristics of the present invention.

본 명세서에 개시된 다양한 실시예들은 순서에 관계없이 수행될 수 있으며, 동시에 또는 별도로 수행될 수 있다. The various embodiments disclosed herein may be performed out of order, and may be performed simultaneously or separately.

일 실시예에서, 본 명세서에서 설명되는 각 도면에서 적어도 하나의 단계가 생략되거나 추가될 수 있고, 역순으로 수행될 수도 있으며, 동시에 수행될 수도 있다. In an embodiment, at least one step may be omitted or added in each figure described herein, may be performed in the reverse order, or may be performed simultaneously.

본 명세서에 개시된 실시예들은 본 발명의 기술적 사상을 한정하기 위한 것이 아니라, 설명하기 위한 것이고, 이러한 실시예들에 의하여 본 발명의 범위가 한정되는 것은 아니다.The embodiments disclosed in the present specification are not intended to limit the technical spirit of the present invention, but to illustrate, and the scope of the present invention is not limited by these embodiments.

본 발명의 보호범위는 청구범위에 의하여 해석되어야 하며, 그와 동등한 범위 내에 있는 모든 기술 사상은 본 발명의 권리범위에 포함되는 것으로 이해되어야 한다.The protection scope of the present invention should be construed by the claims, and all technical ideas within the scope equivalent thereto should be understood to be included in the scope of the present invention.

100: SDN 지원 네트워크
110: SDN 컨트롤러
120: 트래픽 분석기
130: 스위치
200: DDPG 모델
210: 액터 네트워크
220: 크리틱 네트워크
510: 통신부
520: 제어부
530: 저장부100: SDN-enabled network
110: SDN controller
120: traffic analyzer
130: switch
200: DDPG model
210: Actor Network
220: Critic Network
510: communication department
520: control unit
530: storage

Claims

(a) receiving an inspection result of the sampled traffic for each sampling point from the traffic analyzer; and
(b) determining at least one traffic analyzer, at least one sampling point, and sampling rate-related information based on a test result of the sampled traffic;
containing,
How SDN works using reinforcement learning.

According to claim 1,
The step (b) is,
calculating a reward for an inspection result of the sampled traffic; and
determining at least one traffic analyzer, at least one sampling point, and sampling rate related information by applying the reward to a reinforcement learning model;
containing,
How SDN works using reinforcement learning.

3. The method of claim 2,
The step (b) is,
determining a sampling policy by applying the reward to the reinforcement learning model; and
determining, according to the sampling policy, the at least one traffic analyzer, at least one sampling point, and sampling rate related information for maximizing the reward;
containing,
How SDN works using reinforcement learning.

According to claim 1,
Traffic sampled by the sampling point is passed to the traffic analyzer,
How SDN works using reinforcement learning.

3. The method of claim 2,
The reinforcement learning model, including a DDPG (Deep Deterministic Policy Gradient) model,
How SDN works using reinforcement learning.

6. The method of claim 5,
The step (b) is,
determining state space information about the at least one traffic analyzer and the at least one sampling point based on a test result of the sampled traffic; and
determining action space information for the at least one traffic analyzer and at least one sampling point and sampling rate related information based on the state space information;
containing,
How SDN works using reinforcement learning.

7. The method of claim 6,
The step (b) is,
When the action space information is executed at a first time step, transition probability information for a change from the state space information for the first time step to the relative spatial information for a second time step determining;
determining a reward according to a reinforcement learning model based on the conversion probability information; and
determining at least one traffic analyzer, at least one sampling point, and sampling rate related information by applying the reward to a reinforcement learning model;
containing,
How SDN works using reinforcement learning.

a communication unit configured to receive an inspection result of sampled traffic for each sampling point from the traffic analyzer; and
a control unit configured to determine at least one traffic analyzer, at least one sampling point, and sampling rate-related information based on a test result of the sampled traffic;
containing,
SDN device using reinforcement learning.

9. The method of claim 8,
The control unit is
Calculate a reward for the inspection result of the sampled traffic,
Applying the reward to a reinforcement learning model to determine at least one traffic analyzer, at least one sampling point, and sampling rate related information,
SDN device using reinforcement learning.

10. The method of claim 9,
The control unit is
Apply the reward to the reinforcement learning model to determine a sampling policy,
determining, according to the sampling policy, the at least one traffic analyzer for maximizing the reward, at least one sampling point, and sampling rate related information;
SDN device using reinforcement learning.

9. The method of claim 8,
Traffic sampled by the sampling point is passed to the traffic analyzer,
SDN device using reinforcement learning.

10. The method of claim 9,
The reinforcement learning model, including a DDPG (Deep Deterministic Policy Gradient) model,
SDN device using reinforcement learning.

13. The method of claim 12,
The control unit is
determining state space information for the at least one traffic analyzer and at least one sampling point based on the inspection result of the sampled traffic;
determining action space information for the at least one traffic analyzer and at least one sampling point and sampling rate related information based on the state space information;
SDN device using reinforcement learning.

14. The method of claim 13,
The control unit is
When the action space information is executed at a first time step, transition probability information for a change from the state space information for the first time step to the relative spatial information for a second time step to decide,
Determine a reward according to the reinforcement learning model based on the conversion probability information,
Applying the reward to a reinforcement learning model to determine at least one traffic analyzer, at least one sampling point, and sampling rate related information,
SDN device using reinforcement learning.