KR102670448B1

KR102670448B1 - Routing method and device using deep reinforce learning

Info

Publication number: KR102670448B1
Application number: KR1020220016151A
Authority: KR
Inventors: 이성근; 정설령
Original assignee: 국립순천대학교산학협력단
Priority date: 2021-12-14
Filing date: 2022-02-08
Publication date: 2024-05-29
Also published as: KR20230090200A

Abstract

개시된 기술은 심층강화학습 기반 라우팅 방법 및 장치에 관한 것으로, 소프트웨어 정의 네트워크의 제어장치가 네트워크 내 복수의 노드들 중 제 1 노드로부터 통신 서비스를 요청받는 단계; 상기 제어장치가 상기 요청에 따라 상기 통신 서비스에 대한 QoS 등급을 결정하는 단계; 및 상기 제어장치가 강화학습 모델에 상기 복수의 노드들 중 이용 가능한 노드들에 대한 상태정보를 입력하여 상기 통신 서비스를 제공 가능한 복수개의 라우팅 경로를 계산하고 상기 복수개의 라우팅 경로들 중 상기 QoS 등급에 상응하는 적어도 하나의 경로를 선택하는 단계;를 포함한다.The disclosed technology relates to a deep reinforcement learning-based routing method and device, comprising: a control device of a software defined network receiving a communication service request from a first node among a plurality of nodes in the network; determining, by the control device, a QoS level for the communication service according to the request; And the control device inputs status information about available nodes among the plurality of nodes into a reinforcement learning model to calculate a plurality of routing paths that can provide the communication service, and determines the QoS level among the plurality of routing paths. It includes; selecting at least one corresponding path.

Description

Deep reinforcement learning-based routing method and device {ROUTING METHOD AND DEVICE USING DEEP REINFORCE LEARNING}

개시된 기술은 네트워크 서비스의 QoS 등급을 고려한 심층강화학습 기반 라우팅 방법 및 장치에 관한 것이다.The disclosed technology relates to a deep reinforcement learning-based routing method and device that considers the QoS level of network services.

종래 소프트웨어 정의 네트워크(Software Defined Network, SDN)에서는 네트워크 관리 기능을 제공한다. 네트워크 상태에 대한 중앙집중화된 뷰를 관리자에게 제공함으로써 네트워크 관리를 유연하게 수행할 수 있었다. 그러나 실제 통신 네트워크는 매우 동적으로 변화하며 복잡한 환경을 가지고 있어서 사용자의 요구사항을 충족시키는 QoS 등급을 유지하기 위해서는 높은 컴퓨팅 성능을 요구하는 문제가 있었다.A conventional software defined network (SDN) provides a network management function. By providing administrators with a centralized view of network status, network management could be performed flexibly. However, actual communication networks have a very dynamically changing and complex environment, which requires high computing performance to maintain QoS levels that meet user requirements.

한국 공개특허 제10-2020-0002439호Korean Patent Publication No. 10-2020-0002439

개시된 기술은 네트워크 서비스의 QoS 등급을 고려한 심층강화학습 기반 라우팅 방법 및 장치를 제공하는데 있다.The disclosed technology provides a deep reinforcement learning-based routing method and device that takes into account the QoS level of network services.

상기의 기술적 과제를 이루기 위하여 개시된 기술의 제 1 측면은 소프트웨어 정의 네트워크의 제어장치가 네트워크 내 복수의 노드들 중 제 1 노드로부터 통신 서비스를 요청받는 단계, 상기 제어장치가 상기 요청에 따라 상기 통신 서비스에 대한 QoS 등급을 결정하는 단계 및 상기 제어장치가 강화학습 모델에 상기 복수의 노드들 중 이용 가능한 노드들에 대한 상태정보를 입력하여 상기 통신 서비스를 제공 가능한 복수개의 라우팅 경로를 계산하고 상기 복수개의 라우팅 경로들 중 상기 QoS 등급에 상응하는 적어도 하나의 경로를 선택하는 단계를 포함하는 심층강화학습 기반 라우팅 방법을 제공하는데 있다.The first aspect of the technology disclosed to achieve the above technical problem is the step of a control device of a software defined network receiving a communication service request from a first node among a plurality of nodes in the network, and the control device providing the communication service according to the request. determining a QoS level, and the control device inputs status information about available nodes among the plurality of nodes into a reinforcement learning model to calculate a plurality of routing paths capable of providing the communication service, and calculating a plurality of routing paths capable of providing the communication service. The aim is to provide a deep reinforcement learning-based routing method that includes selecting at least one path corresponding to the QoS level among routing paths.

상기의 기술적 과제를 이루기 위하여 개시된 기술의 제 2 측면은 네트워크 내 복수의 노드들 중 제 1 노드로부터 통신 서비스를 요청받는 통신장치, 상기 네트워크 내 복수의 노드들을 이용하여 라우팅 경로를 계산하도록 학습된 강화학습 모델을 저장하는 저장장치 및 상기 요청에 따라 상기 통신 서비스에 대한 QoS 등급을 결정하고 상기 강화학습 모델에 상기 복수의 노드들 중 이용 가능한 노드들에 대한 상태정보를 입력하여 상기 통신 서비스를 제공 가능한 복수개의 라우팅 경로를 계산하고 상기 복수개의 라우팅 경로들 중 상기 QoS 등급에 상응하는 적어도 하나의 경로를 선택하는 연산장치를 포함하는 심층강화학습 기반 라우팅 장치를 제공하는데 있다.The second aspect of the technology disclosed to achieve the above technical problem is a communication device that receives a communication service request from a first node among a plurality of nodes in the network, and an enhancement learned to calculate a routing path using a plurality of nodes in the network. A storage device for storing a learning model and a QoS level for the communication service according to the request, and inputting status information on available nodes among the plurality of nodes into the reinforcement learning model to provide the communication service. The purpose of the present invention is to provide a deep reinforcement learning-based routing device that includes a computing unit that calculates a plurality of routing paths and selects at least one path corresponding to the QoS level among the plurality of routing paths.

개시된 기술의 실시 예들은 다음의 장점들을 포함하는 효과를 가질 수 있다. 다만, 개시된 기술의 실시 예들이 이를 전부 포함하여야 한다는 의미는 아니므로, 개시된 기술의 권리범위는 이에 의하여 제한되는 것으로 이해되어서는 아니 될 것이다. Embodiments of the disclosed technology may have effects including the following advantages. However, since this does not mean that the embodiments of the disclosed technology must include all of them, the scope of rights of the disclosed technology should not be understood as being limited thereby.

개시된 기술의 일 실시예에 따른 심층강화학습 기반 라우팅 방법 및 장치는 가용 가능한 라우팅 경로들 중 요청된 서비스를 만족하는 라우팅 경로를 이용하여 특정 라우팅 경로에 집중되는 부하를 분산시키는 효과가 있다.The deep reinforcement learning-based routing method and device according to an embodiment of the disclosed technology has the effect of distributing the load concentrated on a specific routing path by using a routing path that satisfies the requested service among available routing paths.

또한, 통신 서비스에 대한 QoS를 단계화하여 네트워크의 복잡도를 줄이는 효과가 있다.Additionally, it has the effect of reducing network complexity by tiering QoS for communication services.

도 1은 개시된 기술의 일 실시예에 따른 심층강화학습 기반 라우팅 과정을 나타낸 도면이다.
도 2는 개시된 기술의 일 실시예에 따른 심층강화학습 기반 라우팅 방법에 대한 순서도이다.
도 3은 개시된 기술의 일 실시예에 따른 심층강화학습 기반 라우팅 장치에 대한 블록도이다.
도 4는 개시된 기술에 따른 강화학습 모델의 학습 성능을 나타낸 도면이다.Figure 1 is a diagram showing a deep reinforcement learning-based routing process according to an embodiment of the disclosed technology.
Figure 2 is a flowchart of a deep reinforcement learning-based routing method according to an embodiment of the disclosed technology.
Figure 3 is a block diagram of a deep reinforcement learning-based routing device according to an embodiment of the disclosed technology.
Figure 4 is a diagram showing the learning performance of a reinforcement learning model according to the disclosed technology.

본 발명은 다양한 변경을 가할 수 있고 여러 가지 실시예를 가질 수 있는 바, 특정 실시예들을 도면에 예시하고 상세한 설명에 상세하게 설명하고자 한다. 그러나, 이는 본 발명을 특정한 실시 형태에 대해 한정하려는 것이 아니며, 본 발명의 사상 및 기술 범위에 포함되는 모든 변경, 균등물 내지 대체물을 포함하는 것으로 이해되어야 한다.Since the present invention can make various changes and have various embodiments, specific embodiments will be illustrated in the drawings and described in detail in the detailed description. However, this is not intended to limit the present invention to specific embodiments, and should be understood to include all changes, equivalents, and substitutes included in the spirit and technical scope of the present invention.

제 1 , 제 2, A, B 등의 용어는 다양한 구성요소들을 설명하는데 사용될 수 있지만, 해당 구성요소들은 상기 용어들에 의해 한정되지는 않으며, 단지 하나의 구성요소를 다른 구성요소로부터 구별하는 목적으로만 사용된다. 예를 들어, 본 발명의 권리 범위를 벗어나지 않으면서 제 1 구성요소는 제 2 구성요소로 명명될 수 있고, 유사하게 제 2 구성요소도 제 1 구성요소로 명명될 수 있다. 및/또는 이라는 용어는 복수의 관련된 기재된 항목들의 조합 또는 복수의 관련된 기재된 항목들 중의 어느 항목을 포함한다.Terms such as first, second, A, B, etc. may be used to describe various components, but the components are not limited by the terms, and are only used for the purpose of distinguishing one component from other components. It is used only as For example, a first component may be referred to as a second component, and similarly, the second component may be referred to as a first component without departing from the scope of the present invention. The term and/or includes any of a plurality of related stated items or a combination of a plurality of related stated items.

본 명세서에서 사용되는 용어에서 단수의 표현은 문맥상 명백하게 다르게 해석되지 않는 한 복수의 표현을 포함하는 것으로 이해되어야 한다. 그리고 "포함한다" 등의 용어는 실시된 특징, 개수, 단계, 동작, 구성요소, 부분품 또는 이들을 조합한 것이 존재함을 의미하는 것이지, 하나 또는 그 이상의 다른 특징들이나 개수, 단계 동작 구성요소, 부분품 또는 이들을 조합한 것들의 존재 또는 부가 가능성을 배제하지 않는 것으로 이해되어야 한다.As used herein, singular expressions should be understood to include plural expressions unless the context clearly dictates otherwise. And terms such as "include" mean the presence of implemented features, numbers, steps, operations, components, parts, or combinations thereof, but one or more other features, numbers, steps, operations, components, parts, etc. It should be understood that it does not exclude the existence or addition of combinations thereof.

도면에 대한 상세한 설명을 하기에 앞서, 본 명세서에서의 구성부들에 대한 구분은 각 구성부가 담당하는 주기능 별로 구분한 것에 불과함을 명확히 하고자 한다. 즉, 이하에서 설명할 2개 이상의 구성부가 하나의 구성부로 합쳐지거나 또는 하나의 구성부가 보다 세분화된 기능별로 2개 이상으로 분화되어 구비될 수도 있다. Before providing a detailed description of the drawings, it would be clarified that the division of components in this specification is merely a division according to the main function each component is responsible for. That is, two or more components, which will be described below, may be combined into one component, or one component may be divided into two or more components for more detailed functions.

그리고 이하에서 설명할 구성부 각각은 자신이 담당하는 주기능 이외에도 다른 구성부가 담당하는 기능 중 일부 또는 전부의 기능을 추가적으로 수행할 수도 있으며, 구성부 각각이 담당하는 주기능 중 일부 기능이 다른 구성부에 의해 전담되어 수행될 수도 있음은 물론이다. 따라서, 본 명세서를 통해 설명되는 각 구성부들의 존재 여부는 기능적으로 해석되어야 할 것이다.In addition to the main functions it is responsible for, each of the components described below may additionally perform some or all of the functions handled by other components, and some of the main functions handled by each component may be performed by other components. Of course, it can also be carried out exclusively by . Therefore, the presence or absence of each component described throughout this specification should be interpreted functionally.

소프트웨어 정의 네트워크(SDN :Software Defined Network)는 요구에 따른 자원 할당, 손쉬운 재구성 및 프로그래밍 가능한 네트워크 관리 기능 등을 제공하는 새로운 네트워킹 패러다임이다. SDN에서 네트워크 관리 기능은 논리적으로 제어 평면과 데이터 평면으로 구성된다. SDN은 네트워크 상태에 대한 중앙집중화된 뷰(관점)를 제공함으로써 네트워크 관리와 제어를 더욱 유연하고 일관성 있고, 총괄적으로 수행할 수 있다. 이를 통해 사용자 연결 서비스를 위해 할당된 대역폭과 경로를 동적으로 조정함으로써 트래픽 제어와 관리 기능을 향상시킬 수 있다. Software Defined Network (SDN) is a new networking paradigm that provides on-demand resource allocation, easy reconfiguration, and programmable network management functions. In SDN, network management functions are logically composed of a control plane and a data plane. SDN provides a centralized view of network status, making network management and control more flexible, consistent, and comprehensive. This can improve traffic control and management by dynamically adjusting the allocated bandwidth and paths for user connection services.

SDN 제어 평면의 라우팅 모듈은 실시간으로 네트워크 상태 정보를 수집한다. 이를 기반으로 사용자가 요구하는 지연, 손실 및 대역폭 등의 서비스 품질(Quality of Service : QoS) 파라미터 정보에 따라 통신을 수행하는 양단간에 QoS 를 만족하는 경로를 제공한다. QoS 라우팅에 대해 네트워크 상황에 따라 다양한 연구가 수행되었고, 이러한 연구의 대부분은 모델 기반이며, 사용자 요구와 네트워크 환경을 적절히 잘 모델링할 수 있다는 가정하고 연구가 수행되었다. 또한 여러 QoS 매개 변수를 처리하려면 높은 수준의 컴퓨팅 자원이 필요하다. 한편, 통신 네트워크는 매우 동적이며, 복잡하게 진화됨에 따라 모델링과 제어가 어렵다. The routing module of the SDN control plane collects network state information in real time. Based on this, a path that satisfies QoS is provided between both ends of communication according to Quality of Service (QoS) parameter information such as delay, loss, and bandwidth requested by the user. Various studies have been conducted on QoS routing depending on network conditions, and most of these studies are model-based and conducted on the assumption that user demands and network environments can be modeled appropriately. Additionally, processing multiple QoS parameters requires a high level of computing resources. Meanwhile, communication networks are very dynamic, and as they evolve in complexity, modeling and control are difficult.

DeepMind에서 DQN(Deep Q-network)을 제안한 이후, 심층강화학습(Deep Reinforcement Learning : DRL) 방법은 경험을 통해 학습하기 때문에 정확한 수학 모델링 과정이 필요 없고, 매우 복잡한 문제를 해결할 수 있기 때문에 다양한 분야에 적용되고 있다. 전통적 강화학습은 복잡한 상태와 행동 집합을 필요로 하는 대규모 시스템에 적용하기에는 한계점을 나타낸다. 심층강화학습은 강화학습 이론에 딥러닝을 결합함으로써 전통적 강화학습이 직면한 한계를 극복할 수 있다. DRL 은 통신 및 네트워킹 분야에서 제기되는 대규모의 복잡한 문제를 해결할 수 있는 능력이 있기 때문에 다양한 분야의 연구자들이 관심을 가지게 되었다. 소프트웨어 정의 네트워크에서 강화학습을 적용하여 QoS 인식 적응형 그리디 온라인 라우팅 알고리즘을 제안하였다. 최근의 몇몇 연구는 DDPG(Deep Deterministic Policy Gradient) 등 다양한 DRL 알고리즘을 통신 네트워크의 라우팅에 적용하였으며, 라우팅 문제를 연속 제어 문제로 정의하여 k-최단 경로를 고려하였다. 즉, 지속적인 트래픽 흐름 트래픽 매트릭스를 사용하여 최단 경로를 결정한다. 이러한 DRL 방법은 각 발신지-목적지 쌍간의 통신을 위해 k-최단 경로만 고려하였다. 따라서 더 나은 서비스 품질을 제공할 수 있는 다른 경로가 존재할 수 있기 때문에 성능이 제한될 가능성이 존재한다.After DeepMind proposed DQN (Deep Q-network), the Deep Reinforcement Learning (DRL) method does not require an accurate mathematical modeling process because it learns through experience, and can solve very complex problems, so it has been widely used in various fields. It is being applied. Traditional reinforcement learning has limitations when applied to large-scale systems that require complex sets of states and actions. Deep reinforcement learning can overcome the limitations faced by traditional reinforcement learning by combining deep learning with reinforcement learning theory. Researchers from various fields have become interested in DRL because it has the ability to solve large-scale, complex problems posed in the fields of communications and networking. A QoS-aware adaptive greedy online routing algorithm was proposed by applying reinforcement learning in a software-defined network. Several recent studies have applied various DRL algorithms, such as DDPG (Deep Deterministic Policy Gradient), to routing in communication networks, and have defined the routing problem as a continuous control problem and considered k-shortest paths. That is, the continuous traffic flow traffic matrix is used to determine the shortest path. This DRL method considers only the k-shortest path for communication between each source-destination pair. Therefore, there is a possibility that performance may be limited because there may be other paths that can provide better quality of service.

본 발명에서는 라우팅 문제를 제어 문제로 정의한다. 통신을 원하는 송신자는 목적지 정보 뿐만 아니라 원할한 서비스를 제공받기 위해 필요한 최소한의 서비스 품질 파라미터를 서비스 제공자에게 전달한다. 서비스 품질 파라미터는 매핑 함수를 통하여 적절한 서비스 등급으로 변환된다. SDN 제어기는 현재의 네트워크 상태 정보를 각 링크 별로 대응되는 서비스 등급으로 변환한 정보를 가지고 있다. DRL 에이전트는 서비스 등급 정보를 토대로 발신지-목적지 쌍 간의 QoS 요구조건을 만족하는 통신을 위한 경로를 찾을 수 있도록 학습한다. In the present invention, the routing problem is defined as a control problem. A sender wishing to communicate transmits not only destination information but also the minimum service quality parameters required to receive a smooth service to the service provider. Service quality parameters are converted to appropriate service levels through a mapping function. The SDN controller has information that has converted the current network status information into the service level corresponding to each link. The DRL agent learns to find a path for communication that satisfies the QoS requirements between source and destination pairs based on service level information.

도 1은 개시된 기술의 일 실시예에 따른 심층강화학습 기반 라우팅 과정을 나타낸 도면이다. 도 1을 참조하면 제어장치는 소프트웨어 정의 네트워크의 컨트롤러를 이용할 수 있다. 제어장치는 강화학습 모델을 탑재하고 있으며 네트워크에 대한 상태정보를 강화학습 모델에 입력하여 라우팅 경로를 계산할 수 있다. 종래에는 이 과정에서 현재 네트워크에서 제공 가능한 최선의 라우팅 경로를 통신 서비스를 위해 제공하고 있으므로 만약 유사한 서비스가 짧은 시간 내 다수 요청되는 경우 해당 라우팅 경로 내지는 라우팅 경로에 포함된 복수의 노드들 중 상당 부분이 중복되어 요청된 통신 서비스를 원활하게 제공하지 못하는 문제가 있다. 개시된 기술은 이를 해결하기 위하여 노드로부터 요청받은 통신 서비스에 대한 QoS등급을 단계별로 분류하고 단계별 라우팅 경로를 제공함으로써 네트워크 전반에 따른 부하를 분산시키는 기술이다.Figure 1 is a diagram showing a deep reinforcement learning-based routing process according to an embodiment of the disclosed technology. Referring to Figure 1, the control device may use a controller of a software defined network. The control device is equipped with a reinforcement learning model and can calculate the routing path by inputting state information about the network into the reinforcement learning model. Conventionally, in this process, the best routing path that can be provided in the current network is provided for communication services, so if a large number of similar services are requested within a short period of time, the corresponding routing path or a significant portion of a plurality of nodes included in the routing path are used. There is a problem in which overlapping requested communication services cannot be provided smoothly. In order to solve this problem, the disclosed technology is a technology that distributes the load across the network by classifying the QoS level for communication services requested from nodes into stages and providing routing paths for each stage.

한편, 도 1과 같이 제어장치는 실시간으로 네트워크 상태 정보를 수집한다. 이를 기반으로 사용자가 요구하는 지연, 손실 및 대역폭 등의 서비스 품질(Quality of Service, QoS) 파라미터 정보에 따라 통신을 수행하는 양단간에 QoS를 만족하는 라우팅 경로를 제공할 수 있다. 제어장치는 라우팅 경로의 결정을 제어 문제로 정의한다. 통신을 원하는 송신자는 목적지 정보 뿐만 아니라 원활한 서비스를 제공받기 위해 필요한 최소한의 서비스 품질 파라미터를 서비스 제공자에게 전달한다. 서비스 품질 파라미터는 매핑 함수를 통하여 적절한 서비스 등급으로 변환된다. 제어장치는 현재의 네트워크 상태정보를 각 링크 별로 대응되는 서비스 등급으로 변환한 정보를 가지고 있다. 강화학습 모델은 서비스 등급 정보를 토대로 발신지-목적지 쌍 간의 QoS 요구조건을 만족하는 통신을 위한 경로를 찾을 수 있도록 학습된다. Meanwhile, as shown in Figure 1, the control device collects network status information in real time. Based on this, a routing path that satisfies QoS can be provided between both ends of communication according to Quality of Service (QoS) parameter information such as delay, loss, and bandwidth requested by the user. The control unit defines the determination of the routing path as a control problem. A sender wishing to communicate transmits not only destination information but also the minimum service quality parameters required to receive smooth service to the service provider. Service quality parameters are converted to appropriate service levels through a mapping function. The control device has information that has converted the current network status information into the service level corresponding to each link. The reinforcement learning model is trained to find a path for communication that satisfies the QoS requirements between source and destination pairs based on service class information.

한편, 도 1을 다시 참조하면 전체 시스템은 통신 서비스를 요청하는 복수의 노드들로 구성된 데이터 계층 및 SDN 제어 계층으로 구성된다. 복수의 노드들은 사용자, 라우터들의 상호 결합일 수 있다. 데이터 계층은 네트워크 장치 간에 데이터 전달을 수행하고, 제어 계층은 라우팅 기능 등을 수행하는 응용 계층과 데이터 계층 간의 통신을 수행하며, 데이터 전달 규칙(Forwarding rules)을 동적으로 업데이트하고 네트워크 자원을 할당, 제어하는 기능을 수행한다. 제어 계층은 현재 네트워크의 모든 상태 정보를 파악하고 있으며, 이를 통해 사용자의 통신 요청이 있을 때 경로 설정 및 각 라우터의 데이터 전달 로직을 제어한다. 네트워크는 양방향 그래프 G(V, E)로 표시할 수 있다. 여기서 V는 모든 라우터의 집합을 나타내고 E는 E = {(i,j)|(i,j) → V × V, i≠j} 와 같은 링크 집합으로 정의된다. 각 링크에 대한 상태 정보는 지연, 손실률, 대역폭에 대해 분리하여 관리한다. Meanwhile, referring again to FIG. 1, the entire system consists of a data layer and an SDN control layer consisting of a plurality of nodes requesting communication services. A plurality of nodes may be a combination of users and routers. The data layer transmits data between network devices, and the control layer performs communication between the application layer and the data layer, which performs routing functions, etc., dynamically updates data forwarding rules, and allocates and controls network resources. performs the function of The control layer understands all state information of the current network, and through this, controls route settings and data delivery logic of each router when a user requests communication. The network can be represented as a bidirectional graph G(V, E). Here, V represents the set of all routers and E is defined as the set of links as E = {(i,j)|(i,j) → V × V, i≠j}. Status information for each link is managed separately for delay, loss rate, and bandwidth.

QoS 인지 라우팅 문제는 발신지 노드 x와 목적지 노드 y간에 QoS 요구 조건을 나타내는 흐름 f를 충족하는 경로를 결정하는 문제로 정의된다. 즉, 대역폭을 최대화하면서 지연, 손실을 최소화하는 경로를 결정하는 문제를 의미하는데, 이는 매우 복잡한 문제이다. 따라서 수학적 모델링이 매우 어렵고, 딥러닝을 이용하여 근사화시킬 때에도 학습이 매우 늦어지게 된다. 따라서, 본 발명에서는 각 흐름이 요구하는 QoS 수준을 단계화하고, SDN 네트워크에서 각 링크가 현재 제공할 수 있는 최소 QoS 수준을 미리 계산하여 흐름 f가 요구하는 QoS 수준을 만족시키는 링크만을 찾아서 발신지에서 목적지까지 경로를 결정하는 문제로 대치한다. 이러한 환경에서 통신 서비스를 제공하는 절차는 다음과 같다.The QoS-aware routing problem is defined as the problem of determining a path that satisfies the flow f representing the QoS requirements between the source node x and the destination node y. In other words, it refers to the problem of determining a path that minimizes delay and loss while maximizing bandwidth, which is a very complex problem. Therefore, mathematical modeling is very difficult, and learning is very slow even when approximating using deep learning. Therefore, in the present invention, the QoS level required by each flow is staged, the minimum QoS level that each link can currently provide in the SDN network is calculated in advance, and only links that satisfy the QoS level required by flow f are found and sent to the source. Replace it with the problem of determining the route to the destination. The procedure for providing communication services in this environment is as follows.

먼저, 통신 서비스를 원하는 사용자는 목적지 정보와 서비스 프로파일을 엑세스 라우터에게 전송한다. 서비스 프로파일은 해당 서비스를 제공하기 위해 필요한 최소한의 QoS 파라미터 값을 포함한다. 본 발명에서 고려한 QoS 파라미터는 대역폭, 지연, 손실률로 구성된다. 사용자가 요청한 서비스 프로파일은 대응 함수를 통해 서비스 등급으로 변환된다. 라우터는 SDN 내의 라우팅 기능에 경로 설정을 요구하는데, 송신지와 목적지 노드 정보 및 서비스 등급 정보를 포함하여 전송한다. SDN 라우팅 계층은 동일한 대응 함수를 이용하여 현재 네트워크 상태 정보를 각 링크들이 제공하는 서비스 등급의 행렬로 변환한다. DQN 에이전트는 발신자, 목적지 및 서비스 등급을 입력받고, 환경과 상호 작용을 통해 서비스 등급을 만족시키는 경로 정보를 결정한다. 최소 QoS level 결정은 요구되는 대역폭, 지연, 손실 등에 따라 다르게 결정되며, 각각의 가중치에 따라 변경이 가능하다.First, a user who wants a communication service transmits destination information and a service profile to the access router. The service profile includes the minimum QoS parameter values required to provide the corresponding service. The QoS parameters considered in the present invention consist of bandwidth, delay, and loss rate. The service profile requested by the user is converted into a service level through a corresponding function. The router requests the routing function within the SDN to set up a route, and transmits it including source and destination node information and service level information. The SDN routing layer uses the same response function to convert the current network state information into a matrix of service levels provided by each link. The DQN agent receives the sender, destination, and service level as input and determines route information that satisfies the service level through interaction with the environment. The minimum QoS level is determined differently depending on the required bandwidth, delay, loss, etc., and can be changed according to each weight.

서비스 프로파일에 지정된 QoS 파라미터를 만족하면서 통신을 수행하는 두 노드간의 경로를 찾기 위한 DRL 기반 라우팅 알고리즘에 있어서, 상태, 행동 및 리워드는 다음과 같이 상태 공간, 행동 공간 및 보상 함수로 각각 정의된다. In the DRL-based routing algorithm to find a path between two nodes that perform communication while satisfying the QoS parameters specified in the service profile, the state, action, and reward are defined as a state space, action space, and reward function, respectively, as follows.

먼저, 상태 공간은 네트워크의 현재 상태 정보를 포함해야 하는데, 제어장치를 통해 일차적으로 네트워크의 연결 정보 및 링크 상태 정보를 파악한다. 초기 상태 정보는 각 QoS 파라미터에 대해 |V| × |V| 크기의 2차원 행렬로서 표현된다. 각 파라미터에 대한 실제 값은 상호 스케일이 다를 수 있기 때문에 정규화 과정을 거친다. 이후 대응 함수를 통해 각 링크 별 QoS 파라미터 정보로부터 해당 링크의 서비스 등급값으로 변환된 통합된 2차원 행렬 형태로 변환된다.First, the state space must include the current state information of the network, and the network connection information and link state information are primarily determined through the control device. The initial state information is |V| for each QoS parameter. × |V| It is expressed as a two-dimensional matrix of sizes. Since the actual values for each parameter may have different scales, they undergo a normalization process. Afterwards, through the corresponding function, the QoS parameter information for each link is converted into an integrated two-dimensional matrix that is converted into the service level value of the link.

다음으로 행동 공간은 네트워크의 모든 링크의 집합으로 정의된다. 즉, 행동 공간 벡터는 A = [a₁, a₂,. . . , a_|E|] 이며, 각각의 행동은 네트워크의 링크 (i, j) ∈ E 에 해당한다. 각 노드에서 선택할 수 있는 행동은 해당 노드에서 나가는 출력 링크로 제한되지만, 모델을 단순화하기 위해 모든 링크를 선택 가능하도록 하되, 보상값을 통해 출력 링크를 행동으로 결정하도록 학습을 시킨다.Next, the action space is defined as the set of all links in the network. That is, the action space vector is A = [a ₁ , a ₂ ,. . . , a _|E| ], and each action corresponds to link (i, j) ∈ E of the network. The actions that can be selected at each node are limited to the output link that goes out from that node, but to simplify the model, all links are selectable, and the output link is learned to determine the action through the reward value.

한편, 모델의 학습 정확도를 높이기 위해서는 보상 함수를 적절하게 지정하여야 한다. 에이전트가 네트워크에 연결된 어떠한 링크라도 선택 가능하도록 행동 집합을 정의하였기 때문에 보상함수는 다양한 상황을 고려할 필요가 있다. T 타임 스텝의 에피소드에서 강화학습 에이전트는 소스 노드 x에서 대상 노드 y까지의 경로를 찾아야한다. 임의의 노드 z에서 에이전트는 링크 (i, j)에 해당하는 시간 단계 t에서 액션을 선택하고 아래 수학식 1 과 같이 보상 함수 f((i, j))에 의해 보상 를 받는다.Meanwhile, in order to increase the learning accuracy of the model, the compensation function must be appropriately specified. Because the action set is defined so that the agent can select any link connected to the network, the reward function needs to consider various situations. In an episode of T time steps, the reinforcement learning agent must find a path from source node x to target node y. At a random node z, the agent selects an action at time step t corresponding to link (i, j) and is rewarded by the reward function f((i, j)) as shown in Equation 1 below. receive

여기서 는 노드 z의 유효한 행동 집합을 나타낸다. 즉, 노드 z에서 출력 링크를 선택할 경우에만 유효한 행동으로 간주한다. 선택한 행동이 유효하지 않으면 에이전트는 -|V|/ 2의 보상, 즉 페널티를 받는다. 그렇지 않으면 별도로 정의된 함수에 의해서 보상이 결정된다. 이를 위해 이미 선택된 노드를 향하는 링크를 선택할 경우 매우 큰 음의 보상을 지정한다. 이를 위해 리스트는 각 에피소드의 시작 부분에 비어있는 것으로 정의되고, 이후 링크로 채워지는 집합을 나타낸다. 에이전트가 에피소드에서 동일한 링크를 반복적으로 선택하여 네트워크 루프에 갇히지 않는다. 선택한 링크에 대상 노드가 목적지 노드 y일 경우에는 이 상태가 터미널 상태이며, 환경은 에이전트에게 가장 높은 양의 보상값 |V|를 반환한다. 에이전트가 액션 공간을 탐색하는 동안 무한 루프에 갇히지 않도록 각 에피소드를 T 타임 스텝으로 제한한다. 현재 시간 단계가 총 에지 수보다 크면 에이전트가 손실되고 에피소드는 -|V|의 높은 패널티로 종료된다. 나머지 경우는 에이전트가 선택한 행동에 대한 링크가 요구되는 QoS 수준과 비교하여 적절한 양의 보상과 음의 보상을 반환한다. here represents the set of valid actions of node z. In other words, it is considered a valid action only if the output link is selected at node z. If the selected action is invalid, the agent receives a reward of -|V|/ 2, i.e. a penalty. Otherwise, compensation is determined by a separately defined function. To achieve this, a very large negative reward is specified when selecting a link heading to an already selected node. for teeth The list is defined as empty at the beginning of each episode and represents a set that is then filled with links. Agents do not get stuck in network loops by repeatedly selecting the same link in an episode. If the target node on the selected link is destination node y, this state is a terminal state, and the environment returns the highest positive reward value |V| to the agent. We limit each episode to T time steps so that the agent does not get stuck in an infinite loop while exploring the action space. If the current time step is greater than the total number of edges, the agent is lost and the episode ends with a high penalty of -|V|. In the remaining cases, the link to the agent's selected action returns appropriate positive and negative rewards compared to the required QoS level.

한편, 상술한 바와 같이 정의한 상태 공간, 행동 공간, 보상 함수를 DQN에 적용한다. DQN의 학습 알고리즘은 이하의 수학식 2로 정의된다.Meanwhile, the state space, action space, and reward function defined as described above are applied to DQN. The learning algorithm of DQN is defined by Equation 2 below.

입력 레이어는 |V| × |V| 크기의 1차원 정보이다. 512 개의 뉴런이 있는 2개의 은닉층과 활성화 함수로 수정된 선형 단위가 사용되었다. 출력 레이어의 크기는 |A| 이며, 출력은 각 행동에 대한 큐함수이다. 출력된 큐함수 값 중에서 최대의 큐함수 값을 나타내는 행동을 선택한다. 또한 DRL 에이전트의 훈련 과정에서 배치 크기 64, 학습률 = 0.001, 버퍼 크기 = 5000과 같은 하이퍼 매개변수를 사용했다. 또한 지속적인 학습을 위해 입실론 그리디 방식의 정책을 선택하였고, 랜덤 정책이 선택될 확률을 나타내는 초기 입실론 값은 이를 위해 입실론 값은 0.9 로, 학습이 진행됨에 따라 랜덤 정책을 선택할 확률을 낮추기 위한 입실론 감쇄값은 0.99로 설정하였다. The input layer is |V| × |V| It is one-dimensional information of size. Two hidden layers with 512 neurons and a linear unit modified with activation function were used. The size of the output layer is |A| , and the output is a queue function for each action. Among the output queue function values, select the action that represents the maximum queue function value. Additionally, in the training process of the DRL agent, hyperparameters such as batch size 64, learning rate = 0.001, and buffer size = 5000 were used. In addition, an epsilon greedy policy was selected for continuous learning, and the initial epsilon value indicating the probability of selecting a random policy was set to 0.9, and as learning progressed, epsilon decay was used to lower the probability of selecting a random policy. The value was set to 0.99.

알고리즘 초기에는 네트워크 노드 수와 링크 정보를 지정하여 환경의 인스턴스를 만든다. 이후, 재생 버퍼, 메인 Q-네트워크 및 타겟 Q-네트워크의 하이퍼 파라미터를 초기화된다. 알고리즘은 총 N 개의 에피소드에 대해 실행된다. 각 에피소드가 시작될 때 네트워크 링크 상태 정보는 정규화된다. 대응 함수를 통해 각 링크 별 QoS 파라미터 정보로부터 해당 링크의 서비스 등급값으로 변환된 통합된 2차원 행렬 형태로 변환된다. 이미 방문한 에지에 대한 보상함수를 알리는 데 사용되는 빈 세트 가 생성된다. 각 에피소드에는 T 시간 단계의 기간이 있다 (여기서 T = | E |). At the beginning of the algorithm, an instance of the environment is created by specifying the number of network nodes and link information. Afterwards, the hyperparameters of the playback buffer, main Q-network, and target Q-network are initialized. The algorithm is run for a total of N episodes. At the start of each episode, network link state information is normalized. Through the corresponding function, the QoS parameter information for each link is converted into an integrated two-dimensional matrix that is converted into the service level value of the link. An empty set used to inform the reward function for already visited edges. is created. Each episode has a duration of T time steps (where T = | E |).

각 시간 단계 t에서 엡실론 그리디 방식을 사용하여 행동 가 선택된다. 선택된 동작이 유효한 동작이면 환경에서 실행되고, 보상함수에 따라 보상 를 얻고 에 를 포함하고 상태 공간 벡터를 업데이트한다. 그렇지 않으면 선택한 행동이 유효한 행동이 아닌 경우 음의 보상값을 전달한후 다음 반복을 위해 동일한 상태 공간을 변경하지 않고 사용한다. , , , + 1을 얻은 후 전환은 경험 재생 버퍼에 저장된다. 재생 버퍼에 일정 이상의 전환 정보가 축적되면, 재생 버퍼에서 임의의 미니 배치 전환을 샘플링하고 θ에 대한 경사 하강법을 사용하여 심층 신경망의 가중치를 최적화하여 손실을 최소화한다. 일정한 주기마다 메인 네트워크 파라미터를 통해 타켓 네트워크 파라미터를 갱신한다. 에피소드의 반복은 목적지 노드까지 경로가 완성되거나, t > |E| 인 경우 종료한다. 강화학습 모델의 학습은 상술한 과정에 따라 진행되며 학습 결과에 따라 QoS 등급을 만족하는 라우팅 경로를 계산할 수 있다.At each time step t, act using an epsilon greedy method. is selected. If the selected action is a valid action, it is executed in the environment and compensated according to the reward function. and get to and updates the state space vector. Otherwise, if the selected action is not a valid action, a negative reward value is passed and the same state space is used without change for the next iteration. , , , After getting +1, the transition is stored in the experience playback buffer. When a certain amount of transition information is accumulated in the playback buffer, random mini-batch transitions are sampled from the playback buffer and the weights of the deep neural network are optimized using gradient descent for θ to minimize loss. The target network parameters are updated through the main network parameters at regular intervals. The repetition of an episode occurs when the path to the destination node is completed, or t > |E| If , it ends. Learning of the reinforcement learning model proceeds according to the above-described process, and a routing path that satisfies the QoS level can be calculated according to the learning results.

도 2는 개시된 기술의 일 실시예에 따른 심층강화학습 기반 라우팅 방법에 대한 순서도이다. 도 2를 참조하면 심층강화학습 기반 라우팅 방법(200)은 제어장치가 네트워크 내 복수의 노드들 중 제 1 노드로부터 통신 서비스를 요청받는 단계(210), 제어장치가 요청에 따라 통신 서비스에 대한 QoS 등급을 결정하는 단계(230) 및 제어장치가 강화학습 모델에 복수의 노드들 중 이용 가능한 노드들에 대한 상태정보를 입력하여 통신 서비스를 제공 가능한 복수개의 라우팅 경로를 계산하고 복수개의 라우팅 경로들 중 QoS 등급에 상응하는 적어도 하나의 경로를 선택하는 단계(230)를 포함한다. 심층강화학습 기반 라우팅 방법(200)은 소프트웨어 정의 네트워크의 제어장치를 통해 수행될 수 있다.Figure 2 is a flowchart of a deep reinforcement learning-based routing method according to an embodiment of the disclosed technology. Referring to FIG. 2, the deep reinforcement learning-based routing method 200 includes a step 210 in which the control device receives a communication service request from a first node among a plurality of nodes in the network, and the control device performs QoS for the communication service according to the request. In step 230 of determining the level, the control device calculates a plurality of routing paths that can provide a communication service by inputting status information about available nodes among the plurality of nodes into the reinforcement learning model and selecting one of the plurality of routing paths. It includes a step 230 of selecting at least one path corresponding to the QoS level. The deep reinforcement learning-based routing method 200 can be performed through a control device of a software-defined network.

210 단계에서 제어장치는 네트워크 내 복수의 노드들 중 제 1 노드로부터 통신 서비스에 대한 요청을 수신한다. 제어장치로 통신 서비스를 요청하는 노드는 단일 노드일 수도 있고 복수의 노드들일 수도 있다. In step 210, the control device receives a request for a communication service from a first node among a plurality of nodes in the network. A node requesting a communication service from a control device may be a single node or multiple nodes.

220 단계에서 제어장치는 요청에 따라 통신 서비스에 대한 QoS 등급을 결정한다. 제어장치는 사전에 통신 서비스의 제공을 위해 각 통신 서비스 별로 복수개의 QoS 등급을 결정할 수 있다. 예컨대, 많은 용량의 데이터를 처리해야 하는 고해상도 동영상에 대한 전송을 요청하는 경우에는 상 등급, 텍스트 메시지와 같이 낮은 용량의 데이터에 대한 전송을 요청하는 경우에는 하 등급으로 분류할 수 있다. QoS 등급의 개수는 제어장치의 설정에 따라 달라질 수 있다.In step 220, the control device determines the QoS level for the communication service according to the request. The control device may determine a plurality of QoS levels for each communication service in advance to provide communication services. For example, when requesting transmission of high-resolution video that requires processing a large amount of data, it can be classified as high level, and when requesting transmission of low-capacity data such as text messages, it can be classified as low level. The number of QoS levels may vary depending on the settings of the control device.

230 단계에서 제어장치는 강화학습 모델에 복수의 노드들 중 이용 가능한 노드들에 대한 상태정보를 입력하여 통신 서비스를 제공 가능한 복수개의 라우팅 경로를 계산하고 복수개의 라우팅 경로들 중 QoS 등급에 상응하는 적어도 하나의 경로를 선택한다. 제어장치에 탑재된 강화학습 모델은 상태정보를 입력받아 최적의 행동을 계산하는 모델이다. 여기에서 상태정보는 네트워크의 연결 정보 및 링크 상태를 의미한다. 그리고 행동은 네트워크의 모든 링크들의 집합을 정의하고 링크들의 집합을 이용하여 생성 가능한 복수개의 라우팅 경로를 계산하는 것을 의미한다. In step 230, the control device calculates a plurality of routing paths capable of providing communication services by inputting status information about available nodes among a plurality of nodes into the reinforcement learning model, and calculates at least one routing path corresponding to the QoS level among the plurality of routing paths. Choose one path. The reinforcement learning model installed in the control device is a model that receives state information and calculates optimal behavior. Here, status information refers to network connection information and link status. And the action means defining a set of all links in the network and calculating a plurality of routing paths that can be created using the set of links.

강화학습 모델은 학습 과정에 따라 최적의 라우팅 경로를 계산하도록 학습될 수 있다. 즉, 각 학습 회차마다 보상이 최대값이 되도록 학습될 수 있다. 학습이 충분히 이루어지지 않은 초기에는 지난 학습 때 선택했던 링크를 다시 선택할 수 있는데 이 경우 매우 큰 음의 보상값을 갖도록 할 수 있다. 당연하게도 학습 회차가 반복되면 강화학습 모델은 점점 양의 보상값을 갖는 방향으로 링크를 선택하게 된다. 이 과정을 반복하여 보상값이 최대값이 되도록 학습을 수행할 수 있다. 이와 같은 학습 과정에 따라 이후 테스트 단계에서는 최적의 경로를 출력하는 것이 가능하다.Reinforcement learning models can be trained to calculate the optimal routing path according to the learning process. In other words, the reward can be learned to be the maximum value for each learning round. In the beginning, when learning is not sufficient, the link selected during the last learning can be selected again, and in this case, it can be set to have a very large negative reward value. Naturally, as the learning round is repeated, the reinforcement learning model gradually selects links in the direction with positive reward values. By repeating this process, learning can be performed so that the reward value becomes the maximum. According to this learning process, it is possible to output the optimal path in the later testing stage.

도 3은 개시된 기술의 일 실시예에 따른 심층강화학습 기반 라우팅 장치에 대한 블록도이다. 도 3을 참조하면 심층강화학습 기반 라우팅 장치(300)는 통신장치(310), 저장장치(320) 및 연산장치(330)를 포함한다.Figure 3 is a block diagram of a deep reinforcement learning-based routing device according to an embodiment of the disclosed technology. Referring to FIG. 3, the deep reinforcement learning-based routing device 300 includes a communication device 310, a storage device 320, and a computing device 330.

통신장치(310)는 네트워크 내 복수의 노드들 중 제 1 노드로부터 통신 서비스를 요청받는다. 통신장치(310)는 소프트웨어 정의 네트워크 내 제어장치가 네트워크 상의 노드들로부터 통신 서비스 요청을 수신하는 통신모듈일 수 있다.The communication device 310 receives a communication service request from a first node among a plurality of nodes in the network. The communication device 310 may be a communication module through which a control device within a software-defined network receives communication service requests from nodes on the network.

저장장치(320)는 네트워크 내 복수의 노드들을 이용하여 라우팅 경로를 계산하도록 학습된 강화학습 모델을 저장한다. 저장장치(320)는 심층강화학습 기반 라우팅 장치(300)의 메모리일 수 있다.The storage device 320 stores a reinforcement learning model learned to calculate a routing path using a plurality of nodes in the network. The storage device 320 may be a memory of the deep reinforcement learning-based routing device 300.

연산장치(330)는 요청에 따라 통신 서비스에 대한 QoS 등급을 결정하고 강화학습 모델에 복수의 노드들 중 이용 가능한 노드들에 대한 상태정보를 입력하여 통신 서비스를 제공 가능한 복수개의 라우팅 경로를 계산하고 복수개의 라우팅 경로들 중 QoS 등급에 상응하는 적어도 하나의 경로를 선택한다.The computing unit 330 determines the QoS level for the communication service upon request, inputs status information about available nodes among the plurality of nodes into the reinforcement learning model, and calculates a plurality of routing paths that can provide the communication service. Among the plurality of routing paths, at least one path corresponding to the QoS level is selected.

한편, 상술한 심층강화학습 기반 라우팅 장치(300)는 컴퓨터에서 실행될 수 있는 실행 가능한 알고리즘을 포함하는 프로그램(또는 어플리케이션)으로 구현될 수도 있다. 상기 프로그램은 일시적 또는 비일시적 판독 가능 매체(non-transitory computer readable medium)에 저장되어 제공될 수 있다.Meanwhile, the deep reinforcement learning-based routing device 300 described above may be implemented as a program (or application) including an executable algorithm that can be executed on a computer. The program may be stored and provided in a temporary or non-transitory computer readable medium.

비일시적 판독 가능 매체란 레지스터, 캐쉬, 메모리 등과 같이 짧은 순간 동안 데이터를 저장하는 매체가 아니라 반영구적으로 데이터를 저장하며, 기기에 의해 판독(reading)이 가능한 매체를 의미한다. 구체적으로는, 상술한 다양한 어플리케이션 또는 프로그램들은 CD, DVD, 하드 디스크, 블루레이 디스크, USB, 메모리카드, ROM (read-only memory), PROM (programmable read only memory), EPROM(Erasable PROM, EPROM) 또는 EEPROM(Electrically EPROM) 또는 플래시 메모리 등과 같은 비일시적 판독 가능 매체에 저장되어 제공될 수 있다.A non-transitory readable medium refers to a medium that stores data semi-permanently and can be read by a device, rather than a medium that stores data for a short period of time, such as registers, caches, and memories. Specifically, the various applications or programs described above include CD, DVD, hard disk, Blu-ray disk, USB, memory card, ROM (read-only memory), PROM (programmable read only memory), and EPROM (Erasable PROM, EPROM). Alternatively, it may be stored and provided in a non-transitory readable medium such as EEPROM (Electrically EPROM) or flash memory.

일시적 판독 가능 매체는 스태틱 램(Static RAM，SRAM), 다이내믹 램(Dynamic RAM，DRAM), 싱크로너스 디램 (Synchronous DRAM，SDRAM), 2배속 SDRAM(Double Data Rate SDRAM，DDR SDRAM), 증강형 SDRAM(Enhanced SDRAM，ESDRAM), 동기화 DRAM(Synclink DRAM，SLDRAM) 및 직접 램버스 램(Direct Rambus RAM，DRRAM) 과 같은 다양한 RAM을 의미한다.Temporarily readable media include Static RAM (SRAM), Dynamic RAM (DRAM), Synchronous DRAM (SDRAM), Double Data Rate SDRAM (DDR SDRAM), and Enhanced SDRAM (Enhanced RAM). It refers to various types of RAM, such as SDRAM (ESDRAM), synchronous DRAM (Synclink DRAM, SLDRAM), and Direct Rambus RAM (DRRAM).

도 4는 개시된 기술에 따른 강화학습 모델의 학습 성능을 나타낸 도면이다. x축은 에피소드 횟수를 나타내며, y축은 에피소드가 종료되었을 때 누적된 보상값을 의미한다. 학습 초기에는 상당 기간 환경을 탐색하는 과정을 수행하기 때문에 에이전트는 유효하지 않거나 이미 선택된 행동을 다시 선택하게 된다. 대부분 에피소드가 비정상으로 종료되고, 따라서 매우 큰 누적된 음의 보상값을 받은 것으로 나타났다. 에이전트가 점차 학습이 진행하게 되면, 유효하지 않은 행동과 네트워크 루프를 피하는 행동을 선택함에 따라 받게 되는 누적된 보상값이 증가한다. 누적된 보상값이 0에 가까워짐에 따라 에이전트는 상당부분의 네트워크 상태에 따라 라우팅 방법에 대해 학습이 이루지게 된다. 에피소드가 250회 이후 일 때, 양의 누적된 보상값을 받게 되고, 이는 에이전트 학습이 성공적으로 이루어졌음을 의미한다. Figure 4 is a diagram showing the learning performance of a reinforcement learning model according to the disclosed technology. The x-axis represents the number of episodes, and the y-axis represents the accumulated reward value when the episode ends. Since the process of exploring the environment is performed for a considerable period of time in the early stages of learning, the agent may reselect an action that is invalid or has already been selected. It was found that most episodes ended abnormally and thus received very large accumulated negative reward values. As the agent gradually learns, the accumulated reward value it receives increases as it selects actions that avoid invalid actions and network loops. As the accumulated reward value approaches 0, the agent learns routing methods based on a significant portion of the network state. When there are 250 episodes or more, a positive accumulated reward value is received, which means that agent learning has been successful.

개시된 기술의 일 실시예에 따른 심층강화학습 기반 라우팅 방법 및 장치는 이해를 돕기 위하여 도면에 도시된 실시 예를 참고로 설명되었으나, 이는 예시적인 것에 불과하며, 당해 분야에서 통상적 지식을 가진 자라면 이로부터 다양한 변형 및 균등한 타 실시 예가 가능하다는 점을 이해할 것이다. 따라서, 개시된 기술의 진정한 기술적 보호범위는 첨부된 특허청구범위에 의해 정해져야 할 것이다.The deep reinforcement learning-based routing method and device according to an embodiment of the disclosed technology have been described with reference to the embodiments shown in the drawings to aid understanding, but this is merely an example, and those skilled in the art will understand this. It will be understood that various modifications and equivalent other embodiments are possible. Therefore, the true technical protection scope of the disclosed technology should be determined by the appended patent claims.

Claims

A control device of a software defined network receiving a communication service request from a first node among a plurality of nodes in the network;
determining, by the control device, a QoS level for the communication service according to the request; and
The control device inputs status information about available nodes among the plurality of nodes into the reinforcement learning model to calculate a plurality of routing paths that can provide the communication service, and corresponds to the QoS level among the plurality of routing paths. Including; selecting at least one path to
The control device classifies a plurality of QoS levels for communication services in advance and determines which of the plurality of QoS levels the communication service requested from the node corresponds to,
The control device inputs the network connection information and link status information as the status information, and
A deep reinforcement learning-based routing method in which the reinforcement learning model defines a set of all links in the network and calculates a plurality of routing paths that can be created using the set of links.

delete

According to claim 1,
The reinforcement learning model is a deep reinforcement learning-based routing method that is learned so that the reward is maximized in each learning round.

A communication device that receives a communication service request from a first node among a plurality of nodes in a network;
a storage device that stores a reinforcement learning model learned to calculate a routing path using a plurality of nodes in the network; and
According to the request, the QoS level for the communication service is determined, and status information on available nodes among the plurality of nodes is input into the reinforcement learning model to calculate a plurality of routing paths capable of providing the communication service. It includes a computing device that selects at least one path corresponding to the QoS level among a plurality of routing paths,
The computing device classifies a plurality of QoS levels for communication services in advance and determines which of the plurality of QoS levels the communication service requested from the node corresponds to,
The computing device inputs the network connection information and link status information as the status information, and
A deep reinforcement learning-based routing device in which the reinforcement learning model defines a set of all links in the network and calculates a plurality of routing paths that can be created using the set of links.

delete

According to claim 6,
The reinforcement learning model is a deep reinforcement learning-based routing device that is learned so that the reward is maximized in each learning round.