KR20190127480A

KR20190127480A - Method and apparatus for detremining transmisstion point and link adaptation scheme in communication system

Info

Publication number: KR20190127480A
Application number: KR1020180052144A
Authority: KR
Inventors: 김정임; 이유로
Original assignee: 한국전자통신연구원
Priority date: 2018-05-04
Filing date: 2018-05-04
Publication date: 2019-11-13

Abstract

Disclosed is an operation method of a resource managing device connected with a plurality of transmission points in a communication system. According to the present invention, the operation method of a resource managing device comprises the following steps: receiving wireless channel state information and transmission point state information from a terminal; determining a transmission point providing a maximum data rate and a link adaptation scheme based on the wireless channel state information and the transmission point state information; and transmitting information on the transmission point providing the maximum data rate and information on the link adaptation scheme to the terminal.

Description

METHOD AND APPARATUS FOR DETREMINING TRANSMISSTION POINT AND LINK ADAPTATION SCHEME IN COMMUNICATION SYSTEM}

본 발명은 통신 시스템에 관한 것으로, 더욱 상세하게는 전송 포인트 및 링크 적응 방식을 결정하기 위한 방법 및 장치에 관한 것이다.The present invention relates to a communication system, and more particularly to a method and apparatus for determining a transmission point and link adaptation scheme.

LTE(long term evolution) 시스템은 이동 통신에 대한 대역폭(bandwidth) 요구 사항(requirement)을 만족시키기 위해 소형 셀(cell)을 배치하여 무선 링크 품질 및 주파수 공간의 효율적 재사용을 향상시키기 위한 기술을 표준화하였다. 그러나, 계획적으로 배치되는 매크로 셀과 달리, 소형 셀은 무작위적으로 즉석 배치되기 때문에 위치를 예측하기 어렵다. 수동 개입 및 중앙 집중 제어를 통해 네트워크를 관리하는 종래의 방식은 시간이 오래 걸리고 비용이 많이 들기 때문에 면적당 배치 밀도가 높은 소형 셀을 관리하기에 부적절하다. 이에 3GPP(3rd generation project partnership)는 소형 셀 네트워크 관리를 위한 SON(self organization network) 기술을 표준화하였다. 또한, 3GPP는 5G 시스템의 요구 사항들 중 하나인 단위 면적당 데이터 속도의 1000배 향상을 만족시키기 위해, 단위 면적당 고밀도 데이터를 전송하기 위한 전송 포인트 네트워크를 고려하고 있다.Long term evolution (LTE) systems have standardized technology to improve radio link quality and efficient reuse of frequency space by deploying small cells to meet bandwidth requirements for mobile communications. . However, unlike macro cells that are intentionally placed, small cells are difficult to predict position because they are randomly placed immediately. The traditional approach to managing the network through manual intervention and centralized control is time consuming and expensive, making it unsuitable for managing small cells with high batch density per area. The third generation project partnership (3GPP) has standardized a self organization network (SON) technology for small cell network management. In addition, 3GPP considers a transmission point network for transmitting high density data per unit area to satisfy a 1000 times improvement in data rate per unit area, which is one of the requirements of 5G systems.

3GPP의 표준화 기술인 SON은 여러 개의 기능이 독립적으로 설계되어 있다. 이에 따라, 복수 개의 전송 포인트들이 동시에 SON 기능을 동작시킬 경우, 서로 다른 SON 기능에 의해 상호 독립적으로 상반되는 결정을 내릴 수 있다. 따라서, 복수 개의 전송 포인트들은 상호 간에 허용되지 않는 매개 변수를 설정하여 성능 열화 또는 링크 실패(link failure)가 발생되는 문제가 있다.SON, the standardization technology of 3GPP, has several functions designed independently. Accordingly, when a plurality of transmission points operate the SON function at the same time, it is possible to make mutually independent determinations by different SON functions. Therefore, a plurality of transmission points have a problem that performance degradation or link failure occurs due to setting parameters that are not allowed to each other.

3GPP는 복수 개의 전송 포인트들을 이용하여 단말로 데이터를 전송하는 CoMP(Coordinated Multi Point) 방식을 도입하였다. 그러나, 현재, 복수 개의 전송 포인트들이 존재할 경우, 복수 개의 전송 포인트들 중 단말로 데이터 전송을 수행하기 위한 전송 포인트를 선택하는 방법이 부재한 문제가 있다.3GPP has introduced a CoMP (Coordinated Multi Point) scheme for transmitting data to a terminal using a plurality of transmission points. However, at present, when there are a plurality of transmission points, a method of selecting a transmission point for performing data transmission to the terminal among the plurality of transmission points is lacking.

현재의 이동 통신 시스템에서 링크 적응 방식(link adaptation scheme)은 소형 셀과 같이 근거리에서 신호를 전송하는 전송 포인트를 고려하지 않고 설계되었다. FDD(frequency division duplex) 시스템에서 단말은 하향링크 무선 채널을 측정하여 CQI(channel quality indicator)를 매크로 기지국으로 보고할 수 있다. 매크로 기지국은 단말로부터 수신한 CQI에 기초하여 무선 채널 상황에 적합한 변조 방식 및 채널 코딩 부호율을 결정할 수 있다. 매크로 기지국은 MCS(Modulation Channel coding Scheme) 변조 방식에 따라 패킷을 단말로 전송할 수 있다. 단말은 매크로 기지국으로부터 수신된 패킷에 오류가 검출될 경우, HARQ(Hybrid Automatic Repeat reQuest) 동작을 수행할 수 있다. 매크로 기지국은 HARQ에 따라 패킷 재전송을 수행하여 패킷 에러율을 감소시키는 방식으로 무선 채널 링크를 적응할 수 있다. 즉, 매크로 기지국은 MCS 및 HARQ를 통해 무선 채널 링크를 적응할 수 있다. 종래의 링크 적응 방식은 매크로 기지국의 근거리에 전송 포인트들이 존재할 경우 동적인 자원 할당으로 인한 무선 채널 환경의 급격한 변화를 고려하지 않는다. 따라서, 종래의 링크 적응 방식은 매크로 기지국의 근거리에 복수 개의 전송 포인트들이 빠르게 전송 방식을 변경할 경우 비효율적인 문제가 있다.In the current mobile communication system, a link adaptation scheme is designed without considering a transmission point for transmitting a signal at a short distance, such as a small cell. In a frequency division duplex (FDD) system, a user equipment may report a channel quality indicator (CQI) to a macro base station by measuring a downlink radio channel. The macro base station may determine a modulation scheme and channel coding code rate suitable for a wireless channel situation based on the CQI received from the terminal. The macro base station may transmit a packet to the terminal according to a modulation channel coding scheme (MCS) modulation scheme. When an error is detected in a packet received from the macro base station, the terminal may perform a hybrid automatic repeat request (HARQ) operation. The macro base station may adapt the radio channel link in a manner to reduce packet error rate by performing packet retransmission according to HARQ. That is, the macro base station can adapt the radio channel link through MCS and HARQ. The conventional link adaptation scheme does not consider a drastic change in the radio channel environment due to dynamic resource allocation when there are transmission points in the local area of the macro base station. Therefore, the conventional link adaptation scheme has an inefficient problem when a plurality of transmission points quickly change the transmission scheme at a short distance of a macro base station.

본 발명은 상술한 문제를 해결하기 위하여, 통신 시스템에서 복수 개의 전송 포인트들 각각이 SON 기능을 사용할 경우, 매개 변수들 간의 충돌이 발생하지 않는 네트워크 설정 방법을 제공하고자 한다. 또한, 본 발명은 복수 개의 전송 포인트들 중 매개 변수들 간의 충돌 없이 단말로 데이터를 전송하기 위한 적어도 하나의 전송 포인트를 선택하기 위한 방법을 제공하고자 한다. 또한, 본 발명은 매크로 기지국의 근거리에 위치하는 복수 개의 전송 포인트들을 고려한 링크 적응 방식을 제공하고자 한다.In order to solve the above-mentioned problem, the present invention is to provide a network setting method in which a collision between parameters does not occur when each of a plurality of transmission points uses a SON function in a communication system. In addition, the present invention is to provide a method for selecting at least one transmission point for transmitting data to the terminal without collision between the parameters of the plurality of transmission points. In addition, the present invention is to provide a link adaptation scheme in consideration of a plurality of transmission points located in the near field of the macro base station.

본 발명의 실시예에 따른 통신 시스템에서 복수 개의 전송 포인트(transmission point)들과 연결된 자원 관리기의 동작 방법은, 단말로부터 무선 채널 상태 정보 및 전송 포인트(transmission point) 상태 정보를 수신하는 단계; 상기 무선 채널 상태 정보 및 상기 전송 포인트 상태 정보에 기초하여 최대 데이터 전송률을 제공하는 전송 포인트 및 링크 적응 방식을 결정하는 단계; 및 상기 최대 데이터 전송률을 제공하는 전송 포인트에 관한 정보 및 링크 적응 방식에 관한 정보를 상기 단말로 전송하는 단계;를 포함한다. 상기 최대 데이터 전송률을 제공하는 전송 포인트 및 링크 적응 방식은, Q(quality)-학습 방식 또는 딥(deep) Q-네트워크 학습 방식을 통해 결정된다.In a communication system according to an embodiment of the present invention, a method of operating a resource manager connected to a plurality of transmission points includes: receiving wireless channel state information and transmission point state information from a terminal; Determining a transmission point and link adaptation scheme providing a maximum data rate based on the radio channel state information and the transmission point state information; And transmitting information on a transmission point providing the maximum data rate and information on a link adaptation scheme to the terminal. The transmission point and link adaptation scheme providing the maximum data rate is determined through a quality-learning scheme or a deep Q-network learning scheme.

상기 자원 관리기의 동작 방법은, 상기 무선 채널 상태 정보 및 상기 전송 포인트 상태 정보에 기초하여 Q(quality) 값을 업데이트하는 단계;를 더 포함할 수 있다.The operation method of the resource manager may further include updating a quality (Q) value based on the radio channel state information and the transmission point state information.

상기 무선 채널 상태 정보는 CSI(channel state information)-RS(reference signal), CSI-IM(interference measurement), CQI(channel quality information), PMI(precoding matrix indicator), RI(rank indication), 간섭 상태 정보 중 적어도 하나를 포함할 수 있다. 상기 전송 포인트 상태 정보는 전송 포인트의 전송율, 상기 전송 포인트의 핸드오버 비율 중 적어도 하나를 포함할 수 있다. 상기 Q 값은 상기 무선 채널 상태 정보 및 상기 각각의 전송 포인트 상태 정보에 기초하여 결정되는 최대 데이터 전송률을 지시할 수 있다.The wireless channel state information includes channel state information (CSI) -reference signal (RS), interference measurement (CSI-IM), channel quality information (CQI), precoding matrix indicator (PMI), rank indication (RI), and interference state information. It may include at least one of. The transmission point state information may include at least one of a transmission rate of the transmission point and a handover rate of the transmission point. The Q value may indicate a maximum data rate determined based on the wireless channel state information and the respective transmission point state information.

상기 자원 관리기의 동작 방법은, 가능한 상태와 행동 집합에서 상기 Q 값이 가장 큰 값이 되도록 자원 관리 행동을 수행하는 활용(exploitation) 또는 Q 값에 상관없이 랜덤하게 행동하는 탐험(exploration)을 행동 방식을 결정하는 단계; 및 최대 Q 값에 따라 전송 포인트를 결정하는 단계;를 더 포함할 수 있다.The operation method of the resource manager may include an exploration that performs a resource management action such that the Q value is the highest value in a set of possible states and behaviors or an exploration that randomly behaves regardless of the Q value. Determining; And determining the transmission point according to the maximum Q value.

상기 자원 관리기의 동작 방법은, 행동 방식이 활용인 경우, 단말의 채널 상태(CQI, RI, PMI), 전송 포인트의 로드를 상태로 두고, Q 값을 업데이트하여, 상기 Q 값에 기초하여 상기 전송 포인트들 중 가장 큰 Q 값을 제공하는 핸드오버 전송 포인트를 결정하는 단계;를 더 포함할 수 있다.In the method of operating the resource manager, when the behavior mode is utilization, the channel state (CQI, RI, PMI) of the terminal and the load of the transmission point are left, and the Q value is updated to transmit the Q based on the Q value. The method may further include determining a handover transmission point that provides the largest Q value among the points.

상기 자원 관리기의 동작 방법은, 행동 방식이 활용(exploitation) 또는 탐험(exploration) 결정하고, 상기 행동이 탐험인 경우, Q 값을 업데이트하는 단계; 임의의 MCS(Modulation Channel coding Scheme), HARQ(Hybrid Automatic Repeat reQuest) 및 안테나 전송 모드를 결정하는 단계;를 포함할 수 있다.The method of operation of the resource manager may include determining an action or exploration, and if the action is exploration, updating a Q value; And determining an arbitrary Modulation Channel Coding Scheme (MCS), Hybrid Automatic Repeat reQuest (HARQ), and an antenna transmission mode.

상기 자원 관리기의 동작 방법은, 상기 행동 방식이 활용인 경우, 상기 최대 Q 값에 기초하여 MCS, HARQ, 및 안테나 전송 모드를 결정하는 단계;를 더 포함할 수 있다.The operation method of the resource manager may further include determining an MCS, an HARQ, and an antenna transmission mode based on the maximum Q value when the action mode is utilization.

상기 자원 관리기의 동작 방법은, 상기 무선 채널 상태 정보 및 상기 전송 포인트 상태 정보에 기초하여, 비용(cost)이 최소가 되도록 가중치(weight) 값을 업데이트하는 단계; 및 Q 값을 업데이트하는 단계;를 더 포함할 수 있다.The operation method of the resource manager may include updating a weight value based on the radio channel state information and the transmission point state information such that a cost is minimized; And updating the Q value.

상기 자원 관리기의 동작 방법은, 상기 가중치 값은 비용(cost)을 최소화하는 값으로 업데이트를 수행하는 단계; 및 행동이 탐험인 경우, 상기 복수 개의 전송 포인트들 중 임의의 개수의 전송 포인트를 결정하는 단계;를 더 포함할 수 있다.The operating method of the resource manager may include: updating the weight value to a value that minimizes cost; And if the action is exploration, determining any number of transmission points of the plurality of transmission points.

상기 자원 관리기의 동작 방법은, 상기 가중치 값은 비용을 최소화하는 값으로 업데이트를 수행하는 단계; 및 행동이 활용인 경우, 상기 복수 개의 전송 포인트들 중 최대 Q 값을 제공하는 전송 포인트를 결정하는 단계;를 더 포함할 수 있다.The operating method of the resource manager comprises: updating the weight value to a value that minimizes cost; And if the action is utilization, determining a transmission point that provides a maximum Q value among the plurality of transmission points.

상기 자원 관리기의 동작 방법은, 상기 가중치 값은 비용을 최소화하는 값으로 업데이트를 수행하는 단계; 및 행동이 탐험인 경우, 임의의 MCS, HARQ, 및 안테나 전송 모드를 결정하는 단계;를 포함할 수 있다.The operating method of the resource manager comprises: updating the weight value to a value that minimizes cost; And if the action is exploration, determining any MCS, HARQ, and antenna transmission modes.

상기 자원 관리기의 동작 방법은, 상기 가중치 값은 비용을 최소화하는 값으로 업데이트를 수행하는 단계; 및 행동이 활용인 경우, 최대 Q 값을 제공하는 MCS, HARQ, 및 안테나 전송 모드를 결정하는 단계;를 포함할 수 있다.The operating method of the resource manager comprises: updating the weight value to a value that minimizes cost; And if the action is utilization, determining the MCS, HARQ, and antenna transmission mode providing the maximum Q value.

또한, 본 발명의 실시예에 따른 통신 시스템에서 복수 개의 전송 포인트(transmission point)들과 연결된 자원 관리는, 프로세서(processor); 및 상기 프로세서에 의해 실행되는 적어도 하나의 명령이 저장된 메모리(memory)를 포함하며, 상기 적어도 하나의 명령은, 단말로부터 무선 채널 상태 정보 및 전송 포인트(transmission point) 상태 정보를 수신하고, 상기 무선 채널 상태 정보 및 상기 전송 포인트 상태 정보에 기초하여 최대 데이터 전송률을 제공하는 전송 포인트 및 링크 적응 방식을 결정하고, 그리고 상기 최대 데이터 전송률을 제공하는 전송 포인트에 관한 정보 및 링크 적응 방식에 관한 정보를 상기 단말로 전송하도록 실행된다. 상기 최대 데이터 전송률을 제공하는 전송 포인트 및 링크 적응 방식은, Q(quality)-학습 방식 또는 딥(deep) Q-네트워크 학습 방식을 통해 결정된다.In addition, in the communication system according to an embodiment of the present invention, resource management connected to a plurality of transmission points includes: a processor; And a memory in which at least one command executed by the processor is stored, wherein the at least one command receives radio channel state information and transmission point state information from a terminal, and transmits the radio channel. Determine a transmission point and a link adaptation scheme providing a maximum data rate based on state information and the transmission point state information, and transmit information on the transmission point and the link adaptation scheme that provide the maximum data rate; To be sent to. The transmission point and link adaptation scheme providing the maximum data rate is determined through a quality-learning scheme or a deep Q-network learning scheme.

상기 적어도 하나의 명령은, 상기 무선 채널 상태 정보 및 상기 전송 포인트 상태 정보에 기초하여 Q(quality) 값을 업데이트하도록 더 실행될 수 있다.The at least one command may be further executed to update a quality (Q) value based on the wireless channel state information and the transmission point state information.

상기 적어도 하나의 명령은, 행동이 탐험 또는 활용인가를 결정하고, 행동이 활용인 경우, Q 값이 최대 값이 되는 값으로 상기 복수 개의 전송 포인트들 선택하고, Q 값 업데이트를 수행한다. The at least one command determines whether an action is exploration or utilization, and if the action is utilization, selects the plurality of transmission points to a value whose Q value is the maximum value and performs a Q value update.

행동이 탐험인 경우, 상기 복수 개의 전송 포인트들 중 임의의 개수의 전송 포인트를 결정한다. 상기 핸드오버 비율이 미리 정해진 임계 비율을 초과할 경우 상기 Q 값에 기초하여 상기 복수 개의 전송 포인트들 중 가장 큰 전송률을 제공하는 전송 포인트를 결정하도록 더 실행될 수 있다.If the action is exploration, determine any number of transmission points of the plurality of transmission points. If the handover rate exceeds a predetermined threshold rate, it may be further executed to determine a transmission point that provides the largest transmission rate of the plurality of transmission points based on the Q value.

상기 적어도 하나의 명령은, 행동이 탐험 또는 활용인가를 결정하고, 행동이 탐험인 경우, 임의의 MCS, HARQ, 및 안테나 전송 모드를 결정하고, 그리고 행동이 활용인 경우, 상기 Q 값에 기초하여 최대 Q 값을 제공하는 MCS, HARQ, 및 안테나 전송 모드를 결정하도록 더 실행될 수 있다.The at least one command determines if the action is exploration or utilization, if the action is exploration, determines any MCS, HARQ, and antenna transmission modes, and if the action is utilization, based on the Q value It may be further implemented to determine the MCS, HARQ, and antenna transmission modes that provide the maximum Q value.

상기 적어도 하나의 명령은, 상기 무선 채널 상태 정보 및 상기 전송 포인트 상태 정보에 기초하여 비용(cost)이 최소화 되도록 딥 Q 네트워크 가중치(weight) 값을 업데이트하고, Q 값을 업데이트하도록 더 실행될 수 있다.The at least one command may be further executed to update a deep Q network weight value and update a Q value such that the cost is minimized based on the radio channel state information and the transmission point state information.

상기 적어도 하나의 명령은, 행동이 탐험 또는 활용인가를 결정하고, 행동이 탐험인 경우, 상기 복수 개의 전송 포인트들 중 임의의 개수의 전송 포인트를 결정하고, 상기 딥 Q 네트워크 가중치 값을 비용이 최소화 되도록 업데이트하고, 상기 핸드오버 비율이 미리 정해진 임계 비율을 초과할 경우, 상기 가중치 값에 기초하여 상기 복수 개의 전송 포인트들 중 가장 큰 전송률을 제공하는 전송 포인트를 결정하도록 더 실행될 수 있다.The at least one command determines whether the action is exploration or utilization, and if the action is exploration, determine any number of transmission points of the plurality of transmission points and minimize the cost of the deep Q network weight value. May be further updated to determine a transmission point providing the largest transmission rate among the plurality of transmission points based on the weight value when the handover rate exceeds a predetermined threshold rate.

상기 적어도 하나의 명령은, 행동이 탐험 또는 활용인가를 결정하고, 행동이 탐험인 경우, 임의의 MCS, HARQ, 및 안테나 전송 모드를 결정하고, 그리고 상기 딥 Q 네트워크 가중치 값을 업데이트하고, 행동이 활용인 경우 최대의 Q 값을 제공하는 MCS, HARQ, 및 안테나 전송 모드를 결정하도록 더 실행될 수 있다.The at least one command determines if the action is exploration or utilization, if the action is exploration, determines any MCS, HARQ, and antenna transmission modes, and updates the deep Q network weight value, The utilization may be further executed to determine the MCS, HARQ, and antenna transmission modes that provide the maximum Q value.

본 발명에 의하면, 복수 개의 통신 노드들간의 동작 방식에 따른 링크 구성 동작을 통해 상기 복수 개의 통신 노드들간의 간섭을 최소화할 수 있다. 또한, 본 발명에 의하면, 복수 개의 통신 노드들간에 상이한 자원 할당 단위의 자원을 통해 데이터를 송수신함으로써 상기 복수 개의 통신 노드들간의 간섭을 최소화할 수 있다. 또한, 본 발명에 의하면, 복수 개의 통신 노드들간에 간섭 정보에 기초하여 상호간에 송수신되는 신호들에 의한 간섭 전력 값들을 평균화함으로써 상기 복수 개의 통신 노드들간의 전체 간섭을 최소화할 수 있다.According to the present invention, it is possible to minimize the interference between the plurality of communication nodes through the link configuration operation according to the operation method between the plurality of communication nodes. In addition, according to the present invention, it is possible to minimize the interference between the plurality of communication nodes by transmitting and receiving data through a resource of a different resource allocation unit between the plurality of communication nodes. In addition, according to the present invention, it is possible to minimize the total interference between the plurality of communication nodes by averaging the interference power values by the signals transmitted and received to each other based on the interference information between the plurality of communication nodes.

도 1은 본 발명의 제1 실시예에 따른 통신 시스템를 도시한 개념도이다.
도 2는 본 발명의 제1 실시예에 따른 통신 노드를 도시한 블록도이다.
도 3은 본 발명의 실시예에 따른 통신 시스템에서 통신 환경 및 자원 관리기의 동작을 도시한 개념도이다.
도 4는 본 발명의 실시예에 따른 통신 시스템에서 딥 Q-네트워크 학습 동작을 도시한 개념도이다.
도 5는 본 발명의 실시예에 따른 통신 시스템에서 가중치를 업데이트하는 딥 Q-네트워크 동작을 도시한 개념도이다.
도 6은 본 발명의 실시예에 따른 통신 시스템에서 자원 관리기의 동작을 도시한 흐름도이다.
도 7은 본 발명의 실시예에 따른 통신 시스템에서 Q-학습을 통해 전송 포인트를 결정하는 Q-학습 자원 관리기의 동작을 도시한 흐름도이다.
도 8은 본 발명의 실시예에 따른 통신 시스템에서 Q-학습을 통해 링크 적응 방식을 결정하는 Q-학습 자원 관리기의 동작을 도시한 흐름도이다.
도 9는 본 발명의 실시예에 따른 통신 시스템에서 딥 Q-네트워크 학습을 통해 전송 포인트를 결정하는 딥 Q-네트워크 학습 자원 관리기의 동작을 도시한 흐름도이다.
도 10은 본 발명의 실시예에 따른 통신 시스템에서 딥 Q-네트워크 학습을 통해 링크 적응 방식을 결정하는 딥 Q-네트워크 학습 자원 관리기의 동작을 도시한 흐름도이다.1 is a conceptual diagram illustrating a communication system according to a first embodiment of the present invention.
2 is a block diagram showing a communication node according to the first embodiment of the present invention.
3 is a conceptual diagram illustrating an operation of a communication environment and a resource manager in a communication system according to an embodiment of the present invention.
4 is a conceptual diagram illustrating a deep Q-network learning operation in a communication system according to an embodiment of the present invention.
5 is a conceptual diagram illustrating a deep Q-network operation of updating weights in a communication system according to an embodiment of the present invention.
6 is a flowchart illustrating the operation of a resource manager in a communication system according to an embodiment of the present invention.
7 is a flowchart illustrating an operation of a Q-learning resource manager for determining a transmission point through Q-learning in a communication system according to an embodiment of the present invention.
8 is a flowchart illustrating an operation of a Q-learning resource manager for determining a link adaptation scheme through Q-learning in a communication system according to an embodiment of the present invention.
9 is a flowchart illustrating an operation of a deep Q-network learning resource manager for determining a transmission point through deep Q-network learning in a communication system according to an embodiment of the present invention.
10 is a flowchart illustrating an operation of a deep Q-network learning resource manager for determining a link adaptation scheme through deep Q-network learning in a communication system according to an exemplary embodiment of the present invention.

본 발명은 다양한 변경을 가할 수 있고 여러 가지 실시예를 가질 수 있는 바, 특정 실시예들을 도면에 예시하고 상세하게 설명하고자 한다. 그러나, 이는 본 발명을 특정한 실시 형태에 대해 한정하려는 것이 아니며, 본 발명의 사상 및 기술 범위에 포함되는 모든 변경, 균등물 내지 대체물을 포함하는 것으로 이해되어야 한다.As the present invention allows for various changes and numerous embodiments, particular embodiments will be illustrated in the drawings and described in detail in the written description. However, this is not intended to limit the present invention to specific embodiments, it should be understood to include all modifications, equivalents, and substitutes included in the spirit and scope of the present invention.

제1, 제2 등의 용어는 다양한 구성요소들을 설명하는데 사용될 수 있지만, 상기 구성요소들은 상기 용어들에 의해 한정되어서는 안 된다. 상기 용어들은 하나의 구성요소를 다른 구성요소로부터 구별하는 목적으로만 사용된다. 예를 들어, 본 발명의 권리 범위를 벗어나지 않으면서 제1 구성요소는 제2 구성요소로 명명될 수 있고, 유사하게 제2 구성요소도 제1 구성요소로 명명될 수 있다. 및/또는 이라는 용어는 복수의 관련된 기재된 항목들의 조합 또는 복수의 관련된 기재된 항목들 중의 어느 항목을 포함한다.Terms such as first and second may be used to describe various components, but the components should not be limited by the terms. The terms are used only for the purpose of distinguishing one component from another. For example, without departing from the scope of the present invention, the first component may be referred to as the second component, and similarly, the second component may also be referred to as the first component. The term and / or includes a combination of a plurality of related items or any item of a plurality of related items.

어떤 구성요소가 다른 구성요소에 "연결되어" 있다거나 "접속되어" 있다고 언급된 때에는, 그 다른 구성요소에 직접적으로 연결되어 있거나 또는 접속되어 있을 수도 있지만, 중간에 다른 구성요소가 존재할 수도 있다고 이해되어야 할 것이다. 반면에, 어떤 구성요소가 다른 구성요소에 "직접 연결되어" 있다거나 "직접 접속되어" 있다고 언급된 때에는, 중간에 다른 구성요소가 존재하지 않는 것으로 이해되어야 할 것이다.When a component is referred to as being "connected" or "connected" to another component, it may be directly connected to or connected to that other component, but it may be understood that other components may be present in between. Should be. On the other hand, when a component is said to be "directly connected" or "directly connected" to another component, it should be understood that there is no other component in between.

본 출원에서 사용한 용어는 단지 특정한 실시예를 설명하기 위해 사용된 것으로, 본 발명을 한정하려는 의도가 아니다. 단수의 표현은 문맥상 명백하게 다르게 뜻하지 않는 한, 복수의 표현을 포함한다. 본 출원에서, "포함하다" 또는 "가지다" 등의 용어는 명세서상에 기재된 특징, 숫자, 단계, 동작, 구성요소, 부품 또는 이들을 조합한 것이 존재함을 지정하려는 것이지, 하나 또는 그 이상의 다른 특징들이나 숫자, 단계, 동작, 구성요소, 부품 또는 이들을 조합한 것들의 존재 또는 부가 가능성을 미리 배제하지 않는 것으로 이해되어야 한다.The terminology used herein is for the purpose of describing particular example embodiments only and is not intended to be limiting of the present invention. Singular expressions include plural expressions unless the context clearly indicates otherwise. In this application, the terms "comprise" or "have" are intended to indicate that there is a feature, number, step, operation, component, part, or combination thereof described in the specification, and one or more other features. It is to be understood that the present invention does not exclude the possibility of the presence or the addition of numbers, steps, operations, components, components, or a combination thereof.

다르게 정의되지 않는 한, 기술적이거나 과학적인 용어를 포함해서 여기서 사용되는 모든 용어들은 본 발명이 속하는 기술 분야에서 통상의 지식을 가진 자에 의해 일반적으로 이해되는 것과 동일한 의미를 가지고 있다. 일반적으로 사용되는 사전에 정의되어 있는 것과 같은 용어들은 관련 기술의 문맥 상 가지는 의미와 일치하는 의미를 가진 것으로 해석되어야 하며, 본 출원에서 명백하게 정의하지 않는 한, 이상적이거나 과도하게 형식적인 의미로 해석되지 않는다.Unless defined otherwise, all terms used herein, including technical or scientific terms, have the same meaning as commonly understood by one of ordinary skill in the art. Terms such as those defined in the commonly used dictionaries should be construed as having meanings consistent with the meanings in the context of the related art and shall not be construed in ideal or excessively formal meanings unless expressly defined in this application. Do not.

이하, 첨부한 도면들을 참조하여, 본 발명의 바람직한 실시예를 보다 상세하게 설명하고자 한다. 본 발명을 설명함에 있어 전체적인 이해를 용이하게 하기 위하여 도면상의 동일한 구성요소에 대해서는 동일한 참조부호를 사용하고 동일한 구성요소에 대해서 중복된 설명은 생략한다.Hereinafter, with reference to the accompanying drawings, it will be described in detail a preferred embodiment of the present invention. In the following description of the present invention, the same reference numerals are used for the same elements in the drawings and redundant descriptions of the same elements will be omitted.

도 1은 본 발명의 제1 실시예에 따른 통신 시스템을 도시한 개념도이다.1 is a conceptual diagram illustrating a communication system according to a first embodiment of the present invention.

도 1을 참조하면, 통신 시스템(100)은 복수의 통신 노드들(110-1, 110-2, 110-3, 120-1, 120-2, 130-1, 130-2, 130-3, 130-4, 130-5, 130-6)로 구성될 수 있다. 복수의 통신 노드들 각각은 적어도 하나의 통신 프로토콜을 지원할 수 있다. 예를 들어, 복수의 통신 노드들 각각은 CDMA(code division multiple access) 기반의 통신 프로토콜, WCDMA(wideband CDMA) 기반의 통신 프로토콜, TDMA(time division multiple access) 기반의 통신 프로토콜, FDMA(frequency division multiple access) 기반의 통신 프로토콜, OFDM(orthogonal frequency division multiplexing) 기반의 통신 프로토콜, OFDMA(orthogonal frequency division multiple access) 기반의 통신 프로토콜, SC(single carrier)-FDMA 기반의 통신 프로토콜, NOMA(non-orthogonal multiple access) 기반의 통신 프로토콜, SDMA(space division multiple access) 기반의 통신 프로토콜 등을 지원할 수 있다. 복수의 통신 노드들 각각의 구조는 아래의 도 2를 통해 설명한다.Referring to FIG. 1, the communication system 100 includes a plurality of communication nodes 110-1, 110-2, 110-3, 120-1, 120-2, 130-1, 130-2, 130-3, 130-4, 130-5, 130-6). Each of the plurality of communication nodes may support at least one communication protocol. For example, each of the plurality of communication nodes may include a code division multiple access (CDMA) based communication protocol, a wideband CDMA (WCDMA) based communication protocol, a time division multiple access (TDMA) based communication protocol, and a frequency division multiple (FDMA) based communication protocol. access based communication protocol, orthogonal frequency division multiplexing (OFDM) based communication protocol, orthogonal frequency division multiple access (OFDMA) based communication protocol, single carrier (SC) -FDMA based communication protocol, non-orthogonal multiple An access based communication protocol and a space division multiple access (SDMA) based communication protocol may be supported. The structure of each of the plurality of communication nodes is described with reference to FIG. 2 below.

도 2는 본 발명의 제1 실시예에 따른 통신 시스템에서 통신 노드를 도시한 블록도이다.2 is a block diagram showing a communication node in the communication system according to the first embodiment of the present invention.

도 2를 참조하면, 통신 노드(200)는 적어도 하나의 프로세서(210), 메모리(220) 및 네트워크와 연결되어 통신을 수행하는 송수신 장치(230)를 포함할 수 있다. 또한, 통신 노드(200)는 입력 인터페이스 장치(240), 출력 인터페이스 장치(250), 저장 장치(260) 등을 더 포함할 수 있다. 통신 노드(200)에 포함된 각각의 구성 요소들은 버스(bus)(270)에 의해 연결되어 서로 통신을 수행할 수 있다.Referring to FIG. 2, the communication node 200 may include at least one processor 210, a memory 220, and a transceiver 230 that communicates with a network. In addition, the communication node 200 may further include an input interface device 240, an output interface device 250, a storage device 260, and the like. Each component included in the communication node 200 may be connected by a bus 270 to communicate with each other.

다만, 통신 노드(200)에 포함된 각각의 구성요소들은 공통 버스(270)가 아니라, 프로세서(210)를 중심으로 개별 인터페이스 또는 개별 버스를 통하여 연결될 수도 있다. 예를 들어, 프로세서(210)는 메모리(220), 송수신 장치(230), 입력 인터페이스 장치(240), 출력 인터페이스 장치(250) 및 저장 장치(260) 중에서 적어도 하나와 전용 인터페이스를 통하여 연결될 수도 있다.However, each component included in the communication node 200 may be connected through a separate interface or a separate bus around the processor 210, instead of the common bus 270. For example, the processor 210 may be connected to at least one of the memory 220, the transceiver 230, the input interface device 240, the output interface device 250, and the storage device 260 through a dedicated interface. .

프로세서(210)는 메모리(220) 및 저장 장치(260) 중에서 적어도 하나에 저장된 프로그램 명령(program command)을 실행할 수 있다. 프로세서(210)는 중앙 처리 장치(central processing unit, CPU), 그래픽 처리 장치(graphics processing unit, GPU), 또는 본 발명의 실시예들에 따른 방법들이 수행되는 전용의 프로세서를 의미할 수 있다. 메모리(220) 및 저장 장치(260) 각각은 휘발성 저장 매체 및 비휘발성 저장 매체 중에서 적어도 하나로 구성될 수 있다. 예를 들어, 메모리(220)는 읽기 전용 메모리(read only memory, ROM) 및 랜덤 액세스 메모리(random access memory, RAM) 중에서 적어도 하나로 구성될 수 있다.The processor 210 may execute a program command stored in at least one of the memory 220 and the storage device 260. The processor 210 may refer to a central processing unit (CPU), a graphics processing unit (GPU), or a dedicated processor on which methods according to embodiments of the present invention are performed. Each of the memory 220 and the storage device 260 may be configured as at least one of a volatile storage medium and a nonvolatile storage medium. For example, the memory 220 may be configured as at least one of a read only memory (ROM) and a random access memory (RAM).

도 3은 본 발명의 실시예에 따른 통신 시스템에서 통신 환경 및 자원 관리기의 동작을 도시한 개념도이다.3 is a conceptual diagram illustrating an operation of a communication environment and a resource manager in a communication system according to an embodiment of the present invention.

도 3을 참고하면, 통신 시스템(300)은 통신 환경(310) 및 자원 관리기(320)를 포함할 수 있다. 통신 환경(310)은 적어도 하나의 단말(미도시) 및 복수 개의 전송 포인트들(미도시)을 포함할 수 있다. 자원 관리기(320)는 통신 환경(310) 내의 적어도 하나의 단말 및 복수 개의 전송 포인트들과 연결된 라디오 유닛(radio unit)(미도시)일 수 있다. 여기서, 통신 시스템(300)은 도 1의 통신 시스템(100)과 동일 또는 유사할 수 있다. 또한, 적어도 하나의 단말, 복수 개의 전송 포인트들, 및 자원 관리기(320)의 구조는 도 2의 통신 노드(200)의 구조와 동일 또는 유사할 수 있다.Referring to FIG. 3, the communication system 300 may include a communication environment 310 and a resource manager 320. The communication environment 310 may include at least one terminal (not shown) and a plurality of transmission points (not shown). The resource manager 320 may be a radio unit (not shown) connected to at least one terminal and a plurality of transmission points in the communication environment 310. Here, the communication system 300 may be the same as or similar to the communication system 100 of FIG. 1. In addition, the structure of the at least one terminal, the plurality of transmission points, and the resource manager 320 may be the same as or similar to that of the communication node 200 of FIG. 2.

통신 시스템(300)에서 자원 관리기(320)는 시간의 흐름에 따라 동적으로 변화하는 통신 환경(310)에서 최적의 의사 결정을 내리는 방식을 사용하여, 단말로 데이터를 전송하기 위한 전송 포인트 결정하고, 링크 적응 방식을 결정할 수 있다. 예를 들어, 자원 관리기(320)는 Q-학습(quality-learning) 및 딥 Q-네트워크(deep quality network)를 이용할 수 있다.In the communication system 300, the resource manager 320 determines a transmission point for transmitting data to the terminal by using an optimal decision making method in the communication environment 310 that changes dynamically over time. Link adaptation may be determined. For example, the resource manager 320 may utilize quality-learning and deep quality network.

통신 시스템(300)은 통신 환경(310) 내에 적어도 하나의 전송 포인트가 무작위로 배치되는 예측이 어려운 상황에서 시행 착오의 경험을 통해 최적의 전송 포인트를 선택하고 링크 적응을 수행할 수 있다. 예를 들어, 통신 시스템(300)에서 자원 관리기(320)는 각각의 통신 환경(310)의 상태(state)(311)에서 최대의 데이터 율(data rate)과 같은 보상(313)을 제공받기 위한 자원 관리 행동(312)을 수행할 수 있다. 이때, 자원 관리기(320)는 매개 변수(parameter)를 설정하고, 최적의 전송 포인트 및 링크 적응 방식을 선택할 수 있다. 즉, 통신 시스템(300)은 자원 관리기(320)를 통해 주어진 상태(311)에서 행동(312)에 따라 피드백되는 보상(313)을 이용하여 통신 환경(310)을 최적화하기 위한 모델을 제공할 수 있다.The communication system 300 may select an optimal transmission point and perform link adaptation through trial and error experience in an unpredictable situation in which at least one transmission point is randomly disposed in the communication environment 310. For example, in the communication system 300, the resource manager 320 may be provided with a reward 313 such as a maximum data rate in a state 311 of each communication environment 310. Resource management action 312 may be performed. In this case, the resource manager 320 may set parameters and select an optimal transmission point and link adaptation scheme. That is, the communication system 300 may provide a model for optimizing the communication environment 310 by using the compensation 313 fed back according to the action 312 in a given state 311 through the resource manager 320. have.

통신 시스템(300)에서 자원 관리기(320)는 시행착오의 경험으로부터 데이터 율과 같은 보상(313)을 최대화하기 위한 동작을 수행할 수 있다. 예를 들어, 자원 관리기(320)는 주어진 통신 환경(310)의 상태(311)를 정의하고, 각각의 상태(311)에서 자원 관리를 위한 행동(312)을 수행하여 보상(313)을 얻는 동작을 수행할 수 있다. 이때, 자원 관리기(320)는 경험해 보지 않은 새로운 행동을 경험하기 위해 보상(313)과 관계없는 랜덤 행동(312)을 수행하는 탐험(exploration)을 할 수 있다. 또한, 자원 관리기(320)는 보상(313)을 최대화하는 행동(312)인 활용(exploitation)을 절충하여 수행할 수 있다.The resource manager 320 in the communication system 300 may perform an operation for maximizing a reward 313 such as a data rate from the experience of trial and error. For example, the resource manager 320 defines a state 311 of a given communication environment 310 and performs an action 312 for resource management in each state 311 to obtain a reward 313. Can be performed. In this case, the resource manager 320 may perform an exploration of performing a random action 312 not related to the reward 313 to experience a new behavior that has not been experienced. In addition, the resource manager 320 may perform a compromise by utilizing an action 312 that maximizes the reward 313.

통신 시스템(300)에서 자원 관리기(320)는 보상(313)과 상태(311)를 정의할 때, SON 기능들을 만족해야하는 독립적 조건들을 연합시키므로, 매개 변수간의 충동을 발생시키지 않을 수 있다.In the communication system 300, the resource manager 320 associates independent conditions that must satisfy SON functions when defining the reward 313 and the state 311, and thus may not generate impulses between parameters.

통신 시스템(300)에서 자원 관리기(320)는 시행 착오 학습을 통해 최대 데이터 율을 유도하는 전송 포인트를 선택하는 방식을 제공할 수 있다. 또한, 통신 시스템(300)에서 자원 관리기(320)는 시행 착오 학습을 통해 최대 데이터 율을 유도하는 근거리 전송 포인트들에 대한 링크 적응 방식을 제공할 수 있다.In the communication system 300, the resource manager 320 may provide a method of selecting a transmission point for inducing a maximum data rate through trial and error learning. In addition, the resource manager 320 in the communication system 300 may provide a link adaptation scheme for short-range transmission points for deriving a maximum data rate through trial and error learning.

도 4는 본 발명의 실시예에 따른 통신 시스템에서 딥 Q-네트워크 학습 동작을 도시한 개념도이다.4 is a conceptual diagram illustrating a deep Q-network learning operation in a communication system according to an embodiment of the present invention.

도 4를 참고하면, 자원 관리기(미도시)는 통신 시스템에서 딥 Q-네트워크(420) 학습을 통해 통신 환경의 상태(410)에 기초하여 결정한 행동들(430-1 내지 430-n)을 통해 Q 값들(440-1 내지 440-n)을 도출할 수 있다. 여기서, 통신 시스템은 도 3의 통신 시스템(300)과 동일 또는 유사할 수 있다. 또한, 자원 관리기는 도 3의 자원 관리기(320)와 동일 또는 유사하게 동작할 수 있다. 또한, 통신 환경은 도 3의 통신 환경(310)과 동일 또는 유사할 수 있다. 딥 Q-네트워크(420) 학습은 아래의 도 5를 통해 설명한다.Referring to FIG. 4, the resource manager (not shown) uses the actions 430-1 through 430-n determined based on the state 410 of the communication environment through the deep Q-network 420 learning in the communication system. Q values 440-1 through 440-n may be derived. Here, the communication system may be the same as or similar to the communication system 300 of FIG. 3. In addition, the resource manager may operate similarly or similarly to the resource manager 320 of FIG. 3. In addition, the communication environment may be the same as or similar to the communication environment 310 of FIG. 3. Deep Q-network 420 learning is described with reference to FIG. 5 below.

도 5는 본 발명의 실시예에 따른 통신 시스템에서 가중치를 업데이트하는 딥 Q-네트워크 동작을 도시한 개념도이다.5 is a conceptual diagram illustrating a deep Q-network operation of updating weights in a communication system according to an embodiment of the present invention.

도 5를 참고하면, 딥 Q-네트워크 학습부는 상태(state, s)를 입력(510)받는 입력 레이어(520), 입력 레이어(520)를 통한 값을 연산하는 제1 연산부(530)를 포함할 수 있다. 또한, 딥 Q-네트워크 학습부는 제1 연산부(530)로부터 출력되는 값에 가중치

를 연산하는 제2 연산부(540), 및 제2 연산부(540)로부터 수신되는 값에 기초하여 행동들을 출력하는 출력 레이어(550)를 포함할 수 있다. 자원 관리기는 딥 Q-네트워크 학습부를 통해 Q 값들을 출력(Y)할 수 있다. 자원 관리기는 도 4의 자원 관리기와 동일 또는 유사하게 동작할 수 있다.Referring to FIG. 5, the deep Q-network learner may include an input layer 520 that receives state 510 and a first calculator 530 that calculates a value through the input layer 520. Can be. In addition, the deep Q-network learner weights a value output from the first calculator 530.

It may include a second operation unit 540 for calculating a, and an output layer 550 for outputting actions based on the value received from the second operation unit 540. The resource manager may output (Y) the Q values through the deep Q-network learner. The resource manager may operate the same as or similar to the resource manager of FIG. 4.

도 6은 본 발명의 실시예에 따른 통신 시스템에서 자원 관리기의 동작을 도시한 흐름도이다.6 is a flowchart illustrating the operation of a resource manager in a communication system according to an embodiment of the present invention.

도 6을 참고하면, 자원 관리기는 단말로부터 무선 채널 상태 정보를 수신할 수있다(S601). 자원 관리기는 단말로부터 전송 포인트 상태 정보를 수신할 수 있다(S602). 자원 관리기는 무선 채널 상태 정보 및 전송 포인트 상태 정보에 기초하여 최대 데이터 전송률을 제공하는 전송 포인트 및 링크 적응 방식을 결정할 수 있다(S603). 자원 관리기는 최대 데이터 전송률을 제공하는 전송 포인트에 관한 정보를 단말로 전송할 수 있다(S604). 자원 관리기는 최대 데이터 전송률을 제공하는 링크 적응 방식에 관한 정보를 단말로 전송할 수 있다(S605).Referring to FIG. 6, the resource manager may receive radio channel state information from the terminal (S601). The resource manager may receive transmission point state information from the terminal (S602). The resource manager may determine a transmission point and link adaptation scheme that provides the maximum data rate based on the wireless channel state information and the transmission point state information (S603). The resource manager may transmit information about a transmission point providing the maximum data rate to the terminal (S604). The resource manager may transmit information on a link adaptation scheme that provides the maximum data rate to the terminal (S605).

자원 관리기는 도 5의 자원 관리기와 동일 또는 유사하게 동작할 수 있다. 자원 관리기의 구체적인 동작은 아래의 도 7 내지 도 10을 통해 설명한다.The resource manager may operate the same as or similar to the resource manager of FIG. 5. A detailed operation of the resource manager will be described with reference to FIGS. 7 to 10 below.

도 7은 본 발명의 실시예에 따른 통신 시스템에서 Q-학습을 통해 전송 포인트를 결정하는 Q-학습 자원 관리기의 동작을 도시한 흐름도이다.7 is a flowchart illustrating an operation of a Q-learning resource manager for determining a transmission point through Q-learning in a communication system according to an embodiment of the present invention.

통신 시스템에서 Q-학습 자원 관리기는 복수 개의 단말들 각각으로부터 피드백 정보를 수신할 수 있다(S701). 예를 들어, 피드백 정보는 단말이 측정하는 무선 채널 상태 정보, 채널 간섭 정보, 전송 포인트의 부하(load) 정보, 전송 포인트의 핸드오버 비율 정보를 포함할 수 있다.In the communication system, the Q-learning resource manager may receive feedback information from each of the plurality of terminals (S701). For example, the feedback information may include radio channel state information measured by the terminal, channel interference information, load information of the transmission point, and handover ratio information of the transmission point.

여기서, 무선 채널 상태 정보는 CRS(common reference signal), CSI(channel status information)-RS(reference signal)를 포함할 수 있다. 채널 간섭 정보는 CSI-IM(interference measurement)를 포함할 수 있다.Here, the wireless channel status information may include a common reference signal (CRS) and channel status information (CSI) -reference signal (RS). The channel interference information may include an interference measurement (CSI-IM).

전송 포인트의 부하 정보는 각각의 단말 당 전송 포인트의 전송률에 관한 정보를 포함할 수 있다. 전송 포인트의 핸드오버 실패는 미리 정해진 속도 보다 빠르거나 느린 핸드오버 또는 핑퐁(ping pong) 핸드오버를 포함할 수 있다.The load information of the transmission point may include information about the transmission rate of the transmission point per terminal. Handover failure of a transmission point may include a handover or a ping pong handover that is faster or slower than a predetermined rate.

CSI-RM은 수신 신호의 세기에 따라 미리 정의된 값 RS를 지시할 수 있다. 여기서, rs= 1, 2, ..., S로 표시될 수 있다. CSI-IM은 간섭 신호의 세기에 따라 미리 정의된 값 im을 지시할 수 있다. 여기서, im= 1, 2, ..., I로 표시될 수 있다.The CSI-RM may indicate a predefined value RS according to the strength of the received signal. Here, rs may be represented by 1, 2, ..., S. The CSI-IM may indicate a predefined value im according to the strength of the interference signal. Here, im = 1, 2, ..., I may be represented.

전송 포인트의 부하 정보는 미리 정의된 값 l을 지시할 수 있다. 여기서, l= 1, 2, ..., L로 표시될 수 있다. 전송 포인트의 핸드오버 비율 정보는 미리 정의된 값 h를 지시할 수 있다. 여기서, l= 1, 2, ..., H로 표시될 수 있다. 예를 들어, h는 핸드오버 실패 비율 또는 성공 비율로 표시될 수 있다. 예를 들어, 전송 포인트의 핸드오버 실패 비율 또는 성공 비율이 0.1 이하일 경우 h=1로 표시될 수 있다. 또한, 전송 포인트의 핸드오버 실패 비율 또는 성공 비율이 0.01 이상 0.1 이하일 경우, h=2로 표시될 수 있다. 또한, 전송 포인트의 핸드오버 실패 비율 또는 성공 비율이 10^-H 이상 10^-H-1 이하일 경우 h=H로 표시될 수 있다.The load information of the transmission point may indicate a predefined value l. Here, l = 1, 2, ..., L can be represented. The handover rate information of the transmission point may indicate a predefined value h. Here, l = 1, 2, ..., can be represented by H. For example, h may be expressed as a handover failure rate or success rate. For example, when the handover failure rate or success rate of the transmission point is 0.1 or less, h = 1. In addition, when the handover failure rate or success rate of the transmission point is 0.01 or more and 0.1 or less, it may be indicated by h = 2. In addition, when the handover failure rate or success rate of the transmission point is 10 ^-H or more and 10 ^-H-1 or less, h = H may be indicated.

여기서, S, I, L, H는 자연수이다.Here, S, I, L, and H are natural numbers.

Q-학습 자원 관리기가 전송 포인트를 선택하는 행동은 a로 표시될 수 있다. 또한, Q-학습 자원 관리기가 선택할 수 있는 전송 포인트의 수는 P로 표시될 수 있다. 이때, 전송 포인트에 대한 선택은 1, 비선택은 0으로 표시될 수 있다. 예를 들어, P=17일 경우, a=(0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0)로 표시할 수 있다. 즉, P=17일 경우, a는 17개의 0 또는 1로 구성된 집합으로 표시될 수 있다.The action that the Q-learning resource manager selects a transmission point may be represented by a. In addition, the number of transmission points that the Q-learning resource manager can select may be represented by P. In this case, the selection for the transmission point may be indicated by 1 and the non-selection by 0. For example, if P = 17, you would display a = (0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0). Can be. That is, when P = 17, a may be represented by a set of 17 zeros or ones.

Q-학습 자원 관리기는 Q 값을 업데이트할 수 있다(S702). Q-학습 자원 관리기는 초기 동작에서는 Q 값을 0으로 설정할 수 있다. 또한, 자원 관리기는 피드백 정보에 해당하는 S, I, L, H 또한 0으로 설정할 수 있다.The Q-learning resource manager may update the Q value (S702). The Q-learning resource manager may set the Q value to zero in the initial operation. In addition, the resource manager may also set S, I, L, and H corresponding to the feedback information to zero.

통신 시스템에서 자원 관리기는 통신 환경의 상태, 상태 기초한 행동에 의해 발생되는 보상, 랜덤 값 ε, 학습율(learning rate) α, 및 할인 변수(discount factor) γ를 정의하여 Q 값을 결정할 수 있다. 최대화된 Q 값은 학습에서 상태, 행동, 및 보상을 결정할 때 사용한 정책이 최적임을 설명할 수 있다. Q-학습 자원 관리기는 Q 값을 내부와 외부의 2번의 순환 루프(loop) 동작을 통해 결정할 수 있다. 여기서, 외부의 순환 루프는 입력 상태에 대한 수열의 연산을 의미할 수 있다. 또한, 외부의 순환 루프는 시간에 대한 수열의 연산을 의미할 수 있다. 예를 들어, Q-학습 자원 관리기는 무선 통신 상태인 rs, im, l, h에 대한 내부의 순환 루프 동작을 수행할 수 있다. 또한, 자원 관리기는 시간 t= 1, 2, ..., T에 따라 외부의 순환 루프 동작을 수행할 수 있다. Q-학습 자원 관리기는 순환 루프 초기화 과정에서 모든 상태와 행동에 대해 Q 값을 0으로 설정한 후,순환 루프 동작을 수행할 수 있다.In a communication system, a resource manager may determine a Q value by defining a state of a communication environment, a compensation caused by state-based behavior, a random value ε, a learning rate α, and a discount factor γ. The maximized Q value may explain that the policy used to determine status, behavior, and reward in learning is optimal. The Q-learning resource manager can determine the Q value through two internal and external loop operations. Here, the outer circular loop may mean an operation of a sequence of input states. In addition, the outer circular loop may mean an operation of a sequence of time. For example, the Q-learning resource manager may perform an internal circular loop operation for rs, im, l, h which are in a wireless communication state. In addition, the resource manager may perform an external cyclic loop operation according to the time t = 1, 2, ..., T. The Q-learning resource manager may set a Q value of 0 for all states and actions in a circular loop initialization process, and then perform a circular loop operation.

Q-학습 자원 관리기는 선택된 행동에 따른 변화되는 새로운 상태와 Q 값을 업데이트할 수 있다. 예를 들어, Q-학습 자원 관리기는 시간 t에서 각각의 상태 s, 행동 a에 대한 Q 값 Q_t(s, a)를 아래의 수학식 1과 같이 업데이트할 수 있다.The Q-learning resource manager can update the new state and Q values that change according to the selected behavior. For example, the Q-learning resource manager may update the Q value Q _t (s, a) for each state s and behavior a at time t as shown in Equation 1 below.

여기서,

는 Q-학습 자원 관리기가 시간 t-1에서 상태 s에서 행동 a를 수행하여 얻은 보상 값을 의미한다.

는 Q-학습 자원 관리기가 상태 v에서 행동

를 수행한 최대 Q 값을 의미한다.here,

Denotes the reward value obtained by the Q-learning resource manager performing action a in state s at time t-1.

Q-Learning Resource Manager acts in state v

It means the maximum Q value.

즉, Q-학습 자원 관리기는 최대 Q 값

에 할인 변수

를 곱한 값과 보상 값

을 더한

에 학습률

를 곱한 값, 그리고 시간 t-1에서의 Q 값인

에

를 곱한 값을 더하여 Q 값 Q_t(s, a)을 업데이트할 수 있다. 여기서, 학습률

는 경험에서 얻은 새로운 값과 과거 학습한 값에 대한 비율을 의미한다. 즉, Q 값은 결정적(deterministic)이 아닌 확률적(stochastic)으로 업데이트 될 수 있다.That is, the Q-learning resource manager has a maximum Q value

On discount variables

Multiplied by

plus

Learning rate

Multiplied by and Q value at time t-1

on

You can update the Q value Q _t (s, a) by adding the product times. Where learning rate

Is the ratio of the new value gained from experience to the value learned in the past. That is, the Q value may be updated stochastic rather than deterministic.

또한, Q-학습 자원 관리기는 충분히 많은 경험을 수행하는 경우, 탐험과 활용의 방식으로 ε-greedy 방식이 아닌 랜덤 값 ε을 반복 생성하여, 반복 수에 따라 감소시키는 방식인

방식을 사용할 수 있다. In addition, the Q-learning resource manager is a method of exploring and utilizing iteratively generates a random value ε rather than an ε-greedy method, and reduces it according to the number of iterations.

Can be used.

Q-학습 자원 관리기는 랜덤 값을 생성 후(S702). Q-학습 자원 관리기는 선택한 랜덤 값을 미리 정해진 상수 ε과 비교하여 랜덤 값이 ε 보다 작은 경우, 임의의 행동(S705)을 수행한다. 반면, 랜덤 값이 ε 보다 큰 경우, 현재 상태에서 가장 큰 Q 값을 제공하는 행동을 (S704)을 수행한다. 상술한 동작은 ε-greedy 방식이라 지칭될 수 있다.After the Q-learning resource manager generates a random value (S702). The Q-learning resource manager compares the selected random value with a predetermined constant ε and performs a random action S705 when the random value is smaller than ε. On the other hand, if the random value is larger than ε, the behavior of providing the largest Q value in the current state is performed (S704). The above operation may be referred to as an ε-greedy method.

Q-학습 자원 관리기는 Q 값을 최대로 하는 전송 포인트가 바뀌면 전송 포인트가 바뀌는 핸드오버 정보를 제공할 수 있다(S706). 예를 들어, Q-학습 자원 관리기는 전송 포인트의 부하 정보에 기초하여 각각의 전송 포인트의 데이터 전송률을 결정할 수 있다. 또한, Q-학습 자원 관리기는 전송 포인트의 부하 정보에 기초하여 전체 전송 포인트들을 통한 총 데이터 전송률을 결정할 수 있다.The Q-learning resource manager may provide handover information in which the transmission point is changed when the transmission point maximizing the Q value is changed (S706). For example, the Q-learning resource manager may determine the data rate of each transmission point based on load information of the transmission point. In addition, the Q-learning resource manager may determine the total data rate through the entire transmission points based on the load information of the transmission points.

Q-학습 자원 관리기는 업데이트되는 Q 값 Q_t(s, a), 업데이트되기 전의 Q 값 Q_t-1(s, a), 상태 s(rs, im, l, h)에서 전송 포인트를 선택하는 행동 a, 상태 s, 그리고 행동 a를 수행할 때의 데이터 전송률 r_t-1(s, a)에 기초하여 각각의 전송 포인트의 데이터 전송률을 계산할 수 있다. 예를 들어, Q-학습 자원 관리기는 아래의 수학식 2를 통해 i번째 전송 포인트의 데이터 전송률 d_i를 계산할 수 있다.The Q-learning resource manager selects a transmission point from the updated Q value Q _t (s, a), the Q value before update Q _t-1 (s, a), and the state s (rs, im, l, h). The data rate of each transmission point may be calculated based on the data rate r _t-1 (s, a) when performing action a, state s, and action a. For example, the Q-learning resource manager may calculate a data rate d _i of the i th transmission point through Equation 2 below.

Q-학습 자원 관리기는 서로 다른 전송 포인트에 대해 데이터 전송률이 동일한 경우, 부하가 낮은 전송 포인트를 선택할 수 있다.The Q-learning resource manager may select a transmission point with a low load when the data rates are the same for different transmission points.

Q-학습 자원 관리기는 전송 포인트에 관한 정보를 단말로 전송할 수 있다(S706). 예를 들어, Q-학습 자원 관리기는 최대 전송률을 제공하는 전송 포인트에 관한 정보 또는 임의로 선택된 3개 미만의 전송 포인트에 관한 정보를 단말로 전송할 수 있다. Q-학습 자원 관리기는 동작 종료 후 S701 단계로 돌아가 전단계를 반복 수행할 수 있다.The Q-learning resource manager may transmit information about the transmission point to the terminal (S706). For example, the Q-learning resource manager may transmit information about a transmission point that provides the maximum transmission rate or information about less than three transmission points that are arbitrarily selected to the terminal. The Q-learning resource manager may return to step S701 and repeat the previous step after the operation ends.

통신 시스템에서 Q-학습 자원 관리기는 Q 학습을 구성한 후, 통신 환경의 상태를 입력하여 최적의 보상인 최대 데이터 율을 제공하는 행동을 수행할 수 있다. Q-학습 자원 관리기는 도 6의 자원 관리기와 동일 또는 유사하게 동작할 수 있다.In the communication system, the Q-learning resource manager may configure the Q learning, and then perform an action of inputting the state of the communication environment to provide the maximum data rate, which is an optimal reward. The Q-learning resource manager may operate the same as or similar to the resource manager of FIG. 6.

도 8은 본 발명의 실시예에 따른 통신 시스템에서 Q-학습을 통해 링크 적응 방식을 결정하는 Q-학습 자원 관리기의 동작을 도시한 흐름도이다.8 is a flowchart illustrating an operation of a Q-learning resource manager for determining a link adaptation scheme through Q-learning in a communication system according to an embodiment of the present invention.

Q-학습 자원 관리기는 도 7의 Q-학습 자원 관리기와 동일 또는 유사하게 동작할 수 있다. 아래의 설명에서 도 7의 Q-학습 자원 관리기의 동작과 중복되는 동작을 생략한다.The Q-learning resource manager may operate the same as or similar to the Q-learning resource manager of FIG. 7. In the following description, operations overlapping with those of the Q-learning resource manager of FIG. 7 will be omitted.

도 8을 참고하면, Q-학습 자원 관리기는 단말로부터 피드백 정보를 수신할 수 있다(S801). 피드백 정보는 CQI(channel quality information), PMI(precoding matrix index), RI(rank indication), 및 간섭 정보를 포함할 수 있다. CQI는 미리 정의된 값 CQI를 지시할 수 있다. 여기서, CQI= 1, 2, ..., CQ로 표시될 수 있다. RI는 무선 채널 상태에 따라 미리 정의된 값 RI를 지시할 수 있다. 여기서, RI= 1, 2, ..., R로 표시될 수 있다. PMI는 미리 정의된 값 PMI를 지시할 수 있다. 여기서, PMI= 1, 2, ..., PM으로 표시될 수 있다. 간섭 정보는 CSM-IM일 수 있다. CSM-IM는 간섭 신호의 강도에 따라 미리 정의된 값 IM을 지시할 수 있다. 여기서, IM= 1, 2, ..., I로 표시될 수 있다. 여기서, CQ, R, PM, I는 자연수이다.Referring to FIG. 8, the Q-learning resource manager may receive feedback information from the terminal (S801). The feedback information may include channel quality information (CQI), precoding matrix index (PMI), rank indication (RI), and interference information. The CQI may indicate a predefined value CQI. Here, CQI may be represented as 1, 2, ..., CQ. The RI may indicate a predefined value RI according to the radio channel state. Here, RI = 1, 2, ..., R can be represented. PMI may indicate a predefined value PMI. Here, PMI may be represented as 1, 2, ..., PM. The interference information may be CSM-IM. The CSM-IM may indicate a predefined value IM according to the strength of the interfering signal. Here, IM = 1, 2, ..., I can be represented. Here, CQ, R, PM and I are natural numbers.

Q-학습 자원 관리기는 피드백 정보에 기초하여 Q 값을 업데이트할 수 있다(S802). Q-학습 자원 관리기는 초기 동작에서는 Q 값을 0으로 설정할 수 있다. 또한, Q-학습 자원 관리기는 피드백 정보에 해당하는 CQI, RI, PMI, IM 또한 0으로 설정할 수 있다. Q-학습 자원 관리기는 초기 동작이 아닐 경우, 수학식 1에 기초하여 Q 값을 업데이트할 수 있다.The Q-learning resource manager may update the Q value based on the feedback information (S802). The Q-learning resource manager may set the Q value to zero in the initial operation. In addition, the Q-learning resource manager may also set CQI, RI, PMI, and IM corresponding to the feedback information to zero. If it is not the initial operation, the Q-learning resource manager may update the Q value based on Equation 1.

Q-학습 자원 관리기는 내부 및 외부의 순환 루프 동작을 수행할 수 있다. 예를 들어, Q-학습 자원 관리기는 무선 통신 상태인 CQI, RI, PMI, IM에 대한 내부의 순환 루프 동작을 수행할 수 있다. 또한, Q-학습 자원 관리기는 시간 t= 1, 2, ..., T에 따라 외부의 순환 루프 동작을 수행할 수 있다.The Q-learning resource manager can perform internal and external circular loop operations. For example, the Q-learning resource manager may perform an internal cyclic loop operation for CQI, RI, PMI, and IM in a wireless communication state. In addition, the Q-learning resource manager may perform an external cyclic loop operation according to time t = 1, 2, ..., T.

Q-학습 자원 관리기는 랜덤 값을 선택한 후(S802). Q-학습 자원 관리기는 선택한 랜덤 값을 ε과 비교하여 랜덤 값이 ε 보다 작은 경우, 임의의 행동(S805)을 수행한다. 반면, 랜덤 값이 ε 보다 큰 경우, 현재 상태에서 가장 큰 Q 값을 제공하는 행동을 (S804)을 수행한다.The Q-learning resource manager selects a random value (S802). The Q-learning resource manager compares the selected random value with ε, and performs a random action S805 when the random value is smaller than ε. On the other hand, if the random value is greater than ε, the behavior of providing the largest Q value in the current state is performed (S804).

Q-학습 자원 관리기는 Q 값을 최대로 하는 MCS, HARQ, 안테나 전송 방식을 링크 적응 방식으로 결정한다.The Q-learning resource manager determines the MCS, HARQ, and antenna transmission method that maximizes the Q value as the link adaptation method.

예를 들어, Q-학습 자원 관리기는 전송 포인트의 부하 정보에 기초하여 각각의 전송 포인트의 데이터 전송률을 결정할 수 있다. 또한, Q-학습 자원 관리기는 전송 포인트의 부하 정보에 기초하여 전체 전송 포인트들을 통한 총 데이터 전송률을 결정할 수 있다.For example, the Q-learning resource manager may determine the data rate of each transmission point based on load information of the transmission point. In addition, the Q-learning resource manager may determine the total data rate through the entire transmission points based on the load information of the transmission points.

Q-학습 자원 관리기는 링크 적응 방식에 관한 정보를 단말로 전송할 수 있다(S807). 예를 들어, 자원 관리기는 최대 전송률을 제공하는 링크 적응 방식인 MCS, HARQ, 안테나 전송 방식에 관한 정보를 단말로 전송할 수 있다. 또는 Q-학습 자원 관리기는 임의로 선택된 MCS, HARQ, 안테나 전송 방식에 관한 정보를 단말로 전송할 수 있다. Q-학습 자원 관리기는 동작 종료 후 S801 단계로 돌아가 전단계를 반복 수행할 수 있다.The Q-learning resource manager may transmit information on a link adaptation scheme to the terminal (S807). For example, the resource manager may transmit information on an MCS, a HARQ, and an antenna transmission scheme, which is a link adaptation scheme providing a maximum data rate, to the terminal. Alternatively, the Q-learning resource manager may transmit information on a randomly selected MCS, HARQ, and antenna transmission scheme to the terminal. The Q-learning resource manager may return to step S801 and repeat the previous step after the operation ends.

또한, 통신 시스템에서 Q-학습 자원 관리기는 딥(deep) Q-네트워크를 구성한 후, 통신 환경의 상태를 입력하여 최적의 보상인 최대 데이터 율을 제공하는 행동을 수행할 수 있다. 이때, Q-학습 자원 관리기는 딥(deep) Q-네트워크 학습 관리자로 지칭될 수 있다. 딥 Q-네트워크 학습 자원 관리기의 동작은 아래의 도 9 및 도 10을 통해 설명한다.In addition, the Q-learning resource manager in the communication system may configure a deep Q-network, and then perform an action of inputting a state of the communication environment to provide a maximum data rate that is an optimal compensation. In this case, the Q-learning resource manager may be referred to as a deep Q-network learning manager. The operation of the deep Q-network learning resource manager will be described with reference to FIGS. 9 and 10 below.

도 9는 본 발명의 실시예에 따른 통신 시스템에서 딥 Q-네트워크 학습을 통해 전송 포인트를 결정하는 딥 Q-네트워크 학습 자원 관리기의 동작을 도시한 흐름도이다.9 is a flowchart illustrating an operation of a deep Q-network learning resource manager for determining a transmission point through deep Q-network learning in a communication system according to an embodiment of the present invention.

딥 Q-네트워크 학습 자원 관리기는 도 8의 Q-학습 자원 관리기와 동일 또는 유사하게 동작할 수 있다. 아래의 설명에서 도 8의 딥 Q-네트워크 학습 자원 관리기의 동작과 중복되는 동작을 생략한다.The deep Q-network learning resource manager may operate the same as or similar to the Q-learning resource manager of FIG. 8. In the following description, operations overlapping with those of the deep Q-network learning resource manager of FIG. 8 will be omitted.

도 9를 참고하면, 통신 시스템에서 딥 Q-네트워크 학습 자원 관리기는 복수 개의 단말들 각각으로부터 피드백 정보를 수신할 수 있다(S901). 딥 Q-네트워크 학습 자원 관리기가 피드백 정보를 수신하는 동작은 도 7의 Q-학습 자원 관리기가 피드백 정보를 수신하는 동작(S701)과 동일 또는 유사할 수 있다.Referring to FIG. 9, in a communication system, the deep Q-network learning resource manager may receive feedback information from each of a plurality of terminals (S901). The operation of receiving the feedback information by the deep Q-network learning resource manager may be the same as or similar to the operation S701 of receiving the feedback information by the Q-learning resource manager of FIG. 7.

딥 Q-네트워크 학습 자원 관리기는 미리 정해진 주기에 따라 가중치를 업데이트할 수 있다(S902). 예를 들어, 딥 Q-네트워크 학습 자원 관리기는 통신 시스템에서 통신 환경의 상태, 행동, 보상, 랜덤 값 ε, 그리고 할인 변수 γ를 정의할 수 있다. 딥 Q-네트워크 학습 자원 관리기는 딥 Q-네트워크 학습을 위해 네트워크를 구성하는 가중치(weight)를 내부 및 외부의 2번의 순환 루프 동작을 통해 결정할 수 있다. 외부의 순환 루프는 입력 상태에 대한 수열의 연산을 의미할 수 있다. 내부의 순환 루프는 시간에 대한 수열의 연산을 의미할 수 있다. 딥 Q-네트워크 학습 자원 관리기는 순환 루프 초기에 네트워크의 가중치를 랜덤 값으로 초기화한 후, 순환 루프 동작을 수행할 수 있다.The deep Q-network learning resource manager may update the weight according to a predetermined cycle (S902). For example, the deep Q-network learning resource manager may define the state, behavior, reward, random value ε, and discount variable γ of the communication environment in the communication system. The deep Q-network learning resource manager may determine the weight constituting the network for deep Q-network learning through two circular loop operations inside and outside. The outer circular loop may mean an operation of a sequence of input states. An inner circular loop may mean an operation of a sequence of time. The deep Q-network learning resource manager may initialize the weight of the network to a random value at the beginning of the cyclic loop and then perform a cyclic loop operation.

딥 Q-네트워크 학습 자원 관리기는 주어진 상태에서 선택된 행동에 따라 변화되는 상태를 새로운 상태로 업데이트하고 아래의 수학식 3과 같이 가중치를 업데이트하기 위한 계산을 수행할 수 있다.The deep Q-network learning resource manager may perform a calculation to update a state changing according to a selected action in a given state to a new state and update weights as shown in Equation 3 below.

여기서,

는 할인 변수를 의미한다.

는 가설(예측) 모델의 가중치를 의미한다.

는 실제 모델 네트워크의 가중치를 의미한다.here,

Means a discount variable.

Is the weight of the hypothesis model.

Denotes the weight of the actual model network.

딥 Q-네트워크 학습 자원 관리기는 입력 s와 가중치

의 곱으로 표현된 가설 모델 네트워크(hypothesis network) 가중치

s와 실제 모델 네트워크 가중치

의 차이의 제곱이 최소화되는 값으로 가중치를 업데이트할 수 있다. 즉, 딥 Q-네트워크 학습 자원 관리기는 코스트(cost) 또는 손실(loss)를 최소화하는 값으로 가중치를 업데이트할 수 있다.Deep Q-network learning resource manager weighted with input s

Hypothesis network weight expressed as the product of

s and physical model network weights

The weight may be updated to a value that minimizes the square of the difference of. That is, the deep Q-network learning resource manager may update the weight to a value that minimizes cost or loss.

딥 Q-네트워크 학습 자원 관리기는 가설 모델 네트워크 가중치

는 순환 루프 동작때 마다 업데이트할 수 있다. 반면, 딥 Q-네트워크 학습 자원 관리기는 실제 모델 네트워크의 가중치

는 순환 루프 동작때 마다 매번 업데이트하지 않고, 미리 정해진 순환 루프 주기에 따라 업데이트할 수 있다. 즉, 가설 모델 네트워크에서의 가중치

와 실제 네트워크에서의 가중치

를 분리하지 않고 계산하는 경우, 오류가 발생할 수 있기 때문에, 딥 Q-네트워크 학습 자원 관리기는 가설 모델 네트워크에서의 가중치

와 실제 네트워크에서의 가중치

를 독립적으로 계산할 수 있다. 다시 말해, 딥 Q-네트워크 학습 자원 관리기는 가설 모델 네트워크에서의 가중치

와 실제 네트워크에서의 가중치

를 분리하여 계산할 수 있다.Deep Q-network learning resource manager weighted hypothetical model network

Can be updated for each loop operation. On the other hand, deep Q-network learning resource manager is the weight of the physical model network

May not be updated every time in a cyclic loop operation, but may be updated according to a predetermined cyclic loop period. That is, the weights in the hypothetical model network

And weights in real networks

If you calculate without splitting, the deep Q-network learning resource manager is weighted in the hypothetical model network, because errors can occur.

And weights in real networks

Can be calculated independently. In other words, the deep Q-network learning resource manager is weighted in the hypothetical model network.

And weights in real networks

Can be calculated separately.

즉, 딥 Q-네트워크 학습 자원 관리기는 가설 모델 네트워크에서의 가중치

와 실제 네트워크에서의 가중치

의 차이의 제곱이 최소화되는 가중치를 구하기 위해, 순환 루프때 마다

를 계산하고, 미리 정해진 순환 루프의 주기에 따라

를

로 업데이트할 수 있다.That is, deep Q-network learning resource manager is weighted in the hypothetical model network

And weights in real networks

In order to find the weight that minimizes the square of the difference between

Is calculated, and according to a predetermined cycle of the loop

To

Can be updated with

딥 Q-네트워크 학습 자원 관리기는 랜덤 값을 생성할 수 있다(S903). 딥 Q-네트워크 자원 관리기는 랜덤 값 ε을 발생시켜, 가중치

가 랜덤 값 ε 보다 작은 경우, 임의의 행동을 수행할 수 있다. 반면, 딥 Q-네트워크 학습 자원 관리기는 가중치

가 랜덤 값 ε 보다 큰 경우, 현재 상태에서 가장 큰 보상 값을 주는 행동을 수집할 수 있다. 여기서, 행동 수집은 행동을 직접 수행하지 않고 버퍼링(buffering)을 하여 임의의 몇 개만을 선택하여 행동을 수행하는 경험 재현(experience replay)을 수행하는 동작을 의미할 수 있다. 딥 Q-네트워크 학습 자원 관리기는 경험 재현 방식에서 많은 샘플들이 상관관계(correlation)를 갖고 있어 오류 계산을 유도할 수 있으므로 상관관계가 없어지도록 랜덤하게 샘플을 선택할 수 있다. The deep Q-network learning resource manager may generate a random value (S903). Deep Q-Network Resource Manager generates a random value ε

If is smaller than the random value ε, any action may be performed. In contrast, deep Q-network learning resource manager weighted

If is greater than the random value ε, the behavior giving the largest compensation value in the current state can be collected. Here, the behavior collection may refer to an operation of performing an experience replay of performing an action by selecting only a few randomly by performing buffering without directly performing an action. The deep Q-network learning resource manager can randomly select samples so that they do not correlate because many samples have correlations that can lead to error calculations.

딥 Q-네트워크 학습 자원 관리기는 랜덤 값을 선택한 후(S903). 딥 Q-네트워크 학습 자원 관리기는 선택한 랜덤 값을 ε과 비교하여 랜덤 값이 ε 보다 작은 경우, 임의의 전송 포인트 선택(S706)을 수행한다. 반면, 랜덤 값이 ε 보다 큰 경우, 현재 상태에서 가장 큰 Q 값을 제공하는 전송 포인트를 선택(S705)을 수행할 수 있다. 딥 Q-네트워크 학습 자원 관리기는 동작 종료 후 S901 단계로 돌아가 전단계를 반복 수행할 수 있다.The deep Q-network learning resource manager selects a random value (S903). The deep Q-network learning resource manager compares the selected random value with ε and performs random transmission point selection (S706) when the random value is smaller than ε. On the other hand, if the random value is larger than ε, it is possible to select a transmission point providing the largest Q value in the current state (S705). The deep Q-network learning resource manager may return to step S901 and repeat the previous step after the operation ends.

도 10은 본 발명의 실시예에 따른 통신 시스템에서 딥 Q-네트워크 학습을 통해 링크 적응 방식을 결정하는 딥 Q-네트워크 학습 자원 관리기의 동작을 도시한 흐름도이다.10 is a flowchart illustrating an operation of a deep Q-network learning resource manager for determining a link adaptation scheme through deep Q-network learning in a communication system according to an exemplary embodiment of the present invention.

딥 Q-네트워크 학습 자원 관리기는 도 9의 딥 Q-네트워크 학습 자원 관리기와 동일 또는 유사하게 동작할 수 있다. 아래의 설명에서 도 9의 딥 Q-네트워크 학습 자원 관리기의 동작과 중복되는 동작을 생략한다.The deep Q-network learning resource manager may operate the same as or similar to the deep Q-network learning resource manager of FIG. 9. In the following description, operations overlapping with those of the deep Q-network learning resource manager of FIG. 9 will be omitted.

딥 Q-네트워크 학습 자원 관리기는 단말로부터 피드백 정보를 수신할 수 있다(S1001). 딥 Q-네트워크 학습 자원 관리기가 단말로부터 피드백 정보를 수신하는 동작은 도 8의 Q-학습 자원 관리기가 단말로부터 피드백 정보를 수신하는 동작(S801)과 동일 또는 유사할 수 있다.The deep Q-network learning resource manager may receive feedback information from the terminal (S1001). The operation of the deep Q-network learning resource manager receiving the feedback information from the terminal may be the same as or similar to the operation S801 of receiving the feedback information from the terminal of the Q-learning resource manager of FIG. 8.

딥 Q-네트워크 학습 자원 관리기는 피드백 정보에 기초하여 가중치를 업데이트할 수 있다(S1002). 딥 Q-네트워크 학습 자원 관리기는 초기 동작에서는 가중치를 0으로 설정할 수 있다. 또한, 딥 Q-네트워크 학습 자원 관리기는 피드백 정보에 해당하는 CQI, RI, PMI, IM 또한 0으로 설정할 수 있다. 딥 Q-네트워크 학습 자원 관리기는 초기 동작이 아닐 경우, 수학식 3에 기초하여 가중치를 업데이트할 수 있다. 딥 Q-네트워크 학습 자원 관리기의 가중치 업데이트 동작은 도 9의 딥 Q-네트워크 학습 자원 관리기의 가중치 업데이트 동작(S902)과 동일 또는 유사할 수 있다.The deep Q-network learning resource manager may update the weight based on the feedback information (S1002). The deep Q-network learning resource manager may set the weight to zero in the initial operation. In addition, the deep Q-network learning resource manager may also set CQI, RI, PMI, and IM corresponding to the feedback information to zero. If the deep Q-network learning resource manager is not the initial operation, it may update the weight based on Equation 3. The weight update operation of the deep Q-network learning resource manager may be the same as or similar to the weight update operation S902 of the deep Q-network learning resource manager of FIG. 9.

딥 Q-네트워크 학습 자원 관리기는 내부 및 외부의 순환 루프 동작을 수행할 수 있다. 예를 들어, 딥 Q-네트워크 학습 자원 관리기는 무선 통신 상태인 CQI, RI, PMI, IM에 대한 내부의 순환 루프 동작을 수행할 수 있다. 또한, 딥 Q-네트워크 학습 자원 관리기는 시간 t= 1, 2, ..., T에 따라 외부의 순환 루프 동작을 수행할 수 있다.The deep Q-network learning resource manager can perform internal and external circular loop operations. For example, the deep Q-network learning resource manager may perform an internal cyclic loop operation for CQI, RI, PMI, and IM in a wireless communication state. In addition, the deep Q-network learning resource manager may perform an external cyclic loop operation according to time t = 1, 2, ..., T.

딥 Q 네트워크 학습 자원 관리기는 랜덤 값을 선택한 후(S1003). 딥 Q-네트워크 학습 자원 관리기는 선택한 랜덤 값을 ε과 비교하여 랜덤 값이 ε 보다 작은 경우, 임의의 전송 포인트 선택(S1006)을 수행한다. 반면, 랜덤 값이 ε 보다 큰 경우, 현재 상태에서 가장 큰 Q 값을 제공하는 전송 포인트를 선택(S1005) 한다.The deep Q network learning resource manager selects a random value (S1003). The deep Q-network learning resource manager compares the selected random value with [epsilon] and performs random transmission point selection (S1006) when the random value is smaller than [epsilon]. On the other hand, if the random value is larger than ε, the transmission point providing the largest Q value in the current state is selected (S1005).

딥 Q 네트워크 학습 자원 관리기는 Q 값을 최대로 하는 자원 관리(행동)인 MCS, HARQ, 안테나 전송 방식을 결정할 수 있다(S1005).The deep Q network learning resource manager may determine an MCS, an HARQ, and an antenna transmission scheme that is resource management (action) maximizing the Q value (S1005).

예를 들어, 딥 Q-네트워크 학습 자원 관리기는 전송 포인트의 부하 정보에 기초하여 각각의 전송 포인트의 데이터 전송률을 결정할 수 있다. 또한, 딥 Q-네트워크 학습 자원 관리기는 전송 포인트의 부하 정보에 기초하여 전체 전송 포인트들을 통한 총 데이터 전송률을 결정할 수 있다.For example, the deep Q-network learning resource manager may determine the data rate of each transmission point based on load information of the transmission point. In addition, the deep Q-network learning resource manager may determine the total data rate through the entire transmission points based on load information of the transmission points.

딥 Q-네트워크 학습 자원 관리기는 링크 적응 방식에 관한 정보를 단말로 전송할 수 있다(S1007). 예를 들어, 딥 Q-네트워크 학습 자원 관리기는 최대 전송률을 제공하는 링크 적응 방식인 MCS, HARQ, 안테나 전송 방식에 관한 정보를 단말로 전송할 수 있다. 또는 딥 Q-네트워크 학습 자원 관리기는 임의로 선택된 MCS, HARQ, 안테나 전송 방식에 관한 정보를 단말로 전송할 수 있다. 딥 Q-네트워크 학습 자원 관리기는 동작 종료 후 S1001 단계로 돌아가 전단계를 반복 수행할 수 있다.The deep Q-network learning resource manager may transmit information on a link adaptation method to the terminal (S1007). For example, the deep Q-network learning resource manager may transmit information regarding a link adaptation scheme, MCS, HARQ, and antenna transmission scheme, which provides a maximum data rate, to the terminal. Alternatively, the deep Q-network learning resource manager may transmit information on a randomly selected MCS, HARQ, and antenna transmission scheme to the terminal. The deep Q-network learning resource manager may return to step S1001 and repeat the previous step after the operation ends.

또한, 통신 시스템에서 딥 Q-네트워크 학습 자원 관리기는 딥 Q-네트워크를 구성한 후, 통신 환경의 상태를 입력하여 최적의 보상인 최대 데이터 율을 제공하는 행동을 수행할 수 있다.In addition, the deep Q-network learning resource manager in the communication system may configure the deep Q-network, and then perform an action of inputting a state of the communication environment to provide a maximum data rate, which is an optimal reward.

종래의 자원관리 방식은 순간순간 최대의 보상을 제공하는 활용 행동 방식만을 사용하여 상태와 자원관리(행동)를 수행하지만, 본 발명이 제안하는 활용과 함께 탐험 행동을 수행하는 방식은 더 많은 상태와 자원관리를 수행하여 더 높은 데이터 율을 얻는 기회를 제공한다.Conventional resource management methods perform state and resource management (behavior) by using only utilization behavior methods that provide maximum rewards at the moment, but the method of performing exploration behaviors with the use of the present invention provides more status and It provides an opportunity to perform resource management to obtain higher data rates.

본 발명에 따른 방법들은 다양한 컴퓨터 수단을 통해 수행될 수 있는 프로그램 명령 형태로 구현되어 컴퓨터 판독 가능 매체에 기록될 수 있다. 컴퓨터 판독 가능 매체는 프로그램 명령, 데이터 파일, 데이터 구조 등을 단독으로 또는 조합하여 포함할 수 있다. 컴퓨터 판독 가능 매체에 기록되는 프로그램 명령은 본 발명을 위해 특별히 설계되고 구성된 것들이거나 컴퓨터 소프트웨어 당업자에게 공지되어 사용 가능한 것일 수도 있다.The methods according to the invention can be implemented in the form of program instructions that can be executed by various computer means and recorded on a computer readable medium. Computer-readable media may include, alone or in combination with the program instructions, data files, data structures, and the like. The program instructions recorded on the computer readable medium may be those specially designed and constructed for the present invention, or may be known and available to those skilled in computer software.

컴퓨터 판독 가능 매체의 예에는 롬(rom), 램(ram), 플래시 메모리(flash memory) 등과 같이 프로그램 명령을 저장하고 수행하도록 특별히 구성된 하드웨어 장치가 포함된다. 프로그램 명령의 예에는 컴파일러(compiler)에 의해 만들어지는 것과 같은 기계어 코드뿐만 아니라 인터프리터(interpreter) 등을 사용해서 컴퓨터에 의해 실행될 수 있는 고급 언어 코드를 포함한다. 상술한 하드웨어 장치는 본 발명의 동작을 수행하기 위해 적어도 하나의 소프트웨어 모듈로 작동하도록 구성될 수 있으며, 그 역도 마찬가지이다.Examples of computer readable media include hardware devices that are specifically configured to store and execute program instructions, such as ROM, RAM, flash memory, and the like. Examples of program instructions include machine language code, such as produced by a compiler, as well as high-level language code that can be executed by a computer using an interpreter or the like. The hardware device described above may be configured to operate with at least one software module to perform the operations of the present invention, and vice versa.

이상 실시예를 참조하여 설명하였지만, 해당 기술 분야의 숙련된 당업자는 하기의 특허 청구의 범위에 기재된 본 발명의 사상 및 영역으로부터 벗어나지 않는 범위 내에서 본 발명을 다양하게 수정 및 변경시킬 수 있음을 이해할 수 있을 것이다.Although described with reference to the embodiments above, those skilled in the art will understand that the present invention can be variously modified and changed without departing from the spirit and scope of the invention as set forth in the claims below. Could be.

Claims

A method of operating a resource manager connected to a plurality of transmission points in a communication system,
Receiving wireless channel state information and transmission point state information from a terminal;
Determining a transmission point and link adaptation scheme providing a maximum data rate based on the radio channel state information and the transmission point state information; And
And transmitting information on a transmission point that provides the maximum data rate and information on a link adaptation scheme to the terminal.
The transmission point and link adaptation scheme providing the maximum data rate is determined through a quality-learning scheme or a deep Q-network learning scheme.

The method according to claim 1,
The operation method of the resource manager,
Updating a quality (Q) value based on the radio channel state information and the transmission point state information; And
Generating a random random value; the method of operation of a resource manager.

The method according to claim 2,
The wireless channel state information includes channel state information (CSI) -reference signal (RS), interference measurement (CSI-IM), channel quality information (CQI), precoding matrix index (PMI), rank indication (RI), and interference state information. At least one of
The transmission point state information includes at least one of a transmission rate of a transmission point and a handover rate of the transmission point,
And wherein the Q value indicates a maximum data rate determined based on the radio channel state information and the respective transmission point state information.

The method according to claim 3,
The operation method of the resource manager,
Selecting a mode of exploration or utilization; And
Determining a number of transmission points of the plurality of transmission points to maximize the Q value.

The method according to claim 4,
The operation method of the resource manager,
When the Q value is a maximum value, determining the transmission point that provides the largest transmission rate among the plurality of transmission points based on the Q value when the handover ratio exceeds a predetermined threshold ratio. A method of operating a resource manager, including.

The method according to claim 3,
The operation method of the resource manager,
Determining the Q value as a process of utilization behavior; And
Determining the MCS, HARQ, and antenna transmission mode when determining the Q value as a process of exploration.

The method according to claim 6,
The operation method of the resource manager,
Determining the MCS, HARQ, and antenna transmission modes based on the maximum value of the Q value.

The method according to claim 3,
The operation method of the resource manager,
Updating a weight value based on the radio channel state information and the transmission point state information; And
Generating a Q value; the method of operating a resource manager.

The method according to claim 8,
The operation method of the resource manager,
Selecting an action in a manner that explores or utilizes the Q value; And
In the case of the exploration method, determining any number of transmission points among the plurality of transmission points.

The method according to claim 9,
The operation method of the resource manager,
Determining a transmission point that provides the largest transmission rate among the plurality of transmission points when the Q value is maximum and the handover rate exceeds a predetermined threshold rate. Way.

The method according to claim 8,
The operation method of the resource manager,
Determining behavior by way of utilization or exploration; And in the case of the exploratory behavior, determining any MCS, HARQ, and antenna transmission modes.

The method according to claim 11,
The operation method of the resource manager,
When performing the behavior in a utilizing manner, further comprising the step of determining the MCS, HARQ, and antenna transmission mode according to the maximum Q value.

A resource manager connected to a plurality of transmission points in a communication system,
A processor; And
At least one instruction executed by the processor includes a memory (memory),
The at least one command is
Receiving wireless channel state information and transmission point state information from a terminal, determining a transmission point and link adaptation scheme that provides a maximum data rate based on the wireless channel state information and the transmission point state information, and And transmits information about a transmission point that provides the maximum data rate and information about a link adaptation scheme to the terminal,
The transmission point and link adaptation scheme providing the maximum data rate is determined through a quality-learning scheme or a deep Q-network learning scheme.

The method according to claim 13,
The at least one command is
And updating a quality (Q) value based on the radio channel state information and the transmission point state information.

The method according to claim 14,
The wireless channel state information includes channel state information (CSI) -reference signal (RS), interference measurement (CSI-IM), channel quality information (CQI), precoding matrix index (PMI), rank indication (RI), and interference state information. At least one of
The transmission point state information includes at least one of a transmission rate of a transmission point and a handover rate of the transmission point,
Wherein the Q value indicates a maximum data rate determined based on the radio channel state information and the respective transmission point state information.

The method according to claim 13,
The at least one command is
If the behavior is exploratory,
Determine any number of transmission points among the plurality of transmission points,
And further to update a quality value based on the radio channel state information and the transmission point state information, and to generate a random random value.

The method according to claim 13,
The at least one command is
Determine behavior, perform exploratory behavior, determine any MCS, HARQ, and antenna transmission modes, and
And if the Q value calculation scheme is utilization, further determining to determine the MCS, HARQ, and antenna transmission modes that maximize the Q value.

The method according to claim 13,
The at least one command is
And further to update a weight value based on the radio channel state information and the transmission point state information, and to generate a Q value.

The method according to claim 18,
The at least one command determines the behavioral manner, and if the behavioral behavior is exploration, determines any number of transmission points among a plurality of transmission points,
And if the Q value behavior is utilization, further determine to determine a transmission point based on the maximum Q value.

The method according to claim 18,
The at least one command is
Decides how to act or how to explore,
If exploring, determine any MCS, HARQ, and antenna transmission modes, and
And if utilized, further executing to determine an MCS, HARQ, and antenna transmission mode that provides a maximum Q value.