KR101877243B1

KR101877243B1 - Ap apparatus clustering method using neural network based on reinforcement learning and cooperative communicatin apparatus using neural network based on reinforcement learning

Info

Publication number: KR101877243B1
Application number: KR1020170053025A
Authority: KR
Inventors: 조동호; 이혁준; 지동진; 정배렬
Original assignee: 한국과학기술원
Priority date: 2017-04-25
Filing date: 2017-04-25
Publication date: 2018-07-11

Abstract

An AP device clustering method using a reinforcement learning-based neural network includes: a step of confirming the distribution of terminals in a cell of a first AP device by using channel state information; a step of determining at least one candidate AP device able to be serviced for a specific area of the cell along with the first AP device based on the distribution of the terminals; a step of determining at least one second AP device among candidate AP devices by using a reinforcement learning-based neural network considering a position of the first AP device, a position of the candidate AP device, the distribution of the terminals, and the channel state information of the terminals as input; and a step of clustering the first AP device and second AP device. As such, the present invention is capable of providing a high-quality service to terminals located on the border of a cell.

Description

BACKGROUND OF THE INVENTION 1. Field of the Invention [0001] The present invention relates to an AP apparatus clustering method using a neural network based on reinforcement learning and a cooperative communication apparatus using a neural network based on reinforcement learning. [0002]

이하 설명하는 기술은 이동통신 AP 장치의 협력 통신에 관한 것이다.The techniques described below relate to cooperative communications of a mobile communication AP device.

이동통신 기기가 증가하면서 높아지는 데이터 요구량을 위해 다양한 기법이 연구되고 있다. 기지국과 같은 AP 장치가 서비스하는 셀의 경계 영역에 위치하는 단말에 대해서 인접한 AP 장치와 협력하여 서비스를 제공하기도 한다. 즉 복수의 AP 장치가 협력하여 하나의 단말에 대한 통신 서비스를 제공하는 것이다. 예컨대, LTE-Advanced에서 제시한 CoMP(Coordinated Multi-Point )와 같은 기술이 있다. Various techniques are being studied for increasing data requirements as the number of mobile communication devices increases. The AP may cooperate with an adjacent AP device to provide a service to a terminal located in a boundary area of a cell served by the AP device such as a base station. That is, a plurality of AP devices cooperate to provide a communication service for one terminal. For example, there is a technology such as Coordinated Multi-Point (CoMP) proposed by LTE-Advanced.

협력 통신을 위해서는 협력하여 서비스를 제공하는 복수의 AP 장치를 선택해야 한다. 복수의 AP 장치를 선택하는 과정을 클러스터링(clustering)이라고 한다. 클러스터링 기법은 크게 동적 클러스터링과 정적 클러스터링 기법이 있다. 동적 클러스터링은 단말의 위치에 따라 실시간으로 클러스터링을 수행하는 것이고, 정적 클러스터링은 사전에 결정한 패턴을 활용하여 클러스터링을 수행한다. For collaborative communication, a plurality of AP devices that provide services in cooperation must be selected. The process of selecting a plurality of AP devices is called clustering. Clustering techniques include dynamic clustering and static clustering. The dynamic clustering is a real-time clustering according to the location of the UE, and the static clustering performs clustering using a predetermined pattern.

한국공개특허 제10-2012-0060627호Korean Patent Publication No. 10-2012-0060627

동적 클러스터링은 단말의 위치를 실시간으로 반영하여 시스템의 오버헤드가 크고, 정정 클러스터링은 단말이 예상 위치를 벗어나거나 트래픽이 급증하면 안정적인 서비스 제공이 어렵다. 이하 설명하는 기술은 강화학습 기반의 신경망을 이용하여 AP 장치 사이의 클러스터링을 제공하고자 한다. 이하 설명하는 기술은 DQN(Deep Q-network)을 이용하여 AP 장치 사이의 클러스터링을 제공하고자 한다.Dynamic clustering reflects the location of the terminal in real time, which results in a large overhead of the system. Correct clustering is difficult to provide stable service when the terminal moves out of the expected position or traffic surges. The technique described below is intended to provide clustering between AP devices using a reinforcement learning based neural network. The technique described below is intended to provide clustering between AP devices using a deep Q-network (DQN).

강화학습 기반의 신경망을 이용한 AP 장치 클러스터링 방법은 채널 상태 정보를 이용하여 제1 AP 장치의 셀에서 단말의 분포를 확인하는 단계, 상기 단말의 분포를 기준으로 상기 제1 AP 장치와 함께 상기 셀의 특정 영역에 대한 서비스가 가능한 적어도 하나의 후보 AP 장치를 결정하는 단계, 상기 제1 AP 장치의 위치, 상기 후보 AP 장치의 위치, 상기 단말의 분포 및 상기 단말의 채널 상태 정보를 입력으로 삼는 강화학습 기반의 신경망을 이용하여 상기 적어도 하나의 후보 AP 장치 중 적어도 하나의 제2 AP 장치를 결정하는 단계 및 상기 제1 AP 장치와 상기 적어도 하나의 제2 AP 장치를 클러스터링하는 단계를 포함한다.A method of clustering an AP apparatus using a reinforcement learning based neural network includes the steps of: checking a distribution of a terminal in a cell of a first AP apparatus using channel state information; The method comprising: determining at least one candidate AP device capable of serving a specific area; reinforcement learning for inputting a location of the first AP device, a location of the candidate AP device, a distribution of the terminal, Based neural network to determine at least one second AP device of the at least one candidate AP device and clustering the first AP device and the at least one second AP device.

강화학습 기반의 신경망을 이용한 협력 통신 장치는 신경망의 변수 및 인접 AP 장치의 위치를 저장하는 저장 장치, 셀 내의 단말로부터 채널 상태 정보를 수신하는 안테나 및 상기 채널 상태 정보를 이용하여 확인한 셀 내의 단말의 분포를 기준으로 상기 인접 AP 장치 중 상기 셀의 특정 영역에 대한 서비스가 가능한 적어도 하나의 후보 AP 장치를 결정하고 상기 후보 AP 장치의 위치, 상기 단말의 분포 및 각 단말의 상기 채널 상태 정보를 강화학습 기반의 신경망에 입력하여 상기 적어도 하나의 후보 AP 장치 중 적어도 하나의 대상 AP 장치를 결정하는 제어 회로를 포함한다.The collaborative communication apparatus using the reinforcement learning based neural network includes a storage device for storing a neural network parameter and a location of a neighboring AP device, an antenna for receiving channel state information from a terminal in a cell, and a terminal in a cell identified using the channel state information Determining at least one candidate AP apparatus capable of serving a specific region of the cell among the neighbor AP apparatuses based on the distribution of the location of the AP apparatus, Based neural network to determine at least one target AP device of the at least one candidate AP device.

이하 설명하는 기술은 강화학습 기반의 신경망을 이용한 강화학습을 통해 상황에 맞는 최적을 클러스터링을 제공하여 셀의 경계 영역에 위치한 단말에 높은 품질의 서비스를 제공한다.The technology described below provides optimal quality clustering through reinforcement learning using neural networks based on reinforcement learning, thereby providing high quality services to terminals located in a boundary region of a cell.

도 1은 협력 통신을 위한 통신 환경에 대한 예이다.
도 2는 협력 통신을 위한 클러스터링의 예이다.
도 3은 시그모이드 함수의 예이다.
도 4는 Q 학습에 대한 순서도의 예이다.
도 5는 강화 학습을 통한 클러스터링에 대한 예이다.
도 6은 DQN의 예이다.
도 7은 DQN의 사후 학습 과정에 대한 예이다.
도 8은 DQN을 이용한 클러스터링에 대한 예이다.
도 9는 DQN을 이용한 클러스터링에 대한 다른 예이다.
도 10은 DQN을 이용한 클러스터링에 대한 또 다른 예이다.Figure 1 is an example of a communication environment for collaborative communication.
Figure 2 is an example of clustering for collaborative communications.
Figure 3 is an example of a sigmoid function.
Figure 4 is an example of a flowchart for Q learning.
Figure 5 is an example of clustering through reinforcement learning.
6 shows an example of DQN.
FIG. 7 shows an example of the post-learning process of the DQN.
8 is an example of clustering using DQN.
9 is another example of clustering using DQN.
10 is another example of clustering using DQN.

이하 설명하는 기술은 다양한 변경을 가할 수 있고 여러 가지 실시례를 가질 수 있는 바, 특정 실시례들을 도면에 예시하고 상세하게 설명하고자 한다. 그러나, 이는 이하 설명하는 기술을 특정한 실시 형태에 대해 한정하려는 것이 아니며, 이하 설명하는 기술의 사상 및 기술 범위에 포함되는 모든 변경, 균등물 내지 대체물을 포함하는 것으로 이해되어야 한다.The following description is intended to illustrate and describe specific embodiments in the drawings, since various changes may be made and the embodiments may have various embodiments. However, it should be understood that the following description does not limit the specific embodiments, but includes all changes, equivalents, and alternatives falling within the spirit and scope of the following description.

제1, 제2, A, B 등의 용어는 다양한 구성요소들을 설명하는데 사용될 수 있지만, 해당 구성요소들은 상기 용어들에 의해 한정되지는 않으며, 단지 하나의 구성요소를 다른 구성요소로부터 구별하는 목적으로만 사용된다. 예를 들어, 이하 설명하는 기술의 권리 범위를 벗어나지 않으면서 제1 구성요소는 제2 구성요소로 명명될 수 있고, 유사하게 제2 구성요소도 제1 구성요소로 명명될 수 있다. 및/또는 이라는 용어는 복수의 관련된 기재된 항목들의 조합 또는 복수의 관련된 기재된 항목들 중의 어느 항목을 포함한다.The terms first, second, A, B, etc., may be used to describe various components, but the components are not limited by the terms, but may be used to distinguish one component from another . For example, without departing from the scope of the following description, the first component may be referred to as a second component, and similarly, the second component may also be referred to as a first component. And / or < / RTI > includes any combination of a plurality of related listed items or any of a plurality of related listed items.

본 명세서에서 사용되는 용어에서 단수의 표현은 문맥상 명백하게 다르게 해석되지 않는 한 복수의 표현을 포함하는 것으로 이해되어야 하고, "포함한다" 등의 용어는 설시된 특징, 개수, 단계, 동작, 구성요소, 부분품 또는 이들을 조합한 것이 존재함을 의미하는 것이지, 하나 또는 그 이상의 다른 특징들이나 개수, 단계 동작 구성요소, 부분품 또는 이들을 조합한 것들의 존재 또는 부가 가능성을 배제하지 않는 것으로 이해되어야 한다.As used herein, the singular " include "should be understood to include a plurality of representations unless the context clearly dictates otherwise, and the terms" comprises & , Parts or combinations thereof, and does not preclude the presence or addition of one or more other features, integers, steps, components, components, or combinations thereof.

도면에 대한 상세한 설명을 하기에 앞서, 본 명세서에서의 구성부들에 대한 구분은 각 구성부가 담당하는 주기능 별로 구분한 것에 불과함을 명확히 하고자 한다. 즉, 이하에서 설명할 2개 이상의 구성부가 하나의 구성부로 합쳐지거나 또는 하나의 구성부가 보다 세분화된 기능별로 2개 이상으로 분화되어 구비될 수도 있다. 그리고 이하에서 설명할 구성부 각각은 자신이 담당하는 주기능 이외에도 다른 구성부가 담당하는 기능 중 일부 또는 전부의 기능을 추가적으로 수행할 수도 있으며, 구성부 각각이 담당하는 주기능 중 일부 기능이 다른 구성부에 의해 전담되어 수행될 수도 있음은 물론이다.Before describing the drawings in detail, it is to be clarified that the division of constituent parts in this specification is merely a division by main functions of each constituent part. That is, two or more constituent parts to be described below may be combined into one constituent part, or one constituent part may be divided into two or more functions according to functions that are more subdivided. In addition, each of the constituent units described below may additionally perform some or all of the functions of other constituent units in addition to the main functions of the constituent units themselves, and that some of the main functions, And may be carried out in a dedicated manner.

또, 방법 또는 동작 방법을 수행함에 있어서, 상기 방법을 이루는 각 과정들은 문맥상 명백하게 특정 순서를 기재하지 않은 이상 명기된 순서와 다르게 일어날 수 있다. 즉, 각 과정들은 명기된 순서와 동일하게 일어날 수도 있고 실질적으로 동시에 수행될 수도 있으며 반대의 순서대로 수행될 수도 있다.Also, in performing a method or an operation method, each of the processes constituting the method may take place differently from the stated order unless clearly specified in the context. That is, each process may occur in the same order as described, may be performed substantially concurrently, or may be performed in the opposite order.

이하 설명하는 기술은 복수의 AP 장치의 협력 통신 기법에 관한 것이다. 이하 설명하는 기술은 협력 통신을 위한 복수의 AP 장치 사이의 클러스터링에 관한 것이다. AP 장치는 이동통신 AP(기지국), 스몰셀 AP, WiFi AP 등일 수 있다. 협력 통신은 같은 종류의 AP 사이에서 수행될 수도 있고, 매크로 셀과 스몰셀과 같은 이종네트워크에서 수행될 수도 있다. 또한 AP 장치는 동일한 통신 방식을 사용하는 장치일 수도 있고, 경우에 따라서는 서로 다른 통신 방식을 사용하는 장치일 수도 있다. 이하 설명의 편의를 위해 이동통신의 기지국과 같은 AP 장치 사이의 협력 통신이라고 가정하고 설명한다.The following description relates to a cooperative communication technique of a plurality of AP apparatuses. The techniques described below relate to clustering between multiple AP devices for cooperative communications. The AP device may be a mobile communication AP (base station), a small cell AP, a WiFi AP, and the like. Cooperative communication may be performed between APs of the same type or may be performed in a heterogeneous network such as a macro cell and a small cell. Further, the AP apparatus may be a device using the same communication method or, in some cases, a device using a different communication method. For convenience of explanation, it is assumed that cooperative communication is performed between AP devices such as base stations of mobile communication.

도 1은 협력 통신을 위한 통신 환경에 대한 예이다. AP 장치는 일정한 영역에 배치된다. 도 1은 전체 영역을 n×n 형태의 그리드(grid)로 구분한 예이다. 설명의 편의를 위해 각 사각형마다 AP 장치를 하나씩 도시하였다. 단말(UE)은 전체 영역에서 랜덤하게 분포한다. 단말은 또한 시간에 따라 위치가 변경될 수 있다. 예컨대, AP 장치(10A)는 단말(5A 및 5B)에 대한 통신 서비스를 제공할 수 있다. 셀(cell)의 경계 영역에 위치하는 단말(5B)은 다른 AP 장치(10B)가 서비스할 수도 있다. 따라서 AP 장치(10A)와 AP 장치(10B)가 서로 간섭을 줄 수 있다. 이 경우 AP 장치(10A)와 AP 장치(10B)는 협력 통신을 수행한다면 간선 없이 단말(5B)에 통신 서비스를 제공할 수 있다. 이하 협력 통신을 위하여 클러스터링할 AP 장치를 결정하는 과정을 설명한다.Figure 1 is an example of a communication environment for collaborative communication. The AP apparatus is arranged in a certain area. 1 shows an example in which the entire area is divided into n × n types of grids. For convenience of explanation, one AP device is shown for each square. The UEs are randomly distributed in the entire area. The terminal may also be repositioned with respect to time. For example, the AP apparatus 10A can provide communication services for the terminals 5A and 5B. The terminal 5B located in the border area of the cell may be served by another AP apparatus 10B. Therefore, the AP apparatus 10A and the AP apparatus 10B can interfere with each other. In this case, if the AP device 10A and the AP device 10B perform cooperative communication, the communication service can be provided to the terminal 5B without a trunk line. Hereinafter, a process of determining an AP apparatus to be clustered for cooperation communication will be described.

협력 통신을 수행하는 AP 장치를 결정하기 위하여 강화학습 기반의 신경망을 이용한다. 예컨대, DQN을 사용할 수 있다. 이하 DQN을 중심으로 설명한다. DQN은 Q-학습(learning) 기술에 가치 네트워크(value network)를 추가하여 더 넓은 상태 공간 상에서 강화 학습을 진행하는 알고리즘이다. A reinforcement learning based neural network is used to determine AP devices that perform collaborative communications. For example, DQN can be used. Hereinafter, the DQN will be mainly described. DQN is an algorithm that performs reinforcement learning on a wider state space by adding a value network to the Q-learning technique.

종래의 Q-학습 기술은 한정된 수의 상태 내에서 움직이는 환경 내에서는 왕성한 학습을 진행할 수 있다. 하지만, 상태 공간이 증가하게 되면 Q값을 저장하는데 문제가 생긴다. Q 값은 상태에 대한 가치 함수의 측정값을 말한다. 예컨대, 각 단말의 위치와 분포가 무한하게 바뀔 수 있고, 아주 큰 수의 조합이 존재하므로 각각의 상황에 대해서 Q값을 저장하는 것은 효율적이지 못하다.Conventional Q-learning techniques can perform vigorous learning in a moving environment within a limited number of states. However, as the state space increases, there is a problem in storing the Q value. The Q value is a measure of the value function of the state. For example, the location and distribution of each terminal can be infinitely changed, and there is a very large number of combinations, so it is not efficient to store the Q value for each situation.

DQN은 개별 Q값을 저장하는 것이 아닌, Q값을 결정하는 함수를 추정하여 위 문제를 해결한다. 종래의 Q 학습기술이 상태를 각 표에 저장하고 룩업(lookup)을 통해 Q값을 확인하였다면, DQN은 가치 네트워크에 현재 상태를 입력으로 주고 Q값을 결과값으로 추출한다. DQN은 세 개 레이어(layer) 이상의 가치 네트워크를 사용하여 Q값을 결정하는 함수를 근사할 수 있다.DQN solves the above problem by estimating a function that determines the Q value rather than storing the individual Q value. If the conventional Q learning technique stores the state in each table and confirms the Q value through a lookup, the DQN inputs the current state to the value network and extracts the Q value as a result value. DQN can approximate a function that determines the Q value using more than three layers of value networks.

Q-학습이란 기본적으로 환경(environment), 에이전트(agent), 상태(state), 행동(action), 보상(reward)으로 구성된 강화 학습 알고리즘이다. 먼저 에이전트가 행동을 취함으로서 에이전트는 새로운 상태로 움직일 수 있다. 에이전트가 취한 행동에대해 환경으로부터 두 가지 보상(즉각 보상과 미래 보상)을 받게 된다. 즉각 보상은 에이전트가 취한 행동에 대해 발생하는 즉각적인 보상이고, 미래 보상은 행동으로 인해 나타나는 미래 환경에 대한 보상을 말한다. 결국 에이전트의 최종 목표는 두 가지 보상을 최대로 받을 수 있도록 Q 값을 업데이트 하는 것이다. 이는 아래 수학식 1과 같이 표현할 수 있다.Q-learning is basically a reinforcement learning algorithm consisting of environment, agent, state, action, and reward. First, the agent takes action and the agent can move to the new state. The agent receives two rewards (immediate reward and future reward) from the environment for the actions taken. Immediate compensation is an immediate reimbursement for the actions taken by the agent, and future compensation is compensation for the future environment caused by the behavior. Ultimately, the agent's ultimate goal is to update the Q value to get the maximum of the two rewards. This can be expressed as Equation 1 below.

여기서 s는 상태, a는 행동, r은 보상을 지칭한다. γ는 discount factor로 0에서 1사이의 값을 가지며 0에 가까우면 현재, 1에 가까울 수록 미래에 대한 보상의 중요성을 강조하게 된다. 본 발명에서는 현재, 미래에 대한 보상을 동등하게 고려하기 위해 0.5로 설정하였다. α_t 는 학습율로 0에서 1사이의 값을 가지며 Q 값의 학습율을 결정한다. 예로 α_t = 0 이면 에이전트 학습을 하지 않는다. α_t = 1이면 가장 최근의 정보를 활용하여 에이전트가 학습한다. 에이전트는 과거 Q 값을 통해 학습을 해야하므로 α_t = 1로 설정한다고 가정한다.Where s is the state, a is the action, and r is the compensation. γ is a discount factor between 0 and 1, and if it is close to 0, the closer to 1, the more emphasis is on the importance of compensation for the future. In the present invention, it is currently set at 0.5 to equally compensate for future. α _t has a learning rate between 0 and 1 and determines the learning rate of the Q value. For example, if α _t = 0, no agent learning is performed. If α _t = 1, the agent learns by using the most recent information. We assume that α _t = 1 because agents need to learn from past Q values.

클러스터링 과정에서 행동은 클러스터링이고, 보상은 클러스터링에 따란 성능(throughput)에 해당한다. 에이전트는 클러스터링을 수행하는 주체에 해당하는데 AP 장치일 수 있다. 또는 이동통신 코어망에 위치하는 별도의 제어 장치가 에이전트가 될 수도 있다.In the clustering process, the behavior is clustering, and the compensation corresponds to the throughput according to the clustering. An agent corresponds to a subject performing clustering, and may be an AP device. Or a separate control device located in the mobile communication core network may be an agent.

도 1을 참조하면, 단말은 전체 영역에 랜덤하게 분포하고 있고, 시간에 따라 일정하게 이동할 수 있다. 사용자의 채널 상태 정보를 나타내는 단위인 CSI(channel state information) 정보를 이용하여 사용자의 위치 정보를 알 수 있다고 가정한다. Q-학습의 상태는 아래 수학식 2와 같이 정의된다. Referring to FIG. 1, a terminal is randomly distributed in an entire area, and can move constantly with time. It is assumed that the user's location information can be known using CSI (channel state information), which is a unit representing the channel state information of the user. The state of Q-learning is defined as shown in Equation (2) below.

여기서 C는 기지국 식별번호를 나타낸다. C ∈ {1,2,3,...,N}이다. UE는 단말의 식별 번호이다. UE ∈ {user₁,user₂,user₃,...,user_M}이다. CI는 CSI 정보를 나타낸다. CI ∈ {CSI₁,CSI₂,CSI₃,...,CSI_m}이다. 예를 들어 AP 장치가 3개, 단말이 4명이며, 3개의 기지국에서 지원하는 단말의 수가 각각 {1,2,2}이라고 가정하면 상태는 다음가 같이 생성될 수 있다. Q_t(s) = [ (1,2,3), (user₁,{user₂,user₃},{user₄,user₅}), (CSI₁, {CSI₂,CSI₃}, {CSI₄,CSI₅})]이다.Where C represents the base station identification number. C ∈ {1,2,3, ..., N}. The UE is an identification number of the UE. UE ∈ {user ₁ , user ₂ , user ₃ , ..., user _M }. CI represents CSI information. CI ∈ {CSI ₁ , CSI ₂ , CSI ₃ , ..., CSI _m }. For example, assuming that there are three AP devices and four terminals, and the number of terminals supported by three base stations is {1,2,2} respectively, the state can be generated as follows. _{Q t (s) = [(} 1,2,3), (user 1, {user 2, user 3}, {user 4, user 5}), (CSI 1, {CSI 2, CSI 3}, {CSI ₄ , CSI ₅ })].

에이전트의 행동은 AP 장치의 주변 환경에 따라 달라진다. 예컨대, 주변 환경이 자동차가 이동하는 도로인지 또는 보행자 구역인지에 따라 행동이 달라질 수 있다. 도로 근처에 있는 AP 장치의 경우 클러스터를 형성하는 AP 장치 개수 K를 정하여 도로 형태(예상되는 이동 방향)에 따라 클러스터를 형성할 수 있다.The behavior of the agent depends on the environment of the AP device. For example, behavior may vary depending on whether the surrounding environment is a road on which a car moves or a pedestrian zone. In the case of an AP apparatus located near the road, the number of AP apparatuses K forming the cluster can be determined and a cluster can be formed according to the road shape (anticipated moving direction).

도 2는 협력 통신을 위한 클러스터링의 예이다. 도 2(a)는 도로 영역에 배치된 AP 장치의 클러스터링에 대한 예이다. K = 2로 설정되었을 경우 도로 주변에 있는 인접 AP 장치 2개를 선택하여 클러스터를 형성할 수 있다.Figure 2 is an example of clustering for collaborative communications. 2 (a) shows an example of clustering of AP devices arranged in the road area. When K = 2 is set, two adjacent AP devices in the vicinity of the road can be selected to form a cluster.

도 2(b)는 사람의 보행 구역에 대한 클러스터링의 예이다. 보행 구역에서는 아래 단계에 따라 클러스터가 형성될 수 있다. 먼저 (1) CSI 정보를 활용하여 경계 영역에 사용자가 제일 많은 구역을 찾는다. (2) 도 2(b)와 같이 해당 구역의 AP 장치를 기준으로 인접 AP 장치를 하나 선택하여 클러스터를 형성한다. (a) 인접 AP 장치를 선택하는 기준은 간섭 제거를 최대로 할 수 있는 인접 AP 장치를 선택한다. (b) 간섭 크기가 동일할 경우 해당 인접 AP 장치를 모두 클러스터링 한다. 도 2(b)는 경계 영역에 위치한 단말의 간섭을 최대한 제거할 수 있는 인접 AP 장치를 선택한 예를 도시한다. 도 2(b)는 AP 장치(20A)가 아래에 위치한 AP 장치(20B)를 선택한 예를 도시한다.Fig. 2 (b) is an example of clustering for a human walking zone. In the walking zone, clusters can be formed according to the following steps. First, (1) CSI information is used to find the area with the greatest number of users in the boundary area. (2) As shown in FIG. 2 (b), a neighboring AP device is selected based on the AP device in the corresponding area to form a cluster. (a) A criterion for selecting a neighboring AP device is a neighboring AP device that can maximize interference cancellation. (b) If the interference size is the same, all adjacent AP devices are clustered. 2 (b) shows an example of selecting a neighboring AP device capable of eliminating the interference of a terminal located in the border area as much as possible. 2 (b) shows an example in which the AP apparatus 20A selects the AP apparatus 20B located below.

도 2(b)는 에이전트 장치인 AP 장치를 우측에 도시하였다. AP 장치는 DQN 변수 및 기타 정보를 저장하는 저장 장치(21), DQN 학습 및 클러스터링 형태를 결정하는 제어 회로(22) 및 단말과 통신하는 안테나(23)를 포함한다. 안테나(23)는 셀 내의 단말로부터 채널 상태 정보를 수신할 수 있다. 제어 회로(22)는 채널 상태 정보를 이용하여 확인한 셀 내의 단말의 분포를 기준으로 인접 AP 장치 중 상기 셀의 특정 영역에 대한 서비스가 가능한 적어도 하나의 후보 AP 장치를 결정하고, 후보 AP 장치의 위치, 상기 단말의 분포 및 각 단말의 채널 상태 정보를 DQN에 입력하여 적어도 하나의 후보 AP 장치 중 적어도 하나의 대상 AP 장치를 결정한다. 이후 AP 장치와 대상 AP 장치가 클러스터링을 수행하여 협력 통신을 한다. 후술하겠지만 저장 장치(21)는 추후 학습을 위한 행동 및 보상 등을 저장할 수도 있다.2 (b) shows an AP apparatus as an agent apparatus on the right side. The AP apparatus includes a storage device 21 for storing DQN variables and other information, a control circuit 22 for determining DQN learning and clustering type, and an antenna 23 for communicating with the terminal. The antenna 23 may receive channel state information from a terminal in the cell. The control circuit 22 determines at least one candidate AP apparatus capable of serving a specific region of the neighboring AP apparatus based on the distribution of terminals in the cell confirmed using the channel state information, , The distribution of the terminal, and the channel state information of each terminal to the DQN to determine at least one target AP apparatus among at least one candidate AP apparatus. Thereafter, the AP device and the target AP device perform clustering and cooperative communication. As will be described later, the storage device 21 may store behaviors and rewards for later learning.

보상은 에이전트가 취한 행동에 대한 보상 값으로 단말의 성능 내지 처리량을 사용할 수 있다. 보상은 아래의 수학식 3 및 수학식 4와 같이 설정될 수 있다.The compensation may use the performance or throughput of the terminal as a compensation value for the action taken by the agent. The compensation can be set as shown in Equation (3) and Equation (4) below.

여기서 S는 개선된 시그모이드(sigmoid) 함수로 보상을 보정해주는 e_t를 계산하는데 사용된다. 도 3은 시그모이드 함수의 예이다. T_lb는 전체 성능 중 하위 5%의 성능 값이며, T_avg는 전체 성능의 평균 값이다. 5%는 하나의 예이다. T_lb이 증가할수록 보상은 증가해 단말 전체 성능을 증가시키는 클러스터 형태를 유지하게 된다. 반대로 T_lb 값이 작으면 도 3에 도시한 시그모이드함수의 특징으로 인해 보상이 대폭 감소해 기존 클러스터 행동을 수정해 다른 형태의 클러스터를 형성해 사용자를 지원하게 된다. 도 3의 시그모이드 함수는 0과 1 근방의 정의역에서는 도함수 값이 작다가 0.5에 가까워질수록 도함수가 커지는 특성을 가진다. 5% 성능이 평균 성능과 비슷할 때는 패널티(penalty)를 작게 부과한다. 5% 성능이 일정 정도 이상 작아진다면 큰 패널티를 가해 경계 영역 단말들의 용량 또한 보장해주는 효과가 있다. Where S is used to calculate e _t which compensates the compensation with an improved sigmoid function. Figure 3 is an example of a sigmoid function. T _lb is the performance value of the lower 5% of the total performance, and T _avg is the average value of the overall performance. 5% is an example. As T _lb increases, the compensation increases and maintains a cluster shape that increases the overall performance of the terminal. On the contrary, if the T _lb value is small, the compensation is greatly reduced due to the feature of the sigmoid function shown in FIG. 3, and the existing cluster behavior is modified to form the other type cluster to support the user. The sigmoid function of FIG. 3 has a characteristic in which the derivative value is small in the domain near 0 and 1, and increases as the value approaches 0.5. When the 5% performance is similar to the average performance, the penalty is small. If the 5% performance becomes smaller than a certain level, a large penalty is added to guarantee the capacity of the border area terminals.

도 4는 Q 학습 과정(100)에 대한 순서도의 예이다. 에이전트는 현재 상태 s(C, UE, CSI)를 확인한다(110). 에이전트는 DQN을 이용하여 Q 값을 획득한다(120). 에이전트는 Q 값에 따라 클러스터링 형태를 결정하는 행동을 선택한다(130). 에이전트는 이후 행동에 따른 보상을 관찰한다(140). 학습이 종료되지 않았다면 에이전트는 자신의 행동과 그에 따른 보상(150)을 저장한다(150). 이 과정을 학습이 종료될 때까지 반복한다. 이 과정을 통해 에이전트는 클러스터링을 결정하기 위한 DQN을 마련한다. 에이전트는 실제 환경에서 클러스터링을 하면서 학습을 수행할 수 있다. 또한 에이전트는 사전에 학습을 하기 위해 일정한 샘플 데이터를 이용할 수 있다. 에이전트는 전술한 바와 같이 어느 하나의 AP 장치일 수 있다. 또는 AP 장치로부터 정보를 전달받는 다른 제어 장치일 수도 있다. 예컨대, 에이전트는 이동통신의 코어망에 위치한 제어 장치일 수 있다.FIG. 4 is an example of a flowchart for the Q learning process 100. FIG. The agent confirms the current state s (C, UE, CSI) (110). The agent acquires the Q value using the DQN (120). The agent selects an action to determine the clustering type according to the Q value (130). The agent then observes compensation for the behavior (140). If the learning is not terminated, the agent stores 150 its action and the compensation 150 accordingly. This process is repeated until learning ends. Through this process, the agent prepares a DQN for determining the clustering. The agent can perform learning while clustering in a real environment. Agents can also use certain sample data to learn in advance. The agent may be any one AP device as described above. Or another control device that receives information from the AP device. For example, the agent may be a control device located in the core network of mobile communication.

도 5는 강화 학습을 통한 클러스터링의 과정(200)에 대한 예이다. 도 5는 도 4에 따라 학습된 DQN이 마련된 상황을 전제한다. 에이전트는 현재 상태 s(C, UE, CSI)를 확인한다(210). 에이전트는 학습이 완료된 DQN을 이용하여 Q 값을 획득한다(220). 에이전트는 Q 값에 따라 클러스터링 형태를 결정하는 행동을 선택한다(230). 에이전트는 이후 행동에 따른 보상을 관찰한다(240). 에이전트는 현재 행동에 따른 보상이 직전의 보상보다 큰지 판단한다(250). 에이전트는 현재 보상이 직전의 보상보다 일정한 임계값보다 크면 보상이 크다고 판단할 수 있다. 즉 에이전트는 클러스터링에 따라 단말의 성능이 일정하게 개선되는 지를 판단한다.Figure 5 is an example of a clustering process 200 through reinforcement learning. FIG. 5 presupposes a situation in which the learned DQN is prepared according to FIG. The agent confirms the current state s (C, UE, CSI) (210). The agent acquires the Q value using the learned DQN (220). The agent selects an action that determines the clustering type according to the Q value (230). The agent then observes the compensation according to the behavior (240). The agent determines whether the compensation according to the current action is greater than the compensation just before (250). The agent can determine that the compensation is large if the current compensation is greater than a predetermined threshold than the previous compensation. That is, the agent determines whether the performance of the terminal is constantly improved according to the clustering.

현재 보상이 직전의 보상보다 일정하게 크다면, 에이전트는 행동에 따라 클러스터를 변경한다(260). 만약 현재 보상이 직전의 보상보다 크지 않다면 에이전트는 클러스터를 변경하지 않는다. 에이전트는 통신이 종료되었는지 확인하면서(270), 전체 과정을 통신이 종료될 때까지 반복한다. If the current compensation is consistently greater than the previous compensation, then the agent changes the cluster according to the behavior (260). If the current compensation is not greater than the previous compensation, the agent does not change the cluster. The agent verifies (270) that the communication is terminated, and repeats the entire process until the communication is terminated.

가치 네트워크를 효과적으로 제작하기 위해서는 상태의 특성을 반영해야 한다. 현재 클러스터링 환경은 AP 장치와 단말들의 2차원 구조로 이루어진 환경이다. 경계 영역의 단말들이 많은 AP 장치들은 클러스터링을 통해 간섭을 제거하여 용량을 높일 수 있다. 단말들이 대부분 AP 장치 근처에 있고 움직임이 거의 없는 경우 AP 장치들은 개별적으로 운용하는 것이 효율적이다. 이에, 2-차원 구조를 잘 반영할 수 있는 인공 신경망을 가치 네트워크으로 사용하는 것이 성능을 향상하는 데 도움이 된다.To create a value network effectively, it must reflect the characteristics of the state. The current clustering environment is a two-dimensional environment of AP devices and terminals. AP devices with many terminals in the boundary area can increase the capacity by removing interference through clustering. It is efficient to operate AP devices separately if most terminals are near AP devices and there is little movement. Therefore, using artificial neural network which can reflect 2 - dimensional structure as value network helps to improve performance.

CNN(Convolutional Neural Network)은 위와 같은 2-차원 구조를 가장 잘 이해할 수 있는 인공 신경망 구조이다. CNN은 몇 개의 컨볼루션 계층(convolutional layer)과 몇 개의 완전연결 계층(fully connected layer)으로 이루어진다. 컨볼루션 계층은 컨볼루션 마스크(mask)와 공유 가중치(shared weights)를 통해 관찰된 상태에서 2-D 구조를 추출해낸다. 컨볼루션 계층들을 중첩함으로써 더욱 더 복잡한 특징들을 찾아낼 수 있다. 이렇게 찾아낸 복잡한 특징들을 이용하여 완전연결 계층로 Q 값을 도출할 수 있다. CNN에서 자주 쓰이는 기법 중 최대 풀링(max pooling)이라는 것이 있는데, 이는 마스크로 가려진 공간에서 가장 큰 값만을 뽑는 것으로 복잡도를 낮추며 이동 불변성(translational invariance)을 보장하는 기작으로 사용된다.CNN (Convolutional Neural Network) is an artificial neural network structure that can best understand the above two-dimensional structure. CNN consists of several convolutional layers and several fully connected layers. The convolution layer extracts the 2-D structure as observed through a convolution mask and shared weights. By superimposing the convolutional layers, more complex features can be found. Using these complex features, we can derive the Q value to the complete connection layer. One of the most frequently used techniques in CNN is max pooling, which is used as a mechanism to ensure the translational invariance by lowering the complexity by extracting only the largest value in the masked space.

도 6은 DQN의 예이다. 도 6은 전술한 가치 네트워크의 예이다. 첫 컨볼루션 계층는 현재 AP 장치의 위치와 단말의 분포, 각 단말의 CSI를 입력으로 받는다. 이 layer는 5*5의 컨볼루션 마스크를 사용하여 낮은 수준의 특징(low level feature)들을 찾아낸다. 낮은 수준의 특징은 예를 들어 어떤 두 AP 장치 사이의 단말 분포 및 밀도 등의 간단한 특징들을 뜻한다. 이후의 두 계층은 3*3의 컨볼루션 마스크를 사용하여 높은 수준의 특징(high level feature)들을 찾아낸다. 높은 수준의 특징은 앞에서 찾은 낮은 수준의 특징에서 추론할 수 있는 특징들로써, 단말이 많은 두 AP 장치 쌍들의 공간적 분포가 어떤 지와 시간상의 단말 이동 패턴 등을 뜻한다.6 shows an example of DQN. Figure 6 is an example of the value network described above. The first convolution layer receives the location of the current AP device, the distribution of the terminal, and the CSI of each terminal as input. This layer uses a 5 * 5 convolution mask to find low-level features. The low level features mean, for example, simple features such as terminal distribution and density between any two AP devices. The next two layers use a 3 * 3 convolution mask to find high-level features. The high level features are the characteristics that can be deduced from the low level features found earlier, which means the spatial distribution of the two pairs of AP devices with many terminals and the terminal movement patterns in time.

마지막 계층에서는 2*2 최대 풀링을 진행한다. 최대 풀링은 n*n 마스크 내에서 최대값 하나만을 남기는 작업인데, 이는 데이터를 줄여 정확도를 낮추는 작업으로 보일 수 있다. 이 계층 이후에는 모든 출력 값을 완전연결 계층에 입력하게 된다. 완전연결 계층은 첫 계층의 차원이 1000, 이후 100 그리고 10개의 차원을 가질 수 있다. 이는 점진적으로 뉴런의 출력 값의 개수를 줄여서 중요한 특징만을 남기기 위함이다. 마지막에는 10개의 출력값을 하나의 뉴런에 모아 Q값을 도출하게 된다. 도 6에 도시한 가치 네트워크 구조는 하나의 예이다. 실제 DQN은 다른 구조의 가치 네트워크를 이용할 수도 있다.In the last layer, 2 * 2 maximum pooling is performed. Maximum pooling is the task of leaving only one maximum value within the n * n mask, which can be seen as a job of reducing data and reducing accuracy. After this layer, all output values are input to the complete connection layer. The complete connection layer can have 1000, then 100, and 10 dimensions of the first layer. This is to gradually reduce the number of output values of the neurons, leaving only important features. At the end, 10 output values are collected in one neuron to derive the Q value. The value network structure shown in Fig. 6 is an example. The actual DQN may use a value network of a different structure.

가치 네트워크는 기본적인 DQN을 훈련하는 절차에 따라 학습되게 된다. 먼저 통신 환경에 행동, 즉 클러스터링 환경 변화를 행하고, 행한 행동에 따르는 보상을 관찰한다. 에이전트는 관찰된 행동과 보상 쌍은 저장 장치에 저장한다. 에이전트는 가치 네트워크를 정기적인 훈련 기간마다 학습한다. 에이전트는 저장 장치에 저장한 행동과 보상을 이용하여 훈련 기간에 학습을 수행하고, DQN 네트워크를 업데이트 할 수 있다.The value network is learned according to the procedure of training the basic DQN. First, the behavior is changed in the communication environment, that is, the clustering environment is changed, and the compensation according to the behavior is observed. The agent stores the observed behavior and the compensated pair in the storage device. Agents learn value networks every regular training session. The agent can perform learning in the training period and update the DQN network using the behavior and compensation stored in the storage device.

도 7은 DQN의 사후 학습 과정(300)에 대한 예이다. 에이전트는 저장 장치에 저장된 행동 및 보상 정보를 확인한다(310). 에이전트는 DQN의 변수를 불러온다(320). DQN 변수는 저장 장치에 사전에 저장된다. 에이전트는 저장 장치에 저장된 행동 및 보상을 이용하여 DQN을 학습한다(330). 에이전트는 재차 학습된 DQN을 이용하여 행동에 따른 보상을 관찰한다(340). 에이전트는 저장 장치에 저장된 샘플 데이터(행동 및 보상)를 모두 사용하여 학습이 종료될 때까지 학습 과정을 반복(350)한다. 마지막으로 에이전트는 새로 학습한 DQN의 변수를 지정한다(360). 새롭게 지정한 변수는 저장 장치에 저장될 수 있다.FIG. 7 shows an example of the post-learning process 300 of the DQN. The agent confirms the behavior and compensation information stored in the storage device (310). The agent invokes the variable of the DQN (320). The DQN variable is pre-stored in the storage device. The agent learns the DQN using the behavior and compensation stored in the storage device (330). The agent then uses the learned DQN to observe behavioral compensation (340). The agent uses the sample data (behavior and compensation) stored in the storage device to repeat the learning process until the learning is terminated (350). Finally, the agent specifies a variable of the newly learned DQN (360). The newly specified variable can be stored in the storage device.

이하 전술한 DQN을 활용한 클러스터링에 대한 몇 가지 예를 설명한다. Hereinafter, some examples of clustering using the above-described DQN will be described.

도 8은 DQN을 이용한 클러스터링에 대한 예이다. 도 8은 공연을 하고 있는 공연장의 상황이다. 공연장에는 많은 단말들이 공연이 진행되는 시간 동안 공연장에 모여 한 번에 통신을 시도한다. 공연은 불특정 시간에 진행되므로 경기장 주위 AP 장치에 대해 불특정 시간에 매우 큰 부담이 가해진다. 기존의 정적 클러스터링 구조 하에서는 공연이 진행되는 상황이나 진행되지 않는 상황 중 하나의 환경만을 고려하여 QoS 보장이 어렵다. 전술한 방법에 따라 DQN을 사용하여 강화 학습 클러스터링을 진행한다면 변화하는 단말 밀도에 따라 AP 장치의 클러스터링 형태를 변경하여 용량 증대를 꾀할 수 있다. 예를 들어, 경기장 주위의 AP 장치와 주위 단말 밀도가 낮은 AP 장치들을 클러스터링 하여 단말 밀도가 낮은 AP 장치의 통신 자원을 활용하는 방안이 가능하다. 이로 인해 단말 밀도가 낮은 AP 장치가 활용하지 못하는 자원을 경기장 주위 AP 장치에 공유하여 경기장 내 단말들의 총 용량을 증대할 수 있다. 8 is an example of clustering using DQN. Fig. 8 shows the situation of a performance hall performing a performance. A lot of terminals are gathered at a performance hall during a performance time and try to communicate at one time. Since the performance is performed at an unspecified time, a very heavy burden is imposed on the AP apparatus around the stadium at an unspecified time. Under the existing static clustering scheme, it is difficult to guarantee the QoS considering only the environment in which the performance is progressing or not. If reinforcement learning clustering is performed using the DQN according to the above-described method, the clustering type of the AP apparatus can be changed according to the changing terminal density to increase the capacity. For example, it is possible to utilize the communication resources of the AP apparatus having a low terminal density by clustering the AP apparatus around the stadium and the AP apparatus having a low terminal density. Accordingly, the total capacity of the terminals in the stadium can be increased by sharing resources that the AP device with a low terminal density can not utilize to the AP devices around the stadium.

도 9는 DQN을 이용한 클러스터링에 대한 다른 예이다. 도 9는 도심지 상황의 예이다. 도심지는 특정 시간대(출퇴근 시간)에 단말 트래픽이 급격하게 늘어나고 출퇴근 시간 후에는 트래픽이 해소된다. 기존 정적 클러스터링의 경우 시간대 별로 변화하는 단말 트래픽에 대응하여 클러스터링을 못하기 때문에 네트워크 용량이 저하된다. DQN을 이용한 강화 학습 클러스터링은 throughput에 따라 결정되는 보상 값을 최대화 시키기 위해 단말 트래픽이 증가하는 지역만 클러스터를 형성함으로서 AP 장치간 간섭 문제를 제거하여 네트워크 용량을 증대 시킬 수 있다. 또한 단말들이 빠른 속도로 움직이는 오전, 새벽시간 대의 경우 도로 형태에 맞춰 클러스터를 형성하는 행동을 활용해 단말의 핸드오버 횟수를 감소시켜 안정적인 네트워크 용량을 제공할 수 있다. 동적 클러스터링의 경우 수 많은 단말의 상태 변화를 실시간으로 반영해야 하기 때문에 시스템 오버헤드가 크게 증가하여 실제 네트워크 모델에 적용하기가 어렵다. 9 is another example of clustering using DQN. Figure 9 is an example of a downtown situation. In urban areas, terminal traffic rapidly increases at a certain time (commute time), and traffic is resolved after commute. In the case of existing static clustering, since the clustering can not be performed in response to the terminal traffic changing in each time slot, the network capacity is degraded. In order to maximize the compensation value determined according to the throughput, the reinforcement learning clustering using DQN can increase the network capacity by eliminating the interference problem between the AP devices by forming only clusters where the terminal traffic increases. In addition, it is possible to provide a stable network capacity by decreasing the number of handover times of the terminal by utilizing the behavior of forming a cluster according to the road shape in the morning and morning time, where terminals move at a high speed. In case of dynamic clustering, it is difficult to apply to real network model because the system overhead is greatly increased because the state change of many terminals must be reflected in real time.

도 10은 DQN을 이용한 클러스터링에 대한 또 다른 예이다. 도 10은 재난이 발생한 상황에 대한 예이다. 재난 상황 발생시 주변 AP 장치가 파괴되고 구조 관련 인력들이 증가하여 가동 가능한 AP 장치가 일시적으로 처리해야 할 데이터 트래픽 양이 급증하게 된다. 정적 클러스터링의 경우 상황 변화를 인지 못하고 정해진 클러스터 패턴이 적용되기 때문에 재난 상황에서 안정적인 네트워크 용량 제공이 어렵다. DQN을 사용하여 강화 학습 클러스터링을 진행한다면 보상 값이 크게 감소한 것을 시스템이 인지하여 보상을 높이기 위해 도 10과 같이 운영 가능한 AP 장치 간에 클러스터를 형성하여 네트워크 용량이 추가로 필요한 AP 장치에 주변 AP 장치들의 자원을 공유하여 사용하게 된다. 따라서 지원이 필요한 AP 장치에 네트워크 자원을 집중할 수 있어 구조에 필요한 네트워크 용량을 제공할 수 있다. 이에 반하여 동적 클러스터링의 경우 높은 연산 파워를 요구 하는데 재난 상황에서 가동 가능한 AP 장치가 대부분 손실된 상황에서 안정적인 네트워크 용량 제공이 불가능하다.10 is another example of clustering using DQN. Figure 10 is an example of a situation in which a disaster occurred. In the event of a disaster, nearby AP devices are destroyed and the number of rescue personnel increases, so that the amount of data traffic to be temporarily handled by a movable AP device increases rapidly. In the case of static clustering, it is difficult to provide stable network capacity in a disaster situation because a fixed cluster pattern is applied without recognizing the situation change. DQN, the system recognizes that the compensation value is greatly reduced. In order to increase the compensation, a cluster is formed between the AP devices that can be operated as shown in FIG. 10, Resources are shared and used. Therefore, it is possible to concentrate the network resources on the AP device that needs support, thereby providing the network capacity necessary for the structure. On the other hand, dynamic clustering requires high computational power, but it is impossible to provide stable network capacity in a situation where most AP devices are lost in a disaster situation.

본 실시례 및 본 명세서에 첨부된 도면은 전술한 기술에 포함되는 기술적 사상의 일부를 명확하게 나타내고 있는 것에 불과하며, 전술한 기술의 명세서 및 도면에 포함된 기술적 사상의 범위 내에서 당업자가 용이하게 유추할 수 있는 변형 예와 구체적인 실시례는 모두 전술한 기술의 권리범위에 포함되는 것이 자명하다고 할 것이다.The present embodiment and drawings attached hereto are only a part of the technical idea included in the above-described technology, and it is easy for a person skilled in the art to easily understand the technical idea included in the description of the above- It will be appreciated that variations that may be deduced and specific embodiments are included within the scope of the foregoing description.

5 : 단말
10A, 10B: AP 장치
20A, 20B : AP 장치
21 : 저장 장치
22 : 제어회로
23 : 안테나5: Terminal
10A, 10B: AP device
20A, 20B: AP device
21: Storage device
22: control circuit
23: Antenna

Claims

Confirming a distribution of a terminal in a cell of the first AP apparatus using channel state information;
Determining at least one candidate AP device capable of serving a specific area of the cell together with the first AP device based on the distribution of the terminal;
Wherein at least one candidate AP device among the at least one candidate AP device uses a reinforcement learning based neural network that receives as input the location of the first AP device, the location of the candidate AP device, the distribution of the terminal, 2 AP device; And
And clustering the first AP device and the at least one second AP device using the reinforcement learning based neural network.

The method according to claim 1,
And a neural network based on the reinforcement learning for confirming the distribution of the terminal using CSI (channel state information) for the terminal located in the cell.

The method according to claim 1,
Wherein the specific region is a region in which the largest number of terminals are located in the boundary region of the cell.

The method according to claim 1,
Wherein the first AP device or the control device of the core network determines the second AP device using the neural network.

The method according to claim 1,
Wherein the first AP apparatus and the second AP apparatus provide a service for a target terminal located in the specific area through cooperative communication.

6. The method of claim 5,
Further comprising the step of measuring the performance of the cooperative communication with the target terminal and updating the neural network using the clustering and performance of the first AP device and the second AP device, Clustering of AP devices.

The method according to claim 1,
Wherein the candidate AP apparatus is determined in consideration of a movement path of an area where the first AP apparatus is located.

The method according to claim 1,
Wherein the reinforcement learning based neural network uses a reinforcement learning based neural network for outputting state information including a specific AP apparatus and a specific location by using learning data including the location of the AP apparatus, the distribution of the terminal, and the channel state information Clustering of AP devices.

The method according to claim 1,
Wherein the reinforcement learning-based neural network is a Deep Q-network (DQN).

A storage device for storing variables of the neural network and the location of the adjacent AP device;
An antenna for receiving channel state information from a terminal in a cell; And
Determining at least one candidate AP apparatus capable of serving a specific region of the neighboring AP apparatus based on a distribution of terminals in a cell identified using the channel state information,
And a control circuit for determining at least one target AP device among the at least one candidate AP device by inputting the location of the candidate AP device, the distribution of the terminal, and the channel state information of each terminal to a reinforcement learning- Collaborative communication device using reinforcement learning based neural network.

11. The method of claim 10,
Wherein the specific region is a region in which the largest number of terminals are located in the boundary region of the cell.

11. The method of claim 10,
Wherein the cooperating communication device is an AP device or a control device of a core network.

11. The method of claim 10,
And the cooperating communication device provides a service for the specific area through cooperative communication with the target AP device.

14. The method of claim 13,
Based on the reinforcement learning based on the performance of the target AP apparatus and the cooperative communication.

11. The method of claim 10,
Wherein the reinforcement learning-based neural network is a Deep Q-network (DQN).