WO2018151344A1

WO2018151344A1 - Ap device clustering method using dqn, and cooperative communication device using dqn

Info

Publication number: WO2018151344A1
Application number: PCT/KR2017/001683
Authority: WO
Inventors: 조동호; 이혁준; 지동진; 정배렬
Original assignee: 한국과학기술원
Priority date: 2017-02-16
Filing date: 2017-02-16
Publication date: 2018-08-23

Abstract

An AP device clustering method using a deep Q-network (DQN) comprises the steps of: identifying a distribution of terminals in a cell of a first AP device, by using channel state information; determining one or more candidate AP devices capable of performing a service for a specific area of the cell together with the first AP device, on the basis of the distribution of terminals; determining at least one second AP device from among the one or more candidate AP devices by using a DQN which receives, as inputs, a position of the first AP device, positions of the candidate AP devices, the distribution of terminals, and the channel state information of each terminal; and clustering the first AP device and the at least one second AP device.

Description

AP device clustering method using DQN and cooperative communication device using DQN

The technology described below relates to cooperative communication of a mobile communication AP device.

Various techniques have been studied for increasing data demands as mobile communication devices increase. In some embodiments, a service may be provided in cooperation with an adjacent AP device for a terminal located in a boundary area of a cell served by an AP device such as a base station. That is, a plurality of AP devices cooperate to provide a communication service for one terminal. For example, there is a technology such as CoMP (Coordinated Multi-Point) proposed by LTE-Advanced.

For cooperative communication, it is necessary to select a plurality of AP devices that provide services in cooperation. The process of selecting a plurality of AP devices is called clustering. Clustering techniques are divided into dynamic clustering and static clustering techniques. Dynamic clustering is to perform clustering in real time according to the location of the terminal, static clustering is performed by using a predetermined pattern.

Dynamic clustering reflects the location of the terminal in real time, resulting in a large overhead of the system, and correct clustering is difficult to provide stable services when the terminal is out of the expected location or traffic increases rapidly. The technology described below is intended to provide clustering between AP devices using deep Q-network (DQN).

In the AP device clustering method using DQN, identifying a distribution of a terminal in a cell of a first AP device using channel state information, and for a specific region of the cell together with the first AP device based on the distribution of the terminal. Determining at least one candidate AP device that can be serviced, using the location of the first AP device, the location of the candidate AP device, the distribution of the terminal, and the channel state information of each terminal as inputs; Determining at least one second AP device among at least one candidate AP device and clustering the first AP device and the at least one second AP device.

The cooperative communication device using the DQN includes a storage device for storing the DQN variable and the location of the neighboring AP device, an antenna for receiving channel state information from the terminal in the cell, and the distribution of the terminal in the cell identified using the channel state information. Determining at least one candidate AP device capable of serving a specific area of the cell among neighboring AP devices, and inputting the position of the candidate AP device, the distribution of the terminal, and the channel state information of each terminal to the DQN. And a control circuit for determining at least one target AP device of one candidate AP device.

The technology described below provides high quality service to a terminal located in a boundary region of a cell by providing optimal clustering for a situation through reinforcement learning using DQN.

1 is an example of a communication environment for cooperative communication.

2 is an example of clustering for cooperative communication.

3 is an example of a sigmoid function.

4 is an example of a flowchart for Q learning.

5 is an example of clustering through reinforcement learning.

6 is an example of a DQN.

7 is an example of a post-learning process of DQN.

8 is an example of clustering using DQN.

9 is another example of clustering using DQN.

10 is another example of clustering using DQN.

The following description may be made in various ways and have a variety of embodiments, specific embodiments will be illustrated in the drawings and described in detail. However, this is not intended to limit the technology described below to specific embodiments, it should be understood to include all changes, equivalents, and substitutes included in the spirit and scope of the technology described below.

The terms first, second, A, B, etc. may be used to describe various components, but the components are not limited by the terms, but merely for distinguishing one component from other components. Only used as For example, the first component may be referred to as the second component, and similarly, the second component may be referred to as the first component without departing from the scope of the technology described below. The term and / or includes a combination of a plurality of related items or any item of a plurality of related items.

As used herein, the singular forms "a," "an," and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. It is to be understood that the present invention means that there is a part or a combination thereof, and does not exclude the presence or addition possibility of one or more other features or numbers, step operation components, parts or combinations thereof.

Prior to the detailed description of the drawings, it is to be clear that the division of the components in the present specification is only divided by the main function of each component. That is, two or more components to be described below may be combined into one component, or one component may be provided divided into two or more for each function. Each of the components to be described below may additionally perform some or all of the functions of other components in addition to the main functions of the components, and some of the main functions of each of the components are different. Of course, it may be carried out exclusively by.

In addition, in carrying out the method or operation method, each process constituting the method may occur differently from the stated order unless the context clearly indicates a specific order. That is, each process may occur in the same order as specified, may be performed substantially simultaneously, or may be performed in the reverse order.

The technology described below relates to a cooperative communication technique of a plurality of AP devices. The technology described below relates to clustering between a plurality of AP devices for cooperative communication. The AP device may be a mobile communication AP (base station), a small cell AP, a WiFi AP, or the like. The cooperative communication may be performed between APs of the same type or may be performed in heterogeneous networks such as macro cells and small cells. In addition, the AP device may be a device using the same communication method, or in some cases, may be a device using different communication methods. For convenience of explanation, it is assumed that the cooperative communication between the AP devices such as the base station of the mobile communication.

1 is an example of a communication environment for cooperative communication. The AP device is arranged in a certain area. 1 illustrates an example of dividing an entire region into an n × n grid. For convenience of explanation, one AP device is shown for each rectangle. The UE is randomly distributed in the entire area. The terminal may also change position with time. For example, the AP device 10A may provide a communication service for the

terminals

5A and 5B. The terminal 5B located in the boundary region of the cell may be serviced by another AP device 10B. Therefore, the AP device 10A and the AP device 10B may interfere with each other. In this case, the AP device 10A and the AP device 10B may provide a communication service to the terminal 5B without a trunk line if cooperative communication is performed. Hereinafter, a process of determining an AP device to be clustered for cooperative communication will be described.

First, the DQN will be briefly described. DQN is an algorithm for reinforcement learning in a wider state space by adding a value network to Q-learning technology.

Conventional Q-learning techniques are capable of vigorous learning in environments that move within a limited number of states. However, if the state space increases, there is a problem in storing the Q value. The Q value is a measure of the value function for the state. For example, the position and distribution of each terminal can be changed indefinitely, and since there is a large number of combinations, it is not efficient to store the Q value for each situation.

DQN solves this problem by estimating a function that determines the Q value, rather than storing individual Q values. If the conventional Q learning technique stores the state in each table and checks the Q value through a lookup, the DQN inputs the current state to the value network and extracts the Q value as a result value. The DQN can approximate a function of determining the Q value using a value network of three or more layers.

Q-learning is basically a reinforcement learning algorithm that consists of environment, agent, state, action, and reward. By first acting by an agent, the agent can move to a new state. There are two rewards from the environment for the actions taken by the agent: immediate rewards and future rewards. Immediate rewards are immediate rewards for actions taken by an agent, and future rewards are rewards for the future environment resulting from the action. Eventually, the agent's final goal is to update the Q value to get the maximum of both rewards. This can be expressed as Equation 1 below.

Where s is a state, a is an action, and r is a reward. γ is a discount factor between 0 and 1, the closer to 0, the closer to 1, the greater the importance of compensation for the future. In the present invention, it is set to 0.5 to equally consider the reward for the present. α _t is the learning rate and has a value between 0 and 1, and determines the learning rate of the Q value. For example, if α _t = 0, no agent learning is performed. If α _t = 1, the agent learns using the most recent information. It is assumed that the agent sets α _t = 1 since the agent must learn from past Q values.

In the clustering process, the behavior is clustering, and the compensation is the throughput according to the clustering. The agent corresponds to a subject that performs clustering and may be an AP device. Alternatively, a separate control device located in the mobile communication core network may be an agent.

Referring to FIG. 1, the terminal is randomly distributed over the entire region and may move constantly over time. It is assumed that the location information of the user can be known by using channel state information (CSI) information, which is a unit representing the user's channel state information. The state of the Q-learning is defined as in Equation 2 below.

Where C represents a base station identification number. C ∈ {1,2,3, ..., N}. The UE is an identification number of the terminal. UE ∈ {user ₁ , user ₂ , user ₃ , ..., user _M }. CI represents CSI information. CI ∈ {CSI ₁ , CSI ₂ , CSI ₃ , ..., CSI _m }. For example, assuming that there are three AP devices and four terminals, and that the number of terminals supported by the three base stations is {1, 2, 2}, the state may be generated as follows. Q _t (s) = [(1,2,3), (user ₁ , {user ₂ , user ₃ }, {user ₄ , user ₅ }), (CSI ₁ , {CSI ₂ , CSI ₃ }, {CSI ₄ , CSI ₅ })].

The behavior of the agent depends on the surrounding environment of the AP device. For example, the behavior may vary depending on whether the environment is a road or pedestrian zone in which the car travels. In the case of an AP device near a road, a cluster may be formed according to a road shape (expected movement direction) by determining the number of AP devices forming a cluster.

2 is an example of clustering for cooperative communication. 2 (a) is an example of clustering of AP devices disposed in a road area. When K = 2, a cluster may be formed by selecting two adjacent AP devices around a road.

2 (b) is an example of clustering for a walking zone of a person. In the walking zone, clusters may be formed by following the steps below. First, (1) find the area with the most users in the boundary area using CSI information. (2) As shown in FIG. 2 (b), one adjacent AP device is selected based on the AP device of the corresponding zone to form a cluster. (a) A criterion for selecting a neighboring AP device selects a neighboring AP device capable of maximizing interference cancellation. (b) If the interference magnitudes are the same, all neighboring AP devices are clustered. 2 (b) shows an example of selecting a neighboring AP device capable of maximally eliminating interference of a terminal located in a boundary area. 2B illustrates an example in which the AP device 20A selects the AP device 20B located below.

2 (b) shows the AP device, which is an agent device, on the right side. The AP device includes a storage device 21 for storing DQN variables and other information, a control circuit 22 for determining a DQN learning and clustering type, and an antenna 23 for communicating with the terminal. The antenna 23 may receive channel state information from the terminal in the cell. The control circuit 22 determines at least one candidate AP device capable of serving a specific area of the cell among neighboring AP devices based on the distribution of terminals in the cell identified using the channel state information, and determines the location of the candidate AP device. The at least one target AP device among the at least one candidate AP device is determined by inputting the distribution of the terminal and channel state information of each terminal to the DQN. Thereafter, the AP device and the target AP device perform clustering to perform cooperative communication. As will be described later, the storage device 21 may store actions and rewards for later learning.

The reward may use the performance or throughput of the terminal as a reward value for the action taken by the agent. The compensation may be set as in Equations 3 and 4 below.

Where S is used to calculate e _t which compensates the compensation with the improved sigmoid function. 3 is an example of a sigmoid function. T _lb is the lower 5% of overall performance, and T _avg is the average of overall performance. 5% is one example. As T _lb increases, the compensation increases to maintain a cluster form that increases the overall performance of the terminal. On the contrary, when the value of T _lb is small, the compensation is greatly reduced due to the characteristic of the sigmoid function shown in FIG. 3, thereby modifying the existing cluster behavior to form another cluster to support the user. The sigmoid function of FIG. 3 has a characteristic that the derivative is larger as the derivative value is smaller in the domain around 0 and 1 and approaches 0.5. When 5% performance is close to average performance, a small penalty is imposed. If the 5% performance is reduced by a certain degree or more, a large penalty is applied to guarantee the capacity of the edge area terminals.

4 is an example of a flowchart for the Q learning process 100. The agent checks the current state s (C, UE, CSI) (110). The agent obtains a Q value using the DQN (120). The agent selects an action that determines the clustering type according to the Q value (130). The agent then observes the reward according to the action (140). If the learning is not finished, the agent stores 150 its actions and the rewards 150 accordingly. Repeat this process until the end of learning. Through this process, the agent prepares a DQN for determining clustering. Agents can learn while clustering in a live environment. Agents can also use certain sample data to train in advance. The agent may be any one AP device as described above. Alternatively, it may be another control device that receives information from the AP device. For example, the agent may be a control device located in the core network of the mobile communication.

5 is an example of a process 200 of clustering through reinforcement learning. FIG. 5 assumes a situation in which the DQN learned according to FIG. 4 is provided. The agent checks the current state s (C, UE, CSI) (210). The agent obtains a Q value using the learned DQN (220). The agent selects an action that determines the clustering type according to the Q value (230). The agent then observes the reward for the behavior (240). The agent determines whether the reward according to the current behavior is greater than the reward immediately before (250). The agent may determine that the reward is greater if the current reward is greater than a certain threshold than the previous reward. That is, the agent determines whether the performance of the terminal is constantly improved according to the clustering.

If the current reward is consistently greater than the previous reward, the agent changes the cluster according to the action (260). If the current reward is not greater than the previous reward, the agent does not change the cluster. The agent checks if the communication is terminated (270), and repeats the whole process until the communication is terminated.

In order to effectively create a value network, it must reflect the nature of the state. The current clustering environment is composed of a two-dimensional structure of an AP device and terminals. AP devices with many terminals in the border region can increase capacity by removing interference through clustering. If the terminals are mostly near the AP device and there is little movement, it is efficient to operate the AP devices individually. Therefore, using an artificial neural network that can reflect the two-dimensional structure as a value network helps to improve performance.

Convolutional Neural Network (CNN) is an artificial neural network structure that can best understand the above two-dimensional structure. CNN consists of several convolutional layers and several fully connected layers. The convolutional layer extracts the 2-D structure from the state observed through the convolution mask and shared weights. By nesting convolutional layers, more complex features can be found. Using these complex features, we can derive the Q value into the fully connected layer. One of the most frequently used techniques in CNN is max pooling, which extracts only the largest value in the space covered by the mask, and reduces the complexity and guarantees translational invariance.

6 is an example of a DQN. 6 is an example of the value network described above. The first convolutional layer receives the location of the current AP device, the distribution of the terminals, and the CSI of each terminal as input. This layer uses a 5 * 5 convolution mask to find low level features. Low level features refer to simple features such as terminal distribution and density between any two AP devices. The next two layers use a 3 * 3 convolution mask to find high level features. The high level features are inferred from the low level features found above, and indicate the spatial distribution of two pairs of AP devices with many terminals and the movement pattern of the terminal in time.

In the last layer, 2 * 2 maximum pooling is performed. Maximum pooling is the task of leaving only one maximum within an n * n mask, which can be seen as reducing the accuracy by reducing data. After this layer, all output values are entered into the fully connected layer. A fully connected layer can have dimensions of 1000, then 100 and 10 dimensions of the first layer. This is to gradually reduce the number of neuron outputs to leave only important features. Finally, 10 outputs are collected in one neuron to derive the Q value. The value network structure shown in FIG. 6 is one example. Actual DQNs may use other structures of value networks.

The value network will be trained according to the procedures for training basic DQNs. First, the behavior is changed in the communication environment, that is, the clustering environment is changed, and the reward according to the behavior is observed. The agent stores the observed behavior and reward pairs in storage. Agents learn the value network at regular training sessions. The agent can use the actions and rewards stored in the storage device to perform training during the training period and update the DQN network.

7 is an example of a post-learning process 300 of DQN. The agent checks 310 the behavior and reward information stored in the storage device. The agent retrieves the variables of the DQN (320). The DQN variable is stored in advance in the storage device. The agent learns 330 the DQN using behaviors and rewards stored in the storage device. The agent uses the learned DQN again to observe the reward according to the behavior (340). The agent repeats 350 the learning process until the learning is completed using all the sample data (behavior and reward) stored in the storage device. Finally, the agent assigns a variable of the newly learned DQN (360). The newly designated variable may be stored in the storage device.

Hereinafter, some examples of clustering using the aforementioned DQN will be described.

8 is an example of clustering using DQN. 8 is a situation of a performance hall performing a performance. In the venue, many terminals attempt to communicate at one time during the performance. Since the performance takes place at an unspecified time, a very heavy burden is placed on unspecified times for AP devices around the stadium. In the existing static clustering structure, it is difficult to guarantee QoS considering only one of the environments in which performances are performed or not. If the reinforcement learning clustering is performed using the DQN according to the above-described method, capacity can be increased by changing the clustering form of the AP device according to the changing terminal density. For example, an AP device around a stadium and an AP device having a low terminal density may be clustered to utilize a communication resource of an AP device having a low terminal density. Accordingly, the total capacity of the terminals in the stadium may be increased by sharing resources not utilized by the AP device having a low terminal density to the AP devices around the stadium.

9 is another example of clustering using DQN. 9 is an example of a downtown situation. In the downtown area, the terminal traffic increases sharply in a certain time zone (commuting time), and the traffic is eliminated after the commute time. In the case of the existing static clustering, the network capacity is degraded because clustering cannot be performed in response to the terminal traffic that changes in each time zone. Reinforcement learning clustering using DQN can increase network capacity by eliminating interference problems between AP devices by forming clusters only in areas where terminal traffic increases in order to maximize a compensation value determined according to throughput. In addition, in the morning and dawn time zones in which the terminals move at a high speed, the clusters may be formed according to the road shape to reduce the number of handovers of the terminal, thereby providing stable network capacity. In the case of dynamic clustering, it is difficult to apply the actual network model because the system overhead is greatly increased because the state change of many terminals must be reflected in real time.

10 is another example of clustering using DQN. 10 is an example of a situation in which a disaster has occurred. In the event of a disaster, neighboring AP devices are destroyed and rescue personnel are increasing, causing a surge in data traffic for flexible AP devices to temporarily handle. In the case of static clustering, it is difficult to provide stable network capacity in a disaster situation because the cluster pattern is applied without being aware of the change in the situation. If reinforcement learning clustering is performed using DQN, the system recognizes that the compensation value is greatly reduced and forms a cluster between the AP devices that can be operated as shown in FIG. 10 to increase the compensation. Resources will be shared. Therefore, network resources can be concentrated on AP devices that need support, thereby providing network capacity required for the structure. On the other hand, dynamic clustering requires high computational power, and it is impossible to provide stable network capacity when most of the flexible AP devices are lost in a disaster.

The embodiments and the drawings attached to this specification are merely to clearly show a part of the technical idea included in the above-described technology, and those skilled in the art can easily make it within the scope of the technical idea included in the description and the drawings of the above-described technology. It will be apparent that both the inferred modifications and the specific embodiments are included in the scope of the above-described technology.

Claims

Confirming distribution of a terminal in a cell of the first AP device by using channel state information;

Determining at least one candidate AP device capable of serving a specific area of the cell together with the first AP device based on the distribution of the terminal;

At least one of the at least one candidate AP device using a deep Q-network (DQN) that takes as input the location of the first AP device, the location of the candidate AP device, the distribution of the terminal and the channel state information of each terminal. Determining one second AP device; And

Clustering the first AP device and the at least one second AP device AP device clustering method using a DQN.
The method of claim 1,

AP device clustering method using a DQN to determine the distribution of the terminal using the channel state information (CSI) for the terminal located in the cell.
The method of claim 1,

The specific area is the AP device clustering method using the DQN is the area where the most terminal of the boundary area of the cell is located.
The method of claim 1,

The AP device clustering method using the DQN is determined by the control device of the first AP device or the core network using the DQN.
The method of claim 1,

And providing a service for a target terminal located in the specific area by the first AP device and the second AP device in cooperative communication.
The method of claim 5,

Measuring the performance of the cooperative communication with respect to the target terminal, and updating and storing the DQN by using the clustering and the performance of the first AP device and the second AP device. Clustering method.
The method of claim 1,

The AP device clustering method using the DQN to determine the candidate AP device in consideration of the movement path of the region where the first AP device is located.
The method of claim 1,

The DQN is a method for clustering AP devices using DQN to output status information including a specific AP device and a specific location by using learning data including the location of the AP device, distribution of the terminal, and the channel state information.
A storage device for storing a deep Q-network (DQN) variable and a location of a neighboring AP device;

An antenna for receiving channel state information from a terminal in a cell; And

Determining at least one candidate AP device capable of serving a specific area of the cell among the neighboring AP devices based on the distribution of terminals in the cell identified using the channel state information,

Cooperation with a DQN including a control circuit for determining at least one target AP device among the at least one candidate AP device by inputting a position of the candidate AP device, a distribution of the terminal, and the channel state information of each terminal to a DQN. Communication device.
The method of claim 9,

The specific area is a cooperative communication device using the DQN is the area where the most terminal of the boundary area of the cell is located.
The method of claim 9,

The cooperative communication device is a cooperative communication device using a DQN which is an AP device or a control device of the core network.
The method of claim 9,

The cooperative communication device is a cooperative communication device using a DQN for providing a service for the specific area in the cooperative communication with the target AP device.
The method of claim 12,

The cooperative communication device using the DQN to update the DQN based on the performance of the target AP device and the cooperative communication.