JP7720587B2

JP7720587B2 - Communication device, communication system, and communication method

Info

Publication number: JP7720587B2
Application number: JP2022020561A
Authority: JP
Inventors: 憲一河村; 大輔村山; 俊朗中平; 貴庸守山; めぐみ金子; ディンティハーリー
Original assignee: Nippon Telegraph and Telephone Corp; Inter University Research Institute Corp Research Organization of Information and Systems; NTT Inc USA
Current assignee: Inter University Research Institute Corp Research Organization of Information and Systems; NTT Inc; NTT Inc USA
Priority date: 2022-02-14
Filing date: 2022-02-14
Publication date: 2025-08-08
Anticipated expiration: 2042-02-14
Also published as: JP2023117803A

Description

本発明は、無線通信システムにおけるパケットスケジューリングに関連するものである。。 The present invention relates to packet scheduling in wireless communication systems.

現在、無線通信システムは発展し、マルチバンド・マルチアクセスのシステムによるヘテロジニアスなネットワークになっている。セルラー通信では、第５世代移動通信（５Ｇ）が実用化され、１ＧＨｚ以下の周波数から、ミリ波帯まで幅広い周波数が利用され、スモールセルからマクロセルまで様々なサイズのセルが重畳するように提供される世界になってきている。 Currently, wireless communication systems have evolved into heterogeneous networks based on multi-band, multi-access systems. In cellular communications, fifth-generation mobile communications (5G) has been put into practical use, utilizing a wide range of frequencies, from sub-1 GHz to millimeter wave bands, and we are entering a world where cells of various sizes, from small cells to macrocells, are provided in an overlapping manner.

また、もう一つの代表的な無線アクセスシステムである無線ＬＡＮでも、２．４／５／６０ＧＨｚ帯の無線周波数が利用されており、６ＧＨｚ帯の利用も検討されている。スマートフォンなどの無線端末はセルラーと無線ＬＡＮのアクセスに対応したＩ／Ｆを一般的に備え、またそれぞれのＩ／Ｆにおいて複数のバンドへの対応となっている。端末は複数の周波数、アクセス方式から接続する無線基地局を選択し、通信を行うことが一般的となってきており、デュアルコネクティビティなど、１台の端末が複数の基地局を統合利用することも行われる。 Wireless LAN, another typical wireless access system, also uses radio frequencies in the 2.4, 5, and 60 GHz bands, with the use of the 6 GHz band also being considered. Wireless devices such as smartphones generally have interfaces that support both cellular and wireless LAN access, and each interface supports multiple bands. It is becoming common for devices to select a wireless base station to connect to from multiple frequencies and access methods, and it is also common for a single device to use multiple base stations in a manner such as dual connectivity.

このようなヘテロジニアス環境において、端末がどのＩ／Ｆでどの基地局を選択するかはシステム全体で制御し、最適化を図ることがシステムのリソースの有効利用に有効である。 In such a heterogeneous environment, controlling and optimizing which base station a terminal selects and which I/F across the entire system is effective in making efficient use of system resources.

また、５Ｇの発展として、ｕＲＬＬＣ（Ultra-Reliable and Low Latency Communications、超高信頼低遅延通信）等、従来の無線通信ではあまり使われていなかった超高信頼・超低遅延な用途に向けた通信機能の実現が目標とされている。 In addition, as 5G advances, one of the goals is to realize communication functions for ultra-high reliability and ultra-low latency applications, such as uRLLC (Ultra-Reliable and Low Latency Communications), which have not been widely used in conventional wireless communications.

高信頼性（低パケットロス）、低遅延性を実現するための従来技術の一つとして、複数の無線Ｉ／Ｆ、複数のバンドで冗長的に同一データを送信し、受信側で合成する手法（例えば非特許文献１）がある。 One conventional technique for achieving high reliability (low packet loss) and low latency is to send the same data redundantly over multiple wireless interfaces and multiple bands, and then combine the data on the receiving side (e.g., Non-Patent Document 1).

Cisco Parallel Redundancy Protocol Over Wireless https://www.cisco.com/c/ja_jp/td/docs/wireless/outdoor_industrial/iw3702/technote/b_prp_dg.htmlCisco Parallel Redundancy Protocol Over Wireless https://www.cisco.com/c/ja_jp/td/docs/wireless/outdoor_industrial/iw3702/technote/b_prp_dg.html Yue Gao, Kry Yik Chau Lui, Pablo Hernandez-Leal, "Robust Risk-Sensitive Reinforcement Learning Agents for Trading Markets," RL4RealLife Workshop in Int. Conf. on Machine Learning (ICML), 2021.Yue Gao, Kry Yik Chau Lui, Pablo Hernandez-Leal, "Robust Risk-Sensitive Reinforcement Learning Agents for Trading Markets," RL4RealLife Workshop in Int. Conf. on Machine Learning (ICML), 2021.

非特許文献１の技術では、基本的に、求められるＱｏＳレベルに応じて、固定的に冗長させる無線Ｉ／Ｆあるいはバンドを設定するため、必要以上に無線リソースを使用する場合があり、無線リソースの利用効率が悪い。また、環境の変化に応じて、柔軟に必要なリソース量を反映できない。 The technology in Non-Patent Document 1 basically sets fixed redundant wireless I/Fs or bands according to the required QoS level, which can result in more wireless resources being used than necessary, resulting in poor wireless resource utilization efficiency. Furthermore, it is not possible to flexibly reflect the amount of resources required in response to changes in the environment.

本発明は上記の点に鑑みてなされたものであり、環境の変化に追随しつつ所望の通信品質と無線リソースの利用効率の向上とを両立させるための技術を提供することを目的とする。 The present invention was made in consideration of the above points, and aims to provide technology that can simultaneously achieve the desired communication quality and improve the utilization efficiency of wireless resources while adapting to changes in the environment.

開示の技術によれば、複数の無線インタフェースを利用して無線通信を行う通信装置であって、
あるデバイスへのパケットを送信する無線インタフェースと、当該無線インタフェースにより前記デバイスに送信するパケットの数を、リスク回避型の強化学習を用いて決定する強化学習部と、
前記強化学習部により決定された数のパケットを前記デバイスに送信する送信部と
を備える通信装置であり、
前記強化学習部は、各無線インタフェースでのパケットロス率に基づく満足度レベルを状態とし、各デバイスが使用する無線インタフェースの組み合わせ及び各無線インタフェースで送信するパケットの数を行動とするリスク回避型の強化学習により、状態に対する行動を学習する
通信装置が提供される。
According to the disclosed technology, there is provided a communication device that performs wireless communication using a plurality of wireless interfaces,
a wireless interface for transmitting packets to a device; and a reinforcement learning unit for determining, using risk-averse reinforcement learning, the number of packets to be transmitted to the device via the wireless interface;
a transmitting unit that transmits the number of packets determined by the reinforcement learning unit to the device;
A communication device comprising:
A communication device is provided in which the reinforcement learning unit learns actions relative to states through risk-averse reinforcement learning, with the satisfaction level based on the packet loss rate in each wireless interface being the state, and the combination of wireless interfaces used by each device and the number of packets transmitted in each wireless interface being the actions.

開示の技術によれば、環境の変化に追随しつつ所望の通信品質と無線リソースの利用効率の向上とを両立させるための技術が提供される。 The disclosed technology provides a technique for achieving both desired communication quality and improved utilization efficiency of wireless resources while adapting to environmental changes.

無線通信システムの構成例を示す図である。FIG. 1 is a diagram illustrating an example of the configuration of a wireless communication system. 無線基地局（又は無線端末）の構成図である。FIG. 1 is a diagram illustrating the configuration of a wireless base station (or a wireless terminal). 無線基地局（又は無線端末）の構成図である。FIG. 1 is a diagram illustrating the configuration of a wireless base station (or a wireless terminal). 動作概要を示すフローチャートである。10 is a flowchart showing an outline of an operation. システムモデルを説明するための図である。FIG. 1 is a diagram for explaining a system model. 強化学習を説明するための図である。FIG. 1 is a diagram illustrating reinforcement learning. アルゴリズム１を示す図である。FIG. 1 illustrates Algorithm 1. 装置のハードウェア構成例を示す図である。FIG. 2 illustrates an example of a hardware configuration of the apparatus.

以下、図面を参照して本発明の実施の形態（本実施の形態）を説明する。以下で説明する実施の形態は一例に過ぎず、本発明が適用される実施の形態は、以下の実施の形態に限られるわけではない。 The following describes an embodiment of the present invention (the present embodiment) with reference to the drawings. The embodiment described below is merely an example, and the embodiments to which the present invention can be applied are not limited to the following embodiment.

（システム構成例）
図１に、本実施の形態における無線通信システムの構成例を示す。図１に示すように、本システムは、無線基地局１００と、複数の無線端末２００を含む。図１の例では、無線基地局１００はインターネットに接続されている。 (System configuration example)
Fig. 1 shows an example of the configuration of a wireless communication system according to this embodiment. As shown in Fig. 1, this system includes a wireless base station 100 and a plurality of wireless terminals 200. In the example of Fig. 1, the wireless base station 100 is connected to the Internet.

本実施の形態では、後述する強化学習の手法により、複数の無線インタフェースを備える無線基地局１００が、デバイス（無線端末）に送信するパケットについて、それを送信する無線インタフェース、及び、その無線インタフェースで送信するパケット数を決定して、送信を行う。なお、パケット数を決定することをパケットスケジューリングと呼んでもよい。ただし、本実施の形態に係る手法は、無線端末２００においても適用することが可能である。無線基地局と無線端末を総称して通信装置と呼んでもよい。 In this embodiment, a wireless base station 100 equipped with multiple wireless interfaces uses a reinforcement learning technique described below to determine the wireless interface through which packets to be transmitted to a device (wireless terminal) will be transmitted, as well as the number of packets to be transmitted over that wireless interface, and then transmits the packets. Note that determining the number of packets may also be referred to as packet scheduling. However, the technique according to this embodiment can also be applied to a wireless terminal 200. The wireless base station and wireless terminal may be collectively referred to as a communication device.

また、後述する具体例では、無線インタフェースをＳｕｂ－６ＧＨｚとｍｍＷａｖｅの２種類として説明しているが無線インタフェースはこれらに限られない。また、「無線インタフェース」を、「周波数」であると解釈してもよい。つまり、本実施の形態は、複数周波数をアグリゲーションして使用する形態において、周波数の選択、及び、パケット数決定を後述する強化学習の手法により実現できる。 Furthermore, in the specific examples described below, two types of wireless interfaces are described: Sub-6 GHz and mmWave, but the wireless interface is not limited to these. Furthermore, "wireless interface" may also be interpreted as "frequency." In other words, in this embodiment, in a form in which multiple frequencies are aggregated and used, frequency selection and packet count determination can be achieved using the reinforcement learning method described below.

図２に、無線基地局１００の構成例を示す。無線端末２００も図２に示す構成と同様の構成を備えることとしてよい。 Figure 2 shows an example configuration of the wireless base station 100. The wireless terminal 200 may also have a configuration similar to that shown in Figure 2.

図２に示すように、無線基地局１００は、通信Ｉ／Ｆ部１１０、制御部１２０、無線通信部１３０、アンテナ１０１を有する。 As shown in FIG. 2, the wireless base station 100 has a communication I/F unit 110, a control unit 120, a wireless communication unit 130, and an antenna 101.

無線通信部１３０は、スケジューラ部１４０、受信部１３１、無線通信信号生成部１３２、ＲＦ部１３５を備える。スケジューラ部１４０は、強化学習部１５０、通信品質測定部１４１、全体無線リソース割当算出部１４２、個別無線リソース割当算出部１４３を有する。「個別無線リソース割当算出部１４３、受信部１３１、無線通信信号生成部１３２、ＲＦ部１３５、アンテナ１０１」は、無線インタフェースの数だけ備えられる。ただし、「個別無線リソース割当算出部１４３、受信部１３１、無線通信信号生成部１３２、ＲＦ部１３５、アンテナ１０１」のうちのいずれかが、複数インタフェースで共有されてもよい。また、「個別無線リソース割当算出部１４３、受信部１３１、無線通信信号生成部１３２、ＲＦ部１３５、アンテナ１０１」を「無線インタフェース」と呼んでもよい。 The wireless communication unit 130 includes a scheduler unit 140, a receiver 131, a wireless communication signal generator 132, and an RF unit 135. The scheduler unit 140 includes a reinforcement learning unit 150, a communication quality measurement unit 141, a total wireless resource allocation calculator 142, and an individual wireless resource allocation calculator 143. The number of "individual wireless resource allocation calculators 143, receivers 131, wireless communication signal generators 132, RF units 135, and antennas 101" is equal to the number of wireless interfaces. However, any of the "individual wireless resource allocation calculators 143, receivers 131, wireless communication signal generators 132, RF units 135, and antennas 101" may be shared by multiple interfaces. Furthermore, the "individual wireless resource allocation calculators 143, receivers 131, wireless communication signal generators 132, RF units 135, and antennas 101" may be referred to as "wireless interfaces."

強化学習部１５０は、Ｑテーブル管理部１５１、状態算出部１５２、報酬算出部１５３、リスク評価部１５４を備える。各部の動作は下記のとおりである。 The reinforcement learning unit 150 includes a Q table management unit 151, a state calculation unit 152, a reward calculation unit 153, and a risk assessment unit 154. The operation of each unit is as follows:

通信Ｉ／Ｆ部１１０は、例えばインターネット等との通信を行う。制御部１２０は、例えば、ＣＰＵとメモリを備え、装置全体の制御を行う。無線通信部１３０は無線通信に係る動作を実行する。 The communication I/F unit 110 communicates with, for example, the Internet. The control unit 120 includes, for example, a CPU and memory, and controls the entire device. The wireless communication unit 130 performs operations related to wireless communication.

スケジューラ部１４０はパケットスケジューリング等を実行する。受信部１３１は他の通信装置からの信号（例：無線端末からのフィードバック）をアンテナ及びＲＦ部を介して受信する。無線通信信号生成部１３２は、送信するパケットのデータから無線で送信する信号を生成する。ＲＦ部１３５は、信号を搬送波に乗せる等の処理を実行する。なお、スケジューラ部１４０はコンピュータとプログラムによっても実現でき、プログラムを記録媒体に記録することも、ネットワークを通して提供することも可能である。 The scheduler unit 140 performs packet scheduling, etc. The receiver unit 131 receives signals from other communication devices (e.g., feedback from a wireless terminal) via an antenna and RF unit. The wireless communication signal generator unit 132 generates signals to be transmitted wirelessly from the data of the packets to be transmitted. The RF unit 135 performs processing such as placing the signals on a carrier wave. The scheduler unit 140 can also be realized by a computer and program, and the program can be recorded on a recording medium or provided via a network.

通信品質測定部１４１は、例えば、送信パケット数と、通信相手からのフィードバック（例：ＡＣＫ／ＮＡＣＫ）に基づき、通信品質（例：パケットロス率）を測定する。なお、本実施の形態では、各デバイスからの瞬時のＣＳＩフィードバック（ＡＣＫ／ＮＡＣＫ等）を得られないが、散発的なＣＳＩフィードバックを得られる場合を想定しており、散発的なＣＳＩフィードバックから通信品質の統計値（全デバイスにわたる平均値等）を取得することができる。 The communication quality measurement unit 141 measures communication quality (e.g., packet loss rate) based on, for example, the number of transmitted packets and feedback (e.g., ACK/NACK) from the communication partner. Note that in this embodiment, instantaneous CSI feedback (ACK/NACK, etc.) cannot be obtained from each device, but it is assumed that sporadic CSI feedback is available, and communication quality statistics (e.g., average value across all devices) can be obtained from the sporadic CSI feedback.

全体無線リソース割当算出部１４２は、フレーム毎に強化学習部１５０が決定する行動に基づき、送信するパケット総数に対して、各無線インタフェースに割り当てる量を決定する。また、個別無線リソース割当算出部１４３は、フレーム毎に強化学習部１５０が決定する行動に基づき、該当する無線インタフェース（個別無線リソース割当算出部１４３に接続される無線インタフェース）における送信パケット数に対応する無線リソース量を決定する。 The total wireless resource allocation calculation unit 142 determines the amount to be allocated to each wireless interface relative to the total number of packets to be transmitted, based on the behavior determined by the reinforcement learning unit 150 for each frame. Furthermore, the individual wireless resource allocation calculation unit 143 determines the amount of wireless resources corresponding to the number of packets to be transmitted on the relevant wireless interface (wireless interface connected to the individual wireless resource allocation calculation unit 143), based on the behavior determined by the reinforcement learning unit 150 for each frame.

なお、無線基地局１００（又は無線端末２００）を図３に示す構成で表すこともできる。図３に示すように、無線基地局１００は、強化学習部１０、送信部２０、受信部３０を有する。強化学習部１０は強化学習部１５０と同様の処理を行う。送信部２０は、送信に関する処理（例：送信リソース割当算出、パケット送信）、受信部３０は、受信に関する処理（例：フィードバック受信、通信品質算出）を行う。 The wireless base station 100 (or wireless terminal 200) can also be represented by the configuration shown in Figure 3. As shown in Figure 3, the wireless base station 100 has a reinforcement learning unit 10, a transmitter 20, and a receiver 30. The reinforcement learning unit 10 performs the same processing as the reinforcement learning unit 150. The transmitter 20 performs processing related to transmission (e.g., calculating transmission resource allocation, transmitting packets), and the receiver 30 performs processing related to reception (e.g., receiving feedback, calculating communication quality).

（強化学習部１５０について）
本実施の形態では、無線基地局１００（もしくは無線端末２００）において、複数の無線インタフェース（もしくは複数の周波数）をアグリゲーションする構成を採用している。 (Regarding the reinforcement learning unit 150)
In this embodiment, the radio base station 100 (or the radio terminal 200) employs a configuration in which a plurality of radio interfaces (or a plurality of frequencies) are aggregated.

各無線インタフェースの送信パケットに対する無線リソースの割当を行うスケジューラ部１４０に強化学習部１５０を備えることで強化学習を適用（１）し、自律的に所望の通信品質を得るための最適な接続を学習して行うとともに、複数Ｑテーブルの並列更新（単独Ｑテーブルも含む）に基づくRisk-averse learning（リスク回避学習）手法（非特許文献２）（２）を用いることで、通信の信頼性を重視した行動選択を可能としている。 By equipping the scheduler unit 140, which allocates wireless resources to packets transmitted from each wireless interface, with a reinforcement learning unit 150, reinforcement learning is applied (1) to autonomously learn and perform optimal connections to achieve the desired communication quality. By using a risk-averse learning method (Non-Patent Document 2) (2) based on parallel updating of multiple Q tables (including single Q tables), it is possible to select actions that prioritize communication reliability.

上記の（１）の強化学習の適用について、本実施の形態では、状態ｓ（ｔ）を各無線インタフェースでのパケットロス率（ＡＣＫのフィードバックより検出）情報に基づくSatisfaction Level（満足レベル）とし、行動ａ（ｔ）を各デバイス（送信元が無線基地局の場合は無線端末）に対して使用する無線インタフェースの組み合わせとパケットスケジューリング（各無線インタフェースで送信するパケット数）とする。本実施の形態では、Risk-Averse Average Q-learning（リスク回避平均化Ｑ学習）により、状態ｓ（ｔ）から各デバイスに対する最適な行動ａ（ｔ）を学習する。 In applying reinforcement learning (1) above, in this embodiment, the state s(t) is the satisfaction level based on packet loss rate information (detected from ACK feedback) for each wireless interface, and the action a(t) is the combination of wireless interfaces to be used for each device (wireless terminal if the source is a wireless base station) and packet scheduling (number of packets to be transmitted over each wireless interface). In this embodiment, the optimal action a(t) for each device is learned from the state s(t) using risk-averse average Q-learning.

本実施の形態で想定されるｕＲＬＬＣの場合，低遅延性を保つため瞬時のＣＳＩフィードバックを利用できない。本実施の形態では、瞬時のチャネル状態が未知でも良好なRisk-averse学習ができるように無線インタフェース選択とパケットスケジューリング法を設計している。 In the case of uRLLC, which is assumed in this embodiment, instantaneous CSI feedback cannot be used to maintain low latency. In this embodiment, the radio interface selection and packet scheduling methods are designed to enable good risk-averse learning even when the instantaneous channel conditions are unknown.

上記の（２）のRisk-averse learning（リスク回避学習）手法に関して、Risk-Averse Learningの、Risk(分散の大きさ)に反応する評価関数の概念を示す式（後述する式（１１）、式（１２））に、過去の報酬ｒの分散（リスク）に機敏に反応する項を入れることにより、高リスク行動に対する報酬の低下を反映させている。過去の報酬ｒの分散に反応して評価に反映する項とは、後述する式（１２）（式（１１）をテイラー展開した式）における２番目の項（Ｖａｒがある項）である。 Regarding the risk-averse learning method (2) above, the equations (Equations (11) and (12) described below) that show the concept of the evaluation function that responds to risk (magnitude of variance) in risk-averse learning include a term that quickly responds to the variance (risk) of past reward r, thereby reflecting the reduction in reward for high-risk behavior. The term that reacts to the variance of past reward r and reflects it in the evaluation is the second term (the term with Var) in Equation (12) described below (the equation obtained by Taylor expansion of Equation (11)).

後述する具体的において説明するとおり、本実施の形態では、瞬時報酬は全てのデバイスにわたる平均パケット受信成功率、及びリスク状態によるペナルティー（ｅｘ．信頼性・遅延等のＱｏＳターゲットが未達成の状態）を反映する。 As will be explained in more detail later, in this embodiment, the instantaneous reward reflects the average packet reception success rate across all devices, as well as the penalty due to a risk state (e.g., a state in which QoS targets such as reliability and delay are not achieved).

図２に示す強化学習部１５０において、Ｑテーブル管理部１５１は、Ｑテーブルの保持、初期化、更新等を行う。状態算出部１５２は、状態ｓ（ｔ）を算出する。報酬算出部１５３は、ｓ（ｔ）、ａ（ｔ）に対する報酬ｒを算出する。リスク評価部１５４は、Ｑテーブルに基づいて、評価関数を計算し、行動を選択する。なお、評価関数の計算は報酬算出部１５３が行ってもよい。 In the reinforcement learning unit 150 shown in Figure 2, the Q table management unit 151 holds, initializes, updates, etc. the Q table. The state calculation unit 152 calculates the state s(t). The reward calculation unit 153 calculates the reward r for s(t) and a(t). The risk assessment unit 154 calculates an evaluation function based on the Q table and selects an action. Note that the calculation of the evaluation function may be performed by the reward calculation unit 153.

ここで、強化学習に関連する無線基地局１００の動作概要を図４のフローチャートを参照して説明する。 Here, an overview of the operation of the radio base station 100 related to reinforcement learning will be explained with reference to the flowchart in Figure 4.

Ｓ１０１において、状態算出部１５２が、各無線インタフェースでのパケットロス率（ＡＣＫのフィードバックより検出）情報に基づくSatisfaction Level（満足度レベル）を取得し、状態ｓ（ｔ）を計算する。 At S101, the state calculation unit 152 obtains the satisfaction level based on packet loss rate information (detected from ACK feedback) for each wireless interface and calculates the state s(t).

Ｓ１０２において、リスク評価部１５４は、Ｑテーブル管理部１５１において管理されている複数Ｑテーブル（もしくは単独Ｑテーブル）に基づいて、ε－ｇｒｅｅｄｙ法により行動ａを決定する。 In S102, the risk assessment unit 154 determines action a using the ε-greedy method based on the multiple Q tables (or a single Q table) managed by the Q table management unit 151.

Ｓ１０３において、強化学習部１５０は決定した行動ａを、全体無線リソース割当算出部１４２、個別無線リソース割当算出部１４３等に通知することで、無線基地局１００は、行動ａを実行する。 In S103, the reinforcement learning unit 150 notifies the total radio resource allocation calculation unit 142, the individual radio resource allocation calculation unit 143, etc. of the determined action a, and the radio base station 100 executes action a.

Ｓ１０４において、通信品質測定部１４１によりパケットロス情報が取得され、パケットロス情報は強化学習部１５０における報酬算出部１５３に渡される。 In S104, packet loss information is acquired by the communication quality measurement unit 141, and the packet loss information is passed to the reward calculation unit 153 in the reinforcement learning unit 150.

Ｓ１０５において、報酬算出部１５３は報酬を算出する。Ｓ１０６において、Ｑテーブル管理部１５１は、複数Ｑテーブル（もしくは単独Ｑテーブル）の更新を行う。 In S105, the reward calculation unit 153 calculates the reward. In S106, the Q table management unit 151 updates the multiple Q tables (or a single Q table).

以下、本実施の形態における無線基地局１００の動作（特に強化学習部１５０による動作）を、具体的な無線インタフェースを使用する例を用いてより詳細に説明する。 Below, the operation of the radio base station 100 in this embodiment (particularly the operation of the reinforcement learning unit 150) will be explained in more detail using an example that uses a specific radio interface.

（システムモデル）
本実施の形態では、図５に示すように、複数のデバイスを収容する複数のＡＰから構成される無線ネットワークにおけるダウンリンク（ＤＬ）送信を例にとって説明する。各ＡＰは、Ｓｕｂ‐６ＧＨｚおよびｍｍＷａｖｅ（ミリ波）インタフェースを備えているものとする。各ＡＰは、無線基地局１００に相当する。デバイスは無線端末２００に相当する。以下では、無線基地局１００が本発明に係る強化学習の動作を行うものとして説明するが、無線端末２００も同様の動作が可能である。 (System model)
In this embodiment, as shown in Fig. 5, downlink (DL) transmission in a wireless network configured with multiple APs accommodating multiple devices will be described as an example. Each AP is assumed to have a Sub-6 GHz and mmWave (millimeter wave) interface. Each AP corresponds to a wireless base station 100. A device corresponds to a wireless terminal 200. In the following description, the wireless base station 100 will be described as performing the reinforcement learning operation according to the present invention, but the wireless terminal 200 can also perform the same operation.

図５に示すように、ＡＰｂはデバイスの集合Κに所望のパケットを送信する。また、デバイスの集合Κは、他の全てのＡＰｂ´≠ｂからＤＬ干渉を受信する。 As shown in Figure 5, AP b transmits a desired packet to a set of devices K. The set of devices K also receives DL interference from all other APs b'≠b.

各スケジューリングフレームｔの開始において、ＡＰｂは各デバイスｋ∈ＫへのＬ_ｋ（ｔ）個のパケットを持つものとする。各パケットｌ∈Ｌ_ｋ（ｔ）は、ｄビットのサイズであり、デバイスｋ∈Ｋに送信されるものである。 At the beginning of each scheduling frame t, AP b has L _k (t) packets to each device k ∈ K. Each packet l ∈ L _k (t) is d bits in size and is to be sent to device k ∈ K.

ＡＰｂは、Ｓｕｂ－６ＧＨｚインタフェース上のＮ個のサブチャネルと、ｍｍＷａｖｅインタフェース上のＭ個のビームを介してこれらのパケットを送信する。各Ｓｕｂ－６ＧＨｚサブチャネル又は各ｍｍＷａｖｅビームは、各スケジューリング時間フレームにおいて、あるユニークなデバイスに割り当てることができる。Ｓｕｂ－６ＧＨｚでは異なるサブチャネル、ｍｍＷａｖｅでは異なるビームを介して、各フレームで複数のデバイスをサポートすることができる。 AP b transmits these packets via N subchannels on the Sub-6 GHz interface and M beams on the mmWave interface. Each Sub-6 GHz subchannel or each mmWave beam can be assigned to a unique device in each scheduling time frame. Multiple devices can be supported in each frame via different subchannels in Sub-6 GHz and different beams in mmWave.

Ｓｕｂ－６ＧＨｚ帯では、サブチャネルｎにおけるＡＰｂからデバイスｋに対する信号対干渉＋雑音比（ＳＩＮＲ）は、 In the Sub-6 GHz band, the signal-to-interference plus noise ratio (SINR) from AP b to device k on subchannel n is:

と表される。ここで、ＡＰｂからデバイスｋへのサブチャネルｎにおける送信電力ｐ_ｂｋｎ ^ｓｕｂは、サブチャネル間で等しいと仮定する。Ｗ_ｓｕｂはサブチャネルあたりの帯域幅である。ｈ_ｂｋｎ ^ｓｕｂの項は、サブチャネルｎ上のＡＰｂとデバイスｋとの間のチャネル電力（channel power）であり、ｈ_ｂｋｎ ^ｓｕｂ（ｔ）＝｜^～ｈ_ｂｋｎ ^ｓｕｂ（ｔ）｜^２で与えられる。なお、本明細書のテキストにおいて、記載の便宜上、文字の頭に記載する記号を文字の前に記載する場合がある。「^～ｈ」はその例である。ここで、^～ｈ_ｂｋｎ ^ｓｕｂ（ｔ）は、スモールスケールおよびラージスケールフェージング効果を含む複素チャネル係数である。σ_ｎ ^２は、加算性白色ガウス雑音（ＡＷＧＮ）電力を表す。Ｉ_ｂｋｎ ^ｓｕｂは、ＡＰｓｂ´≠ｂからデバイスｋへのサブチャネルｎ上の干渉電力である。 Here, the transmit power p _bkn ^sub on subchannel n from AP b to device k is assumed to be equal across subchannels. W _sub is the bandwidth per subchannel. The term h _bkn ^sub is the channel power between AP b and device k on subchannel n, and is given by h _bkn ^sub (t) = | ^∼ h _bkn ^sub (t) | ^2. Note that in the text of this specification, for convenience, a symbol that starts with a letter may be written before the letter. " ^∼ h" is an example. Here, ^∼ h _bkn ^sub (t) is the complex channel coefficient including small-scale and large-scale fading effects. _{σ n} ² represents the additive white Gaussian noise (AWGN) power. I _bkn ^sub is the interference power on subchannel n from APs b'≠b to device k.

ｍｍＷａｖｅインタフェースについてはアナログビームフォーミングを想定し、ビームｍ上でのＡＰｂからデバイスｋへの送信ビーム幅とビーム方向はそれぞれθ_ｂｋｍ及びβ_ｂｋｍと表され、各ビームｍにおける対象デバイスｋ及び時間フレームｔに応じて調整される。 For the mmWave interface, analog beamforming is assumed, and the transmission beam width and beam direction from AP b to device k on beam m are expressed as θ _bkm and β _bkm , respectively, and are adjusted according to the target device k and time frame t in each beam m.

単純化のために、一般性を損なうことなく、デバイスｋにおける受信ビーム利得Ｇ_ｋ ^Ｒｘは固定であると想定する。得られるレートを最大化するために、θ_ｂｋｍは最も狭いビーム幅に設定され、β_ｂｋｍはＡＰｂからデバイスｋへの視線（ＬｏＳ）方向によって与えられる。したがって、ＡＰｂに収容されるデバイスｋにおけるビームｍのＳＩＮＲは、次のように与えられる。 For simplicity and without loss of generality, we assume that the receive beam gain G _k ^Rx at device k is fixed. To maximize the obtained rate, θ _bkm is set to the narrowest beamwidth, and β _bkm is given by the line-of-sight (LoS) direction from AP b to device k. Therefore, the SINR of beam m at device k served by AP b is given as follows:

ここで、ｐ_ｂｋｍ ^ｍＷ、ｈ_ｂｋｍ ^ｍＷは、それぞれ、ビームｍ上のＡＰｂとデバイスｋとの間の送信電力およびチャネル電力であり、Ｗ_ｍｗは、帯域幅である。チャネル電力ｈ_ｂｋｍ ^ｍＷは、ビームｍ上の送信ビーム幅と方向の関数であり、下記のとおりである。 where p _bkm ^mW and h _bkm ^mW are the transmission power and channel power, respectively, between AP b and device k on beam m, and W _mw is the bandwidth. The channel power h _bkm ^mW is a function of the transmit beam width and direction on beam m, as follows:

ここで、ＰＬ_ｂｋｍは、ビームｍ上のＡＰｂとデバイスｋとの間のパスロスを示し、Ｇ_ｂ（θ_ｂｋｍ，β_ｂｋｍ）は、ＡＰｂとデバイスｋとの間のメイン送信ビーム利得であり、下記のようにモデル化される。 Here, PL _bkm denotes the path loss between AP b and device k on beam m, and G _b (θ _bkm , β _bkm ) is the main transmit beam gain between AP b and device k, which is modeled as follows:

ここでεはサイドローブビームゲインである。式（２）において、Ｉ_ｂｋｍ ^ｍＷは、全てのＡＰｓｂ´≠ｂからＡＰｂに収容されるデバイスｋへの干渉電力であり、それらのサイドローブビーム利得に基づいて計算される。 where ε is the side lobe beam gain. In equation (2), I _bkm ^mW is the interference power from all APs b'≠b to device k accommodated by AP b, and is calculated based on their side lobe beam gains.

したがって、ＡＰｂに収容されるデバイスｋの実現可能なレートは、次のとおりである。 Therefore, the achievable rate for device k accommodated by AP b is:

ここで、ν＝｛Ｓｕｂ，ｍＷ｝（Ｓｕｂ６ＧＨｚまたはｍｍＷａｖｅ）である。デバイスのアプリケーションの低遅延要件の下では、デバイスからＡＰｓへの瞬時のＣＳＩフィードバックは想定されない。従って、ＡＰｓは達成可能なレート（式（５））を知ることなく割り当てを決定する必要がある。 where v = {Sub, mW} (Sub 6 GHz or mmWave). Given the low latency requirements of device applications, instantaneous CSI feedback from devices to APs is not assumed. Therefore, APs must make allocation decisions without knowing the achievable rate (Equation (5)).

インタフェースνのフレームｔにおけるデバイスｋへの送信パケット数を、ｌ_ｋ ^ν（ｔ）∈｛０，…，Ｌ_ｋ（ｔ）｝と表す。Ｌ_ｋ（ｔ）は、フレームｔにおけるキューに入れられたパケットの総数であるので、ｌ_ｋ ^ｓｕｂ（ｔ）＋ｌ_ｋ ^ｍＷ（ｔ）≦Ｌ_ｋ（ｔ）である。各インタフェース上において、デバイスｋの正常に受信できたパケット数Ω_ｋ ^ν（ｔ）は、ＡＰｂによりデバイスｋのＡＣＫフィードバックに基づいて下記のように計算できる。 The number of packets transmitted to device k on interface v in frame t is denoted as l _k ^v (t) ∈ {0, ..., L _k (t)}. Since L _k (t) is the total number of packets queued in frame t, l _k ^sub (t) + l _k ^mW (t) ≤ L _k (t). On each interface, the number of packets successfully received by device k, Ω _k ^v (t), can be calculated by AP b based on the ACK feedback from device k as follows:

ここで、ω_ｋｌ ^ν（ｔ）は、フレームｔにおけるインタフェースν上のパケットｌに対するデバイスｋからのフィードバックを示し、下記のとおりである。 where ω _kl ^v (t) denotes the feedback from device k for packet l on interface v at frame t, as follows:

更に、期間Ｔ_ｓのフレーム内において、インタフェースν上でデバイスｋにより正常に受信されたサイズｄビットのパケットの最大数は、 Furthermore, the maximum number of packets of size d bits successfully received by device k on interface v within a frame of duration T _s is given by

として与えられる。 is given as:

ここで、ｒ_ｂｋ ^ν（ｔ）はＡＰにおいて未知なので、ｌ_{ｋ，ｍａｘ} ^νは、ＡＰにおいて未知である。従って、ｌ_ｋ ^ν（ｔ）≦ｌ_{ｋ，ｍａｘ} ^ν（ｔ）である場合、つまり、デバイスｋの割り当てられたサブチャネル又はビームにおいて、送信パケットの数がデバイスｋで受信し得るパケットの数よりも小さい場合において、これら全てのパケットは正常に受信され、それらのＡＣＫはＡＰにフィードバックされると想定する。しかし、ｌ_ｋ ^ν（ｔ）≧ｌ_{ｋ，ｍａｘ} ^ν（ｔ）である場合、ｌ_ｋ ^ν（ｔ）－ｌ_{ｋ，ｍａｘ} ^ν（ｔ）パケットはＮＡＣＫ状態になる。 Here, since r _bk ^v (t) is unknown at the AP, l _k,max ^{v (t)} is also unknown at the AP. Therefore, if l _k ^v (t)≦l _k,max ^v (t), that is, if the number of transmitted packets in device k's assigned subchannel or beam is smaller than the number of packets that device k can receive, it is assumed that all these packets are received successfully and their ACKs are fed back to the AP. However, if l _k ^v (t)≧l _k,max ^v (t), then l _k ^v (t)-l _k,max ^v (t) packets are in the NACK state.

上記に基づいて、下記のとおり、フレームｔまでのパケットロス発生を両インタフェースにわたって平均をとったものを、フレームｔにおけるデバイスｋのＰＬＲ（パケットロス率）と定義する。 Based on the above, the PLR (packet loss rate) of device k at frame t is defined as the average packet loss up to frame t across both interfaces, as follows:

ここで、 where:

は、フレームτにおける両インタフェースにわたるパケット正常伝達レート（ＰＳＲ:Packet Successful Delivery Rate）を示す。各インタフェースにおけるフレームｔのデバイスｋのＰＬＲは下記のように更新される。 Denote the Packet Successful Delivery Rate (PSR) across both interfaces at frame τ. The PLR of device k at frame t at each interface is updated as follows:

以下、本実施の形態に係る手法を詳細に説明する。 The method according to this embodiment will be described in detail below.

（マルコフ決定過程（ＭＤＰ）について）
ここでの目標は、各デバイスの個々のＰＬＲ制約を満たしながら、すべてのデバイスにわたって平均化された長期ＰＳＲを最大化（ここではρ_ｍａｘ）することである。この問題は、図６に示すように、状態空間、行動空間、遷移確率および報酬関数によって特徴づけられるＭＤＰとしてモデル化することができる。図５において、状態ｓ_ｔは、全てのデバイスに対する、ＰＬＲの満足レベル（及びＡＣＫフィードバック状態）である。行動ａ_ｔは、全てのデバイスに対する、インタフェース選択及びパケットスケジューリングである。本実施の形態では、状態ｓ（ｔ）、行動ａ（ｔ）を元に、報酬ｒ（ｔ）を得て、目的関数を最大化することで、行動の最適化を行う。 (About Markov Decision Processes (MDPs))
The goal here is to maximize the long-term PSR averaged over all devices (here, ρ _max ) while satisfying the individual PLR constraints of each device. This problem can be modeled as an MDP characterized by a state space, an action space, transition probabilities, and a reward function, as shown in FIG. 6. In FIG. 5, state s( _t) is the PLR satisfaction level (and ACK feedback state) for all devices. Action a( _t ) is the interface selection and packet scheduling for all devices. In this embodiment, the action is optimized by obtaining a reward r(t) based on state s(t) and action a(t) and maximizing the objective function.

各ＡＰ（無線基地局）はインタフェース選択及びパケットスケジューリングの決定を行うエージェントである。各フレームｔにおいて、ＡＰは、現在の状態ｓ_ｔを知っている。状態ｓ_ｔは、当該ＡＰに関連するデバイスの現在のＰＬＲ満足レベルと前回のフレームｔ－１におけるそれらのフィードバック状態からなる。ｓ_ｔに基づいて、ＡＰは行動ａ_ｔを取る。すなわち、ＡＰは、現在のフレームｔにおける各デバイスの各インタフェースにおけるパケット数を決定し、環境から即時報酬ｒ_ｔを取得し、新たな状態ｓ_ｔ＋１に遷移する。 Each AP (wireless base station) is an agent that makes interface selection and packet scheduling decisions. At each frame t, the AP knows the current state s _t , which consists of the current PLR satisfaction levels of devices associated with the AP and their feedback states in the previous frame t-1. Based on _s _t , the AP takes action a _t . That is, the AP determines the number of packets on each interface of each device in the current frame t, obtains an immediate reward r _t from the environment, and transitions to a new state s _t+1 .

即時のＣＳＩやインタフェースの統計等の情報は未知なので、ＡＰは、遷移確率Ｐ（ｓ_ｔ＋１｜ｓ_ｔ，ａ_ｔ）の知識を有していない。本実施の形態では、この問題をＲＬ（強化学習）のフレームワークを用いて解決する。 Since information such as real-time CSI and interface statistics is unknown, the AP does not have knowledge of the transition probability P(s _t+1 |s _t , a _t ). In this embodiment, this problem is solved using a reinforcement learning (RL) framework.

（リスク回避強化学習：Risk-Averse Reinforcement Learning）
厳しい信頼性の要求を最もよく満足させるために、本実施の形態では、リスク回避平均化Ｑ学習（ＲＡＱＬ：Risk-Averse Average Q-learning）と呼ばれるＲＳＲＬ（Risk-Sensitive Reinforcement Learning）のアプローチを用いる。ＱＬのように期待されるリターンを最大化することを目標とする伝統的なＲＬ法と比較して、ＲＳＲＬはリスクの概念を導入しており、そのリスクは、報酬の分散とリンクしている。ＲＡＱＬは、更なる分散の減少を達成しており、それによりリスクを減少させる。 (Risk-Averse Reinforcement Learning)
To best satisfy stringent reliability requirements, the present embodiment uses a Risk-Sensitive Reinforcement Learning (RSRL) approach called Risk-Averse Average Q-learning (RAQL). Compared to traditional RL methods, which aim to maximize expected returns like QL, RSRL introduces the concept of risk, which is linked to the variance of rewards. RAQL achieves further variance reduction, thereby reducing risk.

伝統的なＲＬのように目的関数として期待報酬をとることに代えて、目的関数として下記のような報酬の期待効用（expected utility）を用いる。 Instead of taking the expected reward as the objective function as in traditional RL, we use the expected utility of the reward as the objective function:

上記の式（１１）において、期待は、行動を選択するための確率論的ポリシーπ：Ｓ×Ａ→［０，１］、及び、両インタフェースにわたるチャネル実現ｈにわたるものである。テイラー展開をとることにより下記の式（１２）が得られる。 In the above equation (11), the expectation is over the probabilistic policy π:S×A→[0,1] for choosing an action and the channel realization h over both interfaces. Taking a Taylor expansion, we obtain the following equation (12):

β＜０により、分散が最小となりつつ期待報酬が最大化されるので、目的関数がリスク回避になる。 With β<0, the expected reward is maximized while the variance is minimized, making the objective function risk averse.

なお、上記の式（１１）、式（１２）における記号の意味は下記のとおりである。 The symbols in the above formulas (11) and (12) have the following meanings:

Ｊ_π：マルコフ決定過程における、ポリシーπによる平均効用関数（即時報酬ｒ_ｔの割引和）
Π：ポリシー（方策）
Ｅ_π，ｈ：ポリシーπ、無線チャネル（伝搬路等）の状態ｈの下での期待値
ｒ_ｔ：過程ｔにおける即時報酬値
β：パラメータ
Ｖａｒ［］：［］の分散
Ｏ（）：（）のオーダー
後述するように、本実施の形態では、式（２２）を更新ルールとして使用することにより、複数Ｑテーブルを同時に学習する。そして、真の分散の近似として、これらＱテーブルのサンプル分散が使用される。この分散から、リスク回避＾Ｑテーブルが計算され、行動選択に使用される。 J _π : Average utility function (discounted sum of immediate rewards r _t ) according to policy π in the Markov decision process
Π: Policy
E _π,h : Expected value under policy π and state h of the wireless channel (propagation path, etc.) r _t : Immediate reward value in process t β: Parameter Var[ ]: Variance of [ ] O(): Order of ( ) As will be described later, in this embodiment, multiple Q-tables are learned simultaneously by using equation (22) as the update rule. Then, the sample variance of these Q-tables is used as an approximation of the true variance. From this variance, a risk-averse ^Q-table is calculated and used for action selection.

（ＲＡＱＬベースのインタフェース選択及びパケットスケジューリング法）
次に、本実施の形態においてＡＰ（無線基地局１００）が実行する、ＲＡＱＬに基づくアルゴリズムを詳細に説明する。状態空間と行動空間は次のように定義される。 RAQL-based interface selection and packet scheduling method
Next, the algorithm based on RAQL executed by the AP (wireless base station 100) in this embodiment will be described in detail. The state space and the action space are defined as follows.

状態：ｓ（ｔ）は、下記の式（１３）、式（１４）のとおり、フレームｔにおける全てのデバイスｋ∈Κに対する、ＰＬＲの現在のＱｏＳ満足レベル、及び、フレームｔ－１に送信されたパケットに対する直近のＡＣＫフィードバックである。ｓ（ｔ）にＡＣＫフィードバックを含まないこととしてもよい。 State: s(t) is the current QoS satisfaction level of the PLR for all devices k∈K at frame t and the most recent ACK feedback for packets sent in frame t-1, as shown in equations (13) and (14) below. s(t) may not include ACK feedback.

ここで、 where:

である。 is.

行動：ａ（ｔ）は、各デバイスのパケットが送信されるべきインタフェース選択を示す。行動空間サイズの爆発を回避して、提案手法をスケーラブルにするために、次に説明するように、本実施の形態では、インタフェース選択タスクとパケットスケジューリングタスクを、デバイスｋに対する３つの行動ａ_ｋ（ｔ）に集約している。ＡＰは、即時ＣＳＩの知識を持たないが、散発的なフィードバックにより、平均パスロスあるいは平均ＳＩＮＲなどの長期ＣＳＩが既知であると仮定することは適切である。 Action: a(t) indicates the interface selection for each device over which packets should be transmitted. To avoid the explosion of the action space size and make the proposed method scalable, we aggregate the interface selection task and packet scheduling task into three actions a _k (t) for device k, as explained below. Although the AP does not have knowledge of the instantaneous CSI, it is reasonable to assume that the long-term CSI, such as the average path loss or average SINR, is known due to sporadic feedback.

従って、各ＡＰは、各デバイスの平均ＣＳＩに基づいて、サブチャネル及びビーム割り当てを行うことができる。この場合、全てのサブチャネルは各デバイスにおいて同等であり、従って、ＡＰは各デバイスに割り当てられる各サブチャネルをランダムに選択することができる。そして、各ＡＰのスケジューリングタスクは、各デバイスにおけるサブチャネル毎に送信されるパケット数を決定することに相当する。フレーム長Ｔ_ｓの期間におけるＡＰｂから送信されデバイスｋにより正常に受信されるパケットの最大数は、下記の式（１５）のように推定できる。 Therefore, each AP can assign subchannels and beams based on the average CSI of each device. In this case, all subchannels are equivalent for each device, so the AP can randomly select each subchannel to be assigned to each device. The scheduling task of each AP then corresponds to determining the number of packets to be transmitted per subchannel for each device. The maximum number of packets transmitted from AP b and successfully received by device k during a frame length _Ts can be estimated using the following equation (15):

^～ｒ_ｂｋ ^νは、インタフェースνにおけるデバイスｋの既知の平均レートである。各行動ａ_ｋ（ｔ）は、下記のとおりである。 ^r _bk ^v is the known average rate of device k at interface v. Each action a _k (t) is:

ａ_ｋ（ｔ）＝０：Ｓｕｂ－６ＧＨｚインタフェースのみが使用され、送信パケット数は、 a _k (t) = 0: Only the Sub-6 GHz interface is used, and the number of transmitted packets is

である。 is.

ａ_ｋ（ｔ）＝１：ｍｍＷａｖｅインタフェースのみが使用され、送信パケット数は、 a _k (t) = 1: Only the mmWave interface is used, and the number of transmitted packets is

である。 is.

ａ_ｋ（ｔ）＝２：Ｓｕｂ－６ＧＨｚインタフェースとｍｍＷａｖｅインタフェースの両方が使用されるが、高データレートを利用して送信パケット数を最大化するようにｍｍＷａｖｅの優先度を高くする。 a _k (t) = 2: Both the Sub-6 GHz and mmWave interfaces are used, but mmWave is given higher priority to take advantage of the high data rate and maximize the number of transmitted packets.

最後に、サブチャネルとビームの数の制約の下で、全てのデバイスに対する行動ａ（ｔ）が下記の式（２０）のとおりに与えられる。 Finally, under the constraints of the number of subchannels and beams, the behavior a(t) for all devices is given by the following equation (20):

報酬：ｒ（ｓ（ｔ），ａ（ｔ））は、複数デバイスにわたる平均ＰＳＲにより与えられる、フレームｔにおける行動ａ（ｔ）の実行により達成される即時報酬を表す。特に、この報酬関数は、式（１４）で規定されるリスク状態も考慮している。ＡＰが式（６）におけるΩ_ｋ ^ν（ｔ）を取得するためのＡＣＫ／ＮＡＣＫフィードバックに基づいて、報酬は下記の式（２１）により計算される。 Reward: r(s(t), a(t)) represents the immediate reward achieved by performing action a(t) at frame t, given by the average PSR across devices. In particular, this reward function also considers the risk state defined in equation (14). Based on the ACK/NACK feedback for the AP to obtain Ω _k ^v (t) in equation (6), the reward is calculated by equation (21) below.

式（２１）の各記号の意味は下記のとおりである。 The meanings of the symbols in formula (21) are as follows:

ｒ（ｓ（ｔ），ａ（ｔ））：過程ｔにおける即時報酬値
Ω_ｋ ^ｓｕｂ（τ）：Ｓｕｂ６ＧＨのＩ／Ｆで送信が成功したパケット数
Ω_ｋ ^ｍＷ（τ）：ミリ波のＩ／Ｆで送信が成功したパケット数
ｌ_ｋ ^ｓｕｂ（τ）：Ｓｕｂ６ＧＨのＩ／Ｆで送信されるパケット数
ｌ_ｋ ^ｍＷ（τ）：ミリ波のＩ／Ｆで送信されるパケット数
ｕ_ｋ ^ｓｕｂ（ｔ）：Ｓｕｂ６ＧＨのＩ／Ｆでのパケットロス率ρが要求品質ρ_ｍａｘに達しているかどうかで変わる変数
ｕ_ｋ ^ｍＷ（ｔ）：ミリ波のＩ／Ｆでのパケットロス率ρが要求品質ρ_ｍａｘに達しているかどうかで変わる変数
式（１４）から明らかなように、ｕ_ｋ ^ν（ｔ）＝０である場合、すなわち、デバイスｋが、式（１４）におけるＰＬＲを満足しないリスク状態にある場合、報酬にはペナルティーが科せられる。 r(s(t), a(t)): immediate reward value in process t Ω _k ^sub (τ): number of packets successfully transmitted via the Sub6GH I/F Ω _k ^mW (τ): number of packets successfully transmitted via the millimeter wave I/F l _k ^sub (τ): number of packets transmitted via the Sub6GH I/F l _k ^mW (τ): number of packets transmitted via the millimeter wave I/F u _k ^sub (t): variable that changes depending on whether the packet loss rate ρ at the Sub6GH I/F has reached the required quality ρ _max u _k ^mW (t): variable that changes depending on whether the packet loss rate ρ at the millimeter wave I/F has reached the required quality ρ _max As is clear from equation (14), u _k ^ν If (t)=0, ie, device k is in a risk state that does not satisfy the PLR in equation (14), the reward is penalized.

本実施の形態における、ＲＡＱＬベースのインタフェース選択及びパケットスケジューリング方法は図７に示すアルゴリズム１により実行される。つまり、無線基地局１００は、例えばプログラムをＣＰＵで実行することで本アルゴリズムを実行する。各記号の意味は下記のとおりである。 In this embodiment, the RAQL-based interface selection and packet scheduling method is executed by Algorithm 1 shown in Figure 7. In other words, the radio base station 100 executes this algorithm by, for example, running a program on the CPU. The meanings of each symbol are as follows:

ε：探索率
λ：減衰率
Ｉ：Ｑテーブルの枚数
λ_ｐ：リスク制御パラメータ
Ｑ：Ｑテーブル
Ｖ：Ｑテーブル更新回数
α：学習率
アルゴリズム１において、最初に、ＡＰは、状態ｓの下で各行動ａの選択数をカウントするテーブルＶとともにＩ個のＱテーブルを初期化する。対応する学習率αもまた０に初期化され、ランダム状態からアルゴリズムが開始する（１～２行）。 ε: Exploration rate λ: Decay rate I: Number of Q-tables λ _p : Risk control parameter Q: Q-table V: Number of Q-table updates α: Learning rate In Algorithm 1, first, the AP initializes I Q-tables along with table V, which counts the number of selections of each action a under state s. The corresponding learning rate α is also initialized to 0, and the algorithm starts from a random state (lines 1-2).

各フレームｔにおいて、Ｑテーブルがランダムに選択され、後述する式（２４）によりリスク回避＾Ｑテーブルを計算するためにＱテーブルが使用される（３～５行）。従来のＱＬと異なり、ＲＡＱＬでは、Ｑ関数を下記の式（２２）により更新する。 At each frame t, a Q-table is randomly selected and used to calculate the risk aversion^Q-table using equation (24) described below (lines 3-5). Unlike traditional QL, in RAQL, the Q-function is updated using equation (22) below.

式（２２）における「ｘ_０」は定数であり、例えばｘ_０＝－１と設定される。α（ｓ（ｔ），ａ（ｔ））は、状態行動ペア（ｓ（ｔ），ａ（ｔ））の学習率であり、γは、減衰率であり、ｕ（ｘ）は、単調増加凹効用関数であり、以下で表される。 In equation (22), "x ₀ " is a constant, and is set, for example, as x ₀ = -1. α(s(t), a(t)) is the learning rate of the state-action pair (s(t), a(t)), γ is the decay rate, and u(x) is a monotonically increasing concave utility function, which is expressed as follows:

βは、Risk Averseな特性を持たすためのパラメータであり、ここではβ＜０である。リスク回避＾Ｑテーブルは、下記の式（２４）により計算される。 β is a parameter for imparting risk-averse characteristics, and here β<0. The risk-averse ^Q table is calculated by the following equation (24).

λ_ｐは、リスクコントロールパラメータであり、^－Ｑ（ｓ，ａ）＝（１／Ｉ）Σ_ｉ＝１ ^ＩＱ^ｉ（ｓ，ａ）は平均Ｑテーブルである。 λ _p is the risk control parameter, ⁻ Q(s, a)=(1/I)Σ _i=1 ^I Q ⁱ (s, a) is the average Q table.

次に、現在の状態と探索率εが与えられると、εグリーディー戦略により行動ａ（ｔ）が選択される。ＡＰは、選択された行動に基づきパケットを送信し、即時報酬（式（２１））を受け取る（６～９行）。そして、環境が新たな状態に遷移する（１０～１６行）。このプロセスが、フレームの最大数Ｔに達するまで繰り返される。 Next, given the current state and search rate ε, an action a(t) is selected using the ε-greedy strategy. The AP transmits a packet based on the selected action and receives an immediate reward (Equation (21)) (lines 6-9). The environment then transitions to a new state (lines 10-16). This process is repeated until the maximum number of frames T is reached.

（ハードウェア構成例）
無線基地局１００と無線端末２００はいずれも、例えば、コンピュータにプログラムを実行させることにより実現することも可能である。このコンピュータは、物理的なコンピュータであってもよいし、クラウド上の仮想マシンであってもよい。以下、無線基地局１００と無線端末２００を総称して通信装置と呼ぶ。 (Example of hardware configuration)
Both the radio base station 100 and the radio terminal 200 can be realized, for example, by causing a computer to execute a program. This computer may be a physical computer or a virtual machine on a cloud. Hereinafter, the radio base station 100 and the radio terminal 200 will be collectively referred to as communication devices.

すなわち、通信装置は、コンピュータに内蔵されるＣＰＵやメモリ等のハードウェア資源を用いて、通信装置で実施される処理に対応するプログラムを実行することによって実現することが可能である。上記プログラムは、コンピュータが読み取り可能な記録媒体（可搬メモリ等）に記録して、保存したり、配布したりすることが可能である。また、上記プログラムをインターネットや電子メール等、ネットワークを通して提供することも可能である。 In other words, a communication device can be realized by using hardware resources such as a CPU and memory built into a computer to execute a program corresponding to the processing performed by the communication device. The program can be recorded on a computer-readable recording medium (such as portable memory) and then saved or distributed. The program can also be provided via a network such as the Internet or email.

図８は、上記コンピュータのハードウェア構成例を示す図である。図８のコンピュータは、それぞれバスＢＳで相互に接続されているドライブ装置１０００、補助記憶装置１００２、メモリ装置１００３、ＣＰＵ１００４、インタフェース装置１００５、表示装置１００６、入力装置１００７、出力装置１００８等を有する。なお、通信装置において、表示装置１００６を備えないこととしてもよい。 Figure 8 is a diagram showing an example of the hardware configuration of the computer. The computer in Figure 8 has a drive device 1000, an auxiliary storage device 1002, a memory device 1003, a CPU 1004, an interface device 1005, a display device 1006, an input device 1007, an output device 1008, and the like, all of which are interconnected by a bus BS. Note that the communication device may not have the display device 1006.

当該コンピュータでの処理を実現するプログラムは、例えば、ＣＤ－ＲＯＭ又はメモリカード等の記録媒体１００１によって提供される。プログラムを記憶した記録媒体１００１がドライブ装置１０００にセットされると、プログラムが記録媒体１００１からドライブ装置１０００を介して補助記憶装置１００２にインストールされる。但し、プログラムのインストールは必ずしも記録媒体１００１より行う必要はなく、ネットワークを介して他のコンピュータよりダウンロードするようにしてもよい。補助記憶装置１００２は、インストールされたプログラムを格納すると共に、必要なファイルやデータ等を格納する。 The program that realizes processing on the computer is provided by a recording medium 1001, such as a CD-ROM or memory card. When the recording medium 1001 storing the program is inserted into the drive device 1000, the program is installed from the recording medium 1001 to the auxiliary storage device 1002 via the drive device 1000. However, the program does not necessarily have to be installed from the recording medium 1001; it can also be downloaded from another computer via a network. The auxiliary storage device 1002 stores the installed program as well as necessary files, data, etc.

メモリ装置１００３は、プログラムの起動指示があった場合に、補助記憶装置１００２からプログラムを読み出して格納する。ＣＰＵ１００４は、メモリ装置１００３に格納されたプログラムに従って、通信装置に係る機能を実現する。インタフェース装置１００５は、ネットワークに接続するためのインタフェースとして用いられる。表示装置１００６はプログラムによるＧＵＩ（ＧｒａｐｈｉｃａｌＵｓｅｒＩｎｔｅｒｆａｃｅ）等を表示する。入力装置１００７はキーボード及びマウス、ボタン、又はタッチパネル等で構成され、様々な操作指示を入力させるために用いられる。出力装置１００８は演算結果を出力する。 When an instruction to start a program is received, the memory device 1003 reads and stores the program from the auxiliary storage device 1002. The CPU 1004 implements functions related to the communication device in accordance with the program stored in the memory device 1003. The interface device 1005 is used as an interface for connecting to a network. The display device 1006 displays a GUI (Graphical User Interface) or the like according to the program. The input device 1007 is composed of a keyboard, mouse, buttons, touch panel, or the like, and is used to input various operational instructions. The output device 1008 outputs the results of calculations.

（実施の形態の効果）
本実施の形態に係る技術により、環境の変化に追随しつつ所望の通信品質と無線リソースの利用効率の向上とを両立させるための技術を提供することができる。 (Effects of the embodiment)
The technology according to the present embodiment can provide a technology for achieving both a desired communication quality and improved utilization efficiency of wireless resources while adapting to changes in the environment.

（付記）
本明細書には、少なくとも下記各項の通信装置、及び通信方法が開示されている。
（第１項）
複数の無線インタフェースを利用して無線通信を行う通信装置であって、
あるデバイスへのパケットを送信する無線インタフェースと、当該無線インタフェースにより前記デバイスに送信するパケットの数を、リスク回避型の強化学習を用いて決定する強化学習部と、
前記強化学習部により決定された数のパケットを前記デバイスに送信する送信部と
を備える通信装置。
（第２項）
前記強化学習部は、各無線インタフェースでのパケットロス率に基づく満足度レベルを状態とし、各デバイスが使用する無線インタフェースの組み合わせ及び各無線インタフェースで送信するパケットの数を行動とするリスク回避型の強化学習により、状態に対する行動を学習する
第１項に記載の通信装置。
（第３項）
前記強化学習部は、パケット送信先の複数のデバイスからフィードバックを受信する受信部を更に備え、
前記強化学習部は、前記フィードバックに基づいて、前記パケットロス率を算出する
第２項に記載の通信装置。
（第４項）
前記強化学習部は、全デバイスに対する平均パケット受信成功率と、ＱｏＳ目標値が未達成の状態であるリスク状態によるペナルティーに基づいて即時報酬を計算し、過去の即時報酬を用いて、高リスク行動に対する報酬の低下を反映するように、平均効用関数を最大化するポリシーを算出する
第１項ないし第３項のうちいずれか１項に記載の通信装置。
（第５項）
前記通信装置は、第１無線インタフェースと、前記第１無線インタフェースよりも高いデータレートによる通信を行う第２無線インタフェースを備え、
前記強化学習部により選択される行動は、前記第１無線インタフェースのみを使用、前記第２無線インタフェースのみを使用、及び、前記第２無線インタフェースを優先的に使用、の３つの行動のうちのいずれかの行動である
第１項ないし第４項のうちいずれか１項に記載の通信装置。
（第６項）
第１項ないし第５項のうちいずれか１項に記載の通信装置と、前記デバイスを含む通信システム。
（第７項）
複数の無線インタフェースを利用して無線通信を行う通信装置が実行する通信方法であって、
あるデバイスへのパケットを送信する無線インタフェースと、当該無線インタフェースにより前記デバイスに送信するパケットの数を、リスク回避型の強化学習を用いて決定する強化学習ステップと、
前記強化学習ステップにより決定された数のパケットを前記デバイスに送信する送信ステップと
を備える通信方法。 (Additional Note)
This specification discloses at least the following communication devices and communication methods.
(Section 1)
A communication device that performs wireless communication using a plurality of wireless interfaces,
a wireless interface for transmitting packets to a device; and a reinforcement learning unit for determining, using risk-averse reinforcement learning, the number of packets to be transmitted to the device via the wireless interface;
a transmitting unit that transmits the number of packets determined by the reinforcement learning unit to the device.
(Section 2)
The communication device described in paragraph 1, wherein the reinforcement learning unit learns actions for the states through risk-averse reinforcement learning, with the satisfaction level based on the packet loss rate on each wireless interface being the state, and the combination of wireless interfaces used by each device and the number of packets transmitted on each wireless interface being the actions.
(Section 3)
the reinforcement learning unit further includes a receiving unit that receives feedback from a plurality of devices that are packet destinations;
The communication device according to claim 2, wherein the reinforcement learning unit calculates the packet loss rate based on the feedback.
(Section 4)
The reinforcement learning unit calculates an immediate reward based on the average packet reception success rate for all devices and a penalty due to a risk state in which the QoS target value is not achieved, and calculates a policy that maximizes the average utility function using past immediate rewards so as to reflect a decrease in reward for high-risk behavior. The communication device described in any one of paragraphs 1 to 3.
(Section 5)
the communication device includes a first wireless interface and a second wireless interface that performs communication at a data rate higher than that of the first wireless interface;
The communication device described in any one of paragraphs 1 to 4, wherein the behavior selected by the reinforcement learning unit is one of three behaviors: using only the first wireless interface, using only the second wireless interface, and using the second wireless interface preferentially.
(Section 6)
6. A communication system including the communication device according to any one of claims 1 to 5.
(Section 7)
A communication method executed by a communication device that performs wireless communication using a plurality of wireless interfaces,
a reinforcement learning step of determining a wireless interface for transmitting packets to a certain device and the number of packets to be transmitted to the device via the wireless interface using risk-averse reinforcement learning;
a transmitting step of transmitting the number of packets determined by the reinforcement learning step to the device.

以上、本実施の形態について説明したが、本発明はかかる特定の実施形態に限定されるものではなく、特許請求の範囲に記載された本発明の要旨の範囲内において、種々の変形・変更が可能である。 The present embodiment has been described above, but the present invention is not limited to this specific embodiment, and various modifications and variations are possible within the scope of the gist of the present invention as set forth in the claims.

１００無線基地局
１０１アンテナ
１１０通信Ｉ／Ｆ部
１２０制御部
１３０無線通信部
１３１受信部
１３２無線通信信号生成部
１３５ＲＦ部
１４０スケジューラ部
１４１通信品質測定部
１４２全体無線リソース割当算出部
１４３個別無線リソース割当算出部
１５０強化学習部
１５１Ｑテーブル管理部
１５２状態算出部
１５３報酬算出部
１５４リスク評価部
２００無線端末
１０００ドライブ装置
１００１記録媒体
１００２補助記憶装置
１００３メモリ装置
１００４ＣＰＵ
１００５インタフェース装置
１００６表示装置
１００７入力装置
１００８出力装置 100 Wireless base station 101 Antenna 110 Communication I/F unit 120 Control unit 130 Wireless communication unit 131 Receiving unit 132 Wireless communication signal generation unit 135 RF unit 140 Scheduler unit 141 Communication quality measurement unit 142 Total wireless resource allocation calculation unit 143 Individual wireless resource allocation calculation unit 150 Reinforcement learning unit 151 Q table management unit 152 State calculation unit 153 Reward calculation unit 154 Risk evaluation unit 200 Wireless terminal 1000 Drive device 1001 Recording medium 1002 Auxiliary storage device 1003 Memory device 1004 CPU
1005 Interface device 1006 Display device 1007 Input device 1008 Output device

Claims

A communication device that performs wireless communication using a plurality of wireless interfaces,
a wireless interface for transmitting packets to a device; and a reinforcement learning unit for determining, using risk-averse reinforcement learning, the number of packets to be transmitted to the device via the wireless interface;
a transmitting unit that transmits the number of packets determined by the reinforcement learning unit to the device;
A communication device comprising:
The reinforcement learning unit learns actions for the states through risk-averse reinforcement learning, where the state is a satisfaction level based on the packet loss rate in each wireless interface, and the actions are the combination of wireless interfaces used by each device and the number of packets transmitted in each wireless interface.

the reinforcement learning unit further includes a receiving unit that receives feedback from a plurality of devices that are packet destinations;
The communication device according to claim 1 , wherein the reinforcement learning unit calculates the packet loss rate based on the feedback.

A communication device that performs wireless communication using a plurality of wireless interfaces,
a wireless interface for transmitting packets to a device; and a reinforcement learning unit for determining, using risk-averse reinforcement learning, the number of packets to be transmitted to the device via the wireless interface;
a transmitting unit that transmits the number of packets determined by the reinforcement learning unit to the device;
A communication device comprising:
The reinforcement learning unit calculates an immediate reward based on the average packet reception success rate for all devices and a penalty due to a risk state in which the QoS target value is not achieved, and calculates a policy that maximizes the average utility function using past immediate rewards so as to reflect a decrease in reward for high-risk behavior.

A communication device that performs wireless communication using a plurality of wireless interfaces,
a wireless interface for transmitting packets to a device; and a reinforcement learning unit for determining, using risk-averse reinforcement learning, the number of packets to be transmitted to the device via the wireless interface;
a transmitting unit that transmits the number of packets determined by the reinforcement learning unit to the device;
A communication device comprising:
the communication device includes a first wireless interface and a second wireless interface that performs communication at a data rate higher than that of the first wireless interface;
The behavior selected by the reinforcement learning unit is one of three behaviors: using only the first wireless interface, using only the second wireless interface, and using the second wireless interface preferentially.

A communication system including a communication apparatus according to any one of claims 1 to 4 and said device.

A communication method executed by a communication device that performs wireless communication using a plurality of wireless interfaces,
a reinforcement learning step of determining a wireless interface for transmitting packets to a certain device and the number of packets to be transmitted to the device via the wireless interface using risk-averse reinforcement learning;
a transmitting step of transmitting the number of packets determined by the reinforcement learning step to the device ,
In the reinforcement learning step, the communication device learns an action for the state by risk-averse reinforcement learning, where a satisfaction level based on a packet loss rate in each wireless interface is set as a state, and a combination of wireless interfaces used by each device and the number of packets transmitted in each wireless interface are set as actions.
Communication method .

A communication method executed by a communication device that performs wireless communication using a plurality of wireless interfaces,
a reinforcement learning step of determining a wireless interface for transmitting packets to a certain device and the number of packets to be transmitted to the device via the wireless interface using risk-averse reinforcement learning;
a transmitting step of transmitting the number of packets determined by the reinforcement learning step to the device ,
In the reinforcement learning step, the communication device calculates an immediate reward based on an average packet reception success rate for all devices and a penalty due to a risk state in which the QoS target value is not achieved, and calculates a policy that maximizes an average utility function using past immediate rewards so as to reflect a decrease in reward for high-risk behavior.
Communication method .

A communication method executed by a communication device that performs wireless communication using a plurality of wireless interfaces,
a reinforcement learning step of determining a wireless interface for transmitting packets to a certain device and the number of packets to be transmitted to the device via the wireless interface using risk-averse reinforcement learning;
a transmitting step of transmitting the number of packets determined by the reinforcement learning step to the device;
A communication method comprising:
the communication device includes a first wireless interface and a second wireless interface that performs communication at a data rate higher than that of the first wireless interface;
The behavior selected by the reinforcement learning step is one of three behaviors: using only the first wireless interface, using only the second wireless interface, and using the second wireless interface preferentially.
Communication method .