JP2022018901A

JP2022018901A - Optimization method of wireless communication system, wireless communication system, and program for wireless communication system

Info

Publication number: JP2022018901A
Application number: JP2020122332A
Authority: JP
Inventors: 笑子篠原; Emiko Shinohara; 保彦井上; Yasuhiko Inoue; 裕介淺井; Yusuke Asai; 泰司鷹取; Taiji Takatori; 啓史大関; Hiroshi Ozeki; 義哲成末; Yoshiaki Narusue; 博之森川; Hiroyuki Morikawa
Original assignee: Nippon Telegraph and Telephone Corp; University of Tokyo NUC
Current assignee: Nippon Telegraph and Telephone Corp; University of Tokyo NUC
Priority date: 2020-07-16
Filing date: 2020-07-16
Publication date: 2022-01-27
Anticipated expiration: 2040-07-16
Also published as: JP7385869B2

Abstract

To provide an optimization method for a wireless communication system of performing optimization in wireless communication for each of a plurality of wireless communication terminals, and also performing optimization when the plurality of wireless communication terminals are viewed as a whole.SOLUTION: A wireless communication systems including a plurality of wireless communication terminals is optimized. For each of the wireless communication terminals, action A (t+1) is determined such that the highest reward R'(t) can be obtained on the basis of the state S (t) provided by an environment 14. Due to returning of the action A (t+1) to the environment 14, each individual reward R (t+1) obtained by the wireless communication terminal is calculated. On the basis of the reward for each of the plurality of wireless communication terminals belonging to a group, the utility representing the fairness of the plurality of wireless communication terminals belonging to the group is calculated. The reward R'(t+1) for each of the wireless communication terminals is calculated on the basis of each reward R (t+1) and the utility.SELECTED DRAWING: Figure 3

Description

この発明は、無線通信システムの最適化方法、無線通信システムおよび無線通信システム用プログラムに係り、特に、多段階評価の学習を用いて通信状態の最適化を図る無線通信システムの最適化方法、無線通信システムおよび無線通信システム用プログラムに関する。 The present invention relates to a method for optimizing a wireless communication system, a wireless communication system, and a program for a wireless communication system, and in particular, a method for optimizing a wireless communication system for optimizing a communication state by using learning of multi-step evaluation, wireless. Related to communication systems and programs for wireless communication systems.

より具体的には、本発明は、異なる無線通信システムが干渉しあい混在する環境において、下記の２つの事項を併せて達成するための評価を実施する機械学習や強化学習などの、計算機を用いた学習に関するものである。
１．各無線通信システム内での通信容量を最大化する。
２．同じ周波数リソースを共有する無線通信システム同士で、全体としての最適化を実現する。即ち、各無線通信システムで定められた、スループット達成率などの評価項目について公平性を実現する。 More specifically, the present invention uses a computer such as machine learning or reinforcement learning that evaluates to achieve the following two items together in an environment where different wireless communication systems interfere with each other and coexist. It's about learning.
1. 1. Maximize the communication capacity within each wireless communication system.
2. 2. Achieve overall optimization between wireless communication systems that share the same frequency resources. That is, fairness is realized for evaluation items such as the throughput achievement rate defined in each wireless communication system.

無線LANは、免許不要帯において廉価に利用できる無線通信システムである。このため、その普及は急激に進み、多数の無線LAN端末が同じエリア内に混在する事態が生じている。その結果、無線LAN端末同士が互いに干渉し合うことが課題となっている。このような課題を受けて、無線LAN端末同士の干渉の影響を最小限にして、個々の、または全体のシステム容量を拡大するための技術が多数提案されている。 Wireless LAN is a wireless communication system that can be used inexpensively in the license-free band. For this reason, its spread is rapidly increasing, and a large number of wireless LAN terminals are mixed in the same area. As a result, it is a problem that wireless LAN terminals interfere with each other. In response to these problems, many techniques have been proposed for minimizing the influence of interference between wireless LAN terminals and expanding the individual or overall system capacity.

例えば図１は、無線通信端末１～Ｎが、互いに干渉しあう無線LAN基地局（AP：Access Point）である例を示している。尚、図１の下段に示す無線通信端末Ｎ＋１～Ｎ＋Ｍは、上記のAPと通信を確立するスマートフォン等のユーザ端末である。この例では、APとして機能する無線通信端末１～Ｎの夫々が、それらの周辺における干渉情報や、無線通信端末Ｎ＋１～Ｎ＋Ｍとの接続成否の情報を取得し、無線環境情報として制御サーバ１０へ送信する。 For example, FIG. 1 shows an example in which wireless communication terminals 1 to N are wireless LAN base stations (APs: Access Points) that interfere with each other. The wireless communication terminals N + 1 to N + M shown in the lower part of FIG. 1 are user terminals such as smartphones that establish communication with the above AP. In this example, each of the wireless communication terminals 1 to N functioning as an AP acquires interference information in the vicinity thereof and information on success / failure of connection with the wireless communication terminals N + 1 to N + M, and sends the control server 10 as wireless environment information. Send.

制御サーバ１０は、無線通信端末１～Ｎを含むAP群のスループットが最大となるように周波数チャネルや送信電力値の割り当てを算出し、その結果を制御情報として各APへ返送する。 The control server 10 calculates the allocation of frequency channels and transmission power values so that the throughput of the AP group including the wireless communication terminals 1 to N is maximized, and returns the result as control information to each AP.

他方で、無線LAN以外にも免許不要帯を使用する無線通信システムは存在し、無線LANと同じ周波数リソースを共用して通信している。特に、現在国内でRFIDやIoT向けに開放されている９２０MHz帯では、複数の無線通信システムが混在している。例えば、日本国内では、LoRAWANやWi-SUN、SIGFOXなどの無線通信システムが、その同じ周波数帯域内でサービスを開始している。また、無線LANではIEEE 802.11ahが９２０MHz帯を使用する無線通信システムと考えられている。 On the other hand, there are wireless communication systems that use unlicensed bands other than wireless LAN, and they share the same frequency resources as wireless LAN for communication. In particular, in the 920MHz band, which is currently open to RFID and IoT in Japan, a plurality of wireless communication systems coexist. For example, in Japan, wireless communication systems such as LoRAWAN, Wi-SUN, and SIGFOX have started services within the same frequency band. In wireless LAN, IEEE 802.11ah is considered to be a wireless communication system that uses the 920 MHz band.

海外では、規格上キャリアセンスが規定されているものは時間すみわけがなされている。また、キャリアセンスが規定されていないものは、他の無線通信システムと、周波数リソースを分かつか同時に使用することになる。 Overseas, if the carrier sense is stipulated in the standard, the time is divided. In addition, if the carrier sense is not specified, the frequency resource will be shared or used at the same time as other wireless communication systems.

しかしながら、日本国内では、９２０MHz帯は複数の無線通信システムを収容するために十分な帯域が割り当てられていない。このため、周波数リソースを常に分かつことは難しく、同じ周波数リソースを同時に使用することが想定されている。 However, in Japan, the 920 MHz band is not sufficiently allocated to accommodate a plurality of wireless communication systems. Therefore, it is difficult to always divide the frequency resources, and it is assumed that the same frequency resources are used at the same time.

同じ周波数リソースを共用するこれらの無線通信システムは、同じIoT向けと言っても、規格や仕様が大きく異なる。変復調方式やアクセス制御も異なるため、周波数利用効率や通信距離も異なり、同一の評価軸で扱うことは合理的とは言えない。 These wireless communication systems that share the same frequency resources have very different standards and specifications, even if they are for the same IoT. Since the modulation / demodulation method and access control are different, the frequency utilization efficiency and communication distance are also different, and it is not rational to handle them on the same evaluation axis.

非特許文献１によると、LoRAWANは、占有帯域幅が１２５kHzであり、通信距離は１０kmほど、通信速度は最大でも数十kbsである。また、SIGFOXは帯域幅が１００Hzで通信距離は数十km、通信速度は１００bpsが基本である。Wi-SUNは占有帯域幅が最大６００kHzで通信距離は１kmほど、通信速度は数百kbpsである。11ahは占有帯域幅が１MHz以上、通信速度は１kmほどで通信速度は数Mbpsとなっている。 According to Non-Patent Document 1, LoRAWAN has an occupied bandwidth of 125 kHz, a communication distance of about 10 km, and a communication speed of several tens of kbs at the maximum. In addition, SIGFOX has a bandwidth of 100 Hz, a communication distance of several tens of kilometers, and a communication speed of 100 bps. Wi-SUN has a maximum occupied bandwidth of 600kHz, a communication distance of about 1km, and a communication speed of several hundreds of kbps. 11ah has an occupied bandwidth of 1MHz or more, a communication speed of about 1km, and a communication speed of several Mbps.

以上の無線通信システムでは、規格や仕様も異なる他、ユースケースやトラヒックが全く異なる。例えば、SIGFOXのような広域で低速な無線通信システムは、一日に数回トラヒックが発生して低速で送信するようなセンサ系のユースケースで適用されている。他方で、11ahのような高速な無線通信システムでは、監視カメラからの動画伝送など、常にトラヒックが発生するユースケースで適用されることが考えられる。 The above wireless communication systems have different standards and specifications, as well as completely different use cases and traffic. For example, wide-area, low-speed wireless communication systems such as SIGFOX are applied in sensor-based use cases where traffic occurs several times a day and transmission is performed at low speed. On the other hand, in a high-speed wireless communication system such as 11ah, it may be applied to use cases where traffic always occurs, such as video transmission from a surveillance camera.

このように、同じ周波数リソースを使用する複数の無線通信システムは、通信規格や仕様が大きく異なる他、要求されるスループットや頻度も異なる。このため、周波数リソースの割り当て等に関する最適化の計算では、それらのシステム各々の条件に基づいた計算が必要となる。 As described above, a plurality of wireless communication systems using the same frequency resource differ greatly in communication standards and specifications, and also differ in required throughput and frequency. Therefore, in the calculation of optimization regarding the allocation of frequency resources and the like, it is necessary to perform the calculation based on the conditions of each of those systems.

他方で、条件が異なる複数の無線通信端末に対して周波数リソースを割り当てると、個々には最適な計算ができても、全体では最適と言えない場合が存在する。例えば、通信速度が速いと考えられる無線通信端末に対してのみ優先的に周波数リソースを割り当てるような算出結果では、周波数リソース割り当てが少なかった無線通信端末でのサービスが滞る事態が生ずる。この場合、周波数リソースを使用している全ての無線通信システムの評価としては、アウテージを残す結果となってしまい、最適とは言えない。 On the other hand, when frequency resources are allocated to a plurality of wireless communication terminals having different conditions, there are cases where the optimum calculation can be performed individually but not the optimum as a whole. For example, in the calculation result in which the frequency resource is preferentially allocated only to the wireless communication terminal considered to have a high communication speed, the service in the wireless communication terminal having a small frequency resource allocation may be delayed. In this case, the evaluation of all wireless communication systems using frequency resources will result in leaving an outage, which is not optimal.

そのため、条件が異なる複数の無線通信システムが存在する場合は、個々端末の最適化と共に、各条件の無線通信端末および共存している全無線通信システムの無線通信端末を考慮した最適化を可能とする制御が必要となる。 Therefore, when there are multiple wireless communication systems with different conditions, it is possible to optimize each terminal and also consider the wireless communication terminals of each condition and the wireless communication terminals of all the coexisting wireless communication systems. Control is required.

LPWAの最新動向と今後の展望、千葉大学、阪田史郎、2018年6月Latest trends and future prospects of LPWA, Chiba University, Shiro Sakata, June 2018 IEEE Std 802.11ah-2016、2016年12月IEEE Std 802.11ah-2016, December 2016

上述した免許不要帯のように、異なる条件が課された複数の無線通信システムが共存する無線通信リソースを最適化する際には、上述した従来の手法のように、１種類の無線通信システムが個々の無線通信リソースを最適化するだけでは不十分である。このような状況下では、複数の無線通信システムの夫々について評価を行い、個々のシステムにおける最適化だけではなく、無線通信リソースを使用する全ての無線通信システムに属する全ての無線通信端末にとっての最適化を実現する必要がある。 When optimizing a wireless communication resource in which multiple wireless communication systems with different conditions coexist, such as the license-free band described above, one type of wireless communication system is used as in the conventional method described above. Optimizing individual wireless communication resources is not enough. Under these circumstances, each of the multiple wireless communication systems is evaluated and optimized not only for individual systems but also for all wireless communication terminals belonging to all wireless communication systems that use wireless communication resources. It is necessary to realize the realization.

本発明は、複数の無線通信端末の夫々につき無線通信での最適化を実施するとともに、複数の無線通信端末を全体として見た場合の最適化を併せて実施するため、強化学習の評価を多段階で実行する。 In the present invention, optimization in wireless communication is performed for each of a plurality of wireless communication terminals, and optimization is also performed when a plurality of wireless communication terminals are viewed as a whole. Therefore, reinforcement learning is highly evaluated. Perform in stages.

第１の発明は、上記の目的を達成するため、複数の無線通信端末を含む無線通信システムの最適化方法であって、個々の無線通信端末について、環境から提供される状態に基づいて、最高の報酬が得られるように行動を決定するステップと、前記行動が前記環境に返されることで、前記無線通信端末が得る個別の報酬を計算するステップと、複数の無線通信端末の夫々に対する前記個別の報酬に基づいて、前記複数の無線通信端末の公平性を表す効用を計算するステップと、個々の無線通信端末に対する報酬を、前記個別の報酬と前記効用とに基づいて計算する報酬計算ステップと、を含むことが望ましい。 The first invention is a method for optimizing a wireless communication system including a plurality of wireless communication terminals in order to achieve the above object, and is the best based on the state provided by the environment for each wireless communication terminal. A step of determining an action so as to obtain a reward of the above, a step of calculating an individual reward obtained by the wireless communication terminal by returning the action to the environment, and the individual for each of a plurality of wireless communication terminals. A step of calculating the utility representing the fairness of the plurality of wireless communication terminals based on the reward of, and a reward calculation step of calculating the reward for each wireless communication terminal based on the individual reward and the utility. , Is desirable to include.

また、第２の発明は、複数の無線通信端末を含む無線通信システムであって、前記複数の無線通信端末から無線環境情報を受け取ると共に、当該複数の無線通信端末に制御情報を提供する制御サーバを備え、当該制御サーバは、個々の無線通信端末について、環境から提供される状態に基づいて、最高の報酬が得られるように行動を決定する処理と、前記行動が前記環境に返されることで、前記無線通信端末が得る個別の報酬を計算する処理と、複数の無線通信端末の夫々に対する前記個別の報酬に基づいて、前記複数の無線通信端末の公平性を表す効用を計算する処理と、個々の無線通信端末に対する報酬を、前記個別の報酬と前記効用とに基づいて計算する処理と、を実行することが望ましい。 The second invention is a wireless communication system including a plurality of wireless communication terminals, which is a control server that receives wireless environment information from the plurality of wireless communication terminals and provides control information to the plurality of wireless communication terminals. The control server is provided with a process of determining an action of each wireless communication terminal so as to obtain the highest reward based on the state provided by the environment, and the action is returned to the environment. , A process of calculating the individual reward obtained by the wireless communication terminal, and a process of calculating the utility representing the fairness of the plurality of wireless communication terminals based on the individual reward for each of the plurality of wireless communication terminals. It is desirable to execute a process of calculating the reward for each wireless communication terminal based on the individual reward and the utility.

また、第３の発明は、複数の無線通信端末から無線環境情報を受け取ると共に、当該複数の無線通信端末に制御情報を提供する制御サーバに実装される無線通信システム用プログラムであって、当該制御サーバに、個々の無線通信端末について、環境から提供される状態に基づいて、最高の報酬が得られるように行動を決定する処理と、前記行動が前記環境に返されることで、前記無線通信端末が得る個別の報酬を計算する処理と、複数の無線通信端末の夫々に対する前記個別の報酬に基づいて、前記複数の無線通信端末の公平性を表す効用を計算する処理と、個々の無線通信端末に対する報酬を、前記個別の報酬と前記効用とに基づいて計算する処理と、を実行させるものであることが望ましい。 The third invention is a program for a wireless communication system implemented in a control server that receives wireless environment information from a plurality of wireless communication terminals and provides control information to the plurality of wireless communication terminals, and the control thereof. The wireless communication terminal is subjected to a process of determining an action of each wireless communication terminal to the server so as to obtain the highest reward based on the state provided by the environment, and the action is returned to the environment. The process of calculating the individual rewards to be obtained, the process of calculating the utility indicating the fairness of the plurality of wireless communication terminals based on the individual rewards for each of the plurality of wireless communication terminals, and the process of calculating the utility of the individual wireless communication terminals. It is desirable to execute a process of calculating the reward for the device based on the individual reward and the utility.

本発明によれば、無線通信端末の報酬が、当該端末が個別に受ける報酬と、複数の無線通信端末を公平性の視点で評価した結果である効用とに基づいて計算される。そして、個々の無線通信端末の行動は、その報酬が最大になるように決定される。このため、本発明によれば、無線通信端末夫々の最適化と、複数の無線通信端末を全体として評価した場合の最適化の双方をバランス良く実現することができる。 According to the present invention, the reward of a wireless communication terminal is calculated based on the reward individually received by the terminal and the utility which is the result of evaluating a plurality of wireless communication terminals from the viewpoint of fairness. Then, the behavior of each wireless communication terminal is determined so that the reward is maximized. Therefore, according to the present invention, both the optimization of each wireless communication terminal and the optimization when a plurality of wireless communication terminals are evaluated as a whole can be realized in a well-balanced manner.

無線通信システムの構成例を説明するための図である。It is a figure for demonstrating the configuration example of a wireless communication system. 従来の強化学習のモデル例を説明するための図である。It is a figure for demonstrating the model example of the conventional reinforcement learning. 本発明の実施の形態１で実施される強化学習のモデルの例を説明するための図である。It is a figure for demonstrating an example of the model of reinforcement learning carried out in Embodiment 1 of this invention. 本発明の実施の形態１において実施される学習アルゴリズムの例を説明するためのフローチャートである。It is a flowchart for demonstrating the example of the learning algorithm implemented in Embodiment 1 of this invention. 本発明の実施の形態２で実施される強化学習のモデルの例を説明するための図である。It is a figure for demonstrating an example of the model of reinforcement learning carried out in Embodiment 2 of this invention. 本発明の実施の形態２において実施される学習アルゴリズムの例を説明するためのフローチャートである。It is a flowchart for demonstrating the example of the learning algorithm implemented in Embodiment 2 of this invention.

実施の形態１．
［実施の形態１の構成］
本発明の実施形態１の無線通信システムは、図１に示す構成例により実現することができる。図１において、中段に示す無線通信端末１～Ｎは、夫々Access Point（AP）として機能する。これらは、図１の下段に示す無線通信端末Ｎ＋１～Ｎ＋Ｍと通信することができる。無線通信端末Ｎ＋１～Ｎ＋Ｍは、スマートフォン、ＩｏＴ用のセンサ、スマートメータ等で構成されている。このように、図１に示す構成には、同じ周波数リソースを共用するが、規格や仕様が異なる複数の無線通信システムが含まれている。 Embodiment 1.
[Structure of Embodiment 1]
The wireless communication system of the first embodiment of the present invention can be realized by the configuration example shown in FIG. In FIG. 1, the wireless communication terminals 1 to N shown in the middle stage each function as an access point (AP). These can communicate with the wireless communication terminals N + 1 to N + M shown in the lower part of FIG. The wireless communication terminals N + 1 to N + M are composed of a smartphone, a sensor for IoT, a smart meter, and the like. As described above, the configuration shown in FIG. 1 includes a plurality of wireless communication systems that share the same frequency resource but have different standards and specifications.

本実施形態の無線通信システムは、制御サーバ１０を備えている。制御サーバ１０は、通信インターフェース、プロセッサユニット、メモリ等のハードウェアを備えている。制御サーバ１０は、これらのハードウェアが、メモリ内に格納されているプログラムに従って処理を進めることにより、後述する機能を実現する。 The wireless communication system of this embodiment includes a control server 10. The control server 10 includes hardware such as a communication interface, a processor unit, and a memory. The control server 10 realizes the functions described later by having these hardware proceed with processing according to a program stored in the memory.

制御サーバ１０は、APとして機能する無線通信端末１～Ｎに対して、制御情報を提供することができる。制御情報には、例えば、利用可能な周波数リソースや送信電力等の情報が含まれている。一方、無線通信端末１～Ｎは、制御サーバ１０に対して無線環境情報を送信することができる。無線環境情報には、無線通信端末１～Ｎ夫々の周辺における干渉情報や、無線通信端末Ｎ＋１～Ｎ＋Ｍとの接続成否の情報が含まれている。 The control server 10 can provide control information to the wireless communication terminals 1 to N that function as APs. The control information includes, for example, information such as available frequency resources and transmission power. On the other hand, the wireless communication terminals 1 to N can transmit wireless environment information to the control server 10. The wireless environment information includes interference information in the vicinity of each of the wireless communication terminals 1 to N and information on success or failure of connection with the wireless communication terminals N + 1 to N + M.

また、制御サーバ１０には、無線環境情報等に基づいて、制御情報に含める各種パラメータを最適化するための学習機能と、それら各種パラメータを、その学習の結果に基づいて決定する機能とが備わっている。 Further, the control server 10 is provided with a learning function for optimizing various parameters included in the control information based on wireless environment information and the like, and a function for determining these various parameters based on the learning result. ing.

［強化学習の概要］ [Outline of reinforcement learning]

本実施形態において、制御情報に含める各種パラメータの最適化には、強化学習が用いられる。図２は、一般的な強化学習のモデル図を示す。図２に示すモデルには、学習を行う対象としてエージェント１２が存在する。エージェント１２は、事象の観測タイミングをｔとして、一意な環境１４の中で、現在の状態Ｓ(ｔ)および報酬Ｒ(ｔ)から行動Ａ(ｔ＋１)を算出して実行する。その結果、状態Ｓ(ｔ＋１)が実現される。この状態Ｓ(ｔ＋１)から、行動を評価する報酬Ｒ(ｔ＋１)を得て、次の行動が算出される。 In this embodiment, reinforcement learning is used for optimizing various parameters included in the control information. FIG. 2 shows a model diagram of general reinforcement learning. In the model shown in FIG. 2, an agent 12 exists as a learning target. The agent 12 calculates and executes the action A (t + 1) from the current state S (t) and the reward R (t) in the unique environment 14 with the observation timing of the event as t. As a result, the state S (t + 1) is realized. From this state S (t + 1), the reward R (t + 1) for evaluating the action is obtained, and the next action is calculated.

以下の説明では、ｓおよびＳが状態、ａおよびＡが行動、ｒおよびＲが報酬を夫々表すものとする。ここで、小文字は個々のエージェント（最適化対象）に対するパラメータ、大文字はその集合（複数のエージェント）に対するパラメータであることを意味する。また、各パラメータの添え字ｔは、そのパラメータが、観測タイミングｔにおける値であることを示し、Ｓｔ，Ａｔ，ＲｔはそれぞれＳ(ｔ)，Ａ(ｔ)，Ｒ(ｔ)と同じであるものとする。 In the following description, s and S represent states, a and A represent actions, and r and R represent rewards, respectively. Here, lowercase letters mean parameters for individual agents (optimization targets), and uppercase letters mean parameters for their set (plurality of agents). Further, the subscript t of each parameter indicates that the parameter is a value at the observation timing t, and St, At, and Rt are the same as S (t), A (t), and R (t), respectively. It shall be.

図２に示す強化学習は、以下のステップの繰り返しにより進められる。
１．エージェント１２は、環境１４から状態Ｓ(ｔ)と報酬Ｒ(ｔ)を受け取り、方策πに基づいて決定した行動Ａ(ｔ)を環境１４に返す。
２．環境１４は、エージェント１２から受け取った行動Ａ(ｔ)と現在の状態Ｓ(ｔ)とに基づいて次の状態Ｓ(ｔ＋１)に変化し、遷移後の状態Ｓ(ｔ＋１)と報酬Ｒ(ｔ＋１)をエージェント１２に提供する。尚、報酬Ｒは、その直前の行動Ａの良し悪しを示すスカラー量である。 The reinforcement learning shown in FIG. 2 is advanced by repeating the following steps.
1. 1. The agent 12 receives the state S (t) and the reward R (t) from the environment 14, and returns the action A (t) determined based on the policy π to the environment 14.
2. 2. The environment 14 changes to the next state S (t + 1) based on the action A (t) received from the agent 12 and the current state S (t), and the state S (t + 1) and the reward R (t + 1) after the transition. ) Is provided to the agent 12. The reward R is a scalar amount indicating the quality of the action A immediately before that.

ある状態Ｓに対するエージェントの行動がＡであるとした場合、現時点から無限の未来までに得ることのできる報酬Ｒの総和、つまり収益Ｇは、次式のようになる。

Assuming that the action of the agent for a certain state S is A, the sum of the rewards R that can be obtained from the present time to the infinite future, that is, the profit G is as follows.

但し、γは０≦γ≦１であり、未来の報酬の影響をどの程度収益として評価するかを調整するパラメータである。 However, γ is 0 ≦ γ ≦ 1, and is a parameter for adjusting how much the influence of future rewards is evaluated as profit.

強化学習によるＱ学習では、行動ａの価値が以下の関数で評価される。

In Q-learning by reinforcement learning, the value of action a is evaluated by the following function.

但し、Ｅは期待値を示す関数である。また、Ｑ^πは、状態ｓから行動ａをとるエージェントが方策πに従って行動をとっていった場合の期待値を表す価値関数（以下、「Ｑ関数」とする）である。 However, E is a function indicating an expected value. Further, Q ^π is a value function (hereinafter referred to as “Q function”) representing an expected value when an agent taking action a from the state s takes an action according to the policy π.

図２に示す強化学習は、このＱ関数を最大化するように進められる。この学習は、例えば、状態ｓで行動ａを行ったときの収益Ｇを推定するＱ関数を、次式のアルゴリズムで求めることにより進めることができる。

The reinforcement learning shown in FIG. 2 is advanced so as to maximize this Q function. This learning can be advanced, for example, by obtaining a Q function for estimating the profit G when the action a is performed in the state s by the algorithm of the following equation.

ここで、ｐは学習率と呼ばれるパラメータで、機械学習の設計者が決める代数である。通常は１未満の小さな値に設定される。また、maxQは、理想的に取得すると考えられるＱ関数の最大値を示す。Q関数の学習は、各時間ｔごとに、次の時間ｔ＋１に取る行動によって得られるＱ値を全て見積もり、その中で最大のものを用いてQ値を更新するというものである。 Here, p is a parameter called the learning rate, which is an algebra determined by the machine learning designer. Usually set to a small value less than 1. Further, maxQ indicates the maximum value of the Q function that is considered to be ideally acquired. The learning of the Q function is to estimate all the Q values obtained by the action taken at the next time t + 1 for each time t, and update the Q value using the largest one among them.

［実施の形態１の特徴］
図３は、本実施形態の無線通信システムにおいて実施される強化学習のモデルを示す。本実施形態では、条件の異なる複数の無線通信システムを対象として、個々の評価と各条件の評価とを実施して最適化を図る。複数の無線通信システムは、夫々の条件に基づいてグループ化することができる。図３に示すモデルでは、３つのグループが存在し、グループ毎にエージェントが存在している。 [Characteristics of Embodiment 1]
FIG. 3 shows a model of reinforcement learning implemented in the wireless communication system of the present embodiment. In the present embodiment, individual evaluations and evaluations of each condition are carried out for optimization of a plurality of wireless communication systems having different conditions. Multiple wireless communication systems can be grouped based on their respective conditions. In the model shown in FIG. 3, there are three groups, and an agent exists for each group.

図３に示すエージェント１２－１，１２－２，１２－３は、夫々同じ環境１４の下で、夫々のグループに属する個々のユーザｉの行動を評価すると共に、グループ全体の評価も実施する。例えば、エージェント１２－１には、グループ１に含まれる複数のユーザｉの夫々に対応するエージェントｉが含まれている。エージェントｉは、ユーザｉの行動を評価すると共に、公平性を考慮してグループ１の全体の評価を実施する。 Agents 12-1, 12-2, and 12-3 shown in FIG. 3 evaluate the behavior of each user i belonging to each group under the same environment 14, and also evaluate the entire group. For example, the agent 12-1 includes an agent i corresponding to each of the plurality of users i included in the group 1. The agent i evaluates the behavior of the user i and evaluates the entire group 1 in consideration of fairness.

エージェントｉ毎に必要とする接続回数や帯域などの要件は異なっており、それに応じたリソースの割り振りを考えないとリソースを十分に活用できているとは言えない。このため、グループ全体を評価するにあたり、単純にリソースをエージェントｉの数で等分に割り振るのでは公平性は担保されない。そこで、リソースの分配によって達成される個々のエージェントｉへの割り振りの妥当性を効用関数によって定義することにする。 Requirements such as the number of connections and bandwidth required for each agent i are different, and it cannot be said that the resources can be fully utilized without considering the allocation of resources accordingly. Therefore, when evaluating the entire group, simply allocating resources equally by the number of agents i does not guarantee fairness. Therefore, the validity of the allocation to each agent i achieved by the distribution of resources will be defined by the utility function.

ユーザｉに割り振られるリソースをxiとした場合に、そのユーザｉの効用関数をＲ(xi) と表現することとする。ユーザｉ毎の効用関数の和を最大化出来た場合、システム全体のリソースの割当の妥当性が最大になり、リソースが公平に割り振られたといえる。 When the resource allocated to the user i is xi, the utility function of the user i is expressed as R (xi). If the sum of utility functions for each user i can be maximized, it can be said that the validity of resource allocation for the entire system is maximized and the resources are allocated fairly.

効用関数Ｒ(xi)としては、具体的には以下の関数を用いる。

Specifically, the following function is used as the utility function R (xi).

但し、αは効用関数Ｒの公平性を決定するためのパラメータである。上記の効用関数Ｒにおいて、αを∞とすると、ユーザ間の最小値を最大化するような効用、すなわちmax-min公平性を評価することができる。本実施形態では、このような設定を用いることで、上記の効用関数Ｒにより、報酬が最小値となる無線通信端末に合わせたリソース分配を実現することができる。 However, α is a parameter for determining the fairness of the utility function R. In the above utility function R, if α is ∞, the utility that maximizes the minimum value between users, that is, max-min fairness can be evaluated. In the present embodiment, by using such a setting, it is possible to realize resource allocation according to the wireless communication terminal having the minimum reward by the above utility function R.

例えば、無線通信システムに対する周波数リソースの割り当てを最適化する場合を考える。ここで、グループ１の無線通信システムでは、１/２/４MHz帯を割り当て可能であり、各無線通信端末の要求トラヒックとスループットからスループット達成率を算出できるものとする。なお、スループットは割り当てられる帯域幅および割り当てられた周波数リソース内で共存している無線通信端末の数、送受信端末間の距離などから計算することができる。 For example, consider the case of optimizing the allocation of frequency resources to a wireless communication system. Here, in the group 1 wireless communication system, the 1/2/4 MHz band can be allocated, and the throughput achievement rate can be calculated from the required traffic and throughput of each wireless communication terminal. The throughput can be calculated from the allocated bandwidth, the number of wireless communication terminals coexisting within the allocated frequency resource, the distance between the transmitting and receiving terminals, and the like.

同じくグループ２の無線通信システムでは、２００/４００/６００kHz帯の割り当てが可能であり、グループ１の無線通信システムと同様にトラヒックとスループットからスループット達成率を算出できるものとする。また、グループ３の無線通信システムも同様の計算方法からスループット達成率を算出できるものとする。 Similarly, in the group 2 wireless communication system, the 200/400 / 600kHz band can be allocated, and the throughput achievement rate can be calculated from the traffic and the throughput as in the group 1 wireless communication system. Further, the wireless communication system of Group 3 can also calculate the throughput achievement rate from the same calculation method.

このときの各無線通信端末の評価値を、グループ１の無線通信システムでは、x1，x2，x3，・・・とする。グループ２の無線通信システムでは、その評価値をy1，y2，y3，・・・とする。また、グループ３の無線通信システムの評価値は、z1，z2，z3，・・・とする。この場合、グループ１～３夫々の全体評価は、下記のように表すことができる。尚、下記の評価関数においてβおよびεは、αと同じく効用関数の公平性を決定するためのパラメータである。

The evaluation value of each wireless communication terminal at this time is x1, x2, x3, ... In the wireless communication system of group 1. In the group 2 wireless communication system, the evaluation values are y1, y2, y3, .... The evaluation values of the wireless communication system of Group 3 are z1, z2, z3, .... In this case, the overall evaluation of each of the groups 1 to 3 can be expressed as follows. In the evaluation function below, β and ε are parameters for determining the fairness of the utility function, like α.

グループの全体評価を踏まえて、例えば、ある無線通信端末ｋの報酬は下記のように計算することができる。

Based on the overall evaluation of the group, for example, the reward of a certain wireless communication terminal k can be calculated as follows.

具体的なアルゴリズムの例を示すために以下の環境を考える。
まず、環境として無線通信端末がｎ個、利用できる周波数チャネルの数がｋ個存在する状況を想定する。ある時間において各通信端末はｋ個のチャネルの中から１つを選択しそのチャネルの利用を試みるか、チャネルの利用をしないという（ｋ＋１）個の選択肢の中から１つの行動をとるものとする。その際、各端末は自身の取った行動に対して、他の端末と選択したチャネルが重ならず、チャネルの利用ができた場合にはACKを受け取り、他のいずれかの端末１つとでも同じチャネルを選択してしまった場合にはACKを受け取れない。このACKの受け取りの成否を各端末の報酬とみなす。各端末の行動とそれに対しての結果の報酬をある時間における状態としてみなすことにする。また別の報酬として、一定時間ごとの各端末の総接続数（ACKを受け取った無線通信端末の数）から計算した効用関数を定義する。 Consider the following environment to show an example of a concrete algorithm.
First, it is assumed that there are n wireless communication terminals and k number of available frequency channels as an environment. At a certain time, each communication terminal shall select one from k channels and try to use that channel, or take one action from the (k + 1) options of not using the channel. .. At that time, each terminal receives an ACK when the selected channel does not overlap with the other terminal and the channel can be used for the action taken by itself, and it is the same as any one of the other terminals. If you select a channel, you will not receive an ACK. The success or failure of receiving this ACK is regarded as the reward of each terminal. The action of each terminal and the reward of the result for it will be regarded as a state at a certain time. As another reward, a utility function calculated from the total number of connections of each terminal (the number of wireless communication terminals that received ACK) at regular intervals is defined.

また、報酬として、ACKの受け取りの成否以外に、これまでの通信実績からスループットや通信容量を計算し、要求された通信品質を満たさないアウテージの状態に陥っているか否かの判定結果を指定してもよい。或いは、グループ内でアウテージ状態に至っていない無線通信端末の数を報酬として指定してもよい。アウテージ状態を報酬として考慮すると、ユーザ品質を保てているか否かを指標として学習を進めることができる。このため上記の手法によれば、ユーザ体感に即した効果的な学習が可能である。 In addition to the success or failure of receiving the ACK, the throughput and communication capacity are calculated from the communication results so far, and the judgment result of whether or not the user is in an outage state that does not meet the required communication quality is specified as a reward. You may. Alternatively, the number of wireless communication terminals that have not reached the outage state in the group may be specified as a reward. Considering the outage state as a reward, it is possible to proceed with learning by using whether or not the user quality is maintained as an index. Therefore, according to the above method, effective learning according to the user's experience is possible.

図４は、本実施形態において、制御サーバ１０で実施される学習の概要を示す。
図４に示すアルゴリズムによれば、先ず、ｎ人のユーザｉの行動選択の手法が決定される（ステップ１００）。 FIG. 4 shows an outline of learning performed by the control server 10 in the present embodiment.
According to the algorithm shown in FIG. 4, first, the method of action selection of n users i is determined (step 100).

上記ステップ１００では、以下の３つの手法の何れかがランダムに選択される。
１．学習結果を利用することなく無作為の行動を決定する手法（ステップ１０２）
２．Main-netを用いた学習を利用する手法（ステップ１０４、１０６）
３．Fair-netを用いた学習を利用する手法（ステップ１０８、１１０）
ここで、一定確率でランダムにチャネルを選択する理由は、学習が局所解に陥ることを防止し、学習を効率的に進めるためである。 In step 100, one of the following three methods is randomly selected.
1. 1. Method of determining random behavior without using learning results (step 102)
2. 2. Method using learning using Main-net (steps 104, 106)
3. 3. Method using learning using Fair-net (steps 108, 110)
Here, the reason for randomly selecting a channel with a certain probability is to prevent learning from falling into a local solution and to promote learning efficiently.

尚、本実施形態では、エージェントが取り得る状態の数が膨大である場合に対処するべく、Ｑ関数の学習に、公知のDeep Q Network（DQN）の手法を利用する。上記のMain-netとは、ユーザｉ各自の報酬ｒ、即ち、各時間のユーザｉのチャネル利用の可否の期待値を最大化するように方策πを探索するDQNに付した名前である。また、上記のFair-netとは、グループ全体の効用を考慮して設定された上記の効用関数を最大化するように方策πを探索するDQNに付した名前である。 In this embodiment, a known Deep Q Network (DQN) method is used for learning the Q function in order to deal with the case where the number of states that the agent can take is enormous. The above-mentioned Main-net is the name given to the reward r of each user i, that is, the DQN that searches for the measure π so as to maximize the expected value of the user i's channel availability at each time. The above-mentioned Fair-net is the name given to the DQN that searches for the measure π so as to maximize the above-mentioned utility function set in consideration of the utility of the entire group.

各端末ｉの行動がすべて決定すると、各端末ｉの報酬ｒと状態ｓが定まる（ステップ１１２）。 When all the actions of each terminal i are determined, the reward r and the state s of each terminal i are determined (step 112).

次に、端末ｉ毎の行動ａ、報酬ｒ、状態ｓを、学習用にそれぞれ制御サーバ１０のメモリに追加する（ステップ１１４）。報酬ｒは、各時間における端末ｉのチャネル利用可否の結果xiと、効用関数Ｒの計算結果である。尚、これらのデータは、一定時間分だけ記憶されていればよい。 Next, the action a, the reward r, and the state s for each terminal i are added to the memory of the control server 10 for learning (step 114). The reward r is the result xi of the channel availability of the terminal i at each time and the calculation result of the utility function R. It should be noted that these data need only be stored for a certain period of time.

次に、上記のメモリから、複数のタイムスロットに対応する各端末の情報をランダムに抜き出す（ステップ１１６）。 Next, the information of each terminal corresponding to the plurality of time slots is randomly extracted from the above memory (step 116).

次いで、それらを学習用のデータとしてバッチ学習を実行し、Main-netおよびFair-netのパラメータを更新する（ステップ１１８）。 Then, batch learning is executed using them as training data, and the parameters of Main-net and Fair-net are updated (step 118).

更新したパラメータをもとに、再び各端末は学習結果に基づくチャネル選択（ステップ１０４～１１０）、またはランダムなチャネル選択（ステップ１０２）を繰り返し、同様の流れを追って学習が進められる（ステップ１１２～１１８）。 Based on the updated parameters, each terminal repeats channel selection (steps 104 to 110) or random channel selection (step 102) based on the learning result again, and learning proceeds following the same flow (steps 112 to 112). 118).

尚、上記の説明では、端末が行動を決める３つの手法は、ランダムに決定されることとしているが、本発明はこれに限定されるものではない。例えば、各端末が学習結果に基づいて行動を選択する手法については、Main-netを用いた学習の結果を利用するのを基本として、一定確率でFair-netを用いた学習の結果を利用することとしてもよい。また、ランダムに行動を決定する確率は、学習結果を利用して行動を決定する確率に比して低く設定することとしてもよい。 In the above description, the three methods by which the terminal determines the behavior are randomly determined, but the present invention is not limited thereto. For example, for the method in which each terminal selects an action based on the learning result, the learning result using Fair-net is used with a certain probability based on the learning result using Main-net. It may be that. Further, the probability of randomly determining the action may be set lower than the probability of determining the action using the learning result.

以上説明した通り、本実施形態の無線通信システムでは、個々の学習結果による行動を第１ステップとし、効用関数による全体評価を第２ステップとすることで、個々の端末の行動について多段階評価を実行することができる。このため、本実施形態によれば、同一のグループに属する複数の無線通信端末夫々の最適化と、同一のグループ内での公平性を担保するための最適化との双方を実現することができる。 As described above, in the wireless communication system of the present embodiment, the behavior based on the individual learning results is set as the first step, and the overall evaluation based on the utility function is set as the second step, so that the behavior of each terminal is evaluated in multiple stages. Can be done. Therefore, according to the present embodiment, both the optimization of each of the plurality of wireless communication terminals belonging to the same group and the optimization for ensuring fairness within the same group can be realized. ..

実施の形態２．
次に、図１と共に図５および図６を参照して、本発明の実施の形態２について説明する。本実施形態の無線通信システムは、実施の形態１の場合と同様に、図１に示す構成により実現することができる。本実施形態のシステムは、端末の行動を決定する手法、並びにQ関数を学習する手法が異なる点を除いて、実施の形態１の場合と同様である。 Embodiment 2.
Next, a second embodiment of the present invention will be described with reference to FIGS. 5 and 6 together with FIG. The wireless communication system of the present embodiment can be realized by the configuration shown in FIG. 1 as in the case of the first embodiment. The system of the present embodiment is the same as that of the first embodiment except that the method of determining the behavior of the terminal and the method of learning the Q function are different.

［実施の形態２の特徴］
図５は、本実施形態の無線通信システムにおいて実施される強化学習のモデルを示す。図５に示すモデルでは、図３に示すモデルが実行する処理に加えて、３つのグループの全てを対象とする全体評価が実施される。この全体評価は、同じ環境１４の下で作動する全ての無線通信端末について、公平性の最適化を図るために実行される。 [Characteristics of Embodiment 2]
FIG. 5 shows a model of reinforcement learning implemented in the wireless communication system of the present embodiment. In the model shown in FIG. 5, in addition to the processing performed by the model shown in FIG. 3, an overall evaluation targeting all three groups is performed. This overall evaluation is performed to optimize fairness for all wireless communication terminals operating under the same environment 14.

図５に示すモデルでは、グループ１～３の無線通信システムについての評価結果を用いて、下記のアルゴリズムで全体評価の効用関数R_allが計算される。尚、次式に含まれるθは、グループについての効用関数Ｒに含まれるαと同様、効用関数の公平性を決定するためのパラメータである。

In the model shown in FIG. 5, the utility function R _all of the overall evaluation is calculated by the following algorithm using the evaluation results of the wireless communication systems of groups 1 to 3. Note that θ included in the following equation is a parameter for determining the fairness of the utility function, like α included in the utility function R for the group.

但し、上式におけるΣは、３つのグループの報酬Ｒｘ，Ｒｙ，Ｒｚの総和を取ることを意味している。 However, Σ in the above equation means to take the sum of the rewards Rx, Ry, and Rz of the three groups.

本実施形態のシステムでは、ある無線通信端末ｋの報酬Rxkを、個別の報酬xkと、グループの報酬Rと、全体の報酬R_allとを用いて、下記のように計算することができる。

In the system of the present embodiment, the reward Rxk of a certain wireless communication terminal k can be calculated as follows by using the individual reward xk, the group reward R, and the total reward R _all .

無線通信端末ｋの報酬Rxkを上記のように計算すれば、同じ環境１４に属する全ての無線通信端末を対象としたmax-min公平性を考慮して、個々の端末の行動を決めることができる。 If the reward Rxk of the wireless communication terminal k is calculated as described above, the behavior of each terminal can be determined in consideration of max-min fairness for all wireless communication terminals belonging to the same environment 14. ..

図６は、本実施形態において、制御サーバ１０で実施される学習の概要を示す。図６に示すフローチャートは、ステップ１２０および１２２が追加されている点を除いて、図４に示すフローチャートと同様である。 FIG. 6 shows an outline of learning performed by the control server 10 in the present embodiment. The flowchart shown in FIG. 6 is similar to the flowchart shown in FIG. 4, except that steps 120 and 122 are added.

図６に示すように、本実施形態では、行動選択の手法として、一定の確率で、Fair-net（全体）を用いた学習を利用する手法が採用される（ステップ１２０、１２２）。「Fair-net（全体）」とは、同じ環境１４に属する全ての無線通信端末に関する効用関数R_allを最大化するように方策πを探索するDQNに付した名前である。 As shown in FIG. 6, in the present embodiment, as a method of action selection, a method of using learning using Fair-net (whole) is adopted with a certain probability (steps 120 and 122). "Fair-net (whole)" is the name given to DQN that searches for a policy π to maximize the utility function R _all for all wireless communication terminals belonging to the same environment 14.

上記の処理によれば、個々の無線通信端末についての最適化と、同じグループに属する端末間での公平性の担保と、全ての端末間での公平性の担保とを、バランス良く実現することができる。 According to the above processing, optimization for individual wireless communication terminals, guarantee of fairness among terminals belonging to the same group, and guarantee of fairness among all terminals are realized in a well-balanced manner. Can be done.

１０制御サーバ
１２、１２－１、１２－２、１２－３エージェント
１４環境
Ｓ、ｓ状態
Ｒ、ｒ報酬
Ａ、ａ行動 10 Control server 12, 12-1, 12-2, 12-3 Agent 14 Environment S, s State R, r Reward A, a Action

Claims

It is an optimization method for a wireless communication system that includes multiple wireless communication terminals.
For each wireless communication terminal, the steps to determine the behavior to get the highest reward based on the condition provided by the environment,
The step of calculating the individual reward obtained by the wireless communication terminal by returning the action to the environment, and
Based on the individual rewards for each of the plurality of wireless communication terminals, a step of calculating the utility representing the fairness of the plurality of wireless communication terminals, and
A reward calculation step that calculates a reward for an individual wireless communication terminal based on the individual reward and the utility.
How to optimize wireless communication systems, including.

The plurality of wireless communication terminals include a plurality of wireless communication terminals belonging to a group having the same communication standard and at least one of the requirements.
The utility is calculated for the group and
The optimization method according to claim 1, wherein in the reward calculation step, the reward of the wireless communication terminal belonging to the group is calculated based on the individual reward and the utility for the group.

The plurality of wireless communication terminals are classified into a plurality of groups, and the plurality of wireless communication terminals are classified into a plurality of groups.
Including the step of calculating the overall utility representing fairness for all of the plurality of wireless communication terminals based on the utility for each of the plurality of groups.
The optimization method according to claim 2, wherein in the reward calculation step, the reward of the wireless communication terminal belonging to the group is calculated based on the individual reward, the utility for the group, and the total utility.

The optimization method according to any one of claims 1 to 3, wherein the utility is calculated based on the number of terminals that have received an ACK, which means success of communication, among the plurality of wireless communication terminals.

The optimization method according to any one of claims 1 to 3, wherein the utility is a value evaluated by fairness calculated from throughput and traffic load of the plurality of wireless communication terminals.

The optimization method according to any one of claims 1 to 3, wherein the utility is calculated based on the number of outage terminals calculated from the requirements for the plurality of wireless communication terminals.

A wireless communication system that includes multiple wireless communication terminals.
A control server that receives wireless environment information from the plurality of wireless communication terminals and provides control information to the plurality of wireless communication terminals is provided.
The control server is
For each wireless communication terminal, the process of determining the behavior to get the highest reward based on the condition provided by the environment,
The process of calculating the individual reward obtained by the wireless communication terminal by returning the action to the environment, and
A process of calculating the utility representing the fairness of the plurality of wireless communication terminals based on the individual rewards for each of the plurality of wireless communication terminals.
A process of calculating a reward for an individual wireless communication terminal based on the individual reward and the utility, and
A wireless communication system that runs.

A program for a wireless communication system implemented in a control server that receives wireless environment information from a plurality of wireless communication terminals and provides control information to the plurality of wireless communication terminals.
To the control server
For each wireless communication terminal, the process of determining the behavior to get the highest reward based on the condition provided by the environment,
The process of calculating the individual reward obtained by the wireless communication terminal by returning the action to the environment, and
A process of calculating the utility representing the fairness of the plurality of wireless communication terminals based on the individual rewards for each of the plurality of wireless communication terminals.
A process of calculating a reward for an individual wireless communication terminal based on the individual reward and the utility, and
A program for wireless communication systems to execute.