JP2019208188A

JP2019208188A - Communication system, traffic control device, and traffic control method

Info

Publication number: JP2019208188A
Application number: JP2018103999A
Authority: JP
Inventors: 遼宮武; Ryo Miyatake; 淺井　裕介; Yusuke Asai; 裕介淺井; 理志西尾; Takayuki Nishio
Original assignee: Kyoto University; Nippon Telegraph and Telephone Corp
Current assignee: Kyoto University; Nippon Telegraph and Telephone Corp
Priority date: 2018-05-30
Filing date: 2018-05-30
Publication date: 2019-12-05
Anticipated expiration: 2038-05-30
Also published as: JP7007669B2

Abstract

To increase a total throughput in an environment where a moving obstacle temporarily blocks a line-of-sight channel for wireless communication.SOLUTION: A behavior determination unit 512 of a traffic control device 5 uses image data obtained by imaging a communication environment between an AP 2 and an STA 3 and information on the data amount of data addressed to the STA 3 stored in a storage unit 42 of a proxy server 4 to calculate the value of each type of action using a value function that calculates the value of the action expressed as a combination of traffics of the STAs 3 and determine the action on the basis of the value. A communication control unit 53 controls communication such that the data addressed to the STA 3 is distributed according to the determined action. A reward calculation unit 52 acquires the communication status of the STA 3 due to this control, and calculates a reward indicating the degree of improvement from the past communication status on the basis of the acquired communication status. A learning unit 513 updates the value function on the basis of the calculated reward.SELECTED DRAWING: Figure 1

Description

本発明は、通信システム、トラヒック制御装置及びトラヒック制御方法に関する。 The present invention relates to a communication system, a traffic control device, and a traffic control method.

大容量かつ高速通信を実現できる次世代無線通信技術として、ミリ波通信に期待が集まっている（例えば、非特許文献１参照）。ミリ波通信の利点の一つは利用可能な周波数幅が広帯域な点であり、１Ｇｂｉｔ／ｓ（ギガビット毎秒）を超える高速通信が可能である。その一方で、ミリ波は水分や酸素による減衰が大きく、見通し通信路が人体等で遮蔽されると通信品質が急峻に低下するという欠点がある（例えば、非特許文献２参照）。この遮蔽による急峻な通信品質低下問題に対処するため、遮蔽された通信路の流量やトラヒックの経路を制御する装置が必要となる。具体的には、図９のようにＡＰ（Access Point：アクセスポイント）が複数のＳＴＡ（Station；無線局）とミリ波で通信している環境の無線通信システムにおいては、ＡＰとＳＴＡとの見通し通信路を人体が遮蔽しうる状況であり、このような状況のＡＰの無線帯域を有効利用するための制御装置が必要となる。以下では、Ｎ台（Ｎは１以上の整数）のＳＴＡを、ＳＴＡ−１〜ＳＴＡ−Ｎとも記載する。 As next-generation wireless communication technology capable of realizing large-capacity and high-speed communication, expectation is focused on millimeter wave communication (for example, see Non-Patent Document 1). One of the advantages of millimeter wave communication is that the available frequency width is wide, and high-speed communication exceeding 1 Gbit / s (gigabit per second) is possible. On the other hand, millimeter waves are greatly attenuated by moisture and oxygen, and there is a drawback that the communication quality sharply decreases when the line-of-sight communication path is shielded by a human body or the like (for example, see Non-Patent Document 2). In order to cope with the problem of steep communication quality degradation due to the shielding, a device for controlling the flow rate of the shielded communication path and the traffic route is required. Specifically, in a wireless communication system in an environment in which an AP (Access Point) communicates with a plurality of STAs (Stations) via millimeter waves as shown in FIG. This is a situation where the human body can shield the communication path, and a control device for effectively using the radio band of the AP in such a situation is required. Hereinafter, N STAs (N is an integer of 1 or more) are also referred to as STA-1 to STA-N.

ミリ波通信における通信制御問題の解決手法として、ＲＧＢ−Ｄカメラを用いた人体遮蔽予測に基づくトラヒック制御装置が提案されている（例えば、非特許文献３参照）。従来技術では、ＲＧＢ−Ｄカメラから得られた画像・動画データを用いて人体を検知し、その移動先を予測する。その移動先への移動によって人体がＡＰとＳＴＡとの見通し通信路を遮蔽する場合、遮蔽が起こる直前にＡＰとＳＴＡ間のトラヒックを停止し、遮蔽されていない通信路のトラヒックを優先して送信する。この制御によって、制御しない場合と比べて、ＡＰにおける合計スループットを増加できる。つまり、無線帯域を有効利用するためのトラヒック制御が可能となる。また、遮蔽を予測し、遮蔽が起こる直前にプロアクティブに制御をかけるため、スループットが低下してから制御をかける従来のリアクティブな制御方式と比較して、合計スループットを増加できる。 As a technique for solving a communication control problem in millimeter wave communication, a traffic control apparatus based on human body shielding prediction using an RGB-D camera has been proposed (for example, see Non-Patent Document 3). In the prior art, a human body is detected using image / moving image data obtained from an RGB-D camera, and the movement destination is predicted. When the human body blocks the line-of-sight communication path between the AP and the STA due to the movement to the destination, the traffic between the AP and the STA is stopped immediately before the blocking occurs, and the traffic on the communication path that is not blocked is prioritized and transmitted. To do. By this control, the total throughput in the AP can be increased as compared with the case where the control is not performed. That is, traffic control for effectively using the radio band is possible. Further, since the shielding is predicted and the control is proactively performed immediately before the shielding occurs, the total throughput can be increased as compared with the conventional reactive control method in which the control is performed after the throughput is lowered.

図１０は、非特許文献３の技術を適用したトラヒック制御装置の機能ブロック図である。同図では、ＡＰとＳＴＡ−１〜ＳＴＡ−Ｎとが無線通信する無線通信システムのプロキシサーバに、トラヒック制御装置が搭載されている。トラヒック制御装置は、画像解析部と、遮蔽判定部と、通信制御部とを備える。トラヒック制御装置を稼働させる際には、初期設定として遮蔽判定部に通信路を設定しておく。画像解析部は、ＲＧＢ−Ｄカメラから得られた画像を用いて、ミリ波通信における人体（障害物）の位置推定を行う。次に、遮蔽判定部は、推定された人体の位置とその移動速度から、予め設定した見通し通信路が人体によって遮蔽されるか否かを判定し、遮蔽されると判定した場合にはそのタイミングを推定する。 FIG. 10 is a functional block diagram of a traffic control device to which the technique of Non-Patent Document 3 is applied. In the figure, a traffic control device is mounted on a proxy server of a wireless communication system in which an AP and STA-1 to STA-N perform wireless communication. The traffic control device includes an image analysis unit, a shielding determination unit, and a communication control unit. When operating the traffic control device, a communication path is set in the shielding determination unit as an initial setting. The image analysis unit estimates the position of a human body (obstacle) in millimeter wave communication using an image obtained from the RGB-D camera. Next, the shielding determination unit determines whether or not the preset line-of-sight communication path is shielded by the human body from the estimated position of the human body and the moving speed thereof, and when it is determined that it is shielded, the timing Is estimated.

通信制御部は、遮蔽判定部が推定した見通し通信路の遮蔽状況に基づいて、遮蔽が起こると推定された時間にそのトラヒックを停止するようにトラヒックの流量を制御する。具体的には、通信制御部は、インターネットから受信した、見通し通信路が遮蔽されるＳＴＡ宛のパケットの送信を停止する。また、通信制御部は、遮蔽が解除されると推定された時間に、ＳＴＡ宛てのパケットの送信を再開する。このトラヒック制御によって、ＡＰは、あるＳＴＡとの通信において人体遮蔽に伴いスループットが低下するときにおいても、別のＳＴＡとの通信にリソースを割り当てることができる。よって、トラヒック制御を行わない場合と比較して、ＡＰにおける合計スループットを増加できる。 The communication control unit controls the flow rate of traffic so as to stop the traffic at a time estimated to be shielded based on the shielding state of the line-of-sight communication path estimated by the shielding determination unit. Specifically, the communication control unit stops the transmission of the packet addressed to the STA that is received from the Internet and whose line-of-sight communication path is blocked. In addition, the communication control unit resumes transmission of packets addressed to the STA at a time estimated to be unshielded. By this traffic control, the AP can allocate resources to communication with another STA even when the throughput decreases due to human body occlusion in communication with a certain STA. Therefore, the total throughput in the AP can be increased as compared with the case where traffic control is not performed.

P. Wang, Y. Li, L. Song, and B. Vucetic, “Multi-gigabit millimeter wave wireless communications for 5G: From fixed access to cellular networks,” IEEE Communications Magazine, 2015年1月, vol.53, no.1, p.168−178P. Wang, Y. Li, L. Song, and B. Vucetic, “Multi-gigabit millimeter wave wireless communications for 5G: From fixed access to cellular networks,” IEEE Communications Magazine, January 2015, vol.53, no .1, p.168-178 S. Collonge, G. Zaharia, and G.E. Zein, “Influence of the human activity on wide-band characteristics of the 60 GHz indoor radio channel,” IEEE Transactions on Wireless Communications, 2004年11月, vol.3, no.6, p.2396−2406S. Collonge, G. Zaharia, and GE Zein, “Influence of the human activity on wide-band characteristics of the 60 GHz indoor radio channel,” IEEE Transactions on Wireless Communications, November 2004, vol.3, no.6 , p.2396-2406 T. Nishio, R. Arai, K. Yamamoto, and M. Morikura, “Proactive traffic control based on human blockage prediction using RGBD cameras for millimeter-wave communications,” Proc. 2015 IEEE Consumer Communications and Networking Conference(CCNC), Las Vegas, Nevada, USA, 2015年1月, p.152−153T. Nishio, R. Arai, K. Yamamoto, and M. Morikura, “Proactive traffic control based on human blockage prediction using RGBD cameras for millimeter-wave communications,” Proc. 2015 IEEE Consumer Communications and Networking Conference (CCNC), Las Vegas, Nevada, USA, January 2015, p.152-153

非特許文献３の技術では、見通し通信路が遮蔽されそうなときにその見通し通信路を使用するＳＴＡとの通信を遮断し、別のＳＴＡとの通信にリソースを割り当てるといったルールベースの制御を行っている。この方式では、環境に合わせてルールを人手で作る必要がある。例えば、見通し通信路の遮蔽が通信品質に影響しないような環境（反射による通信路ができるような環境）では、その見通し通信路が遮蔽された場合でも通信を停止する必要がない。しかし、ミリ波通信環境は、ミリ波基地局や家具の配置によって容易に変化するため、その度に設定し直す必要がある。 In the technique of Non-Patent Document 3, when a line-of-sight communication path is likely to be blocked, communication with a STA that uses the line-of-sight communication path is blocked and resources are allocated to communication with another STA. ing. In this method, it is necessary to manually create rules according to the environment. For example, in an environment where shielding of the line-of-sight communication path does not affect communication quality (an environment where a communication path can be formed by reflection), it is not necessary to stop communication even when the line-of-sight communication path is blocked. However, since the millimeter wave communication environment easily changes depending on the arrangement of the millimeter wave base stations and furniture, it is necessary to reset each time.

加えて、人手では適切なルールの設計が難しい環境、例えば、遮蔽する歩行者が多数存在し到来にムラがあるような場合や、動画や音声通話などアプリケーションが異なる場合などでは、適切なトラヒック制御方策は変わることが考えられる。しかしながら、適切な制御方策を決めるのは容易ではない。 In addition, appropriate traffic control in environments where it is difficult to design appropriate rules by hand, such as when there are many pedestrians to be blocked and there is uneven arrival, or when applications such as video and voice calls are different. The strategy may change. However, it is not easy to determine an appropriate control strategy.

さらには、画像から人体認識、移動予測、見通し通信路遮蔽予測など、様々な処理を行う必要がある。それらの性能は、通信制御の性能に強く影響を与える。 Furthermore, it is necessary to perform various processes such as human body recognition, movement prediction, and line-of-sight channel blocking prediction from images. Their performance strongly affects the performance of communication control.

上記事情に鑑み、本発明は、移動する障害物により無線通信のための見通し通信路に一時的に遮蔽が生じる環境下における合計スループットを増加させることができる通信システム、トラヒック制御装置及びトラヒック制御方法を提供することを目的としている。 In view of the above circumstances, the present invention provides a communication system, a traffic control device, and a traffic control method capable of increasing the total throughput in an environment where a line-of-sight communication path for wireless communication is temporarily blocked by a moving obstacle. The purpose is to provide.

本発明の一態様は、第１通信装置と、前記第１通信装置と無線により通信する１台以上の第２通信装置と、第１通信装置から前記第２通信装置に送信するデータを取得する第３通信装置と、トラヒック制御装置とを有する通信システムであって、前記トラヒック制御装置は、前記第１通信装置と前記第２通信装置との間の通信環境を撮像した画像データと、前記第３通信装置が記憶する前記第２通信装置宛ての未送信の前記データのデータ量の情報とを用いて、前記第２通信装置それぞれのトラヒックの組み合わせにより表される行動の価値を算出する価値関数により複数種類の行動それぞれの価値を算出し、算出した前記価値に基づいて行動を決定する行動決定部と、前記行動決定部が決定した前記行動が表す前記第２通信装置それぞれのトラヒックに従って、前記第２通信装置宛ての前記データを前記第１通信装置に送信するよう前記第３通信装置を制御する通信制御部と、前記通信制御部による制御が行われたことによる前記第２通信装置の通信状況を取得し、取得した前記通信状況が過去の通信状況から向上した程度を表す報酬を計算する報酬計算部と、前記報酬計算部が計算した報酬に基づいて前記価値関数を更新する学習部と、を備え、前記第１通信装置は、前記第３通信装置から受信した前記第２通信装置宛ての前記データを無線により前記第２通信装置へ送信する、通信システムである。 One embodiment of the present invention acquires a first communication device, one or more second communication devices that communicate with the first communication device wirelessly, and data transmitted from the first communication device to the second communication device. A communication system comprising a third communication device and a traffic control device, wherein the traffic control device captures image data obtained by imaging a communication environment between the first communication device and the second communication device, and A value function for calculating the value of an action represented by a combination of traffic of each of the second communication devices, using information on the amount of untransmitted data addressed to the second communication device stored in the three communication devices Calculating a value of each of a plurality of types of behaviors, and determining a behavior based on the calculated values, and each of the second communication devices represented by the behavior determined by the behavior determination unit According to the traffic, a communication control unit that controls the third communication device to transmit the data addressed to the second communication device to the first communication device, and the second control unit that is controlled by the communication control unit. The communication status of the communication device is acquired, a reward calculation unit that calculates a reward indicating the degree to which the acquired communication status has improved from the past communication status, and the value function is updated based on the reward calculated by the reward calculation unit A learning unit configured to transmit the data addressed to the second communication device received from the third communication device to the second communication device wirelessly.

本発明の一態様は、第１通信装置と１台以上の第２通信装置との間の通信環境を撮像した画像データと前記第２通信装置宛ての未送信のデータのデータ量の情報とを用いて、前記第２通信装置それぞれのトラヒックの組み合わせとして表される行動の価値を算出する価値関数により複数種類の行動それぞれの価値を算出し、算出した前記価値に基づいて行動を決定する行動決定部と、前記行動決定部が決定した前記行動が表す前記第２通信装置それぞれのトラヒックに従って、前記第１通信装置から前記第２通信装置宛ての前記データが配信されるよう通信を制御する通信制御部と、前記通信制御部による制御が行われたことによる前記第２通信装置の通信状況を取得し、取得した前記通信状況が過去の通信状況から向上した程度を表す報酬を計算する報酬計算部と、前記報酬計算部が計算した報酬に基づいて前記価値関数を更新する学習部と、を備えるトラヒック制御装置である。 According to one aspect of the present invention, image data obtained by imaging a communication environment between a first communication device and one or more second communication devices and data amount information of untransmitted data addressed to the second communication device are provided. Using the value function for calculating the value of the action represented as a combination of traffic of each of the second communication devices, calculating the value of each of the plurality of types of actions, and determining the action based on the calculated value And communication control for controlling communication so that the data addressed to the second communication device is distributed from the first communication device according to traffic of each of the second communication devices represented by the behavior determined by the behavior determining unit and the behavior determining unit And the communication status of the second communication device due to the control performed by the communication control unit, and the reward representing the degree to which the acquired communication status has improved from the past communication status And rewards calculation unit for calculating a learning unit for updating the value function on the basis of compensation the compensation calculation unit has calculated a traffic control device comprising a.

本発明の一態様は、上述のトラヒック制御装置であって、前記第２通信装置の前記通信状況は、前記第２通信装置におけるスループット、又は、前記第２通信装置宛ての前記データの送信にかかった時間を表す情報である。 One aspect of the present invention is the traffic control device described above, wherein the communication status of the second communication device depends on a throughput in the second communication device or transmission of the data addressed to the second communication device. It is information indicating the time.

本発明の一態様は、上述のトラヒック制御装置であって、前記価値関数は、深層ニューラルネットワークにより近似される。 One aspect of the present invention is the above-described traffic control device, wherein the value function is approximated by a deep neural network.

本発明の一態様は、上述のトラヒック制御装置であって、前記価値関数に用いられる前記画像データは、異なるタイミングにおいて撮影された複数の画像データそれぞれの解像度を低減したのちにピクセル値を正規化したデータである。 One aspect of the present invention is the above-described traffic control device, wherein the image data used for the value function normalizes pixel values after reducing the resolution of each of a plurality of image data captured at different timings. Data.

本発明の一態様は、上述のトラヒック制御装置であって、前記価値関数に用いられる未送信の前記第２通信装置宛てのデータ量の情報は、複数の前記第２通信装置それぞれ宛ての未送信の前記データ量をＯｎｅ−Ｈｏｔ表現により表したベクトルを並べた情報である。 One aspect of the present invention is the above-described traffic control device, in which the information on the amount of data addressed to the second communication device that has not been transmitted and is used for the value function is not transmitted to each of the plurality of second communication devices. Is a vector in which vectors representing the data amount are represented by One-Hot expression.

本発明の一態様は、上述のトラヒック制御装置であって、前記画像データは、深度画像データである。 One aspect of the present invention is the above-described traffic control device, wherein the image data is depth image data.

本発明の一態様は、第１通信装置と１台以上の第２通信装置との間の通信環境を撮像した画像データと前記第２通信装置宛ての未送信のデータのデータ量の情報とを用いて、前記第２通信装置それぞれのトラヒックの組み合わせとして表される行動の価値を算出する価値関数により複数種類の行動それぞれの価値を算出し、算出した前記価値に基づいて行動を決定する行動決定ステップと、前記行動決定ステップにおいて決定された前記行動が表す前記第２通信装置それぞれのトラヒックに従って、前記第１通信装置から前記第２通信装置宛ての前記データが配信されるよう通信を制御する通信制御ステップと、前記通信制御ステップによる制御が行われたことによる前記第２通信装置の通信状況を取得し、取得した前記通信状況が過去の通信状況から向上した程度を表す報酬を計算する報酬計算ステップと、前記報酬計算ステップにおいて計算された報酬に基づいて前記価値関数を更新する学習ステップと、を有するトラヒック制御方法である。 According to one aspect of the present invention, image data obtained by imaging a communication environment between a first communication device and one or more second communication devices and data amount information of untransmitted data addressed to the second communication device are provided. Using the value function for calculating the value of the action represented as a combination of traffic of each of the second communication devices, calculating the value of each of the plurality of types of actions, and determining the action based on the calculated value And communication for controlling communication so that the data addressed to the second communication device is distributed from the first communication device according to the traffic of each of the second communication devices represented by the behavior determined in the step and the behavior determination step A communication state of the second communication device due to the control step and the control by the communication control step being performed, and the acquired communication state is past communication And rewards calculation step of calculating a reward that represents the degree of improvement from the situation, a traffic control method having a learning step of updating the value function on the basis of the calculated compensation in the compensation calculation step.

本発明により、移動する障害物により無線通信のための見通し通信路に一時的に遮蔽が生じる環境下における合計スループットを増加させることが可能となる。 According to the present invention, it is possible to increase the total throughput in an environment where a line-of-sight channel for wireless communication is temporarily blocked by a moving obstacle.

本発明の一実施形態による無線通信システムの構成例を示す図である。It is a figure which shows the structural example of the radio | wireless communications system by one Embodiment of this invention. 同実施形態によるトラヒック制御装置の処理の流れを示すフロー図である。It is a flowchart which shows the flow of a process of the traffic control apparatus by the embodiment. 同実施形態によるエピソードを説明するための図である。It is a figure for demonstrating the episode by the embodiment. 同実施形態によるカメラ画像から入力データへの加工を示す図である。It is a figure which shows the process from the camera image to input data by the embodiment. 同実施形態によるファイル残量情報から入力データへの加工を示す図である。It is a figure which shows the process from the file remaining amount information to input data by the embodiment. 同実施形態による行動評価関数の層設計を示す図である。It is a figure which shows the layer design of the action evaluation function by the embodiment. 同実施形態によるトラヒック制御装置のシミュレーション評価の諸元を示す図である。It is a figure which shows the item of simulation evaluation of the traffic control apparatus by the embodiment. 同実施形態によるトラヒック制御装置のシミュレーション評価結果を示す図である。It is a figure which shows the simulation evaluation result of the traffic control apparatus by the embodiment. 制御対象の無線通信システムの構成例を示す図である。It is a figure which shows the structural example of the radio | wireless communications system of a control object. 従来技術によるトラヒック制御装置の機能ブロック図である。It is a functional block diagram of the traffic control apparatus by a prior art.

以下、図面を参照しながら本発明の実施形態を詳細に説明する。
本実施形態のトラヒック制御装置は、従来の問題点を解決するために、深層強化学習を用いる。本実施形態のトラヒック制御装置は、カメラ画像とトラヒックバッファとを「状態」として用い、その「状態」に適切な制御を試行錯誤により学習的に獲得する。強化学習とは、行動主体であるエージェントが環境に対して試行錯誤をしながら行動し、その行動に対して環境から報酬を与えられることによって、より良い方策を獲得する機械学習の一種である。エージェントは、「状態」から期待される報酬を表す価値関数に従って行動し、得られた報酬によってこの価値関数を更新する。深層強化学習では、この価値関数に畳み込みニューラルネットワーク（ＣＮＮ；Convolutional Neural Network）などのニューラルネットワークを用いて関数近似をする。これによって、状態数が膨大な問題に適用できることに加え、畳込み層を用いることで画像を入力とするような問題に対して効果を発揮する。 Hereinafter, embodiments of the present invention will be described in detail with reference to the drawings.
The traffic control apparatus of this embodiment uses deep reinforcement learning in order to solve the conventional problems. The traffic control apparatus according to the present embodiment uses a camera image and a traffic buffer as a “state” and acquires control appropriate to the “state” by learning through trial and error. Reinforcement learning is a type of machine learning in which an agent, who is an action subject, acts with trial and error on the environment and is rewarded by the environment for acquiring better policies. The agent acts according to a value function representing a reward expected from the “state”, and updates the value function with the obtained reward. In deep reinforcement learning, function approximation is performed on this value function using a neural network such as a convolutional neural network (CNN). As a result, in addition to being able to be applied to problems with a large number of states, the use of a convolution layer is effective for problems such as inputting an image.

図１は、本発明の一実施形態による通信システム１を示す図である。通信システム１は、アクセスポイント（ＡＰ）２、無線局（ＳＴＡ）３と、プロキシサーバ４と、トラヒック制御装置５と、撮像装置６とを備える。Ｎ台（Ｎは１以上の整数）のＳＴＡ３のうちｎ台目（ｎは１以上Ｎ以下の整数）のＳＴＡ３を、ＳＴＡ−ｎと記載する。また、同図において、トラヒック制御装置５は、プロキシサーバ４に搭載される。同図に示す通信システム１は、図１０に示す従来のトラヒック制御装置を、トラヒック制御装置５に置き換えた構成である。 FIG. 1 is a diagram showing a communication system 1 according to an embodiment of the present invention. The communication system 1 includes an access point (AP) 2, a radio station (STA) 3, a proxy server 4, a traffic control device 5, and an imaging device 6. Of the N STA3s (N is an integer of 1 or more), the nth STA3 (n is an integer of 1 to N) is referred to as STA-n. In FIG. 2, the traffic control device 5 is mounted on the proxy server 4. The communication system 1 shown in the figure has a configuration in which the conventional traffic control device shown in FIG. 10 is replaced with a traffic control device 5.

ＡＰ２は、１台以上のＳＴＡ３と無線通信する。ＡＰ２は、インターネット７を介して接続される通信装置からプロキシサーバ４が受信したＳＴＡ３宛てのパケットを無線により送信する。また、ＡＰ２は、インターネット７を介して接続される通信装置宛てのパケットをＳＴＡ３から無線により受信し、プロキシサーバ４に送信する。プロキシサーバ４は、ＳＴＡ３の代理としてインターネット７を介した通信を行う。撮像装置６は、例えば、ＲＧＢ−Ｄカメラである。ＲＧＢ−Ｄカメラは、ＲＧＢ画像（カラー画像）と深度画像とを撮像する。撮像装置６は、ＡＰ２と複数のＳＴＡ３との間の無線の見通し通信路と及びその周辺を含んだ環境の画像を所定周期で撮像する。撮像装置６は、撮像した画像のデータであるカメラ画像をトラヒック制御装置５に送信する。 AP 2 communicates wirelessly with one or more STAs 3. The AP 2 wirelessly transmits a packet addressed to the STA 3 received by the proxy server 4 from a communication device connected via the Internet 7. The AP 2 receives a packet addressed to the communication device connected via the Internet 7 from the STA 3 by radio and transmits the packet to the proxy server 4. The proxy server 4 performs communication via the Internet 7 as a proxy for the STA 3. The imaging device 6 is, for example, an RGB-D camera. The RGB-D camera captures an RGB image (color image) and a depth image. The imaging device 6 captures an image of an environment including the wireless line-of-sight communication path between the AP 2 and the plurality of STAs 3 and the periphery thereof at a predetermined cycle. The imaging device 6 transmits a camera image that is data of the captured image to the traffic control device 5.

プロキシサーバ４は、第１通信部４１と、記憶部４２と、第２通信部４３と、トラヒック制御装置５とを備える。第１通信部４１は、インターネット７を介して受信したＳＴＡ３宛てのファイルのパケットを受信し、ＳＴＡ３別に記憶部４２に書き込む。記憶部４２は、複数のファイルバッファを有している。ＳＴＡ３に割り当てられたファイルバッファに、当該ＳＴＡ３宛てのファイルが記憶される。１台のＳＴＡ３に対して複数のファイルバッファを割り当てることができる。１台のＳＴＡ３に対して割り当て可能なファイルバッファに上限を設けてもよい。本実施形態では、１台のＳＴＡ３に３つのファイルバッファを割り当て可能とする。第２通信部４３は、トラヒック制御装置５の制御に従って、ＳＴＡ３宛てのファイルを記憶部４２から読み出してＡＰ２に送信する。 The proxy server 4 includes a first communication unit 41, a storage unit 42, a second communication unit 43, and a traffic control device 5. The first communication unit 41 receives the packet of the file addressed to the STA 3 received via the Internet 7 and writes it in the storage unit 42 for each STA 3. The storage unit 42 has a plurality of file buffers. The file addressed to STA3 is stored in the file buffer assigned to STA3. A plurality of file buffers can be allocated to one STA3. An upper limit may be set for a file buffer that can be allocated to one STA3. In the present embodiment, three file buffers can be allocated to one STA3. The second communication unit 43 reads a file addressed to the STA 3 from the storage unit 42 and transmits it to the AP 2 under the control of the traffic control device 5.

トラヒック制御装置５は、強化学習部５１と、報酬計算部５２と、通信制御部５３とを備える。強化学習部５１は、加工部５１１と、行動決定部５１２と、学習部５１３とを備える。行動決定部５１２及び学習部５１３は、深層強化学習アルゴリズムの処理部である。加工部５１１は、撮像装置６から入力されたカメラ画像と、トラヒックバッファ情報とを処理に適したデータ形式に加工し、深層強化学習アルゴリズムの処理部に出力する。行動決定部５１２は、データ形式が加工されたカメラ画像とトラヒックバッファ情報とを含む「状態」に基づいて、トラヒックの制御信号を「行動」として出力する。トラヒックバッファ情報とは、プロキシサーバ４に蓄積されている各ＳＴＡ３宛ての未送信のデータのデータ量である。本実施形態では、トラヒックバッファ情報として、ファイル残量が用いられる。ファイル残量は、記憶部４２に記憶される未送信の各ＳＴＡ３宛てのファイルの容量である。学習部５１３は、出力した「行動」について報酬計算部５２が計算した報酬に基づいて、より良い制御方法を学習する。 The traffic control device 5 includes a reinforcement learning unit 51, a reward calculation unit 52, and a communication control unit 53. The reinforcement learning unit 51 includes a processing unit 511, an action determination unit 512, and a learning unit 513. The action determination unit 512 and the learning unit 513 are processing units of a deep reinforcement learning algorithm. The processing unit 511 processes the camera image input from the imaging device 6 and the traffic buffer information into a data format suitable for processing, and outputs the data format to the processing unit of the deep reinforcement learning algorithm. The behavior determining unit 512 outputs a traffic control signal as “behavior” based on the “state” including the camera image whose data format has been processed and the traffic buffer information. The traffic buffer information is a data amount of untransmitted data addressed to each STA 3 stored in the proxy server 4. In the present embodiment, the remaining file amount is used as the traffic buffer information. The file remaining amount is the capacity of a file addressed to each untransmitted STA 3 stored in the storage unit 42. The learning unit 513 learns a better control method based on the reward calculated by the reward calculation unit 52 for the output “action”.

報酬計算部５２は、各ＳＴＡ３のスループット及びトラヒックバッファ情報、あるいは、それらの一部から、目的に合わせて設計された報酬を出力する。通信制御部５３は、ＡＰ２と各ＳＴＡ３とのトラヒックをスケジューリングしながらＳＴＡ３宛てのファイルを配信するようプロキシサーバ４の第２通信部を制御する。これは、ミリ波通信においては、その高速通信という利点を活かし、容量の大きいファイルを送信するという実用例が想定されるためである。 The reward calculation unit 52 outputs a reward designed according to the purpose from the throughput and traffic buffer information of each STA 3 or a part thereof. The communication control unit 53 controls the second communication unit of the proxy server 4 so as to distribute a file addressed to the STA 3 while scheduling traffic between the AP 2 and each STA 3. This is because in millimeter wave communication, a practical example of transmitting a file with a large capacity by taking advantage of its high-speed communication is assumed.

なお、トラヒック制御装置５が、プロキシサーバ４の第１通信部４１と、記憶部４２と、第２通信部４３とのうち任意の一以上の機能部を有してもよい。また、第１通信部４１と通信制御部５３とが同一の機能部でもよく、第２通信部４３と通信制御部５３とが同一の機能部でもよく、第１通信部４１、第２通信部４３及び通信制御部５３が同一の機能部でもよい。また、トラヒック制御装置５は、プロキシサーバ４と通信ネットワークにより接続される外部の装置であってもよい。また、第１通信部４１と、記憶部４２と、第２通信部４３と、強化学習部５１と、報酬計算部５２と、通信制御部５３とのうち一以上の任意の機能部を、プロキシサーバ４及びトラヒック制御装置５とが協働して実現してもよい。 The traffic control device 5 may include any one or more functional units among the first communication unit 41, the storage unit 42, and the second communication unit 43 of the proxy server 4. The first communication unit 41 and the communication control unit 53 may be the same functional unit, and the second communication unit 43 and the communication control unit 53 may be the same functional unit. The first communication unit 41 and the second communication unit. 43 and the communication control unit 53 may be the same functional unit. The traffic control device 5 may be an external device connected to the proxy server 4 via a communication network. In addition, one or more arbitrary functional units among the first communication unit 41, the storage unit 42, the second communication unit 43, the reinforcement learning unit 51, the reward calculation unit 52, and the communication control unit 53 are proxied. The server 4 and the traffic control device 5 may be realized in cooperation.

図２は、トラヒック制御装置５の処理の流れを示すフロー図である。
トラヒック制御装置５が起動すると、撮像装置６は、一定時間間隔で通信環境を撮影してカメラ画像を生成し、強化学習部５１へ送信する（ステップＳ１）。一方で、通信制御部５３は、各ＳＴＡ３のファイルバッファ内のファイル残量を取得し、強化学習部５１へ送信する（ステップＳ２）。加工部５１１は、撮像装置６及び通信制御部５３のそれぞれから受信したデータを深層強化学習の設計に合わせて事前処理した後、行動決定部５１２に入力する（ステップＳ３）。 FIG. 2 is a flowchart showing a processing flow of the traffic control device 5.
When the traffic control device 5 is activated, the imaging device 6 captures the communication environment at regular time intervals, generates a camera image, and transmits it to the reinforcement learning unit 51 (step S1). On the other hand, the communication control unit 53 acquires the file remaining amount in the file buffer of each STA 3 and transmits it to the reinforcement learning unit 51 (step S2). The processing unit 511 pre-processes the data received from each of the imaging device 6 and the communication control unit 53 in accordance with the design of deep reinforcement learning, and then inputs the data to the action determination unit 512 (step S3).

深層強化学習では価値関数にニューラルネットワークを用いるため、加工部５１１は、カメラ画像とファイル残量情報を、設計されたニューラルネットワークに適した入力データに加工する。この価値関数のニューラルネットワークの例として、全結合層のみの単純なものや、画像認識の分野でよく用いられる畳込み層を含んだものが挙げられる。例として、価値関数が全結合層のみのニューラルネットワークの場合、加工部５１１は、カメラ画像のうち深度画像の解像度を低くした後に１次元のデータにして、各深度値を０から１までの値に正規化する。また、加工部５１１は、各ＳＴＡ３のファイルバッファに残っているファイルの容量を離散化してＯｎｅ−Ｈｏｔ表現化したファイル残量情報を生成し、入力データとする。Ｏｎｅ−Ｈｏｔ表現とは、ある要素のみが１であり、それ以外の要素が０となるベクトル表現のことである。ファイル容量を表すベクトルの各要素はそれぞれファイル容量の範囲に対応しており、ファイルバッファに残っているファイル容量に対応した要素に１が設定され、他の要素には０が設定される。 In the deep reinforcement learning, a neural network is used for the value function, so the processing unit 511 processes the camera image and the remaining file information into input data suitable for the designed neural network. Examples of the neural network of this value function include a simple one having only a fully connected layer and a one containing a convolution layer often used in the field of image recognition. As an example, in the case of a neural network whose value function is only a fully connected layer, the processing unit 511 reduces the resolution of the depth image of the camera image to one-dimensional data and sets each depth value from 0 to 1 Normalize to In addition, the processing unit 511 generates file remaining amount information obtained by discretizing the capacity of the file remaining in the file buffer of each STA 3 and expressing One-Hot expression as input data. One-hot expression is a vector expression in which only one element is 1 and the other elements are 0. Each element of the vector representing the file capacity corresponds to the range of the file capacity, 1 is set to the element corresponding to the file capacity remaining in the file buffer, and 0 is set to the other elements.

行動決定部５１２は、深層強化学習アルゴリズムを用いて、価値関数の出力結果に基づいて各ＳＴＡ３の通信のトラヒック（強化学習の「行動」）を決定する（ステップＳ４）。具体的には、行動決定部５１２は、カメラ画像とファイルバッファのファイル残量情報という「状態」において、とりうる「行動」のうち、それら各「行動」によって最も価値が高くなるような状態遷移を起こす「行動」（各ＳＴＡ３のトラフィック）を優先的に採用する。行動決定部５１２は、決定した各ＳＴＡ３の通信のトラヒック制御情報を通信制御部５３に送信する。これを受信した通信制御部５３は、そのトラヒック制御情報に従って、ファイルバッファに保持していたファイルをパケットに設定してＡＰ２へ送信するようプロキシサーバ４の第２通信部４３を制御する（ステップＳ５）。 The behavior determination unit 512 determines communication traffic (“action” of reinforcement learning) of each STA 3 based on the output result of the value function using the deep reinforcement learning algorithm (step S4). Specifically, the action determining unit 512 makes a state transition that has the highest value in each “action” among possible “actions” in the “state” of the camera image and the file remaining amount information of the file buffer. "Action" (traffic of each STA3) that causes The behavior determination unit 512 transmits the determined traffic control information for communication of each STA 3 to the communication control unit 53. Receiving this, the communication control unit 53 controls the second communication unit 43 of the proxy server 4 to set the file held in the file buffer as a packet and transmit it to the AP 2 according to the traffic control information (step S5). ).

パケット送信後、通信制御部５３は、各ＳＴＡ３宛てのバッファ内のファイル残量とその時点での各ＳＴＡ３のスループットを取得し、報酬計算部５２へ送信する（ステップＳ６）。報酬計算部５２は、受信したファイル残量及びスループット情報を用いて報酬を計算する（ステップＳ７）。報酬は、トラヒック制御の詳細な目的に合わせて設計される。詳細な目的の例としては、ＡＰ２の合計スループットの最大化、ファイル送信時間の合計の最小化等が挙げられる。ＡＰ２の合計スループットの最大化が目的の場合、報酬計算部５２は、行動決定部５１２が行動を決定し、その決定に基づいて通信制御部５３が行動する度に毎回、その時点でのＡＰ２の合計スループットを報酬として与える。ファイル送信時間の合計の最小化が目的の場合、報酬計算部５２は、行動決定部５１２が行動を決定し、その決定に基づいて通信制御部５３が行動する度に毎回、ファイルがプロキシサーバ４に到着してからＳＴＡ３へファイルの送信を完了するまでの間、負の定数を報酬として与える。つまり、報酬の累積和が、ファイル送信時間の合計に比例した値になる。 After the packet transmission, the communication control unit 53 acquires the remaining file amount in the buffer addressed to each STA 3 and the throughput of each STA 3 at that time, and transmits it to the reward calculation unit 52 (step S6). The reward calculation unit 52 calculates a reward using the received file remaining amount and throughput information (step S7). Rewards are designed for the detailed purpose of traffic control. Examples of detailed purposes include maximizing the total throughput of AP2, minimizing the total file transmission time, and the like. When the purpose is to maximize the total throughput of AP2, the reward calculation unit 52 determines the behavior of the AP2 at that time every time the behavior determination unit 512 determines the behavior and the communication control unit 53 acts based on the determination. Give total throughput as a reward. When the purpose is to minimize the total file transmission time, the reward calculation unit 52 determines that the behavior is determined by the behavior determination unit 512, and the communication control unit 53 acts on the basis of the determination. A negative constant is given as a reward during the period from when the file arrives to when the transmission of the file to the STA 3 is completed. That is, the cumulative sum of rewards is a value proportional to the total file transmission time.

例えば、ＡＰ２の合計スループットの最大化が目的の場合、時間ステップｔにおける報酬ｒ_ｔは、以下の式（１）のように算出される。 For example, when the purpose is to maximize the total throughput of AP2, the reward r _{t at the} time step t is calculated as in the following equation (1).

Ｔ_ｔは時間ステップｔにおける合計スループット、ｃ（ｔ）は時間パラメータｔに応じた係数である。Σの項はこれまでの合計スループットを時間等のパラメータにより加重平均した値である。例えば、各ｃ（ｉ）を、式（１）の第２項において時間に応じた荷重平均スループットが得られるように決定してもよい。また、ｃ（ｉ）＝１（ｉはｔ以下の整数）とすると、報酬ｒ_ｔは、以下の式（２）により算出される。 T _t is the total throughput at time step t, and c (t) is a coefficient according to the time parameter t. The term of Σ is a value obtained by weighted averaging the total throughput so far using parameters such as time. For example, each c (i) may be determined so as to obtain a weighted average throughput according to time in the second term of the equation (1). If c (i) = 1 (i is an integer equal to or less than t), the reward r _t is calculated by the following equation (2).

また、報酬を、式（３）に示すようにＡＰ２全体の平均のスループットＴ_ｔ￣で正規化したスループットとしてもよく、式（４）に示すように、正規化したスループットの差分としてもよい。 Further, the reward may be a throughput normalized by the average throughput T _tの of the entire AP 2 as shown in the equation (3), or may be a difference between the normalized throughputs as shown in the equation (4).

また、以下の式（５）のように、スループットの平均からの減衰率が一定値αを下回ったときに大きな負の報酬を与えるようにしてもよい。 Further, as shown in the following formula (5), a large negative reward may be given when the attenuation rate from the average of the throughput falls below a certain value α.

また、式（１）〜式（５）におけるスループットを、ミリ波通信の物理伝送速度に置き換えてもよい。 Further, the throughput in the equations (1) to (5) may be replaced with the physical transmission rate of millimeter wave communication.

報酬計算部５２は、計算した報酬を強化学習部５１に送信する。強化学習部５１は、通知された報酬に基づいて、深層強化学習アルゴリズムによって価値関数を更新していくことで学習を進める（ステップＳ８）。 The reward calculation unit 52 transmits the calculated reward to the reinforcement learning unit 51. The reinforcement learning unit 51 proceeds with learning by updating the value function using the deep reinforcement learning algorithm based on the notified reward (step S8).

この一連の動作を繰り返すことにより、強化学習部５１は、入力された報酬の累積和が最大となるように学習を進めながら各ＳＴＡ３のトラヒックのトラヒックを決定していく。従って、学習が進むに連れてトラヒック制御装置５を設置した環境に適応したトラヒック制御方法を自動的に獲得する。 By repeating this series of operations, the reinforcement learning unit 51 determines the traffic of each STA3 while advancing learning so that the cumulative sum of the input rewards is maximized. Therefore, as the learning progresses, a traffic control method adapted to the environment in which the traffic control device 5 is installed is automatically acquired.

トラヒック制御装置５は、複数エピソードを実施した結果に基づいて、上記の処理を行い、行動評価関数を学習する。図３は、エピソードを説明するための図である。エピソードとは、記憶部４２におけるファイルバッファ内のファイルが全て送信完了するまでの一連の流れを表す。プロキシサーバ４は、トラヒック制御装置５の通信制御部５３の制御に従って、ファイルバッファに記憶されるファイルを、ＡＰ２を介して各ＳＴＡ３へ送信していき、ファイルバッファ内のファイルを全て送信し終えた時点で１エピソードの終了とする。１エピソードの途中ではファイルは追加されない。エピソードが進むに連れて、本実施形態のトラヒック制御装置５の学習も進む。なお、学習する上限数をあらかじめ決めておき、エピソードが上限数に達した場合には学習を終了してもよい。 The traffic control device 5 performs the above-described processing based on the result of executing a plurality of episodes and learns the behavior evaluation function. FIG. 3 is a diagram for explaining an episode. An episode represents a series of flows until transmission of all files in the file buffer in the storage unit 42 is completed. The proxy server 4 transmits the file stored in the file buffer to each STA 3 via the AP 2 according to the control of the communication control unit 53 of the traffic control device 5, and finishes transmitting all the files in the file buffer. One episode ends at that time. No files are added during the episode. As the episode progresses, the learning of the traffic control device 5 of the present embodiment also progresses. Note that the upper limit number to be learned may be determined in advance, and the learning may be terminated when the number of episodes reaches the upper limit number.

価値関数として用いられる深層ニューラルネットワーク（ＣＮＮ）の入力データ及び層設計の例を説明する。
図４は、ステップＳ３におけるカメラ画像から入力データへの加工を示す図である。強化学習部５１は、１秒間における過去５枚分のカメラ画像に含まれる深度画像データをそれぞれ２０×２０ピクセルの二次元画像データに圧縮する。強化学習部５１は、５枚の深度画像データそれぞれを圧縮して得られた５チャネルの二次元画像をＣＮＮへの入力データとする。 An example of input data and layer design of a deep neural network (CNN) used as a value function will be described.
FIG. 4 is a diagram showing processing from the camera image to the input data in step S3. The reinforcement learning unit 51 compresses the depth image data included in the past five camera images in one second into two-dimensional image data of 20 × 20 pixels. The reinforcement learning unit 51 uses a five-channel two-dimensional image obtained by compressing each of the five depth image data as input data to the CNN.

図５は、ステップＳ３におけるファイル残量情報から入力データへの加工を示す図である。まず、各ファイルの残量を複数段階に離散化する。ここでは、ファイル容量の最大値が２０００Ｍｂｉｔ（メガビット）であり、１０段階に離散化する場合を例とする。この場合、ファイル残量情報として用いられるＯｎｅ−Ｈｏｔ表現のベクトルの各要素を、[（０−２００Ｍｂｉｔ），（２００−４００Ｍｂｉｔ），（４００−６００Ｍｂｉｔ），（６００−８００Ｍｂｉｔ），…，（１８００−２０００ｂｉｔ）]と定める。記憶部４２から取得したＳＴＡ−ｎ（ｎは１以上Ｎ以下の整数）のファイル残量が容量７００Ｍｂｉｔである場合、ファイル残量情報はベクトル［０，０，０，１，０，０，０，０，０，０］と表される。強化学習部５１、ＳＴＡ−１、ＳＴＡ−２、…、ＳＴＡ−Ｎについて生成したファイル残量情報を表すベクトルを並べて結合し、入力データとする。 FIG. 5 is a diagram illustrating processing from remaining file information to input data in step S3. First, the remaining amount of each file is discretized in a plurality of stages. Here, the maximum value of the file capacity is 2000 Mbit (megabit), and the case of discretization in 10 steps is taken as an example. In this case, each element of the One-Hot expression vector used as the remaining file information is represented by [(0-200 Mbit), (200-400 Mbit), (400-600 Mbit), (600-800 Mbit),. -2000 bits)]. When the file remaining amount of STA-n (n is an integer of 1 to N) acquired from the storage unit 42 has a capacity of 700 Mbit, the remaining file amount information is a vector [0, 0, 0, 1, 0, 0, 0. , 0, 0, 0]. Vectors representing the remaining file information generated for the reinforcement learning unit 51, STA-1, STA-2,..., STA-N are aligned and combined to obtain input data.

図６は、ＣＮＮの層設計を示す図である。なお、「Ａｆｆｉｎｅ，ａ−ｂ」は、ａ次元ベクトルを全結合層に入力し、ｂ次元ベクトルを出力する演算を表す。「ｋ×ｌ２ＤＣｏｎｖｅｒｓｉｏｎ，ａ−ｂ」は、ｋ×ｌの二次元フィルタにより、ａチャネルの入力を畳み込み、ｂチャネルにして出力する演算を表す。また、「ｋ×ｌ２ＤＭａｘＰｏｏｌｉｎｇ」は、サイズがｋ×ｌのグリッドに入力を分割し、各グリッドの最大値を代表値として出力する演算を表す。「ＲｅＬＵ」は、活性化関数ＲｅＬＵ（Rectified Linear Units）に入力する演算を表す。活性化関数ＲｅＬＵは、マイナスの値を０に変換する。 FIG. 6 is a diagram showing the layer design of the CNN. “Affine, a−b” represents an operation in which an a-dimensional vector is input to the fully connected layer and a b-dimensional vector is output. “K × l 2D conversion, a−b” represents an operation in which the input of the a channel is convoluted and output as the b channel by a k × l two-dimensional filter. “K × l 2D Max Pooling” represents an operation of dividing the input into grids of size k × l and outputting the maximum value of each grid as a representative value. “ReLU” represents an operation input to an activation function ReLU (Rectified Linear Units). The activation function ReLU converts a negative value to 0.

入力層では、図３に示した処理により５チャネルの二次元画像（5 Channels 2D Image）を生成する。さらに、入力層では、図４に示した処理により各ＳＴＡ３のファイル残量をＯｎｅ−Ｈｏｔ表現のベクトルに変換し、結合して６０次元ベクトルを生成する。 In the input layer, a five-channel two-dimensional image (5 Channels 2D Image) is generated by the processing shown in FIG. Furthermore, in the input layer, the file remaining amount of each STA 3 is converted into a vector of One-Hot expression by the process shown in FIG. 4 and combined to generate a 60-dimensional vector.

隠れ層には、１ａ層〜８ａ層と、１ｂ層〜２ｂ層と、８ａ層及び２ｂ層の出力を入力とする９層とがある。
１ａ層では、５チャネルの二次元画像（5 Channels 2D Image）を、５×５の二次元フィルタにより畳み込み、２０チャネルにして出力する。２ａ層では、２０チャネルの１ａ層の出力を活性化関数ＲｅＬＵに入力し、マイナスの値を取り除く。３ａ層では、２０チャネルの２ａ層の出力を２×２のグリッドに分割し、各グリッドの最大値を出力する。４ａ層では、２０チャネルの３ａ層の出力を、５×５の二次元フィルタにより畳み込み、５０チャネルにして出力する。５ａ層では、５０チャネルの４ａ層の出力を活性化関数ＲｅＬＵに入力し、マイナスの値を取り除く。６ａ層では、５０チャネルの５ａ層の出力を２×２のグリッドに分割し、各グリッドの最大値を出力する。７ａ層では、６ａ層の１２５０次元ベクトルを全結合層に入力し、５００次元ベクトルを出力する。８ａ層では、７ａ層の出力を活性化関数ＲｅＬＵに入力し、マイナスの値を取り除く。 The hidden layers include 1a layer to 8a layer, 1b layer to 2b layer, and 9 layers that receive outputs of the 8a layer and the 2b layer.
In the 1a layer, a 5-channel two-dimensional image (5 Channels 2D Image) is convoluted by a 5 × 5 two-dimensional filter to output 20 channels. In the 2a layer, the output of the 20a 1a layer is input to the activation function ReLU, and negative values are removed. In the 3a layer, the 20 channel 2a layer output is divided into 2 × 2 grids, and the maximum value of each grid is output. In the 4a layer, the output of the 3a layer of 20 channels is convoluted by a 5 × 5 two-dimensional filter to output 50 channels. In the 5a layer, the output of the 50a 4a layer is input to the activation function ReLU, and negative values are removed. In the 6a layer, the 50 channel 5a layer output is divided into 2 × 2 grids, and the maximum value of each grid is output. In the 7a layer, the 1250-dimensional vector of the 6a layer is input to the fully connected layer, and the 500-dimensional vector is output. In the 8a layer, the output of the 7a layer is input to the activation function ReLU, and negative values are removed.

一方、１ｂ層では、各ＳＴＡ３のファイル残量に基づいて得られた６０次元ベクトルを全結合層に入力し、１００次元ベクトルを出力する。なお、ＳＴＡ３の台数Ｎと、Ｏｎｅ−Ｈｏｔ表現のベクトルの要素数との乗算が６０であるとする。２ｂ層では、１ｂ層の出力を活性化関数ＲｅＬＵに入力し、マイナスの値を取り除く。 On the other hand, in the 1b layer, the 60-dimensional vector obtained based on the remaining file amount of each STA 3 is input to the all connection layer, and the 100-dimensional vector is output. It is assumed that the multiplication of the number N of STA3 and the number of vector elements in the One-Hot expression is 60. In the 2b layer, the output of the 1b layer is input to the activation function ReLU, and negative values are removed.

９層では、８ａ層の出力及び２ｂ層の出力を併せた６００次元ベクトルを全結合層に入力し、各行動の評価値を得る。出力層は、各行動の評価値を出力する。各行動は、各ＳＴＡ３との通信をＯＮにするかＯＦＦにするかの組み合わせでもよく、Ｎ台のＳＴＡ３それぞれのトラヒック量の組み合わせでもよい。同図では、２台のＳＴＡ３それぞれとの通信をＯＮにするかＯＦＦにするかの組み合わせから、２台ともＯＦＦの組み合わせを除いたものである。つまり、（ＳＴＡ−１，ＳＴＡ−２）を（ＯＮ，ＯＮ）、（ＯＮ，ＯＦＦ）、（ＯＦＦ，ＯＮ）とする３種類の行動である。この３種類の行動それぞれの評価値を得るため、９層からは３次元ベクトルが出力される。 In the ninth layer, a 600-dimensional vector that combines the output of the 8a layer and the output of the 2b layer is input to the all connected layers, and an evaluation value of each action is obtained. The output layer outputs an evaluation value for each action. Each action may be a combination of turning on or off communication with each STA3, or may be a combination of the traffic amount of each of the N STA3. In the figure, the combination of turning off or turning off the communication with each of the two STAs 3 is obtained by removing the combination of turning off both of them. That is, there are three types of actions in which (STA-1, STA-2) is (ON, ON), (ON, OFF), (OFF, ON). In order to obtain an evaluation value for each of these three types of actions, a three-dimensional vector is output from the ninth layer.

なお、Ｃｏｎｖｅｒｓｉｏｎ層については、入力層に近いところにおいては画像から特徴量抽出するフィルタが学習されることが期待され、出力層に近いところでは特徴量から値を予測するフィルタが学習されることを期待される。ＲｅＬＵは、活性化関数として広く用いられる。ＲｅＬＵは、他の活性化関数（シグモイド関数など）とくらべて、経験的に学習速度が早く、性能が高くなることが知られている。また、ＭａｘＰｏｏｌｉｎｇ層は、Ｃｏｎｖｅｒｓｉｏｎ層を通すことにより増大したパラメータ数を削減することで学習時間を短縮するために使用される。Ａｆｆｉｎｅ層は、ＣＮＮにより抽出された特徴量から値を予測することを期待して使用される。ＣＮＮのみで構成するような層設計と比較して、学習時間の短縮が期待できることが経験的に知られている。 As for the conversion layer, it is expected that a filter for extracting a feature value from an image is learned near the input layer, and a filter for predicting a value from the feature value is learned near the output layer. Be expected. ReLU is widely used as an activation function. ReLU is known to have a higher learning speed and higher performance empirically compared to other activation functions (such as sigmoid functions). Further, the Max Pooling layer is used to reduce the learning time by reducing the number of parameters increased by passing through the Conversion layer. The Affine layer is used with the expectation that a value is predicted from the feature quantity extracted by the CNN. It is empirically known that learning time can be expected to be shortened as compared with a layer design configured only by CNN.

学習部５１３は、価値関数として用いられるＣＮＮを更新する。具体的には、学習部５１３は、報酬計算部５２により計算される報酬に基づいて、全結合層における重みを更新する。例えば、行動決定部５１２において、ＡＰ２とＳＴＡ−１の通信ＯＮ、ＡＰ２とＳＴＡ−２の通信ＯＦＦという結果が得られた場合、通信制御部５３は、ＡＰ２とＳＴＡ−１との通信のみをＯＮにするよう制御を行う。例えば、通信制御部５３は、ＳＴＡ−１宛てのファイルをＡＰ２に出力し、ＳＴＡ−２宛てのファイルをＡＰ２に出力しないようにプロキシサーバ４の第２通信部４３を制御する。あるいは、プロキシサーバ４の第２通信部４３を介して、ＡＰ２に対してＳＴＡ−１との通信を行い、ＳＴＡ−２との通信を行わないよう制御信号を送信してもよい。しかしながら、このような制御を行っても、ＡＰ２とＳＴＡ−１間で遮蔽が発生している、マルチパスで反射が発生しているなど、実際はＡＰ２とＳＴＡ−１間の伝搬路の状態が悪い場合、通信速度は低くなる。極端な例として、ＡＰ２とＳＴＡ−１間に金属の壁があり、ＳＴＡ−１にまったく電波が届かない場合は、通信がＯＮの状態でもスループットは０Ｍｂｉｔ／ｓとなる。学習部５１３は、そのようなことが発生しないように、各ＳＴＡ３のＯＮ／ＯＦＦを制御するための学習を行うことができる。 The learning unit 513 updates the CNN used as the value function. Specifically, the learning unit 513 updates the weights in all connected layers based on the reward calculated by the reward calculation unit 52. For example, when the behavior determination unit 512 obtains the results that the communication between AP2 and STA-1 is ON and the communication between AP2 and STA-2 is OFF, the communication control unit 53 turns ON only the communication between AP2 and STA-1. Control to make For example, the communication control unit 53 controls the second communication unit 43 of the proxy server 4 so that the file addressed to STA-1 is output to AP2 and the file addressed to STA-2 is not output to AP2. Alternatively, the control signal may be transmitted so that the AP 2 communicates with the STA-1 and does not communicate with the STA-2 via the second communication unit 43 of the proxy server 4. However, even if such control is performed, the state of the propagation path between AP2 and STA-1 is actually poor, such as shielding between AP2 and STA-1, and reflection occurring in multipath. In this case, the communication speed becomes low. As an extreme example, if there is a metal wall between AP2 and STA-1, and no radio wave reaches STA-1, the throughput is 0 Mbit / s even when communication is ON. The learning unit 513 can perform learning for controlling ON / OFF of each STA 3 so that such a situation does not occur.

本実施形態のトラヒック制御装置５によれば、カメラ画像を入力とした深層強化学習によりトラヒック制御を行い、様々な通信環境に自動的に適応して無線帯域を有効利用することが可能となる。また、通信端末やカメラの設置環境が変化した際にも、変化した環境に適応して自動的にトラヒックを制御することが可能となる。特に、ミリ波通信機能を搭載した無線ＬＡＮ（Local Area Network）ルータと、複数のミリ波通信端末とが接続された通信システムにおいて、人体遮蔽が起こりうる状況に有用である。また、無線ＬＡＮルータやミリ波通信端末の設置環境が変化する場合にも対応可能である。 According to the traffic control device 5 of the present embodiment, traffic control is performed by deep reinforcement learning using a camera image as an input, and it is possible to automatically adapt to various communication environments and effectively use a radio band. Further, when the installation environment of the communication terminal and the camera changes, it becomes possible to automatically control the traffic in accordance with the changed environment. This is particularly useful in situations where human body shielding may occur in a communication system in which a wireless LAN (Local Area Network) router equipped with a millimeter wave communication function and a plurality of millimeter wave communication terminals are connected. Further, it is possible to cope with a case where the installation environment of the wireless LAN router or the millimeter wave communication terminal changes.

トラヒック制御装置５の実測データを用いたシミュレーション評価について述べる。図７は、シミュレーション評価の諸元を示す図である。このシミュレーション評価では、１台のＡＰ２に、２台のＳＴＡ３を接続した場合を想定し、本実施形態のトラヒック制御を行った場合と、ファイル送信完了ごとに交互に送信宛先を切り替えるラウンドロビン方式で制御を行った場合のＡＰにおける合計スループットを得た。ＡＰ２は、ミリ波ＡＰである。シミュレーションで用いるミリ波通信の見通し通信時、遮蔽時のスループット及びカメラ画像は実機実験から測定した値を用いた。カメラ画像は、ＲＧＢ−Ｄカメラで撮影した画像のデータを用いた。また、ＡＰ２及びＳＴＡ３も市販のものを用いた。 The simulation evaluation using the actual measurement data of the traffic control device 5 will be described. FIG. 7 is a diagram showing specifications for simulation evaluation. In this simulation evaluation, assuming that two STA3s are connected to one AP2, the traffic control of this embodiment is performed, and the round-robin method that switches the transmission destination alternately every time file transmission is completed. The total throughput at the AP when control was performed was obtained. AP2 is a millimeter wave AP. During the line-of-sight communication of the millimeter wave communication used in the simulation, the throughput and the camera image at the time of shielding used the values measured from actual machine experiments. As the camera image, data of an image taken with an RGB-D camera was used. AP2 and STA3 were also commercially available.

図８は、シミュレーション評価結果を示す図である。同図は、本実施形態のトラヒック制御を行った場合とラウンドロビン方式で制御を行った場合のエピソード数に対する合計スループットの推移を示す。同図のグラフにおけるＡＰ２の合計スループットとして、各エピソードにおけるＡＰ２の合計スループットの時間平均として表示している。このシミュレーションでは、プロキシサーバ４のファイルバッファには最初、ファイルがランダムなサイズで与えられ、ＡＰ２を通して各ＳＴＡ３へファイルを送信していく。ファイルバッファ内のファイルを全て送信し終えた時点で１エピソードが終了する。同図に示す評価結果から、エピソードが進み、トラヒック制御装置５の学習が進むに連れて、ラウンドロビン方式による制御を行った場合のスループットよりも、本実施形態のトラヒック制御を行った場合の合計スループットが上回っていることがわかる。 FIG. 8 is a diagram showing a simulation evaluation result. This figure shows the transition of the total throughput with respect to the number of episodes when the traffic control of this embodiment is performed and when the control is performed by the round robin method. The total throughput of AP2 in the graph of the figure is displayed as a time average of the total throughput of AP2 in each episode. In this simulation, a file is first given a random size to the file buffer of the proxy server 4, and the file is transmitted to each STA 3 through AP2. One episode ends when all the files in the file buffer have been transmitted. From the evaluation results shown in the figure, as the episode progresses and the learning of the traffic control device 5 progresses, the total when the traffic control of this embodiment is performed is more than the throughput when the control is performed by the round robin method. It can be seen that the throughput is higher.

以上説明した実施形態によれば、通信システムは、第１通信装置と、第１通信装置と無線により通信する１台以上の第２通信装置と、第１通信装置から第２通信装置に送信するデータを取得する第３通信装置と、トラヒック制御装置とを有する。例えば、第１通信装置はＡＰ２であり、第２通信装置はＳＴＡ３であり、第３通信装置はプロキシサーバ４である。 According to the embodiment described above, the communication system transmits the first communication device, one or more second communication devices that communicate with the first communication device wirelessly, and the first communication device to the second communication device. It has the 3rd communication apparatus which acquires data, and a traffic control apparatus. For example, the first communication device is AP2, the second communication device is STA3, and the third communication device is the proxy server 4.

トラヒック制御装置は、行動決定部と、通信制御部と、報酬計算部と、学習部とを有する。行動決定部は、第１通信装置と第２通信装置との間の通信環境を撮像した画像データと、第３通信装置が記憶する第２通信装置宛ての未送信のデータのデータ量の情報とを用いて、第２通信装置それぞれのトラヒックの組み合わせにより表される行動の価値を算出する価値関数により、複数種類の行動それぞれの価値を算出する。価値関数は、深層ニューラルネットワークにより近似されてもよい。この場合、深層ニューラルネットワークに入力される画像データは、異なるタイミングにおいて撮影された複数の画像データそれぞれの解像度を低減したのちにピクセル値を正規化したデータである。また、深層ニューラルネットワークに入力される未送信の第２通信装置宛てのデータ量の情報は、複数の第２通信装置それぞれ宛ての未送信のデータ量をＯｎｅ−Ｈｏｔ表現により表したベクトルを並べた情報である。行動決定部は、算出した価値に基づいて行動を決定する。 The traffic control device includes an action determination unit, a communication control unit, a reward calculation unit, and a learning unit. The behavior determination unit includes image data obtained by imaging the communication environment between the first communication device and the second communication device, information on the amount of untransmitted data addressed to the second communication device stored in the third communication device, and Is used to calculate the value of each of a plurality of types of actions using a value function that calculates the value of the action represented by the combination of traffic of each second communication device. The value function may be approximated by a deep neural network. In this case, the image data input to the deep neural network is data obtained by normalizing pixel values after reducing the resolution of each of a plurality of image data captured at different timings. The information on the amount of data addressed to the untransmitted second communication device input to the deep neural network is a vector in which the amount of untransmitted data addressed to each of the plurality of second communication devices is represented by the One-Hot expression. Information. The action determining unit determines an action based on the calculated value.

通信制御部は、行動決定部が決定した行動が表す第２通信装置それぞれのトラヒックに従って、第２通信装置宛てのデータを第１通信装置に送信するよう第３通信装置を制御する。報酬計算部は、通信制御部による制御が行われたことによる第２通信装置の通信状況を取得し、取得した通信状況が過去の通信状況から向上した程度を表す報酬を計算する。第２通信装置の通信状況は、第２通信装置におけるスループット、又は、第２通信装置宛てのデータの送信にかかった時間を表す。学習部は、計算された報酬に基づいて価値関数を更新する。第１通信装置は、第３通信装置から受信した第２通信装置宛てのデータを無線により第２通信装置に送信する。 The communication control unit controls the third communication device to transmit data addressed to the second communication device to the first communication device according to the traffic of each second communication device represented by the behavior determined by the behavior determination unit. A reward calculation part acquires the communication condition of the 2nd communication apparatus by control by the communication control part, and calculates the reward showing the grade which the acquired communication condition improved from the past communication condition. The communication status of the second communication device represents the throughput in the second communication device or the time taken to transmit data addressed to the second communication device. The learning unit updates the value function based on the calculated reward. The first communication device transmits the data addressed to the second communication device received from the third communication device to the second communication device by radio.

上述した実施形態におけるトラヒック制御装置５の機能をコンピュータで実現するようにしてもよい。その場合、トラヒック制御装置５はこの機能を実現するためのプログラムをコンピュータ読み取り可能な記録媒体に記録して、この記録媒体に記録されたプログラムをコンピュータシステムに読み込ませ、実行することによって実現してもよい。なお、ここでいう「コンピュータシステム」とは、ＯＳや周辺機器等のハードウェアを含むものとする。また、「コンピュータ読み取り可能な記録媒体」とは、フレキシブルディスク、光磁気ディスク、ＲＯＭ、ＣＤ−ＲＯＭ等の可搬媒体、コンピュータシステムに内蔵されるハードディスク等の記憶装置のことをいう。さらに「コンピュータ読み取り可能な記録媒体」とは、インターネット等のネットワークや電話回線等の通信回線を介してプログラムを送信する場合の通信線のように、短時間の間、動的にプログラムを保持するもの、その場合のサーバやクライアントとなるコンピュータシステム内部の揮発性メモリのように、一定時間プログラムを保持しているものも含んでもよい。また上記プログラムは、前述した機能の一部を実現するためのものであってもよく、さらに前述した機能をコンピュータシステムにすでに記録されているプログラムとの組み合わせで実現できるものであってもよい。 You may make it implement | achieve the function of the traffic control apparatus 5 in embodiment mentioned above with a computer. In that case, the traffic control device 5 is realized by recording a program for realizing this function in a computer-readable recording medium, causing the computer system to read and execute the program recorded in the recording medium. Also good. Here, the “computer system” includes an OS and hardware such as peripheral devices. The “computer-readable recording medium” refers to a storage device such as a flexible medium, a magneto-optical disk, a portable medium such as a ROM and a CD-ROM, and a hard disk incorporated in a computer system. Furthermore, the “computer-readable recording medium” dynamically holds a program for a short time like a communication line when transmitting a program via a network such as the Internet or a communication line such as a telephone line. In this case, a volatile memory inside a computer system serving as a server or a client in that case may be included and a program held for a certain period of time. The program may be a program for realizing a part of the functions described above, and may be a program capable of realizing the functions described above in combination with a program already recorded in a computer system.

以上、この発明の実施形態について図面を参照して詳述してきたが、具体的な構成はこの実施形態に限られるものではなく、この発明の要旨を逸脱しない範囲の設計等も含まれる。 The embodiment of the present invention has been described in detail with reference to the drawings. However, the specific configuration is not limited to this embodiment, and includes designs and the like that do not depart from the gist of the present invention.

無線通信を行う通信システムに利用可能である。 The present invention can be used for a communication system that performs wireless communication.

１…通信システム、２…アクセスポイント、３…無線局、４…プロキシサーバ、５…トラヒック制御装置、６…撮像装置、７…インターネット、４１…第１通信部、４２…記憶部、４３…第２通信部、５１…強化学習部、５２…報酬計算部、５３…通信制御部、５１１…加工部、５１２…行動決定部、５１３…学習部 DESCRIPTION OF SYMBOLS 1 ... Communication system, 2 ... Access point, 3 ... Wireless station, 4 ... Proxy server, 5 ... Traffic control apparatus, 6 ... Imaging device, 7 ... Internet, 41 ... 1st communication part, 42 ... Memory | storage part, 43 ... 1st 2 communication units, 51 ... reinforcement learning unit, 52 ... reward calculation unit, 53 ... communication control unit, 511 ... processing unit, 512 ... action determination unit, 513 ... learning unit

Claims

A first communication device, one or more second communication devices that communicate wirelessly with the first communication device, a third communication device that acquires data to be transmitted from the first communication device to the second communication device, and traffic A communication system having a control device,
The traffic control device includes:
Image data obtained by imaging a communication environment between the first communication device and the second communication device, and information on a data amount of the untransmitted data addressed to the second communication device stored in the third communication device; Using the value function to calculate the value of the action represented by the combination of traffic of each of the second communication devices, the value of each of the plurality of types of action is calculated, and the action is determined based on the calculated value A decision unit;
Communication control for controlling the third communication device to transmit the data addressed to the second communication device to the first communication device according to the traffic of each of the second communication devices represented by the behavior determined by the behavior determination unit. And
A reward calculation unit that obtains a communication status of the second communication device due to the control performed by the communication control unit, and calculates a reward representing a degree to which the acquired communication status is improved from a past communication status;
A learning unit that updates the value function based on the reward calculated by the reward calculation unit;
With
The first communication device wirelessly transmits the data addressed to the second communication device received from the third communication device to the second communication device;
Communications system.

The second communication is performed using image data obtained by imaging a communication environment between the first communication device and one or more second communication devices and information on the amount of untransmitted data addressed to the second communication device. An action determining unit that calculates the value of each of a plurality of types of actions by a value function that calculates the value of the action represented as a combination of traffic of each device, and determines the action based on the calculated value;
A communication control unit that controls communication so that the data addressed to the second communication device is distributed from the first communication device according to the traffic of each of the second communication devices represented by the behavior determined by the behavior determination unit;
A reward calculation unit that obtains a communication status of the second communication device due to the control performed by the communication control unit, and calculates a reward representing a degree to which the acquired communication status is improved from a past communication status;
A learning unit that updates the value function based on the reward calculated by the reward calculation unit;
A traffic control device comprising:

The communication status of the second communication device is information indicating the throughput in the second communication device or the time taken to transmit the data addressed to the second communication device.
The traffic control device according to claim 2.

The value function is approximated by a deep neural network,
The traffic control device according to claim 2 or claim 3.

The image data used for the value function is data obtained by normalizing pixel values after reducing the resolution of each of a plurality of image data captured at different timings.
The traffic control device according to claim 4.

The information on the amount of data addressed to the second communication device that has not been transmitted used in the value function is a vector in which the amount of data that has not yet been transmitted to each of the plurality of second communication devices is represented by a One-Hot expression. Information
The traffic control device according to claim 4.

The image data is depth image data.
The traffic control device according to any one of claims 2 to 6.

The second communication is performed using image data obtained by imaging a communication environment between the first communication device and one or more second communication devices and information on the amount of untransmitted data addressed to the second communication device. A behavior determination step of calculating the value of each of a plurality of types of behavior by a value function that calculates the value of the behavior represented as a combination of traffic of each device, and determining the behavior based on the calculated value;
A communication control step for controlling communication so that the data addressed to the second communication device is distributed from the first communication device according to the traffic of each of the second communication devices represented by the behavior determined in the behavior determination step; ,
Remuneration calculation step of acquiring a communication status of the second communication device due to the control by the communication control step, and calculating a reward indicating the degree to which the acquired communication status has improved from the past communication status;
A learning step for updating the value function based on the reward calculated in the reward calculation step;
A traffic control method.