JP7007669B2

JP7007669B2 - Communication system, traffic control device and traffic control method

Info

Publication number: JP7007669B2
Application number: JP2018103999A
Authority: JP
Inventors: 遼宮武; 裕介淺井; 理志西尾
Original assignee: Kyoto University; Nippon Telegraph and Telephone Corp
Current assignee: Kyoto University; Nippon Telegraph and Telephone Corp
Priority date: 2018-05-30
Filing date: 2018-05-30
Publication date: 2022-01-24
Anticipated expiration: 2038-05-30
Also published as: JP2019208188A

Description

本発明は、通信システム、トラヒック制御装置及びトラヒック制御方法に関する。 The present invention relates to a communication system, a traffic control device, and a traffic control method.

大容量かつ高速通信を実現できる次世代無線通信技術として、ミリ波通信に期待が集まっている（例えば、非特許文献１参照）。ミリ波通信の利点の一つは利用可能な周波数幅が広帯域な点であり、１Ｇｂｉｔ／ｓ（ギガビット毎秒）を超える高速通信が可能である。その一方で、ミリ波は水分や酸素による減衰が大きく、見通し通信路が人体等で遮蔽されると通信品質が急峻に低下するという欠点がある（例えば、非特許文献２参照）。この遮蔽による急峻な通信品質低下問題に対処するため、遮蔽された通信路の流量やトラヒックの経路を制御する装置が必要となる。具体的には、図９のようにＡＰ（Access Point：アクセスポイント）が複数のＳＴＡ（Station；無線局）とミリ波で通信している環境の無線通信システムにおいては、ＡＰとＳＴＡとの見通し通信路を人体が遮蔽しうる状況であり、このような状況のＡＰの無線帯域を有効利用するための制御装置が必要となる。以下では、Ｎ台（Ｎは１以上の整数）のＳＴＡを、ＳＴＡ－１～ＳＴＡ－Ｎとも記載する。 Millimeter-wave communication is expected as a next-generation wireless communication technology that can realize high-capacity and high-speed communication (see, for example, Non-Patent Document 1). One of the advantages of millimeter-wave communication is that the available frequency width is wide band, and high-speed communication exceeding 1 Gbit / s (Gigabit per second) is possible. On the other hand, millimeter waves have a drawback that they are greatly attenuated by moisture and oxygen, and when the line-of-sight communication path is shielded by a human body or the like, the communication quality sharply deteriorates (see, for example, Non-Patent Document 2). In order to deal with the problem of steep communication quality deterioration due to this shielding, a device for controlling the flow rate of the shielded communication path and the traffic path is required. Specifically, as shown in FIG. 9, in a wireless communication system in an environment in which an AP (Access Point) communicates with a plurality of STAs (Stations) by millimeter waves, the outlook for the AP and the STA. It is a situation where the human body can shield the communication path, and a control device for effectively using the radio band of the AP in such a situation is required. In the following, N STAs (N is an integer of 1 or more) are also referred to as STA-1 to STA-N.

ミリ波通信における通信制御問題の解決手法として、ＲＧＢ－Ｄカメラを用いた人体遮蔽予測に基づくトラヒック制御装置が提案されている（例えば、非特許文献３参照）。従来技術では、ＲＧＢ－Ｄカメラから得られた画像・動画データを用いて人体を検知し、その移動先を予測する。その移動先への移動によって人体がＡＰとＳＴＡとの見通し通信路を遮蔽する場合、遮蔽が起こる直前にＡＰとＳＴＡ間のトラヒックを停止し、遮蔽されていない通信路のトラヒックを優先して送信する。この制御によって、制御しない場合と比べて、ＡＰにおける合計スループットを増加できる。つまり、無線帯域を有効利用するためのトラヒック制御が可能となる。また、遮蔽を予測し、遮蔽が起こる直前にプロアクティブに制御をかけるため、スループットが低下してから制御をかける従来のリアクティブな制御方式と比較して、合計スループットを増加できる。 As a method for solving a communication control problem in millimeter-wave communication, a traffic control device based on human body shielding prediction using an RGB-D camera has been proposed (see, for example, Non-Patent Document 3). In the prior art, the human body is detected using the image / moving image data obtained from the RGB-D camera, and the movement destination thereof is predicted. When the human body blocks the line-of-sight communication path between AP and STA by moving to the destination, the traffic between AP and STA is stopped immediately before the shielding occurs, and the traffic on the unshielded communication path is given priority for transmission. do. This control can increase the total throughput in the AP compared to the case without control. That is, traffic control for effectively using the radio band becomes possible. In addition, since the shielding is predicted and proactively controlled immediately before the shielding occurs, the total throughput can be increased as compared with the conventional reactive control method in which control is performed after the throughput decreases.

図１０は、非特許文献３の技術を適用したトラヒック制御装置の機能ブロック図である。同図では、ＡＰとＳＴＡ－１～ＳＴＡ－Ｎとが無線通信する無線通信システムのプロキシサーバに、トラヒック制御装置が搭載されている。トラヒック制御装置は、画像解析部と、遮蔽判定部と、通信制御部とを備える。トラヒック制御装置を稼働させる際には、初期設定として遮蔽判定部に通信路を設定しておく。画像解析部は、ＲＧＢ－Ｄカメラから得られた画像を用いて、ミリ波通信における人体（障害物）の位置推定を行う。次に、遮蔽判定部は、推定された人体の位置とその移動速度から、予め設定した見通し通信路が人体によって遮蔽されるか否かを判定し、遮蔽されると判定した場合にはそのタイミングを推定する。 FIG. 10 is a functional block diagram of a traffic control device to which the technique of Non-Patent Document 3 is applied. In the figure, a traffic control device is mounted on a proxy server of a wireless communication system in which AP and STA-1 to STA-N communicate wirelessly. The traffic control device includes an image analysis unit, a shielding determination unit, and a communication control unit. When operating the traffic control device, a communication path is set in the shield determination unit as an initial setting. The image analysis unit estimates the position of the human body (obstacle) in millimeter-wave communication using the image obtained from the RGB-D camera. Next, the shielding determination unit determines whether or not the preset line-of-sight communication path is shielded by the human body from the estimated position of the human body and its moving speed, and if it is determined to be shielded, the timing. To estimate.

通信制御部は、遮蔽判定部が推定した見通し通信路の遮蔽状況に基づいて、遮蔽が起こると推定された時間にそのトラヒックを停止するようにトラヒックの流量を制御する。具体的には、通信制御部は、インターネットから受信した、見通し通信路が遮蔽されるＳＴＡ宛のパケットの送信を停止する。また、通信制御部は、遮蔽が解除されると推定された時間に、ＳＴＡ宛てのパケットの送信を再開する。このトラヒック制御によって、ＡＰは、あるＳＴＡとの通信において人体遮蔽に伴いスループットが低下するときにおいても、別のＳＴＡとの通信にリソースを割り当てることができる。よって、トラヒック制御を行わない場合と比較して、ＡＰにおける合計スループットを増加できる。 The communication control unit controls the flow rate of the traffic so as to stop the traffic at the time estimated that the shielding occurs, based on the shielding condition of the line-of-sight communication path estimated by the shielding determination unit. Specifically, the communication control unit stops the transmission of the packet addressed to the STA, which is received from the Internet and whose line-of-sight communication path is blocked. Further, the communication control unit resumes the transmission of the packet addressed to the STA at the time estimated that the shielding is released. This traffic control allows the AP to allocate resources to communication with another STA even when the throughput of communication with one STA decreases due to human body shielding. Therefore, the total throughput in the AP can be increased as compared with the case where the traffic control is not performed.

P. Wang, Y. Li, L. Song, and B. Vucetic, “Multi-gigabit millimeter wave wireless communications for 5G: From fixed access to cellular networks,” IEEE Communications Magazine, 2015年1月, vol.53, no.1, p.168－178P. Wang, Y. Li, L. Song, and B. Vucetic, “Multi-gigabit thin wave wireless communications for 5G: From fixed access to cellular networks,” IEEE Communications Magazine, January 2015, vol.53, no .1, p.168-178 S. Collonge, G. Zaharia, and G.E. Zein, “Influence of the human activity on wide-band characteristics of the 60 GHz indoor radio channel,” IEEE Transactions on Wireless Communications, 2004年11月, vol.3, no.6, p.2396－2406S. Collonge, G. Zaharia, and GE Zein, “Influence of the human activity on wide-band characteristics of the 60 GHz indoor radio channel,” IEEE Transactions on Wireless Communications, November 2004, vol.3, no.6 , p.2396-2406 T. Nishio, R. Arai, K. Yamamoto, and M. Morikura, “Proactive traffic control based on human blockage prediction using RGBD cameras for millimeter-wave communications,” Proc. 2015 IEEE Consumer Communications and Networking Conference(CCNC), Las Vegas, Nevada, USA, 2015年1月, p.152－153T. Nishio, R. Arai, K. Yamamoto, and M. Morikura, “Proactive traffic control based on human blockage prediction using RGBD cameras for millimeter-wave communications,” Proc. 2015 IEEE Consumer Communications and Networking Conference (CCNC), Las Vegas, Nevada, USA, January 2015, p.152–153

非特許文献３の技術では、見通し通信路が遮蔽されそうなときにその見通し通信路を使用するＳＴＡとの通信を遮断し、別のＳＴＡとの通信にリソースを割り当てるといったルールベースの制御を行っている。この方式では、環境に合わせてルールを人手で作る必要がある。例えば、見通し通信路の遮蔽が通信品質に影響しないような環境（反射による通信路ができるような環境）では、その見通し通信路が遮蔽された場合でも通信を停止する必要がない。しかし、ミリ波通信環境は、ミリ波基地局や家具の配置によって容易に変化するため、その度に設定し直す必要がある。 The technique of Non-Patent Document 3 performs rule-based control such as blocking communication with an STA that uses the line-of-sight communication path when the line-of-sight communication path is likely to be blocked, and allocating resources to communication with another STA. ing. In this method, it is necessary to manually create rules according to the environment. For example, in an environment in which the line-of-sight communication path shielding does not affect the communication quality (an environment in which a communication path is created by reflection), it is not necessary to stop the communication even if the line-of-sight communication path is blocked. However, the millimeter-wave communication environment easily changes depending on the arrangement of millimeter-wave base stations and furniture, so it is necessary to reset the settings each time.

加えて、人手では適切なルールの設計が難しい環境、例えば、遮蔽する歩行者が多数存在し到来にムラがあるような場合や、動画や音声通話などアプリケーションが異なる場合などでは、適切なトラヒック制御方策は変わることが考えられる。しかしながら、適切な制御方策を決めるのは容易ではない。 In addition, appropriate traffic control is used in environments where it is difficult to manually design appropriate rules, for example, when there are many pedestrians to shield and the arrival is uneven, or when applications such as video and voice calls are different. The policy may change. However, it is not easy to determine an appropriate control measure.

さらには、画像から人体認識、移動予測、見通し通信路遮蔽予測など、様々な処理を行う必要がある。それらの性能は、通信制御の性能に強く影響を与える。 Furthermore, it is necessary to perform various processes such as human body recognition, movement prediction, and line-of-sight channel obstruction prediction from images. Their performance strongly affects the performance of communication control.

上記事情に鑑み、本発明は、移動する障害物により無線通信のための見通し通信路に一時的に遮蔽が生じる環境下における合計スループットを増加させることができる通信システム、トラヒック制御装置及びトラヒック制御方法を提供することを目的としている。 In view of the above circumstances, the present invention presents a communication system, a traffic control device, and a traffic control method capable of increasing the total throughput in an environment in which a line-of-sight communication path for wireless communication is temporarily obstructed by a moving obstacle. Is intended to provide.

本発明の一態様は、第１通信装置と、前記第１通信装置と無線により通信する１台以上の第２通信装置と、第１通信装置から前記第２通信装置に送信するデータを取得する第３通信装置と、トラヒック制御装置とを有する通信システムであって、前記トラヒック制御装置は、前記第１通信装置と前記第２通信装置との間の通信環境を撮像した画像データと、前記第３通信装置が記憶する前記第２通信装置宛ての未送信の前記データのデータ量の情報とを用いて、前記第２通信装置それぞれのトラヒックの組み合わせにより表される行動の価値を算出する価値関数により複数種類の行動それぞれの価値を算出し、算出した前記価値に基づいて行動を決定する行動決定部と、前記行動決定部が決定した前記行動が表す前記第２通信装置それぞれのトラヒックに従って、前記第２通信装置宛ての前記データを前記第１通信装置に送信するよう前記第３通信装置を制御する通信制御部と、前記通信制御部による制御が行われたことによる前記第２通信装置の通信状況を取得し、取得した前記通信状況が過去の通信状況から向上した程度を表す報酬を計算する報酬計算部と、前記報酬計算部が異なる時間区間について計算した前記報酬の累積和が最大になるように前記価値関数を更新する学習部と、を備え、前記時間区間における前記報酬は、前記時間区間における前記第１通信装置の合計のスループットから、前記第１通信装置の過去の時間区間から当該時間区間までのそれぞれの合計のスループットに時間に応じた係数を乗算後に平均した加重平均値を減算した第１の値、または、前記時間区間における前記第１通信装置の合計のスループットを前記第１通信装置の平均のスループットで正規化した第２の値、または、前記第１通信装置の平均のスループットに対する前記時間区間における前記第１通信装置の合計のスループットの比が所定値を超える場合は正の一定値となり、前記比が前記所定値以下の場合は絶対値が前記正の一定値より大きい負の一定値となる第３の値であり、前記第１通信装置は、前記第３通信装置から受信した前記第２通信装置宛ての前記データを無線により前記第２通信装置へ送信する、通信システムである。 One aspect of the present invention is to acquire a first communication device, one or more second communication devices that wirelessly communicate with the first communication device, and data transmitted from the first communication device to the second communication device. A communication system including a third communication device and a traffic control device, wherein the traffic control device includes image data of an image of a communication environment between the first communication device and the second communication device, and the first. 3 A value function for calculating the value of an action represented by a combination of throughputs of each of the second communication devices by using information on the amount of untransmitted data to the second communication device stored in the communication device. According to the throughput of each of the action determination unit that calculates the value of each of the plurality of types of actions and determines the action based on the calculated value, and the second communication device represented by the action determined by the action determination unit. Communication between the communication control unit that controls the third communication device so that the data addressed to the second communication device is transmitted to the first communication device, and the second communication device that is controlled by the communication control unit. The cumulative sum of the rewards calculated by the reward calculation unit for different time intervals is the maximum between the reward calculation unit that acquires the status and calculates the reward indicating the degree to which the acquired communication status is improved from the past communication status. With a learning unit that updates the value function as described above , the reward in the time interval is from the total throughput of the first communication device in the time interval and from the past time interval of the first communication device. The first value obtained by multiplying each total throughput up to the time interval by a coefficient corresponding to time and then subtracting the weighted average value, or the total throughput of the first communication device in the time interval is the first. A second value normalized by the average throughput of the communication device, or positive when the ratio of the total throughput of the first communication device in the time interval to the average throughput of the first communication device exceeds a predetermined value. When the ratio is equal to or less than the predetermined value, the absolute value is a negative constant value larger than the positive constant value, and the first communication device is the third communication device. It is a communication system that wirelessly transmits the data to the second communication device received from the second communication device.

本発明の一態様は、第１通信装置と１台以上の第２通信装置との間の通信環境を撮像した画像データと前記第２通信装置宛ての未送信のデータのデータ量の情報とを用いて、前記第２通信装置それぞれのトラヒックの組み合わせとして表される行動の価値を算出する価値関数により複数種類の行動それぞれの価値を算出し、算出した前記価値に基づいて行動を決定する行動決定部と、前記行動決定部が決定した前記行動が表す前記第２通信装置それぞれのトラヒックに従って、前記第１通信装置から前記第２通信装置宛ての前記データが配信されるよう通信を制御する通信制御部と、前記通信制御部による制御が行われたことによる前記第２通信装置の通信状況を取得し、取得した前記通信状況が過去の通信状況から向上した程度を表す報酬を計算する報酬計算部と、前記報酬計算部が異なる時間区間について計算した前記報酬の累積和が最大になるように前記価値関数を更新する学習部と、を備え、前記時間区間における前記報酬は、前記時間区間における前記第１通信装置の合計のスループットから、前記第１通信装置の過去の時間区間から当該時間区間までのそれぞれの合計のスループットに時間に応じた係数を乗算後に平均した加重平均値を減算した第１の値、または、前記時間区間における前記第１通信装置の合計のスループットを前記第１通信装置の平均のスループットで正規化した第２の値、または、前記第１通信装置の平均のスループットに対する前記時間区間における前記第１通信装置の合計のスループットの比が所定値を超える場合は正の一定値となり、前記比が前記所定値以下の場合は絶対値が前記正の一定値より大きい負の一定値となる第３の値である、トラヒック制御装置である。 One aspect of the present invention is to obtain image data of an image of a communication environment between a first communication device and one or more second communication devices and information on the amount of untransmitted data to the second communication device. The value of each of a plurality of types of actions is calculated by a value function that calculates the value of the action expressed as a combination of traffic of each of the second communication devices, and the action is determined based on the calculated value. Communication control that controls communication so that the data from the first communication device to the second communication device is delivered according to the traffic of each of the unit and the second communication device represented by the action determined by the action determination unit. A reward calculation unit that acquires the communication status of the second communication device due to control by the communication control unit and calculates a reward indicating the degree to which the acquired communication status is improved from the past communication status. And a learning unit that updates the value function so that the cumulative sum of the rewards calculated by the reward calculation unit for different time intervals is maximized, and the reward in the time interval is the said in the time interval. The first is obtained by subtracting the weighted average value obtained by multiplying the total throughput of the first communication device from the past time interval to the time interval by a coefficient according to the time and then averaging from the total throughput of the first communication device. Or the second value obtained by normalizing the total throughput of the first communication device in the time interval with the average throughput of the first communication device, or the said with respect to the average throughput of the first communication device. When the ratio of the total throughput of the first communication device in the time interval exceeds the predetermined value, it becomes a positive constant value, and when the ratio is equal to or less than the predetermined value, the absolute value is a negative constant value larger than the positive constant value. It is a traffic control device which is a third value to be a value .

本発明の一態様は、上述のトラヒック制御装置であって、前記第２通信装置の前記通信状況は、前記第２通信装置におけるスループット、又は、前記第２通信装置宛ての前記データの送信にかかった時間を表す情報である。 One aspect of the present invention is the above-mentioned traffic control device, and the communication status of the second communication device depends on the throughput in the second communication device or the transmission of the data to the second communication device. Information that represents the time spent.

本発明の一態様は、上述のトラヒック制御装置であって、前記価値関数は、深層ニューラルネットワークにより近似される。 One aspect of the present invention is the above-mentioned traffic control device, in which the value function is approximated by a deep neural network.

本発明の一態様は、上述のトラヒック制御装置であって、前記価値関数に用いられる前記画像データは、異なるタイミングにおいて撮影された複数の画像データそれぞれの解像度を低減したのちにピクセル値を正規化したデータである。 One aspect of the present invention is the above-mentioned traffic control device, in which the image data used in the value function normalizes pixel values after reducing the resolution of each of a plurality of image data captured at different timings. It is the data that was done.

本発明の一態様は、上述のトラヒック制御装置であって、前記価値関数に用いられる未送信の前記第２通信装置宛てのデータ量の情報は、複数の前記第２通信装置それぞれ宛ての未送信の前記データ量をＯｎｅ－Ｈｏｔ表現により表したベクトルを並べた情報である。 One aspect of the present invention is the above-mentioned traffic control device, in which information on the amount of data not transmitted to the second communication device used in the value function is not transmitted to each of the plurality of second communication devices. It is the information which arranged the vector which expressed the said data amount by One-hot expression.

本発明の一態様は、上述のトラヒック制御装置であって、前記画像データは、深度画像データである。 One aspect of the present invention is the above-mentioned traffic control device, and the image data is depth image data.

本発明の一態様は、第１通信装置と１台以上の第２通信装置との間の無線通信を制御するトラヒック制御装置におけるトラヒック制御方法であって、前記トラヒック制御装置が、前記第１通信装置と前記第２通信装置との間の通信環境を撮像した画像データと前記第２通信装置宛ての未送信のデータのデータ量の情報とを用いて、前記第２通信装置それぞれのトラヒックの組み合わせとして表される行動の価値を算出する価値関数により複数種類の行動それぞれの価値を算出し、算出した前記価値に基づいて行動を決定する行動決定ステップと、前記行動決定ステップにおいて決定された前記行動が表す前記第２通信装置それぞれのトラヒックに従って、前記第１通信装置から前記第２通信装置宛ての前記データが配信されるよう通信を制御する通信制御ステップと、前記通信制御ステップによる制御が行われたことによる前記第２通信装置の通信状況を取得し、取得した前記通信状況が過去の通信状況から向上した程度を表す報酬を計算する報酬計算ステップと、前記報酬計算ステップにおいて異なる時間区間について計算された前記報酬の累積和が最大になるように前記価値関数を更新する学習ステップと、を実行し、前記時間区間における前記報酬は、前記時間区間における前記第１通信装置の合計のスループットから、前記第１通信装置の過去の時間区間から当該時間区間までのそれぞれの合計のスループットに時間に応じた係数を乗算後に平均した加重平均値を減算した第１の値、または、前記時間区間における前記第１通信装置の合計のスループットを前記第１通信装置の平均のスループットで正規化した第２の値、または、前記第１通信装置の平均のスループットに対する前記時間区間における前記第１通信装置の合計のスループットの比が所定値を超える場合は正の一定値となり、前記比が前記所定値以下の場合は絶対値が前記正の一定値より大きい負の一定値となる第３の値である、トラヒック制御方法である。 One aspect of the present invention is a throughput control method in a throughput control device that controls wireless communication between a first communication device and one or more second communication devices, wherein the traffic control device is the first communication. A combination of the throughput of each of the second communication devices by using the image data of the communication environment between the device and the second communication device and the information of the data amount of the untransmitted data addressed to the second communication device. The action determination step in which the value of each of a plurality of types of actions is calculated by the value function for calculating the value of the action expressed as, and the action is determined based on the calculated value, and the action determined in the action determination step. According to the throughput of each of the second communication devices represented by, a communication control step for controlling communication so that the data from the first communication device to the second communication device is delivered, and control by the communication control step are performed. The communication status of the second communication device is acquired, and the reward calculation step for calculating the reward indicating the degree of improvement of the acquired communication status from the past communication status is calculated for different time intervals in the reward calculation step. A learning step of updating the value function so that the cumulative sum of the rewards is maximized is executed, and the reward in the time interval is obtained from the total throughput of the first communication device in the time interval. The first value obtained by subtracting the weighted average value obtained by multiplying the total throughput of the first communication device from the past time interval to the time interval by a coefficient corresponding to the time, or the said in the time interval. A second value obtained by normalizing the total throughput of the first communication device with the average throughput of the first communication device, or the total of the first communication device in the time interval with respect to the average throughput of the first communication device. When the ratio of the throughput of is more than the predetermined value, it becomes a positive constant value, and when the ratio is equal to or less than the predetermined value, the absolute value becomes a negative constant value larger than the positive constant value. This is a traffic control method.

本発明により、移動する障害物により無線通信のための見通し通信路に一時的に遮蔽が生じる環境下における合計スループットを増加させることが可能となる。 INDUSTRIAL APPLICABILITY According to the present invention, it is possible to increase the total throughput in an environment where the line-of-sight communication path for wireless communication is temporarily obstructed by a moving obstacle.

本発明の一実施形態による無線通信システムの構成例を示す図である。It is a figure which shows the structural example of the wireless communication system by one Embodiment of this invention. 同実施形態によるトラヒック制御装置の処理の流れを示すフロー図である。It is a flow chart which shows the flow of the process of the traffic control apparatus by the same embodiment. 同実施形態によるエピソードを説明するための図である。It is a figure for demonstrating the episode by the same embodiment. 同実施形態によるカメラ画像から入力データへの加工を示す図である。It is a figure which shows the processing from the camera image to the input data by the same embodiment. 同実施形態によるファイル残量情報から入力データへの加工を示す図である。It is a figure which shows the processing from the file remaining amount information to the input data by the same embodiment. 同実施形態による行動評価関数の層設計を示す図である。It is a figure which shows the layer design of the behavior evaluation function by the same embodiment. 同実施形態によるトラヒック制御装置のシミュレーション評価の諸元を示す図である。It is a figure which shows the specification of the simulation evaluation of the traffic control apparatus by the same embodiment. 同実施形態によるトラヒック制御装置のシミュレーション評価結果を示す図である。It is a figure which shows the simulation evaluation result of the traffic control apparatus by the same embodiment. 制御対象の無線通信システムの構成例を示す図である。It is a figure which shows the configuration example of the wireless communication system to be controlled. 従来技術によるトラヒック制御装置の機能ブロック図である。It is a functional block diagram of the traffic control device by the prior art.

以下、図面を参照しながら本発明の実施形態を詳細に説明する。
本実施形態のトラヒック制御装置は、従来の問題点を解決するために、深層強化学習を用いる。本実施形態のトラヒック制御装置は、カメラ画像とトラヒックバッファとを「状態」として用い、その「状態」に適切な制御を試行錯誤により学習的に獲得する。強化学習とは、行動主体であるエージェントが環境に対して試行錯誤をしながら行動し、その行動に対して環境から報酬を与えられることによって、より良い方策を獲得する機械学習の一種である。エージェントは、「状態」から期待される報酬を表す価値関数に従って行動し、得られた報酬によってこの価値関数を更新する。深層強化学習では、この価値関数に畳み込みニューラルネットワーク（ＣＮＮ；Convolutional Neural Network）などのニューラルネットワークを用いて関数近似をする。これによって、状態数が膨大な問題に適用できることに加え、畳込み層を用いることで画像を入力とするような問題に対して効果を発揮する。 Hereinafter, embodiments of the present invention will be described in detail with reference to the drawings.
The traffic control device of the present embodiment uses deep reinforcement learning in order to solve the conventional problems. The traffic control device of the present embodiment uses a camera image and a traffic buffer as "states", and acquires control appropriate for the "state" by trial and error. Reinforcement learning is a type of machine learning in which an agent who is the main actor acts on the environment through trial and error, and the behavior is rewarded by the environment to acquire better measures. The agent acts according to a value function that represents the reward expected from the "state" and updates this value function with the obtained reward. In deep reinforcement learning, a function approximation is performed using a neural network such as a convolutional neural network (CNN) for this value function. As a result, in addition to being applicable to a problem in which the number of states is enormous, the use of a convolutional layer is effective for a problem in which an image is input.

図１は、本発明の一実施形態による通信システム１を示す図である。通信システム１は、アクセスポイント（ＡＰ）２、無線局（ＳＴＡ）３と、プロキシサーバ４と、トラヒック制御装置５と、撮像装置６とを備える。Ｎ台（Ｎは１以上の整数）のＳＴＡ３のうちｎ台目（ｎは１以上Ｎ以下の整数）のＳＴＡ３を、ＳＴＡ－ｎと記載する。また、同図において、トラヒック制御装置５は、プロキシサーバ４に搭載される。同図に示す通信システム１は、図１０に示す従来のトラヒック制御装置を、トラヒック制御装置５に置き換えた構成である。 FIG. 1 is a diagram showing a communication system 1 according to an embodiment of the present invention. The communication system 1 includes an access point (AP) 2, a radio station (STA) 3, a proxy server 4, a traffic control device 5, and an image pickup device 6. Of the N STAs (N is an integer of 1 or more), the nth STA3 (n is an integer of 1 or more and N or less) is referred to as STA-n. Further, in the figure, the traffic control device 5 is mounted on the proxy server 4. The communication system 1 shown in the figure has a configuration in which the conventional traffic control device shown in FIG. 10 is replaced with the traffic control device 5.

ＡＰ２は、１台以上のＳＴＡ３と無線通信する。ＡＰ２は、インターネット７を介して接続される通信装置からプロキシサーバ４が受信したＳＴＡ３宛てのパケットを無線により送信する。また、ＡＰ２は、インターネット７を介して接続される通信装置宛てのパケットをＳＴＡ３から無線により受信し、プロキシサーバ４に送信する。プロキシサーバ４は、ＳＴＡ３の代理としてインターネット７を介した通信を行う。撮像装置６は、例えば、ＲＧＢ－Ｄカメラである。ＲＧＢ－Ｄカメラは、ＲＧＢ画像（カラー画像）と深度画像とを撮像する。撮像装置６は、ＡＰ２と複数のＳＴＡ３との間の無線の見通し通信路と及びその周辺を含んだ環境の画像を所定周期で撮像する。撮像装置６は、撮像した画像のデータであるカメラ画像をトラヒック制御装置５に送信する。 AP2 wirelessly communicates with one or more STA3s. The AP2 wirelessly transmits a packet addressed to the STA 3 received by the proxy server 4 from a communication device connected via the Internet 7. Further, the AP2 wirelessly receives a packet addressed to a communication device connected via the Internet 7 from the STA 3 and transmits the packet to the proxy server 4. The proxy server 4 communicates via the Internet 7 on behalf of the STA 3. The image pickup apparatus 6 is, for example, an RGB-D camera. The RGB-D camera captures an RGB image (color image) and a depth image. The image pickup apparatus 6 captures an image of the environment including the wireless line-of-sight communication path between the AP2 and the plurality of STAs 3 and its surroundings at a predetermined cycle. The image pickup device 6 transmits the camera image, which is the data of the captured image, to the traffic control device 5.

プロキシサーバ４は、第１通信部４１と、記憶部４２と、第２通信部４３と、トラヒック制御装置５とを備える。第１通信部４１は、インターネット７を介して受信したＳＴＡ３宛てのファイルのパケットを受信し、ＳＴＡ３別に記憶部４２に書き込む。記憶部４２は、複数のファイルバッファを有している。ＳＴＡ３に割り当てられたファイルバッファに、当該ＳＴＡ３宛てのファイルが記憶される。１台のＳＴＡ３に対して複数のファイルバッファを割り当てることができる。１台のＳＴＡ３に対して割り当て可能なファイルバッファに上限を設けてもよい。本実施形態では、１台のＳＴＡ３に３つのファイルバッファを割り当て可能とする。第２通信部４３は、トラヒック制御装置５の制御に従って、ＳＴＡ３宛てのファイルを記憶部４２から読み出してＡＰ２に送信する。 The proxy server 4 includes a first communication unit 41, a storage unit 42, a second communication unit 43, and a traffic control device 5. The first communication unit 41 receives the packet of the file addressed to the STA3 received via the Internet 7, and writes the packet to the storage unit 42 separately for each STA3. The storage unit 42 has a plurality of file buffers. The file addressed to the STA3 is stored in the file buffer allocated to the STA3. Multiple file buffers can be assigned to one STA3. An upper limit may be set for the file buffer that can be allocated to one STA3. In this embodiment, three file buffers can be allocated to one STA3. The second communication unit 43 reads the file addressed to the STA 3 from the storage unit 42 and transmits it to the AP2 under the control of the traffic control device 5.

トラヒック制御装置５は、強化学習部５１と、報酬計算部５２と、通信制御部５３とを備える。強化学習部５１は、加工部５１１と、行動決定部５１２と、学習部５１３とを備える。行動決定部５１２及び学習部５１３は、深層強化学習アルゴリズムの処理部である。加工部５１１は、撮像装置６から入力されたカメラ画像と、トラヒックバッファ情報とを処理に適したデータ形式に加工し、深層強化学習アルゴリズムの処理部に出力する。行動決定部５１２は、データ形式が加工されたカメラ画像とトラヒックバッファ情報とを含む「状態」に基づいて、トラヒックの制御信号を「行動」として出力する。トラヒックバッファ情報とは、プロキシサーバ４に蓄積されている各ＳＴＡ３宛ての未送信のデータのデータ量である。本実施形態では、トラヒックバッファ情報として、ファイル残量が用いられる。ファイル残量は、記憶部４２に記憶される未送信の各ＳＴＡ３宛てのファイルの容量である。学習部５１３は、出力した「行動」について報酬計算部５２が計算した報酬に基づいて、より良い制御方法を学習する。 The traffic control device 5 includes a reinforcement learning unit 51, a reward calculation unit 52, and a communication control unit 53. The reinforcement learning unit 51 includes a processing unit 511, an action determination unit 512, and a learning unit 513. The action determination unit 512 and the learning unit 513 are processing units of the deep reinforcement learning algorithm. The processing unit 511 processes the camera image input from the image pickup apparatus 6 and the traffic buffer information into a data format suitable for processing, and outputs the data to the processing unit of the deep reinforcement learning algorithm. The action determination unit 512 outputs the traffic control signal as the "action" based on the "state" including the camera image processed in the data format and the traffic buffer information. The traffic buffer information is the amount of untransmitted data to each STA3 stored in the proxy server 4. In this embodiment, the remaining amount of the file is used as the traffic buffer information. The remaining amount of the file is the capacity of the untransmitted file addressed to each STA3 stored in the storage unit 42. The learning unit 513 learns a better control method based on the reward calculated by the reward calculation unit 52 for the output "behavior".

報酬計算部５２は、各ＳＴＡ３のスループット及びトラヒックバッファ情報、あるいは、それらの一部から、目的に合わせて設計された報酬を出力する。通信制御部５３は、ＡＰ２と各ＳＴＡ３とのトラヒックをスケジューリングしながらＳＴＡ３宛てのファイルを配信するようプロキシサーバ４の第２通信部を制御する。これは、ミリ波通信においては、その高速通信という利点を活かし、容量の大きいファイルを送信するという実用例が想定されるためである。 The reward calculation unit 52 outputs a reward designed according to the purpose from the throughput and traffic buffer information of each STA3, or a part thereof. The communication control unit 53 controls the second communication unit of the proxy server 4 so as to deliver the file addressed to the STA 3 while scheduling the traffic between the AP2 and each STA3. This is because, in millimeter-wave communication, a practical example of transmitting a large-capacity file is assumed by taking advantage of its high-speed communication.

なお、トラヒック制御装置５が、プロキシサーバ４の第１通信部４１と、記憶部４２と、第２通信部４３とのうち任意の一以上の機能部を有してもよい。また、第１通信部４１と通信制御部５３とが同一の機能部でもよく、第２通信部４３と通信制御部５３とが同一の機能部でもよく、第１通信部４１、第２通信部４３及び通信制御部５３が同一の機能部でもよい。また、トラヒック制御装置５は、プロキシサーバ４と通信ネットワークにより接続される外部の装置であってもよい。また、第１通信部４１と、記憶部４２と、第２通信部４３と、強化学習部５１と、報酬計算部５２と、通信制御部５３とのうち一以上の任意の機能部を、プロキシサーバ４及びトラヒック制御装置５とが協働して実現してもよい。 The traffic control device 5 may have any one or more functional units of the first communication unit 41, the storage unit 42, and the second communication unit 43 of the proxy server 4. Further, the first communication unit 41 and the communication control unit 53 may be the same functional unit, the second communication unit 43 and the communication control unit 53 may be the same functional unit, and the first communication unit 41 and the second communication unit may be the same. 43 and the communication control unit 53 may be the same functional unit. Further, the traffic control device 5 may be an external device connected to the proxy server 4 by a communication network. Further, one or more arbitrary functional units of the first communication unit 41, the storage unit 42, the second communication unit 43, the reinforcement learning unit 51, the reward calculation unit 52, and the communication control unit 53 are proxied. It may be realized in cooperation with the server 4 and the traffic control device 5.

図２は、トラヒック制御装置５の処理の流れを示すフロー図である。
トラヒック制御装置５が起動すると、撮像装置６は、一定時間間隔で通信環境を撮影してカメラ画像を生成し、強化学習部５１へ送信する（ステップＳ１）。一方で、通信制御部５３は、各ＳＴＡ３のファイルバッファ内のファイル残量を取得し、強化学習部５１へ送信する（ステップＳ２）。加工部５１１は、撮像装置６及び通信制御部５３のそれぞれから受信したデータを深層強化学習の設計に合わせて事前処理した後、行動決定部５１２に入力する（ステップＳ３）。 FIG. 2 is a flow chart showing a processing flow of the traffic control device 5.
When the traffic control device 5 is activated, the image pickup device 6 captures a communication environment at regular time intervals, generates a camera image, and transmits the camera image to the reinforcement learning unit 51 (step S1). On the other hand, the communication control unit 53 acquires the remaining amount of files in the file buffer of each STA 3 and transmits them to the reinforcement learning unit 51 (step S2). The processing unit 511 preprocesses the data received from each of the image pickup device 6 and the communication control unit 53 according to the design of the deep reinforcement learning, and then inputs the data to the action determination unit 512 (step S3).

深層強化学習では価値関数にニューラルネットワークを用いるため、加工部５１１は、カメラ画像とファイル残量情報を、設計されたニューラルネットワークに適した入力データに加工する。この価値関数のニューラルネットワークの例として、全結合層のみの単純なものや、画像認識の分野でよく用いられる畳込み層を含んだものが挙げられる。例として、価値関数が全結合層のみのニューラルネットワークの場合、加工部５１１は、カメラ画像のうち深度画像の解像度を低くした後に１次元のデータにして、各深度値を０から１までの値に正規化する。また、加工部５１１は、各ＳＴＡ３のファイルバッファに残っているファイルの容量を離散化してＯｎｅ－Ｈｏｔ表現化したファイル残量情報を生成し、入力データとする。Ｏｎｅ－Ｈｏｔ表現とは、ある要素のみが１であり、それ以外の要素が０となるベクトル表現のことである。ファイル容量を表すベクトルの各要素はそれぞれファイル容量の範囲に対応しており、ファイルバッファに残っているファイル容量に対応した要素に１が設定され、他の要素には０が設定される。 Since a neural network is used as a value function in deep reinforcement learning, the processing unit 511 processes the camera image and file remaining amount information into input data suitable for the designed neural network. Examples of the neural network of this value function include a simple one having only a fully connected layer and a neural network including a convolutional layer often used in the field of image recognition. As an example, when the value function is a neural network with only fully connected layers, the processing unit 511 lowers the resolution of the depth image of the camera image and then converts it into one-dimensional data, and sets each depth value from 0 to 1. Normalize to. Further, the processing unit 511 discretizes the capacity of the file remaining in the file buffer of each STA 3 to generate the file remaining amount information expressed as One-Hot, and uses it as input data. The One-Hot expression is a vector expression in which only a certain element is 1 and the other elements are 0. Each element of the vector representing the file capacity corresponds to the range of the file capacity, 1 is set for the element corresponding to the file capacity remaining in the file buffer, and 0 is set for the other elements.

行動決定部５１２は、深層強化学習アルゴリズムを用いて、価値関数の出力結果に基づいて各ＳＴＡ３の通信のトラヒック（強化学習の「行動」）を決定する（ステップＳ４）。具体的には、行動決定部５１２は、カメラ画像とファイルバッファのファイル残量情報という「状態」において、とりうる「行動」のうち、それら各「行動」によって最も価値が高くなるような状態遷移を起こす「行動」（各ＳＴＡ３のトラフィック）を優先的に採用する。行動決定部５１２は、決定した各ＳＴＡ３の通信のトラヒック制御情報を通信制御部５３に送信する。これを受信した通信制御部５３は、そのトラヒック制御情報に従って、ファイルバッファに保持していたファイルをパケットに設定してＡＰ２へ送信するようプロキシサーバ４の第２通信部４３を制御する（ステップＳ５）。 The action determination unit 512 uses the deep reinforcement learning algorithm to determine the communication traffic (“action” of reinforcement learning) of each STA3 based on the output result of the value function (step S4). Specifically, the action determination unit 512 has a state transition in which each of the possible "actions" has the highest value in the "state" of the camera image and the file remaining amount information of the file buffer. "Action" (traffic of each STA3) that causes the above is preferentially adopted. The action determination unit 512 transmits the traffic traffic control information of the determined communication of each STA 3 to the communication control unit 53. Upon receiving this, the communication control unit 53 controls the second communication unit 43 of the proxy server 4 to set the file held in the file buffer as a packet and transmit it to the AP2 according to the traffic control information (step S5). ).

パケット送信後、通信制御部５３は、各ＳＴＡ３宛てのバッファ内のファイル残量とその時点での各ＳＴＡ３のスループットを取得し、報酬計算部５２へ送信する（ステップＳ６）。報酬計算部５２は、受信したファイル残量及びスループット情報を用いて報酬を計算する（ステップＳ７）。報酬は、トラヒック制御の詳細な目的に合わせて設計される。詳細な目的の例としては、ＡＰ２の合計スループットの最大化、ファイル送信時間の合計の最小化等が挙げられる。ＡＰ２の合計スループットの最大化が目的の場合、報酬計算部５２は、行動決定部５１２が行動を決定し、その決定に基づいて通信制御部５３が行動する度に毎回、その時点でのＡＰ２の合計スループットを報酬として与える。ファイル送信時間の合計の最小化が目的の場合、報酬計算部５２は、行動決定部５１２が行動を決定し、その決定に基づいて通信制御部５３が行動する度に毎回、ファイルがプロキシサーバ４に到着してからＳＴＡ３へファイルの送信を完了するまでの間、負の定数を報酬として与える。つまり、報酬の累積和が、ファイル送信時間の合計に比例した値になる。 After transmitting the packet, the communication control unit 53 acquires the remaining amount of files in the buffer destined for each STA3 and the throughput of each STA3 at that time, and transmits them to the reward calculation unit 52 (step S6). The reward calculation unit 52 calculates the reward using the received file remaining amount and the throughput information (step S7). Rewards are designed for the detailed purpose of traffic control. Examples of detailed purposes include maximizing the total throughput of AP2, minimizing the total file transmission time, and the like. When the purpose is to maximize the total throughput of AP2, the reward calculation unit 52 determines the action of the action determination unit 512, and each time the communication control unit 53 acts based on the decision, the AP2 at that time is used. Give total throughput as a reward. When the purpose is to minimize the total file transmission time, in the reward calculation unit 52, every time the action determination unit 512 determines an action and the communication control unit 53 acts based on the determination, the file is a proxy server 4 A negative constant is given as a reward from the time of arrival at to the completion of file transmission to STA3. That is, the cumulative sum of rewards is proportional to the total file transmission time.

例えば、ＡＰ２の合計スループットの最大化が目的の場合、時間ステップｔにおける報酬ｒ_ｔは、以下の式（１）のように算出される。 For example, when the purpose is to maximize the total throughput of AP2, the reward rt in the time step _t is calculated by the following equation (1).

Ｔ_ｔは時間ステップｔにおける合計スループット、ｃ（ｔ）は時間パラメータｔに応じた係数である。Σの項はこれまでの合計スループットを時間等のパラメータにより加重平均した値である。例えば、各ｃ（ｉ）を、式（１）の第２項において時間に応じた加重平均スループットが得られるように決定してもよい。また、ｃ（ｉ）＝１（ｉはｔ以下の整数）とすると、報酬ｒ_ｔは、以下の式（２）により算出される。 T _t is the total throughput in the time step t, and c (t) is a coefficient corresponding to the time parameter t. The term Σ is the value obtained by weighted averaging the total throughput so far by parameters such as time. For example, each c (i) may be determined in the second term of the equation (1) so as to obtain a weighted average throughput according to time. Further, assuming that c (i) = 1 (i is an integer of _t or less), the reward rt is calculated by the following equation (2).

また、報酬を、式（３）に示すようにＡＰ２全体の平均のスループットＴ_ｔ￣で正規化したスループットとしてもよく、式（４）に示すように、正規化したスループットの差分としてもよい。 Further, the reward may be a throughput normalized by the average throughput _Tt ￣ of the entire AP2 as shown in the equation (3), or may be a difference of the normalized throughput as shown in the equation (4).

また、以下の式（５）のように、スループットの平均からの減衰率が一定値αを下回ったときに大きな負の報酬を与えるようにしてもよい。 Further, as in the following equation (5), a large negative reward may be given when the attenuation rate from the average throughput falls below a certain value α.

また、式（１）～式（５）におけるスループットを、ミリ波通信の物理伝送速度に置き換えてもよい。 Further, the throughput in the equations (1) to (5) may be replaced with the physical transmission speed of millimeter wave communication.

報酬計算部５２は、計算した報酬を強化学習部５１に送信する。強化学習部５１は、通知された報酬に基づいて、深層強化学習アルゴリズムによって価値関数を更新していくことで学習を進める（ステップＳ８）。 The reward calculation unit 52 transmits the calculated reward to the reinforcement learning unit 51. The reinforcement learning unit 51 advances learning by updating the value function by the deep reinforcement learning algorithm based on the notified reward (step S8).

この一連の動作を繰り返すことにより、強化学習部５１は、入力された報酬の累積和が最大となるように学習を進めながら各ＳＴＡ３のトラヒックのトラヒックを決定していく。従って、学習が進むに連れてトラヒック制御装置５を設置した環境に適応したトラヒック制御方法を自動的に獲得する。 By repeating this series of operations, the reinforcement learning unit 51 determines the traffic of each STA3 while proceeding with learning so that the cumulative sum of the input rewards is maximized. Therefore, as the learning progresses, the traffic control method adapted to the environment in which the traffic control device 5 is installed is automatically acquired.

トラヒック制御装置５は、複数エピソードを実施した結果に基づいて、上記の処理を行い、行動評価関数を学習する。図３は、エピソードを説明するための図である。エピソードとは、記憶部４２におけるファイルバッファ内のファイルが全て送信完了するまでの一連の流れを表す。プロキシサーバ４は、トラヒック制御装置５の通信制御部５３の制御に従って、ファイルバッファに記憶されるファイルを、ＡＰ２を介して各ＳＴＡ３へ送信していき、ファイルバッファ内のファイルを全て送信し終えた時点で１エピソードの終了とする。１エピソードの途中ではファイルは追加されない。エピソードが進むに連れて、本実施形態のトラヒック制御装置５の学習も進む。なお、学習する上限数をあらかじめ決めておき、エピソードが上限数に達した場合には学習を終了してもよい。 The traffic control device 5 performs the above processing based on the result of performing the plurality of episodes, and learns the behavior evaluation function. FIG. 3 is a diagram for explaining an episode. The episode represents a series of flows until all the files in the file buffer in the storage unit 42 are transmitted. The proxy server 4 transmits the files stored in the file buffer to each STA3 via the AP2 under the control of the communication control unit 53 of the traffic control device 5, and completes the transmission of all the files in the file buffer. At this point, one episode ends. No files are added in the middle of one episode. As the episode progresses, so does the learning of the traffic control device 5 of the present embodiment. The maximum number of learning may be determined in advance, and the learning may be terminated when the maximum number of episodes is reached.

価値関数として用いられる深層ニューラルネットワーク（ＣＮＮ）の入力データ及び層設計の例を説明する。
図４は、ステップＳ３におけるカメラ画像から入力データへの加工を示す図である。強化学習部５１は、１秒間における過去５枚分のカメラ画像に含まれる深度画像データをそれぞれ２０×２０ピクセルの二次元画像データに圧縮する。強化学習部５１は、５枚の深度画像データそれぞれを圧縮して得られた５チャネルの二次元画像をＣＮＮへの入力データとする。 An example of input data and layer design of a deep neural network (CNN) used as a value function will be described.
FIG. 4 is a diagram showing processing from a camera image to input data in step S3. The reinforcement learning unit 51 compresses the depth image data included in the past five camera images in one second into two-dimensional image data of 20 × 20 pixels, respectively. The reinforcement learning unit 51 uses a 5-channel two-dimensional image obtained by compressing each of the five depth image data as input data to the CNN.

図５は、ステップＳ３におけるファイル残量情報から入力データへの加工を示す図である。まず、各ファイルの残量を複数段階に離散化する。ここでは、ファイル容量の最大値が２０００Ｍｂｉｔ（メガビット）であり、１０段階に離散化する場合を例とする。この場合、ファイル残量情報として用いられるＯｎｅ－Ｈｏｔ表現のベクトルの各要素を、[（０－２００Ｍｂｉｔ），（２００－４００Ｍｂｉｔ），（４００－６００Ｍｂｉｔ），（６００－８００Ｍｂｉｔ），…，（１８００－２０００ｂｉｔ）]と定める。記憶部４２から取得したＳＴＡ－ｎ（ｎは１以上Ｎ以下の整数）のファイル残量が容量７００Ｍｂｉｔである場合、ファイル残量情報はベクトル［０，０，０，１，０，０，０，０，０，０］と表される。強化学習部５１、ＳＴＡ－１、ＳＴＡ－２、…、ＳＴＡ－Ｎについて生成したファイル残量情報を表すベクトルを並べて結合し、入力データとする。 FIG. 5 is a diagram showing processing from the file remaining amount information in step S3 to the input data. First, the remaining amount of each file is discretized in multiple stages. Here, the case where the maximum value of the file capacity is 2000 Mbit (megabit) and the file is discretized in 10 steps is taken as an example. In this case, each element of the One-Hot representation vector used as the file remaining amount information is set to [(0-200 Mbit), (200-400 Mbit), (400-600 Mbit), (600-800 Mbit), ..., (1800). -2000 bits)]. When the file remaining amount of STA-n (n is an integer of 1 or more and N or less) acquired from the storage unit 42 has a capacity of 700 Mbit, the file remaining amount information is vector [0,0,0,1,0,0,0. , 0,0,0]. Vectors representing the file remaining amount information generated for the reinforcement learning unit 51, STA-1, STA-2, ..., STA-N are arranged and combined to be input data.

図６は、ＣＮＮの層設計を示す図である。なお、「Ａｆｆｉｎｅ，ａ－ｂ」は、ａ次元ベクトルを全結合層に入力し、ｂ次元ベクトルを出力する演算を表す。「ｋ×ｌ２ＤＣｏｎｖｅｒｓｉｏｎ，ａ－ｂ」は、ｋ×ｌの二次元フィルタにより、ａチャネルの入力を畳み込み、ｂチャネルにして出力する演算を表す。また、「ｋ×ｌ２ＤＭａｘＰｏｏｌｉｎｇ」は、サイズがｋ×ｌのグリッドに入力を分割し、各グリッドの最大値を代表値として出力する演算を表す。「ＲｅＬＵ」は、活性化関数ＲｅＬＵ（Rectified Linear Units）に入力する演算を表す。活性化関数ＲｅＬＵは、マイナスの値を０に変換する。 FIG. 6 is a diagram showing a layer design of CNN. Note that "Affine, ab" represents an operation in which an a-dimensional vector is input to the fully connected layer and a b-dimensional vector is output. “K × l 2D Conversion, ab” represents an operation in which the input of a channel is convolved into b channel by a two-dimensional filter of k × l and output. Further, "k × l 2D Max Pooling" represents an operation in which the input is divided into grids having a size of k × l and the maximum value of each grid is output as a representative value. "ReLU" represents an operation to be input to the activation function ReLU (Rectified Linear Units). The activation function ReLU converts a negative value to 0.

入力層では、図３に示した処理により５チャネルの二次元画像（5 Channels 2D Image）を生成する。さらに、入力層では、図４に示した処理により各ＳＴＡ３のファイル残量をＯｎｅ－Ｈｏｔ表現のベクトルに変換し、結合して６０次元ベクトルを生成する。 In the input layer, a five-channel two-dimensional image (5 Channels 2D Image) is generated by the process shown in FIG. Further, in the input layer, the remaining amount of the file of each STA3 is converted into a vector of One-hot expression by the process shown in FIG. 4, and combined to generate a 60-dimensional vector.

隠れ層には、１ａ層～８ａ層と、１ｂ層～２ｂ層と、８ａ層及び２ｂ層の出力を入力とする９層とがある。
１ａ層では、５チャネルの二次元画像（5 Channels 2D Image）を、５×５の二次元フィルタにより畳み込み、２０チャネルにして出力する。２ａ層では、２０チャネルの１ａ層の出力を活性化関数ＲｅＬＵに入力し、マイナスの値を取り除く。３ａ層では、２０チャネルの２ａ層の出力を２×２のグリッドに分割し、各グリッドの最大値を出力する。４ａ層では、２０チャネルの３ａ層の出力を、５×５の二次元フィルタにより畳み込み、５０チャネルにして出力する。５ａ層では、５０チャネルの４ａ層の出力を活性化関数ＲｅＬＵに入力し、マイナスの値を取り除く。６ａ層では、５０チャネルの５ａ層の出力を２×２のグリッドに分割し、各グリッドの最大値を出力する。７ａ層では、６ａ層の１２５０次元ベクトルを全結合層に入力し、５００次元ベクトルを出力する。８ａ層では、７ａ層の出力を活性化関数ＲｅＬＵに入力し、マイナスの値を取り除く。 The hidden layer includes layers 1a to 8a, layers 1b to 2b, and nine layers that input the outputs of the layers 8a and 2b.
In the 1a layer, a 5 channel 2D image is convoluted by a 5 × 5 2D filter and output as 20 channels. In the 2a layer, the output of the 1a layer of 20 channels is input to the activation function ReLU, and the negative value is removed. In the 3a layer, the output of the 2a layer of 20 channels is divided into 2 × 2 grids, and the maximum value of each grid is output. In the 4a layer, the output of the 3a layer of 20 channels is convoluted by a 5 × 5 two-dimensional filter to make 50 channels and output. In the 5a layer, the output of the 4a layer of 50 channels is input to the activation function ReLU, and the negative value is removed. In the 6a layer, the output of the 5a layer of 50 channels is divided into 2 × 2 grids, and the maximum value of each grid is output. In the 7a layer, the 1250-dimensional vector of the 6a layer is input to the fully connected layer, and the 500-dimensional vector is output. In the 8a layer, the output of the 7a layer is input to the activation function ReLU, and the negative value is removed.

一方、１ｂ層では、各ＳＴＡ３のファイル残量に基づいて得られた６０次元ベクトルを全結合層に入力し、１００次元ベクトルを出力する。なお、ＳＴＡ３の台数Ｎと、Ｏｎｅ－Ｈｏｔ表現のベクトルの要素数との乗算が６０であるとする。２ｂ層では、１ｂ層の出力を活性化関数ＲｅＬＵに入力し、マイナスの値を取り除く。 On the other hand, in the 1b layer, the 60-dimensional vector obtained based on the remaining amount of the file of each STA3 is input to the fully connected layer, and the 100-dimensional vector is output. It is assumed that the multiplication of the number N of the STA3 and the number of elements of the vector of the One-Hot expression is 60. In the 2b layer, the output of the 1b layer is input to the activation function ReLU, and the negative value is removed.

９層では、８ａ層の出力及び２ｂ層の出力を併せた６００次元ベクトルを全結合層に入力し、各行動の評価値を得る。出力層は、各行動の評価値を出力する。各行動は、各ＳＴＡ３との通信をＯＮにするかＯＦＦにするかの組み合わせでもよく、Ｎ台のＳＴＡ３それぞれのトラヒック量の組み合わせでもよい。同図では、２台のＳＴＡ３それぞれとの通信をＯＮにするかＯＦＦにするかの組み合わせから、２台ともＯＦＦの組み合わせを除いたものである。つまり、（ＳＴＡ－１，ＳＴＡ－２）を（ＯＮ，ＯＮ）、（ＯＮ，ＯＦＦ）、（ＯＦＦ，ＯＮ）とする３種類の行動である。この３種類の行動それぞれの評価値を得るため、９層からは３次元ベクトルが出力される。 In the 9th layer, a 600-dimensional vector including the output of the 8a layer and the output of the 2b layer is input to the fully connected layer, and the evaluation value of each action is obtained. The output layer outputs the evaluation value of each action. Each action may be a combination of turning on or off communication with each STA3, or may be a combination of traffic amounts of each of N STA3s. In the figure, the combination of turning on or off the communication with each of the two STA3s is excluded from the combination of turning off both of them. That is, there are three types of actions in which (STA-1, STA-2) is (ON, ON), (ON, OFF), and (OFF, ON). In order to obtain the evaluation values of each of these three types of actions, a three-dimensional vector is output from the nine layers.

なお、Ｃｏｎｖｅｒｓｉｏｎ層については、入力層に近いところにおいては画像から特徴量抽出するフィルタが学習されることが期待され、出力層に近いところでは特徴量から値を予測するフィルタが学習されることを期待される。ＲｅＬＵは、活性化関数として広く用いられる。ＲｅＬＵは、他の活性化関数（シグモイド関数など）とくらべて、経験的に学習速度が早く、性能が高くなることが知られている。また、ＭａｘＰｏｏｌｉｎｇ層は、Ｃｏｎｖｅｒｓｉｏｎ層を通すことにより増大したパラメータ数を削減することで学習時間を短縮するために使用される。Ａｆｆｉｎｅ層は、ＣＮＮにより抽出された特徴量から値を予測することを期待して使用される。ＣＮＮのみで構成するような層設計と比較して、学習時間の短縮が期待できることが経験的に知られている。 Regarding the Conversion layer, it is expected that a filter for extracting features from an image will be learned near the input layer, and a filter for predicting values from features will be learned near the output layer. Be expected. ReLU is widely used as an activation function. It is known that ReLU has an empirically faster learning speed and higher performance than other activation functions (sigmoid function, etc.). Further, the Max Polling layer is used to shorten the learning time by reducing the number of parameters increased by passing through the Conversion layer. The Affine layer is used with the expectation that the value will be predicted from the features extracted by CNN. It is empirically known that the learning time can be expected to be shortened as compared with the layer design consisting only of CNN.

学習部５１３は、価値関数として用いられるＣＮＮを更新する。具体的には、学習部５１３は、報酬計算部５２により計算される報酬に基づいて、全結合層における重みを更新する。例えば、行動決定部５１２において、ＡＰ２とＳＴＡ－１の通信ＯＮ、ＡＰ２とＳＴＡ－２の通信ＯＦＦという結果が得られた場合、通信制御部５３は、ＡＰ２とＳＴＡ－１との通信のみをＯＮにするよう制御を行う。例えば、通信制御部５３は、ＳＴＡ－１宛てのファイルをＡＰ２に出力し、ＳＴＡ－２宛てのファイルをＡＰ２に出力しないようにプロキシサーバ４の第２通信部４３を制御する。あるいは、プロキシサーバ４の第２通信部４３を介して、ＡＰ２に対してＳＴＡ－１との通信を行い、ＳＴＡ－２との通信を行わないよう制御信号を送信してもよい。しかしながら、このような制御を行っても、ＡＰ２とＳＴＡ－１間で遮蔽が発生している、マルチパスで反射が発生しているなど、実際はＡＰ２とＳＴＡ－１間の伝搬路の状態が悪い場合、通信速度は低くなる。極端な例として、ＡＰ２とＳＴＡ－１間に金属の壁があり、ＳＴＡ－１にまったく電波が届かない場合は、通信がＯＮの状態でもスループットは０Ｍｂｉｔ／ｓとなる。学習部５１３は、そのようなことが発生しないように、各ＳＴＡ３のＯＮ／ＯＦＦを制御するための学習を行うことができる。 The learning unit 513 updates the CNN used as a value function. Specifically, the learning unit 513 updates the weights in the fully connected layer based on the reward calculated by the reward calculation unit 52. For example, when the action determination unit 512 obtains the result that the communication between AP2 and STA-1 is turned on and the communication between AP2 and STA-2 is turned off, the communication control unit 53 turns on only the communication between AP2 and STA-1. Control to. For example, the communication control unit 53 controls the second communication unit 43 of the proxy server 4 so as to output the file addressed to STA-1 to AP2 and not to output the file addressed to STA-2 to AP2. Alternatively, the control signal may be transmitted to the AP2 via the second communication unit 43 of the proxy server 4 so as to communicate with the STA-1 and not to communicate with the STA-2. However, even with such control, the state of the propagation path between AP2 and STA-1 is actually poor, such as shielding occurring between AP2 and STA-1 and reflection occurring in multipath. In that case, the communication speed becomes low. As an extreme example, if there is a metal wall between AP2 and STA-1 and no radio waves reach STA-1, the throughput will be 0 Mbit / s even when communication is ON. The learning unit 513 can perform learning to control ON / OFF of each STA3 so that such a situation does not occur.

本実施形態のトラヒック制御装置５によれば、カメラ画像を入力とした深層強化学習によりトラヒック制御を行い、様々な通信環境に自動的に適応して無線帯域を有効利用することが可能となる。また、通信端末やカメラの設置環境が変化した際にも、変化した環境に適応して自動的にトラヒックを制御することが可能となる。特に、ミリ波通信機能を搭載した無線ＬＡＮ（Local Area Network）ルータと、複数のミリ波通信端末とが接続された通信システムにおいて、人体遮蔽が起こりうる状況に有用である。また、無線ＬＡＮルータやミリ波通信端末の設置環境が変化する場合にも対応可能である。 According to the traffic control device 5 of the present embodiment, traffic control can be performed by deep reinforcement learning using a camera image as an input, and it is possible to automatically adapt to various communication environments and effectively use the radio band. In addition, even when the installation environment of the communication terminal or camera changes, it becomes possible to automatically control the traffic by adapting to the changed environment. In particular, it is useful in a situation where human body shielding may occur in a communication system in which a wireless LAN (Local Area Network) router equipped with a millimeter wave communication function and a plurality of millimeter wave communication terminals are connected. It is also possible to deal with changes in the installation environment of wireless LAN routers and millimeter-wave communication terminals.

トラヒック制御装置５の実測データを用いたシミュレーション評価について述べる。図７は、シミュレーション評価の諸元を示す図である。このシミュレーション評価では、１台のＡＰ２に、２台のＳＴＡ３を接続した場合を想定し、本実施形態のトラヒック制御を行った場合と、ファイル送信完了ごとに交互に送信宛先を切り替えるラウンドロビン方式で制御を行った場合のＡＰにおける合計スループットを得た。ＡＰ２は、ミリ波ＡＰである。シミュレーションで用いるミリ波通信の見通し通信時、遮蔽時のスループット及びカメラ画像は実機実験から測定した値を用いた。カメラ画像は、ＲＧＢ－Ｄカメラで撮影した画像のデータを用いた。また、ＡＰ２及びＳＴＡ３も市販のものを用いた。 A simulation evaluation using the measured data of the traffic control device 5 will be described. FIG. 7 is a diagram showing specifications of simulation evaluation. In this simulation evaluation, assuming a case where two STA3s are connected to one AP2, a round robin method is used in which the transmission destination is alternately switched between the case where the throughput control of the present embodiment is performed and the case where the file transmission is completed. The total throughput in AP when controlled was obtained. AP2 is a millimeter wave AP. For the line-of-sight communication of millimeter-wave communication used in the simulation, the throughput at the time of shielding and the camera image used the values measured from the actual machine experiment. As the camera image, the data of the image taken by the RGB-D camera was used. In addition, commercially available AP2 and STA3 were also used.

図８は、シミュレーション評価結果を示す図である。同図は、本実施形態のトラヒック制御を行った場合とラウンドロビン方式で制御を行った場合のエピソード数に対する合計スループットの推移を示す。同図のグラフにおけるＡＰ２の合計スループットとして、各エピソードにおけるＡＰ２の合計スループットの時間平均として表示している。このシミュレーションでは、プロキシサーバ４のファイルバッファには最初、ファイルがランダムなサイズで与えられ、ＡＰ２を通して各ＳＴＡ３へファイルを送信していく。ファイルバッファ内のファイルを全て送信し終えた時点で１エピソードが終了する。同図に示す評価結果から、エピソードが進み、トラヒック制御装置５の学習が進むに連れて、ラウンドロビン方式による制御を行った場合のスループットよりも、本実施形態のトラヒック制御を行った場合の合計スループットが上回っていることがわかる。 FIG. 8 is a diagram showing simulation evaluation results. The figure shows the transition of the total throughput with respect to the number of episodes when the traffic control of the present embodiment is performed and the control is performed by the round robin method. The total throughput of AP2 in the graph of the figure is displayed as the time average of the total throughput of AP2 in each episode. In this simulation, a file is initially given to the file buffer of the proxy server 4 in a random size, and the file is transmitted to each STA 3 through AP2. One episode ends when all the files in the file buffer have been sent. From the evaluation results shown in the figure, as the episode progresses and the learning of the traffic control device 5 progresses, the total when the traffic control of the present embodiment is performed is larger than the throughput when the control is performed by the round robin method. It can be seen that the throughput is exceeded.

以上説明した実施形態によれば、通信システムは、第１通信装置と、第１通信装置と無線により通信する１台以上の第２通信装置と、第１通信装置から第２通信装置に送信するデータを取得する第３通信装置と、トラヒック制御装置とを有する。例えば、第１通信装置はＡＰ２であり、第２通信装置はＳＴＡ３であり、第３通信装置はプロキシサーバ４である。 According to the embodiment described above, the communication system transmits from the first communication device, one or more second communication devices that wirelessly communicate with the first communication device, and the first communication device to the second communication device. It has a third communication device for acquiring data and a traffic control device. For example, the first communication device is AP2, the second communication device is STA3, and the third communication device is proxy server 4.

トラヒック制御装置は、行動決定部と、通信制御部と、報酬計算部と、学習部とを有する。行動決定部は、第１通信装置と第２通信装置との間の通信環境を撮像した画像データと、第３通信装置が記憶する第２通信装置宛ての未送信のデータのデータ量の情報とを用いて、第２通信装置それぞれのトラヒックの組み合わせにより表される行動の価値を算出する価値関数により、複数種類の行動それぞれの価値を算出する。価値関数は、深層ニューラルネットワークにより近似されてもよい。この場合、深層ニューラルネットワークに入力される画像データは、異なるタイミングにおいて撮影された複数の画像データそれぞれの解像度を低減したのちにピクセル値を正規化したデータである。また、深層ニューラルネットワークに入力される未送信の第２通信装置宛てのデータ量の情報は、複数の第２通信装置それぞれ宛ての未送信のデータ量をＯｎｅ－Ｈｏｔ表現により表したベクトルを並べた情報である。行動決定部は、算出した価値に基づいて行動を決定する。 The traffic control device has an action determination unit, a communication control unit, a reward calculation unit, and a learning unit. The action determination unit includes image data that captures the communication environment between the first communication device and the second communication device, and information on the amount of untransmitted data stored in the third communication device to the second communication device. Is used to calculate the value of each of a plurality of types of actions by a value function that calculates the value of the action represented by the combination of traffic of each of the second communication devices. The value function may be approximated by a deep neural network. In this case, the image data input to the deep neural network is data obtained by reducing the resolution of each of the plurality of image data captured at different timings and then normalizing the pixel values. Further, for the information of the amount of untransmitted data to the second communication device input to the deep neural network, a vector representing the amount of untransmitted data to each of the plurality of second communication devices by One-Hot expression is arranged. Information. The action decision unit decides the action based on the calculated value.

通信制御部は、行動決定部が決定した行動が表す第２通信装置それぞれのトラヒックに従って、第２通信装置宛てのデータを第１通信装置に送信するよう第３通信装置を制御する。報酬計算部は、通信制御部による制御が行われたことによる第２通信装置の通信状況を取得し、取得した通信状況が過去の通信状況から向上した程度を表す報酬を計算する。第２通信装置の通信状況は、第２通信装置におけるスループット、又は、第２通信装置宛てのデータの送信にかかった時間を表す。学習部は、計算された報酬に基づいて価値関数を更新する。第１通信装置は、第３通信装置から受信した第２通信装置宛てのデータを無線により第２通信装置に送信する。 The communication control unit controls the third communication device so as to transmit data addressed to the second communication device to the first communication device according to the traffic of each of the second communication devices represented by the action determined by the action determination unit. The reward calculation unit acquires the communication status of the second communication device due to the control by the communication control unit, and calculates a reward indicating the degree to which the acquired communication status is improved from the past communication status. The communication status of the second communication device represents the throughput in the second communication device or the time required for transmitting data to the second communication device. The learning department updates the value function based on the calculated reward. The first communication device wirelessly transmits data to the second communication device received from the third communication device to the second communication device.

上述した実施形態におけるトラヒック制御装置５の機能をコンピュータで実現するようにしてもよい。その場合、トラヒック制御装置５はこの機能を実現するためのプログラムをコンピュータ読み取り可能な記録媒体に記録して、この記録媒体に記録されたプログラムをコンピュータシステムに読み込ませ、実行することによって実現してもよい。なお、ここでいう「コンピュータシステム」とは、ＯＳや周辺機器等のハードウェアを含むものとする。また、「コンピュータ読み取り可能な記録媒体」とは、フレキシブルディスク、光磁気ディスク、ＲＯＭ、ＣＤ－ＲＯＭ等の可搬媒体、コンピュータシステムに内蔵されるハードディスク等の記憶装置のことをいう。さらに「コンピュータ読み取り可能な記録媒体」とは、インターネット等のネットワークや電話回線等の通信回線を介してプログラムを送信する場合の通信線のように、短時間の間、動的にプログラムを保持するもの、その場合のサーバやクライアントとなるコンピュータシステム内部の揮発性メモリのように、一定時間プログラムを保持しているものも含んでもよい。また上記プログラムは、前述した機能の一部を実現するためのものであってもよく、さらに前述した機能をコンピュータシステムにすでに記録されているプログラムとの組み合わせで実現できるものであってもよい。 The function of the traffic control device 5 in the above-described embodiment may be realized by a computer. In that case, the traffic control device 5 is realized by recording a program for realizing this function on a computer-readable recording medium, causing the computer system to read the program recorded on the recording medium, and executing the program. May be good. The term "computer system" as used herein includes hardware such as an OS and peripheral devices. Further, the "computer-readable recording medium" refers to a portable medium such as a flexible disk, a magneto-optical disk, a ROM, or a CD-ROM, and a storage device such as a hard disk built in a computer system. Further, a "computer-readable recording medium" is a communication line for transmitting a program via a network such as the Internet or a communication line such as a telephone line, and dynamically holds the program for a short period of time. It may also include a program that holds a program for a certain period of time, such as a volatile memory inside a computer system that is a server or a client in that case. Further, the above-mentioned program may be for realizing a part of the above-mentioned functions, and may be further realized for realizing the above-mentioned functions in combination with a program already recorded in the computer system.

以上、この発明の実施形態について図面を参照して詳述してきたが、具体的な構成はこの実施形態に限られるものではなく、この発明の要旨を逸脱しない範囲の設計等も含まれる。 Although the embodiment of the present invention has been described in detail with reference to the drawings, the specific configuration is not limited to this embodiment, and the design and the like within a range not deviating from the gist of the present invention are also included.

無線通信を行う通信システムに利用可能である。 It can be used for communication systems that perform wireless communication.

１…通信システム、２…アクセスポイント、３…無線局、４…プロキシサーバ、５…トラヒック制御装置、６…撮像装置、７…インターネット、４１…第１通信部、４２…記憶部、４３…第２通信部、５１…強化学習部、５２…報酬計算部、５３…通信制御部、５１１…加工部、５１２…行動決定部、５１３…学習部 1 ... Communication system, 2 ... Access point, 3 ... Radio station, 4 ... Proxy server, 5 ... Traffic control device, 6 ... Imaging device, 7 ... Internet, 41 ... First communication unit, 42 ... Storage unit, 43 ... No. 2 Communication Department, 51 ... Reinforcement Learning Department, 52 ... Reward Calculation Department, 53 ... Communication Control Department, 511 ... Processing Department, 512 ... Action Decision Department, 513 ... Learning Department

Claims

A first communication device, one or more second communication devices that wirelessly communicate with the first communication device, a third communication device that acquires data transmitted from the first communication device to the second communication device, and a traffic. A communication system having a control device,
The traffic control device is
Image data that captures the communication environment between the first communication device and the second communication device, and information on the amount of untransmitted data to the second communication device stored by the third communication device. The value of each of a plurality of types of actions is calculated by a value function that calculates the value of the action represented by the combination of traffic of each of the second communication devices, and the action is determined based on the calculated value. The decision department and
Communication control that controls the third communication device so as to transmit the data addressed to the second communication device to the first communication device according to the traffic of each of the second communication devices represented by the action determined by the action determination unit. Department and
A reward calculation unit that acquires the communication status of the second communication device due to control by the communication control unit and calculates a reward indicating the degree to which the acquired communication status is improved from the past communication status.
A learning unit that updates the value function so that the cumulative sum of the rewards calculated by the reward calculation unit for different time intervals is maximized .
Equipped with
The reward in the time interval is
A weighted average value obtained by multiplying the total throughput of the first communication device from the past time section of the first communication device to the time section by a coefficient corresponding to the time from the total throughput of the first communication device in the time interval. First value after subtracting
Alternatively, a second value obtained by normalizing the total throughput of the first communication device in the time interval with the average throughput of the first communication device.
Alternatively, when the ratio of the total throughput of the first communication device in the time interval to the average throughput of the first communication device exceeds a predetermined value, it becomes a positive constant value, and when the ratio is equal to or less than the predetermined value, it becomes a positive constant value. It is a third value whose absolute value is a negative constant value larger than the positive constant value.
The first communication device wirelessly transmits the data to the second communication device received from the third communication device to the second communication device.
Communications system.

The second communication is performed by using the image data of the communication environment between the first communication device and one or more second communication devices and the data amount information of the untransmitted data addressed to the second communication device. An action decision unit that calculates the value of each of multiple types of actions by a value function that calculates the value of the action expressed as a combination of traffic of each device, and determines the action based on the calculated value.
A communication control unit that controls communication so that the data from the first communication device to the second communication device is delivered according to the traffic of each of the second communication devices represented by the action determined by the action determination unit.
A reward calculation unit that acquires the communication status of the second communication device due to control by the communication control unit and calculates a reward indicating the degree to which the acquired communication status is improved from the past communication status.
A learning unit that updates the value function so that the cumulative sum of the rewards calculated by the reward calculation unit for different time intervals is maximized .
Equipped with
The reward in the time interval is
A weighted average value obtained by multiplying the total throughput of the first communication device from the past time section of the first communication device to the time section by a coefficient corresponding to the time from the total throughput of the first communication device in the time interval. First value after subtracting
Alternatively, a second value obtained by normalizing the total throughput of the first communication device in the time interval with the average throughput of the first communication device.
Alternatively, when the ratio of the total throughput of the first communication device in the time interval to the average throughput of the first communication device exceeds a predetermined value, it becomes a positive constant value, and when the ratio is equal to or less than the predetermined value, it becomes a positive constant value. A third value whose absolute value is a negative constant value larger than the positive constant value.
Traffic control device.

The communication status of the second communication device is information representing the throughput in the second communication device or the time required for transmitting the data to the second communication device.
The traffic control device according to claim 2.

The value function is approximated by a deep neural network,
The traffic control device according to claim 2 or 3.

The image data used in the value function is data obtained by normalizing pixel values after reducing the resolution of each of a plurality of image data taken at different timings.
The traffic control device according to claim 4.

For the information on the amount of untransmitted data addressed to the second communication device used in the value function, a vector representing the amount of untransmitted data addressed to each of the plurality of the second communication devices by the One-Hot representation is arranged. Information,
The traffic control device according to claim 4.

The image data is depth image data.
The traffic control device according to any one of claims 2 to 6.

A traffic control method in a traffic control device that controls wireless communication between a first communication device and one or more second communication devices.
The traffic control device
Using image data that captures the communication environment between the first communication device and the second communication device and information on the amount of untransmitted data addressed to the second communication device, each of the second communication devices. An action decision step that calculates the value of each of multiple types of actions by a value function that calculates the value of the action expressed as a combination of traffic, and determines the action based on the calculated value.
A communication control step that controls communication so that the data from the first communication device to the second communication device is delivered according to the traffic of each of the second communication devices represented by the action determined in the action determination step. ,
A reward calculation step for acquiring the communication status of the second communication device due to the control performed by the communication control step and calculating a reward indicating the degree to which the acquired communication status is improved from the past communication status.
A learning step that updates the value function so that the cumulative sum of the rewards calculated for different time intervals in the reward calculation step is maximized .
And run
The reward in the time interval is
A weighted average value obtained by multiplying the total throughput of the first communication device from the past time section of the first communication device to the time section by a coefficient corresponding to the time from the total throughput of the first communication device in the time interval. First value after subtracting
Alternatively, a second value obtained by normalizing the total throughput of the first communication device in the time interval with the average throughput of the first communication device.
Alternatively, when the ratio of the total throughput of the first communication device in the time interval to the average throughput of the first communication device exceeds a predetermined value, it becomes a positive constant value, and when the ratio is equal to or less than the predetermined value, it becomes a positive constant value. A third value whose absolute value is a negative constant value larger than the positive constant value.
Traffic control method.