JP7510661B2

JP7510661B2 - Control quantity calculation device and control quantity calculation method

Info

Publication number: JP7510661B2
Application number: JP2020058499A
Authority: JP
Inventors: 博史峰野; 悠安孫子
Original assignee: Shizuoka University NUC
Current assignee: Shizuoka University NUC
Filing date: 2020-03-27
Publication date: 2024-07-04
Anticipated expiration: 2040-03-27

Description

特許法第３０条第２項適用発行日令和１年１０月１５日刊行物２０１９ＩＥＥＥ８ｔｈＧｌｏｂａｌＣｏｎｆｅｒｅｎｃｅｏｎＣｏｎｓｕｍｅｒＥｌｅｃｔｒｏｎｉｃｓ（ＧＣＣＥ）予稿集、ｐ１２９－１３０、発行者：ＩＥＥＥ〔刊行物等〕開催日令和１年１０月１５日～１８日（発表日：令和１年１０月１５日）集会名、開催場所ＩＥＥＥ８ｔｈＧｌｏｂａｌＣｏｎｆｅｒｅｎｃｅｏｎＣｏｎｓｕｍｅｒＥｌｅｃｔｒｏｎｉｃｓ（ＧＣＣＥ２０１９）（開催場所：千里ライフサイエンスセンター）〔刊行物等〕発行日令和２年１月７日刊行物Ｔｈｅ３４ｔｈＩｎｔｅｒｎａｔｉｏｎａｌＣｏｎｆｅｒｅｎｃｅｏｎＩｎｆｏｒｍａｔｉｏｎＮｅｔｗｏｒｋｉｎｇ（ＩＣＯＩＮ２０２０）予稿集、ｐ４２０－４２５、発行者：ＩＥＥＥ〔刊行物等〕開催日令和２年１月７日～１０日（発表日：令和２年１月９日）集会名、開催場所Ｔｈｅ３４ｔｈＩｎｔｅｒｎａｔｉｏｎａｌＣｏｎｆｅｒｅｎｃｅｏｎＩｎｆｏｒｍａｔｉｏｎＮｅｔｗｏｒｋｉｎｇ（ＩＣＯＩＮ２０２０）（開催場所：ＡＣホテルバルセロナフォーラム）Article 30, paragraph 2 of the Patent Act applies. Publication date: October 15, 2019 Publication: 2019 IEEE 8th Global Conference on Consumer Electronics (GCCE) Proceedings, p. 129-130, Publisher: IEEE [Publications, etc.] Date held: October 15-18, 2019 (Announcement date: October 15, 2019) Name of meeting, venue: IEEE 8th Global Conference on Consumer Electronics (GCCE 2019) (Venue: Senri Life Science Center) [Publications, etc.] Date issued: January 7, 2020 Publication: The 34th International Conference on Information Networking (ICOIN 2020) Proceedings, p420-425, Publisher: IEEE [Publications, etc.] Date: January 7-10, 2020 (Announcement date: January 9, 2020) Meeting name, venue: The 34th International Conference on Information Networking (ICOIN 2020) (Venue: AC Hotel Barcelona Forum)

本発明は、スライスに対する通信リソースの割り当て量を制御するための制御量を算出する制御量算出装置及び制御量算出方法に関する。 The present invention relates to a control amount calculation device and a control amount calculation method for calculating a control amount for controlling the amount of communication resources allocated to a slice.

第５世代（５Ｇ）移動通信システムの商用化により、通信ネットワークの大容量化、高速化、及び多数同時接続が可能となり、これに伴い、多様なサービスの提供が円滑化される。また、５Ｇ移動通信システムにおいて提供が検討されている技術として、多様なサービスごとに最適なネットワーク環境を提供するネットワークスライシングという技術が存在する。この技術においては、ネットワーク内に設定される各スライスに対して、基地局、通信経路、交換機、ルータ、サーバのＣＰＵ等の通信リソース（以下、単にリソースとも言う。）を割り当てるリソース管理の仕組みが重要となる。 The commercialization of the fifth generation (5G) mobile communications system will enable larger capacity, faster speeds, and multiple simultaneous connections in communications networks, facilitating the provision of a variety of services. In addition, one technology being considered for provision in the 5G mobile communications system is network slicing, which provides an optimal network environment for each of a variety of services. In this technology, a resource management mechanism that allocates communications resources (hereinafter simply referred to as resources) such as base stations, communication paths, switches, routers, and server CPUs to each slice set up in the network is important.

ネットワーク内の各スライスに対してリソースを割り当てるリソース管理の手法としては、遺伝的アルゴリズムを用いた手法（下記非特許文献１参照）、深層強化学習を用いた手法（下記非特許文献２参照）が検討されている。 As resource management methods for allocating resources to each slice in a network, methods using genetic algorithms (see Non-Patent Document 1 below) and methods using deep reinforcement learning (see Non-Patent Document 2 below) are being considered.

B. Han,et al., “Slice as an Evolutionary Service: Genetic Optimization for Inter-Slice Resource Management in 5G Networks,” IEEE Access, vol. 6,pp. 33137-33147, 2018.B. Han,et al., “Slice as an Evolutionary Service: Genetic Optimization for Inter-Slice Resource Management in 5G Networks,” IEEE Access, vol. 6, pp. 33137-33147, 2018. R. Li etal., “Deep Reinforcement Learning for Resource Management in Network Slicing,” arXiv:1805.06591 [cs], May 2018.R. Li et al., “Deep Reinforcement Learning for Resource Management in Network Slicing,” arXiv:1805.06591 [cs], May 2018.

上述した非特許文献１及び非特許文献２に記載の手法では、制御対象のスライスの数が変化した場合にモデルの再学習が必要となり、スライス数の動的な変化に対応が難しい。 In the methods described in Non-Patent Document 1 and Non-Patent Document 2 above, if the number of slices to be controlled changes, the model needs to be re-learned, making it difficult to respond to dynamic changes in the number of slices.

本発明は、上記課題に鑑みて為されたものであり、スライス数が動的に変化した場合にも複数のスライスへのリソース割り当てを実現できる制御量算出装置及び制御量算出方法を提供することを目的とする。 The present invention has been made in consideration of the above problems, and aims to provide a control amount calculation device and a control amount calculation method that can realize resource allocation to multiple slices even when the number of slices changes dynamically.

上記課題を解決するため、本発明の一形態にかかる制御量算出装置は、通信ネットワーク上の仮想化ネットワークである複数のスライスに対する通信リソースの割り当て量を制御するための制御量を算出する制御量算出装置であって、複数のスライスのそれぞれに対応して設けられ、スライスに関する状態値、及びスライスに関する報酬値を取得し、状態値を強化学習モデルに入力することにより、スライスに対する制御量である行動を決定および出力する複数の制御量導出部と、複数の制御量導出部において取得された状態値及び報酬値と、当該状態値に対応して決定された行動との組み合わせである学習データを格納する学習データ格納部と、学習データ格納部に格納された学習データを用いて、複数の制御量導出部で共用される強化学習モデルをトレーニングにより最適化するトレーニング部と、を備える。 In order to solve the above problem, a control amount calculation device according to one embodiment of the present invention is a control amount calculation device that calculates a control amount for controlling the allocation amount of communication resources to multiple slices that are virtualized networks on a communication network, and includes multiple control amount derivation units that are provided corresponding to each of the multiple slices, acquire a state value related to the slice and a reward value related to the slice, and input the state value into a reinforcement learning model to determine and output an action that is a control amount for the slice, a learning data storage unit that stores learning data that is a combination of the state values and reward values acquired in the multiple control amount derivation units and the action determined corresponding to the state value, and a training unit that optimizes the reinforcement learning model shared by the multiple control amount derivation units by training using the learning data stored in the learning data storage unit.

あるいは、本発明の他の形態にかかる制御量算出方法は、通信ネットワーク上の仮想化ネットワークである複数のスライスに対する通信リソースの割り当て量を制御するための制御量を算出する制御量算出装置によって実行される制御量算出方法であって、複数のスライスのそれぞれに対応して実行され、スライスに関する状態値、及びスライスに関する報酬値を取得し、状態値を強化学習モデルに入力することにより、スライスに対する制御量である行動を決定および出力する複数の制御量導出ステップと、複数の制御量導出ステップにおいて取得された状態値及び報酬値と、当該状態値に対応して決定された行動との組み合わせである学習データを格納する学習データ格納ステップと、学習データ格納ステップによって格納された学習データを用いて、複数の制御量導出ステップで共用される強化学習モデルをトレーニングにより最適化するトレーニングステップと、を備える。 Alternatively, a control amount calculation method according to another embodiment of the present invention is a control amount calculation method executed by a control amount calculation device that calculates a control amount for controlling the allocation amount of communication resources to a plurality of slices that are virtualized networks on a communication network, and includes a plurality of control amount derivation steps that are executed corresponding to each of the plurality of slices, acquire a state value for the slice and a reward value for the slice, and input the state value into a reinforcement learning model to determine and output an action that is a control amount for the slice, a learning data storage step that stores learning data that is a combination of the state values and reward values acquired in the plurality of control amount derivation steps and the action determined corresponding to the state value, and a training step that optimizes the reinforcement learning model shared by the plurality of control amount derivation steps by training using the learning data stored in the learning data storage step.

上記一形態あるいは上記他の形態によれば、複数のスライス毎に、状態値を強化学習モデルに入力することによって、スライス毎にリソース割り当て量を制御するための行動が決定および出力され、その際に用いられた状態値及び行動とその行動に対する報酬値との組み合わせが学習データとして格納される。このとき、複数のスライスのリソース割り当ての制御で共用される強化学習モデルは、予め格納された上記の学習データを用いてトレーニングにより最適化される。これにより、制御対象のネットワークにおけるスライスの数が動的に変化した場合であっても、個々のスライスに関する制御結果を学習データとして、複数のスライスに対するリソース割り当て制御に用いられる強化学習モデルを最適化することができる。その結果、複数のスライスへのリソース割り当てを適切に制御することができる。 According to the above one or other of the above embodiments, by inputting state values for each of the multiple slices into a reinforcement learning model, an action for controlling the resource allocation amount for each slice is determined and output, and a combination of the state value and action used at that time and the reward value for that action is stored as learning data. At this time, the reinforcement learning model shared in the control of resource allocation of the multiple slices is optimized by training using the above learning data stored in advance. As a result, even if the number of slices in the network to be controlled changes dynamically, the reinforcement learning model used for resource allocation control for the multiple slices can be optimized using the control results for each slice as learning data. As a result, resource allocation to the multiple slices can be appropriately controlled.

ここで、スライスに関する状態値は、スライスの要件に関する満足度、及び、スライスに割り当てられた通信リソースの使用率を少なくとも含む、ことが好ましい。この場合、複数のスライスに対してリソース割り当てを制御する際に、各スライスの要件を満たすように制御することができるとともに、リソースの利用効率を向上させることができる。 Here, it is preferable that the state value for a slice includes at least the degree of satisfaction with the slice requirements and the utilization rate of the communication resources allocated to the slice. In this case, when controlling resource allocation to multiple slices, it is possible to control so as to satisfy the requirements of each slice, and to improve the efficiency of resource utilization.

また、スライスに関する報酬値は、スライスの要件に関する満足度と、スライスに割り当てられた通信リソースの使用率とを加味した値である、ことも好ましい。この場合、複数のスライスに対してリソース割り当てを制御するための強化学習モデルを最適化する際に、各スライスの要件を満たすように最適化することができるとともに、リソースの利用効率を向上させるように最適化することができる。 It is also preferable that the reward value for a slice is a value that takes into account the satisfaction level regarding the requirements of the slice and the utilization rate of the communication resources allocated to the slice. In this case, when optimizing a reinforcement learning model for controlling resource allocation to multiple slices, it can be optimized to satisfy the requirements of each slice and to improve the utilization efficiency of resources.

また、複数の制御量導出部によって出力された制御量を基に、複数のスライスに対する割り当て量を制御する制御部をさらに備える、ことも好ましい。こうすれば、複数のスライスに対するリソース割り当て量を、複数の制御量導出部によって出力された制御量を基に、決定することができる。その結果、例えば、所定の判断基準を基にした優先制御等が可能となる。 It is also preferable to further include a control unit that controls the allocation amounts for the multiple slices based on the control amounts output by the multiple control amount derivation units. In this way, the resource allocation amounts for the multiple slices can be determined based on the control amounts output by the multiple control amount derivation units. As a result, for example, priority control based on a predetermined judgment criterion becomes possible.

また、制御部は、複数のスライスに対する直前のリソースの割り当て量と複数のスライスの要件に関する満足度とを基に複数のスライスに対する優先度を決定し、優先度の示す順番に従って複数のスライスに対する制御を実行する、ことも好ましい。この場合には、直前のリソース割り当て量とスライス要件の満足度とから決定した優先度を基にしたリソース割り当ての優先制御等が可能となる。これにより、複数のスライスに対する円滑なリソース割り当てが可能となる。 It is also preferable that the control unit determines priorities for the multiple slices based on the immediately preceding resource allocation amounts for the multiple slices and the satisfaction levels of the requirements for the multiple slices, and executes control for the multiple slices according to the order indicated by the priorities. In this case, priority control of resource allocation based on priorities determined from the immediately preceding resource allocation amounts and the satisfaction levels of the slice requirements becomes possible. This enables smooth resource allocation to the multiple slices.

また、スライスに対する制御量は、無線アクセスネットワークにおけるリソースブロックの数に関する、ことも好ましい。この場合には、複数のスライスへのリソースブロックの割り当てを適切に制御することができる。 It is also preferable that the amount of control for a slice relates to the number of resource blocks in the radio access network. In this case, the allocation of resource blocks to multiple slices can be appropriately controlled.

本発明によれば、スライス数が動的に変化した場合にも複数のスライスへのリソース割り当てを実現できる。 According to the present invention, it is possible to allocate resources to multiple slices even when the number of slices changes dynamically.

本発明の好適な一実施形態にかかる制御システムの概略構成を示す図である。1 is a diagram showing a schematic configuration of a control system according to a preferred embodiment of the present invention; 図１の制御量算出装置のハードウェア構成を示すブロック図である。2 is a block diagram showing a hardware configuration of the control amount calculation device shown in FIG. 1; 図１の制御量算出装置の機能構成を示すブロック図である。FIG. 2 is a block diagram showing a functional configuration of the control amount calculation device shown in FIG. 1 . 図３の制御量導出部３１_１～３１_Ｎが使用する学習モデルのネットワーク構造の一例を示す図である。FIG. 4 is a diagram showing an example of a network structure of a learning model used by the control amount derivation units 31 ₁ to 31 _N in FIG. 3. 制御量算出装置３及び制御装置５によるスライスへのＲＢ割り当ての全体像を説明するための図である。10 is a diagram for explaining an overall picture of RB allocation to slices by the control amount calculation device 3 and the control device 5. FIG. 制御量算出装置３及び制御装置５によるスライスへのＲＢ割り当てによって各スライスに割り当てられるＲＢ量の時間変化を示す図である。11 is a diagram showing a change over time in the amount of RBs allocated to each slice by the control amount calculation device 3 and the control device 5 allocating RBs to the slices. FIG. 本発明の好適な一実施形態にかかる制御量算出方法のうちのＲＢ割り当て処理の手順を示すフローチャートである。10 is a flowchart showing a procedure of an RB allocation process in a control amount calculation method according to a preferred embodiment of the present invention. 本発明の好適な一実施形態にかかる制御量算出方法のうちの学習処理の手順を示すフローチャートである。5 is a flowchart showing a procedure of a learning process in a controlled variable calculation method according to a preferred embodiment of the present invention. 本実施形態にかかる制御システム１によるＲＢ割り当て制御の性能をシミュレーション計算により評価した結果を示すグラフである。11 is a graph showing the results of evaluating the performance of RB allocation control by the control system 1 according to the present embodiment through simulation calculations. 本実施形態にかかる制御システム１によるＲＢ割り当て制御の性能をシミュレーション計算により評価した結果を示すグラフである。11 is a graph showing the results of evaluating the performance of RB allocation control by the control system 1 according to the present embodiment through simulation calculations.

以下、図面を参照しつつ本発明に係る制御システムの好適な実施形態について詳細に説明する。なお、図面の説明においては、同一又は相当部分には同一符号を付し、重複する説明を省略する。 Below, a preferred embodiment of the control system according to the present invention will be described in detail with reference to the drawings. In the description of the drawings, the same or corresponding parts are given the same reference numerals, and duplicated explanations will be omitted.

図１に示す本発明の好適な一実施形態である制御システム１は、第５世代の移動体通信（５Ｇ）等の移動体通信システムを対象にして、スライスへのリソース割り当てを制御するコンピュータシステムである。スライスとは、通信ネットワーク全体のリソースを複数に分割したうちの１つの仮想化ネットワークのことである。 The control system 1 shown in FIG. 1, which is a preferred embodiment of the present invention, is a computer system that controls resource allocation to slices for mobile communication systems such as the fifth generation mobile communication (5G). A slice is one of multiple virtualized networks into which the resources of the entire communication network are divided.

制御システム１の制御対象は、移動体通信システム内に設定された複数のスライスであり、これらの複数のスライスの数は動的に変化しうる。例えば、本実施形態では、複数のスライスとして、スループット、遅延、及び信頼性で規定されるスライスの要件が互いに異なる、スライスＳ_１、スライスＳ_２、…、スライスＳ_Ｎ（Ｎは任意の自然数）、及びスライスＳ_Ｂを含む（Ｎ＋１）個のスライスを想定している。これらのスライスＳ_１～Ｓ_Ｎ，Ｓ_Ｂは、それぞれ、基地局ＢＳ（Base Station）等を含む無線アクセスネットワーク（RAN：Radio Access Network）のリソースを分割して構成され、複数のユーザ端末ＵＥ（User Equipment）によって共有されている。 The control object of the control system 1 is a plurality of slices set in the mobile communication system, and the number of the plurality of slices can change dynamically. For example, in this embodiment, as the plurality of slices, slices S ₁ , S ₂ , ..., S _N (N is any natural number), and S _B are assumed to be (N+1) slices including slices S ₁ to S _N , S B, each of which has different slice requirements defined by throughput, delay, and reliability. Each of these slices S 1 to S N , S _B is configured by dividing resources of a radio access network (RAN) including a base station BS, and is shared by a plurality of user terminals UE (User Equipment).

制御システム１は、スライスＳ_１、スライスＳ_２、…、スライスＳ_Ｎに対するリソースであるリソースブロックの割り当てを制御するための制御量を算出する制御量算出装置３と、制御量算出装置３によって算出された制御量を基に、スライスＳ_１、スライスＳ_２、…、スライスＳ_Ｎ、スライスＳ_Ｂに対するリソースブロック（以下、ＲＢともいう）の割り当て量を制御する制御装置（制御部）５とを含む。ここで、本実施形態では、制御量算出装置３と制御装置５とは、別々の装置として構成されているが、一体化された装置であってもよい。制御装置５は、移動体通信システムを制御可能なように当該移動体通信システムとの間で各種データを送受信可能に構成され、制御量算出装置３は、制御装置５との間でデータを送受信可能に構成され、制御装置５を介して後述する状態値あるいは報酬値等の各種データを受信する。制御システム１が割り当てを制御するＲＢとは、ＲＡＮにおけるリソースの１種であり、一定の周波数帯域内を周波数分割し、かつ、時間軸も分割した、周波数軸及び時間軸から構成される二次元領域を分割したブロックである。 The control system 1 includes a control amount calculation device 3 that calculates a control amount for controlling the allocation of resource blocks, which are resources for slices S ₁ , S ₂ , ..., S _N , and a control device (control unit) 5 that controls the allocation amount of resource blocks (hereinafter also referred to as RB) for slices S ₁ , S ₂ , ..., S _N , and S _B based on the control amount calculated by the control amount calculation device 3. Here, in this embodiment, the control amount calculation device 3 and the control device 5 are configured as separate devices, but may be integrated devices. The control device 5 is configured to be able to transmit and receive various data between the mobile communication system so as to be able to control the mobile communication system, and the control amount calculation device 3 is configured to be able to transmit and receive data between the control device 5, and receives various data such as state values or reward values, which will be described later, via the control device 5. The RBs, the allocation of which is controlled by the control system 1, are a type of resource in the RAN, and are blocks obtained by dividing a two-dimensional area consisting of a frequency axis and a time axis, in which a certain frequency band is divided by frequency and the time axis is also divided.

図２は、制御量算出装置３を構成するコンピュータ２０のハードウェア構成を示している。図２に示すように、コンピュータ２０は、物理的には、プロセッサであるＣＰＵ（Central Processing Unit）１０１、記録媒体であるＲＡＭ（Random Access Memory）１０２又はＲＯＭ（Read Only Memory）１０３、通信モジュール１０４、及び入出力モジュール１０６等を含んだコンピュータ等であり、各々は電気的に接続されている。なお、コンピュータ２０は、入出力モジュール１０６として、ディスプレイ、キーボード、マウス、タッチパネルディスプレイ等を含んでいてもよいし、ハードディスクドライブ、半導体メモリ等のデータ記録装置を含んでいてもよい。また、コンピュータ２０は、複数のコンピュータによって構成されていてもよい。制御装置５も、同様なハードウェア構成を有する。 Figure 2 shows the hardware configuration of the computer 20 constituting the control amount calculation device 3. As shown in Figure 2, the computer 20 is physically a computer including a processor, a CPU (Central Processing Unit) 101, a recording medium, a RAM (Random Access Memory) 102 or a ROM (Read Only Memory) 103, a communication module 104, and an input/output module 106, each of which is electrically connected. The computer 20 may include a display, keyboard, mouse, touch panel display, etc. as the input/output module 106, or may include a data recording device such as a hard disk drive or semiconductor memory. The computer 20 may also be composed of multiple computers. The control device 5 has a similar hardware configuration.

図３は、制御量算出装置３の機能構成を示すブロック図である。制御量算出装置３は、Ｎ個の制御量導出部３１_１～３１_Ｎと、トレーニング部３２と、学習データ格納部３３とを備える。制御量導出部３１_１～３１_Ｎは、制御対象のスライスＳ_１～Ｓ_Ｎの個数に対応した個数で設けられる。図３に示す制御量算出装置３の各機能部は、ＣＰＵ１０１及びＲＡＭ１０２等のハードウェア上にプログラムを読み込ませることにより、ＣＰＵ１０１の制御のもとで、通信モジュール１０４、及び入出力モジュール１０６等を動作させるとともに、ＲＡＭ１０２におけるデータの読み出し及び書き込みを行うことで実現される。制御量算出装置３のＣＰＵ１０１は、このコンピュータプログラムを実行することによって制御量算出装置３を図３の各機能部として機能させ、後述する制御量算出方法に対応する処理を順次実行する。このコンピュータプログラムの実行に必要な各種データ、及び、このコンピュータプログラムの実行によって生成された各種データは、全て、ＲＯＭ１０３、ＲＡＭ１０２等の内蔵メモリ、又は、ハードディスクドライブなどの記憶媒体に格納される。 3 is a block diagram showing a functional configuration of the control amount calculation device 3. The control amount calculation device 3 includes N control amount derivation units 31 ₁ to 31 _N , a training unit 32, and a learning data storage unit 33. The control amount derivation units 31 ₁ to 31 _N are provided in a number corresponding to the number of slices S ₁ to S _N of the control object. Each functional unit of the control amount calculation device 3 shown in FIG. 3 is realized by reading a program onto hardware such as the CPU 101 and the RAM 102, operating the communication module 104 and the input/output module 106 under the control of the CPU 101, and reading and writing data in the RAM 102. The CPU 101 of the control amount calculation device 3 executes this computer program to make the control amount calculation device 3 function as each functional unit of FIG. 3, and sequentially executes processing corresponding to a control amount calculation method described later. All of the various data required for executing this computer program and all of the various data generated by executing this computer program are stored in built-in memories such as the ROM 103 and the RAM 102, or in storage media such as a hard disk drive.

制御量導出部３１_１～３１_Ｎは、それぞれ、対応するスライスＳ_１～Ｓ_Ｎに対するＲＢの割当量を制御するための制御量（ＲＢ量）決定し、当該制御量を制御装置５に出力する。すなわち、制御量導出部３１_１～３１_Ｎは、移動体通信システムから制御装置５を介して対応するスライスＳ_１～Ｓ_Ｎに関する状態値を取得し、共通の深層強化学習の学習モデルに当該状態値を入力し、その学習モデルを用いて行動価値を算出し、行動価値が最大となるような行動を選択する。また、制御量導出部３１_１～３１_Ｎは、選択した行動に対するスライスＳ_１～Ｓ_Ｎに関する報酬値を制御装置５を介して取得する。さらに、制御量導出部３１_１～３１_Ｎは、選択した行動を基に制御量（ＲＢ量）を算出して制御装置５に出力する。加えて、制御量導出部３１_１～３１_Ｎは、１回の制御量の算出ステップごとに、状態値及び行動と、その行動に対する報酬値との組み合わせを、学習データとして学習データ格納部３３に格納する。 The control amount derivation units 31 ₁ to 31 _N each determine a control amount (RB amount) for controlling the allocation amount of RB for the corresponding slices S ₁ to S _N , and output the control amount to the control device 5. That is, the control amount derivation units 31 ₁ to 31 _N acquire state values for the corresponding slices S ₁ to S _N from the mobile communication system via the control device 5, input the state values to a common learning model of deep reinforcement learning, calculate an action value using the learning model, and select an action that maximizes the action value. In addition, the control amount derivation units 31 ₁ to 31 _N acquire reward values for the slices S ₁ to S _N for the selected action via the control device 5. Furthermore, the control amount derivation units 31 ₁ to 31 _N calculate a control amount (RB amount) based on the selected action and output it to the control device 5. In addition, the control amount derivation units 31 ₁ to 31 _N store a combination of a state value, an action, and a reward value for the action as learning data in the learning data storage unit 33 for each control amount calculation step.

詳細には、制御量導出部３１_１～３１_Ｎは、状態値として、スライスＳ_１～Ｓ_Ｎの要件を示すスループット値及び遅延量と、スライス要件に関するユーザの満足度を示すＮＳＲＳ（NS requirement satisfaction）値と、スライスＳ_１～Ｓ_Ｎの使用率を示すＲＢＵＲ（RB usage ratio）値と、スライスＳ_１～Ｓ_Ｎに対する実際の割り当てＲＢ数と、制御対象の基地局ＢＳにおけるスライスＳ_１～Ｓ_Ｎ上での到着パケット数、送信パケット数、及びバッファ内（送信待ち）パケット数とを取得する。上記状態値のうち、スループット値、遅延量、及びＮＳＲＳ値は、スライスＳ_１～Ｓ_Ｎの要件に関する状態を示す値であり、ＲＢＵＲ値、及びＲＢ数は、スライスＳ_１～Ｓ_Ｎの使用率に関する状態を示す値であり、到着パケット数、送信パケット数、及びバッファ内パケット数は、これらの状態値の曖昧性を無くすために補助的に取得される状態値である。 In detail, the control amount derivation units 31 ₁ to 31 _N acquire, as state values, a throughput value and a delay amount indicating the requirements of slices S ₁ to S _N , an NS requirement satisfaction (NSRS) value indicating the user's satisfaction with the slice requirements, an RB usage ratio (RBUR) value indicating the usage rate of slices S ₁ to S _N , the number of RBs actually allocated to slices S ₁ to S _N , and the number of arriving packets, the number of transmitted packets, and the number of packets in the buffer (waiting to be transmitted) on slices S ₁ to S _N in the base station BS to be controlled. Among the above state values, the throughput value, the delay amount, and the NSRS value are values indicating the state regarding the requirements of slices S ₁ to S _N , the RBUR value and the number of RBs are values indicating the state regarding the usage rate of slices S ₁ to S _N , and the number of arriving packets, the number of transmitted packets, and the number of packets in the buffer are state values acquired auxiliary to eliminate ambiguity of these state values.

上記状態値のうちのＮＳＲＳ値は、基地局ＢＳ等によってユーザ端末ＵＥからのデータを収集および集計することにより生成され、下記式（１）；

によって定義される値である。ここで、Suはユーザ毎のスライス要件の満足度の有無を“０”または“１”で示し、uesはスライスに収容されるユーザ数を示す。このＮＳＲＳ値はユーザの平均的な満足度を示し、１に近いほどスライスの要件が満たされていることを表す。 The NSRS value among the above state values is generated by the base station BS or the like by collecting and aggregating data from the user terminal UE, and is expressed by the following formula (1):

Here, Su indicates the satisfaction of the slice requirements for each user with "0" or "1", and ues indicates the number of users accommodated in the slice. This NSRS value indicates the average satisfaction of users, and the closer it is to 1, the more the slice requirements are satisfied.

また、上記状態値のうちのＲＢＵＲ値は、基地局ＢＳ等によって生成され、下記式（２）；

によって定義される値である。ここで、ＵＲＢは実際に使用したＲＢ数を示し、ＭＲＢは、実際に該当するスライスに割り当てたＲＢ数を示す。ＲＢＵＲ値は、該当するスライスの使用率を示し、１に近いほど過剰なＲＢの割り当てが少ないことを意味する。 Moreover, the RBUR value among the above state values is generated by the base station BS or the like, and is expressed by the following formula (2):

Here, URB indicates the number of RBs actually used, and MRB indicates the number of RBs actually allocated to the corresponding slice. The RBUR value indicates the usage rate of the corresponding slice, and the closer it is to 1, the less excess RBs are allocated.

制御量導出部３１_１～３１_Ｎは、状態値を入力する学習モデルとして、ＤＱＮに分散学習を適用したＡｐｅ－Ｘの手法を用いる。すなわち、それぞれの制御量導出部３１_１～３１_Ｎは、制御対象のスライスＳ_１～Ｓ_Ｎに関する状態値を学習モデルに適用して行動を決定し、その行動に対する報酬値を得て、状態値、行動、及び報酬値を経験（学習データ）として収集および蓄積する。それに対して、後述するトレーニング部３２がこの学習データを用いてトレーニング（学習）することにより、学習モデルのパラメータを最適化する。 The control amount derivation units 31 ₁ to 31 _N use the Ape-X technique, which applies distributed learning to DQN, as a learning model to which state values are input. That is, each of the control amount derivation units 31 ₁ to 31 _N applies state values related to the slices S ₁ to S _N of the control target to the learning model to determine an action, obtains a reward value for the action, and collects and accumulates the state value, action, and reward value as experience (learning data). In response to this, the training unit 32, which will be described later, optimizes the parameters of the learning model by training (learning) using this learning data.

図４には、制御量導出部３１_１～３１_Ｎが使用する学習モデルのネットワーク構造の一例を示す。図４に示すように、学習モデルは、状態値が入力される入力層Ｎ_ＩＮと、入力層Ｎ_ＩＮから入力された状態値が順に伝播される全結合層Ｎ_１、バッチ正規化層Ｎ_２、全結合層Ｎ_３、バッチ正規化層Ｎ_４、全結合層Ｎ_５、及びバッチ正規化層Ｎ_６と、バッチ正規化層Ｎ_６から分岐して結合される、全結合層Ｎ_７、バッチ正規化層Ｎ_８、全結合層Ｎ_９と、全結合層Ｎ_１０、バッチ正規化層Ｎ_１１、全結合層Ｎ_１２と、全結合層Ｎ_９，Ｎ_１２に結合される出力層Ｎ_ＯＵＴとを含む。このような学習モデルにおいて、入力層Ｎ_ＩＮに状態値を入力することにより、出力層Ｎ_ＯＵＴから、行動の結果として期待される報酬値である行動価値が出力される。 FIG. 4 shows an example of a network structure of a learning model used by the control amount derivation units 31 ₁ to 31 _N. As shown in FIG. 4, the learning model includes an input layer N _IN to which a state value is input, a fully connected layer N ₁ to which the state value input from the input layer N _IN is propagated in order, a batch normalization layer N ₂ , a fully connected layer N ₃ , a batch normalization layer N ₄ , a fully connected layer N ₅ , and a batch normalization layer N ₆ , which are branched and connected from the batch normalization layer N ₆ , a fully connected layer N ₇ , a batch normalization layer N ₈ , and a fully connected layer N ₉ , a fully connected layer N ₁₀ , a batch normalization layer N ₁₁ , and a fully connected layer N ₁₂ , and an output layer N _OUT connected to the fully connected layers N ₉ and N _12. In such a learning model, by inputting a state value to the input layer N _IN , an action value, which is a reward value expected as a result of an action, is output from the output layer N _OUT .

さらに、制御量導出部３１_１～３１_Ｎは、状態値を学習モデルを入力した結果得られた行動価値を基に、行動価値が最大となる行動を選択する（greedy法）。詳細には、制御量導出部３１_１～３１_Ｎは、図４に示す学習モデルを用いて、時刻tにおける状態値s_t及び行動a_tに対する行動価値Q(S_t,a_t,θ)を取得（θはニューラルネットワークの重み付け等のパラメータ）し、その行動価値Q(S_t,a_t,θ)を最大にする行動a（aは１以上の整数）を決定する。そして、制御量導出部３１_１～３１_Ｎは、決定した行動aを基に、対応するスライスに対して割り当てる時刻tにおけるＲＢの相対量（ＩＤＲＢ）を、下記式（３）；

によって算出する。上記式（３）中の

は床関数である。さらに、制御量導出部３１_１～３１_Ｎは、時間tの割り当てＲＢ量（ＡＲＢ）を、１ステップ前の時刻t-1における割り当てＲＢ量に対して相対量（ＩＤＲＢ）を加算することにより、下記式（４）；
ARB_t＝ARB_t-1＋IDRB_t …（４）
によって計算し、計算した割り当てＲＢ量を制御装置５に出力する。 Furthermore, the control amount derivation units 31 ₁ to 31 _N select an action that maximizes the action value based on the action value obtained as a result of inputting the state value into the learning model (greedy method). In detail, the control amount derivation units 31 ₁ to 31 _N obtain an action value Q(S _t , a _t , θ) for the state value s _t and action a _t at time t using the learning model shown in FIG. 4 (θ is a parameter such as a weighting parameter of the neural network), and determine an action a (a is an integer of 1 or more) that maximizes the action value Q(S _t , a _t , θ). Then, based on the determined action a, the control amount derivation units 31 ₁ to 31 _N calculate the relative amount of RB (IDRB) at time t to be assigned to the corresponding slice, using the following formula (3);

In the above formula (3),

is a floor function. Furthermore, the control amount derivation units 31 ₁ to 31 _N add a relative amount (IDRB) to the allocated RB amount (ARB) at time t with respect to the allocated RB amount at time t−1 one step before, to obtain the following formula (4);
_ARBt = ARBt _-1 + _IDRBt ... (4)
The calculated amount of allocated RBs is output to the control device 5.

トレーニング部３２は、学習データ格納部３３に格納された学習データを用いて制御量導出部３１_１～３１_Ｎが共用する学習モデルのパラメータθを最適値に更新する（トレーニング）。すなわち、トレーニング部３２は、１組の学習データに含まれる、１ステップ後の時刻t+1の報酬値と、時刻t+1における状態値とを取得し、それらの値を基に時刻tで学習モデルが出力するべきターゲット値y_tを算出する。この報酬値は、ＮＳＲＳ値とＲＢＵＲ値とを加味した値であり、例えば、ＮＳＲＳ値とＲＢＵＲ値とを掛け合わせた値である。そして、トレーニング部３２は、時刻tにおける状態値s_tを基に決定される行動価値Q(S_t,a_t,θ)が、ターゲット値y_tに近づくようにパラメータθを更新する。 The training unit 32 updates the parameter θ of the learning model shared by the control amount derivation units 31 ₁ to 31 _N to an optimal value using the learning data stored in the learning data storage unit 33 (training). That is, the training unit 32 acquires the reward value at time t+1 one step later and the state value at time t+1, which are included in one set of learning data, and calculates the target value y _t to be output by the learning model at time t based on these values. This reward value is a value taking into account the NSRS value and the RBUR value, for example, a value obtained by multiplying the NSRS value and the RBUR value. Then, the training unit 32 updates the parameter θ so that the action value Q(S _t , a _t , θ) determined based on the state value s _t at time t approaches the target value y _t .

ここで、トレーニング部３２は、学習データ格納部３３に格納された学習データの中から、優先度を基にトレーニングに用いる学習データをランダムに抽出する。また、トレーニング部３２は、トレーニングの実行とともに、学習データ格納部３３に格納された学習データの優先度を、古い経験のものほど低くするように更新する。さらに、トレーニング部３２は、トレーニングによって更新したパラメータθを、定期的に制御量導出部３１_１～３１_Ｎに複製することにより、それぞれの制御量導出部３１_１～３１_Ｎ内のパラメータθを更新する。ここでは、それぞれの制御量導出部３１_１～３１_Ｎが、トレーニング部３２によって更新されたパラメータθを取得して内部のパラメータθを更新するようにしてもよい。 Here, the training unit 32 randomly extracts learning data to be used for training based on the priority from among the learning data stored in the learning data storage unit 33. In addition, the training unit 32 updates the priority of the learning data stored in the learning data storage unit 33 while performing training so that the older the experience, the lower the priority of the learning data. Furthermore, the training unit 32 periodically copies the parameter θ updated by training to the control amount derivation units 31 ₁ to 31 _N , thereby updating the parameter θ in each of the control amount derivation units 31 ₁ to 31 _N. Here, each of the control amount derivation units 31 ₁ to 31 _N may obtain the parameter θ updated by the training unit 32 and update the internal parameter θ.

次に、制御装置５の機能について説明する。 Next, the functions of the control device 5 will be explained.

制御装置５は、状態値及び報酬値を移動体通信システムから定期的に取得して制御量算出装置３に転送する機能と、制御量算出装置３の複数の制御量導出部３１_１，３１_２，…３１_Ｎから出力された時刻tにおける割り当てＲＢ量（ＡＲＢ）を基に、各スライスＳ_１，Ｓ_２，…，Ｓ_Ｎ，Ｓ_Ｂに時刻tに割り当てるＲＢ量（ＭＲＢ）を決定するように制御する機能とを有する。すなわち、制御装置５は、スライスＳ_Ｂ以外のスライスＳ_１，Ｓ_２，…，Ｓ_Ｎについては、直前の時刻t-1におけるＭＲＢ及びＮＳＲＳ値とを乗算した値を優先度として、優先度の昇順の順番に従って、スライスＳ_１，Ｓ_２，…，Ｓ_Ｎに対する時刻tのＭＲＢをＡＲＢに等しい値に決定する制御を実行する。このとき、ＡＲＢが残りのＲＢ数より大きい場合には、残りのＲＢ数をＭＲＢとして決定する。このような優先制御により、多くのＲＢを必要とするスライスがＲＢを占有しないように、ＲＢの専有量が少ないスライスほど割り当てが優先される。さらに、制御装置５は、スライスＳ_Ｂについては、スライスＳ_１，Ｓ_２，…，Ｓ_Ｎに割り当てたＭＲＢの合計値から決まる残りのＲＢ量をＭＲＢとして決定する。 The control device 5 has a function of periodically acquiring state values and reward values from the mobile communication system and transferring them to the control amount calculation device 3, and a function of controlling to determine the amount of RBs (MRBs) to be allocated to each slice S ₁ , S ₂ , ..., S _N , S _B at time t based on the amount of allocated RBs (ARBs) at time t output from a plurality of control amount derivation units _{31 1} , 31 2 , ..., 31 N of the control amount calculation device 3. That is, for slices S ₁ , S ₂ , ..., S _N other than slice S _B , the control device 5 executes control to determine the MRB at time t for slices S ₁ _{, S 2} _, ..., S _N to a value equal to the ARB, in ascending order of priority, using a value obtained by multiplying the MRB and NSRS value at the immediately preceding time t- ₁ as a priority. At this time, if the ARB is larger than the number of remaining RBs, the number of remaining RBs is determined as the MRB. By such priority control, slices requiring many RBs are given priority in allocation so that the slices requiring fewer RBs do not occupy the RBs. Furthermore, for slice S 1 -S ₂ , the control device 5 determines the remaining RB amount as the MRB, which is determined from the total value of the MRBs allocated to slices S ₁ , S ₂ , ..., S _N.

図５は、制御量算出装置３及び制御装置５によるスライスへのＲＢ割り当ての全体像を説明するための図であり、図６は、制御量算出装置３及び制御装置５によるスライスへのＲＢ割り当てによってスライスＳ_１，Ｓ_２，Ｓ_Ｎに割り当てられるＲＢ量の時間変化を示す図である。このように、制御量算出装置３のトレーニング部３２によって、学習データ格納部３３に各制御量導出部３１_１，３１_２，…３１_Ｎの経験として蓄積された学習データを基に、学習モデルが学習される。そして、学習によって最適化された学習モデルのパラメータが制御量算出装置３内の各制御量導出部３１_１，３１_２，…３１_Ｎに複製され、各制御量導出部３１_１，３１_２，…３１_Ｎでは、共通の学習モデルに状態値を入力することによって、各スライスＳ_１，Ｓ_２，…，Ｓ_Ｎを対象にした行動が選択される。制御装置５では、各スライスＳ_１，Ｓ_２，…，Ｓ_Ｎを対象にして選択された行動を基に、各スライスＳ_１，Ｓ_２，…，Ｓ_Ｎ，Ｓ_Ｂに割り当てるＲＢ量が決定され、それらのＲＢ量のリソースが各スライスＳ_１，Ｓ_２，…，Ｓ_Ｎ，Ｓ_Ｂに割り当てられる。図６に示す例によれば、時間の経過に伴って、スライスＳ_１にはＲＢ量として“０”、“３”、“０”が順に割り当てられ、スライスＳ_２にはＲＢ量として“３”、“２”、“６”が順に割り当てられ、スライスＳ_ＮにはＲＢ量として“２”、“１”、“０”が順に割り当てられる。 Fig. 5 is a diagram for explaining an overall picture of RB allocation to slices by the control amount calculation device 3 and the control device 5, and Fig. 6 is a diagram showing a time change of RB amounts allocated to slices S ₁ , S ₂ , and S _N by RB allocation to slices by the control amount calculation device 3 and the control device 5. In this way, the training unit 32 of the control amount calculation device 3 learns a learning model based on learning data accumulated in the learning data storage unit 33 as experience of each control amount derivation unit 31 ₁ , 31 ₂ , ..., 31 _N. Then, the parameters of the learning model optimized by learning are copied to each control amount derivation unit 31 ₁ , 31 ₂ , ..., 31 _N in the control amount calculation device 3, and each control amount derivation unit 31 ₁ , 31 ₂ , ..., 31 _N inputs a state value to a common learning model, thereby selecting an action targeted at each slice S ₁ , S ₂ , ..., S _N. In the control device 5, the RB amount to be allocated to each slice S ₁ , S ₂ , ..., S _N , and S _B is determined based on the action selected for each slice S ₁ , S ₂ , ..., S _N , and the resource of the RB amount is allocated to each slice S ₁ , S ₂ , ..., S _N , and S _B. According to the example shown in Fig. 6, as time passes, slice S ₁ is allocated RB amounts of "0", "3", and "0" in order, slice S ₂ is allocated RB amounts of "3", "2", and "6" in order, and slice S _N is allocated RB amounts of "2", "1", and "0" in order.

次に、上述した制御システム１によって実行される制御量算出方法の手順を説明する。図７は、制御量算出方法のうちのＲＢ割り当て処理の手順を示すフローチャートであり、図８は、制御量算出方法のうちの学習処理の手順を示すフローチャートである。図７に示すＲＢ割り当て処理は、複数のステップの時刻t,t+1,…で繰り返し実行され、図７に示す学習処理は、定期的なタイミング、あるいは、ＲＢ割り当て処理によって蓄積された経験数が一定数に達したタイミングで繰り返し実行される。 Next, the steps of the control amount calculation method executed by the above-mentioned control system 1 will be described. FIG. 7 is a flowchart showing the steps of the RB allocation process in the control amount calculation method, and FIG. 8 is a flowchart showing the steps of the learning process in the control amount calculation method. The RB allocation process shown in FIG. 7 is repeatedly executed at times t, t+1, ... of multiple steps, and the learning process shown in FIG. 7 is repeatedly executed at regular intervals or when the number of experiences accumulated by the RB allocation process reaches a certain number.

まず、図７を参照して、ＲＢ割り当て処理が起動されると、制御量算出装置３の制御量導出部３１_１によって、スライスＳ_１に関する状態値が取得される（ステップＳ１０１）。次に、制御量導出部３１_１によって、取得した状態値を学習モデルに入力することにより行動価値が算出され、その行動価値が最大となる行動が選択される（ステップＳ１０２）。その後、制御量導出部３１_１によって、選択した行動を基に、スライスＳ_１に対する割り当てＲＢ量（ＡＲＢ）が決定され、そのＲＢ量が制御装置５に出力される（ステップＳ１０３）。さらに、制御量導出部３１_１によって、前回のＲＢ割り当て処理に対するスライスＳ_１に関する報酬値が取得され、状態値、行動、報酬値の組み合わせが学習データとして学習データ格納部３３に格納される（ステップＳ１０４）。このようなステップＳ１０１～Ｓ１０４の一連の処理は、制御量導出部３１_２～制御量導出部３１_Ｎにより、残りのスライスＳ_２～Ｓ_Ｎを対象に繰り返される（ステップＳ１０５）。 First, referring to Fig. 7, when the RB allocation process is started, the control amount derivation unit ₃₁₁ of the control amount calculation device 3 acquires a state value for the slice _S1 (step S101). Next, the control amount derivation unit ₃₁₁ inputs the acquired state value into a learning model to calculate an action value, and selects an action that maximizes the action value (step S102). After that, the control amount derivation unit ₃₁₁ determines an allocated RB amount (ARB) for the slice _S1 based on the selected action, and outputs the RB amount to the control device 5 (step S103). Furthermore, the control amount derivation unit ₃₁₁ acquires a reward value for the slice _S1 for the previous RB allocation process, and a combination of the state value, action, and reward value is stored in the learning data storage unit 33 as learning data (step S104). The series of processes from steps S101 to S104 are repeated by the control amount derivation units 31 ₂ to 31 _N for the remaining slices S ₂ to S _N (step S105).

次に、制御装置５により、直前の時刻t-1におけるＲＢ量（ＭＲＢ）及びＮＳＲＳ値を基にスライスＳ_１～Ｓ_Ｎに関する優先度が決定される（ステップＳ１０６）。その後、制御装置５により、決定された優先度の順番で、スライスＳ_１～Ｓ_Ｎに関するＲＢ量（ＭＲＢ）が、スライスＳ_１～Ｓ_Ｎに関する割り当てＲＢ量（ＡＲＢ）を基に決定され、その優先度の順番でスライスＳ_１～Ｓ_ＮへのＲＢ割り当てが制御される（ステップＳ１０７）。最後に、制御装置５により、スライスＳ_Ｂに対して残りのＲＢを割り当てるように制御が実行される（ステップＳ１０８）。 Next, the control device 5 determines the priorities of slices S ₁ to S _N based on the RB amount (MRB) and NSRS value at the immediately preceding time t-1 (step S106). After that, the control device 5 determines the RB amount (MRB) for slices S ₁ to S _N based on the allocated RB amount (ARB) for slices S ₁ to S _N in the determined order of priority, and controls RB allocation to slices S ₁ to S _N in the order of priority (step S107). Finally, the control device 5 executes control to allocate the remaining RB to slice S _B (step S108).

次に、図８を参照して、学習処理が起動されると、制御量算出装置３のトレーニング部３２によって、学習データ格納部３３に格納された学習データの中から、優先度を基にした重み付けを行いながら、トレーニングに用いる学習データがランダムに抽出される（ステップＳ２０１）。次に、トレーニング部３２により、抽出した学習データを用いたトレーニングによって、ニューラルネットワークの学習モデルのパラメータθが更新される（ステップＳ２０２）。その後、トレーニング部３２により、更新したパラメータθが各制御量導出部３１_１，３１_２，…３１_Ｎ内に複製されることにより、各制御量導出部３１_１，３１_２，…３１_Ｎ内の学習モデルが更新される（ステップＳ２０３）。最後に、トレーニング部３２により、学習データ格納部３３に格納された学習データの優先度が、古い経験の学習データほど低くなるように更新される（ステップＳ２０４）。 Next, referring to Fig. 8, when the learning process is started, the training unit 32 of the controlled variable calculation device 3 randomly extracts learning data to be used for training from the learning data stored in the learning data storage unit 33 while weighting based on the priority (step S201). Next, the training unit 32 updates the parameter θ of the learning model of the neural network by training using the extracted learning data (step S202). After that, the training unit 32 copies the updated parameter θ into each of the controlled variable derivation units 31 ₁ , 31 ₂ , ... 31 _N , thereby updating the learning model in each of the controlled variable derivation units 31 ₁ , 31 ₂ , ... 31 _N (step S203). Finally, the training unit 32 updates the priority of the learning data stored in the learning data storage unit 33 so that the older the learning data, the lower the priority (step S204).

上述した実施形態の制御システム１の作用効果について説明する。 The effects of the control system 1 of the above embodiment will be described.

本実施形態によれば、複数のスライスＳ_１～Ｓ_Ｎ毎に、状態値を強化学習モデルに入力することによって、スライスＳ_１～Ｓ_Ｎ毎にリソース割り当て量を制御するための行動が決定および出力され、その際に用いられた状態値及び行動とその行動に対する報酬値との組み合わせが学習データとして格納される。このとき、複数のスライスＳ_１～Ｓ_Ｎのリソース割り当ての制御で共用される強化学習モデルは、予め格納された上記の学習データを用いてトレーニングにより最適化される。これにより、制御対象のネットワークにおけるスライスの数が動的に変化した場合であっても、個々のスライスに関する制御結果を学習データとして、複数のスライスＳ_１～Ｓ_Ｎに対するリソース割り当て制御に用いられる強化学習モデルを最適化することができる。その結果、スライス数に依存しない複数のスライスＳ_１～Ｓ_Ｎへのリソース割り当てを適切に制御することができる。 According to this embodiment, by inputting a state value into the reinforcement learning model for each of the slices S ₁ to S _N , an action for controlling the resource allocation amount for each of the slices S ₁ to S _N is determined and output, and a combination of the state value and the action used at that time and the reward value for the action is stored as learning data. At this time, the reinforcement learning model shared in the control of the resource allocation of the slices S ₁ to S _N is optimized by training using the above-mentioned learning data stored in advance. As a result, even if the number of slices in the network to be controlled changes dynamically, the reinforcement learning model used for the resource allocation control for the slices S ₁ to S _N can be optimized by using the control results for each slice as learning data. As a result, it is possible to appropriately control the resource allocation to the slices S ₁ to S _N that does not depend on the number of slices.

ここで、スライスＳ_１～Ｓ_Ｎに関する状態値は、ＮＳＲＳ値及びＲＢＵＲ値が含まれている。この場合、複数のスライスＳ_１～Ｓ_Ｎに対してリソース割り当てを制御する際に、各スライスＳ_１～Ｓ_Ｎの要件を満たすように制御することができるとともに、リソースＲＢの利用効率を向上させることができる。 Here, the state values for the slices S ₁ to S _N include an NSRS value and an RBUR value. In this case, when controlling resource allocation for a plurality of slices S ₁ to S _N , it is possible to control so as to satisfy the requirements of each slice S ₁ to S _N , and to improve the utilization efficiency of the resource RB.

また、スライスＳ_１～Ｓ_Ｎに関する報酬値は、ＮＳＲＳ値及びＲＢＵＲ値を加味した値となっている。この場合、複数のスライスＳ_１～Ｓ_Ｎに対してリソース割り当てを制御するための学習モデルを最適化する際に、各スライスＳ_１～Ｓ_Ｎの要件を満たすように最適化することができるとともに、リソースＲＢの利用効率を向上させるように最適化することができる。 In addition, the reward value for slices S ₁ to S _N is a value that takes into account the NSRS value and the RBUR value. In this case, when optimizing a learning model for controlling resource allocation for a plurality of slices S ₁ to S _N , it is possible to optimize the slices S ₁ to S _N so as to satisfy the requirements of each slice S 1 to S N and to improve the utilization efficiency of the resource RB.

また、制御装置５の機能には、複数の制御量導出部３１_１，…３１_Ｎによって出力されたＲＢ量（ＡＲＢ）を基に、複数のスライスＳ_１～Ｓ_Ｎに対する割り当てＲＢ量（ＭＲＢ）を制御する機能が含まれている。このような機能により、複数のスライスＳ_１～Ｓ_Ｎに対するソース割り当て量を、複数の制御量導出部３１_１，…３１_Ｎによって出力された制御量を基に決定することができる。その結果、例えば、所定の判断基準を基にした優先制御等が可能となる。 The functions of the control device 5 also include a function of controlling the allocation RB amount (MRB) for the multiple slices S ₁ to S _N based on the RB amount (ARB) output by the multiple control amount derivation units 31 ₁ , ... ₃₁ N. With this function, the source allocation amount for the multiple slices S ₁ to S _N can be determined based on the control amount output by the multiple control amount derivation units 31 ₁ , ... 31 _N. As a result, for example, priority control based on a predetermined judgment criterion becomes possible.

具体的には、本実施形態では、制御装置５による複数のスライスＳ_１～Ｓ_Ｎに対するリソース割り当ての優先度は、直前の割り当てＲＢ量（ＭＲＢ）とＮＳＲＳ値とを基に決定されている。この場合には、多くのＲＢを必要とするスライスがＲＢを占有しないようにすることで、複数のスライスＳ_１～Ｓ_Ｎに対する円滑なリソース割り当てが可能となる。 Specifically, in this embodiment, the priority of resource allocation to the slices S ₁ to S _N by the control device 5 is determined based on the immediately preceding allocated RB amount (MRB) and the NSRS value. In this case, by preventing slices requiring many RBs from occupying RBs, smooth resource allocation to the slices S ₁ to S _N is possible.

ここで、本実施形態にかかる制御システム１によるＲＢ割り当て制御の性能をシミュレーション計算により評価した結果を、比較例と比較して示す。図９は、平均ＮＳＲＳ値を比較例と比較して示すグラフであり、図１０は、平均ＲＢＵＲ値を比較例と比較して示すグラフである。比較例１は、ネットワークスライシングを実行しないで全てのＲＢをスライス間で共有した場合の例であり、比較例２は、ＲＢをスライス間で等分割するように制御した場合の例であり、比較例３は、基地局ＢＳに到着したパケット数で重み付けをしてＲＢを分割するように制御した場合の例であり、比較例４は、深層強化学習を用いた既存手法（“R. Li et al., “Deep Reinforcement Learning for Resource Management inNetwork Slicing,” arXiv:1805.06591[cs], May 2018.”に記載の手法）によって制御した場合の例である。これらの評価結果により、本実施形態によれば、全てのユーザに対して様々なサービスの要件の満足度が高い制御を実現できるとともに、様々なサービスにおける過剰なＲＢの割り当てを少なくして、ＲＢの使用率を高く維持できることが分かった。 Here, the results of evaluating the performance of the RB allocation control by the control system 1 according to the present embodiment by simulation calculation are shown in comparison with comparative examples. FIG. 9 is a graph showing the average NSRS value in comparison with comparative examples, and FIG. 10 is a graph showing the average RBUR value in comparison with comparative examples. Comparative Example 1 is an example of a case where all RBs are shared between slices without performing network slicing, Comparative Example 2 is an example of a case where RBs are controlled to be equally divided between slices, Comparative Example 3 is an example of a case where RBs are controlled to be divided by weighting the number of packets arriving at the base station BS, and Comparative Example 4 is an example of a case where control is performed by an existing method using deep reinforcement learning (the method described in “R. Li et al., “Deep Reinforcement Learning for Resource Management inNetwork Slicing,” arXiv:1805.06591[cs], May 2018.”). From these evaluation results, it was found that according to the present embodiment, it is possible to realize control that satisfies the requirements of various services for all users, and to reduce the allocation of excessive RBs in various services, thereby maintaining a high RB usage rate.

本発明は、上述した実施形態に限定されるものではない。上記実施形態の構成は様々変更されうる。 The present invention is not limited to the above-described embodiment. The configuration of the above embodiment may be modified in various ways.

１…制御システム、３…制御量算出装置、５…制御装置（制御部）、３１_１～３１_Ｎ…制御量導出部、３２…トレーニング部、３３…学習データ格納部、ＲＢ…リソース、Ｓ_１～Ｓ_Ｎ，Ｓ_Ｂ…スライス。 REFERENCE SIGNS LIST 1 control system, 3 controlled variable calculation device, 5 control device (control unit), 31 ₁ to 31 _N controlled variable derivation unit, 32 training unit, 33 learning data storage unit, RB resource, S ₁ to S _N , S _B slices.

Claims

A control amount calculation device that calculates a control amount for controlling an allocation amount of communication resources to a plurality of slices that are virtualized networks on a communication network, comprising:
A plurality of control variable derivation units are provided corresponding to the plurality of slices, each of which acquires a state value and a reward value related to each of the slices, and each of which includes a reinforcement learning model that inputs the state value, determines an action that is the control variable for each of the slices, and outputs the action;
a learning data storage unit that stores learning data that is a combination of the state value and the reward value acquired in the plurality of control amount derivation units and the action determined corresponding to the state value;
a training unit that optimizes, by training , a parameter shared by the reinforcement learning models included in each of the plurality of control variable derivation units, using the learning data stored in the learning data storage unit;
A control amount calculation device comprising:

The state value for the slice includes at least a satisfaction level for a requirement of the slice and a utilization rate of a communication resource allocated to the slice.
The control amount calculation device according to claim 1.

The reward value for the slice is a value that takes into account the satisfaction level for the requirements of the slice and the usage rate of the communication resources allocated to the slice.
The control amount calculation device according to claim 1 or 2.

A control unit that controls the allocation amounts for the plurality of slices based on the control amounts output by the plurality of control amount derivation units.
The control amount calculation device according to any one of claims 1 to 3.

The control unit determines priorities for the slices based on the immediately preceding allocation amount of the communication resources for the slices and the satisfaction level of the requirements for the slices, and executes the control for the slices in accordance with an order indicated by the priorities.
5. The control amount calculation device according to claim 4.

the amount of control for the slice relates to a number of resource blocks in the radio access network;
The control amount calculation device according to any one of claims 1 to 5.

A control amount calculation method executed by a control amount calculation device that calculates a control amount for controlling an allocation amount of communication resources to a plurality of slices that are virtualized networks on a communication network, comprising:
A plurality of control variable derivation steps are executed corresponding to each of the plurality of slices, and obtain a state value and a reward value for each of the slices, and determine and output the control variable by using a reinforcement learning model that inputs the state value and determines and outputs an action that is the control variable for each of the slices;
a learning data storage step of storing learning data which is a combination of the state value and the reward value acquired in the plurality of control amount derivation steps and the action determined corresponding to the state value;
a training step of optimizing, by training , parameters shared by the reinforcement learning models used in each of the plurality of control variable derivation steps, using the learning data stored in the learning data storage step;
A control amount calculation method comprising: