JP2021158591A

JP2021158591A - Control amount calculation device and control amount calculation method

Info

Publication number: JP2021158591A
Application number: JP2020058499A
Authority: JP
Inventors: 博史峰野; Hiroshi Mineno; 悠安孫子; Yu Abiko
Original assignee: Shizuoka University NUC
Current assignee: Shizuoka University NUC
Priority date: 2020-03-27
Filing date: 2020-03-27
Publication date: 2021-10-07

Abstract

To realize allocation of a resource to a plurality of slices even when the number of slices is dynamically changed.SOLUTION: A control amount calculation device 3 is a device that calculates a control amount for controlling allocation amounts of RB to a plurality of slices S1 to SN, and comprises: a plurality of control amount lead-out parts 311 to 31N that is provided so as to correspond respectively to the plurality of slices S1 to SN, acquires state values and compensation values related to the respective slices S1 to SN, and determines and outputs behavior to the slices S1 to SN by inputting the state values into a learning model; a learning data storage part 33 that stores learning data which are combinations of the state values and the compensation values acquired in the plurality of control amount lead-out parts 311 to 31N and the behavior determined so as to correspond to the state values; and a training part that optimizes the learning model shared by the plurality of control amount lead-out parts 311 to 31N, through training using the learning data stored in the learning data storage part 33.SELECTED DRAWING: Figure 3

Description

本発明は、スライスに対する通信リソースの割り当て量を制御するための制御量を算出する制御量算出装置及び制御量算出方法に関する。 The present invention relates to a control amount calculation device and a control amount calculation method for calculating a control amount for controlling the allocation amount of communication resources for a slice.

第５世代（５Ｇ）移動通信システムの商用化により、通信ネットワークの大容量化、高速化、及び多数同時接続が可能となり、これに伴い、多様なサービスの提供が円滑化される。また、５Ｇ移動通信システムにおいて提供が検討されている技術として、多様なサービスごとに最適なネットワーク環境を提供するネットワークスライシングという技術が存在する。この技術においては、ネットワーク内に設定される各スライスに対して、基地局、通信経路、交換機、ルータ、サーバのＣＰＵ等の通信リソース（以下、単にリソースとも言う。）を割り当てるリソース管理の仕組みが重要となる。 The commercialization of 5th generation (5G) mobile communication systems will enable larger capacity, higher speed, and multiple simultaneous connections of communication networks, which will facilitate the provision of various services. Further, as a technology under consideration for provision in a 5G mobile communication system, there is a technology called network slicing that provides an optimum network environment for various services. In this technology, there is a resource management mechanism that allocates communication resources (hereinafter, also simply referred to as resources) such as base stations, communication paths, exchanges, routers, and server CPUs to each slice set in the network. It becomes important.

ネットワーク内の各スライスに対してリソースを割り当てるリソース管理の手法としては、遺伝的アルゴリズムを用いた手法（下記非特許文献１参照）、深層強化学習を用いた手法（下記非特許文献２参照）が検討されている。 As a resource management method for allocating resources to each slice in the network, a method using a genetic algorithm (see Non-Patent Document 1 below) and a method using deep reinforcement learning (see Non-Patent Document 2 below) are available. It is being considered.

B. Han,et al., “Slice as an Evolutionary Service: Genetic Optimization for Inter-Slice Resource Management in 5G Networks,” IEEE Access, vol. 6,pp. 33137-33147, 2018.B. Han, et al., “Slice as an Evolutionary Service: Genetic Optimization for Inter-Slice Resource Management in 5G Networks,” IEEE Access, vol. 6, pp. 33137-33147, 2018. R. Li etal., “Deep Reinforcement Learning for Resource Management in Network Slicing,” arXiv:1805.06591 [cs], May 2018.R. Li et al., “Deep Reinforcement Learning for Resource Management in Network Slicing,” arXiv: 1805.06591 [cs], May 2018.

上述した非特許文献１及び非特許文献２に記載の手法では、制御対象のスライスの数が変化した場合にモデルの再学習が必要となり、スライス数の動的な変化に対応が難しい。 In the methods described in Non-Patent Document 1 and Non-Patent Document 2 described above, it is necessary to relearn the model when the number of slices to be controlled changes, and it is difficult to cope with the dynamic change in the number of slices.

本発明は、上記課題に鑑みて為されたものであり、スライス数が動的に変化した場合にも複数のスライスへのリソース割り当てを実現できる制御量算出装置及び制御量算出方法を提供することを目的とする。 The present invention has been made in view of the above problems, and provides a control amount calculation device and a control amount calculation method capable of allocating resources to a plurality of slices even when the number of slices changes dynamically. With the goal.

上記課題を解決するため、本発明の一形態にかかる制御量算出装置は、通信ネットワーク上の仮想化ネットワークである複数のスライスに対する通信リソースの割り当て量を制御するための制御量を算出する制御量算出装置であって、複数のスライスのそれぞれに対応して設けられ、スライスに関する状態値、及びスライスに関する報酬値を取得し、状態値を強化学習モデルに入力することにより、スライスに対する制御量である行動を決定および出力する複数の制御量導出部と、複数の制御量導出部において取得された状態値及び報酬値と、当該状態値に対応して決定された行動との組み合わせである学習データを格納する学習データ格納部と、学習データ格納部に格納された学習データを用いて、複数の制御量導出部で共用される強化学習モデルをトレーニングにより最適化するトレーニング部と、を備える。 In order to solve the above problems, the control amount calculation device according to one embodiment of the present invention is a control amount for calculating a control amount for controlling the allocation amount of communication resources for a plurality of slices which are virtual networks on a communication network. It is a calculation device, which is provided corresponding to each of a plurality of slices, and is a control amount for slices by acquiring a state value related to the slice and a reward value related to the slice and inputting the state value into the reinforcement learning model. Learning data that is a combination of a plurality of control amount deriving units that determine and output actions, state values and reward values acquired by the plurality of control amount deriving units, and actions determined in response to the state values. It includes a learning data storage unit for storing and a training unit for optimizing a reinforcement learning model shared by a plurality of control amount derivation units by training using the learning data stored in the learning data storage unit.

あるいは、本発明の他の形態にかかる制御量算出方法は、通信ネットワーク上の仮想化ネットワークである複数のスライスに対する通信リソースの割り当て量を制御するための制御量を算出する制御量算出装置によって実行される制御量算出方法であって、複数のスライスのそれぞれに対応して実行され、スライスに関する状態値、及びスライスに関する報酬値を取得し、状態値を強化学習モデルに入力することにより、スライスに対する制御量である行動を決定および出力する複数の制御量導出ステップと、複数の制御量導出ステップにおいて取得された状態値及び報酬値と、当該状態値に対応して決定された行動との組み合わせである学習データを格納する学習データ格納ステップと、学習データ格納ステップによって格納された学習データを用いて、複数の制御量導出ステップで共用される強化学習モデルをトレーニングにより最適化するトレーニングステップと、を備える。 Alternatively, the control amount calculation method according to another embodiment of the present invention is executed by a control amount calculation device that calculates a control amount for controlling the allocation amount of communication resources for a plurality of slices that are virtual networks on the communication network. This is a control amount calculation method that is executed for each of a plurality of slices, obtains a state value related to the slice and a reward value related to the slice, and inputs the state value into the reinforcement learning model for the slice. A combination of a plurality of control amount derivation steps that determine and output an action that is a control amount, a state value and a reward value acquired in the plurality of control amount derivation steps, and an action determined corresponding to the state value. A learning data storage step that stores a certain learning data and a training step that optimizes a reinforcement learning model shared by a plurality of control amount derivation steps by training using the learning data stored by the training data storage step. Be prepared.

上記一形態あるいは上記他の形態によれば、複数のスライス毎に、状態値を強化学習モデルに入力することによって、スライス毎にリソース割り当て量を制御するための行動が決定および出力され、その際に用いられた状態値及び行動とその行動に対する報酬値との組み合わせが学習データとして格納される。このとき、複数のスライスのリソース割り当ての制御で共用される強化学習モデルは、予め格納された上記の学習データを用いてトレーニングにより最適化される。これにより、制御対象のネットワークにおけるスライスの数が動的に変化した場合であっても、個々のスライスに関する制御結果を学習データとして、複数のスライスに対するリソース割り当て制御に用いられる強化学習モデルを最適化することができる。その結果、複数のスライスへのリソース割り当てを適切に制御することができる。 According to the above one form or the above other form, by inputting the state value into the reinforcement learning model for each of a plurality of slices, the action for controlling the resource allocation amount for each slice is determined and output, and at that time, the action is determined and output. The combination of the state value and the action used in the above and the reward value for the action is stored as learning data. At this time, the reinforcement learning model shared by controlling the resource allocation of a plurality of slices is optimized by training using the above-mentioned learning data stored in advance. As a result, even when the number of slices in the controlled network changes dynamically, the reinforcement learning model used for resource allocation control for multiple slices is optimized by using the control results for each slice as training data. can do. As a result, resource allocation to multiple slices can be appropriately controlled.

ここで、スライスに関する状態値は、スライスの要件に関する満足度、及び、スライスに割り当てられた通信リソースの使用率を少なくとも含む、ことが好ましい。この場合、複数のスライスに対してリソース割り当てを制御する際に、各スライスの要件を満たすように制御することができるとともに、リソースの利用効率を向上させることができる。 Here, the state value relating to the slice preferably includes at least the satisfaction level regarding the requirement of the slice and the utilization rate of the communication resource allocated to the slice. In this case, when controlling resource allocation for a plurality of slices, it is possible to control so as to satisfy the requirements of each slice, and it is possible to improve the resource utilization efficiency.

また、スライスに関する報酬値は、スライスの要件に関する満足度と、スライスに割り当てられた通信リソースの使用率とを加味した値である、ことも好ましい。この場合、複数のスライスに対してリソース割り当てを制御するための強化学習モデルを最適化する際に、各スライスの要件を満たすように最適化することができるとともに、リソースの利用効率を向上させるように最適化することができる。 It is also preferable that the reward value for the slice is a value that takes into account the satisfaction level regarding the slice requirement and the usage rate of the communication resource allocated to the slice. In this case, when optimizing the reinforcement learning model for controlling resource allocation for multiple slices, it can be optimized to meet the requirements of each slice and improve resource utilization efficiency. Can be optimized for.

また、複数の制御量導出部によって出力された制御量を基に、複数のスライスに対する割り当て量を制御する制御部をさらに備える、ことも好ましい。こうすれば、複数のスライスに対するリソース割り当て量を、複数の制御量導出部によって出力された制御量を基に、決定することができる。その結果、例えば、所定の判断基準を基にした優先制御等が可能となる。 It is also preferable to further include a control unit that controls the allocation amount for the plurality of slices based on the control amount output by the plurality of control amount derivation units. In this way, the resource allocation amount for the plurality of slices can be determined based on the control amount output by the plurality of control amount derivation units. As a result, for example, priority control based on a predetermined determination standard can be performed.

また、制御部は、複数のスライスに対する直前のリソースの割り当て量と複数のスライスの要件に関する満足度とを基に複数のスライスに対する優先度を決定し、優先度の示す順番に従って複数のスライスに対する制御を実行する、ことも好ましい。この場合には、直前のリソース割り当て量とスライス要件の満足度とから決定した優先度を基にしたリソース割り当ての優先制御等が可能となる。これにより、複数のスライスに対する円滑なリソース割り当てが可能となる。 In addition, the control unit determines the priority for the plurality of slices based on the allocation amount of the immediately preceding resource for the plurality of slices and the satisfaction with the requirements of the plurality of slices, and controls the plurality of slices according to the order indicated by the priority. It is also preferable to carry out. In this case, priority control of resource allocation based on the priority determined from the immediately preceding resource allocation amount and the satisfaction level of the slice requirement becomes possible. This enables smooth resource allocation for multiple slices.

また、スライスに対する制御量は、無線アクセスネットワークにおけるリソースブロックの数に関する、ことも好ましい。この場合には、複数のスライスへのリソースブロックの割り当てを適切に制御することができる。 It is also preferable that the control amount for the slice is related to the number of resource blocks in the radio access network. In this case, the allocation of resource blocks to a plurality of slices can be appropriately controlled.

本発明によれば、スライス数が動的に変化した場合にも複数のスライスへのリソース割り当てを実現できる。 According to the present invention, resource allocation to a plurality of slices can be realized even when the number of slices changes dynamically.

本発明の好適な一実施形態にかかる制御システムの概略構成を示す図である。It is a figure which shows the schematic structure of the control system which concerns on one preferred embodiment of this invention. 図１の制御量算出装置のハードウェア構成を示すブロック図である。It is a block diagram which shows the hardware configuration of the control amount calculation apparatus of FIG. 図１の制御量算出装置の機能構成を示すブロック図である。It is a block diagram which shows the functional structure of the control amount calculation apparatus of FIG. 図３の制御量導出部３１_１〜３１_Ｎが使用する学習モデルのネットワーク構造の一例を示す図である。It is a figure which shows an example of the network structure of the learning model used by the control quantity derivation part 31 _{1 to} 31 _{N of FIG.} 制御量算出装置３及び制御装置５によるスライスへのＲＢ割り当ての全体像を説明するための図である。It is a figure for demonstrating the whole image of the RB allocation to a slice by a control amount calculation device 3 and control device 5. 制御量算出装置３及び制御装置５によるスライスへのＲＢ割り当てによって各スライスに割り当てられるＲＢ量の時間変化を示す図である。It is a figure which shows the time change of the RB amount assigned to each slice by the RB allocation to a slice by a control amount calculation device 3 and control device 5. 本発明の好適な一実施形態にかかる制御量算出方法のうちのＲＢ割り当て処理の手順を示すフローチャートである。It is a flowchart which shows the procedure of the RB allocation processing in the control amount calculation method which concerns on one preferred embodiment of this invention. 本発明の好適な一実施形態にかかる制御量算出方法のうちの学習処理の手順を示すフローチャートである。It is a flowchart which shows the procedure of learning processing in the control amount calculation method which concerns on one preferred embodiment of this invention. 本実施形態にかかる制御システム１によるＲＢ割り当て制御の性能をシミュレーション計算により評価した結果を示すグラフである。It is a graph which shows the result of having evaluated the performance of the RB allocation control by the control system 1 which concerns on this Embodiment by a simulation calculation. 本実施形態にかかる制御システム１によるＲＢ割り当て制御の性能をシミュレーション計算により評価した結果を示すグラフである。It is a graph which shows the result of having evaluated the performance of the RB allocation control by the control system 1 which concerns on this Embodiment by a simulation calculation.

以下、図面を参照しつつ本発明に係る制御システムの好適な実施形態について詳細に説明する。なお、図面の説明においては、同一又は相当部分には同一符号を付し、重複する説明を省略する。 Hereinafter, preferred embodiments of the control system according to the present invention will be described in detail with reference to the drawings. In the description of the drawings, the same or corresponding parts are designated by the same reference numerals, and duplicate description will be omitted.

図１に示す本発明の好適な一実施形態である制御システム１は、第５世代の移動体通信（５Ｇ）等の移動体通信システムを対象にして、スライスへのリソース割り当てを制御するコンピュータシステムである。スライスとは、通信ネットワーク全体のリソースを複数に分割したうちの１つの仮想化ネットワークのことである。 The control system 1, which is a preferred embodiment of the present invention shown in FIG. 1, is a computer system that controls resource allocation to slices for a mobile communication system such as a fifth-generation mobile communication (5G). Is. A slice is one of the virtualized networks in which the resources of the entire communication network are divided into a plurality of parts.

制御システム１の制御対象は、移動体通信システム内に設定された複数のスライスであり、これらの複数のスライスの数は動的に変化しうる。例えば、本実施形態では、複数のスライスとして、スループット、遅延、及び信頼性で規定されるスライスの要件が互いに異なる、スライスＳ_１、スライスＳ_２、…、スライスＳ_Ｎ（Ｎは任意の自然数）、及びスライスＳ_Ｂを含む（Ｎ＋１）個のスライスを想定している。これらのスライスＳ_１〜Ｓ_Ｎ，Ｓ_Ｂは、それぞれ、基地局ＢＳ（Base Station）等を含む無線アクセスネットワーク（RAN：Radio Access Network）のリソースを分割して構成され、複数のユーザ端末ＵＥ（User Equipment）によって共有されている。 The control target of the control system 1 is a plurality of slices set in the mobile communication system, and the number of these plurality of slices can be dynamically changed. For example, in the present embodiment, as a plurality of slices, the requirements of slices defined by throughput, delay, and reliability are different from each other, slice S ₁ , slice S ₂ , ..., Slice S _N (N is an arbitrary natural number). , and the slice including _{S B} (N + 1) is assumed slices. These slices _{_{_S}} 1 _~S N, _S _B, respectively, a radio access network including a base station BS (Base Station) or the like: by dividing the resources (RAN Radio Access Network) is constructed, a plurality of user terminals UE ( Shared by User Equipment).

制御システム１は、スライスＳ_１、スライスＳ_２、…、スライスＳ_Ｎに対するリソースであるリソースブロックの割り当てを制御するための制御量を算出する制御量算出装置３と、制御量算出装置３によって算出された制御量を基に、スライスＳ_１、スライスＳ_２、…、スライスＳ_Ｎ、スライスＳ_Ｂに対するリソースブロック（以下、ＲＢともいう）の割り当て量を制御する制御装置（制御部）５とを含む。ここで、本実施形態では、制御量算出装置３と制御装置５とは、別々の装置として構成されているが、一体化された装置であってもよい。制御装置５は、移動体通信システムを制御可能なように当該移動体通信システムとの間で各種データを送受信可能に構成され、制御量算出装置３は、制御装置５との間でデータを送受信可能に構成され、制御装置５を介して後述する状態値あるいは報酬値等の各種データを受信する。制御システム１が割り当てを制御するＲＢとは、ＲＡＮにおけるリソースの１種であり、一定の周波数帯域内を周波数分割し、かつ、時間軸も分割した、周波数軸及び時間軸から構成される二次元領域を分割したブロックである。 The control system 1 includes a slice S _1, slice S _2, ..., a control amount calculation unit 3 for calculating a control amount for controlling the allocation of resource block is a resource to the slice S _N, is calculated by the control amount calculating device 3 based on the control amount, the slice S _1, slice S _2, ..., slice S _N, the resource block to the slice S _B controller (control unit) that controls the quota (hereinafter also referred to as RB) and 5 include. Here, in the present embodiment, the control amount calculation device 3 and the control device 5 are configured as separate devices, but they may be integrated devices. The control device 5 is configured to be capable of transmitting and receiving various data to and from the mobile communication system so that the mobile communication system can be controlled, and the control amount calculation device 3 transmits and receives data to and from the control device 5. It is possible to receive various data such as a state value or a reward value, which will be described later, via the control device 5. The RB whose allocation is controlled by the control system 1 is a kind of resource in the RAN, and is a two-dimensional structure composed of a frequency axis and a time axis in which the frequency is divided within a certain frequency band and the time axis is also divided. It is a block that divides the area.

図２は、制御量算出装置３を構成するコンピュータ２０のハードウェア構成を示している。図２に示すように、コンピュータ２０は、物理的には、プロセッサであるＣＰＵ（Central Processing Unit）１０１、記録媒体であるＲＡＭ（Random Access Memory）１０２又はＲＯＭ（Read Only Memory）１０３、通信モジュール１０４、及び入出力モジュール１０６等を含んだコンピュータ等であり、各々は電気的に接続されている。なお、コンピュータ２０は、入出力モジュール１０６として、ディスプレイ、キーボード、マウス、タッチパネルディスプレイ等を含んでいてもよいし、ハードディスクドライブ、半導体メモリ等のデータ記録装置を含んでいてもよい。また、コンピュータ２０は、複数のコンピュータによって構成されていてもよい。制御装置５も、同様なハードウェア構成を有する。 FIG. 2 shows the hardware configuration of the computer 20 that constitutes the control amount calculation device 3. As shown in FIG. 2, the computer 20 physically includes a CPU (Central Processing Unit) 101 as a processor, a RAM (Random Access Memory) 102 or a ROM (Read Only Memory) 103 as a recording medium, and a communication module 104. , And a computer or the like including the input / output module 106 and the like, and each of them is electrically connected. The computer 20 may include a display, a keyboard, a mouse, a touch panel display, and the like as the input / output module 106, and may include a data recording device such as a hard disk drive and a semiconductor memory. Further, the computer 20 may be composed of a plurality of computers. The control device 5 also has a similar hardware configuration.

図３は、制御量算出装置３の機能構成を示すブロック図である。制御量算出装置３は、Ｎ個の制御量導出部３１_１〜３１_Ｎと、トレーニング部３２と、学習データ格納部３３とを備える。制御量導出部３１_１〜３１_Ｎは、制御対象のスライスＳ_１〜Ｓ_Ｎの個数に対応した個数で設けられる。図３に示す制御量算出装置３の各機能部は、ＣＰＵ１０１及びＲＡＭ１０２等のハードウェア上にプログラムを読み込ませることにより、ＣＰＵ１０１の制御のもとで、通信モジュール１０４、及び入出力モジュール１０６等を動作させるとともに、ＲＡＭ１０２におけるデータの読み出し及び書き込みを行うことで実現される。制御量算出装置３のＣＰＵ１０１は、このコンピュータプログラムを実行することによって制御量算出装置３を図３の各機能部として機能させ、後述する制御量算出方法に対応する処理を順次実行する。このコンピュータプログラムの実行に必要な各種データ、及び、このコンピュータプログラムの実行によって生成された各種データは、全て、ＲＯＭ１０３、ＲＡＭ１０２等の内蔵メモリ、又は、ハードディスクドライブなどの記憶媒体に格納される。 FIG. 3 is a block diagram showing a functional configuration of the control amount calculation device 3. The control amount calculation device 3 includes N control amount derivation units 31 _{1 to} 31 _N , a training unit 32, and a learning data storage unit 33. The control amount deriving units 31 _{1 to} 31 _N are provided in an number corresponding to the number of slices S _{1 to} _{SN to be controlled.} Each functional unit of the control amount calculation device 3 shown in FIG. 3 loads the communication module 104, the input / output module 106, and the like under the control of the CPU 101 by loading the program on the hardware such as the CPU 101 and the RAM 102. It is realized by operating and reading and writing data in the RAM 102. The CPU 101 of the control amount calculation device 3 causes the control amount calculation device 3 to function as each functional unit of FIG. 3 by executing this computer program, and sequentially executes processes corresponding to the control amount calculation method described later. Various data necessary for executing this computer program and various data generated by executing this computer program are all stored in an internal memory such as ROM 103 and RAM 102, or a storage medium such as a hard disk drive.

制御量導出部３１_１〜３１_Ｎは、それぞれ、対応するスライスＳ_１〜Ｓ_Ｎに対するＲＢの割当量を制御するための制御量（ＲＢ量）決定し、当該制御量を制御装置５に出力する。すなわち、制御量導出部３１_１〜３１_Ｎは、移動体通信システムから制御装置５を介して対応するスライスＳ_１〜Ｓ_Ｎに関する状態値を取得し、共通の深層強化学習の学習モデルに当該状態値を入力し、その学習モデルを用いて行動価値を算出し、行動価値が最大となるような行動を選択する。また、制御量導出部３１_１〜３１_Ｎは、選択した行動に対するスライスＳ_１〜Ｓ_Ｎに関する報酬値を制御装置５を介して取得する。さらに、制御量導出部３１_１〜３１_Ｎは、選択した行動を基に制御量（ＲＢ量）を算出して制御装置５に出力する。加えて、制御量導出部３１_１〜３１_Ｎは、１回の制御量の算出ステップごとに、状態値及び行動と、その行動に対する報酬値との組み合わせを、学習データとして学習データ格納部３３に格納する。 Control amount derivation unit ₃₁ 1 to 31 _N, respectively, the control amount for controlling the allocation of RB for the corresponding slice _S 1 to S _N (RB weight) was determined, and outputs the control amount to the control unit 5 .. That is, the control amount deriving units 31 _{1 to} 31 _N _{acquire the state values related to the corresponding slices S 1 to} _SN from the mobile communication system via the control device 5, and apply the state to the common deep reinforcement learning learning model. Enter a value, calculate the action value using the learning model, and select the action that maximizes the action value. The control amount derivation unit ₃₁ 1 to 31 _N is obtained through the control unit 5 a reward Slice _S 1 to S _N to the selected action. Further, the control amount derivation units 31 _{1 to} 31 _N calculate the control amount (RB amount) based on the selected action and output it to the control device 5. In addition, the control amount derivation unit 31 _{1 to} 31 _N stores the combination of the state value and the action and the reward value for the action as learning data in the learning data storage unit 33 for each control amount calculation step. Store.

詳細には、制御量導出部３１_１〜３１_Ｎは、状態値として、スライスＳ_１〜Ｓ_Ｎの要件を示すスループット値及び遅延量と、スライス要件に関するユーザの満足度を示すＮＳＲＳ（NS requirement satisfaction）値と、スライスＳ_１〜Ｓ_Ｎの使用率を示すＲＢＵＲ（RB usage ratio）値と、スライスＳ_１〜Ｓ_Ｎに対する実際の割り当てＲＢ数と、制御対象の基地局ＢＳにおけるスライスＳ_１〜Ｓ_Ｎ上での到着パケット数、送信パケット数、及びバッファ内（送信待ち）パケット数とを取得する。上記状態値のうち、スループット値、遅延量、及びＮＳＲＳ値は、スライスＳ_１〜Ｓ_Ｎの要件に関する状態を示す値であり、ＲＢＵＲ値、及びＲＢ数は、スライスＳ_１〜Ｓ_Ｎの使用率に関する状態を示す値であり、到着パケット数、送信パケット数、及びバッファ内パケット数は、これらの状態値の曖昧性を無くすために補助的に取得される状態値である。 Specifically, the control amount derivation unit ₃₁ 1 to 31 _N, as a state value, the slice _S 1 and the throughput value and the delay amount shows a to S _N requirements, NSRS indicating the user satisfaction Slice requirements (NS requirement satisfaction ) value and a slice _S 1 ~S and RBUR (RB usage ratio) value indicating a usage rate of _N, slice _S 1 and the actual allocation RB number for ~S _N, the control target slice _S 1 at the base station BS of ~S _The number of arrival packets, the number of transmission packets, and the number of packets in the buffer (waiting for transmission) on N are acquired. Among the above state values, throughput values, delay, and NSRS value is a value that indicates the state regarding requirements slice _S 1 ~S _N, RBUR value, and RB numbers, usage slice _S 1 to S _N The number of incoming packets, the number of transmitted packets, and the number of packets in the buffer are state values that are supplementarily acquired in order to eliminate the ambiguity of these state values.

上記状態値のうちのＮＳＲＳ値は、基地局ＢＳ等によってユーザ端末ＵＥからのデータを収集および集計することにより生成され、下記式（１）；

によって定義される値である。ここで、Suはユーザ毎のスライス要件の満足度の有無を“０”または“１”で示し、uesはスライスに収容されるユーザ数を示す。このＮＳＲＳ値はユーザの平均的な満足度を示し、１に近いほどスライスの要件が満たされていることを表す。 The NSRS value among the above state values is generated by collecting and aggregating the data from the user terminal UE by the base station BS or the like, and the following equation (1);

The value defined by. Here, Su indicates whether or not the satisfaction level of the slice requirement for each user is satisfied by "0" or "1", and ues indicates the number of users accommodated in the slice. This NSRS value indicates the average satisfaction level of the user, and the closer it is to 1, the more the slice requirement is satisfied.

また、上記状態値のうちのＲＢＵＲ値は、基地局ＢＳ等によって生成され、下記式（２）；

によって定義される値である。ここで、ＵＲＢは実際に使用したＲＢ数を示し、ＭＲＢは、実際に該当するスライスに割り当てたＲＢ数を示す。ＲＢＵＲ値は、該当するスライスの使用率を示し、１に近いほど過剰なＲＢの割り当てが少ないことを意味する。 Further, the RBUR value among the above state values is generated by the base station BS or the like, and the following equation (2);

The value defined by. Here, URB indicates the number of RBs actually used, and MRB indicates the number of RBs actually assigned to the corresponding slice. The RBUR value indicates the utilization rate of the corresponding slice, and the closer it is to 1, the smaller the excess RB allocation.

制御量導出部３１_１〜３１_Ｎは、状態値を入力する学習モデルとして、ＤＱＮに分散学習を適用したＡｐｅ−Ｘの手法を用いる。すなわち、それぞれの制御量導出部３１_１〜３１_Ｎは、制御対象のスライスＳ_１〜Ｓ_Ｎに関する状態値を学習モデルに適用して行動を決定し、その行動に対する報酬値を得て、状態値、行動、及び報酬値を経験（学習データ）として収集および蓄積する。それに対して、後述するトレーニング部３２がこの学習データを用いてトレーニング（学習）することにより、学習モデルのパラメータを最適化する。 The control quantity derivation units 31 _{1 to} 31 _N use the Ape-X method in which distributed learning is applied to DQN as a learning model for inputting state values. That is, each of the control amount deriving units 31 _{1 to} 31 _N applies the state values related to the controlled slices S _{1 to} _SN to the learning model to determine the action, obtains the reward value for the action, and obtains the state value. , Behavior, and reward values are collected and accumulated as experience (learning data). On the other hand, the training unit 32, which will be described later, optimizes the parameters of the learning model by training (learning) using this learning data.

図４には、制御量導出部３１_１〜３１_Ｎが使用する学習モデルのネットワーク構造の一例を示す。図４に示すように、学習モデルは、状態値が入力される入力層Ｎ_ＩＮと、入力層Ｎ_ＩＮから入力された状態値が順に伝播される全結合層Ｎ_１、バッチ正規化層Ｎ_２、全結合層Ｎ_３、バッチ正規化層Ｎ_４、全結合層Ｎ_５、及びバッチ正規化層Ｎ_６と、バッチ正規化層Ｎ_６から分岐して結合される、全結合層Ｎ_７、バッチ正規化層Ｎ_８、全結合層Ｎ_９と、全結合層Ｎ_１０、バッチ正規化層Ｎ_１１、全結合層Ｎ_１２と、全結合層Ｎ_９，Ｎ_１２に結合される出力層Ｎ_ＯＵＴとを含む。このような学習モデルにおいて、入力層Ｎ_ＩＮに状態値を入力することにより、出力層Ｎ_ＯＵＴから、行動の結果として期待される報酬値である行動価値が出力される。 FIG. 4 shows an example of the network structure of the learning model used by the control quantity derivation units 31 _{1 to} 31 _N. As shown in FIG. 4, the learning model is an input layer N _IN the state value is input, the total binding layer N ₁ state value input from the input layer N _IN is propagated _{sequentially,} batch normalization layer N ₂ , total binding layer _{N 3,} batch normalization layer _{N 4,} and the total binding layer _{N 5} and batch normalization layer _{N 6,,} it is coupled to branch from the batch normalization layer _{N 6,} total binding layer _{N 7,} batch Normalized layer N ₈ , fully bonded layer N ₉ , fully bonded layer N ₁₀ , batch normalized layer N ₁₁ , fully bonded layer N _12, and output layer N _OUT bonded to fully bonded layers N ₉ , N _12. including. In such a learning model, by inputting a state value to the _{input layer N IN} , the action value, which is a reward value expected as a result of the action _{, is output from the output layer N OUT.}

さらに、制御量導出部３１_１〜３１_Ｎは、状態値を学習モデルを入力した結果得られた行動価値を基に、行動価値が最大となる行動を選択する（greedy法）。詳細には、制御量導出部３１_１〜３１_Ｎは、図４に示す学習モデルを用いて、時刻tにおける状態値s_t及び行動a_tに対する行動価値Q(S_t,a_t,θ)を取得（θはニューラルネットワークの重み付け等のパラメータ）し、その行動価値Q(S_t,a_t,θ)を最大にする行動a（aは１以上の整数）を決定する。そして、制御量導出部３１_１〜３１_Ｎは、決定した行動aを基に、対応するスライスに対して割り当てる時刻tにおけるＲＢの相対量（ＩＤＲＢ）を、下記式（３）；

によって算出する。上記式（３）中の

は床関数である。さらに、制御量導出部３１_１〜３１_Ｎは、時間tの割り当てＲＢ量（ＡＲＢ）を、１ステップ前の時刻t-1における割り当てＲＢ量に対して相対量（ＩＤＲＢ）を加算することにより、下記式（４）；
ARB_t＝ARB_t-1＋IDRB_t …（４）
によって計算し、計算した割り当てＲＢ量を制御装置５に出力する。 Further, the control amount derivation unit 31 _{1 to} 31 _N selects the action having the maximum action value based on the action value obtained as a result of inputting the state value into the learning model (greedy method). Specifically, the control amount derivation unit 31 ₁ to 31 _N, using a learning model shown in FIG. 4, the state value s _t and the action a action value for _t Q at time _{_{t (S t, a t,}} θ) and Acquire (θ is a parameter such as the weighting of the neural network), and determine the action a (a is an integer of 1 or more) that maximizes the _{action value Q (S t} , a _{t, θ).} Then, the control amount derivation unit 31 _{1 to} 31 _N sets the relative amount (IDRB) of RB at the time t assigned to the corresponding slice based on the determined action a by the following equation (3);

Calculated by. In the above formula (3)

Is the floor function. Further, the control quantity derivation units 31 _{1 to} 31 _N add the allocated RB amount (ARB) at the time t to the allocated RB amount at the time t-1 one step before by adding the relative quantity (IDRB). The following formula (4);
ARB _t = ARB _t-1 + IDRB _t … (4)
And outputs the calculated allocated RB amount to the control device 5.

トレーニング部３２は、学習データ格納部３３に格納された学習データを用いて制御量導出部３１_１〜３１_Ｎが共用する学習モデルのパラメータθを最適値に更新する（トレーニング）。すなわち、トレーニング部３２は、１組の学習データに含まれる、１ステップ後の時刻t+1の報酬値と、時刻t+1における状態値とを取得し、それらの値を基に時刻tで学習モデルが出力するべきターゲット値y_tを算出する。この報酬値は、ＮＳＲＳ値とＲＢＵＲ値とを加味した値であり、例えば、ＮＳＲＳ値とＲＢＵＲ値とを掛け合わせた値である。そして、トレーニング部３２は、時刻tにおける状態値s_tを基に決定される行動価値Q(S_t,a_t,θ)が、ターゲット値y_tに近づくようにパラメータθを更新する。 The training unit 32 updates the parameter θ of the learning model shared by the _{control quantity derivation units 31 1 to} 31 _N to the optimum value using the learning data stored in the learning data storage unit 33 (training). That is, the training unit 32 acquires the reward value at time t + 1 after one step and the state value at time t + 1 included in one set of learning data, and based on those values, at time t. Calculate the _{target value y t} to be output by the training model. This reward value is a value in which the NSRS value and the RBUR value are added, and is, for example, a value obtained by multiplying the NSRS value and the RBUR value. The training unit 32, action value Q which is determined based on the state value s _t at time _{_{t (S t, a t,}} θ) updates the parameters theta so as to approach the target value y _t.

ここで、トレーニング部３２は、学習データ格納部３３に格納された学習データの中から、優先度を基にトレーニングに用いる学習データをランダムに抽出する。また、トレーニング部３２は、トレーニングの実行とともに、学習データ格納部３３に格納された学習データの優先度を、古い経験のものほど低くするように更新する。さらに、トレーニング部３２は、トレーニングによって更新したパラメータθを、定期的に制御量導出部３１_１〜３１_Ｎに複製することにより、それぞれの制御量導出部３１_１〜３１_Ｎ内のパラメータθを更新する。ここでは、それぞれの制御量導出部３１_１〜３１_Ｎが、トレーニング部３２によって更新されたパラメータθを取得して内部のパラメータθを更新するようにしてもよい。 Here, the training unit 32 randomly extracts the learning data used for training based on the priority from the learning data stored in the learning data storage unit 33. Further, the training unit 32 updates the training data stored in the learning data storage unit 33 so that the priority of the learning data stored in the training data storage unit 33 is lowered as the training is executed. Further, the training unit 32 updates the parameter θ in each of the control quantity derivation units 31 _{1 to} 31 _N by periodically duplicating the parameter θ updated by the training to the control quantity derivation units 31 _{1 to} 31 _N. do. Here, the respective control quantity derivation units 31 _{1 to} 31 _N may acquire the parameter θ updated by the training unit 32 and update the internal parameter θ.

次に、制御装置５の機能について説明する。 Next, the function of the control device 5 will be described.

制御装置５は、状態値及び報酬値を移動体通信システムから定期的に取得して制御量算出装置３に転送する機能と、制御量算出装置３の複数の制御量導出部３１_１，３１_２，…３１_Ｎから出力された時刻tにおける割り当てＲＢ量（ＡＲＢ）を基に、各スライスＳ_１，Ｓ_２，…，Ｓ_Ｎ，Ｓ_Ｂに時刻tに割り当てるＲＢ量（ＭＲＢ）を決定するように制御する機能とを有する。すなわち、制御装置５は、スライスＳ_Ｂ以外のスライスＳ_１，Ｓ_２，…，Ｓ_Ｎについては、直前の時刻t-1におけるＭＲＢ及びＮＳＲＳ値とを乗算した値を優先度として、優先度の昇順の順番に従って、スライスＳ_１，Ｓ_２，…，Ｓ_Ｎに対する時刻tのＭＲＢをＡＲＢに等しい値に決定する制御を実行する。このとき、ＡＲＢが残りのＲＢ数より大きい場合には、残りのＲＢ数をＭＲＢとして決定する。このような優先制御により、多くのＲＢを必要とするスライスがＲＢを占有しないように、ＲＢの専有量が少ないスライスほど割り当てが優先される。さらに、制御装置５は、スライスＳ_Ｂについては、スライスＳ_１，Ｓ_２，…，Ｓ_Ｎに割り当てたＭＲＢの合計値から決まる残りのＲＢ量をＭＲＢとして決定する。 The control device 5 has a function of periodically acquiring a state value and a reward value from the mobile communication system and transferring them to the control amount calculation device 3, and a plurality of control amount derivation units 31 ₁ , 31 _{2 of the control amount calculation device 3.} , ... allocation RB quantity at time t output from 31 _N to (ARB) based, each slice _{_{_{S 1, S 2, ...,}}} S N, to determine RB to allocate to time t _{S B} a (MRB) Has a function to control. That is, the control unit 5, the slice _S slices _S 1 except _{_B,} S 2, ..., for _{S N,} a value obtained by multiplying the MRB and NSRS value at time t-1 of the immediately preceding a priority, the priority in ascending order of the sequence, the slice _S _1, S 2, ..., performs a control process for determining the MRB time t for _{S N} to a value equal to ARB. At this time, if the ARB is larger than the remaining RB number, the remaining RB number is determined as the MRB. With such priority control, allocation is prioritized for slices with a smaller RB occupancy so that slices requiring a large amount of RB do not occupy the RB. Further, the control unit 5, for the slice _{S B,} slice _S _1, S 2, ..., to determine the remaining RB amount as MRB determined by the sum of MRB assigned to _{S N.}

図５は、制御量算出装置３及び制御装置５によるスライスへのＲＢ割り当ての全体像を説明するための図であり、図６は、制御量算出装置３及び制御装置５によるスライスへのＲＢ割り当てによってスライスＳ_１，Ｓ_２，Ｓ_Ｎに割り当てられるＲＢ量の時間変化を示す図である。このように、制御量算出装置３のトレーニング部３２によって、学習データ格納部３３に各制御量導出部３１_１，３１_２，…３１_Ｎの経験として蓄積された学習データを基に、学習モデルが学習される。そして、学習によって最適化された学習モデルのパラメータが制御量算出装置３内の各制御量導出部３１_１，３１_２，…３１_Ｎに複製され、各制御量導出部３１_１，３１_２，…３１_Ｎでは、共通の学習モデルに状態値を入力することによって、各スライスＳ_１，Ｓ_２，…，Ｓ_Ｎを対象にした行動が選択される。制御装置５では、各スライスＳ_１，Ｓ_２，…，Ｓ_Ｎを対象にして選択された行動を基に、各スライスＳ_１，Ｓ_２，…，Ｓ_Ｎ，Ｓ_Ｂに割り当てるＲＢ量が決定され、それらのＲＢ量のリソースが各スライスＳ_１，Ｓ_２，…，Ｓ_Ｎ，Ｓ_Ｂに割り当てられる。図６に示す例によれば、時間の経過に伴って、スライスＳ_１にはＲＢ量として“０”、“３”、“０”が順に割り当てられ、スライスＳ_２にはＲＢ量として“３”、“２”、“６”が順に割り当てられ、スライスＳ_ＮにはＲＢ量として“２”、“１”、“０”が順に割り当てられる。 FIG. 5 is a diagram for explaining an overall image of RB allocation to slices by the control amount calculation device 3 and the control device 5, and FIG. 6 is a diagram for RB allocation to slices by the control amount calculation device 3 and the control device 5. by is a graph showing a temporal change of the RB amount allocated to the slice _{_{_{S 1, S 2, S N}}} . Thus, the training unit 32 of the control amount calculating unit 3, the learning data storage unit 33 to the control amount derivation unit 31 _1, 31 2, based on the stored learned data as experience _... 31 _N, learning model Be learned. Then, learning the control amount derivation unit ₃₁ of the parameters of the optimized training model control amount calculating device 3 by, ₃₁ 2, ... 31 _N to be replicated, the control amount derivation unit ₃₁ 1, 31 _2, ... In 31 _N , by inputting a state value into a common learning model, _{actions targeting each slice S 1} , S ₂ , ..., _SN are selected. In the control unit 5, each slice _S _1, S 2, ..., based on the action selected by the target _{S N,} each slice _{_{_{S 1, S 2, ...,}}} S N, is RB to allocate to _{S B} determined is, their RB amount of resources each slice _{_{_{S 1, S 2, ...,}}} S N, is assigned to the _{S B.} According to the example shown in FIG. 6, with the passage of time, it slices the _{S 1} as RB amount "0", "3", "0" is assigned in the order, "3 as RB amount in slice _{S 2} "," 2 "," 6 "are assigned in order, and" 2 "," 1 ", and" 0 "are _{assigned to the slice SN in order as the RB amount.}

次に、上述した制御システム１によって実行される制御量算出方法の手順を説明する。図７は、制御量算出方法のうちのＲＢ割り当て処理の手順を示すフローチャートであり、図８は、制御量算出方法のうちの学習処理の手順を示すフローチャートである。図７に示すＲＢ割り当て処理は、複数のステップの時刻t,t+1,…で繰り返し実行され、図７に示す学習処理は、定期的なタイミング、あるいは、ＲＢ割り当て処理によって蓄積された経験数が一定数に達したタイミングで繰り返し実行される。 Next, the procedure of the control amount calculation method executed by the control system 1 described above will be described. FIG. 7 is a flowchart showing the procedure of the RB allocation process in the control amount calculation method, and FIG. 8 is a flowchart showing the procedure of the learning process in the control amount calculation method. The RB allocation process shown in FIG. 7 is repeatedly executed at times t, t + 1, ... Of a plurality of steps, and the learning process shown in FIG. 7 is a periodic timing or the number of experiences accumulated by the RB allocation process. Is repeatedly executed when a certain number is reached.

まず、図７を参照して、ＲＢ割り当て処理が起動されると、制御量算出装置３の制御量導出部３１_１によって、スライスＳ_１に関する状態値が取得される（ステップＳ１０１）。次に、制御量導出部３１_１によって、取得した状態値を学習モデルに入力することにより行動価値が算出され、その行動価値が最大となる行動が選択される（ステップＳ１０２）。その後、制御量導出部３１_１によって、選択した行動を基に、スライスＳ_１に対する割り当てＲＢ量（ＡＲＢ）が決定され、そのＲＢ量が制御装置５に出力される（ステップＳ１０３）。さらに、制御量導出部３１_１によって、前回のＲＢ割り当て処理に対するスライスＳ_１に関する報酬値が取得され、状態値、行動、報酬値の組み合わせが学習データとして学習データ格納部３３に格納される（ステップＳ１０４）。このようなステップＳ１０１〜Ｓ１０４の一連の処理は、制御量導出部３１_２〜制御量導出部３１_Ｎにより、残りのスライスＳ_２〜Ｓ_Ｎを対象に繰り返される（ステップＳ１０５）。 First, referring to FIG. 7, when the RB assignment process is started, the control amount derivation unit 31 ₁ of the control amount calculating unit 3, the state values for the slice S ₁ is acquired (step S101). Next, the control amount derivation unit 31 ₁ calculates the action value by inputting the acquired state value into the learning model, and selects the action having the maximum action value (step S102). Thereafter, the control amount derivation unit 31 _1, based on the selected action, allocation RB quantity (ARB) is determined relative to the slice _{S 1,} the RB amount is outputted to the control unit 5 (step S103). Furthermore, the control amount derivation unit 31 _1, the acquired compensation value Slice S ₁ for the previous RB assignment process, the state value, action, (step combination of reward values are stored in the learning data storage unit 33 as the learning data S104). A series of processes in steps S101~S104 by the control amount derivation unit 31 ₂ to the control amount derivation unit 31 _N, repeated targeting remaining slice _S 2 to S _N (step S105).

次に、制御装置５により、直前の時刻t-1におけるＲＢ量（ＭＲＢ）及びＮＳＲＳ値を基にスライスＳ_１〜Ｓ_Ｎに関する優先度が決定される（ステップＳ１０６）。その後、制御装置５により、決定された優先度の順番で、スライスＳ_１〜Ｓ_Ｎに関するＲＢ量（ＭＲＢ）が、スライスＳ_１〜Ｓ_Ｎに関する割り当てＲＢ量（ＡＲＢ）を基に決定され、その優先度の順番でスライスＳ_１〜Ｓ_ＮへのＲＢ割り当てが制御される（ステップＳ１０７）。最後に、制御装置５により、スライスＳ_Ｂに対して残りのＲＢを割り当てるように制御が実行される（ステップＳ１０８）。 Next, the control unit 5, RB amount at time t-1 of the immediately preceding (MRB) and NSRS value priority Slice _S 1 to S _N based on is determined (step S106). Thereafter, the control unit 5, in order of the determined priority, RB amount Slice _S 1 ~S _N (MRB) is determined allocation RB quantity Slice _S 1 to S _N to (ARB) based on its RB assignment to slices _S 1 to S _N is controlled in order of priority (step S107). Finally, the control unit 5, control is performed to assign the remaining RB on the slice _{S B} (step S108).

次に、図８を参照して、学習処理が起動されると、制御量算出装置３のトレーニング部３２によって、学習データ格納部３３に格納された学習データの中から、優先度を基にした重み付けを行いながら、トレーニングに用いる学習データがランダムに抽出される（ステップＳ２０１）。次に、トレーニング部３２により、抽出した学習データを用いたトレーニングによって、ニューラルネットワークの学習モデルのパラメータθが更新される（ステップＳ２０２）。その後、トレーニング部３２により、更新したパラメータθが各制御量導出部３１_１，３１_２，…３１_Ｎ内に複製されることにより、各制御量導出部３１_１，３１_２，…３１_Ｎ内の学習モデルが更新される（ステップＳ２０３）。最後に、トレーニング部３２により、学習データ格納部３３に格納された学習データの優先度が、古い経験の学習データほど低くなるように更新される（ステップＳ２０４）。 Next, referring to FIG. 8, when the learning process is started, the training unit 32 of the control amount calculation device 3 bases the priority on the learning data stored in the learning data storage unit 33. Learning data used for training is randomly extracted while weighting (step S201). Next, the training unit 32 updates the parameter θ of the learning model of the neural network by training using the extracted training data (step S202). Thereafter, the training unit 32, updated parameter θ is the control amount derivation unit ₃₁ 1, ₃₁ 2, by being replicated in ... 31 _N, each of the control amount derivation unit ₃₁ 1, ₃₁ 2, in the ... 31 _N The training model is updated (step S203). Finally, the training unit 32 updates the priority of the learning data stored in the learning data storage unit 33 so that the learning data of the older experience has a lower priority (step S204).

上述した実施形態の制御システム１の作用効果について説明する。 The operation and effect of the control system 1 of the above-described embodiment will be described.

本実施形態によれば、複数のスライスＳ_１〜Ｓ_Ｎ毎に、状態値を強化学習モデルに入力することによって、スライスＳ_１〜Ｓ_Ｎ毎にリソース割り当て量を制御するための行動が決定および出力され、その際に用いられた状態値及び行動とその行動に対する報酬値との組み合わせが学習データとして格納される。このとき、複数のスライスＳ_１〜Ｓ_Ｎのリソース割り当ての制御で共用される強化学習モデルは、予め格納された上記の学習データを用いてトレーニングにより最適化される。これにより、制御対象のネットワークにおけるスライスの数が動的に変化した場合であっても、個々のスライスに関する制御結果を学習データとして、複数のスライスＳ_１〜Ｓ_Ｎに対するリソース割り当て制御に用いられる強化学習モデルを最適化することができる。その結果、スライス数に依存しない複数のスライスＳ_１〜Ｓ_Ｎへのリソース割り当てを適切に制御することができる。 According to the present embodiment, for each of a plurality of slices S ₁ to S _N, by entering the state value on reinforcement learning model, action to control the resource allocation amount for each slice S ₁ to S _N is determined and It is output, and the combination of the state value and action used at that time and the reward value for the action is stored as learning data. In this case, reinforcement learning model shared by the control of the resource allocation of a plurality of slices S ₁ to S _N are optimized by trained using the training data stored in advance. Reinforced Accordingly, even when the number of slices in the network of the control target is changed dynamically, the control results for individual slices as learning data, for use in the resource allocation control with respect to a plurality of slices S ₁ to S _N The learning model can be optimized. As a result, it is possible to properly control the resource allocation to a plurality of slices S ₁ to S _N that does not depend on the number of slices.

ここで、スライスＳ_１〜Ｓ_Ｎに関する状態値は、ＮＳＲＳ値及びＲＢＵＲ値が含まれている。この場合、複数のスライスＳ_１〜Ｓ_Ｎに対してリソース割り当てを制御する際に、各スライスＳ_１〜Ｓ_Ｎの要件を満たすように制御することができるとともに、リソースＲＢの利用効率を向上させることができる。 Here, the state values for the slice _S 1 to S _N are included NSRS value and RBUR value. In this case, when controlling the resource allocation for a plurality of slices S ₁ to S _N, it is possible to control to meet the requirements of each slice S ₁ to S _N, thereby improving resource use efficiency RB be able to.

また、スライスＳ_１〜Ｓ_Ｎに関する報酬値は、ＮＳＲＳ値及びＲＢＵＲ値を加味した値となっている。この場合、複数のスライスＳ_１〜Ｓ_Ｎに対してリソース割り当てを制御するための学習モデルを最適化する際に、各スライスＳ_１〜Ｓ_Ｎの要件を満たすように最適化することができるとともに、リソースＲＢの利用効率を向上させるように最適化することができる。 Further, reward Slice _S 1 to S _N is a value obtained by adding the NSRS value and RBUR value. With this case, it can be optimized as in optimizing the learning model for controlling resource allocation for a plurality of slices S ₁ to S _N, meet the requirements of each slice S ₁ to S _N , It can be optimized to improve the utilization efficiency of the resource RB.

また、制御装置５の機能には、複数の制御量導出部３１_１，…３１_Ｎによって出力されたＲＢ量（ＡＲＢ）を基に、複数のスライスＳ_１〜Ｓ_Ｎに対する割り当てＲＢ量（ＭＲＢ）を制御する機能が含まれている。このような機能により、複数のスライスＳ_１〜Ｓ_Ｎに対するソース割り当て量を、複数の制御量導出部３１_１，…３１_Ｎによって出力された制御量を基に決定することができる。その結果、例えば、所定の判断基準を基にした優先制御等が可能となる。 In addition, the function of the control device 5, a plurality of control amount derivation unit ₃₁ 1, ... 31 RB amount output by _N based on (ARB), allocation RB quantity for a plurality of slices _S 1 ~S _N (MRB) Includes the ability to control. These features, the resource allocation amount for a plurality of slices S ₁ to S _N, may be determined more control amount derivation unit 31 _1, the control amount outputted by ... 31 _N to the group. As a result, for example, priority control based on a predetermined determination standard can be performed.

具体的には、本実施形態では、制御装置５による複数のスライスＳ_１〜Ｓ_Ｎに対するリソース割り当ての優先度は、直前の割り当てＲＢ量（ＭＲＢ）とＮＳＲＳ値とを基に決定されている。この場合には、多くのＲＢを必要とするスライスがＲＢを占有しないようにすることで、複数のスライスＳ_１〜Ｓ_Ｎに対する円滑なリソース割り当てが可能となる。 Specifically, in the present embodiment, the priority of resource allocation for a plurality of slices S ₁ to S _N by the controller 5 is determined immediately before the allocation RB quantity and the NSRS value (MRB) to a group. In this case, since the slice that require more RB so as not occupy RB, it is possible to smooth the resource allocation for a plurality of slices S ₁ to S _N.

ここで、本実施形態にかかる制御システム１によるＲＢ割り当て制御の性能をシミュレーション計算により評価した結果を、比較例と比較して示す。図９は、平均ＮＳＲＳ値を比較例と比較して示すグラフであり、図１０は、平均ＲＢＵＲ値を比較例と比較して示すグラフである。比較例１は、ネットワークスライシングを実行しないで全てのＲＢをスライス間で共有した場合の例であり、比較例２は、ＲＢをスライス間で等分割するように制御した場合の例であり、比較例３は、基地局ＢＳに到着したパケット数で重み付けをしてＲＢを分割するように制御した場合の例であり、比較例４は、深層強化学習を用いた既存手法（“R. Li et al., “Deep Reinforcement Learning for Resource Management inNetwork Slicing,” arXiv:1805.06591[cs], May 2018.”に記載の手法）によって制御した場合の例である。これらの評価結果により、本実施形態によれば、全てのユーザに対して様々なサービスの要件の満足度が高い制御を実現できるとともに、様々なサービスにおける過剰なＲＢの割り当てを少なくして、ＲＢの使用率を高く維持できることが分かった。 Here, the result of evaluating the performance of the RB allocation control by the control system 1 according to the present embodiment by simulation calculation is shown in comparison with the comparative example. FIG. 9 is a graph showing the average NSRS value in comparison with the comparative example, and FIG. 10 is a graph showing the average RBUR value in comparison with the comparative example. Comparative Example 1 is an example in which all RBs are shared between slices without executing network slicing, and Comparative Example 2 is an example in which RBs are controlled to be equally divided among slices. Example 3 is an example in which the number of packets arriving at the base station BS is weighted and the RB is controlled to be divided, and Comparative Example 4 is an existing method using deep reinforcement learning (“R. Li et). This is an example of control by al., “Deep Reinforcement Learning for Resource Management in Network Slicing,” arXiv: 1805.06591 [cs], May 2018. ”). Based on these evaluation results, according to the present embodiment, it is possible to realize highly satisfying control of various service requirements for all users, and reduce excessive RB allocation in various services to reduce RB. It was found that the usage rate of the product can be maintained high.

本発明は、上述した実施形態に限定されるものではない。上記実施形態の構成は様々変更されうる。 The present invention is not limited to the above-described embodiments. The configuration of the above embodiment can be changed in various ways.

１…制御システム、３…制御量算出装置、５…制御装置（制御部）、３１_１〜３１_Ｎ…制御量導出部、３２…トレーニング部、３３…学習データ格納部、ＲＢ…リソース、Ｓ_１〜Ｓ_Ｎ，Ｓ_Ｂ…スライス。 1 ... Control system, 3 ... Control amount calculation device, 5 ... Control device (control unit), 31 _{1 to} 31 _N ... Control amount derivation unit, 32 ... Training unit, 33 ... Learning data storage unit, RB ... Resource, S ₁ ~S _N, S B _... slice.

Claims

A control amount calculation device that calculates a control amount for controlling the allocation amount of communication resources for a plurality of slices of a virtualized network on a communication network.
It is the control amount for the slice by being provided corresponding to each of the plurality of slices, acquiring the state value related to the slice and the reward value related to the slice, and inputting the state value into the reinforcement learning model. Multiple control amount derivators that determine and output actions,
A learning data storage unit that stores learning data that is a combination of the state value and the reward value acquired by the plurality of control amount derivation units and the action determined in response to the state value.
A training unit that optimizes the reinforcement learning model shared by the plurality of control quantity derivation units by training using the learning data stored in the training data storage unit.
A control amount calculation device including.

The state value for the slice includes at least satisfaction with the slice requirements and utilization of communication resources allocated to the slice.
The control amount calculation device according to claim 1.

The reward value for the slice is a value that takes into account the satisfaction level regarding the slice requirement and the usage rate of the communication resource allocated to the slice.
The control amount calculation device according to claim 1 or 2.

A control unit that controls the allocation amount for the plurality of slices is further provided based on the control amount output by the plurality of control amount derivation units.
The control amount calculation device according to any one of claims 1 to 3.

The control unit determines the priority for the plurality of slices based on the amount of the resource allocated immediately before the plurality of slices and the satisfaction level regarding the requirements of the plurality of slices, and the control unit determines the priority for the plurality of slices in accordance with the order indicated by the priority. Performing the control over multiple slices,
The control amount calculation device according to claim 4.

The control amount for the slice relates to the number of resource blocks in the radio access network.
The control amount calculation device according to any one of claims 1 to 5.

A control amount calculation method executed by a control amount calculation device that calculates a control amount for controlling the allocation amount of communication resources for a plurality of slices of a virtualized network on a communication network.
It is the control amount for the slice by being executed corresponding to each of the plurality of slices, acquiring the state value related to the slice and the reward value related to the slice, and inputting the state value into the reinforcement learning model. Multiple control amount derivation steps that determine and output actions,
A learning data storage step that stores learning data that is a combination of the state value and the reward value acquired in the plurality of control amount derivation steps and the action determined in response to the state value.
A training step of optimizing the reinforcement learning model shared by the plurality of control quantity derivation steps by training using the learning data stored by the learning data storage step.
Control amount calculation method including.