CN110381541B

CN110381541B - Smart grid slice distribution method and device based on reinforcement learning

Info

Publication number: CN110381541B
Application number: CN201910452242.7A
Authority: CN
Inventors: 孟萨出拉; 王智慧; 丁慧霞; 吴赛; 杨德龙; 孙丽丽; 曹新智; 滕玲; 段钧宝; 李许安; 王莹; 王雪; 陈源彬
Original assignee: State Grid Corp of China SGCC; China Electric Power Research Institute Co Ltd CEPRI; Information and Telecommunication Branch of State Grid Shandong Electric Power Co Ltd
Current assignee: State Grid Corp of China SGCC; China Electric Power Research Institute Co Ltd CEPRI; Information and Telecommunication Branch of State Grid Shandong Electric Power Co Ltd
Priority date: 2019-05-28
Filing date: 2019-05-28
Publication date: 2023-12-26
Anticipated expiration: 2039-05-28
Also published as: CN110381541A

Abstract

The invention discloses a smart grid slice distribution method based on reinforcement learning, which is characterized by comprising the following steps of: classifying the power business of the intelligent power grid according to the business type; corresponding the classifications to different slices; and constructing a reinforcement learning model of the intelligent power grid slice according to the service index of the intelligent power grid, and completing the distribution of the intelligent power grid slice through the reinforcement learning model to realize the resource scheduling management of the intelligent power grid. Classifying the service types of the intelligent power grid, corresponding the classifications to different slices, and completing the distribution of the intelligent power grid slices through the constructed reinforcement learning model of the intelligent power grid slices. Therefore, the integration problem of the 5G network slicing technology and the intelligent power grid based on reinforcement learning is solved.

Description

A smart grid slice allocation method and device based on reinforcement learning

技术领域Technical field

本申请涉及电力无线通信的网络资源分配领域，具体涉及一种基于强化学习的智能电网切片分配方法，同时涉及一种基于强化学习的智能电网切片分配装置。This application relates to the field of network resource allocation for power wireless communications, specifically to a smart grid slice allocation method based on reinforcement learning, and also to a smart grid slice allocation device based on reinforcement learning.

背景技术Background technique

随着高速泛在、低功耗、低时延的5G时代的到来，人类社会的通信逐步实现畅通化。网络切片被认为是5G网络的重要关键技术之一，其将单个物理网络分成多个独立的逻辑网络，以支持各种垂直的多服务网络，并根据其特性，分配于不同的业务场景中，以适应不同的服务需求。利用网络切片技术能够大大节省部署的成本并减少网络的占有率。With the advent of the 5G era of high-speed ubiquity, low power consumption, and low latency, communications in human society have gradually become smoother. Network slicing is considered one of the important key technologies of 5G networks. It divides a single physical network into multiple independent logical networks to support various vertical multi-service networks and distribute them in different business scenarios according to their characteristics. to adapt to different service needs. Using network slicing technology can greatly save deployment costs and reduce network occupancy.

在能源和电力需求增长的驱动下，世界电网从传统网络步入了智能电网时代。结合新一轮能源变革、通信领域的发展以及全球互联网战略构想，5G网络切片技术第一次具备了应用于智能电网业务的可能性。5G网络切片的技术特性对于承载面向电网的无线业务应用具备切片可定制化、切片间安全可靠隔离及切片统一管理的特点，并且具备快速组网、高效经济的优势，在电力系统中有着广阔应用前景。所以，基于强化学习的5G网络切片技术与智能电网的融合是亟需解决的问题。Driven by the growth in energy and power demand, the world's power grid has entered the smart grid era from traditional networks. Combined with the new round of energy revolution, development in the communications field and the global Internet strategic concept, 5G network slicing technology has the possibility of being applied to smart grid services for the first time. The technical characteristics of 5G network slicing have the characteristics of customizable slices, safe and reliable isolation between slices, and unified slice management for carrying wireless business applications for the power grid. It also has the advantages of fast networking, high efficiency and economy, and has broad applications in power systems. prospect. Therefore, the integration of 5G network slicing technology based on reinforcement learning and smart grid is an urgent problem that needs to be solved.

发明内容Contents of the invention

本申请提供一种基于强化学习的智能电网切片分配方法，解决基于强化学习的5G网络切片技术与智能电网的整合问题。This application provides a smart grid slicing allocation method based on reinforcement learning to solve the integration problem of 5G network slicing technology and smart grid based on reinforcement learning.

本申请提供一种基于强化学习的智能电网切片分配方法，其特征在于，包括：This application provides a smart grid slice allocation method based on reinforcement learning, which is characterized by including:

将智能电网的电力业务根据业务类型进行分类；Classify the power business of smart grid according to business type;

将所述分类对应不同的切片；Correspond to different slices according to the classification;

根据智能电网的服务指标构建智能电网切片的强化学习模型，通过所述强化学习模型，完成对智能电网切片的分配，实现智能电网的资源调度管理。A reinforcement learning model of smart grid slices is constructed according to the service indicators of the smart grid. Through the reinforcement learning model, the allocation of smart grid slices is completed and the resource scheduling management of the smart grid is realized.

优选的，将智能电网的电力业务根据业务类型进行分类，包括：Preferably, the power services of the smart grid are classified according to business types, including:

将智能电网的电力业务根据业务类型分为控制类、信息采集类，以及移动应用类。The power business of smart grid is divided into control category, information collection category and mobile application category according to the business type.

优选的，将所述分类对应不同的切片，包括：Preferably, the classification corresponds to different slices, including:

将控制类对应uRLLC切片，将信息采集类对应mMTC切片，将移动应用类对应eMBB切片。The control class corresponds to the uRLLC slice, the information collection class corresponds to the mMTC slice, and the mobile application class corresponds to the eMBB slice.

优选的，所述构建智能电网的强化学习模型，具体的，使用Q_-learning算法构建智能电网的强化学习模型。Preferably, the reinforcement learning model of the smart grid is constructed. Specifically, the Q _-learning algorithm is used to construct the reinforcement learning model of the smart grid.

优选的，所述构建智能电网切片的强化学习模型，包括：分别构建无线接入侧和核心网侧的强化学习模型。Preferably, constructing a reinforcement learning model for smart grid slicing includes: constructing reinforcement learning models for the wireless access side and the core network side respectively.

优选的，所述构建智能电网切片的强化学习模型，包括：Preferably, the reinforcement learning model for constructing smart grid slices includes:

将状态空间定义为S＝{s₁,s₂,...,s_n}；Define the state space as S={s ₁ , s ₂ ,..., s _n };

动作空间A被定义为A＝{a₁,a₂,...,a_n}；Action space A is defined as A={a ₁ ,a ₂ ,..., _an };

奖励函数为R＝{s,a}，P(s,s^*)表示从状态s转移到s'的转移概率；The reward function is R={s,a}, and P(s,s ^* ) represents the transition probability from state s to s';

在任意时刻，处于状态s的切片控制器可以选择动作a，得到奖励即时奖励R_t，同时也会转移到下一个状态s'，Q-learning算法的过程可以用如下更新的式子表述，At any time, the slice controller in state s can choose action a, receive an immediate reward R _t , and will also move to the next state s'. The process of the Q-learning algorithm can be expressed by the following updated formula,

其中α是学习速率，并且是所有即时奖励R_t的折扣积累，where α is the learning rate, and is the discount accumulation of all instant rewards R _t ,

可以通过在足够长的持续时间内更新Q值，并通过调整α和γ的值，保证Q(s,a)最终可以收敛到最优策略时候的值，即 By updating the Q value within a long enough duration and adjusting the values of α and γ, we can ensure that Q(s,a) can eventually converge to the value of the optimal strategy, that is,

本申请同时提供一种基于强化学习的智能电网切片分配装置，其特征在于，包括；This application also provides a smart grid slice distribution device based on reinforcement learning, which is characterized by including;

分类单元，将智能电网的电力业务根据业务类型进行分类；Classification unit, which classifies the power business of smart grid according to business type;

分类与切片对应单元，将所述分类对应不同的切片；A classification and slicing corresponding unit, which corresponds the classification to different slices;

模型构建单元，根据智能电网的服务指标构建智能电网切片的强化学习模型；通过所述强化学习模型，完成对智能电网切片的分配，实现智能电网的资源调度管理。The model building unit constructs a reinforcement learning model of smart grid slices according to the service indicators of the smart grid; through the reinforcement learning model, the allocation of smart grid slices is completed and the resource scheduling management of the smart grid is realized.

本申请提供一种基于强化学习的智能电网切片分配方法，通过将智能电网的业务类型进行分类，将分类对应不同的切片，通过构建的智能电网切片的强化学习模型，完成对智能电网切片的分配。从而解决基于强化学习的5G网络切片技术与智能电网的整合问题。This application provides a method of allocating smart grid slices based on reinforcement learning. By classifying the business types of smart grids, the classifications correspond to different slices, and the allocation of smart grid slices is completed through the constructed reinforcement learning model of smart grid slices. . This solves the integration problem of 5G network slicing technology based on reinforcement learning and smart grid.

附图说明Description of drawings

图1是本申请实施例提供的一种基于强化学习的智能电网切片分配方法的流程示意图；Figure 1 is a schematic flow chart of a smart grid slice allocation method based on reinforcement learning provided by an embodiment of the present application;

图2是本申请实施例涉及的智能电网场景下切片架构示意图；Figure 2 is a schematic diagram of the slicing architecture in the smart grid scenario involved in the embodiment of this application;

图3是本申请实施例涉及的切片与智能电网的三类业务之间的关系示意图；Figure 3 is a schematic diagram of the relationship between slicing and the three types of services of the smart grid involved in the embodiment of the present application;

图4是本申请实施例涉及的智能电网典型业务切片的QoS指标；Figure 4 is the QoS indicator of a typical smart grid business slice involved in the embodiment of this application;

图5是本申请实施例涉及的智能电网切片资源管理机制到RL的映射；Figure 5 is a mapping of the smart grid slice resource management mechanism to RL involved in the embodiment of this application;

图6是本申请实施例提供的一种基于强化学习的智能电网切片分配装置示意图。Figure 6 is a schematic diagram of a smart grid slice allocation device based on reinforcement learning provided by an embodiment of the present application.

具体实施方式Detailed ways

在下面的描述中阐述了很多具体细节以便于充分理解本申请。但是本申请能够以很多不同于在此描述的其它方式来实施，本领域技术人员可以在不违背本申请内涵的情况下做类似推广，因此本申请不受下面公开的具体实施的限制。In the following description, numerous specific details are set forth in order to provide a thorough understanding of the present application. However, the present application can be implemented in many other ways different from those described here. Those skilled in the art can make similar extensions without violating the connotation of the present application. Therefore, the present application is not limited by the specific implementation disclosed below.

请参看图1，图1是本申请实施例提供的一种基于强化学习的智能电网切片分配方法，下面结合图1对本申请提供的方法进行详细说明。Please refer to Figure 1. Figure 1 is a smart grid slice allocation method based on reinforcement learning provided by an embodiment of this application. The method provided by this application will be described in detail below in conjunction with Figure 1.

步骤S101，将智能电网的电力业务根据业务类型进行分类。Step S101: Classify the power services of the smart grid according to service types.

首先，介绍本申请基于的智能电网场景下切片架构，如图2所示。First, the slicing architecture in the smart grid scenario based on this application is introduced, as shown in Figure 2.

网络切片借助SDN技术帮助实现网络的控制/数据平面解耦，并在两者之间定义开放接口，实现对网络切片中的网络功能的灵活定义。为满足该种业务的需求，网络切片只包含支持特定业务的网络功能。电力业务可以分为控制类(如配电自动化、精准负荷控制等)、信息采集类(如用电信息采集、输电线路监测等)、移动应用类(如智能巡检、移动作业等)三大类。Network slicing uses SDN technology to help decouple the control/data plane of the network and define open interfaces between the two to achieve flexible definition of network functions in network slicing. To meet the needs of this type of business, network slicing only includes network functions that support specific services. The power business can be divided into three categories: control (such as distribution automation, precise load control, etc.), information collection (such as power consumption information collection, transmission line monitoring, etc.), and mobile application (such as intelligent inspection, mobile operations, etc.) kind.

步骤S102，将所述分类对应不同的切片。Step S102: Correspond the categories to different slices.

图3是三大类切片与智能电网的三类业务之间的关系。将控制类对应uRLLC切片，将信息采集类对应mMTC切片，将移动应用类对应eMBB切片。Figure 3 shows the relationship between the three major types of slices and the three types of services of the smart grid. The control class corresponds to the uRLLC slice, the information collection class corresponds to the mMTC slice, and the mobile application class corresponds to the eMBB slice.

步骤S103，根据智能电网的服务指标构建智能电网切片的强化学习模型，通过所述强化学习模型，完成对智能电网切片的分配，实现智能电网的资源调度管理。Step S103: Construct a reinforcement learning model of smart grid slices according to the service indicators of the smart grid. Through the reinforcement learning model, the allocation of smart grid slices is completed and the resource scheduling management of the smart grid is realized.

图4给出了智能电网典型业务切片的QoS(服务)指标。本申请考虑业务平面、编排控制平面和数据平面。业务平面将业务划分为弹性应用(Elastic application)和实时应用(Real-Time application)。弹性应用可以容忍相对较大的延迟，不存在最小带宽要求。具体的例子，如汽车入桩分布式电源、视频监控、用户计量等。实时应用要求其网络提供最低级别的性能保证。主要代表类型是URLLC切片业务，典型的例子是配电自动化、应急通信等。数据平面存储着电力设备与物理层交互产生的数据。Figure 4 shows the QoS (service) indicators of typical business slices of smart grids. This application considers the business plane, orchestration control plane and data plane. The business plane divides the business into elastic application (Elastic application) and real-time application (Real-Time application). Resilient applications can tolerate relatively large delays and have no minimum bandwidth requirements. Specific examples include car-mounted distributed power supply, video surveillance, user metering, etc. Real-time applications require a minimum level of performance guarantees from their networks. The main representative type is URLLC slicing service. Typical examples are power distribution automation, emergency communications, etc. The data plane stores data generated by the interaction between power equipment and the physical layer.

在本申请中，重点考虑编排控制平面，引入接入网SDN(软件定义网络)控制器和核心网SDN控制器，分别负责接入网和核心网的网络功能(NF)管理和协调(如服务迁移和部署)，它们相当于两个不同的代理，之间能够相互通信共同完成协调工作。面对业务平面的业务类型、信道条件、用户需求的各类先验知识，编排控制平面的切片编排控制器完成对切片网络的划分，并分为无线接入网(RAN)侧切片和核心网(CN)侧切片。RAN侧和CN侧的网络切片分别由各自的SDN控制器进行管理，负责执行各自网络侧的算法，也就是本申请提出的基于强化学习的智能电网切片分配方法。In this application, the orchestration control plane is mainly considered, and the access network SDN (Software Defined Network) controller and the core network SDN controller are introduced, respectively responsible for the network function (NF) management and coordination (such as services) of the access network and core network. Migration and deployment), they are equivalent to two different agents that can communicate with each other and complete coordination work together. Faced with various prior knowledge of service types, channel conditions, and user requirements on the service plane, the slice orchestration controller of the orchestration control plane completes the division of the slice network and divides it into radio access network (RAN) side slices and core networks. (CN) Side slice. The network slices on the RAN side and the CN side are managed by their respective SDN controllers, which are responsible for executing the algorithms on the respective network sides, which is the smart grid slice allocation method based on reinforcement learning proposed in this application.

下面说明本申请提出的RAN侧和CN侧的强化学习模型。The following describes the reinforcement learning models on the RAN side and CN side proposed in this application.

(1)RAN侧无线资源切片(1) RAN side radio resource slicing

给定一系列的现有切片χ₁,χ₂,...,χ_n，用向量χ表示现有切片的集合为χ＝{χ₁,χ₂,...,χ_n}，这些切片共享聚合带宽B；存在一系列的业务流，用向量D＝{d₁,d₂,...,d_m}表示。变量D实际上是智能电网业务流构成的集合。面对智能电网多业务特征，每种切片业务所需要满足的QoS要求均不同。但该业务流具体是智能电网中的何种业务，事先未知，并且在智能电网的场景下业务的实时需求变化是不稳定的。可以看出，d_i(i∈M＝{1,2,..,m})服从特定的流量模型。Given a series of existing slices χ ₁ , χ ₂ ,..., χ _n , the vector χ is used to represent the set of existing slices as χ = {χ ₁ , χ ₂ ,..., χ _n }. These slices Shared aggregate bandwidth B; there is a series of service flows, represented by vector D = {d ₁ , d ₂ ,..., d _m }. Variable D is actually a collection of smart grid business flows. Faced with the multi-service characteristics of smart grids, each slice service needs to meet different QoS requirements. However, what kind of business this business flow is in the smart grid is unknown in advance, and the real-time demand changes of the business in the smart grid scenario are unstable. It can be seen that d _i (i∈M={1,2,..,m}) obeys a specific traffic model.

首先需要定义RAN侧网络的系统状态空间、动作空间以及奖励函数。切片控制器与无线环境的交互由元组[S,A,P(s,s^*),R(s,a)]表示，其中S表示可能的状态集，A表示可能的动作集，P(s,s^*)表示从状态s转移到s'的转移概率，R(s,a)是与状态s中的动作触发相关联的奖励，其被反馈给切片控制器。以下为无线接入侧切片资源管理到RL的映射。First, it is necessary to define the system state space, action space and reward function of the RAN side network. The interaction between the slice controller and the wireless environment is represented by the tuple [S,A,P(s,s ^* ),R(s,a)], where S represents the possible state set, A represents the possible action set, and P( s,s ^* ) represents the transition probability from state s to s', and R(s,a) is the reward associated with the action trigger in state s, which is fed back to the slice controller. The following is the mapping of wireless access side slice resource management to RL.

A.状态空间：A. State space:

状态空间被定义为一组元组S＝{s^slice}。s^slice是一个向量，其用来表示当前所有可用承载相关电力业务切片的状态，其中第n个元素为 The state space is defined as a set of tuples S={s ^slice }. s ^slice is a vector that is used to represent the status of all currently available bearer-related power service slices, where the nth element is

B.动作空间：B. Action space:

面对时变未知的业务流量模型，强化学习的代理(Agent)必须为相应的电力业务分配合适的切片资源。代理可以根据当前的切片状态以及奖励函数来决定下一刻如何执行动作。动作空间A被定义为A＝{a^bandwidth}，a_bandwidth表示代理(Agent)为每个逻辑上独立划分的切片分配合适的带宽以承载相应的业务。Facing the time-varying and unknown business traffic model, the reinforcement learning agent must allocate appropriate slice resources to the corresponding power business. The agent can decide how to perform actions at the next moment based on the current slice state and the reward function. Action space A is defined as A = {a ^bandwidth }, where a _bandwidth indicates that the agent allocates appropriate bandwidth to each logically independent slice to carry the corresponding service.

由于网络切片是在虚拟网络之间共享网络资源，虚拟网络片之间必须互相隔离，以便若一个切片上的资源不足以承载当前业务而发生拥塞或故障时不会影响到其他切片。因此，为保证切片的隔离性以资源分配的效用最大化，限定每个切片最多只能承载一种业务：Since network slicing shares network resources between virtual networks, virtual network slices must be isolated from each other so that if the resources on one slice are insufficient to carry the current business and congestion or failure occurs, other slices will not be affected. Therefore, in order to ensure the isolation of slices and maximize the effectiveness of resource allocation, each slice is limited to carry only one kind of service at most:

同时限定二值变量 Limit binary variables at the same time

C.奖励函数C. Reward function

代理将特定的切片分配给某智能电网业务后，会得到一个综合收益，我们将此综合收益作为系统的奖励。控制类电力业务对通信的时延、误码率要求非常严格，通信的失效或差错可能影响电网的控制执行，导致电网运行故障。对于一些移动应用类业务(如巡检传输视频，回放高清视频等)需要一定的传输速率保证，且对通信带宽会有较高的要求。供电可靠率意味着持续、充足、高质量的供电。例如，当供电可靠率达到99.999％(“5个9”)时，意味着区域内用电客户的年户均停电时间不会超过5分钟，而当这一数字达到99.9999％(“6个9”)，区域内用电客户的年户均停电时间将被缩减到30秒左右。在RAN侧由于频谱资源有限，在分配切片时应当选取最优策略以最大化满足用户的QoS需求。After the agent allocates a specific slice to a certain smart grid business, it will receive a comprehensive income, which we use as a reward for the system. Control power services have very strict requirements on communication delay and bit error rate. Communication failures or errors may affect the control execution of the power grid and lead to grid operation failures. For some mobile application services (such as inspection transmission video, high-definition video playback, etc.), a certain transmission rate is required, and there are higher requirements for communication bandwidth. Power supply reliability means continuous, sufficient, and high-quality power supply. For example, when the power supply reliability rate reaches 99.999% ("five nines"), it means that the average annual power outage time for electricity customers in the region will not exceed 5 minutes, and when this number reaches 99.9999% ("six nines") ”), the average annual power outage time for electricity customers in the region will be reduced to about 30 seconds. Due to limited spectrum resources on the RAN side, the optimal strategy should be selected when allocating slices to maximize the user's QoS needs.

主要考虑下行情况，采用频谱效率(SE)和时延(Delay)来作为评价指标。系统的频谱效率可以定义为：Mainly considering the downlink situation, spectrum efficiency (SE) and delay (Delay) are used as evaluation indicators. The spectral efficiency of the system can be defined as:

根据香农公式R＝blog₂(1+(g^BS→UEP)/σ²)可以得出基站(BS)到用户的实际速率，其中g^BS→UE是基站到设备之间的信道状态(CSI)，服从瑞利衰落。According to the Shannon formula R=blog ₂ (1+(g ^BS→UE P)/σ ² ), the actual rate from the base station (BS) to the user can be obtained, where g ^BS→UE is the channel state (CSI) between the base station and the device ), obeys Rayleigh fading.

在描述用户的QoS需求时，我们引入效用函数(utility function)，即切片业务被分配到的带宽与用户感知到的性能之间的曲线映射。在本文中，我们假设切片承载的业务可以分为弹性应用和实时应用。When describing users' QoS requirements, we introduce a utility function, which is a curve mapping between the bandwidth allocated to slice services and the performance perceived by users. In this article, we assume that the services carried by slices can be divided into elastic applications and real-time applications.

(a)弹性应用(a) Flexible application

对于这种类型的应用程序，不存在最小带宽要求，因为它可以容忍相对较大的延迟。弹性流量效用模型采用以下函数：For this type of application, there is no minimum bandwidth requirement because it can tolerate relatively large delays. The elastic flow utility model uses the following function:

其中k是一个可调参数，它决定效用函数的形状，并确保在接收到最大请求带宽时，但是即使提供非常高的带宽，这个应用程序的用户满意度也很难达到1。因此，我们认为带宽分配给这种应用程序类型即使在网络带宽过大的情况下，也不应该超过最大带宽b_max。where k is a tunable parameter that determines the shape of the utility function and ensures that when the maximum requested bandwidth is received, But even with very high bandwidth, the user satisfaction rating of this application is hardly 1. Therefore, we believe that the bandwidth allocated to this application type should not exceed the maximum bandwidth b _max even in the case of excessive network bandwidth.

(b)实时应用(b)Real-time application

这种应用类型的流量要求其网络提供最低级别的性能保证。如果分配的带宽降低到某个阈值以下，QoS将变得不可接受。使用以下效用函数对实时应用进行建模：This application type of traffic requires a minimum level of performance guarantees from its network. If the allocated bandwidth drops below a certain threshold, QoS becomes unacceptable. Real-time applications are modeled using the following utility function:

其中k₁，k₂是可调参数，它们决定实用函数的形状。Among them, k ₁ and k ₂ are adjustable parameters, which determine the shape of the utility function.

定义学习代理的奖励如下：Define the rewards for the learning agent as follows:

R＝λ·SE+μ·U_e+ξ·U_rt R＝λ·SE+μ·U _e +ξ·U _rt

其中λ,μ,ξ是SE、U_e和U_rt的权重。where λ, μ, ξ are the weights of SE, U _e and U _rt .

因此，从数学的角度来说，我们的问题可以制定为：Therefore, mathematically speaking, our problem can be formulated as:

d_i(i∈M＝{1,2,..,m})服从特定的流量模型(*)d _i (i∈M＝{1,2,..,m}) obeys a specific traffic model (*)

解决问题(*)的关键难度在于，由于流量模型的存在，在事未先知道的情况下，业务需求变化是不稳定的，即智能电网场景下的业务实时需求变化未知。The key difficulty in solving problem (*) is that due to the existence of the traffic model, changes in business demand are unstable without being known in advance, that is, the real-time changes in business demand in the smart grid scenario are unknown.

(2)基于优先级调度的核心网络切片(2) Core network slicing based on priority scheduling

类似地，如果我们将计算资源虚拟化为每个片的VNFs，那么将计算资源分配给每个VNF的问题就可以像无线资源切片那样得到解决。因此，在这一部分，我们讨论另一个重要的问题，那就是基于优先级的通用VNFs核心网络切片。我们使用的映射与无线电资源切片略有不同，以体现RL的灵活性。同样地，切片控制器与核心网侧的交互也由四元组[S,A,P(s,s^*),R(s,a)]表示，下面分别定义RL元素到这个切片问题的适当映射。Similarly, if we virtualize computing resources into VNFs per slice, then the problem of allocating computing resources to each VNF can be solved like wireless resource slicing. Therefore, in this part, we discuss another important issue, which is priority-based core network slicing of common VNFs. The mapping we use is slightly different from radio resource slicing to reflect the flexibility of RL. Similarly, the interaction between the slicing controller and the core network side is also represented by the quadruple [S, A, P (s, s ^* ), R (s, a)]. The following defines the appropriate RL elements for this slicing problem. mapping.

A.状态空间A. State space

在核心网侧有相关的服务功能链(SFCs)，它们具有相同的基本功能，但是需要消耗不同的计算处理单元(CPUs)，并且产生不同的结果(如业务的排队时间)。例如，基于商业价值或其他智能电网业务相关特征，可以将业务流分为三类(如A类、B类、C类)，从A类到C类的优先级逐渐降低，基于优先级的调度规则定义为：SFC I优先处理A类业务流，SFC II平等对待A类和B类业务流，但服务C类业务流的优先级最低。SFC III对所有业务流一视同仁。在基于优先级调度时产生了业务的排队时间。There are related service function chains (SFCs) on the core network side, which have the same basic functions, but require different computing processing units (CPUs) and produce different results (such as service queuing time). For example, based on business value or other smart grid business-related characteristics, business flows can be divided into three categories (such as Category A, Category B, and Category C). The priority from Category A to Category C gradually decreases. Priority-based scheduling The rule is defined as follows: SFC I gives priority to Class A business flows, and SFC II treats Class A and Class B business flows equally, but serves Class C business flows with the lowest priority. SFC III treats all business streams equally. When scheduling based on priority, the queuing time of the service is generated.

可以将状态空间定义为T＝{T_q}，T_q是一个矢量，表征业务集合D内每一个元素的排队状态。当使用N个CPU计算业务d_i时，第i个元素为T_qi，表示业务d_i的排队时间，其中i∈M＝{1,2,..,m}。The state space can be defined as T={T _q }, where T _q is a vector representing the queuing status of each element in the business set D. When N CPUs are used to calculate service d _i , the i-th element is T _qi , which represents the queuing time of service d _i , where i∈M={1,2,..,m}.

B.动作空间B. Action space

每个SFC最终使用的CPU取决于其处理过的业务流的数量。在CPU数量有限的情况下，每种类型的业务流需要被调度到适当的SFC，从而导致可接受的排队时间。因此在处理业务d_i时，在核心网侧需要选择合适的CPU数量N_CPU。因此定义动作空间为A_CPU＝{a^CPU}，其中a^CPU表示面对到来的业务d_i(i∈M＝{1,2,..,m})，在执行计算时选择所需要CPU的数量。The final CPU usage of each SFC depends on the number of business flows it processes. With a limited number of CPUs, each type of business flow needs to be scheduled to the appropriate SFC, resulting in acceptable queuing time. Therefore, when processing service _di , it is necessary to select an appropriate number of CPUs N _CPU on the core network side. Therefore, the action space is defined as A _CPU ={a ^CPU }, where a ^CPU represents the incoming business _di (i∈M={1,2,..,m}), and the required CPU is selected when performing calculations. quantity.

C.奖励函数C. Reward function

在定义奖励函数时，我们首先需要效用函数U来表征当前业务对于时延的敏感性，之后定义一个新的度量“网络请求价值”函数W来表征业务的优先级。When defining the reward function, we first need a utility function U to represent the sensitivity of the current business to delay, and then define a new measurement "network request value" function W to represent the priority of the business.

上面已经提到过，在描述弹性应用和实时应用时，我们使用效用函数：As mentioned above, when describing elastic applications and real-time applications, we use utility functions:

来分别表征业务d_i的QoS需求。相比于RAN侧，其中不同的是自变量变为计算业务d_i时核心网络侧所需的CPU数目n。但是这只能反映不同业务的QoS需求。由于计算资源的有限性，在分配好计算资源后，需要合理的调度规则来反映优先处理哪一种业务，因此引入“网络请求价值”函数W来表征业务的优先级。对于任一应用业务d_i，需要满足的网络请求价值定义为：To respectively characterize the QoS requirements of business _di . Compared with the RAN side, the difference is that the independent variable becomes the number n of CPUs required on the core network side when calculating the service _di . But this can only reflect the QoS requirements of different services. Due to the limited nature of computing resources, after allocating computing resources, reasonable scheduling rules are needed to reflect which business should be processed first. Therefore, the "network request value" function W is introduced to characterize the priority of the business. For any application service d _i , the network request value that needs to be satisfied is defined as:

W_i＝2^(p)U_i W _i = 2 ^(p) U _i

其中p是业务d_i的优先等级，U_i是弹性应用和实时应用构成集合中的任意一个元素，即U_i∈{U_e,U_kt}。业务请求的权重2^(p)表示该请求相对于其他请求的重要性。定义奖励函数为：Where p is the priority level of business d _i , and U _i is any element in the set of elastic applications and real-time applications, that is, U _i ∈ {U _e , U _kt }. The weight 2 ^(p) of a business request indicates the importance of the request relative to other requests. Define the reward function as:

R＝W_i R＝ _Wi

上式只能得到某个业务d_i的当前优先等级，我们需要得到一系列业务的优先级排队情况，所以需要累积最大化长期奖励，即The above formula can only get the current priority of a certain business di _i . We need to get the priority queuing situation of a series of businesses, so we need to accumulate and maximize the long-term reward, that is

图5为智能电网切片资源管理机制到RL的映射：Figure 5 shows the mapping of smart grid slice resource management mechanisms to RL:

接下来介绍本申请提出的上述模型背景下基于强化学习的切片分配方法。Next, the slice allocation method based on reinforcement learning in the context of the above model proposed in this application is introduced.

一种基于Q_-learning的RAN和CN侧的的强化学习算法。由于在上文中RAN、CN侧状态集合、动作集合以及奖励函数的表述略有不同，且在本文中，基于我们提出的RL到RAN、CN的映射模型，Q_-learning算法具有普适性，为方便表示，在本节中我们统一状态空间为S＝{s₁,s₂,...,s_n}，动作空间为A＝{a₁,a₂,...,a_n}，奖励函数为R＝{s,a}，P(s,s^*)表示从状态s转移到s'的转移概率。A reinforcement learning algorithm on the RAN and CN sides based on Q _-learning . Since the expressions of the RAN and CN side state sets, action sets and reward functions are slightly different above, and in this article, based on the mapping model from RL to RAN and CN we proposed, the Q _-learning algorithm has universal applicability, as For convenience, in this section we unify the state space as S={s ₁ ,s ₂ ,...,s _n }, the action space as A={a ₁ ,a ₂ ,..., _an }, and the reward The function is R={s,a}, and P(s,s ^* ) represents the transition probability from state s to s'.

切片控制器最终的目标是要找到最优的切片策略π^*，该策略是从状态集到动作集的一个映射，并且需要最大化每个状态的预期长期折扣奖励：The ultimate goal of the slicing controller is to find the optimal slicing policy π ^* , which is a mapping from the state set to the action set and needs to maximize the expected long-term discounted reward of each state:

状态s的长期折扣奖励是在状态轨迹上获得的奖励的折扣总和，并由下式给出：The long-term discounted reward for state s is the discounted sum of rewards earned on the state trajectory and is given by:

R(s,π(s))+γR(s₁,π(s₁))+γ²R(s₂,π(s₂))+...R(s,π(s))+γR(s ₁ ,π(s ₁ ))+γ ² R(s ₂ ,π(s ₂ ))+...

其中γ是折扣因子(0＜γ＜1)，决定了未来奖励所对应的现在的值。式(*)中的优化目标表示任意策略的状态值函数，可表示如下：Among them, γ is the discount factor (0<γ<1), which determines the current value corresponding to future rewards. The optimization objective in formula (*) represents the state value function of any strategy, which can be expressed as follows:

根据Bellman的最优性标准，在单一环境设置中至少存在一种最优策略。因此，最优策略的状态值函数由下式给出：According to Bellman's optimality criterion, there exists at least one optimal strategy in a single environmental setting. Therefore, the state value function of the optimal policy is given by:

状态转移概率取决于许多因素，例如流量负载，业务到达和离开率，决策算法等，因此，无论是在无线侧还是在核心网侧都可能不容易获得。因此无模型强化学习非常适合推导出最优策略，因为它不需要奖励的预期，并且状态转移概率可以作为先验知识而被得知。在各种现有的RL算法中，我们选择Q_-learning。The state transition probability depends on many factors, such as traffic load, service arrival and departure rate, decision-making algorithm, etc., therefore, it may not be easily obtained either on the wireless side or the core network side. Therefore, model-free reinforcement learning is very suitable for deriving optimal policies because it does not require reward expectations and the state transition probabilities can be known as prior knowledge. Among various existing RL algorithms, we choose Q _-learning .

以RAN侧为例，切片控制器在很短的离散时间段内与无线环境交互。状态-动作二元组(s,π(s))的动作-值函数(也被称为Q值)可以被表示为Q(s,π(s))。Q(s,π(s))被定义为当使用策略π时状态s的预期长期折扣奖励。我们的目标是要找到一种优化策略，最大化每个状态s的Q值：Taking the RAN side as an example, the slicing controller interacts with the wireless environment in a short discrete time period. The action-value function (also known as Q-value) of the state-action tuple (s, π(s)) can be expressed as Q(s, π(s)). Q(s,π(s)) is defined as the expected long-term discounted reward of state s when using policy π. Our goal is to find an optimization strategy that maximizes the Q value of each state s:

根据Q_-learning算法，切片控制器可以基于已有的信息，通过迭代学习到最优的Q值。在任意时刻，处于状态s的切片控制器可以选择动作a。这会得到奖励即时奖励R_t，同时也会转移到下一个状态s'。Q_-learning算法的过程可以用如下更新的式子表述：According to the Q _-learning algorithm, the slice controller can iteratively learn the optimal Q value based on existing information. At any time, a slice controller in state s can choose action a. This will be rewarded with an immediate reward R _t and will also move to the next state s'. The process of Q _-learning algorithm can be expressed by the following updated formula:

其中α是学习速率，并且是所有即时奖励R_t的折扣积累：where α is the learning rate, and is the accumulation of discounts over all instant rewards R _t :

整个切片策略由下列的算法给出。初始时，Q值被设定为0。在Q_-learning算法应用之前，切片控制器基于每个切片的电力业务流量需求估计对不同切片执行初始切片分配，这样做是为了不同切片的状态初始化。现有的无线资源切片解决方案使用基于带宽或基于资源的供应来将无线资源分配给不同的切片。The entire slicing strategy is given by the following algorithm. Initially, the Q value is set to 0. Before the Q _-learning algorithm is applied, the slice controller performs initial slice allocation to different slices based on the power service traffic demand estimate of each slice. This is done for the state initialization of different slices. Existing wireless resource slicing solutions use bandwidth-based or resource-based provisioning to allocate wireless resources to different slices.

由于Q_-learning是一种在线迭代学习算法，它执行两种不同类型的操作。在探索模式下，切片控制器随机选择一个可能的动作，以增强其未来的决策。相反，在开发模式中，切片控制器更喜欢它过去尝试过并发现有效的操作。我们假设处于状态s的切片控制器以ε的概率进行探索，并且以1-ε的概率利用之前存储的Q值。在任何状态下，不是所有的动作都是可行的为了保持片与片之间的隔离，切片控制器必须确保不会将相同的物理资源块(PRB)分配给两个不同的片(RAN侧)。Since Q _-learning is an online iterative learning algorithm, it performs two different types of operations. In exploration mode, the slice controller randomly selects a possible action to enhance its future decisions. In contrast, in development mode, the slice controller prefers operations that it has tried in the past and found to work. We assume that the slice controller in state s explores with probability ε and exploits previously stored Q values with probability 1-ε. Not all actions are possible in any state. To maintain isolation between slices, the slice controller must ensure that the same physical resource block (PRB) is not allocated to two different slices (RAN side) .

与本申请提供的方法相对应的，本申请同时提供一种基于强化学习的智能电网切片分配装置600，其特征在于，包括；Corresponding to the method provided by this application, this application also provides a smart grid slice allocation device 600 based on reinforcement learning, which is characterized by including;

分类单元610，将智能电网的电力业务根据业务类型进行分类；Classification unit 610 classifies the power services of the smart grid according to service types;

分类与切片对应单元620，将所述分类对应不同的切片；The classification and slice corresponding unit 620 corresponds the classification to different slices;

模型构建单元630，根据智能电网的服务指标构建智能电网切片的强化学习模型；通过所述强化学习模型，完成对智能电网切片的分配，实现智能电网的资源调度管理。The model building unit 630 constructs a reinforcement learning model of smart grid slices according to the service indicators of the smart grid; through the reinforcement learning model, allocation of smart grid slices is completed, and resource scheduling management of the smart grid is realized.

以上实施例仅用以说明本发明的技术方案而非对其限制，尽管参照上述实施例对本发明进行了详细的说明，所属领域的普通技术人员依然可以对本发明的具体实施方式进行修改或者等同替换，而这些未脱离本发明精神和范围的任何修改或者等同替换，其均在申请待批的本发明的权利要求保护范围之内。The above embodiments are only used to illustrate the technical solutions of the present invention but not to limit them. Although the present invention has been described in detail with reference to the above embodiments, those of ordinary skill in the art can still make modifications or equivalent substitutions to the specific embodiments of the present invention. , and any modifications or equivalent substitutions that do not depart from the spirit and scope of the invention are within the scope of the claims of the pending invention.

Claims

1. A smart grid slice allocation method based on reinforcement learning, which is characterized by including:

Classify the power business of smart grid according to business type;

Correspond to different slices according to the classification;

A reinforcement learning model for smart grid slices is constructed according to the service indicators of the smart grid. Through the reinforcement learning model, the allocation of smart grid slices is completed and the resource scheduling management of the smart grid is realized. The reinforcement learning model of the smart grid slice includes: Reinforcement learning models on the wireless access side and core network side;

The reinforcement learning model on the wireless access RAN side includes: given a series of existing slices χ ₁ , χ ₂ ,..., χ _n , the vector χ is used to represent the set of existing slices as χ = {χ ₁ , χ ₂ ,...,χ _n }, these slices share the aggregate bandwidth B; there is a series of business flows, represented by the vector D={d ₁ , d ₂ ,..., d _m }; the vector D is actually the intelligence A collection of power grid services; facing the multi-service characteristics of smart grid, each slice service needs to meet different QoS requirements; take the business d _i in the vector D, where i∈M={1,2,..,m }Subject to a specific traffic model;

First, it is necessary to define the system state space, action space and reward function of the RAN side network; the interaction between the slice controller and the wireless environment is represented by the four-tuple means that/> Represents a state set,/> Represents an action set, Represents the transition probability from state s to s ^* , /> is the reward associated with the action trigger in state s,/> is fed back to the slice controller;

The reinforcement learning model on the CN side of the core network includes: The interaction between the slice controller and the core network side is composed of four-tuple Represent; define the state space as/> T _q is a vector, representing the queuing status of each element in vector D; when N CPUs are used to process business d _i , the i-th element is T _qi , which represents the queuing time of business d _i , where i∈M={ 1,2,..,m};

When processing business d _i , an appropriate number of CPUs N _CPU needs to be selected on the core network side; therefore, the action space is defined as where a ^CPU represents the incoming business d _i , where i∈M={1,2,..,m}, and the number of CPUs required is selected when performing calculations;

When defining the reward function, use the utility function:

to respectively represent the QoS requirements of business d _i , where U _e (x) represents the elastic application utility model, U _rt (x) represents the real-time application utility model, and k, k ₁ and k ₂ are adjustable parameters.

2. The method according to claim 1, characterized in that the power services of the smart grid are classified according to service types, including:

The power business of smart grid is divided into control category, information collection category and mobile application category according to the business type.

3. The method according to claim 1, characterized in that the classification corresponds to different slices, including:

The control class corresponds to the uRLLC slice, the information collection class corresponds to the mMTC slice, and the mobile application class corresponds to the eMBB slice.

4. The method according to claim 1, characterized in that, the reinforcement learning model of the smart grid is constructed, specifically, a Q _-learning algorithm is used to construct the reinforcement learning model of the smart grid.

5. The method according to claim 4, further comprising: constructing a reinforcement learning model of the smart grid based on the Q _-learning algorithm, and allocating the smart grid slices, including: the state set is

The action set is

reward function is the reward associated with the triggering of action a corresponding to state s,/> Represents the transition probability from state s to s ^* ;

At any time, the slice controller in state s can choose action a and get an immediate reward. At the same time, it will also transfer to the next state s'. The process of Q _-learning algorithm can be expressed by the following updated formula,

Among them, γ represents the discount factor; t represents the time experienced from state s to s';a' represents the action under state s';A(s') represents the action space set under state s', α is the learning rate, and are all instant rewards/> accumulation of discounts,

Among them, T represents the elapsed time, from time t=0 to the T-th time; by updating the Q value within the duration and adjusting the values of α and γ, it is ensured that Q(s,a) can eventually be in the optimal state. The strategy converges, and the value obtained by convergence is

6. A smart grid slice distribution device based on reinforcement learning, characterized by including:

Classification unit, which classifies the power business of smart grid according to business type;

A classification and slicing corresponding unit, which corresponds the classification to different slices;

The model building unit constructs a reinforcement learning model of smart grid slices according to the service indicators of the smart grid; through the reinforcement learning model, the allocation of smart grid slices is completed, and the resource scheduling management of the smart grid is realized; the reinforcement learning of the smart grid slices Models, including: reinforcement learning models on the wireless access side and core network side;

When defining the reward function, use the utility function: