CN117852621A

CN117852621A - Module combination model-free computing offloading method and device in multi-environment MEC

Info

Publication number: CN117852621A
Application number: CN202410017052.3A
Authority: CN
Inventors: 牛建伟; 任涛; 冯纬坤; 谷宁波
Original assignee: Beihang University
Current assignee: Beihang University
Priority date: 2024-01-04
Filing date: 2024-01-04
Publication date: 2024-04-09

Abstract

The application provides a model-free computing unloading method and device for module combination in a multi-environment MEC, and relates to the field of machine learning. In the method, a policy device acquires environment description information and current state description information of a target edge computing environment; and calling a pre-trained strategy model to process the environment description information and the state description information to obtain a task unloading strategy considering both the target edge computing environment and the state description information. Therefore, compared with the conventional strategy model, the task unloading strategy is generated only according to the current state description information, and the current environment description information is combined, so that the generated task unloading strategy takes the target edge computing environment and the states in the environment into consideration.

Description

Module combination model-free computing offloading method and device in multi-environment MEC

技术领域Technical Field

本申请涉及机器学习领域，具体而言，涉及一种多环境MEC中模块组合型无模型计算卸载方法及装置。The present application relates to the field of machine learning, and more specifically, to a module-combined model-free computing offloading method and device in a multi-environment MEC.

背景技术Background Art

用户设备(User Equipment，UE)的激增使得移动应用得到了广泛普及，例如一些计算密集型与注重计算延迟的应用：移动支付、在线游戏、智能医疗和增强现实等。用户设备大多配备有限的计算资源和能量预算，这导致用户设备的能力与应用需求之间存在较大的差距。如今在快速发展的高速无线通信技术的帮助下，移动边缘计算(Mobile EdgeComputing，MEC)作为一种有效的方法来缓解该问题，方法是将用户设备的任务卸载到附近部署有强大边缘计算资源的基站(Base Station，BS)进行计算。The surge in user equipment (UE) has led to the widespread popularity of mobile applications, such as some computationally intensive and latency-sensitive applications: mobile payments, online games, smart healthcare, and augmented reality. Most UEs are equipped with limited computing resources and energy budgets, which leads to a large gap between the capabilities of UEs and application requirements. Today, with the help of the rapid development of high-speed wireless communication technology, Mobile Edge Computing (MEC) is an effective way to alleviate this problem by offloading the tasks of UEs to nearby base stations (BS) with powerful edge computing resources.

MEC的关键问题之一是任务卸载，根据时变的MEC状态(例如任务要求、能量预算、无线条件)做出动态决策(例如卸载任务、传输功率)，提高移动应用程序的计算效率。为了做出计算卸载的最佳决策，传统方法主要基于数学编程开发，这在很大程度上依赖于MEC系统模型的可靠性。当系统模型不可用或不可靠时，基于启发式搜索的方法(例如遗传算法和粒子群优化)是在不预知系统模型的情况下实现接近最优计算卸载的方法。然而，随着无线网络速度的快速发展，这些方法逐渐难以在短时间内找到高效的卸载解决方案。因此，当MEC状态和卸载决策具有高维度时，使用强化学习或更常见的深度强化学习来开发无模型计算卸载方法成为了主流。One of the key issues of MEC is task offloading, which makes dynamic decisions (e.g., offloading tasks, transmission power) based on the time-varying MEC state (e.g., task requirements, energy budget, wireless conditions) to improve the computational efficiency of mobile applications. In order to make the best decision for computational offloading, traditional methods are mainly developed based on mathematical programming, which relies heavily on the reliability of the MEC system model. When the system model is unavailable or unreliable, heuristic search-based methods (e.g., genetic algorithms and particle swarm optimization) are methods that achieve near-optimal computational offloading without knowing the system model in advance. However, with the rapid development of wireless network speeds, these methods have gradually become difficult to find efficient offloading solutions in a short period of time. Therefore, when the MEC state and offloading decisions are of high dimensionality, it has become mainstream to use reinforcement learning or more commonly deep reinforcement learning to develop model-free computational offloading methods.

实践过程中发现，目前开发的各种场景下高效的基于深度强化学习的计算卸载方法，例如多智能用户设备MEC、多基站MEC、车辆辅助MEC和卫星集成MEC等；这些方法大多数只考虑具有恒定带宽、边缘服务器(ES)容量、任务类型等单个MEC环境下的计算卸载。然而，这些方法无法适应现实场景中各种高度多样性MEC环境。In practice, it is found that currently developed efficient computation offloading methods based on deep reinforcement learning in various scenarios, such as multi-intelligent user equipment MEC, multi-base station MEC, vehicle-assisted MEC, and satellite-integrated MEC, etc.; most of these methods only consider computation offloading in a single MEC environment with constant bandwidth, edge server (ES) capacity, task type, etc. However, these methods cannot adapt to various highly diverse MEC environments in real scenarios.

发明内容Summary of the invention

为了克服现有技术中的至少一个不足，本申请提供一种多环境MEC中模块组合型无模型计算卸载方法及装置，具体包括：In order to overcome at least one of the deficiencies in the prior art, the present application provides a module combination model-free computing offloading method and device in a multi-environment MEC, specifically including:

第一方面，本申请提供一种多环境MEC中模块组合型无模型计算卸载方法，所述方法包括：In a first aspect, the present application provides a module combination model-free computing offloading method in a multi-environment MEC, the method comprising:

获取目标边缘计算环境的环境描述信息以及当前的状态描述信息；Obtain the environment description information and current status description information of the target edge computing environment;

调用预先训练的策略模型对所述环境描述信息与所述状态描述信息进行处理，得到兼顾所述目标边缘计算环境与所述状态描述信息的任务卸载策略。A pre-trained policy model is called to process the environment description information and the state description information to obtain a task offloading policy that takes into account both the target edge computing environment and the state description information.

结合第一方面的可选实施方式，所述策略模型包括串联的多个策略层，每个策略层包括彼此独立的多个子模型，所述调用预先训练的策略模型对所述环境描述信息与所述状态描述信息进行处理，得到兼顾所述目标边缘计算环境与所述状态描述信息的任务卸载策略，包括：In combination with the optional implementation manner of the first aspect, the policy model includes a plurality of policy layers connected in series, each policy layer includes a plurality of sub-models independent of each other, and the calling of the pre-trained policy model processes the environment description information and the state description information to obtain a task offloading strategy that takes into account the target edge computing environment and the state description information, including:

将所述状态描述信息的状态嵌入特征输入所述多个策略层；Inputting the state embedding features of the state description information into the multiple strategy layers;

对于多个策略层中的任意相邻层，根据所述环境描述信息的环境嵌入特征以及所述状态嵌入特征对上一策略层中每个子模型输出的特征进行筛选，确定出入下一策略层中每个子模型的输入特征；For any adjacent layers in the multiple strategy layers, the features output by each sub-model in the previous strategy layer are screened according to the environment embedding features of the environment description information and the state embedding features to determine the input features of each sub-model in the next strategy layer;

根据所述多个策略层的输出结果，确定兼顾所述目标边缘计算环境与所述状态描述信息的任务卸载策略。According to the output results of the multiple policy layers, a task offloading strategy that takes into account both the target edge computing environment and the state description information is determined.

结合第一方面的可选实施方式，所述根据所述环境描述信息的环境嵌入特征以及所述状态嵌入特征对上一策略层中每个子模型的输出特征进行筛选，确定出下一策略层中每个子模型的输入特征，包括：In combination with an optional implementation manner of the first aspect, the output features of each sub-model in the previous strategy layer are screened according to the environment embedding features of the environment description information and the state embedding features to determine the input features of each sub-model in the next strategy layer, including:

根据所述环境描述信息的环境嵌入特征以及所述状态嵌入特征生成所述下一策略层中每个子模型的权重向量；Generate a weight vector for each sub-model in the next strategy layer according to the environment embedding feature of the environment description information and the state embedding feature;

分别根据所述下一策略层中每个子模型的权重向量对所述上一策略层中的每个子模型的输出特征进行加权，得到所述下一策略层中每个子模型的输入特征。The output features of each sub-model in the previous strategy layer are weighted according to the weight vector of each sub-model in the next strategy layer to obtain the input features of each sub-model in the next strategy layer.

结合第一方面的可选实施方式，所述策略模型还包括串联的多个权重层，所述多个权重层与所述多个策略层中的部分一一对应；所述根据所述环境描述信息的环境嵌入特征以及所述状态嵌入特征生成所述下一策略层中每个子模型的权重向量，包括：In combination with an optional implementation manner of the first aspect, the strategy model further includes a plurality of weight layers connected in series, and the plurality of weight layers correspond one-to-one to parts of the plurality of strategy layers; generating a weight vector for each sub-model in the next strategy layer according to the environment embedding feature of the environment description information and the state embedding feature, comprises:

获取所述状态嵌入特征与环境嵌入特征之间的融合特征；Acquire a fusion feature between the state embedding feature and the environment embedding feature;

将所述融合特征输入所述多个权重层；Inputting the fused features into the multiple weight layers;

对于所述多个权重层中的任意相邻层，将上一权重层输出的权重向量与所述融合特征相乘后输入下一层权重层，得到与下一权重层相对应策略层的权重向量。For any adjacent layers among the multiple weight layers, a weight vector output by a previous weight layer is multiplied by the fusion feature and then input into a next weight layer to obtain a weight vector of a strategy layer corresponding to the next weight layer.

结合第一方面的可选实施方式，所述获取所述状态嵌入特征与环境嵌入特征之间的融合特征，包括：In combination with the optional implementation manner of the first aspect, the acquiring a fusion feature between the state embedding feature and the environment embedding feature includes:

将所述状态嵌入特征与环境嵌入特征逐元素相乘，得到所述融合特征。The state embedding feature and the environment embedding feature are multiplied element by element to obtain the fusion feature.

结合第一方面的可选实施方式，所述策略模型还包括第一编码器与第二编码器，所述方法还包括：In combination with the optional implementation manner of the first aspect, the strategy model further includes a first encoder and a second encoder, and the method further includes:

通过所述第一编码器对所述状态描述信息进行处理，得到所述状态描述信息的状态嵌入特征；Processing the state description information by the first encoder to obtain a state embedding feature of the state description information;

通过所述第二编码器对所述环境描述信息进行处理，得到所述环境描述信息的环境嵌入特征。The environment description information is processed by the second encoder to obtain environment embedding features of the environment description information.

结合第一方面的可选实施方式，所述方法还包括所述策略模型的训练方法，所述训练方法包括：In combination with the optional implementation manner of the first aspect, the method further includes a training method for the policy model, and the training method includes:

获取多个待训练策略模型以及每个所述待训练策略模型的待训练评价模型，其中，所述多个待训练策略模型分别对应有不同的边缘计算环境；Acquire multiple strategy models to be trained and an evaluation model to be trained for each of the strategy models to be trained, wherein the multiple strategy models to be trained correspond to different edge computing environments respectively;

对于每个待训练策略模型，通过所述待训练策略模型与对应的边缘计算环境进行交互，得到针对所述边缘计算环境当前状态的任务卸载经验，并缓存至经验池中；For each strategy model to be trained, the strategy model to be trained interacts with the corresponding edge computing environment to obtain task offloading experience for the current state of the edge computing environment, and caches it in the experience pool;

待所述经验池收集的任务卸载经验满足预设条件后，从所述经验池采样出采样经验，并根据所述采样经验对所述多个待训练策略模型以及各自对应的待训练评价模型进行更新；After the task offloading experience collected by the experience pool meets a preset condition, sampling experience is sampled from the experience pool, and the multiple strategy models to be trained and their corresponding evaluation models to be trained are updated according to the sampling experience;

若所述多个待训练策略模型以及各自对应的待训练评价模型未达到收敛条件，则返回至对于每个待训练策略模型，通过所述待训练策略模型与对应的边缘计算环境进行交互，得到针对所述边缘计算环境当前状态的任务卸载经验，直至满足所述收敛条件后，将本次迭代后的待训练策略模型作为所述预先训练的策略模型。If the multiple strategy models to be trained and their corresponding evaluation models to be trained do not meet the convergence conditions, then return to each strategy model to be trained, interact with the corresponding edge computing environment through the strategy model to be trained, and obtain the task offloading experience for the current state of the edge computing environment, until the convergence conditions are met, and use the strategy model to be trained after this iteration as the pre-trained strategy model.

结合第一方面的可选实施方式，根据所述采样经验对所述多个待训练策略模型以及各自对应的待训练评价模型进行更新，包括：In combination with the optional implementation manner of the first aspect, the multiple strategy models to be trained and their corresponding evaluation models to be trained are updated according to the sampling experience, including:

根据所述采样经验分别对每个所述待训练策略模型中的多个策略层以及多个权重层进行交替更新，以及对每个所述待训练策略模型对应的待训练评价模型进行更新。According to the sampling experience, multiple strategy layers and multiple weight layers in each of the strategy models to be trained are updated alternately, and the evaluation model to be trained corresponding to each of the strategy models to be trained is updated.

结合第一方面的可选实施方式，所述任务卸载经验包括所述待训练策略模型对应边缘计算环境的环境描述信息。In combination with an optional implementation manner of the first aspect, the task offloading experience includes environment description information of the edge computing environment corresponding to the strategy model to be trained.

第二方面，本申请还提供一种多环境MEC中模块组合型无模型计算卸载装置，所述方法包括：In a second aspect, the present application also provides a module-combined model-free computing unloading device in a multi-environment MEC, the method comprising:

信息获取模块，用于获取目标边缘计算环境的环境描述信息以及当前的状态描述信息；An information acquisition module is used to obtain the environment description information and current status description information of the target edge computing environment;

策略生成模块，用于调用预先训练的策略模型对所述环境描述信息与所述状态描述信息进行处理，得到兼顾所述目标边缘计算环境与所述状态描述信息的任务卸载策略。The strategy generation module is used to call a pre-trained strategy model to process the environment description information and the state description information to obtain a task offloading strategy that takes into account both the target edge computing environment and the state description information.

结合第二方面的可选实施方式，所述策略模型包括串联的多个策略层，每个策略层包括彼此独立的多个子模型，所述策略生成模块还具体用于：In conjunction with the optional implementation manner of the second aspect, the strategy model includes a plurality of strategy layers connected in series, each strategy layer includes a plurality of sub-models independent of each other, and the strategy generation module is further specifically used for:

结合第二方面的可选实施方式，所述策略生成模块还具体用于：In conjunction with the optional implementation manner of the second aspect, the strategy generation module is further specifically configured to:

结合第二方面的可选实施方式，所述策略模型还包括串联的多个权重层，所述多个权重层与所述多个策略层中的部分一一对应；所述策略生成模块还具体用于：In conjunction with the optional implementation manner of the second aspect, the strategy model further includes a plurality of weight layers connected in series, and the plurality of weight layers correspond one-to-one to parts of the plurality of strategy layers; and the strategy generation module is further specifically configured to:

结合第二方面的可选实施方式，所述策略模型还包括第一编码器与第二编码器，所述策略生成模块还用于：In conjunction with the optional implementation manner of the second aspect, the strategy model further includes a first encoder and a second encoder, and the strategy generation module is further used to:

结合第二方面的可选实施方式，所述方法还包括所述策略模型的训练方法，所述装置还包括：In combination with the optional implementation manner of the second aspect, the method further includes a training method for the policy model, and the device further includes:

经验收集某块，用于获取多个待训练策略模型以及每个所述待训练策略模型的待训练评价模型，其中，所述多个待训练策略模型分别对应有不同的边缘计算环境；An experience collection block is used to obtain multiple strategy models to be trained and evaluation models to be trained for each strategy model to be trained, wherein the multiple strategy models to be trained correspond to different edge computing environments respectively;

所述经验收集某块，还用于对于每个待训练策略模型，通过所述待训练策略模型与对应的边缘计算环境进行交互，得到针对所述边缘计算环境当前状态的任务卸载经验，并缓存至经验池中；The experience collection block is also used for interacting with the corresponding edge computing environment through the strategy model to be trained for each strategy model to be trained, to obtain the task offloading experience for the current state of the edge computing environment, and cache it in the experience pool;

模型更新模块，用于待所述经验池收集的任务卸载经验满足预设条件后，从所述经验池采样出采样经验，并根据所述采样经验对所述多个待训练策略模型以及各自对应的待训练评价模型进行更新；A model updating module, configured to sample sampled experience from the experience pool after the task offloading experience collected by the experience pool meets a preset condition, and update the multiple strategy models to be trained and their corresponding evaluation models to be trained according to the sampled experience;

所述模型更行模块，还用于若所述多个待训练策略模型以及各自对应的待训练评价模型未达到收敛条件，则返回至对于每个待训练策略模型，通过所述待训练策略模型与对应的边缘计算环境进行交互，得到针对所述边缘计算环境当前状态的任务卸载经验，直至满足所述收敛条件后，将本次迭代后的待训练策略模型作为所述预先训练的策略模型。The model update module is also used to return to each strategy model to be trained, interact with the corresponding edge computing environment through the strategy model to be trained, and obtain task offloading experience for the current state of the edge computing environment, if the multiple strategy models to be trained and their corresponding evaluation models to be trained do not meet the convergence conditions, until the convergence conditions are met, and the strategy model to be trained after this iteration is used as the pre-trained strategy model.

结合第二方面的可选实施方式，所述模型更新模块还具体用于：In conjunction with the optional implementation manner of the second aspect, the model updating module is further specifically configured to:

结合第二方面的可选实施方式，所述任务卸载经验包括所述待训练策略模型对应边缘计算环境的环境描述信息。In combination with an optional implementation of the second aspect, the task offloading experience includes environment description information of the edge computing environment corresponding to the strategy model to be trained.

相对于现有技术而言，本申请具有以下有益效果：Compared with the prior art, this application has the following beneficial effects:

本实施例提供一种多环境MEC中模块组合型无模型计算卸载方法及装置。该方法中，策略设备获取目标边缘计算环境的环境描述信息以及当前的状态描述信息；调用预先训练的策略模型对所述环境描述信息与所述状态描述信息进行处理，得到兼顾所述目标边缘计算环境与所述状态描述信息的任务卸载策略。如此，相较于常规策略模型仅依据当前的状态描述信息生成任务卸载策略，本申请还结合了当前的环境描述信息，使得生成的任务卸载策略兼顾了目标边缘计算环境与该环境中的状态。The present embodiment provides a module-combined model-free computing offloading method and device in a multi-environment MEC. In this method, the policy device obtains the environment description information of the target edge computing environment and the current state description information; calls a pre-trained policy model to process the environment description information and the state description information, and obtains a task offloading strategy that takes into account both the target edge computing environment and the state description information. In this way, compared to the conventional policy model that only generates a task offloading strategy based on the current state description information, the present application also combines the current environment description information, so that the generated task offloading strategy takes into account both the target edge computing environment and the state in the environment.

附图说明BRIEF DESCRIPTION OF THE DRAWINGS

为了更清楚地说明本申请实施例的技术方案，下面将对实施例中所需要使用的附图作简单地介绍，应当理解，以下附图仅示出了本申请的某些实施例，因此不应被看作是对范围的限定，对于本领域普通技术人员来讲，在不付出创造性劳动的前提下，还可以根据这些附图获得其他相关的附图。In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings required for use in the embodiments will be briefly introduced below. It should be understood that the following drawings only show certain embodiments of the present application and therefore should not be regarded as limiting the scope. For ordinary technicians in this field, other related drawings can be obtained based on these drawings without paying creative work.

图1为本申请实施例提供的场景示意图；FIG1 is a schematic diagram of a scenario provided by an embodiment of the present application;

图2为本申请实施例提供的方法流程示意图；FIG2 is a schematic diagram of a method flow chart provided in an embodiment of the present application;

图3为本申请实施例提供的模型结构示意图；FIG3 is a schematic diagram of a model structure provided in an embodiment of the present application;

图4为本申请实施例提供的训练原理示意图；FIG4 is a schematic diagram of the training principle provided in an embodiment of the present application;

图5为本申请实施例提供的装置结构示意图；FIG5 is a schematic diagram of the structure of a device provided in an embodiment of the present application;

图6为本申请实施例提供的电子设备结构示意图。FIG. 6 is a schematic diagram of the structure of an electronic device provided in an embodiment of the present application.

图标：101-基站；102-边缘服务器；103-用户设备；201-信息获取模块；202-策略生成模块；301-存储器；302-处理器；303-通信单元；304-系统总线。Icon: 101 - base station; 102 - edge server; 103 - user equipment; 201 - information acquisition module; 202 - policy generation module; 301 - memory; 302 - processor; 303 - communication unit; 304 - system bus.

具体实施方式DETAILED DESCRIPTION

为使本申请实施例的目的、技术方案和优点更加清楚，下面将结合本申请实施例中的附图，对本申请实施例中的技术方案进行清楚、完整地描述，显然，所描述的实施例是本申请一部分实施例，而不是全部的实施例。通常在此处附图中描述和示出的本申请实施例的组件可以以各种不同的配置来布置和设计。In order to make the purpose, technical solution and advantages of the embodiments of the present application clearer, the technical solution in the embodiments of the present application will be clearly and completely described below in conjunction with the drawings in the embodiments of the present application. Obviously, the described embodiments are part of the embodiments of the present application, rather than all the embodiments. The components of the embodiments of the present application described and shown in the drawings here can be arranged and designed in various different configurations.

因此，以下对在附图中提供的本申请的实施例的详细描述并非旨在限制要求保护的本申请的范围，而是仅仅表示本申请的选定实施例。基于本申请中的实施例，本领域普通技术人员在没有作出创造性劳动前提下所获得的所有其他实施例，都属于本申请保护的范围。Therefore, the following detailed description of the embodiments of the present application provided in the accompanying drawings is not intended to limit the scope of the present application for which protection is sought, but merely represents selected embodiments of the present application. Based on the embodiments in the present application, all other embodiments obtained by ordinary technicians in the field without creative work are within the scope of protection of the present application.

应注意到：相似的标号和字母在下面的附图中表示类似项，因此，一旦某一项在一个附图中被定义，则在随后的附图中不需要对其进行进一步定义和解释。It should be noted that similar reference numerals and letters denote similar items in the following drawings, and therefore, once an item is defined in one drawing, further definition and explanation thereof is not required in subsequent drawings.

在本申请的描述中，术语“第一”、“第二”、“第三”等仅用于区分描述，而不能理解为指示或暗示相对重要性。此外，术语“包括”、“包含”或者其任何其他变体意在涵盖非排他性的包含，从而使得包括一系列要素的过程、方法、物品或者设备不仅包括那些要素，而且还包括没有明确列出的其他要素，或者是还包括为这种过程、方法、物品或者设备所固有的要素。在没有更多限制的情况下，由语句“包括一个……”限定的要素，并不排除在包括所述要素的过程、方法、物品或者设备中还存在另外的相同要素。In the description of the present application, the terms "first", "second", "third", etc. are only used to distinguish the description and cannot be understood as indicating or implying relative importance. In addition, the terms "comprise", "include" or any other variants thereof are intended to cover non-exclusive inclusion, so that a process, method, article or device including a series of elements includes not only those elements, but also includes other elements not explicitly listed, or also includes elements inherent to such process, method, article or device. In the absence of further restrictions, an element defined by the sentence "comprises a ..." does not exclude the presence of other identical elements in the process, method, article or device including the element.

基于以上声明，正如背景技术中所介绍的，目前开发的各种场景下高效的基于深度强化学习的计算卸载方法，例如多智能用户设备MEC、多基站MEC、车辆辅助MEC和卫星集成MEC等；这些方法大多数只考虑具有恒定带宽、边缘服务器(Edge Server，ES)容量、任务类型等单个MEC环境下的计算卸载。然而，这些方法无法适应现实场景中各种高度多样性MEC环境。具体表现为以下两方面的问题：Based on the above statement, as introduced in the background technology, efficient deep reinforcement learning-based computation offloading methods are currently developed in various scenarios, such as multi-intelligent user device MEC, multi-base station MEC, vehicle-assisted MEC, and satellite-integrated MEC; most of these methods only consider computation offloading in a single MEC environment with constant bandwidth, edge server (ES) capacity, task type, etc. However, these methods cannot adapt to various highly diverse MEC environments in real scenarios. Specifically, there are two problems:

(1)经验探索和学习效率低下，因为即使只在单个MEC环境中学习基于深度强化学习的卸载策略也可能需要大量经验。(1) Inefficient empirical exploration and learning, as learning a deep reinforcement learning-based offloading policy even in a single MEC environment may require a large amount of experience.

(2)训练基于深度强化学习的卸载策略中的经验学习干扰，其中来自不同MEC环境的探索体验的梯度可能相互矛盾，导致计算卸载的性能显着下降。(2) Experience learning interference in training deep reinforcement learning-based offloading strategies, where the gradients of exploration experiences from different MEC environments may contradict each other, resulting in a significant performance degradation of computation offloading.

基于上述技术问题的发现，发明人经过创造性劳动提出下述技术方案以解决或者改善上述问题。需要注意的是，以上现有技术中的方案所存在的缺陷，是发明人在经过实践并仔细研究后得出的结果，因此，上述问题的发现过程以及下文中本申请实施例针对上述问题所提出的解决方案，都应该是发明人在发明创造过程中对本申请做出的贡献，而不应当理解为本领域技术人员所公知的技术内容。Based on the discovery of the above technical problems, the inventors have proposed the following technical solutions to solve or improve the above problems through creative work. It should be noted that the defects in the solutions in the above prior art are the results obtained by the inventors after practice and careful research. Therefore, the discovery process of the above problems and the solutions proposed in the embodiments of the present application for the above problems below should all be the contributions made by the inventors to the present application in the process of invention and creation, and should not be understood as technical contents known to those skilled in the art.

鉴于本实施例涉及到机器学习领域中的强化学习，为使得接下介绍的方案更易于理解，下面先对可能涉及到的专业概念进行解释。Since this embodiment involves reinforcement learning in the field of machine learning, in order to make the solution introduced next easier to understand, the professional concepts that may be involved are explained below.

强化学习，是一种机器学习方法，旨在让智能系统与环境交互来学习如何做出最优的决策。在强化学习中，智能系统被称为“代理”，通过观察环境的状态和奖励信号，采取一系列的行动，以最大化未来的累积奖励。可以理解为，强化学习的核心思想是代理通过试错过程来学习。即代理开始时对环境一无所知，通过与环境交互并观察结果，逐渐学习到哪些动作能够获得更高的奖励。代理使用一个叫作“策略”的函数来决定在给定状态下应该采取哪个行动。策略可以是确定性的，也可以是概率性的。综上所述，强化学习的核心组成部分包括：Reinforcement learning is a machine learning method that aims to enable intelligent systems to interact with the environment to learn how to make optimal decisions. In reinforcement learning, the intelligent system is called an "agent", which observes the state of the environment and the reward signal and takes a series of actions to maximize the future cumulative rewards. It can be understood that the core idea of reinforcement learning is that the agent learns through a trial and error process. That is, the agent knows nothing about the environment at the beginning, and gradually learns which actions can obtain higher rewards by interacting with the environment and observing the results. The agent uses a function called a "policy" to decide which action should be taken in a given state. The strategy can be deterministic or probabilistic. In summary, the core components of reinforcement learning include:

环境(Environment)：代理与其交互的外部环境，它可以是真实的物理环境，也可以是虚拟的模拟环境。Environment: The external environment with which the agent interacts, which can be a real physical environment or a virtual simulated environment.

状态(State)：环境的当前观测值，用于描述环境的特征。State: The current observation of the environment, used to describe the characteristics of the environment.

行动(Action)：代理在给定状态下采取的动作。Action: The action that the agent takes in a given state.

奖励(Reward)：代理根据其行动从环境中获得的信号，用于评估行动的好坏。Reward: The signal that the agent gets from the environment based on its actions, used to evaluate how good the actions were.

策略(Policy)：代理根据当前状态选择行动的方式，可以是确定性的映射或概率性的分布。Policy: The way the agent chooses actions based on the current state, which can be a deterministic mapping or a probabilistic distribution.

值函数(Value Function)：衡量代理在某个状态或状态-行动对上的长期价值的函数。它用来评估策略的好坏。Value Function: A function that measures the long-term value of an agent in a state or state-action pair. It is used to evaluate the quality of a policy.

模型(Model)：环境的内部模型，用于预测环境的状态转移和奖励信号。Model: An internal model of the environment that is used to predict the environment's state transitions and reward signals.

强化学习的目标是通过优化策略或值函数来使代理在与环境交互的过程中获得最大的累积奖励。强化学习的发展历程经历了值函数方法、策略梯度方法、Actor-Critic架构、深度强化学习，下面对上述这些方法进行示例性介绍：The goal of reinforcement learning is to optimize the strategy or value function so that the agent can obtain the maximum cumulative reward in the process of interacting with the environment. The development of reinforcement learning has gone through the value function method, policy gradient method, actor-critic architecture, and deep reinforcement learning. The following is an exemplary introduction to these methods:

(1)值函数方法，早期的强化学习最为主要的方法。最为经典算法是Q-learning，它通过迭代地更新状态-动作对的值函数来学习最优策略。值函数方法关注于评估动作的价值，并通过选择具有最高价值的动作来改进决策。(1) Value function method, the most important method in early reinforcement learning. The most classic algorithm is Q-learning, which learns the optimal strategy by iteratively updating the value function of the state-action pair. The value function method focuses on evaluating the value of actions and improving decisions by selecting the action with the highest value.

(2)策略梯度方法，在值函数方法的基础上，本领域技术人员开始关注如何直接优化策略。与值函数方法不同的是，策略梯度方法直接学习策略的参数化表示，而不是通过学习值函数来间接地推断策略。策略梯度方法的核心思想是通过梯度下降法来更新策略的参数，使得策略能够在环境中产生高回报的行为。具体而言，策略梯度方法通过以下步骤进行训练：(2) Policy gradient method: Based on the value function method, technicians in this field began to focus on how to directly optimize the policy. Unlike the value function method, the policy gradient method directly learns the parameterized representation of the policy, rather than indirectly inferring the policy by learning the value function. The core idea of the policy gradient method is to update the parameters of the policy through the gradient descent method so that the policy can produce high-reward behaviors in the environment. Specifically, the policy gradient method is trained through the following steps:

定义策略：首先，选择一个参数化的策略函数，例如神经网络。策略函数接收环境状态作为输入，并输出在每个可能动作上选择的概率分布。Define the policy: First, choose a parameterized policy function, such as a neural network. The policy function receives the state of the environment as input and outputs a probability distribution over each possible action.

收集样本：使用当前策略与环境进行交互，生成一系列的状态、动作和奖励的轨迹。通常采用蒙特卡洛方法来进行采样，即通过多次与环境交互来收集样本。Collect samples: Use the current strategy to interact with the environment to generate a series of states, actions, and reward trajectories. Monte Carlo methods are usually used for sampling, that is, to collect samples by interacting with the environment multiple times.

计算梯度：对于每个样本轨迹，计算其对应的梯度。梯度的计算使用了重要性采样技术，以根据轨迹中选择的动作和策略中的动作概率来调整梯度。Compute gradients: For each sample trajectory, calculate its corresponding gradient. The gradient calculation uses importance sampling techniques to adjust the gradient based on the actions selected in the trajectory and the action probabilities in the policy.

更新策略：将所有样本轨迹的梯度进行平均，并使用梯度下降法来更新策略的参数。目标是最大化奖励的期望值。Update strategy: Average the gradients of all sample trajectories and use gradient descent to update the parameters of the strategy. The goal is to maximize the expected value of the reward.

迭代训练：重复执行步骤2至步骤4，通过与环境的交互和参数更新来逐渐改进策略，直到达到预定的停止条件(如收敛或达到最大迭代次数)。Iterative training: Repeat steps 2 to 4 to gradually improve the strategy through interaction with the environment and parameter updates until a predetermined stopping condition is reached (such as convergence or reaching the maximum number of iterations).

(3)Actor-Critic架构，在值函数方法与策略梯度方法基础上，本领域技术人员提出了Actor-Critic架构的强化学习方法，该方法结合了值函数方法和策略梯度方法的优点。Actor-Critic架构中将代理分为两个部分：一个是演员(Actor)，负责生成动作，另一个是评论家(Critic)，负责评估动作的价值。演员根据策略生成动作，评论家则通过估计值函数来提供反馈信号。在Actor-Critic架构中，演员使用梯度上升法来更新策略参数，以最大化长期累积奖励。评论家使用值函数来评估动作的价值，并通过值函数的更新来提供梯度信号给演员。这种架构允许代理在学习过程中同时估计值函数和优化策略，从而更有效地进行决策改进。(3) Actor-Critic architecture. Based on the value function method and the policy gradient method, technicians in this field have proposed a reinforcement learning method based on the actor-critic architecture, which combines the advantages of the value function method and the policy gradient method. In the actor-critic architecture, the agent is divided into two parts: one is the actor, which is responsible for generating actions, and the other is the critic, which is responsible for evaluating the value of actions. The actor generates actions according to the strategy, and the critic provides feedback signals by estimating the value function. In the actor-critic architecture, the actor uses the gradient ascent method to update the policy parameters to maximize the long-term cumulative reward. The critic uses the value function to evaluate the value of the action and provides gradient signals to the actor by updating the value function. This architecture allows the agent to estimate the value function and optimize the strategy simultaneously during the learning process, thereby making decision improvements more effectively.

(4)深度强化学习(Deep Reinforcement Learning，深度强化学习)，随着深度学习方法的兴起，强化学习也开始与深度神经网络相结合，形成了深度强化学习。深度强化学习使用深度神经网络来近似值函数或策略函数，从而能够处理高维、复杂的状态空间和动作空间。典型的算法包括深度Q网络(DQN)和深度确定性策略梯度(DDPG)等。(4) Deep reinforcement learning (DL). With the rise of deep learning methods, reinforcement learning has also begun to be combined with deep neural networks to form deep reinforcement learning. Deep reinforcement learning uses deep neural networks to approximate value functions or policy functions, so that it can handle high-dimensional and complex state spaces and action spaces. Typical algorithms include deep Q network (DQN) and deep deterministic policy gradient (DDPG).

基于上述实施例中对强化学习的介绍，本实施例主要涉及对现有深度强化学习方法进行改进，在具体介绍本实施例的改进方式之前，下面进一步对本实施例涉及到的边缘计算环境进行介绍。如图1所示，该边缘计算环境包括基站101，该基站101与一个边缘服务器102以计算频率f^ES高度关联，还与U个用户设备103(记为以计算频率f^UE高度关联。其中，该用户设备103可以是，但不限于，智能手机、笔记本电脑、智能手表等。用户设备103可以将计算任务卸载到边缘服务器102进行计算，或者在本地进行计算。对此，可以由策略设备基于预设约束条件为每个用户设备103制定在本地执行或者卸载到边缘服务器102执行的任务卸载策略。Based on the introduction of reinforcement learning in the above embodiment, this embodiment mainly involves improving the existing deep reinforcement learning method. Before specifically introducing the improvement method of this embodiment, the edge computing environment involved in this embodiment is further introduced below. As shown in Figure 1, the edge computing environment includes a base station 101, which is highly associated with an edge server 102 with a computing frequency f ^ES , and is also associated with U user devices 103 (denoted as The calculation frequency f ^UE is highly associated. The user device 103 may be, but is not limited to, a smart phone, a laptop, a smart watch, etc. The user device 103 may offload the computing task to the edge server 102 for computing, or perform computing locally. In this regard, the policy device may formulate a task offloading policy for each user device 103 to be executed locally or offloaded to the edge server 102 for execution based on preset constraints.

假定系统的通信带宽为BMHz，系统时间分成N个等长时间段，每个时间段的时长为s。将每个用户设备103表示为u在第n时间段的开始产生一个计算任务，表示为：Assume that the communication bandwidth of the system is B MHz, the system time is divided into N equal-length time periods, and the length of each time period is s. Each user equipment 103 is represented as u A computing task is generated at the beginning of the nth time period, expressed as:

式中，表示该计算任务的数据大小，c表示该完成该计算任务所需的CPU周期，τ表示计算任务的最大可容忍延迟。In the formula, Indicates the computational task The data size, c represents the completion of the calculation task The CPU cycles required, τ represents the computational task The maximum tolerable delay.

此外，本实施例还规定计算任务是不可分割的，并提供一个二元变量来表示计算任务是在用户设备u本地执行或者卸载到边缘服务器进行边缘计算(当)。并且，用户设备u在第n个时间段的可用能量预算记为即，e^UE表示在每个时间段进行本地计算或者将计算任务无线传输时的能量消耗。用户设备u和边缘服务器都部署了任务队列以防止任务丢失，和分别表示在第n个时间段开始时的队列大小。In addition, this embodiment also specifies the calculation task is inseparable and provides a binary variable To represent the computational task It is executed locally on the user device u Or offload to edge servers for edge computing (when ). And, the available energy budget of user equipment u in the nth time period is recorded as That is, ^eUE means to perform local calculation or transfer the calculation task to each time period. Energy consumption during wireless transmission. Both the user device u and the edge server deploy task queues to prevent task loss. and They represent the queue sizes at the beginning of the nth time period.

(1)基于上述边缘计算环境，本实施例提供了该边缘计算环境的如下通讯模型：(1) Based on the above edge computing environment, this embodiment provides the following communication model of the edge computing environment:

本实施例使用的块衰落模型表示第n个时间段从用户设备u到基站的信道增益，计算公式为：The block fading model used in this embodiment represents the channel gain from user equipment u to the base station in the nth time period, and the calculation formula is:

式中，|·|表示对符号中的式子进行模运算，表示小规模衰落，可以根据jak边缘服务器模型建模为一阶高斯马尔科夫过程，表示大规模衰落，包括路径损失和对数正态阴影。根据LTE标准，的建模为：In the formula, |·| represents the modular operation of the expression in the symbol, represents a small-scale fading, which can be modeled as a first-order Gaussian Markov process according to the jak edge server model, represents large-scale fading, including path loss and log-normal shadowing. According to the LTE standard, The modeling is:

式中，表示第n个时间段用户设备u到基站的距离，z表示服从的对数正态随机变量。In the formula, represents the distance between user equipment u and the base station in the nth time period, and z represents the obedience is a lognormal random variable.

进一步的，本实施例将第n个时间段，从用户设备u到基站的分流链路信号干扰加噪声比表示为计算公式为：Furthermore, in this embodiment, the signal interference plus noise ratio of the offload link from the user equipment u to the base station in the nth time period is expressed as The calculation formula is:

式中，表示用户设备u在第n个时间段的无线发射功率，不应超过最大发射功率P^UE，表示加性高斯白噪声。In the formula, Indicates that the wireless transmission power of user equipment u in the nth time period should not exceed the maximum transmission power P ^UE , represents additive white Gaussian noise.

在第n个时间段，从用户设备u到基站的传输速率表示为计算公式为：In the nth time period, the transmission rate from user equipment u to the base station is expressed as The calculation formula is:

(2)基于上述通讯模型，本实施例还提供了该边缘计算环境的边缘计算模型：(2) Based on the above communication model, this embodiment also provides an edge computing model of the edge computing environment:

假定上述定义的二元变量则计算任务会从用户设备n卸载到边缘服务器。将卸载到边缘服务器的任务延迟时长表示为该任务延迟时长包括了传输时长边缘服务器的排队时长以及执行时长各自的计算计算公式为：Assume that the binary variable defined above The calculation task will be offloaded from user device n to the edge server. The delay duration of the task offloaded to the edge server is expressed as The task delay includes the transmission time. Edge server queue time and execution time The respective calculation formulas are:

式中，表示位于计算任务之前等待边缘服务器执行的任务大小，计算公式为：In the formula, Indicates that the computational task The size of the task previously waiting to be executed by the edge server is calculated as:

式中，包括了从第n个时间段开始的队列大小以及计算任务之前完成卸载的任务数据大小。对于上述表达式，若成立，则的取值为1，反之，则的取值为0。The formula includes the queue size starting from the nth time period And computing tasks The size of the task data that has been previously unloaded. For the above expression, if If established, The value of is 1, otherwise, The value of is 0.

基于上述卸载到边缘服务器的传输时长、排队时长以及执行时长，边缘计算的任务延迟的计算公式为：Based on the transmission time, queuing time, and execution time offloaded to the edge server, the edge computing task delay The calculation formula is:

相应的，用户设备u的能量消耗为：Correspondingly, the energy consumption of user equipment u for:

(3)基于上述通讯模型以及边缘计算模型，本实施还提供了该边缘计算环境的如下本地计算模型：(3) Based on the above communication model and edge computing model, this implementation also provides the following local computing model of the edge computing environment:

假定则计算任务会在用户终端u的本地运行。同理，本地计算的任务延迟包括了排队时长和执行时长各自的计算公式为：assumed The calculation task It will run locally on the user terminal u. Similarly, the task delay of local calculation Including the waiting time and execution time The respective calculation formulas are:

结合上述本地的排队时长以及执行时长，本地计算的任务延迟计算公式的表达式为：Combined with the above local queue time and execution time, the local calculation task delay calculation formula is expressed as:

式中，ξ取决于用户设备u的芯片架构的能量效率系数。Where ξ depends on the energy efficiency coefficient of the chip architecture of the user device u.

综合上述计算模型，在第n个时间段中，计算任务的任务延迟与用户设备u的能量消耗的计算公式为：Based on the above calculation model, in the nth time period, the calculation task Task delay The energy consumption of user equipment u The calculation formula is:

综上，本实施例涉及到的变量可以分为时不变的外部变量与时变的内部变量。其中，时不变的外部变量包括环境描述信息υ^env：In summary, the variables involved in this embodiment can be divided into time-invariant external variables and time-variant internal variables. Among them, the time-invariant external variables include environment description information υ ^env :

υ^env＝{P^task,B,f^UE,f^ES}υ ^env = {P ^task ,B,f ^UE ,f ^ES }

式中，P^task表示任务输入数据大小的概率分布，本实施例将包括V个不同的边缘计算环境的集合用进行表示，并用下标对不同的边缘计算环境进行区分， Where ^Ptask represents the probability distribution of the task input data size, This embodiment will include a collection of V different edge computing environments To represent it, we use subscripts to distinguish different edge computing environments.

时变的内部变量的集合用进行表示，中包括其中的每个元素与时间相关的并在时间段内进行传输，公式如下：The collection of time-varying internal variables is represented by To express, Included Each element is time-dependent and is transmitted within a time period, as follows:

(4)基于上述通讯计算模型、边缘计算模型以及本地计算模型，下面对本实施例中边缘计算环境的任务卸载任务转为以下问题：(4) Based on the above communication computing model, edge computing model and local computing model, the task offloading task of the edge computing environment in this embodiment is transformed into the following questions:

在不失一般性的情况下，将计算卸载问题的目标定义为最小化多个时间段上的平均任务延迟t_n和UE的能量消耗e_n的加权成本。需要说明的是，由于本实施例涉及到不同的边缘任务计算场景，此处为了便于表述，此处省略了t_n与e_n的上标υ^env，即对于每个边缘计算环境而言，平均任务延迟与能量消耗可以表示为：Without loss of generality, the goal of the computation offloading problem is defined as minimizing the weighted cost of the average task delay _tn and the energy consumption e _n of the UE over multiple time periods. It should be noted that since this embodiment involves different edge task computing scenarios, the superscripts υ ^env of _tn and e _n are omitted here for ease of description, that is, for each edge computing environment, the average task delay and energy consumption can be expressed as:

将用户设备u在第n个时间段任务卸载策略表示为将任务卸载策略作为本实施例中的优化变量，则对于每个边缘计算环境，υ^env的优化问题可以表示为：The task offloading strategy of user device u in the nth time period is expressed as Taking the task offloading strategy as the optimization variable in this embodiment, for each edge computing environment, the optimization problem of υ ^env can be expressed as:

P1: P1:

C1: C1:

C2: C2:

C3: C3:

C4: C4:

因此，对于多个边缘计算环境该优化问题被转换为：Therefore, for multiple edge computing environments The optimization problem is transformed into:

P2: P2:

对于多个边缘计算环境的优化问题，其同样受到上述约束条件的C1～C4的约束。For multiple edge computing environments The optimization problem is also subject to the constraints C1 to C4 mentioned above.

需要说明的是，当前任务卸载策略严格取决于之前的任务卸载策略产生的内部变量(n′<n)。因此，可以将上述问题P1转化为MDP(马尔可夫决策过程)问题。对此，本实施例定义一个五元组<S,A,Pr,R,γ>用于表示MDP。五元组中的S表示状态空间，每个时间段包括状态空间中的一个状态s_n∈S，其定义为：It should be noted that the current task offloading strategy Strictly depends on the previous task offloading strategy The internal variables generated (n′<n). Therefore, the above problem P1 can be transformed into an MDP (Markov decision process) problem. In this regard, this embodiment defines a five-tuple <S, A, Pr, R, γ> to represent the MDP. The S in the five-tuple represents the state space, and each time period includes a state s _n ∈S in the state space, which is defined as:

五元组中的A表示动作空间，每个时间段包括一个动作a_n∈A，其定义为：A in the quintuple represents the action space, and each time segment includes an action a _n ∈ A, which is defined as:

五元组中的Pr表示非先验的状态转移概率，所谓状态转移概率用于确定执行a_n时从s_n到s_n+1的转移概率Pr(s_n+1|s_n,a_n)。Pr in the quintuple represents a non-a priori state transition probability, which is used to determine the transition probability Pr(s _n+1 |s _n ,a _n ) from _sn to sn ₊₁ when executing a _n .

五元组中的R表示及时奖励函数，用于根据s_n中a_n对优化目标和约束条件，在每个时间段中生成即时奖励R(s_n,a_n)。The R in the quintuple represents the timely reward function, which is used to generate the immediate reward R(s _n ,a _n ) in each time period according to the optimization objectives and constraints of a _n in s _n .

五元组中的γ表示折扣因子，取值范围为γ∈(0,1]，用于确定对未来奖励的影响。The γ in the quintuple represents the discount factor, which ranges from γ∈(0,1] and is used to determine the impact on future rewards.

应理解的是，在MDP问题下，对于上述单个环境中的P1问题可以转化为以下P3问题：It should be understood that under the MDP problem, the P1 problem in the above single environment can be transformed into the following P3 problem:

在P3问题中，其目标是获得一个策略π，使得单个边缘计算环境中υ^env的预期长期回报最大化。同理，对于上述多个边缘计算环境中中P2问题可以转化为以下P4问题：In the P3 problem, the goal is to obtain a strategy π that maximizes the expected long-term return of υ ^env in a single edge computing environment. The P2 problem can be transformed into the following P4 problem:

在P4问题中，其目标是获得一个策略π，使得多个边缘计算环境中中的预期长期回报相比单个环境最大化。In the P4 problem, the goal is to obtain a strategy π that makes The expected long-term return in the environment is maximized compared to a single environment.

基于上述待优化问题，为了能够适应不同的边缘计算环境，本实施例同样基于DRL的Actor-Critic框架，并提出一种包括多层级、多子模型且能够随意组合的待训练策略模型，将其作为Actor-Critic框架中的Actor。并以此构建包括多个待训练策略模型以及各自对应的待训练评价模型的训练框架，进行强化学习训练。其中，在训练过程中，多个待训练策略模型在训练期间分别与不同的边缘计算环境进行交互，从而学习到不同边缘计算环境中的任务卸载经验，使得训练出的策略模型能够适应多种边缘计算环境。Based on the above-mentioned problem to be optimized, in order to adapt to different edge computing environments, this embodiment is also based on the Actor-Critic framework of DRL, and proposes a strategy model to be trained that includes multiple levels, multiple sub-models and can be arbitrarily combined, and uses it as the Actor in the Actor-Critic framework. A training framework including multiple strategy models to be trained and their corresponding evaluation models to be trained is constructed to perform reinforcement learning training. In the training process, multiple strategy models to be trained interact with different edge computing environments respectively during training, so as to learn the task offloading experience in different edge computing environments, so that the trained strategy model can adapt to a variety of edge computing environments.

本实施例中，将待训练策略模型训练出的结果称为策略模型，基于训练得到的策略模型，本实施例提供一种多环境MEC中模块组合型无模型计算卸载方法。该方法中，策略设备获取目标边缘计算环境的环境描述信息以及当前的状态描述信息；调用预先训练的策略模型对环境描述信息与状态描述信息进行处理，得到兼顾目标边缘计算环境与状态描述信息的任务卸载策略。如此，相较于常规策略模型仅依据当前的状态描述信息生成任务卸载策略，本申请还结合了当前的环境描述信息，使得生成的任务卸载策略兼顾了目标边缘计算环境与该环境中的状态。In this embodiment, the result of training the policy model to be trained is called a policy model. Based on the policy model obtained through training, this embodiment provides a module-combined model-free computing unloading method in a multi-environment MEC. In this method, the policy device obtains the environment description information of the target edge computing environment and the current state description information; calls the pre-trained policy model to process the environment description information and the state description information, and obtains a task unloading strategy that takes into account the target edge computing environment and the state description information. In this way, compared to the conventional policy model that only generates a task unloading strategy based on the current state description information, the present application also combines the current environment description information, so that the generated task unloading strategy takes into account the target edge computing environment and the state in the environment.

其中，实施该多环境MEC中模块组合型无模型计算卸载方法的策略设备，可以是，但不限于，移动终端、平板计算机、膝上型计算机、台式计算机以及服务器等。一些实施方式中，该服务器可以是单个服务器，也可以是服务器组。服务器组可以是集中式的，也可以是分布式的(例如，服务器可以是分布式系统)。在一些实施例中，服务器相对于用户终端，可以是本地的、也可以是远程的。在一些实施例中，服务器可以在云平台上实现；仅作为示例，云平台可以包括私有云、公有云、混合云、社区云(Community Cloud)、分布式云、跨云(Inter-Cloud)、多云(Multi-Cloud)等，或者它们的任意组合。在一些实施例中，服务器可以在具有一个或多个组件的电子设备上实现。Among them, the policy device for implementing the module combination model-free computing offloading method in the multi-environment MEC can be, but is not limited to, a mobile terminal, a tablet computer, a laptop computer, a desktop computer, and a server, etc. In some implementations, the server can be a single server or a server group. The server group can be centralized or distributed (for example, the server can be a distributed system). In some embodiments, the server can be local or remote relative to the user terminal. In some embodiments, the server can be implemented on a cloud platform; as an example only, the cloud platform can include a private cloud, a public cloud, a hybrid cloud, a community cloud (Community Cloud), a distributed cloud, an inter-cloud (Inter-Cloud), a multi-cloud (Multi-Cloud), etc., or any combination thereof. In some embodiments, the server can be implemented on an electronic device having one or more components.

为使本实施例提供的方案更加清楚，下面结合图2对该方法的各个步骤进行详细阐述。但应该理解，流程图的操作可以不按顺序实现，没有逻辑的上下文关系的步骤可以反转顺序或者同时实施。此外，本领域技术人员在本申请内容的指引下，可以向流程图添加一个或多个其他操作，也可以从流程图中移除一个或多个操作。如图2所示，该方法包括：To make the solution provided by this embodiment clearer, the various steps of the method are described in detail below in conjunction with Figure 2. However, it should be understood that the operations of the flowchart can be implemented in a non-sequential manner, and steps without logical contextual relationships can be reversed in order or implemented simultaneously. In addition, those skilled in the art can add one or more other operations to the flowchart under the guidance of the content of this application, or remove one or more operations from the flowchart. As shown in Figure 2, the method includes:

SA101，获取目标边缘计算环境的环境描述信息以及当前的状态描述信息。SA101, obtain the environment description information and current status description information of the target edge computing environment.

其中，该环境描述信息包括目标边缘计算环境中多个用户设备对应任务输入数据大小的概率分布、通信带宽、边缘服务器的计算频率、用户设备的计算频率。状态描述信息则包括每个用户设的计算任务、带宽、用户设备的计算频率、边缘服务器的计算频率、当前剩余的能量、用户设备中的队列长度、边缘服务器中的队列长度。The environment description information includes the probability distribution of the input data size of the tasks corresponding to multiple user devices in the target edge computing environment, the communication bandwidth, the computing frequency of the edge server, and the computing frequency of the user device. The state description information includes the computing tasks set by each user, the bandwidth, the computing frequency of the user device, the computing frequency of the edge server, the current remaining energy, the queue length in the user device, and the queue length in the edge server.

SA102，调用预先训练的策略模型对环境描述信息与状态描述信息进行处理，得到兼顾目标边缘计算环境与状态描述信息的任务卸载策略。SA102 calls the pre-trained strategy model to process the environment description information and the state description information to obtain a task offloading strategy that takes into account the target edge computing environment and the state description information.

该策略模型包括串联的多个策略层，每个策略层包括彼此独立的多个子模型。可以理解为，本实施例通过动态管理的方式将多个策略层中的多个子模型进行排列组合，可以得到能够适应不同边缘计算环境的子策略模型。下面对多个策略层中的多个子模型的组合方式进行说明。可选实施方式中，步骤SA102可以包括：The policy model includes multiple policy layers connected in series, and each policy layer includes multiple independent sub-models. It can be understood that this embodiment arranges and combines multiple sub-models in multiple policy layers in a dynamic management manner to obtain a sub-policy model that can adapt to different edge computing environments. The combination of multiple sub-models in multiple policy layers is described below. In an optional implementation, step SA102 may include:

SA102-1，将状态描述信息的状态嵌入特征输入多个策略层。SA102-1, the state embedding features of the state description information are input into multiple strategy layers.

可选实施方式中，该策略模型包括第一编码器与第二编码器，策略设备通过第一编码器对状态描述信息进行处理，得到状态描述信息的状态嵌入特征；通过第二编码器对环境描述信息进行处理，得到环境描述信息的环境嵌入特征。In an optional implementation, the policy model includes a first encoder and a second encoder, and the policy device processes the state description information through the first encoder to obtain state embedding features of the state description information; and processes the environment description information through the second encoder to obtain environment embedding features of the environment description information.

示例性的，本实施例中将上述策略模型称为MC-Actor，其结构如图3所示，图中将状态描述信息表示为s_n，第一编码器表示为Encoder1，s_n输入到Encoder1处理后，得到s_n的状态嵌入特征e^s。将环境描述信息表示为υ^env，第二编码器表示为Encoder2，υ^env输入到Encoder2处理后，得到环境嵌入特征e^v。Exemplarily, in this embodiment, the above-mentioned strategy model is called MC-Actor, and its structure is shown in FIG3 , in which the state description information is represented as s _n , the first encoder is represented as Encoder1, and after s _n is input to Encoder1 for processing, the state embedding feature ^es of s _n is obtained. The environment description information is represented as υ ^env , the second encoder is represented as Encoder2, and after υ ^env is input to Encoder2 for processing, the environment embedding feature ^ev is obtained.

基于上述嵌入状态特征与嵌入环境特征的介绍，步骤SA102还包括：Based on the above introduction of the embedding state characteristics and the embedding environment characteristics, step SA102 further includes:

SA102-2，对于多个策略层中的任意相邻层，根据环境描述信息的环境嵌入特征以及状态嵌入特征对上一策略层中每个子模型输出的特征进行筛选，确定出入下一策略层中每个子模型的输入特征。SA102-2: For any adjacent layers in multiple strategy layers, the features output by each sub-model in the previous strategy layer are screened according to the environment embedding features and state embedding features of the environment description information to determine the input features of each sub-model in the next strategy layer.

应理解的是，鉴于本实施例提供的策略模型的模型架构已经固定，即对于任意相邻的策略层，两者子模型之间的数据传递关系在模型设计之初已经固定。在此情况下，为了能够调整子模型之间的连接关系，本实施例对输入各子模型的数据进行调整，以达到间接调整子模型之间组合方式的目的。因此，步骤SA102-2的可选实施方式包括：It should be understood that, given that the model architecture of the policy model provided in this embodiment is fixed, that is, for any adjacent policy layers, the data transfer relationship between the two sub-models is fixed at the beginning of the model design. In this case, in order to be able to adjust the connection relationship between the sub-models, this embodiment adjusts the data input to each sub-model to achieve the purpose of indirectly adjusting the combination mode between the sub-models. Therefore, the optional implementation of step SA102-2 includes:

SA102-2-1，根据环境描述信息的环境嵌入特征以及状态嵌入特征生成下一策略层中每个子模型的权重向量。SA102-2-1, generates a weight vector for each sub-model in the next strategy layer based on the environment embedding features and state embedding features of the environment description information.

其中，策略模型还包括串联的多个权重层，多个权重层与多个策略层中的部分一一对应。该策略设备获取状态嵌入特征与环境嵌入特征之间的融合特征。例如，将状态嵌入特征与环境嵌入特征逐元素相乘，得到融合特征。The strategy model further includes a plurality of weight layers connected in series, and the plurality of weight layers correspond one to one to parts of the plurality of strategy layers. The strategy device obtains a fusion feature between the state embedding feature and the environment embedding feature. For example, the state embedding feature and the environment embedding feature are element-wise multiplied to obtain a fusion feature.

进一步地，该策略设备将融合特征输入多个权重层；对于多个权重层中的任意相邻层，将上一权重层输出的权重向量与融合特征相乘后输入下一层权重层，得到与下一权重层相对应策略层的权重向量。Furthermore, the strategy device inputs the fused features into multiple weight layers; for any adjacent layers among the multiple weight layers, the weight vector output by the previous weight layer is multiplied by the fused features and then input into the next weight layer to obtain the weight vector of the strategy layer corresponding to the next weight layer.

SA102-2-2，分别根据下一策略层中每个子模型的权重向量对上一策略层中的每个子模型的输出特征进行加权，得到下一策略层中每个子模型的输入特征。SA102-2-2, weight the output features of each sub-model in the previous strategy layer according to the weight vector of each sub-model in the next strategy layer, and obtain the input features of each sub-model in the next strategy layer.

示例性的，继续参见图3，假定该测录模型包括L个策略层，每个策略层包括M个子模型。将第l层中的每个子模型表示为F_l,m,m∈{1,2,…,M}。每个子模型的输入维度表示为输出维度表示为本实例中，每个策略层满足以下约定条件：For example, referring to FIG3 , it is assumed that the recording model includes L strategy layers, each strategy layer includes M sub-models. Each sub-model in the lth layer is represented as F _l,m , m∈{1,2,…,M}. The input dimension of each sub-model is represented as The output dimension is expressed as In this example, each policy layer meets the following agreed conditions:

(1)同一层中的F_l,m,m∈{1,2,…,M}应该具有相同的维度；(1) F _l,m ,m∈{1,2,…,M} in the same layer should have the same dimension;

(2)l层的输出维度应该等于下一层l+1的输入维度 (2) Output dimension of layer l Should be equal to the input dimension of the next layer l+1

基于上述约定，由于状态嵌入特征e^s由第一编码器生成，即第一编码器将状态描述信息s_n作为输入，将状态嵌入特征e^s作为输出；因此，于第一个策略层，其输入维度等于D，与状态嵌入特征e^s的维度一致。而对于最后一个策略层，由于每个用户设备的采取的动作被表示为包括位置判断变量与无线发射功率分别服从高斯分布，而高斯分布的分布效果由μ和σ两个参数决定，因此，当共有U个用户设备时，最后一个策略层的输出维度为4U。Based on the above convention, since the state embedding feature ^es is generated by the first encoder, that is, the first encoder takes the state description information s _n as input and the state embedding feature ^es as output; therefore, at the first policy layer, its input dimension is equal to D, which is consistent with the dimension of the state embedding feature ^es . For the last policy layer, since the action taken by each user device is represented as Including position judgment variables Wireless transmission power They obey Gaussian distribution respectively, and the distribution effect of Gaussian distribution is determined by two parameters μ and σ. Therefore, when there are U user devices in total, the output dimension of the last strategy layer is 4U.

上述示例中介绍了策略模型中的策略层，继续参见图3，该策略模型还包括有多个权重层组成的动态管理模型。由于多个权重层用于对策略层输出的结果进行加权，因此，权重层的数量少于策略层的数量。本示例中，对于L个策略层，提供L-1个权重层。对于L-1个权重层中的第l层，将其输出表示为其中i,j∈{1,2,…,M}；输出的负责动态组合第l个策略层中的每个子模型F_l,j,j∈{1,2,…,M}的输出，作为第l+1个策略层中各子模型F_l+1,i,i∈{1,2,…,M}的输入。以第l+1个策略层中的第F_l+1个子模型的第k个输入为例：The above example introduces the policy layer in the policy model. Referring to FIG3, the policy model also includes a dynamic management model composed of multiple weight layers. Since multiple weight layers are used to weight the results output by the policy layer, the number of weight layers is less than the number of policy layers. In this example, for L policy layers, L-1 weight layers are provided. For the lth layer in the L-1 weight layers, its output is represented as where i,j∈{1,2,…,M}; the output Responsible for dynamically combining the output of each sub-model F _l,j ,j∈{1,2 _, …,M} in the lth strategy layer as the input of each sub-model F _l+1,i ,i∈{1,2,…,M} in the l+1th strategy layer. For example:

式中，表示第l层中第j个模块F_l,j的第k个输出。In the formula, represents the kth output of the jth module F _l,j in the lth layer.

继续参见图3，对于第1个权重层W₁，将状态嵌入特征e^s与环境嵌入特征e^v之间逐元素相乘，相乘得到的结果作为第1个权重层W₁的输入，并生成作为输出。输入到第2个权重层W₂，得到D维的输出。将e^s与e^v之间逐元素相乘的结果与W₂的输出按元素相乘得到第3个权重层W₃的输入。以此类推，可以得到每个权重层的输入与输出。如此，无论是何种目标边缘计算环境的环境描述信息υ^env，都可以通过权重层的输出权重提取对应策略层可重用的子模型，从而根据目标边缘计算环境的环境描述信息组合适当的子模型。Continuing with FIG3 , for the first weight layer W ₁ , the state embedding feature ^es is multiplied element by element with the environment embedding feature ^ev , and the result of the multiplication is used as the input of the first weight layer W ₁ , and generates as output. Input to the second weight layer W ₂ to obtain the D-dimensional output. The result of the element-by-element multiplication between ^es and ^ev is multiplied by the output of W ₂ to obtain the input of the third weight layer W _3. Similarly, the input and output of each weight layer can be obtained. In this way, no matter what kind of environmental description information υ ^env of the target edge computing environment is, the reusable sub-model of the corresponding strategy layer can be extracted through the output weight of the weight layer, so as to combine the appropriate sub-model according to the environmental description information of the target edge computing environment.

需要说明的是，第l个策略层中的每个子模型F_l,m，可以使用具有不同方程的数学函数或具有不同隐藏架构的神经网络实现，只要其输入与输出的维度满足上述关于上述与的约束条件即可。It should be noted that each sub-model F _l,m in the l-th policy layer can be implemented using mathematical functions with different equations or neural networks with different hidden architectures, as long as the dimensions of its input and output satisfy the above and The constraints are sufficient.

基于上述实施例中对策略模型的介绍，上述步骤SA102还包括：Based on the introduction of the policy model in the above embodiment, the above step SA102 also includes:

SA102-3，根据多个策略层的输出结果，获得兼顾目标边缘计算环境与状态描述信息的任务卸载策略。SA102-3, based on the output results of multiple policy layers, obtains a task offloading strategy that takes into account the target edge computing environment and state description information.

如此，通过本实施例提供的策略模型，利用环境描述信息控制策略层对状态描述信息的处理过程，以适应当前的任务计算环境，从而获得最佳的任务卸载策略。In this way, through the policy model provided by this embodiment, the environment description information is used to control the policy layer's processing of the state description information to adapt to the current task computing environment, thereby obtaining the best task offloading strategy.

此外，本实施例还提供上述策略模型的训练方法，可以通过上述策略设备实施该训练方法。在一些实施方式中，还可以通过其他能够提供足够算力的电子设备实施该训练方法。下面结合具体的实施方式对该训练方法进行详细介绍，该方法具体包括：In addition, this embodiment also provides a training method for the above-mentioned policy model, which can be implemented by the above-mentioned policy device. In some implementations, the training method can also be implemented by other electronic devices that can provide sufficient computing power. The following is a detailed description of the training method in conjunction with a specific implementation method, which specifically includes:

SB101，获取多个待训练策略模型以及每个待训练策略模型的待训练评价模型。SB101, obtain multiple strategy models to be trained and an evaluation model to be trained for each strategy model to be trained.

其中，多个待训练策略模型分别对应有不同的边缘计算环境。Among them, multiple strategy models to be trained correspond to different edge computing environments.

SB102，对于每个待训练策略模型，通过待训练策略模型与对应的边缘计算环境进行交互，得到针对边缘计算环境当前状态的任务卸载经验，并缓存至经验池中。SB102, for each strategy model to be trained, the strategy model to be trained interacts with the corresponding edge computing environment to obtain task offloading experience for the current state of the edge computing environment and caches it in the experience pool.

其中，该任务卸载经验包括待训练策略模型对应边缘计算环境的环境描述信息。Among them, the task offloading experience includes the environment description information of the edge computing environment corresponding to the strategy model to be trained.

SB103，待经验池收集的任务卸载经验满足预设条件后，从经验池采样出采样经验，并根据采样经验对多个待训练策略模型以及各自对应的待训练评价模型进行更新。SB103, after the task offloading experience collected in the experience pool meets the preset conditions, sampled experience is sampled from the experience pool, and multiple strategy models to be trained and their corresponding evaluation models to be trained are updated according to the sampled experience.

SB104，判断多个待训练策略模型以及各自对应的待训练评价模型未达到收敛条件，若是，则执行步骤SB105，若否，则执行步骤SB102。SB104, determine whether the multiple strategy models to be trained and their corresponding evaluation models to be trained have not reached the convergence condition. If so, execute step SB105, if not, execute step SB102.

SB105，将本次迭代后的待训练策略模型作为预先训练的策略模型。SB105, the strategy model to be trained after this iteration is used as the pre-trained strategy model.

如图4所示，在具体实施方式中，与现有基于Actor-Critic架构的深度强化学习模型相同的是，本实施例中将每个待训练策略模型作为待训练Actor，将每个待训练策略评价模型作为待训练Critic。每个待训练Actor负责自身的边缘计算环境进行交互，该待训练Actor对应的待训练Critic则用于评估Actor生成的任务卸载策略对长期奖励的影响。As shown in Figure 4, in a specific implementation, similar to the existing deep reinforcement learning model based on the Actor-Critic architecture, in this embodiment, each strategy model to be trained is used as an Actor to be trained, and each strategy evaluation model to be trained is used as a Critic to be trained. Each Actor to be trained is responsible for interacting with its own edge computing environment, and the Critic to be trained corresponding to the Actor to be trained is used to evaluate the impact of the task offloading strategy generated by the Actor on the long-term reward.

与现有基于Actor-Critic架构的深度强化学习模型不同的是，在训练过程中为了提高经验探索效率，同时激活V个待训练Actor与各自的边缘计算环境进行交互，并将交互经验(s_n,a_n,r_n,s_n+1)与相应的环境描述信息υ^env存储在经验池中，然后将其作为一个批次ε进行采样，并根据预设的损失函数更新待训练Actor和待训练Critic的模型参数。训练完成后，产生的Actor可以适应多种边缘计算环境以生成最佳的任务卸载策略。Different from the existing deep reinforcement learning model based on the Actor-Critic architecture, in order to improve the efficiency of experience exploration during the training process, V actors to be trained are activated to interact with their respective edge computing environments at the same time, and the interaction experience (s _n , a _n , r _n , s _n+1 ) and the corresponding environment description information υ ^env are stored in the experience pool, which is then sampled as a batch ε, and the model parameters of the actors to be trained and the critics to be trained are updated according to the preset loss function. After the training is completed, the generated actors can adapt to a variety of edge computing environments to generate the best task offloading strategy.

以上实施例介绍了为了训练出能够适应多种边缘计算环境的策略模型所构建的模型训练框架。在此基础上，本实施例接下来进一步介绍用于使模型能够收敛的损失函数。The above embodiment introduces a model training framework constructed to train a policy model that can adapt to a variety of edge computing environments. On this basis, this embodiment further introduces a loss function used to enable the model to converge.

首先，将待训练Actor生成的任务卸载策略的过程视为一个函数的运算过程，将该函数用π_φ进行表示，该函数中待优化的参数用φ进行表示；待训练Critic对任务卸载策略的评价过程视为一个函数的运算过程，将该函数用Q_θ进行表示(又名Q值函数)，该函数中待优化的参数用θ进行表示，Q_θ计算出的结果即是按照当前π_φ提供的任务卸载策略进行任务卸载，最终能够达到的长期奖励，长期奖励的数学表达式为：First, the process of generating the task offloading strategy of the Actor to be trained is regarded as the operation process of a function, and the function is represented by π _φ , and the parameter to be optimized in the function is represented by φ; the evaluation process of the task offloading strategy by the Critic to be trained is regarded as the operation process of a function, and the function is represented by Q _θ (also known as Q value function), and the parameter to be optimized in the function is represented by θ. The result calculated by Q _θ is the long-term reward that can be achieved by performing task offloading according to the task offloading strategy provided by the current π _φ . The mathematical expression of the long-term reward is:

对待训练Actor进行训练的目的则是为了使得按照最终得到的策略模型生成的任务策略进行任务卸载后，最终得到的长期任务奖励的值最大。The purpose of training the Actor to be trained is to maximize the value of the long-term task reward after unloading the task according to the task strategy generated by the final strategy model.

需要说明的是，π_φ生成的结果满足多维高斯分布N_φ(μ,σ)，意味着任务卸载策略a_n中每个元素的均值μ和协方差σ都与π_φ生成的结果近似，即π_φ(a_n|s_n)表示在s_n代表的状态下，选择任务卸载策略a_n的高斯概率分布。因此，为待训练Actor提供以下损失函数J_π(φ)；It should be noted that the result generated by π _φ satisfies the multidimensional Gaussian distribution N _φ (μ,σ), which means that the mean μ and covariance σ of each element in the task offloading strategy a _n are similar to the result generated by π _φ , that is, π _φ (a _n |s _n ) represents the Gaussian probability distribution of selecting the task offloading strategy a _n under the state represented by s _n . Therefore, the following loss function J _π (φ) is provided for the Actor to be trained;

为待训练Critic提供以下损失函数J_π(θ)：Provide the following loss function J _π (θ) for the Critic to be trained:

式中，D_KL表示DL散度，Z_θ(s_n)表示归一化exp(Q_θ(s_n,a_n))的分配函数，表示一个由ψ参数化独立的软状态值函数(ψ表示用于函数稳定性的V_ψ的目标网络)。ψ可以通过最小化进行优化：where D _KL represents the DL divergence, Z _θ (s _n ) represents the distribution function of the normalized exp(Q _θ (s _n ,a _n )), represents an independent soft-state value function parameterized by ψ (ψ represents the target network V _ψ for functional stability). ψ can be optimized by minimizing:

进一步的，由于待训练Actor包括多个策略层，每个策略层又包括多个子模型，为了能够学习到动态决定对于不同的边缘计算环境是否重用某个子模型，待训练Actor还包括多个权重层，并且，待训练Actor的输入包括环境描述信息与状态描述信息。Furthermore, since the Actor to be trained includes multiple strategy layers, each strategy layer includes multiple sub-models, in order to be able to learn to dynamically decide whether to reuse a sub-model for different edge computing environments, the Actor to be trained also includes multiple weight layers, and the input of the Actor to be trained includes environment description information and state description information.

由于待训练Actor中包括了多个策略层以及多个权重层，训练过程中需要对两者的参数进行更新。待训练Actor网络会输出多个概率分布，然后从这个分布中采样来决定具体的行动。然而，采样操作本身是不可微分的，这意味着无法直接使用反向传播算法来更新Actor网络的参数，因为反向传播算法依赖于可微分性来计算梯度，对此，本实例采用非累充参数化方法Gumbel-Softmax，来解决中不可微采样导致的反向传播困难问题，即通过引入额外的噪声变量来近似采样过程，从而将从概率分布中的采样变成可微分的，使得模型可以实现端到端训练。Since the Actor to be trained includes multiple policy layers and multiple weight layers, the parameters of both need to be updated during the training process. The Actor network to be trained will output multiple probability distributions, and then sample from this distribution to determine the specific action. However, the sampling operation itself is not differentiable, which means that the backpropagation algorithm cannot be used directly to update the parameters of the Actor network, because the backpropagation algorithm relies on differentiability to calculate the gradient. For this reason, this example uses the non-cumulative parameterization method Gumbel-Softmax to solve The back propagation difficulty problem caused by non-differentiable sampling in the proposed algorithm is solved by introducing additional noise variables to approximate the sampling process, thereby making the sampling from the probability distribution differentiable, so that the model can be trained end-to-end.

为了便于描述，用φ和分别表示策略层与权重层中的经网络参数，卸载策略π的损失函数可以重写为：For ease of description, φ and denote the network parameters in the strategy layer and the weight layer respectively. The loss function of the unloading strategy π can be rewritten as:

进一步的，研究过程中还发现，一些子模块在训练过程中被多次选择使用，而有些子模块则几乎不被选择与训练，因此，本实施例为了避免模块退化，还设计了一个正则化项R并将其加入到待训练Actor的重写后的损失函数中：Furthermore, it was found during the research process that some submodules were selected and used multiple times during the training process, while some submodules were hardly selected and trained. Therefore, in order to avoid module degradation, this embodiment also designed a regularization term R and added it to the rewritten loss function of the Actor to be trained:

式中，R的设计考虑了第l层的每个模块j的组合权重对应的长期总和其中T为时间范围。为了防止某些子模型被排他地选择和训练，定义其目标是最小化任意层l的两个模块j与j′。In the formula, the design of R takes into account the long-term sum of the combined weights of each module j in the lth layer Where T is the time range. In order to prevent some sub-models from being exclusively selected and trained, define The goal is to minimize the two modules j and j′ of any layer l.

并且，本实施例还通过引入dropout来缓解过拟合的问题。对此，应理解的是，Dropout是一种训练神经网络的正则化技术，可以通过训练过程中随机移除神经网络中的一部分神经元来减少神经网络的过拟合问题，因此，本实施例中通过dropout以一定概率随机丢弃某个子模型，从而在每次训练迭代中，每个子模型都有一定概率被选择并且关闭。具体实现过程中，维护一个向量其中每个元素都是伯努利随机变量，概率p^drop为1。因此，子模型F_l+1的第k个输入可以被表示为：In addition, this embodiment also alleviates the problem of overfitting by introducing dropout. In this regard, it should be understood that Dropout is a regularization technology for training neural networks, which can reduce the overfitting problem of neural networks by randomly removing some neurons in the neural network during the training process. Therefore, in this embodiment, dropout is used to randomly discard a sub-model with a certain probability, so that in each training iteration, each sub-model has a certain probability of being selected and closed. In the specific implementation process, a vector is maintained. Each element is a Bernoulli random variable with probability p ^drop equal to 1. Therefore, the kth input of the submodel F _l+1 can be expressed as:

基于上述损失函数，V个待训练Actor并行与各自的边缘计算环境进行交互，以提高探索经验的效率。并将探索过程中的任务卸载经验缓存到经验池ξ中，待缓存的任务卸载经验达到预设条件后，从经验池ξ中进行采样，用获得的采样经验对待训练Actor与待训练Critic的参数进行更新。Based on the above loss function, V actors to be trained interact with their respective edge computing environments in parallel to improve the efficiency of exploring experience. The task offloading experience in the exploration process is cached in the experience pool ξ. When the cached task offloading experience reaches the preset condition, it is sampled from the experience pool ξ, and the parameters of the actors to be trained and the critics to be trained are updated with the sampled experience.

值得说明的是，为了提高训练稳定性，在训练过程中对待训练Actor中的策略层与权重层按照预设周期进行交替更新。其中，策略层的网络参数φ的更新周期表示为Γ_φ，权重层的网络参数的更新周期表示为 It is worth noting that in order to improve the training stability, the strategy layer and weight layer in the training Actor are updated alternately according to the preset period during the training process. The update period of the network parameter φ of the strategy layer is represented as Γ _φ , and the update period of the network parameter φ of the weight layer is represented as Γ φ . The update cycle is expressed as

基于与本实施例所提供的多环境MEC中模块组合型无模型计算卸载方法相同的发明构思，本实施例还提供一种多环境MEC中模块组合型无模型计算卸载装置。该多环境MEC中模块组合型无模型计算卸载装置包括至少一个可以软件形式存储于存储器301或固化在策略设备中的软件功能模块。策略设备中的处理器302用于执行存储器301中存储的可执行模块。例如，多环境MEC中模块组合型无模型计算卸载装置所包括的软件功能模块及计算机程序等。请参照图5，从功能上划分，多环境MEC中模块组合型无模型计算卸载装置可以包括：Based on the same inventive concept as the module-combined model-free computing unloading method in a multi-environment MEC provided in this embodiment, this embodiment also provides a module-combined model-free computing unloading device in a multi-environment MEC. The module-combined model-free computing unloading device in a multi-environment MEC includes at least one software function module that can be stored in the memory 301 in software form or solidified in the policy device. The processor 302 in the policy device is used to execute the executable module stored in the memory 301. For example, the software function modules and computer programs included in the module-combined model-free computing unloading device in the multi-environment MEC. Please refer to Figure 5. From a functional perspective, the module-combined model-free computing unloading device in the multi-environment MEC may include:

信息获取模块201，用于获取目标边缘计算环境的环境描述信息以及当前的状态描述信息；The information acquisition module 201 is used to obtain the environment description information and current state description information of the target edge computing environment;

策略生成模块202，用于调用预先训练的策略模型对环境描述信息与状态描述信息进行处理，得到兼顾目标边缘计算环境与状态描述信息的任务卸载策略。The strategy generation module 202 is used to call a pre-trained strategy model to process the environment description information and the state description information to obtain a task offloading strategy that takes into account the target edge computing environment and the state description information.

本实施例中，该信息获取模块201用于实现图2中的步骤SA101，该策略生成模块202用于实现图2中的步骤SA102。因此，对于上述每个模块的详情可以参见对应步骤的具体实施方式，本实施例不再进行赘述。In this embodiment, the information acquisition module 201 is used to implement step SA101 in Figure 2, and the strategy generation module 202 is used to implement step SA102 in Figure 2. Therefore, for the details of each of the above modules, please refer to the specific implementation of the corresponding step, and this embodiment will not be repeated.

另外，在本申请各个实施例中的各功能模块可以集成在一起形成一个独立的部分，也可以是各个模块单独存在，也可以两个或两个以上模块集成形成一个独立的部分。In addition, the functional modules in the various embodiments of the present application may be integrated together to form an independent part, or each module may exist separately, or two or more modules may be integrated to form an independent part.

还应理解的是，以上实施方式如果以软件功能模块的形式实现并作为独立的产品销售或使用时，可以存储在一个计算机可读取存储介质中。基于这样的理解，本申请的技术方案本质上或者说对现有技术做出贡献的部分或者该技术方案的部分可以以软件产品的形式体现出来，该计算机软件产品存储在一个存储介质中，包括若干指令用以使得一台计算机设备(可以是个人计算机，服务器，或者网络设备等)执行本申请各个实施例方法的全部或部分步骤。It should also be understood that if the above implementation is implemented in the form of a software function module and sold or used as an independent product, it can be stored in a computer-readable storage medium. Based on this understanding, the technical solution of the present application, or the part that contributes to the prior art, or the part of the technical solution, can be embodied in the form of a software product, which is stored in a storage medium and includes a number of instructions for a computer device (which can be a personal computer, a server, or a network device, etc.) to perform all or part of the steps of the methods of each embodiment of the present application.

因此，本实施例还提供一种存储介质，该存储介质存储有计算机程序，该计算机程序被处理器执行时，实现本实施例提供的多环境MEC中模块组合型无模型计算卸载方法。其中，该存储介质可以是U盘、移动硬盘、只读存储器(ROM，Read-Only Memory)、随机存取存储器(RAM，Random Access Memory)、磁碟或者光盘等各种可以存储程序代码的介质。Therefore, this embodiment also provides a storage medium, which stores a computer program. When the computer program is executed by the processor, the module combination model-free computing offloading method in the multi-environment MEC provided by this embodiment is implemented. Among them, the storage medium can be a U disk, a mobile hard disk, a read-only memory (ROM), a random access memory (RAM), a disk or an optical disk, etc., which can store program codes.

本实施例提供的一种实施多环境MEC中模块组合型无模型计算卸载方法的策略设备。如图6所示，该策略设备可包括处理器302及存储器301。并且，存储器301存储有计算机程序，处理器通过读取并执行存储器301中与以上实施方式对应的计算机程序，实现本实施例所提供的多环境MEC中模块组合型无模型计算卸载方法。This embodiment provides a policy device for implementing a module combination model-free computing unloading method in a multi-environment MEC. As shown in FIG6 , the policy device may include a processor 302 and a memory 301. In addition, the memory 301 stores a computer program, and the processor implements the module combination model-free computing unloading method in a multi-environment MEC provided in this embodiment by reading and executing the computer program corresponding to the above implementation method in the memory 301.

继续参见图6，该电子设备还包括有通信单元303。该存储器301、处理器302以及通信单元303各元件相互之间通过系统总线304直接或间接地电性连接，以实现数据的传输或交互。6 , the electronic device further includes a communication unit 303. The memory 301, the processor 302 and the communication unit 303 are directly or indirectly electrically connected to each other via a system bus 304 to achieve data transmission or interaction.

其中，该存储器301可以是基于任何电子、磁性、光学或其它物理原理的信息记录装置，用于记录执行指令、数据等。在一些实施方式中，该存储器301可以是，但不限于，易失存储器、非易失性存储器、存储驱动器等。The memory 301 may be an information recording device based on any electronic, magnetic, optical or other physical principle, used to record execution instructions, data, etc. In some implementations, the memory 301 may be, but is not limited to, a volatile memory, a non-volatile memory, a storage drive, etc.

在一些实施方式中，该易失存储器可以是随机存取存储器(Random AccessMemory，RAM)；在一些实施方式中，该非易失性存储器可以是只读存储器(Read OnlyMemory，ROM)、可编程只读存储器(Programmable Read-Only Memory，PROM)、可擦除只读存储器(Erasble Programmable Read-Only Memory，EPROM)、电可擦除只读存储器(ElectricErasble Programmable Read-Only Memory，EEPROM)、闪存等；在一些实施方式中，该存储驱动器可以是磁盘驱动器、固态硬盘、任何类型的存储盘(如光盘、DVD等)，或者类似的存储介质，或者它们的组合等。In some embodiments, the volatile memory may be a random access memory (RAM); in some embodiments, the non-volatile memory may be a read-only memory (ROM), a programmable read-only memory (PROM), an erasable programmable read-only memory (EPROM), an electrically erasable read-only memory (EEPROM), a flash memory, etc.; in some embodiments, the storage drive may be a disk drive, a solid-state drive, any type of storage disk (such as a CD, a DVD, etc.), or a similar storage medium, or a combination thereof, etc.

该通信单元303用于通过网络收发数据。在一些实施方式中，该网络可以包括有线网络、无线网络、光纤网络、远程通信网络、内联网、因特网、局域网(Local Area Network，LAN)、广域网(Wide Area Network，WAN)、无线局域网(Wireless Local Area Networks，WLAN)、城域网(Metropolitan Area Network，MAN)、广域网(Wide Area Network，WAN)、公共电话交换网(Public switched Telephone Network，PSTN)、蓝牙网络、ZigBee网络、或近场通信(Near Field Communication，NFC)网络等，或其任意组合。在一些实施例中，网络可以包括一个或多个网络接入点。例如，网络可以包括有线或无线网络接入点，例如基站和/或网络交换节点，服务请求处理系统的一个或多个组件可以通过该接入点连接到网络以交换数据和/或信息。The communication unit 303 is used to send and receive data through a network. In some embodiments, the network may include a wired network, a wireless network, an optical fiber network, a telecommunication network, an intranet, the Internet, a local area network (LAN), a wide area network (WAN), a wireless local area network (WLAN), a metropolitan area network (MAN), a wide area network (WAN), a public switched telephone network (PSTN), a Bluetooth network, a ZigBee network, or a near field communication (NFC) network, or any combination thereof. In some embodiments, the network may include one or more network access points. For example, the network may include a wired or wireless network access point, such as a base station and/or a network switching node, through which one or more components of the service request processing system may be connected to the network to exchange data and/or information.

该处理器302可能是一种集成电路芯片，具有信号的处理能力，并且，该处理器可以包括一个或多个处理核(例如，单核处理器或多核处理器)。仅作为举例，上述处理器可以包括中央处理单元(Central Processing Unit，CPU)、专用集成电路(ApplicationSpecific Integrated Circuit，ASIC)、专用指令集处理器(Application SpecificInstruction-set Processor，ASIP)、图形处理单元(Graphics Processing Unit，GPU)、物理处理单元(Physics Processing Unit，PPU)、数字信号处理器(Digital signalProcessor，DSP)、现场可编程门阵列(Field Programmable Gate Array，FPGA)、可编程逻辑器件(Programmable Logic Device，PLD)、控制器、微控制器单元、简化指令集计算机(Reduced Instruction set Computing，RISC)、或微处理器等，或其任意组合。The processor 302 may be an integrated circuit chip having a signal processing capability, and the processor may include one or more processing cores (e.g., a single-core processor or a multi-core processor). By way of example only, the processor may include a central processing unit (CPU), an application specific integrated circuit (ASIC), an application specific instruction set processor (ASIP), a graphics processing unit (GPU), a physical processing unit (PPU), a digital signal processor (DSP), a field programmable gate array (FPGA), a programmable logic device (PLD), a controller, a microcontroller unit, a reduced instruction set computer (RISC), or a microprocessor, or any combination thereof.

可以理解，图6所示的结构仅为示意。策略设备还可以具有比图6所示更多或者更少的组件，或者具有与图6所示不同的配置。图6所示的各组件可以采用硬件、软件或其组合实现。It is understood that the structure shown in Figure 6 is for illustration only. The policy device may also have more or fewer components than those shown in Figure 6, or may have a different configuration than that shown in Figure 6. Each component shown in Figure 6 may be implemented in hardware, software, or a combination thereof.

应该理解到的是，在上述实施方式中所揭露的装置和方法，也可以通过其它的方式实现。以上所描述的装置实施例仅仅是示意性的，例如，附图中的流程图和框图显示了根据本申请的多个实施例的装置、方法和计算机程序产品的可能实现的体系架构、功能和操作。在这点上，流程图或框图中的每个方框可以代表一个模块、程序段或代码的一部分，所述模块、程序段或代码的一部分包含一个或多个用于实现规定的逻辑功能的可执行指令。也应当注意，在有些作为替换的实现方式中，方框中所标注的功能也可以以不同于附图中所标注的顺序发生。例如，两个连续的方框实际上可以基本并行地执行，它们有时也可以按相反的顺序执行，这依所涉及的功能而定。也要注意的是，框图和/或流程图中的每个方框、以及框图和/或流程图中的方框的组合，可以用执行规定的功能或动作的专用的基于硬件的系统来实现，或者可以用专用硬件与计算机指令的组合来实现。It should be understood that the apparatus and method disclosed in the above-mentioned embodiments can also be implemented in other ways. The apparatus embodiments described above are merely schematic. For example, the flowcharts and block diagrams in the accompanying drawings show the possible architecture, functions and operations of the apparatus, methods and computer program products according to the multiple embodiments of the present application. In this regard, each box in the flowchart or block diagram can represent a module, a program segment or a part of a code, and the module, a program segment or a part of a code contains one or more executable instructions for implementing the specified logical function. It should also be noted that in some alternative implementations, the functions marked in the box can also occur in a different order from the order marked in the accompanying drawings. For example, two consecutive boxes can actually be executed substantially in parallel, and they can sometimes be executed in the opposite order, depending on the functions involved. It should also be noted that each box in the block diagram and/or the flowchart, and the combination of boxes in the block diagram and/or the flowchart can be implemented with a dedicated hardware-based system that performs a specified function or action, or can be implemented with a combination of dedicated hardware and computer instructions.

以上所述，仅为本申请的各种实施方式，但本申请的保护范围并不局限于此，任何熟悉本技术领域的技术人员在本申请揭露的技术范围内，可轻易想到变化或替换，都应涵盖在本申请的保护范围之内。因此，本申请的保护范围应所述以权利要求的保护范围为准。The above are only various implementations of the present application, but the protection scope of the present application is not limited thereto. Any technician familiar with the technical field can easily think of changes or substitutions within the technical scope disclosed in the present application, which should be included in the protection scope of the present application. Therefore, the protection scope of the present application should be based on the protection scope of the claims.

Claims

1. A module combination model-free computing offloading method in a multi-environment MEC, characterized in that the method comprises:

Obtain the environment description information and current status description information of the target edge computing environment;

A pre-trained policy model is called to process the environment description information and the state description information to obtain a task offloading policy that takes into account both the target edge computing environment and the state description information.

2. According to claim 1, the module combination model-free computing offloading method in a multi-environment MEC is characterized in that the policy model includes a plurality of policy layers connected in series, each policy layer includes a plurality of sub-models independent of each other, and the calling of the pre-trained policy model processes the environment description information and the state description information to obtain a task offloading strategy that takes into account the target edge computing environment and the state description information, including:

Inputting the state embedding features of the state description information into the multiple strategy layers;

For any adjacent layers in the multiple strategy layers, the features output by each sub-model in the previous strategy layer are screened according to the environment embedding features of the environment description information and the state embedding features to determine the input features of each sub-model in the next strategy layer;

According to the output results of the multiple policy layers, a task offloading strategy that takes into account both the target edge computing environment and the state description information is determined.

3. The module combination model-free computing offloading method in multi-environment MEC according to claim 2 is characterized in that the output features of each sub-model in the previous strategy layer are screened according to the environment embedding features of the environment description information and the state embedding features to determine the input features of each sub-model in the next strategy layer, including:

Generate a weight vector for each sub-model in the next strategy layer according to the environment embedding feature of the environment description information and the state embedding feature;

The output features of each sub-model in the previous strategy layer are weighted according to the weight vector of each sub-model in the next strategy layer to obtain the input features of each sub-model in the next strategy layer.

4. According to the module combination model-free computing offloading method in multi-environment MEC according to claim 3, it is characterized in that the strategy model also includes a plurality of weight layers connected in series, and the plurality of weight layers correspond one-to-one to parts of the plurality of strategy layers; the weight vector of each sub-model in the next strategy layer is generated according to the environment embedding feature of the environment description information and the state embedding feature, comprising:

Acquire a fusion feature between the state embedding feature and the environment embedding feature;

Inputting the fused features into the multiple weight layers;

For any adjacent layers among the multiple weight layers, a weight vector output by a previous weight layer is multiplied by the fusion feature and then input into a next weight layer to obtain a weight vector of a strategy layer corresponding to the next weight layer.

5. The module combination model-free computing offloading method in multi-environment MEC according to claim 4 is characterized in that the step of obtaining the fusion feature between the state embedded feature and the environment embedded feature comprises:

The state embedding feature and the environment embedding feature are multiplied element by element to obtain the fusion feature.

6. The module combination model-free computing offloading method in multi-environment MEC according to claim 2, characterized in that the strategy model also includes a first encoder and a second encoder, and the method further includes:

Processing the state description information by the first encoder to obtain a state embedding feature of the state description information;

The environment description information is processed by the second encoder to obtain environment embedding features of the environment description information.

7. The module combination model-free computing offloading method in multi-environment MEC according to claim 1, characterized in that the method also includes a training method for the policy model, and the training method includes:

Acquire multiple strategy models to be trained and an evaluation model to be trained for each of the strategy models to be trained, wherein the multiple strategy models to be trained correspond to different edge computing environments respectively;

For each strategy model to be trained, the strategy model to be trained interacts with the corresponding edge computing environment to obtain task offloading experience for the current state of the edge computing environment, and caches it in the experience pool;

After the task offloading experience collected by the experience pool meets a preset condition, sampling experience is sampled from the experience pool, and the multiple strategy models to be trained and their corresponding evaluation models to be trained are updated according to the sampling experience;

If the multiple strategy models to be trained and their corresponding evaluation models to be trained do not meet the convergence conditions, then return to each strategy model to be trained, interact with the corresponding edge computing environment through the strategy model to be trained, and obtain the task offloading experience for the current state of the edge computing environment, until the convergence conditions are met, and use the strategy model to be trained after this iteration as the pre-trained strategy model.

8. The module combination model-free computing offloading method in multi-environment MEC according to claim 7 is characterized in that the multiple strategy models to be trained and their corresponding evaluation models to be trained are updated according to the sampling experience, including:

According to the sampling experience, multiple strategy layers and multiple weight layers in each of the strategy models to be trained are updated alternately, and the evaluation model to be trained corresponding to each of the strategy models to be trained is updated.

9. The module combination model-free computing offloading method in multi-environment MEC according to claim 7 is characterized in that the task offloading experience includes the environmental description information of the edge computing environment corresponding to the strategy model to be trained.

10. A module-combined model-free computing offloading device in a multi-environment MEC, characterized in that the device comprises:

An information acquisition module is used to obtain the environment description information and current status description information of the target edge computing environment;

The strategy generation module is used to call a pre-trained strategy model to process the environment description information and the state description information to obtain a task offloading strategy that takes into account both the target edge computing environment and the state description information.