CN117811907A

CN117811907A - Satellite network micro-service deployment method and device based on multi-agent reinforcement learning

Info

Publication number: CN117811907A
Application number: CN202311360363.1A
Authority: CN
Inventors: 吴胜; 段皓月; 纪哲; 虞志刚; 丁文慧; 陆洲
Original assignee: Beijing University of Posts and Telecommunications
Current assignee: Beijing University of Posts and Telecommunications
Priority date: 2023-10-19
Filing date: 2023-10-19
Publication date: 2024-04-02

Abstract

The embodiment of the application provides a satellite network micro-service deployment method and device based on multi-agent reinforcement learning, wherein the method comprises the following steps: acquiring resource demand information of the micro service; determining resource utilization rate information and time delay information of the satellite node according to a pre-established resource utilization rate model and time delay model and configuration information of the satellite node; under the condition that the resource utilization rate information is smaller than a first preset value or the time delay information is smaller than a second preset value, a pre-trained multi-agent strategy deployment model is adopted to determine the deployment strategy of the satellite nodes corresponding to the resource demand information of the micro-service, the server terminal is configured according to the deployment strategy of the satellite nodes, the resource demand of the micro-service on the server and the resource surplus of each satellite node can be utilized, the resource configuration of each satellite node is realized, the resource utilization balance of the satellite nodes is improved, the calling time delay is reduced, and the configuration efficiency is improved.

Description

Satellite network microservice deployment method and device based on multi-agent reinforcement learning

技术领域Technical Field

本申请涉及人工智能技术领域，具体而言，涉及一种基于多智能体强化学习的卫星网络微服务部署方法及装置。The present application relates to the field of artificial intelligence technology, and more specifically, to a satellite network microservice deployment method and device based on multi-agent reinforcement learning.

背景技术Background Art

随着软件构造和运行方式的不断演进，传统的集中式架构由于缺乏灵活性、难以扩展和迁移等问题不能满足各类的应用需求，逐渐演变为分布式的微服务架构。微服务架构是将复杂的应用按照逻辑关系拆分为几个相对独立的小应用，这些微服务可以不互相影响的条件下独立进行开发、更新、扩展与部署，采用轻量级协议进行通信，可以部署到不同卫星边缘节点中，基于微服务架构，使得工程项目具有更高的可扩展性、可靠性与灵活的分布式部署能力。With the continuous evolution of software construction and operation mode, the traditional centralized architecture cannot meet various application requirements due to lack of flexibility, difficulty in expansion and migration, and has gradually evolved into a distributed microservice architecture. Microservice architecture is to split complex applications into several relatively independent small applications according to logical relationships. These microservices can be independently developed, updated, expanded and deployed without affecting each other. They use lightweight protocols for communication and can be deployed to different satellite edge nodes. Based on the microservice architecture, the engineering project has higher scalability, reliability and flexible distributed deployment capabilities.

目前，随着日益增长的通信业务需求，地面通信网络的频谱资源不充足、覆盖面积小等弊端显现了出来。相比地面通信，卫星通信具有独特的优势，例如覆盖范围广、系统可靠性高、通信容量大、不受地震等自然灾害影响，但是各个卫星具备不同的资源量，不同的微服务器对卫星资源的请求不同，而且各个卫星的资源剩余量也不相同，如何能够将不同的微服务器部署到各个卫星上，使得卫星资源得到合理配置是目前急需解决的问题。At present, with the growing demand for communication services, the ground communication network has shown its shortcomings such as insufficient spectrum resources and small coverage area. Compared with ground communication, satellite communication has unique advantages, such as wide coverage, high system reliability, large communication capacity, and no impact from natural disasters such as earthquakes. However, each satellite has different resource amounts, different microservers have different requests for satellite resources, and the remaining resources of each satellite are also different. How to deploy different microservers to each satellite so that satellite resources can be reasonably allocated is an urgent problem to be solved.

发明内容Summary of the invention

本申请的一些实施例的目的在于提供一种基于多智能体强化学习的卫星网络微服务部署方法及装置，通过本申请的实施例的技术方案，通过获取微服务的资源需求信息；根据预先建立的资源利用率模型和时延模型，以及卫星节点的配置信息，确定卫星节点的资源利用率信息和时延信息；在所述资源利用率信息小于第一预设值，或者所述时延信息小于第二预设值的情况下，采用预先训练好的多智能体策略部署模型，确定与所述微服务的资源需求信息对应的卫星节点的部署策略，其中，所述预先训练好的多智能体策略部署模型是采用多智能体深度确定策略梯度算法对智能体网络模型的各个参数进行训练后得到；根据所述卫星节点的部署策略对服务器终端进行配置，本申请实施例通过基于微服务架构建立资源利用率模型和时延模型，根据卫星节点的配置信息，确定卫星节点的资源利用率信息和时延信息，然后再采用预先训练好的多智能体策略部署模型，确定与所述微服务的资源需求信息对应的卫星节点的部署策略，根据该卫星节点的部署策略对服务器终端进行配置，这样，可以根据微服务对服务器资源需求，以及各个卫星节点的资源剩余量，对各个卫星节点进行资源配置，提高卫星节点的资源利用均衡度，并且降低调用时延，提高配置效率。The purpose of some embodiments of the present application is to provide a satellite network microservice deployment method and device based on multi-agent reinforcement learning. Through the technical solution of the embodiments of the present application, by obtaining the resource demand information of the microservice; determining the resource utilization information and delay information of the satellite node according to the pre-established resource utilization model and delay model, and the configuration information of the satellite node; when the resource utilization information is less than a first preset value, or the delay information is less than a second preset value, using a pre-trained multi-agent strategy deployment model to determine the deployment strategy of the satellite node corresponding to the resource demand information of the microservice, wherein the pre-trained multi-agent strategy deployment model is a multi-agent deep determination policy gradient algorithm for the intelligent network The parameters of the network model are trained; the server terminal is configured according to the deployment strategy of the satellite node. The embodiment of the present application establishes a resource utilization model and a delay model based on the microservice architecture, determines the resource utilization information and delay information of the satellite node according to the configuration information of the satellite node, and then adopts the pre-trained multi-agent strategy deployment model to determine the deployment strategy of the satellite node corresponding to the resource demand information of the microservice, and configures the server terminal according to the deployment strategy of the satellite node. In this way, the resources of each satellite node can be configured according to the microservice demand for server resources and the remaining resources of each satellite node, thereby improving the resource utilization balance of the satellite node, reducing the call delay, and improving the configuration efficiency.

第一方面，本申请的一些实施例提供了一种基于多智能体强化学习的卫星网络微服务部署方法，包括：In a first aspect, some embodiments of the present application provide a satellite network microservice deployment method based on multi-agent reinforcement learning, including:

获取微服务的资源需求信息；Get resource demand information of microservices;

根据预先建立的资源利用率模型和时延模型，以及卫星节点的配置信息，确定卫星节点的资源利用率信息和时延信息；Determine resource utilization information and delay information of the satellite node according to a pre-established resource utilization model and delay model and configuration information of the satellite node;

在所述资源利用率信息小于第一预设值，或者所述时延信息小于第二预设值的情况下，采用预先训练好的多智能体策略部署模型，确定与所述微服务的资源需求信息对应的卫星节点的部署策略，其中，所述预先训练好的多智能体策略部署模型是采用多智能体深度确定策略梯度算法对智能体网络模型的各个参数进行训练后得到；When the resource utilization information is less than a first preset value, or the delay information is less than a second preset value, a pre-trained multi-agent strategy deployment model is used to determine a deployment strategy of a satellite node corresponding to the resource demand information of the microservice, wherein the pre-trained multi-agent strategy deployment model is obtained by training various parameters of an agent network model using a multi-agent deep determination policy gradient algorithm;

根据所述卫星节点的部署策略对服务器终端进行配置。The server terminal is configured according to the deployment strategy of the satellite node.

本申请的一些实施例通过建立资源利用率模型和时延模型，根据卫星节点的配置信息，确定卫星节点的资源利用率信息和时延信息，然后再采用预先训练好的多智能体策略部署模型，确定与所述微服务的资源需求信息对应的卫星节点的部署策略，根据该卫星节点的部署策略对服务器终端进行配置，这样，可以根据微服务对服务器资源需求，以及各个卫星节点的资源剩余量，对各个卫星节点进行资源配置，即将具有不同资源需求的微服务部署到合适的卫星节点上，提高卫星节点的资源利用均衡度，并且降低调用时延，提高配置效率。Some embodiments of the present application establish a resource utilization model and a delay model, determine the resource utilization information and delay information of the satellite nodes according to the configuration information of the satellite nodes, and then use a pre-trained multi-agent strategy deployment model to determine the deployment strategy of the satellite nodes corresponding to the resource demand information of the microservice, and configure the server terminal according to the deployment strategy of the satellite node. In this way, resources can be configured for each satellite node according to the microservice's demand for server resources and the remaining resources of each satellite node, that is, microservices with different resource requirements are deployed to appropriate satellite nodes, thereby improving the resource utilization balance of the satellite nodes, reducing the call delay, and improving the configuration efficiency.

可选地，所述多智能体策略部署模型通过如下方式获得：Optionally, the multi-agent strategy deployment model is obtained by:

获取智能体样本参数，其中，所述智能体样本参数至少包括智能体观测环境：Obtain agent sample parameters, wherein the agent sample parameters at least include the agent observation environment:

获取智能网络模型，所述智能网络模型至少行动者网络模型和批判者网络模型；Acquire an intelligent network model, the intelligent network model comprising at least an actor network model and a critic network model;

将所述智能体观测环境输入到所述行动者网络模型，输出智能体的部署动作；Inputting the agent observation environment into the actor network model and outputting the deployment action of the agent;

将所述智能体的部署动作和全局状态输入到所述批判者网络模型，输出动作评判值；Inputting the deployment action and global state of the agent into the critic network model, and outputting the action evaluation value;

根据智能体的当前状态信息、动作信息、奖励信息和下一时刻的状态信息，建立回放池；Establish a replay pool based on the agent's current state information, action information, reward information, and state information at the next moment;

采用多智能体深度确定策略梯度算法，根据所述回放池中获取的当前状态信息、动作信息、奖励信息和下一时刻的状态信息，对所述行动者网络模型和所述批判者网络模型中的网络参数进行更新；Adopting a multi-agent deep determination policy gradient algorithm, the network parameters in the actor network model and the critic network model are updated according to the current state information, action information, reward information and state information at the next moment obtained in the replay pool;

在所述行动者网络模型和批判者网络模型收敛的情况下，将收敛的行动者网络模型和批判者网络模型确定为所述多智能体策略部署模型。When the actor network model and the critic network model converge, the converged actor network model and the critic network model are determined as the multi-agent strategy deployment model.

本申请的一些实施例通过将微服务部署问题转换为部分可观测的马尔可夫决策过程，采用多智能体强化学习方法求解采用集中式训练，分布式执行的方式，在训练阶段，微服务的容器实例作为智能体需要获取全局信息，得到最佳的部署方案，在执行阶段，微服务仅凭借自身的观测空间即可完成部署，大大降低了微服务之间的通信开销。Some embodiments of the present application convert the microservice deployment problem into a partially observable Markov decision process, and adopt a multi-agent reinforcement learning method to solve it in a centralized training and distributed execution manner. In the training phase, the container instance of the microservice, as an agent, needs to obtain global information to obtain the best deployment plan. In the execution phase, the microservice can complete the deployment only by relying on its own observation space, which greatly reduces the communication overhead between microservices.

可选地，所述对所述行动者网络模型和所述批判者网络模型中的网络参数进行更新，包括：Optionally, updating the network parameters in the actor network model and the critic network model includes:

获取行动者网络模型的第一损失函数和所述批判者网络模型的第二损失函数；Obtaining a first loss function of an actor network model and a second loss function of the critic network model;

分别对所述第一损失函数和所述第二损失函数进行梯度计算；Performing gradient calculation on the first loss function and the second loss function respectively;

利用梯度下降法对所述述行动者网络模型和所述批判者网络模型中的网络参数进行更新。The gradient descent method is used to update the network parameters in the actor network model and the critic network model.

本申请的一些实施例采用固定网络的方法，固定目标网络并每隔一段时间将原网络参数传递给目标网络，避免更新目标不断变化，保证训练的稳定性。Some embodiments of the present application adopt a fixed network method, fix the target network and transfer the original network parameters to the target network at regular intervals to avoid constant changes in the update target and ensure the stability of training.

可选地，所述卫星节点的配置信息至少包括卫星节点数量、资源种类总量和异构资源容量。Optionally, the configuration information of the satellite nodes includes at least the number of satellite nodes, the total number of resource types and the capacity of heterogeneous resources.

可选地，所述资源利用率模型通过如下方式获得：Optionally, the resource utilization model is obtained in the following manner:

获取同一卫星节点上不同类型资源的第一资源利用率模型的资源均衡度信息，以及同一类型资源在不同卫星节点上的第二资源利用率模型的节点均衡度信息；Obtain resource balance information of a first resource utilization model of different types of resources on the same satellite node, and node balance information of a second resource utilization model of the same type of resources on different satellite nodes;

根据所述资源均衡度信息和与所述资源均衡度信息对应的权重值，以及所述节点均衡度信息和与所述节点均衡度信息对应的权重值，确定所述资源利用率模型。The resource utilization model is determined according to the resource balance information and the weight value corresponding to the resource balance information, and the node balance information and the weight value corresponding to the node balance information.

可选地，所述时延模型至少包括传输时延子模型、传播时延子模型和迁移时延子模型。Optionally, the delay model includes at least a transmission delay sub-model, a propagation delay sub-model and a migration delay sub-model.

本申请的一些实施例建立资源利用率模型和时延模型，最小化资源利用率方差和时延，将微服务部署问题表示为多目标优化问题。Some embodiments of the present application establish a resource utilization model and a delay model, minimize resource utilization variance and delay, and express the microservice deployment problem as a multi-objective optimization problem.

第二方面，本申请的一些实施例提供了一种基于多智能体强化学习的卫星网络微服务部署装置，包括：In a second aspect, some embodiments of the present application provide a satellite network microservice deployment device based on multi-agent reinforcement learning, including:

获取模块，用于获取微服务的资源需求信息；The acquisition module is used to obtain the resource demand information of microservices;

第一确定模块，用于根据预先建立的资源利用率模型和时延模型，以及卫星节点的配置信息，确定卫星节点的资源利用率信息和时延信息；A first determination module, used to determine resource utilization information and delay information of the satellite node according to a pre-established resource utilization model and delay model, and configuration information of the satellite node;

第二确定模块，用于在所述资源利用率信息小于第一预设值，或者所述时延信息小于第二预设值的情况下，采用预先训练好的多智能体策略部署模型，确定与所述微服务的资源需求信息对应的卫星节点的部署策略，其中，所述预先训练好的多智能体策略部署模型是采用多智能体深度确定策略梯度算法对智能体网络模型的各个参数进行训练后得到；A second determination module is used to use a pre-trained multi-agent strategy deployment model to determine a deployment strategy of a satellite node corresponding to the resource demand information of the microservice when the resource utilization information is less than a first preset value or the delay information is less than a second preset value, wherein the pre-trained multi-agent strategy deployment model is obtained by training various parameters of the agent network model using a multi-agent deep determination policy gradient algorithm;

配置模块，用于根据所述卫星节点的部署策略对服务器终端进行配置。The configuration module is used to configure the server terminal according to the deployment strategy of the satellite node.

本申请的一些实施例通过基于微服务架构建立资源利用率模型和时延模型，根据卫星节点的配置信息，确定卫星节点的资源利用率信息和时延信息，然后再采用预先训练好的多智能体策略部署模型，确定与所述微服务的资源需求信息对应的卫星节点的部署策略，根据该卫星节点的部署策略对服务器终端进行配置，这样，可以根据微服务对服务器资源需求，以及各个卫星节点的资源剩余量，对各个卫星节点进行资源配置，即将具有不同资源需求的微服务部署到合适的卫星节点上，提高卫星节点的资源利用均衡度，并且降低调用时延，提高配置效率。Some embodiments of the present application establish a resource utilization model and a delay model based on the microservice architecture, determine the resource utilization information and delay information of the satellite nodes according to the configuration information of the satellite nodes, and then use a pre-trained multi-agent strategy deployment model to determine the deployment strategy of the satellite nodes corresponding to the resource demand information of the microservice, and configure the server terminal according to the deployment strategy of the satellite node. In this way, resources can be configured for each satellite node according to the microservice's demand for server resources and the remaining resources of each satellite node, that is, microservices with different resource requirements are deployed to appropriate satellite nodes, thereby improving the resource utilization balance of the satellite nodes, reducing the call delay, and improving the configuration efficiency.

可选地，所述装置还包括模型训练模块，所述模型训练模块用于：Optionally, the device further comprises a model training module, wherein the model training module is used to:

可选地，所述模型训练模块用于：Optionally, the model training module is used to:

第三方面，本申请的一些实施例提供一种电子设备，包括存储器、处理器以及存储在所述存储器上并可在所述处理器上运行的计算机程序，其中，所述处理器执行所述程序时可实现如第一方面任一实施例所述的基于多智能体强化学习的卫星网络微服务部署方法。In a third aspect, some embodiments of the present application provide an electronic device comprising a memory, a processor, and a computer program stored on the memory and executable on the processor, wherein when the processor executes the program, the satellite network microservice deployment method based on multi-agent reinforcement learning as described in any embodiment of the first aspect can be implemented.

第四方面，本申请的一些实施例提供一种计算机可读存储介质，其上存储有计算机程序，所述程序被处理器执行时可实现如第一方面任一实施例所述的基于多智能体强化学习的卫星网络微服务部署方法。In a fourth aspect, some embodiments of the present application provide a computer-readable storage medium having a computer program stored thereon, which, when executed by a processor, can implement the satellite network microservice deployment method based on multi-agent reinforcement learning as described in any embodiment of the first aspect.

第五方面，本申请的一些实施例提供一种计算机程序产品，所述的计算机程序产品包括计算机程序，其中，所述的计算机程序被处理器执行时可实现如第一方面任一实施例所述的基于多智能体强化学习的卫星网络微服务部署方法。In a fifth aspect, some embodiments of the present application provide a computer program product, comprising a computer program, wherein when the computer program is executed by a processor, it can implement the satellite network microservice deployment method based on multi-agent reinforcement learning as described in any embodiment of the first aspect.

附图说明BRIEF DESCRIPTION OF THE DRAWINGS

为了更清楚地说明本申请的一些实施例的技术方案，下面将对本申请的一些实施例中所需要使用的附图作简单地介绍，应当理解，以下附图仅示出了本申请的某些实施例，因此不应被看作是对范围的限定，对于本领域普通技术人员来讲，在不付出创造性劳动的前提下，还可以根据这些附图获得其他相关的附图。In order to more clearly illustrate the technical solutions of some embodiments of the present application, the drawings required for use in some embodiments of the present application will be briefly introduced below. It should be understood that the following drawings only show some embodiments of the present application and therefore should not be regarded as limiting the scope. For ordinary technicians in this field, other related drawings can be obtained based on these drawings without paying creative work.

图1为本申请实施例提供的一种基于多智能体强化学习的卫星网络微服务部署方法的流程示意图；FIG1 is a flow chart of a satellite network microservice deployment method based on multi-agent reinforcement learning provided in an embodiment of the present application;

图2为本申请实施例提供的又一种基于多智能体强化学习的卫星网络微服务部署方法的流程示意图；FIG2 is a flow chart of another satellite network microservice deployment method based on multi-agent reinforcement learning provided in an embodiment of the present application;

图3为本申请实施例提供的微服务部署场景示意图；FIG3 is a schematic diagram of a microservice deployment scenario provided in an embodiment of the present application;

图4为本申请实施例提供的模型训练的网络结构图；FIG4 is a network structure diagram of the model training provided in an embodiment of the present application;

图5为本申请实施例提供的模型训练的流程示意图；FIG5 is a schematic diagram of a flow chart of model training provided in an embodiment of the present application;

图6为本申请实施例提供的一种基于多智能体强化学习的卫星网络微服务部署装置的结构示意图；FIG6 is a schematic diagram of the structure of a satellite network microservice deployment device based on multi-agent reinforcement learning provided in an embodiment of the present application;

图7为本申请实施例提供的一种电子设备示意图。FIG. 7 is a schematic diagram of an electronic device provided in an embodiment of the present application.

具体实施方式DETAILED DESCRIPTION

下面将结合本申请的一些实施例中的附图，对本申请的一些实施例中的技术方案进行描述。The technical solutions in some embodiments of the present application will be described below in conjunction with the drawings in some embodiments of the present application.

应注意到：相似的标号和字母在下面的附图中表示类似项，因此，一旦某一项在一个附图中被定义，则在随后的附图中不需要对其进行进一步定义和解释。同时，在本申请的描述中，术语“第一”、“第二”等仅用于区分描述，而不能理解为指示或暗示相对重要性。It should be noted that similar reference numerals and letters represent similar items in the following drawings, so once an item is defined in one drawing, it does not need to be further defined and explained in the subsequent drawings. At the same time, in the description of this application, the terms "first", "second", etc. are only used to distinguish the description and cannot be understood as indicating or implying relative importance.

目前，随着日益增长的通信业务需求，地面通信网络的频谱资源不充足、覆盖面积小等弊端显现了出来。相比地面通信，卫星通信具有独特的优势，例如覆盖范围广、系统可靠性高、通信容量大、不受地震等自然灾害影响，但是各个卫星具备不同的资源量，不同的微服务器对卫星资源的请求不同，而且各个卫星的资源剩余量也不相同，鉴于此，本申请的一些实施例提供了一种基于多智能体强化学习的卫星网络微服务部署方法，该方法包括获取微服务的资源需求信息；根据预先建立的资源利用率模型和时延模型，以及卫星节点的配置信息，确定卫星节点的资源利用率信息和时延信息；在资源利用率信息小于第一预设值，或者时延信息小于第二预设值的情况下，采用预先训练好的多智能体策略部署模型，确定与微服务的资源需求信息对应的卫星节点的部署策略，其中，预先训练好的多智能体策略部署模型是采用多智能体深度确定策略梯度算法对智能体网络模型的各个参数进行训练后得到；根据卫星节点的部署策略对服务器终端进行配置，本申请实施例通过基于微服务架构建立资源利用率模型和时延模型，根据卫星节点的配置信息，确定卫星节点的资源利用率信息和时延信息，然后再采用预先训练好的多智能体策略部署模型，确定与微服务资源需求信息对应的卫星节点的部署策略，根据该卫星节点的部署策略对服务器终端进行配置，这样，可以根据微服务对服务器资源需求，以及各个卫星节点的资源剩余量，对各个卫星节点进行资源配置，即将具有不同资源需求的微服务部署到合适的卫星节点上，提高卫星节点的资源利用均衡度，并且降低调用时延，提高配置效率。At present, with the growing demand for communication services, the disadvantages of ground communication networks such as insufficient spectrum resources and small coverage area have emerged. Compared with ground communication, satellite communication has unique advantages, such as wide coverage, high system reliability, large communication capacity, and no impact from natural disasters such as earthquakes. However, each satellite has different resource amounts, different microservers have different requests for satellite resources, and the remaining resources of each satellite are also different. In view of this, some embodiments of the present application provide a satellite network microservice deployment method based on multi-agent reinforcement learning, the method comprising obtaining resource demand information of the microservice; determining the resource utilization information and delay information of the satellite node according to a pre-established resource utilization model and delay model, as well as the configuration information of the satellite node; when the resource utilization information is less than a first preset value, or the delay information is less than a second preset value, using a pre-trained multi-agent strategy deployment model to determine the deployment strategy of the satellite node corresponding to the resource demand information of the microservice, wherein the pre-trained multi-agent strategy deployment model is used to deploy the satellite node. The agent strategy deployment model is obtained by training various parameters of the agent network model using a multi-agent deep determination policy gradient algorithm; the server terminal is configured according to the deployment strategy of the satellite node. The embodiment of the present application establishes a resource utilization model and a delay model based on the microservice architecture, determines the resource utilization information and delay information of the satellite node according to the configuration information of the satellite node, and then uses the pre-trained multi-agent strategy deployment model to determine the deployment strategy of the satellite node corresponding to the microservice resource demand information, and configures the server terminal according to the deployment strategy of the satellite node. In this way, the resources of each satellite node can be configured according to the microservice demand for server resources and the remaining resources of each satellite node, that is, microservices with different resource requirements are deployed to appropriate satellite nodes, thereby improving the resource utilization balance of the satellite nodes, reducing the call delay, and improving the configuration efficiency.

如图1所示，本申请的实施例提供了一种基于多智能体强化学习的卫星网络微服务部署方法，该方法包括：As shown in FIG1 , an embodiment of the present application provides a satellite network microservice deployment method based on multi-agent reinforcement learning, the method comprising:

S101、获取微服务的资源需求信息；S101. Obtain resource demand information of microservices;

服务器终端用于执行微服务，每个微服务对应至少一个容器实例，容器技术是一种轻量级的资源虚拟化技术，是将计算资源抽象、转化、分割，呈现一个或多个计算资源的技术。其中，Docker是目前最流行的容器技术，广泛应用于微服务部署和云计算平台。The server terminal is used to execute microservices. Each microservice corresponds to at least one container instance. Container technology is a lightweight resource virtualization technology that abstracts, transforms, and splits computing resources to present one or more computing resources. Among them, Docker is the most popular container technology, which is widely used in microservice deployment and cloud computing platforms.

调度平台获取各个微服务的资源需求信息，例如资源需求信息包括哪个微服务获取哪个卫星节点的资源量，如CPU、内存、磁盘IO；The scheduling platform obtains the resource demand information of each microservice. For example, the resource demand information includes which microservice obtains the amount of resources of which satellite node, such as CPU, memory, and disk IO.

S102、根据预先建立的资源利用率模型和时延模型，以及卫星节点的配置信息，确定卫星节点的资源利用率信息和时延信息；S102, determining resource utilization information and delay information of the satellite node according to a pre-established resource utilization model and delay model, and configuration information of the satellite node;

其中，卫星节点的配置信息至少包括卫星节点数量、资源种类总量和异构资源容量。The configuration information of the satellite nodes includes at least the number of satellite nodes, the total number of resource types and the capacity of heterogeneous resources.

具体地，调度平台上预先建立资源利用率模型和时延模型，其中，资源利用率模型用方差表示，分为两类。一类是同一节点上不同种资源的方差，防止某一类资源过多，造成短板效应，造成资源浪费；另一类是同种资源在不同节点上的方差，防止卫星节点资源闲置。Specifically, the scheduling platform pre-establishes resource utilization models and delay models. The resource utilization model is represented by variance and is divided into two categories. One is the variance of different types of resources on the same node, which prevents excessive resources of a certain type from causing a short board effect and waste of resources; the other is the variance of the same type of resources on different nodes, which prevents satellite node resources from being idle.

时延分为传输时延、传播时延和迁移时延。传输时延可表示为数据量大小与传输速率的商，传输速率可由香农公式求得。传播时延与节点间的物理距离成正比。迁移时延由微服务的迁移频次决定。The delay is divided into transmission delay, propagation delay and migration delay. Transmission delay can be expressed as the quotient of the data volume and the transmission rate. The transmission rate can be calculated by Shannon's formula. The propagation delay is proportional to the physical distance between nodes. The migration delay is determined by the migration frequency of the microservice.

调度平台根据预先建立的资源利用率模型和时延模型，以及卫星节点的配置信息，确定卫星节点的资源利用率信息和时延信息。The scheduling platform determines the resource utilization information and delay information of the satellite nodes according to the pre-established resource utilization model and delay model and the configuration information of the satellite nodes.

S103、在资源利用率信息小于第一预设值，或者时延信息小于第二预设值的情况下，采用预先训练好的多智能体策略部署模型，确定与微服务的资源需求信息对应的卫星节点的部署策略，其中，预先训练好的多智能体策略部署模型是采用多智能体深度确定策略梯度算法对智能体网络模型的各个参数进行训练后得到；S103, when the resource utilization information is less than the first preset value, or the delay information is less than the second preset value, a pre-trained multi-agent strategy deployment model is used to determine the deployment strategy of the satellite node corresponding to the resource demand information of the microservice, wherein the pre-trained multi-agent strategy deployment model is obtained by training various parameters of the agent network model using a multi-agent deep determination policy gradient algorithm;

具体地，调度平台预先获取全局状态信息，全局状态信息包括卫星节点的资源占用情况、卫星的位置信息以及容器的部署情况，卫星位置随时间不断变化，智能体会根据位置的改变做出相应的动作，导致状态发生变化，在资源利用率方差和时延尽量小的情况下，建立奖励函数，根据上述信息，建立智能体网络模型，然后采用多智能体深度确定策略梯度算法(Multi-agent Deep Deterministic Policy Gradient，MADDPG)对智能体网络模型的各个参数进行训练，得到多智能体策略部署模型。Specifically, the scheduling platform obtains global state information in advance. The global state information includes the resource occupancy of satellite nodes, the location information of satellites, and the deployment of containers. The satellite position changes over time. The intelligent agent will take corresponding actions according to the change in position, resulting in state changes. Under the condition that the variance and delay of resource utilization are as small as possible, a reward function is established. Based on the above information, an intelligent agent network model is established, and then the multi-agent deep deterministic policy gradient algorithm (Multi-agent Deep Deterministic Policy Gradient, MADDPG) is used to train the various parameters of the intelligent agent network model to obtain a multi-agent policy deployment model.

在资源利用率信息小于第一预设值，或者时延信息小于第二预设值的情况下，采用预先训练好的多智能体策略部署模型，确定与微服务的资源需求信息对应的卫星节点的部署策略。When the resource utilization information is less than the first preset value, or the delay information is less than the second preset value, a pre-trained multi-agent strategy deployment model is used to determine the deployment strategy of the satellite node corresponding to the resource demand information of the microservice.

S104、根据卫星节点的部署策略对服务器终端进行配置。S104: Configure the server terminal according to the deployment strategy of the satellite node.

具体地，调度平台将得到的每个卫星节点的部署策略对每个微服务进行配置，这样，卫星节点执行该部署策略，提高卫星节点的资源利用均衡度，并且降低调用时延，提高配置效率。Specifically, the scheduling platform configures each microservice based on the deployment strategy of each satellite node. In this way, the satellite node executes the deployment strategy, improves the resource utilization balance of the satellite node, reduces the call delay, and improves the configuration efficiency.

本申请的一些实施例通过基于微服务架构建立资源利用率模型和时延模型，根据卫星节点的配置信息，确定卫星节点的资源利用率信息和时延信息，然后再采用预先训练好的多智能体策略部署模型，确定与微服务的资源需求信息对应的卫星节点的部署策略，根据该卫星节点的部署策略对服务器终端进行配置，这样，可以根据微服务对服务器资源需求，以及各个卫星节点的资源剩余量，对各个卫星节点进行资源配置，即将具有不同资源需求的微服务部署到合适的卫星节点上，提高卫星节点的资源利用均衡度，并且降低调用时延，提高配置效率。Some embodiments of the present application establish a resource utilization model and a delay model based on the microservice architecture, determine the resource utilization information and delay information of the satellite nodes according to the configuration information of the satellite nodes, and then use a pre-trained multi-agent strategy deployment model to determine the deployment strategy of the satellite nodes corresponding to the resource demand information of the microservices, and configure the server terminal according to the deployment strategy of the satellite nodes. In this way, resources can be configured for each satellite node based on the microservice's demand for server resources and the remaining resources of each satellite node, that is, microservices with different resource requirements are deployed to appropriate satellite nodes, thereby improving the resource utilization balance of the satellite nodes, reducing the call delay, and improving the configuration efficiency.

本申请又一实施例对上述实施例提供的基于多智能体强化学习的卫星网络微服务部署方法做进一步补充说明。Another embodiment of the present application further supplements the satellite network microservice deployment method based on multi-agent reinforcement learning provided in the above embodiment.

图2为本申请实施例提供的又一种基于多智能体强化学习的卫星网络微服务部署方法的流程示意图，如图2所示，该基于多智能体强化学习的卫星网络微服务部署方法包括：FIG2 is a flow chart of another satellite network microservice deployment method based on multi-agent reinforcement learning provided in an embodiment of the present application. As shown in FIG2 , the satellite network microservice deployment method based on multi-agent reinforcement learning includes:

步骤1：构建微服务部署模型；Step 1: Build a microservice deployment model;

具体地，确定卫星边缘计算系统的网络结构、卫星节点数量以及资源种类，构建微服务部署模型，如图3所示，包含步骤101～步骤104，如下所示：Specifically, the network structure, number of satellite nodes and resource types of the satellite edge computing system are determined, and a microservice deployment model is constructed, as shown in FIG3 , including steps 101 to 104, as shown below:

步骤101：考虑一个卫星边缘计算场景，如图一所示，包含一组卫星边缘节点S＝{s₁,s₂,...,s_N}，其中N为卫星节点数量。在边缘计算场景中，资源种类总量为R(CPU、内存、磁盘IO等)，表示为R＝{r₁,r₂,...,r_R}。对于节点s_i，其异构资源容量表示为向量V_i＝{V_i ¹,V_i ²,...,V_i ^R}，其中V_i ^j表示节点s_i上资源r_j的可用容量。Step 101: Consider a satellite edge computing scenario, as shown in Figure 1, which includes a set of satellite edge nodes S = {s ₁ ,s ₂ ,...,s _N }, where N is the number of satellite nodes. In the edge computing scenario, the total number of resource types is R (CPU, memory, disk IO, etc.), expressed as R = {r ₁ ,r ₂ ,...,r _R }. For node _si , its heterogeneous resource capacity is represented by a vector V _i = {V _i ¹ ,V _i ² ,...,V _i ^R }, where V _i ^j represents the available capacity of resource r _j on node _si .

步骤102：卫星边缘计算平台中目标部署应用的微服务集合为MS＝{ms₁,ms₂,...,ms_M}，其中M为微服务的数量，微服务以容器的形式部署到边缘节点中。不同微服务的资源请求设为向量其中表示微服务ms_i对资源r_j的请求量。Step 102: The set of microservices of the target deployed application in the satellite edge computing platform is MS = {ms ₁ ,ms ₂ ,...,ms _M }, where M is the number of microservices, and the microservices are deployed to the edge nodes in the form of containers. The resource requests of different microservices are set as vectors in Represents the number of requests from microservice ms _i to resource r _j .

步骤103：每个微服务可以存在多个副本，即多个容器的形式部署到不同的节点上，设各个微服务的容器数量为Q＝{q₁,q₂,...,q_M}，其中q_i表示微服务ms_i中容器副本的数量。故要部署的容器总量为∑Q，可表示为集合表示微服务ms_i对应的第j个容器副本。定义容器实例调度决策变量当时表示容器实例部署在节点s_i上，否则值为0。Step 103: Each microservice can have multiple copies, that is, multiple containers deployed on different nodes. Let the number of containers for each microservice be Q = {q ₁ ,q ₂ ,...,q _M }, where q _i represents the number of container copies in microservice ms _i . Therefore, the total number of containers to be deployed is ∑Q, which can be expressed as a set Indicates the jth container replica corresponding to microservice ms _i . Defines the container instance scheduling decision variable when When it represents a container instance Deployed on node s _i , otherwise the value is 0.

步骤104：微服务之间的调用关系用有向无环图来表示，邻接矩阵Y表示该调用关系，当y_ij＝1时表示微服务ms_i调用的下一个微服务为ms_j，否则值为0。Step 104: The calling relationship between microservices is represented by a directed acyclic graph. The adjacency matrix Y represents the calling relationship. When y _ij =1, it means that the next microservice called by microservice ms _i is ms _j , otherwise the value is 0.

步骤2：优化问题表示，即建立资源利用率模型和时延模型。最小化资源利用率方差和时延，将微服务部署问题表示为多目标优化问题。Step 2: Optimize the problem representation, that is, establish a resource utilization model and a latency model. Minimize the resource utilization variance and latency, and represent the microservice deployment problem as a multi-objective optimization problem.

可选地，资源利用率模型通过如下方式获得：Optionally, the resource utilization model is obtained by:

根据资源均衡度信息和与资源均衡度信息对应的权重值，以及节点均衡度信息和与节点均衡度信息对应的权重值，确定资源利用率模型。A resource utilization model is determined according to the resource balance information and the weight value corresponding to the resource balance information, and the node balance information and the weight value corresponding to the node balance information.

在该步骤中，资源利用率模型用方差表示，分为两类。一类是同一节点上不同种资源的方差，防止某一类资源过多，造成短板效应，造成资源浪费；另一类是同种资源在不同节点上的方差，防止卫星节点资源闲置。In this step, the resource utilization model is represented by variance, which is divided into two categories. One is the variance of different types of resources on the same node, which prevents excessive resources of a certain type from causing a short board effect and waste of resources; the other is the variance of the same type of resources on different nodes, which prevents satellite node resources from being idle.

步骤201：建立卫星节点资源利用率模型。Step 201: Establish a satellite node resource utilization model.

节点s_i上资源r_j的利用率u_i.j可表示为：The utilization rate _uij of resource _rj on node _si can be expressed as:

资源利用率模型分为两类，一类为同一节点上不同类型资源的资源利用率。每种微服务对资源类型的请求不同，当多个资源类型相同的微服务部署到同一节点时，会导致其他微服务无法在该节点上部署，形成“短板效应”，造成资源浪费。资源均衡度用标准差来表示，节点s_i上的所有资源的均衡度ε_i可表示为：Resource utilization models are divided into two categories. One is the resource utilization of different types of resources on the same node. Each microservice has different requests for resource types. When multiple microservices with the same resource type are deployed to the same node, other microservices will not be able to be deployed on the node, forming a "short board effect" and causing resource waste. Resource balance is expressed by standard deviation. The balance of all resources on node _si _can be expressed as:

另一类为同一类型资源在不同节点上的资源利用率。当微服务的数量增多时，希望将所有的卫星边缘节点都利用起来，防止资源闲置，造成资源的浪费。资源r_j在不同节点上的均衡度可表示为：The other type is the resource utilization of the same type of resources on different nodes. When the number of _{microservices} increases, it is hoped that all satellite edge nodes can be utilized to prevent idle resources and waste of resources. It can be expressed as:

故集群的资源利用率U的计算公式为，其中α为权重因子：Therefore, the calculation formula for the cluster resource utilization U is, where α is the weight factor:

可选地，时延模型至少包括传输时延子模型、传播时延子模型和迁移时延子模型。Optionally, the delay model includes at least a transmission delay sub-model, a propagation delay sub-model and a migration delay sub-model.

其中，时延分为传输时延、传播时延和迁移时延。传输时延可表示为数据量大小与传输速率的商，传输速率可由香农公式求得。传播时延与节点间的物理距离成正比。迁移时延由微服务的迁移频次决定。Among them, the delay is divided into transmission delay, propagation delay and migration delay. Transmission delay can be expressed as the quotient of the data volume and the transmission rate. The transmission rate can be calculated by Shannon's formula. Propagation delay is proportional to the physical distance between nodes. Migration delay is determined by the migration frequency of microservices.

步骤202：建立时延模型，分为传输时延、传播时延和迁移时延。Step 202: Establish a delay model, which is divided into transmission delay, propagation delay and migration delay.

具体地，根据服务部署的场景，需要考虑地面站与卫星节点间的通信链路和卫星节点间的通信链路。根据香农定理，地面站到目的卫星节点的数据传输速率可表示为：Specifically, according to the service deployment scenario, it is necessary to consider the communication link between the ground station and the satellite node and the communication link between the satellite nodes. According to Shannon's theorem, the data transmission rate from the ground station to the destination satellite node can be expressed as:

其中，W_{g_s}为信道带宽，p_g为地面站的发射功率，g_{g_s}为地面站与目的卫星之间的信道增益，N₀代表背景噪声，代表地面站到卫星的其他噪声干扰功率之和。Where _{Wg_s} is the channel bandwidth, _pg is the transmit power of the ground station, _{gg_s} is the channel gain between the ground station and the target satellite, _N0 represents the background noise, Represents the sum of other noise interference powers from ground stations to satellites.

设为节点s_i和s_j星间链路的信道带宽，为两节点之间的信噪比，则两卫星节点间的数据传输速率为：set up is the channel bandwidth of the intersatellite link between nodes _si and _sj , is the signal-to-noise ratio between the two nodes, then the data transmission rate between the two satellite nodes is:

根据邻接矩阵，可计算出完整的调度链数据传输时延表示为：According to the adjacency matrix, the complete scheduling chain data transmission delay can be calculated as:

其中，d_{g_s}表示地面站到卫星传输的数据量大小，d_i,j表示卫星节点s_i和s_j传输的数据量大小。Among them, d _{g_s} represents the amount of data transmitted from the ground station to the satellite, and d _i,j represents the amount of data transmitted by satellite nodes _si and _sj .

信息传播速率为光速c，节点s_i和s_j之间的距离表示为dis_i,j，则传播时延可表示为：The information propagation rate is the speed of light c, and the distance between nodes _si and _sj is represented by dis _i,j . Then the propagation delay can be expressed as:

当卫星移动出服务小区的可见范围，需要做出迁移动作。当智能体做出迁移动作时，会产生迁移时延。将智能体的迁移动作建模为有向权重图，权重即为从原卫星到目的卫星的迁移时延。整体的迁移成本即为权重求和，表示为When the satellite moves out of the visible range of the service cell, it needs to make a migration action. When the agent makes a migration action, a migration delay will be generated. The migration action of the agent is modeled as a directed weight graph, and the weight is the migration delay from the original satellite to the destination satellite. The overall migration cost is the sum of the weights, expressed as

Tr表示该时间段内智能体动作下的整条迁移链路，w_i,j表示权重。Tr represents the entire migration link under the action of the agent within this time period, and w _i,j represents the weight.

整体的时延D可计算如下：The overall delay D can be calculated as follows:

D＝τ_trans+τ_prop+τ_mig D＝τ _trans +τ _prop +τ _mig

步骤203：优化问题表示。Step 203: Optimize the problem representation.

基于上述模型，可建立联合优化问题，最小化资源利用率标准差U，最小化整体时延D，表示为：Based on the above model, a joint optimization problem can be established to minimize the standard deviation of resource utilization U and minimize the overall delay D, which can be expressed as:

P:min(U),min(D)P:min(U),min(D)

其中，C1表示每个微服务至少部署一个容器实例来实现该微服务的功能，C2表示微服务请求的数据量不能大于节点的资源最大容量。Among them, C1 means that each microservice deploys at least one container instance to implement the function of the microservice, and C2 means that the amount of data requested by the microservice cannot be greater than the maximum resource capacity of the node.

步骤3：将微服务部署问题表示为部分可观测的马尔可夫决策过程，采用多智能体强化学习方法求解。Step 3: Represent the microservice deployment problem as a partially observable Markov decision process and solve it using a multi-agent reinforcement learning method.

如图4所示，由于智能体不能获取到全部的状态信息，故该问题为部分可观测，每个智能体有各自的观测空间，随着时间的变化，卫星的相对位置发生变化，进而影响通信时延，故环境状态随时间和智能体的动作发生状态转移。As shown in Figure 4, since the agent cannot obtain all the state information, the problem is partially observable. Each agent has its own observation space. As time changes, the relative position of the satellite changes, which in turn affects the communication delay. Therefore, the state of the environment changes with time and the action of the agent.

可选地，多智能体策略部署模型通过如下方式获得：Optionally, the multi-agent strategy deployment model is obtained by:

获取智能体样本参数，其中，智能体样本参数至少包括智能体观测环境：Obtain agent sample parameters, where the agent sample parameters at least include the agent observation environment:

获取智能网络模型，智能网络模型至少包括行动者网络模型和批判者网络模型；Obtaining an intelligent network model, the intelligent network model at least includes an actor network model and a critic network model;

将智能体观测环境输入到行动者网络模型，输出智能体的部署动作；Input the agent's observed environment into the actor network model and output the agent's deployment action;

将智能体的部署动作和全局状态输入到批判者网络模型，输出动作评判值；Input the agent's deployment actions and global state into the critic network model and output the action evaluation value;

采用多智能体深度确定策略梯度算法，根据回放池中获取的当前状态信息、动作信息、奖励信息和下一时刻的状态信息，对行动者网络模型和批判者网络模型中的网络参数进行更新；Adopting the multi-agent deep deterministic policy gradient algorithm, the network parameters in the actor network model and the critic network model are updated according to the current state information, action information, reward information and the state information of the next moment obtained from the replay pool;

在行动者网络模型和批判者网络模型收敛的情况下，将收敛的行动者网络模型和批判者网络模型确定为多智能体策略部署模型。When the actor network model and the critic network model converge, the converged actor network model and critic network model are determined as the multi-agent strategy deployment model.

本申请的一些实施例通过将微服务部署问题转换为部分可观测的马尔可夫决策过程，采用多智能体强化学习方法求解采用集中式训练，分布式执行的方式，在训练阶段，微服务的容器实例作为智能体需要获取全局信息，得到最佳的部署方案，在执行阶段，微服务仅凭借自身的观测空间即可完成部署，大大降低了微服务之间的通信开销。Some embodiments of the present application convert the microservice deployment problem into a partially observable Markov decision process, and adopt a multi-agent reinforcement learning method to solve it in a centralized training and distributed execution manner. In the training phase, the container instance of the microservice, as an intelligent agent, needs to obtain global information to obtain the best deployment plan. In the execution phase, the microservice can complete the deployment only by relying on its own observation space, which greatly reduces the communication overhead between microservices.

可选地，对行动者网络模型和批判者网络模型中的网络参数进行更新，包括：Optionally, the network parameters in the actor network model and the critic network model are updated, including:

获取行动者网络模型的第一损失函数和批判者网络模型的第二损失函数；Obtain the first loss function of the actor network model and the second loss function of the critic network model;

分别对第一损失函数和第二损失函数进行梯度计算；Perform gradient calculations on the first loss function and the second loss function respectively;

利用梯度下降法对述行动者网络模型和批判者网络模型中的网络参数进行更新。The gradient descent method is used to update the network parameters in the actor network model and the critic network model.

具体地，本申请实施例中部分可观测马尔可夫决策过程表示即节点预选；Specifically, in the embodiments of the present application, the partially observable Markov decision process representation is node preselection;

步骤301：状态空间表示。全局状态信息包括卫星节点的资源占用情况、卫星的位置信息以及容器的部署情况，可表示为s＝[u,p,c]，其中：Step 301: State space representation. The global state information includes the resource occupancy of satellite nodes, the location information of satellites, and the deployment of containers, which can be represented as s = [u, p, c], where:

u＝[u_1,1,u_1,2,...,u_1,R,u_2,1,u_2,2,...,u_2,R,...,u_N,1,u_N,2,...,u_N,R]u＝[u _1,1 ,u _1,2 ,...,u _1,R ,u _2,1 ,u _2,2 ,...,u _2,R ,...,u _N,1 , u _N,2 ,...,u _N,R ]

p＝[x₁,y₁,z₁,x₂,y₂,z₂,...,x_N,y_N,z_N]p＝[x ₁ ,y ₁ ,z ₁ ,x ₂ ,y ₂ ,z ₂ ,...,x _N ,y _N ,z _N ]

u_i,j为节点s_i上资源r_j的利用率，[x_i,y_i,z_i]为节点s_i的位置坐标，为容器部署节点的索引序号。u _i,j is the utilization rate of resource r _j on node _si , [x _i ,y _i , _zi ] is the location coordinate of node _si , The index number of the node where the container is deployed.

容器作为智能体在部署过程中不能获取全局状态信息，微服务i上容器实例j的观测空间可表示为 As an intelligent agent, the container cannot obtain global state information during the deployment process. The observation space of container instance j on microservice i can be expressed as

步骤302：动作空间表示。根据自身的观测空间，微服务i上容器实例j的动作空间表示为k为观测空间下满足资源要求的节点数量，当时表示容器部署到该节点上，否则为0。Step 302: Action space representation. Based on its own observation space, the action space of container instance j on microservice i is represented as k is the number of nodes that meet the resource requirements in the observation space. If it is, it indicates that the container is deployed on the node; otherwise, it is 0.

步骤303：状态转移函数表示。卫星位置随时间不断变化，智能体会根据位置的改变做出相应的动作，导致状态发生变化，状态转移函数可表示为Step 303: State transfer function representation. The satellite position changes over time, and the agent will take corresponding actions according to the change in position, resulting in a change in state. The state transfer function can be represented as

步骤304：奖励函数表示。我们希望资源利用率方差和时延尽量小，故奖励函数可表示为Step 304: Reward function representation. We hope that the variance of resource utilization and latency are as small as possible, so the reward function can be expressed as

reward＝-(βU+(1-β)D)reward＝-(βU+(1-β)D)

步骤4：采用多智能体深度确定策略梯度算法MADDPG对智能体网络模型的各个参数进行模型训练，包括搭建神经网络和网络参数更新；Step 4: Use the multi-agent deep deterministic policy gradient algorithm MADDPG to train the parameters of the agent network model, including building a neural network and updating network parameters;

其中包括：These include:

搭建智能体网络，如图4所示。将容器视为一个智能体，每个智能体包含四个网络，Actor网络μ(oⁱ；θⁱ)、Target actor网络t_μ(oⁱ；θⁱ)以及Critic网络c(s,a；ωⁱ)、Targetcritic网络t_c(s,a；ωⁱ)。包含步骤401～步骤402。采用固定网络的方法，固定Target网络并每隔一段时间将原网络参数传递给Target网络，避免更新目标不断变化，保证训练的稳定性。Build an agent network, as shown in Figure 4. Consider the container as an agent, each agent contains four networks, Actor network μ(o ⁱ ; θ ⁱ ), Target actor network t_μ(o ⁱ ; θ ⁱ ), Critic network c(s,a; ω ⁱ ), Targetcritic network t_c(s,a; ω ⁱ ). It includes steps 401 to 402. The fixed network method is adopted to fix the Target network and pass the original network parameters to the Target network at regular intervals to avoid the constant change of the update target and ensure the stability of training.

步骤401：两种网络的设置。Actor网络即行动者网络模型的输入是当前智能体的局部观测信息oⁱ，包含节点的资源占用情况，输出为部署动作a。Critic网络即批判者网络模型的输入为Actor网络输出的动作和全局状态，即全局状态信息s和动作a，输出是对应的Q值，用于评判在当前状态下智能体所执行动作的优劣。Step 401: Setting up two networks. The input of the Actor network, i.e., the actor network model, is the local observation information o ⁱ of the current agent, including the resource occupancy of the node, and the output is the deployment action a. The input of the Critic network, i.e., the critic network model, is the action and global state output by the Actor network, i.e., the global state information s and the action a, and the output is the corresponding Q value, which is used to judge the quality of the action performed by the agent in the current state.

步骤402：网络参数传递过程。在参数更新时，若更新目标不断变动会导致更新困难。因此采用固定网络的方法，固定Target网络的参数并隔一段时间将原网络参数传递给Target网络，保证训练的稳定性。Step 402: Network parameter transfer process. When updating parameters, if the update target keeps changing, it will cause update difficulties. Therefore, a fixed network method is adopted to fix the parameters of the Target network and transfer the original network parameters to the Target network at intervals to ensure the stability of training.

步骤五：搭建经验回放池D，智能体根据噪声设置随机采取部署动作，产生一个四元组，即记录状态、智能体的动作、下一时刻状态和奖励，记为(s_t,a_t,r_t,s_t+1)。由于MADDPG算法是异策略，可利用经验回放池来消除历史经验的相关性，将其打散，在训练神经网络时，随机选取一批经验数据，使神经网络训练得更好。Step 5: Build an experience replay pool D. The agent randomly takes deployment actions according to the noise settings, generating a four-tuple, namely the record state, the agent's action, the next moment state and the reward, recorded as (s _t , a _t , r _t , s _t+1 ). Since the MADDPG algorithm is a heterogeneous strategy, the experience replay pool can be used to eliminate the correlation of historical experience and break it up. When training the neural network, a batch of experience data is randomly selected to make the neural network train better.

步骤六：执行MADDPG算法更新网络参数，进行集中式训练。随机选取回放池中的四元组，更新Actor和Critic网络参数，直至收敛。Step 6: Execute the MADDPG algorithm to update network parameters and perform centralized training. Randomly select a quadruple from the playback pool and update the Actor and Critic network parameters until convergence.

主要体现为两类网络的更新过程，包括步骤601～步骤602，It is mainly reflected in the update process of two types of networks, including steps 601 to 602.

步骤601：更新Actor网络参数。Actor网络的损失函数为-Q，-Q需要通过将Actor网络的输出动作输入到当前的Critic网络得到，-Q越小越好。将回放池中智能体i的观测空间oⁱ输入到该智能体的Actor网络μ(oⁱ；θⁱ)中，得到部署动作a_i，然后将全局状态信息s与a_i输入到Critic网络中得到该动作的Q值，将-Q作为损失函数利用梯度下降更新网络参数θⁱ。具体的，损失函数可表示为Step 601: Update the Actor network parameters. The loss function of the Actor network is -Q, which needs to be obtained by inputting the output action of the Actor network into the current Critic network. The smaller -Q, the better. Input the observation space o ⁱ of agent i in the playback pool into the Actor network μ(o ⁱ ; θ ⁱ ) of the agent to obtain the deployment action a _i , and then input the global state information s and a _i into the Critic network to obtain the Q value of the action. Use -Q as the loss function to update the network parameter θ ⁱ using gradient descent. Specifically, the loss function can be expressed as

其中x＝(o₁,o₂,...,o_N)表示所有智能体的观测空间，a_i表示智能体i在其策略μ_i下的动作。根据链式法则，其梯度可表示为Where x = (o ₁ ,o ₂ ,...,o _N ) represents the observation space of all agents, and a _i represents the action of agent i under its strategy μ _i . According to the chain rule, its gradient can be expressed as

使用梯度下降法来更新参数θⁱ Use gradient descent to update the parameters θ ⁱ

步骤602：更新Critic网络参数。Critic网络需要使预测的Q值尽量精确，故其损失函数为Critic网络的输出Q(s₀,a₀；ωⁱ)值(预测值)与Targetcritic网络的输出Q值与奖励的和r₁+γQ(s₁,a₁；ωⁱ)(实际值)之间的差异，差异越小越好。该差异可用MSE表示，利用梯度下降法更新网络参数ωⁱ。具体的，损失函数可表示为Step 602: Update the parameters of the Critic network. The Critic network needs to make the predicted Q value as accurate as possible, so its loss function is the difference between the output Q(s ₀ ,a ₀ ;ω ⁱ ) value of the Critic network (predicted value) and the output Q value of the Targetcritic network and the sum of the reward r ₁ +γQ(s ₁ ,a ₁ ;ω ⁱ ) (actual value). The smaller the difference, the better. The difference can be represented by MSE, and the network parameters ω ⁱ are updated using the gradient descent method. Specifically, the loss function can be expressed as

计算梯度Computing Gradients

使用梯度下降法更新参数ωⁱ Update the parameters ω ⁱ using the gradient descent method

两个网络根据各自的损失函数求梯度，利用梯度下降法来更新网络参数。The two networks calculate the gradients according to their respective loss functions and use the gradient descent method to update the network parameters.

图5为本申请实施例提供的模型训练的流程示意图，如图5所示，包括：FIG5 is a schematic diagram of a flow chart of model training provided in an embodiment of the present application, as shown in FIG5 , including:

1)初始化网络和卫星节点，得到候选节点集合；1) Initialize the network and satellite nodes to obtain a set of candidate nodes;

2)容器智能体随机产生动作，得到四元组；2) The container agent randomly generates actions and obtains a quaternion;

3)将四元组存储到经验回放池；3) Store the quadruple into the experience replay pool;

4)随机抽取四元组；4) Randomly extract a quadruple;

5)根据LOSS函数更新Actor网络和Critic网络；5) Update the Actor network and Critic network according to the LOSS function;

6)更新Target网络参数即目标网络参数；6) Update the Target network parameters, i.e. the target network parameters;

7)输出训练好的策略网络，即多智能体策略部署模型。7) Output the trained strategy network, i.e., the multi-agent strategy deployment model.

步骤5：策略网络部署。Step 5: Policy network deployment.

将训练好的策略网络部署到容器智能体上，微服务可以基于本地观测独立做出最优决策。By deploying the trained policy network to the container agent, microservices can independently make optimal decisions based on local observations.

本申请实施例提供的提出了基于多智能体强化学习的卫星网络微服务部署方法来解决微服务部署的问题，其中，神经网络部分采用固定网络的方法，分为原始网络和目标网络两类，先固定目标网络并隔段时间将原始网络参数传递给目标网络，避免了由于更新目标不断变动造成的更新困难，保证训练的稳定性。在训练完成后，智能体仅需凭借自身的观测空间，利用策略网络做出最佳动作，降低了微服务之间频繁的交互带来的开销。The embodiment of the present application provides a satellite network microservice deployment method based on multi-agent reinforcement learning to solve the problem of microservice deployment. The neural network part adopts a fixed network method, which is divided into two categories: the original network and the target network. The target network is fixed first and the original network parameters are passed to the target network at intervals, avoiding the update difficulty caused by the constant change of the update target and ensuring the stability of the training. After the training is completed, the agent only needs to rely on its own observation space and use the strategy network to make the best action, reducing the overhead caused by the frequent interaction between microservices.

需要说明的是，本实施例中各可实施的方式可以单独实施，也可以在不冲突的情况下以任意组合方式结合实施本申请不做限定。It should be noted that each implementable method in this embodiment may be implemented separately, or may be implemented in combination in any manner without conflict, and this application is not limited thereto.

本申请另一实施例提供一种基于多智能体强化学习的卫星网络微服务部署装置，用于执行上述实施例提供的基于多智能体强化学习的卫星网络微服务部署方法。Another embodiment of the present application provides a satellite network microservice deployment device based on multi-agent reinforcement learning, which is used to execute the satellite network microservice deployment method based on multi-agent reinforcement learning provided by the above embodiment.

如图6所示，为本申请实施例提供的基于多智能体强化学习的卫星网络微服务部署装置的结构示意图。该基于多智能体强化学习的卫星网络微服务部署装置包括获取模块601、第一确定模块602、第二确定模块603和配置模块604，其中：As shown in Figure 6, it is a schematic diagram of the structure of the satellite network microservice deployment device based on multi-agent reinforcement learning provided in an embodiment of the present application. The satellite network microservice deployment device based on multi-agent reinforcement learning includes an acquisition module 601, a first determination module 602, a second determination module 603 and a configuration module 604, wherein:

获取模块601用于获取微服务的资源需求信息；The acquisition module 601 is used to obtain resource demand information of microservices;

第一确定模块602用于根据预先建立的资源利用率模型和时延模型，以及卫星节点的配置信息，确定卫星节点的资源利用率信息和时延信息；The first determination module 602 is used to determine the resource utilization information and delay information of the satellite node according to the pre-established resource utilization model and delay model, and the configuration information of the satellite node;

第二确定模块603用于在资源利用率信息小于第一预设值，或者时延信息小于第二预设值的情况下，采用预先训练好的多智能体策略部署模型，确定与微服务的资源需求信息对应的卫星节点的部署策略，其中，预先训练好的多智能体策略部署模型是采用多智能体深度确定策略梯度算法对智能体网络模型的各个参数进行训练后得到；The second determination module 603 is used to use a pre-trained multi-agent strategy deployment model to determine the deployment strategy of the satellite node corresponding to the resource demand information of the microservice when the resource utilization information is less than the first preset value or the delay information is less than the second preset value, wherein the pre-trained multi-agent strategy deployment model is obtained by training various parameters of the agent network model using a multi-agent deep determination policy gradient algorithm;

配置模块604用于根据卫星节点的部署策略对服务器终端进行配置。The configuration module 604 is used to configure the server terminal according to the deployment strategy of the satellite node.

关于本实施例中的装置，其中各个模块执行操作的具体方式已经在有关该方法的实施例中进行了详细描述，此处将不做详细阐述说明。Regarding the device in this embodiment, the specific manner in which each module performs operations has been described in detail in the embodiment of the method, and will not be elaborated here.

本申请的一些实施例通过基于微服务架构建立资源利用率模型和时延模型，根据卫星节点的配置信息，确定卫星节点的资源利用率信息和时延信息，然后再采用预先训练好的多智能体策略部署模型，确定与微服务的资源需求信息对应的卫星节点的部署策略，根据该卫星节点的部署策略对服务器终端进行配置，这样，可以根据微服务对服务器资源需求，以及各个卫星节点的资源剩余量，对各个卫星节点进行资源配置，即将具有不同资源需求的微服务部署到合适的卫星节点上，提高卫星节点的资源利用均衡度，并且降低调用时延，提高配置效率。Some embodiments of the present application establish a resource utilization model and a delay model based on the microservice architecture, determine the resource utilization information and delay information of the satellite nodes according to the configuration information of the satellite nodes, and then use a pre-trained multi-agent strategy deployment model to determine the deployment strategy of the satellite nodes corresponding to the resource demand information of the microservices, and configure the server terminal according to the deployment strategy of the satellite nodes. In this way, resources can be configured for each satellite node according to the microservice's demand for server resources and the remaining resources of each satellite node, that is, microservices with different resource requirements are deployed to appropriate satellite nodes, thereby improving the resource utilization balance of the satellite nodes, reducing the call delay, and improving the configuration efficiency.

本申请又一实施例对上述实施例提供的基于多智能体强化学习的卫星网络微服务部署装置做进一步补充说明。Another embodiment of the present application further supplements the satellite network microservice deployment device based on multi-agent reinforcement learning provided in the above embodiment.

可选地，该装置还包括模型训练模块，模型训练模块用于：Optionally, the device further includes a model training module, and the model training module is used to:

获取智能网络模型，智能网络模型至少行动者网络模型和批判者网络模型；Obtain an intelligent network model, which includes at least an actor network model and a critic network model;

可选地，模型训练模块用于：Optionally, the model training module is used to:

可选地，卫星节点的配置信息至少包括卫星节点数量、资源种类总量和异构资源容量。Optionally, the configuration information of the satellite nodes includes at least the number of satellite nodes, the total number of resource types and the capacity of heterogeneous resources.

本申请实施例还提供了一种计算机可读存储介质，其上存储有计算机程序，程序被处理器执行时可实现如上述实施例提供的基于多智能体强化学习的卫星网络微服务部署方法中的任意实施例所对应方法的操作。An embodiment of the present application also provides a computer-readable storage medium on which a computer program is stored. When the program is executed by a processor, the operation of the method corresponding to any embodiment of the satellite network microservice deployment method based on multi-agent reinforcement learning provided in the above embodiments can be implemented.

本申请实施例还提供了一种计算机程序产品，的计算机程序产品包括计算机程序，其中，的计算机程序被处理器执行时可实现如上述实施例提供的基于多智能体强化学习的卫星网络微服务部署方法中的任意实施例所对应方法的操作。An embodiment of the present application also provides a computer program product, which includes a computer program, wherein when the computer program is executed by a processor, it can implement the operations corresponding to any embodiment of the satellite network microservice deployment method based on multi-agent reinforcement learning provided in the above embodiment.

如图7所示，本申请的一些实施例提供一种电子设备700，该电子设备700包括：存储器710、处理器720以及存储在存储器710上并可在处理器720上运行的计算机程序，其中，处理器720通过总线730从存储器710读取程序并执行程序时可实现如上述基于多智能体强化学习的卫星网络微服务部署方法包括的任意实施例的方法。As shown in Figure 7, some embodiments of the present application provide an electronic device 700, which includes: a memory 710, a processor 720, and a computer program stored in the memory 710 and executable on the processor 720, wherein the processor 720 reads the program from the memory 710 through a bus 730 and executes the program to implement a method of any embodiment of the satellite network microservice deployment method based on multi-agent reinforcement learning as described above.

处理器720可以处理数字信号，可以包括各种计算结构。例如复杂指令集计算机结构、结构精简指令集计算机结构或者一种实行多种指令集组合的结构。在一些示例中，处理器720可以是微处理器。Processor 720 can process digital signals and can include various computing structures, such as complex instruction set computer structure, reduced instruction set computer structure, or a structure that implements a combination of multiple instruction sets. In some examples, processor 720 can be a microprocessor.

存储器710可以用于存储由处理器720执行的指令或指令执行过程中相关的数据。这些指令和/或数据可以包括代码，用于实现本申请实施例描述的一个或多个模块的一些功能或者全部功能。本公开实施例的处理器720可以用于执行存储器710中的指令以实现上述所示的方法。存储器710包括动态随机存取存储器、静态随机存取存储器、闪存、光存储器或其它本领域技术人员所熟知的存储器。The memory 710 may be used to store instructions executed by the processor 720 or data related to the execution of instructions. These instructions and/or data may include codes for implementing some or all functions of one or more modules described in the embodiments of the present application. The processor 720 of the disclosed embodiment may be used to execute instructions in the memory 710 to implement the method shown above. The memory 710 includes a dynamic random access memory, a static random access memory, a flash memory, an optical memory, or other memory known to those skilled in the art.

以上仅为本申请的实施例而已，并不用于限制本申请的保护范围，对于本领域的技术人员来说，本申请可以有各种更改和变化。凡在本申请的精神和原则之内，所作的任何修改、等同替换、改进等，均应包含在本申请的保护范围之内。应注意到：相似的标号和字母在下面的附图中表示类似项，因此，一旦某一项在一个附图中被定义，则在随后的附图中不需要对其进行进一步定义和解释。The above are only embodiments of the present application and are not intended to limit the scope of protection of the present application. For those skilled in the art, the present application may have various changes and variations. Any modifications, equivalent substitutions, improvements, etc. made within the spirit and principles of the present application should be included in the scope of protection of the present application. It should be noted that similar numbers and letters represent similar items in the following drawings, so once an item is defined in one drawing, it does not need to be further defined and explained in the subsequent drawings.

以上所述，仅为本申请的具体实施方式，但本申请的保护范围并不局限于此，任何熟悉本技术领域的技术人员在本申请揭露的技术范围内，可轻易想到变化或替换，都应涵盖在本申请的保护范围之内。因此，本申请的保护范围应所述以权利要求的保护范围为准。The above is only a specific implementation of the present application, but the protection scope of the present application is not limited thereto. Any technician familiar with the technical field can easily think of changes or substitutions within the technical scope disclosed in the present application, which should be included in the protection scope of the present application. Therefore, the protection scope of the present application should be based on the protection scope of the claims.

需要说明的是，在本文中，诸如第一和第二等之类的关系术语仅仅用来将一个实体或者操作与另一个实体或操作区分开来，而不一定要求或者暗示这些实体或操作之间存在任何这种实际的关系或者顺序。而且，术语“包括”、“包含”或者其任何其他变体意在涵盖非排他性的包含，从而使得包括一系列要素的过程、方法、物品或者设备不仅包括那些要素，而且还包括没有明确列出的其他要素，或者是还包括为这种过程、方法、物品或者设备所固有的要素。在没有更多限制的情况下，由语句“包括一个……”限定的要素，并不排除在包括所述要素的过程、方法、物品或者设备中还存在另外的相同要素。It should be noted that, in this article, relational terms such as first and second, etc. are only used to distinguish one entity or operation from another entity or operation, and do not necessarily require or imply any such actual relationship or order between these entities or operations. Moreover, the terms "include", "comprise" or any other variants thereof are intended to cover non-exclusive inclusion, so that a process, method, article or device including a series of elements includes not only those elements, but also other elements not explicitly listed, or also includes elements inherent to such process, method, article or device. In the absence of further restrictions, the elements defined by the sentence "comprise a ..." do not exclude the presence of other identical elements in the process, method, article or device including the elements.

Claims

1. The satellite network micro-service deployment method based on multi-agent reinforcement learning is characterized by comprising the following steps:

acquiring resource demand information of the micro service;

determining resource utilization rate information and time delay information of the satellite node according to a pre-established resource utilization rate model and time delay model and configuration information of the satellite node;

when the resource utilization rate information is smaller than a first preset value or the time delay information is smaller than a second preset value, a pre-trained multi-agent strategy deployment model is adopted to determine a deployment strategy of a satellite node corresponding to the resource demand information of the micro service, wherein the pre-trained multi-agent strategy deployment model is obtained by training each parameter of an agent network model by adopting a multi-agent depth determination strategy gradient algorithm;

and configuring the server terminal according to the deployment strategy of the satellite node.

2. The satellite network micro-service deployment method based on multi-agent reinforcement learning according to claim 1, wherein the multi-agent policy deployment model is obtained by:

obtaining an agent sample parameter, wherein the agent sample parameter at least comprises an agent observation environment:

Acquiring an intelligent network model, wherein the intelligent network model is at least an actor network model and a criticizer network model;

inputting the intelligent agent observation environment into the actor network model, and outputting deployment actions of intelligent agents;

inputting the deployment action and the global state of the intelligent agent into the criticizer network model, and outputting an action judgment value;

according to the current state information, action information, rewarding information and state information of the intelligent agent at the next moment, a playback pool is established;

a multi-agent depth determination strategy gradient algorithm is adopted, and network parameters in the actor network model and the criticizer network model are updated according to the current state information, the action information, the rewarding information and the state information of the next moment acquired in the playback pool;

and under the condition that the actor network model and the criticizer network model are converged, determining the converged actor network model and the criticizer network model as the multi-agent strategy deployment model.

3. The satellite network micro-service deployment method based on multi-agent reinforcement learning of claim 2, wherein the updating of network parameters in the actor network model and the criticizer network model comprises:

Acquiring a first loss function of an actor network model and a second loss function of the criticizer network model;

gradient calculation is carried out on the first loss function and the second loss function respectively;

and updating network parameters in the actor network model and the criticism network model by using a gradient descent method.

4. The satellite network micro-service deployment method based on multi-agent reinforcement learning according to claim 1, wherein the configuration information of the satellite nodes at least comprises the number of satellite nodes, the total amount of resource types and heterogeneous resource capacity.

5. The satellite network micro-service deployment method based on multi-agent reinforcement learning of claim 1, wherein the resource utilization model is obtained by:

acquiring resource balance information of a first resource utilization rate model of different types of resources on the same satellite node and node balance information of a second resource utilization rate model of the same type of resources on different satellite nodes;

and determining the resource utilization rate model according to the resource balance degree information and the weight value corresponding to the resource balance degree information, and the node balance degree information and the weight value corresponding to the node balance degree information.

6. The satellite network micro-service deployment method based on multi-agent reinforcement learning of claim 1, wherein the delay model comprises at least a transmission delay sub-model, a propagation delay sub-model, and a migration delay sub-model.

7. A satellite network micro-service deployment device based on multi-agent reinforcement learning, the device comprising:

the acquisition module is used for acquiring resource demand information of the micro service;

the first determining module is used for determining the resource utilization rate information and the time delay information of the satellite node according to a pre-established resource utilization rate model and a time delay model and configuration information of the satellite node;

the second determining module is configured to determine a deployment strategy of a satellite node corresponding to the resource demand information of the micro service by using a pre-trained multi-agent strategy deployment model when the resource utilization rate information is smaller than a first preset value or the time delay information is smaller than a second preset value, where the pre-trained multi-agent strategy deployment model is obtained by training each parameter of an agent network model by using a multi-agent depth determination strategy gradient algorithm;

And the configuration module is used for configuring the server terminal according to the deployment strategy of the satellite node.

8. The multi-agent reinforcement learning based satellite network micro-service deployment device of claim 7, further comprising a model training module for:

9. An electronic device comprising a memory, a processor, and a computer program stored on the memory and executable on the processor, wherein the processor, when executing the program, implements the multi-agent reinforcement learning-based satellite network micro-service deployment method of any one of claims 1-6.

10. A computer readable storage medium, wherein a computer program is stored on the computer readable storage medium, and wherein the program is executed by a processor to implement the satellite network micro-service deployment method based on multi-agent reinforcement learning according to any one of claims 1 to 6.