CN117632505A - Heterogeneous computing power intelligent scheduling system and method - Google Patents

Heterogeneous computing power intelligent scheduling system and method Download PDF

Info

Publication number
CN117632505A
CN117632505A CN202311666674.0A CN202311666674A CN117632505A CN 117632505 A CN117632505 A CN 117632505A CN 202311666674 A CN202311666674 A CN 202311666674A CN 117632505 A CN117632505 A CN 117632505A
Authority
CN
China
Prior art keywords
cluster
crd
scheduling
policy
computing power
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202311666674.0A
Other languages
Chinese (zh)
Inventor
范彬
谭哲
洪晓生
于顺治
吴荣兵
李超
钱丽丽
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
China Mobile Communications Group Co Ltd
China Mobile Information Technology Co Ltd
Original Assignee
China Mobile Communications Group Co Ltd
China Mobile Information Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by China Mobile Communications Group Co Ltd, China Mobile Information Technology Co Ltd filed Critical China Mobile Communications Group Co Ltd
Priority to CN202311666674.0A priority Critical patent/CN117632505A/en
Publication of CN117632505A publication Critical patent/CN117632505A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/50Allocation of resources, e.g. of the central processing unit [CPU]
    • G06F9/5005Allocation of resources, e.g. of the central processing unit [CPU] to service a request
    • G06F9/5027Allocation of resources, e.g. of the central processing unit [CPU] to service a request the resource being a machine, e.g. CPUs, Servers, Terminals
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/50Allocation of resources, e.g. of the central processing unit [CPU]
    • G06F9/5061Partitioning or combining of resources
    • G06F9/5072Grid computing
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Software Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Power Sources (AREA)

Abstract

本发明属于算力任务调度技术领域,公开了一种异构算力智能调度系统及方法。异构算力智能调度系统包括算力调度模块、云原生存储ETCD模块以及第三方扩展策略模块,云原生存储ETCD模块存储有异构算力资源CRD、调度策略CRD、算力任务CRD以及集群CRD;第三方扩展策略模块,用于通过异构算力资源CRD提供的算力任务接口将第三方扩展策略注册到云原生存储ETCD模块;云原生存储ETCD模块,用于根据第三方扩展策略更新所述调度策略CRD,得到更新调度策略CRD;算力调度模块,用于根据更新调度策略CRD确定所述集群CRD中的目标集群。能够实现异构算力调度的兼容和扩展功能。

The invention belongs to the technical field of computing power task scheduling, and discloses a heterogeneous computing power intelligent scheduling system and method. The heterogeneous computing power intelligent scheduling system includes a computing power scheduling module, a cloud native storage ETCD module, and a third-party expansion strategy module. The cloud native storage ETCD module stores heterogeneous computing power resource CRD, scheduling strategy CRD, computing task CRD, and cluster CRD. ; The third-party expansion policy module is used to register the third-party expansion policy to the cloud native storage ETCD module through the computing task interface provided by the heterogeneous computing resource CRD; the cloud native storage ETCD module is used to update all the data according to the third-party expansion policy The scheduling policy CRD is used to obtain the updated scheduling policy CRD; the computing power scheduling module is used to determine the target cluster in the cluster CRD according to the updated scheduling policy CRD. It can realize the compatibility and expansion functions of heterogeneous computing power scheduling.

Description

异构算力智能调度系统及方法Heterogeneous computing power intelligent scheduling system and method

技术领域Technical field

本发明涉及算力任务调度技术领域,尤其涉及一种异构算力智能调度系统及方法。The present invention relates to the technical field of computing power task scheduling, and in particular to a heterogeneous computing power intelligent scheduling system and method.

背景技术Background technique

随着各行业生产运营中数字化转型的深化,各类行业终端将产生海量的原始数据,因此亟需大量的算力进行处理,为满足业务层秒级响应,低延时和业务连续性的要求,常采用算力集群方式提供高可用、高吞吐和高扩展的技术架构。在算力集群逐年建设过程中,在同一数据中心或者不同数据中心,存在大量异构算力集群资源,如何能够在统一的平台对这些算力资源进行管理和调度,使得算力使用方可以方便根据需要、使用习惯调用算力资源是及待解决的问题。针对异构算力的智能调度,目前主流的技术方案都是根据具体的场景定制一套智能调度方法,每种调度方案都无法匹配其他场景的异构算力调度的需求,且现有的调度方案无法提供可扩展能力,可见,现有的调度方案缺乏兼容性和可拓展性。With the deepening of digital transformation in production operations of various industries, various industry terminals will generate massive amounts of raw data, so a large amount of computing power is urgently needed for processing. In order to meet the requirements of second-level response, low latency and business continuity at the business layer , often using computing power clusters to provide high availability, high throughput and high scalability technical architecture. In the process of building computing power clusters year by year, there are a large number of heterogeneous computing power cluster resources in the same data center or different data centers. How can these computing power resources be managed and scheduled on a unified platform so that users of computing power can conveniently Allocating computing resources according to needs and usage habits is a problem that needs to be solved. For intelligent scheduling of heterogeneous computing power, the current mainstream technical solutions are to customize a set of intelligent scheduling methods according to specific scenarios. Each scheduling solution cannot match the needs of heterogeneous computing power scheduling in other scenarios, and existing scheduling The solution cannot provide scalability. It can be seen that the existing scheduling solution lacks compatibility and scalability.

发明内容Contents of the invention

本发明的主要目的在于提供一种异构算力智能调度系统及方法,旨在解决现有技术中的调度方案缺乏兼容性和可拓展性的技术问题。The main purpose of the present invention is to provide a heterogeneous computing power intelligent scheduling system and method, aiming to solve the technical problem of lack of compatibility and scalability of scheduling solutions in the prior art.

为实现上述目的,本发明提供了一种异构算力智能调度系统,所述异构算力智能调度系统包括算力调度模块、云原生存储ETCD模块以及第三方扩展策略模块,所述云原生存储ETCD模块存储有异构算力资源CRD、调度策略CRD、算力任务CRD以及集群CRD,所述异构算力资源CRD提供的统一模型和接口库实现在所述异构算力智能调度系统中内置所述调度策略CRD;其中,In order to achieve the above objectives, the present invention provides a heterogeneous computing power intelligent scheduling system. The heterogeneous computing power intelligent scheduling system includes a computing power scheduling module, a cloud native storage ETCD module and a third-party expansion strategy module. The cloud native The storage ETCD module stores heterogeneous computing resource CRD, scheduling strategy CRD, computing task CRD and cluster CRD. The unified model and interface library provided by the heterogeneous computing resource CRD are implemented in the heterogeneous computing intelligent scheduling system. The scheduling policy CRD is built in; where,

所述第三方扩展策略模块,用于通过所述异构算力资源CRD提供的算力任务接口将第三方扩展策略注册到所述云原生存储ETCD模块;The third-party expansion policy module is used to register the third-party expansion policy to the cloud native storage ETCD module through the computing task interface provided by the heterogeneous computing resource CRD;

所述云原生存储ETCD模块,用于根据所述第三方扩展策略更新所述调度策略CRD,得到更新调度策略CRD;The cloud native storage ETCD module is used to update the scheduling policy CRD according to the third-party extension policy to obtain the updated scheduling policy CRD;

所述算力调度模块,用于根据所述更新调度策略CRD确定所述集群CRD中的目标集群,并将算力任务绑定到所述目标集群。The computing power scheduling module is used to determine the target cluster in the cluster CRD according to the updated scheduling policy CRD, and bind the computing power task to the target cluster.

可选地,所述异构算力资源CRD基于云原生技术定义,所述异构算力资源CRD包括接口抽象层,所述接口抽象层统一处理算力资源的单位适配。Optionally, the heterogeneous computing resource CRD is defined based on cloud native technology. The heterogeneous computing resource CRD includes an interface abstraction layer. The interface abstraction layer uniformly handles unit adaptation of computing resources.

可选地,所述接口抽象层包括Deep Copy深拷贝接口、Canonicalize进制规范化转化接口、ADD加运算接口、SUB减运算接口、CMP比较大小运算接口以及Equal相等判断运算接口。Optionally, the interface abstraction layer includes a Deep Copy deep copy interface, a Canonicalize hexadecimal normalization conversion interface, an ADD addition operation interface, a SUB subtraction operation interface, a CMP comparison operation interface, and an Equal equality judgment operation interface.

可选地,所述异构算力资源CRD通过k8s apiserver统一提供增删改查接口。Optionally, the heterogeneous computing resource CRD provides a unified addition, deletion, modification and query interface through k8s apiserver.

可选地,所述更新调度策略CRD包括内置策略和第三方扩展策略,所述内置策略在所述算力调度模块的运行阶段加载到Cache中,所述第三方扩展策略在所述算力调度模块的运行阶段通过动态加载的方式从所述云原生存储ETCD模块中读取。Optionally, the updated scheduling strategy CRD includes a built-in strategy and a third-party expansion strategy. The built-in strategy is loaded into the Cache during the running phase of the computing power scheduling module, and the third-party expansion strategy is loaded into the Cache during the running phase of the computing power scheduling module. The running phase of the module is read from the cloud native storage ETCD module through dynamic loading.

此外,为实现上述目的,本发明还提出一种异构算力智能调度方法,所述构算力智能调度方法包括:In addition, to achieve the above objectives, the present invention also proposes an intelligent scheduling method for heterogeneous computing power. The intelligent scheduling method for heterogeneous computing power includes:

通过异构算力资源CRD提供的算力任务接口将第三方扩展策略注册到云原生存储ETCD模块;Register the third-party expansion strategy to the cloud native storage ETCD module through the computing task interface provided by the heterogeneous computing resource CRD;

根据所述第三方扩展策略更新所述云原生存储ETCD模块中的调度策略CRD,得到更新调度策略CRD;Update the scheduling policy CRD in the cloud native storage ETCD module according to the third-party extension policy to obtain the updated scheduling policy CRD;

根据所述更新调度策略CRD确定集群CRD中的目标集群,并将算力任务绑定到所述目标集群。Determine the target cluster in the cluster CRD according to the updated scheduling policy CRD, and bind the computing power task to the target cluster.

可选地,所述更新调度策略CRD包括内置策略和第三方扩展策略;其中,所述根据所述更新调度策略CRD确定所述集群CRD中的目标集群,并将算力任务绑定到所述目标集群,包括:Optionally, the update scheduling policy CRD includes a built-in strategy and a third-party extension policy; wherein the target cluster in the cluster CRD is determined according to the update scheduling policy CRD, and the computing power task is bound to the cluster CRD. Target clusters include:

在根据所述云原生存储ETCD模块中的算力任务CRD创建算力任务之后触发所述算力调度模块开启调度工作流,其中,所述调度工作流包括预选阶段、优选阶段以及绑定阶段;After the computing power task is created according to the computing power task CRD in the cloud native storage ETCD module, the computing power scheduling module is triggered to start the scheduling workflow, wherein the scheduling workflow includes a pre-selection stage, a optimization stage and a binding stage;

在所述预选阶段,遍历所述内置策略,从所述集群CRD中筛选出满足所有内置策略的第一集群;In the pre-selection stage, the built-in policies are traversed, and the first cluster that satisfies all built-in policies is selected from the cluster CRD;

在所述优选阶段,根据所述第三方扩展策略从所述第一集群中筛选出第二集群;In the optimization stage, filter out a second cluster from the first cluster according to the third-party expansion strategy;

在所述绑定阶段,确定所述第二集群中的目标集群,并将所述算力任务绑定到所述目标集群。In the binding stage, a target cluster in the second cluster is determined, and the computing power task is bound to the target cluster.

可选地,所述根据所述第三方扩展策略从所述第一集群中筛选出第二集群,包括:Optionally, selecting the second cluster from the first cluster according to the third-party expansion policy includes:

S301、调用第三方Operator/schedule HTTPS接口执行所述第三方扩展策略中的目标策略,并为所述第一集群中满足所述目标策略的集群打分,得到调度结果;S301. Call the third-party Operator/schedule HTTPS interface to execute the target policy in the third-party expansion policy, and score the clusters in the first cluster that meet the target policy to obtain the scheduling result;

S302、判断所述第三方扩展策略中是否存在新目标策略;S302. Determine whether there is a new target policy in the third-party expansion policy;

S303、若是,则将所述新目标策略作为步骤是S301中的目标策略来重复执行步骤S301-S303;若否,则基于所述调度结果从所述第一集群中筛选出第二集群。S303. If yes, use the new target policy as the target policy in step S301 to repeat steps S301-S303; if not, filter out the second cluster from the first cluster based on the scheduling result.

可选地,所述基于所述调度结果从所述第一集群中筛选出第二集群,包括:Optionally, selecting the second cluster from the first cluster based on the scheduling result includes:

设置预设阈值;Set preset threshold;

根据所述调度结果确定所述第一集群中各集群的权值打分;Determine the weight score of each cluster in the first cluster according to the scheduling result;

将所述第一集群中权值打分大于所述预设阈值的集群作为第二集群。The cluster with a weight score greater than the preset threshold in the first cluster is regarded as the second cluster.

可选地所述确定所述第二集群中的目标集群,并将所述算力任务绑定到所述目标集群,包括:Optionally determining a target cluster in the second cluster and binding the computing power task to the target cluster includes:

根据所述调度结果确定所述第二集群中各集群的权值打分;Determine the weight score of each cluster in the second cluster according to the scheduling result;

将所述第二集群中权值打分最高的集群作为所述目标集群,并将所述算力任务绑定到所述目标集群。The cluster with the highest weight score in the second cluster is used as the target cluster, and the computing power task is bound to the target cluster.

本发明提出的异构算力智能调度系统及方法,所述异构算力智能调度系统包括算力调度模块、云原生存储ETCD模块以及第三方扩展策略模块,所述云原生存储ETCD模块存储有异构算力资源CRD、调度策略CRD、算力任务CRD以及集群CRD,所述异构算力资源CRD提供的统一模型和接口库实现在所述异构算力智能调度系统中内置所述调度策略CRD;其中,所述第三方扩展策略模块,用于通过所述异构算力资源CRD提供的算力任务接口将第三方扩展策略注册到所述云原生存储ETCD模块;所述云原生存储ETCD模块,用于根据所述第三方扩展策略更新所述调度策略CRD,得到更新调度策略CRD;所述算力调度模块,用于根据所述更新调度策略CRD确定所述集群CRD中的目标集群,并将算力任务绑定到所述目标集群。通过上述系统,使用异构算力智能调度统一框架,结合Kubernetes的Operator、CRD和内置策略实现了异构算力调度策略的兼容与扩展功能。The invention proposes a heterogeneous computing power intelligent scheduling system and method. The heterogeneous computing power intelligent scheduling system includes a computing power scheduling module, a cloud native storage ETCD module and a third-party expansion strategy module. The cloud native storage ETCD module stores Heterogeneous computing resource CRD, scheduling strategy CRD, computing task CRD and cluster CRD. The unified model and interface library provided by the heterogeneous computing resource CRD implement the scheduling built into the heterogeneous computing intelligent scheduling system. Policy CRD; wherein, the third-party extension policy module is used to register the third-party extension policy to the cloud native storage ETCD module through the computing task interface provided by the heterogeneous computing resource CRD; the cloud native storage The ETCD module is used to update the scheduling policy CRD according to the third-party extension policy to obtain the updated scheduling policy CRD; the computing power scheduling module is used to determine the target cluster in the cluster CRD according to the updated scheduling policy CRD. , and bind the computing power task to the target cluster. Through the above system, a unified framework for heterogeneous computing power intelligent scheduling is used, combined with the Operator, CRD and built-in strategies of Kubernetes to achieve the compatibility and expansion functions of heterogeneous computing power scheduling strategies.

附图说明Description of drawings

图1为本发明异构算力智能调度系统第一实施例的结构框图;Figure 1 is a structural block diagram of the first embodiment of the heterogeneous computing power intelligent scheduling system of the present invention;

图2为本发明异构算力智能调度系统第一实施例中异构算力调度的原理图;Figure 2 is a schematic diagram of heterogeneous computing power scheduling in the first embodiment of the heterogeneous computing power intelligent scheduling system of the present invention;

图3为本发明异构算力智能调度方法第一实施例的流程示意图;Figure 3 is a schematic flow chart of the first embodiment of the intelligent scheduling method for heterogeneous computing power of the present invention;

图4为本发明异构算力智能调度方法第一实施例中确定第二集群的流程示意图;Figure 4 is a schematic flowchart of determining the second cluster in the first embodiment of the heterogeneous computing power intelligent scheduling method of the present invention;

图5为本发明异构算力智能调度方法第一实施例中的调度流程图。Figure 5 is a scheduling flow chart in the first embodiment of the intelligent scheduling method for heterogeneous computing power of the present invention.

本发明目的的实现、功能特点及优点将结合实施例,参照附图做进一步说明。The realization of the purpose, functional features and advantages of the present invention will be further described with reference to the embodiments and the accompanying drawings.

具体实施方式Detailed ways

应当理解,此处所描述的具体实施例仅用以解释本发明,并不用于限定本发明。It should be understood that the specific embodiments described here are only used to explain the present invention and are not intended to limit the present invention.

下面将结合本发明实施例中的附图,对本发明实施例中的技术方案进行清楚、完整地描述,显然,所描述的实施例仅仅是本发明的一部分实施例,而不是全部的实施例。基于本发明中的实施例,本领域普通技术人员在没有作出创造性劳动前提下所获得的所有其他实施例,都属于本发明保护的范围。The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present invention. Obviously, the described embodiments are only some of the embodiments of the present invention, not all of them. Based on the embodiments of the present invention, all other embodiments obtained by those of ordinary skill in the art without making creative efforts fall within the scope of protection of the present invention.

需要说明的是,ETCD表示分布式键值存储系统,CRD(Custom ResourceDefinition)是Kubernetes中的一种机制,用于扩展Kubernetes API,并允许用户定义自定义资源;Operator Controller是一种自定义控制器,用于在Kubernetes集群中管理和自动化特定类型的应用程序或资源。It should be noted that ETCD represents a distributed key-value storage system, and CRD (Custom ResourceDefinition) is a mechanism in Kubernetes that is used to extend the Kubernetes API and allow users to define custom resources; Operator Controller is a custom controller , used to manage and automate specific types of applications or resources in a Kubernetes cluster.

本发明提出一种异构算力智能调度系统。The present invention proposes an intelligent scheduling system for heterogeneous computing power.

参照图1,在本发明实施例中,所述异构算力智能调度系统包括算力调度模块10、云原生存储ETCD模块20以及第三方扩展策略模块30,所述云原生存储ETCD模块20存储有异构算力资源CRD、调度策略CRD、算力任务CRD以及集群CRD,所述异构算力资源CRD提供的统一模型和接口库实现在所述异构算力智能调度系统中内置所述调度策略CRD;其中,所述第三方扩展策略模块30,用于通过所述异构算力资源CRD提供的算力任务接口将第三方扩展策略注册到所述云原生存储ETCD模块20;所述云原生存储ETCD模块20,用于根据所述第三方扩展策略更新所述调度策略CRD,得到更新调度策略CRD;所述算力调度模块10,用于根据所述更新调度策略CRD确定所述集群CRD中的目标集群,并将算力任务绑定到所述目标集群。Referring to Figure 1, in the embodiment of the present invention, the heterogeneous computing power intelligent scheduling system includes a computing power scheduling module 10, a cloud native storage ETCD module 20 and a third-party expansion strategy module 30. The cloud native storage ETCD module 20 stores There are heterogeneous computing resource CRD, scheduling policy CRD, computing task CRD and cluster CRD. The unified model and interface library provided by the heterogeneous computing resource CRD are implemented in the heterogeneous computing intelligent scheduling system. Scheduling policy CRD; wherein, the third-party expansion policy module 30 is used to register the third-party expansion policy to the cloud native storage ETCD module 20 through the computing task interface provided by the heterogeneous computing resource CRD; the The cloud native storage ETCD module 20 is used to update the scheduling policy CRD according to the third-party expansion policy to obtain an updated scheduling policy CRD; the computing power scheduling module 10 is used to determine the cluster according to the updated scheduling policy CRD Target cluster in CRD, and bind computing tasks to the target cluster.

需要说明的是,调度策略包括内置策略和第三方扩展策略,调度策略通过CRD数据模式存储在云原生存储ETCD模块中;更新调度策略CRD指的是经过更新的调度策略CRD;算力任务通过CRD数据模式存储在云原生存储ETCD模块中;多集群信息上报之后通过CRD数据模式存储在云原生存储ETCD模块中;算力调度模块指的是算力调度Operator Controller;第三方扩展策略模块指的是第三方扩展策略Operator Controller,第三方扩展策略Operator Controller需要实现/schedule HTTP接口,第三方扩展策略模块主要用于将第三方扩展策略注册到云原生存储ETCD模块的调度策略CRD中;It should be noted that the scheduling policy includes built-in policies and third-party extension policies. The scheduling policy is stored in the cloud native storage ETCD module through the CRD data mode; the updated scheduling policy CRD refers to the updated scheduling policy CRD; the computing power task passes the CRD The data pattern is stored in the cloud native storage ETCD module; after multi-cluster information is reported, it is stored in the cloud native storage ETCD module through the CRD data pattern; the computing power scheduling module refers to the computing power scheduling Operator Controller; the third-party expansion strategy module refers to Third-party expansion policy Operator Controller. The third-party expansion policy Operator Controller needs to implement the /schedule HTTP interface. The third-party expansion policy module is mainly used to register the third-party expansion policy into the scheduling policy CRD of the cloud native storage ETCD module;

可以理解的是,异构算力智能调度系统在运行的过程中可以随时注册引发动态负载;目标集群可以是集群CRD中符合度最高的集群,可以作为本次算力任务需要运行的集群。It is understandable that the heterogeneous computing power intelligent scheduling system can register at any time to trigger dynamic loads during operation; the target cluster can be the cluster with the highest degree of compliance in the cluster CRD, and can be used as the cluster that needs to be run for this computing power task.

需要说明的是,根据所述第三方扩展策略更新所述调度策略CRD是实时动态更新的,可以基于异步watch机制实现,具体地,第三方用户基于第三方扩展策略模块更新调度策略CRD时可以通过以下两个步骤实现:1、基于异构算力资源CRD提供的统一模型和接口库来使得第三方实现扩展的算力调度Operator Controller,并提供标准的HTTPS算力任务调度接口(/schedule),HTTPS能够保证通信和数据安全;2、注册第三方扩展策略到云原生存储ETCD,触发算力调度Operator Controller的策略Cache更新,注册的不仅仅是扩展的策略,还要注册回调的访问地址,以便让算力调度Operator Controller知道/schedule接口调用的地址。It should be noted that updating the scheduling policy CRD according to the third-party extension policy is dynamically updated in real time and can be implemented based on the asynchronous watch mechanism. Specifically, when a third-party user updates the scheduling policy CRD based on the third-party extension policy module, The following two steps are implemented: 1. Based on the unified model and interface library provided by heterogeneous computing resources CRD, third parties can implement extended computing power scheduling Operator Controller and provide a standard HTTPS computing power task scheduling interface (/schedule). HTTPS can ensure communication and data security; 2. Register the third-party extension policy to the cloud native storage ETCD, triggering the policy Cache update of the computing power scheduling Operator Controller. Register not only the extended policy, but also the callback access address, so that Let the computing power scheduling Operator Controller know the address of the /schedule interface call.

在一实施例中,所述异构算力资源CRD基于云原生技术定义,所述异构算力资源CRD包括接口抽象层,所述接口抽象层统一处理算力资源的单位适配。In one embodiment, the heterogeneous computing resource CRD is defined based on cloud native technology. The heterogeneous computing resource CRD includes an interface abstraction layer. The interface abstraction layer uniformly handles unit adaptation of computing resources.

需要说明的是,接口抽象层统一处理二进制、十进制以及浮点数等算力资源的单位适配;接口抽象层包含了已知的所有算力资源单位的适配。It should be noted that the interface abstraction layer uniformly handles the unit adaptation of binary, decimal, floating point and other computing resources; the interface abstraction layer includes the adaptation of all known computing resource units.

在本实施例中,通过不同进制和单位的支持兼容了所有的异构算力资源,解决了异构算力统一模型和度量的问题。In this embodiment, all heterogeneous computing power resources are compatible through the support of different bases and units, and the problem of a unified model and measurement of heterogeneous computing power is solved.

在一实施例中,所述接口抽象层包括Deep Copy深拷贝接口、Canonicalize进制规范化转化接口、ADD加运算接口、SUB减运算接口、CMP比较大小运算接口以及Equal相等判断运算接口。In one embodiment, the interface abstraction layer includes a Deep Copy deep copy interface, a Canonicalize hexadecimal normalization conversion interface, an ADD addition operation interface, a SUB subtraction operation interface, a CMP comparison operation interface, and an Equal equality judgment operation interface.

在本实施例中,通过在接口抽象层中的Deep Copy深拷贝接口、Canonicalize进制规范化转化接口、ADD加运算接口、SUB减运算接口、CMP比较大小运算接口以及Equal相等判断运算接口,能够解决异构算力资源转化、加减以及比较等问题。In this embodiment, through the Deep Copy deep copy interface, Canonicalize hexadecimal normalization conversion interface, ADD addition operation interface, SUB subtraction operation interface, CMP comparison size operation interface and Equal equality judgment operation interface in the interface abstraction layer, the problem can be solved Issues such as conversion, addition, subtraction, and comparison of heterogeneous computing resources.

在一实施例中,所述异构算力资源CRD通过k8s apiserver统一提供增删改查接口。In one embodiment, the heterogeneous computing resource CRD provides a unified addition, deletion, modification and query interface through k8s apiserver.

在一实施例中,所述更新调度策略CRD包括内置策略和第三方扩展策略,所述内置策略在所述算力调度模块的运行阶段加载到Cache中,所述第三方扩展策略在所述算力调度模块的运行阶段通过动态加载的方式从所述云原生存储ETCD模块中读取。In one embodiment, the updated scheduling policy CRD includes a built-in policy and a third-party expansion policy. The built-in policy is loaded into the Cache during the running phase of the computing power scheduling module, and the third-party expansion policy is loaded into the Cache during the running phase of the computing power scheduling module. The running phase of the force scheduling module is read from the cloud native storage ETCD module through dynamic loading.

在具体实现中,如图2所示的异构算力调度原理图,可以确定异构算力智能调度系统包括算力调度Operator Controller、云原生存储ETCD以及第三方扩展策略OperatorController,其中,云原生存储ETCD存储有异构算力资源CRD、调度策略CRD、算力任务CRD以及集群CRD,异构算力资源CRD统一的接口抽象层包括Deep Copy深拷贝接口、Canonicalize进制规范化转化接口、ADD加运算接口、SUB减运算接口、CMP比较大小运算接口以及Equal相等判断运算接口。In the specific implementation, as shown in the schematic diagram of heterogeneous computing power scheduling in Figure 2, it can be determined that the heterogeneous computing power intelligent scheduling system includes computing power scheduling Operator Controller, cloud native storage ETCD and third-party expansion strategy Operator Controller. Among them, cloud native Storage ETCD stores heterogeneous computing resource CRD, scheduling policy CRD, computing task CRD and cluster CRD. The unified interface abstraction layer of heterogeneous computing resource CRD includes Deep Copy deep copy interface, Canonicalize hexadecimal standardization conversion interface, ADD addition Operation interface, SUB subtraction operation interface, CMP comparison operation interface and Equal equality judgment operation interface.

需要说明的是,异构算力资源CRD的参考模型是一个Kubernetes中的自定义资源定义(CustomResourceDefinition)对象的YAML配置文件,它定义了一个名为"kuberesources.resource.kubearena.io"的自定义资源;该自定义资源的API版本为"apiextensions.k8s.io/v1",属于"resource.kubearena.io"组。它的对象类型是"KubeResource",复数形式是"KubeResourceList",并且有一个短名称"res";该自定义资源可以在整个集群中访问,并提供了一个名为"v1"的版本;该版本的定义包含了关于"KubeResource"对象的规范和状态信息。It should be noted that the reference model of heterogeneous computing resource CRD is a YAML configuration file of a custom resource definition (CustomResourceDefinition) object in Kubernetes, which defines a custom resource named "kuberesources.resource.kubearena.io" Resource; the API version of this custom resource is "apiextensions.k8s.io/v1" and belongs to the "resource.kubearena.io" group. Its object type is "KubeResource", the plural form is "KubeResourceList", and has a short name "res"; this custom resource is accessible throughout the cluster and is provided with a version named "v1"; this version The definition contains specification and state information about the "KubeResource" object.

本实施例通过所述异构算力智能调度系统包括算力调度模块、云原生存储ETCD模块以及第三方扩展策略模块,所述云原生存储ETCD模块存储有异构算力资源CRD、调度策略CRD、算力任务CRD以及集群CRD,所述异构算力资源CRD提供的统一模型和接口库实现在所述异构算力智能调度系统中内置所述调度策略CRD;其中,所述第三方扩展策略模块,用于通过所述异构算力资源CRD提供的算力任务接口将第三方扩展策略注册到所述云原生存储ETCD模块;所述云原生存储ETCD模块,用于根据所述第三方扩展策略更新所述调度策略CRD,得到更新调度策略CRD;所述算力调度模块,用于根据所述更新调度策略CRD确定所述集群CRD中的目标集群,并将算力任务绑定到所述目标集群。通过上述系统,使用异构算力智能调度统一框架,结合Kubernetes的Operator、CRD和内置策略实现了异构算力调度策略的兼容与扩展功能。In this embodiment, the heterogeneous computing power intelligent scheduling system includes a computing power scheduling module, a cloud native storage ETCD module, and a third-party expansion strategy module. The cloud native storage ETCD module stores heterogeneous computing power resources CRD and scheduling strategy CRD. , computing power task CRD and cluster CRD. The unified model and interface library provided by the heterogeneous computing power resource CRD implement the scheduling policy CRD built into the heterogeneous computing power intelligent scheduling system; wherein, the third-party extension The policy module is used to register the third-party expansion policy to the cloud native storage ETCD module through the computing task interface provided by the heterogeneous computing resource CRD; the cloud native storage ETCD module is used to register the third-party expansion policy according to the third-party The expansion policy updates the scheduling policy CRD to obtain an updated scheduling policy CRD; the computing power scheduling module is used to determine the target cluster in the cluster CRD according to the updated scheduling policy CRD, and bind the computing power task to the cluster CRD. Describe the target cluster. Through the above system, a unified framework for heterogeneous computing power intelligent scheduling is used, combined with the Operator, CRD and built-in strategies of Kubernetes to achieve the compatibility and expansion functions of heterogeneous computing power scheduling strategies.

此外,如图3所示,本发明实施例还提供了一种异构算力智能调度方法。In addition, as shown in Figure 3, embodiments of the present invention also provide an intelligent scheduling method for heterogeneous computing power.

本实施例中,异构算力智能调度方法包括以下步骤:In this embodiment, the heterogeneous computing power intelligent scheduling method includes the following steps:

步骤S10:通过异构算力资源CRD提供的算力任务接口将第三方扩展策略注册到云原生存储ETCD模块。Step S10: Register the third-party expansion policy to the cloud native storage ETCD module through the computing task interface provided by the heterogeneous computing resource CRD.

需要说明的是,本实施例的执行主体可以是一种具有数据处理、网络通信以及程序运行功能的计算服务设备,例如手机、平板电脑、个人电脑等,或者是一种能够实现上述功能的电子设备或异构算力智能调度系统。以下以所述异构算力智能调度系统为例,对本实施例及下述各实施例进行说明。It should be noted that the execution subject of this embodiment may be a computing service device with data processing, network communication and program running functions, such as a mobile phone, a tablet computer, a personal computer, etc., or an electronic device capable of realizing the above functions. Equipment or heterogeneous computing power intelligent scheduling system. The following takes the heterogeneous computing power intelligent scheduling system as an example to describe this embodiment and the following embodiments.

需要说明的是,异构算力资源CRD提供的算力任务接口指的是标准的HTTPS算力任务调度接口(/schedule);It should be noted that the computing task interface provided by the heterogeneous computing resource CRD refers to the standard HTTPS computing task scheduling interface (/schedule);

步骤S20:根据所述第三方扩展策略更新所述云原生存储ETCD模块中的调度策略CRD,得到更新调度策略CRD。Step S20: Update the scheduling policy CRD in the cloud native storage ETCD module according to the third-party extension policy to obtain the updated scheduling policy CRD.

可以理解的是,异构算力智能调度系统在运行的过程中可以随时注册引发动态负载;目标集群可以是集群CRD中符合度最高的集群,可以作为本次算力任务需要运行的集群。It is understandable that the heterogeneous computing power intelligent scheduling system can register at any time to trigger dynamic loads during operation; the target cluster can be the cluster with the highest degree of compliance in the cluster CRD, and can be used as the cluster that needs to be run for this computing power task.

需要说明的是,更新调度策略CRD指的是经过更新之后的调度策略CRD;更新调度策略CRD包括内置策略和第三方扩展策略;根据所述第三方扩展策略更新所述调度策略CRD是实时动态更新的,可以基于异步watch机制实现,具体地,第三方用户基于第三方扩展策略模块更新调度策略CRD时可以通过以下两个步骤实现:1、基于异构算力资源CRD提供的统一模型和接口库来使得第三方实现扩展的算力调度Operator Controller,并提供标准的HTTPS算力任务调度接口(/schedule),HTTPS能够保证通信和数据安全;2、注册第三方扩展策略到云原生存储ETCD,触发算力调度Operator Controller的策略Cache更新,注册的不仅仅是扩展的策略,还要注册回调的访问地址,以便让算力调度Operator Controller知道/schedule接口调用的地址。It should be noted that updating the scheduling policy CRD refers to the updated scheduling policy CRD; updating the scheduling policy CRD includes built-in policies and third-party extension policies; updating the scheduling policy CRD according to the third-party extension policy is a real-time dynamic update It can be implemented based on the asynchronous watch mechanism. Specifically, when third-party users update the scheduling policy CRD based on the third-party extended policy module, it can be implemented through the following two steps: 1. Based on the unified model and interface library provided by the heterogeneous computing resource CRD To enable third parties to implement extended computing power scheduling Operator Controller, and provide a standard HTTPS computing power task scheduling interface (/schedule), HTTPS can ensure communication and data security; 2. Register the third-party extension policy to the cloud native storage ETCD, trigger When updating the policy Cache of the computing power scheduling Operator Controller, not only the extended policy is registered, but also the callback access address is registered, so that the computing power scheduling Operator Controller knows the address of the /schedule interface call.

步骤S30:根据所述更新调度策略CRD确定集群CRD中的目标集群,并将算力任务绑定到所述目标集群,并将算力任务绑定到所述目标集群。Step S30: Determine the target cluster in the cluster CRD according to the updated scheduling policy CRD, bind the computing power task to the target cluster, and bind the computing power task to the target cluster.

需要说明的是,将算力任务绑定到目标集群之后,当前算力任务调度完成,可以继续基于云原生存储ETCD模块中存储的算力任务CRD来创建新的算力任务后进行调度。It should be noted that after the computing task is bound to the target cluster and the current computing task is scheduled, you can continue to create and schedule new computing tasks based on the computing task CRD stored in the cloud native storage ETCD module.

在一实施例中,所述更新调度策略CRD包括内置策略和第三方扩展策略;其中,所述根据所述更新调度策略CRD确定所述集群CRD中的目标集群,并将算力任务绑定到所述目标集群,包括:In one embodiment, the updated scheduling policy CRD includes a built-in policy and a third-party extension policy; wherein the target cluster in the cluster CRD is determined according to the updated scheduling policy CRD, and the computing task is bound to The target cluster includes:

在根据所述云原生存储ETCD模块中的算力任务CRD创建算力任务之后触发所述算力调度模块开启调度工作流,其中,所述调度工作流包括预选阶段、优选阶段以及绑定阶段;After the computing power task is created according to the computing power task CRD in the cloud native storage ETCD module, the computing power scheduling module is triggered to start the scheduling workflow, wherein the scheduling workflow includes a pre-selection stage, a optimization stage and a binding stage;

在所述预选阶段,遍历所述内置策略,从所述集群CRD中筛选出满足所有内置策略的第一集群;In the pre-selection stage, the built-in policies are traversed, and the first cluster that satisfies all built-in policies is selected from the cluster CRD;

在所述优选阶段,根据所述第三方扩展策略从所述第一集群中筛选出第二集群;In the optimization stage, filter out a second cluster from the first cluster according to the third-party expansion strategy;

在所述绑定阶段,确定所述第二集群中的目标集群,并将所述算力任务绑定到所述目标集群。In the binding stage, a target cluster in the second cluster is determined, and the computing power task is bound to the target cluster.

需要说明的是,调度工作流Workflow包括三个阶段,三个阶段依次为预选阶段、优先阶段以及绑定阶段;在预选阶段,需要遍历执行所有内置策略,针对单个算力任务,内置策略按照配置的顺序依次执行(执行内置策略可以理解为判断集群是否满足内置策略);第一集群指的是集群CRD中满足所有内置策略的集群;在优选阶段,需要执行第三方扩展策略(可以理解的为判断集群是否满足第三方扩展策略),第二集群指的是第一集群中满足第三扩展策略的集群;在绑定阶段,可以将第二集群中符合度最高的集群作为目标集群,目标集群即为本次算力任务需要运行的集群。It should be noted that the scheduling workflow Workflow includes three stages, which are the pre-selection stage, the priority stage and the binding stage. In the pre-selection stage, all built-in strategies need to be traversed and executed. For a single computing task, the built-in strategies are configured according to the configuration. are executed in sequence (executing built-in policies can be understood as judging whether the cluster meets the built-in policies); the first cluster refers to the cluster in the cluster CRD that satisfies all built-in policies; in the optimization stage, third-party expansion strategies need to be executed (which can be understood as Determine whether the cluster meets the third-party expansion policy), the second cluster refers to the cluster in the first cluster that meets the third expansion policy; in the binding phase, the cluster with the highest compliance in the second cluster can be used as the target cluster, and the target cluster That is the cluster that needs to be run for this computing task.

需要说明的是,执行第三方扩展策略时需要调用第三方Operator/scheduleHTTPS接口,并收到调度结果;/schedule接口的参数是算力任务(Compute Task)和前面筛选出的集群(cluster CRD)列表,返回的是本次筛选出的集群列表(即第二集群)。It should be noted that when executing a third-party expansion strategy, you need to call the third-party Operator/scheduleHTTPS interface and receive the scheduling results; the parameters of the /schedule interface are the computing task (Compute Task) and the previously filtered cluster (cluster CRD) list , returns the cluster list filtered out this time (i.e. the second cluster).

在本实施例中,通过调度工作流的三大阶段来从集群CRD中确定目标集群,不仅能够快速确定符合度最高的集群,还能够有效提高调度的精准性。In this embodiment, the target cluster is determined from the cluster CRD through the three major stages of the scheduling workflow, which not only can quickly determine the cluster with the highest degree of compliance, but can also effectively improve the accuracy of scheduling.

在一实施例中,如图4所示,所述根据所述第三方扩展策略从所述第一集群中筛选出第二集群,包括:In one embodiment, as shown in Figure 4, selecting the second cluster from the first cluster according to the third-party expansion policy includes:

步骤S301:调用第三方Operator/schedule HTTPS接口执行所述第三方扩展策略中的目标策略,并为所述第一集群中满足所述目标策略的集群打分,得到调度结果。Step S301: Call the third-party Operator/schedule HTTPS interface to execute the target policy in the third-party extension policy, and score the clusters in the first cluster that meet the target policy to obtain the scheduling result.

步骤S302:判断所述第三方扩展策略中是否存在新目标策略。Step S302: Determine whether there is a new target policy in the third-party extension policy.

步骤S303:若是,则将所述新目标策略作为步骤是S301中的目标策略来重复执行步骤S301-S303;若否,则基于所述调度结果从所述第一集群中筛选出第二集群。Step S303: If yes, use the new target policy as the target policy in step S301 to repeatedly execute steps S301-S303; if not, filter out the second cluster from the first cluster based on the scheduling result.

可以理解的是,新目标策略可以理解为在第三方扩展策略中目标策略的下一个第三方扩展策略。It is understood that the new target policy can be understood as the next third-party expansion policy of the target policy in the third-party expansion policy.

需要说明的是,目标策略可以根据第三方扩展策略的注册顺序来确定,比如在进行第一轮集群的打分时,目标策略为第三方扩展策略中第一个注册的策略;每进行一轮集群的打分都需要更新调度结果,比如第一轮集群的打分时,由于集群A不满足第一轮打分中的目标策略,集群B满足第一轮打分中的目标策略,因此经过第一轮打分得到的调度结果为:集群A得0分,集群B得1分;在进行第二轮集群的打分时,由于集群A和集群B均满足第二轮打分中的目标策略,因此经过第二轮打分之后更新得到的调度结果为:集群A得1分,集群B得2分。It should be noted that the target strategy can be determined according to the registration order of the third-party expansion strategy. For example, when performing the first round of cluster scoring, the target strategy is the first registered strategy among the third-party expansion strategies; for each round of clustering The scheduling results need to be updated for the scoring of The scheduling result is: cluster A gets 0 points and cluster B gets 1 point; during the second round of cluster scoring, since both cluster A and cluster B meet the target strategy in the second round of scoring, after the second round of scoring The updated scheduling result is: cluster A gets 1 point, and cluster B gets 2 points.

可以理解的是,当第三方扩展策略中不存在新目标策略时得到的调度结果为最终的调度结果。It can be understood that when there is no new target policy in the third-party expansion policy, the scheduling result obtained is the final scheduling result.

在一实施例中,所述基于所述调度结果从所述第一集群中筛选出第二集群,包括:In one embodiment, selecting the second cluster from the first cluster based on the scheduling result includes:

设置预设阈值;Set preset threshold;

根据所述调度结果确定所述第一集群中各集群的权值打分;Determine the weight score of each cluster in the first cluster according to the scheduling result;

将所述第一集群中权值打分大于所述预设阈值的集群作为第二集群。The cluster with a weight score greater than the preset threshold in the first cluster is regarded as the second cluster.

需要说明的是,调度结果指的是最终的调度结果,能够用于确定第一集群中各集群的权值打分;预设阈值可以预先进行设定,优选地,可以将预设阈值设置为0,由于集群的权值打分大于等于0,所以当集群的权值打分不为0时可以将该集群作为第二集群,权值打分为0的集群可以看成是第三方扩展策略中所有目标策略均不满足的集群。It should be noted that the scheduling result refers to the final scheduling result, which can be used to determine the weight score of each cluster in the first cluster; the preset threshold can be set in advance, and preferably, the preset threshold can be set to 0 , since the weight score of the cluster is greater than or equal to 0, when the weight score of the cluster is not 0, the cluster can be used as the second cluster. The cluster with a weight score of 0 can be regarded as all target strategies in the third-party expansion strategy. Clusters that are not satisfied.

在本实施例中,通过设置阈值来根据第一集群中各集群的打分情况来确定第二集群,能够有效提高第二集群的确定效率。In this embodiment, by setting a threshold to determine the second cluster based on the scores of each cluster in the first cluster, the efficiency of determining the second cluster can be effectively improved.

在一实施例中,所述确定所述第二集群中的目标集群,并将所述算力任务绑定到所述目标集群,包括:In one embodiment, determining the target cluster in the second cluster and binding the computing power task to the target cluster includes:

根据所述调度结果确定所述第二集群中各集群的权值打分;Determine the weight score of each cluster in the second cluster according to the scheduling result;

将所述第二集群中权值打分最高的集群作为所述目标集群,并将所述算力任务绑定到所述目标集群。The cluster with the highest weight score in the second cluster is used as the target cluster, and the computing power task is bound to the target cluster.

需要说明的是,调度结果指的是最终的调度结果,能够用于确定第二集群中各集群的权值打分;具体地,比如根据调度结果确定第二集群中各集群的权值打分为:集群A的权值打分为1分,集群B的权值打分为2分,那么集群B即为第二集群中的目标集群。It should be noted that the scheduling result refers to the final scheduling result, which can be used to determine the weight score of each cluster in the second cluster; specifically, for example, the weight score of each cluster in the second cluster is determined based on the scheduling result as: If the weight of cluster A is 1 point and the weight of cluster B is 2, then cluster B is the target cluster in the second cluster.

需要说明的是,将算力任务绑定到目标集群之后,当前算力任务调度完成,可以继续基于云原生存储ETCD模块中存储的算力任务CRD来创建新的算力任务后进行调度。It should be noted that after the computing task is bound to the target cluster and the current computing task is scheduled, you can continue to create and schedule new computing tasks based on the computing task CRD stored in the cloud native storage ETCD module.

在本实施例中,通过选取权值打分最高的集群来作为本次算力任务需要运行的集群,能够有效提高算力调度的准确性。In this embodiment, by selecting the cluster with the highest weight score as the cluster that needs to be run for this computing power task, the accuracy of computing power scheduling can be effectively improved.

在具体实现中,如图5所示,需要基于云原生技术定义异构算力资源CRD,并通过异构算力资源CRD实现统一的接口抽象层,便可完成异构算力统一模型和度量框架,然后再基于异构算力资源CRD提供的统一模型和接口库实现内置调度策略,接着便可创建算力任务触发调度Workflow,基于调度工作流的三大阶段便可完成此次算力任务的调度。In the specific implementation, as shown in Figure 5, it is necessary to define heterogeneous computing power resource CRD based on cloud native technology, and implement a unified interface abstraction layer through heterogeneous computing power resource CRD to complete the unified model and measurement of heterogeneous computing power. framework, and then implement the built-in scheduling strategy based on the unified model and interface library provided by the heterogeneous computing resource CRD. Then the computing task can be created to trigger the scheduling workflow, and the computing task can be completed based on the three stages of the scheduling workflow. of scheduling.

应当理解的是,以上仅为举例说明,对本发明的技术方案并不构成任何限定,在具体应用中,本领域的技术人员可以根据需要进行设置,本发明对此不做限制。It should be understood that the above are only examples and do not constitute any limitation on the technical solution of the present invention. In specific applications, those skilled in the art can make settings as needed, and the present invention does not impose any limitations on this.

需要说明的是,以上所描述的工作流程仅仅是示意性的,并不对本发明的保护范围构成限定,在实际应用中,本领域的技术人员可以根据实际的需要选择其中的部分或者全部来实现本实施例方案的目的,此处不做限制。It should be noted that the workflow described above is only illustrative and does not limit the scope of the present invention. In practical applications, those skilled in the art can select some or all of them for implementation according to actual needs. The purpose of this embodiment is not limited here.

另外,未在本实施例中详尽描述的技术细节,可参见本发明任意实施例所提供的异构算力智能调度方法,此处不再赘述。In addition, for technical details that are not described in detail in this embodiment, please refer to the intelligent scheduling method for heterogeneous computing power provided by any embodiment of the present invention, and will not be described again here.

此外,需要说明的是,在本文中,术语“包括”、“包含”或者其任何其他变体意在涵盖非排他性的包含,从而使得包括一系列要素的过程、方法、物品或者系统不仅包括那些要素,而且还包括没有明确列出的其他要素,或者是还包括为这种过程、方法、物品或者系统所固有的要素。在没有更多限制的情况下,由语句“包括一个……”限定的要素,并不排除在包括该要素的过程、方法、物品或者系统中还存在另外的相同要素。Furthermore, it should be noted that, as used herein, the terms "include", "comprises" or any other variation thereof are intended to cover a non-exclusive inclusion, such that a process, method, article or system that includes a list of elements includes not only those elements, but also other elements not expressly listed or elements inherent to the process, method, article or system. Without further limitation, an element defined by the statement "comprises a..." does not exclude the presence of other identical elements in the process, method, article, or system that includes that element.

上述本发明实施例序号仅仅为了描述,不代表实施例的优劣。The above serial numbers of the embodiments of the present invention are only for description and do not represent the advantages and disadvantages of the embodiments.

通过以上的实施方式的描述,本领域的技术人员可以清楚地了解到上述实施例方法可借助软件加必需的通用硬件平台的方式来实现,当然也可以通过硬件,但很多情况下前者是更佳的实施方式。基于这样的理解,本发明的技术方案本质上或者说对现有技术做出贡献的部分可以以软件产品的形式体现出来,该计算机软件产品存储在一个存储介质(如只读存储器(Read Only Memory,ROM)/RAM、磁碟、光盘)中,包括若干指令用以使得一台终端设备(可以是手机,计算机,服务器,或者网络设备等)执行本发明各个实施例所述的方法。Through the above description of the embodiments, those skilled in the art can clearly understand that the methods of the above embodiments can be implemented by means of software plus the necessary general hardware platform. Of course, it can also be implemented by hardware, but in many cases the former is better. implementation. Based on this understanding, the technical solution of the present invention can be embodied in the form of a software product that is essentially or contributes to the existing technology. The computer software product is stored in a storage medium (such as a read-only memory). , ROM)/RAM, magnetic disk, optical disk), including several instructions to cause a terminal device (which can be a mobile phone, computer, server, or network device, etc.) to execute the method described in various embodiments of the present invention.

以上仅为本发明的优选实施例,并非因此限制本发明的专利范围,凡是利用本发明说明书及附图内容所作的等效结构或等效流程变换,或直接或间接运用在其他相关的技术领域,均同理包括在本发明的专利保护范围内。The above are only preferred embodiments of the present invention, and do not limit the patent scope of the present invention. Any equivalent structure or equivalent process transformation made using the description and drawings of the present invention may be directly or indirectly used in other related technical fields. , are all similarly included in the scope of patent protection of the present invention.

Claims (10)

1.一种异构算力智能调度系统,其特征在于,所述异构算力智能调度系统包括算力调度模块、云原生存储ETCD模块以及第三方扩展策略模块,所述云原生存储ETCD模块存储有异构算力资源CRD、调度策略CRD、算力任务CRD以及集群CRD,所述异构算力资源CRD提供的统一模型和接口库实现在所述异构算力智能调度系统中内置所述调度策略CRD;其中,1. A heterogeneous computing power intelligent scheduling system, characterized in that the heterogeneous computing power intelligent scheduling system includes a computing power scheduling module, a cloud native storage ETCD module and a third-party expansion strategy module. The cloud native storage ETCD module Heterogeneous computing resource CRD, scheduling policy CRD, computing task CRD and cluster CRD are stored. The unified model and interface library provided by the heterogeneous computing resource CRD are implemented in the heterogeneous computing intelligent scheduling system. Described scheduling policy CRD; where, 所述第三方扩展策略模块,用于通过所述异构算力资源CRD提供的算力任务接口将第三方扩展策略注册到所述云原生存储ETCD模块;The third-party expansion policy module is used to register the third-party expansion policy to the cloud native storage ETCD module through the computing task interface provided by the heterogeneous computing resource CRD; 所述云原生存储ETCD模块,用于根据所述第三方扩展策略更新所述调度策略CRD,得到更新调度策略CRD;The cloud native storage ETCD module is used to update the scheduling policy CRD according to the third-party extension policy to obtain the updated scheduling policy CRD; 所述算力调度模块,用于根据所述更新调度策略CRD确定所述集群CRD中的目标集群,并将算力任务绑定到所述目标集群。The computing power scheduling module is used to determine the target cluster in the cluster CRD according to the updated scheduling policy CRD, and bind the computing power task to the target cluster. 2.如权利要求1所述的系统,其特征在于,所述异构算力资源CRD基于云原生技术定义,所述异构算力资源CRD包括接口抽象层,所述接口抽象层统一处理算力资源的单位适配。2. The system of claim 1, wherein the heterogeneous computing resource CRD is defined based on cloud native technology, the heterogeneous computing resource CRD includes an interface abstraction layer, and the interface abstraction layer processes computing in a unified manner. Unit adaptation of human resources. 3.如权利要求2所述的系统,其特征在于,所述接口抽象层包括Deep Copy深拷贝接口、Canonicalize进制规范化转化接口、ADD加运算接口、SUB减运算接口、CMP比较大小运算接口以及Equal相等判断运算接口。3. The system of claim 2, wherein the interface abstraction layer includes a Deep Copy deep copy interface, a Canonicalize hexadecimal normalization conversion interface, an ADD addition operation interface, a SUB subtraction operation interface, a CMP comparison operation interface, and Equal equality judgment operation interface. 4.如权利要求1所述的系统,其特征在于,所述异构算力资源CRD通过k8s apiserver统一提供增删改查接口。4. The system of claim 1, wherein the heterogeneous computing resource CRD provides a unified interface for adding, deleting, modifying, and checking through k8s apiserver. 5.如权利要求1所述的系统,其特征在于,所述更新调度策略CRD包括内置策略和第三方扩展策略,所述内置策略在所述算力调度模块的运行阶段加载到Cache中,所述第三方扩展策略在所述算力调度模块的运行阶段通过动态加载的方式从所述云原生存储ETCD模块中读取。5. The system of claim 1, wherein the updated scheduling strategy CRD includes a built-in strategy and a third-party extension strategy, and the built-in strategy is loaded into the Cache during the running phase of the computing power scheduling module, so The third-party expansion strategy is read from the cloud native storage ETCD module through dynamic loading during the running phase of the computing power scheduling module. 6.一种异构算力智能调度方法,其特征在于,所述异构算力智能调度方法应用于如权利要求1-5中任一项所述的异构算力智能调度系统,所述异构算力智能调度方法包括:6. An intelligent scheduling method for heterogeneous computing power, characterized in that the intelligent scheduling method for heterogeneous computing power is applied to the intelligent scheduling system for heterogeneous computing power according to any one of claims 1-5, and the Intelligent scheduling methods for heterogeneous computing power include: 通过异构算力资源CRD提供的算力任务接口将第三方扩展策略注册到云原生存储ETCD模块;Register the third-party expansion strategy to the cloud native storage ETCD module through the computing task interface provided by the heterogeneous computing resource CRD; 根据所述第三方扩展策略更新所述云原生存储ETCD模块中的调度策略CRD,得到更新调度策略CRD;Update the scheduling policy CRD in the cloud native storage ETCD module according to the third-party extension policy to obtain the updated scheduling policy CRD; 根据所述更新调度策略CRD确定集群CRD中的目标集群,并将算力任务绑定到所述目标集群。Determine the target cluster in the cluster CRD according to the updated scheduling policy CRD, and bind the computing power task to the target cluster. 7.如权利要求6所述的方法,其特征在于,所述更新调度策略CRD包括内置策略和第三方扩展策略;其中,所述根据所述更新调度策略CRD确定所述集群CRD中的目标集群,并将算力任务绑定到所述目标集群,包括:7. The method of claim 6, wherein the update scheduling policy CRD includes a built-in policy and a third-party extension policy; wherein the target cluster in the cluster CRD is determined according to the update scheduling policy CRD. , and bind computing tasks to the target cluster, including: 在根据所述云原生存储ETCD模块中的算力任务CRD创建算力任务之后触发所述算力调度模块开启调度工作流,其中,所述调度工作流包括预选阶段、优选阶段以及绑定阶段;After the computing power task is created according to the computing power task CRD in the cloud native storage ETCD module, the computing power scheduling module is triggered to start the scheduling workflow, wherein the scheduling workflow includes a pre-selection stage, a optimization stage and a binding stage; 在所述预选阶段,遍历所述内置策略,从所述集群CRD中筛选出满足所有内置策略的第一集群;In the pre-selection stage, the built-in policies are traversed, and the first cluster that satisfies all built-in policies is selected from the cluster CRD; 在所述优选阶段,根据所述第三方扩展策略从所述第一集群中筛选出第二集群;In the optimization stage, filter out a second cluster from the first cluster according to the third-party expansion strategy; 在所述绑定阶段,确定所述第二集群中的目标集群,并将所述算力任务绑定到所述目标集群。In the binding stage, a target cluster in the second cluster is determined, and the computing power task is bound to the target cluster. 8.如权利要求7所述的方法,其特征在于,所述根据所述第三方扩展策略从所述第一集群中筛选出第二集群,包括:8. The method of claim 7, wherein filtering out the second cluster from the first cluster according to the third-party expansion policy includes: S301、调用第三方Operator/schedule HTTPS接口执行所述第三方扩展策略中的目标策略,并为所述第一集群中满足所述目标策略的集群打分,得到调度结果;S301. Call the third-party Operator/schedule HTTPS interface to execute the target policy in the third-party expansion policy, and score the clusters in the first cluster that meet the target policy to obtain the scheduling result; S302、判断所述第三方扩展策略中是否存在新目标策略;S302. Determine whether there is a new target policy in the third-party expansion policy; S303、若是,则将所述新目标策略作为步骤是S301中的目标策略来重复执行步骤S301-S303;若否,则基于所述调度结果从所述第一集群中筛选出第二集群。S303. If yes, use the new target policy as the target policy in step S301 to repeat steps S301-S303; if not, filter out the second cluster from the first cluster based on the scheduling result. 9.如权利要求8所述的方法,其特征在于,所述基于所述调度结果从所述第一集群中筛选出第二集群,包括:9. The method of claim 8, wherein filtering out the second cluster from the first cluster based on the scheduling result includes: 设置预设阈值;Set preset threshold; 根据所述调度结果确定所述第一集群中各集群的权值打分;Determine the weight score of each cluster in the first cluster according to the scheduling result; 将所述第一集群中权值打分大于所述预设阈值的集群作为第二集群。The cluster with a weight score greater than the preset threshold in the first cluster is regarded as the second cluster. 10.如权利要求8所述的方法,其特征在于,所述确定所述第二集群中的目标集群,并将所述算力任务绑定到所述目标集群,包括:10. The method of claim 8, wherein determining the target cluster in the second cluster and binding the computing power task to the target cluster includes: 根据所述调度结果确定所述第二集群中各集群的权值打分;Determine the weight score of each cluster in the second cluster according to the scheduling result; 将所述第二集群中权值打分最高的集群作为所述目标集群,并将所述算力任务绑定到所述目标集群。The cluster with the highest weight score in the second cluster is used as the target cluster, and the computing power task is bound to the target cluster.
CN202311666674.0A 2023-12-06 2023-12-06 Heterogeneous computing power intelligent scheduling system and method Pending CN117632505A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202311666674.0A CN117632505A (en) 2023-12-06 2023-12-06 Heterogeneous computing power intelligent scheduling system and method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202311666674.0A CN117632505A (en) 2023-12-06 2023-12-06 Heterogeneous computing power intelligent scheduling system and method

Publications (1)

Publication Number Publication Date
CN117632505A true CN117632505A (en) 2024-03-01

Family

ID=90037382

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202311666674.0A Pending CN117632505A (en) 2023-12-06 2023-12-06 Heterogeneous computing power intelligent scheduling system and method

Country Status (1)

Country Link
CN (1) CN117632505A (en)

Similar Documents

Publication Publication Date Title
CN109995677B (en) Resource allocation method, device and storage medium
US9092474B2 (en) Incremental conversion of database objects during upgrade of an original system
CN110175677A (en) Automatic update method, device, computer equipment and storage medium
CN110471746B (en) Distributed transaction callback method, device and system
WO2016169237A1 (en) Data processing method and device
CN110347494A (en) Context information management method, apparatus, system and computer readable storage medium
CN113297031B (en) Container group protection method and device in container cluster
CN117149372A (en) Task scheduling method, device, equipment and storage medium
CN116450290A (en) Computer resource management method and device, cloud server and storage medium
CN110442455A (en) A kind of data processing method and device
CN117632505A (en) Heterogeneous computing power intelligent scheduling system and method
CN111831452A (en) Task execution method, device, storage medium and electronic device
CN111324373B (en) Method and device for sub-coding warehouse on multiple engineering files and computing equipment
CN111240998A (en) Test case processing method and device
CN114610383B (en) Business data processing method, device, equipment and computer storage medium
CN117082143A (en) Resource scheduling method, device, computer equipment and storage medium
CN116151631A (en) Service decision processing system, service decision processing method and device
CN112100186B (en) Data processing method and device based on distributed system and computer equipment
CN113672557A (en) Method, system, device, medium, and article of manufacture for migrating data to a distributed system
CN114238390A (en) Data warehouse optimization method, device, equipment and storage medium
CN114035940A (en) Resource allocation method and device
CN113849273A (en) Access processing method, device, storage medium and program product
CN117112206B (en) Transaction resource isolation method, device, computer equipment and storage medium
US11768704B2 (en) Increase assignment effectiveness of kubernetes pods by reducing repetitive pod mis-scheduling
CN118827651A (en) A file downloading method, device, equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination