WO2023125493A1 - 资源管理方法、装置及资源管理平台 - Google Patents

资源管理方法、装置及资源管理平台 Download PDF

Info

Publication number
WO2023125493A1
WO2023125493A1 PCT/CN2022/142208 CN2022142208W WO2023125493A1 WO 2023125493 A1 WO2023125493 A1 WO 2023125493A1 CN 2022142208 W CN2022142208 W CN 2022142208W WO 2023125493 A1 WO2023125493 A1 WO 2023125493A1
Authority
WO
WIPO (PCT)
Prior art keywords
resource
network
resources
computing
cluster
Prior art date
Application number
PCT/CN2022/142208
Other languages
English (en)
French (fr)
Inventor
折楠
Original Assignee
华为技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 华为技术有限公司 filed Critical 华为技术有限公司
Publication of WO2023125493A1 publication Critical patent/WO2023125493A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/50Allocation of resources, e.g. of the central processing unit [CPU]
    • G06F9/5005Allocation of resources, e.g. of the central processing unit [CPU] to service a request
    • G06F9/5027Allocation of resources, e.g. of the central processing unit [CPU] to service a request the resource being a machine, e.g. CPUs, Servers, Terminals
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/48Program initiating; Program switching, e.g. by interrupt
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/48Program initiating; Program switching, e.g. by interrupt
    • G06F9/4806Task transfer initiation or dispatching
    • G06F9/4843Task transfer initiation or dispatching by program, e.g. task dispatcher, supervisor, operating system
    • G06F9/4881Scheduling strategies for dispatcher, e.g. round robin, multi-level priority queues
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/50Allocation of resources, e.g. of the central processing unit [CPU]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/50Allocation of resources, e.g. of the central processing unit [CPU]
    • G06F9/5061Partitioning or combining of resources
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L41/00Arrangements for maintenance, administration or management of data switching networks, e.g. of packet switching networks
    • H04L41/08Configuration management of networks or network elements
    • H04L41/0896Bandwidth or capacity management, i.e. automatically increasing or decreasing capacities
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Definitions

  • the present application relates to the field of computer technology, in particular to a resource management method, device and resource management platform.
  • the dynamically distributed computing resources and storage resources are often fully connected through the computing power network to realize the unified collaborative scheduling of network resources, storage resources, computing resources and other resources.
  • the existing computing power network includes different types of devices such as general servers, heterogeneous servers, edge servers, network devices (such as switches) and storage devices.
  • Other devices in the computing power network are connected; moreover, the delay and bandwidth of the network between devices also vary greatly due to factors such as the location and type of access network; on the other hand, the storage resources used by different devices also vary depending on the medium The storage capacity varies due to factors such as type and network.
  • platform applications that deploy computing power networks also consider the above factors to varying degrees during user application deployment, for example, cloud computing vendors' elastic computing services allocate virtual central processing units (CPUs) to users based on their needs for running software. , vCPU), memory and network bandwidth.
  • CPUs virtual central processing units
  • vCPU memory and network bandwidth.
  • the existing resource management methods cannot give full play to the resource utilization of the computing power network. Therefore, how to provide a better resource management method has become an urgent technical problem to be solved.
  • the present application provides a resource management method, device, and resource management platform, which can quantify the hardware resources of resource objects in a computing power network according to the resource data of resource objects, and obtain more accurately the efficiency of processing scheduling requests for each resource object, and then can Jobs are scheduled according to quantitative results and user needs.
  • the present application provides a resource management method for a computing power network including multiple resource objects.
  • the method includes: the resource management platform acquires resource data of resource objects, and the resource data is used to indicate various Attribute information of hardware resources; quantify various hardware resources to obtain corresponding quantification results, and then assign resources to process scheduling requests for scheduling requests according to the quantification results, where the quantification results include quantification of the smallest independently runnable unit in the resource object The result obtained by the compute resource.
  • the computing power of each resource object can be more accurately evaluated, and then resources can be allocated, which can make resource scheduling more reasonable and improve computing power Network resource utilization.
  • the above-mentioned hardware resources include computing resources, and the above-mentioned resource data includes hardware attribute data of computing resources, and the hardware attribute data of computing resources includes computing power type of processor, computing width of processor, single processor At least one of the number of independently operable units and the computing frequency of independently operable units; wherein, the computing power type includes integer operations and floating-point operations; the above quantification results include static quantification results of computing resources, and the computing resources The static quantification result of is used to indicate the basic computing capability of the resource object, that is, the computing capability of the resource object when it is empty;
  • the aforementioned quantifying the resource data to obtain a quantification result includes: determining the static quantification result of the computing resource according to the hardware attribute data of the computing resource by taking the smallest independently operable unit as a unit.
  • the minimum independent operating unit is a physical core, a logical core or a stream processor.
  • the computing resources of resource objects are quantified by the smallest independently operable unit, each processor of the same computing power type and different computing width is quantified according to the smallest independently operating unit, and the same computing power type is different
  • the computing power of processors with different widths is quantified by the same standard, for example, the computing power of processors with different computing widths of the same computing power type is converted into the computing power of processors with the same computing power type and the same computing width. In this way, the computing power of each resource object can be more accurately evaluated, and then resource allocation can be carried out, which can make resource scheduling more reasonable and improve the resource utilization rate of the computing power network.
  • the above-mentioned static quantization results of the computing resources include the quantized results of the processors for integer operations and the quantized results of the processors for floating-point operations;
  • the unit is the unit, and the static quantization result of the computing resource is determined according to the hardware attribute data of the computing resource, including: converting the computing frequency of the processor for integer computing with different computing widths into the quantized value of the processor for integer computing with the target computing width , to obtain the quantized result of the processor of the integer operation; convert the calculation frequency of the processor of the floating-point operation of different calculation widths into the quantized value of the processor of the floating-point operation of the target calculation width, and obtain the floating-point number The quantized result of the processor of the operation.
  • the computing resources of the resource object are quantified by the smallest independently operable unit, and the computing frequency of the smallest independently operable unit of different processors is also the same.
  • Each processor is based on the smallest independently operable unit as the unit, Quantifying the computing frequency of processors with the same computing power and different computing widths into the quantized value of processors with the same computing width can more accurately evaluate and compare the computing capabilities of different resource objects, and then when allocating resources, it can The scheduling of resources is more reasonable, improving the resource utilization of the computing power network.
  • the above-mentioned hardware resources include storage resources, and the above-mentioned resource data includes hardware attribute data of the storage device, and the hardware attribute data of the storage device includes the type, capacity, and input/output rate of the storage device, wherein different storage The storage media of the devices are different; the above quantification results include the static quantification results of storage resources, which are used to indicate the basic storage capabilities of resource objects; the quantification of the resource data above to obtain the quantification results includes: The hardware attribute data of the device determines the static quantification of the storage resource.
  • Storage devices are not only used to store data, computing nodes will continuously read and write to storage devices when processing tasks, and different storage devices have different storage capacities, and the read and write rates of different storage devices (that is, the input and output rates of storage devices) also vary. Different, after quantifying the storage resources of a resource object in combination with the capacity and input and output rates of different storage devices in the resource object, it can more accurately reflect the performance of the storage resource of a resource object, and then when the resource is allocated, the resource can be allocated The scheduling is more reasonable.
  • the above-mentioned hardware resources also include network resources, and the above-mentioned resource data includes hardware attribute data of the network resource.
  • the hardware attribute data of the network resource includes the bus bandwidth in the device.
  • the quantification result above includes the static quantification result of the network resource, and the static quantification result of the network resource is used to indicate the basic data transmission capability of the resource object; then the quantification result obtained by quantifying the resource data above includes: using the bus bandwidth of the device as the network resource Static quantification results.
  • the resource object of the computing power network can be a single device.
  • a single device processes data, the data is transmitted between various modules in the device through the bus in the device. Therefore, when the resource object is a single device, the bus bandwidth of the device is the evaluation An important criterion for the network transmission capability within the device.
  • the above-mentioned hardware resources include network resources, and the resource data includes hardware attribute data of the network resources.
  • the hardware attribute data of the network resources includes the cluster's Network topology, port bandwidth of network devices within the cluster, and network bandwidth between the cluster and the external network;
  • the above quantification results also include the static quantification results of network resources;
  • the static quantification results of network resources are used to indicate the basic data transmission capabilities of resource objects;
  • the quantitative result obtained from the quantified resource data includes: determining the static quantified result of the network resource according to the network topology of the cluster and the port bandwidth of each network device inside the cluster.
  • the resource object of the computing power network can also be a cluster including multiple devices. Multiple devices in the cluster are connected to each other through network devices.
  • the port bandwidth of the network device is a factor that affects the data interaction rate between different devices.
  • the rate of data interaction between different nodes affects the efficiency of cluster processing tasks, and the port bandwidth of different network devices is different, the topology of network devices between different clusters, and the topology also affects the data interaction rate between nodes, so Determine the data transmission capabilities of different clusters according to the network topology and port bandwidth of network devices, which can more accurately evaluate the data transmission capabilities of each resource object, and then allocate resources, which can make resource scheduling more reasonable and improve the resource utilization of the computing power network. Rate.
  • hardware acceleration technologies such as remote direct memory access technology and/or in-network computing technology, may also be used between devices in the cluster to improve data transmission capabilities. Therefore, when the hardware acceleration technology is used among the various devices in the cluster, the static quantization result of the network resources can also be determined according to the port bandwidth of each network device and the hardware acceleration technology.
  • the hardware acceleration technology of the device can also accelerate the efficiency of data processing. In the process of quantifying the data transmission capability of the resource object, the effect brought by the hardware acceleration technology can also be quantified, and the efficiency of data processing of the resource object can be obtained more accurately. .
  • allocating resources for the scheduling request according to the quantification result includes: obtaining the resource requirement in the scheduling request above, where the resource requirement includes the requirement of the scheduling request for hardware resources; for example, computing requirements for computing resources, Any one or more of the storage requirements of storage resources or the network requirements of network resources, and then the resource management platform is based on the static quantification results of various hardware resources of multiple resource objects in the computing power network system and the resource requirements in the scheduling request , to determine the target resource object to process the scheduling request.
  • the computing capabilities of each resource object are more accurately evaluated.
  • the resource object that processes the scheduling request is determined in combination with the resource requirements in the scheduling request. , it can schedule resources of resource objects more reasonably, and improve the resource utilization rate of computing power network.
  • the allocation of resources for the scheduling request according to the quantification result includes: determining the available resources of the resource object;
  • the dynamic quantization result, the above dynamic quantification result is used to indicate the ability of the resource object to process the scheduling request; according to the dynamic quantization result and the resource requirement in the scheduling request, determine the target resource object for processing the scheduling request, and the resource requirement includes the scheduling request’s impact on hardware resources Requirements, such as requirements for computing resources, requirements for storage resources, and so on.
  • the dynamic quantization result is used to indicate the capability of the first resource object to process the scheduling request.
  • the dynamic quantization result is obtained after obtaining the resource data of the currently available resources of the first resource object after receiving the scheduling request, that is, the available resources of each resource object in the current situation and the Various types of resources required by scheduling requests can be more accurately obtained by re-quantifying resources such as computing resources, storage resources, and network resources of the computing power network based on the resource demand information and available resources of scheduling requests. s efficiency.
  • the resource management platform can first quantify the various resources included in each resource object in the computing power network to obtain the basic data processing capabilities of each resource object;
  • the available resources of the resource object include available computing resources, available storage resources, and available network resources; then, according to the above quantification results, resource requirements, and available resources of the resource object, determine the resource object With respect to the dynamic quantification results of scheduling requests, it includes: determining the matching degree of computing resources according to the hardware attribute data of computing resources of resource objects, the resource data of available computing resources, and the computing requirements in resource requirements, and the matching degree of computing resources Refers to the matching degree between the available computing resources and the computing requirements in the resource requirements.
  • the computing requirements in the resource requirements refer to the computing resources required to process scheduling requests; according to the hardware attribute data of the storage resources of the resource object and the Resource data and storage requirements in resource requirements determine the matching degree of storage resources.
  • the matching degree of storage resources refers to the matching degree between available storage resources and storage requirements in resource requirements.
  • Storage requirements in resource requirements refer to processing scheduling requests required storage resources. According to the port bandwidth of the network device in the cluster internal network and the available port bandwidth of the network device in the cluster internal network, determine the matching degree of the cluster internal network.
  • the matching degree of the cluster internal network refers to the internal network resources available in the cluster and the resource requirements a match to network requirements; and, based on the network bandwidth between the cluster and the cluster-external network, and the available network bandwidth between the cluster and the cluster-external network, determining the match of the cluster-external network, the cluster-external network match being The matching degree between the available network resources of the external network of the cluster and the external network requirements in the resource requirements; the above dynamics are determined according to the matching degree of computing resources, storage resources, internal network of the cluster and external network of the cluster Quantify results.
  • determining the target resource object for processing the scheduling request includes: when the resource requirement is efficiency priority, determining the resource object with the largest dynamic quantization result is The target resource object; or, when the resource requirement is cost priority, determine the resource object with the smallest dynamic quantification result as the target resource object.
  • the job scheduling platform can assign scheduling requests to a resource that meets user requirements based on dynamic quantification results and user needs, such as efficiency priority or price priority. Resource object handling.
  • the resource management platform obtaining the resource data of the resource object includes: obtaining the resource data of the resource object through a resource manager of the resource object, and the resource manager obtains the resource data of the resource object through a baseboard management controller At least one of BMC, cluster discovery protocol, or data collection interface to acquire the resource data of the resource object.
  • the present application provides a resource management device, where the resource management device includes various modules for executing the resource management method in the first aspect or any possible implementation manner of the first aspect.
  • the present application provides a resource management system, the resource management system includes a processor and a memory; wherein the memory is used to store instructions, the processor is used to execute the instructions, and when the processor executes the instructions, The processor executes the resource management method in the first aspect or in any possible implementation manner of the first aspect.
  • the foregoing resource management system is located in a physical device of the computing power network system.
  • the foregoing resource management system is deployed in a virtual device of a computing power network system, and the foregoing virtual device includes a virtual machine or a container.
  • the processor of the resource management system is included in the processor assigned to the virtual device by the computing power network system
  • the memory of the resource management system is included in the computing power network system The memory allocated to this virtual device.
  • the present application provides a computing device, including a processor and a memory, the memory is used to store instructions, the processor is used to execute the instructions, and when the processor executes the instructions, the above-mentioned first aspect or the above-mentioned first
  • a computing device including a processor and a memory
  • the memory is used to store instructions
  • the processor is used to execute the instructions
  • the processor executes the instructions, the above-mentioned first aspect or the above-mentioned first
  • the present application provides a computer-readable storage medium, in which instructions are stored, and when the instructions are run on a server, the server is made to execute any one of the first aspect or the first aspect.
  • Resource management methods in possible implementations.
  • the present application provides a computer program product, which, when running on a server, causes the server to execute the resource management method in the first aspect or any possible implementation manner of the first aspect.
  • Figure 1 is a schematic diagram of a computing power network provided by an embodiment of the present application.
  • FIG. 2 is a schematic diagram of a system for implementing a resource management method provided by an embodiment of the present application
  • FIG. 3 is a schematic flowchart of a static resource quantification method provided by an embodiment of the present application.
  • FIG. 4 is a schematic diagram of a network topology within a cluster provided by an embodiment of the present application.
  • Fig. 5 is a schematic flowchart of a dynamic resource quantification method provided by an embodiment of the present application.
  • FIG. 6 is a schematic diagram of a resource management device provided by an embodiment of the present application.
  • FIG. 7 is a schematic structural diagram of a computing device provided by an embodiment of the present application.
  • Computing Network refers to the connection of dynamically distributed computing and storage resources through the network, and through the unified and coordinated scheduling of computing, storage, network and other multi-dimensional resources, so that massive applications can call the computing power network on demand and in real time various resources.
  • a heterogeneous cluster refers to a cluster that uses processors of different architectures for joint computing.
  • the processors in the cluster include a central processing unit (CPU), an image processing unit (graphics processing unit) , GPU), embedded neural network processor (neural-network processing units, NPU), tensor processor (tensor processing unit, TPU), data processing unit (data processing units, DPU), chip application-specific integrated circuit (application- Specific integrated circuit (ASIC) or field-programmable gate array (field-programmable gate array, FPGA) and any other two or more chips.
  • CPU central processing unit
  • NPU embedded neural network processor
  • tensor processor tensor processing unit
  • TPU data processing unit
  • DPU data processing units
  • ASIC application- Specific integrated circuit
  • FPGA field-programmable gate array
  • High Performance Computing refers to the use of effective algorithms to quickly complete data-intensive, computing-intensive and data input/output (input/ output, I/O) intensive calculations.
  • Multi-access edge computing migrates traffic and service computing from centralized data centers to the edge of the network, closer to customers. All data is analyzed, processed, and stored at the edge of the network instead of being sent to the data center for processing, reducing latency as data is collected and processed, enabling real-time performance for high-bandwidth applications.
  • Direct memory access is a technology that bypasses the remote host operating system kernel to access data in its memory. Because it does not go through the operating system, it not only saves a lot of CPU resources, but also improves system throughput. , It reduces the network communication delay of the system, and is especially suitable for extensive application in large-scale parallel computer clusters. exist
  • In Network Computing is a distributed parallel computing architecture, which can use network cards, switches and other network devices to simultaneously perform online data calculations during data transmission, which has achieved reduced communication delays and improved Overall computational efficiency of the technique.
  • this application provides a resource management method for each resource object in the computing power network.
  • the resource management method acquires the computing resources, storage resources and network Related hardware attribute data of hardware resources such as resources.
  • the computing resources are quantified in units of the smallest independent operating unit of computing resources; based on the capacity and input and output rates of various storage devices included in resource objects, storage resources are quantified; for network resources, resource objects are quantified
  • the internal network of the resource object and the external network of the resource object are quantified separately, so that the capabilities of various resources included in the resource object can be more accurately evaluated, and resources are evaluated based on the quantification results of various resources of the resource object and the job's demand for various resources. scheduling. Quantitative management of various resources of resource objects through the method provided in this application can more accurately evaluate various capabilities of resource objects and make resource scheduling more reasonable.
  • Figure 1 is a schematic diagram of a computing power network provided by an embodiment of this application
  • the computing power network includes multiple resource objects, Different resource objects are connected to each other through the network, for example, the carrier network provided by the carrier realizes the connection between the resource objects.
  • a resource object may be a single device including at least one of computing resources, storage resources, and network resources, such as a multi-access edge computing server.
  • the resource object can also be a cluster including multiple devices, each cluster includes computing resources, network resources and storage resources, for example, high performance computing (High Performance Computing, HPC) cluster, artificial intelligence (Artificial Intelligence, AI) computing cluster , heterogeneous cluster (Heterogeneous Cluster), data center, etc.
  • HPC High Performance Computing
  • AI Artificial Intelligence
  • Heterogeneous Cluster Heterogeneous Cluster
  • processors for the above-mentioned computing resources can be CPU, GPU, NPU, TPU, DPU, ASIC, complex programmable logic device (complex programmable logic device, CPLD), FPGA, general array logic (generic array logic, GAL ), system on chip (SoC), any one or a combination of multiple types.
  • the above storage resources can be mechanical hard disks, such as hard disk drives (Hard Disk Drive, HDD), tapes, solid state disks (Solid State Disk, SSD), or other types of storage media, or two or more of the above A combination of the above types of storage media.
  • hard disk drives Hard Disk Drive, HDD
  • tapes solid state disks
  • Solid State Disk, SSD solid State Disk
  • the aforementioned network resources include internal network resources and external network resources.
  • the resource object is a single device (for example, the resource object is a computing node)
  • the internal network resource is the bus bandwidth of the device
  • the external network resource is the network bandwidth between the device and the external network
  • the resource object is a cluster including multiple devices
  • the internal network resources of the cluster include the port bandwidth of each network device in the cluster
  • the external network resources are the network bandwidth between the cluster and the external network.
  • FIG. 2 is a schematic diagram of a system for implementing a resource management method provided by an embodiment of the present application.
  • the system includes a resource management platform 100 and multiple resource objects.
  • multiple resource objects constitute the computing power network as shown in Figure 1
  • the resource object 200 may be a cluster including multiple devices, for example, the AI computing cluster 201, the HPC cluster 202, and the heterogeneous cluster 203 shown in Figure 2 wait.
  • the resource object may also be a single device, for example, the MEC server 204 or other types of devices.
  • the resource management platform 100 is used to obtain the hardware attribute data of the hardware resources of each resource object, and then perform quantitative evaluation on various resources of each resource object.
  • the resource management platform 100 can be deployed in any resource object constituting the computing power network, for example, the resource management platform 100 is deployed in a device.
  • the resource management platform 100 can also be deployed in a device dedicated to resource management other than the resource objects constituting the computing power network, or the resource management platform 100 can also be deployed in the form of virtual resources, for example, using Virtual resources such as virtual machines or containers are deployed with the above-mentioned resource management platform 100 .
  • a resource manager 210 is deployed in the resource object, and the resource manager 210 is used to collect hardware attribute data of various hardware resources of the resource object 200 and send it to the resource management platform 100, wherein the hardware resources of each resource object 200 include computing resources , storage resources and network resources.
  • the resource manager 210 can be deployed on any device in the cluster, or can be deployed on a device dedicated to collecting various resource data of the cluster.
  • the resource manager 210 is deployed on the device.
  • the resource management platform 100 After receiving the hardware attribute data of various resources sent by the resource manager 210 of each resource object 200, the resource management platform 100 analyzes and quantifies the various resource data of each resource object 200, and obtains the static quantification of each resource object result. Then the resource management platform 100 stores the resource data of various resources of each resource object 200 and the corresponding static quantification results into the resource directory.
  • the resource data is used to indicate the attribute information of the hardware resource of the resource object associated therewith.
  • the resource manager 210 can collect resources through the intelligent platform management interface (Intelligent Platform Management Interface, IPMI) of the BMC. Resource data of various resources of each device in the object. The device can also collect resource data of various resources of each computing node in the resource object through the cluster discovery protocol or the data collection interface. At this time, each device needs to be deployed with an agent (agent) that supports the data collection service.
  • BMC Baseboard Management Controller
  • IPMI Intelligent Platform Management Interface
  • the above-mentioned computing power network further includes a job scheduling platform 300, and the resource management platform 100 is also used to obtain the current status of each resource object 200 through the resource manager 210 of each resource The usage or remaining status of various resources, and then, according to the static quantification results of each resource object 200 in the resource directory, the current available resources and scheduling requests of each resource object, and again according to the static quantification results, the resource objects 200 Available resources are quantified to obtain a dynamic quantization result of each resource object 200.
  • the dynamic quantification result is used to indicate the ability of the resource object to process the scheduling request.
  • the dynamic quantization result is used to indicate the efficiency of the resource object to process the scheduling request.
  • the job scheduling platform 300 allocates scheduling requests to target resource objects according to the dynamic quantification results of each resource object 200 .
  • the above-mentioned job scheduling platform 300 can be deployed on any device in the cluster, or can be deployed on a device dedicated to collecting various resource data of the cluster.
  • the job scheduling platform 300 may be deployed on the same device as the resource management platform 100, or may not be deployed on the same device as the resource management platform 100, which is not specifically limited in this embodiment of the present application.
  • the resource management method provided by the present application mainly includes resource quantification and resource allocation.
  • the resource quantification method of the present application will be introduced in detail with reference to the accompanying drawings.
  • the resource quantification method of the present application can be further divided into a static resource quantification method and a dynamic resource quantification method according to the quantization operation execution process.
  • the static resource quantification method can obtain the static quantification results of various hardware resources of each resource object, and the static quantification results are used to indicate the basic capabilities of resource objects.
  • the static quantification results of computing resources indicate the basic computing capabilities of resource objects.
  • the static quantification result of the storage resource indicates the basic storage capability of the resource object, and the static quantification result of the network resource indicates the basic data transmission capability of the resource object.
  • the dynamic resource quantification method obtains the dynamic quantification result of each resource object, and the dynamic quantification result is obtained according to the currently available resources of the resource object, and is used to indicate the processing capability of the resource object to process the scheduling request.
  • Fig. 3 is a schematic flowchart of a static resource quantification method provided by an embodiment of the present application.
  • the following takes a resource object in a computing power network as a single device as an example to introduce the static resource quantification method provided by the embodiment of the present application in detail.
  • the above resource object is referred to as the first resource object.
  • the method comprises the steps of:
  • the resource management platform acquires resource data of a first resource object.
  • the first resource object can obtain the resource data of the first resource object through the IPMI of the BMC, the cluster discovery protocol, or the data collection interface, and store the resource data of the first resource object Report to the resource management platform 100.
  • the resource data is used to indicate the attribute information of the hardware resource of the first resource object
  • the hardware resource of the resource object includes computing resource, network resource and storage resource.
  • the resource data includes hardware attribute data of computing resources, hardware attribute data of storage resources, and hardware attribute data of network resources.
  • the hardware attribute data of the above-mentioned computing resources includes the computing power type of the processor, the computing width of the processor, the number of processors, the minimum number of independent operating units included in each processor, and the computing frequency of the independently operating units.
  • the type of processor includes any one or more of CPU, GPU, TPU, DPU, or ASIC;
  • the computing power type of the processor includes integer (Integer, INT) operations and floating point (Floating Point, FP ) operation,
  • the calculation width includes 64 bits, 32 bits, 16 bits, 8 bits, etc.
  • the operation mode of the processor includes 64-bit integer (INT64), 64-bit floating-point (FP64), INT32, FP32, INT16, FP16, etc.; the smallest independent operating unit can be a physical core (Core), a logical core or stream processor.
  • the hardware attribute data of the storage resources include types of storage devices, capacities of various storage devices, and input/output (Input/Output, I/O) rates of various storage devices.
  • Storage device types include hard disk drive (Hard Disk Drive, HDD), magnetic tape, mechanical hard disk, or solid state disk (Solid State Disk, SSD).
  • the hardware attribute data of the network resource includes the bus bandwidth inside the computing node and the network bandwidth between the computing node and the external network.
  • the above-mentioned hardware attribute data of various hardware resources is only used as an example, and does not constitute a limitation on the resource data obtained by the resource manager 200.
  • the resource manager 200 can also obtain more or less than the above-listed resource data.
  • the resource manager when obtaining hardware resource data of computing resources, the resource manager can obtain the model of the processor, and according to the model of the processor, it can determine the type of processor, the computing power type of the processor, the computing width of the processor, the The minimum number of independently run units included and the computing frequency of the processor.
  • the resource management platform performs resource quantification according to the resource data of the first resource object, and obtains a quantification result corresponding to the first resource object.
  • the hardware resources of resource objects include at least one of computing resources, storage resources, or network resources.
  • resource data includes hardware attribute data of computing resources, hardware attribute data of storage resources, or hardware attribute data of network resources. Therefore, resource management The platform 100 needs to quantify the computing resources of resource objects according to the hardware attribute data of computing resources, quantify the storage resources of resource objects according to the hardware attribute data of storage resources, and quantify the network resources of resource objects according to the hardware attribute data of network resources.
  • the quantification results corresponding to the first resource object include static quantification results of computing resources, static quantification results of storage resources, and static quantification results of network resources.
  • the static quantitative results of computing resources are used to indicate the basic computing capabilities of resource objects, which can be understood as computing capabilities determined by the configuration or attributes of resource objects themselves;
  • the static quantitative results of storage resources are used to indicate the resource objects Basic storage capacity;
  • the static quantification result of network resources is used to indicate the basic data transmission capacity of resource objects.
  • the type of processor For the quantification of computing resources, there may be differences in the type of processor, computing frequency of the processor, computing power type and computing width of the processor in the same resource object or different resource objects.
  • the resource object is a heterogeneous device
  • CPU and GPU can be set in the resource object at the same time; or, the operation mode of some processors in the same resource object is INT64, and the operation mode of some processors is INT32; or, the operation mode of some processors in the same resource object is INT32;
  • the calculation rate of some processors in a resource object is 3.4 gigahertz (GHz), and the calculation rate of some processors is 2.1 GHz.
  • the processors of some resource objects only include CPU, and some resource objects are heterogeneous devices; or, the computing width of some resource objects in different resource objects is 64 bits, and the computing width of some resource objects is 32 bits. Therefore, different processors have different computing capabilities, and it is necessary to quantify the computing capabilities of various processors according to a unified standard.
  • the smallest independently runnable unit in the processor is used as an example to quantify the processor according to the computing power type and computing width.
  • a processor for computing integer operations with a width of a and a processor for floating-point operations with a width of b are used as quantization standards. Convert the calculation frequency of processors with different calculation widths of integer operations (including INT64, INT32, INT16, INT8, etc.) The final result is to convert the calculation frequency of processors with different calculation widths of floating-point operations (including FP64, FP32, FP16, FP8, etc.) The processor quantized result of point arithmetic.
  • the calculation capability of a processor with a calculation width of t is p times that of a processor with a calculation width of a;
  • the computing power of the processor is half of that of the processor with the computing mode of INT64.
  • the computing power of the processor whose operation mode is FP16 is a quarter of that of the processor whose operation mode is FP64.
  • the computing power of processors with different calculation widths for integer operations is converted into the computing power of processors whose operation mode is INT a
  • the computing power of processors for floating-point operations with different calculation widths is converted into
  • the computing capability of each processor can be quantified through the above method, and then a static quantification result of the computing resource of the entire resource object can be obtained according to the computing capability of each processor.
  • the static quantification result of the computing resource of the resource object can be determined by the following formula 1, or the static quantization result of the computing resource of the resource object can be determined by the following formula 2.
  • c is the static quantization result of the computing resources of the resource object
  • ⁇ F INT is the static quantization result of the computing power of the processor for all integer operations in the resource object
  • ⁇ F FP is the static quantization result of the processor for all floating-point operations in the resource object Static quantification of computing power.
  • the storage resources are quantified according to the capacity of different storage devices and the IO rate of the storage devices, and the static quantification result of the storage resources of resource objects can be determined according to the following formula 3.
  • M is the static quantification result of the storage resource of the resource object;
  • R i is the capacity of the storage device of the i type;
  • R is the total capacity of the storage device included in the resource object, and
  • v i is the IO rate of the storage device of the i type.
  • the static quantification result of network resources is the bus bandwidth inside the computing node. That is, the static quantification result of network resources satisfies the following formula 4:
  • the above-mentioned embodiment corresponding to FIG. 3 introduces a method for performing static resource quantification on various resources of a resource object when the resource object is a single computing node.
  • the resource object may also be a cluster, and each cluster includes multiple computing nodes.
  • the hardware resources of the first resource object also include computing resources, storage resources, and network resources;
  • the resource data of the first resource object includes hardware attribute data of computing resources, hardware attribute data of storage resources, and Hardware attribute data for network resources.
  • the hardware attribute data of the computing resources of the cluster also includes the type of computing power of the processor, the computing width of the processor, the number of processors, and the smallest independently runnable unit included in each processor. Number and calculation frequency of independently runnable units.
  • the hardware attribute data at this time is hardware attribute data included in multiple computing nodes in the cluster.
  • the hardware attribute data of storage resources also includes the type of storage device, the capacity of each type of storage device, and the I/O rate of each type of storage device.
  • the first resource object when the first resource object is a cluster, multiple computing nodes in the cluster are connected to each other through network devices (such as switches or routers, etc.), and the hardware resource data of the network resources include the network topology in the cluster, the internal Port bandwidth of network devices (switches and/or routers, etc.) and network bandwidth between the cluster and external networks.
  • the network topology within the cluster can be a Spine-Leaf topology, a traditional three-layer topology, a Fat-Tree topology, a Dragonfly topology, or a Dragonfly+ topology, etc.
  • the quantification method for computing resources may refer to the quantification method when the resource object is a computing node above.
  • the quantification method of storage resources refer to the quantification method when the resource object is a computing node.
  • the computing nodes in the cluster are connected through one or more layers of network devices (such as switches).
  • the network topology in the cluster can be a leaf-spine (Spine-Leaf) topology or a traditional three-tier topology. structure.
  • FIG. 4 is a schematic diagram of an intra-cluster network topology provided by an embodiment of the present application.
  • the network device directly connected to the computing node is used as a leaf device, and the network devices at other layers are used as a spine device.
  • switches at the access layer are used as leaf devices, and switches at the aggregation layer and switches at the core layer are used as spine devices.
  • the resource management platform 100 obtains the port bandwidth of each Leaf device, determines the average bandwidth or the minimum bandwidth of all Leaf devices, and obtains the port bandwidth of each Spine device, determines the average bandwidth or the minimum bandwidth of all Spine devices, and then according to the average bandwidth of the above Leaf devices
  • the bandwidth or minimum bandwidth, and the average bandwidth or minimum bandwidth of the spine device determine the static quantification result of the internal network of the resource object.
  • the static quantification result of the internal network when the resource object is a cluster can be determined by the following formula 5.
  • n in ⁇ *min(min ⁇ W spine ⁇ ,avg ⁇ W leaf ⁇ ) (Formula 5)
  • n in is the static quantification result of the internal network of the resource object; min ⁇ W spine ⁇ represents the minimum bandwidth among the port bandwidths of all spine devices in the resource object; avg ⁇ W leaf ⁇ represents the port bandwidth of all leaf devices in the resource object The average bandwidth of ; ⁇ represents the number of independent computing units in the cluster.
  • the static quantization result of the internal network is calculated based on the minimum bandwidth of all spine devices and the average bandwidth of all leaf devices.
  • the static quantization result of the internal network can also be calculated based on the average bandwidth of all spine devices and the average bandwidth of all leaf devices, or the internal network can be calculated based on the minimum bandwidth of all spine devices and the minimum bandwidth of all leaf devices.
  • the static quantization result can also calculate the static quantization result of the internal network based on the average bandwidth of all spine devices and the minimum bandwidth of all leaf devices.
  • the computing node may support hardware acceleration technology, for example, the computing node may use RDMA technology or INC technology to improve data transmission efficiency between computing nodes within the cluster.
  • the resource manager 200 obtains the hardware acceleration information of the computing nodes in the cluster, and sends the hardware acceleration information to the resource management platform 100 .
  • the static quantification result of the internal network when the resource object is a cluster can also be determined by the following formula 6.
  • n in (1+j*c) ⁇ *min(min ⁇ W sping ⁇ ,avg ⁇ W leaf ⁇ ) (Formula 6)
  • j represents the number of hardware acceleration methods that the computing node has
  • c is the weight coefficient.
  • the computing node may also include other hardware acceleration technologies, which will not be detailed here.
  • the weight coefficients corresponding to different hardware acceleration technologies may be different or the same. In Formula 6, the weight coefficients corresponding to different hardware acceleration technologies are the same as an example.
  • the resource management platform 100 can respectively quantify the computing resource, storage resource and network resource of the first resource object through the above method, and obtain the quantification result corresponding to the first resource object.
  • the quantification results include the above static quantification results of computing resources, storage resources and network resources.
  • the resource management platform 100 After determining the static quantification results of various resources of the first resource object, stores the resource data of the first resource object and the static quantification results of various resources of the first resource object into the resource directory.
  • the resource directory records the resource data of each resource object in the computing power network and the static quantification results of various resources of each resource object. After the resource management platform 100 stores the resource data of the first resource object and the static quantification results of various resources of the first resource object in the resource directory, it will return the successful access to the computing power network to the resource manager 210 of the first resource object. information.
  • the resource management platform 100 can use the above method to quantify other resource objects connected to the computing power network through the above static resource quantification method, obtain the quantification results corresponding to each resource object, and combine the resource data of each resource object with each resource
  • the static quantification results of various resources of the object are stored in the resource directory.
  • the resource management platform 100 quantifies other resource objects connected to the computing power network through the above-mentioned static resource quantification method, and obtains the quantification results corresponding to each resource object.
  • the resource management platform 100 can submit a scheduling request through the network (web) interface of the computing power network, the scheduling request includes resource requirements, and the resource requirements include the hardware resource requirements of the scheduling request, and the hardware resources include computing resources, storage resources or Any one or more of network resources.
  • the resource management platform 100 determines the target resource object for processing the scheduling request according to the quantification results of each resource object in the computing power network and the resource requirements in the scheduling request.
  • the target resource object is the resource object with the largest static quantification result of computing resources in the computing power network. If the resource requirements in the above scheduling request include efficiency priority and storage capacity requirements, the target resource object is the static quantification result of computing resources among multiple resource objects in the computing power network whose storage capacity is greater than the storage capacity requirements in the resource requirements The largest resource object. If the resource demand in the scheduling request is price priority, the target resource object is the resource object with the smallest static quantification result of computing resources in the computing power network.
  • FIG. 5 is a schematic flow chart of a dynamic resource quantification method provided by an embodiment of the present application. The method comprises the steps of:
  • the resource management platform acquires a scheduling request.
  • the scheduling request is used to request a resource object for executing a job to be scheduled, and the scheduling request includes resource requirements, and the resource requirements include computing requirements and storage requirements of the scheduling request.
  • the calculation requirement is used to indicate the calculation resources required for processing the scheduling request, that is, the number of minimum independently runnable units required for processing the scheduling request.
  • the storage requirement refers to the size of the storage space required to execute the scheduling request.
  • the resource scheduling platform 100 in the computing power network can obtain the above scheduling request through the application programming interface (application programming interface, API).
  • application programming interface application programming interface
  • the above scheduling request also includes job types, which include heavy computing power scenarios, general computing power scenarios, and mixed computing power scenarios.
  • job types which include heavy computing power scenarios, general computing power scenarios, and mixed computing power scenarios.
  • HPC jobs or AI model training are usually heavy-computing scenarios
  • big data processing and cloud services are usually general-purpose computing scenarios
  • mixed-computing scenarios include jobs that include both heavy-computing scenarios and general-purpose computing scenarios.
  • the job type is used to indicate the proportion of integer computing resources and the proportion of floating point computing resources required to process the scheduling request.
  • the user Before submitting a scheduling request, the user can configure the job type, computing requirements, and storage requirements on the user interface, so that the resource management platform 100 can perform dynamic resource quantification on each resource object according to the computing requirements and storage requirements.
  • the user before submitting the scheduling request, can also set the computing power ratio, that is, the scheduling request also includes the computing power ratio, and the computing power ratio refers to the integer computing resources required to execute the scheduling request The proportion of and the proportion of floating-point computing resources.
  • the resource management platform acquires available resource data of the first resource object.
  • the above available resource data includes resource data of available computing resources, resource data of available network resources, and resource data of available storage resources.
  • the resource data of available computing resources includes the type of processor, the number of available processors, the number of independent computing units of each processor in the available processors, computing frequency, computing width and computing power type, etc.; available storage Resource data for a resource includes available storage capacity.
  • the available resource data of the network resource includes the available port bandwidth of the network device inside the cluster and the available bandwidth between the cluster and the external network. If the first resource object is a single computing node, the resource data of the available network resources includes the available bandwidth between the computing node and the external network.
  • the resource management platform 100 can send a query request to each resource object 200 at a first time interval, and the query request is used to indicate that the resource object 200 that has received the query request reports Current available resource data.
  • the resource management platform 100 sends a query request to each resource object 200 to instruct the resource object 200 that has received the query request to report current available resource data.
  • each resource object 200 reports its available resource data to the resource management platform 100 at a second time interval after successfully accessing the computing power network.
  • Each resource object 200 obtains current available resource data through its respective resource manager 210 , and the method for resource manager 210 to obtain available resource data is the same as the method for obtaining resource data in S301 above, which will not be repeated here.
  • the resource management platform determines the matching degree between various types of available resources of the first resource object and various types of resource requirements in resource requirements according to the scheduling request and the available resource data of the first resource object.
  • the matching degree of various available resources and various resource requirements in the resource requirements includes any one or more of the following: the matching degree between the available computing resources and the computing requirements in the resource requirements, and the matching degree between the available storage resources and the resource requirements.
  • the computing node where the resource management platform 100 is located records a resource directory, and the resource directory records hardware attribute data of each resource object in the computing power network.
  • the resource management platform 100 After acquiring the available resource data of the first resource object, the resource management platform 100 first determines each of the various available resources and resource requirements of the first resource object according to the hardware attribute data, scheduling request, and available resource data of the first resource object. Then, according to the matching degree between various resources of the first resource object and the resource requirements of various resources in the resource requirements and the static quantification results, determine the dynamic quantification result of the first resource object.
  • the resource management platform 100 determines the number of minimum independently operable units for integer operations and the minimum independently operable units for floating-point operations among the computing requirements required by the scheduling request according to the scheduling request.
  • the resource management platform 100 is pre-configured with computing power ratios associated with different application scenarios.
  • the computing power network supports scenarios such as computing power scenarios, general computing power scenarios, and mixed computing power scenarios.
  • the proportion of integer computing resources required in heavy computing power scenarios is 30%, and the proportion of floating point computing resources is 70%; the proportion of integer computing resources required in general computing power scenarios is 60%, and floating point computing resources
  • the proportion of resources is 40%; the proportion of integer computing resources required in a mixed computing power scenario is 50%, and the proportion of floating-point computing resources is 50%.
  • the resource management platform 100 determines the minimum number of independently operable units for integer operations and the minimum number of independently operable units for floating-point operations required to execute scheduling requests according to job types and computing requirements.
  • the scheduling request includes a computing power ratio
  • the resource management platform 100 determines the minimum number of independently operable integer computing units and the floating-point computing power required to execute the scheduling request according to the computing power ratio and computing requirements in the scheduling request.
  • r c is the matching degree between the available computing resources and the computing requirements in the resource requirements;
  • INT t represents the number of the minimum independently operable units for integer operations in the first resource object;
  • FP t represents the floating point in the first resource object
  • INT job indicates the minimum number of independently operable units for integer operations required to execute scheduling requests;
  • FP job indicates the minimum independently operable units for floating-point operations required to execute scheduling requests
  • the number of units; INT a represents the number of the smallest independently runnable units of integer operations currently available in the first resource object;
  • FP a represents the number of the smallest independently runnable units of floating point operations currently available in the first resource object .
  • the resource management platform 100 can calculate the resource availability rate of the computing resources according to the following formula 8.
  • the available resource data of the network resources includes the available bandwidth of the port of the network device inside the cluster; and the available bandwidth between the cluster and the external network. If the first resource object 200 is a single computing node, the resource data of the available network resources includes the available bandwidth between the computing node and the external network.
  • the matching degree of the available network resources includes the matching degree of the available network resources of the internal network and the matching degree of the available network resources of the external network.
  • the resource management platform 100 obtains the port bandwidth of each Leaf device from the resource directory, according to each Leaf The port available bandwidth of the device and its port bandwidth, determine the ratio of the available port bandwidth of each leaf device to its port bandwidth, obtain the multiple port bandwidth ratios corresponding to all leaf devices by the same method, and then determine the port bandwidth ratio of all leaf devices average or minimum.
  • the port bandwidth of each spine device from the resource directory, determine the ratio of the available port bandwidth of each spine device to its port bandwidth based on the available port bandwidth of each spine device and its port bandwidth, and obtain the multiplicity corresponding to all spine devices in the same way port bandwidth ratio, and then determine the average or minimum value of the multiple port bandwidth ratios corresponding to all spine devices. Then, according to the average or minimum value of the ratio of multiple port bandwidths corresponding to all Leaf devices, and the average or minimum value of the ratio of multiple port bandwidths corresponding to all Spine devices, it is determined that when the first resource object is a cluster, the second The matching degree of the available network resources of the internal network of a resource object. In the embodiment of the present application, the following formula 9 can be used to determine the matching degree of available network resources of the internal network when the first resource object is a cluster.
  • r in represents the matching degree of network resources available in the internal network of the first resource object
  • min ⁇ P spine ⁇ represents the minimum value of the port bandwidth ratio corresponding to all spine devices in the first resource object
  • avg ⁇ A leaf ⁇ represents the first resource object The average value of the port bandwidth ratios of all Leaf devices in the resource object.
  • the matching degree of network resources available in the internal network of the first resource object is 1.
  • the resource availability rate of the internal network is calculated based on the minimum value of the bandwidth ratios of multiple ports corresponding to all spine devices and the average value of the bandwidth ratios of multiple ports corresponding to all leaf devices.
  • the resource availability of the internal network can also be calculated based on the average of the bandwidth ratios of multiple ports corresponding to all spine devices and the average of the bandwidth ratios of multiple ports corresponding to all leaf devices, or it can be calculated based on the average value of the bandwidth ratios of all spine devices.
  • the resource availability of the internal network can be calculated by the minimum value of the corresponding multiple port bandwidth ratios and the minimum value of the multiple port bandwidth ratios corresponding to all leaf devices, or the average value of the multiple port bandwidth ratios corresponding to all spine devices, The minimum value of the bandwidth ratio of multiple ports corresponding to all leaf devices is used to calculate the resource availability of the internal network.
  • the resource management platform 100 can calculate the resource availability rate of the external network of the second resource object according to the following formula 10.
  • r out represents the matching degree of the first resource object to the available network resources of the external network
  • W a represents the available bandwidth of the first resource object and the external network
  • W represents the bandwidth of the first resource object and the external network
  • the resource management platform 100 can determine the matching degree between various types of available resources of each resource object and various types of resource requirements in resource requirements according to the scheduling request and the available resource data of each resource object through the above method.
  • the resource management platform dynamically quantifies the first resource object according to the quantification result of the first resource object and the matching degree between various available resources of the first resource object and various resource requirements in resource requirements, and obtains the relative Based on the dynamic quantization results of scheduling requests.
  • the dynamic quantization result is used to indicate the ability of the first resource object to process the scheduling request.
  • the dynamic quantization result is obtained after obtaining the resource data of the currently available resources of the first resource object after receiving the scheduling request, that is, the available resources of each resource object in the current situation and the All kinds of resources required by the scheduling request, so the dynamic quantification result can more accurately reflect the current ability of each resource object to process the scheduling request.
  • the network delay between the data source of the data to be processed by the scheduling request and the resource object is also an important parameter of the external network of the resource object.
  • the resource management platform 100 can also determine the static quantification result of the external network of the resource object according to the network delay between the data source and the resource object, and the bandwidth between the resource object and the external network.
  • the static quantification result of the external network of the resource object can be determined by the following formula 11.
  • n out represents the static quantification result of the external network of the first resource object
  • W represents the network bandwidth between the first resource object and the external network
  • T d represents the network delay between the data source and the first resource object.
  • the resource management platform 100 can use the corresponding static quantification result of the first resource object and the matching degree of various available resources of the first resource object and the resource demand , perform dynamic quantization on the first resource object, and obtain the dynamic quantization result of the first resource object.
  • jobs in scenarios with heavy computing power usually require more computing resources and process a large amount of data.
  • the computing power of the resource object and the bandwidth of the internal network of the resource object have an impact on the job processing efficiency of the heavy computing power scenario.
  • the resource management platform 100 can calculate the dynamic quantification result of the first resource object relative to the scheduling request through the following formula 12.
  • d is the dynamic quantification result of the first resource object relative to the scheduling request
  • is the proportion of recomputing power in the scheduling request
  • is a natural number greater than or equal to 0 and less than or equal to 1.
  • the value of the recomputing ratio ⁇ can be configured by the user and carried in the above job scheduling request.
  • the resource management platform 100 can also calculate the dynamic quantification result of the first resource object relative to the scheduling request through the following formula 13.
  • the resource management platform 100 can calculate and obtain the dynamic quantification result of each resource object in the computing power network relative to the scheduling request through the above method.
  • the dynamic quantification result of each resource object relative to the scheduling request can reflect the ability of the resource object to execute the scheduling request.
  • the larger the value of d the higher the efficiency of the resource object when executing the scheduling request.
  • the value of d The smaller the value, the lower the resource object's efficiency in executing the scheduling request.
  • the resource management platform 100 After the resource management platform 100 obtains the dynamic quantification result of each resource object in the computing power network relative to the scheduling request, the resource management platform 100 sends the dynamic quantification result of each resource object relative to the scheduling request and the scheduling request to the job scheduling platform 300 .
  • the job scheduling platform 300 allocates the scheduling request to the target resource object for processing according to the scheduling request based on the dynamic quantification result of each resource object relative to the scheduling request.
  • the above job scheduling request further includes user requirements.
  • the user can select a resource scheduling strategy on the user interface, for example, select a resource scheduling strategy that prioritizes efficiency or price. If the user chooses efficiency priority, the job scheduling platform 300 will allocate the scheduling request to the resource object with the largest dynamic quantification result for processing according to the user demand of efficiency priority; if the user chooses price priority, the job scheduling platform 300 will prioritize the price According to user needs, the scheduling request is allocated to the resource object with the smallest dynamic quantization result for processing.
  • the job scheduling platform 300 can estimate the duration for each resource object to process the scheduling request according to the dynamic quantification result of each resource object relative to the scheduling request and the scheduling request.
  • the user can also configure the range of execution time while selecting price priority, and the job scheduling platform 300 can allocate the scheduling request to the resource object that meets the execution time and has the smallest dynamic quantification result for execution.
  • the job scheduling platform 300 can estimate the duration and price of processing the scheduling request for each resource object according to the dynamic quantification result of each resource object relative to the scheduling request and the scheduling request.
  • the job scheduling platform 300 displays the duration and price of each resource object processing the scheduling request on the user interface, and the user selects the resource object that processes the scheduling request.
  • FIG. 6 is a schematic diagram of a resource management device provided by an embodiment of the present application.
  • the resource management device 600 includes an acquisition module 110 and a processing module 120 .
  • the obtaining module 110 is used to obtain the resource data of the resource object, and the resource data is used to indicate the attribute information of the hardware resource of the resource object; the processing module 120 is used to quantify the resource data to obtain a quantization result, and assign the scheduling request according to the quantization result resource.
  • the quantization result includes the result obtained by quantifying computing resources in the smallest independently runnable unit in the resource object.
  • the above-mentioned hardware resources include computing resources, and the above-mentioned resource data includes hardware attribute data of computing resources, and the hardware attribute data of computing resources includes computing power type of processors, computing width of processors, independent operating units in a single processor At least one of the number of calculations and the calculation frequency of independently operable units; wherein, the type of calculation power includes integer operations and floating-point operations; the above-mentioned quantitative results include static quantitative results of computing resources, and the static quantitative results of computing resources are used It is used to indicate the basic computing capability of the resource object, that is, the computing capability of the resource object when it is empty;
  • the above-mentioned processing module 120 quantifies the resource data to obtain a quantification result, which collectively includes: determining the static quantification result of the computing resource according to the hardware attribute data of the computing resource by taking the smallest independently operable unit as a unit.
  • the minimum independent operating unit is a physical core, a logical core or a stream processor.
  • the computing resources of resource objects are quantified by the smallest independently operable unit, each processor of the same computing power type and different computing width is quantified according to the smallest independently operating unit, and the same computing power type is different
  • the computing power of processors with different widths is quantified by the same standard, for example, the computing power of processors with different computing widths of the same computing power type is converted into the computing power of processors with the same computing power type and the same computing width. In this way, the computing power of each resource object can be more accurately evaluated, and then resource allocation can be carried out, which can make resource scheduling more reasonable and improve the resource utilization rate of the computing power network.
  • the static quantization results of the above-mentioned computing resources include the quantized results of processors for integer operations and the quantized results of processors for floating-point operations; Unit, according to the hardware attribute data of the computing resource to determine the static quantization result of the computing resource, specifically including: converting the computing frequency of the processor for integer computing with different computing widths into the quantized value of the processor for integer computing with the target computing width, Obtain the quantized result of the processor for integer operations; convert the calculation frequency of processors for floating-point operations with different calculation widths into the quantized value of the processor for floating-point operations with the target calculation width, and obtain floating-point operations The quantized result of the processor.
  • the computing resources of the resource object are quantified by the smallest independently operable unit, and the computing frequency of the smallest independently operable unit of different processors is also the same.
  • Each processor is based on the smallest independently operable unit as the unit, Quantifying the computing frequency of processors with the same computing power and different computing widths into the quantized value of processors with the same computing width can more accurately evaluate and compare the computing capabilities of different resource objects, and then when allocating resources, it can The scheduling of resources is more reasonable, improving the resource utilization of the computing power network.
  • the above-mentioned hardware resources include storage resources, and the above-mentioned resource data includes hardware attribute data of the storage device, and the hardware attribute data of the storage device includes the type, capacity, and input/output rate of the storage device, wherein different storage devices have different storage media;
  • the above quantitative results include static quantitative results of storage resources, which are used to indicate the basic storage capabilities of resource objects;
  • the processing module 120 quantifies the resource data to obtain a quantization result, including: determining the static quantization result of the storage resource according to the hardware attribute data of the storage device.
  • Storage devices are not only used to store data, computing nodes will continuously read and write to storage devices when processing tasks, and different storage devices have different storage capacities, and the read and write rates of different storage devices (that is, the input and output rates of storage devices) also vary. Different, after quantifying the storage resources of a resource object in combination with the capacity and input and output rates of different storage devices in the resource object, it can more accurately reflect the performance of the storage resource of a resource object, and then when the resource is allocated, the resource can be allocated The scheduling is more reasonable.
  • the aforementioned hardware resources further include network resources, and the aforementioned resource data includes hardware attribute data of the network resource.
  • the resource object is a computing node
  • the hardware attribute data of the network resource includes the bus bandwidth within the computing node
  • the aforementioned quantification result Including the static quantification results of network resources, which are used to indicate the basic data transmission capabilities of resource objects;
  • the processing module 120 quantifies the resource data to obtain a quantization result, including: taking the bus bandwidth of the computing node as the static quantization result of the network resource.
  • the resource object of the computing power network can be a single computing node. When a single computing node processes data, the data is transmitted between various modules in the node through the bus in the node. Therefore, when the resource object is a single node, the bus of the node Bandwidth is an important criterion for evaluating the network transmission capability within a node.
  • the aforementioned hardware resources include network resources, and the resource data includes hardware attribute data of the network resources.
  • the hardware attribute data of the network resources includes the network topology of the cluster, the internal The port bandwidth of network devices and the network bandwidth between the cluster and the external network; the above quantification results also include the static quantification results of network resources; the static quantification results of network resources are used to indicate the basic data transmission capabilities of resource objects;
  • the processing module 120 quantifies the resource data to obtain the quantified result, including: determining the static quantified result of the network resource according to the network topology of the cluster and the port bandwidth of each network device inside the cluster.
  • the resource object of the computing power network can also be a cluster including multiple computing nodes. Multiple computing nodes in the cluster are connected to each other through network devices.
  • the port bandwidth of the network device affects the data interaction between different computing nodes. An important factor of speed, the speed of data interaction between different nodes affects the efficiency of cluster processing tasks, while the port bandwidth of different network devices is different, the topology of network devices between different clusters, and the topology also affects the data interaction between nodes Therefore, the data transmission capabilities of different clusters can be determined according to the network topology and the port bandwidth of network devices, which can more accurately evaluate the data transmission capabilities of each resource object, and then allocate resources, which can make resource scheduling more reasonable and improve the computing power of the network. resource utilization.
  • hardware acceleration technologies such as remote direct memory access technology and/or in-network computing technology
  • the hardware acceleration technology of computing nodes can also accelerate the efficiency of data processing. In the process of quantifying the data transmission capability of resource objects, the effect brought by hardware acceleration technology can also be quantified, and the data processing efficiency of resource objects can be obtained more accurately. efficiency.
  • the processing module 120 allocates resources for the scheduling request according to the quantification result, including: obtaining the resource requirements in the scheduling request above, where the resource requirements include hardware resource requirements of the scheduling request; for example, computing requirements for computing resources, storage resources Any one or more of storage requirements or network requirements of network resources, and then the resource management platform determines the processing according to the static quantification results of various hardware resources of multiple resource objects in the computing power network system and the resource requirements in the scheduling request The target resource object for the dispatch request.
  • the computing capabilities of each resource object are more accurately evaluated.
  • the resource object that processes the scheduling request is determined in combination with the resource requirements in the scheduling request. , it can schedule resources of resource objects more reasonably, and improve the resource utilization rate of computing power network.
  • the processing module 120 allocates resources for the scheduling request according to the quantification result, including: determining the available resources of the resource object; and determining the dynamic quantification result of the resource object relative to the scheduling request according to the quantification result, resource demand, and available resources of the resource object , the above dynamic quantization result is used to indicate the ability of the resource object to process the scheduling request; according to the dynamic quantization result and the resource requirement in the scheduling request, determine the target resource object for processing the scheduling request, and the resource requirement includes the hardware resource requirement of the scheduling request, for example Demand for computing resources, demand for storage resources, etc.
  • the dynamic quantization result is used to indicate the capability of the first resource object to process the scheduling request.
  • the dynamic quantization result is obtained after obtaining the resource data of the currently available resources of the first resource object after receiving the scheduling request, that is, the available resources of each resource object in the current situation and the Various types of resources required by scheduling requests can be more accurately obtained by re-quantifying resources such as computing resources, storage resources, and network resources of the computing power network based on the resource demand information and available resources of scheduling requests. s efficiency.
  • the resource management device 600 can first quantify the various resources included in each resource object in the computing power network to obtain the basic data processing capability of each resource object; and when receiving a scheduling request, according to the resources
  • the available resources of the resource object include available computing resources, available storage resources, and available network resources; then the processing module 120 determines the resource object relative to the scheduling Dynamically quantified results of the request, including:
  • the matching degree of the computing resources is determined.
  • the matching degree of the computing resources refers to the available computing resources and the resource requirements.
  • the computing requirements in resource requirements refer to the computing resources required to process scheduling requests;
  • the matching degree of the storage resources refers to the available storage resources and the storage resources in the resource requirements.
  • Matching degree of requirements, storage requirements in resource requirements refer to the storage resources required to process scheduling requests.
  • the matching degree of the cluster internal network refers to the internal network resources available in the cluster and the resource requirements a match to network requirements; and, based on the network bandwidth between the cluster and the cluster-external network, and the available network bandwidth between the cluster and the cluster-external network, determining the match of the cluster-external network, the cluster-external network match being The matching degree between the available network resources of the external network of the cluster and the external network requirements in the resource requirements;
  • the above dynamic quantification results are determined according to the matching degree of computing resources, the matching degree of storage resources, the matching degree of the internal network of the cluster, and the matching degree of the external network of the cluster.
  • the processing module 120 determines the target resource object for processing the scheduling request according to the dynamic quantization result and the resource requirement in the scheduling request, specifically including:
  • the job scheduling platform can assign scheduling requests to a resource that meets user requirements based on dynamic quantification results and user needs, such as efficiency priority or price priority. Resource object handling.
  • the acquisition module 110 acquires the resource data of the resource object, which specifically includes: acquiring the resource data of the resource object through the resource manager of the resource object, and the resource manager discovers the resource data through the baseboard management controller BMC and the cluster. At least one method of protocol or data collection interface to obtain the resource data of the resource object.
  • the resource management device 600 in this embodiment of the present invention may be implemented by a central processing unit (central processing unit, CPU), or by an application-specific integrated circuit (ASIC), or programmable Logic device (programmable logic device, PLD) realizes, and above-mentioned PLD can be complex program logic device (complex programmable logical device, CPLD), field-programmable gate array (field-programmable gate array, FPGA), general array logic (generic array logic) , GAL) or any combination thereof.
  • the resource management method shown in FIG. 3 or FIG. 5 can also be realized by software
  • the resource management device and its modules can also be software modules.
  • the resource management device 600 can be used to implement static resource quantification and dynamic resource quantification of resource objects in the above method embodiments, for details, refer to the relevant descriptions in the above method embodiments corresponding to FIG. 3 or FIG. 5 , and details will not be repeated here. .
  • FIG. 7 is a schematic diagram of a computing device 700 provided by an embodiment of the present application.
  • the computing device 700 includes: one or more processors 710, a communication interface 720, and a memory 730.
  • the processor 710, the communication interface 720 And the memory 730 is connected to each other through the bus 740, wherein,
  • processor 710 For specific implementation of various operations performed by the processor 710, reference may be made to the specific operations performed by the resource scheduling platform 100 in the method embodiment corresponding to FIG. 3 or 5 above.
  • the processor 710 is configured to implement the operations in S501-S503 in FIG. 5 above, or implement the operations in S301-S302 in FIG. 3 above, which will not be repeated here.
  • the processor 710 may have multiple specific implementation forms, for example, the processor 710 may be a CPU or a GPU, and the processor 710 may also be a single-core processor or a multi-core processor.
  • the processor 710 may be a combination of a CPU and a hardware chip.
  • the aforementioned hardware chip may be an ASIC, a programmable logic device (programmable logic device, PLD) or a combination thereof.
  • PLD programmable logic device
  • the above-mentioned PLD may be a complex programmable logic device (complex programmable logic device, CPLD), a field programmable logic gate array FPGA, a general array logic (generic array logic, GAL) or any combination thereof.
  • the processor 710 may also be implemented solely by a logic device with built-in processing logic, such as an FPGA or a digital signal processor (digital signal processor, DSP).
  • the communication interface 720 can be a wired interface or a wireless interface for communicating with other modules or devices.
  • the wired interface can be an Ethernet interface, a local interconnect network (LIN), etc.
  • the wireless interface can be a cellular network interface or use Wireless LAN interface, etc.
  • the communication interface 720 can be specifically used to obtain hardware attribute data of various hardware resources of resource objects, available resource data, or obtain scheduling requests uploaded by users, etc.
  • Memory 730 can be non-volatile memory, for example, read-only memory (read-only memory, ROM), programmable read-only memory (programmable ROM, PROM), erasable programmable read-only memory (erasable PROM, EPROM), Electrically Erasable Programmable Read-Only Memory (electrically EPROM, EEPROM) or flash memory.
  • the memory 730 can also be a volatile memory, which can be a random access memory (random access memory, RAM), which is used as an external cache.
  • RAM random access memory
  • SRAM static random access memory
  • DRAM dynamic random access memory
  • DRAM synchronous dynamic random access memory
  • SDRAM double data rate synchronous dynamic random access memory
  • double data rate SDRAM double data rate SDRAM
  • DDR SDRAM enhanced synchronous dynamic random access memory
  • ESDRAM enhanced synchronous dynamic random access memory
  • serial link DRAM SLDRAM
  • direct memory bus random access memory direct rambus RAM, DR RAM
  • the memory 730 can also be used to store program codes and data, so that the processor 710 calls the program codes stored in the memory 730 to execute the operation steps in the above method embodiment corresponding to FIG. 3 or FIG. 5 . Additionally, computing device 700 may contain more or fewer components than shown in FIG. 7 , or have components arranged in a different manner.
  • the bus 740 can be a peripheral component interconnect express (PCIe) bus, or an extended industry standard architecture (EISA) bus, a unified bus (Ubus or UB), a computer fast link ( compute express link (CXL), cache coherent interconnect for accelerators (CCIX), etc.
  • PCIe peripheral component interconnect express
  • EISA extended industry standard architecture
  • Ubus or UB unified bus
  • CXL compute express link
  • CCIX cache coherent interconnect for accelerators
  • the bus 740 can be divided into an address bus, a data bus, a control bus, and the like.
  • the bus 740 may also include a power bus, a control bus, a status signal bus, and the like. However, for the sake of clarity, only one thick line is used in FIG. 7 , but it does not mean that there is only one bus or one type of bus.
  • the computing device 700 may further include an input/output interface 750 connected with an input/output device for receiving input information and outputting an operation result.
  • FIG. 7 is a schematic structural diagram of a computing device when the resource management platform 100 is deployed on a physical device (eg, a server) in a computing power network.
  • the resource management platform 100 may also be deployed in a virtual device, for example, deployed in a single physical device installed with virtualization software or a virtual machine or container running in a cluster formed by multiple physical devices.
  • the resource management platform 100 assigns the processor of the virtual device through the computing power network to complete the resource management method in the above-mentioned embodiments corresponding to FIG. 3 and FIG. 5 .
  • the present application also provides a computing power network system as shown in FIG. 1, the system includes the above-mentioned resource management platform 100 and job scheduling platform 300, and is used to execute the operation steps of the methods shown in FIG. 3 to FIG. 5, for the sake of brevity , which will not be repeated here.
  • the embodiment of the present application also provides a computer-readable storage medium, the computer-readable storage medium stores instructions, and when it runs on the processor, it can implement the method steps in the above-mentioned method embodiments, and the computer can
  • the processor for reading the storage medium may refer to the specific operations in the method embodiment corresponding to FIG. 3 or FIG. 5 of the above method embodiment for specific implementation of executing the steps of the above method, and details are not repeated here.
  • the above-mentioned embodiments may be implemented in whole or in part by software, hardware, firmware or other arbitrary combinations.
  • the above-described embodiments may be implemented in whole or in part in the form of computer program products.
  • the computer program product includes one or more computer instructions. When the computer program instructions are loaded or executed on the computer, the processes or functions according to the embodiments of the present invention will be generated in whole or in part.
  • the computer may be a general purpose computer, a special purpose computer, a computer network, or other programmable devices.
  • the computer instructions may be stored in or transmitted from one computer-readable storage medium to another computer-readable storage medium, for example, the computer instructions may be transmitted from a website, computer, server or data center Transmission to another website site, computer, server, or data center by wired (eg, coaxial cable, optical fiber, DSL) or wireless (eg, infrared, wireless, microwave, etc.) means.
  • the computer-readable storage medium may be any available medium that can be accessed by a computer, or a data storage device such as a server or a data center that includes one or more sets of available media.
  • the available medium may be a magnetic medium (for example, a floppy disk, a hard disk, or a magnetic tape), an optical medium, or a semiconductor medium, and the semiconductor medium may be a solid state drive (SSD).
  • SSD solid state drive
  • the steps in the method of the embodiment of the present application can be adjusted in order, merged or deleted according to actual needs; the modules in the system of the embodiment of the present application can be divided, combined or deleted according to actual needs.

Landscapes

  • Engineering & Computer Science (AREA)
  • Software Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Computer Networks & Wireless Communication (AREA)
  • Signal Processing (AREA)
  • Data Exchanges In Wide-Area Networks (AREA)

Abstract

一种资源管理方法、装置及资源管理平台。该方法用于包括多个资源对象的算力网络系统,包括:资源管理平台获取资源对象的资源数据,该资源数据用于指示资源对象的各类硬件资源的属性信息;量化各类硬件资源获得对应的量化结果,然后根据该量化结果为调度请求分配处理该调度请求的资源,其中量化结果包括资源对象中以最小的可独立运行单元为单位量化计算资源所获得的结果。通过以资源对象中最小的可独立运行单元为单位对资源对象的计算资源进行量化,能够更精确评估各个资源对象的能力,进而进行资源的分配,能够使资源的调度更加合理。

Description

资源管理方法、装置及资源管理平台
本申请要求于2021年12月27日提交中国专利局、申请号为202111611446.4、发明名称为“一种资源管理的方法”的中国专利申请的优先权,以及于2022年4月29日提交的申请号为202210467575.9、发明名称为“资源管理方法、装置及资源管理平台”的中国专利申请的优先权,前述两件专利申请的全部内容通过引用结合在本申请中。
技术领域
本申请涉及计算机技术领域,尤其涉及一种资源管理方法、装置及资源管理平台。
背景技术
在云计算领域,往往通过算力网络将动态分布的计算资源与存储资源充分连接,实现网络资源、存储资源、计算资源等资源的统一协同调度。其中,现有算力网络中包括通用服务器、异构服务器、边缘服务器、网络设备(如交换机)和存储设备等不同类型的设备,设备间计算能力存在差异,且可能通过不同运营商的网络与算力网络中其他设备进行连接;而且,设备间网络的时延和带宽也因所在地域、接入网络类型等因素导致存在较大差异;另一方面,不同设备所使用的存储资源也因介质类型和网络等原因导致存储能力参差不齐。虽然部署算力网络的平台应用在用户应用部署过程中也不同程度考虑了上述因素,如云计算厂商的弹性计算服务是根据用户对运行软件的需求为其分配虚拟中央处理单元(virtual central processing unit,vCPU)、内存和网络带宽。但是,随着用户需求的多样化,以及资源种类的不断增加,现有的资源管理方法无法充分发挥算力网络的资源利用率。因此,如何提供一种更优的资源管理方法成为亟待解决的技术问题。
发明内容
本申请提供一种资源管理方法、装置及资源管理平台,能够根据资源对象的资源数据对算力网络中资源对象的硬件资源进行量化,更准确的得到各个资源对象处理调度请求的效率,进而能够根据量化结果和用户的需求对作业进行调度。
第一方面,本申请提供一种资源管理方法,用于包括多个资源对象的算力网络,该方法包括:资源管理平台获取资源对象的资源数据,该资源数据用于指示资源对象的各类硬件资源的属性信息;量化各类硬件资源获得对应的量化结果,然后根据该量化结果为调度请求分配处理该调度请求的资源,其中量化结果包括资源对象中以最小的可独立运行单元为单位量化计算资源所获得的结果。
通过以资源对象中最小的可独立运行单元为单位对资源对象的计算资源进行量化,能够更精确评估各个资源对象的计算能力,进而进行资源的分配,能够使资源的调度更加合理,提高算力网络的资源利用率。
在一种可能的实现方式中,上述硬件资源包括计算资源,上述资源数据包括计算资源的硬件属性数据,计算资源的硬件属性数据包括处理器的算力类型、处理器的计算宽度、单个处理器中可独立运行单元的数量和可独立运行单元的计算频率中的至少一种;其中,算力类型包括整型运算和浮点型运算;上述量化结果包括计算资源的静态量化结果,该计算资源的 静态量化结果用于指示资源对象的基础计算能力,即资源对象在空载是的计算能力;
则上述量化所述资源数据获得量化结果,包括:以最小的可独立运行单元为单位,按照计算资源的硬件属性数据确定计算资源的静态量化结果。其中,最小可独立运行单元为物理核、逻辑核或者流处理器。
对于资源对象的计算资源,将计算资源以最小的可独立运行单元进行量化,将相同算力类型、不同计算宽度的各个处理器根据最小的可独立运行单位进行量化,将相同算力类型不同计算宽度的处理器的计算能力以相同的标准进行量化,例如,将相同算力类型的处理器不同计算宽度的处理器的计算能力转换为相同算力类型相同计算宽度的处理器的计算能力。从而能够更精确评估各个资源对象的计算能力,进而进行资源的分配,能够使资源的调度更加合理,提高算力网络的资源利用率。
在一种可能的实现方式中,上述计算资源的静态量化结果包括整型运算的处理器的量化后的结果和浮点型运算的处理器的量化后的结果;则上述以最小的可独立运行单元为单位,按照计算资源的硬件属性数据确定计算资源的静态量化结果,包括:将不同计算宽度的整型运算的处理器的计算频率转换为目标计算宽度的整型运算的处理器的量化值,得到整型运算的处理器的量化后的结果;将不同计算宽度的浮点型运算的处理器的计算频率转换为目标计算宽度的浮点型运算的处理器的量化值,得到的浮点数运算的处理器的量化后的结果。
对于资源对象的计算资源,将计算资源以最小的可独立运行单元进行量化,不同处理器的最小可独立运行单元的计算频率也各部相同,将各个处理器根据最小的可独立运行单元为单位,将相同算力类型不同计算宽度的处理器的计算频率量化为相同计算宽度的处理器的量化值,能够更准确的评估和对比不同资源对象的计算能力,进而在进行资源的分配时,能够使资源的调度更加合理,提高算力网络的资源利用率。
在另一种可能的实现方式中,上述硬件资源包括存储资源,上述资源数据包括存储设备的硬件属性数据,存储设备的硬件属性数据包括存储设备的类型、容量和输入输出速率,其中,不同存储设备的存储介质不同;上述量化结果包括存储资源的静态量化结果,存储资源的静态量化结果用于指示资源对象的基础存储能力;则上述量化所述资源数据获得量化结果,包括:根据所述存储设备的硬件属性数据确定所述存储资源的静态量化结果。
存储设备不仅用于存储数据,计算节点在处理任务时会不断的对存储设备进行读写,而不同存储设备的存储容量不同,不同存储设备的读写速率(即存储设备的输入输出速率)也不同,结合资源对象中不同存储设备的容量和输入输出速率对资源对象的存储资源进行量化后,能够更加准确的反映一个资源对象的存储资源的性能,进而在进行资源的分配时,能够使资源的调度更加合理。
在另一种可能的实现方式中,上述硬件资源还包括网络资源,上述资源数据包括网络资源的硬件属性数据,当资源对象为单个设备时,网络资源的硬件属性数据包括该设备内的总线带宽;上述量化结果包括网络资源的静态量化结果,网络资源的静态量化结果用于指示资源对象的基础数据传输能力;则上述量化所述资源数据获得量化结果,包括:将设备的总线带宽作为网络资源的静态量化结果。
算力网络的资源对象可以是单个设备,单个设备在处理数据时,数据通过设备内的总线在设备内的各个模块之间进行传输,因此在资源对象是单个设备时,设备的总线带宽是评价设备内网络传输能力的重要标准。
在另一种可能的实现方式中,上述硬件资源包括网络资源,所述资源数据包括网络资源的硬件属性数据,当资源对象为包括多个设备的集群时,网络资源的硬件属性数据包括集群 的网络拓扑、集群内部网络设备的端口带宽以及集群与外部网络之间的网络带宽;上述量化结果还包括网络资源的静态量化结果;网络资源的静态量化结果用于指示资源对象的基础数据传输能力;则上述量化资源数据获得量化结果,包括:根据集群的网络拓扑和集群内部各个网络设备的端口带宽,确定网络资源的静态量化结果。
算力网络的资源对象还可以是包括多个设备的集群,集群内的多个设备通过网络设备互相连接,通过集群处理任务时,网络设备的端口带宽是影响不同设备之间数据交互速率的一个重要因素,不同节点之间数据交互的速率影响集群处理任务的效率,而不同网络设备的端口带宽不同,不同集群之间网络设备的拓扑结构,拓扑结构同样影响节点之间的数据交互速率,因此根据网络拓扑和网络设备的端口带宽确定不同集群的数据传输能力,能够更精确评估各个资源对象的数据传输能力,进而进行资源的分配,能够使资源的调度更加合理,提高算力网络的资源利用率。
需要说明的是,集群内各个设备之间还可以采用硬件加速技术以提高数据传输能力,例如远程直接内存访问技术和/或网内计算技术。因此当集群内各个设备之间采用硬件加速技术时,还可以根据各个网络设备的端口带宽以及硬件加速技术确定网络资源的静态量化结果。设备的硬件加速技术同样能够加速处理数据的效率,在对资源对象的数据传输能力进行量化的过程中,将硬件加速技术带来的效果同样进行量化,能够更加精确的得到资源对象处理数据的效率。
在另一种可能的实现方式中,根据量化结果为调度请求分配资源,包括:上述获取调度请求中的资源需求,该资源需求包括调度请求对硬件资源的需求;例如对计算资源的计算需求、存储资源的存储需求或网络资源的网络需求中的任意一种或多种,然后资源管理平台根据算力网络系统中多个资源对象的各类硬件资源的静态量化结果和调度请求中的资源需求,确定处理调度请求的目标资源对象。
对于上述方法对资源对象的计算资源、存储资源和网络资源进行量化后,更精确地评估了各个资源对象的计算能力,在此基础上,结合调度请求中的资源需求确定处理调度请求的资源对象,能够对资源对象的资源的调度更加合理,提高算力网络的资源利用率。
在另一种可能的实现方式中,上述根据量化结果为调度请求分配资源,包括:确定资源对象的可用资源;根据量化结果、资源需求和资源对象的可用资源,确定资源对象相对于调度请求的动态量化结果,上述动态量化结果用于指示资源对象处理调度请求的能力;根据动态量化结果和调度请求中的资源需求,确定处理调度请求的目标资源对象,该资源需求包括调度请求对硬件资源的需求,例如对计算资源的需求、对存储资源的需求等。
可选地,动态量化结果用于指示第一资源对象处理调度请求的能力。该动态量化结果是接收到调度请求后,获取第一资源对象当前可用资源的资源数据后得到的,即,获取动态量化结果的过程中考虑了各个资源对象在当前情况下的可用的资源以及该调度请求需要的各类资源,通过对算力网络的计算资源、存储资源和网络资源等资源基于调度请求的资源需求信息和可用资源进行再次量化,能够更加精确的得到各个资源对象当前处理调度请求的效率。
通过上述方法,资源管理平台能够先对算力网络中的各个资源对象包括的各类资源进行一次量化,以得到各个资源对象的基础数据处理能力;而在接收到调度请求时,根据资源对象当前各类硬件资源的可用资源数据,结合调度请求中的资源需求以及该资源对象的静态量化结果,再次对该资源对象进行量化,能够更精确的得到各个资源对象处理调度请求的效率,进而能够根据第二次的量化结果和用户的需求对作业进行调度。
在另一种可能的实现方式中,上述资源对象的可用资源包括可用的计算资源、可用的存 储资源和可用的网络资源;则根据上述量化结果、资源需求和资源对象的可用资源,确定资源对象相对于调度请求的动态量化结果,包括:根据资源对象的计算资源的硬件属性数据、可用的计算资源的资源数据以及资源需求中的计算需求,确定计算资源的匹配度,该计算资源的匹配度是指可用的计算资源与资源需求中的计算需求的匹配度,资源需求中的计算需求是指处理调度请求所需的计算资源;根据资源对象的存储资源的硬件属性数据、可用的存储资源的资源数据以及资源需求中的存储需求确定存储资源的匹配度,该存储资源的匹配度是指可用的存储资源与资源需求中的存储需求的匹配度,资源需求中的存储需求是指处理调度请求所需的存储资源。根据集群内部网络的网络设备的端口带宽和集群内部网络的网络设备可用的端口带宽,确定集群内部网络的匹配度,该集群内部网络的匹配度是指集群可用的网络资源与资源需求中的内部网络需求的匹配度;以及,根据集群和集群外部网络之间的网络带宽,以及集群与集群外部网络之间的可用网络带宽,确定集群外部网络的匹配度,该集群外部网络的匹配度是指集群外部网络的可用的网络资源与资源需求中的外部网络需求的匹配度;根据计算资源的匹配度、存储资源的匹配度、集群内部网络的匹配度以及集群外部网络的匹配度,确定上述动态量化结果。
在另一种可能实现方式中,在根据动态量化结果和调度请求中的资源需求,确定处理调度请求的目标资源对象,包括:在资源需求为效率优先时,确定动态量化结果最大的资源对象为目标资源对象;或者,在资源需求为成本优先时,确定动态量化结果最小的资源对象为目标资源对象。
通过上述方法能够得到算力网络中每个资源对象处理调度请求的效率,作业调度平台能够根据动态量化结果和用户需求,例如效率优先或价格优先等要求,将调度请求分配给一个满足用户要求的资源对象处理。
在另一种可能实现方式中,上述资源管理平台获取资源对象的资源数据,包括:通过所述资源对象的资源管理器获取所述资源对象的资源数据,所述资源管理器通过基板管理控制器BMC、集群发现协议或数据采集接口中至少一种方式,获取所述资源对象的资源数据。
第二方面,本申请提供一种资源管理装置,该资源管理装置包括用于执行第一方面或第一方面任一种可能实现方式中的资源管理方法的各个模块。
第三方面,本申请提供一种资源管理系统,该资源管理系统包括处理器和存储器;其中,存储器用于存储指令,处理器用于执行所述指令,当所述处理器执行所述指令时,所述处理器执行如第一方面或第一方面任一种可能实现方式中的资源管理方法。
在一种可能实现方式中,上述资源管理系统位于算力网络系统的一个物理设备中。
在另一种可能实现方式中,上述资源管理系统部署于算力网络系统的虚拟设备中,上述虚拟设备包括虚拟机或者容器。当资源管理系统部署于算力网络系统的虚拟设备中时,该资源管理系统的处理器包含于算力网络系统分配给该虚拟设备的处理器,该资源管理系统的存储器包含于算力网络系统分配给该虚拟设备的存储器。
第四方面,本申请提供一种计算设备,包括处理器和存储器,存储器用于存储指令,处理器用于执行所述指令,当处理器执行所述指令时,执行上述第一方面或上述第一方面任意可能的实现方式中所述的资源管理方法。
第五方面,本申请提供了一种计算机可读存储介质,该计算机可读存储介质中存储有指令,当所述指令在服务器上运行时,使得服务器执行第一方面或第一方面任一种可能实现方式中的资源管理方法。
第六方面,本申请提供了一种计算机程序产品,当该计算机程序产品在服务器上运行时, 使得服务器执行第一方面或第一方面任一种可能实现方式中的资源管理方法。
本申请在上述各方面提供的实现方式的基础上,还可以进行进一步组合以提供更多实现方式。
附图说明
图1是本申请实施例提供的一种算力网络的示意图;
图2是本申请实施例提供的一种实现资源管理方法的系统的示意图;
图3是本申请实施例提供的一种静态资源量化方法的流程示意图;
图4是本申请实施例提供的一种集群内网络拓扑示意图;
图5是本申请实施例提供的一种动态资源量化方法的流程示意图;
图6是本申请实施例提供的一种资源管理装置的示意图;
图7是本申请实施例提供的一种计算设备的结构示意图。
具体实施方式
为了便于理解,首先对本申请涉及的术语进行解释性说明。
算力网络(Computing Network),是指将动态分布的计算与存储资源通过网络连接,通过计算、存储、网络等多维度资源的统一协同调度,使海量的应用能够按需、实时调用算力网络中各类资源。
异构集群(Heterogeneous Cluster),是指集群中使用不同体系结构的处理器进行联合计算的集群,例如集群中的处理器包括中央处理器(central processing unit,CPU)、图像处理器(graphics processing unit,GPU)、嵌入式神经网络处理器(neural-network processing units,NPU)、张量处理器(tensor processing unit,TPU)、数据处理单元(data processing units,DPU)、芯片专用集成电路(application-specific integrated circuit,ASIC)或现场可编程逻辑门阵列(field-programmable gate array,FPGA)等任意两种或两种以上的芯片。
高性能计算(High Performance Computing,HPC),是指运用有效的算法,快速完成科学研究、工程设计、金融、工业以及社会管理等领域内具有数据密集型、计算密集型和数据输入输出(input/output,I/O)密集型的计算。
多接入边缘计算(multi-access edge computing,MEC),将流量和服务计算从集中式数据中心迁移至网络边缘,更贴近客户。网络边缘分析、处理并存储所有数据,而不是将其发送到数据中心进行处理,收集并处理数据时可减少延迟,能够为高带宽应用提供实时性能。
直接内存访问(Remote Direct Memory Access,RDMA),是一种绕过远程主机操作系统内核访问其内存中数据的技术,由于不经过操作系统,不仅节省了大量CPU资源,同样也提高了系统吞吐量、降低了系统的网络通信延迟,尤其适合在大规模并行计算机集群中有广泛应用。在
网内计算(In Network Computing,INC),是一种分布式并行计算体系结构,能够利用网卡、交换机等网络设备在数据传输的过程中,同时进行数据的在线计算,已达到降低通信延迟,提升整体计算效率的技术。
为了更好的利用算力网络中各个资源对象的资源,本申请提供一种对算力网络中各个资源对象的资源管理的方法,该资源管理方法通过获取资源对象的计算资源、存储资源和网络资源等硬件资源的相关硬件属性数据。对于计算资源,以计算资源最小可独立运行单元为单位,对计算资源进行量化;基于资源对象包括的各类存储设备的容量和输入输出速率,对存 储资源进行量化;对于网络资源,对资源对象的内部网络和资源对象的外部网络分别进行量化,从而能够更准确地评估资源对象所包括的各类资源的能力,基于资源对象各类资源的量化结果以及作业对各类资源的需求对资源进行调度。通过本申请提供的方法对资源对象的各类资源进行量化管理,能够更加准确评估资源对象的各项能力,使得对资源的调度更加合理。
下面先结合图1和图2介绍本申请提供的一种算力网络的系统结构,其中,图1是本申请实施例提供的一种算力网络的示意图,算力网络包括多个资源对象,不同资源对象通过网络相互连接,例如,由运营商提供的运营商网络实现资源对象间的连接。资源对象可以是包括计算资源、存储资源和网络资源中至少一种的单个设备,例如多接入边缘计算服务器。资源对象也可以是包括多个设备的集群,每个集群包括计算资源、网路资源和存储资源,例如,高性能计算(High Performance Computing,HPC)集群、人工智能(Artificial Intelligence,AI)计算集群、多样性异构集群(Heterogeneous Cluster)、数据中心等。
上述计算资源的处理器的类型可以是中央处理器CPU、GPU、NPU、TPU、DPU、ASIC、复杂可编程逻辑器件(complex programmable logic device,CPLD)、FPGA、通用阵列逻辑(generic array logic,GAL)、片上系统(system on chip,SoC)中的任意一种或者多种类型的组合。
上述存储资源可以是机械硬盘,如硬盘驱动器(Hard Disk Drive,HDD)、磁带,也可以是固态硬盘(Solid State Disk,SSD),还可以是其他类型的存储介质,或者以上两种或两种以上类型的存储介质的组合。
上述网络资源包括内部网络资源和外部网络资源。当资源对象为单个设备(例如,资源对象为计算节点)时,内部网络资源为该设备的总线带宽,外部网络资源为该设备与外部网络的网络带宽;当资源对象是包括多个设备的集群时,集群内的多个设备通过网络设备相互连接,集群的内部网络资源包括该集群内各个网络设备的端口带宽,外部网络资源为该集群与外部网络的网络带宽。
图2是本申请实施例提供的一种实现资源管理方法的系统的示意图。该系统包括资源管理平台100和多个资源对象。其中,多个资源对象构成如图1所示的算力网络,资源对象200可以是包括多个设备的集群,例如,图2中所示的AI计算集群201、HPC集群202、异构集群203等。可选地,资源对象也可以是单个设备,例如,MEC服务器204或其他类型的设备。资源管理平台100用于获取各个资源对象的硬件资源的硬件属性数据,进而对各个资源对象的各类资源进行量化评估。资源管理平台100可以部署于构成算力网络的任意一个资源对象中,例如,资源管理平台100部署于一个设备中。可选地,资源管理平台100也可以部署在除构成算力网络的资源对象之外的一个专用于进行资源管理的设备中,或者,资源管理平台100还可以利用虚拟资源形式部署,例如,利用虚拟机或容器等虚拟资源部署上述资源管理平台100。
资源对象中部署有资源管理器210,资源管理器210用于采集资源对象200的各类硬件资源的硬件属性数据并发送给资源管理平台100,其中,每个资源对象200的硬件资源包括计算资源、存储资源和网络资源。在资源对象200是集群时,资源管理器210可以部署在集群中的任意一个设备上,也可以部署在一个专用于采集集群各类资源数据的设备上。在资源对象200是单个设备时,资源管理器210就部署于该设备。资源管理平台100在接收到各个资源对象200的资源管理器210发送的各类资源的硬件属性数据之后,对各个资源对象200的各类资源数据进行分析和量化,得到对各个资源对象的静态量化结果。然后资源管理平台100将各个资源对象200的各类资源的资源数据以及对应的静态量化结果存放入资源目录。 其中,资源数据用于指示与其关联的资源对象的硬件资源的属性信息。
本申请实施例中,如果部署资源管理器210的设备包括基板管理控制器(Baseboard Management Controller,BMC),则资源管理器210可以通过BMC的智能平台管理接口(Intelligent Platform Management Interface,IPMI)采集资源对象中各个设备的各类资源的资源数据。设备还可以通过集群发现协议或者数据采集接口采集资源对象中各个计算节点的各类资源的资源数据,此时各个设备中需要部署有支持数据采集服务的代理(agent)。
在一种可能的实现方式中,上述算力网络还包括作业调度平台300,资源管理平台100还用于在接收到调度请求后,通过各个资源对象200的资源管理器210获取各个资源对象200当前各类资源的使用情况或剩余情况,然后,根据资源目录中各个资源对象200的静态量化结果、各个资源对象当前各类资源的可用资源以及调度请求,再次根据静态量化结果、各个资源对象200的可用资源进行量化,得到各个资源对象200的动态量化结果,该动态量化结果用于指示资源对象处理调度请求的能力,例如,动态量化结果用于指示资源对象处理调度请求的效率。再由作业调度平台300根据各个资源对象200的动态量化结果,将调度请求分配到目标资源对象。
上述作业调度平台300可以部署在集群中的任意一个设备上,也可以部署在一个专用于采集集群各类资源数据的设备上。例如,该作业调度平台300可以与资源管理平台100部署于同一个设备,也可以不与资源管理平台100部署于同一个设备,本申请实施例不做具体限定。
由上述描述可知,本申请提供的资源管理方法主要包括资源量化和资源分配两方面内容,接下来,先结合附图详细介绍本申请的资源量化的方法。
本申请的资源量化方法按照量化操作执行过程又可以区分为静态资源量化方法和动态资源量化方法。其中,静态资源量化方法可得到各个资源对象的各类硬件资源的静态量化结果,该静态量化结果用于指示资源对象的基础能力,例如,计算资源的静态量化结果指示资源对象的基础计算能力,存储资源的静态量化结果指示资源对象的基础存储能力,网络资源的静态量化结果指示资源对象的基础数据传输能力。动态资源量化方法得到各个资源对象的动态量化结果,动态量化结果是根据资源对象当前可用资源得到的,用于指示资源对象处理调度请求的处理能力。
图3是本申请实施例提供的一种静态资源量化方法的流程示意图。下面以算力网络中的资源对象均为单个设备为例详细介绍本申请实施例提供的静态资源量化方法,为了便于描述,将上述资源对象称为第一资源对象。该方法包括如下步骤:
S301.资源管理平台获取第一资源对象的资源数据。
第一资源对象在其接入算力网络后,第一资源对象可以通过BMC的IPMI、集群发现协议或者数据采集接口等方式获取第一资源对象的资源数据,并将第一资源对象的资源数据上报给资源管理平台100。其中,资源数据用于指示第一资源对象的硬件资源的属性信息,资源对象的硬件资源包括计算资源、网络资源和存储资源。则资源数据包括计算资源的硬件属性数据、存储资源的硬件属性数据和网络资源的硬件属性数据。
上述计算资源的硬件属性数据包括处理器的算力类型、处理器的计算宽度、处理器的数量、每个处理器包括的最小的可独立运行单元的数量和可独立运行单元的计算频率。其中,处理器的类型包括CPU、GPU、TPU、DPU或ASIC等其中的任意一种或多种;处理器的算力类型包括整型(Integer,INT)运算和浮点型(Floating Point,FP)运算,计算宽度包括64位、32位、16位、8位等。因此处理器的运算方式包括64位整型(INT64)、64位浮点型(FP64)、 INT32、FP32、INT16、FP16等;最小的可独立运行单元可以是物理核(Core)、逻辑核或流处理器。存储资源的硬件属性数据包括存储设备的类型、各类存储设备的容量以及各类存储设备的输入/输出(Input/Output,I/O)速率。存储设备类型包括硬盘驱动器(Hard Disk Drive,HDD)、磁带、机械硬盘或者固态硬盘(Solid State Disk,SSD)等。网络资源的硬件属性数据包括计算节点内部的总线带宽以及计算节点与外部网络的网络带宽。
需要说明的是,上述各类硬件资源的硬件属性数据仅用作举例,并不构成对资源管理器200获取的资源数据的限制,资源管理器200还可以获取比上述列举的更多或更少的资源数据。例如,在获取计算资源的硬件资源数据时,资源管理器能够获取处理器的型号,根据处理器的型号即可确定处理器的类型、处理器的算力类型、处理器的计算宽度、处理器包括的最小的可独立运行单元的数量以及处理器的计算频率。
S302.资源管理平台根据第一资源对象的资源数据进行资源量化,得到第一资源对象对应的量化结果。
资源对象的硬件资源包括计算资源、存储资源或网络资源中至少一种,相应地,资源数据包括计算资源的硬件属性数据、存储资源的硬件属性数据或网络资源的硬件属性数据,因此,资源管理平台100需要分别根据计算资源的硬件属性数据对资源对象的计算资源进行量化、根据存储资源的硬件属性数据对资源对象的存储资源进行量化以及根据网络资源的硬件属性数据对资源对象的网络资源进行量化,第一资源对象对应的量化结果包括计算资源的静态量化结果、存储资源的静态量化结果和网络资源的静态量化结果。其中,计算资源的静态量化结果用于指示资源对象的基础计算能力,基础计算能力可以理解为由资源对象自身的配置或属性所确定的计算能力;存储资源的静态量化结果用于指示资源对象的基础存储能力;网络资源的静态量化结果用于指示资源对象的基础数据传输能力。
对于计算资源的量化,由于同一个资源对象或者不同资源对象中处理器的类型、处理器的计算频率、处理器的算力类型和计算宽度等都可能存在不同。例如,当资源对象为异构设备时,该资源对象中可以同时设置CPU和GPU;或者,同一个资源对象中部分处理器的运算方式是INT64,部分处理器的运算方式是INT32;或者,同一个资源对象中部分处理器的计算速率是3.4吉赫(GHz),部分处理器的计算速率是2.1GHz。不同资源对象中,部分资源对象的处理器仅包括CPU,部分资源对象为异构设备;或者,不同的资源对象中部分资源对象的计算宽度是64位,部分资源对象的计算宽度是32位。因此不同处理器的计算能力不同,需要将各类处理器的计算能力按照统一的标准进行量化。
为了便于描述,本申请的以下实施例中,以处理器内最小的可独立运行单元为单位,按照算力类型和计算宽度对处理器进行量化为例进行说明。其中,以计算宽度为a的整型运算的处理器和计算宽度为b的浮点型运算的处理器作为量化标准。将不同计算宽度的整型运算(包括INT64、INT32、INT16、INT8等)的处理器的计算频率换算为计算宽度为a的整型运算的处理器的量化值,得到整型运算的处理器量化后的结果,将不同计算宽度的浮点型运算(包括FP64、FP32、FP16、FP8等)的处理器的计算频率换算为计算宽度为b的浮点型运算的处理器的量化值,得到浮点型运算的处理器量化后的结果。如果一个处理器的运算方式是INT t,即计算宽度为t的整型运算,则该处理器对应的转换系数为p=t/a,表示在相同的计算频率相同算力类型的情况下,计算宽度为t的处理器的计算能力是计算宽度为a的处理器的计算能力的p倍;如果一个处理器的运算方式是FP t,即计算宽度为t的浮点型运算,则该处理器对应的转换系数为q=t/b,表示在相同的计算频率相同的算力类型的情况下,计算宽度为t的处理器的计算能力是计算宽度为b的处理器的计算能力的q倍。即,在相同的计算频率下, 如果一个处理器的最小可独立运算单元的运算方式为INT32,则该处理器的计算能力是运算方式为INT64的处理器的计算能力的二分之一,运算方式为FP16的处理器的计算能力是运算方式为FP64的处理器的计算能力的四分之一。
示例地,本申请实施例中,将不同计算宽度的整型运算的处理器的计算能力换算为运算方式为INT a的处理器的计算能力,将不同计算宽度的浮点型运算的处理器的计算能力换算为INT b的处理器的计算能力,且a=b。对于单个处理器,整型运算的处理器的计算能力的量化值为F INT=p*m*f,浮点型运算的处理器的计算能力的量化值为F FP=q*n*f。其中,p、q为转换系数,m、n为处理器包括的最小可独立运行单元的数量,f为最小可独立运行单元的计算频率。对于一个资源对象,通过上述方法能够分别对每个处理器的计算能力进行量化,然后根据每个处理器的计算能力得到对整个资源对象的计算资源的静态量化结果。本申请实施例中,能够通过如下公式1确定资源对象的计算资源的静态量化结果,或者通过如下公式2确定资源对象的计算资源的静态量化结果。
c=ΣF INT+ΣF FP        (公式1)
其中,c为资源对象的计算资源的静态量化结果;ΣF INT为资源对象中所有整型运算的处理器的计算能力的静态量化结果;ΣF FP为资源对象中所有浮点型运算的处理器的计算能力的静态量化结果。
c=αΣF INT+(1-α)ΣF FP      (公式2)
其中,α为资源对象中整型运算的最小可独立运行单元所占的比例,α=ΣF INT/(ΣF INT+ΣF FP)。
对于存储资源的量化,本申请实施例中,根据不同存储设备的容量和存储设备的IO速率对存储资源进行量化,资源对象的存储资源的静态量化结果能够根据如下公式3进行确定。
Figure PCTCN2022142208-appb-000001
其中,M为资源对象的存储资源的静态量化结果;R i为第i类存储设备的容量;R为资源对象所包括的存储设备的总容量,v i为第i类存储设备的IO速率。
而对于网络资源的量化,网络资源的静态量化结果为计算节点内部的总线带宽。即网络资源的静态量化结果满足如下公式4:
n in=W bus       (公式4)
上述图3所对应的实施例介绍了当资源对象是单个计算节点时,对资源对象的各类资源进行静态资源量化的方法。
作为一种可能的实施例,资源对象还可以是集群,每个集群包括多个计算节点。当上述第一资源对象是集群时,第一资源对象的硬件资源同样包括计算资源、存储资源和网络资源;第一资源对象的资源数据包括计算资源的硬件属性数据、存储资源的硬件属性数据和网络资源的硬件属性数据。
当第一资源对象是集群时,集群的计算资源的硬件属性数据同样包括处理器的算力类型、处理器的计算宽度、处理器的数量、每个处理器包括的最小的可独立运行单元的数量和可独立运行单元的计算频率。此时的硬件属性数据是集群中多个计算节点所包括的硬件属性数据。对于存储资源,存储资源的硬件属性数据同样包括存储设备的类型、各类存储设备的容量以 及各类存储设备的I/O速率。
而对于网络资源,当第一资源对象是集群时,集群中的多个计算节点通过网络设备(例如交换机或路由器等)相互连接,网络资源的硬件资源数据包括集群内的网络拓扑结构、集群内部的网络设备(交换机和/或路由器等)的端口带宽以及集群与外部网络的网络带宽。其中,集群内的网络拓扑结构可以是叶脊(Spine-Leaf)拓扑结构、传统三层拓扑结构、胖树(Fat-Tree)拓扑结构、蜻蜓(Dragonfly)拓扑结构或Dragonfly+拓扑结构等。
对于第一资源对象是集群时的静态资源量化方法,对计算资源的量化方法可以参照上述资源对象为一个计算节点时的量化方法。对存储资源的量化方法可以参照上述资源对象是一个计算节点时的量化方法。
而对于网络资源的量化,集群内的计算节点通过一层或者多层网络设备(例如交换机)连接,例如,集群内的网络拓扑结构可以是叶脊(Spine-Leaf)拓扑结构或传统三层拓扑结构。如图4所示,图4是本申请实施例提供的一种集群内网络拓扑示意图。本申请实施例中,将与计算节点直接相连的网络设备作为Leaf设备,将其他层的网络设备作为Spine设备。例如,在传统三层拓扑网络中,将接入层的交换机作为Leaf设备,将汇聚层的交换机和核心层的交换机作为Spine设备。资源管理平台100获取各个Leaf设备的端口带宽,确定所有Leaf设备的平均带宽或者最小带宽,并获取各个Spine设备的端口带宽,确定所有Spine设备的平均带宽或者最小带宽,然后根据上述Leaf设备的平均带宽或者最小带宽,以及Spine设备的平均带宽或者最小带宽确定资源对象的内部网络的静态量化结果。本申请实施例中,能够通过如下公式5确定资源对象是集群时的内部网络的静态量化结果。
n in=θ*min(min{W spine},avg{W leaf})     (公式5)
其中,n in为资源对象的内部网络的静态量化结果;min{W spine}表示资源对象中所有Spine设备的端口带宽中的最小带宽;avg{W leaf}表示资源对象中所有Leaf设备的端口带宽的平均带宽;θ表示集群中独立计算单元的数量。
需要说明的是,上述公式4中是以所有Spine设备的最小带宽、所有Leaf设备的平均带宽计算内部网络的静态量化结果。实际应用中,也可以是以所有Spine设备的平均带宽、所有Leaf设备的平均带宽计算内部网络的静态量化结果,也可以是以所有Spine设备的最小带宽、所有Leaf设备的最小带宽计算内部网络的静态量化结果,还可以是以所有Spine设备的平均带宽、所有Leaf设备的最小带宽计算内部网络的静态量化结果。
在一种可能的实现方式中,计算节点可支持硬件加速技术,例如,计算节点可通过RDMA技术或者INC技术提高集群内部计算节点之间的数据传输效率。如果上述S301中资源管理器200获取了集群内的计算节点的硬件加速信息,并将硬件加速信息发送给资源管理平台100。本申请实施例还可以通过如下公式6确定资源对象是集群时的内部网络的静态量化结果。
n in=(1+j*c)θ*min(min{W sping},avg{W leaf})     (公式6)
其中,j表示计算节点所具备的硬件加速方式的数量,c为权重系数,例如,如果计算节点只具有RDMA加速技术或者INC加速技术,则j的值1,如果计算节点同时具有RDMA加速技术或者INC加速技术,则j的值为2。应理解,计算节点还可以包括其他硬件加速技术,在此不在一一赘述。另外,不同硬件加速技术对应的权重系数可以不同,也可以相同,公式6中以不同硬件加速技术对应的权重系数相同为例。
资源管理平台100能够通过上述方法对第一资源对象的计算资源、存储资源和网络资源 分别进行量化,得到第一资源对象对应的量化结果。该量化结果包括上述计算资源的静态量化结果、存储资源的静态量化结果和网络资源的静态量化结果。资源管理平台100在确定第一资源对象各类资源的静态量化结果之后,将第一资源对象的资源数据以及第一资源对象各类资源的静态量化结果存入资源目录。资源目录中记载有算力网络中各个资源对象的资源数据和各个资源对象的各类资源的静态量化结果。资源管理平台100在将第一资源对象的资源数据以及第一资源对象各类资源的静态量化结果存入资源目录后,会向第一资源对象的资源管理器210返回成功接入算力网络的消息。
资源管理平台100能够通过上述方法,对接入算力网络的其他资源对象的分别通过上述静态资源量化方法进行量化,得到各个资源对象对应的量化结果,并将各个资源对象的资源数据和各个资源对象的各类资源的静态量化结果存入资源目录。
资源管理平台100在对接入算力网络的其他资源对象的分别通过上述静态资源量化方法进行量化,得到各个资源对象对应的量化结果后。资源管理平台100能够通过算力网络的网络(web)界面提交调度请求之后,该作调度请求包括资源需求,该资源需求包括调度请求对硬件资源的需求,该硬件资源包括计算资源、存储资源或网络资源中的任意一种或多种。资源管理平台100根据算力网络中各个资源对象的量化结果和调度请求中的资源需求,确定处理该调度请求的目标资源对象。
示例性的,如果上述调度请求中的资源需求为效率优先,则目标资源对象为算力网络中计算资源的静态量化结果最大的资源对象。如果上述调度请求中的资源需求包括效率优先,且包括存储容量需求,则目标资源对象为算力网络中存储容量大于资源需求中的存储容量需求的多个资源对象中,计算资源的静态量化结果最大的资源对象。如果调度请求中的资源需求为价格优先,则目标资源对象为算力网络中计算资源的静态量化结果最小的资源对象。
上述图3对应的实施例详细介绍了对资源对象进行静态资源量化的方法,下面结合附图介绍本申请实施例提供的动态资源量化方法。如图5所示,图5是本申请实施例提供的一种动态资源量化方法的流程示意图。该方法包括如下步骤:
S501.资源管理平台获取调度请求。
上述调度请求用于请求执行一个待调度作业的资源对象,上述调度请求包括资源需求,所述资源需求包括调度请求的计算需求以及存储需求。计算需求用于指示处理所述调度请求所需的计算资源,即处理调度请求需要的最小可独立运行单元的数量。存储需求是指执行调度请求需要的存储空间大小。
用户利用用户设备通过算力网络的网络(web)界面提交调度请求之后,算力网络中的资源调度平台100可通过应用编程接口(application programming interface,API)获取上述调度请求。
上述调度请求还包括作业类型,作业类型包括重算力场景、通用算力场景和混合算力场景。例如,HPC作业或者AI模型训练通常为重算力场景,大数据处理和云服务通常为通用算力场景,混合算力场景是即包括重算力场景的作业,也包括通用算力场景的作业。作业类型用于指示处理调度请求需要的整型计算资源的占比和浮点型计算资源的占比。
用户在提交调度请求前,能够在用户界面配置作业类型、计算需求和存储需求,以供资源管理平台100根据计算需求和存储需求对各个资源对象进行动态资源量化。
在一种可能的实现方式中,用户在提交调度请求前,还可以设置算力占比,即调度请求还包括算力占比,算力占比是指执行调度请求所需的整型计算资源的比例和浮点型计算资源的比例。
S502.资源管理平台获取第一资源对象的可用资源数据。
上述可用资源数据包括可用的计算资源的资源数据、可用的网络资源的资源数据和可用的存储资源的资源数据。可用的计算资源的资源数据包括处理器的类型、可用的处理器的数量以及可用的处理器中每个处理器的独立计算单元的数量、计算频率、计算宽度和算力类型等;可用的存储资源的资源数据包括可用存储容量。对于网络资源,如果第一资源对象是集群,则可用的网络资源的资源数据包括集群内部的网络设备的可用端口带宽以及集群与外部网络的可用带宽。如果第一资源对象是单个的计算节点,则可用的网络资源的资源数据包括计算节点与外部网络的可用带宽。
资源管理平台100在对各个资源对象200进行静态资源量化之后,资源管理平台100能够以第一时间间隔向各个资源对象200发送查询请求,该查询请求用于指示接收到查询请求的资源对象200上报当前的可用资源数据。或者,资源管理平台100在接收到调度请求之后,资源管理平台100向各个资源对象200发生查询请求,以指示接收到查询请求的资源对象200上报当前的可用资源数据。或者,各个资源对象200在成功接入算力网络后,以第二时间间隔向资源管理平台100上报各自的可用资源数据。各个资源对象200通过各自的资源管理器210获取当前的可用资源数据,资源管理器210获取可用资源数据的方法与上述S301中获取资源数据的方法相同,在次不在赘述。
S503.资源管理平台根据调度请求以及第一资源对象的可用资源数据,确定第一资源对象的各类可用资源与资源需求中各类资源需求的匹配度。
其中,各类可用资源与资源需求中各类资源需求的匹配度包括以下任意一种或多种:可用的计算资源与资源需求中的计算需求的匹配度、可用的存储资源与资源需求中的存储需求的匹配度以及可用的网络资源的匹配度。资源管理平台100所在的计算节点记录有资源目录,资源目录中记载有算力网络中各个资源对象的硬件属性数据。资源管理平台100在获取到第一资源对象的可用资源数据之后,先根据第一资源对象的硬件属性数据、调度请求和可用资源数据,确定第一资源对象的各类可用资源与资源需求中各类资源需求的匹配度;然后根据第一资源对象各类资源与资源需求中各类资源的资源需求的匹配度以及静态量化结果,确定第一资源对象的动态量化结果。
资源管理平台100根据调度请求确定调度请求需要的计算需求中,整型运算的最小可独立运行单元的数量以及浮点型运算的最小可独立运行单元的数量。在调度请求包括作业类型的情况下,资源管理平台100中预先配置有不同应用场景关联算力占比,例如,算力网络支持算力场景、通用算力场景和混合算力场景等场景,其中,重算力场景下需要的整型计算资源的比例为30%,浮点型计算资源的比例为70%;通用算力场景下需要的整型计算资源的比例为60%,浮点型计算资源的比例为40%;混合算力场景下需要的整型计算资源的比例为50%,浮点型计算资源的比例为50%。
资源管理平台100根据作业类型和计算需求,确定执行调度请求时需要的整型运算的最小可独立运行单元的数量以及浮点型运算的最小可独立运算单元的数量。在调度请求包括算力占比的情况下,资源管理平台100根据调度请求中的算力占比和计算需求确定执行调度请求时需要的整型运算的最小可独立运行单元的数量以及浮点型运算的最小可独立运行单元的数量。在确定执行调度请求时需要的整型运算的最小可独立运行单元的数量以及浮点型运算的最小可独立运行单元的数量后,能够通过如下公式7可用的计算资源与资源需求中的计算需求的匹配度。
Figure PCTCN2022142208-appb-000002
其中,r c为可用的计算资源与资源需求中的计算需求的匹配度;INT t表示第一资源对象中整型运算的最小可独立运行单元的数量;FP t表示第一资源对象中浮点型运算的最小可独立运算单元的数量;INT job表示执行调度请求时需要的整型运算的最小可独立运行单元的数量;FP job表示执行调度请求时需要的浮点型运算的最小可独立运行单元的数量;INT a表示第一资源对象中当前可用的整型运算的最小可独立运行单元的数量;FP a表示第一资源对象中当前可用的浮点型运算的最小可独立运行单元的数量。
对于可用的计算资源与资源需求中的计算需求的匹配度,资源管理平台100能够根据如下公式8计算计算资源的资源可用率。
Figure PCTCN2022142208-appb-000003
其中,β表示调度请求是否需要持久化存储,如果需要持久化存储,则β=1,如果不需要持久化存储,则β=0;γ表示在需要持久化存储的情况下,第一资源对象的可用存储容量是否大于或等于存储需求,如果第一资源对象的可用存储容量大于或等于存储需求,则γ=1,如果第一资源对象的可用存储容量小于存储需求,则γ=0;s表示可用的计算资源与资源需求中的计算需求的匹配度;M为上述公式3中的第一资源对象的存储资源的静态量化结果。
对于可用的网络资源的匹配度,如果第一资源对象200是集群,则网络资源的可用的资源数据包括集群内部的网络设备的端口可用带宽;以及集群与外部网络的可用带宽。如果第一资源对象200是单个计算节点,则可用的网络资源的资源数据包括计算节点与外部网络的可用带宽。可用的网络资源的匹配度包括内部网络可用的网络资源的匹配度和外部网络的可用的网络资源的匹配度。
对于可用的网络资源的匹配度中的内部网络可用的网络资源的匹配度,当第一资源对象200是集群时,资源管理平台100从资源目录中获取各个Leaf设备的端口带宽,根据每个Leaf设备的端口可用带宽与其端口带宽,确定每个Leaf设备的端口可用带宽与其端口带宽的比值,根据相同方法得到所有Leaf设备对应的多个端口带宽比值,进而确定所有Leaf设备对应的端口带宽比值的平均值或者最小值。从资源目录中获取各个Spine设备的端口带宽,根据每个Spine设备的端口可用带宽与其端口带宽,确定每个Spine设备的端口可用带宽与其端口带宽的比值,根据相同方法得到所有Spine设备对应的多个端口带宽比值,进而确定所有Spine设备对应的多个端口带宽比值的平均值或者最小值。然后根据所有Leaf设备对应的多个端口带宽的比值的平均值或者最小值,以及所有Spine设备对应的多个端口带宽的比值的平均值或者最小值,确定在第一资源对象为集群时,第一资源对象的内部网络的可用网络资源的匹配度。本申请实施例中,能够通过如下公式9确定第一资源对象是集群时内部网络的可用网络资源的匹配度。
r in=min(min{P sping},avg{A leaf})     (公式9)
其中,r in表示第一资源对象的内部网络可用的网络资源的匹配度;min{P spine}表示第一资源对象中所有Spine设备对应的端口带宽比值的最小值;avg{A leaf}表示第二资源对象中所 有Leaf设备对应的端口带宽比值的平均值。
需要说明的是,当第一资源对象是单个计算节点时,第一资源对象的内部网络可用的网络资源的匹配度是1。
需要说明的是,上述公式9中是以所有Spine设备对应的多个端口带宽比值的最小值、所有Leaf设备对应的多个端口带宽比值的平均值计算内部网络的资源可用率。实际应用中,也可以是以所有Spine设备对应的多个端口带宽比值的平均值、所有Leaf设备对应的多个端口带宽比值的平均值计算内部网络的资源可用率,也可以是以所有Spine设备对应的多个端口带宽比值的最小值、所有Leaf设备对应的多个端口带宽比值的最小值计算内部网络的资源可用率,还可以是以所有Spine设备对应的多个端口带宽比值的平均值、所有Leaf设备对应的多个端口带宽比值的最小值计算内部网络的资源可用率。
对于可用的网络资源的匹配度中的外部网络可用的网络资源的匹配度,资源管理平台100能够根据如下公式10计算第二资源对象的外部网络的资源可用率。
Figure PCTCN2022142208-appb-000004
其中,r out表示第一资源对象为外部网络可用的网络资源的匹配度,W a表示第一资源对象与外部网络的可用带宽;W表示第一资源对象与外部网络的带宽。
资源管理平台100能够通过上述方法,根据调度请求以及各个资源对象的可用资源数据,确定各个资源对象的各类可用资源与资源需求中各类资源需求的匹配度。
S504.资源管理平台根据第一资源对象的量化结果以及第一资源对象的各类可用资源与资源需求中各类资源需求的匹配度,对第一资源对象进行动态量化,得到第一资源对象相对于调度请求的动态量化结果。
其中,动态量化结果用于指示第一资源对象处理调度请求的能力。该动态量化结果是接收到调度请求后,获取第一资源对象当前可用资源的资源数据后得到的,即,获取动态量化结果的过程中考虑了各个资源对象在当前情况下的可用的资源以及该调度请求需要的各类资源,因此动态量化结果能够更准确地反应各个资源对象当前处理调度请求的能力。
调度请求所要处理的数据的数据源与资源对象之间的网络时延也是资源对象的外部网络的一个重要参数。资源管理平台100还可以根据数据源与资源对象之间的网络时延,以及该资源对象与外部网络的带宽确定该资源对象的外部网络的静态量化结果。本申请实施例中,能够通过如下公式11确定该资源对象的外部网络的静态量化结果。
Figure PCTCN2022142208-appb-000005
其中,n out表示第一资源对象的外部网络的静态量化结果;W为表示第一资源对象与外部网络的网络带宽;T d表示数据源与第一资源对象之间的网络时延。
在根据数据源确定第一资源对象的外部网络的静态量化结果后,资源管理平台100能够根据第一资源对象的对应的静态量化结果以及第一资源对象的各类可用资源与资源需求的匹配度,对第一资源对象进行动态量化,得到第一资源对象的动态量化结果。在实际应用中,通常重算力场景的作业,需要的计算资源较多,处理的数据量较大,资源对象的计算能力和资源对象内部网络的带宽对重算力场景的作业处理效率的影响较大,而对于通用算力场景,资源对象的计算能力和资源对象内部网络的带宽对作业处理效率的影响较小,但是从数据源到资源对象的网络带宽和时延更为关键。因此,本申请实施例中,资源管理平台100能够通过如下公式12计算第一资源对象相对于调度请求的动态量化结果。
d=λcr cn inr in+(1-λ)sn outr out           (公式12)
其中,d为第一资源对象相对于调度请求的动态量化结果;λ为调度请求中重算力的占比,λ为大于或等于0,小于或等于1的自然数。重算力占比λ的值可以由用户配置,并携带于上述作业调度请求中。
在一种可能的实现方式中,资源管理平台100还能够通过如下公式13计算第一资源对象相对于调度请求的动态量化结果。
d=λscr cn inr in+(1-λ)sn outr out           (公式13)
资源管理平台100能够通过上述方法,计算得到算力网络中每个资源对象相对于调度请求的动态量化结果。其中,每个资源对象相对于调度请求的动态量化结果能够反映该资源对象执行该调度请求时的能力,d的值越大,表示该资源对象执行该调度请求时的效率越高,d的值越小,表示该资源对象执行该调度请求时的效率越低。
资源管理平台100在得到算力网络中每个资源对象相对于调度请求的动态量化结果之后,资源管理平台100将每个资源对象相对于调度请求的动态量化结果以及调度请求发送给作业调度平台300。作业调度平台300根据上述每个资源对象相对于调度请求的动态量化结果调度请求,将调度请求分配到目标资源对象进行处理。
在一种可能的实现方式中,上述作业调度请求还包括用户需求,用户在提交作业调度请求前,能够在用户界面进行资源调度策略的选择,例如,选择效率优先或者价格优先的资源调度策略。如果用户选择的是效率优先,则作业调度平台300根据效率优先的用户需求,将调度请求分配给动态量化结果最大的资源对象处理;如果用户选择的是价格优先,则作业调度平台300根据价格优先的用户需求,将调度请求分配给动态量化结果最小的资源对象处理。
在一种可能的实现方式中,作业调度平台300能够根据每个资源对象相对于调度请求的动态量化结果以及调度请求,预估每个资源对象处理调度请求的时长。用户还可以在选择价格优先的同时配置执行时长的区间范围,则作业调度平台300能够将调度请求分配给满足执行时长的资源对象中,动态量化结果最小的资源对象执行。
在一种可能的实现方式中,作业调度平台300能够根据每个资源对象相对于调度请求的动态量化结果以及调度请求,预估每个资源对象处理调度请求的时长和价格。作业调度平台300将每个资源对象处理调度请求的时长和价格展示在用户界面,由用户选择处理调度请求的资源对象。
对于上述方法实施例,为了简单描述,故将其都表述为一系列的动作组合,但是本领域技术人员应该知悉,本发明并不受所描述的动作顺序的限制,其次,本领域技术人员也应该知悉,说明书中所描述的实施例均属于优选实施例,所涉及的动作并不一定是本发明所必须的。
本领域的技术人员根据以上描述的内容,能够想到的其他合理的步骤组合,也属于本发明的保护范围内。其次,本领域技术人员也应该熟悉,说明书中所描述的实施例均属于优选实施例,所涉及的动作并不一定是本发明所必须的。
上文结合附图详细介绍了本申请提供的静态资源量化方法和动态资源量化方法。下面结合图5和图6,描述本申请实施例所提供的进行资源管理的相关装置与计算设备。参见图6,图6是本申请实施例提供的一种资源管理装置的示意图,该资源管理装置600包括获取模块110和处理模块120。
获取模块110,用于获取资源对象的资源数据,该资源数据用于指示资源对象的硬件资源的属性信息;处理模块120,用于量化资源数据获得量化结果,根据所述量化结果为调度请求分配资源。该量化结果包括所述资源对象中以最小的可独立运行单元为单位量化计算资 源所获得的结果。通过以资源对象中最小的可独立运行单元为单位对资源对象的计算资源进行量化,能够更精确评估各个资源对象的计算能力,进而进行资源的分配,能够使资源的调度更加合理,提高算力网络的资源利用率。
可选地,上述硬件资源包括计算资源,上述资源数据包括计算资源的硬件属性数据,计算资源的硬件属性数据包括处理器的算力类型、处理器的计算宽度、单个处理器中可独立运行单元的数量和可独立运行单元的计算频率中的至少一种;其中,算力类型包括整型运算和浮点型运算;上述量化结果包括计算资源的静态量化结果,该计算资源的静态量化结果用于指示资源对象的基础计算能力,即资源对象在空载是的计算能力;
则上述处理模块120量化所述资源数据获得量化结果,集体包括:以最小的可独立运行单元为单位,按照计算资源的硬件属性数据确定计算资源的静态量化结果。其中,最小可独立运行单元为物理核、逻辑核或者流处理器。对于资源对象的计算资源,将计算资源以最小的可独立运行单元进行量化,将相同算力类型、不同计算宽度的各个处理器根据最小的可独立运行单位进行量化,将相同算力类型不同计算宽度的处理器的计算能力以相同的标准进行量化,例如将相同算力类型的处理器不同计算宽度的处理器的计算能力转换为相同算力类型相同计算宽度的处理器的计算能力。从而能够更精确评估各个资源对象的计算能力,进而进行资源的分配,能够使资源的调度更加合理,提高算力网络的资源利用率。
可选地,上述计算资源的静态量化结果包括整型运算的处理器的量化后的结果和浮点型运算的处理器的量化后的结果;则上述处理模块120以最小的可独立运行单元为单位,按照计算资源的硬件属性数据确定计算资源的静态量化结果,具体包括:将不同计算宽度的整型运算的处理器的计算频率转换为目标计算宽度的整型运算的处理器的量化值,得到整型运算的处理器的量化后的结果;将不同计算宽度的浮点型运算的处理器的计算频率转换为目标计算宽度的浮点型运算的处理器的量化值,得到的浮点数运算的处理器的量化后的结果。
对于资源对象的计算资源,将计算资源以最小的可独立运行单元进行量化,不同处理器的最小可独立运行单元的计算频率也各部相同,将各个处理器根据最小的可独立运行单元为单位,将相同算力类型不同计算宽度的处理器的计算频率量化为相同计算宽度的处理器的量化值,能够更准确的评估和对比不同资源对象的计算能力,进而在进行资源的分配时,能够使资源的调度更加合理,提高算力网络的资源利用率。
可选地,上述硬件资源包括存储资源,上述资源数据包括存储设备的硬件属性数据,存储设备的硬件属性数据包括存储设备的类型、容量和输入输出速率,其中,不同存储设备的存储介质不同;上述量化结果包括存储资源的静态量化结果,存储资源的静态量化结果用于指示资源对象的基础存储能力;
则处理模块120量化所述资源数据获得量化结果,包括:根据所述存储设备的硬件属性数据确定所述存储资源的静态量化结果。存储设备不仅用于存储数据,计算节点在处理任务时会不断的对存储设备进行读写,而不同存储设备的存储容量不同,不同存储设备的读写速率(即存储设备的输入输出速率)也不同,结合资源对象中不同存储设备的容量和输入输出速率对资源对象的存储资源进行量化后,能够更加准确的反映一个资源对象的存储资源的性能,进而在进行资源的分配时,能够使资源的调度更加合理。
可选地,上述硬件资源还包括网络资源,上述资源数据包括网络资源的硬件属性数据,当资源对象为一个计算节点时,网络资源的硬件属性数据包括该计算节点内的总线带宽;上述量化结果包括网络资源的静态量化结果,网络资源的静态量化结果用于指示资源对象的基础数据传输能力;
则上述处理模块120量化所述资源数据获得量化结果,包括:将计算节点的总线带宽作为网络资源的静态量化结果。算力网络的资源对象可以是单个的计算节点,单个计算节点在处理数据时,数据通过节点内的总线在节点内的各个模块之间进行传输,因此在资源对象是单个节点时,节点的总线带宽是评价节点内网络传输能力的重要标准。
可选地,上述硬件资源包括网络资源,所述资源数据包括网络资源的硬件属性数据,当资源对象为包括多个计算节点的集群时,网络资源的硬件属性数据包括集群的网络拓扑、集群内部网络设备的端口带宽以及集群与外部网络之间的网络带宽;上述量化结果还包括网络资源的静态量化结果;网络资源的静态量化结果用于指示资源对象的基础数据传输能力;
则上述处理模块120量化资源数据获得量化结果,包括:根据集群的网络拓扑和集群内部各个网络设备的端口带宽,确定网络资源的静态量化结果。
算力网络的资源对象还可以是包括多个计算节点的集群,集群内的多个计算节点通过网络设备互相连接,通过集群处理任务时,网络设备的端口带宽是影响不同计算节点之间数据交互速率的一个重要因素,不同节点之间数据交互的速率影响集群处理任务的效率,而不同网络设备的端口带宽不同,不同集群之间网络设备的拓扑结构,拓扑结构同样影响节点之间的数据交互速率,因此根据网络拓扑和网络设备的端口带宽确定不同集群的数据传输能力,能够更精确评估各个资源对象的数据传输能力,进而进行资源的分配,能够使资源的调度更加合理,提高算力网络的资源利用率。
需要说明的是,集群内各个计算节点之间还可以采用硬件加速技术以提高数据传输能力,例如远程直接内存访问技术和/或网内计算技术。因此当集群内各个计算节点之间采用硬件加速技术时,还可以根据各个网络设备的端口带宽以及硬件加速技术确定网络资源的静态量化结果。计算节点的硬件加速技术同样能够加速处理数据的效率,在对资源对象的数据传输能力进行量化的过程中,将硬件加速技术带来的效果同样进行量化,能够更加精确的得到资源对象处理数据的效率。
可选地,处理模块120根据量化结果为调度请求分配资源,包括:上述获取调度请求中的资源需求,该资源需求包括调度请求对硬件资源的需求;例如对计算资源的计算需求、存储资源的存储需求或网络资源的网络需求中的任意一种或多种,然后资源管理平台根据算力网络系统中多个资源对象的各类硬件资源的静态量化结果和调度请求中的资源需求,确定处理调度请求的目标资源对象。
对于上述方法对资源对象的计算资源、存储资源和网络资源进行量化后,更精确地评估了各个资源对象的计算能力,在此基础上,结合调度请求中的资源需求确定处理调度请求的资源对象,能够对资源对象的资源的调度更加合理,提高算力网络的资源利用率。
可选地,上述处理模块120根据量化结果为调度请求分配资源,包括:确定资源对象的可用资源;根据量化结果、资源需求和资源对象的可用资源,确定资源对象相对于调度请求的动态量化结果,上述动态量化结果用于指示资源对象处理调度请求的能力;根据动态量化结果和调度请求中的资源需求,确定处理调度请求的目标资源对象,该资源需求包括调度请求对硬件资源的需求,例如对计算资源的需求、对存储资源的需求等。
动态量化结果用于指示第一资源对象处理调度请求的能力。该动态量化结果是接收到调度请求后,获取第一资源对象当前可用资源的资源数据后得到的,即,获取动态量化结果的过程中考虑了各个资源对象在当前情况下的可用的资源以及该调度请求需要的各类资源,通过对算力网络的计算资源、存储资源和网络资源等资源基于调度请求的资源需求信息和可用资源进行再次量化,能够更加精确的得到各个资源对象当前处理调度请求的效率。
通过上述描述可知,资源管理装置600能够先对算力网络中的各个资源对象包括的各类资源进行一次量化,以得到各个资源对象的基础数据处理能力;而在接收到调度请求时,根据资源对象当前各类硬件资源的可用资源数据,结合调度请求中的资源需求以及该资源对象的静态量化结果,再次对该资源对象进行量化,能够更精确的得到各个资源对象处理调度请求的效率,进而能够根据第二次的量化结果和用户的需求对作业进行调度。
可选地,上述资源对象的可用资源包括可用的计算资源、可用的存储资源和可用的网络资源;则处理模块120根据上述量化结果、资源需求和资源对象的可用资源,确定资源对象相对于调度请求的动态量化结果,包括:
根据资源对象的计算资源的硬件属性数据、可用的计算资源的资源数据以及资源需求中的计算需求,确定计算资源的匹配度,该计算资源的匹配度是指可用的计算资源与资源需求中的计算需求的匹配度,资源需求中的计算需求是指处理调度请求所需的计算资源;
根据资源对象的存储资源的硬件属性数据、可用的存储资源的资源数据以及资源需求中的存储需求确定存储资源的匹配度,该存储资源的匹配度是指可用的存储资源与资源需求中的存储需求的匹配度,资源需求中的存储需求是指处理调度请求所需的存储资源。
根据集群内部网络的网络设备的端口带宽和集群内部网络的网络设备可用的端口带宽,确定集群内部网络的匹配度,该集群内部网络的匹配度是指集群可用的网络资源与资源需求中的内部网络需求的匹配度;以及,根据集群和集群外部网络之间的网络带宽,以及集群与集群外部网络之间的可用网络带宽,确定集群外部网络的匹配度,该集群外部网络的匹配度是指集群外部网络的可用的网络资源与资源需求中的外部网络需求的匹配度;
根据计算资源的匹配度、存储资源的匹配度、集群内部网络的匹配度以及集群外部网络的匹配度,确定上述动态量化结果。
可选地,处理模块120根据动态量化结果和调度请求中的资源需求,确定处理调度请求的目标资源对象,具体包括:
在资源需求为效率优先时,确定动态量化结果最大的资源对象为目标资源对象;或者,在资源需求为成本优先时,确定动态量化结果最小的资源对象为目标资源对象。
通过上述方法能够得到算力网络中每个资源对象处理调度请求的效率,作业调度平台能够根据动态量化结果和用户需求,例如效率优先或价格优先等要求,将调度请求分配给一个满足用户要求的资源对象处理。
可选地,上述获取模块110获取资源对象的资源数据,具体包括:通过所述资源对象的资源管理器获取所述资源对象的资源数据,所述资源管理器通过基板管理控制器BMC、集群发现协议或数据采集接口中至少一种方式,获取所述资源对象的资源数据。
应理解的是,本发明本申请实施例的资源管理装置600可以通过中央处理单元(central processing unit,CPU)实现,也可以通过专用集成电路(application-specific integrated circuit,ASIC)实现,或可编程逻辑器件(programmable logic device,PLD)实现,上述PLD可以是复杂程序逻辑器件(complex programmable logical device,CPLD),现场可编程门阵列(field-programmable gate array,FPGA),通用阵列逻辑(generic array logic,GAL)或其任意组合。也可以通过软件实现图3或图5所示的资源管理方法时,资源管理装置及其各个模块也可以为软件模块。
此外,资源管理装置600进行静态资源量化的过程可以参照上述图3所对应的实施例中的相关描述,在此不再赘述。资源管理装置600能够用于实现上述方法实施例中对资源对象的静态资源量化和动态资源量化,具体可以参照上述图3或图5所对应的方法实施例中的相 关描述,在此不再赘述。
参见图7,图7是本申请实施例提供的一种计算设备700的示意图,计算设备700包括:一个或者多个处理器710、通信接口720以及存储器730,所述处理器710、通信接口720以及存储器730通过总线740相互连接,其中,
处理器710执行各种操作的具体实现可参照上述图3或5所对应的方法实施例中资源调度平台100所执行的具体操作。例如,处理器710用于实现上述图5中S501~S503中的操作,或者实现上述图3中S301~S302的操作,在此不再赘述。
处理器710可以有多种具体实现形式,例如,处理器710可以为CPU或GPU,处理器710还可以是单核处理器或多核处理器。处理器710可以由CPU和硬件芯片的组合。上述硬件芯片可以是ASIC,可编程逻辑器件(programmable logic device,PLD)或其组合。上述PLD可以是复杂可编程逻辑器件(complex programmable logic device,CPLD),现场可编程逻辑门阵列FPGA,通用阵列逻辑(generic array logic,GAL)或其任意组合。处理器710也可以单独采用内置处理逻辑的逻辑器件来实现,例如FPGA或数字信号处理器(digital signal processor,DSP)等。
通信接口720可以为有线接口或无线接口,用于与其他模块或设备进行通信,有线接口可以是以太接口、局域互联网络(local interconnect network,LIN)等,无线接口可以是蜂窝网络接口或使用无线局域网接口等。本申请实施例中通信接口720具体可用于获取资源对象的各类硬件资源的硬件属性数据、可用资源数据或者获取用户上传的调度请求等。
存储器730可以是非易失性存储器,例如,只读存储器(read-only memory,ROM)、可编程只读存储器(programmable ROM,PROM)、可擦除可编程只读存储器(erasable PROM,EPROM)、电可擦除可编程只读存储器(electrically EPROM,EEPROM)或闪存。存储器730也可以是易失性存储器,易失性存储器可以是随机存取存储器(random access memory,RAM),其用作外部高速缓存。通过示例性但不是限制性说明,许多形式的RAM可用,例如静态随机存取存储器(static RAM,SRAM)、动态随机存取存储器(dynamic RAM,DRAM)、同步动态随机存取存储器(synchronous DRAM,SDRAM)、双倍数据速率同步动态随机存取存储器(double data rate SDRAM,DDR SDRAM)、增强型同步动态随机存取存储器(enhanced SDRAM,ESDRAM)、同步连接动态随机存取存储器(synchlink DRAM,SLDRAM)和直接内存总线随机存取存储器(direct rambus RAM,DR RAM)。
存储器730也可用于存储程序代码和数据,以便于处理器710调用存储器730中存储的程序代码执行上述图3或图5所对应的方法实施例中的操作步骤,。此外,计算设备700可能包含相比于图7展示的更多或者更少的组件,或者有不同的组件配置方式。
总线740可以是快捷外围部件互连标准(peripheral component interconnect express,PCIe)总线,或扩展工业标准结构(extended industry standard architecture,EISA)总线、统一总线(unified bus,Ubus或UB)、计算机快速链接(compute express link,CXL)、缓存一致互联协议(cache coherent interconnect for accelerators,CCIX)等。总线740可以分为地址总线、数据总线、控制总线等。总线740除包括数据总线之外,还可以包括电源总线、控制总线和状态信号总线等。但是为了清楚说明起见,图7中仅用一条粗线表示,但并不表示仅有一根总线或一种类型的总线。
可选地,计算设备700还可以包括输入/输出接口750,输入/输出接口750连接有输入/输出设备,用于接收输入的信息,输出操作结果。
具体地,计算设备700执行各种操作的具体实现可参照上述方法实施例中S301~S303以 及用户查询时执行的具体操作,在此不再赘述。
需要说明的是,上述图7是资源管理平台100部署于算力网络中物理设备(例如,服务器)上时,计算设备的结构示意图。应理解,资源管理平台100还可以部署于虚拟设备中,例如,部署在安装有虚拟化软件的单一物理设备或多个物理设备构成的集群中运行的虚拟机或容器中。当资源管理平台100还可以部署于虚拟设备中时,资源管理平台100通过算力网络分配给该虚拟设备的处理器完成上述图3和图5所对应的实施例中的资源管理方法。本申请还提供一种如图1所示的算力网络的系统,该系统包括上述资源管理平台100和作业调度平台300,用于执行如图3至图5所示方法的操作步骤,为了简洁,在此不再赘述。
本申请实施例还提供一种计算机可读存储介质,所述计算机可读存储介质中存储有指令,当其在处理器上运行时,可以实现上述方法实施例中的方法步骤,所述计算机可读存储介质的处理器在执行上述方法步骤的具体实现可参照上述方法实施例图3或图5所对应的方法实施例中的具体操作,在此不再赘述。
在上述实施例中,对各个实施例的描述都各有侧重,某个实施例中没有详述的部分,可以参见其它实施例的相关描述。
上述实施例,可以全部或部分地通过软件、硬件、固件或其他任意组合来实现。当使用软件实现时,上述实施例可以全部或部分地以计算机程序产品的形式实现。所述计算机程序产品包括一个或多个计算机指令。在计算机上加载或执行所述计算机程序指令时,全部或部分地产生按照本发明实施例所述的流程或功能。所述计算机可以为通用计算机、专用计算机、计算机网络、或者其他可编程装置。所述计算机指令可以存储在计算机可读存储介质中,或者从一个计算机可读存储介质向另一个计算机可读存储介质传输,例如,所述计算机指令可以从一个网站站点、计算机、服务器或数据中心通过有线(例如同轴电缆、光纤、数字用户线)或无线(例如红外、无线、微波等)方式向另一个网站站点、计算机、服务器或数据中心进行传输。所述计算机可读存储介质可以是计算机能够存取的任何可用介质或者是包含一个或多个可用介质集合的服务器、数据中心等数据存储设备。所述可用介质可以是磁性介质(例如,软盘、硬盘、磁带)、光介质、或者半导体介质,半导体介质可以是固态硬盘(solid state drive,SSD)。
本申请实施例方法中的步骤可以根据实际需要进行顺序调整、合并或删减;本申请实施例系统中的模块可以根据实际需要进行划分、合并或删减。
以上对本申请实施例进行了详细介绍,本文中应用了具体个例对本申请的原理及实施方式进行了阐述,以上实施例的说明只是用于帮助理解本申请的方法及其核心思想;同时,对于本领域的一般技术人员,依据本申请的思想,在具体实施方式及应用范围上均会有改变之处,综上所述,本说明书内容不应理解为对本申请的限制。

Claims (26)

  1. 一种资源管理方法,其特征在于,所述方法适用于算力网络系统,所述方法由所述算力网络系统中的资源管理平台执行,所述方法包括:
    获取资源对象的资源数据,所述资源数据用于指示所述资源对象的硬件资源的属性信息;
    量化所述资源数据获得量化结果,所述量化结果包括所述资源对象中以最小的可独立运行单元为单位量化计算资源所获得的结果;
    根据所述量化结果为调度请求分配资源。
  2. 根据权利要求1所述的方法,其特征在于,所述硬件资源包括计算资源,所述资源数据包括计算资源的硬件属性数据,所述硬件属性数据包括处理器的算力类型、计算宽度、单个处理器中所述可独立运行单元的数量和所述可独立运行单元的计算频率中的至少一种;所述算力类型包括整型运算和浮点型运算;
    所述量化结果包括所述计算资源的静态量化结果;所述计算资源的静态量化结果用于指示所述资源对象的基础计算能力;
    则所述量化所述资源数据获得量化结果,包括:
    以所述最小的可独立运行单元为单位,按照所述计算资源的硬件属性数据确定所述计算资源的静态量化结果。
  3. 根据权利要求2所述的方法,其特征在于,所述计算资源的静态量化结果包括整型运算的处理器的量化后的结果和浮点型运算的处理器的量化后的结果;
    则所述以所述最小的可独立运行单元为单位,按照所述计算资源的硬件属性数据确定所述计算资源的静态量化结果,包括:
    将不同计算宽度的整型运算的处理器的计算频率转换为目标计算宽度的整型运算的处理器的量化值,得到所述整型运算的处理器的量化后的结果;
    将不同计算宽度的浮点型运算的处理器的计算频率转换为目标计算宽度的浮点型运算的处理器的量化值,得到的所述浮点数运算的处理器的量化后的结果。
  4. 根据权利要求1所述的方法,其特征在于,所述硬件资源包括存储资源,所述资源数据包括存储设备的硬件属性数据,所述存储设备的硬件属性数据包括所述存储设备的类型、容量和输入输出速率;
    所述量化结果包括所述存储资源的静态量化结果,所述存储资源的静态量化结果用于指示所述资源对象的基础存储能力;
    则所述量化所述资源数据获得量化结果,包括:
    根据所述存储设备的硬件属性数据确定所述存储资源的静态量化结果。
  5. 根据权利要求1所述的方法,其特征在于,所述硬件资源还包括网络资源,所述资源数据包括所述网络资源的硬件属性数据,当所述资源对象为单个设备时,所述网络资源的硬件属性数据包括所述设备内的总线带宽;
    所述量化结果包括所述网络资源的静态量化结果;所述网络资源的静态量化结果用于指示所述资源对象的基础数据传输能力;
    则所述量化所述资源数据获得量化结果,包括:
    将所述设备内的总线带宽作为所述网络资源的静态量化结果。
  6. 根据权利要求1所述的方法,其特征在于,所述硬件资源还包括网络资源,所述资源数据包括所述网络资源的硬件属性数据,当所述资源对象为包括多个设备的集群时,所述网络资源的硬件属性数据包括所述集群的网络拓扑、所述集群内部网络设备的端口带宽以及所述集群与外部网络之间的网络带宽;
    所述量化结果还包括所述网络资源的静态量化结果;所述网络资源的静态量化结果用于指示所述资源对象的基础数据传输能力;
    则所述量化所述资源数据获得量化结果,包括:
    根据所述集群的网络拓扑和所述集群内部各个网络设备的端口带宽,确定所述网络资源的静态量化结果。
  7. 根据权利要求2-6中任一项所述的方法,其特征在于,所述根据所述量化结果为调度请求分配资源,包括:
    获取所述调度请求中的资源需求,所述资源需求包括所述调度请求对硬件资源的需求;
    根据所述算力网络系统中多个资源对象的各类硬件资源的静态量化结果和所述调度请求中的资源需求,确定处理所述调度请求的目标资源对象。
  8. 根据权利要求5或6所述的方法,其特征在于,所述根据所述量化结果为调度请求分配资源,包括:
    确定所述资源对象的可用资源;
    根据所述量化结果、所述资源需求和所述资源对象的可用资源,确定所述资源对象相对于所述调度请求的动态量化结果,所述动态量化结果用于指示所述资源对象处理所述调度请求的能力;
    根据所述动态量化结果和所述调度请求中的资源需求,确定处理所述调度请求的目标资源对象,所述资源需求包括所述调度请求对所述硬件资源的需求。
  9. 根据权利要求8所述的方法,其特征在于,所述资源对象的可用资源包括可用的计算资源、可用的存储资源和可用的网络资源;
    则根据所述量化结果、所述资源需求和所述资源对象的可用资源,确定所述资源对象相对于所述调度请求的动态量化结果,包括:
    根据所述资源对象的计算资源的硬件属性数据、所述可用的计算资源的资源数据以及所述资源需求中的计算需求,确定所述计算资源的匹配度,所述计算资源的匹配度用于指示可用的计算资源与所述资源需求中的计算需求的匹配度;
    根据所述资源对象的存储资源的硬件属性数据、所述可用的存储资源的资源数据以及所述资源需求中的存储需求确定所述存储资源的匹配度,所述存储资源的匹配度用于指示可用的存储资源与所述资源需求中的存储需求的匹配度;
    根据所述集群内部网络的网络设备的端口带宽和所述集群内部网络的网络设备可用的端口带宽,确定所述集群内部网络的匹配度,所述集群内部网络的匹配度是指所述集群可用的网络资源与所述资源需求中的内部网络需求的匹配度;以及,根据所述集群和所述集群外部网络之间的网络带宽,以及所述集群与所述集群外部网络之间的可用网络带宽,确定所述集 群外部网络的匹配度,所述集群外部网络的匹配度用于指示所述集群外部网络的可用的网络资源与所述资源需求中的外部网络需求的匹配度;
    根据所述计算资源的匹配度、所述存储资源的匹配度、所述集群内部网络的匹配度以及所述集群外部网络的匹配度,确定所述动态量化结果。
  10. 根据权利要求8或9所述的方法,其特征在于,所述根据所述动态量化结果和所述调度请求中的资源需求,确定处理所述调度请求的目标资源对象,包括:
    在所述资源需求为效率优先时,确定动态量化结果最大的资源对象为所述目标资源对象;或者,
    在所述资源需求为成本优先时,确定动态量化结果最小的资源对象为所述目标资源对象。
  11. 根据权利要求1-10中任一项所述的方法,其特征在于,所述获取资源对象的资源数据,包括:
    通过所述资源对象的资源管理器获取所述资源对象的资源数据,所述资源管理器通过基板管理控制器BMC、集群发现协议或数据采集接口中至少一种方式,获取所述资源对象的资源数据。
  12. 一种资源管理装置,其特征在于,包括:
    获取模块,用于获取资源对象的资源数据,所述资源数据用于指示所述资源对象的硬件资源的属性信息;
    处理模块,用于量化所述资源数据获得量化结果,所述量化结果包括所述资源对象中以最小的可独立运行单元为单位量化计算资源所获得的结果;
    根据所述量化结果为调度请求分配资源。
  13. 根据权利要求12所述的装置,其特征在于,包括所述硬件资源包括计算资源,所述资源数据包括计算资源的硬件属性数据,所述硬件属性数据包括处理器的算力类型、计算宽度、单个处理器中所述可独立运行单元的数量和所述可独立运行单元的计算频率中的至少一种;所述算力类型包括整型运算和浮点型运算;
    所述量化结果包括所述计算资源的静态量化结果;所述计算资源的静态量化结果用于指示所述资源对象的基础计算能力;
    所述处理模块具体用于:
    以所述最小的可独立运行单元为单位,按照所述计算资源的硬件属性数据确定所述计算资源的静态量化结果。
  14. 根据权利要求13所述的装置,其特征在于,所述计算资源的静态量化结果包括整型运算的处理器的量化后的结果和浮点型运算的处理器的量化后的结果;
    则所述处理模块具体用于:
    将不同计算宽度的整型运算的处理器的计算频率转换为目标计算宽度的整型运算的处理器的量化值,得到所述整型运算的处理器的量化后的结果;
    将不同计算宽度的浮点型运算的处理器的计算频率转换为目标计算宽度的浮点型运算的处理器的量化值,得到的所述浮点数运算的处理器的量化后的结果。
  15. 根据权利要求12所述的装置,其特征在于,所述硬件资源包括存储资源,所述资源数据包括存储设备的硬件属性数据,所述存储设备的硬件属性数据包括所述存储设备的类型、容量和输入输出速率;所述量化结果包括所述存储资源的静态量化结果,所述存储资源的静态量化结果用于指示所述资源对象的基础存储能力;
    则所述处理模块具体用于:
    根据所述存储设备的硬件属性数据确定所述存储资源的静态量化结果。
  16. 根据权利要求12所述的装置,其特征在于,所述硬件资源还包括网络资源,
    所述资源数据包括所述网络资源的硬件属性数据,当所述资源对象为单个设备时,所述网络资源的硬件属性数据包括所述设备内的总线带宽;
    所述量化结果包括所述网络资源的静态量化结果;所述网络资源的静态量化结果用于指示所述资源对象的基础数据传输能力;
    则所述处理模块具体用于:
    将所述计设备内的总线带宽作为所述网络资源的静态量化结果。
  17. 根据权利要求12所述的装置,其特征在于,所述硬件资源还包括网络资源,所述资源数据包括所述网络资源的硬件属性数据,当所述资源对象为包括多个设备的集群时,所述网络资源的硬件属性数据包括所述集群的网络拓扑、所述集群内部网络设备的端口带宽以及所述集群与外部网络之间的网络带宽;
    所述量化结果还包括所述网络资源的静态量化结果;所述网络资源的静态量化结果用于指示所述资源对象的基础数据传输能力;
    则所述处理模块具体用于:
    根据所述集群的网络拓扑和所述集群内部各个网络设备的端口带宽,确定所述网络资源的静态量化结果。
  18. 根据权利要求12-17中任一项所述的装置,其特征在于,
    所述获取模块,还用于获取所述调度请求中的资源需求,所述资源需求包括所述调度请求对硬件资源的需求;
    所述处理模块具体用于:
    在所述算力网络系统的多个资源对象中获取可用硬件资源满足所述调度请求的一个或多个资源对象;
    根据所述一个或多个资源对象的各类硬件资源的静态量化结果和所述调度请求中的资源需求,确定处理所述调度请求的目标资源对象。
  19. 根据权利要求16或17所述的装置,其特征在于,所述处理模块具体用于:
    确定所述资源对象的可用资源;
    根据所述量化结果和所述资源对象的可用资源,确定所述资源对象相对于所述调度请求的动态量化结果,所述动态量化结果用于指示所述资源对象处理所述调度请求的能力;
    根据所述动态量化结果和所述调度请求中的资源需求,确定处理所述调度请求的目标资源对象,所述资源需求包括所述调度请求对所述硬件资源的需求。
  20. 根据权利要求19所述的装置,其特征在于,所述资源对象的可用资源包括可用的计算资源、可用的存储资源和可用的网络资源;
    所述处理模块具体用于:
    根据所述资源对象的计算资源的硬件属性数据、所述可用的计算资源的资源数据以及所述资源需求中的计算需求,确定所述计算资源的匹配度,所述计算资源的匹配度用于指示可用的计算资源与所述资源需求中的计算需求的匹配度;
    根据所述资源对象的存储资源的硬件属性数据、所述可用的存储资源的资源数据以及所述资源需求中的存储需求确定所述存储资源的匹配度,所述存储资源的匹配度用于指示可用的存储资源与所述资源需求中的存储需求的匹配度;
    根据所述集群内部网络的网络设备的端口带宽和所述集群内部网络的网络设备可用的端口带宽,确定所述集群内部网络的匹配度,所述集群内部网络的匹配度是指所述集群可用的网络资源与所述资源需求中的内部网络需求的匹配度;以及,根据所述集群和所述集群外部网络之间的网络带宽,以及所述集群与所述集群外部网络之间的可用网络带宽,确定所述集群外部网络的匹配度,所述集群外部网络的匹配度用于指示所述集群外部网络的可用的网络资源与所述资源需求中的外部网络需求的匹配度;
    根据所述计算资源的匹配度、所述存储资源的匹配度、所述集群内部网络的匹配度以及所述集群外部网络的匹配度,确定所述动态量化结果。
  21. 根据权利要求19或20所述的装置,其特征在于,所述处理模块具体用于:
    在所述资源需求为效率优先时,确定动态量化结果最大的资源对象为所述目标资源对象;或者,
    在所述资源需求为成本优先时,确定动态量化结果最小的资源对象为所述目标资源对象。
  22. 根据权利要求11-21中任一项所述的装置,其特征在于,所述获取模块具体用于:
    通过所述资源对象的资源管理器获取所述资源对象的资源数据,所述资源管理器通过基板管理控制器BMC、集群发现协议或数据采集接口中至少一种方式,获取所述资源对象的资源数据。
  23. 一种资源管理平台,其特征在于,包括处理器和存储器;所述存储器用于存储指令,所述处理器用于执行所述指令,当所述处理器执行所述指令时,所述处理器执行如权利要求1至11任一项所述的方法。
  24. 根据权利要求23所述的资源管理平台,其特征在于,所述资源管理平台部署于所述算力网络系统的物理设备中。
  25. 根据权利要求23所述的资源管理平台,其特征在于,所述资源管理平台部署于所述算力网络系统的虚拟设备中,所述虚拟设备包括虚拟机或者容器。
  26. 一种计算机可读存储介质,其特征在于,所述计算机可读存储介质存储有计算机程序,所述计算机程序被处理器执行时,所述处理器执行如权利要求1至11任一项所述的方法。
PCT/CN2022/142208 2021-12-27 2022-12-27 资源管理方法、装置及资源管理平台 WO2023125493A1 (zh)

Applications Claiming Priority (4)

Application Number Priority Date Filing Date Title
CN202111611446 2021-12-27
CN202111611446.4 2021-12-27
CN202210467575.9 2022-04-29
CN202210467575.9A CN116360972A (zh) 2021-12-27 2022-04-29 资源管理方法、装置及资源管理平台

Publications (1)

Publication Number Publication Date
WO2023125493A1 true WO2023125493A1 (zh) 2023-07-06

Family

ID=86925645

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2022/142208 WO2023125493A1 (zh) 2021-12-27 2022-12-27 资源管理方法、装置及资源管理平台

Country Status (2)

Country Link
CN (1) CN116360972A (zh)
WO (1) WO2023125493A1 (zh)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117349026A (zh) * 2023-12-04 2024-01-05 环球数科集团有限公司 一种用于aigc模型训练的分布式算力调度系统
CN117640410A (zh) * 2024-01-26 2024-03-01 深圳市迈腾电子有限公司 基于功能网络族算力自适应的功能单元析构方法及设备
CN117851075A (zh) * 2024-03-08 2024-04-09 深圳市秋葵互娱科技有限公司 一种数据监测系统的资源优化管理方法
CN117971502A (zh) * 2024-03-29 2024-05-03 南京认知物联网研究院有限公司 一种针对ai推理集群进行在线优化调度的方法与装置

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116775313B (zh) * 2023-08-18 2023-12-08 浪潮(山东)计算机科技有限公司 一种资源分配方法、装置、设备及介质
CN117370135B (zh) * 2023-10-18 2024-04-02 方心科技股份有限公司 基于电力应用弹性测试的超算平台性能评测方法及系统
CN117421108A (zh) * 2023-12-15 2024-01-19 企商在线(北京)数据技术股份有限公司 一种异构算力平台设计方法、平台及资源调度方法

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102609314A (zh) * 2012-01-18 2012-07-25 浪潮(北京)电子信息产业有限公司 一种虚拟机量化管理方法和系统
CN103699440A (zh) * 2012-09-27 2014-04-02 北京搜狐新媒体信息技术有限公司 一种云计算平台系统为任务分配资源的方法和装置
CN107133098A (zh) * 2017-04-24 2017-09-05 东莞中国科学院云计算产业技术创新与育成中心 基于云计算的人力资源数据处理平台
CN109669774A (zh) * 2018-11-14 2019-04-23 新华三技术有限公司成都分公司 硬件资源的量化方法、编排方法、装置及网络设备
WO2021051772A1 (en) * 2019-09-19 2021-03-25 Huawei Technologies Co., Ltd. Method and apparatus for vectorized resource scheduling in distributed computing systems using tensors

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102609314A (zh) * 2012-01-18 2012-07-25 浪潮(北京)电子信息产业有限公司 一种虚拟机量化管理方法和系统
CN103699440A (zh) * 2012-09-27 2014-04-02 北京搜狐新媒体信息技术有限公司 一种云计算平台系统为任务分配资源的方法和装置
CN107133098A (zh) * 2017-04-24 2017-09-05 东莞中国科学院云计算产业技术创新与育成中心 基于云计算的人力资源数据处理平台
CN109669774A (zh) * 2018-11-14 2019-04-23 新华三技术有限公司成都分公司 硬件资源的量化方法、编排方法、装置及网络设备
WO2021051772A1 (en) * 2019-09-19 2021-03-25 Huawei Technologies Co., Ltd. Method and apparatus for vectorized resource scheduling in distributed computing systems using tensors

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117349026A (zh) * 2023-12-04 2024-01-05 环球数科集团有限公司 一种用于aigc模型训练的分布式算力调度系统
CN117349026B (zh) * 2023-12-04 2024-02-23 环球数科集团有限公司 一种用于aigc模型训练的分布式算力调度系统
CN117640410A (zh) * 2024-01-26 2024-03-01 深圳市迈腾电子有限公司 基于功能网络族算力自适应的功能单元析构方法及设备
CN117640410B (zh) * 2024-01-26 2024-04-12 深圳市迈腾电子有限公司 基于功能网络族算力自适应的功能单元析构方法及设备
CN117851075A (zh) * 2024-03-08 2024-04-09 深圳市秋葵互娱科技有限公司 一种数据监测系统的资源优化管理方法
CN117851075B (zh) * 2024-03-08 2024-05-14 深圳市秋葵互娱科技有限公司 一种数据监测系统的资源优化管理方法
CN117971502A (zh) * 2024-03-29 2024-05-03 南京认知物联网研究院有限公司 一种针对ai推理集群进行在线优化调度的方法与装置

Also Published As

Publication number Publication date
CN116360972A (zh) 2023-06-30

Similar Documents

Publication Publication Date Title
WO2023125493A1 (zh) 资源管理方法、装置及资源管理平台
US10325343B1 (en) Topology aware grouping and provisioning of GPU resources in GPU-as-a-Service platform
US10728091B2 (en) Topology-aware provisioning of hardware accelerator resources in a distributed environment
US20200241927A1 (en) Storage transactions with predictable latency
US8949847B2 (en) Apparatus and method for managing resources in cluster computing environment
CN103176849B (zh) 一种基于资源分类的虚拟机集群的部署方法
CN110221920B (zh) 部署方法、装置、存储介质及系统
Seth et al. Dynamic heterogeneous shortest job first (DHSJF): a task scheduling approach for heterogeneous cloud computing systems
CN117632361A (zh) 一种资源调度方法、装置及相关设备
CN113590307B (zh) 边缘计算节点优化配置方法、装置及云计算中心
WO2014114072A1 (zh) 虚拟化平台下i/o通道的调整方法和调整装置
CN115718644A (zh) 一种面向云数据中心的计算任务跨区迁移方法及系统
CN110990154A (zh) 一种大数据应用优化方法、装置及存储介质
US20220300323A1 (en) Job Scheduling Method and Job Scheduling Apparatus
Xu et al. vPFS: Bandwidth virtualization of parallel storage systems
US20230109396A1 (en) Load balancing and networking policy performance by a packet processing pipeline
CN107590000B (zh) 二次随机资源管理方法/系统、计算机存储介质及设备
Li et al. Improving spark performance with zero-copy buffer management and RDMA
US20210004658A1 (en) System and method for provisioning of artificial intelligence accelerator (aia) resources
WO2021231848A1 (en) System and method for creating on-demand virtual filesystem having virtual burst buffers created on the fly
JP2012038275A (ja) 取引計算シミュレーションシステム、方法及びプログラム
WO2024087663A1 (zh) 作业调度方法、装置和芯片
WO2023159652A1 (zh) 一种ai系统、内存访问控制方法及相关设备
Yu et al. Analysis of CPU pinning and storage configuration in 100 Gbps network data transfer
Lang et al. Implementation of load balancing algorithm based on flink cluster

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 22914747

Country of ref document: EP

Kind code of ref document: A1