WO2021258861A1 - 一种作业处理方法以及相关设备 - Google Patents

一种作业处理方法以及相关设备 Download PDF

Info

Publication number
WO2021258861A1
WO2021258861A1 PCT/CN2021/091717 CN2021091717W WO2021258861A1 WO 2021258861 A1 WO2021258861 A1 WO 2021258861A1 CN 2021091717 W CN2021091717 W CN 2021091717W WO 2021258861 A1 WO2021258861 A1 WO 2021258861A1
Authority
WO
WIPO (PCT)
Prior art keywords
cloud computing
task
computing instance
roce
cloud
Prior art date
Application number
PCT/CN2021/091717
Other languages
English (en)
French (fr)
Inventor
肖磊
孙宏伟
孙克勇
阮涵
Original Assignee
华为技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 华为技术有限公司 filed Critical 华为技术有限公司
Publication of WO2021258861A1 publication Critical patent/WO2021258861A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/50Allocation of resources, e.g. of the central processing unit [CPU]
    • G06F9/5005Allocation of resources, e.g. of the central processing unit [CPU] to service a request
    • G06F9/5027Allocation of resources, e.g. of the central processing unit [CPU] to service a request the resource being a machine, e.g. CPUs, Servers, Terminals
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/50Allocation of resources, e.g. of the central processing unit [CPU]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/50Allocation of resources, e.g. of the central processing unit [CPU]
    • G06F9/5061Partitioning or combining of resources
    • G06F9/5066Algorithms for mapping a plurality of inter-dependent sub-tasks onto a plurality of physical CPUs

Definitions

  • the embodiments of the present application relate to the field of cloud computing, and in particular, to a job processing method and related equipment.
  • Remote direct memory access is a technology that bypasses the operating system kernel of a remote host to access data in its memory, which can save processing resources, increase system throughput, and reduce system network communication delays.
  • the RDMA technology has multiple implementation methods, one of which is RDMA (remote direct memory access overconverged ethernet, RDMA overconverged ethernet, RoCE) based on converged Ethernet.
  • RDMA remote direct memory access overconverged ethernet
  • RoCE remote direct memory access overconverged ethernet
  • RoCE remote direct memory access overconverged ethernet
  • the RoCE technology is often used in high-performance computing (HPC) based on a unified cluster.
  • the local HPC cluster when the computing resources of the local HPC cluster based on the RoCE network are insufficient to support high-performance computing, the local HPC cluster will apply to the cloud platform for computing resources. Therefore, the cloud platform will rebuild an HPC cluster on the cloud, and the cloud platform will send the tasks from the local HPC cluster and the data corresponding to the tasks to the cloud resource control node in the HPC cluster on the cloud, and the cloud resource control node will The tasks and the data corresponding to the tasks are allocated to the cloud computing nodes, so that the cloud computing nodes perform high-performance computing according to the aforementioned tasks.
  • the HPC cluster on the cloud and the local HPC cluster are two independent clusters.
  • the local HPC cluster needs to transmit tasks and task-corresponding data to the HPC cluster on the cloud, and the HPC cluster on the cloud can perform high-performance computing.
  • the local resource control node in the local HPC cluster is required to divide the local tasks.
  • the tasks of the high-performance computing process are arduous and difficult to divide, and the transmission of data corresponding to the aforementioned tasks to the cloud platform will also be limited by the transmission bandwidth. Therefore, the aforementioned solution cannot effectively improve the efficiency of high-performance computing.
  • the embodiments of the present application provide a job processing method and related equipment, which are used to configure a cloud computing instance with RoCE function for a local network, so as to improve the efficiency of high-performance computing.
  • an embodiment of the present application provides a job processing method, which can be applied to a high-performance computing scenario.
  • the cloud platform will receive resource applications under preset conditions.
  • the resource application is used to instruct the cloud platform to create a cloud computing instance with the function of remote direct memory access RoCE based on the converged Ethernet.
  • the cloud platform creates a cloud computing instance with RoCE function according to the resource application, and sets the cloud computing instance to access the local network, so that the cloud computing instance processes the data corresponding to the tasks in the local network.
  • the aforementioned preset conditions may include one or more of the following:
  • the computing resources of the local network are not enough to support high-performance computing; or, users can apply for resources to the cloud platform according to their own needs.
  • the details are not limited here.
  • the cloud platform can create a cloud computing instance with RoCE function according to the resource application, and connect the aforementioned cloud computing instance to the local network, the cloud computing instance can bypass the operating system and receive tasks from the local network. And overstep the operating system to access the data corresponding to the aforementioned task in the local network, so as to realize the processing of the data corresponding to the aforementioned task in the local network. Therefore, the cloud platform does not need to establish an HPC cluster on the cloud, and the local network does not need to transmit tasks and task-corresponding data to the cloud platform. Therefore, it is beneficial to improve the efficiency of high-performance computing.
  • the foregoing local network is provided with a task issuing node and a task data storage node.
  • the cloud computing instance can receive the task sent by the task issuing node; then, the cloud computing instance obtains the task corresponding to the task from the task data storage node through remote direct memory access RDMA according to the task. Task data, and perform task processing.
  • the task issuing node is used to issue tasks to the local computing node;
  • the task data storage node is used to store task-corresponding data, and can be used for local computing.
  • the node obtains the data corresponding to the aforementioned task from the task data storage node. Since the aforementioned cloud computing instance has been connected to the local network, the aforementioned cloud computing instance is essentially connected to the task issuing node and the task data storage node. Therefore, the aforementioned task issuing node can issue tasks to the cloud computing instance.
  • the cloud computing instance can access the data corresponding to the task in the aforementioned task data storage node over the operating system, and the cloud computing instance can process the data corresponding to the aforementioned task.
  • the aforementioned cloud computing instance is actually added to the local HPC cluster as a computing node. Since the aforementioned cloud computing instance and the local computing node form an HPC cluster, the task issuing node can use the aforementioned cloud computing instance as a local computing node to assign tasks. Therefore, the task issuing node does not need to transmit both the task and the data corresponding to the task to the cloud platform. Therefore, it is beneficial to improve the efficiency of high-performance computing.
  • the method before the cloud platform receives the resource application, the method further includes: the task issuing node confirms that the number of tasks to be processed exceeds the threshold .
  • the task issuing node can monitor the number of tasks to be processed, and measure the number of tasks to be processed with the computing power of the local computing node.
  • the task issuing node confirms that the number of tasks to be processed exceeds the threshold, that is, when the computing power (or computing resources) of the local computing node is not enough to support the aforementioned tasks to be processed, the task issuing node will trigger the cloud platform Steps to send a resource request.
  • the aforementioned tasks to be processed can be tasks that the task issuing node has not yet assigned to the local computing node, or the total tasks that the task issuing node needs to process within a certain time range (that is, including the tasks that have been allocated to the local computing node).
  • the task of the computing node which is not specifically limited here.
  • the values of the aforementioned thresholds will also be different, and the task issuing nodes can be adjusted according to actual needs, which are not specifically limited here.
  • the method further includes: the cloud platform receives a resource cancellation request sent by the task issuing node, where the resource cancellation request is used to indicate The task has been completed or the task does not need to be executed. Then, the cloud platform cancels the cloud computing instance according to the resource cancellation request.
  • the task issuing node can send a resource cancellation request to the cloud platform, so that the cloud platform can cancel the aforementioned cloud computing instance according to the cancellation request.
  • the task issuing node can revoke the aforementioned cloud computing instance, and the cloud platform can also allocate the aforementioned cloud computing instance to other clusters. Therefore, the cloud computing instance configured by the aforementioned cloud platform can be made more flexible, which is beneficial to improve the utilization rate of the cloud computing instance.
  • the cloud platform applies for creating a cloud computing instance with RoCE function according to the resource, including: the cloud platform obtains the RoCE software package. Then, the cloud platform sends the RoCE software package to the initial cloud computing instance, and triggers the RoCE software package to be installed in the initial cloud computing instance to obtain the cloud computing instance with the RoCE function.
  • the RoCE software package can be pre-stored in the aforementioned cloud platform, or can be pre-stored in the initial cloud computing instance, which is not specifically limited here.
  • the cloud platform may install the RoCE software package in the initial cloud computing instance, so that the aforementioned initial cloud computing instance simulates the RoCE network card by running the RoCE software.
  • the cloud computing instance can use a common network card instead of using a RoCE network card. Therefore, the configuration cost of cloud computing instances can be saved, and the feasibility of the solution can be improved.
  • the resource application includes configuration information
  • the configuration information is used to indicate the configuration of the cloud computing instance required by the local network.
  • the method further includes: the cloud platform creates an initial cloud computing instance according to the configuration information; or, the cloud platform selects an initial cloud computing instance corresponding to the configuration information from a plurality of cloud computing instances according to the configuration information.
  • the aforementioned resource application includes configuration information.
  • the cloud platform may create an initial cloud computing instance based on the foregoing configuration information, or it may search for an initial cloud computing instance matching the foregoing configuration information among existing initial cloud computing instances based on the foregoing configuration information.
  • the cloud computing instance is a virtual machine, a container, or a bare metal server.
  • an embodiment of the present application provides a cloud platform, which includes a processor, a network interface, and a memory.
  • the memory is used to store data and program code;
  • the network interface is used to receive resource applications, and the resource applications are used to instruct the cloud platform to create a cloud computing instance with the function of remote direct memory access RoCE based on converged Ethernet;
  • the processor is used for creating a cloud computing instance with RoCE function according to the resource application, and setting the cloud computing instance to access the local network.
  • the cloud platform can create a cloud computing instance with RoCE function according to the resource application, and connect the aforementioned cloud computing instance to the local network, the cloud computing instance can bypass the operating system and receive tasks from the local network. And overstep the operating system to access the data corresponding to the aforementioned task in the local network, so as to realize the processing of the data corresponding to the aforementioned task in the local network. Therefore, the cloud platform does not need to establish an HPC cluster on the cloud, and the local network does not need to transmit tasks and task-corresponding data to the cloud platform. Therefore, it is beneficial to improve the efficiency of high-performance computing.
  • a task issuing node and a task data storage node are provided in the local network.
  • the task issuing node is used to send tasks to the cloud computing instance; the task data storage node is used to provide task data corresponding to the task to the cloud computing instance through remote direct memory access RDMA.
  • the network interface is specifically used for: when the task issuing node confirms that the number of tasks to be processed exceeds a threshold, the network The interface receives the resource application from the task issuing node.
  • the network interface is also used to receive a resource cancellation request from the task issuing node, and the resource cancellation request is used to instruct the The task has been completed or the task does not need to be executed; the processor is also used to cancel the cloud computing instance according to the resource cancellation request.
  • the processor is specifically configured to: obtain the RoCE software package; control the network interface to send the RoCE software to the initial cloud computing instance Package and trigger the installation of the RoCE software package in the initial cloud computing instance to obtain the cloud computing instance with the RoCE function.
  • the resource application includes configuration information
  • the configuration information is used to indicate the configuration of the cloud computing instance required by the local network.
  • the processor is further configured to: create an initial cloud computing instance according to the configuration information; or, select an initial cloud computing instance corresponding to the configuration information from a plurality of cloud computing instances according to the configuration information.
  • the cloud computing instance is a virtual machine, a container, or a bare metal server.
  • an embodiment of the present application provides a cloud platform, which includes:
  • the receiving module is used to receive a resource application, and the resource application is used to instruct the cloud platform to create a cloud computing instance with the function of remote direct memory access RoCE based on the converged Ethernet;
  • the resource configuration module is used to create a cloud computing instance with RoCE function according to the resource application, and set the cloud computing instance to access the local network.
  • the cloud platform can create a cloud computing instance with RoCE function according to the resource application, and connect the aforementioned cloud computing instance to the local network, the cloud computing instance can bypass the operating system and receive tasks from the local network. And overstep the operating system to access the data corresponding to the aforementioned task in the local network, so as to realize the processing of the data corresponding to the aforementioned task in the local network. Therefore, the cloud platform does not need to establish an HPC cluster on the cloud, and the local network does not need to transmit tasks and task-corresponding data to the cloud platform. Therefore, it is beneficial to improve the efficiency of high-performance computing.
  • a task issuing node and a task data storage node are provided in the local network.
  • the task issuing node is used to send tasks to the cloud computing instance; the task data storage node is used to provide task data corresponding to the task to the cloud computing instance through remote direct memory access RDMA.
  • the receiving module is specifically configured to: when the task issuing node confirms that the number of tasks to be processed exceeds a threshold, the transceiver The device receives the resource application from the task issuing node.
  • the receiving module is further configured to receive a resource cancellation request from the task issuing node, and the resource cancellation request is used to instruct the The task has been completed or the task does not need to be executed; the resource configuration module is also used to cancel the cloud computing instance according to the resource cancellation request.
  • the resource configuration module is specifically used to: obtain the RoCE software package; control the transceiver to send the RoCE to the initial cloud computing instance Software package, and trigger the RoCE software package to be installed in the initial cloud computing instance to obtain the cloud computing instance with RoCE function.
  • the resource application includes configuration information, and the configuration information is used to indicate the configuration of the cloud computing instance required by the local network.
  • the resource configuration module is also used to: create an initial cloud computing instance according to the configuration information; or, according to the configuration information, find an initial cloud computing instance corresponding to the configuration information among multiple initial cloud computing instances.
  • the cloud computing instance is a virtual machine, a container, or a bare metal server.
  • an embodiment of the present application provides a job processing system, which includes a cloud platform, a cloud computing instance, and a local network, where the local network includes a task issuing node and a task data storage node.
  • the cloud platform is used to receive resource applications, create a cloud computing instance with RoCE function based on the resource application, and set the cloud computing instance to connect to the local network
  • the resource application is used to instruct the cloud platform to create Cloud computing instance with remote direct memory access RoCE function based on converged Ethernet.
  • the cloud computing instance is used to receive the task sent by the task issuing node, and obtain task data corresponding to the task from the task data storage node through remote direct memory access RDMA according to the task, and perform task processing.
  • the embodiments of the present application provide a bare metal server.
  • the bare metal server is a computing server with both virtual machine flexibility and physical machine performance, and is used to provide core databases, key application systems, high-performance computing, and big data And other businesses provide excellent computing performance and data security.
  • the bare metal server includes a processing module and a transceiver module.
  • the processing module may be a processor
  • the transceiver module may be an input/output device or a network interface.
  • the bare metal server may further include a storage module, the storage module may be a memory; the storage module is used to store instructions, and the processing module executes the instructions stored in the storage module, so that the bare metal server executes the first aspect or the first aspect described above.
  • the functions involved in the cloud computing instance is a computing server with both virtual machine flexibility and physical machine performance, and is used to provide core databases, key application systems, high-performance computing, and big data And other businesses provide excellent computing performance and data security.
  • the bare metal server includes a processing
  • an embodiment of the present application provides a physical machine, which is used to create a virtual machine or container based on the resource application in the foregoing first aspect or second aspect.
  • the physical machine includes a processing module and a transceiver module.
  • the processing module may be a processor
  • the transceiver module may be an input/output device or a network interface.
  • the physical machine may also include a storage module, the storage module may be a memory; the storage module is used to store instructions, and the processing module executes the instructions stored in the storage module, so that the physical machine executes the aforementioned first aspect or second aspect The functions involved in the cloud computing instance.
  • the cloud platform can set up a cloud computing instance with RoCE function according to resource application, and connect the aforementioned cloud computing instance to the local network, the cloud computing instance can bypass the operating system and receive tasks from the local network. And overstep the operating system to access the data corresponding to the aforementioned task in the local network, so as to realize the processing of the data corresponding to the aforementioned task in the local network. Therefore, the cloud platform does not need to establish an HPC cluster on the cloud, and the local network does not need to transmit tasks and task-corresponding data to the cloud platform. Therefore, it is beneficial to improve the efficiency of high-performance computing.
  • Figure 1 is a system architecture diagram of a job processing method in an embodiment of the application
  • Fig. 2 is a flowchart of a job processing method in an embodiment of the application
  • FIG. 3A is another flowchart of a job processing method in an embodiment of this application.
  • FIG. 3B is a schematic diagram of a logical connection between a local HPC cluster and a cloud computing instance in an embodiment of the application;
  • FIG. 4 is a schematic diagram of an embodiment of a cloud platform in an embodiment of the application.
  • Fig. 5 is a schematic diagram of another embodiment of a cloud platform in an embodiment of the application.
  • the embodiments of the present application provide a job processing method and related equipment, which are used to configure a cloud computing instance with RoCE function for a local network, so as to improve the efficiency of high-performance computing.
  • Remote direct memory access is a data transmission technology that can quickly move data from the memory of one machine or device to the memory of another machine or device without passing through the operating system kernel
  • the protocol stack transmits data on the network without any impact on the operating system.
  • Common RDMA implementation forms include virtual interface architecture, Ethernet-based remote direct memory access (RDMA overconverged ethernet, RoCE), unlimited bandwidth technology (infiniband, IB), and iWARP.
  • RoCE technology avoids the copying process of data between user space and kernel space and the processing process of data in the kernel protocol stack, can reduce memory consumption and CPU consumption, and can also reduce the delay of data transmission.
  • High performance computing refers to the use of aggregated computing power to process data-intensive computing tasks that cannot be completed by standard workstations, including simulation, modeling, and rendering.
  • the high-performance computing in the embodiments of the present application is RoCE-based high-performance computing.
  • the cluster composed of devices that implement the aforementioned high-performance computing is called an HPC cluster.
  • Cloud platform An entity that provides services based on hardware resources and/or software resources to remote devices.
  • the cloud platform in the embodiment of the present application may be: a storage cloud platform focusing on data storage, a computing cloud platform focusing on data processing, or a comprehensive cloud computing platform focusing on both computing and data storage processing.
  • Cloud computing instance In the embodiments of the present application, it refers to the computing resources created by the cloud platform to support task processing of a local network (for example, an HPC cluster).
  • a local network for example, an HPC cluster
  • the job processing method proposed in the embodiment of the present application is mainly applied to a scenario where an HPC cluster based on a RoCE network applies for computing resources from a cloud platform.
  • the local network ie, HPC cluster
  • the local network mainly includes a task issuing node 101, a task data storage node 102, and multiple local task processing nodes 103.
  • the task issuing node 101 is used to allocate computing tasks to each local task processing node 103 in the HPC cluster.
  • Each local task processing node 103 in the local network has a RoCE function. Therefore, the aforementioned local task processing node 103 can access task data in the task data storage node 102 over the operating system.
  • the task issuing node 101 in the local network can communicate with the cloud platform 111.
  • the task issuing node 101 may apply to the cloud platform 111 for computing resources.
  • the cloud platform 111 only has the function of providing a common computing instance, which cannot meet the resource requirements of the HPC cluster based on the RoCE network.
  • the job processing method proposed in the embodiments of the present application is aimed at the foregoing scenario, and enables the cloud platform 111 to configure the cloud computing instance 112 with RoCE function for the aforementioned HPC cluster, and enables the cloud computing instance 112 with RoCE function to be used as a local
  • the task processing node is added to the aforementioned HPC cluster. Since the aforementioned cloud computing instance 112 and the local task processing node 103 form a new HPC cluster, the task issuing node 101 can use the aforementioned cloud computing instance 112 as a local task processing node to assign tasks. Therefore, the task issuing node 101 does not need to transmit both the task and the data corresponding to the task to the cloud platform 111. Therefore, it is beneficial to improve the efficiency of high-performance computing.
  • the cloud platform receives resource applications.
  • the local network may apply for computing resources from the cloud platform. Therefore, the cloud platform can receive resource applications from the local network.
  • the resource request can be triggered by insufficient computing resources of the local network, or it can be triggered based on the user's resource requirements.
  • the resource application is used to instruct the cloud platform to create a cloud computing instance with RoCE function, and the RoCE function refers to the function of accessing the storage device beyond the operating system.
  • the local network may be the HPC cluster described in Figure 1 above.
  • the resource application also includes other configuration information about the cloud computing instance, so that the cloud platform can create a cloud computing instance that not only has the RoCE function but also matches the local network based on the configuration information.
  • the cloud platform can create a cloud computing instance that not only has the RoCE function but also matches the local network based on the configuration information.
  • the detailed introduction in step 302 below refer to the detailed introduction in step 302 below.
  • the cloud platform applies for creating a cloud computing instance with RoCE function according to the resource.
  • the cloud computing instance refers to the computing resources configured by the cloud platform, and can also be understood as the computing resources created by the cloud platform to support task processing of the local network.
  • Different cloud computing instances can provide different computing capabilities, storage space, and network performance.
  • the cloud computing instance may be a bare metal server (BMS), that is, a physical server that is physically isolated from servers of other users, and the bare metal server has both virtual machine flexibility and physical machine performance.
  • the cloud computing instance may also be a virtual machine or container created by a physical host, which is not specifically limited here. In this embodiment and subsequent embodiments, only a cloud computing example is used as an example for introduction.
  • the cloud platform After the cloud platform receives the aforementioned resource application, the cloud platform will create a cloud computing instance with RoCE function based on the resource application. Specifically, the cloud platform can directly create a cloud computing instance with the RoCE function, or it can create a common cloud computing instance, and then configure the RoCE function for the common cloud computing instance, which is not specifically limited here.
  • the cloud platform sets the aforementioned cloud computing instance with the RoCE function to access the local network.
  • the cloud platform After the aforementioned cloud platform creates a cloud computing instance with the RoCE function, the cloud platform also needs to set the aforementioned cloud computing instance with the RoCE function to access the local network, so that the local network can communicate with the aforementioned cloud computing instance.
  • the local network when the local network is the HPC cluster described in FIG. 1, the local network may include task issuing nodes, task data storage nodes, and computing nodes.
  • the cloud computing instance with the RoCE function When the aforementioned cloud computing instance with the RoCE function is connected to the local network, it can be considered that the cloud computing instance can communicate with the aforementioned task issuing node, task data storage node, and computing node.
  • the cloud platform may only create one cloud computing instance with RoCE function, or it may create multiple cloud computing instances with RoCE function.
  • the cloud platform creates multiple cloud computing instances with the RoCE function, each of the foregoing multiple cloud computing instances will execute the following step 204 and step 205 respectively.
  • the cloud computing instance receives the task from the local network.
  • the cloud computing instance can receive tasks from the local network, and the tasks refer to high-performance computing tasks.
  • the task can be sent to the cloud computing instance in a message, instruction or other form.
  • the task carries first indication information, and the first indication information is used to indicate data corresponding to the task.
  • the cloud computing instance can access data corresponding to the task located in the local network based on the first indication information carried in the task. Specifically, refer to the detailed introduction in step 306 below.
  • the cloud computing instance obtains task data corresponding to the task from the task data storage node through RDMA according to the task, and performs task processing.
  • the cloud computing instance since the cloud computing instance is configured with the RoCE function, the cloud computing instance can obtain the task data corresponding to the task from the task data storage node through RDMA according to the first indication information carried by the aforementioned task. Then, the cloud computing instance will perform task processing on the aforementioned task data.
  • the task processing includes high-performance computing tasks. For example, tasks in high-performance computing scenarios such as supercomputing centers and gene sequencing; or other tasks that require large amounts of data, such as computing performance, stability, and real-time performance.
  • the cloud platform can create a cloud computing instance with RoCE function according to resource application, and connect the aforementioned cloud computing instance to the local network, the cloud computing instance can bypass the operating system and receive tasks from the local network. And overstep the operating system to access the data corresponding to the aforementioned task in the local network, so as to realize the processing of the data corresponding to the aforementioned task in the local network. Therefore, the cloud platform does not need to establish an HPC cluster on the cloud, and the local network does not need to transmit tasks and task-corresponding data to the cloud platform. Therefore, it is beneficial to improve the efficiency of high-performance computing.
  • each node, cloud platform, and cloud computing instance in the local network will perform the following steps:
  • the task issuing node confirms that the number of tasks to be processed exceeds the threshold.
  • step 301 is an optional step.
  • the task issuing node has the function of monitoring the number of tasks. Specifically, the task issuing node can count the number of tasks to be processed, where the number of tasks to be processed can be the number of tasks that the task issuing node has not yet assigned to the local task processing node, or the The total amount of tasks that the task issuing node needs to process within a certain time range is not limited here.
  • the task issuing node detects that the number of tasks to be processed in the HPC cluster exceeds the threshold, the task issuing node will send a resource request to the cloud platform. Then, the cloud platform will execute step 302.
  • the task issuing node includes an HPC controller (HPC controller) and a bursting controller (bursting controller).
  • HPC controller HPC controller
  • bursting controller bursting controller
  • the outbreak controller monitors the number of tasks to be processed according to the job queue information controlled by the HPC controller. When the number of tasks to be processed reaches a threshold, the outbreak controller triggers the step of sending a resource request to the cloud platform.
  • the time range during which the aforementioned number of tasks to be processed exceeds the threshold is also referred to as the peak demand period.
  • the process of the task issuing node applying for resources from the cloud platform during the peak demand period is also called cloud bursting.
  • the aforementioned threshold can be set by the task issuing node according to the computing capability of the HPC cluster, and the specific threshold is not limited here.
  • the cloud platform receives the resource application.
  • the resource application is used to instruct the cloud platform to create a cloud computing instance with RoCE function.
  • the RoCE function refers to the function of accessing the storage device beyond the operating system.
  • the cloud computing instance can be a bare metal server, or a virtual machine or container created by a physical host, which is not specifically limited here.
  • the resource application includes first identification information, and the first identification information is used to indicate that the requested cloud computing instance needs to have a RoCE function.
  • the cloud platform can receive the aforementioned resource application in a variety of different implementation manners:
  • the resource application comes from a task issuing node, and the task issuing node detects that the number of tasks to be processed exceeds a threshold and is triggered.
  • the cloud platform can receive the resource application from the aforementioned task issuing node.
  • the resource request may be triggered by a user-defined requirement.
  • users can purchase or rent a cloud platform to configure cloud computing instance services.
  • the cloud platform can provide users with an interface for configuring cloud computing instances through a client or a web browser.
  • the cloud platform can receive a resource application from the aforementioned client or web browser.
  • the aforementioned resource application also includes configuration information, which is used to indicate the basic configuration of the cloud computing instance required by the HPC cluster.
  • the basic configuration includes the type of host, the number and capacity of hard disks, and the type of network card required to form the aforementioned cloud computing instance. And the type of application, etc.
  • the cloud platform includes multiple templates of initial cloud computing instances, each initial cloud computing instance template has a fixed basic configuration, and each initial cloud computing instance template has a unique identification The template number of the template.
  • the configuration information in the aforementioned resource application is the template number.
  • the cloud platform can learn which cloud computing instance needs to be configured in the HPC cluster.
  • the cloud platform does not have a template of the initial cloud computing instance, or the templates of the multiple initial cloud computing instances in the cloud platform are inconsistent with the cloud computing instance required by the HPC cluster.
  • the configuration information in the aforementioned resource application includes detailed basic configuration.
  • the configuration information is that the type of the host is a bare metal server, two 16TB hard drives, 128G memory, 10G ordinary network cards, data analysis applications, and data prediction applications.
  • the cloud platform can configure the initial cloud computing instance based on the aforementioned configuration information.
  • the resource application when the resource application indicates that multiple cloud computing instances need to be applied for, the resource application will include the configuration information of each cloud computing instance in the aforementioned multiple cloud computing instances.
  • the cloud platform applies for creating a cloud computing instance with RoCE function according to the resource.
  • the cloud platform After the cloud platform receives the aforementioned resource application, the cloud platform will determine the initial cloud computing instance based on the aforementioned configuration information. Specifically, when the aforementioned configuration information adopts different implementation manners, the manner in which the cloud platform determines the initial cloud computing instance will also be different.
  • the cloud platform when the configuration information in the aforementioned resource application is a template number, the cloud platform will select the initial cloud computing instance corresponding to the template number among multiple initial cloud computing instances according to the template number .
  • the cloud platform when the configuration information in the aforementioned resource application includes detailed basic configuration, the cloud platform will create an initial cloud computing instance based on the aforementioned basic configuration.
  • the cloud platform uses the RoCE software package (RoCE) to set the RoCE function for the aforementioned initial cloud computing instance to obtain a cloud computing instance with RoCE function.
  • RoCE RoCE software package
  • the RoCE software package is pre-stored in a storage device of the cloud platform, or pre-stored in a database managed by the cloud platform.
  • the cloud platform obtains the RoCE software package from the aforementioned storage device or database, and sends the RoCE software package to the aforementioned initial cloud computing instance. Then, the cloud platform triggers the installation of the RoCE software package in the initial cloud computing instance to obtain the cloud computing instance with the RoCE function.
  • the RoCE software package may be written into the storage device of the initial cloud computing instance when the cloud platform configures the initial cloud computing instance.
  • the cloud platform then triggers the initial cloud computing instance to install the RoCE software package.
  • the cloud platform when the cloud platform triggers the initial cloud computing instance to install the RoCE software package, the cloud platform first triggers the boot disk in the cloud computing instance, and then copies the RoCE software package To the startup disk in the cloud computing instance.
  • the startup disk is located in the startup program list in the computing node, and the startup disk is used to trigger the installation of the RoCE software package when the operating system in the cloud computing instance starts.
  • a specific implementation method for configuring the RoCE function for an initial cloud computing instance without a RoCE network card is proposed.
  • the aforementioned initial cloud computing instance simulates the RoCE network card by running the RoCE software. Because the cloud computing instance can use a common network card instead of a RoCE network card. Therefore, the configuration cost of cloud computing instances can be saved, and the feasibility of the solution can be improved.
  • the cloud platform will also configure intermediate adaptation software in the aforementioned cloud computing instance with RoCE function, and the intermediate adaptation software is used to modify the calling mode of the application in the cloud computing instance to remote direct memory access RDMA calls.
  • the cloud platform sets the aforementioned cloud computing instance with the RoCE function to access the local network.
  • the cloud platform needs to set the aforementioned cloud computing instance with RoCE function to access the local network (That is, the local HPC cluster).
  • the local network is provided with a task issuing node and a task data storage node.
  • the aforementioned resource application also includes information about the HPC cluster.
  • the HPC cluster information is used to indicate the address of each node in the HPC cluster and the connection between the HPC cluster and the gateway, so that the cloud platform is based on
  • the HPC cluster information connects the cloud computing instance with RoCE function to the local HPC cluster.
  • the information of the HPC cluster includes: the Internet Protocol (IP) address of the task issuing node, the port number of the task issuing node, user identification information (also called tenant ID), and general information of the bridge device
  • IP Internet Protocol
  • tenant ID user identification information
  • UUID universally unique identifier
  • the aforementioned bridge device is a level 2 bridge (L2BR) supporting the RoCE protocol.
  • the information of the HPC cluster further includes a virtual local area network (VLAN) range.
  • VLAN virtual local area network
  • the cloud platform connects the cloud computing instance to the local HPC cluster according to the aforementioned information of the HPC cluster.
  • the task issuing node and multiple local task processing nodes are connected through a first switch, and multiple cloud computing instances configured on the cloud platform are also connected through a second switch.
  • the aforementioned first switch and second switch are connected to the bridge device through a gateway.
  • the bridge device may be the aforementioned Layer 2 bridge L2BR.
  • the local HPC cluster includes local task processing node 1 (node1), local task processing node 2 (node2), and local task processing node 3 (node3), where each local task processing node is configured with RoCE Network interface controller (NIC) (ie RoCE network card).
  • NIC RoCE Network interface controller
  • Each cloud computing instance is configured with a common network card, but each cloud computing instance runs RoCE software, which can simulate the RoCE network card to realize the function of the RoCE network card.
  • the multiple cloud computing instances are connected through a second switch.
  • the aforementioned first switch and second switch are connected to the Layer 2 bridge L2BR through a gateway. Therefore, the aforementioned multiple cloud computing instances are added to the local network (ie, the local HPC cluster) as task processing nodes.
  • FIG. 3B is only a schematic diagram of logical connections, and some physical gateways are not shown.
  • the cloud platform sends the first notification to the task issuing node.
  • step 305 is an optional step.
  • the first notification is used to indicate that the configuration of the cloud computing instance is completed and the cloud computing instance has been connected to the local network (ie, the local HPC cluster).
  • the task issuing node After the task issuing node receives the aforementioned first notification, the task issuing node will execute step 306.
  • the task issuing node can detect idle computing resources in the HPC cluster. Since the cloud computing instance that has just accessed the local network is in an idle state, when the task issuing node detects that there are computing resources in an idle state, the task issuing node will execute step 306.
  • the task issuing node sends a task to the cloud computing instance.
  • the task issuing node may send the tasks in the task queue that are not allocated to the local task processing node to the cloud computing instance.
  • the task carries first indication information
  • the first indication information is used to indicate data corresponding to the task.
  • the cloud computing instance can access data corresponding to the task located in the local network based on the first indication information carried in the task.
  • the first indication information includes a task identifier and/or an address of data corresponding to the task.
  • the task identifier is used to uniquely identify a task in the task queue.
  • the task identification can be a queue number or other characters or strings.
  • the data corresponding to the task also has the same task ID.
  • the data corresponding to the task is stored in the task data storage node, and the head of the memory block where the data corresponding to the task is located contains the aforementioned task identifier.
  • the cloud computing instance when the cloud computing instance obtains the task identifier from the received task, the cloud computing instance can traverse the memory in the task data storage node.
  • the cloud computing instance can obtain the data in the memory block, and the cloud computing instance can obtain the data corresponding to the task.
  • the first indication information may also include the address of the data corresponding to the task.
  • the address of the data corresponding to the task may be a physical address or a logical address; it may be the address of each memory block in the task data storage node, or it may be It is the address of a memory block in a local task processing node in the local HPC cluster, which is not specifically limited here.
  • the cloud computing instance may directly obtain the data corresponding to the task from the address of the data corresponding to the task based on the first indication information.
  • the first indication information when the first indication information includes the address of the data corresponding to the task, the first indication information may also include the processing result address.
  • the processing result address is used to store the processing result obtained after the cloud computing instance processes the data corresponding to the aforementioned task.
  • the address of the data corresponding to the task and the address of the processing result can be addresses in the same node or device.
  • the address of the data corresponding to the task indicates a memory block in the task data storage node
  • the processing result address indicates another memory block in the task data storage node.
  • the address of the data corresponding to the task and the address of the processing result may also be addresses in different nodes or devices.
  • the address of the data corresponding to the task indicates a memory block in the task data storage node
  • the processing result address indicates a memory block in the local task processing node. The details are not limited here.
  • the cloud computing instance obtains task data corresponding to the task from the task data storage node through remote direct memory access RDMA according to the task, and performs task processing.
  • the cloud computing instance After the cloud computing instance receives the aforementioned task, it will obtain the task data corresponding to the task from the task data storage node through RDMA according to the first instruction information carried by the aforementioned task.
  • the cloud computing instance can also access the data in the storage device in the computing node over the operating system.
  • the cloud computing instance obtains task data through RDMA in different ways. For details, please refer to the relevant introduction in the foregoing step 306, which will not be repeated here.
  • the cloud computing instance will perform task processing on the aforementioned task data.
  • the task processing includes high-performance computing tasks.
  • tasks in high-performance computing scenarios such as supercomputing centers and gene sequencing; or other tasks that require large amounts of data, such as computing performance, stability, and real-time performance.
  • the aforementioned task data storage node may be independent of the computing node in the local HPC cluster.
  • the task data storage node is a database in the local HPC cluster.
  • the task data storage node can also be integrated with the aforementioned task issuing node.
  • the task data storage node may be located in the task issuing node as a storage device in the task issuing node. At this time, the task data storage node can not only store the data corresponding to the task, but also store the task queue formulated by the task issuing node.
  • the cloud computing instance after the cloud computing instance with the RoCE function configured on the cloud platform enters the local network, the cloud computing instance is added to the local HPC cluster as a computing node. Therefore, the cloud computing instance can receive the task from the task issuing node, and can also access the data corresponding to the task in the task data storage node over the operating system, and then the cloud computing instance can process the data corresponding to the aforementioned task. Therefore, the task issuing node does not need to transmit both the task and the data corresponding to the task to the cloud platform. Therefore, it is beneficial to improve the efficiency of high-performance computing.
  • the local HPC cluster when the local HPC cluster does not need to use the foregoing cloud computing instance, the local HPC cluster will revoke the cloud computing instance through the following steps.
  • the task issuing node sends a resource cancellation request to the cloud platform.
  • the resource cancellation request may request the cancellation of all cloud computing instances that the cloud platform has applied for, or only a certain cloud computing instance.
  • the resource cancellation request may only carry the identification information of the user.
  • the resource cancellation request includes second identification information.
  • the second identification information is used to indicate the cloud computing instance that needs to be cancelled.
  • the second identification information may be the cloud computing instance.
  • the second identification information may be set by the cloud platform when the cloud computing instance is created, or may be carried by the task issuing node in the aforementioned resource application, and the specific information is not limited here.
  • the task issuing node may execute step 308.
  • the user may not need to use the cloud computing instance, for example, the user's lease term expires, the user suspends renting the cloud computing instance.
  • the task issuing node may also trigger the aforementioned step 308.
  • the cloud platform cancels the cloud computing instance according to the resource cancellation request.
  • the cloud platform After the cloud platform receives the aforementioned resource cancellation request, the cloud platform will cancel one or more cloud computing instances corresponding to the second identification information according to the second identification information in the resource cancellation request.
  • the cloud platform will also send a second notification to the task issuing node, and the second notification is used to notify that the cloud computing instance corresponding to the second identification information has been cancelled .
  • the task issuing node may send a resource cancellation request to the cloud platform, so that the cloud platform cancels the aforementioned cloud computing instance according to the cancellation request.
  • the task issuing node can revoke the aforementioned cloud computing instance, and the cloud platform can also allocate the aforementioned cloud computing instance to other clusters. Therefore, the cloud computing instance configured by the aforementioned cloud platform can be made more flexible, which is beneficial to improve the utilization rate of the cloud computing instance.
  • FIG. 4 a schematic structural diagram of a cloud platform 40 is provided for this embodiment of the application.
  • the cloud platform 40 may be a server, a large-scale computing device, or a large-scale management device, which is not specifically limited here.
  • the cloud platforms in the foregoing method embodiments corresponding to FIG. 2 and FIG. 3A may be based on the structure of the cloud platform 40 shown in FIG. 4.
  • the cloud platform 40 includes at least one processor 401 and at least one memory 402. It should be understood that FIG. 4 only shows one processor 401 and one memory 402.
  • the processor 401 may be a general central processing unit (central processing unit, CPU), a microprocessor, a network processor (network processor, NP), or an application-specific integrated circuit (application-specific integrated circuit), or one or more An integrated circuit used to control the execution of the program of this application.
  • the aforementioned processor 401 may be a single-CPU processor or a multi-CPU processor.
  • the processor 401 may refer to one or more devices, circuits, and/or processing cores for processing data (for example, computer program instructions).
  • the processor 401 can be a separate semiconductor chip, or it can be integrated with other circuits to form a semiconductor chip.
  • SoC System-on-a-chip
  • ASIC application specific integrated circuit
  • the aforementioned memory 402 may be a read-only memory ROM, another type of static storage device that can store static information and instructions, it can also be a random access memory RAM, or other types of information and instructions that can be stored.
  • the dynamic storage device may also be an electrically erasable programmable read-only memory (EEPROM), which is not specifically limited here.
  • the memory 402 may exist independently, but is connected to the aforementioned processor 401.
  • the memory 402 may also be integrated with the aforementioned processor 401. For example, integrated in one or more chips.
  • the memory 402 is also used to store program codes for executing the technical solutions of the embodiments of the present application.
  • the foregoing program codes can be controlled and executed by the processor 401, and various types of computer program codes that are executed can also be regarded as drivers of the processor 401. Therefore, the aforementioned processor 401 may analyze the received resource application, set up a cloud computing instance with RoCE function according to the resource application, and set the cloud computing instance to access the local network.
  • the processor 401 may also create an initial cloud computing instance, and configure the RoCE function for the initial cloud computing instance.
  • the processor 401 may also analyze the resource cancellation request, and cancel the cloud computing instance configured for the HPC cluster according to the resource cancellation request.
  • the cloud platform 40 further includes a communication interface 403, which is used to communicate with other servers or network devices, so that the cloud platform can receive instructions or data from other devices.
  • the communication interface 403 may receive a resource application or a resource cancellation request from the task transceiver device.
  • the communication interface 403 is also used to send data or instructions to other devices.
  • the communication interface 403 may send the RoCE software package to the initial cloud computing instance, so that the initial cloud computing instance can install the RoCE software program according to the RoCE software package.
  • FIG. 5 a schematic structural diagram of a cloud platform 50 is provided for this embodiment of the application.
  • the cloud platform 50 may be a server, a large-scale computing device, or a large-scale management device, which is not specifically limited here.
  • the cloud platforms in the foregoing method embodiments corresponding to FIG. 2 and FIG. 3A may be based on the structure of the cloud platform 50 shown in FIG. 5.
  • the cloud platform 50 includes multiple functional modules.
  • the aforementioned functional modules may be integrated into one processing unit, or each module may exist alone physically, or two or more modules may be integrated into one unit.
  • the above-mentioned integrated unit can be implemented in the form of hardware or software functional unit.
  • the cloud platform 50 includes:
  • the receiving module 501 is configured to receive a resource application, and the resource application is used to instruct the cloud platform to create a cloud computing instance with the function of remote direct memory access RoCE based on the converged Ethernet;
  • the resource configuration module 502 is configured to set up a cloud computing instance with RoCE function according to the resource application, and set the cloud computing instance to access the local network.
  • the local network is provided with a task issuing node and a task data storage node.
  • the task issuing node is used to send tasks to the cloud computing instance; the task data storage node is used to provide task data corresponding to the task to the cloud computing instance through remote direct memory access RDMA.
  • the cloud platform 50 can create a cloud computing instance with RoCE function according to the resource application, and connect the aforementioned cloud computing instance to the local network, the cloud computing instance can bypass the operating system and receive tasks from the local network. , And go over the operating system to access the data corresponding to the aforementioned task in the local network, so as to realize the processing of the data corresponding to the task in the aforementioned local network. Therefore, the cloud platform does not need to establish an HPC cluster on the cloud, and the local network does not need to transmit tasks and task-corresponding data to the cloud platform. Therefore, it is beneficial to improve the efficiency of high-performance computing.
  • the receiving module 501 is specifically configured to: when the task issuing node confirms that the number of tasks to be processed exceeds a threshold, the transceiver receives the resource application from the task issuing node.
  • the receiving module 501 is further configured to receive a resource cancellation request from the task issuing node, the resource cancellation request is used to indicate that the task has been completed or the task does not need to be executed; the resource configuration The module 502 is also used to revoke the cloud computing instance according to the resource revoking request.
  • the resource configuration module 502 is specifically used to: obtain the RoCE software package; control the transceiver to send the RoCE software package to the initial cloud computing instance, and trigger the installation of the RoCE software package on the In the initial cloud computing instance, the cloud computing instance with RoCE function is obtained.
  • the resource configuration module 502 is further configured to: create an initial cloud computing instance according to the configuration information; or, according to the configuration information, search for the configuration information in multiple initial cloud computing instances. The corresponding initial cloud computing instance.
  • each functional module in the cloud platform 50 may be implemented in whole or in part by software, hardware, firmware, or any combination thereof.
  • each functional module in the cloud platform 50 may be implemented in the form of a computer program product in whole or in part.
  • each functional module in the cloud platform 50 is implemented in the form of a software functional unit and sold or used as an independent product, it can be stored in a computer readable storage medium.
  • the technical solution of the present application can be embodied in the form of a software product in essence or a part that contributes to the existing technology, or all or part of the technical solution.
  • the computer software product stored in a storage medium includes several instructions to make a computer device (which may be a personal computer, a server, or a network device, etc.) execute the methods described in the various embodiments of the present application All or part of the steps.
  • the aforementioned storage media include: U disk, mobile hard disk, read only memory (read only memory, ROM), random access memory (random access memory, RAM), magnetic disk or optical disk and other media that can store program codes.

Landscapes

  • Engineering & Computer Science (AREA)
  • Software Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Stored Programmes (AREA)

Abstract

本申请实施例公开了一种作业处理方法以及相关设备,用于为本地网络配置具有 RoCE 功能的云计算实例,以提高高性能计算的效率。在该作业处理方法中,云平台根据收到的资源申请创建具有 RoCE 功能的云计算实例,并设置所述云计算实例接入本地网络,以使得所述云计算实例对所述本地网络中的任务对应的数据进行处理。由于,该云计算实例可以越过操作系统接收来自本地网络的任务,并越过操作系统访问本地网络中前述任务对应的数据。因此,云平台无需在云上建立 HPC 集群,本地网络也无需将任务和任务对应的数据均传输至云平台。因此,有利于提高高性能计算的效率。

Description

一种作业处理方法以及相关设备 技术领域
本申请实施例涉及云计算领域,尤其涉及一种作业处理方法以及相关设备。
背景技术
远程直接内存访问(remote direct memory access,RDMA)是一种绕过远程主机操作系统内核访问其内存中数据的技术,可以节省处理资源,提高系统吞吐量,降低系统的网络通信延迟。该RDMA技术有多种实现方式,其中一种为基于融合以太网的RDMA(remote direct memory access over converged ethernet,RDMA over converged ethernet,RoCE)。该RoCE技术常应用于基于统一集群的高性能计算(high-performance computing,HPC)中。
在现有技术中,当基于RoCE网络的本地HPC集群的计算资源不足以支持高性能计算时,本地HPC集群将向云平台申请计算资源。于是,云平台将在云上重建一个HPC集群,并且,云平台将来自本地HPC集群的任务和任务对应的数据发送给云上的HPC集群中的云资源控制节点,由云资源控制节点将前述任务和任务对应的数据分配给云计算节点,以使得云计算节点根据前述任务执行高性能计算。
但是,由于,云上的HPC集群和本地的HPC集群是两个相互独立的集群。本地的HPC集群需要将任务和任务对应的数据均传输至云上的HPC集群,云上的HPC集群才能进行高性能计算。在此过程中,需要本地的HPC集群中的本地资源控制节点对本地的任务进行划分。但是,高性能计算过程的任务繁重且不易划分,并且,将前述任务对应的数据传输至云平台也将受传输带宽的限制。因此,前述方案不能有效地提高高性能计算的效率。
发明内容
本申请实施例提供了一种作业处理方法以及相关设备,用于为本地网络配置具有RoCE功能的云计算实例,以提高高性能计算的效率。
第一方面,本申请实施例提供了一种作业处理方法,该作业处理方法可以应用于高性能计算场景。在该作业处理方法中,云平台将在预设条件下接收资源申请。其中,该资源申请用于指示该云平台创建具有基于融合以太网的远程直接内存访问RoCE功能的云计算实例。然后,该云平台根据该资源申请创建具有RoCE功能的云计算实例,并设置该云计算实例接入本地网络,以使得该云计算实例对该本地网络中的任务对应的数据进行处理。
其中,前述预设条件可以包括如下一项或多项:
本地网络的计算资源不足以支持高性能计算;或者,用户根据自身需求向云平台提出资源申请等。具体此处不做限定。
本申请实施例中,由于云平台可以根据资源申请创建具有RoCE功能的云计算实 例,并且将前述云计算实例接入本地网络,因此,该云计算实例可以越过操作系统接收来自本地网络的任务,并越过操作系统访问本地网络中前述任务对应的数据,以实现对前述本地网络中的任务对应的数据进行处理。因此,云平台无需在云上建立HPC集群,本地网络也无需将任务和任务对应的数据均传输至云平台。因此,有利于提高高性能计算的效率。
基于前述第一方面,在一种可选的实施方式中,前述本地网络中设置有任务发放节点和任务数据存储节点。此时,在该作业处理方法中,该云计算实例可以接收前述任务发放节点发送的任务;然后,该云计算实例根据该任务从该任务数据存储节点通过远程直接内存访问RDMA获取该任务对应的任务数据,并进行任务处理。
本实施方式中,提出本地网络中有两个不同功能的节点,其中,任务发放节点用于向本地的计算节点发放任务;任务数据存储节点用于存储任务对应的数据,并且,可以供本地计算节点从该任务数据存储节点中获取前述任务对应的数据。由于,前述云计算实例已接入本地网络,因此,前述云计算实例实质上是和任务发放节点以及任务数据存储节点连接。于是,前述任务发放节点便可向云计算实例发放任务。由于,云计算实例具有RoCE功能,因此,该云计算实例可以越过操作系统访问前述任务数据存储节点中任务对应的数据,进而该云计算实例可以对前述任务对应的数据进行处理。在这样的实施方式中,前述该云计算实例实际上是作为计算节点加入到本地HPC集群中。由于,前述云计算实例与本地计算节点构成了一个HPC集群,任务发放节点可以将前述云计算实例当作本地计算节点一样分配任务。因此,该任务发放节点无需将任务和任务对应的数据均传输至云平台。因此,有利于提高高性能计算的效率。
基于前述第一方面或前述可选的实施方式,在另一种可选的实施方式中,该云平台接收资源申请之前,该方法还包括:该任务发放节点确认需处理的任务的数量超出阈值。
本实施方式中,提出任务发放节点可以对需处理的任务的数量进行监控,将该需处理的任务的数量与本地计算节点的计算能力进行衡量。当该任务发放节点确认需处理的任务的数量超出阈值,也就是说,当本地计算节点的计算能力(或计算资源)不足以支持前述需处理的任务时,该任务发放节点将触发向云平台发送资源申请的步骤。
应当理解的是,前述需处理的任务可以是该任务发放节点还未分配给本地计算节点的任务,也可以是任务发放节点在一定时间范围内需要处理完成的总任务(即包括已分配给本地计算节点的任务),具体此处不做限定。当然,当前述任务代表不同含义时,前述阈值的取值也将不同,任务发放节点可以根据实际需求进行调整,具体此处不做限定。
基于前述第一方面或前述可选的实施方式,在另一种可选的实施方式中,该方法还包括:云平台接收任务发放节点发送的资源撤销请求,其中,该资源撤销请求用于指示该任务已完成或该任务无需执行。然后,该云平台根据该资源撤销请求撤销该云计算实例。
本实施方式中,提出当前述任务已完成或前述任务无需执行时,该任务发放节点可以向云平台发送资源撤销请求,以使得该云平台根据该撤销请求撤销前述云计算实例。在这样的方案中,任务发放节点可以撤销前述云计算实例,该云平台也可以将前 述云计算实例分配给其他集群。因此,可以使前述云平台配置的云计算实例更加灵活,有利于提高云计算实例的利用率。
基于前述第一方面或前述可选的实施方式,在另一种可选的实施方式中,该云平台根据该资源申请创建具有RoCE功能的云计算实例,包括:该云平台获取RoCE软件包。然后,该云平台向初始云计算实例发送该RoCE软件包,并触发该RoCE软件包安装于该初始云计算实例中,得到该具有RoCE功能的云计算实例。
其中,该RoCE软件包可以预存在前述云平台中,也可以预存在初始云计算实例中,具体此处不做限定。
本实施方式中,提出了让没有RoCE网卡的初始云计算实例配置RoCE功能的具体实现方式。具体地,该云平台可以在初始云计算实例中安装RoCE软件包,以使得前述初始云计算实例通过运行RoCE软件模拟RoCE网卡。在这样的实现方式中,该云计算实例可以采用普通网卡,而无需使用RoCE网卡。因此,可以节约云计算实例的配置成本,提高方案的可实现性。
基于前述第一方面或前述可选的实施方式,在另一种可选的实施方式中,该资源申请包括配置信息,该配置信息用于指示该本地网络所需的云计算实例的配置。该方法还包括:该云平台根据该配置信息创建初始云计算实例;或者,该云平台根据该配置信息在多个云计算实例中选择与该配置信息对应的初始云计算实例。
本实施方式中,提出前述资源申请中包含配置信息。云平台可以基于前述配置信息创建初始云计算实例,也可以基于前述配置信息在已有的初始云计算实例中查找与前述配置信息匹配的初始云计算实例。
基于前述第一方面或前述可选的实施方式,在另一种可选的实施方式中,该云计算实例为虚拟机、容器或裸金属服务器。
第二方面,本申请实施例提供了一种云平台,该云平台包括:处理器、网络接口和存储器。其中,该存储器,用于存储数据和程序代码;该网络接口,用于接收资源申请,该资源申请用于指示该云平台创建具有基于融合以太网的远程直接内存访问RoCE功能的云计算实例;该处理器,用于根据该资源申请创建具有RoCE功能的云计算实例,并设置该云计算实例接入本地网络。
本申请实施例中,由于云平台可以根据资源申请创建具有RoCE功能的云计算实例,并且将前述云计算实例接入本地网络,因此,该云计算实例可以越过操作系统接收来自本地网络的任务,并越过操作系统访问本地网络中前述任务对应的数据,以实现对前述本地网络中的任务对应的数据进行处理。因此,云平台无需在云上建立HPC集群,本地网络也无需将任务和任务对应的数据均传输至云平台。因此,有利于提高高性能计算的效率。
基于前述第二方面,在一种可选的实施方式中,该本地网络中设置有任务发放节点和任务数据存储节点。其中,该任务发放节点用于向该云计算实例发送任务;该任务数据存储节点用于通过远程直接内存访问RDMA向该云计算实例提供该任务对应的任务数据。
基于前述第二方面或前述可选的实施方式,在另一种可选的实施方式中,该网络 接口,具体用于:当该任务发放节点确认需处理的任务的数量超出阈值时,该网络接口接收来自该任务发放节点的该资源申请。
基于前述第二方面或前述可选的实施方式,在另一种可选的实施方式中,该网络接口,还用于接收来自该任务发放节点的资源撤销请求,该资源撤销请求用于指示该任务已完成或该任务无需执行;该处理器,还用于根据该资源撤销请求撤销该云计算实例。
基于前述第二方面或前述可选的实施方式,在另一种可选的实施方式中,该处理器,具体用于:获取RoCE软件包;控制该网络接口向初始云计算实例发送该RoCE软件包,并触发该RoCE软件包安装于该初始云计算实例中,得到该具有RoCE功能的云计算实例。
基于前述第二方面或前述可选的实施方式,在另一种可选的实施方式中,该资源申请包括配置信息,该配置信息用于指示该本地网络所需的云计算实例的配置。此外,该处理器,还用于:根据该配置信息创建初始云计算实例;或者,根据该配置信息在多个云计算实例中选择与该配置信息对应的初始云计算实例。
基于前述第二方面或前述可选的实施方式,在另一种可选的实施方式中,该云计算实例为虚拟机、容器或裸金属服务器。
需要说明的是,本申请实施例还有多种具体其他实施方式,具体可参见第一方面的具体实施方式和其有益效果,在此不再赘述。
第三方面,本申请实施例提供了一种云平台,该云平台包括:
接收模块,用于接收资源申请,该资源申请用于指示该云平台创建具有基于融合以太网的远程直接内存访问RoCE功能的云计算实例;
资源配置模块,用于根据该资源申请创建具有RoCE功能的云计算实例,并设置该云计算实例接入本地网络。
本申请实施例中,由于云平台可以根据资源申请创建具有RoCE功能的云计算实例,并且将前述云计算实例接入本地网络,因此,该云计算实例可以越过操作系统接收来自本地网络的任务,并越过操作系统访问本地网络中前述任务对应的数据,以实现对前述本地网络中的任务对应的数据进行处理。因此,云平台无需在云上建立HPC集群,本地网络也无需将任务和任务对应的数据均传输至云平台。因此,有利于提高高性能计算的效率。
基于前述第三方面,在一种可选的实施方式中,该本地网络中设置有任务发放节点和任务数据存储节点。其中,该任务发放节点用于向该云计算实例发送任务;该任务数据存储节点用于通过远程直接内存访问RDMA向该云计算实例提供该任务对应的任务数据。
基于前述第三方面或前述可选的实施方式,在另一种可选的实施方式中,该接收模块,具体用于:当该任务发放节点确认需处理的任务的数量超出阈值时,该收发器接收来自该任务发放节点的该资源申请。
基于前述第三方面或前述可选的实施方式,在另一种可选的实施方式中,该接收模块,还用于接收来自该任务发放节点的资源撤销请求,该资源撤销请求用于指示该 任务已完成或该任务无需执行;该资源配置模块,还用于根据该资源撤销请求撤销该云计算实例。
基于前述第三方面或前述可选的实施方式,在另一种可选的实施方式中,该资源配置模块,具体用于:获取RoCE软件包;控制该收发器向初始云计算实例发送该RoCE软件包,并触发该RoCE软件包安装于该初始云计算实例中,得到该具有RoCE功能的云计算实例。
基于前述第三方面或前述可选的实施方式,在另一种可选的实施方式中,该资源申请包括配置信息,该配置信息用于指示该本地网络所需的云计算实例的配置。此外,该资源配置模块,还用于:根据该配置信息创建初始云计算实例;或者,根据该配置信息在多个初始云计算实例中查找与该配置信息对应的初始云计算实例。
基于前述第三方面或前述可选的实施方式,在另一种可选的实施方式中,该云计算实例为虚拟机、容器或裸金属服务器。
需要说明的是,本申请实施例还有多种具体其他实施方式,具体可参见第一方面的具体实施方式和其有益效果,在此不再赘述。
第四方面,本申请实施例提供了一种作业处理系统,该作业系统包括云平台、云计算实例以及本地网络,其中,本地网络包括任务发放节点和任务数据存储节点。在该作业处理系统中:云平台,用于接收资源申请,根据该资源申请创建具有RoCE功能的云计算实例,并设置该云计算实例接入本地网络,该资源申请用于指示该云平台创建具有基于融合以太网的远程直接内存访问RoCE功能的云计算实例。云计算实例,用于接收该任务发放节点发送的任务,并根据该任务从该任务数据存储节点通过远程直接内存访问RDMA获取该任务对应的任务数据,并进行任务处理。
此外,该作业处理系统中的云平台的其他功能可以参阅前述第一方面的各项实施方式或第二方面的各项实施方式;该作业处理系统中的云计算实例的其他功能可以参阅前述第一方面的各项实施方式或第二方面的各项实施方式。
第五方面,本申请实施例提供了一种裸金属服务器,该裸金属服务器兼具虚拟机弹性和物理机性能的计算类服务器,用于为核心数据库、关键应用系统、高性能计算、大数据等业务提供卓越的计算性能以及数据安全。该裸金属服务器包括处理模块和收发模块。其中,该处理模块可以是处理器,该收发模块可以是输入输出设备或网络接口。该裸金属服务器还可以包括存储模块,该存储模块可以是存储器;该存储模块用于存储指令,该处理模块执行该存储模块所存储的指令,以使该裸金属服务器执行前述第一方面或第二方面中云计算实例所涉及的功能。
第六方面,本申请实施例提供了一种物理机,该物理机用于基于前述第一方面或第二方面中的资源申请创建虚拟机或容器。该物理机包括处理模块和收发模块。其中,该处理模块可以是处理器,该收发模块可以是输入输出设备或网络接口。该物理机还可以包括存储模块,该存储模块可以是存储器;该存储模块用于存储指令,该处理模块执行该存储模块所存储的指令,以使该物理机执行前述第一方面或第二方面中云计算实例所涉及的功能。
从以上技术方案可以看出,本申请实施例具有以下优点:
本申请实施例中,由于云平台可以根据资源申请设置具有RoCE功能的云计算实 例,并且将前述云计算实例接入本地网络,因此,该云计算实例可以越过操作系统接收来自本地网络的任务,并越过操作系统访问本地网络中前述任务对应的数据,以实现对前述本地网络中的任务对应的数据进行处理。因此,云平台无需在云上建立HPC集群,本地网络也无需将任务和任务对应的数据均传输至云平台。因此,有利于提高高性能计算的效率。
附图说明
为了更清楚地说明本申请实施例的技术方案,下面将对实施例描述中所需要使用的附图作简单地介绍,显而易见地,下面描述中的附图仅仅是本申请的一些实施例。
图1为本申请实施例中作业处理方法的一个系统架构图;
图2为本申请实施例中作业处理方法的一个流程图;
图3A为本申请实施例中作业处理方法的另一个流程图;
图3B为本申请实施例中本地HPC集群与云计算实例之间的逻辑连接示意图;
图4为本申请实施例中云平台的一个实施例示意图;
图5为本申请实施例中云平台的另一个实施例示意图。
具体实施方式
下面将结合本申请实施例中的附图,对本申请实施例中的技术方案进行清楚、完整地描述,显然,所描述的实施例仅仅是本申请一部分实施例,而不是全部的实施例。
本申请的说明书和权利要求书及上述附图中的术语“第一”、“第二”、“第三”、“第四”等(如果存在)是用于区别类似的对象,而不必用于描述特定的顺序或先后次序。应该理解这样使用的数据在适当情况下可以互换,以便这里描述的实施例能够以除了在这里图示或描述的内容以外的顺序实施。此外,术语“包括”和“具有”以及他们的任何变形,意图在于覆盖不排他的包含,例如,包含了一系列步骤或单元的过程、方法、系统、产品或设备不必限于清楚地列出的那些步骤或单元,而是可包括没有清楚地列出的或对于这些过程、方法、产品或设备固有的其它步骤或单元。
本申请实施例提供了一种作业处理方法以及相关设备,用于为本地网络配置具有RoCE功能的云计算实例,以提高高性能计算的效率。
为便于理解,下面先对本申请实施例所涉及的专业术语进行解释:
远程直接内存访问(remote direct memory access,RDMA):是一种数据传输技术,可将数据从一台机器或设备的存储器中快速移动到另一台机器或设备的存储器中,不通过操作系统内核协议栈在网络上传输数据,且不对操作系统造成任何影响。常见的RDMA实现形式包括虚拟接口架构、基于以太网的远程直接内存访问(RDMA over converged ethernet,RoCE)、无限带宽技术(infiniband,IB)和iWARP。其中,RoCE技术避免了数据在用户空间和内核空间之间的拷贝过程以及数据在内核协议栈中的处理过程,能够减少内存消耗和CPU消耗,还能够降低数据传输的延迟。
高性能计算(high performance computing,HPC):是指利用聚集起来的计算能力来处理标准工作站无法完成的数据密集型计算任务,包括仿真、建模和渲染等。本申请实施例中的高性能计算是基于RoCE的高性能计算。本申请实施例中,实现前述高 性能计算的设备构成的集群被称为HPC集群。
云平台(cloud platform):指向远端设备提供基于硬件资源和/或软件资源的服务的实体。本申请实施例中的云平台可以是:以数据存储为主的存储型云平台,以数据处理为主的计算型云平台或者以计算和数据存储处理兼顾的综合云计算平台。
云计算实例:本申请实施例中指云平台创建的用于支持本地网络(例如,HPC集群)的任务处理的计算资源。
下面对本申请实施例提出的作业处理方法所适应的应用场景和系统架构进行介绍:
本申请实施例提出的作业处理方法主要应用于基于RoCE网络的HPC集群向云平台申请计算资源的场景。如图1所示,该场景中的本地网络(即HPC集群)主要包括任务发放节点101、任务数据存储节点102以及多个本地任务处理节点103。其中,该任务发放节点101用于向HPC集群中的每个本地任务处理节点103分配计算任务。该本地网络中的每个本地任务处理节点103均具备了RoCE功能,因此,前述本地任务处理节点103均可以越过操作系统访问任务数据存储节点102中的任务数据。此外,该本地网络中的任务发放节点101可以与云平台111通信。当该HPC集群中的计算资源不足或用户有额外的计算资源需求时,该任务发放节点101可以向云平台111申请计算资源。但是,在目前的技术中,云平台111仅具有提供普通计算实例的功能,该普通的计算实例无法满足基于RoCE网络的HPC集群的资源需求。
对此,本申请实施例提出的作业处理方法针对前述场景,可以使得云平台111为前述HPC集群配置具有RoCE功能的云计算实例112,并且,能够使得该具有RoCE功能的云计算实例112作为本地任务处理节点加入到前述HPC集群中。由于,前述云计算实例112与本地任务处理节点103构成了一个新的HPC集群,任务发放节点101可以将前述云计算实例112当作本地任务处理节点一样分配任务。因此,该任务发放节点101无需将任务和任务对应的数据均传输至云平台111。因此,有利于提高高性能计算的效率。
为便于理解,下面先对本申请实施例提出的作业处理方法的主要流程进行介绍,具体如图2所示,该云平台和云计算实例将执行如下步骤:
201、云平台接收资源申请。
本实施例中,当本地网络的计算资源不足或用户有额外的计算资源需求时,本地网络可以向云平台申请计算资源。于是,该云平台可以收到来自本地网络的资源申请。也就是说,该资源申请可以由本地网络的计算资源不足而触发,也可以基于用户的资源需求而触发。其中,该资源申请用于指示该云平台创建具有RoCE功能的云计算实例,该RoCE功能指越过操作系统访问存储设备的功能。该本地网络可以是前述图1所介绍的HPC集群。
可选的,该资源申请还包含其他的关于云计算实例的配置信息,以使得该云平台可以根据该配置信息创建不仅具有RoCE功能且与本地网络相匹配的云计算实例。具体地,可以参阅后文步骤302中的详细介绍。
202、云平台根据该资源申请创建具有RoCE功能的云计算实例。
其中,该云计算实例指云平台配置的计算资源,也可以理解为,是云平台创建的用于支持本地网络的任务处理的计算资源。不同的云计算实例可以提供不同的计算能力、存储空间以及网络性能等。具体地,该云计算实例可以是裸金属服务器(bare metal server,BMS),即与其他用户的服务器之间物理隔离的物理服务器,该裸金属服务器兼具虚拟机弹性和物理机性能。此外,该云计算实例也可以是由物理主机创建的虚拟机或容器,具体此处不做限定。在本实施例以及后续实施例中,仅以云计算实例为例进行介绍。
当该云平台收到前述资源申请之后,该云平台将基于该资源申请创建具有RoCE功能的云计算实例。具体地,该云平台可以直接创建具有RoCE功能的云计算实例,也可以创建普通的云计算实例,然后为该普通的云计算实例配置RoCE功能,具体此处不做限定。
203、云平台设置前述具有RoCE功能的云计算实例接入本地网络。
前述云平台创建了具有RoCE功能的云计算实例之后,该云平台还需设置前述具有RoCE功能的云计算实例接入本地网络,以使得本地网络可以与前述云计算实例通信。
可选的,当该本地网络为前述图1所介绍的HPC集群时,该本地网络可以包括任务发放节点、任务数据存储节点和计算节点。当前述具有RoCE功能的云计算实例接入本地网络时,可以认为该云计算实例可以与前述任务发放节点、任务数据存储节点和计算节点进行通信。
应当理解的是,该云平台可以仅创建一个具有RoCE功能的云计算实例,也可以创建多个具有RoCE功能的云计算实例。当该云平台创建了多个具有RoCE功能的云计算实例时,前述多个云计算实例中的每个云计算实例将分别执行下述步骤204和步骤205。
204、云计算实例接收来自本地网络的任务。
当前述云计算实例接入本地网络之后,该云计算实例可以接收来自本地网络的任务,该任务指高性能计算任务。其中,该任务可以以消息、指令或者其他形式发送给该云计算实例。该任务携带了第一指示信息,该第一指示信息用于指示该任务对应的数据。该云计算实例可以基于该任务中携带的第一指示信息访问到位于本地网络中的该任务对应的数据。具体地,可以参阅后文步骤306中的详细介绍。
205、云计算实例根据该任务从该任务数据存储节点通过RDMA获取该任务对应的任务数据,并进行任务处理。
本实施方式中,由于该云计算实例配置了RoCE功能,则该云计算实例可以根据前述任务携带的第一指示信息从该任务数据存储节点通过RDMA获取该任务对应的任务数据。然后,该云计算实例将对前述任务数据进行任务处理。其中,该任务处理包括高性能计算任务。例如,超算中心、基因测序等高性能计算场景下的任务;或者,其他的大数据量、对计算性能、稳定性以及实时性等要求较高的任务。
本实施例中,由于云平台可以根据资源申请创建具有RoCE功能的云计算实例,并且将前述云计算实例接入至本地网络,因此,该云计算实例可以越过操作系统接收来自本地网络的任务,并越过操作系统访问本地网络中前述任务对应的数据,以实现 对前述本地网络中的任务对应的数据进行处理。因此,云平台无需在云上建立HPC集群,本地网络也无需将任务和任务对应的数据均传输至云平台。因此,有利于提高高性能计算的效率。
下面基于前述实施例结合图1所示的应用场景和系统架构,对该作业处理方法进行进一步介绍。具体如图3A所示,本地网络中的各个节点、云平台以及云计算实例将执行如下步骤:
301、任务发放节点确认需处理的任务的数量超出阈值。
本实施例中,步骤301是可选的步骤。
本实施例中,任务发放节点具有监控任务的数量的功能。具体地,该任务发放节点可以对需处理的任务的数量进行统计,其中,该需处理的任务的数量可以是该任务发放节点还未分配给本地任务处理节点的任务的数量,也可以是该任务发放节点在一定时间范围内需要处理的任务的总量,具体此处不做限定。当任务发放节点监测到HPC集群中需处理的任务的数量超出阈值时,该任务发放节点将向云平台发送资源申请。于是,该云平台将执行步骤302。
在一些可选的实施方式中,任务发放节点包括HPC控制器(HPC controller)和爆发控制器(bursting controller)。其中,爆发控制器根据HPC控制器所控制的作业队列信息监测需处理的任务的数量,当需处理的任务的数量达到阈值时,该爆发控制器触发向云平台发送资源申请的步骤。在HPC场景中,前述需处理的任务的数量超出阈值的时间范围也被称为峰值需求期间。该任务发放节点在峰值需求期间向云平台申请资源的过程也别称作为云爆发(cloud bursting)。在此过程中,前述阈值可以由任务发放节点根据HPC集群的计算能力进行设置,具体此处不做限定。
302、云平台接收资源申请。
其中,该资源申请用于指示该云平台创建具有RoCE功能的云计算实例。该RoCE功能指越过操作系统访问存储设备的功能。该云计算实例可以是裸金属服务器,也可以是由物理主机创建的虚拟机或容器,具体此处不做限定。可选的,该资源申请包括第一标识信息,该第一标识信息用于指示申请的云计算实例需要有RoCE功能。
具体地,该云平台接收前述资源申请可以有多种不同的实施方式:
在一种可选的实施方式中,该资源申请来自于任务发放节点,由任务发放节点监测到需处理的任务的数量超出阈值而触发。也就是说,当前述任务发放节点执行了步骤301之后,该云平台便可接收来自前述任务发放节点的资源申请。
在另一种可选的实施方式中,该资源申请可以由用户预定义的需求触发。具体地,用户可以购买或租用云平台配置云计算实例的服务。云平台可以通过客户端或网页浏览器向用户提供配置云计算实例的界面,当用户提交所需云计算实例的配置时,云平台可以收到来自前述客户端或网页浏览器的资源申请。
在实际应用中,可以采用前述任意一种实现方式,具体本实施例不做限定。
此外,前述资源申请还包括配置信息,该配置信息用于指示HPC集群所需的云计算实例的基础配置,该基础配置包括组成前述云计算实例所需的主机类型、硬盘数量和容量、网卡类型以及应用类型等。
在一种可选的实施方中,该云平台中包括多个初始云计算实例的模板,每个初始云计算实例模板拥有固定的基础配置,并且,每个初始云计算实例模板拥有唯一标识该模板的模板编号。此时,前述资源申请中的配置信息为模板编号。此时,云平台从该资源申请中获取到该模板标号之后,该云平台便可以获知需要向该HPC集群配置何种云计算实例。
在另一种可选的实施方式中,该云平台没有初始云计算实例的模板,或者,该云平台中的多个初始云计算实例的模板中与该HPC集群所需的云计算实例不一致。此时,前述资源申请中的配置信息包含详细的基础配置。例如,该配置信息为主机的类型为裸金属服务器,2个16TB的硬盘、128G内存、10G普通网卡、数据分析应用以及数据预测应用。此时,云平台便可基于前述配置信息配置初始云计算实例。
可选的,当该资源申请指示需要申请多个云计算实例时,该资源申请将包含前述多个云计算实例中每个云计算实例的配置信息。
303、云平台根据该资源申请创建具有RoCE功能的云计算实例。
当该云平台收到前述资源申请之后,该云平台将基于前述配置信息确定初始云计算实例。具体地,当前述配置信息采用不同的实施方式时,该云平台确定初始云计算实例的方式也将不同。
在一种可选的实施方式中,当前述资源申请中的配置信息为模板编号时,该云平台将根据该模板编号在多个初始云计算实例中选择与该模板编号对应的初始云计算实例。
在另一种可选的实施方式中,当前述资源申请中的配置信息包含详细的基础配置时,该云平台将根据前述基础配置创建初始云计算实例。
当该云平台创建了前述初始云计算实例之后,该云平台采用RoCE软件包(software RoCE)给前述初始云计算实例设置RoCE功能,以得到具有RoCE功能的云计算实例。
在一种可选的实施方式中,该RoCE软件包预存于云平台的存储设备中,或者,预存于云平台管理的数据库中。该云平台从前述存储设备或数据库获取RoCE软件包,并向前述初始云计算实例发送该RoCE软件包。然后,该云平台触发该RoCE软件包安装于该初始云计算实例中,得到该具有RoCE功能的云计算实例。
在另一种可选的实施方式中,该RoCE软件包可以在云平台配置初始云计算实例时便写入初始云计算实例的存储设备中。当需要给该初始云计算实例配置RoCE功能时,该云平台再触发该初始云计算实例安装该RoCE软件包。
基于前述两种可选的实施方式,该云平台在触发该初始云计算实例安装该RoCE软件包的过程中,该云平台先触发云计算实例中的启动盘,然后,将该RoCE软件包拷贝至该云计算实例中的启动盘中。其中,该启动盘位于该计算节点中的启动程序列表中,该启动盘用于在该云计算实例中的操作系统启动时触发该RoCE软件包启动安装。
本实施方式中,提出了让没有RoCE网卡的初始云计算实例配置RoCE功能的具体实现方式。采用安装RoCE软件包的方式,使得前述初始云计算实例通过运行RoCE软件模拟RoCE网卡。由于,该云计算实例可以采用普通网卡,而无需使用RoCE网 卡。因此,可以节约云计算实例的配置成本,提高方案的可实现性。
此外,该云平台还将在前述具有RoCE功能的云计算实例中配置中间适配软件,该中间适配软件用于将该云计算实例中的应用的调用方式修改为远程直接内存访问RDMA调用。
304、云平台设置前述具有RoCE功能的云计算实例接入本地网络。
本实施例中,当前述云平台配置了具有RoCE功能的云计算实例之后,为了使该云计算实例能够服务于本地HPC集群,该云平台需要设置前述具有RoCE功能的云计算实例接入本地网络(即本地HPC集群)。其中,该本地网络中设置有任务发放节点和任务数据存储节点。
具体地,前述资源申请除了包含配置信息之外,还包括HPC集群的信息,该HPC集群信息用于指示HPC集群中各个节点的地址以及该HPC集群与网关的连接情况,以使得该云平台根据该HPC集群信息将具有RoCE功能的云计算实例接入本地HPC集群中。
可选的,该HPC集群的信息包括:任务发放节点的网际互连协议(internet protocol,IP)地址、任务发放节点的端口号、用户标识信息(也被称为租户ID)、桥接设备的通用唯一识别码(universally unique identifier,UUID)以及桥接设备的端口号。其中,前述桥接设备为支持RoCE协议的二层桥接器(level 2 bridge,L2BR)。
可选的,当该云计算实例为虚拟机时,该HPC集群的信息还包括虚拟局域网(virtual local area network,VLAN)范围。
具体地,该云平台根据前述HPC集群的信息将云计算实例接入本地HPC集群中。其中,任务发放节点与多个本地任务处理节点通过第一交换机连接,云平台配置的多个云计算实例也通过第二交换机连接,前述第一交换机和第二交换机通过网关与桥接设备连接。该桥接设备可以是前述二层桥接器L2BR。以图3B为例,本地HPC集群中包括本地任务处理节点1(node1)、本地任务处理节点2(node2)和本地任务处理节点3(node3),其中,每个本地任务处理节点均配置有RoCE网络接口控制器(network interface controller,NIC)(即RoCE网卡)。该本地HPC集群中的多个本地任务处理节点、任务发放节点(master node)和任务数据存储节点通过第一交换机连接。此外,云平台创建了云计算实例1(node1’)、云计算实例2(node2’)和云计算实例3(node3’)。其中,每个云计算实例配置的是普通网卡,但每个云计算实例中均运行了RoCE软件,该RoCE软件可以模拟RoCE网卡以实现RoCE网卡的功能。该多个云计算实例通过第二交换机连接。前述第一交换机和第二交换机通过网关与二层桥接器L2BR连接。于是,前述多个云计算实例便作为任务处理节点加入至本地网络(即本地HPC集群)中。
应当理解的是,该图3B仅为逻辑连接示意图,部分物理网关并未示出。
305、云平台向任务发放节点发送第一通知。
本实施例中,步骤305为可选的步骤。
其中,该第一通知用于指示云计算实例配置完成且该云计算实例已接入本地网络(即本地HPC集群)。当该任务发放节点收到前述第一通知之后,该任务发放节点将执行步骤306。
此外,当云平台不执行步骤305时,该任务发放节点可以检测该HPC集群中空闲的计算资源。由于,刚接入本地网络的云计算实例是处于空闲状态的,当该任务发放节点检测到存在空闲状态的计算资源时,该任务发放节点将执行步骤306。
306、该任务发放节点向该云计算实例发送任务。
本实施例中,当该任务发放节点收到前述第一通知之后,该任务发放节点可以将位于任务队列中的未分配给本地任务处理节点的任务发送至云计算实例。
其中,该任务携带了第一指示信息,该第一指示信息用于指示该任务对应的数据。该云计算实例可以基于该任务中携带的第一指示信息访问到位于本地网络中的该任务对应的数据。
可选的,该第一指示信息包括任务标识和/或任务对应的数据的地址。
其中,该任务标识用于唯一标识任务队列中的一个任务。例如,该任务标识可以是队列序号或者其他字符或字符串。此外,该任务对应的数据也拥有相同的任务标识。例如,该任务对应的数据存储于任务数据存储节点中,该任务对应的数据所在的内存块的头部包含前述任务标识。在这种实施方式中,当该云计算实例从收到的述任务中获取到该任务标识,该云计算实例可以遍历任务数据存储节点中的内存。当检测到某个内存块的头部包含前述任务标识时,该云计算实例便可获取该内存块中的数据,于是该云计算实例便可获取到该任务对应的数据。
此外,该第一指示信息也可以包括该任务对应的数据的地址,该任务对应的数据的地址可以是物理地址或逻辑地址;可以是任务数据存储节点中的每个内存块的地址,也可以是本地HPC集群中某个本地任务处理节点中的内存块的地址,具体此处不做限定。在这种实施方式中,该云计算实例可以基于该第一指示信息直接从该任务对应的数据的地址中获取任务对应的数据。
可选的,当该第一指示信息包括该任务对应的数据的地址时,该第一指示信息还可以包括处理结果地址。该处理结果地址用于存储云计算实例对前述任务对应的数据进行处理之后得到的处理结果。任务对应的数据的地址与该处理结果地址可以是同一节点或设备中的地址。例如,该任务对应的数据的地址指示了任务数据存储节点中的一个内存块,而该处理结果地址指示了任务数据存储节点中的另一个内存块。当然,任务对应的数据的地址与该处理结果地址也可以是不同节点或设备中的地址。例如,该任务对应的数据的地址指示了任务数据存储节点中的一个内存块,而该处理结果地址指示了本地任务处理节点中的一个内存块。具体此处不做限定。
307、该云计算实例根据该任务从该任务数据存储节点通过远程直接内存访问RDMA获取该任务对应的任务数据,并进行任务处理。
本实施例中,该云计算实例收到前述任务之后,将根据前述任务携带的第一指示信息从该任务数据存储节点通过RDMA获取该任务对应的任务数据。可选的,若前述任务需要使用某一个计算节点产生的数据,则该云计算实例也可以越过操作系统访问该计算节点内的存储设备中的数据。具体地,当前述第一指示信息不同时,该云计算实例通过RDMA获取任务数据的方式不尽相同。具体可以参阅前述步骤306中的相关介绍,此处不再赘述。
然后,该云计算实例将对前述任务数据进行任务处理。其中,该任务处理包括高 性能计算任务。例如,超算中心、基因测序等高性能计算场景下的任务;或者,其他的大数据量、对计算性能、稳定性以及实时性等要求较高的任务。
可选的,前述任务数据存储节点可以独立于本地HPC集群中的计算节点,例如,该任务数据存储节点为本地HPC集群中的数据库。此外,该任务数据存储节点也可以与前述任务发放节点集成于一体。例如,该任务数据存储节点可以位于任务发放节点中,作为该任务发放节点中的存储设备。此时,该任务数据存储节点不仅可以存储任务对应的数据,还可以存储该任务发放节点制定的任务队列。
本实施例中,当云平台配置的具有RoCE功能的云计算实例入本地网络之后,该云计算实例作为计算节点加入到本地HPC集群中。因此,该云计算实例可以接收来自任务发放节点的任务,也可以越过操作系统访问任务数据存储节点中任务对应的数据,进而该云计算实例可以对前述任务对应的数据进行处理。因此,该任务发放节点无需将任务和任务对应的数据均传输至云平台。因此,有利于提高高性能计算的效率。
基于前述实施方式,如图3A所示,当本地HPC集群无需使用前述云计算实例时,本地HPC集群将通过如下步骤撤销该云计算实例。
308、任务发放节点向云平台发送资源撤销请求。
其中,该资源撤销请求可以请求撤销该云平台已申请的所有云计算实例,也可以仅撤销某一个云计算实例。
当前述资源撤销请求用于请求撤销云平台已申请的所有云计算实例时,该资源撤销请求可以仅携带用户的标识信息。当前述资源撤销请求用于请求撤销某一个云计算实例时,该资源撤销请求包含第二标识信息,该第二标识信息用于指示需要撤销的云计算实例,该第二标识信息可以是该云计算实例对应的模板编号,或者唯一标识前述云计算实例的字符串。可选的,该第二标识信息可以由云平台在创建该云计算实例时设置,也可以由任务发放节点在前述资源申请中携带,具体此处不做限定。
应当理解的是,当该任务发放节点在确认该HPC集群的任务已完成的情况下,该任务发放节点可以执行本步骤308。此外,若用户无需使用该云计算实例,例如,用户的租赁期限到期,用户暂停租用该云计算实例。此时,该任务发放节点也可以触发上述步骤308。
309、该云平台根据该资源撤销请求撤销该云计算实例。
当该云平台收到前述资源撤销请求之后,该云平台将根据该资源撤销请求中的第二标识信息撤销与该第二标识信息对应的一个或多个云计算实例。
可选的,当该云平台将该云计算实例撤销之后,该云平台还将向该任务发放节点发送第二通知,该第二通知用于通知该第二标识信息对应的云计算实例已撤销。
本实施例中,当前述任务已完成或前述任务无需执行时,该任务发放节点可以向云平台发送资源撤销请求,以使得该云平台根据该撤销请求撤销前述云计算实例。在这样的方案中,任务发放节点可以撤销前述云计算实例,该云平台也可以将前述云计算实例分配给其他集群。因此,可以使前述云平台配置的云计算实例更加灵活,有利于提高云计算实例的利用率。
下面将本申请实施例中的作业处理方法涉及的设备进行介绍:
如图4所示,为本申请实施例提供了一种云平台40的结构示意图。该云平台40可以是服务器、大型计算设备或者大型管理设备,具体此处不做限定。前述图2和图3A对应的方法实施例中的云平台均可以基于图4所示的云平台40的结构。
该云平台40包括至少一个处理器401和至少一个存储器402。应当理解的是,图4仅示出了一个处理器401和一个存储器402。
其中,该处理器401可以是通用中央处理单元(central processing unit,CPU)、微处理器、网络处理器(network processor,NP)或特定应用集成电路(application-specific integrated circuit),或一个或多个用于控制本申方案的程序执行的集成电路。前述处理器401可以是一个单核(single-CPU)处理器,也可以是一个多核(multi-CPU)处理器。处理器401可以指一个或多个装置、电路和/或用于处理数据(例如计算机程序指令)的处理核。此外,该处理器401可以是个单独的半导体芯片,也可以跟其他电路一起集成为一个半导体芯片,例如,可以跟其他电路(如编解码电路、硬件加速电路或各种总线和接口电路)构成一个片上系统(system-on-a-chip,SoC),或者也可以作为一个特殊应用集成电路(application specific integrated circuit,ASIC)的内置处理器集成在所述ASIC当中,该集成了处理器的ASIC可以单独封装或者也可以跟其他电路封装在一起。
此外,前述存储器402可以是只读存储器ROM,也可以是可存储静态信息和指令的其他类型的静态存储设备,也可以是随机存取存储器RAM,也可以是可存储信息和指令的其他类型的动态存储设备,还可以是电可擦可编程只读存储器(electrically erasable programmable read-only memory,EEPROM),具体此处不做限定。该存储器402可以是独立存在,但与前述处理器401相连。可选的,该存储器402也可以和前述处理器401集成于一体。例如,集成于一个或多个芯片之内。
此外,该存储器402还用于存储执行本申请实施例的技术方案的程序代码。前述程序代码可以由处理器401来控制执行,被执行的各类计算机程序代码也可被视为是处理器401的驱动程序。于是,前述处理器401可以对收到的资源申请进行分析,根据该资源申请设置具有RoCE功能的云计算实例,并设置该云计算实例接入本地网络。可选的,该处理器401还可以创建初始云计算实例,并为初始云计算实例配置RoCE功能。可选的,该处理器401还可以对资源撤销请求进行分析,并根据该资源撤销请求撤销为HPC集群配置的云计算实例。
可选的,该云平台40还包括通信接口403,该通信接口403用于与其他服务器或网络设备进行通信,以使得该云平台可以接收来自其他设备的指令或数据。例如,该通信接口403可以接收来自任务收发设备的资源申请或资源撤销请求。该通信接口403还用于向其他设备发送数据或指令。例如,该通信接口403可以向初始云计算实例发送RoCE软件包,以使得该初始云计算实例可以根据该RoCE软件包安装RoCE软件程序。
如图5所示,为本申请实施例提供了一种云平台50的结构示意图。该云平台50可以是服务器、大型计算设备或者大型管理设备,具体此处不做限定。前述图2和图 3A对应的方法实施例中的云平台均可以基于图5所示的云平台50的结构。
该云平台50包括多个功能模块,前述各个功能模块可以集成在一个处理单元中,也可以是各个模块单独物理存在,也可以两个或两个以上模块集成在一个单元中。上述集成的单元既可以采用硬件的形式实现,也可以采用软件功能单元的形式实现。
具体地,该云平台50包括:
接收模块501,用于接收资源申请,该资源申请用于指示该云平台创建具有基于融合以太网的远程直接内存访问RoCE功能的云计算实例;
资源配置模块502,用于根据该资源申请设置具有RoCE功能的云计算实例,并设置该云计算实例接入本地网络。该本地网络中设置有任务发放节点和任务数据存储节点。其中,该任务发放节点用于向该云计算实例发送任务;该任务数据存储节点用于通过远程直接内存访问RDMA向该云计算实例提供该任务对应的任务数据。
本实施例中,由于云平台50可以根据资源申请创建具有RoCE功能的云计算实例,并且将前述云计算实例接入至本地网络,因此,该云计算实例可以越过操作系统接收来自本地网络的任务,并越过操作系统访问本地网络中前述任务对应的数据,以实现对前述本地网络中的任务对应的数据进行处理。因此,云平台无需在云上建立HPC集群,本地网络也无需将任务和任务对应的数据均传输至云平台。因此,有利于提高高性能计算的效率。
在另一种可选的实施方式中,该接收模块501,具体用于:当该任务发放节点确认需处理的任务的数量超出阈值时,该收发器接收来自该任务发放节点的该资源申请。
在另一种可选的实施方式中,该接收模块501,还用于接收来自该任务发放节点的资源撤销请求,该资源撤销请求用于指示该任务已完成或该任务无需执行;该资源配置模块502,还用于根据该资源撤销请求撤销该云计算实例。
在另一种可选的实施方式中,该资源配置模块502,具体用于:获取RoCE软件包;控制该收发器向初始云计算实例发送该RoCE软件包,并触发该RoCE软件包安装于该初始云计算实例中,得到该具有RoCE功能的云计算实例。
在另一种可选的实施方式中,该资源配置模块502,还用于:根据该配置信息创建初始云计算实例;或者,根据该配置信息在多个初始云计算实例中查找与该配置信息对应的初始云计算实例。
在上述实施例中,该云平台50中的各个功能模块可以全部或部分地通过软件、硬件、固件或者其任意组合来实现。当使用软件实现时,该云平台50中的各个功能模块可以全部或部分地以计算机程序产品的形式实现。此时,若该云平台50中的各个功能模块以软件功能单元的形式实现并作为独立的产品销售或使用时,可以存储在一个计算机可读取存储介质中。基于这样的理解,本申请的技术方案本质上或者说对现有技术做出贡献的部分或者该技术方案的全部或部分可以以软件产品的形式体现出来。例如,作为该计算机软件产品存储在一个存储介质中,该计算机软件产品包括若干指令用以使得一台计算机设备(可以是个人计算机,服务器,或者网络设备等)执行本申请各个实施例所述方法的全部或部分步骤。而前述的存储介质包括:U盘、移动硬盘、只读存储器(read only memory,ROM)、随机存取存储器(random access memory,RAM)、磁碟或者光盘等各种可以存储程序代码的介质。
所属领域的技术人员可以清楚地了解到,为描述的方便和简洁,上述描述的系统、设备和模块的具体工作过程,可以参考前述方法实施例中的对应过程,在此不再赘述。
以上实施例仅用以说明本申请的技术方案,而非对其限制;尽管参照前述实施例对本申请进行了详细的说明,本领域的普通技术人员应当理解:其依然可以对前述各实施例所记载的技术方案进行修改,或者对其中部分技术特征进行等同替换;而这些修改或者替换,并不使相应技术方案的本质脱离本申请各实施例技术方案的精神和范围。

Claims (19)

  1. 一种作业处理方法,其特征在于,包括:
    云平台接收资源申请,所述资源申请用于指示所述云平台创建具有基于融合以太网的远程直接内存访问RoCE功能的云计算实例;
    所述云平台根据所述资源申请创建具有RoCE功能的云计算实例,并设置所述云计算实例接入本地网络。
  2. 根据权利要求1所述的方法,其特征在于,所述本地网络中设置有任务发放节点和任务数据存储节点,所述方法还包括:
    所述云计算实例接收所述任务发放节点发送的任务;
    所述云计算实例根据所述任务从所述任务数据存储节点通过远程直接内存访问RDMA获取所述任务对应的任务数据,并进行任务处理。
  3. 根据权利要求1或2所述的方法,其特征在于,所述云平台接收资源申请之前,所述方法还包括:
    所述任务发放节点确认需处理的任务的数量超出阈值。
  4. 根据权利要求1至3中任意一项所述的方法,其特征在于,所述方法还包括:
    所述云平台接收所述任务发放节点发送的资源撤销请求,所述资源撤销请求用于指示所述任务已完成或所述任务无需执行;
    所述云平台根据所述资源撤销请求撤销所述云计算实例。
  5. 根据权利要求1至4中任意一项所述的方法,其特征在于,所述云平台根据所述资源申请创建具有RoCE功能的云计算实例,包括:
    所述云平台获取RoCE软件包;
    所述云平台向初始云计算实例发送所述RoCE软件包,并触发所述RoCE软件包安装于所述初始云计算实例中,得到所述具有RoCE功能的云计算实例。
  6. 根据权利要求5所述的方法,其特征在于,所述资源申请包括所述云计算实例的配置信息;
    所述方法还包括:
    所述云平台根据所述配置信息创建初始云计算实例;
    或者,
    所述云平台根据所述配置信息在多个云计算实例中选择与所述配置信息对应的云计算实例。
  7. 根据权利要求1至6中任意一项所述的方法,其特征在于,所述云计算实例为虚拟机、容器或裸金属服务器。
  8. 一种云平台,其特征在于,包括:
    接收模块,用于接收资源申请,所述资源申请用于指示所述云平台创建具有基于融合以太网的远程直接内存访问RoCE功能的云计算实例;
    资源配置模块,用于根据所述资源申请创建具有RoCE功能的云计算实例,并设置所述云计算实例接入本地网络。
  9. 根据权利要求8所述的云平台,其特征在于,
    所述接收模块,还用于接收所述任务发放节点发送的资源撤销请求,所述资源撤 销请求用于指示所述任务已完成或所述任务无需执行;
    所述资源配置模块,还用于根据所述资源撤销请求撤销所述云计算实例。
  10. 根据权利要求8或9所述的云平台,其特征在于,所述资源配置模块,具体用于:
    获取RoCE软件包;
    向初始云计算实例发送所述RoCE软件包,并触发所述RoCE软件包安装于所述初始云计算实例中,得到所述具有RoCE功能的云计算实例。
  11. 根据权利要求10所述的云平台,其特征在于,所述资源申请包括所述云计算实例的配置信息;
    所述资源配置模块,还用于:
    根据所述配置信息创建初始云计算实例;
    或者,
    根据所述配置信息在多个云计算实例中选择与所述配置信息对应的云计算实例。
  12. 一种云平台,其特征在于,包括处理器和存储器,所述存储器存储有程序指令,所述处理器执行所述程序指令以实现权利要求1、4至7中任意一项所述的方法。
  13. 一种作业处理系统,其特征在于,包括:
    所述云平台,用于接收资源申请,根据所述资源申请创建具有RoCE功能的云计算实例,并设置所述云计算实例接入本地网络,所述资源申请用于指示所述云平台创建具有基于融合以太网的远程直接内存访问RoCE功能的云计算实例;
    所述云计算实例,用于接收来自所述本地网络的任务,并通过远程直接内存访问RDMA获取所述本地网络中所述任务对应的任务数据,并进行任务处理。
  14. 根据权利要求13所述的作业处理系统,其特征在于,所述本地网络中设置有任务发放节点和任务数据存储节点;
    所述云计算实例,具体用于接收所述任务发放节点发送的任务,并根据所述任务从所述任务数据存储节点通过远程直接内存访问RDMA获取所述任务对应的任务数据,并进行任务处理。
  15. 根据权利要求13或14所述的作业处理系统,其特征在于,所述任务发放节点,还用于确认需处理的任务的数量超出阈值。
  16. 根据权利要求13至15中任意一项所述的作业处理系统,其特征在于,
    所述云平台,还用于接收所述任务发放节点发送的资源撤销请求,所述资源撤销请求用于指示所述任务已完成或所述任务无需执行;
    所述云平台,还用于根据所述资源撤销请求撤销所述云计算实例。
  17. 根据权利要求13至16中任意一项所述的作业处理系统,其特征在于,所述云平台,具体用于:
    获取RoCE软件包;
    向初始云计算实例发送所述RoCE软件包,并触发所述RoCE软件包安装于所述初始云计算实例中,得到所述具有RoCE功能的云计算实例。
  18. 根据权利要求17所述的作业处理系统,其特征在于,所述资源申请包括所述云计算实例的配置信息;
    所述云平台,还用于:
    根据所述配置信息创建初始云计算实例;
    或者,
    根据所述配置信息在多个云计算实例中选择与所述配置信息对应的云计算实例。
  19. 根据权利要求13至18中任意一项所述的作业处理系统,其特征在于,所述云计算实例为虚拟机、容器或裸金属服务器。
PCT/CN2021/091717 2020-06-22 2021-04-30 一种作业处理方法以及相关设备 WO2021258861A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202010573552.7A CN113900791A (zh) 2020-06-22 2020-06-22 一种作业处理方法以及相关设备
CN202010573552.7 2020-06-22

Publications (1)

Publication Number Publication Date
WO2021258861A1 true WO2021258861A1 (zh) 2021-12-30

Family

ID=79186219

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2021/091717 WO2021258861A1 (zh) 2020-06-22 2021-04-30 一种作业处理方法以及相关设备

Country Status (2)

Country Link
CN (1) CN113900791A (zh)
WO (1) WO2021258861A1 (zh)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115426259A (zh) * 2022-08-29 2022-12-02 浪潮电子信息产业股份有限公司 一种网络接入控制方法、装置、设备及存储介质

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107113298A (zh) * 2014-12-29 2017-08-29 Nicira股份有限公司 为rdma提供多租赁支持的方法
US20190101974A1 (en) * 2017-09-29 2019-04-04 Intel Corporation Techniques to predict memory bandwidth demand for a memory device
CN110063051A (zh) * 2016-12-13 2019-07-26 亚马逊技术股份有限公司 可重新配置的服务器
CN111221758A (zh) * 2019-09-30 2020-06-02 华为技术有限公司 处理远程直接内存访问请求的方法和计算机设备

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107113298A (zh) * 2014-12-29 2017-08-29 Nicira股份有限公司 为rdma提供多租赁支持的方法
CN110063051A (zh) * 2016-12-13 2019-07-26 亚马逊技术股份有限公司 可重新配置的服务器
US20190101974A1 (en) * 2017-09-29 2019-04-04 Intel Corporation Techniques to predict memory bandwidth demand for a memory device
CN111221758A (zh) * 2019-09-30 2020-06-02 华为技术有限公司 处理远程直接内存访问请求的方法和计算机设备

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115426259A (zh) * 2022-08-29 2022-12-02 浪潮电子信息产业股份有限公司 一种网络接入控制方法、装置、设备及存储介质

Also Published As

Publication number Publication date
CN113900791A (zh) 2022-01-07

Similar Documents

Publication Publication Date Title
US11500670B2 (en) Computing service with configurable virtualization control levels and accelerated launches
JP6224846B2 (ja) プロバイダ定義インターフェイスを介したクライアント構内リソース制御
US9264296B2 (en) Continuous upgrading of computers in a load balanced environment
US7502850B2 (en) Verifying resource functionality before use by a grid job submitted to a grid environment
WO2017157156A1 (zh) 一种用户请求的处理方法和装置
JP5132770B2 (ja) 最善のdhcpサーバを見出すためのルータの動的な構成
WO2020233120A1 (zh) 一种调度方法、装置及相关设备
JP5644150B2 (ja) サービス提供システム、仮想マシンサーバ、サービス提供方法及びサービス提供プログラム
WO2016155394A1 (zh) 一种虚拟网络功能间链路建立方法及装置
US20080025297A1 (en) Facilitating use of generic addresses by network applications of virtual servers
US20130036213A1 (en) Virtual private clouds
US9910687B2 (en) Data flow affinity for heterogenous virtual machines
US10992526B1 (en) Hyper-converged infrastructure networking configuration system
WO2013086861A1 (zh) 一种多路径访问i/o设备的方法、i/o多路径管理器及系统
US20200067838A1 (en) Layer 2 load balancing system
WO2021258861A1 (zh) 一种作业处理方法以及相关设备
TW201426553A (zh) 虛擬機管理系統及方法
US11005782B2 (en) Multi-endpoint adapter/multi-processor packet routing system
US11838149B2 (en) Time division control of virtual local area network (vlan) to accommodate multiple virtual applications
US11474827B1 (en) Reboot migration between bare-metal servers
WO2022089291A1 (zh) 一种数据流镜像方法及装置
JP7212158B2 (ja) プロバイダネットワークサービス拡張
WO2021179556A1 (zh) 一种存储系统和请求处理方法以及交换机
US20230171189A1 (en) Virtual network interfaces for managed layer-2 connectivity at computing service extension locations
WO2022141293A1 (zh) 一种弹性伸缩的方法及装置

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 21829345

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 21829345

Country of ref document: EP

Kind code of ref document: A1