WO2024088263A1 - 异构服务器系统及其使用方法 - Google Patents

异构服务器系统及其使用方法 Download PDF

Info

Publication number
WO2024088263A1
WO2024088263A1 PCT/CN2023/126246 CN2023126246W WO2024088263A1 WO 2024088263 A1 WO2024088263 A1 WO 2024088263A1 CN 2023126246 W CN2023126246 W CN 2023126246W WO 2024088263 A1 WO2024088263 A1 WO 2024088263A1
Authority
WO
WIPO (PCT)
Prior art keywords
switch
computing
port
service
node
Prior art date
Application number
PCT/CN2023/126246
Other languages
English (en)
French (fr)
Inventor
李志兵
Original Assignee
杭州阿里云飞天信息技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 杭州阿里云飞天信息技术有限公司 filed Critical 杭州阿里云飞天信息技术有限公司
Publication of WO2024088263A1 publication Critical patent/WO2024088263A1/zh

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/50Allocation of resources, e.g. of the central processing unit [CPU]
    • G06F9/5061Partitioning or combining of resources
    • G06F9/5072Grid computing
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/50Allocation of resources, e.g. of the central processing unit [CPU]
    • G06F9/5005Allocation of resources, e.g. of the central processing unit [CPU] to service a request
    • G06F9/5027Allocation of resources, e.g. of the central processing unit [CPU] to service a request the resource being a machine, e.g. CPUs, Servers, Terminals
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Definitions

  • the present disclosure relates to a hardware computing device for artificial intelligence, and in particular to a heterogeneous server system that provides composite services for artificial intelligence.
  • the training service and inference service of artificial intelligence models usually require different computing capabilities, so different heterogeneous (for example, central processing unit (CPU) + graphics processing unit (GPU)) servers are usually designed to meet different service requirements.
  • mainstream training services usually use GPU training servers or OAM (Open Application Model)-based UBB (Universal Baseboard) substrates to provide computing power, while inference services usually use single-card GPU model servers.
  • OAM Open Application Model
  • UBB Universal Baseboard
  • the heterogeneous server hardware designed for training services does not match the demand for inference services. If the training server is used for inference services, the CPU and GPU computing power cannot be fully utilized, which will result in a waste of computing power.
  • the current training servers and inference servers cannot be flexibly switched to achieve flexible switching of training, inference services, etc., so that they cannot follow the peaks and troughs of training and inference needs, and fully schedule GPU computing power to match service needs. Therefore, in order to meet the needs of training and inference services at the same time, users currently usually have to purchase both training servers and inference servers at the same time. However, this easily leads to a waste of computing power of the two servers during the trough periods of their respective service needs.
  • a technical problem to be solved by the present disclosure is to provide a heterogeneous server system that can provide at least two artificial intelligence services efficiently.
  • a heterogeneous server system comprising: a first computing node configured to provide a first service; a second computing node configured to provide a second service; and a computing resource node, comprising a switch and a computing processing unit connected to the switch.
  • the computing processing unit is used to perform at least part of the computing tasks of the first service or the second service.
  • the switch is connected to the first computing node and the second computing node and can switch between a first state and a second state, wherein in the first state the switch is The switch connects the computing processing unit to the first computing node, and in the second state the switch connects the computing processing unit to the second computing node.
  • the switch is a PCIe (Peripheral Component Interconnect Express) switch
  • the computing processing unit is connected to a downstream port of the switch via a PCIe cable
  • the first computing node and the second computing node are respectively connected to a first port and a second port of the switch via a PCIe cable, and in the first state the first port is set as an upstream port of the switch and the second port is closed, while in the second state the second port is set as an upstream port of the switch and the first port is closed.
  • PCIe Peripheral Component Interconnect Express
  • the computing resource node further includes a baseboard management controller, and the switching of the switch between the first state and the second state is achieved by the baseboard management controller changing the firmware of the switch.
  • the computing resource node further includes a baseboard management controller
  • the switch further includes an internal processor.
  • the switching of the switch between the first state and the second state is achieved as follows: the switch is configured to enable the baseboard management controller to communicate with the internal processor; the internal processor obtains and saves the PCIe topology of the downstream port; in the first state, the baseboard management controller configures the first port as the upstream port of the switch and closes the second port, and the switch provides the PCIe topology to the first computing node; and in the second state, the baseboard management controller configures the second port as the upstream port of the switch and closes the first port, and the switch provides the PCIe topology to the second computing node.
  • the heterogeneous server system includes multiple second computing nodes
  • the computing resource node includes multiple switches and multiple computing processing units, wherein the multiple computing processing units are divided into multiple groups, and each group is respectively connected to one of the multiple switches.
  • the first computing node is connected to at least two of the multiple switches.
  • Each of the multiple second computing nodes is connected to at least one of the multiple switches, and the number of switches connected to the second computing node is calculated based on the number of computing processing units required for the second service and the connection architecture between the computing processing units and the switches.
  • the number of switches connected to the first computing node is greater than the number of switches connected to the second computing node.
  • the computing processing unit is a GPU; and/or the first computing node and the second computing node include a CPU and a memory, respectively; and/or the first service is an artificial intelligence training service; and/or the second service is an artificial intelligence reasoning service.
  • the computing resource node also includes: a first interface for connecting the first computing node to the first port of the switch; and/or, a second interface for connecting the second computing node to the second port of the switch; and/or, a memory interface for connecting the memory to the third port of the switch.
  • a method for performing computing tasks using the heterogeneous server system comprising: determining whether a computing processing unit connected to a switch is used for a first service or a second service; in a case where the computing processing unit connected to the switch is used for the first service, configuring the switch to a first state; and in a case where the computing processing unit connected to the switch is used for the second service, configuring the switch to a second state.
  • the switch is a PCIe switch
  • the computing processing unit is connected to the downlink port of the switch via a PCIe cable
  • the first computing node and the second computing node are respectively connected to the first port and the second port of the switch via a PCIe cable
  • the computing resource node further includes a baseboard management controller.
  • the step of configuring the switch to the first state includes: setting the firmware of the switch to the first firmware through the baseboard management controller, the first firmware sets the first port as the uplink port of the switch so as to connect to the downlink port, and closes the second port; and/or, the step of configuring the switch to the second state includes: setting the firmware of the switch to the second firmware through the baseboard management controller, the second firmware sets the second port as the uplink port of the switch so as to connect to the downlink port, and closes the first port.
  • the switch is a PCIe switch
  • the computing processing unit is connected to the downstream port of the switch via a PCIe cable
  • the first computing node and the second computing node are respectively connected to the first port and the second port of the switch via a PCIe cable
  • the computing resource node further includes a baseboard management controller
  • the switch further includes an internal processor.
  • the step of configuring the switch to the first state includes: configuring the switch so that the baseboard management controller communicates with the internal processor, obtaining and saving the PCIe topology of the downstream port through the internal processor, and configuring the first port as the upstream port of the switch and closing the second port through the baseboard management controller, and providing the PCIe topology to the first computing node through the switch; and/or, the step of configuring the switch to the second state includes: configuring the switch so that the baseboard management controller communicates with the internal processor, obtaining and saving the PCIe topology of the downstream port through the internal processor, and configuring the second port as the upstream port of the switch and closing the first port through the baseboard management controller, and providing the PCIe topology to the second computing node through the switch.
  • the heterogeneous server system includes a plurality of second computing nodes
  • the computing resource node includes a plurality of switches and a plurality of computing processing units
  • the plurality of computing processing units are divided into a plurality of groups, and each group is respectively connected to one of the plurality of switches
  • the first computing node is connected to at least two of the plurality of switches
  • each of the plurality of second computing nodes is connected to at least one of the plurality of switches
  • the number of switches connected to the second computing node is calculated based on the number of computing processing units required for the second service and the connection architecture between the computing processing units and the switches
  • the number of switches connected to the first computing node is greater than the number of switches connected to the second computing node.
  • the step of determining whether the computing processing unit connected to the switch is used for the first service or the second service includes: determining the number of computing processing units used for the first service and the second service respectively among the plurality of computing processing units according to the number of the first service and the second service to be provided by the heterogeneous server system, and allocating the plurality of computing processing units to the first service and the second service respectively according to the connection architecture between the computing processing unit and the switch, the connection architecture between the first computing node, the second computing node and the switch, and the number of computing processing units required for each service.
  • a computing device comprising: a processor; and a memory on which executable code is stored, and when the executable code is executed by the processor, the processor executes the method described in the second aspect above.
  • a computer program product comprising an executable code, which, when executed by a processor of an electronic device, causes the processor to execute the method described in the second aspect above.
  • a non-transitory machine-readable storage medium on which executable code is stored.
  • the executable code is executed by a processor of an electronic device, the processor executes the method described in the second aspect above.
  • the present invention provides at least two services by hybrid networking at least two service nodes and computing resource nodes to merge into a whole physical machine complex model, and can utilize the flexible switching solution of the switch to improve the utilization rate of the computing power of the computing resource nodes, effectively improving the total cost of ownership (TCO) benefits.
  • TCO total cost of ownership
  • FIG1 shows a schematic block diagram of a heterogeneous server system according to an embodiment of the present disclosure.
  • FIG. 2 shows a schematic flow chart of a method for using a heterogeneous server system according to an embodiment of the present disclosure.
  • FIG3 shows a schematic block diagram of a specific example of a heterogeneous server system according to an embodiment of the present disclosure.
  • FIG. 4 shows a schematic diagram of the structure of a computing device according to an embodiment of the present disclosure.
  • the present invention realizes flexible allocation of computing power of a computing processing unit by connecting at least two computing nodes providing different services to the computing processing unit via a switch, thereby efficiently utilizing computing resources (i.e., computing processing units) to provide at least two different services.
  • FIG1 is a schematic block diagram showing a basic architecture of a heterogeneous server system according to an embodiment of the present disclosure.
  • the heterogeneous server system 100 includes a first computing node 110, a second computing node 120 and a computing resource node 130.
  • the computing resource node 130 includes a switch 140 and a computing processing unit 150.
  • the first computing node 110, the second computing node 120 and the computing processing unit 150 are connected to three ports 1-3 of the switch 140 respectively.
  • the solid lines connecting ports 1 and 3 in the switch 140 shown in FIG1 schematically represent the first state of the switch 140, which connects the computing processing unit 150 to the first computing node 110.
  • the dotted lines connecting ports 2 and 3 shown in FIG1 schematically represent the second state of the switch 140, which connects the computing processing unit 150 to the second computing node 120.
  • the switch 140 can switch between the first state and the second state, thereby selecting to connect the computing processing unit 150 to the first computing node 110.
  • the processing unit 150 is connected to the first computing node 110 or the second computing node 120.
  • the structure of the switch 140 shown in the figure is only a simple illustration of its function and does not represent the physical structure of the switch disclosed in the present invention; the connections represented by all the lines in the figure are not limited to direct physical connections, but may also include indirect connections via intermediate interfaces, or wireless connections, etc.
  • the first computing node 110 and the second computing node 120 are configured to provide a first service and a second service, respectively, such as a training service and an inference service of artificial intelligence.
  • the first computing node 110 and the second computing node 120 may be general-purpose computers or servers, both of which may include a CPU and a memory to perform the operation of the first/second service. Since the services of artificial intelligence generally require higher computing power, these general-purpose computers or servers require additional computing resources to meet the computing power required for their services, that is, to connect to the computing resource node 130 to utilize the computing processing unit 150 therein to perform at least part of the computing tasks of the first or second service.
  • the first computing node 110 and the second computing node 120 may also be specially designed hardware architectures to utilize the computing power of the computing resource node 130 to perform at least part of the computing tasks of the first/second service.
  • the computing processing unit 150 may be a GPU, but the present disclosure is not limited thereto, but includes various computing processing hardware that can provide the required computing power for various artificial intelligence services, such as ASIC or FPGA.
  • the switch 140 may be a PCIe switch, and the first computing node 110, the second computing node 120, and the computing processing unit 150 are respectively connected to ports 1-3 of the switch 140 via PCIe cables.
  • port 3 may be set as a downlink port, and one of ports 1 and 2 may be set as an uplink port and the other may be closed as needed, thereby achieving flexible switching of the switch between two states.
  • the present invention is not limited thereto, but the network connection between each computing node and the computing processing unit may also be achieved through, for example, a network interface controller (NIC, Network Interface Controller), for example, through the remote direct memory access (RDMA, Remote Direct Memory Access) technology between the NIC and the GPU, the GPU computing power may be flexibly provided to the first or second computing node.
  • NIC Network interface Controller
  • RDMA Remote Direct Memory Access
  • the PCIe switch for interconnection has a smaller system latency and lower software complexity.
  • the present invention is not limited thereto.
  • the system 100 may also include another computing node to provide another service, and/or multiple first/second computing nodes, and/or multiple switches and computing processing units.
  • a computing node may be connected to two or more switches, a switch may be connected to two or more computing processing units, and a switch may also be connected to two or more computing nodes.
  • the number of components and the connection architecture included in the system may be designed based on conditions such as the types of services that the system needs to provide, the number of service requirements, and the size of the computing power requirements of each service for computing resources.
  • the computing power required for the first service is more than that for the second service, so the first computing node may be connected to more switches, or the first computing node may be connected to each switch so that the computing task of the first service can be performed using all the computing power.
  • FIG. 2 is a schematic diagram showing a method for executing a computing task using a heterogeneous server system according to an embodiment of the present disclosure. Sexual flow chart.
  • step S210 it is determined whether the computing processing unit 150 is used for the first service or the second service. If it is determined to be used for the first service, step S220 is performed to configure the switch 140 to the first state, that is, the connection relationship shown by the solid line in FIG. 1 . If it is determined to be used for the second service, step S230 is performed to configure the switch 140 to the second state, that is, the connection relationship shown by the dotted line in FIG. 1 .
  • the computing processing unit 150 can be flexibly scheduled to perform the computing task of which service as needed.
  • the method can be implemented by the heterogeneous server system itself (for example, the CPU of each computing node and computing resource node in the system, or other controllers in the system independent of these nodes), or by a control device outside the heterogeneous server system.
  • the heterogeneous server system itself (for example, the CPU of each computing node and computing resource node in the system, or other controllers in the system independent of these nodes), or by a control device outside the heterogeneous server system.
  • the present invention connects each computing node that provides different services to the required computing resources via a switch, and flexibly provides computing resources to each computing node as needed, thereby being able to meet multiple service requirements with a unified composite physical machine and improve the utilization of computing resources.
  • Figure 3 shows a schematic block diagram of a specific example of a heterogeneous server system according to an embodiment of the present disclosure.
  • the first computing node is a training node that provides model training services
  • the second computing node is an inference node that provides inference services
  • the switch is a PCIe switch
  • the computing resource node is a GPU node
  • the computing processing unit is a GPU.
  • the heterogeneous server system 300 includes one training node 310, four inference nodes 320, four PCIe switches 340, and eight GPUs 350.
  • the training node 310 is connected to four PCIe switches 340, each inference node 320 is connected to one PCIe switch 340, and each PCIe switch 340 is connected to two GPUs 350.
  • the present invention is not limited to the hardware quantity and architecture shown in FIG3 , but can be configured according to the hardware requirements of the training and reasoning service scenarios to configure a composite system with a reasonable ratio of CPU and GPU.
  • the heterogeneous server system disclosed in the present invention may also be referred to as a "heterogeneous server composite system".
  • the composite system connects the training node and the required number of GPUs in the GPU node via a switch to form a network, turning the system into a composite physical machine suitable for training services, and connects the inference node and the required number of GPUs via a switch to form a network, turning the system into a composite physical machine suitable for reasoning services, thereby achieving the technical effect of combining the inference node, the training node and the GPU node into a composite physical machine to meet the reasoning and training needs.
  • the training node 310 and each inference node 320 may include the same or different numbers of CPUs.
  • the training node 310 and each inference node 320 may be set to include 1 CPU respectively, the ratio of CPU to GPU computing power required by the training service is 1:8, and the ratio of CPU to GPU computing power required by the inference service is 1:2. Therefore, the GPU node 330 provides 8 GPUs, which are divided into 4 groups and connected to 4 PCIe switches respectively. In this way, 1 group of GPUs connected to each PCIe switch can be allocated to an inference node for use, and all GPUs connected to all PCIe switches can be allocated to the training node for use at the same time.
  • These 8 GPUs can be flexibly allocated to training nodes or inference nodes according to the number of training services and inference services to be provided by the current system.
  • all GPUs i.e., all computing power
  • the training and reasoning services can be managed according to priority or other methods, and all GPUs can be allocated to the training and reasoning services with maximum efficiency. For example, according to the peak and trough periods of the training and reasoning service requirements, the GPU computing power can be fully scheduled to match the service requirements.
  • each PCIe switch 340 has six PCIe ports PE1-PE6, which are connected to various PCIe devices, including GPU 350, interface 360, MCIO 380, and slot 390, through PCIe cables (“PCIe X16/X8” indicated in the figure indicates 16-bit/8-bit PCIe cables).
  • PCIe X16/X8 indicated in the figure indicates 16-bit/8-bit PCIe cables.
  • the PCIe switch of the present invention is not limited thereto, but can increase or decrease PCIe ports as needed, and increase or decrease the number of connected PCIe devices.
  • the training node 310 and the inference node 320 cannot be directly connected to the PCIe switch, but are transferred by the interface 360, that is, they are physically connected to the interface 360 through a cable and then transferred to the PCIe switch 340 through the interface 360.
  • the interface 360 is located in the GPU node 330, but the present invention does not limit the location of the interface, that is, the interface can also be independent of each node or installed in each computing node.
  • MCIO (Mini Cool edge I/O) 380 can be used as a memory interface supporting PCIe, for connecting a memory (such as an SSD or hard disk, etc.) to a PCIe switch.
  • a memory such as an SSD or hard disk, etc.
  • the present invention is not limited to this memory interface, and the memory required for the service can also be provided in other ways, not limited to being connected to a switch as shown in the figure.
  • Slot 390 can be connected to other required PCIe devices, or leave room for PCIe devices that need to be connected in the future.
  • FIG. 3 shows that the GPU node 330 also includes a baseboard management controller (BMC) 370, which is connected to each PCIe switch 340.
  • BMC baseboard management controller
  • a PCIe switch adopts a tree-like connection structure, which has only one uplink port, which is connected to one or more downlink ports. Therefore, according to the present invention, the connection port of GPU 350 is set as a downlink port, and the uplink port is flexibly switched between PE1 and PE2 to achieve flexible switching of GPU computing power between training nodes and inference nodes.
  • the switching of the uplink port of the PCIe switch 340 is achieved by the BMC 370 changing the firmware of the PCIe switch 340.
  • the BMC 370 directly refreshes the firmware of the switch, and the firmware sets each port as required to achieve the required connection.
  • the system 300 outputs a first firmware for connecting the training nodes to the respective GPUs and a second firmware for connecting the inference nodes to the respective GPUs to the BMC 370. Then, the BMC 370 generates a GPU scheduling solution based on the service requirements. Select to load the first firmware or the second firmware to each PCIe switch 340.
  • the first and second firmwares can both set ports PE3-PE6 as downlink ports of the switch. The difference between the first and second firmwares is that the first firmware sets port PE1 as an uplink port of the switch and closes port PE2, while the second firmware sets port PE2 as an uplink port of the switch and closes port PE1.
  • the switching of the uplink port of the PCIe switch 340 is implemented using the internal processor 341 of the switch.
  • the mode of the PCIe switch 340 is configured as ssw mode (synthetic switch mode), and the secrouting library is enabled.
  • the secrouting library is a library of enhanced features of the switch, and supports the debugging library of the advanced mode of the switch.
  • the BMC 370 can communicate with the internal processor 341 for related configuration and modification.
  • the internal processor 341 obtains the PCIe topology structure of the lower layer (ie, each downstream port) of the PCIe switch and stores it in the cache of the internal processor.
  • the BMC 370 configures ports PE1 and PE2 through an IIC (Inter-Integrated Circuit Bus) out-of-band channel, thereby setting one of ports PE1 and PE2 as an uplink port and shutting down the other as needed.
  • IIC Inter-Integrated Circuit Bus
  • the PCIe switch 340 synchronizes resources such as the virtual PCIe tree to the training node or inference node connected to the upstream port to complete the system PCIe driver resource configuration.
  • the PCIe tree describes the tree connection structure of the switch, which includes the PCIe topology of the downstream port.
  • the training nodes, inference nodes, and GPU nodes are physically networked through PCIe cables according to service requirements, and the GPU computing power is flexibly switched between the training nodes and the inference nodes through the configuration of the PCIe switch, thereby realizing on-demand scheduling and switching of GPU computing power to integrate training and inference services, and improving the utilization rate of GPU computing power.
  • Fig. 4 shows a schematic diagram of the structure of a computing device that can be used to implement the method for using the heterogeneous server system according to an embodiment of the present disclosure.
  • the computing device that implements the method of the present disclosure can be concurrently performed by each computing node and computing resource node in the heterogeneous server system, or other computing devices in the system that are independent of these nodes, or a computing device outside the system.
  • computing device 400 includes memory 410 and processor 420 .
  • Processor 420 may be a multi-core processor or may include multiple processors.
  • processor 420 may include a general-purpose main processor and one or more special coprocessors, such as a graphics processing unit (GPU), a digital signal processor (DSP), etc.
  • processor 420 may be implemented using a customized circuit, such as an application-specific integrated circuit (ASIC) or a field programmable gate array (FPGA).
  • ASIC application-specific integrated circuit
  • FPGA field programmable gate array
  • the memory 410 may include various types of storage units, such as system memory, read-only memory (ROM), and permanent storage.
  • ROM can store static data or instructions required by the processor 420 or other modules of the computer.
  • the permanent storage device can be a readable and writable storage device.
  • the permanent storage device can be a storage device that can be stored even if the computer is powered off. A non-volatile storage device that will not lose the stored instructions and data even after the storage is completed.
  • the permanent storage device uses a large-capacity storage device (such as a magnetic or optical disk, flash memory) as a permanent storage device.
  • the permanent storage device may be a removable storage device (such as a floppy disk, optical drive).
  • the system memory may be a readable and writable storage device or a volatile readable and writable storage device, such as a dynamic random access memory.
  • the system memory may store some or all instructions and data required by the processor at run time.
  • the memory 410 may include any combination of computer-readable storage media, including various types of semiconductor memory chips (DRAM, SRAM, SDRAM, flash memory, programmable read-only memory), and magnetic disks and/or optical disks may also be used.
  • the memory 410 may include a removable storage device that can be read and/or written, such as a laser disc (CD), a read-only digital versatile disc (such as a DVD-ROM, a double-layer DVD-ROM), a read-only Blu-ray disc, an ultra-density optical disc, a flash memory card (such as an SD card, a mini SD card, a Micro-SD card, etc.), a magnetic floppy disk, etc.
  • a removable storage device that can be read and/or written, such as a laser disc (CD), a read-only digital versatile disc (such as a DVD-ROM, a double-layer DVD-ROM), a read-only Blu-ray disc, an ultra-density optical disc, a flash memory card (such as an SD card, a mini SD card, a Micro-SD card, etc.), a magnetic floppy disk, etc.
  • the computer-readable storage medium does not contain carrier waves and transient electronic signals transmitted wirelessly or wired.
  • the memory 410 stores executable codes, and when the executable codes are processed by the processor 420 , the processor 420 can execute the method for using the heterogeneous server system mentioned above.
  • the method according to the present invention may also be implemented as a computer program or a computer program product, which includes computer program code instructions for executing the above steps defined in the above method of the present invention.
  • the present invention may also be implemented as a non-temporary machine-readable storage medium (or computer-readable storage medium, or machine-readable storage medium) on which executable code (or computer program, or computer instruction code) is stored.
  • executable code or computer program, or computer instruction code
  • the processor executes the various steps of the above-mentioned method according to the present invention.
  • each square box in the flow chart or block diagram can represent a part of a module, program segment or code, and the part of the module, program segment or code contains one or more executable instructions for realizing the specified logical function.
  • the functions marked in the square box can also occur in a sequence different from that marked in the accompanying drawings. For example, two continuous square boxes can actually be executed substantially in parallel, and they can sometimes be executed in the opposite order, depending on the functions involved.
  • each square box in the block diagram and/or flow chart, and the combination of the square boxes in the block diagram and/or flow chart can be implemented with a dedicated hardware-based system that performs the specified function or operation, or can be implemented with a combination of dedicated hardware and computer instructions.

Landscapes

  • Engineering & Computer Science (AREA)
  • Software Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Multi Processors (AREA)
  • Power Sources (AREA)
  • Hardware Redundancy (AREA)
  • Supply And Distribution Of Alternating Current (AREA)

Abstract

本公开涉及一种异构服务器系统及其使用方法。该异构服务器系统包括:第一计算节点,被配置为提供第一服务;第二计算节点,被配置为提供第二服务;和计算资源节点,包括交换机和连接到交换机的计算处理单元。计算处理单元用于执行第一服务或第二服务的至少部分计算任务。交换机连接到第一计算节点和第二计算节点,并且能在第一状态与第二状态之间切换。在第一状态下交换机将计算处理单元连接到第一计算节点,而在第二状态下交换机将计算处理单元连接到第二计算节点。由此,该异构服务器系统可提供至少两种服务,且可以利用交换机的灵活切换方案来提高计算资源节点的算力的利用率,有效提升TCO受益。

Description

异构服务器系统及其使用方法
本申请要求于2022年10月25日提交中国专利局、申请号为202211311808.2、申请名称为“异构服务器系统及其使用方法”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
技术领域
本公开涉及人工智能的硬件计算设备,特别是涉及一种提供人工智能的复合服务的异构服务器系统。
背景技术
人工智能的模型的训练服务和推理服务通常需要不同的计算能力,因此通常设计不同的异构(例如,中央处理单元(CPU)+图形处理单元(GPU))服务器来满足不同的服务需求。例如,主流的训练服务通常使用GPU训练服务器或者基于OAM(开放应用模型)的UBB(通用基板,Universal Baseboard)基板为其提供算力,而推理服务通常使用单卡GPU机型服务器。
当前,为训练服务设计的异构服务器硬件(例如,CPU、GPU、内存的配比)对推理服务需求不匹配。如果利用训练服务器来进行推理服务,则无法充分发挥CPU、GPU算力,会造成算力的浪费。而且目前的训练服务器和推理服务器均无法灵活切换以实现训练、推理服务等的灵活切换,从而无法跟进训练、推理需求的波峰波谷,充分调度GPU算力与服务需求匹配。因此,为了同时满足训练和推理服务需求,目前用户通常要同时购买训练服务器和推理服务器两者。然而,这容易造成两个服务器在各自服务需求波谷时段的算力的浪费。
因此,需要一种新型的异构服务器系统,其充分利用GPU算力来同时满足人工智能的至少两大类服务(例如训练与推理)的需求。
发明内容
本公开要解决的一个技术问题是提供一种异构服务器系统,其能够高效益地提供人工智能的至少两种服务。
根据本公开的第一个方面,提供了一种异构服务器系统,包括:第一计算节点,被配置为提供第一服务;第二计算节点,被配置为提供第二服务;以及计算资源节点,包括交换机和连接到所述交换机的计算处理单元。其中,所述计算处理单元用于执行所述第一服务或所述第二服务的至少部分计算任务。其中,所述交换机连接到所述第一计算节点和所述第二计算节点,并且能在第一状态与第二状态之间切换,其中在所述第一状态下所述交 换机将所述计算处理单元连接到所述第一计算节点,而在所述第二状态下所述交换机将所述计算处理单元连接到所述第二计算节点。
可选地,所述交换机为PCIe(外部设备互连扩展总线标准)交换机,所述计算处理单元经由PCIe线缆连接到所述交换机的下行端口,所述第一计算节点和所述第二计算节点经由PCIe线缆分别连接到所述交换机的第一端口和第二端口,并且在所述第一状态下所述第一端口被设为所述交换机的上行端口且所述第二端口被关闭,而在所述第二状态下所述第二端口被设为所述交换机的上行端口且所述第一端口被关闭。
可选地,所述计算资源节点还包括基板管理控制器,所述交换机在第一状态与第二状态之间的切换是通过所述基板管理控制器更改所述交换机的固件来实现的。
可选地,所述计算资源节点还包括基板管理控制器,所述交换机还包括内部处理器。所述交换机在第一状态与第二状态之间的切换是通过如下来实现的:所述交换机被配置为使所述基板管理控制器与所述内部处理器通信;所述内部处理器获取并保存所述下行端口的PCIe拓扑结构;在所述第一状态下,所述基板管理控制器将所述第一端口配置为所述交换机的上行端口且关闭所述第二端口,并且所述交换机将所述PCIe拓扑结构提供给所述第一计算节点;以及在所述第二状态下,所述基板管理控制器将所述第二端口配置为所述交换机的上行端口且关闭所述第一端口,并且所述交换机将所述PCIe拓扑结构提供给所述第二计算节点。
可选地,所述异构服务器系统包括多个第二计算节点,所述计算资源节点包括多个交换机和多个计算处理单元,其中,所述多个计算处理单元被分成多组,并且每组分别连接到所述多个交换机中的一个。其中,所述第一计算节点连接到所述多个交换机中的至少两个交换机。其中,所述多个第二计算节点中的每一个连接到所述多个交换机中的至少一个交换机,并且所述第二计算节点连接的交换机的数量是根据所述第二服务所需的计算处理单元数量、以及计算处理单元与交换机的连接架构而计算得出的。并且其中,所述第一计算节点连接的交换机的数量大于所述第二计算节点连接的交换机的数量。
可选地,所述计算处理单元为GPU;并且/或者所述第一计算节点和所述第二计算节点分别包括CPU和内存;并且/或者所述第一服务为人工智能的训练服务;并且/或者所述第二服务为人工智能的推理服务。
可选地,所述计算资源节点还包括:第一接口,用于将所述第一计算节点连接到所述交换机的第一端口;以及/或者,第二接口,用于将所述第二计算节点连接到所述交换机的第二端口;以及/或者,存储器接口,用于将存储器连接到所述交换机的第三端口。
根据本公开的第二个方面,提供了一种使用根据本公开的第一个方面所述的异构服务器系统执行计算任务的方法,包括:确定交换机所连接的计算处理单元用于第一服务还是第二服务;在交换机所连接的计算处理单元用于第一服务的情况下,将所述交换机配置为第一状态;以及在交换机所连接的计算处理单元用于第二服务的情况下,将所述交换机配置为第二状态。
可选地,所述交换机为PCIe交换机,所述计算处理单元经由PCIe线缆连接到所述交换机的下行端口,所述第一计算节点和所述第二计算节点经由PCIe线缆分别连接到所述交换机的第一端口和第二端口,所述计算资源节点还包括基板管理控制器。将所述交换机配置为第一状态的步骤包括:通过所述基板管理控制器将所述交换机的固件设置为第一固件,所述第一固件将所述第一端口设为所述交换机的上行端口从而连接到所述下行端口,且关闭所述第二端口;并且/或者,将所述交换机配置为第二状态的步骤包括:通过所述基板管理控制器将所述交换机的固件设置为第二固件,所述第二固件将所述第二端口设为所述交换机的上行端口从而连接到所述下行端口,且关闭所述第一端口。
可选地,所述交换机为PCIe交换机,所述计算处理单元经由PCIe线缆连接到所述交换机的下行端口,所述第一计算节点和所述第二计算节点经由PCIe线缆分别连接到所述交换机的第一端口和第二端口,所述计算资源节点还包括基板管理控制器,所述交换机还包括内部处理器。所述将所述交换机配置为第一状态的步骤包括:配置所述交换机以使所述基板管理控制器与所述内部处理器通信,通过所述内部处理器获取并保存所述下行端口的PCIe拓扑结构,以及通过所述基板管理控制器将所述第一端口配置为所述交换机的上行端口且关闭所述第二端口,并且通过所述交换机将所述PCIe拓扑结构提供给所述第一计算节点;并且/或者,将所述交换机配置为第二状态的步骤包括:配置所述交换机以使所述基板管理控制器与所述内部处理器通信,通过所述内部处理器获取并保存所述下行端口的PCIe拓扑结构,以及通过所述基板管理控制器将所述第二端口配置为所述交换机的上行端口且关闭所述第一端口,并且通过所述交换机将所述PCIe拓扑结构提供给所述第二计算节点。
可选地,所述异构服务器系统包括多个第二计算节点,所述计算资源节点包括多个交换机和多个计算处理单元,所述多个计算处理单元被分成多组,并且每组分别连接到所述多个交换机中的一个;所述第一计算节点连接到所述多个交换机中的至少两个交换机;所述多个第二计算节点中的每一个连接到所述多个交换机中的至少一个交换机,并且所述第二计算节点连接的交换机的数量是根据所述第二服务所需的计算处理单元数量、以及计算处理单元与交换机的连接架构而计算得出的;所述第一计算节点连接的交换机的数量大于所述第二计算节点连接的交换机的数量。所述确定交换机所连接的计算处理单元用于第一服务还是第二服务的步骤包括:根据所述异构服务器系统要提供的第一服务和第二服务的数量,确定所述多个计算处理单元中分别用于第一服务和第二服务的计算处理单元的数量,并且根据计算处理单元与交换机的连接架构、在第一计算节点、第二计算节点与交换机之间的连接架构、以及各服务所需的计算处理单元的数量,将所述多个计算处理单元分别分配给第一服务和第二服务。
根据本公开的第三个方面,提供了一种计算设备,包括:处理器;以及存储器,其上存储有可执行代码,当可执行代码被处理器执行时,使处理器执行如上述第二方面所述的方法。
根据本公开的第四个方面,提供了一种计算机程序产品,包括可执行代码,当所述可执行代码被电子设备的处理器执行时,使所述处理器执行如上述第二方面所述的方法。
根据本公开的第五个方面,提供了一种非暂时性机器可读存储介质,其上存储有可执行代码,当可执行代码被电子设备的处理器执行时,使处理器执行如上述第二方面所述的方法。
由此,本公开通过将至少两种服务节点与计算资源节点混合组网以融合为一个整体的物理机复合机型,可提供至少两种服务,并且可以利用交换机的灵活切换方案来提高计算资源节点的算力的利用率,有效提升总拥有成本(TCO,Total Cost of Ownership)受益。
附图说明
通过结合附图对本公开示例性实施方式进行更详细的描述,本公开的上述以及其它目的、特征和优势将变得更加明显,其中,在本公开示例性实施方式中,相同的参考标号通常代表相同部件。
图1示出了根据本公开一个实施例的异构服务器系统的示意性框图。
图2示出了根据本公开一个实施例的异构服务器系统的使用方法的示意性流程图。
图3示出了根据本公开一个实施例的异构服务器系统的一个具体示例的示意性框图。
图4示出了根据本公开一个实施例的计算设备的结构示意图。
具体实施方式
下面将参照附图更详细地描述本公开的优选实施方式。虽然附图中显示了本公开的优选实施方式,然而应该理解,可以以各种形式实现本公开而不应被这里阐述的实施方式所限制。相反,提供这些实施方式是为了使本公开更加透彻和完整,并且能够将本公开的范围完整地传达给本领域的技术人员。
为了解决前述技术问题,本公开通过将至少两个提供不同服务的计算节点经由交换机连接到计算处理单元来实现计算处理单元的算力的灵活分配,从而高效地利用计算资源(即计算处理单元)来提供至少两个不同服务。
下面结合图1-2来描述本发明的基本构思。
图1示出了根据本公开实施例的异构服务器系统的一个基本构架的示意性框图。
如图1所示,异构服务器系统100包括第一计算节点110、第二计算节点120和计算资源节点130。计算资源节点130包括交换机140和计算处理单元150。第一计算节点110、第二计算节点120和计算处理单元150分别连接到交换机140的三个端口1-3。
图1中显示的交换机140中连接端口1和3的实线示意性地表示交换机140的第一状态,其将计算处理单元150连接到第一计算节点110。而图1中显示的连接端口2和3的虚线示意性地表示交换机140的第二状态,其将计算处理单元150连接到第二计算节点120。如图1的大箭头所示,交换机140能在第一状态与第二状态之间切换,从而选择将计算处 理单元150连接到第一计算节点110还是第二计算节点120。
本领域技术人员应该理解,图中显示的交换机140的结构只是对其功能的简单示意,并不代表本公开的交换机的物理结构;图中的所有连线表示的连接并不仅限于直接的物理连接,还可以包括经由中间接口的间接连接、或者无线连接等。
第一计算节点110和第二计算节点120被配置为分别提供第一服务和第二服务,例如人工智能的训练服务和推理服务。在本公开的一些实施方式中,第一计算节点110和第二计算节点120可以是通用计算机或服务器,其均可以包括CPU和内存以执行第一/第二服务的操作。而由于人工智能的服务通常要求更高的算力,因此,这些通用计算机或服务器需要额外的计算资源来满足其服务所需算力,也就是说,连接到计算资源节点130以利用其中的计算处理单元150来执行第一或第二服务的至少部分计算任务。在另一些实施方式中,第一计算节点110和第二计算节点120也可以是专门设计的硬件架构,以利用计算资源节点130的算力来执行第一/第二服务的至少部分计算任务。
计算处理单元150可以为GPU,但是本公开并不限于此,而是包括能为人工智能的各种服务提供所需算力的各种计算处理硬件,例如ASIC或FPGA等。
在本公开的一些实施方式中,交换机140可以为PCIe交换机(switch),并且经由PCIe线缆将第一计算节点110、第二计算节点120和计算处理单元150分别连接到交换机140的端口1-3。通过设置交换机140,可以将端口3设为下行端口,并且根据需要将端口1和2中的一个设为上行端口而关闭另一个,由此实现交换机在两个状态之间的灵活切换。当然本发明不限于此,而是还可以通过例如网络接口控制器(NIC,Network Interface Controller)来实现各计算节点与计算处理单元之间的网络连接,例如通过NIC与GPU直接的远程直接内存访问(RDMA,Remote Direct Memory Access)技术,可将GPU算力灵活提供给第一或第二计算节点。与利用NIC的RDMA技术相比,利用PCIe交换机进行互连的解决方案的系统时延更小,并且软件复杂度更低。
本领域技术人员应该理解,尽管图1中仅示出了两个计算节点、一个交换机和计算处理单元,但是本发明不限于此。根据需要,系统100还可以包括另一种计算节点以提供另一种服务,和/或多个第一/第二计算节点,和/或多个交换机和计算处理单元。一个计算节点可以连接两个或更多个的交换机,一个交换机可以连接两个或更多个的计算处理单元,一个交换机也可以连接两个或更多个的计算节点。在实际应用中,可以根据系统需要提供的服务种类、各服务需求数量、各服务对计算资源的算力需求大小等条件来设计该系统包括的各部件的数量和连接架构。在一些情况下,第一服务所需的算力比第二服务要多,因此,可以将第一计算节点连接到更多的交换机,或者将第一计算节点连接到每一个交换机以便能利用全部算力来执行第一服务的计算任务。
下面结合图2来描述使用上述图1的异构服务器系统100实现本发明的基本构思的一种方法。
图2示出了根据本公开一个实施例的使用异构服务器系统执行计算任务的方法的示意 性流程图。
参见图2,在步骤S210,确定计算处理单元150用于第一服务还是第二服务。如果确定用于第一服务,则进行步骤S220,将交换机140配置为第一状态,即如图1中的实线所示的连接关系。如果确定用于第二服务,则进行步骤S230,将交换机140配置为第二状态,即如图1中的虚线所示的连接关系。通过这个方法,可以根据需要灵活调度计算处理单元150来执行哪个服务的计算任务。
该方法可以由该异构服务器系统自身(例如,系统中的各计算节点和计算资源节点的CPU,或者系统中独立于这些节点之外的其它控制器)来实施,也可以由该异构服务器系统外部的控制装置来实施。
综合上面结合图1-2所述的,本发明将分别提供不同服务的各计算节点经由交换机来与所需的计算资源连接,且根据需要将计算资源灵活提供给各计算节点,从而可以以统一的复合物理机满足多种服务需求,并提高计算资源的利用率。
下面,为了更充分地理解本发明,将结合图3描述本公开的一个具体示例及其操作。
图3示出了根据本公开一个实施例的异构服务器系统的一个具体示例的示意性框图。在本示例中,上述的第一计算节点为提供模型的训练服务的训练节点,第二计算节点为提供推理服务的推理节点,交换机为PCIe交换机,计算资源节点为GPU节点,计算处理单元为GPU。
如图3所示,异构服务器系统300包括1个训练节点310、4个推理节点320、4个PCIe交换机340和8个GPU 350。训练节点310连接到4个PCIe交换机340,每个推理节点320连接到一个PCIe交换机340,每个PCIe交换机340连接到2个GPU 350。
本领域技术人员应该理解,本发明不限于图3所示的硬件数量及架构,而是可以根据训练和推理服务场景的硬件需求来配置CPU与GPU配比合理的复合系统。本公开的异构服务器系统也可以称为“异构服务器复合系统”。该复合系统通过将训练节点与GPU节点中所需数量的GPU经由交换机连接组网,将系统变成适合训练服务的复合物理机,而通过将推理节点与所需数量的GPU经由交换机连接组网,将系统变成适合推理服务的复合物理机,从而实现了将推理节点、训练节点与GPU节点组合成复合物理机以满足推理和训练需求的技术效果。
训练节点310和各推理节点320可以分别包括相同或不同数量的CPU。在本示例中,可以设定训练节点310和各推理节点320分别包括1个CPU,训练服务需要的CPU与GPU算力的配比为1:8,推理服务需要的CPU与GPU算力的配比为1:2,因此GPU节点330提供8个GPU,并将其分成4组分别连到4个PCIe交换机,这样,每个PCIe交换机连接的1组GPU可分配给一个推理节点使用,所有PCIe交换机连接的全部GPU可同时分配给训练节点使用。
根据当前系统要提供的训练服务和推理服务的数量,可以将这8个GPU灵活分配给训练节点或推理节点。
例如,在一些实施方式中,在只有训练服务而无推理服务时可以将所有GPU(即全部算力)分配给训练节点,而在训练途中来了优先级更高的推理服务请求时,可以将其中一个交换机切换到推理节点以满足推理计算任务需求而不需要暂停当前训练服务,由此系统可以同时运行训练服务和推理服务。在另一些实施方式中,可以按照优先级或其它方式管理训练和推理服务,最大利用效率地将所有GPU分配给训练和推理服务。例如,可以根据训练和推理服务需求的波峰和波谷时段,充分调度GPU算力与服务需求匹配。
确认了GPU到训练节点和/或推理节点的分配方案后,根据该方案配置好各个PCIe交换机的驱动,从而配置相应的PCIe交换机的上行端口连接到训练节点和推理节点中的一个,并关闭与另一个节点的PCIe通道,因此实现GPU的算力与各个计算节点的组合,以满足相应服务的需求。
下面结合图3具体描述PCIe交换机340的配置操作。
如图3所示,每个PCIe交换机340有6个PCIe端口PE1-PE6,分别通过PCIe线缆(图中标示的“PCIe X16/X8”表示16位/8位的PCIe线)连接到各个PCIe设备,包括GPU 350、接口360、MCIO 380以及插槽390。当然,本发明的PCIe交换机并不限于此,而是可以根据需要增加或减少PCIe端口,以及增加或减少所连接的各个PCIe设备。
在本示例中,训练节点310和推理节点320无法与PCIe交换机直连,而是由接口360进行转接,即通过线缆物理连接到接口360再经由接口360转接到PCIe交换机340。在图3中,接口360位于GPU节点330中,但是本发明并不限制接口的位置,即,接口也可以独立于各节点或者安装于各计算节点中。
MCIO(Mini Cool edge I/O,小型冷边缘输入/输出接口)380可以作为一种支持PCIe的存储器接口,用于将存储器(例如SSD或硬盘等)连接到PCIe交换机。当然本发明不限于此种存储器接口,而且服务所需的存储器也可以通过其它方式提供,并不限于如图所示地连接到交换机。
插槽390可以连接其它所需的PCIe设备,或者为将来需要连接的PCIe设备留有余地。
另外,图3示出了GPU节点330还包括基板管理控制器(BMC,Baseboard Management Controller)370,其连接到各个PCIe交换机340。
如本领域技术人员所知的,PCIe交换机采用的是树状连接结构,其只有一个上行端口,该上行端口连接到一个或多个下行端口。因此,根据本发明,将GPU 350的连接端口设为下行端口,通过将上行端口在PE1和PE2之间灵活切换来实现GPU算力在训练节点与推理节点间的灵活切换。
在一些实施方式中,PCIe交换机340的上行端口的切换是通过BMC 370更改PCIe交换机340的固件来实现的。BMC 370直接刷新交换机的固件,该固件把各个端口按需设置,从而实现所需的连接。
例如,系统300向BMC 370输出将训练节点连接到各个GPU的第一固件和将推理节点连接到各个GPU的第二固件。然后,BMC 370根据服务需求所生成的GPU调度方案, 选择向各个PCIe交换机340加载第一固件或第二固件。第一和第二固件可以均将端口PE3-PE6设为交换机的下行端口。第一和第二固件的区别在于第一固件将端口PE1设为交换机的上行端口且关闭端口PE2,而第二固件将端口PE2设为交换机的上行端口且关闭端口PE1。
在另一些实施方式中,PCIe交换机340的上行端口的切换是利用交换机的内部处理器341来实现的。
例如,将PCIe交换机340的模式配置为ssw模式(合成交换机模式,synthetic switch mode),并使能secrouting库。这里,secrouting库是交换机的增强特性的一个库,支持交换机高级模式的调测库。
然后,通过secrouting库接口,BMC 370可以与内部处理器341通信,用于相关的配置和修改。
然后,内部处理器341获取PCIe交换机下层(即,各下行端口)的PCIe拓扑结构并将其保存在内部处理器的缓存中。
然后,BMC 370通过IIC(内部集成电路总线)带外通道进行端口PE1和PE2的配置,从而根据需要将端口PE1和PE2之一设为上行端口,并关闭另一个。
然后,PCIe交换机340将虚拟的PCIe树等资源同步给上行端口所连接的训练节点或推理节点,完成系统PCIe驱动资源配置。PCIe树描述了该交换机的树状连接结构,其包括下行端口的PCIe拓扑结构。
如上所述,训练节点、推理节点与GPU节点根据服务需求通过PCIe线缆进行物理组网,并且通过PCIe交换机的配置实现GPU算力在训练节点与推理节点间的灵活切换,从而实现了按需调度切换GPU算力以融合实现训练与推理服务,并且提高了GPU算力的利用率。
图4示出了根据本公开一实施例可用于实现上述异构服务器系统的使用方法的计算设备的结构示意图。如前所述,这个实现本公开的方法的计算设备可以是由该异构服务器系统中的各计算节点和计算资源节点兼任,或是该系统中独立于这些节点之外的其它计算设备,也可以是该系统外部的计算设备。
参见图4,计算设备400包括存储器410和处理器420。
处理器420可以是一个多核的处理器,也可以包含多个处理器。在一些实施例中,处理器420可以包含一个通用的主处理器以及一个或多个特殊的协处理器,例如图形处理器(GPU)、数字信号处理器(DSP)等等。在一些实施例中,处理器420可以使用定制的电路实现,例如特定用途集成电路(ASIC,Application Specific Integrated Circuit)或者现场可编程逻辑门阵列(FPGA,Field Programmable Gate Arrays)。
存储器410可以包括各种类型的存储单元,例如系统内存、只读存储器(ROM),和永久存储装置。其中,ROM可以存储处理器420或者计算机的其他模块需要的静态数据或者指令。永久存储装置可以是可读写的存储装置。永久存储装置可以是即使计算机断电 后也不会失去存储的指令和数据的非易失性存储设备。在一些实施方式中,永久性存储装置采用大容量存储装置(例如磁或光盘、闪存)作为永久存储装置。另外一些实施方式中,永久性存储装置可以是可移除的存储设备(例如软盘、光驱)。系统内存可以是可读写存储设备或者易失性可读写存储设备,例如动态随机访问内存。系统内存可以存储一些或者所有处理器在运行时需要的指令和数据。此外,存储器410可以包括任意计算机可读存储媒介的组合,包括各种类型的半导体存储芯片(DRAM,SRAM,SDRAM,闪存,可编程只读存储器),磁盘和/或光盘也可以采用。在一些实施方式中,存储器410可以包括可读和/或写的可移除的存储设备,例如激光唱片(CD)、只读数字多功能光盘(例如DVD-ROM,双层DVD-ROM)、只读蓝光光盘、超密度光盘、闪存卡(例如SD卡、min SD卡、Micro-SD卡等等)、磁性软盘等等。计算机可读存储媒介不包含载波和通过无线或有线传输的瞬间电子信号。
存储器410上存储有可执行代码,当可执行代码被处理器420处理时,可以使处理器420执行上文述及的异构服务器系统的使用方法。
上文中已经参考附图详细描述了根据本发明的异构服务器系统及其使用方法。
此外,根据本发明的方法还可以实现为一种计算机程序或计算机程序产品,该计算机程序或计算机程序产品包括用于执行本发明的上述方法中限定的上述各步骤的计算机程序代码指令。
或者,本发明还可以实施为一种非暂时性机器可读存储介质(或计算机可读存储介质、或机器可读存储介质),其上存储有可执行代码(或计算机程序、或计算机指令代码),当所述可执行代码(或计算机程序、或计算机指令代码)被电子设备(或计算设备、服务器等)的处理器执行时,使所述处理器执行根据本发明的上述方法的各个步骤。
本领域技术人员还将明白的是,结合这里的公开所描述的各种示例性逻辑块、模块、电路和算法步骤可以被实现为电子硬件、计算机软件或两者的组合。
附图中的流程图和框图显示了根据本发明的多个实施例的系统和方法的可能实现的体系架构、功能和操作。在这点上,流程图或框图中的每个方框可以代表一个模块、程序段或代码的一部分,所述模块、程序段或代码的一部分包含一个或多个用于实现规定的逻辑功能的可执行指令。也应当注意,在有些作为替换的实现中,方框中所标记的功能也可以以不同于附图中所标记的顺序发生。例如,两个连续的方框实际上可以基本并行地执行,它们有时也可以按相反的顺序执行,这依所涉及的功能而定。也要注意的是,框图和/或流程图中的每个方框、以及框图和/或流程图中的方框的组合,可以用执行规定的功能或操作的专用的基于硬件的系统来实现,或者可以用专用硬件与计算机指令的组合来实现。
以上已经描述了本发明的各实施例,上述说明是示例性的,并非穷尽性的,并且也不限于所披露的各实施例。在不偏离所说明的各实施例的范围和精神的情况下,对于本技术领域的普通技术人员来说许多修改和变更都是显而易见的。本文中所用术语的选择,旨在最好地解释各实施例的原理、实际应用或对市场中的技术的改进,或者使本技术领域的其 它普通技术人员能理解本文披露的各实施例。

Claims (14)

  1. 一种异构服务器系统,包括:
    第一计算节点,被配置为提供第一服务;
    第二计算节点,被配置为提供第二服务;以及
    计算资源节点,包括交换机和连接到所述交换机的计算处理单元,
    其中,所述计算处理单元用于执行所述第一服务或所述第二服务的至少部分计算任务,
    其中,所述交换机连接到所述第一计算节点和所述第二计算节点,并且能在第一状态与第二状态之间切换,其中在所述第一状态下所述交换机将所述计算处理单元连接到所述第一计算节点,而在所述第二状态下所述交换机将所述计算处理单元连接到所述第二计算节点。
  2. 根据权利要求1所述的异构服务器系统,其中,
    所述交换机为PCIe交换机,
    所述计算处理单元经由PCIe线缆连接到所述交换机的下行端口,
    所述第一计算节点和所述第二计算节点经由PCIe线缆分别连接到所述交换机的第一端口和第二端口,并且在所述第一状态下所述第一端口被设为所述交换机的上行端口且所述第二端口被关闭,而在所述第二状态下所述第二端口被设为所述交换机的上行端口且所述第一端口被关闭。
  3. 根据权利要求2所述的异构服务器系统,其中,
    所述计算资源节点还包括基板管理控制器,所述交换机在第一状态与第二状态之间的切换是通过所述基板管理控制器更改所述交换机的固件来实现的。
  4. 根据权利要求2所述的异构服务器系统,其中,
    所述计算资源节点还包括基板管理控制器,所述交换机还包括内部处理器,所述交换机在第一状态与第二状态之间的切换是通过如下来实现的:
    所述交换机被配置为使所述基板管理控制器与所述内部处理器通信;
    所述内部处理器获取并保存所述下行端口的PCIe拓扑结构;
    在所述第一状态下,所述基板管理控制器将所述第一端口配置为所述交换机的上行端口且关闭所述第二端口,并且所述交换机将所述PCIe拓扑结构提供给所述第一计算节点;以及
    在所述第二状态下,所述基板管理控制器将所述第二端口配置为所述交换机的上行端口且关闭所述第一端口,并且所述交换机将所述PCIe拓扑结构提供给所述第二计算节点。
  5. 根据权利要求1或2所述的异构服务器系统,其中,
    所述异构服务器系统包括多个第二计算节点,
    所述计算资源节点包括多个交换机和多个计算处理单元,
    其中,所述多个计算处理单元被分成多组,并且每组分别连接到所述多个交换机中的一个,
    其中,所述第一计算节点连接到所述多个交换机中的至少两个交换机;
    其中,所述多个第二计算节点中的每一个连接到所述多个交换机中的至少一个交换机,并且所述第二计算节点连接的交换机的数量是根据所述第二服务所需的计算处理单元数量、以及计算处理单元与交换机的连接架构而计算得出的;并且
    其中,所述第一计算节点连接的交换机的数量大于所述第二计算节点连接的交换机的数量。
  6. 根据权利要求1所述的异构服务器系统,其中,
    所述计算处理单元为GPU;并且/或者
    所述第一计算节点和所述第二计算节点分别包括CPU和内存;并且/或者
    所述第一服务为人工智能的训练服务;并且/或者
    所述第二服务为人工智能的推理服务。
  7. 根据权利要求1或2所述的异构服务器系统,其中,
    所述计算资源节点还包括:
    第一接口,用于将所述第一计算节点连接到所述交换机的第一端口;以及/或者
    第二接口,用于将所述第二计算节点连接到所述交换机的第二端口;以及/或者
    存储器接口,用于将存储器连接到所述交换机的第三端口。
  8. 一种使用根据权利要求1至7中任何一项所述的异构服务器系统执行计算任务的方法,包括:
    确定交换机所连接的计算处理单元用于第一服务还是第二服务;
    在交换机所连接的计算处理单元用于第一服务的情况下,将所述交换机配置为第一状态;以及
    在交换机所连接的计算处理单元用于第二服务的情况下,将所述交换机配置为第二状态。
  9. 根据权利要求8所述的方法,其中,
    所述交换机为PCIe交换机,
    所述计算处理单元经由PCIe线缆连接到所述交换机的下行端口,
    所述第一计算节点和所述第二计算节点经由PCIe线缆分别连接到所述交换机的第一 端口和第二端口,
    所述计算资源节点还包括基板管理控制器,
    将所述交换机配置为第一状态的步骤包括:
    通过所述基板管理控制器将所述交换机的固件设置为第一固件,所述第一固件将所述第一端口设为所述交换机的上行端口从而连接到所述下行端口,且关闭所述第二端口,
    并且/或者,将所述交换机配置为第二状态的步骤包括:
    通过所述基板管理控制器将所述交换机的固件设置为第二固件,所述第二固件将所述第二端口设为所述交换机的上行端口从而连接到所述下行端口,且关闭所述第一端口。
  10. 根据权利要求8所述的方法,其中,
    所述交换机为PCIe交换机,
    所述计算处理单元经由PCIe线缆连接到所述交换机的下行端口,
    所述第一计算节点和所述第二计算节点经由PCIe线缆分别连接到所述交换机的第一端口和第二端口,
    所述计算资源节点还包括基板管理控制器,所述交换机还包括内部处理器,
    将所述交换机配置为第一状态的步骤包括:
    配置所述交换机以使所述基板管理控制器与所述内部处理器通信;
    通过所述内部处理器获取并保存所述下行端口的PCIe拓扑结构;以及
    通过所述基板管理控制器将所述第一端口配置为所述交换机的上行端口且关闭所述第二端口,并且通过所述交换机将所述PCIe拓扑结构提供给所述第一计算节点,
    并且/或者,将所述交换机配置为第二状态的步骤包括:
    配置所述交换机以使所述基板管理控制器与所述内部处理器通信;
    通过所述内部处理器获取并保存所述下行端口的PCIe拓扑结构;以及
    通过所述基板管理控制器将所述第二端口配置为所述交换机的上行端口且关闭所述第一端口,并且通过所述交换机将所述PCIe拓扑结构提供给所述第二计算节点。
  11. 根据权利要求8所述的方法,其中,
    所述异构服务器系统包括多个第二计算节点,
    所述计算资源节点包括多个交换机和多个计算处理单元,
    所述多个计算处理单元被分成多组,并且每组分别连接到所述多个交换机中的一个,
    所述第一计算节点连接到所述多个交换机中的至少两个交换机;
    所述多个第二计算节点中的每一个连接到所述多个交换机中的至少一个交换机,并且所述第二计算节点连接的交换机的数量是根据所述第二服务所需的计算处理单元数量、以及计算处理单元与交换机的连接架构而计算得出的;
    所述第一计算节点连接的交换机的数量大于所述第二计算节点连接的交换机的数量; 并且
    所述确定交换机所连接的计算处理单元用于第一服务还是第二服务的步骤包括:
    根据所述异构服务器系统要提供的第一服务和第二服务的数量,确定所述多个计算处理单元中分别用于第一服务和第二服务的计算处理单元的数量,并且
    根据计算处理单元与交换机的连接架构、在第一计算节点、第二计算节点与交换机之间的连接架构、以及各服务所需的计算处理单元的数量,将所述多个计算处理单元分别分配给第一服务和第二服务。
  12. 一种计算设备,包括:
    处理器;以及
    存储器,其上存储有可执行代码,当所述可执行代码被所述处理器执行时,使所述处理器执行如权利要求8至11中任何一项所述的方法。
  13. 一种计算机程序产品,包括可执行代码,当所述可执行代码被电子设备的处理器执行时,使所述处理器执行如权利要求8至11中任何一项所述的方法。
  14. 一种非暂时性机器可读存储介质,其上存储有可执行代码,当所述可执行代码被电子设备的处理器执行时,使所述处理器执行如权利要求8至11中任何一项所述的方法。
PCT/CN2023/126246 2022-10-25 2023-10-24 异构服务器系统及其使用方法 WO2024088263A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202211311808.2A CN116185599A (zh) 2022-10-25 2022-10-25 异构服务器系统及其使用方法
CN202211311808.2 2022-10-25

Publications (1)

Publication Number Publication Date
WO2024088263A1 true WO2024088263A1 (zh) 2024-05-02

Family

ID=86431392

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2023/126246 WO2024088263A1 (zh) 2022-10-25 2023-10-24 异构服务器系统及其使用方法

Country Status (2)

Country Link
CN (1) CN116185599A (zh)
WO (1) WO2024088263A1 (zh)

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116185599A (zh) * 2022-10-25 2023-05-30 阿里巴巴(中国)有限公司 异构服务器系统及其使用方法
CN117687956B (zh) * 2024-01-31 2024-05-07 苏州元脑智能科技有限公司 多加速卡异构服务器及资源链路重构方法

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1967517A (zh) * 2005-10-27 2007-05-23 国际商业机器公司 用于分布式计算系统的方法和系统
US20170322899A1 (en) * 2016-05-06 2017-11-09 Quanta Computer Inc. Dynamic pcie switch reconfiguration mechanism
CN109240832A (zh) * 2018-09-25 2019-01-18 中国电子科技集团公司电子科学研究院 一种硬件重构系统及方法
CN113849431A (zh) * 2021-09-24 2021-12-28 山东云海国创云计算装备产业创新中心有限公司 一种系统拓扑结构切换方法、装置及介质
CN116185599A (zh) * 2022-10-25 2023-05-30 阿里巴巴(中国)有限公司 异构服务器系统及其使用方法

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1967517A (zh) * 2005-10-27 2007-05-23 国际商业机器公司 用于分布式计算系统的方法和系统
US20170322899A1 (en) * 2016-05-06 2017-11-09 Quanta Computer Inc. Dynamic pcie switch reconfiguration mechanism
CN109240832A (zh) * 2018-09-25 2019-01-18 中国电子科技集团公司电子科学研究院 一种硬件重构系统及方法
CN113849431A (zh) * 2021-09-24 2021-12-28 山东云海国创云计算装备产业创新中心有限公司 一种系统拓扑结构切换方法、装置及介质
CN116185599A (zh) * 2022-10-25 2023-05-30 阿里巴巴(中国)有限公司 异构服务器系统及其使用方法

Also Published As

Publication number Publication date
CN116185599A (zh) 2023-05-30

Similar Documents

Publication Publication Date Title
WO2024088263A1 (zh) 异构服务器系统及其使用方法
US20220103446A1 (en) Techniques to configure physical compute resources for workloads via circuit switching
US20220350483A1 (en) Method and apparatus to enable individual non volatile memory express (nvme) input/output (io) queues on differing network addresses of an nvme controller
US10339047B2 (en) Allocating and configuring persistent memory
US10254987B2 (en) Disaggregated memory appliance having a management processor that accepts request from a plurality of hosts for management, configuration and provisioning of memory
US20200142752A1 (en) Physical partitioning of computing resources for server virtualization
WO2020078470A1 (zh) 片上网络数据处理方法及装置
US10209890B2 (en) Near memory accelerator
US7577755B2 (en) Methods and apparatus for distributing system management signals
CN111630505A (zh) 深度学习加速器系统及其方法
CN116389542A (zh) 具有可配置的池化资源的平台
WO2019067929A1 (en) MULTIPLE CRITERIA POWER MANAGEMENT SYSTEM WITH ACCROGRESSED ACCELERATOR ARCHITECTURES
US20230051825A1 (en) System supporting virtualization of sr-iov capable devices
CN117493237B (zh) 计算设备、服务器、数据处理方法和存储介质
US10496565B2 (en) Micro-architectural techniques to minimize companion die firmware loading times in a server platform
JP2021509240A (ja) システム全体の低電力管理
CN109324899B (zh) 基于PCIe池化硬件资源的编址方法、装置及主控节点
US7418517B2 (en) Methods and apparatus for distributing system management signals
US20220121481A1 (en) Switch for managing service meshes
US20220197819A1 (en) Dynamic load balancing for pooled memory
CN116010331A (zh) 到多个定时域的访问
US20230153157A1 (en) Inter-node communication method and device based on multiple processing nodes
CN111078623B (zh) 片上网络处理系统和片上网络数据处理方法
CN111078625B (zh) 片上网络处理系统和片上网络数据处理方法
CN111078624B (zh) 片上网络处理系统和片上网络数据处理方法