WO2024098291A1 - Baie définie par logiciel - Google Patents

Baie définie par logiciel Download PDF

Info

Publication number
WO2024098291A1
WO2024098291A1 PCT/CN2022/130915 CN2022130915W WO2024098291A1 WO 2024098291 A1 WO2024098291 A1 WO 2024098291A1 CN 2022130915 W CN2022130915 W CN 2022130915W WO 2024098291 A1 WO2024098291 A1 WO 2024098291A1
Authority
WO
WIPO (PCT)
Prior art keywords
registry
pcie
driver
processors
processor
Prior art date
Application number
PCT/CN2022/130915
Other languages
English (en)
Inventor
Shuotao XU
Peng Cheng
Yuqing Yang
Derek Tsungkai CHIOU
Mark Donald HILL
Ran SHU
Ziyue YANG
Lei QU
Yongqiang Xiong
Original Assignee
Microsoft Technology Licensing, Llc
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Microsoft Technology Licensing, Llc filed Critical Microsoft Technology Licensing, Llc
Priority to PCT/CN2022/130915 priority Critical patent/WO2024098291A1/fr
Publication of WO2024098291A1 publication Critical patent/WO2024098291A1/fr

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F13/00Interconnection of, or transfer of information or other signals between, memories, input/output devices or central processing units
    • G06F13/38Information transfer, e.g. on bus
    • G06F13/42Bus transfer protocol, e.g. handshake; Synchronisation
    • G06F13/4282Bus transfer protocol, e.g. handshake; Synchronisation on a serial bus, e.g. I2C bus, SPI bus
    • G06F13/4295Bus transfer protocol, e.g. handshake; Synchronisation on a serial bus, e.g. I2C bus, SPI bus using an embedded synchronisation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/16Error detection or correction of the data by redundancy in hardware
    • G06F11/20Error detection or correction of the data by redundancy in hardware using active fault-masking, e.g. by switching out faulty elements or by switching in spare elements
    • G06F11/202Error detection or correction of the data by redundancy in hardware using active fault-masking, e.g. by switching out faulty elements or by switching in spare elements where processing functionality is redundant
    • G06F11/2023Failover techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F13/00Interconnection of, or transfer of information or other signals between, memories, input/output devices or central processing units
    • G06F13/38Information transfer, e.g. on bus
    • G06F13/40Bus structure
    • G06F13/4004Coupling between buses
    • G06F13/4022Coupling between buses using switching circuits, e.g. switching matrix, connection or expansion network
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F2213/00Indexing scheme relating to interconnection of, or transfer of information or other signals between, memories, input/output devices or central processing units
    • G06F2213/0026PCI express

Definitions

  • the present disclosure relates generally to computing, and in particular, to software defined partitions of compute resources.
  • Contemporary computer systems typically include server computers comprising one or more processors coupled to peripheral devices over a dedicated bus. Multiple servers may be placed in racks (i.e., physical structures to hold the circuit boards) . Virtualization technologies may run on top of such physical computer structures. With the growth and cost of compute demands, it would be desirable to have more flexibility in how processors and other compute resources are interconnected to perform various functions.
  • Fig. 1 illustrates a partitioning processors and peripherals according to an embodiment.
  • Fig. 2 illustrates a method of configuring partitioned devices according to an embodiment.
  • Fig. 3 illustrates an example partition and software system according to an embodiment.
  • Fig. 4 illustrates an example registry according to an embodiment.
  • Fig. 5 depicts a simplified block diagram of an example system according to some embodiments.
  • Fig. 6 illustrates an example device allocation process according to an embodiment.
  • Fig. 7 illustrates an example link failure recovery process according to an embodiment.
  • FIG. 1 A software defined rack system where multiple processors and peripheral devices are connected together over a high speed off-chip switch fabric.
  • the system may be partitioned into different compute domains as defined by software.
  • One or more processors and one or more peripheral devices may be in each domain for example.
  • Different domains may be configured by software to have different numbers of processors and different numbers and types of peripheral devices according to unique user application requirements, for example.
  • Fig. 1 illustrates a computer system for partitioning processors and peripherals according to an embodiment.
  • processors 101 may include x86 processors, ARM processors, artificial intelligence or machine learning optimized processors, or a variety of other types of processors, and may be directly coupled to random access memories (RAM) over high speed memory buses (not shown) .
  • RAM random access memories
  • x86 processors are a class of central processing units (CPUs) produced by and for example
  • ARM processors are a class of CPUs based on a reduced instruction set computer (RISC) architecture (ARM typically stands for Advanced RISC Machine) .
  • RISC reduced instruction set computer
  • Peripheral devices may include solid state memory devices (SSDs) , field programmable gate arrays (FPGAs) , and/or graphics processing units (GPUs) , for example.
  • Switch fabric 150 may be a peripheral component interconnect express (PCIe) switch network comprising a plurality of PCIe switches arranged in a variety of configurations to move data between any of the devices connected to the network, for example.
  • PCIe peripheral component interconnect express
  • PCIe registry 111 and a PCIe driver 112 executing one or more of the processors. While registry 111 and driver 112 are shown here executing on processor 110, it is to be understood that these software components may execute on different processors in some embodiments.
  • PCIe driver 112 stores state information (aka, driver context) .
  • a driver is a software component that allows the operating system (OS) to communicate with the device (e.g., a PCIe switch fabric system) .
  • Driver state information may be information the driver uses to support connectivity over the particular partition the driver belongs to.
  • PCIe driver 112 may include state information for (or relevant to) a particular partition and may store partition specific state information (e.g., relevant to a particular partition configuration of devices) .
  • State information may include, for example, routing information. Routing information may include information specifying paths between different endpoint devices (processors or peripherals) on a switch fabric to implement the partitions.
  • PCIe registry 111 may also store the state information.
  • an operating system managing the partitions of the hardware resources aka a “Rack OS”
  • driver 112 may receive driver state information from the registry.
  • the Rack OS may manage multiple drivers over multiple partitions, for example, and manage the routes between the hardware resources in multiple partitions. For example, when a path between a processor and a peripheral device through switch fabric 150 changes (e.g., a link fails) , driver 112 may read new state information from registry 111 to define a new path.
  • registry 111 comprises a plurality of routing paths between the processors and peripheral devices.
  • Fig. 2 illustrates a method of configuring partitioned devices according to an embodiment.
  • the switch fabric is a PCIe switch network.
  • the PCIe switch network illustrated here is used as an example fabric as those skilled in the art may choose other fabrics for other embodiments.
  • a plurality of processors and a plurality of peripheral devices are coupled together over a peripheral component interconnect express (PCIe) switch network, wherein the plurality of processors and plurality of peripheral drivers are configured in a plurality of partitions.
  • PCIe peripheral component interconnect express
  • a PCIe driver executes on a first processor of the processors, the PCIe driver storing state information.
  • a PCIe registry executes on the first processor of the processors, the PCIe registry storing the state information.
  • the PCIe registry may receive the state information from a Rack OS, which is then read by the PCIe driver, for example.
  • the PCIe driver reads new state information from the PCIe registry when a path between a first processor and a first peripheral device through the PCIe switch network changes. The new state information may be received from the Rack OS, for example.
  • a Rack OS comprises an OS manager 302 and OS kernel 303 executing on one or more processors 301.
  • the Rack OS has created two (2) partitions 390 and 391 of endpoint devices on a PCIe switch fabric 350.
  • Partition 390 comprises two x86 processors 340-341, two SSDs 342-343, and GPU 344.
  • Partition 391 comprises two ARM processors 360-361, SSD 362, and two FPGAs 363-364.
  • Rack OS stores state information in PCIe registry 311, and the state information is read by PCIe driver 312 to control each partition.
  • PCIe registry 311 and PCIe driver 312 may run on the same host processor 310, for example. While only one registry and driver are shown in this example, it is to be understood that there may be two (x2) pairs of registries and drivers –one for each partition.
  • OS manager 302 may centralize control of device assignments and routing for each partition. For example, OS manager 302 may partition resources into isolated PCIe domains, configure the OS kernel based on partitions definitions, detect failures, manage the failover process, and perform telemetry operations (e.g., hardware status, resource monitoring, etc%) .
  • OS kernel 303 may control device assignments and switch fabric failures by interfacing with partition registries, for example.
  • OS kernel may further allow for virtualization to run on top of the OS (e.g., allowing user applications to run on hardware resources in a particular partition) .
  • a driver may be extended to include a state manager, which manages driver context.
  • State manager may store driver context in the registry for unplanned driver removal, for example. The state manager may read the context from the registry so the driver can recover when the endpoint resource is back up and running.
  • PCIe driver 312 includes such a state manager 313.
  • the Rack OS may reserve and unreserve registry entries and move driver state information between PCIe bus/device/function (BDF) addresses.
  • the registry may store driver state information (aka context) .
  • Context may include software states of the device driver, routing/path information, outstanding requests in the driver that need to be serviced, for example.
  • the Rack OS uses APIs to control the allocation, deallocation, and movement of the driver context.
  • PCIe device driver 312 initializes, it fetches the driver context from PCIe registry 311, which may include information on how to plug in, how to plug out.
  • PCIe registry 311 may include information on how to plug in, how to plug out.
  • a link fails, a PCIe route may be dead, and the path between two endpoint resources may be moved to another route. Accordingly, the Rack OS reconfigures the link.
  • the Rack OS moves the PCIe driver context stored the PCIe registry 311 into the PCIe driver 312.
  • Fig. 4 illustrates an example registry 400 according to an embodiment.
  • Each entry in registry 400 may correspond to a bus, device, function (BDF) associated with an endpoint (e.g., for read/write) .
  • Each BDF may define a path through PCIe network to access the remote device.
  • Registry 400 may comprise a plurality of fields 401-403 for storing information related to a particular BDF.
  • registry 400 includes a reserved field 401 (R) indicating whether or not an entry is reserved for the device, a valid field 402 (V) indicating whether or not an entry is inserted by the driver, and a driver data field 403 storing driver state information (context) .
  • PCIe registry 400 comprises a plurality of routing paths between the processors and peripheral devices. Each routing path may be stored at a unique address in PCIe registry 400. Thus, processors and peripheral devices in each partition may be mapped to the unique addresses in the PCIe registry 400.
  • Reserved field 401 designates that the routing path is reserved. Accordingly, in response to detecting a reserved designation, the routing path may be probed to initialize the PCIe driver, for example.
  • Valid field 402 designates that the routing path is valid or invalid. Accordingly, a valid designation has been initialized by the PCIe driver and an invalid designation has not been initialized by the PCIe driver.
  • Fig. 5 depicts a simplified block diagram of an example system 500, which can be used to implement the techniques described in the foregoing disclosure.
  • system 500 includes one or more processors 502 that communicate with a number of devices via a switch fabrics 504. These devices may include a storage subsystem 506 (e.g., comprising a memory subsystem 508 and a file storage subsystem 510) and a network interface subsystem 516.
  • a storage subsystem 506 e.g., comprising a memory subsystem 508 and a file storage subsystem 510
  • network interface subsystem 516 e.g., comprising a network interface subsystem 516.
  • Switch fabric 504 can provide a mechanism for letting the various components and subsystems of system 500 communicate with each other as intended. Although switch fabric 504 is shown schematically as a single component, alternative embodiments of the switch fabric can utilize components.
  • Network interface subsystem 516 can serve as an interface for communicating data between system 500 and other computer systems or networks.
  • Embodiments of network interface subsystem 516 can include, e.g., Ethernet, a Wi-Fi and/or cellular adapter, a modem (telephone, satellite, cable, etc. ) , and/or the like.
  • Storage subsystem 506 includes a memory subsystem 508 and a file/disk storage subsystem 510.
  • Subsystems 508 and 510 as well as other memories described herein are examples of non-transitory computer-readable storage media that can store executable program code and/or data that provide the functionality of embodiments of the present disclosure.
  • Memory subsystem 508 comprise one or more memories including a main random access memory (RAM) 518 for storage of instructions and data during program execution and a read-only memory (ROM) 520 in which fixed instructions are stored.
  • File storage subsystem 510 can provide persistent (e.g., non-volatile) storage for program and data files, and can include a magnetic or solid-state hard disk drive, an optical drive along with associated removable media (e.g., CD-ROM, DVD, Blu-Ray, etc. ) , a removable flash memory-based drive or card, and/or other types of storage media known in the art.
  • system 500 is illustrative and many other configurations having more or fewer components than system 500 are possible.
  • Fig. 6 illustrates an example device allocation process according to an embodiment.
  • a device is assigned to a machine.
  • Switch 601 performs PCI registry reserves, route calculations, and device assignments.
  • OS 602 interfaces with the switch 601 and host machine 603 running the registry and driver to coordinate the routing information.
  • OS 602 may start a routing calculation and reserve a BDF in a registry on host 603 as shown at 610.
  • Host 610 sets the reserved field in the BDF entry to true and sends and ACK.
  • OS 602 completes the routing calculation at 611 and sends a device assignments message to switch 601.
  • Switch 601 assigns the device at 612.
  • a message is sent to the host to perform a probe of the device. When the probe is finished, the entry in the registry is set to valid at 613.
  • a probe done confirmation message is sent to the OS 602 and the OS sets the device allocation as done at 614.
  • Device deallocation works in a similar manner.
  • Fig. 7 illustrates an example link failure recovery process according to an embodiment.
  • Host machine 703, running a host software system including the PCIe registry and PCIe driver, may detect a route failure and send a request for a re-route to OS 702 at 710.
  • OS 702 may release the device, start a re-route calculation, and notify switch 701. The release is completed at 711.
  • Host 703 clears the failed route in the PCIe driver.
  • Host may perform a Pci_stateful_remove and sent the entry to false, for example, at 712.
  • OS 702 completes the re-route calculation at 713 is signals the host 703 to move the BDF entry, which is performed at 714.
  • Host 703 may move at least a portion of the state information describing the failed route from a first BDF address to a second BDF address in the PCIe registry, for example.
  • OS 702 Upon receiving an ACK from host 703, OS 702 sends a message to reassign the device to switch 701 at 715.
  • host 703 may initialize the PCIe driver with the new available route from the second address in the PCIe registry. For example, host 703 may recover a probe at 716.
  • Host 703 may send a message to OS 702 indicating that the re-route is complete at 717.
  • the present disclosure may be implemented as a system (e.g., an electronic computation system) , method (e.g., carried out on one or more systems) , or a non-transitory computer-readable medium (CRM) storing a program executable by one or more processors, the program comprising sets of instructions for performing certain processes described above or hereinafter.
  • a system e.g., an electronic computation system
  • method e.g., carried out on one or more systems
  • CCM non-transitory computer-readable medium
  • the present disclosure includes a computer system comprising: a plurality of processors; a plurality of peripheral devices; a switch network coupled between the plurality of processors and the plurality of peripheral devices, wherein the plurality of processors and plurality of peripheral drivers are configured in a plurality of partitions.
  • the present disclosure includes a driver executing on a processor, the driver storing state information, and a registry executing on a processor, the registry storing the state information, wherein the driver reads new state information from the registry when a path between a first processor and a first peripheral device through the switch network changes.
  • the present disclosure includes a method of configuring partitioned devices.
  • the present disclosure includes a non-transitory computer-readable medium storing a program executable by one or more processors, the program comprising sets of instructions. The method and/or set of instructions: coupling a plurality of processors and a plurality of peripheral devices together over a peripheral component interconnect express (PCIe) switch network, wherein the plurality of processors and plurality of peripheral drivers are configured in a plurality of partitions; executing a PCIe driver on a first processor of the processors, the PCIe driver storing state information; and executing a PCIe registry on the first processor of the processors, the PCIe registry storing the state information, wherein the PCIe driver reads new state information from the PCIe registry when a path between a first processor and a first peripheral device through the PCIe switch network changes.
  • PCIe peripheral component interconnect express
  • the switch network is a peripheral component interconnect express (PCIe) switch network
  • the driver is a PCIe driver
  • the registry is a PCIe registry
  • the state information is routing information.
  • the driver reads new routing information from the registry when a route between the first processor and the first peripheral device, through the switch network, in a first partition changes.
  • the registry receives routing information from an operating system managing a plurality of partitions of the plurality of processors and the plurality of peripheral devices.
  • each partition comprising at least one processor and a first plurality of peripheral drivers coupled together through a portion of the switch network.
  • the registry comprises a plurality of routing paths between the processors and peripheral devices.
  • each routing path of the plurality of routing paths is stored at a unique address in the registry.
  • the plurality of processors and the plurality of peripheral devices in each partition are mapped to the unique addresses in the registry.
  • each entry in the registry comprises a field designating that the routing path is reserved, wherein in response to detecting a reserved designation, the routing path is probed to initialize the driver.
  • each entry in the registry comprises a field designating that the routing path is valid or invalid, wherein a valid designation has been initialized by the PCIe driver and an invalid designation has not been initialized by the driver.
  • one or more of the plurality of processors comprise one or more of an x86 processor, an ARM processor, or an artificial intelligence optimized processor.
  • one or more of the plurality of peripheral devices comprise one or more of a solid state drive, a graphics processing unit, or a field programmable gate array.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Computer Hardware Design (AREA)
  • Quality & Reliability (AREA)
  • Hardware Redundancy (AREA)

Abstract

Des modes de réalisation de la présente divulgation comprennent des techniques de gestion de partitions de ressources matérielles dans un système informatique. Dans un mode de réalisation, la présente divulgation comprend une pluralité de processeurs et une pluralité de dispositifs périphériques couplés entre eux sur une matrice de commutation. Un hôte pour une partition peut comprendre un pilote et un registre. Un SE et le pilote peuvent communiquer avec le registre. Des informations d'état stockées dans le pilote et le registre peuvent être coordonnées pour reprise sur défaillances de liaison et d'autres changements de trajet dans la partition, par exemple.
PCT/CN2022/130915 2022-11-09 2022-11-09 Baie définie par logiciel WO2024098291A1 (fr)

Priority Applications (1)

Application Number Priority Date Filing Date Title
PCT/CN2022/130915 WO2024098291A1 (fr) 2022-11-09 2022-11-09 Baie définie par logiciel

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/CN2022/130915 WO2024098291A1 (fr) 2022-11-09 2022-11-09 Baie définie par logiciel

Publications (1)

Publication Number Publication Date
WO2024098291A1 true WO2024098291A1 (fr) 2024-05-16

Family

ID=84361879

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2022/130915 WO2024098291A1 (fr) 2022-11-09 2022-11-09 Baie définie par logiciel

Country Status (1)

Country Link
WO (1) WO2024098291A1 (fr)

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8055934B1 (en) * 2010-06-22 2011-11-08 International Business Machines Corporation Error routing in a multi-root communication fabric
US20170177528A1 (en) * 2015-12-22 2017-06-22 Intel Corporation Architecture for software defined interconnect switch
US11474916B2 (en) * 2018-08-22 2022-10-18 Intel Corporation Failover of virtual devices in a scalable input/output (I/O) virtualization (S-IOV) architecture

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8055934B1 (en) * 2010-06-22 2011-11-08 International Business Machines Corporation Error routing in a multi-root communication fabric
US20170177528A1 (en) * 2015-12-22 2017-06-22 Intel Corporation Architecture for software defined interconnect switch
US11474916B2 (en) * 2018-08-22 2022-10-18 Intel Corporation Failover of virtual devices in a scalable input/output (I/O) virtualization (S-IOV) architecture

Similar Documents

Publication Publication Date Title
JP5305848B2 (ja) データ処理システム内で入出力(i/o)仮想化を管理するための方法およびデータ処理システムならびにコンピュータ・プログラム
CN110998523B (zh) 用于服务器虚拟化的计算资源的物理划分
US8141093B2 (en) Management of an IOV adapter through a virtual intermediary in an IOV management partition
US8141094B2 (en) Distribution of resources for I/O virtualized (IOV) adapters and management of the adapters through an IOV management partition via user selection of compatible virtual functions
US10248468B2 (en) Using hypervisor for PCI device memory mapping
CN106776159B (zh) 具有故障转移的快速外围元件互连网络系统与操作方法
US8359415B2 (en) Multi-root I/O virtualization using separate management facilities of multiple logical partitions
US9465760B2 (en) Method and apparatus for delivering MSI-X interrupts through non-transparent bridges to computing resources in PCI-express clusters
US8930507B2 (en) Physical memory shared among logical partitions in a VLAN
US20170102952A1 (en) Accessing data stored in a remote target using a baseboard management controler (bmc) independently of the status of the remote target's operating system (os)
US9304849B2 (en) Implementing enhanced error handling of a shared adapter in a virtualized system
US20120151265A1 (en) Supporting cluster level system dumps in a cluster environment
US20090249366A1 (en) Method, device, and system for seamless migration of a virtual machine between platforms with different i/o hardware
US9423958B2 (en) System and method for managing expansion read-only memory and management host thereof
US8527666B2 (en) Accessing a configuration space of a virtual function
US10956189B2 (en) Methods for managing virtualized remote direct memory access devices
US9378103B2 (en) Coordination techniques for redundant array of independent disks storage controllers
US20140006767A1 (en) Boot strap processor assignment for a multi-core processing unit
US8880582B2 (en) User access to a partitionable server
CN116382913A (zh) 资源分配装置、方法、电子设备和存储介质
US9146863B2 (en) Address translation table to enable access to virtualized functions
CN111247508B (zh) 网络存储架构
US10747450B2 (en) Dynamic virtual machine memory allocation
US20240037026A1 (en) Memory pooling, provisioning, and sharing
WO2024098291A1 (fr) Baie définie par logiciel

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 22809638

Country of ref document: EP

Kind code of ref document: A1