CN114090186A

CN114090186A - System for managing PCIE (peripheral component interface express) equipment based on Openstack platform

Info

Publication number: CN114090186A
Application number: CN202111421929.8A
Authority: CN
Inventors: 郭尧; 蒿杰; 吕志丰; 赵良田
Original assignee: Institute of Automation of Chinese Academy of Science; Guangdong Institute of Artificial Intelligence and Advanced Computing
Current assignee: Institute of Automation of Chinese Academy of Science; Guangdong Institute of Artificial Intelligence and Advanced Computing
Priority date: 2021-11-26
Filing date: 2021-11-26
Publication date: 2022-02-25

Abstract

The invention belongs to the technical field of computers, and particularly provides a PCIE (peripheral component interface express) equipment management system based on an Openstack platform, which comprises a management node and one or more computing nodes, wherein the management node is used for managing PCIE equipment; the management node and the computing node are connected through a communication link; the management node comprises an equipment information base and a first management module; the equipment information base comprises a plain list equipment list and a corresponding PCIE equipment state; the computing node comprises one or more PCIE devices and a second management module; the first management module is configured to create a KVM virtual machine based on the PCIE devices in the device information base; before the virtual machine is created, the second management module virtualizes the PCIE device selected and configured by the first management module, and transparently transmits the virtualized device to the kvm virtual machine. The invention can realize the unified scheduling of various heterogeneous devices in the heterogeneous computing platform.

Description

System for managing PCIE (peripheral component interface express) equipment based on Openstack platform

Technical Field

The invention belongs to the technical field of computers, and particularly relates to a system for managing PCIE equipment based on an Openstack platform.

Background

With the development of the internet of things and the artificial intelligence technology, the cloud computing requirements are more diversified, and the hardware facilities of the cloud computing platform are more complex. A user needs to build a dedicated computing cluster on a cloud computing platform, the high-performance heterogeneous computing platform comprises a GPU (graphics processing unit) cloud server, an FPGA (field programmable gate array) server, an NPU (network processing unit) server and the like, and meanwhile, a high-speed network card with a speed of hundreds of G is needed to build an internal virtual network. The GPU, the FPGA, the NPU, the high-speed network card, and the like are installed on the physical server through the PCIE interface, so the heterogeneous computing platform needs to uniformly manage the PCIE devices.

Libvirt api is a set of unified management interface running in a computing node server system, and other applications can realize configuration management, life cycle management, access network management and storage management of a virtual machine through the Libvirt api.

An Openstack cloud computing platform is a mainstream scheme of an Iaas cloud computing platform, and native Openstack supports adding of partial PCIE equipment and creation of a virtual machine. Openstack is a system made up of multiple loosely coupled components, where the NOVA component can implement management of a compute instance through libvirt api. Adding and managing PCIE equipment in the native Openstack, firstly, an administrator needs to virtualize the selected equipment on a server, then, a configuration file of a Nova component is modified, after the modification is completed, the physical server needs to be restarted and an instance needs to be created again, and new PCIE equipment cannot be added to the created virtual machine instance. Different PCIE devices include GPUs of different manufacturers, FPGA boards, NPU boards, and the like, and because interface parameters are different, configuration operations to be performed are different. In order to manage multiple heterogeneous devices in a heterogeneous computing platform, a unified scheduling method needs to be implemented. Therefore, in this document, a method for managing PCIE devices based on an openstack cloud platform is provided.

Disclosure of Invention

In order to solve the above problems in the prior art, that is, to solve the problem that the management of multiple heterogeneous PCIE devices in a heterogeneous computing platform cannot realize unified scheduling, the present invention provides a system for managing PCIE devices based on an Openstack platform, including a management node and one or more computing nodes; the management node and the computing node are connected through a communication link;

the management node comprises an equipment information base and a first management module; the equipment information base comprises a plain list equipment list and a corresponding PCIE equipment state;

the computing node comprises one or more PCIE devices and a second management module;

the first management module is configured to create a KVM virtual machine based on the PCIE devices in the device information base; before the virtual machine is created, the second management module virtualizes the PCIE device selected and configured by the first management module, and transparently transmits the virtualized device to the kvm virtual machine.

In some preferred embodiments, the whitelist device list includes a plurality of pieces of entered PCIE device identification information; the PCIE device identification information includes vendor ID information, device ID information, and a device alias.

In some preferred embodiments, the PCIE device status includes slot location information and usage status information of the device.

In some preferred embodiments, the "creating a KVM virtual machine based on the PCIE devices in the device information base" includes:

selecting additional PCIE equipment from the equipment information base through the first management module;

performing transparent transmission virtualization on the selected additional PCIE equipment through the second management module;

and generating a kvm virtual machine through libvirt tool scheduling and mounting corresponding PCIE equipment.

In some preferred embodiments, after the virtual machine is created, the PCIE device state added to the virtual machine created in the device information base is updated to the used state;

when the virtual machine is deleted, the management node schedules and deletes the virtual machine through a libvirt tool, and meanwhile, the state of the released PCIE equipment is updated to be an unused state.

In some preferred embodiments, the management node further includes a human-computer interaction module, configured to enter information and display the information according to an entered instruction.

In some preferred embodiments, "perform information presentation according to an entry instruction," where the presentation information includes a computing node, slot position information, and use state information where each PCIE device is located in the device information base.

In some preferred embodiments, after the whitelist device list is updated, the management node sends a message through the RabbitMQ to update the updated whitelist to all the computing nodes. The RabbitMQ is an open source implementation of AMQP (advanced Message queue) developed by erlang. There are many published standards (such as IIOP of COBAR, or SOAP, etc.) in the world of synchronous message communication, but this is not true in asynchronous message processing, and only large enterprises have some commercial implementations (such as MSMQ of microsoft, Websphere MQ of IBM, etc.), so that the published standard of AMQP is jointly established by Cisco, reddat, iMatix, etc. in response to the needs of the broad masses.

In some preferred embodiments, the device information base obtains a device name and a device manufacturer based on the device ID and the manufacturer ID of the PCIE device, and performs classification management according to a preset classification rule.

In some preferred embodiments, the preset classification rule includes classification based on the device name, or classification based on the device manufacturer, or comprehensive classification based on the device name and the device manufacturer.

The invention has the beneficial effects that:

according to the method, the computing node can count the PCIE equipment information and send the PCIE equipment information to the management node, and a user and an administrator can observe the PCIE equipment information and the use state on the whole cloud computing platform system in real time through a human-computer interaction interface; when a user creates a virtual machine, PCIE equipment can be selectively added to the virtual machine on a user interface; when a user creates a virtual machine, the computing node performs virtualization on the selected PCIE device, then creates the virtual machine, and updates the state of the PCIE device to be used. The invention can carry out unified scheduling management on various heterogeneous devices in the heterogeneous computing platform.

Drawings

Other features, objects and advantages of the present application will become more apparent upon reading of the following detailed description of non-limiting embodiments thereof, made with reference to the accompanying drawings in which:

fig. 1 is a schematic diagram of a system for managing PCIE devices based on an Openstack platform according to an embodiment of the present invention;

fig. 2 is a schematic diagram of a system for managing PCIE devices based on an Openstack platform according to another embodiment of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings, and it is apparent that the described embodiments are some, but not all embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

The present application will be described in further detail with reference to the following drawings and examples. It is to be understood that the specific embodiments described herein are merely illustrative of the relevant invention and not restrictive of the invention. It should be noted that, for convenience of description, only the portions related to the related invention are shown in the drawings.

It should be noted that the embodiments and features of the embodiments in the present application may be combined with each other without conflict.

The invention relates to a system for managing PCIE equipment based on an Openstack platform, which comprises a management node and one or more computing nodes as shown in figure 1; the management node and the computing node are connected through a communication link;

In order to more clearly describe the system for managing PCIE devices based on the Openstack platform, an embodiment of the present invention is described in detail below with reference to fig. 2.

The system for managing PCIE equipment based on the Openstack platform comprises a management node and one or more computing nodes, wherein the management node comprises a first network node and a second network node; the management node and the compute node are connected by a communication link. The management node and the computing node respectively have own operating systems. The management node comprises an equipment information base, a first management module and a man-machine interaction module; the computing node comprises one or more PCIE devices and a second management module.

In this embodiment, the management node may be a computer system, the human-computer interaction module is a web service system, and the input content and the display content of human-computer interaction may be operated and displayed through a web page. The computing nodes are provided with high-performance memories, cpus, storages and PCIE equipment resources, and run computing services.

The equipment information base comprises a plain list equipment list and a corresponding PCIE equipment state; the human-computer interaction module is used for inputting information and displaying the information according to an input instruction; the first management module is configured to create a KVM virtual machine based on the PCIE devices in the device information base, and before creating the virtual machine, the second management module virtualizes the PCIE devices selected and allocated by the first management module and transparently transmits the virtualized devices to the KVM virtual machine.

In this embodiment, a white list policy is adopted, and a user may add a white list on an operation interface of the human-computer interaction module, where the white list includes: device ID, vendor ID, and device alias (device alias is an alias defined by the user for the device). And after the user confirms, the management node issues the white list to all the computing nodes. The computing node periodically reads the PCIE equipment on the server through libvirt by running computing service, filters and screens the PCIE equipment according to a white list, and writes the position information, the ID information and the use state of the equipment into an equipment information base of the management node.

The white list adding process is described in detail as follows:

(1) the user adds the PCIE equipment to be added to the white list on the web page. The data structure of the white list includes the vendor ID (vendor ID) of the PCIE device, the device product ID (device ID), and the alias label defined by the user for the device.

(2) After the user adds the white list, the management node sends a message through the RabbitMQ and updates the white list to all the computing nodes. RabbitMQ is open source message agent software (also known as message-oriented middleware) that implements Advanced Message Queuing Protocol (AMQP)

(3) All the computing nodes periodically count information of the PCIE equipment, including slot positions of the equipment, supplier IDs, equipment product IDs, label (self-defined label) and NUMA nodes of associated CPUs (central processing units), and update the information into a management node equipment information base. label refers to the device alias that the user has customized for the device. numa _ node is a parameter associated with a cpu slot. In order to optimize the device performance, the PCIE device, the CPU core, and the memory should be in the same NUMA NODE.

The management node displays the PCIE devices in the device information base on an operation interface of the human-computer interaction module, and classifies the PCIE devices according to the ID information of the devices, for example, the PCIE devices may be divided into multiple types, such as a GPU, an FPGA, an NPU, and a network card, and the types of the PCIE devices, the positions of the computing nodes and the slots, and the states of whether the PCIE devices are occupied or not. The user can see the equipment information on all the white lists on the operation interface of the human-computer interaction module. In the classification, the device vendor may be looked up by the vendor ID, the product name by the device product ID, and then the classification is performed based on the device vendor and/or the product name.

In this embodiment, the device information base respectively obtains the device name and the device manufacturer based on the device ID and the manufacturer ID of the PCIE device, and performs classification management according to a preset classification rule. The preset classification rule comprises classification based on the equipment name, or classification based on the equipment manufacturer, or comprehensive classification based on the equipment name and the equipment manufacturer. For example, the method can be divided into multiple types such as a GPU, an FPGA, an NPU, a network card, and the like, can also be divided into A, B, C manufacturer categories, and can also classify different types of network cards of each manufacturer.

Based on the system for managing the PCIE equipment based on the Openstack platform, the process of creating the virtual machine is as follows:

step S100, selecting, by the first management module, an additional PCIE device from the device information base.

In this embodiment, a user selects an additional PCIE device through a web page of the management node human-computer interaction module, and the additional PCIE device is used for generating a virtual machine.

And the management node searches the computing nodes meeting the requirements on all the computing nodes through the NOVA component of the Openstack. The method specifically comprises the following steps: after the creation is started, PCIE equipment statistical information and information such as a cpu, a memory, a storage and the like in a root equipment information base of the management node are managed, and a proper computing node is selected according to a scheduling strategy to generate a virtual machine instance.

Step S200, performing transparent transmission virtualization on the selected additional PCIE device through the second management module.

In the selected computing node in step S100, the second management module performs transparent transmission virtualization on the selected additional PCIE device, and after the device correctly loads the vfio-pci driver, the virtualization is completed.

The purpose of virtualization of the PCIE device is to put the PCIE device on a virtual machine for use, and provide heterogeneous computing cloud services such as a GPU cloud, an FPGA cloud, and an NPU cloud for a user. Transparent (pci passthrough) is a way of virtualization, and is an exclusive way to separate devices from physical servers and allocate them directly to virtual machines. The transparent transmission has the advantages that the working efficiency of the PCIE equipment in the virtual machine environment is high, the PCIE equipment is close to the bare machine environment, and the transparent transmission is suitable for high-performance heterogeneous computing scenes.

Step S300, generating a kvm virtual machine by libvirt (libvirt is an existing virtual machine management tool) api scheduling, and mounting a corresponding PCIE device.

After the virtual machine is created, the method further comprises the following steps:

after the virtual machine is created, updating the state of PCIE equipment added by the virtual machine created in the equipment information base into a used state;

when the virtual machine is deleted, the management node updates the state of the released PCIE equipment to an unused state while scheduling and deleting the virtual machine through libvirt.

In this embodiment, a computer node periodically uploads statistical PCIE device information according to a white list provided by user definition, the computing node finds a computing node meeting a condition according to a control node request, creates a required virtual machine instance, virtualizes a selected PCIE device before creating a virtual machine, and transparently transmits the device to a kvm virtual machine, so as to implement uniform management of PCIE devices, and do not need to perform differentiated management and system software configuration on different types of PCIE devices, thereby providing a method for building a heterogeneous computing cluster and providing a heterogeneous computing cloud service.

In order to further explain the technical solution of the present invention, the following is further illustrated by a specific example.

The premise preparation for virtualization is as follows:

and starting hardware auxiliary virtualization in the BIOS, and installing a linux operating system.

An IOMMU (i/o memory management unit) module is launched in the kernel.

If the Intel CPU is used, adding the following components in/etc/default/grub: GRUB _ CMDLINE _ LINUX _ DEFAULT ═ intel _ iommu ═ on "

If AMD's CPU is used, then GRUB _ CMDLINE _ LINUX _ DEFAULT ═ am _ iommu ═ on iommu ═ pt kvm _ amd.npt ═ 1kvm _ amd.avic ═ 1"

And regenerating a starting menu configuration file of GRUB through a sudo update-GRUB command, and updating and restarting the server.

The virtualization method specifically comprises the following steps:

and separating PCIE equipment from a physical server, loading a PCIE virtualization driver VFIO-PCI, and using a virsh detach command in a virsh tool for a virtual machine based on kvm virtualization.

GPU devices and other devices in PCIE devices are often attached with additional interfaces such as audio devices and usb, and PCIE devices themselves default to belong to an IOMMU group. When using passthrough, a single device in the same IOMMU group cannot be passed through to the KVM virtual machine, otherwise a passthrough failure would result. Therefore, the method separates a plurality of devices in one IOMMU group into different IOMMU groups.

Assume that the device ID is 10de:1e07 physical location: 05:00.0

Loading a VFIO-PCI driver to a designated PCIE device: sudo sh-c 'echo' 10de 1e07 '>/sys/bus/pci/drivers/vfio-pci/new _ id'

And unbinding the device from the physical machine: virsh nodevde-detail pci _0000_05_00_0

After the virtualization is successful, lspci-nnv is executed in the linux system to see if the vfio-pci driver loads normally. If the displayed information comprises: and the Kernel driver in use, vfio-pci, indicates that the driver is loaded normally.

After confirming that the driving state is vifio-pci and the IOMMU where the devices are located groups one device, openstack calls to create a KVM virtual machine through libvirt api and to mount the selected PCIE device.

After the creation is completed, the device information base will modify the use state of the PCIE device to be used.

In particular, according to an embodiment of the present disclosure, the processes described above with reference to the flowcharts may be implemented as computer software programs. For example, embodiments of the present disclosure include a computer program product comprising a computer program embodied on a computer readable medium, the computer program comprising program code for performing the method illustrated in the flow chart. In such an embodiment, the computer program may be downloaded and installed from a network via the communication section, and/or installed from a removable medium. The computer program, when executed by a Central Processing Unit (CPU), performs the above-described functions defined in the method of the present application. It should be noted that the computer readable medium mentioned above in the present application may be a computer readable signal medium or a computer readable storage medium or any combination of the two. A computer readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the foregoing. More specific examples of the computer readable storage medium may include, but are not limited to: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the present application, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. In this application, however, a computer readable signal medium may include a propagated data signal with computer readable program code embodied therein, for example, in baseband or as part of a carrier wave. Such a propagated data signal may take many forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A computer readable signal medium may also be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to: wireless, wire, fiber optic cable, RF, etc., or any suitable combination of the foregoing.

Computer program code for carrying out operations for aspects of the present application may be written in any combination of one or more programming languages, including an object oriented programming language such as Java, Smalltalk, C + + or the like and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the case of a remote computer, the remote computer may be connected to the user's computer through any type of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or the connection may be made to an external computer (for example, through the Internet using an Internet service provider).

The flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present application. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

The terms "first," "second," and the like are used for distinguishing between similar elements and not necessarily for describing or implying a particular order or sequence.

The terms "comprises," "comprising," or any other similar term are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus.

So far, the technical solutions of the present invention have been described in connection with the preferred embodiments shown in the drawings, but it is easily understood by those skilled in the art that the scope of the present invention is obviously not limited to these specific embodiments. Equivalent changes or substitutions of related technical features can be made by those skilled in the art without departing from the principle of the invention, and the technical scheme after the changes or substitutions can fall into the protection scope of the invention.

Claims

1. A system for managing PCIE equipment based on an Openstack platform is characterized by comprising a management node and one or more computing nodes; the management node and the computing node are connected through a communication link;

2. The Openstack platform based PCIE device management system according to claim 1, wherein the whitelist device list includes a plurality of pieces of entered PCIE device identification information; the PCIE device identification information includes a device ID, a vendor ID, and a device alias.

3. The Openstack platform based system for managing PCIE devices according to claim 1, wherein the PCIE device state includes slot location information and usage state information of a device.

4. The Openstack platform based PCIE device management system of claim 1, wherein the method of "creating a KVM virtual machine based on PCIE devices in the device information base" is:

and generating a kvm virtual machine through libvirt api scheduling and mounting corresponding PCIE equipment.

5. The Openstack platform-based PCIE device system of claim 4, wherein after a virtual machine is created, the PCIE device status added to the created virtual machine in the device information base is updated to a used status;

when the virtual machine is deleted, the management node schedules and deletes the virtual machine through libvirt api, and meanwhile, the state of the released PCIE equipment is updated to be an unused state.

6. The Openstack platform-based PCIE device system as claimed in any one of claims 1-5, wherein the management node further comprises a human-computer interaction module for inputting information and displaying the information according to an input instruction.

7. The system for managing PCIE devices based on an Openstack platform according to claim 6, wherein "perform information presentation according to a logged instruction" includes, in the device information base, a computation node where each PCIE device is located, slot position information, and use state information.

8. The Openstack platform based PCIE device management system of any one of claims 1-5, wherein after the whitelist device list is updated, the management node sends a message through a RabbitMQ to update the updated whitelist to all the compute nodes.

9. The system for managing PCIE devices based on an Openstack platform according to claim 2, wherein a device name and a device manufacturer are respectively obtained based on a device ID and a manufacturer ID of the PCIE device in the device information base, and the classification management is performed according to a preset classification rule.

10. The Openstack platform based PCIE device management system of claim 9, wherein the preset classification rule includes classification based on a device name, or classification based on a device manufacturer, or comprehensive classification based on a device name and a device manufacturer.