WO2023226601A1 - 一种异构加速资源异常处理方法、装置、存储介质及电子装置 - Google Patents
一种异构加速资源异常处理方法、装置、存储介质及电子装置 Download PDFInfo
- Publication number
- WO2023226601A1 WO2023226601A1 PCT/CN2023/086292 CN2023086292W WO2023226601A1 WO 2023226601 A1 WO2023226601 A1 WO 2023226601A1 CN 2023086292 W CN2023086292 W CN 2023086292W WO 2023226601 A1 WO2023226601 A1 WO 2023226601A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- resources
- resource
- hardware
- heterogeneous acceleration
- health
- Prior art date
Links
- 230000001133 acceleration Effects 0.000 title claims abstract description 191
- 238000003672 processing method Methods 0.000 title abstract description 8
- 230000036541 health Effects 0.000 claims abstract description 99
- 238000000034 method Methods 0.000 claims abstract description 54
- 238000012544 monitoring process Methods 0.000 claims abstract description 51
- 238000012545 processing Methods 0.000 claims abstract description 31
- 238000001514 detection method Methods 0.000 claims abstract description 20
- 230000002159 abnormal effect Effects 0.000 claims description 36
- 238000011084 recovery Methods 0.000 claims description 36
- 230000015654 memory Effects 0.000 claims description 18
- 238000004590 computer program Methods 0.000 claims description 16
- 230000004044 response Effects 0.000 claims description 16
- 238000010586 diagram Methods 0.000 description 8
- 238000005516 engineering process Methods 0.000 description 7
- 230000005856 abnormality Effects 0.000 description 6
- 230000005540 biological transmission Effects 0.000 description 6
- 230000006870 function Effects 0.000 description 5
- 238000007726 management method Methods 0.000 description 5
- 230000008569 process Effects 0.000 description 4
- 238000004064 recycling Methods 0.000 description 4
- 238000004891 communication Methods 0.000 description 3
- 239000002184 metal Substances 0.000 description 3
- 238000013468 resource allocation Methods 0.000 description 3
- 230000005012 migration Effects 0.000 description 2
- 238000013508 migration Methods 0.000 description 2
- 238000012986 modification Methods 0.000 description 2
- 230000004048 modification Effects 0.000 description 2
- 230000002411 adverse Effects 0.000 description 1
- 238000013528 artificial neural network Methods 0.000 description 1
- 238000013135 deep learning Methods 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 230000000694 effects Effects 0.000 description 1
- 230000003862 health status Effects 0.000 description 1
- 238000010295 mobile communication Methods 0.000 description 1
- 230000003287 optical effect Effects 0.000 description 1
- 238000006467 substitution reaction Methods 0.000 description 1
- 238000012546 transfer Methods 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F11/00—Error detection; Error correction; Monitoring
- G06F11/07—Responding to the occurrence of a fault, e.g. fault tolerance
- G06F11/0703—Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation
- G06F11/079—Root cause analysis, i.e. error or fault diagnosis
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F11/00—Error detection; Error correction; Monitoring
- G06F11/07—Responding to the occurrence of a fault, e.g. fault tolerance
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F11/00—Error detection; Error correction; Monitoring
- G06F11/07—Responding to the occurrence of a fault, e.g. fault tolerance
- G06F11/0703—Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation
- G06F11/0706—Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation the processing taking place on a specific hardware platform or in a specific software environment
- G06F11/0709—Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation the processing taking place on a specific hardware platform or in a specific software environment in a distributed system consisting of a plurality of standalone computer nodes, e.g. clusters, client-server systems
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F11/00—Error detection; Error correction; Monitoring
- G06F11/07—Responding to the occurrence of a fault, e.g. fault tolerance
- G06F11/0703—Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation
- G06F11/0793—Remedial or corrective actions
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F11/00—Error detection; Error correction; Monitoring
- G06F11/30—Monitoring
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F11/00—Error detection; Error correction; Monitoring
- G06F11/30—Monitoring
- G06F11/3003—Monitoring arrangements specially adapted to the computing system or computing system component being monitored
- G06F11/3006—Monitoring arrangements specially adapted to the computing system or computing system component being monitored where the computing system is distributed, e.g. networked systems, clusters, multiprocessor systems
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F11/00—Error detection; Error correction; Monitoring
- G06F11/30—Monitoring
- G06F11/3051—Monitoring arrangements for monitoring the configuration of the computing system or of the computing system component, e.g. monitoring the presence of processing resources, peripherals, I/O links, software programs
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F11/00—Error detection; Error correction; Monitoring
- G06F11/30—Monitoring
- G06F11/32—Monitoring with visual or acoustical indication of the functioning of the machine
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F11/00—Error detection; Error correction; Monitoring
- G06F11/30—Monitoring
- G06F11/32—Monitoring with visual or acoustical indication of the functioning of the machine
- G06F11/324—Display of status information
- G06F11/327—Alarm or error message display
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F15/00—Digital computers in general; Data processing equipment in general
- G06F15/76—Architectures of general purpose stored program computers
- G06F15/78—Architectures of general purpose stored program computers comprising a single central processing unit
Definitions
- Embodiments of the present disclosure relate to the field of cloud computing, specifically, to a heterogeneous acceleration resource exception processing method, device, storage medium and electronic device.
- Heterogeneous acceleration resources of cloud computing platforms usually include Graphics Processing Unit (GPU), AI accelerator card (Neural-Network Processing Unit, NPU), and Programmable Acceleration Card (Field Programmable Gate Array, referred to as FPGA), smart network card (Smart NIC), compared with traditional hardware, the heterogeneous acceleration resources of the cloud computing platform have many types of acceleration resources, easy pluggability, multiple virtualization methods, unified allocation and recycling, frequent use, Carrying special business characteristics.
- heterogeneous acceleration hardware When an abnormality occurs in the heterogeneous acceleration hardware, if it cannot be identified, reported, and restored in a timely manner, it will cause serious losses to the customer services carried on the cloud computing platform.
- heterogeneous acceleration resources allocated in a virtualized manner such as GPU, NPU, and FPGA
- problems such as loss of recycling information or untimely resource recycling may occur due to communication abnormalities, and heterogeneity is prone to occur.
- the registration of accelerated resources is inconsistent with the actual usage, which leads to abnormal cloud platform resource allocation and brings losses to the cloud computing platform and customers.
- Embodiments of the present disclosure provide a heterogeneous acceleration resource exception processing method, device, storage medium and electronic device to At least it solves the problem that related technologies only focus on the detection of ordinary hardware resources of traditional servers and cannot identify the inconsistency between the registration and actual use of virtualized heterogeneous acceleration resources managed by the cloud computing platform, thus causing losses to the cloud computing platform and users.
- an abnormality occurs in the heterogeneous acceleration resources, the unhealthy status of the heterogeneous acceleration resources can be quickly sensed and timely alarms and recovery can be ensured to ensure the reliability, stability, and timeliness of the cloud platform's management of heterogeneous acceleration resources.
- a method for handling heterogeneous acceleration resource exceptions includes:
- Allocation exception processing is performed on the allocated faulty resource.
- a device for handling heterogeneous acceleration resource exceptions includes:
- the first monitoring module is configured to determine whether the heterogeneous acceleration resources are hardware healthy resources or hardware non-healthy resources by performing hardware health monitoring on the heterogeneous acceleration resources of the cloud computing platform;
- the second monitoring module determines whether the heterogeneous acceleration resource is using healthy resources or allocating faulty resources by performing device usage health monitoring on the heterogeneous acceleration resources;
- the first response module is configured to handle hardware exceptions on the hardware non-health resources
- the second response module is configured to perform allocation exception processing on the allocation fault resource.
- a computer-readable storage medium is also provided, and a computer program is stored in the storage medium, wherein the computer program is configured to execute any of the above method embodiments when running. steps in.
- an electronic device including a memory and a processor.
- a computer program is stored in the memory, and the processor is configured to run the computer program to perform any of the above. Steps in method embodiments.
- Figure 1 is a hardware structure block diagram of a computer terminal of a method for handling heterogeneous acceleration resource exceptions according to an embodiment of the present disclosure
- Figure 2 is a flow chart of a heterogeneous acceleration resource exception handling method according to an embodiment of the present disclosure
- Figure 3 is a flow chart of a heterogeneous acceleration resource hardware health monitoring method according to an embodiment of the present disclosure
- Figure 4 is a flow chart of a health monitoring method for heterogeneous acceleration resource equipment usage according to an embodiment of the present disclosure
- Figure 5 is a sequence diagram of equipment usage health monitoring and processing according to an optional embodiment of the present disclosure
- Figure 6 is a sequence diagram of heterogeneous acceleration resource exception recovery processing according to an optional embodiment of the present disclosure
- Figure 7 is a block diagram of a heterogeneous acceleration resource exception handling device according to an embodiment of the present disclosure.
- Figure 8 is a heterogeneous acceleration resource health monitoring and exception handling architecture according to an embodiment of the present disclosure.
- Figure 1 is a hardware structure block diagram of a computer terminal of a method for handling heterogeneous acceleration resource exceptions according to an embodiment of the present disclosure.
- the computer terminal may include one or more (in Figure 1 Only one processor 102 is shown (the processor 102 may include but is not limited to a processing device such as a microprocessor MCU or a programmable logic device FPGA) and a memory 104 for storing data, wherein the computer terminal may also include a Transmission device 106 and input and output device 108 for communication functions.
- a processing device such as a microprocessor MCU or a programmable logic device FPGA
- Figure 1 is only illustrative, and it does not limit the structure of the above-mentioned computer terminal.
- the computer terminal may also include more or fewer components than shown in FIG. 1 , or have a different configuration than shown in FIG. 1 .
- the memory 104 can be used to store computer programs, for example, software programs and modules of application software, such as the computer program corresponding to the heterogeneous acceleration resource exception handling method in the embodiment of the present disclosure.
- the processor 102 runs the computer program stored in the memory 104 , thereby executing various functional applications and business chain address pool slicing processing, that is, implementing the above method.
- Memory 104 may include high-speed random access memory, and may also include non-volatile memory, such as one or more magnetic storage devices, flash memory, or other non-volatile solid-state memory.
- the memory 104 may further include memory located remotely relative to the processor 102, and these remote memories may be connected to the mobile terminal through a network. Examples of the above-mentioned networks include but are not limited to the Internet, intranets, local area networks, mobile communication networks and combinations thereof.
- Transmission device 106 is used to receive or send data via a network.
- Specific examples of the above-mentioned network may include a wireless network provided by a communication provider of the computer terminal.
- the transmission device 106 includes a network adapter (Network Interface Controller, NIC for short), which can be connected to other network devices through a base station to communicate with the Internet.
- the transmission device 106 may be a radio frequency (Radio Frequency, RF for short) module, which is used to communicate with the Internet wirelessly.
- NIC Network Interface Controller
- FIG. 2 is a flow chart of the method for handling heterogeneous acceleration resource exceptions according to an embodiment of the present disclosure, as shown in Figure 2 , the process includes the following steps:
- Step S202 Determine whether the heterogeneous acceleration resources are hardware healthy resources or hardware non-healthy resources by performing hardware health monitoring on the heterogeneous acceleration resources of the cloud computing platform;
- Step S204 Determine whether the heterogeneous acceleration resource is using healthy resources or allocating faulty resources by performing device usage health monitoring on the heterogeneous acceleration resources;
- Step S206 Perform hardware exception processing on the hardware unhealthy resources
- Step S208 perform allocation exception processing on the allocated faulty resource.
- step S202 determine whether the heterogeneous acceleration resource exists by scanning PCI slots; if the heterogeneous acceleration resource exists, obtain the resource information of the heterogeneous acceleration resource. Specifically, it can be combined with the cloud
- the configuration of the computing platform identifies the resource information of heterogeneous acceleration resources.
- the heterogeneous acceleration resources include: GPU, NPU, FPGA, and Smart NIC.
- the resource information of the heterogeneous acceleration resources may include: PCI address, manufacturer information, and device model. , device ID, etc., where the PCI address includes the slot number.
- the above step S202 may specifically include: calling the corresponding hardware health detection interface according to the resource information of the heterogeneous acceleration resources; judging the hardware status of the heterogeneous acceleration resources through the hardware health detection interface; if the hardware status is healthy, Then the heterogeneous acceleration resources are determined to be hardware healthy resources; if the hardware status is unhealthy, the heterogeneous acceleration resources are determined to be hardware unhealthy resources.
- the hardware health detection interface of the security-certified heterogeneous acceleration resources in the cloud computing platform can be called cyclically based on the type, manufacturer information, and device model of the heterogeneous acceleration resources, and the heterogeneous acceleration resources can be determined by the hardware health detection interface. hardware status.
- the hardware health monitoring method in step S202 can be performed on each heterogeneous acceleration resource according to a preset hardware health detection cycle.
- FIG 3 is a flow chart of a heterogeneous acceleration resource hardware health monitoring method according to an embodiment of the present disclosure. As shown in Figure 3, the heterogeneous acceleration resource hardware health monitoring method specifically includes the following steps:
- Step S302 Scan each heterogeneous acceleration resource on the PCI slot on the computing node and obtain the PCI address of the acceleration resource;
- Step S304 Combined with the cloud platform configuration, identify the manufacturer and model of the specific acceleration resources (GPU, NPU, FPGA, SmartNIC);
- Step S306 Using the PCI address, manufacturer, and model as the core identification parameters, cyclically call the hardware health detection interface approved by the cloud platform to determine the hardware status of each heterogeneous acceleration resource;
- Step S308 Determine whether the hardware status of the heterogeneous acceleration resource is healthy; if the determination result is yes, execute step S310a; if the determination result is no, execute step S310b;
- Step S310a Determine the heterogeneous acceleration resource as a hardware health resource
- Step S310b Determine that the heterogeneous acceleration resource is a hardware unhealthy resource
- Step S312 Determine whether the current node still has heterogeneous acceleration resources that have not been tested for hardware health
- Step S314 Output hardware healthy resources and hardware non-health resources.
- the above-mentioned step S302 may specifically include: scanning each PCI slot on the PCI slot to determine whether the slot has a physical acceleration resource installed; if there is a physical acceleration resource, obtain the PCI address corresponding to the acceleration resource. Specifically Yes, only one physical acceleration resource can be installed in each PCI slot.
- the PCI address includes the slot number.
- the types of physical acceleration resources can include: GPU, NPU, FPGA, SmartNIC, etc.
- the method in this embodiment it is possible to solve the problem in related technologies that one can only rely on its own system to detect traditional hardware, and the detection results of a wide variety of heterogeneous acceleration resources are inaccurate and inconvenient to manage.
- the corresponding interface for retrieving manufacturer information and device models not only improves the accuracy of detection results, but also enables unified management of a wide variety of heterogeneous acceleration resources.
- the above-mentioned step S204 may specifically include: obtaining the allocation data of the heterogeneous acceleration resources; and determining the used healthy resources and the allocated faulty resources according to the allocation data.
- determining the use of healthy resources or allocating faulty resources based on allocation data includes: determining the actual usage data of heterogeneous acceleration resources; sequentially comparing the allocation data and actual usage data of each heterogeneous acceleration resource, If the allocation data is consistent with the actual usage data, the heterogeneous acceleration resource is determined to be a healthy resource, otherwise, the heterogeneous acceleration resource is determined to be a faulty allocation resource.
- the device usage health monitoring method in step S204 can be performed according to a preset device usage health detection cycle.
- each heterogeneous acceleration resource can be virtualized and allocated to multiple customers.
- the allocation data includes the allocated customer and the allocated quantity
- the actual usage data includes the using customer and the usage quantity.
- heterogeneous acceleration resources are allocated fault resources.
- FIG. 4 is a flow chart of a health monitoring method for heterogeneous acceleration resource equipment usage according to an embodiment of the present disclosure. As shown in Figure 4, the health monitoring method for heterogeneous acceleration resource equipment usage specifically includes the following steps:
- Step S402 Call the cloud platform heterogeneous acceleration resource interface to obtain the allocation data details of heterogeneous acceleration resources (including allocation customers, allocation quantity, etc.);
- Step S404 Detect each allocated acceleration resource
- Step S406 Determine whether the corresponding customer exists. If the judgment result is yes, directly execute step S410. If the judgment result is no, execute step S408;
- Step S408 Add the heterogeneous acceleration resource to the allocation failure list, and record the customers with abnormal allocation;
- Step S410 Determine whether there are still heterogeneous acceleration resources that have not yet been determined. If the determination result is yes, return to step S404. If the determination result is no, execute step S412;
- Step S412 Output the heterogeneous acceleration resources with allocation failure
- each heterogeneous acceleration resource can be virtualized and allocated to multiple customers for use.
- Customer types usually include: virtual machines, bare metal, containers, etc.
- the above-mentioned step S406 may specifically include determining whether the virtual machine, bare metal, and container to which the heterogeneous acceleration resource has been allocated exists. If all of them exist, determine whether the customer allocated to the heterogeneous acceleration resource is using it normally.
- the method in this embodiment it is possible to solve the problem in related technologies that the resource allocation registration situation is inconsistent with the actual usage of the customer when heterogeneous acceleration resources are allocated in virtualization, and the heterogeneous acceleration resources that have allocation failures can be identified in a timely manner. , thereby avoiding repeated allocation of heterogeneous acceleration resources allocated by virtualization to multiple customers, ensuring the security and stability of the cloud computing platform.
- performing allocation exception processing on allocated faulty resources includes: updating the allocation data of allocated faulty resources according to actual usage data. Specifically, using the usage customers in the actual usage data to update the allocation customers in the allocation data. An update is made to update the allocation quantity in the allocation data with the usage quantity in the actual usage data.
- FIG. 5 is a sequence diagram of equipment usage health monitoring and processing according to an optional embodiment of the present disclosure. As shown in Figure 5, the heterogeneous acceleration resource equipment usage health monitoring and processing method specifically includes the following steps:
- Step S502 Allocate fault resources according to the output of the device health monitoring method
- Step S504 Call the response module to handle the allocation exception of the allocation fault resource
- Step S506 Update the heterogeneous acceleration resource information that allocates faulty resources
- Step S508 Return the update result
- Step S510 Return.
- the method for handling exceptions of heterogeneous acceleration resources further includes providing an exception alarm for the hardware unhealthy resources and the allocation fault resources.
- hardware exception handling is performed on hardware unhealthy resources, which specifically includes the following steps:
- the usage status of heterogeneous acceleration resources is divided into available and unavailable, and the recovery status of heterogeneous acceleration resources is divided into recoverable and unrecoverable.
- the system automatically records the setting source of the usage status. If the usage status is set by an administrator, it is marked as an administrator, and its corresponding recovery status is unrecoverable; If the usage status is automatically set by the exception response module, it is marked as a response module, and its corresponding recovery status is recoverable.
- notifying the cloud computing platform to migrate customers whose hardware non-healthy resources have been allocated may specifically include notifying the administrator associated with the cloud computing platform to timely determine the usage of the hardware non-healthy resources, and perform the migration of all customers who have used the non-healthy hardware resources.
- Customers with unhealthy hardware resources such as virtual machines, bare metal, and containers perform live migration (reallocating normal heterogeneous acceleration resources to customers) or other actions.
- abnormal resource information corresponding to hardware unhealthy resources and allocation fault resources can be obtained; abnormal resource information is standardized to obtain standardized abnormal information; and the standardized abnormal information is reported to the cloud computing platform to facilitate timely notification of relevant When personnel process abnormal information, they can store standardized abnormal information in the cloud computing platform for subsequent retrieval.
- standardized abnormal information can also be obtained from the cloud computing platform; hardware health resources and health resource information corresponding to the used health resources can be obtained; the health resource information can be standardized to obtain standardized health information; according to the standardized health information The information determines the recoverable resource from the standardized exception information; if the recovery status of the recoverable resource is recoverable, recovery processing is performed on the recoverable resource.
- the standardized exception information and standardized health information at least include the PCI address, manufacturer information, device model, device ID, etc. of the heterogeneous acceleration resource, where the PCI address includes the slot number.
- determining recoverable resources from standardized abnormal information based on standardized health information includes: matching standardized health information and standardized abnormal information according to preset matching rules, where the preset matching rules include matching the following resources Match at least one of the information: PCI address, manufacturer information, model; determine the heterogeneous acceleration resources corresponding to the successfully matched standardized exception information as recoverable resources.
- the recovery processing of the recoverable resources may specifically include: if there is an abnormal alarm corresponding to the recoverable resource, canceling the abnormal alarm; and setting the usage status of the recoverable resource to available.
- FIG. 6 is a sequence diagram of heterogeneous acceleration resource exception recovery processing according to an optional embodiment of the present disclosure. As shown in Figure 6, the heterogeneous acceleration resource exception recovery processing method specifically includes the following steps:
- Step S601 Output healthy heterogeneous acceleration resources according to the hardware health monitoring method
- Step S602 Send healthy heterogeneous acceleration resource information
- Step S603 Obtain the reported unhealthy heterogeneous acceleration resource information
- Step S604 Return the reported unhealthy heterogeneous acceleration resource information
- Step S605 Identify recoverable heterogeneous acceleration resources through a specific method
- Step S606 Standardize heterogeneous acceleration resource information and call the alarm recovery interface of the cloud computing platform
- Step S607 Return
- Step S608 Determine whether the heterogeneous acceleration resource needs to be restored to availability
- Step S609 Return
- the specific method in the above step S605 may specifically include: performing data comparison based on PCI address, manufacturer information, device model, device ID, official interface, etc., or identifying through a specific algorithm.
- the above step S608 determines whether the heterogeneous acceleration resource needs to be restored to availability. Specifically, it may include: making a judgment based on the recovery status of the heterogeneous acceleration resource. If the recovery status is recoverable, restoring the heterogeneous acceleration resource to availability. .
- the heterogeneous acceleration resource exception recovery processing method in the above steps S601 to S609 can be executed according to a preset recovery cycle.
- the heterogeneous acceleration resource exception recovery processing method in this embodiment when an abnormality in the heterogeneous acceleration resource is detected, an alarm prompt can be issued to the customer and the cloud computing platform administrator in a timely manner to avoid serious losses.
- the abnormal heterogeneous acceleration resource may have been restored to a healthy state.
- This embodiment can automatically restore the heterogeneous acceleration resources to an available state, respond promptly, and process quickly, reducing adverse effects on customers and improving the reliability of the cloud computing platform.
- FIG. 7 is a block diagram of the device for handling exceptions of heterogeneous acceleration resources according to an embodiment of the present disclosure. As shown in Figure 7, the device include:
- the first monitoring module 702 is configured to determine whether the heterogeneous acceleration resources are hardware healthy resources or hardware non-healthy resources by performing hardware health monitoring on the heterogeneous acceleration resources of the cloud computing platform;
- the second monitoring module 704 determines whether the heterogeneous acceleration resource is using healthy resources or allocating faulty resources by performing device usage health monitoring on the heterogeneous acceleration resources;
- the first response module 706 is configured to perform hardware exception processing on the hardware unhealthy resources
- the second response module 708 is configured to perform allocation exception processing on the allocation fault resource.
- the device further includes:
- a scanning module configured to determine whether the heterogeneous acceleration resource exists by scanning PCI slots
- the first acquisition module is configured to acquire the resource information of the heterogeneous acceleration resource if the heterogeneous acceleration resource exists.
- the first monitoring module 702 further includes:
- the calling unit is configured to call the corresponding hardware health detection interface according to the resource information of the heterogeneous acceleration resources;
- a detection unit configured to determine the hardware status of the heterogeneous acceleration resources through the hardware health detection interface
- the first judgment unit is configured to determine that the heterogeneous acceleration resource is the hardware healthy resource if the hardware status is healthy, and determine that the heterogeneous acceleration resource is the hardware health resource if the hardware status is unhealthy. Hardware non-health resources.
- the device further includes:
- An abnormality alarm module is configured to alarm abnormally the hardware non-health resources and the allocated faulty resources.
- the second monitoring module 704 further includes:
- the first acquisition unit is configured to acquire the allocation data of the heterogeneous acceleration resources
- the second judgment unit is configured to determine the used healthy resources and the allocated faulty resources according to the allocation data.
- the second judgment unit further includes:
- the second acquisition unit is configured to determine the actual usage data of the heterogeneous acceleration resources
- the data comparison unit is configured to compare the allocation data and actual usage data of each heterogeneous acceleration resource in sequence. If the allocation data and the actual usage data are consistent, determine that the heterogeneous acceleration resource is in healthy use. resource, otherwise, the heterogeneous acceleration resource is determined to be an allocated fault resource.
- the second response module 708 is further configured to:
- the allocation data of the allocated fault resource is updated according to the actual usage data.
- the first response module 706 also includes:
- a setting unit configured to determine whether the usage status of the hardware non-health resources is unavailable. If the judgment result is no, set the usage status of the hardware non-health resources to unavailable, and set the status of the hardware non-health resources to unavailable. The recovery status is set to recoverable;
- a processing unit configured to determine whether the hardware non-health resources have been allocated to customers, and if the judgment result is yes, notify the cloud computing platform to migrate the customers to which the hardware non-health resources have been allocated, and/or transfer the hardware non-health resources to the customers.
- the recovery status of the health resource is set to non-recoverable.
- the device further includes:
- the second acquisition module is configured to acquire abnormal resource information corresponding to the hardware unhealthy resources and the allocated fault resources. interest;
- the first standardization module is configured to standardize the abnormal resource information and obtain standardized abnormal information
- the reporting module is configured to report the standardized abnormal information to the cloud computing platform.
- the device further includes:
- the third acquisition module is configured to acquire the standardized abnormal information from the cloud computing platform
- the fourth acquisition module is configured to acquire the health resource information corresponding to the hardware health resources and the used health resources;
- the second standardization module is configured to standardize the health resource information to obtain standardized health information
- a recovery judgment module configured to determine recoverable resources from the standardized abnormal information based on the standardized health information
- the recovery processing module is configured to perform recovery processing on the recoverable resource if the recovery status of the recoverable resource is recoverable.
- the recovery judgment module includes:
- a matching unit configured to match the standardized health information and the standardized abnormal information according to preset matching rules, wherein the preset matching rules include matching at least one of the following resource information: PC I Address, manufacturer information, model;
- the recovery judgment unit is configured to determine the heterogeneous acceleration resource corresponding to the successfully matched standardized exception information as the recoverable resource.
- the recovery processing module includes:
- the cancellation unit is configured to cancel the abnormal alarm if there is an abnormal alarm corresponding to the recoverable resource
- a recovery unit configured to set the usage status of the recoverable resource to available.
- a heterogeneous accelerated resource health monitoring and exception handling architecture is also provided.
- Figure 8 is a heterogeneous acceleration resource health monitoring and exception handling architecture according to an embodiment of the present disclosure. As shown in Figure 8, the architecture includes:
- the health identification module 81 includes: a hardware health monitoring module 811, a device usage health monitoring module 812, and a cloud platform heterogeneous resource usage interface 813;
- the exception processing module 82 includes: an exception alarm module 821, an exception response module 822, an exception recovery module 823, a cloud platform alarm interface 824, and a cloud platform heterogeneous resource management interface 825;
- the hardware health monitoring module 811 is configured to implement part or all of the functions of the above-mentioned first monitoring module 702; the device usage health monitoring module 812 is configured to implement part or all of the functions of the above-mentioned second monitoring module 704; cloud
- the platform heterogeneous resource has used interface 813, which is used to implement part or all of the functions of the above-mentioned second acquisition unit.
- the hardware health monitoring module 811 is configured to determine whether the heterogeneous acceleration resources are hardware healthy resources or hardware non-healthy resources by performing hardware health monitoring on the heterogeneous acceleration resources of the cloud computing platform;
- the device usage health monitoring module 812 is configured to In order to determine whether the heterogeneous acceleration resources are using healthy resources or allocating faulty resources by performing device usage health monitoring on the heterogeneous acceleration resources;
- the cloud platform heterogeneous resource usage interface 813 is used to determine the heterogeneous acceleration resources. actual usage data;
- the exception alarm module 821 is used to provide abnormal alarms for the hardware non-health resources and the allocated fault resources; the exception response module 822 is used to implement the above-mentioned first response module 706 and second response module 708 Some or all of its functions, including exception handling for hardware non-healthy resources and allocation failure resources; the cloud platform alarm interface 824 is used to notify the cloud computing platform of abnormal alarm information; the cloud platform heterogeneous resource management interface 825 is used to handle heterogeneous Accelerate investment Manage the source, including setting its usage status.
- the related technology only focuses on the detection of ordinary hardware resources of traditional servers and cannot identify the inconsistency between the registration and actual use of virtualized heterogeneous acceleration resources managed by the cloud computing platform, thus causing losses to the cloud computing platform and users.
- the problem When an abnormality occurs in the heterogeneous acceleration resources, the unhealthy status of the heterogeneous acceleration resources can be quickly sensed and timely alarms and recovery can be ensured to ensure the reliability, stability, and timeliness of the cloud platform's management of heterogeneous acceleration resources.
- Embodiments of the present disclosure also provide a computer-readable storage medium that stores a computer program, wherein the computer program is configured to execute the steps in any of the above method embodiments when running.
- the computer-readable storage medium may include but is not limited to: U disk, read-only memory (Read-Only Memory, referred to as ROM), random access memory (Random Access Memory, referred to as RAM) , mobile hard disk, magnetic disk or optical disk and other media that can store computer programs.
- ROM read-only memory
- RAM random access memory
- mobile hard disk magnetic disk or optical disk and other media that can store computer programs.
- Embodiments of the present disclosure also provide an electronic device, including a memory and a processor.
- a computer program is stored in the memory, and the processor is configured to run the computer program to perform the steps in any of the above method embodiments.
- the above-mentioned electronic device may further include a transmission device and an input-output device, wherein the transmission device is connected to the above-mentioned processor, and the input-output device is connected to the above-mentioned processor.
- modules or steps of the present disclosure can be implemented using general-purpose computing devices, and they can be concentrated on a single computing device, or distributed across a network composed of multiple computing devices. They may be implemented in program code executable by a computing device, such that they may be stored in a storage device for execution by the computing device, and in some cases may be executed in a sequence different from that shown herein. Or the described steps can be implemented by making them into individual integrated circuit modules respectively, or by making multiple modules or steps among them into a single integrated circuit module. As such, the present disclosure is not limited to any specific combination of hardware and software.
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Quality & Reliability (AREA)
- Computer Hardware Design (AREA)
- Computing Systems (AREA)
- Health & Medical Sciences (AREA)
- Biomedical Technology (AREA)
- Mathematical Physics (AREA)
- Debugging And Monitoring (AREA)
Abstract
本公开提供了一种异构加速资源异常处理方法、装置、存储介质及电子装置,该方法包括:通过对云计算平台的异构加速资源进行硬件健康监测的方式确定所述异构加速资源为硬件健康资源或硬件非健康资源;通过对所述异构加速资源进行设备使用健康监测的方式确定所述异构加速资源为使用健康资源或分配故障资源;对所述硬件非健康资源进行硬件异常处理;对所述分配故障资源进行分配异常处理。通过该方法,可以解决相关技术中只关注传统服务器普通硬件资源检测,无法识别出云计算平台管理的虚拟化异构加速资源登记和实际使用不一致,从而给云计算平台和用户带来损失的问题,确保云平台管理异构加速资源的可靠性、稳定性、及时性等。
Description
相关申请的交叉引用
本公开基于2022年05月23日提交的发明名称为“一种异构加速资源异常处理方法、装置、存储介质及电子装置”的中国专利申请CN202210563855.X,并且要求该专利申请的优先权,通过引用将其所公开的内容全部并入本公开。
本公开实施例涉及云计算领域,具体而言,涉及一种异构加速资源异常处理方法、装置、存储介质及电子装置。
随着深度学习等AI技术的发展,用户对于算力和性能的需求越来越迫切,越来越多的用户希望能通过云计算平台获取异构计算能力来实现业务的性能加速,云计算平台提供的异构计算服务成为了不可缺少的功能。
云计算平台的异构加速资源通常包括图形处理器(Graphics Processing Unit,简称为GPU)、AI加速卡(Neural-Network Processing Unit,简称为NPU)、可编程加速卡(Field Programmable Gate Array,简称为FPGA)、智能网卡(Smart NIC),相对于传统的硬件而言,云计算平台的异构加速资源存在加速资源种类多、可插拔方便、虚拟化方式多、统一分配和回收、使用频繁、承载业务特殊等特点。
当异构加速硬件发生异常时,如果不能及时的识别、上报、恢复,会给云计算平台上承载的客户业务带来严重的损失。尤其是虚拟化方式分配的异构加速资源,例如GPU、NPU、FPGA,在资源频繁分配、频繁回收的过程中,可能因为通信异常出现回收信息丢失或资源回收不及时的问题,容易发生异构加速资源的登记和实际使用不一致的情况,从而导致云平台资源分配出现异常,给云计算平台和客户带来损失。
目前,传统的硬件检测手段大多数通过服务器自己的系统进行检测判断,一方面判断不准确,另一方面是随着种类的增多,不能很好的管理,最关键的是无法识别出云计算平台管理的虚拟化异构加速资源登记和实际使用不一致的情况。
由于相关技术中并没有针对云计算平台的异构加速资源的异常检测和异常处理方法,尤其是虚拟化的加速硬件(GPU、NPU)的登记异常、管理员维护加速设备时虚拟化的分配异常、设备本身的健康状况异常,以及设备被误操作等异常情况发生时,无法及时感知并处理,从而影响云计算平台的正常使用,给云计算平台和用户带来损失。
针对相关技术中只关注传统服务器普通硬件资源检测,无法识别出云计算平台管理的虚拟化异构加速资源登记和实际使用不一致,从而给云计算平台和用户带来损失的问题,尚未提出解决方案。
发明内容
本公开实施例提供了一种异构加速资源异常处理方法、装置、存储介质及电子装置,以
至少解决相关技术中只关注传统服务器普通硬件资源检测,无法识别出云计算平台管理的虚拟化异构加速资源登记和实际使用不一致,从而给云计算平台和用户带来损失的问题。当异构加速资源发生异常时,能够快速的感知异构加速资源的非健康状态并及时告警、恢复,确保云平台管理异构加速资源的可靠性、稳定性、及时性等。
根据本公开的一个实施例,提供了一种异构加速资源异常处理方法,所述方法包括:
通过对云计算平台的异构加速资源进行硬件健康监测的方式确定所述异构加速资源为硬件健康资源或硬件非健康资源;
通过对所述异构加速资源进行设备使用健康监测的方式确定所述异构加速资源为使用健康资源或分配故障资源;
对所述硬件非健康资源进行硬件异常处理;
对所述分配故障资源进行分配异常处理。
根据本公开的另一个实施例,还提供了一种异构加速资源异常处理装置,所述装置包括:
第一监测模块,设置为通过对云计算平台的异构加速资源进行硬件健康监测的方式确定所述异构加速资源为硬件健康资源或硬件非健康资源;
第二监测模块,通过对所述异构加速资源进行设备使用健康监测的方式确定所述异构加速资源为使用健康资源或分配故障资源;
第一响应模块,设置为对所述硬件非健康资源进行硬件异常处理;
第二响应模块,设置为对所述分配故障资源进行分配异常处理。
根据本公开的又一个实施例,还提供了一种计算机可读的存储介质,所述存储介质中存储有计算机程序,其中,所述计算机程序被设置为运行时执行上述任一项方法实施例中的步骤。
根据本公开的又一个实施例,还提供了一种电子装置,包括存储器和处理器,所述存储器中存储有计算机程序,所述处理器被设置为运行所述计算机程序以执行上述任一项方法实施例中的步骤。
图1是本公开实施例的异构加速资源异常处理方法的计算机终端的硬件结构框图;
图2是本公开实施例的异构加速资源异常处理方法的流程图;
图3是本公开实施例的异构加速资源硬件健康监测方法的流程图;
图4是本公开实施例的异构加速资源设备使用健康监测方法的流程图;
图5是本公开可选实施例的设备使用健康监测及处理的时序图;
图6是本公开可选实施例的异构加速资源异常恢复处理的时序图;
图7是本公开实施例的异构加速资源异常处理装置的框图;
图8是本公开实施例的异构加速资源健康监测和异常处理架构。
下文中将参考附图并结合实施例来详细说明本公开的实施例。
需要说明的是,本公开的说明书和权利要求书及上述附图中的术语“第一”、“第二”等是用于区别类似的对象,而不必用于描述特定的顺序或先后次序。
本公开实施例中所提供的方法实施例可以在移动终端、计算机终端或者类似的运算装置
中执行。以运行在计算机终端上为例,图1是本公开实施例的异构加速资源异常处理方法的计算机终端的硬件结构框图,如图1所示,计算机终端可以包括一个或多个(图1中仅示出一个)处理器102(处理器102可以包括但不限于微处理器MCU或可编程逻辑器件FPGA等的处理装置)和用于存储数据的存储器104,其中,上述计算机终端还可以包括用于通信功能的传输设备106以及输入输出设备108。本领域普通技术人员可以理解,图1所示的结构仅为示意,其并不对上述计算机终端的结构造成限定。例如,计算机终端还可包括比图1中所示更多或者更少的组件,或者具有与图1所示不同的配置。
存储器104可用于存储计算机程序,例如,应用软件的软件程序以及模块,如本公开实施例中的异构加速资源异常处理方法对应的计算机程序,处理器102通过运行存储在存储器104内的计算机程序,从而执行各种功能应用以及业务链地址池切片处理,即实现上述的方法。存储器104可包括高速随机存储器,还可包括非易失性存储器,如一个或者多个磁性存储装置、闪存、或者其他非易失性固态存储器。在一些实例中,存储器104可进一步包括相对于处理器102远程设置的存储器,这些远程存储器可以通过网络连接至移动终端。上述网络的实例包括但不限于互联网、企业内部网、局域网、移动通信网及其组合。
传输设备106用于经由一个网络接收或者发送数据。上述的网络具体实例可包括计算机终端的通信供应商提供的无线网络。在一个实例中,传输设备106包括一个网络适配器(Network Interface Controller,简称为NIC),其可通过基站与其他网络设备相连从而可与互联网进行通讯。在一个实例中,传输设备106可以为射频(Radio Frequency,简称为RF)模块,其用于通过无线方式与互联网进行通讯。
在本实施例中提供了一种运行于上述计算机终端或网络架构的异构加速资源异常处理方法,图2是本公开实施例的异构加速资源异常处理方法的流程图,如图2所示,该流程包括如下步骤:
步骤S202,通过对云计算平台的异构加速资源进行硬件健康监测的方式确定所述异构加速资源为硬件健康资源或硬件非健康资源;
步骤S204,通过对所述异构加速资源进行设备使用健康监测的方式确定所述异构加速资源为使用健康资源或分配故障资源;
步骤S206,对所述硬件非健康资源进行硬件异常处理;
步骤S208,对所述分配故障资源进行分配异常处理。
在一实施例中,上述步骤S202之前,通过扫描PCI槽位确定所述异构加速资源是否存在;若所述异构加速资源存在,获取异构加速资源的资源信息,具体的,可以结合云计算平台的配置识别异构加速资源的资源信息,所述异构加速资源包括:GPU、NPU、FPGA、Smart NIC,所述异构加速资源的资源信息可以包括:PCI地址、厂商信息、设备型号、设备ID等,其中,PCI地址包括槽位号。
本实施例中,上述步骤S202具体可以包括:根据异构加速资源的资源信息调用对应的硬件健康检测接口;通过硬件健康检测接口判断所述异构加速资源的硬件状态;若硬件状态为健康,则确定异构加速资源为硬件健康资源;若硬件状态为非健康,则确定异构加速资源为硬件非健康资源。
具体的,可以根据异构加速资源的种类、厂商信息、设备型号循环调用云计算平台中已经过安全认证的异构加速资源的硬件健康检测接口,由硬件健康检测接口判断异构加速资源
的硬件状态。
在另一实施例中,可以根据预设的硬件健康检测周期对每一个异构加速资源执行上述步骤S202中的硬件健康监测方法。
图3是本公开实施例的异构加速资源硬件健康监测方法的流程图,如图3所示,异构加速资源硬件健康监测方法具体包含以下步骤:
步骤S302:扫描计算节点上PCI槽上的各个异构加速资源,获取加速资源的PCI地址;
步骤S304:结合云平台配置,识别具体加速资源(GPU、NPU、FPGA、SmartNIC)的厂商、型号;
步骤S306:以PCI地址、厂商、型号为核心识别参数,循环调用云平台认可的硬件健康检测接口,判断每一个异构加速资源的硬件状态;
步骤S308:判断该异构加速资源的硬件状态是否为健康;若判断结果为是,执行步骤S310a,若判断结果为否,执行步骤S310b;
步骤S310a:确定该异构加速资源为硬件健康资源;
步骤S310b:确定该异构加速资源为硬件非健康资源;
步骤S312:判断当前节点是否还有异构加速资源未进行硬件健康检测;
步骤S314:输出硬件健康资源和硬件非健康资源。
本实施例中,上述步骤S302具体可以包括:扫描PCI槽上的每一个PCI槽位判断该槽位是否安装有实体加速资源,若有实体加速资源,则获取该加速资源对应的PCI地址,具体的,每一个PCI槽位只能安装一个实体加速资源,PCI地址包括槽位号,实体加速资源种类可以包括:GPU、NPU、FPGA、SmartNIC等。
通过本实施例中的方法,可以解决相关技术中只能依赖自身系统针对传统硬件进行检测,对于种类繁多的异构加速资源检测结果不准确且不便于管理的问题,通过检测异构加速资源的厂商信息和设备型号调取对应接口,不仅提升了检测结果的准确率,还实现了对种类繁多的异构加速资源的统一管理。
在另一实施例中,上述步骤S204具体可以包括:获取所述异构加速资源的分配数据;根据所述分配数据确定所述使用健康资源和所述分配故障资源。
在本实施例中,根据分配数据确定使用健康资源或分配故障资源,包括:确定异构加速资源的实际使用数据;依次对每一个异构加速资源的分配数据和实际使用数据进行数据比对,若分配数据和所述实际使用数据一致,确定异构加速资源为使用健康资源,否则,确定异构加速资源为分配故障资源。
在一实施例中,可以根据预设的设备使用健康检测周期执行上述步骤S204中的设备使用健康监测方法。
具体的,每一个异构加速资源可以被虚拟化的分配给多个客户使用,分配数据包括分配客户、分配数量,实际使用数据包括使用客户、使用数量。
进一步的,分别对每一个异构加速资源的分配客户和使用客户进行比对、分配数量和使用数量进行比对,若数据全部一致,确定该异构加速资源为使用健康资源,否则,确定该异构加速资源为分配故障资源。
图4是本公开实施例的异构加速资源设备使用健康监测方法的流程图,如图4所示,异构加速资源设备使用健康监测方法具体包含以下步骤:
步骤S402:调用云平台异构加速资源接口,获取异构加速资源的分配数据详情(包括分配客户,分配数量等);
步骤S404:针对每一个已分配的加速资源进行检测;
步骤S406:判断对应的客户是否存在,若判断结果为是,直接执行步骤S410,若判断结果为否,执行步骤S408;
步骤S408:将该异构加速资源加入分配故障列表,并记录异常分配的客户;
步骤S410:判断是否还有异构加速资源未进行判断,若判断结果为是,返回步骤S404,若判断结果为否,执行步骤S412;
步骤S412:输出分配故障的异构加速资源
本实施例中,每一个异构加速资源可以被虚拟化的分配给多个客户使用,客户种类通常包括:虚拟机、裸机、容器等。
上述步骤S406具体可以包括,判断该异构加速资源已分配的虚拟机、裸机、容器是否存在,若都存在,则判断给异构加速资源所分配的客户使用正常。
通过本实施例中的方法,可以解决相关技术中异构加速资源在虚拟化分配时容易发生的资源分配登记情况与客户实际使用情况不一致的问题,能够及时识别出发生分配故障的异构加速资源,从而避免将虚拟化分配的异构加速资源重复分配给多个客户,保证了云计算平台的安全性和稳定性。
在一实施例中,对分配故障资源进行分配异常处理,包括:根据实际使用数据对分配故障资源的分配数据进行数据更新,具体的,用实际使用数据中的使用客户对分配数据中的分配客户进行更新,用实际使用数据中的使用数量对分配数据中的分配数量进行更新。
图5是本公开可选实施例的设备使用健康监测及处理的时序图,如图5所示,异构加速资源设备使用健康监测及处理方法具体包含以下步骤:
步骤S502:根据设备使用健康监测方法输出分配故障资源;
步骤S504:调用响应模块对分配故障资源进行分配异常处理;
步骤S506:对分配故障资源的异构加速资源信息进行更新;
步骤S508:返回更新结果;
步骤S510:返回。
在另一实施例中,异构加速资源的异常处理方法还包括,对所述硬件非健康资源和所述分配故障资源进行异常告警。
在一实施例中,对硬件非健康资源进行硬件异常处理,具体包括以下步骤:
判断硬件非健康资源的使用状态是否为不可用,若判断结果为否,将硬件非健康资源的使用状态设置为不可用,并将硬件非健康资源的恢复状态设置为可恢复;
判断硬件非健康资源是否已分配给客户,若判断结果为是,通知云计算平台对硬件非健康资源已分配的客户进行迁移,和/或将硬件非健康资源的恢复状态设置为不可恢复。
具体的,异构加速资源的使用状态分为可用和不可用,异构加速资源的恢复状态分为可恢复和不可恢复。当对异构加速资源的使用状态进行设置时,系统自动对使用状态的设置来源进行记录,若该使用状态是由管理员设置的,则标记为管理员,其对应的恢复状态为不可恢复;若该使用状态是由异常响应模块自动设置的,则标记为响应模块,其对应的恢复状态为可恢复。
本实施例中,通知云计算平台对硬件非健康资源已分配的客户进行迁移具体可包括,通知云计算平台关联的管理员,及时对硬件非健康资源的使用情况进行判断,对所有已使用该硬件非健康资源的虚拟机、裸机、容器等客户进行热迁移动作(重新分配正常的异构加速资源给客户)或其他动作。
在一实施例中,可以获取硬件非健康资源和分配故障资源对应的异常资源信息;对异常资源信息进行标准化处理,得到标准化异常信息;将该标准化异常信息上报给云计算平台,便于及时通知相关人员处理异常信息,可以将标准化异常信息存储到云计算平台中,便于后续查找。
在另一实施例中,还可以从云计算平台获取标准化异常信息;获取硬件健康资源和使用健康资源对应的健康资源信息;对该健康资源信息进行标准化处理,得到标准化健康信息;根据该标准化健康信息从标准化异常信息中确定可恢复资源;若该可恢复资源的恢复状态为可恢复,对该可恢复资源进行恢复处理。具体的,标准化异常信息和标准化健康信息至少包括异构加速资源的PCI地址、厂商信息、设备型号、设备ID等,其中,PCI地址包括槽位号。
在本实施例中,根据标准化健康信息从标准化异常信息中确定可恢复资源,包括:根据预设的匹配规则对标准化健康信息和标准化异常信息进行匹配,其中,预设的匹配规则包括对以下资源信息中的至少之一进行匹配:PCI地址、厂商信息、型号;将匹配成功的标准化异常信息对应的异构加速资源确定为可恢复资源。
本实施例中,对可恢复资源进行恢复处理具体可以包括:若存在可恢复资源对应的异常告警,取消异常告警;将可恢复资源的使用状态设置为可用。
图6是根据本公开可选实施例的异构加速资源异常恢复处理的时序图,如图6所示,异构加速资源异常恢复处理方法具体包括以下步骤:
步骤S601:根据硬件健康监测方法输出健康的异构加速资源;
步骤S602:发送健康的异构加速资源信息;
步骤S603:获取已上报的非健康的异构加速资源信息;
步骤S604:返回已上报的非健康的异构加速资源信息;
步骤S605:通过特定方法识别出可恢复的异构加速资源;
步骤S606:标准化异构加速资源信息,调用云计算平台的告警恢复接口;
步骤S607:返回;
步骤S608:判断是否需要将该异构加速资源恢复为可用;
步骤S609:返回;
本实施例中,上述步骤S605中的特定方法具体可以包括:根据PCI地址、厂商信息、设备型号、设备ID、官方接口等进行数据比对,或者通过具体算法进行识别。
本实施例中,上述步骤S608判断是否需要将该异构加速资源恢复为可用具体可以包括:根据异构加速资源的恢复状态进行判断,若恢复状态为可恢复,将异构加速资源恢复为可用。
在另一实施例中,可以根据预设的恢复周期执行上述步骤S601到S609中的异构加速资源异常恢复处理方法。
根据本实施例中的异构加速资源异常恢复处理的方法,当检测到异构加速资源出现异常情况时,可以及时对客户和云计算平台管理员发出告警提示,避免造成严重损失。另外,通过人为干预或系统自动处理,该异常异构加速资源可能已恢复成健康状态时,对于这种情况,
本实施例可以自动将该异构加速资源恢复为可用状态,及时响应、快速处理,减少了对使用客户的不良影响,提高了云计算平台的可靠性。
根据本公开实施例的另一方面,还提供了一种异构加速资源异常处理装置,图7是本公开实施例的异构加速资源异常处理装置的框图,如图7所示,所述装置包括:
第一监测模块702,设置为通过对云计算平台的异构加速资源进行硬件健康监测的方式确定所述异构加速资源为硬件健康资源或硬件非健康资源;
第二监测模块704,通过对所述异构加速资源进行设备使用健康监测的方式确定所述异构加速资源为使用健康资源或分配故障资源;
第一响应模块706,设置为对所述硬件非健康资源进行硬件异常处理;
第二响应模块708,设置为对所述分配故障资源进行分配异常处理。
在一实施例中,所述装置还包括:
扫描模块,设置为通过扫描PCI槽位确定所述异构加速资源是否存在;
第一获取模块,设置为若所述异构加速资源存在,获取所述异构加速资源的资源信息。
在一实施例中,所述第一监测模块702还包括:
调用单元,设置为根据所述异构加速资源的资源信息调用对应的硬件健康检测接口;
检测单元,设置为通过所述硬件健康检测接口判断所述异构加速资源的硬件状态;
第一判断单元,设置为若所述硬件状态为健康,则确定所述异构加速资源为所述硬件健康资源,若所述硬件状态为非健康,则确定所述异构加速资源为所述硬件非健康资源。
在一实施例中,所述装置还包括:
异常告警模块,设置为对所述硬件非健康资源和所述分配故障资源进行异常告警。
在一实施例中,所述第二监测模块704还包括:
第一获取单元,设置为获取所述异构加速资源的分配数据;
第二判断单元,设置为根据所述分配数据确定所述使用健康资源和所述分配故障资源。
在一实施例中,所述第二判断单元还包括:
第二获取单元,设置为确定所述异构加速资源的实际使用数据;
数据比对单元,设置为依次对每一个异构加速资源的分配数据和实际使用数据进行数据比对,若所述分配数据和所述实际使用数据一致,确定所述异构加速资源为使用健康资源,否则,确定所述异构加速资源为分配故障资源。
在一实施例中,所述第二响应模块708还设置为:
根据所述实际使用数据对所述分配故障资源的分配数据进行数据更新。
在一实施例中,第一响应模块706还包括:
设置单元,设置为判断所述硬件非健康资源的使用状态是否为不可用,若判断结果为否,将所述硬件非健康资源的使用状态设置为不可用,并将所述硬件非健康资源的恢复状态设置为可恢复;
处理单元,设置为判断所述硬件非健康资源是否已分配给客户,若判断结果为是,通知云计算平台对所述硬件非健康资源已分配的客户进行迁移,和/或将所述硬件非健康资源的恢复状态设置为不可恢复。
在一实施例中,所述装置还包括:
第二获取模块,设置为获取所述硬件非健康资源和所述分配故障资源对应的异常资源信
息;
第一标准化模块,设置为对所述异常资源信息进行标准化处理,得到标准化异常信息;
上报模块,设置为将所述标准化异常信息上报给云计算平台。
在一实施例中,所述装置还包括:
第三获取模块,设置为从所述云计算平台获取所述标准化异常信息;
第四获取模块,设置为获取所述硬件健康资源和所述使用健康资源对应的健康资源信息;
第二标准化模块,设置为对所述健康资源信息进行标准化处理,得到标准化健康信息;
恢复判断模块,设置为根据所述标准化健康信息从所述标准化异常信息中确定可恢复资源;
恢复处理模块,设置为若所述可恢复资源的恢复状态为可恢复,对所述可恢复资源进行恢复处理。
在一实施例中,所述恢复判断模块包括:
匹配单元,设置为根据预设的匹配规则对所述标准化健康信息和所述标准化异常信息进行匹配,其中,所述预设的匹配规则包括对以下资源信息中的至少之一进行匹配:PC I地址、厂商信息、型号;
恢复判断单元,设置为将匹配成功的标准化异常信息对应的异构加速资源确定为所述可恢复资源。
在一实施例中,所述恢复处理模块,包括:
取消单元,设置为若存在所述可恢复资源对应的异常告警,取消所述异常告警;
恢复单元,设置为将所述可恢复资源的使用状态设置为可用。
根据本公开实施例的另一方面,还提供了一种异构加速资源健康监测和异常处理架构。
图8是本公开实施例的异构加速资源健康监测和异常处理架构,如图8所示,所述架构包括:
健康识别模块81,包括:硬件健康监测模块811,设备使用健康监测模块812,云平台异构资源已使用接口813;
异常处理模块82,包括:异常告警模块821,异常响应模块822,异常恢复模块823,云平台告警接口824,云平台异构资源管理接口825;
在本实施例中,硬件健康监测模块811,设置为实现上述第一监测模块702的部分或全部功能;设备使用健康监测模块812,设置为实现上述第二监测模块704的部分或全部功能;云平台异构资源已使用接口813,用于实现上述第二获取单元的部分或全部功能。
具体的,硬件健康监测模块811设置为通过对云计算平台的异构加速资源进行硬件健康监测的方式确定所述异构加速资源为硬件健康资源或硬件非健康资源;设备使用健康监测模块812设置为通过对所述异构加速资源进行设备使用健康监测的方式确定所述异构加速资源为使用健康资源或分配故障资源;云平台异构资源已使用接口813用于确定所述异构加速资源的实际使用数据;
在另一实施例中,异常告警模块821,用于对所述硬件非健康资源和所述分配故障资源进行异常告警;异常响应模块822用于实现上述第一响应模块706和第二响应模块708的部分或全部功能,包括用于对硬件非健康资源和分配故障资源进行异常处理;云平台告警接口824用于将异常告警信息告知云计算平台;云平台异构资源管理接口825用于对异构加速资
源进行管理,包括对其使用状态进行设置。
通过本公开实施例,可以解决相关技术中只关注传统服务器普通硬件资源检测,无法识别出云计算平台管理的虚拟化异构加速资源登记和实际使用不一致,从而给云计算平台和用户带来损失的问题。当异构加速资源发生异常时,能够快速的感知异构加速资源的非健康状态并及时告警、恢复,确保云平台管理异构加速资源的可靠性、稳定性、及时性等。
本公开的实施例还提供了一种计算机可读存储介质,该计算机可读存储介质中存储有计算机程序,其中,该计算机程序被设置为运行时执行上述任一项方法实施例中的步骤。
在一个示例性实施例中,上述计算机可读存储介质可以包括但不限于:U盘、只读存储器(Read-Only Memory,简称为ROM)、随机存取存储器(Random Access Memory,简称为RAM)、移动硬盘、磁碟或者光盘等各种可以存储计算机程序的介质。
本公开的实施例还提供了一种电子装置,包括存储器和处理器,该存储器中存储有计算机程序,该处理器被设置为运行计算机程序以执行上述任一项方法实施例中的步骤。
在一个示例性实施例中,上述电子装置还可以包括传输设备以及输入输出设备,其中,该传输设备和上述处理器连接,该输入输出设备和上述处理器连接。
本实施例中的具体示例可以参考上述实施例及示例性实施方式中所描述的示例,本实施例在此不再赘述。
显然,本领域的技术人员应该明白,上述的本公开的各模块或各步骤可以用通用的计算装置来实现,它们可以集中在单个的计算装置上,或者分布在多个计算装置所组成的网络上,它们可以用计算装置可执行的程序代码来实现,从而,可以将它们存储在存储装置中由计算装置来执行,并且在某些情况下,可以以不同于此处的顺序执行所示出或描述的步骤,或者将它们分别制作成各个集成电路模块,或者将它们中的多个模块或步骤制作成单个集成电路模块来实现。这样,本公开不限制于任何特定的硬件和软件结合。
以上所述仅为本公开的优选实施例而已,并不用于限制本公开,对于本领域的技术人员来说,本公开可以有各种更改和变化。凡在本公开的原则之内,所作的任何修改、等同替换、改进等,均应包含在本公开的保护范围之内。
Claims (15)
- 一种异构加速资源异常处理方法,所述方法包括:通过对云计算平台的异构加速资源进行硬件健康监测的方式确定所述异构加速资源为硬件健康资源或硬件非健康资源;通过对所述异构加速资源进行设备使用健康监测的方式确定所述异构加速资源为使用健康资源或分配故障资源;对所述硬件非健康资源进行硬件异常处理;对所述分配故障资源进行分配异常处理。
- 根据权利要求1所述的方法,其中,在通过对云计算平台的异构加速资源进行硬件健康监测的方式确定所述异构加速资源为硬件健康资源或硬件非健康资源之前,所述方法还包括:通过扫描PCI槽位确定所述异构加速资源是否存在;若所述异构加速资源存在,获取所述异构加速资源的资源信息。
- 根据权利要求2所述的方法,其中,通过对云计算平台的异构加速资源进行硬件健康监测的方式确定所述异构加速资源为硬件健康资源或硬件非健康资源,包括:根据所述异构加速资源的资源信息调用对应的硬件健康检测接口;通过所述硬件健康检测接口判断所述异构加速资源的硬件状态;若所述硬件状态为健康,则确定所述异构加速资源为所述硬件健康资源;若所述硬件状态为非健康,则确定所述异构加速资源为所述硬件非健康资源。
- 根据权利要求1所述的方法,其中,所述方法还包括:对所述硬件非健康资源和所述分配故障资源进行异常告警。
- 根据权利要求1所述的方法,其中,通过对所述异构加速资源进行设备使用健康监测的方式确定所述异构加速资源为使用健康资源或分配故障资源,包括:获取所述异构加速资源的分配数据;根据所述分配数据确定所述使用健康资源和所述分配故障资源。
- 根据权利要求5所述的方法,其中,根据所述分配数据确定使用健康资源或分配故障资源,包括:确定所述异构加速资源的实际使用数据;依次对每一个异构加速资源的分配数据和实际使用数据进行数据比对,若所述分配数据和所述实际使用数据一致,确定所述异构加速资源为使用健康资源,否则,确定所述异构加速资源为分配故障资源。
- 根据权利要求6所述的方法,其中,对所述分配故障资源进行分配异常处理,包括:根据所述实际使用数据对所述分配故障资源的分配数据进行数据更新。
- 根据权利要求1所述的方法,其中,对所述硬件非健康资源进行硬件异常处理,包括:判断所述硬件非健康资源的使用状态是否为不可用,若判断结果为否,将所述硬件非健康资源的使用状态设置为不可用,并将所述硬件非健康资源的恢复状态设置为可恢复;判断所述硬件非健康资源是否已分配给客户,若判断结果为是,通知云计算平台对所述硬件非健康资源已分配的客户进行迁移,和/或将所述硬件非健康资源的恢复状态设置为不可恢复。
- 根据权利要求1所述的方法,其中,所述方法还包括:获取所述硬件非健康资源和所述分配故障资源对应的异常资源信息;对所述异常资源信息进行标准化处理,得到标准化异常信息;将所述标准化异常信息上报给云计算平台。
- 根据权利要求9所述的方法,其中,所述方法还包括:从所述云计算平台获取所述标准化异常信息;获取所述硬件健康资源和所述使用健康资源对应的健康资源信息;对所述健康资源信息进行标准化处理,得到标准化健康信息;根据所述标准化健康信息从所述标准化异常信息中确定可恢复资源;若所述可恢复资源的恢复状态为可恢复,对所述可恢复资源进行恢复处理。
- 根据权利要求10所述的方法,其中,根据所述标准化健康信息从所述标准化异常信息中确定可恢复资源,包括:根据预设的匹配规则对所述标准化健康信息和所述标准化异常信息进行匹配,其中,所述预设的匹配规则包括对以下资源信息中的至少之一进行匹配:PCI地址、厂商信息、型号;将匹配成功的标准化异常信息对应的异构加速资源确定为所述可恢复资源。
- 根据权利要求10所述的方法,其中,若所述可恢复资源的恢复状态为可恢复,对所述可恢复资源进行恢复处理,包括:若存在所述可恢复资源对应的异常告警,取消所述异常告警;将所述可恢复资源的使用状态设置为可用。
- 一种异构加速资源异常处理装置,所述装置包括:第一监测模块,设置为通过对云计算平台的异构加速资源进行硬件健康监测的方式确定所述异构加速资源为硬件健康资源或硬件非健康资源;第二监测模块,通过对所述异构加速资源进行设备使用健康监测的方式确定所述异构加速资源为使用健康资源或分配故障资源;第一响应模块,设置为对所述硬件非健康资源进行硬件异常处理;第二响应模块,设置为对所述分配故障资源进行分配异常处理。
- 一种计算机可读的存储介质,所述存储介质中存储有计算机程序,其中,所述计算机程序被设置为运行时执行所述权利要求1至12任一项中所述的方法。
- 一种电子装置,包括存储器和处理器,所述存储器中存储有计算机程序,所述处理器被设置为运行所述计算机程序以执行所述权利要求1至12任一项中所述的方法。
Applications Claiming Priority (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202210563855.XA CN117149474A (zh) | 2022-05-23 | 2022-05-23 | 一种异构加速资源异常处理方法、装置、存储介质及电子装置 |
CN202210563855.X | 2022-05-23 |
Publications (1)
Publication Number | Publication Date |
---|---|
WO2023226601A1 true WO2023226601A1 (zh) | 2023-11-30 |
Family
ID=88885425
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
PCT/CN2023/086292 WO2023226601A1 (zh) | 2022-05-23 | 2023-04-04 | 一种异构加速资源异常处理方法、装置、存储介质及电子装置 |
Country Status (2)
Country | Link |
---|---|
CN (1) | CN117149474A (zh) |
WO (1) | WO2023226601A1 (zh) |
Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20140201564A1 (en) * | 2013-01-15 | 2014-07-17 | Microsoft Corporation | Healing cloud services during upgrades |
CN106612312A (zh) * | 2015-10-23 | 2017-05-03 | 中兴通讯股份有限公司 | 一种虚拟化数据中心调度系统和方法 |
CN107743611A (zh) * | 2015-04-29 | 2018-02-27 | 微软技术许可有限责任公司 | 动态云计算平台资源的最优分配 |
CN111694789A (zh) * | 2020-04-22 | 2020-09-22 | 西安电子科技大学 | 嵌入式可重构异构测定方法、系统、存储介质、处理器 |
CN114296943A (zh) * | 2021-12-31 | 2022-04-08 | 武汉路特斯汽车有限公司 | 基于虚拟化技术的资源分配方法、装置和设备 |
-
2022
- 2022-05-23 CN CN202210563855.XA patent/CN117149474A/zh active Pending
-
2023
- 2023-04-04 WO PCT/CN2023/086292 patent/WO2023226601A1/zh unknown
Patent Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20140201564A1 (en) * | 2013-01-15 | 2014-07-17 | Microsoft Corporation | Healing cloud services during upgrades |
CN107743611A (zh) * | 2015-04-29 | 2018-02-27 | 微软技术许可有限责任公司 | 动态云计算平台资源的最优分配 |
CN106612312A (zh) * | 2015-10-23 | 2017-05-03 | 中兴通讯股份有限公司 | 一种虚拟化数据中心调度系统和方法 |
CN111694789A (zh) * | 2020-04-22 | 2020-09-22 | 西安电子科技大学 | 嵌入式可重构异构测定方法、系统、存储介质、处理器 |
CN114296943A (zh) * | 2021-12-31 | 2022-04-08 | 武汉路特斯汽车有限公司 | 基于虚拟化技术的资源分配方法、装置和设备 |
Also Published As
Publication number | Publication date |
---|---|
CN117149474A (zh) | 2023-12-01 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US8595556B2 (en) | Soft failure detection | |
US11461679B2 (en) | Message management using machine learning techniques | |
CN109788068B (zh) | 心跳状态信息上报方法、装置和设备及计算机存储介质 | |
JP2022100301A (ja) | ソフトウェア・アップグレードがコンピューティング・デバイスに与える潜在的な影響を判定するための方法、コンピュータ・プログラム、および更新推奨コンピュータ・サーバ(ソフトウェア・アップグレードの安定性の推奨) | |
CN110088744A (zh) | 一种数据库维护方法及其系统 | |
CN114064208A (zh) | 检测应用服务状态的方法、装置、电子设备及存储介质 | |
CN114884838A (zh) | Kubernetes组件的监控方法及服务器 | |
WO2019005294A1 (en) | RECOVERING APPLICATION FOLLOWING AN ERROR | |
CN113094224B (zh) | 服务器资产管理方法、装置、计算机设备和存储介质 | |
CN110474821B (zh) | 节点故障检测方法及装置 | |
CN110018932B (zh) | 一种容器磁盘的监控方法及装置 | |
CN108376110A (zh) | 一种自动检测方法、系统及终端设备 | |
CN117931589A (zh) | 运维故障的识别方法及装置 | |
US11411811B2 (en) | Fault localization for cloud-native applications | |
CN113407973A (zh) | 软件功能权限管理方法、系统、服务器及存储介质 | |
WO2023226601A1 (zh) | 一种异构加速资源异常处理方法、装置、存储介质及电子装置 | |
CN109408104B (zh) | 一种获取游戏整合信息的方法及装置 | |
CN115580522A (zh) | 一种容器云平台运行状态的监控方法及装置 | |
CN111858079B (zh) | 分布式锁迁移方法、装置及电子设备、存储介质 | |
CN114500249A (zh) | 一种根因定位方法和装置 | |
CN113656378A (zh) | 一种服务器管理方法、装置、介质 | |
CN112364121A (zh) | 问卷pdf的自动创建方法、装置、存储介质及计算机设备 | |
CN111835566A (zh) | 一种系统故障管理方法、装置及系统 | |
CN114418488B (zh) | 一种库存信息处理方法、装置及系统 | |
CN115529250B (zh) | 流量回放方法、装置、电子设备及存储介质 |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 23810676 Country of ref document: EP Kind code of ref document: A1 |