WO2022055496A1 - Gpu health scores - Google Patents

Gpu health scores Download PDF

Info

Publication number
WO2022055496A1
WO2022055496A1 PCT/US2020/050390 US2020050390W WO2022055496A1 WO 2022055496 A1 WO2022055496 A1 WO 2022055496A1 US 2020050390 W US2020050390 W US 2020050390W WO 2022055496 A1 WO2022055496 A1 WO 2022055496A1
Authority
WO
WIPO (PCT)
Prior art keywords
gpu
operational data
health score
subset
processor
Prior art date
Application number
PCT/US2020/050390
Other languages
French (fr)
Inventor
Aleksei SHELAEV
Amit Kumar Singh
Lorri JEFFERSON
George GUEORGUIEV
Byron A Alcorn
Abhishek Ghosh
Original Assignee
Hewlett-Packard Development Company, L.P.
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Hewlett-Packard Development Company, L.P. filed Critical Hewlett-Packard Development Company, L.P.
Priority to PCT/US2020/050390 priority Critical patent/WO2022055496A1/en
Publication of WO2022055496A1 publication Critical patent/WO2022055496A1/en

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/0703Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation
    • G06F11/0706Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation the processing taking place on a specific hardware platform or in a specific software environment
    • G06F11/0721Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation the processing taking place on a specific hardware platform or in a specific software environment within a central processing unit [CPU]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/0703Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation
    • G06F11/0751Error or fault detection not based on redundancy
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/0703Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation
    • G06F11/0793Remedial or corrective actions
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/0703Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation
    • G06F11/0766Error or fault reporting or storing
    • G06F11/0787Storage of error reports, e.g. persistent data storage, storage using memory protection
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/08Error detection or correction by redundancy in data representation, e.g. by using checking codes
    • G06F11/10Adding special bits or symbols to the coded information, e.g. parity check, casting out 9's or 11's
    • G06F11/1008Adding special bits or symbols to the coded information, e.g. parity check, casting out 9's or 11's in individual solid state devices
    • G06F11/1048Adding special bits or symbols to the coded information, e.g. parity check, casting out 9's or 11's in individual solid state devices using arrangements adapted for a specific error detection or correction feature
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/30Monitoring
    • G06F11/34Recording or statistical evaluation of computer activity, e.g. of down time, of input/output operation ; Recording or statistical evaluation of user activity, e.g. usability assessment
    • G06F11/3409Recording or statistical evaluation of computer activity, e.g. of down time, of input/output operation ; Recording or statistical evaluation of user activity, e.g. usability assessment for performance assessment
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/30Monitoring
    • G06F11/34Recording or statistical evaluation of computer activity, e.g. of down time, of input/output operation ; Recording or statistical evaluation of user activity, e.g. usability assessment
    • G06F11/3457Performance evaluation by simulation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F2201/00Indexing scheme relating to error detection, to error correction, and to monitoring
    • G06F2201/81Threshold
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F2201/00Indexing scheme relating to error detection, to error correction, and to monitoring
    • G06F2201/815Virtual

Definitions

  • GPUs Graphics processing units
  • graphics processing units may be utilized as coprocessor accelerators to increase computational throughput.
  • FIG. 1 illustrates a system for generating GPU health scores, according to an example
  • FIG. 2 is a block diagram corresponding to a GPU health score system, according to an example
  • FIG. 3 is a flow diagram for generating GPU health scores, according to an example.
  • FIG. 4 is a computing device for supporting instructions to create GPU health scores, according to an example.
  • GPUs Graphics processing units
  • the workloads may include traditional graphic pipeline rendering for applications, such as video games, all the way up to highly complex artificial intelligence modeling.
  • GPUs may be pushed to their limits both thermally and electrically.
  • GPU overload may result in computational errors, display errors, or complete GPU failure.
  • a system coupled to memory with instructions for producing a GPU health score stored within.
  • the instructions including instructions to receive a set of graphics processor unit (GPU) operational data and identify a subset of the set of GPU operational data corresponding to likelihood of failure.
  • the instructions also to create a health score for a GPU corresponding to the set of GPU operational data, map the health score to a remediation action and execute the remediation action.
  • GPU graphics processor unit
  • a method including receiving a set of graphics processor unit (GPU) operational data and identifying a subset of the set of GPU operational data corresponding to likelihood of failure. The method also includes creating a health score for a GPU corresponding to the set of GPU operational data, mapping the health score to a remediation action, sending an alert comprising the remediation action to a client system, and rendering a visualization of the health score.
  • GPU graphics processor unit
  • Another example is a computer readable media including instructions to receive a first set and a second set of graphics processor unit (GPU) operational data.
  • the instructions also identify a first subset of the first set of GPU operational data corresponding to likelihood of failure, identify a second subset of the second set of GPU operational data corresponding to a likelihood of failure and creating a first health score and a second health score corresponding to the first subset and the second subset respectively.
  • the instructions also create a difference between the first health score and the second health score, and render a visualization of the difference.
  • FIG. 1 illustrates a system for generating GPU health scores, according to an example.
  • the processor 102 of the system 100 may be implemented as dedicated hardware circuitry or a virtualized logical processor.
  • the dedicated hardware circuitry may be implemented as a central processing unit (CPU).
  • a dedicated hardware CPU may be implemented as a single to many-core general purpose processor.
  • a dedicated hardware CPU may also be implemented as a multi-chip solution, where more than one CPU are linked through a bus and schedule processing tasks across the more than one CPU.
  • a virtualized logical processor may be implemented across a distributed computing environment.
  • a virtualized logical processor may not have a dedicated piece of hardware supporting it. Instead, the virtualized logical processor may have a pool of resources supporting the task for which it was provisioned.
  • the virtualized logical processor may actually be executed on hardware circuitry; however, the hardware circuitry is not dedicated.
  • the hardware circuitry may be in a shared environment where utilization is time sliced.
  • the virtualized logical processor includes a software layer between any executing application and the hardware circuitry to handle any abstraction which also monitors and save the application state.
  • Virtual machines may be implementations of virtualized logical processors.
  • a memory 104 may be implemented in the system 100.
  • the memory 104 may be dedicated hardware circuitry to host instructions for the processor 102 to execute.
  • the memory 104 may be virtualized logical memory. Analogous to the processor 102, dedicated hardware circuitry may be implemented with dynamic ram (DRAM) or other hardware implementations for storing processor instructions.
  • DRAM dynamic ram
  • the virtualized logical memory may be implemented in a software abstraction which allows the instructions 106 to be executed on a virtualized logical processor, independent of any dedicated hardware implementation.
  • the system 100 may also include instructions 106.
  • the instructions 106 may be implemented in a platform specific language that the processor 102 may decode and execute.
  • the instructions 106 may be stored in the memory 104 during execution.
  • the instructions 106 may be encoded to perform operations such as receiving a set of GPU operational data 108, identify a subset of the set of GPU operational data corresponding to a likelihood of failure 110, create a health score for a GPU corresponding to the set of GPU operational data 112, map the health score to a remediation action 114 and execute the remediation action 116.
  • FIG. 2 is a block diagram corresponding to a GPU health score system 200, according to an example.
  • the system 200 may include a number of connected devices illustrated as device A 202A, device B 202B, device N 202N, a telemetry receiver 206, a database 208, a model 210, alerts system 212 and a remediation system 214.
  • the GPU operational data may be collected at device A 202A, device B 202B and device N 202N.
  • Each device illustrated separately, may be computing devices with different computing capabilities.
  • device A 202A may be a smartphone
  • device B 202B may be a notebook computer
  • device N 202N may be a workstation computer.
  • Each device may incorporate a GPU, illustrated as GPU A 204A, GPUB 204B, and GPU N 204N. In each corresponding device, the GPU may be of varying sophistication and computational power.
  • GPU A 204A may correspond to an integrated GPU built into the central processing unit of a smartphone
  • GPU B 204B may correspond to a discrete mobile GPU built into a gaming notebook computer
  • GPU N 204N may correspond to a discrete add-in card.
  • GPU A 204A, GPUB 204B, and GPU N 204N log operational data regarding the states of the corresponding GPUs.
  • static information pertaining to the GPU may be logged including but not limited the GPU card name, manufacturer, serial number, and install driver version.
  • Dynamic GPU information relevant to GPU health may also be logged including but not limited to error correction code (ECO) errors, blue screen error codes, the number of timeout detection and recovery (TDR) events, and operating system config errors.
  • Table 1 illustrates an example of collected data.
  • Device A 202A, device B 202B and device N 202N may include a telemetry agent (not shown).
  • the telemetry agent collects, aggregates, and packages the GPU operational data for transmission.
  • the telemetry agent may be logically connected to a network via an operating system interface to transmit the GPU operational data.
  • the logical connection may be implemented utilizing an application programming interface (API) allowing the telemetry agent to access a network socket.
  • API application programming interface
  • the telemetry receiver 206, the database 208 and the model 210 may be implemented as the system 100 of FIG. 1.
  • the processor 102 may be utilized as the execution unit to support the functionality of the telemetry receiver 206, the database 208 and the model 210.
  • the instructions 106 may correspond to the data structure and operations of the telemetry receiver 206, the database 208 and the model 210. As such, the instructions 106 may mirror the functionality described for the telemetry receiver 206, the database 208, and the model 210.
  • a telemetry receiver 206 may be connected to a network utilizing a similar logical connection as the telemetry agents.
  • the telemetry receiver 206 may be scalable to support many devices reporting GPU operational data.
  • the telemetry receiver 206 may aggregate GPU operational data based on from which device the operational data was received. By aggregating the operational data, the telemetry receiver 206 may create a historical record of the GPU operational data of any attached device.
  • the telemetry receiver 206 may package the GPU operational data into a format to be stored in a database 208.
  • the database 208 may include the tables and records operable to store historical GPU operational data as well as the data used to derive a GPU health score.
  • the database 208 may be a single database in one implementation.
  • the database 208 may include a more than one relational database communicatively coupled together.
  • the relational databases may be linked by record identifiers corresponding to data inclusive to more than one of the databases.
  • the database 208 may be a data lake whereby the data undergoes very little transformation or aggregation by the telemetry receiver 206.
  • the structure of the database 208 in this implementation may vary and may not structurally resemble a traditional relational database.
  • the database 208 provides the GPU operational data for the model 210 to create a GPU health score.
  • the model 210 executes logic to create the GPU health score.
  • the model 210 may be communicatively couple to the database 208 to receive GPU operational data.
  • the model 210 may count or sum the occurrences of operational data in a given period.
  • a count may refer to a totaling of a singular operational data, whereas a summary may include adding more than one operational data together.
  • the count of the occurrence of those may be counted on a per week basis based on each ‘Serial’ value.
  • ‘ECC’ and ‘TDR’ operational data the sum of the values of those operational data may be calculated per each week and each ‘Serial’.
  • the ‘Driver’ operational data is compared to the database 208 to determine the latest driver version per each ‘Card name’.
  • the model 210 may compare calculated parameters to critical thresholds.
  • the calculated count or summary values for the parameters may be compared to critical thresholds.
  • the critical thresholds may be predefined based on the subject matter expertise and statistical analysis. The exact values of the predefined thresholds may be indicated stored in a table inclusive to the model or may be hard coded into the model explicitly. The resulting comparisons may be stored within the model and ‘_number’ appended the name as to differentiate from the operational data originally collected.
  • the calculated count or summary values from the previous comparison may be averaged on the floating 3-week basis. If the 3-week average value is increasing for 3 weeks in a row, a new variable with the addition of the ‘_growth’ to the name is created, and the value of this may be stored within the model as well. For ‘ECC’, an operational data (‘ECC_growth_high’) of ‘ECC_number’ multiplied by ‘ECC_growth’ is calculated, since the presence of both these operational data in a given week per device may be the a strong sign of graphics card malfunction. [0025] In an implementation, a health score may be calculated based on the
  • Health Score 5 * ecc_number + 5 * ecc_growth + 10 * ecc_growth_high + 5
  • each device may be mapped to a color-coded health category.
  • a health score of ten (10) or more maps to a red category, which categorizes the devices requiring immediate attention.
  • a health score of more than zero (0) and less than ten (10) may be categorized to a yellow category.
  • the yellow category may include devices with some health/performance issue present.
  • a health score of zero (0) maps to the green category, which indicates no health or performance issues observed.
  • each device may be assigned a criticality score for each week.
  • the criticality score may be calculated based on the value of the health score and total number of devices within a set that given week. For calculating criticality score, the devices may be sorted by the value of the health score in descending order. The device with the highest health score may be scored with a criticality equal to one (1). The device with the second highest health score may be scored with a criticality equal to two (2), and so on. The criticality score of the device with the lowest health score may be equal to the total number of devices.
  • each device may be mapped to a detailed description of the observed performance or health problems and the list of suggested remediation steps.
  • the descriptions of the problems and suggested remediation steps may be created based on the domain knowledge and publicly available troubleshooting steps for the devices.
  • a sample remediation action list is presented in Table 2.
  • sample remediation list in Table 2 may not be limited to those remediations included and is used to illustrate the relationship between health scores and remediation actions.
  • the model 210 may send a message to an alert system 212 and/or a remediation system 214.
  • the alert system 212 and the remediation system 214 may be client systems to receive the alerts and remediation actions.
  • the alert system 212 may take the form of a dashboard interface for a fleet manager responsible for monitoring the devices (device A 202A, device B 202B and device N 202N).
  • the alert system 212 may indicate to the fleet manager the categories of each of the devices within the fleet (e.g. red, yellow, and green).
  • the alert system 212 may present a user activated remediation step to the fleet manager (e.g. replace graphics card).
  • the alert system 212 may be communicatively coupled to the telemetry agent on the device.
  • the alert system 212 may propagate an alert to the user of the device that a remediation step may be due. For example, a graphics card failure based on ‘ECC_growth’ may be determined.
  • the alert system 212 may present the fleet manager with a category “red” representation of device A 202A.
  • the alert system 212 may propagate an alert to the telemetry agent executing on device A 202A indicating to the user that the graphics card needs to be replaced and present actions on how to accomplish that.
  • remediation system 214 may be utilized to accomplish a remediation step as identified by the model 210.
  • the remediation system 214 may take action.
  • a remediation system 214 may take the form of an endpoint management system.
  • the remediation system 214 may push the fix to the affected device.
  • FIG. 3 is a flow diagram for generating GPU health scores, according to an example. For the purposes of describing the features of FIG. 3, references may be made to FIG. 1 and FIG. 2.
  • the processor 102 receives a set of GPU operational data.
  • the GPU operational data may correspond to the information retrieved from a telemetry agent act and illustrated in Table 1 .
  • the GPU operational data may include the graphics card model name, the graphics card manufacturer, the graphics card serial number, a time date stamp, ECO errors, logged thermal issues, operating system (OS) failures (e.g. blue screen of death, BSOD), TDRs, OS compatibility problems, and graphics card driver versions.
  • OS operating system
  • the processor 102 may receive a second set of GPU operational data.
  • the second set of GPU operational data may correspond to the same device reporting for a later period in time.
  • the processor 102 identifies a subset of the set of GPU operational data corresponding to likelihood of failure.
  • ECO errors and TDR may be identified as corresponding to a likelihood of failure.
  • the ECO error and TDR values may be identified as a subset of the set of GPU operational data corresponding to a likelihood of failure. While Equation 1 illustrates the same weighted value to each of the operational data of the first subset of the set of GPU operational data, varying weighted values may also be used to emphasize which operational data corresponds to most likelihood of failure.
  • the processor 102 may identify a second subset of the second set of GPU operational data corresponding to a likelihood of failure.
  • the second subset may correspond to the subset in identified in the set of GPU operational data.
  • the targeted operational data of the subset and the second subset would be the same (e.g. ECC errors, TDRs).
  • the values of the operational data may be different, however.
  • the processor 102 creates a health score for a GPU corresponding to the set of GPU operational data. As described in relation to Equation 1 , a health score may be determined to be based on the GPU operational data. The processor 102 may assign a weighted value to each operational data of the first subset. As such the health score may be adjusted based on the weighted value.
  • the processor 102 may create a second health score for a GPU corresponding to the second set of GPU operational data.
  • the second health score may be calculated utilizing the same weighting as the health score.
  • the processor 102 may calculate a difference between the health score and the second health score.
  • the difference may illustrate an improvement in the GPU health score over time, or a deterioration of the health score over time.
  • the processor 102 maps the health score to a remediation action.
  • the processor 102 may map remediation actions to the more problematic components of the health score.
  • the health score may be 25, and criticality may be 4, the processor 102 may map individual components of the health score to remediation steps.
  • the mapped remediation corresponds to a recommendation to replace the graphics card.
  • the processor 102 may send an alert to a client system based on the difference surpassing a threshold.
  • the client system may correspond to the affected device from device A 202A, device B 202B and device N 202N.
  • the alert system 212 may receive the alert message from the model 210 and transmit it to the client system corresponding to the GPU.
  • the processor 102 sends an alert comprising a remediation action to a client system.
  • the alert system 212 may receive a notification from the model 210 to notify the fleet manager.
  • the same notification may also be transmitted from the alert system 212 to a client system which may be the affected device from device A 202A, device B 202B and device N 202N.
  • the notification may transmit to the client system may be different than that transmitted to notify the fleet manager.
  • the notification transmitted to the client system may be more verbose as to provide detailed instruction for a lay person to affect the change prescribed by the remediation.
  • the processor 102 may identify an actionable remediation action from the remediation action.
  • An actionable remediation action may include actions that may be implemented through autonomous means. For example, software installation may be an actionable remediation action. Replacing a graphic card my not be an actionable remediation action.
  • the processor 102 may push the actionable remediation action to the client system with an endpoint management system. For example, if the actionable remediation action includes installing a new graphics card driver, the processor 102 may utilize a remediation system 214 (e.g. endpoint management) to push a software device to the client system, where the client system is the affected device.
  • a remediation system 214 e.g. endpoint management
  • the processor 102 renders a visualization of the health score.
  • the processor 102 may render a dashboard to illustrate to a fleet manager the health of all devices (from device A 202A, device B 202B and device N 202N).
  • the processor 102 may render a difference between the health score and the second health score.
  • the rendering may include a visualization of a trend in the health score.
  • the visualization may include an arrow indicating the direction of the trend, corresponding to the affected device.
  • FIG. 4 is a computing device for supporting instructions to create GPU health scores, according to an example.
  • the computing device 400 depicts a processor 102 and a storage medium 404 and, as an example of the computing device 400 performing its operations, the storage medium 404 may include instructions 406-418 that are executable by the processor 102.
  • the processor 102 may be synonymous with the processor 102 referenced in FIG. 1. Additionally, the processor 102 may include but is not limited to central processing units (CPUs).
  • the storage medium 404 can be said to store program instructions that, when executed by processor 102, implement the components of the computing device 400.
  • the executable program instructions stored in the storage medium 404 include, as an example, instructions to receive a first set and a second set of GPU operational data 406, instructions to identify a first subset of the first set of GPU operation data corresponding to a likelihood of failure 408, instructions to identify a second subset of the second set of GPU operational data corresponding to a likelihood of failure 410, instructions to create a first health score for the GPU corresponding to the first subset 412, instructions to create a second health score for the GPU corresponding to the second subset 414, instructions to create a difference between the first health score and the second health score 416, and render a visualization of the difference 418.
  • Storage medium 404 represents generally any number of memory components capable of storing instructions that can be executed by processor 102.
  • Storage medium 404 is non-transitory in the sense that it does not encompass a transitory signal but instead is made up of at least one memory component configured to store the relevant instructions.
  • the storage medium 404 may be a non-transitory computer-readable storage medium.
  • Storage medium 404 may be implemented in a single device or distributed across devices.
  • processor 102 represents any number of processors capable of executing instructions stored by storage medium 404.
  • Processor 102 may be integrated in a single device or distributed across devices.
  • storage medium 404 may be fully or partially integrated in the same device as processor 102, or it may be separate but accessible to that computing device 400 and the processor 102.
  • the program instructions 406-418 may be part of an installation package that, when installed, can be executed by processor 102 to implement the components of the computing device 400.
  • storage medium 404 may be a portable medium such as a CD, DVD, or flash drive, or a memory maintained by a server from which the installation package can be downloaded and installed.
  • the program instructions may be part of an application or applications already installed.
  • storage medium 404 can include integrated memory such as a hard drive, solid state drive, or the like.
  • examples described may include various components and features. It is also appreciated that numerous specific details are set forth to provide a thorough understanding of the examples. However, it is appreciated that the examples may be practiced without limitations to these specific details. In other instances, well known methods and structures may not be described in detail to avoid unnecessarily obscuring the description of the examples. Also, the examples may be used in combination with each other.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Quality & Reliability (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Debugging And Monitoring (AREA)

Abstract

In an example implementation according to aspects of the present disclosure, a system, method, and storage medium comprising a processor, memory, and instructions to receive a set of graphics processor unit (GPU) operational data. The system identifies a subset of the set of GPU operational data corresponding to likelihood of failure and creates a health score for a GPU corresponding to the set of GPU operational data. The system maps the health score to a remediation action and executes the remediation action.

Description

GPU HEALTH SCORES
BACKGROUND
[0001] Graphics processing units (GPUs) execute the graphical rendering pipeline in a modern computing device. For specific workloads, graphics processing units may be utilized as coprocessor accelerators to increase computational throughput.
BRIEF DESCRIPTION OF THE DRAWINGS
[0002] FIG. 1 illustrates a system for generating GPU health scores, according to an example;
[0003] FIG. 2 is a block diagram corresponding to a GPU health score system, according to an example;
[0004] FIG. 3 is a flow diagram for generating GPU health scores, according to an example; and
[0005] FIG. 4 is a computing device for supporting instructions to create GPU health scores, according to an example.
DETAILED DESCRIPTION
[0006] Graphics processing units (GPUs) have become important components within a modern computing device. Computing devices ranging from small mobile handhelds (e.g. smart phones) to artificial intelligence rack mounted servers utilize GPUs for accelerating specific workloads. The workloads may include traditional graphic pipeline rendering for applications, such as video games, all the way up to highly complex artificial intelligence modeling. During these workloads, GPUs may be pushed to their limits both thermally and electrically. GPU overload may result in computational errors, display errors, or complete GPU failure. Described herein is a system, method and computer readable medium for monitoring, detecting and remediating GPU failure conditions prior to failure by producing a GPU health score. [0007] In one implementation, a system coupled to memory with instructions for producing a GPU health score stored within. The instructions including instructions to receive a set of graphics processor unit (GPU) operational data and identify a subset of the set of GPU operational data corresponding to likelihood of failure. The instructions also to create a health score for a GPU corresponding to the set of GPU operational data, map the health score to a remediation action and execute the remediation action.
[0008] In another implementation, a method including receiving a set of graphics processor unit (GPU) operational data and identifying a subset of the set of GPU operational data corresponding to likelihood of failure. The method also includes creating a health score for a GPU corresponding to the set of GPU operational data, mapping the health score to a remediation action, sending an alert comprising the remediation action to a client system, and rendering a visualization of the health score.
[0009]Another example is a computer readable media including instructions to receive a first set and a second set of graphics processor unit (GPU) operational data. The instructions also identify a first subset of the first set of GPU operational data corresponding to likelihood of failure, identify a second subset of the second set of GPU operational data corresponding to a likelihood of failure and creating a first health score and a second health score corresponding to the first subset and the second subset respectively. The instructions also create a difference between the first health score and the second health score, and render a visualization of the difference.
[0010] FIG. 1 illustrates a system for generating GPU health scores, according to an example.
[0011] The processor 102 of the system 100 may be implemented as dedicated hardware circuitry or a virtualized logical processor. The dedicated hardware circuitry may be implemented as a central processing unit (CPU). A dedicated hardware CPU may be implemented as a single to many-core general purpose processor. A dedicated hardware CPU may also be implemented as a multi-chip solution, where more than one CPU are linked through a bus and schedule processing tasks across the more than one CPU.
[0012] A virtualized logical processor may be implemented across a distributed computing environment. A virtualized logical processor may not have a dedicated piece of hardware supporting it. Instead, the virtualized logical processor may have a pool of resources supporting the task for which it was provisioned. In this implementation, the virtualized logical processor may actually be executed on hardware circuitry; however, the hardware circuitry is not dedicated. The hardware circuitry may be in a shared environment where utilization is time sliced. In some implementations the virtualized logical processor includes a software layer between any executing application and the hardware circuitry to handle any abstraction which also monitors and save the application state. Virtual machines (VMs) may be implementations of virtualized logical processors.
[0013] A memory 104 may be implemented in the system 100. The memory 104 may be dedicated hardware circuitry to host instructions for the processor 102 to execute. In another implementation, the memory 104 may be virtualized logical memory. Analogous to the processor 102, dedicated hardware circuitry may be implemented with dynamic ram (DRAM) or other hardware implementations for storing processor instructions. Additionally, the virtualized logical memory may be implemented in a software abstraction which allows the instructions 106 to be executed on a virtualized logical processor, independent of any dedicated hardware implementation.
[0014] The system 100 may also include instructions 106. The instructions 106 may be implemented in a platform specific language that the processor 102 may decode and execute. The instructions 106 may be stored in the memory 104 during execution. The instructions 106 may be encoded to perform operations such as receiving a set of GPU operational data 108, identify a subset of the set of GPU operational data corresponding to a likelihood of failure 110, create a health score for a GPU corresponding to the set of GPU operational data 112, map the health score to a remediation action 114 and execute the remediation action 116. [0015] FIG. 2 is a block diagram corresponding to a GPU health score system 200, according to an example. The system 200 may include a number of connected devices illustrated as device A 202A, device B 202B, device N 202N, a telemetry receiver 206, a database 208, a model 210, alerts system 212 and a remediation system 214.
[0016] The GPU operational data may be collected at device A 202A, device B 202B and device N 202N. Each device, illustrated separately, may be computing devices with different computing capabilities. For example, device A 202A may be a smartphone, device B 202B may be a notebook computer, and device N 202N may be a workstation computer. Each device may incorporate a GPU, illustrated as GPU A 204A, GPUB 204B, and GPU N 204N. In each corresponding device, the GPU may be of varying sophistication and computational power. Utilizing the previous device example, GPU A 204A may correspond to an integrated GPU built into the central processing unit of a smartphone, GPU B 204B may correspond to a discrete mobile GPU built into a gaming notebook computer, and GPU N 204N may correspond to a discrete add-in card.
[0017] During operation GPU A 204A, GPUB 204B, and GPU N 204N log operational data regarding the states of the corresponding GPUs. For example, static information pertaining to the GPU may be logged including but not limited the GPU card name, manufacturer, serial number, and install driver version. Dynamic GPU information relevant to GPU health may also be logged including but not limited to error correction code (ECO) errors, blue screen error codes, the number of timeout detection and recovery (TDR) events, and operating system config errors. Table 1 illustrates an example of collected data.
Figure imgf000006_0001
Table 1 [0018] Device A 202A, device B 202B and device N 202N may include a telemetry agent (not shown). The telemetry agent collects, aggregates, and packages the GPU operational data for transmission. The telemetry agent may be logically connected to a network via an operating system interface to transmit the GPU operational data. The logical connection may be implemented utilizing an application programming interface (API) allowing the telemetry agent to access a network socket.
[0019] The telemetry receiver 206, the database 208 and the model 210 may be implemented as the system 100 of FIG. 1. The processor 102 may be utilized as the execution unit to support the functionality of the telemetry receiver 206, the database 208 and the model 210. The instructions 106 may correspond to the data structure and operations of the telemetry receiver 206, the database 208 and the model 210. As such, the instructions 106 may mirror the functionality described for the telemetry receiver 206, the database 208, and the model 210.
[0020] A telemetry receiver 206 may be connected to a network utilizing a similar logical connection as the telemetry agents. The telemetry receiver 206 may be scalable to support many devices reporting GPU operational data. The telemetry receiver 206 may aggregate GPU operational data based on from which device the operational data was received. By aggregating the operational data, the telemetry receiver 206 may create a historical record of the GPU operational data of any attached device. The telemetry receiver 206 may package the GPU operational data into a format to be stored in a database 208.
[0021] The database 208 may include the tables and records operable to store historical GPU operational data as well as the data used to derive a GPU health score. The database 208 may be a single database in one implementation. In another implementation, the database 208 may include a more than one relational database communicatively coupled together. The relational databases may be linked by record identifiers corresponding to data inclusive to more than one of the databases. In another implementation, the database 208 may be a data lake whereby the data undergoes very little transformation or aggregation by the telemetry receiver 206. The structure of the database 208 in this implementation may vary and may not structurally resemble a traditional relational database. The database 208 provides the GPU operational data for the model 210 to create a GPU health score.
[0022] The model 210 executes logic to create the GPU health score. The model 210 may be communicatively couple to the database 208 to receive GPU operational data. In one implementation, the model 210 may count or sum the occurrences of operational data in a given period. A count may refer to a totaling of a singular operational data, whereas a summary may include adding more than one operational data together. Referring back to the example in Table 1 , ‘Thermal issue’, and ‘OS Failure code’, the count of the occurrence of those may be counted on a per week basis based on each ‘Serial’ value. Likewise, ‘ECC’ and ‘TDR’ operational data, the sum of the values of those operational data may be calculated per each week and each ‘Serial’. The ‘Driver’ operational data is compared to the database 208 to determine the latest driver version per each ‘Card name’.
[0023] In one implementation, the model 210 may compare calculated parameters to critical thresholds. Referring back to the example in Table 1 , the calculated count or summary values for the parameters may be compared to critical thresholds. The critical thresholds may be predefined based on the subject matter expertise and statistical analysis. The exact values of the predefined thresholds may be indicated stored in a table inclusive to the model or may be hard coded into the model explicitly. The resulting comparisons may be stored within the model and ‘_number’ appended the name as to differentiate from the operational data originally collected.
[0024] Continuing, the calculated count or summary values from the previous comparison may be averaged on the floating 3-week basis. If the 3-week average value is increasing for 3 weeks in a row, a new variable with the addition of the ‘_growth’ to the name is created, and the value of this may be stored within the model as well. For ‘ECC’, an operational data (‘ECC_growth_high’) of ‘ECC_number’ multiplied by ‘ECC_growth’ is calculated, since the presence of both these operational data in a given week per device may be the a strong sign of graphics card malfunction. [0025] In an implementation, a health score may be calculated based on the
‘_number’ and ‘_growth’ variables in accordance with Equation 1 :
Health Score = 5 * ecc_number + 5 * ecc_growth + 10 * ecc_growth_high + 5
* thermal-issues, number + 5 * thermal„issues„growth + 5
* bsod...number + 5 * bsod .growth + 5 * tdr..number + 5
* tdr_growtb
Equation 1
[0026] Note that certain operational data may be weighted more strongly in the health score as they may be more indicative of GPU failure. Based on the resulting health score, each device (from device A 202A, device B 202B and device N 202N) may be mapped to a color-coded health category. A health score of ten (10) or more maps to a red category, which categorizes the devices requiring immediate attention. A health score of more than zero (0) and less than ten (10) may be categorized to a yellow category. The yellow category may include devices with some health/performance issue present. A health score of zero (0) maps to the green category, which indicates no health or performance issues observed.
[0027] Additionally, each device (from device A 202A, device B 202B and device N 202N) may be assigned a criticality score for each week. The criticality score may be calculated based on the value of the health score and total number of devices within a set that given week. For calculating criticality score, the devices may be sorted by the value of the health score in descending order. The device with the highest health score may be scored with a criticality equal to one (1). The device with the second highest health score may be scored with a criticality equal to two (2), and so on. The criticality score of the device with the lowest health score may be equal to the total number of devices.
[0028] Continuing the implementation described above, each device (from device A 202A, device B 202B and device N 202N) may be mapped to a detailed description of the observed performance or health problems and the list of suggested remediation steps. The descriptions of the problems and suggested remediation steps may be created based on the domain knowledge and publicly available troubleshooting steps for the devices. A sample remediation action list is presented in Table 2.
Figure imgf000010_0001
Figure imgf000011_0001
Figure imgf000012_0001
Table 2
[0029]The sample remediation list in Table 2 may not be limited to those remediations included and is used to illustrate the relationship between health scores and remediation actions.
[0030] Upon mapping a health score or criticality score to remediation, the model 210 may send a message to an alert system 212 and/or a remediation system 214. The alert system 212 and the remediation system 214 may be client systems to receive the alerts and remediation actions. The alert system 212 may take the form of a dashboard interface for a fleet manager responsible for monitoring the devices (device A 202A, device B 202B and device N 202N). The alert system 212 may indicate to the fleet manager the categories of each of the devices within the fleet (e.g. red, yellow, and green). The alert system 212 may present a user activated remediation step to the fleet manager (e.g. replace graphics card). In another implementation, the alert system 212 may be communicatively coupled to the telemetry agent on the device. The alert system 212 may propagate an alert to the user of the device that a remediation step may be due. For example, a graphics card failure based on ‘ECC_growth’ may be determined. The alert system 212 may present the fleet manager with a category “red” representation of device A 202A. The alert system 212 may propagate an alert to the telemetry agent executing on device A 202A indicating to the user that the graphics card needs to be replaced and present actions on how to accomplish that.
[0031] In another implementation, remediation system 214 may be utilized to accomplish a remediation step as identified by the model 210. In situations where the remediation step may be automated, the remediation system 214 may take action. For example, a remediation system 214 may take the form of an endpoint management system. When a remediation step corresponds to a driver update, or another task that can be automated by the remediation system 214, the remediation system 214 may push the fix to the affected device.
[0032] FIG. 3 is a flow diagram for generating GPU health scores, according to an example. For the purposes of describing the features of FIG. 3, references may be made to FIG. 1 and FIG. 2.
[0033] At 302, the processor 102 receives a set of GPU operational data. As described in reference to FIG. 2, the GPU operational data may correspond to the information retrieved from a telemetry agent act and illustrated in Table 1 . The GPU operational data may include the graphics card model name, the graphics card manufacturer, the graphics card serial number, a time date stamp, ECO errors, logged thermal issues, operating system (OS) failures (e.g. blue screen of death, BSOD), TDRs, OS compatibility problems, and graphics card driver versions.
[0034] In another implementation, the processor 102 may receive a second set of GPU operational data. The second set of GPU operational data may correspond to the same device reporting for a later period in time.
[0035] At 304, the processor 102 identifies a subset of the set of GPU operational data corresponding to likelihood of failure. In one implementation, ECO errors and TDR may be identified as corresponding to a likelihood of failure. As such, the ECO error and TDR values may be identified as a subset of the set of GPU operational data corresponding to a likelihood of failure. While Equation 1 illustrates the same weighted value to each of the operational data of the first subset of the set of GPU operational data, varying weighted values may also be used to emphasize which operational data corresponds to most likelihood of failure.
[0036] The processor 102 may identify a second subset of the second set of GPU operational data corresponding to a likelihood of failure. The second subset may correspond to the subset in identified in the set of GPU operational data. In one example, the targeted operational data of the subset and the second subset would be the same (e.g. ECC errors, TDRs). The values of the operational data may be different, however.
[0037] At 306, the processor 102 creates a health score for a GPU corresponding to the set of GPU operational data. As described in relation to Equation 1 , a health score may be determined to be based on the GPU operational data. The processor 102 may assign a weighted value to each operational data of the first subset. As such the health score may be adjusted based on the weighted value.
[0038] The processor 102 may create a second health score for a GPU corresponding to the second set of GPU operational data. For proposes of meaningful metrics, the second health score may be calculated utilizing the same weighting as the health score.
[0039] The processor 102 may calculate a difference between the health score and the second health score. The difference may illustrate an improvement in the GPU health score over time, or a deterioration of the health score over time.
[0040] At 308, the processor 102 maps the health score to a remediation action. Upon the health score passing a threshold, and the criticality score passing a threshold, the processor 102 may map remediation actions to the more problematic components of the health score. Referring to the implementation described in reference to FIG. 2, the health score may be 25, and criticality may be 4, the processor 102 may map individual components of the health score to remediation steps. In this example, when ‘ECC_growth’ surpasses two (2) over a three-week period, the mapped remediation corresponds to a recommendation to replace the graphics card.
[0041] In another implementation, the processor 102 may send an alert to a client system based on the difference surpassing a threshold. The client system may correspond to the affected device from device A 202A, device B 202B and device N 202N. The alert system 212 may receive the alert message from the model 210 and transmit it to the client system corresponding to the GPU.
[0042] At 310, the processor 102 sends an alert comprising a remediation action to a client system. The alert system 212 may receive a notification from the model 210 to notify the fleet manager. The same notification may also be transmitted from the alert system 212 to a client system which may be the affected device from device A 202A, device B 202B and device N 202N. In another example, the notification may transmit to the client system may be different than that transmitted to notify the fleet manager. The notification transmitted to the client system may be more verbose as to provide detailed instruction for a lay person to affect the change prescribed by the remediation.
[0043] In another embodiment, the processor 102 may identify an actionable remediation action from the remediation action. An actionable remediation action may include actions that may be implemented through autonomous means. For example, software installation may be an actionable remediation action. Replacing a graphic card my not be an actionable remediation action. The processor 102 may push the actionable remediation action to the client system with an endpoint management system. For example, if the actionable remediation action includes installing a new graphics card driver, the processor 102 may utilize a remediation system 214 (e.g. endpoint management) to push a software device to the client system, where the client system is the affected device.
[0044]At 312, the processor 102 renders a visualization of the health score. The processor 102 may render a dashboard to illustrate to a fleet manager the health of all devices (from device A 202A, device B 202B and device N 202N). In one implementation, the processor 102 may render a difference between the health score and the second health score. The rendering may include a visualization of a trend in the health score. The visualization may include an arrow indicating the direction of the trend, corresponding to the affected device.
[0045] FIG. 4 is a computing device for supporting instructions to create GPU health scores, according to an example. The computing device 400 depicts a processor 102 and a storage medium 404 and, as an example of the computing device 400 performing its operations, the storage medium 404 may include instructions 406-418 that are executable by the processor 102. The processor 102 may be synonymous with the processor 102 referenced in FIG. 1. Additionally, the processor 102 may include but is not limited to central processing units (CPUs). The storage medium 404 can be said to store program instructions that, when executed by processor 102, implement the components of the computing device 400.
[0046] The executable program instructions stored in the storage medium 404 include, as an example, instructions to receive a first set and a second set of GPU operational data 406, instructions to identify a first subset of the first set of GPU operation data corresponding to a likelihood of failure 408, instructions to identify a second subset of the second set of GPU operational data corresponding to a likelihood of failure 410, instructions to create a first health score for the GPU corresponding to the first subset 412, instructions to create a second health score for the GPU corresponding to the second subset 414, instructions to create a difference between the first health score and the second health score 416, and render a visualization of the difference 418.
[0047] Storage medium 404 represents generally any number of memory components capable of storing instructions that can be executed by processor 102. Storage medium 404 is non-transitory in the sense that it does not encompass a transitory signal but instead is made up of at least one memory component configured to store the relevant instructions. As a result, the storage medium 404 may be a non-transitory computer-readable storage medium. Storage medium 404 may be implemented in a single device or distributed across devices. Likewise, processor 102 represents any number of processors capable of executing instructions stored by storage medium 404. Processor 102 may be integrated in a single device or distributed across devices. Further, storage medium 404 may be fully or partially integrated in the same device as processor 102, or it may be separate but accessible to that computing device 400 and the processor 102.
[0048] In one example, the program instructions 406-418 may be part of an installation package that, when installed, can be executed by processor 102 to implement the components of the computing device 400. In this case, storage medium 404 may be a portable medium such as a CD, DVD, or flash drive, or a memory maintained by a server from which the installation package can be downloaded and installed. In another example, the program instructions may be part of an application or applications already installed. Here, storage medium 404 can include integrated memory such as a hard drive, solid state drive, or the like.
[0049] It is appreciated that examples described may include various components and features. It is also appreciated that numerous specific details are set forth to provide a thorough understanding of the examples. However, it is appreciated that the examples may be practiced without limitations to these specific details. In other instances, well known methods and structures may not be described in detail to avoid unnecessarily obscuring the description of the examples. Also, the examples may be used in combination with each other.
[0050] Reference in the specification to “an example” or similar language means that a particular feature, structure, or characteristic described in connection with the example is included in at least one example, but not necessarily in other examples. The various instances of the phrase “in one example” or similar phrases in various places in the specification are not necessarily all referring to the same example.
[0051] It is appreciated that the previous description of the disclosed examples is provided to enable any person skilled in the art to make or use the present disclosure. Various modifications to these examples will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other examples without departing from the scope of the disclosure. Thus, the present disclosure is not intended to be limited to the examples shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

Claims

CLAIMS WHAT IS CLAIMED IS:
1 . A system comprising: a memory comprising instructions; and a processor communicatively coupled to the memory wherein the instructions when executed cause the processor to: receive a set of graphics processor unit (GPU) operational data; identify a subset of the set of GPU operational data corresponding to likelihood of failure; create a health score for a GPU corresponding to the set of GPU operational data; map the health score to a remediation action; and execute the remediation action.
2. The system of claim 1 , the instructions when executed further cause the processor to: send an alert to a client system; and render a visualization of the health score.
3. The system of claim 1 , the instructions when executed further cause the processor to: identify a first subset of the subset of the set of GPU operational data corresponding to a critical likelihood of failure; assign a weighted value to each operational data of the first subset; and adjust the health score based on the weighted value.
4. The system of claim 1 wherein the set of GPU operational data comprises recorded production device values from a plurality of client systems.
5. The system of claim 1 , the instructions when executed further cause the processor to: receive a second set of GPU operational data; identify a second subset of the set of GPU operational data corresponding to a likelihood of failure; create a second health score for a GPU corresponding to the second set of GPU operational data; calculate a difference between the health score and the second health score; and send an alert to a client system based on the difference surpassing a threshold. A method comprising: receiving a set of graphics processor unit (GPU) operational data; identifying a subset of the set of GPU operational data corresponding to likelihood of failure; creating a health score for a GPU corresponding to the set of GPU operational data; mapping the health score to a remediation action; sending an alert comprising the remediation action to a client system; and rendering a visualization of the health score. The method of claim 6 further comprising: identifying an actionable remediation action from the remediation action; push the actionable remediation action to the client system with an endpoint management system. The method of claim 6, further comprising: identifying a first subset of the subset of the set of GPU operational data corresponding to a critical likelihood of failure; assigning a weighted value to each operational data of the first subset; and adjusting the health score based on the weighted value. The method of claim 6 wherein the set of GPU operational data comprises recorded production device values from a plurality of client systems. The method of claim 6, further comprising: receiving a second set of GPU operational data; identifying a second subset of the set of GPU operational data corresponding to a likelihood of failure; creating a second health score for a GPU corresponding to the second set of GPU operational data; calculating a difference between the health score and the second health score; and sending an alert to a client system based on the difference surpassing a threshold. A non-transitory computer readable medium comprising instructions executable by a processor to: receive a first set and a second set of graphics processor unit (GPU) operational data; identify a first subset of the first set of GPU operational data corresponding to likelihood of failure; identify a second subset of the second set of GPU operational data corresponding to a likelihood of failure create a first health score for a GPU corresponding to the first subset; create a second health score for the GPU corresponding to the second subset; create a difference between the first health score and the second health score; and render a visualization of the difference. The medium of claim 11 , the instructions when executed further cause the processor to send an alert to a client system. The medium of claim 11 , wherein the second set of GPU operational data and the first set of GPU operational data correspond to operational information of a single GPU. The medium of claim 11 , the instructions when executed further cause the processor to: The medium of claim 11 , the instructions when executed further cause the processor to: identify an actionable remediation action corresponding to the difference; and push the actionable remediation action to a client system with an endpoint management system.
- 18 -
PCT/US2020/050390 2020-09-11 2020-09-11 Gpu health scores WO2022055496A1 (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
PCT/US2020/050390 WO2022055496A1 (en) 2020-09-11 2020-09-11 Gpu health scores

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/US2020/050390 WO2022055496A1 (en) 2020-09-11 2020-09-11 Gpu health scores

Publications (1)

Publication Number Publication Date
WO2022055496A1 true WO2022055496A1 (en) 2022-03-17

Family

ID=80629762

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/US2020/050390 WO2022055496A1 (en) 2020-09-11 2020-09-11 Gpu health scores

Country Status (1)

Country Link
WO (1) WO2022055496A1 (en)

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105393252A (en) * 2013-04-18 2016-03-09 数字标记公司 Physiologic data acquisition and analysis
US10069710B2 (en) * 2016-03-01 2018-09-04 Dell Products, Lp System and method to identify resources used by applications in an information handling system
US20190057006A1 (en) * 2016-02-14 2019-02-21 Dell Products, Lp System and method to assess information handling system health and resource utilization
US20190068627A1 (en) * 2017-08-28 2019-02-28 Oracle International Corporation Cloud based security monitoring using unsupervised pattern recognition and deep learning

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105393252A (en) * 2013-04-18 2016-03-09 数字标记公司 Physiologic data acquisition and analysis
US20190057006A1 (en) * 2016-02-14 2019-02-21 Dell Products, Lp System and method to assess information handling system health and resource utilization
US10069710B2 (en) * 2016-03-01 2018-09-04 Dell Products, Lp System and method to identify resources used by applications in an information handling system
US20190068627A1 (en) * 2017-08-28 2019-02-28 Oracle International Corporation Cloud based security monitoring using unsupervised pattern recognition and deep learning

Similar Documents

Publication Publication Date Title
US8892960B2 (en) System and method for determining causes of performance problems within middleware systems
US10055275B2 (en) Apparatus and method of leveraging semi-supervised machine learning principals to perform root cause analysis and derivation for remediation of issues in a computer environment
US11803546B2 (en) Selecting interruptible resources for query execution
JP6373482B2 (en) Interface for controlling and analyzing computer environments
US11263071B2 (en) Enabling symptom verification
US10613970B1 (en) Method and system for managing deployment of software application components based on software performance data
US9176759B1 (en) Monitoring and automatically managing applications
US11734100B2 (en) Edge side filtering in hybrid cloud environments
AU2013344538B2 (en) Dynamic graph performance monitoring
US20130197863A1 (en) Performance and capacity analysis of computing systems
KR102404170B1 (en) Dynamic component performance monitoring
US20150324245A1 (en) Intermediate database management layer
CN112380089A (en) Data center monitoring and early warning method and system
US20190026174A1 (en) Integrated statistical log data mining for mean time auto-resolution
US20200327036A1 (en) Topology aware real time gpu-to-gpu traffic monitoring method and analyzing tools
WO2020027931A1 (en) Real time telemetry monitoring tool
CN111897696A (en) Server cluster hard disk state detection method and device, electronic equipment and storage medium
WO2022055496A1 (en) Gpu health scores
EP3993353A2 (en) System and method for managing clusters in an edge network
US20220107858A1 (en) Methods and systems for multi-resource outage detection for a system of networked computing devices and root cause identification

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 20953479

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 20953479

Country of ref document: EP

Kind code of ref document: A1