WO2022055496A1

WO2022055496A1 - Gpu health scores

Info

Publication number: WO2022055496A1
Application number: PCT/US2020/050390
Authority: WO
Inventors: Aleksei SHELAEV; Amit Kumar Singh; Lorri JEFFERSON; George GUEORGUIEV; Byron A Alcorn; Abhishek Ghosh
Original assignee: Hewlett-Packard Development Company, L.P.
Priority date: 2020-09-11
Filing date: 2020-09-11
Publication date: 2022-03-17

Abstract

In an example implementation according to aspects of the present disclosure, a system, method, and storage medium comprising a processor, memory, and instructions to receive a set of graphics processor unit (GPU) operational data. The system identifies a subset of the set of GPU operational data corresponding to likelihood of failure and creates a health score for a GPU corresponding to the set of GPU operational data. The system maps the health score to a remediation action and executes the remediation action.

Description

GPU HEALTH SCORES

BACKGROUND

[0001] Graphics processing units (GPUs) execute the graphical rendering pipeline in a modern computing device. For specific workloads, graphics processing units may be utilized as coprocessor accelerators to increase computational throughput.

BRIEF DESCRIPTION OF THE DRAWINGS

[0002] FIG. 1 illustrates a system for generating GPU health scores, according to an example;

[0003] FIG. 2 is a block diagram corresponding to a GPU health score system, according to an example;

[0004] FIG. 3 is a flow diagram for generating GPU health scores, according to an example; and

[0005] FIG. 4 is a computing device for supporting instructions to create GPU health scores, according to an example.

DETAILED DESCRIPTION

[0006] Graphics processing units (GPUs) have become important components within a modern computing device. Computing devices ranging from small mobile handhelds (e.g. smart phones) to artificial intelligence rack mounted servers utilize GPUs for accelerating specific workloads. The workloads may include traditional graphic pipeline rendering for applications, such as video games, all the way up to highly complex artificial intelligence modeling. During these workloads, GPUs may be pushed to their limits both thermally and electrically. GPU overload may result in computational errors, display errors, or complete GPU failure. Described herein is a system, method and computer readable medium for monitoring, detecting and remediating GPU failure conditions prior to failure by producing a GPU health score. [0007] In one implementation, a system coupled to memory with instructions for producing a GPU health score stored within. The instructions including instructions to receive a set of graphics processor unit (GPU) operational data and identify a subset of the set of GPU operational data corresponding to likelihood of failure. The instructions also to create a health score for a GPU corresponding to the set of GPU operational data, map the health score to a remediation action and execute the remediation action.

[0008] In another implementation, a method including receiving a set of graphics processor unit (GPU) operational data and identifying a subset of the set of GPU operational data corresponding to likelihood of failure. The method also includes creating a health score for a GPU corresponding to the set of GPU operational data, mapping the health score to a remediation action, sending an alert comprising the remediation action to a client system, and rendering a visualization of the health score.

[0009]Another example is a computer readable media including instructions to receive a first set and a second set of graphics processor unit (GPU) operational data. The instructions also identify a first subset of the first set of GPU operational data corresponding to likelihood of failure, identify a second subset of the second set of GPU operational data corresponding to a likelihood of failure and creating a first health score and a second health score corresponding to the first subset and the second subset respectively. The instructions also create a difference between the first health score and the second health score, and render a visualization of the difference.

[0010] FIG. 1 illustrates a system for generating GPU health scores, according to an example.

[0011] The processor 102 of the system 100 may be implemented as dedicated hardware circuitry or a virtualized logical processor. The dedicated hardware circuitry may be implemented as a central processing unit (CPU). A dedicated hardware CPU may be implemented as a single to many-core general purpose processor. A dedicated hardware CPU may also be implemented as a multi-chip solution, where more than one CPU are linked through a bus and schedule processing tasks across the more than one CPU.

[0012] A virtualized logical processor may be implemented across a distributed computing environment. A virtualized logical processor may not have a dedicated piece of hardware supporting it. Instead, the virtualized logical processor may have a pool of resources supporting the task for which it was provisioned. In this implementation, the virtualized logical processor may actually be executed on hardware circuitry; however, the hardware circuitry is not dedicated. The hardware circuitry may be in a shared environment where utilization is time sliced. In some implementations the virtualized logical processor includes a software layer between any executing application and the hardware circuitry to handle any abstraction which also monitors and save the application state. Virtual machines (VMs) may be implementations of virtualized logical processors.

[0013] A memory 104 may be implemented in the system 100. The memory 104 may be dedicated hardware circuitry to host instructions for the processor 102 to execute. In another implementation, the memory 104 may be virtualized logical memory. Analogous to the processor 102, dedicated hardware circuitry may be implemented with dynamic ram (DRAM) or other hardware implementations for storing processor instructions. Additionally, the virtualized logical memory may be implemented in a software abstraction which allows the instructions 106 to be executed on a virtualized logical processor, independent of any dedicated hardware implementation.

[0014] The system 100 may also include instructions 106. The instructions 106 may be implemented in a platform specific language that the processor 102 may decode and execute. The instructions 106 may be stored in the memory 104 during execution. The instructions 106 may be encoded to perform operations such as receiving a set of GPU operational data 108, identify a subset of the set of GPU operational data corresponding to a likelihood of failure 110, create a health score for a GPU corresponding to the set of GPU operational data 112, map the health score to a remediation action 114 and execute the remediation action 116. [0015] FIG. 2 is a block diagram corresponding to a GPU health score system 200, according to an example. The system 200 may include a number of connected devices illustrated as device A 202A, device B 202B, device N 202N, a telemetry receiver 206, a database 208, a model 210, alerts system 212 and a remediation system 214.

[0016] The GPU operational data may be collected at device A 202A, device B 202B and device N 202N. Each device, illustrated separately, may be computing devices with different computing capabilities. For example, device A 202A may be a smartphone, device B 202B may be a notebook computer, and device N 202N may be a workstation computer. Each device may incorporate a GPU, illustrated as GPU A 204A, GPUB 204B, and GPU N 204N. In each corresponding device, the GPU may be of varying sophistication and computational power. Utilizing the previous device example, GPU A 204A may correspond to an integrated GPU built into the central processing unit of a smartphone, GPU B 204B may correspond to a discrete mobile GPU built into a gaming notebook computer, and GPU N 204N may correspond to a discrete add-in card.

[0017] During operation GPU A 204A, GPUB 204B, and GPU N 204N log operational data regarding the states of the corresponding GPUs. For example, static information pertaining to the GPU may be logged including but not limited the GPU card name, manufacturer, serial number, and install driver version. Dynamic GPU information relevant to GPU health may also be logged including but not limited to error correction code (ECO) errors, blue screen error codes, the number of timeout detection and recovery (TDR) events, and operating system config errors. Table 1 illustrates an example of collected data.

Table 1 [0018] Device A 202A, device B 202B and device N 202N may include a telemetry agent (not shown). The telemetry agent collects, aggregates, and packages the GPU operational data for transmission. The telemetry agent may be logically connected to a network via an operating system interface to transmit the GPU operational data. The logical connection may be implemented utilizing an application programming interface (API) allowing the telemetry agent to access a network socket.

[0019] The telemetry receiver 206, the database 208 and the model 210 may be implemented as the system 100 of FIG. 1. The processor 102 may be utilized as the execution unit to support the functionality of the telemetry receiver 206, the database 208 and the model 210. The instructions 106 may correspond to the data structure and operations of the telemetry receiver 206, the database 208 and the model 210. As such, the instructions 106 may mirror the functionality described for the telemetry receiver 206, the database 208, and the model 210.

[0020] A telemetry receiver 206 may be connected to a network utilizing a similar logical connection as the telemetry agents. The telemetry receiver 206 may be scalable to support many devices reporting GPU operational data. The telemetry receiver 206 may aggregate GPU operational data based on from which device the operational data was received. By aggregating the operational data, the telemetry receiver 206 may create a historical record of the GPU operational data of any attached device. The telemetry receiver 206 may package the GPU operational data into a format to be stored in a database 208.

[0021] The database 208 may include the tables and records operable to store historical GPU operational data as well as the data used to derive a GPU health score. The database 208 may be a single database in one implementation. In another implementation, the database 208 may include a more than one relational database communicatively coupled together. The relational databases may be linked by record identifiers corresponding to data inclusive to more than one of the databases. In another implementation, the database 208 may be a data lake whereby the data undergoes very little transformation or aggregation by the telemetry receiver 206. The structure of the database 208 in this implementation may vary and may not structurally resemble a traditional relational database. The database 208 provides the GPU operational data for the model 210 to create a GPU health score.

[0022] The model 210 executes logic to create the GPU health score. The model 210 may be communicatively couple to the database 208 to receive GPU operational data. In one implementation, the model 210 may count or sum the occurrences of operational data in a given period. A count may refer to a totaling of a singular operational data, whereas a summary may include adding more than one operational data together. Referring back to the example in Table 1 , ‘Thermal issue’, and ‘OS Failure code’, the count of the occurrence of those may be counted on a per week basis based on each ‘Serial’ value. Likewise, ‘ECC’ and ‘TDR’ operational data, the sum of the values of those operational data may be calculated per each week and each ‘Serial’. The ‘Driver’ operational data is compared to the database 208 to determine the latest driver version per each ‘Card name’.

[0023] In one implementation, the model 210 may compare calculated parameters to critical thresholds. Referring back to the example in Table 1 , the calculated count or summary values for the parameters may be compared to critical thresholds. The critical thresholds may be predefined based on the subject matter expertise and statistical analysis. The exact values of the predefined thresholds may be indicated stored in a table inclusive to the model or may be hard coded into the model explicitly. The resulting comparisons may be stored within the model and ‘_number’ appended the name as to differentiate from the operational data originally collected.

[0024] Continuing, the calculated count or summary values from the previous comparison may be averaged on the floating 3-week basis. If the 3-week average value is increasing for 3 weeks in a row, a new variable with the addition of the ‘_growth’ to the name is created, and the value of this may be stored within the model as well. For ‘ECC’, an operational data (‘ECC_growth_high’) of ‘ECC_number’ multiplied by ‘ECC_growth’ is calculated, since the presence of both these operational data in a given week per device may be the a strong sign of graphics card malfunction. [0025] In an implementation, a health score may be calculated based on the

‘_number’ and ‘_growth’ variables in accordance with Equation 1 :

Health Score = 5 * ecc_number + 5 * ecc_growth + 10 * ecc_growth_high + 5

* thermal-issues, number + 5 * thermal„issues„growth + 5

* bsod...number + 5 * bsod .growth + 5 * tdr..number + 5

* tdr_growtb

Equation 1

[0026] Note that certain operational data may be weighted more strongly in the health score as they may be more indicative of GPU failure. Based on the resulting health score, each device (from device A 202A, device B 202B and device N 202N) may be mapped to a color-coded health category. A health score of ten (10) or more maps to a red category, which categorizes the devices requiring immediate attention. A health score of more than zero (0) and less than ten (10) may be categorized to a yellow category. The yellow category may include devices with some health/performance issue present. A health score of zero (0) maps to the green category, which indicates no health or performance issues observed.

[0027] Additionally, each device (from device A 202A, device B 202B and device N 202N) may be assigned a criticality score for each week. The criticality score may be calculated based on the value of the health score and total number of devices within a set that given week. For calculating criticality score, the devices may be sorted by the value of the health score in descending order. The device with the highest health score may be scored with a criticality equal to one (1). The device with the second highest health score may be scored with a criticality equal to two (2), and so on. The criticality score of the device with the lowest health score may be equal to the total number of devices.

[0028] Continuing the implementation described above, each device (from device A 202A, device B 202B and device N 202N) may be mapped to a detailed description of the observed performance or health problems and the list of suggested remediation steps. The descriptions of the problems and suggested remediation steps may be created based on the domain knowledge and publicly available troubleshooting steps for the devices. A sample remediation action list is presented in Table 2.

Table 2

[0029]The sample remediation list in Table 2 may not be limited to those remediations included and is used to illustrate the relationship between health scores and remediation actions.

[0030] Upon mapping a health score or criticality score to remediation, the model 210 may send a message to an alert system 212 and/or a remediation system 214. The alert system 212 and the remediation system 214 may be client systems to receive the alerts and remediation actions. The alert system 212 may take the form of a dashboard interface for a fleet manager responsible for monitoring the devices (device A 202A, device B 202B and device N 202N). The alert system 212 may indicate to the fleet manager the categories of each of the devices within the fleet (e.g. red, yellow, and green). The alert system 212 may present a user activated remediation step to the fleet manager (e.g. replace graphics card). In another implementation, the alert system 212 may be communicatively coupled to the telemetry agent on the device. The alert system 212 may propagate an alert to the user of the device that a remediation step may be due. For example, a graphics card failure based on ‘ECC_growth’ may be determined. The alert system 212 may present the fleet manager with a category “red” representation of device A 202A. The alert system 212 may propagate an alert to the telemetry agent executing on device A 202A indicating to the user that the graphics card needs to be replaced and present actions on how to accomplish that.

[0031] In another implementation, remediation system 214 may be utilized to accomplish a remediation step as identified by the model 210. In situations where the remediation step may be automated, the remediation system 214 may take action. For example, a remediation system 214 may take the form of an endpoint management system. When a remediation step corresponds to a driver update, or another task that can be automated by the remediation system 214, the remediation system 214 may push the fix to the affected device.

[0032] FIG. 3 is a flow diagram for generating GPU health scores, according to an example. For the purposes of describing the features of FIG. 3, references may be made to FIG. 1 and FIG. 2.

[0033] At 302, the processor 102 receives a set of GPU operational data. As described in reference to FIG. 2, the GPU operational data may correspond to the information retrieved from a telemetry agent act and illustrated in Table 1 . The GPU operational data may include the graphics card model name, the graphics card manufacturer, the graphics card serial number, a time date stamp, ECO errors, logged thermal issues, operating system (OS) failures (e.g. blue screen of death, BSOD), TDRs, OS compatibility problems, and graphics card driver versions.

[0034] In another implementation, the processor 102 may receive a second set of GPU operational data. The second set of GPU operational data may correspond to the same device reporting for a later period in time.

[0035] At 304, the processor 102 identifies a subset of the set of GPU operational data corresponding to likelihood of failure. In one implementation, ECO errors and TDR may be identified as corresponding to a likelihood of failure. As such, the ECO error and TDR values may be identified as a subset of the set of GPU operational data corresponding to a likelihood of failure. While Equation 1 illustrates the same weighted value to each of the operational data of the first subset of the set of GPU operational data, varying weighted values may also be used to emphasize which operational data corresponds to most likelihood of failure.

[0036] The processor 102 may identify a second subset of the second set of GPU operational data corresponding to a likelihood of failure. The second subset may correspond to the subset in identified in the set of GPU operational data. In one example, the targeted operational data of the subset and the second subset would be the same (e.g. ECC errors, TDRs). The values of the operational data may be different, however.

[0037] At 306, the processor 102 creates a health score for a GPU corresponding to the set of GPU operational data. As described in relation to Equation 1 , a health score may be determined to be based on the GPU operational data. The processor 102 may assign a weighted value to each operational data of the first subset. As such the health score may be adjusted based on the weighted value.

[0038] The processor 102 may create a second health score for a GPU corresponding to the second set of GPU operational data. For proposes of meaningful metrics, the second health score may be calculated utilizing the same weighting as the health score.

[0039] The processor 102 may calculate a difference between the health score and the second health score. The difference may illustrate an improvement in the GPU health score over time, or a deterioration of the health score over time.

[0040] At 308, the processor 102 maps the health score to a remediation action. Upon the health score passing a threshold, and the criticality score passing a threshold, the processor 102 may map remediation actions to the more problematic components of the health score. Referring to the implementation described in reference to FIG. 2, the health score may be 25, and criticality may be 4, the processor 102 may map individual components of the health score to remediation steps. In this example, when ‘ECC_growth’ surpasses two (2) over a three-week period, the mapped remediation corresponds to a recommendation to replace the graphics card.

[0041] In another implementation, the processor 102 may send an alert to a client system based on the difference surpassing a threshold. The client system may correspond to the affected device from device A 202A, device B 202B and device N 202N. The alert system 212 may receive the alert message from the model 210 and transmit it to the client system corresponding to the GPU.

[0042] At 310, the processor 102 sends an alert comprising a remediation action to a client system. The alert system 212 may receive a notification from the model 210 to notify the fleet manager. The same notification may also be transmitted from the alert system 212 to a client system which may be the affected device from device A 202A, device B 202B and device N 202N. In another example, the notification may transmit to the client system may be different than that transmitted to notify the fleet manager. The notification transmitted to the client system may be more verbose as to provide detailed instruction for a lay person to affect the change prescribed by the remediation.

[0043] In another embodiment, the processor 102 may identify an actionable remediation action from the remediation action. An actionable remediation action may include actions that may be implemented through autonomous means. For example, software installation may be an actionable remediation action. Replacing a graphic card my not be an actionable remediation action. The processor 102 may push the actionable remediation action to the client system with an endpoint management system. For example, if the actionable remediation action includes installing a new graphics card driver, the processor 102 may utilize a remediation system 214 (e.g. endpoint management) to push a software device to the client system, where the client system is the affected device.

[0044]At 312, the processor 102 renders a visualization of the health score. The processor 102 may render a dashboard to illustrate to a fleet manager the health of all devices (from device A 202A, device B 202B and device N 202N). In one implementation, the processor 102 may render a difference between the health score and the second health score. The rendering may include a visualization of a trend in the health score. The visualization may include an arrow indicating the direction of the trend, corresponding to the affected device.

[0045] FIG. 4 is a computing device for supporting instructions to create GPU health scores, according to an example. The computing device 400 depicts a processor 102 and a storage medium 404 and, as an example of the computing device 400 performing its operations, the storage medium 404 may include instructions 406-418 that are executable by the processor 102. The processor 102 may be synonymous with the processor 102 referenced in FIG. 1. Additionally, the processor 102 may include but is not limited to central processing units (CPUs). The storage medium 404 can be said to store program instructions that, when executed by processor 102, implement the components of the computing device 400.

[0046] The executable program instructions stored in the storage medium 404 include, as an example, instructions to receive a first set and a second set of GPU operational data 406, instructions to identify a first subset of the first set of GPU operation data corresponding to a likelihood of failure 408, instructions to identify a second subset of the second set of GPU operational data corresponding to a likelihood of failure 410, instructions to create a first health score for the GPU corresponding to the first subset 412, instructions to create a second health score for the GPU corresponding to the second subset 414, instructions to create a difference between the first health score and the second health score 416, and render a visualization of the difference 418.

[0047] Storage medium 404 represents generally any number of memory components capable of storing instructions that can be executed by processor 102. Storage medium 404 is non-transitory in the sense that it does not encompass a transitory signal but instead is made up of at least one memory component configured to store the relevant instructions. As a result, the storage medium 404 may be a non-transitory computer-readable storage medium. Storage medium 404 may be implemented in a single device or distributed across devices. Likewise, processor 102 represents any number of processors capable of executing instructions stored by storage medium 404. Processor 102 may be integrated in a single device or distributed across devices. Further, storage medium 404 may be fully or partially integrated in the same device as processor 102, or it may be separate but accessible to that computing device 400 and the processor 102.

[0048] In one example, the program instructions 406-418 may be part of an installation package that, when installed, can be executed by processor 102 to implement the components of the computing device 400. In this case, storage medium 404 may be a portable medium such as a CD, DVD, or flash drive, or a memory maintained by a server from which the installation package can be downloaded and installed. In another example, the program instructions may be part of an application or applications already installed. Here, storage medium 404 can include integrated memory such as a hard drive, solid state drive, or the like.

[0049] It is appreciated that examples described may include various components and features. It is also appreciated that numerous specific details are set forth to provide a thorough understanding of the examples. However, it is appreciated that the examples may be practiced without limitations to these specific details. In other instances, well known methods and structures may not be described in detail to avoid unnecessarily obscuring the description of the examples. Also, the examples may be used in combination with each other.

[0050] Reference in the specification to “an example” or similar language means that a particular feature, structure, or characteristic described in connection with the example is included in at least one example, but not necessarily in other examples. The various instances of the phrase “in one example” or similar phrases in various places in the specification are not necessarily all referring to the same example.

[0051] It is appreciated that the previous description of the disclosed examples is provided to enable any person skilled in the art to make or use the present disclosure. Various modifications to these examples will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other examples without departing from the scope of the disclosure. Thus, the present disclosure is not intended to be limited to the examples shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

Claims

CLAIMS WHAT IS CLAIMED IS:

1 . A system comprising: a memory comprising instructions; and a processor communicatively coupled to the memory wherein the instructions when executed cause the processor to: receive a set of graphics processor unit (GPU) operational data; identify a subset of the set of GPU operational data corresponding to likelihood of failure; create a health score for a GPU corresponding to the set of GPU operational data; map the health score to a remediation action; and execute the remediation action.

2. The system of claim 1 , the instructions when executed further cause the processor to: send an alert to a client system; and render a visualization of the health score.

3. The system of claim 1 , the instructions when executed further cause the processor to: identify a first subset of the subset of the set of GPU operational data corresponding to a critical likelihood of failure; assign a weighted value to each operational data of the first subset; and adjust the health score based on the weighted value.

4. The system of claim 1 wherein the set of GPU operational data comprises recorded production device values from a plurality of client systems.

5. The system of claim 1 , the instructions when executed further cause the processor to: receive a second set of GPU operational data; identify a second subset of the set of GPU operational data corresponding to a likelihood of failure; create a second health score for a GPU corresponding to the second set of GPU operational data; calculate a difference between the health score and the second health score; and send an alert to a client system based on the difference surpassing a threshold. A method comprising: receiving a set of graphics processor unit (GPU) operational data; identifying a subset of the set of GPU operational data corresponding to likelihood of failure; creating a health score for a GPU corresponding to the set of GPU operational data; mapping the health score to a remediation action; sending an alert comprising the remediation action to a client system; and rendering a visualization of the health score. The method of claim 6 further comprising: identifying an actionable remediation action from the remediation action; push the actionable remediation action to the client system with an endpoint management system. The method of claim 6, further comprising: identifying a first subset of the subset of the set of GPU operational data corresponding to a critical likelihood of failure; assigning a weighted value to each operational data of the first subset; and adjusting the health score based on the weighted value. The method of claim 6 wherein the set of GPU operational data comprises recorded production device values from a plurality of client systems. The method of claim 6, further comprising: receiving a second set of GPU operational data; identifying a second subset of the set of GPU operational data corresponding to a likelihood of failure; creating a second health score for a GPU corresponding to the second set of GPU operational data; calculating a difference between the health score and the second health score; and sending an alert to a client system based on the difference surpassing a threshold. A non-transitory computer readable medium comprising instructions executable by a processor to: receive a first set and a second set of graphics processor unit (GPU) operational data; identify a first subset of the first set of GPU operational data corresponding to likelihood of failure; identify a second subset of the second set of GPU operational data corresponding to a likelihood of failure create a first health score for a GPU corresponding to the first subset; create a second health score for the GPU corresponding to the second subset; create a difference between the first health score and the second health score; and render a visualization of the difference. The medium of claim 11 , the instructions when executed further cause the processor to send an alert to a client system. The medium of claim 11 , wherein the second set of GPU operational data and the first set of GPU operational data correspond to operational information of a single GPU. The medium of claim 11 , the instructions when executed further cause the processor to: The medium of claim 11 , the instructions when executed further cause the processor to: identify an actionable remediation action corresponding to the difference; and push the actionable remediation action to a client system with an endpoint management system.

- 18 -