WO2022126140A1

WO2022126140A1 - Heterogeneous computing platform (hcp) for vertex shader processing

Info

Publication number: WO2022126140A1
Application number: PCT/US2022/011163
Authority: WO
Inventors: Gabriel HUAU; Abhishek Girish SAXENA; Xiaohan Wang; Vladislav LEVENFELD; Hongyu Sun; Chen Li
Original assignee: Innopeak Technology, Inc.
Priority date: 2022-01-04
Filing date: 2022-01-04
Publication date: 2022-06-16

Abstract

Novel tools and techniques are provided for implementing heterogeneous computing platform ("HCP") for vertex shader processing. In various embodiments, a computing system may receive a first shader language code associated with 3D image elements among a plurality of 3D image elements extracted from graphics library commands from a 3D app, and may translate the first shader language code into at least one second shader language code each corresponding to one of at least one non-GPU processor, and may send each of the at least one second shader language code to corresponding one of the at least one non-GPU processor for vertex shader processing of the one or more 3D image elements to calculate graphics resources in the 3D app, instead of vertex shader processing by GPU. The computing system may send the calculated graphics resources in at least one draw call to a renderer for rendering the graphics resources.

Description

HETEROGENEOUS COMPUTING PLATFORM (HCP) FOR VERTEX SHADER PROCESSING

COPYRIGHT STATEMENT

[0001] A portion of the disclosure of this patent document contains material that is subject to copyright protection. The copyright owner has no objection to the facsimile reproduction by anyone of the patent document or the patent disclosure as it appears in the Patent and Trademark Office patent file or records, but otherwise reserves all copyright rights whatsoever.

FIELD

[0002] The present disclosure relates, in general, to methods, systems, and apparatuses for implementing two-dimensional ("2D") and/or three-dimensional ("3D") rendering, and, more particularly, to methods, systems, and apparatuses for implementing heterogeneous computing platform ("HCP") for vertex shader processing.

BACKGROUND

[0003] Mobile platforms have evolved drastically over the past few years. System on Chip ("SoC") solutions contain multiple processors dedicated to specific tasks to optimize the overall performance, such as, audio processing, using a hardware decoder, or a camera using a hardware-accelerated image signal processor ("ISP"), or the like.

[0004] A game's development on a mobile phone heavily relies on OpenGL or other graphics library API for its development. This graphics application programming interface ("API") mainly focuses on the graphics processing unit ("GPU") processor.

[0005] Conventional systems provide for heterogenous computing for graphics pipelines or shared virtual memory. However, none of the conventional systems specifically relate to vertex shader processing and auto-detection of resources in 3D apps (e.g., games).

[0006] Hence, there is a need for more robust and scalable solutions for implementing two- dimensional ("2D") and/or three-dimensional ("3D") rendering, and, more particularly, to methods, systems, and apparatuses for implementing heterogeneous computing platform ("HCP") for vertex shader processing. SUMMARY

[0007] The techniques of this disclosure generally relate to tools and techniques for implementing 2D and/or 3D rendering, and, more particularly, to methods, systems, and apparatuses for implementing HCP for vertex shader processing.

[0008] In an aspect, a method may comprise receiving, using a computing system, a first shader language code associated with one or more three-dimensional ("3D") image elements among a plurality of 3D image elements extracted from one or more graphics library commands from a 3D software application ("app"); translating, using the computing system, the first shader language code into at least one second shader language code each corresponding to one of at least one non-graphics processing unit ("non-GPU") processor; sending, using the computing system, each of the at least one second shader language code to corresponding one of the at least one non-GPU processor for vertex shader processing of the one or more 3D image elements to calculate graphics resources in the 3D app; and sending, using the computing system, the calculated graphics resources in at least one draw call to a Tenderer for rendering the graphics resources.

[0009] In another aspect, an apparatus might comprise at least one processor and a non- transitory computer readable medium communicatively coupled to the at least one processor. The non-transitory computer readable medium might have stored thereon computer software comprising a set of instructions that, when executed by the at least one processor, causes the apparatus to: receive a first shader language code associated with one or more three- dimensional ("3D") image elements among a plurality of 3D image elements extracted from one or more graphics library commands from a 3D software application ("app"); translate the first shader language code into at least one second shader language code each corresponding to one of at least one non-graphics processing unit ("non-GPU") processor; send each of the at least one second shader language code to corresponding one of the at least one non-GPU processor for vertex shader processing of the one or more 3D image elements to calculate graphics resources in the 3D app; and send the calculated graphics resources in at least one draw call to a Tenderer for rendering the graphics resources.

[0010] In yet another aspect, a system might comprise a computing system, which might comprise at least one first processor and a first non-transitory computer readable medium communicatively coupled to the at least one first processor. The first non-transitory computer readable medium might have stored thereon computer software comprising a first set of instructions that, when executed by the at least one first processor, causes the computing system to: receive a first shader language code associated with one or more three-dimensional ("3D") image elements among a plurality of 3D image elements extracted from one or more graphics library commands from a 3D software application ("app"); translate the first shader language code into at least one second shader language code each corresponding to one of at least one non-graphics processing unit ("non-GPU") processor; send each of the at least one second shader language code to corresponding one of the at least one non-GPU processor for vertex shader processing of the one or more 3D image elements to calculate graphics resources in the 3D app; and send the calculated graphics resources in at least one draw call to a Tenderer for rendering the graphics resources.

[0011] Various modifications and additions can be made to the embodiments discussed without departing from the scope of the invention. For example, while the embodiments described above refer to particular features, the scope of this invention also includes embodiments having different combination of features and embodiments that do not include all of the above-described features.

[0012] The details of one or more aspects of the disclosure are set forth in the accompanying drawings and the description below. Other features, objects, and advantages of the techniques described in this disclosure will be apparent from the description and drawings, and from the claims.

BRIEF DESCRIPTION OF THE DRAWINGS

[0013] A further understanding of the nature and advantages of particular embodiments may be realized by reference to the remaining portions of the specification and the drawings, in which like reference numerals are used to refer to similar components. In some instances, a sub-label is associated with a reference numeral to denote one of multiple similar components. When reference is made to a reference numeral without specification to an existing sub-label, it is intended to refer to all such multiple similar components.

[0014] Fig. 1 is a schematic diagram illustrating a system for implementing heterogeneous computing platform ("HCP") for vertex shader processing, in accordance with various embodiments.

[0015] Figs. 2A-2D are schematic block flow diagrams illustrating various non-limiting examples of processes that may be used for implementing HCP for vertex shader processing, in accordance with various embodiments.

[0016] Fig. 3 is a schematic block flow diagram illustrating a non-limiting example of auto-translation of one shader language code to another shader language code during implementation of HCP for vertex shader processing, in accordance with various embodiments. [0017] Figs. 4A-4C are flow diagrams illustrating a method for implementing HCP for vertex shader processing, in accordance with various embodiments.

[0018] Fig. 5 is a block diagram illustrating an example of computer or system hardware architecture, in accordance with various embodiments.

[0019] Fig. 6 is a block diagram illustrating a networked system of computers, computing systems, or system hardware architecture, which can be used in accordance with various embodiments.

DETAILED DESCRIPTION

[0020] Overview

[0021] Various embodiments provide tools and techniques for implementing two- dimensional ("2D") and/or three-dimensional ("3D") rendering, and, more particularly, to methods, systems, and apparatuses for implementing heterogeneous computing platform ("HCP") for vertex shader processing.

[0022] In various embodiments, a computing system may receive a first shader language code associated with one or more 3D image elements among a plurality of 3D image elements extracted from one or more graphics library commands from a 3D software application ("app"). The computing system may automatically translate the first shader language code into at least one second shader language code each corresponding to one of at least one non-graphics processing unit ("non-GPU") processor. The computing system may send each of the at least one second shader language code to corresponding one of the at least one non-GPU processor for vertex shader processing of the one or more 3D image elements to calculate graphics resources in the 3D app, instead of vertex shader processing by a graphics processing unit ("GPU"). The computing system may send the calculated graphics resources in at least one draw call to a Tenderer for rendering the graphics resources.

[0023] In some embodiments, the computing system may comprise at least one of a HCP, a graphics engine, a graphics rendering engine, a game engine, a 3D game engine, a processor on the user device, or at least one central processing unit ("CPU") core on the user device, and/or the like. In some instances, the at least one non-GPU processor may each comprise one of a CPU, a multiprocessor, a digital signal processor ("DSP"), a media processor, or another non- GPU processor, and/or the like. In some cases, the 3D app may comprise one of a 3D game app, a 3D user interface ("UI") -based app, or a 3D guidance app configured for use by a user, and/or the like. In such cases, the user may comprise one of a medical professional, a scientist, an engineer, an architect, a construction worker, a factory worker, a warehouse worker, a shipping or delivery worker, an app developer, or a graphics designer, and/or the like. In some instances, the graphics resources may comprise at least one of vertex buffer object ("VBO") data, element buffer object ("EBO") data, or uniform buffer object data, and/or the like.

[0024] According to some embodiments, the first shader language code may be optimized for per-pixel processing architecture, and the first language code may comprise a shader language code based on graphics library shader language ("GLSL"). Each of the at least one second shader language code may be optimized for vectorized processing of contiguous pixels, and each of the at least one second shader language code may comprise a shader language code based on domain specific language of the corresponding one of the at least one non-GPU processor. In some instances, automatically translating the first shader language code into the at least one second shader language code may comprise: automatically translating the first shader language code into at least one intermediate shader language code, wherein the at least one intermediate shader language code each comprises a shader language code based on one of standard portable intermediate representation ("SPIR-V") processing, intermediate generic language, or vectorized algorithm; wherein based on a determination that the at least one intermediate shader language code comprises multiple successive intermediate shader language codes, automatically translating each intermediate shader language code into each successive intermediate shader language code until the last successive intermediate shader language code has been produced, and automatically translating the last successive intermediate shader language code into the second shader language code; and wherein based on a determination that the at least one intermediate shader language code comprises a single intermediate shader language code, automatically translating the single intermediate shader language code into the second shader language code.

[0025] In some embodiments, receiving the first shader language code may comprise intercepting, using the computing system, the first shader language code as the first shader language code is being sent from the 3D app to one of a graphics library application programming interface ("API") or the GPU.

[0026] According to some embodiments, the computing system may receive the one or more graphics library commands from the 3D app, the one or more graphics library commands comprising commands for rendering image elements, the image elements comprising the plurality of 3D image elements. The computing system may analyze the received one or more graphics library commands to extract the first shader language code.

[0027] In some embodiments, the computing system may analyze the first shader language code to extract size information for each of an input buffer and an output buffer for each vertex shader; may allocate the calculated graphics resources to a shared memory based at least in part on the extracted size information for the input buffer for each vertex shader; and may allocate a copy of the calculated graphics resources to a GPU memory via a passthrough shader based at least in part on the extracted size information for the output buffer for each vertex shader, the passthrough shader performing no calculations. In some cases, the computing system may cache all the calculated graphics resources associated with 3D objects in a shared memory prior to allocation of the graphics resources on a GPU. In some instances, sending the calculated graphics resources in at least one draw call to the Tenderer for rendering the graphics resources may comprise sending the calculated graphics resources in at least one draw call to the Tenderer for rendering the graphics resources after all the graphics resources have been allocated.

[0028] According to some embodiments, the computing system may store at least one of the one or more graphics library commands, the first shader language code associated with the one or more 3D image elements, the plurality of 3D image elements, the calculated graphics resources, or the rendered graphics resources in shared memory that is accessible by each of the at least one non-GPU processor and the GPU, and/or the like, without the at least one nonGPU processor or the GPU copying any of the at least one of the one or more graphics library commands, the first shader language code associated with the one or more 3D image elements, the plurality of 3D image elements, the calculated graphics resources, or the rendered graphics resources, and/or the like.

[0029] In the various aspects described herein, HCP may be implemented for vertex shader processing using non-GPU processors instead of using a GPU. This allows some tasks (e.g., vertex shader processing tasks of 3D elements of a game scene or other 3D app) to be moved or delegated to other processors (e.g., CPU, DSP, etc.) from the GPU, which reduces the GPU usage, allowing the GPU to have more bandwidth to process 3D rendering scene and potentially reducing the overall power consumption. Furthermore, the tasks assigned to a CPU, a DSP, or other processor may be run in parallel with the GPU, improving the performance of the game or other 3D app allowing them to run on older hardware and/or mobile platforms.

[0030] Drawbacks to conventional systems as described above include that CPU, DSP, and other processors are not as powerful and optimized as the GPU for graphics processing, but can be run in parallel, allowing a bigger bandwidth to process data for a scene and compensate the loss in performance. In addition, memory sharing is one of the biggest technical problems related to the usage of multiple processors to render a game scene. Memory copying or duplication should be avoided to keep the performance high; however, memory synchronization mechanisms should be in place to avoid modifying the same area by multiple processors at the same time. Further, replacing only the vertex shader stage of the GPU pipeline is not possible with conventional OpenGL APIs, so some work will still remain on the GPU, which in general is limited to copying the data in a transparent way.

[0031] These and other aspects of the system and method for implementing HCP for vertex shader processing are described in greater detail with respect to the figures.

[0032] The following detailed description illustrates a few embodiments in further detail to enable one of skill in the art to practice such embodiments. The described examples are provided for illustrative purposes and are not intended to limit the scope of the invention.

[0033] In the following description, for the purposes of explanation, numerous details are set forth in order to provide a thorough understanding of the described embodiments. It will be apparent to one skilled in the art, however, that other embodiments of the present invention may be practiced without some of these details. In other instances, some structures and devices are shown in block diagram form. Several embodiments are described herein, and while various features are ascribed to different embodiments, it should be appreciated that the features described with respect to one embodiment may be incorporated with other embodiments as well. By the same token, however, no single feature or features of any described embodiment should be considered essential to every embodiment of the invention, as other embodiments of the invention may omit such features.

[0034] Unless otherwise indicated, all numbers used herein to express quantities, dimensions, and so forth used should be understood as being modified in all instances by the term "about." In this application, the use of the singular includes the plural unless specifically stated otherwise, and use of the terms "and" and "or" means "and/or" unless otherwise indicated. Moreover, the use of the term "including," as well as other forms, such as "includes" and "included," should be considered non-exclusive. Also, terms such as "element" or "component" encompass both elements and components comprising one unit and elements and components that comprise more than one unit, unless specifically stated otherwise.

[0035] Various embodiments as described herein - while embodying (in some cases) software products, computer-performed methods, and/or computer systems - represent tangible, concrete improvements to existing technological areas, including, without limitation, 3D image element or object rendering technology, 2D image element or object rendering technology, UI element or object rendering technology, game image element or object rendering technology, game UI element or object rendering technology, heterogeneous computing platform ("HCP") technology, vertex shader processing technology, mobile platform technology, and/or the like. In other aspects, some embodiments can improve the functioning of user equipment or systems themselves (e.g., 3D image element or object rendering systems, 2D image element or object rendering systems, UI element or object rendering systems, game image element or object rendering systems, game UI element or object rendering systems, HCP systems, vertex shader processing systems, mobile platform systems, etc.), for example, by receiving, using a computing system, a first shader language code associated with one or more three-dimensional ("3D") image elements among a plurality of 3D image elements extracted from one or more graphics library commands from a 3D software application ("app"); automatically translating, using the computing system, the first shader language code into at least one second shader language code each corresponding to one of at least one non-graphics processing unit ("non-GPU") processor; sending, using the computing system, each of the at least one second shader language code to corresponding one of the at least one non-GPU processor for vertex shader processing of the one or more 3D image elements to calculate graphics resources in the 3D app, instead of vertex shader processing by a graphics processing unit ("GPU"); and sending, using the computing system, the calculated graphics resources in at least one draw call to a Tenderer for rendering the graphics resources; and/or the like.

[0036] In particular, to the extent any abstract concepts are present in the various embodiments, those concepts can be implemented as described herein by devices, software, systems, and methods that involve novel functionality (e.g., steps or operations), such as, implementing HCP to delegate vertex shader processing tasks of 3D elements to non-GPU processors (e.g., CPU, DSP, etc.) from a GPU to perform vertex shader processing of 3D elements, in some cases, by converting the original shader code (e.g., GLSL code, or the like) to an intermediate code and executing the intermediate code using other processors, and/or the like, to name a few examples, that extend beyond mere conventional computer processing operations. These functionalities can produce tangible results outside of the implementing computer system, including, merely by way of example, providing a heterogeneous computing platform that delegates vertex shader processing tasks of 3D elements to non-GPU processors (e.g., CPU, DSP, etc.) from a GPU to perform vertex shader processing of 3D elements, which reduces GPU usage, allowing the GPU to have more bandwidth to process 3D rendering scene and potentially reducing the overall power consumption, and allowing 2D rendering tasks by the non-GPU processors to be run in parallel with the GPU thereby improving the performance of the 3D app (e.g., game or other 3D app) allowing them to run on older hardware and/or mobile platforms, at least some of which may be observed or measured by users, game/content developers, and/or user device manufacturers.

[0037] Some Embodiments

[0038] We now turn to the embodiments as illustrated by the drawings. Figs. 1-6 illustrate some of the features of the method, system, and apparatus for implementing two-dimensional ("2D") and/or three-dimensional ("3D") rendering, and, more particularly, to methods, systems, and apparatuses for implementing heterogeneous computing platform ("HCP") for vertex shader processing, as referred to above. The methods, systems, and apparatuses illustrated by Figs. 1-6 refer to examples of different embodiments that include various components and steps, which can be considered alternatives or which can be used in conjunction with one another in the various embodiments. The description of the illustrated methods, systems, and apparatuses shown in Figs. 1-6 is provided for purposes of illustration and should not be considered to limit the scope of the different embodiments.

[0039] With reference to the figures, Fig. 1 is a schematic diagram illustrating a system 100 for implementing heterogeneous computing platform ("HCP") for vertex shader processing, in accordance with various embodiments.

[0040] In the non-limiting embodiment of Fig. 1, system 100 may comprise user device 105, which may include, but is not limited to, one of a portable gaming device, a smart phone, a tablet computer, a laptop computer, a desktop computer, or a server computer, and/or the like. In some embodiments, user device 105 may include, without limitation, at least one of a graphic application 110 (e.g., a 3D software application ("app"), or the like), a computing system 115, and computing hardware 145. In some cases, the graphic application 110 (e.g., 3D app, or the like) may include, without limitation, one of a 3D game app, a 3D user interface ("UI") -based app, or a 3D guidance app configured for use by a user, and/or the like. In some cases, the user may include, but is not limited to, one of a medical professional, a scientist, an engineer, an architect, a construction worker, a factory worker, a warehouse worker, a shipping or delivery worker, an app developer, or a graphics designer, and/or the like. In some embodiments, the computing system may include, without limitation, at least one of a HCP (e.g., HCP 120, or the like), a graphics engine, a graphics rendering engine, a game engine, a 3D game engine, a processor on the user device, or at least one central processing unit ("CPU") core on the user device, and/or the like.

[0041] In some instances, HCP 120 may include, without limitation, an auto-translation system(s) 125, a vertex shader(s) 130, a passthrough shader 135, and shared memory 140, or the like. Computing hardware 145 may comprise GPU 150 and one or more non-GPU processors 155. In some instances, the one or more non-GPU processors 155 may each include, but are not limited to, one of a CPU, a multiprocessor, a digital signal processor ("DSP"), a media processor, or another non-GPU processor, and/or the like. User device 105 may further include, without limitation, at least one of data storage device 160a, communications system 160b, display screen 160c, and audio playback device 160d (optional). One or more of the CPU cores and/or one or more computing systems 115 may be used to perform orchestration, management, render engine coordination, and/or operating system ("OS") processing functionalities. Other CPU cores and/or other non-GPU processors 155 may be used for rendering or image processing of 2D image elements, while the GPU 150 may be used to render 3D image elements.

[0042] The data storage 160a may include, but is not limited to, at least one of read-only memory ("ROM"), programmable read-only memory ("PROM"), erasable programmable readonly memory ("EPROM"), electrically erasable programmable read-only memory ("EEPROM"), flash memory, other non-volatile memory devices, random-access memory ("RAM"), static random-access memory ("SRAM"), dynamic random-access memory ("DRAM"), synchronous dynamic random-access memory ("SDRAM"), virtual memory, a RAM disk, or other volatile memory devices, non-volatile RAM devices, and/or the like.

[0043] The communications system 160b may include wireless communications devices capable of communicating using protocols including, but not limited to, at least one of Bluetooth™ communications protocol, WiFi communications protocol, or other 802.11 suite of communications protocols, ZigBee communications protocol, Z-wave communications protocol, or other 802.15.4 suite of communications protocols, cellular communications protocol (e.g., 3G, 4G, 4G LTE, 5G, etc.), or other suitable communications protocols, and/or the like.

[0044] Some user devices (e.g., a portable gaming device, a smart phone, a tablet computer, a laptop computer, etc.) may each include at least one integrated display screen 160c (in some cases, including a non-touchscreen display screen(s), while, in other cases, including a touchscreen display screen(s), and, in still other cases, including a combination of at least one non-touchscreen display screen and at least one touchscreen display screen) and at least one integrated audio playback device 160d (e.g., built-in speakers or the like). Some user devices (e.g., a desktop computer, or a server computer, or the like) may each include at least one external display screen or monitor (e.g., display devices 195a- 195n, or the like, which may each be a non-touchscreen display device or a touchscreen display device, or the like) and at least one integrated audio playback device 160d (e.g., built-in speakers, etc.) and/or at least one external audio playback device (not shown; e.g., external or peripheral speakers, wired earphones, wired earbuds, wired headphones, wireless earphones, wireless earbuds, wireless headphones, or the like). Some user devices (e.g., some desktop computers, or some server computers, or the like) may have neither an integrated display screen nor an external display screen.

[0045] System 100 may further comprise one or more content sources 170 and corresponding database(s) 175 that communicatively couple with user device 105 via network(s) 165 (and via communications system 160b) to provide image data and/or graphics library commands 190a for the computing system(s) 115 to render or process 3D image data or the like, as described in detail below. The resultant rendered images 190b may be sent to content distribution system 180 and corresponding database(s) 185 (via network(s) 165 and via communications system 160b) for storage and/or distribution to other devices (e.g., display devices 195a-195n). In some cases, user device 105 may directly send the rendered images 190b to one or more display devices 195a-195n (collectively, "display devices 195" or the like), which may each include, but are not limited to, at least one of a smart television (directly), a television (indirectly) via a set- top-box or other intermediary media player, a monitor or digital display panel (directly), a monitor or digital display panel (indirectly) via an externally connected user device (e.g., desktop computer, server computer, etc.), etc. The lightning bolt symbols are used to denote wireless communications between communications system 160b and network(s) 165 (in some cases, via network access points or the like (not shown)), between communications system 160b and at least one of the one or more display devices 195a-195n, and between network(s) 165 and at least one of the one or more display devices 195a-195n (in some cases, via network access points or the like (not shown)).

[0046] In operation, a computing system 115, HCP 120, and/or computing hardware 145 (collectively, "computing system") may receive a first shader language code associated with one or more 3D image elements among a plurality of 3D image elements extracted from one or more graphics library commands (e.g., graphics library ("GL") commands 190a, or the like) from a 3D app (e.g., graphic app or 3D app 110, or the like). The computing system may automatically translate the first shader language code into at least one second shader language code each corresponding to one of at least one non-graphics processing unit ("non-GPU") processor (e.g., non-GPU processor(s) 155, or the like). The computing system may send each of the at least one second shader language code to corresponding one of the at least one non- GPU processor for vertex shader processing of the one or more 3D image elements to calculate graphics resources in the 3D app, instead of vertex shader processing by a GPU (e.g., GPU 150, or the like). The computing system may send the calculated graphics resources in at least one draw call to a Tenderer for rendering the graphics resources, which will be used for generating rendered images (e.g., rendered image(s) 190b, or the like).

[0047] In some embodiments, the computing system may comprise at least one of a HCP (e.g., HCP 120, or the like), a graphics engine, a graphics rendering engine, a game engine, a 3D game engine, a processor on the user device (e.g., user device 105, or the like), or at least one central processing unit ("CPU") core on the user device, and/or the like. In some instances, the at least one non-GPU processor may each comprise one of a CPU, a multiprocessor, a digital signal processor ("DSP"), a media processor, or another non-GPU processor, and/or the like. In some cases, the 3D app may comprise one of a 3D game app, a 3D user interface ("UI") -based app, or a 3D guidance app configured for use by a user, and/or the like. In such cases, the user may comprise one of a medical professional, a scientist, an engineer, an architect, a construction worker, a factory worker, a warehouse worker, a shipping or delivery worker, an app developer, or a graphics designer, and/or the like. In some instances, the graphics resources may comprise at least one of vertex buffer object ("VBO") data, element buffer object ("EBO") data, or uniform buffer object data, and/or the like.

[0048] According to some embodiments, the first shader language code may be optimized for per-pixel processing architecture, and the first language code may comprise a shader language code based on graphics library shader language ("GLSL"). Each of the at least one second shader language code may be optimized for vectorized processing of contiguous pixels, and each of the at least one second shader language code may include a shader language code based on domain specific language of the corresponding one of the at least one non-GPU processor. In some instances, automatically translating (e.g., using auto-translation system(s) 125, or the like) the first shader language code into the at least one second shader language code may comprise automatically translating the first shader language code into at least one intermediate shader language code. In such cases, the at least one intermediate shader language code may each include a shader language code based on one of standard portable intermediate representation ("SPIR-V") processing, intermediate generic language, or vectorized algorithm. Based on a determination that the at least one intermediate shader language code comprises multiple successive intermediate shader language codes, the computing system may automatically translate each intermediate shader language code into each successive intermediate shader language code until the last successive intermediate shader language code has been produced, and may automatically translate the last successive intermediate shader language code into the second shader language code. Based on a determination that the at least one intermediate shader language code comprises a single intermediate shader language code, the computing system may automatically translate the single intermediate shader language code into the second shader language code.

[0049] In some embodiments, receiving the first shader language code may comprise intercepting, using the computing system, the first shader language code as the first shader language code is being sent from the 3D app to one of a graphics library application programming interface ("API") or the GPU.

[0050] According to some embodiments, the computing system may receive the one or more graphics library commands from the 3D app, the one or more graphics library commands comprising commands for rendering image elements, the image elements comprising the plurality of 3D image elements. The computing system may analyze the received one or more graphics library commands to extract the first shader language code.

[0051] In some embodiments, the computing system may analyze the first shader language code to extract size information for each of an input buffer and an output buffer for each vertex shader; may allocate the calculated graphics resources to a shared memory based at least in part on the extracted size information for the input buffer for each vertex shader; and may allocate a copy of the calculated graphics resources to a GPU memory via a passthrough shader (e.g., passthrough shader 135, or the like) based at least in part on the extracted size information for the output buffer for each vertex shader, the passthrough shader performing no calculations. In some cases, the computing system may cache all the calculated graphics resources associated with 3D objects in a shared memory (e.g., shared memory 140, or the like) prior to allocation of the graphics resources on a GPU. In some instances, sending the calculated graphics resources in at least one draw call to the Tenderer for rendering the graphics resources may comprise sending the calculated graphics resources in at least one draw call to the Tenderer for rendering the graphics resources after all the graphics resources have been allocated.

[0052] According to some embodiments, the computing system may store at least one of the one or more graphics library commands, the first shader language code associated with the one or more 3D image elements, the plurality of 3D image elements, the calculated graphics resources, or the rendered graphics resources in shared memory that is accessible by each of the at least one non-GPU processor and the GPU, and/or the like, without the at least one nonGPU processor or the GPU copying any of the at least one of the one or more graphics library commands, the first shader language code associated with the one or more 3D image elements, the plurality of 3D image elements, the calculated graphics resources, or the rendered graphics resources, and/or the like. [0053] In some embodiments, system 100 (and corresponding methods described herein) may be used to improve the performance and/or power consumption of games running on mobile devices (e.g., user device 105, including, but not limited to smart phones, mobile phones, tablet computers, laptop computers, portable gaming devices, and/or the like) by leveraging the different processors to compute the vertex shader stage of a GPU pipeline while using a Heterogeneous Computing Platform (e.g., HCP 120, or the like). HCP provides one integrated system to best maximize its efficiency as well as allowing it to extract the most value from the diverse types of data that the operation uses and generates. HCP uses two or more types of computing cores, for example, a GPU and a CPU or a DSP. Using multiple cores provides the systems with the capabilities that single-core processors are not able to perform. For example, a GPU may be used to perform computation in applications handled by the CPU. While GPUs operate at lower frequencies, they will typically have a larger number of cores, the GPUs can process a larger number of images and graphical data per second than a CPU. Transferring data to a graphical form and then using the GPU to scan and analyze the data creates greater acceleration. In addition, with HCP, diverse types of computing cores are working together in a wide range of applications.

[0054] In the various embodiments, the system moves some tasks from the GPU to other processors, allowing the GPU to have more bandwidth to process 3D rendering scenes and potentially reducing the overall power consumption. Furthermore, the tasks assigned to a DSP or CPU could be run in parallel with the GPU, thus improving the performance of the game (or other graphic application or 3D application), and allowing them to run on older hardware or mobile platforms.

[0055] The following are several key components of the various embodiments and techniques described herein: (A) Intercepting graphics library API (e.g., OpenGL) resources and storing them in a shared memory between (or accessible by) the different processors; (B) Deferring the work to other processors such as a CPU or a DSP instead of the GPU ; and (C) Auto-translating graphics library shader language ("GLSL") code to intermediate code usable by the HCP; and/or the like.

[0056] Alternatively, or additionally, cloud computing may be utilized to push the calculation over a network (rather than the GPU), at the cost of higher latency. Alternatively, or additionally, a multi-GPU system may be utilized, which may require changes to hardware similar to ARM big.LITTLE, having 2 GPUs, the first one with high performance and the second one with low performance reserved for small tasks (but with greater efficiency). [0057] To observe the process in mobile devices, because the process, in some embodiments, depends on intercepting GPU communication, one can examine dynamic linking libraries that appear to intercept OpenGL API calls (or similar graphics library draw calls, or the like). In some instances, this may be achieved via a profiler that examines API calls that are directed to low level system libraries, or the like [referred to herein as "methodology 1" or the like]. Alternatively, or additionally, this may be indirectly verified by removing suspected system libraries from the operating system, and re-running OpenGL apps (or similar graphics library API apps, or the like) and observing any crashes or irregular behaviors [referred to herein as "methodology 2" or the like]. Alternatively, or additionally, this may be verified via decompiling of suspected system libraries (in some cases, subject to term of services associated with the observed services or systems) [referred to herein as "methodology 3" or the like]. Subsequently, one may check to see if there is a subsystem collecting geometry, texture, and/or scene information from a graphics application. This detects the "data-driven" aspect of the techniques and embodiments described herein. The above three methodologies (i.e., methodologies 1 - 3) may be employed.

[0058] As the techniques and embodiments described herein also depends on use of other processors (such as a CPU or a DSP, or the like), two additional techniques may also be employed: (i) Monitoring the activities of these other processors in parallel with running of the game on the mobile device; and/or (ii) Disabling these other processors (or hardware features) from the system so no applications can use them and observing a change in performance or stability in the game (or other graphic application or 3D app).

[0059] In some aspects, the various embodiments provide a heterogeneous computing platform that delegates the vertex shader processing tasks of 3D elements to non-GPU processors (e.g., CPU, DSP, etc.) from a GPU to perform vertex shader processing of 3D elements, in some cases, by converting the original shader code (e.g., GLSL code, or the like) to an intermediate code and executing the intermediate code using other processors, which reduces GPU usage, allowing the GPU to have more bandwidth to process 3D rendering scene and potentially reducing the overall power consumption, and allowing 2D rendering tasks by the non-GPU processors to be run in parallel with the GPU thereby improving the performance of the 3D app (e.g., game or other graphic application or 3D app) allowing them to run on older hardware and/or mobile platforms.

[0060] These and other functions of the system 100 (and its components) are described in greater detail below with respect to Figs. 2-4. [0061] Figs. 2A-2D (collectively, "Fig. 2") are schematic block flow diagrams illustrating various non-limiting examples 200 of processes that may be used for implementing HCP for vertex shader processing, in accordance with various embodiments.

[0062] With reference to the non-limiting example 200 of Fig. 2A, user device 105 may include, without limitation, at least one of a graphic application 110 (e.g., a 3D application, or the like), HCP pipeline 120, and computing hardware 145. In some embodiments, user device 105 may include, but is not limited to, one of a portable gaming device, a smart phone, a tablet computer, a laptop computer, a desktop computer, or a server computer, and/or the like.

According to some embodiments, the graphic application (or 3D app, or the like) may include data including, but not limited to, at least one of geometry data 205, texture data 210, buffer data 215, and/or the like. In some cases, the graphic application 110 (e.g., a 3D app, or the like) may include, without limitation, one of a 3D game app, a 3D user interface ("UI") -based app, or a 3D guidance app configured for use by a user, and/or the like. In some cases, the user may include, but is not limited to, one of a medical professional, a scientist, an engineer, an architect, a construction worker, a factory worker, a warehouse worker, a shipping or delivery worker, an app developer, or a graphics designer, and/or the like.

[0063] In some embodiments, HCP pipeline 120 may include, without limitation, graphics library commands 190a, filtering or detection system 225, auto-translation system(s) 125, vertex shader(s) 130, passthrough shader 135, shared memory 140, shader language codes 230a-230c, and one or more languages 235a-235n (collectively, "languages 235" or the like). In some instances, the one or more languages 235 may include, without limitation, at least one of Halide 235a, HVX 235b, NEON 235c, SVE2 235d, or other language 235n, or the like.

[0064] According to some embodiments, computing hardware 145 may comprise GPU 150 and one or more non-GPU processors 155. In some instances, the one or more non-GPU processors 155 may each include, but is not limited to, one of a CPU 155a, a digital signal processor ("DSP") 155b, a media processor 155c, or another non-GPU processor 155n (e.g., a multiprocessor, or the like), and/or the like.

[0065] In some aspects, the graphic app (or 3D app) 110 may send 3D image elements or 3D image element data 220a to HCP pipeline 120, in some cases, in the form of graphics library commands 190a, or the like. In some instances, the 3D image element data 220a and/or the graphics library commands 190a may include, without limitation, the at least one of geometry data 205, texture data 210, buffer data 215, and/or the like. Filtering or detection system 225 may analyze the graphics library commands to extract the first shader language code 230a. Auto-translation system(s) 125 may automatically translate the first shader language code into at least one intermediate shader language code 230b. In some instances, the at least one intermediate shader language code may each include a shader language code based on one of standard portable intermediate representation ("SPIR-V") processing, intermediate generic language, or vectorized algorithm, and/or the like. In the case (or based on a determination) that the at least one intermediate shader language code comprises multiple successive intermediate shader language codes, auto-translation system(s) 125 may automatically translate each intermediate shader language code 230b into each successive intermediate shader language code 230b until the last successive intermediate shader language code 230b has been produced, and may automatically translate the last successive intermediate shader language code 230b into the second shader language code 230c. Alternatively, in the case (or based on a determination) that the at least one intermediate shader language code comprises a single intermediate shader language code, automatically translating, using the computing system, auto-translation system(s) 125 may automatically translate the single intermediate shader language code into the second shader language code.

[0066] According to some embodiments, the first shader language code 230a may be optimized for per-pixel processing architecture, and the first language code 230a may include a shader language code based on graphics library shader language ("GLSL"). Each of the at least one second shader language code 230c may be optimized for vectorized processing of contiguous pixels, and each of the at least one second shader language code 230c may include a shader language code based on domain specific language of the corresponding one of the at least one non-GPU processor (e.g., CPU 155a, DSP 155b, media processor 155c, or other nonGPU processor 155n, or the like).

[0067] In some embodiments, the HCP Pipeline 120 may analyze the first shader language code to extract size information for each of an input buffer and an output buffer for each vertex shader; may allocate the calculated graphics resources to a shared memory based at least in part on the extracted size information for the input buffer for each vertex shader; and may allocate a copy of the calculated graphics resources to a GPU memory via a passthrough shader (e.g., passthrough shader 135, or the like) based at least in part on the extracted size information for the output buffer for each vertex shader, the passthrough shader performing no calculations. In some cases, the HCP Pipeline 120 may cache all the calculated graphics resources associated with 3D objects in a shared memory (e.g., shared memory 140, or the like) prior to allocation of the graphics resources on a GPU. In some instances, sending the calculated graphics resources in at least one draw call to the Tenderer for rendering the graphics resources may comprise sending the calculated graphics resources in at least one draw call to the Tenderer for rendering the graphics resources after all the graphics resources have been allocated.

[0068] According to some embodiments, the HCP Pipeline 120 may store at least one of the one or more graphics library commands, the first shader language code associated with the one or more 3D image elements, the plurality of 3D image elements, the calculated graphics resources, or the rendered graphics resources in shared memory that is accessible by each of the at least one non-GPU processor and the GPU, and/or the like, without the at least one nonGPU processor or the GPU copying any of the at least one of the one or more graphics library commands, the first shader language code associated with the one or more 3D image elements, the plurality of 3D image elements, the calculated graphics resources, or the rendered graphics resources, and/or the like. HCP pipeline 120 may send the calculated graphics resources in at least one draw call to a Tenderer for rendering the graphics resources, which will be used for generating rendered images (e.g., rendered image(s) 190b, or the like). In some embodiments, HCP pipeline 120 may display, on a display screen of user device 105 (e.g., display screen 160c as shown in Fig. 1, or the like) and/or one or more display devices 195a-195n, the rendered the graphics resources, in some cases, as part of one or more merged and rendered 2D/3D images 190b.

[0069] Turning to the non-limiting example 200 of Fig. 2B, a process block flow is shown in which all graphics library commands 190a (e.g., OpenGL commands, or the like) from graphic app (or 3D app) 110 may be intercepted and processed by the HCP module 120. All graphics library API calls may be asynchronous, and allocation of resources may be separated from the graphics library API calls to trigger rendering of 3D objects or image elements. The first steps of the HCP pipeline may be to cache (e.g., using memory caching system or memory cache 240, or the like) all the resources associated with the 3D objects into a shared memory (e.g., shared memory 140, ION Memory, etc.). Requesting information about a resource already allocated on the GPU is not efficient as it forces it to flush all the previous GPU calls that could have run asynchronously. To avoid such a situation, the caching may be performed prior to allocation on the GPU when the call to the graphics library API is done.

[0070] Once all the graphics resources have been allocated, the 3D application (e.g., games, etc.) may trigger the rendering through a draw call (such as glDrawElements or glDraw Arrays, etc.). For a proper execution of the vertex shader 130, graphics resources that may be used may include, but are not limited to, (a) vertex buffer object ("VBO") data, (b) element buffer object ("EBO") data, or (c) uniform buffer object data, and/or the like. VBO data may contain information regarding the vertices including, without limitations, position, texture coordinate, color, and/or the like. EBO data may contain the indices of the VBO data to execute. Uniform buffer object data (in some cases, referred to as "uniforms" or the like) may contain constant values defined prior the call to the vertex shader.

[0071] The HCP system architecture is responsible for multiple tasks, including, but not limited to: (1) Registering graphics library API (e.g., OpenGL) trigger functions, in which all graphics library API functions may be intercepted to cache the proper information - e.g., in memory caching system or memory cache 240, or the like; (2) Allocating shared memory, in which shared memory (e.g., shared memory 140) may be allocated to avoid unnecessary memory copying between processors (in some cases, relying on the ION framework provided by Android, or the like); (3) Extracting the first shader language code 230a (e.g., using filtering or detection system 225, or the like); (4) Automatically translating (e.g., using auto-translation system 125, or the like) the first shader language code 230a into at least one intermediate shader language code 245, including, but not limited to, one of standard portable intermediate representation ("SPIR-V") processing, intermediate generic language, or vectorized algorithm, and/or the like; (4A) automatically translating the at least one intermediate shader language code 230b (e.g., SPIR-V 245, Halide 235a, or the like) into a second shader language code 230c (e.g., HVX 235b, NEON 235c, SVE2 235d, or the like), each second shader language code being domain specific language of the corresponding one of the at least one non-GPU processor (e.g., CPU 155a, DSP 155b, media processor 155c, or other non-GPU processor 155n, or the like); (5) Performing vertex shader functionalities (in some cases, using vertex shader 130, or the like), including, but not limited to, calculating graphics resources associated with 3D objects on one of the at least one non-GPU processor (e.g., CPU 155a, DSP 155b, media processor 155c, or other non-GPU processor 155n, or the like), instead of vertex shader processing by a graphics processing unit ("GPU"); (6) Rendering (e.g., using rendering system or Tenderer 250, or the like) the calculated graphics resources in at least one draw call to a Tenderer for rendering the graphics resources; (7) Allocating the calculated graphics resources to shared memory (e.g., shared memory 140); and (8) Allocating a copy of the calculated graphics resources to a GPU memory (e.g., GPU memory 255, or the like) via a passthrough shader (e.g., passthrough shader 135, or the like), the passthrough shader performing no calculations.

[0072] Referring to Fig. 2C, graphics library resource functions 260 may include, but are not limited to, at least one of graphics library commands for generating buffers 260a (e.g., "glGenBuffers(...)" function that generates data buffers, etc.), graphics library commands for generating texture as 2D image 260b (e.g., "glTexImage2D(...)" function that generates textures as 2D images, etc.), graphics library commands for creating a shader 260c (e.g., "glCreateShader(. . .)" function that creates space to contain code for shaders, etc.), graphics library commands for copying buffer data 260d (e.g., "glBufferData(. . .)" function that copies data from CPU to GPU, etc.), and/or the like. HCP 120 may intercept the one or more graphics library resource functions 260 (e.g., using interception system 265, or the like), may cache the one or more graphics library resource functions 260 (e.g., using memory caching system or memory cache 240, etc.) into shared memory (e.g., shared memory 140, or the like).

[0073] To avoid impact on the performances with unnecessary memory copying between each processor (e.g., CPU, DSP, GPU, etc.), the HCP may utilize shared memory (e.g., shared memory 140, or the like, which may, in some cases, be based on a memory management framework named ION that is provided for Android OS by Google and other manufacturers). This framework provides an API to allocate memory chunks that can be shared across multiple processors through a uniform memory access ("UMA") architecture, which is suitable for general purpose and time sharing applications by multiple users. Different subsystems may have different ways to allocate those memory chunks. The two most common ways used in HCP include "AHardwareBuffer" object and a remote procedure call ("RPC") memory allocation function (e.g., "rpcmem_alloc" or the like). For example, AHardwareBuffer _lock may return an ION memory that may be shared to a DSP, a network, or any other subsystem available on the same hardware. Alternatively, rpcmem_alloc may be an API provided for a DSP.

[0074] For non-GPU 2D shaders (e.g., non-GPU 2D UI Shaders, or the like), shader language (such as OpenGL shader language ("GLSL")), which was developed with a per-pixel processing architecture to match the GPU hardware architecture, may be used. On DSP, CPU, or other architecture, the number of processors running in parallel are heavily limited. However, there are some other techniques that exist, such as Single Instruction Multiple Data ("SIMD"), with which one can vectorize the processing of data. Instead of processing one word of 32/64-bits per clock, one can fetch and process multiple words (for example, 1024-bits (128-bytes)) per clock by using some special instructions. In some embodiments, HCP may use a different language to transform the original GLSL source code optimized for per-pixel processing architecture to a vectorized processing architecture.

[0075] Rendering of a rectangle using graphics library API (e.g., OpenGL) is generally done with 2 triangles. Each triangle may contain three vertices that indicate a position in a 3D space. Of the six vertices, only four are unique, as the triangles have two vertices in common to form a rectangle. There is no obligation with regards to the order with the vertices and graphics library API (e.g., OpenGL) may handle the rotation of 2D or UI elements by changing the information within those vertices (e.g., position, texture space coordinates (or UVs), etc.). Although a rectangle is described here, the various embodiments are not so limited, and any polygon (e.g., triangle, square, rectangle, or other polygon, or the like) may be used for rendering.

[0076] When the GPU renders the texture, the processing may be performed pixel by pixel. There is a phase to generate an area in which the 2D or UI element may be rendered with the texture and the GPU may call a fragment shader for each pixel within this area. This means it will access the pixel memory individually by processing them simultaneously on multiple cores.

[0077] In the case of the DSP, the memory is not accessed per pixel, but accessed as set of contiguous pixels (i.e., a vector of pixels, or the like). The hardware provides features to run a same operation on a vector of pixels at the same time, which is known as a Single Instruction Multiple Data ("SIMD") instruction set, and this kind of code is often referred as a vectorized algorithm. On a CPU, the SIMD instruction set is named NEON while its alternative on a DSP is HVX, the main difference being the size of the vector that they can process at once and the assembly instructions available. One of the main requirements to be efficient is that the memory accesses need to be contiguous.

[0078] The support of GPU shader on HCP may be handled by identifying: (a) the top left corner out of the six vertices; and (b) the type of rotation applied to the 2D or UI elements (e.g., none, 90, 180, or 270 degrees). This type of rotation may be performed through standardizing the input coordinates and generating all the transformation of a texture in memory.

[0079] Turning to Fig. 2D, vertex shader GLSL 230a may be auto-translated (e.g., by SPIR-V Translation system 125a, or the like) into an intermediate shader language (e.g., standard portable intermediate representation ("SPIR-V") processing or other shader language code such as intermediate generic language or vectorized algorithm, etc.).

[0080] Because the size of an input buffer provided through the VBO to the GPU can be different from the output of the vertex shader, HCP needs to allocate two VBO and/or EBO. In order to know the size of the input and output buffer for each vertex shader, an analysis of the GLSL shader source code may be performed to extract that information. In such a case, HCP 120 or analysis system 270 may analyze at least one of the vertex shader GLSL 230a (e.g., first shader language code, etc.) or the SPIR-V code (e.g., intermediate shader language, etc.) to extract size information for each of an input buffer (e.g., at block 275) and an output buffer (e.g., at block 285) for each vertex shader. HCP 120 or analysis system 270 may allocate the calculated graphics resources (e.g., VBO or EBO, or the like) to a shared memory (e.g., shared memory 140, or the like) based at least in part on the extracted size information for the input buffer for each vertex shader (e.g., at block 280) and may allocate a copy of the calculated graphics resources to a GPU memory (e.g., GPU memory 255, or the like) via a passthrough shader based at least in part on the extracted size information for the output buffer for each vertex shader (e.g., at block 290), respectively. The passthrough shader would be configured to pass the copy of the calculated graphics resources without performing any calculations.

[0081] The process for implementing HCP for vertex shader processing may otherwise be similar, if not identical to that as described with respect to Figs. 1, 3, and 4.

[0082] Fig. 3 is a schematic block flow diagram illustrating a non-limiting example 300 of auto-translation of one shader language code to another shader language code during implementation of HCP for vertex shader processing, in accordance with various embodiments. [0083] Before executing anything through the HCP pipeline, a filtering step may be run to make sure all the resources needed for a valid execution and that all of the processors (e.g., non-GPU processors, etc.) are in a state to accept new tasks. The instructions to execute the vertex shader may be located on the GPU and may be described in graphics library shader language ("GUST"), optimized for GPU execution. This language is not efficient for vectorized computation, as it was developed for GPU architecture, which relies on a scalar execution that could be parallelized on multiple internal co-processors.

[0084] With reference to Fig. 3, to overcome this issue, auto- translation system(s) 125 may auto translate the GUST code 230a into an intermediate language that could then be translated into domain specific language (e.g., Halide 235a, HVX 235b, NEON 235c, or SVE2 235d, etc.) of the processor (e.g., non-GPU processor(s) including, but not limited to, CPU 155a, DSP 155b, etc.). Standard portable intermediate representation ("SPIR-V") is an intermediate language that may be chosen to which GESE may be converted. Halide or similar language provide a generic path of translation to other processors as it is a third party library with a custom domain with specific language supporting generation of vectorized code to multiple processors (e.g., non-GPU processor(s) including, but not limited to, CPU 155a, DSP 155b, etc.). According to some embodiments, some optimizations may not necessarily be possible or efficient and may require a custom translation to those processors. As the GPU pipeline for a specific 3D object cannot be completely skipped, the various embodiments may replace elements running in HCP by a passthrough shader (e.g., passthrough shader 135 of Fig. 2B, or the like). All the data that would previously have been calculated on a GPU, but now calculated in HCP, may only be copied onto the GPU, avoiding any calculation.

[0085] The process for implementing HCP for vertex shader processing may otherwise be similar, if not identical to that as described with respect to Figs. 1, 2, and 4.

[0086] Figs. 4A and 4B (collectively, "Fig. 4") are flow diagrams illustrating a method 400 for implementing HCP for vertex shader processing, in accordance with various embodiments. Method 400 of Fig. 4A continues onto Fig. 4C following the circular marker denoted, "A," and may return to Fig. 4A following the circular marker denoted, "B."

[0087] While the techniques and procedures are depicted and/or described in a certain order for purposes of illustration, it should be appreciated that certain procedures may be reordered and/or omitted within the scope of various embodiments. Moreover, while the method 400 illustrated by Fig. 4 can be implemented by or with (and, in some cases, are described below with respect to) the systems, examples, or embodiments 100, 200, and 300 of Figs. 1, 2, and 3, respectively (or components thereof), such methods may also be implemented using any suitable hardware (or software) implementation. Similarly, while each of the systems, examples, or embodiments 100, 200, and 300 of Figs. 1, 2, and 3, respectively (or components thereof), can operate according to the method 400 illustrated by Fig. 4 (e.g., by executing instructions embodied on a computer readable medium), the systems, examples, or embodiments 100, 200, and 300 of Figs. 1, 2, and 3 can each also operate according to other modes of operation and/or perform other suitable procedures.

[0088] In the non-limiting embodiment of Fig. 4A, method 400, at block 405, may comprise receiving, using a computing system, one or more graphics library commands from a three-dimensional ("3D") software application ("app"), the one or more graphics library commands comprising commands for rendering image elements, the image elements comprising the plurality of 3D image elements. According to some embodiments, receiving the first shader language code may comprise intercepting, using the computing system, the first shader language code as the first shader language code is being sent from the 3D app to one of a graphics library application programming interface ("API") or a graphics processing unit ("GPU").

[0089] In some embodiments, the computing system may comprise at least one of a HCP, a graphics engine, a graphics rendering engine, a game engine, a 3D game engine, a processor on the user device, or at least one central processing unit ("CPU") core on the user device, and/or the like. In some cases, the 3D app may comprise one of a 3D game app, a 3D user interface ("UI") -based app, or a 3D guidance app configured for use by a user, and/or the like. In such cases, the user may comprise one of a medical professional, a scientist, an engineer, an architect, a construction worker, a factory worker, a warehouse worker, a shipping or delivery worker, an app developer, or a graphics designer, and/or the like.

[0090] At block 410, method 400 may comprise analyzing, using the computing system, the received one or more graphics library commands to extract a first shader language code associated with one or more 3D image elements among a plurality of 3D image elements extracted from one or more graphics library commands from the 3D app.

[0091] Method 400 may further comprise, at block 415, receiving, using the computing system, the first shader language code associated with the one or more 3D image elements among the plurality of 3D image elements extracted from the one or more graphics library commands from the 3D app. Method 400 may continue onto the process at block 420 or may continue onto the process at block 470 in Fig. 4C following the circular marker denoted, "A." [0092] At block 420, method 400 may comprise automatically translating, using the computing system, the first shader language code into at least one second shader language code each corresponding to one of at least one non-graphics processing unit ("non-GPU") processor. In some instances, the at least one non-GPU processor may each comprise one of a CPU, a multiprocessor, a digital signal processor ("DSP"), a media processor, or another non-GPU processor, and/or the like.

[0093] Method 400 may further comprise sending, using the computing system, each of the at least one second shader language code to corresponding one of the at least one non-GPU processor for vertex shader processing of the one or more 3D image elements to calculate graphics resources in the 3D app, instead of vertex shader processing by a GPU (block 425); and sending, using the computing system, the calculated graphics resources in at least one draw call to a Tenderer for rendering the graphics resources (block 430). In some cases, the graphics resources may comprise at least one of vertex buffer object ("VBO") data, element buffer object ("EBO") data, or uniform buffer object data, and/or the like.

[0094] Method 400 may comprise, at block 435, rendering, using the Tenderer, the graphics resources. At block 440, method 400 may comprise storing, using the computing system, at least one of the one or more graphics library commands, the first shader language code associated with the one or more 3D image elements, the plurality of 3D image elements, the calculated graphics resources, or the rendered graphics resources in shared memory that is accessible by each of the at least one non-GPU processor and the GPU, and/or the like, without the at least one non-GPU processor or the GPU copying any of the at least one of the one or more graphics library commands, the first shader language code associated with the one or more 3D image elements, the plurality of 3D image elements, the calculated graphics resources, or the rendered graphics resources, and/or the like.

[0095] According to some embodiments, the first shader language code may be optimized for per-pixel processing architecture, and the first language code may comprise a shader language code based on graphics library shader language ("GLSL"). Each of the at least one second shader language code may be optimized for vectorized processing of contiguous pixels, and each of the at least one second shader language code may comprise a shader language code based on domain specific language of the corresponding one of the at least one non-GPU processor.

[0096] With reference to Fig. 4B, automatically translating the first shader language code into the at least one second shader language code (at block 420) may comprise automatically translating, using the computing system, the first shader language code into at least one intermediate shader language code (block 445). In some cases, the at least one intermediate shader language code may each include a shader language code based on one of standard portable intermediate representation ("SPIR-V") processing, intermediate generic language, or vectorized algorithm, and/or the like.

[0097] In the case (or based on a determination) that the at least one intermediate shader language code comprises multiple successive intermediate shader language codes, method 400 may comprise automatically translating, using the computing system, each intermediate shader language code into each successive intermediate shader language code (block 450). At block 455, method 400 may comprise determining whether the last successive intermediate shader language code has been produced based on the automatic translation. If not, method 400 may return to the process at block 450. If so, method 400 may continue onto the process at block 460. At block 460, method 400 may comprise automatically translating, using the computing system, the last successive intermediate shader language code into the second shader language code.

[0098] Alternatively, in the case (or based on a determination) that the at least one intermediate shader language code comprises a single intermediate shader language code, method 400 may comprise automatically translating, using the computing system, autotranslation system(s) may automatically translate the single intermediate shader language code into the second shader language code (block 465).

[0099] At block 470 in Fig. 4C (following the circular marker denoted, "A"), method 400 may comprise analyzing, using the computing system, the first shader language code to extract size information for each of an input buffer and an output buffer for each vertex shader. Method 400 may further comprise caching, using the computing system, all the calculated graphics resources associated with 3D objects in a shared memory prior to allocation of the graphics resources on a GPU (block 475); allocating, using the computing system, the calculated graphics resources to a shared memory based at least in part on the extracted size information for the input buffer for each vertex shader (block 480); allocating, using the computing system, a copy of the calculated graphics resources to a GPU memory via a passthrough shader based at least in part on the extracted size information for the output buffer for each vertex shader, the passthrough shader performing no calculations (block 485); and sending the calculated graphics resources in at least one draw call to the Tenderer for rendering the graphics resources after all the graphics resources have been allocated (block 490).

[0100] Method 400 may return to the process at block 435 in Fig. 4A following the circular marker denoted, "B."

[0101] Examples of System and Hardware Implementation

[0102] Fig. 5 is a block diagram illustrating an example of computer or system hardware architecture, in accordance with various embodiments. Fig. 5 provides a schematic illustration of one embodiment of a computer system 500 of the service provider system hardware that can perform the methods provided by various other embodiments, as described herein, and/or can perform the functions of computer or hardware system (i.e., user device 105, computing system 115, heterogeneous computing platform ("HCP") 120, computing hardware 145, display screen 160c, audio playback device 160d, content source(s) 170, content distribution system 180, and display devices 195a-195n, etc.), as described above. It should be noted that Fig. 5 is meant only to provide a generalized illustration of various components, of which one or more (or none) of each may be utilized as appropriate. Fig. 5, therefore, broadly illustrates how individual system elements may be implemented in a relatively separated or relatively more integrated manner.

[0103] The computer or hardware system 500 - which might represent an embodiment of the computer or hardware system (i.e., user device 105, computing system 115, HCP 120, computing hardware 145, display screen 160c, audio playback device 160d, content source(s) 170, content distribution system 180, and display devices 195a-195n, etc.), described above with respect to Figs. 1-4 - is shown comprising hardware elements that can be electrically coupled via a bus 505 (or may otherwise be in communication, as appropriate). The hardware elements may include one or more processors 510, including, without limitation, one or more general -purpose processors and/or one or more special-purpose processors (such as microprocessors, digital signal processing chips, graphics acceleration processors, and/or the like); one or more input devices 515, which can include, without limitation, a mouse, a keyboard, and/or the like; and one or more output devices 520, which can include, without limitation, a display device, a printer, and/or the like.

[0104] The computer or hardware system 500 may further include (and/or be in communication with) one or more storage devices 525, which can comprise, without limitation, local and/or network accessible storage, and/or can include, without limitation, a disk drive, a drive array, an optical storage device, solid-state storage device such as a random access memory ("RAM") and/or a read-only memory ("ROM"), which can be programmable, flash- updateable, and/or the like. Such storage devices may be configured to implement any appropriate data stores, including, without limitation, various file systems, database structures, and/or the like.

[0105] The computer or hardware system 500 might also include a communications subsystem 530, which can include, without limitation, a modem, a network card (wireless or wired), an infra-red communication device, a wireless communication device and/or chipset (such as a Bluetooth™ device, an 802.11 device, a WiFi device, a WiMax device, a WWAN device, cellular communication facilities, etc.), and/or the like. The communications subsystem 530 may permit data to be exchanged with a network (such as the network described below, to name one example), with other computer or hardware systems, and/or with any other devices described herein. In many embodiments, the computer or hardware system 500 will further comprise a working memory 535, which can include a RAM or ROM device, as described above.

[0106] The computer or hardware system 500 also may comprise software elements, shown as being currently located within the working memory 535, including an operating system 540, device drivers, executable libraries, and/or other code, such as one or more application programs 545, which may comprise computer programs provided by various embodiments (including, without limitation, hypervisors, VMs, and the like), and/or may be designed to implement methods, and/or configure systems, provided by other embodiments, as described herein. Merely by way of example, one or more procedures described with respect to the method(s) discussed above might be implemented as code and/or instructions executable by a computer (and/or a processor within a computer); in an aspect, then, such code and/or instructions can be used to configure and/or adapt a general purpose computer (or other device) to perform one or more operations in accordance with the described methods. [0107] A set of these instructions and/or code might be encoded and/or stored on a non- transitory computer readable storage medium, such as the storage device(s) 525 described above. In some cases, the storage medium might be incorporated within a computer system, such as the system 500. In other embodiments, the storage medium might be separate from a computer system (i.e., a removable medium, such as a compact disc, etc.), and/or provided in an installation package, such that the storage medium can be used to program, configure, and/or adapt a general purpose computer with the instructions/code stored thereon. These instructions might take the form of executable code, which is executable by the computer or hardware system 500 and/or might take the form of source and/or installable code, which, upon compilation and/or installation on the computer or hardware system 500 (e.g., using any of a variety of generally available compilers, installation programs, compression/decompression utilities, etc.) then takes the form of executable code.

[0108] It will be apparent to those skilled in the art that substantial variations may be made in accordance with particular requirements. For example, customized hardware (such as programmable logic controllers, field-programmable gate arrays, application- specific integrated circuits, and/or the like) might also be used, and/or particular elements might be implemented in hardware, software (including portable software, such as applets, etc.), or both. Further, connection to other computing devices such as network input/output devices may be employed.

[0109] As mentioned above, in one aspect, some embodiments may employ a computer or hardware system (such as the computer or hardware system 500) to perform methods in accordance with various embodiments of the invention. According to a set of embodiments, some or all of the procedures of such methods are performed by the computer or hardware system 500 in response to processor 510 executing one or more sequences of one or more instructions (which might be incorporated into the operating system 540 and/or other code, such as an application program 545) contained in the working memory 535. Such instructions may be read into the working memory 535 from another computer readable medium, such as one or more of the storage device(s) 525. Merely by way of example, execution of the sequences of instructions contained in the working memory 535 might cause the processor(s) 510 to perform one or more procedures of the methods described herein.

[0110] The terms "machine readable medium" and "computer readable medium," as used herein, refer to any medium that participates in providing data that causes a machine to operate in some fashion. In an embodiment implemented using the computer or hardware system 500, various computer readable media might be involved in providing instructions/code to processor(s) 510 for execution and/or might be used to store and/or carry such instructions/code (e.g., as signals). In many implementations, a computer readable medium is a non-transitory, physical, and/or tangible storage medium. In some embodiments, a computer readable medium may take many forms, including, but not limited to, non-volatile media, volatile media, or the like. Non-volatile media includes, for example, optical and/or magnetic disks, such as the storage device(s) 525. Volatile media includes, without limitation, dynamic memory, such as the working memory 535. In some alternative embodiments, a computer readable medium may take the form of transmission media, which includes, without limitation, coaxial cables, copper wire, and fiber optics, including the wires that comprise the bus 505, as well as the various components of the communication subsystem 530 (and/or the media by which the communications subsystem 530 provides communication with other devices). In an alternative set of embodiments, transmission media can also take the form of waves (including without limitation radio, acoustic, and/or light waves, such as those generated during radiowave and infra-red data communications).

[0111] Common forms of physical and/or tangible computer readable media include, for example, a floppy disk, a flexible disk, a hard disk, magnetic tape, or any other magnetic medium, a CD-ROM, any other optical medium, punch cards, paper tape, any other physical medium with patterns of holes, a RAM, a PROM, and EPROM, a FLASH-EPROM, any other memory chip or cartridge, a carrier wave as described hereinafter, or any other medium from which a computer can read instructions and/or code.

[0112] Various forms of computer readable media may be involved in carrying one or more sequences of one or more instructions to the processor(s) 510 for execution. Merely by way of example, the instructions may initially be carried on a magnetic disk and/or optical disc of a remote computer. A remote computer might load the instructions into its dynamic memory and send the instructions as signals over a transmission medium to be received and/or executed by the computer or hardware system 500. These signals, which might be in the form of electromagnetic signals, acoustic signals, optical signals, and/or the like, are all examples of carrier waves on which instructions can be encoded, in accordance with various embodiments of the invention.

[0113] The communications subsystem 530 (and/or components thereof) generally will receive the signals, and the bus 505 then might carry the signals (and/or the data, instructions, etc. carried by the signals) to the working memory 535, from which the processor(s) 505 retrieves and executes the instructions. The instructions received by the working memory 535 may optionally be stored on a storage device 525 either before or after execution by the processor(s) 510.

[0114] As noted above, a set of embodiments comprises methods and systems for implementing two-dimensional ("2D") and/or three-dimensional ("3D") rendering, and, more particularly, to methods, systems, and apparatuses for implementing heterogeneous computing platform ("HCP") for vertex shader processing. Fig. 6 illustrates a schematic diagram of a system 600 that can be used in accordance with one set of embodiments. The system 600 can include one or more user computers, user devices, or customer devices 605. A user computer, user device, or customer device 605 can be a general purpose personal computer (including, merely by way of example, desktop computers, tablet computers, laptop computers, handheld computers, and the like, running any appropriate operating system, several of which are available from vendors such as Apple, Microsoft Corp., and the like), cloud computing devices, a server(s), and/or a workstation computer(s) running any of a variety of commercially-available UNIX™ or UNIX-like operating systems. A user computer, user device, or customer device 605 can also have any of a variety of applications, including one or more applications configured to perform methods provided by various embodiments (as described above, for example), as well as one or more office applications, database client and/or server applications, and/or web browser applications. Alternatively, a user computer, user device, or customer device 605 can be any other electronic device, such as a thin-client computer, Internet-enabled mobile telephone, and/or personal digital assistant, capable of communicating via a network (e.g., the network(s) 610 described below) and/or of displaying and navigating web pages or other types of electronic documents. Although the system 600 is shown with two user computers, user devices, or customer devices 605, any number of user computers, user devices, or customer devices can be supported.

[0115] Some embodiments operate in a networked environment, which can include a network(s) 610. The network(s) 610 can be any type of network familiar to those skilled in the art that can support data communications using any of a variety of commercially-available (and/or free or proprietary) protocols, including, without limitation, TCP/IP, SNA™, IPX™, AppleTalk™, and the like. Merely by way of example, the network(s) 610 (similar to network(s) 165 of Fig. 1, or the like) can each include a local area network ("LAN"), including, without limitation, a fiber network, an Ethernet network, a Token-Ring™ network, and/or the like; a wide-area network ("WAN"); a wireless wide area network ("WWAN"); a virtual network, such as a virtual private network ("VPN"); the Internet; an intranet; an extranet; a public switched telephone network ("PSTN"); an infra-red network; a wireless network, including, without limitation, a network operating under any of the IEEE 802.11 suite of protocols, the Bluetooth™ protocol known in the art, and/or any other wireless protocol; and/or any combination of these and/or other networks. In a particular embodiment, the network might include an access network of the service provider (e.g., an Internet service provider ("ISP")). In another embodiment, the network might include a core network of the service provider, and/or the Internet.

[0116] Embodiments can also include one or more server computers 615. Each of the server computers 615 may be configured with an operating system, including, without limitation, any of those discussed above, as well as any commercially (or freely) available server operating systems. Each of the servers 615 may also be running one or more applications, which can be configured to provide services to one or more clients 605 and/or other servers 615.

[0117] Merely by way of example, one of the servers 615 might be a data server, a web server, a cloud computing device(s), or the like, as described above. The data server might include (or be in communication with) a web server, which can be used, merely by way of example, to process requests for web pages or other electronic documents from user computers 605. The web server can also run a variety of server applications, including HTTP servers, FTP servers, CGI servers, database servers, Java servers, and the like. In some embodiments of the invention, the web server may be configured to serve web pages that can be operated within a web browser on one or more of the user computers 605 to perform methods of the invention.

[0118] The server computers 615, in some embodiments, might include one or more application servers, which can be configured with one or more applications accessible by a client running on one or more of the client computers 605 and/or other servers 615. Merely by way of example, the server(s) 615 can be one or more general purpose computers capable of executing programs or scripts in response to the user computers 605 and/or other servers 615, including, without limitation, web applications (which might, in some cases, be configured to perform methods provided by various embodiments). Merely by way of example, a web application can be implemented as one or more scripts or programs written in any suitable programming language, such as Java™, C, C#™ or C++, and/or any scripting language, such as Perl, Python, or TCL, as well as combinations of any programming and/or scripting languages. The application server(s) can also include database servers, including, without limitation, those commercially available from Oracle™, Microsoft™, Sybase™, IBM™, and the like, which can process requests from clients (including, depending on the configuration, dedicated database clients, API clients, web browsers, etc.) running on a user computer, user device, or customer device 605 and/or another server 615. In some embodiments, an application server can perform one or more of the processes for implementing 2D and/or 3D rendering, and, more particularly, to methods, systems, and apparatuses for implementing HCP for vertex shader processing, as described in detail above. Data provided by an application server may be formatted as one or more web pages (comprising HTML, JavaScript, etc., for example) and/or may be forwarded to a user computer 605 via a web server (as described above, for example). Similarly, a web server might receive web page requests and/or input data from a user computer 605 and/or forward the web page requests and/or input data to an application server. In some cases, a web server may be integrated with an application server.

[0119] In accordance with further embodiments, one or more servers 615 can function as a file server and/or can include one or more of the files (e.g., application code, data files, etc.) necessary to implement various disclosed methods, incorporated by an application running on a user computer 605 and/or another server 615. Alternatively, as those skilled in the art will appreciate, a file server can include all necessary files, allowing such an application to be invoked remotely by a user computer, user device, or customer device 605 and/or server 615. [0120] It should be noted that the functions described with respect to various servers herein (e.g., application server, database server, web server, file server, etc.) can be performed by a single server and/or a plurality of specialized servers, depending on implementation-specific needs and parameters.

[0121] In some embodiments, the system can include one or more databases 620a-620n (collectively, "databases 620"). The location of each of the databases 620 is discretionary: merely by way of example, a database 620a might reside on a storage medium local to (and/or resident in) a server 615a (and/or a user computer, user device, or customer device 605). Alternatively, a database 620n can be remote from any or all of the computers 605, 615, so long as it can be in communication e.g., via the network 610) with one or more of these. In a particular set of embodiments, a database 620 can reside in a storage-area network ("SAN") familiar to those skilled in the art. (Likewise, any necessary files for performing the functions attributed to the computers 605, 615 can be stored locally on the respective computer and/or remotely, as appropriate.) In one set of embodiments, the database 620 can be a relational database, such as an Oracle database, that is adapted to store, update, and retrieve data in response to SQL-formatted commands. The database might be controlled and/or maintained by a database server, as described above, for example.

[0122] According to some embodiments, user device 605 (similar to user devices 105 of Figs. 1-3, or the like) may comprise a graphic application 625 (similar to graphic applications 110 of Figs. 1 and 2, or the like), a computing system(s) 630 (similar to computing system 115 of Fig. 1, or the like), and computing hardware 660 (similar to computing hardware 145 of Figs. 1 and 2, or the like). Computing system(s) 630 may comprise heterogeneous computing platform ("HCP") 635 (similar to HCP 120 of Figs. 1 and 2, or the like), which may comprise an auto-translation system(s) 640 (similar to auto-translation system(s) 125 of Figs. 1 and 2, or the like), a vertex shader 645 (similar to vertex shader 130 of Figs. 1 and 2, or the like), a passthrough shader 650 (similar to passthrough shaders 135 of Figs. 1 and 2, or the like), and shared memory 655 (similar to shared memory 140 of Figs. 1 and 2, or the like). Computing hardware 660 may comprise a graphics processing unit ("GPU") 665 (similar to GPUs 150 of Figs. 1 and 2, or the like) and one or more non-GPU processors 670 (similar to non-GPU processors 155 or 155a-155n of Figs. 1 and 2, or the like; e.g., a central processing unit ("CPU") 155a, a multiprocessor, a digital signal processor ("DSP") 155b, a media processor 155c, or another non-GPU processor 155n, and/or the like). User device 605 may further comprise data storage device 675a (similar to data storage device 160a of Fig. 1, or the like), communications system 675b (similar to communications system 160b of Fig. 1, or the like), display screen 675c (similar to display screen 160c of Fig. 1, or the like), and audio playback device 675d (optional; similar to audio playback device 160d of Fig. 1, or the like). System 600 may further comprise one or more content sources 680 and corresponding database(s) 685 (similar to content source(s) 170 and corresponding database(s) 175 of Fig. 1, or the like) and, in some cases, one or more content distribution systems 690 and corresponding database(s) 690 (similar to content distribution system(s) 180 and corresponding database(s) 185 of Fig. 1, or the like).

[0123] In operation, a computing system 630, HCP 635, and/or computing hardware 660 (collectively, "computing system") may receive a first shader language code associated with one or more 3D image elements among a plurality of 3D image elements extracted from one or more graphics library commands from a 3D app (e.g., 3D app 625, or the like). The computing system may automatically translate the first shader language code into at least one second shader language code each corresponding to one of at least one non-graphics processing unit ("non-GPU") processor (e.g., non-GPU processor(s) 670, or the like). The computing system may send each of the at least one second shader language code to corresponding one of the at least one non-GPU processor for vertex shader processing of the one or more 3D image elements to calculate graphics resources in the 3D app, instead of vertex shader processing by a GPU (e.g., GPU 665, or the like). The computing system may send the calculated graphics resources in at least one draw call to a Tenderer for rendering the graphics resources, which will be used for generating rendered images.

[0124] In some embodiments, the computing system may comprise at least one of a HCP (e.g., HCP 635, or the like), a graphics engine, a graphics rendering engine, a game engine, a 3D game engine, a processor on the user device (e.g., user device 605, or the like), or at least one central processing unit ("CPU") core on the user device, and/or the like. In some instances, the at least one non-GPU processor may each comprise one of a CPU, a multiprocessor, a digital signal processor ("DSP"), a media processor, or another non-GPU processor, and/or the like. In some cases, the 3D app may comprise one of a 3D game app, a 3D user interface ("UI") -based app, or a 3D guidance app configured for use by a user, and/or the like. In such cases, the user may comprise one of a medical professional, a scientist, an engineer, an architect, a construction worker, a factory worker, a warehouse worker, a shipping or delivery worker, an app developer, or a graphics designer, and/or the like. In some instances, the graphics resources may comprise at least one of vertex buffer object ("VBO") data, element buffer object ("EBO") data, or uniform buffer object data, and/or the like.

[0125] According to some embodiments, the first shader language code may be optimized for per-pixel processing architecture, and the first language code may comprise a shader language code based on graphics library shader language ("GLSL"). Each of the at least one second shader language code may be optimized for vectorized processing of contiguous pixels, and each of the at least one second shader language code may include a shader language code based on domain specific language of the corresponding one of the at least one non-GPU processor. In some instances, automatically translating (e.g., using auto-translation system(s) 640, or the like) the first shader language code into the at least one second shader language code may comprise automatically translating the first shader language code into at least one intermediate shader language code. In such cases, the at least one intermediate shader language code may each include a shader language code based on one of standard portable intermediate representation ("SPIR-V") processing, intermediate generic language, or vectorized algorithm. Based on a determination that the at least one intermediate shader language code comprises multiple successive intermediate shader language codes, the computing system may automatically translate each intermediate shader language code into each successive intermediate shader language code until the last successive intermediate shader language code has been produced, and may automatically translate the last successive intermediate shader language code into the second shader language code. Based on a determination that the at least one intermediate shader language code comprises a single intermediate shader language code, the computing system may automatically translate the single intermediate shader language code into the second shader language code.

[0126] In some embodiments, receiving the first shader language code may comprise intercepting, using the computing system, the first shader language code as the first shader language code is being sent from the 3D app to one of a graphics library application programming interface ("API") or the GPU.

[0127] According to some embodiments, the computing system may receive the one or more graphics library commands from the 3D app, the one or more graphics library commands comprising commands for rendering image elements, the image elements comprising the plurality of 3D image elements. The computing system may analyze the received one or more graphics library commands to extract the first shader language code.

[0128] In some embodiments, the computing system may analyze the first shader language code to extract size information for each of an input buffer and an output buffer for each vertex shader; may allocate the calculated graphics resources to a shared memory based at least in part on the extracted size information for the input buffer for each vertex shader; and may allocate a copy of the calculated graphics resources to a GPU memory via a passthrough shader (e.g., passthrough shader 650, or the like) based at least in part on the extracted size information for the output buffer for each vertex shader, the passthrough shader performing no calculations. In some cases, the computing system may cache all the calculated graphics resources associated with 3D objects in a shared memory (e.g., shared memory 655 or the like) prior to allocation of the graphics resources on a GPU. In some instances, sending the calculated graphics resources in at least one draw call to the Tenderer for rendering the graphics resources may comprise sending the calculated graphics resources in at least one draw call to the Tenderer for rendering the graphics resources after all the graphics resources have been allocated.

[0129] According to some embodiments, the computing system may store at least one of the one or more graphics library commands, the first shader language code associated with the one or more 3D image elements, the plurality of 3D image elements, the calculated graphics resources, or the rendered graphics resources in shared memory that is accessible by each of the at least one non-GPU processor and the GPU, and/or the like, without the at least one nonGPU processor or the GPU copying any of the at least one of the one or more graphics library commands, the first shader language code associated with the one or more 3D image elements, the plurality of 3D image elements, the calculated graphics resources, or the rendered graphics resources, and/or the like.

[0130] These and other functions of the system 600 (and its components) are described in greater detail above with respect to Figs. 1-4.

[0131] While particular features and aspects have been described with respect to some embodiments, one skilled in the art will recognize that numerous modifications are possible. For example, the methods and processes described herein may be implemented using hardware components, software components, and/or any combination thereof. Further, while various methods and processes described herein may be described with respect to particular structural and/or functional components for ease of description, methods provided by various embodiments are not limited to any particular structural and/or functional architecture but instead can be implemented on any suitable hardware, firmware and/or software configuration. Similarly, while particular functionality is ascribed to particular system components, unless the context dictates otherwise, this functionality need not be limited to such and can be distributed among various other system components in accordance with the several embodiments.

[0132] Moreover, while the procedures of the methods and processes described herein are described in a particular order for ease of description, unless the context dictates otherwise, various procedures may be reordered, added, and/or omitted in accordance with various embodiments. Moreover, the procedures described with respect to one method or process may be incorporated within other described methods or processes; likewise, system components described according to a particular structural architecture and/or with respect to one system may be organized in alternative structural architectures and/or incorporated within other described systems. Hence, while various embodiments are described with — or without — particular features for ease of description and to illustrate some aspects of those embodiments, the various components and/or features described herein with respect to a particular embodiment can be substituted, added and/or subtracted from among other described embodiments, unless the context dictates otherwise. Consequently, although several embodiments are described above, it will be appreciated that the invention is intended to cover all modifications and equivalents within the scope of the following claims.

Claims

WHAT IS CLAIMED IS:

1. A method, comprising: receiving, using a computing system, a first shader language code associated with one or more three-dimensional ("3D") image elements among a plurality of 3D image elements extracted from one or more graphics library commands from a 3D software application ("app"); translating, using the computing system, the first shader language code into at least one second shader language code each corresponding to one of at least one non-graphics processing unit ("non-GPU") processor; sending, using the computing system, each of the at least one second shader language code to corresponding one of the at least one non-GPU processor for vertex shader processing of the one or more 3D image elements to calculate graphics resources in the 3D app; and sending, using the computing system, the calculated graphics resources in at least one draw call to a Tenderer for rendering the graphics resources.

2. The method of claim 1, wherein the computing system comprises at least one of a heterogeneous computing platform ("HCP"), a graphics engine, a graphics rendering engine, a game engine, a 3D game engine, a processor on the user device, or at least one central processing unit ("CPU") core on the user device.

3. The method of claim 1 or 2, wherein the at least one non-GPU processor each comprises one of a central processing unit ("CPU"), a multiprocessor, a digital signal processor ("DSP"), a media processor, or another non-GPU processor.

4. The method of any of claims 1-3, wherein the 3D app comprises one of a 3D game app, a 3D user interface ("UI") -based app, or a 3D guidance app configured for use by a user, wherein the user comprises one of a medical professional, a scientist, an engineer, an architect, a construction worker, a factory worker, a warehouse worker, a shipping or delivery worker, an app developer, or a graphics designer.

5. The method of any of claims 1-4, wherein the graphics resources comprise at least one of vertex buffer object ("VBO") data, element buffer object ("EBO") data, or uniform buffer object data.

6. The method of any of claims 1-5, wherein the first shader language code is optimized for per-pixel processing architecture, wherein the first language code comprises a shader language code based on graphics library shader language ("GLSL"), wherein each of the at least one second shader language code is optimized for vectorized processing of contiguous

37 pixels, and wherein each of the at least one second shader language code comprises a shader language code based on domain specific language of the corresponding one of the at least one non-GPU processor.

7. The method of claim 6, wherein automatically translating the first shader language code into the at least one second shader language code comprises: automatically translating, using the computing system, the first shader language code into at least one intermediate shader language code, wherein the at least one intermediate shader language code each comprises a shader language code based on one of standard portable intermediate representation ("SPIR-V") processing, intermediate generic language, or vectorized algorithm; wherein based on a determination that the at least one intermediate shader language code comprises multiple successive intermediate shader language codes, automatically translating, using the computing system, each intermediate shader language code into each successive intermediate shader language code until the last successive intermediate shader language code has been produced, and automatically translating, using the computing system, the last successive intermediate shader language code into the second shader language code; and wherein based on a determination that the at least one intermediate shader language code comprises a single intermediate shader language code, automatically translating, using the computing system, the single intermediate shader language code into the second shader language code.

8. The method of any of claims 1-7, wherein receiving the first shader language code comprises intercepting, using the computing system, the first shader language code as the first shader language code is being sent from the 3D app to one of a graphics library application programming interface ("API") or the GPU.

9. The method of any of claims 1-8, further comprising: receiving, using the computing system, the one or more graphics library commands from the 3D app, the one or more graphics library commands comprising commands for rendering image elements, the image elements comprising the plurality of 3D image elements; and analyzing, using the computing system, the received one or more graphics library commands to extract the first shader language code.

10. The method of any of claims 1-9, further comprising:

38 analyzing, using the computing system, the first shader language code to extract size information for each of an input buffer and an output buffer for each vertex shader; allocating, using the computing system, the calculated graphics resources to a shared memory based at least in part on the extracted size information for the input buffer for each vertex shader; and allocating, using the computing system, a copy of the calculated graphics resources to a GPU memory via a passthrough shader based at least in part on the extracted size information for the output buffer for each vertex shader, the passthrough shader performing no calculations.

11. The method of claim 10, further comprising: caching, using the computing system, all the calculated graphics resources associated with 3D objects in a shared memory prior to allocation of the graphics resources on a GPU.

12. The method of claim 11, wherein sending the calculated graphics resources in at least one draw call to the Tenderer for rendering the graphics resources comprises sending the calculated graphics resources in at least one draw call to the Tenderer for rendering the graphics resources after all the graphics resources have been allocated.

13. The method of any of claims 1-12, further comprising: storing, using the computing system, at least one of the one or more graphics library commands, the first shader language code associated with the one or more 3D image elements, the plurality of 3D image elements, the calculated graphics resources, or the rendered graphics resources in shared memory that is accessible by each of the at least one non-GPU processor and the GPU, without the at least one non-GPU processor or the GPU copying any of the at least one of the one or more graphics library commands, the first shader language code associated with the one or more 3D image elements, the plurality of 3D image elements, the calculated graphics resources, or the rendered graphics resources.

14. An apparatus, comprising: at least one processor; and a non-transitory computer readable medium communicatively coupled to the at least one processor, the non-transitory computer readable medium having stored thereon computer software comprising a set of instructions that, when executed by the at least one processor, causes the apparatus to: receive a first shader language code associated with one or more three- dimensional ("3D") image elements among a plurality of 3D image elements extracted from one or more graphics library commands from a 3D software application ("app"); translate the first shader language code into at least one second shader language code each corresponding to one of at least one non-graphics processing unit ("non-GPU") processor; send each of the at least one second shader language code to corresponding one of the at least one non-GPU processor for vertex shader processing of the one or more 3D image elements to calculate graphics resources in the 3D app; and send the calculated graphics resources in at least one draw call to a Tenderer for rendering the graphics resources.

15. A system, comprising: a computing system, comprising: at least one first processor; and a first non-transitory computer readable medium communicatively coupled to the at least one first processor, the first non-transitory computer readable medium having stored thereon computer software comprising a first set of instructions that, when executed by the at least one first processor, causes the computing system to: receive a first shader language code associated with one or more three- dimensional ("3D") image elements among a plurality of 3D image elements extracted from one or more graphics library commands from a 3D software application ("app"); translate the first shader language code into at least one second shader language code each corresponding to one of at least one nongraphics processing unit ("non-GPU") processor; send each of the at least one second shader language code to corresponding one of the at least one non-GPU processor for vertex shader processing of the one or more 3D image elements to calculate graphics resources in the 3D app; and send the calculated graphics resources in at least one draw call to a

Tenderer for rendering the graphics resources.