CN114972607A

CN114972607A - Data transmission method, device and medium for accelerating image display

Info

Publication number: CN114972607A
Application number: CN202210903364.5A
Authority: CN
Inventors: 马超; 陈成
Original assignee: Yantai Xintong Semiconductor Technology Co ltd
Current assignee: Nanjing Sietium Semiconductor Co ltd
Priority date: 2022-07-29
Filing date: 2022-07-29
Publication date: 2022-08-30
Anticipated expiration: 2042-07-29
Also published as: CN114972607B

Abstract

The embodiment of the invention discloses a data transmission method, a device and a medium for accelerating image display; the method can comprise the following steps: binding the rendered color cache image data with the color cache texture object to form a first binding relationship; transmitting the color cache texture object bound with the color cache image data to a display server process; the display server binds the frame cache texture object with the frame cache image data to form a second binding relationship; and the display server calls a glDispatch computer function through the compute shader copy image according to the color cache texture object bound with the color cache image data, the frame cache image data and the first binding relation and the second binding relation, so that the color cache image data which is rendered by the GPU in parallel is copied to a frame cache as the frame cache image data.

Description

Data transmission method, device and medium for accelerating image display

Technical Field

The embodiment of the invention relates to the technical field of image processing, in particular to a data transmission method, a device and a medium for accelerating image display.

Background

Currently, computing devices typically employ dedicated graphics hardware, such as a GPU, for graphics rendering. In the graphics rendering process, taking the client-Server mode represented by X11 Server as an example, before the data that has completed rendering by the GPU reaches the frame buffer, the CPU is required to transfer the data that has completed rendering from the display memory to the frame buffer by three copy operations for the display controller to access. The three copying actions need to occupy a large number of CPU clocks to finish high-delay access instructions, the three copying actions need to be finished through a system bus, and the large number of frequent copying means that the bandwidth pressure of the system bus is increased.

Disclosure of Invention

In view of the above, embodiments of the present invention are directed to a method, an apparatus, and a medium for data transmission to accelerate image display; the GPU completes the transmission of rendered data from the video memory to the frame buffer area by utilizing the advantage of parallel execution, thereby reducing the occupation of a system bus and releasing CPU resources.

The technical scheme of the embodiment of the invention is realized as follows:

in a first aspect, an embodiment of the present invention provides a data transmission method for accelerating image display, where the method includes:

binding the rendered color cache image data with the color cache texture object to form a first binding relationship;

transmitting the color cache texture object bound with the color cache image data to a display server process;

the display server binds the frame cache texture object with the frame cache image data to form a second binding relationship;

and the display server calls a glDispatch computer function through a compute shader copy image according to the color cache texture object bound with the color cache image data, the frame cache image data and the first binding relation and the second binding relation so that the color cache image data which is rendered by the GPU in parallel is copied to a frame cache as the frame cache image data.

In a second aspect, an embodiment of the present invention provides a data transmission apparatus for accelerating image display, where the data transmission apparatus includes: a first binding part, a transmission part, a second binding part and a copy part; wherein the content of the first and second substances,

the first binding part is configured to bind the rendered color cache image data and the color cache texture object to form a first binding relationship;

the transmission part is configured to transmit the color cache texture object bound with the color cache image data to a display server process;

the second binding part is configured to bind the frame cache texture object with the frame cache image data to form a second binding relationship;

and the copying part is configured to call a glDispatch computer function through a compute shader CopyImage according to the color cache texture object bound with the color cache image data, the frame cache image data and the first and second binding relations so that the GPU parallelly executes the color cache image data after rendering and copies the color cache image data to the frame cache.

In a third aspect, an embodiment of the present invention provides a graphics driver architecture based on a Linux system, where the architecture includes a GPU driver and a display server; wherein the content of the first and second substances,

the GPU driver comprises a first binding part and a transmission part in the data transmission device for accelerating image display of the second aspect;

the display server includes a second binding part and a copy part in the data transmission apparatus for accelerating image display according to the second aspect.

In a fourth aspect, an embodiment of the present invention provides a computer storage medium, where a data transmission program for accelerating image display is stored, and the data transmission program for accelerating image display is executed by at least one processor to implement the data transmission method steps for accelerating image display according to the first aspect.

The embodiment of the invention provides a data transmission method, a device and a medium for accelerating image display; and copying the rendered color cache image data to a frame cache by utilizing a glDispatchCompute function in the compute shader and the binding relationship between the cache image data and the texture object. Therefore, the data copying process from the rendering video memory to the frame buffer memory is realized under the control of the GPU, the use frequency of the CPU and the system bus is reduced to the maximum extent, the CPU resource can be released, and the occupation of the system bus bandwidth can be reduced.

Drawings

Fig. 1 is a schematic diagram of a computing device capable of implementing the technical solution of the embodiment of the present invention.

FIG. 2 is a block diagram of an example implementation of the processor, GPU and system memory of FIG. 1.

Fig. 3 is a block diagram of a graphics driver architecture based on a Linux system according to an embodiment of the present invention.

Fig. 4 is a schematic diagram of a copying process of rendering rendered data from a rendering video memory to a frame buffer according to a conventional scheme.

Fig. 5 is a flowchart illustrating a data transmission method for accelerating image display according to an embodiment of the present invention.

FIG. 6 is a schematic diagram of a SIMT model according to an embodiment of the present invention.

Fig. 7 is a schematic diagram of a pixel model for performing copy according to an embodiment of the present invention.

Fig. 8 is a schematic diagram illustrating a data transmission apparatus for accelerating image display according to an embodiment of the present invention.

Detailed Description

The technical solution in the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present invention.

Referring to fig. 1, a computing device 2 capable of implementing aspects of embodiments of the present invention is shown, examples of the computing device 2 including, but not limited to: wireless devices, mobile or cellular telephones, including so-called smart phones, Personal Digital Assistants (PDAs), video game consoles, including video displays, mobile video gaming devices, mobile video conferencing units, laptop computers, desktop computers, television set-top boxes, tablet computing devices, electronic book readers, fixed or mobile media players, and the like. In the example of fig. 1, the computing device 2 may include: processor 6, system memory 10, and GPU 12. Computing device 2 may also include display processor 14, transceiver module 3, user interface 4, and display 8. Transceiver module 3 and display processor 14 may both be part of the same Integrated Circuit (IC) as processor 6 and/or GPU 12, both may be external to one or more ICs that include processor 6 and/or GPU 12, or may be formed in an IC that is external to the IC that includes processor 6 and/or GPU 12.

For clarity, computing device 2 may include additional modules or units not shown in fig. 1. For example, computing device 2 may include a speaker and microphone (both not shown in fig. 1) to enable telephonic communications in instances in which it is a mobile wireless telephone or, alternatively, computing device 2 includes a speaker in the case of a media player. Computing device 2 may also include a camera. Moreover, the various modules and units shown in computing device 2 may not be necessary in every instance of computing device 2. For example, in examples where computing device 2 is a desktop computer or other device equipped to connect with an external user interface or display, user interface 4 and display 8 may be external to computing device 2.

Examples of user interface 4 include, but are not limited to, a trackball, a mouse, a keyboard, and other types of input devices. The user interface 4 may also be a touch screen and may be incorporated as part of the display 8. Transceiver module 3 may include circuitry to allow wireless or wired communication between computing device 2 and another device or a network. Transceiver module 3 may include modulators, demodulators, amplifiers and other such circuitry for wired or wireless communication.

The processor 6 may be a microprocessor, such as a Central Processing Unit (CPU), configured to process instructions of a computer program for execution. Processor 6 may comprise a general-purpose or special-purpose processor that controls operations of computing device 2. A user may provide input to computing device 2 to cause processor 6 to execute one or more software applications. The software applications executing on processor 6 may include, for example, an operating system, a word processor application, an email application, a spreadsheet application, a media player application, a video game application, a graphical user interface application, or another program. Additionally, processor 6 may execute a GPU driver 22 for controlling the operations of GPU 12. A user may provide input to computing device 2 via one or more input devices not shown in the figure, such as a keyboard, a mouse, a microphone, a touch pad, or another input device coupled to computing device 2 via user input interface 4.

A software application executing on processor 6 may include one or more graphics rendering instructions that instruct processor 6 to cause graphics data to be rendered to display 8. In some examples, the software instructions may conform to a graphics Application Programming Interface (API), such as an open graphics library OpenGL API, an open graphics library embedded system (OpenGL ES) API, a Direct3D API, an X3D API, a RenderMan API, a WebGL API, an open computing language (OpenCLT M), RenderScript, or any other heterogeneous computing API, or any other public or proprietary standard graphics or computing API. The software instructions may also be instructions for non-rendering algorithms such as computational photography, convolutional neural networks, video processing, scientific applications, and the like. To process the graphics rendering instructions, processor 6 may issue one or more graphics rendering commands to GPU 12 (e.g., by GPU driver 22) to cause GPU 12 to perform some or all of the rendering of the graphics data. In some examples, the graphics data to be rendered may include a list of graphics primitives, such as points, lines, triangles, quadrilaterals, triangle strips, and so forth.

GPU 12 may be configured to perform graphics operations to render one or more graphics primitives to display 8. Thus, when one of the software applications executing on processor 6 requires graphics processing, processor 6 may provide graphics commands and graphics data to GPU 12 for rendering to display 8. Graphics data may include, for example, draw commands, state information, primitive information, texture information, and so forth. In some cases, GPU 12 may be built in with a highly parallel structure that provides more efficient processing of complex graphics related operations than processor 6. For example, GPU 12 may include a plurality of processing elements, such as shader units, that are configured to operate on multiple vertices or pixels in a parallel manner. In some cases, the highly parallel nature of GPU 12 allows GPU 12 to draw graphics images (e.g., GUIs and two-dimensional (2D) and/or three-dimensional (3D) graphics scenes) onto display 8 more quickly than drawing the scenes directly to display 8 using processor 6.

In some cases, GPU 12 may be integrated into the motherboard of computing device 2. In other cases, GPU 12 may be present on a graphics card that is mounted in a port in the motherboard of computing device 2, or may be otherwise incorporated within a peripheral device configured to interoperate with computing device 2. GPU 12 may include one or more processors, such as one or more microprocessors, Application Specific Integrated Circuits (ASICs), Field Programmable Gate Arrays (FPGAs), Digital Signal Processors (DSPs), or other equivalent integrated or discrete logic circuitry. GPU 12 may also include one or more processor cores, such that GPU 12 may be referred to as a multicore processor.

In some examples, GPU 12 may store the fully formed image in system memory 10. Display processor 14 may retrieve an image from system memory 10 and output values that cause pixels of display 8 to illuminate to display the image. Display 8 may be a display of computing device 2 that displays image content generated by GPU 12. The display 8 may be a Liquid Crystal Display (LCD), an organic light emitting diode display (OLED), a Cathode Ray Tube (CRT) display, a plasma display, or another type of display device.

Fig. 2 is a block diagram illustrating an example implementation of processor 6, GPU 12, and system memory 10 in fig. 1 in further detail. As shown in fig. 2, processor 6 may execute at least one software application 18, a graphics API 20, and a GPU driver 22, each of which may be one or more software applications or services. In some examples, graphics API 20 and GPU driver 22 may be implemented as hardware units of CPU 6.

Memory available for use by GPU 12 may include a video memory 16, which may store rendered image data, such as pixel data, as well as any other data. In one implementation, the video memory 16 may be part of the system memory 10 or may be separate from the system memory 10.

Video memory 16 stores target pixels for GPU 12. Each target pixel may be associated with a unique screen pixel location. In some examples, the display memory 16 may store the color component and the destination alpha value for each target pixel. For example, the video memory 16 may store red, green, blue, alpha (RGBA) components for each pixel, where the "RGB" components correspond to color values and the "a" components correspond to destination alpha values (e.g., opacity values for image compositing). Although the video memory 16 and the system memory 10 are illustrated as separate memory units, in other examples, the video memory 16 may be part of the system memory 10. The display memory 16 may also be capable of storing any suitable data other than pixels.

Software application 18 may be any application that utilizes the functionality of GPU 12. For example, the software application 18 may be a graphics application, an operating system, a portable graphics application, a computer-aided design program for engineering or artistic applications, a video game application, or another type of software application that uses 2D or 3D graphics.

Software application 18 may include one or more drawing instructions that instruct GPU 12 to render a Graphical User Interface (GUI) and/or a graphical scene. For example, the draw instructions may include instructions that define a set of one or more graphics primitives to be rendered by GPU 12. In some examples, the drawing instructions may collectively define all or part of a plurality of windowing surfaces for use in the GUI. In additional examples, the drawing instructions may collectively define all or part of a graphics scene that includes one or more graphics objects within a model space or world space defined by an application.

Software application 18 may invoke GPU driver 22 via graphics API 20 to issue one or more commands to GPU 12 for rendering one or more graphics primitives into a displayable graphics image. For example, software application 18 may invoke GPU driver 22 to provide GPU 12 with primitive definitions. In some cases, the primitive definitions may be provided to GPU 12 in the form of a list of drawing primitives, such as triangles, rectangles, triangle fans, triangle strips, and so forth. The primitive definition may include a vertex specification that specifies one or more vertices associated with the primitive to be rendered. The vertex specification may include location coordinates for each vertex, and in some cases other attributes associated with the vertex, such as color attributes, normal vectors, and texture coordinates. The primitive definition may also include primitive type information (e.g., triangle, rectangle, triangle fan, triangle strip, etc.), scaling information, rotation information, and the like.

Based on the instructions issued by software application 18 to GPU driver 22, GPU driver 22 may formulate one or more commands that specify one or more operations for GPU 12 to perform in order to render the primitive. When GPU 12 receives the command from CPU 6, GPU 12 may execute a graphics processing pipeline using processor cluster 46 in order to decode the command and configure the graphics processing pipeline to perform the operation specified in the command.

The computing device 20 described in conjunction with fig. 1 and fig. 2 above, referring to fig. 3, shows a block diagram of a graphics driver architecture based on the Linux system according to an embodiment of the present invention, and in fig. 3, the graphics driver architecture is divided into a User Space (User Space), a Linux Kernel Space (Kernel Space), and a hardware level; wherein the user space is further divided into an application program, a rendering API, a display server program, a display driver, and a window manager; and the application programs include a 3D engine and a graphics application. The rendering API may be a graphics Application Programming Interface (API) compliant as described above, such as an open graphics library OpenGL API, an open graphics library embedded system (OpenGL ES) API, a Direct3D API, an X3D API, a RenderMan API, a WebGL API, an open computing language (OpenCLT M), RenderScript, or any other heterogeneous computing API, or any other public or proprietary standard graphics or computing API, with embodiments of the invention being described hereinafter with the OpenGL API as an example. The Linux kernel space includes a Direct Rendering Manager (DRM) driver, i.e., the GPU driver 22; and Kernel Mode Setting (KMS) driver. The hardware layer comprises the CPU 6, the system memory 10 and the related registers; GPU 12, video memory 16, and associated registers are also included. For the video memory 16, the video memory is generally divided into two parts, one part of which is used in the GPU rendering process, in the embodiment of the present invention, this part of the video memory may be referred to as "rendering video memory", and generally, a graphics application sends an instruction to the DRM through a rendering API based on a 3D engine to drive and manage this part of the video memory; and the other part is managed as a frame buffer by a display driver through a KMS driver, and for the frame buffer part in the display memory 16, the stored rendered data is transmitted to a display server program through the KMS, and then the display sending work is completed through a window manager, so that the whole process from the completion of drawing all the graphics to the display sending is completed. Specifically, "display" indicates an operation of displaying the display by being displayed on the display 8.

Based on the block diagram shown in fig. 3, in the conventional scheme, the display server and the rendering API belong to two different system processes, and have independent process virtual address spaces, and the different process virtual address spaces are isolated and cannot be accessed to each other, and data transmission between the processes can only be performed by a network protocol or an inter-process communication (IPC) manner. Therefore, in a process that an application program performs GPU rendering on data from a rendering video memory to a frame buffer that can be accessed by a display server through a 3D engine, at least 3 times of data copying are required, and a copying process is shown in fig. 4, specifically, a CPU transmits rendered data in the rendering video memory shown by a tag 1 to a system memory 10 shown by a tag 2 according to a virtual address space of an application program process; then, the CPU transfers the rendered data to the portion corresponding to the virtual address of the display server process indicated by the label 3 in the system memory 10; finally, CPI is copied from the rendered data in system memory 10 to the frame buffer portion of video memory 16, indicated at 4, via KMS driving. For the above conventional scheme, three data copies are all completed by the CPU, which means that each copy needs to occupy a large amount of CPU clock to complete high-latency access instructions, usually Load/Store instructions; meanwhile, whether the system memory 10 or the display memory 16 (including the rendering display memory and the frame buffer that can be accessed by the display server) is connected with the CPU through the system bus, the large amount of frequent data copy operations also means that valuable system bus bandwidth will be occupied, and power consumption will be increased.

In order to reduce the copy-to-copy operation between the system memory 10 and the video memory 16, the existing solution utilizes the DMA-BUF shared cache technology of the Linux kernel, the core content of which is to encapsulate a buffer, and the driver provides a buffer FD like a File Descriptor (File Descriptor) to the user mode program, and this DMA-BUF FD can be transferred (such as export/import) between processes, and at the same time, the driver also provides the user mode program with a necessary set of operations established on the DMA-BUF, for example, a DMA _ BUF _ map operation allows an Importer (Importer) of the DMA-BUF to perform address mapping on the imported DMA-BUF, so as to map it into its virtual address space, so that the Importer process can operate on this buffer. According to the DMA-BUF shared cache technology, although two data copying operations can be reduced, the rest data copying operation is still executed by a CPU; and because the physical video memory is connected with the CPU through the system bus, the DMA-BUF shared cache technology still occupies the system bus resources.

Through analysis of the two conventional schemes, it is considered that the current GPU has the capability of executing massively programmable parallel tasks, and the GPU has lower delay for accessing the video memory 16 than the CPU. In view of this, the embodiments of the present invention expect that the GPU is used to implement the data copy process from the rendering video memory to the frame buffer, so that the maximum program reduces the usage frequency of the CPU and the system bus, and not only can release the CPU resource, but also can reduce the occupation of the system bus bandwidth.

Based on the above, referring to fig. 5, a data transmission method for accelerating image display according to an embodiment of the present invention is shown, where the method may be applied to the Linux system-based graphics driver architecture shown in fig. 3, and the method may include:

s501: binding the rendered color cache image data with the color cache texture object to form a first binding relationship;

s502: transmitting the color cache texture object bound with the color cache image data to a display server process;

s503: the display server binds the frame cache texture object with the frame cache image data to form a second binding relationship;

s504: and the display server calls a glDispatchCompute function through calculating shader CopyImage according to the color cache texture object bound with the color cache image data, the frame cache image data, the first binding relation and the second binding relation so as to copy the color cache image data which are rendered by the GPU in parallel as frame cache image data to a frame cache.

According to the technical scheme, the rendered color cache image data is copied to the frame cache by utilizing a glDispatchCompute function in the computing shader and the binding relationship between the cache image data and the texture object. Therefore, the data copying process from the rendering video memory to the frame buffer memory is realized under the control of the GPU, the use frequency of the CPU and the system bus is reduced by the maximum program, the CPU resource can be released, and the occupation of the system bus bandwidth can be reduced.

For the technical solution shown in fig. 5, in combination with the Linux-based graphics driver architecture shown in fig. 3, in some examples, S501 and S502 may be implemented by OpenGL invoking GPU driver 22; s503 and S504 may be implemented by the display server.

For the technical solution shown in fig. 5, in some possible implementations, the method further includes:

and packaging the Color Buffer into a Color Buffer texture object.

For the above implementation, it should be noted that, in the process of calling the glDispatchCompute function to copy in step S504, a texture object is usually used as an execution object; therefore, for data formats such as cache data, the data formats need to be packaged as texture objects before the glDispatchCompute function is called, so that the calling of the glDispatchCompute function in S504 can be accurately realized; therefore, the above implementation may still be implemented by GPU driver 22 in the implementation process. Based on this, in some possible implementations, the method further includes: the frame buffer data Framebuffer is encapsulated into a frame buffer texture object framebufftex, but the implementation may also be implemented by the display server.

For the technical solution shown in fig. 5, in some possible implementation manners, the binding the rendered color cache image data and the color cache texture object to form a first binding relationship includes:

and calling a glBindImageTexture function to bind the color cache texture object ColorBufTex with the rendered color cache image data ColorBufImg to form the first binding relationship.

Specifically, the color buffer image data ColorBufImg in the GPU display 16 is bound to the color buffer texture object ColorBufTex, and the frame buffer texture object is bound to the frame buffer image data, according to the two binding relationships, the color buffer image data ColorBufImg can be copied from the GPU display 16 to the frame buffer as the frame buffer image data bound to the frame buffer texture object, by using the characteristic that the gldispatch computer function executes on the texture object.

For the technical solution shown in fig. 5, in some possible implementations, the transmitting the color cache texture object bound to the color cache image data to the display server process includes:

the color cache texture object bound to the color cache image data colorbuff is exported by the 3D driver process in the GPU driver 22 and imported to the display server process.

For the technical solution shown in fig. 5, in some possible implementation manners, the binding, by the display server, the frame cache texture object and the frame cache image data to form a second binding relationship, where the binding relationship includes:

and calling a glBindImageTexture function to bind the frame cache texture object FrameBufTex and the frame cache image data FrameBufImg to form the second binding relationship.

For the technical solution shown in fig. 5, in some possible implementation manners, the method for copying, by the display server, the color cache image data that is rendered by the GPU in parallel to the frame cache by calling a gldispatcchcomputer function through a compute shader copy image according to the color cache image data, the frame cache image data, and the first and second binding relationships includes:

the display server takes the color cache image data ColorBufImg and the frame cache image data FrameBufImg as input parameters for calculating the shader copy image;

and calling a glDispatchCompute function through the compute shader CopyImage to enable the GPU to execute the rendered color cache image data in parallel as frame cache image data to be copied to a frame cache according to the input parameters, the first binding relation and the second binding relation.

With respect to the technical solution shown in fig. 5 and the implementation manner thereof, particularly, most architecture designs of the GPU as a processing unit capable of running a large amount of computation in parallel are based on a Single Instruction Multiple Threads (SIMT) execution model, in which a parallel minimum unit of hardware scheduling may be referred to as a thread bundle warp in some examples and may be referred to as a wavefront in other examples, the SIMT model is shown in fig. 6, which shows a schematic of a processing core 300 suitable for the SIMT model, and in some examples, the processing core 300 can be used as one of programmable execution cores in the processor cluster 46 of the GPU shown in fig. 2, which implements highly parallel computation, so as to implement parallel execution of a large number of Threads, where each thread is an instance of a program (instance). In other examples, the processing core 300 may be implemented as a Streaming Multiprocessors (SM) in a GPU. Within the processing core 300, there may be multiple threaded processors organized as warp 304, or referred to as cores, each of which may correspond to a thread. In some examples, corresponding to processing core 300 being implemented as a SM, the core may be implemented as a Streaming Processor (SP) or may also be referred to as a Unified computing Device Architecture core (CUDA core). The processing core 300 may contain J warp 304-1 through 304-J, with each warp 304 having K cores 306-1 through 306-K. In some examples, warp 304-1 to 304-J may be further organized into one or more thread blocks (blocks) 302. In some examples, each warp 304 may have 32 cores; in other examples, each warp 304 may have 4 cores, 8 cores, 16 cores, or tens of thousands of cores; it should be understood that the above-mentioned setting is only used for illustration of the technical solution, and does not limit the protection scope of the technical solution, and those skilled in the art can easily adapt the technical solution explained based on the above-mentioned setting to other situations, and will not be described herein again. In some alternative examples, the processing core 300 may organize the cores only into warp 304, omitting the level of organization of the thread block.

Further, processing core 300 may also include an array of pipeline control units 308, shared memory 310, and local memories 312-1 through 312-J associated with warps 304-1 through 304-J. Pipeline control unit 308 distributes tasks to various warps 304-1 to 304-J via data bus 314. Pipeline control unit 308 creates, manages, schedules, executes, and provides mechanisms to synchronize warp 304-1 through 304-J. With continued reference to processing core 300 shown in FIG. 3, cores 306 within warp execute in parallel with each other. warp 304-1 to 304-J communicates with shared memory 310 over memory bus 316. warp 304-1 to 304-J communicates with local memories 312-1 to 312-J via local buses 318-1 to 318-J, respectively. For example, as shown in FIG. 3, warp 304-J utilizes local memory 312-J to communicate over local bus 318-J. Some embodiments of processing core 300 allocate a shared portion of shared memory 310 to each thread block 302 and allow access to the shared portion of shared memory 310 by all warps 304 within thread blocks 302. Some embodiments include warp 304 using only local memory 312. Many other embodiments include warp 304 that balances the use of local memory 312 and shared memory 310. When the above-mentioned fig. 5 and its implementation are implemented specifically, the implementation may be based on the SIMT execution model shown in fig. 6. The embodiment of the present invention will not be described in detail.

For the foregoing technical solution, in the implementation process of the software layer, any high-level rendering language or assembly language may be used for writing, as long as the written program can be compiled to generate machine instructions that can be executed by the bottom-level GPU. The technical scheme of the embodiment of the invention can be realized by writing in OpenGL rendering Language (GLSL), wherein the GLSL is a high-level rendering Language based on C Language and is established by OpenGL ARB, so that developers can directly control more GPUs without using assembly Language or hardware specification Language. Furthermore, the GLSL high level shading language has rich features, and in fact it is a generic term for a variety of related shading languages, including: vertex, Tesselltion Control, Tesselltion Evaluation, Geometry, Fragment, and computer rendering languages introduced by OpenGL GLSL 4.3. The implementation of the technical scheme can be written by computer coloring language.

It should be noted that, since the GLSL shader itself is not an independent application program, it needs the support of the OpenGL API, and the OpenGL API is generally implemented by a GPU hardware vendor through a GPU driver, for the above technical solution, compared to the architecture shown in fig. 3, two components are improved: firstly, the GPU driving controlled by the OpenGL library is improved, and S501 and S502 shown in fig. 5 are executed; next, the display server program is modified to execute S503 and S504 shown in fig. 5.

For a specific implementation of compute shader copy image, the pseudo code source program is as follows:

#version 430 core

parameter relating to architecture of GPU

layout(local_size_x = 32, local_size_y = 32, local_size_z = 1) in;

// Image variable bound to ColorBufTex as copy source

layout (binding = 0, rgba32f) readonly uniform image2D ColorBufImg;

// Image variable bound to FrameBufTex as copy destination

layout (binding = 0, rgba32f) readonly uniform image2D FrameBufImg;

void main()

{

ivec2 pixel_coord = ivec2(gl_GlobalInvocationID.xy);

vec4 pixel;

pixel = imageLoad(ColorBufImg, pixel_coord);

imageStore(FrameBufImg, pixel_coord, pixel);

}

After the source program is compiled, the source program can be copied in parallel by using the SIMT model shown in fig. 6, and at this time, the parallel execution depends on the architecture of the GPU itself, that is, the number of execution cores in one execution cluster; and the height and width of the rendering result picture (e.g., in pixels, and the height and width are powers of 2). The execution function is as follows:

glDispatchCompute(width / 32, height / 32, 1)；

it should be noted that the execution function may concurrently execute the copy of the pixels at different positions by width × height GPU threads, depending on the number of execution cores of the execution cluster of the specific GPU; for example, if the number of execution cores on one execution cluster of the GPU of the system is 32 × 32, 32 × 32 GPU threads are executed in parallel to execute pixel copies at different positions. The pixel model during execution is schematically shown in fig. 7, each square represents a pixel, and the parameter num _ workgroup _ x represents width; num _ workgroup _ y represents height.

Based on the same inventive concept of the foregoing technical solution, referring to fig. 8, a data transmission apparatus 80 for accelerating image display according to an embodiment of the present invention is shown, where the data transmission apparatus 80 includes: a first binding portion 801, a transmission portion 802, a second binding portion 803, and a copy portion 804; the first binding part 801 and the transmission part 802 may be implemented by a GPU driver, and the second binding part 803 and the copy part 804 may be implemented by a display server; in particular, the present invention relates to a method for producing,

the first binding part 801 is configured to bind the rendered color cache image data with the color cache texture object to form a first binding relationship;

the transmitting part 802 configured to transmit the color cache texture object bound with the color cache image data to a display server process;

the second binding part 803 is configured to bind the frame cache texture object with the frame cache image data to form a second binding relationship;

the copying part 804 is configured to call a glDispatch computer function through a compute shader CopyImage according to the color cache texture object bound to the color cache image data, the frame cache image data, and the first and second binding relationships, so that the GPU executes the rendered color cache image data in parallel and copies the rendered color cache image data to the frame cache.

In the above scenario, the first binding portion 801 is further configured to:

and packaging the Color Buffer into a Color Buffer texture object.

In the above solution, the first binding portion 801 is configured to:

In the above scheme, the transmitting part 802 is configured to:

exporting the color cache texture object bound with the color cache image data ColorBufImg through a 3D driving process in a GPU driving program, and importing the color cache texture object into a display server process.

In the above solution, the second binding part 803 is configured to:

The copy portion 804 configured to:

using color buffer image data ColorBufImg and the frame buffer image data FrameBufImg as input parameters for calculating shader copy image;

It is understood that in this embodiment, "part" may be part of a circuit, part of a processor, part of a program or software, etc., and may also be a unit, and may also be a module or a non-modular.

In addition, each component in the embodiment may be integrated in one processing unit, or each unit may exist alone physically, or two or more units are integrated in one unit. The integrated unit can be realized in a form of hardware or a form of a software functional module.

Based on the understanding that the technical solution of the present embodiment essentially or a part contributing to the prior art, or all or part of the technical solution may be embodied in the form of a software product stored in a storage medium, and include several instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) or a processor (processor) to execute all or part of the steps of the method of the present embodiment. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and other various media capable of storing program codes.

Therefore, the present embodiment provides a computer storage medium, which stores a data transmission program for accelerating image display, and when the data transmission program is executed by at least one processor, the data transmission program for accelerating image display implements the steps of the data transmission method for accelerating image display in the above technical solution.

It is understood that the exemplary technical solution of the data transmission device 80 for accelerating image display is the same as the technical solution of the data transmission method for accelerating image display, and therefore, the details of the technical solution of the data transmission device 80 for accelerating image display, which are not described in detail, can be referred to the description of the technical solution of the data transmission method for accelerating image display. The embodiments of the present invention will not be described in detail herein.

It should be noted that: the technical schemes described in the embodiments of the present invention can be combined arbitrarily without conflict.

The above description is only for the specific embodiments of the present invention, but the scope of the present invention is not limited thereto, and any person skilled in the art can easily conceive of the changes or substitutions within the technical scope of the present invention, and all the changes or substitutions should be covered within the scope of the present invention. Therefore, the protection scope of the present invention shall be subject to the protection scope of the appended claims.

Claims

1. A data transmission method for accelerating image display, the method comprising:

and the display server calls a glDispatch computer function through the compute shader copy image according to the color cache texture object bound with the color cache image data, the frame cache image data and the first binding relation and the second binding relation, so that the color cache image data which is rendered by the GPU in parallel is copied to a frame cache as the frame cache image data.

2. The method of claim 1, further comprising:

and packaging the Color Buffer into a Color Buffer texture object.

3. The method according to claim 1, wherein the binding the rendered color cache image data with the color cache texture object to form the first binding relationship comprises:

4. The method of claim 1, wherein transmitting the color cache texture object bound with the color cache image data to a display server process comprises:

5. The method of claim 1, wherein the display server binds the frame buffer texture object with the frame buffer image data to form a second binding relationship, comprising:

6. The method of claim 1, wherein the display server calls a glDispatch computer function through a compute shader CopyImage according to the color cache texture object bound to the color cache image data, the frame cache image data, and the first and second binding relationships to copy the color cache image data rendered by the GPUs in parallel to a frame cache as the frame cache image data, comprising:

7. A data transmission apparatus that accelerates display of an image, the data transmission apparatus comprising: a first binding part, a transmission part, a second binding part and a copy part; wherein the content of the first and second substances,

8. The apparatus of claim 7, wherein the copy portion is configured to:

9. A Linux system-based graphics driver architecture is characterized in that the architecture comprises a GPU driver and a display server; wherein the content of the first and second substances,

the GPU driver comprising a first binding part and a transmitting part in the data transmitting device for accelerating image display according to claim 7;

the display server includes the second binding part and the copy part in the data transmission apparatus for accelerating display of an image according to claim 7.

10. A computer storage medium, characterized in that the computer storage medium stores a data transfer program for accelerating image display, which when executed by at least one processor implements the data transfer method steps for accelerating image display according to any one of claims 1 to 6.