WO2015067043A1 - Gpu virtualization realization method as well as vertex data caching method and related device - Google Patents

Gpu virtualization realization method as well as vertex data caching method and related device Download PDF

Info

Publication number
WO2015067043A1
WO2015067043A1 PCT/CN2014/079557 CN2014079557W WO2015067043A1 WO 2015067043 A1 WO2015067043 A1 WO 2015067043A1 CN 2014079557 W CN2014079557 W CN 2014079557W WO 2015067043 A1 WO2015067043 A1 WO 2015067043A1
Authority
WO
WIPO (PCT)
Prior art keywords
vertex
data
buffer area
graphics
array
Prior art date
Application number
PCT/CN2014/079557
Other languages
French (fr)
Chinese (zh)
Other versions
WO2015067043A9 (en
Inventor
徐利成
Original Assignee
华为技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 华为技术有限公司 filed Critical 华为技术有限公司
Publication of WO2015067043A1 publication Critical patent/WO2015067043A1/en
Publication of WO2015067043A9 publication Critical patent/WO2015067043A9/en

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T15/003D [Three Dimensional] image rendering
    • G06T15/005General purpose rendering architectures

Definitions

  • GPU virtualization implementation method and vertex data caching method and related device The application is filed on November 08, 2013, the Chinese Patent Office, application number 201310554845.0, the invention name is "GPU virtualization implementation method and vertex data caching method and related device" The priority of the Chinese Patent Application, the entire contents of which is incorporated herein by reference.
  • the present invention relates to the field of virtualization technologies, and in particular, to a GPU virtualization implementation method and a vertex data buffering method and related apparatus. Background technique
  • GPU Graphic Processing Unit
  • GPU virtualization technology is to allow virtualized instances running on data center servers to share the same block or multiple GPU processors for graphics operations. From the perspective of products that have already been implemented, the virtualization solution based on DirectX 3d is relatively mature, and its performance and experience are close to the level of physical machines. In the more widely used HD graphics field, most of them are used. 3D software is more based on the OpenGL (Open Graphics Library) specification, which is the most difficult application problem for enterprises.
  • OpenGL Open Graphics Library
  • Chromium essentially implements a cross-network remote rendering process.
  • vertex arrays allow Opengl drivers to get properties such as vertices, colors, normal vectors, etc. directly from the application's memory.
  • the use of vertex arrays minimizes the overhead of function calls and reduces the amount of data that must be packed into the command cache in the display driver.
  • the vertex array pointers intercepted from the application layer are allocated on the graphics client. If the vertex array pointer is directly transmitted to the graphics server for use, an error will be generated.
  • Chromium decomposes a glArrayElement instruction call into an equivalent glVertex3f, glNormal3f, glColor3f, or glTexCoord2f call, which converts the glArrayElement's pass pointer class parameter instructions into a series of passed-valued class parameter instructions.
  • the number of decoded instructions is the number of instructions before the decomposition. 100 Multiple times, the amount of data transmitted by the network will increase abruptly, which will generate a large amount of delay, occupy the bandwidth of the transmission channel, increase the consumption of the CPU by the memory sharing, and cause the VM (Virtual Machine ware, virtual machine) to have low density and high cost. . Summary of the invention
  • Embodiments of the present invention provide a GPU virtualization implementation method and a vertex data caching method and related apparatus, which can greatly reduce the bandwidth of the delay and the transmission channel, reduce the CPU consumption of the memory sharing, increase the VM density, and reduce the cost.
  • the first aspect provides a GPU virtualization implementation method, comprising: a graphics client intercepting a vertex array class instruction; performing vertex data caching to create a first buffer area, sending a synchronization instruction to a graphics server to create a second buffer area, and a second buffer
  • the region and the first buffer area form a mapping relationship of vertex data, and the vertex data is obtained from the vertex array class instruction, including the vertex array pointer and the vertex array length; the query is performed in the local data, if there is a vertex data and the intercepted in the local data If the vertex data is consistent, the vertex array class instruction is packaged and sent to the graphics server, so that the graphics server renders the image according to the vertex data of the second buffer area and the packed vertex array class instruction, and if not, the vertex array class instruction is decomposed.
  • the method further includes: the graphics client receiving the image sent by the graphics server through the data channel and pasting the graphic device interface; and redirecting the vertex array class instruction to the TC through the graphic device interface The end executes the vertex array class instruction and generates a screen.
  • performing vertex data caching to create the first buffer area includes: if the newly added vertex data is historical data, but the cached first buffer area is released or its vertex array length is Need to update to a larger value, create a temporary buffer; copy the newly added vertex data into the temporary buffer; copy the vertex data from the temporary buffer to the first buffer.
  • the vertex data is buffered to create a first buffer
  • the synchronization instruction is sent to the graphics server to create a second buffer
  • the second buffer forms vertex data with the first buffer.
  • the mapping relationship includes: performing vertex data caching and creating the first a buffer area; sending a synchronization instruction to the graphics server to create a second buffer area, the synchronization instruction includes a vertex array pointer, and the second buffer area forms a mapping relationship with the first buffer area by the vertex array pointer.
  • the first buffer area is located in the graphics client.
  • the first buffer area is located in the shared memory.
  • the second aspect provides a GPU virtualization implementation method, including: receiving a synchronization instruction and creating a second buffer area for vertex data caching, and forming a mapping relationship between the second buffer area and the first buffer area of the graphics client to form vertex data, and vertex data.
  • a vertex array pointer and a vertex array length include a vertex array pointer and a vertex array length; determine, according to the vertex array pointer, whether the second buffer area has a corresponding vertex data cached, and if so, the receiving graphics client sends the packed vertex array class instruction through the data channel, and according to the second The vertex data of the buffer and the packed vertex array class instruction render the image for transmission to the graphics client; if not, receive the decomposed vertex array class instruction sent by the graphics client, and according to the decomposed vertex array class The instruction renders the image for transmission to the graphics client.
  • the receiving the synchronization instruction and creating the second buffer area for performing vertex data buffering, the mapping relationship between the second buffer area and the first buffer area of the graphics client forming vertex data includes: receiving a synchronization instruction sent by the graphics client, wherein the synchronization instruction includes a vertex array pointer; a second buffer area is created according to the synchronization instruction to perform vertex data buffering, and the second buffer area forms vertex data through the vertex array pointer and the first buffer area of the graphics client. Mapping relationship.
  • the second buffer is located in the graphics server.
  • the second buffer is located in the shared memory.
  • the third aspect provides a method for buffering vertex data in a GPU, comprising: creating a first buffer area by a graphics client to perform vertex data caching, wherein the vertex data includes a vertex array pointer and a vertex array length; and sending a synchronization instruction to the graphics server
  • the synchronization instruction includes a vertex array pointer; the second buffer area is created by the graphics server according to the synchronization instruction, and the vertex data is buffered, and the second buffer area forms a mapping relationship between the vertex data and the first buffer area by the vertex array pointer.
  • the vertex data cache is cached.
  • the unit mode learns, predicts, and corrects the vector, including vertex array pointers and the learning, prediction, and correction of vertex array lengths.
  • the buffer unit mode includes: indicating a first address of the apex array and a length of each byte; and drawing the geometric unit according to the offset of the first address.
  • the learning, predicting, and correcting the vertex array pointer includes: obtaining a vertex array class instruction; using a vertex array pointer for a hash lookup; determining whether the hit is true, and if so, setting the current The cache data pointer is used to draw the vertex pointer; if not, the vertex array pointer and related feature information are added to the Hashtable; the cached data pointer is transparently transmitted.
  • the learning, predicting, and correcting the vertex array length includes: obtaining a vertex instruction; determining whether the vertex data has been cached, and if so, determining whether the vertex buffer data exists locally. In the data, if yes, pass the vertex pointer transparently, if not, decompose the vertex pointer; if the vertex data is not cached, determine whether the vertex array length needs to be updated, if necessary, update the vertex array length, if not needed Then, the vertex pointer is decomposed, wherein the local data is vertex data pre-existing in the graphics client, and the vertex data can be sent and used for the graphics server without being decomposed.
  • the fourth aspect provides a GPU graphics client, including an instruction acquisition module, a first cache module, a query module, and a sending module, wherein: the instruction acquisition module is configured to intercept a vertex array class instruction; and the first cache module is configured to perform vertex data caching.
  • the instruction acquisition module is configured to intercept a vertex array class instruction
  • the first cache module is configured to perform vertex data caching.
  • To create a first buffer area send a synchronization instruction to the graphics server to create a second buffer area, the second buffer area forms a mapping relationship with the first buffer area, and the vertex data is obtained from the vertex array class instruction, including the vertex array pointer. And the vertex array length; the query module is used to query in the local data.
  • the sending module packages and sends the vertex array class instruction to the graphics server to make the graphics server Rendering the image according to the vertex data of the second buffer area and the packed vertex array class instruction. If not, the sending module decomposes the vertex array class instruction and sends the instruction to the graphics server, so that the graphics server renders according to the decomposed vertex array class instruction.
  • Image where local data To pre-exist the vertex data of the graphics client, the vertex data can be sent and used for the graphics server without decomposition.
  • the graphics client further includes a first receiving module and a graphic device interface, where: the first receiving module is configured to receive the image through the data channel and Paste to the graphics device interface; the graphics device interface redirects vertex array class instructions to the TC side to execute vertex array class instructions and generate a screen shot.
  • the sending module further sends a synchronization instruction to the graphics server, where the synchronization instruction includes a vertex array pointer, and the first buffer area forms vertex data through the vertex array pointer and the second buffer area of the graphics server. Mapping relationship.
  • the first cache is used.
  • the module is also used to: create a temporary buffer; copy the newly added vertex data into the temporary buffer; copy the vertex data from the temporary buffer to the first buffer.
  • a fifth aspect provides a GPU graphics server, including a second cache module, a second receiving module, and a rendering module, wherein: the second cache module is configured to create a second buffer area for vertex data caching, and the second buffer area and the graphics client
  • the first buffer area of the end forms a mapping relationship of vertex data
  • the vertex data includes a vertex array pointer and a vertex array length
  • the second receiving module is configured to determine, according to the vertex array pointer, whether the second buffer area has a corresponding vertex data cached, and if so, Receiving the packed vertex array class instruction sent by the graphics client, and the rendering module renders the image according to the vertex data of the second buffer area and the packed vertex array class instruction to send to the graphics client; if not, the second receiving module Receiving the decomposed vertex array class instruction sent by the graphics client, and the rendering module renders the image according to the decomposed vertex array class instruction for sending to the graphics client.
  • the second cache module further receives a synchronization instruction sent by the graphics client, where the synchronization instruction includes a vertex array pointer, and the second cache module creates a second buffer area according to the synchronization instruction.
  • the vertex data is cached, and the second buffer area forms a mapping relationship between the vertex data and the first buffer area of the graphics client by the vertex array pointer.
  • a sixth aspect provides an apparatus for buffering vertex data in a GPU, comprising: a first cache module, configured to create a first buffer area on a graphics client, and perform vertex data caching, wherein the vertex data includes a vertex array pointer and a vertex array length.
  • a sending module configured to send a synchronization instruction to the graphics server, wherein the synchronization instruction includes a vertex array pointer;
  • the second cache module is configured to create a second buffer area according to the synchronization instruction by the graphics server, perform vertex data buffering, and the second buffer area
  • the mapping relationship between the vertex data and the first buffer area is formed by the vertex array pointer.
  • the first cache module is configured by using a cache unit
  • the formula is the learning, prediction, and correction of the vertex array pointer and the vertex array length.
  • the cache unit mode includes indicating a first address of the vertex array and a length of each byte; the geometric unit is drawn according to the offset of the first address.
  • the first cache module is configured to: obtain a vertex array class instruction
  • Hash lookup determine whether a hit, if yes, set to the current cache data pointer for drawing vertex pointers; if not, add vertex array pointers and related feature information to the Hashtable; pass the cached data pointer.
  • the first cache module is configured to: obtain a vertex instruction; determine whether the vertex data has been cached, and if so, Determine whether the vertex buffer data exists in the local data, if so, transparently draw the vertex pointer, if not, decompose the vertex pointer; if the vertex data is not cached, determine whether the vertex array length needs to be updated, if necessary, The vertex array length is updated, and if not required, the vertex pointer is decomposed, wherein the local data is vertex data pre-existing in the graphics client, and the vertex data can be sent and used for the graphics server without decomposition.
  • the invention intercepts the vertex array class instruction through the graphics client; performs vertex data buffering to create the first buffer area, sends the synchronization instruction to the graphics server to create the second buffer area, and the second buffer area forms a mapping with the first buffer area to form vertex data. Relationship; query in local data, if there is a vertex data in the local data that is consistent with the intercepted vertex data, the vertex array class instruction is packaged and sent to the graphics server, so that the graphics server according to the vertex data of the second buffer area and The packaged vertex array class instruction renders the image.
  • the vertex array class instruction is decomposed and sent to the graphics server, so that the graphics server renders the image according to the decomposed vertex array class instruction; the second buffer area and the first buffer After the region forms the mapping relationship of the vertex data, it does not need to decompose the vertex array class instruction, which can solve the problem that the vertex array class instruction used in the graphics server can directly generate errors, so that even if there are still some vertex array class instructions Decompose, but the total number of instructions to be transmitted is large In order to reduce, the time required for transmitting all instructions is reduced, and the bandwidth consumption is also reduced, so that the bandwidth of the delay and the transmission channel can be greatly reduced, the CPU consumption of the memory sharing is reduced, the VM density is increased, and the cost is reduced.
  • FIG. 1 is a schematic structural diagram of a system for implementing GPU virtualization according to a first embodiment of the present invention
  • FIG. 2 is a schematic flowchart of a method for implementing GPU virtualization according to a first embodiment of the present invention
  • FIG. 3 is a GPU of a second embodiment of the present invention
  • FIG. 4 is a schematic flowchart of a method for virtualizing a vertex data in a GPU according to a first embodiment of the present invention
  • FIG. 5 is a schematic diagram showing a structure of a cache unit mode of a method for buffering vertex data in a GPU according to a first embodiment of the present invention
  • FIG. 6 is a flow chart showing a method for learning, predicting, and correcting a vertex array pointer in a method for buffering vertex data in a GPU according to a first embodiment of the present invention
  • FIG. 7 is a flow chart showing a method for learning, predicting, and correcting a vertex array length in a method for buffering vertex data in a GPU according to a first embodiment of the present invention
  • FIG. 8 is a flow chart showing the process of updating the vertices of the vertices in the method of vertice data buffering in the GPU according to the first embodiment of the present invention
  • FIG. 9 is a schematic structural diagram of a GPU graphics client according to a first embodiment of the present invention.
  • FIG. 10 is a schematic structural diagram of a GPU graphics server according to a first embodiment of the present invention
  • FIG. 11 is a schematic structural diagram of an apparatus for buffering vertex data in a GPU according to a first embodiment of the present invention
  • FIG. 12 is a schematic structural diagram of a GPU graphics client according to a second embodiment of the present invention
  • FIG. 13 is a schematic structural diagram of a GPU graphics server according to a second embodiment of the present invention
  • FIG. 14 is a system for implementing GPU virtualization according to a second embodiment of the present invention
  • Schematic diagram of the structure Schematic diagram of the structure. detailed description
  • FIG. 1 is a schematic structural diagram of a system for implementing GPU virtualization according to a first embodiment of the present invention.
  • the GPU virtualization implementation system 10 includes a graphics client 11, a graphics server 12, a data channel 13, a graphics card 14, and a TC (Thin Client, thin client). End 15 , wherein the graphics client 11 includes a GDI (Graphic Device Interface) 110.
  • the graphics client 11 is connected to the graphics server 12 via a data channel 13, the graphics card 14 is connected to the graphics server 12, and the TC terminal 15 is connected to the graphics device interface 110 of the graphics client 11.
  • GDI Graphic Device Interface
  • the graphics client 11 intercepts the vertex array class instructions, creates the first buffer area 111, performs vertex data buffering, and sends synchronization instructions to the graphics server 12 via the data channel 13.
  • the vertex data is obtained from the vertex array class instruction, including the vertex array pointer and the vertex array length, and the synchronization instruction includes the vertex array pointer and the content of the vertex array.
  • the graphics server 12 After receiving the synchronization instruction, creates a second buffer area 121, and the second buffer area 121 establishes a mapping relationship between the vertex data and the first buffer area 111 through the vertex array pointer.
  • the creation of the first buffer area 111 and the second buffer area 121 is finally performed according to the intercepted vertex array class instruction, which is a continuous process.
  • the graphics client 11 also queries in the local data. If there is a vertex data in the local data that is consistent with the intercepted vertex data, the vertex array class instruction is cache optimized, that is, the vertex array class instruction is packaged and sent to The graphics server, the graphics server renders the image according to the vertex data of the second buffer area and the packed vertex array class instruction; if not, the vertex array class instruction is decomposed and sent to the graphics server, and the graphics server according to the decomposed vertex array
  • the class instructions render the picture, wherein the local data is vertex data pre-existing in the graphics client 11, the vertex data being sent and used for the graphics server 12 without decomposition.
  • the rendered image can be, but is not limited to, a three-dimensional image, or a two-dimensional image, and the image can be a combination of one or more images or a part of a complete image.
  • the graphics client 11 learns, predicts, and corrects the vertex array pointer and the vertex array length by using the cache unit mode as a carrier, thereby determining whether the cached vertex data exists in the local data, and if present, the vertex array class instruction. Cache optimization, if it does not exist, the vertex array class instruction is decomposed, that is, the vertex instruction of the passed value class is used, and the vertex data is saved in the Hashtable for the next cache optimization.
  • the number of decomposed instructions is more than 100 times the number of pre-decomposition instructions, which causes the amount of data transmitted by the network to increase abruptly, which in turn generates a large amount of delay and occupies the bandwidth of the transmission channel.
  • the vertex array class instruction is cache-optimized, so that the vertex array class instruction is not required to be decomposed, and the vertex array class directly transmitted through the graphics server 12 can be solved.
  • the instruction will generate an error, so even if there are still some vertex array class instructions to be divided Solution, but the total number of instructions to be transmitted is greatly reduced, which reduces the time required to transmit all instructions and reduces the bandwidth usage. Therefore, while ensuring the consistency of the cached vertex data, the time can be greatly reduced. Extend the bandwidth of the transmission channel, reduce the CPU consumption of memory sharing, increase the VM density, and reduce the cost.
  • the graphics client 11 packages the vertex array class instruction and sends it to the graphics server through the data channel 13. 12, the graphics server 12 unpacks the vertex array class instruction and sends it to the graphics card 14 to render the image; when the intercepted vertex data does not exist in the local data, the graphics client 11 sends the decomposed vertex array class instruction through the data channel 13 To the graphics server 12, the graphics server 12 is then sent to the graphics card 44 to render the picture.
  • the graphics server 12 copies the picture into the memory through the screen capture and sends it to the graphics client 11 through the data channel 13, the graphics client 11 receives the picture and pastes it to the graphics device interface 110, and the graphics device interface 110 re-points the vertex array class instruction.
  • the TC end 15 is directed to execute vertex array class instructions and generate a screen shot.
  • the data channel 13 may be TCP/IP (Transmission Control Protocol/Internet Protocol), SR-IOV (Single-Root I/O Virtualization, Single Direct I/O Virtualization RDMA (Remote Direct) Memory Access, remote memory direct access) and any of the shared memory.
  • FIG. 2 is a schematic flow chart of a method for implementing GPU virtualization according to a first embodiment of the present invention. As shown in FIG. 2, the graphics client 11 shown in FIG. 1 is specifically described as a main body.
  • the GPU virtualization implementation method in this embodiment includes:
  • the graphics client 11 intercepts the vertex array class instruction. Specifically, the TC terminal 15 sends a 3D instruction to the graphics device interface 110 of the graphics client 11 through mouse and keyboard redirection, and the graphics client 11 is driven by the OpenGL ICD (Interface Control Document) of the graphics device interface 110.
  • 3D instructions can be intercepted.
  • the 3D instructions include glGet* return-transfer instructions, glSwapBuffer and other instructions that need to be sent immediately, vertex array-like instructions with pointer parameters, and instructions that can be aggregated and packaged. In this embodiment, it is mainly processed for vertex array class instructions with pointer parameters.
  • the graphics client 11 creates a first buffer area 111 for buffering vertex data, and simultaneously sends a synchronization instruction to the graphics server 12 through the data channel 13, the synchronization instruction including the vertex array pointer and the contents of the vertex array, and the vertex array pointer.
  • a mapping relationship is established with the vertex data of the second buffer area of the graphics server 12.
  • the creation of the first buffer area 111 and the second buffer area 121 is finally performed according to the intercepted vertex array class instruction, which is a continuous process. If the newly added vertex data is historical data, but the cached first buffer area has been released or its vertex array length needs to be updated to a larger value, the graphics client 11 also updates the vertex array length to create a temporary buffer area, which will The newly added vertex data is copied into the temporary buffer area, and then the entire vertex data is copied from the temporary buffer area to the first buffer area 111. Upon receiving the synchronization instruction, the graphics server 12 immediately creates the second buffer area 121, copies the contents of the vertex array from the synchronization instruction, and buffers the vertex data.
  • the first buffer area 111 and the second buffer area 121 establish a mapping relationship by the vertex array pointer, thereby ensuring the consistency of the cached vertex data.
  • the first buffer area may be located in the graphics client 11 or in the shared memory.
  • S12 Query in the local data, if there is a vertex data in the local data that is consistent with the intercepted vertex data, the vertex array class instruction is packaged and sent to the graphics server 12, so that the graphics server 12 according to the vertices of the second buffer area
  • the data and packed vertex array class instructions render the image, if not, the vertex array class instructions are decomposed and sent to the graphics server 12 to cause the graphics server 12 to render the image based on the decomposed vertex array class instructions.
  • the rendered image can be, but is not limited to, a three-dimensional image, or a two-dimensional image, and the image can be a combination of one or more images or a part of a complete image.
  • the local data is the vertex data pre-existing in the graphics client 11, and the vertex data can be sent and used for the graphics server 12 without being decomposed.
  • the process of vertex array caching is a process of predicting data, and the prediction result may be correct or wrong, so the data verification process is indispensable.
  • the graphics client 11 uses the cache unit mode as a carrier to learn, predict, and correct the vertex array pointer and the vertex array length to determine whether the cached vertex data is The local data is in the middle.
  • the intercepted vertex data can be cache-optimized, that is, according to the characteristics of the vertex array class instruction, the corresponding packaging processing is performed; if it does not exist, the cache optimization cannot be performed, and only the vertex array can be obtained.
  • the class instruction is decomposed, using the vertex instruction of the passed value class, and the vertex data is saved as historical data. Hashtable for the next cache optimization.
  • the data channel 13 can be any of TCP/IP, SR-IOV, RDMA, and shared memory.
  • the picture is compressed by the graphics server 12 to generate a compressed code stream, and the graphics client 11 receives the compressed code stream through the data channel 13 and decompresses it.
  • the graphics client 11 then calls the bitblt() interface to paste the image into the graphics area of the 3D application of the graphics device interface 110, and redirects the vertex array class instructions to the TC terminal 15 via the graphics device interface 110 to execute the vertex array class instructions and generate Screen shot.
  • the second buffer area 121 is created in the graphics server 12, and the second buffer area 121 and the first buffer area 111 form a mapping of vertex data through the vertex array pointer. Relationship, and when the intercepted vertex data exists in the local data, the cache optimization of the vertex data is performed, so that the vertex array class instruction is not needed to be decomposed, and the vertex array class instruction used in the graphics server 12 to directly transmit the error may be generated.
  • the problem is that even if there are still some vertex array class instructions that need to be decomposed, the total number of instructions to be transferred is greatly reduced, which reduces the time required to transfer all the instructions, and also reduces the bandwidth usage, so it can be greatly reduced. Delay and bandwidth of the transmission channel reduce the CPU consumption of memory sharing, increase VM density, and reduce cost.
  • FIG. 3 is a schematic flowchart diagram of a method for implementing GPU virtualization according to a second embodiment of the present invention.
  • the graphics server 12 shown in FIG. 1 is specifically described as a main body.
  • the GPU virtualization implementation method in this embodiment includes:
  • S20 Receive a synchronization instruction and create a second buffer area 121 for vertex data buffering, and the second buffer area 121 forms a mapping relationship with the first buffer area 111 of the graphics client 11 to form vertex data, and the vertex data includes a vertex array pointer and a vertex array. length.
  • the graphics server 12 receives the synchronization instructions sent by the graphics client 11.
  • the synchronization instruction includes the contents of the vertex array pointer and the vertex array.
  • the graphics server 12 creates a second buffer area 121 according to the synchronization instruction to perform vertex data buffering, and forms a mapping relationship between the vertex data and the first buffer area 111 of the graphics client 11 through the vertex array pointer, so that the vertex array class instruction can be cached.
  • the creation of the first buffer area 111 and the second buffer area 121 is ultimately performed based on the intercepted vertex array class instructions, which is a continuous process.
  • the second buffer area may be located in the graphics server 12 or in the shared memory.
  • S21 determining, according to the vertex array pointer, whether the second buffer area 121 is buffered with corresponding vertex data, and if so, receiving the packed vertex array class instruction sent by the graphics client 11, and according to the vertex data of the second buffer area 121 and The packed vertex array class instruction renders the image for sending to the graphics client 11, and if not, receives the decomposed vertex array class instruction sent by the graphics client 11, and renders the image according to the decomposed vertex array class instruction. To send to the graphics client 11.
  • the graphics server 12 when the second buffer area 121 caches the vertex data corresponding to the vertex array pointer, the graphics server 12 receives the vertex array class instruction sent by the graphics client 11 through the data channel 13, and according to the characteristics of the vertex array class instruction itself. It is correspondingly unpacked. The graphics server 12 then sends the unwrapped vertex array class instructions to the graphics card 14. When the second buffer area 121 does not cache the vertex data corresponding to the vertex array pointer, the graphics server 12 receives the decomposed vertex array class instruction sent by the graphics client 11 and sends it to the graphics card 14. The graphics card 14 executes the vertices array class instruction and renders the image and saves it in the video memory.
  • the rendered image may be, but not limited to, a three-dimensional image or a two-dimensional image, and the image may be a combination of one or more images or a part of a complete image.
  • the graphics server 12 copies the pictures into the memory through screen capture. Since the picture is relatively large, the graphics server 12 compresses the picture, and then sends the compressed code stream to the graphics client 11 through the transmission channel 13 so that the graphics client 11 decompresses the compressed code stream and vertices through the graphics device interface 110.
  • the array class instructions are redirected to the TC side 15 to execute the vertex array class instructions and generate a screen shot.
  • FIG. 4 is a flow chart showing a method of vertex data buffering in a GPU according to a first embodiment of the present invention. As shown in FIG. 4, the method for buffering vertex data in the GPU of this embodiment includes:
  • S30 Create a first buffer area 111 through the graphics client 11 to perform vertex data caching, wherein the vertex data includes a vertex array pointer and a vertex array length.
  • the vertex data buffer is learned, predicted, and corrected using the cache unit mode as a carrier, including vertex array pointers and learning, prediction, and correction of vertex array lengths. Therefore, the choice of cache unit mode is the primary problem to solve the vertex data cache, which is mainly a A question of granularity considerations.
  • a large-grained mode can consider caching in units of frames, so that not only vertex data can be cached, but also 3D instructions can be cached, but the data between each frame will always be different and the difference will be large, and the difference processing will lead to performance degradation. .
  • the structure of the cache unit mode is as shown in FIG.
  • the role of gl*P 0 inter is to indicate the first address of the vertex array and the length of each byte.
  • the subsequent vertex instructions glDrawArray/glDrawElements are all based on the offset of the first address of the vertex array to draw the geometry unit. Until the next gPPointer instruction appears, indicating the end of a cache unit mode.
  • gPPointer uses the gl vertexPointer/ glNormalPointer or glInterLeavedArrays 0 in Figure 5 to use this mode for vertex data caching, moderate granularity, small overhead, and good cache content stability.
  • the methods for learning, predicting, and correcting vertex array pointers include:
  • S40 Intercept the gPPointer instruction.
  • the vertex array pointer can be obtained from the gPPointer directive.
  • S41 Use the vertex array pointer as a Hash lookup.
  • the correction of a vertex array pointer in a cache unit mode is completed.
  • the above process is repeated until the correction of all vertex array pointers in the cache unit mode is completed.
  • the learning, prediction and correction of the vertex array length are performed, that is, the correction of the vertex instruction is completed, so that the geometric unit is drawn based on the offset of the vertex array first address.
  • the method of learning, predicting, and correcting the length of the Array of vertices includes:
  • the glDraw Array directive includes the glDrawArray/glDrawElement directive in Figure 5.
  • the length of the vertex array is available in the glDraw Arrays/ glDrawElements directive.
  • the local data is non-vertex data pre-existing in the graphics client, and the vertex data can be sent and used for the graphics server 12 without being decomposed.
  • the glDrawArray instruction is decomposed. It can be seen that if the intercepted vertex data does not exist in the local data, or the intercepted vertex data is not cached, the cache optimization cannot be performed, and only the glDrawArray instruction can be decomposed, using the vertex instruction of the passed value class, and The vertex data is saved as historical data in the Hashtable for the next cache optimization.
  • Transparent pass glDrawArray instruction That is, if the intercepted vertex data exists in the local data, the cache optimization can be performed. The above process is repeated until the correction of all the vertex instructions of the cache unit mode is completed. The learning, prediction, and correction of vertex array pointers and vertex array lengths of Figures 6 and 7 are then repeated to complete the caching of vertex data for all cache unit modes. In the learning, prediction and correction of the vertex array pointer and the vertex array length, it is judged whether the cached vertex data exists in the local data, and if so, the vertex array class instruction is cache-optimized, and if not, the vertex array class instruction is decomposed.
  • the vertex array class instruction does not need to be decomposed, and the problem that the vertex array class instruction directly used in the graphics server 12 generates an error may be solved, even if Some vertex array class instructions need to be decomposed, but the total number of instructions to be transferred is greatly reduced, which reduces the time required to transfer all instructions and reduces the bandwidth consumption, thus greatly reducing the delay and transmission channel.
  • the bandwidth reduces the CPU consumption of memory sharing, increases VM density, and reduces costs.
  • the second buffer area 121 forms a mapping relationship between the vertex data and the first buffer area 111 by using the vertex array pointer.
  • the vertex array pointer and the vertex array length can be learned, so that the second buffer area 121 can be created.
  • the graphics server 12 also copies the contents of the vertex array from the synchronization instructions in the second buffer area 121.
  • the vertex array length needs to be updated to a larger value, including:
  • Update the vertex array length Specifically, when traversing the (k-1)th cache unit mode, the vertex array pointer of the cache unit mode is first recorded in the first buffer area, and is updated when the vertex array length needs to be updated to a larger value. .
  • the last cache unit mode is to ensure that the vertex data transfer of the temporary buffer area is completed before the (k)th cache unit mode traversal. Therefore, at the beginning of the (k)th cache unit mode, the buffer area of the last cache unit mode is created, that is, the buffer area of the (k-1)th cache unit mode.
  • the vertex data of the temporary buffer area is copied as a whole to the buffer area of the (k-1)th cache unit mode.
  • the buffer area of the (k-1)th cache unit mode and the buffer area of the (k)th cache unit mode are referred to as the first buffer area 111.
  • the graphics server 12 creates the second buffer area 121 according to the synchronization instruction, and forms a mapping relationship with the first buffer area 111 of the graphics client 11 through the vertex array pointer of the graphics client 11, thereby ensuring the consistency of the cached vertex data. Sex.
  • the first cache area 111 is created by the graphics client 11, the vertex data is buffered, and the synchronization instruction is sent to the graphics server 12 to create the second buffer area 121, the first buffer area 111 and the second buffer area. 121 through the vertex array pointer to form a mapping of vertex data off Therefore, the vertex array class instruction can be cache-optimized, so that the vertex array class instruction does not need to be decomposed, and the problem that the vertex array class instruction using the direct pass-through in the graphics server 12 can cause an error, even if there is still a part.
  • the vertex array class instructions need to be decomposed, but the total number of instructions to be transferred is greatly reduced, thereby reducing the time required to transfer all instructions and reducing the bandwidth usage, thus ensuring the consistency of the cached vertex data. It can greatly reduce the bandwidth of the delay and transmission channel, reduce the CPU consumption of memory sharing, increase the VM density, and reduce the cost.
  • the creation of the first buffer area 111 and the second buffer area 121 is finally performed according to the intercepted vertex array class instruction, which is a continuous process.
  • FIG. 9 is a schematic structural diagram of a GPU graphics client according to a first embodiment of the present invention. As shown in FIG. 9, the GPU virtualization implementation method of the first embodiment is described.
  • the graphics client 11 includes a graphics device interface 110, a first buffer area 111, an instruction acquisition module 112, and a first cache module 113.
  • the instruction acquisition module 112 is configured to intercept vertex array class instructions.
  • the first cache module 113 is configured to create a first buffer area 111, perform vertex data buffering, and send a synchronization instruction to the graphics server 12 to create a mapping of the second buffer area 121 and the first buffer area 111 to form vertex data. Relationships, vertex data is obtained from vertex array class instructions, including vertex array pointers and vertex array lengths. In this embodiment, the creation of the first buffer area 111 and the second buffer area 121 is ultimately performed according to the intercepted vertex array class instruction, which is a continuous process.
  • the query module 114 is configured to perform a query in the local data.
  • the sending module 115 packages and sends the vertex array class instruction.
  • the graphics server 12 renders the image according to the vertex data of the second buffer area 121 and the packed vertex array class instruction, that is, cache optimization of the vertex array class instruction, if not, the sending module 115 decomposes the vertex The array class instruction, that is, the drawing vertex instruction using the value class, and saves the vertex data in the Hashtable for the next cache optimization, and sends it to the graphics server 12 to cause the graphics server 12 to render according to the decomposed vertex array class instruction.
  • the rendered image can be, but is not limited to, a three-dimensional image, or a two-dimensional image, and the image can be a combination of one or more images or a part of a complete image.
  • the local data is the vertex data pre-existing in the graphics client 11, and the vertex data can be sent and used for the graphics server 12 without being decomposed.
  • the first receiving module 116 is configured to receive a picture and paste it to the graphics device interface 110. Graphic device connection Port 110 redirects vertex array class instructions to TC side 15 to execute vertex array class instructions and generate a screen shot.
  • the sending module 115 further sends a synchronization instruction to the graphics server 12 to create a second buffer area 121.
  • the synchronization instruction includes a vertex array pointer, and the second buffer area 121 forms a mapping relationship between the vertex data and the first buffer area 111 through the vertex array pointer. Therefore, the vertex array class instruction can be cache-optimized, so that the vertex array class instruction does not need to be decomposed, and the problem that the vertex array class instruction using the direct pass-through in the graphics server 12 will generate an error, even if there are still some vertices.
  • Array class instructions need to be decomposed, but the total number of instructions to be transferred is greatly reduced, which reduces the time required to transfer all instructions and reduces the bandwidth usage, thus ensuring the consistency of the cached vertex data.
  • the first cache module 113 is further configured to create a temporary buffer area. Copy the newly added vertex data to the temporary buffer. The vertex data is then copied from the temporary buffer area to the first buffer area 111 as a whole.
  • the picture is compressed by the graphics server 12 to generate a compressed code stream and sent to the graphics client 11.
  • the first receiving module 116 receives the compressed code stream through the data channel 13 and decompresses it, and then calls the bitblt() interface.
  • the picture is pasted to the graphics area of the 3D application of graphics device interface 110, and the vertex array class instructions are redirected to TC end 15 via graphics device interface 110 to execute vertex array class instructions and generate a screen shot.
  • FIG. 10 is a schematic structural diagram of a GPU graphics server according to a first embodiment of the present invention.
  • the graphics client 12 includes a second buffer area 121, a second cache module 122, a second receiving module 123, and a rendering module 124.
  • the second cache module 122 is configured to create a second buffer area 121 for vertex data caching, and the second buffer area 121 forms a mapping relationship with the first buffer area 111 of the graphics client 11 to form vertex data, and vertex data. Includes vertex array pointers and vertex array lengths.
  • the creation of the first buffer area 111 and the second buffer area 121 is finally performed according to the intercepted vertex array class instruction, which is a continuous process.
  • the second receiving module 123 is configured to determine, according to the vertex array pointer, whether the second buffer area 121 is buffered with corresponding vertex data, and if so, Receiving the packed vertex array class instruction sent by the graphics client 11, and the rendering module 124 renders the image according to the vertex data of the second buffer area 121 and the packed vertex array class instruction for sending to the graphics client 11; if not, then The second receiving module 123 receives the decomposed vertex array class instruction sent by the graphics client 11, and the rendering module 124 renders the image according to the decomposed vertex array class instruction for sending to the graphics client 11.
  • the second receiving module 123 further receives the synchronization instruction sent by the graphics client 11 through the data channel 13, wherein the synchronization instruction includes a vertex array pointer.
  • the second cache module 122 creates a second buffer area 121 according to the synchronization instruction to perform vertex data buffering, and the second buffer area 121 forms a mapping relationship between the vertex data by the vertex array pointer and the first buffer area 111 of the graphics client 11, thereby ensuring the cache. The consistency of the vertex data.
  • the cache optimization of the vertex data is performed, so that the vertex array class instruction does not need to be decomposed, so that even if there are still some vertex array class instructions to be decomposed, the direct use in the graphics server 12 can be solved.
  • Pass-through vertex array class instructions can cause errors, so that even if there are still some vertex array class instructions to be decomposed, the total number of instructions to be transferred is greatly reduced, thus reducing the time required to transfer all instructions.
  • the bandwidth consumption is reduced, so the bandwidth of the delay and the transmission channel can be greatly reduced, the CPU consumption of the memory sharing is reduced, the VM density is increased, and the cost is reduced.
  • the second receiving module 123 when the second buffer area 121 caches the vertex data corresponding to the vertex array pointer, the second receiving module 123 receives the vertex array class instruction sent by the graphics client 11 through the data channel 13, and instructs itself according to the vertex array class. The feature is unpacked accordingly, and the unwrapped vertex array class instruction is sent to the graphics card 14.
  • the second receiving module 123 receives the decomposed vertex array class instruction sent by the graphics client 11 and sends it to the graphics card 14.
  • the graphics card 14 executes the vertex array class instructions and renders the image, saving it in the video memory.
  • the rendered image may be, but not limited to, a three-dimensional image, or may be a two-dimensional image, and the image may be a combination of one or more images, or may be part of a complete image.
  • the rendering module 124 copies the image into memory through screen capture. Since the picture is relatively large, the rendering module 124 compresses the picture, and then sends the compressed code stream to the graphics client 11 through the transmission channel 13 so that the graphics client 11 decompresses the compressed code stream and vertices through the graphics device interface 110.
  • the array class instructions are redirected to the TC side 15 to execute the vertex array class instructions and generate a screen shot.
  • 11 is a schematic structural diagram of an apparatus for buffering vertex data in a GPU according to a first embodiment of the present invention. 9 and FIG. 10, as shown in FIG. 11, the apparatus 100 for vertex data buffering includes: a first cache module 113, a first buffer area 111, a sending module 115, a second buffer area 121, and a second Cache
  • the first cache module 113 is configured to create a first buffer area 111 for vertex data caching, wherein the vertex data includes a vertex array pointer and a vertex array length.
  • the sending module 115 is configured to send a synchronization instruction to the graphics server 12, wherein the synchronization instruction includes a vertex array pointer.
  • the second cache module 122 is configured to create a second buffer area 121 according to the synchronization instruction, and perform vertex data buffering.
  • the second buffer area 121 forms a mapping relationship with the first buffer area 111 by the vertex array pointer.
  • the creation of the first buffer area 111 and the second buffer area 121 is finally performed according to the intercepted vertex array class instruction, which is an ongoing process.
  • the first cache module 113 uses the cache unit mode as a carrier to learn, predict, and correct the vertex array pointer and the vertex array length.
  • the cache unit mode includes indicating the first address of the vertex array and the length of each byte, and drawing the geometric unit according to the offset of the first address.
  • the first cache module 113 is used to obtain the vertex array class instruction; the vertex array pointer is used for the hash lookup; whether the hit is determined, and if so, the current cache array pointer is set for drawing.
  • the vertex pointer is used; if not, the vertex array pointer and related feature information is added to the Hashtable; the cached data pointer is transparently transmitted.
  • the first cache module 113 is configured to obtain a vertex instruction; determine whether the intercepted vertex data has been cached, and if yes, determine whether the intercepted vertex buffer data exists in the local data. If yes, pass the vertex pointer transparently. If not, decompose the vertex pointer, that is, use the vertex instruction of the passed value class, and save the vertex data in the Hashtable for the next cache optimization; if the vertex data is not To do the caching, determine whether the vertex array length needs to be updated. If necessary, update the vertex array length. If not, decompose the vertex pointer, that is, use the vertex instruction of the passed value class.
  • the local data is the vertex data pre-existing in the graphics client 11, and the vertex data can be sent and used for the graphics server 12 without being decomposed. Therefore, if the intercepted vertex data does not exist in the local data, or the intercepted vertex data is not cached, the cache optimization cannot be performed, and only the vertex instruction can be decomposed, that is, the vertex instruction of the passed value class is used. If the intercepted vertex data exists in the local data, that is, if there is a vertex data in the local data that is consistent with the intercepted vertex data, then the cache optimization can be performed, thereby eliminating the need to decompose the vertex array. Class instructions can greatly reduce the bandwidth of the delay and transmission channels, reduce the CPU consumption of memory sharing, increase the VM density, and reduce the cost.
  • the first cache module 113 when the vertex array length is updated, the first cache module 113 first creates a temporary buffer area, and instantly copies the newly added data into the temporary buffer area.
  • the temporary buffer area is used. The historical data has already been cached; the buffer of the previous mode is created, and the vertex data of the temporary buffer is transferred to the buffer of the previous mode as a whole before the next cache unit mode is traversed.
  • the first cache area 111 is created by the first cache module 113, and the vertex data is buffered.
  • the sending module 115 sends a synchronization instruction to the graphics server 12, and the second cache module 122 creates a second buffer area 121 according to the synchronization instruction.
  • the second cache module 122 forms a mapping relationship between the vertex data by the vertex array pointer and the first buffer area 111, thereby ensuring the consistency of the cached vertex data, and when the intercepted vertex data exists in the local data,
  • the cache optimization of the vertex data is performed, so that the vertex array class instruction does not need to be decomposed, so that even if there are still some vertex array class instructions to be decomposed, the total number of instructions to be transferred is greatly reduced, so that the graphics server 12 can be solved.
  • the use of directly transparent vertex array class instructions can cause errors, which can greatly reduce the bandwidth of the delay and transmission channels, reduce the CPU consumption of memory sharing, increase the VM density, and reduce the cost.
  • FIG. 12 is a schematic structural diagram of a GPU graphics client according to a second embodiment of the present invention.
  • the GPU graphics client 20 includes a processor 201, a memory 202, a receiver 203, a bus 204, and a transmitter 205.
  • the processor 201, the memory 202, the transmitter 205, and the receiver 203 are connected by a bus 204. Communicate with each other.
  • the receiver 203 is configured to intercept vertex array instructions.
  • the processor 201 is configured to create a first buffer area, the memory 202 buffers the vertex data, the transmitter 205 sends a synchronization instruction to the graphics server to create a second buffer area, and the second buffer area forms a mapping relationship with the first buffer area to form vertex data.
  • Vertex data is obtained from vertex array class instructions, including vertex array pointers and vertex array lengths.
  • the creation of the first buffer area and the second buffer area is ultimately performed according to the intercepted vertex array class instruction, which is a continuous process.
  • the processor 201 is further configured to perform a query in the local data.
  • the transmitter 205 packages and sends the vertex array class instruction to the graphics server, and the processor 201 according to the second The vertex data of the buffer area and the packed vertex array class instruction render the image, that is, the top The dot array class instruction performs cache optimization. If it does not exist, the vertex array class instruction is decomposed, that is, the vertex instruction of the pass value class is used, and the vertex data is saved in the Hashtable for the next cache optimization, and the transmitter 205 sends to The graphics server 12, the processor 201 renders the picture according to the decomposed vertex array class instruction.
  • the rendered image can be, but is not limited to, a three-dimensional image, or a two-dimensional image, and the image can be a combination of one or more images or a part of a complete image.
  • the local data is vertex data pre-existing in the graphics client, and the vertex data can be sent and used for the graphics server without being decomposed.
  • the receiver 203 is further configured to receive a picture and paste it to a graphics device interface.
  • the graphical device interface redirects vertex array class instructions to the TC side to execute vertex array class instructions and generate a screen. If the newly added vertex data is historical data, but the cached first buffer area has been released or its vertex array length needs to be updated to a larger value, the processor 201 also creates a temporary buffer area, which will add new vertex data. Copy to the temporary buffer and copy the vertex data from the temporary buffer to the first buffer.
  • the transmitter 205 sends a synchronization command to the graphics server to create a second buffer.
  • the synchronization instruction includes a vertex array pointer, and the second buffer area forms a mapping relationship with the first buffer area by the vertex array pointer, so that the vertex array class instruction can be cache-optimized, so that the vertex array class instruction does not need to be decomposed, Solving the problem of using the directly transparent pass-through vertex array class instruction in the graphics server will cause errors, so that even if some vertex array class instructions need to be decomposed, the total number of instructions to be transferred is greatly reduced, thereby reducing the transmission of all instructions.
  • the required time also reduces the bandwidth occupation, thus ensuring the consistency of the cached vertex data, which can greatly reduce the bandwidth of the delay and the transmission channel, reduce the CPU consumption of the memory sharing, increase the VM density, and reduce the cost.
  • FIG. 13 is a schematic structural diagram of a GPU graphics server according to a second embodiment of the present invention.
  • the GPU graphics server 30 includes a processor 301, a memory 302, a receiver 303, and a bus 304.
  • the processor 301, the memory 302, and the receiver 303 are connected by a bus 304 to communicate with each other.
  • the processor 301 is configured to create a second buffer area.
  • the memory 202 caches the vertex data, and the second buffer area forms a mapping relationship with the first buffer area of the graphics client to form vertex data.
  • Vertex data includes vertex array pointers and vertex array lengths.
  • the creation of the first buffer area and the second buffer area is finally performed according to the intercepted vertex array class instruction, which is a The ongoing process.
  • the processor 301 determines, according to the vertex array pointer, whether the second buffer area is buffered with corresponding vertex data, and if so, the receiver 303 receives the packed vertex array class instruction sent by the graphics client, and the processor 301 is configured according to the second buffer area.
  • the vertex data and the packed vertex array class instructions render the image for transmission to the graphics client; if not, the receiver 303 receives the decomposed vertex array class instruction sent by the graphics client, and the processor 301 is based on the decomposed vertex The array class instruction renders the image for sending to the graphics client.
  • the receiver 303 also receives the synchronization instruction sent by the graphics client through the data channel, wherein the synchronization instruction includes a vertex array pointer.
  • the processor 301 creates a second buffer area according to the synchronization instruction to perform vertex data caching, and the second buffer area forms a mapping relationship between the vertex data pointer and the first buffer area of the graphics client to ensure vertex data consistency.
  • the cache optimization of the vertex data is performed, so that the vertex array class instruction does not need to be decomposed, so that even if some vertex array class instructions still need to be decomposed, the total needs to be transmitted.
  • the number of instructions is greatly reduced, so it can solve the problem that the vertex array class instruction used in the graphics server can directly generate errors, which can greatly reduce the bandwidth of the delay and the transmission channel, reduce the CPU consumption of the memory sharing, and improve the VM density. cut costs.
  • FIG. 14 is a schematic structural diagram of an implementation system of GPU virtualization according to a second embodiment of the present invention.
  • the GPU virtualization implementation system 40 of the second embodiment includes a graphics client 41, a graphics server 42, a data channel 43, a graphics card 44, and a TC terminal 45.
  • the graphics client 41 includes a graphics device interface 410.
  • the data channel 43 includes a vertex data buffer 431.
  • the graphics client 41 is connected to the graphics server 42 via a data channel 43, the graphics card 44 is coupled to the graphics server 42, and the TC terminal 45 is coupled to the graphics device interface 410 of the graphics client 41.
  • the data channel 43 is a shared memory, and the graphics client 41 and the graphics server 42 share the vertex data buffer 431 in the shared memory to implement vertex data buffering.
  • the TC terminal 45 sends a 3D instruction to the graphics device interface 410 of the graphics client 41 through mouse and keyboard redirection, and the graphics client 41 can intercept the 3D instruction through the Opengl ICD driver of the graphics device interface 410, and the 3D instruction includes a vertex. Array class instructions.
  • the graphics client 41 performs vertex data caching in the vertex data buffer 431, and sends synchronization instructions to the graphics server 42 through the data channel 43.
  • the graphics server 42 performs vertex data caching in the vertex data buffer 431 to ensure consistent cached vertex data. Sex.
  • the vertex data buffer 431 is created the most. It is executed according to the intercepted vertex array class instructions and is a continuous process.
  • the graphics client 41 queries in the local data. If there is a vertex data in the local data that is consistent with the intercepted vertex data, the vertex array class instruction is packaged and sent to the graphics server 42 so that the graphics server 42 is based on the vertex data buffer.
  • the vertice data of 431 and the packed vertex array class instruction render the image, that is, cache optimization of the vertex array class instruction, if not, decompose the vertex array class instruction, that is, use the vertex instruction of the value class, and the vertex is
  • the data is saved in the Hashtable for next cache optimization and sent to the graphics server 42 to cause the graphics server 42 to render the image based on the decomposed vertex array class instructions.
  • the local data is vertex data pre-existing in the graphics client, and the vertex data can be sent and used for the graphics server 42 without being decomposed. Specifically, a vertex data exists in the local data and is consistent with the intercepted vertex data.
  • the graphics client 41 packages the vertex array class instruction and sends the data to the graphics server 42 through the data channel 43.
  • the server 42 unpacks the vertex array class instruction and sends it to the graphics card 44 to render the picture; when the intercepted vertex data does not exist in the local data, the graphics client 41 sends the decomposed vertex array class instruction to the graphics server through the data channel 43. 42.
  • the graphics server 42 sends the graphics card 42 to the graphics card to render the picture.
  • the rendered image can be, but is not limited to, a three-dimensional image, or a two-dimensional image, and the image can be a combination of one or more images or a part of a complete image.
  • the graphics server 42 copies the picture into the memory through screen capture and sends it to the graphics client 41 via the data channel 43.
  • the graphics client 41 receives the picture and pastes it to the graphics device interface 410.
  • the graphics device interface 410 places the vertex array class instruction heavy.
  • the TC end 45 is directed to execute vertex array class instructions and generate a screen shot. Among them, the vertex data is obtained from the vertex array class instruction, including the vertex array pointer and the vertex array length.
  • the vertex data buffer is implemented by sharing the vertex data buffer 431 in the shared memory between the graphics client 41 and the graphics server 42, ensuring the consistency of the cached vertex data, and the intercepted vertex data exists in In the local data, the cache optimization of the vertex data is performed, so that the vertex array class instruction does not need to be decomposed, so that even if some vertex array class instructions need to be decomposed, the total number of instructions to be transmitted is greatly reduced, so Solving the problem of using the directly transparent pass-through vertex array class instruction in the graphics server 42 will cause an error, so that even if there are still some vertex array class instructions to be decomposed, the total number of instructions to be transmitted is greatly reduced, thereby reducing the transfer of all
  • the time required for the instruction also reduces the bandwidth usage, thus greatly reducing the bandwidth of the delay and the transmission channel, reducing the CPU consumption of the memory sharing, and improving VM density, reducing cost; while reducing the use of cache memory, simplifies the complexity of maintaining the consistency of graphics
  • the present invention intercepts vertex array class instructions through a graphics client; performs vertex data caching to create a first buffer area, sends synchronization instructions to a graphics server to create a second buffer area, a second buffer area and a first buffer area. Forming a mapping relationship of vertex data; performing a query in the local data. If a vertex data in the local data is consistent with the intercepted vertex data, the vertex array class instruction is packaged and sent to the graphics server, so that the graphics server is configured according to the second cache. The vertex data of the region and the packed vertex array class instruction render the image.
  • the vertex array class instruction is decomposed and sent to the graphics server, so that the graphics server renders the image according to the decomposed vertex array class instruction; the second cache After the mapping relationship between the region and the first buffer region forms vertex data, it is not necessary to decompose the vertex array class instruction, which can solve the problem that the vertex array class instruction used in the graphics server directly generates a fault, so that even if there is still a part Vertex array class instructions need to be decomposed, but the total need
  • the number of instructions transmitted is greatly reduced, which can greatly reduce the bandwidth of the delay and the transmission channel, reduce the CPU consumption of memory sharing, increase the VM density, and reduce the cost.

Landscapes

  • Engineering & Computer Science (AREA)
  • Computer Graphics (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Image Generation (AREA)
  • Memory System Of A Hierarchy Structure (AREA)

Abstract

Disclosed in the present invention are a GPU (Graphic Processing Unit) virtualization realization method as well as a vertex data caching method and a related device. The method comprises: a graphics client intercepting vertex array class commands; caching vertex data to create a first cache area, sending a synchronization command to a graphics server to create a second cache area, and the second cache area and the first cache area forming a mapping relationship of the vertex data; querying in local data, packing and sending the vertex array class commands to the graphics server to render a picture according to the vertex data of the second cache area and the packed vertex array class commands if one piece of vertex data consistent with the intercepted vertex data exists in the local data, and if there is no vertex data consistent with the intercepted vertex data in the local data, resolving and sending the vertex array class commands to the graphics server to render the picture according to the resolved vertex array class commands. By doing as above, the present invention enables a drastic reduction of the delay and the bandwidth of transmission paths, and also reduces CPU (Central Processing Unit) consumption of memory sharing, increases VM (Virtual Machine) density and lowers cost.

Description

GPU虚拟化实现方法以及顶点数据缓存方法和相关装置 本申请要求于 2013 年 11 月 08 日提交中国专利局、 申请号为 201310554845.0、 发明名称为" GPU 虚拟化实现方法以及顶点数据缓存方 法和相关装置" 的中国专利申请的优先权,其全部内容通过引用结合在本 申请中。  GPU virtualization implementation method and vertex data caching method and related device The application is filed on November 08, 2013, the Chinese Patent Office, application number 201310554845.0, the invention name is "GPU virtualization implementation method and vertex data caching method and related device" The priority of the Chinese Patent Application, the entire contents of which is incorporated herein by reference.
技术领域 Technical field
本发明涉及虚拟化技术领域,特别是涉及一种 GPU虚拟化实现方法以 及顶点数据缓存方法和相关装置。 背景技术  The present invention relates to the field of virtualization technologies, and in particular, to a GPU virtualization implementation method and a vertex data buffering method and related apparatus. Background technique
GPU ( Graphic Processing Unit ,图像处理器单元)主要是进行浮点运算 和并行计算的,常用于专业的图形运算。 GPU虚拟化技术就是要让运行在 数据中心服务器上的虚拟化实例共享同一块或多块 GPU处理器进行图形运 算。 从目前已经实现的产品来看,基于 DirectX 3d的虚拟化解决方案已经 比较成熟,无论是性能、 体验等方面都已经接近于物理机的水平;而在更 广泛应用的高清制图领域,绝大部分 3D软件更多的是基于 Opengl ( Open Graphics Library ,开放的图形程序接口)规范来实现的,这一领域才是企业 最迫切希望解决的应用难题。  GPU (Graphic Processing Unit) is mainly used for floating point arithmetic and parallel computing, and is often used for professional graphics operations. GPU virtualization technology is to allow virtualized instances running on data center servers to share the same block or multiple GPU processors for graphics operations. From the perspective of products that have already been implemented, the virtualization solution based on DirectX 3d is relatively mature, and its performance and experience are close to the level of physical machines. In the more widely used HD graphics field, most of them are used. 3D software is more based on the OpenGL (Open Graphics Library) specification, which is the most difficult application problem for enterprises.
基于 Opengl 指令的 GPU 虚拟化技术的实现目前现存的有开源代码 Chromium , Chromium 本质上是实现了一种跨网络远程渲染过程。 在 Chromium架构中,顶点数组允许 Opengl驱动程序直接从应用程序的内存 中获取顶点、 颜色、 法线向量等属性。 顶点数组的使用能最小化函数调用 的开销,減少必须打包到显示驱动中命令缓存区的数据量。 但是,在远程 渲染过程中,从应用层截获到的顶点数组指针是在图形客户端分配出来的, 若直接将顶点数组指针透传到图形服务器进行使用会产生错误。 Chromium 将一个 glArrayElement指令调用分解为等价的 glVertex3f、 glNormal3f、 glColor3f或 glTexCoord2f调用,即把 glArrayElement的传指针类参数指令 转化为一系列传值类参数指令,分解后的指令数目是分解前指令数目的 100 多倍,会使网络传输的数据量陡然增加,会产生大量的延时, 占用传输通 道的带宽,增加内存共享对 CPU的消耗,造成 VM ( Virtual Machine ware , 虚拟机)密度低,成本较高。 发明内容 Implementation of GPU Virtualization Technology Based on Opengl Directive There is an open source code Chromium. Chromium essentially implements a cross-network remote rendering process. In the Chromium architecture, vertex arrays allow Opengl drivers to get properties such as vertices, colors, normal vectors, etc. directly from the application's memory. The use of vertex arrays minimizes the overhead of function calls and reduces the amount of data that must be packed into the command cache in the display driver. However, during the remote rendering process, the vertex array pointers intercepted from the application layer are allocated on the graphics client. If the vertex array pointer is directly transmitted to the graphics server for use, an error will be generated. Chromium decomposes a glArrayElement instruction call into an equivalent glVertex3f, glNormal3f, glColor3f, or glTexCoord2f call, which converts the glArrayElement's pass pointer class parameter instructions into a series of passed-valued class parameter instructions. The number of decoded instructions is the number of instructions before the decomposition. 100 Multiple times, the amount of data transmitted by the network will increase abruptly, which will generate a large amount of delay, occupy the bandwidth of the transmission channel, increase the consumption of the CPU by the memory sharing, and cause the VM (Virtual Machine ware, virtual machine) to have low density and high cost. . Summary of the invention
本发明实施方式提供一种 GPU虚拟化实现方法以及顶点数据缓存方法 和相关装置,能大幅降低时延和传输通道的带宽,降低内存共享对 CPU的 消耗,提高 VM密度,降低成本。  Embodiments of the present invention provide a GPU virtualization implementation method and a vertex data caching method and related apparatus, which can greatly reduce the bandwidth of the delay and the transmission channel, reduce the CPU consumption of the memory sharing, increase the VM density, and reduce the cost.
第一方面提供一种 GPU虚拟化实现方法,包括:图形客户端截获顶点 数组类指令;进行顶点数据缓存以创建第一缓存区,发送同步指令至图形 服务器以创建第二缓存区,第二缓存区与第一缓存区形成顶点数据的映射 关系,顶点数据从顶点数组类指令中获取,包括顶点数组指针和顶点数组 长度;在本地数据中进行查询,若本地数据中存在一顶点数据与截获的顶 点数据一致,则将顶点数组类指令打包并发送至图形服务器,以使得图形 服务器根据第二缓存区的顶点数据和打包的顶点数组类指令渲染出图片, 若不存在,则分解顶点数组类指令并发送至图形服务器,以使得图形服务 器根据分解的顶点数组类指令渲染出图片,其中,本地数据为预存在图形 客户端的顶点数据,该顶点数据不需分解即可发送并使用于图形服务器。  The first aspect provides a GPU virtualization implementation method, comprising: a graphics client intercepting a vertex array class instruction; performing vertex data caching to create a first buffer area, sending a synchronization instruction to a graphics server to create a second buffer area, and a second buffer The region and the first buffer area form a mapping relationship of vertex data, and the vertex data is obtained from the vertex array class instruction, including the vertex array pointer and the vertex array length; the query is performed in the local data, if there is a vertex data and the intercepted in the local data If the vertex data is consistent, the vertex array class instruction is packaged and sent to the graphics server, so that the graphics server renders the image according to the vertex data of the second buffer area and the packed vertex array class instruction, and if not, the vertex array class instruction is decomposed. And sending to the graphics server, so that the graphics server renders the image according to the decomposed vertex array class instruction, wherein the local data is vertex data pre-existing in the graphics client, and the vertex data can be sent and used for the graphics server without decomposition.
在第一方面的第一种可能的实现方式中,方法还包括:图形客户端通 过数据通道接收图形服务器发送的图片并贴至图形设备接口 ;通过图形设 备接口将顶点数组类指令重定向至 TC 端以执行顶点数组类指令并生成屏 幕画面。  In a first possible implementation manner of the first aspect, the method further includes: the graphics client receiving the image sent by the graphics server through the data channel and pasting the graphic device interface; and redirecting the vertex array class instruction to the TC through the graphic device interface The end executes the vertex array class instruction and generates a screen.
在第一方面的第二种可能的实现方式中,进行顶点数据缓存以创建第 一缓存区包括:如果新增的顶点数据为历史数据,但缓存的第一缓存区已 释放或者其顶点数组长度需要更新为更大的值,则创建临时缓存区;将新 增的顶点数据拷贝到临时缓存区中;将顶点数据从临时缓存区拷贝至第一 缓存区。  In a second possible implementation manner of the first aspect, performing vertex data caching to create the first buffer area includes: if the newly added vertex data is historical data, but the cached first buffer area is released or its vertex array length is Need to update to a larger value, create a temporary buffer; copy the newly added vertex data into the temporary buffer; copy the vertex data from the temporary buffer to the first buffer.
在第一方面的第三种可能的实现方式中,进行顶点数据缓存以创建第 一缓存区,发送同步指令至图形服务器以创建第二缓存区,第二缓存区与 第一缓存区形成顶点数据的映射关系包括:进行顶点数据缓存,并创建第 一缓存区;发送同步指令给图形服务器以创建第二缓存区,同步指令包括 顶点数组指针,第二缓存区通过顶点数组指针与第一缓存区形成顶点数据 的映射关系。 In a third possible implementation of the first aspect, the vertex data is buffered to create a first buffer, the synchronization instruction is sent to the graphics server to create a second buffer, and the second buffer forms vertex data with the first buffer. The mapping relationship includes: performing vertex data caching and creating the first a buffer area; sending a synchronization instruction to the graphics server to create a second buffer area, the synchronization instruction includes a vertex array pointer, and the second buffer area forms a mapping relationship with the first buffer area by the vertex array pointer.
在第一方面的第四种可能的实现方式中,第一缓存区位于图形客户端 中。  In a fourth possible implementation of the first aspect, the first buffer area is located in the graphics client.
在第一方面的第五种可能的实现方式中,第一缓存区位于共享内存中。 第二方面提供一种 GPU虚拟化实现方法,包括:接收同步指令并创建 第二缓存区以进行顶点数据缓存,第二缓存区与图形客户端的第一缓存区 形成顶点数据的映射关系,顶点数据包括顶点数组指针和顶点数组长度; 根据顶点数组指针判断第二缓存区是否缓存有对应的顶点数据,如果有, 则接收图形客户端通过数据通道发送经打包的顶点数组类指令,并根据第 二缓存区的顶点数据和打包的顶点数组类指令渲染出图片以发送给图形客 户端;如果没有,则接收图形客户端发送的经分解后的顶点数组类指令, 并根据经分解后的顶点数组类指令渲染出图片以发送给图形客户端。  In a fifth possible implementation of the first aspect, the first buffer area is located in the shared memory. The second aspect provides a GPU virtualization implementation method, including: receiving a synchronization instruction and creating a second buffer area for vertex data caching, and forming a mapping relationship between the second buffer area and the first buffer area of the graphics client to form vertex data, and vertex data. Include a vertex array pointer and a vertex array length; determine, according to the vertex array pointer, whether the second buffer area has a corresponding vertex data cached, and if so, the receiving graphics client sends the packed vertex array class instruction through the data channel, and according to the second The vertex data of the buffer and the packed vertex array class instruction render the image for transmission to the graphics client; if not, receive the decomposed vertex array class instruction sent by the graphics client, and according to the decomposed vertex array class The instruction renders the image for transmission to the graphics client.
在第二方面的第一种可能的实现方式中,接收同步指令并创建第二缓 存区以进行顶点数据缓存,第二缓存区与图形客户端的第一缓存区形成顶 点数据的映射关系包括:接收图形客户端发送的同步指令,其中,同步指 令包括顶点数组指针;根据同步指令创建第二缓存区以进行顶点数据缓存, 第二缓存区通过顶点数组指针与图形客户端的第一缓存区形成顶点数据的 映射关系。  In a first possible implementation manner of the second aspect, the receiving the synchronization instruction and creating the second buffer area for performing vertex data buffering, the mapping relationship between the second buffer area and the first buffer area of the graphics client forming vertex data includes: receiving a synchronization instruction sent by the graphics client, wherein the synchronization instruction includes a vertex array pointer; a second buffer area is created according to the synchronization instruction to perform vertex data buffering, and the second buffer area forms vertex data through the vertex array pointer and the first buffer area of the graphics client. Mapping relationship.
在第二方面的第二种可能的实现方式中,第二缓存区位于图形服务器 中。  In a second possible implementation of the second aspect, the second buffer is located in the graphics server.
在第二方面的第三种可能的实现方式中,第二缓存区位于共享内存中。 第三方面提供一种 GPU中顶点数据缓存的方法,包括:通过图形客户 端创建第一缓存区,进行顶点数据缓存,其中,顶点数据包括顶点数组指 针和顶点数组长度;发送同步指令至图形服务器,其中,同步指令包括顶 点数组指针;通过图形服务器根据同步指令创建第二缓存区,进行顶点数 据缓存,第二缓存区通过顶点数组指针与第一缓存区形成顶点数据的映射 关系。  In a third possible implementation of the second aspect, the second buffer is located in the shared memory. The third aspect provides a method for buffering vertex data in a GPU, comprising: creating a first buffer area by a graphics client to perform vertex data caching, wherein the vertex data includes a vertex array pointer and a vertex array length; and sending a synchronization instruction to the graphics server The synchronization instruction includes a vertex array pointer; the second buffer area is created by the graphics server according to the synchronization instruction, and the vertex data is buffered, and the second buffer area forms a mapping relationship between the vertex data and the first buffer area by the vertex array pointer.
在第三方面的第一种可能的实现方式中,进行顶点数据缓存是以缓存 单元模式为载体进行学习、 预测和校正,包括顶点数组指针以及顶点数组 长度的学习、 预测和校正。 In a first possible implementation of the third aspect, the vertex data cache is cached. The unit mode learns, predicts, and corrects the vector, including vertex array pointers and the learning, prediction, and correction of vertex array lengths.
在第三方面的第二种可能的实现方式中,缓存单元模式包括:指明顶 点数组的首地址和每字节的长度;根据首地址的偏移量绘制几何单元。  In a second possible implementation of the third aspect, the buffer unit mode includes: indicating a first address of the apex array and a length of each byte; and drawing the geometric unit according to the offset of the first address.
在第三方面的第三种可能的实现方式中,顶点数组指针的学习、 预测 和校正包括:获取顶点数组类指令;用顶点数组指针作 Hash查找;判断是 否命中,如果是,则设置为当前的缓存数据指针,供画顶点指针使用;如 果否,将顶点数组指针及相关特征信息添加到 Hashtable中;透传缓存数据 指针。  In a third possible implementation of the third aspect, the learning, predicting, and correcting the vertex array pointer includes: obtaining a vertex array class instruction; using a vertex array pointer for a hash lookup; determining whether the hit is true, and if so, setting the current The cache data pointer is used to draw the vertex pointer; if not, the vertex array pointer and related feature information are added to the Hashtable; the cached data pointer is transparently transmitted.
在第三方面的第四种可能的实现方式中,顶点数组长度的学习、 预测 和校正包括:获取画顶点指令;判断顶点数据是否已做缓存,如果是,则 判断顶点缓存数据是否存在于本地数据中,如果是,则透传画顶点指针, 如果否,则分解画顶点指针;如果顶点数据未做缓存,则判断顶点数组长 度是否需要更新,如果需要,则更新顶点数组长度,如果不需要,则分解 画顶点指针,其中,本地数据为预存在图形客户端的顶点数据,该顶点数 据不需分解即可发送并使用于图形服务器。  In a fourth possible implementation of the third aspect, the learning, predicting, and correcting the vertex array length includes: obtaining a vertex instruction; determining whether the vertex data has been cached, and if so, determining whether the vertex buffer data exists locally. In the data, if yes, pass the vertex pointer transparently, if not, decompose the vertex pointer; if the vertex data is not cached, determine whether the vertex array length needs to be updated, if necessary, update the vertex array length, if not needed Then, the vertex pointer is decomposed, wherein the local data is vertex data pre-existing in the graphics client, and the vertex data can be sent and used for the graphics server without being decomposed.
第四方面提供一种 GPU图形客户端,包括指令获取模块、 第一缓存模 块、 查询模块以及发送模块,其中:指令获取模块用于截获顶点数组类指 令;第一缓存模块用于进行顶点数据缓存以创建第一缓存区,发送同步指 令至图形服务器以创建第二缓存区,第二缓存区与第一缓存区形成顶点数 据的映射关系,顶点数据从顶点数组类指令中获取,包括顶点数组指针和 顶点数组长度;查询模块用于在本地数据中进行查询,若本地数据中存在 一顶点数据与截获的顶点数据一致,则发送模块将顶点数组类指令打包并 发送至图形服务器,以使得图形服务器根据第二缓存区的顶点数据和打包 的顶点数组类指令渲染出图片,若不存在,则发送模块分解顶点数组类指 令并发送至图形服务器,以使得图形服务器根据分解的顶点数组类指令渲 染出图片,其中,本地数据为预存在图形客户端的顶点数据,该顶点数据 不需分解即可发送并使用于图形服务器。  The fourth aspect provides a GPU graphics client, including an instruction acquisition module, a first cache module, a query module, and a sending module, wherein: the instruction acquisition module is configured to intercept a vertex array class instruction; and the first cache module is configured to perform vertex data caching. To create a first buffer area, send a synchronization instruction to the graphics server to create a second buffer area, the second buffer area forms a mapping relationship with the first buffer area, and the vertex data is obtained from the vertex array class instruction, including the vertex array pointer. And the vertex array length; the query module is used to query in the local data. If there is a vertex data in the local data that is consistent with the intercepted vertex data, the sending module packages and sends the vertex array class instruction to the graphics server to make the graphics server Rendering the image according to the vertex data of the second buffer area and the packed vertex array class instruction. If not, the sending module decomposes the vertex array class instruction and sends the instruction to the graphics server, so that the graphics server renders according to the decomposed vertex array class instruction. Image, where local data To pre-exist the vertex data of the graphics client, the vertex data can be sent and used for the graphics server without decomposition.
在第四方面的第一种可能的实现方式中,图形客户端还包括第一接收 模块和图形设备接口 ,其中:第一接收模块用于通过数据通道接收图片并 贴至图形设备接口;图形设备接口将顶点数组类指令重定向至 TC端以执行 顶点数组类指令并生成屏幕画面。 In a first possible implementation manner of the fourth aspect, the graphics client further includes a first receiving module and a graphic device interface, where: the first receiving module is configured to receive the image through the data channel and Paste to the graphics device interface; the graphics device interface redirects vertex array class instructions to the TC side to execute vertex array class instructions and generate a screen shot.
在第四方面的第二种可能的实现方式中,发送模块还发送同步指令给 图形服务器,同步指令包括顶点数组指针,第一缓存区通过顶点数组指针 与图形服务器的第二缓存区形成顶点数据的映射关系。  In a second possible implementation manner of the fourth aspect, the sending module further sends a synchronization instruction to the graphics server, where the synchronization instruction includes a vertex array pointer, and the first buffer area forms vertex data through the vertex array pointer and the second buffer area of the graphics server. Mapping relationship.
在第四方面的第三种可能的实现方式中,如果新增的顶点数据为历史 数据,但缓存的第一缓存区已释放或者其顶点数组长度需要更新为更大的 值,则第一缓存模块还用于:创建临时缓存区;将新增的顶点数据拷贝到 临时缓存区中;将顶点数据从临时缓存区拷贝至第一缓存区。  In a third possible implementation manner of the fourth aspect, if the newly added vertex data is historical data, but the cached first buffer area is released or its vertex array length needs to be updated to a larger value, the first cache is used. The module is also used to: create a temporary buffer; copy the newly added vertex data into the temporary buffer; copy the vertex data from the temporary buffer to the first buffer.
第五方面提供一种 GPU图形服务器,包括第二缓存模块、 第二接收模 块以及渲染模块,其中:第二缓存模块用于创建第二缓存区以进行顶点数 据缓存,第二缓存区与图形客户端的第一缓存区形成顶点数据的映射关系, 顶点数据包括顶点数组指针和顶点数组长度;第二接收模块用于根据顶点 数组指针判断第二缓存区是否缓存有对应的顶点数据,如果有,则接收图 形客户端发送的经打包的顶点数组类指令,并且渲染模块根据第二缓存区 的顶点数据和打包的顶点数组类指令渲染出图片以发送给图形客户端;如 果没有,则第二接收模块接收图形客户端发送的经分解后的顶点数组类指 令,并且渲染模块根据经分解后的顶点数组类指令渲染出图片以发送给图 形客户端。  A fifth aspect provides a GPU graphics server, including a second cache module, a second receiving module, and a rendering module, wherein: the second cache module is configured to create a second buffer area for vertex data caching, and the second buffer area and the graphics client The first buffer area of the end forms a mapping relationship of vertex data, the vertex data includes a vertex array pointer and a vertex array length; and the second receiving module is configured to determine, according to the vertex array pointer, whether the second buffer area has a corresponding vertex data cached, and if so, Receiving the packed vertex array class instruction sent by the graphics client, and the rendering module renders the image according to the vertex data of the second buffer area and the packed vertex array class instruction to send to the graphics client; if not, the second receiving module Receiving the decomposed vertex array class instruction sent by the graphics client, and the rendering module renders the image according to the decomposed vertex array class instruction for sending to the graphics client.
在第五方面的第一种可能的实现方式中,第二缓存模块还接收图形客 户端发送的同步指令,其中,同步指令包括顶点数组指针;第二缓存模块 根据同步指令创建第二缓存区以进行顶点数据缓存,第二缓存区通过顶点 数组指针与图形客户端的第一缓存区形成顶点数据的映射关系。  In a first possible implementation manner of the fifth aspect, the second cache module further receives a synchronization instruction sent by the graphics client, where the synchronization instruction includes a vertex array pointer, and the second cache module creates a second buffer area according to the synchronization instruction. The vertex data is cached, and the second buffer area forms a mapping relationship between the vertex data and the first buffer area of the graphics client by the vertex array pointer.
第六方面提供一种 GPU中顶点数据缓存的装置,包括:第一缓存模块, 用于在图形客户端创建第一缓存区,进行顶点数据缓存,其中,顶点数据 包括顶点数组指针和顶点数组长度;发送模块,用于发送同步指令至图形 服务器,其中,同步指令包括顶点数组指针;第二缓存模块,用于通过图 形服务器根据同步指令创建第二缓存区,进行顶点数据缓存,第二缓存区 通过顶点数组指针与第一缓存区形成顶点数据的映射关系。  A sixth aspect provides an apparatus for buffering vertex data in a GPU, comprising: a first cache module, configured to create a first buffer area on a graphics client, and perform vertex data caching, wherein the vertex data includes a vertex array pointer and a vertex array length. a sending module, configured to send a synchronization instruction to the graphics server, wherein the synchronization instruction includes a vertex array pointer; the second cache module is configured to create a second buffer area according to the synchronization instruction by the graphics server, perform vertex data buffering, and the second buffer area The mapping relationship between the vertex data and the first buffer area is formed by the vertex array pointer.
在第六方面的第一种可能的实现方式中,第一缓存模块以缓存单元模 式为载体对顶点数组指针以及顶点数组长度的学习、 预测和校正。 In a first possible implementation manner of the sixth aspect, the first cache module is configured by using a cache unit The formula is the learning, prediction, and correction of the vertex array pointer and the vertex array length.
在第六方面的第二种可能的实现方式中,缓存单元模式包括指明顶点 数组的首地址和每字节的长度;根据首地址的偏移量绘制几何单元。  In a second possible implementation of the sixth aspect, the cache unit mode includes indicating a first address of the vertex array and a length of each byte; the geometric unit is drawn according to the offset of the first address.
在第六方面的第三种可能的实现方式中,对顶点数组指针学习、 预测 和校正时,第一缓存模块用于:获取顶点数组类指令;用顶点数组指针作 In a third possible implementation manner of the sixth aspect, when the vertex array pointer is learned, predicted, and corrected, the first cache module is configured to: obtain a vertex array class instruction;
Hash查找;判断是否命中,如果是,则设置为当前的缓存数据指针,供画 顶点指针使用;如果否,将顶点数组指针及相关特征信息添加到 Hashtable 中;透传缓存数据指针。 Hash lookup; determine whether a hit, if yes, set to the current cache data pointer for drawing vertex pointers; if not, add vertex array pointers and related feature information to the Hashtable; pass the cached data pointer.
在第六方面的第四种可能的实现方式中,对顶点数组长度的学习、 预 测和校正进,第一缓存模块用于:获取画顶点指令;判断顶点数据是否已 做缓存,如果是,则判断顶点缓存数据是否存在于本地数据中,如果是致, 则透传画顶点指针,如果否,则分解画顶点指针;如果顶点数据未做缓存, 则判断顶点数组长度是否需要更新,如果需要,则更新顶点数组长度,如 果不需要,则分解画顶点指针,其中,本地数据为预存在图形客户端的顶 点数据,该顶点数据不需分解即可发送并使用于图形服务器。  In a fourth possible implementation manner of the sixth aspect, for learning, predicting, and correcting the length of the vertex array, the first cache module is configured to: obtain a vertex instruction; determine whether the vertex data has been cached, and if so, Determine whether the vertex buffer data exists in the local data, if so, transparently draw the vertex pointer, if not, decompose the vertex pointer; if the vertex data is not cached, determine whether the vertex array length needs to be updated, if necessary, The vertex array length is updated, and if not required, the vertex pointer is decomposed, wherein the local data is vertex data pre-existing in the graphics client, and the vertex data can be sent and used for the graphics server without decomposition.
本发明通过图形客户端截获顶点数组类指令;进行顶点数据缓存以创 建第一缓存区,发送同步指令至图形服务器以创建第二缓存区,第二缓存 区与第一缓存区形成顶点数据的映射关系;在本地数据中进行查询,若本 地数据中存在一顶点数据与截获的顶点数据一致,则将顶点数组类指令打 包并发送至图形服务器,以使得图形服务器根据第二缓存区的顶点数据和 打包的顶点数组类指令渲染出图片,若不存在,则分解顶点数组类指令, 并发送至图形服务器,以使得图形服务器根据分解的顶点数组类指令渲染 出图片;第二缓存区与第一缓存区形成顶点数据的映射关系后,就不需要 对顶点数组类指令进行分解,可以解决在图形服务器使用直接透传的顶点 数组类指令会产生错误的问题,这样即使仍有部分顶点数组类指令需进行 分解,但总的需传送的指令数目大为減少,从而減少了传送所有指令所需 要的时间,也減少了对带宽的占用,因此能够大幅降低时延和传输通道的 带宽,降低内存共享对 CPU的消耗,提高 VM密度,降低成本。 附图说明 为了更清楚地说明本发明实施例中的技术方案,下面将对实施例描述 中所需要使用的附图作简单地介绍,显而易见地,下面描述中的附图仅仅 是本发明的一些实施例,对于本领域普通技术人员来讲,在不付出创造性 劳动的前提下,还可以根据这些附图获得其他的附图。 其中: The invention intercepts the vertex array class instruction through the graphics client; performs vertex data buffering to create the first buffer area, sends the synchronization instruction to the graphics server to create the second buffer area, and the second buffer area forms a mapping with the first buffer area to form vertex data. Relationship; query in local data, if there is a vertex data in the local data that is consistent with the intercepted vertex data, the vertex array class instruction is packaged and sent to the graphics server, so that the graphics server according to the vertex data of the second buffer area and The packaged vertex array class instruction renders the image. If it does not exist, the vertex array class instruction is decomposed and sent to the graphics server, so that the graphics server renders the image according to the decomposed vertex array class instruction; the second buffer area and the first buffer After the region forms the mapping relationship of the vertex data, it does not need to decompose the vertex array class instruction, which can solve the problem that the vertex array class instruction used in the graphics server can directly generate errors, so that even if there are still some vertex array class instructions Decompose, but the total number of instructions to be transmitted is large In order to reduce, the time required for transmitting all instructions is reduced, and the bandwidth consumption is also reduced, so that the bandwidth of the delay and the transmission channel can be greatly reduced, the CPU consumption of the memory sharing is reduced, the VM density is increased, and the cost is reduced. DRAWINGS In order to more clearly illustrate the technical solutions in the embodiments of the present invention, the drawings used in the description of the embodiments will be briefly described below. It is obvious that the drawings in the following description are only some embodiments of the present invention. Other drawings may also be obtained from those of ordinary skill in the art in light of the inventive work. among them:
图 1是本发明第一实施例的 GPU虚拟化的实现系统的结构示意图; 图 2是本发明第一实施例的 GPU虚拟化实现方法的流程示意图; 图 3是本发明第二实施例的 GPU虚拟化实现方法的流程示意图; 图 4是本发明第一实施例的 GPU 中顶点数据缓存的方法的流程示意 图;  1 is a schematic structural diagram of a system for implementing GPU virtualization according to a first embodiment of the present invention; FIG. 2 is a schematic flowchart of a method for implementing GPU virtualization according to a first embodiment of the present invention; FIG. 3 is a GPU of a second embodiment of the present invention; FIG. 4 is a schematic flowchart of a method for virtualizing a vertex data in a GPU according to a first embodiment of the present invention; FIG.
图 5是本发明第一实施例的 GPU中顶点数据缓存的方法的缓存单元模 式结构示意图;  5 is a schematic diagram showing a structure of a cache unit mode of a method for buffering vertex data in a GPU according to a first embodiment of the present invention;
图 6是本发明第一实施例的 GPU中顶点数据缓存的方法中顶点数组指 针的学习、 预测和校正方法流程示意图;  6 is a flow chart showing a method for learning, predicting, and correcting a vertex array pointer in a method for buffering vertex data in a GPU according to a first embodiment of the present invention;
图 7是本发明第一实施例的 GPU中顶点数据缓存的方法中顶点数组长 度的学习、 预测和校正方法流程示意图;  7 is a flow chart showing a method for learning, predicting, and correcting a vertex array length in a method for buffering vertex data in a GPU according to a first embodiment of the present invention;
图 8是本发明第一实施例的 GPU中顶点数据缓存的方法中更新顶点数 组长度的流程示意图;  8 is a flow chart showing the process of updating the vertices of the vertices in the method of vertice data buffering in the GPU according to the first embodiment of the present invention;
图 9是本发明第一实施例的 GPU图形客户端的结构示意图;  9 is a schematic structural diagram of a GPU graphics client according to a first embodiment of the present invention;
图 10是本发明第一实施例的 GPU图形服务器的结构示意图; 图 11是本发明第一实施例的 GPU中顶点数据缓存的装置的结构示意 图;  10 is a schematic structural diagram of a GPU graphics server according to a first embodiment of the present invention; FIG. 11 is a schematic structural diagram of an apparatus for buffering vertex data in a GPU according to a first embodiment of the present invention;
图 12是本发明第二实施例的 GPU图形客户端的结构示意图; 图 13是本发明第二实施例的 GPU图形服务器的结构示意图; 图 14是本发明第二实施例的 GPU虚拟化的实现系统的结构示意图。 具体实施方式  12 is a schematic structural diagram of a GPU graphics client according to a second embodiment of the present invention; FIG. 13 is a schematic structural diagram of a GPU graphics server according to a second embodiment of the present invention; FIG. 14 is a system for implementing GPU virtualization according to a second embodiment of the present invention; Schematic diagram of the structure. detailed description
下面结合附图和实施方式对本发明进行详细说明。  The invention will now be described in detail in conjunction with the drawings and embodiments.
首先请参见图 1 ,图 1是本发明第一实施例的 GPU虚拟化的实现系统 的结构示意图。 如图 1所示,该 GPU虚拟化的实现系统 10包括图形客户 端 11、 图形服务器 12、 数据通道 13、 显卡 14、 TC (Thin Client ,瘦客户) 端 15 ,其中,图形客户端 11包括 GDI ( Graphic Device Interface ,图形设备 接口) 110。 图形客户端 11与图形服务器 12通过数据通道 13连接,显卡 14与图形服务器 12连接, TC端 15与图形客户端 11的图形设备接口 110 连接。 First, please refer to FIG. 1. FIG. 1 is a schematic structural diagram of a system for implementing GPU virtualization according to a first embodiment of the present invention. As shown in FIG. 1, the GPU virtualization implementation system 10 includes a graphics client 11, a graphics server 12, a data channel 13, a graphics card 14, and a TC (Thin Client, thin client). End 15 , wherein the graphics client 11 includes a GDI (Graphic Device Interface) 110. The graphics client 11 is connected to the graphics server 12 via a data channel 13, the graphics card 14 is connected to the graphics server 12, and the TC terminal 15 is connected to the graphics device interface 110 of the graphics client 11.
在本实施例中,图形客户端 11截获顶点数组类指令,创建第一缓存区 111 ,进行顶点数据缓存,并通过数据通道 13 发送同步指令给图形服务器 12。 其中,顶点数据从顶点数组类指令中获取,包括顶点数组指针和顶点 数组长度,同步指令包括顶点数组指针和顶点数组的内容。 图形服务器 12 接收到同步指令后即创建第二缓存区 121 ,第二缓存区 121通过顶点数组指 针与第一缓存区 111 建立顶点数据的映射关系。 在本实施例中,第一缓存 区 111和第二缓存区 121的创建最终是根据截获的顶点数组类指令来执行 的,是一个持续的过程。 图形客户端 11还在本地数据中进行查询,若本地 数据中存在一顶点数据与截获的所述顶点数据一致,则对顶点数组类指令 进行缓存优化,即将所述顶点数组类指令打包并发送至所述图形服务器, 图形服务器根据第二缓存区的顶点数据和打包的顶点数组类指令渲染出图 片;若不存在,则分解顶点数组类指令,并发送至图形服务器,图形服务 器根据分解的顶点数组类指令渲染出图片,其中,本地数据为预存在图形 客户端 11的顶点数据,该顶点数据不需分解即可发送并使用于图形服务器 12。 渲染出的图片可以但不限于三维,也可以是二维的图片,而且该图片 可以是一幅或者多幅图片的组合,也可以是一幅完整图片的一部分。 具体 地,图形客户端 11以缓存单元模式为载体对顶点数组指针以及顶点数组长 度进行学习、 预测和校正,进而判断缓存的顶点数据是否存在于本地数据 中,若存在,则对顶点数组类指令进行缓存优化,若不存在,则分解顶点 数组类指令, 即使用传值类的画顶点指令,并且将该顶点数据保存在 Hashtable中以便下一次进行缓存优化。在 GPU虚拟化技术中,分解后的指 令数目是分解前指令数目的 100多倍,这会使网络传输的数据量陡然增加, 进而产生大量的延时, 占用传输通道的带宽。 在本实施例中,在截获的顶 点数据与本地数据一致时,对顶点数组类指令进行缓存优化,从而不需要 对顶点数组类指令进行分解,可以解决在图形服务器 12直接透传的顶点数 组类指令会产生错误的问题,这样即使仍有部分顶点数组类指令需进行分 解,但总的需传送的指令数目大为減少,从而減少了传送所有指令所需要 的时间,也減少了对带宽的占用,因此在确保缓存的顶点数据的一致性的 同时,能够大幅降低时延和传输通道的带宽,降低内存共享对 CPU的消耗, 提高 VM密度,降低成本。 In the present embodiment, the graphics client 11 intercepts the vertex array class instructions, creates the first buffer area 111, performs vertex data buffering, and sends synchronization instructions to the graphics server 12 via the data channel 13. The vertex data is obtained from the vertex array class instruction, including the vertex array pointer and the vertex array length, and the synchronization instruction includes the vertex array pointer and the content of the vertex array. After receiving the synchronization instruction, the graphics server 12 creates a second buffer area 121, and the second buffer area 121 establishes a mapping relationship between the vertex data and the first buffer area 111 through the vertex array pointer. In this embodiment, the creation of the first buffer area 111 and the second buffer area 121 is finally performed according to the intercepted vertex array class instruction, which is a continuous process. The graphics client 11 also queries in the local data. If there is a vertex data in the local data that is consistent with the intercepted vertex data, the vertex array class instruction is cache optimized, that is, the vertex array class instruction is packaged and sent to The graphics server, the graphics server renders the image according to the vertex data of the second buffer area and the packed vertex array class instruction; if not, the vertex array class instruction is decomposed and sent to the graphics server, and the graphics server according to the decomposed vertex array The class instructions render the picture, wherein the local data is vertex data pre-existing in the graphics client 11, the vertex data being sent and used for the graphics server 12 without decomposition. The rendered image can be, but is not limited to, a three-dimensional image, or a two-dimensional image, and the image can be a combination of one or more images or a part of a complete image. Specifically, the graphics client 11 learns, predicts, and corrects the vertex array pointer and the vertex array length by using the cache unit mode as a carrier, thereby determining whether the cached vertex data exists in the local data, and if present, the vertex array class instruction. Cache optimization, if it does not exist, the vertex array class instruction is decomposed, that is, the vertex instruction of the passed value class is used, and the vertex data is saved in the Hashtable for the next cache optimization. In GPU virtualization technology, the number of decomposed instructions is more than 100 times the number of pre-decomposition instructions, which causes the amount of data transmitted by the network to increase abruptly, which in turn generates a large amount of delay and occupies the bandwidth of the transmission channel. In this embodiment, when the intercepted vertex data is consistent with the local data, the vertex array class instruction is cache-optimized, so that the vertex array class instruction is not required to be decomposed, and the vertex array class directly transmitted through the graphics server 12 can be solved. The instruction will generate an error, so even if there are still some vertex array class instructions to be divided Solution, but the total number of instructions to be transmitted is greatly reduced, which reduces the time required to transmit all instructions and reduces the bandwidth usage. Therefore, while ensuring the consistency of the cached vertex data, the time can be greatly reduced. Extend the bandwidth of the transmission channel, reduce the CPU consumption of memory sharing, increase the VM density, and reduce the cost.
在本实施例中,本地数据中存在一顶点数据与截获的顶点数据一致, 即截获的顶点数据存在于本地数据时,图形客户端 11将顶点数组类指令打 包并通过数据通道 13发送至图形服务器 12 ,图形服务器 12解包顶点数组 类指令,并发送给显卡 14以渲染出图片;截获的顶点数据不存在于本地数 据时,图形客户端 11将分解后的顶点数组类指令通过数据通道 13发送至 图形服务器 12 ,图形服务器 12再发送给显卡 44以渲染出图片。 图形服务 器 12通过屏幕抓取将图片拷贝到内存中,并通过数据通道 13发送给图形 客户端 11 ,图形客户端 11接收图片并贴至图形设备接口 110 ,图形设备接 口 110将顶点数组类指令重定向至 TC端 15以执行顶点数组类指令并生成 屏幕画面。 其中 , 数据通道 13 可以是 TCP/IP ( Transmission Control Protocol/Internet Protocol , 传输控制协议 /因特网互联协议 )、 SR-IOV ( Single-Root I/O Virtualization ,单根 I/O虚拟化 RDMA ( Remote Direct Memory Access ,远程内存直接存取)以及共享内存中的任一项。  In this embodiment, there is a vertex data in the local data that is consistent with the intercepted vertex data. When the intercepted vertex data exists in the local data, the graphics client 11 packages the vertex array class instruction and sends it to the graphics server through the data channel 13. 12, the graphics server 12 unpacks the vertex array class instruction and sends it to the graphics card 14 to render the image; when the intercepted vertex data does not exist in the local data, the graphics client 11 sends the decomposed vertex array class instruction through the data channel 13 To the graphics server 12, the graphics server 12 is then sent to the graphics card 44 to render the picture. The graphics server 12 copies the picture into the memory through the screen capture and sends it to the graphics client 11 through the data channel 13, the graphics client 11 receives the picture and pastes it to the graphics device interface 110, and the graphics device interface 110 re-points the vertex array class instruction. The TC end 15 is directed to execute vertex array class instructions and generate a screen shot. The data channel 13 may be TCP/IP (Transmission Control Protocol/Internet Protocol), SR-IOV (Single-Root I/O Virtualization, Single Direct I/O Virtualization RDMA (Remote Direct) Memory Access, remote memory direct access) and any of the shared memory.
图 2是本发明第一实施例的 GPU虚拟化实现方法的流程示意图。 如图 2所示,对图 1所示的图形客户端 11作为主体进行具体说明,本实施例的 GPU虚拟化实现方法包括:  2 is a schematic flow chart of a method for implementing GPU virtualization according to a first embodiment of the present invention. As shown in FIG. 2, the graphics client 11 shown in FIG. 1 is specifically described as a main body. The GPU virtualization implementation method in this embodiment includes:
510:图形客户端 11截获顶点数组类指令。 具体而言, TC端 15通过 鼠标、 键盘重定向将 3D指令发送至图形客户端 11的图形设备接口 110 , 图形客户端 11 通过图形设备接口 110 的 Opengl ICD ( Interface Control Document ,接口控制文件)驱动可以截获到 3D指令, 3D指令包括 glGet* 回传类指令、 glSwapBuffer等需要即时发送的指令、带指针参数的顶点数组 类指令以及可聚合打包类的指令。 在本实施例中,主要是针对带指针参数 的顶点数组类指令进行处理。  510: The graphics client 11 intercepts the vertex array class instruction. Specifically, the TC terminal 15 sends a 3D instruction to the graphics device interface 110 of the graphics client 11 through mouse and keyboard redirection, and the graphics client 11 is driven by the OpenGL ICD (Interface Control Document) of the graphics device interface 110. 3D instructions can be intercepted. The 3D instructions include glGet* return-transfer instructions, glSwapBuffer and other instructions that need to be sent immediately, vertex array-like instructions with pointer parameters, and instructions that can be aggregated and packaged. In this embodiment, it is mainly processed for vertex array class instructions with pointer parameters.
511:进行顶点数据缓存以创建第一缓存区 111 ,发送同步指令至图形 服务器 12以创建第二缓存区 121 ,第二缓存区 121与第一缓存区 111形成 顶点数据的映射关系,顶点数据从顶点数组类指令中获取,包括顶点数组 指针和顶点数组长度。 具体而言,图形客户端 11创建第一缓存区 111 ,进 行顶点数据的缓存,同时通过数据通道 13发送同步指令给图形服务器 12 , 同步指令包括顶点数组指针以及顶点数组的内容,通过顶点数组指针与图 形服务器 12的第二缓存区的顶点数据建立映射关系。 在本实施例中,第一 缓存区 111和第二缓存区 121的创建最终是根据截获的顶点数组类指令来 执行的,是一个持续的过程。 如果新增的顶点数据为历史数据,但缓存的 所述第一缓存区已释放或者其顶点数组长度需要更新为更大的值,图形客 户端 11还更新顶点数组长度,创建临时缓存区,将新增的顶点数据拷贝到 临时缓存区中 ,然后将到顶点数据整体从临时缓存区拷贝到第一缓存区 111。 图形服务器 12接收到同步指令,便立即创建第二缓存区 121 ,从同步 指令中拷贝出顶点数组的内容,并进行顶点数据的缓存。 如此,第一缓存 区 111与第二缓存区 121通过顶点数组指针建立起映射关系,确保了缓存 的顶点数据的一致性。 在本实施例中,第一缓存区可以位于图形客户端 11 或共享内存中。 511: Perform vertex data caching to create a first buffer area 111, send a synchronization instruction to the graphics server 12 to create a second buffer area 121, and the second buffer area 121 forms a mapping relationship with the first buffer area 111 to form vertex data, and the vertex data is from Obtained in the vertex array class instruction, including the vertex array Pointer and vertex array length. Specifically, the graphics client 11 creates a first buffer area 111 for buffering vertex data, and simultaneously sends a synchronization instruction to the graphics server 12 through the data channel 13, the synchronization instruction including the vertex array pointer and the contents of the vertex array, and the vertex array pointer. A mapping relationship is established with the vertex data of the second buffer area of the graphics server 12. In this embodiment, the creation of the first buffer area 111 and the second buffer area 121 is finally performed according to the intercepted vertex array class instruction, which is a continuous process. If the newly added vertex data is historical data, but the cached first buffer area has been released or its vertex array length needs to be updated to a larger value, the graphics client 11 also updates the vertex array length to create a temporary buffer area, which will The newly added vertex data is copied into the temporary buffer area, and then the entire vertex data is copied from the temporary buffer area to the first buffer area 111. Upon receiving the synchronization instruction, the graphics server 12 immediately creates the second buffer area 121, copies the contents of the vertex array from the synchronization instruction, and buffers the vertex data. In this way, the first buffer area 111 and the second buffer area 121 establish a mapping relationship by the vertex array pointer, thereby ensuring the consistency of the cached vertex data. In this embodiment, the first buffer area may be located in the graphics client 11 or in the shared memory.
S12:在本地数据中进行查询,若本地数据中存在一顶点数据与截获的 顶点数据一致,则将顶点数组类指令打包并发送至图形服务器 12 ,以使得 图形服务器 12根据第二缓存区的顶点数据和打包的顶点数组类指令渲染出 图片,若不存在,则分解顶点数组类指令,并发送至图形服务器 12 ,以使 得图形服务器 12根据分解的顶点数组类指令渲染出图片。 渲染出的图片可 以但不限于三维,也可以是二维的图片,而且该图片可以是一幅或者多幅 图片的组合,也可以是一幅完整图片的一部分。 其中,本地数据为预存在 图形客户端 11的顶点数据,该顶点数据不需分解即可发送并使用于图形服 务器 12。 具体而言,顶点数组缓存的过程是一个预测数据的过程,预测的 结果可能是对的也可能是错的,因而数据校验过程是必不可少的。 每次在 使用顶点数据前,都需要在本地数据中进行查询,即图形客户端 11以缓存 单元模式为载体对顶点数组指针以及顶点数组长度进行学习、 预测和校正, 以判断缓存的顶点数据是否存在于本地数据是中,若存在,则可以对截获 的顶点数据进行缓存优化,即根据顶点数组类指令的特点做相应的打包处 理;若不存在,则不能进行缓存优化,只能将顶点数组类指令进行分解, 使用传值类的画顶点指令 , 并且将该顶点数据作为历史数据保存在 Hashtable中以便下一次进行缓存优化。 S12: Query in the local data, if there is a vertex data in the local data that is consistent with the intercepted vertex data, the vertex array class instruction is packaged and sent to the graphics server 12, so that the graphics server 12 according to the vertices of the second buffer area The data and packed vertex array class instructions render the image, if not, the vertex array class instructions are decomposed and sent to the graphics server 12 to cause the graphics server 12 to render the image based on the decomposed vertex array class instructions. The rendered image can be, but is not limited to, a three-dimensional image, or a two-dimensional image, and the image can be a combination of one or more images or a part of a complete image. The local data is the vertex data pre-existing in the graphics client 11, and the vertex data can be sent and used for the graphics server 12 without being decomposed. Specifically, the process of vertex array caching is a process of predicting data, and the prediction result may be correct or wrong, so the data verification process is indispensable. Each time before using vertex data, it is necessary to query in the local data, that is, the graphics client 11 uses the cache unit mode as a carrier to learn, predict, and correct the vertex array pointer and the vertex array length to determine whether the cached vertex data is The local data is in the middle. If it exists, the intercepted vertex data can be cache-optimized, that is, according to the characteristics of the vertex array class instruction, the corresponding packaging processing is performed; if it does not exist, the cache optimization cannot be performed, and only the vertex array can be obtained. The class instruction is decomposed, using the vertex instruction of the passed value class, and the vertex data is saved as historical data. Hashtable for the next cache optimization.
在本实施例中,数据通道 13可以是 TCP/IP、 SR-IOV, RDMA以及共 享内存中的任一项。 图片经图形服务器 12进行压缩处理后生成压缩码流, 图形客户端 11通过数据通道 13接收压缩码流并进行解压。 图形客户端 11 然后调用 bitblt()接口将图片贴到图形设备接口 110的 3D应用程序的图形区 域,通过图形设备接口 110将顶点数组类指令重定向至 TC端 15以执行顶 点数组类指令并生成屏幕画面。  In this embodiment, the data channel 13 can be any of TCP/IP, SR-IOV, RDMA, and shared memory. The picture is compressed by the graphics server 12 to generate a compressed code stream, and the graphics client 11 receives the compressed code stream through the data channel 13 and decompresses it. The graphics client 11 then calls the bitblt() interface to paste the image into the graphics area of the 3D application of the graphics device interface 110, and redirects the vertex array class instructions to the TC terminal 15 via the graphics device interface 110 to execute the vertex array class instructions and generate Screen shot.
在本实施例中,通过在图形客户端 11建立第一缓存区 111 ,在图形服 务器 12创建第二缓存区 121 ,第二缓存区 121与第一缓存区 111通过顶点 数组指针形成顶点数据的映射关系,并在截获的顶点数据存在于本地数据 中时,进行顶点数据的缓存优化,从而不需分解顶点数组类指令,可以解 决在图形服务器 12使用直接透传的顶点数组类指令会产生错误的问题,这 样即使仍有部分顶点数组类指令需进行分解,但总的需传送的指令数目大 为減少,从而減少了传送所有指令所需要的时间,也減少了对带宽的占用, 因此能够大幅降低时延和传输通道的带宽,降低内存共享对 CPU的消耗, 提高 VM密度,降低成本。  In this embodiment, by creating the first buffer area 111 in the graphics client 11, the second buffer area 121 is created in the graphics server 12, and the second buffer area 121 and the first buffer area 111 form a mapping of vertex data through the vertex array pointer. Relationship, and when the intercepted vertex data exists in the local data, the cache optimization of the vertex data is performed, so that the vertex array class instruction is not needed to be decomposed, and the vertex array class instruction used in the graphics server 12 to directly transmit the error may be generated. The problem is that even if there are still some vertex array class instructions that need to be decomposed, the total number of instructions to be transferred is greatly reduced, which reduces the time required to transfer all the instructions, and also reduces the bandwidth usage, so it can be greatly reduced. Delay and bandwidth of the transmission channel reduce the CPU consumption of memory sharing, increase VM density, and reduce cost.
图 3是本发明第二实施例的 GPU虚拟化实现方法的流程示意图。 如图 3所示,对图 1所示的图形服务器 12作为主体进行具体说明,本实施例的 GPU虚拟化实现方法包括:  FIG. 3 is a schematic flowchart diagram of a method for implementing GPU virtualization according to a second embodiment of the present invention. As shown in FIG. 3, the graphics server 12 shown in FIG. 1 is specifically described as a main body. The GPU virtualization implementation method in this embodiment includes:
S20:接收同步指令并创建第二缓存区 121以进行顶点数据缓存,第二 缓存区 121与图形客户端 11的第一缓存区 111形成顶点数据的映射关系, 顶点数据包括顶点数组指针和顶点数组长度。 具体而言,图形服务器 12接 收图形客户端 11发送的同步指令。 其中,同步指令包括顶点数组指针和顶 点数组的内容。 图形服务器 12根据同步指令创建第二缓存区 121以进行顶 点数据缓存,并通过顶点数组指针与图形客户端 11的第一缓存区 111形成 顶点数据的映射关系,如此可以对顶点数组类指令进行缓存优化,从而不 需要对顶点数组类指令进行分解,可以解决在图形服务器 12使用直接透传 的顶点数组类指令会产生错误的问题,这样即使仍有部分顶点数组类指令 需进行分解,但总的需传送的指令数目大为減少,从而減少了传送所有指 令所需要的时间,也減少了对带宽的占用,因此在确保了缓存的顶点数据 的一致性的同时,能够大幅降低时延和传输通道的带宽,降低内存共享对S20: Receive a synchronization instruction and create a second buffer area 121 for vertex data buffering, and the second buffer area 121 forms a mapping relationship with the first buffer area 111 of the graphics client 11 to form vertex data, and the vertex data includes a vertex array pointer and a vertex array. length. Specifically, the graphics server 12 receives the synchronization instructions sent by the graphics client 11. The synchronization instruction includes the contents of the vertex array pointer and the vertex array. The graphics server 12 creates a second buffer area 121 according to the synchronization instruction to perform vertex data buffering, and forms a mapping relationship between the vertex data and the first buffer area 111 of the graphics client 11 through the vertex array pointer, so that the vertex array class instruction can be cached. Optimization, so that there is no need to decompose the vertex array class instructions, which can solve the problem that the vertex array class instruction used in the graphics server 12 will directly generate errors, so that even if there are still some vertex array class instructions to be decomposed, the total The number of instructions to be transmitted is greatly reduced, thereby reducing the time required to transfer all instructions and reducing the bandwidth usage, thus ensuring cached vertex data. At the same time, it can greatly reduce the bandwidth of the delay and transmission channel, and reduce the memory sharing pair.
CPU的消耗,提高 VM密度,降低成本。 在本实施例中,第一缓存区 111 和第二缓存区 121 的创建最终是根据截获的顶点数组类指令来执行的,是 一个持续的过程。其中,第二缓存区可以位于图形服务器 12或共享内存中。 CPU consumption increases VM density and reduces costs. In this embodiment, the creation of the first buffer area 111 and the second buffer area 121 is ultimately performed based on the intercepted vertex array class instructions, which is a continuous process. The second buffer area may be located in the graphics server 12 or in the shared memory.
S21:根据顶点数组指针判断第二缓存区 121是否缓存有对应的顶点数 据,如果有,则接收图形客户端 11发送的经打包的顶点数组类指令,并根 据第二缓存区 121 的顶点数据和打包的顶点数组类指令渲染出图片以发送 给图形客户端 11 ,如果没有,则接收图形客户端 11发送的经分解后的顶点 数组类指令,并根据经分解后的顶点数组类指令渲染出图片以发送给图形 客户端 11。  S21: determining, according to the vertex array pointer, whether the second buffer area 121 is buffered with corresponding vertex data, and if so, receiving the packed vertex array class instruction sent by the graphics client 11, and according to the vertex data of the second buffer area 121 and The packed vertex array class instruction renders the image for sending to the graphics client 11, and if not, receives the decomposed vertex array class instruction sent by the graphics client 11, and renders the image according to the decomposed vertex array class instruction. To send to the graphics client 11.
在本实施例中,第二缓存区 121 缓存有顶点数组指针对应的顶点数据 时,图形服务器 12接收图形客户端 11通过数据通道 13发送的顶点数组类 指令,并根据顶点数组类指令自身的特点对其进行相应的解包处理。 图形 服务器 12再将解包出来的顶点数组类指令发送给显卡 14。 第二缓存区 121 没有缓存顶点数组指针对应的顶点数据时,图形服务器 12接收图形客户端 11发送的经分解后的顶点数组类指令,再发送给显卡 14。 显卡 14执行顶 点数组类指令并渲染出图片,将其保存于显存中。 其中,渲染出的图片可 以但不限于三维,也可以是二维的图片,而且该图片可以是一幅或者多幅 图片的组合,也可以是一幅完整图片的一部分。 图形服务器 12通过屏幕抓 取将图片拷贝到内存中。 由于图片比较大,图形服务器 12对图片进行压缩 处理,再将压缩码流通过传输通道 13发送给图形客户端 11 ,以便图形客户 端 11将压缩码流进行解压,并通过图形设备接口 110将顶点数组类指令重 定向至 TC端 15以执行顶点数组类指令并生成屏幕画面。  In this embodiment, when the second buffer area 121 caches the vertex data corresponding to the vertex array pointer, the graphics server 12 receives the vertex array class instruction sent by the graphics client 11 through the data channel 13, and according to the characteristics of the vertex array class instruction itself. It is correspondingly unpacked. The graphics server 12 then sends the unwrapped vertex array class instructions to the graphics card 14. When the second buffer area 121 does not cache the vertex data corresponding to the vertex array pointer, the graphics server 12 receives the decomposed vertex array class instruction sent by the graphics client 11 and sends it to the graphics card 14. The graphics card 14 executes the vertices array class instruction and renders the image and saves it in the video memory. The rendered image may be, but not limited to, a three-dimensional image or a two-dimensional image, and the image may be a combination of one or more images or a part of a complete image. The graphics server 12 copies the pictures into the memory through screen capture. Since the picture is relatively large, the graphics server 12 compresses the picture, and then sends the compressed code stream to the graphics client 11 through the transmission channel 13 so that the graphics client 11 decompresses the compressed code stream and vertices through the graphics device interface 110. The array class instructions are redirected to the TC side 15 to execute the vertex array class instructions and generate a screen shot.
图 4是本发明第一实施例的 GPU 中顶点数据缓存的方法的流程示意 图。 如图 4所示,本实施例的 GPU中顶点数据缓存的方法包括:  4 is a flow chart showing a method of vertex data buffering in a GPU according to a first embodiment of the present invention. As shown in FIG. 4, the method for buffering vertex data in the GPU of this embodiment includes:
S30:通过图形客户端 11创建第一缓存区 111 ,进行顶点数据缓存,其 中,顶点数据包括顶点数组指针和顶点数组长度。  S30: Create a first buffer area 111 through the graphics client 11 to perform vertex data caching, wherein the vertex data includes a vertex array pointer and a vertex array length.
在本实施例中,进行顶点数据缓存是以缓存单元模式为载体进行学习、 预测和校正,包括顶点数组指针以及顶点数组长度的学习、 预测和校正。 因此,缓存单元模式的选择是解决顶点数据缓存的首要问题,这主要是一 个粒度考虑的问题。 选择大粒度的模式,则查找、 校正等额外开销小,但 内容容易发生变化,总体性能会受影响。 如大粒度的模式可以考虑以帧为 单元做缓存,这样不仅可以缓存顶点数据,也能缓存 3D指令,但是每帧之 间的数据总会有差异且差异比较大,差异处理过程会导致性能下降。 选择 小粒度的模式,缓存的内容变化不大相对比较稳定,但查找、 校正等额外 开销会比较大。 在本发明的实施例中,缓存单元模式的结构如图 5所示。 在 Opengl的规范中,gl*P0inter的作用是指明顶点数组的首地址及每字节的 长度,后续的画顶点指令 glDrawArray/glDrawElements都是基于顶点数组首 地址的偏移量来绘制几何单元,直到下一个 gPPointer的指令出现,表示一 个 缓 存 单 元 模 式 结 束 。 其 中 , gPPointer 为 图 5 中 的 gl VertexPointer/ glNormalPointer或者 glInterLeavedArrays0 使用这种模式进 行顶点数据缓存,粒度适中,额外开销小,缓存的内容稳定性好。 In this embodiment, the vertex data buffer is learned, predicted, and corrected using the cache unit mode as a carrier, including vertex array pointers and learning, prediction, and correction of vertex array lengths. Therefore, the choice of cache unit mode is the primary problem to solve the vertex data cache, which is mainly a A question of granularity considerations. When you choose a large-grained mode, the overhead of finding, correcting, etc. is small, but the content is subject to change and the overall performance is affected. For example, a large-grained mode can consider caching in units of frames, so that not only vertex data can be cached, but also 3D instructions can be cached, but the data between each frame will always be different and the difference will be large, and the difference processing will lead to performance degradation. . Choosing a small-grained mode, the content of the cache changes little and relatively stable, but the overhead of finding, correcting, etc. will be relatively large. In the embodiment of the present invention, the structure of the cache unit mode is as shown in FIG. In the OpenGL specification, the role of gl*P 0 inter is to indicate the first address of the vertex array and the length of each byte. The subsequent vertex instructions glDrawArray/glDrawElements are all based on the offset of the first address of the vertex array to draw the geometry unit. Until the next gPPointer instruction appears, indicating the end of a cache unit mode. Among them, gPPointer uses the gl vertexPointer/ glNormalPointer or glInterLeavedArrays 0 in Figure 5 to use this mode for vertex data caching, moderate granularity, small overhead, and good cache content stability.
如图 6所示,顶点数组指针的学习、 预测和校正方法包括:  As shown in Figure 6, the methods for learning, predicting, and correcting vertex array pointers include:
S40:截获 gPPointer指令。顶点数组指针可以从 gPPointer指令中获取。 S41:用顶点数组指针作 Hash查找。  S40: Intercept the gPPointer instruction. The vertex array pointer can be obtained from the gPPointer directive. S41: Use the vertex array pointer as a Hash lookup.
542:判断是否命中。 如果是 ,则执行 S43;如果否 ,则执行 S44。 具 体而言,是判断获取的顶点数组指针与 Hashtable中预存的顶点数组指针是 否相同。  542: Determine if a hit. If yes, execute S43; if no, execute S44. Specifically, it is judged whether the obtained vertex array pointer is the same as the pre-stored vertex array pointer in the Hashtable.
543:设置为当前的顶点数组指针,供画顶点指令使用。  543: Set to the current vertex array pointer for use by drawing vertex instructions.
S44:将顶点数组指针及相关特征信息添加到 Hashtable中。  S44: Add the vertex array pointer and related feature information to the Hashtable.
S45:透传 gPPointer指令。  S45: Transparent transmission of the gPPointer instruction.
如此,表示一个缓存单元模式中的一顶点数组指针的校正完成。 重复 上述过程,直到完成该缓存单元模式中的所有顶点数组指针的校正。 之后, 进行顶点数组长度的学习、 预测和校正,即完成画顶点指令的校正,以便 基于顶点数组首地址的偏移量来绘制几何单元。 具体地,如图 7所示,顶 点数组长度的学习、 预测和校正方法包括:  Thus, the correction of a vertex array pointer in a cache unit mode is completed. The above process is repeated until the correction of all vertex array pointers in the cache unit mode is completed. After that, the learning, prediction and correction of the vertex array length are performed, that is, the correction of the vertex instruction is completed, so that the geometric unit is drawn based on the offset of the vertex array first address. Specifically, as shown in FIG. 7, the method of learning, predicting, and correcting the length of the Array of vertices includes:
S50 : 截获 glDraw Array 指令。 glDraw Array 指令包括图 5 中的 glDrawArray/glDrawElement指令,顶点数组的长度可以在 glDraw Arrays/ glDrawElements指令中获取。  S50 : Intercept the glDraw Array instruction. The glDraw Array directive includes the glDrawArray/glDrawElement directive in Figure 5. The length of the vertex array is available in the glDraw Arrays/ glDrawElements directive.
S51:顶点数据是否已做缓存。 如果否,则执行 S52;如果是,则执行 S53。 S51: Whether the vertex data has been cached. If no, execute S52; if yes, execute S53.
552:顶点数组长度是否需要更新。 如果是,则执行 S54;如果否,则 执行 S55。  552: Whether the vertex array length needs to be updated. If yes, execute S54; if no, execute S55.
553:顶点数据是否存在于本地数据中。 如果否,则执行 S55;如果是, 则执行 S56。 其中,本地数据为预存在图形客户端的不顶点数据,该顶点数 据不需分解即可发送并使用于图形服务器 12。  553: Whether the vertex data exists in the local data. If not, execute S55; if yes, execute S56. The local data is non-vertex data pre-existing in the graphics client, and the vertex data can be sent and used for the graphics server 12 without being decomposed.
554:更新顶点数组长度。 具体方法如后续的图 8所示。  554: Update the vertex array length. The specific method is shown in Figure 8 below.
555: glDrawArray指令分解。 由此可知,如果截获的顶点数据不存在 于本地数据中,或者截获的顶点数据没有进行缓存,则不能进行缓存优化, 只能将该 glDrawArray指令分解,使用传值类的画顶点指令,并将该顶点数 据作为历史数据保存在 Hashtable中以便下一次进行缓存优化。  555: The glDrawArray instruction is decomposed. It can be seen that if the intercepted vertex data does not exist in the local data, or the intercepted vertex data is not cached, the cache optimization cannot be performed, and only the glDrawArray instruction can be decomposed, using the vertex instruction of the passed value class, and The vertex data is saved as historical data in the Hashtable for the next cache optimization.
556:透传 glDrawArray指令。 即如果截获的顶点数据存在于本地数据 中,则可以进行缓存优化。 重复上述过程,直至完成该缓存单元模式的所 有画顶点指令的校正。 然后重复图 6和图 7的顶点数组指针和顶点数组长 度的学习、 预测和校正以完成所有缓存单元模式的顶点数据的缓存。 在顶 点数组指针以及顶点数组长度的学习、 预测和校正中,判断缓存的顶点数 据是否存在于本地数据中,若是,则对顶点数组类指令进行缓存优化,若 否,则分解顶点数组类指令,即使用传值类的画顶点指令,并且将该顶点 数据保存在 Hashtable中以便下一次进行缓存优化。 在本实施例中,对顶点 数组类指令进行缓存优化后,则不需要对顶点数组类指令进行分解,可以 解决在图形服务器 12使用直接透传的顶点数组类指令会产生错误的问题, 即使仍有部分顶点数组类指令需进行分解,但总的需传送的指令数目大为 減少,从而減少了传送所有指令所需要的时间,也減少了对带宽的占用, 因此能够大幅降低时延和传输通道的带宽,降低内存共享对 CPU的消耗, 提高 VM密度,降低成本。  556: Transparent pass glDrawArray instruction. That is, if the intercepted vertex data exists in the local data, the cache optimization can be performed. The above process is repeated until the correction of all the vertex instructions of the cache unit mode is completed. The learning, prediction, and correction of vertex array pointers and vertex array lengths of Figures 6 and 7 are then repeated to complete the caching of vertex data for all cache unit modes. In the learning, prediction and correction of the vertex array pointer and the vertex array length, it is judged whether the cached vertex data exists in the local data, and if so, the vertex array class instruction is cache-optimized, and if not, the vertex array class instruction is decomposed. That is, use the vertex instruction of the pass-value class, and save the vertex data in the Hashtable for the next cache optimization. In this embodiment, after the cache optimization of the vertex array class instruction is performed, the vertex array class instruction does not need to be decomposed, and the problem that the vertex array class instruction directly used in the graphics server 12 generates an error may be solved, even if Some vertex array class instructions need to be decomposed, but the total number of instructions to be transferred is greatly reduced, which reduces the time required to transfer all instructions and reduces the bandwidth consumption, thus greatly reducing the delay and transmission channel. The bandwidth reduces the CPU consumption of memory sharing, increases VM density, and reduces costs.
531:发送同步指令至图形服务器 12 ,其中,同步指令包括顶点数组指 针。  531: Send a synchronization instruction to the graphics server 12, wherein the synchronization instruction includes a vertex array pointer.
532:通过图形服务器 12根据同步指令创建第二缓存区 121 ,进行顶点 数据缓存,第二缓存区 121通过顶点数组指针与第一缓存区 111形成顶点 数据的映射关系。 由以上可知,按照缓存单元模式的结构做一次遍历,就可以学习到顶 点数组指针及顶点数组长度,从而可以创建第二缓存区 121。 图形服务器 12还从同步指令中拷贝出顶点数组的内容存于第二缓存区 121。 532: Create a second buffer area 121 according to the synchronization instruction by the graphics server 12 to perform vertex data buffering, and the second buffer area 121 forms a mapping relationship between the vertex data and the first buffer area 111 by using the vertex array pointer. As can be seen from the above, by performing a traversal according to the structure of the cache unit mode, the vertex array pointer and the vertex array length can be learned, so that the second buffer area 121 can be created. The graphics server 12 also copies the contents of the vertex array from the synchronization instructions in the second buffer area 121.
在本实施例中,如果新增的顶点数据为历史数据,但缓存的所述第一 缓存区已释放或者其顶点数组长度需要更新为更大的值,为了保证学习、 预测和校正的顶点数组指针和顶点数组长度的可靠性,需要对顶点数组长 度进行更新。 具体如图 8所示,假设在遍历第(k-1 )个缓存单元模式时, 顶点数组长度需要更新为更大的值,则包括:  In this embodiment, if the newly added vertex data is historical data, but the cached first buffer area has been released or its vertex array length needs to be updated to a larger value, in order to ensure learning, prediction and correction of the vertex array The reliability of pointer and vertex array lengths requires updating the vertex array length. Specifically, as shown in FIG. 8, it is assumed that when traversing the (k-1)th cache unit mode, the vertex array length needs to be updated to a larger value, including:
560:更新顶点数组长度。 具体而言,在遍历第(k-1 )个缓存单元模式 时,首先在第一缓存区中记录下该缓存单元模式的顶点数组指针,在顶点 数组长度需要更新为更大的值时进行更新。  560: Update the vertex array length. Specifically, when traversing the (k-1)th cache unit mode, the vertex array pointer of the cache unit mode is first recorded in the first buffer area, and is updated when the vertex array length needs to be updated to a larger value. .
561:将新增的顶点数据拷贝到临时缓存区中。 具体而言,首先创建临 时缓存区,将新增的数据即时地拷贝到临时缓存区中,待第(k-1 )个缓存 单元模式遍历完时,临时缓存区就已经缓存了历史数据,因为是即时拷贝, 所以这个拷贝过程是可靠的。  561: Copy the newly added vertex data into the temporary buffer. Specifically, the temporary buffer area is first created, and the newly added data is instantly copied into the temporary buffer area. When the (k-1)th cache unit mode is traversed, the temporary buffer area has already cached the historical data because It is an instant copy, so this copying process is reliable.
562:创建上个模式的缓存区。 具体地,为了防止临时缓存区数据被覆 盖,上个缓存单元模式要保证在第(k )个缓存单元模式遍历之前完成临时 缓存区的顶点数据转移。 因此,在第(k )个缓存单元模式的开端完成创建 上个缓存单元模式的缓存区,即第(k-1 )个缓存单元模式的缓存区。 并将 临时缓存区的顶点数据整体拷贝至第(k-1 )个缓存单元模式的缓存区。 上 述第( k-1 )个缓存单元模式的缓存区和第( k )个缓存单元模式的缓存区都 是指的第一缓存区 111。  562: Create a buffer for the previous mode. Specifically, in order to prevent the temporary buffer area data from being overwritten, the last cache unit mode is to ensure that the vertex data transfer of the temporary buffer area is completed before the (k)th cache unit mode traversal. Therefore, at the beginning of the (k)th cache unit mode, the buffer area of the last cache unit mode is created, that is, the buffer area of the (k-1)th cache unit mode. The vertex data of the temporary buffer area is copied as a whole to the buffer area of the (k-1)th cache unit mode. The buffer area of the (k-1)th cache unit mode and the buffer area of the (k)th cache unit mode are referred to as the first buffer area 111.
563:发送同步指令给图形服务器 12。 前述的 S60-S63都是由图形客户 端 11完成的  563: Send a synchronization command to the graphics server 12. The aforementioned S60-S63 are all completed by the graphics client 11.
564:创建第二缓存区 121。 具体地,图形服务器 12根据同步指令创建 第二缓存区 121 ,并通过图形客户端 11的顶点数组指针与图形客户端 11的 第一缓存区 111形成映射关系,从而确保了缓存的顶点数据的一致性。  564: Create a second buffer area 121. Specifically, the graphics server 12 creates the second buffer area 121 according to the synchronization instruction, and forms a mapping relationship with the first buffer area 111 of the graphics client 11 through the vertex array pointer of the graphics client 11, thereby ensuring the consistency of the cached vertex data. Sex.
在本实施例中,通过图形客户端 11创建第一缓存区 111 ,进行顶点数 据缓存,同时发送同步指令至图形服务器 12 ,以创建第二缓存区 121 ,第 一缓存区 111与第二缓存区 121通过顶点数组指针形成顶点数据的映射关 系,如此可以对顶点数组类指令进行缓存优化,从而不需要对顶点数组类 指令进行分解,可以解决在图形服务器 12使用直接透传的顶点数组类指令 会产生错误的问题,这样即使仍有部分顶点数组类指令需进行分解,但总 的需传送的指令数目大为減少,从而減少了传送所有指令所需要的时间, 也減少了对带宽的占用,因此确保了缓存的顶点数据的一致性,能够大幅 降低时延和传输通道的带宽,降低内存共享对 CPU的消耗,提高 VM密度, 降低成本。 在本实施例中,第一缓存区 111和第二缓存区 121的创建最终 是根据截获的顶点数组类指令来执行的,是一个持续的过程。 In this embodiment, the first cache area 111 is created by the graphics client 11, the vertex data is buffered, and the synchronization instruction is sent to the graphics server 12 to create the second buffer area 121, the first buffer area 111 and the second buffer area. 121 through the vertex array pointer to form a mapping of vertex data off Therefore, the vertex array class instruction can be cache-optimized, so that the vertex array class instruction does not need to be decomposed, and the problem that the vertex array class instruction using the direct pass-through in the graphics server 12 can cause an error, even if there is still a part. The vertex array class instructions need to be decomposed, but the total number of instructions to be transferred is greatly reduced, thereby reducing the time required to transfer all instructions and reducing the bandwidth usage, thus ensuring the consistency of the cached vertex data. It can greatly reduce the bandwidth of the delay and transmission channel, reduce the CPU consumption of memory sharing, increase the VM density, and reduce the cost. In this embodiment, the creation of the first buffer area 111 and the second buffer area 121 is finally performed according to the intercepted vertex array class instruction, which is a continuous process.
图 9是本发明第一实施例的 GPU图形客户端的结构示意图。 如图 9所 示,在第一实施例的 GPU虚拟化实现方法的基础上进行描述,图形客户端 11包括图形设备接口 110、 第一缓存区 111、 指令获取模块 112、 第一缓存 模块 113、 查询模块 114、 发送模块 115以及第一接收模块 116。  FIG. 9 is a schematic structural diagram of a GPU graphics client according to a first embodiment of the present invention. As shown in FIG. 9, the GPU virtualization implementation method of the first embodiment is described. The graphics client 11 includes a graphics device interface 110, a first buffer area 111, an instruction acquisition module 112, and a first cache module 113. The query module 114, the sending module 115, and the first receiving module 116.
在本实施例中,指令获取模块 112用于截获顶点数组类指令。 第一缓 存模块 113用于创建第一缓存区 111 ,进行顶点数据缓存,并发送同步指令 至图形服务器 12以创建第二缓存区 121第二缓存区 121与第一缓存区 111 形成顶点数据的映射关系,顶点数据从顶点数组类指令中获取,包括顶点 数组指针和顶点数组长度。 在本实施例中,第一缓存区 111 和第二缓存区 121 的创建最终是根据截获的顶点数组类指令来执行的,是一个持续的过 程。 查询模块 114用于在本地数据中进行查询,若本地数据中存在一顶点 数据与截获的顶点数据一致,即截获的顶点数据存在于本地数据中,则发 送模块 115将顶点数组类指令打包并发送至图形服务器 12 ,以使得图形服 务器 12根据第二缓存区 121的顶点数据和打包的顶点数组类指令渲染出图 片,即对顶点数组类指令进行缓存优化,若不存在,则发送模块 115 分解 顶点数组类指令,即使用传值类的画顶点指令,并且将该顶点数据保存在 Hashtable中以便下一次进行缓存优化,并发送至图形服务器 12 ,以使得图 形服务器 12根据分解的顶点数组类指令渲染出图片。 渲染出的图片可以但 不限于三维,也可以是二维的图片,而且该图片可以是一幅或者多幅图片 的组合,也可以是一幅完整图片的一部分。 其中,本地数据为预存在图形 客户端 11的顶点数据,该顶点数据不需分解即可发送并使用于图形服务器 12。 第一接收模块 116用于接收图片并贴至图形设备接口 110。 图形设备接 口 110将顶点数组类指令重定向至 TC端 15以执行顶点数组类指令并生成 屏幕画面。 In this embodiment, the instruction acquisition module 112 is configured to intercept vertex array class instructions. The first cache module 113 is configured to create a first buffer area 111, perform vertex data buffering, and send a synchronization instruction to the graphics server 12 to create a mapping of the second buffer area 121 and the first buffer area 111 to form vertex data. Relationships, vertex data is obtained from vertex array class instructions, including vertex array pointers and vertex array lengths. In this embodiment, the creation of the first buffer area 111 and the second buffer area 121 is ultimately performed according to the intercepted vertex array class instruction, which is a continuous process. The query module 114 is configured to perform a query in the local data. If there is a vertex data in the local data that is consistent with the intercepted vertex data, that is, the intercepted vertex data exists in the local data, the sending module 115 packages and sends the vertex array class instruction. To the graphics server 12, so that the graphics server 12 renders the image according to the vertex data of the second buffer area 121 and the packed vertex array class instruction, that is, cache optimization of the vertex array class instruction, if not, the sending module 115 decomposes the vertex The array class instruction, that is, the drawing vertex instruction using the value class, and saves the vertex data in the Hashtable for the next cache optimization, and sends it to the graphics server 12 to cause the graphics server 12 to render according to the decomposed vertex array class instruction. Out of the picture. The rendered image can be, but is not limited to, a three-dimensional image, or a two-dimensional image, and the image can be a combination of one or more images or a part of a complete image. The local data is the vertex data pre-existing in the graphics client 11, and the vertex data can be sent and used for the graphics server 12 without being decomposed. The first receiving module 116 is configured to receive a picture and paste it to the graphics device interface 110. Graphic device connection Port 110 redirects vertex array class instructions to TC side 15 to execute vertex array class instructions and generate a screen shot.
进一步地,发送模块 115还发送同步指令给图形服务器 12以创建第二 缓存区 121 ,同步指令包括顶点数组指针,第二缓存区 121通过顶点数组指 针与第一缓存区 111 形成顶点数据的映射关系,如此可以对顶点数组类指 令进行缓存优化,从而不需要对顶点数组类指令进行分解,可以解决在图 形服务器 12使用直接透传的顶点数组类指令会产生错误的问题,这样即使 仍有部分顶点数组类指令需进行分解,但总的需传送的指令数目大为減少, 从而減少了传送所有指令所需要的时间,也減少了对带宽的占用,因此确 保了缓存的顶点数据的一致性,能够大幅降低时延和传输通道的带宽,降 低内存共享对 CPU的消耗,提高 VM密度,降低成本。  Further, the sending module 115 further sends a synchronization instruction to the graphics server 12 to create a second buffer area 121. The synchronization instruction includes a vertex array pointer, and the second buffer area 121 forms a mapping relationship between the vertex data and the first buffer area 111 through the vertex array pointer. Therefore, the vertex array class instruction can be cache-optimized, so that the vertex array class instruction does not need to be decomposed, and the problem that the vertex array class instruction using the direct pass-through in the graphics server 12 will generate an error, even if there are still some vertices. Array class instructions need to be decomposed, but the total number of instructions to be transferred is greatly reduced, which reduces the time required to transfer all instructions and reduces the bandwidth usage, thus ensuring the consistency of the cached vertex data. Significantly reduce the bandwidth of the delay and transmission channel, reduce the CPU consumption of memory sharing, increase the VM density, and reduce the cost.
可选地,如果新增的顶点数据为历史数据,但缓存的第一缓存区已释 放或者其顶点数组长度需要更新为更大的值,则第一缓存模块 113 还用于 创建临时缓存区,将新增的顶点数据拷贝到临时缓存区中。 然后将顶点数 据从临时缓存区整体拷贝至第一缓存区 111。  Optionally, if the newly added vertex data is historical data, but the cached first buffer area is released or its vertex array length needs to be updated to a larger value, the first cache module 113 is further configured to create a temporary buffer area. Copy the newly added vertex data to the temporary buffer. The vertex data is then copied from the temporary buffer area to the first buffer area 111 as a whole.
在本实施例中,图片经图形服务器 12进行压缩处理后生成压缩码流并 发送给图形客户端 11 ,第一接收模块 116通过数据通道 13接收压缩码流并 进行解压,然后调用 bitblt()接口将图片贴到图形设备接口 110的 3D应用程 序的图形区域,通过图形设备接口 110将顶点数组类指令重定向至 TC端 15以执行顶点数组类指令并生成屏幕画面。  In this embodiment, the picture is compressed by the graphics server 12 to generate a compressed code stream and sent to the graphics client 11. The first receiving module 116 receives the compressed code stream through the data channel 13 and decompresses it, and then calls the bitblt() interface. The picture is pasted to the graphics area of the 3D application of graphics device interface 110, and the vertex array class instructions are redirected to TC end 15 via graphics device interface 110 to execute vertex array class instructions and generate a screen shot.
图 10是本发明第一实施例的 GPU图形服务器的结构示意图。 如图 10 所示,在第一实施例的 GPU虚拟化实现方法的基础上进行描述,图形客户 端 12包括第二缓存区 121、 第二缓存模块 122、 第二接收模块 123以及渲 染模块 124。  FIG. 10 is a schematic structural diagram of a GPU graphics server according to a first embodiment of the present invention. As shown in FIG. 10, based on the GPU virtualization implementation method of the first embodiment, the graphics client 12 includes a second buffer area 121, a second cache module 122, a second receiving module 123, and a rendering module 124.
在本实施例中,第二缓存模块 122用于创建第二缓存区 121 以进行顶 点数据缓存,第二缓存区 121与图形客户端 11的第一缓存区 111形成顶点 数据的映射关系,顶点数据包括顶点数组指针和顶点数组长度。 在本实施 例中,第一缓存区 111和第二缓存区 121的创建最终是根据截获的顶点数 组类指令来执行的,是一个持续的过程。 第二接收模块 123用于根据顶点 数组指针判断第二缓存区 121 是否缓存有对应的顶点数据,如果有,则接 收图形客户端 11发送的经打包的顶点数组类指令,并且渲染模块 124根据 第二缓存区 121 的顶点数据和打包的顶点数组类指令渲染出图片以发送给 图形客户端 11;如果没有,则第二接收模块 123接收图形客户端 11发送的 经分解后的顶点数组类指令,并且渲染模块 124根据经分解后的顶点数组 类指令渲染出图片以发送给图形客户端 11。 In this embodiment, the second cache module 122 is configured to create a second buffer area 121 for vertex data caching, and the second buffer area 121 forms a mapping relationship with the first buffer area 111 of the graphics client 11 to form vertex data, and vertex data. Includes vertex array pointers and vertex array lengths. In this embodiment, the creation of the first buffer area 111 and the second buffer area 121 is finally performed according to the intercepted vertex array class instruction, which is a continuous process. The second receiving module 123 is configured to determine, according to the vertex array pointer, whether the second buffer area 121 is buffered with corresponding vertex data, and if so, Receiving the packed vertex array class instruction sent by the graphics client 11, and the rendering module 124 renders the image according to the vertex data of the second buffer area 121 and the packed vertex array class instruction for sending to the graphics client 11; if not, then The second receiving module 123 receives the decomposed vertex array class instruction sent by the graphics client 11, and the rendering module 124 renders the image according to the decomposed vertex array class instruction for sending to the graphics client 11.
可选地,第二接收模块 123还通过数据通道 13接收图形客户端 11发 送的同步指令,其中,同步指令包括顶点数组指针。 第二缓存模块 122根 据同步指令创建第二缓存区 121 以进行顶点数据缓存,第二缓存区 121通 过顶点数组指针与图形客户端 11的第一缓存区 111形成顶点数据的映射关 系,确保了缓存的顶点数据的一致性。 并在顶点数据存在于本地数据中时, 进行顶点数据的缓存优化,从而不需要对顶点数组类指令进行分解,这样 即使仍有部分顶点数组类指令需进行分解,可以解决在图形服务器 12使用 直接透传的顶点数组类指令会产生错误的问题,这样即使仍有部分顶点数 组类指令需进行分解,但总的需传送的指令数目大为減少,从而減少了传 送所有指令所需要的时间,也減少了对带宽的占用,因此能够大幅降低时 延和传输通道的带宽,降低内存共享对 CPU的消耗,提高 VM密度,降低 成本。  Optionally, the second receiving module 123 further receives the synchronization instruction sent by the graphics client 11 through the data channel 13, wherein the synchronization instruction includes a vertex array pointer. The second cache module 122 creates a second buffer area 121 according to the synchronization instruction to perform vertex data buffering, and the second buffer area 121 forms a mapping relationship between the vertex data by the vertex array pointer and the first buffer area 111 of the graphics client 11, thereby ensuring the cache. The consistency of the vertex data. And when the vertex data exists in the local data, the cache optimization of the vertex data is performed, so that the vertex array class instruction does not need to be decomposed, so that even if there are still some vertex array class instructions to be decomposed, the direct use in the graphics server 12 can be solved. Pass-through vertex array class instructions can cause errors, so that even if there are still some vertex array class instructions to be decomposed, the total number of instructions to be transferred is greatly reduced, thus reducing the time required to transfer all instructions. The bandwidth consumption is reduced, so the bandwidth of the delay and the transmission channel can be greatly reduced, the CPU consumption of the memory sharing is reduced, the VM density is increased, and the cost is reduced.
在本实施例中,第二缓存区 121 缓存有顶点数组指针对应的顶点数据 时,第二接收模块 123接收图形客户端 11通过数据通道 13发送的顶点数 组类指令,并根据顶点数组类指令自身的特点对其进行相应的解包处理, 再将解包出来的顶点数组类指令发送给显卡 14。 第二缓存区 121没有缓存 顶点数组指针对应的顶点数据时,第二接收模块 123接收图形客户端 11发 送的经分解后的顶点数组类指令,再将其发送给显卡 14。显卡 14执行顶点 数组类指令并渲染出图片,将其保存于显存中。 其中,渲染出的图片可以 但不限于三维,也可以是二维的图片,而且该图片可以是一幅或者多幅图 片的组合,也可以是一幅完整图片的一部分。 渲染模块 124通过屏幕抓取 将图片拷贝到内存中。 由于图片比较大,渲染模块 124对图片进行压缩处 理,再将压缩码流通过传输通道 13发送给图形客户端 11 ,以便图形客户端 11将压缩码流进行解压,并通过图形设备接口 110将顶点数组类指令重定 向至 TC端 15以执行顶点数组类指令并生成屏幕画面。 图 11是本发明第一实施例的 GPU中顶点数据缓存的装置的结构示意 图。 在图 9和图 10的基础上进行描述,如图 11所示,顶点数据缓存的装 置 100包括:第一缓存模块 113、 第一缓存区 111、 发送模块 115、 第二缓 存区 121以及第二缓存模块 122。 In this embodiment, when the second buffer area 121 caches the vertex data corresponding to the vertex array pointer, the second receiving module 123 receives the vertex array class instruction sent by the graphics client 11 through the data channel 13, and instructs itself according to the vertex array class. The feature is unpacked accordingly, and the unwrapped vertex array class instruction is sent to the graphics card 14. When the second buffer area 121 does not cache the vertex data corresponding to the vertex array pointer, the second receiving module 123 receives the decomposed vertex array class instruction sent by the graphics client 11 and sends it to the graphics card 14. The graphics card 14 executes the vertex array class instructions and renders the image, saving it in the video memory. The rendered image may be, but not limited to, a three-dimensional image, or may be a two-dimensional image, and the image may be a combination of one or more images, or may be part of a complete image. The rendering module 124 copies the image into memory through screen capture. Since the picture is relatively large, the rendering module 124 compresses the picture, and then sends the compressed code stream to the graphics client 11 through the transmission channel 13 so that the graphics client 11 decompresses the compressed code stream and vertices through the graphics device interface 110. The array class instructions are redirected to the TC side 15 to execute the vertex array class instructions and generate a screen shot. 11 is a schematic structural diagram of an apparatus for buffering vertex data in a GPU according to a first embodiment of the present invention. 9 and FIG. 10, as shown in FIG. 11, the apparatus 100 for vertex data buffering includes: a first cache module 113, a first buffer area 111, a sending module 115, a second buffer area 121, and a second Cache module 122.
在本实施例中,第一缓存模块 113用于创建第一缓存区 111 ,进行顶点 数据缓存,其中,顶点数据包括顶点数组指针和顶点数组长度。 发送模块 115 用于发送同步指令至图形服务器 12 ,其中,同步指令包括顶点数组指 针。 第二缓存模块 122用于根据同步指令创建第二缓存区 121 ,进行顶点数 据缓存,第二缓存区 121通过顶点数组指针与第一缓存区 111形成顶点数 据的映射关系。 在本实施例中,第一缓存区 111和第二缓存区 121的创建 最终是根据截获的顶点数组类指令来执行的,是一个持续的过程。  In this embodiment, the first cache module 113 is configured to create a first buffer area 111 for vertex data caching, wherein the vertex data includes a vertex array pointer and a vertex array length. The sending module 115 is configured to send a synchronization instruction to the graphics server 12, wherein the synchronization instruction includes a vertex array pointer. The second cache module 122 is configured to create a second buffer area 121 according to the synchronization instruction, and perform vertex data buffering. The second buffer area 121 forms a mapping relationship with the first buffer area 111 by the vertex array pointer. In the present embodiment, the creation of the first buffer area 111 and the second buffer area 121 is finally performed according to the intercepted vertex array class instruction, which is an ongoing process.
进一步地,第一缓存模块 113 以缓存单元模式为载体对顶点数组指针 以及顶点数组长度的学习、 预测和校正。 其中,缓存单元模式包括指明顶 点数组的首地址和每字节的长度,根据首地址的偏移量绘制几何单元。 对 顶点数组指针学习、 预测和校正时,第一缓存模块 113 用于获取顶点数组 类指令;用顶点数组指针作 Hash查找;判断是否命中,如果是,则设置为 当前的缓存数组指针,供画顶点指针使用;如果否,将顶点数组指针及相 关特征信息添加到 Hashtable中;透传缓存数据指针。 对顶点数组长度的学 习、 预测和校正进,第一缓存模块 113 用于获取画顶点指令;判断截获的 顶点数据是否已做缓存,如果是,则判断截获的顶点缓存数据是否存在于 本地数据中,如果是,则透传画顶点指针,如果否,则分解画顶点指针, 即使用传值类的画顶点指令,并且将该顶点数据保存在 Hashtable中以便下 一次进行缓存优化;如果顶点数据未做缓存,则判断顶点数组长度是否需 要更新,如果需要,则更新顶点数组长度,如果不需要,则分解画顶点指 针,即使用传值类的画顶点指令。 其中,本地数据为预存在图形客户端 11 的顶点数据,该顶点数据不需分解即可发送并使用于图形服务器 12。 因此, 如果截获的顶点数据不存在于本地数据中,或者截获的顶点数据未做缓存, 则不能进行缓存优化,只能将画顶点指令分解,即使用传值类的画顶点指 令。 如果截获的顶点数据存在于本地数据中,即本地数据中存在一顶点数 据与截获的顶点数据一致,则可以进行缓存优化,从而不需分解顶点数组 类指令,能够大幅降低时延和传输通道的带宽,降低内存共享对 CPU的消 耗,提高 VM密度,降低成本。 Further, the first cache module 113 uses the cache unit mode as a carrier to learn, predict, and correct the vertex array pointer and the vertex array length. The cache unit mode includes indicating the first address of the vertex array and the length of each byte, and drawing the geometric unit according to the offset of the first address. For the vertex array pointer learning, prediction and correction, the first cache module 113 is used to obtain the vertex array class instruction; the vertex array pointer is used for the hash lookup; whether the hit is determined, and if so, the current cache array pointer is set for drawing. The vertex pointer is used; if not, the vertex array pointer and related feature information is added to the Hashtable; the cached data pointer is transparently transmitted. For learning, predicting, and correcting the length of the vertex array, the first cache module 113 is configured to obtain a vertex instruction; determine whether the intercepted vertex data has been cached, and if yes, determine whether the intercepted vertex buffer data exists in the local data. If yes, pass the vertex pointer transparently. If not, decompose the vertex pointer, that is, use the vertex instruction of the passed value class, and save the vertex data in the Hashtable for the next cache optimization; if the vertex data is not To do the caching, determine whether the vertex array length needs to be updated. If necessary, update the vertex array length. If not, decompose the vertex pointer, that is, use the vertex instruction of the passed value class. The local data is the vertex data pre-existing in the graphics client 11, and the vertex data can be sent and used for the graphics server 12 without being decomposed. Therefore, if the intercepted vertex data does not exist in the local data, or the intercepted vertex data is not cached, the cache optimization cannot be performed, and only the vertex instruction can be decomposed, that is, the vertex instruction of the passed value class is used. If the intercepted vertex data exists in the local data, that is, if there is a vertex data in the local data that is consistent with the intercepted vertex data, then the cache optimization can be performed, thereby eliminating the need to decompose the vertex array. Class instructions can greatly reduce the bandwidth of the delay and transmission channels, reduce the CPU consumption of memory sharing, increase the VM density, and reduce the cost.
在本实施例中,更新顶点数组长度时,第一缓存模块 113 首先创建临 时缓存区,将新增的数据即时地拷贝到临时缓存区中,待上一个缓存单元 模式遍历完时,临时缓存区就已经缓存了历史数据;创建上个模式的缓存 区,在下一个缓存单元模式遍历之前将临时缓存区的顶点数据整体转移至 上个模式的缓存区。  In this embodiment, when the vertex array length is updated, the first cache module 113 first creates a temporary buffer area, and instantly copies the newly added data into the temporary buffer area. When the previous cache unit mode is traversed, the temporary buffer area is used. The historical data has already been cached; the buffer of the previous mode is created, and the vertex data of the temporary buffer is transferred to the buffer of the previous mode as a whole before the next cache unit mode is traversed.
在本实施例中,通过第一缓存模块 113创建第一缓存区 111 ,进行顶点 数据缓存,发送模块 115发送同步指令至图形服务器 12 ,第二缓存模块 122 根据同步指令创建第二缓存区 121 ,进行顶点数据缓存;第二缓存模块 122 通过顶点数组指针与第一缓存区 111 形成顶点数据的映射关系,确保了缓 存的顶点数据的一致性,并在截获的顶点数据存在于本地数据中时,进行 顶点数据的缓存优化,从而不需要对顶点数组类指令进行分解,这样即使 仍有部分顶点数组类指令需进行分解,但总的需传送的指令数目大为減少, 因此可以解决在图形服务器 12使用直接透传的顶点数组类指令会产生错误 的问题,能够大幅降低时延和传输通道的带宽,降低内存共享对 CPU的消 耗,提高 VM密度,降低成本。  In this embodiment, the first cache area 111 is created by the first cache module 113, and the vertex data is buffered. The sending module 115 sends a synchronization instruction to the graphics server 12, and the second cache module 122 creates a second buffer area 121 according to the synchronization instruction. Performing vertex data caching; the second cache module 122 forms a mapping relationship between the vertex data by the vertex array pointer and the first buffer area 111, thereby ensuring the consistency of the cached vertex data, and when the intercepted vertex data exists in the local data, The cache optimization of the vertex data is performed, so that the vertex array class instruction does not need to be decomposed, so that even if there are still some vertex array class instructions to be decomposed, the total number of instructions to be transferred is greatly reduced, so that the graphics server 12 can be solved. The use of directly transparent vertex array class instructions can cause errors, which can greatly reduce the bandwidth of the delay and transmission channels, reduce the CPU consumption of memory sharing, increase the VM density, and reduce the cost.
图 12是本发明第二实施例的 GPU图形客户端的结构示意图。 如图 12 所示, GPU图形客户端 20包括处理器 201、 存储器 202、 接收器 203、 总 线 204以及发射器 205 ,处理器 201、存储器 202、发射器 205和接收器 203 通过总线 204相连,以进行相互通信。  FIG. 12 is a schematic structural diagram of a GPU graphics client according to a second embodiment of the present invention. As shown in FIG. 12, the GPU graphics client 20 includes a processor 201, a memory 202, a receiver 203, a bus 204, and a transmitter 205. The processor 201, the memory 202, the transmitter 205, and the receiver 203 are connected by a bus 204. Communicate with each other.
具体地,接收器 203用于截获顶点数组指令。 处理器 201用于创建第 一缓存区,存储器 202对顶点数据进行缓存,发射器 205发送同步指令至 图形服务器以创建第二缓存区,第二缓存区与第一缓存区形成顶点数据的 映射关系。 顶点数据从顶点数组类指令中获取,包括顶点数组指针和顶点 数组长度。 在本实施例中,第一缓存区和第二缓存区的创建最终是根据截 获的顶点数组类指令来执行的,是一个持续的过程。 处理器 201 还用于在 本地数据中进行查询,若本地数据中存在一顶点数据与截获的顶点数据一 致,则发射器 205将顶点数组类指令打包并发送至图形服务器,处理器 201 根据第二缓存区的顶点数据和打包的顶点数组类指令渲染出图片,即对顶 点数组类指令进行缓存优化,若不存在,则分解顶点数组类指令,即使用 传值类的画顶点指令,并且将该顶点数据保存在 Hashtable中以便下一次进 行缓存优化,发射器 205发送至图形服务器 12 ,处理器 201根据分解的顶 点数组类指令渲染出图片。 渲染出的图片可以但不限于三维,也可以是二 维的图片,而且该图片可以是一幅或者多幅图片的组合,也可以是一幅完 整图片的一部分。 其中,本地数据为预存在图形客户端的顶点数据,该顶 点数据不需分解即可发送并使用于图形服务器。 Specifically, the receiver 203 is configured to intercept vertex array instructions. The processor 201 is configured to create a first buffer area, the memory 202 buffers the vertex data, the transmitter 205 sends a synchronization instruction to the graphics server to create a second buffer area, and the second buffer area forms a mapping relationship with the first buffer area to form vertex data. . Vertex data is obtained from vertex array class instructions, including vertex array pointers and vertex array lengths. In this embodiment, the creation of the first buffer area and the second buffer area is ultimately performed according to the intercepted vertex array class instruction, which is a continuous process. The processor 201 is further configured to perform a query in the local data. If there is a vertex data in the local data that is consistent with the intercepted vertex data, the transmitter 205 packages and sends the vertex array class instruction to the graphics server, and the processor 201 according to the second The vertex data of the buffer area and the packed vertex array class instruction render the image, that is, the top The dot array class instruction performs cache optimization. If it does not exist, the vertex array class instruction is decomposed, that is, the vertex instruction of the pass value class is used, and the vertex data is saved in the Hashtable for the next cache optimization, and the transmitter 205 sends to The graphics server 12, the processor 201 renders the picture according to the decomposed vertex array class instruction. The rendered image can be, but is not limited to, a three-dimensional image, or a two-dimensional image, and the image can be a combination of one or more images or a part of a complete image. The local data is vertex data pre-existing in the graphics client, and the vertex data can be sent and used for the graphics server without being decomposed.
在本实施例中,接收器 203 还用于接收图片并贴至图形设备接口。 图 形设备接口将顶点数组类指令重定向至 TC 端以执行顶点数组类指令并生 成屏幕画面。 如果新增的顶点数据为历史数据,但缓存的所述第一缓存区 已释放或者其顶点数组长度需要更新为更大的值,则处理器 201 还创建临 时缓存区,将新增的顶点数据拷贝到临时缓存区中,然后将顶点数据从临 时缓存区拷贝至第一缓存区。  In this embodiment, the receiver 203 is further configured to receive a picture and paste it to a graphics device interface. The graphical device interface redirects vertex array class instructions to the TC side to execute vertex array class instructions and generate a screen. If the newly added vertex data is historical data, but the cached first buffer area has been released or its vertex array length needs to be updated to a larger value, the processor 201 also creates a temporary buffer area, which will add new vertex data. Copy to the temporary buffer and copy the vertex data from the temporary buffer to the first buffer.
在本实施例中,发射器 205 发送同步指令给图形服务器以创建第二缓 存区。 同步指令包括顶点数组指针,第二缓存区通过顶点数组指针与第一 缓存区形成顶点数据的映射关系,如此可以对顶点数组类指令进行缓存优 化,从而不需要对顶点数组类指令进行分解,可以解决在图形服务器使用 直接透传的顶点数组类指令会产生错误的问题,这样即使仍有部分顶点数 组类指令需进行分解,但总的需传送的指令数目大为減少,从而減少了传 送所有指令所需要的时间,也減少了对带宽的占用,因此确保了缓存的顶 点数据的一致性,能够大幅降低时延和传输通道的带宽,降低内存共享对 CPU的消耗,提高 VM密度,降低成本。  In this embodiment, the transmitter 205 sends a synchronization command to the graphics server to create a second buffer. The synchronization instruction includes a vertex array pointer, and the second buffer area forms a mapping relationship with the first buffer area by the vertex array pointer, so that the vertex array class instruction can be cache-optimized, so that the vertex array class instruction does not need to be decomposed, Solving the problem of using the directly transparent pass-through vertex array class instruction in the graphics server will cause errors, so that even if some vertex array class instructions need to be decomposed, the total number of instructions to be transferred is greatly reduced, thereby reducing the transmission of all instructions. The required time also reduces the bandwidth occupation, thus ensuring the consistency of the cached vertex data, which can greatly reduce the bandwidth of the delay and the transmission channel, reduce the CPU consumption of the memory sharing, increase the VM density, and reduce the cost.
图 13是本发明第二实施例的 GPU图形服务器的结构示意图。 如图 13 所示, GPU图形服务器 30包括处理器 301、 存储器 302、 接收器 303以及 总线 304 ,处理器 301、 存储器 302和接收器 303通过总线 304相连,以进 行相互通信。  FIG. 13 is a schematic structural diagram of a GPU graphics server according to a second embodiment of the present invention. As shown in FIG. 13, the GPU graphics server 30 includes a processor 301, a memory 302, a receiver 303, and a bus 304. The processor 301, the memory 302, and the receiver 303 are connected by a bus 304 to communicate with each other.
具体地,处理器 301用于创建第二缓存区。 存储器 202对顶点数据进 行缓存,第二缓存区与图形客户端的第一缓存区形成顶点数据的映射关系。 顶点数据包括顶点数组指针和顶点数组长度。 在本实施例中,第一缓存区 和第二缓存区的创建最终是根据截获的顶点数组类指令来执行的,是一个 持续的过程。 处理器 301 根据顶点数组指针判断第二缓存区是否缓存有对 应的顶点数据,如果有,则接收器 303接收图形客户端发送的经打包的顶 点数组类指令,处理器 301 根据第二缓存区的顶点数据和打包的顶点数组 类指令渲染出图片以发送给图形客户端;如果没有,则接收器 303 接收图 形客户端发送的经分解后的顶点数组类指令,处理器 301 根据经分解后的 顶点数组类指令渲染出图片以发送给图形客户端。 Specifically, the processor 301 is configured to create a second buffer area. The memory 202 caches the vertex data, and the second buffer area forms a mapping relationship with the first buffer area of the graphics client to form vertex data. Vertex data includes vertex array pointers and vertex array lengths. In this embodiment, the creation of the first buffer area and the second buffer area is finally performed according to the intercepted vertex array class instruction, which is a The ongoing process. The processor 301 determines, according to the vertex array pointer, whether the second buffer area is buffered with corresponding vertex data, and if so, the receiver 303 receives the packed vertex array class instruction sent by the graphics client, and the processor 301 is configured according to the second buffer area. The vertex data and the packed vertex array class instructions render the image for transmission to the graphics client; if not, the receiver 303 receives the decomposed vertex array class instruction sent by the graphics client, and the processor 301 is based on the decomposed vertex The array class instruction renders the image for sending to the graphics client.
在本实施例中,接收器 303 还通过数据通道接收图形客户端发送的同 步指令,其中,同步指令包括顶点数组指针。 处理器 301 根据同步指令创 建第二缓存区以进行顶点数据缓存,第二缓存区通过顶点数组指针与图形 客户端的第一缓存区形成顶点数据的映射关系,确保了缓存的顶点数据的 一致性,并在截获的顶点数据存在于本地数据中时,进行顶点数据的缓存 优化,从而不需要对顶点数组类指令进行分解,这样即使仍有部分顶点数 组类指令需进行分解,但总的需传送的指令数目大为減少,因此可以解决 在图形服务器使用直接透传的顶点数组类指令会产生错误的问题,能够大 幅降低时延和传输通道的带宽,降低内存共享对 CPU的消耗,提高 VM密 度,降低成本。  In this embodiment, the receiver 303 also receives the synchronization instruction sent by the graphics client through the data channel, wherein the synchronization instruction includes a vertex array pointer. The processor 301 creates a second buffer area according to the synchronization instruction to perform vertex data caching, and the second buffer area forms a mapping relationship between the vertex data pointer and the first buffer area of the graphics client to ensure vertex data consistency. And when the intercepted vertex data exists in the local data, the cache optimization of the vertex data is performed, so that the vertex array class instruction does not need to be decomposed, so that even if some vertex array class instructions still need to be decomposed, the total needs to be transmitted. The number of instructions is greatly reduced, so it can solve the problem that the vertex array class instruction used in the graphics server can directly generate errors, which can greatly reduce the bandwidth of the delay and the transmission channel, reduce the CPU consumption of the memory sharing, and improve the VM density. cut costs.
图 14是本发明第二实施例的 GPU虚拟化的实现系统的结构示意图。 如图 14所示 ,第二实施例的 GPU虚拟化的实现系统 40包括图形客户端 41、 图形服务器 42、 数据通道 43、 显卡 44、 TC端 45 ,其中,图形客户端 41 包括图形设备接口 410 ,数据通道 43包括顶点数据缓存区 431。 图形客户 端 41与图形服务器 42通过数据通道 43连接,显卡 44与图形服务器 42连 接, TC端 45与图形客户端 41的图形设备接口 410连接。  FIG. 14 is a schematic structural diagram of an implementation system of GPU virtualization according to a second embodiment of the present invention. As shown in FIG. 14, the GPU virtualization implementation system 40 of the second embodiment includes a graphics client 41, a graphics server 42, a data channel 43, a graphics card 44, and a TC terminal 45. The graphics client 41 includes a graphics device interface 410. The data channel 43 includes a vertex data buffer 431. The graphics client 41 is connected to the graphics server 42 via a data channel 43, the graphics card 44 is coupled to the graphics server 42, and the TC terminal 45 is coupled to the graphics device interface 410 of the graphics client 41.
在本实施例中,数据通道 43为共享内存,图形客户端 41和图形服务 器 42共用共享内存中的顶点数据缓存区 431来实现顶点数据缓存。具体地, TC端 45通过鼠标、 键盘重定向将 3D指令发送至图形客户端 41的图形设 备接口 410 ,图形客户端 41通过图形设备接口 410的 Opengl ICD驱动可以 截获到 3D指令, 3D指令包括顶点数组类指令。 图形客户端 41在顶点数据 缓存区 431进行顶点数据缓存,并通过数据通道 43发送同步指令给图形服 务器 42;图形服务器 42在顶点数据缓存区 431进行顶点数据缓存,确保了 缓存的顶点数据的一致性。 在本实施例中,顶点数据缓存区 431 的创建最 终是根据截获的顶点数组类指令来执行的,是一个持续的过程。 图形客户 端 41在本地数据中进行查询,若本地数据中存在一顶点数据与截获的顶点 数据一致,则将顶点数组类指令打包并发送至图形服务器 42 ,以使得图形 服务器 42根据顶点数据缓存区 431的顶点数据和打包的顶点数组类指令渲 染出图片,即对顶点数组类指令进行缓存优化,若不存在,则分解顶点数 组类指令,即使用传值类的画顶点指令,并且将该顶点数据保存在 Hashtable 中以便下一次进行缓存优化,并发送至图形服务器 42 ,以使得图形服务器 42根据分解的顶点数组类指令渲染出图片。 其中,本地数据为预存在图形 客户端的顶点数据,该顶点数据不需分解即可发送并使用于图形服务器 42。 具体地,本地数据中存在一顶点数据与截获的顶点数据一致,即截获的顶 点数据存在于本地数据时,图形客户端 41将顶点数组类指令打包并通过数 据通道 43发送至图形服务器 42 ,图形服务器 42解包顶点数组类指令,并 发送给显卡 44以渲染出图片;截获的顶点数据不存在于本地数据时,图形 客户端 41将分解后的顶点数组类指令通过数据通道 43发送至图形服务器 42 ,图形服务器 42再发送给显卡 44以渲染出图片。 渲染出的图片可以但 不限于三维,也可以是二维的图片,而且该图片可以是一幅或者多幅图片 的组合,也可以是一幅完整图片的一部分。 图形服务器 42通过屏幕抓取将 图片拷贝到内存中,并通过数据通道 43发送给图形客户端 41 ,图形客户端 41接收图片并贴至图形设备接口 410 ,图形设备接口 410将顶点数组类指 令重定向至 TC端 45以执行顶点数组类指令并生成屏幕画面。 其中,顶点 数据是从顶点数组类指令中获取的,包括顶点数组指针和顶点数组长度。 在本实施例中,通过在图形客户端 41和图形服务器 42共用共享内存中的 顶点数据缓存区 431 来实现顶点数据缓存,确保了缓存的顶点数据的一致 性,并在截获的顶点数据存在于本地数据中时,进行顶点数据的缓存优化, 从而不需要对顶点数组类指令进行分解,这样即使仍有部分顶点数组类指 令需进行分解,但总的需传送的指令数目大为減少,因此可以解决在图形 服务器 42使用直接透传的顶点数组类指令会产生错误的问题,这样即使仍 有部分顶点数组类指令需进行分解,但总的需传送的指令数目大为減少, 从而減少了传送所有指令所需要的时间,也減少了对带宽的占用,因此能 够大幅降低时延和传输通道的带宽,降低内存共享对 CPU 的消耗,提高 VM密度,降低成本;同时減少了缓存内存的使用,简化了维护图形客户端 41与图形服务器 42缓存一致性的复杂度。 In this embodiment, the data channel 43 is a shared memory, and the graphics client 41 and the graphics server 42 share the vertex data buffer 431 in the shared memory to implement vertex data buffering. Specifically, the TC terminal 45 sends a 3D instruction to the graphics device interface 410 of the graphics client 41 through mouse and keyboard redirection, and the graphics client 41 can intercept the 3D instruction through the Opengl ICD driver of the graphics device interface 410, and the 3D instruction includes a vertex. Array class instructions. The graphics client 41 performs vertex data caching in the vertex data buffer 431, and sends synchronization instructions to the graphics server 42 through the data channel 43. The graphics server 42 performs vertex data caching in the vertex data buffer 431 to ensure consistent cached vertex data. Sex. In this embodiment, the vertex data buffer 431 is created the most. It is executed according to the intercepted vertex array class instructions and is a continuous process. The graphics client 41 queries in the local data. If there is a vertex data in the local data that is consistent with the intercepted vertex data, the vertex array class instruction is packaged and sent to the graphics server 42 so that the graphics server 42 is based on the vertex data buffer. The vertice data of 431 and the packed vertex array class instruction render the image, that is, cache optimization of the vertex array class instruction, if not, decompose the vertex array class instruction, that is, use the vertex instruction of the value class, and the vertex is The data is saved in the Hashtable for next cache optimization and sent to the graphics server 42 to cause the graphics server 42 to render the image based on the decomposed vertex array class instructions. The local data is vertex data pre-existing in the graphics client, and the vertex data can be sent and used for the graphics server 42 without being decomposed. Specifically, a vertex data exists in the local data and is consistent with the intercepted vertex data. When the intercepted vertex data exists in the local data, the graphics client 41 packages the vertex array class instruction and sends the data to the graphics server 42 through the data channel 43. The server 42 unpacks the vertex array class instruction and sends it to the graphics card 44 to render the picture; when the intercepted vertex data does not exist in the local data, the graphics client 41 sends the decomposed vertex array class instruction to the graphics server through the data channel 43. 42. The graphics server 42 sends the graphics card 42 to the graphics card to render the picture. The rendered image can be, but is not limited to, a three-dimensional image, or a two-dimensional image, and the image can be a combination of one or more images or a part of a complete image. The graphics server 42 copies the picture into the memory through screen capture and sends it to the graphics client 41 via the data channel 43. The graphics client 41 receives the picture and pastes it to the graphics device interface 410. The graphics device interface 410 places the vertex array class instruction heavy. The TC end 45 is directed to execute vertex array class instructions and generate a screen shot. Among them, the vertex data is obtained from the vertex array class instruction, including the vertex array pointer and the vertex array length. In the present embodiment, the vertex data buffer is implemented by sharing the vertex data buffer 431 in the shared memory between the graphics client 41 and the graphics server 42, ensuring the consistency of the cached vertex data, and the intercepted vertex data exists in In the local data, the cache optimization of the vertex data is performed, so that the vertex array class instruction does not need to be decomposed, so that even if some vertex array class instructions need to be decomposed, the total number of instructions to be transmitted is greatly reduced, so Solving the problem of using the directly transparent pass-through vertex array class instruction in the graphics server 42 will cause an error, so that even if there are still some vertex array class instructions to be decomposed, the total number of instructions to be transmitted is greatly reduced, thereby reducing the transfer of all The time required for the instruction also reduces the bandwidth usage, thus greatly reducing the bandwidth of the delay and the transmission channel, reducing the CPU consumption of the memory sharing, and improving VM density, reducing cost; while reducing the use of cache memory, simplifies the complexity of maintaining the consistency of graphics client 41 and graphics server 42 cache.
综上所述,本发明通过图形客户端截获顶点数组类指令;进行顶点数 据缓存以创建第一缓存区,发送同步指令至图形服务器以创建第二缓存区, 第二缓存区与第一缓存区形成顶点数据的映射关系;在本地数据中进行查 询,若本地数据中存在一顶点数据与截获的顶点数据一致,则将顶点数组 类指令打包并发送至图形服务器,以使得图形服务器根据第二缓存区的顶 点数据和打包的顶点数组类指令渲染出图片,若不存在,则分解顶点数组 类指令,并发送至图形服务器,以使得图形服务器根据分解的顶点数组类 指令渲染出图片;第二缓存区与第一缓存区形成顶点数据的映射关系后, 则不需要对顶点数组类指令进行分解,可以解决在图形服务器使用直接透 传的顶点数组类指令会产生错误的问题,这样即使仍有部分顶点数组类指 令需进行分解,但总的需传送的指令数目大为減少,能够大幅降低时延和 传输通道的带宽,降低内存共享对 CPU的消耗,提高 VM密度,降低成本。  In summary, the present invention intercepts vertex array class instructions through a graphics client; performs vertex data caching to create a first buffer area, sends synchronization instructions to a graphics server to create a second buffer area, a second buffer area and a first buffer area. Forming a mapping relationship of vertex data; performing a query in the local data. If a vertex data in the local data is consistent with the intercepted vertex data, the vertex array class instruction is packaged and sent to the graphics server, so that the graphics server is configured according to the second cache. The vertex data of the region and the packed vertex array class instruction render the image. If not, the vertex array class instruction is decomposed and sent to the graphics server, so that the graphics server renders the image according to the decomposed vertex array class instruction; the second cache After the mapping relationship between the region and the first buffer region forms vertex data, it is not necessary to decompose the vertex array class instruction, which can solve the problem that the vertex array class instruction used in the graphics server directly generates a fault, so that even if there is still a part Vertex array class instructions need to be decomposed, but the total need The number of instructions transmitted is greatly reduced, which can greatly reduce the bandwidth of the delay and the transmission channel, reduce the CPU consumption of memory sharing, increase the VM density, and reduce the cost.
以上所述仅为本发明的实施例,并非因此限制本发明的专利范围,凡 是利用本发明说明书及附图内容所作的等效结构或等效流程变换,或直接 或间接运用在其他相关的技术领域,均同理包括在本发明的专利保护范围  The above is only the embodiment of the present invention, and is not intended to limit the scope of the invention, and the equivalent structure or equivalent process transformation of the present invention and the contents of the drawings may be directly or indirectly applied to other related technologies. The scope of the patent protection is included in the scope of patent protection of the present invention.

Claims

权利要求 Rights request
1. 一种 GPU虚拟化实现方法,其特征在于,所述方法包括: 图形客户端截获顶点数组类指令;  A method for implementing GPU virtualization, the method comprising: intercepting a vertex array class instruction by a graphics client;
进行顶点数据缓存以创建第一缓存区,发送同步指令至图形服务器以 创建第二缓存区,所述第二缓存区与所述第一缓存区形成顶点数据的映射 关系,所述顶点数据从所述顶点数组类指令中获取,包括顶点数组指针和 顶点数组长度;  Performing vertex data caching to create a first buffer area, sending a synchronization instruction to the graphics server to create a second buffer area, the second buffer area and the first buffer area forming a mapping relationship of vertex data, the vertex data from the Obtained in the vertex array class instruction, including the vertex array pointer and the vertex array length;
在本地数据中进行查询,若所述本地数据中存在一顶点数据与截获的 所述顶点数据一致,则将所述顶点数组类指令打包并发送至所述图形服务 器,以使得所述图形服务器根据所述第二缓存区的所述顶点数据和打包的 所述顶点数组类指令渲染出图片,若不存在,则分解所述顶点数组类指令 并发送至所述图形服务器,以使得所述图形服务器根据分解的所述顶点数 组类指令渲染出图片,其中,所述本地数据为预存在所述图形客户端的顶 点数据,该顶点数据不需分解即可发送并使用于所述图形服务器。  Querying in the local data, if there is a vertex data in the local data that is consistent with the intercepted vertex data, the vertex array class instruction is packaged and sent to the graphics server, so that the graphics server is configured according to The vertex data of the second buffer area and the packed vertex array class instruction render a picture, if not, decompose the vertex array class instruction and send the instruction to the graphics server, so that the graphics server Rendering a picture according to the decomposed vertex array class instruction, wherein the local data is vertex data pre-existing in the graphics client, and the vertex data is sent and used for the graphics server without decomposition.
2. 根据权利要求 1所述的方法,其特征在于,所述方法还包括: 所述图形客户端通过数据通道接收所述图形服务器发送的图片并贴至 图形设备接口;  The method according to claim 1, wherein the method further comprises: receiving, by the graphics client, a picture sent by the graphics server through a data channel and attaching to a graphic device interface;
通过所述图形设备接口将所述顶点数组类指令重定向至 TC 端以执行 所述顶点数组类指令并生成屏幕画面。  The vertex array class instructions are redirected to the TC side by the graphics device interface to execute the vertex array class instructions and generate a screen shot.
3. 根据权利要求 1所述的方法,其特征在于,所述进行顶点数据缓存 以创建第一缓存区包括:如果新增的顶点数据为历史数据,但缓存的所述 第一缓存区已释放或者其顶点数组长度需要更新为更大的值,则  The method according to claim 1, wherein the performing vertex data buffering to create the first buffer area comprises: if the newly added vertex data is historical data, but the cached first buffer area is released Or its vertex array length needs to be updated to a larger value, then
创建临时缓存区;  Create a temporary cache area;
将所述新增的顶点数据拷贝到所述临时缓存区中;  Copying the newly added vertex data into the temporary buffer area;
将所述顶点数据从所述临时缓存区拷贝至所述第一缓存区。  Copying the vertex data from the temporary buffer to the first buffer.
4. 根据权利要求 1所述的方法,其特征在于,所述进行顶点数据缓存 以创建第一缓存区,发送同步指令至图形服务器以建立第二缓存区,所述 第二缓存区与所述第一缓存区形成顶点数据的映射关系包括:  4. The method of claim 1, wherein the performing vertex data buffering to create a first buffer area, transmitting a synchronization instruction to a graphics server to establish a second buffer area, the second buffer area and the The mapping relationship between the first buffer area and the vertex data includes:
进行所述顶点数据缓存,并创建所述第一缓存区;  Performing the vertex data cache and creating the first buffer area;
发送同步指令给所述图形服务器以创建第二缓存区,所述同步指令包 括所述顶点数组指针,所述第二缓存区通过所述顶点数组指针与所述第一 缓存区形成顶点数据的映射关系。 Sending a synchronization instruction to the graphics server to create a second buffer area, the synchronization instruction packet The vertex array pointer is included, and the second buffer area forms a mapping relationship with the first buffer area by using the vertex array pointer to form vertex data.
5. 根据权利要求 1所述的方法,其特征在于,所述第一缓存区位于所 述图形客户端中。  5. The method of claim 1 wherein the first cache area is located in the graphics client.
6. 根据权利要求 1所述的方法,其特征在于,所述第一缓存区位于共 享内存中。  6. The method of claim 1 wherein the first buffer area is located in shared memory.
7. 一种 GPU虚拟化实现方法,其特征在于,所述方法包括: 根据接收的同步指令创建第二缓存区以进行顶点数据缓存,所述第二 缓存区与图形客户端的第一缓存区形成顶点数据的映射关系,所述顶点数 据包括顶点数组指针和顶点数组长度;  A GPU virtualization implementation method, the method comprising: creating a second buffer area according to the received synchronization instruction to perform vertex data buffering, wherein the second buffer area forms a first buffer area with a graphics client. a mapping relationship of vertex data, the vertex data includes a vertex array pointer and a vertex array length;
根据所述顶点数组指针判断所述第二缓存区是否缓存有对应的顶点数 据,如果有,则接收所述图形客户端发送的经打包的顶点数组类指令,并 根据所述第二缓存区的所述顶点数据和所述打包的顶点数组类指令渲染出 图片以发送给所述图形客户端,如果没有,则接收所述图形客户端发送的 经分解后的顶点数组类指令,并根据所述经分解后的顶点数组类指令渲染 出图片以发送给所述图形客户端。  Determining, according to the vertex array pointer, whether the second buffer area is buffered with corresponding vertex data, and if so, receiving the packed vertex array class instruction sent by the graphics client, and according to the second buffer area The vertex data and the packed vertex array class instruction renders a picture for transmission to the graphics client, and if not, receives the decomposed vertex array class instruction sent by the graphics client, and according to the The decomposed vertex array class instruction renders the image for transmission to the graphics client.
8. 根据权利要求 7所述的方法,其特征在于,所述接收同步指令并创 建第二缓存区以进行顶点数据缓存,所述第二缓存区与图形客户端的第一 缓存区形成所述顶点数据的映射关系包括:  8. The method of claim 7, wherein the receiving a synchronization instruction and creating a second buffer for vertex data caching, the second buffer forming a vertex with a first buffer of a graphics client The mapping relationship of data includes:
接收所述图形客户端发送的同步指令,其中,所述同步指令包括顶点 数组指针;  Receiving a synchronization instruction sent by the graphics client, where the synchronization instruction includes a vertex array pointer;
根据所述同步指令创建所述第二缓存区以进行顶点数据缓存,所述所 述第二缓存区通过所述顶点数组指针与所述图形客户端的所述第一缓存区 形成所述顶点数据的映射关系。  Creating the second buffer area according to the synchronization instruction to perform vertex data caching, and the second buffer area forms the vertex data by using the vertex array pointer and the first buffer area of the graphics client. Mapping relations.
9. 根据权利要求 7所述的方法,其特征在于,所述第二缓存区位于所 述图形服务器中。  9. The method of claim 7, wherein the second cache area is located in the graphics server.
10. 根据权利要求 7所述的方法,其特征在于,所述第二缓存区位于共 享内存中。  10. The method of claim 7, wherein the second buffer area is located in shared memory.
11. 一种 GPU虚拟化中顶点数据缓存的方法,其特征在于,所述方法 包括: 通过图形客户端创建第一缓存区,进行顶点数据缓存,其中,所述顶 点数据包括顶点数组指针和顶点数组长度; A method for buffering vertex data in GPU virtualization, the method comprising: Creating a first buffer area by a graphics client to perform vertex data caching, wherein the vertex data includes a vertex array pointer and a vertex array length;
发送同步指令至图形服务器,其中,所述同步指令包括所述顶点数组 指针;  Sending a synchronization instruction to the graphics server, wherein the synchronization instruction includes the vertex array pointer;
通过所述图形服务器根据所述同步指令创建第二缓存区,进行顶点数 据缓存,所述第二缓存区通过所述顶点数组指针与所述第一缓存区形成顶 点数据的映射关系。  The second buffer area is created by the graphics server according to the synchronization instruction, and the vertex data is buffered. The second buffer area forms a mapping relationship with the first buffer area by using the vertex array pointer.
12. 根据权利要求 11所述的方法,其特征在于,所述进行顶点数据缓 存是以缓存单元模式为载体进行学习、 预测和校正,包括顶点数组指针以 及顶点数组长度的学习、 预测和校正。  12. The method of claim 11, wherein the performing vertex data cache is learning, predicting, and correcting using a cache unit mode as a carrier, including vertex array pointers and learning, prediction, and correction of vertex array lengths.
13. 根据权利要求 12所述的方法,其特征在于,所述缓存单元模式包 括:  13. The method of claim 12, wherein the cache unit mode comprises:
指明所述顶点数组的首地址和每字节的长度;  Indicate the first address of the vertex array and the length of each byte;
根据所述首地址的偏移量绘制几何单元。  A geometry unit is drawn based on the offset of the first address.
14. 根据权利要求 12所述的方法,其特征在于,所述顶点数组指针的 学习、 预测和校正包括:  14. The method of claim 12, wherein the learning, predicting, and correcting the vertex array pointers comprises:
获取所述顶点数组类指令;  Obtaining the vertex array class instruction;
用所述顶点数组指针作 Hash查找;  Using the vertex array pointer as a Hash lookup;
判断是否命中,如果是,则设置为当前的缓存数据指针,供画顶点指 针使用;如果否,将顶点数组指针及相关特征信息添加到 Hashtable中; 透传所述缓存数据指针。  Determine whether the hit, if yes, set to the current cache data pointer for drawing the vertex pointer; if not, add the vertex array pointer and related feature information to the Hashtable; transparently pass the cached data pointer.
15. 根据权利要求 12所述的方法,其特征在于,所述顶点数组长度的 学习、 预测和校正包括:  15. The method of claim 12, wherein the learning, predicting, and correcting the length of the vertex array comprises:
获取所述画顶点指令;  Obtaining the vertex instruction of the drawing;
判断所述顶点数据是否已做缓存,如果是,则判断顶点缓存数据是否 存在于本地数据中,如果是,则透传所述画顶点指针,如果否,则分解所 述画顶点指针;如果所述顶点数据未做缓存,则判断所述顶点数组长度是 否需要更新,如果需要,则更新所述顶点数组长度,如果不需要,则分解 所述画顶点指针,其中,所述本地数据为预存在所述图形客户端的顶点数 据,该顶点数据不需分解即可发送并使用于所述图形服务器。 Determining whether the vertex data has been cached, and if so, determining whether the vertex buffer data exists in the local data, and if so, transparently transmitting the drawn vertex pointer; if not, decomposing the drawn vertex pointer; If the vertex data is not cached, it is determined whether the vertex array length needs to be updated, if necessary, the vertex array length is updated, and if not, the drawn vertex pointer is decomposed, wherein the local data is pre-existing The vertex data of the graphics client, the vertex data can be sent and used for the graphics server without decomposition.
16. —种 GPU图形客户端,其特征在于,所述图形客户端包括指令获 取模块、 第一缓存模块、 查询模块以及发送模块,其中: 16. A GPU graphics client, wherein the graphics client comprises an instruction acquisition module, a first cache module, a query module, and a sending module, wherein:
所述指令获取模块用于截获顶点数组类指令;  The instruction acquisition module is configured to intercept a vertex array class instruction;
所述第一缓存模块用于进行顶点数据缓存以创建第一缓存区,发送同 步指令至图形服务器以创建第二缓存区,所述第二缓存区与所述第一缓存 区形成顶点数据的映射关系,所述顶点数据从所述顶点数组类指令中获取, 包括顶点数组指针和顶点数组长度;  The first cache module is configured to perform vertex data caching to create a first buffer area, send a synchronization instruction to a graphics server to create a second buffer area, and the second buffer area forms a mapping with vertex data of the first buffer area. a relationship, the vertex data is obtained from the vertex array class instruction, including a vertex array pointer and a vertex array length;
所述查询模块用于在本地数据在进行查询,若所述本地数据中存在一 顶点数据与截获的所述顶点数据一致,则所述发送模块将所述顶点数组类 指令打包并发送至所述图形服务器,以使得所述图形服务器根据所述第二 缓存区的所述顶点数据和打包的所述顶点数组类指令渲染出图片,若不存 在,则所述发送模块分解所述顶点数组类指令并发送至所述图形服务器, 以使得所述图形服务器根据分解的所述顶点数组类指令渲染出图片,其中, 所述本地数据为预存在所述图形客户端的顶点数据,该顶点数据不需分解 即可发送并使用于所述图形服务器。  The query module is configured to perform a query on the local data. If a vertex data exists in the local data and the intercepted vertex data is consistent, the sending module packages and sends the vertex array class instruction to the a graphics server, such that the graphics server renders a picture according to the vertex data of the second buffer area and the packaged vertex array class instruction, and if not, the sending module decomposes the vertex array class instruction And sending to the graphics server, so that the graphics server renders a picture according to the decomposed vertex array class instruction, where the local data is pre-existing vertex data of the graphics client, and the vertex data does not need to be decomposed It can be sent and used for the graphics server.
17. 根据权利要求 16所述的图形客户端,其特征在于,所述图形客户 端还包括第一接收模块和图形设备接口 ,其中:  17. The graphics client of claim 16, wherein the graphics client further comprises a first receiving module and a graphics device interface, wherein:
所述第一接收模块用于通过所述数据通道接收所述图片并贴至所述图 形设备接口;  The first receiving module is configured to receive the picture through the data channel and paste the picture to the graphic device interface;
所述图形设备接口将所述顶点数组类指令重定向至 TC 端以执行所述 顶点数组类指令并生成屏幕画面。  The graphics device interface redirects the vertex array class instructions to the TC terminal to execute the vertex array class instructions and generate a screen shot.
18. 根据权利要求 16所述的图形客户端,其特征在于,所述发送模块 还发送同步指令给所述图形服务器,所述同步指令包括所述顶点数组指针, 所述第一缓存区通过所述顶点数组指针与所述图形服务器的第二缓存区形 成顶点数据的映射关系。  18. The graphics client according to claim 16, wherein the sending module further sends a synchronization instruction to the graphics server, the synchronization instruction includes the vertex array pointer, and the first buffer area passes through The mapping between the vertex array pointer and the second buffer area of the graphics server forms vertex data.
19. 根据权利要求 16所述的图形客户端,其特征在于,如果新增的顶 点数据为历史数据,但缓存的所述第一缓存区已释放或者其顶点数组长度 需要更新为更大的值,则所述第一缓存模块还用于:  19. The graphics client according to claim 16, wherein if the newly added vertex data is historical data, but the cached first buffer area is released or its vertex array length needs to be updated to a larger value. The first cache module is further configured to:
创建临时缓存区;  Create a temporary cache area;
将所述新增的顶点数据拷贝到所述临时缓存区中; 将所述顶点数据从所述临时缓存区拷贝至所述第一缓存区。 Copying the newly added vertex data into the temporary buffer area; Copying the vertex data from the temporary buffer to the first buffer.
20. 一种 GPU图形服务器,其特征在于,所述图形服务器包括第二缓 存模块、 第二接收模块以及渲染模块,其中:  20. A GPU graphics server, the graphics server comprising a second cache module, a second receiving module, and a rendering module, wherein:
所述第二缓存模块用于根据接收的同步指令创建第二缓存区以进行顶 点数据缓存,所述第二缓存区与图形客户端的第一缓存区形成顶点数据的 映射关系,所述顶点数据包括顶点数组指针和顶点数组长度;  The second cache module is configured to create a second buffer area according to the received synchronization instruction to perform vertex data caching, and the second buffer area and the first buffer area of the graphics client form a mapping relationship of vertex data, where the vertex data includes Vertex array pointer and vertex array length;
所述第二接收模块用于根据所述顶点数组指针判断所述第二缓存区是 否缓存有对应的顶点数据,如果有,则接收所述图形客户端发送的经打包 的顶点数组类指令,并且所述渲染模块根据所述第二缓存区的所述顶点数 据和所述打包的顶点数组类指令渲染出图片以发送给所述图形客户端;如 果没有,则所述第二接收模块接收所述图形客户端发送的经分解后的顶点 数组类指令,并且所述渲染模块根据所述经分解后的顶点数组类指令渲染 出图片以发送给所述图形客户端。  The second receiving module is configured to determine, according to the vertex array pointer, whether the second buffer area has cached corresponding vertex data, and if so, receive the packaged vertex array class instruction sent by the graphics client, and The rendering module renders a picture to the graphics client according to the vertex data of the second buffer area and the packed vertex array class instruction; if not, the second receiving module receives the The decomposed vertex array class instruction sent by the graphics client, and the rendering module renders the image according to the decomposed vertex array class instruction for sending to the graphics client.
21. 根据权利要求 20所述的图形服务器,其特征在于,所述第二接收 模块还接收所述图形客户端发送的同步指令,其中,所述同步指令包括顶 点数组指针;  21. The graphics server of claim 20, wherein the second receiving module further receives a synchronization instruction sent by the graphics client, wherein the synchronization instruction comprises a top array pointer;
所述第二缓存模块根据所述同步指令创建所述第二缓存区以进行顶点 数据缓存,所述第二缓存区通过所述顶点数组指针与所述图形客户端的第 一缓存区形成所述顶点数据的映射关系。  The second cache module creates the second buffer area according to the synchronization instruction to perform vertex data buffering, and the second buffer area forms the vertex by using the vertex array pointer and the first buffer area of the graphics client. The mapping relationship of data.
22. 一种 GPU虚拟化中顶点数据缓存的装置,其特征在于,所述装置 包括:  22. An apparatus for vertex data caching in GPU virtualization, the apparatus comprising:
第一缓存模块,用于在所述图形客户端创建第一缓存区,进行顶点数 据缓存,其中,所述顶点数据包括顶点数组指针和顶点数组长度;  a first cache module, configured to create a first buffer area in the graphics client, and perform vertex data caching, where the vertex data includes a vertex array pointer and a vertex array length;
发送模块,用于发送同步指令至图形服务器,其中,所述同步指令包 括所述顶点数组指针;  a sending module, configured to send a synchronization instruction to the graphics server, where the synchronization instruction includes the vertex array pointer;
第二缓存模块,用于通过图形服务器根据所述同步指令创建第二缓存 区,进行顶点数据缓存,所述第二缓存区通过所述顶点数组指针与所述第 一缓存区形成顶点数据的映射关系。  a second cache module, configured to create a second buffer area according to the synchronization instruction by the graphics server to perform vertex data caching, and the second buffer area forms a mapping of vertex data with the first buffer area by using the vertex array pointer relationship.
23. 根据权利要求 22所述的装置,其特征在于,所述第一缓存模块以 缓存单元模式为载体对顶点数组指针以及顶点数组长度的学习、 预测和校 正。 The device according to claim 22, wherein the first cache module learns, predicts, and verifies the vertex array pointer and the vertex array length by using a cache unit mode as a carrier. Positive.
24. 根据权利要求 23所述的装置,其特征在于,所述缓存单元模式包 括指明所述顶点数组的首地址和每字节的长度;根据所述首地址的偏移量 绘制几何单元。  24. The apparatus of claim 23, wherein the buffer unit pattern comprises a first address indicating a vertex array and a length per byte; and a geometry unit is drawn based on an offset of the first address.
25. 根据权利要求 23所述的装置,其特征在于,对顶点数组指针学习、 预测和校正时,所述第一缓存模块用于:  25. The apparatus of claim 23, wherein when the vertex array pointer is learned, predicted, and corrected, the first cache module is configured to:
获取所述顶点数组类指令;  Obtaining the vertex array class instruction;
用所述顶点数组指针作 Hash查找;  Using the vertex array pointer as a Hash lookup;
判断是否命中,如果是,则设置为当前的缓存数据指针,供画顶点指 针使用;如果否,将顶点数组指针及相关特征信息添加到 Hashtable中; 透传所述缓存数据指针。  Determine whether the hit, if yes, set to the current cache data pointer for drawing the vertex pointer; if not, add the vertex array pointer and related feature information to the Hashtable; transparently pass the cached data pointer.
26. 根据权利要求 23所述的装置,其特征在于,对顶点数组长度的学 习、 预测和校正进,所述第一缓存模块用于:  26. The apparatus of claim 23, wherein the first cache module is for: learning, predicting, and correcting the length of the vertex array:
获取所述画顶点指令;  Obtaining the vertex instruction of the drawing;
判断所述顶点数据是否已做缓存,如果是,则判断顶点缓存数据是否 存在于本地数据中,如果是,则透传所述画顶点指针,如果否,则分解所 述画顶点指针;如果所述顶点数据未做缓存,则判断所述顶点数组长度是 否需要更新,如果需要,则更新所述顶点数组长度,如果不需要,则分解 所述画顶点指针,其中,所述本地数据为预存在所述图形客户端的顶点数 据,该顶点数据不需分解即可发送并使用于所述图形服务器。  Determining whether the vertex data has been cached, and if so, determining whether the vertex buffer data exists in the local data, and if so, transparently transmitting the drawn vertex pointer; if not, decomposing the drawn vertex pointer; If the vertex data is not cached, it is determined whether the vertex array length needs to be updated, if necessary, the vertex array length is updated, and if not, the drawn vertex pointer is decomposed, wherein the local data is pre-existing The vertex data of the graphics client, the vertex data can be sent and used for the graphics server without decomposition.
PCT/CN2014/079557 2013-11-08 2014-06-10 Gpu virtualization realization method as well as vertex data caching method and related device WO2015067043A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201310554845.0 2013-11-08
CN201310554845.0A CN103559078B (en) 2013-11-08 2013-11-08 GPU (Graphics Processing Unit) virtualization realization method as well as vertex data caching method and related device

Publications (2)

Publication Number Publication Date
WO2015067043A1 true WO2015067043A1 (en) 2015-05-14
WO2015067043A9 WO2015067043A9 (en) 2015-09-03

Family

ID=50013331

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2014/079557 WO2015067043A1 (en) 2013-11-08 2014-06-10 Gpu virtualization realization method as well as vertex data caching method and related device

Country Status (2)

Country Link
CN (1) CN103559078B (en)
WO (1) WO2015067043A1 (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20210243444A1 (en) * 2018-05-01 2021-08-05 Nvidia Corporation Managing virtual machine density by controlling server resource

Families Citing this family (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103559078B (en) * 2013-11-08 2017-04-26 华为技术有限公司 GPU (Graphics Processing Unit) virtualization realization method as well as vertex data caching method and related device
WO2015154226A1 (en) * 2014-04-08 2015-10-15 华为技术有限公司 Method, device and processor for data communication in virtualized environment
CN105139356B (en) * 2015-08-25 2018-06-22 北京锤子数码科技有限公司 The frosted glass effect processing method and device of a kind of image data
CN108346126B (en) * 2017-01-24 2023-01-06 深圳博十强志科技有限公司 Method and device for drawing mobile phone picture based on memory copy mode
CN109509139B (en) * 2017-09-14 2023-06-27 龙芯中科技术股份有限公司 Vertex data processing method, device and equipment
CN108415854A (en) * 2018-02-11 2018-08-17 中国神华能源股份有限公司 Data collecting system based on shared buffer memory and method
CN110580674B (en) * 2019-07-24 2024-01-16 西安万像电子科技有限公司 Information processing method, device and system
CN111309649B (en) * 2020-02-11 2021-05-25 支付宝(杭州)信息技术有限公司 Data transmission and task processing method, device and equipment
CN112669428B (en) * 2021-01-06 2024-06-25 南京亚派软件技术有限公司 BIM model rendering method based on cooperation of server and client
CN116230006A (en) * 2023-05-09 2023-06-06 成都力比科技有限公司 Sound effect visualization method based on GPU

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP0595066A2 (en) * 1992-10-29 1994-05-04 International Business Machines Corporation Context management in a graphics system
CN102394935A (en) * 2011-11-10 2012-03-28 方正国际软件有限公司 Wireless shared storage system and wireless shared storage method thereof
CN102819819A (en) * 2012-08-14 2012-12-12 长沙景嘉微电子股份有限公司 Implementation method for quickly reading peak in GPU (graphics processing unit)
CN103200128A (en) * 2013-04-01 2013-07-10 华为技术有限公司 Method, device and system for network package processing
CN103559078A (en) * 2013-11-08 2014-02-05 华为技术有限公司 GPU (Graphics Processing Unit) virtualization realization method as well as vertex data caching method and related device

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101551761A (en) * 2009-04-30 2009-10-07 浪潮电子信息产业股份有限公司 Method for sharing stream memory of heterogeneous multi-processor

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP0595066A2 (en) * 1992-10-29 1994-05-04 International Business Machines Corporation Context management in a graphics system
CN102394935A (en) * 2011-11-10 2012-03-28 方正国际软件有限公司 Wireless shared storage system and wireless shared storage method thereof
CN102819819A (en) * 2012-08-14 2012-12-12 长沙景嘉微电子股份有限公司 Implementation method for quickly reading peak in GPU (graphics processing unit)
CN103200128A (en) * 2013-04-01 2013-07-10 华为技术有限公司 Method, device and system for network package processing
CN103559078A (en) * 2013-11-08 2014-02-05 华为技术有限公司 GPU (Graphics Processing Unit) virtualization realization method as well as vertex data caching method and related device

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20210243444A1 (en) * 2018-05-01 2021-08-05 Nvidia Corporation Managing virtual machine density by controlling server resource
US11722671B2 (en) * 2018-05-01 2023-08-08 Nvidia Corporation Managing virtual machine density by controlling server resource

Also Published As

Publication number Publication date
CN103559078B (en) 2017-04-26
CN103559078A (en) 2014-02-05
WO2015067043A9 (en) 2015-09-03

Similar Documents

Publication Publication Date Title
WO2015067043A1 (en) Gpu virtualization realization method as well as vertex data caching method and related device
US11792307B2 (en) Methods and apparatus for single entity buffer pool management
JP5060489B2 (en) Multi-user terminal service promotion device
US7996569B2 (en) Method and system for zero copy in a virtualized network environment
US9535871B2 (en) Dynamic routing through virtual appliances
US9026615B1 (en) Method and apparatus for caching image data transmitted over a lossy network
US9454392B2 (en) Routing data packets between virtual machines using shared memory without copying the data packet
US10635474B2 (en) Systems and methods for virtio based optimization of data packet paths between a virtual machine and a network device for live virtual machine migration
US9363172B2 (en) Managing a configurable routing scheme for virtual appliances
US10355997B2 (en) System and method for improving TCP performance in virtualized environments
WO2017000580A1 (en) Media content rendering method, user equipment, and system
US7926067B2 (en) Method and system for protocol offload in paravirtualized systems
US20180063555A1 (en) Network-enabled graphics processing module
Laufer et al. Climb: Enabling network function composition with click middleboxes
US9300818B2 (en) Information processing apparatus and method
CN113285931B (en) Streaming media transmission method, streaming media server and streaming media system
TWI486787B (en) Method and system of displaying frame
US20120136988A1 (en) Dynamic bandwidth optimization for remote input
WO2023216621A1 (en) Cloud desktop image processing method and apparatus, server and storage medium
US20180300844A1 (en) Data processing
EP3547132B1 (en) Data processing system
Nguyen et al. Reducing data copies between gpus and nics
Heo et al. FleXR: A System Enabling Flexibly Distributed Extended Reality
US8046698B1 (en) System and method for providing collaboration of a graphics session
Jang et al. Design and implementation of a protocol offload engine for TCP/IP and remote direct memory access based on hardware/software coprocessing

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 14859469

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 14859469

Country of ref document: EP

Kind code of ref document: A1