CN115421787A - Instruction execution method, apparatus, device, system, program product, and medium - Google Patents

Instruction execution method, apparatus, device, system, program product, and medium Download PDF

Info

Publication number
CN115421787A
CN115421787A CN202211034224.5A CN202211034224A CN115421787A CN 115421787 A CN115421787 A CN 115421787A CN 202211034224 A CN202211034224 A CN 202211034224A CN 115421787 A CN115421787 A CN 115421787A
Authority
CN
China
Prior art keywords
gpu
target
memory
instruction
storage space
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202211034224.5A
Other languages
Chinese (zh)
Inventor
龙毅
付斌章
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Alibaba China Co Ltd
Original Assignee
Alibaba China Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Alibaba China Co Ltd filed Critical Alibaba China Co Ltd
Priority to CN202211034224.5A priority Critical patent/CN115421787A/en
Publication of CN115421787A publication Critical patent/CN115421787A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/30003Arrangements for executing specific machine instructions
    • G06F9/3004Arrangements for executing specific machine instructions to perform operations on memory
    • G06F9/30043LOAD or STORE instructions; Clear instruction
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/30003Arrangements for executing specific machine instructions
    • G06F9/3004Arrangements for executing specific machine instructions to perform operations on memory

Abstract

One or more embodiments of the present specification provide an instruction execution method, apparatus, device, system, program product, and medium for a GPU resource pool, the GPU resource pool including a plurality of GPUs and a memory corresponding to each of the GPUs; the method comprises the following steps: determining a usable target GPU and a target memory corresponding to the target GPU; determining a target storage space in the target memory for storing instructions; obtaining at least one first instruction to be executed by the target GPU; and sending the at least one first instruction to the target memory, so that after the at least one first instruction is written into the target storage space, the target GPU reads the first instruction from the target storage space and executes the first instruction.

Description

Instruction execution method, apparatus, device, system, program product, and medium
Technical Field
Embodiments of the present disclosure relate to the field of computer technologies, and in particular, to a method, an apparatus, a device, a system, a program product, and a medium for executing an instruction for a GPU resource pool.
Background
The resource pooling is to abstract a certain kind of resources into resources which can be shared by various users and services in the whole data center according to management requirements through technologies such as distributed software, virtualization and the like, so that a mode that the resources are exclusively used by the users and the services is broken, a fixed ratio of the number of physical resources such as a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), a disk or a network card of a single computer device is broken, and application and release can be dynamically performed according to the requirements of the users and the services. E.g., a pool of GPU resources, etc.
However, when the GPU resource pool is used, compared to a way of configuring a GPU in a single computer device, a connection link between a CPU in the computer device and the GPU in the GPU resource pool is lengthened, and therefore, how to improve the efficiency of the GPU for executing instructions is an urgent technical problem to be solved.
Disclosure of Invention
To overcome the problems in the related art, embodiments of the present specification provide an instruction execution method, apparatus, device, distributed system, and storage medium for a GPU resource pool.
According to a first aspect of embodiments herein, there is provided an instruction execution method for a GPU resource pool comprising a plurality of GPUs and a memory corresponding to each of the GPUs; the method comprises the following steps:
determining a usable target GPU and a target memory corresponding to the target GPU;
determining a target storage space in the target memory for storing instructions;
obtaining at least one first instruction to be executed by the target GPU;
and sending the at least one first instruction to the target memory, so that after the at least one first instruction is written into the target storage space, the target GPU reads the first instruction from the target storage space and executes the first instruction.
According to a second aspect of embodiments herein, there is provided an instruction execution apparatus for a GPU resource pool, the GPU resource pool comprising a plurality of GPUs and a memory corresponding to each of the GPUs; the device comprises:
a first determination module to: determining a usable target GPU and a target memory corresponding to the target GPU;
a second determination module to: determining a target storage space in the target memory for storing instructions;
an acquisition module to: obtaining at least one first instruction to be executed by the target GPU;
a sending module configured to: and sending the at least one first instruction to the target memory, so that after the at least one first instruction is written into the target storage space, the target GPU reads the first instruction from the target storage space and executes the first instruction.
According to a third aspect of embodiments herein, there is provided a computer device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor implements the steps of the method embodiments for instruction execution for a GPU resource pool of the first aspect when executing the computer program.
According to a fourth aspect of embodiments herein, there is provided a distributed system comprising one or more computer devices including a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor implements the steps of the instruction execution method embodiments for the GPU resource pool of the first aspect when executing the computer program.
According to a fifth aspect of embodiments herein, there is provided a computer program product comprising a computer program that, when executed by a processor, implements the steps of an embodiment of the method for instruction execution for a GPU resource pool of the first aspect.
According to a sixth aspect of embodiments herein, there is provided a computer-readable storage medium having stored thereon a computer program which, when executed by a processor, implements the steps of the method embodiments performed by the instructions for the GPU resource pool of the first aspect.
The technical scheme provided by the embodiment of the specification can have the following beneficial effects:
in this embodiment of the present disclosure, a target GPU and a corresponding target memory may be determined, so as to determine a target storage space for storing an instruction in the target memory, where an instruction executed by the GPU is not placed in a memory on a computer device side, but is placed in the target memory corresponding to the target GPU on a GPU resource pool side, and thus, the target GPU does not need to read the instruction across a longer distance, which reduces time consumption for reading the instruction by the target GPU, thereby improving efficiency for executing the instruction by the target GPU.
It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the specification.
Drawings
The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the present specification and together with the description, serve to explain the principles of the specification.
Fig. 1A is a schematic diagram of an application scenario of a GPU resource pool according to an exemplary embodiment of the present disclosure.
FIG. 1B is a schematic diagram illustrating one type of execution instruction shown in the present specification according to an exemplary embodiment.
FIG. 1C is a schematic diagram of another execution instruction shown in the present specification, according to an example embodiment.
FIG. 2A is a flowchart illustrating a method for instruction execution for a GPU resource pool in accordance with an exemplary embodiment.
FIG. 2B is another schematic diagram illustrating execution of instructions for a GPU resource pool according to an exemplary embodiment.
FIG. 3A is a flowchart illustrating another method for instruction execution for a GPU resource pool according to an exemplary embodiment.
FIG. 3B1 is a diagram illustrating an allocation of storage according to an exemplary embodiment.
FIG. 3B2 is a schematic diagram illustrating interaction of a computer device with a GPU, according to an exemplary embodiment.
FIG. 3C is a schematic diagram illustrating another allocation of storage space according to an example embodiment of the present description.
FIG. 4 is a block diagram of a computer device shown in the present specification according to an exemplary embodiment.
FIG. 5 is a block diagram illustrating an instruction execution apparatus for GPU resource pools, according to an example embodiment.
Detailed Description
Reference will now be made in detail to the exemplary embodiments, examples of which are illustrated in the accompanying drawings. When the following description refers to the accompanying drawings, like numbers in different drawings represent the same or similar elements unless otherwise indicated. The embodiments described in the following exemplary embodiments do not represent all embodiments consistent with the present specification. Rather, they are merely examples of apparatus and methods consistent with certain aspects of the specification, as detailed in the appended claims.
The terminology used in the description herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the description. As used in this specification and the appended claims, the singular forms "a", "an", and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. It should also be understood that the term "and/or" as used herein refers to and encompasses any and all possible combinations of one or more of the associated listed items.
It should be understood that although the terms first, second, third, etc. may be used herein to describe various information, these information should not be limited to these terms. These terms are only used to distinguish one type of information from another. For example, the first information may also be referred to as second information, and similarly, the second information may also be referred to as first information, without departing from the scope of the present specification. The word "if," as used herein, may be interpreted as "at … …" or "at … …" or "in response to a determination," depending on the context.
With the rapid popularization of artificial intelligence and the convenience provided by cloud services, the demand for GPUs in cloud services is increasing. Typically, a GPU is configured in a computer device in a single computer device, physically connected with other hardware. For example, the GPU is plugged into a motherboard of a computer device, and is connected to other devices via a bus of the motherboard. But the fixed connection relationship is very unfriendly to the utilization of resources, the upgrade and the maintenance of equipment. In addition, if there are performance and size requirements when a user is using a service, these requirements are limited by the GPU resources under a single server.
In order to solve the problems in the above scenarios, a solution of GPU resource pool may be adopted. Fig. 1A is a schematic diagram of an application scenario of a GPU resource pool according to an exemplary embodiment shown in this specification, where fig. 1A includes a device cluster, the device cluster includes a plurality of computer devices, and fig. 1A takes computer device 0 to computer device n as an example. Fig. 1A also shows a GPU resource pool including a plurality of GPUs, and a management device for managing the GPU resource pool, which is exemplified by GPU0 to GPU in fig. 1A. In some examples, the device cluster, the GPU resource pool, and the GPU resource pool are connected by a network.
The purpose of GPU pooling is that multiple GPUs are logically grouped together to form a GPU resource pool, and the management device can allocate the GPUs to different computer devices according to the needs of the user. When the user does not use the cloud service, the GPU resources can be released back to the resource pool and are conveniently distributed to other computer equipment in need next time, and therefore convenience and elasticity of the cloud service are greatly improved. In addition to the cloud service scenario, scenarios such as AI (Artificial Intelligence) also require the GPU resource pool. In the GPU resource pool scheme, because the transmission distance is lengthened, a link of a CPU in the computer device accessing a GPU in the GPU resource pool is lengthened, and a delay of information interaction between the computer device and the GPU is increased, resulting in a loss of performance of the GPU for executing instructions.
Fig. 1B is a schematic diagram of an instruction execution system according to an exemplary embodiment, where a computer device includes a CPU (not shown), and a management device allocates a GPU to the computer device, the GPU is connected to the CPU, and the GPU can access a memory of a server. Fig. 1B illustrates the interaction between a computer device and a GPU, which is based on a heterogeneous computing software framework, and a CPU may create a buffer in memory for storing instructions to be executed by the GPU.
In some examples, a first Network Card (Network Interface Card) may be inserted into a motherboard of the computer device, the GPU resource pool includes one or more second Network cards, each GPU is connected to one of the second Network cards, and the CPU and the GPU directly communicate via the first Network Card and the second Network Card. The first network card/the second network card may be a common network card, or may be a Smart network card (Smart NIC) or a PCI-Express switch (Peripheral Component Interconnect Express, PCIE, peripheral Component Interconnect Express) or the like. The communication channel between the CPU and the network card, and between the GPU and the network card, includes a high-speed serial bus, for example, including but not limited to: peripheral Component Interconnect (Peripheral Component Interconnect) bus, PCIE bus, and the like. The network card is provided with a communication interface, and the communication interface of the network card is connected with a CPU of the computer equipment on the computer equipment side; and on the GPU side, the communication interface of the network card is connected with the GPU. Therefore, information is forwarded between the CPU of the computer device and the GPU of the GPU resource pool through respective network cards.
For example, the smart card may serve as an accelerator for the CPU, and may help the processor interconnected with the smart card to implement at least part of the functions, that is, the processor may offload some or all of the functions (offload) to the accelerator, so as to achieve the performance acceleration. The accelerator may have its own computing resources, including a processor and a storage, and may include local storage resources such as a memory and a hard disk, or cloud storage resources such as a cloud disk and a NAS.
In other examples, the motherboard of the computer device may further include other devices connected to the CPU, such as a DPU (Data Processing Unit). Illustratively, the DPU is a programmable multi-core processor, i.e., a SoC (System On Chip) Chip, and also has local storage resources such as a memory. The DPU can offload storage, security, virtualization, and other workloads from the CPU to itself, in addition to the functions of the Smart NIC. Similarly, the GPU resource pool also comprises a DPU, and the DPU is connected with the GPU; therefore, information is forwarded between the CPU of the computer device and the GPU of the GPU resource pool through respective DPUs.
As can be seen from the foregoing embodiments, in a GPU resource pool scenario, a CPU cannot directly communicate with a GPU through a bus of a motherboard, but needs to communicate through a forwarding node (node), for example, the aforementioned PCI-E switch, DPU, smart NIC, or the like, information sent by either of the CPU and the GPU is forwarded to the other through a network of its node. In the embodiment shown in fig. 1B, the CPU of the server is connected to the forwarding node 1, the gpu is connected to the forwarding node 2, and the forwarding node 1 and the forwarding node 2 are directly connected. In other embodiments, a device such as a switch or a router may be connected between the forwarding node 1 and the forwarding node 2, and the two may communicate with each other through a local area network or a public network.
In a GPU resource pool scenario, the CPU needs to send instructions to the GPU for execution. Some solutions store instructions in CPU memory by the CPU and access the CPU memory by the GPU. As can be seen from the above embodiments, the link between the CPU and the GPU is longer. Fig. 1C is a schematic diagram of an instruction execution method according to an exemplary embodiment shown in this specification, where host is a host side of a computer device, and includes a CPU and a memory on the computer device side, where 7 steps of instruction processing are shown:
step 1, storing an instruction generated by an application program into an instruction queue host queue of a host;
step 2, the host instruction queue host queue is put into a memory, namely an instruction queue executed by the GPU;
step 3, after the host determines that the instruction queue is written into the memory, the host informs the GPU;
step 4, the GPU fetches the instruction from the CPU memory through the read pointer;
step 5, the GPU updates the read pointer after taking out the instruction;
step 6, the GPU executes the fetched instruction;
in step 7, the GPU sends a signal representing the execution result to the host.
As can be seen from the above embodiments, in the GPU pooling scenario, in steps 4 to 5, the interaction between the GPU and the CPU needs to span a longer distance, thereby causing the performance of the GPU to be reduced.
Based on this, in this embodiment, the target GPU and the corresponding target memory may be determined, so as to determine a target storage space for storing the instruction in the target memory, and since the instruction executed by the GPU is not placed in the memory on the computer device side, but is placed in the target memory corresponding to the target GPU on the GPU resource pool side, the target GPU does not need to read the instruction across a longer distance, and this embodiment reduces the time consumed by the target GPU for reading the instruction, thereby improving the efficiency of the target GPU for executing the instruction.
FIG. 2A is a flowchart illustrating a method of instruction execution for a GPU resource pool that is applicable to any one computer device in a device cluster, the GPU resource pool including a plurality of GPUs and a memory corresponding to each of the GPUs, according to one exemplary embodiment; the method comprises the following steps:
in step 202, a target GPU that can be used and a target memory corresponding to the target GPU are determined.
In step 204, a target storage space in the target memory for storing instructions is determined.
In step 206, at least one first instruction that needs to be executed by the target GPU is fetched.
In step 208, the at least one first instruction is sent to the target memory, so that after the at least one first instruction is written into the target memory space, the target GPU reads the first instruction from the target memory space and executes the first instruction.
In this embodiment, the device form of the computer device is not limited, and the computer device may be any entity device with certain computing, storing and communicating capabilities, for example, a terminal device such as a desktop computer, a notebook computer, a smart phone or an IOT device, and may also be a device of a server such as a server.
Illustratively, the software hierarchy of the computer device may be, in turn: operating systems, GPU drivers (drivers), GPU computing platform toolkits (toolkits), and applications developed based on GPU computing platforms. Based on this, as an embodiment, when the embodiment is implemented, the instruction queue may be a new function module in a tool box of the GPU computing platform, for example, the instruction queue may be a new function module in a tool box of the GPU computing platform, so as to be used for sending the instruction queue that needs to be executed by the GPU to the memory corresponding to the GPU specified in the embodiment.
The implementation process in this embodiment may be implemented in various ways, for example, after the computer device is started up. In other examples, not all tasks need to be executed by the GPU during the operation of the computer device, for example, some instructions generated by the first application program developed based on the GPU computing platform are specified to be executed by the GPU, and based on this, the embodiment may be executed after the first application program is started. Illustratively, the first application includes an AI application or the like.
Illustratively, for step 202, the number of GPUs available to the computer device may be one or more, which the present embodiment refers to as the target GPU. For example, in a GPU resource pool scenario, a computer device may initiate a request to allocate a GPU to a management device as needed, and the management device allocates the GPU to the computer device based on the request. For example, the management device may send the allocated GPU information, such as GPU identification, access address, etc., to the computer device so that the computer device may control the GPU according to the GPU information. In some examples, after the GPU has been resource pooled, the GPU cards of different physical nodes may be abstracted into a virtual GPU resource pool. The GPU allocated to the computer device may be a virtual GPU, of course, a physical GPU is also optional, and this embodiment does not limit this. The process of allocating GPUs may refer to related technologies, which is not limited in this embodiment.
In some examples, each GPU in the GPU resource pool is configured with an internal memory (i.e., a video memory), and the memory corresponding to each GPU may be the internal memory of each GPU.
In other examples, as mentioned above, the GPU resource pool further comprises a forwarding node connected to each of the GPUs, the forwarding node comprising a memory, the memory corresponding to each of the GPUs comprising: a memory of a forwarding node to which each of the GPUs is connected. In practical applications, one GPU may be connected to one forwarding node, or one GPU may be connected to multiple forwarding nodes, and in this case, the memory corresponding to the GPU may be a memory of each forwarding node in the connected multiple forwarding nodes, or a memory of one of the forwarding nodes. Alternatively, multiple GPUs may be connected to one forwarding node, that is, multiple GPUs may correspond to the same memory.
In this embodiment, at least the first instruction is sent to the target memory, in an actual application, multiple first instructions may be sent in batches, and the number of the first instructions sent in batches may be flexibly configured according to needs, which is not limited in this embodiment. In this embodiment, at least one first instruction is sent to the target memory by the computer device and written into the target memory space, so that the target GPU can access the target memory and read and execute the first instruction from the target memory space.
For example, the number of instructions generated by an application program in a computer device is typically large, and a CPU may manage a plurality of instructions by using an instruction queue, that is, create an instruction queue, where the instruction queue includes a plurality of first instructions; and the CPU sends one instruction queue to a target storage space of a target memory corresponding to the target GPU for storage, namely one instruction queue corresponds to one target storage space. In practical applications, one application program may have a plurality of instruction queues, and the number of the instruction queues may be flexibly configured according to actual needs, that is, the target storage space created for one application program in this embodiment may be one or more. For example, some GPUs may have multiple independent compute engines (e.g., 4, etc.) and may support multithreaded parallel processing. For the first application program, the number of target storage spaces created as needed may be less than or equal to the number of compute engines of the GPU, so that each compute engine of the GPU may execute instructions in each target storage space simultaneously.
When the GPU reads instructions from a target storage space, as mentioned in the foregoing embodiment of fig. 1C, the instructions stored in the target storage space may be specifically read in a specified order through the read pointer; after each instruction is read, the read pointer updates the address pointed by the read pointer according to the specified sequence. As shown in fig. 2B, which is a schematic diagram of another instruction processing according to an exemplary embodiment of the present disclosure, in this embodiment, an instruction is placed in a GPU memory, and compared to fig. 1C, since a target memory storing an instruction queue is closer to the GPU, the delay of the embodiment is lower when performing steps 4 to 6, thereby improving the efficiency of the GPU in executing the instruction.
Shown in the following table is an alignment of the experimental results of the protocol of FIG. 1C with that of this example:
Procedure time consuming FIG. 1C protocol Time consuming implementation of this embodiment
Step
1 1.7314 1.6377
Step 2, 3 42.1451 41.8794
Step 4, 5, 6 14.8266 8.64
Step 7 11.8187 12.1046
Please see the comparison between the time consumption in step 2 and step 3, in the scheme of fig. 1C, in step 2, the cpu places the instruction queue in the closer memory, whereas in the scheme of this embodiment, in step 2, the cpu sends the instruction queue to the farther GPU memory, but the time consumption of the two is basically the same as that of step 3.
Please see steps 4 to 6, the time consumption of the solution of the present embodiment is reduced by 41.73%, where step 6 is the execution process of the instruction, and the time consumption of the two solutions in step 6 is the same; therefore, the reduction process comes from step 4 and step 5, and since the GPU can fetch the instruction and update the read pointer in the closer memory, the embodiment scheme improves the instruction execution efficiency.
In some examples, determining the target storage space in the target memory for storing the instructions may be based on usage of the storage space in the target memory by the computer device. In other examples, the computer device may request allocation of its required target storage to the target storage, for example, the computer device sends a request to the target storage, and the target storage allocates the target storage space according to the request. Based on this, the target storage space can be determined more quickly.
In some examples, determining a target storage space in the target memory for storing instructions may include: and calling a storage space allocation interface of the target storage to determine a target storage space for storing instructions in the target storage. For example, if the target memory is a memory of a forwarding node connected to a target GPU, a memory allocation interface provided by the forwarding node is called to determine a target memory space for storing instructions in the target memory; if the target memory is the internal memory of the target GPU, calling an internal memory allocation interface provided by a computing platform of the target GPU to determine a target storage space used for storing instructions in the target memory. Based on this, the target storage space can be determined quickly through the call of the interface.
In some examples, after the step of determining a target storage space in the target memory for storing instructions, the method further comprises: acquiring the address of the target storage space; and sending the address to the target GPU so that the target GPU can access the target storage space according to the address. In this embodiment, after determining a target storage space dedicated to storing instructions in the target memory, the computer device may send an address pointing to the target storage space to the target GPU, so that the target GPU may access the target storage space and then read and execute the first instruction. Based on this, the target GPU may be enabled to quickly access the target memory space read instruction.
As can be seen from the foregoing embodiments, the implementation manner of the target memory may be an internal memory of the GPU, or may be a memory of the forwarding node 2, and the following two embodiments are respectively described.
Fig. 3A is a schematic diagram of another instruction execution method for a GPU resource pool according to an exemplary embodiment shown in this specification, where this embodiment takes a first application program as an example, an instruction of the first application program is executed by a GPU, and an initialization process and a specific interaction process with the GPU may include the following steps:
301. initializing and running an application;
302. judging whether a new queue is created; if yes, go to step 303; if not, go to step 305. As an example, it is determined that a new queue needs to be created in the case of application initialization; in the application running process, instructions are continuously generated, in practical applications, one or more queues created by an application can be provided, each queue can have a fixed storage space, and if instructions placed in one queue are full, the application can judge that a new queue needs to be created.
303. Creating an instruction queue;
304. sending the queue information to a GPU;
305. placing the instructions in an instruction queue;
306. informing the GPU to execute the instruction;
307. the GPU takes the instruction from the instruction queue for execution;
308. the GPU judges whether the instruction queue is empty; if not, go to step 307 again; if yes, go to step 309;
309. and finishing execution and sending a completion signal to the application.
In this embodiment, the method may be implemented by moving the space for storing the instruction queue to an internal memory of the GPU or a memory of the forwarding node 2 when creating the instruction queue for storing the instruction.
For example, in the embodiment, after the first application program is started, the instruction, which is generated after the first application program is run and needs to be executed by the GPU, may be sent to the memory corresponding to the GPU to be written into the target memory space.
The process of requesting allocation of memory space to GPU memory may be as shown in fig. 3B 1; in this embodiment, an internal memory allocation interface provided by the GPU computing platform, such as a video memory allocation interface, may be invoked to request allocation of the target memory space. In practical applications, some instructions in the first application program are executed by the CPU, and other space allocation processes do not need to be changed, for example, a process of calling a CPU memory allocation interface. In this way, the instruction queue of the first application program in the computer device, which needs to be executed by the GPU, can be moved to the video memory of the GPU, as shown in fig. 3B2, and the buffer area in the GPU represents the target storage space in the video memory of the GPU. Based on this, the computer device can send a plurality of first instructions to the target storage space in the GPU video memory in batches, and the GPU can directly fetch the instructions from its own memory space without accessing the memory of the computer device side through the network.
In this embodiment, the GPU needs to have an extra memory space of a certain size allocated to the computer device for use, and the computer device can directly access the memory space allocated by the GPU. For example, an internal memory allocation interface provided by a computing platform of the GPU may be invoked to determine a target storage space in the target memory for storing instructions. For example, there are parallel computing frameworks for GPUs in which GPU UM (Unified Memory) technology is provided to allow applications to allocate video Memory using an interface. In this embodiment, a video memory allocation interface provided by a computing platform of the GPU may be called to request the video memory of the GPU to allocate a memory space to the computer device to store the first instruction. In practical application, the GPU can be flexibly implemented according to a computing platform actually used by the GPU, which is not limited in this embodiment. Optionally, when the first application finishes running, each of the target storage spaces corresponding to the first application may be released, for example, a request to release the target storage space is made to the target storage.
In some examples, the instruction queue created by the application, i.e., the aforementioned target memory space, may be a circular queue (command queue), and the CPU and GPU will each maintain some data structures to ensure the proper operation of the circular queue. Such as the base address of the queue, the size of the queue, write pointer and read pointer. Wherein the write pointer points to an address where the CPU is to write an instruction, and the read pointer points to an address where the GPU is to read a command. When a write command of the CPU is finished or a read command of the GPU is finished, the write pointer and the read pointer move forwards. When the pointer reaches the end of the queue, it will move to the head of the queue to continue execution.
In this embodiment, the CPU may send the plurality of first instructions to the video memory of the GPU in batch, so that the GPU may directly read the first instructions in the local area when fetching the first instructions, and the performance of the GPU when executing the first instructions is high. In the implementation of the scheme, no modification is needed to be made on GPU hardware, the support of a GPU manufacturer is not needed, and the implementation difficulty is low.
In other examples, as shown in fig. 3C, the present embodiment moves the storage space for storing the instruction to the forwarding node 2 closest to the GPU. In this embodiment, the memory of the forwarding node 2 needs to be simultaneously accessed by the CPU and the GPU of the computer device, and a certain storage space needs to be available for the computer device. As mentioned above, the forwarding node 2 may be the aforementioned device such as a DPU, and may call a memory allocation interface provided by the forwarding node to determine a target memory space in the target memory for storing instructions.
As an example, the forwarding node 2 is provided with an allocation interface that allocates memory space and a release interface that releases memory space.
The allocation interface may be invoked when determining a target memory space in the target memory for storing instructions, and the invoked parameter may include a size of the memory space, so that the forwarding node 2 may determine a target memory space in the target memory that satisfies the size according to the parameter. The return value of the allocation interface may be an address pointer indicating a point to the allocated target memory space.
The release interface may be invoked when it is determined that the target memory space does not need to be used, and the parameters of the invocation may include the address pointer described above, so that the forwarding node 2 may release the target memory space pointed to by the address pointer in the target memory according to the parameters. The return value of the release interface may include an identification indicating a release success or release failure.
Illustratively, the computer device may call an allocation interface provided by the forwarding node 2 to request the target memory space upon initialization of the first application, and also call a space release interface to release upon release.
Illustratively, the space for storing the instructions is moved to the forwarding node closest to the GPU due to the invocation of the interface for allocating space provided by the forwarding node. The GPU may obtain the instructions from the forwarding node when reading the instructions. Also, the GPU does not require any modification.
In practical application, only a storage space in an internal memory of the GPU may be used, or only a memory of a forwarding node connected to the GPU may be used; alternatively, both embodiments may be used. For example, the computer device may determine a target storage space of an internal memory of the target GPU and also determine a target storage space of a memory of a forwarding node to which the target GPU is connected; according to the requirement, some first instructions are placed in a target storage space of an internal memory of the target GPU, and other first instructions are placed in a target storage space of a memory of the forwarding node; in practical application, the configuration may be flexible according to needs, and this embodiment does not limit this.
Corresponding to the foregoing embodiments of the instruction execution method for the GPU resource pool, the present specification also provides embodiments of an instruction execution apparatus for the GPU resource pool and a computer device applied thereto. The computer device is a device capable of automatically performing numerical calculation and/or information processing according to a preset or stored instruction, and the hardware of the electronic device includes, but is not limited to, a microprocessor, an Application Specific Integrated Circuit (ASIC), a Programmable Gate Array (FPGA), a Digital Signal Processor (DSP), an embedded device, and the like. The computer device may be any electronic product capable of performing human-computer interaction with a user, for example, a Personal computer, a tablet computer, a smart phone, a Personal Digital Assistant (PDA), a game machine, an interactive web Television (IPTV), an intelligent wearable device, and the like. The computer device may also include a network device and/or a user device. The network device includes, but is not limited to, a single network server, a server group consisting of a plurality of network servers, or a Cloud Computing (Cloud Computing) based Cloud consisting of a large number of hosts or network servers. The Network in which the computer device is located includes, but is not limited to, the internet, a wide area Network, a metropolitan area Network, a local area Network, a Virtual Private Network (VPN), and the like.
As shown in fig. 4, a schematic diagram of a computer device shown in the present specification according to an exemplary embodiment may include: a processor 401, a memory 402, an input/output interface 403, a communication interface 404, and a bus 405. Wherein the processor 401, the memory 402, the input/output interface 403 and the communication interface 404 are communicatively connected to each other within the device by a bus 405.
The processor 401 may be implemented by a general-purpose CPU (Central Processing Unit), a microprocessor, an Application Specific Integrated Circuit (ASIC), or one or more Integrated circuits, and is configured to execute related programs to implement the technical solutions provided in the embodiments of the present specification. Processor 401 may also include a GPU or the like.
The Memory 402 may be implemented in the form of a ROM (Read Only Memory), a RAM (Random Access Memory), a static storage device, a dynamic storage device, or the like. The memory 402 may store data of an operating system and other application programs, and when the technical solution provided by the embodiments of the present specification is implemented by software or firmware, the relevant program codes are stored in the memory 402 and called to be executed by the processor 401.
The input/output interface 403 is used for connecting an input/output module to realize information input and output. The input/output module may be configured as a component in a device (not shown) or may be external to the device to provide a corresponding function. The input module may include a keyboard, a mouse, a touch screen, a microphone, various sensors, etc., and the output module may include a display, a speaker, a vibrator, an indicator light, etc. If the computer device is a server or the like, the above-described part of the input/output module may not be included.
The communication interface 404 is used to connect a communication module (not shown in the figure) to implement communication interaction between the present device and other devices. The communication module can realize communication in a wired mode (such as USB, network cable and the like) and also can realize communication in a wireless mode (such as mobile network, WIFI, bluetooth and the like).
Bus 105 includes a path that transfers information between the various components of the device, such as processor 101, memory 102, input/output interface 103, and communication interface 104.
It should be noted that although the above-mentioned device only shows the processor 101, the memory 102, the input/output interface 103, the communication interface 104 and the bus 105, in a specific implementation, the device may also include other components necessary for normal operation. In addition, those skilled in the art will appreciate that the above-described apparatus may also include only those components necessary to implement the embodiments of the present description, and not necessarily all of the components shown in the figures.
Wherein memory is a computer-readable storage medium including non-transitory and non-transitory, removable and non-removable media, etc., that may be used by any method or technology to implement information storage. The information may be computer readable instructions, data structures, modules of a program, or other data. Examples of computer storage media include, but are not limited to, phase change memory (PRAM), static Random Access Memory (SRAM), dynamic Random Access Memory (DRAM), other types of Random Access Memory (RAM), read Only Memory (ROM), electrically Erasable Programmable Read Only Memory (EEPROM), flash memory or other memory technology, compact disc read only memory (CD-ROM), digital Versatile Discs (DVD) or other optical storage, magnetic cassettes, magnetic disk storage or other magnetic storage devices, or any other non-transmission medium that can be used to store information that can be accessed by a computer device.
As an example, the embodiment of the instruction execution apparatus for GPU resource pool in this specification can be applied to a computer device as shown in fig. 4. The device embodiments may be implemented by software, or by hardware, or by a combination of hardware and software. In the case of software implementation, as a logical device, the corresponding computer program instructions in the nonvolatile memory are read into the memory by the processor in which the device is located and executed.
FIG. 5 is a block diagram of an instruction execution device for a GPU resource pool including a plurality of GPUs and a memory corresponding to each GPU shown in the present specification in accordance with an exemplary embodiment; the device comprises:
a first determining module 51 for: determining a usable target GPU and a target memory corresponding to the target GPU;
a second determining module 52 configured to: determining a target storage space in the target memory for storing instructions;
an obtaining module 53, configured to: obtaining at least one first instruction to be executed by the target GPU;
a sending module 54, configured to: and sending the at least one first instruction to the target memory, so that after the at least one first instruction is written into the target storage space, the target GPU reads the first instruction from the target storage space and executes the first instruction.
In some examples, the GPU resource pool further comprises a forwarding node connected to each of the GPUs, the forwarding node comprising a memory;
the memory corresponding to the GPU comprises: a memory of a forwarding node to which the GPU is connected; or, an internal memory of the GPU.
In some examples, the second determining module is further configured to:
if the target memory is a memory of a forwarding node connected with a target GPU, calling a memory allocation interface provided by the forwarding node to determine a target storage space for storing instructions in the target memory;
if the target memory is the internal memory of the target GPU, calling an internal memory allocation interface provided by a computing platform of the target GPU to determine a target storage space used for storing instructions in the target memory.
In some examples, the sending module is further configured to: obtaining the address of a target storage space used for storing instructions in the determined target storage; and sending the address to the target GPU so that the target GPU can access the target storage space according to the address.
In some examples, the apparatus further comprises a notification module to:
and in response to the at least first instruction writing to the target storage space, notifying the GPU to read the first instruction from the target storage space.
In some examples, the request module is further configured to:
in response to a first application program launching, one or more target memory spaces in the target memory for storing instructions are determined, each of the target memory spaces corresponding to the first application program.
In some examples, the apparatus further comprises a release module to:
and releasing each target storage space corresponding to the first application program in response to the first application program ending running.
Embodiments of the present specification further provide a computer device, including a memory, a processor, and a computer program stored on the memory and executable on the processor, where the processor implements the steps of the foregoing instruction execution method embodiment for the GPU resource pool when executing the computer program.
Embodiments of the present specification also provide a distributed system comprising one or more computer devices including a first memory, a processor, and a computer program stored on the memory and executable on the processor; the distributed system also comprises a GPU resource pool and a management device for managing the GPU resource pool, the computer device is connected with the management device, and the GPU resource pool comprises a plurality of GPUs and a memory corresponding to each GPU; wherein the processor implements the steps of the aforementioned instruction execution method embodiments for the GPU resource pool when executing the computer program.
Embodiments of the present specification further provide a computer program product, which includes a computer program, and when the computer program is executed by a processor, the computer program implements the steps of the foregoing instruction execution method embodiments for GPU resource pool.
Embodiments of the present specification further provide a computer-readable storage medium, on which a computer program is stored, where the computer program, when executed by a processor, implements the steps of the foregoing instruction execution method embodiments for a GPU resource pool.
The implementation processes of the functions and actions of each module in the instruction execution device for the GPU resource pool are specifically described in the implementation processes of corresponding steps in the instruction execution method for the GPU resource pool, and are not described herein again.
For the device embodiments, since they substantially correspond to the method embodiments, reference may be made to the partial description of the method embodiments for relevant points. The above-described embodiments of the apparatus are merely illustrative, wherein the modules described as separate parts may or may not be physically separate, and the parts displayed as modules may or may not be physical modules, may be located in one place, or may be distributed on a plurality of network modules. Some or all of the modules can be selected according to actual needs to achieve the purpose of the solution in the specification. One of ordinary skill in the art can understand and implement it without inventive effort.
The foregoing description has been directed to specific embodiments of this disclosure. Other embodiments are within the scope of the following claims. In some cases, the actions or steps recited in the claims may be performed in a different order than in the embodiments and still achieve desirable results. In addition, the processes depicted in the accompanying figures do not necessarily require the particular order shown, or sequential order, to achieve desirable results. In some embodiments, multitasking and parallel processing may also be possible or may be advantageous.
The steps of the above methods are divided for clarity, and the implementation may be combined into one step or split some steps, and the steps are divided into multiple steps, so long as the same logical relationship is included, which are all within the protection scope of the present patent; it is within the scope of the present application to add insignificant modifications to the algorithms or processes or to introduce insignificant design changes to the core design without changing the algorithms or processes.
The description herein of "particular examples," or "some examples," etc., means that a particular feature, structure, material, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the specification. In this specification, the schematic representations of the terms used above do not necessarily refer to the same embodiment or example. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or more embodiments or examples.
From the above description of the embodiments, it is clear to those skilled in the art that the embodiments of the present disclosure can be implemented by software plus necessary general hardware platform. Based on such understanding, the technical solutions of the embodiments of the present specification may be embodied in the form of a software product, which may be stored in a storage medium, such as a ROM/RAM, a magnetic disk, an optical disk, or the like, and includes several instructions for enabling a computer device (which may be a personal computer, a server, or a network device) to execute the methods described in the embodiments or some parts of the embodiments of the present specification.
The systems, devices, modules or units illustrated in the above embodiments may be implemented by a computer chip or an entity, or by a product with certain functions. A typical implementation device is a computer, which may take the form of a personal computer, laptop computer, cellular telephone, camera phone, smart phone, personal digital assistant, media player, navigation device, email messaging device, game console, tablet computer, wearable device, or a combination of any of these devices.
Other embodiments of the present description will be apparent to those skilled in the art from consideration of the specification and practice of the invention disclosed herein. This specification is intended to cover any variations, uses, or adaptations of the specification following, in general, the principles of the specification and including such departures from the present disclosure as come within known or customary practice within the art to which the specification pertains. It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the specification being indicated by the following claims.
It will be understood that the present description is not limited to the precise arrangements described above and shown in the drawings, and that various modifications and changes may be made without departing from the scope thereof. The scope of the present description is limited only by the appended claims.
The above description is only a preferred embodiment of the present disclosure, and should not be taken as limiting the present disclosure, and any modifications, equivalents, improvements and the like made within the spirit and principle of the present disclosure should be included in the protection scope of the present disclosure.

Claims (12)

1. An instruction execution method for a GPU resource pool, the GPU resource pool comprising a plurality of GPUs and a memory corresponding to each of the GPUs; the method comprises the following steps:
determining a usable target GPU and a target memory corresponding to the target GPU;
determining a target storage space in the target memory for storing instructions;
obtaining at least one first instruction to be executed by the target GPU;
and sending the at least one first instruction to the target memory, so that after the at least one first instruction is written into the target storage space, the target GPU reads the first instruction from the target storage space and executes the first instruction.
2. The method of claim 1, the GPU resource pool further comprising a forwarding node connected with each of the GPUs, the forwarding node comprising a memory;
the memory corresponding to the GPU comprises: a memory of a forwarding node connected with the GPU; or, an internal memory of the GPU.
3. The method of claim 1, the determining a target storage space in the target memory for storing instructions, comprising:
and calling a storage space allocation interface of the target memory to determine a target storage space in the target memory for storing the instruction.
4. The method of claim 1, after the step of determining a target storage space in the target memory for storing instructions, the method further comprising:
acquiring the address of the target storage space;
and sending the address to the target GPU so that the target GPU can access the target storage space according to the address.
5. The method of claim 1, further comprising:
and in response to the at least one first instruction being written into the target storage space, notifying the target GPU to read the first instruction from the target storage space.
6. The method of claim 1, the determining a target storage space in the target memory for storing instructions, comprising:
in response to a first application program launching, determining one or more target storage spaces in the target memory for storing instructions, each target storage space corresponding to the first application program.
7. The method of claim 4, further comprising:
and releasing each target storage space corresponding to the first application program in response to the end of the running of the first application program.
8. An instruction execution device for a GPU resource pool, the GPU resource pool comprising a plurality of GPUs and a memory corresponding to each of the GPUs; the device comprises:
a first determination module to: determining a usable target GPU and a target memory corresponding to the target GPU;
a second determination module to: determining a target storage space in the target memory for storing instructions;
an acquisition module to: obtaining at least one first instruction to be executed by the target GPU;
a sending module configured to: and sending the at least one first instruction to the target memory, so that after the at least one first instruction is written into the target storage space, the target GPU reads the first instruction from the target storage space and executes the first instruction.
9. A computer device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the steps of the method of any of claims 1 to 7 are implemented when the computer program is executed by the processor.
10. A distributed system comprising one or more computer devices comprising a first memory, a processor, and a computer program stored on the memory and executable on the processor; the distributed system also comprises a GPU resource pool and a management device for managing the GPU resource pool, the computer device is connected with the management device, and the GPU resource pool comprises a plurality of GPUs and a memory corresponding to each GPU;
wherein the processor when executing the computer program realizes the steps of the method of any of claims 1 to 7.
11. A computer program product comprising a computer program which, when executed by a processor, carries out the steps of the method of any one of claims 1 to 7.
12. A computer-readable storage medium, on which a computer program is stored which, when being executed by a processor, carries out the steps of the method of any one of claims 1 to 7.
CN202211034224.5A 2022-08-26 2022-08-26 Instruction execution method, apparatus, device, system, program product, and medium Pending CN115421787A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202211034224.5A CN115421787A (en) 2022-08-26 2022-08-26 Instruction execution method, apparatus, device, system, program product, and medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202211034224.5A CN115421787A (en) 2022-08-26 2022-08-26 Instruction execution method, apparatus, device, system, program product, and medium

Publications (1)

Publication Number Publication Date
CN115421787A true CN115421787A (en) 2022-12-02

Family

ID=84199745

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202211034224.5A Pending CN115421787A (en) 2022-08-26 2022-08-26 Instruction execution method, apparatus, device, system, program product, and medium

Country Status (1)

Country Link
CN (1) CN115421787A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115964167A (en) * 2022-12-16 2023-04-14 摩尔线程智能科技(北京)有限责任公司 Resource pooling method, apparatus, device, medium, and product for heterogeneous computing platforms
CN116450055A (en) * 2023-06-15 2023-07-18 支付宝(杭州)信息技术有限公司 Method and system for distributing storage area between multi-processing cards

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115964167A (en) * 2022-12-16 2023-04-14 摩尔线程智能科技(北京)有限责任公司 Resource pooling method, apparatus, device, medium, and product for heterogeneous computing platforms
CN115964167B (en) * 2022-12-16 2023-09-01 摩尔线程智能科技(北京)有限责任公司 Resource pooling method, device, equipment, medium and product of heterogeneous computing platform
CN116450055A (en) * 2023-06-15 2023-07-18 支付宝(杭州)信息技术有限公司 Method and system for distributing storage area between multi-processing cards
CN116450055B (en) * 2023-06-15 2023-10-27 支付宝(杭州)信息技术有限公司 Method and system for distributing storage area between multi-processing cards

Similar Documents

Publication Publication Date Title
EP3876100A2 (en) Method and apparatus for sharing gpu, electronic device and readable storage medium
US10180843B2 (en) Resource processing method and device for a multi-core operating system
CN115421787A (en) Instruction execution method, apparatus, device, system, program product, and medium
JP2016508647A (en) Method and system for supporting resource separation in a multi-core architecture
CN109144619B (en) Icon font information processing method, device and system
CN102667714B (en) Support the method and system that the function provided by the resource outside operating system environment is provided
CN113641457A (en) Container creation method, device, apparatus, medium, and program product
CN107861691B (en) Load balancing method and device of multi-control storage system
KR20210095690A (en) Resource management method and apparatus, electronic device and recording medium
CN104104705B (en) The cut-in method and equipment of distributed memory system
US20140006767A1 (en) Boot strap processor assignment for a multi-core processing unit
EP4184324A1 (en) Efficient accelerator offload in multi-accelerator framework
CN113010265A (en) Pod scheduling method, scheduler, memory plug-in and system
CN113835887A (en) Video memory allocation method and device, electronic equipment and readable storage medium
CN113485791B (en) Configuration method, access method, device, virtualization system and storage medium
CN105677481B (en) A kind of data processing method, system and electronic equipment
CN114168301A (en) Thread scheduling method, processor and electronic device
CN104714792A (en) Multi-process shared data processing method and device
US9405470B2 (en) Data processing system and data processing method
CN112596669A (en) Data processing method and device based on distributed storage
CN113535087B (en) Data processing method, server and storage system in data migration process
US20220066827A1 (en) Disaggregated memory pool assignment
CN115481052A (en) Data exchange method and device
CN116166572A (en) System configuration and memory synchronization method and device, system, equipment and medium
CN116974736A (en) Equipment virtualization method and related equipment

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination