WO2023134735A1 - 计算设备、数据处理方法、系统及相关设备 - Google Patents

计算设备、数据处理方法、系统及相关设备 Download PDF

Info

Publication number
WO2023134735A1
WO2023134735A1 PCT/CN2023/071994 CN2023071994W WO2023134735A1 WO 2023134735 A1 WO2023134735 A1 WO 2023134735A1 CN 2023071994 W CN2023071994 W CN 2023071994W WO 2023134735 A1 WO2023134735 A1 WO 2023134735A1
Authority
WO
WIPO (PCT)
Prior art keywords
data
memory space
heterogeneous device
heterogeneous
memory
Prior art date
Application number
PCT/CN2023/071994
Other languages
English (en)
French (fr)
Inventor
刘晓
余洲
Original Assignee
华为云计算技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 华为云计算技术有限公司 filed Critical 华为云计算技术有限公司
Publication of WO2023134735A1 publication Critical patent/WO2023134735A1/zh

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/50Allocation of resources, e.g. of the central processing unit [CPU]
    • G06F9/5005Allocation of resources, e.g. of the central processing unit [CPU] to service a request
    • G06F9/5011Allocation of resources, e.g. of the central processing unit [CPU] to service a request the resources being hardware resources other than CPUs, Servers and Terminals
    • G06F9/5016Allocation of resources, e.g. of the central processing unit [CPU] to service a request the resources being hardware resources other than CPUs, Servers and Terminals the resource being the memory
    • AHUMAN NECESSITIES
    • A63SPORTS; GAMES; AMUSEMENTS
    • A63FCARD, BOARD, OR ROULETTE GAMES; INDOOR GAMES USING SMALL MOVING PLAYING BODIES; VIDEO GAMES; GAMES NOT OTHERWISE PROVIDED FOR
    • A63F13/00Video games, i.e. games using an electronically generated display having two or more dimensions
    • A63F13/50Controlling the output signals based on the game progress
    • A63F13/52Controlling the output signals based on the game progress involving aspects of the displayed game scene
    • AHUMAN NECESSITIES
    • A63SPORTS; GAMES; AMUSEMENTS
    • A63FCARD, BOARD, OR ROULETTE GAMES; INDOOR GAMES USING SMALL MOVING PLAYING BODIES; VIDEO GAMES; GAMES NOT OTHERWISE PROVIDED FOR
    • A63F13/00Video games, i.e. games using an electronically generated display having two or more dimensions
    • A63F13/55Controlling game characters or game objects based on the game progress
    • A63F13/56Computing the motion of game characters with respect to other game characters, game objects or elements of the game scene, e.g. for simulating the behaviour of a group of virtual soldiers or for path finding
    • AHUMAN NECESSITIES
    • A63SPORTS; GAMES; AMUSEMENTS
    • A63FCARD, BOARD, OR ROULETTE GAMES; INDOOR GAMES USING SMALL MOVING PLAYING BODIES; VIDEO GAMES; GAMES NOT OTHERWISE PROVIDED FOR
    • A63F13/00Video games, i.e. games using an electronically generated display having two or more dimensions
    • A63F13/70Game security or game management aspects
    • A63F13/77Game security or game management aspects involving data related to game devices or game servers, e.g. configuration data, software version or amount of memory
    • AHUMAN NECESSITIES
    • A63SPORTS; GAMES; AMUSEMENTS
    • A63FCARD, BOARD, OR ROULETTE GAMES; INDOOR GAMES USING SMALL MOVING PLAYING BODIES; VIDEO GAMES; GAMES NOT OTHERWISE PROVIDED FOR
    • A63F13/00Video games, i.e. games using an electronically generated display having two or more dimensions
    • A63F13/80Special adaptations for executing a specific game genre or game mode
    • A63F13/822Strategy games; Role-playing games
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/50Allocation of resources, e.g. of the central processing unit [CPU]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/50Allocation of resources, e.g. of the central processing unit [CPU]
    • G06F9/5005Allocation of resources, e.g. of the central processing unit [CPU] to service a request
    • G06F9/5027Allocation of resources, e.g. of the central processing unit [CPU] to service a request the resource being a machine, e.g. CPUs, Servers, Terminals
    • G06F9/5044Allocation of resources, e.g. of the central processing unit [CPU] to service a request the resource being a machine, e.g. CPUs, Servers, Terminals considering hardware capabilities
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/54Interprogram communication
    • AHUMAN NECESSITIES
    • A63SPORTS; GAMES; AMUSEMENTS
    • A63FCARD, BOARD, OR ROULETTE GAMES; INDOOR GAMES USING SMALL MOVING PLAYING BODIES; VIDEO GAMES; GAMES NOT OTHERWISE PROVIDED FOR
    • A63F2300/00Features of games using an electronically generated display having two or more dimensions, e.g. on a television screen, showing representations related to the game
    • A63F2300/30Features of games using an electronically generated display having two or more dimensions, e.g. on a television screen, showing representations related to the game characterized by output arrangements for receiving control signals generated by the game device
    • A63F2300/308Details of the user interface
    • AHUMAN NECESSITIES
    • A63SPORTS; GAMES; AMUSEMENTS
    • A63FCARD, BOARD, OR ROULETTE GAMES; INDOOR GAMES USING SMALL MOVING PLAYING BODIES; VIDEO GAMES; GAMES NOT OTHERWISE PROVIDED FOR
    • A63F2300/00Features of games using an electronically generated display having two or more dimensions, e.g. on a television screen, showing representations related to the game
    • A63F2300/50Features of games using an electronically generated display having two or more dimensions, e.g. on a television screen, showing representations related to the game characterized by details of game servers
    • A63F2300/55Details of game data or player data management
    • A63F2300/552Details of game data or player data management for downloading to client devices, e.g. using OS version, hardware or software profile of the client device
    • AHUMAN NECESSITIES
    • A63SPORTS; GAMES; AMUSEMENTS
    • A63FCARD, BOARD, OR ROULETTE GAMES; INDOOR GAMES USING SMALL MOVING PLAYING BODIES; VIDEO GAMES; GAMES NOT OTHERWISE PROVIDED FOR
    • A63F2300/00Features of games using an electronically generated display having two or more dimensions, e.g. on a television screen, showing representations related to the game
    • A63F2300/60Methods for processing data by generating or executing the game program
    • A63F2300/65Methods for processing data by generating or executing the game program for computing the condition of a game character
    • AHUMAN NECESSITIES
    • A63SPORTS; GAMES; AMUSEMENTS
    • A63FCARD, BOARD, OR ROULETTE GAMES; INDOOR GAMES USING SMALL MOVING PLAYING BODIES; VIDEO GAMES; GAMES NOT OTHERWISE PROVIDED FOR
    • A63F2300/00Features of games using an electronically generated display having two or more dimensions, e.g. on a television screen, showing representations related to the game
    • A63F2300/80Features of games using an electronically generated display having two or more dimensions, e.g. on a television screen, showing representations related to the game specially adapted for executing a specific type of game
    • A63F2300/807Role playing or strategy games

Definitions

  • the present application relates to the technical field of data processing, and in particular to a computing device, a data processing method, system and related devices.
  • cloud rendering refers to the transfer of storage, computing, and rendering to the cloud, so as to realize large-scale scene rendering in the cloud and generate high-quality images in real time.
  • cloud rendering services can include image rendering, artificial intelligence (AI) noise reduction, encoding streaming and other processes, so that the cloud can integrate central processing unit (CPU), graphics processing unit (graphics processing unit, GPU) and other computing power and perform computing power cascading (pipeline), so as to use different types of computing power to perform different processing processes in the cloud rendering business.
  • CPU central processing unit
  • graphics processing unit graphics processing unit
  • GPU graphics processing unit
  • pipeline computing power cascading
  • processors can be integrated in the computing device for processing business, and each processor has a separately configured memory, so that the computing device can use the multiple computing power provided by the multiple types of processors to process business .
  • computing devices consume a lot of resources to process services, and the service processing delay is relatively high.
  • the present application provides a computing device, which is used to reduce the resource consumption of the computing device for processing services and reduce the service processing delay.
  • the present application also provides a data processing method, device, system, computer-readable storage medium, and computer program product.
  • the present application provides a computing device, the computing device includes a central processing unit, at least one heterogeneous device, and a shared memory pool, the at least one heterogeneous device includes a first heterogeneous device, and the shared memory pool includes multiple A shared memory, for example, the shared memory can be implemented by a memory bar, and the central processing unit, at least one heterogeneous device, and multiple shared memories are coupled through a bus; wherein, the central processing unit is used to divide the shared memory pool a plurality of memory spaces, and store the first data to be processed provided by the client and associated with the service in the first memory space of the plurality of memory spaces, and notify the first heterogeneous device that the first data to be processed is in the first memory space The address of the memory space, the first operation that the first heterogeneous device needs to perform on the first data to be processed; and the first heterogeneous device is used to perform the first operation on the first data to be processed in the first memory space , and store the obtained first data in the second memory
  • the heterogeneous device processes the first data to be processed, it does not need to move the first data to be processed between different memories, but can directly process the first data to be processed in the shared memory pool, thus avoiding data
  • the problem of large resource consumption and high business processing delay caused by moving between different memories reduces the need for business processing resource consumption and reduce business processing delay; moreover, multiple shared memories in the shared memory pool are coupled with the central processing unit and heterogeneous devices through the bus, which makes it possible to configure the shared memory pool in the computing device without the central processing unit
  • the impact of heterogeneous devices for example, it is not affected by the physical size of the CPU and the chip where the heterogeneous devices are located), so that the local memory of the computing device can reach a higher level, such as the capacity can be configured in the computing device to be too large.
  • Byte-level memory pools, etc. so that computing devices can load large-scale data into local memory for processing at the same time, so as to meet the real-time processing requirements for large-scale data in practical application scenarios.
  • the second memory space may be notified by the central processing unit to the heterogeneous device, or may be independently determined by the heterogeneous device, which is not limited.
  • the central processing unit is further configured to notify the first heterogeneous device of the storage location of the data generated by the first operation that needs to be performed on the data to be processed, which is the second memory in the multiple memory spaces The space, in this way, realizes the unified allocation and management of the memory space by the central processing unit.
  • At least one heterogeneous device in the computing device further includes a second heterogeneous device
  • the central processing unit is further configured to store the second data to be processed provided by the client and associated with the service
  • the second heterogeneous device is notified of the address of the second data to be processed in the third memory space, and the second heterogeneous device needs to execute the second data for the second data to be processed.
  • the storage location of the operation and the data generated by the second operation is the fourth memory space among the plurality of memory spaces; and the second heterogeneous device is used to perform the second operation on the second data to be processed in the third memory space, Obtain the second data and store the second data in the fourth memory space.
  • the computing device can use multiple heterogeneous devices to process the data to be processed related to the business in parallel, thereby improving the data processing efficiency and shortening the business response time. time-consuming.
  • At least one heterogeneous device in the computing device further includes a second heterogeneous device
  • the central processing unit is further configured to notify the second heterogeneous device of the location of the first data in the second memory space
  • the address, the second operation performed by the second heterogeneous device on the first data, and the storage location of the data generated by the second operation are the fourth memory space in the plurality of memory spaces; and the second heterogeneous device is used to perform the second operation on the first data
  • the second operation is performed on the first data in the second memory space, and the obtained second data is stored in the fourth memory space.
  • the first heterogeneous device and the second heterogeneous device are image processors GPUs.
  • At least one heterogeneous device in the computing device further includes a third heterogeneous device
  • the central processing unit is further configured to provide addresses of the second memory space and the fourth memory space to the first Three heterogeneous devices, and notifying the third heterogeneous device that the third operation performed on the first data and the second data, and the storage location of the data generated by the third operation is a fifth memory space among the plurality of memory spaces;
  • the third heterogeneous device is used to perform a third operation on the first data in the second memory space and the second data in the fourth memory space to obtain the third data, and store the third data in the fifth memory
  • computing devices can use more heterogeneous devices to continue processing data, and the processed data does not need to be moved between different memories, thereby improving data processing efficiency.
  • the third heterogeneous device is a graphics processor GPU, a neural network processor NPU, or a video processor VPU.
  • At least one heterogeneous device in the computing device further includes a third heterogeneous device, and the third heterogeneous device is connected to other computing devices in a network; space and fourth memory The address of the space is provided to the third heterogeneous device, and the third heterogeneous device is notified of the third operation performed on the first data and the second data, and the data generated by the third operation is sent to other computing devices; and the third The heterogeneous device is configured to perform a third operation on the first data in the second memory space and the second data in the fourth memory space to obtain the third data and send the third data to other computing devices, thus, the first The three heterogeneous devices can output the processed service data (that is, the third data) to other computing devices to meet service requirements, or send the service data to other computing devices for further processing.
  • the processed service data that is, the third data
  • the third heterogeneous device is a network card, configured to forward the third data to other computing devices.
  • the service is an image rendering task
  • the first data to be processed associated with the service is image data
  • the computing device can process the image rendering task based on multiple heterogeneous devices, and through The shared memory pool improves the processing efficiency of business data.
  • the bus used to couple the shared memory module, the central processing unit, and at least one heterogeneous device is the Z-generation bus, or the cache coherent interconnection CCIX bus of the accelerator, or the computing express link CXL bus .
  • the capacity of the shared memory pool is not less than 1 TB.
  • the present application also provides a data processing method, the data processing method is applied to a computing device, the computing device includes a central processing unit, at least one heterogeneous device, and a shared memory pool, and the at least one heterogeneous device includes a first heterogeneous Structured equipment, the shared memory pool includes a plurality of shared memory bars, and the central processing unit, at least one heterogeneous device, and the plurality of shared memory bars are coupled through the bus; the method includes: the central processing unit divides a plurality of memory spaces in the shared memory pool; The central processor stores the first data to be processed provided by the client and associated with the service in the first memory space of the plurality of memory spaces; the central processor notifies the first heterogeneous device that the first data to be processed is stored in the first memory space address, the first operation performed by the first heterogeneous device on the first data to be processed; the first heterogeneous device performs the first operation on the first data to be processed in the first memory space, obtains the first data, and
  • At least one heterogeneous device in the computing device further includes a second heterogeneous device
  • the method further includes: the central processor stores the second service-related data provided by the client in the multiple In the third memory space of the two memory spaces; the central processing unit notifies the second heterogeneous device of the address of the second data to be processed in the third memory space, the second operation performed by the second heterogeneous device on the second data to be processed, and The storage location of the data generated by the second operation is the fourth memory space among the plurality of memory spaces; the second heterogeneous device performs the second operation on the second data to be processed in the third memory space to obtain the second data, and The second data is stored in the fourth memory space.
  • At least one heterogeneous device in the computing device further includes a second heterogeneous device
  • the method further includes: the central processing unit notifies the second heterogeneous device of the address of the first data in the second memory space, The second operation performed by the second heterogeneous device on the first data, and the storage location of the data generated by the second operation is the fourth memory space in the plurality of memory spaces; the second heterogeneous device performs the second operation on the second memory space. A second operation is performed on the data to obtain the second data, and the second data is stored in the fourth memory space.
  • the first heterogeneous device and the second heterogeneous device are image processors GPUs.
  • At least one heterogeneous device in the computing device further includes a third heterogeneous device
  • the method further includes: the central processing unit provides addresses of the second memory space and the fourth memory space to the third heterogeneous device
  • the central processing unit notifies the third heterogeneous device of the third operation performed on the first data and the second data, and the storage location of the data generated by the third operation is the fifth memory space among the plurality of memory spaces; the third Heterogeneous device pairs in the second memory space
  • a third operation is performed on the first data and the second data in the fourth memory space to obtain third data, and the third data is stored in the fifth memory space.
  • the third heterogeneous device is a graphics processor GPU, a neural network processor NPU, or a video processor VPU.
  • At least one heterogeneous device in the computing device further includes a third heterogeneous device, and the third heterogeneous device is connected to other computing devices in a network
  • the method further includes: the central processing unit assigns the second memory space and the address of the fourth memory space are provided to the third heterogeneous device; the central processing unit notifies the third heterogeneous device of the third operation performed on the first data and the second data, and sends the data generated by the third operation to other computing devices ;
  • the third heterogeneous device performs a third operation on the first data in the second memory space and the second data in the fourth memory space to obtain the third data, and sends the third data to other computing devices.
  • the third heterogeneous device is a network card.
  • the method further includes: the central processing unit notifies the first heterogeneous device that the storage location of the data generated by the first operation is the second memory space in the plurality of memory spaces.
  • the service is an image rendering task
  • the first data to be processed is image data
  • the bus used to couple the shared memory module, the central processing unit, and at least one heterogeneous device is a Z-generation bus, or a CCIX bus, or a CXL bus.
  • the capacity of the shared memory pool is no less than 1 TB.
  • the data processing method provided in the second aspect corresponds to the computing device provided in the first aspect, so the second aspect and the technical effects of the data processing method in any possible implementation of the second aspect can refer to the first aspect and The technical effects of the corresponding implementations in the first aspect will not be repeated here.
  • the present application provides a data processing system, where the data processing system includes at least one computing device, and the computing device is the computing device described in the first aspect and any implementation manner of the first aspect.
  • the present application provides a computer-readable storage medium, where instructions are stored in the computer-readable storage medium, and when the computer-readable storage medium is run on a computing device, the computing device executes any of the above-mentioned second aspect or the second aspect. A method described in an implementation.
  • the present application provides a computer program product containing instructions, which, when run on a computing device, causes the computing device to execute the method described in the second aspect or any implementation manner of the second aspect.
  • Fig. 1 is a structural schematic diagram of a computing device
  • FIG. 2 is a schematic structural diagram of a computing device provided by an embodiment of the present application.
  • FIG. 3 is a schematic structural diagram of another computing device provided by an embodiment of the present application.
  • FIG. 4 is a schematic structural diagram of another computing device provided by an embodiment of the present application.
  • FIG. 5 is a schematic structural diagram of yet another computing device provided in an embodiment of the present application.
  • FIG. 6 is a schematic diagram of two computing devices interconnected through a high-speed interface provided by the embodiment of the present application.
  • FIG. 7 is a schematic flow diagram of a data processing method provided in an embodiment of the present application.
  • FIG. 8 is a schematic structural diagram of another computing device provided in the embodiment of the present application.
  • FIG. 9 is a schematic diagram of processing an image rendering service provided by an embodiment of the present application.
  • FIG. 10 is a schematic flowchart of another data processing method provided by the embodiment of the present application.
  • FIG. 1 it is a schematic structural diagram of a computing device.
  • the computing device 100 includes multiple processors.
  • n processors processor 1 to processor n
  • Each processor may be separately configured with memory, for example, processor 1 is configured with memory 1, and processor 2 is configured with memory 2, and processor 1 and processor 2 may be coupled through a bus.
  • processors there are at least two processors of different types, which can provide different types of computing power for the computing device 100, such as processor 1 and processor 2 are different types of processors, and the processing Processor 1 is a CPU, processor 2 is a GPU, and so on.
  • the computing device 100 may use different types of computing power to process services, such as the cloud rendering service requested by the client 1 .
  • the processor 1 first writes the data to be processed corresponding to the business into the memory 1 , and processes the data to be processed in the memory 1 , and stores the processed intermediate data in the memory 1 .
  • the processor 2 reads the intermediate data stored in the memory 1 into the memory 2 through the bus, and processes the intermediate data in the memory 2, and stores the processed final data in the memory 2. Since the same data (that is, the aforementioned intermediate data) needs to be moved between different memories during the process of processing business data, this will not only generate a large resource consumption, but also affect the efficiency of the computing device 100 in processing business data .
  • the computing device uses more than three (including three) processors to process business data sequentially, the business data may be frequently moved between multiple different memories, which will seriously affect the business processing of the computing device 100. performance.
  • the memory capacity individually configured for each processor is usually limited.
  • the size of the chip on which the processor resides is usually limited, the physical size of the memory deployed on the chip will also be limited, resulting in that the capacity of the memory that can be configured for the processor on the chip is usually small, such as No more than 32GB (gigabytes), etc.
  • the computing device 100 it is difficult for the computing device 100 to load large-scale data to be processed into the local memory for processing at the same time.
  • all business data can be processed by sequentially processing different parts of business data, so it is difficult to realize real-time processing of large-scale data. need.
  • the computing device 200 includes a CPU 201, a heterogeneous device 202 (the CPU 201 and the heterogeneous device 202 may form a computing resource pool) and a shared memory pool 203, wherein in FIG. Example for illustration.
  • the CPU 201 and the heterogeneous device 202 can provide different types of computing power for the computing device 200 .
  • the heterogeneous device 202 may be a GPU, or the heterogeneous device 202 may be a neural-network processing unit (NPU), etc., which is not limited in this embodiment.
  • the CPU 201 and the heterogeneous device 202 in the computing device 200 may to be on a different base board/chip.
  • the shared memory pool 203 includes a plurality of shared memories, such as the first shared memory and the second shared memory among Fig. 2 , and each shared memory can be realized by a memory controller (Memory Controller) and a storage medium.
  • the shared memory implements the shared memory pool 203 as an example.
  • the shared memory includes a memory controller and a corresponding storage medium, and the shared memory may be, for example, a memory stick.
  • the shared memory pool 203 can be expanded horizontally (scale-out), that is, the capacity of the shared memory pool 203 can be expanded as the number of memory controllers and the number of storage media increase.
  • the memory controller is a bus circuit controller that internally controls the shared memory pool 203 in the computing device 200 and is used to manage and plan data transmission from the shared memory pool 203 to the CPU 201 or between heterogeneous devices 202 . Data can be exchanged between the shared memory pool 203 and the CPU 201 or the heterogeneous device 202 through the memory controller.
  • the memory controller can be a separate chip that can control the necessary logic to write data into the shared memory pool 203 or read data from the shared memory pool 203 .
  • the memory controller can be implemented by a general-purpose processor, a dedicated accelerator, a GPU, an FPGA, an embedded processor, and the like.
  • the storage medium in the shared memory pool 203 may be implemented by dynamic random access memory (dynamic random access memory, DRAM), or may be dual-inline-memory-modules (dual-inline-memory-modules, DIMM), etc.
  • a DIMM can generally be regarded as a memory stick entity, and each memory stick entity can have two sides, and both sides have memory particles.
  • Each plane can be called a Rank, that is to say, a memory stick entity can have two Ranks, and each Rank can include multiple memory chips (chips).
  • the memory controller and the storage medium may be connected through a double data rate (double data rate, DDR) bus, or through other buses.
  • the shared memory pool 203 may also be implemented in other ways, for example, the storage medium in the shared memory pool 203 may also be other types of storage media, etc., which is not limited in this embodiment.
  • the CPU 201 in the computing device 200, the heterogeneous device 202, and the shared memory pool 203 can be coupled through a bus, for example, the CPU 201 can access data in the shared memory pool 203 through the bus, or send data to the heterogeneous device 202 through the bus (such as operating instructions, etc.).
  • the bus may be a peripheral component interconnect express (PCIe) bus, or may be another type of bus, such as a cache coherent interconnect for accelerators (CCIX) bus of an accelerator, a generation Z ( generation Z (Gen-Z) bus, a compute express link (CXL) bus, etc., which are not limited in this embodiment.
  • the heterogeneous device 202 may be, for example, any type of heterogeneous processor such as GPU, NPU, VPU, or other devices.
  • computing device 200 may be deployed in a cloud data center.
  • the cloud data center includes one computing device 200 as an example.
  • the cloud data center may also include multiple computing devices.
  • both the cloud data center and the client 1 are connected to the Internet, so that the client 1 and each computing device in the cloud data center can realize network communication through the Internet.
  • the CPU 201 can divide multiple memory spaces in the shared memory pool 203, so that when the computing device 200 uses the computing power provided by heterogeneous devices to process services, the CPU 201 can receive the pending data associated with a certain service provided by the client 1, And write the data to be processed into the first memory space allocated to the business in the shared memory pool 203, as shown in Figure 2, so that the CPU 201 can notify the heterogeneous device 202 of the address of the data to be processed in the first memory space, and The first operation that the heterogeneous device 202 needs to perform on the data to be processed.
  • the heterogeneous device 202 can perform the first operation on the data to be processed in the first memory space to obtain the first data processing. At this time, the heterogeneous device 202 can directly access the first data in the first memory space, and in the shared memory pool Execute the corresponding first operation on the data to be processed to obtain the first data, and store the first data in the second memory space in the shared memory pool 203 , as shown in FIG. 2 .
  • the computing device 200 can send the first data to the client 1; and if the computing device 200 needs to further process the first data processing, other heterogeneous devices in the computing device 200 may continue to process the first data in the shared memory pool 203 , and send the data finally obtained by processing the first data to the client 1 .
  • the heterogeneous device 202 processes the data to be processed, it does not need to move the data to be processed between different memories, but can directly process the data to be processed in the shared memory pool 203, thus avoiding data transfer between different memories.
  • the problem of large resource consumption and high service processing delay caused by the migration between the two devices can reduce the resource consumption required for business processing and reduce the service processing delay.
  • multiple shared memories in the shared memory pool 203 are interconnected with the CPU 201 and the heterogeneous device 202 through the bus, which makes the configuration of the shared memory pool 203 in the computing device 200 free from the influence of the CPU 201 and the heterogeneous device 202 (such as is not affected by the physical size of the chip where the CPU 201 and the heterogeneous device 202 are located), so that the local memory of the computing device 200 can reach a higher level, such as a TB (terabyte) level memory that can be configured in the computing device 200 memory pool etc.
  • the computing device 200 can simultaneously load large-scale data to the local memory (ie, the shared memory pool 203 ) for processing, so as to meet the real-time processing requirements for large-scale data in practical application scenarios.
  • the computing device 200 shown in FIG. 2 is only used as an exemplary illustration, and is not intended to limit the specific implementation of the computing device 200 .
  • separate memory independent of the shared memory pool 203 can also be configured for the CPU 201 and the heterogeneous device 202, so that the CPU 201 and the heterogeneous
  • the device 202 may use a separately configured memory to process other services; or, the computing device 200 may include a greater number or more types of heterogeneous devices, and the number of each type of heterogeneous devices may be one or more; or,
  • the computing device 200 may also include more devices with other functions, etc., which is not limited in this embodiment.
  • the computing device 200 can be deployed on the user side, that is, it can be used as a local device to provide users with local data processing services; or, the computing device 200 can be deployed in the cloud, such as a public cloud, an edge cloud, or a distributed cloud, etc.
  • cloud data processing services such as cloud rendering services.
  • the client 1 may request a processing service from the computing device 200, and the service may be, for example, an image rendering service or other cloud services.
  • the client 1 may generate a service processing request, which may include a service identifier and data to be processed associated with the service, and send the service processing request to the computing device 200 .
  • the client 1 may be, for example, a web browser provided externally by the computing device 200 for interacting with the user; or, the client 1 may be an application program (application) running on the user terminal, such as a software development kit ( software development kit, SDK), etc.
  • the CPU 201 can receive the service processing request sent by the client 1, parse out the service identification and the data to be processed associated with the service (such as image data in the image rendering service, etc.) from the service processing request, and determine the calculation according to the service identification
  • the heterogeneous device 202 in the device 200 processes the service.
  • one or more heterogeneous devices used by the computing device 200 to process different services may be pre-configured according to actual application requirements.
  • the computing device 200 can use the heterogeneous device 202 to process services, and use other heterogeneous devices in the computing device 200 except the heterogeneous device 202 to process customer The service requested by end 1, etc.
  • the processing of services by the computing device 200 based on the heterogeneous device 202 is taken as an example for illustration.
  • the CPU 201 may allocate the first memory space for the service from the shared memory pool 203 according to the amount of data to be processed associated with the service carried in the service processing request. Wherein, the CPU 201 may divide multiple memory spaces in the shared memory pool 203, so that the CPU 201 may allocate the first memory space among the multiple memory spaces to the business. Wherein, the size of the first memory space may be determined according to the amount of data to be processed, or, the size of the first memory space may be pre-configured by technicians for the business, so that the CPU 201 queries the configuration information according to the business identifier, namely The size of the first memory space allocated for the service in the shared memory pool 203 may be determined.
  • the CPU 201 writes the data to be processed associated with the service into the first memory space, and records the address where the data to be processed is stored in the first memory space.
  • the address may be represented by, for example, the first address of the first data stored in the first memory space and the length of the data to be processed.
  • the CPU 201 may notify the heterogeneous device 202 of the address of the data to be processed in the first memory space and the first operation performed by the heterogeneous device 202 on the data to be processed.
  • the CPU 201 may generate an operation instruction for the data to be processed, and send the operation instruction to the heterogeneous device 202 .
  • the operation instruction may include an address where the data to be processed is stored in the first memory space.
  • the operation instruction may include a pointer and a data length, the pointer is used to indicate the first address of the data to be processed stored in the first memory space, and the data length is used to indicate the length of the data to be processed.
  • the operation instruction may also carry a first operation for instructing the heterogeneous device 202 to perform on the data to be processed.
  • the type of the first operation is related to the computing power of the heterogeneous device 202 .
  • the first operation may be, for example, a rendering operation on the data to be processed;
  • the heterogeneous device 202 is specifically an NPU, the first operation may be, for example, an AI noise reduction operation on the data to be processed. .
  • the heterogeneous device 202 may be configured with a message queue, and the heterogeneous device 202 may buffer the operation instructions sent by the CPU 201 through the message queue.
  • the heterogeneous device 202 can read the operation instruction from the message queue, and parse out the position of the data to be processed in the operation instruction in the first memory space, so that the heterogeneous device 202 can locate the data to be processed in the first memory space , so that the heterogeneous device 202 can execute the first operation on the data to be processed based on the operation instruction, and store the processed first data in the second memory space in the shared memory pool 203 . In this way, the processing of the data to be processed in the shared memory pool 203 is realized.
  • the second memory space may be designated by the CPU 201 .
  • the operation instruction sent by the CPU 201 to the heterogeneous device 202 may also include the address of the second memory space, and the address may be represented by the first address of the second memory space and the size of the memory space.
  • the second memory space may also be determined by the heterogeneous device 202 , that is, the heterogeneous device 202 may determine the second memory space from the remaining available memory space in the shared memory pool 203 .
  • the computing device 200 when the computing device 200 only uses the computing power provided by the heterogeneous device 202 to process the data to be processed, the computing device 200 can feed back the first data as the final processed data to the client 1; and when the computing device 200 When still using the computing power provided by other heterogeneous devices to continue processing the first data, the computing device 200 can feed back the data finally obtained by other heterogeneous devices processing the first data to the client 1. Not limited.
  • the heterogeneous device 202 can directly process the data to be processed in the shared memory pool 203, there is no need to move the data to be processed between different memories, which can reduce resource consumption required by the computing device 200 for processing services, Reduce service processing delay.
  • the computing device 200 can support writing large-scale pending data into the shared memory pool 203 at the same time, so that the CPU 201 and the heterogeneous device 202 in the computing device 200 can process large-scale data each time. Business data, which can improve business processing efficiency.
  • the computing device 200 processes data to be processed associated with a service based on a heterogeneous device 202 as an example for illustration.
  • the computing device 200 also includes other heterogeneous devices, and when the service requires processing based on multiple heterogeneous devices, the computing device 200 can use the heterogeneous device 202 and other heterogeneous devices to process the business. business data.
  • the computing device 200 shown in FIG. 3 further includes a heterogeneous device 204 .
  • the heterogeneous device 204 can cooperate with the heterogeneous device 202 to process the data to be processed associated with the first service.
  • the following two implementation examples are provided in this embodiment for introduction and description.
  • the heterogeneous device 202 and the heterogeneous device 204 can process in parallel the data to be processed associated with the service requested by the client 1 .
  • both the heterogeneous device 202 and the heterogeneous device 204 are GPUs, and the computing device 200 can use multiple GPUs to process different service data generated by the image rendering service in parallel.
  • the CPU 201 may receive the first data to be processed and the second data to be processed for the business sent by the client 1 . And, CPU201 can write the first data to be processed into the first memory space in the shared memory pool 203, record the address of the first data to be processed in the first memory space; write the second data to be processed into the shared memory pool 203 in the third memory space, and record the address of the second data to be processed in the third memory space.
  • the CPU 201 may generate an operation instruction 1, which may carry the address of the first data to be processed stored in the first memory space, the first operation to be performed by the heterogeneous device 202 on the first data to be processed, And the storage location of the data generated by the first operation is the second memory space in the shared memory pool 203 .
  • the CPU 201 may also generate an operation instruction 2, which may carry the address of the second data to be processed stored in the first memory space, and the second operation that the heterogeneous device 202 needs to perform on the second data to be processed , and the storage location of the data generated by the second operation is the fourth memory space in the shared memory pool 203 .
  • the CPU 201 may send the operation instruction 1 to the message queue in the heterogeneous device 202 through the interconnection bus, and send the operation instruction 2 to the message queue in the heterogeneous device 204 through the interconnection bus.
  • the heterogeneous device 202 reads the operation instruction 1 from the message queue, and determines the first data to be processed in the first memory space according to the operation instruction 1, thereby performing the first operation on the first data to be processed, and obtaining the corresponding first data and store the first data in the second memory space indicated by the CPU 201, and obtain the storage address of the first data in the second memory space.
  • the heterogeneous device 204 reads the operation instruction 2 from the message queue, and determines the second data to be processed in the third memory space according to the operation instruction 2, so as to perform the second operation on the second data to be processed, and obtain the corresponding and store the second data in the fourth memory space indicated by the CPU 201, and obtain the storage address of the second data in the fourth memory space.
  • the computing device 200 processes the first data to be processed and the second data to be processed in parallel based on multiple heterogeneous devices, so as to improve the data processing efficiency of the computing device 200 .
  • the heterogeneous device 202 and the heterogeneous device 204 may serially process the data to be processed associated with the service requested by the client 1 .
  • the heterogeneous device 202 is a GPU
  • the heterogeneous device 204 is a video processor (or NPU)
  • the GPU and the video processor (or NPU) can serially process the service data generated by the image rendering service
  • the heterogeneous Both the device 202 and the heterogeneous device 204 are GPUs, and are used to sequentially process service data generated by image rendering services.
  • the CPU 201 may receive the first data to be processed for the business sent by the client 1, write the first data to be processed into the first memory space in the shared memory pool 203, and record the first data to be processed in the An address in memory space. Then, the CPU 201 may generate the operation instruction 1 and send the operation instruction 1 to the message queue in the heterogeneous device 202 through the interconnection bus. Wherein, the operation instruction 1 may carry the address of the first data to be processed stored in the first memory space, the first operation to be performed by the heterogeneous device 202 on the first data to be processed, and the data generated by the first operation.
  • the storage location is the second memory space in the shared memory pool 203 .
  • the heterogeneous device 202 reads the operation instruction 1 from the message queue, and determines the first data to be processed in the first memory space according to the operation instruction 1, thereby performing the first operation on the first data to be processed, and obtaining the corresponding first data and store the first data in the second memory space indicated by the CPU 201, and obtain the storage address of the first data in the second memory space. Then, the heterogeneous device 202 may feed back the address of the first data in the second memory space to the CPU 201 .
  • the CPU 201 may generate the operation instruction 2 and send the operation instruction 2 to the message queue in the heterogeneous device 204 through the interconnection bus.
  • the operation instruction 2 may carry the address where the first data is stored in the second memory space, the second operation that the heterogeneous device 204 needs to perform on the first data, and the storage location of the data generated by the second operation is shared The fourth memory space in the memory pool 203.
  • the heterogeneous device 204 reads the operation instruction 2 from the message queue, and determines the first data in the second memory space according to the operation instruction 2, thereby performing the second operation on the first data to obtain the corresponding second data, and The second data is stored in the fourth memory space indicated by the CPU 201 .
  • the heterogeneous device 202 may generate an operation instruction 2 and send the operation instruction 2 to the message queue in the heterogeneous device 204 to control the heterogeneous device 204
  • the second operation is performed on the first data in the second memory space, and the generated second data is stored in the fourth memory space.
  • the computing device 200 may feed back the second data as the final processing result to the client 1 .
  • the computing device 200 is mainly used as an example to illustrate the processing of services based on the CPU 201 and one or more heterogeneous devices.
  • the computing device 200 can support the processing of customer
  • the computing device 200 may use different computing power combinations when processing different services, and different computing power combinations include different types of computing power or different computing power specifications.
  • the client 1 may request the processing service from the computing device 200 , and may also require the computing device 200 to use the CPU 201 , the heterogeneous device 202 , the heterogeneous device 204 , and the heterogeneous device 205 for processing.
  • FIG. 4 shows a schematic structural diagram of another computing device 200 .
  • the computing device 200 shown in FIG. 4 may further include a heterogeneous device 205 .
  • the type of the heterogeneous device 205 may be different from the type of the heterogeneous device 202 and the type of the heterogeneous device 204 .
  • the device 202 is a GPU
  • the heterogeneous device 204 is an NPU
  • the heterogeneous device 205 is a video processing unit (VPU); The device forwards data, etc.
  • VPU video processing unit
  • the type of the heterogeneous device 205 may be the same as that of the heterogeneous device 202, for example, both are GPUs; or, the type of the heterogeneous device 205 may be the same as that of the heterogeneous device 204, for example, both are NPUs.
  • the CPU 201 can generate an operation instruction 3, and the operation instruction 3 includes the address of the first data in the second memory space, the address of the second data in the fourth memory space, and the heterogeneous device 205
  • the storage location for the third operation to be executed on the first data and the second data, and the data generated by the third operation is the fifth memory space in the shared memory pool 203 .
  • the CPU 201 may send the operation instruction 3 to the message queue in the heterogeneous device 205 .
  • the second data may be the data generated by the heterogeneous device 204 performing the second operation on the second data to be processed, or may be the data generated by the heterogeneous device 204 performing the second operation on the first data. In this embodiment This is not limited.
  • the heterogeneous device 204 reads the operation instruction 3 from the message queue, and obtains the first data from the second memory space and the second data from the fourth memory space according to the operation instruction 3, so that the heterogeneous device 204 can A third operation is performed on the first data and the second data to obtain the corresponding third data, and the third data is stored in the fifth memory space indicated by the CPU 201 .
  • the computing device 200 when the computing device 200 only utilizes the computing power provided by the heterogeneous device 202, the heterogeneous device 204, and the heterogeneous device 205 to process business data, the computing device 200 can feed back the third data as the final processing result to the client 1.
  • the heterogeneous device 203 may be a device capable of data forwarding, such as a network card.
  • the heterogeneous device 205 may send the third data to other computing devices connected to the heterogeneous device network under the instruction of the CPU 201 .
  • the CPU 201 can generate an operation instruction 3, and the operation instruction 3 includes the address of the first data in the second memory space, the address of the second data in the fourth memory space, and the address of the heterogeneous device 205 for the first data.
  • the operation instruction 3 includes the address of the first data in the second memory space, the address of the second data in the fourth memory space, and the address of the heterogeneous device 205 for the first data.
  • it also includes instruction information notifying the heterogeneous device 205 to send the data generated by the third operation to other computing devices.
  • the heterogeneous device 204 performs the third operation on the first data and the second data and obtains the corresponding third data, it directly sends the third data to other computing devices without storing the third data in the shared memory pool 203. third data.
  • the computing device 200 can not only flexibly select different combinations of computing power from the computing resource pool to process different services according to business requirements, but in further implementations, the computing device 200 can also select from the shared memory pool 203 according to business needs Flexible selection of shared memory of different capacities in the shared memory pool to support multiple heterogeneous devices of the same or different types of computing power to process business data in the shared memory pool. For example, when the computing device 200 is processing a service, the capacity of the total memory space allocated for the service 1 from the shared memory pool 203 is 1 TB; The capacity of the memory space is 10TB etc.
  • each heterogeneous device in the computing device 200 uses the memory space in the shared memory pool 203 to process services, while in other implementation examples, one or more of the computing devices 200
  • Each heterogeneous device can also have a separately configured local memory, and the heterogeneous device can use the separately configured local memory to store corresponding service data, so as to meet the higher memory access bandwidth requirements of the computing device 200 for some services.
  • memory 1 can be deployed on the chip/basic board where the heterogeneous device 202 is located; memory 2 can be deployed on the chip/basic board where the heterogeneous device 204 is located; memory 3 can be deployed on the chip/basic board where the heterogeneous device 205 is located.
  • the speed at which the heterogeneous device 202 accesses the memory 1 on the chip or the base board is usually higher than the speed at which the heterogeneous device 202 accesses the shared memory in the shared memory pool 203 through the PCIe bus, and the other heterogeneous devices access The situation of asking the corresponding local memory is similar.
  • the heterogeneous device 202 in the computing device 200 can access the memory 1 based on a relatively large bandwidth, and use the memory 1 to process service data.
  • the memory capacity individually configured by the heterogeneous devices in the computing device 200 may be at the GB (gigabyte) level, for example, the memory capacity individually configured by each heterogeneous device is 32 GB, etc., which is not included in this embodiment. Not limited.
  • the memory configured separately for each heterogeneous device can use a coarse-grained 4KB (kilobyte) page table for data caching, thereby reducing the impact of the high-speed interconnection bus between heterogeneous devices.
  • the CPU 201 in the computing device 200 may also have a separately configured memory, such as the memory 4 in FIG. 5 , so that the CPU 201 executes corresponding data processing processes based on the memory 4 .
  • the computing resource pool and the memory pool are built inside the computing device 200 as an example for illustration, and in other embodiments, the computing device 200 can also communicate with other computing devices through High-speed interconnect ports are used to interconnect to build larger memory pools across computing devices.
  • computing device 200 may be interconnected with computing device 300 through a high-speed interconnection interface (such as high-speed interconnection interface 1 and high-speed interconnection interface 2 in FIG. 6 ).
  • the computing device 300 may have one or more heterogeneous devices (in FIG. 6, the computing device 300 includes the heterogeneous device 301 and the heterogeneous device 302 as an example for illustration) and a shared memory pool 303, and the shared memory pool 303 includes one or more shared memories, which may be implemented by, for example, at least one memory controller and at least one storage medium (such as a DIMM).
  • each heterogeneous device (such as heterogeneous device 202, etc.) in computing device 200 can access the shared memory pool 303 in computing device 300 through the high-speed interconnection interface;
  • the heterogeneous device 301 in the device 300 (and other heterogeneous devices) can access the shared memory pool 203 in the computing device 200 through the high-speed interconnection interface.
  • the shared memory pool 203 in the computing device 200 and the shared memory pool 303 in the computing device 300 can form a logically larger unified memory pool, which is shared by the heterogeneous devices in the computing device 200 and the computing device 300 .
  • the computing device 200 may preferentially allocate memory space for the business from the shared memory pool 203 when processing a business. However, when the available memory in the shared memory pool 203 is less, the computing device 200 may use the shared memory pool 303 in the computing device 300 to process service data.
  • the computing device 200 can also utilize the heterogeneous devices in the computing device 300 to realize computing power expansion, including the expansion of computing power specifications, the expansion of computing power types, and the like.
  • heterogeneous device 202 when computing device 200 needs to use three processors to process services, heterogeneous device 202 , heterogeneous device 301 , and heterogeneous device 302 may be used to sequentially process services, so as to realize computing power expansion of computing device 200 .
  • FIG. 7 is a schematic flow chart of a data processing method provided in the embodiment of the present application
  • FIG. 8 is a schematic structural diagram of another computing device provided in the embodiment of the present application
  • the data processing method shown in FIG. 7 can be applied to the computing device 800 shown in FIG. 8, and the computing device 800 includes a CPU801 and multiple heterogeneous devices: GPU802, NPU803, and VPU804, wherein, The CPU 801 , multiple heterogeneous devices and the shared memory pool 806 in the computing device 800 are coupled through the bus 807 .
  • the data processing method shown in FIG. 7 may specifically include:
  • the client 1 encodes image data based on user input operations to generate an image data stream to be rendered, where the image data stream includes image data and user input operations.
  • Client 1 sends the image data stream to computing device 800 .
  • the user can perform operations based on the interactive interface provided by the client 1, such as clicking controls in the interactive interface used to control the movement of player characters in the game, etc., so that the client 1 can perform operations based on the user interface.
  • the input operation encodes the image data (such as game screen, etc.) presented by the current interactive interface, and generates an image data stream including user input operation and image data.
  • the client 1 can generate a cloud rendering request, which carries the image data stream to be rendered, and sends it to the computing device 800 deployed in the cloud, so as to request the computing device 800 to operate the image according to the user's input.
  • the image data stream executes the corresponding rendering process, for example, the position of the player character in the rendered game screen has been changed according to the user's input operation.
  • the computing device 800 may receive the image data stream to be rendered sent by the client 1 through the network card 805 .
  • the network card may also be called a network interface controller (network interface controller, NIC), and is used for receiving data sent by an external device, or sending data to an external device, and the like.
  • NIC network interface controller
  • the CPU 801 decodes the received image data stream to be rendered, obtains image data and user input operations, and writes the image data into the first memory space in the shared memory pool 806 .
  • the CPU 801 may write the decoded image data into the first memory space in the shared memory pool 806 with the "write-only" permission.
  • the "write-only" permission means that the read-write permission of the decoded image data to the CPU 801 is only capable of performing a write operation on the image data.
  • the CPU 801 may include a high-speed interface 8011, a memory management unit (memory management unit, MMU) 8012, a message queue 8013, and a processing unit 8014.
  • the high-speed interface 8011 can be, for example, a serialization/deserialization (serializer/deserializer, SerDes) interface
  • the processing unit 8014 can write the image data into the first memory space in the shared memory pool 806 through the 8011,
  • the first memory space may be allocated for storing the image data of the cloud rendering service requested by the client 1 .
  • the MMU 8012 can be used to manage the shared memory pool 806, including capacity expansion of the shared memory pool 806, health status monitoring, memory resource allocation, and the like.
  • the message queue 8013 may be used to cache the operation instructions generated by the CPU 801 so that the processing unit 8014 sends the operation instructions in the message queue 8013 to other processors.
  • the shared memory pool 806 includes a high-speed interface 8061 (the number of high-speed interfaces may be one or more), a local agent (home agent) unit 8062, at least one memory controller 8063, and a storage medium 8064.
  • a local agent (home agent) unit 8062 when constructing the shared memory pool 806, one or more memory controllers 8063 and storage media 8064 can be linked to the bus 807, and by configuring a local agent (home agent) unit 8062, support computing device 800 Cache Coherent Non-Uniform Memory Access (CCNUMA) is implemented between multiple processors, that is, multiple memory units (such as memory units built from multiple storage media) are connected to form a single memory unit with a larger capacity. Memory.
  • CCNUMA Non-Uniform Memory Access
  • the shared memory pool 806 can also provide a high-speed interface 8061 for external communication, so as to receive data (such as the above-mentioned image data, etc.) sent by the processor in the computing device 800 through the high-speed interface 8061, and write the data into storage medium 8064.
  • data such as the above-mentioned image data, etc.
  • Each processor in the computing device 800 is interconnected with the shared memory pool 806 through a bus 807, such as CPU801
  • the image data sent to the shared memory pool 806 can be transmitted to the shared memory in the shared memory pool 806 through the bus 807 .
  • the bus 807 may include a high-speed interface 8071 for connecting the processor, a high-speed interface 8072 for connecting the shared memory pool 806, and a switch unit 8073 for realizing data interaction, and, due to practical application In the scenario, a single switching unit 8073 is responsible for a limited number of interconnected hardware, therefore, the number of switching (switch) units 8073 in the bus 807 can be based on the number of processors linked to the bus 807 in the computing device 800 and the shared The number of memory controllers 8063 in the memory pool 806 is determined.
  • the CPU 801 generates a rendering instruction for the image data according to user input operations and processing logic, and sends the rendering instruction to the GPU 802 .
  • the rendering instruction generated and sent by the CPU 801 may include the storage address of the image data to be rendered in the first memory space (which may be represented by a pointer corresponding to the first address and the data length), GPU802 and NPU803 which sequentially process the image data , processing operations performed by the VPU804 and other information. Further, the rendering instruction may also include an address where data generated by each processor executing a corresponding operation is stored in the memory space in the shared memory pool 806 .
  • the GPU 802 determines image data in the shared memory according to the rendering instruction, and performs a rendering operation on the image data to obtain first data, and writes the first data into a second memory space in the shared memory pool 806 .
  • the GPU802 can receive the rendering instruction sent by the CPU801 through the high-speed interface 8021, and cache the rendering instruction (or the storage location used to indicate the image data in the rendering instruction) through the message queue 8022, and pass the microcontroller 8023
  • the rendering instruction in the message queue 8022 is parsed to determine the storage location of the image data and the processing operations to be performed by the GPU 802 on the image data.
  • the processing unit 8024 can utilize an input output memory management unit (input output memory management unit, IOMMU) 8025 to access the image data in the shared memory, and perform a rendering operation on the image data in the shared memory to obtain the first data, and Write the first data into the second memory space through the IOMMU8025, and record the storage address of the first data in the second memory space, such as recording the first address and data length of the first data when stored in the second memory space.
  • IOMMU 8025 can realize that the page tables of the GPU 802 and the CPU 801 are consistent, so that the GPU 802 can use a virtual address to access the shared memory pool 806 managed by the MMU 8012 in the CPU 801 .
  • multiple different processors in computing device 800 may share the same page table; or, according to actual business requirements, multiple different processors may not share the same page table, such as some
  • the processor has the right to read and write data in the shared memory pool 806, while the other processors only have the right to read data in the shared memory pool 806, which is not limited in this embodiment.
  • a single GPU 806 is used as an example to illustrate the processing of image data.
  • the computing device 800 may also use multiple GPUs to process image data serially or in parallel.
  • the computing device 800 may include 16 GPUs, namely GPU0 to GPU15 , and the computing device may use the 16 GPUs to accelerate the processing of image data.
  • CPU801 can send a rendering instruction to GPU0, and the rendering instruction can include the storage address of the image data in the first memory space and the operations required to be performed by GPU0 to GPU15, so that GPU0 can obtain the first Image data in a memory space, and provide the image data to the remaining multiple GPUs (such as GPU1 to GPU15), instruct the remaining multiple GPUs to perform corresponding rendering operations on the image data in parallel, and generate the first Data is written into the second memory space designated by CPU801 in a "write-only" manner.
  • the CPU 801 may also send rendering instructions to each GPU respectively, and instruct each GPU to perform corresponding rendering operations on different image data stored in the first memory space.
  • GPU802 generates a noise reduction instruction, and sends the noise reduction instruction to NPU803.
  • the GPU802 can determine that the next processor to process the image data is the NPU803 according to the sequence of processors that process the image data carried in the rendering instruction, therefore, the GPU802 can control the NPU803 to continue processing the first processor in the second memory space. a data.
  • the GPU 802 may generate and send a noise reduction instruction to the NPU 803, so that the NPU 803 implements an AI noise reduction process on the first data.
  • the noise reduction instruction generated by the GPU802 may carry information such as the storage address of the first data in the second memory space, the processing operations performed by the NPU803 and the VPU804 that sequentially process the image data, and the like.
  • the noise reduction instruction may also include a storage address of the third memory space in the shared memory pool 806 for the data generated by the corresponding processing operations performed by the NPU 803 and the VPU 804 .
  • the NPU 803 determines the first data in the second memory space according to the noise reduction instruction, and performs a noise reduction operation on the first data to obtain the second data, and writes the second data into the shared memory pool 806 third memory space.
  • the NPU803 can analyze the address of the first data stored in the second memory space from the received noise reduction instruction, and access the first data from the second memory space in a "read-only” manner, thereby according to The noise reduction operation indicated by the noise reduction instruction processes the first data to obtain the second data. Then, the NPU 803 may write the second data into the third memory space in a "write-only” manner according to the address of the third memory space indicated by the noise reduction instruction, as shown in FIG. 9 .
  • the NPU803 may include a high-speed interface 8031, a message queue 8032, a microcontroller 8033, a processing unit 8034, and an IOMMU8035.
  • the NPU 803 can receive the noise reduction instruction through the high-speed interface 8031, and cache the noise reduction instruction (or the storage location used to indicate the first data in the noise reduction instruction) through the message queue 8032, and send the message through the microcontroller 8033
  • the noise reduction instructions in the queue 8032 are analyzed to determine the storage location of the first data in the second memory space and the processing operation to be performed by the NGPU 803 on the first data as a noise reduction operation.
  • the processing unit 8034 can use the IOMMU 8035 to access the first data in the shared memory, and perform a noise reduction operation on the first data in the shared memory, for example, remove noise data in the first data, and perform denoising on the denoised
  • the first data is subjected to super-resolution processing (that is, based on low-resolution image data to construct high-resolution image data) to obtain second data, and the second data is written into the third memory space by IOMMU8035, and the second data is recorded in A storage location in the third memory space.
  • NPU803 generates an encoding instruction, and sends the encoding instruction to VPU804.
  • the NPU803 can determine the next processor that continues to process the image data as the VPU804 according to the sequence of processors that process the image data carried in the noise reduction instruction. Therefore, the NPU803 can control the VPU804 to continue processing the image data.
  • the NPU803 may generate and send an encoding instruction to the VPU804, so that the NPU803 implements encoding on the second data.
  • the encoded instruction generated by the NPU 803 may carry information such as a storage address of the second data in the third memory space, a processing operation performed by the VPU 804 , and the like. Further, the encoded instruction may also include a storage address of the fourth memory space in the shared memory pool 806 for the data generated by the VPU 804 performing corresponding processing operations.
  • the VPU 804 determines the second data stored in the third memory space according to the encoding instruction, and performs an encoding operation on the second data to obtain encoded data, and writes the encoded data into the fourth memory in the shared memory pool 806 space.
  • the VPU 804 can parse the address of the second data stored in the third memory space from the received coded instruction, and access the second data from the third memory space in a "read-only” manner, thereby according to the The encoding operation indicated by the encoding instruction processes the second data to obtain encoded data. Then, the VPU 804 may write the encoded data into the fourth memory space in a "write-only” manner according to the address of the fourth memory space indicated by the encoding instruction, as shown in FIG. 9 .
  • the VPU804 may include a high-speed interface 8041, a message queue 8042, a microcontroller 8043, a processing unit element 8044 and IOMMU8045, and use this to determine the second data in the fourth memory space, and perform the corresponding encoding operation on the second data to obtain the encoded data, and continue to cache the encoding through the fourth memory space in the shared memory pool 806 data.
  • the specific implementation of the VPU804 encoding the second data according to the encoding instruction can refer to the related description of the above-mentioned NPU803 performing the noise reduction operation on the first data according to the encoding instruction, and details are not repeated here.
  • the CPU 801 can send a transfer instruction to the network card 805, and the transfer instruction can include the storage address of the encoded data in the fourth memory space, so that the network card 805 can read from the fourth memory space in a "read-only" manner based on the transfer instruction.
  • the transfer instruction can include the storage address of the encoded data in the fourth memory space, so that the network card 805 can read from the fourth memory space in a "read-only" manner based on the transfer instruction.
  • Obtain the encoded data and send the encoded data to the client 1, as shown in FIG. 9 .
  • the computing device 800 can realize the image rendering service requested by the client 1, and provide the image rendering service for the client 1.
  • the image rendering service can be a cloud service or a local service, which is not limited in this embodiment. .
  • NPU803 and VPU804 receive operation instructions from the previous processor that processes business data, and determine the data storage address and required execution according to the operation instructions received respectively
  • the address information and operation information included in the operation instructions received by NPU803 and VPU804 all originate from the rendering instructions of CPU801. Therefore, GPU802, NPU803, and VPU804 actually process and store business under the coordination and notification of CPU1 data. In this way, the number of interactions between the CPU 801 and the GPU 802 , NPU 803 , and VPU 804 can be reduced, thereby reducing the load of the CPU 801 and improving the control performance of the CPU 801 .
  • the instructions received by heterogeneous processors such as GPU802, NPU803, and VPU804 may also be directly issued by CPU801.
  • this will be described in detail below in conjunction with FIG. 8 , FIG. 9 and FIG. 10 .
  • FIG. 10 shows a schematic flowchart of another data processing method provided by the embodiment of the present application.
  • the method may specifically include:
  • the client 1 encodes image data based on user input operations to generate an image data stream to be rendered, where the image data stream includes image data and user input operations.
  • Client 1 sends the image data stream to computing device 800 .
  • the CPU 801 decodes the received image data stream to be rendered, obtains image data and user input operations, and writes the image data into the first memory space in the shared memory pool 806 .
  • steps S1001 to S1003 for the specific implementation process of steps S1001 to S1003, refer to the relevant descriptions of steps S701 to S703 in the embodiment shown in FIG. 7 above, and details are not repeated here.
  • the CPU 801 generates a rendering instruction for the image data according to user input operations and processing logic, and sends the rendering instruction to the GPU 802 .
  • the CPU 801 controls the heterogeneous processors to perform corresponding operations by issuing instructions one by one to the heterogeneous processors.
  • the rendering instruction generated by CPU 801 for GPU 802 may include the storage address of the image data to be rendered in the first memory space (which may be represented by a pointer corresponding to the first address and the data length), The rendering operation to be executed and the data generated by the GPU 802 executing the rendering operation are stored in the second memory space in the shared memory pool 806 .
  • the GPU 802 determines the image data in the shared memory according to the rendering instruction, and performs a rendering operation on the image data to obtain first data, and writes the first data into the second memory space in the shared memory pool 806 .
  • the GPU 802 may notify the CPU 801 so that the CPU 801 instructs other heterogeneous processors to continue processing the first data.
  • the CPU 801 generates a noise reduction instruction, and sends the noise reduction instruction to the NPU 803 .
  • the noise reduction instruction generated by CPU801 can include the storage address of the first data in the second memory space (which can be represented by the pointer corresponding to the first address and the data length of the first data), and the NPU803 for the first data.
  • the noise reduction operation to be performed and the data generated by the NPU 803 performing the noise reduction operation are stored in the third memory space in the shared memory pool 806 .
  • the NPU 803 determines the first data in the second memory space according to the noise reduction instruction, and performs a noise reduction operation on the first data to obtain the second data, and writes the second data into the shared memory pool 806 third memory space.
  • the NPU 803 may notify the CPU 801 so that the CPU 801 instructs other heterogeneous processors to continue processing the second data.
  • the CPU 801 generates an encoding instruction, and sends the encoding instruction to the VPU 804 .
  • the code instruction generated by CPU801 can include the storage address of the second data in the third memory space (which can be represented by the pointer corresponding to the first address and the data length of the second data), and the data required by the VPU804 for the second data.
  • the executed encoding operation and the data generated by the VPU 804 performing the encoding operation are stored in the fourth memory space in the shared memory pool 806 .
  • the VPU 804 determines the second data stored in the third memory space according to the encoding instruction, and performs an encoding operation on the second data to obtain encoded data, and writes the encoded data into the fourth memory in the shared memory pool 806 space.
  • the VPU804 After the VPU804 completes the encoding operation and obtains the encoded data, it can notify the CPU801.
  • the CPU 801 feeds back the encoded data in the fourth memory space to the client 1 through the network card 805 .
  • the CPU 801 can send a transfer instruction to the network card 805, and the transfer instruction can include the storage address of the encoded data in the fourth memory space, so that the network card 805 can read from the fourth memory space in a "read-only" manner based on the transfer instruction.
  • the transfer instruction can include the storage address of the encoded data in the fourth memory space, so that the network card 805 can read from the fourth memory space in a "read-only" manner based on the transfer instruction.
  • Obtain the encoded data and send the encoded data to the client 1, as shown in FIG. 9 .
  • the CPU 801 can sequentially control the processing of the image data by issuing instructions to the GPU 802 , the NPU 803 , the VPU 804 and the network card 805 one by one.
  • the heterogeneous processors such as GPU802, NPU803, and VPU804 performing corresponding operations according to the received instructions and the data storage process, please refer to the relevant descriptions of the above-mentioned embodiment shown in FIG. 7 , and will not repeat them here .
  • the embodiment of the present application also provides a data processing system, and the data processing system may include one or more computing devices.
  • the computing device in the data processing system may be any computing device in the above-mentioned Figs. This is not limited.
  • the data processing system may constitute a computing device cluster including one or more computing devices.
  • the data processing system can be deployed on a backboard, and the backboard can integrate multiple memory sticks implementing a shared memory pool, at least one central processing unit, and at least one heterogeneous device.
  • the backplane may also include more devices with other functions, and various devices on the backplane may be coupled through interfaces.
  • embodiments of the present application also provide a computer-readable storage medium, and the computer-readable storage medium stores instructions, and when the computer-readable storage medium is run on the computing device, the computing device executes the computing device described in each of the above-mentioned embodiments. method of execution.
  • the embodiment of the present application also provides a computer program product, the computer program product is used by the above-mentioned When the computing device in the embodiment is executed, the computing device executes the aforementioned data processing method.
  • the computer program product may be a software installation package, which can be downloaded and executed on a computer if any of the aforementioned data processing methods needs to be used.
  • the essence of the technical solution of this application or the part that contributes to the prior art can be embodied in the form of a software product, and the computer software product is stored in a readable storage medium, such as a floppy disk of a computer , U disk, mobile hard disk, ROM, RAM, magnetic disk or optical disk, etc., including several instructions to make a computing device (which can be a personal computer, training device, or network device, etc.) execute the instructions described in various embodiments of the present application. method.
  • a readable storage medium such as a floppy disk of a computer , U disk, mobile hard disk, ROM, RAM, magnetic disk or optical disk, etc.
  • all or part of them may be implemented by software, hardware, firmware or any combination thereof.
  • software When implemented using software, it may be implemented in whole or in part in the form of a computer program product.
  • the computer program product includes one or more computer instructions.
  • the computer can be a general purpose computer, a special purpose computer, a computer network, or other programmable devices.
  • the computer instructions may be stored in or transmitted from one computer-readable storage medium to another computer-readable storage medium, for example, the computer instructions may be transferred from a website, computer, training device, or data
  • the center transmits to another website site, computer, training device or data center via wired (eg, coaxial cable, fiber optic, digital subscriber line (DSL)) or wireless (eg, infrared, wireless, microwave, etc.).
  • wired eg, coaxial cable, fiber optic, digital subscriber line (DSL)
  • wireless eg, infrared, wireless, microwave, etc.
  • the computer-readable storage medium may be any available medium that can be stored by a computer, or a data storage device such as a training device or a data center integrated with one or more available media.
  • the available medium may be a magnetic medium (for example, a floppy disk, a hard disk, or a magnetic tape), an optical medium (for example, DVD), or a semiconductor medium (for example, a solid state disk (Solid State Disk, SSD)) and the like.

Landscapes

  • Engineering & Computer Science (AREA)
  • Multimedia (AREA)
  • Theoretical Computer Science (AREA)
  • Software Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Human Computer Interaction (AREA)
  • Business, Economics & Management (AREA)
  • Computer Security & Cryptography (AREA)
  • General Business, Economics & Management (AREA)
  • Memory System Of A Hierarchy Structure (AREA)

Abstract

一种计算设备、数据处理方法及系统,计算设备包括中央处理器、至少一个异构设备以及共享内存池;中央处理器用于在共享内存池中划分多个内存空间,将由客户端提供的与业务关联的第一待处理数据存储在该多个内存空间的第一内存空间中,通知第一异构设备第一待处理数据在第一内存空间的地址、第一异构设备所需执行的第一操作;第一异构设备对第一待处理数据执行第一操作,并将得到的第一数据存储在第二内存空间中。方法使得异构设备无需将第一待处理数据在不同内存之间进行数据搬移,而可以直接在共享内存池中处理第一待处理数据,避免了资源消耗较大、业务处理时延较高的问题。

Description

计算设备、数据处理方法、系统及相关设备
本申请要求于2022年01月14日提交中国国家知识产权局、申请号为202210041883.5、申请名称为“一种数据处理方法和计算机”的中国专利申请的优先权,并要求于2022年7月8日提交中国国家知识产权局、申请号为202210801224.7、申请名称为“计算设备、数据处理方法、系统及相关设备”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
技术领域
本申请涉及数据处理技术领域,尤其涉及一种计算设备、数据处理方法、系统及相关设备。
背景技术
随着业务复杂度越来越高,处理业务所要求的数据量以及计算规模也逐渐增大。比如,随着元宇宙(Metaverse)、面向3D开发协作平台Omniverse以及数字孪生等场景的技术突破,云渲染成为主流业务之一。其中,云渲染,是指将存储、计算以及渲染转移至云端,实现在云端进行大规模场景渲染、实时生成高品质画面。通常情况下,云渲染业务可以包括图像渲染、人工智能(artificial intelligence,AI)降噪以及编码流化等过程,从而云端可以集成中央处理器(central processing unit,CPU)、图形处理器(graphics processing unit,GPU)等多种算力并进行算力串联(pipeline),以便利用不同类型的算力执行云渲染业务中的不同处理过程。
实际应用时,处理业务的计算设备中可以集成多种类型的处理器,每个处理器具有单独配置的内存,从而计算设备可以利用该多种类型的处理器所提供的多种算力处理业务。但是,计算设备处理业务的资源消耗较大、业务处理时延较高。
发明内容
本申请提供了一种计算设备,用于减少计算设备处理业务的资源消耗、降低业务处理时延。此外,本申请还提供了一种数据处理方法、装置、系统、计算机可读存储介质以及计算机程序产品。
第一方面,本申请提供了一种计算设备,该计算设备包括中央处理器、至少一个异构设备以及共享内存池,该至少一个异构设备包括第一异构设备,该共享内存池包括多个共享内存,该共享内存例如可以通过内存条实现,并且,该中央处理器、至少一个异构设备以及多个共享内存通过总线进行耦合;其中,中央处理器,用于在共享内存池中划分多个内存空间,并将由客户端提供的与业务关联的第一待处理数据存储在该多个内存空间的第一内存空间中,并通知第一异构设备该第一待处理数据在第一内存空间的地址、该第一异构设备针对第一待处理数据所需执行的第一操作;而第一异构设备,用于对第一内存空间中的第一待处理数据执行第一操作,并将得到的第一数据存储在第二内存空间中。
如此,异构设备在处理第一待处理数据时,无需将第一待处理数据在不同内存之间进行数据搬移,而可以直接在共享内存池中处理该第一待处理数据,如此可以避免数据在不同内存之间的搬移而造成的资源消耗较大、业务处理时延较高的问题,减少业务处理所需 的资源消耗、降低业务处理时延;而且,共享内存池中的多个共享内存与中央处理器、异构设备通过总线进行耦合,这使得在计算设备中配置共享内存池可以不受中央处理器以及异构设备的影响(如不受中央处理器以及异构设备所在芯片的物理尺寸的影响),从而可以使得计算设备的本地内存达到较高的水平,如可以在计算设备中配置容量为太字节级别的内存池等,这样,计算设备可以将大规模的数据同时装载至本地内存进行处理,从而满足实际应用场景中对大规模数据的实时处理需求。
其中,第二内存空间可以由中央处理器通知给异构设备,也可以是由异构设备自主确定,对此并不进行限定。
在一种可能的实施方式中,中央处理器,还用于通知第一异构设备针对待处理数据所需执行的第一操作产生的数据的存储位置,为多个内存空间中的第二内存空间,如此,实现由中央处理器对于内存空间的统一分配和管理。
在一种可能的实施方式中,计算设备中的至少一个异构设备还包括第二异构设备,并且,中央处理器,还用于将由客户端提供的与业务关联的第二待处理数据存储在多个内存空间的第三内存空间中,并通知第二异构设备该第二待处理数据在第三内存空间的地址、第二异构设备针对第二待处理数据所需执行的第二操作、以及第二操作产生的数据的存储位置为多个内存空间中的第四内存空间;而第二异构设备,用于对第三内存空间中的第二待处理数据执行第二操作,得到第二数据,并将该第二数据存储在第四内存空间中,如此,计算设备可以利用多个异构设备并行处理与业务关联的待处理数据,从而可以提高数据处理效率,缩短业务相应的耗时。
在一种可能的实施方式中,计算设备中的至少一个异构设备还包括第二异构设备,并且,中央处理器,还用于通知第二异构设备第一数据在第二内存空间的地址、第二异构设备针对第一数据执行的第二操作、以及第二操作产生的数据的存储位置为多个内存空间中的第四内存空间;而第二异构设备,用于对第二内存空间中的第一数据执行第二操作,并将得到的第二数据存储在第四内存空间中,如此,计算设备可以利用多个异构设备实现对业务的依次处理,并且,各个异构设备可以直接在共享内存池中处理数据,而不用将数据在不同内存之间进行搬移,从而可以提高数据处理效率。
在一种可能的实施方式中,第一异构设备和第二异构设备为图像处理器GPU。
在一种可能的实施方式中,计算设备中的至少一个异构设备还包括第三异构设备,则,中央处理器,还用于将第二内存空间和第四内存空间的地址提供给第三异构设备,并通知该第三异构设备针对第一数据以及第二数据执行的第三操作、以及该第三操作产生的数据的存储位置为多个内存空间中的第五内存空间;而第三异构设备,用于对第二内存空间中的第一数据和第四内存空间中的第二数据执行第三操作,得到第三数据,并将该第三数据存储在第五内存空间中,如此,计算设备可以利用更多数量的异构设备继续处理数据,并且,所处理的数据可以不用在不同内存之间进行搬移,从而可以提高数据处理效率。
在一种可能的实施方式中,第三异构设备为图形处理器GPU或神经网络处理器NPU或视频处理器VPU。
在一种可能的实施方式中,计算设备中的至少一个异构设备还包括第三异构设备,该第三异构设备与其它计算设备网络连接;中央处理器,还用于将第二内存空间和第四内存 空间的地址提供给第三异构设备,并通知该第三异构设备针对第一数据以及第二数据执行的第三操作、并将第三操作产生的数据发送给其它计算设备;而第三异构设备,用于对第二内存空间中的第一数据和第四内存空间中的第二数据执行第三操作,得到第三数据,并将第三数据发送给其它计算设备,如此,第三异构设备可以将处理好的业务数据(即第三数据)输出给其它计算设备,以满足业务需求,或者将该业务数据发送给其它计算设备继续进行处理。
在一种可能的实施方式中,第三异构设备为网卡,用于实现将第三数据转发给其它计算设备。
在一种可能的实施方式中,业务为图像渲染任务,而与该业务关联的第一待处理数据为图像数据,从而计算设备可以基于多个异构设备实现对图像渲染任务的处理,并通过共享内存池提高业务数据的处理效率。
在一种可能的实施方式中,用于耦合共享内存条、中央处理器以及至少一个异构设备的总线为第Z代总线、或加速器的缓存一致性互连CCIX总线、或计算快速链接CXL总线。
在一种可能的实施方式中,所述共享内存池的容量不少于1太字节TB。
第二方面,本申请还提供了一种数据处理方法,该数据处理方法应用于计算设备,计算设备包括中央处理器、至少一个异构设备以及共享内存池,至少一个异构设备包括第一异构设备,共享内存池包括多个共享内存条,中央处理器、至少一个异构设备与多个共享内存条通过总线进行耦合;方法包括:中央处理器在共享内存池中划分多个内存空间;中央处理器将由客户端提供的与业务关联的第一待处理数据存储在多个内存空间的第一内存空间中;中央处理器通知第一异构设备第一待处理数据在第一内存空间的地址、第一异构设备针对第一待处理数据执行的第一操作;第一异构设备对第一内存空间中的第一待处理数据执行第一操作,得到第一数据,并将第一数据存储在第二内存空间中。
在一种可能的实施方式中,计算设备中的至少一个异构设备还包括第二异构设备,方法还包括:中央处理器将由客户端提供的与业务关联的第二待处理数据存储在多个内存空间的第三内存空间中;中央处理器通知第二异构设备第二待处理数据在第三内存空间的地址、第二异构设备针对第二待处理数据执行的第二操作、以及第二操作产生的数据的存储位置为多个内存空间中的第四内存空间;第二异构设备对第三内存空间中的第二待处理数据执行第二操作,得到第二数据,并将第二数据存储在第四内存空间中。
在一种可能的实施方式中,计算设备中的至少一个异构设备还包括第二异构设备,方法还包括:中央处理器通知第二异构设备第一数据在第二内存空间的地址、第二异构设备针对第一数据执行的第二操作、以及第二操作产生的数据的存储位置为多个内存空间中的第四内存空间;第二异构设备对第二内存空间中的第一数据执行第二操作,得到第二数据,并将第二数据存储在第四内存空间中。
在一种可能的实施方式中,第一异构设备和第二异构设备为图像处理器GPU。
在一种可能的实施方式中,计算设备中的至少一个异构设备还包括第三异构设备,方法还包括:中央处理器将第二内存空间和第四内存空间的地址提供给第三异构设备;中央处理器通知第三异构设备针对第一数据以及第二数据执行的第三操作、以及第三操作产生的数据的存储位置为多个内存空间中的第五内存空间;第三异构设备对第二内存空间中的 第一数据和第四内存空间中的第二数据执行第三操作,得到第三数据,并将第三数据存储在第五内存空间中。
在一种可能的实施方式中,第三异构设备为图形处理器GPU或神经网络处理器NPU或视频处理器VPU。
在一种可能的实施方式中,计算设备中的至少一个异构设备还包括第三异构设备,第三异构设备与其它计算设备网络连接,方法还包括:中央处理器将第二内存空间和第四内存空间的地址提供给第三异构设备;中央处理器通知第三异构设备针对第一数据以及第二数据执行的第三操作、将第三操作产生的数据发送给其它计算设备;第三异构设备对第二内存空间中的第一数据和第四内存空间中的第二数据执行第三操作,得到第三数据,并将第三数据发送给其它计算设备。
在一种可能的实施方式中,第三异构设备为网卡。
在一种可能的实施方式中,方法还包括:中央处理器通知第一异构设备第一操作产生的数据的存储位置为多个内存空间中的第二内存空间。
在一种可能的实施方式中,业务为图像渲染任务,第一待处理数据为图像数据。
在一种可能的实施方式中,用于耦合共享内存条、中央处理器以及至少一个异构设备的总线为第Z代总线、或CCIX总线、或CXL总线。
在一种可能的实施方式中,所述共享内存池的容量不少于1 TB。
第二方面提供的数据处理方法,对应于第一方面提供的计算设备,故第二方面以及第二方面任一种可能实现方式中的数据处理方法所具有的技术效果,可以参照第一方面以及第一方面中相应实现方式所具有的技术效果,在此不做赘述。
第三方面,本申请提供一种数据处理系统,所述数据处理系统包括至少一个计算设备,所述计算设备为上述第一方面以及第一方面的任一种实现方式所述的计算设备。
第四方面,本申请提供一种计算机可读存储介质,所述计算机可读存储介质中存储有指令,当其在计算设备上运行时,使得计算设备执行上述第二方面或第二方面的任一种实现方式所述的方法。
第五方面,本申请提供了一种包含指令的计算机程序产品,当其在计算设备上运行时,使得计算设备执行上述第二方面或第二方面的任一种实现方式所述的方法。
本申请在上述各方面提供的实现方式的基础上,还可以进行进一步组合以提供更多实现方式。
附图说明
为了更清楚地说明本申请实施例中的技术方案,下面将对实施例描述中所需要使用的附图作简单地介绍,显而易见地,下面描述中的附图仅仅是本申请中记载的一些实施例,对于本领域普通技术人员来讲,还可以根据这些附图获得其它的附图。
图1为一种计算设备的结构示意图;
图2为本申请实施例提供的一种计算设备的结构示意图;
图3为本申请实施例提供的另一种计算设备的结构示意图;
图4为本申请实施例提供的又一种计算设备的结构示意图;
图5为本申请实施例提供的再一种计算设备的结构示意图;
图6为本申请实施例提供的两个计算设备通过高速接口进行互连的示意图;
图7为本申请实施例提供的一种数据处理方法的流程示意图;
图8为本申请实施例提供的又一种计算设备的结构示意图;
图9为本申请实施例提供的处理图像渲染业务的示意图;
图10为本申请实施例提供的另一种数据处理方法的流程示意图。
具体实施方式
本申请的说明书和权利要求书及上述附图中的术语“第一”、“第二”等是用于区别类似的对象,而不必用于描述特定的顺序或先后次序。应该理解,这样使用的术语在适当情况下可以互换,这仅仅是描述本申请的实施例中对相同属性的对象在描述时所采用的区分方式。
参见图1,为一种计算设备的结构示意图。如图1所示,计算设备100中包括多个处理器,图1中以包括n个处理器(处理器1至处理器n)为例进行示例性说明。每个处理器可以单独配置有内存,如处理器1配置有内存1、处理器2配置有内存2,并且,处理器1与处理器2之间可以通过总线进行耦合。其中,该n个处理器中存在至少两个处理器属于不同的类型,可以为计算设备100提供不同的类型的算力,如处理器1以及处理器2为不同类型的处理器,并且,处理器1为CPU、处理器2为GPU等。
实际应用场景中,计算设备100可能会利用不同类型的算力处理业务,如客户端1所请求的云渲染业务等。此时,处理器1会先将该业务对应的待处理数据写入内存1中,并在该内存1中对该待处理数据进行处理,所处理得到中间数据存储在该内存1。然后,处理器2通过总线将内存1中存储的中间数据读取至内存2中,并在内存2中对该中间数据进行处理,所处理得到的最终数据存储在该内存2中。由于在处理业务数据的过程中,相同的数据(即前述中间数据)需要在不同内存之间进行搬移,这不仅会产生较大的资源消耗,而且,也会影响计算设备100处理业务数据的效率。特别的,当计算设备利用三个以上(包括三个)的处理器依次处理业务数据时,业务数据可能会在多个不同的内存之间频繁进行数据搬移,从而会严重影响计算设备100处理业务的性能。
而且,实际应用场景中,因为硬件环境的限制,为每个处理器所单独配置的内存容量通常受限。具体地,由于处理器所在芯片的尺寸通常有限,所以部署在该芯片上的内存的物理尺寸也会受限,从而导致在该芯片上为处理器所能配置的内存的容量通常较小,如不超过32GB(吉字节)等。这样,计算设备100难以实现将大规模的待处理数据同时装载至本地内存进行处理,通常是通过依次处理不同部分的业务数据才能实现对所有业务数据的处理,从而难以实现大规模数据的实时处理需求。
基于此,本申请实施例提供了一种计算设备,以提高计算设备处理业务的性能,并且,还可以进一步满足对大规模数据进行实时处理的需求。如图2所示,计算设备200包括CPU201、异构设备202(该CPU201与异构设备202可以构成计算资源池)以及共享内存池203,其中,图2中以包括CPU201以及异构设备202为例进行示例性说明。
其中,CPU201与异构设备202能够为计算设备200提供不同类型的算力。比如,异构设备202可以是GPU,或者,异构设备202可以是神经网络处理器(neural-network processing unit,NPU)等,本实施例对此并不进行限定。并且,计算设备200中的CPU201、异构设备202可 以位于不同的基础板/芯片上。
共享内存池203包括多个共享内存,如图2中的第一共享内存以及第二共享内存等,每个共享内存可以通过内存控制器(Memory Controller)以及存储介质实现,图2中以n个共享内存实现共享内存池203为例,共享内存包括内存控制器以及相应的存储介质,共享内存可例如为内存条。并且,实际部署时,共享内存池203可以横向扩展(scale-out)实现扩容,即共享内存池203的容量可以随着内存控制器的数量以及存储介质的数量的增加而得到扩展。其中,内存控制器是计算设备200内部控制共享内存池203,并且用于管理与规划从共享内存池203到CPU201或异构设备202之间的数据传输的总线电路控制器。通过内存控制器,共享内存池203与CPU201或异构设备202之间可以交换数据。内存控制器可以是一个单独的芯片,可以控制必要的逻辑以将数据写入共享内存池203或从共享内存池203中读取数据。内存控制器可以是通过通用处理器、专用加速器、GPU、FPGA、嵌入式处理器等进行实现。共享内存池203中的存储介质,可以采用动态随机存取存储器(dynamic random access memory,DRAM)实现,或者可以是双列直插式存储模块(dual-inline-memory-modules,DIMM)等。其中,一个DIMM通常可以被作为一个内存条实体,每个内存条实体可以有两个面,并且,两个面都有内存颗粒。每个面可以称之为Rank,也就是说一个内存条实体可以存在两个Rank,每个Rank可以包括多个内存芯片(chip)。示例性地,内存控制器与存储介质之间可以通过双数据速率(double data rate,DDR)总线进行连接,或者通过其它总线进行连接等。实际应用时,共享内存池203也可以是采用其它方式实现,如共享内存池203中的存储介质也可以是其它类型的存储介质等,本实施例对此并不进行限定。
计算设备200中的CPU201、异构设备202以及共享内存池203之间可以通过总线进行耦合,例如,CPU201可以通过总线访问共享内存池203中的数据,或者通过总线向异构设备202发送数据(如操作指令等)。其中,总线,可以是外围器件互联(peripheral component interconnect express,PCIe)总线,或者可以是其它类型的总线,如加速器的缓存一致性互连(cache coherent interconnect for accelerators,CCIX)总线、第Z代(generation Z,Gen-Z)总线、计算快速链接(compute express link,CXL)总线等,本实施例对此并不进行限定。
异构设备202可例如为GPU、NPU、VPU等任意类型的异构处理器,或者可以是其它设备。
示例性地,计算设备200可以部署于云数据中心。图2中是以云数据中心包括一个计算设备200为例,实际应用时,云数据中心也可以包括多个计算设备。并且,云数据中心与客户端1均接入互联网,以便客户端1与云数据中心中的各个计算设备通过互联网实现网络通信。
CPU201可以在共享内存池203中划分多个内存空间,这样,当计算设备200利用异构设备提供的算力处理业务时,CPU201可以接收客户端1提供的与某个业务关联的待处理数据,并将该待处理数据写入共享内存池203中分配给该业务的第一内存空间,如图2所示,从而CPU201可以通知异构设备202该待处理数据在第一内存空间的地址、以及异构设备202针对该待处理数据所需执行的第一操作。
然后,异构设备202可以对第一内存空间中的待处理数据执行第一操作,得到第一数据处理,此时,异构设备202可以直接访问第一内存空间中的第一数据,并在该共享内存池中 对待处理数据执行相应的第一操作,得到第一数据,并将该第一数据存储在共享内存池203中的第二内存空间,如图2所示。此时,若第一数据为计算设备200处理该业务所最终输出的数据,则计算设备200可以将该第一数据发送给客户端1;而若计算设备200还需对该第一数据进行进一步处理,则可以由计算设备200中的其它异构设备继续在共享内存池203中处理该第一数据,并将处理该第一数据所最终得到的数据发送给客户端1。
这样,异构设备202在处理待处理数据时,无需将待处理数据在不同内存之间进行数据搬移,而可以直接在共享内存池203中处理该待处理数据,如此可以避免数据在不同内存之间的搬移而造成的资源消耗较大、业务处理时延较高的问题,减少业务处理所需的资源消耗、降低业务处理时延。
而且,共享内存池203中的多个共享内存与CPU201、异构设备202通过总线进行互连,这使得在计算设备200中配置共享内存池203可以不受CPU201以及异构设备202的影响(如不受CPU201以及异构设备202所在芯片的物理尺寸的影响),从而可以使得计算设备200的本地内存达到较高的水平,如可以在计算设备200中配置容量为TB(太字节)级别的内存池等。这样,计算设备200可以将大规模的数据同时装载至本地内存(即共享内存池203)进行处理,以满足实际应用场景中对大规模数据的实时处理需求。
需要说明的是,图2所示的计算设备200仅作为一种示例性说明,并不用于限定计算设备200的具体实现。比如,在其它实施例中,计算设备200中除了包括共享内存池203之外,还可以为CPU201以及异构设备202配置单独的内存(独立于共享内存池203之外),从而CPU201以及异构设备202可以利用单独配置的内存处理其它业务;或者,计算设备200中可以包括更多数量或更多类型的异构设备,每种类型的异构设备的数量可以是一个或者多个;或者,计算设备200还可以包括更多其它功能的器件等,本实施例对此并不进行限定。
实际部署时,计算设备200可以部署在用户侧,即可以作为本地设备为用户提供本地的数据处理服务;或者,计算设备200可以部署在云端,如公有云、边缘云或者分布式云等,用于为用户提供云端的数据处理服务,如云渲染服务等。
为便于理解,下面基于图2所示的计算设备200,对计算设备200处理业务数据的过程进行详细介绍。
本实施例中,客户端1(或者其它客户端)可以向计算设备200请求处理业务,该业务例如可以是图像渲染业务或者其它云服务等。具体实现时,客户端1可以生成业务处理请求,该业务处理请求可以包括业务的标识以及与业务关联的待处理数据,并将该业务处理请求发送给计算设备200。其中,客户端1例如可以是计算设备200对外提供的网络浏览器,用于与用户进行交互;或者,客户端1可以是运行在用户终端上的应用程序(application),例如软件开发工具包(software development kit,SDK)等。
CPU201可以接收客户端1发送的业务处理请求,从该业务处理请求中解析出业务的标识以及与业务关联的待处理数据(例如图像渲染业务中的图像数据等),并根据业务的标识确定计算设备200中处理该业务的异构设备202。其中,计算设备200处理不同业务所采用的一个或者多个异构设备可以预先根据实际应用需求进行配置。比如,计算设备200可以利用异构设备202处理业务,利用计算设备200中除异构设备202之外的其它异构设备来处理客户 端1所请求的业务等。为便于描述,本实施例中以计算设备200基于异构设备202处理业务为例进行示例性说明。
CPU201可以根据业务处理请求中携带的与业务关联的待处理数据的数据量,从共享内存池203中为业务分配第一内存空间。其中,CPU201可以在共享内存池203中划分多个内存空间,从而CPU201可以将该多个内存空间中的第一内存空间分配给业务。其中,该第一内存空间的大小可以根据待处理数据的数据量进行确定,或者,第一内存空间的大小,可以预先由技术人员针对业务进行配置,从而CPU201根据业务的标识查询配置信息,即可确定共享内存池203中为该业务分配的第一内存空间的大小。
然后,CPU201将与业务关联的待处理数据写入第一内存空间,并记录该待处理数据在第一内存空间中存储的地址。示例性地,该地址例如可以通过第一数据在第一内存空间中存储的首地址以及待处理数据的长度进行表示。
接着,CPU201可以通知异构设备202该待处理数据在第一内存空间的地址、异构设备202针对待处理数据执行的第一操作。作为一种实现示例,CPU201可以生成针对待处理数据的操作指令,并将该操作指令发送给异构设备202。其中,操作指令可以包括待处理数据在第一内存空间中存储的地址。比如,操作指令中可以包括指针以及数据长度,该指针用于指示待处理数据在第一内存空间中存储的首地址,该数据长度用于指示待处理数据的长度。同时,操作指令中还可以携带有用于指示异构设备202针对该待处理数据所需执行的第一操作。第一操作的类型,与异构设备202的算力相关。比如,当异构设备202具体为GPU时,第一操作例如可以针对待处理数据的渲染操作;当异构设备202具体为NPU时,第一操作例如可以针对待处理数据的AI降噪操作等。
本实施例中,异构设备202中可以配置有消息队列,并且,异构设备202可以通过该消息队列缓存CPU201发送的操作指令。异构设备202可以从该消息队列中读出操作指令,并解析出该操作指令中的待处理数据在第一内存空间的位置,以便异构设备202定位出第一内存空间中的待处理数据,从而异构设备202可以基于该操作指令,对该待处理数据执行第一操作,并将处理得到的第一数据存储至共享内存池203中的第二内存空间。如此,实现在共享内存池203中对待处理数据的处理。其中,第二内存空间,可以是由CPU201进行指定。比如,CPU201发送给异构设备202的操作指令中,还可以包括第二内存空间的地址,该地址可以通过第二内存空间的首地址以及内存空间大小进行表示。或者,第二内存空间,也可以是由异构设备202进行确定,即异构设备202可以从共享内存池203中剩余的可用内存空间中确定第二内存空间。
实际应用时,当计算设备200仅利用异构设备202所提供的算力处理待处理据时,计算设备200可以将第一数据作为最终处理得到的数据反馈给客户端1;而当计算设备200还利用其它异构设备所提供的算力继续对第一数据进行处理时,计算设备200可以将其它异构设备处理该第一数据所最终得到的数据反馈给客户端1,本实施例对此并不进行限定。
可以理解,由于异构设备202能够直接在共享内存池203中处理待处理数据,无需将该待处理数据在不同内存之间进行数据搬移,如此可以减少计算设备200处理业务所需的资源消耗、降低业务处理时延。并且,计算设备200可以支持将较大规模的待处理数据同时写入共享内存池203,从而计算设备200中的CPU201以及异构设备202每次可以处理较大规模的 业务数据,以此可以提高业务处理效率。
值得注意的是,图2所示的实施例中,是以计算设备200基于一个异构设备202处理与业务关联的待处理数据为例进行示例性说明。而在其它实施例中,计算设备200中还包括其它异构设备,并且,当该业务要求基于多个异构设备进行处理时,计算设备200可以利用异构设备202以及其它异构设备处理该业务的数据。
比如,参见图3所示的计算设备200。在图2所示的计算设备200的基础上,图3所示的计算设备200中还包括异构设备204。与图2所示的实施例不同的是,在图3所示的计算设备200中,异构设备204可以与异构设备202协同处理与第一业务关联的待处理数据。为便于理解,本实施例中提供了以下两种实现示例进行介绍说明。
在第一种实现示例中,异构设备202与异构设备204可以并行处理与客户端1所请求业务关联的待处理数据。比如,异构设备202与异构设备204均为GPU,并且,计算设备200可以利用多个GPU并行处理图像渲染业务所产生的不同业务数据。
具体实现时,CPU201可以接收客户端1发送的针对业务的第一待处理数据以及第二待处理数据。并且,CPU201可以将第一待处理数据写入共享内存池203中的第一内存空间,记录该第一待处理数据在第一内存空间中的地址;将第二待处理数据写入共享内存池203中的第三内存空间,并记录该第二待处理数据在第三内存空间的地址。然后,CPU201可以生成操作指令1,该操作指令1中可以携带有第一待处理数据在第一内存空间中存储的地址、异构设备202针对第一待处理数据所需执行的第一操作、以及第一操作产生的数据的存储位置为共享内存池203中的第二内存空间。并且,CPU201还可以生成操作指令2,该操作指令2中可以携带有第二待处理数据在第一内存空间中存储的地址、异构设备202针对第二待处理数据所需执行的第二操作、以及该第二操作产生的数据的存储位置为共享内存池203中的第四内存空间。然后,CPU201可以将操作指令1通过互联总线发送至异构设备202中的消息队列,并将操作指令2通过互联总线发送至异构设备204中的消息队列。
异构设备202从消息队列中读出操作指令1,并根据该操作指令1确定第一内存空间中的第一待处理数据,从而对该第一待处理数据执行第一操作,得到相应的第一数据,并在CPU201所指示的第二内存空间中存储该第一数据,获得该第一数据在第二内存空间中的存储地址。
并且,异构设备204从消息队列中读出操作指令2,并根据该操作指令2确定第三内存空间中的第二待处理数据,从而对该第二待处理数据执行第二操作,得到相应的第二数据,并在CPU201所指示的第四内存空间中存储该第二数据,获得该第二数据在第四内存空间中的存储地址。
如此,计算设备200基于多个异构设备并行处理该业务的第一待处理数据以及第二待处理数据,以此可以提高计算设备200的数据处理效率。
在第二种实现示例中,异构设备202与异构设备204可以串行处理与客户端1所请求业务关联的待处理数据。比如,异构设备202为GPU,异构设备204为视频处理器(或者NPU),并且,GPU与视频处理器(或者NPU)可以串行处理图像渲染业务所产生的业务数据;或者,异构设备202与异构设备204均为GPU,用于依次对图像渲染业务所产生的业务数据进行处理等。
具体实现时,CPU201可以接收客户端1发送的针对业务的第一待处理数据,并将第一待处理数据写入共享内存池203中的第一内存空间,记录该第一待处理数据在第一内存空间中的地址。然后,CPU201可以生成操作指令1,并将操作指令1通过互联总线发送至异构设备202中的消息队列。其中,操作指令1中可以携带有第一待处理数据在第一内存空间中存储的地址、异构设备202针对第一待处理数据所需执行的第一操作、以及第一操作产生的数据的存储位置为共享内存池203中的第二内存空间。
异构设备202从消息队列中读出操作指令1,并根据该操作指令1确定第一内存空间中的第一待处理数据,从而对该第一待处理数据执行第一操作,得到相应的第一数据,并在CPU201所指示的第二内存空间中存储该第一数据,获得该第一数据在第二内存空间中的存储地址。然后,异构设备202可以将第一数据在第二内存空间的地址反馈给CPU201。
接着,CPU201可以生成操作指令2,并将操作指令2通过互联总线发送至异构设备204中的消息队列。其中,操作指令2中可以携带有第一数据在第二内存空间中存储的地址、异构设备204针对第一数据所需执行的第二操作、以及第二操作产生的数据的存储位置为共享内存池203中的第四内存空间。
异构设备204从消息队列中读出操作指令2,并根据该操作指令2确定第二内存空间中的第一数据,从而对该第一数据执行第二操作,得到相应的第二数据,并在CPU201所指示的第四内存空间中存储该第二数据。
上述实现示例仅作为一些示例性说明,并不用于限定计算设备200中的多个异构设备处理业务数据的具体实现方式。比如,在其它实现示例中,异构设备202在确定执行完第一操作后,可以生成操作指令2,并将该操作指令2发送至异构设备204中的消息队列,以控制异构设备204对第二内存空间中的第一数据执行第二操作,并将产生的第二数据存储至第四内存空间。
进一步地,当计算设备200仅利用异构设备202以及异构设备204所提供的算力处理业务数据时,计算设备200可以将第二数据作为最终的处理结果反馈给客户端1。
上述图2以及图3所示的实施例中,主要是以计算设备200基于CPU201以及一个或者多个异构设备处理业务为例进行示例性说明,实际应用场景中,计算设备200可以支持处理客户端所请求的多种不同的业务,并且,计算设备200在处理不同业务时可以采用不同的算力组合,不同算力组合所包括的算力类型不同或者算力规格不同。比如,如图4所示,客户端1向计算设备200所请求处理业务,也可以要求计算设备200采用CPU201、异构设备202、异构设备204以及异构设备205进行处理。
具体地,参见图4,图4示出了另一种计算设备200的结构示意图。在图3所示的计算设备200的基础上,图4所示的计算设备200中还可以包括异构设备205。其中,异构设备205的类型,可以与异构设备202的类型以及异构设备204的类型均不相同。比如,以后设备202为GPU、异构设备204为NPU、异构设备205为视频处理器(video processing unit,VPU);或者,异构设备205为具有数据转发功能的网卡,用于向其它计算设备转发数据等。或者,异构设备205的类型,可以与异构设备202的类型相同,如均为GPU;或者,异构设备205的类型,可以与异构设备204的类型相同等,如均为NPU。
与图3所示实施例不同的是,CPU201可以生成操作指令3,该操作指令3包括上述第一数据在第二内存空间的地址、第二数据在第四内存空间的地址、异构设备205针对第一数据以及针对第二数据所需执行的第三操作、以及第三操作产生的数据的存储位置为共享内存池203中的第五内存空间。然后,CPU201可以将该操作指令3发送至异构设备205中的消息队列。其中,第二数据,可以是异构设备204对第二待处理数据执行第二操作所生成的数据,或者可以是异构设备204对第一数据执行第二操作所生成的数据,本实施例对此并不进行限定。
异构设备204从消息队列中读出操作指令3,并根据该操作指令3从第二内存空间中获得第一数据、从第四内存空间中获得第二数据,从而异构设备204可以对第一数据以及第二数据执行第三操作,得到相应的第三数据,并在CPU201所指示的第五内存空间中存储该第三数据。
进一步地,当计算设备200仅利用异构设备202、异构设备204以及异构设备205所提供的算力处理业务数据时,计算设备200可以将第三数据作为最终的处理结果反馈给客户端1。
或者,异构设备203可以是具有数据转发能力的设备,如网卡等,此时,异构设备205可以是在CPU201的指示下向与该异构设备网络连接的其它计算设备发送第三数据。
具体地,,CPU201可以生成操作指令3,并且,该操作指令3除了包括上述第一数据在第二内存空间的地址、第二数据在第四内存空间的地址、异构设备205针对第一数据以及针对第二数据所需执行的第三操作之外,还包括通知异构设备205将第三操作产生的数据发送给其它计算设备的指示信息。如此,异构设备204在对第一数据以及第二数据执行第三操作并得到相应的第三数据后,直接将该第三数据发送给其它计算设备,可以无需在共享内存池203中存储该第三数据。
本实施例中,计算设备200不仅可以根据业务需求从计算资源池中灵活选择不同的算力组合处理不同的业务,在进一步的实施方式中,计算设备200还可以根据业务需求从共享内存池203中灵活选择不同容量的共享内存,以支持多个相同或者不同算力类型的异构设备在该共享内存池中处理业务数据。比如,计算设备200在处理业务时,从共享内存池203中为该业务1分配的总的内存空间的容量为1TB,而在处理业务2时,从共享内存池203中为该业务分配的总的内存空间的容量为10TB等。
上述图2至图4所示的实施例中,计算设备200中的各个异构设备均采用共享内存池203中的内存空间处理业务,而在其它实现示例中,计算设备200中的一个或者多个异构设备还可以具有单独配置的本地内存,并且,异构设备可以利用该单独配置的本地内存处于相应的业务数据,以满足计算设备200对于部分业务的较高内存访问带宽的要求。
具体地,参见图5所示的计算设备200,在图4所示的计算设备200的基础上,图5所示的计算设备200中,异构设备202、异构设备204、异构设备205均单独配置有相应的本地内存,如图5所示的内存1、内存2以及内存3。其中,内存1可以部署于异构设备202所在的芯片/基础板;内存2可以部署于异构设备204所在的芯片/基础板;内存3可以部署于异构设备205所在的芯片/基础板。通常情况下,异构设备202在芯片或者基础板上访问内存1的速度,通常高于异构设备202通过PCIe总线访问共享内存池203中的共享内存的速度,其余异构设备访 问各自对应的本地内存的情况类似。
这样,在处理部分对于内存访问速度要求较高的业务时,计算设备200中的异构设备202可以基于较大的带宽访问内存1,并利用内存1实现对业务数据的处理。示例性地,计算设备200中的异构设备所单独配置的内存的容量可以为GB(吉字节)级别,如每个异构设备单独配置的内存容量为32GB等,本实施例对此并不进行限定。进一步地,每个异构设备所单独配置的内存,可以采用粗粒度的4KB(千字节)页表进行数据缓存,以此可以降低异构设备之间的高速互联总线所产生的影响。
在进一步可能的实施方式中,计算设备200中的CPU201也可以具有单独配置的内存,如图5中的内存4等,以便CPU201基于内存4执行相应的数据处理过程。
图2至图5所示的实施例中,是以在计算设备200内部构建计算资源池以及内存池为例进行示例性说明,而在其它实施例中,计算设备200还可以与其它计算设备通过高速互连接口进行互连,以实现跨计算设备构建容量更大的内存池。
示例性地,如图6所示,计算设备200可以与计算设备300通过高速互连接口(如图6中的高速互连接口1以及高速互连接口2)进行互连。其中,计算设备300可以具有一个或者多个异构设备(图6中以计算设备300包括异构设备301、异构设备302为例进行示例性说明)以及共享内存池303,并且,共享内存池303包括一个或者多个共享内存,例如可以通过至少一个内存控制器以及至少一种存储介质(如DIMM等)实现。
在将计算设备200与计算设备300互连后,计算设备200中的各个异构设备(如异构设备202等)可以通过该高速互联接口访问计算设备300中的共享内存池303;同样,计算设备300中的异构设备301(以及其余异构设备)可以通过该高速互连接口访问计算设备200中的共享内存池203。这样,计算设备200中的共享内存池203以及计算设备300中的共享内存池303可以组成逻辑上容量更大的统一内存池,并被计算设备200以及计算设备300中的异构设备所共享。实际应用时,在计算设备200中的共享内存池203中的可用内存充足的情况下,计算设备200在处理业务时,可以优先从共享内存池203中为该业务分配内存空间。而当共享内存池203中的可用内存较少的情况下,计算设备200可以利用计算设备300中的共享内存池303处理业务数据。
并且,计算设备200还可以利用计算设备300中的异构设备,实现算力扩展,包括算力规格的扩展、算力类型的扩展等。比如,当计算设备200处理业务时需要使用3个处理器时,可以利用异构设备202、异构设备301以及异构设备302依次处理业务,以此实现对计算设备200的算力扩展。
为便于进一步理解本申请实施例的技术方案,下面结合图像渲染业务的具体应用场景,对本申请的实施例进行介绍。
参见图7、图8、图9,图7为本申请实施例提供的一种数据处理方法的流程示意图,图8为本申请实施例提供的另一种计算设备的结构示意图,图9为本申请实施例提供的处理图像渲染业务的示意图。其中,图7所示的数据处理方法可以应用于图8所示的计算设备800,并且,计算设备800中包括CPU801以及多个异构设备:GPU802、NPU803、VPU804,其中, 计算设备800中的CPU801、多个异构设备与共享内存池806通过总线807进行耦合。
基于图8所示的计算设备800,图7所示的数据处理方法具体可以包括:
S701:客户端1基于用户输入操作对图像数据编码,生成待渲染的图像数据流,该图像数据流包括图像数据以及用户输入操作。
S702:客户端1将图像数据流发送给计算设备800。
在一种可能的应用场景中,用户可以基于客户端1所提供的交互界面执行操作,如点击交互界面中用于控制游戏中的玩家角色进行移动的控件等,从而客户端1可以基于该用户的输入操作对当前交互界面所呈现的图像数据(如游戏画面等)进行编码,生成包括用户输入操作以及图像数据的图像数据流。然后,客户端1可以生成云渲染请求,该云渲染请求中携带该待渲染的图像数据流,并将其发送给部署于云端的计算设备800,以请求计算设备800根据用户的输入操作对该图像数据流执行相应的渲染过程,如渲染后的游戏画面中玩家角色的位置已按照用户的输入操作发生相应的变化等。
相应的,计算设备800可以通过网卡805接收客户端1发送的待渲染的图像数据流。其中,网卡也可以称之为网络接口控制器(network interface controller,NIC),用于负责接收外部设备发送的数据,或者向外部设备发送数据等。
S703:CPU801对接收到的待渲染的图像数据流进行解码,获得图像数据以及用户输入操作,并将图像数据写入共享内存池806中的第一内存空间。
实际应用时,CPU801可以将解码得到的图像数据,通过“只写”的权限,写入共享内存池806中的第一内存空间。其中,“只写”权限,是指解码得到的图像数据对于CPU801的读写权限为仅能对该图像数据执行写入操作。
具体实现时,如图8所示,CPU801中可以包括高速接口8011、内存管理单元(memory management unit,MMU)8012、消息队列8013以及处理单元8014。其中,高速接口8011,例如可以是序列化/反序列化(serializer/deserializer,SerDes)接口,并且,处理单元8014可以通过该8011实现将图像数据写入共享内存池806中的第一内存空间,该第一内存空间可以被分配用于存储客户端1所请求的云渲染业务的图像数据。MMU8012可以用于对共享内存池806进行管理,包括共享内存池806的容量扩展、健康状态监控、内存资源分配等。消息队列8013可以用于缓存CPU801生成的操作指令,以便处理单元8014将该消息队列8013中的操作指令下发给其它处理器。
共享内存池806中包括高速接口8061(高速接口的数量可以是一个或者多个)、本地代理(home agent)单元8062、至少一个内存控制器8063以及存储介质8064。本实施例中,在构建共享内存池806时,可以将一个或者多个内存控制器8063以及存储介质8064链接到总线807,并通过配置本地代理(home agent)单元8062,支持计算设备800中的多个处理器之间实现缓存一致的非一致内存访问(Cache Coherent Non-Uniform Memory Access,CCNUMA),即多个内存单元(如多个存储介质构建的内存单元)相连接形成容量更大的单一内存。同时,共享内存池806还可以提供对外通信的高速接口8061,以便通过高速接口8061接收计算设备800中的处理器发送的数据(如上述图像数据等),并通过内存控制器8063将数据写入存储介质8064中。
计算设备800中的各个处理器与共享内存池806之间通过总线807进行互连,如CPU801 向共享内存池806发送的图像数据,可以通过总线807传输至共享内存池806中的共享内存。如图8所示,总线807中可以包括连接处理器的高速接口8071、用于连接共享内存池806的高速接口8072、以及用于实现数据交互的交换(switch)单元8073,并且,由于实际应用场景中,单个交换单元8073所负责实现互连的硬件数量有限,因此,总线807中的交换(switch)单元8073的数量,可以根据计算设备800中链接到该总线807的处理器的数量以及共享内存池806中的内存控制器8063的数量进行确定。
704:CPU801根据用户输入操作以及处理逻辑,生成针对该图像数据的渲染指令,并将该渲染指令发送给GPU802。
示例性地,CPU801生成并发送的渲染指令可以包括待渲染的图像数据在第一内存空间中的存储地址(可以通过首地址对应的指针以及数据长度进行表示)、依次处理图像数据的GPU802、NPU803、VPU804分别执行的处理操作等信息。进一步地,该渲染指令中还可以包括各个处理器执行相应操作所产生的数据在共享内存池806中的内存空间存储的地址。
S705:GPU802根据该渲染指令,确定共享内存中的图像数据,并对该图像数据执行渲染操作,得到第一数据,并将该第一数据写入共享内存池806中的第二内存空间。
具体实现时,GPU802可以通过高速接口8021接收CPU801发送的渲染指令,并通过消息队列8022对该渲染指令(或者该渲染指令中用于指示图像数据的存储位置)进行缓存,并通过微控制器8023对该消息队列8022中的渲染指令进行解析,确定图像数据的存储位置以及GPU802针对该图像数据所需执行的处理操作。然后,处理单元8024可以利用输入输出内存管理单元(input output memory management unit,IOMMU)8025访问共享内存中的图像数据,并在该共享内存中对该图像数据执行渲染操作,得到第一数据,并通过IOMMU8025将第一数据写入第二内存空间,记录该第一数据在第二内存空间中的存储地址,如记录第一数据在第二内存空间存储时的首地址以及数据长度等。其中,IOMMU8025可以实现GPU802与CPU801的页表一致,以便GPU802能够使用虚拟地址访问CPU801中的MMU8012所管理的共享内存池806。实际应用时,计算设备800中的多个不同的处理器之间可以共享相同的页表;或者,根据实际业务需求,多个不同的处理器之间也可以不共享相同的页表,如部分处理器具有在共享内存池806中读写数据的权限,而其余处理器仅具有在共享内存池806中读数据的权限,本实施例对此并不进行限定。
值得注意的是,图7中是以单个GPU806处理图像数据为例进行示例性说明。实际应用场景中,计算设备800还可以利用多个GPU串行或者并行处理图像数据。比如,参见图9,计算设备800中可以包括16个GPU,分别为GPU0至GPU15,并且,计算设备可以通过该16个GPU来加速对于图像数据的处理。比如,CPU801可以向GPU0发送渲染指令,该渲染指令可以包括图像数据在第一内存空间中的存储地址以及GPU0至GPU15所需执行的操作,这样,GPU0可以通过“只读”的权限访问得到第一内存空间中的图像数据,并将该图像数据提供给其余的多个GPU(如GPU1至GPU15),指示其余的多个GPU并行对图像数据执行相应的渲染操作,并将所生成的第一数据通过“只写”的方式写入CPU801指定的第二内存空间。在其它实现方式中,CPU801也可以是向各个GPU分别发送渲染指令,并指示各个GPU对第一内存空间中存储的不同图像数据执行相应的渲染操作。
S706:GPU802生成降噪指令,并将该降噪指令发送给NPU803。
示例性地,GPU802可以根据渲染指令中所携带的处理图像数据的处理器顺序,确定下一个对图像数据进行处理的处理器为NPU803,因此,GPU802可以控制NPU803继续处理第二内存空间中的第一数据。
具体实现时,GPU802可以生成并向NPU803发送降噪指令,以便于NPU803实现对第一数据执行AI降噪过程。其中,GPU802所生成的降噪指令中,可以携带第一数据在第二内存空间中的存储地址、依次处理图像数据的NPU803、VPU804分别执行的处理操作等信息。进一步地,该降噪指令还可以包括NPU803、VPU804执行相应的处理操作所产生的数据在共享内存池806中的第三内存空间的存储地址。
S707:NPU803根据该降噪指令,确定第二内存空间中的第一数据,并对该第一数据执行降噪操作,得到第二数据,并将该第二数据写入共享内存池806中的第三内存空间。
NPU803可以从接收到的降噪指令中,解析出第一数据在第二内存空间中存储的地址,并通过“只读”的方式访问从第二内存空间中访问到该第一数据,从而根据该降噪指令所指示的降噪操作处理该第一数据,得到第二数据。然后,NPU803可以根据降噪指令所指示的第三内存空间的地址,将第二数据通过“只写”的方式写入第三内存空间中,如图9所示。
具体实现时,与GPU802类似,NPU803中可以包括高速接口8031、消息队列8032、微控制器8033、处理单元8034以及IOMMU8035。
NPU803可以通过高速接口8031接收降噪指令,并通过消息队列8032对该降噪指令(或者该降噪指令中用于指示第一数据的存储位置)进行缓存,并通过微控制器8033对该消息队列8032中的降噪指令进行解析,确定第一数据在第二内存空间中的存储位置以及NGPU803针对该第一数据所需执行的处理操作为降噪操作。然后,处理单元8034可以利用IOMMU8035访问共享内存中的第一数据,并在该共享内存中对该第一数据执行降噪操作,例如可以去除第一数据中的噪声数据,并对去噪后的第一数据进行超分辨处理(即基于低分辨率的图像数据构建高分辨率的图像数据),得到第二数据,并通过IOMMU8035将第二数据写入第三内存空间,记录该第二数据在第三内存空间中的存储位置。
S708:NPU803生成编码指令,并将该编码指令发送给VPU804。
其中,NPU803可以根据降噪指令中所携带的处理图像数据的处理器顺序,确定下一个继续处理图像数据的处理器为VPU804,因此,NPU803可以控制VPU804继续处理图像数据。
具体实现时,NPU803可以生成并向VPU804发送编码指令,以便于NPU803实现对第二数据执行编码。其中,NPU803所生成的编码指令中,可以携带第二数据在第三内存空间中的存储地址、VPU804所执行的处理操作等信息。进一步地,该编码指令还可以包括VPU804执行相应的处理操作所产生的数据在共享内存池806中的第四内存空间的存储地址。
S709:VPU804根据该编码指令,确定第三内存空间中存储的第二数据,并对该第二数据执行编码操作,得到编码数据,并将该编码数据写入共享内存池806中的第四内存空间。
VPU804可以从接收到的编码指令中,解析出第二数据在第三内存空间中存储的地址,并通过“只读”的方式访问从第三内存空间中访问到该第二数据,从而根据该编码指令所指示的编码操作处理该第二数据,得到编码数据。然后,VPU804可以根据编码指令所指示的第四内存空间的地址,将编码数据通过“只写”的方式写入第四内存空间中,如图9所示。
示例性地,VPU804中可以包括高速接口8041、消息队列8042、微控制器8043、处理单 元8044以及IOMMU8045,并以此在第四内存空间中确定第二数据,并对该第二数据执行相应的编码操作,得到编码数据,并通过共享内存池806中的第四内存空间继续缓存编码数据。其中,VPU804根据编码指令对第二数据进行编码操作的具体实现,可参照前述NPU803根据编码指令对第一数据进行降噪操作过程的相关之处描述,在此不做赘述。
S710:CPU801通过网卡805将第四内存空间中的编码数据反馈给客户端1。
比如,CPU801可以向网卡805发送传输指令,该传输指令可以包括编码数据在第四内存空间中的存储地址,从而网卡805可以基于该传输指令,通过“只读”的方式从第四内存空间中获得该编码数据,并向客户端1发送该编码数据,如图9所示。
如此,计算设备800可以实现客户端1所请求的图像渲染业务,为客户端1提供图像渲染服务,该图像渲染服务可以是云服务,也可以是本地服务,本实施例对此并不进行限定。
值得注意的是,上述图7所示实施例中,虽然NPU803、VPU804是从处理业务数据的前一处理器处接收操作指令,并根据各自接收到的操作指令确定数据存储地址、所需执行的操作,但是NPU803、VPU804所接收到的操作指令中包括的地址信息以及操作信息,均源于CPU801的渲染指令,因此,GPU802、NPU803、VPU804实际上是在CPU1的统筹以及通知下处理并存储业务数据。如此,可以减少CPU801与GPU802、NPU803、VPU804之间的交互次数,以此可以降低CPU801的负载,提高CPU801的管控性能。
在其它实施例中,GPU802、NPU803、VPU804等异构处理器所接收到的指令,也可以是由CPU801直接下发。为便于理解,下面结合图8、图9和图10对此进行详细说明。
参加图10,示出了本申请实施例提供的另一种数据处理方法的流程示意图,该方法具体可以包括:
S1001:客户端1基于用户输入操作对图像数据编码,生成待渲染的图像数据流,该图像数据流包括图像数据以及用户输入操作。
S1002:客户端1将图像数据流发送给计算设备800。
S1003:CPU801对接收到的待渲染的图像数据流进行解码,获得图像数据以及用户输入操作,并将图像数据写入共享内存池806中的第一内存空间。
本实施例中,步骤S1001至步骤S1003的具体实现过程,可参见前述图7所示实施例中的步骤S701至步骤S703的相关之处描述,在此不做赘述。
S1004:CPU801根据用户输入操作以及处理逻辑,生成针对该图像数据的渲染指令,并将该渲染指令发送给GPU802。
与图7所示实施例不同的是,本实施例中,CPU801通过向异构处理器逐个下发指令的方式,控制异构处理器执行相应的操作。为此,CPU801针对GPU802所生成的渲染指令中,可以包括待渲染的图像数据在第一内存空间中的存储地址(可以通过首地址对应的指针以及数据长度进行表示)、GPU802针对该图像数据所需执行的渲染操作、以及GPU802执行渲染操作所生成的数据在共享内存池806中存储的第二内存空间。
S1005:GPU802根据该渲染指令,确定共享内存中的图像数据,并对该图像数据执行渲染操作,得到第一数据,并将该第一数据写入共享内存池806中的第二内存空间。
本实施例中,GPU802在执行完渲染操作并得到第一数据后,可以将其通知给CPU801,以便CPU801指示其它异构处理器对该第一数据继续进行处理。
S1006:CPU801生成降噪指令,并将该降噪指令发送给NPU803。
其中,CPU801所生成的降噪指令,可以包括第一数据在第二内存空间中的存储地址(可以通过首地址对应的指针以及第一数据的数据长度进行表示)、NPU803针对该第一数据所需执行的降噪操作、以及NPU803执行降噪操作所生成的数据在共享内存池806中存储的第三内存空间。
S1007:NPU803根据该降噪指令,确定第二内存空间中的第一数据,并对该第一数据执行降噪操作,得到第二数据,并将该第二数据写入共享内存池806中的第三内存空间。
NPU803在执行完降噪操作并得到第二数据后,可以将其通知给CPU801,以便CPU801指示其它异构处理器对该第二数据继续进行处理。
S1008:CPU801生成编码指令,并将该编码指令发送给VPU804。
其中,CPU801所生成的编码指令,可以包括第二数据在第三内存空间中的存储地址(可以通过首地址对应的指针以及第二数据的数据长度进行表示)、VPU804针对该第二数据所需执行的编码操作、以及VPU804执行编码操作所生成的数据在共享内存池806中存储的第四内存空间。
S1009:VPU804根据该编码指令,确定第三内存空间中存储的第二数据,并对该第二数据执行编码操作,得到编码数据,并将该编码数据写入共享内存池806中的第四内存空间。
VPU804在执行完编码操作并得到编码数据后,可以将其通知给CPU801。
S1010:CPU801通过网卡805将第四内存空间中的编码数据反馈给客户端1。
比如,CPU801可以向网卡805发送传输指令,该传输指令可以包括编码数据在第四内存空间中的存储地址,从而网卡805可以基于该传输指令,通过“只读”的方式从第四内存空间中获得该编码数据,并向客户端1发送该编码数据,如图9所示。
本实施例中,CPU801通过向GPU802、NPU803、VPU804以及网卡805逐个下发指令,可以对图像数据的处理过程进行有序控制。其中,关于GPU802、NPU803、VPU804等异构处理器根据接收到的指令执行相应的操作以及数据存储过程的具体实现,可以参见前述图7所示实施例的相关之处描述,在此不做赘述。
此外,本申请实施例还提供了一种数据处理系统,该数据处理系统可以包括一个或者多个计算设备。其中,数据处理系统中的计算设备,可以是上述图2至图6、图8中的任意一种计算设备,或者可以是基于上述计算设备示例进行调整的其他可适用的计算设备,本实施例对此并不进行限定。
可选地,该数据处理系统可以构成包括一个或者多个计算设备的计算设备集群。或者,该数据处理系统可以部署于一个背板上,该背板上可以集成实现共享内存池的多个内存条、至少一个中央处理器以及至少一个异构设备。实际应用时,当数据处理系统部署于背板时,该背板还可以包括更多其它功能的器件,并且,背板上的各个器件可以通过接口实现耦合。
此外,本申请实施例还提供了一种计算机可读存储介质,该计算机可读存储介质中存储有指令,当其在计算设备上运行时,使得计算设备执行上述各个实施例中的计算设备所执行的方法。
此外,本申请实施例还提供了一种计算机程序产品,所述计算机程序产品被上述各个 实施例中的计算设备执行时,所述计算设备执行前述的数据处理方法。该计算机程序产品可以为一个软件安装包,在需要使用前述数据处理方法的任一方法的情况下,可以下载该计算机程序产品并在计算机上执行该计算机程序产品。
通过以上的实施方式的描述,所属领域的技术人员可以清楚地了解到本申请可借助软件加必需的通用硬件的方式来实现,当然也可以通过专用硬件包括专用集成电路、专用CPU、专用存储器、专用元器件等来实现。一般情况下,凡由计算机程序完成的功能都可以很容易地用相应的硬件来实现,而且,用来实现同一功能的具体硬件结构也可以是多种多样的,例如模拟电路、数字电路或专用电路等。但是,对本申请而言更多情况下软件程序实现是更佳的实施方式。基于这样的理解,本申请的技术方案本质上或者说对现有技术做出贡献的部分可以以软件产品的形式体现出来,该计算机软件产品存储在可读取的存储介质中,如计算机的软盘、U盘、移动硬盘、ROM、RAM、磁碟或者光盘等,包括若干指令用以使得一台计算设备(可以是个人计算机,训练设备,或者网络设备等)执行本申请各个实施例所述的方法。
在上述实施例中,可以全部或部分地通过软件、硬件、固件或者其任意组合来实现。当使用软件实现时,可以全部或部分地以计算机程序产品的形式实现。
所述计算机程序产品包括一个或多个计算机指令。在计算机上加载和执行所述计算机程序指令时,全部或部分地产生按照本申请实施例所述的流程或功能。所述计算机可以是通用计算机、专用计算机、计算机网络、或者其它可编程装置。所述计算机指令可以存储在计算机可读存储介质中,或者从一个计算机可读存储介质向另一计算机可读存储介质传输,例如,所述计算机指令可以从一个网站站点、计算机、训练设备或数据中心通过有线(例如同轴电缆、光纤、数字用户线(DSL))或无线(例如红外、无线、微波等)方式向另一个网站站点、计算机、训练设备或数据中心进行传输。所述计算机可读存储介质可以是计算机能够存储的任何可用介质或者是包含一个或多个可用介质集成的训练设备、数据中心等数据存储设备。所述可用介质可以是磁性介质,(例如,软盘、硬盘、磁带)、光介质(例如,DVD)、或者半导体介质(例如固态硬盘(Solid State Disk,SSD))等。

Claims (27)

  1. 一种计算设备,其特征在于,所述计算设备包括中央处理器、至少一个异构设备以及共享内存池,所述至少一个异构设备包括第一异构设备,所述共享内存池包括多个共享内存,所述中央处理器、所述至少一个异构设备与所述多个共享内存条通过总线进行耦合;
    所述中央处理器,用于在所述共享内存池中划分多个内存空间,将由客户端提供的与业务关联的第一待处理数据存储在所述多个内存空间的第一内存空间中,并通知所述第一异构设备所述第一待处理数据在所述第一内存空间的地址、所述第一异构设备针对所述第一待处理数据执行的第一操作;
    所述第一异构设备,用于对所述第一内存空间中的所述第一待处理数据执行所述第一操作,得到第一数据,并将所述第一数据存储在所述第二内存空间中。
  2. 根据权利要求1所述的计算设备,其特征在于,所述中央处理器,还用于通知所述第一异构设备所述第一操作产生的数据的存储位置为所述多个内存空间中的第二内存空间。
  3. 根据权利要求1或2所述的计算设备,其特征在于,所述至少一个异构设备还包括第二异构设备;
    所述中央处理器,还用于将由客户端提供的与业务关联的第二待处理数据存储在所述多个内存空间的第三内存空间中,并通知所述第二异构设备所述第二待处理数据在所述第三内存空间的地址、所述第二异构设备针对所述第二待处理数据执行的第二操作、以及所述第二操作产生的数据的存储位置为所述多个内存空间中的第四内存空间;
    所述第二异构设备,用于对所述第三内存空间中的所述第二待处理数据执行所述第二操作,得到第二数据,并将所述第二数据存储在所述第四内存空间中。
  4. 根据权利要求1或2所述的计算设备,其特征在于,所述至少一个异构设备还包括第二异构设备;
    所述中央处理器,还用于通知所述第二异构设备所述第一数据在所述第二内存空间的地址、所述第二异构设备针对所述第一数据执行的第二操作、以及所述第二操作产生的数据的存储位置为所述多个内存空间中的第四内存空间;
    所述第二异构设备,用于对所述第二内存空间中的所述第一数据执行所述第二操作,得到第二数据,并将所述第二数据存储在所述第四内存空间中。
  5. 根据权利要求3或4所述的计算设备,其特征在于,所述第一异构设备和所述第二异构设备为图像处理器GPU。
  6. 根据权利要求3至5任一项所述的计算设备,其特征在于,所述至少一个异构设备还包括第三异构设备;
    所述中央处理器,还用于将所述第二内存空间和所述第四内存空间的地址提供给所述第三异构设备,并通知所述第三异构设备针对所述第一数据以及所述第二数据执行的第三操作、以及所述第三操作产生的数据的存储位置为所述多个内存空间中的第五内存空间;
    所述第三异构设备,用于对所述第二内存空间中的所述第一数据和所述第四内存空间中的所述第二数据执行所述第三操作,得到第三数据,并将所述第三数据存储在所述第五内存空间中。
  7. 根据权利要求6所述的计算设备,其特征在于,所述第三异构设备为图形处理器GPU 或神经网络处理器NPU或视频处理器VPU。
  8. 根据权利要求3至5任一项所述的计算设备,其特征在于,所述至少一个异构设备还包括第三异构设备,所述第三异构设备与其它计算设备网络连接;
    所述中央处理器,还用于将所述第二内存空间和所述第四内存空间的地址提供给所述第三异构设备,并通知所述第三异构设备针对所述第一数据以及所述第二数据执行的第三操作、将所述第三操作产生的数据发送给所述其它计算设备;
    所述第三异构设备,用于对所述第二内存空间中的所述第一数据和所述第四内存空间中的所述第二数据执行所述第三操作,得到第三数据,并将所述第三数据发送给所述其它计算设备。
  9. 根据权利要求8所述的计算设备,所述第三异构设备为网卡。
  10. 根据权利要求1至9任一项所述的计算设备,其特征在于,所述业务为图像渲染任务,所述第一待处理数据为图像数据。
  11. 根据权利要求1至10任一项所述的计算设备,其特征在于,所述总线为第Z代总线、或加速器的缓存一致性互连CCIX总线、或计算快速链接CXL总线。
  12. 根据权利要求1至11任一项所述的计算设备,其特征在于,所述共享内存池的容量不少于1太字节TB。
  13. 一种数据处理方法,其特征在于,所述数据处理方法应用于计算设备,所述计算设备包括中央处理器、至少一个异构设备以及共享内存池,所述至少一个异构设备包括第一异构设备,所述共享内存池包括多个共享内存条,所述中央处理器、所述至少一个异构设备与所述多个共享内存条通过总线进行耦合;所述方法包括:
    所述中央处理器在所述共享内存池中划分多个内存空间;
    所述中央处理器将由客户端提供的与业务关联的第一待处理数据存储在所述多个内存空间的第一内存空间中;
    所述中央处理器通知所述第一异构设备所述第一待处理数据在所述第一内存空间的地址、所述第一异构设备针对所述第一待处理数据执行的第一操作;
    所述第一异构设备对所述第一内存空间中的所述第一待处理数据执行所述第一操作,得到第一数据,并将所述第一数据存储在所述第二内存空间中。
  14. 根据权利要求13所述的方法,其特征在于,所述方法还包括:
    所述中央处理器通知所述第一异构设备所述第一操作产生的数据的存储位置为所述多个内存空间中的第二内存空间。
  15. 根据权利要求13或14所述的方法,其特征在于,所述至少一个异构设备还包括第二异构设备,所述方法还包括:
    所述中央处理器将由客户端提供的与业务关联的第二待处理数据存储在所述多个内存空间的第三内存空间中;
    所述中央处理器通知所述第二异构设备所述第二待处理数据在所述第三内存空间的地址、所述第二异构设备针对所述第二待处理数据执行的第二操作、以及所述第二操作产生的数据的存储位置为所述多个内存空间中的第四内存空间;
    所述第二异构设备对所述第三内存空间中的所述第二待处理数据执行所述第二操作, 得到第二数据,并将所述第二数据存储在所述第四内存空间中。
  16. 根据权利要求13或14所述的方法,其特征在于,所述至少一个异构设备还包括第二异构设备,所述方法还包括:
    所述中央处理器通知所述第二异构设备所述第一数据在所述第二内存空间的地址、所述第二异构设备针对所述第一数据执行的第二操作、以及所述第二操作产生的数据的存储位置为所述多个内存空间中的第四内存空间;
    所述第二异构设备对所述第二内存空间中的所述第一数据执行所述第二操作,得到第二数据,并将所述第二数据存储在所述第四内存空间中。
  17. 根据权利要求15或16所述的方法,其特征在于,所述第一异构设备和所述第二异构设备为图像处理器GPU。
  18. 根据权利要求15至17任一项所述的方法,其特征在于,所述至少一个异构设备还包括第三异构设备,所述方法还包括:
    所述中央处理器将所述第二内存空间和所述第四内存空间的地址提供给所述第三异构设备;
    所述中央处理器通知所述第三异构设备针对所述第一数据以及所述第二数据执行的第三操作、以及所述第三操作产生的数据的存储位置为所述多个内存空间中的第五内存空间;
    所述第三异构设备对所述第二内存空间中的所述第一数据和所述第四内存空间中的所述第二数据执行所述第三操作,得到第三数据,并将所述第三数据存储在所述第五内存空间中。
  19. 根据权利要求18所述的方法,其特征在于,所述第三异构设备为图形处理器GPU或神经网络处理器NPU或视频处理器VPU。
  20. 根据权利要求15至17任一项所述的方法,其特征在于,所述至少一个异构设备还包括第三异构设备,所述第三异构设备与其它计算设备网络连接,所述方法还包括:
    所述中央处理器将所述第二内存空间和所述第四内存空间的地址提供给所述第三异构设备;
    所述中央处理器通知所述第三异构设备针对所述第一数据以及所述第二数据执行的第三操作、将所述第三操作产生的数据发送给所述其它计算设备;
    所述第三异构设备对所述第二内存空间中的所述第一数据和所述第四内存空间中的所述第二数据执行所述第三操作,得到第三数据,并将所述第三数据发送给所述其它计算设备。
  21. 根据权利要求20所述的方法,所述第三异构设备为网卡。
  22. 根据权利要求13至21任一项所述的方法,其特征在于,所述业务为图像渲染任务,所述第一待处理数据为图像数据。
  23. 根据权利要求13至22任一项所述的方法,其特征在于,所述总线为第Z代总线、或加速器的缓存一致性互连CCIX总线、或计算快速链接CXL总线。
  24. 根据权利要求13至23任一项所述的方法,其特征在于,所述共享内存池的容量不少于1太字节TB。
  25. 一种数据处理系统,其特征在于,所述数据处理系统包括至少一个如权利要求1至 12任一项所述的计算设备。
  26. 一种计算机可读存储介质,其特征在于,所述计算机可读存储介质中存储有指令,当其在计算设备上运行时,使得所述计算设备执行如权利要求13至24任一项所述的方法。
  27. 一种包含指令的计算机程序产品,当其在计算设备上运行时,使得所述计算设备执行如权利要求13至24任一项所述的方法。
PCT/CN2023/071994 2022-01-14 2023-01-13 计算设备、数据处理方法、系统及相关设备 WO2023134735A1 (zh)

Applications Claiming Priority (4)

Application Number Priority Date Filing Date Title
CN202210041883 2022-01-14
CN202210041883.5 2022-01-14
CN202210801224.7 2022-07-08
CN202210801224.7A CN116483553A (zh) 2022-01-14 2022-07-08 计算设备、数据处理方法、系统及相关设备

Publications (1)

Publication Number Publication Date
WO2023134735A1 true WO2023134735A1 (zh) 2023-07-20

Family

ID=87225524

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2023/071994 WO2023134735A1 (zh) 2022-01-14 2023-01-13 计算设备、数据处理方法、系统及相关设备

Country Status (2)

Country Link
CN (1) CN116483553A (zh)
WO (1) WO2023134735A1 (zh)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116886751A (zh) * 2023-09-04 2023-10-13 浪潮(北京)电子信息产业有限公司 一种异构设备的高速通信方法、装置和异构通信系统
CN118519753A (zh) * 2024-07-23 2024-08-20 天翼云科技有限公司 一种基于池化内存的计算资源聚合方法及系统

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9529706B1 (en) * 2014-05-29 2016-12-27 Amdocs Software Systems Limited System, method, and computer program for performing software application operations concurrently with memory compaction operations
CN109547531A (zh) * 2018-10-19 2019-03-29 华为技术有限公司 数据处理的方法、装置和计算设备
CN112214444A (zh) * 2020-09-24 2021-01-12 深圳云天励飞技术股份有限公司 一种核间通信方法、arm、dsp及终端
CN113515483A (zh) * 2020-04-10 2021-10-19 华为技术有限公司 一种数据传输方法及装置

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9529706B1 (en) * 2014-05-29 2016-12-27 Amdocs Software Systems Limited System, method, and computer program for performing software application operations concurrently with memory compaction operations
CN109547531A (zh) * 2018-10-19 2019-03-29 华为技术有限公司 数据处理的方法、装置和计算设备
CN113515483A (zh) * 2020-04-10 2021-10-19 华为技术有限公司 一种数据传输方法及装置
CN112214444A (zh) * 2020-09-24 2021-01-12 深圳云天励飞技术股份有限公司 一种核间通信方法、arm、dsp及终端

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116886751A (zh) * 2023-09-04 2023-10-13 浪潮(北京)电子信息产业有限公司 一种异构设备的高速通信方法、装置和异构通信系统
CN116886751B (zh) * 2023-09-04 2024-01-19 浪潮(北京)电子信息产业有限公司 一种异构设备的高速通信方法、装置和异构通信系统
CN118519753A (zh) * 2024-07-23 2024-08-20 天翼云科技有限公司 一种基于池化内存的计算资源聚合方法及系统

Also Published As

Publication number Publication date
CN116483553A (zh) 2023-07-25

Similar Documents

Publication Publication Date Title
WO2023134735A1 (zh) 计算设备、数据处理方法、系统及相关设备
WO2018137529A1 (zh) 一种数据传输的方法、装置、设备和系统
US12061564B2 (en) Network-on-chip data processing based on operation field and opcode
US10203878B2 (en) Near memory accelerator
CN104750559B (zh) 跨多节点的存储器资源的池化
US12099454B2 (en) Memory appliance couplings and operations
US11947472B2 (en) Composable infrastructure enabled by heterogeneous architecture, delivered by CXL based cached switch SoC
CN107209663B (zh) 数据格式转换装置、缓冲芯片及方法
CN114546913B (zh) 一种基于pcie接口的多主机之间数据高速交互的方法和装置
WO2023124304A1 (zh) 芯片的缓存系统、数据处理方法、设备、存储介质及芯片
CN111262917A (zh) 一种基于fpga云平台的远端数据搬移装置和方法
US10235054B1 (en) System and method utilizing a cache free list and first and second page caches managed as a single cache in an exclusive manner
CN117631974A (zh) 跨越基于存储器的通信队列的多信道接口的存取请求重新排序
CN113678104A (zh) 设备内符号数据移动系统
WO2017087544A1 (en) Method and system for shared direct access storage
WO2023207295A1 (zh) 数据处理方法、数据处理单元、系统及相关设备
US11609879B2 (en) Techniques for configuring parallel processors for different application domains
US20230050808A1 (en) Systems, methods, and apparatus for memory access in storage devices
US11789859B1 (en) Address generation for page collision prevention
US11748253B1 (en) Address generation for page collision prevention in memory regions
US20240070107A1 (en) Memory device with embedded deep learning accelerator in multi-client environment
US20240201990A1 (en) Fused Data Generation and Associated Communication
CN106708747A (zh) 一种存储器切换方法及装置
CN116795742A (zh) 存储设备、信息存储方法及系统
CN115905036A (zh) 一种数据访问系统、方法及相关设备

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 23740076

Country of ref document: EP

Kind code of ref document: A1

ENP Entry into the national phase

Ref document number: 2023740076

Country of ref document: EP

Effective date: 20240726

NENP Non-entry into the national phase

Ref country code: DE