WO2021213209A1 - 数据处理方法及装置、异构系统 - Google Patents

数据处理方法及装置、异构系统 Download PDF

Info

Publication number
WO2021213209A1
WO2021213209A1 PCT/CN2021/086703 CN2021086703W WO2021213209A1 WO 2021213209 A1 WO2021213209 A1 WO 2021213209A1 CN 2021086703 W CN2021086703 W CN 2021086703W WO 2021213209 A1 WO2021213209 A1 WO 2021213209A1
Authority
WO
WIPO (PCT)
Prior art keywords
accelerator
processor
auxiliary
data
processing
Prior art date
Application number
PCT/CN2021/086703
Other languages
English (en)
French (fr)
Inventor
李涛
林伟彬
刘昊程
许利霞
李生
Original Assignee
华为技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 华为技术有限公司 filed Critical 华为技术有限公司
Priority to EP21792489.3A priority Critical patent/EP4120094A4/en
Publication of WO2021213209A1 publication Critical patent/WO2021213209A1/zh
Priority to US18/046,151 priority patent/US20230114242A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F15/00Digital computers in general; Data processing equipment in general
    • G06F15/16Combinations of two or more digital computers each having at least an arithmetic unit, a program unit and a register, e.g. for a simultaneous processing of several programs
    • G06F15/163Interprocessor communication
    • G06F15/167Interprocessor communication using a common memory, e.g. mailbox
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F12/00Accessing, addressing or allocating within memory systems or architectures
    • G06F12/02Addressing or allocation; Relocation
    • G06F12/08Addressing or allocation; Relocation in hierarchically structured memory systems, e.g. virtual memory systems
    • G06F12/0802Addressing of a memory level in which the access to the desired data or data block requires associative addressing means, e.g. caches
    • G06F12/0806Multiuser, multiprocessor or multiprocessing cache systems
    • G06F12/0815Cache consistency protocols
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F13/00Interconnection of, or transfer of information or other signals between, memories, input/output devices or central processing units
    • G06F13/14Handling requests for interconnection or transfer
    • G06F13/16Handling requests for interconnection or transfer for access to memory bus
    • G06F13/1668Details of memory controller
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/50Allocation of resources, e.g. of the central processing unit [CPU]
    • G06F9/5005Allocation of resources, e.g. of the central processing unit [CPU] to service a request
    • G06F9/5027Allocation of resources, e.g. of the central processing unit [CPU] to service a request the resource being a machine, e.g. CPUs, Servers, Terminals
    • G06F9/5044Allocation of resources, e.g. of the central processing unit [CPU] to service a request the resource being a machine, e.g. CPUs, Servers, Terminals considering hardware capabilities
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F2209/00Indexing scheme relating to G06F9/00
    • G06F2209/50Indexing scheme relating to G06F9/50
    • G06F2209/509Offload
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F2212/00Indexing scheme relating to accessing, addressing or allocation within memory systems or architectures
    • G06F2212/10Providing a specific technical effect
    • G06F2212/1016Performance improvement
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F2212/00Indexing scheme relating to accessing, addressing or allocation within memory systems or architectures
    • G06F2212/62Details of cache specific to multiprocessor cache arrangements
    • G06F2212/621Coherency control relating to peripheral accessing, e.g. from DMA or I/O device

Definitions

  • This application relates to the field of computer technology, in particular to a data processing method and device, and a heterogeneous system.
  • Heterogeneous systems usually include processors and accelerators connected via a high-speed serial computer expansion bus standard (peripheral component interconnect express, PCIE) bus.
  • PCIE peripheral component interconnect express
  • the accelerator can assist the processor to execute certain data processing procedures, so that the heterogeneous system has strong data processing capabilities.
  • the processor is connected to the main memory, and the accelerator is connected to the auxiliary memory.
  • the processor needs to control the accelerator to process data, it first needs to notify the accelerator to use direct memory access (DMA) to move the data to be processed in the main memory to the auxiliary memory.
  • DMA direct memory access
  • the processor also needs to notify the accelerator to process the data in the auxiliary memory.
  • the accelerator After the accelerator has processed the data, it will write the processing result into the auxiliary memory and notify the processor that the data has been processed.
  • the processor needs to notify the accelerator to move the processing result from the auxiliary memory to the main memory by DMA, so that the processor can obtain the processing result from the main memory.
  • This application provides a data processing method and device, and a heterogeneous system, which can solve the problem of low data processing efficiency.
  • the technical solution is as follows:
  • a heterogeneous system in a first aspect, includes: a first processor and a first accelerator connected, and a first auxiliary memory connected with the first accelerator; wherein, the first A processor is configured to write data to be processed into the first auxiliary memory, and trigger the first accelerator to process the data to be processed in the first auxiliary memory according to a processing instruction; the first accelerator It is used to write the processing result of the to-be-processed data into the first auxiliary memory, and trigger the first processor to read the processing result from the first auxiliary memory.
  • the first accelerator can assist the first processor to process the data to be processed, so the data processing capability of the entire heterogeneous system is relatively high.
  • the first processor can directly write the data to be processed into the auxiliary memory connected to the first accelerator. Therefore, the process in which the first processor notifies the first accelerator to move the data to be processed from the main memory connected to the first processor to the auxiliary memory is avoided, and the process in which the data to be processed is moved by the first accelerator is also avoided.
  • the first accelerator can directly write the processing result into the auxiliary memory, and the first processor can obtain the processing result from the auxiliary memory.
  • the first accelerator notifies the first processor that the data to be processed has been processed, and the first processor notifies the first accelerator to move the processing result from the auxiliary memory to the main memory is avoided. Therefore, in the embodiment of the present application, the number of interactions between the first processor and the first accelerator is relatively small, and the process of the data processing method is relatively simple, so that the efficiency of data processing is high.
  • the first processor and the first accelerator are connected through a cache coherency bus.
  • the cache coherency bus is a bus that uses the cache coherency protocol.
  • the processor and the accelerator are connected by a cache coherency bus, the storage space on the main memory, the storage space on the accelerator, and the storage space on the auxiliary memory in a heterogeneous system can all be visible to the processor. These storage spaces will be uniformly addressed in the processor, so that the processor can read and write these storage spaces based on the addresses of these storage spaces.
  • the cache coherency bus includes: a CCIX bus or a CXL bus.
  • the first processor includes: an ARM architecture processor; or, when the cache coherency bus includes: a CXL bus, the first processor The processor includes: x86 architecture processor.
  • the auxiliary memory includes: HBM. Since HBM can provide a higher bandwidth storage function, it can improve the data processing efficiency of heterogeneous systems. In addition, the HBM has a smaller volume and a smaller operating power.
  • the accelerator includes: GPU, FPGA or ASIC.
  • the accelerator may also be other devices with data processing functions, which is not limited in this application.
  • the heterogeneous system includes: multiple accelerators connected to each other, and the first accelerator is any one of the multiple accelerators; the processing instruction carries an accelerator identifier, and the accelerator identifier is all accelerators.
  • the to-be-processed data in the memory is processed. It can be seen that when the heterogeneous system includes multiple accelerators, the first accelerator can determine whether the first accelerator is an accelerator designated by the processor for processing data to be processed according to the accelerator identifier.
  • the heterogeneous system includes: multiple auxiliary memories connected to the multiple accelerators in a one-to-one correspondence, and multiple processors connected to each other, and the first processor is one of the multiple processors Any processor connected to the first accelerator; the processing instruction also carries the identifier of the first processor; the first accelerator is used when the accelerator identifier is not the identifier of the first accelerator When the time, the data to be processed is written into the auxiliary memory connected to the auxiliary accelerator indicated by the accelerator identifier, and the auxiliary accelerator is triggered to process the data to be processed according to the processing instruction; the auxiliary accelerator is used for : After processing the data to be processed according to the processing instruction, write the processing result of the data to be processed into the connected auxiliary memory, and according to the identification of the first processor carried in the processing instruction, Triggering the first processor to read the processing result from the auxiliary memory connected to the auxiliary accelerator.
  • the first accelerator can forward the processing instruction to the auxiliary accelerator, so that the auxiliary accelerator executes the processing of the data to be processed, thereby preventing the first processor from processing the data.
  • the multiple accelerators are connected through a cache coherency bus, and the multiple processors are connected through a cache coherency bus.
  • the processor can trigger the accelerator to process the data to be processed in the auxiliary memory according to the processing instructions in various ways, and the accelerator can trigger the processor to process the data from the auxiliary memory connected to the accelerator in a variety of ways.
  • Read the processing result For example, the processor triggers the accelerator to process the data to be processed by sending a processing instruction to the accelerator, and the accelerator triggers the processor to read the processing result by sending a processing response to the processor.
  • the processor can also trigger the accelerator to process the aforementioned data to be processed by changing the state value of the register, and the accelerator can also trigger the processor to read the aforementioned processing result by changing the state value of the register.
  • a data processing method for a first accelerator in a heterogeneous system.
  • the heterogeneous system further includes: a first processor and a first auxiliary memory connected to the first accelerator;
  • the method includes: under the trigger of the first processor, processing the to-be-processed data in the first auxiliary memory according to a processing instruction, and then writing the processing result of the to-be-processed data into the first Auxiliary memory, and trigger the first processor to read the processing result from the first auxiliary memory.
  • the heterogeneous system includes: multiple accelerators connected to each other, and the first accelerator is any one of the multiple accelerators; the processing instruction carries an accelerator identifier, and the accelerator identifier is all accelerators.
  • the identifier of the accelerator used to execute the processing instruction the first accelerator may, when the accelerator is identified as the identifier of the first accelerator, compare the processing instruction to the accelerator in the first auxiliary memory according to the processing instruction. The to-be-processed data is processed.
  • the heterogeneous system includes: multiple auxiliary memories connected to the multiple accelerators in a one-to-one correspondence, and multiple processors connected to each other, and the first processor is one of the multiple processors Any processor connected to the first accelerator; the processing instruction also carries the identifier of the first processor, and the method further includes: when the accelerator identifier is not the identifier of the first accelerator , Writing the data to be processed into the auxiliary memory connected to the auxiliary accelerator indicated by the accelerator identifier, and triggering the auxiliary accelerator to process the data to be processed according to the processing instruction.
  • the first accelerator may trigger the first processor to read the processing result from the auxiliary memory connected to the first accelerator in various ways.
  • the first accelerator triggers the first processor to read the foregoing processing result by sending a processing response to the first processor.
  • the first accelerator may also trigger the first processor to read the foregoing processing result by changing the state value of the register.
  • a data processing method for use in an auxiliary accelerator in a heterogeneous system.
  • the heterogeneous system includes: multiple processors connected to each other, multiple accelerators connected to each other, and The accelerators are connected to multiple auxiliary memories one by one; the auxiliary accelerator and the first accelerator are any two accelerators connected to the multiple accelerators; the method includes: under the trigger of the first accelerator, processing according to The instruction processes the data to be processed in the auxiliary memory connected to the auxiliary accelerator, the processing instruction carries the identifier of the first processor connected to the first accelerator, and the processing result of the data to be processed Write to the connected auxiliary memory; then, according to the identification of the first processor carried in the processing instruction, trigger the first processor to read the processing result from the auxiliary memory connected to the auxiliary accelerator.
  • the auxiliary accelerator may trigger the first processor to read the processing result from the auxiliary memory connected to the auxiliary accelerator in various ways.
  • the auxiliary accelerator triggers the first processor to read the foregoing processing result by sending a processing response to the first processor.
  • the auxiliary accelerator may also trigger the first processor to read the foregoing processing result by changing the state value of the register.
  • a data processing method for a first processor in a heterogeneous system.
  • the heterogeneous system further includes: a first accelerator connected to the first processor, and a first accelerator connected to the first processor; A first auxiliary memory connected to an accelerator; the method includes: writing to-be-processed data into the first auxiliary memory, and triggering the first accelerator to perform processing on the to-be-processed data in the first auxiliary memory according to a processing instruction The processed data is processed; then, under the trigger of the first accelerator, the processing result of the to-be-processed data is read from the first auxiliary memory.
  • the heterogeneous system includes: multiple processors connected to each other, multiple accelerators connected to each other, and multiple auxiliary memories connected to the multiple accelerators one by one;
  • the processing instruction carries: The accelerator identifier and the identifier of the first processor, where the accelerator identifier is the identifier of the accelerator used to execute the processing instruction in the heterogeneous system;
  • the first processor may be identified by the accelerator as the first When the accelerator is identified, the processing result of the data to be processed is read from the first auxiliary memory under the trigger of the first accelerator; when the accelerator is identified as the auxiliary accelerator connected to the first accelerator When identifying, the first processor may read the processing result from the auxiliary memory connected to the auxiliary accelerator under the trigger of the auxiliary accelerator.
  • the first processor may trigger the first accelerator to process the to-be-processed data in the first auxiliary memory according to the processing instruction in various ways.
  • the first processor triggers the first accelerator to process the aforementioned data to be processed by sending a processing instruction to the first accelerator.
  • the first processor may also trigger the first accelerator to process the aforementioned data to be processed by changing the state value of the register.
  • a data processing device for use in a first accelerator in a heterogeneous system, and the heterogeneous system further includes: a first processor and a first auxiliary memory connected to the first accelerator;
  • the data processing device includes: various modules for executing the data processing method provided in the second aspect.
  • a data processing device for use in an auxiliary accelerator in a heterogeneous system.
  • the heterogeneous system includes: multiple processors connected to each other, multiple accelerators connected to each other, and The accelerators are connected to multiple auxiliary memories one by one; the auxiliary accelerator and the first accelerator are any two accelerators connected to the multiple accelerators; the data processing device includes: for performing the data processing provided by the third aspect The various modules of the method.
  • a data processing device for a first processor in a heterogeneous system comprising: a first accelerator connected to the first processor, and a first accelerator connected to the first processor; A first auxiliary memory connected to an accelerator; the data processing device includes: various modules for executing the data processing method provided in the fourth aspect.
  • a computer storage medium is provided, and a computer program is stored in the storage medium.
  • the computer program runs on a computer device, the computer device executes the second aspect, the third aspect, or the fourth aspect. Any data processing method provided by the aspect.
  • a computer program product containing instructions is provided.
  • the computer program product runs on a computer device, the computer device executes any of the data processing methods provided in the second, third, or fourth aspects.
  • FIG. 1 is a schematic structural diagram of a heterogeneous system provided by an embodiment of this application.
  • FIG. 2 is a schematic structural diagram of another heterogeneous system provided by an embodiment of the application.
  • FIG. 3 is a schematic structural diagram of another heterogeneous system provided by an embodiment of the application.
  • FIG. 4 is a flowchart of a data processing method provided by an embodiment of this application.
  • FIG. 5 is a flowchart of another data processing method provided by an embodiment of the application.
  • FIG. 6 is a schematic diagram of functional modules of a heterogeneous system provided by an embodiment of this application.
  • FIG. 7 is a block diagram of a data processing device provided by an embodiment of this application.
  • FIG. 8 is a block diagram of another data processing device provided by an embodiment of the application.
  • FIG. 9 is a block diagram of another data processing device provided by an embodiment of the application.
  • Heterogeneous systems can achieve higher-efficiency data processing, such as online prediction processing based on deep learning, video transcoding processing in live broadcast, image compression or decompression processing, and so on.
  • FIG. 1 is a schematic structural diagram of a heterogeneous system provided by an embodiment of the application.
  • the heterogeneous system usually includes: at least one processor and at least one accelerator.
  • the heterogeneous system includes a processor 011 and an accelerator 021 as an example, but the number of processors and accelerators in the heterogeneous system may also be greater than 1, which is not limited in this application.
  • At least one processor in a heterogeneous system may include a processor 011 and a processor 012, and the at least one accelerator includes an accelerator 021 and an accelerator 022.
  • at least one processor in a heterogeneous system may include: a processor 011, a processor 012, a processor 013, and a processor 014, and the at least one accelerator includes: an accelerator 021, an accelerator 022, an accelerator 023 and accelerator 024.
  • each processor is connected to each other by a cache coherent bus (some processors are directly connected, some Indirect connection between the processors).
  • a heterogeneous system includes multiple accelerators, each accelerator is connected to each other by a cache coherent bus (some accelerators are directly connected, and some accelerators are indirectly connected).
  • Each processor is connected to at least one accelerator.
  • the processors and accelerators in the heterogeneous system both have data processing functions, and the accelerator is used to assist the processor to perform some data processing to strengthen the data processing capabilities of the heterogeneous system.
  • the processor may be any type of processor, such as an Advanced Reduced Instruction Set Computing Machine (ARM) architecture processor or an x86 architecture processor.
  • ARM Advanced Reduced Instruction Set Computing Machine
  • x86 x86 architecture processor
  • the ARM architecture processor and the x86 architecture processor are the names of two different architecture processors.
  • the protocols used by the two processors are different, and the power consumption, performance, and cost are also different.
  • the accelerator can be any device with data processing functions, such as graphics processing unit (GPU), field programmable gate array (FPGA) or application specific integrated circuit (ASIC). )Wait.
  • GPU graphics processing unit
  • FPGA field programmable gate array
  • ASIC application specific integrated circuit
  • each processor in the heterogeneous system can be connected to a main memory, and each accelerator can be connected to a secondary memory.
  • the heterogeneous system also includes: the main memory connected to each processor and the auxiliary memory connected to each accelerator.
  • the heterogeneous system also includes: a main memory 031 connected to the processor 011, and a secondary memory 041 connected to the accelerator 021.
  • the heterogeneous system also includes: the main memory 031 connected to the processor 011, the main memory 032 connected to the processor 012, the auxiliary memory 041 connected to the accelerator 021, and the auxiliary memory 042 connected to the accelerator 022.
  • the heterogeneous system also includes: main memory 031 connected to processor 011, main memory 032 connected to processor 012, main memory 033 connected to processor 013, main memory 034 connected to processor 014, accelerator 021
  • Both the main memory and the auxiliary memory in a heterogeneous system can be any type of memory, for example, double data rate synchronous dynamic random access memory (double data rate synchronous dynamic random access memory, DDR), high bandwidth memory (HBM) Wait.
  • the main memory is DDR and the auxiliary memory is HBM as an example.
  • HBM can provide a higher bandwidth storage function, it can improve the data processing efficiency of the heterogeneous system.
  • the HBM has a smaller volume and a smaller operating power.
  • main memory in a heterogeneous system may be independent of the connected processor, and the auxiliary memory may be independent of the connected accelerator; alternatively, the main memory may also be integrated on the connected processor, and the auxiliary memory may be integrated on the connected accelerator.
  • main memory is independent of the connected processor, and the auxiliary memory is integrated on the connected accelerator as an example (the integration relationship is not shown in FIG. 1).
  • the processor and the accelerator in the heterogeneous system are connected through the PCIE bus.
  • the storage space on the main memory connected to the processor and the storage space on the accelerator connected to the processor can both be visible to the processor, and the processor can read and write these storage spaces.
  • the storage space on the auxiliary memory connected to the accelerator is not visible to the processor, and the processor cannot read and write to the storage space. Therefore, in the related art, when the processor needs to control the accelerator to process data, it first needs to write the data into the main memory, and notify the accelerator to move the data to be processed in the main memory to the auxiliary memory through DMA. After that, the processor also needs to notify the accelerator to process the data in the auxiliary memory.
  • the accelerator After the accelerator has processed the data, it will write the processing result into the auxiliary memory and notify the processor that the data has been processed. Finally, the processor needs to notify the accelerator to move the processing result from the auxiliary memory to the main memory through DMA, and then the processor can read the processing result from the main memory. It can be seen that in the process of data processing by the accelerator-assisted processor, the number of information interactions between the processor and the accelerator is large, which affects the efficiency of data processing.
  • the processor and the accelerator in the heterogeneous system are connected through a cache coherency bus.
  • the cache coherency bus is a bus that uses the cache coherency protocol.
  • the storage space on the main memory, the storage space on the accelerator, and the storage space on the auxiliary memory in a heterogeneous system can all be visible to the processor. These storage spaces will be uniformly addressed in the processor, so that the processor can read and write these storage spaces based on the addresses of these storage spaces.
  • the cache coherent bus may be any bus that adopts a cache coherent protocol, such as a cache coherent interconnect (Cache Coherent Interconnect for Accelerators, CCIX) bus, or an open interconnect technology (Compute Express Link, CXL) bus, etc.
  • a cache coherent interconnect Cache Coherent Interconnect for Accelerators, CCIX
  • CXL Computer Express Link
  • the processor may be the foregoing ARM architecture processor; when the cache coherency bus is a CXL bus, the processor may be the foregoing x86 architecture processor.
  • the embodiment of the present application does not limit the type of the cache coherency bus and the type of the processor.
  • the embodiment of the present application provides a data processing method for the heterogeneous system.
  • the data processing method not only enables the accelerator to assist the processor to perform data processing, but also reduces the number of interactions between the processor and the accelerator, thereby improving the efficiency of data processing.
  • the data processing method provided in the embodiment of the present application can be used in the heterogeneous system provided in the embodiment of the present application (the heterogeneous system shown in any one of FIG. 1 to FIG. 3).
  • the method involves a first processor, a first accelerator, and a first auxiliary memory in the heterogeneous system, wherein the first processor is any processor in the heterogeneous system, and the first accelerator is connected to the first processor
  • the first auxiliary storage is the auxiliary storage connected to the first accelerator.
  • FIG. 4 is a flowchart of a data processing method provided by an embodiment of the application.
  • the first processor is the processor 011 in FIG. 1
  • the first accelerator is the accelerator 021 in FIG. 1.
  • the auxiliary memory connected to the first accelerator is the auxiliary memory 041 in FIG. 1 as an example.
  • the data processing method may include:
  • the processor 011 writes the data to be processed into the auxiliary memory 041 connected to the accelerator 021.
  • the processor 011 and the accelerator 021 are connected through a cache coherency bus, the storage space on the main memory, the accelerator, and the auxiliary memory in the heterogeneous system are all visible to the processor 011.
  • all processors in the heterogeneous system such as the Basic Input Output System (BIOS) in all processors
  • BIOS Basic Input Output System
  • Unified addressing of the storage space In this way, each processor in the heterogeneous system has the address of each storage unit (the smallest storage unit in the storage space) in these storage spaces. After that, the processors in the heterogeneous system can directly access the Read and write data at these addresses.
  • the storage space on the main memory 031 in FIG. 1 includes: storage unit 1 and storage unit 2, and storage space on the accelerator 021 (for example, the storage space of some I/O (input/output) registers on the accelerator 021) It includes: a storage unit 3 and a storage unit 4, and the storage space on the auxiliary memory 041 includes: a storage unit 5 and a storage unit 6.
  • the processor 011 performs unified addressing on these storage spaces, it can obtain the addresses A, B, C, D, E, and F of each storage unit as shown in Table 1.
  • the processor 011 may write the data to be processed into at least one storage unit in the auxiliary memory 041 according to the address of each storage unit on the auxiliary memory 041. It should be noted that the data to be processed may be data generated by the processor 011, or data sent by other devices outside the heterogeneous system, or the data to be processed may also be stored in the processor 011 before S401
  • the data in the main memory 031 is not limited in this embodiment of the application.
  • the processor 011 sends a processing instruction of the data to be processed to the accelerator 021.
  • the processing instruction is used to instruct the accelerator 021 to perform certain processing on the data to be processed. Therefore, the processing instruction may carry the storage address of the data to be processed in the auxiliary memory connected to the accelerator 021 and the instruction information of the certain processing.
  • the certain processing may be processing based on a machine learning algorithm, a deep learning algorithm, or a financial risk control algorithm, etc.
  • the embodiment of the present application does not limit the processing indicated by the processing instruction.
  • the accelerator 021 processes the to-be-processed data in the auxiliary memory 041 according to the processing instruction.
  • the accelerator 021 After the accelerator 021 receives the processing instruction, it can parse the processing instruction to determine the address of the data to be processed indicated by the processing instruction and the processing that the data to be processed needs to perform. After that, the accelerator 021 can read the to-be-processed data from the connected auxiliary memory 041, and execute the processing indicated in the processing instruction on the to-be-processed data to obtain the processing result of the to-be-processed data.
  • the accelerator 021 writes the processing result of the data to be processed into the connected auxiliary memory 041.
  • the accelerator 021 sends a processing response of the to-be-processed data to the processor 011.
  • the above processing response is used to instruct the processor 011 to obtain the processing result of the data to be processed. Therefore, the processing response needs to carry the storage address of the processing result in the auxiliary memory connected to the accelerator 021.
  • the processor 011 reads the processing result from the auxiliary memory 041 connected to the accelerator 021 according to the processing response.
  • the processing response carries the storage address of the processing result in the auxiliary memory 041 connected to the accelerator 021. Therefore, the processor 011 can read the processing result according to the storage address.
  • the first accelerator (such as the aforementioned accelerator 021) can assist the first processor (such as the aforementioned processor 011) to process the data to be processed, so the data of the entire heterogeneous system Higher processing capacity.
  • the first processor (such as the above-mentioned processor 011) can directly write the data to be processed into the auxiliary memory (such as the above-mentioned secondary memory 041) connected to the first accelerator (such as the above-mentioned accelerator 021). Therefore, the process in which the first processor notifies the first accelerator to move the data to be processed from the main memory connected to the first processor to the auxiliary memory is avoided, and the process in which the data to be processed is moved by the first accelerator is also avoided.
  • the auxiliary memory such as the above-mentioned secondary memory 041
  • the first accelerator such as the above-mentioned accelerator 021
  • the first accelerator can directly write the processing result into the auxiliary memory, and the first processor can obtain the processing result from the auxiliary memory. Therefore, a process in which the first accelerator notifies the first processor that the data to be processed has been processed, and the first processor notifies the first accelerator to move the processing result from the auxiliary memory to the main memory is avoided.
  • the number of interactions between the first processor and the first accelerator is relatively small, and the process of the data processing method is relatively simple, so that the efficiency of data processing is high.
  • a cache coherency bus with a higher transmission bandwidth may be used in the embodiments of the present application, for example, a cache coherency bus with a transmission bandwidth of 25 Gigabits per second (GT/s) may be used. Cache coherency bus.
  • GT/s Gigabits per second
  • the embodiment shown in FIG. 4 takes the heterogeneous system shown in FIG. 1 as an example.
  • the heterogeneous system includes multiple accelerators connected to each other using a cache coherency bus (the heterogeneous system shown in FIG. 2 or FIG. 3)
  • the data processing method also involves auxiliary accelerators in heterogeneous systems and auxiliary storage connected to the auxiliary accelerators.
  • the data processing method at this time may be as shown in FIG. 5.
  • the first processor is the processor 011 in FIG. 2
  • the first accelerator is the accelerator 021 in FIG. 2
  • the first accelerator is connected to the auxiliary
  • the memory (which may be referred to as the first auxiliary memory) is the auxiliary memory 041 in FIG. 2
  • the auxiliary accelerator is the accelerator 022 in FIG. 2
  • the auxiliary memory connected to the auxiliary accelerator is the auxiliary memory 042 in FIG. 2 as an example.
  • the data processing method may include:
  • the processor 011 writes the data to be processed into the auxiliary memory 041 connected to the accelerator 021. Go to S502.
  • the processor 011 sends a processing instruction of the data to be processed to the accelerator 021, where the processing instruction carries an accelerator identifier and an identifier of the processor 011, and the accelerator identifier is an identifier of an accelerator used to execute the processing instruction in a heterogeneous system. Go to S503.
  • the processing instruction since the heterogeneous system includes multiple accelerators, in order to associate the processing instruction with the accelerator used to execute the processing instruction, the processing instruction needs to carry the identifier of the accelerator used to execute the processing instruction . Moreover, in the case where the heterogeneous system includes multiple processors, in order to associate the processing instruction with the processor that issued the processing instruction, the processing instruction needs to carry the identifier of the processor 011 that issued the processing instruction.
  • the accelerator 021 detects whether the accelerator identifier in the processing instruction is the identifier of the accelerator 021. If the accelerator identifier in the processing instruction is the identifier of accelerator 021, execute S504; if the accelerator identifier in the processing instruction is not the identifier of accelerator 021, execute S508.
  • the accelerator 021 needs to detect the processing instructions carried in the processing instructions after receiving the processing instructions sent by the processor 011 Whether the accelerator identifier is the same as its own identifier to determine whether it is the accelerator designated by the processor 011 for executing the processing instruction.
  • the accelerator 021 may determine that it is the accelerator designated by the processor 011 for executing the processing instruction. At this time, the accelerator 021 can execute S504 to perform corresponding data processing according to the processing instruction.
  • the accelerator 021 may determine that it is not the accelerator designated by the processor 011 for executing the processing instruction. At this time, the accelerator 021 may execute S508 to trigger the accelerator 022 designated by the processor 011 to perform corresponding data processing according to the processing instruction.
  • the accelerator 021 processes the to-be-processed data in the auxiliary memory 041 according to the processing instruction. Go to S505.
  • the accelerator 021 writes the processing result of the data to be processed into the connected auxiliary memory 041. Go to S506.
  • the accelerator 021 sends a processing response of the to-be-processed data to the processor 011. Go to S507.
  • the processor 011 reads the processing result from the auxiliary memory 041 connected to the accelerator 021 according to the processing response sent by the accelerator 021.
  • the accelerator 021 writes the data to be processed into the auxiliary memory 042 connected to the accelerator 022 indicated by the accelerator identifier.
  • the accelerator 021 forwards the processing instruction of the data to be processed to the accelerator 022. Go to S510.
  • the accelerator 021 is connected to the accelerator 022, and the accelerator 021 can write the data to be processed into the auxiliary memory 042 connected to the accelerator 022 based on the connection, and send it to the accelerator 022. Send processing instructions.
  • the accelerator 022 processes the data to be processed in the connected auxiliary storage 042 according to the processing instruction. Go to S511.
  • the accelerator 022 writes the processing result of the data to be processed into the connected auxiliary memory 042. Perform S512.
  • the process of writing the processing result in S511 can refer to the process of writing the processing result in S404, which is not described in detail in the embodiment of the present application.
  • the accelerator 022 sends a processing response of the data to be processed to the processor 011 according to the identifier of the processor 011 carried in the processing instruction. Go to S513.
  • the processing instruction Since the processing instruction carries the identifier of the processor 011 that issued the processing instruction, after the accelerator 022 has finished executing the processing instruction sent by the accelerator 021, it can send a processing response to the processor 011 according to the identifier of the processor 011 to indicate The processor 011 obtains the processing result of the data to be processed. At this time, the processing response needs to carry the storage address of the processing result in the auxiliary memory 042 connected to the accelerator 022.
  • the processor 011 reads the processing result from the auxiliary memory 042 connected to the accelerator 022 according to the processing response sent by the accelerator 022.
  • the first accelerator such as the above accelerator 021 or the auxiliary accelerator (such as the above accelerator 022) can assist the first processor (such as the above processor 011) to perform processing on the data to be processed. Processing, so the data processing capability of the entire heterogeneous system is relatively high.
  • the first processor can directly write the data to be processed into the auxiliary memory connected to the first accelerator (such as the auxiliary memory 041 described above). Therefore, the process in which the first processor notifies the first accelerator to move the data to be processed from the main memory connected to the first processor to the auxiliary memory is avoided, and the process in which the data to be processed is moved by the first accelerator is also avoided.
  • the first accelerator or the auxiliary accelerator can directly write the processing result into the auxiliary memory, and the first processor can obtain the processing result from the auxiliary memory. Therefore, it is avoided that the first accelerator or the auxiliary accelerator notifies the first processor that the data to be processed has been processed, and the first processor notifies the first accelerator or the auxiliary accelerator to move the processing result from the auxiliary memory to the main memory.
  • the number of interactions between the first processor and the first accelerator or auxiliary accelerator is relatively small, and the process of the data processing method is relatively simple, so that the efficiency of data processing is high.
  • a cache coherency bus with a higher transmission bandwidth may be used in the embodiments of the present application, for example, a cache coherency bus with a transmission bandwidth of 25 Gigabits per second (GT/s) may be used. Cache coherency bus.
  • GT/s Gigabits per second
  • the processor in the heterogeneous system can control multiple accelerators to perform data processing in parallel based on the data processing method shown in FIG. 5, thereby further improving the entire heterogeneous system Data processing efficiency.
  • the first processor can trigger the first accelerator to connect The to-be-processed data in the auxiliary storage is processed.
  • the first accelerator can trigger the first processor to connect from the first accelerator Read the processing result from the secondary storage.
  • the first processor triggers the first accelerator to perform data processing by sending a processing instruction to the first accelerator, and the first accelerator triggers the first processor to read by sending a processing response to the first processor.
  • the first processor may not trigger the first accelerator to perform data processing by sending a processing instruction to the first accelerator, and the first accelerator may not trigger the first processing by sending a processing response to the first processor.
  • the processor reads the processing result.
  • the storage space on the auxiliary memory may include three types of storage units, namely: a data storage unit for storing data, an instruction storage unit for storing processing instructions, and a result storage unit for storing processing results.
  • the I/O register in the first accelerator may have a corresponding relationship with the data storage unit, the instruction storage unit, and the result storage unit on the auxiliary memory connected to the first accelerator. Both the first processor and the first accelerator can obtain the corresponding relationship, and execute the above-mentioned data processing method based on the corresponding relationship.
  • the first processor when the first processor writes the data to be processed into a certain data storage unit in the auxiliary memory connected to the first accelerator, it can write the processing instruction into the instruction storage unit corresponding to the data storage unit according to the corresponding relationship , And modify the status value of the I/O register corresponding to the data storage unit.
  • the I/O register may have multiple state values, and the multiple state values may include: a first state value and a second state value.
  • the state value of the I/O register is the first state value; the first processor changes a certain I/O in the first accelerator After the status value of the register, the status value of the I/O register becomes the second status value.
  • the first accelerator When the first accelerator detects that the state value of a certain I/O register has changed to the second state value, the first accelerator can obtain the processing instruction from the instruction storage unit corresponding to the I/O register according to the above-mentioned corresponding relationship, and Read the data to be processed from the data storage unit corresponding to the I/O register. After that, the first accelerator can process the data to be processed according to the processing instruction to obtain the processing result of the data to be processed.
  • the first accelerator can modify the state value of the I/O register according to the corresponding relationship, and write the processing result of the data to be processed into the result storage unit corresponding to the I/O register.
  • the multiple state values of the I/O register may also include: a third state value.
  • the first accelerator may change the state value of the I/O register to the third state value.
  • the first processor may detect whether there is an I/O register with the third state value in the first accelerator. When the first accelerator changes the state value of a certain I/O register to the third state value, the first processor can read the result storage unit corresponding to the I/O register to be processed according to the above-mentioned corresponding relationship The result of data processing.
  • the state value of each I/O register is the first state shown in Table 2.
  • the value is 0. If the processor 011 writes the data to be processed into the data storage unit 1.1 and the processing instruction into the instruction storage unit 2.1, the processor 011 can also change the state value of the I/O register 3.1 from the first state value 0 to such as The second state value shown in Table 3 is 1.
  • the accelerator 021 can detect that the state value of the I/O register 3.1 has changed to the second state value 1, and can obtain the data to be processed from the data storage unit 1.1 corresponding to the I/O register 3.1, and from the I/O register 3.1
  • the corresponding instruction storage unit 2.1 acquires a processing instruction, and processes the data to be processed according to the processing instruction to obtain the processing result of the data to be processed.
  • the accelerator 021 can write the processing result into the result storage unit 4.1 corresponding to the data storage unit 1.1, and change the state value of the I/O register 3.1 corresponding to the data storage unit 1.1 to the third state as shown in Table 4. Value 2.
  • the processor 011 detects that the state value of the I/O register 3.1 is the third state value 2, it can obtain the processing result of the data to be processed from the result storage unit 4.1 corresponding to the I/O register 3.1.
  • the first processor in the embodiment of the present application can also trigger the first accelerator to perform data processing by changing the state value of the I/O register, and the first accelerator can also change the state value of the I/O register. Trigger the first processor to read the processing result.
  • the first accelerator may change the state value of the I/O register corresponding to the result storage unit where the processing result is located to the first state value. This is convenient for the next time the first processor can trigger the first accelerator to perform data processing by changing the state value of the I/O register, and the first accelerator triggers the first processor by changing the state value of the I/O register Read the processing result.
  • the first accelerator can trigger the auxiliary accelerator to process the data to be processed in the connected auxiliary storage according to the processing instruction.
  • the auxiliary accelerator can trigger the first processor to read the processing result from the auxiliary memory connected to the auxiliary accelerator.
  • the first accelerator triggers the auxiliary accelerator to perform data processing by sending processing instructions to the auxiliary accelerator as an example.
  • the auxiliary accelerator triggers the first processor by sending a processing response to the first processor. Take the read processing result as an example.
  • the first accelerator may also trigger the auxiliary accelerator to perform data processing without sending a processing instruction to the auxiliary accelerator, and the auxiliary accelerator may also trigger the first processor to read processing without sending a processing response to the first processor. result.
  • the first accelerator in S509 and S510 may refer to the first processor to trigger the process of performing data processing by the first accelerator by changing the state value of the I/O register, and trigger the auxiliary accelerator to perform data processing.
  • the auxiliary accelerator may refer to the first accelerator to trigger the first accelerator to read the processing result by changing the state value of the I/O register, and trigger the first processor to read the processing result.
  • FIG. 6 is a schematic diagram of functional modules of a heterogeneous system provided by an embodiment of the application, and in FIG. 6 a group of connected processors, accelerators, and auxiliary memories in a heterogeneous system is taken as an example.
  • the functional modules in each group structure can refer to FIG. 6.
  • the processor may include: an application adaptation layer, an accelerated application programming interface (Application Programming Interface, API), inter-process shared memory, and cache coherent memory.
  • the accelerator may include: a cache coherency module and a processing module. Among them, the cache coherent memory in the processor is connected to the cache coherency module in the accelerator through a cache coherency bus. Both the cache coherency module and the processing module are connected to the auxiliary memory.
  • the application software running in the processor can call the acceleration API by calling the application adaptation layer.
  • Acceleration API is used to realize data conversion and control between application software and accelerator.
  • Inter-process shared memory is used for communication between multiple processes running in the processor.
  • Cache coherent memory is used to implement the communication between the processor and the accelerator.
  • the processing module is used to perform the processing operations performed by the accelerator in the foregoing data processing method, and the processing module can also trigger the cache coherency module to perform the read and write operations performed by the accelerator in the foregoing data processing method.
  • the reading and writing of data in the auxiliary memory by the processing module in the processor and the accelerator needs to be implemented through the above-mentioned cache coherency module.
  • the processing module in the processor or accelerator needs to read/write data in the auxiliary memory, it can send a read/write request to the cache coherency module.
  • the cache coherency module may generate a request agent (RA) (not shown in FIG. 6) for each read/write request received, and the RA performs the corresponding read/write operation.
  • RA request agent
  • each RA reads the data on the auxiliary memory, it will cache a copy of the data, so that the next time the data can be read by reading the local copy, there is no need to read from the auxiliary memory.
  • the data when each RA reads the data on the auxiliary memory, it will cache a copy of the data, so that the next time the data can be read by reading the local copy, there is no need to read from the auxiliary memory. The data.
  • the cache coherency module also includes: a host agent (HA) (not shown in FIG. 6), which is used to manage all RAs in the cache coherency module to achieve cache coherence.
  • HA host agent
  • each RA when each RA reads and writes data in the auxiliary memory, it needs to first send a request for reading and writing data in the auxiliary memory to the HA.
  • HA For RAs that are used to read data (such as reading processing instructions, data to be processed, or processing results) among multiple RAs, HA will give the RA to read the request after receiving the data read request sent by the RA The permission of the data in the auxiliary storage, and then the RA can read the data in the auxiliary storage.
  • the HA For RAs that are used to write data (such as write processing instructions, data to be processed, processing results, etc.) to an address in the auxiliary memory among multiple RAs, the HA receives the request to write data to the address sent by the RA , A consistency check is required to ensure the exclusive authority of this RA for this address. For example, the HA can detect whether there are other RAs that have cached copies of the data at this address at this time. If there are other RAs that have cached copies of the data at this address at this time, if a certain RA writes data to this address, the above-mentioned copies of the other RA caches will be inconsistent with the actual data of the address.
  • the HA will invalidate these copies during the consistency check, and then grant the RA used to write data to the address the right to write data at the address, and then the RA can write at the address data. In this way, it can be ensured that the data of the address read by each RA is consistent. It should be noted that when the copy of the RA cache is invalid, if the RA needs to read the data again, because the copy is invalid, the RA will re-initiate a request to read the data in the auxiliary memory to the HA.
  • the reading and writing process can also be implemented by the above-mentioned cache coherency module to ensure the state value of the I/O register in the accelerator Cache consistency.
  • the read and write process of the state value of the I/O register is realized by the above-mentioned cache coherency module, and the read and write process of the auxiliary memory realized by the above-mentioned cache coherency module can be referred to, which is not described in detail in the embodiment of the present application.
  • the heterogeneous system provided by the embodiment of this application includes: a first processor and a first accelerator connected to each other. , And the first auxiliary storage connected to the first accelerator.
  • the first processor is used to write the data to be processed into the first auxiliary memory; the first processor is also used to trigger the first accelerator to process the data to be processed in the first auxiliary memory according to the processing instruction; the first accelerator is used To write the processing result of the data to be processed into the first auxiliary memory; the first accelerator is used to trigger the first processor to read the processing result from the first auxiliary memory.
  • the first processor and the first accelerator are connected through a cache coherency bus.
  • the cache coherency bus includes: a CCIX bus or a CXL bus.
  • the cache coherency bus includes: a CCIX bus, and the first processor includes: an ARM architecture processor; or, the cache coherency bus includes: a CXL bus, and the first processor includes: an x86 architecture processor.
  • the auxiliary memory includes: HBM.
  • the accelerator includes: GPU, FPGA or ASIC.
  • the heterogeneous system includes: multiple accelerators connected to each other, the first accelerator is any one of the multiple accelerators; the processing instruction carries an accelerator identifier, and the accelerator identifier is an accelerator used to execute the processing instruction among the multiple accelerators
  • the first accelerator is used to process the data to be processed in the first auxiliary storage according to the processing instruction when the accelerator is identified as the first accelerator.
  • the heterogeneous system includes: multiple auxiliary memories connected to multiple accelerators in a one-to-one correspondence, and multiple processors connected to each other, and the first processor is any one of the multiple processors connected to the first accelerator.
  • the processing instruction also carries the identifier of the first processor;
  • the first accelerator is used to write the data to be processed into the auxiliary memory connected to the auxiliary accelerator indicated by the accelerator identifier when the accelerator identifier is not the identifier of the first accelerator, And trigger the auxiliary accelerator to process the data to be processed according to the processing instruction;
  • the auxiliary accelerator is used to: after processing the data to be processed according to the processing instruction, write the processing result of the data to be processed into the connected auxiliary memory;
  • the identifier of the processor triggers the first processor to read the processing result from the auxiliary memory connected to the auxiliary accelerator.
  • multiple accelerators are connected through a cache coherency bus, and multiple processors are connected through a cache coherency bus.
  • the first accelerator can assist the first processor to process the data to be processed, so the data processing capability of the entire heterogeneous system is relatively high.
  • the first processor can directly write the data to be processed into the auxiliary memory connected to the first accelerator. Therefore, the process in which the first processor notifies the first accelerator to move the data to be processed from the main memory connected to the first processor to the auxiliary memory is avoided, and the process in which the data to be processed is moved by the first accelerator is also avoided.
  • the first accelerator can directly write the processing result into the auxiliary memory, and the first processor can obtain the processing result from the auxiliary memory. Therefore, a process in which the first accelerator notifies the first processor that the data to be processed has been processed, and the first processor notifies the first accelerator to move the processing result from the auxiliary memory to the main memory is avoided.
  • the number of interactions between the first processor and the first accelerator is relatively small, and the process of the data processing method is relatively simple, so that the efficiency of data processing is high.
  • FIG. 7 is a block diagram of a data processing device provided by an embodiment of the application, and the data processing device may be the first accelerator in the data processing system provided by the embodiment of the application.
  • the data reading and writing device includes:
  • the processing module 701 is configured to process the data to be processed in the first auxiliary storage according to the processing instructions under the trigger of the first processor; the operation performed by the processing module 701 can refer to the above S403 or S504 (or the same as S403 or S403 or S504). S504 related description), the embodiments of the present application will not be repeated here.
  • the writing module 702 is used to write the processing result of the data to be processed into the first auxiliary memory; the operation performed by the writing module 702 can refer to the above S404 or S505 (or the description related to S404 or S505), which is implemented in this application The examples are not repeated here.
  • the trigger module 703 is configured to trigger the first processor to read the processing result from the first auxiliary memory.
  • operations performed by the trigger module 703 refer to the foregoing S405 or S506 (or the description related to S405 or S506), which is not described in detail in the embodiment of the present application.
  • the above-mentioned data processing apparatus is further configured to perform other operations in the data processing method shown in FIG. 5.
  • the processing module 701 is also used to execute S503 in FIG. 5
  • the writing module 702 is also used to execute S508 in FIG. 5
  • the trigger module 703 is also used to execute S509 in FIG. Please refer to the description of FIG. 4 and FIG. 5 above for the specific flow of each module in the data processing device to perform each step, and will not be repeated here.
  • FIG. 8 is a block diagram of another data processing device provided in an embodiment of this application.
  • the data processing device may be an auxiliary accelerator in the data processing system provided in this embodiment of the application.
  • the heterogeneous system includes: multiple processors connected to each other, multiple accelerators connected to each other, and multiple auxiliary memories connected to multiple accelerators one by one; the auxiliary accelerator and the first accelerator are any two of the multiple accelerators connected to each other. Accelerators.
  • the data reading and writing device includes:
  • the processing module 801 is configured to process the to-be-processed data in the auxiliary memory connected to the auxiliary accelerator under the trigger of the first accelerator, and the processing instruction carries the identifier of the first processor connected to the first accelerator.
  • the processing module 801 For operations performed by the processing module 801, reference may be made to the foregoing S510 (or the description related to S510), which is not described in detail in the embodiment of the present application.
  • the writing module 802 is used to write the processing result of the data to be processed into the connected auxiliary memory; the operation performed by the writing module 802 can refer to the above S511 (or the description related to S511). Do repeat.
  • the trigger module 803 is configured to trigger the first processor to read the processing result from the auxiliary memory connected to the auxiliary accelerator according to the identifier of the first processor carried in the processing instruction.
  • the trigger module 803 refer to the foregoing S512 (or the description related to S512), which is not described in detail in the embodiment of the present application.
  • FIG. 9 is a block diagram of another data processing device provided by an embodiment of the application, and the data processing device may be the first processor in the data processing system provided by the embodiment of the application.
  • the heterogeneous system further includes: a first accelerator connected to the first processor, and a first auxiliary memory connected to the first accelerator.
  • the data reading and writing device includes:
  • the writing module 901 is used to write the to-be-processed data into the first auxiliary memory; the operation performed by the writing module 901 can refer to the above S401 or S501 (or the description related to S401 or S501).
  • the embodiment of the present application is here Do not repeat it.
  • the trigger module 902 is used to trigger the first accelerator to process the data to be processed in the first auxiliary memory according to the processing instruction; the operation performed by the trigger module 902 can refer to the above S402 or S502 (or the description related to S402 or S502) ), the embodiments of this application will not be repeated here.
  • the reading module 903 is configured to read the processing result of the data to be processed from the first auxiliary memory under the trigger of the first accelerator.
  • the reading module 903 For operations performed by the reading module 903, reference may be made to the foregoing S406 or S507 (or the description related to S406 or S507), which is not described in detail in the embodiment of the present application.
  • the above-mentioned data processing apparatus is further configured to perform other operations in the data processing method shown in FIG. 5.
  • the reading module 903 is also used to execute S513 in FIG. 5. Please refer to the description of FIG. 4 and FIG. 5 above for the specific flow of each module in the data processing device to perform each step, and will not be repeated here.
  • An embodiment of the present application provides a computer storage medium in which a computer program is stored, and the computer program is used to execute any data processing method provided in the present application.
  • the embodiments of the present application provide a computer program product containing instructions.
  • the computer program product runs on a computer device, the computer device executes any data processing method provided in the embodiments of the present application.
  • the computer may be implemented in whole or in part by software, hardware, firmware, or any combination thereof.
  • software it may be implemented in the form of a computer program product in whole or in part, and the computer program product includes one or more computer instructions.
  • the computer program instructions When the computer program instructions are loaded and executed on the computer, the processes or functions described in the embodiments of the present application are generated in whole or in part.
  • the computer may be a general-purpose computer, a computer network, or other programmable devices.
  • the computer instructions may be stored in a computer-readable storage medium, or transmitted from one computer-readable storage medium to another computer-readable storage medium.
  • the computer instructions may be transmitted from a website, computer, server, or data.
  • the center transmits to another website, computer, server, or data center through wired (such as coaxial cable, optical fiber, digital subscriber line) or wireless (such as infrared, wireless, microwave, etc.).
  • the computer-readable storage medium may be any available medium that can be accessed by a computer or a data storage device such as a server or a data center integrated with one or more available media.
  • the usable medium may be a magnetic medium (for example, a floppy disk, a hard disk, and a magnetic tape), an optical medium, or a semiconductor medium (for example, a solid state hard disk).
  • first and second are only used for descriptive purposes, and cannot be understood as indicating or implying relative importance.
  • the term “at least one” refers to one or more, and “multiple” refers to two or more, unless expressly defined otherwise.
  • the disclosed device and the like can be implemented in other structural manners.
  • the device embodiments described above are only illustrative.
  • the division of modules is only a logical function division. In actual implementation, there may be other division methods, for example, multiple modules may be combined or integrated into another. A system or some features can be ignored or not implemented.
  • the displayed or discussed mutual coupling or direct coupling or communication connection may be indirect coupling or communication connection through some interfaces, devices or units, and may be in electrical or other forms.
  • the modules described as separate components may or may not be physically separate, and the components described as modules may or may not be physical units, and may be located in one place or distributed on multiple devices. Some or all of the units may be selected according to actual needs to achieve the objectives of the solutions of the embodiments.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Software Systems (AREA)
  • Computer Hardware Design (AREA)
  • Advance Control (AREA)

Abstract

本申请公开了一种数据处理方法及装置、异构系统,属于计算机技术领域。所述异构系统包括: 相连接的第一处理器和第一加速器,以及与第一加速器连接的第一辅存储器; 其中,第一处理器用于将待处理数据写入第一辅存储器,以及触发第一加速器根据处理指令,对第一辅存储器中的待处理数据进行处理; 第一加速器用于将待处理数据的处理结果写入第一辅存储器,以及触发第一处理器从第一辅存储器中读取处理结果。本申请实施例中第一处理器与第一加速器的交互次数较少,数据处理方法的过程较为简单,使得数据处理的效率较高。

Description

数据处理方法及装置、异构系统 技术领域
本申请涉及计算机技术领域,特别涉及一种数据处理方法及装置、异构系统。
背景技术
异构系统通常包括通过高速串行计算机扩展总线标准(peripheral component interconnect express,PCIE)总线连接的处理器和加速器。其中,加速器能够辅助处理器执行一定的数据处理流程,从而使异构系统具有较强的数据处理能力。
通常处理器连接有主存储器,加速器连接有辅存储器。处理器在需要控制加速器处理数据时,首先需要通知加速器采用直接存储器访问(Direct Memory Access,DMA)的方式将主存储器中待处理的数据搬移至辅存储器。之后,处理器还需通知加速器对辅存储器中的该数据进行处理。加速器在处理完毕该数据后,会将处理结果写入辅存储器,并通知处理器已经处理完毕该数据。最后,处理器需要通知加速器采用DMA的方式将处理结果从辅存储器搬移至主存储器,以便于处理器能够从该主存储器中获取到处理结果。
可以看出,在加速器辅助处理器进行数据处理的过程中,处理器与加速器之间的信息交互次数较多,从而影响了数据处理的效率。
发明内容
本申请提供了一种数据处理方法及装置、异构系统,可以解决数据处理的效率较低的问题,所述技术方案如下:
第一方面,提供了一种异构系统,所述异构系统包括:相连接的第一处理器和第一加速器,以及与所述第一加速器连接的第一辅存储器;其中,所述第一处理器用于将待处理数据写入所述第一辅存储器,以及触发所述第一加速器根据处理指令,对所述第一辅存储器中的所述待处理数据进行处理;所述第一加速器用于将所述待处理数据的处理结果写入所述第一辅存储器,以及触发所述第一处理器从所述第一辅存储器中读取所述处理结果。
可以看出,本申请实施例提供的数据处理方法中,第一加速器能够辅助第一处理器对待处理数据进行处理,所以整个异构系统的数据处理能力较高。并且,该数据处理方法中,第一处理器能够直接将待处理数据写入第一加速器连接的辅存储器。因此,避免了第一处理器通知第一加速器将待处理数据从第一处理器连接的主存储器搬移至该辅存储器的过程,也避免了第一加速器搬移该待处理数据的过程。另外,该数据处理方法中,第一加速器可以直接将处理结果写入辅存储器,且第一处理器能够从该辅存储器获取处理结果。因此,避免了第一加速器通知第一处理器已经处理完毕待处理数据,以及第一处理器通知第一加速器将处理结果从辅存储器搬移至主存储器的过程。所以,本申请实施例中第一处理器与第一加速器的交互次数较少,数据处理方法的过程较为简单,使得数据处理的效率较高。
可选地,所述第一处理器与所述第一加速器通过缓存一致性总线连接。缓存一致性 总线是一种采用缓存一致性协议的总线。在处理器与加速器之间通过缓存一致性总线连接的情况下,异构系统中主存储器上的存储空间,加速器上的存储空间,以及辅存储器上的存储空间均能够对处理器可见。这些存储空间会在处理器中进行统一编址,从而使得处理器能够基于这些存储空间的地址对这些存储空间进行读写。
可选地,所述缓存一致性总线包括:CCIX总线或CXL总线。可选地,当所述缓存一致性总线包括:CCIX总线时,所述第一处理器包括:ARM架构处理器;或者,当所述缓存一致性总线包括:CXL总线时,所述第一处理器包括:x86架构处理器。
可选地,所述辅存储器包括:HBM。由于HBM能够提供较高带宽的存储功能,因此,能够提升异构系统的数据处理效率。并且,HBM的体积较小,运行功率也较小。
可选地,所述加速器包括:GPU、FPGA或ASIC。当然加速器也可以是其他的具有数据处理功能的器件,本申请对此不作限定。
可选地,所述异构系统包括:相互连接的多个加速器,所述第一加速器为所述多个加速器中的任一加速器;所述处理指令携带有加速器标识,所述加速器标识为所述多个加速器中用于执行所述处理指令的加速器的标识;所述第一加速器用于在所述加速器标识为所述第一加速器的标识时,根据所述处理指令对所述第一辅存储器中的所述待处理数据进行处理。可以看出,在异构系统包括多个加速器的情况下,第一加速器可以根据加速器标识确定第一加速器是否为处理器指定的用于处理待处理数据的加速器。
可选地,所述异构系统包括:与所述多个加速器一一对应连接的多个辅存储器,以及相互连接的多个处理器,所述第一处理器为所述多个处理器中与所述第一加速器连接的任一处理器;所述处理指令还携带有所述第一处理器的标识;所述第一加速器用于在所述加速器标识不为所述第一加速器的标识时,将所述待处理数据写入所述加速器标识指示的辅助加速器所连接的辅存储器,并触发所述辅助加速器根据所述处理指令对所述待处理数据进行处理;所述辅助加速器用于:在根据所述处理指令对所述待处理数据进行处理后,将所述待处理数据的处理结果写入连接的辅存储器,以及根据所述处理指令携带的所述第一处理器的标识,触发所述第一处理器从所述辅助加速器连接的辅存储器中读取所述处理结果。当第一处理器将处理指令误发送至第一加速器时,第一加速器器能够将该处理指令转发至辅助加速器,以使辅助加速器执行对待处理数据的处理,从而避免了第一处理器将处理指令误发而造成的不良后果。
可选地,所述多个加速器通过缓存一致性总线连接,所述多个处理器通过缓存一致性总线连接。
可选地,在该异构系统中,处理器可以通过多种方式触发加速器根据处理指令对辅存储器中的待处理数据进行处理,加速器可以通过多种方式触发处理器从加速器连接的辅存储器中读取处理结果。比如,处理器通过向加速器发送处理指令的方式,触发加速器对上述待处理数据进行处理,加速器通过向处理器发送处理响应的方式,触发处理器读取上述处理结果。又比如,处理器也可以通过更改寄存器的状态值的方式触发加速器对上述待处理数据进行处理,加速器也可以通过更改寄存器的状态值的方式触发处理器读取上述处理结果。
第二方面,提供了一种数据处理方法,用于异构系统中的第一加速器,所述异构系统还包括:与所述第一加速器连接的第一处理器和第一辅存储器;所述方法包括:在所述 第一处理器的触发下,根据处理指令对所述第一辅存储器中的待处理数据进行处理,之后再将所述待处理数据的处理结果写入所述第一辅存储器,并触发所述第一处理器从所述第一辅存储器中读取所述处理结果。
可选地,所述异构系统包括:相互连接的多个加速器,所述第一加速器为所述多个加速器中的任一加速器;所述处理指令携带有加速器标识,所述加速器标识为所述异构系统中用于执行所述处理指令的加速器的标识,第一加速器可以在所述加速器标识为所述第一加速器的标识时,根据所述处理指令对所述第一辅存储器中的所述待处理数据进行处理。
可选地,所述异构系统包括:与所述多个加速器一一对应连接的多个辅存储器,以及相互连接的多个处理器,所述第一处理器为所述多个处理器中与所述第一加速器连接的任一处理器;所述处理指令还携带有所述第一处理器的标识,所述方法还包括:在所述加速器标识不为所述第一加速器的标识时,将所述待处理数据写入所述加速器标识所指示的辅助加速器连接的辅存储器,并触发所述辅助加速器根据所述处理指令对所述待处理数据进行处理。
可选地,第一加速器可以通过多种方式触发第一处理器从第一加速器连接的辅存储器中读取处理结果。比如,第一加速器通过向第一处理器发送处理响应的方式,触发第一处理器读取上述处理结果。又比如,第一加速器也可以通过更改寄存器的状态值的方式触发第一处理器读取上述处理结果。
第三方面,提供了一种数据处理方法,用于异构系统中的辅助加速器,所述异构系统包括:相互连接的多个处理器,相互连接的多个加速器,以及与所述多个加速器一一连接的多个辅存储器;所述辅助加速器与第一加速器为所述多个加速器中相连接的任意两个加速器;所述方法包括:在所述第一加速器的触发下,根据处理指令对所述辅助加速器连接的辅存储器中的待处理数据进行处理,所述处理指令携带有所述第一加速器连接的所述第一处理器的标识,并将所述待处理数据的处理结果写入连接的辅存储器;之后,再根据所述处理指令携带的所述第一处理器的标识,触发所述第一处理器从所述辅助加速器连接的辅存储器中读取所述处理结果。
可选地,辅助加速器可以通过多种方式触发第一处理器从辅助加速器连接的辅存储器中读取处理结果。比如,辅助加速器通过向第一处理器发送处理响应的方式,触发第一处理器读取上述处理结果。又比如,辅助加速器也可以通过更改寄存器的状态值的方式触发第一处理器读取上述处理结果。
第四方面,提供了一种数据处理方法,用于异构系统中的第一处理器,所述异构系统还包括:与所述第一处理器连接的第一加速器,以及与所述第一加速器连接的第一辅存储器;所述方法包括:将待处理数据写入所述第一辅存储器,并触发所述第一加速器根据处理指令,对所述第一辅存储器中的所述待处理数据进行处理;之后在在所述第一加速器的触发下,从所述第一辅存储器中读取所述待处理数据的处理结果。
可选地,所述异构系统包括:相互连接的多个处理器,以及相互连接的多个加速器,以及与所述多个加速器一一连接的多个辅存储器;所述处理指令携带有:加速器标识和所述第一处理器的标识,所述加速器标识为所述异构系统中用于执行所述处理指令的加速器的标识;第一处理器可以在所述加速器标识为所述第一加速器的标识时,在所述第 一加速器的触发下,从所述第一辅存储器中读取所述待处理数据的处理结果;在所述加速器标识为所述第一加速器连接的辅助加速器的标识时,第一处理器可以在所述辅助加速器的触发下,从所述辅助加速器连接的辅存储器中读取所述处理结果。
可选地,第一处理器可以通过多种方式触发第一加速器根据处理指令对第一辅存储器中的待处理数据进行处理。比如,第一处理器通过向第一加速器发送处理指令的方式,触发第一加速器对上述待处理数据进行处理。又比如,第一处理器也可以通过更改寄存器的状态值的方式触发第一加速器对上述待处理数据进行处理。
第五方面,提供了一种数据处理装置,用于异构系统中的第一加速器,所述异构系统还包括:与所述第一加速器连接的第一处理器和第一辅存储器;所述数据处理装置包括:用于执行第二方面提供的数据处理方法的各个模块。
第六方面,提供了一种数据处理装置,用于异构系统中的辅助加速器,所述异构系统包括:相互连接的多个处理器,相互连接的多个加速器,以及与所述多个加速器一一连接的多个辅存储器;所述辅助加速器与第一加速器为所述多个加速器中相连接的任意两个加速器;所述数据处理装置包括:用于执行第三方面提供的数据处理方法的各个模块。
第七方面,提供了一种数据处理装置,用于异构系统中的第一处理器,所述异构系统还包括:与所述第一处理器连接的第一加速器,以及与所述第一加速器连接的第一辅存储器;所述数据处理装置包括:用于执行第四方面提供的数据处理方法的各个模块。
第八方面,提供了一种计算机存储介质,所述存储介质内存储有计算机程序,当所述计算机程序在计算机装置上运行时,使所述计算机装置执行第二方面、第三方面或第四方面提供的任一种数据处理方法。
第九方面,提供了一种包含指令的计算机程序产品,当计算机程序产品在计算机装置上运行时,使得计算机装置执行第二方面、第三方面或第四方面提供的任一种数据处理方法。
上述第二方面至第九方面的有益效果可以参考上述第一方面中相应描述中的有益效果,本申请在此不做赘述。
附图说明
图1为本申请实施例提供的一种异构系统的结构示意图;
图2为本申请实施例提供的另一种异构系统的结构示意图;
图3为本申请实施例提供的另一种异构系统的结构示意图;
图4为本申请实施例提供的一种数据处理方法的流程图;
图5为本申请实施例提供的另一种数据处理方法的流程图;
图6为本申请实施例提供的一种异构系统的功能模块示意图;
图7为本申请实施例提供的一种数据处理装置的框图;
图8为本申请实施例提供的另一种数据处理装置的框图;
图9为本申请实施例提供的又一种数据处理装置的框图。
具体实施方式
为使本申请的原理和技术方案更加清楚,下面将结合附图对本申请实施方式作进一 步地详细描述。
随着计算机技术的发展,具有较强处理数据能力的异构系统得到了广泛发展。异构系统能够实现较高效率的数据处理,比如基于深度学习的在线预测处理、直播中的视频转码处理、图片压缩或解压缩处理等。
示例地,图1为本申请实施例提供的一种异构系统的结构示意图,如图1所示,该异构系统通常包括:至少一个处理器和至少一个加速器。图1中以该异构系统包括一个处理器011以及一个加速器021为例,但该异构系统中处理器和加速器的个数也可以大于1,本申请不对此进行限定。
例如,如图2所示,异构系统中的至少一个处理器可以包括:处理器011和处理器012,该至少一个加速器包括:加速器021和加速器022。又例如,如图3所示,异构系统中的至少一个处理器可以包括:处理器011、处理器012、处理器013和处理器014,该至少一个加速器包括:加速器021、加速器022、加速器023和加速器024。根据图2和图3示出的设备间的连接关系可知,在异构系统包括多个处理器时,各个处理器之间采用缓存一致性总线相互连接(有的处理器之间直接连接,有的处理器之间间接连接)。在异构系统包括多个加速器时,各个加速器之间采用缓存一致性总线相互连接(有的加速器之间直接连接,有的加速器之间间接连接)。每个处理器与至少一个加速器连接。
异构系统中的处理器和加速器均具有数据处理功能,并且,加速器用于辅助处理器执行一些数据处理以强化异构系统的数据处理能力。处理器可以是任一种处理器,比如高级精简指令集机器(Advanced Reduced Instruction Set Computing Machines,ARM)架构处理器或x86架构处理器等。其中,ARM架构处理器和x86架构处理器分别为两种不同架构的处理器的名称,这两种处理器所采用的协议不同,功耗、性能和成本也不相同。加速器可以是任一种具有数据处理功能的器件,比如,图形处理器(Graphics Processing Unit,GPU)、现场可编程门阵列(Field Programmable Gate Array,FPGA)或专用集成电路(Application Specific Integrated Circuit,ASIC)等。
进一步地,异构系统中的每个处理器均可以连接有主存储器,每个加速器可以连接有辅存储器。此时,异构系统还包括:每个处理器连接的主存储器以及每个加速器连接的辅存储器。比如,请继续参考图1,该异构系统还包括:处理器011连接的主存储器031,以及加速器021连接的辅存储器041。请继续参考图2,异构系统还包括:处理器011连接的主存储器031,处理器012连接的主存储器032,加速器021连接的辅存储器041,以及加速器022连接的辅存储器042。请继续参考图3,异构系统还包括:处理器011连接的主存储器031,处理器012连接的主存储器032,处理器013连接的主存储器033,处理器014连接的主存储器034,加速器021连接的辅存储器041,以及加速器022连接的辅存储器042,加速器023连接的辅存储器043,以及加速器024连接的辅存储器044。
异构系统中的主存储器和辅存储器均可以为任一种存储器,比如,双倍速率同步动态随机存储器(double data rate synchronous dynamic random access memory,DDR)、高带宽存储器(High Bandwidth Memory,HBM)等。本申请实施例中以主存储器为DDR,且辅存储器为HBM为例。其中,当异构系统包括HBM时,由于HBM能够提供较高带宽的存储功能,因此,能够提升异构系统的数据处理效率。并且,HBM的体积较小,运行 功率也较小。
另外,异构系统中的主存储器可以独立于连接的处理器,辅存储器可以独立于连接的加速器;或者,主存储器也可以集成在连接的处理器上,辅存储器可以集成在连接的加速器上。本申请实施例中以主存储器独立于连接的处理器,且辅存储器集成在连接的加速器上为例(图1中并未示出该集成关系)。
相关技术中,异构系统中的处理器与加速器通过PCIE总线连接。处理器连接的主存储器上的存储空间,以及处理器连接的加速器上的存储空间,均能够对处理器可见,处理器能够对这些存储空间进行读写。但加速器连接的辅存储器上的存储空间对处理器不可见,处理器无法对该存储空间进行读写。所以,相关技术中处理器在需要控制加速器处理数据时,首先需要将该数据写入主存储器,并通知加速器通过DMA的方式将主存储器中待处理的数据搬移至辅存储器。之后,处理器还需通知加速器对辅存储器中的该数据进行处理。加速器在处理完毕该数据后,会将处理结果写入辅存储器,并通知处理器已经处理完毕该数据。最后,处理器需要通知加速器通过DMA的方式将处理结果从辅存储器搬移至主存储器,之后处理器才可以从该主存储器中读取该处理结果。可以看出,在加速器辅助处理器进行数据处理的过程中,处理器与加速器之间的信息交互次数较多,从而影响了数据处理的效率。
本申请实施例中,异构系统中的处理器与加速器之间通过缓存一致性总线连接。缓存一致性总线是一种采用缓存一致性协议的总线。在处理器与加速器之间通过缓存一致性总线连接的情况下,异构系统中主存储器上的存储空间,加速器上的存储空间,以及辅存储器上的存储空间均能够对处理器可见。这些存储空间会在处理器中进行统一编址,从而使得处理器能够基于这些存储空间的地址对这些存储空间进行读写。
该缓存一致性总线可以为任一种采用缓存一致性协议的总线,比如缓存一致性互联(Cache Coherent Interconnect for Accelerators,CCIX)总线,或者,开放互连技术(Compute Express Link,CXL)总线等。可选地,在缓存一致性总线为CCIX总线时,处理器可以是上述ARM架构处理器;在缓存一致性总线为CXL总线时,处理器可以是上述x86架构处理器。本申请实施例中并不对缓存一致性总线的类型与处理器的类型进行限定。
在处理器与加速器之间通过缓存一致性总线连接的基础上,本申请实施例提供了一种用于该异构系统的数据处理方法。该数据处理方法不仅能够使加速器辅助处理器进行数据处理,并且能够减少处理器与加速器之间的交互次数,从而提升数据处理的效率。
本申请实施例提供的数据处理方法可以用于本申请实施例提供的异构系统(如图1至图3任一所示的异构系统)。该方法涉及该异构系统中的第一处理器、第一加速器以及第一辅存储器,其中,第一处理器为异构系统中的任一处理器,第一加速器为第一处理器连接的任一加速器,第一辅存储器为第一加速器连接的辅存储器。
示例地,图4为本申请实施例提供的一种数据处理方法的流程图,图4中以该第一处理器为图1中的处理器011,第一加速器为图1中的加速器021,第一加速器连接的辅存储器为图1中的辅存储器041为例。如图4所示,该数据处理方法可以包括:
S401、处理器011将待处理数据写入加速器021连接的辅存储器041。
本申请实施例中,由于处理器011与加速器021通过缓存一致性总线连接,因此, 异构系统中的主存储器、加速器以及辅存储器上的存储空间均对处理器011可见。在异构系统启动时,该异构系统中的所有处理器(比如所有处理器中的基本输入输出系统(Basic Input Output System,BIOS))需要对该系统中的主存储器、加速器以及辅存储器上的存储空间进行统一编址。这样一来,就使得异构系统中的每个处理器拥有这些存储空间中每个存储单元(存储空间中的最小存储单位)的地址,之后,该异构系统中的处理器便可以直接在这些地址上进行数据的读写。
示例地,假设图1中主存储器031上的存储空间包括:存储单元1和存储单元2,加速器021上的存储空间(比如加速器021上的一些I/O(输入/输出)寄存器的存储空间)包括:存储单元3和存储单元4,辅存储器041上的存储空间包括:存储单元5和存储单元6。处理器011在对这些存储空间进行统一编址后,可以得到如表1所示的各个存储单元的地址A、B、C、D、E和F。
Figure PCTCN2021086703-appb-000001
在S401中,处理器011可以根据辅存储器041上各个存储单元的地址,将待处理数据写入辅存储器041中的至少一个存储单元。需要说明的是,该待处理数据可以是处理器011生成的数据,或者,由异构系统外的其他设备发送的数据,或者,该待处理数据也可以是在S401之前由处理器011存入主存储器031中的数据,本申请实施例对此不作限定。
S402、处理器011向加速器021发送待处理数据的处理指令。
该处理指令用于指示加速器021对待处理数据进行某种处理。因此,该处理指令可以携带有待处理数据在加速器021连接的辅存储器中的存储地址,以及该某种处理的指示信息。示例地,该某种处理可以是基于机器学习算法、深度学习算法或金融风控算法等的处理,本申请实施例不对处理指令所指示的处理进行限定。
S403、加速器021根据处理指令,对辅存储器041中的待处理数据进行处理。
加速器021在接收到处理指令之后,可以解析该处理指令,以确定该处理指令所指示的待处理数据所在的地址,以及该待处理数据需要进行的处理。之后,加速器021便可以从连接的辅存储器041中读取该待处理数据,并对该待处理数据执行处理指令中指示的处理,得到待处理数据的处理结果。
S404、加速器021将待处理数据的处理结果写入连接的辅存储器041。
S405、加速器021向处理器011发送待处理数据的处理响应。
上述处理响应用于指示处理器011获取待处理数据的处理结果。所以,该处理响应需要携带处理结果在加速器021连接的辅存储器中的存储地址。
S406、处理器011根据处理响应,从加速器021连接的辅存储器041中读取处理结果。
示例地,处理响应中携带有处理结果在加速器021连接的辅存储器041中的存储地址,因此,处理器011可以根据该存储地址读取该处理结果。
可以看出,本申请实施例提供的数据处理方法中,第一加速器(如上述加速器021)能够辅助第一处理器(如上述处理器011)对待处理数据进行处理,所以整个异构系统的数据处理能力较高。
并且,该数据处理方法中,第一处理器(如上述处理器011)能够直接将待处理数据写入第一加速器(如上述加速器021)连接的辅存储器(如上述辅存储器041)。因此,避免了第一处理器通知第一加速器将待处理数据从第一处理器连接的主存储器搬移至该辅存储器的过程,也避免了第一加速器搬移该待处理数据的过程。
另外,该数据处理方法中,第一加速器可以直接将处理结果写入辅存储器,且第一处理器能够从该辅存储器获取处理结果。因此,避免了第一加速器通知第一处理器已经处理完毕待处理数据,以及第一处理器通知第一加速器将处理结果从辅存储器搬移至主存储器的过程。
所以,本申请实施例中第一处理器与第一加速器的交互次数较少,数据处理方法的过程较为简单,使得数据处理的效率较高。
另外,为了进一步提升异构系统的数据处理效率,本申请实施例中可以采用具有较高传输带宽的缓存一致性总线,比如,采用具有25千兆传输/秒(GT/s)的传输带宽的缓存一致性总线。
图4所示的实施例以图1所示的异构系统为例,当异构系统包括采用缓存一致性总线相互连接的多个加速器(如图2或图3所示的异构系统)时,该数据处理方法还涉及异构系统中的辅助加速器,以及辅助加速器连接的辅存储器。
示例地,此时该数据处理方法可以如图5所示,图5中以第一处理器为图2中的处理器011,第一加速器为图2中的加速器021,第一加速器连接的辅存储器(可以称为第一辅存储器)为图2中的辅存储器041,辅助加速器为图2中的加速器022,辅助加速器连接的辅存储器为图2中的辅存储器042为例。请参考图5,该数据处理方法可以包括:
S501、处理器011将待处理数据写入加速器021连接的辅存储器041。执行S502。
S501可以参考S401,本申请实施例在此不做赘述。
S502、处理器011向加速器021发送待处理数据的处理指令,其中,该处理指令携带有加速器标识和处理器011的标识,加速器标识为异构系统中用于执行该处理指令的加速器的标识。执行S503。
S502可以参考S402,本申请实施例在此不做赘述。
另外,在本申请实施例中,由于异构系统包括多个加速器,因此,为了将处理指令与用于执行该处理指令的加速器关联,该处理指令需要携带用于执行该处理指令的加速器的标识。并且,在该异构系统包括多个处理器的情况下,为了将处理指令与发出该处理指令的处理器关联,该处理指令需要携带发出该处理指令的处理器011的标识。
S503、加速器021检测处理指令中的加速器标识是否为加速器021的标识。若处理指令中的加速器标识为加速器021的标识,则执行S504;若处理指令中的加速器标识不为加速器021的标识,则执行S508。
由于本申请实施例中异构系统包括多个加速器,因此,为了避免第一处理器将处理指令发错,加速器021在接收到处理器011发送的处理指令之后,需要检测该处理指令中携带的加速器标识是否与自己的标识相同,以确定自己是否为处理器011指定的用于执行该处理指令的加速器。
当处理指令中的加速器标识为加速器021的标识时,该加速器021可以确定自己就是处理器011指定的用于执行该处理指令的加速器。此时,该加速器021可以执行S504,以根据处理指令进行相应的数据处理。
当处理指令中的加速器标识并不是加速器021的标识时,该加速器021可以确定自己并不是处理器011指定的用于执行该处理指令的加速器。此时,该加速器021可以执行S508,以触发处理器011指定的加速器022根据处理指令进行相应的数据处理。
S504、加速器021根据处理指令,对辅存储器041中的待处理数据进行处理。执行S505。
S504可以参考S403,本申请实施例在此不做赘述。
S505、加速器021将待处理数据的处理结果写入连接的辅存储器041。执行S506。
S505可以参考S404,本申请实施例在此不做赘述。
S506、加速器021向处理器011发送待处理数据的处理响应。执行S507。
S506可以参考S405,本申请实施例在此不做赘述。
S507、处理器011根据加速器021发送的处理响应,从加速器021连接的辅存储器041中读取处理结果。
S507可以参考S406,本申请实施例在此不做赘述。
S508、加速器021将待处理数据写入加速器标识指示的加速器022所连接的辅存储器042。
S509、加速器021向加速器022转发待处理数据的处理指令。执行S510。
由于本申请实施例中异构系统中的各个加速器相互连接,因此,加速器021与加速器022相连接,加速器021能够基于该连接将待处理数据写入加速器022连接的辅存储器042,并向加速器022发送处理指令。
S510、加速器022根据处理指令对连接的辅存储器042中的待处理数据进行处理。执行S511。
S510中的处理过程可以参考S403中的处理过程,本申请实施例在此不做赘述。
S511、加速器022将待处理数据的处理结果写入连接的辅存储器042。执行S512。
S511中写入处理结果的过程可以参考S404中写入处理结果的过程,本申请实施例在此不做赘述。
S512、加速器022根据处理指令携带的处理器011的标识,向处理器011发送待处理数据的处理响应。执行S513。
由于处理指令中携带有发出该处理指令的处理器011的标识,因此,加速器022在执行完毕加速器021发送的处理指令之后,可以根据处理器011的标识,向处理器011发送处理响应,以指示处理器011获取待处理数据的处理结果。此时,该处理响应需要携带处理结果在加速器022连接的辅存储器042中的存储地址。
S513、处理器011根据加速器022发送的处理响应,从加速器022连接的辅存储器 042中读取处理结果。
S513中读取处理结果的过程可以参考S406中读取处理结果的过程,本申请实施例在此不做赘述。
可以看出,本申请实施例提供的数据处理方法中,第一加速器(如上述加速器021)或辅助加速器(如上述加速器022)能够辅助第一处理器(如上述处理器011)对待处理数据进行处理,所以整个异构系统的数据处理能力较高。
并且,该数据处理方法中,第一处理器能够直接将待处理数据写入第一加速器连接的辅存储器(如上述辅存储器041)。因此,避免了第一处理器通知第一加速器将待处理数据从第一处理器连接的主存储器搬移至该辅存储器的过程,也避免了第一加速器搬移该待处理数据的过程。
另外,该数据处理方法中,第一加速器或辅助加速器可以直接将处理结果写入辅存储器,且第一处理器能够从该辅存储器获取处理结果。因此,避免了第一加速器或辅助加速器通知第一处理器已经处理完毕待处理数据,以及第一处理器通知第一加速器或辅助加速器将处理结果从辅存储器搬移至主存储器的过程。
所以,本申请实施例中第一处理器与第一加速器或辅助加速器的交互次数较少,数据处理方法的过程较为简单,使得数据处理的效率较高。
另外,为了进一步提升异构系统的数据处理效率,本申请实施例中可以采用具有较高传输带宽的缓存一致性总线,比如,采用具有25千兆传输/秒(GT/s)的传输带宽的缓存一致性总线。
可选地,在异构系统包括多个加速器的情况下,异构系统中的处理器可以基于图5所示的数据处理方法,控制多个加速器并行执行数据处理,从而进一步提升整个异构系统的数据处理效率。
根据以上实施例(如图4所示的实施例中的S402和S403,或图5所示的实施例中的S502和S504)可知,第一处理器可以触发第一加速器根据处理指令,对连接的辅存储器中的待处理数据进行处理。根据以上实施例(如图4所示的实施例中的上述S405和S406,或图5所示的实施例中的S506和S507)可知,第一加速器可以触发第一处理器从第一加速器连接的辅存储器中读取处理结果。
以上实施例中以第一处理器通过向第一加速器发送处理指令的方式,触发第一加速器执行数据处理,第一加速器通过向第一处理器发送处理响应的方式,触发第一处理器读取处理结果为例。可选地,第一处理器也可以不通过向第一加速器发送处理指令的方式触发第一加速器执行数据处理,第一加速器也可以不通过向第一处理器发送处理响应的方式触发第一处理器读取处理结果。
例如,辅存储器上的存储空间可以包括三种存储单元,分别为:用于存储数据的数据存储单元,用于存储处理指令的指令存储单元,以及用于存储处理结果的结果存储单元。并且,第一加速器中的I/O寄存器可以与第一加速器连接的辅存储器上的数据存储单元、指令存储单元以及结果存储单元存在对应关系。第一处理器和第一加速器均能够获取到该对应关系,并基于该对应关系执行上述数据处理方法。
示例地,第一处理器在将待处理数据写入第一加速器连接的辅存储器中的某一数据 存储单元时,可以根据该对应关系,将处理指令写入该数据存储单元对应的指令存储单元,并修改该数据存储单元对应的I/O寄存器的状态值。其中,I/O寄存器可以有多种状态值,该多种状态值可以包括:第一状态值和第二状态值。在第一处理器更改第一加速器中某一I/O寄存器的状态值之前,该I/O寄存器的状态值为第一状态值;在第一处理器更改第一加速器中某一I/O寄存器的状态值后,该I/O寄存器的状态值变为第二状态值。当第一加速器检测到某一I/O寄存器的状态值变为第二状态值时,该第一加速器便可以根据上述对应关系,从该I/O寄存器对应的指令存储单元获取处理指令,并从该I/O寄存器对应的数据存储单元中读取待处理数据。之后,第一加速器便可以根据该处理指令对待处理数据进行处理,以得到待处理数据的处理结果。
第一加速器在得到待处理数据的处理结果后,可以根据该对应关系,修改该I/O寄存器的状态值,以及将待处理数据的处理结果写入该I/O寄存器对应的结果存储单元。其中,I/O寄存器的多种状态值还可以包括:第三状态值。在第一加速器得到上述处理结果后,第一加速器可以将该I/O寄存器的状态值更改为第三状态值。第一处理器可以检测第一加速器中是否存在具有第三状态值的I/O寄存器。当第一加速器将某一I/O寄存器的状态值变为第三状态值时,该第一处理器便可以根据上述对应关系,从该I/O寄存器对应的结果存储单元中读取待处理数据的处理结果。
例如,假设上述对应关系如表2所示,并且,在处理器011未更改加速器021中I/O寄存器的状态值时,各个I/O寄存器的状态值均为表2所示的第一状态值0。若处理器011将待处理数据写入数据存储单元1.1,以及将处理指令写入指令存储单元2.1,则处理器011还可以将I/O寄存器3.1的状态值由第一状态值0更改为如表3所示的第二状态值1。此时,加速器021能够检测到I/O寄存器3.1的状态值变为第二状态值1,并可以从I/O寄存器3.1对应的数据存储单元1.1中获取待处理数据,以及从I/O寄存器3.1对应的指令存储单元2.1中获取处理指令,以及根据该处理指令对待处理数据进行处理,得到待处理数据的处理结果。之后,加速器021可以将该处理结果写入数据存储单元1.1对应的结果存储单元4.1,以及将该数据存储单元1.1对应的I/O寄存器3.1的状态值更改为如表4所示的第三状态值2。处理器011在检测到I/O寄存器3.1的状态值为第三状态值2时,可以从该I/O寄存器3.1对应的结果存储单元4.1中获取待处理数据的处理结果。
表2
Figure PCTCN2021086703-appb-000002
表3
Figure PCTCN2021086703-appb-000003
Figure PCTCN2021086703-appb-000004
表4
Figure PCTCN2021086703-appb-000005
可以看出,本申请实施例中第一处理器也可以通过更改I/O寄存器的状态值的方式触发第一加速器执行数据处理,第一加速器也可以通过更改I/O寄存器的状态值的方式触发第一处理器读取处理结果。
可选地,在第一处理器读取完毕处理结果之后,第一加速器可以将该处理结果所在的结果存储单元对应的I/O寄存器的状态值更改为第一状态值。从而便于下一次第一处理器可以通过更改该I/O寄存器的状态值的方式触发第一加速器执行数据处理,以及第一加速器通过更改该I/O寄存器的状态值的方式触发第一处理器读取处理结果。
根据图5所示的实施例中的S509和S510可知,第一加速器可以触发辅助加速器根据处理指令,对连接的辅存储器中的待处理数据进行处理。根据图5所示的实施例中的上述S512和S513可知,辅助加速器可以触发第一处理器从辅助加速器连接的辅存储器中读取处理结果。
S509和S510中以第一加速器通过向辅助加速器发送处理指令的方式触发辅助加速器执行执行数据处理为例,S512和S513中以辅助加速器通过向第一处理器发送处理响应的方式触发第一处理器读取处理结果为例。可选地,第一加速器也可以不通过向辅助加速器发送处理指令的方式触发辅助加速器执行数据处理,辅助加速器也可以不通过向第一处理器发送处理响应的方式触发第一处理器读取处理结果。
例如,S509和S510中第一加速器可以参考第一处理器通过更改I/O寄存器的状态值的方式触发第一加速器执行数据处理的过程,触发辅助加速器执行数据处理。S512和S513中辅助加速器可以参考第一加速器通过更改I/O寄存器的状态值的方式触发第一加速器读取处理结果的过程,触发第一处理器读取处理结果。本申请实施例在此不做赘述。
以上实施例中通过用于异构系统的数据传输方法对异构系统中各个设备的功能进行了简单描述,以下将进一步对异构系统中各个设备的功能模块进行进一步说明。
示例地,图6为本申请实施例提供的一种异构系统的功能模块示意图,且图6中以异构系统中一组相连接的处理器、加速器以及辅存储器为例,当异构系统包括多组这样的结构时,各组结构中的功能模块均可以参考图6。
如图6所示,处理器可以包括:应用适配层、加速应用程序接口(Application Programming Interface,API)、进程间共享内存以及缓存一致性内存。加速器可以包括:缓存一致性模块和处理模块。其中,处理器中的缓存一致性内存通过缓存一致性总线与加速器中的缓存一致性模块连接。缓存一致性模块和处理模块均连接至辅存储器。
在处理器中,处理器中运行的应用软件能够通过调用应用适配层,调用加速API。加速API用于实现应用软件与加速器之间的数据转换和控制。进程间共享内存用于处理器中运行的多个进程间的通信。缓存一致性内存用于实现处理器与加速器之间的通信。
在加速器中,处理模块用于执行上述数据处理方法中由加速器执行的处理操作,并且,处理模块还可以触发缓存一致性模块执行上述数据处理方法中由加速器执行的读写操作。
上述数据处理方法中,处理器和加速器中的处理模块对辅存储器中数据的读写,均需要通过上述缓存一致性模块实现。比如,处理器或加速器中的处理模块在需要对辅存储器中的数据进行读/写时,可以向缓存一致性模块发送读/写请求。缓存一致性模块可以为接收到的每个读/写请求生成一个请求代理(request agent,RA)(图6中未示出),并由该RA执行相应的读/写操作。另外,每个RA在读取辅存储器上的数据时,会缓存该数据的副本,以便于下一次可以通过读取本地的该副本实现该数据的读取,而无需再从辅存储器上读取该数据。
缓存一致性模块还包括:主代理(host agent,HA)(图6中未示出),该HA用于对缓存一致性模块中的所有RA进行管理,以实现缓存一致性。
示例地,每个RA在对辅存储器中的数据进行读写时,均需要先向HA发送读写辅存储器中数据的请求。
对于多个RA中用于读取数据(比如读取处理指令、待处理数据或处理结果等数据)的RA,HA在接收到该RA发送的读取数据的请求后,会赋予该RA读取辅存储器中数据的权限,之后该RA便能够读取辅存储器中的数据。
对于多个RA中用于向辅存储器中某一地址写数据(比如写处理指令、待处理数据、处理结果等数据)的RA,HA在接收到该RA发送的向该地址写数据的请求后,需要进行一致性检查,保证此RA对该地址的独占权限。示例地,HA可以检测此时是否存在其他RA缓存有此地址上数据的副本。如果此时存在其它RA缓存有此地址上数据的副本,若某一RA对该地址写数据,则会导致该其他RA缓存的上述副本与该地址真实的数据不一致。所以,此时HA在一致性检查的过程中会将这些副本均失效掉,再赋予用于向该地址写数据的RA在该地址写入数据的权限,之后该RA便可以在该地址写入数据。这样一来,就能够保证各个RA读取到的该地址的数据是一致的。需要说明的是,在RA缓存的副本失效时,若该RA需要再次读取该数据,由于该副本失效,因此该RA会重新向HA发起读取辅存储器中数据的请求。
进一步地,若上述数据处理方法中涉及对加速器中I/O寄存器的状态值的读写,则该读写过程也可以通过上述缓存一致性模块实现,以保证加速器中I/O寄存器的状态值的缓存一致性。通过上述缓存一致性模块实现对I/O寄存器的状态值的读写过程,可以参考通过上述缓存一致性模块实现对辅存储器的读写过程,本申请实施例在此不做赘述。
上文中结合图1至图6,详细描述了本申请所提供的数据处理方法,根据该数据处理方法可知,本申请实施例提供的异构系统包括:相连接的第一处理器和第一加速器,以及与第一加速器连接的第一辅存储器。
其中,第一处理器用于将待处理数据写入第一辅存储器;第一处理器还用于触发第 一加速器根据处理指令,对第一辅存储器中的待处理数据进行处理;第一加速器用于将待处理数据的处理结果写入第一辅存储器;第一加速器用于触发第一处理器从第一辅存储器中读取处理结果。
可选地,第一处理器与第一加速器通过缓存一致性总线连接。
可选地,该缓存一致性总线包括:CCIX总线,或CXL总线。
可选地,缓存一致性总线包括:CCIX总线,第一处理器包括:ARM架构处理器;或者,缓存一致性总线包括:CXL总线,第一处理器包括:x86架构处理器。
可选地,辅存储器包括:HBM。
可选地,加速器包括:GPU、FPGA或ASIC。
可选地,异构系统包括:相互连接的多个加速器,第一加速器为多个加速器中的任一加速器;处理指令携带有加速器标识,加速器标识为多个加速器中用于执行处理指令的加速器的标识;第一加速器用于在加速器标识为第一加速器的标识时,根据处理指令对第一辅存储器中的待处理数据进行处理。
可选地,异构系统包括:与多个加速器一一对应连接的多个辅存储器,以及相互连接的多个处理器,第一处理器为多个处理器中与第一加速器连接的任一处理器;处理指令还携带有第一处理器的标识;第一加速器用于在加速器标识不为第一加速器的标识时,将待处理数据写入加速器标识指示的辅助加速器所连接的辅存储器,并触发辅助加速器根据处理指令对待处理数据进行处理;辅助加速器用于:在根据处理指令对待处理数据进行处理后,将待处理数据的处理结果写入连接的辅存储器;根据处理指令携带的第一处理器的标识,触发第一处理器从辅助加速器连接的辅存储器中读取处理结果。
可选地,多个加速器通过缓存一致性总线连接,多个处理器通过缓存一致性总线连接。
可以看出,本申请实施例提供的数据处理方法中,第一加速器能够辅助第一处理器对待处理数据进行处理,所以整个异构系统的数据处理能力较高。
并且,该数据处理方法中,第一处理器能够直接将待处理数据写入第一加速器连接的辅存储器。因此,避免了第一处理器通知第一加速器将待处理数据从第一处理器连接的主存储器搬移至该辅存储器的过程,也避免了第一加速器搬移该待处理数据的过程。
另外,该数据处理方法中,第一加速器可以直接将处理结果写入辅存储器,且第一处理器能够从该辅存储器获取处理结果。因此,避免了第一加速器通知第一处理器已经处理完毕待处理数据,以及第一处理器通知第一加速器将处理结果从辅存储器搬移至主存储器的过程。
所以,本申请实施例中第一处理器与第一加速器的交互次数较少,数据处理方法的过程较为简单,使得数据处理的效率较高。
进一步地,下面将结合结合图7至图9描述本申请所提供的数据处理系统中的各个数据处理装置。
示例地,图7为本申请实施例提供的一种数据处理装置的框图,该数据处理装置可以为本申请实施例提供的数据处理系统中的第一加速器。如图7所示,该数据读写装置包括:
处理模块701,用于在第一处理器的触发下,根据处理指令对第一辅存储器中的待处 理数据进行处理;处理模块701用于执行的操作可以参考上述S403或S504(或者与S403或S504相关的描述),本申请实施例在此不做赘述。
写入模块702,用于将待处理数据的处理结果写入第一辅存储器;写入模块702用于执行的操作可以参考上述S404或S505(或者与S404或S505相关的描述),本申请实施例在此不做赘述。
触发模块703,用于触发第一处理器从第一辅存储器中读取处理结果。触发模块703用于执行的操作可以参考上述S405或S506(或者与S405或S506相关的描述),本申请实施例在此不做赘述。
可选地,上述数据处理装置还用于执行如图5所示的数据处理方法中的其他操作。比如,处理模块701还用于执行图5中的S503,写入模块702还用于执行图5中的S508,触发模块703还用于执行图5中的S509。数据处理装置中各个模块执行各个步骤的具体流程请见上文对图4和图5的介绍,此处不再赘述。
又示例地,图8为本申请实施例提供的另一种数据处理装置的框图,该数据处理装置可以为本申请实施例提供的数据处理系统中的辅助加速器。异构系统包括:相互连接的多个处理器,相互连接的多个加速器,以及与多个加速器一一连接的多个辅存储器;辅助加速器与第一加速器为多个加速器中相连接的任意两个加速器。如图8所示,该数据读写装置包括:
处理模块801,用于在第一加速器的触发下,根据处理指令对辅助加速器连接的辅存储器中的待处理数据进行处理,处理指令携带有第一加速器连接的第一处理器的标识。处理模块801用于执行的操作可以参考上述S510(或者与S510相关的描述),本申请实施例在此不做赘述。
写入模块802,用于将待处理数据的处理结果写入连接的辅存储器;写入模块802用于执行的操作可以参考上述S511(或者与S511相关的描述),本申请实施例在此不做赘述。
触发模块803,用于根据处理指令携带的第一处理器的标识,触发第一处理器从辅助加速器连接的辅存储器中读取处理结果。触发模块803用于执行的操作可以参考上述S512(或者与S512相关的描述),本申请实施例在此不做赘述。
再示例地,图9为本申请实施例提供的又一种数据处理装置的框图,该数据处理装置可以为本申请实施例提供的数据处理系统中的第一处理器。异构系统还包括:与第一处理器连接的第一加速器,以及与第一加速器连接的第一辅存储器。如图9所示,该数据读写装置包括:
写入模块901,用于将待处理数据写入第一辅存储器;写入模块901用于执行的操作可以参考上述S401或S501(或者与S401或S501相关的描述),本申请实施例在此不做赘述。
触发模块902,用于触发第一加速器根据处理指令,对第一辅存储器中的待处理数据进行处理;触发模块902用于执行的操作可以参考上述S402或S502(或者与S402或S502相关的描述),本申请实施例在此不做赘述。
读取模块903,用于在第一加速器的触发下,从第一辅存储器中读取待处理数据的处理结果。读取模块903用于执行的操作可以参考上述S406或S507(或者与S406或S507 相关的描述),本申请实施例在此不做赘述。
可选地,上述数据处理装置还用于执行如图5所示的数据处理方法中的其他操作。比如,读取模块903还用于执行图5中的S513。数据处理装置中各个模块执行各个步骤的具体流程请见上文对图4和图5的介绍,此处不再赘述。
本申请实施例提供了一种计算机存储介质,所述存储介质内存储有计算机程序,所述计算机程序用于执行本申请提供的任一数据处理方法。
本申请实施例提供了一种包含指令的计算机程序产品,当计算机程序产品在计算机装置上运行时,使得计算机装置执行本申请实施例提供的任一数据处理方法。
在上述实施例中,可以全部或部分地通过软件、硬件、固件或者其任意组合来实现。当使用软件实现时,可以全部或部分地以计算机程序产品的形式实现,所述计算机程序产品包括一个或多个计算机指令。在计算机上加载和执行所述计算机程序指令时,全部或部分地产生按照本申请实施例所述的流程或功能。所述计算机可以是通用计算机、计算机网络、或者其他可编程装置。所述计算机指令可以存储在计算机的可读存储介质中,或者从一个计算机可读存储介质向另一个计算机可读存储介质传输,例如,所述计算机指令可以从一个网站站点、计算机、服务器或数据中心通过有线(例如同轴电缆、光纤、数字用户线)或无线(例如红外、无线、微波等)方式向另一个网站站点、计算机、服务器或数据中心传输。所述计算机可读存储介质可以是计算机能够存取的任何可用介质或者包含一个或多个可用介质集成的服务器、数据中心等数据存储装置。所述可用介质可以是磁性介质(例如,软盘、硬盘、磁带)、光介质,或者半导体介质(例如固态硬盘)等。
在本申请中,术语“第一”和“第二”等仅用于描述目的,而不能理解为指示或暗示相对重要性。术语“至少一个”指一个或多个,“多个”指两个或两个以上,除非另有明确的限定。
本申请实施例提供的方法实施例和装置实施例等不同类型的实施例均可以相互参考,本申请实施例对此不做限定。本申请实施例提供的方法实施例操作的先后顺序能够进行适当调整,操作也能够根据情况进行响应增减,任何熟悉本技术领域的技术人员在本申请揭露的技术范围内,可轻易想到变化的方法,都应涵盖在本申请的保护范围之内,因此不再赘述。
在本申请提供的相应实施例中,应该理解到,所揭露的装置等可以通过其它的构成方式实现。例如,以上所描述的装置实施例仅仅是示意性的,例如,模块的划分,仅仅为一种逻辑功能划分,实际实现时可以有另外的划分方式,例如多个模块可以结合或者可以集成到另一个系统,或一些特征可以忽略,或不执行。另一点,所显示或讨论的相互之间的耦合或直接耦合或通信连接可以是通过一些接口,装置或单元的间接耦合或通信连接,可以是电性或其它的形式。
作为分离部件说明的模块可以是或者也可以不是物理上分开的,作为模块描述的部件可以是或者也可以不是物理单元,既可以位于一个地方,或者也可以分布到多个设备上。可以根据实际的需要选择其中的部分或者全部单元来实现本实施例方案的目的。
以上所述,仅为本申请的具体实施方式,但本申请的保护范围并不局限于此,任何熟悉本技术领域的技术人员在本申请揭露的技术范围内,可轻易想到各种等效的修改或替 换,这些修改或替换都应涵盖在本申请的保护范围之内。因此,本申请的保护范围应以权利要求的保护范围为准。

Claims (17)

  1. 一种异构系统,其特征在于,所述异构系统包括:相连接的第一处理器和第一加速器,以及与所述第一加速器连接的第一辅存储器;
    所述第一处理器用于将待处理数据写入所述第一辅存储器;
    所述第一处理器还用于触发所述第一加速器根据处理指令,对所述第一辅存储器中的所述待处理数据进行处理;
    所述第一加速器用于将所述待处理数据的处理结果写入所述第一辅存储器;
    所述第一加速器用于触发所述第一处理器从所述第一辅存储器中读取所述处理结果。
  2. 根据权利要求1所述的异构系统,其特征在于,所述第一处理器与所述第一加速器通过缓存一致性总线连接。
  3. 根据权利要求2所述的异构系统,其特征在于,所述缓存一致性总线包括:CCIX总线,所述第一处理器包括:高级精简指令集机器ARM架构处理器;
    或者,所述缓存一致性总线包括:CXL总线,所述第一处理器包括:x86架构处理器。
  4. 根据权利要求1至3任一所述的异构系统,其特征在于,所述异构系统包括:相互连接的多个加速器,所述第一加速器为所述多个加速器中的任一加速器;
    所述处理指令携带有加速器标识,所述加速器标识为所述多个加速器中用于执行所述处理指令的加速器的标识;
    所述第一加速器用于在所述加速器标识为所述第一加速器的标识时,根据所述处理指令对所述第一辅存储器中的所述待处理数据进行处理。
  5. 根据权利要求4所述的异构系统,其特征在于,所述异构系统包括:与所述多个加速器一一对应连接的多个辅存储器,以及相互连接的多个处理器,所述第一处理器为所述多个处理器中与所述第一加速器连接的任一处理器;所述处理指令还携带有所述第一处理器的标识;
    所述第一加速器用于在所述加速器标识不为所述第一加速器的标识时,将所述待处理数据写入所述加速器标识指示的辅助加速器所连接的辅存储器,并触发所述辅助加速器根据所述处理指令对所述待处理数据进行处理;
    所述辅助加速器用于:
    在根据所述处理指令对所述待处理数据进行处理后,将所述待处理数据的处理结果写入连接的辅存储器;
    根据所述处理指令携带的所述第一处理器的标识,触发所述第一处理器从所述辅助加速器连接的辅存储器中读取所述处理结果。
  6. 一种数据处理方法,其特征在于,用于异构系统中的第一加速器,所述异构系统还包括:与所述第一加速器连接的第一处理器和第一辅存储器;所述方法包括:
    在所述第一处理器的触发下,根据处理指令对所述第一辅存储器中的待处理数据进行处理;
    将所述待处理数据的处理结果写入所述第一辅存储器;
    触发所述第一处理器从所述第一辅存储器中读取所述处理结果。
  7. 根据权利要求6所述的方法,其特征在于,所述异构系统包括:相互连接的多个加速器,所述第一加速器为所述多个加速器中的任一加速器;所述处理指令携带有加速器标识,所述加速器标识为所述异构系统中用于执行所述处理指令的加速器的标识,所述根据处理指令对所述第一辅存储器中的待处理数据进行处理,包括:
    在所述加速器标识为所述第一加速器的标识时,根据所述处理指令对所述第一辅存储器中的所述待处理数据进行处理。
  8. 根据权利要求7所述的方法,其特征在于,所述异构系统包括:与所述多个加速器一一对应连接的多个辅存储器,以及相互连接的多个处理器,所述第一处理器为所述多个处理器中与所述第一加速器连接的任一处理器;所述处理指令还携带有所述第一处理器的标识,所述方法还包括:
    在所述加速器标识不为所述第一加速器的标识时,将所述待处理数据写入所述加速器标识所指示的辅助加速器连接的辅存储器,并触发所述辅助加速器根据所述处理指令对所述待处理数据进行处理。
  9. 一种数据处理方法,其特征在于,用于异构系统中的辅助加速器,所述异构系统包括:相互连接的多个处理器,相互连接的多个加速器,以及与所述多个加速器一一连接的多个辅存储器;所述辅助加速器与第一加速器为所述多个加速器中相连接的任意两个加速器;
    所述方法包括:
    在所述第一加速器的触发下,根据处理指令对所述辅助加速器连接的辅存储器中的待处理数据进行处理,所述处理指令携带有所述第一加速器连接的所述第一处理器的标识;
    将所述待处理数据的处理结果写入连接的辅存储器;
    根据所述处理指令携带的所述第一处理器的标识,触发所述第一处理器从所述辅助加速器连接的辅存储器中读取所述处理结果。
  10. 一种数据处理方法,其特征在于,用于异构系统中的第一处理器,所述异构系统还包括:与所述第一处理器连接的第一加速器,以及与所述第一加速器连接的第一辅存储器;所述方法包括:
    将待处理数据写入所述第一辅存储器;
    触发所述第一加速器根据处理指令,对所述第一辅存储器中的所述待处理数据进行处理;
    在所述第一加速器的触发下,从所述第一辅存储器中读取所述待处理数据的处理结果。
  11. 根据权利要求10所述的方法,其特征在于,所述异构系统包括:相互连接的多个处理器,以及相互连接的多个加速器,以及与所述多个加速器一一连接的多个辅存储器;所述处理指令携带有:加速器标识和所述第一处理器的标识,所述加速器标识为所述异构系统中用于执行所述处理指令的加速器的标识;
    所述在所述第一加速器的触发下,从所述第一辅存储器中读取所述待处理数据的处理结果,包括:
    在所述加速器标识为所述第一加速器的标识时,在所述第一加速器的触发下,从所述第一辅存储器中读取所述待处理数据的处理结果;
    所述方法还包括:
    在所述加速器标识为所述第一加速器连接的辅助加速器的标识时,在所述辅助加速器的触发下,从所述辅助加速器连接的辅存储器中读取所述处理结果。
  12. 一种数据处理装置,其特征在于,用于异构系统中的第一加速器,所述异构系统还包括:与所述第一加速器连接的第一处理器和第一辅存储器;所述数据处理装置包括:
    处理模块,用于在所述第一处理器的触发下,根据处理指令对所述第一辅存储器中的待处理数据进行处理;
    写入模块,用于将所述待处理数据的处理结果写入所述第一辅存储器;
    触发模块,用于触发所述第一处理器从所述第一辅存储器中读取所述处理结果。
  13. 根据权利要求12所述的数据处理装置,其特征在于,所述异构系统包括:相互连接的多个加速器,所述第一加速器为所述多个加速器中的任一加速器;所述处理指令携带有加速器标识,所述加速器标识为所述异构系统中用于执行所述处理指令的加速器的标识;
    所述处理模块用于在所述加速器标识为所述第一加速器的标识时,根据所述处理指令对所述第一辅存储器中的所述待处理数据进行处理。
  14. 根据权利要求13所述的数据处理装置,其特征在于,所述异构系统包括:与所述多个加速器一一对应连接的多个辅存储器,以及相互连接的多个处理器,所述第一处理器为所述多个处理器中与所述第一加速器连接的任一处理器;所述处理指令还携带有所述第一处理器的标识;
    所述写入模块还用于在所述加速器标识不为所述第一加速器的标识时,将所述待处理数据写入所述加速器标识所指示的辅助加速器连接的辅存储器;
    所述触发模块还用于触发所述辅助加速器根据所述处理指令对所述待处理数据进行处理。
  15. 一种数据处理装置,其特征在于,用于异构系统中的辅助加速器,所述异构系统包括:相互连接的多个处理器,相互连接的多个加速器,以及与所述多个加速器一一连接的多个辅存储器;所述辅助加速器与第一加速器为所述多个加速器中相连接的任意两个加速器;所述数据处理装置包括:
    处理模块,用于在所述第一加速器的触发下,根据处理指令对所述辅助加速器连接的辅存储器中的待处理数据进行处理,所述处理指令携带有所述第一加速器连接的所述第一处理器的标识;
    写入模块,用于将所述待处理数据的处理结果写入连接的辅存储器;
    触发模块,用于根据所述处理指令携带的所述第一处理器的标识,触发所述第一处理器从所述辅助加速器连接的辅存储器中读取所述处理结果。
  16. 一种数据处理装置,其特征在于,用于异构系统中的第一处理器,所述异构系统还包括:与所述第一处理器连接的第一加速器,以及与所述第一加速器连接的第一辅存储器;所述数据处理装置包括:
    写入模块,用于将待处理数据写入所述第一辅存储器;
    触发模块,用于触发所述第一加速器根据处理指令,对所述第一辅存储器中的所述待处理数据进行处理;
    读取模块,用于在所述第一加速器的触发下,从所述第一辅存储器中读取所述待处理数据的处理结果。
  17. 根据权利要求16所述的数据处理装置,其特征在于,所述异构系统包括:相互连接的多个处理器,以及相互连接的多个加速器,以及与所述多个加速器一一连接的多个辅存储器;所述处理指令携带有:加速器标识和所述第一处理器的标识,所述加速器标识为所述异构系统中用于执行所述处理指令的加速器的标识;
    所述读取模块用于:
    在所述加速器标识为所述第一加速器的标识时,在所述第一加速器的触发下,从所述第一辅存储器中读取所述待处理数据的处理结果;
    在所述加速器标识为所述第一加速器连接的辅助加速器的标识时,在所述辅助加速器的触发下,从所述辅助加速器连接的辅存储器中读取所述处理结果。
PCT/CN2021/086703 2020-04-22 2021-04-12 数据处理方法及装置、异构系统 WO2021213209A1 (zh)

Priority Applications (2)

Application Number Priority Date Filing Date Title
EP21792489.3A EP4120094A4 (en) 2020-04-22 2021-04-12 DATA PROCESSING METHOD AND DEVICE AND HETEROGENEUS SYSTEM
US18/046,151 US20230114242A1 (en) 2020-04-22 2022-10-12 Data processing method and apparatus and heterogeneous system

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202010323587.5 2020-04-22
CN202010323587.5A CN113535611A (zh) 2020-04-22 2020-04-22 数据处理方法及装置、异构系统

Related Child Applications (1)

Application Number Title Priority Date Filing Date
US18/046,151 Continuation US20230114242A1 (en) 2020-04-22 2022-10-12 Data processing method and apparatus and heterogeneous system

Publications (1)

Publication Number Publication Date
WO2021213209A1 true WO2021213209A1 (zh) 2021-10-28

Family

ID=78094115

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2021/086703 WO2021213209A1 (zh) 2020-04-22 2021-04-12 数据处理方法及装置、异构系统

Country Status (4)

Country Link
US (1) US20230114242A1 (zh)
EP (1) EP4120094A4 (zh)
CN (1) CN113535611A (zh)
WO (1) WO2021213209A1 (zh)

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20230185740A1 (en) * 2021-12-10 2023-06-15 Samsung Electronics Co., Ltd. Low-latency input data staging to execute kernels
US11989142B2 (en) 2021-12-10 2024-05-21 Samsung Electronics Co., Ltd. Efficient and concurrent model execution

Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104794100A (zh) * 2015-05-06 2015-07-22 西安电子科技大学 基于片上网络的异构多核处理系统
CN106502782A (zh) * 2015-09-04 2017-03-15 联发科技股份有限公司 异构计算系统及其方法
US9648102B1 (en) * 2012-12-27 2017-05-09 Iii Holdings 2, Llc Memcached server functionality in a cluster of data processing nodes
CN107122243A (zh) * 2017-04-12 2017-09-01 杭州远算云计算有限公司 用于cfd仿真计算的异构集群系统及cfd计算方法
CN109308280A (zh) * 2017-07-26 2019-02-05 杭州华为数字技术有限公司 数据处理方法和相关设备
WO2019095154A1 (zh) * 2017-11-15 2019-05-23 华为技术有限公司 一种调度加速资源的方法、装置及加速系统
CN110069439A (zh) * 2018-01-24 2019-07-30 英特尔公司 设备认证
CN111026363A (zh) * 2018-10-09 2020-04-17 英特尔公司 用于自主驾驶的异构计算架构硬件/软件协同设计
CN111198839A (zh) * 2018-11-16 2020-05-26 三星电子株式会社 存储装置及操作其的方法、控制器

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR20220056986A (ko) * 2020-10-29 2022-05-09 삼성전자주식회사 메모리 확장기, 이종 컴퓨팅 장치, 및 이종 컴퓨팅 장치의 동작 방법

Patent Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9648102B1 (en) * 2012-12-27 2017-05-09 Iii Holdings 2, Llc Memcached server functionality in a cluster of data processing nodes
CN104794100A (zh) * 2015-05-06 2015-07-22 西安电子科技大学 基于片上网络的异构多核处理系统
CN106502782A (zh) * 2015-09-04 2017-03-15 联发科技股份有限公司 异构计算系统及其方法
CN107122243A (zh) * 2017-04-12 2017-09-01 杭州远算云计算有限公司 用于cfd仿真计算的异构集群系统及cfd计算方法
CN109308280A (zh) * 2017-07-26 2019-02-05 杭州华为数字技术有限公司 数据处理方法和相关设备
WO2019095154A1 (zh) * 2017-11-15 2019-05-23 华为技术有限公司 一种调度加速资源的方法、装置及加速系统
CN110069439A (zh) * 2018-01-24 2019-07-30 英特尔公司 设备认证
CN111026363A (zh) * 2018-10-09 2020-04-17 英特尔公司 用于自主驾驶的异构计算架构硬件/软件协同设计
CN111198839A (zh) * 2018-11-16 2020-05-26 三星电子株式会社 存储装置及操作其的方法、控制器

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
See also references of EP4120094A4 *

Also Published As

Publication number Publication date
US20230114242A1 (en) 2023-04-13
CN113535611A (zh) 2021-10-22
EP4120094A1 (en) 2023-01-18
EP4120094A4 (en) 2023-09-13

Similar Documents

Publication Publication Date Title
US9720714B2 (en) Accelerator functionality management in a coherent computing system
EP3028162B1 (en) Direct access to persistent memory of shared storage
US8131814B1 (en) Dynamic pinning remote direct memory access
US11500797B2 (en) Computer memory expansion device and method of operation
WO2021213209A1 (zh) 数据处理方法及装置、异构系统
JP2001167077A (ja) ネットワークシステムにおけるデータアクセス方法、ネットワークシステムおよび記録媒体
TWI703501B (zh) 具有分散式信箱架構的多處理器系統及其溝通方法
JP6514329B2 (ja) メモリアクセス方法、スイッチ、およびマルチプロセッサシステム
US10635589B2 (en) System and method for managing transactions
US9864687B2 (en) Cache coherent system including master-side filter and data processing system including same
JP2008503003A (ja) コヒーレント・マルチプロセッサ・プロトコルを有するシステムにおけるダイレクト・プロセッサ・キャッシュ・アクセス
US9170963B2 (en) Apparatus and method for generating interrupt signal that supports multi-processor
WO2023165319A1 (zh) 内存访问方法、装置和输入输出内存管理单元
WO2017101080A1 (zh) 处理写请求的方法、处理器和计算机
JP2006260159A (ja) 情報処理装置、情報処理装置におけるデータ制御方法
WO2020038466A1 (zh) 数据预取方法及装置
JP2003281079A (ja) ページ・テーブル属性によるバス・インタフェース選択
WO2022133656A1 (zh) 一种数据处理装置、方法及相关设备
US11275707B2 (en) Multi-core processor and inter-core data forwarding method
WO2021082877A1 (zh) 访问固态硬盘的方法及装置
US11874783B2 (en) Coherent block read fulfillment
CN115174673B (zh) 具备低延迟处理器的数据处理装置、数据处理方法及设备
US11803470B2 (en) Multi-level cache coherency protocol for cache line evictions
US11341069B2 (en) Distributed interrupt priority and resolution of race conditions
EP4086774A1 (en) Coherent memory system

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 21792489

Country of ref document: EP

Kind code of ref document: A1

ENP Entry into the national phase

Ref document number: 2021792489

Country of ref document: EP

Effective date: 20221014

NENP Non-entry into the national phase

Ref country code: DE