WO2023103391A1 - 流处理方法、处理电路和电子设备 - Google Patents

流处理方法、处理电路和电子设备 Download PDF

Info

Publication number
WO2023103391A1
WO2023103391A1 PCT/CN2022/107083 CN2022107083W WO2023103391A1 WO 2023103391 A1 WO2023103391 A1 WO 2023103391A1 CN 2022107083 W CN2022107083 W CN 2022107083W WO 2023103391 A1 WO2023103391 A1 WO 2023103391A1
Authority
WO
WIPO (PCT)
Prior art keywords
processing
segment
stream
dimension
program
Prior art date
Application number
PCT/CN2022/107083
Other languages
English (en)
French (fr)
Inventor
王磊
李甲
徐立宝
葛建明
彭永超
袁红岗
仇小钢
Original Assignee
海飞科(南京)信息技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 海飞科(南京)信息技术有限公司 filed Critical 海飞科(南京)信息技术有限公司
Publication of WO2023103391A1 publication Critical patent/WO2023103391A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F15/00Digital computers in general; Data processing equipment in general
    • G06F15/76Architectures of general purpose stored program computers
    • G06F15/78Architectures of general purpose stored program computers comprising a single central processing unit
    • G06F15/7807System on chip, i.e. computer system on a single chip; System in package, i.e. computer system on one or more chips in a single package
    • G06F15/781On-chip cache; Off-chip memory
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F12/00Accessing, addressing or allocating within memory systems or architectures
    • G06F12/02Addressing or allocation; Relocation
    • G06F12/08Addressing or allocation; Relocation in hierarchically structured memory systems, e.g. virtual memory systems
    • G06F12/0802Addressing of a memory level in which the access to the desired data or data block requires associative addressing means, e.g. caches
    • G06F12/0806Multiuser, multiprocessor or multiprocessing cache systems
    • G06F12/0811Multiuser, multiprocessor or multiprocessing cache systems with multilevel cache hierarchies
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F12/00Accessing, addressing or allocating within memory systems or architectures
    • G06F12/02Addressing or allocation; Relocation
    • G06F12/08Addressing or allocation; Relocation in hierarchically structured memory systems, e.g. virtual memory systems
    • G06F12/10Address translation
    • G06F12/1009Address translation using page tables, e.g. page table structures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F13/00Interconnection of, or transfer of information or other signals between, memories, input/output devices or central processing units
    • G06F13/14Handling requests for interconnection or transfer
    • G06F13/20Handling requests for interconnection or transfer for access to input/output bus
    • G06F13/28Handling requests for interconnection or transfer for access to input/output bus using burst mode transfer, e.g. direct memory access DMA, cycle steal

Definitions

  • Embodiments of the present disclosure generally relate to the field of electronics, and more specifically relate to a method for stream processing, a processing circuit, an electronic device, a computer-readable storage medium, and a computer program product.
  • GPUs graphics processing units
  • processing circuitry such as a GPU is particularly beneficial for the processing of tensors with large numbers of data of the same or similar form.
  • Tensor data usually represents one-dimensional or multi-dimensional array data in the computer field.
  • image data is a conventional two-dimensional tensor data, which can be represented by a two-dimensional array.
  • multiple processing circuits or multiple processing cores (or processing engines) in the processing circuits may process different parts of the image data in parallel to reduce processing time.
  • On-chip memory typically has faster access speeds for processing circuits such as GPUs.
  • processing circuits such as GPUs.
  • the storage space of on-chip memory is often limited. This makes it possible to frequently access an external memory (also called an off-chip memory) during the tensor operation using the processing circuit, which will seriously affect the efficiency of the tensor operation.
  • Embodiments of the present disclosure provide a solution for stream processing.
  • a method for stream processing includes running, by the stream processor, the stream processing program to cause a set of pages to be loaded into the on-chip memory as segments having at least one dimension, wherein one of the at least one dimensions is designated by the stream processing program as a stream processing dimension; And a kernel program is run by the processing engine to process at least part of the segment, at least part of which is determined based on the offset position of the segment in the stream processing dimension.
  • running the stream processing program by the stream processor to cause a set of pages to be loaded into the on-chip memory as segments having at least one dimension comprises: running the stream processing program by the stream processor, and sending a first group of load instructions to a direct memory access DMA controller, the first group of load instructions is used to load the group of pages into the on-chip memory.
  • the target page in the set of pages is associated with a first counter for indicating the loading status of the target page and a second counter for indicating The number of processing engines referencing the target page.
  • the method further includes: in response to completion of loading the target page into the on-chip memory, updating a value of the first counter to indicate completion of loading of the target page.
  • the method further includes determining that the on-chip memory The space in the corresponding to the target page can be used to load new data from the off-chip memory.
  • the at least one dimension comprises a plurality of dimensions, the method further comprising: a starting anchor point based on the offset position and a non-stream processing dimension of the processing engine in the plurality of dimensions , to determine the at least part of the segment.
  • the group of pages is the first group of pages
  • the method further includes: in response to an update instruction in the kernel program, running the stream processing program by the stream processor to make the second A set of pages is loaded into a space in the on-chip memory corresponding to the at least part of the segment.
  • the segment is the first segment
  • the offset position is the first offset position
  • the update instruction indicates the update offset position in the flow processing dimension determined by the kernel program
  • the method further comprising: in response to determining that the update offset position exceeds a boundary of the segment in the stream processing dimension, running the stream processing program by the stream processor to define a second segment, the first a second segment comprising at least the second set of pages; and executing, by the processing engine, the kernel program to process at least one of the second segment based on a second offset position of the second segment in the stream processing dimension part.
  • the method further comprises: executing, by the processing engine, the kernel program to process the first segment in response to determining that the update offset position does not exceed a boundary of the segment in the stream processing dimension Another portion of , wherein the other portion is determined based on the updated offset location.
  • the method further comprises: in response to determining that the segment is marked as a terminating segment in the stream processing program, after execution of the target instruction associated with the terminating segment in the stream processing program is completed , to terminate the stream processing process associated with the stream processing dimension.
  • a processing circuit comprising an on-chip memory, a stream processor and a processing engine.
  • the processing circuit is configured to execute any method of the first aspect and its implementations.
  • an electronic device in a third aspect of the present disclosure, includes a processing circuit configured to execute any method of the first aspect and its implementation manners.
  • a computer readable storage medium stores instructions, which when executed by the processing circuit cause the processing circuit to execute any method of the first aspect and its implementation manners.
  • a computer program product comprises instructions which, when executed by the processing circuit, cause the processing circuit to perform any method of the first aspect and implementations thereof.
  • the processing circuit of the second aspect, the electronic device of the third aspect, the computer storage medium of the fourth aspect, or the computer program product of the fifth aspect provided above can all be used to execute the method provided in the first aspect. Therefore, the explanations or illustrations about the first aspect are also applicable to the second aspect, the third aspect, the fourth aspect and the fifth aspect.
  • the beneficial effects achieved by the second aspect, the third aspect, the fourth aspect and the fifth aspect reference may be made to the beneficial effects in the corresponding methods, which will not be repeated here.
  • Figure 1 shows a schematic diagram of an example environment in which various embodiments of the present disclosure can be implemented
  • Figure 2 shows a schematic block diagram of a processing circuit according to some embodiments of the present disclosure
  • Fig. 3 shows a schematic block diagram of a three-dimensional tensor according to some embodiments of the present disclosure
  • FIG. 4 shows a flowchart of an example process of a stream processing method according to some embodiments of the present disclosure
  • FIG. 5 shows a schematic diagram of processing segments according to a stream processing scheme according to some embodiments of the present disclosure.
  • 6A and 6B show schematic diagrams of data loading according to some embodiments of the present disclosure.
  • the term “comprise” and its variants mean open inclusion, ie “including but not limited to”.
  • the term “or” means “and/or” unless otherwise stated.
  • the term “based on” means “based at least in part on”.
  • the terms “one example embodiment” and “one embodiment” mean “at least one example embodiment.”
  • the term “another embodiment” means “at least one further embodiment”.
  • the terms “first”, “second”, etc. may refer to different or the same object. Other definitions, both express and implied, may also be included below.
  • on-chip memory for processing circuits can provide faster access efficiency, its size is usually limited.
  • Traditional memory management supports developers to manage the loading of data in the on-chip memory, which may result in a large number of operations to access off-chip memory for certain operations of the processing circuit. This in turn may significantly reduce the operational efficiency of the processing circuit and lead to greater power consumption.
  • the loading of the on-chip memory is implemented by using stream processing, which allows the process of storing the loading to overlap with the operation process of the processing engine. This can significantly improve the utilization of on-chip memory, thereby effectively reducing off-chip memory access.
  • FIG. 1 shows a schematic diagram of an example environment 100 in which various embodiments of the present disclosure can be implemented.
  • Example environment 100 may be, for example, an electronic device with computing capabilities, such as a computer.
  • example environment 100 includes, for example, central processing unit (CPU) 20 , system memory 10 , north/memory bridge 30 , accelerator subsystem 40 , device memory 50 , and south/input-output (IO) bridge 60 .
  • System memory 10 may be, for example, a volatile memory such as dynamic random access memory (DRAM).
  • DRAM dynamic random access memory
  • the north bridge/memory bridge 30, for example, integrates a memory controller, a PCIe controller, etc., and is responsible for data exchange between the CPU 20 and the high-speed interface and bridging the CPU 20 and the south bridge/IO bridge 60.
  • the South Bridge/IO Bridge 60 is used for low-speed interfaces of computers, such as Serial Advanced Technology Interface (SATA) controllers and the like.
  • the accelerator subsystem 40 may include, for example, devices or chips such as a graphics processing unit (GPU) and an artificial intelligence (AI) accelerator for accelerated processing of data such as graphics and video. In this disclosure, accelerator subsystem 40 may also be referred to as "processing circuitry.”
  • device memory 50 may be, for example, volatile memory, such as DRAM, external to accelerator subsystem 40 .
  • device memory 50 is also referred to as off-chip memory, ie, memory located outside the chip of accelerator subsystem 40 .
  • the chip of the accelerator subsystem 40 also has a volatile memory, such as a first-level (L1) cache (cache) and an optional second-level (L2) cache, which can be collectively referred to as "on-chip memory".
  • L1 cache first-level
  • L2 cache optional second-level cache
  • FIG. 1 An example environment 100 in which embodiments of the disclosure can be implemented is shown in FIG. 1 , the disclosure is not limited thereto. Some embodiments of the present disclosure may also be used in some application environments with accelerator subsystems such as GPUs, such as ARM architectures and RISC-V architectures.
  • accelerator subsystems such as GPUs, such as ARM architectures and RISC-V architectures.
  • FIG. 2 shows a schematic block diagram of a processing circuit 200 according to an embodiment of the present disclosure.
  • the processing circuit 200 may be, for example, a specific implementation of the chip of the accelerator subsystem 40 in FIG. 1 .
  • the processing circuit 200 is, for example, a processing circuit chip such as a GPU.
  • processing circuit 200 includes stream processor (SP) 210, page table device 220, processing engine (PE) unit 230, direct memory access (DMA) controller 240, L1 cache (cache) 260, and L2 Cache 250.
  • SP stream processor
  • PE processing engine
  • DMA direct memory access
  • the processing circuit 200 is controlled by a host device such as the CPU 20, and receives instructions from the CPU 20.
  • the SP 210 analyzes instructions from the CPU 20, and assigns the analyzed operations to the PE unit 230, the page table device 220, and the DMA controller 240 for processing.
  • the page table device 220 is used to manage the on-chip virtual storage of the processing circuit 200 .
  • L2 cache 250 and off-chip memory such as device memory 50 in FIG. 1 constitute a virtual storage system.
  • the page table device 220 is jointly maintained by the SP 210, the PE unit 230 and the DMA controller 240.
  • the PE unit 230 includes a plurality of processing engines (processing engine, PE) PE_1, PE_2...PE_N, where N represents an integer greater than 1.
  • PE processing engine
  • Each PE in PE unit 230 may be a single instruction multiple thread (SIMT) device.
  • each thread can have its own register file (register file), and all threads of each PE also share a unified register file (uniform register file).
  • Multiple PEs can perform the same or different processing tasks in parallel, and can perform address conversion described below and access to target data in memory in parallel, thereby reducing processing time. It can be understood that the target elements processed by multiple PEs are not the same, and the segment, page, cache line, and attribute, size, and dimension order of the target element may be different, as described in detail below.
  • Each thread can perform thread-level data exchange between its own register file and the memory subsystem.
  • Each thread has its own arithmetic logic execution unit and uses its own storage address, which adopts a typical register access architecture (load-store architecture).
  • Each execution unit includes a floating-point/fixed-point unit supporting multiple data types and an arithmetic logic unit.
  • the processing circuit 200 in FIG. 2 may, for example, perform the following operations: 1) construct the page table entry content and initial state; 2) move the data on the off-chip memory such as the device memory 50 in FIG. 1 to the on-chip memory, such as the L2 cache 250; 3) start and execute the program; 4) define each segment and describe the properties of the tensor and storage; 5) when the program execution is completed, write the data of the execution result to the slice external memory.
  • the data processed by the processing circuit 200 is mainly for multi-dimensional tensors.
  • the tensor may be a four-dimensional tensor having four dimensions Dl, D2, D3, and D4, and the tensor may be of different size in each dimension.
  • the tensor may be a one-dimensional, two-dimensional, three-dimensional or more dimensional tensor, which is not limited in the present disclosure.
  • the tensor may internally support such as uint8, int8, bfloat16, float16, uint16, int16, float32, int32, uint32 and other custom element types, and the present disclosure does not limit this.
  • the basic unit is elements. For example, if the element type is int8, the element is in bytes. For another example, if the element type is int16, the basic addressing unit is double byte, and so on.
  • tensors may be divided into at least one segment. In the case where the tensor contains only one segment, the tensor is the segment. Whereas, in the case where the tensor contains multiple segments, the segment is part of the tensor.
  • the CPU 20 can specify which PE processes each part of the segment by an instruction.
  • FIG. 3 shows a schematic block diagram of a three-dimensional tensor 300 according to an embodiment of the present disclosure.
  • the three-dimensional tensor 300 has three dimensions D1, D2, and D3, and includes a first segment S1, a second segment S2, and a third segment S3.
  • CPU 20 may specify that the tensor elements of segment S1 be processed by PE_1, PE_2, PE_3, PE_4, PE_5, PE_6, PE_7, and PE_8.
  • the CPU 20 also specifies that the tensor elements of the second segment S2 are processed by PE_1-PE_4.
  • each segment may have a different size, so programmers can flexibly configure segments based on design needs.
  • page division can be implemented on any one or more dimensions, and the number of pages divided on each dimension is independent of each other.
  • tensor data may be stored in on-chip high-speed memory, such as L2 cache 250 .
  • on-chip high-speed memory such as L2 cache 250 .
  • the kernel program (kernel) can be started multiple times, and each time the DMA controller 240 moves a segment of the tensor from the off-chip storage to the on-chip storage in advance for kernel operation. After starting the kernel multiple times, all the segments contained in the tensor are processed, and the entire running process ends.
  • the on-chip high-speed memory is sufficient to accommodate all tensors to be accessed by the kernel, a tensor only needs one segment description, and the kernel only needs to be started once.
  • At least one page may also be set to further subdivide the tensor.
  • the first segment S1 there are 4 pages P[1], P[2], P[3] and P[4].
  • the second segment S2 has only one page.
  • the number of pages in each segment can be different, so programmers can flexibly configure the size of pages in a segment based on design requirements. For example, pages are configured to fit into L2 cache 250 in their entirety.
  • a page can usually contain multiple elements.
  • the page where the target element is located is referred to as a "target element page" herein.
  • a page may include multiple cache lines.
  • L1 cache 260 It only takes a few clock cycles for the PE to read data from the L1 cache 260 , but it may take dozens or even hundreds of clock cycles for the L1 cache 260 to read data from the L2 cache 250 . Therefore, it is desirable to reduce the number of times L1 cache 260 reads data from L2 cache 250 .
  • a "cache line" is used here to describe the minimum transfer data unit from L2 cache 250 to L1 cache 260, in this disclosure, this part of data may not necessarily be arranged in rows or columns, a "cache line”
  • the data inside is distributed on multiple dimensions, and the size of the data distributed on each dimension is not limited to 1.
  • PE performs parallel processing on the data in a segment, and the allocation of PE is carried out in the logical address space of the data, which is independent of the physical storage structure of the segment, as described below.
  • the first group of cache lines in the first page P[1] is designated to be processed by PE_1, and the second group of cache lines is designated to be processed by PE_2.
  • PE_1 the first group of cache lines in the first page P[1]
  • PE_2 the second group of cache lines
  • PE_M the number of tensor data that can be processed by PE_M, where M represents any integer not greater than N.
  • FIG. 4 shows a flowchart of a stream processing method 400 according to some embodiments of the present disclosure.
  • the method 400 may be implemented, for example, by the processing circuit 200 (or the accelerator subsystem 40) such as a GPU, so the various aspects described above with respect to FIGS. 1 to 3 may be selectively applied to the method 400.
  • the stream processing program is run by the stream processor 210 to cause a set of pages to be loaded into the on-chip memory as segments having at least one dimension, where one of the at least one dimensions is designated by the stream processing program as being stream processed dimension.
  • stream processor 210 may run a stream processing program, which may include, for example, a set of SP instructions.
  • the set of SP instructions may include, for example, load instructions (LOAD instructions) for loading pages from off-chip memory.
  • the LOAD instruction can, for example, specify the address of the off-chip memory of the page to be loaded, and the address of the on-chip memory to be written.
  • execution of the LOAD instruction is non-blocking in a stream processing program.
  • the stream processor 210 may send an instruction to the DMA controller 240 to load data from the corresponding off-chip memory address to the corresponding on-chip memory address. After the stream processor 210 finishes sending the instruction, the stream processor 210 can execute the next instruction in the stream processing program without waiting for the completion of loading of data.
  • the processing circuit 200 may also set a corresponding first counter for the target page.
  • the first counter may be set to a first value (for example, 1) to indicate that the target page has not been loaded.
  • the value of the first counter may be updated to a second value (for example, 0) to indicate that the corresponding target page has been loaded.
  • the processing circuit 200 may also set a second counter for the target page to indicate the number of processing engines PE referencing the target page. For example, when multiple PEs are used to jointly process the target page, the initial number of the second counter may be set as the number of multiple PEs.
  • a stream handler may include segment definition instructions for defining segments.
  • a segment definition instruction may indicate one or more pages constituting the segment. Additionally, the segment definition instruction may also indicate one or more dimensions included in the segment and the size on the one or more dimensions.
  • the segment definition instruction may also indicate a dimension as a stream processing dimension.
  • FIG. 5 shows a schematic diagram of processing a segment 500 according to a stream processing scheme according to some embodiments of the present disclosure.
  • the defined segment 500 may include, for example, 64 pages, which include Dimension 0 , Dimension 1 and Dimension 2 .
  • the stream processor 210 may specify dimension 1 as the stream processing dimension through a segment definition instruction.
  • the processing engine PE executes a kernel program to process at least part of the segment, wherein the at least part is determined based on the offset position of the segment in the stream processing dimension.
  • the stream processor 210 may send the information of the defined segment to the processing engine to run a kernel program to process at least part of the segment.
  • the information may include, for example, the stream processing dimension indicated by the stream processing program, and the segment's offset position in the stream processing dimension.
  • each processing engine PE may be assigned a starting anchor in a non-stream processing dimension. Further, the processing engine may determine the initial page processed by the processing engine based on the corresponding starting anchor point and the starting offset indicated by the segment definition instruction in the stream processing program.
  • the starting anchor points of the processing engine in dimension 0 and dimension 2 are, for example, (0,0), then the processing engine can determine the new starting page as (0,0) according to the offset position "0". 0,0), or "page".
  • the processing engine may determine whether the page has finished loading based on the value of the first counter for the initial page. If the page has finished loading, the processing engine can perform operations using the page's data in the on-chip memory.
  • the processing engine can run a kernel program and can sequentially execute operations on one or more pages by defining a loop program.
  • the kernel program may also include an update instruction, for example, the update instruction may be located before the loop in the kernel program starts, or after the loop ends.
  • the update instruction may, for example, indicate the update offset position of the next page to be processed by the kernel program in the stream processing dimension.
  • any appropriate processing step size may be defined in the kernel program, which may also cause the offset position in the stream processing dimension not to be updated.
  • the stream processor in response to an update instruction in the kernel program, may update the second counter of previous pages that have been completed.
  • "previous page” means a page whose offset position of the stream processing dimension is smaller than the update offset position.
  • the stream processor may update the second counters of all pages whose offset positions in the stream processing dimension are between “0" and "1".
  • the initial value of the second counter is set to the number of processing engines referencing the page, and when one of the processing engines sends an update command, the value of the second counter for these pages can be decremented .
  • the value of the second counter may be decremented to zero, for example, it may indicate that no processing engine is re-referencing the page.
  • the stream processor 210 may determine that the space corresponding to the target page in the on-chip memory can be Used to load new data from off-chip memory.
  • the stream processing program in the stream processor 210 may include instructions for defining other segments after defining the segment 500 shown in FIG. description, also known as the second set of instructions). Similar to the first set of instructions defining segment 500, the second set of instructions may include, for example, one or more LOAD instructions and a segment definition instruction.
  • the LOAD instruction corresponding to the page in the second set of instructions may be controlled by the DMA instruction to load new data into the space corresponding to that particular page in the on-chip memory.
  • the stream processor 210 may run the stream processing program to cause the second set of pages to be loaded into the space in the on-chip memory corresponding to at least part of the segments already processed by the kernel program.
  • the stream processor can determine whether the update offset specified by the update instruction in the kernel program exceeds the boundary of the defined segment in the stream processing dimension. If the update offset position does not exceed the boundary of the stream processing dimension, the processing engine can continue running the program to process another part of the segment. Specifically, the processing engine can determine a new start page according to the update offset position and the start anchor point in the non-stream processing dimension, so that one or more pages in the segment can be further processed by using a loop program.
  • the starting anchor points of the processing engine in dimension 0 and dimension 2 are, for example, (0,0), then the processing engine can determine the new starting page as (0 ,1,0), that is, "page P+4".
  • the embodiments of the present disclosure enable the operation process of the processing engine to process a new page and the loading process of pages that have not been referenced to be at least partially parallelized, thereby improving the processing efficiency of the processing circuit.
  • the on-chip memory is loaded in a streaming manner such that the data of the on-chip memory is dynamically updated for use by the processing engine.
  • the perceived size of the on-chip memory will be much larger than the actual size of the on-chip memory.
  • the stream processing program may be run by the stream processor to define a new segment including the second set of pages segment (also known as the second segment).
  • stream processor 210 may execute stream processing commands to define new segments.
  • FIG. 6A shows a schematic diagram 600A of data loading according to some embodiments of the present disclosure.
  • a matrix 600A in off-chip memory 610 is discretely stored as three segments 612 , 614 and 616 .
  • the stream processor can define three segments through a segment definition instruction.
  • the stream processor can execute the instruction defining the second segment 614, and convert the second segment
  • the information at 614 is sent to the kernel program of the processing engine for cyclic execution.
  • the processing engine will process the second segment 614 based on its new offset position in the stream processing dimension (also referred to as the second offset position).
  • the second offset position may be determined according to the indicated offset of the update offset position relative to the boundary of the first segment 612 in the stream processing dimension.
  • the kernel program may, for example, indicate to update the offset position through an update instruction.
  • the stream processor 210 can further run the segment definition instruction that defines the third segment 616, and send the information of the third segment 616 to the processing engine , to be processed by the kernel program.
  • the stream processor 210 may mark the third segment 616 as a terminating segment in the segment definition instruction. After the processing engine runs the kernel program to complete the operation of the third segment 616, the stream processor 210 will terminate the stream processing process associated with the stream processing dimension. Specifically, the stream processing program will not provide information about the new segment, which will cause the processing engine to always obtain out-of-bounds data when accessing the segment.
  • the embodiments of the present disclosure can enable the kernel program to be executed without exiting, thereby further improving processing efficiency.
  • FIG. 6B shows a schematic diagram 600B of data loading according to some embodiments of the present disclosure.
  • the size of the on-chip memory 610 is limited, and the matrix 630 in the off-chip memory 620 cannot be loaded into the on-chip memory 610 at one time, for example.
  • the stream processor 210 can define three segments 632 , 634 and 636 through the stream processing program for processing by the processing engine using the kernel program.
  • the three segments 632, 634 and 636 can be loaded into the on-chip memory in turn.
  • the embodiments of the present disclosure can process data of any size without considering whether it can be loaded into the on-chip memory at one time.
  • the present disclosure may be a method, a processing circuit, an electronic device, a computer storage medium, and/or a computer program product.
  • a computer program product may include a computer readable storage medium having computer readable program instructions thereon for carrying out various aspects of the present disclosure.
  • a computer readable storage medium may be a tangible device that can retain and store instructions for use by an instruction execution device.
  • a computer readable storage medium may be, for example, but is not limited to, an electrical storage device, a magnetic storage device, an optical storage device, an electromagnetic storage device, a semiconductor storage device, or any suitable combination of the foregoing.
  • Computer-readable storage media include: portable computer diskettes, hard disks, random access memory (RAM), read-only memory (ROM), erasable programmable read-only memory (EPROM), or flash memory), static random access memory (SRAM), compact disc read only memory (CD-ROM), digital versatile disc (DVD), memory stick, floppy disk, mechanically encoded device, such as a printer with instructions stored thereon A hole card or a raised structure in a groove, and any suitable combination of the above.
  • RAM random access memory
  • ROM read-only memory
  • EPROM erasable programmable read-only memory
  • flash memory static random access memory
  • SRAM static random access memory
  • CD-ROM compact disc read only memory
  • DVD digital versatile disc
  • memory stick floppy disk
  • mechanically encoded device such as a printer with instructions stored thereon
  • a hole card or a raised structure in a groove and any suitable combination of the above.
  • computer-readable storage media are not to be construed as transient signals per se, such as radio waves or other freely propagating electromagnetic waves, electromagnetic waves propagating through waveguides or other transmission media (e.g., pulses of light through fiber optic cables), or transmitted electrical signals.
  • Computer-readable program instructions described herein may be downloaded from a computer-readable storage medium to a respective computing/processing device, or downloaded to an external computer or external storage device over a network, such as the Internet, a local area network, a wide area network, and/or a wireless network.
  • the network may include copper transmission cables, fiber optic transmission, wireless transmission, routers, firewalls, switches, gateway computers, and/or edge servers.
  • a network adapter card or a network interface in each computing/processing device receives computer-readable program instructions from the network and forwards the computer-readable program instructions for storage in a computer-readable storage medium in each computing/processing device .
  • Computer program instructions for performing the operations of the present disclosure may be assembly instructions, instruction set architecture (ISA) instructions, machine instructions, machine-dependent instructions, microcode, firmware instructions, state setting data, or Source or object code written in any combination, including object-oriented programming languages—such as Smalltalk, C++, etc., and conventional procedural programming languages—such as “C” or similar programming languages.
  • Computer readable program instructions may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer, or entirely on the remote computer or server implement.
  • the remote computer can be connected to the user computer through any kind of network, including a local area network (LAN) or a wide area network (WAN), or it can be connected to an external computer (such as via the Internet using an Internet service provider). connect).
  • LAN local area network
  • WAN wide area network
  • an electronic circuit such as a programmable logic circuit, field programmable gate array (FPGA), or programmable logic array (PLA)
  • FPGA field programmable gate array
  • PDA programmable logic array
  • These computer readable program instructions may be provided to a processing unit of a general purpose computer, special purpose computer, or other programmable data processing apparatus to produce a machine such that when executed by the processing unit of the computer or other programmable data processing apparatus , producing an apparatus for realizing the functions/actions specified in one or more blocks in the flowchart and/or block diagram.
  • These computer-readable program instructions can also be stored in a computer-readable storage medium, and these instructions cause computers, programmable data processing devices and/or other devices to work in a specific way, so that the computer-readable medium storing instructions includes An article of manufacture comprising instructions for implementing various aspects of the functions/acts specified in one or more blocks in flowcharts and/or block diagrams.
  • each block in a flowchart or block diagram may represent a module, a program segment, or a portion of an instruction that contains one or more executable instruction.
  • the functions noted in the block may occur out of the order noted in the figures. For example, two blocks in succession may, in fact, be executed substantially concurrently, or they may sometimes be executed in the reverse order, depending upon the functionality involved.
  • each block of the block diagrams and/or flowchart illustrations, and combinations of blocks in the block diagrams and/or flowchart illustrations can be implemented by a dedicated hardware-based system that performs the specified function or action , or may be implemented by a combination of dedicated hardware and computer instructions.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • General Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Computer Hardware Design (AREA)
  • Computing Systems (AREA)
  • Microelectronics & Electronic Packaging (AREA)
  • Memory System Of A Hierarchy Structure (AREA)
  • Stored Programmes (AREA)

Abstract

本文描述了一种用于流处理的方法、处理电路、电子设备、计算机可读存储介质和计算机程序产品。在此提出的方法包括:由流处理器运行流处理程序,以使一组页被加载到片上存储器中以作为具有至少一个维度的段,其中至少一个维度中的一个维度由流处理程序指定为流处理维度;以及由处理引擎运行内核程序以处理段的至少部分,其中至少部分是基于段在流处理维度的偏移位置而被确定。基于这样的方式,可以将数据流式加载到片上存储器中,以提高数据处理的效率。

Description

流处理方法、处理电路和电子设备
本申请要求于2021年12月06日提交中国专利局、申请号为202111479635.0、发明名称为“流处理方法、处理电路和电子设备”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
技术领域
本公开的实施例一般地涉及电子领域,更具体而言涉及一种用于流处理的方法、处理电路、电子设备、计算机可读存储介质和计算机程序产品。
背景技术
诸如图形处理器(GPU)之类的并行高性能多线程多核处理系统处理数据的速度比过去快得多。这些处理系统可以将复杂的计算分解为较小的任务,并且由多核并行处理以增加处理效率并且减少处理时间。
在一些情形下,诸如GPU之类的处理电路对具有大量相同或相似形式的数据的张量的处理尤为有利。张量数据在计算机领域通常表示一维或多维数组的数据,例如图像数据就是一种常规的二维张量数据,其可以由二维数组表示。对图像数据进行处理时,可以由多个处理电路或者处理电路中的多个处理核(或处理引擎)对图像数据中的不同部分并行处理以减少处理时间。
对于诸如GPU等处理电路而言,片上存储器通常具有更快的访问速度。然而,片上存储器的存储空间往往是受限的。这使得在利用处理电路进行张量运算的过程中,可能需要进行频繁地访问外部存储器(也称为片外存储器),这将严重影响张量运算的效率。
发明内容
本公开的实施例提供了一种用于流处理的方案。
在第一方面,提供了一种用于流处理的方法。该方法包括由流 处理器运行流处理程序,以使一组页被加载到片上存储器中以作为具有至少一个维度的段,其中至少一个维度中的一个维度由流处理程序指定为流处理维度;以及由处理引擎运行内核程序以处理段的至少部分,其中至少部分是基于段在流处理维度的偏移位置而被确定。
在一些实施例中,由所述流处理器运行流处理程序以使一组页被加载到所述片上存储器中以作为具有至少一个维度的段包括:由所述流处理器运行流处理程序,以向直接存储器访问DMA控制器发送第一组加载指令,所述第一组加载指令用于将所述一组页加载到所述片上存储器中。
在一些实施例中,所述一组页中的目标页与第一计数器和第二计数器相关联,所述第一计数器用于指示所述目标页的加载状态,所述第二计数器用于指示引用所述目标页的处理引擎的数目。
在一些实施例中,方法还包括:响应于完成将所述目标页加载到所述片上存储器器中,更新所述第一计数器的值,以指示所述目标页加载完成。
在一些实施例中,方法还包括:响应于所述目标页的所述第一计数器指示所述目标页完成加载且所述第二计数器指示没有处理引擎引用所述目标页,确定所述片上存储器中与所述目标页对应的空间能够被用于加载来自片外存储器的新的数据。
在一些实施例中,所述至少一个维度包括多个维度,所述方法还包括:基于所述偏移位置和所述处理引擎在所述多个维度中的非流处理维度的起始锚点,确定所述段的所述至少部分。
在一些实施例中,所述一组页为第一组页,所述方法还包括:响应于所述内核程序中的更新指令,由所述流处理器运行所述流处理程序以使第二组页被加载到所述片上存储器中与所述段的所述至少部分对应的空间。
在一些实施例中,所述段为第一段,所述偏移位置为第一偏移位置,所述更新指令指示由所述内核程序确定的、在所述流处理维 度的更新偏移位置,所述方法还包括:响应于确定所述更新偏移位置超出所述段在所述流处理维度的边界,由所述流处理器运行所述流处理程序以定义第二段,所述第二段包括至少所述第二组页;以及由所述处理引擎运行所述内核程序以基于所述第二段在所述流处理维度的第二偏移位置来处理所述第二段的至少部分。
在一些实施例中,方法还包括:响应于确定所述更新偏移位置未超出所述段在所述流处理维度的边界,由所述处理引擎运行所述内核程序以处理所述第一段的另一部分,其中所述另一部分是基于所述更新偏移位置而被确定。
在一些实施例中,方法还包括:响应于确定所述段在所述流处理程序中被标记为终止段,当所述流处理程序中与所述终止段相关联的目标指令被执行完成后,终止与所述流处理维度相关联的流处理过程。
在本公开的第二方面,提供了一种处理电路,该处理电路包括片上存储器、流处理器和处理引擎。该处理电路被配置为执行第一方面及其实现方式的任一方法。
在本公开的第三方面,提供了一种电子设备。该电子设备包括处理电路,该处理电路被配置为执行第一方面及其实现方式的任一方法。
在本公开的第四方面,提供了一种计算机可读存储介质。计算机可读存储介质存储有指令,指令在被处理电路执行时使得处理电路执行第一方面及其实现方式的任一方法。
在本公开的第五方面,提供了一种计算机程序产品。计算机程序产品包括指令,指令在被处理电路执行时使得处理电路执行第一方面及其实现方式的任一方法。
可以理解地,上述提供的第二方面的处理电路、第三方面的电子设备、第四方面的计算机存储介质或第五方面的计算机程序产品均可以用于执行第一方面所提供的方法。因此,关于第一方面的解释或者说明同样适用于第二方面、第三方面、第四方面和第五方面。 此外,第二方面、第三方面、第四方面和第五方面所能达到的有益效果可参考对应方法中的有益效果,此处不再赘述。
应当理解,发明内容部分中所描述的内容并非旨在限定本公开的关键或重要特征,亦非用于限制本公开的范围。本公开的其他特征通过以下的描述将变得容易理解。
附图说明
通过结合附图对本公开示例性实施例进行更详细的描述,本公开的上述以及其他目的、特征和优势将变得更加明显,其中,在本公开示例性实施例中,相同的参考标号通常代表相同部件。
图1示出了本公开的多个实施例能够在其中实现的示例环境的示意图;
图2示出了根据本公开的一些实施例的处理电路的示意框图;
图3示出了根据本公开的一些实施例的三维张量的示意框图;
图4示出了根据本公开的一些实施例的流处理方法的示例过程的流程图;
图5示出了根据本公开的一些实施例的根据流处理方案来处理段的示意图;以及
图6A和图6B示出了根据本公开的一些实施例的数据加载的示意图。
具体实施方式
下面将参照附图更详细地描述本公开的优选实施例。虽然附图中示出了本公开的优选实施例,然而应该理解,本公开可以以各种形式实现而不应被这里阐述的实施例限制。相反,提供这些实施例是为了使本公开更加透彻和完整,并且能够将本公开的范围完整地传达给本领域的技术人员。
在本文中使用的术语“包括”及其变形表示开放性包括,即“包括但不限于”。除非特别申明,术语“或”表示“和/或”。术语“基 于”表示“至少部分地基于”。术语“一个示例实施例”和“一个实施例”表示“至少一个示例实施例”。术语“另一实施例”表示“至少一个另外的实施例”。术语“第一”、“第二”等等可以指代不同的或相同的对象。下文还可能包括其他明确的和隐含的定义。
如前文所提及的,虽然处理电路的片上存储器能够提供更快的访问效率,但其尺寸通常是受限的。传统的存储管理支持开发者管理片上存储器中数据的加载,这导致处理电路的某些运算可能会发生大量的访问片外存储器的操作。这进而可能显著地降低处理电路的运算效率并带来更大的能耗。
在本公开的一些实施例中,通过利用流处理的方式来实现片上存储器的加载,这使得存储加载的过程可以与处理引擎的运算过程重叠。这能够显著地提升片上存储器的利用率,从而有效地降低片外存储器的访问。
示例环境
图1示出了本公开的多个实施例能够在其中实现的示例环境100的示意图。示例环境100例如可以是诸如计算机之类的具有计算能力的电子设备。在一个实施例中,示例环境100例如包括中央处理器(CPU)20、系统存储器10、北桥/存储器桥30、加速器子系统40、设备存储器50和南桥/输入输出(IO)桥60。系统存储器10例如可以是诸如动态随机存取存储器(DRAM)之类的易失性存储器。北桥/存储器桥30例如集成了内存控制器、PCIe控制器等,其负责CPU 20和高速接口之间的数据交换以及桥接CPU 20和南桥/IO桥60。南桥/IO桥60用于计算机的低速接口,例如串行高级技术接口(SATA)控制器等。加速器子系统40例如可以包括诸如图形处理器(GPU)和人工智能(AI)加速器等用于对图形、视频等数据进行加速处理的装置或芯片。在本公开中,加速器子系统40也可以被称为“处理电路”。
继续参考图1,设备存储器50例如可以是诸如DRAM之类的位 于加速器子系统40外部的易失性存储器。在本公开中,设备存储器50也被称为片外存储器,即,位于加速器子系统40的芯片外部的存储器。相对而言,加速器子系统40的芯片内部也具有易失性存储器,例如一级(L1)高速缓存(cache)以及可选的二级(L2)高速缓存,其可以被统一称为“片上存储器”
应当理解,虽然在图1中示出了本公开的多个实施例能够在其中实现的一种示例环境100,但是本公开不限于此。本公开的一些实施例也可以在诸如ARM架构和RISC-V架构之类的具有诸如GPU之类的加速器子系统的一些应用环境中使用。
图2示出了根据本公开的一个实施例的处理电路200的示意框图。处理电路200例如可以是图1中加速器子系统40的芯片的一种具体实现方式。处理电路200例如是诸如GPU之类的处理电路芯片。在一个实施例中,处理电路200包括流处理器(SP)210、页表装置220、处理引擎(PE)单元230、直接存储器访问(DMA)控制器240、L1高速缓存(cache)260和L2高速缓存250。
处理电路200由诸如CPU 20之类的主机设备控制,并且接收来自CPU 20的指令。SP 210对来自CPU 20的指令进行分析,并且将经分析的操作指派给PE单元230、页表装置220和DMA控制器240进行处理。页表装置220用于管理处理电路200的片上虚拟存储。在本公开中,L2高速缓存250和诸如图1中的设备存储器50之类的片外存储器构成虚拟存储系统。页表装置220由SP 210、PE单元230和DMA控制器240共同维护。
PE单元230包括多个处理引擎(processing engine,PE)PE_1、PE_2……PE_N,其中N表示大于1的整数。PE单元230中的每个PE可以是单指令多线程(SIMT)装置。在PE中,每个线程可以具有自己的寄存器堆(register file),并且每个PE的所有线程还共享一个统一寄存器堆(uniform register file)。多个PE可以并行地执行相同或不同的处理工作,可以并行地进行下文所述的地址转换和存储器中目标数据的访问,从而减少处理时间。可以理解,多个PE处理的 目标元素并不相同,并且目标元素所在的段、页、缓存行和元素的属性、尺寸、维度排序等可以有所不同,如下文具体描述。
每个线程可以在自己的寄存器堆与存储器子系统之间做线程级的数据交换。每个线程有自己的算数逻辑执行单元并使用自己的存储地址,其采用典型的寄存器存取架构(load-store architecture)。每个执行单元包括一个支持多种数据类型的浮点/定点单元以及一个算数逻辑单元。
大多数的指令执行算数和逻辑运算,例如,浮点和定点数的加、减、乘、除,或者逻辑与、或、非等。操作数来自于寄存器。存储器读写指令可以提供寄存器与片上/片外存储器之间的数据交换。一般地,PE中所有的执行单元可以同步地执行相同指令。通过使用谓词(predicate)寄存器,可以屏蔽部分执行单元,从而实现分支指令的功能。
在一个实施例中,图2的处理电路200可以例如执行如下操作:1)组建页表项内容和初始状态;2)将诸如图1中的设备存储器50之类的片外存储器上的数据搬运至片上存储器,例如L2高速缓存250;3)启动和执行程序;4)定义各个段并对张量以及存储的属性进行描述;5)在程序执行完成时,将执行结果的数据写入至片外存储器。
可以理解,在公开的实施例中,处理电路200所处理的数据主要针对多维张量。例如,在一个实施例中,张量可以是四维张量,其具有四个维度D1、D2、D3和D4,并且张量在各维上的尺寸可以不同。在另一些实施例中,张量可以是一维、二维、三维或更多维张量,本公开对此不进行限制。
此外,在本公开的实施例中,张量内部可以支持诸如uint8、int8、bfloat16、float16、uint16、int16、float32、int32、uint32以及其他自定义元素类型,本公开对此也不进行限制。对于张量的寻址而言,其以元素为基本单位。例如,如果元素类型为int8,则元素以字节为单位。再例如,如果元素类型为int16,则寻址基本单位为双字节, 依此类推。
在一些情形中,张量所包含的数据量可能较大,而L2高速缓存250的容量有限,因此无法将张量整体加载至片上的L2高速缓存250。在本公开的一些实施例中,为了便于张量的并行处理,可以将张量划分为至少一个段。在张量仅包括一个段的情形下,张量即为段。而在张量包括多个段的情形下,段为张量的一部分。CPU 20可以通过指令指定段的各个部分由哪个PE进行处理。
张量的存储结构
图3示出了根据本公开的一个实施例的三维张量300的示意框图。三维张量300具有三个维度D1、D2和D3,并且包括第一段S1、第二段S2和第三段S3。CPU 20可以指定段S1的张量元素由PE_1、PE_2、PE_3、PE_4、PE_5、PE_6、PE_7和PE_8处理。此外,CPU20还指定了第二段S2的张量元素由PE_1-PE_4处理。在本公开的实施例中,每个段所具有的尺寸可以不同,因此编程人员可以基于设计需要灵活配置段。实际上,页的划分可以在任意一个或多个维上实施,并且各维上划分的页数是相互独立的。
在一个实施例中,可以将张量数据存储于片上的高速存储器,例如L2高速缓存250。但由于片上的高速存储器的容量较少,因此在张量规模较大时,编程人员可以将张量划分为多个段,每个段描述张量一部分。核心程序(kernel)可以分多次启动,每次由DMA控制器240提前将张量的一个段由片外存储搬运到片内存储,并供kernel操作使用。在多次启动kernel后,张量包含的所有段均被处理,整个运行过程结束。当片上的高速存储器足以容纳kernel所要访问的所有张量时,一个张量仅需要一个段描述即可,kernel也只需要启动一次。
进一步地,在本公开的一些实施例中,在一个段内,还可以设置至少一个页以进一步细分张量。例如,在第一段S1中,具有4个页P[1]、P[2]、P[3]和P[4]。第二段S2仅具有一个页。在本公开的 实施例中,每个段所具有的页的数目可以不同,因此编程人员可以基于设计需要灵活配置段内页的尺寸。例如,将页配置为适于整体存入L2高速缓存250。
如上所述,当对张量寻址时,最小的寻址单元是以元素为单元。一个页通常可以包括多个元素。目标元素所在的页在本文中被称为“目标元素页”。在本公开的一些实施例中,页可以包括多个缓存行。目标元素页可以位于L2高速缓存250中时,如果PE经由L1高速缓存260读取目标元素,则L2高速缓存250需要将L2高速缓存250中的包括目标元素在内的一小部分的物理地址连续的数据整体传输至L1高速缓存260。这一小部分数据也被称为缓存行(cache line)数据,而这种缓存机制基于空间邻近性原理。PE从L1高速缓存260读取数据仅需几个时钟周期,而L1高速缓存260从L2高速缓存250读取数据可能需要几十个甚至上百个时钟周期。因此,期望减少L1高速缓存260从L2高速缓存250读取数据的次数。虽然在此以“缓存行”来描述从L2高速缓存250到L1高速缓存260的最小传输数据单位,但在本公开中,这部分数据可以并不必然按行或列排列,一个“缓存行”里面的数据分布在多个维上,且各维上分布的数据尺寸不限于1。PE对一个段内的数据进行并行处理,PE的分配在数据的逻辑地址空间展开,独立于段的物理存储结构,具体如下文描述。
在图3中,第一页P[1]中的第一组缓存行被指定由PE_1处理,第二组缓存行被指定由PE_2处理。虽然在此以顺序示出了张量由多个PE依序处理,但是可以理解张量数据的处理独立于PE的顺序,本公开对此不进行限制。例如图3中的PE_2表示部分的张量数据可以由PE_M处理,其中M表示不大于N的任意整数。
流处理的示例过程
图4示出了根据本公开的一些实施例的流处理方法400的流程图。在一个实施例中,方法400例如可以由诸如GPU之类的处理电 路200(或加速器子系统40)实施,因此上面针对图1至图3描述的各个方面可以选择性地适用于方法400。
在框402,由流处理器210运行流处理程序,以使一组页被加载到片上存储器中以作为具有至少一个维度的段,其中至少一个维度中的一个维度由流处理程序指定为流处理维度。
在一些实施例中,流处理器210可以运行流处理程序,其例如可以包括一组SP指令。该组SP指令例如可以包括用于从片外存储器加载页的加载指令(LOAD指令)。LOAD指令例如可以指定待加载的页的片外存储器地址,以及待写入的片上存储器地址。
在一些实施例中,在流处理程序中,LOAD指令的执行是非阻塞的。在执行LOAD指令的过程中,流处理器210可以向DMA控制器240发送指示,以将数据从对应的片外存储器地址加载到对应的片上存储器地址。在流处理器210完成指示的发送后,流处理器210可以执行流处理程序中的下一指令,而无需等待对于数据的加载完成。
在一些实施例中,在执行LOAD目标页的指令时,处理电路200还可以为目标页设置对应的第一计数器。在执行LOAD指令时,第一计数器例如可以被设置为第一值(例如,1),以指示目标页未完成加载。此外,当DMA控制器240完成目标页的加载时,第一计数器的值可以被更新为第二值(例如,0),以指示对应目标页已经完成加载。
在一些实施例中,在执行LOAD目标页的指令时,处理电路200还可以为目标页设置第二计数器,以指示引用目标页的处理引擎PE的数目。例如,当多个PE用于共同处理目标页时,第二计数器的初始数目可以被设置为多个PE的数目。
在一些实施例中,流处理程序可以包括用于定义段的段定义指令。具体地,段定义指令可以指示构成该段的一个或多个页。附加地,段定义指令还可以指示段所包括的一个或多个维度以及在该一个或多个维度上的尺寸。
在一些实施例中,为了实现流处理运算,段定义指令还可以指示一个维度以作为流处理维度。图5示出了根据本公开的一些实施例的根据流处理方案来处理段500的示意图。如图5所示,所定义的段500例如可以包括64个页,其包括维度0、维度1和维度2。示例性地,流处理器210可以通过段定义指令来将维度1指定为流处理维度。
继续参考4,在框404,由处理引擎PE运行内核程序以处理段的至少部分,其中该至少部分是基于段在流处理维度的偏移位置而被确定。
在一些实施例中,流处理器210可以将所定义的段的信息发送至处理引擎,以运行内核程序来处理该段的至少部分。该信息例如可以包括由流处理程序所指示的流处理维度,以及该段在流处理维度的偏移位置。
在一些实施例中,如果段具有多个维度,则可以为每个处理引擎PE指定在非流处理维度的起始锚点。进一步地,处理引擎可以基于对应的起始锚点和由流处理程序中段定义指令所指示的起始偏移来确定该处理引擎所处理的初始页。
以图5为示例,处理引擎的在维度0和维度2的起始锚点例如为(0,0),则处理引擎可以根据偏移位置“0”来确定新的起始页为(0,0,0),即“页”。
在一些实施例中,如上文所讨论的,处理引擎可以根据初始页的第一计数器的值来确定该页是否完成加载。如果该页已经完成加载,则处理引擎可以利用片上存储器中的页的数据来执行运算。
进一步地,处理引擎可以运行内核程序可以通过定义循环程序来依次执行对一个或多个页的运算。
在一些实施例中,内核程序还可以包括更新指令,该更新指令例如可以位于在内核程序中的循环开始之前,或者循环结束之后。该更新指令例如可以指示内核程序下一待处理的页在流处理维度的更新偏移位置。
以图作为示例,如果处理引擎待处理的页是流处理维度的偏移位置为“0”的全部页,则当下一待处理的页为“页P+4”时,处理引擎中的内核程序可以执行更新指令,以指示在流处理维度的更新偏移位置为“1”。
应当理解,可以在内核程序中定义任何适当的处理步长,其也可能导致流处理维度上偏移位置不被更新。
在一些实施例中,响应于内核程序中的更新指令,流处理器可以更新已经完成的在先页的第二计数器。在本公开中,“在先页”表示流处理维度的偏移位置小于更新偏移位置的页。
以图5作为示例,当更新偏移位置为“1”时,流处理器可以更新流处理维度的偏移位置在“0”到“1”之间的全部页的第二计数器。
如上文所讨论的,第二计数器的初始值被设置为引用该页的处理引擎的数目,当多个处理引擎中一个处理引擎发送了更新指令后,这些页的第二计数器的值可以被递减。当第二计数器的值例如可以被减至为零时,其可以指示没有处理引擎再引用该页。
在一些实施例中,当一个页的第一计数器指示该页完成加载,并且第二计数器指示没有处理引擎再引用该页时,流处理器210可以确定片上存储器中与目标页对应的空间能够被用于加载来自片外存储器的新的数据。
具体地,流处理器210中的流处理程序例如可以在定义如图5所示的段500的指令(为了方便描述,称为第一组指令)后还包括用于定义其他段的指令(为了描述,也称为第二组指令)。与定义段500的第一组指令类似,第二组指令例如可以包括一个或多个LOAD指令和一个段定义指令。
在一些实施例中,当特定页的第一计数器指示该页完成加载,并且第二计数器指示没有处理引擎再引用该页时,第二组指令中与该页所对应的LOAD指令可以由DMA控制器指令,以将新的数据加载到片上存储器中与该特定页所对应的空间。由此,流处理器210 可以运行流处理程序以使第二组页被加载到片上存储器中与由内核程序已经处理的段的至少部分对应的空间。
在一些实施例中,流处理器可以确定内核程序中的更新指令所指定的更新偏移位置是否超出所定义的段在流处理维度的边界。如果更新偏移位置未超出流处理维度的边界,则处理引擎可以继续运行程序以处理该段的另一部分。具体地,处理引擎可以根据更新偏移位置和在非流处理维度的起始锚点来确定新的起始页,从而可以利用循环程序来进一步处理该段中的一个或多个页。
以图5为示例,处理引擎的在维度0和维度2的起始锚点例如为(0,0),则处理引擎可以根据更新偏移位置“1”来确定新的起始页为(0,1,0),即“页P+4”。
基于这样的方式,本公开的实施例能够使得处理引擎处理新的页的运算过程可以与已经没有引用的页的加载过程被至少部分地并行,从而提高处理电路的处理效率。此外,片上存储器通过流式的方式被加载,使得片上存储器的数据被动态地更新以用于由处理引擎使用。对于处理引擎而言,其感知的片上存储器大小将远大于实际的片上存储器的大小。
在一些实施例中,如果更新偏移位置超出了该段(也称为第一段)在流处理维度的边界,则可以由流处理器运行流处理程序以定义包括第二组页的新的段(也称为第二段)。
以图5为示例,当更新偏移位置为“4”时,其超出段500在流处理维度的边界“3”。此时,流处理器210可以运行流处理命令以定义新的段。
图6A示出了根据本公开的一些实施例的数据加载的示意图600A。如图6A所示,片外存储器610中的一个矩阵600A被离散的存储为三个段612、614和616。示例性地,流处理器可以通过段定义指令定义三个段。
当处理引擎通过内核程序中的更新指令指示的更新偏移位置超出第一个段612在流处理维度的边界时,流处理器可以运行定义第 二个段614的指令,并将第二个段614的信息发送至处理引擎的内核程序以用于循环执行。在这种情况下,处理引擎将基于第二个段614在流处理维度的新的偏移位置(也称为第二偏移位置)来处理第二个段614。
在一些实施例中,第二偏移位置可以是根据所指示的该更新偏移位置相对于第一个段612在流处理维度的边界的偏移量所确定的。
进一步地,随着内核程序的执行,其可以通过更新指令例如指示更新偏移位置。当更新偏移位置超出第二个段614在流处理维度的边界时,流处理器210可以进一步运行定义第三个段616的段定义指令,并将第三个段616的信息发送至处理引擎,以利用内核程序进行处理。
在一些实施例中,第三个段616如果是流处理过程中的终止段,则流处理器210可以在段定义指令中将第三个段616标记为终止段。在处理引擎运行内核程序完成第三个段616的运算后,流处理器210将终止与所述流处理维度相关联的流处理过程。具体地,流处理程序将不会提供新的段的信息,这将使得处理引擎对该段的访问始终获得越界的数据。
由此,开发人员在开发内核程序时不需要关注片上存储器加载的具体细节。另一方面,本公开的实施例可以使得内核程序可以被新欢执行而无需退出,从而进一步提高处理效率。
图6B示出了根据本公开的一些实施例的数据加载的示意图600B。在该示例中,片上存储器610的尺寸是有限的,片外存储器620中的矩阵630例如无法一次加载到片上存储器610中。
根据本公开的流处理方式,流处理器210可以通过流处理程序定义三个段632、634和636,以用于由处理引擎利用内核程序处理。由此,三个段632、634和636可以轮流装载到片上存储器中。基于这样的方式,本公开的实施例可以处理任意大小的数据,而无需考虑其是否能够一次性被加载到片上存储器中。
本公开可以是方法、处理电路、电子设备、计算机存储介质和/ 或计算机程序产品。计算机程序产品可以包括计算机可读存储介质,其上载有用于执行本公开的各个方面的计算机可读程序指令。
计算机可读存储介质可以是可以保持和存储由指令执行设备使用的指令的有形设备。计算机可读存储介质例如可以是――但不限于――电存储设备、磁存储设备、光存储设备、电磁存储设备、半导体存储设备或者上述的任意合适的组合。计算机可读存储介质的更具体的例子(非穷举的列表)包括:便携式计算机盘、硬盘、随机存取存储器(RAM)、只读存储器(ROM)、可擦式可编程只读存储器(EPROM或闪存)、静态随机存取存储器(SRAM)、便携式压缩盘只读存储器(CD-ROM)、数字多功能盘(DVD)、记忆棒、软盘、机械编码设备、例如其上存储有指令的打孔卡或凹槽内凸起结构、以及上述的任意合适的组合。这里所使用的计算机可读存储介质不被解释为瞬时信号本身,诸如无线电波或者其他自由传播的电磁波、通过波导或其他传输媒介传播的电磁波(例如,通过光纤电缆的光脉冲)、或者通过电线传输的电信号。
这里所描述的计算机可读程序指令可以从计算机可读存储介质下载到各个计算/处理设备,或者通过网络、例如因特网、局域网、广域网和/或无线网下载到外部计算机或外部存储设备。网络可以包括铜传输电缆、光纤传输、无线传输、路由器、防火墙、交换机、网关计算机和/或边缘服务器。每个计算/处理设备中的网络适配卡或者网络接口从网络接收计算机可读程序指令,并转发该计算机可读程序指令,以供存储在各个计算/处理设备中的计算机可读存储介质中。
用于执行本公开操作的计算机程序指令可以是汇编指令、指令集架构(ISA)指令、机器指令、机器相关指令、微代码、固件指令、状态设置数据、或者以一种或多种编程语言的任意组合编写的源代码或目标代码,编程语言包括面向对象的编程语言—诸如Smalltalk、C++等,以及常规的过程式编程语言—诸如“C”语言或类似的编程语言。计算机可读程序指令可以完全地在用户计算机上执行、部分 地在用户计算机上执行、作为一个独立的软件包执行、部分在用户计算机上部分在远程计算机上执行、或者完全在远程计算机或服务器上执行。在涉及远程计算机的情形中,远程计算机可以通过任意种类的网络—包括局域网(LAN)或广域网(WAN)—连接到用户计算机,或者,可以连接到外部计算机(例如利用因特网服务提供商来通过因特网连接)。在一些实施例中,通过利用计算机可读程序指令的状态信息来个性化定制电子电路,例如可编程逻辑电路、现场可编程门阵列(FPGA)或可编程逻辑阵列(PLA),该电子电路可以执行计算机可读程序指令,从而实现本公开的各个方面。
这里参照根据本公开实施例的方法、装置(系统)和计算机程序产品的流程图和/或框图描述了本公开的各个方面。应当理解,流程图和/或框图的每个方框以及流程图和/或框图中各方框的组合,都可以由计算机可读程序指令实现。
这些计算机可读程序指令可以提供给通用计算机、专用计算机或其它可编程数据处理装置的处理单元,从而生产出一种机器,使得这些指令在通过计算机或其它可编程数据处理装置的处理单元执行时,产生了实现流程图和/或框图中的一个或多个方框中规定的功能/动作的装置。也可以把这些计算机可读程序指令存储在计算机可读存储介质中,这些指令使得计算机、可编程数据处理装置和/或其他设备以特定方式工作,从而,存储有指令的计算机可读介质则包括一个制造品,其包括实现流程图和/或框图中的一个或多个方框中规定的功能/动作的各个方面的指令。
也可以把计算机可读程序指令加载到计算机、其它可编程数据处理装置、或其它设备上,使得在计算机、其它可编程数据处理装置或其它设备上执行一系列操作步骤,以产生计算机实现的过程,从而使得在计算机、其它可编程数据处理装置、或其它设备上执行的指令实现流程图和/或框图中的一个或多个方框中规定的功能/动作。
附图中的流程图和框图显示了根据本公开的多个实施例的系 统、方法和计算机程序产品的可能实现的体系架构、功能和操作。在这点上,流程图或框图中的每个方框可以代表一个模块、程序段或指令的一部分,模块、程序段或指令的一部分包含一个或多个用于实现规定的逻辑功能的可执行指令。在有些作为替换的实现中,方框中所标注的功能也可以以不同于附图中所标注的顺序发生。例如,两个连续的方框实际上可以基本并行地执行,它们有时也可以按相反的顺序执行,这依所涉及的功能而定。也要注意的是,框图和/或流程图中的每个方框、以及框图和/或流程图中的方框的组合,可以用执行规定的功能或动作的专用的基于硬件的系统来实现,或者可以用专用硬件与计算机指令的组合来实现。
以上已经描述了本公开的各实施方式,上述说明是示例性的,并非穷尽性的,并且也不限于所披露的各实施方式。在不偏离明的各实施方式的范围和精神的情况下,对于本技术领域的普通技术人员来说许多修改和变更都是显而易见的。本文中所用术语的选择,旨在最好地解释各实施方式的原理、实际应用或对市场中的技术的改进,或者使本技术领域的其他普通技术人员能理解本文披露的各实施方式。

Claims (14)

  1. 一种流处理的方法,由处理电路执行,所述处理电路包括片上存储器、流处理器和处理引擎,所述方法包括:
    由所述流处理器运行流处理程序,以使一组页被加载到所述片上存储器中以作为具有至少一个维度的段,其中所述至少一个维度中的一个维度由所述流处理程序指定为流处理维度;以及
    由处理引擎运行内核程序以处理所述段的至少部分,其中所述至少部分是基于所述段在所述流处理维度的偏移位置而被确定。
  2. 根据权利要求1所述的方法,其中由所述流处理器运行流处理程序以使一组页被加载到所述片上存储器中以作为具有至少一个维度的段包括:
    由所述流处理器运行流处理程序,以向直接存储器访问DMA控制器发送第一组加载指令,所述第一组加载指令用于将所述一组页加载到所述片上存储器中。
  3. 根据权利要求2所述的方法,其中所述一组页中的目标页与第一计数器和第二计数器相关联,所述第一计数器用于指示所述目标页的加载状态,所述第二计数器用于指示引用所述目标页的处理引擎的数目。
  4. 根据权利要求3所述的方法,还包括:
    响应于完成将所述目标页加载到所述片上存储器器中,更新所述第一计数器的值,以指示所述目标页加载完成。
  5. 根据权利要求3所述的方法,还包括:
    响应于所述目标页的所述第一计数器指示所述目标页完成加载且所述第二计数器指示没有处理引擎引用所述目标页,确定所述片上存储器中与所述目标页对应的空间能够被用于加载来自片外存储器的新的数据。
  6. 根据权利要求1所述的方法,其中所述至少一个维度包括多个维度,所述方法还包括:
    基于所述偏移位置和所述处理引擎在所述多个维度中的非流处理维度的起始锚点,确定所述段的所述至少部分。
  7. 根据权利要求1所述的方法,其中所述一组页为第一组页,所述方法还包括:
    响应于所述内核程序中的更新指令,由所述流处理器运行所述流处理程序以使第二组页被加载到所述片上存储器中与所述段的所述至少部分对应的空间。
  8. 根据权利要求7所述的方法,其中所述段为第一段,所述偏移位置为第一偏移位置,所述更新指令指示由所述内核程序确定的、在所述流处理维度的更新偏移位置,所述方法还包括:
    响应于确定所述更新偏移位置超出所述段在所述流处理维度的边界,由所述流处理器运行所述流处理程序以定义第二段,所述第二段包括至少所述第二组页;以及
    由所述处理引擎运行所述内核程序以基于所述第二段在所述流处理维度的第二偏移位置来处理所述第二段的至少部分。
  9. 根据权利要求8所述的方法,还包括:
    响应于确定所述更新偏移位置未超出所述段在所述流处理维度的边界,由所述处理引擎运行所述内核程序以处理所述第一段的另一部分,其中所述另一部分是基于所述更新偏移位置而被确定。
  10. 根据权利要求1所述的方法,还包括:
    响应于确定所述段在所述流处理程序中被标记为终止段,在所述处理引擎完成所述终止段的运算后,终止与所述流处理维度相关联的流处理过程。
  11. 一种处理电路,包括片上存储器、流处理器和处理引擎,其中所述处理电路被配置为执行根据权利要求1至10中任一项所述的方法。
  12. 一种电子设备,包括片外存储存储器和处理电路,其中所述处理电路被配置为执行根据权利要求1至10中任一项所述的方法。
  13. 一种计算机可读存储介质,其上存储有一条或多条计算机 指令,其中所述一条或多条计算机指令被处理电路执行以实现根据权利要求1至10中任一项所述的方法。
  14. 一种计算机程序产品,包括计算机可执行指令,其中所述计算机可执行指令在被处理电路执行时实现根据权利要求1至10中任一项所述的方法。
PCT/CN2022/107083 2021-12-06 2022-07-21 流处理方法、处理电路和电子设备 WO2023103391A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202111479635.0A CN114218152B (zh) 2021-12-06 2021-12-06 流处理方法、处理电路和电子设备
CN202111479635.0 2021-12-06

Publications (1)

Publication Number Publication Date
WO2023103391A1 true WO2023103391A1 (zh) 2023-06-15

Family

ID=80699855

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2022/107083 WO2023103391A1 (zh) 2021-12-06 2022-07-21 流处理方法、处理电路和电子设备

Country Status (2)

Country Link
CN (1) CN114218152B (zh)
WO (1) WO2023103391A1 (zh)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114218152B (zh) * 2021-12-06 2023-08-15 海飞科(南京)信息技术有限公司 流处理方法、处理电路和电子设备

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101021831A (zh) * 2007-03-19 2007-08-22 中国人民解放军国防科学技术大学 面向科学计算的64位流处理器芯片体系结构
CN101739242A (zh) * 2009-11-27 2010-06-16 宇盛通信科技(深圳)有限公司 一种流数据处理方法及流处理器
US20130219103A1 (en) * 2012-02-17 2013-08-22 Netronome Systems, Inc. Configurable Mesh Data Bus In An Island-Based Network Flow Processor
CN114218152A (zh) * 2021-12-06 2022-03-22 海飞科(南京)信息技术有限公司 流处理方法、处理电路和电子设备

Family Cites Families (14)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9489997B2 (en) * 2013-07-03 2016-11-08 Crossbar, Inc. Hardware assisted meta data lookup
CN104317751B (zh) * 2014-11-18 2017-03-01 郑州云海信息技术有限公司 一种gpu上数据流处理系统及其数据流处理方法
EP3203400A1 (en) * 2016-02-03 2017-08-09 Universitat Rovira I Virgili A computer implemented method of generation of statistically uncorrelated molecule's conformations and computer programs
WO2018219452A1 (en) * 2017-05-31 2018-12-06 Huawei Technologies Co., Ltd. Cross platform stream dataflows
US10489056B2 (en) * 2017-11-09 2019-11-26 Nvidia Corporation Queue manager for streaming multiprocessor systems
CN109117949A (zh) * 2018-08-01 2019-01-01 南京天数智芯科技有限公司 用于人工智能设备的灵活数据流处理器和处理方法
CN110941789B (zh) * 2018-09-21 2023-12-15 北京地平线机器人技术研发有限公司 张量运算方法和装置
WO2020181670A1 (en) * 2019-03-11 2020-09-17 Huawei Technologies Co., Ltd. Control flow optimization in graphics processing unit
EP3938888A1 (en) * 2019-03-15 2022-01-19 INTEL Corporation Systolic disaggregation within a matrix accelerator architecture
US11934308B2 (en) * 2019-04-01 2024-03-19 Wave Computing, Inc. Processor cluster address generation
US11494608B2 (en) * 2019-08-14 2022-11-08 Intel Corporation Methods and apparatus to tile walk a tensor for convolution operations
CN111145076B (zh) * 2019-12-27 2023-04-07 深圳鲲云信息科技有限公司 数据并行化处理方法、系统、设备及存储介质
US10970619B1 (en) * 2020-08-21 2021-04-06 Moffett Technologies Co., Limited Method and system for hierarchical weight-sparse convolution processing
CN113159285B (zh) * 2021-04-14 2023-09-05 广州放芯科技有限公司 神经网络加速器

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101021831A (zh) * 2007-03-19 2007-08-22 中国人民解放军国防科学技术大学 面向科学计算的64位流处理器芯片体系结构
CN101739242A (zh) * 2009-11-27 2010-06-16 宇盛通信科技(深圳)有限公司 一种流数据处理方法及流处理器
US20130219103A1 (en) * 2012-02-17 2013-08-22 Netronome Systems, Inc. Configurable Mesh Data Bus In An Island-Based Network Flow Processor
CN114218152A (zh) * 2021-12-06 2022-03-22 海飞科(南京)信息技术有限公司 流处理方法、处理电路和电子设备

Also Published As

Publication number Publication date
CN114218152B (zh) 2023-08-15
CN114218152A (zh) 2022-03-22

Similar Documents

Publication Publication Date Title
US8301672B2 (en) GPU assisted garbage collection
JP3802042B2 (ja) キャッシュメモリ実装方法および装置、キャッシュメモリシステム
RU2636675C2 (ru) Команды, процессоры, способы и системы доступа множественных регистров к памяти
US8327109B2 (en) GPU support for garbage collection
US7685601B2 (en) Methods and apparatus for segmented stack management in a processor system
CN114667508B (zh) 为加速器取回数据的方法和系统
WO2023040460A1 (zh) 存储器访问方法和电子装置
WO2023173642A1 (zh) 指令调度的方法、处理电路和电子设备
US11947821B2 (en) Methods and systems for managing an accelerator's primary storage unit
WO2023103392A1 (zh) 用于存储管理的方法、介质、程序产品、系统和装置
JP2023518833A (ja) ハードウェアアクセラレーションリソースを有効にするためのコンパイラ主導のタイル置換
WO2023103391A1 (zh) 流处理方法、处理电路和电子设备
WO2023103397A1 (zh) 用于存储管理的方法、介质、程序产品、系统和装置
WO2023065748A1 (zh) 加速器和电子装置
WO2023077875A1 (zh) 用于并行执行核心程序的方法和装置
CN114035980B (zh) 基于便笺存储器来共享数据的方法和电子装置
US11900142B2 (en) Improving memory access handling for nested virtual machines
WO2006085636A1 (en) Methods and apparatus for processing instructions in a multi-processor system
Du et al. Breaking the interaction wall: A DLPU-centric deep learning computing system
US20240168719A1 (en) Dual vector arithmetic logic unit
US20230115542A1 (en) Programmable matrix multiplication engine
US20160342527A1 (en) Deferring registration for dma operations
JP2023552789A (ja) 算術論理演算ユニット用のソフトウェアベースの命令スコアボード
CN115145837A (zh) 预取数据的方法、装置和介质

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 22902826

Country of ref document: EP

Kind code of ref document: A1