CN116578340A - Low-overhead thread switching method for tensor operation - Google Patents
Low-overhead thread switching method for tensor operation Download PDFInfo
- Publication number
- CN116578340A CN116578340A CN202310395659.0A CN202310395659A CN116578340A CN 116578340 A CN116578340 A CN 116578340A CN 202310395659 A CN202310395659 A CN 202310395659A CN 116578340 A CN116578340 A CN 116578340A
- Authority
- CN
- China
- Prior art keywords
- tensor
- instruction
- thread
- priority
- operand
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000000034 method Methods 0.000 title claims abstract description 20
- 206010070597 Disorder of sex development Diseases 0.000 claims description 36
- 238000013402 definitive screening design Methods 0.000 claims description 36
- 238000011162 downstream development Methods 0.000 claims description 36
- 230000003068 static effect Effects 0.000 claims description 4
- 238000004458 analytical method Methods 0.000 claims description 2
- 239000004065 semiconductor Substances 0.000 abstract description 3
- 238000004364 calculation method Methods 0.000 description 25
- 238000010586 diagram Methods 0.000 description 14
- 230000000712 assembly Effects 0.000 description 2
- 238000000429 assembly Methods 0.000 description 2
- 230000008569 process Effects 0.000 description 2
- 230000009286 beneficial effect Effects 0.000 description 1
- 230000005540 biological transmission Effects 0.000 description 1
- 238000005516 engineering process Methods 0.000 description 1
- 230000006870 function Effects 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 230000001960 triggered effect Effects 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for program control, e.g. control units
- G06F9/06—Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
- G06F9/30—Arrangements for executing machine instructions, e.g. instruction decode
- G06F9/30098—Register arrangements
- G06F9/3012—Organisation of register space, e.g. banked or distributed register file
- G06F9/30123—Organisation of register space, e.g. banked or distributed register file according to context, e.g. thread buffers
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for program control, e.g. control units
- G06F9/06—Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
- G06F9/30—Arrangements for executing machine instructions, e.g. instruction decode
- G06F9/30098—Register arrangements
- G06F9/3012—Organisation of register space, e.g. banked or distributed register file
- G06F9/30138—Extension of register space, e.g. register cache
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y02—TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
- Y02D—CLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
- Y02D10/00—Energy efficient computing, e.g. low power processors, power management or thermal management
Landscapes
- Engineering & Computer Science (AREA)
- Software Systems (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Executing Machine-Instructions (AREA)
Abstract
The application relates to the technical field of semiconductor chips in general, and provides a low-overhead thread switching method for tensor operation, wherein a processing unit executes the following actions: running a first tensor instruction; receiving a data packet, wherein the data packet comprises a hash index and data; parsing the data packet, wherein when a second tensor instruction is obtained by parsing and an idle thread register exists in the processing unit, an operand and a priority of the second tensor instruction are determined; and executing the second tensor instruction when all operands of the second tensor instruction are ready and the priority of the second tensor instruction is higher than the priority of the first tensor instruction. The application takes tensor as basic operand, so that the instruction number in PE is greatly reduced, thereby avoiding extra power consumption, leading the power consumption of a chip to be lower, having high energy efficiency ratio, reducing the circuit area, avoiding instruction cache miss caused by a large number of instruction fetching operations and improving the performance.
Description
Technical Field
The present application relates generally to the field of semiconductor chip technology. In particular, the present application relates to a low overhead thread switching method for tensor operations.
Background
Traditionally, the operands or processing units of a semiconductor chip typically have a single data or vector as granularity, which makes instruction fetching (Instruction Fetch) costly, requires a large number of instruction fetching when performing tensor operations, introduces a large amount of additional Fetch (Fetch), decode (Decode) overhead, consumes a large amount of power, and is energy efficient. The corresponding instruction Cache (Instruction Cache) is more demanding in terms of capacity, resulting in greater area overhead, and the large number of instruction fetching operations makes it easier to generate instruction Cache misses (Cache Miss), resulting in reduced performance.
In particular, when performing tensor calculations, the same Cache replacement policy (Cache Replacement Policy) is prone to Data Cache misses (Data Cache Miss) when there are many more different tensor calculations, so different Cache replacement policies need to be used for different tensor sizes and calculation types. When different tensor calculation threads are switched, tensor data which is not yet calculated by another thread is easily evicted from the Cache, so that when the threads are switched, the required data of the thread are found to be evicted from a Cache Line (Cache Line), and the performance is further reduced.
Because of the large number of processing units (PE, processing Element) and long routing of the wafer level chip, the above problems, such as the Miss Penalty (Miss Penalty) caused by cache Miss, are further increased compared with the conventional computing system, and have become the bottleneck and the pull-elbow of the computing system, which severely restricts the utilization rate of the computing units. In the wafer level chip, the data that will be called for many times should be cached locally as much as possible, however, each processing unit (PE) does not know whether the data cached (Cache) is actually corresponding data, if the problem is completely guaranteed by the compiler, the computational complexity of the compiler will be too high, and the problem of computational accuracy will be caused when errors occur.
Disclosure of Invention
To at least partially solve the above-mentioned problems in the prior art, the present application proposes a low-overhead thread switching method for tensor operation, which is characterized by comprising the following actions performed by a processing unit:
running a first tensor instruction;
receiving a data packet, wherein the data packet comprises a hash index and data;
parsing the data packet, wherein when a second tensor instruction is obtained by parsing and an idle thread register exists in the processing unit, an operand and a priority of the second tensor instruction are determined; and
the second tensor instruction is executed when all operands of the second tensor instruction are ready and the priority of the second tensor instruction is higher than the priority of the first tensor instruction.
In one embodiment of the application, it is provided that parsing the data packet further comprises:
and determining operands except the immediate value of the second tensor instruction and the priority when the immediate value of the second tensor instruction is obtained through analysis, wherein the second tensor instruction is executed when all operands except the immediate value of the second tensor instruction are ready and the priority of the second tensor instruction is higher than the priority of the first tensor instruction.
In one embodiment of the application, it is provided that the processing unit comprises:
a plurality of thread registers, comprising:
a plurality of secondary pointers Root Tensor DSDs pointing to the primary pointers Tensor DSDs and the storage space of the static random access memory, wherein the processing unit takes the secondary pointers Root Tensor DSDs as basic operands;
a plurality of primary pointers, tensor DSDs, representing the dimensions of the Tensor; and
static random access memory.
In one embodiment of the application, it is provided that the thread register stores instructions and states including:
in one embodiment of the application, it is provided that the thread register stores instructions and states including one or more of the following:
whether Valid or not, which indicates whether the current thread is Valid or not;
interrupt Enable indicating whether to Enable the Interrupt when the current thread is executed;
priority, which indicates the Priority of the current thread in scheduling;
a Source Operand, which represents the data that is required for thread execution;
a destination Operand Dst operations that represents data to be written or modified when the thread is executing;
an operation code Opcode indicating the kind of operation performed by the instruction;
an Instruction Type instrumentation Type, which represents the Type of Instruction being executed by the current thread;
a source register pointer Source Register ptr representing the address of the tensor to which the source operand points;
a Source stride representing a pattern of Source operand tensors read;
destination Register pointer Dst Register ptr, which represents the address of the tensor to which the destination operand points;
destination step Dst Stride, which represents a pattern that indicates when the destination operand tensor is read;
a source operand hash value index Source Operand Hash Index, which represents the hash value of the source operand; and
destination Index Dst Index, which represents the routing information that the destination operand eventually needs to forward.
In one embodiment of the application, it is provided that the state and linker stored in the secondary pointer Root Tensor DSD comprises one or more of the following:
the Start address Start Addr, the End address End Addr, the Hash Index, the Tensor Dimension, the validity or not Valid, the destination Transfer Dst Transfer, the link Tensor Index Link TENSOR DSD INDEX, the Data Length, the Data Width, the protection on protection, whether Ready or not, and the Type.
In one embodiment of the application, it is provided that the state stored in the primary pointer Tensor_DSD includes one or more of the following:
dimension Length, data Width, stride, and whether Valid.
In one embodiment of the application, it is provided that the decoded instructions of the processing unit comprise:
a computing instruction, comprising:
three operands, wherein the three operands include an immediate, and an SPM, or an immediate, an SPM, and an SPM, or an SPM, and an SPM; or alternatively
Two operands, wherein the two operands include an immediate and an immediate, or an immediate and an SPM;
SPM stores instructions; and
SPM loads instructions.
In one embodiment of the application, it is provided that the currently calculated state is stored by the source Register pointer Source Register ptr and the destination Register pointer Dst Register ptr.
The application also proposes a processing unit which performs the steps of the method.
The application also provides a wafer-level chip, which is provided with the processing unit.
The application has at least the following beneficial effects: the application takes Tensor (Root Tensor DSD) as a basic operand, so that the instruction number in PE is greatly reduced, thereby avoiding the extra power consumption caused by Fetch and Decode caused by a large number of instructions, and having lower power consumption and high energy efficiency ratio. Further, high Instruction Cache overhead caused by a large number of instructions is avoided, the circuit area is reduced, instruction cache miss caused by a large number of instruction fetching operations is avoided, and the performance is improved. The application takes Tensor calculation Thread X as a basic instruction, the secondary pointer Root pointer DSD points to the primary pointer Tensor DSD and points to the corresponding SRAM space, and the SRAM address space corresponding to the Tensor which is not yet calculated is in a protection state and is not easy to be replaced, thereby reducing cache miss caused by Thread switching. The Source Register 0 ptr-Source Register 2 ptr and Dst Register 0ptr in the corresponding Thread Register Thread X in the Thread calculation store the current calculation state, so that the overhead in the Thread switching is very low, and the calculation can be directly continued from the current state address stored by the corresponding pointer. Different threads may allocate different priority levels in Thread X, so that the compiler may allocate different registers for tasks of different importance levels to reduce critical paths. The compiler can allocate different Tensor calculation tasks at different time according to the resource preparation of each processing unit (PE, processing Element) in the wafer-level chip, and as each PE can replace data with Tensor (Root Tensor DSD) as granularity, the scheduling allocation of the compiler can reduce the probability caused by cache miss as much as possible. In a wafer-level chip, due to the large number of PEs and long routing, data that will be called multiple times should be cached locally as much as possible.
Drawings
To further clarify the advantages and features present in various embodiments of the present application, a more particular description of various embodiments of the present application will be rendered by reference to the appended drawings. It is appreciated that these drawings depict only typical embodiments of the application and are therefore not to be considered limiting of its scope. In the drawings, for clarity, the same or corresponding parts will be designated by the same or similar reference numerals.
FIG. 1 shows a schematic diagram of a tensor-granularity processing unit in one embodiment of the application.
Fig. 2 shows a schematic diagram of information stored in Thread, root_dsd, and Tensor_dsd in an embodiment of the present application.
FIG. 3 is a schematic diagram of a decoded Tensor PE instruction in accordance with an embodiment of the application.
Fig. 4 shows a schematic diagram of a packet (Package) according to an embodiment of the application.
FIG. 5 is a schematic diagram illustrating a process flow after accepting a packet (Package) according to an embodiment of the present application.
Fig. 6A-B show a flow diagram of a line switch in an embodiment of the application.
Figures 7A-L illustrate a schematic diagram of the operation of a processing unit in one embodiment of the application.
Detailed Description
It should be noted that the components in the figures may be shown exaggerated for illustrative purposes and are not necessarily to scale. In the drawings, identical or functionally identical components are provided with the same reference numerals.
In the present application, unless specifically indicated otherwise, "disposed on …", "disposed over …" and "disposed over …" do not preclude the presence of an intermediate therebetween. Furthermore, "disposed on or above" … merely indicates the relative positional relationship between the two components, but may also be converted to "disposed under or below" …, and vice versa, under certain circumstances, such as after reversing the product direction.
In the present application, the embodiments are merely intended to illustrate the scheme of the present application, and should not be construed as limiting.
In the present application, the adjectives "a" and "an" do not exclude a scenario of a plurality of elements, unless specifically indicated.
It should also be noted herein that in embodiments of the present application, only a portion of the components or assemblies may be shown for clarity and simplicity, but those of ordinary skill in the art will appreciate that the components or assemblies may be added as needed for a particular scenario under the teachings of the present application. In addition, features of different embodiments of the application may be combined with each other, unless otherwise specified. For example, a feature of the second embodiment may be substituted for a corresponding feature of the first embodiment, or may have the same or similar function, and the resulting embodiment may fall within the scope of disclosure or description of the application.
It should also be noted herein that, within the scope of the present application, the terms "identical", "equal" and the like do not mean that the two values are absolutely equal, but rather allow for some reasonable error, that is, the terms also encompass "substantially identical", "substantially equal". By analogy, in the present application, the term "perpendicular", "parallel" and the like in the table direction also covers the meaning of "substantially perpendicular", "substantially parallel".
The numbers of the steps of the respective methods of the present application are not limited to the order of execution of the steps of the methods. The method steps may be performed in a different order unless otherwise indicated.
The application is further elucidated below in connection with the embodiments with reference to the drawings.
FIG. 1 shows a schematic diagram of a tensor-granularity processing unit in one embodiment of the application. As shown in fig. 1, the processing unit (Tensor PE) may include a plurality of threads, a plurality of root_dsds, a plurality of tensor_dsds, and an SRAM. The Thread register may store a source operand, a root_dsd of a destination operand, an interrupt enable, a priority, a pointer state of a current execution, and the like of a certain instruction. Root_dsd is Root DSD, similar to a secondary pointer, containing 0-4 tensor_dsd. Tensor_DSD is the corresponding dimension.
Fig. 2 shows a schematic diagram of information stored in Thread, root_dsd, and Tensor_dsd in an embodiment of the present application.
As shown in fig. 2, the Instruction (Instruction) and State (State) stored in Thread may include:
valid, whether Valid, indicate the current thread is Valid; the Interrupt Enable Interrupt is started to indicate whether the Interrupt is started when the current thread is executed; priority, which indicates the Priority of the current thread in scheduling; source operations 0-2, source operands 0-2, represent the data needed by a thread when executing, support at least one and at most three Source operands; dst Operand, destination Operand, which represents the data to be written or modified when a thread executes; opcode, the type of operation performed by an instruction, such as addition, multiplication, comparison, etc.; an Instruction Type, indicating the Type of Instruction being executed by the current thread, e.g., only one source operand or two source operands, etc.; state Regs, status registers, indicating the status of the current instruction execution, e.g., in-execution, where to execute, etc.; source Register 0-2 ptr, source Register pointer 0-2, indicates the address of the tensor pointed to by the Source operand; source 0-2 stride, source step size 0-2, indicates the mode when the Source operand tensor is read, e.g., continuous read or read at certain interval values; dst Register 0ptr, destination Register pointer, address of tensor pointed to by destination operand; dst 0 Stride, destination step size 0, indicates the mode in which the destination operand tensor is read, e.g., continuous read or read at a certain interval value; source Operand Hash Index, a source operand hash index indicating the source operand hash value to determine if a tensor already exists in the SPM of a core; dst Index, destination Index, indicates the routing information that the destination operand ultimately needs to forward.
The State (State) and Linker (Linker) stored in the root_dsd may include: the Start address Start Addr, the End address End Addr, the Hash Index, the Tensor Dimension, the validity or not Valid, the destination Transfer Dst Transfer, the link Tensor Index Link TENSOR DSD INDEX, the Data Length, the Data Width, the protection on protection, whether Ready or not, and the Type.
The State (State) stored in the tensor_dsd may include: dimension Length, data Width, stride, and whether Valid.
In the present application instruction decoding may accept at most one immediate as source operand. FIG. 3 is a schematic diagram of a decoded Tensor PE instruction in accordance with an embodiment of the application.
As shown in FIG. 3, the decoded Tensor PE instruction includes a calculation instruction and SPM L/S instruction. The calculation instruction includes a three-operand calculation instruction and a two-operand calculation instruction. The three operand calculation instructions may include an immediate, and an SPM, or an immediate, an SPM, and an SPM, or an SPM, and an SPM. The two operand calculation instruction may include an immediate and an immediate, or an immediate and an SPM, or an SPM and an SPM. SPM L/S instructions include SPM STORE instructions and SPM LOAD instructions.
Comparing the Hash values of SPM Tensor used by the decoded Tensor PE instruction, invalidating the instruction when the Hash values are Equal (Equal) for the SPM LOAD instruction, backtracking to send feedback, and canceling subsequent transmission; when the Hash values are not equal (Unequal), the ready of the instruction is set to 1, whether to switch the thread operation is determined according to whether the Priority is high or low, the protection of the corresponding Tensor_DSD is set to 0, and after execution is completed, the Hash value of the corresponding Tensor_DSD is calculated. For other decoded Tensor PE instructions, when the Hash values are Equal (Equal), setting the ready of the instruction to 1, determining whether to switch the thread operation according to the Priority, and setting the protection of the corresponding Tensor_DSD to 1; when the Hash values are not equal (Unequal), the ready of the instruction is set to 0, hold temporarily, and feedback is sent back to request the corresponding data to be read.
Fig. 4 shows a schematic diagram of a packet (Package) according to an embodiment of the application. As shown in fig. 4, a 32bit packet may include 16b hash_index and 16b data, and the data packet may be parsed into instruction\data, forwarding rule, and control\data information.
FIG. 5 is a schematic diagram illustrating a process flow after accepting a packet (Package) according to an embodiment of the present application. As shown in fig. 5, after receiving the 32bit packet, it is first determined whether it is instruction information. When the package is not instruction information, comparing the hash_index in the package with the hash_index of the root_dsd with the ready of 0 in the existing read instruction, and comparing the hash_index in the package with the hash_index with the operand of the immediate in one calculation instruction. When the same hash_index (Match Not Found) is Not matched, the hash_index is compared again with the hash_index of the root_DSD with the ready of 0 in the existing read instruction; when the same hash_index is matched in the read instruction (Match Found in Read Inst), writing data to the root_dsd address of the corresponding hash_index; when the same hash_index is matched in the immediate of the calculation instruction (Match Found in Compute inst.'s Immediate Operand), the corresponding calculation thread is switched, and the immediate is accepted for calculation.
When the package is instruction information, whether an idle thread exists or not is judged. When no idle thread exists, the Back Pressure is triggered, and the Failure information of the instruction is returned; when there is an idle thread, the complete instruction is received and parsed in cycles, and information (such as priority, inter, source operand, destination operation, hash_index, forwarding information of destination operand, etc.) of the corresponding thread register is written.
Fig. 6A-B show a flow diagram of a line switch in an embodiment of the application.
As shown in fig. 6A, when the tensor instruction a is running, it is first detected whether or not it is inter Enable. When the B instruction is not satisfied and the Tensor PE has an empty thread register, continuing to run the A instruction until the A instruction runs and a stall occurs; however, when a B instruction is received and the TensorPE has an empty thread register, it is determined whether all operands of the B instruction are ready and have a higher priority. When all operands of the B instruction are not satisfied and have higher priority, continuing to run the A instruction until the A instruction runs and is stalled, selecting the valid and executing the thread with the highest priority, wherein the thread is selected randomly or using a round-robin algorithm when the priorities are the same. When all operands of the B instruction are ready and there is a higher priority, switch to execute the B instruction and keep the running state of the A instruction, pointers, etc. in the original registers.
As shown in fig. 6B, when the tensor instruction a is running, it is first detected whether or not it is inter Enable. When the received data packet is not satisfied, analyzing to obtain a hash_index as the immediate of another calculation instruction B, continuing to run the instruction A, discarding the received data packet and returning fail; when the received data packet is satisfied, the hash_index is obtained by parsing and is the immediate of another computing instruction B, whether all operands except the immediate of the B instruction are ready or not is judged, and the higher priority is achieved. When the B instruction is not satisfied, all operands except the immediate are ready and have higher priority, continuing to run the A instruction, discarding the received data packet and returning to faii; when the B instruction is satisfied that all operands except the immediate are ready and there is a higher priority, switch to execute the B instruction and leave the running state of the a instruction, pointers, etc. in the original registers.
Figures 7A-L illustrate a schematic diagram of the operation of a processing unit in one embodiment of the application.
Wherein the operation of the processing unit may comprise:
as shown in fig. 7A, the processing unit Tensor PE includes a plurality of registers Thread. The Thread comprises a primary pointer Tensor DSD, a secondary pointer Root Tensor DSD pointing to the primary pointer Tensor DSD, and a 48KB SRAM space.
As shown in fig. 7B, the processing unit Tensor PE reads the Tensor a instruction (Read Tensor A Instruction). Where the tensor A (Load Tensor A) is loaded in the register Thread, the hash index of tensor a (a hash index) may be 0x1acf, ptr=0000. The hash index (a hash index) of the Tensor a is stored in the secondary pointer, the dimension (adjense 4) of the Tensor a is stored in the primary pointer Tensor DSD, and the first space is occupied in the SRAM.
As shown in fig. 7C, the processing unit Tensor PE receives a 32bit packet. The data packet includes 16bit index=0x1acf and 16bit data= [15:0] data, ptr=1000 in the register Thread, and occupies the storage space in the first space. Until the reception of the packet is completed with ptr=3ffff in the register Thread, the first space is fully occupied, as shown in fig. 7D.
As shown in fig. 7E, the processing unit Tensor PE receives a c=a×b instruction. The register Thread is not ready to execute a c=a×b instruction (c=a×b not ready), and the secondary pointer Root Tensor DSD stores the hash index (C hash index 0x3d2 e) of the Tensor C, requiring the Tensor B (need Tensor B).
As shown in fig. 7F, the processing unit Tensor PE reads the load Tensor B instruction. Wherein the Tensor B is loaded in the register Thread, tensor B ptr=4000, the hash index (B hash index 0xff 71) of the Tensor B is stored in the secondary pointer Root Tensor DSD, the dimension (B dimension 4) of the Tensor B is stored in the primary pointer Tensor DSD, and the second space is occupied in the SRAM.
As shown in fig. 7G, the processing unit Tensor PE receives a 32bit packet. The data packet comprises 16bit index=0xff 71 and 16bit data= [15:0] data, and a register Thread loads Tensor B to Tensor B ptr=7fff to complete the receiving of the data packet, and the second space is completely occupied.
As shown in fig. 7H, the processing unit Tensor PE reads the load Tensor D instruction. Wherein the Tensor D (priority 3) is loaded in the register Thread, and the Tensor c=a×b (ptr=8000, priority 2), the hash index of the Tensor D (D hash index 0xd59 a) is stored in the secondary pointer Root Tensor DSD, the dimension (D dimension 2) of the Tensor D is stored in the primary pointer Tensor DSD, the C hash index 0x3D2e points to the third space in the SRAM, and the D hash index 0xd59a points to the fourth space in the SRAM.
As shown in fig. 7I, the processing unit Tensor PE receives the packet (irrelated data package). Tensor c=a×b (ptr=9000, priority 2) in the register Thread, where the storage space is occupied in the third space.
As shown in fig. 7J, the processing unit Tensor PE receives a 32bit packet. The data packet includes 16bit index=0xd59a and 16bit data= [15:0] data, wherein the memory space is occupied in the fourth space of the SRAM.
As shown in fig. 7K, the processing unit Tensor PE receives the packet (irrelated data package). In the register Thread, tensor c=a×b (ptr=c999, priority 2), reception of the packet is completed, and the third space is fully occupied.
As shown in fig. 7L, the processing unit Tensor PE receives the e=c+d instruction, and the register Thread is not ready to execute the e=c+d instruction (e=c+d unreready), and the hash index of the Tensor E is stored in the secondary pointer (E hash index 0x37b 9).
The application takes Tensor (Root Tensor DSD) as a basic operand, so that the instruction number in PE is greatly reduced, thereby avoiding the extra power consumption caused by Fetch and Decode caused by a large number of instructions, and having lower power consumption and high energy efficiency ratio. Further, high Instruction Cache overhead caused by a large number of instructions is avoided, the circuit area is reduced, instruction cache miss caused by a large number of instruction fetching operations is avoided, and the performance is improved.
The application takes Tensor calculation Thread X as a basic instruction, the secondary pointer Root pointer DSD points to the primary pointer Tensor DSD and points to the corresponding SRAM space, and the SRAM address space corresponding to the Tensor which is not yet calculated is in a protection state and is not easy to be replaced, thereby reducing cache miss caused by Thread switching. The Source Register 0 ptr-Source Register 2 ptr and Dst Register 0ptr in the corresponding Thread Register Thread X in the Thread calculation store the current calculation state, so that the overhead in the Thread switching is very low, and the calculation can be directly continued from the current state address stored by the corresponding pointer.
Different threads may allocate different priority levels in Thread X, so that the compiler may allocate different registers for tasks of different importance levels to reduce critical paths. The compiler can allocate different Tensor calculation tasks at different time according to the resource preparation of each processing unit (PE, processing Element) in the wafer-level chip, and as each PE can replace data with Tensor (Root Tensor DSD) as granularity, the scheduling allocation of the compiler can reduce the probability caused by cache miss as much as possible. In a wafer-level chip, due to the large number of PEs and long routing, data that will be called multiple times should be cached locally as much as possible.
While various embodiments of the present application have been described above, it should be understood that they have been presented by way of example only, and not limitation. It will be apparent to those skilled in the relevant art that various combinations, modifications, and variations can be made therein without departing from the spirit and scope of the application. Thus, the breadth and scope of the present application as disclosed herein should not be limited by any of the above-described exemplary embodiments, but should be defined only in accordance with the following claims and their equivalents.
Claims (10)
1. A method of low overhead thread switching for tensor operations, comprising the steps of, by a processing unit:
running a first tensor instruction;
receiving a data packet, wherein the data packet comprises a hash index and data;
parsing the data packet, wherein when a second tensor instruction is obtained by parsing and an idle thread register exists in the processing unit, an operand and a priority of the second tensor instruction are determined; and
the second tensor instruction is executed when all operands of the second tensor instruction are ready and the priority of the second tensor instruction is higher than the priority of the first tensor instruction.
2. The low overhead thread switching method for tensor operations of claim 1, wherein parsing the data packet further comprises:
and determining operands except the immediate value of the second tensor instruction and the priority when the immediate value of the second tensor instruction is obtained through analysis, wherein the second tensor instruction is executed when all operands except the immediate value of the second tensor instruction are ready and the priority of the second tensor instruction is higher than the priority of the first tensor instruction.
3. The low overhead thread switching method for tensor operations of claim 1, wherein the processing unit comprises:
a plurality of thread registers, comprising:
a plurality of secondary pointers Root Tensor DSDs pointing to the primary pointers Tensor DSDs and the storage space of the static random access memory, wherein the processing unit takes the secondary pointers Root Tensor DSDs as basic operands;
a plurality of primary pointers, tensor DSDs, representing the dimensions of the Tensor; and
static random access memory.
4. A low overhead thread switching method for tensor operations according to claim 3, wherein the thread register stored instructions and states include one or more of:
whether Valid or not, which indicates whether the current thread is Valid or not;
interrupt on lnterrupt Enable, which indicates whether to start an interrupt when the current thread is executing;
priority, which indicates the Priority of the current thread in scheduling;
a Source Operand, which represents the data that is required for thread execution;
a destination Operand Dst operations that represents data to be written or modified when the thread is executing;
an operation code Opcode indicating the kind of operation performed by the instruction;
an Instruction Type instrumentation Type, which represents the Type of Instruction being executed by the current thread;
a source register pointer Source Register ptr representing the address of the tensor to which the source operand points;
a Source stride representing a pattern of Source operand tensors read;
destination Register pointer Dst Register ptr, which represents the address of the tensor to which the destination operand points;
destination step Dst Stride, which represents a pattern that indicates when the destination operand tensor is read;
a source operand hash value index Source Operand Hash Index, which represents the hash value of the source operand; and
destination index DstIndex, which represents the routing information that the destination operand eventually needs to forward.
5. The low overhead thread switching method for Tensor operations of claim 4, wherein the state and linker stored in the secondary pointer Root Tensor DSD includes one or more of:
the Start address Start Addr, the End address End Addr, the Hash index Hash lndex, the Tensor Dimension, the validity or non-validity Valid, the destination Transfer Dst Transfer, the link Tensor index Link TENSOR DSD INDEX, the Data Length, the Data Width, the protection on protection, whether Ready or not, and the Type.
6. The low overhead thread switching method for Tensor operations of claim 5, wherein the state stored in the primary pointer tensor_dsd includes one or more of:
dimension Length, data Width, stride, and whether Valid.
7. The low overhead thread switching method for tensor operations of claim 6, wherein the decoded instructions of the processing unit comprise:
a computing instruction, comprising:
three operands, wherein the three operands include an immediate, and an SPM, or an immediate, an SPM, and an SPM, or an SPM, and an SPM; or alternatively
Two operands, wherein the two operands include an immediate and an immediate, or an immediate and an SPM;
SPM stores instructions; and
SPM loads instructions.
8. The low overhead thread switching method for tensor operations of claim 4, wherein the currently computed state is stored by the source Register pointer Source Register ptr and destination Register pointer Dst Register ptr.
9. Processing unit, characterized in that the steps of the method according to one of claims 1-9 are performed.
10. A wafer level chip having the processing unit of claim 9.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202310395659.0A CN116578340A (en) | 2023-04-13 | 2023-04-13 | Low-overhead thread switching method for tensor operation |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202310395659.0A CN116578340A (en) | 2023-04-13 | 2023-04-13 | Low-overhead thread switching method for tensor operation |
Publications (1)
Publication Number | Publication Date |
---|---|
CN116578340A true CN116578340A (en) | 2023-08-11 |
Family
ID=87538542
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202310395659.0A Pending CN116578340A (en) | 2023-04-13 | 2023-04-13 | Low-overhead thread switching method for tensor operation |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN116578340A (en) |
-
2023
- 2023-04-13 CN CN202310395659.0A patent/CN116578340A/en active Pending
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US9286072B2 (en) | Using register last use infomation to perform decode-time computer instruction optimization | |
US7321965B2 (en) | Integrated mechanism for suspension and deallocation of computational threads of execution in a processor | |
JP3797471B2 (en) | Method and apparatus for identifying divisible packets in a multi-threaded VLIW processor | |
US7376954B2 (en) | Mechanisms for assuring quality of service for programs executing on a multithreaded processor | |
JP4837305B2 (en) | Microprocessor and control method of microprocessor | |
Kim et al. | Warped-preexecution: A GPU pre-execution approach for improving latency hiding | |
US7676654B2 (en) | Extended register space apparatus and methods for processors | |
US20050050305A1 (en) | Integrated mechanism for suspension and deallocation of computational threads of execution in a processor | |
US8291195B2 (en) | Processing device | |
US20080046689A1 (en) | Method and apparatus for cooperative multithreading | |
WO2015024452A1 (en) | Branch predicting method and related apparatus | |
JPH04232532A (en) | Digital computer system | |
US20010054056A1 (en) | Full time operating system | |
US20100082952A1 (en) | Processor | |
CN112241288A (en) | Dynamic control flow reunion point for detecting conditional branches in hardware | |
CN109388429B (en) | Task distribution method for MHP heterogeneous multi-pipeline processor | |
US7831979B2 (en) | Processor with instruction-based interrupt handling | |
CN109408118B (en) | MHP heterogeneous multi-pipeline processor | |
CN116578340A (en) | Low-overhead thread switching method for tensor operation | |
US10289516B2 (en) | NMONITOR instruction for monitoring a plurality of addresses | |
US6119220A (en) | Method of and apparatus for supplying multiple instruction strings whose addresses are discontinued by branch instructions | |
CN114691597A (en) | Adaptive remote atomic operation | |
US20040015682A1 (en) | Application registers | |
WO2004084064A2 (en) | Type conversion unit in a multiprocessor system | |
RU2816094C1 (en) | Vliw processor with additional preparation pipeline and transition predictor |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |