CN116578340A

CN116578340A - Low-overhead thread switching method for tensor operation

Info

Publication number: CN116578340A
Application number: CN202310395659.0A
Authority: CN
Inventors: 姜申飞; 莫志文; 胡杨; 潘岳; 李霞; 王立华; 朱小云; 王磊; 郝培霖
Original assignee: Tsinghua University; Shanghai AI Innovation Center
Current assignee: Tsinghua University; Shanghai AI Innovation Center
Priority date: 2023-04-13
Filing date: 2023-04-13
Publication date: 2023-08-11

Abstract

The application relates to the technical field of semiconductor chips in general, and provides a low-overhead thread switching method for tensor operation, wherein a processing unit executes the following actions: running a first tensor instruction; receiving a data packet, wherein the data packet comprises a hash index and data; parsing the data packet, wherein when a second tensor instruction is obtained by parsing and an idle thread register exists in the processing unit, an operand and a priority of the second tensor instruction are determined; and executing the second tensor instruction when all operands of the second tensor instruction are ready and the priority of the second tensor instruction is higher than the priority of the first tensor instruction. The application takes tensor as basic operand, so that the instruction number in PE is greatly reduced, thereby avoiding extra power consumption, leading the power consumption of a chip to be lower, having high energy efficiency ratio, reducing the circuit area, avoiding instruction cache miss caused by a large number of instruction fetching operations and improving the performance.

Description

Low-overhead thread switching method for tensor operation

Technical Field

The present application relates generally to the field of semiconductor chip technology. In particular, the present application relates to a low overhead thread switching method for tensor operations.

Background

Traditionally, the operands or processing units of a semiconductor chip typically have a single data or vector as granularity, which makes instruction fetching (Instruction Fetch) costly, requires a large number of instruction fetching when performing tensor operations, introduces a large amount of additional Fetch (Fetch), decode (Decode) overhead, consumes a large amount of power, and is energy efficient. The corresponding instruction Cache (Instruction Cache) is more demanding in terms of capacity, resulting in greater area overhead, and the large number of instruction fetching operations makes it easier to generate instruction Cache misses (Cache Miss), resulting in reduced performance.

In particular, when performing tensor calculations, the same Cache replacement policy (Cache Replacement Policy) is prone to Data Cache misses (Data Cache Miss) when there are many more different tensor calculations, so different Cache replacement policies need to be used for different tensor sizes and calculation types. When different tensor calculation threads are switched, tensor data which is not yet calculated by another thread is easily evicted from the Cache, so that when the threads are switched, the required data of the thread are found to be evicted from a Cache Line (Cache Line), and the performance is further reduced.

Because of the large number of processing units (PE, processing Element) and long routing of the wafer level chip, the above problems, such as the Miss Penalty (Miss Penalty) caused by cache Miss, are further increased compared with the conventional computing system, and have become the bottleneck and the pull-elbow of the computing system, which severely restricts the utilization rate of the computing units. In the wafer level chip, the data that will be called for many times should be cached locally as much as possible, however, each processing unit (PE) does not know whether the data cached (Cache) is actually corresponding data, if the problem is completely guaranteed by the compiler, the computational complexity of the compiler will be too high, and the problem of computational accuracy will be caused when errors occur.

Disclosure of Invention

To at least partially solve the above-mentioned problems in the prior art, the present application proposes a low-overhead thread switching method for tensor operation, which is characterized by comprising the following actions performed by a processing unit:

running a first tensor instruction;

receiving a data packet, wherein the data packet comprises a hash index and data;

parsing the data packet, wherein when a second tensor instruction is obtained by parsing and an idle thread register exists in the processing unit, an operand and a priority of the second tensor instruction are determined; and

the second tensor instruction is executed when all operands of the second tensor instruction are ready and the priority of the second tensor instruction is higher than the priority of the first tensor instruction.

In one embodiment of the application, it is provided that parsing the data packet further comprises:

and determining operands except the immediate value of the second tensor instruction and the priority when the immediate value of the second tensor instruction is obtained through analysis, wherein the second tensor instruction is executed when all operands except the immediate value of the second tensor instruction are ready and the priority of the second tensor instruction is higher than the priority of the first tensor instruction.

In one embodiment of the application, it is provided that the processing unit comprises:

a plurality of thread registers, comprising:

a plurality of secondary pointers Root Tensor DSDs pointing to the primary pointers Tensor DSDs and the storage space of the static random access memory, wherein the processing unit takes the secondary pointers Root Tensor DSDs as basic operands;

a plurality of primary pointers, tensor DSDs, representing the dimensions of the Tensor; and

static random access memory.

In one embodiment of the application, it is provided that the thread register stores instructions and states including:

in one embodiment of the application, it is provided that the thread register stores instructions and states including one or more of the following:

whether Valid or not, which indicates whether the current thread is Valid or not;

interrupt Enable indicating whether to Enable the Interrupt when the current thread is executed;

priority, which indicates the Priority of the current thread in scheduling;

a Source Operand, which represents the data that is required for thread execution;

a destination Operand Dst operations that represents data to be written or modified when the thread is executing;

an operation code Opcode indicating the kind of operation performed by the instruction;

an Instruction Type instrumentation Type, which represents the Type of Instruction being executed by the current thread;

a source register pointer Source Register ptr representing the address of the tensor to which the source operand points;

a Source stride representing a pattern of Source operand tensors read;

destination Register pointer Dst Register ptr, which represents the address of the tensor to which the destination operand points;

destination step Dst Stride, which represents a pattern that indicates when the destination operand tensor is read;

a source operand hash value index Source Operand Hash Index, which represents the hash value of the source operand; and

destination Index Dst Index, which represents the routing information that the destination operand eventually needs to forward.

In one embodiment of the application, it is provided that the state and linker stored in the secondary pointer Root Tensor DSD comprises one or more of the following:

the Start address Start Addr, the End address End Addr, the Hash Index, the Tensor Dimension, the validity or not Valid, the destination Transfer Dst Transfer, the link Tensor Index Link TENSOR DSD INDEX, the Data Length, the Data Width, the protection on protection, whether Ready or not, and the Type.

In one embodiment of the application, it is provided that the state stored in the primary pointer Tensor_DSD includes one or more of the following:

dimension Length, data Width, stride, and whether Valid.

In one embodiment of the application, it is provided that the decoded instructions of the processing unit comprise:

a computing instruction, comprising:

three operands, wherein the three operands include an immediate, and an SPM, or an immediate, an SPM, and an SPM, or an SPM, and an SPM; or alternatively

Two operands, wherein the two operands include an immediate and an immediate, or an immediate and an SPM;

SPM stores instructions; and

SPM loads instructions.

In one embodiment of the application, it is provided that the currently calculated state is stored by the source Register pointer Source Register ptr and the destination Register pointer Dst Register ptr.

The application also proposes a processing unit which performs the steps of the method.

The application also provides a wafer-level chip, which is provided with the processing unit.

The application has at least the following beneficial effects: the application takes Tensor (Root Tensor DSD) as a basic operand, so that the instruction number in PE is greatly reduced, thereby avoiding the extra power consumption caused by Fetch and Decode caused by a large number of instructions, and having lower power consumption and high energy efficiency ratio. Further, high Instruction Cache overhead caused by a large number of instructions is avoided, the circuit area is reduced, instruction cache miss caused by a large number of instruction fetching operations is avoided, and the performance is improved. The application takes Tensor calculation Thread X as a basic instruction, the secondary pointer Root pointer DSD points to the primary pointer Tensor DSD and points to the corresponding SRAM space, and the SRAM address space corresponding to the Tensor which is not yet calculated is in a protection state and is not easy to be replaced, thereby reducing cache miss caused by Thread switching. The Source Register 0 ptr-Source Register 2 ptr and Dst Register 0ptr in the corresponding Thread Register Thread X in the Thread calculation store the current calculation state, so that the overhead in the Thread switching is very low, and the calculation can be directly continued from the current state address stored by the corresponding pointer. Different threads may allocate different priority levels in Thread X, so that the compiler may allocate different registers for tasks of different importance levels to reduce critical paths. The compiler can allocate different Tensor calculation tasks at different time according to the resource preparation of each processing unit (PE, processing Element) in the wafer-level chip, and as each PE can replace data with Tensor (Root Tensor DSD) as granularity, the scheduling allocation of the compiler can reduce the probability caused by cache miss as much as possible. In a wafer-level chip, due to the large number of PEs and long routing, data that will be called multiple times should be cached locally as much as possible.

Drawings

To further clarify the advantages and features present in various embodiments of the present application, a more particular description of various embodiments of the present application will be rendered by reference to the appended drawings. It is appreciated that these drawings depict only typical embodiments of the application and are therefore not to be considered limiting of its scope. In the drawings, for clarity, the same or corresponding parts will be designated by the same or similar reference numerals.

FIG. 1 shows a schematic diagram of a tensor-granularity processing unit in one embodiment of the application.

Fig. 2 shows a schematic diagram of information stored in Thread, root_dsd, and Tensor_dsd in an embodiment of the present application.

FIG. 3 is a schematic diagram of a decoded Tensor PE instruction in accordance with an embodiment of the application.

Fig. 4 shows a schematic diagram of a packet (Package) according to an embodiment of the application.

FIG. 5 is a schematic diagram illustrating a process flow after accepting a packet (Package) according to an embodiment of the present application.

Fig. 6A-B show a flow diagram of a line switch in an embodiment of the application.

Figures 7A-L illustrate a schematic diagram of the operation of a processing unit in one embodiment of the application.

Detailed Description

It should be noted that the components in the figures may be shown exaggerated for illustrative purposes and are not necessarily to scale. In the drawings, identical or functionally identical components are provided with the same reference numerals.

In the present application, unless specifically indicated otherwise, "disposed on …", "disposed over …" and "disposed over …" do not preclude the presence of an intermediate therebetween. Furthermore, "disposed on or above" … merely indicates the relative positional relationship between the two components, but may also be converted to "disposed under or below" …, and vice versa, under certain circumstances, such as after reversing the product direction.

In the present application, the embodiments are merely intended to illustrate the scheme of the present application, and should not be construed as limiting.

In the present application, the adjectives "a" and "an" do not exclude a scenario of a plurality of elements, unless specifically indicated.

It should also be noted herein that in embodiments of the present application, only a portion of the components or assemblies may be shown for clarity and simplicity, but those of ordinary skill in the art will appreciate that the components or assemblies may be added as needed for a particular scenario under the teachings of the present application. In addition, features of different embodiments of the application may be combined with each other, unless otherwise specified. For example, a feature of the second embodiment may be substituted for a corresponding feature of the first embodiment, or may have the same or similar function, and the resulting embodiment may fall within the scope of disclosure or description of the application.

It should also be noted herein that, within the scope of the present application, the terms "identical", "equal" and the like do not mean that the two values are absolutely equal, but rather allow for some reasonable error, that is, the terms also encompass "substantially identical", "substantially equal". By analogy, in the present application, the term "perpendicular", "parallel" and the like in the table direction also covers the meaning of "substantially perpendicular", "substantially parallel".

The numbers of the steps of the respective methods of the present application are not limited to the order of execution of the steps of the methods. The method steps may be performed in a different order unless otherwise indicated.

The application is further elucidated below in connection with the embodiments with reference to the drawings.

FIG. 1 shows a schematic diagram of a tensor-granularity processing unit in one embodiment of the application. As shown in fig. 1, the processing unit (Tensor PE) may include a plurality of threads, a plurality of root_dsds, a plurality of tensor_dsds, and an SRAM. The Thread register may store a source operand, a root_dsd of a destination operand, an interrupt enable, a priority, a pointer state of a current execution, and the like of a certain instruction. Root_dsd is Root DSD, similar to a secondary pointer, containing 0-4 tensor_dsd. Tensor_DSD is the corresponding dimension.

As shown in fig. 2, the Instruction (Instruction) and State (State) stored in Thread may include:

valid, whether Valid, indicate the current thread is Valid; the Interrupt Enable Interrupt is started to indicate whether the Interrupt is started when the current thread is executed; priority, which indicates the Priority of the current thread in scheduling; source operations 0-2, source operands 0-2, represent the data needed by a thread when executing, support at least one and at most three Source operands; dst Operand, destination Operand, which represents the data to be written or modified when a thread executes; opcode, the type of operation performed by an instruction, such as addition, multiplication, comparison, etc.; an Instruction Type, indicating the Type of Instruction being executed by the current thread, e.g., only one source operand or two source operands, etc.; state Regs, status registers, indicating the status of the current instruction execution, e.g., in-execution, where to execute, etc.; source Register 0-2 ptr, source Register pointer 0-2, indicates the address of the tensor pointed to by the Source operand; source 0-2 stride, source step size 0-2, indicates the mode when the Source operand tensor is read, e.g., continuous read or read at certain interval values; dst Register 0ptr, destination Register pointer, address of tensor pointed to by destination operand; dst 0 Stride, destination step size 0, indicates the mode in which the destination operand tensor is read, e.g., continuous read or read at a certain interval value; source Operand Hash Index, a source operand hash index indicating the source operand hash value to determine if a tensor already exists in the SPM of a core; dst Index, destination Index, indicates the routing information that the destination operand ultimately needs to forward.

The State (State) and Linker (Linker) stored in the root_dsd may include: the Start address Start Addr, the End address End Addr, the Hash Index, the Tensor Dimension, the validity or not Valid, the destination Transfer Dst Transfer, the link Tensor Index Link TENSOR DSD INDEX, the Data Length, the Data Width, the protection on protection, whether Ready or not, and the Type.

The State (State) stored in the tensor_dsd may include: dimension Length, data Width, stride, and whether Valid.

In the present application instruction decoding may accept at most one immediate as source operand. FIG. 3 is a schematic diagram of a decoded Tensor PE instruction in accordance with an embodiment of the application.

As shown in FIG. 3, the decoded Tensor PE instruction includes a calculation instruction and SPM L/S instruction. The calculation instruction includes a three-operand calculation instruction and a two-operand calculation instruction. The three operand calculation instructions may include an immediate, and an SPM, or an immediate, an SPM, and an SPM, or an SPM, and an SPM. The two operand calculation instruction may include an immediate and an immediate, or an immediate and an SPM, or an SPM and an SPM. SPM L/S instructions include SPM STORE instructions and SPM LOAD instructions.

Comparing the Hash values of SPM Tensor used by the decoded Tensor PE instruction, invalidating the instruction when the Hash values are Equal (Equal) for the SPM LOAD instruction, backtracking to send feedback, and canceling subsequent transmission; when the Hash values are not equal (Unequal), the ready of the instruction is set to 1, whether to switch the thread operation is determined according to whether the Priority is high or low, the protection of the corresponding Tensor_DSD is set to 0, and after execution is completed, the Hash value of the corresponding Tensor_DSD is calculated. For other decoded Tensor PE instructions, when the Hash values are Equal (Equal), setting the ready of the instruction to 1, determining whether to switch the thread operation according to the Priority, and setting the protection of the corresponding Tensor_DSD to 1; when the Hash values are not equal (Unequal), the ready of the instruction is set to 0, hold temporarily, and feedback is sent back to request the corresponding data to be read.

Fig. 4 shows a schematic diagram of a packet (Package) according to an embodiment of the application. As shown in fig. 4, a 32bit packet may include 16b hash_index and 16b data, and the data packet may be parsed into instruction\data, forwarding rule, and control\data information.

FIG. 5 is a schematic diagram illustrating a process flow after accepting a packet (Package) according to an embodiment of the present application. As shown in fig. 5, after receiving the 32bit packet, it is first determined whether it is instruction information. When the package is not instruction information, comparing the hash_index in the package with the hash_index of the root_dsd with the ready of 0 in the existing read instruction, and comparing the hash_index in the package with the hash_index with the operand of the immediate in one calculation instruction. When the same hash_index (Match Not Found) is Not matched, the hash_index is compared again with the hash_index of the root_DSD with the ready of 0 in the existing read instruction; when the same hash_index is matched in the read instruction (Match Found in Read Inst), writing data to the root_dsd address of the corresponding hash_index; when the same hash_index is matched in the immediate of the calculation instruction (Match Found in Compute inst.'s Immediate Operand), the corresponding calculation thread is switched, and the immediate is accepted for calculation.

When the package is instruction information, whether an idle thread exists or not is judged. When no idle thread exists, the Back Pressure is triggered, and the Failure information of the instruction is returned; when there is an idle thread, the complete instruction is received and parsed in cycles, and information (such as priority, inter, source operand, destination operation, hash_index, forwarding information of destination operand, etc.) of the corresponding thread register is written.

As shown in fig. 6A, when the tensor instruction a is running, it is first detected whether or not it is inter Enable. When the B instruction is not satisfied and the Tensor PE has an empty thread register, continuing to run the A instruction until the A instruction runs and a stall occurs; however, when a B instruction is received and the TensorPE has an empty thread register, it is determined whether all operands of the B instruction are ready and have a higher priority. When all operands of the B instruction are not satisfied and have higher priority, continuing to run the A instruction until the A instruction runs and is stalled, selecting the valid and executing the thread with the highest priority, wherein the thread is selected randomly or using a round-robin algorithm when the priorities are the same. When all operands of the B instruction are ready and there is a higher priority, switch to execute the B instruction and keep the running state of the A instruction, pointers, etc. in the original registers.

As shown in fig. 6B, when the tensor instruction a is running, it is first detected whether or not it is inter Enable. When the received data packet is not satisfied, analyzing to obtain a hash_index as the immediate of another calculation instruction B, continuing to run the instruction A, discarding the received data packet and returning fail; when the received data packet is satisfied, the hash_index is obtained by parsing and is the immediate of another computing instruction B, whether all operands except the immediate of the B instruction are ready or not is judged, and the higher priority is achieved. When the B instruction is not satisfied, all operands except the immediate are ready and have higher priority, continuing to run the A instruction, discarding the received data packet and returning to faii; when the B instruction is satisfied that all operands except the immediate are ready and there is a higher priority, switch to execute the B instruction and leave the running state of the a instruction, pointers, etc. in the original registers.

Wherein the operation of the processing unit may comprise:

as shown in fig. 7A, the processing unit Tensor PE includes a plurality of registers Thread. The Thread comprises a primary pointer Tensor DSD, a secondary pointer Root Tensor DSD pointing to the primary pointer Tensor DSD, and a 48KB SRAM space.

As shown in fig. 7B, the processing unit Tensor PE reads the Tensor a instruction (Read Tensor A Instruction). Where the tensor A (Load Tensor A) is loaded in the register Thread, the hash index of tensor a (a hash index) may be 0x1acf, ptr=0000. The hash index (a hash index) of the Tensor a is stored in the secondary pointer, the dimension (adjense 4) of the Tensor a is stored in the primary pointer Tensor DSD, and the first space is occupied in the SRAM.

As shown in fig. 7C, the processing unit Tensor PE receives a 32bit packet. The data packet includes 16bit index=0x1acf and 16bit data= [15:0] data, ptr=1000 in the register Thread, and occupies the storage space in the first space. Until the reception of the packet is completed with ptr=3ffff in the register Thread, the first space is fully occupied, as shown in fig. 7D.

As shown in fig. 7E, the processing unit Tensor PE receives a c=a×b instruction. The register Thread is not ready to execute a c=a×b instruction (c=a×b not ready), and the secondary pointer Root Tensor DSD stores the hash index (C hash index 0x3d2 e) of the Tensor C, requiring the Tensor B (need Tensor B).

As shown in fig. 7F, the processing unit Tensor PE reads the load Tensor B instruction. Wherein the Tensor B is loaded in the register Thread, tensor B ptr=4000, the hash index (B hash index 0xff 71) of the Tensor B is stored in the secondary pointer Root Tensor DSD, the dimension (B dimension 4) of the Tensor B is stored in the primary pointer Tensor DSD, and the second space is occupied in the SRAM.

As shown in fig. 7G, the processing unit Tensor PE receives a 32bit packet. The data packet comprises 16bit index=0xff 71 and 16bit data= [15:0] data, and a register Thread loads Tensor B to Tensor B ptr=7fff to complete the receiving of the data packet, and the second space is completely occupied.

As shown in fig. 7H, the processing unit Tensor PE reads the load Tensor D instruction. Wherein the Tensor D (priority 3) is loaded in the register Thread, and the Tensor c=a×b (ptr=8000, priority 2), the hash index of the Tensor D (D hash index 0xd59 a) is stored in the secondary pointer Root Tensor DSD, the dimension (D dimension 2) of the Tensor D is stored in the primary pointer Tensor DSD, the C hash index 0x3D2e points to the third space in the SRAM, and the D hash index 0xd59a points to the fourth space in the SRAM.

As shown in fig. 7I, the processing unit Tensor PE receives the packet (irrelated data package). Tensor c=a×b (ptr=9000, priority 2) in the register Thread, where the storage space is occupied in the third space.

As shown in fig. 7J, the processing unit Tensor PE receives a 32bit packet. The data packet includes 16bit index=0xd59a and 16bit data= [15:0] data, wherein the memory space is occupied in the fourth space of the SRAM.

As shown in fig. 7K, the processing unit Tensor PE receives the packet (irrelated data package). In the register Thread, tensor c=a×b (ptr=c999, priority 2), reception of the packet is completed, and the third space is fully occupied.

As shown in fig. 7L, the processing unit Tensor PE receives the e=c+d instruction, and the register Thread is not ready to execute the e=c+d instruction (e=c+d unreready), and the hash index of the Tensor E is stored in the secondary pointer (E hash index 0x37b 9).

The application takes Tensor (Root Tensor DSD) as a basic operand, so that the instruction number in PE is greatly reduced, thereby avoiding the extra power consumption caused by Fetch and Decode caused by a large number of instructions, and having lower power consumption and high energy efficiency ratio. Further, high Instruction Cache overhead caused by a large number of instructions is avoided, the circuit area is reduced, instruction cache miss caused by a large number of instruction fetching operations is avoided, and the performance is improved.

The application takes Tensor calculation Thread X as a basic instruction, the secondary pointer Root pointer DSD points to the primary pointer Tensor DSD and points to the corresponding SRAM space, and the SRAM address space corresponding to the Tensor which is not yet calculated is in a protection state and is not easy to be replaced, thereby reducing cache miss caused by Thread switching. The Source Register 0 ptr-Source Register 2 ptr and Dst Register 0ptr in the corresponding Thread Register Thread X in the Thread calculation store the current calculation state, so that the overhead in the Thread switching is very low, and the calculation can be directly continued from the current state address stored by the corresponding pointer.

Different threads may allocate different priority levels in Thread X, so that the compiler may allocate different registers for tasks of different importance levels to reduce critical paths. The compiler can allocate different Tensor calculation tasks at different time according to the resource preparation of each processing unit (PE, processing Element) in the wafer-level chip, and as each PE can replace data with Tensor (Root Tensor DSD) as granularity, the scheduling allocation of the compiler can reduce the probability caused by cache miss as much as possible. In a wafer-level chip, due to the large number of PEs and long routing, data that will be called multiple times should be cached locally as much as possible.

While various embodiments of the present application have been described above, it should be understood that they have been presented by way of example only, and not limitation. It will be apparent to those skilled in the relevant art that various combinations, modifications, and variations can be made therein without departing from the spirit and scope of the application. Thus, the breadth and scope of the present application as disclosed herein should not be limited by any of the above-described exemplary embodiments, but should be defined only in accordance with the following claims and their equivalents.

Claims

1. A method of low overhead thread switching for tensor operations, comprising the steps of, by a processing unit:

running a first tensor instruction;

2. The low overhead thread switching method for tensor operations of claim 1, wherein parsing the data packet further comprises:

3. The low overhead thread switching method for tensor operations of claim 1, wherein the processing unit comprises:

a plurality of thread registers, comprising:

static random access memory.

4. A low overhead thread switching method for tensor operations according to claim 3, wherein the thread register stored instructions and states include one or more of:

interrupt on lnterrupt Enable, which indicates whether to start an interrupt when the current thread is executing;

priority, which indicates the Priority of the current thread in scheduling;

a Source stride representing a pattern of Source operand tensors read;

destination index DstIndex, which represents the routing information that the destination operand eventually needs to forward.

5. The low overhead thread switching method for Tensor operations of claim 4, wherein the state and linker stored in the secondary pointer Root Tensor DSD includes one or more of:

the Start address Start Addr, the End address End Addr, the Hash index Hash lndex, the Tensor Dimension, the validity or non-validity Valid, the destination Transfer Dst Transfer, the link Tensor index Link TENSOR DSD INDEX, the Data Length, the Data Width, the protection on protection, whether Ready or not, and the Type.

6. The low overhead thread switching method for Tensor operations of claim 5, wherein the state stored in the primary pointer tensor_dsd includes one or more of:

dimension Length, data Width, stride, and whether Valid.

7. The low overhead thread switching method for tensor operations of claim 6, wherein the decoded instructions of the processing unit comprise:

a computing instruction, comprising:

SPM stores instructions; and

SPM loads instructions.

8. The low overhead thread switching method for tensor operations of claim 4, wherein the currently computed state is stored by the source Register pointer Source Register ptr and destination Register pointer Dst Register ptr.

9. Processing unit, characterized in that the steps of the method according to one of claims 1-9 are performed.

10. A wafer level chip having the processing unit of claim 9.