CN220773595U

CN220773595U - Reconfigurable processing circuit and processing core

Info

Publication number: CN220773595U
Application number: CN202321824956.4U
Authority: CN
Inventors: 孙晓宇; 拉万·恩心; 穆拉特·凯雷姆·阿卡尔瓦达尔
Original assignee: Taiwan Semiconductor Manufacturing Co TSMC Ltd
Current assignee: Taiwan Semiconductor Manufacturing Co TSMC Ltd
Priority date: 2022-07-21
Filing date: 2023-07-12
Publication date: 2024-04-12
Anticipated expiration: 2033-07-12
Also published as: US20240028869A1

Abstract

The embodiment of the utility model discloses a reconfigurable processing circuit and a processing core. In one aspect, the reconfigurable processing circuit includes: a first memory configured to store an input activation state; a second memory configured to store weights; a multiplier configured to multiply the weight with the input activation state and output a product; a first multiplexer (mux) configured to output a previous sum from a previous reconfigurable processing element based on a first selector; a third memory configured to store a first sum; a second multiplexer configured to output the previous sum or the first sum based on a second selector; an adder configured to add the product with the previous sum or the first sum to output a second sum; and a third multiplexer configured to output the second sum or the previous sum based on a third selector.

Description

Reconfigurable processing circuit and processing core

Technical Field

Embodiments of the present utility model relate to reconfigurable processing elements for artificial intelligence accelerators and methods of operation thereof.

Background

Artificial Intelligence (AI) is a powerful tool that can be used to simulate human intelligence in machines programmed to think and behave like humans. AI may be used in a variety of applications and industries. An AI accelerator is a hardware device for efficiently handling AI workloads, such as neural networks. One type of AI accelerator includes a systolic array that can perform operations on inputs via multiply and accumulate operations.

Disclosure of Invention

According to an embodiment of the present utility model, a reconfigurable processing circuit for an Artificial Intelligence (AI) accelerator includes: a first memory configured to store an input activation state; a second memory configured to store weights; a multiplier configured to multiply the weight with the input activation state and output a product; a first multiplexer (mux) configured to output a previous sum from a previous reconfigurable processing element based on a first selector; a third memory configured to store a first sum; a second multiplexer configured to output the previous sum or the first sum based on a second selector; an adder configured to add the product with the previous sum or the first sum to output a second sum; and a third multiplexer configured to output the second sum or the previous sum based on a third selector.

According to an embodiment of the present utility model, a method of operating a reconfigurable processing element of an artificial intelligence accelerator includes: selecting, by a first multiplexer (mux), a previous column or a previous sum of previous rows of a matrix of reconfigurable processing elements from the artificial intelligence accelerator based on a first selector; multiplying the input activation state by a weight to output a product; selecting, by a second multiplexer, the previous sum or the current sum based on a second selector; adding the product to the selected previous sum or the selected current sum to output an updated sum; selecting, by a third multiplexer, the updated sum or the previous sum based on a third selector; and outputting the selected updated sum or the selected previous sum.

According to an embodiment of the present utility model, a processing core for an Artificial Intelligence (AI) accelerator includes: an input buffer configured to store a plurality of input activation states; a weight buffer configured to store a plurality of weights; a matrix array of processing elements arranged in a plurality of rows and a plurality of columns, wherein each processing element of the matrix array of processing elements includes: a first memory configured to store an input activation state from the input buffer; a second memory configured to store weights from the weight buffer; a multiplier configured to multiply the weight with the input activation state and output a product; a first multiplexer (mux) configured to output a previous sum of processing elements from a previous row or a previous column based on a first selector; a third memory configured to store a first sum and output the first sum to a processing element of a next row or column; a second multiplexer configured to output the previous sum or the first sum based on a second selector; an adder configured to add the product with the previous sum or the first sum to output a second sum; and a third multiplexer configured to output the second sum or the previous sum based on a third selector; a plurality of accumulators configured to receive outputs from a last row of the plurality of rows and to sum one or more of the received outputs from the last row; and an output buffer configured to receive outputs from the plurality of accumulators.

Drawings

Aspects of the disclosure are better understood from the following detailed description when read in conjunction with the accompanying drawings. It should be noted that the various components are not drawn to scale according to standard practice in the industry. In fact, the dimensions of the various features may be arbitrarily increased or reduced for clarity of discussion.

FIG. 1 illustrates an example block diagram of a processing core of an AI accelerator in accordance with some embodiments.

FIG. 2 illustrates an example block diagram of a PE in accordance with some embodiments.

Fig. 3, 4, and 5 illustrate PEs configured for outputting a fixed flow, according to some embodiments.

Fig. 6, 7, and 8 illustrate PEs configured for inputting a fixed flow, according to some embodiments.

Fig. 9, 10, and 11 illustrate PEs configured for weight-fixed flows, according to some embodiments.

FIG. 12 illustrates a block diagram of a processing core including a 2x2 PE array in accordance with some embodiments.

Fig. 13 illustrates a block diagram of an AI accumulator including a processing core array, in accordance with some embodiments.

FIG. 14 illustrates a graph of accuracy loss as a function of accumulator bit width, according to some embodiments.

Fig. 15 illustrates a flowchart of an example method of operating a reconfigurable processing element for an AI accelerator, in accordance with some embodiments.

Detailed Description

The following disclosure provides many different embodiments, or examples, of the different means for implementing the provided objects. Specific examples of components and arrangements are described below to simplify the present disclosure. Of course, these are merely examples and are not intended to be limiting. For example, in the following description, the formation of a first member over or on a second member may include embodiments in which the first member and the second member are formed in direct contact, and may also include embodiments in which additional members may be formed between the first member and the second member such that the first member and the second member may not be in direct contact. In addition, the present disclosure may repeat reference numerals and/or letters in the various examples. This repetition is for the purpose of simplicity and clarity and does not in itself dictate a relationship between the various embodiments and/or configurations discussed.

Moreover, for ease of description, spatially relative terms such as "under …," "under …," "lower," "above …," "upper," "top," "bottom," and the like may be used herein to describe one element or member's relationship to another element(s) or member(s) as illustrated in the figures. Spatially relative terms are intended to encompass different orientations of the device in use or operation in addition to the orientation depicted in the figures. The device may be otherwise oriented (rotated 90 degrees or at other orientations) and the spatially relative descriptors used herein interpreted accordingly.

AI accelerators are a class of specialized hardware used to accelerate the machine learning workload of Deep Neural Network (DNN) processing, which is typically a neural network involving mass memory access and highly parallel but simple operations. AI accelerators may be based on Application Specific Integrated Circuits (ASICs) that include multiple Processing Elements (PEs) (or processing circuits) arranged spatially or temporally to perform Multiply and Accumulate (MAC) operations. MAC operations are performed based on the input activation state (input) and weights and then summed together to provide an output activation state (output). Typical AI accelerators are customized to support one fixed data stream, such as output fixed, input fixed, and weight fixed workflows. However, AI workloads contain various layer types/shapes that may be advantageous for different data streams, e.g., adapting one workload or one data stream of one layer may not be the best solution for others, thus limiting performance. Adapting one workload or one data stream of one layer may not be the best solution for the other in view of the diversity of workloads in layer type, layer shape and batch size, thus limiting performance.

The present embodiments include novel systems and methods for reconfiguring Processing Elements (PEs) within an AI accelerator to support various data flows and to better accommodate different workloads to improve the efficiency of the AI accelerator. The PE may include a number of multiplexers (mux) that may be used to provide inputs, weights, and partial/total sums for various data streams. The multiplexers may be controlled using various control signals such that the multiplexers output data to support each of the data streams. There are practical applications in particular, in which AI accelerators with reconfigurable architecture can support various data streams, which can lead to more energy efficient systems and faster computation performed by the AI accelerator. For example, the approximate accumulation of the output fixed data stream may reduce area and energy consumption without degrading accuracy by using lower precision adders and buffers inside the PE. Moreover, by reusing separate accumulators designated for weight-fixed and input-fixed data streams in the PEs to collect partial sums from each core, the disclosed techniques also provide technical advantages over conventional systems due to reduced area and energy consumption when performing computations.

FIG. 1 illustrates an example block diagram of a processing core 100 of an AI accelerator in accordance with some embodiments. The processing core 100 may be used as a building block for an AI accelerator. Processing core 100 includes weight buffer 102, input buffer 104, output buffer 108, PE array 110, and accumulators 120, 122, and 124. Although certain components are shown in fig. 1, embodiments are not so limited and more or fewer components may be included in the processor core 100.

The inner layers of a neural network may be considered to a large extent as layers of neurons, each of which receives weighted outputs from neurons of other (e.g., previous) layers of neurons in a mesh interconnect structure between the layers. The weight of a connection from the output of a particular previous neuron to the input of another subsequent neuron is set according to the effect or effect that the previous neuron has on the subsequent neuron. The output value of the previous neuron is multiplied by the weight of its connection to the subsequent neuron to determine the particular stimulus that the previous neuron presents to the subsequent neuron.

The total input stimulus of a neuron corresponds to the combined stimulus of its all weighted input connections. According to various implementations, if the total input stimulus of a neuron exceeds a certain threshold, the neuron is triggered to perform a linear or nonlinear mathematical function on its input stimulus. The output of the mathematical function corresponds to the output of a neuron, which is then multiplied by the corresponding weight of the output connection of the neuron to its succeeding neuron.

In general, the more connections between neurons, the more neurons per layer and/or the more layers of neurons, the more intelligence the network can achieve. Thus, neural networks for practical, real-world artificial intelligence applications are typically characterized by a large number of neurons and a large number of connections between neurons. Thus, in processing information through a neural network, an extremely large amount of computation is involved (not only for the neuron output function, but also for the weighted connection).

As mentioned above, while neural networks may be implemented entirely in software as program code instructions executing on one or more conventional general-purpose Central Processing Units (CPUs) or Graphics Processing Unit (GPU) processing cores, the read/write activities between the CPU/GPU cores and system memory required to perform all computations are extremely intensive. In the millions or billions of computations required to affect a neural network, the overhead and energy associated with repeatedly moving a large amount of read data from system memory, processing the data by the CPU/GPU core, and then writing the results back to system memory is not entirely satisfactory in many respects.

Referring to fig. 1, a processing core 100 represents a building block of a systolic array-based AI accelerator that models a neural network. In systolic array based systems, data is wave processed by a processing core 100 that performs operations. Such operations may sometimes rely on dot product and vector absolute difference operations, typically using multiply-accumulate (MAC) operations performed on parameters, input data, and weights. MAC operations typically involve multiplication of two values and accumulation of a series of multiplications. One or more processing cores 100 may be connected together to form a neural network, which may form a systolic array-based system that forms an AI accelerator.

The input buffer 104 includes one or more memories (e.g., buffers) that can receive and store inputs (e.g., input activation data) of the neural network. For example, such inputs may be received as outputs from, for example, different processing cores 100 (not shown), global buffers (not shown), or different devices. Inputs from the input buffer 104 may be provided to the PE array 110 for processing, as described below.

The weight buffer 102 includes one or more memories (e.g., buffers) that can receive and store weights for the neural network. The weight buffer 102 may receive and store weights from, for example, different processing cores 100 (not shown), a global buffer (not shown), or different devices. The weights from the weight buffer 102 may be provided to the PE array 110 for processing, as described above.

PE array 110 comprises PE 111, 112, 113, 114, 115, 116, 117, 118, and 119 arranged in rows and columns. The first row includes PEs 111-113, the second row includes PEs 114-116, and the third row includes PEs 117-119. The first column contains PEs 111, 114, 117, the second column contains PEs 112, 115, 118, and the third row contains PEs 113, 116, 119. Although processing core 100 includes 9 PEs 111-119, embodiments are not so limited and processing core 100 may include more or fewer PEs. PEs 111-119 may perform multiply and accumulate (e.g., sum) operations based on inputs and weights received and/or stored in input buffer 104, weight buffer 102, or received from different PEs (e.g., PEs 111-119). The output of a PE (e.g., PE 111) may be provided to one or more different PEs (e.g., PEs 112, 114) in the same PE array 110 for multiplication and/or summation operations.

For example, PE 111 may receive a first input from input buffer 104 and a first weight from weight buffer 102, and perform a multiplication and/or summation operation based on the first input and the first weight. PE 112 may receive the output of PE 111 and the second weight from weight buffer 102 and perform a multiplication and/or summation operation based on the output of PE 111 and the second weight. PE 113 may receive the output of PE 112 and a third weight from weight buffer 102 and perform a multiplication and/or summation operation based on the output of PE 112 and the third weight. PE 114 may receive the output of PE 111, the second input from input buffer 104, and the fourth weight from weight buffer 102, and perform a multiplication and/or summation operation based on the output of PE 111, the second input, and the fourth weight. PE 115 may receive the outputs of PEs 112 and 114 and a fifth weight from weight buffer 102 and perform multiplication and/or summation operations based on the outputs of PEs 112 and 114 and the fifth weight. PE 116 may receive the outputs of PEs 113 and 115 and a sixth weight from weight buffer 102 and perform multiplication and/or summation operations based on the outputs of PEs 113 and 115 and the sixth weight. PE 117 may receive the output of PE 114, the third input from input buffer 104, and the seventh weight from weight buffer 102, and perform a multiplication and/or summation operation based on the output of PE 114, the third input, and the seventh weight. PE 118 may receive the outputs of PEs 115 and 117 and the eighth weight from weight buffer 102 and perform multiplication and/or summation operations based on the outputs of PEs 115 and 117 and the eighth weight. PE 119 may receive the outputs of PEs 116 and 118 and a ninth weight from weight buffer 102 and perform multiplication and/or summation operations based on the outputs of PEs 116 and 118 and the ninth weight. For the PE bottom rows of the PE array (e.g., PEs 117-119), the output may also be provided to one or more accumulators 120-124. Depending on the embodiment, the first, second, and/or third inputs and/or first through ninth weights and/or outputs of PEs 111 through 119 may be forwarded to some or all of PEs 111 through 119. These operations may be performed in parallel such that output from PEs 111 through 119 is provided at each cycle.

Accumulators 120-124 may sum up partial sum values of the results of PE array 110. For example, accumulator 120 may sum the three outputs provided by PE 117 for a set of inputs provided by input buffer 104. Each of accumulators 120-124 may include one or more buffers storing outputs from PEs 117-119 and a counter that tracks how many times an accumulation operation has been performed before outputting the sum to output buffer 108. For example, before accumulator 120 provides the sum to output buffer 108, accumulator 120 may perform three summations on the output of PE 117 (e.g., consider the outputs from three PEs 111, 114, 117). Once the accumulators 120-124 have completed summing all partial values, the output may be provided to the output buffer 108.

Output buffer 108 may store the outputs of accumulators 120-124 and provide such outputs as inputs to different processing cores 100, or to a global output buffer (not shown) for further processing and/or analysis and/or prediction.

FIG. 2 illustrates an example block diagram of a PE 200 in accordance with some embodiments. Each of PEs 111-119 of PE array 110 of fig. 1 may include (or be implemented as) PE 200.PE 200 may include buffers (or memories) 220, 222, 224, multiplexers (MUX) MUX1, MUX2, MUX3, multiplier 230, and adder 240.PE 200 can also receive a data signal comprising an input 202, a previous output 204, a weight 206, and a previous output 208. PE 200 can also receive control signals including a write enable WE1, a write enable WE2, a first selector ISS, a second selector OSS, and a third selector OS_OUT. Although certain components and signals are shown and described in PE 200, embodiments are not so limited, and various components and signals may be added and/or removed depending on the embodiment. A controller (not shown) may generate and transmit control signals.

PE 200 may be configured for various workflows (or streams or modes) of operation. For example, the PE 200 can be configured for input fixed, output fixed, and weight fixed AI workflows. The operation of PE 200 and how PE 200 may be configured for various AI workflows is further described below with reference to FIGS. 3-11.

The buffer 220 may receive inputs 202 (e.g., first, second, and third inputs) from the input buffer 104. The buffer 220 may also receive a write enable WE1 that may be able to write the input 202 into the buffer 220. The output of the buffer 220 may be provided to the PE and multiplier 230 in the next column, if any.

The buffer 222 may receive weights 206 (e.g., first through ninth weights) from the weight buffer 102. The buffer 222 may also receive a write enable WE2 that may be able to write the weight 206 into the buffer 222. The output of the buffer 222 may be provided to the PEs and multipliers 230 in the next row, if any.

Multiplexer MUX1 may receive as inputs a previous output 204 from a PE of a previous column (if any) and a previous output 208 from a PE of a previous row (if any). The output of multiplexer MUX1 may be provided to multiplexer MUX2 and multiplexer MUX3. The first selector ISS is operable to select which inputs of the multiplexer MUX1 are provided to the output of the multiplexer MUX 1. The previous output 204 may be selected when the first selector ISS is 0, and the previous output 208 may be selected when the first selector ISS is 1. Embodiments are not so limited, and the encoding of the first selector ISS may be switched (e.g., 1 for selecting the previous output 204 and 0 for selecting the previous output 208).

Multiplier 230 may perform a multiplication of the output of register 220 and the output of register 222. The output of multiplier 230 may be provided to adder 240.

Multiplexer MUX2 may receive as inputs the output of multiplexer MUX1 and the output of buffer 224. The output of multiplexer MUX2 may be provided to adder 240. The second selector OSS may be used to select which inputs of the multiplexer MUX2 are provided to the output of the multiplexer MUX 2. The output of multiplexer MUX1 may be selected when the second selector OSS is 0 and the output of register 224 may be selected when the second selector OSS is 1. The embodiments are not limited thereto and the encoding of the first selector ISS may be switched (e.g., 1 for selecting the output of multiplexer MUX1 and 0 for selecting the output of register 224).

Adder 240 may perform addition operations. Adder 240 may add the output of multiplier 230 to the output of multiplexer MUX 2. The sum (output) of the adders may be provided to a multiplexer MUX3.

Multiplexer MUX3 may receive as input the output of adder 240 and the output of multiplexer MUX 1. The output of multiplexer MUX3 may be provided to buffer 224. A third selector osout may be used to select which inputs of multiplexer MUX3 are provided to buffer 224. The output of adder 240 may be selected when the third selector OS_OUT is 0, and the output of multiplexer MUX1 may be selected when the third selector OS_OUT is 1. The embodiments are not so limited, and the encoding of the third selector OS_OUT may be switched (e.g., 1 for selecting the output of adder 240 and 0 for selecting the output of multiplexer MUX 1).

The buffer 224 may receive the output of the multiplexer MUX 3. The output of the buffer 224 may be provided to the PE in the next row (if any), the PE in the next column (if any), and the multiplexer MUX2.

PE 200 may be reconfigured to support various data streams, such as weight fixed, input fixed, and output fixed data streams. In a fixed-weight data stream, weights are pre-filled and stored in each PE before the operation begins, so that all PEs of a given filter are allocated along the PE columns. Then, the Input Feature Map (IFMAP) flows through the left edge of the array, while the weights are fixed in each PE, and each PE produces a partial sum at each cycle. The resulting partial sums are then reduced in parallel across the rows along each column to produce one Output Feature Map (OFMAP) pixel per column. The input fixed data stream is similar to the weight fixed data stream except for the mapping order. Instead of pre-populating the array with weights, the expanded IFMAP is stored in each PE. Then, weights flow in from the edges, and each PE generates a partial sum at each cycle. The resulting partial sums are also reduced in parallel across the rows along each column to produce one output eigen-mapped pixel per column. The output fixed data stream refers to a mapping of each PE performing all operations on one OFMAP when weights and IFMAP are fed from the edges of the array, the weights and IFMAP being distributed to the PEs using PE-to-PE interconnects. Partial sums are generated and reduced within each PE. Once all PEs in the array complete the OFMAP generation, the result is data out of the array through the PE-to-PE interconnect.

As described with reference to fig. 3-11, PE 200 may be reconfigured for different data streams such that the same PE may be used for various data streams.

Fig. 3-5 illustrate a PE 300 configured for outputting a fixed flow, according to some embodiments. PE 300 is similar to PE 200 except that PE 300 is configured for outputting a fixed operation flow.

FIG. 3 illustrates a multiplication operation of PE 300 in accordance with some embodiments. When write enable WE1 is high, input 202 is saved to buffer 220. The output 302 of the buffer 220 is then forwarded to another PE (e.g., the PE of the next column) and also provided as an input to the multiplier 230. When write enable WE2 is high, weights 206 are saved to buffer 222. The output 306 of the buffer 222 is then forwarded to another PE (e.g., PE of the next row) or output buffer (e.g., output buffer 108), and is also provided as an input to multiplier 230. Multiplier 230 performs a multiplication operation on output 302 and output 306 and provides the product as an input to adder 240. During output of a fixed data stream, each time a MAC operation is performed, registers 220 and 222 may be updated with a new input activation state (from input buffer 104) and a new weight (from weight buffer 102).

FIG. 4 illustrates an accumulation operation of PE 300 in accordance with some embodiments. At the end of the multiplication operation shown in fig. 3, the output 402 of the multiplication is provided as an input to adder 240. The output 406 contains the partial sum stored in the buffer 224 provided to the multiplexer MUX 2. The second selector OSS is set to "1" such that the output 406 is provided as the output 408 of the multiplexer MUX 2. Output 406 is provided to adder 240 and added to output 402 such that output 410 is provided to multiplexer MUX3. And when the third selector OS_OUT is "0," output 410 may be provided as output 404 to multiplexer MUX3 and as input to buffer 224. Buffer 224 may then store output 404 as an updated MAC result.

The multiply operation of fig. 3 and the accumulate operation of fig. 4 may be combined to be referred to as a MAC operation, as discussed above. The MAC operation is repeated for the entire PE array 110. For example, the MAC operation is performed for all input activation states and all weights stored in the input buffer 104 and the weight buffer 102. Depending on the embodiment, the bit width of the buffer 224 may be varied to accommodate the length of the result of the MAC operation for higher accuracy.

Fig. 5 illustrates an outgoing operation of PE 300 in accordance with some embodiments. In general, during an outgoing operation, the sum stored in the respective buffer 224 in each of the PEs 300 is vertically transferred along the corresponding column, ultimately to the accumulators 120-124. For example, the first selector ISS is set to "1" to output the previous output 208 from multiplexer MUX1 as output 502. The output 502 is then provided to the multiplexer MUX3. The third selector osout is set to "1" so that multiplexer MUX3 provides output 502 as output 504 to buffer 224. After completion of the operation for the entire array (e.g., when all currently stored input active state and weighted MAC operations are completed), the stored sum value in the entire array's buffer 224 is transferred vertically to the PE 300 positioned in the lower row until the stored output in the entire buffer 224 is provided to the accumulators 120-124 between the PE array 110 and the output buffer 108, as shown in fig. 1.

Thus, the PE 200 can be reconfigured so that AI workloads with output fixed workloads can be supported.

Fig. 6-8 illustrate a PE 600 configured for inputting a fixed flow, according to some embodiments. PE 600 is similar to PE 200 except that PE 600 is configured for inputting a fixed operation flow.

FIG. 6 illustrates a preload input activation operation for PE 600 configured for inputting a fixed stream, in accordance with some embodiments. An input (e.g., input activation) 202 is provided to a buffer 220. Write enable WE1 is high so that input 202 is stored in buffer 220. Once the input 202 is written into the buffer 220, the write enable WE1 is set low so that the stored input 202 remains stored in the buffer 220 throughout the MAC operation. The buffer 220 may output the previously stored input 202 as an output 220.

FIG. 7 illustrates a multiplication operation of PE 600 configured for inputting a fixed stream, in accordance with some embodiments. The output 602 is provided to the multiplier 230. The weights 206 are provided to the buffer 222. Write enable WE2 is set high so that weight 206 is written to buffer 222 every cycle. The stored weights 206 may then be output as output 604. Weights 604 may be provided as inputs to multiplier 230. Output 602 may be multiplied by output 604 by multiplier 230.

Fig. 8 illustrates a summation operation of PEs 600 configured for inputting a fixed stream, in accordance with some embodiments. The previous output 204 may be provided to a multiplexer MUX1. The first selector ISS may be set to "0" such that the previous output 204 is provided as the output 702 of the multiplexer MUX1. The output 702 may be input to the multiplexer MUX2 and when the second selector OSS is set to "0", the output 704 of the multiplexer MUX2 may be provided to the adder 240. The output 706 from the multiplier 230 may also be provided as an input to the adder 240. Output 706 and output 704 may be summed to provide output 708 to multiplexer MUX3 as a MAC result. The third selector OS_OUT may be set to "0" such that output 708 is provided to and stored in an input of register 224. Output 712 may then be provided to PE 600 and/or accumulators 220-224 of the next row.

Thus, the PE 200 can be reconfigured so that AI workloads with input fixed workloads can be supported.

Fig. 9-11 illustrate a PE 900 configured for weight-fixed flows, according to some embodiments. PE 900 is similar to PE 200 except that PE 900 is configured for a weight fixed operation flow.

FIG. 9 illustrates a pre-load weight operation of PE 900 configured for weight-fixed flows, according to some embodiments. Weights 206 may be provided to the buffer 222 and write enable WE2 may be high such that weights 206 are loaded into the buffer 222. The weights may then be provided as output 902 by the buffer 222 to the multiplier 230 for subsequent MAC operations until the weights of the buffer 222 are updated. For example, write enable WE2 may be set to "0" such that buffer 222 retains weights 206 for all MAC operations in PE array 110 until the weights are activated and updated with a new set of inputs for the new MAC operation.

FIG. 10 illustrates a multiplication operation of PE 900 configured for weight-fixed flows, according to some embodiments. The input activation 202 may be provided and stored in the buffer 220, with the write enable WE1 activated. The output 1002 of the buffer 220 may then be provided as an input to the multiplier 230 along with the output 902. The output 902 may be multiplied with the output 1002 using the multiplier 230.

FIG. 11 illustrates an accumulation operation of PEs 900 configured for weight fixed flows, according to some embodiments. The previous output 208 may be provided as an input to the multiplexer MUX1. The first selector ISS may be set to "1" to output the output 1102 as the output of the multiplexer MUX1. The output 1102 may be input to the multiplexer MUX2 and when the second selector OSS is set to "0", the output 1104 of the multiplexer MUX2 may be provided to the adder 240. The output 1106 from the multiplier 230 may also be provided as an input to the adder 240. Output 1106 and output 1104 may be summed to provide output 1108 to multiplexer MUX3 as a MAC result. The third selector OS_OUT may be set to "0" such that the output 1108 is provided to and stored in an input of the buffer 224. Output 1112 may then be provided to PE 900 and/or accumulators 220-224 of the next row.

FIG. 12 illustrates a block diagram of a processing core 1200 including a 2x2PE array in accordance with some embodiments. The processing core 1200 includes an input buffer 1204 (e.g., input buffer 104), a weight buffer 1202 (e.g., weight buffer 102), an output buffer 1208 (e.g., output buffer 108), and accumulators 1220, 1222 (e.g., accumulators 120, 122). PEs 1210 and 1212 form a first row and PEs 1214 and 1216 form a second row. PE 1216 and 1214 form a first row, and PE 1212 and 1216 form a second row. Fig. 12 shows how the various inputs and outputs of PEs 1210-1216 are connected to each other with buffers 1202-1208, accumulators 1220-1222. Processing core 1200 is similar to processing core 100 of FIG. 1 except that processing core 1200 includes a 2x2PE array instead of 3x3 PE array 110 shown in FIG. 1. Therefore, duplicate descriptions are omitted for clarity and simplicity. Furthermore, although processing core 1200 includes a 2x2PE array, embodiments are not so limited, and additional PEs may be present in each column and/or row.

PE 1210 and 1212 may receive weights from weight buffer 1202 via weight lines WL1 and WL 2. Weights may be stored in registers 222 of PEs 1210 and 1212. Stored weights may be transferred to PEs 1214 and 1216 via corresponding columns of weight transfer lines WTL1 and WTL2 (e.g., from PE 1210 to PE 1214 via weight transfer line WTL1, and from PE 1212 to PE 1216 via weight transfer line WTL 2).

PE 1210 and 1214 can receive an input activation from input buffer 1204 via input lines IL1 and IL 2. Input activations may be stored in registers 220 of PEs 1210 and 1214. The input activation may be transferred to PEs 1212 and 1216 in the corresponding row via input transfer lines ITL1 and ITL2 (e.g., from PE 1210 to PE 1212 via input transfer line ITL1 and from PE 1214 to PE 1216 via input transfer line ITL 2).

The PEs 1210 and 1212 may provide partial and/or full sums from the corresponding registers 224 to PEs 1214 and 1216 via vertical sum transfer lines VSTL1 and VSTL2 (e.g., from PE 1210 to PE 1214 via vertical sum transfer line VSTL1 and from PE 1212 to PE 1216 via vertical sum transfer line VSTL 2). The PEs 1210 and 1214 may provide partial and/or full sums from the corresponding registers 224 to the PEs 1212 and 1216 via the horizontal sum transfer lines HSTL1 and HSTL2 (e.g., from PE 1210 to PE 1212 via the horizontal sum transfer line HSTL1 and from PE 1214 to PE 1216 via the horizontal sum transfer line HSTL 2).

PE 1214 and 1216 can provide partial and/or full sums from register 224 to corresponding accumulators 1220 and 1222 via accumulator lines AL1 and AL 2. For example, PE 1214 can communicate a partial/full sum to accumulator 1220 via accumulator line AL1, and PE 1216 can communicate a partial/full sum to accumulator 1222.

Fig. 13 illustrates a block diagram of an AI accumulator 1300 including a processing core array, in accordance with some embodiments. For example, AI accumulator 1300 may include a 4x4 array of processing cores 100 of fig. 1. With a multi-core architecture as shown in fig. 13, the operation of one output feature may be divided into multiple segments, which may then be distributed to multiple cores. In some embodiments, different processing cores 100 may generate partial sums corresponding to one output feature. Thus, by interconnecting the cores, the accumulator (i.e., adder and buffer) can be reused to sum the partial sums from each core. The global buffer 1302 may be used to provide input activations and/or weights for the entire AI accumulator 1300, which may then be stored in the respective weight buffer 102 and/or input buffer 104 of the corresponding processing core 100. In some embodiments, global buffer 1302 may include input buffer 104 and/or weight buffer 102.

In some embodiments, PE 110 may accumulate a small number of MAC results at worst case (e.g., highest accuracy) for an output fixed data stream, because accumulators 120-124 may be used to perform a full accumulation operation that sums the partial sums provided from each column. In some embodiments, the bit widths of registers (e.g., registers 220-224) and adders (e.g., adder 240) internal to PE 110 may become smaller.

FIG. 14 illustrates a graph 1400 of accuracy loss as a function of accumulator bit width in accordance with some embodiments. The x-axis of graph 1400 includes accumulator bit width in bit counts and the y-axis includes loss of accuracy in percent (%). Graph 1400 is merely an example showing how the disclosed techniques may provide the benefits of reconfiguration, area reduction, and energy savings without significant loss of accuracy.

The operational constraint of changing partial sum accumulation, considering the output fixed workflow, shows no loss of accuracy below the 23-bit accumulator bit width. On the other hand, a typical AI accelerator may have a 30-bit wide accumulator to accommodate the maximum number of MAC results to be accumulated. Thus, rather than increasing the bit width of the weight fixed accumulator to accommodate the original worst case in the output fixed workflow, various embodiments may reduce the bit width of the buffers and adders to some extent. In some embodiments, the bit width may be aligned with the accumulator bit width of the input fixed and output fixed workflows. Thus, AI accelerators embodying the disclosed technology may have reduced area and energy consumption.

Fig. 15 illustrates a flowchart of an example method 1500 of operating a reconfigurable processing element for an AI accelerator, in accordance with some embodiments. Example method 1500 may be performed with processing core 100 and/or processing elements 111-119 or 200. Briefly, the method 1500 begins with an operation 1502 of selecting, by a first multiplexer (e.g., first MUX 1), a previous sum (e.g., previous sum 204 or 208) of previous columns or previous rows from a matrix of reconfigurable processing elements (e.g., PE array 110) based on a first selector (e.g., first selector ISS). The method 1500 continues with operation 1504 of multiplying the input activation state (e.g., the input 202 or the output of the buffer 220) with a weight (e.g., the weight 206 or the output of the buffer 222) to output a product. The method 1500 continues with an operation 1506 of selecting a previous sum (e.g., output of multiplexer MUX 1) or a current sum (e.g., output of buffer 224) by a second multiplexer (e.g., multiplexer MUX 2) based on a second selector (e.g., second selector OSS). The method 1500 continues with operation 1508 of adding the product (e.g., the output of multiplier 230) to the selected previous sum or the selected current sum (e.g., the output of multiplexer MUX 2) to output the updated sum. The method 1500 continues with operation 1510 of selecting an updated sum (e.g., output of adder 240) or a previous sum (e.g., output of multiplexer MUX 1) by a third multiplexer (e.g., third multiplexer MUX 3) based on a third selector (e.g., third selector osout). The method 1500 continues with operation 1512 of outputting the selected updated sum or the selected previous sum to a next column or row of the matrix of reconfigurable processing elements.

With respect to operation 1502, selecting a previous sum from a previous column or a previous row depends on the mode of the reconfigurable PE. For example, when the reconfigurable PE is in the output fixed mode, the first selector selects the previous sum from the PEs in the previous row. When the reconfigurable PE is in the input fixed mode, the first selector selects the previous sum of PEs from the previous column. When the reconfigurable PE is in the weight fixed mode, the first selector selects the previous sum of PEs from the previous row.

With respect to operation 1504, for each mode, multiplication is performed on the input activation state and weights.

With respect to operation 1506, the selection of the previous sum or the current sum depends on the mode of the reconfigurable PE. For example, when the reconfigurable PE is in the output fixed mode, the second selector selects the current sum. When the reconfigurable PE is in the input fixed mode, the second selector selects the previous sum of PEs from the previous column. When the reconfigurable PE is in the weight fixed mode, the second selector selects the previous sum of PEs from the previous row.

With respect to operation 1508, an addition is performed based on the product from operation 1504 and the selected output of the second multiplexer and the mode of the reconfigurable PE. For example, in the output fixed pattern, the product is added to the current sum. In the input and weight fixed mode, the product is added to the previous sum.

With respect to operation 1510, the selection of the updated sum or the previous sum depends on the mode of the reconfigurable PE. For example, when the reconfigurable PE is in the output fixed mode, the third selector selects (1) the output of the adder when performing the accumulation operation of the partial sums, and (2) the previous sum when performing the outgoing operation. The third selector selects the output of the adder when the reconfigurable PE is in the input and weight fixed mode.

With respect to operation 1512, the output of the updated sum or the previous sum depends on the mode of the reconfigurable PE. For example, when the reconfigurable PE is in the output fixed mode, the previous sum is output. When the reconfigurable PE is in the input or weight fixed mode, the updated sum is output.

In one aspect of the present disclosure, a reconfigurable processing circuit for an AI accelerator is disclosed. The reconfigurable processing circuit includes: a first memory configured to store an input activation state; a second memory configured to store weights; a multiplier configured to multiply the weight with the input activation state and output a product; a first multiplexer (mux) configured to output a previous sum from a previous reconfigurable processing element based on a first selector; a third memory configured to store a first sum; a second multiplexer configured to output the previous sum or the first sum based on a second selector; an adder configured to add the product with the previous sum or the first sum to output a second sum; and a third multiplexer configured to output the second sum or the previous sum based on a third selector.

In another aspect of the disclosure, a method of operating a reconfigurable processing element for an AI accelerator is disclosed. The method comprises the following steps: selecting, by a first multiplexer (mux), a previous sum of previous columns or previous rows from the matrix of reconfigurable processing elements based on a first selector; multiplying the input activation state by a weight to output a product; selecting, by a second multiplexer, the previous sum or the current sum based on a second selector; adding the product to the selected previous sum or the selected current sum to output an updated sum; selecting, by a third multiplexer, the updated sum or the previous sum based on a third selector; and outputting the selected updated sum or the selected previous sum.

In yet another aspect of the present disclosure, a processing core for an AI accelerator is disclosed. The processing core includes: an input buffer configured to store a plurality of input activation states; a weight buffer configured to store a plurality of weights; a matrix array of processing elements arranged in a plurality of rows and a plurality of columns; a plurality of accumulators configured to receive outputs from a last row of the plurality of rows and to sum one or more of the received outputs from the last row; and an output buffer configured to receive outputs from the plurality of accumulators. Each processing element of the matrix array of processing elements includes: a first memory configured to store an input activation state from the input buffer; a second memory configured to store weights from the weight buffer; a multiplier configured to multiply the weight with the input activation state and output a product; a first multiplexer (mux) configured to output a previous sum of processing elements from a previous row or a previous column based on a first selector; a third memory configured to store a first sum and output the first sum to a processing element of a next row or column; a second multiplexer configured to output the previous sum or the first sum based on a second selector; an adder configured to add the product with the previous sum or the first sum to output a second sum; and a third multiplexer configured to output the second sum or the previous sum based on a third selector.

As used herein, the terms "about" and "approximately" generally mean plus or minus 10% of the stated value. For example, about 0.5 would include 0.45 and 0.55, about 10 would include 9 to 11, and about 1000 would include 900 to 1100.

The foregoing outlines features of several embodiments so that those skilled in the art may better understand the aspects of the disclosure. Those skilled in the art should appreciate that they may readily use the present disclosure as a basis for designing or modifying other processes and structures for carrying out the same purposes and/or achieving the same advantages of the embodiments introduced herein. Those skilled in the art should also realize that such equivalent constructions do not depart from the spirit and scope of the present disclosure, and that they may make various changes, substitutions, and alterations herein without departing from the spirit and scope of the present disclosure.

Symbol description

100 processing core

102 weight buffer

104 input buffer

108 output buffer

110 Processing Element (PE) array

111 to 119 Processing Elements (PE)

120 accumulator

122 accumulator

124 accumulator

200 Processing Element (PE)

202 input(s)

204 previous output

206 weight of

208 previous output

220 register (or memory)

222 buffer (or memory)

224 register (or memory)

230 multiplier(s)

240 adder

300 Processing Element (PE)

302 output

306 output

402 output of

404 output

406 output

408 output

410 output

502 output

504 output

600 Processing Element (PE)

602 output

604 output/weight

702 output

704 output of

706 output of

708 output of

712 output of

900 Processing Element (PE)

902 output

1002 output

1102 output

1104 output

1106 output of

1108 output of

1112 output (S)

1200 processing core

1202 weight buffer

1204 input buffer

1208 output buffer

1210 Processing Element (PE)

1212 Processing Element (PE)

1214 Processing Element (PE)

1216 Processing Element (PE)

1220 accumulator

1222 accumulator

1300 Artificial Intelligence (AI) accumulator

1302 global buffer

1400 chart

1500 method of

1502 operation

1504 operation of

1506 operation

1508 operation of

1510 operation of

1512 operation

AL1 to AL2 accumulator line

HSTL 1-HSTL 2 horizontal sum transfer line

IL1 to IL2 input line

ISS first selector

ITL1 to ITL2 input transmission lines

MUX 1-MUX 3: multiplexer (MUX)

OSS second selector

OS_OUT third selector

VSTL 1-VSTL 2 vertical sum transmission line

WE1 to WE2 write Enable

WL1 to WL2 weight lines

WTL1 to WTL2, weight transmission line.

Claims

1. A reconfigurable processing circuit, characterized in that the reconfigurable processing circuit comprises:

a first memory configured to store an input activation state;

a second memory configured to store weights;

a multiplier configured to multiply the weight with the input activation state and output a product;

a first multiplexer (mux) configured to output a previous sum from a previous reconfigurable processing element based on a first selector;

a third memory configured to store a first sum;

a second multiplexer configured to output the previous sum or the first sum based on a second selector;

an adder configured to add the product with the previous sum or the first sum to output a second sum; and

A third multiplexer configured to output the second sum or the previous sum based on a third selector.

2. The reconfigurable processing circuit of claim 1, wherein the first multiplexer is further configured to:

receiving as a first input a first previous sum from a first reconfigurable processing circuit of a first column;

Receiving as a second input a second previous sum from a second reconfigurable processing circuit of a different row; and

The first previous sum or the second previous sum is output as the previous sum based on a first selector.

3. The reconfigurable processing circuit of claim 2, wherein in a first mode, the first and second memories are further configured to update the stored input activation state and the stored weights, respectively, at each cycle.

4. The reconfigurable processing circuit of claim 3, wherein in the first mode, during an accumulation operation, the second multiplexer is further configured to output the first sum, and the third multiplexer is further configured to output the second sum during an accumulation operation.

5. The reconfigurable processing circuit of claim 4, wherein in the first mode, during an outgoing operation, the first multiplexer is further configured to output the second previous sum as the previous sum, and the third multiplexer is further configured to output the previous sum.

6. The reconfigurable processing circuit of claim 2, wherein in a second mode, only the second memories of the first and second memories are configured to update the weights stored per cycle.

7. The reconfigurable processing circuit of claim 6, wherein in the second mode:

the first multiplexer is further configured to output the first previous sum as the previous sum;

the second multiplexer is further configured to output the previous sum; and is also provided with

The third multiplexer is further configured to output the second sum.

8. The reconfigurable processing circuit of claim 2, wherein in a third mode, only the first memories of the first and second memories are configured to update the stored input activation state every cycle.

9. A processing core, the processing core comprising:

an input buffer configured to store a plurality of input activation states;

a weight buffer configured to store a plurality of weights;

a matrix array of processing elements arranged in a plurality of rows and a plurality of columns, wherein each processing element of the matrix array of processing elements includes:

A first memory configured to store an input activation state from the input buffer;

a second memory configured to store weights from the weight buffer;

a first multiplexer (mux) configured to output a previous sum of processing elements from a previous row or a previous column based on a first selector;

a third memory configured to store a first sum and output the first sum to a processing element of a next row or column;

A third multiplexer configured to output the second sum or the previous sum based on a third selector;

a plurality of accumulators configured to receive outputs from a last row of the plurality of rows and to sum one or more of the received outputs from the last row; and

An output buffer configured to receive outputs from the plurality of accumulators.

10. The processing core of claim 9, wherein a first row of the matrix array comprises a first processing element and a second processing element, and a second row of the matrix array comprises a third processing element and a fourth processing element,

wherein the first processing element is configured to output the first sum of the first processing element to the second processing element and the third processing element as the previous sums in the second and third processing elements, and

wherein the first multiplexer of the fourth processing element is configured to receive the first sum from the second processing element as a first input and the first sum from the third processing element as a second input.