TWI796977B

TWI796977B - Memory device and operation method thereof

Info

Publication number: TWI796977B
Application number: TW111110903A
Authority: TW
Inventors: 胡瀚文; 李永駿; 林柏榕; 王淮慕; 王韋程
Original assignee: 旺宏電子股份有限公司
Priority date: 2021-11-22
Filing date: 2022-03-23
Publication date: 2023-03-21
Also published as: TW202321952A

Abstract

A memory device and an operation method thereof are provided. A memory device and an operation method thereof are provided. The operation method includes: encoding an input data, sending an encoded input data to at least one page buffer, and reading out the encoded input data in parallel; encoding a first part and a second part of a weight data into an encoded first part and an encoded second part of the weight data, respectively, writing the encoded first part and the encoded second part of the weight data into a plurality of memory cells of the memory device, and reading out the encoded first part and the encoded second part of the weight data in parallel; multiplying the encoded input data with the encoded first part and the encoded second part of the weight data respectively to parallel generate a plurality of partial products; and accumulating the partial products to generate an operation result.

Description

Memory device and method of operation thereof

本發明係有關於一種具有記憶體內運算(In-Memory-Computing (IMC))的記憶體裝置及其操作方法。The present invention relates to a memory device with In-Memory-Computing (IMC) and an operating method thereof.

人工智慧(AI)已在許多領域中成為高度有效解決方案。AI的關鍵操作在於對大量的輸入資料(如輸入特徵圖(input feature maps))與權重值進行乘積累加運算(multiply-and-accumulate，MAC)。Artificial intelligence (AI) has become a highly effective solution in many fields. The key operation of AI is to perform a multiply-and-accumulate (MAC) operation on a large amount of input data (such as input feature maps) and weight values.

然而，以目前的AI架構而言，容易遇到輸出入瓶頸(IO bottleneck)與低效率的MAC運算流程。However, with the current AI architecture, it is easy to encounter an IO bottleneck and an inefficient MAC operation process.

為達到高準確度，可執行具有多位元輸入及多位元權重值的MAC操作。然而，輸出入瓶頸變得更加嚴重，且效率將更低。To achieve high accuracy, MAC operations can be performed with multi-bit inputs and multi-bit weight values. However, the I/O bottleneck becomes more severe and will be less efficient.

記憶體內運算(In-Memory-Computing (IMC))可用於加速MAC運算，因為IMC可減少在中央處理架構下所需要用的複雜算術邏輯單元 (Arithmetic logic unit，ALU)，且提供記憶體內的MAC操作的高並行性(parallelism)。In-Memory-Computing (IMC) can be used to accelerate MAC operations, because IMC can reduce the complex arithmetic logic unit (ALU) required under the central processing architecture, and provide MAC in memory High parallelism of operations.

在進行IMC時，無符號數(unsigned integer)乘法運算與帶符號數(signed integer)乘法運算之說明如下。When performing IMC, the multiplication operations of unsigned integers and signed integers are described as follows.

例如，想要相乘兩個無符號數(皆為8位元)：a[7:0]與b[7:0]。則可執行8次的單位元乘法來產生8個部份乘積(partial product)p0[7:0]~p7[7:0]，各該8個部份乘積相關於被乘數a的各位元，該8個部份乘積可表示如下： p0[7:0] = a[0] × b[7:0] = {8{a[0]}} & b[7:0] p1[7:0] = a[1] × b[7:0] = {8{a[1]}} & b[7:0] p2[7:0] = a[2] × b[7:0] = {8{a[2]}} & b[7:0] p3[7:0] = a[3] × b[7:0] = {8{a[3]}} & b[7:0] p4[7:0] = a[4] × b[7:0] = {8{a[4]}} & b[7:0] p5[7:0] = a[5] × b[7:0] = {8{a[5]}} & b[7:0] p6[7:0] = a[6] × b[7:0] = {8{a[6]}} & b[7:0] p7[7:0] = a[7] × b[7:0] = {8{a[7]}} & b[7:0] For example, you want to multiply two unsigned numbers (both 8 bits): a[7:0] and b[7:0]. Then 8 unit multiplications can be performed to generate 8 partial products (partial product) p0[7:0]~p7[7:0], each of which is related to each bit of the multiplicand a , the 8 partial products can be expressed as follows: p0[7:0] = a[0] × b[7:0] = {8{a[0]}} & b[7:0] p1[7:0] = a[1] × b[7:0] = {8{a[1]}} & b[7:0] p2[7:0] = a[2] × b[7:0] = {8{a[2]}} & b[7:0] p3[7:0] = a[3] × b[7:0] = {8{a[3]}} & b[7:0] p4[7:0] = a[4] × b[7:0] = {8{a[4]}} & b[7:0] p5[7:0] = a[5] × b[7:0] = {8{a[5]}} & b[7:0] p6[7:0] = a[6] × b[7:0] = {8{a[6]}} & b[7:0] p7[7:0] = a[7] × b[7:0] = {8{a[7]}} & b[7:0]

其中，{8{a[0]}}代表將a[0]重複8次，其餘可依此類推。Among them, {8{a[0]}} means to repeat a[0] 8 times, and the rest can be deduced in this way.

為得到乘積，將該8個部份乘積p0[7:0]~p7[7:0]相加，如第1A圖所示。第1A圖顯示兩個無符號數(皆為8位元)之相乘。To obtain the product, add the 8 partial products p0[7:0]~p7[7:0], as shown in FIG. 1A. Figure 1A shows the multiplication of two unsigned numbers (both 8 bits).

其中，P0=p0[0]+0+0+0+0+0+0+0，而P1=p0[1]+p1[0]+0+0+0+0+0+0，其餘可依此類推。Among them, P0=p0[0]+0+0+0+0+0+0+0, and P1=p0[1]+p1[0]+0+0+0+0+0+0, the rest can be So on and so forth.

乘積P[15:0]則是將P0~P15而得。乘積P[15:0]代表將兩個無符號數(皆為8位元)相乘所得到的16位元無符號乘積。The product P[15:0] is obtained by multiplying P0~P15. The product P[15:0] represents a 16-bit unsigned product obtained by multiplying two unsigned numbers (both 8-bit).

而如果b是帶符號數，則於加總之前，部份乘積需要做符號展開(sign-extended)至乘積寬度。如果a也是帶符號數，則部份乘積P7要從最後總和減去，而不是相加。And if b is a signed number, the partial product needs to be sign-extended to the width of the product before summing. If a is also signed, the partial product P7 is subtracted from the final sum, not added.

第1B圖顯示兩個符號數(皆為8位元)之相乘。於第1B圖中，符號「~」代表互補，例如，~p1[7]代表p1[7]的互補數。Figure 1B shows the multiplication of two signed numbers (both 8 bits). In Fig. 1B, the symbol "~" represents complement, for example, ~p1[7] represents the complement number of p1[7].

在進行IMC時，如果能加快「操作速度」及減少容量需要(capacity requirement)的話，對於IMC性能將可有所助益。When performing IMC, if the "operation speed" can be accelerated and the capacity requirement can be reduced, it will be beneficial to the performance of IMC.

根據本案一實例，提出一種記憶體裝置，包括：複數個記憶體晶粒，各該些記憶體晶粒包括複數個記憶體平面、複數個頁緩衝器與一累加電路，各該些記憶體平面包括複數個記憶體晶胞。其中，對一輸入資料進行編碼，將一編碼後輸入資料傳送至至少一頁緩衝器內，以及，從該至少一頁緩衝器平行讀出該編碼後輸入資料；對一權重資料的一第一部份與一第二部份分別編碼為該權重資料的一編碼後第一部份與該權重資料的一編碼後第二部份，並寫入至該記憶體裝置之該些記憶體晶胞內，以及，平行讀出該權重資料的該編碼後第一部份與該權重資料的該編碼後第二部份；將該編碼後輸入資料分別乘上該權重資料的該編碼後第一部份與該權重資料的該編碼後第二部份，以平行產生複數個部份乘積；以及將該些部份乘積累加，以產生一運算結果。According to an example of this case, a memory device is proposed, including: a plurality of memory chips, each of which includes a plurality of memory planes, a plurality of page buffers and an accumulation circuit, and each of these memory planes It includes a plurality of memory cells. Wherein, an input data is encoded, an encoded input data is transmitted to at least one page buffer, and the encoded input data is read in parallel from the at least one page buffer; a first A part and a second part are respectively encoded as an encoded first part of the weight data and an encoded second part of the weight data, and written into the memory cells of the memory device In, and, read out the coded first part of the weight data and the coded second part of the weight data in parallel; multiply the coded input data by the coded first part of the weight data respectively Parts and the encoded second part of the weight data are used to generate a plurality of partial products in parallel; and these partial products are accumulated to generate an operation result.

根據本案另一實例，提出一種記憶體裝置之操作方法，包括：對一輸入資料進行編碼，將一編碼後輸入資料傳送至至少一頁緩衝器內，以及，從該至少一頁緩衝器平行讀出該編碼後輸入資料；對一權重資料的一第一部份與一第二部份分別編碼為該權重資料的一編碼後第一部份與該權重資料的一編碼後第二部份，並將該權重資料的該編碼後第一部份與該權重資料的該編碼後第二部份寫入至該記憶體裝置之複數個記憶體晶胞內，以及，平行讀出該權重資料的該編碼後第一部份與該權重資料的該編碼後第二部份；將該編碼後輸入資料分別乘上該權重資料的該編碼後第一部份與該權重資料的該編碼後第二部份，以平行產生複數個部份乘積；以及將該些部份乘積累加，以產生一運算結果。According to another example of the present case, a method for operating a memory device is proposed, including: encoding an input data, transmitting an encoded input data to at least one page buffer, and reading in parallel from the at least one page buffer input data after outputting the coding; a first part and a second part of a weight data are respectively coded as a coded first part of the weight data and a coded second part of the weight data, and writing the encoded first part of the weight data and the encoded second part of the weight data into a plurality of memory cells of the memory device, and reading out the weight data in parallel The coded first part and the coded second part of the weight data; multiply the coded input data by the coded first part of the weight data and the coded second part of the weight data respectively Parts are used to generate a plurality of partial products in parallel; and these partial products are accumulated and accumulated to generate an operation result.

為了對本發明之上述及其他方面有更佳的瞭解，下文特舉實施例，並配合所附圖式詳細說明如下：In order to have a better understanding of the above-mentioned and other aspects of the present invention, the following specific examples are given in detail with the accompanying drawings as follows:

本說明書的技術用語係參照本技術領域之習慣用語，如本說明書對部分用語有加以說明或定義，該部分用語之解釋係以本說明書之說明或定義為準。本揭露之各個實施例分別具有一或多個技術特徵。在可能實施的前提下，本技術領域具有通常知識者可選擇性地實施任一實施例中部分或全部的技術特徵，或者選擇性地將這些實施例中部分或全部的技術特徵加以組合。The technical terms in this specification refer to the customary terms in this technical field. If some terms are explained or defined in this specification, the explanations or definitions of these terms shall prevail. Each embodiment of the disclosure has one or more technical features. On the premise of possible implementation, those skilled in the art may selectively implement some or all of the technical features in any embodiment, or selectively combine some or all of the technical features in these embodiments.

第2圖顯示根據本案一實施例之記憶體裝置之操作方法流程圖。於步驟210中，對輸入資料進行編碼並將編碼後輸入資料(其為向量)以傳送至複數個頁緩衝器內，且從該些頁緩衝器平行讀出。至於如何對輸入資料進行編碼的細節將於底下說明之。FIG. 2 shows a flow chart of the operation method of the memory device according to an embodiment of the present invention. In step 210, the input data is encoded and the encoded input data (which is a vector) is sent to a plurality of page buffers, and read out from the page buffers in parallel. The details of how to encode the input data will be explained below.

於步驟220中，對權重資料進行編碼並將編碼後權重資料(其為向量)寫入至記憶體裝置之複數個記憶體晶胞內，以及，平行讀出編碼後權重資料。至於如何對權重資料進行編碼的細節將於底下說明之。其中，在進行編碼時，權重資料的最高有效位元(most significant bit，MSB)與最低有效位元(least significant bit，LSB)被分別編碼。In step 220, encode the weight data and write the encoded weight data (which is a vector) into a plurality of memory cells of the memory device, and read out the encoded weight data in parallel. The details of how to encode the weight data will be explained below. Wherein, when encoding, the most significant bit (most significant bit, MSB) and the least significant bit (least significant bit, LSB) of the weight data are respectively encoded.

於步驟230中，將輸入資料分別乘上編碼後權重資料的MSB與LSB，以平行產生複數個部份乘積(partial product)。In step 230, the input data are respectively multiplied by the MSB and LSB of the encoded weight data to generate a plurality of partial products in parallel.

於步驟240中，將該些部份乘積加總(累積)，以產生乘積累加運算(multiply-and-accumulate，MAC)結果或者漢明距離(Hamming distance)運算結果。In step 240, the partial products are summed (accumulated) to generate a multiply-and-accumulate (MAC) result or a Hamming distance (Hamming distance) result.

本案一實施例揭露可實施數位MAC操作的記憶體裝置，具有錯誤位元容忍(error-bit-tolerance)資料編碼，以容忍錯誤位元及減少面積需求。錯誤位元容忍資料編碼乃是使用輸入資料複製(duplication)及權重資料平坦化(flattening)技術。此外，本案實施例的感應技術包括標準的單位元晶胞(single level cell，SLC)讀取與邏輯及(AND)功能，進行位元乘法以產生部份乘積。於本案其他可能實施例中，如果在感應過程中，頁緩衝器不會將存於閂鎖單元內的輸入資料移除的話，則SLC讀取可由選擇位元讀取所取代，或者被多位元晶胞(Multi-Level Cell，MLC)、三位元晶胞(Triple Level Cell，TLC)、四位元晶胞(Quad-level cells，QLC)讀取操作所取代。此外，本案一實施例的數位MAC操作運算乃是使用高頻寬權重累加器(high bandwidth weighted accumulator)以產生輸出結果，該高頻寬權重累加器可藉由重複使用故障位元計數(fail-bit-count，FBC)電路來實施權重化累加(weighted accumulation)。An embodiment of the present case discloses a memory device capable of implementing digital MAC operations, which has error-bit-tolerance data encoding to tolerate error bits and reduce area requirements. Error-tolerant data encoding uses input data duplication and weighted data flattening techniques. In addition, the sensing technology of the present embodiment includes standard single level cell (SLC) readout and logical AND (AND) function to perform bit multiplication to generate partial products. In other possible embodiments of the present application, if the page buffer does not remove the input data stored in the latch unit during the sensing process, the SLC read can be replaced by a select bit read, or by a multi-bit It is replaced by read operations of Multi-Level Cell (MLC), Triple Level Cell (TLC), and Quad-level Cells (QLC). In addition, the digital MAC operation in an embodiment of the present case uses a high bandwidth weighted accumulator to generate an output result, and the high bandwidth weighted accumulator can repeatedly use the fail-bit-count (fail-bit-count, FBC) circuit to implement weighted accumulation (weighted accumulation).

本案另一實施例揭露一種可實施漢明距離運算的記憶體裝置，具有錯誤位元容忍資料編碼，以容忍錯誤位元。錯誤位元容忍資料編碼乃是使用輸入資料複製及權重資料平坦化技術。此外，本案實施例的感應技術包括標準的單位元晶胞(single level cell，SLC)讀取與邏輯互斥或(EXOR)功能，進行位元乘法以產生部份乘積。於本案其他可能實施例中，如果在感應過程中，頁緩衝器不會將存於閂鎖單元內的輸入資料移除的話，則SLC讀取可由選擇位元讀取所取代，或者被MLC、TLC、QLC讀取操作所取代。而邏輯互斥或(EXOR)功能可由邏輯反互斥或(XNOR)與邏輯反所取代。此外，本案一實施例的數位漢明距離操作運算乃是使用高頻寬未權重累加器(high bandwidth unweighted accumulator)以產生輸出結果，該高頻寬未權重累加器可藉由重複使用故障位元計數(fail-bit-count，FBC)電路來實施未權重累加(unweighted accumulation)。Another embodiment of the present case discloses a memory device capable of implementing Hamming distance calculation, which has error-bit-tolerant data encoding to tolerate error bits. Error-tolerant data encoding uses input data duplication and weighted data flattening techniques. In addition, the sensing technology of the present embodiment includes a standard single level cell (SLC) read and logical exclusive OR (EXOR) function to perform bit multiplication to generate partial products. In other possible embodiments of this case, if the page buffer does not remove the input data stored in the latch unit during sensing, the SLC read can be replaced by a select bit read, or by MLC, Superseded by TLC, QLC read operations. And the logical exclusive OR (EXOR) function can be replaced by the logical anti-mutual exclusive OR (XNOR) and logical inverse. In addition, the digital Hamming distance operation of an embodiment of the present case uses a high bandwidth unweighted accumulator to generate output results. The high bandwidth unweighted accumulator can be used repeatedly to bit-count (FBC) circuit to implement unweighted accumulation (unweighted accumulation).

第3A圖與第3B圖顯示本案實施例中之錯誤位元容忍資料編碼。例如但不受限於，輸入資料與權重資料為浮點32(floating point 32)資料。於第3A圖中，將輸入資料與權重資料量化成8位元二進位整數，其中，輸入資料與權重資料皆為8位元向量，且為N維度(N為正整數)。輸入資料與權重資料可分別表示為X _i(7:0)與W _i(7:0)。 Figures 3A and 3B show error-tolerant data encoding in this embodiment. For example but not limited thereto, the input data and weight data are floating point 32 (floating point 32) data. In FIG. 3A, the input data and the weight data are quantized into 8-bit binary integers, wherein both the input data and the weight data are 8-bit vectors with N dimensions (N is a positive integer). The input data and weight data can be expressed as _Xi (7:0) and W _i (7:0) respectively.

於第3B圖中，將該些N維度的各個8位元權重向量分開成MSB向量與LSB向量。8位元權重向量的MSB向量包括4位元W _i(7:4)，而LSB向量包括4位元W _i(3:0)。 In FIG. 3B, the N-dimensional 8-bit weight vectors are divided into MSB vectors and LSB vectors. The MSB vector of the 8-bit weight vector includes 4 bits W _i (7:4), and the LSB vector includes 4 bits W _i (3:0).

接著，將8位元權重向量的MSB向量與LSB向量的各位元以一元編碼(Unary coding)(亦即數值形式(value format))表示。例如，8位元權重向量的MSB向量的位元W _i=0(7)可以表示為8位元(複製8次)，8位元權重向量的MSB向量的位元W _i=0(6)可以表示為4位元(複製4次)，8位元權重向量的MSB向量的位元W _i=0(5)可以表示為2位元(複製2次)，8位元權重向量的MSB向量的位元W _i=0(4)可以表示為1位元(複製1次)，並且將備用位元(spare bit)(0)加入於8位元權重向量的MSB向量的位元W _i=0(4)之後。如此，可將8位元權重向量的4位元MSB向量編碼成16位元的一元編碼(Unary coding)格式。 Next, each bit of the MSB vector and the LSB vector of the 8-bit weight vector is represented by unary coding (ie, a value format). For example, the bit W _i=0 (7) of the MSB vector of the 8-bit weight vector can be expressed as 8 bits (replicated 8 times), and the bit W _i=0 (6) of the MSB vector of the 8-bit weight vector Can be expressed as 4 bits (copy 4 times), bit W _i=0 (5) of MSB vector of 8-bit weight vector can be expressed as 2 bits (copy 2 times), MSB vector of 8-bit weight vector The bit W _i=0 (4) of can be expressed as 1 bit (copy once), and the spare bit (spare bit) (0) is added to the bit W _i= of the MSB vector of the 8-bit weight vector After ₀ (4). In this way, the 4-bit MSB vector of the 8-bit weight vector can be encoded into a 16-bit unary coding (Unary coding) format.

同樣地，可將8位元權重向量的4位元LSB向量編碼成16位元的一元編碼(Unary coding)格式。Similarly, the 4-bit LSB vector of the 8-bit weight vector can be encoded into a 16-bit unary coding (Unary coding) format.

在本案一實施例中，透過上次的編碼方式，可以提高錯誤位元的容忍度。In the first embodiment of the present case, the error bit tolerance can be improved through the previous encoding method.

第4A圖顯示在本案一實施例中的8位元無符號數(unsigned integer)乘法運算，而第4B圖顯示在本案一實施例中的8位元帶符號數(signed integer)乘法運算。FIG. 4A shows an 8-bit unsigned integer multiplication operation in one embodiment of the present case, and FIG. 4B shows an 8-bit signed integer multiplication operation in an embodiment of the present case.

如第4A圖所示，於進行8位元無符號數乘法運算時，在第0周期時，將輸入資料的X _i(7)(輸入資料已編碼成一元編碼格式)乘上權重資料的MSB向量W _i(7:4)(權重資料的MSB向量已編碼成一元編碼格式)，以得到第一MSB部份乘積。相似地，將輸入資料的X _i(7)乘上權重資料的LSB向量W _i(3:0)(權重資料的LSB向量已編碼成一元編碼格式)，以得到第一LSB部份乘積。將第一MSB部份乘積位移4位元後相加至第一LSB部份乘積，以得到第一部份乘積。 As shown in Figure 4A, when multiplying 8-bit unsigned numbers, at cycle 0, multiply the input data _Xi (7) (the input data has been encoded into a unary encoding format) by the MSB of the weight data Vector W _i (7:4) (the MSB vector of the weight data has been encoded into a unary encoding format) to obtain the first MSB partial product. Similarly, X _i (7) of the input data is multiplied by the LSB vector W _i (3:0) of the weight data (the LSB vector of the weight data has been encoded into a unary encoding format) to obtain the first LSB partial product. The first MSB partial product is shifted by 4 bits and added to the first LSB partial product to obtain the first partial product.

在第1周期時，將輸入資料的X _i(6)乘上權重資料的MSB向量W _i(7:4)(權重資料的MSB向量已編碼成一元編碼格式)，以得到第二MSB部份乘積。相似地，將輸入資料的X _i(6)乘上權重資料的LSB向量W _i(3:0)(權重資料的LSB向量已編碼成一元編碼格式)，以得到第二LSB部份乘積。將第二MSB部份乘積位移4位元後相加至第二LSB部份乘積以得到第二部份乘積。此外，更將第一部份乘積位移1位元相加至第二部份乘積，以得到更新後第二部份乘積。其餘周期(第2周期至第7周期)的操作可依此類推，於此不重述。 In the first cycle, multiply Xi ₍ 6) of the input data by the MSB vector W _i (7:4) of the weight data (the MSB vector of the weight data has been encoded into a unary encoding format) to obtain the second MSB part product. Similarly, X _i (6) of the input data is multiplied by the LSB vector W _i (3:0) of the weight data (the LSB vector of the weight data has been encoded into a unary encoding format) to obtain the second LSB partial product. The second MSB partial product is shifted by 4 bits and added to the second LSB partial product to obtain the second partial product. In addition, the first part of the product is shifted by 1 bit and added to the second part of the product to obtain the updated second part of the product. The operations of the remaining cycles (the second cycle to the seventh cycle) can be deduced in a similar manner, which will not be repeated here.

亦即，透過8個周期可以完成8位元無符號數乘法運算。That is, the multiplication of 8-bit unsigned numbers can be completed in 8 cycles.

如第4B圖所示，於進行8位元帶符號數乘法運算時，在第0周期時，將輸入資料的X _i(7)乘上權重資料的MSB向量W _i(7)(權重資料的MSB向量已編碼成一元編碼格式)，以及將輸入資料的X _i(7)乘上權重資料的MSB向量W _i(6:4)(權重資料的MSB向量已編碼成一元編碼格式)並反相，兩者相加，以得到第一MSB部份乘積。將輸入資料的X _i(7)乘上權重資料的LSB向量W _i(3:0)(權重資料的LSB向量已編碼成一元編碼格式)並反相以得到第一LSB部份乘積。將第一MSB部份乘積位移4位元後相加至第一LSB部份乘積，以得到第一部份乘積。 As shown in FIG. 4B, when performing multiplication of 8- _{bit signed numbers, at cycle 0, multiply Xi (7) of input data by MSB vector W i} ₍ 7) of weight data (weight data of The MSB vector has been encoded into a unary encoding format), _{and the input data Xi (7) is multiplied by the MSB vector W i} ₍ 6:4) of the weight data (the MSB vector of the weight data has been encoded into a unary encoding format) and reversed , the two are added to obtain the first MSB partial product. Multiply _Xi (7) of the input data by the LSB vector W _i (3:0) of the weight data (the LSB vector of the weight data has been encoded into a unary encoding format) and invert to obtain the first LSB partial product. The first MSB partial product is shifted by 4 bits and added to the first LSB partial product to obtain the first partial product.

在第1周期時，將輸入資料的X _i(6)乘上權重資料的MSB向量W _i(7)(權重資料的MSB向量已編碼成一元編碼格式)後反相，以及，將輸入資料的X _i(6)乘上權重資料的MSB向量W _i(6:4)(權重資料的MSB向量已編碼成一元編碼格式)，兩者相加，以得到第二MSB部份乘積。相似地，將輸入資料的X _i(6)乘上權重資料的LSB向量W _i(3:0)(權重資料的LSB向量已編碼成一元編碼格式)，以得到第二LSB部份乘積。將第二MSB部份乘積位移4位元後相加至第二LSB部份乘積以得到第二部份乘積。此外，更將第一部份乘積位移1位元相加至第二部份乘積，以得到更新後第二部份乘積。其餘周期(第2周期至第7周期)的操作可依此類推，於此不重述。 In the first cycle, the input data Xi ₍ 6) is multiplied by the MSB vector W _i (7) of the weight data (the MSB vector of the weight data has been encoded into a unary encoding format) and inverted, and the input data X _i (6) is multiplied by the MSB vector W _i (6:4) of the weight data (the MSB vector of the weight data has been encoded into a unary encoding format), and the two are added to obtain the second MSB partial product. Similarly, X _i (6) of the input data is multiplied by the LSB vector W _i (3:0) of the weight data (the LSB vector of the weight data has been encoded into a unary encoding format) to obtain the second LSB partial product. The second MSB partial product is shifted by 4 bits and added to the second LSB partial product to obtain the second partial product. In addition, the first part of the product is shifted by 1 bit and added to the second part of the product to obtain the updated second part of the product. The operations of the remaining cycles (the second cycle to the seventh cycle) can be deduced in a similar manner, which will not be repeated here.

亦即，透過8個周期可以完成8位元帶符號數乘法運算。That is, the multiplication of 8-bit signed numbers can be completed in 8 cycles.

上述方式需要8個周期才可以完成8位元無符號數乘法運算與8位元帶符號數乘法運算。The above method needs 8 cycles to complete the 8-bit unsigned number multiplication operation and the 8-bit signed number multiplication operation.

第5A圖顯示根據本案一實施例之無符號數乘法運算之操作示意圖。第5B圖顯示根據本案一實施例之帶符號數乘法運算之操作示意圖。第5A圖與第5B圖以輸入資料與權重資料都是8位元為例做說明，但當知本案並不受限於此。FIG. 5A shows a schematic diagram of the operation of multiplication of unsigned numbers according to an embodiment of the present invention. FIG. 5B shows a schematic diagram of the operation of multiplication of signed numbers according to an embodiment of the present invention. FIG. 5A and FIG. 5B take an example in which both the input data and the weight data are 8-bit for illustration, but it should be understood that the present case is not limited thereto.

於第5A圖與第5B圖中，輸入資料亦被編碼，以及，權重資料的MSB向量與LSB向量已編碼成一元編碼格式。In Figures 5A and 5B, the input data is also encoded, and the MSB vector and LSB vector of the weight data are encoded into a unary encoding format.

於第5A圖與第5B圖中，將輸入資料輸入至頁緩衝器，而權重資料則是寫入至複數個記憶體晶胞內。In FIG. 5A and FIG. 5B, the input data is input into the page buffer, and the weight data is written into a plurality of memory cells.

在第5A圖中，輸入資料從頁緩衝器平行讀取出，以及，將權重資料從該些記憶體晶胞平行讀取出，進行平行乘法，以得到部份乘積。In FIG. 5A, the input data is read in parallel from the page buffers, and the weight data are read in parallel from the memory cells and multiplied in parallel to obtain partial products.

細言之，輸入資料的位元X _i(7)乘上權重資料的MSB向量W _i(7:4)，以得到第一MSB部份乘積。輸入資料的位元X _i(6)乘上權重資料的MSB向量W _i(7:4)，以得到第二MSB部份乘積。其餘可依此類推，直到輸入資料的位元X _i(0)乘上權重資料的MSB向量W _i(7:4)，以得到第八MSB部份乘積。例如，在第5A圖中，輸入資料的位元X _i(7)被複製15次，並加上備用位元，以成為16位元的乘數“0000000000000000”。此16位元的乘數“0000000000000000”乘上權重資料的MSB向量W _i(7:4) “1111111100001100”，以得到第一MSB部份乘積“0000000000000000”。其餘可依此類推。所有的MSB部份乘積可以合併成為輸入串流(input stream)M。 In detail, the bit X _i (7) of the input data is multiplied by the MSB vector W _i (7:4) of the weight data to obtain the first MSB partial product. The bit X _i (6) of the input data is multiplied by the MSB vector W _i (7:4) of the weight data to obtain a second MSB partial product. The rest can be deduced in the same way until the bit _Xi (0) of the input data is multiplied by the MSB vector W _i (7:4) of the weight data to obtain the eighth MSB partial product. For example, in Fig. 5A, the bit _Xi (7) of the input data is copied 15 times and a spare bit is added to form a 16-bit multiplier "0000000000000000". The 16-bit multiplier "00000000000000000" is multiplied by the MSB vector W _i (7:4) "1111111100001100" of the weight data to obtain the first MSB partial product "0000000000000000". The rest can be deduced by analogy. All MSB partial products can be combined into an input stream M.

相似地，將輸入資料的X _i(7)乘上權重資料的LSB向量W _i(3:0)，以得到第一LSB部份乘積。將輸入資料的X _i(6)乘上權重資料的LSB向量W _i(3:0)，以得到第二LSB部份乘積。其餘可依此類推，直到輸入資料的位元X _i(0)乘上權重資料的LSB向量W _i(3:0)，以得到第八LSB部份乘積。所有的LSB部份乘積可以合併成為輸入串流L。 Similarly, X _i (7) of the input data is multiplied by the LSB vector W _i (3:0) of the weight data to obtain the first LSB partial product. Multiply _Xi (6) of the input data by the LSB vector W _i (3:0) of the weight data to obtain a second LSB partial product. The rest can be deduced in the same way until the bit _Xi (0) of the input data is multiplied by the LSB vector W _i (3:0) of the weight data to obtain the eighth LSB partial product. All LSB partial products can be combined to form the input stream L.

之後，將該些第一至第八MSB部份乘積與該些第一至第八LSB部份乘積合併，並且計數合併值的位元1的數量，即可得到無符號數乘法運算之MAC運算結果。Afterwards, the first to eighth MSB partial products are combined with the first to eighth LSB partial products, and the number of bits 1 of the combined value is counted to obtain the MAC operation of the unsigned multiplication operation result.

在第5B圖中，輸入資料從頁緩衝器平行讀取出，以及，將權重資料從該些記憶體晶胞平行讀取出，進行平行乘法，以得到部份乘積。In FIG. 5B, the input data is read in parallel from the page buffers, and the weight data are read in parallel from the memory cells and multiplied in parallel to obtain partial products.

細言之，輸入資料的位元X _i(7)乘上權重資料的MSB向量W _i(7:4)，以得到第一MSB部份乘積。輸入資料的位元X _i(6)乘上權重資料的MSB向量W _i(7:4)，以得到第二MSB部份乘積。其餘可依此類推，直到輸入資料的位元X _i(0)乘上權重資料的MSB向量W _i(7:4)，以得到第八MSB部份乘積。 In detail, the bit X _i (7) of the input data is multiplied by the MSB vector W _i (7:4) of the weight data to obtain the first MSB partial product. The bit X _i (6) of the input data is multiplied by the MSB vector W _i (7:4) of the weight data to obtain a second MSB partial product. The rest can be deduced in the same way until the bit _Xi (0) of the input data is multiplied by the MSB vector W _i (7:4) of the weight data to obtain the eighth MSB partial product.

相似地，將輸入資料的X _i(7)乘上權重資料的LSB向量W _i(3:0)，以得到第一LSB部份乘積。將輸入資料的X _i(6)乘上權重資料的LSB向量W _i(3:0)，以得到第二LSB部份乘積。其餘可依此類推，直到輸入資料的位元X _i(0)乘上權重資料的LSB向量W _i(3:0)，以得到第八LSB部份乘積。 Similarly, X _i (7) of the input data is multiplied by the LSB vector W _i (3:0) of the weight data to obtain the first LSB partial product. Multiply _Xi (6) of the input data by the LSB vector W _i (3:0) of the weight data to obtain a second LSB partial product. The rest can be deduced in the same way until the bit _Xi (0) of the input data is multiplied by the LSB vector W _i (3:0) of the weight data to obtain the eighth LSB partial product.

之後，將該些第一至第八MSB部份乘積與該些第一至第八LSB部份乘積合併，並且計數合併值的位元1的數量，即可得到帶符號數乘法運算之MAC運算結果。Afterwards, combining the products of the first to eighth MSB parts with the products of the first to eighth LSB parts, and counting the number of bits 1 of the combined value, the MAC operation of multiplication with signed numbers can be obtained result.

第6圖顯示根據本案一實施例之記憶體裝置之功能方塊圖。記憶體裝置600包括複數個記憶體晶粒(die)615。在第6圖中以記憶體裝置600包括4個記憶體晶粒615為例做說明，但當知本案並不受限於此。FIG. 6 shows a functional block diagram of a memory device according to an embodiment of the present invention. The memory device 600 includes a plurality of memory dies (die) 615 . In FIG. 6, the memory device 600 includes four memory chips 615 as an example for illustration, but it should be understood that the present application is not limited thereto.

記憶體晶粒615包括複數個記憶體平面(memory plane, MP)620、複數個頁緩衝器625與累加電路630。在第6圖中以記憶體晶粒615包括4個記憶體平面620與4個頁緩衝器625為例做說明，但當知本案並不受限於此。記憶體平面620包括複數個記憶體晶胞(未示出)。權重資料存於該些記憶體晶胞內。The memory die 615 includes a plurality of memory planes (MP) 620 , a plurality of page buffers 625 and an accumulation circuit 630 . In FIG. 6, the memory die 615 includes four memory planes 620 and four page buffers 625 as an example for illustration, but it should be understood that the present invention is not limited thereto. The memory plane 620 includes a plurality of memory cells (not shown). Weight data is stored in the memory cells.

在各記憶體晶粒615內，累加電路630由該些記憶體平面620所共享，故而，累加電路630依序執行該些記憶體平面620的累加運算。此外，各記憶體晶粒615可以獨立執行本案實施例的上述數位MAC運算與數位漢明距離運算。In each memory die 615 , the accumulation circuit 630 is shared by the memory planes 620 , so the accumulation circuit 630 executes the accumulation operation of the memory planes 620 sequentially. In addition, each memory die 615 can independently execute the above-mentioned digital MAC operation and digital Hamming distance operation in this embodiment.

輸入資料可以透過複數條字元線而輸入至該些頁緩衝器625內。Input data can be input into the page buffers 625 through a plurality of word lines.

頁緩衝器625包括感應電路631、複數個閂鎖單元633-641、複數個邏輯閘643與645。The page buffer 625 includes a sensing circuit 631 , a plurality of latch units 633 - 641 , and a plurality of logic gates 643 and 645 .

感應電路631耦接至位元線BL，以感應位元線BL上的電流。The sensing circuit 631 is coupled to the bit line BL to sense the current on the bit line BL.

閂鎖單元633-641例如但不受限於，分別為資料閂鎖器(data latch，DL)633、閂鎖器(L1)635、閂鎖器(L2)637、閂鎖器(L3)639與共同資料閂鎖器(common data latch，CDL)641。閂鎖單元633-641例如但不受限於，為單位元閂鎖器。The latch units 633-641 are, for example but not limited to, data latches (data latch, DL) 633, latches (L1) 635, latches (L2) 637, latches (L3) 639 and a common data latch (CDL) 641 . The latch units 633-641 are, for example but not limited to, single cell latches.

資料閂鎖器633用以閂鎖權重資料，並將權重資料輸出至邏輯閘643與645。The data latch 633 is used to latch the weight data and output the weight data to logic gates 643 and 645 .

閂鎖器(L1)635與閂鎖器(L3)639乃是用於解碼。The latch (L1) 635 and the latch (L3) 639 are used for decoding.

閂鎖器(L2)637用以閂鎖輸入資料，並將輸入資料輸出至邏輯閘643與645。The latch ( L2 ) 637 is used to latch the input data and output the input data to logic gates 643 and 645 .

共同資料閂鎖器641用以閂鎖由邏輯閘643或645所傳來的資料。The common data latch 641 is used to latch the data transmitted from the logic gate 643 or 645 .

邏輯閘643與645例如但不受限於，分別為邏輯及閘，以及邏輯XOR閘。邏輯閘643對輸入資料與權重資料進行邏輯及運算，並將邏輯運算結果寫入至共同資料閂鎖器641。邏輯閘645對輸入資料與權重資料進行邏輯XOR運算，並將邏輯運算結果寫入至共同資料閂鎖器641。邏輯閘643與645分別受控於致能信號AND_EN與XOR_EN。例如，當進行數位MAC運算時，邏輯閘643被致能信號AND_EN致能；以及，當進行數位漢明距離運算時，邏輯閘645被致能信號XOR_EN致能。Logic gates 643 and 645 are, for example but not limited to, logic AND gates and logic XOR gates, respectively. The logic gate 643 performs a logic AND operation on the input data and the weight data, and writes the logic operation result into the common data latch 641 . The logic gate 645 performs a logic XOR operation on the input data and the weight data, and writes the logic operation result into the common data latch 641 . The logic gates 643 and 645 are respectively controlled by enable signals AND_EN and XOR_EN. For example, when the digital MAC operation is performed, the logic gate 643 is enabled by the enable signal AND_EN; and when the digital Hamming distance operation is performed, the logic gate 645 is enabled by the enable signal XOR_EN.

以第5A圖或第5B圖來做說明，輸入資料的位元X _i(7)的1個位元係輸入至閂鎖器(L2)637，而已編碼成一元編碼格式的權重資料的MSB向量W _i(7:4)的1個位元係輸入至資料閂鎖器633。閂鎖器(L2)637的輸入資料與資料閂鎖器633的權重資料則由邏輯閘643或645進行邏輯運算後，共同資料閂鎖器641用以閂鎖由邏輯閘643或645所傳來的資料。共同資料閂鎖器641亦可視為是該位元線的資料輸出路徑。 As illustrated in Figure 5A or Figure 5B, one bit of the bit _Xi (7) of the input data is input to the latch (L2) 637, and the MSB vector of the weight data that has been encoded into a unary encoding format One bit of _Wi (7:4) is input to the data latch 633 . The input data of the latch (L2) 637 and the weight data of the data latch 633 are logically operated by the logic gate 643 or 645, and the common data latch 641 is used to latch the data transmitted from the logic gate 643 or 645. data of. The common data latch 641 can also be regarded as the data output path of the bit line.

累加電路630包括：部份乘積累加單元651、單一維度乘積產生單元653、第一多維度累加單元655、第二多維度累加單元657與權重累加控制單元659。The accumulation circuit 630 includes: a partial product accumulation unit 651 , a single-dimensional product generation unit 653 , a first multi-dimensional accumulation unit 655 , a second multi-dimensional accumulation unit 657 and a weight accumulation control unit 659 .

部份乘積累加單元651耦接至頁緩衝器625，以接收由頁緩衝器625的複數個共同資料閂鎖器641所傳來的複數個邏輯運算結果，來產生複數個部份乘積。The partial product accumulation unit 651 is coupled to the page buffer 625 to receive a plurality of logical operation results transmitted from the plurality of common data latches 641 of the page buffer 625 to generate a plurality of partial products.

例如，以第5A圖或第5B圖來說，部份乘積累加單元651產生該些第一至第八MSB部份乘積以及該些第一至第八LSB部份乘積。For example, taking FIG. 5A or FIG. 5B, the partial product accumulation unit 651 generates the first to eighth MSB partial products and the first to eighth LSB partial products.

單一維度乘積產生單元653耦接至部份乘積累加單元651，將部份乘積累加單元651所產生的該些部份乘積累加，以產生單一維度乘積。The single-dimensional product generation unit 653 is coupled to the partial product accumulation unit 651 and accumulates the partial products generated by the partial product accumulation unit 651 to generate a single-dimensional product.

例如，以第5A圖或第5B圖來說，單一維度乘積產生單元653則將部份乘積累加單元651所產生的該些第一至第八MSB部份乘積以及該些第一至第八LSB部份乘積累加，以產生單一維度乘積。For example, taking Fig. 5A or Fig. 5B as an example, the single-dimensional product generating unit 653 generates the first to eighth MSB partial products and the first to eighth LSB partial products generated by the partial product accumulation unit 651 Partial multiplications are accumulated to produce single-dimensional products.

例如，於第0周期產生第＜0＞維度乘積後，可於第1周期產生第＜1＞維度乘積，其餘依此類推。For example, after the <0> dimension product is generated in the 0th cycle, the <1> dimension product can be generated in the 1st cycle, and so on.

第一多維度累加單元655耦接至單一維度乘積產生單元653，將單一維度乘積產生單元653所產生的複數個單一維度乘積累加，以得到多維度乘積累加結果。The first multi-dimensional accumulation unit 655 is coupled to the single-dimensional product generating unit 653, and accumulates the multiple single-dimensional products generated by the single-dimensional product generating unit 653 to obtain a multi-dimensional product accumulation result.

例如但不受限於，第一多維度累加單元655將單一維度乘積產生單元653所產生的第＜0＞至第＜7＞維度乘積累加，以得到8維度＜0:7＞乘積累加結果。接著，第一多維度累加單元655將單一維度乘積產生單元653所產生的第＜8＞至第＜15＞維度乘積累加，以得到另一個8維度＜8:15＞乘積累加結果。For example but not limited thereto, the first multi-dimensional accumulation unit 655 accumulates the multiplication and accumulation of the <0>th to <7th> dimensions generated by the single-dimensional product generation unit 653 to obtain an 8-dimensional <0:7> multiplication and accumulation result. Next, the first multi-dimensional accumulation unit 655 accumulates the multiplication and accumulation of the <8th> to <15th> dimensions generated by the single-dimensional product generation unit 653 to obtain another 8-dimensional <8:15> multiplication and accumulation result.

第二多維度累加單元657耦接至第一多維度累加單元655，將第一多維度累加單元655所產生的複數個多維度乘積累加結果進行累加，以得到輸出累加值。例如但不受限於，第二多維度累加單元657將第一多維度累加單元655所產生的64個8維度乘積累加結果進行累加，以得到512維度的輸出累加值。The second multi-dimensional accumulating unit 657 is coupled to the first multi-dimensional accumulating unit 655 and accumulates a plurality of multi-dimensional multi-dimensional accumulation results generated by the first multi-dimensional accumulating unit 655 to obtain an output accumulating value. For example but not limited thereto, the second multi-dimensional accumulation unit 657 accumulates 64 8-dimensional multiplication and accumulation results generated by the first multi-dimensional accumulation unit 655 to obtain a 512-dimensional output accumulation value.

權重累加控制單元659耦接至部份乘積累加單元651、單一維度乘積產生單元653、第一多維度累加單元655。根據進行數位MAC運算操作或數位漢明距離運算操作，權重累加控制單元659被致能或失能。例如但不受限於，當進行數位MAC運算操作時，權重累加控制單元659被致能；以及，當進行數位漢明距離運算操作時，權重累加控制單元659被失能。當權重累加控制單元659被致能時，權重累加控制單元659根據權重累加致能信號WACC_EN而輸出控制信號至部份乘積累加單元651、單一維度乘積產生單元653、第一多維度累加單元655。The weight accumulation control unit 659 is coupled to the partial product accumulation unit 651 , the single-dimensional product generation unit 653 , and the first multi-dimensional accumulation unit 655 . According to performing a digital MAC operation or a digital Hamming distance operation, the weight accumulation control unit 659 is enabled or disabled. For example, but not limited to, when performing digital MAC operation, the weight accumulation control unit 659 is enabled; and when performing digital Hamming distance operation, the weight accumulation control unit 659 is disabled. When the weight accumulation control unit 659 is enabled, the weight accumulation control unit 659 outputs control signals to the partial product accumulation unit 651 , the single-dimensional product generation unit 653 , and the first multi-dimensional accumulation unit 655 according to the weight accumulation enable signal WACC_EN.

第6圖中的單一個頁緩衝器620乃是耦接至複數條位元線BL。例如但不受限於，各頁緩衝器620耦接至131072條位元線BL，每一個周期內選擇128條位元線BL上的資料結果給累加電路630進行累加。如此的話，需要1024個周期把131072條位元線BL上的資料送完。A single page buffer 620 in FIG. 6 is coupled to a plurality of bit lines BL. For example but not limited thereto, each page buffer 620 is coupled to 131072 bit lines BL, and data results on 128 bit lines BL are selected in each cycle to be accumulated by the accumulation circuit 630 . In this case, it takes 1024 cycles to send the data on the 131072 bit lines BL.

此外於上述說明中，部份乘積累加單元651一次接收128位元，第一多維度累加單元655產生8維度乘積累加結果，而第二多維度累加單元657產生512維度的輸出累加值。但本案不受限於此。於另一可能實施例中，部份乘積累加單元651一次接收64位元(2位元為1組)，第一多維度累加單元655產生16維度乘積累加結果，而第二多維度累加單元657產生512維度的輸出累加值。In addition, in the above description, the partial multiply-accumulate unit 651 receives 128 bits at a time, the first multi-dimensional accumulate unit 655 produces an 8-dimensional multiply-accumulate result, and the second multi-dimensionally accumulate unit 657 produces a 512-dimensional output accumulated value. But this case is not limited to this. In another possible embodiment, the partial multiply-accumulate unit 651 receives 64 bits at a time (2 bits are 1 group), the first multi-dimensional accumulate unit 655 generates a 16-dimensional multiply-accumulate result, and the second multi-dimensionally accumulate unit 657 Produces an output accumulation of 512 dimensions.

第7圖顯示比較本案一實施例與現有技術之MAC運算流程時序圖。以第7圖來看，於輸入廣播(input broadcasting)時間內，接收輸入資料。之後，對於該輸入資料與該權重資料進行如上述方式的位元乘法與位元累加，以產生MAC運算操作結果。FIG. 7 shows a sequence diagram of the MAC operation flow comparing an embodiment of the present case with the prior art. As shown in Fig. 7, input data is received during input broadcasting time. Afterwards, bit multiplication and bit accumulation are performed on the input data and the weight data to generate a MAC operation result.

於習知技術中，需要較長的操作時間。相反的，在本案實施例中，藉由平行乘法來產生(1)輸入向量與權重資料的MSB向量的部份乘積，以及(2)輸入向量與權重資料的LSB向量的部份乘積。如此可以在1個周期內完成無符號數乘法運算及/或帶符號數乘法運算。所以，本案實施例的操作速度快於習知技術的操作速度。In the prior art, a long operation time is required. On the contrary, in the present embodiment, (1) the partial product of the input vector and the MSB vector of the weight data, and (2) the partial product of the input vector and the LSB vector of the weight data are generated by parallel multiplication. In this way, the unsigned number multiplication operation and/or the signed number multiplication operation can be completed within one cycle. Therefore, the operation speed of the embodiment of the present case is faster than that of the conventional technology.

第8圖顯示根據本案一實施例之記憶體裝置之操作方法，包括：對一輸入資料進行編碼，將一編碼後輸入資料傳送至至少一頁緩衝器內，以及，從該至少一頁緩衝器平行讀出該編碼後輸入資料(810)；對一權重資料的一第一部份與一第二部份分別編碼為該權重資料的一編碼後第一部份與該權重資料的一編碼後第二部份，並寫入至該記憶體裝置之複數個記憶體晶胞內，以及，平行讀出該權重資料的該編碼後第一部份與該權重資料的該編碼後第二部份(820)；將該編碼後輸入資料分別乘上該權重資料的該編碼後第一部份與該權重資料的該編碼後第二部份，以平行產生複數個部份乘積(830)；以及將該些部份乘積累加，以產生一運算結果(840)。Fig. 8 shows an operation method of a memory device according to an embodiment of the present invention, including: encoding an input data, transmitting an encoded input data to at least one page buffer, and, from the at least one page buffer Read out the coded input data in parallel (810); code a first part and a second part of a weight data into a coded first part of the weight data and a coded code of the weight data respectively The second part is written into a plurality of memory cells of the memory device, and the encoded first part of the weight data and the encoded second part of the weight data are read in parallel (820); respectively multiplying the encoded input data by the encoded first part of the weight data and the encoded second part of the weight data to generate a plurality of partial products in parallel (830); and The partial products are accumulated and accumulated to generate an operation result (840).

如上述般，於本案實施例中，透過位元錯誤容忍編碼方式可以減少錯誤位元，提高準確度並減少對記憶體容量的需求。As mentioned above, in the embodiment of the present case, error bits can be reduced through the bit error tolerant coding method, the accuracy can be improved and the demand for memory capacity can be reduced.

此外，本案一實施例的數位MAC操作運算乃是使用高頻寬權重累加器以產生輸出結果，該高頻寬權重累加器可藉由重複使用故障位元計數電路來實施權重化累加，故而可以改善累加速度。In addition, the digital MAC operation in an embodiment of the present case uses a high-bandwidth weight accumulator to generate output results. The high-bandwidth weight accumulator can implement weighted accumulation by repeatedly using the faulty bit counting circuit, so the accumulation speed can be improved.

本案一實施例的數位漢明距離操作運算乃是使用高頻寬未權重累加器以產生輸出結果，該高頻寬未權重累加器可藉由重複使用故障位元計數電路來實施未權重累加，故而可以改善累加速度。The digital Hamming distance operation operation of an embodiment of the present case uses a high-bandwidth unweighted accumulator to generate output results. The high-bandwidth unweighted accumulator can implement unweighted accumulation by reusing the faulty bit counting circuit, so it can improve accumulation. acceleration.

本案上述實施例可應用於NAND型快閃記憶體，或者敏感於錯誤位元的記憶體裝置，例如但不受限於，NOR型快閃記憶體，相變(PCM)型快閃記憶體，磁式隨機存取記憶體(magnetic RAM)或電阻式RAM。The above embodiments of the present case can be applied to NAND flash memory, or memory devices sensitive to error bits, such as but not limited to, NOR flash memory, phase change (PCM) flash memory, Magnetic RAM or resistive RAM.

於上述實施例中，累加電路630可以接收由頁緩衝器625所傳來的128個部份乘積，但於本案其他實施例中，累加電路630可以接收由頁緩衝器625所傳來的2、4、8、16…512個部份乘積(為2的冪次方)，此亦在本案精神範圍內。In the above embodiment, the accumulation circuit 630 can receive 128 partial products transmitted from the page buffer 625, but in other embodiments of this case, the accumulation circuit 630 can receive 2, 4, 8, 16...512 partial products (which are powers of 2), which are also within the spirit of this case.

於上述實施例中，累加電路630可以支援加法功能，但於本案其他實施例中，累加電路630可以支援減法功能，此亦在本案精神範圍內。In the above embodiments, the accumulation circuit 630 can support the addition function, but in other embodiments of the present application, the accumulation circuit 630 can support the subtraction function, which is also within the spirit of the present application.

於上述實施例中，雖以INT8或UNIT8的MAC運算為例做說明，但於本案其他實施例中，也可支援INT2、UNIT2、INT4、UNIT4的MAC運算操作，此亦在本案精神範圍內。In the above embodiment, although the MAC operation of INT8 or UNIT8 is used as an example for illustration, in other embodiments of this case, the MAC operation of INT2, UNIT2, INT4, and UNIT4 can also be supported, which is also within the spirit of this case.

雖然上述實施例中，將權重資料分為MSB向量與LSB向量(2個向量)，但本案並不受限於此。於本案其他可能實施例中，權重資料亦可分為更多個向量，此亦在本案精神範圍內。Although in the above embodiments, the weight data is divided into MSB vectors and LSB vectors (two vectors), the present application is not limited thereto. In other possible embodiments of this case, the weight data can also be divided into more vectors, which is also within the spirit of this case.

本案上述實施例可應用於需要MAC運算操作的AI模型設計中，例如但不受限於，完全連接層(fully-connection layer)、卷積層(convolution layer)、多層感知器(multiple layer Perceptron)、支援向量機器(support vector machine)等AI技術之中。The above-mentioned embodiments of this case can be applied to the design of AI models that require MAC operations, such as but not limited to, fully-connection layer, convolution layer, multi-layer perceptron (multiple layer Perceptron), Among AI technologies such as support vector machine.

本案上述不只可應用於計算用途(computing usage)，也可應用於相似性搜尋(similarity search)、分析用途(analysis usage)、聚類分析(clustering analysis)等。The above in this case can be applied not only to computing usage, but also to similarity search, analysis usage, clustering analysis, etc.

綜上所述，雖然本發明已以實施例揭露如上，然其並非用以限定本發明。本發明所屬技術領域中具有通常知識者，在不脫離本發明之精神和範圍內，當可作各種之更動與潤飾。因此，本發明之保護範圍當視後附之申請專利範圍所界定者為準。To sum up, although the present invention has been disclosed by the above embodiments, it is not intended to limit the present invention. Those skilled in the art of the present invention can make various changes and modifications without departing from the spirit and scope of the present invention. Therefore, the scope of protection of the present invention should be defined by the scope of the appended patent application.

210-240:步驟 600:記憶體裝置 615:記憶體晶粒 620:記憶體平面 625:頁緩衝器 630:累加電路 631:感應電路 633-641:閂鎖單元 643、645:邏輯閘 651:部份乘積累加單元 653:單一維度乘積產生單元 655:第一多維度累加單元 657:第二多維度累加單元 659:權重累加控制單元 810-840:步驟210-240: Steps 600: memory device 615: memory die 620: memory plane 625: page buffer 630: accumulation circuit 631: induction circuit 633-641: Latch unit 643, 645: logic gate 651: Partial multiply-accumulate unit 653:Single-dimension product generation unit 655: The first multi-dimensional accumulation unit 657: The second multi-dimensional accumulation unit 659: Weight accumulation control unit 810-840: Steps

第1A圖顯示兩個無符號數之相乘。第1B圖顯示兩個符號數之相乘。第2圖顯示根據本案一實施例之記憶體裝置之操作方法流程圖。第3A圖與第3B圖顯示本案實施例中之錯誤位元容忍資料編碼。第4A圖顯示在本案一實施例中的8位元無符號數乘法運算，而第4B圖顯示在本案一實施例中的8位元帶符號數乘法運算。第5A圖顯示根據本案一實施例之無符號數乘法運算之操作示意圖。第5B圖顯示根據本案一實施例之帶符號數乘法運算之操作示意圖。第6圖顯示根據本案一實施例之記憶體裝置之功能方塊圖。第7圖顯示比較本案一實施例與現有技術之MAC運算流程時序圖。第8圖顯示根據本案一實施例之記憶體裝置之操作方法。 Figure 1A shows the multiplication of two unsigned numbers. Figure 1B shows the multiplication of two signed numbers. FIG. 2 shows a flow chart of the operation method of the memory device according to an embodiment of the present invention. Figures 3A and 3B show error-tolerant data encoding in this embodiment. FIG. 4A shows the multiplication operation of 8-bit unsigned numbers in one embodiment of the present invention, and FIG. 4B shows the multiplication operation of signed 8-bit numbers in one embodiment of the present case. FIG. 5A shows a schematic diagram of the operation of multiplication of unsigned numbers according to an embodiment of the present invention. FIG. 5B shows a schematic diagram of the operation of multiplication of signed numbers according to an embodiment of the present invention. FIG. 6 shows a functional block diagram of a memory device according to an embodiment of the present invention. FIG. 7 shows a sequence diagram of the MAC operation flow comparing an embodiment of the present case with the prior art. FIG. 8 shows a method of operating a memory device according to an embodiment of the present invention.

810-840:步驟 810-840: Steps

Claims

A memory device comprising: a plurality of memory dies, each of the memory dies includes a plurality of memory planes, a plurality of page buffers and an accumulation circuit, each of the memory planes includes a plurality of memory cells, Wherein, an input data is encoded, an encoded input data is transmitted to at least one page buffer, and the encoded input data is read out in parallel from the at least one page buffer; Encoding a first part and a second part of a weight data into a coded first part of the weight data and a coded second part of the weight data respectively, and writing them into the memory In the memory cells of the device, and read the encoded first part of the weight data and the encoded second part of the weight data in parallel; multiplying the encoded input data by the encoded first part of the weighting data and the encoded second part of the weighting data, respectively, to generate a plurality of partial products in parallel; and The partial products are accumulated and accumulated to generate an operation result.

The memory device as claimed in item 1, wherein, The first part of the weight data is a most significant bit (MSB), and the second part of the weight data is a least significant bit (LSB).

The memory device as claimed in item 1, wherein, During encoding, the input data and the weight data are respectively quantized into binary integer vectors; copy each bit of the input data a plurality of times and add a spare bit; separate the weighting data into the first part and the second part; and Each bit of the first part and the second part of the weight data is represented by a unitary code, so as to obtain the coded first part of the weight data and the coded second part of the weight data.

The memory device as claimed in item 1, wherein, The operation result includes a multiply-and-accumulate (MAC) operation result or a Hamming distance (Hamming distance) operation result; and, accumulating the partial products belonging to the same dimension to obtain a single-dimensional product; Multiplying and accumulating multiple single-dimensions to obtain a multi-dimensional multiplication-accumulation result; A plurality of multi-dimensional multiply-accumulate results are accumulated to generate the operation result.

The memory device as described in claim 4, wherein, When performing a multiply-accumulate operation, perform a logical AND operation on each bit of the encoded input data and each bit of the encoded first part of the weight data; and When the Hamming distance calculation is performed, a logical exclusive OR operation is performed on each bit of the encoded input data and each bit of the encoded first part of the weight data.

A method of operating a memory device, comprising: Encoding an input data, transferring an encoded input data to at least one page buffer, and reading the encoded input data in parallel from the at least one page buffer; A first part and a second part of a weight data are respectively coded as a coded first part of the weight data and a coded second part of the weight data, and the weight data of the writing the encoded first part and the encoded second part of the weight data into a plurality of memory cells of the memory device, and reading out the encoded first part of the weight data in parallel with the encoded second part of the weight data; multiplying the encoded input data by the encoded first part of the weighting data and the encoded second part of the weighting data, respectively, to generate a plurality of partial products in parallel; and The partial products are accumulated and accumulated to generate an operation result.

The method of operating a memory device according to claim 6, wherein, The first part of the weight data is a most significant bit (MSB), and the second part of the weight data is a least significant bit (LSB).

The method of operating a memory device according to claim 6, wherein, During encoding, the input data and the weight data are respectively quantized into binary integer vectors; copy each bit of the input data a plurality of times and add a spare bit; separate the weighting data into the first part and the second part; and Each bit of the first part and the second part of the weight data is represented by a unitary code, so as to obtain the coded first part of the weight data and the coded second part of the weight data.

The method of operating a memory device according to claim 6, wherein, The operation result includes a multiply-and-accumulate (MAC) operation result or a Hamming distance (Hamming distance) operation result; and accumulating the partial products belonging to the same dimension to obtain a single-dimensional product; Multiplying and accumulating multiple single-dimensions to obtain a multi-dimensional multiplication-accumulation result; A plurality of multi-dimensional multiply-accumulate results are accumulated to generate the operation result.

The method of operating a memory device according to claim 9, wherein, When performing a multiply-accumulate operation, perform a logical AND operation on each bit of the encoded input data and each bit of the encoded first part of the weight data; and When the Hamming distance calculation is performed, a logical exclusive OR operation is performed on each bit of the encoded input data and each bit of the encoded first part of the weight data.