CN115879530A

CN115879530A - Method for optimizing array structure of RRAM (resistive random access memory) memory computing system

Info

Publication number: CN115879530A
Application number: CN202310186971.9A
Authority: CN
Inventors: 王浩; 郑精; 吕琳; 汪汉斌; 万厚钊; 马国坤; 袁晓旭; 高浩浩
Original assignee: Hubei University; Hubei Jiangcheng Laboratory
Current assignee: Hubei University; Hubei Jiangcheng Laboratory
Priority date: 2023-03-02
Filing date: 2023-03-02
Publication date: 2023-03-31
Anticipated expiration: 2043-03-02
Also published as: CN115879530B

Abstract

The invention discloses a RRAM (resistive random access memory) memory computing system array structure optimization-oriented method, which is mainly used for optimizing an RRAM-based memory computing system array structure by using a corresponding formula in a post-training quantization algorithm, so that the array area is reduced and the system power consumption is reduced under the condition of ensuring the computing accuracy and precision. The invention has the beneficial effects that: the method is suitable for multiple neural networks such as a multilayer perceptron and a convolutional neural network, the system area is effectively reduced, the system energy consumption is reduced, the system calculation efficiency is improved by reducing the 1T1R array scale by half under the same calculation condition, and the method is more suitable for commercial landing by combining the current situation that the RRAM device preparation process is not mature enough; in the invention, under the condition that the number of convolution kernels of the CNN convolution layer is increased, the array scale is half of that of the conventional technology, but extra calculation X is added _Z W _Q And X _Q W _Z The number of the multipliers is kept unchanged, and the overall performance advantage of the system is very remarkable.

Description

Method for optimizing array structure of RRAM (resistive random access memory) memory computing system

Technical Field

The invention relates to the technical field of memory computing, in particular to a method for optimizing an array structure of a RRAM memory computing system.

Background

With the rapid development of science and technology, the era of 'big data' has been developed, various information data are continuously and explosively increased, and higher requirements are put forward on storage and computing technology. Traditional computers employ a von Neumann architecture, with memory and processor being independent of each other. Data exchange between the two is frequent, but a data exchange link is narrow, power consumption is high, a storage wall is formed between calculation and storage, and high performance of the advanced processor is greatly influenced. Therefore, it is important to develop a new storage computing system, and a memory computing architecture proposed in recent years combines storage and operation into one, so that the load of data transmission can be effectively reduced, the energy consumption of data computing can be reduced, and the efficiency of information processing can be improved.

In-memory computing techniques generally utilize the physical electrical characteristics of non-volatile memory to perform computations directly in memory while ensuring non-volatile storage. The method effectively avoids the interaction of high-frequency data of storage and calculation, thereby breaking through the limitation of a storage wall and greatly improving the data processing capability. Among these nonvolatile memories, a Resistive Random Access Memory (RRAM) has attracted attention due to its characteristics of simple structure, fast read/write, low power consumption, and good CMOS process compatibility. The RRAM can generate resistance state transition under the action of a specific voltage excitation signal, by means of the electrical characteristic, the voltage high and low levels respectively represent digital 1 and 0, the device high and low resistance states respectively represent 0 and 1, and the current flowing through the device is acquired and quantized by combining ohm's law to obtain a digital multiplication calculation result. When the RRAM device is expanded to a cross array structure, the product accumulation operation (MAC) can be completed, and then matrix operation is realized. The matrix storage computing capability of the RRAM array is very suitable for the intensive computing application requirement of the neural network, so that the RRAM array has wide application prospect in the field of neural network accelerators.

However, in the current RRAM-based in-memory computing neural network accelerator, since the 1T1R array can only store unsigned numbers when storing the weight matrix, two device structures are usually adopted to commonly represent signed numbers, i.e., positive and negative weights in the neural network. Among them, the following three types of common treatment methods are available:

1, as a 2T2R structure array shown in FIG. 1, two 1T1R units form a pair of positive and negative values with signed numbers stored, so as to represent positive and negative weight values; applying equivalent positive and negative voltage pulse signals to the positive and negative voltage pulse signals, and quantizing the accumulated current collected at the end of the row to obtain a final multiplication accumulation calculation result;

2, as shown in the 1T1R positive and negative row structure array shown in fig. 2, all weights of a weight matrix are mapped to two 1T1R conductive rows, one row is a positive weight and has a positive pulse input, and the other row is a negative weight and has an equivalent negative pulse input; after the coded pulse is input into a bit line, collecting two rows of accumulated output currents, and subtracting to obtain a calculation result;

and 3, constructing two 1T1R arrays according to the 1T1R positive and negative double array structure shown in the figure 3, respectively storing positive and negative weights, inputting equivalent voltage signals, and finally subtracting the calculation results to obtain the final calculation result.

Because the technology of manufacturing RRAM devices is not mature enough, the manufacturing of large-scale arrays still faces many challenges, and meanwhile, in order to suppress the crosstalk problem, the three array structures are adopted for memory calculation, but when a signed number is represented, double 1T1R resources are needed, and the area and the energy consumption of a memory calculation system are greatly increased.

Disclosure of Invention

The present invention is directed to a method for optimizing an array structure of a RRAM-oriented memory computing system, so as to solve the problems in the background art.

In order to achieve the above object, the present invention provides the following technical solutions, a method for optimizing an array structure of a RRAM memory computing system, comprising the following steps:

step one, performing post-training quantification on image data and neural network weight data, and performing quantification through a quantification formula

Obtaining image quantization data X by calculation _Q Zero point shift amount X of image data _Z Weighted quantized data W _Q And the weight zero point offset amount W _Z ，X _Z And W _Z After quantization, the X is fixed and unchanged, and the software calculates the X _Z W _Z A value;

step two, the signed number calculation formula of the neural network forward propagation is

Y=X*W=（X _Q -X _Z ）*(W _Q -W _Z )

It is developed to obtain:Y=X _Q W _Q -X _Q W _Z -X _Z W _Q +X _Z W _Z respectively calculating positive integer term X by using adder, multiplier and RRAM array circuit _Q W _Z 、X _Z W _Q And X _Q W _Q Finally combining X calculated in the step one _Z W _Z Substituting the expansion formula to calculate a Y value; by the aid of the split calculation method, the RRAM array stores and calculates positive integers, and the problem of excessive resource consumption of RRAM devices caused by direct calculation of signed numbers is solved;

and step three, storing the Y value calculation result into a buffer, performing other operations such as subsequent activation function and quantization, and obtaining complete characteristic diagram data after the processing is finished and using the complete characteristic diagram data as the input of the next layer of network.

As a preferred technical solution of the present invention, in the first step, a quantization formula is:

，

wherein R is an original data value, Q is a quantized data value, S is a scale factor scale representing the proportional relation between the original data and the quantized data, and Z is zero offset representing an integer corresponding to 0 in the original data after quantization;

setting the quantization precision as n bits, randomly extracting part of test set data, calculating a quantization scale factor S and a zero offset Z in each layer of network according to the following formula, and substituting the quantization scale factor S and the zero offset Z into the formula to obtain a quantized data value;

the formula for calculating the scaling factor S is as follows:

，

the zero offset Z is calculated as:

；

wherein R is _max And R _min 、Q _max And Q _min The maximum and minimum values of the original data value and the quantized data value, respectively, and Q _max And Q _min Determined by the quantization precision n, R _max And R _min Determined by the randomly drawn part of the sample data.

As a preferred technical scheme of the invention, in the second step, the RRAM adopts a device with only two stable states of high and low resistance, and forms a 1T1R structure with the NMOS tube, so as to build a 1T1R criss-cross array; and the n 1T1R structures in a row form an nbits weight value, and corresponding shift weighting is carried out when the output current is collected to obtain a calculation result.

As a preferred embodiment of the present invention, the weight data W _Q The memory is written into the 1T1R array row by row under the control of the peripheral digital logic circuitThen quantize the parameter X _Z Converting the voltage into a read voltage signal, inputting the read voltage signal into a 1T1R array, collecting output current, quantizing, and weighting by displacement to obtain a calculation result X _Z W _Q。

As a preferred technical scheme of the invention, the digital control circuit is used for scheduling, and the corresponding picture data are sequentially taken from the buffer to be added and then multiplied by the corresponding W _Z To obtain X _Q W _Z (ii) a Meanwhile, converting the picture data into voltage signals, inputting the voltage signals from the head of a row of the 1T1R array, reading the output current at the tail end of a column, and obtaining X after quantization and shift weighting _Q W _Q 。

Compared with the prior art, the invention has the beneficial effects that: under the same calculation condition, the invention greatly reduces the number of RRAM devices, and is more suitable for commercial landing by combining the current situation that the preparation process of the RRAM devices is not mature enough; by the aid of the split calculation method, the RRAM array stores and calculates positive integers, and the problem of excessive resource consumption of RRAM devices caused by direct calculation of signed numbers is solved; by reducing the size of the 1T1R array by half, the area of the system is effectively reduced, the energy consumption of the system is reduced, and the calculation efficiency of the system is improved; the invention is suitable for multiple neural networks such as multilayer perceptron and convolutional neural network, and the local receptive field of CNN convolutional layer is small, so that X is calculated _Q W _Z The number of adders required by the items is less, so that the technology has more obvious advantages in a CNN network; in the invention, under the condition that the number of convolution kernels of the CNN convolution layer is increased, the array scale is half of that of the conventional technology, but extra calculation X is added _Z W _Q And X _Q W _Z The number of the multipliers is kept unchanged, the influence of the introduced computing unit is limited when the size of the CNN network is large, and the overall performance advantage of the system is very remarkable.

Drawings

FIG. 1 is a diagram of a 2T2R structure representing positive and negative weights in the background art;

FIG. 2 is a diagram showing the structure of positive and negative rows of 1T1R in the background art to represent positive and negative weights;

FIG. 3 is a diagram showing the positive and negative weights of a 1T1R positive and negative double array structure in the background art;

FIG. 4 shows the optimized 1T1R single array representation quantization weights of the present invention;

FIG. 5 is a diagram of a computational architecture optimized using PTQ equations in accordance with the present invention;

FIG. 6 is a diagram of a single-layer network accelerator architecture based on a 1T1R array according to the present invention;

detailed description of the preferred embodiments

Example 1

As shown in fig. 4 to 6, the invention discloses a method for optimizing an array structure of an RRAM-based in-memory computing system, which mainly optimizes the array structure of the RRAM-based in-memory computing system by using a corresponding formula in a post-training quantization algorithm, reduces the array area and the system power consumption under the condition of ensuring the computing accuracy and precision, and provides a reliable solution for accelerating the computation of a large-scale neural network.

The quantization operation is to quantize the 32-bit floating point number in the neural network to an 8-bit or other low-bit fixed point number, so that the calculation cost can be greatly reduced, and the neural network is favorably integrated into edge intelligent equipment with strict requirements on power consumption and delay. The post-training quantization algorithm (PTQ) utilized by the technology of the invention obtains the quantization parameters of the network on the premise of not retraining the network (namely not updating the network weight). Taking a Convolutional Neural Network (CNN) as an example, after the neural network is trained normally, data after each convolution, pooling and full connection layer calculation is quantized in an inference stage, so that the storage calculation cost is effectively saved.

The quantization formula is:

，

wherein R is an original data value, Q is a quantized data value, S is scale, which represents the proportional relation between the original data and the quantized data, and Z is zero offset, which represents an integer corresponding to 0 in the original data after quantization.

The formula for calculating the scaling factor S is as follows:

，

the zero offset Z is calculated as:

，

wherein Q _max And Q _min Determined by the precision of the number of bits to be quantized, R _max And R _min It is determined by the randomly drawn part of the sample data.

Therefore, various quantization parameters required by the post-training quantization process can be determined before reasoning after the neural network training, and finally, the data can be quantized only by substituting the corresponding parameters into the quantization formula in the reasoning process.

The technology of the invention mainly utilizes a quantitative calculation formula to realize the optimization of the array structure. Taking convolution calculation in the CNN first layer convolution network as an example, the actual calculation formula after quantizing the original picture data and the convolution kernel is:

Y=X*W=（X _Q -X _Z ）*(W _Q -W _Z )，

unfolding it can result in:Y=X _Q W _Q -X _Q W _Z -X _Z W _Q +X _Z W _Z ，

wherein X _Z And W _Z Is a fixed value.

So during the entire convolution calculation, X _Z W _Z For a known determined value, X _Z W _Q The calculation is only needed once after the initialization of the 1T1R array weight and before the convolution sliding operation, and actually only X which needs to be calculated really is needed to be calculated every time the sliding operation is carried out _Q W _Z And X _Q W _Q Two items. X _Q W _Z The term is the sum of the picture input data at each sliding and multiplied by W _Z Since the convolution kernel is typically 3 × 3 or 5 × 5, only a few adders and multipliers are required to complete the computation; x _Q W _Q The term is the key point of calculation, and the quantized positive value is convolvedKernel weight matrix W _Q Storing in 1T1R array, inputting corresponding characteristic diagram data X in each convolution sliding operation _Q The in-memory calculation of matrix multiplication can be realized. The structure of the above calculation method is shown in fig. 5. In summary, the present invention can reduce the 1T1R array resource in the current common technology by adding a few additional multipliers and adders. In particular, the number of additional adders and multipliers is dependent only on the size of the convolution kernel and not on the number of convolution kernels. When the number of convolution kernels in the CNN is very large, the required array scale is greatly increased, the number of additional adders and multipliers is unchanged, and the advantage of the technology of the invention in array scale optimization is more remarkable, so that the technology is very suitable for large-scale neural network acceleration.

Taking the first layer of convolutional network as an example, the convolutional kernel is assumed to be 200 3 × 3 convolutional kernels, and the specific implementation steps of the invention are as follows:

1, designing and training a neural network system. Setting quantization bit number (such as 8 bits), randomly extracting partial test set data, and calculating quantization scale factor S and zero point offset Z of each layer of network

（

）；

2 according to the formula

Carrying out quantitative processing on the original picture data, the weights of all network layers and the intermediate calculation results of all layers to obtain X of all layers of networks _Q 、W _Q 、X _Z And W _Z Equating the data and parameters and calculating X for each layer _Z W _Z ；

3, the RRAM adopts a device with two stable states of high and low resistance states, and forms a 1T1R structure with the NMOS tube. Because the RRAM here can only represent binary logics of "0" and "1", a line of 8 1T1R structures is required to form an 8-bits weight value, and a corresponding shift weighting is also required when an output current is collected to obtain an actual calculation result. The convolution kernel is processed by adopting the conventional technology, the convolution kernel is expanded into column vectors line by line and is stored in a column of 1T1R structures, the column end output current is collected, and the single convolution operation result is obtained after quantization. Corresponding 1T1R arrays are built according to the logic, and the size of 200 arrays of 3 × 3 convolution kernels is 9 × 1600 (in the technology, only 14400 1T1R structures are needed, and 28800 1T1R structures are needed due to the existence of a negative weight problem in the conventional technology);

4, writing the quantized convolution kernel weight data in the step 2 into an array for storage line by line through the control of a peripheral digital logic circuit, and then writing a quantization parameter X into the array for storage _Z Converting the voltage into a read voltage signal, inputting the read voltage signal into a 1T1R array, collecting output current, quantizing, and weighting by displacement to obtain a calculation result X _Z W _Q ；

And 5, storing the original picture data into a buffer, and controlling the sliding operation through a digital control circuit. According to the convolution sliding setting, sequentially taking 9 picture data in the corresponding sliding window from the buffer, adding the picture data and multiplying the added picture data by the corresponding W _Z To obtain X _Q W _Z . Meanwhile, 9 picture data are converted into reading voltages which are input from the head of the array row, output currents are read, and X is obtained after quantization and shifting weighting _Q W _Q ；

And 6, connecting a digital logic circuit at the output end of the array, and calculating a final convolution calculation result:

Y=X _Q W _Q -X _Q W _Z -X _Z W _Q +X _Z W _Z

and storing the single convolution calculation result into a register, performing subsequent activation function and pooling operation, and obtaining complete characteristic diagram data as the input of the next layer of network after the convolution sliding calculation is finished.

The steps describe the calculation operation steps of the CNN first layer convolution layer, the operation of other network layers is realized similarly, and the calculation result output of each layer is quantized to be the data input of the next layer. As can be seen from fig. 6, the 1T1R array according to the present invention only needs to store the quantized positive weight values, and only needs half of the number of 1T1R arrays according to the conventional technology under the same weight. Although additional multipliers and adders are introduced in the calculation, the required number is small. In the above example, only 1 multiplier and 3 + 200+9 adders are added, but 14400 1T1R units are reduced. Considering the current situation that the traditional CMOS process technology is far matured over the RRAM process, the technical advantages of the invention are very obvious.

Although the present invention has been described in detail with reference to the specific embodiments thereof, the present invention is not limited to the above embodiments, and various changes can be made without departing from the gist of the present invention within the knowledge of those skilled in the art without departing from the scope of the present invention.

Claims

1. A method for optimizing an array structure of a RRAM (resistive random access memory) oriented computing system is characterized by comprising the following steps:

The image quantization data X is obtained by calculation _Q Zero point offset amount X of image data _Z Weighted quantized data W _Q And the weight zero point offset amount W _Z ，X _Z And W _Z After quantization, the X is fixed and unchanged, and the software calculates the X _Z W _Z A value;

step two, the calculation formula of the signed number propagated forward by the neural network is as follows:

Y=X*W=（X _Q -X _Z ）*(W _Q -W _Z )

and step three, storing the calculation result of the Y value into a buffer, performing other operations such as subsequent activation function and quantization, and obtaining complete characteristic diagram data after the processing is finished to be used as the input of the next layer of network.

2. The method for optimizing an array structure of a RRAM-oriented memory computing system of claim 1, wherein: the quantization formula in the first step is as follows:

，

the formula for calculating the scaling factor S is as follows:

，

the zero offset Z is calculated as:

；

3. The method for optimizing an array structure of a RRAM-oriented memory computing system of claim 1, wherein: in the second step, the RRAM adopts a device with two stable states of only high and low resistance states to form a 1T1R structure with the NMOS tube, and then a 1T1R crisscross array is built; and the n 1T1R structures in a row form an nbits weight value, and corresponding shift weighting is carried out when the output current is collected to obtain a calculation result.

4. The method of claim 3, wherein the method comprises the following steps: weight data W _Q Writing the data into the 1T1R array memory line by line under the control of a peripheral digital logic circuit, and then quantizing the parameter X _Z Converting the voltage into a read voltage signal, inputting the read voltage signal to a 1T1R array, collecting output current, quantizing, shifting and weighting to obtain a calculation result X _Z W _Q 。

5. The method of claim 3, wherein the method for optimizing the array structure of the RRAM-oriented memory computing system comprises: scheduling by a digital control circuit, sequentially taking corresponding picture data from the buffer, adding the picture data, and multiplying the picture data by corresponding W _Z To obtain X _Q W _Z (ii) a Meanwhile, converting the picture data into voltage signals, inputting the voltage signals from the head of a row of the 1T1R array, reading the output current at the tail end of a column, and obtaining X after quantization and shift weighting _Q W _Q 。