CN115879530B

CN115879530B - RRAM (remote radio access m) memory-oriented computing system array structure optimization method

Info

Publication number: CN115879530B
Application number: CN202310186971.9A
Authority: CN
Inventors: 王浩; 郑精; 吕琳; 汪汉斌; 万厚钊; 马国坤; 袁晓旭; 高浩浩
Original assignee: Hubei University; Hubei Jiangcheng Laboratory
Current assignee: Hubei University; Hubei Jiangcheng Laboratory
Priority date: 2023-03-02
Filing date: 2023-03-02
Publication date: 2023-05-05
Anticipated expiration: 2043-03-02
Also published as: CN115879530A

Abstract

The invention discloses a method for optimizing an array structure of an RRAM-oriented in-memory computing system, which mainly utilizes a corresponding formula in a post-training quantization algorithm to optimize the array structure of the RRAM-based in-memory computing system, and reduces the array area and the system power consumption under the condition of ensuring the computing accuracy and precision. The beneficial effects of the invention are as follows: the invention is suitable for multiple neural networks such as a multi-layer perceptron and a convolutional neural network, and under the same calculation condition, by halving the 1T1R array scale, the system area is effectively reduced, the system energy consumption is reduced, the system calculation efficiency is improved, and the invention is more suitable for commercialized landing in combination with the current situation that the RRAM device preparation process is not mature enough; under the condition that the number of convolution kernels of the CNN convolution layer is increased, the array scale of the invention is half of that of the conventional technology, but the computation X is additionally added _Z W _Q And X _Q W _Z The number of multipliers is kept unchanged, and the overall performance advantage of the system is quite remarkable.

Description

RRAM (remote radio access m) memory-oriented computing system array structure optimization method

Technical Field

The invention relates to the technical field of memory computation, in particular to a method for optimizing an array structure of an RRAM memory computing system.

Background

With the rapid development of science and technology, we have put into the "big data" age, and various information data are continuously and explosive, which puts higher demands on storage and computing technologies. Traditional computers employ von neumann architecture, with memory and processor each being independent. The data exchange between the two is frequent, but the data exchange link is narrow and the power consumption is high, and a storage wall is formed between the calculation and the storage, so that the high-performance exertion of the advanced processor is greatly influenced. Therefore, development of a new memory computing system is important, and recently proposed in-memory computing architecture combines memory and operation, so that the load of data transmission can be effectively reduced, the energy consumption of data computing is reduced, and the information processing efficiency is improved.

In-memory computing technology typically utilizes the physical and electrical characteristics of non-volatile memory to perform computations directly in memory while guaranteeing non-volatile memory. The method effectively avoids the interaction of high-frequency data stored and calculated, thereby breaking through the limit of a storage wall and greatly improving the data processing capacity. Among these nonvolatile memories, resistive Random Access Memory (RRAM) has been attracting attention due to its characteristics of simple structure, fast reading and writing, low power consumption, and good CMOS process compatibility. The RRAM can generate resistance state transition under the action of a specific voltage excitation signal, the voltage high and low levels respectively represent the numbers of 1 and 0 by virtue of the electrical characteristic, the device high and low resistance states respectively represent the numbers of 0 and 1, and the current flowing through the device is acquired and quantized by combining ohm law, so that the digital multiplication calculation result is obtained. After the RRAM device is expanded into a cross array structure, multiply-accumulate operation (MAC) can be completed, and matrix operation is further realized. The matrix storage computing capability of the RRAM array is very suitable for the intensive computing application requirements of the neural network, so that the RRAM array has wide application prospect in the field of the neural network accelerator.

However, in the current in-memory computing neural network accelerator based on RRAM, since the 1T1R array can only store unsigned numbers when storing the weight matrix, two device structures are generally adopted to jointly represent signed numbers, namely, positive and negative weights in the neural network. Among them, the common processing methods are as follows:

1, as shown in the 2T2R structure array of FIG. 1, two 1T1R units are combined into a pair of positive and negative values stored with symbol numbers to represent positive and negative weights; applying equivalent positive and negative voltage pulse signals to the current collector, and quantifying the accumulated current collected by the last column to obtain a final multiply-accumulate calculation result;

2, as shown in the 1T1R positive and negative line structure array of fig. 2, all the weights of one weight matrix are mapped to two 1T1R conductive lines, one line is a positive weight, with positive pulse input, and the other line is a negative weight, with equivalent negative pulse input; after inputting the coding pulse into the bit line, collecting two rows of accumulated output currents, and subtracting to obtain a calculation result;

and 3, constructing two 1T1R arrays according to the 1T1R positive and negative double-array structure shown in fig. 3, respectively storing positive and negative weights, inputting equivalent voltage signals, and finally subtracting the calculation results to obtain a final calculation result.

Because the RRAM device technology is not mature enough, the manufacturing of a large-scale array also faces many challenges, and in order to inhibit crosstalk, the three array structures are adopted to carry out storage calculation, but double 1T1R resources are required when one signed number is represented, so that the area and the energy consumption of an in-memory computing system are greatly increased.

Disclosure of Invention

The invention aims to provide a method for optimizing an array structure of an RRAM-oriented in-memory computing system, which aims to solve the problems in the background technology.

In order to achieve the above purpose, the present invention provides a method for optimizing an array structure of an RRAM-oriented in-memory computing system, comprising the following steps:

step one, performing post-training quantization on image data and neural network weight data, and performing quantization formula

Calculating to obtain image quantized data X _Q Zero offset X of image data _Z Weight quantization data W _Q Weight zero point offset W _Z ，X _Z And W is _Z Fixed after quantization, software calculates X _Z W _Z A value;

step two, signed forward propagation of neural networkThe numerical formula is

And (3) expanding the material to obtain:

respectively calculating positive integer term X by using adder, multiplier and RRAM array circuit _Q W _Z 、X _Z W _Q And X is _Q W _Q Finally, combining X calculated in the step one _Z W _Z Substituting the expansion formula to calculate a Y value; the RRAM array is used for storing and calculating the positive integer by the splitting calculation method, so that the problem of excessive RRAM device resource consumption caused by directly calculating the number of symbols is avoided;

and thirdly, storing the Y value calculation result into a buffer, carrying out subsequent activation function and other quantized operations, and obtaining complete characteristic diagram data as input of a next-layer network after processing.

As a preferred technical solution of the present invention, the quantization formula in the first step is:

，

wherein R is an original data value, Q is a quantized data value, S is a scale factor scale, which represents the proportional relation between the original data and the quantized data, Z is zero offset, which represents an integer corresponding to 0 quantization in the original data;

setting the quantization precision as n bits, randomly extracting part of test set data, calculating a quantization scale factor S and a zero offset Z in each layer of network according to the following formula, and substituting the quantization scale factor S and the zero offset Z into the formula to obtain a quantized data value;

the calculation formula of the scaling factor S is as follows:

，

the zero offset Z calculation formula is:

；

wherein, the liquid crystal display device comprises a liquid crystal display device,R _max and R is _min 、Q _max And Q _min Maximum and minimum values of the original data value and the quantized data value, respectively, and Q _max And Q _min Determined by the quantization accuracy n, R _max And R is _min Determined from a randomly drawn portion of the sample data.

In the second step, the RRAM adopts a device with only two stable states of high resistance and low resistance, and forms a 1T1R structure with the NMOS tube, so as to construct a 1T1R crisscross array; and (3) forming an n bits weight value by a row of n 1T1R structures, and performing corresponding shift weighting when collecting output current to obtain a calculation result.

As a preferable technical scheme of the invention, the weight data W _Q Controlling row-by-row writing into 1T1R array storage through peripheral digital logic circuit, and then quantizing parameter X _Z Converted into a read voltage signal to be input into a 1T1R array, the output current is collected, quantized and shifted and weighted to obtain a calculation result X _Z W _Q。

As a preferable technical scheme of the invention, the digital control circuit is used for scheduling, corresponding image data are sequentially taken from the buffer and added and multiplied by corresponding W _Z Obtaining X _Q W _Z The method comprises the steps of carrying out a first treatment on the surface of the At the same time, converting the image data into voltage signals, inputting the voltage signals from the row head of the 1T1R array, reading the output current at the tail end of the column, and obtaining X after quantization, shift and weighting _Q W _Q 。

Compared with the prior art, the invention has the beneficial effects that: under the same calculation condition, the invention greatly reduces the number of RRAM devices, and is more suitable for commercial landing in combination with the current situation that the preparation process of the RRAM devices is not mature enough; the RRAM array is used for storing and calculating the positive integer by the splitting calculation method, so that the problem of excessive RRAM device resource consumption caused by directly calculating the number of symbols is avoided; by halving the 1T1R array scale, the system area is effectively reduced, the system energy consumption is reduced, and the system calculation efficiency is improved; the invention is suitable for multiple neural networks such as a multilayer perceptron and a convolutional neural network, and because the local receptive field of the CNN convolutional layer is smaller, X is calculated _Q W _Z Item instituteFewer adders are needed, so that the advantages of the technology in the CNN network are more obvious; under the condition that the number of convolution kernels of the CNN convolution layer is increased, the array scale of the invention is half of that of the conventional technology, but the computation X is additionally added _Z W _Q And X _Q W _Z The number of multipliers is kept unchanged, the influence of a calculation unit introduced when the CNN network is large in scale is limited, and the overall performance advantage of the system is quite remarkable.

Drawings

FIG. 1 is a diagram showing the positive and negative weights of a background art 2T2R structure;

FIG. 2 is a diagram showing positive and negative weights of a positive and negative row structure of a background art 1T 1R;

FIG. 3 is a diagram showing the positive and negative weights of a 1T1R positive and negative double array structure in the prior art;

FIG. 4 is a graph showing the quantization weights of the optimized 1T1R single array according to the present invention;

FIG. 5 is a computational block diagram of the present invention optimized using PTQ formulas;

FIG. 6 is a diagram of a single-layer network accelerator architecture based on a 1T1R array in accordance with the present invention;

Detailed Description

Example 1

As shown in fig. 4 to 6, the invention discloses a method for optimizing an array structure of an in-memory computing system for RRAM (resistive random access memory), which mainly utilizes a corresponding formula in a post-training quantization algorithm to optimize the array structure of the in-memory computing system based on RRAM, reduces the array area and the system power consumption under the condition of ensuring the computing accuracy and precision, and provides a reliable solution for the computing acceleration of a large-scale neural network.

The quantization operation is to quantize the 32-bit floating point number in the neural network to an 8-bit fixed point number or other low-bit fixed point number, so that the calculation cost can be greatly reduced, and the neural network is beneficial to being integrated into the edge intelligent equipment with strict requirements on power consumption and delay. The post-training quantization algorithm (PTQ) utilized by the technology of the invention obtains the quantization parameters of the network on the premise of not retraining the network (i.e. not updating the network weight). Taking Convolutional Neural Network (CNN) as an example, after the neural network is trained normally, data after each convolution, pooling and full connection layer calculation are quantized in an inference stage, so that storage calculation cost is effectively saved.

The quantization formula is:

，

wherein R is an original data value, Q is a quantized data value, S is scale, which represents the proportional relation between the original data and the quantized data, Z is zero point offset, which represents an integer corresponding to 0 quantization in the original data.

The calculation formula of the scaling factor S is as follows:

，

the zero offset Z calculation formula is:

，

wherein Q is _max And Q _min Determined by the precision of the number of bits to be quantized, R _max And R is _min Then it is determined from the randomly extracted portion of the sample data.

Therefore, all quantization parameters required by the post-training quantization process can be determined before reasoning after the neural network training, and finally, the quantization of the data can be completed only by substituting the corresponding parameters into a quantization formula in the reasoning process.

The technology of the invention mainly utilizes a quantization calculation formula to realize the optimization of the array structure. Taking convolution calculation in a CNN first layer convolution network as an example, an actual calculation formula after original image data and convolution kernel quantization is as follows:

，

the unfolding can be obtained by:

，

wherein X is _Z And W is _Z Is a fixed value.

So X is in the whole convolution calculation process _Z W _Z To a known determined value, X _Z W _Q The method also only needs to calculate once after the 1T1R array weight is initialized and before the convolution sliding operation, and actually only X is really needed to calculate during each sliding operation _Q W _Z And X is _Q W _Q Two items. X is X _Q W _Z The term is that the picture input data at each sliding time is added and multiplied by W _Z Since the convolution kernel is typically 3*3 or 5*5, only a few adders and multipliers are required to complete the computation; x is X _Q W _Q The term is the calculation key point, and the quantized positive value convolution kernel weight matrix W _Q Storing into 1T1R array, and inputting corresponding characteristic diagram data X during each convolution sliding operation _Q In-memory computation of matrix multiplication can be achieved. The structure of the calculation method is shown in fig. 5. In summary, the present invention can halve the 1T1R array resources of the current technology by adding a few additional multipliers and adders. In particular, the number of additional adders and multipliers is only related to the size of the convolution kernel and not to the number of convolution kernels. When the number of convolution kernels in CNN is very large, the required array scale is greatly increased, the number of additional adders and multipliers is unchanged, and the advantages of the technology in array scale optimization are more remarkable, so that the technology is very suitable for accelerating a large-scale neural network.

Taking a first layer convolution network as an example, the convolution kernels are assumed to be 200 3*3 convolution kernels, and the specific implementation steps of the method are as follows:

1, designing and training a neural network system. Setting quantization bit number (8 bits, for example), randomly extracting partial test set data, and calculating quantization scale factor S and zero offset Z of each layer of network

）；

2 according to the formula

Quantizing the original image data, the weights of all network layers and the intermediate calculation result of each layer to obtain X of each layer of network _Q 、W _Q 、X _Z And W is _Z Equating data and parameters and calculating X for each layer _Z W _Z ；

And 3, the RRAM adopts a device with only two stable states of high resistance state and low resistance state, and forms a 1T1R structure with the NMOS tube. Because RRAM only can represent '0' and '1' binary logic, a row of 8 1T1R structures are required to form an 8 bits weight value, and corresponding shift weighting is also required to be carried out when output current is acquired to obtain an actual calculation result. The processing of the convolution kernel adopts the prior conventional technology, the convolution kernel is unfolded line by line into column vectors, the column vectors are stored in a column 1T1R structure, the output current of the column ends is collected, and the single convolution operation result is obtained after quantization. According to the logic building corresponding to the 1T1R array, the array size of 200 3*3 convolution kernels is 9 x 1600 (only 14400 1T1R structures are needed in the technology of the invention, and 28800 1T1R structures are needed in the prior art because of the negative weight problem);

4, controlling the quantized convolution kernel weight data in the step 2 to be written into the array for storage line by line through a peripheral digital logic circuit, and then storing the quantization parameter X _Z Converted into a read voltage signal to be input into a 1T1R array, the output current is collected, quantized and shifted and weighted to obtain a calculation result X _Z W _Q ；

And 5, storing the original image data into a buffer, and controlling sliding operation through a digital control circuit. According to the convolution sliding setting, 9 image data in the corresponding sliding window are sequentially taken from the buffer to be added and multiplied by the corresponding W _Z Obtaining X _Q W _Z . At the same time, 9 image data are converted into read voltages which are input from the row head of the array, the output current is read, and X is obtained after quantization, shift and weighting _Q W _Q ；

And 6, accessing a digital logic circuit at the output end of the array, and calculating a final convolution calculation result:

/>

and storing the single convolution calculation result into a register, carrying out subsequent activation function and pooling operation, and obtaining complete characteristic diagram data as the input of the next network after the convolution sliding calculation is completed.

The steps describe the calculation operation steps of the first convolutional layer of the CNN, the operation of other network layers is similar to the calculation operation steps, and the calculation result output of each network layer is quantized and then is the data input of the next layer. As can be seen from fig. 6, the inventive technique 1T1R array only needs to store quantized positive number weight values, and only needs half of the number of prior art 1T1R under the same weight. Although additional multipliers and adders are introduced in the calculation, the number required is small. In the above example, only 1 multiplier and 3×200+9 adders are added, but 14400 1T1R units are reduced. Considering the current situation that the traditional CMOS technology is far mature in RRAM technology, the technical advantages of the invention are obvious.

Although the specific embodiments of the present invention have been described in detail, the present invention is not limited to the above embodiments, and various changes and modifications without inventive labor may be made within the scope of the present invention without departing from the spirit of the present invention, which is within the scope of the present invention.

Claims

1. The RRAM in-memory computing system array structure optimization-oriented method is characterized by comprising the following steps of:

step one, performing post-training quantization on image data and neural network weight data, and calculating through a quantization formula to obtain image quantization data X _Q Zero offset X of image data _Z Weight quantization data W _Q Weight zero point offset W _Z ，X _Z And W is _Z Fixed after quantization, software calculates X _Z W _Z A value;

step two, the calculation formula of the forward propagation signed number of the neural network is as follows:

and (3) expanding the material to obtain:

using adders, multipliers and RRsAM array circuits respectively calculate positive integer items X _Q W _Z 、X _Z W _Q And X is _Q W _Q Finally, combining X calculated in the step one _Z W _Z Substituting the expansion formula to calculate a Y value; the RRAM array is used for storing and calculating the positive integer by the splitting calculation method, so that the problem of excessive RRAM device resource consumption caused by directly calculating the number of symbols is avoided; the RRAM adopts a device with only two stable states of high resistance and low resistance, and forms a 1T1R structure with an NMOS tube, so as to construct a 1T1R crisscross array; n 1T1R structures in a row form an n bits weight value, and corresponding shifting weighting is carried out when output current is acquired to obtain a calculation result; weight data W _Q Controlling row-by-row writing into 1T1R array storage through peripheral digital logic circuit, and then quantizing parameter X _Z Converted into a read voltage signal to be input into a 1T1R array, the output current is collected, quantized and shifted and weighted to obtain a calculation result X _Z W _Q The method comprises the steps of carrying out a first treatment on the surface of the Sequentially taking corresponding image data from the buffer through digital control circuit scheduling, adding and multiplying the corresponding image data by corresponding W _Z Obtaining X _Q W _Z The method comprises the steps of carrying out a first treatment on the surface of the At the same time, converting the image data into voltage signals, inputting the voltage signals from the row head of the 1T1R array, reading the output current at the tail end of the column, and obtaining X after quantization, shift and weighting _Q W _Q；

2. The method for optimizing the array structure of the RRAM-oriented in-memory computing system of claim 1, wherein the method comprises the following steps: the quantization formula in the first step is as follows:

the calculation formula of the scaling factor S is as follows:

the zero offset Z calculation formula is:

；

wherein R is _max And R is _min 、Q _max And Q _min Maximum and minimum values of the original data value and the quantized data value, respectively, and Q _max And Q _min Determined by the quantization accuracy n, R _max And R is _min Determined from a randomly drawn portion of the sample data.