CN112580793A

CN112580793A - Neural network accelerator based on time domain memory computing and acceleration method

Info

Publication number: CN112580793A
Application number: CN202011548012.XA
Authority: CN
Inventors: 尹首一; 杨建勋; 刘壮志; 韩慧明; 刘雷波; 魏少军
Original assignee: Tsinghua University
Current assignee: Tsinghua University
Priority date: 2020-12-24
Filing date: 2020-12-24
Publication date: 2021-03-30
Anticipated expiration: 2040-12-24
Also published as: CN112580793B

Abstract

The invention discloses a neural network accelerator based on time domain memory computing and an accelerating method, wherein the neural network accelerator comprises: the device comprises a main controller, a weight storage unit, an activation value storage unit, a sub-convolution kernel generator, a pulse quantizer and a complementary dual-mode predictor; the sub-convolution kernel generator carries out bit-wise comparison on the quantization weights according to a preset high-low bit cross-over encoding table, generates a plurality of sub-convolution kernels according to the result of the bit-wise comparison, and sends the generated plurality of sub-convolution kernels to the time domain memory computing module; the pulse quantizer receives a result calculated by the time domain memory calculation module according to the characteristic diagram information and the plurality of sub-convolution kernels, and quantizes the calculated result; and the complementary dual-mode predictor receives the quantized result of the pulse quantizer, sequences the quantized result and determines whether to terminate the calculation in advance according to the prediction result in the set prediction mode. The invention supports non-uniform quantization, reduces access memory, eliminates redundant computation and reduces quantization energy consumption and errors.

Description

Neural network accelerator based on time domain memory computing and acceleration method

Technical Field

The invention relates to the technical field of artificial intelligence calculation on edge equipment, in particular to a neural network accelerator and an acceleration method based on time domain memory calculation.

Background

With the development of the internet of things technology, a large amount of data cannot be transmitted in a traditional terminal-cloud mode, and many internet of things devices cannot perform data processing at the cloud. In this case, the edge AI can perform AI calculations directly near the sensor without the internet. Because the system load is reduced, the network delay is reduced, the edge AI device can be applied to more places, for example, an unmanned automobile, the edge AI device can sample the conditions near the automobile through a vehicle-mounted radar, the automobile traveling route is automatically planned, and under the condition of meeting an emergency, the edge AI device brakes the automobile in a shorter time because the edge AI device does not need to interact with cloud data.

Meanwhile, in different application scenarios, the requirements for the edge AI devices are different, for example, in some small devices, such as an electronic watch or an intelligent sound box, the power consumption for the edge AI is required to be lower so as to pursue higher continuous flight time; in the unmanned automobile, the calculation speed and the precision are required to be higher, so that the automobile can be required to process the emergency at a higher speed; on the device such as a mobile phone which needs to process various data, the device needs to flexibly adapt to different neural network bit widths and simultaneously considers speed and precision. The emergence of these application scenarios presents new challenges for edge AI devices in terms of computational complexity, user experience, device memory and device energy consumption

An in-memory computing architecture is an architecture that does computation directly in memory. Under the traditional von Neumann architecture, a storage unit and a calculation unit are separated, and a large amount of data interaction is needed when convolution calculation is calculated. Meanwhile, along with the evolution of moore's law, the performance of the computing unit is improved faster along with the reduction of the characteristic size than that of the storage unit, and the energy consumed in the storage process of the 32-bit data is two to three orders of magnitude greater than that consumed in the computing process at present. This has become a major bottleneck in the energy efficiency ratio of neural network accelerators. The memory calculation reduces the storage time and energy consumption by directly performing calculation in the storage unit. Particularly, in the convolution calculation, because the parallelism of the convolution calculation is high, the memory calculation can avoid large-scale movement of data, and the method is quite suitable for accelerating the convolution calculation.

The memory computing system places the arithmetic unit inside the memory unit. Since the memory unit is designed in the analog domain, the computing unit also works in the analog domain. Unlike the digital domain, the analog domain can be calculated by voltage, that is, the magnitude of voltage is used to represent the magnitude of data. This method has the advantage of a lower flip rate of the device and thus lower power consumption, but due to the effect of noise within the device, this method cannot represent too accurate results and can represent a smaller range of data. In addition, the data size can also be represented by frequency in the analog domain, which is affected by noise and cannot represent too accurate results. Both of these two methods require a dedicated digital-to-analog/analog-to-digital converter, which occupies a certain power consumption and area.

Besides the voltage domain or the frequency domain, the size of the data can be represented by the width of the pulse, and compared with the voltage domain calculation or the frequency domain calculation, the time domain calculation mode is not easily affected by noise, has better robustness and also has higher precision. Meanwhile, because a pulse is used for representing a data size, the turnover rate of the pulse is low, and therefore the pulse has lower energy consumption and better stability.

For high-precision application scenarios, high-bit-width neural network calculations are required. For a time domain convolution module with a fixed bit width, bit splitting convolution calculation is needed, namely weight data with a large bit width is split into a plurality of sub-weights with small bit widths, then the sub-weights and an activation value are respectively subjected to convolution calculation, a vector of a k-bit is split into a plurality of vectors of h-bits (k > h), the convolution is divided into a plurality of inner products, and each result is multiplied by a scale factor and then added to obtain a final result.

Although the application of bit-splitting convolution can speed up multi-bit neural networks, there are three major problems due to its computational nature: 1. due to the characteristics of bit splitting, a network with uniform quantization can only be accelerated, so for a non-uniform quantization network with relatively higher precision, bit splitting convolution cannot help the network with unequal quantization intervals. 2. The bit-split convolution cannot skip the redundant computation due to the activation function. After the convolution calculation is finished, the result needs to pass through an activation function, generally a ReLU function, that is, the negative result is set to 0, while the positive result is not changed, if the size of the positive result is smaller than that of the negative result for each partial product, the final result is set to 0, and therefore, a part of the negative result is actually redundant calculation. Due to the characteristic of bit splitting convolution, the activation values are all calculated with 1 or 0, so the final results are all positive values, and redundant calculation cannot be skipped. 3. Due to the characteristics of the time domain conversion module, a standard pulse is needed to measure the size of the result, and the result obtained by bit splitting convolution cannot be predicted in the same interval, so that the result can only be quantized by the same pulse, if the result is quantized by a narrow pulse, a large energy consumption waste can be caused for a long pulse signal, and if the result is quantized by a wide pulse, a large error can be generated for a short pulse signal, so that the energy consumption and the error of the quantized pulse cannot be balanced.

Therefore, there is a need for a time domain memory computation based neural network acceleration scheme that can overcome the above problems.

Disclosure of Invention

The embodiment of the invention provides a neural network accelerator based on time domain memory computing, which is used for accelerating the neural network based on the time domain memory computing, reducing memory access, eliminating redundant computing and reducing quantization energy consumption and errors while supporting non-uniform quantization, and comprises the following components: the device comprises a main controller, a weight storage unit, an activation value storage unit, a sub-convolution kernel generator, a pulse quantizer and a complementary dual-mode predictor;

the main controller is used for writing the quantization weight into the weight storage unit and writing the activation value into the activation value storage unit;

the weight storage unit is used for sending the quantization weight to a sub-convolution kernel generator;

the activation value storage unit is used for sending characteristic graph information to a time domain memory computing module according to the activation value;

the sub-convolution kernel generator is used for receiving the quantization weight sent by the weight storage unit, carrying out bit-wise comparison on the quantization weight according to a preset high-low bit cross-over encoding table, generating a plurality of sub-convolution kernels according to the result of the bit-wise comparison, and sending the generated plurality of sub-convolution kernels to the time domain memory computing module;

the pulse quantizer is used for receiving a result calculated by the time domain memory calculation module according to the characteristic diagram information and the plurality of sub-convolution kernels, and quantizing the result calculated by the time domain memory calculation module according to a preset weight threshold and a sub-convolution result threshold;

the complementary dual-mode predictor is used for receiving the quantized result of the pulse quantizer, sequencing the quantized result and determining whether to terminate the calculation in advance according to the prediction result in the set prediction mode.

The embodiment of the invention provides a neural network acceleration method based on time domain memory computing, which is used for accelerating the neural network based on the time domain memory computing, reducing memory access, eliminating redundant computing and reducing quantization energy consumption and errors while supporting non-uniform quantization, and comprises the following steps:

the main controller writes the quantization weight into the weight storage unit and writes an activation value into the activation value storage unit;

the weight storage unit sends the quantization weight to a sub-convolution kernel generator;

the activation value storage unit sends feature map information to a time domain memory computing module according to the activation value;

the sub-convolution kernel generator receives the quantization weight sent by the weight storage unit, performs bit comparison on the quantization weight according to a preset high-low bit cross flip coding table, generates a plurality of sub-convolution kernels according to the result of the bit comparison, and sends the generated plurality of sub-convolution kernels to the time domain memory computing module;

the pulse quantizer receives a result calculated by the time domain memory calculation module according to the characteristic diagram information and the plurality of sub-convolution kernels, and quantizes the result calculated by the time domain memory calculation module according to a preset weight threshold and a sub-convolution result threshold;

and the complementary dual-mode predictor receives the quantized result of the pulse quantizer, sequences the quantized result and determines whether to terminate the calculation in advance according to the prediction result in the set prediction mode.

The embodiment of the invention also provides computer equipment which comprises a memory, a processor and a computer program which is stored on the memory and can run on the processor, wherein the processor executes the neural network acceleration method based on the time domain memory computing.

An embodiment of the present invention further provides a computer-readable storage medium, where a computer program for executing the neural network acceleration method based on time domain memory computing is stored in the computer-readable storage medium.

Compared with the scheme of performing time domain memory computing-based neural network acceleration by using bit splitting convolution in the prior art, the time domain memory computing-based neural network accelerator provided by the embodiment of the invention comprises the following steps: the device comprises a main controller, a weight storage unit, an activation value storage unit, a sub-convolution kernel generator, a pulse quantizer and a complementary dual-mode predictor; the main controller is used for writing the quantization weight into the weight storage unit and writing the activation value into the activation value storage unit; the weight storage unit is used for sending the quantization weight to a sub-convolution kernel generator; the activation value storage unit is used for sending characteristic graph information to a time domain memory computing module according to the activation value; the sub-convolution kernel generator is used for receiving the quantization weight sent by the weight storage unit, carrying out bit-wise comparison on the quantization weight according to a preset high-low bit cross-over encoding table, generating a plurality of sub-convolution kernels according to the result of the bit-wise comparison, and sending the generated plurality of sub-convolution kernels to the time domain memory computing module; the pulse quantizer is used for receiving a result calculated by the time domain memory calculation module according to the characteristic diagram information and the plurality of sub-convolution kernels, and quantizing the result calculated by the time domain memory calculation module according to a preset weight threshold and a sub-convolution result threshold; the complementary dual-mode predictor is used for receiving the quantized result of the pulse quantizer, sequencing the quantized result and determining whether to terminate the calculation in advance according to the prediction result in the set prediction mode. According to the embodiment of the invention, memory access is reduced through high-low bit cross-over encoding of the sub-convolution kernel generator, redundant calculation is eliminated through the complementary dual-mode predictor, quantization energy consumption and errors are reduced by using the pulse quantizer, and single-weight convolution calculation is carried out on the time domain memory calculation module according to the characteristic diagram information and the plurality of sub-convolution kernels, so that the calculation of a neural network is effectively accelerated, the operation number is reduced, and the non-uniform quantization can be supported.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to the drawings without creative efforts. In the drawings:

FIG. 1 is a diagram of a neural network accelerator based on time domain memory computing according to an embodiment of the present invention;

FIG. 2 is a diagram of an overall architecture of a neural network accelerator based on time domain memory computations according to an embodiment of the present invention;

FIG. 3 is a diagram illustrating a single-weight split convolution scheme according to an embodiment of the present invention;

FIGS. 4-6 are schematic diagrams illustrating operation of a sub-convolution kernel generator according to an embodiment of the present invention;

FIGS. 7-9 are schematic diagrams illustrating the operation of the pulse quantizer according to an embodiment of the present invention;

FIGS. 10-11 are schematic diagrams illustrating operation of a complementary dual-mode predictor according to an embodiment of the present invention;

fig. 12 is a schematic diagram of a neural network acceleration method based on time domain memory computing according to an embodiment of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the embodiments of the present invention more apparent, the embodiments of the present invention are further described in detail below with reference to the accompanying drawings. The exemplary embodiments and descriptions of the present invention are provided to explain the present invention, but not to limit the present invention.

In order to accelerate a neural network based on time domain memory computing, reduce memory accesses, eliminate redundant computing, and reduce quantization energy consumption and errors while supporting non-uniform quantization, an embodiment of the present invention provides a neural network accelerator based on time domain memory computing, where as shown in fig. 1, the neural network accelerator may include: the device comprises a main controller, a weight storage unit, an activation value storage unit, a sub-convolution kernel generator, a pulse quantizer and a complementary dual-mode predictor;

As shown in fig. 1, the neural network accelerator based on time domain memory computing according to the embodiment of the present invention includes: the device comprises a main controller, a weight storage unit, an activation value storage unit, a sub-convolution kernel generator, a pulse quantizer and a complementary dual-mode predictor; the main controller is used for writing the quantization weight into the weight storage unit and writing the activation value into the activation value storage unit; the weight storage unit is used for sending the quantization weight to a sub-convolution kernel generator; the activation value storage unit is used for sending characteristic graph information to a time domain memory computing module according to the activation value; the sub-convolution kernel generator is used for receiving the quantization weight sent by the weight storage unit, carrying out bit-wise comparison on the quantization weight according to a preset high-low bit cross-over encoding table, generating a plurality of sub-convolution kernels according to the result of the bit-wise comparison, and sending the generated plurality of sub-convolution kernels to the time domain memory computing module; the pulse quantizer is used for receiving a result calculated by the time domain memory calculation module according to the characteristic diagram information and the plurality of sub-convolution kernels, and quantizing the result calculated by the time domain memory calculation module according to a preset weight threshold and a sub-convolution result threshold; the complementary dual-mode predictor is used for receiving the quantized result of the pulse quantizer, sequencing the quantized result and determining whether to terminate the calculation in advance according to the prediction result in the set prediction mode. According to the embodiment of the invention, memory access is reduced through high-low bit cross-over encoding of the sub-convolution kernel generator, redundant calculation is eliminated through the complementary dual-mode predictor, quantization energy consumption and errors are reduced by using the pulse quantizer, and single-weight convolution calculation is carried out on the time domain memory calculation module according to the characteristic diagram information and the plurality of sub-convolution kernels, so that the calculation of a neural network is effectively accelerated, the operation number is reduced, and the non-uniform quantization can be supported.

Fig. 2 is a diagram of an overall architecture of a neural network accelerator based on time domain memory computing according to an embodiment of the present invention, which includes a main controller, a weight storage unit with a capacity of 72KB, an activation value storage unit with a capacity of 128KB, and 3 auxiliary modules. The 3 auxiliary modules include: (1) the sub-convolution kernel generator can decompose the whole convolution kernel into a plurality of sub-convolution kernels (2) complementary dual-mode predictors more efficiently, determine whether to finish calculating (3) the pulse quantizer in advance according to the current weight and the accumulated sum, and generate pulses as long as possible under the condition of ensuring the precision according to the current weight and the result of the previous calculation so as to reduce the power consumption. The 3 auxiliary modules can better utilize a time domain memory computing module to carry out convolution calculation.

The inventor finds that the previous time domain memory computing architecture uses a bit splitting mode to perform computation, and can realize variable bit width because data can be expanded according to bits, but can only scale original data according to proportion, so that the non-uniform quantization cannot be supported.

In the embodiment, in order to support non-uniform quantization of variable bit width and data at the same time, a single-weight segmentation convolution mode is used in the embodiment of the present invention, in which a convolution kernel with bit width as k quantization weight is segmented into 2^kEach of the quantization weights is convolved with a respective one of the sub-convolution kernels, whereby the standard convolution process is decomposed to require only 2^kSparse convolution of a multiplication operation and n addition operations, where n is the number of activation values in the feature map and n is much greater than 2^k. As shown in fig. 3, for a core with 4 quantized numbers, the bit width of the quantization weight is 2, the whole core needs to be split into 4, and each sub-weight only includes two values, i.e., 0 and 1. The position of 1 represents the position of the actual weight corresponding to the sub-convolution kernel in the original convolution kernel, and 0 represents that the original data at the position is not the corresponding actual weight. The feature map is then convolved with the sub-convolution kernels. Since the sub-convolution kernel only contains two values of 0 and 1 at this time, it can be considered that the multiplication in the conventional convolution is only equivalent to the selection operation, the whole convolution process is equivalent to the process of adding all the activation values corresponding to 1 position, and SX in fig. 3 represents the result obtained by the sub-convolution operation. After the sub-convolution operation is finished, multiplying the sub-convolution operation by the actual weight corresponding to the quantization weight, and sequentially accumulating the actual weight and the quantization weight to obtain the final productTo the final result.

The inventor finds that, because the single-weight convolution mode is to convolute the feature map with the sub-kernels of the convolution kernel, the convolution kernel can be freely decomposed, so that the convolution kernel can support the non-uniform weight quantization with higher precision, which means that the non-uniform weight quantization can represent any actual weight to better improve the precision of the convolution result. But relatively large power consumption and delay are incurred because each time a sub-convolution kernel is generated, a convolution kernel needs to be read. In order to compensate for the defect, the embodiment of the invention provides high-low bit cross-over encoding, which can fully utilize the information of the previously generated sub-convolution kernels so as to achieve the purpose of reducing the access amount.

In an embodiment, the sub-convolution kernel generator is further to: according to a preset high-low bit cross flip coding table, carrying out bit-by-bit comparison on the quantization weight to be compared and the weight corresponding to the sub convolution kernel to be generated in the single weight calculation; performing AND operation and/or non-operation on the result of the bit comparison; a plurality of sub-convolution kernels are generated based on the result of the AND operation and the NOR operation.

In this embodiment, since the sub-convolution kernel is determined only by the quantization convolution kernel and the quantization weight is only one index, the actual weight can be arbitrarily encoded and corresponds to the true weight. In order to reduce memory accesses as much as possible, it is desirable to multiplex the previous comparison results as much as possible. Meanwhile, we note that in the exclusive nor operation, we can use a fast sub-convolution kernel generator based on high-low bit cross inversion for comparison, and a plurality of sub-convolution kernels can be generated by one memory access.

FIG. 4 shows a comparison process using high and low cross-flip encoding according to an embodiment of the present invention. Wherein w represents the quantization weight to be compared, and e represents the weight corresponding to the sub-convolution kernel to be generated in the single weight calculation. In making the comparison, w is first compared with e bitwise, C_AIndicating that the result of the bitwise comparison is to be ANDed, C_OIndicating that the result of the bitwise comparison is NOR-operated, comparing the high and low bits and obtaining HC_A，HC_O，LC_A，LC_OFour results. By combining the four junctions, the utility model canAnd obtaining the results of four sub-convolution kernels through one comparison.

Fig. 5 is a diagram of a coding table with high and low bits being alternately inverted according to an embodiment of the present invention, in which the high and low bits of each group are completely opposite, so that it can be ensured that each group only needs to be compared once to obtain a result. Meanwhile, due to the proximity of high and low bits between each group, such as group0 and group1, high bit results can be multiplexed, so that the memory access amount is further reduced.

Fig. 6 is an architecture diagram of a convolutional kernel generator according to an embodiment of the present invention, in which a bit-serial word parallel structure is implemented, and a weight storage unit sends 1 bit of all words to a cross-over encoding module in parallel, and then enters a comparator. There are 4 registers inside the comparator to store the HC_A，HC_O，LC_A，LC_OAnd four results which are sequentially combined and output according to the data needing to be compared, so that the results of 4 sub-convolution kernels can be output under the condition that the weight storage unit is read only once. For 1to 8 bit neural networks, both memory access and latency are reduced by a factor of 1.91 to 3.83.

The inventor finds that for a time domain memory computing architecture, if the time domain result needs to be quantized into a digital domain result, a pulse quantizer is needed, and a standard quantization pulse is used to quantize the result. As shown in fig. 7, finding the most suitable frequency for the resulting pulse may better reduce the error and reduce the power consumption. For narrow pulses, if wide quantization pulses are used, large errors will be generated, and for wide pulses, if narrow quantization pulses are used, the errors will be small, but large power consumption will be generated. For the single weight convolution mode, the width of the quantization pulse and the magnitude of the true weight both affect the error of the final result and the energy consumption of the quantization, and since the final accumulation result is the multiplication and addition of the pulse width and the true weight, the error of the pulse quantization result is amplified by the larger true weight.

In an embodiment, the pulse quantizer is further configured to: determining the quantized pulse width according to a preset weight threshold and a sub-convolution result threshold; and quantizing the result calculated by the time domain memory calculation module according to the quantized pulse width.

In this embodiment, the magnitude of the sub-convolution result and the true weight of the single-weight convolution method both affect the final accumulation result, as shown in fig. 8, and therefore the ranges of the two parts need to be estimated. The sum of all sub-convolutions of the single-weight convolution result is a fixed value, namely the convolution result of the input characteristic diagram and all 1 convolution kernels is the sum of all the activation values, so that the upper limit of the sub-convolution result is continuously reduced each time, and the real weight value of the sub-convolution result is known each time, so that the quantization pulse width of the time domain memory calculation module can be dynamically configured, and the energy consumption is reduced as much as possible on the premise of ensuring the precision. On this basis, we propose to activate the value-weight adaptive pulse quantizer, test the quantizer offline using four sampling frequencies for each neural network, and search for the best frequency to reduce energy and accuracy loss. As shown in fig. 8, the width of the quantization pulse is determined based on two thresholds, a weight threshold and a sub convolution result threshold. Wherein, SX_thRepresents half of the average of the sum of all activation values of the input feature map in 1000 convolution calculations, and w_thRepresenting half of the sum of the absolute values of the quantized true weights. The four bins determined with these two thresholds are quantized using different frequencies. The determination of different interval frequencies is to calculate 1000 characteristic graphs by using 16 frequencies offline, calculate and simulate through an energy model and a neural network convolution model respectively, and search for the optimal frequency, so that the maximum energy consumption loss is reduced as far as possible under the condition of ensuring the precision. The frequency of different intervals also has different requirements, the precision is required to be high under the condition of small convolution result and large real weight, the precision is required to be lower under the condition of large convolution result and small real weight, and the energy consumption is required to be lower.

FIG. 9 is a block diagram of a pulse quantizer according to an embodiment of the present invention. The actual weight decoder and the prediction unit send the actual weight calculated this time, the threshold value of the actual weight, the accumulated sum of the remaining activation values, and the threshold value of the sub-convolution kernel to the quantizer. With this information, the quantizer adjusts the width of the quantization pulse by adjusting the voltage and the delay chain length. For a 1-to-8-bit VGG16 neural network, the energy is reduced by 1.41-1.92 x, and the accuracy is improved by 0.37-3.49%.

The inventors found that due to the presence of the activation function ReLU, the final convolution result becomes 0 if the result less than the threshold (th) passes through the activation function. If the final positive or negative of the whole result can be estimated according to the numerical value of the existing result and the numerical characteristics of the non-calculated result in the accumulation process of calculating the sub-convolution result, the non-calculated part does not need to be calculated, and the calculated amount of the part can also be called as redundant calculation.

In an embodiment, the prediction modes include: predicting a current accumulated value prediction mode and a hypothesis extreme value; the complementary dual-mode predictor is further to: under the current accumulated value prediction mode, calculating positive weight and then negative weight, and if the accumulated value is smaller than a set threshold value, stopping calculation in advance; under the condition of supposing extreme value prediction, firstly calculating negative weight and then positive weight, subtracting the current activation value from the sum of all the activation values, multiplying the obtained difference value by the residual maximum positive weight to obtain the predicted maximum value of the residual result, and if the result of adding the predicted maximum value and the current accumulated value is still smaller than a threshold value, terminating the calculation in advance.

In this embodiment, according to the addition and exchange principle, the addition order of any one sub-convolution in the exchange single-weight convolution has no influence on the final result, so that the calculation of the sub-convolutions can be sorted, the current accumulated value result can be observed at any time, or the negative value result can be eliminated in advance according to the maximum result which can be achieved currently, so as to reduce the calculation amount. The calculation process of the single weight convolution is as follows:

wherein SX represents the computation result of the feature map and the sub-convolution kernel, SUM₁Representing the accumulated sub-convolution results, and SUM₂Representing the result of the sub-convolution without accumulation. Since the ReLU function will convert results less than the threshold to 0, it can be considered that each calculation is completeThe live values are all non-negative values, so SX is always positive, and the weights have both positive and negative values, so that the weights can be sequenced for prediction. We have designed two prediction modes: current accumulated value prediction and hypothesis extremum prediction.

Fig. 10 (a) shows a calculation process of the current accumulated value prediction according to an embodiment of the present invention. The current accumulated value predicts the positive weight w_pIs calculated preferentially, and the weight w is negative_nPost-calculation, such as SUM of accumulated values during calculation₁In the accumulation process, a positive value is added first, and then a negative value is added, so that the result is increased first and then decreased if SUM is calculated in a certain time₁Less than threshold due to SUM thereafter₂The result of the SX calculation is the negative weight which is certainly less than the threshold value, and the SUM is the final result₁+SUM₂The condition must be satisfied and therefore the computations after the t-th time are redundant computations and can be terminated early.

Fig. 10 (b) shows a calculation process of the prediction of the assumed extremum value according to the embodiment of the present invention. It is assumed that extremum prediction calculates negative weights first and positive weights later. In the calculation process, the sum SX of all the activation values needs to be calculated firstly_allI.e. the result of convolution of the signature with a particular full 1-sub convolution kernel, and then subtracting the calculated SX each time_iMultiplying it by the remaining maximum positive weight by w_pmaxThe predicted maximum value max (SUM) of the remaining results is obtained₂). If the maximum value is added to the current accumulated value, SUM₁+max(SUM₂) If still less than the threshold, the SUM is validated as the final result₁+SUM₂Must be less than the threshold value and thus can be considered as a redundant calculation and can be skipped.

FIG. 11 is a block diagram of a complementary dual-mode predictor, in which a calculation module sends a calculation result to the predictor, and determines whether the calculation result is the total SX of all activated values according to a prediction mode_allAnd if so, storing the data. Under the current accumulated value prediction mode, only whether the calculation result is smaller than the threshold value is required to be detected every time, and if the calculation result is smaller than the threshold value once, calculation is stopped and the result is directly output. Whereas in the case of the hypothetical extreme prediction mode,SX to be stored_allAnd subtracting the newly calculated result each time, multiplying the newly calculated result by the maximum weight value which is not calculated to obtain the maximum value of the expected rest result, adding the maximum value and the accumulated result, judging whether the sum is less than the threshold value or not, stopping the calculation if the sum is successful, and directly outputting the result. Tested in accordance with the 1-8 bit VGG16 network, the use of the dual mode prediction mode can reduce the amount of computation by a factor of 2.32 to 5.46.

The embodiment of the invention aims to improve the energy efficiency of the time domain memory computing architecture, overcome the original defects of the time domain memory computing architecture and carry out the convolution process with any bit width. We first propose a single weight convolution method to speed up the computation of the neural network, reduce the number of operations and make it possible to support non-uniform quantization. Secondly, we propose three techniques to improve energy efficiency: high-low bit cross-inversion coding reduces access and memory, dual-mode prediction eliminates redundant computation, and a value-sparse-self-adaptive dynamic quantizer is activated to reduce quantization energy consumption and errors. The energy efficiency of the uniform quantization network and the non-uniform quantization network under the condition of 28nm technology reaches 60.2 and 62.1TOPS/W, which is about 1.29 times and 10.6 times higher than that of the latest technology.

Based on the same inventive concept, the embodiment of the present invention further provides a neural network acceleration method based on time domain memory computing, as described in the following embodiments. Since the principles of these solutions are similar to those of the neural network accelerator based on time domain memory computation, the implementation of the apparatus can be referred to the implementation of the neural network accelerator, and the repeated details are not repeated.

Fig. 12 is a schematic diagram of a neural network acceleration method based on time domain memory computing according to an embodiment of the present invention, as shown in fig. 12, the method includes:

step 1201, the main controller writes the quantization weight into the weight storage unit and writes the activation value into the activation value storage unit;

step 1202, the weight storage unit sends the quantization weight to a sub-convolution kernel generator;

step 1203, the activation value storage unit sends feature map information to a time domain memory computing module according to the activation value;

1204, receiving the quantization weight sent by the weight storage unit by a sub convolution kernel generator, performing bit-wise comparison on the quantization weight according to a preset high-low bit cross-over encoding table, generating a plurality of sub convolution kernels according to the result of the bit-wise comparison, and sending the generated plurality of sub convolution kernels to a time domain memory computing module;

step 1205, the pulse quantizer receives a result calculated by the time domain memory calculation module according to the characteristic diagram information and the plurality of sub-convolution kernels, and quantizes the result calculated by the time domain memory calculation module according to a preset weight threshold and a sub-convolution result threshold;

and step 1206, the complementary dual-mode predictor receives the quantized results of the pulse quantizer, orders the quantized results, and determines whether to terminate the calculation in advance according to the prediction results in the set prediction mode.

In one embodiment, the bit-wise comparing the quantization weights according to a preset high-low bit cross-over encoding table, and generating a plurality of sub-convolution kernels according to the result of the bit-wise comparison includes:

according to a preset high-low bit cross flip coding table, carrying out bit-by-bit comparison on the quantization weight to be compared and the weight corresponding to the sub convolution kernel to be generated in the single weight calculation;

performing AND operation and/or non-operation on the result of the bit comparison;

a plurality of sub-convolution kernels are generated based on the result of the AND operation and the NOR operation.

In one embodiment, quantizing the result calculated by the time domain memory calculation module according to a preset weight threshold and a sub-convolution result threshold includes:

determining the quantized pulse width according to a preset weight threshold and a sub-convolution result threshold;

and quantizing the result calculated by the time domain memory calculation module according to the quantized pulse width.

In one embodiment, the prediction modes include: predicting a current accumulated value prediction mode and a hypothesis extreme value;

the determining whether to terminate the calculation in advance according to the prediction result in the set prediction mode includes: under the current accumulated value prediction mode, calculating positive weight and then negative weight, and if the accumulated value is smaller than a set threshold value, stopping calculation in advance; under the condition of supposing extreme value prediction, firstly calculating negative weight and then positive weight, subtracting the current activation value from the sum of all the activation values, multiplying the obtained difference value by the residual maximum positive weight to obtain the predicted maximum value of the residual result, and if the result of adding the predicted maximum value and the current accumulated value is still smaller than a threshold value, terminating the calculation in advance.

To sum up, the neural network accelerator based on time domain memory computing provided by the embodiments of the present invention includes: the device comprises a main controller, a weight storage unit, an activation value storage unit, a sub-convolution kernel generator, a pulse quantizer and a complementary dual-mode predictor; the main controller is used for writing the quantization weight into the weight storage unit and writing the activation value into the activation value storage unit; the weight storage unit is used for sending the quantization weight to a sub-convolution kernel generator; the activation value storage unit is used for sending characteristic graph information to a time domain memory computing module according to the activation value; the sub-convolution kernel generator is used for receiving the quantization weight sent by the weight storage unit, carrying out bit-wise comparison on the quantization weight according to a preset high-low bit cross-over encoding table, generating a plurality of sub-convolution kernels according to the result of the bit-wise comparison, and sending the generated plurality of sub-convolution kernels to the time domain memory computing module; the pulse quantizer is used for receiving a result calculated by the time domain memory calculation module according to the characteristic diagram information and the plurality of sub-convolution kernels, and quantizing the result calculated by the time domain memory calculation module according to a preset weight threshold and a sub-convolution result threshold; the complementary dual-mode predictor is used for receiving the quantized result of the pulse quantizer, sequencing the quantized result and determining whether to terminate the calculation in advance according to the prediction result in the set prediction mode. According to the embodiment of the invention, memory access is reduced through high-low bit cross-over encoding of the sub-convolution kernel generator, redundant calculation is eliminated through the complementary dual-mode predictor, quantization energy consumption and errors are reduced by using the pulse quantizer, and single-weight convolution calculation is carried out on the time domain memory calculation module according to the characteristic diagram information and the plurality of sub-convolution kernels, so that the calculation of a neural network is effectively accelerated, the operation number is reduced, and the non-uniform quantization can be supported.

As will be appreciated by one skilled in the art, embodiments of the present invention may be provided as a method, system, or computer program product. Accordingly, the present invention may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present invention may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.

The present invention is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the invention. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

Finally, it should be noted that: the above-mentioned embodiments are only specific embodiments of the present invention, which are used for illustrating the technical solutions of the present invention and not for limiting the same, and the protection scope of the present invention is not limited thereto, although the present invention is described in detail with reference to the foregoing embodiments, those skilled in the art should understand that: any person skilled in the art can modify or easily conceive the technical solutions described in the foregoing embodiments or equivalent substitutes for some technical features within the technical scope of the present disclosure; such modifications, changes or substitutions do not depart from the spirit and scope of the embodiments of the present invention, and they should be construed as being included therein. Therefore, the protection scope of the present invention shall be subject to the protection scope of the claims.

Claims

1. A neural network accelerator based on time domain memory computing, comprising: the device comprises a main controller, a weight storage unit, an activation value storage unit, a sub-convolution kernel generator, a pulse quantizer and a complementary dual-mode predictor;

2. The time domain memory computation-based neural network accelerator of claim 1, wherein the sub-convolution kernel generator is further to:

3. The time domain memory computation-based neural network accelerator of claim 1, wherein the pulse quantizer is further to:

4. The time domain memory computation-based neural network accelerator of claim 1, wherein the prediction mode comprises: predicting a current accumulated value prediction mode and a hypothesis extreme value;

the complementary dual-mode predictor is further to: under the current accumulated value prediction mode, calculating positive weight and then negative weight, and if the accumulated value is smaller than a set threshold value, stopping calculation in advance; under the condition of supposing extreme value prediction, firstly calculating negative weight and then positive weight, subtracting the current activation value from the sum of all the activation values, multiplying the obtained difference value by the residual maximum positive weight to obtain the predicted maximum value of the residual result, and if the result of adding the predicted maximum value and the current accumulated value is still smaller than a threshold value, terminating the calculation in advance.

5. A method for performing time domain memory computation-based neural network acceleration using the neural network accelerator of any one of claims 1-4, comprising:

6. The method of claim 5, wherein comparing the quantization weights according to a predetermined high-low bit cross-over coding table bit-wise, and generating a plurality of sub-convolution kernels according to the result of the bit-wise comparison comprises:

7. The method of claim 5, wherein quantifying the result calculated by the time domain memory calculation module according to the preset weight threshold and the sub-convolution result threshold comprises:

8. The method of claim 5, wherein the prediction mode comprises: predicting a current accumulated value prediction mode and a hypothesis extreme value;

9. A computer device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor implements the method of any of claims 5 to 8 when executing the computer program.

10. A computer-readable storage medium, characterized in that the computer-readable storage medium stores a computer program for executing the method of any of claims 5 to 8.