WO2021000469A1

WO2021000469A1 - Binary neural network accumulator circuit based on analogue delay chain

Info

Publication number: WO2021000469A1
Application number: PCT/CN2019/114252
Authority: WO
Inventors: 单伟伟; 商新超
Original assignee: 东南大学
Priority date: 2019-07-01
Filing date: 2019-10-30
Publication date: 2021-01-07
Also published as: CN110428048B; CN110428048A

Abstract

A binary neural network accumulator circuit based on an analogue delay chain, which belongs to the technical field of basic electronic circuits. The binary neural network accumulator circuit comprises a delay chain module with two delay chains and a pulse generation circuit, wherein the analogue delay chain is composed of multiple analogue delay cells connected in series, and the analogue delay cells use six MOS transistors and determine "0" and "1" by means of the size of delay. An analogue calculation method is used to replace cumulative calculation in a traditional digital circuit design. Moreover, the accumulator structure can work stably within a wide voltage, and circuit implementation is simple, thereby effectively reducing the power consumption of binary neural network accumulator calculation, and greatly improving the energy efficiency of a neural network circuit.

Description

A Binary Neural Network Accumulator Circuit Based on Analog Delay Chain

Technical field

The invention relates to a binary neural network accumulator circuit based on an analog delay chain, relates to a circuit that uses digital-analog hybrid technology to realize neural network accumulation calculation, and belongs to the technical field of basic electronic circuits.

Background technique

In recent years, artificial intelligence technology has demonstrated its unique advantages in image recognition, face detection, voice recognition, word processing and artificial intelligence games. In developed countries, artificial intelligence has become a priority development goal. Among them, the most prominent is the recent progress in the field of deep learning. The research practices of high-end Internet companies such as Baidu, Google, Microsoft, and Facebook show that deep learning can It reaches or exceeds human level in image perception. One of the main challenges in implementing a deep learning network is that a large number of operations will consume too much energy and hardware resources.

The mainstream neural network structure data bit width uses 32bit, and there is a trend of gradually reducing it to 16bit, 8bit or even 4bit. Therefore, in order to reduce power consumption, the method of dynamic precision adjustment can be used to adjust the number of bits of operation, according to different needs. The operation bit width is dynamically selected to save power consumption. The more radical one can be 2bit bit width than 4bit, and the most extreme one can be 1bit bit width. When the bit width becomes 1bit, the neural network becomes a special network-Binary Neutral Network (BNN).

Power consumption is a major bottleneck that limits the application of neural networks, and binary neural networks are an important direction in the exploration of neural network "miniaturization". There are two parts in a neural network that can be binarized, one is the coefficient of the network, and the other is the intermediate result of the network. By changing the floating-point single-precision coefficient to positive 1 or negative 1, the binarization of the coefficient can achieve the effect of reducing the storage size to 1/32 or 3% of the original. On the other hand, if the intermediate result also becomes two-valued, since most of the calculations are performed between 1, floating-point calculations can be replaced with integer bit operations. Compared with non-binarized networks, binary neural networks turn a large number of mathematical operations into bit operations, which greatly reduces the amount of calculations and effectively reduces the amount of storage, making the application threshold of neural networks lower.

Due to the particularity of calculation, the multiplication operation of the binary neural network is consistent with the XOR gate operation. Therefore, in the actual chip circuit implementation, the XOR gate can be used to realize the multiplication operation of the binary neural network. The overall operation process of the binary neural network includes multiplication and accumulation, and the final output result is determined by judging the number of 1 according to the accumulation result. Therefore, simulation calculations can be used to determine the accumulated result.

The invention is mainly used for the accumulation calculation of the binary neural network, thereby reducing the power consumption of the neural network calculation. The analog delay unit designed in this application can control the delay from A to Y by controlling the state of the data terminal "D".

Summary of the invention

The purpose of the present invention is to address the shortcomings of the above-mentioned background technology and provide a binary neural network accumulator circuit based on analog delay chain, which uses analog calculation to replace the traditional digital circuit accumulation calculation, effectively reducing the two The power consumption of the cumulative calculation of the numerized neural network realizes the energy-efficient cumulative calculation of the binary neural network, and solves the technical problem that the energy consumption of the cumulative calculation of the binary neural network needs to be reduced.

The present invention adopts the following technical solutions to achieve the above-mentioned purpose of the invention:

A binary neural network system based on an analog delay chain, including a delay chain module and a pulse generating circuit. The delay chain module structure is composed of two delay chains and a D flip-flop, where each delay chain It is composed of N analog delay units. The analog delay unit uses 6 MOS transistors to determine whether the input data is "0" or "1" through the difference in delay time. The delay chain connects N analog delay units as required to achieve multi-input data Accumulate and determine the number of "1"s.

The design method of the binary neural network accumulator circuit based on the analog delay chain of the present invention includes the following steps:

(1) Analog delay unit design: first complete the size design of the analog delay unit, and then draw the analog delay unit according to the layout design rules of the digital standard unit;

(2) Delay chain module: After the design of the analog delay unit is completed, use the units in the standard cell library to join the analog delay unit to complete the design of the delay chain module.

The delay unit is composed of 3 NMOS tubes, 1 PMOS tube, and an inverter. The peripheral input data is connected to the gate of the PMOS tube and the first NMOS tube, and the peripheral input data is connected to the gate of the second NMOS tube. , The source of the first NMOS tube, the drain of the second NMOS tube and the drain of the third NMOS tube are connected to node n, the source of the second NMOS tube M3 and the source of the third NMOS tube are connected to ground, and the PMOS tube The drain and the drain of the first NMOS transistor are connected to node m as the input of the inverter. The output of the inverter is the output delay signal. The source of the PMOS transistor and the gate of the third NMOS transistor are both connected power supply.

The delay chain module is composed of two delay chains and a D flip-flop, wherein each delay chain is composed of n analog delay units. The data input terminal of the analog delay unit is connected to the peripheral input data, the signal output terminals of the delay chain 1 and delay chain 2 are respectively connected to the data terminal and the clock terminal of the D flip-flop, and the clock signal output terminal of the pulse generating circuit is connected to the delay The clock signal input end of the module is connected, and the output signal of the D flip-flop is the judgment signal of the delay signal.

The input signal of the delay chain module is a clock input signal and n peripheral input data, and the output signal is a data output flag signal. The n peripheral input data are respectively connected to the data input terminals of n delay units, the clock input signal is input to the data input terminal of the first delay unit, and the output terminal of each delay unit is connected to the data of the next delay unit Input terminal, the output terminal of the nth delay unit of delay chain 1 is connected to the D terminal of the D flip-flop, the output terminal of the nth delay unit of delay chain 2 is connected to the clock CLK terminal of the D flip-flop, and D trigger The output signal Flag of the detector is the delay flag signal.

The overall operation process of the binary neural network includes multiplication, accumulation, and the final output result is determined by judging the number of 1 according to the accumulated result. In the actual calculation, we only need to know the accumulation result after binarization, that is, judge whether the accumulation result is greater than 0 or less than 0, so we can use simulation calculation to judge the accumulation result.

The present invention adopts the above technical solution and has the following beneficial effects: the present invention uses analog calculation to realize the cumulative calculation of neural network, converts digital signals into analog signals for calculation, can effectively reduce the overall power consumption of the chip, and can be used in wide voltage At the same time, the proposed delay unit has less area overhead, so that higher power consumption gains can be obtained.

Description of the drawings

Figure 1 is a structural diagram of the delay unit of the present invention.

Figure 2 is a working sequence diagram of the delay unit of the present invention.

Figure 3 is a circuit diagram of the analog delay chain of the present invention.

Fig. 4 is an overall structure diagram of the delay chain module of the present invention.

Fig. 5 is a working sequence diagram of the delay chain module of the present invention.

Figure 6 is a HSPICE simulation timing diagram of the delay chain module of the present invention.

Fig. 7 is a circuit diagram of the pulse generating circuit of the present invention.

Fig. 8 is a working timing diagram of the pulse generating circuit of the present invention.

Detailed ways

The technical solution of the invention will be described in detail below with reference to the accompanying drawings, but the protection scope of the invention is not limited to the embodiments.

The delay unit of the present invention is shown in Fig. 1, and is composed of 3 NMOS tubes, 1 PMOS tube and an inverter. The peripheral input data A is connected to the gates of the PMOS tube M1 and the NMOS tube M2, and the peripheral input data D Connected to the gate of the NMOS transistor M3, the source of the NMOS transistor M2 and the drains of the NMOS transistors M3 and M4 are connected to node n, the sources of the NMOS transistors M3 and M4 are connected to the ground, and the source of the PMOS transistor M1 is connected to the third NMOS The gates of the tubes M4 are both connected to the power supply, and the drains of the PMOS tube M1 and the NMOS tube M2 are connected to the node m as the input end of the inverter U1, and the output of the inverter is the output delay signal.

The working sequence diagram of the delay unit of the present invention is shown in Figure 2. The data input terminal D controls whether the MOS transistor M3 is on or off. When the input terminal D is "1", the transistor M3 is turned on, and when the input terminal A When changing from "0" to "1", the discharge path of node n is completed by transistors M3 and M4 in parallel. When the input terminal D is "0", the transistor M3 is turned off. When the input terminal A changes from "0" to "1", the discharge path of node n can only be completed through the transistor M4, causing the delay from A to Y to increase . Therefore, the delay from A to Y can be controlled through the data input terminal D.

As shown in Figure 3, the analog delay chain circuit of the present invention includes two parts: a delay chain module and a pulse generating circuit. The delay chain module is composed of n delay units and a D flip-flop. The data input of the delay unit Terminal D is connected to the peripheral input data, the signal output terminals Y1 and Y2 of delay chain 1 and delay chain 2 are respectively connected to the data terminal and clock terminal of the D flip-flop, and the clock signal output terminal of the pulse generating circuit is connected to the clock of the delay module The signal input terminal is connected, and the output signal Flag of the D flip-flop is the judgment signal of the delay signal.

The overall structure of the delay chain module of the present invention is shown in FIG. 4. The weights W1, W2, .., Wn of the neural network and the input data X1, X2, .., Xn perform the same-or operation, and the output results D1, D2, .., Dn are used as the data input to the data input of the delay chain module . Delay chain module The delay chain module is composed of two delay chains and a D flip-flop, and each delay chain is composed of n analog delay units. The input data of delay chain 1 is the data D1, D2, .., Dn after the weight and the image are the same or the same; the delay chain 2 is the reference chain, and the input data is configured according to the calculation needs of each layer of the neural network. The signal output terminals Y1 and Y2 of delay chain 1 and delay chain 2 are respectively connected to the data terminal and clock terminal of the D flip-flop. The clock signal output terminal of the pulse generating circuit is connected to the clock signal input terminal of the delay module. D flip-flop The output signal Flag is the judgment signal of the delay signal. In the neural network training stage, the data of each layer of the neural network needs to be standardized, so that the output is normalized to a normal distribution N (0, 1), that is, batch normalization (BN). The calculation formula for batch normalization is shown in the following formula 1.1:

Among them, γ and β are scaling factors and bias coefficients, which are parameters during training, used to affine the activation value to ensure the restoration of the original input, x is the input data set, and μ _B is the input The mean of the data set, σ _B represents the standard deviation of the input data set, and ε is a parameter added to prevent the denominator from being 0, usually a small constant greater than 0.

The binarization neural network binarizes the weight value and each activation function value in its weight matrix (binarization to positive 1 or negative 1). Due to the particularity of its calculation, the batch of the binarization neural network can be The normalization method is optimized, and the calculation formula for batch normalization of the binary neural network is shown in the following formula 1.2:

After changing the formula, the batch normalization of the binarized neural network can be added to the bias. Therefore, the bias value can be directly added to the reference delay chain 2, and the delay chain can be configured according to the results of network training 2. The input situation.

The working sequence diagram of the delay chain module is shown in Figure 5. In order to compare the number of "1"s in delay chain 1 and delay chain 2, it can be judged by comparing the arrival sequence of signals Y1 and Y2. The signals Y1 and Y2 are respectively connected to the data terminal and the clock terminal of the D flip-flop. In the first clock cycle, the number of "1"s in delay chain 1 is more than the number of "1"s in delay chain 2. If Y1 arrives first, the data collected by the D flip-flop is "1"; In the second clock cycle, the number of "1"s in delay chain 1 is less than the number of "1"s in delay chain 2, and Y2 arrives first, and the data collected by the D flip-flop is "0".

The HSPICE simulation timing diagram of the delay chain module is shown in Figure 6. When the number of "1"s in delay chain 1 is less than that of delay chain 2, signal Y2 arrives first, and the data (Flag) collected by the D flip-flop is "0"; when the number of "1" in delay chain 1 When the number is more than the delay chain 2, the signal Y1 arrives first, and the data (Flag) collected by the D flip-flop is "1".

The pulse generating circuit and working sequence diagram of the present invention are shown in FIG. 7. The pulse generating circuit is composed of 3 NAND gates, an inverter and a delay module. Among them, the delay module can be equipped to complete the configuration of different delay sizes, so as to realize the adjustment of the pulse width.

The working sequence diagram of the pulse generating circuit of the present invention is shown in FIG. 8. The basic principle of the pulse generating circuit is: when CLK is low, nodes X and Qb are both high, and node Y remains low; when CLK changes from low to high, first node Qb changes from high to low Change, which causes node Y to change from low to high, and node X from high to low. At this time, node Qb changes from low to high. The time to complete the whole process is the pulse width generated by the pulse generating circuit. Therefore, the pulse width is It is determined by the delay chain and the delay of the three NAND gates.

In the specific implementation process, in order to illustrate its advantages in calculating power consumption, compare it with the traditional adder structure (using the full adder synthesis in the standard cell library provided by the craftsman). The 64 single-bit data is implemented using the traditional digital adder structure and the structure designed in this paper, and the number of "1"s after accumulation is judged. Table 1 is a data comparison with the traditional digital adder structure. It can be seen from the data in the table that the same 64 single-bit data accumulation calculation can be realized, power consumption can be saved by 57%, and performance is improved by 33.3%.

Table 1 Comparison of 64-bit single-bit accumulation structure of traditional digital circuit and the design data index (0.81V, 125℃, SS)

As mentioned above, although the present invention has been shown and described with reference to specific preferred embodiments, it should not be construed as limiting the present invention itself. Various changes in form and details can be made without departing from the spirit and scope of the present invention as defined by the appended claims.

Claims

The analog delay unit is characterized in that it uses a digital input signal to control the delay of the clock input signal, and specifically includes: a PMOS tube (M1), a first NMOS tube (M2), a second NMOS tube (M3), and a third NMOS tube ( M4), inverter (U1), the gate of the PMOS tube (M1) and the gate of the first NMOS tube (M2) are connected in parallel and then the clock input signal, the drain of the PMOS tube (M1) and the first NMOS tube The drain of (M2) is connected in parallel with the input terminal of the inverter (U1), the gate of the second NMOS tube (M3) is connected to the digital input signal, and the drain of the second NMOS tube (M3) is connected to the third NMOS The drain of the tube (M4) is connected in parallel with the source of the first NMOS tube (M2), the source of the PMOS tube (M1) and the gate of the third NMOS tube (M4) are both connected to the power supply, and the second NMOS tube The source of (M3) and the source of the third NMOS transistor (M4) are both grounded.
The analog delay chain is characterized in that it is composed of a plurality of analog delay units as claimed in claim 1 in series, and the digital signal input end of the latter analog delay unit is connected to the previous analog delay unit. The output terminal.
The binary neural network accumulator circuit is characterized by comprising two analog delay chains according to claim 2 and a D flip-flop, the clock signal input ends of the two analog delay chains are connected to the same pulse clock signal, The digital data input terminal of each analog delay unit in the first analog delay chain is connected to the convolution result of the binary neural network layer weight parameter and the input feature map data, and the digital data of each delay unit in the second analog delay chain The input terminal is connected to the reference value corresponding to the calculation result of each convolution unit in the binary neural network layer, the data input terminal of the D flip-flop is connected to the output terminal of the first analog delay chain, and the clock input terminal of the D flip-flop is connected At the output end of the second analog delay chain, the D flip-flop compares the arrival sequence of the output signals of the two analog delay chains and outputs the flag signal.
The binarized neural network accumulator circuit according to claim 3, wherein the convolution result of the weight parameters of the binarized neural network layer and the input feature map data is performed by performing the same OR on the weight data and the input feature map data. It is calculated that the digital data input terminal of each analog delay unit in the first analog delay chain is connected to the output terminal of an XOR gate, and the two input terminals of the XOR gate are respectively connected to the weight data and input characteristics of a convolution unit Graph data.
The binarized neural network accumulator circuit according to claim 3, wherein the reference value of the calculation result of each convolution unit in the binarized neural network layer is the bias value of each network layer obtained by training.
The binary neural network accumulator circuit according to claim 3, wherein the same pulse clock signal connected to the clock signal input ends of the two analog delay chains is provided by a pulse generating circuit, and the pulse generating circuit comprises: One to the third NAND gate, delay module, inverter, the two input terminals of the first NAND gate are respectively connected to the clock signal and the output terminal of the third NAND gate, the input terminal of the delay module and the inverter The input terminals of are connected to the output terminal of the first NAND gate, and the two input terminals of the second NAND gate are respectively connected to the output terminal of the delay module and the output terminal of the third NAND gate. The two input terminals are respectively connected to the output terminal and the clock signal of the second NAND gate, and the inverter outputs the pulse clock signal to the clock signal input terminals of the two analog delay chains.