WO2020029551A1 - 一种适用于神经网络的乘加计算方法和计算电路 - Google Patents

一种适用于神经网络的乘加计算方法和计算电路 Download PDF

Info

Publication number
WO2020029551A1
WO2020029551A1 PCT/CN2019/072892 CN2019072892W WO2020029551A1 WO 2020029551 A1 WO2020029551 A1 WO 2020029551A1 CN 2019072892 W CN2019072892 W CN 2019072892W WO 2020029551 A1 WO2020029551 A1 WO 2020029551A1
Authority
WO
WIPO (PCT)
Prior art keywords
input
multiplication
product
group
neural network
Prior art date
Application number
PCT/CN2019/072892
Other languages
English (en)
French (fr)
Inventor
刘波
龚宇
葛伟
杨军
时龙兴
Original Assignee
东南大学
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 东南大学 filed Critical 东南大学
Priority to US16/757,421 priority Critical patent/US10984313B2/en
Publication of WO2020029551A1 publication Critical patent/WO2020029551A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F7/00Methods or arrangements for processing data by operating upon the order or content of the data handled
    • G06F7/38Methods or arrangements for performing computations using exclusively denominational number representation, e.g. using binary, ternary, decimal representation
    • G06F7/48Methods or arrangements for performing computations using exclusively denominational number representation, e.g. using binary, ternary, decimal representation using non-contact-making devices, e.g. tube, solid state device; using unspecified devices
    • G06F7/544Methods or arrangements for performing computations using exclusively denominational number representation, e.g. using binary, ternary, decimal representation using non-contact-making devices, e.g. tube, solid state device; using unspecified devices for evaluating functions by calculation
    • G06F7/5443Sum of products
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/06Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons
    • G06N3/063Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons using electronic means
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F7/00Methods or arrangements for processing data by operating upon the order or content of the data handled
    • G06F7/38Methods or arrangements for performing computations using exclusively denominational number representation, e.g. using binary, ternary, decimal representation
    • G06F7/48Methods or arrangements for performing computations using exclusively denominational number representation, e.g. using binary, ternary, decimal representation using non-contact-making devices, e.g. tube, solid state device; using unspecified devices
    • G06F7/50Adding; Subtracting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F7/00Methods or arrangements for processing data by operating upon the order or content of the data handled
    • G06F7/38Methods or arrangements for performing computations using exclusively denominational number representation, e.g. using binary, ternary, decimal representation
    • G06F7/48Methods or arrangements for performing computations using exclusively denominational number representation, e.g. using binary, ternary, decimal representation using non-contact-making devices, e.g. tube, solid state device; using unspecified devices
    • G06F7/52Multiplying; Dividing
    • G06F7/523Multiplying only
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F2207/00Indexing scheme relating to methods or arrangements for processing data by operating upon the order or content of the data handled
    • G06F2207/38Indexing scheme relating to groups G06F7/38 - G06F7/575
    • G06F2207/48Indexing scheme relating to groups G06F7/48 - G06F7/575
    • G06F2207/4802Special implementations
    • G06F2207/4818Threshold devices
    • G06F2207/4824Neural networks

Definitions

  • the invention relates to the technical field of analog integrated circuits, in particular to a multiply-add calculation method and circuit applied to a neural network.
  • a neural network is a computing model that consists of a large number of nodes connected to each other. Each node represents a kind of excitation function, and each connection between two nodes represents a weight.
  • the output of the artificial neural network varies depending on the connection method, weight value and excitation function of the network.
  • great progress has been made in artificial neural networks and deep learning.
  • the large-scale parallelism of neural networks makes them potentially capable of quickly processing certain tasks, so in pattern recognition, intelligent robots, automatic control, prediction estimation, biology , Medical, economic and other fields have shown good intelligent characteristics.
  • Neural networks have a large number of complex, parallel multiply-add operations, which require a large amount of computing resources.
  • computers are generally used to process data.
  • this data processing method largely limits the application of neural networks in real-time systems.
  • convolutional neural network As an example, there is a large number of multiplication and addition operations in one operation.
  • multiplier bits When the traditional digital operation methods implement multiplication operations, there is a large occupation area and large power consumption. Therefore, it is very important to propose a multiply-add method suitable for neural networks.
  • the technical problem to be solved by the present invention is to propose a multiply-add calculation method and a calculation circuit suitable for a neural network to meet the calculation scale and accuracy requirements of the neural network, and achieve low-power consumption and high-speed completion of calculation tasks.
  • the present invention adopts the following technical solutions:
  • a multiply-add calculation circuit suitable for a neural network including a multiplication-calculation circuit array and an accumulation calculation circuit.
  • the multiplication calculation circuit array is formed by concatenating M sets of the same multiplication calculation circuits; each set of multiplication calculation circuits is used to multiply the input data of the neural network by the weight coefficient of the neural network, and sequentially input the data on each bit of the obtained product. Into the accumulation calculation circuit.
  • Accumulation calculation circuit used to accumulate data on the bits in each set of products output by the multiplication calculation circuit array in the time domain, and convert the obtained time-quantized result into a digital quantity by the TDC circuit before performing addition shift Operation to get the output data of the neural network.
  • the multiplication calculation circuit array has 2M groups of inputs and M groups of output terminals; the accumulation calculation circuit has M groups of inputs and a group of outputs; the M groups of inputs of the accumulation calculation circuit are respectively connected to the M groups of outputs of the multiplication calculation circuit array .
  • Each set of multiplication calculation circuits has two sets of inputs and one set of outputs; each set of input data of the neural network is sent from the first input of each set of multiplication calculation circuits; each set of weight coefficients of the neural network is multiplied from each set The second input of the calculation circuit is input; the output data of the neural network is sent from the output of the accumulation calculation circuit.
  • the first input of the m-th multiplication calculation circuit inputs the m-th input data, which is an 8-bit data;
  • the second input of the m-th multiplication calculation circuit inputs the m-th weight coefficient, including 8 8-bit data, They are the first weight coefficient of the m group, the second weight coefficient of the m group, the third weight coefficient of the m group, the fourth weight coefficient of the m group, the fifth weight coefficient of the m group, the sixth weight coefficient of the m group, The 7th weight coefficient of the m group and the 8th weight coefficient of the m group;
  • the output of the m group multiplication calculation circuit outputs the m group product, which includes 8 groups of data, which are the first product of the m group and the second of the m group Product, the third product of the m group, the fourth product of the m group, the fifth product of the m group, the sixth product of the m group, the seventh product of the m group, and the eighth product of the m group; the first product of
  • each set of multiplication calculation circuits is composed of a set of multiplication array units and eight independent selection shift units; the multiplication array unit is used to multiply the input data of the neural network and the feature multiplier to obtain the feature product; Select a shift unit, select and shift the feature product according to the weight coefficient of the neural network, obtain the product of the input data of the neural network and the weight coefficient of the neural network, and input the data on each bit of the obtained product into the cumulative calculation in order.
  • the circuit is composed of a set of multiplication array units and eight independent selection shift units; the multiplication array unit is used to multiply the input data of the neural network and the feature multiplier to obtain the feature product; Select a shift unit, select and shift the feature product according to the weight coefficient of the neural network, obtain the product of the input data of the neural network and the weight coefficient of the neural network, and input the data on each bit of the obtained product into the cumulative calculation in order.
  • the eight sets of selection shift units include a first selection shift unit, a second selection shift unit, a third selection shift unit, a fourth selection shift unit, a fifth selection shift unit, a sixth selection shift unit, a first Seven selection shift units and eighth selection shift unit.
  • the multiplication array unit has a set of inputs and a set of outputs; each set of selection shift units has two sets of inputs and a set of outputs; the input of the multiplication array unit is connected to the first input of the multiplication calculation circuit; the multiplication array
  • the output terminal of the unit and the first input terminal of each group of selective shift units form a shared connection; the second input terminal of each group of selective shift units is connected to the second input terminal of the multiplication calculation circuit;
  • the output terminals are respectively connected to the output terminals of the multiplication calculation circuit.
  • the multiplication array unit outputs n feature products, each feature product is an 8-bit data, where n is 8 or 4 or 2 or 1; n feature products are used as the shared input data of the first to eighth selected shift units, since The first input terminal of the first to eighth selection shift unit inputs; the second input terminal of the first to eighth selection shift unit inputs the first to eighth weight coefficients correspondingly in order; the first to eighth selection shift unit.
  • the unit outputs data on each bit of the first to eighth products in order.
  • the multiplication array unit is composed of eight or four or two or one calculation sub-unit, which are called eighth-order quantization, fourth-order quantization, second-order quantization, and first-order quantization, respectively.
  • the feature multipliers of the eighth-order quantization multiplication array unit are 1, 3, 5, 7, 9, 9, 11, 13, 15; the feature multipliers of the fourth-order quantization multiplication array unit are 1, 3, 5, 7, respectively; second-order The characteristic multipliers of the quantized multiplication array unit are 1, 3; the characteristic multipliers of the first-order quantized multiplication array unit are 1.
  • the feature product of the eighth-order quantization multiplication array unit is 1 ⁇ Input, 3 ⁇ Input, 5 ⁇ Input, 7 ⁇ Input, 9 ⁇ Input, 11 ⁇ Input, 13 ⁇ Input, 15 ⁇ Input; characteristics of the fourth-order quantization multiplication array unit The product is 1 ⁇ Input, 3 ⁇ Input, 5 ⁇ Input, 7 ⁇ Input; the characteristic product of the second-order quantized multiplication array unit is 1 ⁇ Input, 3 ⁇ Input; the characteristic product of the first-order quantized multiplication array unit is 1 ⁇ Input.
  • the accumulation calculation circuit is composed of a delay accumulation circuit, a TDC conversion circuit, and an addition shift circuit in series.
  • the delay accumulation circuit has M groups of inputs and eight groups of outputs;
  • the TDC conversion circuit has eight groups of inputs and eight groups of outputs;
  • the addition shift circuit has eight groups of inputs and one group of outputs;
  • the M group input terminals are respectively connected to the M group output terminals of the multiplication calculation circuit array;
  • the eight groups of output terminals of the delay accumulation circuit are respectively connected to the eight groups of input terminals of the TDC conversion circuit;
  • the eight groups of output terminals of the TDC conversion circuit are respectively connected to
  • the eight input terminals of the addition shift circuit are connected correspondingly;
  • the output terminal of the addition shift circuit is the output terminal of the accumulation calculation circuit.
  • the delay accumulation circuit is a summing array composed of 8 independent controllable delay chains; the number of controllable delay chains is equal to the number of bits of the output product of the multiplication calculation circuit array; any one of the controllable delays The chain completes M group multiplication of a certain bit of data in the time domain.
  • Each controllable delay chain is formed by serially connecting M controllable delay blocks; the trigger signal of the controllable delay block at the odd position is the rising edge of the clock; the trigger signal of the controllable delay block at the even position is The falling edge of the clock; each controllable delay block has two inputs, the first input is used to receive the time reference signal, and the second input is connected to the M group inputs of the delay accumulation circuit respectively for Receives M-bit products of certain bit data; each controllable delay block has an output terminal that outputs the time signal after the delay amount is superimposed.
  • the multiplication and addition calculation method includes a multiplication calculation method and an accumulation calculation method. The specific steps are as follows:
  • Step 101 The on-chip training method is used to perform real-time quantization on the order of the multiplication array unit in each group of multiplication calculation circuits, and the n-th order quantization multiplication array unit is used to perform the actual calculation of the neural network;
  • Step 102 In each group of multiplication calculation circuits, the n-th order quantization multiplication array unit provides n feature products as the shared input to the first to eighth selection shift units;
  • Step 103 In each set of multiplication calculation circuits, within a calculation period, each selection shift unit multiplies the decimal value of the four weights of the weight coefficient and the decimal value of the four weights of the weight coefficient by the feature of the nth-order quantization multiplication array unit. The numbers are compared and the products of each group are output as follows:
  • Step 103-A When the decimal value of the upper four digits of the weighting coefficient is consistent with the characteristic multiplier and is not 0, the data of each bit of the corresponding characteristic product is directly output; the decimal value of the lower four digits of the weighting coefficient and the characteristic multiplier When they are consistent and not 0, select the corresponding feature product, and directly output the data of each bit of the corresponding feature product; and enter step 104;
  • Step 103-B When the decimal value of the upper four digits of the weighting coefficient is inconsistent with the feature multiplier and is not 0, the feature product is shifted to output data of each bit of the resulting result; the decimal value of the lower four digits of the weighting coefficient and When the feature multipliers are inconsistent and not 0, perform a shift operation on the feature product to output data of each bit of the obtained result; and proceed to step 104;
  • Step 103-C When the decimal value of the upper four digits of the weight coefficient is 0, 0 is directly output; when the decimal value of the lower four digits of the weight coefficient is 0, 0 is directly output; and step 104 is entered;
  • Step 104 The data on the i bit in the product of the first group to the M group is sequentially input into the M controllable delay blocks of the i + 1th delay chain, where i is any one of 0 to 7. Natural number;
  • the controllable delay block outputs different delay amounts according to different values of the input data, as follows: When the input data is 0, the output delay amount of the controllable delay block is ⁇ t; when the input data When it is 1, the output delay amount of the controllable delay block is 2 ⁇ t; at the time of the non-trigger signal, regardless of whether the input data is 0 or 1, the output delay amount of the controllable delay block is ⁇ t;
  • Step 105 The i + 1 controllable delay chain completes the accumulation of the data on the ibit bits in the product of the first group to the Mth group in the time domain; when the neural network calculation scale exceeds the number of controllable delay block cascades At M, dynamically control the number of iterative calculations of each controllable delay chain;
  • Step 106 using a TDC conversion circuit to convert the delay amount output by each controllable delay chain into a decimal digital amount
  • Step 107 Use the addition shift circuit to perform addition and right shift operations on the digital quantities to obtain the output data of the neural network.
  • Step 201 initialization of a neural network weight coefficient; setting of a training data set; obtaining a pre-trained neural network NN 0 ;
  • Step 202 includes two parts that are carried out simultaneously, as follows:
  • Step 202-A Test the pre-trained neural network NN 0 using the test data set to obtain the initial value A of the network accuracy;
  • Step 202-B Initialize the quantization order of the multiplication array unit to be a first order, and replace the standard multiplication calculation circuit in the pre-trained neural network NN 0 with a first-order quantized multiplication calculation circuit to obtain a first-order quantized neural network NN 1 ; Use the test data set to train the first-order quantitative neural network NN 1 to obtain the actual value B of the network accuracy;
  • Step 203 Introduce the network accuracy limit Q to determine the magnitude relationship between B and A ⁇ Q: if B> A ⁇ Q, proceed to step 209; if B ⁇ A ⁇ Q, proceed to step 204;
  • Step 204 Increase the quantization order of the multiplication array unit to be a second order, and replace the standard multiplication calculation circuit in the pre-trained neural network NN 0 with a second-order quantized multiplication calculation circuit to obtain a second-order quantized neural network NN 2 ;
  • the data set is trained on the second-order quantized neural network NN 2 to obtain the actual value B of the network accuracy;
  • Step 205 Introduce the network accuracy limit Q to determine the magnitude relationship between B and A ⁇ Q: if B> A ⁇ Q, proceed to step 209; if B ⁇ A ⁇ Q, proceed to step 206;
  • Step 206 Increase the quantization order of the multiplication array unit to fourth order, and replace the standard multiplication calculation circuit in the pre-trained neural network NN 0 with a fourth order quantized multiplication calculation circuit to obtain a fourth order quantized neural network NN 4 ;
  • the data set is trained on a fourth-order quantized neural network NN 4 to obtain the actual value B of the network accuracy;
  • Step 207 Introduce the network accuracy limit Q to determine the magnitude relationship between B and A ⁇ Q: if B> A ⁇ Q, proceed to step 209; if B ⁇ A ⁇ Q, proceed to step 208;
  • Step 208 Increase the quantization order of the multiplication array unit to eighth order, and replace the standard multiplication calculation circuit in the pre-trained neural network NN 0 with the eighth-order quantized multiplication calculation circuit to obtain the eighth-order quantized neural network NN 8 ; enter step 209;
  • Step 209 End the on-chip training of the quantization order of the multiplication array unit, and perform the actual calculation of the neural network using the multiplication array unit of the current quantization order.
  • the feature product 1 ⁇ Input is shifted 1 bit to the left to get 2 ⁇ Input; the feature product 1 ⁇ Input is shifted to the left by 2 bits to get 4 ⁇ Input; the feature product 1 ⁇ Input is shifted to the left 3 Bits get 8 ⁇ Input; 1 ⁇ Input to the feature product is shifted left by 1 bit to get 2 ⁇ Input, which is approximately 3 ⁇ Input; Feature product 1 ⁇ Input is shifted to the left by 2 bits to get 4 ⁇ Input, which is approximately 5 ⁇ Input, 6 ⁇ Input, 7 ⁇ Input; shift the feature product 1 ⁇ Input to the left by 3 bits to get 8 ⁇ Input, which approximates 9 ⁇ Input, 10 ⁇ Input, 11 ⁇ Input, 12 ⁇ Input, 13 ⁇ Input, 14 ⁇ Input, 15 ⁇ Input .
  • step 105 the number of iterative calculations of each controllable delay chain is dynamically controlled in step 105, and the specific steps are as follows:
  • Step 301 The calculation scale of the neural network is W, and the W data is divided into K pieces of M data and N pieces of data, where K is an integer greater than or equal to 1, and N is an integer greater than or equal to 1 and less than M;
  • the present invention adopts the above technical solutions, and has the following technical effects;
  • each group of multiplication calculation circuits eight groups of selected shift units share the output of a group of multiplication array units. Therefore, in the same calculation cycle, these eight groups of selected shift units can simultaneously perform multiplication calculations, which significantly improves the circuit's performance. Operation rate; and each set of multiplication calculation circuits has only one set of multiplication array units, so compared with the prior art, the power consumption of the multiplication circuit is significantly reduced, and the circuit throughput is improved.
  • the delay accumulation circuit is a summing array constructed by a controllable delay chain. Based on the superposition of delay signals, the calculation scale of each layer of the neural network is used to dynamically adjust the number of iterations of the summing array. Control the number of iterations of the delay chain, so as to meet the differences in the calculation scale of different network layers, save hardware storage space, reduce calculation complexity, and reduce data scheduling.
  • the digital quantity is directly converted into a time quantity for accumulation operation. After completing the desired number of iterations, the TDC circuit performs digital conversion in the time finally obtained.
  • the entire accumulation process is completed in the time domain, eliminating the effects of non-ideal effects of the external circuit; while ensuring the accumulation accuracy, the complexity of the circuit can be reduced and the circuit can be easily implemented.
  • the fast switching speed and high efficiency of the circuit in the time domain enable the delay accumulation circuit to operate in a low-power and high-speed environment, which meets the needs of practical applications.
  • FIG. 1 is a schematic block diagram of a multiply-add calculation circuit suitable for a neural network according to the present invention
  • FIG. 2 is a block diagram of a group of multiplication calculation circuits in a multiplication-add calculation circuit suitable for a neural network according to the present invention
  • FIG. 3 is a control logic diagram for dynamically adjusting the quantization order of the multiplication array unit in real time by using an on-chip training method according to the present invention, taking a DNN network as an example;
  • FIG. 4 is a schematic diagram of a delay accumulation circuit in a multiply-add calculation circuit suitable for a neural network according to the present invention.
  • FIG. 1 is a schematic block diagram of a multiply-add calculation circuit suitable for a neural network according to the present invention.
  • the multiply-add calculation circuit shown in the figure includes a multiplication-calculation circuit array and an accumulation calculation circuit.
  • the multiplication calculation circuit array is formed by concatenating M sets of the same multiplication calculation circuits; each set of multiplication calculation circuits is used to multiply the input data of the neural network by the weight coefficient of the neural network, and sequentially input the data on each bit of the obtained product. Into the accumulation calculation circuit.
  • Accumulation calculation circuit used to accumulate data on the bits in each set of products output by the multiplication calculation circuit array in the time domain, and convert the obtained time-quantized result into a digital quantity by the TDC circuit before performing addition shift Operation to get the output data of the neural network.
  • the multiplication calculation circuit array has 2M groups of inputs and M groups of output terminals; the accumulation calculation circuit has M groups of inputs and a group of outputs; the M groups of inputs of the accumulation calculation circuit are respectively connected to the M groups of outputs of the multiplication calculation circuit array .
  • Each set of multiplication calculation circuits has two sets of inputs and one set of outputs; each set of input data of the neural network is sent from the first input of each set of multiplication calculation circuits; each set of weight coefficients of the neural network is multiplied from each set The second input of the calculation circuit is input; the output data of the neural network is sent from the output of the accumulation calculation circuit.
  • the accumulation calculation circuit is composed of a delay accumulation circuit, a TDC conversion circuit, and an addition shift circuit in series.
  • the delay accumulation circuit has M groups of inputs and eight groups of outputs;
  • the TDC conversion circuit has eight groups of inputs and eight groups of outputs;
  • the addition shift circuit has eight groups of inputs and one group of outputs;
  • the M group input terminals are respectively connected to the M group output terminals of the multiplication calculation circuit array;
  • the eight groups of output terminals of the delay accumulation circuit are respectively connected to the eight groups of input terminals of the TDC conversion circuit;
  • the eight groups of output terminals of the TDC conversion circuit are respectively connected to
  • the eight input terminals of the addition shift circuit are connected correspondingly;
  • the output terminal of the addition shift circuit is the output terminal of the accumulation calculation circuit.
  • This embodiment provides specific description of the output data of the multiplication calculation circuit in conjunction with FIG. 1 as follows:
  • the first input of the m-th multiplication calculation circuit inputs the m-th input data, which is an 8-bit data; the second input of the m-th multiplication calculation circuit inputs the m-th weight coefficient, including 8 8-bit data, They are the first weight coefficient of the m group, the second weight coefficient of the m group, the third weight coefficient of the m group, the fourth weight coefficient of the m group, the fifth weight coefficient of the m group, the sixth weight coefficient of the m group, The 7th weight coefficient of the m group and the 8th weight coefficient of the m group.
  • the output of the m-th multiplication calculation circuit outputs the m-th product, including 8 sets of data, as follows:
  • the first product of the m- th group P m-1 m- th input data ⁇ the first weight coefficient of the m-th group;
  • the second product of the m- th group P m-2 the m-th input data ⁇ the second weight coefficient of the m-th group;
  • the third product of the m group P m-3 the m-th input data ⁇ the third weight coefficient of the m-th group;
  • the fourth product of the m group P m-4 m input data ⁇ the fourth weight coefficient of the m group;
  • the fifth product of the m group P m-5 the m-th input data ⁇ the fifth weight coefficient of the m-th group;
  • the sixth product of the m group P m-6 m input data ⁇ the sixth weight coefficient of the m group;
  • the seventh product of the m group P m-7 m input data ⁇ the seventh weight coefficient of the m group;
  • the eighth product of the m group P m-8 m input data ⁇ the eighth weight coefficient of the m group;
  • the first product to the eighth product of the m group contain eight 1-bit data, and the product m of the m group is an 8 ⁇ 8 matrix, as follows:
  • FIG. 2 is a block diagram of a group of multiplication calculation circuits in a multiplication-add calculation circuit suitable for a neural network according to the present invention.
  • each set of multiplication calculation circuits is composed of a set of multiplication array units and eight independent selection shift units; the multiplication array unit is used to multiply the input data of the neural network and the feature multiplier. Get the feature product; select the shift unit, select and shift the feature product based on the weight coefficient of the neural network, get the product of the input data of the neural network and the weight coefficient of the neural network, and place the data on each bit of the product in order Input to the accumulation calculation circuit.
  • the eight sets of selection shift units include a first selection shift unit, a second selection shift unit, a third selection shift unit, a fourth selection shift unit, a fifth selection shift unit, a sixth selection shift unit, a first Seven selection shift units and eighth selection shift unit.
  • the multiplication array unit has a set of inputs and a set of outputs; each set of selection shift units has two sets of inputs and a set of outputs; the input of the multiplication array unit is connected to the first input of the multiplication calculation circuit; the multiplication array
  • the output terminal of the unit and the first input terminal of each group of selective shift units form a shared connection; the second input terminal of each group of selective shift units is connected to the second input terminal of the multiplication calculation circuit;
  • the output terminals are respectively connected to the output terminals of the multiplication calculation circuit.
  • the multiplication array unit outputs n feature products, each feature product is an 8-bit data, where n is 8 or 4 or 2 or 1; n feature products are used as the shared input data of the first to eighth selected shift units, since The first input terminal of the first to eighth selection shift unit inputs; the second input terminal of the first to eighth selection shift unit inputs the first to eighth weight coefficients correspondingly in order; the first to eighth selection shift unit.
  • the unit outputs data on each bit of the first to eighth products in order.
  • the multiplication array unit is composed of eight or four or two or one calculation sub-unit, which are called eighth-order quantization, fourth-order quantization, second-order quantization, and first-order quantization, respectively.
  • the feature multipliers of the eighth-order quantization multiplication array unit are 1, 3, 5, 7, 9, 9, 11, 13, 15; the feature multipliers of the fourth-order quantization multiplication array unit are 1, 3, 5, 7, respectively; second-order The characteristic multipliers of the quantized multiplication array unit are 1, 3; the characteristic multipliers of the first-order quantized multiplication array unit are 1.
  • the feature product of the eighth-order quantization multiplication array unit is 1 ⁇ Input, 3 ⁇ Input, 5 ⁇ Input, 7 ⁇ Input, 9 ⁇ Input, 11 ⁇ Input, 13 ⁇ Input, 15 ⁇ Input; characteristics of the fourth-order quantization multiplication array unit The product is 1 ⁇ Input, 3 ⁇ Input, 5 ⁇ Input, 7 ⁇ Input; the characteristic product of the second-order quantized multiplication array unit is 1 ⁇ Input, 3 ⁇ Input; the characteristic product of the first-order quantized multiplication array unit is 1 ⁇ Input.
  • the feature product is shifted according to the quantization order of the multiplication array unit, as follows:
  • the feature product 1 ⁇ Input is shifted 1 bit to the left to get 2 ⁇ Input; the feature product 1 ⁇ Input is shifted to the left by 2 bits to get 4 ⁇ Input; the feature product 1 ⁇ Input is shifted to the left 3 Bits get 8 ⁇ Input; 1 ⁇ Input to the feature product is shifted left by 1 bit to get 2 ⁇ Input, which is approximately 3 ⁇ Input; Feature product 1 ⁇ Input is shifted to the left by 2 bits to get 4 ⁇ Input, which is approximately 5 ⁇ Input, 6 ⁇ Input, 7 ⁇ Input; shift the feature product 1 ⁇ Input to the left by 3 bits to get 8 ⁇ Input, which approximates 9 ⁇ Input, 10 ⁇ Input, 11 ⁇ Input, 12 ⁇ Input, 13 ⁇ Input, 14 ⁇ Input, 15 ⁇ Input .
  • the eighth-order quantization multiplication calculation circuit can accurately calculate 16 products; the fourth-order quantization multiplication calculation circuit can accurately calculate 12 products and approximately calculate four products; the second-order quantization multiplication The calculation circuit can accurately calculate 8 products and approximately calculate 8 products; the first-order quantized multiplication calculation circuit is equivalent to eliminating the multiplication array structure and directly inputting input data to the selection shift unit, which can accurately calculate 5 The product sum approximation calculated 11 products.
  • FIG. 3 is a control logic diagram for dynamically adjusting the quantization order of the multiplication array unit in real time by using an on-chip training method according to the present invention.
  • on-chip training of a DNN is preferred, and the number of orders of each multiplication array unit is converted to fourth order by using on-chip training, as follows:
  • Step 201 DNN weight coefficient initialization; training data set setting; obtaining pre-trained DNN;
  • Step 202 includes two parts that are carried out simultaneously, as follows:
  • Step 202-A Test the pre-trained DNN using the test data set to obtain the initial value A of the network accuracy
  • Step 202-B Initialize the quantization order of the multiplication array unit to first order, and replace the standard multiplication calculation circuit in the pre-trained DNN with the first-order quantization multiplication calculation circuit to obtain DNN 1 ; use the test data set to perform DNN 1 Training to get the actual value B of the network accuracy;
  • Step 203 Introduce the network accuracy limit Q. At this time, B ⁇ A ⁇ Q, go to step 4.
  • Step 204 the quantization step to improve the multiplication array unit is second order, the second order and multiplicative quantization circuit calculating a multiplication circuit DNN replaced the standard pre-training, to give DNN 2; DNN using the test data set to train 2, Get the network accuracy actual value B;
  • Step 205 Introduce the network accuracy limit Q, and at this time, B ⁇ A ⁇ Q, go to step 6;
  • Step 206 Increase the quantization order of the multiplication array unit to fourth order, and replace the standard multiplication calculation circuit in the pre-trained DNN with the fourth-order quantized multiplication calculation circuit to obtain DNN 4 ; use the test data set to train DNN 4 , Get the network accuracy actual value B;
  • Step 207 Introduce the network accuracy limit Q.
  • B> A ⁇ Q end the on-chip training of the quantization order of the multiplication array unit, and perform the actual DNN calculation with the fourth-order quantized multiplication array unit.
  • FIG. 4 is a schematic diagram of a delay accumulation circuit in a multiply-add calculation circuit suitable for a neural network according to the present invention. With reference to Figure 4, the delay accumulation circuit is specifically described as follows:
  • the delay accumulation circuit is a summing array composed of eight independently controllable delay chains; according to the first embodiment, the number of controllable delay chains is equal to the number of bits of the output product of the multiplication calculation circuit array.
  • Each controllable delay chain is formed by serially connecting M controllable delay blocks; the trigger signal of the controllable delay block del1 at the odd position is the rising edge of the clock; the trigger of the controllable delay block del2 at the even position The signal is the falling edge of the clock; each controllable delay block has two inputs, the first input is used to receive the time reference signal, and the second input is connected to the M group inputs of the delay accumulation circuit respectively. It is used to receive M-bit products of a certain bit; each controllable delay block has an output terminal that outputs the time signal after the delay amount is superimposed.
  • the controllable delay block outputs different delay amounts according to different values of the input data, as follows: When the input data is 0, the output delay amount of the controllable delay block is ⁇ t; when the input data When it is 1, the output delay amount of the controllable delay block is 2 ⁇ t; at the time of the non-trigger signal, regardless of whether the input data is 0 or 1, the output delay amount of the controllable delay block is ⁇ t
  • the data input to the fifth controllable delay chain for accumulation calculation is data on the 4th bit in the product of the first group to the M group, specifically: Where 1 ⁇ j ⁇ 8;
  • the number of input data is 8M. According to the calculation scale, the number of iteration calculations of the fifth controllable delay chain dynamically is 8.

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Computing Systems (AREA)
  • Health & Medical Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Pure & Applied Mathematics (AREA)
  • Computational Mathematics (AREA)
  • Mathematical Analysis (AREA)
  • Mathematical Optimization (AREA)
  • Molecular Biology (AREA)
  • General Health & Medical Sciences (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Evolutionary Computation (AREA)
  • Data Mining & Analysis (AREA)
  • Computational Linguistics (AREA)
  • Artificial Intelligence (AREA)
  • Neurology (AREA)
  • Complex Calculations (AREA)
  • Image Analysis (AREA)

Abstract

本发明提出一种适用于神经网络的乘加计算方法和计算电路,涉及模拟集成电路技术领域,实现了低功耗、高速度完成神经网络大规模乘加计算。乘加计算电路包括乘法计算电路阵列和累加计算电路。乘法计算电路阵列由M组乘法计算电路组成,每组乘法计算电路由一个乘法阵列单元和八个选择移位单元组成,采用片上训练实时量化乘法阵列单元阶数,为选择移位单元提供共享输入,实现运算速率的提高及功耗的降低;累加计算电路由延时累加电路、TDC转换电路和相加移位电路串联构成。延时累加电路包含8条可控延时链,动态控制迭代次数,在时间域内对数据多次累加,满足不同网络层计算规模的差异性,节省硬件存储空间、降低计算复杂度、减小数据调度。

Description

一种适用于神经网络的乘加计算方法和计算电路 技术领域
本发明涉及模拟集成电路技术领域,特别是涉及一种应用于神经网络的乘加计算方法和电路。
背景技术
神经网络,是一种运算模型,由大量的节点相互联接构成。每个节点代表一种激励函数,每两个节点间的连接代表一个权重。人工神经网络的输出依网络的连接方式、权重值和激励函数的不同而不同。近年来,人工神经网络和深度学习取得了很大的进展,神经网络的大规划并行性使其具有快速处理某些任务的潜在能力,因而在模式识别、智能机器人、自动控制、预测估计、生物、医学、经济等领域均表现出了良好的智能特性。
随着神经网络的不断深入各领域的应用,所需解决的问题也越来越复杂,神经网络具有大量复杂的、并行的乘加运算,需要消耗大量的计算资源,传统上一般采用计算机对数据进行离线处理和分析,这种数据处理方式在很大程度上限值了神经网络在实时系统中的应用。以最常用的卷积神经网络为例,一次运算中存在着大量的乘加运算,传统的数字运算方法实现乘法运算时随着乘数比特位的增加,存在着占用面积大、功耗大等问题,因此,提出一种适用于神经网络的乘加方法,在满足硬件资源低功耗、高速的前提下,完成神经网络中大规模的乘加计算任务具有十分重要的意义。
发明内容
本发明所要解决的技术问题是:提出一种适用于神经网络的乘加计算方法和计算电路,以满足神经网络计算规模和精度要求,并实现低功耗、高速度的完成计算任务。
本发明为解决上述技术问题采用以下技术方案:
一方面,提出一种适用于神经网络的乘加计算电路,包括乘法计算电路阵列和累加计算电路。
乘法计算电路阵列由M组相同的乘法计算电路级联而成;各组乘法计算电路用于将神经网络的输入数据与神经网络权重系数相乘,并将所得乘积各bit位上的数据依次输入到累加计算电路中。
累加计算电路,用于在时间域内将乘法计算电路阵列输出的各组乘积中各bit位上的数据完成累加,并将所得时间量化的结果经TDC电路转化成数字量后再进行相加移位操作,得到 神经网络的输出数据。
乘法计算电路阵列有2M组输入端和M组输出端;累加计算电路有M组输入端和一组输出端;累加计算电路的M组输入端分别和乘法计算电路阵列的M组输出端对应相连。
每组乘法计算电路有两组输入端和一组输出端;神经网络的各组输入数据分别从各组乘法计算电路的第一输入端送入;神经网络的各组权重系数分别从各组乘法计算电路的第二输入端送入;神经网络的输出数据从累加计算电路的输出端送出。
第m组乘法计算电路的第一输入端输入第m输入数据,这是1个8bit位数据;第m组乘法计算电路的第二输入端输入第m组权重系数,包括8个8bit位数据,分别是第m组第一权重系数、第m组第二权重系数、第m组第三权重系数、第m组第四权重系数、第m组第五权重系数、第m组第六权重系数、第m组第七权重系数和第m组第八权重系数;第m组乘法计算电路的输出端输出第m组乘积,包括8组数据,分别是第m组第一乘积、第m组第二乘积、第m组第三乘积、第m组第四乘积、第m组第五乘积、第m组第六乘积、第m组第七乘积,第m组第八乘积;第m组第一乘积到第八乘积中都包含8个1bit位数据;累加计算电路的输出端输出8个8bit位数据。
进一步提出,各组乘法计算电路均由一组乘法阵列单元和八组相互独立的选择移位单元组成;乘法阵列单元,用于将神经网络的输入数据和特征乘数相乘,得到特征乘积;选择移位单元,根据神经网络的权重系数对特征乘积进行选择和移位操作,得到神经网络的输入数据与神经网络权重系数的乘积,并将所得乘积各bit位上的数据依次输入到累加计算电路中。
八组选择移位单元包括第一选择移位单元、第二选择移位单元、第三选择移位单元、第四选择移位单元、第五选择移位单元、第六选择移位单元、第七选择移位单元和第八选择移位单元。
乘法阵列单元有一组输入端和一组输出端;每组选择移位单元都有两组输入端和一组输出端;乘法阵列单元的输入端与乘法计算电路的第一输入端相连;乘法阵列单元的输出端和各组选择移位单元的第一输入端形成共享连接;各组选择移位单元的第二输入端分别和乘法计算电路的第二输入端相连;每组选择移位单元的输出端分别与乘法计算电路的输出端相连。
乘法阵列单元输出n个特征乘积、每个特征乘积是一个8bit位数据,其中n是8或者4或者2或者1;n个特征乘积作为第一到第八选择移位单元的共享输入数据,自第一到第八选择移位单元的第一输入端输入;第一到第八选择移位单元的第二输入端依照顺序对应输入第一到第八权重系数;第一到第八选择移位单元依照顺序分别输出第一到第八乘积各bit位上的数据。
进一步提出,乘法阵列单元由8个或者4个或者2个或者1个计算子单元组成,分别称 为八阶量化、四阶量化、二阶量化和一阶量化。
八阶量化乘法阵列单元的特征乘数分别是1,3,5,7,9,11,13,15;四阶量化乘法阵列单元的特征乘数分别是1,3,5,7;二阶量化乘法阵列单元的特征乘数分别是1,3;一阶量化乘法阵列单元的特征乘数是1。
八阶量化乘法阵列单元的特征乘积是1×Input,3×Input,5×Input,7×Input,9×Input,11×Input,13×Input,15×Input;四阶量化乘法阵列单元的特征乘积是1×Input,3×Input,5×Input,7×Input;二阶量化乘法阵列单元的特征乘积是1×Input,3×Input;一阶量化乘法阵列单元的特征乘积是1×Input。
进一步提出,累加计算电路由延时累加电路、TDC转换电路和相加移位电路依次串联构成。延时累加电路有M组输入端和八组输出端;TDC转换电路有八组输入端和八组输出端;相加移位电路有八组输入端和一组输出端;延时累加电路的M组输入端分别和乘法计算电路阵列的M组输出端对应相连;延时累加电路的八组输出端分别和TDC转换电路的八组输入端对应相连;TDC转换电路的八组输出端分别和相加移位电路的八组输入端对应相连;相加移位电路的输出端就是累加计算电路的输出端。
进一步提出,延时累加电路是8条相互独立的可控延时链构成的求和阵列;可控延时链的数量等于乘法计算电路阵列输出乘积的比特位的数量;任意一条可控延时链完成M组乘积某bit位数据在时间域内的一次累加。
每条可控延时链由M个可控延时块顺序串联而成;奇数位置上的可控延时块的触发信号是时钟上升沿;偶数位置上的可控延时块的触发信号是时钟下降沿;每个可控延时块都有两个输入端,第一输入端均用于接收时间参考信号,第二输入端分别与延时累加电路的M组输入端对应连接,用于接收M组乘积某bit位数据;每个可控延时块有一个输出端,输出叠加延时量后的时间信号。
另一方面,提出一种适用于神经网络的乘加计算方法,借助于前面所述的乘加计算电路而实现。乘加计算方法包括一种乘法计算方法和一种累加计算方法,具体步骤如下:
步骤101:采用片上训练方式对每组乘法计算电路中乘法阵列单元的阶数进行实时量化,以n阶量化乘法阵列单元进行神经网络实际计算;
步骤102:每组乘法计算电路中,n阶量化乘法阵列单元向第一至第八选择移位单元提供n个特征乘积作为共享输入;
步骤103:每组乘法计算电路中,一个计算周期内,每个选择移位单元将权重系数高四位的十进制数值、权重系数低四位的十进制数值分别与n阶量化乘法阵列单元的特征乘数进行比较并输出各组乘积,具体如下:
步骤103-A:权重系数高四位的十进制数值与特征乘数一致且不为0时,则直接输出对应的特征乘积各bit位的数据;权重系数低四位的十进制数值与特征乘数一致且不为0时,选择对应的特征乘积,则直接输出对应的特征乘积各bit位的数据;并进入步骤104;
步骤103-B:权重系数高四位的十进制数值与特征乘数不一致且不为0时,对特征乘积进行移位操作,输出所得结果各bit位的数据;权重系数低四位的十进制数值与特征乘数不一致且不为0时,对特征乘积进行移位操作,输出所得结果各bit位的数据;并进入步骤104;
步骤103-C:权重系数高四位的十进制数值为0时,则直接输出0;权重系数低四位的十进制数值为0时,则直接输出0;并进入步骤104;
步骤104:第一组到第M组乘积中i bit位上的数据依次输入到第i+1条延时链的M个可控延时块中,其中i为0到7之中的任意一个自然数;
在触发信号时刻,可控延时块根据所输入数据的不同数值而输出不同的延时量,具体如下:当输入数据是0时,可控延时块输出延时量为Δt;当输入数据是1时,可控延时块输出延时量为2Δt;在非触发信号时刻,无论输入数据是0或者1,可控延时块均输出延时量为Δt;
步骤105:第i+1条可控延时链完成第一组到第M组乘积中i bit位上的数据在时间域内的一次累加;当神经网络计算规模超过可控延时块级联数量M时,动态控制各条可控延时链迭代计算次数;
步骤106:使用TDC转换电路,将每条可控延时链输出的延时量转化成十进制的数字量;
步骤107:使用相加移位电路对数字量进行相加和右移位操作,得到神经网络的输出数据。
进一步提出,每组乘法计算电路中,片上训练方式具体步骤如下:
步骤201:神经网络权重系数初始化;训练数据集设置;得到预训练神经网络NN 0
步骤202,包括两个同时开展的部分,具体如下:
步骤202-A:采用测试数据集对预训练神经网络NN 0进行测试,得到网络精度初始值A;
步骤202-B:初始化乘法阵列单元的量化阶数为一阶,并采用一阶量化后的乘法计算电路替换预训练神经网络NN 0中的标准乘法计算电路,得到一阶量化神经网络NN 1;采用测试数据集对一阶量化神经网络NN 1进行训练,得到网络精度实际值B;
步骤203:引入网络精度限值Q,判断B和A×Q的大小关系:若B>A×Q,则进入步骤209;若B<A×Q,则进入步骤204;
步骤204:提高乘法阵列单元的量化阶数为二阶,并采用二阶量化后的乘法计算电路替换预训练神经网络NN 0中的标准乘法计算电路,得到二阶量化神经网络NN 2;采用测试数据集对二阶量化神经网络NN 2进行训练,得到网络精度实际值B;
步骤205:引入网络精度限值Q,判断B和A×Q的大小关系:若B>A×Q,则进入步骤209;若B<A×Q,则进入步骤206;
步骤206:提高乘法阵列单元的量化阶数为四阶,并采用四阶量化后的乘法计算电路替换预训练神经网络NN 0中的标准乘法计算电路,得到四阶量化神经网络NN 4;采用测试数据集对四阶量化神经网络NN 4进行训练,得到网络精度实际值B;
步骤207:引入网络精度限值Q,判断B和A×Q的大小关系:若B>A×Q,则进入步骤209;若B<A×Q,则进入步骤208;
步骤208:提高乘法阵列单元的量化阶数为八阶,并采用八阶量化后的乘法计算电路替换预训练神经网络NN 0中的标准乘法计算电路,得到八阶量化神经网络NN 8;进入步骤209;
步骤209:结束乘法阵列单元量化阶数的片上训练,以当前量化阶数的乘法阵列单元进行神经网络实际计算。
进一步提出,每组乘法计算电路中,根据乘法阵列单元的量化阶数,步骤103-B中对特征乘积进行移位操作,具体如下:
当乘法阵列单元的量化阶数为八时:对特征乘积1×Input左移1位得到2×Input;对特征乘积1×Input左移2位得到4×Input;对特征乘积3×Input左移1位得到6×Input;对特征乘积1×Input左移3位得到8×Input;对特征乘积5×Input左移1位得到10×Input;对特征乘积3×Input左移2位得到12×Input;对特征乘积7×Input左移1位得到14×Input。
当乘法阵列单元的量化阶数为四时:对特征乘积1×Input左移1位得到2×Input;对特征乘积1×Input左移2位得到4×Input;对特征乘积3×Input左移1位得到6×Input;对特征乘积1×Input左移3位得到8×Input;对特征乘积3×Input左移2位得到12×Input;对特征乘积1×Input左移2位得到4×Input,近似表示5×Input;对特征乘积3×Input左移1位得到6×Input,近似表示7×Input;对特征乘积1×Input左移3位得到8×Input,近似表示9×Input和10×Input;对特征乘积3×Input左移2位得到12×Input,近似表示11×Input、13×Input、14×Input和15×Input。
当乘法阵列单元的量化阶数为二时:对特征乘积1×Input左移1位得到2×Input;对特征乘积1×Input左移2位得到4×Input;对特征乘积3×Input左移1位得到6×Input;对特征乘积1×Input左移3位得到8×Input;对特征乘积3×Input左移2位得到12×Input;对特征乘积1×Input左移2位得到4×Input,近似表示5×Input;对特征乘积3×Input左移1位得到6×Input,近似表示7×Input;对特征乘积1×Input左移3位得到8×Input,近似表示9×Input和10×Input;对特征乘积3×Input左移2位得到12×Input,近似表示11×Input、13×Input、14×Input和15×Input。
当乘法阵列单元的量化阶数为一时:对特征乘积1×Input左移1位得到2×Input;对特征乘积1×Input左移2位得到4×Input;对特征乘积1×Input左移3位得到8×Input;对特征乘积1×Input左移1位得到2×Input,近似表示3×Input;对特征乘积1×Input左移2位得到4×Input,近似表示5×Input、6×Input、7×Input;对特征乘积1×Input左移3位得到8×Input,近似表示9×Input、10×Input、11×Input、12×Input、13×Input、14×Input、15×Input。
进一步提出,步骤105中动态控制各条可控延时链迭代计算次数,具体步骤如下:
步骤301:神经网络的计算规模为W,将W组数据分成K段M组数据和一段N组数据,其中K是大于等于1的整数,N是大于等于1且小于M的整数;
步骤302:以时间信号Y j-1为输入,各条可控延时链对第j段M组数据开展第j次累加计算,输出时间量Y j=Y j-1+ΔT j,其中j是从1到K的自然数;
步骤303:以时间信号Y K为输入,各条可控延时链对N组数据开展累加计算,输出时间量Y=Y K+ΔT。
本发明采用以上技术方案与现有技术相比,具有以下技术效果;
1、针对神经网络不同的计算规模和精度要求,提出一种乘加计算方法,采用片上训练方式动态实时调整乘法阵列单元的量化阶数,以当前n阶量化乘法阵列单元进行神经网络实际计算,使得精确度和功耗达到最优。
2、每组乘法计算电路中,八组选择移位单元共享一组乘法阵列单元的输出,因此在同一个计算周期内,这八组选择移位单元可同时开展乘法计算,显著提高了电路的运算速率;并且每组乘法计算电路中仅有一组乘法阵列单元,因此相比现有技术,显著降低了乘法电路的功耗,提高了电路吞吐率。
3、延时累加电路是可控延时链构建的求和阵列,基于延时信号具有可叠加性,因此根据神经网络的每层网络计算规模,动态调整求和阵列的迭代次数,即各可控延时链的迭代次数,从而满足不同网络层计算规模的差异性,节省硬件存储空间、降低计算的复杂度、减小数据调度。
4、延时累加电路中,数字量被直接转换为时间量进行累加操作,在完成期望的迭代次数后,由TDC电路将最终得到的时间里进行数字转换。全部累加过程在时间域内完成,消除了外电路非理想效应的影响;在保证累加精度的同时,可以减小电路的复杂度,使其电路易于实现。并且,时域内电路转换速度快、效率高,使得延时累加电路在低功耗高速环境下运行,满足了实际应用中的需求。
附图说明
图1是本发明提出的一种适用于神经网络的乘加计算电路的方框示意图;
图2是本发明提出的一种适用于神经网络的乘加计算电路中的一组乘法计算电路的方框示意图;
图3是本发明提出的采用片上训练方式动态实时调整所述乘法阵列单元的量化阶数的控制逻辑图,以DNN网络为例;
图4是本发明提出的一种适用于神经网络的乘加计算电路中的延时累加电路的原理示意图。
具体实施方式
具体实施例一:
图1是本发明提出的一种适用于神经网络的乘加计算电路的方框示意图,图中所示的乘加计算电路包括乘法计算电路阵列和累加计算电路。
乘法计算电路阵列由M组相同的乘法计算电路级联而成;各组乘法计算电路用于将神经网络的输入数据与神经网络权重系数相乘,并将所得乘积各bit位上的数据依次输入到累加计算电路中。
累加计算电路,用于在时间域内将乘法计算电路阵列输出的各组乘积中各bit位上的数据完成累加,并将所得时间量化的结果经TDC电路转化成数字量后再进行相加移位操作,得到神经网络的输出数据。
乘法计算电路阵列有2M组输入端和M组输出端;累加计算电路有M组输入端和一组输出端;累加计算电路的M组输入端分别和乘法计算电路阵列的M组输出端对应相连。
每组乘法计算电路有两组输入端和一组输出端;神经网络的各组输入数据分别从各组乘法计算电路的第一输入端送入;神经网络的各组权重系数分别从各组乘法计算电路的第二输入端送入;神经网络的输出数据从累加计算电路的输出端送出。
累加计算电路由延时累加电路、TDC转换电路和相加移位电路依次串联构成。延时累加电路有M组输入端和八组输出端;TDC转换电路有八组输入端和八组输出端;相加移位电路有八组输入端和一组输出端;延时累加电路的M组输入端分别和乘法计算电路阵列的M组输出端对应相连;延时累加电路的八组输出端分别和TDC转换电路的八组输入端对应相连;TDC转换电路的八组输出端分别和相加移位电路的八组输入端对应相连;相加移位电路的输出端就是累加计算电路的输出端。
本实施例结合图1对乘法计算电路的输出数据给出具体说明,如下:
第m组乘法计算电路的第一输入端输入第m输入数据,这是1个8bit位数据;第m组乘法计算电路的第二输入端输入第m组权重系数,包括8个8bit位数据,分别是第m组第一权重系数、第m组第二权重系数、第m组第三权重系数、第m组第四权重系数、第m组第五权重系数、第m组第六权重系数、第m组第七权重系数和第m组第八权重系数。
第m组乘法计算电路的输出端输出第m组乘积,包括8组数据,如下:
第m组第一乘积P m-1=第m输入数据×第m组第一权重系数;
第m组第二乘积P m-2=第m输入数据×第m组第二权重系数;
第m组第三乘积P m-3=第m输入数据×第m组第三权重系数;
第m组第四乘积P m-4=第m输入数据×第m组第四权重系数;
第m组第五乘积P m-5=第m输入数据×第m组第五权重系数;
第m组第六乘积P m-6=第m输入数据×第m组第六权重系数;
第m组第七乘积P m-7=第m输入数据×第m组第七权重系数;
第m组第八乘积P m-8=第m输入数据×第m组第八权重系数;
第m组第一乘积到第八乘积中都包含8个1bit位数据,第m组乘积P m是一个8×8的矩阵,如下:
Figure PCTCN2019072892-appb-000001
具体实施例二:
图2是本发明提出的一种适用于神经网络的乘加计算电路中的一组乘法计算电路的方框示意图。
如图2所示,每一组乘法计算电路均由一组乘法阵列单元和八组相互独立的选择移位单 元组成;乘法阵列单元,用于将神经网络的输入数据和特征乘数相乘,得到特征乘积;选择移位单元,根据神经网络的权重系数对特征乘积进行选择和移位操作,得到神经网络的输入数据与神经网络权重系数的乘积,并将所得乘积各bit位上的数据依次输入到累加计算电路中。
八组选择移位单元包括第一选择移位单元、第二选择移位单元、第三选择移位单元、第四选择移位单元、第五选择移位单元、第六选择移位单元、第七选择移位单元和第八选择移位单元。
乘法阵列单元有一组输入端和一组输出端;每组选择移位单元都有两组输入端和一组输出端;乘法阵列单元的输入端与乘法计算电路的第一输入端相连;乘法阵列单元的输出端和各组选择移位单元的第一输入端形成共享连接;各组选择移位单元的第二输入端分别和乘法计算电路的第二输入端相连;每组选择移位单元的输出端分别与乘法计算电路的输出端相连。
乘法阵列单元输出n个特征乘积、每个特征乘积是一个8bit位数据,其中n是8或者4或者2或者1;n个特征乘积作为第一到第八选择移位单元的共享输入数据,自第一到第八选择移位单元的第一输入端输入;第一到第八选择移位单元的第二输入端依照顺序对应输入第一到第八权重系数;第一到第八选择移位单元依照顺序分别输出第一到第八乘积各bit位上的数据。
乘法阵列单元由8个或者4个或者2个或者1个计算子单元组成,分别称为八阶量化、四阶量化、二阶量化和一阶量化。
八阶量化乘法阵列单元的特征乘数分别是1,3,5,7,9,11,13,15;四阶量化乘法阵列单元的特征乘数分别是1,3,5,7;二阶量化乘法阵列单元的特征乘数分别是1,3;一阶量化乘法阵列单元的特征乘数是1。
八阶量化乘法阵列单元的特征乘积是1×Input,3×Input,5×Input,7×Input,9×Input,11×Input,13×Input,15×Input;四阶量化乘法阵列单元的特征乘积是1×Input,3×Input,5×Input,7×Input;二阶量化乘法阵列单元的特征乘积是1×Input,3×Input;一阶量化乘法阵列单元的特征乘积是1×Input。
每组乘法计算电路中,根据乘法阵列单元的量化阶数,对特征乘积进行移位操作,具体如下:
当乘法阵列单元的量化阶数为八时:对特征乘积1×Input左移1位得到2×Input;对特征乘积1×Input左移2位得到4×Input;对特征乘积3×Input左移1位得到6×Input;对特征乘积1×Input左移3位得到8×Input;对特征乘积5×Input左移1位得到10×Input;对特征乘积3×Input左移2位得到12×Input;对特征乘积7×Input左移1位得到14×Input。
当乘法阵列单元的量化阶数为四时:对特征乘积1×Input左移1位得到2×Input;对特 征乘积1×Input左移2位得到4×Input;对特征乘积3×Input左移1位得到6×Input;对特征乘积1×Input左移3位得到8×Input;对特征乘积3×Input左移2位得到12×Input;对特征乘积1×Input左移2位得到4×Input,近似表示5×Input;对特征乘积3×Input左移1位得到6×Input,近似表示7×Input;对特征乘积1×Input左移3位得到8×Input,近似表示9×Input和10×Input;对特征乘积3×Input左移2位得到12×Input,近似表示11×Input、13×Input、14×Input和15×Input。
当乘法阵列单元的量化阶数为二时:对特征乘积1×Input左移1位得到2×Input;对特征乘积1×Input左移2位得到4×Input;对特征乘积3×Input左移1位得到6×Input;对特征乘积1×Input左移3位得到8×Input;对特征乘积3×Input左移2位得到12×Input;对特征乘积1×Input左移2位得到4×Input,近似表示5×Input;对特征乘积3×Input左移1位得到6×Input,近似表示7×Input;对特征乘积1×Input左移3位得到8×Input,近似表示9×Input和10×Input;对特征乘积3×Input左移2位得到12×Input,近似表示11×Input、13×Input、14×Input和15×Input。
当乘法阵列单元的量化阶数为一时:对特征乘积1×Input左移1位得到2×Input;对特征乘积1×Input左移2位得到4×Input;对特征乘积1×Input左移3位得到8×Input;对特征乘积1×Input左移1位得到2×Input,近似表示3×Input;对特征乘积1×Input左移2位得到4×Input,近似表示5×Input、6×Input、7×Input;对特征乘积1×Input左移3位得到8×Input,近似表示9×Input、10×Input、11×Input、12×Input、13×Input、14×Input、15×Input。
通过对本实施例的分析,可见八阶量化的乘法计算电路可以精确计算出16个乘积;四阶量化的乘法计算电路可以精确计算出12个乘积和近似计算出4个乘积;二阶量化的乘法计算电路可以精确计算出8个乘积和近似计算出8个乘积;一阶量化的乘法计算电路,相当于取消了乘法阵列结构,直接将输入数据输入到选择移位单元,可以精确计算出5个乘积和近似计算出11个乘积。
可见,共享乘法器中的乘法阵列实时量化的阶数越多,乘法器计算的精度越高,但是速度降低、损耗增加;而乘法阵列量化的阶数越少,乘法器计算的精度越低,但是功耗显著降低,并且电路的吞吐率和速度显著提高。尤其是一阶实时量化后,实现了完全解放共享乘法器,取消了共享乘法阵列,将乘法用移位和相加的方法替换实现,对于精确度要求不高的应用场景,完全解放乘法器具有很大的优势,实现了乘法的超低功耗近似计算,大大提高了电路的工作效率。
具体实施例三:
图3是本发明提出的采用片上训练方式动态实时调整所述乘法阵列单元的量化阶数的控制逻辑图。结合图3,本实施例以DNN的片上训练为优选,采用片上训练方式将各乘法阵列单元的阶数量化为四阶,具体如下:
步骤201:DNN权重系数初始化;训练数据集设置;得到预训练DNN;
步骤202,包括两个同时开展的部分,具体如下:
步骤202-A:采用测试数据集对预训练DNN进行测试,得到网络精度初始值A;
步骤202-B:初始化乘法阵列单元的量化阶数为一阶,并采用一阶量化后的乘法计算电路替换预训练DNN中的标准乘法计算电路,得到DNN 1;采用测试数据集对DNN 1进行训练,得到网络精度实际值B;
步骤203:引入网络精度限值Q,此时B<A×Q,则进入步骤4;
步骤204:提高乘法阵列单元的量化阶数为二阶,并采用二阶量化后的乘法计算电路替换预训练DNN中的标准乘法计算电路,得到DNN 2;采用测试数据集对DNN 2进行训练,得到网络精度实际值B;
步骤205:引入网络精度限值Q,此时B<A×Q,则进入步骤6;
步骤206:提高乘法阵列单元的量化阶数为四阶,并采用四阶量化后的乘法计算电路替换预训练DNN中的标准乘法计算电路,得到DNN 4;采用测试数据集对DNN 4进行训练,得到网络精度实际值B;
步骤207:引入网络精度限值Q,此时B>A×Q,则结束乘法阵列单元量化阶数的片上训练,以四阶量化的乘法阵列单元进行DNN实际计算。
具体实施例四:
图4是本发明提出的一种适用于神经网络的乘加计算电路中的延时累加电路的原理示意图。结合图4,对延时累加电路开展具体说明,如下:
延时累加电路是8条相互独立的可控延时链构成的求和阵列;根据具体实施例一可知,可控延时链的数量等于乘法计算电路阵列输出乘积的比特位的数量。
每条可控延时链由M个可控延时块顺序串联而成;奇数位置上的可控延时块del1的触发信号是时钟上升沿;偶数位置上的可控延时块del2的触发信号是时钟下降沿;每个可控延时块都有两个输入端,第一输入端均用于接收时间参考信号,第二输入端分别与延时累加电路的M组输入端对应连接,用于接收M组乘积某bit位数据;每个可控延时块有一个输出端,输出叠加延时量后的时间信号。
在触发信号时刻,可控延时块根据所输入数据的不同数值而输出不同的延时量,具体如下:当输入数据是0时,可控延时块输出延时量为Δt;当输入数据是1时,可控延时块输出 延时量为2Δt;在非触发信号时刻,无论输入数据是0或者1,可控延时块均输出延时量为Δt
以四阶量化的乘法计算电路阵列为例,对延时累加电路中的第三条可控延时链的累加过程,详细说明如下:
输入到第五条可控延时链进行累加计算的数据,是第一组到第M组乘积中的第4bit位上的数据,具体:
Figure PCTCN2019072892-appb-000002
其中1≤j≤8;
输入数据数量为8M个,根据该计算规模,动态控制第五条可控延时链的迭代计算次数为8。
第一次迭代时,以时间信号Y 0为输入,第五条可控延时链上各延时块依次输入
Figure PCTCN2019072892-appb-000003
则延时链输出时间量Y 1=Y 0+ΔT 1;第二次迭代时,以时间信号Y 1为输入,第五条可控延时链上各延时块依次输入
Figure PCTCN2019072892-appb-000004
则延时链输出时间量Y 2=Y 1+ΔT 2=Y 0+ΔT 1+ΔT 2;经过8次迭代后,第五条延时链输出时间量为
Figure PCTCN2019072892-appb-000005
此时要对该输出时间量里包含的本征延时量T in进行消除,即
Figure PCTCN2019072892-appb-000006
通过多次迭代计算后,第五条可控延时链只输出一个时间延时量T delay=Y-Y 0,该延时量的长短代表了输入的数据信号中“1”的数量。
以上所述仅是本发明的部分实施方式,应当指出,对于本技术领域的普通技术人员来说,在不脱离本发明原理的前提下,还可以做出若干改进和润饰,这些改进和润饰也应视为本发明的保护范围。

Claims (9)

  1. 一种适用于神经网络的乘加计算电路,其特征在于:包括乘法计算电路阵列和累加计算电路;
    所述乘法计算电路阵列由M组结构相同的乘法计算电路级联而成;所述各组乘法计算电路用于将神经网络的输入数据与神经网络权重系数相乘,并将所得乘积各bit位上的数据依次输入到累加计算电路中;
    所述累加计算电路,用于在时间域内对乘法计算电路阵列输出的各组乘积中各bit位上的数据进行累加,并对所得时间量化的结果进行模数转换相加移位操作得到神经网络的输出数据;
    所述乘法计算电路阵列有2M组输入端和M组输出端;所述累加计算电路有M组输入端和一组输出端;累加计算电路的M组输入端分别和乘法计算电路阵列的M组输出端对应相连;
    第m组乘法计算电路的第一输入端输入第m输入数据,这是1个8bit位数据;第m组乘法计算电路的第二输入端输入第m组权重系数,包括8个8bit位数据,分别是第m组第一权重系数、第m组第二权重系数、第m组第三权重系数、第m组第四权重系数、第m组第五权重系数、第m组第六权重系数、第m组第七权重系数和第m组第八权重系数;第m组乘法计算电路的输出端输出第m组乘积,包括8组数据,分别是第m组第一乘积、第m组第二乘积、第m组第三乘积、第m组第四乘积、第m组第五乘积、第m组第六乘积、第m组第七乘积,第m组第八乘积;第m组第一乘积到第八乘积中都包含8个1bit位数据;n<M,M为正整数;第m组乘法计算电路的输出端与累加计算电路的输入端连接,,所述累加计算电路输出端输出的8个8bit位数据为神经网络的输出数据。
  2. 根据权利要求1所述的一种适用于神经网络的乘加计算电路,其特征在于:所述各组乘法计算电路均由一组乘法阵列单元和八组相互独立的选择移位单元组成;所述乘法阵列单元,用于将神经网络的输入数据和特征乘数相乘,得到特征乘积;所述选择移位单元,根据神经网络的权重系数对特征乘积进行选择和移位操作,得到神经网络的输入数据与神经网络权重系数的乘积,并将所得乘积各bit位上的数据依次输入到累加计算电路中;
    所述八组选择移位单元包括第一选择移位单元、第二选择移位单元、第三选择移位单元、第四选择移位单元、第五选择移位单元、第六选择移位单元、第七选择移位单元和第八选择移位单元;
    所述乘法阵列单元的输入端与乘法计算电路的第一输入端相连;乘法阵列单元的输出端和各组选择移位单元的第一输入端形成共享连接;各组选择移位单元的第二输入端分别和乘法计算电路的第二输入端相连;每组选择移位单元的输出端分别与乘法计算电路的输出端相 连;
    所述乘法阵列单元输出n个特征乘积、每个特征乘积是一个8bit位数据,其中n是8或者4或者2或者1;n个特征乘积作为第一到第八选择移位单元的共享输入数据,自第一到第八选择移位单元的第一输入端输入;第一到第八选择移位单元的第二输入端依照顺序对应输入第一到第八权重系数;第一到第八选择移位单元依照顺序分别输出第一到第八乘积各bit位上的数据。
  3. 根据权利要求2所述的一种适用于神经网络的乘加计算电路,其特征在于:所述乘法阵列单元由8个或者4个或者2个或者1个计算子单元组成,分别称为八阶量化乘法阵列单元、四阶量化乘法阵列单元、二阶量化乘法阵列单元和一阶量化乘法阵列单元;
    所述八阶量化乘法阵列单元的特征乘数分别是1,3,5,7,9,11,13,15;所述四阶量化乘法阵列单元的特征乘数分别是1,3,5,7;所述二阶量化乘法阵列单元的特征乘数分别是1,3;所述一阶量化乘法阵列单元的特征乘数是1;
    所述八阶量化乘法阵列单元的特征乘积是1×Input,3×Input,5×Input,7×Input,9×Input,11×Input,13×Input,15×Input;所述四阶量化乘法阵列单元的特征乘积是1×Input,3×Input,5×Input,7×Input;所述二阶量化乘法阵列单元的特征乘积是1×Input,3×Input;所述一阶量化乘法阵列单元的特征乘积是1×Input。
  4. 根据权利要求1所述的一种适用于神经网络的乘加计算电路,其特征在于:所述累加计算电路由延时累加电路、TDC转换电路和相加移位电路依次串联构成;
    所述延时累加电路有M组输入端和八组输出端;所述TDC转换电路有八组输入端和八组输出端;所述相加移位电路有八组输入端和一组输出端;延时累加电路的M组输入端分别和乘法计算电路阵列的M组输出端对应相连;延时累加电路的八组输出端分别和TDC转换电路的八组输入端对应相连;TDC转换电路的八组输出端分别和相加移位电路的八组输入端对应相连;相加移位电路的输出端就是累加计算电路的输出端。
  5. 根据权利要求4所述的一种适用于神经网络的乘加计算电路,其特征在于:所述延时累加电路是8条相互独立的可控延时链构成的求和阵列;所述可控延时链的数量等于乘法计算电路阵列输出乘积的比特位的数量;任意一条可控延时链完成M组乘积某bit位数据在时间域内的一次累加;
    每条可控延时链由M个可控延时块顺序串联而成;奇数位置上的可控延时块的触发信号是时钟上升沿;偶数位置上的可控延时块的触发信号是时钟下降沿;
    每个可控延时块都有两个输入端和一个输出端,第一输入端均用于接收时间参考信号,第二输入端与延时累加电路的M组输入端中的一组输出端连接用于接收M组乘积某bit位数 据,输出端输出叠加延时量后的时间信号。
  6. 一种适用于神经网络的乘加计算方法,借助于权利要求5所述的一种适用于神经网络的乘加计算电路而实现,其特征在于:所述乘加计算方法,具体包括一种乘法计算方法和一种累加计算方法;所述乘加计算方法包括如下步骤:
    步骤101:采用片上训练方式对每组乘法计算电路中乘法阵列单元的阶数进行实时量化,以n阶量化乘法阵列单元进行神经网络实际计算;
    步骤102:每组乘法计算电路中,n阶量化乘法阵列单元向第一至第八选择移位单元提供n个特征乘积作为共享输入;
    步骤103:每组乘法计算电路中,一个计算周期内,每个选择移位单元将权重系数高四位的十进制数值、权重系数低四位的十进制数值分别与n阶量化乘法阵列单元的特征乘数进行比较并输出各组乘积,具体如下:
    步骤103-A:权重系数高四位的十进制数值与特征乘数一致且不为0时,则直接输出对应的特征乘积各bit位的数据;权重系数低四位的十进制数值与特征乘数一致且不为0时,选择对应的特征乘积,则直接输出对应的特征乘积各bit位的数据;并进入步骤104;
    步骤103-B:权重系数高四位的十进制数值与特征乘数不一致且不为0时,对特征乘积进行移位操作,输出所得结果各bit位的数据;权重系数低四位的十进制数值与特征乘数不一致且不为0时,对特征乘积进行移位操作,输出所得结果各bit位的数据;并进入步骤104;
    步骤103-C:权重系数高四位的十进制数值为0时,则直接输出0;权重系数低四位的十进制数值为0时,则直接输出0;并进入步骤104;
    步骤104:第一组到第M组乘积中i bit位上的数据依次输入到第i+1条延时链的M个可控延时块中,其中i为0到7之中的任意一个自然数;
    在触发信号时刻,可控延时块根据所输入数据的不同数值而输出不同的延时量,具体如下:当输入数据是0时,可控延时块输出延时量为Δt;当输入数据是1时,可控延时块输出延时量为2Δt;在非触发信号时刻,无论输入数据是0或者1,可控延时块均输出延时量为Δt;
    步骤105:第i+1条可控延时链完成第一组到第M组乘积中i bit位上的数据在时间域内的一次累加;当神经网络计算规模超过可控延时块级联数量M时,动态控制各条可控延时链迭代计算次数;
    步骤106:使用TDC转换电路,将每条可控延时链输出的延时量转化成十进制的数字量;
    步骤107:使用相加移位电路对数字量进行相加和右移位操作,得到神经网络的输出数据。
  7. 根据权利要求6所述的一种适用于神经网络的乘加计算方法,其特征在于:每组乘法 计算电路中,所述片上训练方式具体步骤如下:
    步骤201:神经网络权重系数初始化;训练数据集设置;得到预训练神经网络NN 0
    步骤202,包括两个同时开展的部分,具体如下:
    步骤202-A:采用测试数据集对预训练神经网络NN 0进行测试,得到网络精度初始值A;
    步骤202-B:初始化乘法阵列单元的量化阶数为一阶,并采用一阶量化后的乘法计算电路替换预训练神经网络NN 0中的标准乘法计算电路,得到一阶量化神经网络NN 1;采用测试数据集对一阶量化神经网络NN 1进行训练,得到网络精度实际值B;
    步骤203:引入网络精度限值Q,判断B和A×Q的大小关系:若B>A×Q,则进入步骤209;若B<A×Q,则进入步骤204;
    步骤204:提高乘法阵列单元的量化阶数为二阶,并采用二阶量化后的乘法计算电路替换预训练神经网络NN 0中的标准乘法计算电路,得到二阶量化神经网络NN 2;采用测试数据集对二阶量化神经网络NN 2进行训练,得到网络精度实际值B;
    步骤205:引入网络精度限值Q,判断B和A×Q的大小关系:若B>A×Q,则进入步骤209;若B<A×Q,则进入步骤206;
    步骤206:提高乘法阵列单元的量化阶数为四阶,并采用四阶量化后的乘法计算电路替换预训练神经网络NN 0中的标准乘法计算电路,得到四阶量化神经网络NN 4;采用测试数据集对四阶量化神经网络NN 4进行训练,得到网络精度实际值B;
    步骤207:引入网络精度限值Q,判断B和A×Q的大小关系:若B>A×Q,则进入步骤209;若B<A×Q,则进入步骤208;
    步骤208:提高乘法阵列单元的量化阶数为八阶,并采用八阶量化后的乘法计算电路替换预训练神经网络NN 0中的标准乘法计算电路,得到八阶量化神经网络NN 8;进入步骤209;
    步骤209:结束乘法阵列单元量化阶数的片上训练,以当前量化阶数的乘法阵列单元进行神经网络实际计算。
  8. 根据权利要求6所述的一种适用于神经网络的乘加计算方法,其特征在于:每组乘法计算电路中,根据乘法阵列单元的量化阶数,所述步骤103-B中对特征乘积进行移位操作,具体如下:
    当乘法阵列单元的量化阶数为八时:对特征乘积1×Input左移1位得到2×Input;对特征乘积1×Input左移2位得到4×Input;对特征乘积3×Input左移1位得到6×Input;对特征乘积1×Input左移3位得到8×Input;对特征乘积5×Input左移1位得到10×Input;对特征乘积3×Input左移2位得到12×Input;对特征乘积7×Input左移1位得到14×Input;
    当乘法阵列单元的量化阶数为四时:对特征乘积1×Input左移1位得到2×Input;对特 征乘积1×Input左移2位得到4×Input;对特征乘积3×Input左移1位得到6×Input;对特征乘积1×Input左移3位得到8×Input;对特征乘积3×Input左移2位得到12×Input;对特征乘积1×Input左移2位得到4×Input,近似表示5×Input;对特征乘积3×Input左移1位得到6×Input,近似表示7×Input;对特征乘积1×Input左移3位得到8×Input,近似表示9×Input和10×Input;对特征乘积3×Input左移2位得到12×Input,近似表示11×Input、13×Input、14×Input和15×Input;
    当乘法阵列单元的量化阶数为二时:对特征乘积1×Input左移1位得到2×Input;对特征乘积1×Input左移2位得到4×Input;对特征乘积3×Input左移1位得到6×Input;对特征乘积1×Input左移3位得到8×Input;对特征乘积3×Input左移2位得到12×Input;对特征乘积1×Input左移2位得到4×Input,近似表示5×Input;对特征乘积3×Input左移1位得到6×Input,近似表示7×Input;对特征乘积1×Input左移3位得到8×Input,近似表示9×Input和10×Input;对特征乘积3×Input左移2位得到12×Input,近似表示11×Input、13×Input、14×Input和15×Input;
    当乘法阵列单元的量化阶数为一时:对特征乘积1×Input左移1位得到2×Input;对特征乘积1×Input左移2位得到4×Input;对特征乘积1×Input左移3位得到8×Input;对特征乘积1×Input左移1位得到2×Input,近似表示3×Input;对特征乘积1×Input左移2位得到4×Input,近似表示5×Input、6×Input、7×Input;对特征乘积1×Input左移3位得到8×Input,近似表示9×Input、10×Input、11×Input、12×Input、13×Input、14×Input、15×Input。
  9. 根据权利要求6所述的一种适用于神经网络的乘加计算方法,其特征在于:所述步骤105中动态控制各条可控延时链迭代计算次数,具体步骤如下:
    步骤301:神经网络的计算规模为W,将W组数据分成K段M组数据和一段N组数据,其中K是大于等于1的整数,N是大于等于1且小于M的整数;
    步骤302:以时间信号Y j-1为输入,各条可控延时链对第j段M组数据开展第j次累加计算,输出时间量Y j=Y j-1+ΔT j,其中j是从1到K的自然数;
    步骤303:以时间信号Y K为输入,各条可控延时链对N组数据开展累加计算,输出时间量Y=Y K+ΔT。
PCT/CN2019/072892 2018-08-08 2019-01-24 一种适用于神经网络的乘加计算方法和计算电路 WO2020029551A1 (zh)

Priority Applications (1)

Application Number Priority Date Filing Date Title
US16/757,421 US10984313B2 (en) 2018-08-08 2019-01-24 Multiply-accumulate calculation method and circuit suitable for neural network

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201810894109.2A CN109344964B (zh) 2018-08-08 2018-08-08 一种适用于神经网络的乘加计算方法和计算电路
CN201810894109.2 2018-08-08

Publications (1)

Publication Number Publication Date
WO2020029551A1 true WO2020029551A1 (zh) 2020-02-13

Family

ID=65296795

Family Applications (2)

Application Number Title Priority Date Filing Date
PCT/CN2019/072892 WO2020029551A1 (zh) 2018-08-08 2019-01-24 一种适用于神经网络的乘加计算方法和计算电路
PCT/CN2019/078378 WO2020029583A1 (zh) 2018-08-08 2019-03-15 一种适用于神经网络的乘加计算方法和计算电路

Family Applications After (1)

Application Number Title Priority Date Filing Date
PCT/CN2019/078378 WO2020029583A1 (zh) 2018-08-08 2019-03-15 一种适用于神经网络的乘加计算方法和计算电路

Country Status (3)

Country Link
US (1) US10984313B2 (zh)
CN (1) CN109344964B (zh)
WO (2) WO2020029551A1 (zh)

Families Citing this family (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109344964B (zh) * 2018-08-08 2020-12-29 东南大学 一种适用于神经网络的乘加计算方法和计算电路
US11573792B2 (en) 2019-09-03 2023-02-07 Samsung Electronics Co., Ltd. Method and computing device with a multiplier-accumulator circuit
CN110750231B (zh) * 2019-09-27 2021-09-28 东南大学 一种面向卷积神经网络的双相系数可调模拟乘法计算电路
CN111091190A (zh) * 2020-03-25 2020-05-01 光子算数(北京)科技有限责任公司 数据处理方法及装置、光子神经网络芯片、数据处理电路
CN111694544B (zh) * 2020-06-02 2022-03-15 杭州知存智能科技有限公司 多位复用乘加运算装置、神经网络运算系统以及电子设备
CN111738427B (zh) * 2020-08-14 2020-12-29 电子科技大学 一种神经网络的运算电路
CN111988031B (zh) * 2020-08-28 2022-05-20 华中科技大学 一种忆阻存内矢量矩阵运算器及运算方法
CN112988111B (zh) * 2021-03-05 2022-02-11 唐山恒鼎科技有限公司 一种单比特乘法器
CN115935878B (zh) * 2023-01-06 2023-05-05 上海后摩智能科技有限公司 基于模拟信号的多比特数据计算电路、芯片及计算装置
CN116700670B (zh) * 2023-08-08 2024-04-05 深圳比特微电子科技有限公司 乘累加电路、包含该乘累加电路的处理器和计算装置

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107341544A (zh) * 2017-06-30 2017-11-10 清华大学 一种基于可分割阵列的可重构加速器及其实现方法
CN107402905A (zh) * 2016-05-19 2017-11-28 北京旷视科技有限公司 基于神经网络的计算方法及装置
CN107797962A (zh) * 2017-10-17 2018-03-13 清华大学 基于神经网络的计算阵列
CN107818367A (zh) * 2017-10-30 2018-03-20 中国科学院计算技术研究所 用于神经网络的处理系统和处理方法
CN107844826A (zh) * 2017-10-30 2018-03-27 中国科学院计算技术研究所 神经网络处理单元及包含该处理单元的处理系统

Family Cites Families (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2703010B2 (ja) * 1988-12-23 1998-01-26 株式会社日立製作所 ニユーラルネツト信号処理プロセツサ
US5216751A (en) * 1990-10-22 1993-06-01 Motorola, Inc. Digital processing element in an artificial neural network
CN101382882B (zh) * 2008-09-28 2010-08-11 宁波大学 一种基于CTGAL的Booth编码器及绝热补码乘累加器
CN101706712B (zh) * 2009-11-27 2011-08-31 北京龙芯中科技术服务中心有限公司 浮点向量乘加运算装置和方法
US9292297B2 (en) * 2012-09-14 2016-03-22 Intel Corporation Method and apparatus to process 4-operand SIMD integer multiply-accumulate instruction
KR102063557B1 (ko) * 2013-05-14 2020-01-09 한국전자통신연구원 시간할당 알고리즘 기반의 인터폴레이션 필터
CN109496306B (zh) * 2016-07-13 2023-08-29 莫鲁米有限公司 多功能运算装置及快速傅里叶变换运算装置
US10891538B2 (en) * 2016-08-11 2021-01-12 Nvidia Corporation Sparse convolutional neural network accelerator
US10360163B2 (en) * 2016-10-27 2019-07-23 Google Llc Exploiting input data sparsity in neural network compute units
CN107657312B (zh) * 2017-09-18 2021-06-11 东南大学 面向语音常用词识别的二值网络实现系统
CN107862374B (zh) * 2017-10-30 2020-07-31 中国科学院计算技术研究所 基于流水线的神经网络处理系统和处理方法
CN107967132B (zh) * 2017-11-27 2020-07-31 中国科学院计算技术研究所 一种用于神经网络处理器的加法器和乘法器
CN109344964B (zh) * 2018-08-08 2020-12-29 东南大学 一种适用于神经网络的乘加计算方法和计算电路

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107402905A (zh) * 2016-05-19 2017-11-28 北京旷视科技有限公司 基于神经网络的计算方法及装置
CN107341544A (zh) * 2017-06-30 2017-11-10 清华大学 一种基于可分割阵列的可重构加速器及其实现方法
CN107797962A (zh) * 2017-10-17 2018-03-13 清华大学 基于神经网络的计算阵列
CN107818367A (zh) * 2017-10-30 2018-03-20 中国科学院计算技术研究所 用于神经网络的处理系统和处理方法
CN107844826A (zh) * 2017-10-30 2018-03-27 中国科学院计算技术研究所 神经网络处理单元及包含该处理单元的处理系统

Also Published As

Publication number Publication date
CN109344964A (zh) 2019-02-15
CN109344964B (zh) 2020-12-29
WO2020029583A1 (zh) 2020-02-13
US10984313B2 (en) 2021-04-20
US20200342295A1 (en) 2020-10-29

Similar Documents

Publication Publication Date Title
WO2020029551A1 (zh) 一种适用于神经网络的乘加计算方法和计算电路
CN110263925B (zh) 一种基于fpga的卷积神经网络前向预测的硬件加速实现装置
Huddar et al. Novel high speed vedic mathematics multiplier using compressors
Wang et al. Low power convolutional neural networks on a chip
Li et al. Merging the interface: Power, area and accuracy co-optimization for rram crossbar-based mixed-signal computing system
CN108090560A (zh) 基于fpga的lstm递归神经网络硬件加速器的设计方法
CN110543939B (zh) 一种基于fpga的卷积神经网络后向训练的硬件加速实现装置
CN104145281A (zh) 神经网络计算装置和系统及其方法
CA2125244A1 (en) Artificial neuron and method of using same
CN112068798B (zh) 一种实现网络节点重要性排序的方法及装置
US20220164663A1 (en) Activation Compression Method for Deep Learning Acceleration
Kung et al. A power-aware digital feedforward neural network platform with backpropagation driven approximate synapses
CN112101517A (zh) 基于分段线性脉冲神经元网络的fpga实现方法
CN110309904B (zh) 一种神经网络压缩方法
CN109271695A (zh) 基于神经网络的多目标天线设计方法
CN109508784A (zh) 一种神经网络激活函数的设计方法
Waris et al. AxRMs: Approximate recursive multipliers using high-performance building blocks
Bao et al. LSFQ: A low precision full integer quantization for high-performance FPGA-based CNN acceleration
CN115145536A (zh) 一种低位宽输入-低位宽输出的加法器树单元及近似乘加方法
CN111275167A (zh) 一种用于二值卷积神经网络的高能效脉动阵列架构
Ranganath et al. Design of MAC unit in artificial neural network architecture using Verilog HDL
US20210374509A1 (en) Modulo Operation Unit
Hong et al. NN compactor: Minimizing memory and logic resources for small neural networks
JPH04229362A (ja) 学習機械
Ogbogu et al. Energy-Efficient ReRAM-Based ML Training via Mixed Pruning and Reconfigurable ADC

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 19846526

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 19846526

Country of ref document: EP

Kind code of ref document: A1

122 Ep: pct application non-entry in european phase

Ref document number: 19846526

Country of ref document: EP

Kind code of ref document: A1

32PN Ep: public notification in the ep bulletin as address of the adressee cannot be established

Free format text: NOTING OF LOSS OF RIGHTS PURSUANT TO RULE 112(1) EPC (EPO FORM 1205A DATED 07.09.2021)

122 Ep: pct application non-entry in european phase

Ref document number: 19846526

Country of ref document: EP

Kind code of ref document: A1