CN113095468B - Neural network accelerator and data processing method thereof - Google Patents

Neural network accelerator and data processing method thereof Download PDF

Info

Publication number
CN113095468B
CN113095468B CN201911337168.0A CN201911337168A CN113095468B CN 113095468 B CN113095468 B CN 113095468B CN 201911337168 A CN201911337168 A CN 201911337168A CN 113095468 B CN113095468 B CN 113095468B
Authority
CN
China
Prior art keywords
network
reram
quantization
value
unit
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201911337168.0A
Other languages
Chinese (zh)
Other versions
CN113095468A (en
Inventor
王佩琪
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shanghai Sensetime Intelligent Technology Co Ltd
Original Assignee
Shanghai Sensetime Intelligent Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shanghai Sensetime Intelligent Technology Co Ltd filed Critical Shanghai Sensetime Intelligent Technology Co Ltd
Priority to CN201911337168.0A priority Critical patent/CN113095468B/en
Publication of CN113095468A publication Critical patent/CN113095468A/en
Application granted granted Critical
Publication of CN113095468B publication Critical patent/CN113095468B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/06Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons
    • G06N3/063Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons using electronic means

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • General Health & Medical Sciences (AREA)
  • General Physics & Mathematics (AREA)
  • Evolutionary Computation (AREA)
  • Computational Linguistics (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Artificial Intelligence (AREA)
  • Neurology (AREA)
  • Image Analysis (AREA)
  • Semiconductor Memories (AREA)

Abstract

The embodiment of the disclosure provides a neural network accelerator and a data processing method thereof, wherein the neural network accelerator comprises: the device comprises a network operation unit and a hardware quantization unit, wherein the network operation unit comprises a first resistance change type memory ReRAM circuit, and the hardware quantization unit comprises a second ReRAM circuit; the network operation unit is used for calculating and processing the t-th input data of the target network layer in the cyclic neural network to obtain a t-th network output value of the target network layer; and the hardware quantization unit is used for executing three-value quantization processing on the t-th network output value of the target network layer to obtain a quantization result of the t-th network output value.

Description

Neural network accelerator and data processing method thereof
Technical Field
The present disclosure relates to machine learning techniques, and in particular to neural network accelerators and data processing methods thereof.
Background
The cyclic neural networks (Recurrent Neural Networks, RNNs) are widely used in the fields of natural language processing, machine translation, speech recognition and other practical production and living. The tasks running on the data center use about 30% of the workload of the RNN network architecture, while using CNN only 5%. The computationally intensive nature of RNN tasks is more pronounced than convolutional neural networks (Convolutional Neural Networks, CNNs). In order to process the timing characteristics of the input data, the RNN needs to store part of the history information of the output sequence and perform corresponding computation on the information, which makes the actual computation of the RNN require a lot of computing and storage resources. Therefore, it is an important issue to study how to accelerate the RNN calculation process.
Disclosure of Invention
The embodiment of the disclosure at least provides a neural network accelerator and a corresponding data processing method thereof.
In a first aspect, there is provided a neural network accelerator, the neural network accelerator comprising: the device comprises a network operation unit and a hardware quantization unit, wherein the network operation unit comprises a first resistance change type memory ReRAM circuit, and the hardware quantization unit comprises a second ReRAM circuit;
the network operation unit is used for calculating and processing the t-th input data of a target network layer in the cyclic neural network to obtain a t-th network output value of the target network layer;
and the hardware quantization unit is used for executing three-value quantization processing on the t-th network output value of the target network layer to obtain a quantization result of the t-th network output value.
In combination with any of the embodiments of the present disclosure, the second ReRAM circuit includes: and the first comparator is used for obtaining the quantized result of the t network output value by comparing the t network output value with the quantized reference value.
In combination with any of the embodiments of the present disclosure, the second ReRAM circuit further includes: a first ReRAM array, the first ReRAM array connected to the first comparator; the first ReRAM array is configured to store the t-th network output value.
In combination with any one of the embodiments of the present disclosure, the first comparator is located in an analog-to-digital converter in a first bit line peripheral circuit of the first ReRAM array, and the quantization reference value is a quantization threshold preset in the first comparator.
In combination with any of the embodiments of the present disclosure, the second ReRAM circuit further includes: a random number generator for generating a random number; wherein the quantization reference value is a random number generated by the random number generator; the random number generator comprises a ReRAM unit and a second comparator, wherein the ReRAM unit is used for outputting a current value corresponding to the resistance value stored by the ReRAM unit; the second comparator is used for obtaining the random number by comparing a standard value with a current value output by the ReRAM unit.
In combination with any of the embodiments of the present disclosure, the accelerator further comprises: and the first ReRAM array is used for storing the quantized result of the output value of the t-th network.
In combination with any of the embodiments of the present disclosure, the accelerator further comprises: the first word line peripheral circuit is used for splitting data to be stored to obtain two non-negative data values, wherein the data comprises: the output value of the t network, or the quantized result of the output value of the t network; wherein the first ReRAM array of the accelerator is used to store the two non-negative data values.
In combination with any of the embodiments of the present disclosure, the first ReRAM circuit includes: a second word line peripheral circuit and a plurality of second ReRAM arrays; a second word line peripheral circuit for acquiring the t-th input data from a memory and inputting the t-th input data to the plurality of second ReRAM arrays; and the second ReRAM array is used for storing the network parameters of the target network layer after three-value quantization processing in a resistance value mode and executing matrix multiplication and addition calculation on the t-th input data according to the network parameters.
In combination with any of the embodiments of the present disclosure, the t-th input data includes two non-negative data values; each of the plurality of second ReRAM arrays is configured to store one of two non-negative parameter values obtained by splitting the network parameter after the ternary quantization processing, and calculate a product of the non-negative parameter value and one non-negative data value input by each of the second ReRAM arrays; the first ReRAM circuit further includes: and the second bit line peripheral circuit is used for fusing products respectively output by the plurality of second ReRAM arrays to obtain a multiplication result of the t-th input data and the network parameter.
In combination with any one of the embodiments of the present disclosure, the second word line peripheral circuit is configured to, in a case where the t-th input data is multi-bit data, input the multi-bit data into the second ReRAM array bit by bit to perform calculation; the first ReRAM circuit includes a second bit line peripheral circuit for merging the calculation results of the data corresponding to each bit of the multi-bit data.
In combination with any of the embodiments of the present disclosure, the neural network accelerator further includes: a control unit; the network operation unit includes: a matrix calculation unit, a nonlinear unit and a vector multiplication unit comprising the first ReRAM circuit; the control unit is used for controlling the next input data to enter the network operation unit when the processing execution of any one unit of the matrix calculation unit, the nonlinear unit and the vector multiplication unit on the current input data is finished.
In a second aspect, there is provided a data processing method, the method comprising:
acquiring the t-th input data of a target network layer in a cyclic neural network;
calculating the t-th input data through a network operation unit to obtain a t-th network output value of the target network layer;
And executing three-value quantization processing on the t-th network output value of the target network layer through a hardware quantization unit to obtain a quantization result of the t-th network output value.
In combination with any embodiment of the disclosure, the performing, by a hardware quantization unit, a ternary quantization process on a t-th network output value of the target network layer to obtain a quantization result of the t-th network output value includes: and comparing the output value of the t network with a quantization reference value through the hardware quantization unit to obtain a quantization result of the output value of the t network.
In combination with any one of the embodiments of the present disclosure, the comparing, by the hardware quantization unit, the t-th network output value and a quantization reference value includes: in the process of reading the stored t network output value from the hardware quantization unit, comparing the t network output value with a quantization reference value.
In combination with any of the embodiments of the present disclosure, before the comparing, by the hardware quantization unit, the t-th network output value and a quantization reference value, the method further comprises: outputting a current value corresponding to the resistance value stored in a ReRAM unit in the hardware quantization unit; and comparing a standard value with the current value to obtain a random number, wherein the random number is used as the quantized reference value.
In combination with any of the embodiments of the present disclosure, the method further comprises: splitting data to be stored to obtain two non-negative data values, wherein the data to be stored comprises: the output value of the t network, or the quantized result of the output value of the t network; the two non-negative data values are stored.
In combination with any embodiment of the disclosure, the calculating, by the network operation unit, the t input data to obtain a t network output value of the target network layer includes: inputting the obtained t-th input data to a plurality of second ReRAM arrays included in the network operation unit; and performing matrix multiplication and addition calculation on the t-th input data through the plurality of second ReRAM arrays based on the network parameters of the target network layer, which are stored in a resistance value manner and are subjected to three-value quantization processing.
In combination with any of the embodiments of the present disclosure, the t-th input data includes two non-negative data values; the performing, by the plurality of second ReRAM arrays, matrix multiply-add computation on the t-th input data based on the network parameters of the target network layer subjected to the three-value quantization processing, which are stored in a resistive value manner, includes: calculating, by each of the plurality of second ReRAM arrays, a product of one of two stored non-negative parameter values obtained by splitting the network parameter subjected to the three-value quantization processing, and one non-negative data value input by each of the second ReRAM arrays; and fusing products respectively output by the plurality of second ReRAM arrays to obtain a multiplication result of the t-th input data and the network parameter.
In combination with any of the embodiments of the present disclosure, the t-th input data is multi-bit data; the performing, by the plurality of second ReRAM arrays, matrix multiply-add computation on the t-th input data based on the network parameters of the cyclic neural network, which are stored in a resistive value manner and are subjected to three-value quantization processing, includes: inputting the multi-bit data bit by bit into the second ReRAM array for calculation to obtain a calculation result corresponding to each bit of data in the multi-bit data; and combining the calculation results corresponding to the data of each bit in the multi-bit data to obtain the calculation result of the t-th input data.
In a third aspect, a data processing apparatus is provided, the apparatus comprising means for performing steps in a data processing method according to any of the embodiments of the present disclosure.
The neural network accelerator provided by the embodiment of the disclosure comprises a network operation unit comprising a first ReRAM circuit and a hardware quantization unit comprising a second ReRAM circuit, wherein the network operation unit is used for calculating and processing the t-th input data of a target network layer in a cyclic neural network to obtain a t-th network output value of the target network layer, and the hardware quantization unit is used for executing three-value quantization processing on the t-th network output value of the target network layer to obtain a quantization result of the t-th network output value, so that the cyclic neural network calculates the quantized data during calculation, the calculation amount is reduced, the network calculation process is accelerated, and the calculation performance is improved.
Drawings
In order to more clearly illustrate the technical solutions of one or more embodiments of the present disclosure or related technologies, the following description will briefly describe the drawings that are required to be used in the embodiments or related technology descriptions, and it is apparent that the drawings in the following description are only some embodiments described in one or more embodiments of the present disclosure, and other drawings may be obtained according to these drawings without inventive effort to those of ordinary skill in the art.
FIG. 1 illustrates a general architecture schematic of a neural network accelerator provided by at least one embodiment of the present disclosure;
FIG. 2 illustrates a schematic diagram of a neuron computation of an LSTM provided in accordance with at least one embodiment of the present disclosure;
FIG. 3 illustrates the storage and computation of a ReRAM array provided by at least one embodiment of the present disclosure;
FIG. 4 illustrates a ReRAM array and its peripheral circuitry provided in accordance with at least one embodiment of the present disclosure;
FIG. 5 illustrates a wordline peripheral circuit of a ReRAM array provided in accordance with at least one embodiment of the present disclosure;
FIG. 6 illustrates a bit line peripheral circuit of a ReRAM array provided in accordance with at least one embodiment of the present disclosure;
FIG. 7 illustrates a quantization flow diagram of a fixed quantization provided by at least one embodiment of the present disclosure;
Fig. 8 is a schematic diagram illustrating a structure of a random quantization unit according to at least one embodiment of the present disclosure;
FIG. 9 illustrates a quantization flow diagram for random quantization provided by at least one embodiment of the present disclosure;
FIG. 10 illustrates a flow of a fixed quantization and splitting provided by at least one embodiment of the present disclosure;
FIG. 11 illustrates a flow of random quantization and splitting provided by at least one embodiment of the present disclosure;
fig. 12 illustrates a flow chart of a data processing method provided by at least one embodiment of the present disclosure.
Detailed Description
In order that those skilled in the art will better understand the technical solutions in one or more embodiments of the present disclosure, the technical solutions in one or more embodiments of the present disclosure will be clearly and completely described below with reference to the accompanying drawings in one or more embodiments of the present disclosure, and it is apparent that the described embodiments are only some embodiments of the present disclosure, not all embodiments. All other embodiments, which may be made by one of ordinary skill in the art based on one or more embodiments of the present disclosure without inventive faculty, are intended to be within the scope of the present disclosure.
Embodiments of the present disclosure provide a neural network accelerator that can accelerate the computation process for a recurrent neural network during the inferential deployment phase of the neural network.
The neural network accelerator may include: and the network operation unit and the hardware quantization unit.
The network operation unit may include, for example, a Resistive Random-Access Memory (ReRAM) circuit, and the hardware quantization unit may also include a ReRAM circuit. For convenience of distinction, the ReRAM circuit in the network operation unit may be referred to as a first ReRAM circuit, and the ReRAM circuit included in the hardware quantization unit may be referred to as a second ReRAM circuit.
In the calculation process of the cyclic neural network, the network operation unit can be used for calculating and processing the t-th input data of the target network layer in the cyclic neural network to obtain the t-th network output value of the target network layer. The recurrent neural network may include multiple layers (e.g., layer 1, layer2, …), among which the target network Layer may be one. The t-th input data may be input data of a neuron at the t-th time. The t-th network output value of the target network layer is an output value obtained after the input data of the neuron is calculated.
After the network operation unit calculates the network output value of a certain neuron of the cyclic neural network, the t-th network output value can be subjected to three-value quantization processing by the hardware quantization unit. For example, the quantization result may include: 0. 1, -1. The three-value quantization of the network output value in the embodiment of the disclosure may be implemented based on the second ReRAM circuit, which belongs to the three-value quantization performed by hardware.
As described above, the neural network accelerator realizes the three-value quantization of the network output value in a hardware manner, so that the quantization process can not interrupt the calculation flow of hardware control, namely, the three-value quantization of the hardware control is executed in the calculation process of the hardware control cyclic neural network, and the calculation performance is improved.
A specific implementation of the accelerator structure will be described below by way of an exemplary neural network accelerator structure. Fig. 1 illustrates a general architecture schematic of a neural network accelerator, as shown in fig. 1, which may include: the control unit 11, the I/O interface 12, the on-chip storage 13 and the computing unit 14, the number of the computing units 14 may be one or more. The computing unit 14 is a core part of the accelerator, and is responsible for the computing process of the neural network.
Alternatively, the calculation unit 14 may be responsible for different phases of the neural network calculation process by different parts when calculating the neural network, whereby the calculation unit 14 may be divided into different calculation function modules. For example, referring to the example of fig. 1, the computing unit 14 may include: a matrix calculation unit 141, a nonlinear unit 142, a vector multiplication unit 143, and a buffer unit 144.
In the case of a calculation of the recurrent neural network, different functional modules of the calculation unit 14 can be responsible for different calculation tasks. For example, the buffer unit 144 may be used to store intermediate data generated during the calculation. The matrix calculation unit 141 may be responsible for performing matrix multiply-add calculations in the recurrent neural network; the nonlinear unit 142 may be responsible for performing an activation function calculation on the result of the matrix multiply-add calculation. The vector multiplication unit 143 may perform a vector multiplication and addition operation according to the calculation result of the activation function.
Taking Long Short-Term Memory (LSTM) as an example, the calculation formula of neurons in LSTM includes:
i t ,f t ,o t ,g t =σ(Wx t +Uh t-1 +b)………(1-1)
c t =f t ·c t-1 +i t ·g t …………………………(1-2)
h t =o t ·tanh(c t )……………………………(1-3)
referring to fig. 2, fig. 2 is a schematic diagram of calculation of neurons corresponding to LSTM described above. Wherein the symbol σ in equation (1-1) represents the corresponding activation function calculation, "Wx t +Uh t-1 +b "is a matrix multiply-add calculation; the expression (1-2) is to combine the calculation results of the gates to obtain the cell state c at the current time t of the neuron t The partial calculation is a vector multiply-add calculation. Finally, obtaining a network output value h at the current moment through a formula (1-3) t Also, the calculation of the activation function and the vector multiply-add calculation.
As above, in the LSTM example, the computation process of the recurrent neural network involves: matrix multiply-add computation, activation function computation, and vector multiply-add computation. As can be seen from the exemplary description of fig. 1, the matrix multiply-add calculation may be performed by the matrix calculation unit 141 in the neural network accelerator, the activation function calculation may be performed by the nonlinear unit 142, and the vector multiply-add calculation may be performed by the vector multiplication unit 143.
In order to accelerate the calculation of the cyclic neural network, the embodiment of the specification adopts a Resistive Random-Access Memory (ReRAM) to store and calculate data, and the material characteristics of the ReRAM enable the ReRAM to simultaneously store and calculate the data, and meanwhile, the method has the characteristics of high integration level, low static energy consumption, good Access performance and the like, and can efficiently accelerate the neural network.
For example, the matrix calculating unit 141 or the nonlinear unit 142 is implemented by a ReRAM array or ReRAM unit and peripheral circuits, and the buffer unit 144 may also employ ReRAM as a storage medium.
As can be seen from the above description, the calculation process of the recurrent neural network may be performed by calculation functional modules such as the matrix calculation unit 141, the nonlinear unit 142, and the vector multiplication unit 143, and these units responsible for the calculation of the recurrent neural network may be referred to as network operation units. As can be further seen from the above description, the network operation unit includes a first ReRAM circuit, for example, where the matrix calculation unit 141 and the nonlinear unit 142 may include a ReRAM array and its corresponding peripheral circuit. The present embodiments may refer to the ReRAM array and its corresponding peripheral circuitry as ReRAM circuitry, e.g., the circuitry may be first ReRAM circuitry.
Furthermore, as can be seen from the formula (1-1), the network weights, for example, W, U and b, are also used in the calculation process of the recurrent neural network. In some embodiments, the network weights may be three-valued quantization operations already performed during the training phase of the recurrent neural network, e.g., W, U and b may be 0, +1 or-1. In the network reasoning deployment stage of the embodiment of the disclosure, the network weight after the three-value quantization can be directly used.
In some implementations, the network weights may be stored in a ReRAM array, which may use resistance values to store the network weights. For example, the resistance of the ReRAM array may take a high resistance state or a low resistance state, where a network weight of 0 is stored when the resistance takes a high resistance state, and a network weight of 1 is stored when the resistance takes a low resistance state. The network weight-1 can also be realized through the resistance value storage of the ReRAM array, and the embodiment of the disclosure does not limit the specific realization of the ReRAM array through the resistance storage network weight.
With continued reference to the schematic of fig. 3, fig. 3 illustrates a schematic of the storage and computation of a ReRAM array. The corresponding multipliers of the bias parameter b can be set to be 1, W, U and b are stored through the resistance value of the ReRAM array, the ReRAM array inputs input data through a voltage signal according to kirchhoff's law, the network weight is stored through the resistance value, and the final network weight and the matrix multiplication and addition calculation result of the input data are output through a current signal. The input data may be the output of the neuron of the previous layer or the output of the same neuron of the previous time step.
As can be obtained above, in the calculation of the recurrent neural network by the neural network accelerator illustrated in fig. 1, it can be calculated by a process that can include an initial stage and a calculation stage:
the initial stage: in the initial phase of the calculation, the network weights and the input data required for the calculation may be loaded into the accelerator through the I/O interface 12 of the accelerator. The neural network accelerator may divide the computational tasks according to the size of the computational tasks and the actual hardware computational resources. For example, computing tasks that do not have computing dependencies with respect to each other may be processed by parallel computing units or ReRAM arrays, while computing tasks that have computing dependencies with respect to each other may be assigned to the same ReRAM array. And, the network weights may be stored in a ReRAM array awaiting subsequent computation.
And (3) a calculation stage: the neural network accelerator reads corresponding input data from a storage unit (e.g., an on-chip storage or cache unit) according to the divided calculation requirements, and performs corresponding calculation by using the device characteristics of the ReRAM. For example, it may be by a matrix meterThe computing unit performs matrix multiply-add computation, performs activation function computation by the nonlinear unit, and performs vector multiply-add computation by the vector multiplying unit. Network output value h t Can be stored, for example, in a buffer unit, which outputs the value h t The calculation may continue as input data for the next time of the neuron. When all the calculation tasks are completed, the calculation of the recurrent neural network is ended.
In the calculation of the neural network, the network weight after three-value quantization is used, so that the calculation amount can be reduced by the quantization of the network weight, and the calculation process of the cyclic neural network can be accelerated. The embodiment of the specification can further perform three-value quantization on the network output value on the basis, and realizes the three-value quantization on the network output value in a hardware manner through a hardware quantization unit, specifically, the network output value h calculated for each neuron in the cyclic neural network t Quantization is performed.
By performing three-value quantization on the network output value calculated by the recurrent neural network, the calculated amount can be further reduced. For example, the matrix multiply-add calculation "Wx" in equation (1-1) t +Uh t-1 +b ", wherein the network weight W, U, b and the input data x t (network output value of upper layer), h t-1 The three-value quantization is already carried out (the network output value of the last time step), so that complex matrix multiplication and addition calculation is converted into simple addition and subtraction operation, and the calculation process of the network is accelerated. In addition, the three-value quantization of the network output value is realized in a hardware mode, so that the calculation flow of hardware control cannot be interrupted in the quantization process, namely, the three-value quantization of the hardware control is executed in the calculation process of the hardware control network, and the calculation performance is improved.
In an embodiment of the present disclosure, when performing ternary quantization on a t-th network output value by using a hardware quantization unit including a second ReRAM circuit, a specific manner may include: a comparator may be included in the second ReRAM circuit, and may be referred to as a first comparator for convenience of distinction from a comparator appearing in the following description. The first comparator included in the second ReRAM circuit may obtain a quantization result of the output value of the t-th network by comparing the output value of the t-th network with the quantization reference value. For example, if the output value of the t-th network is greater than the quantization reference value, the quantization result may be set to 1, if the output value of the t-th network is less than the quantization reference value, the quantization result may be set to 0, etc., the specific implementation of the comparator is not limited in the embodiments of the present disclosure.
Two exemplary hardware three-valued quantization approaches are introduced: "fixed quantization" and "random quantization", both of which may be configured to selectively use one of them. The first comparator described above is used for the three-value quantization of the output value of the t-th network, either in "fixed quantization" or "random quantization".
Fixed quantization
Referring to fig. 4, fig. 4 illustrates a ReRAM array and its peripheral circuitry. Fig. 5 illustrates a word line peripheral circuit of a ReRAM array, and fig. 6 illustrates a bit line peripheral circuit of a ReRAM array.
After the input data of the ReRAM array is processed by a word line peripheral circuit (WordLine Peripheral), the input data enters the ReRAM array from a Word Line (WL), and is subjected to analog calculation with the network weight stored in the ReRAM array. For example, the network weights W, U, b in equation (1-1) are stored by the individual resistance values in the ReRAM array, and x t 、h t-1 And 1 is transmitted to a ReRAM array through a word line WL, and the ReRAM array can perform matrix multiplication and addition calculation on input data and network weights.
The calculation result of the matrix multiply-add calculation may be output through Bit lines (Bit lines, BL) of the ReRAM array, and the output result may also be processed by a Bit Line peripheral circuit (BitLine Peripheral) corresponding to the ReRAM array. The calculation result of the activation function calculation and the calculation result of the vector multiply-add calculation may be output after being processed by the bit line and the bit line peripheral circuit.
For example, during neuron computation of a recurrent neural network, the ReRAM array computes Wx t 、Uh t-1 The equal product result can be output by the bit line BL, and can be added by the bit line peripheral circuit to obtain Wx t +Uh t-1 The intermediate result of +b can also be converted into a digital signal by an 8-bit Analog-to-Digital Converter (ADC) in the bitline peripheral circuit. The intermediate result after conversion can be further processed by next step of activation function calculation and vector multiply-add calculation, and the final network output value h of the neuron calculation can be obtained after the vector multiply-add calculation is processed t . The network output value h t May be stored in a cache unit 144 in the computing unit. According to the characteristics of the cyclic neural network, the data calculation at the time t+1 is dependent on the calculation result at the time t in the time dimension, so that the network outputs a value h t One of the input data, which can be calculated as data at time t+1, is to be input into the ReRAM array to participate in the matrix multiply-add calculation.
In the embodiment of the disclosure, the network output value h t May be referred to as a nth network output value which may be stored in a second ReRAM circuit, in particular in a first ReRAM array comprised by the second ReRAM circuit. For example, as may be seen in combination with fig. 1 and 4, a ReRAM circuit similar to that shown in fig. 4, i.e., a second ReRAM circuit, may be included in the cache unit 144. The second ReRAM circuit may include a ReRAM array and peripheral circuitry surrounding the ReRAM array. Wherein the ReRAM array is the first ReRAM array described above, and the output value of the t-th network may be stored in the first ReRAM array.
At the network output value h t The ReRAM array in the input matrix calculation unit can be used for the h before participating in the data calculation at the time of t+1 t Three-value quantization is performed. Specifically, the h may be read from the first ReRAM array in the cache unit 144 prior to performing the calculation t . When the memory element is used as a storage element, the ReRAM unit realizes the storage of data by different resistance values and outputs a value h through a network t The value of (2) is stored by the resistance of the ReRAM cell. When data reading is needed, the peripheral circuit transmits a voltage signal corresponding to read control to the WL, and the BL outputs a corresponding current value.
Correspondence h of BL output t The current value of (2) can be entered into the first comparator by a preset amount in the first comparatorComparing the threshold values, thereby realizing the comparison of h t Is a three-value quantization of (2). Illustratively, the first comparator may be located in a bit line peripheral circuit corresponding to the first ReRAM array, which may be referred to as a first bit line peripheral circuit, such as BitLine Peripheral in fig. 4. The ADC (responsible for converting the output analog current signal into a digital signal) in the first bit line peripheral circuit may include a first comparator provided with a quantization threshold value θ, and the comparator may output the pair h according to the comparison result in the analog-to-digital conversion process by comparing the current value with the quantization threshold value θ t Is a three-value quantization result of (c).
It should be noted that, as described in the above procedure, the value h is outputted to the network t During quantization, h is actually stored in the buffer unit 144 t The comparator in the bit line peripheral circuit is modified so that the quantization process is completed at the same time as the data reading. In the process, based on the characteristics of the ReRAM units, the input voltage signals control the data stored in a given ReRAM unit to be read and output in corresponding current signals, and the current signals are also equivalent to h t Correspondingly, it can be said that the comparator compares h by the comparison result between the network output value and the quantization threshold value t Three-value quantization is performed.
The following formula (2-1) illustrates a formula for performing three-value quantization on x, which may be the network output value h, according to the quantization threshold value θ t Output current value of the corresponding ReRAM cell.
As described above, in this fixed quantization manner, the second ReRAM circuit of the hardware quantization unit may be a circuit located in the buffer unit 144, and the buffer unit simultaneously implements the storage of the output value of the t-th network and the three-value quantization.
Fig. 7 illustrates a fixed quantization flow diagram, as shown in fig. 7, which may include the following processes, wherein specific quantization flows may be referred to above, as briefly described below:
In step 700, the t-th input data is calculated by the network operation unit, so as to obtain a t-th network output value. For example, the input data of a certain neuron of the cyclic neural network enters a network operation unit, and after matrix multiplication and addition calculation, activation function calculation, vector multiplication and addition calculation and other processes in the network operation unit, the network output value h of the neuron can be finally obtained t
In step 702, the output value of the t-th network is output and stored in the second ReRAM circuit of the buffer unit. Wherein, the network operation unit calculates the output network output value h t May be stored in a second ReRAM circuit of the cache unit, in particular in a first ReRAM array comprised by the second ReRAM circuit.
In step 704, the stored output value of the t-th network is read, and the read output value is quantized by the first comparator.
When the network outputs a value h t To participate in the next calculation (e.g. to participate in the neuron calculation at time t+1) can be read out from the first ReRAM array. Read network output value h t May be entered into a corresponding first bitline peripheral circuit of the first ReRAM array. The ADC in the bitline periphery circuit includes a first comparator that can output a value h by comparing a network t Obtaining the output value h to the network from the quantized threshold value theta t Is a result of the quantization of (2).
Random quantization
In the above fixed quantization manner, the three-value quantization of the network output value may be implemented by the bit line peripheral circuit included in the buffer unit 144 in fig. 1, or the random quantization unit 145 in fig. 1 may also randomly quantize the network output value.
In the random quantization mode, the second ReRAM circuit included in the hardware quantization unit may include a random number generator for generating a random number in addition to the first comparator, and the random number may be used as a quantization reference value for the first comparator. The first comparator can obtain a quantized result of the output value of the t network by comparing the output value of the t network with the random number.
Fig. 8 illustrates a structure of the random quantization unit 145, and the random quantization unit 145 may include a second ReRAM circuit therein. As shown in fig. 8, the random quantization unit 145 may include a random number generator 71 and a comparator 72. Wherein the comparator 72 is the first comparator.
Wherein the random number generator 71 may comprise a plurality of ReRAM cells, for example, embodiments of the present disclosure may comprise 8 ReRAM cells. According to the characteristic of the ReRAM unit as a memory device, when data stored in the ReRAM unit is read, the obtained resistance value R is read to follow normal distribution with the average value of R. Therefore, when the data is read by inputting the read voltage signal Vread (the input Vread indicates that the read operation is to be performed), the current of the obtained output value is a current value corresponding to the resistance value stored in the ReRAM unit, and the current value can be changed along with the resistance value and also obeys the rule of normal distribution, which is equivalent to a 1-bit random output value.
The output value of each ReRAM unit may be compared with a standard std current value, if the output value of the ReRAM unit is greater than the standard std current value, the random number generator outputs a position corresponding to the ReRAM unit as 1, otherwise, if the output value of the ReRAM unit is less than the standard std current value, the random number generator outputs a position corresponding to the ReRAM unit as 0, and finally the random number generator may obtain a series of random 0/1 characters. For example, fig. 8 illustrates that the comparator included in the random number generator 71 may be referred to as a second comparator 73, and the second comparator 73 generates a random number 10010110 according to the comparison result of the current value output from the ReRAM unit and the standard value.
The comparator 72 (i.e., a first comparator) may compare the random number generated by the random number generator with the absolute value of the input data x and output a three-value quantization result of x according to the comparison result. For example, when x itself is a positive number, if the absolute value of x is larger than a random number, the quantization result is 1; if the absolute value of x is less than the random number, thenThe quantization result is 0; when x is a negative number, if the absolute value of x is larger than the random number, the quantization result is-1; if the absolute value of x is smaller than the random number, the quantization result is 0. Wherein the input data x can be the network output value h of the cyclic neural network t
Outputting a value h to the network according to the random quantization method t When three-value quantization is performed, the network output value h can be made t The probability of each quantization result of the resulting ternary quantization obeys the following formula (2-2):
wherein the probability p obeys a bernoulli distribution.
Fig. 9 illustrates a quantization flow diagram of random quantization, as shown in fig. 9, which may include the following processes, wherein specific quantization flows may be referred to above, as briefly described below:
in step 900, the network operation unit calculates the t input data to obtain the t network output value. For example, the input data of a certain neuron of the cyclic neural network enters a network operation unit, and after matrix multiplication and addition calculation, activation function calculation, vector multiplication and addition calculation and other processes in the network operation unit, the network output value h of the neuron can be finally obtained t
In step 902, the t-th network output value is output to a random quantization unit, and the second ReRAM circuit in the random quantization unit performs quantization processing on the t-th network output value.
In this step, the network operation unit calculates the obtained network output value h t May be output to a random quantization unit. As described in the above embodiment, the random number generator 71 in the random quantization unit 145 may generate a random number, and the random number is compared with the network output value h by the comparator 72 t Obtaining a network output value h t Is a result of the quantization of (2).
In step 904, the quantized result of the output value of the t-th network is stored in a buffer unit.
For example, the output value h for the network obtained by the random quantization unit can be t Is stored in a second ReRAM circuit of the buffer unit, in particular in a first ReRAM array comprised by the second ReRAM circuit. When the network outputs a value h t To participate in the next calculation (e.g., to participate in the neuron calculation at time t+1), the stored quantized network output value h may be directly used t And (3) reading out the quantized result of the step (C) to participate in the next calculation.
The three-value quantization of the network output value calculated for the neurons of the recurrent neural network can be achieved by either of the fixed quantization and the random quantization described above, and is implemented in hardware quantization. In addition, the quantized result of the t-th network output value processed by the random quantization unit shown in fig. 8 may be stored in a first ReRAM array, which may be in a buffer unit, for example.
Further, in the embodiments herein, the input data to the neurons may be negative, e.g., -1, when computing the recurrent neural network. And, even if a network parameter (for example, a network weight) that has been three-valued quantized is stored in advance in the ReRAM array of the matrix calculation unit, the network weight may be a negative number such as-1. In order to implement the representation and calculation of the negative number (e.g., -1) in the ReRAM array, the neural network accelerator of the present embodiment may include a first word line peripheral circuit that splits the data to be stored to obtain two non-negative data values. The data to be stored may be a t-th network output value, or may be a quantization result of the t-th network output value.
For example, in matrix multiply add computation, W, U, b these network weights may be pre-split and stored directly in the ReRAM array of the matrix computation unit at the initial stage of the recurrent neural network computation. While for input data (e.g., x t Or h t-1 ) May be in a cache unit 144 (also a ReRAM array), e.g., may be in the cache unit 144The data to be stored in the first ReRAM array may be first split by the first word line peripheral circuit to obtain two non-negative data values, for example, x+ and x-, and then the non-negative data values are input to the first ReRAM array for storage. When the calculation is performed next time, the data read out from the buffer unit are divided into two non-negative numbers, and then are input into the corresponding matrix calculation unit.
For example, for a fixed quantization approach, please refer to the example of fig. 10, fig. 10 depicts an example of a process flow when both fixed quantization and splitting are performed:
in step 1000, the network operation unit calculates the t input data to obtain the t network output value.
In step 1002, the t-th network output value is split by the first word line peripheral circuit corresponding to the first ReRAM array, so as to obtain two non-negative data values.
For example, the output value of the t-th network may be split by the first word line peripheral circuit to obtain two non-negative data values before being stored in the first ReRAM array.
In step 1004, the two non-negative data values separated from the output value of the t-th network are stored in a first ReRAM array. In this way, the two non-negative data values obtained by splitting the output value of the t-th network are read from the first ReRAM array in the next calculation.
In step 1006, the two non-negative data values are read and quantized by a first comparator. For example, the non-negative data value may be quantized by a first bitline peripheral circuit corresponding to the first ReRAM array.
As described above, the fixed quantization is to split the output value of the t-th network before storing, split the output value of the t-th network, and store the split non-negative data value. And carrying out quantization processing when the stored non-negative data value is read later.
For another example, for the manner of random quantization, please refer to the example of fig. 11, fig. 11 depicts an example of a process flow when both random quantization and splitting are performed:
In step 1100, the t-th input data is calculated by the network operation unit, so as to obtain a t-th network output value.
In step 1102, the t-th network output value is output to a random quantization unit, and the second ReRAM circuit in the random quantization unit performs quantization processing on the t-th network output value.
In step 1104, the quantization result of the output value of the t-th network is split by the first word line peripheral circuit corresponding to the first ReRAM array, so as to obtain two non-negative data values.
In this embodiment, the quantization result of the output value of the t-th network may be first split by the first word line peripheral circuit to obtain two non-negative data values before being stored in the first ReRAM array.
In step 1106, the two non-negative data values obtained by splitting the quantized result of the above-mentioned t-th network output value are stored in the first ReRAM array. In this way, the two non-negative data values obtained by splitting the quantized result read from the first ReRAM array are the output value of the t-th network in the next calculation.
As described above, in the random quantization mode, ht after quantization output by the random quantization unit may be split by the first word line peripheral circuit of the buffer unit, and the split result is directly the split quantization result when the next calculation and reading is performed, and no quantization is performed any more.
Input data x of a recurrent neural network is as follows t The multiplication with a network weight W is exemplified by the exemplary descriptions of W and x t Is divided into a plurality of segments. For example, W.x t Is calculated as an example:
W·x t =(W + -W - )(x + -x - )
=W + x + -W + x - -W - x + +W - x - ……(2-3)
the above formula (2-3) can be applied to the input data x t Splitting the network weight W into a difference value of two non-negative data values. For example, w=w + -W - Wherein W is + And W is - Are all non-negative numbers; x is x t =x + -x - Wherein x is + And x - Are non-negative numbers. For example, taking W as an example, when w=1, it can be decomposed into 1 to 0; when w=0, it can be decomposed into 0-0, and when w= -1, it can be decomposed into 0-1. Similarly, for Uh t-1 Or b may perform the data de-fragmentation described above. In addition, the network weight W after the three-value quantization processing can be split to obtain W + And W is - Referred to as non-negative parameter values, will input data x t Resolving the obtained x + And x - Referred to as a non-negative data value.
W + x + 、W + x - 、W - x + 、W - x - These four sets of multiplication calculations may be performed by a matrix calculation unit. For example, the matrix computing unit includes a first ReRAM circuit, which may include a plurality of second ReRAM arrays and their corresponding second word line peripheral circuits. Wherein the second word line peripheral circuit may obtain the t-th input data from the memory (e.g., the first ReRAM array in the cache cell), which may include the above-mentioned split two non-negative data values, such as x + And x - And inputting the t-th input data to a plurality of second ReRAM arrays. And the second ReRAM array stores the network parameters of the target network layer after three-value quantization in a resistive manner, wherein the stored network parameters can be non-negative parameter values after three-value quantization and splitting, such as W + And W is - . In W + x + For example, a non-negative parameter value W + There are various types of ReRAM cells stored in a ReRAM array, for example, a 2-level ReRAM can represent 1-bit data, i.e., there is a high resistance state and a low resistance state to represent 0 or 1. When the resistance of the ReRAM array takes a high resistance state, the resistance represents W + The value of (2) is 0, and when the resistance of the ReRAM array takes a low resistance state, W is represented + The value of (2) is 1.
Matrix multiply-add computation is performed by the second ReRAM array based on the network parameters and the t-th input data. Specifically, the above W + x + 、W + x - 、W - x + 、W - x - The four sets of multiplication calculations may be accomplished by a second ReRAM array, respectively, e.g., one of which is responsible for calculating W + x + Another second ReRAM array is responsible for calculating W + x - And so on, four sets of second ReRAM arrays are required. For example, x is + As input data to the second ReRAM array, the incoming x is performed by the second ReRAM array + And W is + The product of the two is output in the form of a current through the BL to a bitline peripheral circuit, which may be referred to as a second bitline peripheral circuit. Referring to FIG. 6, the second bit line peripheral circuit may have an analog signal adding/subtracting module (Add &Sub) fusing products respectively output by the four groups of second ReRAM arrays to obtain W.x t (i.e. network parameter W and t-th input data x t ) Is a result of the multiplication of (a). The multiplication result is converted into a digital signal through an 8-bit ADC.
Furthermore, as described above, the input data of the ReRAM array, whether it is the network output value of the upper layer or the network output value of the same neuron at one time step, has been subjected to three-value quantization, and the non-negative number decomposition shown in the formula (2-3) is also performed, and the input data is either 0 or 1. The calculation of the 1-bit data can directly participate in the calculation without converting a Digital signal into an Analog signal, so that the simplification of a word line peripheral circuit can be realized in hardware design, and the use of a Digital-to-Analog Converter (DAC) is canceled. For example, the digital-to-analog converter DAC is not included in the word line peripheral circuit in fig. 5, thereby reducing hardware overhead.
For "Wx" in matrix multiply-add computation t +Uh t-1 +b ", as mentioned above, each multiplication of the input data and the network weights can be split into four sets of calculated products as shown in equation (2-3), and Wx can be performed separately by the ReRAM array t Split four-group multiplication, uh t-1 The four divided groups of multiplication and the four divided groups of multiplication 1*b are added, subtracted and integrated through the corresponding bit line peripheral circuits.
Furthermore, although most of the input data of the ReRAM array is 1-bit data of 0/1, the input data of the first layer of the network comes from outside, and the data is represented as 8-bit. In order for the circuit of the embodiments of the present specification to support both the 8-bit input of the first layer and the processing of the three-value input of the subsequent layer, this may be done as follows:
the second word line peripheral circuit in the matrix calculation unit can input the multi-bit data into the second ReRAM array in the first ReRAM circuit bit by bit for calculation when the received t-th input data is multi-bit. For example, assuming that the multi-bit data is 8-bit input data, the 8-bit may be decomposed into two 7-bit non-negatives x + And x - . For example, 8-bits using a binary for negative numbers '-8' may be represented as '10001000', where the most significant bit represents a sign bit, and 1 represents data as a negative number; the next 7 bits represent specific values. According to the aforementioned data splitting method, ' -8' is split into ' x + =0 'and' x + And x - =8', and the difference value between them is used to represent the value of the data, i.e., -8=0-8. Because the split data are all non-negative numbers, the most significant bit is not needed to represent the sign of the data, and only 7-bit binary data is needed to be used for splitting the data x + And x - Are shown as '0000000' and '0001000', respectively.
Next, please refer to the word line peripheral circuit shown in fig. 5, which may be the second word line peripheral circuit described above. The second word line peripheral circuit may include: latches (Latch) and shifters (Shift). The latch can receive the non-negative number (Input data) of the 7-bit obtained by decomposition, and divide the non-negative number into 7 periods through the shifter, and Input the 7 periods into the second ReRAM array bit by bit for calculation, and only 1-bit is still Input at a time. Finally, referring to fig. 6, the bit line peripheral circuit may be a second bit line peripheral circuit included in the first ReRAM circuit, and the Shift accumulator (Shift & Add) may combine the bit-by-bit calculation results obtained in the 7 periods.
By the mode, the embodiment of the specification can simultaneously support the processing of the first layer 8-bit input and the subsequent layer three-value input by using the same set of circuit, so that the circuit can not only effectively reduce hardware cost, but also flexibly support high-precision calculation.
The calculation result of the matrix calculation unit is output to the nonlinear unit to realize the calculation of the activation function. In order to reduce the cost of hardware implementation, and meanwhile, the quantized network result has limited value, a memory implemented by a ReRAM is used as a lookup table to realize the calculation of activating functions such as sigmoid or tanh. Furthermore, the nonlinear unit uses the ReRAM to store the corresponding result as a lookup table, so that the 8-bit is not required to be split into 7-bits as well.
In the neural network accelerator for the calculation processing of the recurrent neural network, a plurality of calculation units may be provided for performing the calculation processing of the recurrent neural network to improve the calculation throughput. Model parallelism and data parallelism can be employed to accelerate computation of the recurrent neural network. Model parallelism is that a plurality of ReRAM arrays perform parallel computation on a plurality of neurons of the same neural network, namely, the parallel computation of the plurality of neurons, and data parallelism is that different input data are processed simultaneously, namely, the parallel computation of the data.
For example, for small-scale models, a single ReRAM array in the matrix computation unit can meet the computation-scale requirements, and at this time, the utilization rate and throughput rate of the neural network accelerator can be improved in a data parallel manner. Specifically, the network weight is duplicated in multiple copies, different input data are transmitted, and multiple copies are calculated simultaneously. For a medium-scale model, a single ReRAM array cannot meet the requirement of the computing scale, but one computing unit can meet the requirement of the computing scale, and at this time, the accelerated computation can be performed through data parallelism and model parallelism at the same time. Specifically, model parallelism is performed in one computing unit, and computing tasks which have no data dependence with each other can be respectively executed through different computing units; within the same computational unit, because there is no data dependency in the same-layer neuron computation, the corresponding data computation tasks can be assigned to different ReRAM arrays in the unit. That is, the calculation tasks without data dependency can be parallel, and the calculation tasks with data dependency can be centralized together, so as to facilitate calculation. And among different computing units, the data parallelism is realized by copying the network weight to generate a plurality of copies. For large-scale models, then, a degree of data parallelism and model parallelism is performed as appropriate, based on the relationship between the computing resources of the entire accelerator and the computing resources required by the entire network.
In addition, inside the same ReRAM array, because the computation of the recurrent neural network in the time dimension has data dependence, the throughput of the overall computation can be improved by using multistage pipelining. As shown in table 1 below, x 0 、y 0 、z 0 、s 0 The control unit 11 in fig. 1 may control, when any one of the matrix computing unit 141, the nonlinear unit 142, and the vector multiplying unit 143 performs processing on the current input data, to enter a network computing unit, such as any one of the matrix computing unit, the nonlinear unit, and the vector multiplying unit in the network computing unit, for example.
For example, the calculation of the input data by a neuron is completed after the processing of a plurality of calculation functional modules, i.e., the matrix calculation unit 141, the nonlinear unit 142 and the vector multiplication unit 143. When the controller determines that the calculation process of the input data in the matrix calculation unit 141 has been completed and the process of the nonlinear unit 142 has been entered, the matrix calculation unit 141 is idle at this time, and the control unit 11 may control the input data at the next time to enter the matrix calculation unit 141.
For example, the input data x shown in Table 1 0 At T 0 The moment is calculated at a matrix calculation unit, at a moment T 1 The non-linear unit is moved to perform calculation processing, and the matrix meterThe arithmetic unit is already idle and the control unit 11 can control the next input data y 0 The matrix calculation unit is entered to start calculation. Similarly, when time T 2 Input data x 0 When moving to the vector multiplication unit processing, the input data y 0 A nonlinear element may be entered. By analogy, the processing mode is similar to multi-stage pipeline processing, and the calculation speed can be improved.
TABLE 1 multistage flow schematic
In the neural network accelerator of the embodiment of the present disclosure, a three-level Memory hierarchy is also designed, which is the on-chip Memory 13, the Buffer unit 144 in the computing unit, and the internal Memory of the ReRAM array (for example, buffer in fig. 4, which may use SRAM (Static Random-Access Memory) as a Memory material). The intermediate results generated in the calculation process of the recurrent neural network may be stored in the corresponding storage units according to the data flow, for example, may be stored in buffers preferentially according to the principle of nearby storage, and then may be stored in the Buffer unit 144 after the buffers are full, and then may be stored in the on-chip storage 13 after the storage capacity of the Buffer unit 144 is full. In the process of storing and reading data, the on-chip data is transmitted at high speed through a private port on the chip, and the on-chip data and the off-chip data are exchanged through an I/O interface.
For example, in an actual application scenario of a mobile terminal, if a cyclic neural network model needs to be deployed, the acceleration method of the neural network in the scheme can be used to achieve acceleration of the network and reduction of hardware overhead.
A data processing method when performing calculation processing on a recurrent neural network is described below, and the data processing involved in the neural network accelerator described in any embodiment of the present specification may be taken as an example of the data processing method, but it is understood that the structure of the execution device of the data processing method is not limited to the neural network accelerator of any embodiment of the present specification.
Fig. 12 provides a data processing method that may be used to perform computational processing on input data in a recurrent neural network. As shown in fig. 12, the method may include:
in step 1200, t input data for a target network layer in a recurrent neural network is acquired.
In step 1202, the network computing unit performs computation processing on the t-th input data to obtain a t-th network output value of the target network layer.
The network operation unit in the neural network accelerator may be responsible for performing calculation processing on the recurrent neural network, and for example, the network operation unit may include a matrix calculation unit, a nonlinear unit, and a vector multiplication unit. The output value of the t-th network can be a neuron output value h obtained by calculating the t-th input data of a certain neuron in the cyclic neural network t
In step 1204, a hardware quantization unit performs a three-value quantization process on the t-th network output value of the target network layer, to obtain a quantization result of the t-th network output value. For example, the output value of the t network may be quantized by a hardware quantization unit to three values, and the quantized result may include 0, 1 and-1.
According to the data processing method, through three-value quantization processing realized by hardware after the output value of the t-th network is obtained, complex matrix multiplication and addition calculation of the cyclic neural network is converted into simple addition and subtraction operation, and the calculation process of the network is accelerated. In addition, the three-value quantization of the network output value is realized in a hardware mode, so that the calculation flow of hardware control cannot be interrupted in the quantization process, namely, the three-value quantization of the hardware control is executed in the calculation process of the hardware control network, and the calculation performance is improved.
In one example, the third value quantization processing is performed on the output value of the t network, and the hardware quantization unit may compare the output value of the t network with the quantization reference value to obtain a quantization result of the output value of the t network. In practical implementations, the hardware quantization may include:
For example, the hardware quantization unit may compare the t-th network output value with a quantization reference value in a process of reading the stored t-th network output value from the hardware quantization unit, thereby obtaining a quantization result. The mode can enable the hardware quantization unit to store the output value of the t network and quantize the output value of the t network, and is convenient, quick and low in cost. This quantization mode may be referred to as a fixed quantization mode.
For another example, before comparing the output value of the t network with the quantization reference value by the hardware quantization unit, the current value corresponding to the resistance value of the ReRAM unit in the hardware quantization unit may be output first; and by comparing the standard value with the current value, a random number is obtained. And then the hardware quantization unit takes the random number as the quantization reference value, and the quantization result of the t network output value is obtained by comparing the t network output value with the quantization reference value. This quantization mode may be referred to as a random quantization mode.
In yet another example, the data processing method may further split data to be stored to obtain two non-negative data values, where the data to be stored includes: and outputting a value by the t network, or quantifying a result of the value by the t network. And after splitting, the two non-negative data values may be stored.
For example, in a fixed quantization mode, the output value of the t-th network may be split into two non-negative data values before being stored, and then the two non-negative data values are stored. When the two non-negative data values are read to participate in the next calculation, the two non-negative data values may be hardware-fixed quantized.
For another example, in the random quantization mode, the hardware quantization unit may split the quantization result of the output value of the t-th network, and store the split quantization result after splitting. The quantized result after splitting can be directly read to participate in the next calculation.
In yet another example, the calculating, by the network operation unit, the t input data to obtain a t network output value of the target network layer includes: the obtained t-th input data can be input into a plurality of second ReRAM arrays included in a network operation unit; and performing matrix multiply-add computation on the t-th input data based on the network parameters of the target network layer subjected to three-value quantization processing stored in a resistance value manner through the plurality of second ReRAM arrays.
In yet another example, the t-th input data includes two non-negative data values; the performing, by the plurality of second ReRAM arrays, matrix multiply-add computation on the t-th input data based on the network parameters of the target network layer subjected to the three-value quantization processing, which are stored in a resistive value manner, includes:
Calculating, by each of the plurality of second ReRAM arrays, a product of one of two stored non-negative parameter values obtained by splitting the network parameter subjected to the three-value quantization processing, and one non-negative data value input by each of the second ReRAM arrays; and fusing products respectively output by the plurality of second ReRAM arrays to obtain a multiplication result of the t-th input data and the network parameter.
In yet another example, the t-th input data is multi-bit data; the performing, by the plurality of second ReRAM arrays, matrix multiply-add computation on the t-th input data based on the network parameters of the cyclic neural network, which are stored in a resistive value manner and are subjected to three-value quantization processing, includes:
inputting the multi-bit data bit by bit into the second ReRAM array for calculation to obtain a calculation result corresponding to each bit of data in the multi-bit data; and combining the calculation results corresponding to the data of each bit in the multi-bit data to obtain the calculation result of the t-th input data.
The embodiment of the present disclosure further provides a data processing apparatus for implementing any of the above method embodiments, and accordingly, the data processing apparatus includes a unit for executing any step or flow in the above method embodiments, where the unit may be implemented by software, hardware, or a combination of software and hardware, which is not limited by the embodiment of the present disclosure.
One skilled in the art will appreciate that one or more embodiments of the present disclosure may be provided as a method, system, or computer program product. Accordingly, one or more embodiments of the present disclosure may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Moreover, one or more embodiments of the present disclosure may take the form of a computer program product on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, etc.) having computer-usable program code embodied therein.
The embodiments of the present disclosure also provide a computer-readable storage medium, on which a computer program may be stored, which when executed by a processor, implements the steps of the method described in any of the embodiments of the present disclosure, and/or implements the steps of the method described in any of the embodiments of the present disclosure.
Wherein "and/or" as described in embodiments of the present disclosure means at least one of the two, for example, "multiple and/or B" includes three schemes: many, B, and "many and B".
The various embodiments in this disclosure are described in a progressive manner, and identical and similar parts of the various embodiments are all referred to each other, and each embodiment is mainly described as different from other embodiments. In particular, for data processing apparatus embodiments, the description is relatively simple, as it is substantially similar to method embodiments, with reference to the description of method embodiments in part.
The foregoing has described certain embodiments of the present disclosure. Other embodiments are within the scope of the following claims. In some cases, the acts or steps recited in the claims may be performed in a different order than in the embodiments and still achieve desirable results. In addition, the processes depicted in the accompanying figures do not necessarily require the particular order shown, or sequential order, to achieve desirable results. In some embodiments, multitasking and parallel processing are also possible or may be advantageous.
Embodiments of the subject matter and functional operations described in this disclosure may be implemented in the following: digital electronic circuitry, tangibly embodied computer software or firmware, computer hardware including the structures disclosed in this disclosure and structural equivalents thereof, or a combination of one or more of them. Embodiments of the subject matter described in this disclosure can be implemented as one or more computer programs, i.e., one or more modules of computer program instructions encoded on a tangible, non-transitory program carrier for execution by, or to control the operation of, data processing apparatus. Alternatively or additionally, the program instructions may be encoded on a manually-generated propagated signal, e.g., a machine-generated electrical, optical, or electromagnetic signal, that is generated to encode and transmit information to suitable receiver apparatus for execution by data processing apparatus. The computer storage medium may be a machine-readable storage device, a machine-readable storage substrate, a random or serial access memory device, or a combination of one or more of them.
The processes and logic flows described in this disclosure can be performed by one or more programmable computers executing one or more computer programs to perform corresponding functions by operating on input data and generating output. The processes and logic flows can also be performed by, and apparatus can also be implemented as, special purpose logic circuitry, e.g., a FPG multi (field programmable gate array) or multi SIC (application specific integrated circuit).
Computers suitable for executing computer programs include, for example, general purpose and/or special purpose microprocessors, or any other type of central processing unit. Typically, the central processing unit will receive instructions and data from a read only memory and/or a random access memory. The essential elements of a computer include a central processing unit for carrying out or executing instructions and one or more memory devices for storing instructions and data. Typically, a computer will also include, or be operatively coupled to receive data from or transfer data to, or both, one or more mass storage devices for storing data, e.g., magnetic, magneto-optical disks, or optical disks, etc. However, a computer does not have to have such a device. Furthermore, the computer may be embedded in another device, such as a mobile phone, a personal digital assistant (PD multislot), a mobile audio or video player, a game console, a Global Positioning System (GPS) receiver, or a portable storage device such as a Universal Serial Bus (USB) flash drive, to name a few.
Computer readable media suitable for storing computer program instructions and data include all forms of non-volatile memory, media and memory devices including, for example, semiconductor memory devices (e.g., EPROM, EEPROM, and flash memory devices), magnetic disks (e.g., internal hard disk or removable disks), magneto-optical disks, and CD-ROM and DVD-ROM disks. The processor and the memory can be supplemented by, or incorporated in, special purpose logic circuitry.
Although this disclosure contains many specific implementation details, these should not be construed as limiting the scope of any disclosure or the scope of what is claimed, but rather as primarily describing features of particular embodiments of the particular disclosure. Certain features that are described in this disclosure in the context of separate embodiments can also be implemented in combination in a single embodiment. On the other hand, the various features described in the individual embodiments may also be implemented separately in the various embodiments or in any suitable subcombination. Furthermore, although features may be acting in certain combinations and even initially claimed as such, one or more features from a claimed combination can in some cases be excised from the combination, and the claimed combination may be directed to a subcombination or variation of a subcombination.
Similarly, although operations are depicted in the drawings in a particular order, this should not be understood as requiring that such operations be performed in the particular order shown or in sequential order, or that all illustrated operations be performed, to achieve desirable results. In some cases, multitasking and parallel processing may be advantageous. Moreover, the separation of various system modules and components in the embodiments described above should not be understood as requiring such separation in all embodiments, and it should be understood that the described program components and systems can generally be integrated together in a single software product or packaged into multiple software products.
Thus, particular embodiments of the subject matter have been described. Other embodiments are within the scope of the following claims. In some cases, the actions recited in the claims can be performed in a different order and still achieve desirable results. Furthermore, the processes depicted in the accompanying drawings are not necessarily required to be in the particular order shown, or sequential order, to achieve desirable results. In some implementations, multitasking and parallel processing may be advantageous.
The foregoing description of the preferred embodiment(s) of the present disclosure is merely intended to illustrate the embodiment(s) of the present disclosure, and any modifications, equivalents, improvements, etc. that fall within the spirit and principles of the embodiment(s) of the present disclosure are intended to be included within the scope of the present disclosure.

Claims (18)

1. A neural network accelerator is characterized in that,
the neural network accelerator includes: the device comprises a network operation unit and a hardware quantization unit, wherein the network operation unit comprises a first resistance change type memory ReRAM circuit, and the hardware quantization unit comprises a second ReRAM circuit;
the network operation unit is used for calculating and processing the t-th input data of a target network layer in the cyclic neural network to obtain a t-th network output value of the target network layer;
the hardware quantization unit is used for performing three-value quantization processing on a t-th network output value of the target network layer to obtain a quantization result of the t-th network output value;
the accelerator further includes:
the first word line peripheral circuit is used for splitting data to be stored to obtain two non-negative data values, wherein the data to be stored comprises: the output value of the t network, or the quantized result of the output value of the t network;
wherein the first ReRAM array of the accelerator is to store the two non-negative data values; the first wordline peripheral circuit corresponds to the first ReRAM array.
2. The accelerator according to claim 1, wherein the accelerator comprises a plurality of accelerator members,
The second ReRAM circuit includes: and the first comparator is used for obtaining the quantized result of the t network output value by comparing the t network output value with the quantized reference value.
3. The accelerator according to claim 2, wherein the accelerator comprises a catalyst,
the second ReRAM circuit further includes: a first ReRAM array, the first ReRAM array connected to the first comparator;
the first ReRAM array is configured to store the t-th network output value.
4. The accelerator of claim 3, wherein the first comparator is located within an analog-to-digital converter in a first bitline peripheral circuit of the first ReRAM array, and wherein the quantization reference value is a quantization threshold preset in the first comparator.
5. The accelerator according to claim 2, wherein the accelerator comprises a catalyst,
the second ReRAM circuit further includes: a random number generator for generating a random number;
wherein the quantization reference value is a random number generated by the random number generator;
the random number generator comprises a ReRAM unit and a second comparator, wherein,
the ReRAM unit is used for outputting a current value corresponding to the resistance value stored by the ReRAM unit;
The second comparator is used for obtaining the random number by comparing a standard value with a current value output by the ReRAM unit.
6. The accelerator of claim 5, further comprising:
and the first ReRAM array is used for storing the quantized result of the output value of the t-th network.
7. The accelerator according to any one of claims 1 to 6, wherein the first resistance change type memory ReRAM circuit includes: a second word line peripheral circuit and a plurality of second ReRAM arrays;
a second word line peripheral circuit for acquiring the t-th input data from a memory and inputting the t-th input data to the plurality of second ReRAM arrays;
and the second ReRAM array is used for storing the network parameters of the target network layer after three-value quantization processing in a resistance value mode and executing matrix multiplication and addition calculation on the t-th input data according to the network parameters.
8. The accelerator according to claim 7, wherein the accelerator comprises a plurality of accelerator members,
the t-th input data comprises two non-negative data values;
each of the plurality of second ReRAM arrays is configured to store one of two non-negative parameter values obtained by splitting the network parameter after the ternary quantization processing, and calculate a product of the non-negative parameter value and one non-negative data value input by each of the second ReRAM arrays;
The first resistance change type memory ReRAM circuit further includes: and the second bit line peripheral circuit is used for fusing products respectively output by the plurality of second ReRAM arrays to obtain a multiplication result of the t-th input data and the network parameter.
9. The accelerator according to claim 7, wherein the accelerator comprises a plurality of accelerator members,
the second word line peripheral circuit is used for inputting the multi-bit data into the second ReRAM array bit by bit for calculation when the t-th input data is multi-bit data;
and the second bit line peripheral circuit is included in the first resistance change type memory ReRAM circuit and is used for combining the calculation results of the data corresponding to each bit in the multi-bit data.
10. The accelerator according to any one of claims 1 to 6, 8 and 9,
the neural network accelerator further comprises: a control unit;
the network operation unit includes: a matrix calculation unit, a nonlinear unit and a vector multiplication unit which comprise the first resistance change type memory ReRAM circuit;
the control unit is used for controlling the next input data to enter the network operation unit when the processing execution of any one unit of the matrix calculation unit, the nonlinear unit and the vector multiplication unit on the current input data is finished.
11. A method of data processing, the method comprising:
acquiring the t-th input data of a target network layer in a cyclic neural network;
calculating the t-th input data through a network operation unit to obtain a t-th network output value of the target network layer;
performing ternary quantization processing on a t-th network output value of the target network layer through a hardware quantization unit to obtain a quantization result of the t-th network output value;
splitting data to be stored to obtain two non-negative data values, wherein the data to be stored comprises: the output value of the t network, or the quantized result of the output value of the t network;
the two non-negative data values are stored.
12. The method according to claim 11, wherein the performing, by the hardware quantization unit, a ternary quantization process on the t-th network output value of the target network layer to obtain a quantized result of the t-th network output value includes:
and comparing the output value of the t network with a quantization reference value through the hardware quantization unit to obtain a quantization result of the output value of the t network.
13. The method of claim 12, wherein the comparing, by the hardware quantization unit, the t-th network output value and a quantization reference value comprises:
In the process of reading the stored t network output value from the hardware quantization unit, comparing the t network output value with a quantization reference value.
14. The method of claim 12, wherein prior to said comparing, by said hardware quantization unit, said t-th network output value and a quantization reference value, said method further comprises:
outputting a current value corresponding to the resistance value stored in a ReRAM unit in the hardware quantization unit;
and comparing a standard value with the current value to obtain a random number, wherein the random number is used as the quantized reference value.
15. The method according to any one of claims 11 to 14, wherein the calculating, by the network operation unit, the t-th input data to obtain the t-th network output value of the target network layer includes:
inputting the obtained t-th input data to a plurality of second ReRAM arrays included in the network operation unit;
and performing matrix multiplication and addition calculation on the t-th input data through the plurality of second ReRAM arrays based on the network parameters of the target network layer, which are stored in a resistance value manner and are subjected to three-value quantization processing.
16. The method of claim 15, wherein the t-th input data comprises two non-negative data values;
the performing, by the plurality of second ReRAM arrays, matrix multiply-add computation on the t-th input data based on the network parameters of the target network layer subjected to the three-value quantization processing, which are stored in a resistive value manner, includes:
calculating, by each of the plurality of second ReRAM arrays, a product of one of two stored non-negative parameter values obtained by splitting the network parameter subjected to the three-value quantization processing, and one non-negative data value input by each of the second ReRAM arrays;
and fusing products respectively output by the plurality of second ReRAM arrays to obtain a multiplication result of the t-th input data and the network parameter.
17. The method of claim 15, wherein the t-th input data is multi-bit data;
the performing, by the plurality of second ReRAM arrays, matrix multiply-add computation on the t-th input data based on the network parameters of the cyclic neural network, which are stored in a resistive value manner and are subjected to three-value quantization processing, includes:
Inputting the multi-bit data bit by bit into the second ReRAM array for calculation to obtain a calculation result corresponding to each bit of data in the multi-bit data;
and combining the calculation results corresponding to the data of each bit in the multi-bit data to obtain the calculation result of the t-th input data.
18. A data processing apparatus, comprising: means for performing the steps in the method of any one of claims 11 to 17.
CN201911337168.0A 2019-12-23 2019-12-23 Neural network accelerator and data processing method thereof Active CN113095468B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201911337168.0A CN113095468B (en) 2019-12-23 2019-12-23 Neural network accelerator and data processing method thereof

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201911337168.0A CN113095468B (en) 2019-12-23 2019-12-23 Neural network accelerator and data processing method thereof

Publications (2)

Publication Number Publication Date
CN113095468A CN113095468A (en) 2021-07-09
CN113095468B true CN113095468B (en) 2024-04-16

Family

ID=76663867

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201911337168.0A Active CN113095468B (en) 2019-12-23 2019-12-23 Neural network accelerator and data processing method thereof

Country Status (1)

Country Link
CN (1) CN113095468B (en)

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108648020A (en) * 2018-05-15 2018-10-12 携程旅游信息技术(上海)有限公司 User behavior quantization method, system, equipment and storage medium
CN109214509A (en) * 2017-07-05 2019-01-15 中国科学院沈阳自动化研究所 One kind being used for deep neural network high speed real-time quantization structure and operation implementation method
CN109376864A (en) * 2018-09-06 2019-02-22 电子科技大学 A kind of knowledge mapping relation inference algorithm based on stacking neural network
WO2019076095A1 (en) * 2017-10-20 2019-04-25 上海寒武纪信息科技有限公司 Processing method and apparatus
CN110378468A (en) * 2019-07-08 2019-10-25 浙江大学 A kind of neural network accelerator quantified based on structuring beta pruning and low bit

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110163334B (en) * 2018-02-11 2020-10-09 上海寒武纪信息科技有限公司 Integrated circuit chip device and related product

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109214509A (en) * 2017-07-05 2019-01-15 中国科学院沈阳自动化研究所 One kind being used for deep neural network high speed real-time quantization structure and operation implementation method
WO2019076095A1 (en) * 2017-10-20 2019-04-25 上海寒武纪信息科技有限公司 Processing method and apparatus
CN108648020A (en) * 2018-05-15 2018-10-12 携程旅游信息技术(上海)有限公司 User behavior quantization method, system, equipment and storage medium
CN109376864A (en) * 2018-09-06 2019-02-22 电子科技大学 A kind of knowledge mapping relation inference algorithm based on stacking neural network
CN110378468A (en) * 2019-07-08 2019-10-25 浙江大学 A kind of neural network accelerator quantified based on structuring beta pruning and low bit

Non-Patent Citations (4)

* Cited by examiner, † Cited by third party
Title
GPU平台上循环神经网络训练算法设计与优化;冯诗影;《中国优秀硕士学位论文全文数据库(电子期刊)信息科技辑》;第2019卷(第01期);第7页、第51-53页 *
ReRAM-Based Processing-in-Memory Architecture for Recurrent Neural Network Acceleration;Yun Long 等;《IEEE》;全文 *
SNrram: an efficient sparse neural network computation architecture based on resistive random-access memory;Peiqi Wang 等;《DBLP》;全文 *
面向阻变存储器的长短期记忆网络加速器的训练和软件仿真;刘鹤;《计算机研究与发展》;第56卷(第6期);全文 *

Also Published As

Publication number Publication date
CN113095468A (en) 2021-07-09

Similar Documents

Publication Publication Date Title
CN108009640B (en) Training device and training method of neural network based on memristor
Lin et al. Learning the sparsity for ReRAM: Mapping and pruning sparse neural network for ReRAM based accelerator
Sung et al. Resiliency of deep neural networks under quantization
CN107636640B (en) Dot product engine, memristor dot product engine and method for calculating dot product
CN107340993B (en) Arithmetic device and method
Gupta et al. Deep learning with limited numerical precision
US11604960B2 (en) Differential bit width neural architecture search
Li et al. Quantized neural networks with new stochastic multipliers
KR102396447B1 (en) Deep learning apparatus for ANN with pipeline architecture
CN113168310B (en) Hardware module for converting numbers
CN113805842B (en) Integrative device of deposit and calculation based on carry look ahead adder realizes
US20200311511A1 (en) Accelerating neuron computations in artificial neural networks by skipping bits
CN113095468B (en) Neural network accelerator and data processing method thereof
US20220076127A1 (en) Forcing weights of transformer model layers
US11243743B2 (en) Optimization of neural networks using hardware calculation efficiency and adjustment factors
JP2022042467A (en) Artificial neural network model learning method and system
JP2022541144A (en) Methods for interfacing with hardware accelerators
Ma et al. Non-volatile memory array based quantization-and noise-resilient LSTM neural networks
Kim et al. Mapping binary ResNets on computing-in-memory hardware with low-bit ADCs
JP2024506441A (en) Digital circuitry for normalization functions
Zheng et al. Accelerating Sparse Attention with a Reconfigurable Non-volatile Processing-In-Memory Architecture
TW202324205A (en) Computation in memory architecture for phased depth-wise convolutional
CN114154631A (en) Convolutional neural network quantization implementation method and device based on FPGA
WO2020194032A1 (en) Accelerating neuron computations in artificial neural networks by skipping bits
Duan et al. DDC-PIM: Efficient Algorithm/Architecture Co-Design for Doubling Data Capacity of SRAM-Based Processing-in-Memory

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant