WO2021115262A1 - 脉冲卷积神经网络算法、集成电路、运算装置及存储介质 - Google Patents

脉冲卷积神经网络算法、集成电路、运算装置及存储介质 Download PDF

Info

Publication number
WO2021115262A1
WO2021115262A1 PCT/CN2020/134558 CN2020134558W WO2021115262A1 WO 2021115262 A1 WO2021115262 A1 WO 2021115262A1 CN 2020134558 W CN2020134558 W CN 2020134558W WO 2021115262 A1 WO2021115262 A1 WO 2021115262A1
Authority
WO
WIPO (PCT)
Prior art keywords
layer
storage
calculation
neural network
input
Prior art date
Application number
PCT/CN2020/134558
Other languages
English (en)
French (fr)
Inventor
王瑶
陈轩
李张南
王宇宣
Original Assignee
南京惟心光电系统有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 南京惟心光电系统有限公司 filed Critical 南京惟心光电系统有限公司
Publication of WO2021115262A1 publication Critical patent/WO2021115262A1/zh

Links

Images

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N25/00Circuitry of solid-state image sensors [SSIS]; Control thereof
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/06Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons
    • G06N3/063Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons using electronic means

Definitions

  • the invention relates to a pulse convolution neural network algorithm, an integrated circuit, an arithmetic device and a storage medium, which can convert a traditional convolution neural network into a pulse neural network, and specifically relates to the field of image classification.
  • the storage-calculation integrated unit solves the above problems because it can directly save the data on the computing unit and does not require on-chip cache.
  • a large number of analog-to-digital converters are needed to convert the current into a digital signal, which occupies most of the area and power consumption of the entire system.
  • the calculation speed of the entire system is also limited by it and cannot be improved.
  • the pulse neural network tries to simulate the calculation method of the human brain as much as possible.
  • the obvious feature is that the data flows in the network in the form of pulse signals.
  • the power consumption is much less than that of the convolutional neural network.
  • the impulse convolutional neural network combines the characteristics of the convolutional neural network and the impulse neural network. After some modifications to the convolutional neural network model, the method of training the convolutional neural network can be used to obtain the weights, and the classification accuracy is relative to all The convolutional neural network used has a very small drop. Because the data in the network is in pulse form, the hardware resource consumption is small. At present, the research in this field only stays at the algorithm level, and there is no relevant hardware implementation scheme.
  • one input method is that the input pulse is generated by random number generation. According to the law of large numbers, it takes a long time to generate a large number of pulses before it may converge to the original value and get close to the volume. The classification results of the product neural network, and this requires a lot of calculation time, and the calculation efficiency is very low.
  • Another input method is that the input is not a pulse, but an analog value. When the circuit is implemented, on the one hand, the accuracy of the input cannot be guaranteed. On the other hand, considering the actual application, the input source is likely to be an image sensor, and the output is equal. For digital signals, compatibility needs to be considered.
  • the Batch Normalization (BN) layer is a commonly used layer in the convolutional neural network to optimize the network, which can improve the accuracy of training and reduce the dependence of the training results on the initialization method.
  • BN Batch Normalization
  • the convolutional layer/fully connected layer will inevitably generate bias.
  • the use of bias is avoided, so that the BN layer cannot be added, which brings difficulties to the training of large-scale convolutional neural networks.
  • a pulsed convolutional neural network algorithm is proposed.
  • the average pooling layer is merged into the next convolutional layer or fully connected layer, and the convolution with bias is supported.
  • the calculation of layers and fully connected layers, support for adding BN layers to the network, setting the judgment conditions for the end of calculation, adding auxiliary judgments for special situations, etc. can greatly save the calculation time of the existing pulse convolutional neural network algorithm.
  • improve the accuracy of image classification increase the function support of the pulse convolutional neural network algorithm for the bias and BN layer, and adjust the input mode to increase compatibility.
  • a pulsed convolutional neural network operation device is proposed.
  • the pulsed convolutional neural network is implemented on the storage-calculation unit, the multi-bit digital signal representing the true value of the convolutional neural network is converted
  • the analog-to-digital converter is replaced by a current integration and comparison circuit, thereby greatly reducing the area and power consumption.
  • mapping method of the convolutional layer and the fully connected layer is completely expanded, that is, all the output results of each layer are calculated at the same time, and are connected to the next layer as the input of the next layer, and each layer of convolutional layer/full
  • the weight coefficients of the connection layer are all stored in the storage and calculation integrated unit, so there is no data to be cached during the calculation process, and the calculation speed of the entire system is significantly accelerated.
  • the required storage-calculation unit is proportional to the square of the input image size and the number of channels in the convolutional layer, which requires a lot of area. And the calculation speed of this solution is very fast. In the case of a large image, it far exceeds the transmission speed of the input image data, that is, the calculation speed is limited because the data transmission speed cannot keep up.
  • a pulse convolutional neural network computing device with a memory is proposed.
  • this method will make the theoretical calculation speed much lower than that of the solution that does not save intermediate data, in fact, because of the bottleneck limitation of the data transmission speed, the final speed is also within an acceptable range.
  • a pulsed convolutional neural network algorithm based on a storage-calculation integrated unit, which includes at least one storage input terminal, at least one arithmetic input terminal and one output terminal, characterized in that: 1) Copy the weights of the first layer of the pulse convolutional neural network to several copies, the number of copies is at least the number of binary numbers converted into the quantity used to characterize the properties of the analyte and the storage-calculation integrated unit Save the minimum value of the input terminal, and process the weights of the number of copies after the copy, so that the weights after the copy are sequentially reduced by twice in value, and the obtained values are respectively input to the multiple storage and calculation integrated units
  • the number of the storage input terminal of the storage and calculation unit is the same as the number of copies; 2)
  • the selected quantity that is used to characterize the properties of the analyte is converted into a binary number, and the input Each digit of a binary number, or a value truncated according to the bit width of the system,
  • the input quantity corresponds to the input quantity
  • the input quantity of the storage input terminal with the larger absolute value corresponds to the input quantity of the higher calculation input terminal
  • the quantity of the storage input terminal is The amount of the calculation input terminal is calculated, and the current value obtained at the output terminal represents the result of the multiplication operation of the value of the storage input terminal of the storage and calculation integrated unit and the value of the calculation input terminal.
  • the impulse convolutional neural network algorithm is further characterized by: 1) including the operations of the first layer and the operations of other layers, and in any of the layers, in the In addition to the operation of the input terminal and the calculation input terminal, an additional operation accumulation term is added.
  • the operation accumulation term is a modified offset value, and the modified offset value is proportional to its original value and then divided by the The cumulative multiplication of the positive thresholds of all layers before the layer, the proportional ratio is related to the layer where the bias is located and the weight scaling ratio of the previous layer; 2) the pulse convolutional neural network algorithm, the storage calculation The output of the integrated unit is continuously accumulated, and when the accumulated sum exceeds a set positive threshold, the accumulated sum is cleared, and an output pulse is released from the input terminal of the corresponding position in the next layer; and when After the accumulated sum is less than a set negative threshold, the accumulated sum is maintained at the negative threshold.
  • the impulse convolutional neural network includes a batch normalization layer, and the weights and biases in a convolution layer or a fully connected layer before the batch normalization layer are linearly transformed, wherein The parameters in the linear transformation are obtained in the previous training process.
  • a plurality of counters are used to count the number of pulses of each neuron in the last fully connected layer of the pulse convolutional neural network and the time of the earliest pulse appearing.
  • the number is the number of said neurons or twice.
  • the category value corresponding to the counter that receives the pulse earliest is selected as the final result.
  • the output is terminated, and the final classification result is output as the category value corresponding to the maximum value of the counting results of the plurality of counters.
  • At least one of average pooling, maximum pooling, convolutional layer and fully connected layer operations is also performed.
  • the impulse convolutional neural network algorithm is further characterized by: 1) setting the duration of several clock signals as one analysis period; 2) dividing the object to be analyzed into several Partition; 3) Taking the analysis period as the time unit, analyze the time series signal of a partition one by one, and send the calculation result representing the partition to a memory; 4) Analyze the signal of the next partition, and transfer the signal representing the partition The calculation result is sent to the memory until the completed signals of the multiple partitions jointly meet the analysis conditions of the next layer; 5) the signals of each partition stored in the memory are sent to the next layer for calculation.
  • the memory is at least one of a register, an on-chip cache, an off-chip storage, or a cloud storage, or a combination thereof.
  • an integrated circuit based on a pulsed convolutional neural network characterized in that the integrated circuit executes the above-mentioned pulsed convolutional neural network algorithm.
  • a computer-readable recording medium on which computer-readable instructions are stored, and when the computer-readable instructions are executed by a computer, the computer is caused to execute a pulsed convolutional neural network algorithm, so
  • the characteristics of the impulse convolutional neural network algorithm are: 1) copy the weight of the first layer of the impulse convolutional neural network to several copies, the number of copies is at least the binary number converted into the quantity used to characterize the properties of the analyte
  • the number of digits and the minimum value of the storage input terminal of the storage-calculation integrated unit, and the weights of the copies after the copy are processed, so that the weights after the copy are reduced by two times in numerical value, and the resulting value is They are respectively input to the storage input terminals of a plurality of the storage-calculation integrated units, and the number of the storage-calculation integrated units is the same as the number of copies; 2) Convert the selected quantity that is used to characterize the properties of the analyte Into a binary number, and input each
  • the calculation unit until the processing of the attribute of the analyte is completed; 3) for each basic binary number in the one group that is used to characterize the attribute of the analyte, the storage of each The input quantity of the input terminal corresponds to the input quantity of a calculation input terminal, and the input quantity of the storage input terminal with a larger absolute value corresponds to the input quantity of the higher calculation input terminal in a one-to-one correspondence; 4) The storage and calculation are integrated in each In the unit, the amount of the storage input terminal and the amount of the calculation input terminal are operated, and the current value obtained at the output terminal represents the result of the multiplication operation of the value of the storage input terminal of the integrated unit of storage and calculation and the value of the calculation input terminal.
  • the computer-readable recording medium is further characterized in that: 1) the impulse convolutional neural network algorithm includes the operations of the first layer and the operations of other layers, and In any layer, in addition to the operation of the storage input terminal and the calculation input terminal, an operation accumulation item is added.
  • the operation accumulation item is a corrected offset value, and the corrected offset value is proportional to it
  • the original value is divided by the cumulative multiplication of the positive thresholds of all layers before the layer, and the ratio of the proportional ratio is related to the layer where the bias is located and the weight scaling ratio of the previous layer; 2) the impulse convolutional neural network algorithm , To continuously accumulate the output of the storage and calculation integrated unit, when the accumulation sum exceeds a set positive threshold, the accumulation sum is cleared, and the calculation input terminal at the corresponding position in the next layer is released An output pulse; and when the accumulated sum is less than a set negative threshold, the accumulated sum is maintained at the negative threshold.
  • the impulse convolutional neural network includes a batch normalization layer, and the weights and biases in a convolution layer or a fully connected layer before the batch normalization layer are linearly transformed, wherein The parameters in the linear transformation are obtained in the previous training process.
  • a plurality of counters are used to count the number of pulses of each neuron in the last fully connected layer of the pulse convolutional neural network and the time of the earliest pulse appearing. It is the number of said neurons or twice.
  • the category value corresponding to the counter that receives the pulse earliest is selected as the final result.
  • one counter collects significantly more pulses than the other counters, then the output is terminated, and the final classification result is used as the plurality of counters.
  • the category value corresponding to the maximum value of the counter count is output.
  • At least one of average pooling, maximum pooling, convolutional layer and fully connected layer operations is also performed.
  • the impulse convolutional neural network algorithm includes the following: 1) setting the duration of several clock signals as an analysis period; 2) dividing the object to be analyzed into several partitions; 3 ) Using the analysis period as the time unit, analyze the time series signals of a partition one by one, and send the calculation result representing the partition to a memory, and the analyzed signal can be overwritten by subsequent signals; 4) Analyze the next partition Signal, the operation result representing the partition is sent to the memory until the completed signals of multiple partitions jointly meet the analysis conditions of the next layer; 5) the signal of each partition stored in the memory Send to the next layer for calculation.
  • the memory is at least one of a register, an on-chip cache, an off-chip storage, or a cloud storage, or a combination thereof.
  • an integrated circuit based on a pulsed convolutional neural network includes multiple layers of neurons, each layer of neurons includes multiple neuron components, and each layer of neurons The multiple neurons of are not connected to each other, but are connected to the neurons in the back layer; at least one of the neuron components has at most one digital logic circuit, and the digital logic circuit is used for operation, and the operation includes data distribution, It can also include maximum pooling, clock synchronization, and data buffering; and, each neuron component of the last layer has a counter group to count the number of pulses with high level in the output pulse of the neuron component; where Each neuron includes at least one storage-calculation integrated unit and at least one integration comparison circuit, and the current output ends of the multiple storage-calculation integrated units are connected to each other and collectively connected to the integration comparison circuit; each of the The integral comparison circuit includes at least one integrator and at least one comparator, the integrator is used to accumulate the output of the current output terminal, and the comparator
  • each of the storage-calculation integrated units includes at least one storage input terminal and at least one The calculation input terminal and at least one current output terminal, the storage input terminal is set to receive the carrier characterizing the weight issued by the host computer, and the calculation input terminal is set to receive The upper layer of the input pulse carrier; the current output terminal is set to output the carrier as the weighted value and the carrier as the input pulse in the form of current.
  • the storage-computer integrated unit is one of a photoelectric computing unit, a memristor, and a flash memory based on semiconductor principles.
  • the digital logic circuit is configured to find a number of output signals output from the neuron component of the upper layer of the current pooling layer whose number is the square of the size of the pooling layer. Output the first high-level pulse signal; and the digital logic circuit is also set as a functional device including a multiplexer, so that the high-level pulse signal is maintained after passing through the multiplexer.
  • the path corresponding to the high-level pulse signal is opened, and the path is connected to the next convolutional layer or the fully connected layer; at the same time, signals of other paths parallel to the path corresponding to the high-level pulse signal are ignored, or Close the other passages.
  • the average pooling operation is merged into the next convolutional layer or fully connected layer, including: 1) a convolutional layer or a fully connected layer, the convolutional layer or a fully connected layer
  • the number of storage and calculation integrated units in each neuron component is several times the original size of the corresponding algorithm of the layer, and the multiple is the square of the size of the pooling layer, and each weight in the corresponding algorithm is in the neuron There are several times in the component, the number of times is the square of the size of the pooling layer, 2) Among them, the output pulses output from the neuron component of the previous layer and to be transmitted to the next pooling layer, the number of which is the square of the size of the pooling layer
  • the signal is directly used as the calculation input of the storage and calculation integrated unit in the convolutional layer or the fully connected layer, and the storage and calculation integrated unit respectively corresponds to the same weight.
  • each of the neuron components includes a neuron and has a register, and the register is used to synchronize the involved data operations in time.
  • a pulse convolution neural network operation device for performing pulse convolution neural network operations, including a host computer and the above-mentioned integrated circuit; wherein, the host computer is configured to process and Generate the weights of the first layer, and the process of generating the weights of the first layer includes: generating a set of weights according to an initial weight obtained by training through several linear transformations, and the set of weights includes a plurality of weights. , Where the latter weight value is 1/2 of the previous weight value; and, the host computer sends the set of weight values to the memory in each neuron component of the first layer of the impulse convolutional neural network.
  • the storage input terminal in the integrated unit; and the host computer sends the initial weight value to the storage input terminal of the integrated storage unit of the other layers after the first layer after a number of linear transformations.
  • the weights of the convolutional layer or the fully connected layer after the pooling layer are also copied in several copies according to the pooling size, and the number of copies is the square of the size of the pooling layer.
  • the device is used to analyze the subject matter by partition, and then synthesize the subject matter signals of each partition to form complete subject matter information
  • the pulse convolutional neural network computing device It also includes a memory, which is used to store the signals of at least one partition of the subject matter that have been processed step by step, and after all the partition signals are processed, all the partition signals are synthesized, or all the partition signals are processed.
  • the partition signal of is sent to another processor for synthesis; the memory is at least one of registers, on-chip cache, off-chip storage, or cloud storage.
  • a method for manufacturing the above-mentioned integrated circuit comprising the following steps: 1) Forming a digital logic circuit, an integral comparison circuit, and a medium of a transistor in a storage-calculation unit by thermal oxidation and deposition Layers and gates; the transistors include at least ordinary logic transistors, high-voltage transistors and floating-gate transistors; 2) forming the capacitor in the integral comparison circuit by depositing MIM dielectric layers and depositing metal layers, or thermal oxidation and deposition processes; 3) The source and drain of the transistor in the digital logic circuit, the integral comparison circuit and the storage-calculation integrated unit are formed by ion implantation, as well as the P and N levels of the PN junction; 4) The metal layer process, the metal layer dielectric process And the through hole process forms the metal connection and active area-metal layer and metal layer-metal layer through holes of the overall circuit; 5) Through the process applied to the memristor or flash memory, a CMOS process is generated to
  • the purpose of the present invention is at least to convert the data in the convolutional neural network into a time pulse sequence, and replace the analog-to-digital converter with a large power consumption and area by a current integration and comparison circuit, thereby greatly reducing the area and size of the entire system. Power consumption.
  • Another purpose of the present invention is to directly connect the output results of each layer of convolutional layer/fully connected layer with the next layer of convolutional layer/fully connected layer, and the weight data can be directly stored in the storage-calculation integrated unit, There is no need for on-chip cache in the entire system, which saves a lot of data transfer process and speeds up calculations.
  • the present invention proposes a pulse convolutional neural network computing device with memory. The output result of each layer of convolutional layer/fully connected layer is directly connected to the next layer of convolutional layer/fully connected layer. There are too many storage and calculation integrated units required for connection, and the area is too large. Therefore, storing part of the data through on-chip or off-chip memory and changing time for space will greatly reduce the required hardware resources.
  • Fig. 1 is a block diagram of a multi-function area of a computing unit according to an embodiment.
  • Fig. 2 is a schematic diagram of the structure of an optoelectronic computing array according to an embodiment.
  • Example 3 is a cross-sectional view (a) and a perspective view (b) of the structure of the computing unit of Example 1-1.
  • FIG. 4 is a cross-sectional view (a) and a perspective view (b) of the structure of the calculation unit of the embodiment 1-2.
  • Fig. 5 is a schematic diagram (a) and a schematic diagram (b) of the multi-function area of the calculation unit of the embodiment 1-3.
  • Fig. 6 is a schematic diagram of the structure of an RRAM device according to an embodiment and its three-terminal overview.
  • Fig. 7 is a basic cell structure diagram of a flash memory according to an embodiment.
  • Fig. 8 is a schematic diagram of the structure of Spiking-Lenet-5 of Example 4-1 (average pooling).
  • Fig. 9 is a schematic diagram of the structure of Spiking-Lenet-5 of Example 4-1 (maximum pooling).
  • Fig. 10 is a schematic diagram of a neuron composed of an integrated storage-calculation unit of Embodiment 4-1.
  • Fig. 11 is a block diagram of the entire system of the embodiment 4-1 (average pooling).
  • Fig. 12 is a block diagram of the entire system of the embodiment 4-1 (maximum pooling).
  • Fig. 13 is a calculation flowchart (average pooling) of the entire system of the embodiment 4-1.
  • Fig. 14 is a calculation flowchart (maximum pooling) of the entire system of the embodiment 4-1.
  • Fig. 15 is a schematic diagram of a neuron composed of a storage-calculation integrated unit of the embodiment 4-2 (without registers).
  • Fig. 16 is a block diagram of the entire system of the embodiment 4-2 (average pooling, register removal).
  • Fig. 17 is a block diagram of the entire system of the embodiment 4-2 (max pooling, register removal).
  • FIG. 18 is a schematic diagram of the structure of Spiking-Alexnet in Example 4-3.
  • Fig. 19 is a schematic diagram of a neuron composed of an integrated storage-calculation unit in embodiment 4-3.
  • Fig. 20 is a block diagram of the entire system of the embodiment 4-3.
  • Fig. 21 is a calculation flowchart of the entire system of the embodiment 4-3.
  • Fig. 22 is a block diagram of the entire system of the embodiment 4-4.
  • Fig. 23 is a block diagram of the entire system of the embodiment 4-5.
  • Fig. 24 is a block diagram of the entire system of the embodiment 4-6.
  • FIG. 25 is a diagram of the Alexnet network structure of the fifth embodiment.
  • Fig. 26 is a diagram of the Spiking-Alexnet network structure of the fifth embodiment (average pooling).
  • Fig. 27 is a Spiking-Alexnet network structure diagram of Embodiment 5 (maximum pooling).
  • FIG. 28 is a structural diagram of a neuron in Example 5.
  • FIG. 28 is a structural diagram of a neuron in Example 5.
  • the storage-calculation integrated unit described in the present invention is not specific to a specific device, as long as the storage-calculation integrated unit can store data, and the vector dot multiplication operation can be completed by combining multiple storage-calculation integrated units.
  • For each storage and calculation unit there are storage input, calculation input and output.
  • the data of the storage input can be stored for a long time.
  • the value of the output is proportional to the product of the calculation input and the storage input, and there are multiple storage calculations.
  • the output of the integrated unit can be summed.
  • the calculation unit in the optoelectronic calculation unit is a multifunctional area structure including three functional areas, as shown in Figure 1, the three functional areas are: carrier control area, coupling area, photo-generated carrier collection area and readout area ,
  • the specific functions are as follows:
  • Carrier control area Responsible for controlling and modulating the carriers in the optoelectronic computing unit, and as the electrical input port of the computing unit, inputting one of the calculations as the electrical input; or only controlling and modulating the carriers in the computing unit , Input the electrical input through other areas.
  • Coupling area Responsible for connecting the photo-generated carrier collection area and the read-out area, so that the photo-generated carriers generated by the incidence of photons act on the carriers in the optoelectronic computing unit to form a calculation relationship.
  • Photo-generated carrier collection area and read-out area the collection area is responsible for absorbing incident photons and collect the generated photo-generated carriers, and is used as the optical input port of the calculation unit, and one of the calculations is input as the light input; the read-out area It can be used as the electrical input port of the calculation unit, inputting one of the calculation quantities as the electrical input, and as the output port of the calculation unit, outputting the carriers after the light input and the electrical input are the unit output; or through other The area inputs the electrical input, and the readout area is only used as the output port of the calculation unit, and the carriers that are acted on by the light input and the electrical input are output as the output of the unit.
  • the light input is actually the photogenerated carriers stored in the semiconductor device, this carrier can be used in a relatively long period of time (usually on the order of seconds, longer than a few years).
  • the light input is the storage input in the storage-calculation integrated unit, and the collection area in the photo-generated carrier collection and readout area is the storage input of the storage-calculation integrated unit; electrical input It does not have the function of long-term storage in the unit, so the point input is the calculation input in the storage-calculation integrated unit, and the photo-generated carrier collection and readout area in the readout area or the carrier control area are integrated storage and calculation
  • the calculation input terminal of the unit depends on the specific working mode; the final calculation result of the photoelectric calculation unit is output in the form of current in the readout area of the photogenerated carrier collection and readout area, so the photogenerated carrier collection and readout area
  • the read-out area in is the output terminal of the storage-calculation integrated unit.
  • FIG. 2 is a schematic diagram of the structure of an optoelectronic computing array, in which: 1 is a light-emitting array, and 2 is a computing array.
  • the optoelectronic computing array includes a light emitting array 1 and a computing array 2.
  • the light-emitting array 1 is composed of a plurality of light-emitting units arranged periodically
  • the computing array 2 is composed of a plurality of computing units that are periodically arranged.
  • the calculation unit of this embodiment includes: a control gate as a carrier control region, a charge-coupled layer as a coupling region, and a P-type substrate as a photo-generated carrier collection region and a readout region
  • the P-type substrate is divided into a left collection area and a right readout area.
  • the right readout area includes shallow trench isolation, N-type source terminals and N-type drain terminals formed by ion implantation.
  • the shallow trench isolation is located in the middle of the semiconductor substrate, between the collection area and the readout area.
  • the shallow trench isolation is formed by etching and filling silicon dioxide to isolate the electrical signals in the collection area and the readout area.
  • the N-type source terminal is located on the side close to the bottom dielectric layer in the readout area, and is formed by doping by ion implantation.
  • the N-type drain terminal is located on the other side of the semiconductor substrate, which is close to the bottom dielectric layer and the N-type source terminal, and is also formed by doping method by ion implantation. It should be understood that the left side, right side, upper side, and lower side mentioned in this text only represent that the relative position under observation through the viewing angle shown in the figure changes as the viewing angle changes, and is not understood as a limitation on the specific structure.
  • a pulse with a voltage range of negative voltage is applied to the substrate in the collection area, or a pulse with a voltage range of positive voltage is applied to the control gate, so that a depletion layer for photoelectron collection is generated in the substrate of the collection area and passes through
  • the read-out area on the right reads the collected photoelectron quantity as the input quantity of the light input terminal.
  • Apply a positive voltage to the control gate to form a conductive channel between the N-type source terminal and the N-type drain terminal of the collection area, and then apply a bias pulse between the N-type source terminal and the N-type drain terminal The voltage accelerates the electrons in the conductive channel to form a current between the source and the drain.
  • the carriers that form current in the channel between the source and drain are affected by the control gate voltage, the voltage between the source and the drain, and the number of photoelectrons collected in the collection area.
  • the output is in the form of output, where the control gate voltage and the voltage between the source and drain can be used as the electrical input of the device, and the number of photoelectrons is the light input of the device.
  • the charge-coupled layer in the coupling area is used to connect the collection area and the readout area, so that after the depletion area in the substrate of the collection area begins to collect photoelectrons, the surface potential of the substrate in the collection area will be affected by the number of photoelectrons collected; Connection, so that the surface potential of the semiconductor substrate in the readout area is affected by the surface potential of the semiconductor substrate in the collection area, which in turn affects the current between the source and drain in the readout area, so that the photoelectrons collected in the collection area can be read by judging the current between the source and drain in the readout area Quantity
  • the control gate of the carrier control region is used to apply a pulse voltage to it to generate a depletion region for exciting photoelectrons in the readout region of the P-type semiconductor substrate, and it can also be used as an electrical input terminal to input into it One bit of calculation.
  • the calculation unit of this embodiment includes: a control gate as a carrier control region, a charge-coupled layer as a coupling region, and a P-type semiconductor substrate as a photo-generated carrier collection region and a readout region.
  • the P-type substrate contains N-type source and drain terminals formed by ion implantation.
  • the P-type semiconductor substrate can simultaneously undertake the work of light-sensing and reading.
  • the N-type source terminal is located on the side close to the bottom dielectric layer in the readout area, and is formed by doping by ion implantation.
  • the N-type drain terminal is located on the other side of the semiconductor substrate close to the bottom dielectric layer opposite to the N-type source terminal, and is also formed by doping method by ion implantation.
  • a pulse with a voltage range of negative voltage is applied to the P-type semiconductor substrate, and a pulse with a voltage range of positive voltage is applied to the control gate as the carrier control region, so that the P-type substrate generates A depletion layer for photoelectron collection.
  • the electrons generated in the depletion area are accelerated by the electric field between the control gate and the P-type substrate, and reach a sufficiently high energy to pass through the P-type substrate.
  • the underlying dielectric barrier between the substrate and the charge-coupled layer enters the charge-coupled layer and is stored there. The amount of charge in the charge-coupled layer will affect the threshold when the device is turned on, and then affect the current between the source and drain during readout.
  • the control gate voltage and the voltage between the source and drain can be used as the electrical input of the device, and the amount of photoelectrons stored in the charge-coupled layer is the light input of the device.
  • the charge-coupled layer in the coupling area is used to store the photoelectrons that enter it and change the threshold value of the device during readout, which in turn affects the current between the source and drain of the readout area, so that by judging the current between the source and drain of the readout area, it can be used to read out and The number of photoelectrons entering the charge-coupled layer.
  • the control gate of the carrier control region is used to apply a pulse voltage to it to generate a depletion region for exciting photoelectrons in the readout region of the P-type semiconductor substrate, and it can also be used as an electrical input terminal to input into it One bit of calculation.
  • Fig. 5 is a schematic diagram (a) and a schematic diagram (b) of the multi-function area of the calculation unit of the embodiment 1-3.
  • the calculation unit of this embodiment includes: a photodiode and a readout tube as a photo-generated carrier collection and readout area, where the photodiode is formed by ion doping and is responsible for light sensitivity.
  • the N area of the photodiode is connected to the control gate of the readout tube and the source end of the reset tube through the photoelectron coupling lead as the coupling area, and a positive voltage pulse is applied to the drain end of the readout tube as the driving voltage of the readout current;
  • the drain terminal voltage of the reset tube is applied to the photodiode, so that the photodiode as the collection area is in a reverse biased state, resulting in a depletion layer; during exposure, the reset tube is turned off, and the photodiode is electrically isolated. After entering the depletion area of the photodiode, photoelectrons are generated and accumulated in the diode.
  • the read-out tube is responsible for reading. A positive pulse voltage is applied to the drain terminal, and the source terminal is connected to the drain terminal of the address selection tube. When reading, the address selection tube is turned on, and a current is generated in the read-out tube. , The drain terminal voltage of the read-out tube and the number of incident photons are affected together.
  • the electrons in the read-out tube channel are output in the form of current as the electrons that are combined by the light input and the electrical input.
  • the reset tube drain voltage, The voltage at the drain terminal of the readout tube can be used as the electrical input of the device, and the number of electrically incident photons is the light input of the device.
  • the photoelectron coupling lead of the coupling area is used to connect the photodiode as the collection area in the photo-generated carrier collection and readout area and the readout tube as the readout area, and apply the N region potential of the photodiode to the readout tube control grid.
  • a positive voltage is input through its drain terminal to act on the photodiode.
  • the positive voltage will act on the photodiode, causing the photodiode to produce a depletion area and light-sensing. It can also be used as an electrical input terminal to input one of the calculations.
  • the address selection tube is used to control the output of the entire arithmetic device as the output current of the output, and can be used for row and column address selection when the photoelectric calculation unit forms an array.
  • Memristor is called memristor. This device can be summarized as a special non-volatile type that can switch between "high resistance state” and “low resistance state” and can store the resistance value for a long time. Sexual (NVM) storage devices.
  • FIG. 6 is a schematic diagram of the RRAM device structure and its three-terminal overview.
  • the device usually consists of two layers of metal electrodes sandwiching a special through hole layer group that can be formed into conductive through holes.
  • the through hole layer is mostly composed of metal oxide, such as WO x , TaO x and so on.
  • the RRAM device When the RRAM device is in the initial mode, the device is in a high-resistance state.
  • a large bias voltage is applied to both ends of the device, the device enters the programming state, and a conductive channel is formed in the special via layer, and continues to maintain this after the voltage is reduced.
  • the conductive channel exists and stores the current resistance value until the device enters the erasing state after a large negative bias is applied, and the conductive channel tube section causes the device to return to the initial high resistance state.
  • RRAM device as a storage and calculation integrated device, because it has the function of storing resistance value for a long time, so its storage input terminal is the two ends of the device in the programming state; after the resistance input is completed, the device is in a low resistance state and can be in a certain state.
  • the voltage range is used as a linear resistance, and the calculation required by the storage and calculation unit can be completed by using the linear resistance range. Therefore, the calculation input terminal is the two ends of the device within the linear resistance range; when there is a linear resistance range
  • the bias voltage is applied to both ends of the device, the current flows from one end of the RRAM to the other end, so the end of the device current flowing out at this time is the output end of the storage-calculation integrated device.
  • RRAM is usually a two-terminal device, it stores the input terminal, and the input terminal and output terminal are usually the same area in different working modes.
  • Flash memory FLASH is currently the most common non-volatile (NVM) storage device, and its basic storage unit is a floating gate device, such as a structure similar to the optoelectronic computing unit described in Embodiment 1-2, or as shown in Figure 7. The structure shown.
  • Figure 7 is a basic cell structure diagram of a flash memory. As shown in Figure 7, EG and WL for erasing and selection are added.
  • the basic principle is to add a charge storage layer surrounded by an oxide isolation layer between the channel of a normal MOSFET transistor and the control gate, use this isolation storage layer to store charge to store data, and determine the threshold of the transistor To read out the stored charge.
  • the isolation layer can be a floating gate made of polysilicon, as shown in FG (floating gate) in Figure 7, or a nitride layer, etc.
  • the charge stored in the isolation layer is mostly through channel hot electron injection (CHE) Mechanism to achieve.
  • CHE channel hot electron injection
  • the flash device is used as a storage-calculation integrated device, because the electric nucleus stored in the isolated charge storage layer can be stored in the device for a long time, so the amount of stored charge is the storage input in the storage-calculation integrated device.
  • the storage input terminal is the hot electron injection terminal. This mechanism usually occurs directly below the charge storage layer of the surface channel in the P-type substrate of the flash device, as shown in Figure 7 directly below the FG (floating gate); the flash device reads When output, the channel current of the MOSFET transistor is affected by the source-drain voltage Vds, the control gate voltage Vgs, and the amount of charge stored in the charge storage layer. Therefore, the input terminal can be the control gate of the flash device, as shown in Figure 7.
  • CG coupled gate
  • WL word line
  • source and drain because the data after the final power input and the stored input work together in the form of current flows between the source and drain of the flash, so the flash
  • the output terminals of the device as a storage-computing integrated device are the source terminal and the drain terminal.
  • any one of the above embodiments is used as an integrated storage-calculation unit to perform pulse convolutional neural network calculations.
  • the following specific implementations are available:
  • the data set is MNIST as an example, the data set size is 10000*28*28, a total of 10000 sets of test data, the image size is 28*28, the number of channels is 1, the data is a floating point number between 0-1, and the number of classifications is 10. .
  • the convolutional neural network takes Lenet-5 as an example.
  • the pooling layer can be maximum pooling or average pooling.
  • the specific network structure is shown in Figure 8 and Figure 9.
  • Figure 8 is a schematic diagram of the structure of Spiking-Lenet-5 (average pooling)
  • Figure 9 is a schematic diagram of the structure of Spiking-Lenet-5 (maximum pooling).
  • the input image size in Figure 8 and Figure 9 are both 28*28, and each pixel value needs to be converted into a binary number with a bit width of width.
  • the first layer is the convolutional layer
  • the size of the convolution kernel is 5*5, and the number is 6, and each weight value needs to be copied into a proportional sequence of 1/2, which is copied into width in total, and The 0/1 of the different bits of the same pixel value are multiplied correspondingly;
  • the size of each convolution window is 5*5, because the first layer has only 1 channel, if it is a multi-channel input, each convolution kernel is also There should be multiple channels, and the pixel value in each channel is multiplied by the corresponding weight of the convolution kernel; for the 5*5 convolution window, select 5*5 pixel values on the input image, and the pixel values in the same position and the volume
  • the weight of the product core corresponds to multiplication; the result of all the multiplication and accumulation in the same convolution window corresponds to the increment of the integral value
  • the 28*28*6 above the convolutional layer 1 is the total number of neurons in the convolutional layer 1
  • 28*28 is the size of the output image (when performing convolution, the image
  • 6 corresponds to the number of convolution kernels, and represents the number of channels of the output image.
  • the 28*28*6 output image is directly used as the input of the convolutional layer 2, and new calculations are started. Because the average pooling method is used here, the original 28*28 image should be 2*2 averaged to generate a 14*14 image. Here, any pixel in the 14*14 image is directly assigned to the corresponding 4 in the 28*28 image. The pixels are integrated in the same convolution window. Correspondingly, the original 5*5 convolution window becomes 10*10, and the corresponding weights of adjacent 2*2 pixels are the same. The convolution calculation process is similar to convolution layer 1.
  • the maximum pooling layer 1 needs to be added before the convolutional layer 2.
  • the function is to choose 1 from 4, so that the input image size of the convolutional layer 2 is 14*14.
  • the last 10 counters respectively count the number of high-level pulse signals in the output pulse signals of the 10 neurons in the fully connected layer 3. According to the specific implementation of the system, 10 counters can be added to record the earliest time when each neuron generates a high level.
  • the convolutional neural network In the host computer, the convolutional neural network must be trained first, and the trained convolutional neural network is calculated according to the following formula:
  • I is the input of a certain layer of the convolutional neural network
  • W is the weight
  • B is the bias
  • O is the output
  • channel is the number of input channels
  • kernelsize is the size of the convolution kernel.
  • ii is the row of the output image
  • jj is the column of the output image
  • nn is the channel of the output image.
  • weight W and bias B of each layer and the input data from the data set, that is, the I of the first layer are processed as follows:
  • the gray value in the data set is quantized according to the false data bit width width, and the binary number of width bits is obtained, and the insufficient bits are filled with zeros in the high bits.
  • the original input data is expanded into a binary number times width, that is, a pulse signal.
  • the gray value of the input image should be multiplied by a certain weight W'in a certain convolution kernel and accumulated according to the calculation formula of the convolutional neural network, then The weights of the first layer are copied into width parts, which are kept unchanged in turn, divided by 2, divided by 4, etc. to the power of 2, and the corrected weight is recorded as W".
  • the offset of this layer should be corrected in the above Multiply B'by 2 and record it as B".
  • the binary numbers obtained by the quantization of the gray value are sorted corresponding to W', W'/2, W'/4, W'/8, ... from high order to low order.
  • each weight W'in the convolutional layer or the fully connected layer of the average pooling layer will be copied into several copies. Is the square of the size of the pooling layer. For example, the pooling layer is 2*2, then each weight is copied into 4 copies, and the corrected weight is recorded as W". If the layer has a bias, then The offset value is magnified 4 times on the basis of B', and denoted as B”.
  • the first is that the user scales the weights of each layer according to actual needs (for example, according to the actual quantization bit width and the maximum absolute value of the weight of the layer to achieve the highest possible accuracy), the weight of each layer will be scaled to make the new weight
  • the value is W"'.
  • I changed I into a number of binary expanded inputs I
  • weights and biases are arranged in order according to the calculation formula of the convolutional neural network and written into the storage-calculation integrated unit.
  • the host computer After the input of the storage input terminal is all completed, the host computer sends an input pulse to the calculation input terminal of the first-level storage-calculation integrated unit, and the device starts the calculation task.
  • Fig. 10 is a schematic diagram of a neuron composed of an integrated storage-calculation unit of Embodiment 4-1.
  • all circulating data are impulse signals, that is, 0 or 1
  • the basic calculation unit is a storage and calculation unit. The unit is responsible for multiplication.
  • a neuron includes multiple storage-calculation integrated units.
  • the input of the storage input terminal in these storage-calculation integrated units corresponds to the synapse of the neuron in the human brain, that is, W””, which is the input
  • the input of the terminal corresponds to the strength of the synaptic connection, that is, I"".
  • the neuron also needs a cell body, which is responsible for accumulating the output results ⁇ I"" ⁇ W""+1*B"" of the storage-calculation unit in each clock cycle, and accumulating it with the neuron
  • the potential v(t-1) of the cell body at this time is accumulated.
  • the formula is:
  • the neuron potential undergoes the following changes, which are all completed before the arrival of the next clock cycle:
  • vth - and vth + are hyperparameters that can be set by each layer.
  • vth + is a positive threshold
  • vth - is a negative threshold.
  • the negative threshold can also be set to 0.
  • This function is realized by the current integration and comparison circuit, and the output result is saved in the register, aligned with the rising edge of the clock, and transmitted to the neuron in the next layer.
  • the corrected bias value has been stored in a storage-calculation unit in each neuron of the layer, and only the storage-calculation unit needs to be calculated
  • the calculation input of is always set to 1.
  • Scheme 1 The values of 10 counters are transmitted to the upper computer in real time.
  • the number of pulses in the type 1 counter is a more than other types
  • a is the set constant, that is, the calculation can be ended, and the output The category number with the largest number of pulses.
  • the recommended setting is 4.
  • the number of pulses is the same, then compare who receives the pulse first, and output the type.
  • the upper computer sends the corresponding control signal to the control system, clears and resets some places in the system that need to be cleared and reset, and then sends the input pulse signal of the next picture to start the next round of calculation.
  • the values of these 10 counters are transmitted to a special end condition judgment module, which needs to realize such a function: at a certain moment, there are a number of pulses in the counter of type 1, which is a more than that of other types.
  • a is the set constant (it is recommended to set it to 4), which means that the calculation can be ended. Or after the set maximum time length is reached, the end judgment condition is not met, and the end is forced to end.
  • the end here refers to the end signal of raising an output, which is transmitted to the control system and the upper computer, and the corresponding position of the hardware part is reset.
  • the upper computer transmits new image data to the control system, and the control system saves the image data sent by the upper computer, and distributes the image data that needs to be calculated next to the storage-calculation integrated unit. (Here, there can be several different data transmission schemes according to the usage of the memory capacity in the actual system, which are not limited).
  • Figure 11 is a block diagram of the entire system (average pooling).
  • Figure 12 is a block diagram of the entire system (max pooling).
  • CONV convolutional layer
  • FC fully connected layer.
  • Spiking-Lenet-5 impulse convolutional neural network structure diagram described in conjunction with Figures 8 and 9
  • Figure 11 and Figure 12 implement each layer in hardware separately, and the data is circulated in different modules; in addition, the hardware part is also
  • Figure 13 is the calculation flow chart of the entire system (average pooling).
  • Figure 14 is a calculation flow chart of the entire system (max pooling).
  • CONV stands for convolutional layer
  • FC stands for fully connected layer.
  • the host computer Before calculating the image, it is necessary to first write the trained weights and offsets, after correction, into the storage input terminal of the storage-calculation integrated unit. Afterwards, all modules of the entire hardware accelerator are reset except for the data written in the input terminal. Then the host computer starts to transmit input data to the hardware accelerator, and the control system receives these data. After all the input data of the first picture is transmitted, it starts to distribute data to the storage-calculation integrated unit at the same time.
  • the input data of the same image remains unchanged until the image is calculated, according to the specific design of the system, you can wait for one image to be calculated before transmitting the input data of the next image, or in the first Before the images are calculated, the input data of the next or several images is saved in the hardware accelerator to realize the ping-pong operation.
  • the storage and calculation integrated unit receives the input signal from the input terminal, and the calculation results of all storage and calculation integrated units add the currents in series and input them into the current integral comparison circuit.
  • the circuit is integrated, compared with the threshold, and then generates an output pulse, and completes the operation of aligning with the rising edge of the clock in the following register to obtain the output of this layer.
  • the maximum pooling module is also added.
  • the outputs of conv1 and conv2 are first connected to the maximum pooling 1 and 2, in the maximum pooling module, the one with the highest level is selected and transmitted to the next layer.
  • the counter group counts the number of high-level pulses in the output pulses of each neuron in the last fully connected layer, and the result has been transmitted back to the upper computer by the control system.
  • the upper computer judges whether the calculation of this picture is completed according to the conditions set by the user. If it is not completed, it will continue to maintain the status quo. If it is completed, it will change the control signal and combine the hardware accelerator, current integral comparison circuit and other registers in the system with The counter is reset and new pictures are transferred.
  • Solution 2 The counter group counts the number of high-level pulses in the output pulses of each neuron in the last fully connected layer, and the earliest time each neuron generates high-level pulses. The result is only after the end of this round of calculation It is sent back to the upper computer by the control system. The judgment of the end of this round of calculation is completed by the digital logic according to the results of the counter group statistics. If it is not over, it will continue to maintain the status quo. If it is over, it will send a high end signal to the control system, and the hardware accelerator, the current integral comparison circuit And other registers and counters in the system are reset, and the storage and calculation unit waits for the control system to distribute the data of the next image. After the upper computer receives the end signal, it sends a new picture to the control system, and processes the returned counter group data to obtain the final classification result.
  • the current integration and comparison circuit replaces the analog-to-digital converter with large power consumption and area, which greatly reduces the area and power consumption of the entire system.
  • the output results of each layer of convolutional layer/fully connected layer are directly connected with the next layer of convolutional layer/fully connected layer, and the weight data can be directly stored in the storage and calculation unit, and the entire system does not need on-chip Caching saves a lot of data handling process, thereby speeding up calculations.
  • the function of synchronizing the output result of the integral comparison circuit with the clock signal is added, and each register connected to the integral comparison circuit in the embodiment 4-1 is removed ,
  • the output of the integral comparison circuit is directly connected to the next layer of neurons, maximum pooling module or counter. See Figure 15 for a schematic diagram of neurons without registers, see Figure 16 for a block diagram of the entire system (register removal, average pooling), and Figure 17 for a block diagram of the entire system (register removal, maximum pooling).
  • the data set is Cifar-10 as an example, the data set size is 10000*32*32*3, a total of 10000 sets of test data, the image size is 32*32, the number of channels is 3, the data is an integer from 0 to 255, and the number of categories is 10.
  • the convolutional neural network takes Alexnet as an example.
  • the model used here has changed.
  • the first and second convolutional layers are followed by a BN layer, the pooling layer is changed to average pooling, and all convolutional layers
  • the size of the convolution kernel is 3*3, and the specific network structure is shown in Figure 18.
  • the input image size in FIG. 18 is 32*32, the number of channels is 3, and each pixel value needs to be converted into a binary number with a bit width of width.
  • the first layer is the convolutional layer, the size of the convolution kernel is 3*3, the number of channels is 3, the number is 96, and each weight value needs to be copied into a geometric sequence of 1/2, which is copied in total Into width, it is multiplied by 0/1 corresponding to different bits of the same pixel value; the convolution kernel of 3 channels corresponds to the input image; the size of the convolution window is 3*3, and 3*3 is selected on the input image Pixel value, the pixel value of the same position is multiplied with the weight of the convolution kernel; the result of all the multiplication and accumulation in the same convolution window corresponds to the increment of the integral value in a neuron current integration and comparison circuit;
  • the convolution window slides the window in a fixed order on the input image, which corresponds to different neurons; afterwards, a different convolution kernel is replaced, which
  • the 32*32*96 above the convolutional layer 1 is the total number of neurons in the convolutional layer 1
  • 32*32 is the size of the output image (when convolution is performed, the edge of the image is in the convolution
  • the insufficient part in the product window is filled with 0)
  • 96 corresponds to the number of convolution kernels, and represents the number of channels of the output image.
  • the 3 output image is directly used as the input of the convolutional layer 2, and a new calculation is started. Because the average pooling method is used here, the original 32*32 image should be 2*2 averaged to generate a 16*16 image.
  • any pixel in the 16*16 image is directly corresponding to 4 in the 32*32 image
  • the pixels are integrated in the same convolution window.
  • the original 3*3 convolution window becomes 6*6, and the corresponding weights of adjacent 2*2 pixels are the same.
  • the convolution calculation process is similar to convolution layer 1. The same is true for other convolutional layers.
  • the last 10 counters respectively count the number of high-level pulse signals in the output pulse signals of the 10 neurons in the fully connected layer 3. According to the specific implementation of the system, 10 counters can be added to record the earliest time when each neuron generates a high level.
  • the convolutional neural network In the host computer, the convolutional neural network must be trained first, and the trained convolutional neural network is calculated according to the following formula:
  • I is the input of a certain layer of the convolutional neural network
  • W is the weight
  • B is the bias
  • O is the output
  • channel is the number of input channels
  • kernelsize is the size of the convolution kernel.
  • ii is the row of the output image
  • jj is the column of the output image
  • nn is the channel of the output image.
  • weight W and bias B of each layer and the input data from the data set, that is, the I of the first layer are processed as follows:
  • the RGB values in the data set are quantized according to the false data bit width width to obtain the binary number of width bits, and the insufficient bits are filled with zeros in the high bits.
  • the original input data is expanded into a binary number times width, that is, a pulse signal.
  • the first layer assuming that the RGB value of the input image should be multiplied by a certain weight W'in a certain convolution kernel and accumulated according to the calculation formula of the convolutional neural network, then the first The weights of a layer are copied into width parts, keep unchanged in turn, divided by 2, divided by 4, etc. to the power of 2, and the corrected weight is recorded as W".
  • the offset of this layer should be in the above-mentioned correction B Multiply it by 2 on the basis of', and record it as B".
  • the binary numbers obtained by the quantization of the RGB values are sorted corresponding to W', W'/2, W'/4, W'/8, ... from high order to low order.
  • each weight W'in the convolutional layer or the fully connected layer of the average pooling layer will be copied into several copies. Is the square of the size of the pooling layer. For example, the pooling layer is 2*2, then each weight is copied into 4 copies, and the corrected weight is recorded as W". If the layer has a bias, then The offset value is magnified 4 times on the basis of B', and denoted as B”.
  • the first is that the user scales the weights of each layer according to actual needs (for example, according to the actual quantization bit width and the maximum absolute value of the weight of the layer to achieve the highest possible accuracy), the weight of each layer will be scaled to make the new weight
  • the value is W"'.
  • I changed I into a number of binary expanded inputs I
  • weights and biases are arranged in order according to the calculation formula of the convolutional neural network and written into the storage-calculation integrated unit.
  • the host computer After the input of the storage input terminal is all completed, the host computer sends an input pulse to the calculation input terminal of the first-level storage-calculation integrated unit, and the device starts the calculation task.
  • Fig. 19 is a schematic diagram of a neuron composed of an integrated storage-calculation unit in embodiment 4-2.
  • all circulating data are impulse signals, that is, 0 or 1
  • the basic calculation unit is a storage and calculation unit. The unit is responsible for multiplication.
  • a neuron includes multiple storage-calculation integrated units.
  • the input of the storage input terminal in these storage-calculation integrated units corresponds to the synapse of the neuron in the human brain, that is, W””, which is the input
  • the input of the terminal corresponds to the strength of the synaptic connection, that is, I"".
  • the neuron also needs a cell body, which is responsible for accumulating the output results ⁇ I"" ⁇ W""+1*B"" of the storage-calculation unit in each clock cycle, and accumulating it with the neuron
  • the potential v(t-1) of the cell body at this time is accumulated.
  • the formula is:
  • the neuron potential undergoes the following changes, which are all completed before the arrival of the next clock cycle:
  • vth - and vth + are hyperparameters that can be set by each layer.
  • vth + is a positive threshold
  • vth - is a negative threshold.
  • the negative threshold can also be set to 0.
  • This function is implemented by a current integration and comparison circuit, and the output result is saved in the on-chip buffer.
  • a fixed-duration output pulse signal needs to be collected as a data packet, and the fixed-duration is consistent with the transmission time of the input pulse.
  • the output result is buffered in all the data packets needed in the calculation of the next layer, it will be transmitted to the neurons of the next layer in the form of data packets, and the accumulated value in the current integral comparison circuit will be cleared. zero.
  • the capacity of the on-chip cache and the number of neurons required need to be considered based on the actual situation, comprehensive area, power consumption, speed, and the balance of the calculation speed of each layer.
  • the corrected bias value has been stored in a storage-calculation unit in each neuron of the layer, and only the storage-calculation unit needs to be calculated
  • the calculation input of is always set to 1.
  • zero padding is required in many convolutional layers, because the position of zero padding is fixed, so long as the corresponding input is always set to 0 during its sending time.
  • the upper computer sends the corresponding control signal to the control system, clears and resets some places in the system that need to be cleared and reset, and then sends the input pulse signal of the next picture to start the next round of calculation.
  • FIG. 20 The block diagram of the entire system is shown in Figure 20.
  • each layer is implemented in hardware separately in Figure 20, and the data is circulated in different modules; in addition, the hardware part also has a control system for Receive the input data and control signal from the upper computer, write the input data into the on-chip buffer, and receive the statistical result from the counter module, and then send it to the upper computer.
  • the hardware part also has a control system for Receive the input data and control signal from the upper computer, write the input data into the on-chip buffer, and receive the statistical result from the counter module, and then send it to the upper computer.
  • the logic circuits receive control signals from the control system, and generate on-chip buffer control signals and memory addresses according to the sequence of data distribution calculations.
  • the received output pulse is saved to the on-chip buffer; according to the actual buffer capacity, the data in the buffer that will no longer be used is covered by new data; according to the sequence of data distribution calculations, the on-chip buffer control signal and reading address are generated, and Conv1 ⁇ Conv5 and FC1 calculation of the required input data read out.
  • the calculation flow chart of the whole system is shown in Figure 21.
  • Fig. 21 before calculating the image, it is necessary to first write the trained weights and biases into the storage input terminal of the storage-calculation integrated unit after correction. Afterwards, all modules of the hardware accelerator except for the data written in the input terminal are reset. Then the host computer starts to transmit input data to the hardware accelerator. The control system receives these data and writes them into the on-chip buffer. After all the input data required for a calculation of the Conv1 module is transmitted, the on-chip buffer starts to store and calculate the Conv1 module. The integrated unit distributes the data. After all the distribution is completed, the Conv1 module starts to calculate.
  • the transmission speed of the host computer should be considered in conjunction with the calculation speed of each module of the entire system and the on-chip buffer capacity. But it should be ensured that after a fixed duration T, the data required for the next calculation of the Conv1 module has been saved on the on-chip cache within T time.
  • the storage and calculation integrated unit receives the input signal from the input terminal, and the calculation results of all storage and calculation integrated units add the currents in series and input them into the current integral comparison circuit.
  • the circuit is integrated, compared with the threshold, and then generates an output pulse, and completes the operation of aligning with the rising edge of the clock in the following register to obtain the output of this layer.
  • the input signal of the input terminal comes from the on-chip buffer, that is, for each calculation of these modules with a duration of T clock cycles, the prerequisite for the beginning is all required for the calculation
  • the input has been read from the on-chip cache.
  • the input signal comes from the output signal of the previous fully connected layer.
  • the output signal of each neuron will be packed according to the unit size of T and stored in the on-chip cache.
  • the counter group counts the number of high-level pulses in the output pulses of each neuron in the last fully connected layer.
  • the result is sent back to the host computer by the control system.
  • the host computer changes the control signal, resets the current integral comparison circuit and its registers and counters in the hardware accelerator, and starts to transmit new pictures.
  • 10 counters can also be added to record the earliest time when each neuron generates a high level for auxiliary judgment of the classification result.
  • the on-chip cache is changed to a register, and the logic control circuit should be modified accordingly, because the register positioning method is different from the on-chip cache.
  • the system block diagram is shown in Figure 22.
  • the on-chip cache is changed to off-chip memory.
  • the hardware accelerator part it only includes the storage-calculation integrated unit and counter part of each layer.
  • the on-chip cache and its logic control circuit are moved off-chip, and its functions are replaced by FPGA development boards (field programmable gate array) and DDR (double rate synchronous dynamic random access memory).
  • FPGA development boards field programmable gate array
  • DDR double rate synchronous dynamic random access memory
  • the on-chip cache is changed to off-chip cloud storage.
  • the hardware accelerator part it only includes the storage-calculation integrated unit and counter part of each layer.
  • the on-chip cache and its logic control circuit have been moved off-chip, and their functions are replaced by host computers and cloud storage.
  • the system block diagram is shown in Figure 24.
  • any one of the above embodiments is used as an integrated storage-calculation unit to perform pulse convolutional neural network calculations.
  • the following specific implementations are available:
  • FIG. 25 is a diagram of the Alexnet network structure of the fifth embodiment.
  • the data set is cifar-10 as an example.
  • the size of the data set is 10000*32*32*3.
  • the input image size is 32*32, the number of channels is 3, and the data is 0.
  • An integer between -255, the number of classifications is 10.
  • the convolutional neural network takes Alexnet as an example.
  • the model used here has changed.
  • the first and second convolutional layers are followed by a BN layer.
  • the pooling layer can be maximum pooling or average pooling.
  • the size of the convolution kernel of all convolutional layers is 3*3.
  • I is the input of the layer
  • W is the weight
  • B is the bias
  • O is the output
  • channel is the number of input channels
  • kernelsize is the size of the convolution kernel, both of which are 3 here.
  • I is the input of the layer
  • W is the weight
  • B is the bias
  • O is the output
  • channel is the number of input channels.
  • FIG. 28 is a structural diagram of a neuron in Example 5.
  • a neuron includes multiple storage-calculation integrated units, and the storage-calculation integrated unit is similar to the description in the foregoing embodiment 4, and will not be repeated here.
  • bn.weight( ⁇ ), bn.bias( ⁇ ), bn.running_mean(mean), bn.running_var(var), and bn.eps need to be derived during training. (eps, a small amount added to the denominator, the default is 1e-5), and modify the weight and offset of Layer M according to the following formula:
  • the pixel value is 64, which is converted to a binary value of 01000000, 0-255 can be represented by an 8-bit binary number, and the insufficient number of bits is filled with zeros in the high position.
  • the pixel value should be multiplied by a certain weight W in a certain convolution kernel.
  • the original 64*W needs to be converted to 128*(0*W+1*W/2+0*W/4+0*W/8+0*W/16+0*W/32+0* W/64+0*W/128), that is, the input is expanded to 8 times the original value, and the weight is also copied 7 copies, and then divided by different exponential powers of 2, and then accumulated.
  • the whole round of calculation corresponds to a set of test data in the data set.
  • Cifar-10 it is a 32*32*3 image.
  • the bias value should be multiplied by 2 on the basis of the above correction formula; if the weight of the layer is scaled as a whole, the bias value It should also be scaled in the same proportion.
  • the input pulse is generated by a random number, that is, a decimal number of 0-1 is randomly generated, and compared with the pixel value/255. If the random number is smaller than this number, a pulse is generated, otherwise No pulse is generated.
  • this method has great randomness, and only after calculating a large number of pulses can it be as close to the original pixel value as possible.
  • the input pulse is completely equivalent to the original pixel value, and a large number of pulses are not required.
  • the weights are quantified. That is, assuming that the weight can only be represented by a WW-bit binary number at most, for WW not less than 8 (the 8-bit binary number required for 0-255, for a larger range of input data, it is not 8), the input data and the weight Copy as described above. But if WW is less than 8, then the weights divided by the larger exponential power of 2 may be directly equal to 0, and the input has no effect on it and can be omitted directly. That is, the input takes the WW bits starting from the high bit, and the copied weights also take the WW bits with the larger absolute value.
  • the weight scaling of the previous layers will be accumulated in this layer.
  • the weight of the first layer is adjusted, and finally in the integral comparison circuit
  • the cumulative result in the convolutional neural network model is twice the theoretical value, so in addition to the first layer's bias needs to be doubled, to the second layer, this 2 times will still be from the frequency of the input pulse.
  • pooling layers in the convolutional neural network. Two types of maximum pooling and average pooling are commonly used. If the size of the pooling window is 2*2, then pooling The function of is to change the original 4 inputs into 1 output, reducing the size of the image and reducing the amount of calculation. Maximum pooling is to select the maximum output from these 4 inputs, and average pooling is to calculate the average output of these 4 inputs.
  • each counter corresponds to a neuron, and also corresponds to a type of result of image classification,
  • the number of pulses in one type of counter is a more than other types
  • a is the set hyperparameter, that is, it is considered that the calculation can be ended, and the type number of the type with the largest number of pulses is output.
  • the recommended setting is 4.
  • the number of pulses is the same, then compare who receives the pulse first, and output the type.
  • the average pooling layer is merged into the next convolutional layer or the fully connected layer, and the calculation and support of the biased convolutional layer and the fully connected layer are supported.
  • Adding the BN layer to the network, setting the conditions for determining the end of the calculation, and adding auxiliary judgments for special situations can greatly save the calculation time of the existing pulse convolutional neural network algorithm and improve the accuracy of image classification.
  • improvements have been made to the calculation time.
  • the storage-computing integrated unit according to the above-mentioned embodiments of the present invention can be implemented in an integrated circuit. Next, the manufacturing method of such an integrated circuit will be described, which includes the following steps:
  • the dielectric layers and gates of the transistors in the digital logic circuit, the integral comparison circuit and the storage-calculation integrated unit are formed by thermal oxidation and deposition;
  • the transistors include ordinary logic transistors, high-voltage transistors and floating gate transistors, etc.;
  • the capacitor in the integration comparison circuit is formed by depositing a MIM dielectric layer and a metal layer, or thermal oxidation and deposition processes, and the capacitor can be a MIM capacitor or a MOS capacitor;
  • CMOS process memory-calculation integrated unit is generated.
  • the production process of integrated circuits based on pulsed convolutional neural networks can all be produced using standard CMOS technology.
  • the storage and calculation integrated unit in the neuron if the photoelectric computing unit or flash memory is used If it is, it can also be produced using standard CMOS process.
  • CMOS process for semiconductor devices such as transistors, diodes or capacitors based on this process, it will not be described in detail here. Get better device performance.
  • a memristor is used as a storage-calculation unit in a neuron, a special process compatible with this memristor is required, including a storage-calculation integrated device with a special process and a digital logic circuit and an integral comparison circuit using standard CMOS technology.
  • the integration method can be achieved by directly using special processes to produce special devices on a silicon-based substrate, or by wafer-level integration or off-chip integration. For example, the method of generating high-durability memristors on silicon-based substrates mentioned in Chinese Patent CN110098324A and other memristor manufacturing process methods.
  • the devices and algorithm steps can be implemented by electronic hardware or a combination of computer software and electronic hardware. Whether these functions are executed by hardware or software depends on the specific application and design constraint conditions of the technical solution. Professionals and technicians can use different methods for each specific application to implement the described functions, but such implementation should not be considered as going beyond the scope of the present disclosure.
  • the disclosed device and method may be implemented in other ways.
  • the device embodiments described above are merely illustrative, for example, the division of the units is only a logical function division, and there may be other divisions in actual implementation, for example, multiple units or components may be combined or It can be integrated into another device, or some features can be ignored or not implemented.
  • the units described as separate components may or may not be physically separate, and the components displayed as units may or may not be physical units. Some or all of the units may be selected according to actual needs to achieve the objectives of the solutions of the embodiments.
  • the function is implemented in the form of a software functional unit and sold or used as an independent product, it can be stored in a computer readable storage medium.
  • the technical solution of the present disclosure essentially or the part that contributes to the prior art or the part of the technical solution can be embodied in the form of a software product, and the computer software product is stored in a storage medium, including Several instructions are used to make a computer device (which may be a personal computer, a server, or a network device, etc.) execute all or part of the steps of the methods described in the various embodiments of the present disclosure.
  • the aforementioned storage media include: U disk, mobile hard disk, read-only memory, random access memory, magnetic disk or optical disk and other media that can store program codes.

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • General Health & Medical Sciences (AREA)
  • General Physics & Mathematics (AREA)
  • Evolutionary Computation (AREA)
  • Computational Linguistics (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Artificial Intelligence (AREA)
  • Neurology (AREA)
  • Multimedia (AREA)
  • Signal Processing (AREA)
  • Complex Calculations (AREA)
  • Error Detection And Correction (AREA)

Abstract

一种脉冲卷积神经网络算法以及相关的集成电路、运算装置和存储介质,用于以高效低能耗的方式进行人工智能中的脉冲卷积神经网络运算。所述算法的特征在于,基于存算一体单元,所述存算一体单元包括至少一个存输入端,至少一个算输入端以及一个输出端,其特征在于将脉冲卷积神经网络的第一层的权值复制至若干份,份数至少为用于表征待分析物属性的量所转换成的二进制数的位数以及所述存算一体单元的存输入端的最小值,并且将复制后的所述份数的权值进行处理,使复制后的各个权值在数值上依次缩小两倍,所得数值被分别输入到多个所述存算一体单元的存输入端,所述存算一体单元的个数与所述份数相同。

Description

脉冲卷积神经网络算法、集成电路、运算装置及存储介质
本申请要求于2019年12月9日递交的中国专利申请第201911249006.1号的优先权,在此全文引用上述中国专利申请公开的内容以作为本申请的一部分。
技术领域
本发明涉及一种脉冲卷积神经网络脉冲卷积神经网络算法、集成电路、运算装置及存储介质,可以将传统的卷积神经网络转换成脉冲神经网络,具体涉及图像分类领域。
背景技术
传统的计算机大多采取冯诺依曼架构,然而,因为冯诺依曼架构存储单元和运算单元的分立,导致了在数据传输上产生了极大的能量消耗,并且影响运算速度。目前卷积神经网络在图像分类领域具有非常好的效果,拥有大量成熟的训练方法和工具,还有经过大量验证的经典卷积神经网络模型,如lenet-5、alexnet、vgg-16等。如果在采用冯诺依曼架构的硬件上运行,比如CPU、GPU、FPGA,则需要大量的数据传输过程,对于规模很大的矩阵,计算速度比数据传输的速度要快得多,优化计算速度不能够加快整个系统的速度。
存算一体单元由于能够将数据直接保存在计算单元上且不需要片上缓存,从而解决了上述问题,但是卷积神经网络运算中依然存在大量的中间数据需要缓存,且在存算一体单元上实现时,需要使用大量模数转换器将电流转换为数字信号,占用了整个系统大部分的面积和功耗。另外由于模数转换器的频率有限,所以整个系统的计算速度也受到其限制,无法再进行提升。
发明内容
脉冲神经网络试图尽可能模拟人脑的计算方式,明显的特征就是数据都是以脉冲信号的形式在网络里流动,在用硬件实现时,功耗远小于卷积神经网络。
脉冲卷积神经网络结合了卷积神经网络和脉冲神经网络的特点,将卷积神经网络模型进行一些修改后,使得可以用训练卷积神经网络的方法得到权值,并且分类准确率相对于所采用的卷积神经网络,下降幅度很小。由于网络中的数据都是脉冲形式,所以硬件资源消耗小。目前该领域的研究还仅停留在算法层面,没有相关的硬件实现方案。
现有的脉冲卷积神经网络算法中,一种输入方式是,输入脉冲通过随机数生成的方式产生,根据大数定律,需要很长时间生成大量脉冲后,才可能收敛到原始值,贴近卷积神经网络的分类结果,而这样就需要大量的计算时间,计算效率很低。另一种输入方式是,输入并不是脉冲,而是模拟值,在电路实现时,一方面输入的精度得不到保证,另一方面考虑到实际应用,输入源很可能是图像传感器,输出均为数字信号,需要考虑到兼容性。
批标准化(Batch Normalization,BN)层是卷积神经网络中的一种对网络进行优化的常用层,可以提高训练的准确率,减少训练结果对初始化方法的依赖。通过数学推导,如果要在脉冲卷积神经网络算法中添加BN层,经过卷积层/全连接层与BN层的合并后,卷积层/全连接层中一定会不可避免地产生偏置。而现有的脉冲卷积神经网络算法中,都避免了偏置的使用,这样就无法添加BN层,对大规模的卷积神经网络的训练工作带来了困扰。
此外,现有的脉冲卷积神经网络算法中,都没有考虑过结束的问题,然而在实际仿真和电路中,计算时长也是很重要的考量因素,针对这一点也值得改进。
鉴于以上,根据本发明的一方面,提出了一种脉冲卷积神经网络算法,通过改变输入方式、将平均池化层并入下一个卷积层或全连接层、支持带偏置的卷积层和全连接层的计算、支持在网络中添加BN层、设定计算结束判定条件、加入对特殊情况的辅助判断等优化改进方法,可以大大节约现有脉冲卷积神经网络算法的计算时间,并提高图像分类的准确率,增加脉冲卷积神经网络算法对偏置和BN层的功能支持,并调整输入方式增加兼容性。
根据本发明的另一方面,提出了一种脉冲卷积神经网络运算装置,在存算一体单元上实现脉冲卷积神经网络时,通过将代表卷积神经网络中真实值的多位数字信号转换成时间序列脉冲信号的形式,用电流积分比较电路代替模数转换器,从而大大减小了面积和功耗。而且卷积层和全连接层的映射方式是 完全展开,即每一层的所有输出结果同时计算完成,并与作为下一层的输入连接到下一层,且每一层卷积层/全连接层的权值系数均保存在存算一体单元中,从而运算过程中没有数据需要缓存,整个系统的计算速度显著加快。
但是对于大规模的脉冲卷积神经网络,所需的存算一体单元与输入图像尺寸的平方和卷积层通道数成正比,需要占用大量面积。并且这个方案的计算速度非常快,在图像很大的情况下,远远超过了输入图像数据的传输速度,也就是说,会因为数据传输速度跟不上而导致计算速度受限。
鉴于以上,根据本发明的又一方面,提出了一种带有存储器的脉冲卷积神经网络运算装置,通过将脉冲信号按照固定时长进行打包并加入片上或者片外的存储器保存中间数据,大大缩减了所需的存算一体单元数目,从而减小面积和功耗。虽然这样的方法会使得理论上的计算速度比起不用保存中间数据的方案下降很多,但是实际上因为数据传输速度的瓶颈限制,最终的速度也在可以接受的范围内。
根据本发明的一个方面,提供一种脉冲卷积神经网络算法,基于存算一体单元,所述存算一体单元包括至少一个存输入端,至少一个算输入端以及一个输出端,其特征在于:1)将脉冲卷积神经网络的第一层的权值复制至若干份,份数至少为用于表征待分析物属性的量所转换成的二进制数的位数以及所述存算一体单元的存输入端的最小值,并且将复制后的所述份数的权值进行处理,使复制后的各个权值在数值上依次缩小两倍,所得数值被分别输入到多个所述存算一体单元的存输入端,所述存算一体单元的个数与所述份数相同;2)将所选的、集中用于表征待分析物属性的量转换成二进制数,并将待输入的所述二进制数的每一位数值,或者根据系统位宽截位后的数值作为输入脉冲,输入到所述脉冲卷积神经网络的存算一体计算单元中;并且,对于每个表征待分析物属性的输入集合,在对应于所述输入集合的时间周期内,使所述输入脉冲保持不变并不间断地输入到所述脉冲卷积神经网络中相应的计算单元,直到完成对该被分析物的所述属性的处理;3)对于用于表征待分析物属性的、对应于所述一个组中的每个基本的二进制数,使所述每个存输入端的输入量,分别与一个算输入端的输入量相对应,并且绝对值较大的存输入端的输入量与较高位的算输入端的输入量一一对应;4)在每个所述存算一体单元中,使所述存输入端的量与所述算输入端的量进行运算,输出端得到的电流值代表所 述存算一体单元的存输入端的值与算输入端的值进行乘法运算的结果。
此外,根据本发明的一个实施例,所述脉冲卷积神经网络算法,其特征还在于:1)包括所述第一层的运算以及其它层的运算,并且在其中的任意层,在所述存输入端与所述算输入端的运算以外,再加一个运算累加项,所述运算累加项为一个经过修正的偏置值,所述经过修正的偏置值正比于其原始值再除以该层之前所有层的正阈值的累乘,所述正比的比例与该偏置所在的层以及之前的层的权值缩放比例有关;2)所述脉冲卷积神经网络算法,对所述存算一体单元的输出持续地进行累加,当所述累加和超过一个设定的正阈值后,对所述累加和进行清零,并且向下一层相应位置的算输入端释放一个输出脉冲;并且当所述累加和小于一个设定的负阈值之后,使该累加和保持在该负阈值上。
此外,根据本发明的一个实施例,所述脉冲卷积神经网络中包括批标准化层,对该批标准化层之前的一个卷积层或全连接层中的权值和偏置进行线性变换,其中所述线性变换中的参数由前面的训练过程中得到。
此外,根据本发明的一个实施例,其中用多个计数器对所述脉冲卷积神经网络最后一个全连接层中每个神经元的脉冲个数以及最早出现脉冲的时间进行统计,所述计数器个数为所述神经元的数目或其两倍。
此外,根据本发明的一个实施例,如果所述多个计数器中至少两个计数器计数结果均为相同的最大值,则选取最早接收到脉冲的计数器所对应的类别值为最终结果。
此外,根据本发明的一个实施例,计数器显著地多,则输出终止运算,将最终的分类结果作为所述多个计数器计数结果的最大值所对应的类别值进行输出。
此外,根据本发明的一个实施例,在所述第一层的运算之后,还进行平均池化、最大池化、卷积层和全连接层运算中的至少一种。
此外,根据本发明的一个实施例,所述脉冲卷积神经网络算法,其特征还在于:1)设定若干个时钟信号的时长为一个分析周期;2)将待分析的标的物分为若干分区;3)以所述分析周期为时间单位,逐次分析一个分区的时间序列信号,将代表该分区的运算结果送至一个存储器;4)分析下一个分区的信号,将所述代表该分区的运算结果送至所述存储器,直到所完成的多个分区的 信号联合地满足下一层的分析条件;5)将所述存储器存储的各个所述分区的信号送入下一层进行运算。
此外,根据本发明的一个实施例,所述存储器为寄存器、片上缓存、片外存储或者云存储中的至少一种,或者它们的组合。
根据本发明的另一个方面,提供一种基于脉冲卷积神经网络的集成电路,其特征在于,所述集成电路执行如上述的脉冲卷积神经网络算法。
根据本发明的又一个方面,提供一种计算机可读记录介质,其上存储计算机可读指令,当所述计算机可读指令由计算机执行时,使得所述计算机执行脉冲卷积神经网络算法,所述脉冲卷积神经网络算法的特征在于:1)将脉冲卷积神经网络的第一层的权值复制至若干份,份数至少为用于表征待分析物属性的量所转换成的二进制数的位数以及所述存算一体单元的存输入端的最小值,并且将复制后的所述份数的权值进行处理,使复制后的各个权值在数值上依次缩小两倍,所得数值被分别输入到多个所述存算一体单元的存输入端,所述存算一体单元的个数与所述份数相同;2)将所选的、集中用于表征待分析物属性的量转换成二进制数,并将待输入的所述二进制数的每一位数值,或者根据系统位宽截位后的数值作为输入脉冲,输入到所述脉冲卷积神经网络的存算一体计算单元中;并且,对于每个表征待分析物属性的输入集合,在对应于所述输入集合的时间周期内,使所述输入脉冲保持不变并不间断地输入到所述脉冲卷积神经网络中相应的计算单元,直到完成对该被分析物的所述属性的处理;3)对于用于表征待分析物属性的、对应于所述一个组中的每个基本的二进制数,使所述每个存输入端的输入量,分别与一个算输入端的输入量相对应,并且绝对值较大的存输入端的输入量与较高位的算输入端的输入量一一对应;4)在每个所述存算一体单元中,使所述存输入端的量与所述算输入端的量进行运算,输出端得到的电流值代表所述存算一体单元的存输入端的值与算输入端的值进行乘法运算的结果。
此外,根据本发明的一个实施例,所述计算机可读记录介质的特征还在于:1)所述脉冲卷积神经网络算法包括所述第一层的运算以及其它层的运算,并且在其中的任意层,在所述存输入端与所述算输入端的运算以外,再加一个运算累加项,所述运算累加项为一个经过修正的偏置值,所述经过修正的偏置值正比于其原始值再除以该层之前所有层的正阈值的累乘,所述正比的比例 与该偏置所在的层以及之前的层的权值缩放比例有关;2)所述脉冲卷积神经网络算法,对所述存算一体单元的输出持续地进行累加,当所述累加和超过一个设定的正阈值后,对所述累加和进行清零,并且向下一层相应位置的算输入端释放一个输出脉冲;并且当所述累加和小于一个设定的负阈值之后,使该累加和保持在该负阈值上。
此外,根据本发明的一个实施例,所述脉冲卷积神经网络中包括批标准化层,对该批标准化层之前的一个卷积层或全连接层中的权值和偏置进行线性变换,其中所述线性变换中的参数由前面的训练过程中得到。
此外,根据本发明的一个实施例,用多个计数器对所述脉冲卷积神经网络最后一个全连接层中每个神经元的脉冲个数以及最早出现脉冲的时间进行统计,所述计数器个数为所述神经元的数目或其两倍。
此外,根据本发明的一个实施例,如果所述多个计数器中至少两个计数器计数结果均为相同的最大值,则选取最早接收到脉冲的计数器所对应的类别值为最终结果。
此外,根据本发明的一个实施例,在所述多个计数器进行计数的过程中,一个计数器收集的脉冲数比其他计数器显著地多,则输出终止运算,将最终的分类结果作为所述多个计数器计数结果的最大值所对应的类别值进行输出。
此外,根据本发明的一个实施例,在所述第一层的运算之后,还进行平均池化、最大池化、卷积层和全连接层运算中的至少一种。
此外,根据本发明的一个实施例,所述脉冲卷积神经网络算法包括以下:1)设定若干个时钟信号的时长为一个分析周期;2)将待分析的标的物分为若干分区;3)以所述分析周期为时间单位,逐次分析一个分区的、时间序列信号,将代表该分区的运算结果送至一个存储器,已分析的信号可以被后续的信号覆盖;4)分析下一个分区的信号,将所述代表该分区的运算结果送至所述存储器,直到所完成的多个分区的信号联合地满足下一层的分析条件;5)将所述存储器存储的各个所述分区的信号送入下一层进行运算。
此外,根据本发明的一个实施例,所述存储器为寄存器、片上缓存、片外存储或者云存储中的至少一种,或者它们的组合。
根据本发明的又一个方面,提供一种基于脉冲卷积神经网络的集成电路,所述脉冲卷积神经网络包括多层神经元,每层神经元包括多个神经元组件,每 层神经元中的多个神经元彼此不连接,而连接到后层的神经元;至少一个所述神经元组件带有至多一个数字逻辑电路,所述数字逻辑电路被用于操作,所述操作包括数据分发,还可以包括最大池化、时钟同步、以及数据缓存;并且,最后一层的每个神经元组件带有一个计数器组,统计该神经元组件的输出脉冲中具有高电平的脉冲个数;其中,每个神经元包括至少一个存算一体单元和至少一个积分比较电路,所述多个存算一体单元的电流输出端彼此连接,并且集体地连接到所述积分比较电路上;每个所述积分比较电路包括至少一个积分器和至少一个比较器,所述积分器用于累加电流输出端的输出量,所述比较器用于将积分器中被累加的输出量与在先设定的阈值进行比较,并且进行比较器的清零和脉冲输出,所述清零的操作使所述的积分器可以进行下一次的累加操作;并且,每个所述存算一体单元包括至少一个存输入端和至少一个算输入端以及至少一个电流输出端,所述存输入端被设置为接收表征所述上位机所下发的权值的载流子,所述算输入端被设置为接收表征外界或所设定的上层输入脉冲的载流子;所述电流输出端被设置为以电流的形式输出被作为权值的载流子和作为输入脉冲的载流子共同作用后的载流子。
此外,根据本发明的一个实施例,所述存算一体单元为半导体原理的光电计算单元、忆阻器、快闪存储器中的一种。
此外,根据本发明的一个实施例,所述数字逻辑电路被设置为从当前池化层的上一层神经元组件中输出的、数量为池化层尺寸的平方的多个输出信号中,找出最先出现的高电平脉冲信号;并且,所述数字逻辑电路还被设置为包括一个多路选择器的功能器件,使所述高电平脉冲信号经过所述多路选择器后,保持该高电平脉冲信号所对应的通路开启,将所述通路与下一个卷积层或全连接层连通;同时忽略与该高电平脉冲信号所对应的通路相并行的其它通路的信号,或者关闭所述其它通路。
此外,根据本发明的一个实施例,将平均池化运算合并到下一个卷积层或全连接层中进行,包括:1)卷积层或全连接层,所述卷积层或全连接层的每个神经元组件中的存算一体单元数量为该层对应算法的原始尺寸的若干倍,倍数为池化层尺寸的平方,并且所述对应算法中的每一个权值在所述神经元组件中出现若干次,次数为池化层尺寸的平方,2)其中从上一层神经元组件中输出的、待传输到下一个池化层的、数量为池化层尺寸的平方的输出脉冲信 号,直接作为所述卷积层或全连接层中的存算一体单元的算输入量,所述存算一体单元分别与同样的权值对应。
此外,根据本发明的一个实施例,每个所述神经元组件包括一个神经元,并且带有寄存器,所述寄存器用于实现所涉及的数据操作在时间上的同步。
根据本发明的又一个方面,提供一种脉冲卷积神经网络运算装置,用于进行脉冲卷积神经网络运算,包括一个上位机和上述的集成电路;其中,所述上位机被设置为处理并生成第一层的权值,所述生成第一层的权值的过程包括:根据一个训练得出的初始权值经过若干线性变换生成一组权值,该组权值包括多个权值数,其中后一个权值数值为前一个权值数值的1/2;并且,所述上位机将该组权值发送给所述脉冲卷积神经网络的第一层的各个神经元组件中的存算一体单元中的存输入端;并且,所述上位机将初始权值经过若干线性变换后发送给所述第一层之后的其它层的存算一体单元的存输入端中,对于紧接着平均池化层之后的卷积层或全连接层的权值,还根据池化尺寸将权值复制若干份,份数为池化层尺寸的平方。
此外,根据本发明的一个实施例,所述装置被用于按分区来分析标的物,再将各分区的标的物信号合成,构成完整的标的物信息,并且所述脉冲卷积神经网络运算装置还包括存储器,所述存储器用于存储已分步处理过的、代表所述标的物的至少一个分区的信号,并在所有的分区信号处理完以后,将所有的分区信号进行合成,或将所有的分区信号发送至另一个处理器进行合成;所述存储器为寄存器、片上缓存、片外存储或者云存储中的至少一种。
根据本发明的又一个方面,提供一种上述集成电路的制造方法,所述方法包括以下步骤:1)通过热氧化和淀积形成数字逻辑电路、积分比较电路和存算一体单元中晶体管的介质层和栅极;所述晶体管至少包括普通逻辑晶体管,高压晶体管以及浮栅晶体管;2)通过淀积MIM介质层以及淀积金属层,或热氧化和淀积工艺形成积分比较电路中的电容;3)通过离子注入的方式形成数字逻辑电路、积分比较电路和存算一体单元中晶体管的源极和漏极,以及PN结的P级和N级;4)通过金属层工艺、金属层介质工艺以及通孔工艺形成整体电路的金属连线和有源区-金属层以及金属层-金属层通孔;5)通过应用于忆阻器或快闪存储器的工艺,生成一个CMOS工艺的存算一体单元。
本发明的目的至少在于,通过将卷积神经网络中的数据转换为时间脉冲 序列的方式,通过电流积分比较电路代替功耗和面积都很大的模数转换器,大大降低整个系统的面积和功耗。
本发明的另一个目在于,将每一层卷积层/全连接层的输出结果均与下一层卷积层/全连接层直接相连,权值数据可以直接保存在存算一体单元中,整个系统中不需要片上缓存,节省了大量数据搬运的过程,从而加快计算的速度。而对于大规模的网络,本发明提出了一种带有存储器的脉冲卷积神经网络运算装置,每一层卷积层/全连接层的输出结果与下一层卷积层/全连接层直接相连所需要的存算一体单元过多,面积过大,所以通过片上或片外的存储器保存部分数据,用时间换空间的方式将大大减少所需要的硬件资源。
附图说明
图1是根据实施例的计算单元的多功能区框图。
图2是根据实施例的光电计算阵列的结构示意图。
图3是实施例1-1计算单元结构的截面图(a)和立体图(b)。
图4是实施例1-2计算单元结构的截面图(a)和立体图(b)。
图5是实施例1-3计算单元的结构示意图(a)和多功能区示意图(b)。
图6是根据实施例的RRAM器件结构示意图以及其三端概述。
图7是根据实施例的闪存的基本cell单元结构图。
图8是实施例4-1的Spiking-Lenet-5的结构示意图(平均池化)。
图9是实施例4-1的Spiking-Lenet-5的结构示意图(最大池化)。
图10是实施例4-1的由存算一体单元组成的一个神经元示意图。
图11是实施例4-1的整个系统的框图(平均池化)。
图12是实施例4-1的整个系统的框图(最大池化)。
图13是实施例4-1的整个系统的计算流程图(平均池化)。
图14是实施例4-1的整个系统的计算流程图(最大池化)。
图15是实施例4-2的由存算一体单元组成的一个神经元示意图(去除寄存器)。
图16是实施例4-2的整个系统的框图(平均池化、去除寄存器)。
图17是实施例4-2的整个系统的框图(最大池化、去除寄存器)。
图18是实施例4-3的Spiking-Alexnet的结构示意图。
图19是实施例4-3的由存算一体单元组成的一个神经元示意图。
图20是实施例4-3的整个系统的框图。
图21是实施例4-3的整个系统的计算流程图。
图22是实施例4-4的整个系统的框图。
图23是实施例4-5的整个系统的框图。
图24是实施例4-6的整个系统的框图。
图25是实施例5的Alexnet网络结构图。
图26是实施例5的Spiking-Alexnet网络结构图(平均池化)。
图27是实施例5的Spiking-Alexnet网络结构图(最大池化)。
图28是实施例5的神经元的结构图。
具体实施方式
本发明中所述的存算一体单元,并不具体到某一种特定器件,只要存算一体单元中可以保存数据,通过多个存算一体单元组合可以完成向量点乘的运算即可。对于每一个存算一体单元,有存输入端、算输入端和输出端,存输入端的数据可以长时间保存,输出端的值与算输入端和和存输入端的乘积成正比,且多个存算一体单元的输出端可以进行求和。
接下来分别以光电计算单元、忆阻器、快闪存储器为例,描述存算一体单元。
实施例1
光电计算单元中的计算单元为包括三大功能区的多功能区结构,如图1所示,三大功能区为:载流子控制区、耦合区、光生载流子收集区和读出区,具体功能分别如下:
载流子控制区:负责控制并调制光电计算单元内的载流子,并且作为计算单元的电输入端口,输入其中一个运算量作为电输入量;或者只控制并调制计算单元内的载流子,通过其他区域输入电输入量。
耦合区:负责连接光生载流子收集区和读出区,使得光子入射产生的光生载流子作用于光电计算单元内的载流子,形成运算关系。
光生载流子收集区和读出区:其中收集区负责吸收入射的光子并收集产生的光生载流子,并且作为计算单元的光输入端口,输入其中一个运算量作为 光输入量;读出区可以作为计算单元的电输入端口,输入其中一个运算量作为电输入量,并且作为计算单元的输出端口,输出被光输入量和电输入量作用后的载流子作为单元输出量;或者通过其他区域输入电输入量,读出区只作为计算单元的输出端口,输出被光输入量和电输入量作用后的载流子,作为单元输出量。
在上述例子中,因为光输入量实际为存储在半导体器件内的光生载流子,此载流子可以在相对于运算速度较长的时间内(通常为秒级,更长的能到数年)存储在光电计算单元中,因此光输入量即为存算一体单元中的存输入量,光生载流子收集和读出区中的收集区为存算一体单元的存输入端;电输入量不具备长时间保存在单元内的功能,因此点输入量为存算一体单元中的算输入量,光生载流子收集和读出区中的读出区或者载流子控制区为存算一体单元的算输入端,取决于具体工作模式;光电计算单元的最终运算结果在光生载流子收集和读出区中的读出区以电流的形式输出,因此光生载流子收集和读出区中的读出区即为存算一体单元的输出端。
发光单元发出的光作为入射计算单元光生载流子收集和读出区的光子,参与运算。图2是光电计算阵列的结构示意图,其中:1为发光阵列,2为计算阵列。如图2所示,光电计算阵列包括发光阵列1和计算阵列2。发光阵列1由多个发光单元周期性排列组成,计算阵列2由多个计算单元周期性排列组成。
本实施例所述的光电计算单元,有如下三种具体的器件实现形式:
实施例1-1
图3是实施例1-1计算单元结构的截面图(a)和立体图(b)。如图3所示,本实施例的计算单元包括:作为载流子控制区的控制栅极、作为耦合区的电荷耦合层,以及作为光生载流子收集区和读出区的P型衬底,P型衬底中分为左侧收集区和右侧读出区,其中右侧读出区中包括浅槽隔离、通过离子注入形成的N型源端和N型漏端。浅槽隔离位于半导体衬底中部、收集区和读出区的中间,浅槽隔离通过刻蚀并填充入二氧化硅来形成,以用于隔离收集区和读出区的电信号。N型源端位于读出区内靠近底层介质层的一侧,通过离子注入法掺杂而形成。N型漏端位于半导体衬底中靠近底层介质层与N型源端相对的另一侧,同样通过离子注入法进行掺杂法形成。应理解,本文中提及的 左侧、右侧、上方以及下方只代表在通过图中所示视角观察下的相对位置随观察视角变化而变化,并不理解为对具体结构的限制。
在收集区的衬底上施加一个电压范围为负压的脉冲,或在控制栅上施加一个电压范围为正压的脉冲,使得收集区衬底中产生用于光电子收集的耗尽层,并通过右侧读出区读出收集的光电子数量,作为光输入端的输入量。读出时,在控制栅极上施加一正电压,使N型源端和收集区N型漏端间形成导电沟道,再通过在N型源端和N型漏端间施加一个偏置脉冲电压,使得导电沟道内的电子加速形成源漏之间的电流。源漏之间沟道内形成电流的载流子,受到控制栅电压、源漏间电压和收集区收集的光电子数量共同作用,作为被光输入量和电输入量共同作用后的电子,以电流的形式进行输出,其中控制栅电压、源漏间电压可以作为器件的电输入量,光电子数量则为器件的光输入量。
耦合区的电荷耦合层用于连接收集区和读出区,使收集区衬底内耗尽区开始收集光电子以后,收集区衬底表面势就会受到收集的光电子数量影响;通过电荷耦合层的连接,使得读出区半导体衬底表面势受到收集区半导体衬底表面势影响,进而影响读出区源漏间电流大小,从而通过判断读出区源漏间电流来读出收集区收集的光电子数量;
载流子控制区的控制栅,用以在其上施加一个脉冲电压,使得在P型半导体衬底读出区中产生用于激发光电子的耗尽区,同时也可以作为电输入端,输入其中一位运算量。
此外,P型半导体衬底和电荷耦合层之间存在用于隔离的底层介质层;电荷耦合层和控制栅之间亦存在用于隔离的顶层介质层。
实施例1-2
图4是实施例1-2计算单元结构的截面图(a)和立体图(b)。如图4所示,本实施例的计算单元包括:作为载流子控制区的控制栅极、作为耦合区的电荷耦合层,以及作为光生载流子收集区和读出区的P型半导体衬底,其中P型衬底中包含通过离子注入形成的N型源端和漏端。P型半导体衬底可以同时承担感光和读出的工作。N型源端位于读出区内靠近底层介质层的一侧,通过离子注入法掺杂而形成。N型漏端位于半导体衬底中靠近底层介质层与所述N型源端相对的另一侧,同样通过离子注入法进行掺杂法形成。
感光时,在P型半导体衬底上施加一个电压范围为负压的脉冲,同时在 作为载流子控制区的控制栅极上施加一个电压范围为正压的脉冲,使得P型衬底中产生用于光电子收集的耗尽层,产生在耗尽区内的电子在控制栅极和P型衬底两端之间的电场作用下被加速,并在到达获得足够高的能量,穿过P型衬底和电荷耦合层之间的底层介质层势垒,进入电荷耦合层并储存于此,电荷耦合层中的电荷数量,会影响器件开启时的阈值,进而影响读出时的源漏间电流大小;读出时,在控制栅极上施加一脉冲电压,使N型源端和N型漏端间形成导电沟道,再通过在N型源端和N型漏端间施加一个脉冲电压,使得导电沟道内的电子加速形成源漏之间的电流。源漏之间的电流受到控制栅脉冲电压、源漏间电压和电荷耦合层中存储的电子数量共同作用,作为被光输入量和电输入量共同作用后的电子,以电流的形式进行输出,其中控制栅电压、源漏间电压可以作为器件的电输入量,电荷耦合层中存储的光电子数量则为器件的光输入量。
耦合区的电荷耦合层用于储存进入其中的光电子,并改变读出时器件阈值大小,进而影响读出区源漏间电流,从而通过判断读出区源漏间电流来读出感光时产生并且进入电荷耦合层中的光电子数量。
载流子控制区的控制栅,用以在其上施加一个脉冲电压,使得在P型半导体衬底读出区中产生用于激发光电子的耗尽区,同时也可以作为电输入端,输入其中一位运算量。
此外,P型半导体衬底和电荷耦合层之间存在一层用于隔离的底层介质层;电荷耦合层和控制栅之间亦存在一层用于隔离的顶层介质层。
实施例1-3
图5是实施例1-3计算单元的结构示意图(a)和多功能区示意图(b)。如图5所示,本实施例的计算单元包括:作为光生载流子收集和读出区的光电二极管和读出管,其中,光电二极管通过离子掺杂形成,负责感光。光电二极管的N区通过作为耦合区的光电子耦合引线连接到读出管的控制栅和复位管的源端上,读出管的漏端施加一正电压脉冲,作为读出电流的驱动电压;曝光前,复位管打开,复位管漏端电压施加到光电二极管上,使作为收集区的光电二极管处于反偏状态,产生耗尽层;曝光时,复位管关断,光电二极管被电学上隔离,光子入射光电二极管耗尽区后产生光电子,并在二极管中积累,二极管的N区和在电学上通过作为耦合区的光电子耦合引线和N区连接的读出 管控制栅电势开始下降,进而影响读出管沟道内的电子浓度。读出管负责读出,其漏端施加一正脉冲电压,源端和选址管漏端连接,读出时,打开选址管,读出管中产生电流,电流大小受到复位管漏端电压、读出管漏端电压和入射光子数共同影响,读出管沟道内的电子,作为被光输入量和电输入量共同作用后的电子,以电流的形式输出,其中复位管漏端电压、读出管漏端电压可以作为器件的电输入量,电入射光子数则为器件的光输入量。
耦合区的光电子耦合引线用于连接作为光生载流子收集和读出区中收集区的光电二极管和作为读出区的读出管,将光电二极管N区电势施加到读出管控制栅上。
作为载流子控制区的复位管,通过其漏端输入一个正电压作用于光电二极管,当复位管打开时,正电压即会作用在光电二极管上,使光电二极管产生耗尽区并感光,同时也可以作为电输入端,输入其中一位运算量。
此外,选址管用于控制整个运算器件作为输出量的输出电流的输出,可以在光电计算单元组成阵列时行列选址使用。
实施例2
忆阻器(RRAM)全称为记忆电阻器,该器件可以概括为可在“高阻状态”和“低阻状态”之间切换,并可将电阻值长时间存储的一种特殊的非易失性(NVM)存储器件。
图6是RRAM器件结构示意图以及其三端概述。如图6所示,通常该器件由两层金属电极中间夹着可以行成导电通孔的特殊通孔层组,通孔层多由金属氧化物组成,常见的有如WO x,TaO x等。成当RRAM器件处于初始模式时,器件处于高阻态,当有较大偏压加在器件两端时,器件进入编程状态,特殊通孔层中形成导电通道,并在电压降低后继续维持此导电通道的存在并存储当前电阻值,直到施加一较大负偏压后器件进入擦除状态,导电通道管段,使得器件重新回到初始高阻态。
使用RRAM器件作为存算一体器件,因为其具有长时间存储电阻值的功能,因此其存输入端即为处于编程状态时的器件两端;电阻输入完成后器件即处于低阻态并可以在一定电压范围内当作线性电阻使用,利用此线性电阻的范围即可完成存算一体单元所需的运算,因此其算输入端即为处于线性电阻范围内的器件两端;当有线性电阻范围内的偏压加在器件两端时,电流即从 RRAM的一端流到另一端,因此此时器件电流流出的一端即为存算一体器件中的输出端。
因为RRAM通常为两端器件,因此其存输入端,算输入端和输出端通常为不同工作模式下的相同区域。
实施例3
闪存(FLASH)为目前最常见的非易失性(NVM)存储器件,其基本存储单元为浮栅器件,例如和实施例1-2中描述的光电计算单元类似的结构,或如图7所示的结构。
图7是一种闪存的基本cell单元结构图。如图7所示,添加用于擦除和选择的EG和WL。其基本原理为在一正常MOSFET晶体管的沟道和控制栅极之间添加四周被氧化物隔离层包裹的电荷存储层,利用此隔离存储层来存储电荷以存储数据,并通过判断该晶体管的阈值来将存储的电荷量读出。其中所述隔离层可以是使用多晶硅制作的浮栅,如图7中的FG(floating gate),也可以是氮化物层等,电荷存入隔离层多为通过沟道热电子注入(CHE)的机制来实现。
使用flash器件作为存算一体器件,因为存储在被隔离的电荷存储层中的电核可以在长时间内保存在器件当中,因此被存储的电荷量即为存算一体器件中的存输入量,存输入端即为热电子注入端,这一机制通常发生在flash器件P型衬底中的表面沟道的电荷存储层正下方,如图7中的FG(floating gate)正下方;flash器件读出时,MOSFET晶体管的沟道电流受到源漏间电压Vds、控制栅极电压Vgs和电荷存储层中存储的电荷量共同作用,因此算输入端可以为flash器件的控制栅极,如图7中的CG(coupling gate)或WL(word line),或者为源端和漏端;因为最终受电输入量和存输入量共同作用后的数据以电流的形式从flash源漏间流过,因此flash器件作为存算一体器件的输出端为源端和漏端。
实施例4
本实施例使用上述实施例中的任意一种作为存算一体单元,进行脉冲卷积神经网络的计算,有如下具体的实施方式:
实施例4-1
数据集以MNIST为例,数据集大小为10000*28*28,共10000组测试数 据,图像尺寸为28*28,通道数为1,数据为0-1之间的浮点数,分类数目为10。
卷积神经网络以Lenet-5为例,池化层可以是最大池化,也可以是平均池化,具体网络结构见图8和图9。其中图8是Spiking-Lenet-5的结构示意图(平均池化),图9是Spiking-Lenet-5的结构示意图(最大池化)。
具体地,图8和图9中的输入图像大小都是28*28,每一个像素值还需要转化为位宽为width的二进制数。第一层是卷积层,卷积核尺寸为5*5,个数为6个,并且每个权值都需要被复制成比例为1/2的等比数列,一共复制成width个,与同一个像素值的不同位的0/1对应相乘;每个卷积窗口的大小为5*5,因为第一层只有1个通道,如果是多通道的输入,则每一个卷积核也应有多通道,每一个通道里的像素值与卷积核权值对应相乘;关于5*5的卷积窗口,在输入图像上选取5*5个像素值,位置相同的像素值与卷积核权值对应相乘;同一个卷积窗口内所有的乘积累加得到的结果,对应于一个神经元电流积分比较电路中的积分值的增量;将卷积窗口在输入图像上按照固定顺序滑窗,则对应不同的神经元;之后更换不同的卷积核,对应不同的一组神经元。
如图8和图9所示,卷积层1上方的28*28*6,即为卷积层1的神经元总个数,28*28为输出图像的大小(进行卷积运算时,图像边缘在卷积窗口中不足的部分用0进行填补),6与卷积核个数对应,表示输出图像的通道数。
在图8中,该28*28*6的输出图像,直接作为卷积层2的输入,开始进行新的计算。因为这里采用了平均池化的方式,原本28*28的图像应该2*2平均,生成14*14的图像,这里直接将14*14图像中任一个像素点在28*28图像中对应的4个像素点整合在同一个卷积窗口中了,相应地,原本5*5的卷积窗口则变成了10*10,相邻2*2的像素点对应的权值是一样的。卷积计算过程与卷积层1类似。
而在图9中,因为采用了最大池化,所以需要在卷积层2之前加上最大池化层1,功能是4选1,使得卷积层2的输入图像大小为14*14。
对于全连接层,就是直接进行矩阵向量乘的操作,图8中的1600*120是在400*120的基础上因为平均池化复制权值所致。
最后的10个计数器,则分别统计全连接层3的10个神经元的输出脉冲信号中,高电平的个数了。根据系统的具体实现方案,还可以添加10个计数 器,记录每个神经元最早生成高电平的时间。
在上位机中,首先要先训练好卷积神经网络,训练好的卷积神经网络按照如下公式进行计算:
Figure PCTCN2020134558-appb-000001
Figure PCTCN2020134558-appb-000002
其中,I为卷积神经网络某一层的输入,W为权值,B为偏置,O为输出,channel为输入通道数,kernelsize为卷积核尺寸。ii为输出图像的行,jj为输出图像的列,nn为输出图像的通道。
再将得到的每一层的权值W和偏置B以及来自于数据集的输入数据,即第一层的I,进行如下处理:
先考虑来自数据集的输入数据,在存输入端量化位宽和数据集输入位宽之间,选择更小的那个值作为系统位宽width。将数据集中的灰度值按照假数据位宽width进行量化,得到width位的二进制数,不足的位数在高位补零。原来的输入数据即被扩展成width倍的二进制数,即脉冲信号。
再考虑权值和偏置。如果卷积神经网络中有BN层(批标准化,batch normalization),那么在训练的时候,需要导出bn.weight(γ)、bn.bias(β)、bn.running_mean(mean)、bn.running_var(var)和bn.eps(eps,给分母加上的小量,默认为1e-5),其中,bn.weight(γ)表示:训练过程中学习到的缩放系数;bn.bias(β)表示:训练过程中学习到的偏移系数;bn.running_mean(mean)表示:训练过程中得到的,数据的统计平均值;bn.running_var(var)表示:训练过程中得到的,数据的统计方差值。并按照如下公式修改该BN层前一层卷积层或全连接层的权值W和偏置B:
Figure PCTCN2020134558-appb-000003
Figure PCTCN2020134558-appb-000004
这样就完成了卷积层或全连接层与BN层的合并,在进行推断任务的时候,仅需保留修正过的W’和B’的卷积层或全连接层计算即可,无需多余的BN层运算。
还有一些特殊情况,对于第一层,假设输入图像的灰度值本应和某个卷积核中的某个权值W’相乘并按照卷积神经网络的计算公式进行累加,那么将第一层的权值复制为width份,依次保持不变、除以2、除以4等2的指数次幂,将修正后的权值记为W”。该层的偏置应在上述修正B’的基础上再乘以2,记为B”。其中,灰度值量化得到的二进制数,按照高位到低位的顺序,依次与W’、W’/2、W’/4、W’/8……对应起来排序。
如果该卷积神经网络中使用了平均池化层,那么该平均池化层的下一层卷积层或全连接层的中的每个权值W’都将被复制成若干份,该数量为池化层尺寸的平方,比如池化层是2*2的,那么每个权值都被复制成4份,将修正后的权值记为W”。若该层有偏置,则将偏置值在B’的基础上再放大4倍,记为B”。
至此,在卷积神经网络层面上,对于输入、权值和偏置的处理就已经结束了,考虑脉冲卷积神经网络层面。
首先是用户根据实际需要(比如根据实际量化位宽以及该层权值的最大绝对值进行缩放调整,以达到尽可能高的精度),会对每一层的权值进行缩放,令新的权值为W”’。
然后是根据脉冲卷积神经网络的原理,给每层的偏置带来的修正。
对于第一层而言,其计算公式为O”’=I”’*W”’+B”’,此处省略了如上文卷积计算公式中复杂的求和表达式,这里的形式虽然略有不同,但区别仅在于将I改变成了二进制展开的若干输入I”,W”也相应处理成若干倍W”’。由于不论在任何时刻,W”’与B”’的关系都应该能够计算出与卷积神经网络中O”对应的O”’,即如果I”*W”’=A1*(I”*W”),那么B”’=A1*B”,O”’=A1*O”,A1为一个缩放比例。
再考虑第二层,对于第二层而言,其输入I”’是由第一层的O”’按照时间累加,每超过阈值
Figure PCTCN2020134558-appb-000005
后生成一次1,否则为0,假设这个时间为T1,即
Figure PCTCN2020134558-appb-000006
对于I”’,每T1时间内仅包含1次1,其余均为零。假设第二层的I”’*W”’+B”’按照时间累加,每超过阈值
Figure PCTCN2020134558-appb-000007
后生成一次1,否则为0,假设这个时间为T2,即
Figure PCTCN2020134558-appb-000008
Figure PCTCN2020134558-appb-000009
将第一层的公式代入得:
Figure PCTCN2020134558-appb-000010
由于在卷积神经网络中,第二层的输出=O”*W”+B”,其中W”’=A2*W”,O”’=A1*O”,那么
Figure PCTCN2020134558-appb-000011
之后的第n层同理推导可得:
Figure PCTCN2020134558-appb-000012
其中分母的这些超参数vth +均为在上位机上由用户设置的值。
所有的权值和偏置都应在修正过的基础上再按照width的位宽进行二进制量化得到最终写入存算一体单元存输入端的值,记为W””和B””。
上述工作均在上位机中完成,完成后将权值和偏置根据卷积神经网络的计算公式排好顺序,写入存算一体单元中。
存输入端的输入全部完成后,上位机向第一层存算一体单元的算输入端发送输入脉冲,本装置开始进行计算任务。
图10是实施例4-1的由存算一体单元组成的一个神经元示意图。在脉冲卷积神经网络算法中,除了输入与权值之间的对应关系与卷积神经网络算法保持一致,所有的流通数据均为脉冲信号,即0或1,基本的计算单元为存算一体单元,负责乘法。在此基础上,如图10所示,一个神经元包括多个存算一体单元,这些存算一体单元中存输入端的输入对应于人脑中神经元的突触,即W””,算输入端的输入对应于突触连接强度,即I””。此外,神经元中还需有一个胞体,在每一个时钟周期内,负责将这些存算一体单元的输出端结果∑I″″·W″″+1*B″″进行累加,并与该神经元胞体此时的电势v(t-1)进行累加。用公式表示即为:
v(t)=v(t-1)+∑I″″·W″″+1*B″″
输出脉冲的生成公式为:
Figure PCTCN2020134558-appb-000013
输出脉冲生成完之后,神经元电势经过如下变化,这些都在下一个时钟周期到来之前完成:
Figure PCTCN2020134558-appb-000014
其中vth -和vth +均为每层可自行设定的超参数。vth +为正阈值,vth -为负阈值。对于不加偏置的神经网络,负阈值也可以设置为0。
该功能由电流积分比较电路实现,并将输出的结果保存至寄存器中,与时钟上升沿对齐,传送给下一层的神经元。
对于需要加入偏置的某一层卷积层或全连接层,其经过修正的偏置值已经保存在该层每个神经元中的一个存算一体单元中,仅需将该存算一体单元的算输入始终置为1即可。
有的网络在很多卷积层中都会出现需要补零的情况,因为补零的位置是固定的,只要将相应输入一直置0即可。
对于平均池化层,需要在下一个卷积层或全连接层中,将需要平均池化的所有输入直接和相应的神经元连接起来,并将权值复制多份(在上位机中已完成),实现和原来等比例的乘累加。
对于最大池化层,不需要在下一个卷积层或全连接层进行操作,而是需要在这相邻两层卷积层或全连接层中间,加上额外的判断条件,即从计算开始算起,每一个池化窗口所对应的输入信号中,选择最早为1的那一路,与下一层卷积层或全连接层接通,其余的输入信号就可以被忽略了。这里的具体实现方式为一些数字逻辑加上多路选择器。
在全连接层3(即最后一个全连接层)后面,有10个计数器一直在统计这十类接收到的脉冲数目(高电平),并通过控制系统发送给上位机。
方案一:10个计数器的值实时地传送给上位机。
在上位机中,需要进行这样的结束条件判断:当某时刻,有1类计数器中的脉冲数目,要比别的类多a个,a为设定的常数,即认为计算可以结束了,输出脉冲数目最大的该类类别号。建议设置为4。
如果到了设定的最大时长后,还没有满足结束判定条件,就强制结束,找出这10类中脉冲数目最多的那一类。
如果有至少2类中,脉冲数目是一致的,那么就比较谁最先接收到脉冲,输出该类。
该图片计算完成后,上位机发送相应的控制信号给控制系统,将系统中一些需要清零复位的地方进行清零复位,然后再发送下一张图片的输入脉冲信号,开始下一轮计算。
方案二:10个计数器的值没有办法实时地传送给上位机。
在硬件部分中,这10个计数器的值被传输到一个专门的结束条件判断模块,需要实现这样的功能:当某时刻,有1类计数器中的脉冲数目,要比别的类多a个,a为设定的常数(建议设置为4),即认为计算可以结束了。或者到了设定的最大时长后,还没有满足结束判定条件,就强制结束,这里的结束指的是拉高一个输出的结束信号,传输给控制系统和上位机,对硬件部分的相应位置进行复位,上位机向控制系统传输新的图像数据,控制系统保存上位机发送的图像数据,并将接下来需要计算的图像数据分发给存算一体单元。(这里根据实际系统中存储器容量的使用情况,可以有若干种不同的数据传输方案,不做限定)。
除了结束信号外,当结束信号拉高之后,需要将这10个计数器的值通过控制系统传送给上位机。此外,除了这10个计数器,在全连接层3的后面还需设置10个计数器,用来记录这10个神经元最早生成高电平输出的时间,这10个计数器也将被传送给上位机。
在上位机中,需要先在10个统计高电平数目的计数器中找出最大值,若有一样的,则选择最早生成高电平的那一类,作为最终的分类结果。
整个系统的框图见图11和图12。图11是整个系统的框图(平均池化)。图12是整个系统的框图(最大池化)。其中CONV表示卷积层,FC表示全连接层。如结合图8、9所描述的Spiking-Lenet-5脉冲卷积神经网络结构图,图11和图12将每一层都分别用硬件实现,数据在不同的模块中流通;此外,硬件部分还有控制系统,用于从上位机接收输入数据和控制信号,然后分发至Conv1模块中,并从计数器模块中接收统计的结果,再发送给上位机。
整个系统的计算流程图见图13和图14。图13是整个系统的计算流程图 (平均池化)。图14是整个系统的计算流程图(最大池化)。其中CONV表示卷积层,FC表示全连接层。
在对图像进行计算之前,需要先将训练好的权值和偏置,经过修正之后,写入存算一体单元的存输入端。之后对整个硬件加速器的除了存输入端写入的数据之外,所有的模块进行复位操作。接着上位机开始向硬件加速器传输输入数据,控制系统接收到这些数据,等第一幅图的所有输入数据传输完毕后,开始同时对存算一体单元分发数据。由于同一幅图的输入数据在该图像没有计算完毕之前是一直保持不变的,根据系统的具体设计方案,可以等一幅图像算完之后再传输下一张图像的输入数据,或者在第一张图像算完之前,就将下一张或者若干张图像的输入数据保存在硬件加速器中,实现乒乓操作。
对于每一个卷积层或者全连接层模块,存算一体单元接收算输入端的输入信号,所有的存算一体单元的计算结果通过串联的方式将电流相加,输入至电流积分比较电路中,在该电路中经过积分、与阈值比较,然后生成输出脉冲,在紧接着的寄存器中完成与时钟上升沿对齐的操作,得到该层的输出。这些模块都是在同时、一刻不停地进行着独立的运算的。
在图14中,还多了最大池化模块,conv1和conv2的输出先接入最大池化1、2,在最大池化模块中选择高电平最早出现的那一路传输至下一层。
关于计数器组,方案一:计数器组统计最后一层全连接层每个神经元的输出脉冲中,高电平的个数,该结果一直在被控制系统传送回上位机。上位机根据用户设置的条件,判断本张图片的计算是否完成,如果没有完成则继续保持现状,如果完成了,就改变控制信号,将硬件加速器中,电流积分比较电路以及系统中的其它寄存器和计数器进行复位,并开始传输新的图片。
方案二:计数器组统计最后一层全连接层每个神经元的输出脉冲中高电平的个数,以及每个神经元最早生成高电平脉冲的时间,该结果仅在本轮计算结束后才被控制系统传送回上位机。本轮计算结束的判断由数字逻辑根据计数器组统计的结果来完成,如果没有结束则继续保持现状,如果结束了,就向控制系统发送拉高的结束信号,将硬件加速器中,电流积分比较电路以及系统中的其它寄存器和计数器进行复位,存算一体单元等待控制系统分发下一张图像的数据。上位机接收到该结束信号后,向控制系统传送新的图片,并对传回的计数器组的数据进行处理,得到最终的分类结果。
根据上述实施例,通过将卷积神经网络中的数据转换为时间脉冲序列的方式,通过电流积分比较电路代替功耗和面积都很大的模数转换器,大大降低整个系统的面积和功耗。另外,将每一层卷积层/全连接层的输出结果均与下一层卷积层/全连接层直接相连,权值数据可以直接保存在存算一体单元中,整个系统中不需要片上缓存,节省了大量数据搬运的过程,从而加快计算的速度。
实施例4-2
本实施例在实施例4-1的基础上,在积分比较电路中,加入积分比较电路的输出结果与时钟信号同步的功能,实施例4-1中每一个与积分比较电路相连的寄存器被除去,积分比较电路的输出直接被接入下一层神经元、最大池化模块或者计数器。除去寄存器的神经元示意图见图15,整个系统(去除寄存器、平均池化)的框图见图16,整个系统(去除寄存器、最大池化)的框图见图17。
实施例4-3
数据集以Cifar-10为例,数据集大小为10000*32*32*3,共10000组测试数据,图像尺寸为32*32,通道数为3,数据为0-255的整数,分类数目为10。
卷积神经网络以Alexnet为例,这里采用的模型有所变动,在第一层和第二层卷积层后面紧跟着加BN层,池化层改为平均池化,且所有卷积层的卷积核大小均为3*3,具体网络结构见图18。
具体地,图18中的输入图像大小是32*32,通道数为3,每一个像素值需要转化为位宽为width的二进制数。第一层是卷积层,卷积核尺寸为3*3,通道数为3,个数为96个,并且每个权值都需要被复制成比例为1/2的等比数列,一共复制成width个,与同一个像素值的不同位的0/1对应相乘;3个通道的卷积核与输入图像对应;卷积窗口大小为3*3,在输入图像上选取3*3个像素值,位置相同的像素值与卷积核权值对应相乘;同一个卷积窗口内所有的乘积累加得到的结果,对应于一个神经元电流积分比较电路中的积分值的增量;将卷积窗口在输入图像上按照固定顺序滑窗,则对应不同的神经元;之后更换不同的卷积核,对应不同的一组神经元。
如图18所示,卷积层1上方的32*32*96,即为卷积层1的神经元总个数,32*32为输出图像的大小(进行卷积运算时,图像边缘在卷积窗口中不足 的部分用0进行填补),96与卷积核个数对应,表示输出图像的通道数。该3输出图像,直接作为卷积层2的输入,开始进行新的计算。因为这里采用了平均池化的方式,原本32*32的图像应该2*2平均,生成16*16的图像,这里直接将16*16图像中任一个像素点在32*32图像中对应的4个像素点整合在同一个卷积窗口中了,相应地,原本3*3的卷积窗口则变成了6*6,相邻2*2的像素点对应的权值是一样的。卷积计算过程与卷积层1类似。其它卷积层同理。
对于全连接层,就是直接进行矩阵向量乘的操作,图18中的16384*1024是在4096*1024的基础上因为平均池化复制权值所致。
最后的10个计数器,则分别统计全连接层3的10个神经元的输出脉冲信号中,高电平的个数了。根据系统的具体实现方案,还可以添加10个计数器,记录每个神经元最早生成高电平的时间。
在上位机中,首先要先训练好卷积神经网络,训练好的卷积神经网络按照如下公式进行计算:
Figure PCTCN2020134558-appb-000015
Figure PCTCN2020134558-appb-000016
其中,I为卷积神经网络某一层的输入,W为权值,B为偏置,O为输出,channel为输入通道数,kernelsize为卷积核尺寸。ii为输出图像的行,jj为输出图像的列,nn为输出图像的通道。
再将得到的每一层的权值W和偏置B以及来自于数据集的输入数据,即第一层的I,进行如下处理:
先考虑来自数据集的输入数据,在存输入端量化位宽和数据集输入位宽之间,选择更小的那个值作为系统位宽width。将数据集中的RGB值按照假数据位宽width进行量化,得到width位的二进制数,不足的位数在高位补零。 原来的输入数据即被扩展成width倍的二进制数,即脉冲信号。
再考虑权值和偏置。如果卷积神经网络中有BN层(批标准化,batch normalization),那么在训练的时候,需要导出bn.weight(γ)、bn.bias(β)、bn.running_mean(mean)、bn.running_var(var)和bn.eps(eps,给分母加上的小量,默认为1e-5),其中,bn.weight(γ)表示:训练过程中学习到的缩放系数;bn.bias(β)表示:训练过程中学习到的偏移系数;bn.running_mean(mean)表示:训练过程中得到的,数据的统计平均值;bn.running_var(var)表示:训练过程中得到的,数据的统计方差值。并按照如下公式修改该BN层前一层卷积层或全连接层的权值W和偏置B:
Figure PCTCN2020134558-appb-000017
Figure PCTCN2020134558-appb-000018
这样就完成了卷积层或全连接层与BN层的合并,在进行推断任务的时候,仅需保留修正过的W’和B’的卷积层或全连接层计算即可,无需多余的BN层运算。
还有一些特殊情况,对于第一层,假设输入图像的RGB值本应和某个卷积核中的某个权值W’相乘并按照卷积神经网络的计算公式进行累加,那么将第一层的权值复制为width份,依次保持不变、除以2、除以4等2的指数次幂,将修正后的权值记为W”。该层的偏置应在上述修正B’的基础上再乘以2,记为B”。其中,RGB值量化得到的二进制数,按照高位到低位的顺序,依次与W’、W’/2、W’/4、W’/8……对应起来排序。
如果该卷积神经网络中使用了平均池化层,那么该平均池化层的下一层卷积层或全连接层的中的每个权值W’都将被复制成若干份,该数量为池化层尺寸的平方,比如池化层是2*2的,那么每个权值都被复制成4份,将修正后的权值记为W”。若该层有偏置,则将偏置值在B’的基础上再放大4倍,记为B”。
至此,在卷积神经网络层面上,对于输入、权值和偏置的处理就已经结束了,考虑脉冲卷积神经网络层面。
首先是用户根据实际需要(比如根据实际量化位宽以及该层权值的最大 绝对值进行缩放调整,以达到尽可能高的精度),会对每一层的权值进行缩放,令新的权值为W”’。
然后是根据脉冲卷积神经网络的原理,给每层的偏置带来的修正。
对于第一层而言,其计算公式为O”’=I”’*W”’+B”’,此处省略了如上文卷积计算公式中复杂的求和表达式,这里的形式虽然略有不同,但区别仅在于将I改变成了二进制展开的若干输入I”,W”也相应处理成若干倍W”’。由于不论在任何时刻,W”’与B”’的关系都应该能够计算出与卷积神经网络中O”对应的O”’,即如果I”*W”’=A1*(I”*W”),那么B”’=A1*B”,O”’=A1*O”,A1为一个缩放比例。
再考虑第二层,对于第二层而言,其输入I”’是由第一层的O”’按照时间累加,每超过阈值
Figure PCTCN2020134558-appb-000019
后生成一次1,否则为0,假设这个时间为T1,即
Figure PCTCN2020134558-appb-000020
对于I”’,每T1时间内仅包含1次1,其余均为零。假设第二层的I”’*W”’+B”’按照时间累加,每超过阈值
Figure PCTCN2020134558-appb-000021
后生成一次1,否则为0,假设这个时间为T2,即
Figure PCTCN2020134558-appb-000022
Figure PCTCN2020134558-appb-000023
将第一层的公式代入得:
Figure PCTCN2020134558-appb-000024
由于在卷积神经网络中,第二层的输出=O”*W”+B”,其中W”’=A2*W”,O”’=A1*O”,那么
Figure PCTCN2020134558-appb-000025
之后的第n层同理推导可得:
Figure PCTCN2020134558-appb-000026
其中分母的这些超参数vth +均为在上位机上由用户设置的值。
所有的权值和偏置都应在修正过的基础上再按照width的位宽进行二进制量化得到最终写入存算一体单元存输入端的值,记为W””和B””。
上述工作均在上位机中完成,完成后将权值和偏置根据卷积神经网络的计算公式排好顺序,写入存算一体单元中。
存输入端的输入全部完成后,上位机向第一层存算一体单元的算输入端发送输入脉冲,本装置开始进行计算任务。
图19是实施例4-2的由存算一体单元组成的一个神经元示意图。在脉冲 卷积神经网络算法中,除了输入与权值之间的对应关系与卷积神经网络算法保持一致,所有的流通数据均为脉冲信号,即0或1,基本的计算单元为存算一体单元,负责乘法。在此基础上,如图19所示,一个神经元包括多个存算一体单元,这些存算一体单元中存输入端的输入对应于人脑中神经元的突触,即W””,算输入端的输入对应于突触连接强度,即I””。此外,神经元中还需有一个胞体,在每一个时钟周期内,负责将这些存算一体单元的输出端结果∑I″″·W″″+1*B″″进行累加,并与该神经元胞体此时的电势v(t-1)进行累加。用公式表示即为:
v(t)=v(t-1)+∑I″″·W″″+1*B″″
输出脉冲的生成公式为:
Figure PCTCN2020134558-appb-000027
输出脉冲生成完之后,神经元电势经过如下变化,这些都在下一个时钟周期到来之前完成:
Figure PCTCN2020134558-appb-000028
其中vth -和vth +均为每层可自行设定的超参数。vth +为正阈值,vth -为负阈值。对于不加偏置的神经网络,负阈值也可以设置为0。
该功能由电流积分比较电路实现,并将输出的结果保存片上缓存中,对于每一个神经元,需要收集固定时长的输出脉冲信号,作为一个数据包,该固定时长与输入脉冲的发送时长一致。当该输出结果在下一层的计算中所需要的所有数据包均缓存完毕后,就会以数据包的形式被传送给下一层神经元,并且该电流积分比较电路中的累加值会被清零。片上缓存的容量以及所需神经元的个数需要根据实际情况,综合面积、功耗、速度和各层计算速度的平衡这几个方面来考虑。
对于需要加入偏置的某一层卷积层或全连接层,其经过修正的偏置值已经保存在该层每个神经元中的一个存算一体单元中,仅需将该存算一体单元的算输入始终置为1即可。
有的网络在很多卷积层中都会出现需要补零的情况,因为补零的位置是固定的,只要将相应输入在其发送时长内一直置0即可。
对于平均池化层,需要在下一个卷积层或全连接层中,将需要平均池化的所有输入直接和相应的神经元连接起来,并将权值复制多份(在上位机中已完成),实现和原来等比例的乘累加。
在全连接层3后面,有10个计数器一直在统计这十类接收到的脉冲数目,并通过控制系统发送给上位机。这里还需要额外的逻辑电路来判断什么时候有脉冲,因为最后一层的运行时间相比于前面的层来说很短,并不是一直在运行。根据系统的具体实现方案,还可以添加10个计数器,记录每个神经元最早生成高电平的时间。
在上位机中,需要对固定时长内接收到的10个脉冲数目进行比较,选其中的最大值,如果添加了10个计数器,记录每个神经元最早生成高电平的时间,那么还可以进行辅助比较:如果有至少2类中,脉冲数目是一致的,那么就比较谁最先接收到脉冲,输出该类。
该图片计算完成后,上位机发送相应的控制信号给控制系统,将系统中一些需要清零复位的地方进行清零复位,然后再发送下一张图片的输入脉冲信号,开始下一轮计算。
整个系统的框图见图20。如结合图18所描述的Spiking-Alexnet脉冲卷积神经网络结构图,图20中将每一层都分别用硬件实现,数据在不同的模块中流通;此外,硬件部分还有控制系统,用于从上位机接收输入数据和控制信号,将输入数据写入片上缓存,并从计数器模块中接收统计的结果,再发送给上位机。特别地,整个系统中还有片上缓存和与其相对应的逻辑电路,逻辑电路接收控制系统的控制信号,根据数据分发计算的顺序,生成片上缓存的控制信号和存数地址,将从Conv1~Conv5接收到的输出脉冲保存至片上缓存;根据实际缓存容量,使缓存中不再会被使用到的数据被新的数据覆盖;根据数据分发计算的顺序,生成片上缓存的控制信号和读数地址,将Conv1~Conv5以及FC1计算所需的输入数据读出。
整个系统的计算流程图见图21。如图21所示,在对图像进行计算之前,需要先将训练好的权值和偏置,经过修正之后,写入存算一体单元的存输入端。之后对整个硬件加速器的除了存输入端写入的数据之外,所有的模块进行 复位操作。接着上位机开始向硬件加速器传输输入数据,控制系统接收到这些数据,并写入片上缓存中,等Conv1模块进行一次计算所需的所有输入数据传输完毕后,片上缓存开始对Conv1模块的存算一体单元分发数据,全部分发完毕后,Conv1模块开始计算。此时上位机的传输速度,应与整个系统各个模块的计算速度以及片上缓存容量相结合考虑。但应确保固定时长T后,Conv1模块下一次计算所需的数据在T时间内已经保存在片上缓存上了。
对于每一个卷积层或者全连接层模块,存算一体单元接收算输入端的输入信号,所有的存算一体单元的计算结果通过串联的方式将电流相加,输入至电流积分比较电路中,在该电路中经过积分、与阈值比较,然后生成输出脉冲,在紧接着的寄存器中完成与时钟上升沿对齐的操作,得到该层的输出。这些模块都是在同时进行着独立的运算的。每个模块的连续工作时间以固定时长T为单位,在T个时钟周期内,存算一体单元及电流积分比较电路都在连续不间断地进行计算。该T个时钟周期结束后,电流积分比较电路收到来自控制系统的控制信号,进行复位归零操作,等待下一次计算的开始。
对于每一个卷积或者全连接1模块,算输入端的输入信号都来自于片上缓存,即对于这些模块的每一次时长为T个时钟周期的计算,开始的前提条件是该次计算所需要的全部输入已经由片上缓存读取出来了。对于全连接2和全连接3模块,其输入信号来自于上一个全连接层的输出信号。对于每一个卷积模块,其每一个神经元的输出信号,都会按照以T为单位大小进行打包,存储在片上缓存中。
计数器组统计最后一层全连接层每个神经元的输出脉冲中,高电平的个数,当固定时长的计算时间结束后,该结果被控制系统传送回上位机。上位机改变控制信号,将硬件加速器中,电流积分比较电路及其寄存器和计数器进行复位,并开始传输新的图片。根据系统的具体实现方案,还可以添加10个计数器,记录每个神经元最早生成高电平的时间,用于分类结果的辅助判断。
根据上述实施例,对于大规模的网络,通过利用片上缓存保存部分数据,用时间换空间的方式将大大减少所需要的硬件资源。
实施例4-4
本实施例在实施例4-3的基础上,将片上缓存改为寄存器,逻辑控制电路也要相应修改,因为寄存器的定位方式与片上缓存不同。系统框图见图22。
实施例4-5
本实施例在实施例4-3的基础上,将片上缓存改为片外的存储器,对于硬件加速器部分,就仅包含每一层的存算一体单元和计数器部分,实施例4-3中的片上缓存及其逻辑控制电路被移到了片外,由FPGA开发板(现场可编程门阵列)和DDR(双倍速率同步动态随机存储器)代替其功能。系统框图见图23。
实施例4-6
本实施例在实施例4-3的基础上,将片上缓存改为片外的云存储,对于硬件加速器部分,就仅包含每一层的存算一体单元和计数器部分,实施例4-3中的片上缓存及其逻辑控制电路被移到了片外,由上位机和云存储代替其功能。系统框图见图24。
实施例5
本实施例使用上述实施例中的任意一种作为存算一体单元,进行脉冲卷积神经网络的计算,有如下具体的实施方式:
图25是实施例5的Alexnet网络结构图。如图25所示,数据集以cifar-10为例,数据集大小为10000*32*32*3,其中共10000组测试数据,输入图像尺寸为32*32,通道数为3,数据为0-255之间的整数,分类数目为10。
卷积神经网络以Alexnet为例,这里采用的模型有所变动,在第一层和第二层卷积层后面紧跟着加BN层,池化层可以是最大池化,也可以是平均池化,且所有卷积层的卷积核大小均为3*3。
每一层卷积层的输出按照如下公式获得:
Figure PCTCN2020134558-appb-000029
其中I为该层的输入,W为权值,B为偏置,O为输出,channel为输入通道数,kernelsize为卷积核尺寸,此处均为3。
每一层全连接层的输出按照如下公式获得:
Figure PCTCN2020134558-appb-000030
其中I为该层的输入,W为权值,B为偏置,O为输出,channel为输入通道数。
现在上述卷积神经网络的基础上,生成脉冲卷积神经网络。该脉冲卷积神经网络的最基本计算单元为存算一体单元,负责完成乘法。图28是实施例5的神经元的结构图。如图28所示,一个神经元包括多个存算一体单元,所述存算一体单元与在上述实施例4中的描述类似,在此不再赘述。
对于需要加入BN层的某一层Layer M,在训练的时候,需要导出bn.weight(γ)、bn.bias(β)、bn.running_mean(mean)、bn.running_var(var)和bn.eps(eps,给分母加上的小量,默认为1e-5),并按照如下公式修改Layer M的权值和偏置:
Figure PCTCN2020134558-appb-000031
Figure PCTCN2020134558-appb-000032
这样就完成了Layer M(卷积层或全连接层)与BN层的合并,在进行推断任务的时候,仅需保留修正过W和B的Layer M即可,无需多余的BN层运算。
考虑脉冲卷积神经网络第一层的输入数据,假设像素值为64,其转换为二进制为01000000,0-255可由8位二进制数表示,不足的位数在高位补零。假设该像素值本应和某个卷积核中的某个权值W相乘。那么本来的64*W就需要被转换为128*(0*W+1*W/2+0*W/4+0*W/8+0*W/16+0*W/32+0*W/64+0*W/128),即输入扩展成原来的8倍,权值也先复制7份,然后分别除以2的不同指数幂,再进行累加。并且该输入一直保持不变,直到整轮计算结束,整轮计算对应数据集中的一组测试数据,在Cifar-10中,即为32*32*3的图像。此外,该层若有偏置,如果该层权值保持原始比例,则其偏置值应在上述修正公式的基础上再乘以2;如果该层权值整体进行了缩放,则偏置值也应随之进行同比例的缩放。
相比于现有脉冲卷积神经网络算法中,输入脉冲由随机数生成,即随机生成0-1的小数,与像素值/255进行比较,如果随机数比这个数小,则生成脉冲,否则不生成脉冲。但该方法因为具有很大的随机性,只有计算大量脉冲后才能尽可能贴近原始像素值。而本发明的算法中,输入脉冲与原始像素值是完全等价的,不需要大量脉冲。
此外,根据实际需求,还会存在对权值进行量化的情况。即假设权值最多只能由WW位二进制数表示,对于WW不小于8(0-255所需的8位二进制数,对于范围更大的输入数据即不是8)的情况,输入数据以及权值复制如上文所述。但如果WW小于8,那么将权值除以较大2的指数幂的那些份,可能就直接等于0了,输入对其没有任何影响,可以直接略去。即输入取从高位开始的WW位,复制的权值也取绝对值较大的WW个。
对于每一层的偏置修正,除了由于阈值产生的修正外,此前的各层的权值缩放都将在该层进行累积,比如说第一层的权值经过调整后,最终在积分比较电路中累加的结果,为卷积神经网络模型中理论值的2倍,那么除了第一层的偏置需要随之变成2倍外,到了第二层,这个2倍依然会从输入脉冲的频率上体现出来,也就是说第二层公式中的I*W+B中的I已经是原来的2倍了,那么B也应该相应变成原来的2倍。其它层以此类推,总之修正依据就是I*W与B的缩放倍数是一致的。
除了卷积层、全连接层和BN层之外,卷积神经网络中还有池化层,常用最大池化和平均池化两种,假设池化窗口的尺寸为2*2,则池化的作用是将原始的4个输入变成1个输出,将图像的尺寸减小减少计算量。最大池化就是在这4个输入中选择最大值输出,平均池化就是对这4个输入计算平均值输出。
对于平均池化层,需要在下一个卷积层或全连接层中,将需要平均池化的所有输入直接和相应的神经元连接起来,并将权值复制多份,实现和原来等比例的乘累加,所述处理与在上述实施例4-2中描述的类似,在此不再赘述。
由于在脉冲神经网络算法中,等比例地将某一层所有的权值和偏置都放大或者缩小,对于最终的输出结果没有影响,所以在下一个卷积层或全连接层中,直接计算∑ 2*2O·W即可,若该层有偏置,则将偏置值在前文所述的所有修正的基础上再放大4倍写入存算一体单元的存输入端。
相比于专门添加一个由神经元组成的平均池化层,通过阈值的方法生成的脉冲信号,与理论结果会有一定的误差。将平均池化层并入下一层后,就可以保证在平均池化的计算上,没有精度的损失,所得到的Spiking-Alexnet网络结构(平均池化)如图26所示。
对于最大池化层,不需要在下一个卷积层或全连接层进行操作,而是需要在这相邻两层卷积层或全连接层中间,加上额外的判断条件,即从计算开始算起,每一个池化窗口所对应的输入信号中,选择最早为1的那一路,与下一层卷积层或全连接层接通,其余的输入信号就可以被忽略了,所得到的Spiking-Alexnet网络结构(最大池化)如图27所示。
如图所示,在全连接层3后面,有10或20个计数器一直在统计接收到的脉冲数目(高电平)以及最早接收到脉冲(高电平)的时间。每一个计数器对应一个神经元,也对应着图像分类的一类结果,
当某时刻,有1类计数器中的脉冲数目,要比别的类多a个,a为设定的超参数,即认为计算可以结束了,输出脉冲数目最大的该类类别号。建议设置为4。
如果到了设定的最大时长后,还没有满足结束判定条件,就强制结束,找出这10类中脉冲数目最多的那一类。
如果有至少2类中,脉冲数目是一致的,那么就比较谁最先接收到脉冲,输出该类。
根据上述实施例的脉冲卷积神经网络算法,通过改变输入方式、将平均池化层并入下一个卷积层或全连接层、支持带偏置的卷积层和全连接层的计算、支持在网络中添加BN层、设定计算结束判定条件、加入对特殊情况的辅助判断等优化改进方法,可以大大节约现有脉冲卷积神经网络算法的计算时间,并提高图像分类的准确率。此外,考虑了脉冲卷积神经网络算法的结束,针对计算时长也作出了改进。
此外,根据本发明上述实施例所述的存算一体单元可以实施于集成电路中,接下来将描述这种集成电路的制造方法,其包括以下步骤:
1)通过热氧化和淀积形成数字逻辑电路、积分比较电路和存算一体单元中晶体管的介质层和栅极;所述晶体管包括普通逻辑晶体管,高压晶体管以及 浮栅晶体管等;
2)通过淀积MIM介质层以及淀积金属层,或热氧化和淀积工艺形成积分比较电路中的电容,所述电容可以为MIM电容也可以是MOS电容;
3)通过离子注入的方式形成数字逻辑电路、积分比较电路和存算一体单元中晶体管的源极和漏极,以及PN结的P级和N级;
4)通过金属层工艺、金属层介质工艺以及通孔工艺形成整体电路的金属连线和有源区-金属层以及金属层-金属层通孔;
5)通过相应于忆阻器或快闪存储器的工艺,生成一个CMOS工艺的存算一体单元。
基于脉冲卷积神经网络的集成电路的生产工艺,其中的数字逻辑电路、神经元中的积分比较电路均可使用标准CMOS工艺生产,神经元中的存算一体单元,如果使用光电计算单元或闪存的话,则同样可以使用标准CMOS工艺生产,关于标准CMOS工艺对于基于此工艺的器件如晶体管二极管或电容等半导体器件的生产流程,在此不详细叙述,其中光电计算单元使用CIS图像传感器工艺生产能够获得更好的器件性能。如使用忆阻器作为神经元中的存算一体单元,则需要使用兼容此种忆阻器的特殊工艺,其中使用特殊工艺的存算一体器件和使用标准CMOS工艺的数字逻辑电路和积分比较电路的集成方式,可以通过直接在硅基衬底上使用特殊工艺制作特殊器件的方式来实现,也可以通过晶圆级集成或片外集成等方式来实现。如中国专利CN110098324A中提到的在硅基衬底上生成高耐久性忆阻器的方法等多种忆阻器制作工艺方法。
本领域普通技术人员可以意识到,结合本文中所公开的实施例描述的各示例的器件及算法步骤,能够以电子硬件、或者计算机软件和电子硬件的结合来实现。这些功能究竟以硬件还是软件方式来执行,取决于技术方案的特定应用和设计约束条件。专业技术人员可以对每个特定的应用来使用不同方法来实现所描述的功能,但是这种实现不应认为超出本公开的范围。
在本公开所提供的几个实施例中,应该理解到,所揭露的设备和方法,可以通过其它的方式实现。例如,以上所描述的装置实施例仅仅是示意性的,例如,所述单元的划分,仅仅为一种逻辑功能划分,实际实现时可以有另外的划 分方式,例如多个单元或组件可以结合或者可以集成到另一个设备,或一些特征可以忽略,或不执行。
所述作为分离部件说明的单元可以是或者也可以不是物理上分开的,作为单元显示的部件可以是或者也可以不是物理单元。可以根据实际的需要选择其中的部分或者全部单元来实现本实施例方案的目的。
所述功能如果以软件功能单元的形式实现并作为独立的产品销售或使用时,可以存储在一个计算机可读取存储介质中。基于这样的理解,本公开的技术方案本质上或者说对现有技术做出贡献的部分或者该技术方案的部分可以以软件产品的形式体现出来,该计算机软件产品存储在一个存储介质中,包括若干指令用以使得一台计算机设备(可以是个人计算机,服务器,或者网络设备等)执行本公开各个实施例所述方法的全部或部分步骤。而前述的存储介质包括:U盘、移动硬盘、只读存储器、随机存取存储器、磁碟或者光盘等各种可以存储程序代码的介质。
以上所述,仅为本公开的具体实施方式,但本公开的保护范围并不局限于此,任何熟悉本技术领域的技术人员在本公开揭露的技术范围内,可轻易想到变化或替换,都应涵盖在本公开的保护范围之内。因此,本公开的保护范围应以所述权利要求的保护范围为准。

Claims (29)

  1. 一种脉冲卷积神经网络算法,基于存算一体单元,所述存算一体单元包括至少一个存输入端,至少一个算输入端以及一个输出端,其特征在于:
    1)将脉冲卷积神经网络的第一层的权值复制至若干份,份数至少为用于表征待分析物属性的量所转换成的二进制数的位数以及所述存算一体单元的存输入端的最小值,并且将复制后的所述份数的权值进行处理,使复制后的各个权值在数值上依次缩小两倍,所得数值被分别输入到多个所述存算一体单元的存输入端,所述存算一体单元的个数与所述份数相同;
    2)将所选的、集中用于表征待分析物属性的量转换成二进制数,并将待输入的所述二进制数的每一位数值,或者根据系统位宽截位后的数值作为输入脉冲,输入到所述脉冲卷积神经网络的存算一体计算单元中;并且,对于每个表征待分析物属性的输入集合,在对应于所述输入集合的时间周期内,使所述输入脉冲保持不变并不间断地输入到所述脉冲卷积神经网络中相应的计算单元,直到完成对该被分析物的所述属性的处理;
    3)对于用于表征待分析物属性的、对应于所述一个组中的每个基本的二进制数,使所述每个存输入端的输入量,分别与一个算输入端的输入量相对应,并且绝对值较大的存输入端的输入量与较高位的算输入端的输入量一一对应;
    4)在每个所述存算一体单元中,使所述存输入端的量与所述算输入端的量进行运算,输出端得到的电流值代表所述存算一体单元的存输入端的值与算输入端的值进行乘法运算的结果。
  2. 如权利要求1所述的脉冲卷积神经网络算法,其特征在于:
    1)包括所述第一层的运算以及其它层的运算,并且在其中的任意层,在所述存输入端与所述算输入端的运算以外,再加一个运算累加项,所述运算累加项为一个经过修正的偏置值,所述经过修正的偏置值正比于其原始值再除以该层之前所有层的正阈值的累乘,所述正比的比例与该偏置所在的层以及之前的层的权值缩放比例有关;
    2)所述脉冲卷积神经网络算法,对所述存算一体单元的输出持续地进行 累加,当所述累加和超过一个设定的正阈值后,对所述累加和进行清零,并且向下一层相应位置的算输入端释放一个输出脉冲;并且当所述累加和小于一个设定的负阈值之后,使该累加和保持在该负阈值上。
  3. 如权利要求2所述的脉冲卷积神经网络算法,其特征在于,所述脉冲卷积神经网络中包括批标准化层,对该批标准化层之前的一个卷积层或全连接层中的权值和偏置进行线性变换,其中所述线性变换中的参数由前面的训练过程中得到。
  4. 如权利要求1至3中任一项所述的脉冲卷积神经网络算法,其特征在于,用多个计数器对所述脉冲卷积神经网络最后一个全连接层中每个神经元的脉冲个数以及最早出现脉冲的时间进行统计,所述计数器个数为所述神经元的数目或其两倍。
  5. 如权利要求4所述的脉冲卷积神经网络算法,其特征在于,如果所述多个计数器中至少两个计数器计数结果均为相同的最大值,则选取最早接收到脉冲的计数器所对应的类别值为最终结果。
  6. 如权利要求4所述的脉冲卷积神经网络算法,其特征在于,计数器显著地多,则输出终止运算,将最终的分类结果作为所述多个计数器计数结果的最大值所对应的类别值进行输出。
  7. 如权利要求1所述的脉冲卷积神经网络算法,其特征在于,在所述第一层的运算之后,还进行平均池化、最大池化、卷积层和全连接层运算中的至少一种。
  8. 如权利要求1至3,及7中任一项所述的脉冲卷积神经网络算法,其特征在于:
    1)设定若干个时钟信号的时长为一个分析周期;
    2)将待分析的标的物分为若干分区;
    3)以所述分析周期为时间单位,逐次分析一个分区的时间序列信号,将代表该分区的运算结果送至一个存储器;
    4)分析下一个分区的信号,将所述代表该分区的运算结果送至所述存储器,直到所完成的多个分区的信号联合地满足下一层的分析条件;
    5)将所述存储器存储的各个所述分区的信号送入下一层进行运算。
  9. 如权利要求8所述的脉冲卷积神经网络算法,其特征在于,所述存储器为寄存器、片上缓存、片外存储或者云存储中的至少一种,或者它们的组合。
  10. 一种基于脉冲卷积神经网络的集成电路,其特征在于,所述集成电路执行如权利要求1-3,及7中任一项所述的脉冲卷积神经网络算法。
  11. 一种基于脉冲卷积神经网络的集成电路,其特征在于,所述集成电路执行如权利要求8所述的脉冲卷积神经网络算法。
  12. 一种计算机可读记录介质,其上存储计算机可读指令,当所述计算机可读指令由计算机执行时,使得所述计算机执行脉冲卷积神经网络算法,所述脉冲卷积神经网络算法的特征在于:
    1)将脉冲卷积神经网络的第一层的权值复制至若干份,份数至少为用于表征待分析物属性的量所转换成的二进制数的位数以及所述存算一体单元的存输入端的最小值,并且将复制后的所述份数的权值进行处理,使复制后的各个权值在数值上依次缩小两倍,所得数值被分别输入到多个所述存算一体单元的存输入端,所述存算一体单元的个数与所述份数相同;
    2)将所选的、集中用于表征待分析物属性的量转换成二进制数,并将待输入的所述二进制数的每一位数值,或者根据系统位宽截位后的数值作为输入脉冲,输入到所述脉冲卷积神经网络的存算一体计算单元中;并且,对于每个表征待分析物属性的输入集合,在对应于所述输入集合的时间周期内,使所述输入脉冲保持不变并不间断地输入到所述脉冲卷积神经网络中相应的计算单元,直到完成对该被分析物的所述属性的处理;
    3)对于用于表征待分析物属性的、对应于所述一个组中的每个基本的二 进制数,使所述每个存输入端的输入量,分别与一个算输入端的输入量相对应,并且绝对值较大的存输入端的输入量与较高位的算输入端的输入量一一对应;
    4)在每个所述存算一体单元中,使所述存输入端的量与所述算输入端的量进行运算,输出端得到的电流值代表所述存算一体单元的存输入端的值与算输入端的值进行乘法运算的结果。
  13. 如权利要求12所述的计算机可读记录介质,其特征在于:
    1)所述脉冲卷积神经网络算法包括所述第一层的运算以及其它层的运算,并且在其中的任意层,在所述存输入端与所述算输入端的运算以外,再加一个运算累加项,所述运算累加项为一个经过修正的偏置值,所述经过修正的偏置值正比于其原始值再除以该层之前所有层的正阈值的累乘,所述正比的比例与该偏置所在的层以及之前的层的权值缩放比例有关;
    2)所述脉冲卷积神经网络算法,对所述存算一体单元的输出持续地进行累加,当所述累加和超过一个设定的正阈值后,对所述累加和进行清零,并且向下一层相应位置的算输入端释放一个输出脉冲;并且当所述累加和小于一个设定的负阈值之后,使该累加和保持在该负阈值上。
  14. 如权利要求12所述的计算机可读记录介质,其特征在于,所述脉冲卷积神经网络中包括批标准化层,对该批标准化层之前的一个卷积层或全连接层中的权值和偏置进行线性变换,其中所述线性变换中的参数由前面的训练过程中得到。
  15. 如权利要求12至14中任一项所述的计算机可读记录介质,其特征在于,用多个计数器对所述脉冲卷积神经网络最后一个全连接层中每个神经元的脉冲个数以及最早出现脉冲的时间进行统计,所述计数器个数为所述神经元的数目或其两倍。
  16. 如权利要求15所述的计算机可读记录介质,其特征在于,如果所述多个计数器中至少两个计数器计数结果均为相同的最大值,则选取最早接收到 脉冲的计数器所对应的类别值为最终结果。
  17. 如权利要求15所述的计算机可读记录介质,其特征在于,在所述多个计数器进行计数的过程中,一个计数器收集的脉冲数比其他计数器显著地多,则输出终止运算,将最终的分类结果作为所述多个计数器计数结果的最大值所对应的类别值进行输出。
  18. 如权利要求12所述的计算机可读记录介质,其特征在于,在所述第一层的运算之后,还进行平均池化、最大池化、卷积层和全连接层运算中的至少一种。
  19. 如权利要求12至14,及18中任一项所述的计算机可读记录介质,其特征在于,所述脉冲卷积神经网络算法包括以下:
    1)设定若干个时钟信号的时长为一个分析周期;
    2)将待分析的标的物分为若干分区;
    3)以所述分析周期为时间单位,逐次分析一个分区的、时间序列信号,将代表该分区的运算结果送至一个存储器,已分析的信号可以被后续的信号覆盖;
    4)分析下一个分区的信号,将所述代表该分区的运算结果送至所述存储器,直到所完成的多个分区的信号联合地满足下一层的分析条件;
    5)将所述存储器存储的各个所述分区的信号送入下一层进行运算。
  20. 如权利要求19所述的计算机可读记录介质,其特征在于,所述存储器为寄存器、片上缓存、片外存储或者云存储中的至少一种,或者它们的组合。
  21. 一种基于脉冲卷积神经网络的集成电路,所述脉冲卷积神经网络包括多层神经元,每层神经元包括多个神经元组件,每层神经元中的多个神经元彼此不连接,而连接到后层的神经元;
    至少一个所述神经元组件带有至多一个数字逻辑电路,所述数字逻辑电路被用于操作,所述操作包括数据分发;并且,最后一层的每个神经元组件带有 一个计数器组,统计该神经元组件的输出脉冲中具有高电平的脉冲个数;其中,
    每个神经元包括至少一个存算一体单元和至少一个积分比较电路,所述多个存算一体单元的电流输出端彼此连接,并且集体地连接到所述积分比较电路上;
    每个所述积分比较电路包括至少一个积分器和至少一个比较器,所述积分器用于累加电流输出端的输出量,所述比较器用于将积分器中被累加的输出量与在先设定的阈值进行比较,并且进行比较器的清零和脉冲输出,所述清零的操作使所述的积分器可以进行下一次的累加操作;
    并且,每个所述存算一体单元包括至少一个存输入端和至少一个算输入端以及至少一个电流输出端,所述存输入端被设置为接收表征所述上位机所下发的权值的载流子,所述算输入端被设置为接收表征外界或所设定的上层输入脉冲的载流子;
    所述电流输出端被设置为以电流的形式输出被作为权值的载流子和作为输入脉冲的载流子共同作用后的载流子。
  22. 如权利要求21所述的集成电路,其特征在于,所述存算一体单元为半导体原理的光电计算单元、忆阻器、快闪存储器中的一种。
  23. 如权利要求21或22所述的集成电路,其特征在于,所述数字逻辑电路的所述操作还包括最大池化、时钟同步和数据缓存。
  24. 如权利要求21所述的集成电路,其特征在于,所述数字逻辑电路被设置为从当前池化层的上一层神经元组件中输出的、数量为池化层尺寸的平方的多个输出信号中,找出最先出现的高电平脉冲信号;并且,
    所述数字逻辑电路还被设置为包括一个多路选择器的功能器件,使所述高电平脉冲信号经过所述多路选择器后,保持该高电平脉冲信号所对应的通路开启,将所述通路与下一个卷积层或全连接层连通;同时忽略与该高电平脉冲信号所对应的通路相并行的其它通路的信号,或者关闭所述其它通路。
  25. 如权利要求21所述的集成电路,其特征在于,将平均池化运算合并到下一个卷积层或全连接层中进行,包括:
    1)卷积层或全连接层,所述卷积层或全连接层的每个神经元组件中的存算一体单元数量为该层对应算法的原始尺寸的若干倍,倍数为池化层尺寸的平方,并且所述对应算法中的每一个权值在所述神经元组件中出现若干次,次数为池化层尺寸的平方,
    2)其中从上一层神经元组件中输出的、待传输到下一个池化层的、数量为池化层尺寸的平方的输出脉冲信号,直接作为所述卷积层或全连接层中的存算一体单元的算输入量,所述存算一体单元分别与同样的权值对应。
  26. 如权利要求21所述的集成电路,其特征在于,每个所述神经元组件包括一个神经元,并且带有寄存器,所述寄存器用于实现所涉及的数据操作在时间上的同步。
  27. 一种脉冲卷积神经网络运算装置,用于进行脉冲卷积神经网络运算,包括一个上位机和如权利要求21所述的集成电路;其中,
    所述上位机被设置为处理并生成第一层的权值,所述生成第一层的权值的过程包括:根据一个训练得出的初始权值经过若干线性变换生成一组权值,该组权值包括多个权值数,其中后一个权值数值为前一个权值数值的1/2;并且,所述上位机将该组权值发送给所述脉冲卷积神经网络的第一层的各个神经元组件中的存算一体单元中的存输入端;并且,所述上位机将初始权值经过若干线性变换后发送给所述第一层之后的其它层的存算一体单元的存输入端中,对于紧接着平均池化层之后的卷积层或全连接层的权值,还根据池化尺寸将权值复制若干份,份数为池化层尺寸的平方。
  28. 如权利要求27所述的脉冲卷积神经网络运算装置,其特征在于,所述装置被用于按分区来分析标的物,再将各分区的标的物信号合成,构成完整的标的物信息,并且
    所述脉冲卷积神经网络运算装置还包括存储器,所述存储器用于存储已分步处理过的、代表所述标的物的至少一个分区的信号,并在所有的分区信号处 理完以后,将所有的分区信号进行合成,或将所有的分区信号发送至另一个处理器进行合成;
    所述存储器为寄存器、片上缓存、片外存储或者云存储中的至少一种。
  29. 一种集成电路的制造方法,其特征在于,所述集成电路为权利要求21所述的集成电路,所述方法包括以下步骤:
    1)通过热氧化和淀积形成数字逻辑电路、积分比较电路和存算一体单元中晶体管的介质层和栅极;所述晶体管至少包括普通逻辑晶体管,高压晶体管以及浮栅晶体管;
    2)通过淀积MIM介质层以及淀积金属层,或热氧化和淀积工艺形成积分比较电路中的电容;
    3)通过离子注入的方式形成数字逻辑电路、积分比较电路和存算一体单元中晶体管的源极和漏极,以及PN结的P级和N级;
    4)通过金属层工艺、金属层介质工艺以及通孔工艺形成整体电路的金属连线和有源区-金属层以及金属层-金属层通孔;
    5)通过应用于忆阻器或快闪存储器的工艺,生成一个CMOS工艺的存算一体单元。
PCT/CN2020/134558 2019-12-09 2020-12-08 脉冲卷积神经网络算法、集成电路、运算装置及存储介质 WO2021115262A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201911249006.1 2019-12-09
CN201911249006.1A CN113033759A (zh) 2019-12-09 2019-12-09 脉冲卷积神经网络算法、集成电路、运算装置及存储介质

Publications (1)

Publication Number Publication Date
WO2021115262A1 true WO2021115262A1 (zh) 2021-06-17

Family

ID=76329546

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2020/134558 WO2021115262A1 (zh) 2019-12-09 2020-12-08 脉冲卷积神经网络算法、集成电路、运算装置及存储介质

Country Status (3)

Country Link
CN (1) CN113033759A (zh)
TW (1) TWI774147B (zh)
WO (1) WO2021115262A1 (zh)

Cited By (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113642724A (zh) * 2021-08-11 2021-11-12 西安微电子技术研究所 一种高带宽存储的cnn加速器
CN113657580A (zh) * 2021-08-17 2021-11-16 重庆邮电大学 基于微环谐振器和非易失性相变材料的光子卷积神经网络加速器
CN113837373A (zh) * 2021-09-26 2021-12-24 清华大学 数据处理装置以及数据处理方法
CN114723023A (zh) * 2022-03-03 2022-07-08 北京大学 数据通信方法及系统、脉冲神经网络运算系统
CN115545190A (zh) * 2022-12-01 2022-12-30 四川轻化工大学 一种基于概率计算的脉冲神经网络及其实现方法
WO2023045114A1 (zh) * 2021-09-22 2023-03-30 清华大学 存算一体芯片及数据处理方法
CN116167424A (zh) * 2023-04-23 2023-05-26 深圳市九天睿芯科技有限公司 基于cim的神经网络加速器、方法、存算处理系统与设备
CN116205274A (zh) * 2023-04-27 2023-06-02 苏州浪潮智能科技有限公司 一种脉冲神经网络的控制方法、装置、设备及存储介质
WO2023103149A1 (zh) * 2021-12-06 2023-06-15 成都时识科技有限公司 脉冲事件决策装置、方法、芯片及电子设备
CN116663632A (zh) * 2023-08-02 2023-08-29 华中科技大学 感存算一体智能感知系统

Families Citing this family (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113625963A (zh) * 2021-07-16 2021-11-09 南京大学 基于存算一体器件的卷积神经网络层间的存储装置及方法
CN114692681B (zh) * 2022-03-18 2023-08-15 电子科技大学 基于scnn的分布式光纤振动及声波传感信号识别方法
CN114997388B (zh) * 2022-06-30 2024-05-07 杭州知存算力科技有限公司 存算一体芯片用基于线性规划的神经网络偏置处理方法
WO2023160735A2 (zh) * 2023-06-09 2023-08-31 南京大学 一种运算方法和运算单元

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20190005376A1 (en) * 2017-06-30 2019-01-03 Intel Corporation In-memory spiking neural networks for memory array architectures
CN110119785A (zh) * 2019-05-17 2019-08-13 电子科技大学 一种基于多层spiking卷积神经网络的图像分类方法
CN110276047A (zh) * 2019-05-18 2019-09-24 南京惟心光电系统有限公司 一种利用光电计算阵列进行矩阵向量乘运算的方法
CN110543933A (zh) * 2019-08-12 2019-12-06 北京大学 基于flash存算阵列的脉冲型卷积神经网络

Family Cites Families (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
FR3025344B1 (fr) * 2014-08-28 2017-11-24 Commissariat Energie Atomique Reseau de neurones convolutionnels
US9646243B1 (en) * 2016-09-12 2017-05-09 International Business Machines Corporation Convolutional neural networks using resistive processing unit array
US10565492B2 (en) * 2016-12-31 2020-02-18 Via Alliance Semiconductor Co., Ltd. Neural network unit with segmentable array width rotator
CN109460817B (zh) * 2018-09-11 2021-08-03 华中科技大学 一种基于非易失存储器的卷积神经网络片上学习系统
CN109800870B (zh) * 2019-01-10 2020-09-18 华中科技大学 一种基于忆阻器的神经网络在线学习系统
CN109871940B (zh) * 2019-01-31 2021-07-27 清华大学 一种脉冲神经网络的多层训练算法
CN110263926B (zh) * 2019-05-18 2023-03-24 南京惟心光电系统有限公司 基于光电计算单元的脉冲神经网络及其系统和运算方法
CN110334799B (zh) * 2019-07-12 2022-05-24 电子科技大学 基于存算一体的神经网络推理与训练加速器及其运行方法

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20190005376A1 (en) * 2017-06-30 2019-01-03 Intel Corporation In-memory spiking neural networks for memory array architectures
CN110119785A (zh) * 2019-05-17 2019-08-13 电子科技大学 一种基于多层spiking卷积神经网络的图像分类方法
CN110276047A (zh) * 2019-05-18 2019-09-24 南京惟心光电系统有限公司 一种利用光电计算阵列进行矩阵向量乘运算的方法
CN110543933A (zh) * 2019-08-12 2019-12-06 北京大学 基于flash存算阵列的脉冲型卷积神经网络

Cited By (15)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113642724B (zh) * 2021-08-11 2023-08-01 西安微电子技术研究所 一种高带宽存储的cnn加速器
CN113642724A (zh) * 2021-08-11 2021-11-12 西安微电子技术研究所 一种高带宽存储的cnn加速器
CN113657580A (zh) * 2021-08-17 2021-11-16 重庆邮电大学 基于微环谐振器和非易失性相变材料的光子卷积神经网络加速器
CN113657580B (zh) * 2021-08-17 2023-11-21 重庆邮电大学 基于微环谐振器和非易失性相变材料的光子卷积神经网络加速器
WO2023045114A1 (zh) * 2021-09-22 2023-03-30 清华大学 存算一体芯片及数据处理方法
CN113837373A (zh) * 2021-09-26 2021-12-24 清华大学 数据处理装置以及数据处理方法
WO2023103149A1 (zh) * 2021-12-06 2023-06-15 成都时识科技有限公司 脉冲事件决策装置、方法、芯片及电子设备
CN114723023A (zh) * 2022-03-03 2022-07-08 北京大学 数据通信方法及系统、脉冲神经网络运算系统
CN114723023B (zh) * 2022-03-03 2024-04-23 北京大学 数据通信方法及系统、脉冲神经网络运算系统
CN115545190B (zh) * 2022-12-01 2023-02-03 四川轻化工大学 一种基于概率计算的脉冲神经网络及其实现方法
CN115545190A (zh) * 2022-12-01 2022-12-30 四川轻化工大学 一种基于概率计算的脉冲神经网络及其实现方法
CN116167424A (zh) * 2023-04-23 2023-05-26 深圳市九天睿芯科技有限公司 基于cim的神经网络加速器、方法、存算处理系统与设备
CN116205274A (zh) * 2023-04-27 2023-06-02 苏州浪潮智能科技有限公司 一种脉冲神经网络的控制方法、装置、设备及存储介质
CN116663632A (zh) * 2023-08-02 2023-08-29 华中科技大学 感存算一体智能感知系统
CN116663632B (zh) * 2023-08-02 2023-10-10 华中科技大学 感存算一体智能感知系统

Also Published As

Publication number Publication date
TW202123032A (zh) 2021-06-16
CN113033759A (zh) 2021-06-25
TWI774147B (zh) 2022-08-11

Similar Documents

Publication Publication Date Title
WO2021115262A1 (zh) 脉冲卷积神经网络算法、集成电路、运算装置及存储介质
Ambrogio et al. Equivalent-accuracy accelerated neural-network training using analogue memory
Xue et al. 24.1 A 1Mb multibit ReRAM computing-in-memory macro with 14.6 ns parallel MAC computing time for CNN based AI edge processors
Peng et al. DNN+ NeuroSim: An end-to-end benchmarking framework for compute-in-memory accelerators with versatile device technologies
JP6995131B2 (ja) 抵抗型処理ユニットアレイ、抵抗型処理ユニットアレイを形成する方法およびヒステリシス動作のための方法
Bong et al. 14.6 A 0.62 mW ultra-low-power convolutional-neural-network face-recognition processor and a CIS integrated with always-on haar-like face detector
Amir et al. 3-D stacked image sensor with deep neural network computation
WO2021088248A1 (zh) 基于忆阻器的神经网络的并行加速方法及处理器、装置
CN105144699A (zh) 阈值监测的有条件重置的图像传感器
Abedin et al. Mr-pipa: An integrated multilevel rram (hfo x)-based processing-in-pixel accelerator
Geng et al. An on-chip layer-wise training method for RRAM based computing-in-memory chips
Zhao et al. Crossbar-level retention characterization in analog RRAM array-based computation-in-memory system
Bose et al. A 75kb SRAM in 65nm CMOS for in-memory computing based neuromorphic image denoising
Peng et al. Inference engine benchmarking across technological platforms from CMOS to RRAM
CN113936717A (zh) 一种复用权重的存算一体电路
CN110263926B (zh) 基于光电计算单元的脉冲神经网络及其系统和运算方法
Chen et al. WRAP: Weight RemApping and processing in RRAM-based neural network accelerators considering thermal effect
Choi et al. Implementation of an On-Chip Learning Neural Network IC Using Highly Linear Charge Trap Device
Khokhar et al. Nanoscale memristive crossbar circuits for approximate edge detection in smart cameras
CN111667064B (zh) 基于光电计算单元的混合型神经网络及其运算方法
US20230289577A1 (en) Neural network system, high density embedded-artificial synaptic element and operating method thereof
CN116049094B (zh) 一种基于光电存算一体单元的多阈值配置装置及其方法
Narayanan et al. Neuromorphic technologies for next-generation cognitive computing
US20230292533A1 (en) Neural network system, high efficiency embedded-artificial synaptic element and operating method thereof
Nägele et al. Charge based mixed-signal multiply-accumulate circuit for energy efficient in-memory computing

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 20898013

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 20898013

Country of ref document: EP

Kind code of ref document: A1

122 Ep: pct application non-entry in european phase

Ref document number: 20898013

Country of ref document: EP

Kind code of ref document: A1

32PN Ep: public notification in the ep bulletin as address of the adressee cannot be established

Free format text: NOTING OF LOSS OF RIGHTS PURSUANT TO RULE 112(1) EPC (EPO FORM 1205A DATED 27.01.2023)

122 Ep: pct application non-entry in european phase

Ref document number: 20898013

Country of ref document: EP

Kind code of ref document: A1