CN117273100A

CN117273100A - Brain-like chip architecture with configurable network structure

Info

Publication number: CN117273100A
Application number: CN202311248937.6A
Authority: CN
Inventors: 石匆; 王腾霄; 田敏; 王海冰; 钟正青; 何俊贤; 蒋颖
Original assignee: Chongqing University
Current assignee: Chongqing University
Priority date: 2023-09-25
Filing date: 2023-09-25
Publication date: 2023-12-22

Abstract

The invention discloses a brain-like chip architecture with a configurable network structure, which comprises M ₁ Macronerve computation core, element-cross switch array routing structure and output error computation unit, M ₁ The macro nerve computation cores are respectively connected with the element-cross switch array routing structure, and the element-cross switch array routing structure is connected with the output error computation unit; the impulse event transmitted between the macro nerve computation cores is encoded into AER data packets in AER format, wherein the AER data packets contain addresses and time stamps of neurons transmitting impulse signals; pulse data packets are transmitted between the macronerve computation cores through a meta-crossbar array routing structure, and the pulse data packets sent out from the macronerve computation cores can be routedTo any macronerve computation core. The invention can realize high-speed and high-precision on-chip real-time learning, simultaneously reduce pulse routing overhead, and ensure the expandability of the on-chip network scale and certain network mapping flexibility.

Description

Brain-like chip architecture with configurable network structure

Technical Field

The invention belongs to the technical field of artificial intelligence and brain-like intelligent chips, and particularly relates to a brain-like chip architecture with a configurable network structure.

Background

Currently, pulsed neural networks (Spiking Neural Network, SNN) with high computational efficiency are receiving increasing attention in the field of artificial intelligence. Compared with an artificial neural network (Artificial Neural Network, ANN) which involves a large number of dense matrix operations, the impulse neural network model represents, transmits and processes sensory data in a sparse electrical impulse form by closely simulating the functional mechanisms of the human cerebral cortex, performs operations based on impulse events which are sparse in time and space, and can greatly improve the energy efficiency of an intelligent system. Since conventional von neumann processors cannot efficiently handle such irregular pulse events under conventional instruction flow mechanisms. Thus, various research institutions designed a number of pulse-based neuromorphic brain chips, including TmeNorth from SpiNNaker, IBM of Manchester university, neurogrid from Stanford university, loihi chip from Intel corporation, and the like. The chip is aimed at constructing a large-scale simulation platform of a multi-chip array system to complete neuroscience research and data center cognition tasks, the area of the chip is different from tens to hundreds of square millimeters, high complexity and programmability are shown, but the cost is large and the power consumption is relatively high.

Because the edge-side intelligent system has strict requirements on the area, cost and energy efficiency of the chip, researchers have developed a plurality of small neuromorphic brain-like chips for the edge-side intelligent system with limited resources, including Morphic and ODIN chips at university of Luwen, darwin chips at university of Zhejiang, and the like. These chip architectures and circuits are highly tailored and optimized for specific tasks (such as visual recognition), custom impulse neuron models, SNN topology, and synaptic plasticity, while achieving very small areas or very high energy efficiency, the scalability and configurability of the chip is severely limited. Wherein, some chips cannot realize on-chip learning, and on-chip learning functions are important in uncontrolled edge-side intelligent application scenes; some chips can only train shallow SNNs through simple on-chip learning rules, the visual recognition accuracy is low, and the application scene is very limited; although a depth SNN learning algorithm based on layer-by-layer error back propagation is realized on a chip, the visual recognition accuracy is greatly improved, the chip calculation cost is huge, the energy consumption is high, and the learning process delay is higher.

Therefore, there is a need to develop a brain-like chip architecture with a configurable network architecture.

Disclosure of Invention

The invention aims to provide a brain-like chip architecture with a configurable network structure, which can realize high-speed and high-precision on-chip real-time learning, and can reduce pulse routing overhead at the same time so as to ensure the expandability of the on-chip network scale and certain network mapping flexibility.

The invention relates to a brain-like chip architecture with a configurable network structure, which comprises M ₁ Macronerve computation core, element-cross switch array routing structure and output error computation unit, M ₁ The macro nerve computation cores are respectively connected with the element-cross switch array routing structure, and the element-cross switch array routing structure is connected with the output error computation unit;

the impulse event transmitted between the macro nerve computation cores is encoded into AER data packets in AER format, wherein the AER data packets contain addresses and time stamps of neurons transmitting impulse signals;

pulse data packets are transmitted between the macronerve computation cores through the meta-crossbar array routing structure, and the pulse data packets sent out from the macronerve computation cores can be routed to any macronerve computation core.

Optionally, the macronerve computation core comprises M ₂ Parallel micronerve computation cores, and each micronerve computation core can be responsible for N _L /M ₂ LI (LI)The calculation of the neuron F is used for accelerating the calculation processing task of each layer of the FC-SNN;

the micronerve computation core includes:

a nerve processing unit for updating the state of each LIF neuron and transmitting a pulse;

a weight updating unit for taking charge of receiving the error and updating the neuron synaptic weight;

a neuron state and weight memory for storing neuron state information and synaptic weights, the neuron state and weight memory being connected to the neural processing unit and the weight updating unit, respectively;

and the output AER buffer is used for buffering the pulse data transmitted by the nerve processing unit and is connected with the neuron state and weight memory.

Optionally, the macro nerve computation core further includes:

the leakage unit based on the register lookup table is used for calculating and storing intermediate variables required in the process of calculating the membrane potential leakage of the neuron, and is respectively connected with the nerve processing units of all the micro nerve calculation cores in the macro nerve calculation core;

an input AER buffer for buffering input pulse data, the input AER buffer being connected with a leakage unit based on a register lookup table;

An output AER scheduler for scheduling the input/output pulse data, the output AER scheduler being connected to output AER buffers of each of the micro-nerve computation cores within the macro-nerve computation core;

and a routing node connected to the weight updating unit, the input AER buffer, and the output AER scheduler of each of the micronerve computation cores within the macronerve computation core, respectively.

Optionally, the processing of the entire chip architecture includes a forward phase and a reverse phase;

the macro nerve computation core independently completes computation in an event-driven mode in a forward stage, namely, the macro nerve computation core processes the input pulse event without synchronous execution when receiving the input pulse event;

the macro nerve computation core completes the neuron synaptic weight updating in a synchronous array parallel mode in a reverse phase;

the micro-nerve computation core synchronously operates in an array parallel mode in a forward stage and a reverse stage.

Optionally, the meta-crossbar array routing structure includes a routing node network, M ₁ +1 AER schedulers and M ₁ +1 switch connection configuration registers, the routing node network comprising (M ₁ +1)×(M ₁ +1) routing node switches, wherein the routing node switches of each column are sequentially connected, and the routing node switches of each row are sequentially connected;

Each routing node switch in each column is respectively connected with an AER scheduler and a switch connection configuration register for controlling the column through an AND gate, and the switching on and off of the routing node switch is controlled through the switch connection configuration register and the AER scheduler.

Optionally, the leakage unit based on the register lookup table comprises a first adder and a register-based index lookup table, wherein the first adder is connected with the register-based index lookup table; when AER (i, t) reaches the macronerve computation core, Δt=t-t is computed by the first adder _pre Wherein i represents the source neuron index, t is the pulse timestamp, t _prc And accessing the obtained delta t as an address to a register-based index lookup table initialized in advance for the pulse time stamp of the last event to obtain a leakage term.

Optionally, the neural processing unit includes a first multiplier, a second adder, a first comparator, a fire_flag_regs register, and a second comparator;

the first multiplier is respectively connected with the second adder and the index lookup table based on the register, the first comparator is connected with the second adder, and the fire flag regs register is connected with the first comparator; the second comparator is connected with the second adder;

The first multiplier is used for associating the leakage term with the old membrane potential V of the neuron j _m，j Multiplying to obtain the membrane potential of the leaked neuron;the second adder is used for combining the leaked neuron membrane potential with the synaptic weight w _ij Adding to obtain updated membrane potential updatedV of neuron j _m，j The method comprises the steps of carrying out a first treatment on the surface of the The first comparator is used for updating the film potential _m，j And a threshold V _th Comparison is made and at updated V _m，j ＞V _th When the pulse aer_out is transmitted; and the neuron address after transmitting the pulse is recorded in a fire_flag_regs register file of the nerve processing unit, the fire_flag_regs register is accessed before each pulse is transmitted, and if the current neuron j does not transmit the pulse yet, the pulse is allowed to be transmitted;

when each event comes and updates the membrane potential, judging whether the current membrane potential is greater than the previously stored maximum value V of the membrane potential by the second comparator _max，j If it is greater than V needs to be updated _max，j Synchronously updating the current time stamp to t _max，j And stored in a neuron state and weight memory.

Optionally, the output error calculation unit includes an output layer neuron error calculation module, and an output error register connected with the output layer neuron error calculation module;

The output layer neuron error calculation module calculates the error of each output layer neuron, stores the error in an output error register and broadcasts the error to the micro-nerve calculation core of each macro-nerve calculation core after all the output errors are calculated.

Optionally, the weight updating unit includes a 16-way 2b x1b low cost multiplier array, a single cycle 16-to-1 adder tree, and a local error accumulation register;

the 16-way 2bx1b low-cost multiplier array is connected with the single-period 16-to-1 adder tree and is used for realizing multiply-add operation;

the local error accumulation register is connected with the single-period 16-to-1 adder tree and is used for storing the intermediate result of the multiply-add operation, and after all the multiply-add operations are completed, the value in the local error accumulation register is the local error_error (j) of the current neuron j.

Optionally, the weight updating unit further includes a third adder, a second multiplier, a shifter, an 8bit linear feedback shift register, a third comparator, a fourth adder, and a fifth adder;

the third adder is connected with an index lookup table based on a register; the second multiplier is respectively connected with the index lookup table based on the register, the local error accumulation register and the shifter; the third adder takes the time stamp t when the membrane potential of the neuron j is maximum _max，j And pulse time stamp t _i Subtracting to obtain delta t, accessing delta t as address to a register-based index lookup table to obtain an index term, multiplying the index term with a hidden layer neuron local error (j) by a second multiplier, and shifting Lbit to the right to obtain a weight update quantity delta w of 16bit _ij ；

The third comparator is respectively connected with the shifter and the 8bit linear feedback shift register, and the fourth adder is respectively connected with the shifter, the third comparator and the fifth adder; after obtaining the weight update amount Deltaw of 16 bits _ij Thereafter, its 8bit lower portion Deltaw is used _ij [7：0]Comparing the 8bit unsigned random number R generated by the 8bit linear feedback shift register with the updated Deltaw through a third comparator _ij Is higher by 8bit portion aw _ij [15：8]The old weight w read from the neuron state and weight memory is added by the fourth adder and the fifth adder, and the updated weight w is obtained and written into the neuron state and weight memory.

The invention has the following advantages:

(1) The brain-like chip architecture with the configurable network structure integrates the technologies of a hierarchical multi-core chip architecture, a hybrid parallel processing mechanism, an error broadcasting mechanism, a Meta-Crossbar (Meta-Crossbar array) pulse routing mechanism, a complete event driving mechanism, on-chip weight random updating and the like, so that the chip realizes the integration of 'memory and arithmetic', can realize high-speed and high-precision on-chip real-time learning, reduces the pulse routing overhead, and ensures the scalability of the on-chip network scale and certain network mapping flexibility.

(2) The meta-crossbar array routing structure provided by the invention realizes configurable implicit connection in the core, and can realize Full Connection (FC) in SNN by only configuring a small amount of parameter registers without storing all possible connection relations and weights, thereby greatly reducing the consumption of on-chip storage resources and the area and energy consumption of a chip. Meanwhile, the Meta-Crossbar pulse routing mechanism enables pulse events to be efficiently transmitted among processing cores, and compared with a traditional network-on-chip pulse routing method, the method can enable huge routing overhead of pulse transmission among chips to be relieved. Meanwhile, the design of the meta-crossbar array routing architecture also ensures the scalability of the inter-chip/on-chip network scale and certain network mapping flexibility.

(3) The invention provides a pulse processing unit driven by complete events, wherein a nerve processing unit is activated only when an input pulse event arrives in a circuit unit, and pulse neurons responsible for the pulse processing unit sequentially execute operations in turn; meanwhile, the register-based index lookup table is adopted to replace complex index operation, so that the calculation cost and the energy consumption can be reduced to the greatest extent.

(4) The on-chip learning circuit module provided by the invention adopts the technologies of error broadcasting, on-chip weight random updating learning and the like, and a method of high-precision random updating and low-precision synaptic weight storage is adopted while the chip identification precision is hardly reduced, so that the chip area is greatly reduced, the chip storage resource utilization rate is improved, the chip cost is reduced, and meanwhile, the chip can realize rapid on-chip learning.

(5) The brain-like chip architecture with the configurable network structure provided by the invention has high-speed and high-precision on-chip learning capability, higher area and energy efficiency, solves the defects of the traditional small-sized nerve morphology brain-like chip at the edge end, and has extremely high application potential and value in an intelligent system at the edge end with limited resources.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings that are needed in the embodiments will be briefly described below, it being understood that the following drawings only illustrate some embodiments of the present invention and therefore should not be considered as limiting the scope, and other related drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

FIG. 1 is a flow chart of the overall scheme of the present embodiment;

FIG. 2 is a schematic diagram of a brain-like chip architecture with configurable network architecture in the present embodiment;

FIG. 3 is a schematic diagram of a routing structure of a meta-crossbar array in the present embodiment;

FIG. 4 is a schematic diagram of a pulse processing unit driven by a complete event in the present embodiment;

fig. 5 is a schematic diagram of an on-chip learning circuit module in the present embodiment.

Detailed Description

The present invention will be described in detail below with reference to the accompanying drawings.

As shown in fig. 1, the overall scheme flow of the brain-like chip architecture with configurable network structure includes:

1) Design of a brain-like chip architecture with a configurable network structure: based on the techniques of layering multi-core chip architecture, hybrid parallel processing mechanism, error back-pass mechanism, meta-Crossbar routing mechanism, complete event driving mechanism, on-chip weight random updating learning and the like, the chip can realize high-speed and high-precision on-chip real-time learning, meanwhile, pulse routing overhead is reduced, and the scalability of on-chip network scale and certain network mapping flexibility are ensured.

2) Design of a meta-crossbar array routing structure: the proposed meta-crossbar array routing structure enables efficient transfer of pulse events between processing cores, and compared with conventional Network on Chip (NoC) pulse routing methods, the huge routing overhead of pulse transfer between chips can be alleviated. Meanwhile, the design of the meta-crossbar array routing structure also ensures the scalability of the inter-chip/on-chip network scale and certain network mapping flexibility.

3) Complete event driven pulse processing unit design: in the circuit unit, the nerve processing unit is activated only when the input pulse event arrives, and pulse neurons responsible for the nerve processing unit sequentially execute operations, so that the calculation cost and the energy consumption can be reduced to the greatest extent.

4) Design of on-chip learning circuit module: the on-chip learning related circuit module adopts error broadcasting, on-chip weight random updating learning and other technologies, so that the utilization rate of chip storage resources is greatly improved, the area and cost of the chip are reduced, and quick on-chip learning can be realized.

As shown in fig. 2, in this embodiment, a brain-like chip architecture with a configurable network structure makes full use of the structural features and regular connections of FC-SNN (fully connected SNN) to reduce hardware design complexity and cost. The proposed memory-science-integration brain-like chip architecture adopts a double-layer Macro-Core (MCore) -Micro-Core (Micro-Core) multi-Core architecture, and can simply and efficiently finish mapping of an FC-SNN model to chip computing storage resources.

As shown in fig. 2 (upper right corner), in the present embodiment, a brain-like chip architecture (abbreviated as chip architecture) with configurable network structure includes M ₁ A macro-neural computation core (MCore), a meta-crossbar array routing structure, and an output error computation unit (Output Error Unit), M ₁ The macro nerve computation cores are respectively connected with the element-cross switch array routing structure, and the element-cross switch array routing structure is connected with the output error computation unit. Each macro-nerve computation core can be mapped to a part of any layer in the FC-SNN, and the maximum number of layers of the FC-SNN supported by the chip is M ₁ . At most N can be supported by each macro-neural computing core _L Data calculation and storage of individual LIF neurons, each LIF neuron supporting at most N _S The synapses fan in. The impulse events passed between the macronerve computation cores are encoded into AER packets in AER format, which contain the address (i.e., index) and time stamp of the neuron transmitting the impulse signal. Pulse data packets are transmitted between MCores through a meta-crossbar array routing structure, and pulse data packets sent out from the MCores can be routed to any MCore (including the MCores, so that a chip supports the FC-SNN network configured with various structures, and the scalability of network on chip scale and the network mapping flexibility are greatly improved.

As shown in FIG. 2 (upper left corner), in the present embodiment, each MCore contains M ₂ Parallel micronerve computation cores (uCore), each uCore being responsible (N _L /M ₂ ) The calculation of LIF neurons is used for accelerating the calculation processing task of each layer of FC-SNN. The micro-nerve computation core comprises a nerve processing unit, a weight updating unit, a neuron state and weight memory and an output AER buffer. The nerve processing unit is responsible for updating the state of each LIF neuron and transmitting pulses. The weight updating unit is responsible for receiving errors and updating neuron synaptic weights. The neuron state and weight memory is responsible for storing neuron state information and synaptic weights. The output AER buffer is used for buffering pulse data transmitted by the nerve processing unit. The neuron state and weight memory inside the uCore are respectively connected with the nerve processing unit and the weight updating unit, and the output AER buffer is connected with the neuron state and weight memory. In addition, each MCore further includes an input AER buffer for buffering input pulse data; a leakage unit based on a register lookup table, which is used for calculating and storing intermediate variables required in the process of calculating the neuron membrane potential leakage; and an output AER scheduler and routing node) for scheduling the input/output of the pulse data. The leakage units based on the register lookup table are respectively connected with the nerve processing units of the micro nerve computing cores in the macro nerve computing cores; the input AER buffer is connected with a leakage unit based on a register lookup table; the output AER scheduler is connected with output AER buffers of all the micro-nerve computation cores in the macro-nerve computation core; the routing node is respectively connected with a weight updating unit, an input AER buffer and an output AER scheduler of each micro-nerve computation core in the macro-nerve computation core.

As shown in fig. 2 (bottom half), a hybrid multi-level parallel processing mechanism of the chip architecture is illustrated. The processing process of the whole chip architecture comprises a forward stage and a reverse stage, and in the forward stage, each MCore independently completes calculation in an event-driven mode, namely, processing is started as soon as an input pulse event is received, and synchronization is not needed; in the reverse phase, each MCore completes the neuron synaptic weight update in a synchronous array in parallel. At the level of uCore, each uCore operates in parallel with a naive array, either in the forward or reverse phase. Through the dynamic reconfigurable hybrid multi-level parallel processing mechanism, the computing resources on the chip can be efficiently used for different processing stage requirements, thereby realizing high-speed on-chip learning and reasoning.

The data flow at the forward and reverse phases of the chip processing is briefly described as follows:

in the forward stage, after the input pulse AER reaches the element-crossbar array routing structure, the element-crossbar array routing structure sends AER to the corresponding MCore according to the configuration of the routing structure, after the MCore receives the AER, the AER is firstly put into an input AER buffer for buffering, then is sent into a leakage unit based on a register lookup table for operation, and then is sent to a nerve processing unit in the uCore, and the nerve processing unit is responsible for updating LIF neuron state information, including membrane potential, pulse sending time and the like, and is realized by accessing a neuron state and a weight memory. If the membrane potential of a LIF neuron exceeds the threshold, the nerve processing unit transmits an AER to the output AER buffer, and then the AER buffer is transmitted to the next MCore through the output AER scheduler and the routing node, or the AER buffer is directly output to the outside of the chip architecture.

In the reverse phase, after the output error calculation unit completes the output error calculation, the output error vector is synchronously broadcast to each MCore, and the weight updating unit in the MCore starts to synchronously perform the neuron synaptic weight updating process, namely the on-chip learning process after receiving the output layer error.

In the present embodiment, M in the chip architecture _i ＝M ₂ =4, N alone _L And N _S Maximum values of 1024 (1K), but when N _S When=1k, N _L Can only take the maximum value of 64 (N) _L And N _S Correlation, N, limited by on-chip memory space _L ×N _s 64K), thus, the maximum number of neurons supported by the chip architecture is 4×1k=4k, the maximum neuronsThe number of synapses is 4×64×1k=256K.

As shown in fig. 3, in this embodiment, the meta-crossbar array routing architecture is used to efficiently transfer pulse events between mcores in the feed-forward phase. The meta-crossbar array routing structure includes a network of routing nodes, M ₁ +1 AER schedulers and M ₁ +1 switch connection configuration registers, the routing node network comprises (M ₁ +1)×(M ₁ +1) routing node switches, and the routing node switches of each column are connected in sequence, and the routing node switches of each row are connected in sequence. As shown in fig. 3 (upper left corner), the Input path of the meta-crossbar array routing structure includes Input nodes (Input) and MCore1-MCore4, and the Output path includes Output nodes (Output) and MCore l-MCore4. The open circles in fig. 3 represent possible routing paths, with connections being made through configuration switch connection configuration registers. The pulse data packets of the input nodes may be routed to MCore-MCore 4, and the pulse data packets transmitted by MCore-MCore 4 may be routed to MCore1-MCore4, or the output nodes. As shown in fig. 3 (upper right corner), a specific implementation of the Meta-Crossbar routing architecture is shown, taking column 4 of the Meta-Crossbar array routing architecture as an example. As shown in fig. 3 (upper left corner), column 4 of the meta-crossbar array routing structure shows routing nodes connected to MCore1, each open circle corresponds to a switch, and controls whether or not the connection relationship is present. As shown in the upper right corner of fig. 3, the switch is turned on and off by configuring the switch connection configuration register, and when the value of the switch connection configuration register is 1, it indicates that the routing node has a connection relationship. Note that since there may be multiple paths connected to MCore1, in order to prevent pulse data collision and collision, the AER scheduler is required to arbitrate and schedule AER packets, and input_req, mcore1_req.

One of the core innovation points of the Meta-Crossbar array routing structure proposed in this embodiment is "Meta", and in early neuromorphic brain chips, in order to support mapping various network structures, a connection mode of noc+crossbar is generally adopted, so that the flexibility is high. However, crossbar needs to allocate storage space for all possible connection relationships and synaptic weights, and cannot realize weight multiplexing, and when mapping network structures such as convolution, the problems of large occupancy rate of a nerve core, low utilization rate of storage resources in the core and the like exist, so that a larger chip area is required. In order to reduce the chip area and energy consumption, many small neuromorphic chips use fixed connection rules, only supporting specific network structures, but this approach limits the flexibility and scalability of the chip architecture. In this embodiment, a meta-crossbar array routing structure is adopted to implement configurable implicit connection in the core, and Full Connection (FC) in the SNN can be implemented by configuring a small number of parameter registers (i.e., switch connection configuration registers), without storing all possible connection relationships and weights, so that on-chip storage resource consumption is greatly reduced, and chip area and energy consumption are reduced. Meanwhile, the network structure is also configurable, and by combining the MCore and the configuration parameter register, various network structures can be flexibly realized on the chip by matching with the meta-crossbar array routing structure, so that the flexibility and the expandability of the chip architecture are greatly enhanced. As shown in fig. 3 (lower part), examples of three configurations are given. The lower left example illustrates how the FC-SNN of the 256-256-256-256-256 architecture is implemented by configuring the relevant parameter registers and routing connections. In this example, corresponding to each MCore being responsible for the calculation of one layer (256 neurons) in the FC-SNN, the number of fan-ins/fan-outs of each MCore can be determined by configuring the parameter register synapses per neuron (the number of synapses per neuron) and the number of neurons, and by configuring the connection relation of the routing nodes, corresponding to the concatenation of 4 mcores, so as to realize the network structure of 256-256-256-256-256; the second example shows how FC-SNN of the 512-256-128-64 architecture can be implemented, unlike the first example where MCore1 and MCore2 are commonly responsible for one layer in the network; the third example shows how FC-SNN of 1024-256 structure is implemented, in this case MCore1-MCore4 are commonly responsible for one layer in the network.

Taking the first example of the lower left corner of fig. 3 as an example, when a certain AER packet arrives at MCore, since the connection structure of MCore1 is determined by the configuration parameter register in advance, MCore1 can automatically complete the calculation and state update of 256 neurons, which corresponds to 1 input packet, can trigger the update of 256 neurons, and pulse packets do not need to be copied as early neuromorphic chips, and pulses are transmitted one by one. This packet is referred to as a "Meta" (i.e., meta) packet. Through the mechanism, the meta-crossbar array routing structure fully utilizes the regular structure of the depth FC-SNN, thereby improving the utilization rate of the storage resource rate in the core, realizing the depth SNN network mapping by only needing a moderate chip area, and reserving the scalability of the SNN scale and a certain mapping flexibility. In addition, the Meta-AER mechanism greatly reduces the number of pulse data packets in the routing network and alleviates the congestion problems common in NoC networks because the pulse data packets do not need to be duplicated.

As shown in fig. 4, in the present embodiment, the proposed full event-driven pulse processing unit (including a neural processing unit and a leakage unit based on a register lookup table), in the full event-driven pulse processing unit, the neural processing unit is activated only when an input pulse event arrives, and the pulse neurons responsible for the same sequentially perform operations, so that the computational overhead and the energy consumption can be reduced to the greatest extent. Table 1 shows the definition and description of the variables in the figure. When one input AER (i, t) reaches MCore, the full event driven pulse processing unit completes the update of the neuron state according to equation (1):

Wherein: updated V _m，j New membrane potential, V, representing the currently processed neuron j _m，j Represents the old membrane potential of neuron j, τ represents leakage factor, w _ij Is the synaptic weight between the previous layer neuron i and the current layer neuron j. Therefore, the leakage operation is first completed when the membrane potential is updated, and then the corresponding membrane potential is added. In addition, the updated membrane is required to be electrically poweredBit and set neuron threshold V _th Comparing if the threshold is greater than the neuron threshold V _th A pulse is transmitted.

Table 1: definition and description of variables in pulse processing units

The flow of the complete event driven pulse processing unit calculation is thus as follows:

as shown in fig. 4 (upper left corner portion), in the present embodiment, the register-based leak unit includes a first adder 5 and a register-based index lookup table, the first adder 5 being connected to the register-based index lookup table. When AER (i, t) reaches MCore, Δt=t-t is first calculated by the first adder 5 _pre Then, the obtained delta t is used as an address to access a register-based index lookup table initialized in advance, so that a leakage term exp (-delta t/tau) can be directly obtained, complex index operation is not required to be completed on a chip, and the occupied chip area is small because the index lookup table is realized by adopting a register.

As shown in fig. 4, in the present embodiment, the nerve processing unit includes a first multiplier 1, a second adder 2, a first comparator 3, a fire flag regs register, and a second comparator 4; the first multiplier 1 is respectively connected with the second adder 2 and the index lookup table based on a register, the first comparator 3 is connected with the second adder 2, and the fire_flag_regs register is connected with the first comparator 3; the second comparator 4 is connected to the second adder 2. After obtaining the leakage term exp (- Δt/τ), the leakage term is first coupled with the old membrane potential V of neuron j by the first multiplier 1 _m，j Multiplying to obtain leaked neuron membrane potential, and then adding the leaked neuron membrane potential and synaptic weight w by a second adder 2 _ij Adding to get neuron j updatesPost-membrane potential updated V _m，j Then the membrane potential updated V after the neuron j is updated by the first comparator 3 _m，j And threshold V _th For comparison, if updated V _m，j ＞V _th A pulse aer_out is transmitted. Furthermore, in order to minimize computational overhead and power consumption, the fully event driven pulse processing unit allows each neuron to transmit a pulse only once, the address of the neuron that transmitted the pulse will be recorded in the file of the fire_flag_regs register of the neural processing unit, which is accessed before each pulse is transmitted, and the pulse is allowed to be transmitted if the current neuron j has not yet transmitted the pulse. When each event comes and the membrane potential is updated, it is necessary to judge whether the current membrane potential is greater than the previously stored maximum value of the membrane potential V by the second comparator 4 _max，j If it is greater than V needs to be updated _max，j Synchronously updating the current time stamp to t _max，j And stored in a neuron state and weight memory for use in an on-chip learning process in a subsequent reverse phase.

As shown in fig. 4 (upper right hand corner portion), a timing diagram of the access of the pulse processing unit to the neuron state and weight memory is shown, with blue representing read memory, red representing write memory, and orange-yellow bottom color portion representing the update of the on-chip learning related variable. The fully event driven pulse processing unit operates on neurons j=1, 2,3 sequentially and since the chip memory is implemented with a single port SRAM, only reads or writes per cycle, the neuron dependent variable V needs to be sequentially operated on _m，j 、w _ij 、V _max，j 、t _max，j A read/write operation is performed.

As shown in fig. 5, the on-chip learning circuit module provided in this embodiment is shown, and the on-chip learning circuit module adopts techniques such as error broadcasting and on-chip weight random update learning, so that the utilization rate of chip storage resources is greatly improved, the area and cost of the chip are reduced, and rapid on-chip learning can be realized. Table 2 shows the definition and description of the variables in the figure. The upper half of fig. 5 shows a circuit diagram of an output error calculation unit comprising an output layer neuron error calculation module 14, and an output error register connected to the output layer neuron error calculation module 14; the output layer neuron error calculation module 14 is configured to calculate an error of each output layer neuron, and store the calculated error in an output error register. The first step in the on-chip learning process is to calculate a three-valued output layer neuron error err, for output layer neuron k, err (k) =1 assuming that its assigned label is equal to the sample label and no pulse is transmitted (neuron label= sample label & fire flag= 0); err (k) = -1 if its assigned tag is not equal to the sample tag and a pulse is transmitted (neuron label +.sample label & fire flag= 1); for other cases (otherwise), err (k) =0. The output error calculation unit calculates the error of each output layer neuron in turn, then stores the calculated error in an output error register, and broadcasts the calculated error to the uCore of each MCore.

Table 2: definition and description of variables in on-chip learning circuit module

Variable name	Definition of the definition	Description of the invention
			i	Source neuron Address index	-
j	Neuron index in process	-
			k	Output layer neuron index	-
neuron label	Neuron tag	-
			sample label	Sample label	-
B	Random projection matrix	Binary (-1, 1)
			err	Output layer neuron error	Three values (-1, 0, 1)
local_err	Hidden layer neuron local error	-
			Δw _ij [15：8]	Weight update amount is 8bit higher	For random updating
Δw _ij [7：0]	Weight update amount is 8bit lower	For random updating

As shown in fig. 5 (lower half), a circuit diagram of a weight updating unit is shown, the weight updating unit further including a third adder 7, a second multiplier 8, a shifter 9, an 8-bit linear feedback shift register 10, a third comparator 11, a fourth adder 12, and a fifth adder 13; the third adder 7 is connected with a register-based index lookup table; the second multiplier 8 is respectively connected with an index lookup table based on a register, a local error accumulation register and a shifter 9; the third comparator 11 is respectively connected with the shifter 9 and the 8bit linear feedback shift register 10, and the fourth adder 12 is respectively connected with the shifter 9, the third comparator 11 and the fifth adder 13. After receiving the output error err, the weight updating unit in the uCore of each MCore needs to calculate the local error_error according to formula (2):

Wherein B (j, k) is an element in the random projection matrix B, and the elements in B are binary, and only have two values of-1 and 1. As can be seen from equation (2), the calculation of the local error requires multiple multiply-add operations. As shown in fig. 5, in the on-chip learning circuit module proposed in this embodiment, the weight updating unit uses a 16-way 2b×1b low-cost multiplier array and a single-cycle 16-to-1 adder tree to implement multiply-add operation, and each uCore can complete 16 multiply-add operations in 2 clock cycles. Through the design, the storage resources occupied by the multiplier are reduced to the greatest extent, the area and the cost of the chip are reduced, and meanwhile, the speed of on-chip learning is ensured. The intermediate result of the multiply-add operation is stored in a local error accumulation register, and when all multiply-add operations are completed, the value in the local error accumulation register is the local error_error (j) of the current neuron j. After the local error is obtained, the synaptic weight update is calculated according to formula (3):

wherein Deltaw is _ij Represents the synaptic weight update amount, λ represents the learning rate, and the learning rate is set to λ=2 ^-U Therefore, in the hardware implementation process, the shift operation is only needed, and the use of a multiplier which consumes resources is avoided. As shown in the lowermost circuit of fig. 5, t is first read from the neuron state and weight memory of uCore _max，j And t _i Then subtracting to obtain delta t, accessing delta t as address to an index lookup table based on a register to obtain an index term, multiplying with a local_error (j) by a second multiplier 8, and finally right-shifting Ubit to obtain the weight update quantity delta w _ij . The invention does not directly update the weight quantity Deltaw _ij Adding to the synaptic weight w yields an updated weight, because Δw _ij Is 16bit and if added directly it is also necessary to store the synaptic weights w in 16bit format, which will occupy a lot of chip area. In order to improve the utilization rate of memory resources on a chip and reduce the area and cost of the chip to the greatest extent, the invention provides an on-chip weight random updating technology, which ensures that the recognition accuracy of the chip is hardly reduced, and simultaneously, the area of the chip is greatly reduced by using low-accuracy memory synaptic weights w. The on-chip weight random update correlation circuit is shown in the lower right corner of FIG. 5, resulting in a 16bit Δw _ij After that, first use its lower 8bit portion Δw _ij [7：0]The 8-bit unsigned random number R generated by the 8-bit linear feedback shift register 10 (8 b LFSR) is compared with the 8-bit unsigned random number R by the third comparator 11, and if Deltaw _ij [7：0]More than or equal to R, a carry is generated to Deltaw _ij Is higher by 8bit portion aw _ij [15：8]The method comprises the steps of carrying out a first treatment on the surface of the If Deltaw _ij [7：0]< R, then discard Δw _ij [7：0]. Updated aw _ij Is higher by 8bit portion aw _ij [15：8]Then the old synaptic weight w read from the neuron state and weight memory is added by the fourth adder 12 and the fifth adder 13 to obtain updated synaptic weight w, and the updated synaptic weight w is written into the neuron state and weight memory, thus the on-chip learning process can be completed. Furthermore, due to the pulse time stamp t in the present design _i > 0, so that the weight is updated if t of a neuron _i =0, indicating that it did not transmit a pulse, the relevant synapse will skip the weight update process.

The invention applies the on-chip weight random updating technology when the weight is updatedBy adopting full-precision updating, only the high 8 bits of the storage weight are needed during storage, so that the consumption of on-chip storage resources can be greatly reduced, the utilization rate of the on-chip storage resources can be improved, and the chip area and the cost can be reduced. Compared with the direct storage of high-precision 16bit weight, the random updating technology can save 1.27mm ² (17.6%) active silicon area. Furthermore, experiments have shown that applying on-chip weight random updates only results in a very small loss of recognition accuracy.

The invention designs a brain-like chip architecture which is oriented to an intelligent system of an edge end, has high-speed and high-precision on-chip learning capability and is provided with a configurable network structure, and integrates the technologies of a layered multi-core processing mechanism, a hybrid parallel processing mechanism, a Meta-Crossbar pulse routing mechanism, a complete event driving mechanism, on-chip weight random updating learning and the like to maximally improve the processing speed and the energy efficiency of the chip, so that the chip can realize on-chip learning of a rapid high-precision depth SNN. The prototype chip realized based on the chip architecture is manufactured by adopting a 65 nanometer CMOS process, the core area is only 7.2 square millimeters, the operation of at most 1 kilopulse neuron and 26 ten thousand nerve synapses is supported, the rapid on-chip deep learning can be completed under the condition of 61mW low power consumption, the on-chip learning and reasoning speed of 802 and 2270 frames per second can be respectively achieved in practical application, the chip calculates the energy consumed by each synapse operation to be only 3.09pJ, the energy consumed by each image is identified to be only 0.43uJ, the area and the energy efficiency are extremely high, and the application potential and the value are extremely high in an edge-end intelligent system with limited resources.

In this embodiment, the on-chip learning circuit module learns by applying the deep tempo algorithm [1 ].

[1]C.Shi，T.Wang，J.He，J.Zhang，L.Liu and N.Wu，″DeepTempo：A Hardware-Friendly Direct Feedback Alignment Multi-Layer Tempotron Learning Rule for Deep Spiking Neural Networks，″in IEEE Transactions on Circuits and Systems II：Express Briefs，vol.68，no.5，pp.1581-1585，May 2021.

Therefore, the embodiments of the present invention are not limited by the above examples, and any other changes, modifications, substitutions, combinations, and simplifications that do not depart from the spirit and principles of the present invention should be made and equivalents should be construed as being included in the scope of the present invention.

Claims

1. A brain-like chip architecture with a configurable network structure, characterized in that: includes M ₁ Macronerve computation core, element-cross switch array routing structure and output error computation unit, M ₁ The macro nerve computation cores are respectively connected with the element-cross switch array routing structure, and the element-cross switch array routing structure is connected with the output error computation unit;

2. The network architecture configurable brain-like chip architecture of claim 1, wherein: the macronerve computation core contains M ₂ Parallel micronerve computation cores, and each micronerve computation core can be responsible for N _L / M ₂ The LIF neurons are calculated and used for accelerating the calculation processing task of each layer of FC-SNN;

the micronerve computation core includes:

3. The network architecture configurable brain-like chip architecture of claim 2, wherein: the macronerve computation core further includes:

4. A network architecture configurable brain-like chip architecture according to claim 3, wherein: the processing process of the whole chip architecture comprises a forward stage and a reverse stage;

5. The network architecture configurable brain-like chip architecture of claim 1, wherein: the meta-crossbar array routing structure comprises a routing node network, M ₁ +1 piecesAER scheduler and M ₁ The +1 switch is connected to a configuration register,

the routing node network comprises (M ₁ +1）×（M ₁ +1) routing node switches, wherein the routing node switches of each column are sequentially connected, and the routing node switches of each row are sequentially connected;

6. A network architecture configurable brain-like chip architecture according to claim 3, wherein: the leakage unit based on the register lookup table comprises a first adder (5) and an index lookup table based on the register, wherein the first adder (5) is connected with the index lookup table based on the register; when AER%i, t) When reaching the macronerve computation core, fatt=t-t is computed by the first adder _pre Wherein, the method comprises the steps of, wherein,irepresenting the source neuron index, t is the pulse timestamp, t _pre And accessing the obtained fatter as an address to a register-based index lookup table initialized in advance for the pulse time stamp of the last event to obtain a leakage term.

7. The network architecture configurable brain-like chip architecture of claim 6, wherein: the nerve processing unit comprises a first multiplier (1), a second adder (2), a first comparator (3), a fire_flag_regs register and a second comparator (4);

the first multiplier (1) is respectively connected with the second adder (2) and the index lookup table based on a register, the first comparator (3) is connected with the second adder (2), and the fire_flag_regs register is connected with the first comparator (3); the second comparator (4) is connected with the second adder (2);

the first multiplier (1) is used for associating leakage terms with neuronsjOld membrane potentialV _{m, j} Multiplying to obtain the membrane potential of the leaked neuron; the second adder (2) is used for adding the leaked neuron membrane potential and synaptic weightw _ij Adding to obtain neuronsjUpdated membrane potentialV _{m, j} The method comprises the steps of carrying out a first treatment on the surface of the The first comparator (3) is used for updating the updated membrane potentialV _{m, j} And threshold valueV _th Comparison is performed and at updatedV _{m, j＞} V _th When the pulse aer_out is transmitted; and the address of the neuron after the pulse is transmitted is recorded in the fire_flag_regs register file of the nerve processing unit, and the fire_flag_regs register is accessed before each pulse is transmitted if the neuron is current jIf no pulse has been transmitted, allowing the pulse to be transmitted;

when each event arrives and updates the membrane potential, judging whether the current membrane potential is larger than the maximum value of the membrane potential stored before by the second comparator (4)V _{max, j} If it is larger than the above, it needs to be updatedV _{max, j} Synchronously updating the current time stamp tot _{max, j} And stored in a neuron state and weight memory.

8. The network architecture configurable brain-like chip architecture of claim 1, wherein: the output error calculation unit comprises an output layer neuron error calculation module (14) and an output error register connected with the output layer neuron error calculation module (14);

the output layer neuron error calculation module (14) calculates the error of each output layer neuron, stores the error in an output error register and broadcasts the error to the micro-nerve calculation core of each macro-nerve calculation core after all the output errors are calculated.

9. The network architecture configurable brain-like chip architecture of claim 8, wherein: the weight updating unit comprises a 16-way 2b multiplied by 1b low-cost multiplier array, a single-period 16-to-1 adder tree and a local error accumulation register;

The 16-way 2b multiplied by 1b low-cost multiplier array is connected with the single-period 16-to-1 adder tree and is used for realizing multiply-add operation;

10. The network architecture configurable brain-like chip architecture of claim 9, wherein: the weight updating unit further comprises a third adder (7), a second multiplier (8), a shifter (9), an 8-bit linear feedback shift register (10), a third comparator (11), a fourth adder (12) and a fifth adder (13);

the third adder (7) is connected with a register-based index lookup table; the second multiplier (8) is respectively connected with an index lookup table based on a register, a local error accumulation register and a shifter (9); the third adder (7) maximizes the time stamp of the membrane potential of the neuron jt _jmax, And pulse time stampt _i Subtracting to obtain the final producttAnd will (I)tAccessing an index lookup table based on a register as an address to obtain an index term, multiplying the index term by a local error (j) of a hidden layer neuron by a second multiplier (8), and shifting Ubit to the right to obtain a weight updating amount of 16bit w _ij ；

The third comparator (11) is respectively connected with the shifter (9) and the 8-bit linear feedback shift register (10), and the fourth adder (12) is respectively connected with the shifter (9), the third comparator (11) and the fifth adder (13); the weight update weight of 16 bit is obtainedw _ij Thereafter, the lower 8bit fatter is usedw _ij [7:0]Comparing with 8bit unsigned random number R generated by 8bit linear feedback shift register (10) through third comparator (11), and updatingw _ij Is 8bit higherw _ij [15:8]Old weights read from the neuron state and weight memory are added by a fourth adder (12) and a fifth adder (13)wObtaining updated weightswAnd writes to the neuron state and weight memory.