CN112631983B

CN112631983B - Sparse neural network-oriented system-on-chip

Info

Publication number: CN112631983B
Application number: CN202011576461.5A
Authority: CN
Inventors: 黄乐天; 明小满
Original assignee: University of Electronic Science and Technology of China
Current assignee: University of Electronic Science and Technology of China
Priority date: 2020-12-28
Filing date: 2020-12-28
Publication date: 2023-05-02
Anticipated expiration: 2040-12-28
Also published as: CN112631983A

Abstract

The invention discloses a sparse neural network-oriented system on chip, which comprises a main processor, a coprocessor, system slave equipment and an off-chip memory, wherein the main processor is in communication connection with the main processor; the system slave device is in communication connection with the main processor; the off-chip memory slave system interface is in communication connection with the slave system; the main processor decomposes matrix calculation in a neural network algorithm into vector calculation, executes program screening and reorganizes sparse vectors participating in calculation in the sparse neural network, converts the sparse vectors in the sparse neural network into dense vectors, sends an acceleration instruction to the coprocessor, and executes the acceleration calculation of the dense vectors by the coprocessor. The invention solves the problems of limited parallelism of the accelerator and low utilization efficiency of the processor in the system. The direct index storage for the sparse neural network solves the problem that data multiplexing in an accelerator is difficult due to non-zero element position uncertainty in the sparse neural network.

Description

Sparse neural network-oriented system-on-chip

Technical Field

The invention relates to the field of communication, in particular to a sparse neural network-oriented system-on-chip.

Background

Neural network sparseness greatly reduces the computational effort and data storage of the algorithm, which facilitates deployment of large-scale neural network algorithms in embedded devices with limited storage, computational resources, and energy consumption. However, the data irregularities caused by the sparsification make the sparse network very inefficient to execute on a generic platform. In order to make hardware better execute the sparse neural network algorithm, researchers designed various sparse neural network accelerators, and in addition, system-on-chip research facing the sparse neural network is gradually developed.

The existing sparse neural network accelerator and system-on-chip have several problems:

a. the calculation parallelism and memory access efficiency of the accelerator are limited by the neural network sparsification rule;

b. uncertainty of non-zero element positions in the sparse network causes difficulty in multiplexing data in the accelerator;

c. a special compiler is required to be designed for the accelerator, so that the universality is poor.

In a system on a chip, the utilization of the CPU is too low and the overall acceleration effect of the system is generally.

Disclosure of Invention

In order to solve the problems, the invention provides a sparse neural network-oriented system on a chip, which is realized by the following technical scheme:

a sparse neural network-oriented system-on-chip is characterized by comprising a main processor, a coprocessor, system slave equipment and an off-chip memory,

the main processor is in communication connection with the main processor;

the system slave device is in communication connection with the main processor;

the off-chip memory slave system interface is in communication connection with the slave system;

wherein, the liquid crystal display device comprises a liquid crystal display device,

the main processor decomposes matrix calculation in a neural network algorithm into vector calculation, executes program screening and reorganizes sparse vectors participating in calculation in the sparse neural network, converts the sparse vectors in the sparse neural network into dense vectors, sends an acceleration instruction to the coprocessor, and executes the acceleration calculation of the dense vectors by the coprocessor.

The main processor comprises an open source processor, a data path, a first-level data cache and a first-level instruction cache, wherein:

the main processor is used for decomposing matrix calculation in a neural network algorithm into vector calculation, expanding a neuron matrix and a weight matrix of the sparse neural network into original sparse vectors, and screening out the sparse vectors in the reorganization sparse neural network;

the data path is used for data interaction between the main processor and the coprocessor;

the first-level data cache is used for storing calculation data of the main processor and the coprocessor;

the first-level instruction cache is used for storing processing instructions of the main processor;

the main processor and the coprocessor share a level one data cache.

Further, the system slave device comprises an SPI Flash controller and a debugging interface UART;

the SPI Flash controller is used for integrating the off-chip memory into the system on chip by utilizing an SPI interface;

the UART is a debugging interface of the system on chip and is used for receiving an instruction of the upper computer or sending the instruction to the upper computer.

Further, after the system is powered on, the main processor sequentially reads the instruction cache of one block from the SPI Flash controller to the first-level cache address, and executes program screening and reorganizing original sparse vectors participating in calculation in the sparse neural network.

Further, the original sparse vector comprises a sparse neuron vector and a sparse weight vector, and the system on a chip stores non-zero elements in the sparse neuron vector and the sparse weight vector and index marks corresponding to the sparse neuron vector and the sparse weight vector in a direct index mode.

Further, the main processor performs AND computation on index marks of the sparse neuron vector and the sparse weight vector to obtain index marks of the sparse neuron vector or the sparse weight vector participating in computation in an original vector, screens the non-zero neuron count vector and the non-zero weight count vector through the obtained index marks to obtain position coordinates of the non-zero neuron and the non-zero weight participating in computation in a non-zero neuron component vector and a non-zero weight component vector respectively, indexes the non-zero element participating in computation finally according to the non-zero neuron component vector and the non-zero weight component vector, and recombines the indexed non-zero element into a dense vector according to the sequence.

Further, when the bit width of the composed dense vector data reaches the bit width of the interface between the main processor and the coprocessor, the main processor sends an acceleration instruction to the coprocessor through a RoCC interface.

Further, the main processor stores the recombined dense vector into a first-level data cache, and the coprocessor directly accesses the dense vector from the first-level cache after receiving an acceleration instruction.

Further, the coprocessor comprises a control decoding unit, an address generator, a dot product calculating unit, a vector adding calculating unit, an activation function calculating unit, a pooling operation unit, a maximum value index calculating unit and an encoder, wherein:

the control decoding unit is used for analyzing the RoCC instruction to obtain a coprocessor access memory starting address, and controlling the flow direction of the data flow in the algorithm execution process;

the address generator is used for calculating the memory access address in the execution process of the convolutional neural network algorithm; the convolutional neural network algorithm comprises a convolutional layer algorithm, a full-connection layer algorithm and an output layer algorithm;

the dot product calculation unit, the vector addition calculation unit, the activation function calculation unit and the maximum value index calculation unit adopt a cascade structure and are used for executing corresponding dot product calculation, vector addition calculation, activation function calculation and maximum value index calculation;

the encoder is used for encoding the calculation results of the dot product calculation unit, the vector addition calculation unit, the activation function calculation unit and the pooling operation unit.

Further, when the coprocessor executes a convolution layer algorithm, the data stream is subjected to maximum pooling calculation and then is sent to an encoder for compression, and then non-zero elements and index marks corresponding to the non-zero elements are written into a first-level data cache;

when the full-connection layer algorithm is executed, the data flow is calculated by an activation function calculation unit, then online coding compression is executed, and then non-zero elements and index marks corresponding to the non-zero elements are written into a first-level data cache; the output of the last full-connection layer is compressed and then sent to a maximum value index unit for calculation;

when the output layer algorithm is executed, the calculation result of the maximum value index calculation unit is directly written into the first-level data cache through the memory access request signal interface.

The invention has the advantages that the main processor in the system preprocesses the sparse neural network and gives the sparse neural network to the acceleration unit to execute dense vectors, so that the accelerator can execute the sparse neural network with any sparseness, the hardware resources in the accelerator can be fully utilized, and the data processing efficiency is improved; in addition, the system adopts a heterogeneous system-on-chip structure of the main processor and the coprocessor, the main processor and the coprocessor share a first-level data cache, the data interaction delay between the main processor and the coprocessor is short, and the memory access bandwidth is increased; and the calculation units of the accelerator are connected in a cascade mode, so that the intermediate calculation results are effectively multiplexed, the data utilization rate is improved, and the access times of the data are reduced.

Drawings

The accompanying drawings, which are included to provide a further understanding of embodiments of the invention and are incorporated in and constitute a part of this application, illustrate embodiments of the invention. In the drawings:

fig. 1 is a schematic diagram of a sparse neural network-oriented system-on-chip architecture of the present invention, where the meaning of the identification is: a processor; datapath: a data path; l1 ICache: a first level instruction cache; l1 DCache: first-level data caching; roCC Interface: a coprocessor interface, roCC interface; CC admission: the processor sends an abnormal signal to the coprocessor; CC inter: the coprocessor sends an interrupt signal to the processor; core Cmd: the processor sends signals of instructions to the coprocessor; MEM Req: a memory access request signal; MEM Resp: a memory access response signal; SPI Flash Controller: SPI Flash controller; SPI Flash: and an external Flash memory of the SPI interface.

Fig. 2 is a schematic diagram of data screening and recombination in the embodiment of the present invention, and the meaning of the marks in the diagram is: 1: sparse neuron vectors; 2: a vector of non-zero neurons; 3: a non-zero neuron count vector; 4: an index bit string corresponding to the sparse neuron vector; 5: index bit strings corresponding to the sparse weight vectors; 6: a non-zero weight count vector; 7: a vector of non-zero weights; 8: sparse weight vectors; 9: the neurons participating in calculation correspond to index bit strings of 1 and 8 vectors with weights; 10: the location of the non-zero neurons in 2 that participated in the calculation; 11: a vector of non-zero neurons involved in the computation; 12: the position of the non-zero weights participating in the calculation in 7; 13: a vector of non-zero weights that participate in the calculation.

FIG. 3 is a diagram of a coprocessor accelerator architecture in accordance with an embodiment of the present invention, where the reference numerals are as follows: roCC Interface: a coprocessor interface; roCC-IN: a coprocessor input port; roCC-OUT: a coprocessor output port; core Cmd, MEM Req, MEM Resp, CC Exception, CC Interrupt: the coprocessor sends an interrupt signal to the processor; decoder Controller: a coprocessor decoding control unit; addr: a coprocessor address generator; encoder: an encoder; VDP: a vector dot product calculation unit; VA: a vector addition calculation unit; RELU, activating function calculating unit; MAX Pool: a max pooling calculation unit; argMax: and a maximum value index calculation unit.

Detailed Description

Hereinafter, the terms "comprises" or "comprising" as may be used in various embodiments of the present invention indicate the presence of inventive functions, operations or elements, and are not limiting of the addition of one or more functions, operations or elements. Furthermore, as used in various embodiments of the invention, the terms "comprises," "comprising," and their cognate terms are intended to refer to a particular feature, number, step, operation, element, component, or combination of the foregoing, and should not be interpreted as first excluding the existence of or increasing likelihood of one or more other features, numbers, steps, operations, elements, components, or combinations of the foregoing.

In various embodiments of the invention, the expression "or" at least one of a or/and B "includes any or all combinations of the words listed simultaneously. For example, the expression "a or B" or "at least one of a or/and B" may include a, may include B or may include both a and B.

Expressions (such as "first", "second", etc.) used in the various embodiments of the invention may modify various constituent elements in the various embodiments, but the respective constituent elements may not be limited. For example, the above description does not limit the order and/or importance of the elements. The above description is only intended to distinguish one element from another element. For example, the first user device and the second user device indicate different user devices, although both are user devices. For example, a first element could be termed a second element, and, similarly, a second element could be termed a first element, without departing from the scope of various embodiments of the present invention.

It should be noted that: if it is described to "connect" one component element to another component element, a first component element may be directly connected to a second component element, and a third component element may be "connected" between the first and second component elements. Conversely, when one constituent element is "directly connected" to another constituent element, it is understood that there is no third constituent element between the first constituent element and the second constituent element.

The terminology used in the various embodiments of the invention is for the purpose of describing particular embodiments only and is not intended to be limiting of the various embodiments of the invention. As used herein, the singular is intended to include the plural as well, unless the context clearly indicates otherwise. Unless otherwise defined, all terms (including technical and scientific terms) used herein have the same meaning as commonly understood by one of ordinary skill in the art to which various embodiments of the invention belong. The terms (such as those defined in commonly used dictionaries) will be interpreted as having a meaning that is the same as the context of the relevant art and will not be interpreted in an idealized or overly formal sense unless expressly so defined herein in connection with the various embodiments of the invention.

For the purpose of making apparent the objects, technical solutions and advantages of the present invention, the present invention will be further described in detail with reference to the following examples and the accompanying drawings, wherein the exemplary embodiments of the present invention and the descriptions thereof are for illustrating the present invention only and are not to be construed as limiting the present invention.

Example 1

In this embodiment, the system on chip oriented to the sparse neural network includes the main processor type open source processor router, the co-processor type neural network accelerator, the system bus TileLink, the debug interface UART, the SPI Flash controller, and the SPI Flash memory outside the chip. Wherein the processor acts as a master device of the system; the neural network accelerator is used as a coprocessor of the main processor, and the coprocessor is connected with the main processor through a RoCC interface; the UART and the SPI Flash controller are used as slave equipment of the system, and the slave equipment is connected with the main processor through a TileLink bus; and the SPI Flash is integrated into the system as an off-chip storage through an SPI interface in the SPI Flash controller. The processor is mainly responsible for decomposing matrix calculation in a neural network algorithm into vector calculation, screening non-zero data participating in calculation according to index marks of neurons and weights, and sequentially recombining the screened non-zero elements into continuously stored dense vectors. Thereafter, the host processor sends acceleration instructions to the coprocessor via the RoCC interface, which performs the acceleration computation of the dense vector. And (3) injection: the neurons and the weights are stored in a direct index mode, only non-zero elements and index marks of all elements are stored, the index marks are represented by bit strings, 0 'in the bit strings represents zero elements at the current position, and 1' in the bit strings represents the non-zero elements at the current position.

Example 2

As shown in FIG. 1, the present embodiment is a heterogeneous system-on-chip created based on RISC-V open source processors. The system integrates a rock CPU, a coprocessor accelerator, an SPI Flash controller, a UART and an off-chip SPI Flash. The Roccket CPU is used as a main processor of the system, the accelerator for executing the neural network algorithm is used as a coprocessor of the system, the main processor and the coprocessor are tightly coupled through a RoCC interface, and the main processor and the coprocessor share L1 DCache. The SPI Flash is an off-chip program memory, is externally connected to the SPI Flash controller, stores a binary file of a program in the SPI Flash controller, and sequentially reads a block of instructions from an SPI Flash base address to be cached in the L1 ICache by a CPU after the system is powered on and reset. Then, CPU executes program screening and recombining non-zero neuron and non-zero weight in the sparse neural network, the sparse vector in the sparse neural network is converted into dense vector, when the total data bit width of the composed vector reaches the bit width of the RoCC interface, CPU sends acceleration instruction to the coprocessor through the Core cmd interface of the RoCC interface, and the coprocessor executes neural network algorithm. At the same time, the CPU continues to screen the non-screened sparse neuron vector and weight vector, and provides a data source for the next round of accelerator calculation of the coprocessor. The coprocessor sends a memory access address, a calculation result and an effective mark signal to the LI DCache through the MEM Req interface; the main processor writes the data to be calculated to the coprocessor through the MEM Resp interface. The CPU executes the Exception program and informs the coprocessor of the Exception signal through CC Exception. The coprocessor performs vector calculation and interrupts the CPU through CC Interrupt. UART is used as a debugging interface of the system, and can receive instructions from an upper computer or send instructions to the upper computer, and is mainly used for assisting prototype verification.

The storage and screening recombination principle of sparse neurons and weights is shown in fig. 2. And expanding a neuron matrix and a weight matrix of the sparse neural network into original sparse vectors, wherein the sparse vectors correspond to the sparse vectors corresponding to the 1 and 8 marks in the figure respectively. The invention adopts a direct indexing mode to store the sparse vector, namely only non-zero elements of the sparse vector (such as vectors corresponding to the

reference numerals

2 and 7 in fig. 2) and index marks corresponding to the sparse vector (such as bit strings corresponding to the

reference numerals

4 and 5 in fig. 2) are stored. The '1' in the index bit string indicates that the element of the current position of the sparse vector is a non-zero element, and the '0' indicates that the element of the current position of the sparse vector is a zero element. In the neural network algorithm, when the corresponding position weight and the neuron are non-zero elements, the neuron and the weight of the corresponding position only contribute to the calculation result. Therefore, the CPU needs to filter and reorganize sparse neurons and weights before they are handed over to the coprocessor accelerator computation. To facilitate the CPU establishing the indexed non-zero element positions in 2, 7, two additional count vectors (e.g., vectors corresponding to the 3, 6 reference numerals in FIG. 2) are added for recording the coordinates of the non-zero element. The specific screening recombination process is divided into three steps, step1, step2 and step3, as identified in FIG. 2. step1 performs an and operation on the index bit string of the sparse neuron and the index bit string of the sparse weight, and the resulting vector (such as 9 in fig. 2) represents the position information of the neuron or weight participating in the calculation in the original vector. step2 takes the index bit string obtained by step1 as a selection signal of vectors with the reference numbers of 3 and 6, and when the bit is '1', the non-zero element representing the current coordinate participates in calculation, otherwise, does not participate. The filtering results in vectors numbered 10 and 12, and the elements in the two vectors represent the coordinates of the non-zero elements in 2 and 7 which finally participate in calculation. step3 indexes the non-zero elements that last participated in the calculation according to the 2, 7 vector elements, and the indexed non-zero elements are recombined into dense vectors in sequence, such as the vectors indicated by

reference numerals

11, 13 in the figure. The CPU stores the recombined vector into the L1DCache so that the coprocessor directly accesses data from the L1DCache, thereby saving access time.

The neural network algorithm may be decomposed into dot products, vector addition, activation function computation, pooling operations, and maximum index computation. The convolution layer in the convolution neural network algorithm comprises dot product, vector addition, activation function calculation and pooling operation; the full connection layer comprises dot product, vector addition and activation function calculation; the output layer is argmax layer, and the main operation is maximum value indexing. For a general convolutional neural network, this embodiment employs a coprocessor accelerator as shown in fig. 3. The accelerator is coupled to the main processor through the RoCC interface, and includes a control decoding unit Decoder Controller, an address generator Addr, a dot product calculating unit VDP, a vector addition calculating unit VA, an activation function calculating unit RELU, a pooling operation unit MaxPool, a maximum value index calculating unit Argmax, and an Encoder. The control decoding unit analyzes the RoCC instruction to obtain the access starting address of the accelerator, and controls the flow direction of the data stream in the algorithm execution process. The address generator is mainly used for calculating the memory access in the algorithm execution process. The calculation units in the accelerator adopt a cascade structure, the calculation result generated by the calculation unit at the upper stage can be directly used by the calculation unit at the lower stage, and the intermediate calculation result does not need to be written back to the L1 DCache. Accelerators differ greatly in the final output paths that perform the convolutional layer, the fully-connected layer, and the output layer. When a convolution layer algorithm is executed, the data flow is delivered to an Encoder Encoder for compression processing after passing through a MaxPool computing unit, and then non-zero elements and corresponding index bit strings are written back to L1 DCache; when the full-connection layer is executed, the data flow can execute online coding compression processing operation after passing through the RELU calculation unit, and then non-zero elements and corresponding index bit strings are written back to the L1DCache, and the output of the last full-connection layer is directly delivered to the Argmax unit for further processing after being compressed; when the output layer is executed, the calculation result of the Argmax unit is directly written back to the L1DCache through the MEM Req interface, and online compression processing is not needed.

It will be appreciated by those skilled in the art that embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment, or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.

The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to embodiments of the application. It will be understood that each flow and/or block of the flowchart illustrations and/or block diagrams, and combinations of flows and/or blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

The foregoing description of the embodiments has been provided for the purpose of illustrating the general principles of the invention, and is not meant to limit the scope of the invention, but to limit the invention to the particular embodiments, and any modifications, equivalents, improvements, etc. that fall within the spirit and principles of the invention are intended to be included within the scope of the invention.

Claims

1. A sparse neural network-oriented system-on-chip is characterized by comprising a main processor, a coprocessor, system slave equipment and an off-chip memory,

the main processor is in communication connection with the coprocessor;

the system slave device is in communication connection with the main processor;

the off-chip memory slave system interface is connected with the system slave device;

the main processor decomposes matrix calculation in a neural network algorithm into vector calculation, expands a neuron matrix and a weight matrix of a sparse neural network into an original sparse vector, wherein the original sparse vector comprises a sparse neuron vector and a sparse weight vector, and the system on a chip stores non-zero elements in the sparse neuron vector and the sparse weight vector and index marks corresponding to the sparse neuron vector and the sparse weight vector in a direct index mode;

executing program to screen non-zero neurons and non-zero weights participating in calculation in the recombined sparse neural network, converting sparse vectors in the sparse neural network into dense vectors, sending an acceleration instruction to the coprocessor, and executing the acceleration calculation of the dense vectors by the coprocessor, wherein the specific mode is as follows: the main processor performs AND computation on index marks of the sparse neuron vector and the sparse weight vector to obtain index marks of the sparse neuron vector or the sparse weight vector participating in computation in an original vector, screens the non-zero neuron count vector and the non-zero weight count vector through the obtained index marks to obtain position coordinates of the non-zero neuron and the non-zero weight participating in computation in a non-zero neuron component vector and a non-zero weight component vector respectively, indexes the non-zero element participating in computation finally according to the non-zero neuron component vector and the non-zero weight component vector, and recombines the indexed non-zero element into a dense vector in sequence.

2. The sparse neural network oriented system on chip of claim 1, wherein the main processor comprises an open source processor, a data path, a level one data cache, and a level one instruction cache, wherein:

the main processor and the coprocessor share a level one data cache.

3. The sparse neural network oriented system on chip of claim 1, wherein the system slave comprises a SPIFlash controller and a debug interface UART;

the SPIFLash controller is used for integrating the off-chip memory into the system on chip by utilizing an SPI interface;

4. The sparse neural network-oriented system on chip of claim 3, wherein after the system is powered on, the main processor sequentially reads an instruction cache of a block from the SPIFlash controller to the first-level cache address, and performs program screening and reorganizing original sparse vectors participating in calculation in the sparse neural network.

5. A sparse neural network oriented system on chip according to claim 1, wherein when the composed dense vector data bit width reaches the bit width of the interface between the host processor and the coprocessor, the host processor sends an acceleration instruction to the coprocessor over a RoCC interface.

6. The sparse neural network oriented system on chip of claim 5, wherein the host processor stores the reassembled dense vector into a first level data cache, and wherein the coprocessor directly accesses the dense vector from the first level data cache upon receiving an acceleration instruction.

7. The sparse neural network oriented system on chip of claim 1, wherein the co-processor comprises a control decoding unit, an address generator, a dot product calculation unit, a vector addition calculation unit, an activation function calculation unit, a pooling operation unit, a maximum index calculation unit, and an encoder, wherein:

8. The sparse neural network-oriented system on chip of claim 7, wherein when the coprocessor executes a convolutional layer algorithm, the data stream is subjected to maximum pooling calculation and then sent to an encoder for compression, and then the non-zero elements and index marks corresponding to the non-zero elements are written into a first-level data cache;