CN112631983A

CN112631983A - Sparse neural network-oriented system on chip

Info

Publication number: CN112631983A
Application number: CN202011576461.5A
Authority: CN
Inventors: 黄乐天; 明小满
Original assignee: University of Electronic Science and Technology of China
Current assignee: University of Electronic Science and Technology of China
Priority date: 2020-12-28
Filing date: 2020-12-28
Publication date: 2021-04-09
Anticipated expiration: 2040-12-28
Also published as: CN112631983B

Abstract

The invention discloses a sparse neural network-oriented system on a chip, which comprises a main processor, a coprocessor, system slave equipment and an off-chip memory, wherein the main processor is in communication connection with the main processor; the system slave device is in communication connection with the main processor; the off-chip memory slave system interface is in communication connection with the slave system; the main processor decomposes the matrix calculation in the neural network algorithm into vector calculation, executes a program to screen and recombine sparse vectors participating in calculation in the sparse neural network, converts the sparse vectors in the sparse neural network into dense vectors, sends an acceleration instruction to the coprocessor, and executes the acceleration calculation of the dense vectors by the coprocessor. The invention solves the problems that the parallelism and the access efficiency of the accelerator are limited and the utilization efficiency of a processor in the system is low. The direct index storage of the sparse neural network solves the problem that the data multiplexing in an accelerator is difficult due to the uncertainty of the non-zero position in the sparse neural network.

Description

Sparse neural network-oriented system on chip

Technical Field

The invention relates to the field of communication, in particular to a sparse neural network-oriented system on a chip.

Background

Neural network sparsification greatly reduces the computational complexity and data storage capacity of the algorithm, which facilitates the deployment of large-scale neural network algorithms in embedded devices with limited storage, computational resources and energy consumption. But the data irregularity brought by the sparsification makes the sparse network very inefficient to perform on a general-purpose platform. In order to enable hardware to better execute sparse neural network algorithms, researchers have designed various sparse neural network accelerators, and in addition, system-on-chip research oriented to sparse neural networks is gradually developed.

The existing sparse neural network accelerator and the system on chip have several problems:

a. the calculation parallelism and the memory access efficiency of the accelerator are limited by the neural network sparsification rule;

b. the uncertainty of the non-zero position in the sparse network causes the difficulty of data multiplexing in the accelerator;

c. and a special compiler needs to be designed for the accelerator, so that the universality is poor.

In the system on chip, the utilization rate of the CPU is too low, and the overall acceleration effect of the system is general.

Disclosure of Invention

In order to solve the above problems, the present invention provides a sparse neural network-oriented system on chip, which is implemented by the following technical scheme:

a sparse neural network-oriented system on chip is characterized by comprising a main processor, a coprocessor, a system slave device and an off-chip memory,

the main processor is in communication connection with the main processor;

the system slave device is in communication connection with the main processor;

the off-chip memory slave system interface is in communication connection with the slave system;

wherein the content of the first and second substances,

the main processor decomposes the matrix calculation in the neural network algorithm into vector calculation, executes a program to screen and recombine sparse vectors participating in calculation in the sparse neural network, converts the sparse vectors in the sparse neural network into dense vectors, sends an acceleration instruction to the coprocessor, and executes the acceleration calculation of the dense vectors by the coprocessor.

The main processor comprises an open source processor, a data path, a first-level data cache and a first-level instruction cache, wherein:

the main processor is used for decomposing the matrix calculation in the neural network algorithm into vector calculation, expanding the neuron matrix and the weight matrix of the sparse neural network into original sparse vectors, and screening and recombining the sparse neural network;

the data path is used for data interaction between the main processor and the coprocessor;

the first-level data cache is used for storing the calculation data of the main processor and the coprocessor;

the first-level instruction cache is used for storing a processing instruction of the main processor;

the main processor and the coprocessor share a first level data cache.

Furthermore, the system slave device comprises an SPI Flash controller and a debugging interface UART;

the SPI Flash controller is used for integrating the off-chip memory into the system on chip by utilizing an SPI interface;

the UART is a debugging interface of the system on chip and is used for receiving an instruction of an upper computer or sending the instruction to the upper computer.

Further, after the system is powered on, the main processor starts to sequentially read a block of instructions from the SPI Flash controller and caches the block of instructions into a first-level cache address, and an execution program screens and recombines original sparse vectors participating in calculation in the sparse neural network.

Further, the original sparse vector comprises a sparse neuron vector and a sparse weight vector, and the system on chip stores non-zero elements in the sparse neuron vector and the sparse weight vector and index marks corresponding to the sparse neuron vector and the sparse weight vector in a direct index mode.

Further, the main processor performs and calculation on index marks of the sparse neuron vector and the sparse weight vector to obtain index marks of the sparse neuron vector or the sparse weight vector participating in the calculation in the original vector, screens the nonzero neuron count vector and the nonzero weight count vector through the obtained index marks to obtain position coordinates of the nonzero neuron and the nonzero weight participating in the calculation in the nonzero neuron component vector and the nonzero weight component vector respectively, indexes the nonzero elements participating in the calculation finally according to the nonzero neuron component vector and the nonzero weight component vector, and recombines the indexed nonzero elements into dense vectors according to the sequence.

Further, when the bit width of the formed dense vector data reaches the bit width of an interface between the main processor and the coprocessor, the main processor sends an acceleration instruction to the coprocessor through a RoCC interface.

Further, the main processor stores the recombined dense vectors into a first-level data cache, and the coprocessor directly accesses the dense vectors from the first-level cache after receiving the acceleration instruction.

Further, the coprocessor comprises a control decoding unit, an address generator, a dot product calculation unit, a vector addition calculation unit, an activation function calculation unit, a pooling operation unit, a maximum index calculation unit and an encoder, wherein:

the control decoding unit is used for analyzing the RoCC instruction to obtain a coprocessor access initial address and controlling the flow direction of data flow in the algorithm execution process;

the address generator is used for calculating the access address in the execution process of the convolutional neural network algorithm; the convolutional neural network algorithm comprises a convolutional layer algorithm, a full-link layer algorithm and an output layer algorithm;

the dot product calculation unit, the vector addition calculation unit, the activation function calculation unit and the maximum index calculation unit adopt a cascade structure and are used for executing corresponding dot product calculation, vector addition calculation, activation function calculation and maximum index calculation;

the encoder is used for encoding the calculation results of the dot product calculation unit, the vector addition calculation unit, the activation function calculation unit and the pooling operation unit.

Further, when the coprocessor executes a convolutional layer algorithm, the data stream is sent to an encoder for compression after maximum pooling calculation, and then non-zero elements and index marks corresponding to the non-zero elements are written into a first-level data cache;

when the full connection layer algorithm is executed, the data stream is calculated by the activation function calculation unit, then the online coding compression is executed, and then the nonzero element and the index mark corresponding to the nonzero element are written into the first-level data cache; the output of the last layer of full connection layer is compressed and then sent to a maximum index unit for calculation;

when the output layer algorithm is executed, the calculation result of the maximum index calculation unit is directly written into the first-level data cache through the access request signal interface.

The invention has the advantages that the main processor in the system preprocesses the sparse neural network and gives the acceleration unit execution of dense vectors, so that the accelerator can execute the sparse neural network with any sparsity, hardware resources in the accelerator can be fully utilized, and the data processing efficiency is improved; in addition, the system adopts a heterogeneous system-on-chip structure of a main processor and a coprocessor, the main processor and the coprocessor share a first-level data cache, the data interaction delay between the main processor and the coprocessor is short, and the memory access bandwidth is increased; and the computing units of the accelerator are connected in a cascading mode, so that the middle computing result is effectively multiplexed, the data utilization rate is improved, and the access and storage times of the data are reduced.

Drawings

The accompanying drawings, which are included to provide a further understanding of the embodiments of the invention and are incorporated in and constitute a part of this application, illustrate embodiment(s) of the invention and together with the description serve to explain the principles of the invention. In the drawings:

fig. 1 is a schematic diagram of a sparse neural network-oriented system-on-chip architecture of the present invention, wherein the meaning of identification is: a socket CPU, a processor; datapath: a data path; l1 ICache: caching a first-level instruction; l1 DCache: caching first-level data; RoCC Interface: coprocessor interface, RoCC interface; CC Exception: the processor sends an exception signal to the coprocessor; CC Interrupt: the interrupt signal is sent to the processor by the coprocessor; core Cmd: the processor sends a signal of an instruction to the coprocessor; MEM Req: a memory access request signal; MEM Resp: a memory access response signal; SPI Flash Controller: an SPI Flash controller; SPI Flash: and an off-chip Flash memory externally connected with the SPI interface.

FIG. 2 is a schematic diagram of data screening and reorganization in an embodiment of the present invention, wherein the labeled meanings in the diagram are as follows: 1: sparse neuron vectors; 2: a vector of non-zero neurons; 3: a non-zero neuron count vector; 4: index bit strings corresponding to the sparse neuron vectors; 5: index bit strings corresponding to the sparse weight vectors; 6: a non-zero weight count vector; 7: a vector of non-zero weights; 8: a sparse weight vector; 9: the neurons participating in calculation and the weights correspond to index bit strings of 1 and 8 vectors; 10: the position in 2 of the non-zero neuron involved in the calculation; 11: a vector of non-zero neurons involved in the computation; 12: the position in 7 of the non-zero weight participating in the calculation; 13: a vector of non-zero weights participating in the computation.

FIG. 3 is a diagram of coprocessor accelerator architecture according to an embodiment of the present invention, where the notation means: RoCC Interface: a coprocessor interface; RoCC-IN: a coprocessor input port; RoCC-OUT: a coprocessor output port; core Cmd, MEM Req, MEM Resp, CC Exception, CC Interrupt: the interrupt signal is sent to the processor by the coprocessor; decoder Controller: a coprocessor decoding control unit; addr: a coprocessor address generator; an Encoder: an encoder; VDP: a vector dot product calculation unit; VA: a vector addition calculation unit; RELU is an activation function calculation unit; MAX Pool: a maximum pooling calculation unit; ArgMax: and a maximum index calculation unit.

Detailed Description

Hereinafter, the term "comprising" or "may include" used in various embodiments of the present invention indicates the presence of the invented function, operation or element, and does not limit the addition of one or more functions, operations or elements. Furthermore, as used in various embodiments of the present invention, the terms "comprises," "comprising," "includes," "including," "has," "having" and their derivatives are intended to mean that the specified features, numbers, steps, operations, elements, components, or combinations of the foregoing, are only meant to indicate that a particular feature, number, step, operation, element, component, or combination of the foregoing, and should not be construed as first excluding the existence of, or adding to the possibility of, one or more other features, numbers, steps, operations, elements, components, or combinations of the foregoing.

In various embodiments of the invention, the expression "or" at least one of a or/and B "includes any or all combinations of the words listed simultaneously. For example, the expression "a or B" or "at least one of a or/and B" may include a, may include B, or may include both a and B.

Expressions (such as "first", "second", and the like) used in various embodiments of the present invention may modify various constituent elements in various embodiments, but may not limit the respective constituent elements. For example, the above description does not limit the order and/or importance of the elements described. The foregoing description is for the purpose of distinguishing one element from another. For example, the first user device and the second user device indicate different user devices, although both are user devices. For example, a first element could be termed a second element, and, similarly, a second element could be termed a first element, without departing from the scope of various embodiments of the present invention.

It should be noted that: if it is described that one constituent element is "connected" to another constituent element, the first constituent element may be directly connected to the second constituent element, and a third constituent element may be "connected" between the first constituent element and the second constituent element. In contrast, when one constituent element is "directly connected" to another constituent element, it is understood that there is no third constituent element between the first constituent element and the second constituent element.

The terminology used in the various embodiments of the invention is for the purpose of describing particular embodiments only and is not intended to be limiting of the various embodiments of the invention. As used herein, the singular forms are intended to include the plural forms as well, unless the context clearly indicates otherwise. Unless otherwise defined, all terms (including technical and scientific terms) used herein have the same meaning as commonly understood by one of ordinary skill in the art to which various embodiments of the present invention belong. The terms (such as those defined in commonly used dictionaries) should be interpreted as having a meaning that is consistent with their contextual meaning in the relevant art and will not be interpreted in an idealized or overly formal sense unless expressly so defined herein in various embodiments of the present invention.

In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is further described in detail below with reference to examples and accompanying drawings, and the exemplary embodiments and descriptions thereof are only used for explaining the present invention and are not meant to limit the present invention.

Example 1

A sparse neural network-oriented system on chip comprises a main processor type open source processor socket, a coprocessor type neural network accelerator, a system bus TileLink, a debugging interface UART, an SPI Flash controller and an off-chip SPI Flash memory. Wherein, the processor is used as the main device of the system; the neural network accelerator is used as a coprocessor of the main processor, and the coprocessor is connected with the main processor through a RoCC interface; the UART and the SPI Flash controller are used as slave equipment of the system, and the slave equipment is connected with the main processor through a TileLink bus; the SPI Flash is used as off-chip storage and is integrated into a system through an SPI interface in an SPI Flash controller. The processor is mainly responsible for decomposing matrix calculation in the neural network algorithm into vector calculation, screening nonzero data participating in calculation according to the neurons and index marks of the weights, and sequentially recombining the screened nonzero elements into continuously stored dense vectors. Thereafter, the host processor sends an acceleration instruction to the coprocessor via the RoCC interface, and the coprocessor performs accelerated computation of the dense vector. Note: the neurons and the weights are stored in a direct index mode, only non-zero elements and index marks of all elements are stored, the index marks are represented by bit strings, 0 'in the bit strings indicates that the current position is a zero element, and 1' indicates that the current position is a non-zero element.

Example 2

As shown in FIG. 1, the present embodiment is a heterogeneous system on chip created based on RISC-V open source processor. A socket CPU, a coprocessor accelerator, an SPI Flash controller, a UART and an off-chip SPI Flash are integrated in the system. The socket CPU is used as a main processor of the system, an accelerator for executing a neural network algorithm is used as a coprocessor of the system, the main processor and the coprocessor are tightly coupled through a RoCC interface, and the main processor and the coprocessor share L1 DCache. The SPI Flash is an off-chip program memory, is externally connected to the SPI Flash controller, internally stores a binary file of a program, and after the system is electrified and reset, the CPU reads a block of instructions from the SPI Flash base address in sequence and caches the block of instructions into the L1 ICache. Then, the CPU executes a program to screen and recombine the nonzero neurons and nonzero weights participating in calculation in the sparse neural network, converts sparse vectors in the sparse neural network into dense vectors, and when the total bit width of the formed vectors reaches the bit width of a RoCC interface, the CPU sends an acceleration instruction to the coprocessor through a Core cmd interface of the RoCC interface, and the coprocessor executes a neural network algorithm. Meanwhile, the CPU continuously screens the unscreened sparse neuron vectors and the weight vectors and provides a data source for the next round of accelerator calculation of the coprocessor. The coprocessor sends a memory access address, a calculation result and an effective mark signal to the LI DCache through the MEM Req interface; the host processor writes the data to be computed to the coprocessor via the MEM Resp interface. The CPU executes an Exception program and informs the coprocessor of an Exception signal through CC Exception. The coprocessor executes the vector calculation and interrupts the CPU through CC Interrupt. The UART is used as a debugging interface of the system, can receive instructions of the upper computer or send instructions to the upper computer, and is mainly used for assisting prototype verification.

The principle of sparse neuron and weight storage and screening recombination is shown in fig. 2. And (3) unfolding the neuron matrix and the weight matrix of the sparse neural network into original sparse vectors which respectively correspond to the sparse vectors marked by the

numbers

1 and 8 in the figure. The invention stores the sparse vector by adopting a direct index mode, namely only storing non-zero elements (such as vectors corresponding to

labels

2 and 7 in figure 2) of the sparse vector and index marks (such as bit strings corresponding to

labels

4 and 5 in figure 2) corresponding to the sparse vector. A '1' in the index bit string indicates that an element of the sparse vector current position is a non-zero element, and a '0' indicates that an element of the sparse vector current position is a zero element. In the neural network algorithm, when the weight and the neuron of the corresponding position are both nonzero elements, the neuron and the weight of the corresponding position contribute to the calculation result. Therefore, the CPU needs to perform a filtering reorganization of sparse neurons and weights before they are handed over to the coprocessor accelerator for computation. In order to facilitate the CPU to establish the positions of the indexed non-zero elements in 2 and 7, two additional counting vectors (such as the vectors corresponding to the 3 and 6 labels in FIG. 2) are added for recording the coordinates of the non-zero elements. The specific selection recombination process is divided into three steps, as identified in FIG. 2 as step1, step2 and step 3. step1 is to perform an and operation on the index bit string of the sparse neuron and the index bit string of the sparse weight, and the obtained vector (9 in fig. 2) represents the position information of the neuron or weight involved in the calculation in the original vector. step2 takes the index bit string obtained in step1 as the selection signal of the vector labeled as 3 and 6, when the bit is '1', the non-zero element representing the current coordinate participates in the calculation, otherwise, the non-zero element does not participate. After screening, vectors with

reference numbers

10 and 12 are obtained, and elements in the two vectors represent coordinates of non-zero elements in 2 and 7 which finally participate in calculation. step3 indexes the nonzero elements that last participated in the calculation according to the 2 and 7 vector elements, and reorganizes the indexed nonzero elements into dense vectors in sequence, such as the vectors denoted by the

reference numerals

11 and 13 in the figure. The CPU stores the recombined vector into the L1DCache so that the coprocessor directly accesses data from the L1DCache, thereby saving the access time.

Neural network algorithms can be decomposed into dot products, vector additions, activation function calculations, pooling operations, and maximum index calculations. The convolution layer in the convolution neural network algorithm comprises dot product, vector addition, activation function calculation and pooling operation; the full connection layer comprises dot product, vector addition and activation function calculation; the output layer is an argmax layer, and the main operation is to find the maximum value index. For a general convolutional neural network, the present embodiment employs a coprocessor accelerator as shown in fig. 3. The accelerator is coupled with the main processor through a RoCC interface, and the accelerator internally comprises a control decoding unit Decoder Controller, an address generator Addr, a dot product calculation unit VDP, a vector addition calculation unit VA, an activation function calculation unit RELU, a pooling operation unit MaxPool, a maximum index calculation unit Argmax and an Encoder Encoder. The control decoding unit analyzes the RoCC instruction to obtain an accelerator access initial address, and controls the flow direction of data flow in the algorithm execution process. The address generator is mainly used for calculating the memory access address in the algorithm execution process. The computing units in the accelerator adopt a cascade structure, the computing result generated by the computing unit at the upper stage can be directly used by the next computing unit, and the intermediate computing result does not need to be written back to the L1 DCache. The accelerator has a large difference in the final output paths for executing the convolutional layer, the fully-connected layer and the output layer. When the convolutional layer algorithm is executed, the data stream passes through a MaxBool computing unit and then is sent to an Encoder for compression, and then non-zero elements and corresponding index bit strings are written back to the L1 DCache; when the full link layer is executed, the data stream can execute the online coding compression processing operation after passing through the RELU calculation unit, and then write back the non-zero element and the corresponding index bit string to the L1DCache, and it is worth noting that the output of the last full link layer is directly delivered to the Argmax unit for further processing after being compressed; when the output layer is executed, the calculation result of the Argmax unit is directly written back to the L1DCache through the MEM Req interface, and online compression processing is not needed.

As will be appreciated by one skilled in the art, embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.

The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the application. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

The above-mentioned embodiments are intended to illustrate the objects, technical solutions and advantages of the present invention in further detail, and it should be understood that the above-mentioned embodiments are merely exemplary embodiments of the present invention, and are not intended to limit the scope of the present invention, and any modifications, equivalent substitutions, improvements and the like made within the spirit and principle of the present invention should be included in the scope of the present invention.

Claims

1. A sparse neural network-oriented system on chip is characterized by comprising a main processor, a coprocessor, a system slave device and an off-chip memory,

the main processor is in communication connection with the main processor;

the system slave device is in communication connection with the main processor;

wherein the content of the first and second substances,

2. The sparse neural network-oriented system on a chip of claim 1, wherein the main processor comprises an open source processor, a data path, a level one data cache, and a level one instruction cache, wherein:

the main processor and the coprocessor share a first level data cache.

3. The sparse neural network-oriented system on chip of claim 1, wherein the system slave device comprises a SPIFlash controller and a debugging interface UART;

the SPIFlash controller is used for integrating the off-chip memory into the system on chip by utilizing an SPI interface;

4. The sparse neural network-oriented system on a chip of claim 3, wherein after the system is powered on, the main processor sequentially reads a block of instructions from the SPIFlash controller and caches the block of instructions into the first-level cache address, and executes a program to screen and reconstruct original sparse vectors participating in calculation in the sparse neural network.

5. The sparse neural network-oriented system on chip according to claim 4, wherein the original sparse vectors include sparse neuron vectors and sparse weight vectors, and the system on chip stores nonzero elements in the sparse neuron vectors and the sparse weight vectors and index marks corresponding to the sparse neuron vectors and the sparse weight vectors in a direct indexing manner.

6. The sparse neural network-oriented system-on-chip as claimed in claim 5, wherein the main processor performs an and calculation on index flags of the sparse neural vector and the sparse weight vector to obtain index flags of the sparse neural vector or the sparse weight vector participating in the calculation in the original vector, screens the nonzero neuron count vector and the nonzero weight count vector through the obtained index flags to obtain position coordinates of the nonzero neuron and the nonzero weight participating in the calculation in the nonzero neuron component vector and the nonzero weight component vector respectively, indexes the nonzero element participating in the calculation finally according to the nonzero neuron component vector and the nonzero weight component vector, and recombines the indexed nonzero elements into the dense vector according to the order.

7. The sparse neural network-oriented system on chip of claim 6, wherein the main processor sends an acceleration instruction to the coprocessor via the RoCC interface when the bit width of the composed dense vector data reaches the bit width of the interface between the main processor and the coprocessor.

8. The sparse neural network-oriented system on a chip of claim 7, wherein the host processor stores the reorganized dense vectors in a primary data cache, and the coprocessor directly accesses the dense vectors from the primary data cache after receiving the acceleration instruction.

9. The sparse neural network-oriented system-on-chip of claim 1, wherein the coprocessor comprises a control coding unit, an address generator, a dot product calculation unit, a vector addition calculation unit, an activation function calculation unit, a pooling operation unit, a maximum index calculation unit, and an encoder, wherein:

10. The sparse neural network-oriented system on a chip according to claim 9, wherein when the coprocessor executes a convolutional layer algorithm, a data stream of the convolutional layer algorithm is sent to an encoder for compression after maximum pooling calculation, and then non-zero elements and index marks corresponding to the non-zero elements are written into a first-level data cache;