CN109615071B

CN109615071B - High-energy-efficiency neural network processor, acceleration system and method

Info

Publication number: CN109615071B
Application number: CN201811592475.9A
Authority: CN
Inventors: 秦刚; 姜凯; 李朋
Original assignee: Shandong Inspur Scientific Research Institute Co Ltd
Current assignee: Shandong Inspur Scientific Research Institute Co Ltd
Priority date: 2018-12-25
Filing date: 2018-12-25
Publication date: 2023-04-18
Anticipated expiration: 2038-12-25
Also published as: CN109615071A

Abstract

The invention discloses a high-energy-efficiency neural network processor, an acceleration system and a method, which belong to neural network processing devices and aim to solve the technical problems that: how to reduce the read-write times of the multiplier and the data memory and accelerate the calculation of the neural network; the structure package of the system is a main control chip containing an ARM core, and the system comprises a processor unit and a logic calculation unit, wherein the logic calculation unit is electrically connected with the processor unit through a bus interface. The system comprises the main control chip and the storage module. The method comprises the following steps: selecting and sorting the weight data; acquiring input data in a data multiplexing mode, and performing convolution operation and pooling operation on the weight data and the input data in a parallel computing mode according to a plurality of PE computing subunits; and acquiring the output data of each PE calculation subunit, performing addition operation to obtain final data, and storing the final data in a storage module. The invention can reduce convolution times and read-write times of external storage.

Description

High-energy-efficiency neural network processor, acceleration system and method

Technical Field

The invention relates to the field of neural network processing devices, in particular to a high-energy-efficiency neural network processor, an acceleration system and a method.

Background

The deep learning technology is a booster developed by an artificial intelligence technology, and the deep learning adopts a topological structure of a deep neural network to train, optimize and reason.

The convolutional neural network is the basis of deep learning, the calculation amount of the convolutional operation in the whole algorithm is large, a large number of multiplier units are needed, and the convolutional neural network is a bottleneck influencing the performance. The method adopted at present is as follows; the multiplication and accumulation are carried out in parallel to form a structure that the output of a plurality of multipliers is connected to an addition tree. When the existing system and method are used for multiplication and accumulation in parallel, the weight data and the related data need to be read for many times, the loss of a storage unit and a multiplier is large, and the calculation speed is low.

How to reduce the number of times of reading and writing the multiplier and the data memory and accelerate the calculation of the neural network is a technical problem to be solved.

Disclosure of Invention

The technical task of the invention is to provide a high-energy-efficiency neural network processor, an acceleration system and a method aiming at the defects, so as to solve the problems of reducing the read-write times of a multiplier and a data memory and accelerating the calculation of a neural network.

In a first aspect, an embodiment of the present invention provides an energy-efficient neural network processor, which is a main control chip including an ARM core, and includes:

the processor unit is used for acquiring input data and weight data and generating instruction data according to a model of the neural network;

the logic calculation unit is electrically connected with the processor unit through a bus interface and comprises an instruction FIFO subunit, a data FIFO subunit, a sorting subunit, an addition subunit and a plurality of PE calculation subunits, wherein:

the instruction FIFO subunit is used for realizing FIFO of instruction data and activating a proper number of PE calculation subunits and resources of the PE calculation subunits according to the instruction data;

a data FIFO subunit for implementing FIFO of the weight data and the input data;

the sequencing submodule is used for outputting the weight data and the input data in sequence based on the principle that the weight data which is positive number is output preferentially, the weight data which is negative number is output later and the weight data which is zero is not output;

the PE calculation subunit is used for performing convolution operation and pooling operation on the weight data and the input data and judging whether to automatically terminate the convolution operation;

the PE computing subunits are multiple in number, acquire input data in a data multiplexing mode and perform convolution operation and pooling operation on the weight data and the input data in a parallel computing mode;

and the addition subunit is used for performing addition operation on the data output by the PE calculation subunits.

In the embodiment, the weight data are sorted according to the principle that positive numbers are output preferentially, negative numbers are output after being output, and zero is not output, so that when the PE calculation subunits perform convolution operation, the convolution operation frequency can be reduced, the input data of each PE calculation subunit is multiplexed, and the read-write frequency from an external storage module can be further reduced, thereby reducing the external read-write of an external memory and reducing the use resources of related units of internal convolution operation.

Preferably, the PE calculation subunit includes:

the convolution calculation micro units are used for acquiring input data in a data multiplexing mode and carrying out convolution operation on the weight data and the input data in a serial calculation mode;

and the activation function is configured as a relu function and used for judging whether to terminate the convolution operation in the convolution calculation micro unit, and the judgment principle is as follows: if the convolution operation of the current M weight data and the related input data in the convolution calculation micro unit is negative, automatically stopping the convolution operation in the convolution calculation micro unit and outputting zero; otherwise, carrying out convolution operation on the N weight data and the related input data and outputting convolution data which is positive; wherein M < N, N is the total number of weight data located in the convolution calculation microcell;

and the pooling layer is used for performing pooling operation on the output data of each convolution calculation micro unit.

In the preferred embodiment, the convolution calculation micro units multiplex input data and weight data, that is, when convolution calculation is performed in each PE calculation subunit, each convolution calculation micro unit outputs the input data to the next convolution calculation micro unit for convolution calculation, so that it is not necessary to repeat reading and writing from an external or internal storage area, and the reading and writing frequency of an external memory or an internal storage area is reduced, thereby reducing power consumption; according to the characteristics of the activation function, when convolution calculation is performed by the convolution calculation micro unit, if the former convolution results are negative values, the rest convolution calculation can be omitted, so that the utilization rate of the convolution calculation micro unit is reduced.

Preferably, the logic calculation unit further comprises:

and the compression/decompression unit is used for compressing the weight data according to the level of the neural network and according to the run-length coding compression algorithm, or performing lossless decompression on the compressed weight data.

In the preferred embodiment, the compression of the weight data is realized by the compression/decompression unit, and the compressed weight data is stored in the external memory or the internal buffer area, so that the storage space can be reduced; the lossless compression is carried out on the weight data, which is equivalent to reducing the read-write times of external storage, thereby reducing the energy consumption.

Preferably, the logic calculation unit further comprises:

and the buffer subunit is used for temporarily storing the weight data, the input data and the instruction data.

Preferably, the bus interface is an AXI interface, which supports DMA data transfer.

Preferably, the main control chip is a zynq chip, a PS end of the zynq chip is used as a processor unit, and a PL end of the zynq chip is used as a logic calculation unit.

In a second aspect, an embodiment of the present invention provides an energy-efficient neural network acceleration system, including:

an energy-efficient neural network processor as in the first aspect; and

and the storage module is electrically connected with the main control chip and used for storing the weight data and the output data of the addition unit.

The embodiment provides an energy-efficient neural network acceleration system for performing neural network acceleration calculation.

In a third aspect, an embodiment of the present invention provides an energy-efficient neural network acceleration method, including:

storing the weight data in a storage module;

acquiring weight data and input data, and generating instruction data according to a model of the neural network;

acquiring instruction data in an FIFO mode, and activating a proper number of PE calculation subunits and resources of the PE calculation subunits according to the instruction data;

acquiring input data and weight data in an FIFO mode, and selectively sequencing the weight data and the input data based on the principle that weight data which is positive number is preferentially output, weight data which is negative number is output later, and weight data which is zero is not output;

acquiring input data in a data multiplexing mode, performing convolution operation and pooling operation on the weight data and the input data in a mode of parallel calculation of a plurality of PE calculation subunits, and judging whether to terminate the convolution operation according to the result of the convolution operation in the convolution operation process;

and acquiring the output data of each PE calculation subunit, performing addition operation to obtain final data, and storing the final data in a storage module.

Preferably, the PE calculation subunit performs convolution operation and pooling operation on the weight data and the related input data, and includes:

acquiring input data in a data multiplexing mode, and performing convolution operation on the weight data and the related input data in a mode of serial calculation of a plurality of convolution calculation micro units;

in the process of carrying out convolution operation in the convolution calculation micro unit, judging whether to terminate the convolution operation in the current convolution calculation micro unit through an activation function, wherein the method comprises the following steps: if the convolution operation of the current M weight data and the related input data in the convolution calculation micro unit is negative, automatically stopping the convolution operation in the convolution calculation micro unit and outputting zero; otherwise, carrying out convolution operation on the N weight data and the related input data and outputting convolution data which is positive; wherein M < N, N is the total number of weight data located in the convolution calculation microcell;

and performing pooling operation on the output data of each convolution calculation micro unit.

Preferably, before the weight data is stored in the external storage module, the weight data is compressed according to the level of the neural network and a run length coding compression algorithm, and the compressed weight data is stored in the external storage module;

and before the weight data and the input data are selected and sorted, lossless decompression is carried out on the compressed weight data.

The high-energy-efficiency neural network processor, the acceleration system and the method provided by the invention have the following advantages:

1. the weight data and the input data are sequenced by the sequencing submodule, so that convolution operation is preferentially carried out on the weight data which are positive numbers and the related input data in the subsequent PE calculation subunit, then convolution operation is carried out on the weight data which are negative numbers and the related input data, and whether the convolution operation is automatically stopped or not can be judged according to the convolution operation result in the volume operation process, so that the use of a convolution calculation related unit can be reduced, and the calculation speed is accelerated;

2. the input data are acquired among the PE calculation subunits in a data multiplexing mode, and meanwhile, the input data are acquired among the convolution calculation micro units in the PE calculation subunits in a data multiplexing mode, so that reading and writing of external storage can be reduced, power consumption is reduced, the utilization rate of the input data is improved, and the calculation speed is increased;

3. and the weight data is compressed and stored, so that the occupation of the storage space is reduced.

Drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present invention, the drawings needed to be used in the description of the embodiments will be briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.

FIG. 1 is a schematic diagram of an energy-efficient neural network processor according to embodiment 1;

FIG. 2 is a schematic structural diagram of an improved energy-efficient neural network processor according to embodiment 1;

fig. 3 is a schematic structural diagram of an energy-efficient neural network acceleration system according to embodiment 2;

fig. 4 is a flow chart of an energy-efficient neural network acceleration method according to embodiment 3.

Detailed Description

The present invention is further described below with reference to the accompanying drawings and specific embodiments so that those skilled in the art can better understand the present invention and can implement the present invention, but the embodiments are not intended to limit the present invention, and the embodiments and technical features of the embodiments can be combined with each other without conflict.

It is to be understood that "a plurality" in the embodiments of the present invention means two or more.

The embodiment of the invention provides a high-energy-efficiency neural network processor, an acceleration system and a method, which are used for solving the technical problems of reducing the read-write times of a multiplier and a data memory and accelerating the calculation of a neural network.

Example 1:

as shown in fig. 1, the embodiment provides an energy-efficient neural network processor, which is a main control chip including an ARM core, and includes a processor unit and a logic computation unit, where the processor unit and the logic computation unit are electrically connected through a bus interface.

The processor unit comprises a multi-core ARM processor and is used for acquiring input data and weight data and generating instruction data according to a model of the neural network.

The bus interface is an AXI interface and supports a DMA data transmission mode.

The logic computation unit includes an instruction FIFO subunit, a data FIFO subunit, a sorting subunit, an addition subunit, and a plurality of PE computation subunits.

And the instruction FIFO subunit selects an FIFO memory, is connected between the plurality of PE automatic units through the bus interface, is used for realizing FIFO of instruction data, and activates a proper number of PE calculation subunits and resources of the PE calculation subunits according to the instruction data.

And the data FIFO subunit selects an FIFO memory to realize the electrical connection between the bus interface and the sequencing subunit and between the bus interface and the PE calculation subunit, and is used for realizing the FIFO of the weight data and the input data.

And the sequencing sub-module is used for acquiring the weight data from the data FIFO sub-unit, sequencing the weight data and the input data based on the principle that the weight data which is positive number is output preferentially, the weight data which is negative number is output later and the weight data which is zero is not output, and outputting the weight data and the corresponding input data to the PE sub-unit in sequence.

And the PE calculation subunit is used for performing convolution operation and pooling operation on the weight data and the input data and judging whether to automatically terminate the convolution operation according to the result of the convolution operation in the volume operation process.

In this embodiment, each PE calculation subunit includes N convolution calculation micro units, an activation function, and a pooling layer.

Wherein, N convolution calculation microcells have the following functions: and acquiring the sorted weight data and related input data from the sorting subunit, acquiring the input data among the N convolution calculation micro units in a data multiplexing mode, and performing convolution operation on the weight data and the input data in a serial calculation mode. The mode that the N convolution calculation micro units acquire input data in a data multiplexing mode is as follows:

partial input data are acquired in each convolution calculation micro unit to carry out convolution operation, the convolution calculation micro units carry out convolution operation in series in sequence, and after the convolution operation of the partial input data currently acquired by each convolution calculation micro unit and the weight data is finished, the convolution calculation micro units register the partial input data currently acquired to the next convolution calculation micro unit around, so that the input data can be acquired among the convolution calculation micro units in a data multiplexing mode, and the frequency of acquiring data from the outside is reduced.

An activation function configured as a relu function for judging whether to terminate the convolution operation, the judgment principle being: if the convolution operation of the current M weight data and the related input data in the convolution calculation micro unit is negative, automatically stopping the convolution operation and outputting zero; otherwise, carrying out convolution operation on the N weight data and the related input data and outputting convolution data which is positive number; where M < N, N is the total number of weight data output by the sorting subunit. The principle is as follows: in each convolution calculation micro unit, positive weight data is preferentially input, then negative weight data is input, namely convolution operation is preferentially carried out on the positive weight data and the corresponding input data during convolution calculation, after convolution operation is carried out on all the positive weight data and the corresponding input data, convolution operation is carried out on the negative weight data and the corresponding convolution data, if the positive weight data and the corresponding input data, and part of the negative weight data and the corresponding input data are calculated, negative numbers appear on the convolution data, the final result is affirmed as a negative number, due to the characteristic of an activation function, subsequent convolution calculation is automatically stopped, and the convolution operation of the convolution calculation micro unit is finished and zero is output.

The PE calculation sub-units are a plurality of sub-units having the following functions: the method comprises the steps of acquiring input data in a data multiplexing mode, and performing convolution operation and pooling operation on weight data and the input data in a parallel computing mode. The operating modes of the plurality of PE computation units are understood as:

each PE calculation subunit acquires partial input data, convolution operation and pooling operation are carried out on the partial input data and the weight data, the PE calculation subunits synchronously carry out calculation work in parallel, and after the last convolution operation of the partial input data acquired by each PE calculation subunit and the weight data is finished, the PE calculation subunit registers the partial input data acquired at present to the next PE calculation subunit around, so that the PE calculation subunits can acquire the input data in a data multiplexing mode, and the frequency of acquiring the data from the outside is reduced.

And the addition subunit is electrically connected with the plurality of PE calculation subunits and is used for acquiring the pooled data from the PE calculation subunits and performing addition operation to obtain final data.

The high-energy-efficiency neural network processor can be used for being matched with an external storage module to accelerate the calculation of a neural network.

As shown in fig. 2, as a further improvement of this embodiment, the logic calculating unit further includes a compression/decompression unit, which is configured to compress the weight data according to the level of the neural network and according to the run-length encoding compression algorithm, or to decompress the compressed weight data without loss.

The compression/decompression unit has the following specific functions in the acceleration operation application of the neural network: before the neural network calculation, the weight data of the trained neural network is compressed by adopting a run-length coding compression algorithm according to the level of the neural network, so that the storage space is saved. Lossless decompression is performed on the compressed weight data before the weight data and the input data are operated.

As a further improvement of this embodiment, the logic calculating unit further includes a buffer subunit, and the buffer subunit is configured to temporarily store the weight data, the input data, and the instruction data. The buffer unit is connected with the command FIFO subunit through a bus interface to acquire the weight data, the input data and the command data, transmit the command data to the command FIFO subunit and transmit the weight data and the input data to the sequencing subunit.

In this embodiment, the main control chip selects a zynqMP chip, a PS terminal of the zynqMP chip is used as a processor unit, and a PL terminal of the zynqMP chip is used as a logic calculation unit.

Example 2:

as shown in fig. 3, the present embodiment provides an energy-efficient neural network acceleration system, which includes a storage module and an energy-efficient neural network processor disclosed in embodiment 1. The storage module is used for storing the weight data and the result data output by the adding subunit. Wherein the weight data can be decompressed weight data or compressed weight data.

The energy-efficient neural network acceleration system provided by the embodiment can realize acceleration of neural network calculation.

Example 3:

as shown in fig. 4, the present embodiment provides an energy-efficient neural network acceleration method, which is implemented based on the energy-efficient neural network acceleration system disclosed in embodiment 2, and includes the following steps:

s100, storing the weight data in an external storage module;

s200, acquiring weight data and input data through a processor unit, and generating instruction data according to a model of a neural network;

s300, acquiring instruction data through the instruction FIFO subunit, realizing FIFO of the instruction data, and activating a proper number of PE calculation subunits and resources of the PE calculation subunits according to the instruction data;

the data FIFO subunit is used for acquiring input data and weight data, realizing FIFO of the input data and the weight data, and selectively sequencing the weight data and the input data through the sequencing subunit based on the principle that the weight data which is positive number is preferentially output, the weight data which is negative number is output later, and the weight data which is zero is not output;

s400, acquiring input data in a data multiplexing mode, performing convolution operation and pooling operation on the weight data and the input data in a mode of parallel calculation of a plurality of PE calculation subunits, and judging whether to terminate the convolution operation according to the result of the convolution operation in the convolution operation process;

s500, acquiring output data of each PE calculation subunit, performing addition operation to obtain final data, and storing the final data in a storage module.

In step S300, the execution steps are preferably:

s310, acquiring instruction data through the instruction FIFO subunit, realizing FIFO of the instruction data, and activating a proper number of PE calculation subunits and resources of the PE calculation subunits according to the instruction data;

s320, acquiring input data and weight data through a data FIFO subunit to realize FIFO of the input data and the weight data;

s330, based on the principle that the weight data which is positive number is output preferentially, the weight data which is negative number is output later, and the weight data which is zero is not output, the weight data and the input data are selected and sorted through the sorting subunit.

In step S400, the PE calculation subunits acquire input data in a data multiplexing manner, and perform convolution operation and pooling operation on the weight data and the input data according to a parallel calculation manner of the PE calculation subunits, where the operation mode is: each PE calculation subunit acquires partial input data, convolution operation and pooling operation are carried out on the partial input data and the weight data, the PE calculation subunits synchronously carry out calculation work in parallel, and after the last convolution operation of the partial input data currently acquired by each PE calculation subunit and the weight data is finished, the PE calculation subunit registers the partial input data currently acquired to the next surrounding PE calculation subunit, so that the PE calculation subunits can acquire the input data in a data multiplexing mode, and the frequency of acquiring data from the outside is reduced.

The PE calculation subunit performs convolution and pooling operations on the weight data and the related input data, including:

s410, acquiring input data in a data multiplexing mode, and performing convolution operation on the weight data and the related input data in a mode of serial calculation of a plurality of convolution calculation micro units;

in the process of carrying out convolution operation in the convolution calculation micro unit, judging whether to terminate the convolution operation in the current convolution calculation micro unit through an activation function, wherein the method comprises the following steps: if the convolution operation of the current M weight data and the related input data in the convolution calculation micro unit is negative, automatically stopping the convolution operation in the convolution calculation micro unit and outputting zero; otherwise, carrying out convolution operation on the N weight data and the related input data and outputting convolution data which is positive number; wherein M < N, N is the total number of weight data in the convolution calculation micro unit;

and S420, performing pooling operation on the output data of each convolution calculation micro unit.

In step S410, the plurality of convolution calculating micro units obtain the input data in a data multiplexing manner, and perform convolution operation on the weight data and the input data according to a serial calculating manner of the plurality of convolution calculating micro units, wherein a working mode of the convolution calculating micro units is as follows: each convolution calculation micro unit acquires partial input data to carry out convolution operation, the convolution calculation micro units carry out the convolution operation in series in sequence, and after the convolution operation of the partial input data currently acquired by each convolution calculation micro unit and the weight data is finished, the convolution calculation micro unit registers the partial input data currently acquired to the next convolution calculation micro unit around, so that the convolution calculation micro units can acquire the input data in a data multiplexing mode, and the frequency of acquiring the data from the outside is reduced.

As a further improvement of this embodiment, before executing step S100 to store the weight data in the external storage module, the weight data is compressed by the compression/pressurization subunit, and the compression step is: according to the hierarchy of the neural network, adopting a run-length coding algorithm to compress the weight data in a layering way, and storing the compressed weight data into an external storage module; before step S330 is executed, i.e. before the weight data is subjected to selective sorting, lossless decompression is performed on the weight data.

And the weight data is compressed, so that the storage space can be saved.

The above-mentioned embodiments are merely preferred embodiments for fully illustrating the present invention, and the scope of the present invention is not limited thereto. The equivalent substitution or change made by the technical personnel in the technical field on the basis of the invention is all within the protection scope of the invention. The protection scope of the invention is subject to the claims.

Claims

1. An energy-efficient neural network processor is characterized in that the processor is a master control chip comprising an ARM core, and the processor comprises:

the logic calculation unit is electrically connected with the processor unit through a bus interface and comprises an instruction FIFO subunit, a data FIFO subunit, a sequencing subunit, an addition subunit and a plurality of PE calculation subunits, wherein:

the sorting submodule is used for outputting the weight data and the input data in sequence based on the principle that the weight data which is positive number is output preferentially, the weight data which is negative number is output after, and the weight data which is zero is not output;

an addition subunit, configured to perform addition operation on the data output by the plurality of PE calculation subunits;

the PE calculation subunit includes:

and the activation function is configured as a relu function and used for judging whether to terminate convolution operation in the convolution calculation micro unit, and the judgment principle is as follows: if the convolution operation of the current M weight data and the related input data in the convolution calculation micro unit is negative, automatically stopping the convolution operation in the convolution calculation micro unit and outputting zero; otherwise, carrying out convolution operation on the N weight data and the related input data and outputting convolution data which is positive; wherein M < N, N is the total number of weight data located in the convolution calculation microcell;

2. The energy efficient neural network processor of claim 1, wherein the logic computation unit further comprises:

3. The energy efficient neural network processor of claim 1, wherein the logic computation unit further comprises:

4. The energy efficient neural network processor of claim 1, wherein the bus interface is an AXI interface that supports DMA data transfer.

5. The neural network processor with high energy efficiency according to claim 1, wherein the main control chip is a zynq chip, a PS end of the zynq chip is used as a processor unit, and a PL end of the zynq chip is used as a logic calculation unit.

6. An energy-efficient neural network acceleration system characterized by comprising:

an energy efficient neural network processor as claimed in any one of claims 1 to 4;

7. An energy-efficient neural network acceleration method, characterized by comprising:

storing the weight data in a storage module;

acquiring output data of each PE calculation subunit, performing addition operation to obtain final data, and storing the final data in a storage module;

acquiring input data in a data multiplexing mode, and performing convolution operation on the weight data and the related input data in a serial calculation mode of a plurality of convolution calculation micro units;

8. The energy-efficient neural network acceleration method of claim 7, characterized in that before the weight data is stored in the external storage module, the weight data is compressed according to the level of the neural network and according to the run length coding compression algorithm, and the compressed weight data is stored in the external storage module;

and before the weight data and the input data are selected and sorted, carrying out lossless decompression on the compressed weight data.