CN111783966A

CN111783966A - Hardware device and method of deep convolutional neural network hardware parallel accelerator

Info

Publication number: CN111783966A
Application number: CN201910269470.0A
Authority: CN
Inventors: 林森; 何一波; 李珏
Original assignee: Beijing Xinqi Technology Co ltd
Current assignee: Beijing Xinqi Technology Co ltd
Priority date: 2019-04-04
Filing date: 2019-04-04
Publication date: 2020-10-16

Abstract

The invention discloses a hardware parallel acceleration device and method for a deep convolutional neural network. The device comprises: the specially designed data loading device is used for specifically extracting and arranging the data required by the algorithm; an array of parallel computing execution units; the specially designed data output device performs specific extraction, arrangement and storage control on the output of the execution array; other vector or/and scalar special execution units and data access devices thereof can accelerate various other key operators; a central control device for controlling and dispatching the 4 sub-devices through programmable commands; the 5 sub-devices are all provided with special storage and buffer areas for compressing and buffering instructions and intermediate data. The invention optimizes the computing mechanism of the parallel computing unit on the basis of hardware of the specific operator in the knowledge field, forms a set of computing mechanism surrounding data scheduling, and optimizes the realization of the embedded end-side AI acceleration chip.

Description

Hardware device and method of deep convolutional neural network hardware parallel accelerator

Technical Field

The invention belongs to the field of computer hardware and artificial neural network algorithm deployment hardware acceleration, and particularly relates to a hardware parallel operation device and method of an on-chip deep convolutional neural network accelerator.

Background

The deep convolutional neural network is one kind of artificial neural network, belongs to machine learning algorithm model, and consists of several layers of specific neuron algorithm layer and hidden layer. It can generate a set of data vector outputs, such as the labels or labels and coordinates of the classification result, based on a set of data tensor inputs, such as tensors formed by image frames. Each layer of the system performs operations such as feature extraction, activation, sampling and the like on input data, and outputs the operation to the input of the next layer. Each layer comprises operators, algorithm structures and calculation parameters, such as convolution kernels or calculation weight parameters, which are selected and extracted in the process of training a group of data after a set of neural network algorithms are created. In an application scenario facing the same data and special fields, the operator, the algorithm structure and the calculation parameters are deployed in a specific acceleration chip, so that a function result pointed by training can be exerted, for example, an object classification result or related information of an object to be recognized is output.

In recent years, deep convolutional neural networks have been increasingly studied and widely accepted in a variety of application fields. It is mostly composed of convolution computation layers and also includes some other algorithm layers. Meanwhile, in the application fields of various industries, the artificial neural network algorithm is explored, and a relatively fixed algorithm begins to appear under the compromise of required computational power, prediction accuracy and accuracy. The deployment of artificial neural algorithms on the embedded end side has become a wide demand, but factors such as the performance and cost of a chip have become bottleneck factors of the demand. Patent document 1 (publication No. CN105488565A) discloses an arithmetic device and method for an acceleration chip for accelerating a deep neural network algorithm, which forms a set of AI instruction sets and an acceleration processor thereof by fully hardware-converting and instructing operators. However, the device and the method consider the flexibility of fusion and programming of excessive computing tasks, the granularity of data processing is small, and the device needs to be matched with other high-performance data arrangement devices to play a role, so that the overall cost of device chip formation and product formation is high, and the application trend of the device is close to high-cost and high-performance scenes such as large-scale servers, cloud measurement and the like.

The current research in the field of machine learning algorithms shows that theoretical algorithms need to be combined with specific fields and industrial scenes, reconnection and online coordination deployment need to be carried out between single algorithms, and results processed by various algorithms can be applied in practice. Therefore, the storage cost of a large amount of calculation data is considered, meanwhile, the configuration can be rapidly switched among a plurality of artificial neural algorithms, and a design device which can be configured and cut aiming at the fields and industrial scenes is configured, so that the design concept of the embedded end-side neural network deployment acceleration chip is met.

Disclosure of Invention

The invention aims to provide a hardware device and a method of a deep convolutional neural network hardware parallel accelerator, which can provide reasonable calculation power with low-cost and low-power-consumption chip process nodes and lower main storage cost, so that a plurality of deep convolutional neural networks can be deployed and switched at the same time, and the common judging, classifying or detecting functions can be completed within specific real-time required time.

The hardware device of the hardware parallel accelerator of the deep convolutional neural network comprises:

the parallel multiplication and addition calculation execution unit array mainly performs parallel operations of multiplication and addition and subtraction of matrix or/and vector data and consists of single neuron hardware modules formed by multiplication, addition and subtraction and other basic operators one by one;

the data loading device is used for extracting, arranging and inputting a large amount of data required by the deep convolutional neural network algorithm and carrying out related control;

the data output device is used for extracting, arranging and storing back the output result of the calculation execution unit array and carrying out related control;

vector or/and scalar quantity other special calculation execution units and data access devices thereof are used for carrying out accelerated execution on other key operators in the deep convolutional neural network algorithm;

the central control device finishes the overall control and scheduling of the 4 sub-devices through a programmable instruction;

the 5 sub-devices are all provided with intermediate storage and buffer areas, read and write are carried out on the main memory, the instructions and intermediate data are cached, and meanwhile, the random access can be carried out.

The hardware task deployment and division method of the deep convolutional neural network hardware parallel accelerator emphasizes and considers the limited storage resources of embedded chip hardware, and follows the following design principle:

for key operators in the deep convolutional neural network algorithm, such as convolution, pooling (pooling), activation and the like, and key tensor segmentation, fusion, transposition and the like, the operations are not taken as hardware operators and data operators are instructed uniformly, but specific hierarchy and macro block division modes are carried out on the processed operation data according to the method and scale for realizing the artificial intelligent neural network algorithm layer and the device chip, the operation data are fused with corresponding required hardware functions, and then the instructed packaging design is carried out.

The hardware device of the deep convolutional neural network hardware parallel accelerator has the configurable characteristics, and comprises the following characteristics:

a data scheduling method capable of dynamically configuring parallel computing operators through software programming, wherein the data scheduling method comprises partial row residence of input data, weight data residence representing synapses of neurons and temporary output data residence;

the input and output data arrangement mode and data format of the parallel computation can be dynamically configured through software programming;

the parallel efficiency and the parallel working mode of the parallel computing execution unit can be dynamically configured through software programming, for example, the difference of the sizes of various convolution kernels and the operation interval data distance is matched with the scheduling method and the arrangement mode to be configured.

The invention relates to a method for operating a hardware device of a deep convolutional neural network hardware parallel accelerator, which has the design principle that:

for some steps of matrix operators in the deep convolutional neural network algorithm, such as convolution, pooling, full connection and other algorithms, which can be converted into matrix operation, the execution is accelerated by a parallel multiplication and addition calculation execution unit array;

for the non-matrix operator in the algorithm, if the non-matrix operator cannot be converted into the matrix operation, the non-matrix operator is accelerated to be executed through vector or/and scalar other special calculation execution units and data access devices thereof;

the random access of data between the two calculation execution units and the main memory is completed by respective specially designed data access devices.

The method for realizing matrix operation of the hardware device of the deep convolutional neural network hardware parallel accelerator is characterized in that a parallel multiply and add computation execution unit array and a local cache register in a data loading device are not connected in a full address mode and can access all cache spaces, and the method is designed to be locally accessible in a cache address area by combining with the software configurable working mode of claim 3.

The invention relates to a hardware device of a deep convolutional neural network hardware parallel accelerator and a data scheduling method thereof, which are characterized in that:

the data input and output device is provided with a special cache in the on-chip module and can carry out random access with the relevant caches of other calculation execution units;

the main memory with a certain size is used as a main performance buffer and a space for temporarily storing the intermediate result;

the central control device comprises a general central processing unit and a set of expandable high-performance configuration device, and the central control device is used for scheduling in algorithm layers and data macro block layers of different neural networks to rapidly configure the sub-devices.

The invention has the following effects:

1. simplifying the complexity of connection between the hardware parallel computing unit array and the input device

2. Simplifying the spatial complexity of arranging data between the output device and the main storage

3. Simplifying the address calculation complexity of software configuration data and dividing data macro block

4. The practical application efficiency of the hardware parallel computing unit array is improved

5. Is more suitable for being implemented on a low-cost embedded ASIC chip

Drawings

FIG. 1 is a diagram of the modules and relationships of the hardware devices of the deep convolutional neural network hardware parallel accelerator according to the present invention;

FIG. 2 is a diagram of a convolution-calculation-oriented hardware basic operator execution unit structure;

FIG. 3 is a diagram showing the structure relationship between the hardware operator execution unit array and the output register array;

FIG. 4 is a structural diagram of a hardware operator execution unit array and a weight input unit;

FIG. 5 is a diagram showing the structure relationship between the hardware operator execution unit array and the input register array;

FIG. 6 is a data flow diagram of a hardware device of the deep convolutional neural network hardware parallel accelerator accelerating tensor convolution operation according to the present invention;

FIG. 7 is a diagram of the configuration of the central control unit and the relationship with other sub-units;

FIG. 8 is a diagram of a hardware accelerated execution unit architecture for special vector or scalar operators;

FIG. 9 is a diagram illustrating a method for performing convolutional layer calculation according to the present invention;

FIG. 10 is a flowchart illustrating the steps of convolutional layer calculation according to the present invention.

Description of the reference numerals

1 parallel hardware computing unit array

101 convolution calculation unit

12 parallel output register array

121 output register set

2 input data device

201 input data cache

202 parallel input register array

205 input weight cache

3 output data device

301 output data buffer

4 central control device

401 data and instruction caching for a central control unit

5 data bus

6 main memory controller and main memory

7 control bus device

8 vector/scalar special computation hardware acceleration unit

Cache in 800 vector/scalar special compute hardware acceleration unit

801 Special computing hardware operator

802 input DMA

803 output DMA

804 vector/scalar special computing hardware acceleration controller

909 input tensor data

Detailed Description

The hardware device and method of the deep convolutional neural network hardware parallel accelerator of the present invention are further described in detail with reference to the accompanying drawings.

Fig. 1 is a diagram of the modules and the relationship among the hardware devices of the deep convolutional neural network hardware parallel accelerator, and the device comprises a parallel hardware matrix computing unit array 1, an input data device 2, an output data device 3, a central control device 4, a special high-throughput high-performance data bus 5, a main memory 6, a high-performance control bus device 7 and a vector/scalar special computing hardware acceleration unit 8.

The

data devices

2 and 3, the central control device 4 and the calculation acceleration device 8 all have local high-speed storage areas, such as 201, 301, 401 and 800, in the sub-devices, which are used as data or instruction caches, intermediate value storage areas, area data sharing areas, data fusion areas and the like. The operation times of the main memory are reduced and the operation efficiency is improved.

The apparatus 1 comprises a parallel accelerated computing array consisting of a certain design number of hardware basic operator execution units 101, each basic operator execution unit 101 comprising a plurality of basic operators, such as multiplication, addition and subtraction required for convolution computation, or even other operators or designs required for pooling (posing) operations. FIG. 2 is a block diagram of a basic operator execution unit having two primary inputs and one secondary input corresponding to data to be convolved, a convolution kernel, and a local computation result to be accumulated, respectively; it is output by a primary output that selects the current particular final result from register set 121. Reference numeral 101 denotes a set of fixedly connected registers 121, which includes a plurality of registers for storing a plurality of results of corresponding execution units, as shown in fig. 2. The present invention introduces an acceleration method of convolutional neural network deployment computation (fig. 9), in which the corresponding principle and usage method of multiple registers and specific 101 in 121 will be described in detail. The parallel output array 12 is constructed from the same design-specific number of 121 cells.

The invention simplifies the connection structure of the array 1 and the array 12, and as shown in fig. 2, the array 101 and the array 121 are correspondingly fixedly connected. Assuming that the number of cells 101 in array 1 is P, the corresponding set of output registers is also P.

The present invention simplifies the connection structure between the array 1 and the weight parameter buffer 205, and as shown in fig. 4, the arranged weight parameters are input into the weight buffer, which is a stack-like storage structure, and can output weights following each beat of the working clock according to a certain working sequence, and the output weights can be broadcast to one main input of all the units 101 in the array 1. The required weight inputs are 1 to P broadcasts according to the aforementioned assumptions.

According to the above assumption, assuming that the convolution kernel is K × K size and the unit of the number of calculation operations is op (operations), the arithmetic device of the present invention has a theoretical maximum computation power of (K ^2+ K ^2-1) × P/(K ^2+1) ═ 2-3/(K ^2+1)) > P, the present invention exerts that the computation power of a single execution unit is close to 2 (in general artificial neural network algorithm application > -1.94), and the present invention exerts that the computation power of the parallel execution unit array is at least 1.94P. The method for calculating the number of the parallel execution accelerating units of the artificial neural network accelerator comprises the steps of estimating P according to a target algorithm in a specific field and the calculation requirements of a common network, combining the real-time requirements of industrial application and the theoretical calculation force interval of the device, and rounding up the square root R of the P to obtain the square array 1 of the R.

The invention designs a parallel input register array 202 between the input buffer memory 201 and the array 1 for data to be continued.

The invention can realize the performance multiple-fold improvement in parallel by overlapping a plurality of sets of hardware devices, and the degree of the actual performance improvement is limited only by the bottleneck of storage. Higher computational performance means greater data throughput requirements, which also leads to increased storage costs.

The invention is operational with numerous general purpose or special purpose computing system environments or configurations, such as: embedded end-side artificial neural acceleration chips, heterogeneous processor systems, microprocessor systems, server systems, hand-held or portable devices, consumer electronics devices, industrial control devices, tablet or personal computers, distributed computing platforms comprising the above systems or devices, data centers, and the like.

The present invention may be described in the general context of general and/or extended instructions executed by a central controller, such as a software program. Software programs generally include routines, objects, components, data structures, reference models, and the like that perform particular tasks or implement particular data types.

The hardware apparatus and method of the present invention, deep convolutional neural network hardware parallel accelerator, are described above, it should be understood that the above description is for understanding the method and core idea of the present invention, and is not intended to limit the scope of the present invention, and any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention should be included in the scope of the present invention.

Claims

1. A hardware apparatus for a deep convolutional neural network hardware parallel accelerator, comprising:

the parallel multiplication and addition execution unit array is mainly used for executing the parallel operation of multiplication and addition and subtraction of matrix or/and vector data;

2. The hardware device of the deep convolutional neural network hardware parallel accelerator as claimed in claim 1, wherein for the operations of key operators in the deep convolutional neural network algorithm, such as convolution, pooling (pooling), activation, and the like, and key tensor segmentation, fusion, transposition, and the like, not all the operations are used as hardware operators, and not all the operations are used as data operators for instructing, but according to the method and scale implemented by the artificial intelligent neural network algorithm layer and the device chip, the specific hierarchy and macro block division mode are performed on the processed operation data, and the division mode is fused with the corresponding required hardware function, and then the instructed encapsulation design is performed.

3. The hardware apparatus of the deep convolutional neural network hardware parallel accelerator of claims 1-2, comprising:

the input and output data arrangement of the parallel computation can be dynamically configured by software programming,

a data format;

4. The method for hardware design of hardware device of deep convolutional neural network hardware parallel accelerator as claimed in claims 1-3, wherein:

according to the calculation requirements of common neural network algorithms in specific fields and industrial scenes, the number of hardware parallel execution units is converted according to the theoretical calculation force interval which can be reached by the hardware device;

according to the calculation force and the characteristics of the algorithm, a certain method is used for calculating the input direction and the output direction to obtain the scales of other arrays and the data throughput performance requirements, and further calculating the scale of the cache to finish the prototype shaping of the device.

5. The method for performing operations by a hardware device of a deep convolutional neural network hardware parallel accelerator as claimed in claims 1-3, wherein:

for matrix operators in deep convolutional neural network algorithms such as convolution, pooling, full join, and

other algorithms can be converted into certain steps of matrix operation, and are accelerated to be executed through a parallel multiplication and addition calculation execution unit array;

6. The method as claimed in claim 5, wherein the parallel multiply and add computation execution unit array and the local cache register in the data loading device are not fully address-connected and can access the whole cache space, but are designed to be locally accessible in the cache address region in combination with the software configurable operation manner as claimed in claim 3.

7. The hardware device of the deep convolutional neural network hardware parallel accelerator and the data scheduling method thereof as claimed in claim 1, wherein: