CN107657316B

CN107657316B - Design of cooperative system of general processor and neural network processor

Info

Publication number: CN107657316B
Application number: CN201610695285.4A
Authority: CN
Inventors: 余金城; 姚颂
Original assignee: Beijing Deephi Technology Co Ltd
Current assignee: Xilinx Technology Beijing Ltd
Priority date: 2016-08-12
Filing date: 2016-08-19
Publication date: 2020-04-07
Anticipated expiration: 2036-08-19
Also published as: CN107688855B; CN107657316A; CN107688855A

Abstract

The present invention relates to Artificial Neural Networks (ANNs), such as Convolutional Neural Networks (CNNs), and more particularly to how to implement artificial neural networks based on a synergistic system design of general purpose processors and neural network specific processors.

Description

Design of cooperative system of general processor and neural network processor

Referenced priority application

The present application claims priority from the previously filed chinese patent application 201610663201.9, "a method of optimizing artificial neural networks" and chinese patent application 201610663563.8, "a deep processing unit for implementing ANN.

Technical Field

The present invention relates to Artificial Neural Networks (ANNs), such as Convolutional Neural Networks (CNNs), and more particularly to how to implement artificial neural networks based on a synergistic system design of general purpose processors and neural network processors.

Background

The convolutional neural network has very wide application in the field of current image processing, and the neural network has the characteristics of simple training method and uniform calculation structure. But neural networks are computationally expensive to store. Many efforts have attempted to build or directly design special purpose chips on FPGAs to implement accelerators for neural networks. However, the dedicated neural network acceleration hardware is still limited in flexibility and can perform a single task.

The article "Going deep With Embedded FPGA Platform for conditional Neural Network" (2016.2) published by inventor Yaosong et al describes an acceleration system using an FPGA in which a general purpose processor (e.g., ARM) is used to perform calculations that some FPGAs have not been able to perform. For example, ARM is responsible for transferring instructions and preparing data.

Disclosure of Invention

On the basis of the above-mentioned article, the inventors propose further improvements. The present application proposes to combine a neural network dedicated processor with a general purpose processor (CPU) to provide a flexible system that can be adapted to complex neural networks.

According to an aspect of the present invention, a Deep Processing Unit (DPU) for operating an Artificial Neural Network (ANN) is presented, comprising: a CPU for scheduling a programmable processing module (PL) and a Direct Memory Access (DMA); a Direct Memory Access (DMA) connected with the CPU, the programmable processing module and the external memory respectively, and used for communication between the CPU and the programmable processing module; a programmable processor module (PL) comprising: a Controller (Controller) for obtaining instructions and scheduling a Computing core based on the instructions, the Computing core (Computing Complex) comprising a plurality of Computing units (PEs) for performing Computing tasks based on the instructions and data, a buffer (buffer) for storing the data and instructions used by the programmable processor module; an external memory (DDR) connected to the CPU and the DMA, for storing: instructions for implementing the ANN and data that needs to be processed by the ANN; the CPU controls the DMA to transfer instructions and data between the external memory and the programmable logic module.

In addition, the DMA transfers data between the external memory and the programmable processing module through a FIFO; the DMA transfers instructions between the external memory and the programmable processing module through a FIFO.

According to another aspect of the present invention, a Deep Processing Unit (DPU) for operating an Artificial Neural Network (ANN) is presented, comprising: a CPU for scheduling a programmable processing module (PL) and a Direct Memory Access (DMA); a Direct Memory Access (DMA) connected with the CPU, the programmable processing module and the external memory respectively, and used for communication between the CPU and the programmable processing module; a programmable processor module (PL) comprising: a Controller (Controller) for obtaining instructions and scheduling a Computing core based on the instructions, the Computing core (Computing Complex) comprising a plurality of Computing units (PEs) for performing Computing tasks based on the instructions and data, a buffer (buffer) for storing the data and instructions used by the programmable processor module; an external memory (DDR) connected to the CPU, DMA and programmable logic module for storing: instructions for implementing the ANN and data that needs to be processed by the ANN; wherein the CPU controls the DMA to transfer instructions between the external memory and the programmable logic module; wherein the programmable logic module and the external memory directly transfer data.

In addition, the DMA and the programmable processing module transmit instructions through FIFO.

Further, the CPU further includes: a state monitoring module to monitor a state of a Finite State Machine (FSM) of the programmable logic module.

Furthermore, the computing unit (PE) comprises: a complex convolution kernel (convoluter complex) coupled to the buffer to receive weights of the ANN and input data for performing convolution calculations in the ANN; an addition tree (addertree) coupled to the complex convolution kernel for summing results of convolution computation operations; a non-linearization module, coupled to the summing tree, for applying a non-linear function operation to an output of the summing tree.

Furthermore, the computing unit (PE) further comprises: and the aggregation module is connected with the nonlinear module and used for carrying out aggregation operation in the ANN.

Further, the buffer includes: an input buffer area used for preparing input data and instructions used by the calculation of the calculation core; and outputting the buffer area, and storing and outputting the calculation result.

In addition, the buffer further includes: a bias shifter (bias shift) for shifting the weights, which are the number of fixed points to be quantized, to different quantization ranges and outputting the shifted weights to the addition tree.

According to one embodiment of the invention, the CPU, the programmable logic module, and the DMA are implemented on a SOC. The external memory is implemented on another chip than the SOC.

Drawings

Fig. 1a and 1b show a common structure of an artificial neural network model.

FIG. 2 shows a process for deploying an artificial neural network model on dedicated hardware.

Fig. 3 shows the overall flow of optimizing the artificial neural network.

Fig. 4 shows a hardware architecture for implementing an artificial neural network using a co-design of a CPU and a dedicated accelerator (e.g., DPU) according to a first embodiment of the present invention.

Fig. 5 shows a data transfer mechanism using a FIFO for the hardware architecture shown in fig. 3.

Fig. 6 shows a hardware architecture for implementing an artificial neural network using a co-design of a CPU and a dedicated accelerator (e.g., DPU) according to a second embodiment of the present invention.

Fig. 7 shows a further modification of the first embodiment of the present invention.

Fig. 8 shows a further modification of the second embodiment of the present invention.

Fig. 9 shows the difference between the processing flows of the first and second embodiments.

Detailed Description

A part of the present application was published by the academic article "Going deep within Embedded FPGA Platform for conditional Neural Network" (2016.2) of song by the inventor. The application makes more improvements on the basis of the method.

In this application, the improvement of CNN by the present invention will be mainly described by taking image processing as an example. Deep Neural Networks (DNNs) and Recurrent Neural Networks (RNNs) are similar to CNNs.

CNN basic concept

CNN achieves the most advanced performance in a wide range of vision-related tasks. To help understand the CNN-based image classification algorithm analyzed in this application, we first introduce the basic knowledge of CNN, the image network dataset, and the existing CNN model.

As shown in fig. 1(a), a typical CNN consists of a series of layers that run in order.

The parameters of the CNN model are called "weights" (weights). The first layer of CNN reads the input image and outputs a series of feature maps (maps). The lower layer reads the feature map generated by the upper layer and outputs a new feature map. The last classifier (classifier) outputs the probability of each class to which the input image may belong. The CONV layer (convolutional layer) and the FC layer (full link layer) are two basic layer types in CNN. After the CONV layer, there is usually a Pooling layer (Pooling layers).

For example, for one CNN layer,

represents the jth input feature map,

represents the ith output feature map (output feature map), b_iThe offset term of the ith output plot is shown.

For the CONV layer, n_inAnd n_outRepresenting the number of input and output profiles, respectively.

For FC layer, n_inAnd n_outRepresenting the length of the input and output feature vectors, respectively.

Definition of CONV layers (Convolutional layers): the CONV layer takes a series of feature maps as input and obtains an output feature map by convolution kernel convolution.

A non-linear layer, i.e. a non-linear excitation function, usually connected to the CONV layer, is applied to each element in the output signature.

The CONV layer may be represented by expression 1:

wherein g is_i，jIs a volume applied to the jth input profile and ith output profileAnd (4) integrating kernels.

Definition of FC layers (full-Connected layers): the FC layer applies a linear transformation on the input feature vector:

f^out＝Wfⁱⁿ+b (2)

w is an integer n_out×n_inTransform matrix, b is a bias term. It is worth noting that for the FC layer, not a combination of several two-dimensional feature maps, but one feature vector is input. Thus, in expression 2, the parameter n_inAnd n_outIn effect corresponding to the length of the input and output feature vectors.

Pooling (pooling) layer: usually connected to the CONV layer for outputting the maximum or average value of each partition (subarea) in each profile. The Pooling maximum value can be represented by expression 3:

where p is the size of the sink kernel. This non-linear "down-sampling" not only reduces the size and computation of the feature map for the next layer, but also provides a translation invariance.

CNN can be used for image classification in the forward inference process. But before using CNN for any task, the CNN dataset should be trained first. Recent studies have shown that CNN models based on large dataset forward training for a given task can be used for other tasks and achieve high precision fine tuning in network weights (network weights), which is called "fine-tune". The training of CNN is mainly implemented on large servers. For embedded FPGA platforms, we focus on accelerating the inference process of CNN.

Image-Net data set

The Image-Net dataset is treated as a standard reference benchmark to evaluate the performance of Image classification and object detection algorithms. To date, Image-Net datasets have collected over 1400 million images in 2 million 1 thousand categories. The Image-Net issues a sub-set with 1000 categories and 120 ten thousand images for the ILSVRC classification task, and greatly promotes the development of CV technology. In the present application, all CNN models were trained with the ILSVRC 2014 training set and evaluated with the ILSVRC 2014 validation set.

Existing CNN models

In ILSVRC 2012, the Supervision team used AlexNet, winning the first name in the image classification task, 84.7% of top 5 precision. Cafnenet has minor variations on the basis of replication of AlexNet. AlexNet and cafnenet both comprise 5 CONV layers and 3 FC layers.

In 2013 ILSVRC, the Zeiler-and-Fergus (ZF) network won the first name in the image classification task, 88.8% of the top 5 precision. The ZF network also has 5 CONV layers and 3 FC layers.

As shown in fig. 1(b), a typical CNN is illustrated from the perspective of the input-output data flow.

The CNN shown in fig. 1(b) includes 5 CONV groups CONV1, CONV2, CONV3, CONV4, CONV5, 3 FC layers FC1, FC2, FC3, each of which includes 3 convolutional layers, and one Softmax decision function.

Fig. 2 is a schematic diagram of software optimization and hardware implementation of an artificial neural network.

As shown in fig. 2, in order to accelerate CNN, a whole set of technical solutions is proposed from the perspective of optimization flow and hardware architecture.

The artificial neural network model is shown on the lower side of fig. 2. In the middle of fig. 2 it is shown how the CNN model can be compressed to reduce memory usage and the number of operations while minimizing the loss of accuracy.

The dedicated hardware provided for the compressed CNN is shown on the upper side of fig. 2.

As shown in the upper side of fig. 2, in the hardware architecture, two blocks, PS and PL, are included.

A general Processing System (PS) includes: a CPU and an external memory (EXTERNALMEMORY).

A Programmable Logic module (PL) comprising: DMA and computational cores, input/output buffers, and controllers, etc.

As shown in fig. 2, PL is provided with: complex Computing (Computing Complex), input buffers, output buffers, controllers, and Direct Memory Access (DMA).

The compute core includes a plurality of processing units (PEs) that are responsible for most of the computational tasks at the CONV, rendezvous, and FC layers of the artificial neural network.

The chip buffer includes an input buffer and an output buffer, prepares data for use by the PEs, and stores the results.

The controller, fetches instructions on external memory, decodes the instructions (if needed), and coordinates all modules in the PL except the DMA.

DMAs are used to transfer data and instructions between external memory (e.g., DDR) and PL.

The PS includes a general purpose processor (CPU)8110 and an external memory 8120.

The external memory stores model parameters, data and instructions for all artificial neural networks.

The PS is a hard core, the hardware structure is fixed, and the scheduling is carried out by software.

PL is programmable hardware logic, the hardware architecture of which can vary. The programmable logic module (PL) may be an FPGA, for example.

It should be noted that according to the embodiment of the present invention, the DMA is directly controlled by the CPU, although it is on the PL side, and transfers data from the EXTERNAL MEMORY to the PL.

Thus, the hardware architecture shown in FIG. 2 is only a functional partition, and the boundary between PL and PS described above is not absolute. For example, in actual implementation, the PL and CPU may be implemented on one SOC, such as the Zynq chip of xilinx. The external memory may be implemented by another memory chip, connected to the CPU in the SOC chip.

FIG. 3 shows an optimization procedure before deploying an artificial neural network to a hardware chip.

The input to fig. 3 is the original artificial neural network.

Step 405: compression

The compressing step may include pruning the CNN model. Network pruning has proven to be an effective method to reduce the complexity and overfitting of the network. See, for example, the article "Second order originations for network planning" in b.Hasssibi and d.g.Stork, Optial bridge surgeon ".

In the priority application 201610663201.9, "a method of optimizing an artificial neural network", cited in the present application, a method of compressing a CNN network by pruning is proposed.

First, an initialization step initializes the weights of the convolutional layer, the FC layer, to random values, where an ANN with a full connection is generated, the connection having weight parameters.

Secondly, training the ANN, and adjusting the weight of the ANN according to the accuracy of the ANN until the accuracy reaches a preset standard.

For example, the training step adjusts the weights of the ANN based on a stochastic gradient descent algorithm, i.e., randomly adjusts weight values, and selects based on the perusal variation of the ANN. For the introduction of the stochastic gradient algorithm, see the above-mentioned "learning weights and connections for influencing neural networks".

The accuracy may be quantified as the difference between the predicted and correct outcome of the ANN for the training data set.

Third, a pruning step, based on predetermined conditions, of discovering unimportant connections in the ANN, pruning said unimportant connections. In particular, the weight parameters of the pruned connection are no longer saved.

The predetermined condition includes any one of: the weight parameter of the connection is 0; or the weight parameter of the connection is less than a predetermined value.

Fourthly, a fine tuning step of resetting the pruned connection to a connection whose weight parameter value is zero, i.e. restoring the pruned connection and assigning a weight value of 0.

And finally, judging that the accuracy of the ANN reaches a preset standard. If not, repeating the second, third and fourth steps.

Step 410: fixed point quantization of data

For a fixed point number, its value is expressed as follows:

where bw is the bit width of the number, f_lIs the length of the part (fractional length) that can be negative.

In order to convert floating point numbers into fixed point numbers and obtain the highest precision, the inventor provides a dynamic precision data quantization strategy and an automatic workflow.

Unlike the prior static precision quantization strategy, in the data quantization flow provided by the inventor, f_lThe atlas is dynamically changing for different layers and features while remaining static in one layer to minimize truncation error per layer.

The proposed quantization procedure mainly consists of two stages.

(1) Weight quantization stage:

the purpose of the weight quantization stage is to find the best f of the weights of a layer_lAs expression 5:

where W is the weight, W (bw, f)_l) Is represented at given bw and f_lFixed point format of W under.

In one embodiment, the dynamic range of each layer weight is first analyzed, e.g., estimated by sampling. Afterwards, to avoid data overflow, f is initialized_l. Furthermore, we start with f_lNeighborhood search optimization of (f)_l。

According to another embodiment, in the weight fixed point quantization step, another way is used to find the best f_lAs in expression 6.

Where i represents one of the bw bits, k_iIs the bit weight. Using expression 6In such a way that different bits are given different weights and then the optimal f is calculated_l。

(2) A data quantization stage.

The data quantization phase aims at finding the optimal f for the feature set between two layers of the CNN model_l。

At this stage, the CNN is trained using a training data set (bench mark). The training data set may be dataset 0.

According to an embodiment of the present invention, the weights of the CONV layers and FC layers of all CNNs are quantized first, and then data quantization is performed. At this time, the training data set is input into the CNN with quantized weight, and each layer of input feature map is obtained through layer-by-layer processing of the CONV layer and the FC layer.

And comparing the intermediate data of the fixed-point CNN model and the floating-point CNN model layer by using a greedy algorithm aiming at each layer of input feature graph so as to reduce precision loss. The optimization goal for each layer is shown in expression 7:

in expression 7, when A represents the calculation of one layer (e.g., a CONV layer or FC layer), x represents the input, and x represents the output⁺When the value is A.x, x⁺Representing the output of the layer. Notably, for the CONV layer or FC layer, the direct result x is⁺With a longer bit width than the given standard. Therefore, when f is optimal_lTruncation is required for selection. Finally, the whole data quantization configuration is generated.

According to another embodiment, in the step of fixed-point quantization of data, another way is used to find the best f_lAs in expression 8.

Where i represents one of the bw bits, k_iIs the bit weight. In a similar manner to expression 4, different bits are given different weights, and the optimal f is recalculated_l。

The above data quantization step yields the optimum f_l。

Furthermore, according to another embodiment, the weight quantization and the data quantization may be performed alternately, not sequentially.

Regarding the flow sequence of data processing, each layer of the convolution layer (CONV layer) and the full link layer (FC layer) of the ANN is in a series relation, and the training data set is processed by each feature map set obtained when the CONV layer and the FC layer of the ANN are processed in sequence.

Specifically, the weight quantization step and the data quantization step are performed alternately according to the series relationship, wherein after the weight quantization step completes fixed-point quantization of a certain layer, the data quantization step is performed on the feature atlas output by the layer.

First embodiment

In the priority application, the inventor proposes a collaborative design using a general-purpose processor and a dedicated accelerator, but does not discuss how efficiently the flexibility of the general-purpose processor and the computing power of the dedicated accelerator are utilized, such as how to transfer instructions, transfer data, perform calculations, and the like. In the application, the inventor proposes a further optimization scheme.

Fig. 4 shows a further modification to the hardware architecture of fig. 2.

In fig. 4, the CPU controls the DMA, which is responsible for scheduling data. Specifically, the CPU controls the DMA to transfer the instruction in the external memory (DDR) to the FIFO. Subsequently, the dedicated accelerator fetches instructions in the FIFO and executes.

The CPU controls the DMA to transfer the data number from the DDR to the FIFO, and the data is transferred from the FIFO for calculation. Similarly, the CPU also maintains the work of transporting the output data of the accelerator.

In operation, the CPU needs to constantly monitor the status of the DMA. When the Input FIFO is not full, data needs to be moved from the DDR to the Input FIFO. When the Output FIFO is not empty, data needs to be carried from the Output FIFO back into the DDR.

In addition, the dedicated accelerator of fig. 4 includes: a controller, a computing core (computation Complex) and a buffer (buffer).

The computing core comprises: convolvers, adder trees, non-linear modules, etc.

The size of the convolution kernel is typically only a few options such as 3 x 3, 5 x 5 and 7 x 7. For example, a two-dimensional convolver designed for convolution operation is a 3 × 3 window.

An adder tree (AD) sums all the results of the convolver. A non-linear (NL) module is adapted to input data streams of the non-linear excitation functions. For example, the function may be a ReLU function. In addition, a Max-Pooling module (not shown) is used for the assembly operation, for example, a specific 2 × 2 window is used for the input data stream, and the maximum value thereof is output.

The buffer area includes: the device comprises an input data buffer area, an output data buffer area and a bias shift (bias shift) module.

A bias shift (bias shift) is used to support the conversion of the dynamic quantization range. For example, shifting is performed for the weights. Also for example, shifting is performed for the data.

The input data buffer may further include: input data buffer, weight buffer. The input data buffer may be a line buffer (line buffer) for storing data required for operation and delaying release of the data to realize reuse of the data.

Fig. 5 shows the FIFO interaction between the CPU and the dedicated accelerator.

There are class 3 FIFOs in the architecture diagram shown in fig. 5. Similarly, there are three types of control of the DMA by the CPU.

In the first embodiment, the CPU and the dedicated accelerator communicate completely through FIFO, and there are three types of cache FIFOs between the CPU and the dedicated accelerator: command, input data, output data. Specifically, under the control of the CPU, the DMA is responsible for the transfer of input data, output data, and instructions between the external memory and the dedicated accelerator, where an input data FIFO, an output data FIFO, and an instruction FIFO are provided between the DMA and the dedicated accelerator, respectively.

For a dedicated accelerator, the design is simple, only calculation is concerned, and data is not concerned. The data operation is controlled entirely by the CPU.

However, in some application scenarios, the scheme shown in fig. 5 also has a deficiency.

First, the CPU performing scheduling consumes resources of the CPU. For example, the CPU needs to constantly listen to the status of each FIFO and prepare to receive and transmit data at any time. The CPU time is consumed by the CPU to listen to the states and process data according to different states. In some applications, the cost of the CPU to monitor the FIFO and process data is large, resulting in almost full CPU utilization and no CPU time to process other tasks (reading pictures, pre-processing pictures, etc.).

Secondly, a plurality of FIFOs are required to be arranged in the special accelerator, and PL resources are occupied.

Second embodiment

The second embodiment is characterized as follows: first, the dedicated processor and the CPU share an external memory, both of which can read the external memory. Second, the CPU controls only the instruction input of the dedicated accelerator. In this way, the CPU operates in conjunction with a dedicated accelerator system, where the CPU assumes tasks that some dedicated accelerators cannot accomplish.

As shown in fig. 6, in the second embodiment, the dedicated accelerator (PL) directly interacts with the external memory (DDR). Accordingly, an Input FIFO and an Ouput FIFO (as shown in FIG. 5) between the DMA and the special accelerator are cancelled, and only 1 FIFO is reserved for transmitting instructions between the DMA and the special accelerator, so that resources are saved.

For the CPU, complex scheduling of input and output data is not required, but the data is accessed directly from the external memory (DDR) by a dedicated accelerator. When the artificial neural network is operating, the CPU may perform other processing, such as reading image data to be processed from a camera, and the like.

Therefore, the second embodiment solves the problem of over-heavy CPU tasks, and frees the CPU to process more tasks. However, the dedicated accelerator needs to perform data access control to an external memory (DDR) by itself.

Modifications of the first and second embodiments

In both the first and second embodiments, the CPU controls the accelerator by instructions.

The accelerator may "fly" in error during execution (i.e., the program enters a dead loop or runs in a meaningless shuffle). In current solutions, the CPU cannot determine whether the accelerator has run away.

In a modified embodiment based on the first or second embodiment, the inventors have also provided a "state peripheral" in the CPU, passing the state of the Finite State Machine (FSM) in the dedicated accelerator (PL) directly to the CPU.

The CPU can know the operation of the accelerator by detecting the state of a Finite State Machine (FSM). The CPU may also send a signal to directly reset the accelerator if it is found that the accelerator has run away or stuck.

Figure 7 shows an example of adding a "stateful peripheral" to the architecture of the first embodiment shown in figure 4.

Figure 8 shows an example of adding a "stateful peripheral" to the architecture of the second embodiment shown in figure 6.

As shown in fig. 7 and 8, a Finite State Machine (FSM) is provided in the controller of the dedicated accelerator, and the state of the FSM is directly transmitted to a state peripheral (i.e., a state monitoring module) of the CPU, so that the CPU can monitor a fault condition such as a program running crash.

Comparison of the first and second embodiments

The two scheduling strategies of the first and second embodiments are advantageous each.

In the embodiment of FIG. 4, the picture data requires the CPU to schedule DMA for transmission to the dedicated accelerator, so the dedicated accelerator has time to idle. However, since the CPU schedules data handling, the dedicated accelerator is only responsible for calculations, the computing power is optimized, and the time to process data is correspondingly shortened.

In the embodiment of FIG. 6, the dedicated accelerator has the ability to access data individually without the CPU scheduling data transfers. The data processing may be performed independently on a dedicated accelerator.

The CPU may be responsible only for data reading and output with an external system. The reading operation means, for example, that the CPU reads out picture data from a camera (not shown) and transfers the picture data to an external memory; the output operation means that the CPU outputs the recognized output from the external memory to a screen (not shown).

With the embodiment of fig. 6, tasks may be pipelined, allowing for faster processing of multiple tasks. The corresponding cost is that the special accelerator is responsible for calculation and data handling at the same time, the efficiency is not high, and the processing needs longer time.

Fig. 9 comparatively shows the difference in the processing flow of the first and second embodiments.

Application of the second embodiment: face recognition

According to the second embodiment, since there is a shared external memory (DDR), the CPU and the dedicated accelerator can perform one calculation task together.

For example, in the task of face recognition: the CPU can read the camera and detect the face in the input picture; the neural network over-accelerator acceleration core can complete the recognition of the human face.

The neural network computing task above the CPU can be rapidly deployed on the embedded device by using the cooperation design of the CPU and the special accelerator.

In particular, referring to example 2 of fig. 9, the reading (e.g., reading from a camera) and pre-processing of the picture is run on the CPU, and the processing of the picture is done on a dedicated accelerator.

The method separates the tasks of the CPU and the accelerator, so that the CPU and the accelerator can completely process the tasks in parallel.

Table 1 shows the performance comparison between the second embodiment (CPU + dedicated accelerator co-design) with CPU only.

Table 1

The comparative CPU used Tegra k1 manufactured by imperial labda. It can be seen that with our co-design of CPU + dedicated accelerator there is a significant acceleration for each layer, with an overall acceleration of up to 7 times.

The invention has the advantages that the characteristic of insufficient flexibility of a special accelerator (programmable logic module PL, such as FPGA) is made up by utilizing the characteristic of rich functions of a CPU (general processor), and the characteristic of insufficient real-time calculation of the CPU calculation speed is made up by utilizing the characteristic of high calculation speed of the special accelerator.

Further, the general purpose processor may be an ARM processor, or any other CPU. The programmable logic modules may be FPGAs or other programmable application specific processors (ASICs).

It should be noted that, in the present specification, the embodiments are described in a progressive manner, each embodiment focuses on differences from other embodiments, and the same and similar parts among the embodiments may be referred to each other.

In the embodiments provided in the present application, it should be understood that the disclosed apparatus and method can be implemented in other ways. The apparatus embodiments described above are merely illustrative, and for example, the flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of apparatus, methods and computer program products according to various embodiments of the present invention. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

The above description is only a preferred embodiment of the present invention and is not intended to limit the present invention, and various modifications and changes may be made by those skilled in the art. Any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention should be included in the protection scope of the present invention. It should be noted that: like reference numbers and letters refer to like items in the following figures, and thus, once an item is defined in one figure, it need not be further defined and explained in subsequent figures.

The above description is only for the specific embodiments of the present invention, but the scope of the present invention is not limited thereto, and any person skilled in the art can easily conceive of the changes or substitutions within the technical scope of the present invention, and all such changes or substitutions should be covered by the present invention

The invention is in the scope of protection. Therefore, the protection scope of the present invention shall be subject to the protection scope of the claims.

Claims

1. A Deep Processing Unit (DPU) for operating an Artificial Neural Network (ANN), comprising:

a CPU for scheduling a programmable processing module (PL) and a Direct Memory Access (DMA);

a Direct Memory Access (DMA) connected to the CPU, the programmable processing module and the external memory, respectively, for communication between the CPU and the programmable processing module;

a programmable processor module (PL) comprising:

a Controller (Controller) to obtain an instruction and schedule a compute core based on the instruction;

a Computing core (Computing Complex) comprising a plurality of Computing elements (PEs) for performing Computing tasks based on instructions and data;

a buffer for storing data and instructions used by the programmable processor module;

and the external memory (DDR) is respectively connected with the CPU, the DMA and the programmable logic module and is used for storing: instructions to implement the ANN and data to be processed by the ANN;

wherein the CPU controls the DMA to transfer instructions between the external memory and the programmable logic module;

wherein the programmable logic module and the external memory directly transfer data.

2. The deep processing unit of claim 1, wherein instructions are transferred between the DMA and the programmable processing module through a FIFO.

3. The deep processing unit of claim 1, wherein the CPU further comprises:

a state monitoring module to monitor a state of a Finite State Machine (FSM) of the programmable logic module.

4. The deep processing unit according to claim 1, the computation unit (PE) comprising:

a complex convolution kernel (convoluter complex) coupled to the buffer to receive weights of the ANN and input data for performing convolution calculations in the ANN;

an add tree (add tree) connected to the complex convolution kernel for summing the results of the convolution calculation operations;

a non-linearization module, coupled to the summing tree, for applying a non-linear function operation to an output of the summing tree.

5. The deep processing unit of claim 4, the computing unit (PE) further comprising:

and the aggregation module is connected with the nonlinear module and is used for carrying out aggregation operation in the ANN.

6. The deep processing unit of claim 1, the buffer comprising:

an input buffer area used for preparing input data and instructions used by the calculation of the calculation core;

and outputting the buffer area, and storing and outputting the calculation result.

7. The deep processing unit of claim 4, the buffer further comprising:

and a bias shifter (bias shift) for shifting weights, which are the number of fixed points to be quantized, to different quantization ranges and outputting the shifted weights to the addition tree.

8. The deep processing unit of claim 1, wherein the CPU, the programmable logic module, and the DMA are implemented on one SOC.

9. The deep processing unit of claim 8, wherein the external memory is implemented on another chip than the SOC.