CN107657316B - Design of cooperative system of general processor and neural network processor - Google Patents

Design of cooperative system of general processor and neural network processor Download PDF

Info

Publication number
CN107657316B
CN107657316B CN201610695285.4A CN201610695285A CN107657316B CN 107657316 B CN107657316 B CN 107657316B CN 201610695285 A CN201610695285 A CN 201610695285A CN 107657316 B CN107657316 B CN 107657316B
Authority
CN
China
Prior art keywords
cpu
data
module
dma
ann
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201610695285.4A
Other languages
Chinese (zh)
Other versions
CN107657316A (en
Inventor
余金城
姚颂
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Xilinx Technology Beijing Ltd
Original Assignee
Beijing Deephi Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Deephi Technology Co Ltd filed Critical Beijing Deephi Technology Co Ltd
Publication of CN107657316A publication Critical patent/CN107657316A/en
Application granted granted Critical
Publication of CN107657316B publication Critical patent/CN107657316B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/084Backpropagation, e.g. using gradient descent
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches

Abstract

The present invention relates to Artificial Neural Networks (ANNs), such as Convolutional Neural Networks (CNNs), and more particularly to how to implement artificial neural networks based on a synergistic system design of general purpose processors and neural network specific processors.

Description

Design of cooperative system of general processor and neural network processor
Referenced priority application
The present application claims priority from the previously filed chinese patent application 201610663201.9, "a method of optimizing artificial neural networks" and chinese patent application 201610663563.8, "a deep processing unit for implementing ANN.
Technical Field
The present invention relates to Artificial Neural Networks (ANNs), such as Convolutional Neural Networks (CNNs), and more particularly to how to implement artificial neural networks based on a synergistic system design of general purpose processors and neural network processors.
Background
The convolutional neural network has very wide application in the field of current image processing, and the neural network has the characteristics of simple training method and uniform calculation structure. But neural networks are computationally expensive to store. Many efforts have attempted to build or directly design special purpose chips on FPGAs to implement accelerators for neural networks. However, the dedicated neural network acceleration hardware is still limited in flexibility and can perform a single task.
The article "Going deep With Embedded FPGA Platform for conditional Neural Network" (2016.2) published by inventor Yaosong et al describes an acceleration system using an FPGA in which a general purpose processor (e.g., ARM) is used to perform calculations that some FPGAs have not been able to perform. For example, ARM is responsible for transferring instructions and preparing data.
Disclosure of Invention
On the basis of the above-mentioned article, the inventors propose further improvements. The present application proposes to combine a neural network dedicated processor with a general purpose processor (CPU) to provide a flexible system that can be adapted to complex neural networks.
According to an aspect of the present invention, a Deep Processing Unit (DPU) for operating an Artificial Neural Network (ANN) is presented, comprising: a CPU for scheduling a programmable processing module (PL) and a Direct Memory Access (DMA); a Direct Memory Access (DMA) connected with the CPU, the programmable processing module and the external memory respectively, and used for communication between the CPU and the programmable processing module; a programmable processor module (PL) comprising: a Controller (Controller) for obtaining instructions and scheduling a Computing core based on the instructions, the Computing core (Computing Complex) comprising a plurality of Computing units (PEs) for performing Computing tasks based on the instructions and data, a buffer (buffer) for storing the data and instructions used by the programmable processor module; an external memory (DDR) connected to the CPU and the DMA, for storing: instructions for implementing the ANN and data that needs to be processed by the ANN; the CPU controls the DMA to transfer instructions and data between the external memory and the programmable logic module.
In addition, the DMA transfers data between the external memory and the programmable processing module through a FIFO; the DMA transfers instructions between the external memory and the programmable processing module through a FIFO.
According to another aspect of the present invention, a Deep Processing Unit (DPU) for operating an Artificial Neural Network (ANN) is presented, comprising: a CPU for scheduling a programmable processing module (PL) and a Direct Memory Access (DMA); a Direct Memory Access (DMA) connected with the CPU, the programmable processing module and the external memory respectively, and used for communication between the CPU and the programmable processing module; a programmable processor module (PL) comprising: a Controller (Controller) for obtaining instructions and scheduling a Computing core based on the instructions, the Computing core (Computing Complex) comprising a plurality of Computing units (PEs) for performing Computing tasks based on the instructions and data, a buffer (buffer) for storing the data and instructions used by the programmable processor module; an external memory (DDR) connected to the CPU, DMA and programmable logic module for storing: instructions for implementing the ANN and data that needs to be processed by the ANN; wherein the CPU controls the DMA to transfer instructions between the external memory and the programmable logic module; wherein the programmable logic module and the external memory directly transfer data.
In addition, the DMA and the programmable processing module transmit instructions through FIFO.
Further, the CPU further includes: a state monitoring module to monitor a state of a Finite State Machine (FSM) of the programmable logic module.
Furthermore, the computing unit (PE) comprises: a complex convolution kernel (convoluter complex) coupled to the buffer to receive weights of the ANN and input data for performing convolution calculations in the ANN; an addition tree (addertree) coupled to the complex convolution kernel for summing results of convolution computation operations; a non-linearization module, coupled to the summing tree, for applying a non-linear function operation to an output of the summing tree.
Furthermore, the computing unit (PE) further comprises: and the aggregation module is connected with the nonlinear module and used for carrying out aggregation operation in the ANN.
Further, the buffer includes: an input buffer area used for preparing input data and instructions used by the calculation of the calculation core; and outputting the buffer area, and storing and outputting the calculation result.
In addition, the buffer further includes: a bias shifter (bias shift) for shifting the weights, which are the number of fixed points to be quantized, to different quantization ranges and outputting the shifted weights to the addition tree.
According to one embodiment of the invention, the CPU, the programmable logic module, and the DMA are implemented on a SOC. The external memory is implemented on another chip than the SOC.
Drawings
Fig. 1a and 1b show a common structure of an artificial neural network model.
FIG. 2 shows a process for deploying an artificial neural network model on dedicated hardware.
Fig. 3 shows the overall flow of optimizing the artificial neural network.
Fig. 4 shows a hardware architecture for implementing an artificial neural network using a co-design of a CPU and a dedicated accelerator (e.g., DPU) according to a first embodiment of the present invention.
Fig. 5 shows a data transfer mechanism using a FIFO for the hardware architecture shown in fig. 3.
Fig. 6 shows a hardware architecture for implementing an artificial neural network using a co-design of a CPU and a dedicated accelerator (e.g., DPU) according to a second embodiment of the present invention.
Fig. 7 shows a further modification of the first embodiment of the present invention.
Fig. 8 shows a further modification of the second embodiment of the present invention.
Fig. 9 shows the difference between the processing flows of the first and second embodiments.
Detailed Description
A part of the present application was published by the academic article "Going deep within Embedded FPGA Platform for conditional Neural Network" (2016.2) of song by the inventor. The application makes more improvements on the basis of the method.
In this application, the improvement of CNN by the present invention will be mainly described by taking image processing as an example. Deep Neural Networks (DNNs) and Recurrent Neural Networks (RNNs) are similar to CNNs.
CNN basic concept
CNN achieves the most advanced performance in a wide range of vision-related tasks. To help understand the CNN-based image classification algorithm analyzed in this application, we first introduce the basic knowledge of CNN, the image network dataset, and the existing CNN model.
As shown in fig. 1(a), a typical CNN consists of a series of layers that run in order.
The parameters of the CNN model are called "weights" (weights). The first layer of CNN reads the input image and outputs a series of feature maps (maps). The lower layer reads the feature map generated by the upper layer and outputs a new feature map. The last classifier (classifier) outputs the probability of each class to which the input image may belong. The CONV layer (convolutional layer) and the FC layer (full link layer) are two basic layer types in CNN. After the CONV layer, there is usually a Pooling layer (Pooling layers).
For example, for one CNN layer,
Figure BDA0001084770810000031
represents the jth input feature map,
Figure BDA0001084770810000032
represents the ith output feature map (output feature map), biThe offset term of the ith output plot is shown.
For the CONV layer, ninAnd noutRepresenting the number of input and output profiles, respectively.
For FC layer, ninAnd noutRepresenting the length of the input and output feature vectors, respectively.
Definition of CONV layers (Convolutional layers): the CONV layer takes a series of feature maps as input and obtains an output feature map by convolution kernel convolution.
A non-linear layer, i.e. a non-linear excitation function, usually connected to the CONV layer, is applied to each element in the output signature.
The CONV layer may be represented by expression 1:
Figure BDA0001084770810000041
wherein g isi,jIs a volume applied to the jth input profile and ith output profileAnd (4) integrating kernels.
Definition of FC layers (full-Connected layers): the FC layer applies a linear transformation on the input feature vector:
fout=Wfin+b (2)
w is an integer nout×ninTransform matrix, b is a bias term. It is worth noting that for the FC layer, not a combination of several two-dimensional feature maps, but one feature vector is input. Thus, in expression 2, the parameter ninAnd noutIn effect corresponding to the length of the input and output feature vectors.
Pooling (pooling) layer: usually connected to the CONV layer for outputting the maximum or average value of each partition (subarea) in each profile. The Pooling maximum value can be represented by expression 3:
Figure BDA0001084770810000042
where p is the size of the sink kernel. This non-linear "down-sampling" not only reduces the size and computation of the feature map for the next layer, but also provides a translation invariance.
CNN can be used for image classification in the forward inference process. But before using CNN for any task, the CNN dataset should be trained first. Recent studies have shown that CNN models based on large dataset forward training for a given task can be used for other tasks and achieve high precision fine tuning in network weights (network weights), which is called "fine-tune". The training of CNN is mainly implemented on large servers. For embedded FPGA platforms, we focus on accelerating the inference process of CNN.
Image-Net data set
The Image-Net dataset is treated as a standard reference benchmark to evaluate the performance of Image classification and object detection algorithms. To date, Image-Net datasets have collected over 1400 million images in 2 million 1 thousand categories. The Image-Net issues a sub-set with 1000 categories and 120 ten thousand images for the ILSVRC classification task, and greatly promotes the development of CV technology. In the present application, all CNN models were trained with the ILSVRC 2014 training set and evaluated with the ILSVRC 2014 validation set.
Existing CNN models
In ILSVRC 2012, the Supervision team used AlexNet, winning the first name in the image classification task, 84.7% of top 5 precision. Cafnenet has minor variations on the basis of replication of AlexNet. AlexNet and cafnenet both comprise 5 CONV layers and 3 FC layers.
In 2013 ILSVRC, the Zeiler-and-Fergus (ZF) network won the first name in the image classification task, 88.8% of the top 5 precision. The ZF network also has 5 CONV layers and 3 FC layers.
As shown in fig. 1(b), a typical CNN is illustrated from the perspective of the input-output data flow.
The CNN shown in fig. 1(b) includes 5 CONV groups CONV1, CONV2, CONV3, CONV4, CONV5, 3 FC layers FC1, FC2, FC3, each of which includes 3 convolutional layers, and one Softmax decision function.
Fig. 2 is a schematic diagram of software optimization and hardware implementation of an artificial neural network.
As shown in fig. 2, in order to accelerate CNN, a whole set of technical solutions is proposed from the perspective of optimization flow and hardware architecture.
The artificial neural network model is shown on the lower side of fig. 2. In the middle of fig. 2 it is shown how the CNN model can be compressed to reduce memory usage and the number of operations while minimizing the loss of accuracy.
The dedicated hardware provided for the compressed CNN is shown on the upper side of fig. 2.
As shown in the upper side of fig. 2, in the hardware architecture, two blocks, PS and PL, are included.
A general Processing System (PS) includes: a CPU and an external memory (EXTERNALMEMORY).
A Programmable Logic module (PL) comprising: DMA and computational cores, input/output buffers, and controllers, etc.
As shown in fig. 2, PL is provided with: complex Computing (Computing Complex), input buffers, output buffers, controllers, and Direct Memory Access (DMA).
The compute core includes a plurality of processing units (PEs) that are responsible for most of the computational tasks at the CONV, rendezvous, and FC layers of the artificial neural network.
The chip buffer includes an input buffer and an output buffer, prepares data for use by the PEs, and stores the results.
The controller, fetches instructions on external memory, decodes the instructions (if needed), and coordinates all modules in the PL except the DMA.
DMAs are used to transfer data and instructions between external memory (e.g., DDR) and PL.
The PS includes a general purpose processor (CPU)8110 and an external memory 8120.
The external memory stores model parameters, data and instructions for all artificial neural networks.
The PS is a hard core, the hardware structure is fixed, and the scheduling is carried out by software.
PL is programmable hardware logic, the hardware architecture of which can vary. The programmable logic module (PL) may be an FPGA, for example.
It should be noted that according to the embodiment of the present invention, the DMA is directly controlled by the CPU, although it is on the PL side, and transfers data from the EXTERNAL MEMORY to the PL.
Thus, the hardware architecture shown in FIG. 2 is only a functional partition, and the boundary between PL and PS described above is not absolute. For example, in actual implementation, the PL and CPU may be implemented on one SOC, such as the Zynq chip of xilinx. The external memory may be implemented by another memory chip, connected to the CPU in the SOC chip.
FIG. 3 shows an optimization procedure before deploying an artificial neural network to a hardware chip.
The input to fig. 3 is the original artificial neural network.
Step 405: compression
The compressing step may include pruning the CNN model. Network pruning has proven to be an effective method to reduce the complexity and overfitting of the network. See, for example, the article "Second order originations for network planning" in b.Hasssibi and d.g.Stork, Optial bridge surgeon ".
In the priority application 201610663201.9, "a method of optimizing an artificial neural network", cited in the present application, a method of compressing a CNN network by pruning is proposed.
First, an initialization step initializes the weights of the convolutional layer, the FC layer, to random values, where an ANN with a full connection is generated, the connection having weight parameters.
Secondly, training the ANN, and adjusting the weight of the ANN according to the accuracy of the ANN until the accuracy reaches a preset standard.
For example, the training step adjusts the weights of the ANN based on a stochastic gradient descent algorithm, i.e., randomly adjusts weight values, and selects based on the perusal variation of the ANN. For the introduction of the stochastic gradient algorithm, see the above-mentioned "learning weights and connections for influencing neural networks".
The accuracy may be quantified as the difference between the predicted and correct outcome of the ANN for the training data set.
Third, a pruning step, based on predetermined conditions, of discovering unimportant connections in the ANN, pruning said unimportant connections. In particular, the weight parameters of the pruned connection are no longer saved.
The predetermined condition includes any one of: the weight parameter of the connection is 0; or the weight parameter of the connection is less than a predetermined value.
Fourthly, a fine tuning step of resetting the pruned connection to a connection whose weight parameter value is zero, i.e. restoring the pruned connection and assigning a weight value of 0.
And finally, judging that the accuracy of the ANN reaches a preset standard. If not, repeating the second, third and fourth steps.
Step 410: fixed point quantization of data
For a fixed point number, its value is expressed as follows:
Figure BDA0001084770810000071
where bw is the bit width of the number, flIs the length of the part (fractional length) that can be negative.
In order to convert floating point numbers into fixed point numbers and obtain the highest precision, the inventor provides a dynamic precision data quantization strategy and an automatic workflow.
Unlike the prior static precision quantization strategy, in the data quantization flow provided by the inventor, flThe atlas is dynamically changing for different layers and features while remaining static in one layer to minimize truncation error per layer.
The proposed quantization procedure mainly consists of two stages.
(1) Weight quantization stage:
the purpose of the weight quantization stage is to find the best f of the weights of a layerlAs expression 5:
Figure BDA0001084770810000072
where W is the weight, W (bw, f)l) Is represented at given bw and flFixed point format of W under.
In one embodiment, the dynamic range of each layer weight is first analyzed, e.g., estimated by sampling. Afterwards, to avoid data overflow, f is initializedl. Furthermore, we start with flNeighborhood search optimization of (f)l
According to another embodiment, in the weight fixed point quantization step, another way is used to find the best flAs in expression 6.
Figure BDA0001084770810000073
Where i represents one of the bw bits, kiIs the bit weight. Using expression 6In such a way that different bits are given different weights and then the optimal f is calculatedl
(2) A data quantization stage.
The data quantization phase aims at finding the optimal f for the feature set between two layers of the CNN modell
At this stage, the CNN is trained using a training data set (bench mark). The training data set may be dataset 0.
According to an embodiment of the present invention, the weights of the CONV layers and FC layers of all CNNs are quantized first, and then data quantization is performed. At this time, the training data set is input into the CNN with quantized weight, and each layer of input feature map is obtained through layer-by-layer processing of the CONV layer and the FC layer.
And comparing the intermediate data of the fixed-point CNN model and the floating-point CNN model layer by using a greedy algorithm aiming at each layer of input feature graph so as to reduce precision loss. The optimization goal for each layer is shown in expression 7:
Figure BDA0001084770810000081
in expression 7, when A represents the calculation of one layer (e.g., a CONV layer or FC layer), x represents the input, and x represents the output+When the value is A.x, x+Representing the output of the layer. Notably, for the CONV layer or FC layer, the direct result x is+With a longer bit width than the given standard. Therefore, when f is optimallTruncation is required for selection. Finally, the whole data quantization configuration is generated.
According to another embodiment, in the step of fixed-point quantization of data, another way is used to find the best flAs in expression 8.
Figure BDA0001084770810000082
Where i represents one of the bw bits, kiIs the bit weight. In a similar manner to expression 4, different bits are given different weights, and the optimal f is recalculatedl
The above data quantization step yields the optimum fl
Furthermore, according to another embodiment, the weight quantization and the data quantization may be performed alternately, not sequentially.
Regarding the flow sequence of data processing, each layer of the convolution layer (CONV layer) and the full link layer (FC layer) of the ANN is in a series relation, and the training data set is processed by each feature map set obtained when the CONV layer and the FC layer of the ANN are processed in sequence.
Specifically, the weight quantization step and the data quantization step are performed alternately according to the series relationship, wherein after the weight quantization step completes fixed-point quantization of a certain layer, the data quantization step is performed on the feature atlas output by the layer.
First embodiment
In the priority application, the inventor proposes a collaborative design using a general-purpose processor and a dedicated accelerator, but does not discuss how efficiently the flexibility of the general-purpose processor and the computing power of the dedicated accelerator are utilized, such as how to transfer instructions, transfer data, perform calculations, and the like. In the application, the inventor proposes a further optimization scheme.
Fig. 4 shows a further modification to the hardware architecture of fig. 2.
In fig. 4, the CPU controls the DMA, which is responsible for scheduling data. Specifically, the CPU controls the DMA to transfer the instruction in the external memory (DDR) to the FIFO. Subsequently, the dedicated accelerator fetches instructions in the FIFO and executes.
The CPU controls the DMA to transfer the data number from the DDR to the FIFO, and the data is transferred from the FIFO for calculation. Similarly, the CPU also maintains the work of transporting the output data of the accelerator.
In operation, the CPU needs to constantly monitor the status of the DMA. When the Input FIFO is not full, data needs to be moved from the DDR to the Input FIFO. When the Output FIFO is not empty, data needs to be carried from the Output FIFO back into the DDR.
In addition, the dedicated accelerator of fig. 4 includes: a controller, a computing core (computation Complex) and a buffer (buffer).
The computing core comprises: convolvers, adder trees, non-linear modules, etc.
The size of the convolution kernel is typically only a few options such as 3 x 3, 5 x 5 and 7 x 7. For example, a two-dimensional convolver designed for convolution operation is a 3 × 3 window.
An adder tree (AD) sums all the results of the convolver. A non-linear (NL) module is adapted to input data streams of the non-linear excitation functions. For example, the function may be a ReLU function. In addition, a Max-Pooling module (not shown) is used for the assembly operation, for example, a specific 2 × 2 window is used for the input data stream, and the maximum value thereof is output.
The buffer area includes: the device comprises an input data buffer area, an output data buffer area and a bias shift (bias shift) module.
A bias shift (bias shift) is used to support the conversion of the dynamic quantization range. For example, shifting is performed for the weights. Also for example, shifting is performed for the data.
The input data buffer may further include: input data buffer, weight buffer. The input data buffer may be a line buffer (line buffer) for storing data required for operation and delaying release of the data to realize reuse of the data.
Fig. 5 shows the FIFO interaction between the CPU and the dedicated accelerator.
There are class 3 FIFOs in the architecture diagram shown in fig. 5. Similarly, there are three types of control of the DMA by the CPU.
In the first embodiment, the CPU and the dedicated accelerator communicate completely through FIFO, and there are three types of cache FIFOs between the CPU and the dedicated accelerator: command, input data, output data. Specifically, under the control of the CPU, the DMA is responsible for the transfer of input data, output data, and instructions between the external memory and the dedicated accelerator, where an input data FIFO, an output data FIFO, and an instruction FIFO are provided between the DMA and the dedicated accelerator, respectively.
For a dedicated accelerator, the design is simple, only calculation is concerned, and data is not concerned. The data operation is controlled entirely by the CPU.
However, in some application scenarios, the scheme shown in fig. 5 also has a deficiency.
First, the CPU performing scheduling consumes resources of the CPU. For example, the CPU needs to constantly listen to the status of each FIFO and prepare to receive and transmit data at any time. The CPU time is consumed by the CPU to listen to the states and process data according to different states. In some applications, the cost of the CPU to monitor the FIFO and process data is large, resulting in almost full CPU utilization and no CPU time to process other tasks (reading pictures, pre-processing pictures, etc.).
Secondly, a plurality of FIFOs are required to be arranged in the special accelerator, and PL resources are occupied.
Second embodiment
The second embodiment is characterized as follows: first, the dedicated processor and the CPU share an external memory, both of which can read the external memory. Second, the CPU controls only the instruction input of the dedicated accelerator. In this way, the CPU operates in conjunction with a dedicated accelerator system, where the CPU assumes tasks that some dedicated accelerators cannot accomplish.
As shown in fig. 6, in the second embodiment, the dedicated accelerator (PL) directly interacts with the external memory (DDR). Accordingly, an Input FIFO and an Ouput FIFO (as shown in FIG. 5) between the DMA and the special accelerator are cancelled, and only 1 FIFO is reserved for transmitting instructions between the DMA and the special accelerator, so that resources are saved.
For the CPU, complex scheduling of input and output data is not required, but the data is accessed directly from the external memory (DDR) by a dedicated accelerator. When the artificial neural network is operating, the CPU may perform other processing, such as reading image data to be processed from a camera, and the like.
Therefore, the second embodiment solves the problem of over-heavy CPU tasks, and frees the CPU to process more tasks. However, the dedicated accelerator needs to perform data access control to an external memory (DDR) by itself.
Modifications of the first and second embodiments
In both the first and second embodiments, the CPU controls the accelerator by instructions.
The accelerator may "fly" in error during execution (i.e., the program enters a dead loop or runs in a meaningless shuffle). In current solutions, the CPU cannot determine whether the accelerator has run away.
In a modified embodiment based on the first or second embodiment, the inventors have also provided a "state peripheral" in the CPU, passing the state of the Finite State Machine (FSM) in the dedicated accelerator (PL) directly to the CPU.
The CPU can know the operation of the accelerator by detecting the state of a Finite State Machine (FSM). The CPU may also send a signal to directly reset the accelerator if it is found that the accelerator has run away or stuck.
Figure 7 shows an example of adding a "stateful peripheral" to the architecture of the first embodiment shown in figure 4.
Figure 8 shows an example of adding a "stateful peripheral" to the architecture of the second embodiment shown in figure 6.
As shown in fig. 7 and 8, a Finite State Machine (FSM) is provided in the controller of the dedicated accelerator, and the state of the FSM is directly transmitted to a state peripheral (i.e., a state monitoring module) of the CPU, so that the CPU can monitor a fault condition such as a program running crash.
Comparison of the first and second embodiments
The two scheduling strategies of the first and second embodiments are advantageous each.
In the embodiment of FIG. 4, the picture data requires the CPU to schedule DMA for transmission to the dedicated accelerator, so the dedicated accelerator has time to idle. However, since the CPU schedules data handling, the dedicated accelerator is only responsible for calculations, the computing power is optimized, and the time to process data is correspondingly shortened.
In the embodiment of FIG. 6, the dedicated accelerator has the ability to access data individually without the CPU scheduling data transfers. The data processing may be performed independently on a dedicated accelerator.
The CPU may be responsible only for data reading and output with an external system. The reading operation means, for example, that the CPU reads out picture data from a camera (not shown) and transfers the picture data to an external memory; the output operation means that the CPU outputs the recognized output from the external memory to a screen (not shown).
With the embodiment of fig. 6, tasks may be pipelined, allowing for faster processing of multiple tasks. The corresponding cost is that the special accelerator is responsible for calculation and data handling at the same time, the efficiency is not high, and the processing needs longer time.
Fig. 9 comparatively shows the difference in the processing flow of the first and second embodiments.
Application of the second embodiment: face recognition
According to the second embodiment, since there is a shared external memory (DDR), the CPU and the dedicated accelerator can perform one calculation task together.
For example, in the task of face recognition: the CPU can read the camera and detect the face in the input picture; the neural network over-accelerator acceleration core can complete the recognition of the human face.
The neural network computing task above the CPU can be rapidly deployed on the embedded device by using the cooperation design of the CPU and the special accelerator.
In particular, referring to example 2 of fig. 9, the reading (e.g., reading from a camera) and pre-processing of the picture is run on the CPU, and the processing of the picture is done on a dedicated accelerator.
The method separates the tasks of the CPU and the accelerator, so that the CPU and the accelerator can completely process the tasks in parallel.
Table 1 shows the performance comparison between the second embodiment (CPU + dedicated accelerator co-design) with CPU only.
Table 1
Figure BDA0001084770810000111
Figure BDA0001084770810000121
The comparative CPU used Tegra k1 manufactured by imperial labda. It can be seen that with our co-design of CPU + dedicated accelerator there is a significant acceleration for each layer, with an overall acceleration of up to 7 times.
The invention has the advantages that the characteristic of insufficient flexibility of a special accelerator (programmable logic module PL, such as FPGA) is made up by utilizing the characteristic of rich functions of a CPU (general processor), and the characteristic of insufficient real-time calculation of the CPU calculation speed is made up by utilizing the characteristic of high calculation speed of the special accelerator.
Further, the general purpose processor may be an ARM processor, or any other CPU. The programmable logic modules may be FPGAs or other programmable application specific processors (ASICs).
It should be noted that, in the present specification, the embodiments are described in a progressive manner, each embodiment focuses on differences from other embodiments, and the same and similar parts among the embodiments may be referred to each other.
In the embodiments provided in the present application, it should be understood that the disclosed apparatus and method can be implemented in other ways. The apparatus embodiments described above are merely illustrative, and for example, the flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of apparatus, methods and computer program products according to various embodiments of the present invention. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.
The above description is only a preferred embodiment of the present invention and is not intended to limit the present invention, and various modifications and changes may be made by those skilled in the art. Any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention should be included in the protection scope of the present invention. It should be noted that: like reference numbers and letters refer to like items in the following figures, and thus, once an item is defined in one figure, it need not be further defined and explained in subsequent figures.
The above description is only for the specific embodiments of the present invention, but the scope of the present invention is not limited thereto, and any person skilled in the art can easily conceive of the changes or substitutions within the technical scope of the present invention, and all such changes or substitutions should be covered by the present invention
The invention is in the scope of protection. Therefore, the protection scope of the present invention shall be subject to the protection scope of the claims.

Claims (9)

1. A Deep Processing Unit (DPU) for operating an Artificial Neural Network (ANN), comprising:
a CPU for scheduling a programmable processing module (PL) and a Direct Memory Access (DMA);
a Direct Memory Access (DMA) connected to the CPU, the programmable processing module and the external memory, respectively, for communication between the CPU and the programmable processing module;
a programmable processor module (PL) comprising:
a Controller (Controller) to obtain an instruction and schedule a compute core based on the instruction;
a Computing core (Computing Complex) comprising a plurality of Computing elements (PEs) for performing Computing tasks based on instructions and data;
a buffer for storing data and instructions used by the programmable processor module;
and the external memory (DDR) is respectively connected with the CPU, the DMA and the programmable logic module and is used for storing: instructions to implement the ANN and data to be processed by the ANN;
wherein the CPU controls the DMA to transfer instructions between the external memory and the programmable logic module;
wherein the programmable logic module and the external memory directly transfer data.
2. The deep processing unit of claim 1, wherein instructions are transferred between the DMA and the programmable processing module through a FIFO.
3. The deep processing unit of claim 1, wherein the CPU further comprises:
a state monitoring module to monitor a state of a Finite State Machine (FSM) of the programmable logic module.
4. The deep processing unit according to claim 1, the computation unit (PE) comprising:
a complex convolution kernel (convoluter complex) coupled to the buffer to receive weights of the ANN and input data for performing convolution calculations in the ANN;
an add tree (add tree) connected to the complex convolution kernel for summing the results of the convolution calculation operations;
a non-linearization module, coupled to the summing tree, for applying a non-linear function operation to an output of the summing tree.
5. The deep processing unit of claim 4, the computing unit (PE) further comprising:
and the aggregation module is connected with the nonlinear module and is used for carrying out aggregation operation in the ANN.
6. The deep processing unit of claim 1, the buffer comprising:
an input buffer area used for preparing input data and instructions used by the calculation of the calculation core;
and outputting the buffer area, and storing and outputting the calculation result.
7. The deep processing unit of claim 4, the buffer further comprising:
and a bias shifter (bias shift) for shifting weights, which are the number of fixed points to be quantized, to different quantization ranges and outputting the shifted weights to the addition tree.
8. The deep processing unit of claim 1, wherein the CPU, the programmable logic module, and the DMA are implemented on one SOC.
9. The deep processing unit of claim 8, wherein the external memory is implemented on another chip than the SOC.
CN201610695285.4A 2016-08-12 2016-08-19 Design of cooperative system of general processor and neural network processor Active CN107657316B (en)

Applications Claiming Priority (4)

Application Number Priority Date Filing Date Title
CN2016106632019 2016-08-12
CN201610663563 2016-08-12
CN2016106635638 2016-08-12
CN201610663201 2016-08-12

Publications (2)

Publication Number Publication Date
CN107657316A CN107657316A (en) 2018-02-02
CN107657316B true CN107657316B (en) 2020-04-07

Family

ID=61127258

Family Applications (2)

Application Number Title Priority Date Filing Date
CN201610698184.2A Active CN107688855B (en) 2016-08-12 2016-08-19 Hierarchical quantization method and device for complex neural network
CN201610695285.4A Active CN107657316B (en) 2016-08-12 2016-08-19 Design of cooperative system of general processor and neural network processor

Family Applications Before (1)

Application Number Title Priority Date Filing Date
CN201610698184.2A Active CN107688855B (en) 2016-08-12 2016-08-19 Hierarchical quantization method and device for complex neural network

Country Status (1)

Country Link
CN (2) CN107688855B (en)

Families Citing this family (48)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US11437032B2 (en) 2017-09-29 2022-09-06 Shanghai Cambricon Information Technology Co., Ltd Image processing apparatus and method
US11709672B2 (en) 2018-02-13 2023-07-25 Shanghai Cambricon Information Technology Co., Ltd Computing device and method
KR102148110B1 (en) 2018-02-13 2020-08-25 상하이 캠브리콘 인포메이션 테크놀로지 컴퍼니 리미티드 Computing device and method
US11630666B2 (en) 2018-02-13 2023-04-18 Shanghai Cambricon Information Technology Co., Ltd Computing device and method
CN110162162B (en) 2018-02-14 2023-08-18 上海寒武纪信息科技有限公司 Control device, method and equipment of processor
CN108564165B (en) * 2018-03-13 2024-01-23 上海交通大学 Method and system for optimizing convolutional neural network by fixed point
EP3770775A4 (en) * 2018-03-23 2021-06-02 Sony Corporation Information processing device and information processing method
CN108491890B (en) * 2018-04-04 2022-05-27 百度在线网络技术(北京)有限公司 Image method and device
CN108510067B (en) * 2018-04-11 2021-11-09 西安电子科技大学 Convolutional neural network quantification method based on engineering realization
CN110413255B (en) * 2018-04-28 2022-08-19 赛灵思电子科技(北京)有限公司 Artificial neural network adjusting method and device
EP3624020A4 (en) 2018-05-18 2021-05-05 Shanghai Cambricon Information Technology Co., Ltd Computing method and related product
CN108805265B (en) * 2018-05-21 2021-03-30 Oppo广东移动通信有限公司 Neural network model processing method and device, image processing method and mobile terminal
KR20190136431A (en) * 2018-05-30 2019-12-10 삼성전자주식회사 Neural network system, Application processor having the same and Operating method of neural network system
CN110555508B (en) * 2018-05-31 2022-07-12 赛灵思电子科技(北京)有限公司 Artificial neural network adjusting method and device
CN110555450B (en) * 2018-05-31 2022-06-28 赛灵思电子科技(北京)有限公司 Face recognition neural network adjusting method and device
CN110598839A (en) * 2018-06-12 2019-12-20 华为技术有限公司 Convolutional neural network system and method for quantizing convolutional neural network
CN109034025A (en) * 2018-07-16 2018-12-18 东南大学 A kind of face critical point detection system based on ZYNQ
CN112449703A (en) * 2018-09-21 2021-03-05 华为技术有限公司 Method and device for quantifying neural network model in equipment
WO2020062392A1 (en) 2018-09-28 2020-04-02 上海寒武纪信息科技有限公司 Signal processing device, signal processing method and related product
US20220004854A1 (en) * 2018-10-08 2022-01-06 Deeper-I Co., Inc. Artificial neural network computation acceleration apparatus for distributed processing, artificial neural network acceleration system using same, and artificial neural network acceleration method therefor
CN109389120A (en) * 2018-10-29 2019-02-26 济南浪潮高新科技投资发展有限公司 A kind of object detecting device based on zynqMP
CN109523016B (en) * 2018-11-21 2020-09-01 济南大学 Multi-valued quantization depth neural network compression method and system for embedded system
CN109740619B (en) * 2018-12-27 2021-07-13 北京航天飞腾装备技术有限责任公司 Neural network terminal operation method and device for target recognition
CN111383638A (en) 2018-12-28 2020-07-07 上海寒武纪信息科技有限公司 Signal processing device, signal processing method and related product
CN110889497B (en) * 2018-12-29 2021-04-23 中科寒武纪科技股份有限公司 Learning task compiling method of artificial intelligence processor and related product
CN109711367B (en) * 2018-12-29 2020-03-06 中科寒武纪科技股份有限公司 Operation method, device and related product
US10592799B1 (en) * 2019-01-23 2020-03-17 StradVision, Inc. Determining FL value by using weighted quantization loss values to thereby quantize CNN parameters and feature values to be used for optimizing hardware applicable to mobile devices or compact networks with high precision
CN110009096A (en) * 2019-03-06 2019-07-12 开易(北京)科技有限公司 Target detection network model optimization method based on embedded device
US11847554B2 (en) 2019-04-18 2023-12-19 Cambricon Technologies Corporation Limited Data processing method and related products
CN111831543A (en) 2019-04-18 2020-10-27 中科寒武纪科技股份有限公司 Data processing method and related product
US11676028B2 (en) 2019-06-12 2023-06-13 Shanghai Cambricon Information Technology Co., Ltd Neural network quantization parameter determination method and related products
CN112085184B (en) 2019-06-12 2024-03-29 上海寒武纪信息科技有限公司 Quantization parameter adjustment method and device and related product
CN110348562B (en) * 2019-06-19 2021-10-15 北京迈格威科技有限公司 Neural network quantization strategy determination method, image identification method and device
CN110309877B (en) * 2019-06-28 2021-12-07 北京百度网讯科技有限公司 Feature map data quantization method and device, electronic equipment and storage medium
CN111344719A (en) * 2019-07-22 2020-06-26 深圳市大疆创新科技有限公司 Data processing method and device based on deep neural network and mobile device
CN110569713B (en) * 2019-07-22 2022-04-08 北京航天自动控制研究所 Target detection system and method for realizing data serial-parallel two-dimensional transmission by using DMA (direct memory access) controller
US11635893B2 (en) * 2019-08-12 2023-04-25 Micron Technology, Inc. Communications between processors and storage devices in automotive predictive maintenance implemented via artificial neural networks
CN112446460A (en) * 2019-08-28 2021-03-05 上海寒武纪信息科技有限公司 Method, apparatus and related product for processing data
CN110837890A (en) * 2019-10-22 2020-02-25 西安交通大学 Weight value fixed-point quantization method for lightweight convolutional neural network
CN110990060B (en) * 2019-12-06 2022-03-22 北京瀚诺半导体科技有限公司 Embedded processor, instruction set and data processing method of storage and computation integrated chip
CN111144511B (en) * 2019-12-31 2020-10-20 上海云从汇临人工智能科技有限公司 Image processing method, system, medium and electronic terminal based on neural network
CN111178522B (en) * 2020-04-13 2020-07-10 杭州雄迈集成电路技术股份有限公司 Software and hardware cooperative acceleration method and system and computer readable storage medium
CN111626414B (en) * 2020-07-30 2020-10-27 电子科技大学 Dynamic multi-precision neural network acceleration unit
CN112561933A (en) * 2020-12-15 2021-03-26 深兰人工智能(深圳)有限公司 Image segmentation method and device
CN113240101B (en) * 2021-05-13 2022-07-05 湖南大学 Method for realizing heterogeneous SoC (system on chip) by cooperative acceleration of software and hardware of convolutional neural network
CN113361695B (en) * 2021-06-30 2023-03-24 南方电网数字电网研究院有限公司 Convolutional neural network accelerator
CN115705482A (en) * 2021-07-20 2023-02-17 腾讯科技(深圳)有限公司 Model quantization method and device, computer equipment and storage medium
CN114708180B (en) * 2022-04-15 2023-05-30 电子科技大学 Bit depth quantization and enhancement method for predistortion image with dynamic range preservation

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1776644A (en) * 2005-12-09 2006-05-24 中兴通讯股份有限公司 Method for monitoring internal memory varible rewrite based on finite-state-machine
CN104794102A (en) * 2015-05-14 2015-07-22 哈尔滨工业大学 Embedded system on chip for accelerating Cholesky decomposition
CN105224482A (en) * 2015-10-16 2016-01-06 浪潮(北京)电子信息产业有限公司 A kind of FPGA accelerator card high-speed memory system
CN105630735A (en) * 2015-12-25 2016-06-01 南京大学 Coprocessor based on reconfigurable computational array

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP3192015A1 (en) * 2014-09-09 2017-07-19 Intel Corporation Improved fixed point integer implementations for neural networks
CN105760933A (en) * 2016-02-18 2016-07-13 清华大学 Method and apparatus for fixed-pointing layer-wise variable precision in convolutional neural network

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1776644A (en) * 2005-12-09 2006-05-24 中兴通讯股份有限公司 Method for monitoring internal memory varible rewrite based on finite-state-machine
CN104794102A (en) * 2015-05-14 2015-07-22 哈尔滨工业大学 Embedded system on chip for accelerating Cholesky decomposition
CN105224482A (en) * 2015-10-16 2016-01-06 浪潮(北京)电子信息产业有限公司 A kind of FPGA accelerator card high-speed memory system
CN105630735A (en) * 2015-12-25 2016-06-01 南京大学 Coprocessor based on reconfigurable computational array

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
Going Deeper with Embedded FPGA Platform for Convolutional Neural Network;Jiantao Qiu et al.;《FPGA’16 Proceedings of the 2016 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays》;20160229;第30-32页及图4 *

Also Published As

Publication number Publication date
CN107688855B (en) 2021-04-13
CN107657316A (en) 2018-02-02
CN107688855A (en) 2018-02-13

Similar Documents

Publication Publication Date Title
CN107657316B (en) Design of cooperative system of general processor and neural network processor
CN107862374B (en) Neural network processing system and processing method based on assembly line
CN107239829B (en) Method for optimizing artificial neural network
US10691996B2 (en) Hardware accelerator for compressed LSTM
US20180046913A1 (en) Combining cpu and special accelerator for implementing an artificial neural network
CN107679620B (en) Artificial neural network processing device
US11055063B2 (en) Systems and methods for deep learning processor
US10824939B2 (en) Device for implementing artificial neural network with flexible buffer pool structure
US20180197084A1 (en) Convolutional neural network system having binary parameter and operation method thereof
CN107578099B (en) Computing device and method
US20180307976A1 (en) Device for implementing artificial neural network with separate computation units
CN107766292B (en) Neural network processing method and processing system
US11354570B2 (en) Machine learning network implemented by statically scheduled instructions, with MLA chip
CN110321997B (en) High-parallelism computing platform, system and computing implementation method
CN111694643B (en) Task scheduling execution system and method for graph neural network application
KR20190089685A (en) Method and apparatus for processing data
US11803740B2 (en) Ordering computations of a machine learning network in a machine learning accelerator for efficient memory usage
Wong et al. Low bitwidth CNN accelerator on FPGA using Winograd and block floating point arithmetic
CN110716751A (en) High-parallelism computing platform, system and computing implementation method
US20210133579A1 (en) Neural network instruction streaming
Wang et al. Acceleration and implementation of convolutional neural network based on FPGA
CN112596912B (en) Acceleration operation method and device for convolution calculation of binary or ternary neural network
US11488066B2 (en) Efficient convolution of multi-channel input samples with multiple kernels
CN110765413A (en) Matrix summation structure and neural network computing platform
EP4177731A1 (en) Sparsity uniformity enforcement for multicore processor

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
TA01 Transfer of patent application right

Effective date of registration: 20180606

Address after: 100083, 17 floor, 4 Building 4, 1 Wang Zhuang Road, Haidian District, Beijing.

Applicant after: Beijing deep Intelligent Technology Co., Ltd.

Address before: 100084 Wang Zhuang Road, 1, Haidian District, Beijing, Tsinghua Tongfang Technology Plaza, block D, 1705

Applicant before: Beijing insight Technology Co., Ltd.

TA01 Transfer of patent application right
GR01 Patent grant
GR01 Patent grant
TR01 Transfer of patent right

Effective date of registration: 20200907

Address after: Unit 01-19, 10 / F, 101, 6 / F, building 5, yard 5, Anding Road, Chaoyang District, Beijing 100029

Patentee after: Xilinx Electronic Technology (Beijing) Co., Ltd

Address before: 100083, 17 floor, 4 Building 4, 1 Wang Zhuang Road, Haidian District, Beijing.

Patentee before: BEIJING DEEPHI TECHNOLOGY Co.,Ltd.

TR01 Transfer of patent right