The present application claims priority from the previously filed chinese patent application 201610663201.9, "a method of optimizing artificial neural networks" and chinese patent application 201610663563.8, "a deep processing unit for implementing ANN.
Detailed Description
A part of the present application was published by the academic article "Going deep within Embedded FPGA Platform for conditional Neural Network" (2016.2) of song by the inventor. The application makes more improvements on the basis of the method.
In this application, the improvement of CNN by the present invention will be mainly described by taking image processing as an example. Deep Neural Networks (DNNs) and Recurrent Neural Networks (RNNs) are similar to CNNs.
CNN basic concept
CNN achieves the most advanced performance in a wide range of vision-related tasks. To help understand the CNN-based image classification algorithm analyzed in this application, we first introduce the basic knowledge of CNN, the image network dataset, and the existing CNN model.
As shown in fig. 1(a), a typical CNN consists of a series of layers that run in order.
The parameters of the CNN model are called "weights" (weights). The first layer of CNN reads the input image and outputs a series of feature maps (maps). The lower layer reads the feature map generated by the upper layer and outputs a new feature map. The last classifier (classifier) outputs the probability of each class to which the input image may belong. The CONV layer (convolutional layer) and the FC layer (full link layer) are two basic layer types in CNN. After the CONV layer, there is usually a Pooling layer (Pooling layers).
For example, for one CNN layer,
represents the jth input feature map,
represents the ith output feature map (output feature map), b
iThe offset term of the ith output plot is shown.
For the CONV layer, ninAnd noutRepresenting the number of input and output profiles, respectively.
For FC layer, ninAnd noutRepresenting the length of the input and output feature vectors, respectively.
Definition of CONV layers (Convolutional layers): the CONV layer takes a series of feature maps as input and obtains an output feature map by convolution kernel convolution.
A non-linear layer, i.e. a non-linear excitation function, usually connected to the CONV layer, is applied to each element in the output signature.
The CONV layer may be represented by expression 1:
wherein g isi,jIs a volume applied to the jth input profile and ith output profileAnd (4) integrating kernels.
Definition of FC layers (full-Connected layers): the FC layer applies a linear transformation on the input feature vector:
fout=Wfin+b (2)
w is an integer nout×ninTransform matrix, b is a bias term. It is worth noting that for the FC layer, not a combination of several two-dimensional feature maps, but one feature vector is input. Thus, in expression 2, the parameter ninAnd noutIn effect corresponding to the length of the input and output feature vectors.
Pooling (pooling) layer: usually connected to the CONV layer for outputting the maximum or average value of each partition (subarea) in each profile. The Pooling maximum value can be represented by expression 3:
where p is the size of the sink kernel. This non-linear "down-sampling" not only reduces the size and computation of the feature map for the next layer, but also provides a translation invariance.
CNN can be used for image classification in the forward inference process. But before using CNN for any task, the CNN dataset should be trained first. Recent studies have shown that CNN models based on large dataset forward training for a given task can be used for other tasks and achieve high precision fine tuning in network weights (network weights), which is called "fine-tune". The training of CNN is mainly implemented on large servers. For embedded FPGA platforms, we focus on accelerating the inference process of CNN.
Image-Net data set
The Image-Net dataset is treated as a standard reference benchmark to evaluate the performance of Image classification and object detection algorithms. To date, Image-Net datasets have collected over 1400 million images in 2 million 1 thousand categories. The Image-Net issues a sub-set with 1000 categories and 120 ten thousand images for the ILSVRC classification task, and greatly promotes the development of CV technology. In the present application, all CNN models were trained with the ILSVRC 2014 training set and evaluated with the ILSVRC 2014 validation set.
Existing CNN models
In ILSVRC 2012, the Supervision team used AlexNet, winning the first name in the image classification task, 84.7% of top 5 precision. Cafnenet has minor variations on the basis of replication of AlexNet. AlexNet and cafnenet both comprise 5 CONV layers and 3 FC layers.
In 2013 ILSVRC, the Zeiler-and-Fergus (ZF) network won the first name in the image classification task, 88.8% of the top 5 precision. The ZF network also has 5 CONV layers and 3 FC layers.
As shown in fig. 1(b), a typical CNN is illustrated from the perspective of the input-output data flow.
The CNN shown in fig. 1(b) includes 5 CONV groups CONV1, CONV2, CONV3, CONV4, CONV5, 3 FC layers FC1, FC2, FC3, each of which includes 3 convolutional layers, and one Softmax decision function.
Fig. 2 is a schematic diagram of software optimization and hardware implementation of an artificial neural network.
As shown in fig. 2, in order to accelerate CNN, a whole set of technical solutions is proposed from the perspective of optimization flow and hardware architecture.
The artificial neural network model is shown on the lower side of fig. 2. In the middle of fig. 2 it is shown how the CNN model can be compressed to reduce memory usage and the number of operations while minimizing the loss of accuracy.
The dedicated hardware provided for the compressed CNN is shown on the upper side of fig. 2.
As shown in the upper side of fig. 2, in the hardware architecture, two blocks, PS and PL, are included.
A general Processing System (PS) includes: a CPU and an external memory (EXTERNALMEMORY).
A Programmable Logic module (PL) comprising: DMA and computational cores, input/output buffers, and controllers, etc.
As shown in fig. 2, PL is provided with: complex Computing (Computing Complex), input buffers, output buffers, controllers, and Direct Memory Access (DMA).
The compute core includes a plurality of processing units (PEs) that are responsible for most of the computational tasks at the CONV, rendezvous, and FC layers of the artificial neural network.
The chip buffer includes an input buffer and an output buffer, prepares data for use by the PEs, and stores the results.
The controller, fetches instructions on external memory, decodes the instructions (if needed), and coordinates all modules in the PL except the DMA.
DMAs are used to transfer data and instructions between external memory (e.g., DDR) and PL.
The PS includes a general purpose processor (CPU)8110 and an external memory 8120.
The external memory stores model parameters, data and instructions for all artificial neural networks.
The PS is a hard core, the hardware structure is fixed, and the scheduling is carried out by software.
PL is programmable hardware logic, the hardware architecture of which can vary. The programmable logic module (PL) may be an FPGA, for example.
It should be noted that according to the embodiment of the present invention, the DMA is directly controlled by the CPU, although it is on the PL side, and transfers data from the EXTERNAL MEMORY to the PL.
Thus, the hardware architecture shown in FIG. 2 is only a functional partition, and the boundary between PL and PS described above is not absolute. For example, in actual implementation, the PL and CPU may be implemented on one SOC, such as the Zynq chip of xilinx. The external memory may be implemented by another memory chip, connected to the CPU in the SOC chip.
FIG. 3 shows an optimization procedure before deploying an artificial neural network to a hardware chip.
The input to fig. 3 is the original artificial neural network.
Step 405: compression
The compressing step may include pruning the CNN model. Network pruning has proven to be an effective method to reduce the complexity and overfitting of the network. See, for example, the article "Second order originations for network planning" in b.Hasssibi and d.g.Stork, Optial bridge surgeon ".
In the priority application 201610663201.9, "a method of optimizing an artificial neural network", cited in the present application, a method of compressing a CNN network by pruning is proposed.
First, an initialization step initializes the weights of the convolutional layer, the FC layer, to random values, where an ANN with a full connection is generated, the connection having weight parameters.
Secondly, training the ANN, and adjusting the weight of the ANN according to the accuracy of the ANN until the accuracy reaches a preset standard.
For example, the training step adjusts the weights of the ANN based on a stochastic gradient descent algorithm, i.e., randomly adjusts weight values, and selects based on the perusal variation of the ANN. For the introduction of the stochastic gradient algorithm, see the above-mentioned "learning weights and connections for influencing neural networks".
The accuracy may be quantified as the difference between the predicted and correct outcome of the ANN for the training data set.
Third, a pruning step, based on predetermined conditions, of discovering unimportant connections in the ANN, pruning said unimportant connections. In particular, the weight parameters of the pruned connection are no longer saved.
The predetermined condition includes any one of: the weight parameter of the connection is 0; or the weight parameter of the connection is less than a predetermined value.
Fourthly, a fine tuning step of resetting the pruned connection to a connection whose weight parameter value is zero, i.e. restoring the pruned connection and assigning a weight value of 0.
And finally, judging that the accuracy of the ANN reaches a preset standard. If not, repeating the second, third and fourth steps.
Step 410: fixed point quantization of data
For a fixed point number, its value is expressed as follows:
where bw is the bit width of the number, flIs the length of the part (fractional length) that can be negative.
In order to convert floating point numbers into fixed point numbers and obtain the highest precision, the inventor provides a dynamic precision data quantization strategy and an automatic workflow.
Unlike the prior static precision quantization strategy, in the data quantization flow provided by the inventor, flThe atlas is dynamically changing for different layers and features while remaining static in one layer to minimize truncation error per layer.
The proposed quantization procedure mainly consists of two stages.
(1) Weight quantization stage:
the purpose of the weight quantization stage is to find the best f of the weights of a layerlAs expression 5:
where W is the weight, W (bw, f)l) Is represented at given bw and flFixed point format of W under.
In one embodiment, the dynamic range of each layer weight is first analyzed, e.g., estimated by sampling. Afterwards, to avoid data overflow, f is initializedl. Furthermore, we start with flNeighborhood search optimization of (f)l。
According to another embodiment, in the weight fixed point quantization step, another way is used to find the best flAs in expression 6.
Where i represents one of the bw bits, kiIs the bit weight. Using expression 6In such a way that different bits are given different weights and then the optimal f is calculatedl。
(2) A data quantization stage.
The data quantization phase aims at finding the optimal f for the feature set between two layers of the CNN modell。
At this stage, the CNN is trained using a training data set (bench mark). The training data set may be dataset 0.
According to an embodiment of the present invention, the weights of the CONV layers and FC layers of all CNNs are quantized first, and then data quantization is performed. At this time, the training data set is input into the CNN with quantized weight, and each layer of input feature map is obtained through layer-by-layer processing of the CONV layer and the FC layer.
And comparing the intermediate data of the fixed-point CNN model and the floating-point CNN model layer by using a greedy algorithm aiming at each layer of input feature graph so as to reduce precision loss. The optimization goal for each layer is shown in expression 7:
in expression 7, when A represents the calculation of one layer (e.g., a CONV layer or FC layer), x represents the input, and x represents the output+When the value is A.x, x+Representing the output of the layer. Notably, for the CONV layer or FC layer, the direct result x is+With a longer bit width than the given standard. Therefore, when f is optimallTruncation is required for selection. Finally, the whole data quantization configuration is generated.
According to another embodiment, in the step of fixed-point quantization of data, another way is used to find the best flAs in expression 8.
Where i represents one of the bw bits, kiIs the bit weight. In a similar manner to expression 4, different bits are given different weights, and the optimal f is recalculatedl。
The above data quantization step yields the optimum fl。
Furthermore, according to another embodiment, the weight quantization and the data quantization may be performed alternately, not sequentially.
Regarding the flow sequence of data processing, each layer of the convolution layer (CONV layer) and the full link layer (FC layer) of the ANN is in a series relation, and the training data set is processed by each feature map set obtained when the CONV layer and the FC layer of the ANN are processed in sequence.
Specifically, the weight quantization step and the data quantization step are performed alternately according to the series relationship, wherein after the weight quantization step completes fixed-point quantization of a certain layer, the data quantization step is performed on the feature atlas output by the layer.
First embodiment
In the priority application, the inventor proposes a collaborative design using a general-purpose processor and a dedicated accelerator, but does not discuss how efficiently the flexibility of the general-purpose processor and the computing power of the dedicated accelerator are utilized, such as how to transfer instructions, transfer data, perform calculations, and the like. In the application, the inventor proposes a further optimization scheme.
Fig. 4 shows a further modification to the hardware architecture of fig. 2.
In fig. 4, the CPU controls the DMA, which is responsible for scheduling data. Specifically, the CPU controls the DMA to transfer the instruction in the external memory (DDR) to the FIFO. Subsequently, the dedicated accelerator fetches instructions in the FIFO and executes.
The CPU controls the DMA to transfer the data number from the DDR to the FIFO, and the data is transferred from the FIFO for calculation. Similarly, the CPU also maintains the work of transporting the output data of the accelerator.
In operation, the CPU needs to constantly monitor the status of the DMA. When the Input FIFO is not full, data needs to be moved from the DDR to the Input FIFO. When the Output FIFO is not empty, data needs to be carried from the Output FIFO back into the DDR.
In addition, the dedicated accelerator of fig. 4 includes: a controller, a computing core (computation Complex) and a buffer (buffer).
The computing core comprises: convolvers, adder trees, non-linear modules, etc.
The size of the convolution kernel is typically only a few options such as 3 x 3, 5 x 5 and 7 x 7. For example, a two-dimensional convolver designed for convolution operation is a 3 × 3 window.
An adder tree (AD) sums all the results of the convolver. A non-linear (NL) module is adapted to input data streams of the non-linear excitation functions. For example, the function may be a ReLU function. In addition, a Max-Pooling module (not shown) is used for the assembly operation, for example, a specific 2 × 2 window is used for the input data stream, and the maximum value thereof is output.
The buffer area includes: the device comprises an input data buffer area, an output data buffer area and a bias shift (bias shift) module.
A bias shift (bias shift) is used to support the conversion of the dynamic quantization range. For example, shifting is performed for the weights. Also for example, shifting is performed for the data.
The input data buffer may further include: input data buffer, weight buffer. The input data buffer may be a line buffer (line buffer) for storing data required for operation and delaying release of the data to realize reuse of the data.
Fig. 5 shows the FIFO interaction between the CPU and the dedicated accelerator.
There are class 3 FIFOs in the architecture diagram shown in fig. 5. Similarly, there are three types of control of the DMA by the CPU.
In the first embodiment, the CPU and the dedicated accelerator communicate completely through FIFO, and there are three types of cache FIFOs between the CPU and the dedicated accelerator: command, input data, output data. Specifically, under the control of the CPU, the DMA is responsible for the transfer of input data, output data, and instructions between the external memory and the dedicated accelerator, where an input data FIFO, an output data FIFO, and an instruction FIFO are provided between the DMA and the dedicated accelerator, respectively.
For a dedicated accelerator, the design is simple, only calculation is concerned, and data is not concerned. The data operation is controlled entirely by the CPU.
However, in some application scenarios, the scheme shown in fig. 5 also has a deficiency.
First, the CPU performing scheduling consumes resources of the CPU. For example, the CPU needs to constantly listen to the status of each FIFO and prepare to receive and transmit data at any time. The CPU time is consumed by the CPU to listen to the states and process data according to different states. In some applications, the cost of the CPU to monitor the FIFO and process data is large, resulting in almost full CPU utilization and no CPU time to process other tasks (reading pictures, pre-processing pictures, etc.).
Secondly, a plurality of FIFOs are required to be arranged in the special accelerator, and PL resources are occupied.
Second embodiment
The second embodiment is characterized as follows: first, the dedicated processor and the CPU share an external memory, both of which can read the external memory. Second, the CPU controls only the instruction input of the dedicated accelerator. In this way, the CPU operates in conjunction with a dedicated accelerator system, where the CPU assumes tasks that some dedicated accelerators cannot accomplish.
As shown in fig. 6, in the second embodiment, the dedicated accelerator (PL) directly interacts with the external memory (DDR). Accordingly, an Input FIFO and an Ouput FIFO (as shown in FIG. 5) between the DMA and the special accelerator are cancelled, and only 1 FIFO is reserved for transmitting instructions between the DMA and the special accelerator, so that resources are saved.
For the CPU, complex scheduling of input and output data is not required, but the data is accessed directly from the external memory (DDR) by a dedicated accelerator. When the artificial neural network is operating, the CPU may perform other processing, such as reading image data to be processed from a camera, and the like.
Therefore, the second embodiment solves the problem of over-heavy CPU tasks, and frees the CPU to process more tasks. However, the dedicated accelerator needs to perform data access control to an external memory (DDR) by itself.
Modifications of the first and second embodiments
In both the first and second embodiments, the CPU controls the accelerator by instructions.
The accelerator may "fly" in error during execution (i.e., the program enters a dead loop or runs in a meaningless shuffle). In current solutions, the CPU cannot determine whether the accelerator has run away.
In a modified embodiment based on the first or second embodiment, the inventors have also provided a "state peripheral" in the CPU, passing the state of the Finite State Machine (FSM) in the dedicated accelerator (PL) directly to the CPU.
The CPU can know the operation of the accelerator by detecting the state of a Finite State Machine (FSM). The CPU may also send a signal to directly reset the accelerator if it is found that the accelerator has run away or stuck.
Figure 7 shows an example of adding a "stateful peripheral" to the architecture of the first embodiment shown in figure 4.
Figure 8 shows an example of adding a "stateful peripheral" to the architecture of the second embodiment shown in figure 6.
As shown in fig. 7 and 8, a Finite State Machine (FSM) is provided in the controller of the dedicated accelerator, and the state of the FSM is directly transmitted to a state peripheral (i.e., a state monitoring module) of the CPU, so that the CPU can monitor a fault condition such as a program running crash.
Comparison of the first and second embodiments
The two scheduling strategies of the first and second embodiments are advantageous each.
In the embodiment of FIG. 4, the picture data requires the CPU to schedule DMA for transmission to the dedicated accelerator, so the dedicated accelerator has time to idle. However, since the CPU schedules data handling, the dedicated accelerator is only responsible for calculations, the computing power is optimized, and the time to process data is correspondingly shortened.
In the embodiment of FIG. 6, the dedicated accelerator has the ability to access data individually without the CPU scheduling data transfers. The data processing may be performed independently on a dedicated accelerator.
The CPU may be responsible only for data reading and output with an external system. The reading operation means, for example, that the CPU reads out picture data from a camera (not shown) and transfers the picture data to an external memory; the output operation means that the CPU outputs the recognized output from the external memory to a screen (not shown).
With the embodiment of fig. 6, tasks may be pipelined, allowing for faster processing of multiple tasks. The corresponding cost is that the special accelerator is responsible for calculation and data handling at the same time, the efficiency is not high, and the processing needs longer time.
Fig. 9 comparatively shows the difference in the processing flow of the first and second embodiments.
Application of the second embodiment: face recognition
According to the second embodiment, since there is a shared external memory (DDR), the CPU and the dedicated accelerator can perform one calculation task together.
For example, in the task of face recognition: the CPU can read the camera and detect the face in the input picture; the neural network over-accelerator acceleration core can complete the recognition of the human face.
The neural network computing task above the CPU can be rapidly deployed on the embedded device by using the cooperation design of the CPU and the special accelerator.
In particular, referring to example 2 of fig. 9, the reading (e.g., reading from a camera) and pre-processing of the picture is run on the CPU, and the processing of the picture is done on a dedicated accelerator.
The method separates the tasks of the CPU and the accelerator, so that the CPU and the accelerator can completely process the tasks in parallel.
Table 1 shows the performance comparison between the second embodiment (CPU + dedicated accelerator co-design) with CPU only.
Table 1
The comparative CPU used Tegra k1 manufactured by imperial labda. It can be seen that with our co-design of CPU + dedicated accelerator there is a significant acceleration for each layer, with an overall acceleration of up to 7 times.
The invention has the advantages that the characteristic of insufficient flexibility of a special accelerator (programmable logic module PL, such as FPGA) is made up by utilizing the characteristic of rich functions of a CPU (general processor), and the characteristic of insufficient real-time calculation of the CPU calculation speed is made up by utilizing the characteristic of high calculation speed of the special accelerator.
Further, the general purpose processor may be an ARM processor, or any other CPU. The programmable logic modules may be FPGAs or other programmable application specific processors (ASICs).
It should be noted that, in the present specification, the embodiments are described in a progressive manner, each embodiment focuses on differences from other embodiments, and the same and similar parts among the embodiments may be referred to each other.
In the embodiments provided in the present application, it should be understood that the disclosed apparatus and method can be implemented in other ways. The apparatus embodiments described above are merely illustrative, and for example, the flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of apparatus, methods and computer program products according to various embodiments of the present invention. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.
The above description is only a preferred embodiment of the present invention and is not intended to limit the present invention, and various modifications and changes may be made by those skilled in the art. Any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention should be included in the protection scope of the present invention. It should be noted that: like reference numbers and letters refer to like items in the following figures, and thus, once an item is defined in one figure, it need not be further defined and explained in subsequent figures.
The above description is only for the specific embodiments of the present invention, but the scope of the present invention is not limited thereto, and any person skilled in the art can easily conceive of the changes or substitutions within the technical scope of the present invention, and all such changes or substitutions should be covered by the present invention
The invention is in the scope of protection. Therefore, the protection scope of the present invention shall be subject to the protection scope of the claims.