Detailed Description
To make the objects, technical solutions and advantages of the present application more apparent, embodiments of the present application will be described in detail below with reference to the accompanying drawings. It should be noted that the embodiments and features of the embodiments in the present application may be arbitrarily combined with each other without conflict.
The pure hardware design is limited by bandwidth and resources, so that the system performance is improved to a limited extent. Some researchers adopt the idea of software and hardware collaborative design to further improve the performance of the deep convolutional neural network accelerator of the FPGA. The mode of combining soft and hard is mainly divided into the following two aspects: 1) to alleviate bandwidth pressure. The data is processed using pruning, quantization, and other techniques. For example, a structured pruning manner is directly deployed into the current deep convolutional neural network accelerator architecture, so that the CNN model can be regularly pruned; data are processed in an unstructured pruning mode (for example, random clipping weight nodes and the like), so that the compression rate of the deep convolutional neural network accelerator can be improved, extra position information of non-zero weight values needs to be stored, and hardware circuit design has certain difficulty. Quantization is a relatively common model compression method at present. The 8-bit fixed point quantization can basically keep the original precision unchanged, is widely used, but has limited compression capacity. In order to further compress the model, some researchers research low bit/ultra-low bit quantization (for example, 6bit, 4bit, etc.), and some more researchers adopt a binary network mode to convert multiplication operation into logic operation, so that the performance of the deep convolutional neural network accelerator can be greatly improved, but the network precision loss is excessive. 2) In order to reduce the amount of computation, it is conceivable to perform the computation in a transform domain (e.g., Winograd transform, FFT transform, etc.). The multiplication times of about 1/3 can be reduced through one-dimensional Winograd transformation; by the two-dimensional Winograd transformation, the multiplication times can be reduced by about 2.25 times. The OaA (Overlap-and add) algorithm based on FFT carries out convolution operation on data, and compared with the method for carrying out convolution operation on data based on time domain convolution, the system performance is improved by about 3.3 times.
The method optimizes two target detection algorithms of YOLO and fast-RCNN by using a single-precision floating point data type, but increases the pressure of system resources and bandwidth, so that the utilization rate of a Digital Signal Processor (DSP) is low. If the YOLOv1 network is designed to be accelerated based on the Xilinx KU115 board card, but only the convolutional layer is accelerated, when the full connection layer is comprehensively considered, the performance of the YOLOv1 network is reduced. If the lightweight YOLOv2 algorithm is adopted, the feature extraction part is realized by adopting a binary network, the classification and regression are realized by adopting single-precision floating point numbers, and the application of the binary network reduces the calculated amount and the transmission bandwidth on one hand and also loses the precision of data on the other hand, so that the accuracy of the data is reduced. Because of the high energy consumption of the GPU (Graphics Processing Unit), the design of the GPU cannot meet the requirements of embedded applications.
Fig. 1 is a flowchart illustrating an image processing method that can be applied to a target detection apparatus according to an embodiment of the present application. As shown in fig. 1, includes:
and 110, preprocessing an image to be detected to obtain an input feature map, and extracting a first depth parallelism and a longitudinal parallelism of the input feature map.
For example, the input image is processed by any one of normalization, centralization and normalization to obtain an input feature map. Wherein, the normalization is to scale the pixel value of the input image to be in the range of 0-1; centering is to subtract the average pixel value from the pixel value of the input image to make the average value of the new pixel value 0; normalization is to treat the pixel values of the input image as a standard gaussian distribution, i.e. the average value of the new pixel values is 0 and the standard deviation is 1.
It should be noted that, in the centering and normalization, the pixel mean and standard deviation may be calculated on different color channels, or the mean and standard deviation may be calculated for one image, a batch of images, or the entire training data set to obtain the input feature map. Normalization is usually the first attempted method, since the pixel values of the input image are always in the range of 0-255, and only all pixels of the picture need to be divided by 255, which is simple to operate and easy to implement. The centering may be global centering or local centering, or different numbers of images may be selected to perform the calculation of the average value, the above preprocessing manner is only an example, and may be specifically set according to specific situations, and other non-explained preprocessing manners are also within the protection scope of the present application, and are not described herein again.
In a specific implementation, the input feature map may be an image with different dimensions, for example, the input feature map is an image with three dimensions, the vertical parallelism represents the length of the input feature map on the Y axis, the first depth parallelism represents the length of the input feature map on the Z axis, and the length of the input feature map on the X axis may be specifically set according to the actual situation. The above dimension information of the input feature map is only an example, and may be specifically set according to specific situations, and the dimension information of other input feature maps that are not described is also within the protection scope of the present application, and is not described herein again.
And step 120, performing vectorization processing on the input feature map according to the first depth parallelism and the longitudinal parallelism to obtain N pieces of input vector data.
N is an integer of 1 or more. The second depth parallelism of the output characteristic graph is the same as the number of convolution kernels, the first depth parallelism and the second depth parallelism are the parallelism determined according to a parallelism model, and the parallelism model is the model determined according to hardware resources and a memory bandwidth model.
The depth of the convolution kernel is the same as the longitudinal parallelism, so that when the input feature map is subjected to convolution operation with the convolution kernel, the longitudinal operation amount can be reduced, and the convolution operation speed is accelerated.
For example, VEC _ SIZE is used to represent a first depth parallelism, PE _ NUM _ Y is used to represent a vertical parallelism, and PE _ NUM _ Z is used to represent a second depth parallelism. Vectorizing the input feature map in the unit of VEC _ SIZE PE _ NUM _ Y in the transverse dimension and the longitudinal dimension to obtain N pieces of input vector data of VEC _ SIZE PE _ NUM _ Y SIZE, and then performing convolution processing on each piece of input vector data to increase the speed of convolution operation.
Due to the fact that hardware resources are different, used bandwidth resources are different when convolution operation is conducted, a parallelism model is determined through the hardware resources and the memory bandwidth model, and then VEC _ SIZE and PE _ NUM _ Z are determined according to the parallelism model, and the speed of convolution operation is guaranteed to be improved to the maximum extent under the limited hardware resource environment.
And step 130, performing convolution operation on the N input vector data and the convolution kernel simultaneously to obtain an output characteristic diagram.
For example, N input vector data may be convolved with one convolution kernel at the same time to improve the sharing degree of the weight values; or, one input vector data and a plurality of convolution kernels can be subjected to convolution operation simultaneously, so that the sharing degree of the input data is improved. The speed of convolution operation is accelerated, and then an output characteristic diagram is obtained. The above convolution operation manner is only an example, and may be specifically set according to an actual situation, and other non-described convolution operation manners are also within the protection scope of the present application, and are not described herein again.
In the embodiment, vectorization processing is performed on the input feature map through the first depth parallelism and the longitudinal parallelism of the input feature map to obtain N input vector data, and then convolution operation is performed on the N input vector data and the convolution kernel simultaneously to obtain the output feature map, so that one convolution kernel can perform convolution operation on the N input vector data simultaneously, the speed of convolution operation is accelerated under the condition of low energy consumption, and the processing speed of the input feature map is improved. Meanwhile, the precision of the target object in the output characteristic diagram is superior to that of the target object in the input characteristic diagram, so that the precision of the target object can be improved, the category of the target object in the input characteristic diagram is more accurate, and the method is convenient to apply in the field of machine vision.
The embodiment of the present application provides another possible implementation manner, where before step 110, the method further includes: and caching the input feature map into an input cache region, wherein the storage form of the input cache region at least comprises any one of a folding storage form, a double-cache mechanism and a multi-port loading form.
Through various storage forms, the speed of the data access operation of the input buffer area is accelerated, and high-reliability data are conveniently provided for convolution operation. Meanwhile, the input characteristic diagram is firstly cached in the input cache region, so that the pressure of interactive bandwidth with other equipment can be relieved.
In some implementations, caching the input feature map in an input cache area includes: if the storage form of the input buffer area is determined to be a folding form, folding data corresponding to the input characteristic diagram according to a data step length, and storing the folded data in the input buffer area, wherein the data step length is a value determined according to the longitudinal parallelism, the data length required by the convolution operation of the input characteristic diagram on the longitudinal dimension, and the unit step length; and if the storage form of the input cache region is determined to be a multi-port loading form, caching the data corresponding to the input characteristic diagram into the input cache region according to the number of ports and the number of clock cycles required for loading one data buffer region.
For example, the data length for obtaining the fold is calculated using the following formula.
Hrow=(PE_NUM_Y-1)×S+K。
Where, Hrow represents the data length required for PE _ NUM _ Y convolution operations in the vertical dimension (Y dimension) of the input feature map, S represents the data step size, and K represents the data length of the input feature map. According to the longitudinal parallelism (PE _ NUM _ Y), a plurality of areas on the input feature map can be synchronously processed. And the number of line buffers can be reduced, and the universality of the convolutional neural network structure is improved.
Another possible implementation manner is provided in the embodiment of the present application, fig. 2 is a schematic flowchart of a method for performing convolution acceleration operation by using multiply-accumulate trees, and as shown in fig. 2, step 130 includes:
step 131, performing convolution operation on the N input vector data and the convolution kernel simultaneously to obtain output vector data.
For example, N input vector data are convolved with one convolution kernel at the same time, or one input vector data is convolved with a plurality of convolution kernels to obtain output vector data.
Step 132, the multiply-accumulate tree is used to process the output vector data and the corresponding weight parameters to obtain an output feature map.
For example, output vector data is subjected to data rearrangement, data corresponding to the input feature map originally arranged in (W, H, N) is rearranged into (VEC _ SIZE, W, H, N/VEC _ SIZE), and VEC _ SIZE is subjected to vectorization processing. Where W denotes a horizontal length of the input vector data, H denotes a vertical length of the input vector data, and N denotes a depth of the input vector data. The corresponding weight parameters can also carry out corresponding data rearrangement, so that the output characteristic diagram is more favorable for convolution operation. Then, point-to-point multiplication is carried out on the output vector data and the corresponding weight parameters, then the obtained multiplication results are added, and the output characteristic diagram can be obtained through multiple operations.
In some specific implementations, performing point-to-point multiply-accumulate operation on M output vector data according to a data bit width and a weight parameter of the output vector data to obtain a first accumulation result, where M is an integer greater than or equal to 1; caching the first accumulation result into a shift cache region; according to the depth of the shift cache region, carrying out local accumulation on the first accumulation result to obtain a second accumulation result; caching the second accumulation result into a delay cache region; and performing addition operation on the data in the delay cache again to obtain an output characteristic diagram.
For example, the output vector data is stored by using a char type (data bit width is 8 bits), then, point-to-point multiply-accumulate operation is performed on the M output vector data, that is, each bit on the 8 bits needs to be multiplied by data on a corresponding bit of a corresponding weight parameter, so as to obtain 8 product results, and then, the 8 product results are sequentially accumulated, so as to obtain a first accumulation result. When the first accumulation result is meat-dish to the shift cache area, the depth of the shift cache area needs to be adjusted, so that a production line with a starting interval of 1 can be formed, and the depth of the production line is determined according to VEC _ SIZE. Then, according to the depth of the shift cache region, carrying out local accumulation on the first accumulation result to obtain a second accumulation result, and caching the second accumulation result into a delay cache region; and performing addition operation on the data in the delay cache again to obtain an output characteristic diagram.
In this embodiment, unnecessary waste of logic resources can be saved by using different data bit widths, and point-to-point multiply-accumulate operations are performed on M output vector data according to the data bit width and the weight parameter of the output vector data to obtain a first accumulation result, so that point-to-point multiply-accumulate operations can be performed on data with different data bit widths, and the data processing capability is improved. And the first accumulation result is cached in the shift cache region and the delay cache region for local accumulation, so that a flow with a starting interval of 1 can be formed, and the data processing speed is improved.
Another possible implementation manner is provided in the embodiment of the present application, where after step 132, the method further includes: and caching the output characteristic graph to an output buffer area, wherein the storage form of the output buffer area comprises a multi-port loading mode.
In this embodiment, the output feature map is cached in the output cache region storing the multi-port loading mode, so that the output feature map can be processed through different ports, for example, when the number of the ports is equal, the number of the line cache regions to be responsible for each port is Ceil (number of line cache regions/n), so that the data loading time can be reduced by n times, and the data loading efficiency is improved.
Another possible implementation manner is provided in the embodiment of the present application, where after step 132, the method further includes: rearranging the output characteristic diagram according to the first depth parallelism to obtain an rearranged result; and outputting the rearrangement result to an output buffer area.
Wherein, outputting the rearrangement result to the output buffer area includes: and outputting the rearrangement result to an output buffer area in a multi-port data storage mode.
By rearranging the output characteristic diagram, the parallelism can be provided for the subsequent maximum pooling operation, and a correct data format can be provided for the input of the lower convolution, so that the data can be processed as fast as possible, and the data processing efficiency is improved.
In some implementations, outputting the rearrangement result to the output buffer includes: performing pooling treatment on the rearrangement result according to the first depth parallelism to obtain a pooled result; and outputting the pooled result to an output buffer area in the form of an Open Computing Language (OpenCL) pipeline.
The result after the pooling is output to the output buffer area in the form of an OpenCL pipeline, the OpenCL pipeline can guarantee high-efficiency intercommunication of data, a deep pipeline structure can be formed between the pooling layer and the output buffer area, and data processing efficiency is improved.
In some implementations, the first depth parallelism of the input feature map and the second depth parallelism of the output feature map can be obtained in such a way that the first depth parallelism and the second depth parallelism can be matched to adapt to different hardware environments. In practical use, because of more hardware parameters, if some parameters change, the final performance of convolution acceleration also changes. The parallelism model is determined such that a best matching combination of a first depth parallelism of the input feature map and a second depth parallelism of the output feature map can be determined by the parallelism model to achieve a desired performance index. Fig. 3 is a flow chart illustrating a method for generating a parallelism model and determining a first depth parallelism and a second depth parallelism from the parallelism model, which may include the following steps, as shown in fig. 3.
Step 301, analyzing the hardware parameters to obtain the longitudinal parallelism of the input feature map.
For example, as shown in table 1, the hardware parameters include the number of ports of the data read core, the number of ports of the data write core, a first depth parallelism of the input feature map, a vertical parallelism of the input feature map, and a second depth parallelism of the output feature map. Wherein VEC _ SIZE takes on a power of 2 (e.g., 2, 4, 8, 16, etc.). The values of PE _ NUM _ Y, PE _ NUM _ Z, n and m may be positive integers greater than or equal to 0.
It should be noted that, the value of PE _ NUM _ Y should be divided as high as possible by the data of each layer of the input feature map. After PE _ NUM _ Y is known, the optimal value of n can be calculated by loading a formula of the number of clock cycles required by a data buffer area and a calculation formula of the number of clock cycles required by transmitting data of the data buffer area to the largest pooling kernel, and then the value of m is determined by analyzing the output buffer area.
TABLE 1 hardware parameter List
Variable parameter
|
Meaning of expression
|
n
|
Port number of data read core
|
m
|
Port number of data write core
|
VEC_SIZE
|
First depth parallelism of input feature map
|
PE_NUM_Y
|
Vertical parallelism of input feature maps
|
PE_NUM_Z
|
Second depth parallelism of output feature map |
Step 302, hardware resource information is obtained.
Wherein, the hardware resource information comprises: logic resources, the number of DSP chips, and Random Access Memory (RAM) resources.
In one particular implementation, the RAM resources include the number of on-chip cache regions and global memory ports. For example, the data bit width of the data cache region of the data read core is 8, SLineIndicating the data length of each line cache, each line cache requiring CRd_LineM20K, as shown in formula (1); the total number of M20K needed by the data buffer area is CRd_fAs shown in formula (2), wherein 2 represents a double buffer area.
The memory space required by the weight cache region is hw=swPE _ NUM _ Z, unit is Byte. The data bit width of the weight cache region is 8, and the actual number of M20K memory cells is CRd_wAs shown in formula (3). The compiler based on Intel FPGA OpenCL defaults to open up memory space according to power of 2 when h iswIs not a power of 2, the memory space actually allocated by the compiler will be greater than hw。
The line cache of the kernel of the maximum pooling layer has a length SpoolM20K memory cell CPoolAnd (4) as shown in formula (4). Where 2 denotes a two-line cache. The number of line buffers is related to the size of the pooling window.
The data writing kernel has n data loading ports, and each port needs CLoad_fM20K; weight load port requires CLoad_wM20K; offset load port requirement CLoad_bM20K. The ports of the data read core require C in totalRd_PortM20K memory cells, as shown in equation (5).
CRd_Port=CLoad_f*n+CLoad_w+CLoad_b (5)
The data writing kernel has m data storage ports, and each port needs CStore_fM20K, therefore, C is required in totalWr_PortM20K memory cells, as shown in equation (6).
CWr_Port=CStore_f*m (6)
Wherein C isLoad_f、CLoad_w、CLoad_bAnd CStore_fAre all related to the data type of the global memory access port, as shown in equations (7) to (9).
CLoad_f=C1*VEC_SIZE+C0 (7)
CLoad_w=C1*VEC_SIZEPE_NUM_Z+C0 (8)
CLoad_b=CStore_f=C1+C0 (9)
In summary, the total usage of RAM resources is shown in equation (10): wherein, C0、C1And C2Are constants associated with the hardware platform.
CRAM=CRd_f+CRd_w+CPool+CRd_Port+CWr_Port+C2 (10)
In one specific implementation, the number of DSPs may be calculated by the following formula. For example, if one DSP supports two 8-bit multiplications, the number of DSPs consumed in the convolution kernel, C, isDSP_CONVCan be obtained by calculation from equation (11). Total number of DSPs CDSPAs shown in formula (12), wherein C3、C4、C5And C6Are all constant.
CDSP=C3*VEC_SIZE*PE_NUM_Y*PE_NUM_Z+C4*n+C5*m+C6 (12)
In one embodiment, the usage of the Logic resource is shown as formula (13), where C isRAMIndicating the number of Logic resources, C7、C8And C9Are all constant. C7~C9Are all constant.
CRAM=(C7+C8*VEC_SIZE)*PE_NUM_Y*PE_NUM_Z+C9 (13)
According to the formula calculation, different types of hardware resource information can be obtained through calculation according to the first depth parallelism and the longitudinal parallelism of the input feature map and the second depth parallelism of the output feature map, the input feature map is matched with various different hardware resources, subsequent model analysis is facilitated, and system portability is improved.
Step 303, determining an average memory bandwidth model according to the theoretical calculation time of convolution operation, the weight value of the training image, the offset value of the training image and the space occupied by the training image.
The theoretical calculation time is calculated and obtained according to the parallelism information of the training images, the convolution operation amount and the second depth parallelism of the trained images, wherein the parallelism information of the training images comprises the first depth parallelism of the training images and the longitudinal parallelism of the training images.
For example, first, the obtained theoretical calculation time of the convolution operation is calculated by the formula (14). Wherein, FreqRepresenting the clock frequency, OplRepresents the operation amount of the l-th layer convolution.
For a particular network model, the overall performance (FPS) is shown in equation (15):
for the first layer of convolution, the input characteristic diagram needs to pass through in three dimensions respectively
And
the next prefetch, then the full computation can be completed, where,
and
can be obtained by calculation using the following equations (16) and (17).
Therefore, for the whole convolution neural network model, the size H of the input feature map needs to be loaded from the off-chip global memoryfAs shown in formula (19), the unit is Byte. Wherein N isLineIndicating the number of line buffers, NcolIndicating the number of convolutions that can actually be performed within each line buffer.
The weight value H to be loaded from the off-chip global memorywAs shown in formula (20), the unit is Byte.
Size of offset H to be loaded from off-chip global memorybAs shown in formula (21), the unit is Byte.
In summary, the average memory bandwidth HtotalAs in formula (22), the unit is Byte/s.
Through the calculation of the formula, an average memory bandwidth model, namely formula (22), can be obtained, so that the average memory bandwidth required to be used can be obtained, the input feature map can be processed by the average memory bandwidth, and the convolution operation speed is improved.
And step 304, determining a parallelism model according to the hardware resource information and the average memory bandwidth model.
By matching the average memory bandwidth model with the hardware resource information, the average memory bandwidth model can meet the convolution operation requirement on the input characteristic diagram under the condition that the hardware resources are limited, and the parallelism model is finally obtained through multiple times of training.
For example, firstly, several groups of rapid compilation are carried out on a target board card, and basic platform information is obtained according to a compilation result; then, C in the formulas (10), (12) and (13) is obtained by function fitting approximation0~C9Taking the value of (A); then, through analysis of the above hardware resource information, an available (PE _ NUM _ Z, VEC _ SIZE) combination is determined. On the premise that PE _ NUM _ Y, n, m are determined, a parallelism model is determined, so that subsequent use of the parallelism model is facilitated to determine a combination of available PE _ ZUN _ Z and VEC _ SIZE.
And 305, inputting the feature map to be verified into the parallelism model for verification, and obtaining the verified feature map and the second depth parallelism of the verified feature map.
The feature graph to be verified comprises a first depth parallelism and a longitudinal parallelism.
In a specific implementation, pooling and full-connection processing are further performed on the verified feature map to obtain an output image corresponding to the verified feature map.
Step 306, if it is determined that the output image corresponding to the verified feature map meets the performance requirements of the system, obtaining a first depth parallelism and a second depth parallelism.
For example, fig. 4 is a table comparing the performance parameters of the output image in the present application and the performance of the existing FPGA acceleration-based YOLO network. As shown in fig. 4, different hardware resources, for example, an FPGA with model number Zynq 7045, an FPGA with model number XILINX KU115, or an FPGA with model number Zynq MPSoC, an FPGA with model number arria-10 GX1150, etc., DSP chips correspond to different convolutional neural network frameworks (for example, network frameworks built by using YOLOv1 algorithm, YOLOv2 algorithm, etc.), corresponding accuracy is different, computing capabilities of processors are also different, and finally obtained network Throughput (Throughput), Frame Per Second transmission Frame (FPS), etc. are different. By inputting the verification image into the parallelism model for training, when the output image meets the performance of the hardware system, for example, when a higher FPS or Throughput is obtained, that is, the output image is determined to meet the performance requirement of the system, the (PE _ NUM _ Z, VEC _ SIZE) combinations obtained at this time are used as the final first depth parallelism and the final second depth parallelism.
The network throughput is measured by a Group of picture (GOP) period of a key frame per second, and the precision includes fixed-point data and floating-point data.
Through the analysis and the construction of the parallelism model, the first depth parallelism and the second depth parallelism which are matched with the system performance can be obtained, and then convolution operation is carried out according to the first depth parallelism and the second depth parallelism, so that the convolution operation speed is improved, and the processing speed of the input feature map is improved.
The following describes a node device according to an embodiment of the present application in detail with reference to the accompanying drawings. Fig. 5 shows a schematic structural diagram of an object detection device according to an embodiment of the present application. The target detection means may be implemented using an FPGA. As shown in fig. 5, the object detection apparatus includes the following modules.
The preprocessing module 501 is configured to preprocess an image to be detected to obtain an input feature map, and extract a first depth parallelism and a longitudinal parallelism of the input feature map; a vectorization processing module 502, configured to perform vectorization processing on the input feature map according to the first depth parallelism and the longitudinal parallelism, to obtain N input vector data, where N is an integer greater than or equal to 1; and a convolution operation module 503, configured to perform convolution operation on the N input vector data and the convolution kernel simultaneously to obtain an output feature map.
According to the target detection device, vectorization processing is carried out on the input feature diagram through the vectorization processing module according to the first depth parallelism and the longitudinal parallelism of the input feature diagram to obtain N input vector data, then the convolution operation module is used for carrying out convolution operation on the N input vector data and the convolution kernel at the same time to obtain the output feature diagram, so that one convolution kernel can carry out convolution operation on the N input vector data at the same time, the speed of convolution operation is accelerated under the condition of low energy consumption, and the processing speed of the input feature diagram is improved. Meanwhile, the precision of the target object in the output characteristic diagram is superior to that of the target object in the input characteristic diagram, so that the precision of the target object can be improved, the category of the target object in the input characteristic diagram is more accurate, and the method is convenient to apply in the field of machine vision.
It should be apparent that the present application is not limited to the particular configurations and processes described in the above embodiments and shown in the figures. For convenience and brevity of description, detailed description of a known method is omitted here, and for the specific working processes of the system, the module and the unit described above, reference may be made to corresponding processes in the foregoing method embodiments, which are not described herein again.
Fig. 6 shows a schematic structural diagram of an object detection system according to an embodiment of the present application. As shown in fig. 6, the target detection system may include a host end 61 and a device end 62, and the host end 61 and the device end 62 are connected through an off-chip Double Data Rate (off-chip DDR 3). The host 61 includes a Task Scheduler 601(Task Scheduler) and a Reorg module 602(Reorg Function). The device end 62 is realized by adopting a reconfigurable logic device FPGA, and has obvious advantages in application deployment at an edge end due to the characteristic of low power consumption, and the YOLOv2 algorithm can be efficiently realized. Device side 62 includes a data read Kernel 6100(MemRD Kernel), a data write Kernel 6200(MemWR Kernel) Kernel, a convolution operation Kernel 6300(Conv Kernel), and a pooling Kernel 6400(MaxPool Kernel). The kernels are all constructed in the form of Single Work items (Single Work Item), so that the device side realizes efficient pipelining. Meanwhile, the kernels are cascaded through OpenCL pipelines (channels) to form a deep pipeline structure between the kernels.
The MemRD Kernel includes an extraction logic module 6110, a weight cache region 6120, and an input cache region 6130 of a double-cache mechanism. The MemWR Kernel includes a reorder module 6210 and an output buffer 6220. The Conv Kernel core 6300 includes a plurality of pulse array Processing units (PEs), such as a pulse array Processing unit 6311, a pulse array Processing unit 6312, a pulse array Processing unit 6313, and the like, and further includes a plurality of data buffers and a plurality of weight buffers, such as a data buffer 6331 and a weight buffer 6321 connected to the pulse array Processing unit 6311, a data buffer 6332 and a weight buffer 6322 connected to the pulse array Processing unit 6312, and a data buffer 6333 and a weight buffer 6323 connected to the pulse array Processing unit 6313. The MaxPhool Kernel core includes a Line Buffer region 6411(Line Buffer1), a Line Buffer region 6412(Line Buffer2) and compare Logic 6420 (MaxPhool Logic). Wherein the comparison logic comprises a plurality of Max modules. The number of line buffers is related to the size of the pooling window, e.g., 3 × 3 pooling windows require two line buffers, 2 × 2 pooling windows only require 1 line buffer. And a multi-scale pooling mode is adopted to improve the portability of the target detection system.
Among them, the MemRD Kernel Kernel is responsible for preparing input data for the Conv Kernel Kernel. The MemRD Kernel caches a part of data from the global memory to the local memory, and transmits the prepared data to the Conv Kernel through an OpenCL pipeline. So as to relieve bandwidth pressure and ensure high throughput rate of the system. The MemWR Kernel is responsible for rearranging the convolution operation result. The MemWR Kernel caches convolution operation results to a local memory and rearranges the convolution operation results according to a certain sequence. And when the pooling layer exists, outputting the convolution operation result to the MaxPool Kernel Kernel through the OpenCL pipeline, and when the pooling layer does not exist, returning the convolution operation result to the global cache region. The Conv Kernel is mainly used for accelerating computation-intensive convolution operations, full-link operations and data activation in the convolutional neural network. In order to improve the operation efficiency, a plurality of PEs are adopted to realize the parallel operation of convolution. And the MaxPhool Kernel downsamples the output feature map. And the MaxPool Kernel reads and processes data through an OpenCL pipeline, and finally stores a processing result to the global memory. When a pooling layer exists, the MaxPool Kernel Kernel outputs data in an OpenCL pipeline mode; when the pooling layer is not available, the MaxPhool Kernel Kernel outputs data in a multi-port data storage mode so as to balance the speed of convolution operation and result storage.
For example, the MemWR Kernel transmits VEC _ SIZE unit SIZE data to the MaxPool Kernel, which processes with a maximum pooling window of 3 × 3. The MemRD Kernel Kernel caches the first two lines of input feature maps in an on-chip line cache region, when the first data of a third line enters, the maximum value is obtained through the first line cache data and the second line cache data, the maximum value of each column in a pooling window is obtained, then the maximum value is sent into a shift register with the depth of 3 for temporary storage, the maximum value is obtained through the second line cache and the third line cache, the maximum value in the pooling window is finally obtained, and the maximum value is stored back into the global memory. The contents of the two Line buffers are updated in sequence, the Line Buffer2 is updated by the data of the Line Buffer1, and the Line Buffer1 is updated by the input new data, and the process is repeated. In specific implementation, different line caches can be averaged, and then the content of the line cache region can be updated. The above updating operations for the line cache region of the MemWR Kernel are only examples, and may be specifically set according to actual situations, and the updating operations for other unexplained line cache regions are also within the protection scope of the present application, and are not described herein again.
Through the matching of the line cache and the register, a production line with the starting interval of 1 can be formed in the MaxPhool Kernel Kernel, and a maximum pooling result can be output in each clock cycle. Because the operation amount of the pooling layer is small, only the longitudinal parallelism (namely VEC _ SIZE) is set in the Maxpool Kernel Kernel, wherein the value of the VEC _ SIZE is consistent with the longitudinal parallelism in the Conv Kernel Kernel, and the execution time of the pooling layer can be controlled by adjusting the SIZE of the VEC _ SIZE.
The task scheduler 601 is mainly responsible for configuring an Open Computing Language (OpenCL) runtime environment and scheduling execution and synchronization of the device kernel through an OpenCL specific Application Programming Interface (API). According to the method shown in fig. 3, the task scheduler 601 builds a complete OpenCL execution environment. The scheduler needs to create two memory objects based on the context, which are used to store the input profile and the output profile of the convolution, respectively. Each memory object has both input and output attributes and can be used to store the output of the current layer and to transfer the input of the lower layer. The scheduler needs to transmit the preprocessed input data to the global memory area outside the FPGA chip through a command queue in the form of a memory object at the beginning.
Before each layer of convolution is executed, the task scheduler 601 configures parameters of each kernel through a specific API, starts execution of each kernel through a command queue, and monitors whether execution of each kernel is completed through an event. Execution of the pooling layer is controlled by a pooling switch, and when the pooling switch is set to 1, execution of the Maxpool Kernel Kernel is started. After the four cores on the FPGA are executed, the task scheduler 601 stores the final output result to the host memory in a command queue manner, so as to execute the subsequent operation.
The Reorg module 602 is mainly implemented by a Reorg function and is responsible for rearranging the convolution output characteristic diagram. For example, the Reorg module 602 may execute in parallel with the layer 14 convolution by adjusting the execution order of the network. The FPGA executes the memory access operation with continuous addresses more efficiently, and because the reorder function performs read-write operation on jump memory addresses, the logic is simpler, the computation amount of the reorder function is extremely small, the execution time on a CPU is shorter than that of the 14 th layer convolution, and the reorder module 602 is used only once in a network, so that resources on the FPGA can be saved, and the utilization rate of the resources is improved.
FIG. 7 is a block diagram of a Conv Kernel Kernel convolution operation based on parallelism. In order to realize efficient convolution operation and exert the advantages of an FPGA hardware architecture, a Conv Kernel Kernel adopts the design of three parallelism degrees, specifically comprising a first depth parallelism (VEC _ SIZE) of an input feature map, a longitudinal parallelism (PE _ NUM _ Y) of the input feature map and a second depth parallelism (PE _ NUM _ Z) of an output feature map.
Wherein the product of PE _ NUM _ Y and PE _ NUM _ Z is consistent with the total number of PEs in the Conv Kernel. And PE _ NUM _ Z and convolution kernels of the output feature graph are the same in number, VEC _ SIZE and PE _ NUM _ Z are parallelism determined according to a parallelism model, and the parallelism model is determined according to a hardware resource and memory bandwidth model. Vectorization of VEC _ SIZE of the input feature map is achieved by means of an unoll. One convolution kernel can be operated with a plurality of areas on the input characteristic diagram simultaneously so as to improve the sharing degree of the weight value; meanwhile, one region on the input feature map can be operated with a plurality of convolution kernels simultaneously so as to improve the sharing degree of input data.
And the MemRD Kernel is used for rearranging data corresponding to the input feature map which is originally arranged according to (W, H, N) into (VEC _ SIZE, W, H, N/VEC _ SIZE) and vectorizing the VEC _ SIZE. And then inputting data with the SIZE of PE _ NUM _ Y VEC _ SIZE and weight values with the SIZE of PE _ NUM _ Z VEC _ SIZE into a Conv Kernel Kernel, so that the Conv Kernel Kernel performs K2N/VEC _ SIZE multiplication accumulation operation on the input data to obtain data with the SIZE of PE _ NUM _ Y PE _ NUM _ Z, and outputting the data with the SIZE of PE _ NUM _ Y PE _ NUM _ Z to a MemKernel Kernel. As shown in fig. 7, for example, the input Feature Map (input Feature Map) input by the MemRD Kernel is an image of 160 × 3(W × H × N), there are PE _ NUM _ Z convolution kernels in total, and the size of each convolution Kernel is defined as 3 × 3 (i.e., K is equal to 3, and K is an integer greater than 3). And taking a convolution kernel and each layer of the input Feature Map for operation, sequentially and transversely taking numbers (R points in total) for the input Feature Map according to the data step length, then sequentially operating the taken data and each convolution kernel, and finally obtaining an output Feature Map (output Feature Map), wherein the size of the output Feature Map is R C M. When the input feature map is subjected to horizontal fetching, multiple rows may be fetched together to increase the operation speed of convolution.
FIG. 8 is a diagram of the operation of PEs in the Conv Kernel Kernel. One PE includes: convolution operation logic 810 and activation function logic 820. The convolution operation logic unit 810 includes: a custom MAC subunit 811, and a local accumulation subunit 812.
1) MAC subunit 811: the input of vector data may be supported for inputting the Vectorized input data (vectored data) and the weights (vectored weight) into the MAC subunit 811, so that the MAC subunit 811 may calculate the input data and the corresponding weights according to the multiply-accumulate tree. Specifically, the input data may be multiplied point to point, and then the obtained multiplication results may be added, so as to obtain the output result of the MAC subunit 811 after K2 xn/VEC _ SIZE times of operations. The output is then input to the local accumulation subunit 812 for buffering.
It should be noted that the Intel FPGA board provides a variable-precision DSP, that is, one DSP can support multiplication operations of multiple data bit widths, where d0, d1, … …, dn-1, and dn represent data on each bit of input data, w0, w1, … …, wn-1, and wn represent data on each bit of a weight value, and n is an integer greater than 1. For example, one DSP in the Intel Stratix V GXA7 FPGA development board can perform 1 x 27bit multiplication and can also perform 3 x 9bit multiplication. When the method is actually used, the data bit width can be changed by reconfiguring the relevant parameters of the kernel, and a certain specific DSP is appointed to calculate the corresponding data bit width, so that the compiling frequency is improved.
In C language, the data types of integers include char (8bit) type, short (16bit) type, int (32bit) type, long (64bit) type, and the like, and the data bit width of each data type is a power of 2. For example, if the char type is used to store fixed point number, the two 8-bit integers are multiplied, and the obtained result needs a 16-bit storage space; if j integers of 1bit are added, the obtained addition result needs a storage space of Ceil (Log2(j)) bits, so that after the j integers of 8 bits are subjected to multiply-accumulate operation, the obtained multiply-accumulate result needs a storage space of (16+ Ceil (Log2(j))) bits in total. Wherein i and j are integers greater than or equal to 2, and Ceil represents rounding up the data. Therefore, the data bit width of the multiply-accumulate result output by the MAC subunit 811 is (16+ Ceil (Log2(VEC _ SIZE))), and the data bit width of the delay buffer is (16+ Ceil (Log2 (sw/h))). Wherein sw represents the maximum operation amount of each layer of single convolution in the network model, and h represents the depth of the delay buffer area. For example, in the YOLOv2 algorithm, the maximum amount of single convolution operation at the 22 nd layer is 32 × 1280, and the depth is 6, so the data bit width (16+ Ceil (Log2(sw/h))) of the delay buffer is 29. Through the setting of the data bit width, unnecessary waste of logic resources can be saved.
2) Local accumulation subunit 812: the method is used for realizing manual clock alignment and ensuring the high-saturation operation of the pipeline. After obtaining the operation result of the K2 xn/VEC _ SIZE MAC subunit 811 input by the customized MAC subunit 811, the local accumulation unit performs an addition operation on the buffered data according to the design of the delay buffer to obtain a convolution result, where a data bit width of the convolution result is greater than a bit width of the input data (vectored data), and a truncation operation needs to be performed on the convolution result again to make the data bit width of the final result be the same as the bit width of the input data (vectored data).
It should be noted that fig. 9a shows a schematic diagram of a multiply-accumulate pipeline when no local accumulation subunit 812 is added. The Fetch function processing and the MAC layer and accumulation processing are taken as a processing unit and are sequentially processed on a time axis; however, there is data dependency (i.e. dependency relationship between data) between each processing unit, so that after one processing unit is completed, loop iteration of the next processing unit can be started, resulting in a start interval Π being greater than 1. And figure 9b shows a pipeline diagram of multiply-accumulate operations with the addition of a partial accumulate subunit 812. At this point, the output of the MAC subunit 811 is sent to the local accumulation subunit 812 for data buffering. In a specific implementation, the local accumulation subunit 812 may be composed of a shift register, so that by adjusting the depth of the shift register, a pipeline with a start interval Π equal to 1 can be formed, wherein the depth of the pipeline is K2 × N/VEC _ SIZE. The processing efficiency of the assembly line is improved.
3) And an activation function logic unit 820, configured to perform activation processing on a final result input by the convolution operation logic unit. For example, the final result is sent to the leakage ReLU logic circuit, whether to perform a shift operation is selected according to the sign bit X (for example, as shown in fig. 8, if X <0, 3 bits need to be shifted to the left, and if X > is 0, no shift is needed), and then the output result of the PE is input to the MemWR Kernel core through the OpenCL pipeline.
There is always a conflict between scarce on-chip memory cells and the high demand for on-chip memory space. On one hand, the bandwidth of the off-chip memory area is limited and the access delay is high, so that the design of the on-chip memory unit can reduce the bandwidth pressure of the off-chip memory area. On the other hand, because the storage space of the on-chip storage unit is very rare, the whole neural network model cannot be cached in the on-chip storage unit, so that the design of the on-chip storage unit becomes the key for ensuring the throughput rate of the system. The device side in the present application includes three buffers, namely an input data buffer, a weight and offset buffer, and an output buffer. Each data cache region can adopt any storage form of a folding storage form, a double-cache mechanism and a multi-port loading mode.
Fig. 10 is a schematic diagram showing the structure of a data buffer in a folding storage form. Wherein S represents a data step size, and K represents a data length of the input feature map. The error represents a data length required by PE _ NUM _ Y convolution operations of the input feature map in the longitudinal dimension (Y dimension), and is specifically represented by formula (23):
Hrow=(PE_NUM_Y-1)×S+K (23)
according to the longitudinal parallelism (PE _ NUM _ Y), a plurality of areas on the input feature map can be synchronously processed. So that the number of line buffers can be reduced while improving the versatility of the convolutional neural network structure. When data are loaded from the global memory to the local data cache region, each line cache stores data with the data length of S each time, so that each line cache can output one data.
Fig. 11 shows a schematic structure of a data buffer. The data cache region comprises a plurality of line caches and is a two-dimensional cache region. The two-dimensional cache region has a higher data reuse rate than the one-dimensional cache region. Increasing the data reuse rate of the local cache means a reduction in bandwidth. In actual use, the two-dimensional cache region can save about 57% of bandwidth compared with the one-dimensional cache region.
Wherein, SLine represents the length of each line cache, NLine represents the number of line caches, as shown in formula (24); the length of each line cache used actually is Wcol (Wcol is less than or equal to SLine), and Wcol can be dynamically adjusted according to the depth and convolution step length of the input characteristic diagram, as shown in a formula (25); the number of convolutions that can be performed in each line buffer is Ncol, which is determined by Wcol, as shown in equation (26); given the integrity of the convolution operation, the value of SLine ensures that each convolution layer has at least one convolution region cached.
Ncol=FLOOR((Wcol-K)/S+1) (26)
Wherein the function of the floor (X) function is "round down", i.e., rounding down or rounding to zero, i.e., taking the largest integer no greater than X.
The line cache region can be designed as a double cache mechanism. Namely, one data buffer area loads data from the off-chip global memory, and the other buffer area transmits pre-stored data to the Conv Kernel Kernel, and the two buffer areas perform data operation alternately and simultaneously. The data SIZE that the data cache can load from off-chip global memory at a time is VEC _ SIZE, and the data SIZE that is transmitted to the Conv Kernel is (PE _ NUM _ Y × VEC _ SIZE). Therefore, the time for waiting data loading in convolution can be saved, the data transmission efficiency is improved, and a guarantee is provided for efficient convolution operation.
In one embodiment, the double-buffer mechanism may cause the problem of mismatch between the parallel transmission speed and the serial loading speed of data. Assuming that one clock cycle can complete one data operationThen the number of clock cycles required to load a data buffer from global memory is TloadAs shown in formula (27); the number of clock cycles required for transmitting data of one data buffer area to the Conv Kernel core is TtransAs shown in formula (28); if PE _ NUM _ Y is set large, then T is set at this timeload>Ttrans. Therefore, to balance the speed of the two cache regions and ensure smooth execution of the deep pipeline between the three cores, namely the MemRD Kernel core, the Conv Kernel core and the MemWR Kernel core, multi-port loading of data is required.
When the number of ports is n, each port will be responsible for
In a cache bank, the data loading time is reduced by n times.
In one implementation, the MemRD Kernel pre-reorders the weights according to a first depth parallelism and a vertical parallelism. For example, the weights are rearranged from the original (K, N, M) order to the (VEC _ SIZE, PE _ NUM _ Z, K, N/VEC _ SIZE, M/PE _ NUM _ Z) order. The MemRD Kernel Kernel rearranges the biases into (PE _ NUM _ Z, M/PE _ NUM _ Z) order according to the second depth parallelism. So that parallelism can be provided for the MaxPool Kernel and the correct data format can be provided for the input of the lower layer convolution.
Wherein the memory size s of a single convolution kernelwTaking the maximum value of each layer convolution as shown in a formula (29); total size h of weight bufferwAs shown in formula (30), the unit is Byte. Memory size h required to offset the cache regionbAs shown in formula (31), the unit is Byte.
sw=Max(Kl*Kl*Nl) (29)
hw=sw*PE_NUM_Z (30)
hb=PE_NUM_Z (31)
It should be noted that the weight buffer and the offset buffer are also configured as a double-buffer mechanism, so that efficient data transmission can be provided for convolution operation of the Conv Kernel, and meanwhile, when manual clock alignment is performed, input data, weight and offset can be transmitted out through the OpenCL pipeline in the same clock cycle.
In one implementation, if the SIZE of the output buffer is (PE _ NUM _ Y _ PE _ NUM _ Z) bytes, the MemWR Kernel needs to perform (PE _ NUM _ Y _ PE _ NUM _ Z) operations when storing the data of the output buffer back to the global memory, and performs (K × K N/VEC _ SIZE) operations when performing a complete convolution operation. When the depth of the input feature map is shallow or the parallelism value is too large, the time required by the MemWR Kernel Kernel is obviously longer than the convolution operation time when the convolution result is restored to the global memory. Therefore, a multi-port data storage format for the output buffer is required. For example, when the number of ports is m, each port needs to be responsible for outputting data in CEIL (PE _ NUM _ Y × PE _ NUM _ Z/m) to the off-chip global memory area.
The target detection system has high portability and expandability. When convolution operation, normalization processing and activation processing are carried out on data, an original network structure is adopted, so that the influence of the change of the weight value on a result can be considered while good accuracy is guaranteed. And 8-bit fixed point number quantization is adopted to ensure that the precision loss is within an acceptable range.
Fig. 12 shows a schematic structural diagram of a machine vision apparatus in an embodiment of the present application. The machine vision equipment comprises the following devices.
The image acquisition device 1201 is used for acquiring an image to be detected, wherein the image to be detected comprises a target object to be determined; the target detection device 1202 is configured to detect an image to be detected according to the image processing method in the foregoing embodiment.
Wherein the accuracy of the target object in the output feature map is better than the accuracy of the target object in the input feature map. For example, in the application of machine vision, a target object to be determined included in an image to be detected may be a dog or a bicycle, but due to the color of a background object, the position of the object in a picture, and the like, when a robot observes the image to be detected, the robot cannot accurately acquire the category, the position information, and the like of the target object to be determined in the image to be detected, N input vector data are obtained by vectorizing the image to be detected according to the first depth parallelism and the longitudinal parallelism of the image to be detected, and an output feature map is obtained by performing convolution operation on the N input vector data and a convolution kernel at the same time, the category of the target object in the obtained output feature map is clearer, when the robot observes the output feature map, the target object in the output feature map is a dog and a bicycle, and the position information of the dog and the bicycle can be more accurately obtained, and the detection precision of the target object to be determined is improved.
According to the machine vision equipment, the image to be detected is obtained through the image obtaining device, the target detection device is used for detecting the image to be detected according to the image processing method, the convolution operation speed in the image analysis process is increased, and the processing speed of the image to be detected is improved; and the precision of the image to be detected is improved, so that the category of the target object to be determined is clear and visible, and the method is convenient to apply in the field of machine vision.
It should be apparent that the present application is not limited to the particular configurations and processes described in the above embodiments and shown in the figures. For convenience and brevity of description, detailed description of a known method is omitted here, and for the specific working processes of the system, the module and the unit described above, reference may be made to corresponding processes in the foregoing method embodiments, which are not described herein again.
Fig. 13 is a block diagram illustrating an exemplary hardware architecture of an electronic device capable of implementing methods and apparatus according to embodiments of the application.
As shown in fig. 13, the electronic device 1300 includes an input device 1301, an input interface 1302, a central processor 1303, a memory 1304, an output interface 1305, and an output device 1306. The input interface 1302, the central processor 1303, the memory 1304, and the output interface 1305 are connected to each other via a bus 1307, and the input device 1301 and the output device 1306 are connected to the bus 1307 via the input interface 1302 and the output interface 1305, respectively, and further connected to other components of the electronic device 1300.
Specifically, the input device 1301 receives input information from the outside, and transmits the input information to the central processor 1303 through the input interface 1302; the central processor 1303 processes input information based on computer-executable instructions stored in the memory 1304 to generate output information, stores the output information in the memory 1304 temporarily or permanently, and then transmits the output information to the output device 1306 through the output interface 1305; output device 1306 outputs output information to the exterior of electronic device 1300 for use by a user.
In one embodiment, the electronic device shown in fig. 13 may be implemented as a network device, which may include: a memory configured to store a program; a processor configured to execute the program stored in the memory to perform the image processing method described in the above embodiments.
In general, the various embodiments of the application may be implemented in hardware or special purpose circuits, software, logic or any combination thereof. For example, some aspects may be implemented in hardware, while other aspects may be implemented in firmware or software which may be executed by a controller, microprocessor or other computing device, although the application is not limited thereto.
Embodiments of the application may be implemented by a data processor of a mobile device executing computer program instructions, for example in a processor entity, or by hardware, or by a combination of software and hardware. The computer program instructions may be assembly instructions, Instruction Set Architecture (ISA) instructions, machine related instructions, microcode, firmware instructions, state setting data, or source code or object code written in any combination of one or more programming languages.
Any logic flow block diagrams in the figures of this application may represent program steps, or may represent interconnected logic circuits, modules, and functions, or may represent a combination of program steps and logic circuits, modules, and functions. The computer program may be stored on a memory. The memory may be of any type suitable to the local technical environment and may be implemented using any suitable data storage technology, such as, but not limited to, Read Only Memory (ROM), Random Access Memory (RAM), optical storage devices and systems (digital versatile disks, DVDs, or CD discs), etc. The computer readable medium may include a non-transitory storage medium. The data processor may be of any type suitable to the local technical environment, such as but not limited to general purpose computers, special purpose computers, microprocessors, Digital Signal Processors (DSPs), Application Specific Integrated Circuits (ASICs), programmable logic devices (FGPAs), and processors based on a multi-core processor architecture.
The foregoing has provided by way of exemplary and non-limiting examples a detailed description of exemplary embodiments of the present application. Various modifications and adaptations to the foregoing embodiments may become apparent to those skilled in the relevant arts in view of the drawings and the following claims without departing from the scope of the invention. Accordingly, the proper scope of the application is to be determined according to the claims.