WO2021249192A1 - 图像处理方法及装置、机器视觉设备、电子设备和计算机可读存储介质 - Google Patents

图像处理方法及装置、机器视觉设备、电子设备和计算机可读存储介质 Download PDF

Info

Publication number
WO2021249192A1
WO2021249192A1 PCT/CN2021/096062 CN2021096062W WO2021249192A1 WO 2021249192 A1 WO2021249192 A1 WO 2021249192A1 CN 2021096062 W CN2021096062 W CN 2021096062W WO 2021249192 A1 WO2021249192 A1 WO 2021249192A1
Authority
WO
WIPO (PCT)
Prior art keywords
feature map
parallelism
data
input
output
Prior art date
Application number
PCT/CN2021/096062
Other languages
English (en)
French (fr)
Inventor
彭海勇
曹常锋
刘新阳
李火林
田万廷
Original Assignee
中兴通讯股份有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 中兴通讯股份有限公司 filed Critical 中兴通讯股份有限公司
Publication of WO2021249192A1 publication Critical patent/WO2021249192A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T1/00General purpose image data processing
    • G06T1/20Processor architectures; Processor configuration, e.g. pipelining
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T1/00General purpose image data processing
    • G06T1/60Memory management

Definitions

  • This application relates to the field of computer technology.
  • the deep convolutional neural network accelerator architecture based on Field Programmable Gate Array can be roughly divided into three types: accelerator architecture based on computing module, accelerator architecture based on network mapping, and pulse array based Accelerator architecture.
  • the focus of the accelerator architecture based on computing modules is on the basic computing units (such as convolution, pooling, full connection, etc.) of Convolutional Neural Networks (CNN), and several general computing modules are obtained through certain designs. , These calculation modules can be combined flexibly to realize the forward inference process of networks of different depths and structures.
  • the accelerator architecture based on network mapping needs to cache the calculation results between layers outside the chip, which increases the bandwidth pressure. If you try to map all layers to the FPGA circuit according to the network structure, you only need to load the input data from the outside of the chip and connect the network. The final calculation results are stored outside the chip, which can avoid the buffering of intermediate results and realize the pipeline structure within and between the CNN layers. Therefore, it has high efficiency.
  • the pulse array-based accelerator architecture can achieve higher clock frequency and less logic resource consumption, but when deploying the CNN model, the configuration of the pulse array processing unit (PE) is more complicated and difficult to implement. And the pure hardware design is limited by bandwidth and resources, resulting in limited performance improvement.
  • PE pulse array processing unit
  • One aspect of the embodiments of the present application provides an image processing method, including: preprocessing the image to be detected to obtain an input feature map, and extracting the first depth parallelism and the vertical parallelism of the input feature map; according to the first depth parallelism Vectorize the input feature map with vertical parallelism to obtain N input vector data, where N is an integer greater than or equal to 1; and, use N input vector data and convolution kernel to perform convolution operations at the same time to obtain Output Feature Map.
  • an image processing device including: a preprocessing module configured to preprocess the image to be detected to obtain an input feature map, and extract the first depth parallelism and vertical parallelism of the input feature map
  • the vectorization processing module is configured to perform vectorization processing on the input feature map according to the first depth parallelism and the vertical parallelism to obtain N input vector data, where N is an integer greater than or equal to 1
  • the product operation module is configured to use N input vector data and the convolution kernel to perform convolution operations at the same time to obtain an output feature map.
  • a machine vision device including: an image acquisition device configured to acquire an image to be detected, wherein the image to be detected includes the target object to be determined; and an image processing device, which is It is configured to detect the image to be detected according to the image processing method provided in the embodiment of the present application, and determine the category of the target object to be determined.
  • Another aspect of the embodiments of the present application provides an electronic device, including: one or more processors; a memory, on which one or more programs are stored, when one or more programs are executed by one or more processors, One or more processors are enabled to implement at least one step of the image processing method provided in the embodiment of the present application.
  • Another aspect of the embodiments of the present application provides a computer-readable storage medium on which a computer program is stored.
  • the computer program is executed by a processor, at least one step of the image processing method provided in the embodiment of the present application is implemented.
  • FIG. 1 shows a schematic flowchart of an image processing method provided by an embodiment of the present application.
  • FIG. 2 shows a schematic flow chart of a method for accelerating convolution operation using a multiply-accumulate tree provided by an embodiment of the present application.
  • FIG. 3 shows a schematic flowchart of a method for generating a parallelism model and determining a first depth parallelism and a second depth parallelism according to the parallelism model provided by an embodiment of the present application.
  • FIG. 4 shows a comparison table of the performance parameters of the output image in the embodiment of the present application and the performance parameters of the related FPGA-accelerated YOLO network.
  • Fig. 5 shows a schematic structural diagram of an image processing device provided by an embodiment of the present application.
  • FIG. 6 shows a schematic structural diagram of a target detection system provided by an embodiment of the present application.
  • FIG. 7 shows a schematic diagram of a module for performing convolution operation based on parallelism in the convolution operation kernel provided by an embodiment of the present application.
  • FIG. 8 shows a schematic diagram of the operation process of the processing unit in the convolution operation kernel provided by the embodiment of the present application.
  • Fig. 9a shows a schematic diagram of a multiply-accumulate operation provided by an embodiment of the present application when no local accumulating unit is added.
  • Fig. 9b shows a schematic diagram of a multiply-accumulate operation after adding a local accumulating unit provided by an embodiment of the present application.
  • FIG. 10 shows a schematic diagram of a structure of a data buffer area in the form of folded storage provided by an embodiment of the present application.
  • FIG. 11 shows a schematic diagram of a structure of a data buffer area provided by an embodiment of the present application.
  • Fig. 12 shows a schematic structural diagram of a machine vision device provided by an embodiment of the present application.
  • FIG. 13 shows a schematic diagram of an exemplary hardware architecture of an electronic device provided by an embodiment of the present application.
  • the FPGA-based deep convolutional neural network accelerator architecture can be roughly divided into an accelerator architecture based on arithmetic modules, an accelerator architecture based on network mapping, and an accelerator architecture based on pulse arrays. Most of these three architectures are designed and implemented using Register Transfer Level (RTL) circuit development methods, and their portability and scalability are relatively poor.
  • RTL Register Transfer Level
  • the pure hardware design is limited by bandwidth and resources, so that the system performance is limited.
  • Some researchers have adopted the idea of software and hardware co-design to further improve the performance of FPGA deep convolutional neural network accelerators.
  • the combination of software and hardware is mainly divided into the following two aspects: 1) In order to alleviate the bandwidth pressure, the data is processed using techniques such as pruning and quantification.
  • the structured pruning method is directly deployed to the current deep convolutional neural network accelerator architecture, so that the CNN model can be pruned regularly; and the data is processed through the unstructured pruning method (for example, random cropping of weight nodes, etc. ), so that the compression rate of the deep convolutional neural network accelerator can be improved, but additional non-zero weight value location information needs to be stored, which makes the hardware circuit design difficult.
  • Quantification is currently a relatively common model compression method.
  • 8bit fixed-point quantization can basically keep the original accuracy unchanged, and is widely used, but the compression capability is limited.
  • some researchers study low-bit/ultra-low-bit quantization (for example, 6bit, 4bit, etc.), and some researchers use binary networks to convert multiplication operations into logical operations, making it extremely Dadi improves the performance of the deep convolutional neural network accelerator, but the network accuracy loses too much. 2)
  • In order to reduce the amount of calculation consider performing calculations in the transform domain (for example, Winograd transform, FFT transform, etc.).
  • the number of multiplications can be reduced by about 1/3; through the two-dimensional Winograd transform, the number of multiplications can be reduced by about 2.25 times.
  • the OaA (Overlap-andAdd) algorithm based on FFT transform performs convolution operation on data. Compared with convolution operation on data based on time domain convolution, the system performance is improved by about 3.3 times.
  • DSP Digital Signal Processor
  • the application of the binary network reduces the amount of calculation and transmission bandwidth on the one hand, and on the other hand also loses the accuracy of the data, resulting in a decrease in the accuracy of the data.
  • the graphics processing unit (Graphics Processing Unit, GPU) has high energy consumption, which makes the design of the GPU unable to meet the needs of embedded applications.
  • This application provides an image processing method and device, machine vision equipment, electronic equipment, and storage medium, which are used to solve the problem of poor portability and scalability of CNN.
  • FIG. 1 is a schematic flowchart of an image processing method provided by an embodiment of the present application, and the method can be applied to an image processing device. As shown in FIG. 1, the image processing method may include step 110-step 130.
  • step 110 the image to be detected is preprocessed to obtain an input feature map, and the first depth parallelism and vertical parallelism of the input feature map are extracted.
  • the input image is processed to obtain the input feature map.
  • the pixel average and standard deviation can be calculated on different color channels, or the average and standard deviation of an image, a batch of images, or the entire training data set can be calculated to obtain Enter the feature map. Normalization is usually the first method to try, because the pixel value of the input image is always in the range of 0-255. You only need to divide all the pixels of the image by 255. This method is simple and easy to implement. Centralization can use global centralization or local centralization, and different numbers of images can be selected to calculate the average value. The above preprocessing methods are only examples and can be set according to specific conditions. Other preprocessing not explained The method is also within the protection scope of this application, and will not be repeated here.
  • the input feature map can be an image with different dimensions.
  • the vertical parallelism represents the length of the input feature map on the Y axis
  • the first depth parallelism It indicates the length of the input feature map on the Z axis
  • the length of the input feature map on the X axis can be specifically set according to actual conditions.
  • the above dimensional information of the input feature map is only an example, and can be specifically set according to specific conditions. Other unexplained dimensional information of the input feature map is also within the protection scope of this application, and will not be repeated here.
  • step 120 the input feature map is vectorized according to the first depth parallelism and the vertical parallelism to obtain N input vector data.
  • N is an integer greater than or equal to 1.
  • the second depth of parallelism of the output feature map is the same as the number of convolution kernels.
  • the first and second depths of parallelism are determined according to the parallelism model, and the parallelism model is determined according to the hardware resource and memory bandwidth model. Model.
  • the depth of the convolution kernel is the same as the vertical parallelism, so that when the input feature map is convolved with the convolution kernel, the amount of vertical calculation can be reduced and the convolution operation speed can be accelerated.
  • VEC_SIZE For example, use VEC_SIZE to indicate the first depth of parallelism, PE_NUM_Y to indicate the vertical parallelism, and PE_NUM_Z to indicate the second depth of parallelism.
  • the input feature map is vectorized with VEC_SIZE*PE_NUM_Y as the unit, so that N VEC_SIZE*PE_NUM_Y input vector data can be obtained, and then each input vector data Perform convolution processing to increase the speed of convolution operations.
  • the bandwidth resources used in the convolution operation are also different.
  • the parallelism model is determined by the hardware resources and the memory bandwidth model, and then the VEC_SIZE and PE_NUM_Z are determined according to the parallelism model. Under the resource environment, the speed of convolution operation is guaranteed to be maximized.
  • step 130 the convolution operation is performed simultaneously with the N input vector data and the convolution kernel to obtain an output feature map.
  • N input vector data can be convolved with a convolution kernel at the same time to increase the degree of sharing of weight values; one input vector data can be convolved with multiple convolution kernels at the same time to increase the input The degree of data sharing.
  • the speed of the convolution operation is accelerated, and the output feature map is obtained.
  • the above method of convolution operation is only an example, and can be set according to actual conditions. Other unexplained methods of convolution operation are also within the protection scope of this application, and will not be repeated here.
  • the input feature map can be vectorized by the first depth parallelism and the vertical parallelism of the input feature map to obtain N input vector data, and then use the N input vector data and
  • the convolution kernel performs convolution operation at the same time to obtain the output feature map, so that a convolution kernel can perform convolution operation with N input vector data at the same time, ensuring that the speed of convolution operation is accelerated under the condition of low energy consumption, and the accuracy is improved.
  • Input the processing speed of the feature map.
  • the accuracy of the target object in the output feature map is better than the accuracy of the target object in the input feature map, so that the accuracy of the target object can be improved, so that the target object category in the input feature map is more accurate, which is convenient in the field of machine vision.
  • the accuracy of the target object in the output feature map is better than the accuracy of the target object in the input feature map, so that the accuracy of the target object can be improved, so that the target object category in the input feature map is more accurate, which is convenient in the field of machine vision.
  • the image processing method may further include: buffering the input feature map in an input buffer area, wherein the storage form of the input buffer area at least includes a folded storage form, a double buffer mechanism, and multiple buffers. Any of the port loading modes.
  • the speed of data access operations in the input buffer area is accelerated, and it is convenient to provide highly reliable data for convolution operations.
  • the input feature map is first cached in the input buffer area, so that the bandwidth pressure for interaction with other devices can be relieved.
  • buffering the input feature map in the input buffer area may include: if it is determined that the storage form of the input buffer area is a folded form, folding the data corresponding to the input feature map according to the data step size, and The folded data is stored in the input buffer area, where the data step size is a value determined according to the vertical parallelism, the data length required for the convolution operation of the input feature map in the vertical dimension, and the unit step size; if the input buffer is determined
  • the storage form of the area is a multi-port loading form. According to the number of ports and the number of clock cycles required to load a data buffer, the data corresponding to the input feature map is buffered into the input buffer area.
  • H row (PE_NUM_Y-1) ⁇ S+K can be used to calculate the length of the folded data.
  • H row represents the data length required for PE_NUM_Y convolution operations of the input feature map in the vertical dimension (Y dimension)
  • S represents the data step size
  • K represents the data length of the input feature map.
  • PE_NUM_Y the vertical parallelism
  • multiple regions on the input feature map can be processed simultaneously.
  • the number of line buffers can be reduced, and the versatility of the CNN structure can be improved at the same time.
  • step 130 may specifically include step 131 and step 132.
  • step 131 the convolution operation is performed simultaneously on the N input vector data and the convolution kernel to obtain output vector data.
  • perform convolution operation on N input vector data and one convolution kernel at the same time or perform convolution operation on one input vector data and multiple convolution kernels to obtain output vector data.
  • step 132 the output vector data and the corresponding weight parameters are processed using the multiply-accumulate tree to obtain the output feature map.
  • VEC_SIZE For example, rearrange the output vector data, rearrange the data corresponding to the input feature map originally arranged by (W, H, N) to (VEC_SIZE, W, H, N/VEC_SIZE), and vectorize VEC_SIZE deal with.
  • W represents the horizontal length of the input vector data
  • H represents the vertical length of the input vector data
  • N represents the depth of the input vector data.
  • the corresponding weight parameters can also be rearranged accordingly, so that the output feature map is more conducive to the convolution operation. Then, point-to-point multiplication is performed on the output vector data and its corresponding weight parameters, and then the obtained multiplication results are added together. After multiple operations, the output feature map can be obtained.
  • a point-to-point multiplication and accumulation operation is performed on M output vector data to obtain the first accumulation result, where M is an integer greater than or equal to 1; Buffer the first accumulation result in the shift buffer area; perform partial accumulation on the first accumulation result according to the depth of the shift buffer area to obtain the second accumulation result; buffer the second accumulation result in the delay buffer area; yes The data in the delay buffer is added again to obtain the output characteristic map.
  • the char type data bit width is 8bit
  • point-to-point multiplication and accumulation of M output vector data that is, each bit on the 8bit needs to correspond to the corresponding bit of the corresponding weight parameter.
  • the data on the bits are multiplied to obtain 8 product results, and then these 8 product results are sequentially accumulated to obtain the first cumulative result.
  • the depth of the shift buffer area needs to be adjusted so that a pipeline with a start interval of 1 can be formed, and the depth of the pipeline is determined according to VEC_SIZE.
  • the first accumulation result is partially accumulated to obtain the second accumulation result, and then the second accumulation result is buffered in the delay buffer area; the data in the delay buffer is added again , Obtain the output feature map.
  • the first accumulation result enables point-to-point multiplication and accumulation operations on data with different data bit widths, thereby improving data processing capabilities.
  • the first accumulation result is buffered in the shift buffer area and the delay buffer area for partial accumulation, so that a pipeline with a start interval of 1 can be formed, which improves the processing speed of the data.
  • the image processing method may further include: buffering the output feature map in an output buffer area, wherein the storage form of the output buffer area includes a multi-port loading mode.
  • the output characteristic map can be processed through different ports. For example, when the number of ports is equal to The number of line buffers in charge is Ceil (the number of line buffers/n), so that the data loading time can be reduced by n times, and the efficiency of data loading is improved.
  • the image processing method may further include: rearranging the output feature map according to the first depth parallelism to obtain the rearrangement result; and outputting the rearrangement result to the output buffer area.
  • outputting the rearrangement result to the output buffer area includes: outputting the rearrangement result to the output buffer area in the form of multi-port storage data.
  • outputting the rearrangement result to the output buffer includes: pooling the rearrangement result according to the first depth of parallelism to obtain the pooled result; and opening the pooled result
  • the arithmetic language (Open Computing Language, OpenCL) is output to the output buffer in the form of a channel.
  • the OpenCL pipeline can ensure the efficient intercommunication of data, making it possible to form a deep pipeline structure between the pooling layer and the output buffer to improve data processing efficient.
  • the first depth parallelism of the input feature map and the second depth parallelism of the output feature map can be obtained in the following manner, so that the first depth parallelism and the second depth parallelism can be matched to Adapt to different hardware environments. In actual use, due to many hardware parameters, if some parameters change, the final performance of the convolution acceleration will also change.
  • the following steps are used to determine the parallelism model, so that the parallelism model can be used to determine the best matching combination of the first depth parallelism of the input feature map and the second depth parallelism of the output feature map to achieve the desired performance index. Fig.
  • FIG. 3 is a schematic flowchart of a method for generating a parallelism model and determining a first depth parallelism and a second depth parallelism according to the parallelism model provided by an embodiment of the application.
  • generating a parallelism model and determining the first depth parallelism and the second depth parallelism according to the parallelism model may include the following steps 301 to 306.
  • step 301 the hardware parameters are analyzed to obtain the longitudinal parallelism of the input feature map.
  • the hardware parameters include the number of ports of the data read core, the number of ports of the data write core, the first depth parallelism of the input feature map, the vertical parallelism of the input feature map, and the second depth of the output feature map Degree of parallelism.
  • VEC_SIZE is a power of 2 (for example, 2, 4, 8, 16, etc.).
  • PE_NUM_Y, PE_NUM_Z, n, and m can be positive integers greater than or equal to 0.
  • PE_NUM_Y should be divisible by the data height of each layer of the input feature map as much as possible.
  • the formula for the number of clock cycles required to load a data buffer and the formula for the number of clock cycles required to transfer data from a data buffer to the maximum pooling core can be used to calculate the maximum value of n Optimal value, and then determine the value of m by analyzing the output buffer area.
  • variable parameter Meaning n The number of ports of the data read core m Number of ports where data is written to the kernel VEC_SIZE The first depth of parallelism of the input feature map PE_NUM_Y Vertical parallelism of the input feature map PE_NUM_Z The second depth parallelism of the output feature map
  • step 302 hardware resource information is obtained.
  • the hardware resource information includes: Logic resources, the number of DSP chips, and random access memory (Random Access Memory, RAM) resources.
  • the RAM resource includes the number of on-chip buffer areas and global memory ports.
  • the data bit width of the data buffer area of the data read core is 8
  • S Line represents the data length of each line buffer, and each line buffer requires C Rd_Line M20K, as shown in formula (1); the data buffer area needs a total of The number of M20K is C Rd_f, as shown in formula (2), where 2 represents a double buffer area.
  • the data bit width of the weight buffer area is 8, and the actual number of M20K memory cells required is C Rd_w , as shown in equation (3).
  • the compiler based on Intel FPGA OpenCL opens up memory space by the power of 2 by default. When the value of h w is not a power of 2, the actual memory space allocated by the compiler will be greater than h w .
  • the length of the line buffer of the core of the maximum pooling layer is S pool , which requires M20K memory units C Pool , as shown in equation (4).
  • 2 means two-line cache.
  • the number of row buffers is related to the size of the pooling window.
  • the data write core has n data load ports, each of which requires C Load_f M20K; the weight load port requires C Load_w M20K; the bias load port requires C Load_b M20K.
  • the port of the data reading kernel requires a total of C Rd_Port M20K memory units, as shown in formula (5).
  • the data write core has m data storage ports, and each port requires C Store_f M20K memory units. Therefore, a total of C Wr_Port M20K memory units are required, as shown in equation (6).
  • C Load_f , C Load_w , C Load_b, and C Store_f are all related to the data type of the global memory access port, as shown in equations (7) to (9).
  • the number of DSPs can be specifically calculated by the following formula. For example, if a DSP supports two 8-bit multiplication operations, the number of DSPs consumed in the convolution operation core C DSP_CONV can be calculated by equation (11). The total number of DSPs C DSP is shown in formula (12), where C 3 , C 4 , C 5 and C 6 are all constants.
  • C DSP C 3 *VEC_SIZE*PE_NUM_Y*PE_NUM_Z+C 4 *n+C 5 *m+C 6 (12)
  • Logic resources are as shown in formula (13), where C RAM represents the number of Logic resources, and C 7 , C 8 and C 9 are all constants. C 7 to C 9 are all constants.
  • the average memory bandwidth model is determined according to the theoretical calculation time of the convolution operation, the weight value of the training image, the offset value of the training image, and the size of the space occupied by the training image.
  • the theoretical calculation time is calculated based on the parallelism information of the training image, the amount of convolution and the second depth parallelism of the trained image, where the parallelism information of the training image includes the first depth of the training image The parallelism and the longitudinal parallelism of the training image.
  • the input feature map needs to go through in three dimensions respectively and All calculations can be completed by prefetching times, of which, and It can be calculated by formula (16) and formula (17).
  • N Line represents the number of line buffers
  • N col represents the number of actually executable convolutions in each line buffer.
  • the average memory bandwidth H total is shown in formula (22), and the unit is Byte/s.
  • the average memory bandwidth model can be obtained, that is, equation (22), so that the average memory bandwidth to be used can be known, and the input feature map can be processed with the average memory bandwidth, and the convolution operation speed can be improved.
  • a parallelism model is determined according to the hardware resource information and the average memory bandwidth model.
  • the average memory bandwidth model can meet the convolution operation requirements of the input feature map under the condition of limited hardware resources. After multiple trainings, the parallelism model is finally obtained.
  • step 305 the feature map to be verified is input into the parallelism model for verification, and the verified feature map and the second depth parallelism of the verified feature map are obtained.
  • the feature map to be verified includes the first depth parallelism and the longitudinal parallelism.
  • step 306 if it is determined that the output image corresponding to the verified feature map meets the performance requirements of the system, the first depth parallelism and the second depth parallelism are obtained.
  • FIG. 4 is a comparison table of the performance parameters of the output image in the embodiment of the present application and the performance parameters of the related FPGA-accelerated YOLO network.
  • the DSP chip corresponds to different CNN framework (for example, a network framework built using YOLOv1 algorithm or YOLOv2 algorithm), the corresponding accuracy is different, the computing power of the processor is also different, the final network throughput (Throughput), the number of frames per second (Frame Per Second, FPS), etc.
  • the network throughput is measured by the key frame period (Group of Picture, GOP) per second, and the accuracy includes fixed-point data and floating-point data.
  • GOP Group of Picture
  • the first depth parallelism and the second depth parallelism matching the system performance can be obtained, and then the convolution operation is performed according to the first and second depth parallelism , In order to increase the convolution operation speed, and increase the processing speed of the input feature map.
  • the input feature map is vectorized by the first depth parallelism and the vertical parallelism of the input feature map to obtain N input vector data, and then use the N input vector data and volume
  • the product core performs convolution operations at the same time to obtain the output feature map, so that a convolution core can perform convolution operations with N input vector data at the same time, ensuring that the speed of the convolution operation is accelerated under the condition of low energy consumption, and the input characteristics are improved Graph processing speed.
  • the accuracy of the target object in the output feature map is better than the accuracy of the target object in the input feature map, so that the accuracy of the target object can be improved, so that the target object category in the input feature map is more accurate, which is convenient in the field of machine vision.
  • the accuracy of the target object in the output feature map is better than the accuracy of the target object in the input feature map, so that the accuracy of the target object can be improved, so that the target object category in the input feature map is more accurate, which is convenient in the field of machine vision.
  • Fig. 5 shows a schematic structural diagram of an image processing device provided by an embodiment of the present application.
  • the image processing device can be implemented using FPGA.
  • the image processing device may include a preprocessing module 501, a vectorization processing module 502, and a convolution operation module 503.
  • the preprocessing module 501 may be configured to preprocess the image to be detected to obtain an input feature map, and extract the first depth parallelism and vertical parallelism of the input feature map.
  • the vectorization processing module 502 can be configured to perform vectorization processing on the input feature map according to the first depth parallelism and the vertical parallelism to obtain N input vector data, where N is an integer greater than or equal to 1.
  • the convolution operation module 503 can be configured to use N input vector data and convolution kernels to perform convolution operations at the same time to obtain an output feature map.
  • the vectorization processing module performs vectorization processing on the input feature map according to the first depth parallelism and the vertical parallelism of the input feature map to obtain N input vector data, and then use convolution
  • the arithmetic module performs convolution operations on the N input vector data and the convolution kernel at the same time, and obtains the output feature map, so that a convolution kernel can perform convolution operations on the N input vector data at the same time, ensuring speed up under the condition of low energy consumption The speed of the convolution operation improves the processing speed of the input feature map.
  • the accuracy of the target object in the output feature map is better than the accuracy of the target object in the input feature map, so that the accuracy of the target object can be improved, so that the target object category in the input feature map is more accurate, which is convenient in the field of machine vision.
  • the accuracy of the target object in the output feature map is better than the accuracy of the target object in the input feature map, so that the accuracy of the target object can be improved, so that the target object category in the input feature map is more accurate, which is convenient in the field of machine vision.
  • FIG. 6 shows a schematic structural diagram of a target detection system provided by an embodiment of the present application.
  • the target detection system may include a host 61 and a device 62.
  • the host 61 and the device 62 are connected by an off-chip Double Data Rate 3 (off-chip DDR3) bus.
  • the host side 61 includes a task scheduler (Task Scheduler) 601 and a Reorg module (Reorg Function) 602.
  • the device side 62 is implemented by a reconfigurable logic device FPGA, and its low power consumption characteristics make it have obvious advantages in edge-side application deployment, and can efficiently implement the YOLOv2 algorithm.
  • the device side 62 includes a data read kernel (MemRD Kernel) 6100, a data write kernel (MemWR Kernel) 6200, a convolution operation kernel (Conv Kernel) 6300, and a pooling kernel (MaxPool Kernel) 6400.
  • These cores are all constructed in the form of a single work item (Single Work Item), so that the device end 62 realizes an efficient pipeline. At the same time, these cores are cascaded through the OpenCL pipeline to form a deep pipeline structure between the cores.
  • MemRD Kernel 6100 includes an extraction logic module 6110, a weight buffer area 6120, and an input buffer area 6130 with a double buffer mechanism.
  • MemWR Kernel 6200 includes a rearrangement module 6210 and an output buffer area 6220.
  • Conv Kernel 6300 includes multiple pulse array processing units, for example, pulse array processing unit 6311, pulse array processing unit 6312, pulse array processing unit 6313, etc., and also includes multiple data buffer areas and multiple weight buffer areas, such as pulse array processing unit 6311, pulse array processing unit 6312, and pulse array processing unit 6313.
  • MaxPool Kernel 6400 includes line buffer area (Line Buffer 1) 6411, line buffer area (Line Buffer 2) 6412 and comparison logic circuit (MaxPool Logic) 6420.
  • the comparison logic circuit 6420 includes multiple Max modules.
  • the number of row buffers is related to the size of the pooling window. For example, a 3*3 pooling window requires two row buffers, and a 2*2 pooling window requires only one row buffer.
  • the multi-scale pooling method is adopted to improve the portability of the target detection system.
  • MemRD Kernel 6100 is responsible for preparing input data for Conv Kernel 6300.
  • MemRD Kernel 6100 caches a part of data from global memory to local memory, and transmits the prepared data to Conv Kernel 6300 through the OpenCL pipeline.
  • MemWR Kernel 6200 is responsible for rearranging the results of the convolution operation.
  • MemWR Kernel 6200 caches the results of convolution operations in local memory and rearranges them in a certain order. When there is a pooling layer, the convolution operation result is output to MaxPool Kernel 6400 through the OpenCL pipeline.
  • Conv Kernel 6300 is mainly used to accelerate the computationally intensive convolution operation, fully connected operation and data activation in CNN. In order to improve computing efficiency, multiple PEs are used to implement parallel computing of convolution. MaxPool Kernel 6400 downsampling the output feature map. MaxPool Kernel 6400 reads and processes the data through the OpenCL pipeline, and finally saves the processing result to the global memory.
  • MaxPool Kernel 6400 When there is a pooling layer, MaxPool Kernel 6400 outputs data in the form of an OpenCL pipeline; when there is no pooling layer, MaxPool Kernel 6400 outputs data in a multi-port data storage mode to balance the speed of convolution operation and result storage .
  • MemWR Kernel 6200 transmits data of the unit size of VEC_SIZE to MaxPool Kernel 6400, and MaxPool Kernel 6400 uses a maximum pooling window of 3*3 for processing.
  • MemRD Kernel 6100 first caches the first two lines of the input feature map to the on-chip upstream buffer area.
  • the maximum value is obtained from the first line of cached data and the second line of cached data to obtain pooling
  • the maximum value of each column in the window then the maximum value is sent to the shift register with a depth of 3 for temporary storage, and then the maximum value is obtained through the second row buffer and the third row buffer, and finally the maximum value in the pooling window is obtained , And store the maximum value back into global memory.
  • the contents of the two line buffers are updated in sequence, Line Buffer 2 is updated with the data of Line Buffer 1, and Line Buffer 1 is updated with the input new data, and iteratively repeats. In specific implementation, it is also possible to average the different row buffers, and then update the content of the row buffer area.
  • VEC_SIZE the vertical parallelism
  • the task scheduler 601 is mainly responsible for configuring the operating environment of OpenCL, and scheduling the execution and synchronization of the kernel of the device 62 through an OpenCL specific application program interface (Application Programming Interface, API). According to the method shown in FIG. 3, the task scheduler 601 builds a complete OpenCL execution environment.
  • the task scheduler 601 needs to create two memory objects based on the context, which are respectively used to store the input feature map and the output feature map of the convolution. Each memory object has two attributes of input and output, which can be used to store the output of this layer, and can also be used to transmit the input of the lower layer.
  • the task scheduler 601 needs to transmit the preprocessed input data to the FPGA off-chip global memory area through the command queue in the form of a memory object at the beginning.
  • the task scheduler 601 Before the execution of each layer of convolution, the task scheduler 601 first configures the parameters of each core through a specific API, starts the execution of each core through the command queue, and then monitors whether each core has been executed through an event.
  • the execution of the pooling layer is controlled by the pooling switch. When the pooling switch is set to 1, the execution of MaxPool Kernel 6400 is started. After the execution of the four cores on the FPGA is completed, the task scheduler 601 saves the final output result to the host memory in the form of a command queue, so as to perform subsequent operations.
  • the Reorg module 602 is mainly implemented by the Reorg function, and is responsible for rearranging the convolution output feature map. For example, by adjusting the execution order of the network, the Reorg module 602 can be executed in parallel with the 14th layer convolution. FPGA is more efficient to perform memory access operations with consecutive addresses. Because the Reorg function reads and writes to jump memory addresses, the logic is simpler, and the Reorg function has very few calculations. The execution time on the CPU is longer than that of the 14th layer. The execution time of the product should be short and it is only used once in the network. Therefore, placing the Reorg module 602 on the host 61 can save on-chip FPGA resources, thereby improving resource utilization.
  • FIG. 7 is a schematic diagram of a module for performing convolution operation based on parallelism in Conv Kernel 6300 provided by an embodiment of the present application.
  • Conv Kernel 6300 uses three parallelism designs, including the first depth parallelism of the input feature map (VEC_SIZE) and the vertical parallelism of the input feature map (PE_NUM_Y) And the second depth of parallelism (PE_NUM_Z) of the output feature map.
  • the product of PE_NUM_Y and PE_NUM_Z is consistent with the total amount of PE in Conv Kernel 6300.
  • the number of PE_NUM_Z and the number of convolution kernels of the output feature map are the same.
  • VEC_SIZE and PE_NUM_Z are the parallelism determined according to the parallelism model, and the parallelism model is the model determined according to the hardware resource and memory bandwidth model.
  • the VEC_SIZE of the input feature map is vectorized by unrolling.
  • a convolution kernel can operate simultaneously with multiple regions on the input feature map to improve the sharing of weight values; at the same time, a region on the input feature map can operate with multiple convolution kernels at the same time to improve the sharing of input data Spend.
  • MemRD Kernel 6100 rearrange the data corresponding to the input feature map originally arranged by (W, H, N) into (VEC_SIZE, W, H, N/VEC_SIZE), and perform vectorization processing on VEC_SIZE. Then enter the data of PE_NUM_Y*VEC_SIZE and the weight value of PE_NUM_Z*VEC_SIZE to Conv Kernel 6300, so that Conv Kernel 6300 performs K2*N/VEC_SIZE multiplication and accumulation operations on the input data to obtain data of PE_NUM_Y*PE_NUM_Z size. The data of PE_NUM_Y*PE_NUM_Z size is output to MemWR Kernel 6200.
  • the input feature map (Intput Feature Map) input by MemRD Kernel 6100 is 160*
  • a 160*3 (W*H*N) image has PE_NUM_Z convolution kernels in total, and the size of each convolution kernel is defined as 3*3*3 (that is, K is equal to 3, and K is an integer greater than 3).
  • the input feature map is sequentially taken horizontally (total R points), and then the extracted data is sequentially connected with each convolution kernel Perform calculations, and finally obtain the output feature map, the size of the output feature map is R*C*M. It should be noted that when the input feature map is fetched horizontally, multiple lines can also be fetched together to increase the calculation speed of the convolution.
  • FIG. 8 is a schematic diagram of the operation process of the processing unit in the Conv Kernel 6300 provided by an embodiment of the present application.
  • One processing unit may include: a convolution operation logic unit 810 and an activation function logic unit 820.
  • the convolution operation logic unit 810 includes: a custom MAC subunit 811 and a local accumulation subunit 812.
  • MAC sub-unit 811 It can support the input of vector data, and is configured to input vectorized input data (Vectorized data) and weight (Vectorized weight) into MAC sub-unit 811, so that MAC sub-unit 811 can accumulate according to multiplication
  • the tree calculates the input data and the corresponding weight. Specifically, a point-to-point multiplication operation may be performed on the input data first, and then the obtained multiplication operation results may be added together, and after K2xN/VEC_SIZE operations, the output result of the MAC subunit 811 can be obtained. The output result is input to the local accumulation subunit 812 for buffering.
  • the Intel FPGA board provides variable-precision DSP, that is, a DSP can support multiplication operations with multiple data widths, where d0, d1, ..., dn-1, dn represent each of the input data
  • the data on the bit position, w0, w1, ..., wn-1, wn represent the data on each bit position of the weight value, and n is an integer greater than 1.
  • one DSP in the Intel Stratix V GXA7 FPGA development board can perform one 27bit*27bit multiplication operation, and can also perform three 9bit*9bit multiplication operations.
  • the data bit width can be changed by reconfiguring the relevant parameters of the kernel, and a specific DSP can be designated to calculate the corresponding data bit width, so that the compilation frequency can be increased.
  • the data types of integers include char (8bit) type, short (16bit) type, int (32bit) type and long (64bit) type, etc.
  • the data bit width of each data type is a power of 2.
  • the char type is used to store fixed-point numbers, two 8-bit integers are multiplied, and the result obtained requires 16-bit storage space; if j 1-bit integers are added, the result of the addition must be Ceil(Log2( j)) bit storage space. Therefore, after multiplying and accumulating for j 8-bit integers, the obtained multiplying and accumulating result requires a total of (16+Ceil(Log2(j))) bits of storage space.
  • i and j are integers greater than or equal to 2, and Ceil represents rounding up the data. Therefore, the data bit width of the multiplication and accumulation result output by the MAC subunit 811 is (16+Ceil(Log2(VEC_SIZE))), and the data bit width of the delay buffer area is (16+Ceil(Log2(sw/h))) .
  • sw represents the maximum amount of calculation of a single convolution of each layer in the network model
  • h represents the depth of the delay buffer area.
  • the maximum amount of single convolution operation in the 22nd layer is 32*1280, and the depth is 6, then the data bit width of the delay buffer area (16+Ceil(Log2(sw/h))) is 29. By setting the data bit width, unnecessary waste of logic resources can be saved.
  • the partial accumulation subunit 812 is configured to implement manual clock alignment to ensure high saturation operation of the pipeline.
  • the local accumulation unit obtains the calculation result of the MAC sub-unit 811 after K2xN/VEC_SIZE times input by the custom MAC sub-unit 811, and then adds the buffered data according to the design of the delay buffer to obtain the convolution result.
  • the data bit width of the convolution result at this time is greater than the bit width of the vectorized input data, and the convolution result needs to be truncated to make the data bit width of the final result the same as the bit width of the vectorized input data.
  • FIG. 9a shows a schematic diagram of a multiply-accumulate operation when the local accumulation subunit 812 is not added according to an embodiment of the present application.
  • the Fetch function processing and the MAC layer and accumulation processing are used as one processing unit, which is processed sequentially on the time axis; but there is a data dependency (ie, a dependency relationship between data) with each processing unit, so that in one processing After the unit is completed, the loop iteration of the next processing unit can be started, resulting in a start interval ⁇ greater than 1.
  • 9b shows a schematic diagram of a multiplication and accumulation operation after adding a local accumulation subunit 812 according to an embodiment of the present application.
  • the output of the MAC subunit 811 is sent to the local accumulation subunit 812 for data buffering.
  • the partial accumulation subunit 812 can be composed of shift registers, so that by adjusting the depth of the shift register, a pipeline with a startup interval ⁇ equal to 1 can be formed, where the depth of the pipeline is K2*N/VEC_SIZE. Improve the processing efficiency of the pipeline.
  • the device side 62 in this application includes three buffer areas, namely, input data buffer areas (6130, 6331, 6332, and 6333), weight and bias buffer areas (6120, 6321, 6322, and 6323), and output buffer area 6220.
  • each data buffer area can adopt any storage form among the folded storage form, the double buffer mechanism and the multi-port loading mode.
  • FIG. 10 shows a schematic diagram of a structure of a data buffer area in the form of folded storage provided by an embodiment of the present application.
  • S represents the data step size
  • K represents the data length of the input feature map.
  • H row represents the data length required for PE_NUM_Y convolution operations of the input feature map in the vertical dimension (Y dimension), as shown in formula (23):
  • each row buffer stores data with a data length of S each time, so that each row buffer can output one data.
  • FIG. 11 shows a schematic diagram of a structure of a data buffer area provided by an embodiment of the present application.
  • the data buffer area may include multiple row buffers and is a two-dimensional buffer area.
  • the two-dimensional cache has a higher data reuse rate.
  • Increasing the data reuse rate of the local cache means a reduction in bandwidth.
  • the two-dimensional buffer area can save about 57% of the bandwidth compared to the one-dimensional buffer area.
  • S Line represents the length of each line buffer
  • N Line represents the number of line buffers, as shown in equation (24)
  • the actual length of each line buffer is W col (W col ⁇ S Line ), W col It can be dynamically adjusted according to the depth of the input feature map and the convolution step length, as shown in equation (25);
  • the actual number of convolutions that can be executed in each row buffer is N col , which is determined by W col , as shown in equation (26)
  • the value of S Line must ensure that at least one convolution area of each convolution layer is cached.
  • N col FLOOR((W col -K)/S+1) (26)
  • the function of the FLOOR(X) function is to "round down", that is, round down or round to zero, that is, take the largest integer not greater than X.
  • the row buffer area can be designed as a double buffer mechanism. That is, one data buffer area loads data from the off-chip global memory, and the other buffer area transmits pre-stored data to the Conv Kernel 6300. These two buffer areas alternate and perform data operations at the same time.
  • the size of the data that can be loaded from the off-chip global memory in the data buffer at one time is VEC_SIZE, and the size of the data transmitted to the Conv Kernel 6300 is (PE_NUM_Y*VEC_SIZE). This makes it possible to save convolution waiting time for data loading, improve data transmission efficiency, and provide a guarantee for efficient convolution operations.
  • the double buffer mechanism will bring about the problem of mismatch between the parallel transmission of data and the serial loading speed.
  • T load the number of clock cycles required to load a data buffer from the global memory
  • T trans the required number of clock cycles
  • PE_NUM_Y the required number of clock cycles
  • T load the number of clock cycles required to load a data buffer from the global memory
  • PE_NUM_Y the number of clock cycles required to load a data buffer from the global memory
  • T trans as shown in equation (28)
  • PE_NUM_Y is set to be very large, then T load > T trans at this time. Therefore, in order to balance the speed of the two buffers and ensure the smooth execution of the deep pipeline between the three cores of MemRD Kernel 6100, Conv Kernel 6300 and MemWR Kernel 6200, multi-port data loading is required.
  • each port will be responsible for In the row buffer area, the data loading time will be reduced by n times.
  • MemRD Kernel 6100 pre-rearranges the weights according to the first depth parallelism and the vertical parallelism. For example, the weights are rearranged from the original (K, K, N, M) order to (VEC_SIZE, PE_NUM_Z, K, K, N/VEC_SIZE, M/PE_NUM_Z) order. MemRD Kernel 6100 rearranges the offsets into (PE_NUM_Z, M/PE_NUM_Z) order according to the second depth of parallelism. This makes it possible to provide parallelism for MaxPool Kernel 6400 and provide the correct data format for the input of the lower layer convolution.
  • the memory size s w of a single convolution kernel is shown in equation (29), taking the maximum value of each layer of convolution; the total size of the weight buffer area h w is shown in equation (30), and the unit is Byte.
  • the memory size h b required by the offset buffer is shown in formula (31), and the unit is Byte.
  • weight buffer area and the offset buffer area are also set to a double buffer mechanism, which can provide efficient data transmission for the Conv Kernel 6300 convolution operation.
  • the input data, weight and The bias can be transmitted through the OpenCL pipeline in the same clock cycle.
  • MemWR Kernel 6200 needs to perform (PE_NUM_Y*PE_NUM_Z) times when storing the data in the output buffer area 6220 back to the global memory. Operation, and a complete convolution operation requires (K*K*N/VEC_SIZE) operations.
  • K*K*N/VEC_SIZE the time required for MemWR Kernel 6200 to store the convolution result back to the global memory. Therefore, it is necessary to adopt a multi-port data storage format for the output buffer area 6220. For example, when the number of ports is m, each port needs to be responsible for outputting CEIL (PE_NUM_Y*PE_NUM_Z/m) data to the off-chip global memory area.
  • the target detection system provided by the embodiment of the application has high portability and scalability.
  • the original network structure is used, so that while ensuring good accuracy, it is possible to consider the impact of the change of the weight value on the result.
  • the use of 8bit fixed-point quantization can ensure that the accuracy loss is within an acceptable range.
  • Fig. 12 shows a schematic structural diagram of a machine vision device provided by an embodiment of the present application.
  • the machine vision equipment may include an image acquisition device 1201 and an image processing device 1202.
  • the image acquisition device 1201 may be configured to acquire the image to be detected, where the image to be detected includes the target object to be determined; the image processing device 1202 may be configured to follow the image processing method provided in the embodiments of the present application.
  • the detected image is inspected.
  • the accuracy of the target object in the output feature map is better than the accuracy of the target object in the input feature map.
  • the target object to be determined included in the image to be detected may be a dog or a bicycle, but due to the color of the background object and the position of the object in the picture, the robot
  • the target object category in the obtained output feature map is clearer.
  • the target object in the output feature map can be obtained as a dog and a bicycle, and the position information of the dog and bicycle can be obtained more accurately, which improves the target object to be determined. Detection accuracy.
  • the image to be detected is acquired through the image acquisition device, and the image processing device is used to detect the image to be detected according to the image processing method, so as to speed up the convolution operation in the process of image analysis. Speed, improve the processing speed of the image to be detected; and improve the accuracy of the image to be detected, so that the category of the target object to be determined is clearly visible, which is convenient for application in the field of machine vision.
  • FIG. 13 shows a schematic diagram of an exemplary hardware architecture of an electronic device provided by an embodiment of the present application.
  • the electronic device 1300 may include an input device 1301, an input interface 1302, a central processing unit 1303, a memory 1304, an output interface 1305, and an output device 1306.
  • the input interface 1302, the central processing unit 1303, the memory 1304, and the output interface 1305 are connected to each other through the bus 1307
  • the input device 1301 and the output device 1306 are connected to the bus 1307 through the input interface 1302 and the output interface 1305, respectively, and then to the electronic device 1300
  • the other components are connected.
  • the input device 1301 receives input information from the outside, and transmits the input information to the central processing unit 1303 through the input interface 1302; the central processing unit 1303 processes the input information based on the computer executable instructions stored in the memory 1304 to generate output Information, the output information is temporarily or permanently stored in the memory 1304, and then the output information is transmitted to the output device 1306 through the output interface 1305; the output device 1306 outputs the output information to the outside of the electronic device 1300 for the user to use.
  • the electronic device 1300 shown in FIG. 13 may be implemented as a network device, and the network device (ie, the electronic device 1300) may include: a memory configured to store a program; and a processor configured to To run the program stored in the memory to execute at least one step of the image processing method provided in the embodiment of the present application.
  • the embodiments of the present application can be implemented in hardware or dedicated circuits, software, logic or any combination thereof.
  • some aspects may be implemented in hardware, while other aspects may be implemented in firmware or software that may be executed by a controller, microprocessor, or other computing device, although the application is not limited thereto.
  • Computer program instructions can be assembly instructions, instruction set architecture (ISA) instructions, machine instructions, machine-related instructions, microcode, firmware instructions, state setting data, or source code written in any combination of one or more programming languages or Object code.
  • ISA instruction set architecture
  • the block diagram of any logic flow in the drawings of the present application may represent program steps, or may represent interconnected logic circuits, modules, and functions, or may represent a combination of program steps and logic circuits, modules, and functions.
  • the computer program can be stored on the memory.
  • the memory can be of any type suitable for the local technical environment and can be implemented using any suitable data storage technology, such as but not limited to read-only memory (ROM), random access memory (RAM), optical storage devices and systems (digital multi-function discs) DVD or CD) etc.
  • Computer-readable media may include non-transitory storage media.
  • the data processor may be any type suitable for the local technical environment, such as but not limited to general-purpose computers, special-purpose computers, microprocessors, DSPs, application-specific integrated circuits (ASIC), FGPA, and processors based on a multi-core processor architecture.

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Molecular Biology (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • General Health & Medical Sciences (AREA)
  • Biomedical Technology (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Health & Medical Sciences (AREA)
  • Image Analysis (AREA)

Abstract

本申请实施例涉及计算机技术领域,并提供一种图像处理方法及装置、机器视觉设备、电子设备和计算机可读存储介质。图像处理方法包括:对待检测的图像进行预处理获得输入特征图,并提取输入特征图的第一深度并行度和纵向并行度;依据第一深度并行度和纵向并行度对输入特征图进行向量化处理,获得N个输入向量数据,其中,N为大于或等于1的整数;以及,使用N个输入向量数据与卷积核同时进行卷积运算,获得输出特征图。

Description

图像处理方法及装置、机器视觉设备、电子设备和计算机可读存储介质 技术领域
本申请涉及计算机技术领域。
背景技术
当前,基于现场可编程逻辑门阵列(Field Programmable Gate Array,FPGA)的深度卷积神经网络加速器架构大致可分为3种:基于运算模块的加速器架构、基于网络映射的加速器架构和基于脉冲阵列的加速器架构。
基于运算模块的加速器架构的关注点在卷积神经网络(Convolutional Neural Networks,CNN)的基本运算单元(如卷积、池化、全连接等)上,通过一定的设计得到若干个通用的计算模块,这些计算模块可以进行灵活的组合,以实现不同深浅、不同结构网络的正向推断过程。基于网络映射的加速器架构需要将层间计算结果缓存到芯片外部,使得带宽压力增加,若尝试按照网络结构将所有层映射到FPGA电路中,则运算时只需要从芯片外部加载输入数据以及将网络最终计算结果回存到芯片外部,可避免中间结果的缓存,实现CNN层内和层间流水结构,因此具有很高的效率,但当网络层数较深时,会受到硬件资源的限制。基于脉冲阵列的加速器架构能实现较高的时钟频率和较少的逻辑资源消耗,但在对CNN模型进行部署时,脉冲阵列处理单元(Processing Elements,PE)的配置较为复杂,不易实现。并且单纯的硬件设计受带宽、资源的限制,导致性能提升有限。
发明内容
本申请实施例的一个方面提供一种图像处理方法,包括:对待检测的图像进行预处理获得输入特征图,并提取输入特征图的第一深度并行度和纵向并行度;依据第一深度并行度和纵向并行度对输入特 征图进行向量化处理,获得N个输入向量数据,其中,N为大于或等于1的整数;以及,使用N个输入向量数据与卷积核同时进行卷积运算,获得输出特征图(Output Feature Map)。
本申请实施例的另一方面提供一种图像处理装置,包括:预处理模块,被配置为对待检测的图像进行预处理获得输入特征图,并提取输入特征图的第一深度并行度和纵向并行度;以及,向量化处理模块,被配置为根据第一深度并行度和纵向并行度对输入特征图进行向量化处理,获得N个输入向量数据,其中,N为大于或等于1的整数;卷积运算模块,被配置为使用N个输入向量数据与卷积核同时进行卷积运算,获得输出特征图。
本申请实施例的再一方面提供一种机器视觉设备,包括:图像获取装置,被配置为获取待检测的图像,其中,待检测的图像包括待确定的目标物体;以及,图像处理装置,被配置为根据本申请实施例提供的图像处理方法对待检测的图像进行检测,并确定待确定的目标物体的类别。
本申请实施例的又一方面提供一种电子设备,包括:一个或多个处理器;存储器,其上存储有一个或多个程序,当一个或多个程序被一个或多个处理器执行,使得一个或多个处理器实现本申请实施例提供的图像处理方法的至少一个步骤。
本申请实施例的又一方面提供了一种计算机可读存储介质,其上存储有计算机程序,计算机程序被处理器执行时实现本申请实施例提供的图像处理方法的至少一个步骤。
附图说明
图1示出本申请实施例提供的图像处理方法的一种流程示意图。
图2示出本申请实施例提供的利用乘累加树进行卷积加速运算的方法的一种流程示意图。
图3示出本申请实施例提供的生成并行度模型并依据该并行度模型确定第一深度并行度和第二深度并行度的方法的一种流程示意图。
图4示出本申请实施例中的输出图像的性能参数和相关的基于FPGA加速的YOLO网络的性能参数的对比表。
图5示出本申请实施例提供的图像处理装置的一种结构示意图。
图6示出本申请实施例提供的目标检测系统的一种结构示意图。
图7示出本申请实施例提供的卷积运算内核中的依据并行度进行卷积运算的一种模块示意图。
图8示出本申请实施例提供的卷积运算内核中的处理单元的运算过程的一种示意图。
图9a示出本申请实施例提供的没有添加局部累加单元时的乘累加操作的一种示意图。
图9b示出本申请实施例提供的增加了局部累加单元后的乘累加操作的一种示意图。
图10示出本申请实施例提供的采用折叠存储形式的数据缓存区的一种结构示意图。
图11示出本申请实施例提供的数据缓存区的一种结构示意图。
图12示出本申请实施例提供的机器视觉设备的一种结构示意图。
图13示出本申请实施例提供的电子设备的一种示例性硬件架构示意图。
具体实施方式
为使本申请的目的、技术方案和优点更加清楚明白,下文中将结合附图对本申请实施例进行详细说明。需要说明的是,在不冲突的情况下,本申请实施例中的可实施方式及可实施方式中的特征可以相互任意组合。
基于FPGA的深度卷积神经网络加速器架构大致可分为基于运算模块的加速器架构、基于网络映射的加速器架构和基于脉冲阵列的加速器架构。这三种架构多是采用寄存器转换级(Register Transfer Level,RTL)电路开发方式设计实现,移植性和扩展性相对较差。
单纯的硬件设计受带宽、资源的限制,使得系统性能提升有限。一些研究人员采用软硬件协同设计的思想,以进一步提高FPGA的深 度卷积神经网络加速器的性能。软硬结合的方式主要分为以下两个方面:1)为了缓解带宽压力,使用剪枝、量化等技术对数据进行处理。例如,将结构化剪枝方式直接部署到当前的深度卷积神经网络加速器架构中,使得可以规则地修剪CNN模型;而通过非结构化剪枝方式对数据进行处理(例如,随机裁剪权重节点等),使得能提高深度卷积神经网络加速器的压缩率,但需存储额外的非零的权重值的位置信息,使得硬件电路设计具有一定难度。量化是目前比较通用的模型压缩方法。8bit定点数量化能基本保持原精度不变,受到了广泛使用,但压缩能力有限。为进一步压缩模型,一些研究人员对低比特/超低比特量化(例如,6bit,4bit等)进行研究,更有一些研究人员采用二值网络的方式,将乘法运算转化为逻辑运算,使得能够极大地提高深度卷积神经网络加速器的性能,但网络精度损失过多。2)为了减少运算量,可考虑在变换域(例如,Winograd变换,FFT变换等)进行运算。通过一维的Winograd变换,可减少约1/3的乘法次数;通过二维的Winograd变换,可减少约2.25倍的乘法次数。基于FFT变换的OaA(Overlap-andAdd)算法对数据进行卷积运算,相比于基于时域卷积对数据进行卷积运算,使得系统性能提升了约3.3倍。
使用单精度浮点数据类型,对YOLO和Faster-RCNN两种目标检测算法进行优化,但提升了系统资源和带宽的压力,致使数字信号处理器(Digital Signal Processor,DSP)的利用率不高。若基于Xilinx KU115板卡对YOLOv1网络进行加速设计,但只对卷积层进行了加速,当综合考虑全连接层时,YOLOv1网络的性能会下降。若采用轻量级的YOLOv2算法,其特征提取部分是采用二值网络实现的,分类和回归器是采用单精度浮点数实现的。二值网络的应用,一方面减少了计算量和传输带宽,另一方面也损失了数据的精度,导致数据的准确率降低。图形处理器(Graphics Processing Unit,GPU)的能耗较高,使得GPU的设计不能满足嵌入式应用的需求。
本申请提供一种图像处理方法及装置、机器视觉设备、电子设备和存储介质,用于解决CNN的移植性和扩展性较差的问题。
图1是本申请实施例提供的图像处理方法的一种流程示意图, 该方法可应用于图像处理装置。如图1所示,图像处理方法可包括步骤110-步骤130。
在步骤110中,对待检测的图像进行预处理获得输入特征图,并提取输入特征图的第一深度并行度和纵向并行度。
例如,采用归一化、中心化和标准化中的任一项预处理方式,对输入的图像进行处理,获得输入特征图。其中,归一化是将输入的图像的像素值缩放到0-1范围内;中心化是将输入的图像的像素值中减去平均像素值,使新像素值的平均值为0;标准化是将输入的图像的像素值处理为标准高斯分布,即新像素值的平均值为0,标准差为1。
需要说明的是,在中心化和标准化中,可以在不同颜色通道上计算像素平均值和标准差,也可以计算一张图像、一批图像或整个训练数据集的平均值和标准差,以获得输入特征图。归一化通常是首先尝试的方法,因为输入的图像的像素值始终在0-255范围内,只需要把图片的所有像素除以255即可,该方法操作简单且易于实现的。中心化可采用全局中心化或局部中心化,也可以选取不同数量的图像进行平均值的计算,以上对于预处理方式仅是举例说明,可根据具体情况进行具体设定,其它未说明的预处理方式也在本申请的保护范围之内,在此不再赘述。
具体实现时,输入特征图可以是一个具有不同维度的图像,例如,输入特征图是具有三个维度的图像,则纵向并行度表示该输入特征图在Y轴上的长度,第一深度并行度表示该输入特征图在Z轴上的长度,该输入特征图在X轴上的长度可根据实际情况具体设定。以上对于输入特征图的维度信息仅是举例说明,可根据具体情况具体设定,其它未说明的输入特征图的维度信息也在本申请的保护范围之内,在此不再赘述。
在步骤120中,依据第一深度并行度和纵向并行度对输入特征图进行向量化处理,获得N个输入向量数据。
需要说明的是,N为大于或等于1的整数。输出特征图的第二深度并行度和卷积核的数量相同,第一深度并行度和第二深度并行度是 依据并行度模型确定的并行度,并行度模型是依据硬件资源和内存带宽模型确定的模型。
卷积核的深度与纵向并行度相同,使得输入特征图在与卷积核进行卷积运算时,能够减少纵向的运算量,加快卷积运算速度。
例如,使用VEC_SIZE表示第一深度并行度,使用PE_NUM_Y表示纵向并行度,使用PE_NUM_Z表示第二深度并行度。对输入特征图在横向维度和纵向维度上,以VEC_SIZE*PE_NUM_Y为单位,对输入特征图进行向量化处理,使得能够获得N个VEC_SIZE*PE_NUM_Y大小的输入向量数据,然后在对每一个输入向量数据进行卷积处理,以增加卷积运算的速度。
由于硬件资源不同,使得在做卷积运算时,所用到的带宽资源也不同,通过硬件资源和内存带宽模型确定并行度模型,然后根据该并行度模型来确定VEC_SIZE和PE_NUM_Z,使得在有限的硬件资源环境下,保证卷积运算的速度得到最大的提升。
在步骤130中,使用N个输入向量数据与卷积核同时进行卷积运算,获得输出特征图。
例如,可将N个输入向量数据与一个卷积核同时进行卷积运算,以提高权重值的共享度;也可以将一个输入向量数据与多个卷积核同时进行卷积运算,以提高输入数据的共享度。使得卷积运算的速度加快,进而获得输出特征图。以上对于卷积运算的方式仅是举例说明,可根据实际情况具体设定,其它未说明的卷积运算的方式也在本申请的保护范围之内,在此不再赘述。
根据本申请实施例提供的图像处理方法,可通过输入特征图的第一深度并行度和纵向并行度对输入特征图进行向量化处理,获得N个输入向量数据,然后使用N个输入向量数据与卷积核同时进行卷积运算,获得输出特征图,从而使得一个卷积内核能够同时与N个输入向量数据进行卷积运算,保证在低能耗的情况下,加快卷积运算的速度,提升对输入特征图的处理速度。同时,输出特征图中的目标物体的精度优于输入特征图中的目标物体的精度,使得能够提升目标物体的精度,以使输入特征图中的目标物体的类别更准确,方便在机器视 觉领域中的应用。
在一种可实施方式中,在步骤110之前,该图像处理方法还可包括:将输入特征图缓存到输入缓存区,其中,输入缓存区的存储形式至少包括折叠存储形式、双缓存机制和多端口加载形式中的任一项。
通过多种存储形式,使得输入缓存区对数据的存取操作的速度加快,方便为卷积运算提供高可靠的数据。同时,将输入特征图先缓存到输入缓存区,使得能够缓解与其它设备之间交互的带宽的压力。
在一种可实施方式中,将输入特征图缓存到输入缓存区,可包括:若确定输入缓存区的存储形式是折叠形式,则依据数据步长将输入特征图对应的数据进行折叠,并将折叠后的数据存储在输入缓存区中,其中,数据步长是依据纵向并行度、输入特征图在纵向维度上进行卷积运算所需的数据长度和单位步长确定的值;若确定输入缓存区的存储形式是多端口加载形式,依据端口的数量和加载一个数据缓冲区所需的时钟周期数,将输入特征图对应的数据缓存到输入缓存区。
例如,可采用公式H row=(PE_NUM_Y-1)×S+K,计算获得折叠的数据长度。其中,H row表示输入特征图在纵向维度(Y维度)上的PE_NUM_Y个卷积运算所需的数据长度,S表示数据步长,K表示输入特征图的数据长度。依据纵向并行度(PE_NUM_Y),使得输入特征图上的多个区域能够被同步处理。并且,可以减少行缓存的数目,同时提高CNN结构的通用性。
在一种可实施方式中,如图2所示,其为本申请实施例提供的利用乘累加树进行卷积加速运算的方法的一种流程示意图,步骤130可具体包括步骤131和步骤132。
在步骤131中,使用N个输入向量数据与卷积核同时进行卷积运算,获得输出向量数据。
例如,将N个输入向量数据与一个卷积核同时进行卷积运算,或将一个输入向量数据与多个卷积核进行卷积运算,获得输出向量数据。
在步骤132中,利用乘累加树对输出向量数据和对应的权重参数进行处理,获得输出特征图。
例如,对输出向量数据进行数据重排,将原按(W,H,N)排列的输入特征图对应的数据重排为(VEC_SIZE,W,H,N/VEC_SIZE),并对VEC_SIZE进行向量化处理。其中,W表示输入向量数据的横向长度,H表示输入向量数据的纵向长度,N表示输入向量数据的深度。对应的权重参数也可以进行相应的数据重排,使得输出特征图更有利于卷积运算。然后,先对输出向量数据和其对应的权重参数进行点对点的乘法运算,再将获得的乘法运算结果进行相加运算,经过多次运算,可获得输出特征图。
在一种可实施方式中,依据输出向量数据的数据位宽和权重参数,对M个输出向量数据进行点对点的乘累加运算,获得第一累加结果,其中,M为大于或等于1的整数;将第一累加结果缓存到移位缓存区中;依据移位缓存区的深度,对第一累加结果进行局部累加,获得第二累加结果;将第二累加结果缓存到延时缓存区中;对延时缓存中的数据再次进行相加运算,获得输出特征图。
例如,采用char类型(数据位宽是8bit)来存储输出向量数据,然后将M个输出向量数据进行点对点的乘累加运算,即8bit上的每个bit位都需要与对应的权重参数的对应bit位上的数据进行乘法运算,获得8个乘积结果,然后再将这8个乘积结果依次进行累加运算,获得第一累加结果。当将第一累加结果荤菜到移位缓存区后,需要调节移位缓存区的深度,使得能够形成一条启动间隔为1的流水线,该流水线的深度为是依据VEC_SIZE确定的。然后依据移位缓存区的深度,对第一累加结果进行局部累加,获得第二累加结果,再将第二累加结果缓存到延时缓存区中;对延时缓存中的数据再次进行相加运算,获得输出特征图。
根据本申请实施例,通过不同的数据位宽,使得可以节省不必要的逻辑资源的浪费,并且根据输出向量数据的数据位宽和权重参数,对M个输出向量数据进行点对点的乘累加运算,获得第一累加结果,使得能够对具有不同的数据位宽的数据进行点对点的乘累加运算,提升数据的处理能力。并且,将第一累加结果缓存到移位缓存区和延时缓存区中,进行局部累加,使得能够形成一条启动间隔为1的流水先, 提升对数据的处理速度。
在一种可实施方式中,在步骤132之后,图像处理方法还可包括:将输出特征图缓存至输出缓存区,其中,输出缓存区的存储形式包括多端口加载模式。
根据本申请实施例,通过将输出特征图缓存至存储形式为多端口加载模式的输出缓存区,使得能够通过不同的端口对输出特征图进行处理,例如,当端口数为时,每个端口将负责的行缓存区的个数为Ceil(行缓存区的个数/n),使得数据加载时间可以缩减n倍,提升了数据加载的效率。
在一种可实施方式中,在步骤132之后,图像处理方法还可包括:依据第一深度并行度,对输出特征图进行重排,获得重排结果;将重排结果输出至输出缓存区。
其中,将重排结果输出至输出缓存区,包括:以多端口存储数据的形式,将重排结果输出至输出缓存区。
通过对输出特征图进行重排,可以为后续的最大池化操作提供并行度,并且还可以为下层卷积的输入提供正确的数据格式,使得数据能够得到尽快的处理,提升数据处理效率。
在一种可实施方式中,将重排结果输出至输出缓存区,包括:依据第一深度并行度对重排结果进行池化处理,获得池化后的结果;将池化后的结果以开放运算语言(Open Computing Language,OpenCL)管道(channel)的形式输出至输出缓存区。
通过将池化后的结果以OpenCL管道的形式输出至输出缓存区,其中的OpenCL管道可以保证数据的高效互通,使得能够在池化层与输出缓存区之间的形成深度流水线结构,提升数据处理效率。
在一种可实施方式中,可通过下面的方式获得输入特征图的第一深度并行度和输出特征图的第二深度并行度,使得第一深度并行度和第二深度并行度能够匹配,以适应不同的硬件环境。实际使用时,由于硬件参数较多,若某些参数发生变化时,使得卷积加速的最终性能也随之变化。采用下文中的步骤确定并行度模型,使得可以通过该并行度模型,来确定输入特征图的第一深度并行度和输出特征图的第 二深度并行度的最佳匹配组合,以达到期望的性能指标。图3为本申请实施例提供的生成并行度模型并依据该并行度模型确定第一深度并行度和第二深度并行度的方法的一种流程示意图。如图3所示,生成并行度模型,并依据该并行度模型确定第一深度并行度和第二深度并行度可包括如下步骤301-步骤306。
在步骤301中,对硬件参数进行分析,获得输入特征图的纵向并行度。
例如,如表1所示,硬件参数包括数据读内核的端口数、数据写内核的端口数、输入特征图的第一深度并行度、输入特征图的纵向并行度和输出特征图的第二深度并行度。其中VEC_SIZE的取值是2的幂次(例如,2,4,8,16等)。PE_NUM_Y、PE_NUM_Z、n和m的取值可以是大于或等于0的正整数。
需要说明的是,其中,PE_NUM_Y的取值应尽可能的被输入特征图各层数据高整除。在PE_NUM_Y已知后,通过加载一个数据缓冲区所需时钟周期数的公式,以及将一个数据缓存区的数据传输到最大池化内核所需的时钟周期数的计算公式,即可算出n的最佳取值,再通过对输出缓存区的分析确定m的取值。
表1硬件参数列表
可变参数 表示的意义
n 数据读内核的端口数
m 数据写内核的端口数
VEC_SIZE 输入特征图的第一深度并行度
PE_NUM_Y 输入特征图的纵向并行度
PE_NUM_Z 输出特征图的第二深度并行度
在步骤302中,获取硬件资源信息。
其中,硬件资源信息包括:Logic资源、DSP芯片的数量和随机存取存储器(Random Access Memory,RAM)资源。
在一种可实施方式中,RAM资源包括片上缓存区和全局内存端口的数量。例如,数据读内核的数据缓存区的数据位宽为8,S Line表示每条 行缓存的数据长度,每条行缓存需要C Rd_Line个M20K,如式(1)所示;数据缓存区总共需要M20K的个数为C Rd_f,如式(2)所示,其中,2表示双缓存区。
Figure PCTCN2021096062-appb-000001
Figure PCTCN2021096062-appb-000002
权重缓存区需要的内存空间大小为h w=s w*PE_NUM_Z,单位是Byte。权重缓存区的数据位宽为8,实际需M20K内存单元个数为C Rd_w,如式(3)所示。基于Intel FPGA OpenCL的编译器默认按2的幂次开辟内存空间,当h w的取值不是2的幂次时,编译器实际分配的内存空间将大于h w
Figure PCTCN2021096062-appb-000003
最大池化层的内核的行缓存的长度为S pool,需M20K内存单元C Pool个,如式(4)所示。其中,2表示两条行缓存。行缓存的个数与池化窗口的大小有关。
Figure PCTCN2021096062-appb-000004
数据写内核有n个数据加载端口,每个端口需要C Load_f个M20K;权重加载端口需要C Load_w个M20K;偏置加载端口需要C Load_b个M20K。数据读内核的端口总共需要C Rd_Port个M20K内存单元,如式(5)所示。
C Rd_Port=C Load_f*n+C Load_w+C Load_b                          (5)
数据写内核有m个数据存储端口,每个端口需要C Store_f个M20K,因此,总共需要C Wr_Port个M20K内存单元,如式(6)所示。
C Wr_Port=C Store_f*m                                    (6)
其中C Load_f、C Load_w、C Load_b和C Store_f均与全局内存访问端口的数据类型有关,如式(7)到式(9)所示。
C Load_f=C 1*VEC_SIZE+C 0                           (7)
C Load_w=C 1*VEC_SIZEPE_NUM_Z+C 0                   (8)
C Load_b=C Store_f=C 1+C 0                            (9)
综上所述,RAM资源的总使用情况如式(10)所示:其中,C 0、C 1和 C 2均是与硬件平台有关的常量。
C RAM=C Rd_f+C Rd_w+C Pool+C Rd_Port+C Wr_Port+C 2      (10)
在一种可实施方式中,DSP的数量具体可通过如下公式计算获得。例如,若一个DSP支持两个8bit的乘法运算,则卷积运算内核中消耗的DSP的个数C DSP_CONV可由式(11)计算获得。总的DSP的数量C DSP如式(12)所示,其中,C 3、C 4、C 5和C 6均为常量。
Figure PCTCN2021096062-appb-000005
C DSP=C 3*VEC_SIZE*PE_NUM_Y*PE_NUM_Z+C 4*n+C 5*m+C 6   (12)
在一种可实施方式中,Logic资源的使用情况如式(13)所示,其中,C RAM表示Logic资源的数量,C 7、C 8和C 9均为常量。C 7~C 9均为常量。
C RAM=(C 7+C 8*VEC_SIZE)*PE_NUM_Y*PE_NUM_Z+C 9    (13)
通过以上公式计算可知,依据输入特征图的第一深度并行度和纵向并行度,以及输出特征图的第二深度并行度,使得能够计算获得不同类型的硬件资源信息,将输入特征图与各种不同的硬件资源相匹配,方便后续的模型分析,提升系统移植性。
在步骤303中,依据卷积运算的理论计算时间、训练图像的权重值、训练图像的偏置值和训练图像所占用的空间大小,确定平均内存带宽模型。
其中,理论计算时间是依据训练图像的并行度信息、卷积的运算量和训练后的图像的第二深度并行度计算获得的时间,其中,训练图像的并行度信息包括训练图像的第一深度并行度和训练图像的纵向并行度。
例如,首先,通过公式(14)计算卷积运算的获得理论计算时间。其中,F req表示时钟频率,Op l表示第l层卷积的运算量。
Figure PCTCN2021096062-appb-000006
对一个特定网络模型,其整体性能(FPS)如式(15)所示:
Figure PCTCN2021096062-appb-000007
对第l层卷积,输入特征图在三个维度上分别需经过
Figure PCTCN2021096062-appb-000008
Figure PCTCN2021096062-appb-000009
次 预取,才能完成全部计算,其中,
Figure PCTCN2021096062-appb-000010
Figure PCTCN2021096062-appb-000011
可通过式(16)、和式(17)计算获得。
Figure PCTCN2021096062-appb-000012
Figure PCTCN2021096062-appb-000013
Figure PCTCN2021096062-appb-000014
因此,对整个CNN模型,需从片外全局内存加载的输入特征图的大小H f如式(19)所示,单位是Byte。其中,N Line表示行缓存的个数,N col表示每条行缓存内实际可执行的卷积的个数。
Figure PCTCN2021096062-appb-000015
需从片外全局内存加载的权重值的大小H w如式(20)所示,单位是Byte。
Figure PCTCN2021096062-appb-000016
需从片外全局内存加载的偏置的大小H b如式(21)所示,单位是Byte。
Figure PCTCN2021096062-appb-000017
综上所述,平均内存带宽H total如式(22),单位是Byte/s。
Figure PCTCN2021096062-appb-000018
通过以上公式的计算,可获得平均内存带宽模型,即式(22),使得能够获知需使用的平均内存带宽,保证以该平均内存带宽可以对输入特征图进行处理,提升卷积运算速度。
在步骤304中,依据硬件资源信息和平均内存带宽模型,确定并行度模型。
通过将平均内存带宽模型与硬件资源信息相匹配,使得平均内存带宽模型能够在硬件资源受限的情况下,满足对输入特征图的卷积运算要求,通过多次训练,最终获得并行度模型。
例如,首先在目标板卡上进行几组快速编译,根据编译结果得到基础平台信息;然后通过函数拟合近似得到式(10)、式(12)和式(13)中C _0~C _9的取值;然后,通过对以上硬件资源信息的分析,确定可用的(PE_NUM_Z,VEC_SIZE)组合。在PE_NUM_Y、n、m确定的前提下,确定并行度模型,使得方便后续使用该并行度模型,确定可用的PE_ZUN_Z和VEC_SIZE的组合。
在步骤305中,将待验证特征图输入到并行度模型中进行验证,获得验证后的特征图和验证后的特征图的第二深度并行度。
其中,待验证特征图包括第一深度并行度和纵向并行度。
具体实现时,还需要对验证后的特征图继续进行池化处理和全连接处理,获得该验证后的特征图对应的输出图像。
在步骤306中,若确定验证后的特征图对应的输出图像符合系统的性能要求,则获得第一深度并行度和第二深度并行度。
例如,图4是本申请实施例中的输出图像的性能参数和相关的基于FPGA加速的YOLO网络的性能参数的对比表。如图4所示,通过不同的硬件资源,例如,型号为Zynq 7045的FPGA,型号为XILINX KU115的FPGA,型号为Zynq MPSoC的FPGA,或型号为Arria-10GX1150的FPGA等,DSP芯片对应不同的CNN框架(例如,采用YOLOv1算法,或YOLOv2算法等搭建的网络框架),对应的精度不同,处理器的计算能力也不同,最终获得的网络吞吐量(Throughput)、画面每秒传输帧数(Frame Per Second,FPS)等都不尽相同。通过将验证图像输入到并行度模型中进行训练,当输出图像符合硬件系统的性能时,例如,获得较高的FPS或Throughput时,即确定该输出图像符合系统的性能要求,则将此时获得的(PE_NUM_Z,VEC_SIZE)组合作为最终的第一深度并行度和第二深度并行度。
其中,网络吞吐量是采用每秒的关键帧的周期(Group of picture,GOP)来衡量的,精度包括定点型数据和浮点型数据。
通过以上对于并行度模型的分析和搭建,使得可以获得与系统性能相匹配的第一深度并行度和第二深度并行度,然后依据该第一深度并行度和第二深度并行度进行卷积运算,以提高卷积运算速度,提升对输入特征图的处理速度。
根据本申请实施例提供的图像处理方法,通过输入特征图的第一深度并行度和纵向并行度对输入特征图进行向量化处理,获得N个输入向量数据,然后使用N个输入向量数据与卷积核同时进行卷积运算,获得输出特征图,使得一个卷积内核能够同时与N个输入向量数据进行卷积运算,保证在低能耗的情况下,加快卷积运算的速度, 提升对输入特征图的处理速度。同时,输出特征图中的目标物体的精度优于输入特征图中的目标物体的精度,使得能够提升目标物体的精度,以使输入特征图中的目标物体的类别更准确,方便在机器视觉领域中的应用。
下面结合附图,详细介绍本申请实施例提供的节点设备。图5示出了本申请实施例提供的图像处理装置的一种结构示意图。图像处理装置可以使用FPGA来实现。如图5所示,该图像处理装置可包括预处理模块501、向量化处理模块502和卷积运算模块503。
预处理模块501,可被配置为对待检测的图像进行预处理获得输入特征图,并提取输入特征图的第一深度并行度和纵向并行度。向量化处理模块502,可被配置为依据第一深度并行度和纵向并行度,对输入特征图进行向量化处理,获得N个输入向量数据,其中,N为大于或等于1的整数。卷积运算模块503,可被配置为使用N个输入向量数据与卷积核同时进行卷积运算,获得输出特征图。
根据本申请实施例提供的图像处理装置,通过向量化处理模块依据输入特征图的第一深度并行度和纵向并行度对输入特征图进行向量化处理,获得N个输入向量数据,然后使用卷积运算模块将N个输入向量数据与卷积核同时进行卷积运算,获得输出特征图,使得一个卷积内核能够同时与N个输入向量数据进行卷积运算,保证在低能耗的情况下,加快卷积运算的速度,提升对输入特征图的处理速度。同时,输出特征图中的目标物体的精度优于输入特征图中的目标物体的精度,使得能够提升目标物体的精度,以使输入特征图中的目标物体的类别更准确,方便在机器视觉领域中的应用。
需要明确的是,本申请并不局限于上文中所描述并在图中示出的特定配置和处理。为了描述的方便和简洁,这里省略了对已知方法的详细描述,并且上述描述的系统、模块和单元的具体工作过程,可以参考本申请实施例对图像处理方法的相关描述,在此不再赘述。
图6示出本申请实施例提供的目标检测系统的一种结构示意图。如图6所示,目标检测系统可以包括主机端61和设备端62,主机端61与设备端62之间通过离线双倍数据传输总线(off-chip Double  Data Rate 3,off-chip DDR3)连接。其中,主机端61包括任务调度器(Task Scheduler)601和Reorg模块(Reorg Function)602。设备端62采用可重构的逻辑器件FPGA实现,其低功耗的特点使其在边缘端应用部署上具有明显优势,可高效的实现YOLOv2算法。设备端62包括数据读内核(MemRD Kernel)6100、数据写内核(MemWR Kernel)6200、卷积运算内核(Conv Kernel)6300和池化内核(MaxPool Kernel)6400。这些内核都采用单个工作项(Single Work Item)形式构造,使得设备端62实现高效流水线。同时,这些内核又通过OpenCL管道级联,形成内核与内核之间的深度流水线结构。
其中,MemRD Kernel 6100包括提取逻辑模块6110、权重缓存区6120和双缓存机制的输入缓存区6130。MemWR Kernel 6200包括重排模块6210和输出缓存区6220。Conv Kernel 6300包括多个脉冲阵列处理单元,例如,脉冲阵列处理单元6311、脉冲阵列处理单元6312和脉冲阵列处理单元6313等,还包括多个数据缓存区和多个权重缓存区,例如,与脉冲阵列处理单元6311相连接的数据缓存区6331和权重缓存区6321,与脉冲阵列处理单元6312相连接的数据缓存区6332和权重缓存区6322,与脉冲阵列处理单元6313相连接的数据缓存区6333和权重缓存区6323等。MaxPool Kernel 6400包括行缓存区(Line Buffer 1)6411和行缓存区(Line Buffer 2)6412和比较逻辑电路(MaxPool Logic)6420。其中,比较逻辑电路6420包括多个Max模块。行缓存区的个数与池化窗的大小有关,例如,3*3的池化窗需要两条行缓存,2*2的池化窗只需要1条行缓存。采用多尺度池化的方式,以提高该目标检测系统的可移植性。
其中,MemRD Kernel 6100负责为Conv Kernel 6300准备输入数据。MemRD Kernel 6100从全局内存中缓存一部分数据到本地内存,并将准备好的数据通过OpenCL管道传输到Conv Kernel 6300。以缓解带宽压力,保证系统的高吞吐率。MemWR Kernel 6200负责对卷积运算结果进行重排。MemWR Kernel 6200将卷积运算结果缓存到本地内存,并按一定的顺序重新排列。在有池化层时,将卷积运算结果通过OpenCL管道输出到MaxPool Kernel 6400,在没有池化层时,将 卷积运算结果传回全局缓存区。Conv Kernel 6300主要用于加速CNN中计算密集型的卷积运算、全连接操作和数据激活。为了提高运算效率,采用多个PE实现卷积的并行运算。MaxPool Kernel 6400对输出特征图进行下采样操作。MaxPool Kernel 6400通过OpenCL管道读取数据并进行处理,最后将处理结果保存至全局内存。当有池化层时,MaxPool Kernel 6400以OpenCL管道的形式将数据输出;当没有池化层时,MaxPool Kernel 6400以多端口存储数据的方式将数据输出,以平衡卷积运算和结果存储的速度。
例如,MemWR Kernel 6200将VEC_SIZE单位大小的数据传输至MaxPool Kernel 6400,MaxPool Kernel 6400采用3*3的最大池化窗进行处理。MemRD Kernel 6100先将输入特征图的前两行缓存到片上行缓存区,当第三行的第一个数据进入时,通过第一行缓存数据和第二行缓存数据求最大值,获得池化窗口内每一列的最大值,然后将该最大值送入深度为3的移位寄存器暂存,再通过第二行缓存和第三行缓存求最大值,最终求得池化窗内的最大值,并将该最大值存储回全局内存中。依次更新两个行缓存区的内容,Line Buffer 2被Line Buffer 1的数据更新,Line Buffer 1被输入的新数据更新,循环往复。具体实现时,还可以对不同的行缓存求平均值,进而对行缓存区的内容进行更新。以上对于MemWR Kernel 6200的行缓存区的更新操作仅是举例说明,可根据实际情况具体设定,其它未说明的行缓存区的更新操作也在本申请的保护范围之内,在此不再赘述。
通过行缓存与寄存器的配合,使得MaxPool Kernel 6400中能够形成一条启动间隔为1的流水线,每个时钟周期都能输出一个最大池化结果。因池化层的运算量很小,MaxPool Kernel 6400中只设置纵向并行度(即VEC_SIZE),其中,VEC_SIZE的取值与Conv Kernel6300中的纵向并行度保持一致,通过调整VEC_SIZE的大小,使得可以控制池化层的执行时间。
任务调度器601主要负责配置OpenCL的运行环境,以及通过OpenCL特定的应用程序接口(Application Programming Interface,API)来调度设备端62的内核的执行与同步。按照图3所示的方法, 任务调度器601搭建完整的OpenCL执行环境。任务调度器601需基于上下文创建两个内存对象,分别用于存储卷积的输入特征图和输出特征图。每个内存对象都具有输入和输出两种属性,既可用于存储本层的输出,又可用于传输下层的输入。任务调度器601需在开始时将预处理后的输入数据以内存对象的形式,通过命令队列传输到FPGA片外全局内存区。
在每一层卷积执行之前,任务调度器601先通过特定API配置各个内核的参数,通过命令队列启动各个内核的执行,再通过事件监视各个内核是否执行完毕。池化层的执行由池化开关控制,当池化开关设为1时,启动MaxPool Kernel 6400的执行。待FPGA上的四个内核执行完毕后,任务调度器601将最终的输出结果通过命令队列的方式,保存至主机内存,以便执行后续操作。
Reorg模块602主要由Reorg函数实现,负责对卷积输出特征图进行重排。例如,Reorg模块602通过调整网络的执行顺序,Reorg模块602可与第14层卷积并行执行。FPGA执行地址连续的内存存取操作更为高效,因Reorg函数是对跳变内存地址进行读写操作,逻辑较为简单,且Reorg函数运算量极少,在CPU上的执行时间比第14层卷积的执行时间要短,且在网络中只使用一次,因此,将Reorg模块602放置于主机端61,可以节省FPGA片上资源,进而提高资源的利用率。
图7是本申请实施例提供的Conv Kernel 6300中的依据并行度进行卷积运算的一种模块示意图。为了实现高效卷积运算,发挥FPGA硬件架构的优势,Conv Kernel 6300采用三种并行度的设计,具体包括输入特征图的第一深度并行度(VEC_SIZE)、输入特征图的纵向并行度(PE_NUM_Y)和输出特征图的第二深度并行度(PE_NUM_Z)。
其中,PE_NUM_Y和PE_NUM_Z的乘积与Conv Kernel 6300中的PE的总量保持一致。并且,输出特征图的PE_NUM_Z和卷积核的数量相同,VEC_SIZE和PE_NUM_Z是依据并行度模型确定的并行度,并行度模型是依据硬件资源和内存带宽模型确定的模型。通过unroll的方式实现输入特征图的VEC_SIZE的向量化。一个卷积内核可以与输 入特征图上的多个区域同时运算,以提高权重值的共享度;同时,输入特征图上的一个区域可以与多个卷积内核同时运算,以提高输入数据的共享度。
MemRD Kernel 6100,将原按(W,H,N)排列的输入特征图对应的数据重排为(VEC_SIZE,W,H,N/VEC_SIZE),并对VEC_SIZE进行向量化处理。然后输入PE_NUM_Y*VEC_SIZE大小的数据和PE_NUM_Z*VEC_SIZE大小的权重值至Conv Kernel 6300,使得Conv Kernel 6300对输入的数据进行K2*N/VEC_SIZE次乘累加运算,获得PE_NUM_Y*PE_NUM_Z大小的数据,并将该PE_NUM_Y*PE_NUM_Z大小的数据输出至MemWR Kernel 6200。例如,如图7所示,其为本申请实施例提供的Conv Kernel 6300中的依据并行度进行卷积运算的一种模块示意图,MemRD Kernel 6100输入的输入特征图(Intput Feature Map)是160*160*3(W*H*N)的图像,共有PE_NUM_Z个卷积核,并且每个卷积核的大小定义为3*3*3(即K等于3,K为大于3的整数)。取一个卷积核与输入特征图的每一层进行运算,并且根据数据步长,对输入特征图依次进行横向取数(共计R个点),然后将取出的数据依次与每一个卷积核进行运算,最后获得输出特征图,该输出特征图的大小为R*C*M。需要说明的是,在对输入特征图进行横向取数时,也可以多行一起取,以提高卷积的运算速度。
图8是本申请实施例提供的Conv Kernel 6300中的处理单元的运算过程的一种示意图。一个处理单元可包括:卷积运算逻辑单元810和激活函数逻辑单元820。其中,卷积运算逻辑单元810包括:自定义的MAC子单元811和局部累加子单元812。
1)MAC子单元811:可支持向量数据的输入,被配置为将向量化的输入数据(Vectorized data)和权重(Vectorized weight)输入至MAC子单元811中,使得MAC子单元811可以依据乘累加树,对输入的数据及对应的权重进行计算。具体地,可先对输入数据进行点对点的乘法运算,然后将获得的乘法运算结果进行相加运算,在经过K2ⅹN/VEC_SIZE次的运算后,获得MAC子单元811的输出结果。再将该输出结果输入至局部累加子单元812进行缓存。
需要说明的是,Intel FPGA板卡提供可变精度的DSP,即一个DSP可支持多种数据位宽的乘法运算,其中,d0、d1、……、dn-1、dn表示输入数据的每一个bit位上的数据,w0、w1、……、wn-1、wn表示权重值的每一个bit位上的数据,n为大于1的整数。例如,Intel Stratix V GXA7FPGA开发板中的一个DSP可执行1个27bit*27bit的乘法运算,也可执行3个9bit*9bit的乘法运算。实际使用时,可通过重新配置内核的相关参数,来改变数据位宽,并指定某个特定的DSP来进行对应数据位宽的计算,使得编译频率得以提高。
在C语言中,整数的数据类型包括char(8bit)类型、short(16bit)类型、int(32bit)类型和long(64bit)类型等,各个数据类型的数据位宽均为2的幂次。例如,若采用char类型来存储定点数,则两个8bit的整数相乘,获得的结果需用16bit的存储空间;若j个1bit的整数相加,获得的加和结果需用Ceil(Log2(j))位的存储空间,因此,j个8bit的整数在进行乘累加运算后,所获得的乘累加结果共需要(16+Ceil(Log2(j)))位的存储空间。其中,i、j均为大于或等于2的整数,Ceil表示对数据进行上取整。因此,MAC子单元811输出的乘累加结果的数据位宽为(16+Ceil(Log2(VEC_SIZE))),延时缓存区的数据位宽为(16+Ceil(Log2(sw/h)))。其中,sw表示网络模型中各层单个卷积的最大运算量,h表示延时缓存区的深度。例如,在YOLOv2算法中第22层单个卷积运算量最大为32*1280,深度为6,则延时缓存区的数据位宽(16+Ceil(Log2(sw/h)))为29。通过对数据位宽的设置,使得可以节省不必要的逻辑资源浪费。
2)局部累加子单元812:被配置为实现手动时钟对齐,保证流水线的高饱和度的运行。局部累加单元在获得自定义的MAC子单元811输入的经过K2ⅹN/VEC_SIZE次的MAC子单元811的运算结果,会依据延时缓存的设计,对缓存的数据进行相加运算,获得卷积结果,此时的卷积结果的数据位宽大于向量化的输入数据的位宽,需要对卷积结果再进行截断操作,以使最终结果的数据位宽与向量化的输入数 据的位宽相同。
图9a示出本申请实施例提供的没有添加局部累加子单元812时的乘累加操作的一种示意图。其中,以Fetch函数处理和MAC层及累加处理为一个处理单元,在时间轴上进行依次处理;但有与各个处理单元之间存在数据依赖(即数据之间的依赖关系),使得在一个处理单元完成后,才能开始进行下一个处理单元的循环迭代,导致启动间隔Π大于1。而图9b示出本申请实施例提供的增加了局部累加子单元812后的乘累加操作的一种示意图。此时,MAC子单元811的输出被送入局部累加子单元812进行数据缓存。具体实现时,局部累加子单元812可由移位寄存器组成,使得通过调节移位寄存器的深度,就能形成一条启动间隔Π等于1的流水线,其中,流水线的深度为K2*N/VEC_SIZE。提高了流水线的处理效率。
3)激活函数逻辑单元820,被配置为对卷积运算逻辑单元810输入的最终结果进行激活处理。例如,将最终结果送入Leaky ReLU逻辑电路,根据符号位X选择是否执行移位操作(例如,如图8所示,若X<0,则需要向左移动3位,若X>=0,则无需进行移位),然后,通过OpenCL管道将处理单元的输出结果输入至MemWR Kernel 6200。
稀缺的片上存储单元与片上存储空间的高需求之间总是存在着冲突。一方面,由于片外内存区的带宽有限且访问延时高,使得片上存储单元的设计可以减轻片外内存区的带宽压力。另一方面,由于片上存储单元的存储空间非常稀少,不可能将整个神经网络模型都缓存到片上存储单元中,使得片上存储单元的设计成为保证系统吞吐率的关键。本申请中的设备端62包括三个缓存区,即输入数据缓存区(6130、6331、6332和6333)、权重和偏置缓存区(6120、6321、6322和6323)和输出缓存区6220。其中,各个数据缓存区可采用折叠存储形式、双缓存机制和多端口加载模式中的任一种存储形式。
图10示出本申请实施例提供的采用折叠存储形式的数据缓存区的一种结构示意图。其中,S表示数据步长,K表示输入特征图的数据长度。H row表示输入特征图在纵向维度(Y维度)上的PE_NUM_Y个卷积运算所需的数据长度,具体如式(23)所示:
H row=(PE_NUM_Y-1)×S+K          (23)
依据纵向并行度(PE_NUM_Y),使得输入特征图上的多个区域能够被同步处理。使得可以减少行缓存的数目,同时提高CNN结构的通用性。当从全局内存加载数据到本地数据缓存区时,每条行缓存每次存储数据长度为S的数据,以便每条行缓存都能输出一个数据。
图11示出本申请实施例提供的数据缓存区的一种结构示意图。该数据缓存区可包括多条行缓存,是一个二维缓存区。与一维缓存区相比,二维缓存区具有更高的数据重用率。提高本地缓存区的数据重用率,也就意味着带宽的减少。实际使用时,该二维缓存区比一维缓存区可节省约57%的带宽。
其中,S Line表示每条行缓存的长度,N Line表示行缓存的个数,如式(24)所示;每条行缓存实际使用的长度为W col(W col≤S Line),W col可根据输入特征图的深度和卷积步长动态调整,如式(25)所示;每条行缓存内实际可执行的卷积个数为N col,由W col决定,如式(26)所示;考虑到卷积运算的完整性,S Line的取值要确保每个卷积层都至少有一个卷积区域被缓存。
Figure PCTCN2021096062-appb-000019
Figure PCTCN2021096062-appb-000020
N col=FLOOR((W col-K)/S+1)       (26)
其中,FLOOR(X)函数的功能是“向下取整”,即向下舍入或向零取舍,即取不大于X的最大整数。
其中,行缓存区可设计为双缓存机制。即一个数据缓存区从片外全局内存加载数据,另一个缓存区向Conv Kernel 6300传输预存的数据,这两个缓存区是交替并同时进行数据操作的。数据缓存区一次可从片外全局内存加载的数据大小为VEC_SIZE,并向Conv Kernel 6300传输的数据大小为(PE_NUM_Y*VEC_SIZE)。使得可以节省卷积等待数据加载的时间,提高数据传输效率,为高效卷积运算提供保障。
在一个具体实现中,双缓存机制会带来数据的并行传输和串行加载速度不匹配的问题。假设一个时钟周期能完成一个数据操作,那么从全 局内存加载一个数据缓冲区所需的时钟周期数为T load,如式(27)所示;将一个数据缓存区的数据传输到Conv Kernel 6300所需的时钟周期数为T trans,如式(28)所示;若PE_NUM_Y设置的很大,则此时T load>T trans。因此,为平衡两个缓存区的速度,保证MemRD Kernel 6100、Conv Kernel6300和MemWR Kernel 6200三个内核之间的深度流水线的顺畅执行,需要进行多端口加载数据。
Figure PCTCN2021096062-appb-000021
Figure PCTCN2021096062-appb-000022
如图11所示,当端口数为n时,每个端口将负责
Figure PCTCN2021096062-appb-000023
条行缓存区,数据加载时间将缩减n倍。
在一种可实施方式中,MemRD Kernel 6100依据第一深度并行度和纵向并行度对权重进行预重排。例如,将权重由原来的(K,K,N,M)顺序,重新排列为(VEC_SIZE,PE_NUM_Z,K,K,N/VEC_SIZE,M/PE_NUM_Z)顺序。MemRD Kernel 6100依据第二深度并行度,将偏置重新排列为(PE_NUM_Z,M/PE_NUM_Z)顺序。使得能够为MaxPool Kernel 6400提供并行度,并为下层卷积的输入提供正确的数据格式。
其中,单个卷积核的内存大小s w如式(29)所示,取各层卷积的最大值;权重缓存区的总大小h w如式(30)所示,单位为Byte。偏置缓存区所需的内存大小h b如式(31)所示,单位为Byte。
s w=Max(K l*K l*N l)             (29)
h w=s w*PE_NUM_Z               (30)
h b=PE_NUM_Z                  (31)
需要说明的是,权重缓存区和偏置缓存区同样设置为双缓存机制,使得能够为Conv Kernel 6300的卷积运算提供高效数据传输,同时,在手动时钟对齐时,可使输入数据、权重和偏置能在同一时钟周期内通过OpenCL管道传出。
在一种可实施方式中,若输出缓存区6220的大小为(PE_NUM_Y*PE_NUM_Z)字节,则MemWR Kernel 6200在将输出缓存 区6220的数据存储回全局内存时,需要进行(PE_NUM_Y*PE_NUM_Z)次操作,而进行一次完整卷积运算,需要进行(K*K*N/VEC_SIZE)次操作。当输入特征图的深度较浅或并行度的数值过大时,MemWR Kernel 6200在将卷积结果回存至全局内存时,所需的时间将明显大于卷积运算时间。因此,需要对输出缓存区6220采用多端口数据存储形式。例如,端口数为m时,每个端口需要负责将CEIL(PE_NUM_Y*PE_NUM_Z/m)个数据输出到片外全局内存区。
本申请实施例提供的目标检测系统,具有高移植性和可扩展性。在对数据进行卷积运算、归一化处理和激活处理时,采用原始网络结构,使得在保证很好的准确度的同时,能够考虑到权重值的变化对结果的影响。此外,采用8bit定点数量化,可以保证精度损失在可接受的范围内。
图12示出本申请实施例提供的机器视觉设备的一种结构示意图。该机器视觉设备可包括图像获取装置1201和图像处理装置1202。
图像获取装置1201,可被配置为获取待检测的图像,其中,待检测的图像包括待确定的目标物体;图像处理装置1202,可被配置为跟据本申请实施例提供的图像处理方法,对待检测的图像进行检测。
其中,输出特征图中的目标物体的精度优于输入特征图中的目标物体的精度。例如,在机器视觉的应用中,待检测的图像中包括的待确定的目标物体可能是一只狗或一辆自行车,但由于背景物体的颜色,及物体在图片中的位置等原因,使得机器人在观察该待检测的图像时,无法准确获取待检测的图像中的待确定的目标物体的类别和位置信息等,通过依据待检测的图像的第一深度并行度和纵向并行度对待检测的图像进行向量化处理,获得N个输入向量数据,并使用N个输入向量数据与卷积核同时进行卷积运算,获得输出特征图,所获得的输出特征图中的目标物体的类别更清晰,在机器人对该输出特征图进行观察时,可获得输出特征图中的目标物体是一只狗和一辆自行车,并可以更准确的获得狗和自行车的位置信息等,提高了对待确定的目标物体的检测精度。
根据本申请实施例提供的机器视觉设备,通过图像获取装置获 取到待检测的图像,并使用图像处理装置依据图像处理方法对待检测的图像进行检测,加快在图像分析的过程中的卷积运算的速度,提升对待检测的图像的处理速度;并且提高待检测的图像的精度,使得待确定的目标物体的类别清晰可见,方便在机器视觉领域中的应用。
需要明确的是,本申请并不局限于上文中所描述并在图中示出的特定配置和处理。为了描述的方便和简洁,这里省略了对已知方法的详细描述,并且上述描述的系统、模块和单元的具体工作过程,可以参考本申请实施例对图像处理方法的相关描述,在此不再赘述。
图13示出本申请实施例提供的电子设备的一种示例性硬件架构示意图。
如图13所示,电子设备1300可包括输入设备1301、输入接口1302、中央处理器1303、存储器1304、输出接口1305、以及输出设备1306。其中,输入接口1302、中央处理器1303、存储器1304、以及输出接口1305通过总线1307相互连接,输入设备1301和输出设备1306分别通过输入接口1302和输出接口1305与总线1307连接,进而与电子设备1300的其它组件连接。
具体地,输入设备1301接收来自外部的输入信息,并通过输入接口1302将输入信息传送到中央处理器1303;中央处理器1303基于存储器1304中存储的计算机可执行指令对输入信息进行处理以生成输出信息,将输出信息临时或者永久地存储在存储器1304中,然后通过输出接口1305将输出信息传送到输出设备1306;输出设备1306将输出信息输出到电子设备1300的外部供用户使用。
在一种可实施方式中,图13所示的电子设备1300可以被实现为一种网络设备,该网络设备(即电子设备1300)可以包括:存储器,被配置为存储程序;处理器,被配置为运行存储器中存储的程序,以执行本申请实施例提供的图像处理方法的至少一个步骤。
一般来说,本申请实施例可以在硬件或专用电路、软件、逻辑或其任何组合中实现。例如,一些方面可以被实现在硬件中,而其它方面可以被实现在可以被控制器、微处理器或其它计算装置执行的固件或软件中,尽管本申请不限于此。
本申请实施例可以通过移动装置的数据处理器执行计算机程序指令来实现,例如在处理器实体中,或者通过硬件,或者通过软件和硬件的组合。计算机程序指令可以是汇编指令、指令集架构(ISA)指令、机器指令、机器相关指令、微代码、固件指令、状态设置数据、或者以一种或多种编程语言的任意组合编写的源代码或目标代码。
本申请附图中的任何逻辑流程的框图可以表示程序步骤,或者可以表示相互连接的逻辑电路、模块和功能,或者可以表示程序步骤与逻辑电路、模块和功能的组合。计算机程序可以存储在存储器上。存储器可以具有任何适合于本地技术环境的类型并且可以使用任何适合的数据存储技术实现,例如但不限于只读存储器(ROM)、随机访问存储器(RAM)、光存储器装置和系统(数码多功能光碟DVD或CD光盘)等。计算机可读介质可以包括非瞬时性存储介质。数据处理器可以是任何适合于本地技术环境的类型,例如但不限于通用计算机、专用计算机、微处理器、DSP、专用集成电路(ASIC)、FGPA以及基于多核处理器架构的处理器。
通过示范性和非限制性的示例,上文已提供了对本申请的示范实施例的详细描述。但结合附图和权利要求来考虑,对以上实施例的多种修改和调整对本领域技术人员来说是显而易见的,但不偏离本申请的范围。因此,本申请的恰当范围将根据权利要求确定。

Claims (14)

  1. 一种图像处理方法,包括:
    对待检测的图像进行预处理获得输入特征图,并提取所述输入特征图的第一深度并行度和纵向并行度;
    依据所述第一深度并行度和所述纵向并行度对所述输入特征图进行向量化处理,获得N个输入向量数据,其中,N为大于或等于1的整数;以及
    使用所述N个输入向量数据与卷积核同时进行卷积运算,获得输出特征图。
  2. 根据权利要求1所述的方法,其中,所述输出特征图的第二深度并行度和所述卷积核的数量相同,所述第一深度并行度和所述第二深度并行度是依据并行度模型确定的并行度,以及所述并行度模型是依据硬件资源和内存带宽模型确定的模型。
  3. 根据权利要求1所述的方法,其中,所述使用所述N个输入向量数据与卷积核同时进行卷积运算,获得输出特征图,包括:
    使用所述N个输入向量数据与卷积核同时进行卷积运算,获得输出向量数据;以及
    利用乘累加树对所述输出向量数据和对应的权重参数进行处理,获得所述输出特征图。
  4. 根据权利要求3所述的方法,其中,所述利用乘累加树对所述输出向量数据和对应的权重参数进行处理,获得所述输出特征图,包括:
    依据所述输出向量数据的数据位宽和所述权重参数,对M个所述输出向量数据进行点对点的乘累加运算,获得第一累加结果,其中,M为大于或等于1的整数;
    将所述第一累加结果缓存到移位缓存区中;
    依据所述移位缓存区的深度,对所述第一累加结果进行局部累 加,获得第二累加结果;
    将所述第二累加结果缓存到延时缓存区中;以及
    对所述延时缓存中的数据再次进行相加运算,获得所述输出特征图。
  5. 根据权利要求3所述的方法,在所述利用乘累加树对所述输出向量数据和对应的权重参数进行处理,获得所述输出特征图的步骤之后,还包括:
    将所述输出特征图缓存至输出缓存区,其中,所述输出缓存区的存储形式包括多端口加载模式。
  6. 根据权利要求3所述的方法,在所述利用乘累加树对所述输出向量数据和对应的权重参数进行处理,获得所述输出特征图的步骤之后,还包括:
    依据所述第一深度并行度,对所述输出特征图进行重排,获得重排结果;以及
    将所述重排结果输出至输出缓存区。
  7. 根据权利要求6所述的方法,其中,所述将所述重排结果输出至输出缓存区,包括:
    以多端口存储数据的形式,将所述重排结果输出至所述输出缓存区。
  8. 根据权利要求6所述的方法,其中,所述将所述重排结果输出至输出缓存区,包括:
    依据所述第一深度并行度对所述重排结果进行池化处理,获得池化后的结果;以及
    将所述池化后的结果以开放运算语言管道的形式输出至所述输出缓存区。
  9. 根据权利要求1至8中任一项所述的方法,在所述依据所述第一深度并行度和所述纵向并行度对所述输入特征图进行向量化处理,获得N个输入向量数据的步骤之前,还包括:
    将所述输入特征图缓存到输入缓存区,其中,所述输入缓存区的存储形式至少包括折叠存储形式、双缓存机制和多端口加载形式中的任一项。
  10. 根据权利要求9所述的方法,其中,所述将所述输入特征图缓存到输入缓存区,包括:
    响应于确定所述输入缓存区的存储形式是所述折叠形式,依据数据步长将所述输入特征图对应的数据进行折叠,并将折叠后的数据存储在所述输入缓存区中;其中,所述数据步长是依据所述纵向并行度、所述输入特征图在纵向维度上进行卷积运算所需的数据长度和单位步长确定的值;以及
    响应于确定所述输入缓存区的存储形式是所述多端口加载形式,依据端口的数量和加载一个数据缓冲区所需的时钟周期数,将所述输入特征图对应的数据缓存到所述输入缓存区。
  11. 一种图像处理装置,包括:
    预处理模块,被配置为对待检测的图像进行预处理获得输入特征图,并提取所述输入特征图的第一深度并行度和纵向并行度;
    向量化处理模块,被配置为根据所述第一深度并行度和所述纵向并行度对所述输入特征图进行向量化处理,获得N个输入向量数据,其中,N为大于或等于1的整数;以及
    卷积运算模块,被配置为使用所述N个输入向量数据与卷积核同时进行卷积运算,获得输出特征图。
  12. 一种机器视觉设备,包括:
    图像获取装置,被配置为获取待检测的图像,其中,所述待检测的图像包括待确定的目标物体;以及
    图像处理装置,被配置为执行根据权利要求1-10中任一项所述的图像处理方法对所述待检测的图像进行检测,并确定所述待确定的目标物体的类别。
  13. 一种电子设备,包括:
    一个或多个处理器;以及
    存储器,其上存储有一个或多个程序,当所述一个或多个程序被所述一个或多个处理器执行,使得所述一个或多个处理器实现根据权利要求1-10中任一项所述的图像处理方法。
  14. 一种计算机可读存储介质,其上存储有计算机程序,所述计算机程序被处理器执行时实现根据权利要求1-10中任一项所述的图像处理方法。
PCT/CN2021/096062 2020-06-12 2021-05-26 图像处理方法及装置、机器视觉设备、电子设备和计算机可读存储介质 WO2021249192A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202010537684.4 2020-06-12
CN202010537684.4A CN113807998A (zh) 2020-06-12 2020-06-12 图像处理方法、目标检测装置、机器视觉设备和存储介质

Publications (1)

Publication Number Publication Date
WO2021249192A1 true WO2021249192A1 (zh) 2021-12-16

Family

ID=78845231

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2021/096062 WO2021249192A1 (zh) 2020-06-12 2021-05-26 图像处理方法及装置、机器视觉设备、电子设备和计算机可读存储介质

Country Status (2)

Country Link
CN (1) CN113807998A (zh)
WO (1) WO2021249192A1 (zh)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2023125838A1 (zh) * 2021-12-30 2023-07-06 深圳云天励飞技术股份有限公司 数据处理方法、装置、终端设备及计算机可读存储介质
CN117391149A (zh) * 2023-11-30 2024-01-12 爱芯元智半导体(宁波)有限公司 神经网络输出数据的处理方法、装置及芯片

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116108902B (zh) * 2023-02-22 2024-01-05 成都登临科技有限公司 采样操作实现系统、方法、电子设备及存储介质

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20190102671A1 (en) * 2017-09-29 2019-04-04 Intel Corporation Inner product convolutional neural network accelerator
CN110070178A (zh) * 2019-04-25 2019-07-30 北京交通大学 一种卷积神经网络计算装置及方法
CN110321997A (zh) * 2018-03-31 2019-10-11 北京深鉴智能科技有限公司 高并行度计算平台、系统及计算实现方法

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20190102671A1 (en) * 2017-09-29 2019-04-04 Intel Corporation Inner product convolutional neural network accelerator
CN110321997A (zh) * 2018-03-31 2019-10-11 北京深鉴智能科技有限公司 高并行度计算平台、系统及计算实现方法
CN110070178A (zh) * 2019-04-25 2019-07-30 北京交通大学 一种卷积神经网络计算装置及方法

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
CHAO LIJUNJIE WU: "Advanced Computer Architecture", 11 August 2018, SPRINGER, SINGAPORE, ISBN: 978-981-13-2422-2, article KE XUXIAOYUN WANGSHIHANG FUDONG WANG: "A Scalable FPGA Accelerator for Convolutional Neural Networks", DOI: 10.1007/978-981-13-2423-9_1 *
WANG DONG; XU KE; JIANG DIANKUN: "PipeCNN: An OpenCL-Based Open-Source FPGA Accelerator for Convolution Neural Networks", 2017 INTERNATIONAL CONFERENCE ON FIELD PROGRAMMABLE TECHNOLOGY (ICFPT), 11 December 2017 (2017-12-11), XP033313512, DOI: 10.1109/FPT.2017.8280160 *
WANG XIAOYUN: "Acceleration Method Research on CNN Related Object Detection Algorithm Based on OpenCL", CHINESE MASTER'S THESES FULL-TEXT DATABASE, 1 May 2019 (2019-05-01), pages 1 - 70, XP055878716 *

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2023125838A1 (zh) * 2021-12-30 2023-07-06 深圳云天励飞技术股份有限公司 数据处理方法、装置、终端设备及计算机可读存储介质
CN117391149A (zh) * 2023-11-30 2024-01-12 爱芯元智半导体(宁波)有限公司 神经网络输出数据的处理方法、装置及芯片
CN117391149B (zh) * 2023-11-30 2024-03-26 爱芯元智半导体(宁波)有限公司 神经网络输出数据的处理方法、装置及芯片

Also Published As

Publication number Publication date
CN113807998A (zh) 2021-12-17

Similar Documents

Publication Publication Date Title
WO2021249192A1 (zh) 图像处理方法及装置、机器视觉设备、电子设备和计算机可读存储介质
CN111459877B (zh) 基于FPGA加速的Winograd YOLOv2目标检测模型方法
JP6977239B2 (ja) 行列乗算器
US11550543B2 (en) Semiconductor memory device employing processing in memory (PIM) and method of operating the semiconductor memory device
Fan et al. A real-time object detection accelerator with compressed SSDLite on FPGA
CN113313243B (zh) 神经网络加速器的确定方法、装置、设备以及存储介质
JP2021521515A (ja) 演算を加速するための方法および加速器装置
JP2021521516A (ja) 演算を加速するための加速器及びシステム
CN109446996B (zh) 基于fpga的人脸识别数据处理装置及处理方法
CN113469350B (zh) 一种适于npu的深度卷积神经网络加速方法和系统
CN112884137A (zh) 神经网络的硬件实现方式
KR20230142355A (ko) 데이터베이스 스캔 가속을 위한 시스템 및 방법
EP3447690A1 (en) Maxout layer operation apparatus and method
Zong-ling et al. The design of lightweight and multi parallel CNN accelerator based on FPGA
CN116888591A (zh) 一种矩阵乘法器、矩阵计算方法及相关设备
CN116301920B (zh) 一种用于部署cnn模型至基于fpga的高性能加速器的编译系统
US20230259578A1 (en) Configurable pooling processing unit for neural network accelerator
CN110716751B (zh) 高并行度计算平台、系统及计算实现方法
CN110659014B (zh) 乘法器及神经网络计算平台
CN116011534A (zh) 一种基于fpga的通用卷积神经网络加速器实现方法
CN115222028A (zh) 基于fpga的一维cnn-lstm加速平台及实现方法
CN115170381A (zh) 一种基于深度学习的视觉slam加速系统及方法
Wang et al. An FPGA-based hardware accelerator for real-time block-matching and 3D filtering
Ordoñez et al. A 3D convolution accelerator implemented on FPGA using SDSoC
WO2021080724A1 (en) Three dimensional convolution in neural network processor

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 21822369

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

32PN Ep: public notification in the ep bulletin as address of the adressee cannot be established

Free format text: NOTING OF LOSS OF RIGHTS PURSUANT TO RULE 112(1) EPC (EPO FORM 1205 DATED 22/05/2023)

122 Ep: pct application non-entry in european phase

Ref document number: 21822369

Country of ref document: EP

Kind code of ref document: A1