KR20160133924A - Apparatus and method for convolution operation - Google Patents
Apparatus and method for convolution operation Download PDFInfo
- Publication number
- KR20160133924A KR20160133924A KR1020150067113A KR20150067113A KR20160133924A KR 20160133924 A KR20160133924 A KR 20160133924A KR 1020150067113 A KR1020150067113 A KR 1020150067113A KR 20150067113 A KR20150067113 A KR 20150067113A KR 20160133924 A KR20160133924 A KR 20160133924A
- Authority
- KR
- South Korea
- Prior art keywords
- window
- convolution
- image
- pixels
- values
- Prior art date
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T1/00—General purpose image data processing
- G06T1/20—Processor architectures; Processor configuration, e.g. pipelining
Landscapes
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Image Processing (AREA)
- Complex Calculations (AREA)
Abstract
A method and apparatus for convolutional computation are provided. The convolution arithmetic unit performs a convolution operation on the window using the values of the pixels of the window and the kernel coefficients. The window buffer provides the values of the pixels of the window of the image and the convolution core performs the convolution operation on the window using the values of the pixels of the window and the kernel coefficients.
Description
The following embodiments relate to an arithmetic apparatus and method, and more particularly to a convolution arithmetic apparatus and method.
Convolution is a fundamental operation in image signal processing and computer vision. Various basic but crucial signal processing tasks are based on convolution operations.
In image processing, two-dimensional convolution is widely used with different filter coefficients to achieve different results corresponding to the filter to which it is applied. The results from the convolution are often found in the initial stages of the vision algorithm. This vision algorithm can provide the necessary data for higher level vision tasks.
The convolution operation may be compute in a spatial domain or a frequency domain. The properties of the two approaches of the spatial domain and the frequency domain inherently inherit different advantages and disadvantages. Therefore, the spatial domain and frequency domain are favorable relative to each other for different circumstances.
For example, calculations within the frequency domain are less attractive for real-time systems with streaming video inputs. These real-time systems require conversion from the spatial domain to the frequency domain of the image data. Also, in such a real-time system, as the calculation is carried out, conversion is required to return the image data back to the spatial domain. Moreover, convolution within the frequency domain requires kernel coefficients for a particular filter to be re-computed for every unique image size. Thus, the convolution within the frequency domain is inherently not flexible to the image size. On the other hand, on the other hand, the kernel coefficients of the filter for the convolution in the spatial domain remain the same irrespective of the image size.
If the functions f and g are each a real one-dimensional function, then the convolution of f and g can be defined as: Ester leak represents a convolution operation.
[Equation 1]
One of the two functions is reversed and shifted for the convolution of f and g. The convolution of f and g can be defined as the integration of the product of one function and another function that is inversed and shifted out of f and g. The above integration efficiently measures the amount of overlap between the two functions.
In a realm of a particular computer vision, two-dimensional image data is represented as pixels on spatial coordinates. In this domain, the discrete two-dimensional version of equation (1) above is defined with respect to the coordinates of the space, instead of being defined with the passage of time. The convolution operation of this discrete two-dimensional version is defined as in Equation (2) below for each coordinate (u, v) of the convolved matrix.
[Equation 2]
Here, I represents an input image. K represents the kernel of the filter function. N x represents the width of the image. N y represents the height of the image.
The convolution of a large image frame is computationally intense. Also, for real time vision applications with high resolution images, the convolution of large image frames is constrained by strict performance and accuracy requirements. Convolution of large image frames can be a serious performance bottleneck. Rapid improvement in camera technology has led to a significant increase in image resolution, which has exacerbated the load of computational burden on the system.
There are a number of computer vision applications based on high definition streaming video feeds. Such computer vision applications include Advanced Driver Assistance System (ADAS) for automotive systems, object recognition and surveillance, and the like. Often, the image processing tasks executed for these vision systems include multiple convolution operations to manipulate the image data into more useful information. These multiple convolution operations are performed before any further high-level processing. Multiple convolution operations for large amounts of pixels of high quality video frames require large amounts of parallel computations and inevitably cause performance bottlenecks.
The program is essentially serial, and the instructions are based on a general purpose processor. Running such algorithms including multiple convolution operations with such programs and instructions has been found to be unsuitable for dealing with computational intensities required to provide real-time performance for high resolution video.
To overcome these computational bottlenecks, other hardware platforms have emerged as attractive alternatives. These hardware platforms include Field Programmable Gate Arrays (FPGAs) and General-Purpose Computing Graphics Processing Units (GPGPU) capable of massively parallel calculation. Both FPGAs and GPGPUs have advantages and disadvantages. These advantages and disadvantages determine which is the preferred platform for a given application.
The embodiments described below may be related to FPGA acceleration. Also, in the embodiment described below, the convolution operation can be used for high resolution video frames for real time vision systems.
U.S. Patent No. 5,922,580, U.S. Patent Publication No. 2011-0138157 and Korean Patent Laid-Open No. 2001-0004946 disclose a method related to convolution calculation.
One embodiment may provide an apparatus and method for performing a convolution operation.
One embodiment may provide an apparatus and method for performing a convolution operation on one current pixel or window per clock cycle.
One embodiment may provide an apparatus and method for performing a convolution operation on a plurality of windows of an image.
A window buffer, on one side, for providing values of pixels of a window of the image; And a convolution core for performing a convolution operation on the window using values of pixels of the window and kernel coefficients.
The convolution arithmetic unit may perform sliding window image processing on the image by performing the convolution operation in a predetermined order on a plurality of windows of the image.
The predetermined order may follow the order of the raster scan.
The window buffer may include a plurality of registers.
The plurality of registers may store values of pixels of the window and may provide values of the pixels of the window.
The plurality of registers may constitute a plurality of rows and a plurality of columns.
The number of the plurality of rows may be equal to the height of the kernel.
The number of the plurality of columns may be equal to the width of the kernel.
The last register of the remaining rows excluding the last row of the plurality of rows may be connected to the input of the first in first out (FIFO).
The output of the first-in-first-out may be connected to the register at the beginning of the next row of the last register.
When a new pixel is input to the window buffer, the values of the plurality of registers may be propagated through the sequence of contiguous registers and the FIFO.
The new pixel may be input to the window buffer every clock cycle.
The window buffer may provide pixel values of the new window for each clock cycle.
The FIFO may store the values of pixels needed for windows to be processed later by sliding window image processing rather than the current window being processed.
The sum of the width of the plurality of rows and the length of the FIFO may be equal to the width of the image.
The length of the FIFO may be dynamically configured according to the width of the image processed in the convolution unit.
The maximum length of the FIFO that can be dynamically configured may be a value obtained by subtracting the width of the kernel from the maximum image width that can be processed by the convolution processor.
The window buffer may maintain the number of pixels entering the window buffer to record where the current pixel is in the image.
The current pixel may be the center of the window.
The convolution core comprises: a plurality of processing elements (PEs); And an accumulation tree.
Each PE of the plurality of PEs may calculate a product of a value of a pixel provided by the window buffer and a kernel coefficient corresponding to the provided pixel.
The accumulation tree may generate the result of the convolution operation by accumulating the values computed by the plurality of PEs.
The plurality of PEs may correspond to the kernel coefficients, respectively.
The PE provided with the value of the pixel from the register of the i-th row and the j-th column of the window buffer among the plurality of PEs may calculate the product of the value of the pixel and the kernel coefficient of the i-th row and the j-th column.
I may be an integer of 1 or more and k or less.
J may be an integer of 1 or more and k or less.
K may be the size of the kernel.
Some of the plurality of PEs may sum the first product calculated by the other PE with the second product calculated by the partial PE and output the sum of the first product and the second product.
The portion of the PE may add the first product to the second product using a post-multiplication adder of PE.
On another side, the window buffer providing values of the pixels of the window of the image; And a convolutional core performing a convolution operation on the window using values of pixels of the window and kernel coefficients.
In another aspect, a method for performing a convolution operation on a plurality of windows of an image, the method comprising: performing a convolution operation on a window using a window buffer and a convolution core, Wherein the convolution core performs a convolution operation on a current pixel using values of pixels of the window and kernel coefficients; And setting the window buffer such that the window buffer provides values of pixels of the next window of the window by inputting a value of a new pixel of the image into the window buffer.
In addition, there is further provided another method, apparatus, system for implementing the invention and a computer readable recording medium for recording a computer program for executing the method.
An apparatus and method for performing a convolution operation are provided.
An apparatus and method for performing a convolution operation on one current pixel or window per clock cycle is provided.
An apparatus and method for performing a convolution operation on a plurality of windows of an image are provided.
1 is a block diagram of a convolution arithmetic unit according to an embodiment.
Figure 2 shows a plurality of windows of an image according to an example.
FIG. 3 illustrates a configuration of a window buffer according to an example.
4 shows a configuration of a convolution core according to an example.
5 is a flowchart of a convolution operation method according to an embodiment.
6 is a flow diagram of a method for performing convolution operations on a plurality of windows of an image in accordance with an embodiment.
In the following, embodiments will be described in detail with reference to the accompanying drawings. It should be understood that the embodiments are different, but need not be mutually exclusive.
The terms used in the embodiments can be interpreted based on the actual meaning of the terms that are not the names of simple terms and the contents throughout the specification.
In embodiments, the connection relationship for a particular portion and the other portion may include an indirect connection relationship that is connected via another portion therebetween, in addition to the direct connection relationship between the specific portion and the other portion. Like reference numerals in the drawings denote like elements.
1 is a block diagram of a convolution arithmetic unit according to an embodiment.
The
The values of the pixels of the image may be provided to the
The
The
The
A convolution operation on a window may mean a convolution operation on the current pixel. The current pixel may be the center of the window. Also, the current pixel and the reference pixel can be used in the same sense.
In the following, each of the
Figure 2 shows a plurality of windows of an image according to an example.
In FIG. 2, a
In the
Convolution operations in the spatial domain may be window-based operations. In a window-based operation, kernel coefficients may be applied to each window of a plurality of windows in an image.
A window in the image may refer to a sub-region of the image. That is to say, the pixels of the window may be sub-area pixels. The dimensions of the sub-region may be the same as the size of the kernel. That is to say, the size of the sub-area and the size of the kernel may be the same. For example, when the size of the kernel is 3x3, the size of the sub-region may also be 3x3. Also, when the size of the kernel is 5x5, the size of the sub-region may also be 5x5. The center of the sub-region or window may be the current pixel that is the subject of processing.
In video ingress pixel streams, a raster scan may be regarded as the standard used format. Thus, according to the raster scan, the processing for a plurality of windows may be completed first horizontally and then vertically. That is to say, processing for a plurality of windows can be completed first from left to right. When the processing for horizontal windows (i.e., one row of windows) is completed, the processing can be completed in the order from top to bottom. In the same manner as described above, the
The
FIG. 3 illustrates a configuration of a window buffer according to an example.
When the size of the kernel is kxk pixels and the size of the image is nxm pixels, the k 2 sub-sets of pixels of the image can be defined as the current window in which they are processed. Here, k may be an integer of 2 or more, and each of n and m may be an integer of 2 or more. The center of the sub-set may be the current pixel.
In the following, it can be assumed that one pixel per clock is introduced into the
The hardware design of the
The
The plurality of registers of the
The plurality of registers of the
The number of rows of the plurality of registers may be equal to the height of the kernel. The number of the plurality of columns of the plurality of registers may be equal to the width of the kernel. That is to say, the dimensions of the plurality of registers may be the same as the size of the kernel.
In the remaining rows of the plurality of rows of the
That is to say, when a plurality of registers constitute k rows, the window buffer may contain k-1 FIFOs. The last register of the i-th row may be connected to the input of the i-th FIFO. The output of the i-th FIFO can be connected to the register at the beginning of the (i + 1) -th row. Here, i may be an integer of 1 or more and k-1 or less.
When a new pixel is introduced into the input pixel stream of the
When a new pixel is input to the
The propagation may include at least one of the following processes i) through v). i) The value of the new pixel can be input to the first register of the first row among the plurality of rows constituted by the plurality of registers of the
New pixels of the input pixel stream may be input to the
That is to say, the FIFOs can store the values of the pixels needed for the windows to be processed later by the sliding window image processing, rather than the current window being processed.
The sum of the widths of the plurality of rows constituted by the plurality of registers of the
In addition, the length of the FIFO can be dynamically configured according to the width of the image processed in the
The control logic of the
4 shows a configuration of a convolution core according to an example.
As discussed above, every new clock cycle, new window of new valid pixels may be provided by the
In Fig. 4, the
The
PEs can be digital signal processing (DSP) PEs that are commonly found in FPGAs. The PE may comprise an internal register for storing the value of the pixel and an internal register for storing the result of the calculation.
In terms of computation, the value of each pixel in the window can be supplied to the PE. First, the value of the pixel supplied to the PE can be multiplied by the corresponding kernel coefficient. That is to say, each PE of a plurality of PEs may be provided with pixel values by the
The PE provided with the value of the pixel from the i-th row and the j-th column of the
Each of the products from all the pixels in the window can be provided to the accumulation tree from the PE. The products from all the pixels in the window may be accumulated together by the accumulation tree to produce values for the window being processed or the current pixel. That is, the accumulation tree may generate the result of the convolution operation by accumulating the values calculated by the plurality of PEs.
Values from adjacent PEs may be added before entering the accumulation tree to reduce the depth of the accumulation tree. That is to say, some PEs of the plurality of PEs may sum the first product calculated by the other PE with the second product calculated by the PEs of the above, and may output the sum of the first and second products have. For example, in FIG. 4, an odd-numbered PE among a plurality of PEs may add a product calculated by the PE adjacent to its right to the product calculated by itself.
Adding values from adjacent PEs can be done by a general post-multiplication adder in an FPGA. Some PEs of the plurality of PEs may combine the first product computed by the other PE with the second product computed by some of the PEs using the product-post adder of the PE.
The sums generated by the accumulation tree may be the desired convolution result for the window or the current pixel.
5 is a flowchart of a convolution operation method according to an embodiment.
In
In
The current pixel may be the center of the window.
The above description with reference to Figs. 1 to 4 can also be applied to the embodiment described above with reference to Fig. Duplicate descriptions will be omitted below.
6 is a flow diagram of a method for performing convolution operations on a plurality of windows of an image in accordance with an embodiment.
In
In
For example, when the
In
After
The above description with reference to Figs. 1 to 5 can also be applied to the embodiment described above with reference to Fig. Duplicate descriptions will be omitted below.
The following describes the DSP required for the
The PE may be DSP PE, as described above. The number of DSP PEs D g required for a plurality of PEs of the convolution core 120 (except for the accumulation tree) may be expressed by
[Equation 3]
Also, the accumulation tree can be configured using DSP PE. The accumulation tree can utilize the DSP PE as adders with two inputs to accumulate a given number of elements. Here, the DSP PE can be configured at a log 2 n level. The input values of the accumulation tree can enter the input ports of the DSP PE of the accumulation tree and can go through the input registers of the DSP PE.
When the size of the kernel is k, the number of DSP PEs D a required for the accumulation tree of the
[Equation 4]
The total number of D PE DSP used for the convolution kernel of the
[Equation 5]
Table 1 below shows the use of DSP PE for various kernel sizes.
In the following, the configurability of the
Table 2 below may illustrate various runtime parameters and compile time parameters of the
(Runtime or compile time)
The number of PEs in the
As described in Table 2, the design of the
The design of the
The method according to an embodiment may be implemented in the form of a program command that can be executed through various computer means and recorded in a computer-readable medium. The computer-readable medium may include program instructions, data files, data structures, and the like, alone or in combination. The program instructions to be recorded on the medium may be those specially designed and configured for the embodiments or may be available to those skilled in the art of computer software. Examples of computer-readable media include magnetic media such as hard disks, floppy disks and magnetic tape; optical media such as CD-ROMs and DVDs; magnetic media such as floppy disks; Magneto-optical media, and hardware devices specifically configured to store and execute program instructions such as ROM, RAM, flash memory, and the like. Examples of program instructions include machine language code such as those produced by a compiler, as well as high-level language code that can be executed by a computer using an interpreter or the like. The hardware devices described above may be configured to operate as one or more software modules to perform the operations of the embodiments, and vice versa.
While the present invention has been particularly shown and described with reference to exemplary embodiments thereof, it is to be understood that the invention is not limited to the disclosed exemplary embodiments. For example, it is to be understood that the techniques described may be performed in a different order than the described methods, and / or that components of the described systems, structures, devices, circuits, Lt; / RTI > or equivalents, even if it is replaced or replaced.
Therefore, other implementations, other embodiments, and equivalents to the claims are also within the scope of the following claims.
100: convolution arithmetic unit
110: Window buffer
120: Convolution core
Claims (20)
A convolution core for performing a convolution operation on the window using values of pixels of the window and kernel coefficients;
And a convolution arithmetic unit.
Wherein the convolution arithmetic unit performs the sliding window image processing on the image by performing the convolution operation on a plurality of windows of the image in a predetermined order.
Wherein the predetermined order is a raster scan order.
Wherein the window buffer comprises a plurality of registers,
Wherein the plurality of registers store values of pixels of the window and provide values of pixels of the window.
The plurality of registers constituting a plurality of rows and a plurality of columns,
Wherein the number of rows is equal to the height of the kernel,
Wherein the number of the plurality of columns is equal to the width of the kernel.
The last register of the remaining rows excluding the last row of the plurality of rows is connected to an input of a first in first out (FIFO), and the output of the first in, first out is connected to the first register And a convolution arithmetic unit connected to the arithmetic unit.
And when a new pixel is input to the window buffer, the values of the plurality of registers are propagated through the columns of consecutive registers and through the FIFO.
The new pixel is input to the window buffer every clock cycle,
Wherein the window buffer provides pixel values of a new window for each clock cycle.
Wherein the FIFO stores values of pixels required for windows to be processed later by a sliding window image processing rather than the current window being processed.
Wherein the sum of the widths of the plurality of rows and the length of the FIFO is equal to the width of the image.
Wherein the length of the FIFO is dynamically configured according to a width of the image processed in the convolution arithmetic unit.
Wherein the maximum length of the FIFO that can be dynamically configured is a value obtained by subtracting the width of the kernel from a maximum image width that can be processed by at least the convolution arithmetic unit.
Wherein the window buffer maintains the number of pixels entering the window buffer to record where the current pixel is in the image, and wherein the current pixel is the center of the window.
The convolution core includes:
A plurality of processing elements (PEs); And
Accumulation tree
Lt; / RTI >
Wherein each PE of the plurality of PEs calculates a product of a value of a pixel provided by the window buffer and a kernel coefficient corresponding to the provided pixel,
Wherein the accumulation tree generates a result of the convolution operation by accumulating values computed by the plurality of PEs.
Wherein the plurality of PEs correspond to the kernel coefficients, respectively.
Wherein the PE provided the value of the pixel from the i-th row and the j-th column of the window buffer among the plurality of PEs calculates the product of the pixel value and the kernel coefficient of the i-th row and the j-
I is an integer of 1 or more and k or less,
J is an integer of 1 or more and k or less,
K is the size of the kernel.
Wherein some PEs of the plurality of PEs combine a first product computed by another PE with a second product computed by the partial PEs and output a sum of the first product and the second product, .
And wherein the portion of the PEs combines the first product with the second product using a post-multiplication adder of PE.
Wherein the convolution core performs a convolution operation on the window using values of pixels of the window and kernel coefficients
/ RTI >
Performing a convolution operation on a window using a window buffer and a convolution core, the window buffer providing values of pixels of the window in an image, the convolution core using values of pixels of the window and kernel coefficients Performing a convolution operation on the current pixel; And
Setting the window buffer such that the window buffer provides values of pixels of the next window of the window by entering a value of a new pixel of the image in the window buffer
/ RTI >
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
KR1020150067113A KR20160133924A (en) | 2015-05-14 | 2015-05-14 | Apparatus and method for convolution operation |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
KR1020150067113A KR20160133924A (en) | 2015-05-14 | 2015-05-14 | Apparatus and method for convolution operation |
Publications (1)
Publication Number | Publication Date |
---|---|
KR20160133924A true KR20160133924A (en) | 2016-11-23 |
Family
ID=57541773
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
KR1020150067113A KR20160133924A (en) | 2015-05-14 | 2015-05-14 | Apparatus and method for convolution operation |
Country Status (1)
Country | Link |
---|---|
KR (1) | KR20160133924A (en) |
Cited By (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
KR20190052587A (en) * | 2017-11-08 | 2019-05-16 | 삼성전자주식회사 | Neural network device and operation method of the same |
WO2019216513A1 (en) * | 2018-05-10 | 2019-11-14 | 서울대학교산학협력단 | Row-by-row calculation neural processor and data processing method using same |
CN113792868A (en) * | 2021-09-14 | 2021-12-14 | 绍兴埃瓦科技有限公司 | Neural network computing module, method and communication device |
-
2015
- 2015-05-14 KR KR1020150067113A patent/KR20160133924A/en unknown
Cited By (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
KR20190052587A (en) * | 2017-11-08 | 2019-05-16 | 삼성전자주식회사 | Neural network device and operation method of the same |
WO2019216513A1 (en) * | 2018-05-10 | 2019-11-14 | 서울대학교산학협력단 | Row-by-row calculation neural processor and data processing method using same |
CN113792868A (en) * | 2021-09-14 | 2021-12-14 | 绍兴埃瓦科技有限公司 | Neural network computing module, method and communication device |
CN113792868B (en) * | 2021-09-14 | 2024-03-29 | 绍兴埃瓦科技有限公司 | Neural network computing module, method and communication equipment |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
JP6941499B2 (en) | Zero coefficient skip convolutional neural network engine | |
CN109640138B (en) | Adaptive path smoothing for video stabilization | |
US10642622B2 (en) | Arithmetic processing device and control method of the arithmetic processing device | |
KR20180060149A (en) | Convolution processing apparatus and method | |
KR101499373B1 (en) | Fast repeated integral images | |
US8130229B2 (en) | Methods and apparatus for image processing at pixel rate | |
KR20200081044A (en) | Method and apparatus for processing convolution operation of neural network | |
CA2929403C (en) | Multi-dimensional sliding window operation for a vector processor | |
CN108073549B (en) | Convolution operation device and method | |
US20200410352A1 (en) | System and methods for processing spatial data | |
KR20160133924A (en) | Apparatus and method for convolution operation | |
KR20200095300A (en) | Method and apparatus for processing convolution operation of neural network | |
JP2021524960A (en) | Methods and equipment for removing video jitter | |
KR102453370B1 (en) | Method and Apparatus for High-Speed Low-Power Processing in Large-Scale Deep Neural Network | |
KR101204866B1 (en) | Method and apparatus of executing pixel calculation within window area at high speed in window-based image processing | |
EP3282398A1 (en) | Zero coefficient skipping convolution neural network engine | |
Amaricai et al. | A moving window architecture for a hw/sw codesign based canny edge detection for fpga | |
Pohl et al. | Leveraging polynomial approximation for non-linear image transformations in real time | |
WO2024066829A1 (en) | Feature map processing method and apparatus and computer readable storage medium | |
TWI493476B (en) | Image processing circuit and method thereof | |
JP5719271B2 (en) | Image processing method, image processing apparatus, and image processing program | |
JP7114321B2 (en) | Data processing device and method | |
KR102282756B1 (en) | Apparatus and method for gaussian filtering | |
Sathvik et al. | Enhancing Image Segmentation with Optimized Winograd Algorithm for Convolution Neural Network | |
WO2024115874A1 (en) | A method of processing source data |