KR20160133924A

KR20160133924A - Apparatus and method for convolution operation

Info

Publication number: KR20160133924A
Application number: KR1020150067113A
Authority: KR
Inventors: 조용철
Original assignee: 한국전자통신연구원
Priority date: 2015-05-14
Filing date: 2015-05-14
Publication date: 2016-11-23

Abstract

A method and apparatus for convolutional computation are provided. The convolution arithmetic unit performs a convolution operation on the window using the values of the pixels of the window and the kernel coefficients. The window buffer provides the values of the pixels of the window of the image and the convolution core performs the convolution operation on the window using the values of the pixels of the window and the kernel coefficients.

Description

[0001] APPARATUS AND METHOD FOR CONVOLUTION OPERATION [0002]

The following embodiments relate to an arithmetic apparatus and method, and more particularly to a convolution arithmetic apparatus and method.

Convolution is a fundamental operation in image signal processing and computer vision. Various basic but crucial signal processing tasks are based on convolution operations.

In image processing, two-dimensional convolution is widely used with different filter coefficients to achieve different results corresponding to the filter to which it is applied. The results from the convolution are often found in the initial stages of the vision algorithm. This vision algorithm can provide the necessary data for higher level vision tasks.

The convolution operation may be compute in a spatial domain or a frequency domain. The properties of the two approaches of the spatial domain and the frequency domain inherently inherit different advantages and disadvantages. Therefore, the spatial domain and frequency domain are favorable relative to each other for different circumstances.

For example, calculations within the frequency domain are less attractive for real-time systems with streaming video inputs. These real-time systems require conversion from the spatial domain to the frequency domain of the image data. Also, in such a real-time system, as the calculation is carried out, conversion is required to return the image data back to the spatial domain. Moreover, convolution within the frequency domain requires kernel coefficients for a particular filter to be re-computed for every unique image size. Thus, the convolution within the frequency domain is inherently not flexible to the image size. On the other hand, on the other hand, the kernel coefficients of the filter for the convolution in the spatial domain remain the same irrespective of the image size.

If the functions f and g are each a real one-dimensional function, then the convolution of f and g can be defined as: Ester leak represents a convolution operation.

[Equation 1]

One of the two functions is reversed and shifted for the convolution of f and g. The convolution of f and g can be defined as the integration of the product of one function and another function that is inversed and shifted out of f and g. The above integration efficiently measures the amount of overlap between the two functions.

In a realm of a particular computer vision, two-dimensional image data is represented as pixels on spatial coordinates. In this domain, the discrete two-dimensional version of equation (1) above is defined with respect to the coordinates of the space, instead of being defined with the passage of time. The convolution operation of this discrete two-dimensional version is defined as in Equation (2) below for each coordinate (u, v) of the convolved matrix.

[Equation 2]

Here, I represents an input image. K represents the kernel of the filter function. N _x represents the width of the image. N _y represents the height of the image.

The convolution of a large image frame is computationally intense. Also, for real time vision applications with high resolution images, the convolution of large image frames is constrained by strict performance and accuracy requirements. Convolution of large image frames can be a serious performance bottleneck. Rapid improvement in camera technology has led to a significant increase in image resolution, which has exacerbated the load of computational burden on the system.

There are a number of computer vision applications based on high definition streaming video feeds. Such computer vision applications include Advanced Driver Assistance System (ADAS) for automotive systems, object recognition and surveillance, and the like. Often, the image processing tasks executed for these vision systems include multiple convolution operations to manipulate the image data into more useful information. These multiple convolution operations are performed before any further high-level processing. Multiple convolution operations for large amounts of pixels of high quality video frames require large amounts of parallel computations and inevitably cause performance bottlenecks.

The program is essentially serial, and the instructions are based on a general purpose processor. Running such algorithms including multiple convolution operations with such programs and instructions has been found to be unsuitable for dealing with computational intensities required to provide real-time performance for high resolution video.

To overcome these computational bottlenecks, other hardware platforms have emerged as attractive alternatives. These hardware platforms include Field Programmable Gate Arrays (FPGAs) and General-Purpose Computing Graphics Processing Units (GPGPU) capable of massively parallel calculation. Both FPGAs and GPGPUs have advantages and disadvantages. These advantages and disadvantages determine which is the preferred platform for a given application.

The embodiments described below may be related to FPGA acceleration. Also, in the embodiment described below, the convolution operation can be used for high resolution video frames for real time vision systems.

U.S. Patent No. 5,922,580, U.S. Patent Publication No. 2011-0138157 and Korean Patent Laid-Open No. 2001-0004946 disclose a method related to convolution calculation.

One embodiment may provide an apparatus and method for performing a convolution operation.

One embodiment may provide an apparatus and method for performing a convolution operation on one current pixel or window per clock cycle.

One embodiment may provide an apparatus and method for performing a convolution operation on a plurality of windows of an image.

A window buffer, on one side, for providing values of pixels of a window of the image; And a convolution core for performing a convolution operation on the window using values of pixels of the window and kernel coefficients.

The convolution arithmetic unit may perform sliding window image processing on the image by performing the convolution operation in a predetermined order on a plurality of windows of the image.

The predetermined order may follow the order of the raster scan.

The window buffer may include a plurality of registers.

The plurality of registers may store values of pixels of the window and may provide values of the pixels of the window.

The plurality of registers may constitute a plurality of rows and a plurality of columns.

The number of the plurality of rows may be equal to the height of the kernel.

The number of the plurality of columns may be equal to the width of the kernel.

The last register of the remaining rows excluding the last row of the plurality of rows may be connected to the input of the first in first out (FIFO).

The output of the first-in-first-out may be connected to the register at the beginning of the next row of the last register.

When a new pixel is input to the window buffer, the values of the plurality of registers may be propagated through the sequence of contiguous registers and the FIFO.

The new pixel may be input to the window buffer every clock cycle.

The window buffer may provide pixel values of the new window for each clock cycle.

The FIFO may store the values of pixels needed for windows to be processed later by sliding window image processing rather than the current window being processed.

The sum of the width of the plurality of rows and the length of the FIFO may be equal to the width of the image.

The length of the FIFO may be dynamically configured according to the width of the image processed in the convolution unit.

The maximum length of the FIFO that can be dynamically configured may be a value obtained by subtracting the width of the kernel from the maximum image width that can be processed by the convolution processor.

The window buffer may maintain the number of pixels entering the window buffer to record where the current pixel is in the image.

The current pixel may be the center of the window.

The convolution core comprises: a plurality of processing elements (PEs); And an accumulation tree.

Each PE of the plurality of PEs may calculate a product of a value of a pixel provided by the window buffer and a kernel coefficient corresponding to the provided pixel.

The accumulation tree may generate the result of the convolution operation by accumulating the values computed by the plurality of PEs.

The plurality of PEs may correspond to the kernel coefficients, respectively.

The PE provided with the value of the pixel from the register of the i-th row and the j-th column of the window buffer among the plurality of PEs may calculate the product of the value of the pixel and the kernel coefficient of the i-th row and the j-th column.

I may be an integer of 1 or more and k or less.

J may be an integer of 1 or more and k or less.

K may be the size of the kernel.

Some of the plurality of PEs may sum the first product calculated by the other PE with the second product calculated by the partial PE and output the sum of the first product and the second product.

The portion of the PE may add the first product to the second product using a post-multiplication adder of PE.

On another side, the window buffer providing values of the pixels of the window of the image; And a convolutional core performing a convolution operation on the window using values of pixels of the window and kernel coefficients.

In another aspect, a method for performing a convolution operation on a plurality of windows of an image, the method comprising: performing a convolution operation on a window using a window buffer and a convolution core, Wherein the convolution core performs a convolution operation on a current pixel using values of pixels of the window and kernel coefficients; And setting the window buffer such that the window buffer provides values of pixels of the next window of the window by inputting a value of a new pixel of the image into the window buffer.

In addition, there is further provided another method, apparatus, system for implementing the invention and a computer readable recording medium for recording a computer program for executing the method.

An apparatus and method for performing a convolution operation are provided.

An apparatus and method for performing a convolution operation on one current pixel or window per clock cycle is provided.

An apparatus and method for performing a convolution operation on a plurality of windows of an image are provided.

1 is a block diagram of a convolution arithmetic unit according to an embodiment.
Figure 2 shows a plurality of windows of an image according to an example.
FIG. 3 illustrates a configuration of a window buffer according to an example.
4 shows a configuration of a convolution core according to an example.
5 is a flowchart of a convolution operation method according to an embodiment.
6 is a flow diagram of a method for performing convolution operations on a plurality of windows of an image in accordance with an embodiment.

In the following, embodiments will be described in detail with reference to the accompanying drawings. It should be understood that the embodiments are different, but need not be mutually exclusive.

The terms used in the embodiments can be interpreted based on the actual meaning of the terms that are not the names of simple terms and the contents throughout the specification.

In embodiments, the connection relationship for a particular portion and the other portion may include an indirect connection relationship that is connected via another portion therebetween, in addition to the direct connection relationship between the specific portion and the other portion. Like reference numerals in the drawings denote like elements.

1 is a block diagram of a convolution arithmetic unit according to an embodiment.

The convolution arithmetic unit 100 may include a window buffer 110 and a convolution core 120.

The values of the pixels of the image may be provided to the window buffer 110. For example, the values of the pixels of the image may be sequentially input to the window buffer 110 for each clock cycle. First, the values of pixels of an image can be input in the order of pixels from the left pixel to the right pixel with respect to the row of the image. If all the values of the pixels for one row are input, then the values of the pixels of the next row can be input. The rows of the image can be input in the order from the above row to the next row.

The window buffer 110 may provide values of the pixels of the window of the image.

The convolution core 120 may have kernel coefficients. For example, the kernel coefficients may be input to the convolution core 120.

The convolution core 120 may perform a convolution operation on the window using the values of the pixels of the window and the kernel coefficients provided by the window buffer 110. [ The convolution core 120 may output the result of the convolution operation for the window.

A convolution operation on a window may mean a convolution operation on the current pixel. The current pixel may be the center of the window. Also, the current pixel and the reference pixel can be used in the same sense.

In the following, each of the window buffer 110 and the convolution core 120 according to one embodiment will be described in detail.

Figure 2 shows a plurality of windows of an image according to an example.

In FIG. 2, a first window 211, a second window 212, a third window 213, and a first window 221 of a second column are shown as a plurality of windows of the image 200. In FIG.

In the image 200, one rectangle may represent a pixel. According to Fig. 2, each window of a plurality of windows is shown to have a size of 3x3.

Convolution operations in the spatial domain may be window-based operations. In a window-based operation, kernel coefficients may be applied to each window of a plurality of windows in an image.

A window in the image may refer to a sub-region of the image. That is to say, the pixels of the window may be sub-area pixels. The dimensions of the sub-region may be the same as the size of the kernel. That is to say, the size of the sub-area and the size of the kernel may be the same. For example, when the size of the kernel is 3x3, the size of the sub-region may also be 3x3. Also, when the size of the kernel is 5x5, the size of the sub-region may also be 5x5. The center of the sub-region or window may be the current pixel that is the subject of processing.

In video ingress pixel streams, a raster scan may be regarded as the standard used format. Thus, according to the raster scan, the processing for a plurality of windows may be completed first horizontally and then vertically. That is to say, processing for a plurality of windows can be completed first from left to right. When the processing for horizontal windows (i.e., one row of windows) is completed, the processing can be completed in the order from top to bottom. In the same manner as described above, the convolution computing device 100 can sequentially process the windows in the order in which the windows are formed.

The convolution arithmetic unit 100 may perform sliding window image processing on the image 200 by performing a convolution operation on the plurality of windows of the image 200 in a predetermined order. The predetermined order may follow the order of the raster scan.

FIG. 3 illustrates a configuration of a window buffer according to an example.

When the size of the kernel is kxk pixels and the size of the image is nxm pixels, the k ² sub-sets of pixels of the image can be defined as the current window in which they are processed. Here, k may be an integer of 2 or more, and each of n and m may be an integer of 2 or more. The center of the sub-set may be the current pixel.

In the following, it can be assumed that one pixel per clock is introduced into the window buffer 110 by the input pixel stream.

The hardware design of the window buffer 110 shown in FIG. 3 may facilitate functions for the aforementioned sliding window image processing. 3 illustrates an exemplary hardware design of the window buffer 110 when the size of the kernel is k.

The window buffer 110 may include a plurality of registers and a plurality of first in first outs (FIFOs).

The plurality of registers of the window buffer 110 may store the values of the pixels of the window and may provide values of the pixels of the window. Each register of a plurality of registers may store the value of one pixel and may provide a value of the pixel. That is, each register may be a pixel register.

The plurality of registers of the window buffer 110 may constitute a plurality of rows and a plurality of columns. In Fig. 3, P _{i, j} may represent a value of a pixel stored by a register in column i and column j. That is to say, P _{i, j} may be the value of the pixels in the i-th row and j-th column of the window. i and j may be an integer of 1 or more and k or less, respectively. In Fig. 3, the values of some registers and pixels are omitted.

The number of rows of the plurality of registers may be equal to the height of the kernel. The number of the plurality of columns of the plurality of registers may be equal to the width of the kernel. That is to say, the dimensions of the plurality of registers may be the same as the size of the kernel.

In the remaining rows of the plurality of rows of the window buffer 110 except for the last row, the last register of the remaining row may be connected to the input of the FIFO. The output of the FIFO can be concatenated with the register at the beginning of the next row of the last register above. In Fig. 3, some FIFOs have been omitted.

That is to say, when a plurality of registers constitute k rows, the window buffer may contain k-1 FIFOs. The last register of the i-th row may be connected to the input of the i-th FIFO. The output of the i-th FIFO can be connected to the register at the beginning of the (i + 1) -th row. Here, i may be an integer of 1 or more and k-1 or less.

When a new pixel is introduced into the input pixel stream of the window buffer 110, the pixels may be propagated through a column of cascaded registers and a FIFO to produce the effect of the sliding window image processing described above . Here, the column of consecutive registers may be the registers constituting each row of the plurality of rows constituted by the plurality of registers of the window buffer 110.

When a new pixel is input to the window buffer 110 by the input pixel stream, the values of the plurality of registers can be propagated through the columns of consecutive registers and the FIFO.

The propagation may include at least one of the following processes i) through v). i) The value of the new pixel can be input to the first register of the first row among the plurality of rows constituted by the plurality of registers of the window buffer 110. ii) In each row of the plurality of rows, the values of the remaining registers except for the last register can be moved to the register on the right side thereof. iii) In the remaining rows of the plurality of rows, the value of the last register of the remaining row may be input to the FIFO connected to the last register. iv) As a new value is input to the FIFO, the value entered the earliest of the values in the FIFO can be output. Also, the value output from the FIFO can be input to the first register of the row connected to the output of the FIFO among the plurality of rows. v) The value of the last register of the last row among the plurality of rows may be discarded.

New pixels of the input pixel stream may be input to the window buffer 110 every clock cycle and through propagation of the plurality of register values the window buffer 110 may provide pixel values of the new window per clock cycle.

That is to say, the FIFOs can store the values of the pixels needed for the windows to be processed later by the sliding window image processing, rather than the current window being processed.

The sum of the widths of the plurality of rows constituted by the plurality of registers of the window buffer 110 and the length of the FIFOs may be equal to the width of the image processed by the convolution processor 100.

In addition, the length of the FIFO can be dynamically configured according to the width of the image processed in the convolution arithmetic unit 100. [ The maximum length of the dynamically configurable FIFOs may be at least the maximum image width that can be processed by the convolution computing device 100 minus the width of the kernel.

The control logic of the window buffer 110 may keep the number of pixels entered into the window buffer 110 to record where the current pixel is in the image. When the convolution processor 100 processes a plurality of frames as an image, the control logic of the window buffer 110 may maintain the number of pixels entering the window buffer 110 for each frame of the plurality of frames. The control logic may be configurable at run time to enable a change in image size between frames.

4 shows a configuration of a convolution core according to an example.

As discussed above, every new clock cycle, new window of new valid pixels may be provided by the window buffer 110. The convolution core 120 may access registers with values of pixels in the current window. The role of the convolution core 120 may be to process the window to perform the necessary calculations for the current window.

In Fig. 4, the window buffer 110 allows the convolution core 120 to be provided with the pixel values of the pixels of the window.

The convolution core 120 may include a plurality of processing elements (PEs) and an accumulation tree. The plurality of PEs may be K ² . That is to say, a plurality of PEs may correspond to a plurality of kernel coefficients, respectively.

PEs can be digital signal processing (DSP) PEs that are commonly found in FPGAs. The PE may comprise an internal register for storing the value of the pixel and an internal register for storing the result of the calculation.

In terms of computation, the value of each pixel in the window can be supplied to the PE. First, the value of the pixel supplied to the PE can be multiplied by the corresponding kernel coefficient. That is to say, each PE of a plurality of PEs may be provided with pixel values by the window buffer 110. Each PE may calculate the product of the value of the pixel provided by the window buffer 110 and the kernel coefficient corresponding to the provided pixel. In Fig. 4, a PE processing P ₁ , ₁ , P ₁ , ₂ , P _k , _k-1 and P _k , _k is shown and the remaining PEs are omitted. P _{i, j} may represent a value of a pixel stored in a register of the i-th row and the j-th column. That is to say, P _{i, j} may be the value of the pixels in the i-th row and j-th column of the window. Also, C _{i, j} can represent the kernel coefficients of row i and column j. i and j may be an integer of 1 or more and k or less, respectively.

The PE provided with the value of the pixel from the i-th row and the j-th column of the window buffer 110 among the plurality of PEs can calculate the product of the value of the pixel and the kernel coefficient of the i-th row and the j-th column. Here, i may be an integer of 1 or more and k or less. j may be an integer equal to or greater than 1 and equal to or less than k. k may be the size of the kernel.

Each of the products from all the pixels in the window can be provided to the accumulation tree from the PE. The products from all the pixels in the window may be accumulated together by the accumulation tree to produce values for the window being processed or the current pixel. That is, the accumulation tree may generate the result of the convolution operation by accumulating the values calculated by the plurality of PEs.

Values from adjacent PEs may be added before entering the accumulation tree to reduce the depth of the accumulation tree. That is to say, some PEs of the plurality of PEs may sum the first product calculated by the other PE with the second product calculated by the PEs of the above, and may output the sum of the first and second products have. For example, in FIG. 4, an odd-numbered PE among a plurality of PEs may add a product calculated by the PE adjacent to its right to the product calculated by itself.

Adding values from adjacent PEs can be done by a general post-multiplication adder in an FPGA. Some PEs of the plurality of PEs may combine the first product computed by the other PE with the second product computed by some of the PEs using the product-post adder of the PE.

The sums generated by the accumulation tree may be the desired convolution result for the window or the current pixel.

5 is a flowchart of a convolution operation method according to an embodiment.

In step 510, the window buffer 110 may provide values of the pixels of the window of the image.

In step 520, the convolution core 120 may perform a convolution operation on the window using the values of the pixels of the window and the kernel coefficients.

The current pixel may be the center of the window.

The above description with reference to Figs. 1 to 4 can also be applied to the embodiment described above with reference to Fig. Duplicate descriptions will be omitted below.

6 is a flow diagram of a method for performing convolution operations on a plurality of windows of an image in accordance with an embodiment.

In operation 610, the convolution arithmetic unit 100 may perform a convolution operation on the window using the window buffer 110 and the convolution core 120. Step 610 may correspond to steps 510 and 520 described above with reference to FIG.

In step 620, the convolution computing device 100 may check whether or not the convolution operation for a plurality of windows has ended. If the convolution operation is terminated, the procedure can be terminated. If the convolution operation has not ended, then step 630 may be performed for the convolution operation on the next window.

For example, when the convolution arithmetic unit 100 performs the convolution operation on the last window of the image, the convolution arithmetic unit 100 may determine that the convolution operation on the plurality of windows is completed. The convolution computing device 100 may determine that the convolution operation for a plurality of windows does not terminate if the convolution operation performed in step 610 is not a convolution operation for the last window of the image.

In operation 630, the convolution operator 100 inputs the value of the new pixel of the image to the window buffer 110 so that the window buffer 100 determines the value of the pixels of the next window of the window processed in operation 610 The window buffer 610 may be set to provide the window buffer 610. [

After step 630 is performed, step 610 may be repeated for the new window.

The above description with reference to Figs. 1 to 5 can also be applied to the embodiment described above with reference to Fig. Duplicate descriptions will be omitted below.

The following describes the DSP required for the convolution calculation apparatus 100.

The PE may be DSP PE, as described above. The number of DSP PEs D _g required for a plurality of PEs of the convolution core 120 (except for the accumulation tree) may be expressed by Equation 3 below.

[Equation 3]

Also, the accumulation tree can be configured using DSP PE. The accumulation tree can utilize the DSP PE as adders with two inputs to accumulate a given number of elements. Here, the DSP PE can be configured at a log ₂ n level. The input values of the accumulation tree can enter the input ports of the DSP PE of the accumulation tree and can go through the input registers of the DSP PE.

When the size of the kernel is k, the number of DSP PEs D _a required for the accumulation tree of the convolution core 120 may be expressed by Equation 4 below.

[Equation 4]

The total number of D PE DSP used for the convolution kernel of the core 120 may be equal to the formula 5 below.

[Equation 5]

Table 1 below shows the use of DSP PE for various kernel sizes.

Kernel size Used DSP PE 5 49 7 97 9 161 11 241 13 337 15 449

In the following, the configurability of the convolution calculation apparatus 100 is described.

Table 2 below may illustrate various runtime parameters and compile time parameters of the window buffer 110 and the convolution core 120. In addition, Table 2 can illustrate the ranges used for each parameter with the parameters.

parameter When the value is determined
(Runtime or compile time) Hardware Impact Image width Runtime The depth of the FIFO of the window buffer 110 Height of image Runtime none Size of the kernel Compile time The number of registers in the window buffer 110 and the number of FIFOs
The number of PEs in the convolution core 120 The bit width of the pixel (bitwidth) Compile time Width of registers, width of FIFOs and width of PEs The bit width of the kernel Compile time Width of registers and width of PEs Kernel values Runtime none

As described in Table 2, the design of the convolution computing device 100 is scalable for any image size and kernel size. The width of the image may affect the depth of the FIFO in the window buffer 110. The height of the image may not receive ramifications for hardware resources. The size of the kernel may determine the number of window buffer 110 pixel registers. In addition, the size of the kernel may determine the number of FIFOs required in the window buffer 110. In addition, the size of the kernel may determine the number of PEs in the convolution core 120 for processing the pixels provided by the window buffer 110.

The design of the convolution calculation apparatus 100 described above can accelerate the calculation of the convolution output for the raster scan video input stream. The described design can accelerate the calculation of the convolution output to the same pixel rate as the input pixel rate.

The method according to an embodiment may be implemented in the form of a program command that can be executed through various computer means and recorded in a computer-readable medium. The computer-readable medium may include program instructions, data files, data structures, and the like, alone or in combination. The program instructions to be recorded on the medium may be those specially designed and configured for the embodiments or may be available to those skilled in the art of computer software. Examples of computer-readable media include magnetic media such as hard disks, floppy disks and magnetic tape; optical media such as CD-ROMs and DVDs; magnetic media such as floppy disks; Magneto-optical media, and hardware devices specifically configured to store and execute program instructions such as ROM, RAM, flash memory, and the like. Examples of program instructions include machine language code such as those produced by a compiler, as well as high-level language code that can be executed by a computer using an interpreter or the like. The hardware devices described above may be configured to operate as one or more software modules to perform the operations of the embodiments, and vice versa.

While the present invention has been particularly shown and described with reference to exemplary embodiments thereof, it is to be understood that the invention is not limited to the disclosed exemplary embodiments. For example, it is to be understood that the techniques described may be performed in a different order than the described methods, and / or that components of the described systems, structures, devices, circuits, Lt; / RTI > or equivalents, even if it is replaced or replaced.

Therefore, other implementations, other embodiments, and equivalents to the claims are also within the scope of the following claims.

100: convolution arithmetic unit
110: Window buffer
120: Convolution core

Claims

A window buffer for providing values of pixels of the window of the image; And
A convolution core for performing a convolution operation on the window using values of pixels of the window and kernel coefficients;
And a convolution arithmetic unit.

The method according to claim 1,
Wherein the convolution arithmetic unit performs the sliding window image processing on the image by performing the convolution operation on a plurality of windows of the image in a predetermined order.

3. The method of claim 2,
Wherein the predetermined order is a raster scan order.

The method according to claim 1,
Wherein the window buffer comprises a plurality of registers,
Wherein the plurality of registers store values of pixels of the window and provide values of pixels of the window.

5. The method of claim 4,
The plurality of registers constituting a plurality of rows and a plurality of columns,
Wherein the number of rows is equal to the height of the kernel,
Wherein the number of the plurality of columns is equal to the width of the kernel.

6. The method of claim 5,
The last register of the remaining rows excluding the last row of the plurality of rows is connected to an input of a first in first out (FIFO), and the output of the first in, first out is connected to the first register And a convolution arithmetic unit connected to the arithmetic unit.

The method according to claim 6,
And when a new pixel is input to the window buffer, the values of the plurality of registers are propagated through the columns of consecutive registers and through the FIFO.

8. The method of claim 7,
The new pixel is input to the window buffer every clock cycle,
Wherein the window buffer provides pixel values of a new window for each clock cycle.

The method according to claim 6,
Wherein the FIFO stores values of pixels required for windows to be processed later by a sliding window image processing rather than the current window being processed.

The method according to claim 6,
Wherein the sum of the widths of the plurality of rows and the length of the FIFO is equal to the width of the image.

The method according to claim 6,
Wherein the length of the FIFO is dynamically configured according to a width of the image processed in the convolution arithmetic unit.

12. The method of claim 11,
Wherein the maximum length of the FIFO that can be dynamically configured is a value obtained by subtracting the width of the kernel from a maximum image width that can be processed by at least the convolution arithmetic unit.

The method according to claim 1,
Wherein the window buffer maintains the number of pixels entering the window buffer to record where the current pixel is in the image, and wherein the current pixel is the center of the window.

The method according to claim 1,
The convolution core includes:
A plurality of processing elements (PEs); And
Accumulation tree
Lt; / RTI >
Wherein each PE of the plurality of PEs calculates a product of a value of a pixel provided by the window buffer and a kernel coefficient corresponding to the provided pixel,
Wherein the accumulation tree generates a result of the convolution operation by accumulating values computed by the plurality of PEs.

15. The method of claim 14,
Wherein the plurality of PEs correspond to the kernel coefficients, respectively.

15. The method of claim 14,
Wherein the PE provided the value of the pixel from the i-th row and the j-th column of the window buffer among the plurality of PEs calculates the product of the pixel value and the kernel coefficient of the i-th row and the j-
I is an integer of 1 or more and k or less,
J is an integer of 1 or more and k or less,
K is the size of the kernel.

15. The method of claim 14,
Wherein some PEs of the plurality of PEs combine a first product computed by another PE with a second product computed by the partial PEs and output a sum of the first product and the second product, .

18. The method of claim 17,
And wherein the portion of the PEs combines the first product with the second product using a post-multiplication adder of PE.

The window buffer providing values of pixels of the window of the image; And
Wherein the convolution core performs a convolution operation on the window using values of pixels of the window and kernel coefficients
/ RTI >

A method of performing a convolution operation on a plurality of windows of an image,
Performing a convolution operation on a window using a window buffer and a convolution core, the window buffer providing values of pixels of the window in an image, the convolution core using values of pixels of the window and kernel coefficients Performing a convolution operation on the current pixel; And
Setting the window buffer such that the window buffer provides values of pixels of the next window of the window by entering a value of a new pixel of the image in the window buffer
/ RTI >