DESCRIPTION
MULTI-PURPOSE IMAGE PROCESSING CORE
Field of the invention
This invention is related with an image processing method in FPGA to analyze video frames using neural network based techniques, operating in real-time on embedded platforms. Background of the invention
Neural network approaches in vision are becoming increasingly popular due to their performance in complex tasks such as large-scale classification [REF. l] or multi-modal fusion [REF.2]. The success is attributed to multiple advantages such as unsupervised feature learning from unlabeled data [REF.3], [REF.4], hierarchical processing via deep architectures [REF.5]-[ REF.7] and exploitation of long-range statistical dependencies using recurrent processing [REF.3], [REF.8]. Neural networks approach is orthogonal approach to kernel methods: input is projected onto a nonlinear high dimensional space of hidden units after which even a linear hyper plane is able to partition the data [REF.9]. Since this nonlinear projection is a powerful representation of the visual data, it is possible to utilize it for multiple different tasks, such as classification, detection, tracking, clustering, interest point detection etc. Thus, after an image or a video block is "analyzed" by a neural network via multi layer processing, the hidden layer activities that represent the visual input can be multiplexed to many different tasks according to the needs, as it is executed in cortical processing [REF.10].
Real-time embedded visual processing needs are growing, with increased demands in intelligent robotic platforms, such as Unmanned Aerial Vehicles (UAV). These systems are expected to navigate and operate in autonomous fashion, and this entails successful implementations of image and video
understanding functions. Scene recognition, detection of specific objects in an image, classification of moving objects and object tracking are some of the essential visual functions that are required in an autonomous robotic system. Weight and energy specifications of such systems restrict both the number and complexity of visual processing functions, diminishing the operational capacity. A visual processing core that is common to at least a subset of these functions is able to loosen the restrictions.
In this invention, we are showing that Sparse and over complete image representation is formed in the neural network hidden layers, providing versatility and discriminative power [REF.4], [REF.l l]. Specifically, we show that, which can be embedded in a UAV platform for surveillance and reconnaissance missions. Objects of the invention
The object of the invention is to provide FPGA implementation of a neural network based image processing core. Detailed description of the invention
Multi-purpose image processing (IP) core in order to fulfill the objects of the present invention is illustrated in the attached figures, where: Figure 1 is the schematic of IP core in FPGA with the external components.
Figure 2 is the video frame and patch structure.
Figure 3 is the flow of the feature extractor.
Figure 4 is the structure of the take patch process.
Figure 5 is the construction of P vector.
Figure 6 is the construction of binary PB vector.
Figure 7 is the dictionary D.
Figure 8 is the construction of distance vector DV.
Figure 9 is the computation of pixel feature vector PFV.
Figure 10 is the structure of the feature summer.
Figure 11 is the structure of quadrants.
Figure 12 is the computation of feature vector FV.
Figure 13 is the computation of class label CL.
In the preferred embodiment of the invention, the multi-purpose image processing core (101) is implemented in FPGA (100). The core consists of two main sub- blocks; image analyzer (102) and memory interface (103).
Memory interface (103) is responsible for data transfer between image analyzer (102) and external memories (113). Image analyzer (102) block consists of three sub-blocks; feature extractor (104), feature summer (105) and classifier (106). Image analyzer (102) block receives five types of inputs from outside of the FPGA (100); video frames (107), feature dictionary (108), class matrix (110), feature calculation requests (109) and sparsity multiplier (114). The video frames (107) can be defined by two parameters; resolution and frame rate. The resolution is M (row) (201) by N (column) (202) and the frame rate is the number of frames (203) captured in a second. Other inputs; feature dictionary (108), class matrix (110), feature calculation requests (109) and sparsity multiplier (114) will be detailed in the following sections.
Feature extractor (104) block starts with take patch (301) process. This process captures the related pixels which are in the selected coordinates of the patch (204) from the video frames (107). To capture the related pixels, the incoming video line (row (201) of the video frames (107)) is written to line FIFO (401). According to the patch (204) dimension (K), take patch (301) process uses K line FIFO (401). Each incoming video line is firstly written to bottom line FIFO (401), and then when the next video line is coming, the previous one is read from bottom line FIFO (401) and written to upper one. These steps continue until all line
FIFOs (401) are filled with the necessary lines to construct patch (204). When all lines are available, with the next line coming, pixel values are read from the line FIFOs (401). After K read operations, the patch is ready for further operations. The K+l read from the line FIFOs (401) gives the next pixel patch. These steps continue until all patches (204) are captured through a line. During patch read from the line FIFOs (401), new lines continue to move to upper line FIFOs (401). This movement generates the patch (204) downward movement through the video frames (107). The P vector (501) is constructed (302) by using the captured patch (204) pixel values. Actually, this construction process is a simple register assignment. There are KxK registers from L1P1 (402) to LKPK (403) and every register keep the related pixel value. The bit size of the registers is determined by the maximum possible pixel value.
To calculate mean value (Ρμ (602)) of P vector (501) (303), every pixel value in the patch (204) should be added and then divided by the total number of pixels. The addition process can be realized by the adders; the input number of the adders can be different according to the FPGA capability. The adder input number affects the pipeline clock latency and the number of adders used. After all pixel values are calculated, the total is divided by K*K.
After calculating the Ρμ (602), each entry of the P vector (501) is compared (601) with Ρμ (602), and binarized to construct the vector PB (603) (304). Binarization step is essential for realizing this image processing algorithm in currently available FPGAs. For the values that are less than Ρμ (602), "0" is assigned. For the values that are equal or greater, "1" is assigned. After all values are compared (601) with mean value, binary P (501) vector PB (603) is obtained. PB (603) is a T (604) by "1" bit vector where T (604) equals to K*K.
Every binary vector PB (603) constructed from all the patches (204) in an image are transformed into a feature vector using a pre-computed dictionary that has Z (703) number of visual words. The dictionary D (701) is a T (604) by Z (703) bit matrix. Entities of D (701) are binary values; "1" or "0". The columns of D (701) matrix (DC1 - DCZ (702)) are stored in internal registers of FPGA (100). The dictionary is loaded to FPGA by means of the communication interfaces like PCI, VME etc. The entries of the dictionary can be updated any time since the entries are stored in internal registers.
Bit flipping (or Hamming) distance calculation (305) computes the similarity between two vectors: PB (603) and every column (DCl-DCZ (702)) of D (701). If the entries of PB (603) and DCX (702) are the same "0" is assigned, otherwise "1" is assigned. This operation is realized by xor (801) blocks. The total number of "1" values after xor (801) operation is a measure of dissimilarity between the two binary vectors. DV (804) contains the Hamming distance of a single PB (603) vector to all the visual words (columns (702)) in the dictionary. The entries (805) of DV (804) keep the numbers of "l"s, so they are integer values and can be represented by less bits when compared with PB (603) or DCX (702). DV (804) is an H (806) by Z (703) bit vector. H (806) is the minimum number of bits that can define the scalar value T (604).
The mean value of DV (804) (ϋνμ) is computed (306) similar with Ρμ (602). To calculate standard deviation of DV (804) (DVo) (307), ϋν is subtracted from each entry of DV (805). Then the square of the subtraction is calculated and all the squares are added. Then the total value is divided by Z (703). Finally, the square root is calculated and DVo is obtained. Activation threshold AT (901) is calculating (307) by EQ.l. This threshold is used to construct a sparse representation via nullifying the distance values larger than a specified value.
AT = ϋνμ - (sparity multiplier x DVo)
EQ. l
To construct the pixel feature vector (309), each entry (805) of DV (804) is compared (902) with AT (901). If the entry (805) is greater than AT (901) then assign "0" to related entry (905) of PFV (904), if it is less assign "1". The result is a "1" by Z (703) pixel feature vector PFV (904).
As a result, for each pixel of a video frame (107), "1" by Z (703) bit vector (pixel feature vector PFV (904)) is obtained. These PFVs (904) are sent to the memory interface (103) to be written to external memories (113). The feature calculation requests (109) are written to the feature calculation request FIFO (1003), the requests are written as pixel coordinates. The CPU sends the coordinates of two border pixels (upper-left and lower-right, black dots (1101)) and FPGA calculates the rest (white dots (1102)) of the coordinates of the sub- regions. The main idea is to divide a region to four equal sub-regions; quadrants (1103, 1104, 1105, 1106), and pool pixel feature vectors (PFVs (904)) inside the quadrants for dimensionality reduction and concatenate the integral feature vectors to obtain a feature vector (FV (111)).
According to pixel coordinates, the internal RAM (1004) addresses are calculated by address calculator (1002) block. This block knows the content of the RAM, namely the line coordinates that are stored. To make the calculations faster, the PFV (904) values are read from external memory (113) and written to internal RAM. The RAM can store R x N (202) x Z (703) bit data. R is the maximum number of lines that can be processed at a time.
Integral vector calculator (1001) reads the necessary PFVs (904) from the internal RAM (1004) to calculate the integral vector (1201). Integral vector IV (1201) entry is the summation of the all entries of previous PFVs (904) on both horizontal and vertical dimensions. For example IV11 (1201-11) is equal to PFV11 (904-11), IV12 (1201-12) is equal to IV11 (1201-11) plus PFV 12 (904-12) and rV21 (1201-21) is equal to PFV11 (904-11) plus PFV21 (904-21). The final
result is the quadrant integral vector IV22 (1201-22). The pool feature operation requests the difference between IV22 (1201-22) and IV11 (1201-11). So, it is the same to take PFV11 (904-11) as all "0". And the final integral vector IV22 (1201- 22) is the difference and equals to QIV (1103-1).
Since there exist four quadrants (Ql (1103), Q2 (1104), Q3 (1105) and Q4 (1106)), all quadrant results (1103-1, 1104-1, 1105-1 and 1106-1) are concatenated and final feature vector FV (1202) is obtained. The FV (1202) is G x S bit vector. S is the minimum bit number that can store the all "l"s in the quadrant. G is equal to 4*Z (703). The vector is stored in internal ram of FPGA. This feature vector (FV) represents the image patch defined the border coordinates, and it can be used for classification and clustering purposes, executed either in FPGA or in CPU via memory transfer. After the completion of an image region, namely when the pooling on requested coordinates in that region are finished, internal RAM (1004) is updated with new lines, and new pooling calculations are started. These processes are controlled by integral vector calculator (1001) with the aid of address calculator (1002). Classifier block (106) generates a class label likelihood vector using a linear classification method. It performs matrix- vector multiplication of class matrix C (1301) with FV (111). The class matrix C (1301) is loaded to FPGA like feature dictionary D (701). Row arbiter (1303) controls the C (1301) matrix row management for the FV (111) multiplication. The C (1301) matrix is J (1302) x G x S bit matrix. The result is the class label CL (112) vector. The entities of the CL (112) are the addition of the multiplication (1304) of FV (111) with C (1301) rows. The CL (112) is sent to the CPU for further processing, classification, detection etc.