WO2015040450A1

WO2015040450A1 - Multi-purpose image processing core

Info

Publication number: WO2015040450A1
Application number: PCT/IB2013/058604
Authority: WO
Inventors: Ismail OZSARAC; Ozgür YILMAZ; Omer GUNAY
Original assignee: Aselsan Elektronik Sanayi Ve Ticaret Anonim Sirketi
Priority date: 2013-09-17
Filing date: 2013-09-17
Publication date: 2015-03-26
Also published as: KR20160003020A; KR101864000B1

Abstract

Object detection, recognition and tracking algorithms are used in many applications in vision. The outputs of these algorithms are essential for situational awareness and decision making. The accuracy and the processing latency of these algorithms are important parameters for the success of the system. This invention enhances the accuracy by enabling the neural network based techniques while fulfilling the latency constraints.

Description

DESCRIPTION

MULTI-PURPOSE IMAGE PROCESSING CORE

Field of the invention

This invention is related with an image processing method in FPGA to analyze video frames using neural network based techniques, operating in real-time on embedded platforms. Background of the invention

Neural network approaches in vision are becoming increasingly popular due to their performance in complex tasks such as large-scale classification [REF. l] or multi-modal fusion [REF.2]. The success is attributed to multiple advantages such as unsupervised feature learning from unlabeled data [REF.3], [REF.4], hierarchical processing via deep architectures [REF.5]-[ REF.7] and exploitation of long-range statistical dependencies using recurrent processing [REF.3], [REF.8]. Neural networks approach is orthogonal approach to kernel methods: input is projected onto a nonlinear high dimensional space of hidden units after which even a linear hyper plane is able to partition the data [REF.9]. Since this nonlinear projection is a powerful representation of the visual data, it is possible to utilize it for multiple different tasks, such as classification, detection, tracking, clustering, interest point detection etc. Thus, after an image or a video block is "analyzed" by a neural network via multi layer processing, the hidden layer activities that represent the visual input can be multiplexed to many different tasks according to the needs, as it is executed in cortical processing [REF.10].

Real-time embedded visual processing needs are growing, with increased demands in intelligent robotic platforms, such as Unmanned Aerial Vehicles (UAV). These systems are expected to navigate and operate in autonomous fashion, and this entails successful implementations of image and video understanding functions. Scene recognition, detection of specific objects in an image, classification of moving objects and object tracking are some of the essential visual functions that are required in an autonomous robotic system. Weight and energy specifications of such systems restrict both the number and complexity of visual processing functions, diminishing the operational capacity. A visual processing core that is common to at least a subset of these functions is able to loosen the restrictions.

In this invention, we are showing that Sparse and over complete image representation is formed in the neural network hidden layers, providing versatility and discriminative power [REF.4], [REF.l l]. Specifically, we show that, which can be embedded in a UAV platform for surveillance and reconnaissance missions. Objects of the invention

The object of the invention is to provide FPGA implementation of a neural network based image processing core. Detailed description of the invention

Multi-purpose image processing (IP) core in order to fulfill the objects of the present invention is illustrated in the attached figures, where: Figure 1 is the schematic of IP core in FPGA with the external components.

Figure 2 is the video frame and patch structure.

Figure 3 is the flow of the feature extractor.

Figure 4 is the structure of the take patch process.

Figure 5 is the construction of P vector.

Figure 6 is the construction of binary PB vector.

Figure 7 is the dictionary D. Figure 8 is the construction of distance vector DV.

Figure 9 is the computation of pixel feature vector PFV.

Figure 10 is the structure of the feature summer.

Figure 11 is the structure of quadrants.

Figure 12 is the computation of feature vector FV.

Figure 13 is the computation of class label CL.

In the preferred embodiment of the invention, the multi-purpose image processing core (101) is implemented in FPGA (100). The core consists of two main sub- blocks; image analyzer (102) and memory interface (103).

Memory interface (103) is responsible for data transfer between image analyzer (102) and external memories (113). Image analyzer (102) block consists of three sub-blocks; feature extractor (104), feature summer (105) and classifier (106). Image analyzer (102) block receives five types of inputs from outside of the FPGA (100); video frames (107), feature dictionary (108), class matrix (110), feature calculation requests (109) and sparsity multiplier (114). The video frames (107) can be defined by two parameters; resolution and frame rate. The resolution is M (row) (201) by N (column) (202) and the frame rate is the number of frames (203) captured in a second. Other inputs; feature dictionary (108), class matrix (110), feature calculation requests (109) and sparsity multiplier (114) will be detailed in the following sections.

Feature extractor (104) block starts with take patch (301) process. This process captures the related pixels which are in the selected coordinates of the patch (204) from the video frames (107). To capture the related pixels, the incoming video line (row (201) of the video frames (107)) is written to line FIFO (401). According to the patch (204) dimension (K), take patch (301) process uses K line FIFO (401). Each incoming video line is firstly written to bottom line FIFO (401), and then when the next video line is coming, the previous one is read from bottom line FIFO (401) and written to upper one. These steps continue until all line FIFOs (401) are filled with the necessary lines to construct patch (204). When all lines are available, with the next line coming, pixel values are read from the line FIFOs (401). After K read operations, the patch is ready for further operations. The K+l read from the line FIFOs (401) gives the next pixel patch. These steps continue until all patches (204) are captured through a line. During patch read from the line FIFOs (401), new lines continue to move to upper line FIFOs (401). This movement generates the patch (204) downward movement through the video frames (107). The P vector (501) is constructed (302) by using the captured patch (204) pixel values. Actually, this construction process is a simple register assignment. There are KxK registers from L1P1 (402) to LKPK (403) and every register keep the related pixel value. The bit size of the registers is determined by the maximum possible pixel value.

To calculate mean value (Ρμ (602)) of P vector (501) (303), every pixel value in the patch (204) should be added and then divided by the total number of pixels. The addition process can be realized by the adders; the input number of the adders can be different according to the FPGA capability. The adder input number affects the pipeline clock latency and the number of adders used. After all pixel values are calculated, the total is divided by K*K.

After calculating the Ρμ (602), each entry of the P vector (501) is compared (601) with Ρμ (602), and binarized to construct the vector PB (603) (304). Binarization step is essential for realizing this image processing algorithm in currently available FPGAs. For the values that are less than Ρμ (602), "0" is assigned. For the values that are equal or greater, "1" is assigned. After all values are compared (601) with mean value, binary P (501) vector PB (603) is obtained. PB (603) is a T (604) by "1" bit vector where T (604) equals to K*K. Every binary vector PB (603) constructed from all the patches (204) in an image are transformed into a feature vector using a pre-computed dictionary that has Z (703) number of visual words. The dictionary D (701) is a T (604) by Z (703) bit matrix. Entities of D (701) are binary values; "1" or "0". The columns of D (701) matrix (DC1 - DCZ (702)) are stored in internal registers of FPGA (100). The dictionary is loaded to FPGA by means of the communication interfaces like PCI, VME etc. The entries of the dictionary can be updated any time since the entries are stored in internal registers.

Bit flipping (or Hamming) distance calculation (305) computes the similarity between two vectors: PB (603) and every column (DCl-DCZ (702)) of D (701). If the entries of PB (603) and DCX (702) are the same "0" is assigned, otherwise "1" is assigned. This operation is realized by xor (801) blocks. The total number of "1" values after xor (801) operation is a measure of dissimilarity between the two binary vectors. DV (804) contains the Hamming distance of a single PB (603) vector to all the visual words (columns (702)) in the dictionary. The entries (805) of DV (804) keep the numbers of "l"s, so they are integer values and can be represented by less bits when compared with PB (603) or DCX (702). DV (804) is an H (806) by Z (703) bit vector. H (806) is the minimum number of bits that can define the scalar value T (604).

The mean value of DV (804) (ϋνμ) is computed (306) similar with Ρμ (602). To calculate standard deviation of DV (804) (DVo) (307), ϋν is subtracted from each entry of DV (805). Then the square of the subtraction is calculated and all the squares are added. Then the total value is divided by Z (703). Finally, the square root is calculated and DVo is obtained. Activation threshold AT (901) is calculating (307) by EQ.l. This threshold is used to construct a sparse representation via nullifying the distance values larger than a specified value.

AT = ϋνμ - (sparity multiplier x DVo)

EQ. l To construct the pixel feature vector (309), each entry (805) of DV (804) is compared (902) with AT (901). If the entry (805) is greater than AT (901) then assign "0" to related entry (905) of PFV (904), if it is less assign "1". The result is a "1" by Z (703) pixel feature vector PFV (904).

As a result, for each pixel of a video frame (107), "1" by Z (703) bit vector (pixel feature vector PFV (904)) is obtained. These PFVs (904) are sent to the memory interface (103) to be written to external memories (113). The feature calculation requests (109) are written to the feature calculation request FIFO (1003), the requests are written as pixel coordinates. The CPU sends the coordinates of two border pixels (upper-left and lower-right, black dots (1101)) and FPGA calculates the rest (white dots (1102)) of the coordinates of the sub- regions. The main idea is to divide a region to four equal sub-regions; quadrants (1103, 1104, 1105, 1106), and pool pixel feature vectors (PFVs (904)) inside the quadrants for dimensionality reduction and concatenate the integral feature vectors to obtain a feature vector (FV (111)).

According to pixel coordinates, the internal RAM (1004) addresses are calculated by address calculator (1002) block. This block knows the content of the RAM, namely the line coordinates that are stored. To make the calculations faster, the PFV (904) values are read from external memory (113) and written to internal RAM. The RAM can store R x N (202) x Z (703) bit data. R is the maximum number of lines that can be processed at a time.

Integral vector calculator (1001) reads the necessary PFVs (904) from the internal RAM (1004) to calculate the integral vector (1201). Integral vector IV (1201) entry is the summation of the all entries of previous PFVs (904) on both horizontal and vertical dimensions. For example IV11 (1201-11) is equal to PFV11 (904-11), IV12 (1201-12) is equal to IV11 (1201-11) plus PFV 12 (904-12) and rV21 (1201-21) is equal to PFV11 (904-11) plus PFV21 (904-21). The final result is the quadrant integral vector IV22 (1201-22). The pool feature operation requests the difference between IV22 (1201-22) and IV11 (1201-11). So, it is the same to take PFV11 (904-11) as all "0". And the final integral vector IV22 (1201- 22) is the difference and equals to QIV (1103-1).

Since there exist four quadrants (Ql (1103), Q2 (1104), Q3 (1105) and Q4 (1106)), all quadrant results (1103-1, 1104-1, 1105-1 and 1106-1) are concatenated and final feature vector FV (1202) is obtained. The FV (1202) is G x S bit vector. S is the minimum bit number that can store the all "l"s in the quadrant. G is equal to 4*Z (703). The vector is stored in internal ram of FPGA. This feature vector (FV) represents the image patch defined the border coordinates, and it can be used for classification and clustering purposes, executed either in FPGA or in CPU via memory transfer. After the completion of an image region, namely when the pooling on requested coordinates in that region are finished, internal RAM (1004) is updated with new lines, and new pooling calculations are started. These processes are controlled by integral vector calculator (1001) with the aid of address calculator (1002). Classifier block (106) generates a class label likelihood vector using a linear classification method. It performs matrix- vector multiplication of class matrix C (1301) with FV (111). The class matrix C (1301) is loaded to FPGA like feature dictionary D (701). Row arbiter (1303) controls the C (1301) matrix row management for the FV (111) multiplication. The C (1301) matrix is J (1302) x G x S bit matrix. The result is the class label CL (112) vector. The entities of the CL (112) are the addition of the multiplication (1304) of FV (111) with C (1301) rows. The CL (112) is sent to the CPU for further processing, classification, detection etc.

Claims

The inventive multi-purpose image processing core (101) essentially comprises

at least one image analyzer (102) block,

at least one memory interface (103) block.

In the preferred embodiment of the invention, image analyzer (102) block essentially comprises

at least one feature extractor (104) block,

at least one feature summer (105) block,

at least one classifier (106) block.

The image analyzer (102) block according to claim 1 characterized by receiving the following inputs;

video frames (107),

feature dictionary (108),

class matrix (110),

feature calculation requests (109),

sparsity multiplier (114).

The image analyzer (102) block according to claim 1 characterized by generating the following outputs;

feature vectors (111),

class labels (112),

In the preferred embodiment of the invention, feature extractor (104) block essentially comprises the sub-steps of;

take patch (301) process,

construct P vector (302) process,

compute mean value of P vector (303) process, construct binary PB vector (304) process,

calculate bit flipping distance vector (DV) of PB with dictionary D (305) process,

compute mean value of DV (306) process,

compute standard deviation value of DV (307) process,

compute activation threshold AT of DV (308) process,

compute pixel feature vector (PFV) of DV (309) process.

In the preferred embodiment of the invention, take patch (102) process essentially comprises the sub-steps of;

writing the each new incoming video line to bottom line FIFO (401), with the next video line coming, reading the previous one from bottom line FIFO (401) and writing to upper line FIFO (401), continuing the first two steps until all line FIFOs (401) are filled with the necessary lines to construct patch (204),

after all lines are available, reading K times from the line FIFOs (401) and obtaining the patch (204),

obtaining the next pixel patch with the K+l read from the line FIFOs (401),

continuing the read operations until all patches obtained through a video line,

moving the video lines to upper line FIFOs (401) to generate patch (204) downward movement.

In the preferred embodiment of the invention, feature extractor (104) block according to claim 5 characterized by using binary PB (603) vector for the distance calculations (305) between feature dictionary D (701) instead of scalar P (501) vector.

In the preferred embodiment of the invention, feature extractor (104) block according to claim 5 characterized by using binary feature dictionary D (701) for the distance calculations (305) instead of scalar feature dictionary.

9. In the preferred embodiment of the invention, feature extractor (104) block according to claim 5 characterized by having the possibility to load different feature dictionaries D (701) during the operation according the scenario.

10. In the preferred embodiment of the invention, bit flipping distance calculation (305) process essentially comprises the sub-steps of;

comparing PB (603) vector with every column (702) of feature dictionary D (701) by using xor (801) operations,

computing (802) the number of entries which are equal to "1" after xor operations (801),

constructing the distance vector DV (804) (803).

11. In the preferred embodiment of the invention, the distance vector DV (804) according to claim 10 characterized by keeping only the number of entries which are equal to "1" instead of all xor (801) results.

12. In the preferred embodiment of the invention, computing pixel feature vector PFV (309) process essentially comprises the sub-steps of;

- comparing (902) the entries of DV (805) with AT (901),

- assigning "0" if DV entry (805) is greater than AT (901),

- assigning "1" if DV entry (805) is less than AT (901),

constructing pixel feature vector PFV (309) (904).

13. In the preferred embodiment of the invention, the pixel feature vector PFV (904) according to claim 12 characterized by keeping the results of the comparison between DV entries and AT as binary values.

14. In the preferred embodiment of the invention, feature summer (105) block essentially comprises the sub-blocks of;

at least one integral vector calculator (1001) which calculates the integral vector IV (1201),

at least one address calculator (1002) which calculates the internal RAM (1004) addresses according the feature calculation requests (109) ,

at least one feature calculation request FIFO (1003) to store the feature calculation requests (109),

- at least one internal RAM (1004) which stores the PFVs (309).

15. In the preferred embodiment of the invention, the feature summer (105) block according to claim 14 characterized by receiving border coordinates (1101) of the feature calculation requests (109) as pixel values and calculating the other coordinates (1102) to divide a region to four equal sub-regions; quadrants (1103, 1104, 1105, 1106).

16. In the preferred embodiment of the invention, the feature summer (105) block according to claim 14 characterized by reading the PFVs (309) from external memories (113) and writing to internal RAM (1004) to make the calculations faster.

17. In the preferred embodiment of the invention, integral vector calculator (1001) sub-block essentially comprises the sub-steps of;

- reading the PFVs (309) from internal RAM (1004),

computing the quadrant integral vector QIV (1103-1) by adding the all entries of previous PFVs (904) on both horizontal and vertical dimensions.

18. In the preferred embodiment of the invention, integral vector calculator (1001) sub-block according to claim 17 characterized by taking the first PVF (309) value as all "0" to omit the subtraction operation between quadrant integral vector and first PVF.

19. In the preferred embodiment of the invention, classifier (106) block essentially comprises the sub-blocks of;

at least one C matrix row arbiter (1303) which controls the multiplication of C matrix rows with FV (111),

at least one mult & add (1304) operator which realize the matrix- vector multiplication operation.

References

[1] A. Krizhevsky, I. Sutskever, and G. Hinton, "Imagenet classification with deep convolutional neural networks," in Advances in Neural Information Processing Systems 25, 2012, pp. 1106-1114.

[2] N. Srivastava and R. Salakhutdinov, "Multimodal learning with deep boltzmann machines," in Advances in Neural Information Processing Systems 25, 2012, pp. 2231-2239.

[3] G. E. Hinton, S. Osindero, and Y.-W. Teh, "A fast learning algorithm for deep belief nets," Neural computation, vol. 18, no. 7, pp. 1527-1554,2006.

[4] B. A. Olshausen et al., "Emergence of simple-cell receptive field properties by learning a sparse code for natural images," Nature, vol.381, no. 6583, pp. 607- 609, 1996.

[5] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner, "Gradient-based learning applied to document recognition," Proceedings of the IEEE, vol. 86, no. 11, pp. 2278-2324, 1998.

[6] M. Ranzato, F. J. Huang, Y.-L. Boureau, and Y. Lecun, "Unsupervised learning of invariant feature hierarchies with applications to object recognition," in Computer Vision and Pattern Recognition, 2007. CVPR'07. IEEE Conference on. IEEE, 2007, pp. 1-8.

[7] Y. Bengio, P. Lamblin, D. Popovici, and H. Larochelle, "Greedy layerwise training of deep networks," Advances in neural information processing systems, vol. 19, p. 153, 2007.

[8] D. E. Rumelhart, G. E. Hinton, and R. J. Williams, "Learning representations by back-propagating errors," Cognitive modeling, vol. 1, p. 213, 2002.12

[9] A. Coates, A. Y. Ng, and H. Lee, "An analysis of single-layer networks in unsupervised feature learning," in International Conference on Artificial Intelligence and Statistics, 2011, pp. 215-223.

[10] T. S. Lee, D. Mumford, R. Romero, and V. A. Lamme, "The role of the primary visual cortex in higher level vision," Vision research, vol. 38, no. 15, pp. 2429-2454, 1998. [11] Y.-L. Boureau, F. Bach, Y. LeCun, and J. Ponce, "Learning mid-level features for recognition," in Computer Vision and Pattern Recognition (CVPR), 2010 IEEE Conference on. IEEE, 2010, pp. 2559-2566.