WO2009054683A1

WO2009054683A1 - System and method for real-time stereo image matching

Info

Publication number: WO2009054683A1
Application number: PCT/KR2008/006267
Authority: WO
Inventors: Hong Jeong; Sung Chan Park; Young Su Kim
Original assignee: Postech Academy-Industry Foundation
Priority date: 2007-10-25
Filing date: 2008-10-23
Publication date: 2009-04-30
Also published as: KR100926127B1; KR20090041843A

Abstract

A system for real-time stereo image matching includes: an image processor for converting image signals taken by each of a plurality of cameras into digital images; a plurality of input buffers for extracting interpixel data from the digital images provided from the image processor and reordering the interpixel data; a processing element array having a plurality of processing elements to produce an optimal decision value or a disparity value; and an encoder for encoding the disparity value or the optimal decision value. Each of the processing elements sequentially receives the reordered interpixel data and performs image matching by using a disparity value outputted from the processing element of the upper stage at previous clock.

Description

SYSTEM AND METHOD FOR REAL-TIME STEREO IMAGE

MATCHING

Technical Field

[1] The present invention relates to a system and method for real-time stereo image matching using multiple cameras; and, more particularly, to a system and method which can accurately extract 3-dimensional (3D) distance information on a thin object spaced apart from a background by finding corresponding points in multiple scan lines of multiple images in real time. Background Art

[2] As well-known in the art, stereo matching is a core process of a real-time stereo image processing system. The stereo matching refers to a re-creating process of 3D spatial information from a pair of 2D images by using geometric relations therebetween as shown in Fig. 1.

[3] In Fig. 1, reference symbols

F

B and Z represent a focal length, a base line and a depth, respectively.

[4] Referring to Fig. 1, the stereo matching employs a method for finding a pixel on an image line corresponding to an epipolar line in each of a left and a right image, respectively, the pixels thus found corresponding to an identical point of

QC₉Y₉Z) in a 3D space. In this instance, a disparity d for the conjugate pixel pair is calculated as in Equation 1 : [5]

[6] MathFigure 1

[Math.l] j r ϊ a = x - x

[7]

[8] The depth Z is a geometric characteristic calculated from the disparity. That is, the disparity has distance information. Hence, 3D distance and shape information on an object in an observation space can be measured by calculating the disparity in real time from left and right images.

[9] The real-time stereo image matching system functions, for example, as a visual device of a robot in industries, a road recognition device of an autonomous vehicle and a visual device of a toy robot or the like in home electronics. Further, the stereo image matching system can be used for producing a 3D map along with an artificial satellite.

[10] However, when a thin object is positioned nearer than a background having a lot of information, the above-described prior art system may extract only a 3D distance value for the background without extracting a 3D distance value for the object due to the information of the background.

[11] Further, since only two cameras are used for the matching, the prior art system has a drawback in that horizontal noises are generated and does not show high matching reliability.

[12]

Disclosure of Invention

Technical Problem

[13] In view of the above, the present invention provides a system and method for realtime stereo image matching using multiple cameras. By using multiple cameras, 3D distance information on a thin object spaced apart from a background can be extracted accurately, and also, horizontal noises can be reduced. Accordingly, 3D distance and shape information on an object in an observation space can be measured more accurately, thereby improving matching reliability.

[14]

Technical Solution

[15] In accordance with an aspect of the present invention, there is provided a system for real-time stereo image matching, including: an image processor for converting image signals taken by each of a plurality of cameras into digital images; a plurality of input buffers for extracting interpixel data from the digital images provided from the image processor and reordering the interpixel data; a processing element array having a plurality of processing elements, wherein each of the processing elements sequentially receives the reordered interpixel data and performs image matching by using a disparity value outputted from the processing element of the upper line at previous clock to produce an optimal decision value or a disparity value; and an encoder for encoding the disparity value or the optimal decision value. [16] In accordance with another aspect of the present invention, there is provided a method for real-time stereo image matching in a system including a processing element array having a plurality of processing elements, the method including: converting image signals taken by each of a plurality of cameras into digital images; extracting in- terpixel data from the digital images and reordering the interpixel data; producing an optimal decision value or a disparity value by sequentially providing to each of the processing elements the reordered interpixel data to perform image matching using a disparity value provided from the processing element of the upper stage at a previous clock; and encoding the disparity value or the optimal decision value by using differential coding.

Advantageous Effects

[17] In accordance with the present invention, by using multiple cameras, 3D distance information on a thin object spaced apart from a background can be extracted accurately, and also, horizontal noises can be reduced. Accordingly, 3D distance and shape information on an object in an observation space can be measured more accurately, thereby improving matching reliability.

[18] Further, since the system in accordance with the present invention employs a highspeed parallel processing matching technique and a high-compression encoding technique, the system can be applied competitively to various applications as a small device.

[19]

Brief Description of the Drawings

[20] The above and other objects and features of the present invention will become apparent from the following description of embodiments, given in conjunction with the accompanying drawings, in which:

[21] Fig. 1 illustrates an explanatory view of a stereo matching using two cameras;

[22] Fig. 2 illustrates a block diagram of a systolic architecture for image matching using multiple cameras in accordance with an embodiment of the present invention;

[23] Fig. 3 illustrates a detailed block diagram of the image matching unit in Fig. 2;

[24] Fig. 4 illustrates a detailed block diagram of the input buffers in Fig. 3;

[25] Fig. 5 illustrates a detailed block diagram of the forward processor in Fig. 3;

[26] Fig. 6 illustrates a detailed block diagram of the backward processor in Fig. 3;

[27] Fig. 7 illustrates a table for explaining the operation of the encoder in Fig. 3;

[28] Fig. 8 illustrates an exemplary view of the input buffer in a case where three cameras are employed;

[29] Fig. 9 illustrates a structure of a processing element array in a case where three cameras are employed; [30] Fig. 10 illustrates an exemplary view of the forward processor in Fig. 9 in a case where three cameras are employed;

[31] Fig. 11 illustrates a structure of a processing element array in a case where an N- number of cameras are employed; and

[32] Fig. 12 illustrates a flowchart for explaining parallel processing performed by an M- number of processing elements as clock t increments from 1 to 2(M-I).

[33]

Best Mode for Carrying Out the Invention

[34] Hereinafter, embodiments of the present invention will be described in detail with reference to the accompanying drawings, which form a part hereof.

[35] Fig. 2 illustrates a block diagram of a systolic architecture for image matching using multiple cameras in accordance with an embodiment of the present invention. An image matching system of the present invention includes an N-number of cameras 10-1 to 10-N, an image processor 20, and an image matching unit 30 and a user system 40.

[36] Each of the N-number of cameras 10-1 to 10-N takes an image of an object, and provides an image signal to the image processor 20.

[37] The image processor 20 converts image signals of objects from the respective N- number of cameras 10-1 to 10-N into digital signals to provide an N-number of image data to the image matching unit 30. Under the control of the image processor 20, the N-number of image data are also stored in a storage unit 50.

[38] The image matching unit 30 sequentially receives pixel data on image lines corresponding to an identical epipolar line in the N-number of digital images provided from the image processor 20 to calculate disparity values. The calculated disparity values are provided to the user system 40.

[39] The image matching unit 30 repeatedly performs the above-described operations to thereby provide to the user system 40 the calculated disparity values for all epipolar lines.

[40] The user system 40 is a system that uses distance data based on the disparity values provided from the image matching unit 30. The user system 40 may include a visual device of a robot in industries, a road recognition device of an autonomous vehicle, a visual device of a toy robot or the like in home electronics and a 3D map building system along with an artificial satellite.

[41] As shown in Fig. 3, the image matching unit 30 includes an N-number of input buffers 31-1 to 31-n, a processing element unit 33 and an encoder 35.

[42] Each of the N-number of input buffers 31-1 to 31-n extracts interpixel data from each of the N-number of digital images provided from the image processor 20 and reorders the interpixel data, respectively. As shown in Fig. 4, each of the input buffers 31-1 to 31-n includes a control unit 31a, two D flip-flops 31b and 31c and a calculator 3 Id.

[43] The control unit 31a generates enable signals for determining whether or not the D flip-flops 31b and 31c will receive the pixel data, and provides the enable signals to the D flip-flops 31b and 31c. Further, the control unit 31a generates a pixel index and provides the pixel index to the calculator 3 Id.

[44] Each of the D flip-flops 31b and 31c stores one pixel value based on the enable signal from the control unit 31a, and provides the pixel value to the calculator 3 Id.

[45] The calculator 3 Id obtains weighted sum of two pixel values (i.e., pixel 1 a and pixel

2 as shown in Fig. 3) received from the D flip-flops 31b and 31c according to the pixel index from the control unit 31a, and divides the weighted sum by sum of weights. For example, division by two, four and eight may be implemented by removing one, two and three least significant bit(s), respectively. The calculator 3 Id then provides the division result to the processing element unit 33. Here, each of the D flip-flops 31b and 31c may be of a register, and the calculator 3 Id may be of a multiplier, an adder and a divider.

[46] The processing element unit 33 may be implemented with a number of processing elements in the form of a processing element array, the number of processing elements corresponding to a specific maximum disparity value. Each of the processing elements, which includes a forward processor 33a, a trellis queue (e.g., a stack) 33b and a backward processor 33c, can exchange information with its adjacent processing elements. The processing element unit 33 is configured to operate at the maximum speed regardless of the number of processing elements.

[47] The processing element unit 33 sequentially receives the reordered interpixel data and performs image matching by using the disparity values of a processing element on the upper line processed at a previous clock to produce decision values (to be described later) or the disparity values of the current processing line.

[48] The forward processor 33a operates in synchronism with a clock. The forward processor 33a receives proper pixel values from the image line corresponding to the epipolar line of each multiple images to calculate a decision value

(where

^Vr, denotes a decision value calculated by the forward processor 33a of a j-th processing element at a t-th clock), and then stores the decision value v* in the trellis queue 33b. As shown in Fig. 5, the forward processor 33a includes an absolute value calculator 33aa, a multiplexer 33ab, an adder 33ac and a flip-flop 33ad. [49] The absolute value calculator 33aa calculates a matching cost using the difference between the N-number of pixel data values. [50] The multiplexer 33ab determines, assuming that processing elements vertically adjacent in the processing element unit 33 are referred to as a top, a middle and a bottom processing element respectively, a minimum value among an accumulated cost

at a previous (t-l)-th clock stored in the top processing element, an accumulated cost

^_-1O D at the previous (t-l)-th clock stored in the bottom processing element and an accumulated cost

at the previous (t-l)-th clock stored in the middle processing element and feed- backed via the flip-flop 33ad (where

denotes an accumulated cost of a j-th processing element at a t-th clock). The minimum value determined by the multiplexer 33ab is then provided to the adder 33ac. [51] Further, the multiplexer 33ab provides, to the trellis queue 33b, the decision value

(where decision values for the top, middle and bottom processing elements are 1, 0 and -1, respectively), wherein the decision value v* represents an originating path of the minimum value (i.e., the least cost) among the top, middle and bottom processing elements.

[52] The adder 33ac adds the minimum value from the multiplexer 33ab and the matching cost at a current clock (t) from the absolute value calculator 33aa to calculate an accumulated cost

of the current element. The accumulated cost

is then provided to the flip-flop 33ad. [53] The flip-flop 33ad outputs the current accumulated cost

provided from the adder 33ac at a next clock. In other words, output data from the forward processor 33a of the j-th processing element at the time of a clock t includes an accumulated cost

to be transmitted to the top and bottom elements and the decision value

to be stored in the trellis queue 33b and then transmitted to the backward processor 33c. [54] The trellis queue 33b stores the decision value

from the forward processor 33a and then transmits it to the backward processor 33c.

[55] The backward processor 33c carries out an operation on the decision value read from the trellis queue 33b to calculate an optimal disparity value. The optimal disparity value is provided to the encoder 35 in synchronism with the clock. As shown in Fig. 6, the backward processor 33c includes an OR gate 33ca, a one-bit activation D flip-flop 33cb, a demultiplexer 33cc and a tri-state buffer 33cd.

[56] The OR gate 33ca receives activation signals

and

_Vl O i )δ( i + F_MιM ) from the adjacent processing elements (i.e., (j-l)th and (j+l)th processing element processors, respectively) and an activation signal

feed-backed via the demultiplexer 33cc (where

^ CO indicates an activation bit value of the backward processor 33c of the j-th processing element at the t-th clock). The OR gate 33ca then performs a logical OR operation of the activation signals a_jΛ O 1 )5( 1 - V_tΛjΛ )^'

_Vl O i )δ( i + F_MιM ) and

to provide an activation bit 8 to the D flip-flop 33cb. [57] The D flip-flop 33cb temporarily stores the activation bit

from the OR gate 33ca while providing an activation bit

at the previous clock to each of the buffer 33cd and the demultiplexer 33cc. [58] The demultiplexer 33cc outputs activation signals

and

, which are corresponding to the activation bit

provided from the D flip-flop 33cb, to the backward processors of the adjacent processing elements based on the decision value v^ provided from the trellis queue 33b (where

represents a decision value at the current matching point). That is, the forward direction of the activation bit

is set using the decision value

representing a path stored at the immediately previous matching point. Further, the demultiplexer 33cc outputs the activation signal o^ O DδC v_t__x;y to the OR gate 33ca. [59] The tri-state buffer 33cd provides, to the encoder 35, an optimal decision value

^VZ which represents increment/decrement of disparity depending on the activation bit

from the D flip-flop 33cb. Preferably, when the activation bit is " 1" the tri-state buffer 33cd outputs the input value as it is, and, the tri-state buffer 33cd becomes a high impedance state elsewhere and there is no output.

[60] Fig. 7 illustrates a table for explaining the operation of the encoder 35 in Fig. 3.

[61] Concatenation of optimal decision values

and

K from the backward processor 33c produces a disparity value. However, since the disparity value is gradually changed depending on the variation the optimal decision values

and

, the encoder 35 employs a differential coding to obtain an output thereof, thereby increasing the compression rate. That is, the encoder 35 does not output the concatenation of the optimal decision values

and

=+= v_t from the backward processor 33c, but makes an output by using the variations of the optimal decision values

and

=+= v_t as shown in Fig. 7, for example.

[62] Basically, representation of two optimal decision values requires four bits. However, if dummy data and nearly nonexistent pairs of optimal decision values are excluded, the two optimal decision values can be represented with only three bits. Further, one of eight binary representations using three bits can be assigned as a flag. As such, the encoder 35 compresses two optimal disparity values into three bits, thereby showing high compression effect.

[63] To be specific, the decision value representing variations of the path has only three values, e.g., "01", "10" and "00" for representing "upward", "downward" and "no change" respectively, and thus " 11" is a dummy data. Further, since "0110" and "1001" are nearly nonexistent pair of optimal decision values geometrically, they can be considered to be the same as "0000". In consideration of the characteristics of the output disparity value described above, a high compression rate can be obtained by differential coding and exclusion of dummy data.

[64] Fig. 8 illustrates an exemplary view of the input buffer in a case where three cameras are employed. In Fig. 8, a pixel value is inputted to the input buffer once for every two clocks. The D flip-flop 31c is always in an enable state, and the D flip-flop 31b is enabled after the pixel value has been changed and a lapse of one clock.

[65] Fig. 9 illustrates a structure of a processing element array in a case where three cameras are employed. As shown in Fig. 9, images of first to third cameras are inputted to each element. Specifically, the images are sequentially inputted in a manner that the image of the first camera is inputted toward a top processing element producing a low disparity value, and the image of the third camera is inputted toward a bottom processing element producing a high disparity value.

[66] Fig. 10 illustrates an exemplary view of the forward processor fp in Fig. 9 in a case where three cameras are employed. Pixel values such as

- anAdi^]

are inputted to the absolute value calculator 33aa, and the sum of the differences of these pixel values is outputted as a matching cost. [67] Fig. 11 illustrates a structure of a processing element array in a case where an N- number of cameras are employed. As shown in Fig. 11, each of the processing elements has the same structure having a forward processor fp, a stack and a backward processor bp as in Fig. 9. [68] One pixel of each N-number of images is inputted to the forward processor in each processing element. An input pixel value

from the i-th camera has an index of

, where

D is a disparity level (if it is infinitely distant (

Z=oo ),

D equals zero, and if it is the closest,

D equals

M

(maximum disparity level)),

N is the number of cameras (or images) and x is the index of a camera at center. Since the index

C is the same as the index of a result disparity, in order to represent the index of the result disparity, the index

is used as a reference for obtaining the index of

JC i even if a center camera does not exist.

[69] In this structure, one processing element array performs parallel computation in synchronism with a time clock to produce a result corresponding to one disparity index for every two time clocks. That is, the index

^X c at the time clock t becomes 2 . When the index

JC

C is t

, the index of the input pixel value x from the i-th camera in the j-th processing element at the time clock t becomes t AT+ 1 -2/ T ⁺ 2(TNM) ^J (where

JC i is an image register value, i.e., an interpixel value, of the j-th processing element of the i-th camera at the clock t

)• [70] A pixel value

JC having the above-described index is inputted to the forward processor fp. In operation, a total N-number of pixels, i.e., one pixel on an image line corresponding to an epipolar line for each N-number of camera images, are inputted to each processing element. Each processing element exchanges cost values

U with its adjacent processing elements. Further, in each processing element, the forward processor fp provides to the stack a direction value -1, 0 or 1. The direction value represents which processing element among the (j-l)-th, j-th and (j+l)-th processing elements has the least cost at (t-l)-th clock, and the values -1, 0 and 1 mean that the processing element having the least cost is the (j-l)-th, j-th and (j+l)-th elements, respectively. Therefore, the direction value informs from which processing element the cost is provided.

[71] Fig. 12 illustrates a flowchart for explaining parallel processing performed by an M- number of processing elements as clock t increments from 1 to 2(M-I). As shown in Fig. 12, first of all, the processing elements are initialized in step S1201. To be specific, all cost registers and image registers in the processing element array are initialized, and then the clock is set to 1. Thereafter, an image pixel from each N-number of cameras is inputted to each processing element in step S 1203. The M-number of processing elements operates in parallel in step S 1205 (e.g., the operation of the j-th processing element is shown in Fig. 12). In each processing element, the forward processor 33a processes the interpixel data of a current image line provided thereto and the backward processor 33c reads a decision value of a previous image line processed by the forward processor 33a from the trellis queue 33b for processing thereof. It is judged, in step S 1207, whether the parallel processing of the step S 1205 is performed until clock 2(M-I). If not, the clock increments by one in step S 1209 and the process returns to the step S 1203. On the other hand, if performed until clock 2(M-I), the processed result is provided to the encoder 35 in step S1211.

[72] A multi-camera cost

of the forward processor 33a of a j-th processing element at a t-th clock is calculated as in Equation 2 so that the effect of the entire cameras may exert uniformly on the cost:

[73] [74] MathFigure 2 [Math.2]

^]

[75] [76] In Fig.12,

is a cost register value of the forward processor 33a of a j-th processing element at a t-th clock, where t N+ \ -2i _.-i

'^•[ 2^{~ +} 2(N- I) ³\ indicates an image register value of the j-th processing element of an i-th camera at the t-th clock (i.e., an interpixel value having an index of t N+ l -li T ⁺ 2(N- I) ^J ). In addition,

denotes a decision value which is provided from the forward processor 33a in the j-th processing element and stored in the trellis queue 33b at the t-th clock, and a ^t) represents an activation bit value of the backward processor 33c in the j-th processing element at the t-th clock. Further, a ₊ (t- l )5(p+ V_{f 1 +} ) denotes an activation signal for the backward processor to be transmitted to the adjacent processing elements, where p= l and

P=- 1 when the activation signal is transmitted to an upper processing element and a lower processing element, respectively. Further more,

Λ

indicates a disparity value at the t-th clock, M denotes the number of pixels on an image line (e.g.,

M= 1024 in a 1024x768 image) and argminfuncQx) is to produce a parameter x for minimizing the function func(x)

[77] While the invention has been shown and described with respect to the embodiments, it will be understood by those skilled in the art that various changes and modification may be made without departing from the scope of the invention as defined in the following claims.

Claims

[1] A system for real-time stereo image matching, comprising: an image processor for converting image signals taken by each of a plurality of cameras into digital images; a plurality of input buffers for extracting interpixel data from the digital images provided from the image processor and reordering the interpixel data; a processing element array having a plurality of processing elements, wherein each of the processing elements sequentially receives the reordered interpixel data and performs image matching by using a disparity value outputted from the processing element of the upper stage at previous clock to produce an optimal decision value or a disparity value; and an encoder for encoding the disparity value or the optimal decision value.

[2] The system of claim 1, wherein each of the input buffers includes: a control unit for generating two enable signals and a pixel index; two D flip-flops for storing one pixel value, respectively, in response to the enable signals; and a calculator for calculating a pixel value to be provided to the processing elements by using the pixel index and two pixel values outputted from the two D flip-flops.

[3] The system of claim 2, wherein each of the D flip-flops includes a register.

[4] The system of claim 2, wherein the calculator includes a multiplier, an adder and a divider to obtain weighted sum of the two pixel values outputted from the two D flip-flops according to the pixel index and divide the weighted sum by sum of weights.

[5] The system of claim 1, wherein each of the processing elements includes: a forward processor for receiving a pixel value from an image line corresponding to an epipolar line of each of digital images to calculate a decision value; a trellis queue for storing the calculated decision value; and a backward processor for calculating the disparity value based on the decision value stored in the trellis queue, and the calculated disparity value being provided to the encoder.

[6] The system of claim 5, wherein the forward processor includes: an absolute value calculator for calculating a matching cost by using the difference between the pixel values from the input buffers; a multiplexer for determining a minimum value between accumulated matching costs of adjacent processing elements from the absolute value calculator and a previous accumulated matching cost to produce the decision value representing a path of the minimum value; an adder for adding the decision value from the multiplexer and the matching cost calculated by the absolute value calculator to calculate an accumulated matching cost; and a flip-flop for temporarily storing the calculated accumulated matching cost, wherein the accumulated matching cost is provided to the multiplexer as the previous accumulated matching cost at a next clock.

[7] The system of claim 5, wherein the backward processor includes: an OR gate for calculating logical sum of activation signals from adjacent processing elements and a feed-backed activation signal to produce an activation bit; a flip-flop for temporarily storing the activation bit from the OR gate until a next clock; a demultiplexer for outputting an activation signal corresponding to the activation bit from the flip-flop to backward processors of the adjacent processing elements according to the decision value from the trellis queue and to the OR gate as the feed-backed activation signal; and a buffer for outputting the optimal decision value representing increment/ decrement of the disparity value depending on the activation bit from the flip- flop.

[8] The system of claim 7, wherein the buffer outputs the input value as it is when the input value is " 1 " and the buffer becomes a high impedance state and outputs no value when the input value is not "1".

[9] The system of claim 1, wherein the encoder encodes the disparity value or the optimal decision value by using differential coding.

[10] A method for real-time stereo image matching in a system including a processing element array having a plurality of processing elements, the method comprising: converting image signals taken by each of a plurality of cameras into digital images; extracting interpixel data from the digital images and reordering the interpixel data; producing an optimal decision value or a disparity value by sequentially providing to each of the processing elements the reordered interpixel data to perform image matching using a disparity value produced from the processing element of the upper stage at a previous clock; and encoding the disparity value or the optimal decision value by using differential coding.

[11] The method of claim 10, wherein producing the optimal decision value or the disparity value includes: calculating a decision value by receiving a pixel value from an image line corresponding to an epipolar line of each of digital images; temporarily storing the calculated decision value; and calculating the disparity value based on the stored decision value. [12] The method of claim 11, wherein calculating the decision value includes: calculating a matching cost by using the difference between the pixel values; determining a minimum value between accumulated matching costs of adjacent processing elements and a previous accumulated matching cost; producing the decision value, wherein the decision value represents a path of the minimum value; adding the decision value and the matching cost to calculate an accumulated matching cost; and temporarily storing the calculated accumulated matching cost until providing the accumulated matching cost as the previous accumulated matching cost at a next clock. [13] The method of claim 11, wherein calculating the disparity value includes: calculating logical sum of activation signals from adjacent processing elements and a feed-backed activation signal to produce an activation bit; temporarily storing the activation bit; outputting an activation signal corresponding to the activation bit stored at a previous clock to the adjacent processing elements according to the decision value and feed-backing the activation signal; and outputting the optimal decision value representing increment/decrement of the disparity value depending on the activation bit. [14] The method of claim 10, wherein extracting interpixel data from the digital images and reordering the interpixel data includes: generating a pixel index; storing two pixel values; and calculating a pixel value to be provided to the processing elements by using the pixel index and the stored pixel values. [15] The method of claim 14, wherein calculating the pixel value includes: obtaining weighted sum of the two pixel values according to the pixel index; and dividing the weighted sum by sum of weights.