WO2009014314A1

WO2009014314A1 - Belief propagation based fast systolic array and method thereof

Info

Publication number: WO2009014314A1
Application number: PCT/KR2008/003280
Authority: WO
Inventors: Hong Jeong; Sung Chan Park
Original assignee: Postech Academy-Industry Foundation
Priority date: 2007-06-29
Filing date: 2008-06-12
Publication date: 2009-01-29
Also published as: KR100920227B1; KR20090001026A

Abstract

In a belief propagation based fast systolic array, a hierarchical dynamic Bayesian network of nodes corresponding to pixels of input left and right image pixel data is generated in consideration of an iteration axis and scale levels. Further, messages on the generated dynamic Bayesian network are updated in a specific axis direction on a Markov random field.

Description

BELIEF PROPAGATION BASED PAST SYSTOLIC ARRAY AND METHOD THEREOF

Field of the Invention

The present invention relates to a belief propagation (BP) based fast systolic array and a method thereof; and, more particularly, to a systolic array that can perform parallel computation with a compact memory by using hierarchical BP based characteristics while reducing a total memory size when the number of iterations is small, and a method thereof.

Background of the Invention

Fig. 1 illustrates an MRF (Markov Random Field) network for stereo matching and a conventional BP update rule.

In the conventional BP technique, when nodes corresponding to image pixels are regularly connected with each other as shown in Fig. l, a 2-dimensional (2D) MRF network having a size of N₁ x N₀ is defined. Referring to P. F. Felzenszwalb and D. R. Huttenlocher , "Efficient Belief Propagation for Early Vision", in Proc . 2004 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, No. 1, pages 261-268, 2004 (hereinafter, referred to as "Patent Reference 1" ) , when BP is performed on an MRF using hierarchical data costs, a low error rate can be obtained with a small number of iterations. However, if an image is large, it takes long time to process the image due to a large number of nodes, and also a large message memory is needed.

In the Ni x N₀ MRF network, a 2D vector is represented by _X=Ex₀ Xi]^'1' using elements χ_α and X₃., and a position of a node is represented by a 2D vector p= [pα pi] ^τ. Further, a data cost D_p (d_p ⁾ and an edge cost V(d_p,d_q) are allocated to a hidden state d_p of each node and a hidden state d_p,d_q of each edge on the MRF graph, respectively. Then, an approximation solution for a MAP (Maximum A posteriori) solution, i.e., a state that minimizes the sum of all of the costs on the MRF network, can be computed by using the BP, a.s in Equation 1.

Equation 1 shows an energy cost model in the MRF network.

[Equation 1]

d=arg_(/ παin E(d),

E(d)= X V{d^d₁₁)^D_p{d_p)

P.<l^h p*l^>

A message calculation process is as in Equation 2

[Equation 2]

Here, N^ represents neighboring nodes, N_b(p)/q represents nodes neighboring to a node p except for a node g, and m'_fll/(d_tl) represents a message transmitted from the node p to the node q. Like the BP update rule shown in Fig. 1, the message ^(d,,) is calculated by adding messages transmitted to the node p from the nodes neighboring to the node p except for the node q, the data cost of the node p, and the edge cost of an edge from the node p to the node q. In Equation 2, α is a normalization parameter and corresponds to an average of all state costs of messages of each node. The message m^' id,) is calculated at each iteration and transmitted from the node p to the node q.

As in Equation 3, the MAP state d_p, which is a disparity- value, can be estimated by adding messages transmitted to the node q from the neighboring nodes at the final iteration T and determining a state having a minimum cost for each node p.

[Equation 3]

As described above, in the conventional EP update rule, all messages are not stored in the memory, but the MRF network is scanned in a specific axis direction.

Fig. 2 illustrates a layer structure in which a layer corresponds to an iteration of a message at each node shown in Fig. 1. As shown in Fig. 2, the MRF network structure shown in Fig. 1 is constructed aε a dynamic Bayesian network in which a layer is stacked each time message is repeatedly calculated at each node, the number of iterations *t' corresponding to a layer index "I' . when p(l) represents the coordinate of a node at an 2-th iteration layer, a layer transformation equation of the dynamic Bayesian network that tilts the position of a node for each iteration in a scanning axis direction b= [1 o]^T is represented as Equation 4.

[Equation 4]

(A=[I O]' )

Figs. 3A and 3B show the vertical -rearrangement result of Po(I) nodes according to Equation 4, i.e., a layer- transformed FBP (Fast Belief Propagation) structure and a message update sequence .

[Equation 5]

As shown in Equation 5, the node p(l) and the node p(l-l) differ from each other by an offset -[1O]¹ on the layer structure .

In the dynamic Bayesian network structure, nodes are grouped, and nodes at the same layer in a group are then parallel-processed to obtain messages thereof. To be specific, messages of nodes at the previous layer in a group are read from a local buffer of the group, and messages of nodes within the adjacent group are read from a layer buffer in which messages of nodes in the previously processed group are stored.

As shown in Pigs. 3A and 3B, the final iteration message is calculated as the layer buffer is right-shifted, i.e., as the layer buffer moves in a positive direction of the p₀ axis. That is, messages of nodes in a group are calculated in parallel and stored in the local buffer. The messages stored in. the local buffer are used to process nodes at the next (upper) layer in the group. Further, the messages stored in the local buffer are to be stored in the layer buffer for use in processing messages of nodes in a next group. Accordingly, the messages can be processed by using a small layer buffer and a small local buffer.

However, as described above, when the conventional BP technique is applied to stereo matching, a large number of iterations is needed, and thus, the size of the layer buffer for fast EP (FBP) is increased by the large number of iterations. Considering that semiconductor and information communications technologies are being rapidly developed at present, there is a need for a fast systolic array for BP based stereo matching that can reduce the size of the layer buffer using the characteristics of the hierarchical BP structure, and a method thereof.

In the hierarchical BP structure, convergence of messages is rapidly performed from a coarse level to a fine level within a abor_t iteration time by using data costs at different K scale levels. However, even if the rapid convergence is made due to the hierarchical structure, a large memory is still needed, in N₀ x N₁ left and right images, when the number of iterations at each level, the number of states, and the size of the state cost are if, S, and B bits, respectively, the size of a message memory becomes 4NiN₀SB bits and the size of a data cost memory becomes N₁N₀SB bits. Therefore, the total memory size becomes 5N₁N₀SB. Further, since the number of nodes according to the scale level k xs (N_!/2^k) x (N_D/2^k) , the total computation amount becomes

∑SZ*(ΛV2^x)(W₀/2*).

Summary of the Invention

in view of the above, the present invention provides a BP based fast systolic array that can perform parallel computation with a compact memory by using hierarchical BP based characteristics while reducing a total memory size when the number of iterations is small, and a method thereof.

In accordance with an aspect of the present invention, there is provided a belief propagation based fast systolic array, wherein a hierarchical dynamic Bayesian network of nodes corresponding to pixels of input left and right image pixel data is generated in consideration of an iteration axis and scale levels, and messages on the generated dynamic Bayesian network are updated in a specific axis direction on a Markov random field. In accordance with another aspect of the present invention, there is provided a belief propagation based fast systolic array method, including:

(a) storing left and right image pixel data input by raster scanning; and

(b) outputting a disparity image fast and in parallel using the pixel data stored at the step (a) . In accordance with the present invention, parallel calculation i9 performed with a compact memory by using hierarchical BP based characteristics while the total memory size is reduced when the number of iterations is small. With the reduction in the memory size, fast parallel processing can be carried out by using a compact distributed memory provided in an existing VLSI (Very Large Scale Integration) chip and a parallel processor for accessing the memory. As well as the fast parallel processing can be carried out by using small amount of memory resources, the message update of the present invention can be performed by simple integer calculation, and thus, the fast systolic array of the present invention can be manufactured as a compact parallel VLSI chip, e.g. , PPGA (Field-Programmable Gate Array) or an ASIC (Application-specific Integrated Circuit) , having a small memory size . Therefore, a complex image processing system can be manufactured as an inexpensive and small device which performs fast real-time processing. Brief Description of the Drawings

The above features of the present invention will become apparent from the following description of embodiments, given in conjunction with the accompanying drawings, in which:

Fig. 1 illustrates an MRF network for stereo matching and a conventional BP update rule;

Fig. 2 illustrates a layer structure in which a layer corresponds to an iteration of a message at each node shown in Fig . 1 ;

Figs .3A and 3B respectively illustrate an explanatory view of a layer-transformed FBP (Fast Belief Propagation) structure and a message update sequence; Fig. 4 illustrates a dynamic Bayesian network having a hierarchical BP structure;

Figs .5A to 5D respectively illustrate a layer-transformed hierarchical structure;

Figs. 6A to 6D respectively illustrate a sequence diagram when Figs. 5A to 5D are viewed from a different angle;

Fig. 7 illustrates a detailed structure of a layer buffer and a local buffer;

Figs. 8A and 8B illustrate the message update processes in the structure before layer transformation and in the layer-transformed structure, respectively;

Fig. 9 illustrates a data cost reading process in the layer-transformed hierarchical FBP structure shown in Figs. 5A to 5D;

Fig. 10 illustrates a configuration of a BP based fast systolic array for use in stereo matching in accordance with the present invention;

Fig. 11 illustrates systolic array architecture of an FBP stereo matching module;

Fig. 12 illustrates a detail view of a processing element (PE) group; Fig. 13 illustrates architecture of a data cost module;

Fig. 14 illustrates a detail view of a module A in the data cost module shown in Fig. 13;

Fig. 15 illustrates a detail view of a module B in the data cost module shown in Fig. 13; Fig. 16 illustrates a buffer distribution structure of PEs in a PE group;

Fig. 17 illustrates a detail view of a PE module;

Figs. 18A and 18B respectively illustrate a forward processor in the PE module shown in Fig. 17; Figs. 19A and 19B respectively illustrate a backward processor in the PE module shown in Fig. 17;

Fig. 20 illustrates an FBP stereo matching sequence on nodes of a FBP stereo matching module;

Fig. 21 illustrates a flowchart of the FBP stereo matching sequence;

Fig.22 illustrates a sequential calculation procedure for message update within a group in the FBP stereo matching sequence of Fig. 21;

Fig. 23 illustrates a parallel calculation procedure for message update within a group in the FBP stereo matching sequence of Fig. 21;

Fig.24 illustrates a calculation sequence in the data cost module;

Fig .25 illustrates a local index for buffer update in Figs . 22 and 23; and Fig. 26 illustrates a comparison result of an error rate with other real-time stereo matching systems.

Detailed Description of the Invention

Embodiments of the present invention will be described in detail with reference to the accompanying drawings, which form a part hereof .

Fig. 4 illustrates a dynamic Bayesian network having a hierarchical BP structure . As shown in Fig. 4, as the number of nodes increases from a coarse level to a fine level, iteration layers are formed. The number of nodes at a level k is Nχ/2^k x N₀/2^k on an Ni x N₀ MRF network .

If the coordinate of a node at the level k and a k-level iteration layer on the dynamic Bayesian network are represented by P^k=(Po>Pι) ^and l^ke [0,L^k-l] , respectively, a layer transformation equation to tilt the position of a node at each iteration in a scanning axis direction b = [10] ^τ is as in Equation 6, when different scale characteristics at a previous level is taken into consideration.

[Equation 6] jt+l J=K-I

=a^k-l^k+2(p^k+1(L^M))

Here, a^k is an offset generated by a scale difference between levels k and k-1 with respect to p^k"1 at the coarsest level .

If the nodes on pl{l^k) are vertically re-arrayed in the layer structure according to Equation 6, a sequence in the layer-transformed hierarchical structure shown in Figs. 5A to 5D is obtained. Figs. 6Α to 6D respectively illustrate a sequence diagram when Figs . 5A to 5D are viewed from a different angle. Fig. 9 illustrates a data cost reading process in the layer-transformed hierarchical FBP structure shown in Figs. 5A to 5D. Next, as shown in Fig. 7, all nodes on the p^k(l^k) axis are grouped and a processor is positioned to each node. Then, the processors within a group perform parallel processing, such that the MRF network is scanned in the p₀ ^k(l^k) axis direction. To be specific, messages of nodes at the previous layer in a group are read from a local buffer of the previous layer, and messages of nodes within a group of a previous line are read from a layer buffer. Further, as shown in Figs. 5A to 5D, a message at the final iteration is calculated as the layer buffer is right-shifted, i.e. , as the layer buffer moves in a positive direction of the pl(l^k) axis. The messages of the nodes in the group are calculated in parallel and stored in the local buffer to be then used in processing a next (upper) layer. Further, the messages are also stored in the layer buffer to be then used in processing messages of a next group. As a result, the same result as of the hierarchical BP structure can be obtained using a small layer buffer and a small local buffer.

Fig. 10 illustrates a configuration of a BP based fast systolic array for use in stereo matching in accordance with the present invention.

Referring to Fig. 10, the BP based fast systolic array includes : an image buffer 10 that receives and temporarily stores left and right image pixel data input by a raster scan method; and an FBP stereo matching module 13 that outputs a disparity image fast and in parallel by using the left and right pixel data output from the image buffer 10.

As shown in Fig. 11, the FBP stereo matching module 13 includes a plurality of PE (Processing Element) groups which exchanges messages and pixel data with each other. The FBP stereo matching module 13 supports fast parallel processing. As shown in Fig. 12, each PE group includes: a data cost module 13a that receives pixel data and calculates data costs; a plurality of multiplexers (MUX) 13b that receives the data costs from the data cost module 13a and messages from neighboring PE groups and selects a desired output; a plurality of processing elements PE 13c that calculates a new message by using the output of the MUXs 13b; a plurality of local buffers 13d that stores a result value of the PEs 13c; and a layer buffer 13e that stores the result value of the local buffers 13d again. Each PE 13c includes: an adder that sequentially reads and adds the data costs and the messages of the previous layer by states; a forward processor that receives the output of the adder and outputs a forward processor cost; a forward stack that receives and stores the forward processor cost; a backward processor that receives an output value of the forward stack and outputs a backward processor cost; a backward stack that stores an output value of the backward processor; a normalizer that receives an output value of the backward stack and calculates a final message; and a buffer that stores an output value of the normalizer.

The forward processor includes : a first forward processor that initializes a first delay buffer, reads an input cost value at each step, compares the input cost value with a value obtained by adding a constant value to a previous value of the first delay buffer, stores a minimum value among them in the first delay buffer, and outputs the minimum value; and a second forward processor that initializes a second delay buffer, calculates a minimum value of an input cost of the second delay buffer, and outputs a value obtained by adding a constant value to the minimum value . The backward processor includes: a first backward processor that initializes a first delay buffer, reads an input cost value at each step, compares the input cost value with a value obtained by adding a constant value to the value of the first delay buffer, stores a minimum value among them in the first delay buffer, compares an output value of the first delay buffer with an output value of the forward processor, and outputs a minimum value among them,- and a second backward processor that initializes a second delay buffer to ^λ0', stores in the second delay buffer a value obtained by adding the output value of the first delay buffer to the value of the second delay buffer at each step, and shifts and outputs the value of the second delay buffer by a specific value.

The normalizer outputs a value obtained by subtracting the value calculated by the second backward processor from the value calculated by the first backward processor, thereby calculating a message.

Specifically, if the number of levels is K, the PE group has 2^K'X PEs 13c in total (see Figs. 11 and 12) . Accordingly, Ni/2^K'1 PE groups are needed in an N₁ x N₀ image. Since the number of nodes at each level varies according to the coarse-to-fine scale characteristics, when the FBP stereo matching sequence operates at level k within the PE group, only N₁/2^k PEs operate in parallel. As shown in Fig. 16, each PE has a local buffer for each level and a layer buffer, and accesses to these buffers through the MUX. As shown in Fig. 13, the data cost module is a logic for calculating data costs and performs a function as in Equation 7.

[Equation 7]

As shown in Fig. 14, a module A in the data cost module includes: registers that store left and right pixel data g^r(po, Pi+d) and g¹(po, Pi); and a logic that calculates an absolute difference between the register values. That is, a data cost D'p(d) is an output value of the module A. The right pixel data is shifted by a shift logic to a neighboring register by a value d in Equation 7, and thus the data cost D'_p(d) becomes output for each value d. As shown in Fig. 15, a module B in the data cost module is a logic that performs an operation of Equation 8 to calculate the data cost D_p(d) . As shown in Fig. 13, two neighboring data costs D'_p(d) are added at each level k, and thus the final data cost D^k(d) is calculated by adding 2^k scan lines using a register and an accumulator.

[Equation 8]

To be specific, as shown in Fig. 24, the data cost D_p ^k(d) in the accumulator is initialized first, and left and right scan lines corresponding to each e₀e[0, 2^k-l] are loaded. Then, the value of Equation 9 is accumulated to the data cost D_p ^h(d) . Here, each data cost for the value d needs to be calculated.

[Equation 9]

B₁=O

Meanwhile, the FBP stereo matching sequence in the FBP stereo matching module using the data costs on the N₀ x Ni MRF network is as follows .

for node pξ^~λ from 0 to N₀^^"1

Message_update ( />_o , 0 , K- I) for a^κ"2 ffrroomm 00 ttoo 11 P₀ (U)=α +2(p₀ ~L ) Message__update ( pξ ² , a^κ~2 , K- 2 )

for a⁰ from 0 to 1

Message_update( Po(O), a⁰, 0)

State_estimation ( /?°(0) , L₀)

As described above, like an FBP stereo matching sequence on a node of an FBP stereo matching module shown in FIG. 20, even the finest level node above a node p_ϋ ^k~λ can be processed via a depth-first tree sequence due to the coarse-to-fine scale characteristics .

That is, the processors at all of the nodes on the

axis are grouped for each ρ_Q ^k(l^k) , and performs the Message_update function in parallel within the group via the depth-first tree sequence. Then, the State_estimation function is performed at a final layer L⁰ to determine the disparity value.

Fig. 21 illustrates a flowchart of the FBP stereo matching sequence. Fig. 22 and Fig. 23 respectively illustrate a sequential calculation procedure and a parallel calculation procedure for message update within a group in the FBP stereo matching sequence of Fig. 21. Fig. 25 illustrates a local index for buffer update in Figs. 22 and 23.

Each function in the FBP stereo matching sequence will be described below. 1. Message_update(/?_o(O) , a^k, k) for each layer l^k from 1 to L^k for each parallel processor he [(0,0), (0,N^k-l) a. Message_calculation in local buffer

if l^k = 1, then

N_h=N_b(h+(a^k-I)[I O]^7')RSLANT(s+(a^k-l)[l θf)

otherwise

N_h=N_b(h+(a^k -Y)[I Qf)RSLANT(s+(a^k -l)[l θf) b. Buffer_update in layer buffer, for next group processing a) in case of data cost

b) in case of message

M_c ^k _d{d,l^k -T)=M^k _b(d,l^k -1) (1) Downward propagation message: propagation offset

θ]^r for α from 1 to 0

(2) Leftward and rightward propagation message: propagation offset

a=h,a=h+n_b

2. State_estimation( p°(0) , L⁰) for each parallel processor he [(0,0), (0,N°-l)]

In the layer structure, if a message in a layer l^k and a local index s within a group is expressed by M^k _s(d_s,l^k) , this message corresponds to a message m _*,_/t__n ι_(/iv(rf _if/κ) at k level in the

MRF network . In order to calculate the estimated MAP state d_p(L° +1) and the message Mf_JS(d_s,l^k) , the edge cost V_hs(dh, d_s) , the data cost D^(d_h) , and neighboring messages M^k _h(d_h,l^k -V) are needed.

As described in Patent Reference 1, the edge cost Vh_S(d_h, d_s) can be calculated without using a memory when a truncated linear function

—d_s ,K_V) with parameters α_v and K_v is used.

A case where the condition l^k ≠ 1 is satisfied at each level is first taken into consideration.

In the message update and state estimation, it should be noted that a node on the MRF network corresponds to nodes having an offset -[I 0]^τ between different iteration layers in the layer-transformed structure, as shown in Equation 5. Accordingly, N_b (h) /s is changed to N_b (h- [10] ^τ) / (s- [10] ^τ) by the layer transformation.

The Message calculation function accesses the layer buffer and the local buffer as follows. A local index U₀ of a node is in a range -2 ≤ U₀ ≤= 0. If a node is within a group, i.e., U₀

= 0, the data cost and the message of the node at the previous layer are read from the local buffer. If -2 ≤ u₀ < 0, the node belongs to the previous group and the data cost and the message of the node at the previous layer are read from the layer buffer.

Here, a message that is calculated by the function is stored in the local buffer.

D_h (d) may be read from a layer buffer D , γ(d) of a previous layer to the local buffer by p(l)=p(l-\)-[l Of . When the condition l^k = 1 is satisfied at each level, the previous layer has a different scale level, and thus special regard needs to be paid as follows .

That is, if the coarsest level k is K-I, a message is initialized to "0". If k is not K-I, a previous level message M^(d_h,,L^k+1) is read from the local buffer. Meanwhile, if the condition l^k=l is satisfied, a data cost

D_h ^k(d_h)=π _k (d_h,Jc)=D' _k,_{χ o}γ(d_h,k) is read from the data cost module. Next, the Buffer_update function performs layer buffer update in such a manner that a local " satisfying the condition U₀ = 0 is shifted to the next smallest index, i.e. , a layer buffer, like a data cost module calculation sequence shown in Fig. 24. The State estimation function outputs d _o,,_Ojl, using a message M°_h(d_h,L°) of an L⁰ layer at a level 0. Here, d _{0 0+}

becomes a disparity value corresponding to {pi -(L^k +1+ ∑L^J2^J~k),p°)

of an output disparity image. The messages of nodes satisfying -2 ≤ u < 0 are read from the layer buffer. As shown in FigS . 5A to 5D, when u₀ = -1, messages toward three neighboring nodes in three directions are required for N₁ nodes in total, and when U₀ = -2, a message in one direction is required. Accordingly, the number of messages that are stored for each layer is 4Ni in total. When the number of states is S and the size of the state cost is B bits, the size κ-\ of the layer buffer for all of the messages is ∑ASBϋN* bits.

4=0

Since the local buffer only stores messages in all

directions of the current layer, the message memory size is

JJff--II

Σ ∑4SBN? bits .

A=O'

As for the data cost size, only a case where h₀ = -1 needs to be considered, as shown in Figs. 6A to 6D. Accordingly, the

AT-I K-X layer buffer is ∑SBϋN* bits , and the local buffer is ∑SBN*

Jt=O Jt=O bits. Therefore, the total memory size of the FBP stereo

matching module is ∑5SB(L^k +I)Nf bits. Here, the existing hierarchical BP memory size is 5NiN₀SB bits. Accordingly, when L^k is sufficiently small, the memory size of the FBP stereo matching module becomes smaller κ-\ N₀/∑(£L^k +l)/2^k) times. The calculation speed becomes faster by fc=0 Nf times by N₁ ⁴ parallel processors at the level k, and thus a calculation speed is faster by approximately Ni times in total.

Meanwhile, an FBP scanning sequence may be implemented by a VLSI chip in which a plurality of processors read messages from neighboring processors to perform parallel calculation, as shown in Figs. 8A and 8B. Alternatively, a PC may sequentially read the messages to perform calculation.

Below, a PE calculation architecture will be described.

[Equation 10] m_o(d_s)=min_diι(V_hs(d_h,d_s)+m_sum(d_h))

As shown in Equation 10, the PE is a logic that calculates a new message m_o(d_s) by using V_hs(dh, d_s) and m_sum(dh) . The present invention suggests a new PE architecture which can reduce, when a message has the state size of "A", the calculation amount from 0(A²) to O (3A) by using the distance transform characteristics disclosed in Patent Reference 1. Here, the PE has the forward processor, the backward processor, and the normalizer, as shown in Fig. 17. Further, the new PE architecture is suitable for VLSI implementation because of its simple calculation procedure using only an adder, a subtracter, a shifter, and a comparator.

Below, "B" presents an allowable maximum value.

Forward Processor: Di(-l) = B, D₂(-l) = B For t from 0 to A-I,

^(O=A(O,A(O=HUn(W_^(O,A('-i)+c_v) m_f{-l)=D₂{A-\)+K,D₂{t)=mm{m_sum{t\D₂{t-\))

Backward Processor: D₃ (-1) = B, D₄(-l) = 0 For t from 0 to A-I m_b(t)=mm(D_s(t),m_f(~\)),D₃(t) = mm(m_f(A~l-t),D₃(t-l)+C_v) m_b(-l)=D₄(A-l)/A,D₄(t)=m_b(t)+D₄(t-l)

Normalizer:

For t from 0 to A-I m_o(t)=m_b(t)-m_b{-\)

As described above, in the PE module shown in Fig. 17, the forward processor outputs m_sum(t) , which is the sum of the message and the data cost, as m_f(t) . Here, τrif(t) is stored in the stack and used by the backward processor to output τn_b(t) . The normalizer receives m_b(t) and calculates m_o(t).

Figs. 18A and 18B respectively illustrate a forward processor in the PE module shown in Fig. 17. In Figs. 18A and 18B, the input cost m_sum(t) represents sequential input of a vector where t is in a range from 0 to A-I. A first forward processor shown in Fig. 18A initializes a delay buffer D₁C-I) to ^λλB" and adds the input cost at each step. The newly calculated value D_x (t) is compared with Di(t-1)+C_o calculated at the previous step and the minimum value is calculated as ιτi_f(t) at the current step. A second forward processor shown in Fig.18B calculates the minimum value of m_sum(t) by using a delay buffer D₂ (t) , and adds K_v to the minimum value to output nri_f (-1) . Figs. 19A and 19B respectively illustrate a backward processor in the PE module shown in Fig. 17.

A first backward processor shown in Fig. 19A initializes a delay buffer D₃ (-1) to "B" and reads the state value rri_f (t) of the forward cost at each step, m_f (t) is compared with D₃ (t-1) +C_V calculated at the previous step, and the minimum value is set as D₃ (t) at the current step. D₃ (t) is compared with an input parameter m_f(-l) again, and a smaller value is calculated and output as m_b(t) . In a second backward processor shown in Fig. 19B, a delay buffer D₄ (t) is initialized to "0" at the beginning, and m_b(t) is added thereto at each step. At the final step, D₄(A-I) is right-shifted by "A" and then output. The normalizer in the PE module shown in Fig. 17 subtracts the output value πib(-l) of the second backward processor from the output value m_b(t) of the first backward processor, and finally outputs m_o(t) . Accordingly, in case of a Middlebury test image, when the total number of scale levels is 4 and L^k is allocated with (5, 5, 10, 5) in a coarse-to-fine manner, the present invention shows an excellent low error result, as shown in Fig. 26. Particularly, in case of a 436 x 383 image, the memory size is

K-I reduced N₀/∑((L^k +Y)/2^k) = 28 times, and the calculation speed

M) becomes faster 436 times by using 436 parallel processors.

While the invention has been shown and described with respect to the embodiments , it will be understood by those skilled in the art that various changes and modification may be made without departing from the scope of the invention as defined in the following claims.

Claims

What is claimed is:

1. A belief propagation based fast systolic array, wherein a hierarchical dynamic Bayesian network of nodes corresponding to pixels of input left and right image pixel data is generated in consideration of an iteration axis and scale levels, and messages on the generated dynamic Bayesian network are updated in a specific axis direction on a Markov random field.

2. The belief propagation based fast systolic array of claim

1, comprising: an image buffer that stores the left and right image pixel data input by raster scanning; and a fast belief propagation (FBP) stereo matching module that outputs a disparity image fast and in parallel by using the pixel data output from the image buffer.

3. The belief propagation based fast systolic array of claim

2, wherein the FBP stereo matching module has a systolic array architecture including a plurality of parallel processing element groups, each processing element group calculating the messages and disparity values in parallel while transmitting the messages and the pixel data to a neighboring processing element group .

4. The belief propagation based fast systolic array of claim 3, wherein each of the processing element groups includes: a data cost module that receives the pixel data and calculates data costs; a plurality of multiplexers that receives the data costs and the messages from the data cost module and from a neighboring processing element group, respectively, and selects desired messages ; a plurality of processing elements that calculates new messages by using the desired messages selected by the multiplexers; a plurality of local buffers that stores the new messages calculated by the processing elements; and a plurality of layer buffers that stores the new messages stored in the local buffers.

5. The belief propagation based fast systolic array of claim

4, wherein the data cost module includes: a plurality of first modules , each first module calculating and outputting an absolute difference between the left and the right pixel data corresponding to each disparity value; and a plurality of second modules, each second module calculating final data costs at each scale level by using outputs of the first modules.

6. The belief propagation based fast systolic array of claim

5, wherein each of the first modules includes: a series of left registers that stores the left pixel data; a series of right registers that stores the right pixel data; and a logic that calculates the absolute difference by using outputs of the left and the right registers, wherein the right registers are shifted to make the output thereof .

7. The belief propagation based fast systolic array of claim 5, wherein each of the second modules includes: an adder that adds two data costs; a register that stores the addition result of the adder; and an accumulator that accumulates an output of the register.

8. The belief propagation based fast systolic array of claim 4 , wherein the data cost module includes : a plurality of first modules, each calculating and outputting an absolute difference between the left and the right pixel data; and a plurality of second modules, each calculating final data costs for each scale level, wherein each first module calculates the absolute difference by sequentially reading left and right scan lines required to calculate a data cost of a specific node at a specific scale level and storing a series of left and right pixel data of the left and right scan lines in registers of the first module, and wherein each second module calculates the final data costs by adding outputs of neighboring first modules according to the specific scale level and accumulating the addition result for each scan line .

9. The belief propagation based fast systolic array of claim 2, wherein the FBP stereo matching module has an FBP stereo matching sequence in which, in a layer-transformed hierarchical dynamic Bayesian network which is obtained by tilting in a scanning axis direction the positions of the nodes at each iteration on the dynamic Bayesian network, nodes on a line corresponding to the same coordinate on an axis in the Markov random field are processed in parallel and sequentially processed in the axis direction.

10. The belief propagation based fast systolic array of claim 9, wherein, in the FBP stereo matching sequence, for memory scanning of the layer-transformed hierarchical dynamic Bayesian network, upper layer messages are processed by a message update function in a depth-first tree order while nodes on the same coordinate of the scanning axis are processed in parallel at the coarsest level, and a disparity value is calculated by a state estimation function.

11. The belief propagation based fast systolic array of claim 10 , wherein the message update function is performed by the number of iteration layers at each level, and calls a message calculation, function, which is to calculate messages and store the messages in a local buffer, and a buffer_update function, which is to store the messages in the local buffer in a layer buffer to process a group of a next line .

12. The belief propagation based fast systolic array of claim 11, wherein data costs and messages of a previous layer read by the message calculation function are those processed in the previous layer or the group of the previous line.

13. The belief propagation based fast systolic array of claim 11, wherein the message calculation function reads data costs or messages of the previous layer in a group from the local buffer, and reads those out of the group from the layer buffer.

14. The belief propagation based fast systolic array of claim 11, wherein, when a message of a first layer at each level is calculated, the message calculation function sets messages of the coarsest level to zero, reads messages at a previous coarser level from the local buffer for other levels, and reads data costs from data cost modules that receive the pixel data and calculate the data cost.

15. The belief propagation based fast systolic array of claim 11, wherein the buffer update function stores messages and data costs in the local buffer in the layer buffer such that message and data costs of a current group are read from the layer buffer when the message calculation function is performed for a group of the next line on the network.

16. The belief propagation based fast systolic array of claim 10 , wherein the state estimation function reads messages and data costs from the local buffer and the layer buffer after a final iteration at the finest level, adds them for each state, and estimates a state corresponding to the minimum cost as the disparity value.

17. The belief propagation based fast systolic array of claim 3 , wherein, when the number of levels is K₇ the processing element group has 2¹^^"1 processing elements in total, Nχ/2^k processing elements operate in parallel in an FBP stereo matching sequence at k level, and, wherein each processing element has a local buffer and a layer buffer at each level and access the buffers via a multiplexer.

18. The belief propagation based fast systolic array of claim 4, wherein the local buffer stores currently calculated messages in a group to allow the messages of the previous layer to be accessed for message calculation of a next layer.

19. The belief propagation based fast systolic array of claim 4, wherein the layer buffer stores, on a layer basis, messages of a current group required for message calculation of a group of a next line on the network.

20. The belief propagation based fast systolic array of claim 2, wherein the FBP stereo matching module performs an FBP stereo matching sequence by sequentially accessing buffers with a single processor.

21. The belief propagation based fast systolic array of claim 4, wherein each processing element includes: an adder that sequentially reads and adds the data costs and the messages of the previous layer on a state basis; a forward processor that receives an output of the adder to output a forward processor cost; a forward stack that receives and stores the forward processor cost; a backward processor that receives an output of the forward stack to output a backward processor cost; a backward stack that stores an output of the backward processor; a normalizer that receives an output of the backward stack and calculates a final message; and a buffer that stores an output of the normalizer.

22. The belief propagation based fast systolic array of claim 21, wherein the forward processor includes: a first forward processor that initializes a first delay buffer, reads an input cost at each step, compares the input cost with a value obtained by adding a constant value to a previous value of the first delay buffer, stores the minimum value in the first delay buffer, and outputs the minimum value,- and a second forward processor that initializes a second delay buffer to calculate a minimum value of an input cost, and outputs a value obtained by adding a constant value to the minimum value .

23. The belief propagation based fast systolic array of claim 21, wherein the backward processor includes: a first backward processor that initializes a first delay buffer, reads an input cost at each step, compares the input cost with a value obtained by adding a constant value to the value of the first delay buffer, stores a minimum value in the first delay buffer, compares an output of the first delay buffer with an output of the forward processor, and outputs a minimum value,- and a second backward processor that initializes a second delay buffer to zero, stores in the second delay buffer a value obtained by adding the output of the first delay buffer at each step, shifts the value of the second delay buffer by a specific number of bits, and outputs the sifted value.

24. The belief propagation based fast systolic array of claim 21, wherein the normalizer calculate the message by outputting a value obtained by subtracting the value calculated by the second backward processor from the value calculated by the first backward processor.

25. The belief propagation based fast systolic array of claim 23, wherein the normalizer calculate the message by outputting a value obtained by subtracting the value calculated by the second backward processor from the value calculated by the first backward processor.

26. The belief propagation based fast systolic array of claim 2, wherein the FBP stereo matching module is a VLSI chip that operates only with multiplexers, adders and subtracters for integer operation, comparators, and shifters by using systolic array architecture.

27. A belief propagation based fast systolic array method, comprising:

(a) storing left and right image pixel data input by raster scanning; and

(b) outputting a disparity image fast and in parallel using the pixel data stored at the step (a) .

28. The belief propagation based fast systolic array method of 27, wherein the step (b) is performed by a plurality of parallel processing element groups, each processing element group calculating the messages and disparity values in parallel while transmitting the messages and the pixel data to a neighboring processing element group.

29. The belief propagation based fast systolic array method of claim 28, wherein the step (b) includes: (al) receiving the pixel data and calculating data costs,- (bl) receiving the data costs calculated at the step (al) and receiving messages from neighboring processing element groups to select desired messages;

(cl) calculating new messages by using the message selected at the step (bl) ;

(dl) storing calculation result at the step (cl) in a local buffer; and

(el) storing the result stored at the step (dl) in a layer buffer.

30. The belief propagation based fast systolic array method of claim 29, wherein the step (cl) includes:

(ell) sequentially reading and adding the data costs and messages of a previous layer on a state basis; (cl2) receiving addition result at the step (ell) and outputting a forward processor cost; (cl3) receiving the forward processor cost output at the step (cl2) and storing the received forward processor cost in a forward stack;

(cl4) receiving an output of the forward stack and outputting a backward processor cost;

(cl5) storing an output of the step (cl4) in a backward stack;

(cl6) receiving an output of the backward stack and calculating a final message; and (cl7) storing an output calculated at the step (cl6) in a buffer.

31. The belief propagation based fast systolic array method of claim 30, wherein the step (cl2) includes: initializing a first delay buffer; reading an input cost at each step; comparing the input cost with a value obtained by adding a constant value to a previous value of the first delay buffer; storing a minimum value in the first delay buffer; outputting the minimum value; initializing a second delay buffer; calculating a minimum value of the input cost by using the second delay buffer; and outputting a value obtained by adding a constant value to the minimum value.

32. The belief propagation based fast systolic array method of claim 30, wherein the step (cl4) includes: initializing a first delay buffer; reading an input cost at each step; comparing the input cost with a value obtained by adding a constant value to the value of the first delay buffer; storing a minimum value in the first delay buffer; comparing an output of the first delay buffer with an output of the forward processor; outputting a minimum value; initializing a second delay buffer to zero; storing in the second delay buffer a value obtained by adding the output of the first delay buffer at each step; shifting the value of the second delay buffer by a specific number of bits; and outputting the shifted value.