WO2001096982A2

WO2001096982A2 - System for the estimation of optical flow

Info

Publication number: WO2001096982A2
Application number: PCT/US2001/019012
Authority: WO
Inventors: Siegfried Wonneberger; Max Griessl; Markus Wittkop
Original assignee: Dynapel Systems, Inc.
Priority date: 2000-06-14
Filing date: 2001-06-14
Publication date: 2001-12-20
Also published as: WO2001096982A3

Abstract

In a system for generating a dense motion field (13) representing the motion of image components in a motion picture, a correction to the initial estimate of the dense motion field is directly calculated by determining partial derivatives with respect to time and space from brightness values in two successive frames in the motion picture. The brightness partial derivatives are determined by calculating temporal and spatial differences in brightness values at positions in the successive frames determined by the vectors of the initial estimate of the motion field (13). The resulting brightness partial derivatives are used to calculate the dense motion field using a motion flow algorithm. The calculated correction to the initial estimate of the dense motion field is then added to the initial estimate to provide a new estimate of the dense motion field. The calculation of the estimated dense motion field is used in the hierarchical pyramid (51, 53, 55) wherein the calculations are carried out on successively finer grids.

Description

SYSTEM FOR THE ESTIMATION OF OPTICAL FLOW

Field of the Invention

This invention relates to the estimation of dense motion fields in sequences of images, e.g., video images, by a gradient based optical flow computation.

Background of the Invention

A dense motion field, also called a dense motion vector field, is a set of vectors, one for each pixel of a frame from a set of motion picture frames, wherein the vectors represent the frame to frame movement of pixel-sized image components of the objects depicted in the set of sequential motion picture frames. For example, as shown in Fig. 1 , if pixels A, B, C and D represent image components of a depicted square object 11 in a first motion picture frame and the square 11' represents where the square 11 has moved in a second motion picture frame, the vectors 13, representing the change of position of the image components A, B, C and D are vectors of a dense motion vector field. The example of Fig. 1 is a special simplified case in which the square 11 moves without changing its size or orientation. As a result, the dense motion field vectors representing the motion of the image components of the square 11 are parallel and are of equal length. Typically, the motion of objects in a motion picture is more complex than that represented in the example shown in Fig. 1 and the dense motion field vectors will often not be parallel and will not be of equal length. It should be noted that the pixel-sized image components are technically not pixels since pixels by definition do not move. The image components, on the other hand, are components of the objects depicted in the

motion picture and these image components change position from frame to frame

when the corresponding objects change position from frame to frame to represent

motion of these objects in the motion picture. The computation of a dense motion

field is called optical flow computation.

Motion estimation by gradient based optical flow computation between two

consecutive images of a sequence has the lack, that the amplitude of motion, which could be determined, is very limited. To overcome this deficiency, the gradient-based

method is used as a motion estimation kernel within a hierarchical pyramid as

described in PCT application Serial No. WO/9907156, which is hereby incorporated

by reference. The skeleton of this pyramid consists of a number of image pairs of

decreasing spatial resolution, derived by reducing (down sampling) the original

images. At each resolution, a correction to the motion field, determined before at the

coarser resolution, is calculated in the motion estimation kernel. After the correction

has been added, the refined motion field is filtered and expanded for use in the next

finer resolution.

The classical, gradient-based motion estimation kernel algorithm introduced

by Horn and Schunck in an article entitled "Determining Optical Flow", published in

Artificial Intelligence, 1981, Vol. 17, pp. 185-203, which is hereby incorporated by

reference. In the Horn and Schunck approach as described in this article, the image

sequence is interpreted as a discrete part of the brightness field E(t, x, y), dense in time

t and space (x, y), with the pixels being located at integer positions. The basic assumption of the Horn and Schunck article is that all brightness changes from frame

to frame of a motion picture are caused by motion. In other words, each image component of a depicted object is assumed to stay at the same brightness from frame to frame. This assumption which is called optical flow constraint, can formally be

expressed by the vanishing total temporal derivative of the brightness field E(t, x, y),

that is:

^-E = 0. (1) at

Applying the chain rule of differentiation yields the equation

E, + E_xu + E_yv = 0, ^' (2)

dx , dy where u := — and v .- - - dt dt

u and v representing the local components of the motion vector (flow velocity), and

the indices E_t , E_x and^ expressing the partial derivatives with respect to time t and

space (x, y), that is, E, = ——EE,, EE_x* ==——EE,, aanndd EE_yy ==—— . E . As it is impossible to get dt ox dy

local solutions for both components of the motion vector from a single algebraic

equation, Horn and Schunck proposed an additional smoothness constraint for the

motion vector field, hi order to let the motion vector field approximately fulfill both

constraints almost everywhere, they minimize the functional

F(u,v) := ld x d y ((E_t + E_xu + EyV)² + a²(u_x + u_y + v_x ² +v_y)), (3)

wherein the integration extends over the whole space (frame) and the positive constant

a² controls the relative contributions of the optical flow deviation term and the non-

smoothness term. The constant ² should be roughly equivalent to the expected noise in the estimate of E_x ² + E_y ². The minimization is performed using the calculus of

variation, such as disclosed in Methods of Mathematical Physics, R. Courant and R.

Hubert, published by Intersciences, New York, New York 1937, 1953. The pertaining Euler-Lagrange differential equations

(E_t + E_xu +E_yv)E_x -a²(u_xx + u_yy) = 0 (4)

(E_t + E_xu + E,v)E_y - a²(v_xx + v_y,) = 0 (5)

are re-discretised by replacing the derivatives by discrete difference masks, as

explained in the Horn and Schunck article. The resulting system of linear algebraic

equations can be solved by standard numerical methods, such as disclosed in Matrix

Computation, 2d Ed., by G. H. Golub and C. F. Nan Loan, published by Johns

Hopkins University Press, Baltimore, MD 1989.

The motion refinement method, used in the above-cited PCT patent

application, at each resolution of a hierarchical pyramid, first uses an estimated

preliminary motion field (U, V) to warp one of the images called the source image, in

order to make a prediction for a second image called the target image. Then, the

correction field (u, v) to the preliminary motion field is calculated by estimating the

motion (displacement) field between this prediction and the target image. Within the motion estimation kernel of the Horn & Schunck article, partial brightness derivatives

E_t , E_x and E_y are calculated as:

E_{t, x} t, x, y)~ {

(-, +, -) E(T, X+l. Y ) }/4.

wherein

• the bracketed alternative signs correspond with the equally ordered alternative variables in the index of the partial derivative (this means that the first symbol, "+" or "-", inside parenthesis is used when determining E_t, the middle symbol inside the parenthesis is used when determining E_x and the last symbol inside the parenthesis is used when determining E_y),

• the brightness values E are calculated at points, expressed with the abbreviations

-*.-=t±-, X:=x- Y -y- (7) 2

It is emphasized, that the time t lies in the middle between the two images located at

consecutive integer times andE⁺, that is, t e IN + 0.5. The space points (x, y) lie

on the lattice (IN + 0.5)². The space points (x, y) lie on the lattice (IN + 0.5)², therefore, and 7 are pixel positions.

Summary of the Invention

As explained above, in the system described in the above-cited PCT application, initial estimates of the motion field are used to predict a target image and then the motion field between the actual target image and the predicted target image is calculated as a correction field. The correction field will be a dense motion field between the predicted target image and the actual target image. This correction field will then be added by vector addition to the initial estimated dense motion field to obtain a new estimate of the dense motion field. In accordance with the present invention, instead of calculating a predicted target image and then determining the correction field from the predicted target image and the actual target image, the system of the invention calculates the correction motion field directly. In this calculation, the partial derivatives E_t , E_x and E_y are determined from the brightness

values in the two successive frames for which the dense motion field is being

calculated. More specifically, the brightness partial derivatives are determined by

calculating temporal and spatial differences in brightness values at positions in the

successive frames determined by the vectors of the initial estimate of the motion field.

The resulting brightness partial derivatives are then used to calculate the correction to

the dense motion field by means of the algorithm set forth in the Horn and Schunck

article.

Brief Description of the Drawings

Fig. 1 is a diagram used to explain a dense motion field which is calculated by

the present invention.

Fig. 2 is a block diagram illustrating the system of the present invention.

Fig. 3 is a flow chart of the method of the present invention.

Figs. 4 A, 4B and 4C illustrate graphically an example of the coordinate points

at which mean brightness values are calculated in accordance with the invention.

Fig. 5 graphically illustrates how the mean brightness value is calculated at a given coordinate point.

Fig. 6 graphically illustrates how an estimate of a Laplacian is calculated.

Fig. 7 is a schematic diagram illustrating the pyramid grid calculating used in

the system of the present invention. Description of a Preferred Embodiment hi the system of the invention, as shown in Fig. 2, a source of successive pixel

based motion picture frames are fed to pixel frame buffer memories 21 and 22

wherein the first frame of the sequence is received in the pixel frame buffer memory

21 and the second frame of the sequence is received in pixel frame buffer memory 22.

The data processor 24 computes the dense motion field from the brightness values of

the pixels in the buffer memories 20 and 22.

As shown in Fig. 3, in the system of the present invention, the data processing unit computes the dense motion field between successive frames of a motion picture

by first computing a set of brightness derivatives determined as a function of changes

in brightness with space and time and also in accordance with an initial estimate of the

dense motion field U and N. Following the computation of these brightness

derivatives, the correction to the initial estimate of the dense motion field is calculated

using the Horn and Schunck equations. Following this calculation, the correction to

the dense motion field is added to the initial estimate to provide a new estimate of the

dense motion field.

In accordance with the invention, the CPU 24 calculates the partial brightness

derivatives E_t , E_x and E_y talcing the preliminary motion field (U, V) into account.

They are calculated as follows:

E (t,x,y,U,V)~ { E (T⁺X⁺v+l,Y; +1)

(+,-,-) E (T⁺,Xv ,Y; ) (8)

(-,+,+) E (T-,X-_U+1_.Y +1) (-,-,+) E (T,X_υ ,7-+l)

(-,-,-) E (r-, - 7^" )

(-,+,-) E (E-,N-+1, Y- ) }/4.

The bracketed alternative signs correspond with the equally ordered alternative

variables in the index of the partial derivative in the same manner as in Equation (6).

The values E are mean brightness values and calculated at coordinate points,

expressed with the abbreviations

X^± _u:=x--₂ + U(x,y).(l₂-(l±l)-λ), γ÷ := y- +V(x.y)*( (l±l)-λ). (9)

where the parameter λ s [0,1] fixes the time t = T ^" + λ (T⁺ - T^~), whereby the

derivatives are calculated, at an arbitrary point between the two times T^~ <T⁺

belonging to consecutive original images. The space point (x, y) is a point in the

frame corresponding to an image component or vector in the initial estimated field.

Normally (x, y) will lie either on the lattice (IN + 0.5)² or on the lattice IN².

Thus, in Equation (8), each of the eight mean brightness values E are

determined for specifically identified points in the first motion picture frame or the

second motion picture frame. For example, T⁺ in the parenthetical portion of a

brightness value E means that the brightness value is determined for a point in the

second frame and T^" means that the brightness value of E is determined for a

coordinate point in the first frame. The coordinates of the point at which the brightness value are determined is indicated by the second and third terms in the

parenthetical expression. Thus, E(τ^ , Xu + 1, Yr + 1) means the brightness value is

determined for a point in the second frame at the

+ l, γ_v +l and

E(T, Xu_> Y^'r ⁺1) means the brightness value is determined for a coordinate point in

the first frame at the coordinates χ^' _υ , Y^' _r +l. The coordinates

are

determined from equations (9) by using the plus (+) sign for the plus or minus

indicator (±) to compute x and γ_r and using the minus (-) sign for the plus or

minus indicator (±) to compute χ_υ ^' and γ^' _r.

The points at which the mean brightness values E are calculated in Equation

(8) are graphically illustrated in Figs. 4A, 4B an 4C for the vector (UN) which is

positioned to pass through the point (x,y), which divides (u,v) in two parts of relative

lengths λ and 1-λ. hi these figures, λ is about 0.37. In these figures, the eight points at

which the mean brightness values are determined are designated 31 through 38. Fig.

4A represents the calculation of E_t. In this figure, the plus (+) signs are on the points

31 -34 to indicate that mean brightness values at these points are added in Equation (8)

and the minus (-) signs are on the points 35-38 to indicate that mean brightness values

at these points are subtracted, h a manner similar to Fig. 4A, Fig. 4B illustrates the

calculation of the partial derivative E_x and Fig. 4C illustrates the calculation of the

partial derivative E_y. In Figs. 4A-4C, the coordinates at the initial point of the vector

and the points 35-38 are in the first frame of the two sequential frames and the

coordinates at the terminal point of the vector and the points 31-34 are in the second

of the two motion picture frames. As shown in Figs. 4A-4C, the brightness

differentials E_t , E_x and E_y are determined by the differences in the mean brightnesses at locations in the sequential motion picture frames determined in accordance with the initial estimate vector. E_t is determined by the difference between the mean

brightness values between the two frames at the imtial and terminal points of the

corresponding vector. E_x is determined by adding the brightness values at points 31

and 34 in the second frame and at points 35 and 38 in the first frame and subtracting

the mean brightness values at the points 32 and 33 in the second frame and at the

points 36 and 37 in the first frame. Thus, E_x is determined by differences in mean

brightness values at points incrementally spaced in the X direction at the initial point

of the vector in the first frame and at the terminal point of the vector in the second

frame. Similarly, E_y is determined by adding the mean brightness values at the points 33 and 34 in the second frame and at the points 37 and 38 is the first frame and by

subtracting the mean brightness values at the points 31 and 32 in the second frame and

at the points 35 and 36 in the first frame. Thus, E_y is determined by the differences in the mean brightness values at incrementally spaced points in the Y direction at the

initial and terminal points of the vector in the first and second frames, respectively.

The mean brightness E are arbitrary convex combinations of the brightness values of

the neighboring pixels and each mean brightness value E is an approximation of the

brightness at the corresponding coordinate point. An approximation is needed

because the initial point and termination point of a vector will not be expected to fall

at the centers of pixels. One reasonable definition for the mean brightness E is given

by E (TXY):- { (l- X-m)) (l-(7-[7j)) E(T,[X\ ,[Y] )

+ (X-[X) (l-(7-[7j)) E(T,[X]+1,[Y] )

+ (X-[X\) (7-[7j)) E(T,[X]+1,[Y]+1) (10)

+(1-(N- Y]) (7-[7j)) E(T,[X ,[7J+1) },

wherein the coefficients of the convex-combination can be interpreted as the

intersection areas of a pixel sized unit square, centered at (X, Y), with the four unit

squares, representing its neighboring pixels. The integer positions of the pixels are

expressed with the help of clipping brackets, indicating that [α] is the integral part of

α, i.e., the largest integer not exceeding α.

The above calculation of Equation (10) computes the E brightness

approximation as the weighted average of four pixels neighboring the coordinate point

for which the E brightness approximation is being computed. Fig. 5 graphically

illustrates an example of the computation of Equation (10). As shown in Fig. 5, unit

square 41 surrounds the coordinate point at (X,Y) for which E is being computed.

The unit square 41 overlaps the boundaries of four neighboring pixels 43, 45, 47 and

49. E is the weighted average of the brightness of the pixels 43, 45, 47 and 49 with

each brightness being weighted in accordance with how much it is overlapped by the

unit square 41. h this manner, an approximation of the brightness at the coordinate

point (X,Y) is determined.

Following the computation of the brightness partial derivatives E_t , E_x and

E_y as described above, the equations of Horn and Schunck as set forth in the above-

cited article are used to calculate a dense motion field. Because the brightness partial derivatives are determined as an appropriate function of the initial estimate of the dense motion field, the Horn and Schunck equations will yield a dense motion field which is a correction to the initial estimate and which, when added to the initial

estimate, will provide a new estimate of the dense motion field. As described in Horn

and Schunck, the partial brightness derivatives can be related to the dense motion field

u and v as follows:

E² _xu + E_xE_yv = ²V² -E_xEt, (11)

E_xE_y + E² _yv = ²V v-E_yEf (12)

In these equations, Vu and V²v are the Laplacians of u and v. As explained by Horn

and Schunck, the Laplacians of u and v can be approximated by subtracting the

magnitudes of u and v from weighted averages of the surrounding magnitudes of u

and v as follows:

V²u K(u-u) and \7²v∞ K(v-v), (13)

in which u and are the weighted averages of the values of u and v surrounding the

pixels at which the Laplacians of u and v are being calculated. The weighted

averages and v at the coordinates x,y can be calculated as follows (time dependence

suppressed):

> y)⁼~ i^u(x - V> y) ^{+ u}(^x> y + V + (x + 1, y) + u(x, y-1)} 6 (14)

+ — {u(x-l,y-l) + u(x-l,y + l) + u(χA-l,y + l) + u(x + l,y-l)} 12

> y) = ~ (x - 1). y) + v(x, y + l) + v(x + 1, y) + v(x, y - 1)} 6 (15)

+ — {v(x-l,y-l) + v(x-l,y + l) + v(x + l,y + l) + v(x + l,y-l)} Fig. 6 illustrates the weighting carried out by the above equations for the values at the coordinates x,y. With the approximations of the Laplacians substituted in Equations

(11) and (12) and solving for u and v the following equations result:

(a² + E² _x + E_y)u = + (a² + E² _y)u - E_xE_yv- E_xEl, (16)

( +E_X ² + E_y)v = - E_xE_y +( ² + E -E_yE_t. (17)

The above equations provide an expression for u and v at each point in the image.

These equations can be solved iteratively as follows:

u"⁺¹ = ^~ _u ⁿ - E_x [E_x uⁿ + E_y 7 + Etl /(a² + E_X ² + E_y ²), (18)

vⁿ⁺¹ = v" - E_y[E_xu" + Eyv" + Etl /( ² + E_X ² + E_y ²). (19)

The calculations represented by the above iterative equations are repeated until they

converge to provide a dense motion field u and v for each image component. The

calculated motion field (u, v) will be a correction to the initial estimate (U, N) and

when added to the initial estimate will provide a new estimate of the dense motion

field.

The calculation of the estimated dense motion field is then used in a

hierarchical pyramid as shown in Fig. 7. hi this pyramid, the finest grid 51 corresponds to the pixel display wherein each square of the grid represents one pixel.

The other grids 53 and 55 of the pyramid represent progressively coarser grids

representing the same image with larger pixels, hi accordance with the invention, an

initial estimate of the dense motion field is determined for the coarsest grid 55 in the pyramid and the above-described method is then used to calculate a new estimate of the dense motion field for this coarsest grid 55. This new estimate of the coarsest grid

then becomes the initial estimate for the middle grid 53 and a new motion field is

calculated by the method described above for the middle grid 53. This new motion

field estimate then becomes the initial estimate for the finest grid 51 and the

calculation is repeated for the finest grid to produce an estimate of the dense motion

field for the finest grid 51. The number of grids in the pyramid is an example and a

greater number of grids can be used if desired.

hi the above described systems, the equations for solving for the dense motion

field have been expressed in the rectangular coordinate system. It will be apparent

that the system is not limited to calculations employing rectangular coordinates and

other coordinate systems could be used in the solution, such as, for example, polar

coordinates.

The above description is of a preferred embodiment of the invention and

modification may be made thereto without departing from the spirit and scope of the

invention, as defined in the appended claims.

Claims

1. hi a method for generating a dense motion field to represent the motion of

image components from frame to frame in a motion picture wherein an initial estimate

of said dense motion field is made, a correction dense motion field is calculated and

said correction dense motion field is added to said initial estimate of said dense

motion field to provide a new estimate of said dense motion field, the improvement

wherein said correction dense motion field is determined by estimating partial

derivatives of the brightness of the images in said motion picture with respect to time

and space by the difference of pixel brightness at positions in said motion picture

frames determined in accordance with said initial estimate of said dense motion field,

and using said partial derivatives of brightness to calculate said correction dense

motion field.

2. A method as recited in claim 1, wherein said dense motion field represents the change in position of image components between sequential frames of said motion

picture and wherein said differences in pixel brightness are determined for each vector

of said initial estimate in the proximity of the corresponding image component at

locations in said sequential frames displaced from each other by the corresponding

vector.

3. A method as recited in claim 2, wherein the partial derivatives of brightness

with respect to time are determined for each vector of said initial estimate by the

differences in brightness of pixels between said sequential frames and the partial

derivatives of brightness with respect to space being determined by incremental differences in the brightnesses between pixels in both of said sequential frames.

4. A method as recited in claim 3, wherein the brightness partial derivatives

with respect to time E_t and with respect to space E_x and E_y are calculated as:

(+, -, +) E (T⁺, Xv ,Y_r +l)

(+, -, ") E (T Xv ,Y_γ )

(+, +, -) E (T Xv +l,Y_r )

(-, +, +) E (7-, - +l,7- +l)

(-, -, +) E T-, χ-_υ J- +\)

(-, -, ") E CT, N- Y_y )

(-, +, ") E (E-, X- +1, Y_y ) }/4.

in wlrich the values Ε bar [do elsewhere] are mean brightness values estimated at

coordinate points and successive frames of said motion picture and wherein

in which λ is a value between 0 and 1 and x and y are the coordinate of a coordinate

point in the motion picture frames.

5. a system for generating a dense motion field comprising storage means

connected to receive and store successive frames of a motion picture and a data

processor connected to receive the data representing said motion picture frames and

programmed to generate a dense motion field representing the motion of image

components from frame to frame, wherein an initial estimate of said dense motion field is made and wherein said data processor is programmed to calculate a correction

dense motion field which is added to the initial estimate of said dense motion field to provide a new estimate of said dense motion field, the improvement wherein said data processor is programmed to determine said correction dense motion field by

estimating partial derivatives of brightness of the images in said motion picture with

respect to time and space by the difference of the pixel brightness at positions in said

motion picture frames determined in accordance with said initial estimate of said

dense motion field and using said partial derivatives of brightness to calculate said

correction dense motion field.

6. A system as recited in claim 5, wherein said dense motion field represents a

change in position of image components between sequential frames of said dense

motion field stored in said storage means and wherein said data processor determines

said differences and pixel brightness for each vector of said initial estimate and the proximity of the corresponding image component at locations in said sequential

frames displaced from each other by the corresponding vector.

7. A system as recited in claim 6, wherein the partial derivatives of brightness with respect to time are determined for each vector of said initial estimate by the

derivatives of brightness with respect to space are determined by incremental

differences in the brightnesses between pixels in both of said sequential frames.

8. A system as recited in claim 7,wherein said data processor is programmed to determine the brightness partial derivative with respect to time E_t and the brightness

derivatives with respect to space E_x and Ey in accordance with E _χy(t,x,y,U,V)- { E (T⁺Xv+ 1,7⁺ +1)

(-, +, +) E (T-,χ-_υ +1,7^+1)

(-, -, +) E CT,X- ,7^+1)

(-, -, ") E (E-, ^" 7^" )

(-, +, ") E (E-,X-+1, Y_y ) }/4.

in which the values Ε bar [fix on another machine] are mean brightness values calculated at coordinate points in successive frames of said motion picture expressed

with the abbreviations

in which λ is a value between 0 and 1 and x and y are the coordinates of coordinate points in the motion picture frames.