WO2015158570A1 - System, method for computing depth from video - Google Patents

System, method for computing depth from video Download PDF

Info

Publication number
WO2015158570A1
WO2015158570A1 PCT/EP2015/057534 EP2015057534W WO2015158570A1 WO 2015158570 A1 WO2015158570 A1 WO 2015158570A1 EP 2015057534 W EP2015057534 W EP 2015057534W WO 2015158570 A1 WO2015158570 A1 WO 2015158570A1
Authority
WO
WIPO (PCT)
Prior art keywords
current
depth value
previous
image
pixel location
Prior art date
Application number
PCT/EP2015/057534
Other languages
French (fr)
Inventor
Wilhelmus Hendrikus Alfonsus Bruls
Meindert Onno Wildeboer
Original Assignee
Koninklijke Philips N.V.
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Koninklijke Philips N.V. filed Critical Koninklijke Philips N.V.
Publication of WO2015158570A1 publication Critical patent/WO2015158570A1/en

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T5/00Image enhancement or restoration
    • G06T5/70Denoising; Smoothing
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/50Depth or shape recovery
    • G06T7/55Depth or shape recovery from multiple images
    • G06T7/579Depth or shape recovery from multiple images from motion
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N13/00Stereoscopic video systems; Multi-view video systems; Details thereof
    • H04N13/10Processing, recording or transmission of stereoscopic or multi-view image signals
    • H04N13/106Processing image signals
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N13/00Stereoscopic video systems; Multi-view video systems; Details thereof
    • H04N13/20Image signal generators
    • H04N13/261Image signal generators with monoscopic-to-stereoscopic image conversion
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/10Image acquisition modality
    • G06T2207/10016Video; Image sequence
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/20Special algorithmic details
    • G06T2207/20172Image enhancement details
    • G06T2207/20182Noise reduction or smoothing in the temporal domain; Spatio-temporal filtering

Definitions

  • the invention relates to computing depth values from a video sequence.
  • the video sequence comprises video frames, each video frame being an image having two dimensions: a horizontal and a vertical dimension. Depth values represent a third dimension in addition to said horizontal and vertical dimension.
  • the video sequence and corresponding depth values may be converted to a three-dimensional (3D) format for being displayed on a 3D display.
  • EP 2629531 describes a method for computing a depth map from a video sequence by using motion-based depth cues. The method determines motion vectors for an image of the video sequence, wherein each of the motion vectors correspond to an image pixel of the image. A depth value corresponding to an image pixel is then computed as being proportional to the length of a motion vector corresponding to that image pixel. The depth values combined for the entire image then forms a depth map.
  • a drawback of said method is that meaningful depth values can only be obtained for an image of the video sequence when motion is present at said image in the video sequence. In contrast, when the image presents a static scene, a depth map cannot be computed.
  • An aspect of the invention is a system for computing a depth value from a video sequence, the video sequence comprising a sequence of images, each of the images comprising a two-dimensional array of image pixels, each of the image pixels having a respective pixel location in the two-dimensional array, the depth value representing depth of image content at a current pixel location, the current pixel location being a pixel location in a current image in the video sequence, the system comprising: a current motion unit arranged for determining a current local motion vector representing motion of image content at the current pixel location; a current depth unit arranged for determining a current dynamic depth value based on the current local motion vector; a previous pixel location unit arranged for determining a previous pixel location being a pixel location in a previous image of the video sequence, the previous pixel location comprising image content corresponding to the image content at the current pixel location; previous motion unit arranged for determining a previous local motion vector representing motion of the image content at the previous pixel location; a previous depth unit arranged for
  • the video sequence comprises a sequence of images (video frames), each image comprising a two-dimensional array of image pixels.
  • Output of the system is a depth value for a current image, being one of the images of the video sequence.
  • the depth value represents depth of image content at the current image pixel, which is a pixel location in the current image.
  • the depth value may be one of the depth values of a depth map, whereas the depth map holds all depth values corresponding to the respective image pixels of the current image.
  • the depth value is based on two dynamic depth values from two respective images, being (a) the current dynamic depth value from the current image, and (b) the previous dynamic depth value from the previous image.
  • the current depth value represents depth of image content at said current pixel location
  • the previous depth value represents depth of image content at the previous pixel location.
  • the previous pixel location is a pixel location in the previous image, and image content at that previous pixel location corresponds to image content at the current pixel location.
  • the current local motion vector is determined.
  • the current local motion vector represents motion of image content at the current pixel location.
  • the current local motion vector may have a horizontal and vertical component indicating a two-dimensional displacement of the image content at the current pixel location.
  • the local motion vector may refer to a displacement of said image content between the current image and an image adjacent to the current image in the video sequence.
  • determining the current local motion vector may be done using a motion estimation algorithm that estimates motion of the image content at the current pixel location, or may be selected from local motion vectors already provided with the video sequence.
  • the current dynamic depth value is a motion-based depth value from the current image, and is based on the current local motion vector.
  • the length of the current local motion vector may be used as an indicator of the current dynamic depth value.
  • the previous pixel location is determined as a pixel location in the previous image where the image content matches the image content at the current pixel location (which is in the current image).
  • the image content at the current pixel location and the image content at the previous pixel location are related.
  • the image content may portray a portion of a face of an actor, for example.
  • image content at the current and previous pixel location correspond to respective different moments in time, namely to the respective moments in time corresponding to the current and previous image, respectively.
  • Determining the previous pixel location may be done, for example, by applying a motion estimation algorithm (or other image matching algorithm) to locate a pixel location where the image content matches the image content at the current pixel location, and to determine the previous pixel location as that located pixel location.
  • a motion estimation algorithm or other image matching algorithm
  • the 'previous pixel location' consistently refers to (a pixel location in) the previous image
  • the 'current pixel location' consistently refers to (a pixel location in) the current image
  • the previous local motion vector is determined.
  • the previous local motion vector represents motion of image content at the previous pixel location, analogous to the current local motion vector at the current pixel location.
  • the previous local motion vector may refer to a displacement of said image content between the previous image and an image adjacent to the previous image in the video sequence.
  • the previous dynamic depth value is based on the previous local motion vector, which represents motion of image content at the previous pixel location. Determining the previous local motion vector for the previous pixel location is done in an analogous manner as determining the current local motion for the current pixel location. The previous dynamic depth value is thus a motion-based depth value from the previous image.
  • the depth value is determined based on the current dynamic depth value and the previous dynamic depth value.
  • the depth value may be computed based on a. linear combination of the current depth value and previous dynamic depth value.
  • an effect of the invention is therefore that the depth value is not only dependent on motion in the current image, but may also benefit from motion in the previous image.
  • the depth value may be improved by using a motion-based depth value from the previous image, in addition to a motion-based depth value from the current dynamic depth value.
  • the depth unit is arranged for determining the depth value by computing the depth value as a combination of the current dynamic depth value and the previous dynamic depth value, the relative contributions of the current dynamic depth value and the previous dynamic depth value in the combination being defined by a current weight factor and a previous weight factor, respectively.
  • the current weight factor and the previous weight factor may be used to make the depth value more reliant on the current image or on the previous image. In such a way, the weight factors may effectively enable a tuning of the depth value towards (a) using motion image from the current image or (b) using motion from the previous image.
  • the depth value may be computed as a linear combination of the current dynamic depth value and the previous dynamic depth value, thus as a sum of (a) the current dynamic depth value multiplied by the current weight factor and (b) the previous dynamic depth value multiplied by the previous weight factor.
  • the depth unit is further arranged for defining the current weight factor by determining a local motion indicator indicating an amount of motion at the current pixel location, and determining at least one of the current dynamic weight factor and the previous dynamic weight factor based on the local motion indicator.
  • the determined amount of motion may indicate a large amount of motion at the current pixel location and, in response, the current weight factor may be determined to be 0.9 and the dynamic weight factor may be determined to be 0.1.
  • the depth value then depends strongly on the current dynamic depth value than on the previous dynamic depth value, because the current weight factor is large (i.e. close to 1.0).
  • the depth unit is arranged for determining the local motion indicator by computing the local motion parameter based on one of: a length of the current local motion vector, an absolute value of a horizontal component of the current local motion vector, and an absolute value of a vertical component of the current local motion vector.
  • the local motion indicator is an indicator of the amount of motion present at the current pixel location.
  • the local motion indicator may be computed as the length of the current local motion vector or as the absolute value of one of its components.
  • the depth unit is arranged for computing the depth value by determining the current weight factor based on a difference of amount of motion at the current pixel location and an amount of motion at the previous local motion vector. For example, the depth unit may determine that the amount of motion at the current pixel location is smaller than the amount of motion at the previous pixel location. The image content at the current pixel location then effectively 'decelerates'. By consequently decreasing the current weight factor, the depth value relies less on motion at the current pixel location and consequently more on motion at the previous pixel location. The depth value thus relies more on depth from the previous image in this case.
  • the current depth unit is arranged for determining the current dynamic depth value by computing the current dynamic depth value based on one of: a length of the current local motion vector, an absolute value of a horizontal component of the current local motion vector, and an absolute value of a vertical component of the current local motion vector.
  • the length of the current local motion vector is used to compute the current dynamic depth value.
  • the current motion unit is arranged for determining the current local motion vector by: (a) determining a local motion vector indicating motion of the image content at the current pixel location, (b) determining a global motion vector indicating global motion of image content of at least a region of the current image, and (c) computing the current local motion vector as a relative motion vector based on the local motion vector relative to the global motion vector.
  • a current dynamic depth value based on the relative motion vector compensates for movement of the background in a video scene, because a static foreground in front of a moving background and a moving foreground in front of a static background will both correspond to a large relative motion of the foreground. In both cases, the current dynamic depth value will therefore be larger if the current pixel location is part of the foreground than if it is part of the background.
  • the depth unit is arranged for determining the depth value based on a static depth value being a non-motion-based depth value based on only the current image.
  • the static depth value may be determined even when motion is small or absent in the current image and the current dynamic depth value may therefore become unreliable.
  • the static depth value may be based on a so-called 'slant' being a depth map that defines depth as a vertical gradient, i.e. having large depth values (close to the viewer) at the bottom of the current image and having smaller depth values (farther away from the viewer) toward the top or middle of the current image.
  • the depth unit is further arranged for determining the depth value by (a) determining a combined dynamic depth value by combining the current dynamic depth value and the previous dynamic depth value, and (b) determining the depth value by combining the combined dynamic depth value with the static depth value into the depth value, relative contributions of the dynamic depth value and the static depth value in the combining being dependent on an amount of global motion present in the current image. Based on the amount of global motion present in the current image, the depth value may rely more on the static depth value or more on the combined dynamic depth value.
  • the combined dynamic depth value may be determined by the depth unit of the system above, thus based on the current dynamic depth value and previous dynamic depth value.
  • the depth value may then be determined as a linear combination of the static depth value and the combined dynamic depth value.
  • the relative contributions of the static depth value and the combined dynamic depth value in said linear combination may be represented by respective weight factors. For example, if the amount of global motion is small, this may indicate that said combined dynamic depth value is unreliable. In such a case, it may be desirable to make the depth value more dependent on the static depth value. The latter may be achieved by means of a high relative contribution of the static depth value and a low relative contribution of the combined dynamic depth value.
  • the previous pixel unit is arranged for determining the previous pixel location by determining the previous pixel location according a non-motion
  • the previous pixel location has the same (X,Y) coordinate in the 2D array of the previous image as the current pixel location has in the 2D array of the current image.
  • the previous pixel location is effectively determined by straightforward copying said (X,Y) coordinate from the current pixel location in the current image to the previous pixel in the previous image.
  • the non-motion-compensated is straightforward as no motion estimation is needed to determine the previous pixel location.
  • the system is further arranged for determining the depth value by using a predetermined non-linear function for limiting the depth value to a predetermined depth value range.
  • Large values of the current local motion vector may result in an excessive value of the depth current dynamic depth value that lies outside the predetermined depth value range.
  • the depth value may be used to convert the image pixel at the current pixel location to a three-dimensional format, in order to be displayed at a 3D-display with a limited output depth range.
  • limiting the depth value may be achieved by applying e.g. a hard- or soft-clipping function to the depth value.
  • the previous pixel location unit is arranged to determine the previous pixel location in the previous image that corresponds to a later moment in time than the current image.
  • a depth map may be computed for each image in the video sequence, processing the images one-by-one, starting with the first image and ending with the last image of the video sequence (i.e. in a regular temporal order), or the other way around, thus starting with the last image and ending with the first image of the video sequence (i.e. in a reverse temporal order).
  • the regular temporal order implies that the video sequence is intended to be played starting with the first image (corresponding to an early time instance) and ending with the last image (corresponding to a later time instance).
  • a further aspect of the invention is a method for computing a depth value from a video sequence, the video sequence comprising a sequence of images, the depth value representing depth of image content at a current pixel location, the current pixel location being a pixel location in a current image in the video sequence, the video sequence comprising a sequence of images, each of the images comprising a two-dimensional array of image pixels, each of the image pixels having a respective pixel location in the two- dimensional array, the method comprising the steps of: determining a current local motion vector representing motion of image content at the current pixel location, determining a current dynamic depth value based on the current local motion vector, determining a previous pixel location being a pixel location in a previous image of the video sequence, the previous pixel location comprising image content corresponding to the image content at the current pixel location, determining a previous local motion vector representing motion of the image content at the previous pixel location, determining a previous dynamic depth value based on the previous local motion vector, and determining the depth value based
  • a further aspect of the invention is a three-dimensional display system comprising: a display unit comprising a display arranged for displaying a three-dimensional image; an input unit arranged for receiving a video sequence comprising a sequence of images, the video sequence comprising a sequence of images, each of the images comprising a two-dimensional array of image pixels, each of the image pixels having a respective pixel location in the two-dimensional array; a conversion unit arranged for converting an image of the video sequence to the three-dimensional image using a depth map comprising depth values of the image; and a processing unit comprising the system above, the processing unit being arranged for computing depth values of the depth map from the video sequence using the system above.
  • a further aspect of the invention is a computer program product comprising computer program code means adapted to perform all the steps of the method above when said computer program code is run on a computer.
  • FIG.l illustrates a system for computing a depth value from a video sequence
  • FIG.2 illustrates the current motion unit being arranged to compute the current local motion vector as a relative local motion vector
  • FIG.3 illustrates the depth unit being arranged to compute the depth value
  • FIG.4 illustrates a method for computing the depth value from the video sequence
  • FIG.5 illustrates a 3D display system for showing a 3D image derived from a current image of the video sequence.
  • FIG.l illustrates a system 100 for computing a depth value 141 from a video sequence 101.
  • the video sequence 101 comprises a sequence of images (video frames) which are two-dimensional, each image being a two-dimensional array of image pixels.
  • the depth value 141 corresponds to an image pixel, being an image pixel at a current pixel location in a current image of the video sequence.
  • the depth value 141 may be one of the depth values of a depth map, whereas the depth map holds all depth values corresponding to the respective image pixels of the current image.
  • the depth is defined as the third dimension in addition to the horizontal and vertical dimension of an image of the video sequence 101.
  • the depth dimension would be substantially perpendicular to the plane of the display.
  • image content e.g. a foreground object
  • having large (high) depth values would be perceived as standing more out of the display towards the viewer, than if said image content would have smaller (lower) depth value.
  • a large depth value corresponds to a smaller perceived distance to the viewer, whereas a smaller depth value corresponds to a larger perceived distance to the viewer.
  • a current motion unit 115 determines, from the video sequence 101, a current local motion vector 116 representing motion of image content at the current pixel location.
  • a current depth unit 110 determines a current dynamic depth value 111 from the current local motion vector 116, which serves as an indicator of depth at the current pixel location.
  • a previous pixel location unit 105 determines, from a previous image in the video sequence, a previous pixel location 106 having image content that matches image content at the current pixel location, thus in the current image.
  • a previous motion unit 125 determines, from the video sequence 101, a previous local motion vector 126 representing motion of the image content at the previous pixel location 106.
  • a previous depth unit 120 determines a previous dynamic depth value 121 from the previous local motion vector 126, which is used as an indicator of depth at the previous pixel location 106. Finally, a depth unit 140 determines the depth value 141 based on the current dynamic depth value 111 and the previous dynamic depth value 121.
  • 'image content at the current pixel location' refers to a small group of image pixels near the current pixel location.
  • the group of image pixels may be a small patch of 8x8 image pixels.
  • the image content at the current pixel location then refers a portion of the current image portrayed by the small patch of 8x8 image pixels.
  • the 'current pixel location' in this document always refers to a pixel location in the current image
  • the 'previous pixel location' always refers to a pixel location in the previous image.
  • the current pixel location' thus implies 'the current pixel location in the current image'
  • 'the previous pixel location' thus implies 'the previous pixel location in the previous image'.
  • a current local motion vector 116 is thus determined for the current pixel location.
  • the current local motion vector 116 represents motion of the image content at the current pixel location.
  • the current local motion vector may have a horizontal and vertical component indicating a two-dimensional displacement of said image content.
  • the current local motion vector 116 (dX, dY) represents how said image content moves between (a) the current image and (b) an image directly succeeding the current image in the video sequence. Said image content moves by a number of pixels dX in the horizontal (X) direction and by a number of pixels dY in the vertical direction (Y).
  • the video sequence is a movie with a frame rate of 50 Hz, i.e. 50 images per second.
  • the current image may correspond to a moment in time tl .
  • Determining the current local motion vector 116 may be done by estimating local motion vectors for all image pixels of the current image, and then selecting, from the local motion vectors, the current local motion vector 116 corresponding to the current pixel location.
  • local motion vectors may be determined by searching for matching (similar) image content in both the current image and an adjacent image in the video sequence.
  • An example of such an algorithm is the so-called '3D Recursive Search' (see also 'True estimation with 3-D recursive search block matching', IEEE Transactions on Circuits and Systems for Video Technology, Vol.3 No.5, Oct 1993).
  • Motion may be determined on a pixel-basis, implying that a local motion vector is computed (estimated) for each image pixel in the current image.
  • Local motion vectors may initially also be determined on a block-basis, which implies that a single local motion vector is determined for each block of pixels in the current image, e.g. a block of 8x8 image pixels. In that case, the single local motion vector represents motion of every image pixel in the block of image pixels.
  • a refinement algorithm then may be applied to refine the block-specific local motion vector to a pixel- specific local motion vector.
  • the current local motion vector 116 may be determined by selecting it from local motion vectors that are pre-computed and provided with the video sequence 101.
  • the video sequence 101 may be obtained by decoding an encoded video stream that also contains said the pre-computed local motion vectors.
  • an MPEG-encoded video stream typically contains both the video sequence 101 and said local motion vectors. In such a case, motion estimation from the video sequence 101 is not needed, as the local motion vectors are readily available.
  • the current dynamic depth value 111 may be instead based on relative motion.
  • the current local motion vector 116 can be determined as a relative motion vector.
  • a relative local motion vector may be defined as a local motion vector relative to a global motion vector.
  • the global motion vector may represent the overall motion in the current image, and may be computed e.g. as an average of all local motion vectors in the current image.
  • the current local motion vector may thus be calculated as the relative motion vector by subtracting the global motion vector from the local motion vector at the current pixel location.
  • a local motion vector may also be referred to, in what follows, as an absolute local motion vector.
  • the previous motion unit 125 may be arranged to determine the previous local motion vector as a relative local motion vector.
  • the previous motion unit 125 is arranged in an analogous manner to the current motion unit 115, which is arranged to determine the current local motion vector as a relative local motion vector.
  • a benefit of computing the current local motion vector 116 as a relative local motion vector is that the current dynamic depth value 111 (which is based on the current local motion vector 116) is largely invariant to background movement in a scene of the video sequence.
  • a scene is defined here as multiple consecutive images of the video sequence that portray related image content (for example an actor) at respective consecutive time instances. Invariance to said background movement is explained based on the following two cases.
  • a scene comprising a fast moving foreground object before a static background is considered.
  • the foreground object moves in that scene, while the background is static.
  • depth values are assigned correctly to pixels: large depth values are assigned to pixels of the foreground object and small depth values are assigned to pixels of the background.
  • relative local motion vectors as direct indicators of depth has the same result in this case, because the background is static, so that the relative local motion vectors are effectively abso lute motion vectors .
  • a scene comprising a static foreground object before a fast moving background is considered.
  • the foreground object is static in that scene, while the background moves.
  • depth values are incorrectly assigned: large depth values are assigned to pixels of the background and small depth values are assigned to pixels of the foreground object.
  • relative local motion vectors as direct indicators of depth has a different result in this case, because relative local motion vectors of the foreground object are large, whereas relative local motion vectors of the background are (by definition) small. Consequently, depth values are then assigned correctly to pixels: large depth values are assigned to pixels of the foreground object and small depth values are assigned to pixels of the background.
  • the depth unit 110 is arranged to determine the current dynamic depth value 111 from the current local motion vector 116. This can be done in two steps. The first step is to determine an intermediate depth value. The second step is to map the intermediate depth value to the current dynamic depth value.
  • the intermediate depth value can be determined as the length of the current local motion vector 116. For example, if the length of the current local motion vector is 5 pixels, then the intermediate depth value is 5. Alternatively, the intermediate depth value can be determined as the absolute value of either the horizontal component or the vertical component of the current local motion vector 116.
  • the mapping function may alternatively include a suppression of large depth values. Such suppression may be accomplished by a function that has a low derivative for large values of x, thus for large values of the intermediate depth value.
  • the mapping function may be defined as y being proportional to the square root of x.
  • the mapping function may be defined as a soft-clipping function wherein y, representing the current dynamic depth, gradually tends to a predetermined maximum value for large values of x, representing the intermediate depth value.
  • the conversion of the current image to a 3D image to be viewed on a 3D display may result in extreme depth values causing a discomfort for a viewer.
  • Such extreme depth values may also cause a problem in portraying the resulting 3D image on a 3D display that can only display a 3D image having limited depth output range.
  • An auto-stereoscopic 3D display is an example of such a 3D display having a limited depth output range.
  • the unit 105 is arranged to determine the previous pixel location 106.
  • a motion estimation algorithm may be used to locate the previous pixel location.
  • the previous pixel location 106 may be determined as a pixel location in the previous image where image content matches image content at the current pixel location.
  • the previous pixel location 106 is then determined in a motion-compensated manner.
  • the previous pixel location 106 can be determined as a copy of the current pixel location: the coordinates of the previous pixel location 106 within the previous image are the same as the coordinates of current pixel location within the current image.
  • the previous pixel location 106 is then determined in a non-motion-compensated manner.
  • An advantage of the non-motion compensated manner may be that it is straightforward, because it requires no motion estimation.
  • image content at the previous pixel location may still match image content at the current pixel location, when the image content moves only little between the previous and current image.
  • the previous motion unit 125 is arranged to determine the previous local motion vector 126 at the previous pixel location 106, in an analogous manner as 115 determines the current local motion vector 116 at the current pixel location.
  • the previous depth unit 120 determines the previous dynamic depth value 121 based on the previous local motion vector 126, in an analogous manner as 110 determines the current dynamic depth value 111 based on the current local motion vector 116.
  • the depth unit 140 is arranged to compute the depth value 141 by combining the current dynamic depth value 111 and the previous dynamic depth value 121. Further below, where FIG.3 is discussed, the depth unit 140 will be explained in further detail.
  • FIG.2 illustrates the current motion unit 115 being arranged to compute the current local motion vector 116 as a relative local motion vector.
  • a sub-unit 230 is arranged to determine a local motion vector 231 at the current pixel location.
  • the local motion vector 231 may result directly from applying a motion estimation algorithm that determines the motion of image content at the current pixel location.
  • the current local motion vector 116 represents motion between the current image and an image adjacent to the current image in the video sequence 101.
  • a sub-unit 220 is arranged to determine a global motion vector 211, for example based on local motion vectors in the current image.
  • the sub-unit 220 may receive the local motion vectors from the sub-unit 230 or, alternatively, they may be provided with the video sequence.
  • a global motion vector 221 may be determined by computing the average of the local motion vectors of the current image.
  • the sub-unit 210 is arranged to compute the current local motion vector 116 as a relative motion vector, by subtracting the global motion vector 221 from the local motion vector 231.
  • the horizontal component (X) of the global motion vector 221 may be determined by computing a trimmed mean of the horizontal components of the respective local motion vectors.
  • the trimmed mean is an average of said horizontal components, wherein the largest 10% and smallest 10% of the horizontal components are excluded from that average.
  • the vertical component (Y) of the global motion vector 221 is determined by computing a trimmed mean of the vertical components of the respective local motion vectors.
  • the global motion vector is determined by computing the horizontal component of the global motion vector as a median of the horizontal components of the local motion vectors, and by computing the vertical component of the global motion vector as the median of the vertical components of the local motion vectors.
  • the global motion vector is determined by a so-called projection method, which does not require use of the local motion vectors.
  • the horizontal component of the global motion vector is computed as follows. A first profile is computed by vertically averaging all pixel lines of the current image. The first profile is thus a pixel line itself, comprising the vertically averaged pixel lines of the current image.
  • a second profile is computed by vertically averaging all pixel lines of the previous image. The horizontal component is then determined as the amount of horizontal pixels that the first profile needs to be shifted to best match the second profile. A best match may be determined as the shift for which the difference between the (shifted) first profile and the second profile is at a minimum.
  • the vertical component of the global motion vector is determined in an analogous way, thus based a first and second profile from horizontally averaging pixel columns of the current and previous image.
  • the global motion vector may be computed from a region in the current image, rather than from the entire image.
  • the region may consist of the current image excluding the center and the bottom of the current image. Since a foreground is expected to be at the center and bottom of the current image, the global motion vector is then computed from local motion vectors predominantly corresponding to the background.
  • FIG.3 illustrates the depth unit 140 being arranged to compute the depth value
  • the depth unit 140 may be arranged to combine the current dynamic depth value 111 and the previous dynamic depth value 121 using respective weight factors.
  • the depth unit 140 includes a sub-unit 310 and a sub-unit 320.
  • Sub-unit 320 is arranged to combine the current dynamic depth value 111 and the previous depth value 121 into the depth value 141, using a current weight factor 311.
  • the units 110 and 120 provide the current dynamic depth value 111 and the previous depth value 121, respectively.
  • the current weight factor 311 is provided by the sub-unit 310 and determines to what extent the depth value 141 depends on the current dynamic depth value 111 and on the previous dynamic depth value 121. For example, the depth value 141 is determined as a weighted average:
  • Wi and W 2 represent the current weight factor and the previous weight factor, respectively, while D, D l s and D 2 represent the depth value 141, the current dynamic depth value 111 and the previous dynamic depth value 121, respectively.
  • the weight factors Wi and W 2 are typically numbers between 0.0 and 1.0.
  • the weight factors, Wi and W 2 may be related according to the following equation:
  • sub-unit 310 only needs to provide Wi to sub-unit 320, as sub-unit 320 derives W 2 from Wi.
  • the depth value 141 is based on more than two images, i.e. more than the current image and the previous image.
  • the depth value 141 may be determined based on multiple depth values from respective multiple images.
  • the depth value 141 may be based on a first, second, third, and fourth dynamic depth value from a respective first, second, third and fourth image.
  • the terms 'first' and 'second' are used as substitutes for the terms 'current' and 'previous'.
  • Each of the first, second, third and fourth depth values may be based on a first, second, third and fourth local motion vector.
  • the depth value 141 may be computed as
  • D 3 and D 4 represent the third and fourth dynamic depth value, respectively;
  • Eq. 2 describes a finite impulse response (FIR) filter that has an output D, input values Di, D 2 , D 3i and D 4 , and respective filter coefficients Wi, W 2 , W 3 , and W 4 .
  • FIR finite impulse response
  • computation of the depth value D 141 is implemented as an infinite impulse res onse (IIR) filter as follows:
  • D represents a 'temporal update' of the cumulative depth value C 2 , and is calculated as a weighted average of Di and the cumulative depth value C 2 .
  • the cumulative depth value C 2 is computed as a 'temporal update', of the cumulative depth value C 3 , thus as a weighted average of D 2 and the cumulative depth value C 3 .
  • C 3 is computed as a 'temporal update' of a cumulative depth value C 4 of a fourth image, etcetera.
  • the depth value 141 is thus computed using a temporal recursive filter, (a) because of the recursion described by of Eqs.3-5 and (b) because of the depth value is derived based on images in the video sequence 101 associated with a different time instances.
  • the effect of varying the current weight factor W 311 is as follows. If W is close to 1.0 then D is very similar to Di and depends weakly (or to a minor extent) on images preceding the current image. This means that when the video sequence 101 is shown in real time on a 3D-display, the depth value 141 is less dependent on depths from preceding images. If Wis close to 0.0 then D is very similar to D 2 and largely depends on images preceding the current image. In other words, the depth value 141 then includes only a relatively small contribution of the current depth value 121. When the video sequence 101 is shown in real time on a 3D-display, in such a case, the depth value 141 is strongly dependent on depths from preceding earlier images.
  • the depth value D 141 When W is close to 1.0, the depth value D 141 thus depends weakly on depths from preceding images. Consequently, D depends strongly on depth from the current image. D is therefore based on 'up-to-date image content, because D represents depth at the current pixel location which is in the current image, while D is strongly based on depth from the same, current image. A benefit of W being close to 1.0 may thus be that the depth value D 141 is based on up-to-date image content.
  • W is computed to only a minor extent from motion in preceding images of the video sequence, when the current image itself contains little or no motion to compute meaningful depth.
  • a drawback of W being close to 1.0 may thus be that the depth value D 141 may not represent a meaningful value when the current image itself contains little or no motion.
  • the depth value D 141 When W is close to 0.0, the depth value D 141 thus depends strongly on depths from preceding images. Consequently, D depends weakly on depth from the current image. D is therefore not based on 'up-to-date image content, because D represents depth at the current pixel location which is in the current image, while D is only weakly based on depth from the same, current image. A drawback of W being close to 0.0 may thus be that the depth value D 141 is not based on up-to-date image content.
  • W is close to 0.0.
  • D can be computed from motion in preceding images of the video sequence, even when the current image itself contains little or no motion to compute meaningful depth from.
  • a benefit of W being close to 0.0 may thus be that the depth value D 141 may be computed from motion in the video sequence, even when the current image itself contains little or no motion.
  • D may be computed primarily from motion in the previous image when:
  • the current image contains little or no motion.
  • said predetermined amount of motion at the current pixel location may be two, for example. Then, if the length, sqrt(dX 2 +dY 2 ), of the current local motion vector 116 (X,Y) is smaller than two, the image content at the current pixel location may be defined as containing 'little or no motion'.
  • D may be computed from both motion in the current image and motion in the previous image, when:
  • the current image contains a moderate or large amount of motion.
  • W may have an intermediate value that further depends on the amount of motion at the current pixel location. This may work as follows. For example, when the amount of motion at the current pixel location is:
  • D therefore depends more on the Di as the amount of motion at the current pixel location is larger.
  • the amount of global motion in the current image may be determined as:
  • the amount of global motion in the previous image is determined in a similar manner as the amount of global motion in the current image.
  • the depth value 141 may be part of a depth map that contains depth values of all respective image pixels of the current image. Such a depth map depends on motion, and therefore may be referred to as a 'dynamic' depth map.
  • a so-called 'static' depth map may supplement the dynamic depth map.
  • the static depth map is based on depth cues from within the current image and thus does not depend on motion.
  • the static depth map may be a so-called 'slant' that describes the depth as a gradient of the vertical dimension of the current image, gradually changing from nearby (i.e. large depth values) at the bottom of the current image too far away (i.e. small depth values) at the middle or top of the current image.
  • the static depth map may be based on other non-motion dependent depth cues from the current image. For example, using a face detector, large depth values may be assigned to areas in the current image containing faces, so that the faces become part of the foreground of the current image. As another example, using a sky detector may, low depth values may be assigned to areas of the current image detected as sky, so that said areas become part of the background in the current image.
  • a static depth value (from a static depth map) may be added to the depth value
  • the depth value 141 contains a meaningful depth value, even when not enough motion is present in the current image for a reliable computation of the dynamic depth value.
  • Adding the static depth value to the depth value 141 is particularly advantageous for the case (a) when stationary periods in the video sequence (i.e. with little or no motion) last too many consecutive images for the recursive temporal filter to allow reliable derivation of the depth from previous images, or (b) when the video sequence starts with a stationary scene having little motion so that no previous images are available to reliably compute a motion-based depth value.
  • Adding the static depth value to the depth value 141 may be rephrased as follows. First, a 'dynamic depth value' is defined as the depth value 141 before adding the static depth value. The dynamic depth value is thus based on only the current dynamic depth value 111 and the previous dynamic depth value 121. Next, the depth value 141 is determined by combining the dynamic depth value and the static depth value. For example, as described above, the depth value 141 is a result from adding the static depth value and the dynamic depth value.
  • Combining the static depth value and the dynamic depth value may depend on the amount of global motion in the current image. For example, when the amount of global motion is low, the depth value 141 relies more on the static depth value than on the dynamic depth value, and, conversely, when the amount of global motion is high, the depth value 141 relies less on the static depth value than on the dynamic.
  • Such a combination of the static depth value and the dynamic depth value may be implemented as a weighted average, wherein the respective relative contributions (i.e. weight factors) of the static and dynamic depth value in the weighted average gradually vary with the amount of global motion.
  • the depth value 141 simply equals the largest of the static depth value and the dynamic depth value, corresponding to the relative contributions being 0 or 1.
  • FIG.4 illustrates a method 400 for computing the depth value 141 from the video sequence 101.
  • the method 400 comprises the following steps.
  • Step 410 comprises determining the current local motion vector 116 representing motion of image content at the current pixel location.
  • Step 420 comprises determining the current dynamic depth value 111 based on the current local motion vector.
  • Step 430 comprises determining the previous pixel location 106 being a pixel location in the previous image of the video sequence; the previous pixel location comprises image content corresponding to the image content at the current pixel location.
  • Step 440 comprises determining the previous local motion vector 126 representing motion of the image content at the previous pixel location 106.
  • Step 450 comprises determining the previous dynamic depth value 121 based on the previous local motion vector 126.
  • Step 460 comprises determining the depth value 141 based on the current dynamic depth value 111 and the previous dynamic depth value 121.
  • Operations performed by the steps 410-460 of method 400 are consistent with operations performed by units 115, 110, 105, 125, 120, and 140 of system 100, respectively.
  • the method 400 described above may also be implemented as computer program code means.
  • the computer program code means may be adapted to perform the steps of the method 400 when said computer program code is run on a computer.
  • the computer program code may be provided via a data carrier, such as a DVD or solid-state disk, for example.
  • FIG.5 illustrates a 3D display system 500 for showing a 3D image 531.
  • the 3D image 531 is derived from a current image of the video sequence 101.
  • a display unit 540 comprises a 3D display arranged for showing the 3D image 531.
  • the display may be a stereoscopic display that requires viewer to wear stereo glasses, or an auto- stereoscopic multiview display.
  • An input unit 510 may receive the video sequence 101, for example for reading the video sequence 101 from a storage medium or receiving it from a network.
  • a processing unit 520 comprises the system 100 for computing a depth map 521 which comprises depth values of respective image pixels of the current image.
  • a conversion unit 530 is for converting the current image to the 3D image 531 based on the depth map 521 and the current image.
  • the 3D display system may be a 3D-TV.
  • the processing unit 520 and the conversion unit 530 may be part of a video processing board that renders a 3D-video sequence in real-time for being displayed on the 3D display.
  • the processing unit 520 computes a depth map for each image in the video sequence, whereas the conversion unit 530 converts each image and the corresponding depth map into a 3D-image.
  • the 3D display system 500 device may implemented as a smart phone having a 3D-display capability or as studio equipment for offline conversion of the video sequence 101 into a 3D video sequence.
  • any reference signs placed between parentheses shall not be construed as limiting the claim.
  • Use of the verb "comprise” and its conjugations does not exclude the presence of elements or steps other than those stated in a claim.
  • the article “a” or “an” preceding an element does not exclude the presence of a plurality of such elements.
  • the invention may be implemented by means of hardware comprising several distinct elements, and by means of a suitably programmed computer.

Landscapes

  • Engineering & Computer Science (AREA)
  • Multimedia (AREA)
  • Signal Processing (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Testing, Inspecting, Measuring Of Stereoscopic Televisions And Televisions (AREA)

Abstract

A system (100) is arranged for computing a depth value (141) from a video sequence (101). The depth value represents depth at a current pixel location in a current image in the video sequence. The video sequence comprises a sequence of images. The system determines a current local motion vector (116) representing motion at the current pixel location, and determines a current dynamic depth value (111) based on the current local motion vector. The system determines in a previous image, a previous pixel location (106) comprising image content matching image content at the current pixel location. The system determines a previous local motion vector (126) representing motion at the previous pixel location, and determines a previous dynamic depth value (121) based on the previous local motion vector. Finally, the depth value is based on the current dynamic depth value and the previous dynamic depth value.

Description

System, method for computing depth from video
FIELD OF THE INVENTION
The invention relates to computing depth values from a video sequence. The video sequence comprises video frames, each video frame being an image having two dimensions: a horizontal and a vertical dimension. Depth values represent a third dimension in addition to said horizontal and vertical dimension. The video sequence and corresponding depth values may be converted to a three-dimensional (3D) format for being displayed on a 3D display.
BACKGROUND OF THE INVENTION
EP 2629531 describes a method for computing a depth map from a video sequence by using motion-based depth cues. The method determines motion vectors for an image of the video sequence, wherein each of the motion vectors correspond to an image pixel of the image. A depth value corresponding to an image pixel is then computed as being proportional to the length of a motion vector corresponding to that image pixel. The depth values combined for the entire image then forms a depth map.
A drawback of said method is that meaningful depth values can only be obtained for an image of the video sequence when motion is present at said image in the video sequence. In contrast, when the image presents a static scene, a depth map cannot be computed.
SUMMARY OF THE INVENTION
It is an object of the invention to provide an improved system and method for computing a depth value based on motion in a video sequence.
An aspect of the invention is a system for computing a depth value from a video sequence, the video sequence comprising a sequence of images, each of the images comprising a two-dimensional array of image pixels, each of the image pixels having a respective pixel location in the two-dimensional array, the depth value representing depth of image content at a current pixel location, the current pixel location being a pixel location in a current image in the video sequence, the system comprising: a current motion unit arranged for determining a current local motion vector representing motion of image content at the current pixel location; a current depth unit arranged for determining a current dynamic depth value based on the current local motion vector; a previous pixel location unit arranged for determining a previous pixel location being a pixel location in a previous image of the video sequence, the previous pixel location comprising image content corresponding to the image content at the current pixel location; previous motion unit arranged for determining a previous local motion vector representing motion of the image content at the previous pixel location; a previous depth unit arranged for determining a previous dynamic depth value based on the previous local motion vector; and a depth unit arranged for determining the depth value based on the current dynamic depth value and the previous dynamic depth value.
The video sequence comprises a sequence of images (video frames), each image comprising a two-dimensional array of image pixels. Output of the system is a depth value for a current image, being one of the images of the video sequence. The depth value represents depth of image content at the current image pixel, which is a pixel location in the current image. For example, the depth value may be one of the depth values of a depth map, whereas the depth map holds all depth values corresponding to the respective image pixels of the current image.
In the depth unit, the depth value is based on two dynamic depth values from two respective images, being (a) the current dynamic depth value from the current image, and (b) the previous dynamic depth value from the previous image. The current depth value represents depth of image content at said current pixel location, whereas the previous depth value represents depth of image content at the previous pixel location. The previous pixel location is a pixel location in the previous image, and image content at that previous pixel location corresponds to image content at the current pixel location.
In the current motion unit, the current local motion vector is determined. The current local motion vector represents motion of image content at the current pixel location. The current local motion vector may have a horizontal and vertical component indicating a two-dimensional displacement of the image content at the current pixel location. For example, the local motion vector may refer to a displacement of said image content between the current image and an image adjacent to the current image in the video sequence. For example, determining the current local motion vector may be done using a motion estimation algorithm that estimates motion of the image content at the current pixel location, or may be selected from local motion vectors already provided with the video sequence. The current dynamic depth value is a motion-based depth value from the current image, and is based on the current local motion vector. The length of the current local motion vector may be used as an indicator of the current dynamic depth value.
In the previous location unit, the previous pixel location is determined as a pixel location in the previous image where the image content matches the image content at the current pixel location (which is in the current image). For example, the image content at the current pixel location and the image content at the previous pixel location are related. The image content may portray a portion of a face of an actor, for example. Yet, image content at the current and previous pixel location correspond to respective different moments in time, namely to the respective moments in time corresponding to the current and previous image, respectively. Determining the previous pixel location may be done, for example, by applying a motion estimation algorithm (or other image matching algorithm) to locate a pixel location where the image content matches the image content at the current pixel location, and to determine the previous pixel location as that located pixel location. (Such a process for locating the previous pixel location by means of matching image content is well known to a person skilled in the art of motion estimation in a video sequence.)
It should be noted that according to above definitions, the 'previous pixel location' consistently refers to (a pixel location in) the previous image, and that the 'current pixel location' consistently refers to (a pixel location in) the current image.
In the previous motion unit, the previous local motion vector is determined.
The previous local motion vector represents motion of image content at the previous pixel location, analogous to the current local motion vector at the current pixel location. For example, the previous local motion vector may refer to a displacement of said image content between the previous image and an image adjacent to the previous image in the video sequence.
In the previous depth unit, the previous dynamic depth value is based on the previous local motion vector, which represents motion of image content at the previous pixel location. Determining the previous local motion vector for the previous pixel location is done in an analogous manner as determining the current local motion for the current pixel location. The previous dynamic depth value is thus a motion-based depth value from the previous image.
In the depth unit, the depth value is determined based on the current dynamic depth value and the previous dynamic depth value. For example, the depth value may be computed based on a. linear combination of the current depth value and previous dynamic depth value.
An effect of the invention is therefore that the depth value is not only dependent on motion in the current image, but may also benefit from motion in the previous image. When the current image in general and the current pixel in particular contain little motion, the depth value may be improved by using a motion-based depth value from the previous image, in addition to a motion-based depth value from the current dynamic depth value.
Optionally, the depth unit is arranged for determining the depth value by computing the depth value as a combination of the current dynamic depth value and the previous dynamic depth value, the relative contributions of the current dynamic depth value and the previous dynamic depth value in the combination being defined by a current weight factor and a previous weight factor, respectively. The current weight factor and the previous weight factor may be used to make the depth value more reliant on the current image or on the previous image. In such a way, the weight factors may effectively enable a tuning of the depth value towards (a) using motion image from the current image or (b) using motion from the previous image. For example, the depth value may be computed as a linear combination of the current dynamic depth value and the previous dynamic depth value, thus as a sum of (a) the current dynamic depth value multiplied by the current weight factor and (b) the previous dynamic depth value multiplied by the previous weight factor.
Optionally, the depth unit is further arranged for defining the current weight factor by determining a local motion indicator indicating an amount of motion at the current pixel location, and determining at least one of the current dynamic weight factor and the previous dynamic weight factor based on the local motion indicator. For example, the determined amount of motion may indicate a large amount of motion at the current pixel location and, in response, the current weight factor may be determined to be 0.9 and the dynamic weight factor may be determined to be 0.1. The depth value then depends strongly on the current dynamic depth value than on the previous dynamic depth value, because the current weight factor is large (i.e. close to 1.0).
Optionally, the depth unit is arranged for determining the local motion indicator by computing the local motion parameter based on one of: a length of the current local motion vector, an absolute value of a horizontal component of the current local motion vector, and an absolute value of a vertical component of the current local motion vector. The local motion indicator is an indicator of the amount of motion present at the current pixel location. The local motion indicator may be computed as the length of the current local motion vector or as the absolute value of one of its components.
Optionally, the depth unit is arranged for computing the depth value by determining the current weight factor based on a difference of amount of motion at the current pixel location and an amount of motion at the previous local motion vector. For example, the depth unit may determine that the amount of motion at the current pixel location is smaller than the amount of motion at the previous pixel location. The image content at the current pixel location then effectively 'decelerates'. By consequently decreasing the current weight factor, the depth value relies less on motion at the current pixel location and consequently more on motion at the previous pixel location. The depth value thus relies more on depth from the previous image in this case.
Optionally, the current depth unit is arranged for determining the current dynamic depth value by computing the current dynamic depth value based on one of: a length of the current local motion vector, an absolute value of a horizontal component of the current local motion vector, and an absolute value of a vertical component of the current local motion vector. For example, the length of the current local motion vector is used to compute the current dynamic depth value.
Optionally the current motion unit is arranged for determining the current local motion vector by: (a) determining a local motion vector indicating motion of the image content at the current pixel location, (b) determining a global motion vector indicating global motion of image content of at least a region of the current image, and (c) computing the current local motion vector as a relative motion vector based on the local motion vector relative to the global motion vector. A current dynamic depth value based on the relative motion vector compensates for movement of the background in a video scene, because a static foreground in front of a moving background and a moving foreground in front of a static background will both correspond to a large relative motion of the foreground. In both cases, the current dynamic depth value will therefore be larger if the current pixel location is part of the foreground than if it is part of the background.
Optionally, the depth unit is arranged for determining the depth value based on a static depth value being a non-motion-based depth value based on only the current image. The static depth value may be determined even when motion is small or absent in the current image and the current dynamic depth value may therefore become unreliable. For example, the static depth value may be based on a so-called 'slant' being a depth map that defines depth as a vertical gradient, i.e. having large depth values (close to the viewer) at the bottom of the current image and having smaller depth values (farther away from the viewer) toward the top or middle of the current image.
Optionally, the depth unit is further arranged for determining the depth value by (a) determining a combined dynamic depth value by combining the current dynamic depth value and the previous dynamic depth value, and (b) determining the depth value by combining the combined dynamic depth value with the static depth value into the depth value, relative contributions of the dynamic depth value and the static depth value in the combining being dependent on an amount of global motion present in the current image. Based on the amount of global motion present in the current image, the depth value may rely more on the static depth value or more on the combined dynamic depth value. First, the combined dynamic depth value may be determined by the depth unit of the system above, thus based on the current dynamic depth value and previous dynamic depth value. Second, the depth value may then be determined as a linear combination of the static depth value and the combined dynamic depth value. The relative contributions of the static depth value and the combined dynamic depth value in said linear combination may be represented by respective weight factors. For example, if the amount of global motion is small, this may indicate that said combined dynamic depth value is unreliable. In such a case, it may be desirable to make the depth value more dependent on the static depth value. The latter may be achieved by means of a high relative contribution of the static depth value and a low relative contribution of the combined dynamic depth value.
Optionally, the previous pixel unit is arranged for determining the previous pixel location by determining the previous pixel location according a non-motion
compensated manner by determining the previous pixel location having a same coordinate in the two-dimensional array of the previous image as a coordinate of the current pixel location in the two-dimensional array of current image. According to the non-motion-compensated manner, the previous pixel location has the same (X,Y) coordinate in the 2D array of the previous image as the current pixel location has in the 2D array of the current image.
Accordingly, the previous pixel location is effectively determined by straightforward copying said (X,Y) coordinate from the current pixel location in the current image to the previous pixel in the previous image. The non-motion-compensated is straightforward as no motion estimation is needed to determine the previous pixel location.
Optionally, the system is further arranged for determining the depth value by using a predetermined non-linear function for limiting the depth value to a predetermined depth value range. Large values of the current local motion vector may result in an excessive value of the depth current dynamic depth value that lies outside the predetermined depth value range. For example, the depth value may be used to convert the image pixel at the current pixel location to a three-dimensional format, in order to be displayed at a 3D-display with a limited output depth range. For example, limiting the depth value may be achieved by applying e.g. a hard- or soft-clipping function to the depth value.
Optionally, the previous pixel location unit is arranged to determine the previous pixel location in the previous image that corresponds to a later moment in time than the current image. For example, a depth map may be computed for each image in the video sequence, processing the images one-by-one, starting with the first image and ending with the last image of the video sequence (i.e. in a regular temporal order), or the other way around, thus starting with the last image and ending with the first image of the video sequence (i.e. in a reverse temporal order). In this context, the regular temporal order implies that the video sequence is intended to be played starting with the first image (corresponding to an early time instance) and ending with the last image (corresponding to a later time instance).
A further aspect of the invention is a method for computing a depth value from a video sequence, the video sequence comprising a sequence of images, the depth value representing depth of image content at a current pixel location, the current pixel location being a pixel location in a current image in the video sequence, the video sequence comprising a sequence of images, each of the images comprising a two-dimensional array of image pixels, each of the image pixels having a respective pixel location in the two- dimensional array, the method comprising the steps of: determining a current local motion vector representing motion of image content at the current pixel location, determining a current dynamic depth value based on the current local motion vector, determining a previous pixel location being a pixel location in a previous image of the video sequence, the previous pixel location comprising image content corresponding to the image content at the current pixel location, determining a previous local motion vector representing motion of the image content at the previous pixel location, determining a previous dynamic depth value based on the previous local motion vector, and determining the depth value based on the current dynamic depth value and the previous dynamic depth value.
A further aspect of the invention is a three-dimensional display system comprising: a display unit comprising a display arranged for displaying a three-dimensional image; an input unit arranged for receiving a video sequence comprising a sequence of images, the video sequence comprising a sequence of images, each of the images comprising a two-dimensional array of image pixels, each of the image pixels having a respective pixel location in the two-dimensional array; a conversion unit arranged for converting an image of the video sequence to the three-dimensional image using a depth map comprising depth values of the image; and a processing unit comprising the system above, the processing unit being arranged for computing depth values of the depth map from the video sequence using the system above.
A further aspect of the invention is a computer program product comprising computer program code means adapted to perform all the steps of the method above when said computer program code is run on a computer. BRIEF DESCRIPTION OF THE DRAWINGS
These and other aspects of the invention are apparent from and will be elucidated with reference to the embodiments described hereinafter.
In the drawings,
FIG.l illustrates a system for computing a depth value from a video sequence, FIG.2 illustrates the current motion unit being arranged to compute the current local motion vector as a relative local motion vector
FIG.3 illustrates the depth unit being arranged to compute the depth value, FIG.4 illustrates a method for computing the depth value from the video sequence, and
FIG.5 illustrates a 3D display system for showing a 3D image derived from a current image of the video sequence.
It should be noted that items that have the same reference numbers in different figures, have the same structural features and the same functions. Where the function and/or structure of such an item have been explained, there is no necessity for repeated explanation thereof in the detailed description.
DETAILED DESCRIPTION OF THE INVENTION
FIG.l illustrates a system 100 for computing a depth value 141 from a video sequence 101. The video sequence 101 comprises a sequence of images (video frames) which are two-dimensional, each image being a two-dimensional array of image pixels. The depth value 141 corresponds to an image pixel, being an image pixel at a current pixel location in a current image of the video sequence. For example, the depth value 141 may be one of the depth values of a depth map, whereas the depth map holds all depth values corresponding to the respective image pixels of the current image. The depth is defined as the third dimension in addition to the horizontal and vertical dimension of an image of the video sequence 101. If the image is converted to a 3D- format and shown on a (3D) display, then the depth dimension would be substantially perpendicular to the plane of the display. For a viewer standing in front of the display, image content (e.g. a foreground object) having large (high) depth values would be perceived as standing more out of the display towards the viewer, than if said image content would have smaller (lower) depth value. In other words, a large depth value corresponds to a smaller perceived distance to the viewer, whereas a smaller depth value corresponds to a larger perceived distance to the viewer.
In summary, the system 100 works as follows, when in operation. A current motion unit 115 determines, from the video sequence 101, a current local motion vector 116 representing motion of image content at the current pixel location. A current depth unit 110 determines a current dynamic depth value 111 from the current local motion vector 116, which serves as an indicator of depth at the current pixel location. A previous pixel location unit 105 determines, from a previous image in the video sequence, a previous pixel location 106 having image content that matches image content at the current pixel location, thus in the current image. A previous motion unit 125 determines, from the video sequence 101, a previous local motion vector 126 representing motion of the image content at the previous pixel location 106. A previous depth unit 120 then determines a previous dynamic depth value 121 from the previous local motion vector 126, which is used as an indicator of depth at the previous pixel location 106. Finally, a depth unit 140 determines the depth value 141 based on the current dynamic depth value 111 and the previous dynamic depth value 121.
It should be noted that 'image content at the current pixel location' refers to a small group of image pixels near the current pixel location. For example, the group of image pixels may be a small patch of 8x8 image pixels. The image content at the current pixel location then refers a portion of the current image portrayed by the small patch of 8x8 image pixels.
It should be noted that that the 'current pixel location' in this document always refers to a pixel location in the current image, and the 'previous pixel location' always refers to a pixel location in the previous image. 'The current pixel location' thus implies 'the current pixel location in the current image', and 'the previous pixel location' thus implies 'the previous pixel location in the previous image'.
In the current motion unit 115, a current local motion vector 116 is thus determined for the current pixel location. The current local motion vector 116 represents motion of the image content at the current pixel location. The current local motion vector may have a horizontal and vertical component indicating a two-dimensional displacement of said image content. For example, the current local motion vector 116 (dX, dY) represents how said image content moves between (a) the current image and (b) an image directly succeeding the current image in the video sequence. Said image content moves by a number of pixels dX in the horizontal (X) direction and by a number of pixels dY in the vertical direction (Y).
For example, the video sequence is a movie with a frame rate of 50 Hz, i.e. 50 images per second. The current image may correspond to a moment in time tl . The image directly succeeding the current image then corresponds to a moment in time t2=tl+0.02 sec. A current local motion vector 116 may be (dX, dY) = (+4,-2), representing that said image content moves horizontally by 4 pixels to the right and moves vertically by 2 pixels downward, within the time period of 0.02 sec between tl and t2.
Determining the current local motion vector 116 may be done by estimating local motion vectors for all image pixels of the current image, and then selecting, from the local motion vectors, the current local motion vector 116 corresponding to the current pixel location. For example, local motion vectors may be determined by searching for matching (similar) image content in both the current image and an adjacent image in the video sequence. An example of such an algorithm is the so-called '3D Recursive Search' (see also 'True estimation with 3-D recursive search block matching', IEEE Transactions on Circuits and Systems for Video Technology, Vol.3 No.5, Oct 1993). Motion may be determined on a pixel-basis, implying that a local motion vector is computed (estimated) for each image pixel in the current image. Local motion vectors may initially also be determined on a block-basis, which implies that a single local motion vector is determined for each block of pixels in the current image, e.g. a block of 8x8 image pixels. In that case, the single local motion vector represents motion of every image pixel in the block of image pixels. Optionally, a refinement algorithm then may be applied to refine the block-specific local motion vector to a pixel- specific local motion vector.
Alternatively, the current local motion vector 116 may be determined by selecting it from local motion vectors that are pre-computed and provided with the video sequence 101. For example, the video sequence 101 may be obtained by decoding an encoded video stream that also contains said the pre-computed local motion vectors. For example, an MPEG-encoded video stream typically contains both the video sequence 101 and said local motion vectors. In such a case, motion estimation from the video sequence 101 is not needed, as the local motion vectors are readily available.
Rather than computing the current dynamic depth value 111 based on motion (e.g. according to the method of EP 2629531), it may be instead based on relative motion. The current local motion vector 116 can be determined as a relative motion vector. A relative local motion vector may be defined as a local motion vector relative to a global motion vector. The global motion vector may represent the overall motion in the current image, and may be computed e.g. as an average of all local motion vectors in the current image. The current local motion vector may thus be calculated as the relative motion vector by subtracting the global motion vector from the local motion vector at the current pixel location. To emphasize the difference with a relative local motion vector, a local motion vector may also be referred to, in what follows, as an absolute local motion vector.
The previous motion unit 125 may be arranged to determine the previous local motion vector as a relative local motion vector. The previous motion unit 125 is arranged in an analogous manner to the current motion unit 115, which is arranged to determine the current local motion vector as a relative local motion vector.
A benefit of computing the current local motion vector 116 as a relative local motion vector is that the current dynamic depth value 111 (which is based on the current local motion vector 116) is largely invariant to background movement in a scene of the video sequence. A scene is defined here as multiple consecutive images of the video sequence that portray related image content (for example an actor) at respective consecutive time instances. Invariance to said background movement is explained based on the following two cases.
In a first case, a scene comprising a fast moving foreground object before a static background is considered. The foreground object moves in that scene, while the background is static. Using (absolute) local motion vectors as direct indicators of depth, depth values are assigned correctly to pixels: large depth values are assigned to pixels of the foreground object and small depth values are assigned to pixels of the background. Using relative local motion vectors as direct indicators of depth has the same result in this case, because the background is static, so that the relative local motion vectors are effectively abso lute motion vectors .
In a second case, a scene comprising a static foreground object before a fast moving background is considered. The foreground object is static in that scene, while the background moves. Using (absolute) local motion vectors as direct indicators of depth, depth values are incorrectly assigned: large depth values are assigned to pixels of the background and small depth values are assigned to pixels of the foreground object. Using relative local motion vectors as direct indicators of depth has a different result in this case, because relative local motion vectors of the foreground object are large, whereas relative local motion vectors of the background are (by definition) small. Consequently, depth values are then assigned correctly to pixels: large depth values are assigned to pixels of the foreground object and small depth values are assigned to pixels of the background.
These two cases illustrate the invariance of the current dynamic depth value 111 for background movement, when the current dynamic depth value is based on the current local motion vector 115 being a relative local motion vector. Computing the current dynamic depth value 111 based on the relative local motion vector may therefore be preferred.
The depth unit 110 is arranged to determine the current dynamic depth value 111 from the current local motion vector 116. This can be done in two steps. The first step is to determine an intermediate depth value. The second step is to map the intermediate depth value to the current dynamic depth value.
The intermediate depth value can be determined as the length of the current local motion vector 116. For example, if the length of the current local motion vector is 5 pixels, then the intermediate depth value is 5. Alternatively, the intermediate depth value can be determined as the absolute value of either the horizontal component or the vertical component of the current local motion vector 116.
To map the intermediate depth value to the current dynamic depth value 111 , a mapping function of the linear type y=ax+b may be applied, having a gain a, an offset b, an input x for the intermediate depth value, and an output y for the current dynamic depth value 111. The mapping function maps the intermediate depth value to a depth range. For example, if the intermediate value lies typically in a range of [0..10] and needs to be mapped to a depth range of [-10..10], then the mapping function may be y=2x-10. For example, using this mapping function, the intermediate depth values 0, 5, and 10 are then mapped a current dynamic depth value 115 being -10, 0, and 10, respectively.
In alternative embodiments, non-linear mapping functions y=f(x) may also be used to map the intermediate depth value to the current dynamic depth value 111. The mapping function may alternatively include a suppression of large depth values. Such suppression may be accomplished by a function that has a low derivative for large values of x, thus for large values of the intermediate depth value. For example, the mapping function may be defined as y being proportional to the square root of x. Or as another example, the mapping function may be defined as a soft-clipping function wherein y, representing the current dynamic depth, gradually tends to a predetermined maximum value for large values of x, representing the intermediate depth value. Without such suppression, the conversion of the current image to a 3D image to be viewed on a 3D display may result in extreme depth values causing a discomfort for a viewer. Such extreme depth values may also cause a problem in portraying the resulting 3D image on a 3D display that can only display a 3D image having limited depth output range. An auto-stereoscopic 3D display is an example of such a 3D display having a limited depth output range.
The unit 105 is arranged to determine the previous pixel location 106. For example, a motion estimation algorithm may be used to locate the previous pixel location. The previous pixel location 106 may be determined as a pixel location in the previous image where image content matches image content at the current pixel location. The previous pixel location 106 is then determined in a motion-compensated manner. As another example, the previous pixel location 106 can be determined as a copy of the current pixel location: the coordinates of the previous pixel location 106 within the previous image are the same as the coordinates of current pixel location within the current image. The previous pixel location 106 is then determined in a non-motion-compensated manner.
An advantage of the non-motion compensated manner may be that it is straightforward, because it requires no motion estimation. In the non-motion-compensated manner image content at the previous pixel location may still match image content at the current pixel location, when the image content moves only little between the previous and current image.
The previous motion unit 125 is arranged to determine the previous local motion vector 126 at the previous pixel location 106, in an analogous manner as 115 determines the current local motion vector 116 at the current pixel location. The previous depth unit 120 determines the previous dynamic depth value 121 based on the previous local motion vector 126, in an analogous manner as 110 determines the current dynamic depth value 111 based on the current local motion vector 116.
The depth unit 140 is arranged to compute the depth value 141 by combining the current dynamic depth value 111 and the previous dynamic depth value 121. Further below, where FIG.3 is discussed, the depth unit 140 will be explained in further detail.
FIG.2 illustrates the current motion unit 115 being arranged to compute the current local motion vector 116 as a relative local motion vector. A sub-unit 230 is arranged to determine a local motion vector 231 at the current pixel location. The local motion vector 231 may result directly from applying a motion estimation algorithm that determines the motion of image content at the current pixel location. The current local motion vector 116 represents motion between the current image and an image adjacent to the current image in the video sequence 101. A sub-unit 220 is arranged to determine a global motion vector 211, for example based on local motion vectors in the current image. The sub-unit 220 may receive the local motion vectors from the sub-unit 230 or, alternatively, they may be provided with the video sequence. A global motion vector 221 may be determined by computing the average of the local motion vectors of the current image. The sub-unit 210 is arranged to compute the current local motion vector 116 as a relative motion vector, by subtracting the global motion vector 221 from the local motion vector 231.
Alternatively, the horizontal component (X) of the global motion vector 221 may be determined by computing a trimmed mean of the horizontal components of the respective local motion vectors. The trimmed mean is an average of said horizontal components, wherein the largest 10% and smallest 10% of the horizontal components are excluded from that average. Similarly, the vertical component (Y) of the global motion vector 221 is determined by computing a trimmed mean of the vertical components of the respective local motion vectors.
Alternatively, the global motion vector is determined by computing the horizontal component of the global motion vector as a median of the horizontal components of the local motion vectors, and by computing the vertical component of the global motion vector as the median of the vertical components of the local motion vectors.
Alternatively, the global motion vector is determined by a so-called projection method, which does not require use of the local motion vectors. In a projection method, the horizontal component of the global motion vector is computed as follows. A first profile is computed by vertically averaging all pixel lines of the current image. The first profile is thus a pixel line itself, comprising the vertically averaged pixel lines of the current image. In a similar way, a second profile is computed by vertically averaging all pixel lines of the previous image. The horizontal component is then determined as the amount of horizontal pixels that the first profile needs to be shifted to best match the second profile. A best match may be determined as the shift for which the difference between the (shifted) first profile and the second profile is at a minimum. The vertical component of the global motion vector is determined in an analogous way, thus based a first and second profile from horizontally averaging pixel columns of the current and previous image. For further details on said projection method, one is referred to US-patent US20090153742. Alternatively, the global motion vector may be computed from a region in the current image, rather than from the entire image. For example, the region may consist of the current image excluding the center and the bottom of the current image. Since a foreground is expected to be at the center and bottom of the current image, the global motion vector is then computed from local motion vectors predominantly corresponding to the background.
Consequently, subtracting that global motion vector from the local motion vector results in a relative motion vector that represents local motion relative to background motion. This may be considered as a more accurate and appropriate way of computing the relative motion vector.
FIG.3 illustrates the depth unit 140 being arranged to compute the depth value
141. The depth unit 140 may be arranged to combine the current dynamic depth value 111 and the previous dynamic depth value 121 using respective weight factors. The depth unit 140 includes a sub-unit 310 and a sub-unit 320.
Sub-unit 320 is arranged to combine the current dynamic depth value 111 and the previous depth value 121 into the depth value 141, using a current weight factor 311. The units 110 and 120 provide the current dynamic depth value 111 and the previous depth value 121, respectively. The current weight factor 311 is provided by the sub-unit 310 and determines to what extent the depth value 141 depends on the current dynamic depth value 111 and on the previous dynamic depth value 121. For example, the depth value 141 is determined as a weighted average:
Figure imgf000017_0001
wherein Wi and W2 represent the current weight factor and the previous weight factor, respectively, while D, Dl s and D2 represent the depth value 141, the current dynamic depth value 111 and the previous dynamic depth value 121, respectively. The weight factors Wi and W2 are typically numbers between 0.0 and 1.0. The weight factors, Wi and W2 may be related according to the following equation:
Figure imgf000017_0002
In such a case the sub-unit 310 only needs to provide Wi to sub-unit 320, as sub-unit 320 derives W2 from Wi.
In an embodiment, the depth value 141 is based on more than two images, i.e. more than the current image and the previous image. The depth value 141 may be determined based on multiple depth values from respective multiple images. For example, the depth value 141 may be based on a first, second, third, and fourth dynamic depth value from a respective first, second, third and fourth image. For sake of clarity, the terms 'first' and 'second' are used as substitutes for the terms 'current' and 'previous'. Each of the first, second, third and fourth depth values may be based on a first, second, third and fourth local motion vector. Similarly to Eq. 1, the depth value 141 may be computed as
D = Wi Di + W2D2 + W3 D3 + W4D4 [Eq. 2] wherein:
- D3 and D4 represent the third and fourth dynamic depth value, respectively;
- W3 and W4 represent the third and fourth weight factor, respectively; and
- Variables Wi, Di, W2, and D2 are defined as in Eq.l .
Eq. 2 describes a finite impulse response (FIR) filter that has an output D, input values Di, D2, D3i and D4, and respective filter coefficients Wi, W2, W3, and W4.
In another embodiment, computation of the depth value D 141 is implemented as an infinite impulse res onse (IIR) filter as follows:
Figure imgf000018_0001
wherein:
- W is the current weight factor 311;
- Di, D2, D3i D4 are as in Eq.2; and
- C2, C3, C4 represent cumulative depth values of the second, third, fourth image,
respectively. Accordingly, D represents a 'temporal update' of the cumulative depth value C2, and is calculated as a weighted average of Di and the cumulative depth value C2. In its turn, the cumulative depth value C2 is computed as a 'temporal update', of the cumulative depth value C3, thus as a weighted average of D2 and the cumulative depth value C3. In its turn, C3 is computed as a 'temporal update' of a cumulative depth value C4 of a fourth image, etcetera. The depth value 141 is thus computed using a temporal recursive filter, (a) because of the recursion described by of Eqs.3-5 and (b) because of the depth value is derived based on images in the video sequence 101 associated with a different time instances. Said recursion requires an initial value at some preceding image. For example, said recursion may be initialized at the fourth image by C4 = D4.
The effect of varying the current weight factor W 311 is as follows. If W is close to 1.0 then D is very similar to Di and depends weakly (or to a minor extent) on images preceding the current image. This means that when the video sequence 101 is shown in real time on a 3D-display, the depth value 141 is less dependent on depths from preceding images. If Wis close to 0.0 then D is very similar to D2 and largely depends on images preceding the current image. In other words, the depth value 141 then includes only a relatively small contribution of the current depth value 121. When the video sequence 101 is shown in real time on a 3D-display, in such a case, the depth value 141 is strongly dependent on depths from preceding earlier images.
When W is close to 1.0, the depth value D 141 thus depends weakly on depths from preceding images. Consequently, D depends strongly on depth from the current image. D is therefore based on 'up-to-date image content, because D represents depth at the current pixel location which is in the current image, while D is strongly based on depth from the same, current image. A benefit of W being close to 1.0 may thus be that the depth value D 141 is based on up-to-date image content.
Another consequence of W being close to 1.0 is that D is computed to only a minor extent from motion in preceding images of the video sequence, when the current image itself contains little or no motion to compute meaningful depth. A drawback of W being close to 1.0 may thus be that the depth value D 141 may not represent a meaningful value when the current image itself contains little or no motion.
When W is close to 0.0, the depth value D 141 thus depends strongly on depths from preceding images. Consequently, D depends weakly on depth from the current image. D is therefore not based on 'up-to-date image content, because D represents depth at the current pixel location which is in the current image, while D is only weakly based on depth from the same, current image. A drawback of W being close to 0.0 may thus be that the depth value D 141 is not based on up-to-date image content.
Another consequence of W being close to 0.0 is that D can be computed from motion in preceding images of the video sequence, even when the current image itself contains little or no motion to compute meaningful depth from. A benefit of W being close to 0.0 may thus be that the depth value D 141 may be computed from motion in the video sequence, even when the current image itself contains little or no motion.
By adapting W to motion in the current image and to motion in the previous image, said benefits for computing D may be combined in an advantageous way. In what follows, it will be explained how W may be determined by such an adaptation in a first, second and third embodiment.
In said first embodiment, D may be computed primarily from motion in the current image when the current image contains more motion than the previous image. Motion in the current image may then be considered as more reliable for computing D than motion in the previous image. This may work as follows. An amount of motion at the current pixel location is computed as the length of the current local motion vector. An amount of motion at the previous pixel location is computed as the length of the previous local motion vector. When the amount of motion at the current pixel location is larger than the amount of motion at the previous pixel location, then W may be determined as being close to 1.0, e.g. W=0.8 or W=l .0. Consequently, D then depends strongly on Di and is thus based on up-to-date image content.
In said second embodiment, D may be computed primarily from motion in the previous image when:
- the current image contains less motion than the previous image, and
- the current image contains little or no motion.
This may work as follows. When the amount of motion at the current pixel location is:
- smaller than the amount of motion at the previous pixel location, and
- is also smaller than a predetermined amount of motion at the current pixel location,
W may be determined as being close to 0.0, e.g. W=0.2 or W=0.0.
Consequently, the effect is that D then depends strongly on D2 and is thus computed from motion in the previous image, while the amount of motion at the current pixel location itself is small.
For example, said predetermined amount of motion at the current pixel location may be two, for example. Then, if the length, sqrt(dX2+dY2), of the current local motion vector 116 (X,Y) is smaller than two, the image content at the current pixel location may be defined as containing 'little or no motion'.
In said third embodiment, D may be computed from both motion in the current image and motion in the previous image, when:
- the current image contains less motion than the previous image and
- the current image contains a moderate or large amount of motion.
This may work as follows. When the amount of motion at the current pixel location is:
- smaller than the amount of motion at the previous pixel location, and
- is larger than the predefined amount of motion at the current pixel location,
W may be determined as an intermediate value, e.g. W=0.5. Consequently, the effect is that D depends on both D 1 and D2 to a similar extent.
As a refinement of said third embodiment, W may have an intermediate value that further depends on the amount of motion at the current pixel location. This may work as follows. For example, when the amount of motion at the current pixel location is:
- smaller than the amount of motion at the previous pixel location, and
- is smaller than or equal to 2 then W=0.0, or
- is larger than 2 then W=0.2, or
- is larger than 4 then W=0.4, or
- is larger than 6 then W=0.6, or
- is larger than 8 then W=0.8.
D therefore depends more on the Di as the amount of motion at the current pixel location is larger.
The following pseudo-code comprises a combination of abovementioned first, second and third embodiment.:
if (currmot >= prevmot) {
W=1.0;} /* first embodiment */
else { /* second/third embodiment */
if (currmot <=2) { W=0.0;} /* second/third embodiment */
if (currmot >2) { W=0.2;} /* third embodiment */
if (currmot >4) { W=0.4;} /* third embodiment */
if (currmot >6) { W=0.6;} /* third embodiment */
if (currmot >8) { W=0.8;} /* third embodiment */ }
depth=W*currdepth+ ( 1-W) *prevdepth; /* computing depth value 141 */ wherein:
- currmot represents the length of the current local motion vector 116,
- prevmot represents the length of the previous local motion vector 126,
- currdepth represents the current dynamic depth value Di 111,
- prevdepth represents the previous dynamic depth value D2 121,
- depth represents the depth value D 141,
- W represents the current weight factor W 311.
In previous embodiments determining W is based on two features:
- the amount of motion at the current pixel location and
- the amount of motion at the previous pixel location.
Additional embodiments are obtained by replacing said two features by:
- the amount of global motion in the current image and
- the amount of global motion in the previous image, respectively. The amount of global motion in the current image may be determined as:
- a length of the global motion vector of the current image or
- an average length of local motion vectors of the current image.
The amount of global motion in the previous image is determined in a similar manner as the amount of global motion in the current image.
The depth value 141 may be part of a depth map that contains depth values of all respective image pixels of the current image. Such a depth map depends on motion, and therefore may be referred to as a 'dynamic' depth map. In addition, a so-called 'static' depth map may supplement the dynamic depth map. The static depth map is based on depth cues from within the current image and thus does not depend on motion. For example, the static depth map may be a so-called 'slant' that describes the depth as a gradient of the vertical dimension of the current image, gradually changing from nearby (i.e. large depth values) at the bottom of the current image too far away (i.e. small depth values) at the middle or top of the current image. The static depth map may be based on other non-motion dependent depth cues from the current image. For example, using a face detector, large depth values may be assigned to areas in the current image containing faces, so that the faces become part of the foreground of the current image. As another example, using a sky detector may, low depth values may be assigned to areas of the current image detected as sky, so that said areas become part of the background in the current image.
A static depth value (from a static depth map) may be added to the depth value
141. In this way, the depth value 141 contains a meaningful depth value, even when not enough motion is present in the current image for a reliable computation of the dynamic depth value. Adding the static depth value to the depth value 141 is particularly advantageous for the case (a) when stationary periods in the video sequence (i.e. with little or no motion) last too many consecutive images for the recursive temporal filter to allow reliable derivation of the depth from previous images, or (b) when the video sequence starts with a stationary scene having little motion so that no previous images are available to reliably compute a motion-based depth value.
Adding the static depth value to the depth value 141 may be rephrased as follows. First, a 'dynamic depth value' is defined as the depth value 141 before adding the static depth value. The dynamic depth value is thus based on only the current dynamic depth value 111 and the previous dynamic depth value 121. Next, the depth value 141 is determined by combining the dynamic depth value and the static depth value. For example, as described above, the depth value 141 is a result from adding the static depth value and the dynamic depth value.
Combining the static depth value and the dynamic depth value may depend on the amount of global motion in the current image. For example, when the amount of global motion is low, the depth value 141 relies more on the static depth value than on the dynamic depth value, and, conversely, when the amount of global motion is high, the depth value 141 relies less on the static depth value than on the dynamic. Such a combination of the static depth value and the dynamic depth value may be implemented as a weighted average, wherein the respective relative contributions (i.e. weight factors) of the static and dynamic depth value in the weighted average gradually vary with the amount of global motion. In a variant of this embodiment, the depth value 141 simply equals the largest of the static depth value and the dynamic depth value, corresponding to the relative contributions being 0 or 1.
FIG.4 illustrates a method 400 for computing the depth value 141 from the video sequence 101. The method 400 comprises the following steps. Step 410 comprises determining the current local motion vector 116 representing motion of image content at the current pixel location. Step 420 comprises determining the current dynamic depth value 111 based on the current local motion vector. Step 430 comprises determining the previous pixel location 106 being a pixel location in the previous image of the video sequence; the previous pixel location comprises image content corresponding to the image content at the current pixel location. Step 440 comprises determining the previous local motion vector 126 representing motion of the image content at the previous pixel location 106. Step 450 comprises determining the previous dynamic depth value 121 based on the previous local motion vector 126. Step 460 comprises determining the depth value 141 based on the current dynamic depth value 111 and the previous dynamic depth value 121.
Operations performed by the steps 410-460 of method 400 are consistent with operations performed by units 115, 110, 105, 125, 120, and 140 of system 100, respectively.
The method 400 described above may also be implemented as computer program code means. The computer program code means may be adapted to perform the steps of the method 400 when said computer program code is run on a computer. The computer program code may be provided via a data carrier, such as a DVD or solid-state disk, for example.
FIG.5 illustrates a 3D display system 500 for showing a 3D image 531. The 3D image 531 is derived from a current image of the video sequence 101. A display unit 540 comprises a 3D display arranged for showing the 3D image 531. For example, the display may be a stereoscopic display that requires viewer to wear stereo glasses, or an auto- stereoscopic multiview display. An input unit 510 may receive the video sequence 101, for example for reading the video sequence 101 from a storage medium or receiving it from a network. A processing unit 520 comprises the system 100 for computing a depth map 521 which comprises depth values of respective image pixels of the current image. A conversion unit 530 is for converting the current image to the 3D image 531 based on the depth map 521 and the current image.
For example, the 3D display system may be a 3D-TV. The processing unit 520 and the conversion unit 530 may be part of a video processing board that renders a 3D-video sequence in real-time for being displayed on the 3D display. The processing unit 520 computes a depth map for each image in the video sequence, whereas the conversion unit 530 converts each image and the corresponding depth map into a 3D-image. As another example, the 3D display system 500 device may implemented as a smart phone having a 3D-display capability or as studio equipment for offline conversion of the video sequence 101 into a 3D video sequence.
It should be noted that the above-mentioned embodiments illustrate rather than limit the invention, and that those skilled in the art will be able to design many alternative embodiments without departing from the scope of the appended claims. It will be appreciated by those skilled in the art that two or more of the above-mentioned embodiments, implementations, and/or aspects of the invention may be combined in any way deemed useful.
Modifications and variations of the method or the computer program product, which correspond to the described modifications and variations of the monitoring subsystem, can be carried out by a person skilled in the art on the basis of the present description.
In the claims, any reference signs placed between parentheses shall not be construed as limiting the claim. Use of the verb "comprise" and its conjugations does not exclude the presence of elements or steps other than those stated in a claim. The article "a" or "an" preceding an element does not exclude the presence of a plurality of such elements. The invention may be implemented by means of hardware comprising several distinct elements, and by means of a suitably programmed computer. In the device (or system) claim
enumerating several means, several of these means may be embodied by one and the same item of hardware. The mere fact that certain measures are recited in mutually different dependent claims does not indicate that a combination of these measures cannot be used to advantage. The invention is defined in the independent claims. Advantageous yet optional embodiments are defined in the dependent claims.

Claims

CLAIMS:
1. A system (100) for computing a depth value (141) from a video sequence
(101), the video sequence comprising a sequence of images, each of the images comprising a two-dimensional array of image pixels, each of the image pixels having a respective pixel location in the two-dimensional array, the depth value representing depth of image content at a current pixel location, the current pixel location being a pixel location in a current image in the video sequence, the system (100) comprising:
a current motion unit (115) arranged for determining a current local motion vector (116) representing motion of image content at the current pixel location,
a current depth unit (110) arranged for determining a current dynamic depth value (111) based on the current local motion vector
a previous pixel location unit (105) arranged for determining a previous pixel location (106) being a pixel location in a previous image of the video sequence, the previous pixel location comprising image content corresponding to the image content at the current pixel location
- a previous motion unit (125) arranged for determining a previous local motion vector (126) representing motion of the image content at the previous pixel location
a previous depth unit (120) arranged for determining a previous dynamic depth value (121) based on the previous local motion vector, and
a depth unit (140) arranged for determining the depth value (141) based on the current dynamic depth value (111) and the previous dynamic depth value (121), characterized by the current motion unit (115) being arranged for determining the current local motion vector (116) by
determining a local motion vector indicating motion of image content at the current pixel location,
- determining a global motion vector representing global motion of image content in at least a region of the current image, wherein the global motion vector is an average of all local motion vectors in said region of the current image, and
computing the current local motion vector as a relative motion vector based on the local motion vector relative to the global motion vector.
2. A system (100) according to claim 1, wherein
the depth unit (140) is arranged for determining the depth value (141) by computing the depth value as a combination of the current dynamic depth value and the previous dynamic depth value, the relative contributions of the current dynamic depth value and the previous dynamic depth value in the combination being defined by a current weight factor (31 1) and a previous weight factor, respectively.
3. A system (100) of claim 2, wherein
the depth unit (140) is further arranged for defining the current weight factor (311) by
determining a local motion indicator indicating an amount of motion at the current pixel location, and
determining at least one of the current dynamic weight factor and the previous dynamic weight factor based on the local motion indicator.
4. A system (100) of claim 3, wherein the depth unit (140) is arranged for determining the local motion indicator by computing the local motion parameter based on one of
a length of the current local motion vector,
an absolute value of a horizontal component of the current local motion vector, and
an absolute value of a vertical component of the current local motion vector.
5. A system (100) of claim 2, wherein the depth unit (140) is arranged for computing the depth value (141) by determining the current weight factor based on a difference of amount of motion at the current pixel location and an amount of motion at the previous local motion vector.
6. A system (100) according to any of the previous claims, wherein the current depth unit (110) is arranged for determining the current dynamic depth value (111) by computing the current dynamic depth value based on one of:
a length of the current local motion vector,
an absolute value of a horizontal component of the current local motion vector, and
an absolute value of a vertical component of the current local motion vector.
7. A system (100) according to any of the previous claims, wherein
the depth unit (140) is arranged for determining the depth value (141) based on a static depth value being a non-motion-based depth value based on only the current image.
8. A system (100) of claim 7, wherein the depth unit (140) is further arranged fordetermining the depth value (141) by
- determining a dynamic depth value based on the current dynamic depth value and the previous dynamic depth value, and
combining the dynamic depth value with the static depth value into the depth value (141), relative contributions of the dynamic depth value and the static depth value in the combining
being dependent on the global motion vector in the current image.
9. A system (100) according to any of the previous claims, wherein
the previous pixel unit (105) is arranged for determining the previous pixel location (106) according to a non-motion compensated manner by determining the previous pixel location having a same coordinate in the two-dimensional array of the previous image
as a coordinate of the current pixel location in the two-dimensional array of current image.
10. A system (100) according to any of the previous claims, wherein
the system is further arranged for determining the depth value (141) by using a predetermined non-linear function for limiting the depth value to a predetermined depth value range.
11. A system (100) according to any of the previous claims, wherein the previous pixel location unit (105) is arranged to determine the previous pixel location (106) in the previous image, the previous image corresponding to a later moment in time than the current image.
12. A method (400) for computing a depth value (141) from a video sequence (101), the video sequence comprising a sequence of images, the depth value representing depth of image content at a current pixel location, the current pixel location being a pixel location in a current image in the video sequence, the video sequence comprising a sequence of images, each of the images comprising a two-dimensional array of image pixels, each of the image pixels having a respective pixel location in the two-dimensional array, the method comprising the steps of:
- determining a current local motion vector (116) representing motion of image content at the current pixel location,
determining a current dynamic depth value (111) based on the current local motion vector,
determining a previous pixel location (106) being a pixel location in a previous image of the video sequence, the previous pixel location comprising image content corresponding to the image content at the current pixel location,
determining a previous local motion vector (126) representing motion of the image content at the previous pixel location,
determining a previous dynamic depth value (121) based on the previous local motion vector, and
determining the depth value (141) based on the current dynamic depth value (111) and the previous dynamic depth value (121),
said method characterized by the step of determining the current local motion vector being arranged for determining the current local motion vector (116) by
- determining a local motion vector indicating motion of image content at the current pixel location,
determining a global motion vector representing global motion of image content in at least a region of the current image, wherein the global motion vector is an average of all local motion vectors in the current image, and
- computing the current local motion vector as a relative motion vector based on the local motion vector relative to the global motion vector.
13. A three-dimensional display system (500) comprising:
a display unit (540) comprising a display arranged for displaying a three- dimensional image (531),
an input unit (510) arranged for receiving a video sequence (101) comprising a sequence of images, the video sequence comprising a sequence of images, each of the images comprising a two-dimensional array of image pixels, each of the image pixels having a respective pixel location in the two-dimensional array, a conversion unit (530) arranged for converting an image (512) of the video sequence to the three-dimensional image using a depth map (521) comprising depth values of said image (512) and
a processing unit (520) comprising the system (100) of claim 1 for computing depth values of the depth map.
14. A computer program product a computer program product comprising computer program code means adapted to perform all the steps of the method (400) according to claim 12 when said computer program code is run on a computer.
PCT/EP2015/057534 2014-04-17 2015-04-08 System, method for computing depth from video WO2015158570A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
EP14165051.5 2014-04-17
EP14165051 2014-04-17

Publications (1)

Publication Number Publication Date
WO2015158570A1 true WO2015158570A1 (en) 2015-10-22

Family

ID=50588551

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/EP2015/057534 WO2015158570A1 (en) 2014-04-17 2015-04-08 System, method for computing depth from video

Country Status (1)

Country Link
WO (1) WO2015158570A1 (en)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107071379A (en) * 2015-11-02 2017-08-18 联发科技股份有限公司 The enhanced method of display delay and mancarried device
GB2552058A (en) * 2016-05-11 2018-01-10 Bosch Gmbh Robert Method and device for processing image data and driver assistance system for a vehicle
EP3418975A1 (en) 2017-06-23 2018-12-26 Koninklijke Philips N.V. Depth estimation for an image
CN113989717A (en) * 2021-10-29 2022-01-28 北京字节跳动网络技术有限公司 Video image processing method and device, electronic equipment and storage medium

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2010087955A1 (en) * 2009-01-30 2010-08-05 Thomson Licensing Coding of depth maps
US20110096832A1 (en) * 2009-10-23 2011-04-28 Qualcomm Incorporated Depth map generation techniques for conversion of 2d video data to 3d video data
US20130050427A1 (en) * 2011-08-31 2013-02-28 Altek Corporation Method and apparatus for capturing three-dimensional image and apparatus for displaying three-dimensional image
US20130336577A1 (en) * 2011-09-30 2013-12-19 Cyberlink Corp. Two-Dimensional to Stereoscopic Conversion Systems and Methods

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2010087955A1 (en) * 2009-01-30 2010-08-05 Thomson Licensing Coding of depth maps
US20110096832A1 (en) * 2009-10-23 2011-04-28 Qualcomm Incorporated Depth map generation techniques for conversion of 2d video data to 3d video data
US20130050427A1 (en) * 2011-08-31 2013-02-28 Altek Corporation Method and apparatus for capturing three-dimensional image and apparatus for displaying three-dimensional image
US20130336577A1 (en) * 2011-09-30 2013-12-19 Cyberlink Corp. Two-Dimensional to Stereoscopic Conversion Systems and Methods

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
BING HAN ET AL: "Motion-segmentation-based change detection", PROCEEDINGS OF SPIE, 27 April 2007 (2007-04-27), XP055195537, ISSN: 0277-786X *
MAHSA T. POURAZAD ET AL: "Generating the Depth Map from the Motion Information of H.264-Encoded 2D Video Sequence", EURASIP JOURNAL ON IMAGE AND VIDEO PROCESSING, vol. 2010, 1 January 2010 (2010-01-01), pages 1 - 13, XP055033766, ISSN: 1687-5176, DOI: 10.1155/2010/108584 *
REDERT A: "Visualization of arbitrary-shaped 3D scenes on depth-limited 3D displays", 3D DATA PROCESSING, VISUALIZATION AND TRANSMISSION, 2004. 3DPVT 2004. PROCEEDINGS. 2ND INTERNATIONAL SYMPOSIUM ON THESSALONIKI, GREECE 6-9 SEPT. 2004, PISCATAWAY, NJ, USA,IEEE, 6 September 2004 (2004-09-06), pages 938 - 942, XP010725305, ISBN: 978-0-7695-2223-4, DOI: 10.1109/TDPVT.2004.1335416 *

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107071379A (en) * 2015-11-02 2017-08-18 联发科技股份有限公司 The enhanced method of display delay and mancarried device
GB2552058A (en) * 2016-05-11 2018-01-10 Bosch Gmbh Robert Method and device for processing image data and driver assistance system for a vehicle
US10150485B2 (en) 2016-05-11 2018-12-11 Robert Bosch Gmbh Method and device for processing image data, and driver-assistance system for a vehicle
GB2552058B (en) * 2016-05-11 2022-10-05 Bosch Gmbh Robert Method and device for processing image data and driver assistance system for a vehicle
EP3418975A1 (en) 2017-06-23 2018-12-26 Koninklijke Philips N.V. Depth estimation for an image
CN113989717A (en) * 2021-10-29 2022-01-28 北京字节跳动网络技术有限公司 Video image processing method and device, electronic equipment and storage medium

Similar Documents

Publication Publication Date Title
JP6158929B2 (en) Image processing apparatus, method, and computer program
US9661227B2 (en) Method, circuit and system for stabilizing digital image
CA2726208C (en) System and method for depth extraction of images with forward and backward depth prediction
JP5153940B2 (en) System and method for image depth extraction using motion compensation
US9171373B2 (en) System of image stereo matching
US20130215107A1 (en) Image processing apparatus, image processing method, and program
US20130070049A1 (en) System and method for converting two dimensional to three dimensional video
JP2015522198A (en) Depth map generation for images
WO2012096530A2 (en) Multi-view rendering apparatus and method using background pixel expansion and background-first patch matching
JP2000261828A (en) Stereoscopic video image generating method
US8803947B2 (en) Apparatus and method for generating extrapolated view
US20120320045A1 (en) Image Processing Method and Apparatus Thereof
US9661307B1 (en) Depth map generation using motion cues for conversion of monoscopic visual content to stereoscopic 3D
JP4892113B2 (en) Image processing method and apparatus
WO2015158570A1 (en) System, method for computing depth from video
KR101458986B1 (en) A Real-time Multi-view Image Synthesis Method By Using Kinect
JP7159198B2 (en) Apparatus and method for processing depth maps
WO2013173282A1 (en) Video disparity estimate space-time refinement method and codec
WO2008152607A1 (en) Method, apparatus, system and computer program product for depth-related information propagation
EP3418975A1 (en) Depth estimation for an image
WO2013080898A2 (en) Method for generating image for virtual view of scene
Wei et al. Iterative depth recovery for multi-view video synthesis from stereo videos
US20130286289A1 (en) Image processing apparatus, image display apparatus, and image processing method
Lin et al. Semi-automatic 2D-to-3D video conversion based on depth propagation from key-frames
Choi Hierarchical block-based disparity estimation

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 15713783

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 15713783

Country of ref document: EP

Kind code of ref document: A1