US20230334685A1 - Method and system for generating a depth map - Google Patents

Method and system for generating a depth map Download PDF

Info

Publication number
US20230334685A1
US20230334685A1 US18/301,032 US202318301032A US2023334685A1 US 20230334685 A1 US20230334685 A1 US 20230334685A1 US 202318301032 A US202318301032 A US 202318301032A US 2023334685 A1 US2023334685 A1 US 2023334685A1
Authority
US
United States
Prior art keywords
frame
depth
grid point
frames
determining
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
US18/301,032
Other languages
English (en)
Inventor
Komal Kainth
Joel David Gibson
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Blackmagic Design Pty Ltd
Original Assignee
Blackmagic Design Pty Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Blackmagic Design Pty Ltd filed Critical Blackmagic Design Pty Ltd
Priority to US18/301,032 priority Critical patent/US20230334685A1/en
Assigned to BLACKMAGIC DESIGN PTY LTD reassignment BLACKMAGIC DESIGN PTY LTD ASSIGNMENT OF ASSIGNORS INTEREST (SEE DOCUMENT FOR DETAILS). Assignors: KAINTH, Komal, GIBSON, Joel David
Publication of US20230334685A1 publication Critical patent/US20230334685A1/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/50Depth or shape recovery
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/50Depth or shape recovery
    • G06T7/55Depth or shape recovery from multiple images
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T3/00Geometric image transformations in the plane of the image
    • G06T3/40Scaling of whole images or parts thereof, e.g. expanding or contracting
    • G06T3/4007Scaling of whole images or parts thereof, e.g. expanding or contracting based on interpolation, e.g. bilinear interpolation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T3/00Geometric image transformations in the plane of the image
    • G06T3/40Scaling of whole images or parts thereof, e.g. expanding or contracting
    • G06T3/4046Scaling of whole images or parts thereof, e.g. expanding or contracting using neural networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/20Analysis of motion
    • G06T7/269Analysis of motion using gradient-based methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/10Image acquisition modality
    • G06T2207/10016Video; Image sequence
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/20Special algorithmic details
    • G06T2207/20081Training; Learning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/20Special algorithmic details
    • G06T2207/20084Artificial neural networks [ANN]

Definitions

  • the present disclosure relates to methods and systems relating to depth estimation in images such as in frames of a video clip comprising a time sequence of frames.
  • Depth can mean the real or implied distance from a camera capturing an image or a virtual point of view in an artificially generated image, to an object (or point on an object).
  • FIG. 1 shows an image frame from a movie clip in image (a) and a corresponding depth map in image (b).
  • image (b) the depth of each pixel is represented as a grey level, with lighter pixels having lower depth than darker pixels, such that white pixels are the nearest and black the farthest from the camera that captured the image.
  • Some techniques for depth estimation rely on binocular or stereo images, to enable depth to be determined by triangulation. But stereo images are typically not available. Monocular depth estimation techniques also exist. These typically perform depth estimation on a single image (e.g., a photograph or single frame of a movie). However, when single image depth estimation techniques are applied to each frame in a time sequence of frames comprising a movie clip, it is common for “flicker” to occur in the depth map. The flicker results from the depth estimate for an object or region (or points in an object or region) changing from one frame to the next frame. A small change in absolute depth may be acceptable, but erroneous relative changes can be more problematic. Most noticeable is when (without an appreciable scene change or camera movement) the relative depth of two objects changes between frames so that one object moves in front or behind of another object that it was previously behind or in front of.
  • a method of generating a depth map corresponding to a frame of a sequence of frames in a video clip may comprise:
  • the corresponding scale value for each pixel of the single image depth map may be generated using a method comprising: for each grid point of a plurality of grid points which are arranged across the frame:
  • the step of generating an initial scale value using a depth value for the grid point and depth values for the same grid point from a plurality of temporally related frames can comprise determining a depth value for the grid point in said frame by determining an average depth value for a region including the grid point; and determining depth values corresponding to the same grid point for a plurality of temporally related frames comprises determining a correspondence between content of said frame and content of said temporally related frames such that a location corresponding to said grid point can be determined for each of the plurality of temporally related frames, and determining an average depth value for a region including said location in each temporally related frame to determine a depth value corresponding to said grid point for each temporally related frame.
  • the initial scale value for each grid point can be determined using a ratio of: a measure of central tendency of a group of depth values including at least the depth values for the same grid point from the plurality of temporally related frames, to the depth value for the grid point.
  • the measure of central tendency could be the median.
  • the group of depth values could include the depth value for the grid point.
  • the method can include defining a mask including pixels of said frame in which the single image depth map is determined to be either or both of: unreliable based on optical flow analysis of the plurality of frames; or have a depth greater than a threshold depth.
  • determining a correspondence between the content of said frame and the content of said temporally related frames can include analyzing optical flow between temporally adjacent frames and generating a warped depth map of each of said plurality of temporally related frames in accordance with the optical flow, whereby said location corresponding to said grid point is aligned with said grid point, and determining the average depth value for the region around said location in each temporally related frame using the warped depth map.
  • determining a correspondence between the content of said frame and the content of said temporally related frames can include analyzing optical flow between temporally adjacent frames and tracking the location of said grid point in each of said temporally related frames using said optical flow and determining the average depth value for a region around said location in each temporally related frame.
  • pixels that are included in the mask are excluded from either or both of: determining a depth value for the grid point by determining an average depth value for a region including the grid point; and/or determining depth values corresponding to the same grid point for a plurality of temporally related frames.
  • the step of generating a final scale value for said grid point on the basis of said grid point's initial scale value and an initial scale value of one or more neighboring grid points comprises determining a relative contribution of each of said one or more neighboring grid points and said grid point's initial scale value.
  • the relative contribution for said one or more neighboring grid points can be determine in some embodiments using said mask.
  • generating a final scale value for said grid point on the basis of said grid point's initial scale value and an initial scale value of one or more neighboring grid point includes solving a series of linear equations representing an initial scale value of each of said grid points and the initial scale value for each of said grid point's neighboring grid points.
  • determining scale values for application to each pixel of said single image depth map from the final scale values of the grid points can comprise generating a scale value for each pixel between said grid points by interpolation. If there are pixels outside said grid points, these can have scale values determined by extrapolation.
  • the scale values for application to each pixel of said single image depth map from the final scale values of the grid points can be determined by assigning a scale value for each pixel based on a position relative to said grid points. For example, all pixels in an area around each grid point may take the scale value corresponding to the grid point.
  • Generating a single image depth map for each frame may use machine learning techniques. For example, it may comprise using a deep learning model to generate said single image depth map.
  • the deep learning model may be a convolutional neural network, or other suitable model.
  • the method can be repeated or continued to generate a depth map for at least one additional frame of the video clip.
  • the depth map can be generated at a lower resolution than the frame. For example, it may be performed at a fractional resolution, e.g., 1 ⁇ 2, 1 ⁇ 4 resolution. In other embodiments, the depth map can be generated at the same resolution as the frame.
  • a computer system including a processor operating in accordance with execution instructions stored in a non-transitory storage media, whereby execution of the instructions configures the computer system to perform an embodiment of a method described herein.
  • the computer system can be a non-linear editor for use in editing video and optionally audio media.
  • Non-transitory computer-readable storage media storing thereon execution instructions which when executed by a processor cause the processor to perform an embodiment of a method as described herein.
  • a computer software product containing execution instructions which when executed by a processor cause the processor to perform an embodiment of a method as described herein.
  • the computer software product can comprise a non-linear editing software product or video effects software product, for example the Applicant's Davinci Resolve or Fusion software could perform embodiments of a method as described herein.
  • FIG. 1 shows a frame of a movie clip (a) and a corresponding depth map (b) illustrating estimated depth in grey levels.
  • FIG. 2 is flowchart showing an overview of one embodiment of a method for generating a depth map for a frame of a video clip.
  • FIG. 3 is a flowchart showing further details of an embodiment according to the overview of FIG. 2 .
  • FIG. 4 shows a series of frames of a video clip.
  • FIG. 5 shows a series of single image depth maps corresponding to the frames of FIG. 4 .
  • FIGS. 6 A to 6 C illustrate example arrangements of grid points in three embodiments.
  • FIG. 7 shows a frame (a) and the frame overlaid (b) with a grid of FIG. 6 B .
  • FIG. 8 illustrates corresponding regions associated with grid points in temporally related frames.
  • FIG. 9 illustrates a further embodiment of corresponding regions associated with grid points in temporally related frames using warping.
  • FIG. 10 illustrates a process for optical flow estimation.
  • FIG. 11 illustrates an optical flow estimation applied to a plurality of temporally related frames.
  • FIG. 12 illustrates a series of frames n ⁇ 2 to n+2 and schematically represents how warped SIDMs may be created using a backward warp.
  • FIG. 13 illustrates a mask used in at least one embodiment.
  • FIG. 14 illustrates a model in the form of a circuit diagram that may be used to determine final scale values in some embodiments.
  • FIG. 15 schematically illustrates a process using spatio-temporal filtering sSIDMs to generate the final depth map for frame n.
  • FIG. 16 is a schematic block diagram of a first embodiment of a computer system according to an embodiment disclosed herein.
  • FIG. 2 is a flowchart that schematically illustrates an overview of an embodiment of a method for generating a depth map corresponding to a frame n (frame n) of a sequence of frames in a video clip.
  • the method 10 begins with a video clip 110 having a plurality of frames (frame n ⁇ x . . . frame n+y) and finally generates a depth map for frame n (DM n).
  • the method can be performed again to generate a depth map for any other frame (e.g., frame n+1, n ⁇ 1 etc.). It will become apparent however that not all steps, actions, sub-steps will need to be repeated in full as data may be reused from one frame to the next.
  • baseline depth estimation is performed to generate single image depth map (SIDM) for frame n, and at least some frames temporally adjacent to frame n.
  • baseline depth estimation 12 can be performed on all frames of the clip or only the frames necessary to complete the method in respect of frame n.
  • step 14 involves application of a scalar field to the baseline depth estimation from step 12 .
  • the baseline SIDM values are multiplied by corresponding values in the scalar field.
  • the scalar field is calculated using SIDM values from a time series of frames (including frame n). This may help to address large area flickering in the SIDM from one frame to the next.
  • step 16 spatio-temporal filtering is performed, using scaled single image depth maps of a plurality of frames, to generate the depth map for frame n (DM n).
  • This step may take a weighted average of corresponding spatial regions of a scaled depth maps over the plurality of frames.
  • FIG. 3 is a flowchart that illustrates steps in a method of generating a depth map according to an embodiment of the method of FIG. 2 .
  • the method 100 begins with obtaining a video clip 110 .
  • the video clip could be obtained, for example, by reading from memory, receiving a video clip via transmission channel over a wired or wireless network, or directly capturing the video clip in a camera.
  • the video clip 110 comprises a plurality of frames.
  • the plurality of frames 110 include x frames before frame n (for which the depth map is to be created) and y frames after frame n (Frame n ⁇ x n n+y).
  • x and y are arbitrary numbers of frames and x and y may be equal or unequal.
  • FIG. 4 shows a series of images for part of a video clip.
  • the images in FIG. 4 show a bear walking, and comprise 7 frames in total. There are 3 frames before (Frames n ⁇ 3 to n ⁇ 1) and 3 frames after (frames n+1 to n+3) the frame (frame n) for which a depth map will be created.
  • baseline depth estimation is performed by generating a single image depth map (SIDM) for each of a plurality of frames.
  • the plurality of frames processed in this step may be all frames in the video clip 110 or just those needed to process Frame n.
  • the single image depth map corresponding to Frame n is labelled SIDM n.
  • FIG. 5 shows single image depth maps corresponding to the frames of the clip from FIG. 4 .
  • the 7 single image depth maps are labelled (SIDM n ⁇ 3 . . . SIDM n+3). The same naming convention is used for other frames and single image depth maps.
  • single image depth estimation can be performed using a convolutional neural network, such as MiDaS.
  • MiDas is described more fully by Rene Ranftl, Katrin Lasinger, Konrad Schindler, and Vladlen Koltun in “Towards robust monocular depth estimation: Mixing datasets for zero-shot cross-dataset transfer” TPAMI, 2020, and can be accessed at the repository https://github.com/isl-org/MiDaS.
  • the original frame may be scaled to 384 ⁇ n, where n depends on aspect ratio of the input clip and represents the length of the short side of the image frame.
  • the output of MiDaS produces inverse depth, such that the output equals 1/Depth.
  • FIG. 5 This is represented in FIG. 5 in a greyscale image for each SIDM wherein the estimated depth of the image content is represented as a grey level, with lighter pixels having lower estimated depth than darker pixels, such that white pixels are those areas deemed as nearest and black pixels are those areas deemed the farthest from the point of view from which the image is captured (which will be the camera in an image of a real scene, or some chosen point in an artificially created image).
  • step 130 the single image depth map (SIDM n) is scaled to generate a scaled single image depth map (sSIDM n).
  • step 130 includes the following sub-processes:
  • scale values are determined for each pixel of the single image depth map from the final scale values of the grid points. This may include interpolating scale values between the grid points, and if necessary, extrapolating scale values outside them. In some embodiments, groups of pixels may share scale values to avoid the need to interpolate the scale values up to the full resolution of the frame.
  • initial scale values are generated at a plurality of grid points which are arranged across the frame.
  • the grid points may be arranged in a regular pattern or array across the frames, or placed in an irregular distribution around the frame, or placed at specific positions based on the image.
  • FIG. 6 A to 6 C illustrate several ways in which grid points may be arranged with respect to a frame and its single image depth map (which typically have the same dimensions or aspect ratio).
  • FIG. 6 A and 6 B illustrate examples where the grid points are arranged in a regular array with respect to the frame.
  • FIG. 6 A shows the single image depth map of frame n (SIDMn) overlaid with a grid of lines. Each intersection between the vertical lines ( 200 V) and horizontal lines ( 200 H) define grid points such as grid point 200 P.
  • FIG. 6 B illustrates a similar arrangement of grid points (e.g., 200 Q) to those of FIG. 6 A except that the grid points are arranged by vertical and horizontal grid lines that are offset with respect to those of FIG. 6 A . Accordingly FIG. 6 B has grid points positioned at the edge of the frame, whereas in FIG. 6 A its outermost grid points are spaced inwardly from the edge of the frame. Other grid shapes, or grid lines set at an angle to the horizontal or vertical are possible in some embodiments.
  • the grid points may be placed in a regular n x m array.
  • FIG. 6 C illustrates an example where grid points (e.g., 200 R) are placed randomly around the frame.
  • the grid points are placed in an n ⁇ m array having 25 ⁇ 14 layout.
  • the frame shows a frame in panel (a) from a clip which shows a jogger running by a body of water.
  • Panel (b) shows the frame overlaid with a grid having 25 (vertical) lines across the frame and 14 (horizontal) lines spaced up the image.
  • This grid of lines defines 375 grid points arranged in a 25 ⁇ 14 array and located at the intersections of the lines. Note that this embodiment follows the example of FIG. 6 B and includes grid points on the edge of the frame.
  • the step of generating an initial scale value for a given grid point uses a depth value for the grid point; and the depth values for the same grid point from a plurality of temporally related frames.
  • Determining a depth value for the grid point in said frame may involve determining an average depth value for a region including the grid point.
  • FIG. 8 illustrates how a region including a grid point may be defined in at least one embodiment.
  • FIG. 8 schematically illustrates a sequence of single image depth maps for frames n ⁇ i to n+i.
  • a grid point 200 Q is illustrated, along with a region around it 202 Q containing depth values.
  • the region 202 Q extends to the halfway point between the vertical and horizontal lines that intersect to define the grid point 200 Q and their neighboring horizontal and vertical lines.
  • the region is the same shape as the grid defining the grid points, but offset so that the grid point is in the center of its region.
  • the region may be square or rectangular depending on the spacing of the grid, or in some embodiments another geometry e.g., circular, if the region is defined by a radius around the grid point.
  • FIG. 8 also shows depth maps for temporally related frames n ⁇ i and n+i .
  • each frame will have a grid point corresponding to 200 Q and its corresponding region or area 202 Q.
  • These are illustrated in SIDM n ⁇ i and SIDM n+i as grid points 200 Q ⁇ i and 200 Q+i which are surrounded by regions 202 Q ⁇ i and 202 Q+i.
  • the average value of the SIDM can be determined for the region and assigned to the grid point. This same process can be performed for the same grid point for a plurality of temporally related frames.
  • the temporally related frames can be a series of frames that come before or after frame n. In at least one embodiment, three frames before and after are used, but more or fewer can be used.
  • one or more embodiments may first determine a correspondence between the content of said frame and the content of said temporally related frames, and in some embodiments pixels or groups of pixels where the correspondence is weak may be treated differently or excluded from certain processing steps.
  • Checking the correspondence between the content of said frame and the content of said temporally related frames can include analyzing optical flow between temporally adjacent frames. This can be done using an AI tool such as a Convolutional Neural Network (CNN).
  • CNN Convolutional Neural Network
  • One suitable example of such a tool is SelFlow as described by P. Liu, M. Lyu, I. King and J. Xu, “SelFlow: Self-Supervised Learning of Optical Flow,” in 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 2019 pp. 4566-4575.Such a tool can be used to determine optical flow between frame n and each of the temporally related frames. As illustrated schematically in FIG.
  • the optical flow estimator 301 takes three frames as inputs (Frame n ⁇ 1, Frame n, Frame n+1) and outputs two outputs (Flow n ⁇ n+1, Flow n ⁇ n ⁇ 1), wherein Flow n ⁇ n+1 is a forward optical flow estimate from the “central” frame (Frame n in FIGS. 10) to the later frame (Frame n+1) and Flow n ⁇ n ⁇ 1 being a backward optical flow estimate from the central frame (Frame n) to the earlier frame (Frame n ⁇ 1).
  • the process is performed using a plurality of temporally related frames with inputs being the central frame (Frame n) and equally spaced pairs of frames before and after the central frame to generate optical flow estimates between each frame and the central frame.
  • inputs being the central frame (Frame n) and equally spaced pairs of frames before and after the central frame to generate optical flow estimates between each frame and the central frame.
  • the frames being equally spaced, and other unequal spacings can be used in some embodiments.
  • FIG. 11 illustrates the optical flow estimation process from FIG. 10 applied to the frames illustrated in FIG. 4 .
  • 6 optical flow estimates are made, each between a frame (Frames n+1, n+2, n+3) either following the central frame (Frame n) and a frame (frames n ⁇ 1, n ⁇ 2, n ⁇ 3) preceding it.
  • the optical flow estimator 301 is used 3 times on three groups of frames.
  • the optical flow estimator performs a first set of estimations using a first set of frames (Frame n ⁇ 1, Frame n, Frame n+1) and outputs two optical flow estimations as outputs (Flow n ⁇ n+1, Flow n ⁇ n ⁇ 1).
  • the optical flow estimator 301 also performs a second estimation using a second set of frames (Frame n ⁇ 2, Frame n, Frame n+2) and outputs two optical flow outputs (Flow n ⁇ n+2, Flow n ⁇ n ⁇ 2).
  • optical flow estimator performs a third set of estimations using a third set of frames (Frame n ⁇ 3, Frame n, Frame n+3) and outputs two optical flow outputs (Flow n ⁇ n+3, Flow n ⁇ n ⁇ 3).
  • FIG. 12 illustrates a series of frames n ⁇ 2 to n+2 ( 1202 ).
  • SIDMs 1204 are created for each frame and optical flow is used to perform a backward warp to generate warped depth maps 1206 “Warp SIDM n ⁇ 1 ⁇ n”, “Warp SIDM n+1 ⁇ n”, “Warp SIDM n ⁇ 2 ⁇ n” and “Warp SIDM n+2 ⁇ n”.
  • a mask is created by checking the pixel-wise difference between the depth map at current time step (Frame n) and each warped depth map. If the difference is more than a predetermined threshold (e.g., 3.0 in some embodiments), the area is masked. If the difference exceeds the threshold, the pixels are effectively deemed ‘unreliable’.
  • a predetermined threshold e.g. 3.0 in some embodiments
  • Pixel locations that have a depth value higher than a predetermined threshold may also be masked. This is because depth of distant objects (for example, sky) can be orders of magnitude larger than nearby objects (for example, the runner in the images of FIG. 12 ).
  • a predetermined threshold e.g. 25.0
  • the inclusion of very few distant pixels in subsequent calculations e.g., taking an average depth around a grid point that could include sky pixels and runner pixels) will unduly distort the average towards the distant pixels, even though the remainder of the region has foreground content.
  • FIG. 13 illustrates such an example mask generated from the frames of FIG. 12 .
  • the white regions i.e., masked pixels
  • black regions denote reliable and nearby pixels that are unmasked.
  • the mask includes the sky 1302 (because it is distant) and some area 1304 around the runner because it contains the greatest variability between depth at current timestep (Frame n) and the frame of the warped depth maps because it represents the interface between the moving runner and the relatively stable background.
  • a mask excluding only distant pixels may also be used in some embodiments.
  • a mask may be a “single frame mask” that is generated from the SIDM of a current frame (Frame n) and that of a single temporally related frame. Such a mask will be useful in computing an initial scale value for grid points using the single temporally related frame.
  • a mask may be a “multiple frame mask” created by the combination of multiple single frame masks. This is performed by using an “OR” operation to combine multiple masks, so that any pixel masked in a single frame mask is masked in the multiple frame mask.
  • generating an initial scale value for a given grid point uses a depth value for the grid point and the depth values for the same grid point from the plurality of temporally related frames.
  • Determining a depth value for the grid point in the frame can involve determining an average depth value for the region including the grid point, but excluding pixels that are masked. For example, due to their being at a distance greater than the predetermined depth.
  • the same process is performed on their respective warped depth maps, that is for each grid point a depth value is computed.
  • the depth value being the average depth value for the region in the warped depth map, but excluding pixels that are masked (e.g., due to their being at a distance greater than the predetermined depth).
  • a 7 ⁇ 25 ⁇ 14 matrix of average depth values is computed.
  • An initial scale value for each grid point may then be calculated by comparing the depth value of the grid point in the present frame to the group of depth values of the corresponding grid point in the temporally related frames. This can involve determining a ratio of a measure of central tendency of the group of depth values; to the depth value for the grid point.
  • initial scale value can be calculated as follows:
  • Inital Scale Value (Median depth value of group)/(Depth value in frame n)
  • the group of depth values for the temporally related frames will typically include the depth value for the grid point, that is in the present illustrative embodiment the group of depth values will include 7 average values.
  • the “grid” defining the grid point can be warped (e.g., using by image analysis techniques such as optical flow) so that a corresponding grid point moves from frame to frame with the image content, similarly a corresponding region that is a first shape in frame n may take a different shape or different orientation in a temporally related frame due to such warping.
  • FIG. 9 illustrates this schematically again SIDM n, SIDM n ⁇ i and SIDM n+i are illustrated with grid points 200 Q, 200 Q ⁇ i and 200 Q+i which are surrounded by regions 202 Q, 202 Q ⁇ i and 202 Q+i.
  • the initial scale values have some temporal consistency from frame to frame because each successive frame's initial scale value will share some common frames in its determination. But an initial scale value has no regard for spatial consistency, as only spatially corresponding portions of the frames are used in its generation. This is addressed by generating final scale values for each grid point on the basis of said grid point's initial scale value and that of its neighboring grid points.
  • This process can involve determining a relative contribution of each of the neighboring grid points and said grid point's initial scale value.
  • the relative contribution for said one or more neighboring grid points can be determined in some embodiments using said mask.
  • the task of determining a set of values in such a scenario can be modelled as determining a voltage at each node in a network of resistors, (or equivalently as force at nodes in a network of springs).
  • FIG. 14 illustrates an example of a model network of resistors representing this problem.
  • each initial scale value is represented as a battery 1401 having a voltage (b) equal to the initial scale value.
  • the final scale value corresponding to an initial scale value is the voltage at the node in the network closest to the battery representing the initial scale value.
  • These two values are tied together by an “elasticity” that represents how much influence the initial value has over the final value, and is modelled as a resistor 1402 between them.
  • the influence of each neighboring node on the final scale value (represented as node voltage (u)) are set by weights that are represented as resistors 1403 joining neighboring nodes.
  • the “diagonal” connections between nodes are only shown for the node labelled “c” and its neighboring nodes labelled “n”. All other “diagonal” connections are also weighted in the same manner but not shown. Node c will be used as an example below.
  • the voltages (u) can be calculated by solving a set of linear equations representing the model circuit as follows
  • a T CAu A T Cb (EQ1)
  • A is an incidence matrix that defines the connection between nodes. As noted above all neighboring nodes are connected.
  • a T is the transpose of matrix A.
  • u is a vector containing the voltages at each node.
  • b is a vector containing battery voltages that represent the initial scale values.
  • C is a conductance matrix. This is a matrix with values computed as follows:
  • First weights are computed for each node. Using the grey highlighted portion 1410 of the circuit in FIG. 14 , the weight of node “c” can be determined as follows:
  • the weight of each node is a weighted average of pixels within the area of influence of the node “c”.
  • a “multiple frame mask” can be used to exclude any pixel that is potentially problematic, e.g., due to variation or distance.
  • This area of influence includes all pixels within an area defined by the neighboring nodes—e.g., for node c it includes all pixels within the square defined by the 8 nodes labelled “n”.
  • the area of influence contains a high number of masked pixels the initial scale value for that pixel will be unreliable, and its neighbors should have increased influence over the final scale value at that point. This results in a higher conductance for weights connecting the node to its neighbors. Conversely a “reliable” initial value with very few masked pixels will have less conductance to its neighbors and should stay closer to its initial value.
  • the weights for each area around a node are calculated as follows:
  • w c is the weighting for given node c, and is a sum over all pixels in the area of influence of the node.
  • ⁇ 1 is a scalar value, e.g., 0.1
  • ⁇ 2 is a scalar value e.g., 10.0—the relative values of ⁇ 1 and ⁇ 2 set the relative importance of masked and unmasked pixels.
  • N p is the Number of pixels in the area of influence.
  • b(p) is a bilinear coefficient of the pixel at location (p) and is derived using the distance between the pixel location (p) and the node location (c).
  • each node has 8 conductances (one for each “resistor” 1403 in FIG. 14 linking it to each of its neighboring nodes) and one conductance to its initial value.
  • the conductance to the node's initial scale value may be fixed, e.g., 1 in this example.
  • a conductance matrix C can be generated, and EQ1 solved to generate a u vector that represents the final scale values for the grid points for frame n. This only involves solving a set of linear equations, which is relatively straightforward and fast compared to the optimization approaches of some prior art.
  • the final scale values of the grid points determine the scale value to be applied to each pixel of the SIDM.
  • this includes interpolating scale values between the grid points (e.g., using bilinear interpolation). If there are pixels in the SIDM that lie outside the outermost grid points, extrapolation from the final scale values at the grid point can be used to generate scale values for application to these pixels. However, it may not be strictly necessary to have individual scale values for all pixels in the SIDM of the frame.
  • interpolation or extrapolation may not increase the number of scale values to match the full resolution of the frame or SIDM.
  • the scale values for application to each pixel of said SIDM from the final scale values of the grid points can be determined by assigning a scale value for each pixel based on their position relative to said grid points. For example, all pixels in an area around each grid point may take the scale value corresponding to the grid point.
  • those portions of steps 12 to 16 of FIG. 2 or steps 120 to 170 of FIG. 3 can be repeated to generate scaled SIDM (sSIDM) for other frames of the video clip 110 .
  • sSIDM scaled SIDM
  • DMn final depth map for Frame n
  • step 140 a time sequence of scaled single image depth maps sSIDM are processed to generate a depth map corresponding to the Frame n (DMn).
  • the spatio-temporal filtering step can be performed using the process set out in equations (7) and (8) of in “ Robust Consistent Video Depth Estimation ” by Kopf et. al.
  • the number of scaled single image depth maps in the time sequence is selectable. In some embodiments, it may include between 1 and 5 frames before and after Frame n.
  • FIG. 15 shows the steps in an embodiment of sub-process 140 .
  • Step 140 begins at 141 with a time sequence of scaled single image depth maps generated in step 130 .
  • a group of 5 sSIDM are used, namely sSIDM n, sSIDM n+1, sSIDM n ⁇ 1, sSIDM n+2, sSIDM n ⁇ 2.
  • the number of frames can be chosen based on the computational budget to achieve the extent of temporal smoothing desired.
  • each sSIDM is warped back to sSIDM n using the previously computed optical flow estimations. This results in a series of warped sSIDM frames.
  • the warped sSIDM frames can then be processed in step 143 , using equations (7) and (8) of “Robust Consistent Video Depth Estimation” by Kopf et al. to perform spatio-temporal filtering by generating a weighted average over the time series of warped sSIDMs in a neighborhood around each pixel.
  • a neighborhood of 3 ⁇ 3 pixels can be used.
  • the size of the filter neighborhood can be changed to modify the spatial filtering characteristics with commensurate changes in computation requirements.
  • the final output DMn is a smoothed depth map suitable for use in further processing of the video clip 110 .
  • FIG. 16 provides a block diagram that illustrates one example of a computer system 1000 on which embodiments of the disclosure may be implemented.
  • Computer system 1000 includes a bus 1002 or other communication mechanisms for communicating information, and a hardware processor 1004 coupled with bus 1002 for processing information.
  • Hardware processor 1004 may be, for example, one or more general-purpose microprocessor, one or more graphics processing unit, or other type of processing unit, or combinations thereof.
  • Computer system 1000 also includes a main memory 1006 , such as a random access memory (RAM) or other dynamic storage device, coupled to bus 1002 for storing information and instructions to be executed by processor 1004 .
  • Main memory 1006 also may be used for storing temporary variables or other intermediate information during execution of instructions to be executed by processor 1004 .
  • Such instructions when stored in non-transitory storage media accessible to processor 1004 , render computer system 1000 into a special-purpose machine that is customized and configured to perform the operations specified in the instructions.
  • Computer system 1000 may further include a read only memory (ROM) 1008 or other static storage device coupled to bus 1002 for storing static information and instructions for processor 1004 .
  • ROM read only memory
  • a storage device 1010 such as a magnetic disk or optical disk, may be provided and coupled to bus 1002 for storing information and instructions including the video editing software application described above.
  • the computer system 1000 may be coupled via bus 1002 to a display 1012 (such as one or more LCD, LED, touch screen displays, or other display) for displaying information to a computer user.
  • a display 1012 such as one or more LCD, LED, touch screen displays, or other display
  • An input device 1014 may be coupled to the bus 1002 for communicating information and command selections to processor 1004 .
  • cursor control 1016 is Another type of user input device, such as a mouse, a trackball, or cursor direction keys for communicating direction information and command selections to processor 1004 and for controlling cursor movement on display 1012 .
  • the techniques herein are performed by computer system 1000 in response to processor 1004 executing one or more sequences of one or more instructions contained in main memory 1006 .
  • Such instructions may be read into main memory 1006 from another storage medium, such as a remote database.
  • Execution of the sequences of instructions contained in main memory 1006 causes processor 1004 to perform the process steps described herein.
  • hard-wired circuitry may be used in place of or in combination with software instructions.
  • Non-volatile media includes, for example, optical or magnetic disks, such as storage device 1010 .
  • Volatile media includes dynamic memory, such as main memory 1006 .
  • storage media include, for example, a floppy disk, hard disk drive, solid state drive, magnetic tape, or any other magnetic data storage medium, a CD-ROM, any other optical data storage medium, any physical medium with patterns of holes, a RAM, a PROM, and EPROM, a FLASH-EPROM, NVRAM, any other memory chip or cartridge.
  • Computer system 1000 may also include a communication interface 1018 coupled to bus 1002 .
  • Communication interface 1018 provides a two-way data communication coupling to a network link 1020 that is connected to communication network 1050 .
  • communication interface 1018 may be an integrated services digital network (ISDN) card, cable modem, satellite modem, etc.
  • ISDN integrated services digital network
  • communication interface 1018 may be a local area network (LAN) card to provide a data communication connection to a compatible LAN.
  • LAN local area network
  • Wireless links may also be implemented.
  • communication interface 1018 sends and receives electrical, electromagnetic or optical signals that carry digital data streams representing various types of information.
  • a given flowchart step could potentially be performed in various ways and by various devices, systems or system modules.
  • a given flowchart step could be divided into multiple steps and/or multiple flowchart steps could be combined into a single step, unless the contrary is specifically noted as essential.
  • the order of the steps can be changed without departing from the scope of the present disclosure, unless the contrary is specifically noted as essential.

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • Multimedia (AREA)
  • Image Analysis (AREA)
US18/301,032 2022-04-15 2023-04-14 Method and system for generating a depth map Pending US20230334685A1 (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
US18/301,032 US20230334685A1 (en) 2022-04-15 2023-04-14 Method and system for generating a depth map

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
US202263331396P 2022-04-15 2022-04-15
US18/301,032 US20230334685A1 (en) 2022-04-15 2023-04-14 Method and system for generating a depth map

Publications (1)

Publication Number Publication Date
US20230334685A1 true US20230334685A1 (en) 2023-10-19

Family

ID=86007069

Family Applications (1)

Application Number Title Priority Date Filing Date
US18/301,032 Pending US20230334685A1 (en) 2022-04-15 2023-04-14 Method and system for generating a depth map

Country Status (4)

Country Link
US (1) US20230334685A1 (ja)
EP (1) EP4261780A1 (ja)
JP (1) JP2023157856A (ja)
CN (1) CN116912303A (ja)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20220180899A1 (en) * 2019-09-06 2022-06-09 Guangdong Oppo Mobile Telecommunications Corp., Ltd. Matching method, terminal and readable storage medium

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9030469B2 (en) * 2009-11-18 2015-05-12 Industrial Technology Research Institute Method for generating depth maps from monocular images and systems using the same

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20220180899A1 (en) * 2019-09-06 2022-06-09 Guangdong Oppo Mobile Telecommunications Corp., Ltd. Matching method, terminal and readable storage medium
US11984140B2 (en) * 2019-09-06 2024-05-14 Guangdong Oppo Mobile Telecommunications Corp., Ltd. Matching method, terminal and readable storage medium

Also Published As

Publication number Publication date
JP2023157856A (ja) 2023-10-26
CN116912303A (zh) 2023-10-20
EP4261780A1 (en) 2023-10-18

Similar Documents

Publication Publication Date Title
Liu et al. Surface-aware blind image deblurring
Dosovitskiy et al. Flownet: Learning optical flow with convolutional networks
EP3540637B1 (en) Neural network model training method, device and storage medium for image processing
Fischer et al. Flownet: Learning optical flow with convolutional networks
US11017586B2 (en) 3D motion effect from a 2D image
US5745668A (en) Example-based image analysis and synthesis using pixelwise correspondence
Yang et al. Color-guided depth recovery from RGB-D data using an adaptive autoregressive model
US6192156B1 (en) Feature tracking using a dense feature array
US8346013B2 (en) Image processing apparatus, image processing method, and program
US20230334685A1 (en) Method and system for generating a depth map
CN106485720A (zh) 图像处理方法和装置
Li et al. Depth-aware stereo video retargeting
Weerasekera et al. Just-in-time reconstruction: Inpainting sparse maps using single view depth predictors as priors
US10593050B2 (en) Apparatus and method for dividing of static scene based on statistics of images
Chen et al. Improved seam carving combining with 3D saliency for image retargeting
US20220414908A1 (en) Image processing method
KR100987412B1 (ko) 멀티프레임을 고려한 비디오 오브젝트 매팅 시스템 및 방법
Concha et al. An evaluation of robust cost functions for RGB direct mapping
US10121257B2 (en) Computer-implemented method and system for processing video with temporal consistency
EP2823467B1 (en) Filtering a displacement field between video frames
US11348336B2 (en) Systems and approaches for learning efficient representations for video understanding
Kim et al. Content-aware image and video resizing based on frequency domain analysis
Sim et al. Robust reweighted MAP motion estimation
CN114936633B (zh) 用于转置运算的数据处理单元及图像转置运算方法
US10235763B2 (en) Determining optical flow

Legal Events

Date Code Title Description
AS Assignment

Owner name: BLACKMAGIC DESIGN PTY LTD, AUSTRALIA

Free format text: ASSIGNMENT OF ASSIGNORS INTEREST;ASSIGNORS:KAINTH, KOMAL;GIBSON, JOEL DAVID;SIGNING DATES FROM 20220812 TO 20220814;REEL/FRAME:063342/0945

STPP Information on status: patent application and granting procedure in general

Free format text: DOCKETED NEW CASE - READY FOR EXAMINATION