EP1514238A1 - Improved conversion and encoding technioues - Google Patents

Improved conversion and encoding technioues

Info

Publication number
EP1514238A1
EP1514238A1 EP03724646A EP03724646A EP1514238A1 EP 1514238 A1 EP1514238 A1 EP 1514238A1 EP 03724646 A EP03724646 A EP 03724646A EP 03724646 A EP03724646 A EP 03724646A EP 1514238 A1 EP1514238 A1 EP 1514238A1
Authority
EP
European Patent Office
Prior art keywords
image
depth
classifier
points
data
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Withdrawn
Application number
EP03724646A
Other languages
German (de)
French (fr)
Other versions
EP1514238A4 (en
Inventor
Simon Richard Fox
Julien Charles Flack
Philip Victor Harman
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Dynamic Digital Depth Research Pty Ltd
Original Assignee
Dynamic Digital Depth Research Pty Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Dynamic Digital Depth Research Pty Ltd filed Critical Dynamic Digital Depth Research Pty Ltd
Priority to EP09175312.9A priority Critical patent/EP2146319B1/en
Publication of EP1514238A1 publication Critical patent/EP1514238A1/en
Publication of EP1514238A4 publication Critical patent/EP1514238A4/en
Withdrawn legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/50Depth or shape recovery

Definitions

  • the present invention is directed towards an improved technique for deriving depth maps from 2D images, in particular the present invention relates to a method of recovering dense depth maps from a set of sparse 3D points associated with an image sequence.
  • Structure-from-Motion is a collection of methods for recovering 3D information of a scene that has been projected on to the planar 2D film back plane of a camera.
  • the structural information derived from a SfM algorithm typically takes the form of a set of projection matrices, one projection matrix per image frame, representing the relationship between a specific 2D point in the image plane and its corresponding 3D point.
  • SfM algorithms rely on tracking specific image features to determine such structural information concerning the scene. Generally speaking only a small percentage of an image can be accurately tracked - these points usually lie on edges and corners where sharp intensity discontinuities provide unambiguous tracking cues.
  • stereo or multi-ocular disparity analysis may be used to determine 3D points from 2D images.
  • 3D points can only be established for a small percentage of an image at locations where there is sufficient contrast to unambiguously determine correspondences with a second image.
  • Prior art for conversion of sparse 3D points to dense depth maps relies on either spatial interpolation of the sparse 3D data or hypothesise-and-test approaches such as the RANSAC algorithm. Both these approaches only use the sparse 3D point data available at each individual image frame. This leads to two major shortcomings - first, the number of sparse points available in any single image may not be sufficient to accurately derive a dense depth map and secondly, the consistency of the depth maps from one frame to the next may be poor.
  • the present invention discloses a method for deriving dense depth maps from sparse 3D data that addresses these shortcomings.
  • phase one to be completed satisfactory it is necessary to know the depth for a sufficient number of pixels within a key frame, in order to solve an equation to generate the depth map for that key frame. For example, if a key frame has 350,000 pixels then ideally the depth for 17,500 pixels (or 5% of the total number of pixels) would be known so as to enable a function for the depth map to be generated. If the number of pixels for which the depth is known is not sufficient, the quality of the resulting depth map will not be adequate. If unable to generate an accurate depth map for a key frame, then it is unlikely that phase two will be able to be completed successfully. There is therefore a need for a simplified process for the generation of depth maps. OBJECT OF THE INVENTION
  • the present invention provides in one aspect a method of creating at least one depth map for an image sequence including the steps of: receiving image data of a plurality of frames; receiving depth data of a plurality of points in at least one of said plurality of frames; utilising said image data and said depth data to determine an algorithm to ascertain depth characteristics as a function of image data; determining a depth map for at least one said frame utilising said algorithm.
  • the present invention provides a method of creating at least one depth map for a 2D image sequence including a plurality of frames, each said frame including a plurality of points, said method including the steps of: receiving depth data for a plurality of said points in at least one said frame; determining image data for said points having said depth data; inputting said depth data and said image data into a classifier to encode the relationship between said image data and said depth data; deriving a depth map for at least one said frame of 2D image utilising said encoded relationship.
  • the present invention provides a method of creating depth maps for an image sequence including the steps of: receiving depth data for a plurality of points in said image sequence; utilising said depth data to train a classifier to ascertain depth characteristics as a function of image characteristics and relative position; creating a depth map for each frame of said image sequence utilising said image characteristics.
  • the present invention provides a method creating depth maps for an image sequence including the steps of: dividing said image sequence into a plurality of image shots; for each image shot receiving depth data for a plurality of points and utilising said depth data to train a respective classifier to create an initial algorithm to ascertain depth characteristics as a function of image characteristics and relative position; creating at least one combined algorithm by combining at least two initial algorithms from a plurality of said shots; creating a depth map for each frame of said image sequence using said at least one combined algorithm.
  • the present invention provides a method of creating at least one depth map for an image sequence including the steps of: dividing said image sequence into a plurality of image shots; receiving depth data for a plurality of points in at least one image shot; utilising said depth data and a classifier to ascertain depth characteristics as a function of image characteristics and relative position; and creating a depth map for at least one frame of said at least one image shot utilising said image characteristics.
  • the image data may include RGB values for each point or pixel, and may also include relative XY positions.
  • the present invention may also include the initial step of determining depth data for a number of points within the image. The depth data may be determined either manually, automatically, or a combination of manual and automatic means.
  • FIG 1 illustrates the working of the applicants co-pending patent application.
  • Figure 2 demonstrates the operation of the present invention.
  • Figure 3 illustrates the use of a classifier in one embodiment of the present invention.
  • Figure 4 illustrates a preferred embodiment of the present invention through the use of adjacent classifiers.
  • a shot detection algorithm takes as input a sequence of image frames and outputs one or more shot boundaries.
  • the shot boundaries effectively partition the image sequence into one or more groups, such that the difference between successive images within any group is relatively small.
  • Automatic shot detection algorithms are commonly based on image differencing. For example, to determine whether a shot boundary should be inserted between two successive images of a sequence the total difference between each pixel of the two images is computed. If this difference is above a pre-defined threshold then a shot boundary is inserted.
  • a 3D point may be defined as a 2D point with an associated depth value that represents the distance of the point from the camera and may be generated by any one or any combination of the following processes:
  • Structure-from-motion algorithms Such algorithms will initially identify a number of feature points in a first image and attempt to locate the same feature points in a subsequent image. High contrast regions such as the corners and edges of objects are generally the most reliable features identified for tracking. Once sufficient corresponding 2D points have been located it is possible to derive 3D points. For example, if 8 corresponding 2D points between two images are known then the Longuet-Higgens algorithm may be used to recover the epipolar geometry between the images. In combination with the camera calibration matrix the corresponded 2D points may be triangulated to find their recovered 3D positions. 2.
  • Stereo or multi-ocular algorithms If an image sequence is captured using two or more cameras in a fixed relative arrangement then correspondences between the images of each camera can be used to derive depth estimates by triangulation. As with SfM algorithms, corresponding points may only be reliably identified in high contrast regions, limiting the number of accurate 3D points that may be determined from such algorithms.
  • 3D points may be manually identified either by indicating correspondences between points in two or more images or by directly associating a depth value with a single 2D point.
  • the present invention uses a classifier to encode the relationship between 2D point (inputs) and 3D points (outputs).
  • a classifier can be considered an algorithm that encodes the relationship between a set of inputs and a set of outputs.
  • a classifier has an internal configuration that may be in a number of different states. In supervised classification the classifier adapts its internal state using examples of the relationship between inputs and outputs. This process can be referred to as training a classifier.
  • the classifier may be trained using the 3D points derived from the processes described above. Alternatively, a classifier may be selected which does not require training.
  • a 2D point at a location x, y in an image occurring at a time t within an image sequence as:
  • a 3D point is defined as:
  • 3D points that are generated by the techniques described above can be used to train a classifier.
  • the classifier encodes the relationships between a set of 2D points and their corresponding 3D points. As this relationship varies over time a classifier is trained over a defined temporal interval T. In the preferred embodiment this temporal interval coincides with the decomposition of the image sequence as identified by the shot detection. It should be noted that a single temporal interval T might contain one or more shots that are not adjacent in time. An image sequence may alternate between two or more different scenes, for example during a news interview when the camera alternatively focuses on the interviewer and the interviewee.
  • the temporal interval T may contain all the images from one scene (say the interviewer).
  • Figure 3 shows an image sequence that has been decomposed using a shot detection algorithm so that the three frames in the center of the figure belong to a single shot 2.
  • Frame 1 is an image in the previous shot 14 and frame 5 is an image in the subsequent shot 15.
  • Each image frame in the shot has a number of 3D points 4 associated with it. It is not important whether the depth Z for each 3D point was derived manually or automatically, what is required is a series of points for which the depth is known. For simplicity these 3D points are represented by their projection on to the 2D image plane. All 3D points within the current shot are input to the classifier as training data regardless of which frame they are in.
  • a trained classifier 3 can represent the relationship or mapping between a 2D point P and a 3D point R over a temporal interval T:
  • a classifier trained using 3D points derived over a specific set of image frames can now be used to recover a depth value for any other 2D point over the same temporal interval. Restricting a classifier to a specific temporal interval improves the accuracy with which it can recover 3D points from 2D points but may lead to results that are inconsistent over time. These inconsistencies generally manifest themselves at the temporal boundaries between two classifiers.
  • C 2 Before training a classifier is generally initialised to some random state. Depending on the nature of the classifier this initial state can have a significant influence on the final state of the classifier after training.
  • C 2 can be initialised to the same state as the classifier C s initial state.
  • C 2 may be initialised using a partially or fully trained state of the classifier C*. For example, if we assume a first classifier is initialised to a random state s1. During training the classifier might change its state iteratively, for example from s1 to s50. A second classifier following the first classifier may be initialised to state s10, for example, instead of a random state.
  • the process of initialising a first classifier with some state of a second classifier is referred to as bootstrapping the first classifier.
  • Bootstrapping has the additional advantage of increasing the speed with which a classifier can be trained, as the starting state is generally closer to the final trained state.
  • two or more classifiers can be combined to determine a 3D point from a 2D point.
  • a classifier trained over a time interval from time to time t ⁇ is associated to a specific point in time, which is the midpoint between and t ⁇ .
  • a combination means 18 can produce an improved 3D point R 3 by combining R 1 and R ⁇ .
  • the combination means 181 can simply average R 1 and R 2 , but ideally uses a weighted combination of its inputs. In a preferred embodiment the weight is based on the temporal distance between the point P and the classifier's temporal midpoint.
  • T m ⁇ 0.5 * ( + 1 ⁇ ).
  • T m2 0.5 * (t 3 + 1 4 ).
  • the weighting is determined by classification error estimates as opposed to temporal proximity.
  • the present system differs significantly from the applicants prior system and thereby any other method for the generation of depth maps for an image sequence.
  • the applicants prior process required the selection of a number of key frames, and for the depth for a sufficient number of pixels within each key frame to be known. For each key frame, assuming that sufficient pixels had been assigned a depth, a depth map could then be created in phase one. The depth map for each key frame were then used to develop subsequent depth maps for the remaining frames in phase two.
  • This process differs significantly from the present invention which is exemplified in figure 2.
  • Figure 2 shows that it is no longer necessary to select key frames from an image sequence.
  • the present invention takes depth and image data for a number of pixels across a number of frames to create the depth maps.
  • the data for the classifier could come from a single frame, particularly if the image sequence is of a still object, but even in this situation it differs from the applicants prior application, in that the classifier is used to generate a depth map for each frame of the shot, rather than the prior system which generated two depth maps for two key frames and then used those key frame depth maps to generate subsequent depth maps. It will be appreciated that in each case the image data for each pixel is known.
  • RGB components for each pixel the system knows the relative XY position, and the RGB values. What is required is for a number of pixels across the shot sequence to have a depth assigned to them. This depth may be assigned manually, or automatically or a combination of manual or automatic. This information may then be passed to the classifier of the present invention to thereby create the depth map for each frame of the shot.
  • the system of the present invention may be further improved by utilising the classifiers in adjacent shots. That is, rather than rely on the depth map generated solely by a single classifier, reference is made to a depth as generated by an adjacent classifier. Again this is exemplified in figure 5.
  • the first shot 12 includes four frames the data of which is fed into the classifier 14.
  • the adjacent shot 13 includes six frames the data of which is fed into the second classifier 15.
  • the output from both classifiers is combined so as to determine the depth at that point.
  • the reliance on either classifier will be weighted in favour of a particular classifier dependent of which frame is being considered. For example, in the example of figure 4, the weighting of classifier 15 will be much greater in the fifth frame as opposed to the first frame. Similarly, the weighting of classifier 14 will be greater for frame five than for frame ten.
  • the weighting is designed to take into consideration the fact that the depth of objects within an image may change over time. It is also appreciated that the depth of an object will have some relevance to both the historical depth of the object, and also the future depth of the object. By weighting the various classifiers a smoother transition between shots may be achieved.
  • the applicants prior system made significant advances from what was know at the time.
  • the present system makes further significant advances from the applicants prior system. It is no longer necessary to perform a two phase process in order to determine depth maps for frames within any image sequence. Further, it is no longer necessary for a sufficient number of pixels within a single frame to be known so as to derive a depth map for phase one. Rather, whilst the present invention could rely on a single frame, it is capable of deriving information from a series of frames to thereby generate depth maps for each of those frames. Further, the present system will be able to perform even if no depth data is known for a particular frame, as shown for example by frame nine of figure 4. In this circumstance the classifier uses the known depth data in the remaining frames.

Landscapes

  • Engineering & Computer Science (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Image Analysis (AREA)
  • Testing, Inspecting, Measuring Of Stereoscopic Televisions And Televisions (AREA)
  • Image Processing (AREA)

Abstract

A method of creating at least one depth map for an image sequence including the steps of receiving depth data for a plurality of points in the image sequence, utilising the depth data and a classifier to ascertain depth characteristics as a function of image characteristics and relative position and creating a depth map for at least one frame of the image sequence utilising the image characteristics.

Description

IMPROVED CONVERSION AND ENCODING TECHNIQUES FIELD OF THE INVENTION
The present invention is directed towards an improved technique for deriving depth maps from 2D images, in particular the present invention relates to a method of recovering dense depth maps from a set of sparse 3D points associated with an image sequence. BACKGROUND OF THE INVENTION
Structure-from-Motion (SfM) is a collection of methods for recovering 3D information of a scene that has been projected on to the planar 2D film back plane of a camera. The structural information derived from a SfM algorithm typically takes the form of a set of projection matrices, one projection matrix per image frame, representing the relationship between a specific 2D point in the image plane and its corresponding 3D point. SfM algorithms rely on tracking specific image features to determine such structural information concerning the scene. Generally speaking only a small percentage of an image can be accurately tracked - these points usually lie on edges and corners where sharp intensity discontinuities provide unambiguous tracking cues.
Similarly, stereo or multi-ocular disparity analysis may be used to determine 3D points from 2D images. As with SfM analysis, 3D points can only be established for a small percentage of an image at locations where there is sufficient contrast to unambiguously determine correspondences with a second image.
In many applications including, but not limited to stereoscopic image rendering, robotic navigation and special effects animation, such sparse 3D points are insufficient. Such applications require a dense depth map in which each 2D point in an image is associated with a 3D point.
Prior art for conversion of sparse 3D points to dense depth maps relies on either spatial interpolation of the sparse 3D data or hypothesise-and-test approaches such as the RANSAC algorithm. Both these approaches only use the sparse 3D point data available at each individual image frame. This leads to two major shortcomings - first, the number of sparse points available in any single image may not be sufficient to accurately derive a dense depth map and secondly, the consistency of the depth maps from one frame to the next may be poor. The present invention discloses a method for deriving dense depth maps from sparse 3D data that addresses these shortcomings.
The applicants have disclosed in co-pending PCT application number PCT/AU01/00975, the contents of which are herein disclosed by reference, a method for generating depth maps from one or more images. This method involved a two step process. In the first step sparse depth data associated with a single image was used to generate a depth map for the image. In the second phase depth maps for each image in an image sequence were generated using the results generated in phase one. Whilst this method works in ideal situations, there are many limitations to the process. In the applicants prior application it was necessary to select a number of key frames in an image sequence. For each of these key frames it was necessary to know the depth data for a sufficient number of pixels within that key frame such that an equation to generate a corresponding depth map could be generated. That is, given the depth for a sufficient number of pixels within the key frame, a function could be derived such that the depth for every other pixel could be determined. Once these functions were generated for the key frames they could then be used to in turn generate functions for the remaining frames.
One of the limitations of the applicants prior process is the necessity for two phases. It will be appreciated that if an error is introduced in the first phase for whatever reason, then this error is propagated throughout the second phase. In such a situation the resultant depth maps may not be satisfactory.
Of greater concern is that for phase one to be completed satisfactory, it is necessary to know the depth for a sufficient number of pixels within a key frame, in order to solve an equation to generate the depth map for that key frame. For example, if a key frame has 350,000 pixels then ideally the depth for 17,500 pixels (or 5% of the total number of pixels) would be known so as to enable a function for the depth map to be generated. If the number of pixels for which the depth is known is not sufficient, the quality of the resulting depth map will not be adequate. If unable to generate an accurate depth map for a key frame, then it is unlikely that phase two will be able to be completed successfully. There is therefore a need for a simplified process for the generation of depth maps. OBJECT OF THE INVENTION
It is therefore an object of the present invention to provide an improved system for the generation of depth maps from a 2D image sequence which does not require a two phase process and is not dependent on the depth for a requisite number of pixels in a key frame to be known.
SUMMARY OF THE INVENTION
With the above object in mind the present invention provides in one aspect a method of creating at least one depth map for an image sequence including the steps of: receiving image data of a plurality of frames; receiving depth data of a plurality of points in at least one of said plurality of frames; utilising said image data and said depth data to determine an algorithm to ascertain depth characteristics as a function of image data; determining a depth map for at least one said frame utilising said algorithm.
In a further aspect the present invention provides a method of creating at least one depth map for a 2D image sequence including a plurality of frames, each said frame including a plurality of points, said method including the steps of: receiving depth data for a plurality of said points in at least one said frame; determining image data for said points having said depth data; inputting said depth data and said image data into a classifier to encode the relationship between said image data and said depth data; deriving a depth map for at least one said frame of 2D image utilising said encoded relationship.
In yet a further aspect the present invention provides a method of creating depth maps for an image sequence including the steps of: receiving depth data for a plurality of points in said image sequence; utilising said depth data to train a classifier to ascertain depth characteristics as a function of image characteristics and relative position; creating a depth map for each frame of said image sequence utilising said image characteristics. In still a further aspect the present invention provides a method creating depth maps for an image sequence including the steps of: dividing said image sequence into a plurality of image shots; for each image shot receiving depth data for a plurality of points and utilising said depth data to train a respective classifier to create an initial algorithm to ascertain depth characteristics as a function of image characteristics and relative position; creating at least one combined algorithm by combining at least two initial algorithms from a plurality of said shots; creating a depth map for each frame of said image sequence using said at least one combined algorithm.
In a further aspect the present invention provides a method of creating at least one depth map for an image sequence including the steps of: dividing said image sequence into a plurality of image shots; receiving depth data for a plurality of points in at least one image shot; utilising said depth data and a classifier to ascertain depth characteristics as a function of image characteristics and relative position; and creating a depth map for at least one frame of said at least one image shot utilising said image characteristics. The image data may include RGB values for each point or pixel, and may also include relative XY positions. The present invention may also include the initial step of determining depth data for a number of points within the image. The depth data may be determined either manually, automatically, or a combination of manual and automatic means. When dividing an image sequence into a series of shots, the preferred embodiment of the present invention will combine the outcome of the two classifiers on either side of each frame. BRIEF DESCRIPTION OF THE DRAWINGS
Figure 1 illustrates the working of the applicants co-pending patent application.
Figure 2 demonstrates the operation of the present invention.
Figure 3 illustrates the use of a classifier in one embodiment of the present invention. Figure 4 illustrates a preferred embodiment of the present invention through the use of adjacent classifiers. DETAILED DESCRIPTION OF THE INVENTION
A depth map represents the 3D profile of a scene from a specific viewpoint. When a dense depth map is associated with a 2D image it encodes the distance between the camera used to capture the image and the observed objects for each pixel in the image. An objective of the present invention is to recover a depth map given only a sparse set of 3D points, which may be derived from one of the following means. To achieve this, an image sequence containing a number of image frames may be decomposed into a series of shots. A shot ideally contains one or more image frames in which the inter-frame differences are relatively small. The frequency and placement of shot boundaries in an image sequence may be dependent upon the motion in the sequence. Shot boundaries may be identified manually by an operator or automatically using a shot detection algorithm. A shot detection algorithm takes as input a sequence of image frames and outputs one or more shot boundaries. The shot boundaries effectively partition the image sequence into one or more groups, such that the difference between successive images within any group is relatively small. Automatic shot detection algorithms are commonly based on image differencing. For example, to determine whether a shot boundary should be inserted between two successive images of a sequence the total difference between each pixel of the two images is computed. If this difference is above a pre-defined threshold then a shot boundary is inserted.
A 3D point may be defined as a 2D point with an associated depth value that represents the distance of the point from the camera and may be generated by any one or any combination of the following processes:
1. Structure-from-motion algorithms: Such algorithms will initially identify a number of feature points in a first image and attempt to locate the same feature points in a subsequent image. High contrast regions such as the corners and edges of objects are generally the most reliable features identified for tracking. Once sufficient corresponding 2D points have been located it is possible to derive 3D points. For example, if 8 corresponding 2D points between two images are known then the Longuet-Higgens algorithm may be used to recover the epipolar geometry between the images. In combination with the camera calibration matrix the corresponded 2D points may be triangulated to find their recovered 3D positions. 2. Stereo or multi-ocular algorithms: If an image sequence is captured using two or more cameras in a fixed relative arrangement then correspondences between the images of each camera can be used to derive depth estimates by triangulation. As with SfM algorithms, corresponding points may only be reliably identified in high contrast regions, limiting the number of accurate 3D points that may be determined from such algorithms.
3. Manual point correspondence or direct depth assignment: 3D points may be manually identified either by indicating correspondences between points in two or more images or by directly associating a depth value with a single 2D point.
The present invention uses a classifier to encode the relationship between 2D point (inputs) and 3D points (outputs). A classifier can be considered an algorithm that encodes the relationship between a set of inputs and a set of outputs. A classifier has an internal configuration that may be in a number of different states. In supervised classification the classifier adapts its internal state using examples of the relationship between inputs and outputs. This process can be referred to as training a classifier. The classifier may be trained using the 3D points derived from the processes described above. Alternatively, a classifier may be selected which does not require training. For the purposes of clarification, we describe a 2D point at a location x, y in an image occurring at a time t within an image sequence as:
P {x, y, t, l} where / is the set of image characteristics of the point P. In the preferred embodiment the image characteristics / consist of the red, green and blue colour components of the 2D point P. Any other image characteristics including, but not limited to linear or non-linear combinations or higher order statistics of the red, green and blue components may also be used without affecting the nature of the invention. A 3D point is defined as:
R = {x, y, z, t, I } where z corresponds to the depth, or distance from the camera of the point R. 3D points that are generated by the techniques described above can be used to train a classifier. The classifier encodes the relationships between a set of 2D points and their corresponding 3D points. As this relationship varies over time a classifier is trained over a defined temporal interval T. In the preferred embodiment this temporal interval coincides with the decomposition of the image sequence as identified by the shot detection. It should be noted that a single temporal interval T might contain one or more shots that are not adjacent in time. An image sequence may alternate between two or more different scenes, for example during a news interview when the camera alternatively focuses on the interviewer and the interviewee. In such circumstances the temporal interval T may contain all the images from one scene (say the interviewer). Figure 3 shows an image sequence that has been decomposed using a shot detection algorithm so that the three frames in the center of the figure belong to a single shot 2. Frame 1 is an image in the previous shot 14 and frame 5 is an image in the subsequent shot 15. Each image frame in the shot has a number of 3D points 4 associated with it. It is not important whether the depth Z for each 3D point was derived manually or automatically, what is required is a series of points for which the depth is known. For simplicity these 3D points are represented by their projection on to the 2D image plane. All 3D points within the current shot are input to the classifier as training data regardless of which frame they are in.
A trained classifier 3 can represent the relationship or mapping between a 2D point P and a 3D point R over a temporal interval T:
C{T} : P{x, y, t, 1} --- R{x, y, z, t, 1} if t falls within the interval T
In other words, a classifier trained using 3D points derived over a specific set of image frames can now be used to recover a depth value for any other 2D point over the same temporal interval. Restricting a classifier to a specific temporal interval improves the accuracy with which it can recover 3D points from 2D points but may lead to results that are inconsistent over time. These inconsistencies generally manifest themselves at the temporal boundaries between two classifiers.
For example, consider the situation in which an image sequence has been decomposed into two shots. A classifier is trained for all the 3D points in each shot. Assume also that the first shot occupies image frames from to fe (inclusive) and the second shot image frames from f3 to U (inclusive). The image frame at t will be classified using the first classifier and the image frame at f3 will be classified using the second classifier. In certain situations this can cause a noticeable discontinuity in the recovered depth maps. That is, the objects in the first shot may appear at a particular depth as a result of the classification, however those same objects in the first shot may appear at a different depth in the second shot as a result of the second classifier. In this situation the images appear to jump around for the viewer.
In circumstances where this could be a problem preferred embodiments of the present invention address the consistency issue in two ways.
Before training a classifier is generally initialised to some random state. Depending on the nature of the classifier this initial state can have a significant influence on the final state of the classifier after training. To improve the consistency between two classifiers C? and C where C occurs after Ci, C2 can be initialised to the same state as the classifier C s initial state. Alternatively, C2 may be initialised using a partially or fully trained state of the classifier C*. For example, if we assume a first classifier is initialised to a random state s1. During training the classifier might change its state iteratively, for example from s1 to s50. A second classifier following the first classifier may be initialised to state s10, for example, instead of a random state. The process of initialising a first classifier with some state of a second classifier is referred to as bootstrapping the first classifier. Bootstrapping has the additional advantage of increasing the speed with which a classifier can be trained, as the starting state is generally closer to the final trained state. In order to further improve the consistency of results two or more classifiers can be combined to determine a 3D point from a 2D point. As an example, consider that a classifier trained over a time interval from time to time t is associated to a specific point in time, which is the midpoint between and t. For any point P { x, y, t, I } for which we wish to determine a 3D point we determine the two closest classifiers whose temporal midpoints occur immediately before and after time t. Figure 4 shows an image sequence consisting of a number of image frames, which has been decomposed into two shots 12 and 13. The time t increases from left to right with each successive image frame. In order to determine a 3D point for a given 2D point 6 which is part of an image 7 within the shot 12 of the image sequence the first classifier 14 with a temporal midpoint immediately before and the second classifier 15 with a temporal midpoint immediately after the time t of the 2D point 6 can be combined. Assuming that the first classifier 14 outputs a 3D point Ri and the second classifier 15 outputs a 3D point R2, given the 2D point P as input, a combination means 18 can produce an improved 3D point R3 by combining R1 and R. The combination means 181 can simply average R1 and R2, but ideally uses a weighted combination of its inputs. In a preferred embodiment the weight is based on the temporal distance between the point P and the classifier's temporal midpoint. As a further illustration of this process consider that the temporal midpoint of the first classifier 14, which has been trained over a temporal interval between and t2 is defined as Tmι = 0.5 * ( + 1). Similarly, the temporal midpoint of the second classifier 10, which has been trained over a temporal interval fe and U is defined as Tm2 = 0.5 * (t3 + 14).
We may determine the relative contribution of the two classifiers by calculating respective weights Wi = (t - tmι) / (tm2~tmι) for the first classifier 14 and w2 = (tm2-t)/(tm2-tmι) for the second classifier 15. The improved 3D point R3 may then be calculated as follows: R3 = w1 * R1 + w2 * R2
In an alternative embodiment, the weighting is determined by classification error estimates as opposed to temporal proximity.
It will be appreciated that the present system differs significantly from the applicants prior system and thereby any other method for the generation of depth maps for an image sequence. As can be seen in figure 1 , the applicants prior process required the selection of a number of key frames, and for the depth for a sufficient number of pixels within each key frame to be known. For each key frame, assuming that sufficient pixels had been assigned a depth, a depth map could then be created in phase one. The depth map for each key frame were then used to develop subsequent depth maps for the remaining frames in phase two. This process differs significantly from the present invention which is exemplified in figure 2. Figure 2 shows that it is no longer necessary to select key frames from an image sequence. Further, it is no longer necessary to ensure that a particular frame or key frame, has a depth assigned to a sufficient number of pixels or points. Rather, the present invention takes depth and image data for a number of pixels across a number of frames to create the depth maps. It will be appreciated that the data for the classifier could come from a single frame, particularly if the image sequence is of a still object, but even in this situation it differs from the applicants prior application, in that the classifier is used to generate a depth map for each frame of the shot, rather than the prior system which generated two depth maps for two key frames and then used those key frame depth maps to generate subsequent depth maps. It will be appreciated that in each case the image data for each pixel is known. That is, if we consider RGB components, for each pixel the system knows the relative XY position, and the RGB values. What is required is for a number of pixels across the shot sequence to have a depth assigned to them. This depth may be assigned manually, or automatically or a combination of manual or automatic. This information may then be passed to the classifier of the present invention to thereby create the depth map for each frame of the shot.
The system of the present invention may be further improved by utilising the classifiers in adjacent shots. That is, rather than rely on the depth map generated solely by a single classifier, reference is made to a depth as generated by an adjacent classifier. Again this is exemplified in figure 5. In this case the first shot 12, includes four frames the data of which is fed into the classifier 14. Similarly, the adjacent shot 13 includes six frames the data of which is fed into the second classifier 15. In order to determine the depth for any point in any of the frames one to ten, for example, a point 16 in the fourth frame 17, the output from both classifiers is combined so as to determine the depth at that point. Ideally, the reliance on either classifier will be weighted in favour of a particular classifier dependent of which frame is being considered. For example, in the example of figure 4, the weighting of classifier 15 will be much greater in the fifth frame as opposed to the first frame. Similarly, the weighting of classifier 14 will be greater for frame five than for frame ten.
The weighting is designed to take into consideration the fact that the depth of objects within an image may change over time. It is also appreciated that the depth of an object will have some relevance to both the historical depth of the object, and also the future depth of the object. By weighting the various classifiers a smoother transition between shots may be achieved.
It will be appreciated that in a more complex system it could be possible to combine more than two classifiers so as to further improve the transitional smoothing. For example, three classifiers could be used including the classifier for a particular shot, and also the two adjacent classifiers on either side of the shot.
It was considered that the applicants prior system made significant advances from what was know at the time. The present system makes further significant advances from the applicants prior system. It is no longer necessary to perform a two phase process in order to determine depth maps for frames within any image sequence. Further, it is no longer necessary for a sufficient number of pixels within a single frame to be known so as to derive a depth map for phase one. Rather, whilst the present invention could rely on a single frame, it is capable of deriving information from a series of frames to thereby generate depth maps for each of those frames. Further, the present system will be able to perform even if no depth data is known for a particular frame, as shown for example by frame nine of figure 4. In this circumstance the classifier uses the known depth data in the remaining frames. Whilst the method and apparatus of the present invention has been summarised and explained by illustrative application it will be appreciated by those skilled in the art that many widely varying embodiments and applications are within the teaching and scope of the present invention, and that the examples presented herein are by way of illustration only and should not be construed as limiting the scope of this invention.

Claims

THE CLAIMS DEFINING THE INVENTION ARE AS FOLLOWS:
1. A method of creating at least one depth map for an image sequence including the steps of: receiving depth data for a plurality of points in said image sequence; utilising said depth data and a classifier to ascertain depth characteristics as a function of image characteristics and relative position; creating a depth map for at least one frame of said image sequence utilising said image characteristics.
2. A method as claimed in claim 1 , wherein said image characteristics include RGB values.
3. A method as claimed in claim 1 or claim 2, wherein said image characteristics include relative xy positions.
4. A method as claimed in any preceding claim, further including an initial step of ascertaining depth data for at least a predetermined number of points within said image sequence.
5. A method as claimed in any one of claims 1 to 3, including an initial step of bootstrapping said classifier.
6. A method of creating at least one depth map for an image sequence including the steps of: dividing said image sequence into a plurality of image shots; receiving depth data for a plurality of points in at least one of said image shots; utilising said depth data and a classifier to ascertain depth characteristics as a function of image characteristics and relative position; creating a depth map for at least one frame of said at least one image shot utilising said image characteristics.
7. A method as claimed in claim 6, wherein each shot is assigned a corresponding classifier.
8. A method as claimed in claim 7, wherein at least one corresponding classifier is initialised to a similar state as a previous classifier.
9. A method of creating at least one depth map for an image sequence including the steps of: receiving image data of a plurality of frames; receiving depth data of a plurality of points in at least one of said frames; utilising said image data and said depth data to determine an algorithm to ascertain depth characteristics as a function of image data; determining a depth map for said at least one of said frame utilising said algorithm.
10. A method as claimed in claim 9, wherein said image data includes RGB values.
11. A method as claimed in claim 9 or claim 10, wherein said image data includes relative xy positions.
12. A method as claimed in any one of claims 9 to 11 , further including an initial step of ascertaining depth data for at least a predetermined number of points within said image sequence.
13. A method of creating at least one depth map for an image sequence including the steps of: dividing said image sequence into a plurality of image shots; for each image shot receiving depth data for a plurality of points and utilising said depth data and a respective classifier to create an initial algorithm to ascertain depth characteristics as a function of image characteristics and relative position; creating at least one combined algorithm by combining at least two initial algorithms from a plurality of said shots; creating a depth map for at least one frame of said image sequence using said at least one combined algorithm.
14. A method as claimed in claim 13, wherein said image characteristics include RGB values.
15. A method as claimed in claim 13 or claim 14, wherein said image characteristics include relative xy positions.
16. A method as claimed in any one of claims 13 to 15, further including an initial step of ascertaining depth data for at least a predetermined number of points within said image sequence.
17. A method as claimed in any one of claims 13 to 16, wherein said combined algorithm is determined using an average of said at least two initial algorithms.
18. A method as claimed in any one of claims 13 to 16, wherein said combined algorithm is determined using a weighted average of said at least two initial algorithms.
19. A method as claimed in any one of claims 13 to 18, including an initial step of initalising said classifier to a random state.
20. A method as claimed in any one of claims 13 to 18, wherein at least one said respective classifier is initialised to a similar state as a previous classifier.
21. A method substantially as hereinbefore described with reference to figures 2 to 4 of the accompanying drawings.
EP03724646A 2002-06-07 2003-06-02 Improved conversion and encoding technioues Withdrawn EP1514238A4 (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
EP09175312.9A EP2146319B1 (en) 2002-06-07 2003-06-02 Filling of sparse depth maps of an image sequence

Applications Claiming Priority (3)

Application Number Priority Date Filing Date Title
AUPS2836A AUPS283602A0 (en) 2002-06-07 2002-06-07 Improved conversion and encoding techniques
AUPS283602 2002-06-07
PCT/AU2003/000686 WO2003105089A1 (en) 2002-06-07 2003-06-02 Improved conversion and encoding technioues

Related Child Applications (1)

Application Number Title Priority Date Filing Date
EP09175312.9A Division EP2146319B1 (en) 2002-06-07 2003-06-02 Filling of sparse depth maps of an image sequence

Publications (2)

Publication Number Publication Date
EP1514238A1 true EP1514238A1 (en) 2005-03-16
EP1514238A4 EP1514238A4 (en) 2008-04-02

Family

ID=3836398

Family Applications (2)

Application Number Title Priority Date Filing Date
EP09175312.9A Expired - Lifetime EP2146319B1 (en) 2002-06-07 2003-06-02 Filling of sparse depth maps of an image sequence
EP03724646A Withdrawn EP1514238A4 (en) 2002-06-07 2003-06-02 Improved conversion and encoding technioues

Family Applications Before (1)

Application Number Title Priority Date Filing Date
EP09175312.9A Expired - Lifetime EP2146319B1 (en) 2002-06-07 2003-06-02 Filling of sparse depth maps of an image sequence

Country Status (6)

Country Link
EP (2) EP2146319B1 (en)
JP (1) JP4634142B2 (en)
CN (1) CN1669053B (en)
AU (1) AUPS283602A0 (en)
CA (1) CA2488738C (en)
WO (1) WO2003105089A1 (en)

Families Citing this family (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101720480B (en) * 2007-07-03 2012-07-18 皇家飞利浦电子股份有限公司 Computing a depth map
CN102257827B (en) * 2008-12-19 2014-10-01 皇家飞利浦电子股份有限公司 Creation of depth maps from images
CN101873506B (en) * 2009-04-21 2012-01-25 财团法人工业技术研究院 Image processing method for providing depth information and image processing system thereof
CN101867810B (en) * 2010-04-07 2011-12-14 宁波大学 Method for pre-processing deep video sequence
CN102196292B (en) * 2011-06-24 2013-03-06 清华大学 Human-computer-interaction-based video depth map sequence generation method and system
US9533539B2 (en) * 2011-10-20 2017-01-03 GM Global Technology Operations LLC Vehicle suspension system and method of using the same

Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2002013141A1 (en) * 2000-08-09 2002-02-14 Dynamic Digital Depth Research Pty Ltd Image conversion and encoding techniques

Family Cites Families (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JPH07262382A (en) * 1994-03-17 1995-10-13 Fujitsu Ltd Object recognition and image interpolating device for image processing system
US6252974B1 (en) * 1995-03-22 2001-06-26 Idt International Digital Technologies Deutschland Gmbh Method and apparatus for depth modelling and providing depth information of moving objects
US6384859B1 (en) * 1995-03-29 2002-05-07 Sanyo Electric Co., Ltd. Methods for creating an image for a three-dimensional display, for calculating depth information and for image processing using the depth information
JP2910698B2 (en) * 1996-10-15 1999-06-23 日本電気株式会社 Three-dimensional structure estimation method and apparatus
JP3512992B2 (en) * 1997-01-07 2004-03-31 株式会社東芝 Image processing apparatus and image processing method
EP0928460B1 (en) * 1997-07-29 2003-01-29 Philips Electronics N.V. Method of reconstruction of tridimensional scenes and corresponding reconstruction device and decoding system
JP4112819B2 (en) * 2000-05-11 2008-07-02 株式会社東芝 Object area information generation apparatus and object area information description program
US20030206652A1 (en) * 2000-06-28 2003-11-06 David Nister Depth map creation through hypothesis blending in a bayesian framework
US6795808B1 (en) * 2000-10-30 2004-09-21 Koninklijke Philips Electronics N.V. User interface/entertainment device that simulates personal interaction and charges external database with relevant data

Patent Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2002013141A1 (en) * 2000-08-09 2002-02-14 Dynamic Digital Depth Research Pty Ltd Image conversion and encoding techniques

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
HARMAN P ET AL: "RAPID 2D TO 3D CONVERSION" PROCEEDINGS OF THE SPIE, SPIE, BELLINGHAM, VA, US, vol. 4660, 21 January 2002 (2002-01-21), pages 78-86, XP008021652 ISSN: 0277-786X *
QINGQING WEI: "Converting 2D to 3D: A Survey" RESEARCH ASSIGNMENT, [Online] December 2005 (2005-12), XP002469656 Information and Communication Theory Group (ICT) Faculty of Electrical Engineering, Mathematics and Computer ScienceDelft University of Technology, the Netherlands Retrieved from the Internet: URL:http://ict.ewi.tudelft.nl/pub/ben/Research%20Assignment%20Qing%20Qing%20Wei%20-%20A%20Survey%20of%202D%20to%203D%20Conversion%20Algorithms.pdf> [retrieved on 2008-02-18] *
See also references of WO03105089A1 *

Also Published As

Publication number Publication date
JP2005529417A (en) 2005-09-29
CN1669053B (en) 2013-11-06
JP4634142B2 (en) 2011-02-16
WO2003105089A1 (en) 2003-12-18
CN1669053A (en) 2005-09-14
EP1514238A4 (en) 2008-04-02
EP2146319A1 (en) 2010-01-20
EP2146319B1 (en) 2017-08-16
AUPS283602A0 (en) 2002-06-27
CA2488738C (en) 2016-10-25
CA2488738A1 (en) 2003-12-18

Similar Documents

Publication Publication Date Title
US7489812B2 (en) Conversion and encoding techniques
Harman et al. Rapid 2D-to-3D conversion
US6504569B1 (en) 2-D extended image generation from 3-D data extracted from a video sequence
CA2430591C (en) Techniques and systems for developing high-resolution imagery
EP2291825B1 (en) System and method for depth extraction of images with forward and backward depth prediction
EP3367334B1 (en) Depth estimation method and depth estimation apparatus of multi-view images
JP2004505393A (en) Image conversion and coding technology
CN110706269B (en) Binocular vision SLAM-based dynamic scene dense modeling method
JP5197683B2 (en) Depth signal generation apparatus and method
Moustakas et al. Stereoscopic video generation based on efficient layered structure and motion estimation from a monoscopic image sequence
Jain et al. Efficient stereo-to-multiview synthesis
KR20090062440A (en) Multi-view matching method and device using foreground/background separation
CN113888458A (en) Method and system for object detection
CN104331890B (en) A kind of global disparity method of estimation and system
US20150035828A1 (en) Method for processing a current image of an image sequence, and corresponding computer program and processing device
CA2488738C (en) Improved conversion and encoding techniques
CN115965961B (en) Local-global multi-mode fusion method, system, equipment and storage medium
CN104408710B (en) Global parallax estimation method and system
US11558625B2 (en) Method for generating residual image of multi-view video and apparatus using the same
CN112215879B (en) Depth extraction method of light field polar plane image
KR20220074044A (en) Stereo Matching Confidence Estimation Apparatus And Method Using Generative Adversarial Network
US8754891B2 (en) Method for image processing and an apparatus thereof
KR20050019124A (en) Improved conversion and encoding techniques
CN118229794A (en) Depth-based three-dimensional target detection method, device and system and vehicle
CN117333524A (en) Three-dimensional target detection method, device and equipment

Legal Events

Date Code Title Description
PUAI Public reference made under article 153(3) epc to a published international application that has entered the european phase

Free format text: ORIGINAL CODE: 0009012

17P Request for examination filed

Effective date: 20050105

AK Designated contracting states

Kind code of ref document: A1

Designated state(s): AT BE BG CH CY CZ DE DK EE ES FI FR GB GR HU IE IT LI LU MC NL PT RO SE SI SK TR

AX Request for extension of the european patent

Extension state: AL LT LV MK

RIN1 Information on inventor provided before grant (corrected)

Inventor name: HARMAN, PHILIP, VICTOR

Inventor name: FLACK, JULIEN, CHARLES

Inventor name: FOX, SIMON, RICHARD

DAX Request for extension of the european patent (deleted)
A4 Supplementary search report drawn up and despatched

Effective date: 20080229

17Q First examination report despatched

Effective date: 20090513

STAA Information on the status of an ep patent application or granted ep patent

Free format text: STATUS: THE APPLICATION HAS BEEN WITHDRAWN

18W Application withdrawn

Effective date: 20170530