GB2537142A

GB2537142A - An arrangement for image segmentation

Info

Publication number: GB2537142A
Application number: GB1506015.5A
Authority: GB
Inventors: Wang Tinghuai
Original assignee: Nokia Technologies Oy
Current assignee: Nokia Technologies Oy
Priority date: 2015-04-09
Filing date: 2015-04-09
Publication date: 2016-10-12
Also published as: GB201506015D0

Abstract

At least one image and a depth map of a scene is initially provided (200, 202). A user command is detected on a display indicating a region of the image as a foreground object (204). A coarse segmentation is performed (206) between the foreground object and a background region of the image, and edge detection is performed (208) on the depth map for identifying boundaries of a depth plane of the foreground object. A plurality of mutually nested bounding boxes (fig. 8) is provided (210) around the foreground object, a boundary of the outermost box extending to the image background region. A fine segmentation between the foreground object and the background region of the image is then performed (212), wherein training examples of the foreground object are taken from the coarse segmentation and training examples of the background region are taken from within the boundary of the outermost bounding box. The coarse segmentation of the foreground object with its initial bounding box and an edge detected depth map of the foreground object is used in providing the nested boxes around the foreground object.

Description

An arrangement for image segmentation

Field of the invention

The present invention relates to image processing, and more particularly to a process of image segmentation.

Background

Stereoscopy and displaying three-dimensional (3D) content has been a major research area for decades. Consequently, various stereoscopic displays of different sizes have been proposed and implemented. Big size 3D displays have already gained popularity and mobile 3D displays are expected to be popular soon.

Interactive image segmentation is becoming growingly popular to facilitate spatially localized media manipulation. Therein, prior knowledge about the desired object and background can be easily defined with simple user interactions, such as marking of object boundaries, placing a bounding box around the foreground object, and/or loosely drawing scribbles on foreground/background regions. Regardless of the intervention modality, the goal of any interactive image segmentation is to minimize the amount of effort to cut out a desired object while accurately selecting objects of interest.

Despite of the significant advances delivered in recent years, some open issues prevent the interactive image segmentation methods from being widely used by mobile device end-users. First, the user input to indicate both the foreground and background objects may be too troublesome for compact touch screens.

The need to switch between foreground and background scribbles also complicates the UI. Second, it may be cumbersome for end users to perform fine-tunings to correct missegmentations, especially the noisy boundaries and disjoint regions which may severely affect the quality of the target applications.

Summary

Now there has been invented an improved method and technical equipment implementing the method, by which the above problems are at least alleviated. Various aspects of the invention include a method, an apparatus and a computer program, which are characterized by what is stated in the independent claims.

Various embodiments of the invention are disclosed in the dependent claims.

According to a first aspect, a method according to the invention is based on the idea of providing at least one image of a scene; providing a depth map associated with said at least one image; detecting a user command on a display showing said at least one image, the user command indicating a region of the at least one image as a foreground object; performing a coarse segmentation between the foreground object and a background region of the image; performing edge detection on the depth map for identifying boundaries of a depth plane of the foreground object; providing a plurality of mutually nested bounding boxes around the foreground object, a boundary of the outermost bounding box extending to the background region of the image; and performing a fine segmentation between the foreground object and the background region of the image, wherein training examples of the foreground object are taken from the coarse segmentation and training examples of the background region are taken from within the boundary of the outermost bounding box.

According to an embodiment, the training examples of the background region are taken from a region between a second innermost bounding box and the outermost bounding box.

According to an embodiment, the method further comprises: providing the coarse segmentation of the foreground object with an initial bounding box approximately around the boundaries of the foreground object.

According to an embodiment, the method further comprises: using the coarse segmentation of the foreground object with its initial bounding box and an edge detected depth map of the foreground object in said providing the plurality of mutually nested bounding boxes around the foreground object.

According to an embodiment, the method further comprises: extending the initial bounding box to an extended first bounding box along a vertical direction, until the first bounding box does not intersect with any edges of the foreground object.

According to an embodiment, any subsequent extended bounding box is determined on the basis of the extended first bounding box.

According to an embodiment, modeling and segmentation of the fine segmentation is carried out iteratively with fixed areas of all the bounding boxes.

According to an embodiment, the method further comprises: computing a histogram of depth values; identifying bins of the histogram in which the initial depth surface of the foreground object belongs; and determining the bins of the histogram in which presumed background depth surfaces belong.

According to an embodiment, color and depth features of the detected depth surfaces are applied in a Gaussian Mixture Model (GMM).

According to an embodiment, the method further comprises: converting an original color space of the image to CIELAB color space; and merging the CIELAB color space of the image with the depth map of the image.

According to a second aspect, there is provided an apparatus comprising at least one processor, memory including computer program code, the memory and the computer program code configured to, with the at least one processor, cause the apparatus to at least: provide at least one image of a scene; provide a depth map associated with said at least one image; detect a user command on a display showing said at least one image, the user command indicating a region of the at least one image as a foreground object; perform a coarse segmentation between the foreground object and a background region of the image; perform edge detection on the depth map for identifying boundaries of a depth plane of the foreground object; provide a plurality of mutually nested bounding boxes around the foreground object, a boundary of the outermost bounding box extending to the background region of the image; and perform a fine segmentation between the foreground object and the background region of the image, wherein training examples of the foreground object are taken from the coarse segmentation and training examples of the background region are taken from within the boundary of the outermost bounding box.

According to a third aspect, there is provided a computer readable storage medium stored with code thereon for use by an apparatus, which when executed by a processor, causes the apparatus to perform: providing at least one image of a scene; providing a depth map associated with said at least one image; detecting a user command on a display showing said at least one image, the user command indicating a region of the at least one image as a foreground object; performing a coarse segmentation between the foreground object and a background region of the image; performing edge detection on the depth map for identifying boundaries of a depth plane of the foreground object; providing a plurality of mutually nested bounding boxes around the foreground object, a boundary of the outermost bounding box extending to the background region of the image; and performing a fine segmentation between the foreground object and the background region of the image, wherein training examples of the foreground object are taken from the coarse segmentation and training examples of the background region are taken from within the boundary of the outermost bounding box.

These and other aspects of the invention and the embodiments related thereto will become apparent in view of the detailed disclosure of the embodiments further below.

List of drawings In the following, various embodiments of the invention will be described in more detail with reference to the appended drawings, in which Fig. 1 shows a simplified example of a camera system suitable to be used in the embodiments; Fig. 2 shows a flow chart of an image segmentation method according to an embodiment of the invention; Fig. 3 shows a schematic diagram of a pipeline of the segmentation process according to an embodiment; Fig. 4 illustrates a histogram of depth values; Fig. 5a illustrates an example of an image used as input in a segmentation process according to an embodiment; Fig. 5b shows a disparity map corresponding to Figure 5a; Fig. 5c shows initial depth surfaces of the selected object and the

background of Figure 5a;

Fig. 6 shows an example of a coarse segmentation of the selected object in Figures 5a -5c; Fig. 7 an example of depth-based contours derived from the depth map of Figure 5b; and Fig. 8 shows an example of three mutually nested bounding boxes drawn around the selected object of Figure 5c; Fig. 9 shows the result of the fine segmentation process according to an embodiment carried out to Figure 5a; Fig. 10 shows a simplified 2D model of a stereoscopic camera setup; and Figs. 11a, 11b show an example of a TOF-based depth estimation system.

Description of embodiments

Figs. la and lb show a system and devices suitable to be used in an image segmentation according to an embodiment. In Fig. la, the different devices may be connected via a fixed network 210 such as the Internet or a local area network; or a mobile communication network 220 such as the Global System for Mobile communications (GSM) network, 3rd Generation (3G) network, 3.5th Generation (3.5G) network, 4th Generation (4G) network, Wireless Local Area Network (WLAN), Bluetooth®, or other contemporary and future networks. Different networks are connected to each other by means of a communication interface 280. The networks comprise network elements such as routers and switches to handle data, and communication interfaces such as the base stations 230 and 231 in order for providing access for the different devices to the network, and the base stations 230, 231 are themselves connected to the mobile network 220 via a fixed connection 276 or a wireless connection 277.

There may be a number of servers connected to the network, and in the example of Fig. 1 a are shown servers 240, 241 and 242, each connected to the mobile network 220. Some of the above devices, for example the computers 240, 241, 242 may be such that they are arranged to make up a connection to the Internet with the communication elements residing in the fixed network 210.

There are also a number of end-user devices such as mobile phones and smart phones 251, Internet access devices, for example Internet tablet computers 250, personal computers 260 of various sizes and formats, televisions and other viewing devices 261, video decoders and players 262, as well as video cameras 263 and other encoders. These devices 250, 251, 260, 261, 262 and 263 can also be made of multiple parts. The various devices may be connected to the networks 210 and 220 via communication connections such as a fixed connection 270, 271, 272 and 280 to the internet, a wireless connection 273 to the internet 210, a fixed connection 275 to the mobile network 220, and a wireless connection 278, 279 and 282 to the mobile network 220. The connections 271-282 are implemented by means of communication interfaces at the respective ends of the communication connection.

Fig. 1 b shows devices for the image segmentation according to an example embodiment. As shown in Fig. 1 b, the server 240 contains memory 245, one or more processors 246, 247, and computer program code 248 residing in the memory 245. The different servers 241, 242, 290 may contain at least these elements for employing functionality relevant to each server.

Similarly, the end-user device 251 contains memory 252, at least one processor 253 and 256, and computer program code 254 residing in the memory 252 for implementing, for example, the image segmentation process. The end-user device may also have one or more cameras 255 and 259 for capturing image data, for example stereo video. The end-user device may also contain one, two or more microphones 257 and 258 for capturing sound.

The end user devices may also comprise a screen for viewing single-view, stereoscopic (2-view), or multiview (more-than-2-view) images. The end-user devices may also be connected to video glasses 290 e.g. by means of a communication block 293 able to receive and/or transmit information. The glasses may contain separate eye elements 291 and 292 for the left and right eye. These eye elements may either show a picture for viewing, or they may comprise a shutter functionality e.g. to block every other picture in an alternating manner to provide the two views of three-dimensional picture to the eyes, or they may comprise an orthogonal polarization filter (compared to each other), which, when connected to similar polarization realized on the screen, provide the separate views to the eyes. Other arrangements for video glasses may also be used to provide stereoscopic viewing capability. Stereoscopic or multiview screens may also be autostereoscopic, i.e. the screen may comprise or may be overlaid by an optics arrangement, which results into a different view being perceived by each eye. Single-view, stereoscopic, and multiview screens may also be operationally connected to viewer tracking such a manner that the displayed views depend on viewer's position, distance, and/or direction of gaze relative to the screen.

In addition to applications of cutting out objects of interest, the various embodiments could be used in different applications, such as image editing or converting 2D images to 3D images.

It needs to be understood that various embodiments allow different parts to be carried out in different elements. For example, various processes of image segmentation may be carried out in one or more processing devices; for example, entirely in one user device like 250, 251 or 260, or in one server device 240, 241, 242 or 290, or across multiple user devices 250, 251, 260 or across multiple network devices 240, 241, 242, 290, or across both user devices 250, 251, 260 and network devices 240, 241, 242, 290. The elements of the image segmentation process may be implemented as a software component residing on one device or distributed across several devices, as mentioned above, for example so that the devices form a cloud.

Stereoscopic image content consists of pairs of offset images that are shown separately to the left and right eye of the viewer. These offset images may be captured with a specific stereoscopic camera setup assuming a particular stereo baseline distance between cameras.

Figure 10 shows a simplified 2D model of such stereoscopic camera setup. In Figure 10, C1 and C2 refer to cameras of the stereoscopic camera setup, more particularly to the center locations of the cameras, b is the distance between the centers of the two cameras (i.e. the stereo baseline), f is the focal length of cameras and X is an object in the real 3D scene that is being captured. The real world object X is projected to different locations in images captured by the cameras C1 and C2, these locations being x1 and x2 respectively. The horizontal distance between x1 and x2 in absolute coordinates of the image is called disparity. The images that are captured by the camera setup are called stereoscopic images, and the disparity presented in these images creates or enhances the illusion of depth. For enabling the images to be shown separately to the left and right eye of the viewer, typically specific 3D glasses are required to be used by the viewer. Adaptation of the disparity is a key feature for adjusting the stereoscopic image content to be comfortably viewable on various displays.

A depth estimation algorithm takes a stereoscopic view as an input and computes local disparities between the two offset images of the view. Each image is processed pixel by pixel in overlapping blocks, and for each block of pixels a horizontally localized search for a matching block in the offset image is performed. Once a pixel-wise disparity is computed, the corresponding depth value z is calculated by equation (1): f.b d+Ad (1), where f is the focal length of the camera and b is the baseline distance between cameras, as shown in Figure 10. Further, d refers to the disparity observed between the two cameras, and the camera offset Ad reflects a possible horizontal misplacement of the optical centers of the two cameras.

Alternatively, or in addition to the above-described stereo view depth estimation, the depth value may be obtained using the time-of-flight (TOF) principle. Figures lla and 11 b show an example of a TOF-based depth estimation system utilizing only one camera. The camera is provided with a light source, for example an infrared emitter, for illuminating the scene. Such an illuminator is arranged to produce an intensity modulated electromagnetic emission for a frequency between 10-100 MHz, which typically requires LEDs or laser diodes to be used. Infrared light is typically used to make the illumination unobtrusive. The light reflected from objects in the scene is detected by an image sensor, which is modulated synchronously at the same frequency as the illuminator. The image sensor is provided with optics; a lens gathering the reflected light and an optical bandpass filter for passing only the light with the same wavelength as the illuminator, thus helping to suppress background light. The image sensor measures for each pixel the time the light has taken to travel from the illuminator to the object and back. The distance to the object is represented as a phase shift in the illumination modulation, which can be determined from the sampled data simultaneously for each pixel in the scene.

An aspect relates to a method for user-friendly image segmentation usable in data processing devices provided with depth information regarding the image scene, especially in dual camera touch screen devices, which method enables to effectively cut out a foreground object from background. In the method, a user of the touch screen device is prompted to select the foreground object by a single user command, such as a single finger tap, on the desired object through the touch screen.

The method according to the aspect is illustrated in Figure 2. In the method, at least one image of a scene is provided (200) in a data processing device. Further, a depth map associated with said at least one image of the scene is provided (202) in the data processing device.

As described above, the data processing device may be provided with only one camera for capturing the image of the scene. Such a data processing device may be provided with means, such as a TOF sensor or other kind of ranging sensor and computer program code stored in a memory, for determining the depth map associated with said at least one image of the scene.The data processing device may also be provided with two or more cameras for capturing the first image and the second image of the scene. Such a data processing device may be provided with means, such as computer program code stored in a memory, for determining the depth map on the basis of the first image and the second image of the scene. However, it is also possible that sais at least one image or the first image and the second image of the scene and/or the depth map are transmitted to the data processing device, or the data processing device may retrieve the first image and the second image of the scene and/or the depth map from a data storage.

Further in the method, a user command is detected (204) on a display showing said at least one image of the scene or at least one of said first and second images, the user command indicating a region of the at least one image as a foreground object. Thus, at least one of said two images is selected as a basis for the object segmentation process. The user may indicate an object in the selected image by submitting a single user command, such as a tap by a finger or a stylus on a touch screen display, or a click by a mouse or other pointing means on the display area, if a conventional display without a touch input feature is used. In terms of convenience, a single finger tap may be considered the most user-friendly interaction for object segmentation. The data processing device may be arranged to determine, on the basis of the user command such as a finger tap of the user, coordinates of a point of the screen, wherein the determined x-y coordinates define a pixel or a group of pixels of the foreground object.

Once the foreground object has been selected by the user, a coarse segmentation (206) is performed between the foreground object and a background region of the image. Herein, the object of interest, i.e. the selected foreground object, can be coarsely segmented mainly based on depth information. As there is only very limited prior knowledge about the object, i.e. the single user input coordinate, depth information may assist in locating the object spatially. Using mainly the depth information, and possibly some further information from the detected depth surfaces, such as color, the coarse segmentation can be kept computationally as a lightweight and straightforward process. The coarse segmentation is intended to capture some characterizing parts of the foreground objects, such as rough information about the appearance, location and depth of the foreground object. However, many details of the foreground object are typically still missing after the coarse segmentation.

Using the results of the coarse segmentation, edge detection is performed (208) on the depth map for identifying boundaries of a depth plane of the foreground object. The depth map does not typically identify the depth planes. For example, the background region of the image may usually contain areas with gradually changing depth values. By running edge detection on depth map, the noisy and gradually changing individual depth values are ignored, but discontinuities occurring along the boundaries between neighbouring depth planes are identified. As a result of the edge detection, the coarsely segmented foreground object is provided with more detailed and accurate boundaries at its depth plane.

The edge detection may especially enhance the segmentation of the coarsely segmented foreground object in vertical direction.

Having obtained the estimated depth planes, a plurality of mutually nested bounding boxes are provided (210) around the foreground object, a boundary of the outermost bounding box extending to the background region of the image. Thus, at least two, preferably three, bounding boxes are provided around the foreground object, for example such that the innermost bounding box covers substantially not very much more than the foreground object, the second innermost bounding box covers also area where the foreground object may outreach, and the third innermost (the outermost) bounding box covers also area, which most probably belongs to the background. If only two bounding boxes are used, then the whole image area may be considered to represent the outermost bounding box.

Now, a fine segmentation is performed (212) between the foreground object and the background region of the image, wherein training examples of the foreground object are taken from the coarse segmentation and training examples of the background region are taken from within the boundary of the outermost bounding box. According to an embodiment, the training examples of the background region may be taken from the region between the second innermost bounding box and the outermost bounding box Thus, the final segmentation is performed within the outermost bounding box, there by neglecting the image area outside the outermost bounding box as presumably being background, which substantially enhances the efficiency and the speed of the segmentation process.

According to an embodiment, the above method may be divided into a plurality of sub-processes comprising at least pre-processing, coarse segmentation and fine segmentation. According to an embodiment, each of these sub-processes may comprise one or more operation step.

The above method and various embodiments related thereto are now described more in detail by referring to the schematic diagram of Figure 3, which discloses a pipeline of the segmentation process according to an embodiment. At the core of the pipeline reside the preprocessing engine 300, coarse segmentation engine 302 (which may also be referred to as depth segmentation engine), and the fine segmentation engine 304 (which may also be referred to as color/depth segmentation engine).

The preprocessing engine 300 may take a pair of color images 306 about a common scene, for example from a dual camera device, as input and perform a stereo matching 308 as the first step to estimate the depth information in the scene. The output of the stereo matching process is a disparity/depth map 310.

In this document, the terms "disparity" and "depth" are used interchangeably referring to the depth information.

The segmentation process operates on one of the color images with the corresponding depth map. Each point value in a digital representation of a 2D image is specified in a color space, the commonly used color spaces including YUV (a luminance and two chrominance color components), RGB (red, green and blue color components) and CMYK (cyan, magenta, yellow and key, i.e. black, color components). The original color space of image is converted to CIELAB color space. CIELAB is the most complete color space specified by the Commission Internationale de l'eclairage, CIE (International Commission on Illumination). It describes all the colors visible to the human eye and it serves as a device-independent model to be used as a reference. LAB refers to a color-opponent space with dimension or channel I' for lightness and 'a' and 'b' for the color-opponent dimensions or channels, based on nonlinearly compressed coordinates.

The preprocessing engine 300 then merges the CIELAB color space image with the depth map which is treated as the fourth channel. The resulting 4-channel image 312 is denoted as CIELAB-D.

The coarse segmentation engine 302 is initiated by obtaining coordinates from a user input on the display. As the depth values vary even on the same depth surface, the specific depth surface selected by the user input cannot be determined on the basis of the coordinates. In order to get more coherent depth map, mean-shift algorithm can be performed to extract the dominant depth mode 316 of the depth values. The depth mode selected by the user input coordinate is chosen as the initial object depth surface for the dominant depth mode extraction process.

In the depth-based segmentation 318, the initial depth surfaces of the background, and possibly other foreground objects, are determined. According to an embodiment, a histogram of depth values is computed and the bins in which the initial object depth surface falls are identified. Then the bins in which the presumed background depth surfaces fall are determined.

The utilization of the histogram is illustrated in Figure 4. According to an embodiment, the histogram is split into two sub-histograms, denoted as 'front' and 'behind' in respect to the bin of initial object depth surface. Typically in depth maps, the depth values are given within a predetermined range, such as 0 255, wherein typically the depth values closer to 0 refer to background while depth values close to 255 refer to objects closer to the camera. The histogram may be normalized, and starting from the farthest bins of the normalized 'behind' (smaller depth values) sub-histogram, a cumulative distribution function (CDF) may be calculated. All the farthest bins summing up the CDF to a predefined threshold value (e.g. 50%) may be regarded as the background depth surfaces. Similarly, the depth surfaces of possible other foreground objects can also be determined, provided that the number of bins is larger than a predefined threshold H, which is to prevent the case where there is no other foreground object at all. Thus, starting from the farthest bins of the normalized 'front' (larger depth values) sub-histogram, all the farthest bins summing up the CDF to a predefined threshold value may be regarded as other foreground object depth surfaces.

The above process steps may be illustrated by an example shown in Figures 5a -5c. Figure 5a shows one of the pair of color images 306 used as input images of the segmentation process. Figure 5b shows a disparity map corresponding to Figure 5a. In this example, it is presumed that the user has selected the object represented by the man standing rightwards and behind the woman in the image. The selection has been carried out e.g. by a single finger tap to the man's chest area in the image. Figure 5c shows the initial depth surfaces of the selected object and the background determined on the basis of histogram bins of the depth values. Gray area in the chest of the man indicates the initial depth surface of the object of interest, and white area indicates background, whereas black areas are still unknown. It is noted that even though the image comprises at least one other foreground object (the woman), its initial depth surface has not been determined, and therefore the other foreground object is shown as white.

Referring back to Figure 3, an initial object mask 320 for the selected object is determined. Herein, the feature distribution of both the initially detected object and background are modelled. The feature distribution can be modelled using a number of various estimation models, for example, Gaussian Kernel Estimation, and Gaussian Mixture Model (GMM), etc. Various features from the detected depth surfaces can be modeled, such as color, depth, texture.

According to an embodiment, color and depth features of the detected depth surfaces are applied in a Gaussian Mixture Model (GMM). For example, corresponding pixel values in CIELAB color space and depth values are taken from the initial object and background depth surfaces, and GMMs are used to represent the 4D (color+depth) features.

In various graph cut methods, a so-called Gibbs energy function E is defined so that its minimum should correspond to a good segmentation, the energy function being guided both by the observed foreground and background color GMMs and similarity of pixel-wise GMM components. The graph cut process may be carried out iteratively, each iteration step minimising the total energy E and finally converging at least to a local minimum of E. According to an embodiment, the coarse segmentation is formulated as a pixel-labelling problem of assigning each pixel with a value 0 or 1 which represents background or object of interest respectively. The energy function to be minimized to achieve the optimal labeling takes the similar form to regular graph cut E(:r) Iii(r4) (t. (Xi DJ) (2) eV where Nz is the set of pixels neighboring to pixel i in the graph and a is a weigh parameter. The unary term Dt(a,) defines the cost of assigning label rt; (0 or 1) to pixel i, which is defined based on the per-pixel probability maps computed by modeling the feature distribution ("e;) = log( tri) ) (3) where ("=( t',1 represents the probability of observing pixel (4D feature) given label i based on the feature distribution. The pairwise term is defined as: (4) where H denotes the indicator function taking values 1 (if true) or 0 (otherwise), ci)2 is the squared Euclidean distance between two adjacent pixels in the proposed 4D feature space, and 13---< ci)2 > with < - denoting the expectation or average.

A graph cut optimization is performed to obtain the coarse segmentation. The graph cut may be carried out according any known graph cut method, such as the one disclosed in "Interactive graph cuts for optimal boundary and region segmentation of objects in n-d images" by Boykov, Y., Jolly, M.-P., in Proc. ICCV, pp. 105-112,2001.

The coarse segmentation is intended to capture the essential parts of the target object, even though many details are still missing. Coarse segmentation gives more information about the appearance, location and depth of the target object.

Estimating the definite background is the key to refining the segmentation, which is normally achieved using a bounding box around the target object. Thus, the coarse segmentation of the selected object may then be provided with an initial bounding box 322 around the object. Figure 6 shows an example of a coarse segmentation of the selected object in Figures 5a -5c. The coarse segmentation of the selected object is provided with an initial bounding box approximately around the boundaries of the object.

As shown in Figure 6, using the bounding box of the coarse segmentation normally causes erroneous estimation, especially along the vertical direction.

The challenge lies in extending the initial bounding box properly to cover the object depth plane which could be eased if the depth planes are already known. Detecting the depth plane based on smoothly varying depth values is still an open problem, especially due to the homogeneity of depth estimation of object and the adjacent ground plane.

To obtain means for addressing these challenges, depth-based contours are determined by applying edge detection on the depth map. In Figure 3, this process 314 is described to be a part of the preprocessing. However, for the implementation of the method it is irrelevant at which preceding stage of the pipeline the depth-based contours are determined.

A depth map, as such, does not give exact information about the depth planes. For example, the ground plane in Figure 5a contains gradually changing depth values. By running edge detection on the depth map, discontinuities occurring along the boundaries between neighboring depth planes are identified and the gradually changing depth values may be neglected. Figure 7 shows an example of depth-based contours (a.k.a. an edge map) derived for the depth map of Figure 5b.

The coarse segmentation of the selected object with its initial bounding box and the depth-based contour of the selected object are taken as input to a process of refining bounding boxes using contour information 324. Therein, the initial bounding box can be extended to encompass the whole object depth plane.

According to an embodiment, the extension of the initial bounding box to an extended first bounding box is carried out along the vertical directions (i.e. upwards and/or downwards), until the bounding box does not intersect with any edges.

According to an embodiment, further extended bounding boxes are determined on the basis of the extended first bounding box. Let us denote this extended first bounding box as BBX1; then the first bounding box BBX1 is extended at least to a second bounding box BBX2, and preferably the second bounding box BBX2 is further extended to a third bounding box BBX3, each extension being carried out e.g. with a certain number of pixels. The extension may be carried out vertical and/or horizontal direction.

Figure 8 shows an example of three mutually nested bounding boxes drawn around the selected object of Figure 5c. The first bounding box BBX1 indicates the range from the known hard evidence of the target object; the area between the first bounding box BBX1 and the second bounding box BBX2 indicates the possible range where the target object might outreach; the area between the second bounding box BBX2 and the third bounding box BBX3 indicates the definite background. This set of bounding boxes redefined the area where the modeling and segmentation are to be refined.

The modeling and segmentation process described in 320 of Figure 320 is then repeated in the processes of GMM (326) and graph cut (328) of the fine segmentation sub-process. The feature distribution is modeled using the CIELAB color and depth values.

When considering the example of Figure 8, the training examples of target object are taken from the coarse segmentation (Figure 6) whilst the training examples of background are taken from the area between BBX2 and BBX3. The segmentation is performed within BBX3, imposing the rest of the image (outside BBX3) as background, which substantially speeds up the processing.

The result of the fine segmentation process is shown in Figure 9.

According to an embodiment, the modeling and segmentation of the fine segmentation sub-process carried out iteratively with fixed areas of all the bounding boxes, i.e. the segmentation from the previous iteration is used to improve the modeling of target object in the current iteration. The object area inside BBX1 is used for modeling object even though the refined segmentation would also reach area between BBX1 and BBX2.

A skilled man appreciates that any of the embodiments described above may be implemented as a combination with one or more of the other embodiments, unless there is explicitly or implicitly stated that certain embodiments are only alternatives to each other.

The various embodiments of the invention can be implemented with the help of computer program code that resides in a memory and causes the relevant apparatuses to carry out the invention. For example, a terminal device may comprise circuitry and electronics for handling, receiving and transmitting data, computer program code in a memory, and a processor that, when running the computer program code, causes the terminal device to carry out the features of an embodiment. The various devices may be or may comprise encoders, decoders and transcoders, packetizers and depacketizers, and transmitters and receivers.

The various embodiments may provide advantages over state of the art. The embodiments effectively utilize depth information available, for example, in a dual camera device to resolve the visual ambiguities posed in 2D imagery to cut out the object of interest with one single finger tap.With a minimum amount of user interaction, an accurate and pleasant looking object segmentation may be achieved. From the usability point of view, the overall process is intuitive for the user. The various embodiments provide an efficient approach to estimating the depth planes for parsing the noisy depth map from stereo matching. By estimating a set of bounding boxes, the modeling and segmentation process can be made significantly more efficient.

It is obvious that the present invention is not limited solely to the above-presented embodiments, but it can be modified within the scope of the appended claims.

Claims

Claims: 1. A method comprising: providing at least one image of a scene; providing a depth map associated with said at least one image; detecting a user command on a display showing said at least one image, the user command indicating a region of the at least one image as a foreground object; performing a coarse segmentation between the foreground objectand a background region of the image;performing edge detection on the depth map for identifying boundaries of a depth plane of the foreground object; providing a plurality of mutually nested bounding boxes around the foreground object, a boundary of the outermost bounding box extending to thebackground region of the image; andperforming a fine segmentation between the foreground object and the background region of the image, wherein training examples of the foreground object are taken from the coarse segmentation and training examples of the background region are taken from within the boundary of the outermost bounding box.
2. The method according to claim 1, wherein the training examples of the background region are taken from a region between a second innermost bounding box and the outermost bounding box.
3. The method according to claim 1 or 2, the method further comprising: providing the coarse segmentation of the foreground object with an initial bounding box approximately around the boundaries of the foreground 30 object.
4. The method according to claim 3, the method further comprising: using the coarse segmentation of the foreground object with its initial bounding box and an edge detected depth map of the foreground object in said providing the plurality of mutually nested bounding boxes around the foreground object.
5. The method according to claim 4, the method further comprising: extending the initial bounding box to an extended first bounding box along a vertical direction, until the first bounding box does not intersect with any edges of the foreground object.
6. The method according to claim 5, wherein any subsequent extended bounding box is determined on the basis of the extended first bounding box.
7. The method according to any preceding claim, wherein modeling and segmentation of the fine segmentation is carried out iteratively with fixed areas of all the bounding boxes.
8. The method according to any preceding claim, the method further comprising: computing a histogram of depth values; identifying bins of the histogram in which the initial depth surface of the foreground object belongs; and determining the bins of the histogram in which presumedbackground depth surfaces belong.
9. The method according to any preceding claim, wherein color and depth features of the detected depth surfaces are applied in a Gaussian Mixture Model (GMM).
10. The method according to any preceding claim, the method further comprising: converting an original color space of the image to CIELAB color space; and merging the CIELAB color space of the image with the depth map of the image.
11. An apparatus comprising at least one processor, memory including computer program code, the memory and the computer program code configured to, with the at least one processor, cause the apparatus to at least: provide at least one of a scene; provide a depth map associated with said at least one image; detect a user command on a display showing said at least one image, the user command indicating a region of the at least one image as a foreground object; perform a coarse segmentation between the foreground object anda background region of the image;perform edge detection on the depth map for identifying boundaries of a depth plane of the foreground object; provide a plurality of mutually nested bounding boxes around the foreground object, a boundary of the outermost bounding box extending to thebackground region of the image; andperform a fine segmentation between the foreground object and the background region of the image, wherein training examples of the foreground object are taken from the coarse segmentation and training examples of the background region are taken from within the boundary of the outermost bounding box.
12. The apparatus according to claim 11, wherein the apparatus is configured to take the training examples of the background region from a region between a second innermost bounding box and the outermost bounding box. 20
13. The apparatus according to claim 11 or 12, further comprising computer program code configured to cause the apparatus to at least: provide the coarse segmentation of the foreground object with an initial bounding box approximately around the boundaries of the foreground 25 object.
14. The apparatus according to claim 13, further comprising computer program code configured to cause the apparatus to at least: use the coarse segmentation of the foreground object with its initial bounding box and an edge detected depth map of the foreground object in said providing the plurality of mutually nested bounding boxes around the foreground object.
15. The apparatus according to claim 14, further comprising computer program code configured to cause the apparatus to at least: extend the initial bounding box to an extended first bounding box along a vertical direction, until the first bounding box does not intersect with any edges of the foreground object.
16. The apparatus according to claim 15, wherein the apparatus is configured to determine any subsequent extended bounding box on the basis of the extended first bounding box.
17. The apparatus according to any of claims 11 -16, wherein the apparatus is configured to carry out modeling and segmentation of the fine segmentation iteratively with fixed areas of all the bounding boxes.
18. The apparatus according to any of claims 11 -17, further comprising computer program code configured to cause the apparatus to at least: compute a histogram of depth values; identify bins of the histogram in which the initial depth surface of the foreground object belongs; and determine the bins of the histogram in which presumed background depth surfaces belong.
19. The apparatus according to any of claims 11 -18, wherein the apparatus is configured to apply color and depth features of the detected depth surfaces in a Gaussian Mixture Model (GMM).
20. The apparatus according to any of claims 11 -19, further comprising computer program code configured to cause the apparatus to at least: convert an original color space of the image to CIELAB color space; and merge the CIELAB color space of the image with the depth map of the image.
21. The apparatus according to any of claims 11 -20, wherein the apparatus comprises two or more cameras and computer program code configured to cause the apparatus to at least: provide a first image and a second image of a scene; and provide a depth map determined on the basis of the first image and the second image.
22. A computer readable storage medium stored with code thereon for use by an apparatus, which when executed by a processor, causes the apparatus to perform: providing at least one image of a scene; providing a depth map associated with said at least one image; detecting a user command on a display showing said at least one image, the user command indicating a region of the at least one image as a foreground object; performing a coarse segmentation between the foreground object and a background region of the image; performing edge detection on the depth map for identifying boundaries of a depth plane of the foreground object; providing a plurality of mutually nested bounding boxes around the foreground object, a boundary of the outermost bounding box extending to the background region of the image; and performing a fine segmentation between the foreground object and the background region of the image, wherein training examples of the foreground object are taken from the coarse segmentation and training examples of the background region are taken from within the boundary of the outermost bounding box.