CN116324878A

CN116324878A - Segmentation for image effects

Info

Publication number: CN116324878A
Application number: CN202180067005.4A
Authority: CN
Inventors: C-C·蔡; S·马德哈范; S-C·庄; K-J·许; V·R·K·达亚娜; 江晓云
Original assignee: Qualcomm Inc
Current assignee: Qualcomm Inc
Priority date: 2020-10-05
Filing date: 2021-09-07
Publication date: 2023-06-23
Also published as: KR20230084486A; US20220108454A1; EP4226322A1; WO2022076116A1; BR112023005338A2; US11276177B1

Abstract

Systems, methods, and computer-readable media for foreground image segmentation are provided. In some examples, a method may include: obtaining a first image of the object and a second image of the object, the first image having a first field of view (FOV), and the second image having a second FOV; determining a first segmentation map based on the first image, the first segmentation map identifying a first estimated foreground region in the first image; determining a second segmentation map based on the second image, the second segmentation map identifying a second estimated foreground region in the second image; generating a third segmentation map based on the first segmentation map and the second segmentation map; and generating a refined segmentation mask using the second segmentation map and the third segmentation map, the refined segmentation mask identifying at least a portion of the object as a foreground region of the first image and/or the second image.

Description

Segmentation for image effects

Technical Field

The present disclosure relates generally to image processing, and more particularly to segmentation for image effects.

Background

The increased versatility of digital camera products has allowed digital cameras to be integrated into a wide variety of devices and has expanded their use to different applications. For example, telephones, drones, automobiles, computers, televisions, and many other devices today are often equipped with camera devices. The camera device allows a user to capture images and/or video from any system equipped with the camera device. Images and/or video may be captured for entertainment use, professional photography, surveillance, and automation, among other applications. Furthermore, camera devices are increasingly equipped with specific functions for modifying images or creating artistic effects on images. For example, many camera devices are equipped with image processing capabilities for generating different effects on captured images.

Many image processing techniques rely on image segmentation algorithms that divide an image into segments that can be analyzed or processed to produce a particular image effect. Some example implementations of image segmentation include, but are not limited to, chroma key synthesis, feature extraction, recognition tasks (e.g., object recognition, face recognition, etc.), image stylization, machine vision, medical imaging, and depth of field (or "foreground") effects, among others. However, camera devices and image segmentation techniques often produce poor and inconsistent results and in many cases are only suitable for certain types of images.

Disclosure of Invention

Systems, methods, and computer-readable media for accurate image segmentation and image effects are disclosed herein. According to at least one example, a method for foreground prediction is provided. An example method may include: obtaining a first image of the object and a second image of the object, the first image having a first field of view (FOV), and the second image having a second FOV different from the first FOV; determining a first segmentation map based on the first image, the first segmentation map comprising foreground predictors associated with a first estimated foreground region in the first image; determining a second segmentation map based on the second image, the second segmentation map comprising foreground predictors associated with a second estimated foreground region in the second image; generating a third segmentation map based on the first segmentation map and the second segmentation map; and generating a refined segmentation mask using the second segmentation map and the third segmentation map, the refined segmentation mask identifying at least a portion of the object as a foreground region of the first image and/or the second image.

According to at least some examples, an apparatus for foreground prediction is provided. In one example, an example apparatus may include a memory and one or more processors configured to: obtaining a first image of the object and a second image of the object, the first image having a first field of view (FOV), and the second image having a second FOV different from the first FOV; determining a first segmentation map based on the first image, the first segmentation map comprising foreground predictors associated with a first estimated foreground region in the first image; determining a second segmentation map based on the second image, the second segmentation map comprising foreground predictors associated with a second estimated foreground region in the second image; generating a third segmentation map based on the first segmentation map and the second segmentation map; and generating a refined segmentation mask using the second segmentation map and the third segmentation map, the refined segmentation mask identifying at least a portion of the object as a foreground region of the first image and/or the second image.

According to at least some examples, another example apparatus may include means for: obtaining a first image of the object and a second image of the object, the first image having a first field of view (FOV), and the second image having a second FOV different from the first FOV; determining a first segmentation map based on the first image, the first segmentation map comprising foreground predictors associated with a first estimated foreground region in the first image; determining a second segmentation map based on the second image, the second segmentation map comprising foreground predictors associated with a second estimated foreground region in the second image; generating a third segmentation map based on the first segmentation map and the second segmentation map; and generating a refined segmentation mask using the second segmentation map and the third segmentation map, the refined segmentation mask identifying at least a portion of the object as a foreground region of the first image and/or the second image

According to at least one example, a non-transitory computer-readable medium for foreground prediction is provided. An example non-transitory computer-readable medium may store instructions that, when executed by one or more processors, cause the one or more processors to: obtaining a first image of the object and a second image of the object, the first image having a first field of view (FOV), and the second image having a second FOV different from the first FOV; determining a first segmentation map based on the first image, the first segmentation map comprising foreground predictors associated with a first estimated foreground region in the first image; determining a second segmentation map based on the second image, the second segmentation map comprising foreground predictors associated with a second estimated foreground region in the second image; generating a third segmentation map based on the first segmentation map and the second segmentation map; and generating a refined segmentation mask using the second segmentation map and the third segmentation map, the refined segmentation mask identifying at least a portion of the object as a foreground region of the first image and/or the second image

In some aspects, the above-described methods, apparatus, and non-transitory computer-readable media may include: an edited image is generated based on the refined segmentation mask. In some examples, the edited image may be based on the first image or the second image. In some cases, the edited image may include a visual effect, an augmented reality effect, an image processing effect, a blur effect, an image recognition effect, a segmentation effect, a computer graphics effect, a chromakeying effect, and/or an image stylization effect. In an illustrative example, the edited image may include a blur effect.

In some cases, generating the edited image may include: applying a blurring effect to one or more image regions located outside the foreground region based on the refined segmentation mask, wherein the blurring effect comprises a depth of field effect, wherein the one or more image regions are at least partially blurred and the foreground region is at least partially focused.

In some examples, generating the refined segmentation mask may include: generating a fused segmentation map by fusing the second segmentation map and the third segmentation map; and generating a refined segmentation mask based on the fused segmentation map. In some cases, fusing the second and third partition maps may include averaging the foreground predicted values in the second partition map and the foreground predicted values in the third partition map. In some examples, the fused segmentation map may include an average foreground prediction value.

In some cases, generating the fused split map may include fusing the second split map and the third split map, and multiplying the center-first map with the fused output of the second split map and the third split map.

In some cases, a first image may be received from a first camera and a second image may be received from a second camera. In some examples, the first camera may be responsive to visible light and the second camera may be responsive to infrared light. In some cases, the apparatus may include a first camera and a second camera. In other cases, the first image and the second image may be received from the same camera. In some examples, the apparatus may include the same camera.

In some aspects, the above-described methods, apparatus, and non-transitory computer-readable media may include: a fourth segmentation map is determined based on the third image, the fourth segmentation map including foreground values associated with a third estimated foreground region in the third image. In some cases, a third image may be received from a third camera. For example, a first image may be received from a first camera, a second image may be received from a second camera, and a third image may be received from a third camera.

In some examples, the apparatus described above may be configured to generate the segmentation mask in response to a user selection of an imaging mode (e.g., portrait mode), green mask, etc.

In some aspects, the above-described methods, apparatus, and non-transitory computer-readable media may include: a foreground probability is calculated for super pixels (superpixels) in the first image and/or the second image, and a center-first map is generated based on the foreground probabilities. In some cases, each foreground probability may be calculated based on the distance of the associated superpixel to the center of the image.

In some examples, generating the third segmentation map may include: clipping the first segmentation map to the size of the second image; and upsampling the first segmentation map according to the resolution of the second image after cropping the first segmentation map.

In some cases, determining the first segmentation map may include: extracting superpixels from the first image; generating a set of image queries, each image query comprising the extracted superpixels, wherein each image query has a different boundary region of superpixels marked by one or more distinguishing pixel attributes; generating a set of segmentation probability maps based on the set of image queries, wherein the segmentation probability maps for each image query are generated using one or more popular ranking functions (manifold ranking function) to estimate a foreground probability based at least in part on differences between different boundary regions of the superpixels and one or more other regions of the superpixels associated with each image query; and generating a first segmentation map by multiplying the set of segmentation probability maps.

In some examples, determining the second segmentation map may include: extracting superpixels from the second image; generating an additional set of image queries, each additional image query in the additional set of image queries comprising extracted superpixels from the second image, wherein each additional image query has a different boundary region of superpixels marked by one or more distinguishing pixel attributes; generating an additional set of segmentation probability maps based on the additional set of image queries, wherein the additional segmentation probability maps for each additional image query are generated using one or more popular ranking functions to estimate a foreground probability based at least in part on additional differences between different boundary regions of the superpixel and one or more other regions of the superpixel associated with the additional image query; and generating a second segmentation map by multiplying the additional set of segmentation probability maps.

In some examples, the first segmentation map and the second segmentation map may be generated using a deep neural network. In some examples, the first segmentation map and the second segmentation map may be generated using a trained neural network. In some cases, the trained neural network may include a deep neural network. In some cases, the first image and the second image are captured using different lenses having different FOVs. In some examples, the different lenses may include a tele lens, a wide lens, a super wide lens, a standard lens, and/or a zoom lens.

In some cases, the first image and the second image may be captured using different image capture devices or cameras having different FOVs. In other cases, a single image capture device or a camera with different FOV may be used to capture the first and second images.

In some cases, the apparatus may be and/or may include a mobile phone, a smart wearable device, a portable computer, and/or a camera. In some cases, the apparatus may include an image sensor and/or a display. In some examples, each of the above-described apparatuses may include one or more cameras. For example, each device may include a different camera with a different FOV. As another example, each device may include a camera having a different FOV.

This summary is not intended to identify key or essential features of the claimed subject matter, nor is it intended to be used as an aid in isolation to determine the scope of the claimed subject matter. The subject matter should be understood by reference to appropriate portions of the entire specification, drawings and claims of this disclosure.

The above and other features and embodiments will become more apparent upon reference to the following description, claims and drawings.

Drawings

Illustrative examples of the present application are described in detail below with reference to the following drawings:

FIG. 1 is a block diagram illustrating an example image processing system according to some examples of the present disclosure;

fig. 2A and 2B illustrate example differences between foreground predictions made from images of targets captured with different fields of view according to some examples of the present disclosure;

fig. 3A and 3B are diagrams illustrating example flows of image segmentation using images with different fields of view according to some examples of the present disclosure;

FIG. 4 is a diagram illustrating an example machine learning model for foreground prediction according to some examples of the present disclosure;

FIG. 5 is a diagram illustrating an example prospect prediction process based on popularity ordering, according to some examples of the present disclosure;

FIG. 6 is a flowchart of an example of a process for image segmentation using images with different fields of view, according to some examples of the present disclosure;

fig. 7 illustrates an example computing device architecture according to some examples of this disclosure.

Detailed Description

Certain aspects and embodiments of the disclosure are provided below. As will be apparent to those skilled in the art, some of these aspects and embodiments may be applied independently, and some of these aspects and embodiments may be applied in combination. In the following description, for purposes of explanation, specific details are set forth in order to provide a thorough understanding of the embodiments of the present application. It may be evident, however, that the various embodiments may be practiced without these specific details. These drawings and descriptions are not intended to be limiting.

The following description merely provides example embodiments and is not intended to limit the scope, applicability, or configuration of the disclosure. Rather, the ensuing description of the exemplary embodiments will provide those skilled in the art with an enabling description for implementing an exemplary embodiment. It being understood that various changes may be made in the function and arrangement of elements without departing from the spirit and scope of the application as set forth in the appended claims.

As noted previously, computing devices are increasingly equipped with capabilities for capturing images, performing various image processing tasks, and generating various image effects. Many of the image processing tasks and effects (e.g., chromakeying, depth of field or "foreground" effects, recognition tasks (e.g., object, facial and biometric recognition), feature extraction, image stylization, automation, machine vision, computer graphics, medical imaging, etc.) rely on image segmentation to divide an image into multiple segments that can be analyzed or processed to perform a desired image processing task or to generate a desired image effect. For example, cameras are often equipped with portrait mode functionality that achieves a shallow depth of field ("foreground") effect. The depth of field effect may bring a particular image region or object (e.g., foreground object or region) into focus while blurring other regions or pixels (e.g., background region or pixels) in the image. Depth effects may be created using image segmentation techniques to identify and modify different regions or objects in the image, e.g., background and foreground regions or objects. In some examples, depth effects may be created with depth information associated with an image.

However, image segmentation may often produce inaccurate, inconsistent, and/or poor results, which may negatively impact depth of field and other visual effects. In many cases, the image segmentation results may vary based on a number of factors, such as, but not limited to, the distance of the camera device from the target, the type of lens of the camera device, the size ratio of the image and the foreground target, the texture of the foreground target and/or other portions of the image, the color of the foreground target and/or other portions of the image, and so forth. For example, if the target appears too close or too far from the camera device, the image segmentation for portrait mode may fail. In many cases, as a foreground object (or a portion of a foreground object) moves farther and farther away from the center of an image, it may become increasingly difficult to detect the boundary or outline of the object (or a portion of the object) for segmenting and/or distinguishing the object (or a portion of the object) from other objects/regions (e.g., background objects/regions) in the image.

The focal length, field of view (FOV), and/or type of lens used to capture the image may also lead to variations in segmentation results and various segmentation problems. For example, objects in an image captured using a tele lens may appear closer and may contain less background information, which may potentially obscure the contours of foreground objects. However, image boundaries in such images often cut (e.g., do not capture/include) a portion or region of the foreground object, which results in less information about the object and may negatively impact segmentation accuracy.

On the other hand, an object in an image captured using a wide-angle lens may appear farther away, and the image may more easily capture the complete outline of the object. However, images captured using wide-angle lenses may typically contain more information from side areas of the image (e.g., image data corresponding to other objects and/or non-target areas) (and/or areas away from the center of the image), which may result in segmentation noise/interference, and may make it increasingly difficult to distinguish the target (or a portion of the target) from other objects/areas in the background (e.g., side areas (or areas away from the center of the image)).

These and other problems may negatively impact the image segmentation results, thereby negatively impacting the vision and image effects (e.g., depth of field effects) that rely on image segmentation. In many cases, in order to trigger the portrait mode and prevent image segmentation failure for the portrait mode, the camera device must be within a certain distance of the target. Typically, the user must move the camera device and the target farther or closer to each other to ensure that the camera device and the target are within a certain distance to trigger the portrait mode and/or prevent image segmentation failure for the portrait mode. However, this approach is inconvenient, inflexible, and may result in poor segmentation results.

In some examples, the techniques described herein may utilize images with different fields of view (FOV) and/or images captured using different types of lenses to refine and refine the segmentation results. For example, the techniques described herein may produce more accurate, flexible, and consistent segmentation results. Segmentation results may also improve the quality and performance of image processing tasks and visual effects (such as, but not limited to, depth of view effects, computer graphics effects, image recognition tasks, image stylization, feature extraction tasks, machine vision, etc.) that rely on image segmentation.

In some examples, the techniques herein may utilize images captured by different cameras on a multi-camera device. In some cases, different cameras may have different FOVs or focal lengths. Images captured by different cameras having different FOVs or focal lengths may be used as inputs to generate improved and refined segmentation results. For example, in some cases, the multi-camera device may use different cameras to capture a first image having a FOV (e.g., wide angle FOV, narrow angle FOV, zoom FOV, etc.) and a second image having a different FOV. The first image may be used to generate a first segmentation mask and the second image may be used to generate a second segmentation mask. The first segmentation mask and the second segmentation mask may be fused to generate improved segmentation results. The segmentation result may be used to render effects on the first image, the second image, and/or the future image.

In some examples, the techniques herein may utilize images captured by a single camera having a different FOV or focal length. Images captured using different FOVs or focal distances may be used as inputs to generate improved and refined segmentation results. For example, in some cases, a single camera may capture a first image having a FOV (e.g., wide angle FOV, narrow angle FOV, zoom FOV, etc.) and a second image having a different FOV. The first image may be used to generate a first segmentation mask and the second image may be used to generate a second segmentation mask. The first segmentation mask and the second segmentation mask may be fused to generate improved segmentation results. The segmentation result may be used to render effects on the first image, the second image, and/or the future image.

In the following disclosure, systems, apparatuses (or devices), processes (or methods), and computer-readable media for image segmentation are provided. The present technology will be described in the following disclosure. Discussion begins with a description of example systems, techniques, and processes for image segmentation and foreground prediction, as shown in fig. 1-5. As shown in fig. 6, a description of an example process for image segmentation and foreground prediction is as follows. The discussion concludes with a description of an example computing device architecture that includes example hardware components suitable for performing image segmentation and generating images with depth of field and other effects, as shown in fig. 7. The present disclosure now turns to fig. 1.

Fig. 1 is a diagram illustrating an example image processing system 100 according to some examples. As described herein, the image processing system 100 may perform various image processing tasks and generate various image processing effects. For example, image processing system 100 may perform image segmentation and foreground prediction, generate a depth image, generate a chromakey effect, perform feature extraction tasks, perform image recognition tasks, implement machine vision, and/or perform any other image processing tasks.

In some illustrative examples, the image processing system 100 may perform foreground prediction and generate a depth image from one or more image capture devices (e.g., cameras, image sensors, etc.). For example, in some implementations, the image processing system 100 may use images with different FOVs to perform image segmentation and generate depth image effects. In some cases, capture devices with different types of image sensors and/or lenses (e.g., wide angle lens, tele lens (e.g., short tele, mid tele, etc.), standard lens, zoom lens, etc.) may be used to capture images with different FOVs. Although depth of field effects are used herein as example effects generated based on image segmentation results, the techniques described herein may be used for any image processing effect, e.g., chromakeying effects, feature extraction effects, image recognition effects, machine vision effects, image stylization effects, augmented reality effects, any combination thereof, and/or for any other image processing effect.

In the example shown in fig. 1, image processing system 100 includes

image capture devices

102 and 104, storage 108, computing component 110, image processing engine 120, neural network 122, and rendering engine 124. The image processing system 100 may also potentially include one or more sensors 106, such as light detection and ranging (LIDAR) sensors, radar, accelerometers, gyroscopes, light sensors, inertial Measurement Units (IMUs), proximity sensors, and the like. In some cases, the image processing system 100 may include multiple image capture devices capable of capturing images having different FOVs. For example, in a dual camera or image sensor application, the image processing system 100 may include image capture devices with different types of lenses (e.g., wide angle, tele, standard, zoom, etc.) capable of capturing images with different FOVs (e.g., different perspectives, different depths of field, etc.).

The image processing system 100 may be part of a computing device or multiple computing devices. In some examples, the image processing system 100 may be part of an electronic device (or multiple electronic devices), such as a camera system (e.g., digital camera, IP camera, video camera, security camera, etc.), a telephone system (e.g., smart phone, cellular phone, conferencing system, etc.), a desktop computer, a laptop or notebook computer, a tablet computer, a set-top box, a television, a display device, a digital media player, a game console, a video streaming device, an unmanned aerial vehicle, a computer in an automobile, an IoT (internet of things) device, an intelligent wearable device, an extended reality (XR) device (e.g., head-mounted display, smart glasses, etc.), or any other suitable electronic device.

In some implementations, the image capture device 102, the image capture device 104, the other sensor 106, the storage 108, the computing component 110, the image processing engine 120, the neural network 122, and the rendering engine 124 may be part of the same computing device. For example, in some cases, image capture device 102, image capture device 104, another sensor 106, storage 108, computing component 110, image processing engine 120, neural network 122, and rendering engine 124 may be integrated into a smart phone, a laptop computer, a tablet computer, a smart wearable device, a gaming system, an XR device, and/or any other computing device. However, in some implementations, the image capture device 102, the image capture device 104, the other sensor 106, the storage 108, the computing component 110, the image processing engine 120, the neural network 122, and/or the rendering engine 124 may be part of two or more separate computing devices.

In some examples,

image capture devices

102 and 104 may be any image and/or video capture device, such as a digital camera, video camera, smart phone camera, camera device on an electronic apparatus such as a television or computer, camera system, or the like. In some cases, the

image capture devices

102 and 104 may be part of a camera or computing device (e.g., a digital camera, video camera, IP camera, smart phone, smart television, gaming system, etc.). In some examples, the

image capture devices

102 and 104 may be part of a dual-camera assembly.

Image capture devices

102 and 104 may capture image and/or video content (e.g., raw image and/or video data) that may then be processed by computing component 110, image processing engine 120, neural network 122, and/or rendering engine 124 as described herein.

In some cases,

image capture devices

102 and 104 may include image sensors and/or lenses for capturing image and/or video data.

Image capture devices

102 and 104 may capture image and/or video data having different FOVs (including different perspectives, different depths of field, different sizes, etc.). For example, in some cases,

image capture devices

102 and 104 may include different image sensors having different FOVs. In other examples,

image capture devices

102 and 104 may include different types of lenses having different FOVs, e.g., wide angle lenses, tele lenses (e.g., short tele, mid tele, etc.), standard lenses, zoom lenses, etc. In some examples, image capture device 102 may include one type of lens and image capture device 104 may include a different type of lens. In some cases,

image capture devices

102 and 104 may be responsive to different types of light. For example, in some cases, image capture device 102 may be responsive to visible light and image capture device 104 may be responsive to infrared light.

In some examples, the image capture device 102 may have a different FOV. In such examples, the image capture device 102 may capture images having different FOVs. Similarly, in some examples, the image capture device 104 may have different FOVs, and images with different FOVs may be captured.

The other sensor 106 may be any sensor for detecting and measuring information such as distance, motion, position, depth, speed, etc. Non-limiting examples of sensors include LIDAR, gyroscopes, accelerometers, magnetometers, radar, and IMU. In one illustrative example, the sensor 106 may be a LINDAR configured to sense or measure distance and/or depth information that may be used in computing depth of field and other effects. In some cases, the image processing system 100 may include other sensors, such as machine vision sensors, smart scene sensors, voice recognition sensors, impact sensors, position sensors, tilt sensors, light sensors, and the like.

The storage 108 may be any storage device for storing data (e.g., image or video data). The storage 108 may store data from any of the components of the image processing system 100. For example, the storage 108 may store data or measurements from any of the

image capture devices

102 and 104, the image sensor 106, the computing component 110 (e.g., processing parameters, output images, computing results, etc.), and/or any of the image processing engine 120, the neural network 122, and/or the rendering engine 124 (e.g., output images, processing results, parameters, etc.). In some examples, the storage 108 may include a buffer for storing data (e.g., image data) processed by the computing component 110.

In some implementations, the computing component 110 may include a Central Processing Unit (CPU) 112, a Graphics Processing Unit (GPU) 114, a Digital Signal Processor (DSP) 116, and/or an Image Signal Processor (ISP) 118. The computing component 110 may perform various operations such as image augmentation, object or image segmentation, computer vision, graphics rendering, XR (e.g., augmented reality, virtual reality, mixed reality, etc.), image/video processing, sensor processing, recognition (e.g., text recognition, object recognition, feature recognition, face recognition, tracking or pattern recognition, scene recognition, etc.), foreground prediction, machine learning, filtering, depth effect calculation or rendering, and any of the various operations described herein. In some examples, the computing component 110 may implement an image processing engine 120, a neural network 122, and a rendering engine 124. In other examples, the computing component 110 may also implement one or more other processing engines.

The operations of the image processing engine 120, the neural network 122, and the rendering engine 124 may be implemented by one or more of the computing components 110. In one illustrative example, the image processing engine 120 and the neural network 122 (and associated operations) may be implemented by the CPU 112, DSP 116, and/or ISP 118, and the rendering engine 124 (and associated operations) may be implemented by the GPU 114. In some cases, the computing component 110 may include other electronic circuitry or hardware, computer software, firmware, or any combination thereof to perform any of the various operations described herein.

In some cases, the computing component 110 may receive data (e.g., image data, video data, etc.) captured by the image capture device 102 and/or the image capture device 104 and process the data to generate an output image having a particular visual and/or image processing effect (e.g., depth of view effect). For example, the computing component 110 may receive image data (e.g., one or more frames, etc.) captured by the

image capture devices

102 and 104, perform foreground prediction and image segmentation, and generate an output image having a depth of field effect. The image (or frame) may be a red-green-blue (RGB) image having red, green, blue components per pixel; luminance, chrominance-red, chrominance-blue (YCbCr) images having a luminance component and two chrominance (color) components (chrominance-red and chrominance-blue) per pixel; or any other suitable type of color or monochrome picture.

The computing component 110 may implement the image processing engine 120 and the neural network 122 to perform various image processing operations and generate image effects, such as depth of view effects. For example, the computing component 110 may implement the image processing engine 120 and the neural network 122 to perform feature extraction, superpixel detection, foreground prediction, spatial mapping, saliency detection, segmentation, pixel classification, clipping, upsampling/downsampling, blurring, segmentation, modeling, filtering, color correction, noise reduction, scaling, sequencing, and/or other image processing tasks. The computing component 110 may process image data captured by the image capture devices 102 and/or 104; image data in the storage device 108; image data received from a remote source such as a remote camera, server, or content provider; image data obtained from a combination of sources; etc.

In some examples, the computing component 110 may predict a foreground in image data captured by the

image capture devices

102 and 104, generate a segmentation map (e.g., a probability map), generate a refined or updated segmentation map based on the plurality of segmentation maps, generate a segmentation mask, and output an image having an effect (e.g., a depth of field effect) generated based on the segmentation mask. In some cases, the computing component 110 may use spatial information (e.g., a center-first graph), probability graphs, disparity information (e.g., disparity graphs), image queries, saliency maps, etc. to segment objects and/or regions in one or more images and generate an output image having an image effect (e.g., a depth effect). In other cases, the computing component 110 may also use other information such as face detection information, sensor measurements (e.g., depth measurements), depth measurements, and the like.

In some examples, the computing component 110 may perform segmentation (e.g., foreground-background segmentation) with (or almost) pixel-level accuracy to generate an output image with a depth of field. The computing component 110 may perform segmentation (e.g., foreground-background segmentation) using images with different FOVs. For example, the computing component 110 may perform foreground-background segmentation using an image captured by the image capture device 102 having a first FOV and an image captured by the image capture device 104 having a second FOV. The foreground-background segmentation may also enable other image adjustment or image processing operations (or be used in conjunction therewith), such as, but not limited to, depth enhancement and object-aware auto-exposure, auto-white balancing, auto-focusing, tone mapping, and the like.

Although image processing system 100 is shown as including certain components, one of ordinary skill will appreciate that image processing system 100 may include more or fewer components than those shown in FIG. 1. For example, in some instances, image processing system 100 may also include one or more memory devices (e.g., RAM, ROM, cache, etc.), one or more network interfaces (e.g., wired and/or wireless communication interfaces, etc.), one or more display devices, and/or other hardware or processing devices not shown in fig. 1. An illustrative example of computing devices and hardware components that may be implemented using image processing system 100 is described below with reference to FIG. 7.

As noted previously, if the foreground object appears too close to the image capture device, the image segmentation for portrait mode may fail. For example, the image segmentation algorithm may not detect objects that appear to be too close to the image capture device. The image segmentation results may also be different for images with different FOVs. The type of lens used to capture the image may affect the image segmentation performance. For example, objects in an image captured using a tele lens may appear more recent and may contain less background information, which may potentially complicate foreground prediction (e.g., by adding noise). However, the boundaries of images captured using a tele lens often cut (e.g., do not capture/include) a portion or region of the foreground object, which results in less information about the object and can negatively impact segmentation accuracy.

Fig. 2A and 2B illustrate example differences between foreground predictions made from images of objects captured with different FOVs. Referring to fig. 2A, image 202 may have a first FOV and image 220 may have a second FOV different from the first FOV. Both

images

202 and 220 capture an object 210, the object 210 representing a foreground shape to be detected during foreground prediction and/or image segmentation.

In this example, the first FOV of the image 202 has a larger magnification and a narrower viewing angle of the object 210 captured in the image 202, and the second FOV of the image 220 has a smaller magnification and a wider viewing angle of the object 210. In some examples, the image 202 having the first FOV may be an image of the object 210 captured by a lens having a narrower viewing angle and a larger magnification (e.g., a tele lens, a zoom lens, etc.), and the image 220 having the second FOV may be an image of the object 210 captured by a lens having a wider viewing angle and a smaller magnification (e.g., a wide angle lens, etc.).

The segmentation mask 204 shows the foreground prediction generated based on the image 202 having the first FOV and the segmentation mask 222 shows the foreground prediction generated based on the image 220 having the second FOV. The different FOVs of

images

202 and 220 result in different ratios of the size of object 210 to the size of

images

202 and 220. The ratio of the target size (e.g., the size of the target foreground shape) to the image size (e.g., the size of the image) can affect segmentation accuracy and performance. For example, the ratio of target size to image size may result in foreground prediction missing, excluding, and/or failing to detect portions of a foreground target, or detecting and including information and/or shapes (e.g., objects, people, animals, structures, etc.) that are not part of a foreground, e.g., background and/or non-target information and/or shapes.

As shown in the example of fig. 2A, the segmentation mask 204 generated from the image 202 having the first FOV misses a portion of the object 210 (e.g., the hand of the object 210) within an area 206 that is farther from the center of the image 202 and/or closer to the image boundary (e.g., closer to the edge/boundary of the image 202). While the image 202 has less background information, a larger ratio of object size to image size (which may correspond to a closer look of the object 210 and/or a narrower FOV of the image 202) may cut/truncate a portion of the foreground information (e.g., information of the object) and affect segmentation accuracy.

A segmentation mask 222 generated from the image 220 having the second FOV captures the object 210 without missing foreground information associated with the object 210, including the portion of the object 210 within the region 206. However, in addition to capturing the target 210, the segmentation mask 222 captures objects (e.g., garbage cans) within the background region 224 that should not be identified and/or considered as part of the foreground (e.g., target 210). While the smaller ratio of object size to image size resulting from the wider FOV of image 220 allows segmentation mask 222 to better capture the complete contour of the foreground (e.g., object 210), it also results in segmentation mask 222 capturing background and/or interference information that should not be included/considered as part of the foreground.

In some cases, different images may result in inconsistent segmentation even though the images have the same or similar FOV. For example, referring to fig. 2B, both the first image 240 and the second image 250 capture a target 260, the target 260 representing a foreground shape to be detected during foreground prediction and/or image segmentation. The first image 240 and the second image 250 have similar or identical FOVs. However, the segmentation masks 242 and 252 generated from the first image 240 and the second image 250 are not uniform.

For example, the segmentation mask 242 generated from the first image 240 captures the complete contour of the target 260. On the other hand, the segmentation mask 252 generated from the second image 250 fails to fully capture a portion 262 of the target 260, in contrast to the portion 262 being captured in the segmentation mask 242. In this example, portion 262 of target 260 is a cup that the user holds, and the prospect prediction should include the user and the cup that the user holds. When the split mask 242 accurately captures the cup and the user, the split mask 252 captures the user but cannot properly capture the cup. As shown, the segmentation masks 242 and 252 generated from the two

images

240 and 250 are inconsistent, which results in inconsistent segmentation results and may affect image processing tasks that depend on the segmentation results.

In some examples, to increase segmentation consistency, quality, and performance, a segmentation mask may be generated from multiple images by fusing and/or utilizing segmentation information from the multiple images. In some cases, the multiple images may have different FOVs. Segmentation information from images with different FOVs may complement each other to avoid or limit segmentation inconsistencies and to avoid or limit segmentation problems caused by relative distance/depth, scale, view, etc. of the object. In some examples, the segmentation results may be generated using images captured with different types of shots. Segmentation information from images captured using different types of shots may be used together to refine the segmentation results and produce a more consistent and accurate segmentation.

Fig. 3A is a diagram illustrating an example flow 300 for image segmentation using images with different FOVs. Inputs to the example flow 300 include an image 302 and an image 304. In some cases,

images

302 and 304 may be captured by image capture device 102 and image capture device 104, respectively. In some examples,

image capture devices

102 and 104 may include different types of lenses and may capture images having different FOVs. For example, image capture device 102 may capture image 302 using a FOV and image capture device 104 may capture image 304 using a different FOV. In some cases,

images

302 and 304 may be captured by the same image capture device (e.g., image capture device 102 or 104) using different FOVs. In fig. 3A, image 302 has a different FOV than image 304. For example, image 302 has a wider field of view than image 304.

Images

302 and 304 capture an object 306 representing a foreground shape. Image 302 has a wider field of view than image 304, and image 304 has a larger ratio of target size to image size. The target 306 appears closer in the image 304 than the image 302. Given a wider FOV (and a larger ratio of object size to image size), image 302 captures a larger portion of the foreground (e.g., object 306), but also captures more background and/or additional information (e.g., information that does not correspond to object 306). For example, the target 306 in the image 302 is shown within a bounding box 308, the bounding box 308 representing content (e.g., a scene) captured by the image 304 (e.g., the full image 304) from the image capture device 104. Some or all of the information from the area outside of bounding box 308 (e.g., outside of image 304) represents the background and/or additional information captured by image 302.

As can be seen in this example, image 302 captures more background and/or additional information (e.g., more information outside bounding box 308) than image 304. However, while image 304 captures less background and/or additional information than image 302, given a narrower FOV, image 304 also captures less foreground information (e.g., fewer objects 306). For example, image 302 captures the complete outline of target 306, while image 304 cuts or intercepts a portion of target 306. As further described herein, the process 300 may utilize information from both images to generate more accurate and consistent segmentation results.

The image processing system 100 may use the image 302 and the image 304 to perform the foreground prediction 310. Based on the foreground prediction 310, the image processing system 100 may generate a probability map 312 for the image 302 and a probability map 314 for the image 304. Probability map 312 may contain probabilities of each pixel or super-pixel (e.g., group of pixels) in image 302 belonging to a target 306 to be identified as foreground and/or separated from background. Probability map 314 may contain the probability that each pixel or superpixel in image 304 belongs to a target 306 that is to be identified as foreground and/or separated from background.

In some cases, the superpixels may represent different segments or regions of an image (e.g., image 302 or image 304). In some examples, a superpixel may comprise a set of pixels in an image. For example, in some cases, a superpixel may comprise a set of homogeneous or nearly homogeneous pixels in an image. As another example, a superpixel may include a set of pixels having one or more characteristics such that when the superpixel is rendered, the superpixel may have one or more uniform or consistent characteristics, e.g., color, texture, brightness, semantics, etc. In some examples, the superpixels may provide a perceived grouping of pixels in the image.

In some examples, the image processing system 100 may perform the foreground prediction 310 using a Deep Neural Network (DNN), such as the neural network 122. In other examples, the image processing system 100 may perform the foreground prediction 310 using other techniques, such as graph-based popular ordering. In some examples, using graph-based popular ordering, image processing system 100 may order the correlation between each pixel or superpixel in the input image (e.g., image 302, image 304) and the foreground (and/or the region estimated as part of the foreground). In some cases, popular ordering may have a closed form to enable more efficient computation.

The probability map 312 generated from the image 302 by the foreground prediction 310 includes the entire target 306 and identifies it as part of the foreground, but also includes false positives from portions of the image 302 outside the contour of the target 306. The probability map 314 generated from the image 304 by the foreground prediction 310 contains less noise (e.g., fewer false positives) than the probability map 312, but does not include the entire target 306. Because probability map 312 and probability map 314 complement each other, image processing system 100 may use both probability map 312 and probability map 314 to generate refined probability maps and segmentation results, as described further herein.

The image processing system 100 may crop and upsample 316 the probability map 312 to infer a probability map 318 having the FOV of the image 304 from the probability map 312 having the FOV of the image 302. Clipping and upsampling may reduce the size of probability map 312 and increase its spatial resolution to the same (or similar size and resolution) as probability map 314. In some examples, based on clipping and upsampling, the inferred probability map 318 may have the same (or similar) size and resolution as the probability map 314 generated for the image 304 to allow the image processing system 100 to fuse (or optimize the fusion of) the probability map 314 and the inferred probability map 318. Since inferred probability map 318 contains foreground information predicted from image 302 and probability map 314 contains foreground information predicted from image 304, inferred probability map 318 and probability map 314 may be used together to utilize foreground information from image 302 and image 304 to obtain refined foreground prediction results.

In some examples, when generating inferred probability map 318 (e.g., after cropping and upsampling), image processing system 100 may perform guided filtering (e.g., based on image 304 and/or probability map 314) for image enhancement and/or texture smoothing. In some examples, the image processing system 100 may also perform normalization to adjust the range of pixel intensity values.

The image processing system 100 may fuse 320 the probability map 314 and the inferred probability map 318 to generate a fused probability map 322. The fused probability map 322 may combine, include, reflect, and/or refine the foreground information from the probability map 314 generated from the image 304 and the inferred probability map 318 generated from the image 302. In some examples, to fuse 320 the probability map 314 and the inferred probability map 318, the image processing system 100 may perform element-wise addition based on values in the probability map 314 and the inferred probability map 318. In some cases, the image processing system 100 may average the values in the probability map 314 and the inferred probability map 318 to generate an average probability map.

The image processing system 100 may multiply 324 the fusion probability map 322 with a center priority map 326. The center priority map 326 may include spatial priority information. In some examples, the center priority graph 326 may represent the probability or likelihood that the target 306 is located in a center region of the image rather than a boundary region of the image. Because foreground objects/shapes are more likely to be located around or near the center of the image, the center-first map 326 may include a higher probability of the center of the image and/or the area around/near the center of the image. For regions farther from the center of the image, the probability may decrease.

Given the higher probability of the center (and nearby/surrounding areas) of the image, the center priority map 326 may help remove unwanted information from the side areas in the fused probability map 322 (e.g., areas on/near the perimeter of the fused probability map 322 and/or areas farther from the center of the fused probability map 322). In some examples, the multiplication of the fusion probability map 322 and the center-first map 326 may result in pixels or superpixels in the center (and closer to the center) of the fusion probability map 322 being weighted more than pixels or superpixels on the sides of the fusion probability map 322 and/or away from the center of the fusion probability map 322.

In some examples, the image processing system 100 may generate the center-first map 326 by segmenting an original image (e.g., image 304) associated with the probability map 314 into a temporal or pixel-based representation. The image processing system 100 may then calculate the distance of each superpixel to the center of the segmented image.

In some cases, the center-first graph 326 may be used to determine the color of the image by super-pixel based average x-y coordinates chord (v _i ) Distance to the image center O, the center is preferentially integrated into each superpixel v _i To enhance the fusion probability map 322. The principle in this example may be that foreground objects/shapes are more likely to be placed around the center of the image. For illustration, consider the center-first CP of an image, each superpixel v _i The value of (2) may be calculated as follows:

the output of the multiplication of the fused probability map 322 and the center priority map 326 may be a refined probability map 328. The image processing system 100 may generate a segmentation mask 330 based on the refined probability map 328. In some examples, the image processing system 100 may further refine the refined probability map 328 to generate the segmentation mask 330. In some implementations, the image processing system 100 may perform further refinement based on fully connected Conditional Random Fields (CRFs) or densecrff. For example, the image processing system 100 may parse the refinement probability map 328 through the denseCRF module for further refinement.

In some cases, the image processing system 100 may use the refined probability map 328 and the image 304 as inputs for generating the segmentation mask 330. The image processing system 100 may use the image 304 and the refined probability map 328 to classify each pixel in the refined probability map 328 as a background pixel or a foreground pixel. The white portion of the segmentation mask 330 may correspond to pixels classified as foreground pixels and the black portion of the segmentation mask 330 may correspond to pixels classified as background pixels.

In some examples, the segmentation mask 330 may be a binary map representing background and foreground regions of an image. For example, to generate the segmentation mask 330, the image processing system 100 may binarize the refined probability map 328 to include 0 and 1 values corresponding to background pixels and foreground pixels, respectively.

In some cases, the image processing system 100 may use the segmentation mask 330 for foreground separation for certain image processing tasks and/or visual effects. In this example, the image processing system 100 may use the segmentation mask 330 to generate an output image 334 with a depth of field effect. The image processing system 100 may use the segmentation mask 330 and the image 304 to blur and synthesize 332 the output image 334. For example, the image processing system 100 may apply the segmentation mask 330 to the image 304 to blur an area of the image 304 outside of the foreground (e.g., the target 306) and synthesize the output image 334 with a depth of field effect. In some cases, the image processing system 100 may blur the region of the image 304 using a gaussian function (e.g., gaussian blur).

In some examples, the image processing system 100 may use the segmentation mask 330 and the depth information to generate the output image 334. For example, the image processing system 100 may calculate the depth of pixels or super-pixels in the image 304 and blur regions in the image 304 using the segmentation mask 330 and the depth information. In some cases, the image processing system 100 may perform depth prediction based on a pair of

images

302 and 304. In other cases, image processing system 100 may perform depth prediction based on measurements obtained from one or more depth sensors (e.g., other sensors 106) for image 302 and/or image 304.

Although the flow 300 in fig. 3A is described with respect to two input images having different FOVs, other examples may use more than two input images and/or FOVs. For example, in some cases, the input images for flow 300 may include one or more wide-angle images generated using one or more wide-angle lenses, one or more super-wide-angle images generated using one or more super-wide-angle lenses, one or more tele images generated using one or more tele lenses (e.g., short tele lenses, medium tele lenses, super-tele lenses, etc.), one or more images with one or more different and/or the same FOVs (e.g., relative to one or more wide-angle images and/or tele images) generated using one or more zoom lenses, any other type of images generated from any other type of lens, and/or any combination thereof. The

input images

302 and 304 in fig. 3A are non-limiting examples provided for illustration purposes.

Fig. 3B is a diagram illustrating another example flow 350 of image segmentation using images with different FOVs. In this example,

images

352 and 354 capture a target 356 representing a foreground shape.

Images

352 and 354 also capture a side object 360 that is not part of the foreground shape. Image 352 has a wider FOV than image 354, and image 354 has a larger ratio of target size to image size. The target 356 appears closer in image 354 than image 352. Given a wider FOV (and smaller target size to image size ratio), image 352 captures a greater portion of the foreground (e.g., target 356), but also captures more background and/or additional information (e.g., information that does not correspond to target 356). For example, the target 356 in the image 352 is shown within a bounding box 358, the bounding box 358 representing content (e.g., a scene) captured by the image 354 (e.g., the complete image 354). Some or all of the information from the area outside of bounding box 358 (e.g., outside of image 354) represents the background and/or additional information captured by image 352.

As can be seen in this example, image 352 captures more background and/or additional information (e.g., more information outside bounding box 358) than image 354. However, while image 354 captures less background and/or additional information than image 352, given a narrower FOV, image 354 also captures less foreground information (e.g., fewer objects 356). For example, the image 352 captures the complete outline of the target 356, while the image 354 cuts or truncates a portion of the target 356. As further described herein, the flow 350 may utilize information from both images to generate more accurate and consistent segmentation results.

The image processing system 100 may use the image 352 and the image 354 to perform the foreground prediction 362. Based on the foreground prediction 362, the image processing system 100 may generate a probability map 364 for the image 352 and a probability map 366 for the image 354. The probability map 364 may contain the probability that each pixel or superpixel in the image 352 belongs to a target 356 to be identified as foreground and/or separated from background. The probability map 366 may contain the probability that each pixel or super-pixel in the image 354 belongs to a target 356 that is to be identified as foreground and/or separated from background.

In this example, foreground prediction 362 enables salient object detection to determine regions of interest (e.g., target 356) in

images

352 and 354. In some examples, the image processing system 100 may perform foreground prediction 362, including salient object detection, using a Deep Neural Network (DNN), such as the neural network 122. Salient object detection may use a visual attention mechanism to determine a region of interest (e.g., target 356) in an image. For example, salient object detection may attempt to locate the most prominent/noticeable and eye-catching areas (e.g., objects, shapes, etc.) in an image.

In some examples, probability map 364 and probability map 366 may each contain a probability that each pixel belongs to target 356 to be separated from the background and/or other regions of

images

352 and 354. In some cases, the probability may be expressed as saliency or include saliency (e.g., based on salient object detection), which may represent a likelihood of where the user's attention may be within the image.

As shown in fig. 3B, the probability map 364 generated by the foreground prediction 362 from the image 352 includes the entire target 356 and identifies it as part of the foreground, but also includes false positives from portions of the image 352 outside the contour of the target 356. For example, in addition to including the entire target 356 as part of the foreground, the probability map 364 also includes the side objects 360 as part of the foreground. The probability map 366 generated by the foreground prediction 362 from the image 354 contains less noise (e.g., fewer false positives) than the probability map 364, but does not include the entire target 356. For example, the probability map 366 does not include the side object 360 as part of the foreground, but lacks details of the portion 368 from the target 356 (e.g., part of the target user's arm). The probability map 364 is better able to recover and/or save details of the target 356, including a portion 368 of the target 356, but as previously noted, the probability map 364 also includes and identifies the side objects 360.

Because the probability map 364 and the probability map 366 complement each other, the image processing system 100 can use both the probability map 364 and the probability map 366 to generate a refined probability map and segmentation results, as described further herein. The image processing system 100 may crop and upsample 316 the probability map 364 to generate an output map 370 representing the probability map with the FOV of the image 354. The output map 370 may be inferred from the probability map 364 having the FOV of the image 352. Clipping and upsampling may reduce the size of probability map 364 and increase its spatial resolution to the same (or similar size and resolution as probability map 366). In some examples, based on clipping and upsampling, the output map 370 may have the same (or similar) size and resolution as the probability map 366 generated for the image 354.

The image processing system 100 may fuse 320 the probability map 366 and the output map 370 to generate a fused probability map 372. The fused probability map 372 may combine, include, reflect, and/or refine the foreground information from the probability map 366 generated from the image 354 and the foreground information from the output map 370 generated from the image 352. In some examples, to fuse 320 the probability map 366 and the output map 370, the image processing system 100 may perform element-wise addition based on values in the probability map 366 and the output map 370. In some cases, the image processing system 100 may average the values in the probability map 366 and the output map 370 to generate an average probability map.

As shown in fig. 3B, the side object 360 has a dull appearance and/or strength in the fused probability map 372. The dimly appearance and/or intensity of the side object 360 reflects a lesser amount of detail restored for the side object 360 and/or a reduced probability that the side object 360 is part of the foreground region of interest (e.g., the target 356). When the output map 370 and the probability map 366 are fused, the probabilities (and/or details of restoration) associated with the side object 360 in the probability map 364 and the output map 370 may be reduced based on the foreground prediction information in the probability map 366, which calculates a lower or zero probability to the side object 360 and/or less details of restoration of the side object 360.

Fusing probability map 372 has also restored more detail of portion 368 of target 356 than probability map 366. This is because the probability map 364 and the output map 370 are fused with the probability map 366 that includes details of the portion 368 of the target 356. By fusing the probability map 366 and the output map 370, details of the portion 368 of the target 356 are recovered from the output map 370, and details of the side objects 360 that were accurately excluded in the probability map 366 are reduced in the fused probability map 372.

The image processing system 100 may refine the fusion probability map 372 to generate a segmentation mask 374. As shown in fig. 3B, the side object 360 has an even darker appearance and/or intensity in the segmentation mask 374, and the portion 368 of the target 356 includes additional detail (e.g., relative to the probability map 366 and the fused probability map 372). Here, the side object 360 has been gradually dimmed from the probability map 364 to the fusion probability map 372 and the segmentation mask 374, and the details of the portion 368 of the target 356 have been gradually restored from the probability map 366 to the fusion probability map 372 and the segmentation mask 374.

In some implementations, the image processing system 100 can refine the fusion probability map 372 based on a fully connected Conditional Random Field (CRF) or denscrf. For example, the image processing system 100 may parse the fused probability map 372 through the denseCRF module for further refinement. In some cases, the image processing system 100 may use the fused probability map 372 and the image 354 as inputs for generating the segmentation mask 374. The image processing system 100 may use the image 354 and the fused probability map 372 to classify each pixel in the fused probability map 372 as a background pixel or a foreground pixel. The white portion of the segmentation mask 374 may correspond to pixels classified as foreground pixels and the black portion of the segmentation mask 374 may correspond to pixels classified as background pixels.

In some cases, the image processing system 100 may use the segmentation mask 374 to perform foreground separation for certain image processing tasks and/or visual effects. In this example, the image processing system 100 may use the segmentation mask 374 to generate an output image 378 having a depth of field effect. The image processing system 100 may use the segmentation mask 374 and the image 354 to blur and synthesize 374 the output image 378. For example, the image processing system 100 may apply a segmentation mask 374 to the image 354 to blur areas of the image 354 that are outside of the foreground (e.g., the target 356) and synthesize the output image 378 with a depth of field effect. In some cases, the image processing system 100 may blur the region of the image 354 using a gaussian function (e.g., gaussian blur).

In some examples, the image processing system 100 may use the segmentation mask 374 and the depth information to generate the output image 378. For example, the image processing system 100 may calculate the depth of pixels or super-pixels in the image 354 and blur regions in the image 354 using the segmentation mask 374 and the depth information. In some cases, image processing system 100 may perform depth prediction based on a pair of

images

352 and 354. In other cases, image processing system 100 may perform depth prediction based on measurements obtained from one or more depth sensors (e.g., other sensors 106) for image 352 and/or image 354.

Although the flow 350 in fig. 3B is described with respect to two input images having different FOVs, other examples may use more than two input images and/or FOVs. For example, in some cases, the input images for flow 350 may include one or more wide-angle images generated using one or more wide-angle lenses, one or more super-wide-angle images generated using one or more super-wide-angle lenses, one or more tele images generated using one or more tele lenses (e.g., short tele lenses, medium tele lenses, super-tele lenses, etc.), one or more images with one or more different and/or the same FOVs (e.g., relative to one or more wide-angle images and/or tele images) generated using one or more zoom lenses, any other type of images generated from any other type of lens, and/or any combination thereof. The

input images

352 and 354 in fig. 3B are non-limiting examples provided for illustration purposes.

FIG. 4 is a diagram illustrating an example machine learning model 400 for foreground prediction. In this example, the machine learning model 400 may include a Deep Neural Network (DNN) 122 that processes the input image 402 to detect and separate foreground regions (e.g., foreground shapes, objects, pixels, etc.) in the input image 402. DNN 122 may generate segmentation mask 408 that identifies the detected foreground regions.

In some examples, DNN 122 may include multiple layers of interconnected nodes. Each node may represent a piece of information. Information associated with a node may be shared between layers. Each layer may retain information as it is processed. In some cases, each node or interconnection between nodes may have a weight, which is a set of parameters derived from training of DNN 122. For example, the interconnections between nodes may represent a piece of information learned about interconnecting nodes. The interconnect may have numerical weights that may be adjusted (e.g., based on a training data set) to allow DNN 122 to adapt to inputs and to learn as more and more data is processed.

DNN 122 may be pre-trained to process features from data associated with input image 402. In examples where DNN 122 is used to detect foreground regions in an image, DNN 122 may be trained using training data that includes image data. DNN 122 may be further trained as more input data, such as image data, is received. In some cases, DNN 122 may be trained using supervised learning and/or reinforcement training. When DNN 122 is trained, DNN 122 may adjust the weights and/or bias of the nodes to optimize their performance.

In some cases, DNN 122 may use a training process such as back propagation to adjust the weighting of the nodes. The back propagation may include forward pass, loss function, backward pass, and weighting update. Forward pass, loss function, backward pass, and parameter update are performed in one training iteration. For each set of training data (e.g., image data), the process may repeat a certain number of iterations until the weighting of layers in DNN 122 is precisely adjusted.

DNN 122 may include any suitable neural network. One example includes a Convolutional Neural Network (CNN) that includes an input layer and an output layer with a plurality of hidden layers between the input layer and the output layer. The hidden layers of CNNs include a series of convolution, non-linearity, pooling, full-join, and normalization layers. DNN 122 may include any other depth network, such as an automatic encoder, a Deep Belief Network (DBN), a Recurrent Neural Network (RNN), and so forth.

In the example machine learning model 400 shown in fig. 4, DNN 122 has an encoder 404 and decoder 406 structure. The encoder 404 and decoder 406 each have multiple layers. For example, encoder 404 may include multiple convolutional layers and a pooling layer. The convolutional layer may analyze the input data. Each node of the convolutional layer may be connected to a region (e.g., pixel) of a node of the input data (e.g., input image 402). A convolution layer may be considered as one or more filters (each corresponding to a different activation or signature), where each convolution iteration of a filter is a node or neuron of the convolution layer. Each connection between a node and the receptive field (region (e.g., pixel) of that node) learns the weights, and in some cases the overall bias, so that each node learns to analyze its particular local receptive field in the input image 402.

In some examples, at each convolution iteration, the value of the filter may be multiplied with a corresponding number of original pixel values of the image. The multiplications from each convolution iteration may be added to obtain a sum for that iteration or node. The process may continue at the next location in the input data (e.g., image 402) based on the receptive field of the next node in the convolutional layer. Processing the filter at each unique location of the input volume may produce a number representing the filter results for that location, resulting in a sum value being determined for each node of the convolution layer.

In some cases, the convolutional layer may perform batch normalization to normalize the input to the layer for each batch of data. Batch normalization may make DNN 122 faster and more stable by adjusting and scaling the activation. The convolutional layer may also apply layer activation functions, such as rectifier linear unit (ReLU) functions. In some examples, the pooling layer may be applied after some convolution layers. The pooling layer may be used to simplify the information in the output from the convolutional layer. For example, the pooling layer may obtain each activation graph output from the convolutional layer and generate a compressed activation graph (or signature graph) using the pooling function. Max pooling is one example of a function performed by the pooling layer. The pooling layer may use other forms of pooling functionality, such as average pooling or other suitable pooling functionality.

A pooling function (e.g., a max-pooling filter) may be applied to the activation graph included in the convolutional layer. The pooling function (e.g., maximum pooling) may reduce, aggregate, or concatenate output or feature representations in an input (e.g., image 402). Maximum pooling (and other pooling methods) provides the advantage that the pooling features are less and the number of parameters required in later layers can be reduced.

The decoder 406 may include a plurality of upsampling layers, convolution layers, and softmax layers. The upsampling layer may increase the dimension of the input data. Each upsampling layer may be followed by a plurality of convolution layers that perform convolution on the upsampled data from the upsampling layer. In some examples, the last layer of decoder 406 may be a softmax layer. The softmax layer may perform a softmax function to perform classification. In some examples, the softmax layer may classify pixels, superpixels, and/or regions of the input image 402 as foreground or background. In some examples, the softmax layer may determine a foreground prediction probability. The output of the decoder 406 may be a segmentation mask 408 that provides a foreground prediction for the input image 402.

FIG. 5 is a diagram illustrating an example prospect prediction process 500 based on popularity ranking. In this example, the image processing system 100 may extract the superpixel 504 from the input image 502. For example, image processing system 100 may segment or partition image 502 into a plurality of superpixels. In some implementations, the image processing system 100 may use a super-pixel segmentation algorithm (e.g., a Simple Linear Iterative Clustering (SLIC) algorithm that may perform local clustering of pixels) to extract super-pixels in the image 502. In some examples, superpixel extraction may help preserve image structure while extracting unnecessary detail. The superpixels may also be used as a computational unit for ordering as described further below.

The superpixels may represent different segments or regions of image 502. In some cases, the superpixels may include groups of pixels in image 502. For example, a superpixel may comprise a set of homogeneous or nearly homogeneous pixels in image 502. In other examples, a superpixel may include a set of pixels having one or more characteristics such that when the superpixel is rendered, the superpixel may have one or more uniform or consistent characteristics, e.g., color, texture, brightness, semantics, etc. In some cases, super-pixels may provide a perceived grouping of pixels in an image.

The homogenous or nearly homogenous pixels referred to above may include pixels that are uniform, homogeneous, or substantially similar in texture, color, brightness, semantics, and/or any other characteristic. In some implementations, while an object in an image may be divided into a plurality of superpixels, a particular superpixel may not be divided by the boundary of the object. In some implementations, some or all of the pixels in a superpixel may be spatially correlated (e.g., spatially contiguous). For example, a pixel in a super-pixel may include adjacent or neighboring pixels from an image.

The image processing system 100 may calculate the query 506 based on the superpixels extracted from the image 502. The image processing system 100 may create

queries

508, 510, and 512 for generating

probability maps

518, 520, 522. In some examples, queries 508, 510, and 512 may include images with extracted superpixels. The image may also include a contrast superpixel region 514. The contrast superpixel region 514 may include superpixels marked by the image processing system 100 having one or more distinguishing attributes (e.g., color, texture, brightness, and/or any other characteristics). The distinguishing attribute may help distinguish the superpixels in the contrast superpixel region 514 from other superpixels in

queries

508, 510, and 512.

In the illustrative example of fig. 5, queries 508, 510, and 512 may include a top seed (e.g., query 508), a left seed (e.g., query 510), and a right seed (e.g., query 512). In other examples, queries 508, 510, and 512 may include more or fewer queries and/or other types of seeds, e.g., bottom seeds. The top seed (e.g., query 508) may include a contrasting super pixel area 514 on a top edge (e.g., perimeter) of the top seed. The contrast superpixel region 514 may include superpixels marked by a distinguishing color that distinguishes superpixels in the contrast superpixel region 514 from other superpixels in the top seed.

The left seed (e.g., query 510) may include a contrast superpixel region 514 on the left edge (e.g., perimeter) of the left seed. The contrast superpixel region 514 may include superpixels marked by a distinguishing color that distinguishes superpixels in the contrast superpixel region 514 from other superpixels in the left seed.

The right seed (e.g., query 512) may include a contrast superpixel region 514 on the right edge (e.g., perimeter) of the right seed. The contrast superpixel region 514 may include superpixels marked by a distinguishing color that distinguishes superpixels in the contrast superpixel region 514 from other superpixels in the right hand seed.

In some cases, the image processing system 100 may utilize the contrast differences between the contrast superpixel regions 514 in the

queries

508, 510, and 512 and the rest of the

queries

508, 510, and 512 (e.g., other superpixels in the

queries

508, 510, and 512) to estimate the probability maps 518, 520, and 522. In some examples, image processing system 100 may perform a popularity ranking to estimate probability maps 518, 520, and 522. When performing popular ordering, image processing system 100 may use the contrast differences between contrast superpixel regions 514 and other superpixels in

queries

508, 510, and 512 to estimate probability maps 518, 520, and 522. The contrast differences may help calculate ranking scores for the probability maps 518, 520, and 522.

In some examples, the popular ranking may include a graph-based ranking algorithm for calculating

probability graphs

518, 520, and 522. For example, in some cases, image processing system 100 may construct and query 508, 510, and 512 together to generate popular ranking results (e.g., probability maps 518, 520, and 522). In some cases, image processing system 100 may construct a map

Wherein (1)>

And->

Representing vertices and edges in the image (e.g., query 508, query 510, query 512), respectively. In the figure- >

In each vertex->

May be defined as a superpixel. If superpixel v _i And v _j Connected spatially in the image, then edge +.>

May be added and based on the target superpixel v _i And v _j The calculated feature distance is weighted. In some examples, edge weighting may be performed by for super-pixel v _i And v _j The calculated feature distance or similarity is determined as follows: />

Wherein p is _i And q _i Respectively represent superpixels v _i Average color and semantic representation of middle pixels, p _j And q _j Respectively represent superpixels v _j Average color and semantic representation of the middle pixels, d () is

Feature distance, and gamma indicates thatFor weighting of semantic features.

In some examples, in example equation (2), the constant σ _c The value of (2) may be set to the average pair-wise distance between all superpixels under its color characteristics, and σ _s The value of (2) may be set to the average pairwise distance between all superpixels under its semantic features. It should be noted that color and semantic features are non-limiting examples provided herein for purposes of explanation, and that other implementations may utilize more, fewer, or different types of features. For example, in some cases, equation (2) may be implemented using only one type of feature (e.g., color or semantic features), or a different combination of features such as color, semantic, and/or texture features.

In some cases, the image processing system 100 may be configured with an affinity matrix a= [ a ] _ij ]N graph

Laplace matrix>

Where N represents the total number of super-pixels in the image. Further, in some implementations, the image processing system 100 can infer labels for nodes (e.g., superpixels) on the graph and use the labels on the manifold structure of the data (e.g., image). In a given data set->

Where M represents the feature dimension of each data point, some data points may be labeled as queries, and the remaining data points may be ordered according to their relevance to the query.

For example, let f: X → [0,1 ]]Is to take the value f _i Assigned to each data point x _i Wherein 0 is the background data point value and 1 is the foreground data point value. Here, f can be regarded as a vector f= [ f ₁ ,…,f _N ] ^T . In addition, let

Representing an indication vector, wherein if +.>

Is a query, then->

Otherwise->

Given figure->

The number matrix may be d=deg { D ₁₁ ,…,d _nn And d is as follows _ii ＝∑ _j a _ij And μ is a weighting constant. In this example, the optimal ordering of queries may be calculated by solving the following minimization or optimization problem:

solving equation (3) above can help ensure that similar data points are assigned similar ranking values while maintaining the ranking result close to the original indicator vector

In some examples, the minimum solution may be calculated by setting the derivative of equation (3) to zero. In some cases, the closed form solution of the ranking function may be expressed as follows:

f ^* ＝(D-αA) ^-1 y equation (4)

In some cases, an indication vector may be formed

And the indication vector +.>

For calculating the ranking vector f using equation (4) ^* 。

The popularity ranking may utilize spatial information to infer potential prospects (e.g., probability maps 518, 520, and 522) of the image 502. In some cases, popular ordering may borrow cleansed information from users to help predict prospects, thereby helping predict truly salient objects. For example, the popular ordering may use information provided by the user about the segmented image, e.g., an indication of one or more shapes and/or objects in the image, an indication of one or more background regions, an indication of one or more foreground regions, and/or any other information about the image, image content, and/or image characteristics.

Having disclosed example systems, techniques, and concepts, the present disclosure now turns to an example process 600 shown in fig. 6. The steps outlined herein are examples and may be implemented in any combination thereof, including combinations that exclude, add or modify particular steps.

Process 600 may be implemented to perform various image processing tasks and/or effects. For example, in some cases, process 600 may be implemented to perform segmentation (e.g., foreground prediction) and produce effects based on image segmentation, such as depth of field effects, chromakey effects, image stylization effects, artistic effects, computed photographic effects, and the like. In other examples, process 600 may be implemented to perform other image segmentation-based effects or processing tasks such as, but not limited to, feature extraction, image recognition (e.g., object or face recognition), machine vision, XR, automation, and/or any other image segmentation-based effects and/or processing tasks.

At block 602, the process 600 may include obtaining a first image (e.g., image 302 or 352) of a target (e.g., target 306 or 356) and a second image (e.g., image 304 or 354) of the target. The first image may have a first FOV and the second image may have a second FOV different from the first FOV.

In some examples, the first and second images may be captured using different image capture devices (e.g., image capture devices 102 and 104) having different FOVs or focal lengths. In some examples, different image sensors and/or lenses may be used to capture the first image and the second image. Different image sensors and/or lenses may produce images having different FOVs. In some cases, different lenses may be used to capture the first and second images, and the different lenses may include one or more tele lenses, wide lenses, ultra-wide lenses, standard lenses, and/or zoom lenses. In some cases, the same image capture device may be used to capture the first image and the second image. For example, image capture devices having different FOVs may use different FOVs to capture the first image and the second image.

At block 604, the process 600 may include determining a first segmentation map (e.g., the first probability map 312 or 364) based on the first image, the first segmentation map including foreground predictors associated with a first estimated foreground region in the first image.

In some examples, determining the first segmentation map may include: extracting superpixels from the first image; generating a set of image queries (e.g., queries 508, 510, 512), each image query including the extracted superpixels and a different boundary region of superpixels marked by one or more distinguishing pixel attributes (e.g., color, texture, brightness, etc.); generating a set of separate probability maps (e.g., probability maps 518, 520, 522) based on the set of image queries; and generating a first segmentation map by multiplying the set of segmentation probability maps.

In some examples, a segmentation probability map for each image query is generated using one or more popular ranking functions to estimate a foreground probability based at least in part on differences between different boundary regions of the superpixel and one or more other regions of the superpixel associated with the image query.

At block 606, the process 600 may include determining a second segmentation map (e.g., the second probability map 314 or 366) based on the second image, the second segmentation map including foreground predictors associated with a second estimated foreground region in the second image.

In some examples, determining the second segmentation map may include: extracting superpixels from the second image; generating additional sets of image queries (e.g., queries 508, 510, 512), each additional image query including a different boundary region of superpixels extracted from the second image and superpixels marked by one or more distinguishing pixel attributes (e.g., color, texture, brightness, etc.); generating an additional set of segmentation probability maps (e.g., probability maps 518, 520, 522) based on the additional set of image queries; and generating a second segmentation map by multiplying the additional set of segmentation probability maps.

In some examples, a segmentation probability map for each additional image query is generated using one or more popular ranking functions to estimate a foreground probability based at least in part on differences between different boundary regions of the superpixel and one or more other regions of the superpixel associated with the additional image query.

In some examples, the first segmentation map and/or the second segmentation map may be generated using a deep neural network (e.g., DNN 122). In some cases, the first segmentation map and/or the second segmentation map may be generated using a trained neural network. In some cases, the trained neural network may be a deep neural network. In other examples, the first segmentation map and/or the second segmentation map may be generated using non-DNN techniques such as graph-based popular ordering.

At block 608, the process 600 may include generating a third segmentation map (e.g., the inferred second probability map 318 or the output map 370) based on the first segmentation map and the second segmentation map. In some examples, the third segmentation map may include the first segmentation map modified according to a size of the second image and/or a resolution of the second image.

In some examples, generating the third segmentation map may include cropping the first segmentation map to a size of the second image. In some examples, generating the third segmentation map may further include upsampling the first segmentation map according to a resolution of the second image after cropping the first segmentation map.

At block 610, the process 600 may include generating a refined segmentation mask (e.g., the refined segmentation mask 330 or 374) using the second segmentation map and the third segmentation map, the refined segmentation mask identifying at least a portion of the object as a foreground region of the first image and/or the second image.

In some examples, generating the refined segmentation mask may include generating a fused segmentation map (e.g., fused probability map 322 or fused probability map 372) by fusing (e.g., combining, cascading, merging, etc.) the second segmentation map and the third segmentation map, and generating the refined segmentation mask based on the fused segmentation map. In some examples, a denseCRF model may be applied to the fused segmentation map to generate a refined segmentation mask. In some cases, fusing the second and third partition maps may include averaging the foreground predicted values in the second partition map and the foreground predicted values in the third partition map. In some examples, the fused segmentation map may include an average foreground prediction value.

In some cases, generating the fused segmentation map may include fusing the second segmentation map and the third segmentation map, and multiplying a center-first map (e.g., center-first map 326) with the fused output of the second segmentation map and the third segmentation map.

In some aspects, the process 600 may include calculating a foreground probability for a superpixel in the first image and/or the second image, and generating a center-first graph based on the foreground probability. In some examples, each foreground probability is calculated based on the distance of the associated superpixel to the center of the image.

In some aspects, the process 600 may include generating an edited image based on the refined segmentation mask. In some examples, the edited image may be based on the first image or the second image. In some cases, the edited image may include one or more effects such as, but not limited to, a visual effect, an augmented reality effect, an image processing effect, a blur effect, an image recognition effect, a segmentation effect, a computer graphics effect, a chroma-keying effect, an image stylization effect, and the like. In an illustrative example, the edited image may include a blur effect.

In some aspects, the process 600 may include generating an output image (e.g., the output image 334 or 378) that includes the first image or the second image modified to include the blurring effect. In some examples, the blurring effect may be based on a refined segmentation mask. In some cases, generating the output image may include applying a blurring effect to one or more image regions (e.g., one or more background regions) located outside the foreground region based on the refined segmentation mask. In some examples, the blurring effect may include a depth of field effect in which one or more image regions are at least partially blurred and a foreground region is at least partially focused.

In some cases, depth information and/or measurements from one or more sensors (e.g., one or more LINDARs, radars, light sensors, depth sensors, etc.) may be used to generate depth of field effects. In some examples, depth information and/or measurements may be derived using stereo vision, LIDAR techniques, and/or Phase Detection (PD).

In some aspects, the process 600 may include determining a fourth segmentation map based on the third image. In some examples, the fourth segmentation map may include foreground values associated with the estimated foreground region in the third image. The third image may be captured from the same or a different image capturing device as the first and/or second image. In some examples, process 600 may include generating the segmentation map in response to a user selection of an imaging mode (e.g., portrait mode, green mask, etc.).

Although the process 600 is described with respect to first and second images having different FOVs, other examples may use more than two images and/or FOVs. For example, in some cases, process 600 may use: one or more wide-angle images generated using one or more wide-angle lenses, one or more super-wide-angle images generated using one or more super-wide-angle lenses, one or more tele images generated using one or more tele lenses (e.g., short tele lenses, medium tele lenses, super-tele lenses, etc.), one or more images with one or more different and/or same FOVs (e.g., relative to one or more wide-angle images and/or tele images) generated using one or more standard lenses, one or more zoom images generated using one or more zoom lenses, any other type of image generated from any other type of lens, and/or any combination thereof.

In some examples, process 600 may be performed by one or more computing devices or apparatuses. In one illustrative example, process 600 may be performed by image processing system 100 shown in fig. 1 and/or one or more computing devices having computing device architecture 700 shown in fig. 7. In some cases, such computing devices or apparatus may include a processor, microprocessor, microcomputer, or other component of a device configured to perform the steps of process 600. In some examples, such a computing device or apparatus may include one or more sensors configured to capture image data. For example, the computing device may include a smart phone, a head mounted display, a mobile device, a camera, a tablet computer, or other suitable device. In some examples, such a computing device or apparatus may include a camera configured to capture one or more images or videos. In some cases, such a computing device may include a display for displaying images. In some examples, one or more sensors and/or cameras are separate from the computing device, in which case the computing device receives the sensed data. Such computing devices may also include a network interface configured to communicate data.

Components of the computing device may be implemented in circuitry. For example, the components may include and/or be implemented using electronic circuitry or other electronic hardware, which may include one or more programmable electronic circuits (e.g., microprocessors, graphics Processing Units (GPUs), digital Signal Processors (DSPs), central Processing Units (CPUs), and/or other suitable electronic circuits), and/or may include and/or be implemented using computer software, firmware, or any combination thereof, to perform the various operations described herein. The computing device may also include a display (as an example of an output device or in addition to an output device), a network interface configured to communicate and/or receive data, any combination thereof, and/or other components. The network interface may be configured to communicate and/or receive Internet Protocol (IP) based data or other types of data.

Process 600 is illustrated as a logic flow diagram whose operations represent a sequence of operations that may be implemented in hardware, computer instructions, or a combination thereof. In the context of computer instructions, operations represent computer-executable instructions stored on one or more computer-readable storage media that, when executed by one or more processors, perform the recited operations. Generally, computer-executable instructions include routines, programs, objects, components, data structures, etc. that perform particular functions or implement particular data types. The order in which the operations are described is not intended to be construed as a limitation, and any number of the described operations may be combined in any order and/or in parallel to implement a process.

Furthermore, the process 600 may be performed under control of one or more computer systems configured with executable instructions, and may be implemented by hardware as code (e.g., executable instructions, one or more computer programs, or one or more applications) that is executed jointly on one or more processors, or a combination thereof. As noted above, the code may be stored on a computer-readable or machine-readable storage medium, for example, in the form of a computer program comprising a plurality of instructions executable by one or more processors. The computer-readable or machine-readable storage medium may be non-transitory.

FIG. 7 illustrates an example computing device architecture 700 of an example computing device that can implement the various techniques described herein. For example, computing device architecture 700 may implement at least some portions of image processing system 100 shown in FIG. 1. The components of computing device architecture 700 are shown in electrical communication with each other using connections 705 (e.g., buses). The example computing device architecture 700 includes a processing unit (CPU or processor) 710 and a computing device connection 705, the computing device connection 705 coupling various computing device components including a computing device memory 715, such as a read-only memory (ROM) 720 and a Random Access Memory (RAM) 725, to the processor 710.

The computing device architecture 700 may include a cache that is directly connected to the processor 710, immediately adjacent to the processor 710, or integrated as part of the processor 710. The computing device architecture 700 may copy data from the memory 715 and/or the storage device 730 to the cache 712 for quick access by the processor 710. In this way, the cache may provide a performance boost that avoids processor 710 delays while waiting for data. These and other modules may control the processor 710 or be configured to control the processor 710 to perform various actions. Other computing device memory 715 may also be available for use. The memory 715 may include a plurality of different types of memory having different performance characteristics.

Processor 710 may include any general purpose processor and hardware or software services, such as service 1 732, service 2 734, and service 3 736 stored in storage 730, configured to control processor 710 and special purpose processors in which software instructions are incorporated into the processor design. Processor 710 may be a self-contained system including multiple cores or processors, buses, memory controllers, caches, and the like. The multi-core processor may be symmetrical or asymmetrical.

To enable a user to interact with the computing device architecture 700, the input device 745 may represent any number of input mechanisms, such as a microphone for voice, a touch screen for gesture or graphical input, a keyboard, a mouse, motion input, voice, and so forth. The output device 735 may also be one or more of a number of output mechanisms known to those skilled in the art, such as a display, projector, television, speaker device. In some instances, the multi-modal computing device may enable a user to provide multiple types of inputs to communicate with the computing device architecture 700. Communication interface 740 may generally control and manage user inputs and computing device outputs. There is no limitation on the operation on any particular hardware arrangement, and therefore, the basic features herein may be easily replaced when developing an improved hardware or firmware arrangement.

Storage device 730 is non-volatile memory and may be a hard disk or other type of computer-readable medium that can store data that is accessible by a computer, such as magnetic cassettes, flash memory cards, solid state memory devices, digital versatile disks, magnetic cassettes, random Access Memory (RAM) 175, read Only Memory (ROM) 720, and mixtures thereof. Storage 730 may include

services

732, 734, 736 for controlling processor 710. Other hardware or software modules are also contemplated. Storage device 730 may be connected to computing device connection 705. In one aspect, a hardware module that performs a particular function may include software components stored on a computer-readable medium that combine with necessary hardware components, such as processor 710, connection 705, output device 735, etc., to perform the function.

The term "computer-readable medium" includes, but is not limited to, portable or non-portable storage devices, optical storage devices, and various other media capable of storing, containing, or carrying instruction(s) and/or data. Computer-readable media may include non-transitory media in which data may be stored and does not include carrier waves and/or transitory electronic signals propagating wirelessly or through a wired connection. Examples of non-transitory media may include, but are not limited to, magnetic disks or tapes, optical storage media such as Compact Discs (CDs) or Digital Versatile Discs (DVDs), flash memory, or memory devices. The computer readable medium may have code and/or machine executable instructions stored thereon, which may represent procedures, functions, subroutines, programs, routines, subroutines, modules, software packages, classes, or any combination of instructions, data structures, or program statements. A code segment may be coupled to another code segment or a hardware circuit by passing and/or receiving information, data, arguments, parameters, or memory contents. Information, arguments, parameters, data, etc. may be passed, forwarded, or transmitted via any suitable means including memory sharing, message passing, token passing, network transmission, etc.

In some embodiments, the computer readable storage devices, media, and memory may comprise a cable or wireless signal comprising a bit stream or the like. However, when referred to, non-transitory computer-readable storage media expressly exclude media such as energy, carrier signals, electromagnetic waves, and signals themselves.

Specific details are provided in the above description to provide a thorough understanding of the embodiments and examples provided herein. However, it will be understood by those of ordinary skill in the art that the embodiments may be practiced without these specific details. For clarity of explanation, in some examples, the present technology may be presented as comprising individual functional blocks including devices, device components, steps or routines in a method embodied in software, or a combination of hardware and software. Other components may be used in addition to those shown in the figures and/or described herein. For example, circuits, systems, networks, processes, and other components may be shown in block diagram form in order not to obscure the embodiments in unnecessary detail. In other instances, well-known circuits, processes, algorithms, structures, and techniques may be shown without unnecessary detail in order to avoid obscuring the embodiments.

Individual embodiments may be described above as a process or method, which is depicted as a flowchart, a flow diagram, a data flow diagram, a structure diagram, or a block diagram. Although a flowchart may describe the operations as a sequential process, many of the operations can be performed in parallel or concurrently. Further, the order of the operations may be rearranged. The process terminates when its operation is completed, but may have additional steps not included in the figure. A process may correspond to a method, a function, a procedure, a subroutine, etc. When a process corresponds to a function, its termination may correspond to the function returning to the calling function or the main function.

The processes and methods according to the examples above may be implemented using computer-executable instructions stored in or otherwise available from a computer-readable medium. Such instructions may include, for example, instructions and data which cause or otherwise configure a general purpose computer, special purpose computer, or processing device to perform a certain function or group of functions. Portions of the computer resources used may be accessible through a network. The computer-executable instructions may be, for example, binary files, intermediate format instructions such as assembly language, firmware, source code. Examples of computer readable media that may be used to store instructions, information, and/or created information for use during a method according to the described examples include magnetic or optical disks, flash memory, USB devices provided with non-volatile memory, networked storage devices, and the like.

Devices implementing processes and methods according to these disclosures may include hardware, software, firmware, middleware, microcode, hardware description languages, or any combination thereof, and may take any of a variety of form factors. When implemented in software, firmware, middleware or microcode, the program code or code segments to perform the necessary tasks (e.g., computer program product) may be stored in a computer-readable or machine-readable medium. The processor may perform the necessary tasks. Typical examples of form factors include laptop computers, smart phones, mobile phones, tablet devices, or other small form factor personal computers, personal digital assistants, rack-mounted devices, stand alone devices, and the like. The functionality described herein may also be embodied in a peripheral device or plug-in card. By way of further example, such functionality may also be implemented on a circuit board, between different chips or different processes executing in a single device.

Instructions, media for transmitting such instructions, computing resources for executing them, and other structures for supporting such computing resources are example units for providing the functionality described in this disclosure.

In the foregoing description, aspects of the present application have been described with reference to specific embodiments thereof, but those skilled in the art will recognize that the present application is not so limited. Thus, while illustrative embodiments of the application have been described in detail herein, it should be understood that the inventive concepts may be otherwise variously embodied and employed, and that the appended claims are intended to be construed to include such variations unless limited by the prior art. The various features and aspects of the above-described applications may be used singly or in combination. Moreover, embodiments may be used in any number of environments and applications beyond those described herein without departing from the broader spirit and scope of the specification. The specification and drawings are, accordingly, to be regarded in an illustrative rather than a restrictive sense. For purposes of illustration, the methods are described in a particular order. It should be understood that in alternative embodiments, the methods may be performed in an order different than that described.

It will be understood by those of ordinary skill that less ("<") and greater (">) symbols or terms used herein may be replaced with less than or equal to (" +") and greater than or equal to (" +") symbols, respectively, without departing from the scope of the present description.

Where a component is described as "configured to" perform certain operations, such configuration may be achieved, for example, by: the electronic circuitry or other hardware is designed to perform the operations, the programmable electronic circuitry (e.g., a microprocessor or other suitable electronic circuitry) is programmed to perform the operations, or any combination thereof.

The phrase "coupled to" refers to any component being physically connected to another component, either directly or indirectly, and/or any component being in communication with another component, either directly or indirectly (e.g., being connected to another component through a wired or wireless connection and/or other suitable communication interface).

Claim language referring to "at least one" of a collection and/or "one or more" of a collection indicates that one member of the collection or members of the collection (any combination) meets the claims. For example, claim language referring to "at least one of A and B" or "at least one of A or B" means A, B, or A and B. In another example, claim language referring to at least one of "a, B, and C" or "at least one of a, B, or C" means a, B, C, or a and B, or a and C, or B and C, or a and B and C. The language "at least one" of a collection and/or "one or more" of a collection does not limit the collection to items listed in the collection. For example, claim language referring to "at least one of a and B" or "at least one of a or B" may mean a, B, or a and B, and may additionally include items not listed in the set of a and B.

The various illustrative logical blocks, modules, circuits, and algorithm steps described in connection with the examples disclosed herein may be implemented as electronic hardware, computer software, firmware, or combinations thereof. To clearly illustrate this interchangeability of hardware and software, various illustrative components, blocks, modules, circuits, and steps have been described above generally in terms of their functionality. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the overall system. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present application.

The techniques described herein may also be implemented in electronic hardware, computer software, firmware, or any combination thereof. The techniques may be implemented in any of various devices (e.g., a general purpose computer, a wireless communication device handset, or an integrated circuit device having multiple uses including applications in wireless communication device handsets and other devices). Any features described as modules or components may be implemented together in an integrated logic device or separately as discrete but interoperable logic devices. If implemented in software, the techniques may be realized at least in part by a computer readable data storage medium comprising program code that includes instructions that, when executed, perform one or more of the methods, algorithms, and/or operations described above. The computer readable data storage medium may form part of a computer program product, which may include packaging material. The computer-readable medium may include memory or data storage medium such as Random Access Memory (RAM) (e.g., synchronous Dynamic Random Access Memory (SDRAM)), read-only memory (ROM), non-volatile random access memory (NVRAM), electrically erasable programmable read-only memory (EEPROM), flash memory, magnetic or optical data storage medium, and the like. Additionally or alternatively, the techniques may be realized at least in part by a computer-readable communication medium that carries or communicates program code in the form of instructions or data structures and that can be accessed, read, and/or executed by a computer, such as a propagated signal or wave.

The program code may be executed by a processor, which may include one or more processors, such as one or more Digital Signal Processors (DSPs), general purpose microprocessors, application Specific Integrated Circuits (ASICs), field programmable logic arrays (FPGAs), or other equivalent integrated or discrete logic circuitry. Such a processor may be configured to perform any of the techniques described in this disclosure. A general purpose processor may be a microprocessor; but, in the alternative, the processor may be any conventional processor, controller, microcontroller, or state machine. A processor may also be implemented as a combination of computing devices, e.g., a combination of a DSP and a microprocessor, a plurality of microprocessors, one or more microprocessors in conjunction with a DSP core, or any other such configuration. Thus, the term "processor" as used herein may refer to any one of the foregoing structures, any combination of the foregoing structures, or any other structure or device suitable for implementation of the techniques described herein.

Illustrative aspects of the present disclosure include:

aspect 1: an apparatus for processing one or more images, comprising: a memory; and one or more processors coupled to the memory, the one or more processors configured to: obtaining a first image of an object and a second image of the object, the first image having a first field of view (FOV) and the second image having a second FOV different from the first FOV; determining a first segmentation map based on the first image, the first segmentation map comprising foreground values associated with a first estimated foreground region in the first image; determining a second segmentation map based on the second image, the second segmentation map comprising foreground values associated with a second estimated foreground region in the second image; generating a third segmentation map based on the first segmentation map and the second segmentation map; and generating a refined segmentation mask using the second segmentation map and the third segmentation map, the refined segmentation mask identifying at least a portion of the object as a foreground region of at least one of the first image and the second image.

Aspect 2: the apparatus of aspect 1, wherein generating the refined segmentation mask comprises: generating a fused segmentation map by fusing the second segmentation map and the third segmentation map; and generating the refined segmentation mask based on the fused segmentation map.

Aspect 3: the apparatus of any one of aspects 1-2, wherein the one or more processors are configured to: an edited image is generated based on the refined segmentation mask.

Aspect 4: the apparatus of aspect 3, wherein the edited image is based on one of the first image or the second image.

Aspect 5: the apparatus of any of aspects 3 or 4, wherein the edited image includes at least one of: visual effects, augmented reality effects, image processing effects, blur effects, image recognition effects, segmentation effects, computer graphics effects, chromakeying effects, and image stylization effects.

Aspect 6: the apparatus of any of aspects 3 or 4, wherein the edited image includes a blur effect.

Aspect 7: the apparatus of any one of aspects 1 to 6, wherein generating the third segmentation map comprises: clipping the first segmentation map to the size of the second image; and upsampling the first segmentation map according to the resolution of the second image after cropping the first segmentation map.

Aspect 8: the apparatus of any of aspects 1-7, wherein the first segmentation map and the second segmentation map are generated using a trained neural network.

Aspect 9: the apparatus of aspect 8, wherein the trained neural network comprises a deep neural network.

Aspect 10: the apparatus of any one of aspects 1-9, wherein the first image is received from a first camera and the second image is received from a second camera.

Aspect 11: the apparatus of aspect 10, further comprising the first camera and the second camera.

Aspect 12: the apparatus of any of aspects 10 or 11, wherein the first camera is responsive to visible light and the second camera is responsive to infrared light.

Aspect 13: the apparatus of any one of aspects 1-12, wherein the first image and the second image are received from a first camera.

Aspect 14: the apparatus of any one of aspects 1-13, wherein the one or more processors are configured to: a fourth segmentation map is determined based on a third image, the fourth segmentation map including foreground values associated with a third estimated foreground region in the third image.

Aspect 15: the apparatus of aspect 14, wherein the first image is received from a first camera, the second image is received from a second camera, and the third image is received from a third camera.

Aspect 16: the apparatus of any one of aspects 1-15, wherein the one or more processors are configured to: the segmentation mask is generated in response to a user selection of the imaging mode.

Aspect 17: a method, comprising: obtaining a first image of an object and a second image of the object, the first image having a first field of view (FOV) and the second image having a second FOV different from the first FOV; determining a first segmentation map based on the first image, the first segmentation map comprising foreground predictors associated with a first estimated foreground region in the first image; determining a second segmentation map based on the second image, the second segmentation map comprising foreground predictors associated with a second estimated foreground region in the second image; generating a third segmentation map based on the first segmentation map and the second segmentation map; and generating a refined segmentation mask using the second segmentation map and the third segmentation map, the refined segmentation mask identifying at least a portion of the object as a foreground region of at least one of the first image and the second image.

Aspect 18: the method of aspect 17, wherein generating the refined segmentation mask comprises: generating a fused segmentation map by fusing the second segmentation map and the third segmentation map; and generating the refined segmentation mask based on the fused segmentation map.

Aspect 19: the method of any one of aspects 17 to 18, further comprising: an edited image is generated based on the refined segmentation mask.

Aspect 20: the method of claim 19, wherein the edited image is based on one of the first image or the second image.

Aspect 21: the method of any of claims 19 or 20, wherein the edited image includes at least one of: visual effects, augmented reality effects, image processing effects, blur effects, image recognition effects, segmentation effects, computer graphics effects, chromakeying effects, and image stylization effects.

Aspect 22: the method of any of aspects 19 or 20, wherein the edited image includes a blur effect.

Aspect 23: the method of any of aspects 17-22, wherein generating the third segmentation map comprises: clipping the first segmentation map to the size of the second image; and upsampling the first segmentation map according to the resolution of the second image after cropping the first segmentation map.

Aspect 24: the method of any of claims 17 to 23, wherein the first segmentation map and the second segmentation map are generated using a trained neural network.

Aspect 25: the method of any of claims 17 to 24, wherein the trained neural network comprises a deep neural network.

Aspect 26: the method of any of claims 17-25, wherein the first image is received from a first camera and the second image is received from a second camera.

Aspect 27: the method of aspect 26, wherein the first camera is responsive to visible light and the second camera is responsive to infrared light.

Aspect 28: the method of any one of aspects 17 to 27, further comprising: a fourth segmentation map is determined based on a third image, the fourth segmentation map including foreground values associated with a third estimated foreground region in the third image.

Aspect 29: the method of any of aspects 17-28, wherein at least one of the first, second, and third segmentation maps is generated in response to a user selection of an imaging mode.

Aspect 30: a non-transitory computer-readable storage medium having stored thereon instructions that, when executed by one or more processors, cause the one or more processors to perform operations according to any of aspects 1 to 29.

Aspect 31: an apparatus comprising one or more means for performing the operations of any one of aspects 1 to 29.

Claims

1. An apparatus for processing one or more images, comprising:

a memory; and

one or more processors coupled to the memory, the one or more processors configured to:

obtaining a first image of an object and a second image of the object, the first image having a first field of view (FOV) and the second image having a second FOV different from the first FOV;

determining a first segmentation map based on the first image, the first segmentation map comprising foreground values associated with a first estimated foreground region in the first image;

determining a second segmentation map based on the second image, the second segmentation map comprising foreground values associated with a second estimated foreground region in the second image;

generating a third segmentation map based on the first segmentation map and the second segmentation map; and

A refined segmentation mask is generated using the second segmentation map and the third segmentation map, the refined segmentation mask identifying at least a portion of the object as a foreground region of at least one of the first image and the second image.

2. The apparatus of claim 1, wherein generating the refined segmentation mask comprises:

generating a fused segmentation map by fusing the second segmentation map and the third segmentation map; and

the refined segmentation mask is generated based on the fused segmentation map.

3. The apparatus of claim 1, wherein the one or more processors are configured to:

an edited image is generated based on the refined segmentation mask.

4. The apparatus of claim 3, wherein the edited image is based on one of the first image or the second image.

5. The apparatus of claim 3, wherein the edited image comprises at least one of: visual effects, augmented reality effects, image processing effects, blur effects, image recognition effects, segmentation effects, computer graphics effects, chromakeying effects, and image stylization effects.

6. The apparatus of claim 3, wherein the edited image includes a blur effect.

7. The apparatus of claim 1, wherein generating the third segmentation map comprises:

clipping the first segmentation map to the size of the second image; and

after cropping the first segmentation map, the first segmentation map is upsampled according to the resolution of the second image.

8. The apparatus of claim 1, wherein the first segmentation map and the second segmentation map are generated using a trained neural network.

9. The apparatus of claim 8, wherein the trained neural network comprises a deep neural network.

10. The apparatus of claim 1, wherein the first image is received from a first camera and the second image is received from a second camera.

11. The apparatus of claim 10, further comprising the first camera and the second camera.

12. The apparatus of claim 11, wherein the first camera is responsive to visible light and the second camera is responsive to infrared light.

13. The apparatus of claim 1, wherein the first image and the second image are received from a first camera.

14. The apparatus of claim 1, wherein the one or more processors are configured to:

a fourth segmentation map is determined based on a third image, the fourth segmentation map including foreground values associated with a third estimated foreground region in the third image.

15. The apparatus of claim 14, wherein the first image is received from a first camera, the second image is received from a second camera, and the third image is received from a third camera.

16. The apparatus of claim 1, wherein the one or more processors are configured to: the segmentation mask is generated in response to a user selection of the imaging mode.

17. A method for processing one or more images, comprising:

determining a first segmentation map based on the first image, the first segmentation map comprising foreground predictors associated with a first estimated foreground region in the first image;

determining a second segmentation map based on the second image, the second segmentation map comprising foreground predictors associated with a second estimated foreground region in the second image;

18. The method of claim 17, wherein generating the refined segmentation mask comprises:

the refined segmentation mask is generated based on the fused segmentation map.

19. The method of claim 17, further comprising:

an edited image is generated based on the refined segmentation mask.

20. The method of claim 19, wherein the edited image is based on one of the first image or the second image.

21. The method of claim 19, wherein the edited image comprises at least one of: visual effects, augmented reality effects, image processing effects, blur effects, image recognition effects, segmentation effects, computer graphics effects, chromakeying effects, and image stylization effects.

22. The method of claim 19, wherein the edited image includes a blur effect.

23. The method of claim 17, wherein generating the third segmentation map comprises:

clipping the first segmentation map to the size of the second image; and

24. The method of claim 17, wherein the first segmentation map and the second segmentation map are generated using a trained neural network.

25. The method of claim 17, wherein the trained neural network comprises a deep neural network.

26. The method of claim 17, wherein the first image is received from a first camera and the second image is received from a second camera.

27. The method of claim 26, wherein the first camera is responsive to visible light and the second camera is responsive to infrared light.

28. The method of claim 17, further comprising:

29. The method of claim 17, wherein at least one of the first, second, and third segmentation maps is generated in response to a user selection of an imaging mode.

30. A non-transitory computer-readable storage medium having instructions stored thereon that, when executed by one or more processors, cause the one or more processors to: