US20220383525A1 - Method for depth estimation for a variable focus camera - Google Patents

Method for depth estimation for a variable focus camera Download PDF

Info

Publication number
US20220383525A1
US20220383525A1 US17/663,643 US202217663643A US2022383525A1 US 20220383525 A1 US20220383525 A1 US 20220383525A1 US 202217663643 A US202217663643 A US 202217663643A US 2022383525 A1 US2022383525 A1 US 2022383525A1
Authority
US
United States
Prior art keywords
image
dimensional
image features
focus
images
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
US17/663,643
Other languages
English (en)
Inventor
Ceruso SABATO
Ricardo Oliva GARCIA
Jose Manuel Rodriguez RAMOS
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Wooptix SL
Original Assignee
Wooptix SL
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Wooptix SL filed Critical Wooptix SL
Assigned to WOOPTIX S.L. reassignment WOOPTIX S.L. ASSIGNMENT OF ASSIGNORS INTEREST (SEE DOCUMENT FOR DETAILS). Assignors: GARCIA, RICARDO OLIVIA, RAMOS, JOSE MANUEL RODRIGUEZ, SABATO, CERUSO
Publication of US20220383525A1 publication Critical patent/US20220383525A1/en
Assigned to WOOPTIX S.L. reassignment WOOPTIX S.L. CHANGE OF ADDRESS OF ASSIGNEE Assignors: WOOPTIX S.L.
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/50Depth or shape recovery
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T3/00Geometric image transformations in the plane of the image
    • G06T3/06Topological mapping of higher dimensional structures onto lower dimensional surfaces
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/74Image or video pattern matching; Proximity measures in feature spaces
    • G06V10/75Organisation of the matching processes, e.g. simultaneous or sequential comparisons of image or video features; Coarse-fine approaches, e.g. multi-scale approaches; using context analysis; Selection of dictionaries
    • G06V10/757Matching configurations of points or features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/50Depth or shape recovery
    • G06T7/55Depth or shape recovery from multiple images
    • G06T7/571Depth or shape recovery from multiple images from focus
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/25Fusion techniques
    • G06F18/253Fusion techniques of extracted features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/0464Convolutional networks [CNN, ConvNet]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T5/00Image enhancement or restoration
    • G06T5/50Image enhancement or restoration using two or more images, e.g. averaging or subtraction
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/10Segmentation; Edge detection
    • G06T7/11Region-based segmentation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/30Determination of transform parameters for the alignment of images, i.e. image registration
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/70Determining position or orientation of objects or cameras
    • G06T7/73Determining position or orientation of objects or cameras using feature-based methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/40Extraction of image or video features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/74Image or video pattern matching; Proximity measures in feature spaces
    • G06V10/75Organisation of the matching processes, e.g. simultaneous or sequential comparisons of image or video features; Coarse-fine approaches, e.g. multi-scale approaches; using context analysis; Selection of dictionaries
    • G06V10/751Comparing pixel values or logical combinations thereof, or feature values having positional relevance, e.g. template matching
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/77Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/82Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V30/00Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
    • G06V30/10Character recognition
    • G06V30/19Recognition using electronic means
    • G06V30/192Recognition using electronic means using simultaneous comparisons or correlations of the image signals with a plurality of references
    • G06V30/194References adjustable by an adaptive method, e.g. learning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/10Image acquisition modality
    • G06T2207/10016Video; Image sequence
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/10Image acquisition modality
    • G06T2207/10028Range image; Depth image; 3D point clouds
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/20Special algorithmic details
    • G06T2207/20081Training; Learning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/20Special algorithmic details
    • G06T2207/20084Artificial neural networks [ANN]

Definitions

  • the disclosure relates to a method, a computer system and a storage media.
  • Existing techniques either require the use of dedicated intricate specific hardware, e.g. a stereo camera and/or intricate optical microlens arrays, and/or require intricate and computer resource intensive processing of captured images, e.g. structure from motion techniques and/or depth from focus techniques.
  • said objectives are achieved by a computer-implemented method, a computer system and a computer-storage media.
  • a computer-implementable method for extracting depth information from a plurality of images taken by a camera at different focus positions may comprise one, some or all of the following steps.
  • processing further comprising aligning the image features stored from the previously captured images with the image features of the currently processed image
  • the expression of at least some of the processed images may refer to a subset of the predetermined number of captured images that have been processed, i.e. whose image features have been extracted and stored, or it may refer to a/the set comprising all of the processed predetermined number of captured images.
  • the expression of all processed images may refer to a/the set comprising all of the processed predetermined number of captured images or may refer to a/the set comprising all of the processed predetermined number of captured images and the currently processed image.
  • the expression of at least some of the processed images may refer to a natural number s, wherein s is less than or equal to k and wherein s is greater than or equal to 2. In some embodiments, s can be equal to k.
  • a scene can be understood as a scene in real physical three-dimensional space.
  • an image or image frame can be understood as a two-dimensional pixilated digital image or image frame having a width dimension and a height dimension.
  • a/the plurality of images or image frames may be images/image frames of a video-stream.
  • image features can inter alia be understood as characteristics or properties of objects or subjects in a/the captured image, such as, for example, shapes, contours, colors of objects or subjects in a/the captured image.
  • Image features may also refer to information describing whether an object or subject is in focus or is defocused, i.e. image features may comprise focus/defocus information.
  • the above-mentioned predetermined focus schedule may specify an order in which a/the camera captures images at specific focus positions at specific times.
  • the focus schedule can specify a predefined list of focus positions to be used by the camera and also can specify how this list is to be traversed when capturing a sequence of images of a scene with the camera.
  • Said predetermined focus schedule may comprise a plurality of focus positions that can comprise different focus positions and/or identical focus positions. Stated differently, a/the focus schedule may specify that the same focus position of a/the camera may be used at different points in time for capturing an image.
  • A/the camera may then traverse said exemplary focus schedule chronological to capture images at defined focus positions.
  • a/the camera can be understood as an image capturing system that can capture pixelated two-dimensional digital images.
  • said exemplary camera can capture a stream of images, e.g. a video-stream of images.
  • a camera herein may be understood as being an optical camera.
  • An exemplary camera may be a common digital camera, e.g. a camera of a smartphone. Furthermore, a/the camera can be configured for capturing images at different focus positions, i.e. for traversing a/the predetermined focus schedule.
  • a/the camera may comprise an optical element, e.g. one or more lenses, for controlling where on an image detection plane, e.g. image sensor plane, captured light converges, i.e. the camera can be a variable focus camera.
  • an optical element e.g. one or more lenses
  • the camera can be a variable focus camera.
  • the above identified exemplary method steps may be carried out iteratively or in a loop until all captured images or until a desired number captured images have been processed.
  • the speed-up of the depth information extraction from images focused at different positions according to the herein described method steps allows the extraction of depth information from a stream of images in real time, i.e. without a delay noticeable for a user of the camera.
  • depth information can be extracted at the same time from images of the captured video-stream.
  • depth information can be extracted at speeds of less than 18 ms per image, thereby allowing for example application of the herein described method on video-streams with an image frame rate of at least 30 frames per second or higher.
  • the camera may automatically capture a plurality of images of said scene with varying focus according to a/the predetermined focus schedule to extract depth information for the scene in order to optimize the focus settings for the image the user wants to capture and/or to generate an all-in-focus image of the scene the user wants to capture.
  • the improved performance of the herein described method for extracting depth information from a plurality of images is inter alia due to the fact that while processing a current image or current input image, the information of at least some or all of the previously captured images/past images is saved, re-used and taken into account, thereby avoiding the need for extra computations.
  • the herein described method for extracting depth information from a plurality of images does not require any intricate mathematical operations, but, for example, rather relies on basic mathematical operations or computations like addition and multiplication that can be carried out in parallel, it is ideally suited to be carried out by a graphical processor unit (GPU) that can carry out such parallel basic computations faster than a generic central processor unit (CPU).
  • GPU graphical processor unit
  • CPU central processor unit
  • the herein exemplary described method for extracting depth information from a plurality of images provides a more robust and more accurate absolute depth information extraction from images of a scene, in particular in the case of a dynamic scene, i.e. a scene wherein movements of objects and/or subjects occurs during the capturing of a plurality of images.
  • the improved robustness and accuracy of the herein exemplary described method is inter alia due to the fact that the processing of captured images is performed only on the feature level not on the level of images or image frames as a whole.
  • the herein described method dispenses with the need of directly registering the images/image frames with respect to each other, as is required by common depth-of-focus techniques and which in particular causes problems when objects or subjects in the scene move between captured images and wherein said movement(s) can cause current techniques for image alignment to fail.
  • the herein described method allows carrying out alignments on the feature level, i.e. aligning past/previously captured features from past/previously captured images with image features from a currently processed image/currently processed input image, thereby providing an indirect implicit registration/alignment between captured images.
  • the present method is therefore able to better cope with movements in the scene and/or due to camera movements, e.g. due to shaking support of the camera, that may occur between captured images.
  • the herein described method steps and means may also be applied in the field of computer and robotic vision.
  • the herein described method can be used to improve visual object recognition of robots or cars, e.g. autonomous robots or cars, in particular for improving autonomous navigation capabilities.
  • the herein described method can be used to improve classification of images, e.g. improve the accuracy and performance of image or object detection algorithms, and/or to improve image/video analytic systems using the extracted depth information.
  • the extracted depth information i.e. the generated two-dimensional depth map
  • the extracted depth information can be used as input for displays, in particular, for example, as input for three-dimensional displays to generate three-dimensional images of a/the scene captured in a/the plurality of two-dimensional images.
  • the herein described method and the herein described machine learning algorithm comprising a convolutional neural network is independent from any type of camera used, i.e. it works with images and cameras of any type. No specific camera hardware or lenses are required.
  • the convolutional neural network can be trained with different cameras and focus configurations, thereby allowing a better generalization of the method to unknown scenes/unknown samples.
  • the image features can be extracted by layers of the convolutional neural network as three-dimensional feature tensors comprising a width dimension, W, a height dimension, H, and a channel dimension, C, wherein said channel dimension can describe the number of feature maps extracted from an image by one or more layers of the convolutional neural network and wherein the storing of extracted image features can comprise storing the extracted image features as a list of three-dimensional feature tensors.
  • An exemplary feature tensor e.g. a feature tensor F of a currently processed image, may therefore be an element of C,H,W , i.e. ⁇ C,H,W , with C, W and H referring to the above-mentioned dimensions.
  • a feature map extracted from an image can be understood as a two-dimensional representation with a width dimension, W and a height dimension, H, of a feature or aspect or property or characteristic of an image detected/extracted by one or more layers of the convolutional neural network.
  • a feature map can be understood as a two-dimensional representation of the locations where a specific feature or aspect or property or characteristic of an image is detected or extracted by one or more layers of the convolutional neural network.
  • the width and height dimensions of said feature map may be equal to or different from, e.g. smaller than, the width and height dimensions of the image from which the feature map was extracted.
  • Said exemplary feature maps can be understood as being/representing slices of said exemplary three-dimensional feature tensor(s).
  • feature of an image or image feature may therefore be understood as representation of an image or as representation of an aspect of an image in a different dimensional space, e.g. a higher-dimensional space, than the dimension space of the image from which the feature was extracted.
  • the herein described process of extracting, by a machine learning algorithm comprising a convolutional neural network, image features from an image can therefore be understood as transforming an image into image features.
  • Representing extracted image features as tensors facilitates the computational processing of the extracted image features as the feature tensors can be more easily processed by the convolutional neural network.
  • Extraction of image features by the convolutional neural network can be carried out by a sequence comprising convolutional layers, batch normalization(s) (BN), rectified linear activation functions (ReLu), resampling, e.g. up-sampling (Up), reshaping or pooling, concatenation (Concat) and skip operations.
  • BN batch normalization(s)
  • ReLu rectified linear activation functions
  • resampling e.g. up-sampling (Up)
  • Up up-sampling
  • Concat concatenation
  • a/the batch normalization may refer to a normalization operation using an estimated mean, E(x), a.k.a. running mean, and variance, Var (x), and a scale parameter and a shift parameter, wherein said scale parameter and said shift parameter may have been learned by the convolutional neural network during training.
  • BN batch normalization
  • a/the rectified linear activation function(s) (ReLu) can be understood as referring to an activation function of the convolutional neural network that sets negative values to zero and leaves positive values, including zero, unchanged.
  • a rectified linear activation function (ReLu) can be expressed, for example, as:
  • an up-sampling (Up, UP, Up-sample) operation may refer to an operation that increases the width and/or height dimension(s) of an image or image feature or image feature tensor or feature map, wherein the increase operation is based on/comprises interpolating or extrapolating values of the original image or original image feature tensor or original feature map to obtain up-sampled values.
  • a/the reshaping operation may refer to an operation that modifies the shape, i.e. the dimensions, of a tensor, e.g. the shape of an image or image feature or feature map or image feature tensor, while maintaining the total number of elements of the tensor.
  • a tensor of shape [10, 3, 9, 2] could be reshaped into [10, 3, 18], [30, 18], [30, 1, 18] or [540] as all these shapes contain the same total number of elements ( 540 ).
  • a/the pooling operation may be understood as an operation similar to an up-sampling operation but to down-sample the width and/or height dimension(s) of an image or image feature or image feature tensor or feature map.
  • a/the pooling operation may apply an operation, e.g. a maximum or average function, to a subset of pixels, e.g. pixels of an image or image feature or feature map or image feature tensor, wherein the subset corresponds to the size of a predetermined kernel/filter, with a predetermined stride to generate a/one pixel.
  • an input e.g. image or image feature or image feature tensor or feature map
  • kernel and stride sizes are exemplary only. Other kernel and stride sizes, for example determined empirically, may be chosen as well.
  • a/the concatenation operation may refer to an operation that merges two or more tensors, e.g. images or image features or image feature tensors or feature maps, along a specific dimension. This operation may require that all the to be merged/to be concatenated tensors have the same shape in all dimensions, except in the concatenation dimension.
  • a tensor t 1 of shape [10, 3, 9] and a tensor t 2 of shape [4, 3, 9] concatenated along the first dimension results in a tensor of shape [14, 3, 9].
  • a/the skip operation may refer to an operation that connects non-consecutive layers or non-consecutive sequences of operations of the convolutional neural network using a specific operation, e.g. an addition operation.
  • an exemplary skip operation may be defined as addition of x 3 + ⁇ 1, i.e. skipping the layer 2 .
  • a/the convolutional layer or convolution layer may refer to convolution operations on images or image features or image feature tensors or feature maps.
  • Said possible exemplary operations or layers of the convolutional neural network may be carried out along/over/in multiple dimensions.
  • the dimensionality of the operations may depend on where the operations are taking place within the convolutional neural network.
  • the extraction of image features by the convolutional neural network may involve operations or layers that in particular operate/act on the height, H, and width, W, dimensions of image or image feature or image feature tensor or feature map and the further processing of extracted image features, e.g. the aligning of image features, may involve operations or layers of the convolutional neural network that act on further dimensions, such as a focus position dimension.
  • the above-mentioned aligning of the image features stored from the previously captured images with the image features of the currently processed image can comprise applying a four-dimensional encoding to both the image features stored from the previously captured images and to the image features from the currently processed image, wherein the image features are represented as tensors.
  • said four-dimensional encoding can comprise embedding temporal, spatial and focus position information into the image features from the previously captured images and into the image features from the currently processed image.
  • a four-dimensional encoding E may be composed according to the following two equations:
  • being a correction constant, for instance, a being greater than C
  • the number of channels or channel dimension size, x, y are spatial pixel coordinates
  • t is the time, i.e. the temporal position/the point in time/time stamp/time index of the captured image from which the image features were extracted, with t ⁇ [0, K ⁇ 1], wherein K denotes a/the number of previously captured images, e.g. a/the predetermined number of captured images, d ⁇ [0, N ⁇ 1] is the focus plane position/focus position/focus position index of a given image to be encoded and N is the total number of images, e.g.
  • derived images may be derived by interpolation or extrapolation of images captured according to the focus schedule, and i ⁇ [0, C/2] is an index used for dividing the number of channels into even and odd channels for the encoding(s).
  • Said exemplary encoding E being composed of exemplary encodings E 2i,x,y .
  • E 2i+1,x,y is/are applied via addition to the image features/feature tensors of the currently processed image F ⁇ C,H,W and to each of the image features/feature tensors from the previously captured images, i.e. to each of the image features/feature tensors from the past K images PF ⁇ K,C,H,W to obtain EF ⁇ C,H,W and EPF ⁇ K,C,H,W as follows:
  • c ⁇ [1, C] is a channel index
  • E k c,x,y denotes the encodings of the image features/feature tensors of the past K images/previously captured images, i.e. k ⁇ [1, K] denotes an index for the image features/feature tensors of the past K/previously captured/stored images.
  • the above described example describes an exemplary four-dimensional encoding that is non-linear and based on using trigonometric functions and wherein the four-dimensional encoding is applied via addition to the image features from the currently processed image and to each of the image features stored from the previously captured images.
  • other four-dimensional encodings may be used as well.
  • the following exemplary steps can be carried out.
  • a similarity operation for the encoded feature tensors can be carried out by the convolutional neural network based on the following exemplary similarity score(s):
  • EF′ is a two-dimensional matrix that has been obtained by reshaping EF with the convolutional neural network
  • i, j denote the matrix elements
  • EPF′ is a three-dimensional tensor that has been obtained by reshaping EPF with the convolutional neural network
  • k, i′, j denote the elements of the three-dimensional tensor with k being the index denoting image features tensors of the past K/previously captured/stored images.
  • index i and i′ for example, have a range of [0, (H*W) ⁇ 1] and index j may have a range of [0, C ⁇ 1], with H, W being the height and width dimension of the feature tensors and with C being the number of channels.
  • EPF also may contain the features extracted of the currently processed image, i.e. may contain the feature tensor of the currently processed image.
  • the feature tensor of the currently processed image must be completely similar to itself, the similarity score would not be affected when including the feature tensor of the currently processed image in EPF.
  • including the feature tensor of the currently processed image in EPF may be inter alia useful to check the validity and robustness of the convolutional neural network, in particular, for example, during training of the convolutional neural network.
  • Sim ⁇ K,HW,HW can be understood as similarity scores between image features of a currently processed image and the image features for each of the K past images.
  • Said similarity score can be translated to probabilities Sim′ ⁇ K,HW,HW according to
  • Said exemplary normalized similarity scores can then be multiplied with a reshaped encoded feature tensor of the past K images EPF v′ ⁇ K,HW,C to obtain AF′ ⁇ K,HW,C :
  • AF′ can then be reshaped to AF ⁇ K,C,H,W . Then, AF can be grouped along the first dimension to group the features corresponding to the same focus position, thus obtaining GAF ⁇ N,M,C,H,W , with
  • K is the total number of past K/previously captured/stored images, which may include also the currently processed image, or the number of all focus positions of the past K/previously captured/stored images, which may also include the focus position of the currently processed image, and N is the number of unique focus positions among the total number K of focus positions.
  • the information can be merged, e.g. by a reduction sum operation:
  • EPF ⁇ ⁇ N,C,H,W being an example for the at least one multi-dimensional tensor representing the image features of all processed images aligned to the image features of the currently processed image, wherein n is an index in the range [0, N ⁇ 1] and m is an index in the range [0, M ⁇ 1], with N being the number of unique focus positions and with M as defined above.
  • EPF ⁇ ⁇ N,C,H,W may, for example, represent only some of the previously processed image features, aligned to the image features of the currently processed image, i.e. the above-identified index ranges are exemplary only.
  • the step of generating a two-dimensional depth map using the focus positions specified in the predetermined focus schedule and the at least one generated multi-dimensional tensor may further comprise, generating, by the machine learning algorithm, at least one multi-dimensional focus probability map fpm ⁇ N,H,W using the obtained at least one multi-dimensional tensor EPF ⁇ and remapping said at least one multi-dimensional focus probability map to real physical distances using the focus positions specified in the predetermined focus schedule.
  • Said multi-dimensional focus probability map fpm can inter alia for example be obtained by the convolutional neural network via the following steps:
  • a softmax operation by the convolutional neural network may, for example, be defined as
  • Said obtained exemplary at least one multi-dimensional focus probability map fpm ⁇ N,H,W is a three-dimensional tensor having a width dimension, W, a height dimension, H, and a focus position dimension, N, said focus position dimension describing the number of focus positions, e.g. different focus positions in the focus schedule or different focus positions from focus schedule and from synthetic/derived focus positions for synthetic/derived images, said synthetic/derived images having been derived from captured images via interpolation or extrapolation.
  • the size of the width and height dimensions can be equal to the size of the width and height dimensions of an input image, wherein said input image is either an image of the predetermined number of captured images or the currently processed image or a synthetic image.
  • the remapping of the at least one multi-dimensional focus probability map fpm to real physical distances using the focus positions specified in the predetermined focus schedule may comprise computing the dot product between each pixel of the at least one multi-dimensional focus probability map and the known focus positions in the focus schedule, thereby obtaining a/the two-dimensional depth map with absolute depth information on the captured scene.
  • the step of extracting image features of the predetermined number of captured images and extracting image features of the currently processed image may further comprise extracting, by the machine learning algorithm, image features at different scales, wherein said scales are defined as a fraction of the height of an input image and/or as fraction of the width of an input image, wherein said input image is either an image of the predetermined number of captured images or the currently processed image.
  • the image features/feature tensors extracted from the predetermined number of captured images and the image features/feature tensors extracted from the currently processed image are stored in a computer-readable memory in a circular buffer, e.g. a circular buffer that can hold at least the image features from the predetermined number of captured images.
  • the predetermined number of captured images can be at least equal to or greater than the number of different focus positions specified by the focus schedule.
  • the above and herein exemplary described convolutional neural network can be a trained convolutional neural network that has been trained on a training sample comprising a plurality of images focused at different focus positions for a plurality of different scenes from the real physical world, wherein the scenes are static or dynamic, and wherein the convolutional neural network parameters have been optimized by comparing estimated depth maps generated by the convolutional neural network with corresponding known ground truth depth maps, i.e. depth maps whose absolute values are known, using a loss function.
  • the loss function is a measure of how different the estimated/predicted depth maps are with respect to the expected known ground truth depth maps.
  • the training of the convolutional neural network is run until the loss function has reached a desired/specified minimum and the optimal model parameters of the convolutional neural network have been determined.
  • the minimization of the loss function may be achieved by optimization techniques such as using a gradient descent algorithm. However, also other optimization techniques, e.g. simulated annealing, genetic algorithms or Markov-chain-Monte-Carlo algorithms, may be applied to minimize the loss function and to determine the best model parameters of the convolutional neural network from the training, such as for example, best weights of convolutional layers, best scale or shift parameter values.
  • a computer system comprising: a computer memory, one or more processors, e.g. a central processing unit (CPU) and/or a graphics processing unit (GPU), wherein the computer memory can store instructions that direct the one or more processors to carry out a method or method steps as described herein for extracting depth information from a plurality of images taken by a camera at different focus positions.
  • processors e.g. a central processing unit (CPU) and/or a graphics processing unit (GPU)
  • CPU central processing unit
  • GPU graphics processing unit
  • said computing system can be a portable mobile device, e.g. a smartphone, comprising a camera that is configured for capturing images of a scene with different focus positions.
  • computer-executable instructions that when executed by a computer system can perform a method for extracting depth information from a plurality of images taken by a camera at different focus positions as described herein, can be stored on a computer-readable storage medium, e.g. a non-volatile computer storage medium.
  • the above-mentioned predetermined number of captured images may for example be smaller, equal to or greater than the number of focus positions in the predetermined focus schedule and/or may be equal to or greater than the number of different, i.e. unique, focus positions in the predetermined focus schedule.
  • the predetermined number of captured images may be a natural number multiple of the number of focus positions in the predetermined focus schedule.
  • FIG. 1 Exemplary schematic overview of the method and means for extracting depth information
  • FIG. 2 a Exemplary two-dimensional encoder of convolutional neural network
  • FIG. 2 b Exemplary two-dimensional convolution block
  • FIG. 2 c Exemplary two-dimensional residual convolution block
  • FIG. 2 d Exemplary two-dimensional multiscale feature aggregation block
  • FIG. 2 e Exemplary two-dimensional spatial pyramid pooling block
  • FIG. 3 a Exemplary three-dimensional decoder of convolutional neural network
  • FIG. 3 b Exemplary three-dimensional residual convolution block
  • FIG. 3 c Exemplary three-dimensional multiscale feature aggregation block
  • FIG. 3 d Exemplary three-dimensional spatial pyramid pooling block
  • FIG. 4 a Exemplary memory block
  • FIG. 4 b Exemplary feature alignment block
  • FIG. 4 c Exemplary feature alignment head
  • FIG. 5 Exemplary flow diagram of method for extracting depth information
  • FIG. 6 Exemplary schematic overview of training of machine learning algorithm.
  • FIG. 1 exemplary shows a general overview of the method and means for extracting depth information from images.
  • a stream of images 700 of a scene, wherein said image stream has been taken by a camera with variable focus by capturing images at different focus positions according to a focus schedule 710 is inputted/fed to a machine learning model/machine learning algorithm 720 comprising a convolutional neural network.
  • the machine learning algorithm comprising a convolutional neural network outputs a focus probability map 730 of the scene can be remapped 740 to absolute distances using the known focus positions of the focus schedule 710 to obtain a two-dimensional depth map 750 of the scene.
  • FIG. 2 a shows an exemplary part of an exemplary possible convolutional neural network architecture that could be used for extracting image features from images 101 that have been captured by a camera at different focus positions and that outputs the exemplary extracted features or feature tensors 115 , 116 , 117 and 118 .
  • FIG. 2 a shown exemplary part of an exemplary possible convolutional neural network architecture can be understood as representing a two-dimensional (2D) encoder 100 that encodes features from an input image 101 into two-dimensional feature maps of width W and height H for every channel dimension C.
  • 2D two-dimensional
  • image features are extracted as three-dimensional feature tensors 115 , 116 , 117 , 118 comprising a width dimension, W, a height dimension, H, and a channel dimension, C, wherein said channel dimension describes the number of feature maps extracted from an image by the one or more layers or blocks 102 , 103 , 104 , 106 , 107 , 108 , 109 , 110 , 111 , 112 , 113 , 114 of the shown part of the convolutional neural network.
  • features from an input image 101 are extracted at four different scales, e.g. with different spatial sizes and/or different channel dimensions.
  • two-dimensional (2D) operations or layers or blocks e.g. a 2D convolution block or a 2D residual convolution block or a 2D spatial pyramid pooling block or a 2D multiscale feature aggregation block
  • 2D operations or layers or blocks can be understood as acting/operating on the height and width dimensions of a feature tensor, e.g. the height and width dimensions of a feature map.
  • Said height and width dimensions may be equal in size or different in size from the size of the height and width dimensions of the input image 101 .
  • the exemplary extraction of the features at four different scales is achieved by a sequence comprising a two-dimensional convolution block 102 and four two-dimensional residual convolution blocks 103 , 104 , 105 and 106 .
  • Said exemplary two-dimensional residual convolution blocks 103 , 104 , 105 and 106 each comprise a sequence of two-dimensional convolutional layers (Conv), batch normalization (BN), rectified linear activation functions (ReLu), summation (Sum) and skip connections between the input and output of a given residual convolution block.
  • An exemplary configuration for a two-dimensional residual convolution block is provided in FIG. 2 c.
  • Said two-dimensional convolution block 102 may, for example, comprise sequences of two-dimensional convolutional layers (Conv), batch normalization (BN), rectified linear activation functions (ReLu) and a pooling layer (pool).
  • Conv sequences of two-dimensional convolutional layers
  • BN batch normalization
  • ReLu rectified linear activation functions
  • pooling layer pooling layer
  • a two-dimensional spatial pyramid pooling block 107 is applied.
  • An exemplary configuration for such a two-dimensional spatial pyramid pooling block is provided in FIG. 2 e.
  • the output of the two-dimensional spatial pyramid pooling block 107 is then merged sequentially with the intermediate outputs from the first three two-dimensional residual convolution blocks 103 , 104 and 105 using the two-dimensional multiscale feature aggregation blocks 108 , 109 and 110 .
  • FIG. 2 d An exemplary configuration for a two-dimensional multiscale feature aggregation block is provided in FIG. 2 d.
  • a sequence 111 , 112 , 113 , 114 of two-dimensional convolutional layers (Conv) 111 a , 112 a , 113 a , 114 a , batch normalization (BN) 111 b , 112 b , 113 b , 114 b and rectified linear activation functions (ReLu) 111 c , 112 c , 113 c , 114 c can be applied to obtain the extracted features/feature tensors 115 , 116 , 117 , 118 for the exemplary four feature scales.
  • FIG. 2 b exemplary shows a possible exemplary configuration for the two-dimensional convolution block 102 of FIG. 2 a , comprising three sequences 119 , 120 , 121 , wherein each sequence comprises a two-dimensional convolutional layer (Conv), a batch normalization (BN) and a rectified linear activation function (ReLu) operation. After the last sequence 121 a pooling layer (Pool) is applied to obtain the output of the convolution block 102 .
  • Conv convolutional layer
  • BN batch normalization
  • ReLu rectified linear activation function
  • FIG. 2 c exemplary shows a possible exemplary configuration for a two-dimensional residual convolution block 103 , 104 , 105 , 106 of FIG. 2 a comprising two branches 128 , 129 .
  • Exemplary branch 128 comprises a first sequence 123 comprising a two-dimensional convolutional layer (Conv), a batch normalization (BN) and a rectified linear activation function (ReLu) operation and a second sequence 124 comprising a batch normalization (BN) and a rectified linear activation function (ReLu) operation.
  • Conv convolutional layer
  • BN batch normalization
  • ReLu rectified linear activation function
  • Exemplary branch 129 only comprises a single sequence of a two-dimensional convolutional layer (Conv) and a batch normalization (BN) operation.
  • Conv convolutional layer
  • BN batch normalization
  • the output of said exemplary two branches is merged using a summation (Sum) operation 125 and the output of the two-dimensional residual convolution block is obtained after a final rectified linear activation function (ReLu) operation 126 .
  • FIG. 2 d exemplary shows a possible exemplary configuration for a two-dimensional multiscale feature aggregation block 108 , 109 , 110 of FIG. 2 a.
  • Said exemplary two-dimensional multiscale feature aggregation block can comprise an up-sampling operation (UP) 130 followed by sequence 131 comprising a two-dimensional convolutional layer (Conv), a batch normalization (BN) and a rectified linear activation function (ReLu) operation, followed by a concatenation (Concat) operation 132 and a final sequence 133 comprising a two-dimensional convolutional layer (Conv), a batch normalization (BN) and a rectified linear activation function (ReLu) operation.
  • UP up-sampling operation
  • sequence 131 comprising a two-dimensional convolutional layer (Conv), a batch normalization (BN) and a rectified linear activation function (ReLu) operation
  • Conv convolutional layer
  • BN batch normalization
  • ReLu rectified linear activation function
  • FIG. 2 e exemplary shows a possible exemplary configuration for the two-dimensional spatial pyramid pooling block 107 of FIG. 2 a .
  • the input to the exemplary two-dimensional spatial pyramid pooling block is directed to five branches 134 , 135 , 136 , 137 and 138 , wherein the four parallel branches 134 , 135 , 136 , 137 each comprise a sequence of a pooling layer (Pool), a convolutional layer (Conv) and an up-sampling operation (Up-sample), the output of said four parallel branches 134 , 135 , 136 , 137 is then merged with the fifth branch 138 which corresponds to the input of the two-dimensional spatial pyramid pooling block via a summation operation (Sum) 139 to generate the output of the two-dimensional spatial pyramid pooling block, i.e. branch 138 skips the operations of the four parallel branches 134 , 135 , 136 , 137 .
  • FIG. 3 a shows an exemplary part of an exemplary possible convolutional neural network architecture that can follow the output(s) 115 , 116 , 117 , 118 of the exemplary encoder 100 shown in FIG. 2 a , i.e. the extracted features/feature tensors 115 , 116 , 117 , 118 become the input(s) for the exemplary three-dimensional decoder 200 shown in FIG. 3 a.
  • the exemplary decoder 200 outputs the final three-dimensional focus probability map 310 along with three other intermediate focus probability maps 280 , 290 , 300 , all of them with shape (N, H, W) with N for example being the number of different focus positions in the focus schedule and with H and W corresponding to height and width dimension sizes of the input image 101 from FIG. 2 a.
  • N also denoted additional focus positions that were not specified in the focus schedule but that have been synthesized by the convolutional neural network. Such synthesized/generated focus positions may be used to obtain further additional focus probability maps and therefore to increase the obtainable depth resolution.
  • Each of the input features/feature tensors 201 , 202 , 203 , 204 passes first through a dedicated memory block 240 , 250 , 260 , 270 where the stored features of the past images/previously captured images and previously processed images are retrieved and aligned with the features of the currently processed image, e.g. input image 101 , resulting in a multi-dimensional tensor of shape (C,N,H,W) where C is the number of channel of the feature maps, N the number of different focus distances in the focus schedule, and H and W refer to the spatial resolution of the extracted features, i.e. the height and width dimension if the feature maps.
  • Said multi-dimensional tensor represents for a given scale the image features extracted from the previously processed images aligned to the image features extracted for the currently processed image.
  • FIG. 4 a An example for a memory block is shown in FIG. 4 a.
  • three-dimensional (3D) operations or layers or blocks e.g. a 3D residual convolution block or a 3D spatial pyramid pooling block or a 3D multiscale feature aggregation block
  • 3D operations or layers or blocks can be understood as acting/operating on the height and width dimensions of a feature tensor, e.g. the height and width dimensions of a feature map, as well as acting/operating on the focus position dimension.
  • Said height and width dimensions may be equal in size or different in size from the size of the height and width dimensions of the input image 101 .
  • one or more three-dimensional (3D) residual convolutional blocks 320 , 350 , 380 , 410 can be applied.
  • 3D a only one three-dimensional (3D) residual convolutional block is shown for a given feature scale but it can be more than one, e.g. five.
  • FIG. 3 b An example for a three-dimensional (3D) residual convolutional block is shown in FIG. 3 b.
  • the residual convolutional blocks 320 , 350 , 380 , 410 are each followed by a three-dimensional (3D) spatial pyramid pooling block 330 , 360 , 390 , 420 .
  • FIG. 3 d An example for a three-dimensional (3D) spatial pyramid pooling block is shown in FIG. 3 d.
  • the outputs of the pyramid pooling blocks 330 , 360 390 exemplary follow two branches:
  • One branch 430 , 440 , 450 wherein an up-sampling (UP) occurs to the size/original spatial resolution of the input image 101 , followed by a sequence of a convolutional layer (Conv), a batch normalization (BN) and a rectified linear activation function (ReLu), a further convolutional layer (Conv) and further batch normalization (BN) operation to reduce the number of channels to one and a final softmax operation to obtain an intermediate focus probability map 280 , 290 , 300 .
  • UP up-sampling
  • the other branch 431 , 441 , 451 comprises a three-dimensional (3D) multiscale aggregation block 340 , 370 , 400 , which merges the outputs of the three-dimensional spatial pyramid pooling blocks with the outputs of memory blocks 250 , 260 , 270 .
  • the output of memory block 250 is merged with the output of three-dimensional spatial pyramid pooling block 330
  • the output of memory block 260 is merged with the output of three-dimensional spatial pyramid pooling block 360
  • the output of memory block 270 is merged with the output of three-dimensional spatial pyramid pooling block 390 .
  • FIG. 3 c An example for a three-dimensional (3D) multiscale aggregation block is shown in FIG. 3 c.
  • the final focus probability map 310 can be obtained by applying a last sequence 460 comprising a convolutional layer (Conv), a batch normalization (BN) and a rectified linear activation function (ReLu), a further convolutional layer (Conv) and further batch normalization (BN) operation and a final softmax operation.
  • a convolutional layer Conv
  • BN batch normalization
  • ReLu rectified linear activation function
  • BN further batch normalization
  • FIG. 3 b shows an exemplary configuration for a/the three-dimensional residual convolution block(s) 320 , 250 , 380 , 410 that can be used in the exemplary three-dimensional decoder 200 of FIG. 3 a of an exemplary convolutional neural network architecture.
  • the three-dimensional residual convolution block can comprise two branches 501 , 502 .
  • Exemplary branch 501 comprises a first sequence 503 comprising a three-dimensional convolutional layer (Conv), a batch normalization (BN) and a rectified linear activation function (ReLu) operation and a second sequence 504 comprising a batch normalization (BN) and a rectified linear activation function (ReLu) operation.
  • Conv convolutional layer
  • BN batch normalization
  • ReLu rectified linear activation function
  • Exemplary branch 502 only comprises a single sequence of a three-dimensional convolutional layer (Conv) and a batch normalization (BN) operation.
  • Conv convolutional layer
  • BN batch normalization
  • the output of said exemplary two branches is merged using a summation (Sum) operation 506 and the output of the three-dimensional residual convolution block is obtained after a final rectified linear activation function (ReLu) operation 507 .
  • FIG. 3 c shows a possible exemplary configuration for a/the three-dimensional multiscale feature aggregation block(s) 340 , 370 , 400 of FIG. 3 a.
  • Said exemplary three-dimensional multiscale feature aggregation block can comprise an up-sampling operation (UP) 508 followed by sequence 509 comprising a three-dimensional convolutional layer (Conv), a batch normalization (BN) and a rectified linear activation function (ReLu) operation, followed by a concatenation (Concat) operation 510 and a final sequence 511 comprising a three-dimensional convolutional layer (Conv), a batch normalization (BN) and a rectified linear activation function (ReLu) operation.
  • UP up-sampling operation
  • sequence 509 comprising a three-dimensional convolutional layer (Conv), a batch normalization (BN) and a rectified linear activation function (ReLu) operation
  • Conv three-dimensional convolutional layer
  • BN batch normalization
  • ReLu rectified linear activation function
  • synthetic focus positions can be generated inside a three-dimensional multiscale feature aggregation block.
  • synthetic focus positions may be generated using a three-dimensional up-sampling operation before the concatenation (Concat) operation 510 .
  • FIG. 3 d exemplary shows a possible exemplary configuration for a/the three-dimensional spatial pyramid pooling block(s) 330 , 360 , 390 , 420 of FIG. 3 a.
  • the input to the exemplary three-dimensional spatial pyramid pooling block is directed to five branches 512 , 513 , 514 , 515 and 516 , wherein the four parallel branches 512 , 513 , 514 , 515 each comprise a sequence of a pooling layer (Pool), a convolutional layer (Conv) and an up-sampling operation (Up-sample), the output of said four parallel branches 512 , 513 , 514 , 515 is then merged with the fifth branch 516 which corresponds to the input of the three-dimensional spatial pyramid pooling block via a summation operation (Sum) 517 to generate the output of the three-dimensional spatial pyramid pooling block, i.e. branch 516 skips the operations of the four parallel branches 512 , 513 , 514 , 515 .
  • the four parallel branches 512 , 513 , 514 , 515 each comprise a sequence of a pooling layer (Pool), a convolutional layer (Conv
  • FIG. 4 a shows a possible exemplary configuration for a/the memory block(s) 240 , 250 , 260 , 270 of the decoder of 200 FIG. 3 a.
  • It can comprise a memory denoted as storage pool 4010 , wherein image features/feature tensors that have been extracted from a predetermined number K of previously captured/previously processed images can be stored.
  • the past images features storage pool 4010 can for example store the features/feature tensors extracted from captured images by the 2D encoder shown in FIG. 2 a of each of the last K images, with K, for example, being a natural number multiple of N, the number of focus positions.
  • the image features 4000 of a/the currently processed image for a given scale which are a three-dimensional tensor of shape (C,H,W), with channel dimension C, height dimension H and width dimension W can also be stored in the storage pool 4010 .
  • the memory block can further comprise a feature alignment block 4020 that can take as input the features/feature tensors stored in the storage pool 4010 , e.g. features/feature tensors extracted from said K previously captured/previously processed images, together with the features/feature tensors extracted from the currently processed image and output a four-dimensional tensor 4020 of shape (C,N,H,W) representing the images features of each focus position/each focus plane aligned to the last, chronologically ordered, focus position, i.e. the focus position of the currently processed image.
  • a feature alignment block 4020 can take as input the features/feature tensors stored in the storage pool 4010 , e.g. features/feature tensors extracted from said K previously captured/previously processed images, together with the features/feature tensors extracted from the currently processed image and output a four-dimensional tensor 4020 of shape (C,N,H,W) representing the images features of each focus position/each focus
  • C again refers to the channel dimension, N to the focus position dimension, H to the height dimension and W to the width dimension of the currently processed image/image feature/image feature tensor/feature map.
  • FIG. 4 b shows an exemplary overview of the configuration of the aforementioned exemplary feature alignment block 4020 .
  • the exemplary feature alignment block 4020 has two inputs, the three-dimensional image features/three-dimensional feature tensors 4040 from a/the currently processed image and a four-dimensional tensor 4050 representing the image features extracted from a predetermined number K of previously captured/previously processed images and that have been stored in a past images features storage pool, e.g. in past images features storage pool 4010 .
  • the exemplary feature alignment block 4020 further comprises at least one feature alignment head 4060 and a feature combination operator 4070 , e.g. a sum operator, to generate as output the multi-dimensional tensor representing the image features of all processed images aligned to the image features of the currently processed image, i.e. the four-dimensional tensor 4030 , 4080 of shape (C,N,H,W) representing the images features of each focus position/each focus plane aligned to the last, chronologically ordered, focus position, i.e. the focus position of the currently processed image.
  • a feature combination operator 4070 e.g. a sum operator
  • the feature alignment head(s) 4060 divide(s) the above-mentioned inputs into patches of different resolutions, i.e. patches with different sizes in height h p and width w p compared to the inputted features, ranging, for example, from patches of size 1 ⁇ 1 (meaning that the inputted features remain without change) to H ⁇ W (meaning that the whole inputted feature tensor will be treated as one patch).
  • FIG. 4 c shows an exemplary configuration of an exemplary feature alignment head, such as feature alignment head 4060 from feature alignment block 4020 that can be used in the exemplary decoder 200 of the convolutional neural network architecture shown in FIG. 3 a.
  • the input of the current image features/feature tensors 4090 i.e. the input of image features extracted from the currently processed image is fed via branch 4091 to a (first) four-dimensional encoding block 4110 that embeds as previously indicated and as detailed again further below, temporal, spatial and focus position information into the image features 4090 extracted from the currently processed image.
  • the input of the past image features 4100 the image features extracted from the previously captured images, e.g. extracted from a predetermined number K of previously captured/previously processed images, is fed via branch 4101 to a separate (second) four-dimensional encoding block 4190 that embeds temporal, spatial and focus position information into the features extracted from the previously captured images.
  • a four-dimensional encoding E may be composed according to the following two equations:
  • being a correction constant, for instance a being greater than C
  • the number of channels or channel dimension size, x, y are spatial pixel coordinates
  • t is the time, i.e. the temporal position/the point in time/time stamp/time index of the captured image from which the image features were extracted, with t ⁇ [0, K ⁇ 1], wherein K denotes a/the number of previously captured images, e.g. a/the predetermined number of captured images, d ⁇ [0, N ⁇ 1] is the focus plane position/focus position/focus position index of a given image to be encoded and Nis the total number of images or focus positions, e.g.
  • derived images may be derived by interpolation or extrapolation of images captured according to the focus schedule, and i ⁇ [0, C/2] is an index used for dividing the number of channels into even and odd channels for the encoding(s).
  • Said exemplary encoding E being composed of exemplary encodings E 2i,x,y .
  • E 2i+1,x,y can also take into account a given patch width w p and patch height h p resolution, i.e.
  • Said exemplary encodings can be applied by addition to the image features/feature tensors 4090 of the currently processed image F ⁇ C,H,W and to each of the image features/feature tensors 4100 from the previously captured images, i.e. to each of the image features/feature tensors from the past K images PF ⁇ K,C,H,W to obtain EF ⁇ C,H,W and EPF ⁇ K,C,H,W as follows.
  • the four-dimensional encoding block 4110 can obtain EF ⁇ C,H,W via
  • a sequence 4121 of a two-dimensional convolutional layer (Conv) with batch normalization (BN) is applied to EF to obtain EF query along the output branch 4120 of the four-dimensional encoding block 4110 .
  • a sequence 4131 of a two-dimensional convolutional layer (Conv) with batch normalization (BN) is applied to EPF to obtain EPF key along an output branch 4130 of the four-dimensional encoding block 4190 .
  • the outputs from said output branches 4120 and 4130 are fed as inputs into a patch-wise similarity block 4150 .
  • This block 4150 first, reshapes the three-dimensional tensor EF query ⁇ C,H,W into the two-dimensional matrix
  • Sim k,i,i′ can be understood as describing how similar a/the patch i of a/the feature tensor of the currently processed image is to a/the patch j of a/the feature tensor of the K past/previously captured images.
  • EF′ and EPF′ may have a shape of [(H*W)/(w p *h p ),w p *h p *C], with w p and h p as the patch width and height respectively.
  • a patch size of [1,1] the shape would be [H*W, C]. Consequently, index i and index i′ would have a range of [0, (H*W) ⁇ 1] and index j a range of [0, C ⁇ 1].
  • Said normalized similarity scores Sim′ are/represent the output 4151 of the patch-wise similarity block 4150 after processing the inputs received from the branch 4120 following the first four-dimensional (4D) encoding block 4110 that processes the image features extracted from the currently processed image and received from the (first, upper) branch 4130 following the second four-dimensional (4D) encoding block 4190 that processes the image features extracted and stored from previously captured images, e.g. the image features extracted and stored from a/the predetermined number of captured images, e.g. from past K images.
  • similarity scores are only exemplary and that also other similarity functions could be used to derive a similarity measure of the current processed image feature with previously processed and stored image features.
  • other similarity functions for example, a cosine similarity or a similarity operation using matrix multiplication or any other function that is able to compare two samples could be applied.
  • the other (second, lower) branch 4140 of the second four-dimensional (4D) encoding block 4190 comprises a first sequence 4141 comprising a two-dimensional convolutional layer (Conv) and batch normalization (BN) operation gives as output EPF v ⁇ K,C,H,W which is then reshaped, by a reshape operation/layer (Reshape) 4142 , to
  • Said branch 4140 further comprises a matrix multiplication operation/layer 4143 (Matmul) wherein the normalized similarity scores Sim′ from the patch-wise similarity block 4150 are multiplied with EPF v′ to obtain
  • AF′ is then further reshaped to AF ⁇ K,C,H,W , with H and W corresponding to the height and width dimension size of the input image 101 , i.e. the currently processed image.
  • This reshaping may be part of the matrix multiplication operation/layer 4143 (Matmul) or may be performed in a further separate reshape operation/layer (not shown).
  • AF is grouped along the first dimension K, by block/operation/layer 4160 , to group the features corresponding to the same focus position, thus obtaining GAF ⁇ N,M,C,H,W , with
  • EPF ⁇ N,C,H,W being an example for the at least one multi-dimensional tensor representing the image features of all processed images, i.e. the image features of all processed focus positions, aligned to the image features of the currently processed image.
  • EPF ⁇ N,C,H,W being an example for the at least one multi-dimensional tensor representing the image features of all processed images, i.e. the image features of all processed focus positions, aligned to the image features of the currently processed image.
  • the herein exemplary described memory blocks and feature alignment heads can be understood as forming a data structure model of a retrieval system in which image features can be stored in a key-value pair structure that can be queried in order to align previously processed and stored image features to the image features of a currently processed image.
  • the value of said key-value pair structure can be understood as being the content of/being represented by the four-dimensional tensor EPF key ⁇ K,C,H,W of the image features of the previously processed and stored images after applying the sequence 4141 comprising a two-dimensional convolutional layer (Conv) with batch normalization (BN) along the lower branch 4140 , i.e.
  • EPF v ⁇ K,C,H,W content of/being represented by EPF v ⁇ K,C,H,W and the key can be understood as being the content of/being represented by the four-dimensional tensor EPF key ⁇ K,C,H,W of the image features of the previously processed and stored images after applying the sequence 4131 comprising a two-dimensional convolutional layer (Conv) with batch normalization (BN) along the upper branch 4140 following the 4D positional encoding block 4190 .
  • Conv convolutional layer
  • BN batch normalization
  • the query can be understood as being the key of the three-dimensional tensor EF query ⁇ C,H,W i.e. the content of/being represented by EF query along the output branch 4120 of the four-dimensional encoding block 4110 that processed the image features from the currently processed image.
  • the four-dimensional tensor EPF key ⁇ K,C,H,W represents a set of keys in a retrieval system that are mapped against a query EF query ⁇ C,H,W to obtain a specific value or content or key from the set of keys that best matches the query.
  • weights of the convolutional layers applied in branches 4130 and 4140 may differ. Said weights may inter alia, for example, have been learned/optimized during training of the convolutional network.
  • FIG. 5 shows an exemplary flow chart for a method 800 for extracting depth information from a plurality of images taken by a camera at different focus positions, which can comprise one, some or all of the following steps.
  • Extracting, 802 by a machine learning algorithm comprising a convolutional neural network, image features of a predetermined number of captured images and storing said extracted image features, said convolutional neural network, for example, comprising a configuration as exemplary described in FIGS. 2 a , 2 b , 2 c , 2 d , 2 e , 3 a , 3 b , 3 c , 3 d , 4 a , 4 b , 4 c , and said storing may comprise storing said features, for example, inside a memory block 240 , 250 , 260 , 270 , e.g. in feature storage pool 4010 .
  • Said processing comprising extracting by the machine learning algorithm image features from the currently processed image and storing the extracted image features.
  • Said processing further comprising aligning the image features stored from the previously captured images with the image features of the currently processed image, wherein, for example, said alignment is carried out by a feature alignment head of a memory block as exemplary described in FIGS. 3 a , 4 a , 4 b , 4 c.
  • Said processing further comprising generating at least one multi-dimensional tensor representing the image features of all processed images aligned to the image features of the currently processed image, as for example the tensor EPF ⁇ N,C,H,W as described above.
  • FIG. 6 shows a schematic example of a possible training protocol for machine learning algorithm 630 comprising a convolutional neural network with an architecture as exemplary described above.
  • a training sample comprising a plurality/a sequence 600 of captured images focused at different focus positions according to a focus schedule 620 for a plurality of different scenes from the real physical world can be processed according to the steps described previously to obtain a sequence 640 of focus probability maps, one for each image after a predetermined number of captured images have been processed.
  • the captured images may have been taken with same camera or with different cameras.
  • the herein described method is independent from the type of camera, i.e. is not restricted to the use of a specific type of camera.
  • the scenes captured in the sequence 600 of images of the training sample can be static or dynamic, i.e. there can be movement between images, e.g. due to movement of objects or subjects in the scene and/or due to movement of the camera, e.g. vibrations due to the camera being held in the hand of a user or due to the camera changing its position.
  • the obtained focus probability maps are remapped 670 to real distances using the focus positions from the known focus schedule 620 .
  • the result is a sequence of predicted/estimated depth maps which are then, along with the sequence of ground truth depth maps 610 , i.e. known/expected depth maps, used as inputs to the loss function 660 .
  • the loss function 660 is a measure of how different the estimated/predicted depth maps are with respect to the expected known ground truth depth maps.
  • the training of the machine learning algorithm 630 comprising a convolutional neural network is run until the loss function has reached a desired/specified minimum and the optimal model parameters of the convolutional neural network have been determined.
  • the minimization of the loss function may be achieved by optimization techniques such as using a gradient descent algorithm.
  • optimization techniques e.g. simulated annealing, genetic algorithms or Markov-chain-Monte-Carlo algorithms, may be applied to minimize the loss function and to determine the best model parameters of the machine learning algorithm/convolutional neural network from the training.
  • visual cues can be used to better derive a semantically correct depth map.
  • the convolutional neural network can be trained to recognize that when an object occults another object, the occulting object is closer to the camera than the occulted object.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Computation (AREA)
  • Artificial Intelligence (AREA)
  • Multimedia (AREA)
  • Software Systems (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Computing Systems (AREA)
  • Databases & Information Systems (AREA)
  • Medical Informatics (AREA)
  • Data Mining & Analysis (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • General Engineering & Computer Science (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • Mathematical Physics (AREA)
  • Biomedical Technology (AREA)
  • Molecular Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Evolutionary Biology (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Image Analysis (AREA)
  • Image Processing (AREA)
  • Automatic Focus Adjustment (AREA)
  • Measurement Of Optical Distance (AREA)
  • Traffic Control Systems (AREA)
  • Studio Devices (AREA)
US17/663,643 2021-05-20 2022-05-16 Method for depth estimation for a variable focus camera Pending US20220383525A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
EP21382458.4A EP4092572A1 (en) 2021-05-20 2021-05-20 Method for depth estimation for a variable focus camera
EP21382458.4 2021-05-20

Publications (1)

Publication Number Publication Date
US20220383525A1 true US20220383525A1 (en) 2022-12-01

Family

ID=76197385

Family Applications (1)

Application Number Title Priority Date Filing Date
US17/663,643 Pending US20220383525A1 (en) 2021-05-20 2022-05-16 Method for depth estimation for a variable focus camera

Country Status (9)

Country Link
US (1) US20220383525A1 (zh)
EP (1) EP4092572A1 (zh)
JP (1) JP7449977B2 (zh)
KR (1) KR20220157329A (zh)
CN (1) CN115375532A (zh)
AU (1) AU2022203080B2 (zh)
CA (1) CA3157444A1 (zh)
CL (1) CL2022001304A1 (zh)
TW (1) TWI791405B (zh)

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20220103743A1 (en) * 2019-01-18 2022-03-31 Kandao Technology Co., Ltd. Picture focusing method, apparatus, terminal, and corresponding storage medium
US20220391638A1 (en) * 2021-06-08 2022-12-08 Fanuc Corporation Network modularization to learn high dimensional robot tasks
US20220388162A1 (en) * 2021-06-08 2022-12-08 Fanuc Corporation Grasp learning using modularized neural networks
US20230196750A1 (en) * 2021-12-20 2023-06-22 International Business Machines Corporation Unified framework for multigrid neural network architecture
CN116386027A (zh) * 2023-04-03 2023-07-04 南方海洋科学与工程广东省实验室(珠海) 一种基于人工智能算法的海洋三维旋涡识别系统及方法

Family Cites Families (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9530213B2 (en) * 2013-01-02 2016-12-27 California Institute Of Technology Single-sensor system for extracting depth information from image blur
TWI554103B (zh) * 2014-11-13 2016-10-11 聚晶半導體股份有限公司 影像擷取裝置及其數位變焦方法
TWI640199B (zh) * 2016-06-24 2018-11-01 聚晶半導體股份有限公司 影像擷取裝置及其攝影構圖的方法
US10755428B2 (en) * 2017-04-17 2020-08-25 The United States Of America, As Represented By The Secretary Of The Navy Apparatuses and methods for machine vision system including creation of a point cloud model and/or three dimensional model
CN109803090B (zh) * 2019-01-25 2021-09-28 睿魔智能科技(深圳)有限公司 无人拍摄自动变焦方法及系统、无人摄像机及存储介质
CN110400341B (zh) * 2019-07-03 2021-09-21 北京华捷艾米科技有限公司 一种3d结构光深度相机和移动终端
CN112102388B (zh) * 2020-09-18 2024-03-26 中国矿业大学 基于巡检机器人单目图像获取深度图像的方法及装置

Cited By (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20220103743A1 (en) * 2019-01-18 2022-03-31 Kandao Technology Co., Ltd. Picture focusing method, apparatus, terminal, and corresponding storage medium
US11683583B2 (en) * 2019-01-18 2023-06-20 Kandao Technology Co., Ltd. Picture focusing method, apparatus, terminal, and corresponding storage medium
US20220391638A1 (en) * 2021-06-08 2022-12-08 Fanuc Corporation Network modularization to learn high dimensional robot tasks
US20220388162A1 (en) * 2021-06-08 2022-12-08 Fanuc Corporation Grasp learning using modularized neural networks
US11809521B2 (en) * 2021-06-08 2023-11-07 Fanuc Corporation Network modularization to learn high dimensional robot tasks
US12017355B2 (en) * 2021-06-08 2024-06-25 Fanuc Corporation Grasp learning using modularized neural networks
US20230196750A1 (en) * 2021-12-20 2023-06-22 International Business Machines Corporation Unified framework for multigrid neural network architecture
US11983920B2 (en) * 2021-12-20 2024-05-14 International Business Machines Corporation Unified framework for multigrid neural network architecture
CN116386027A (zh) * 2023-04-03 2023-07-04 南方海洋科学与工程广东省实验室(珠海) 一种基于人工智能算法的海洋三维旋涡识别系统及方法

Also Published As

Publication number Publication date
AU2022203080A1 (en) 2022-12-08
AU2022203080B2 (en) 2024-02-22
CL2022001304A1 (es) 2023-01-13
KR20220157329A (ko) 2022-11-29
JP2022179397A (ja) 2022-12-02
EP4092572A1 (en) 2022-11-23
TW202247100A (zh) 2022-12-01
JP7449977B2 (ja) 2024-03-14
CA3157444A1 (en) 2022-11-20
CN115375532A (zh) 2022-11-22
TWI791405B (zh) 2023-02-01

Similar Documents

Publication Publication Date Title
US20220383525A1 (en) Method for depth estimation for a variable focus camera
US10353271B2 (en) Depth estimation method for monocular image based on multi-scale CNN and continuous CRF
CN110738207B (zh) 一种融合文字图像中文字区域边缘信息的文字检测方法
CN107895150B (zh) 基于嵌入式系统小规模卷积神经网络模块的人脸检测和头部姿态角评估
CN109583483B (zh) 一种基于卷积神经网络的目标检测方法和系统
AU2019268184B2 (en) Precise and robust camera calibration
CN109960742B (zh) 局部信息的搜索方法及装置
CN110473137A (zh) 图像处理方法和装置
CN107909026B (zh) 基于小规模卷积神经网络年龄和/或性别评估方法及系统
CN110909651A (zh) 视频主体人物的识别方法、装置、设备及可读存储介质
WO2022083335A1 (zh) 一种基于自我注意力机制的行为识别方法
CN112381061B (zh) 一种面部表情识别方法及系统
CN109376641B (zh) 一种基于无人机航拍视频的运动车辆检测方法
CN112200056B (zh) 人脸活体检测方法、装置、电子设备及存储介质
CN116246119A (zh) 3d目标检测方法、电子设备及存储介质
CN112102379B (zh) 一种无人机多光谱影像配准方法
CN111709269B (zh) 一种深度图像中基于二维关节信息的人手分割方法和装置
CN116703996A (zh) 基于实例级自适应深度估计的单目三维目标检测算法
CN115471901B (zh) 基于生成对抗网络的多姿态人脸正面化方法及系统
Hüsem et al. A survey on image super-resolution with generative adversarial networks
CN110738225B (zh) 图像识别方法及装置
RU2817534C1 (ru) Способ автоматического обнаружения объектов с использованием системы технического зрения, установленной на бвс
CN116957999A (zh) 深度图优化方法、装置、设备及存储介质
CN112488058A (zh) 面部跟踪方法、装置、设备和存储介质
CN116229297A (zh) 一种测绘数据处理方法、系统、介质及计算机

Legal Events

Date Code Title Description
STPP Information on status: patent application and granting procedure in general

Free format text: DOCKETED NEW CASE - READY FOR EXAMINATION

AS Assignment

Owner name: WOOPTIX S.L., SPAIN

Free format text: ASSIGNMENT OF ASSIGNORS INTEREST;ASSIGNORS:SABATO, CERUSO;GARCIA, RICARDO OLIVIA;RAMOS, JOSE MANUEL RODRIGUEZ;SIGNING DATES FROM 20220607 TO 20220712;REEL/FRAME:060709/0892

AS Assignment

Owner name: WOOPTIX S.L., SPAIN

Free format text: CHANGE OF ADDRESS OF ASSIGNEE;ASSIGNOR:WOOPTIX S.L.;REEL/FRAME:065834/0710

Effective date: 20231207