WO2020118093A1 - Neural network focusing for imaging systems - Google Patents

Neural network focusing for imaging systems Download PDF

Info

Publication number
WO2020118093A1
WO2020118093A1 PCT/US2019/064755 US2019064755W WO2020118093A1 WO 2020118093 A1 WO2020118093 A1 WO 2020118093A1 US 2019064755 W US2019064755 W US 2019064755W WO 2020118093 A1 WO2020118093 A1 WO 2020118093A1
Authority
WO
WIPO (PCT)
Prior art keywords
image
network
neural
focusing method
based focusing
Prior art date
Application number
PCT/US2019/064755
Other languages
French (fr)
Inventor
David Jones Brady
Chengyu Wang
Original Assignee
Duke University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Duke University filed Critical Duke University
Publication of WO2020118093A1 publication Critical patent/WO2020118093A1/en

Links

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N17/00Diagnosis, testing or measuring for television systems or their details
    • H04N17/002Diagnosis, testing or measuring for television systems or their details for television cameras
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T5/00Image enhancement or restoration
    • G06T5/50Image enhancement or restoration using two or more images, e.g. averaging or subtraction
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T5/00Image enhancement or restoration
    • G06T5/60Image enhancement or restoration using machine learning, e.g. neural networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T5/00Image enhancement or restoration
    • G06T5/73Deblurring; Sharpening
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N23/00Cameras or camera modules comprising electronic image sensors; Control thereof
    • H04N23/60Control of cameras or camera modules
    • H04N23/617Upgrading or updating of programs or applications for camera control
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N23/00Cameras or camera modules comprising electronic image sensors; Control thereof
    • H04N23/60Control of cameras or camera modules
    • H04N23/67Focus control based on electronic image sensor signals
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/10Image acquisition modality
    • G06T2207/10141Special mode during image acquisition
    • G06T2207/10148Varying focus
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/20Special algorithmic details
    • G06T2207/20081Training; Learning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/20Special algorithmic details
    • G06T2207/20084Artificial neural networks [ANN]

Definitions

  • This disclosure relates generally to systems, methods, structures, and techniques pertaining to neural network focusing for imaging systems resulting in enhanced image quality.
  • focus - and the ability of an imaging system to quickly and accurately focus - is a primary determinant of image quality.
  • focus is determined and subsequently controlled by optimizing image quality metrics including contrast and/or weighted Laplacian measures.
  • image quality metrics including contrast and/or weighted Laplacian measures.
  • Such metrics are described, for example by Yao, Y., B. Abidi, N. Doggaz and M. Abidi in an article entitled “Evaluation of Sharpness Measures and Search Algorithms for the Auto-Focusing of High- Magnification Images,” that appeared in Defense and Security Symposium, SPIE, (1960).
  • Strategies for optimizing such metrics are described by Jie, H., Z. Rongzhen and H.
  • An advance in the art is made according to aspects of the present disclosure directed to neural network focusing for imaging systems that advantageously overcomes these - and other - focusing difficulties associated with contemporary imaging systems.
  • a neural network according to aspects of the present disclosure is trained based on know best focus scenes thereby optimizing focus/imaging performance and resulting quality.
  • the neural network employed in imaging systems may identify regions or targets of interest within a scene and selectively focus on such target(s). Additionally, such neural network may process subsampled or compressively sampled data directly thereby substantially reducing computational requirements as compared with prior-art contemporary systems, methods, and structures. Finally, such neural network may directly evaluate image quality on image data - thereby eliminating need for focus specific hardware.
  • FIG. 1 shows a schematic ray diagram of an illustrative optical arrangement according to aspects of the present disclosure
  • FIG. 2 shows a schematic diagram of an illustrative six layer neural network according to aspects of the present disclosure
  • FIG. 3 shows a pair of images captured at an optimal focus (Sharp Image - top) position and an off position (Blurred Image - bottom) according to aspects of the present disclosure
  • FIG. 4(A), FIG. 4(B), FIG. 4(C), FIG. 4(D), FIG. 4(E), and FIG. 4(F) are a series of images illustrating model calibration using images captured with different focal position and displacement from focal position according to aspects of the present disclosure
  • FIG. 5 shows a schematic diagram of an illustrative seven layer neural network for predicting distance according to aspects of the present disclosure
  • FIG. 6 shows a schematic diagram of an illustrative discriminant six layer neural network according to aspects of the present disclosure
  • FIG. 7(A), FIG. 7(B), and FIG. 7(C) is a series of images illustrating aspects of the present disclosure in which FIG. 7(A) is a defocused image captured at the beginning of the process, then the network predicts movement and generates two possible positions illustrated in FIG. 7(B) and FIG. 7(C) - by comparison, the image of FIG. 7(C) is better than that of FIG. 7(B) and the discriminant predicts FIG. 7(C) is a focused image;
  • FIG. 8(A), FIG. 8(B), and FIG. 8(C) is a series of images illustrating aspects of the present disclosure in which FIG. 8(A) is a defocused image captured at the beginning of the process, then the network predicts movement and generates two possible positions illustrated in FIG. 8(B) and FIG. 8(C) - by comparison, the image of FIG. 8(C) is better than that of FIG. 8(B) and the discriminant predicts FIG. 8(C) is a focused image;
  • FIG. 9(A), FIG. 9(B), and FIG. 9(C) is a series of images illustrating aspects of the present disclosure in which FIG. 9(A) is a defocused image captured at the beginning of the process, then the network predicts movement and generates two possible positions illustrated in FIG. 9(B) and FIG. 9(C) - by comparison, the image of FIG. 9(C) is better than that of FIG. 9(B) and the discriminant predicts FIG. 9(C) is a focused image;
  • FIG. 10(A), FIG. 10(B), FIG. 10(C), FIG. 10(D), and FIG. 10(E) is a series of images illustrating aspects of the present disclosure in which FIG. 10(A) is a defocused image captured at the beginning of the process, then the network predicts movement and generates two possible positions illustrated in FIG. 10(B) and FIG. 10(C) - by comparison, the image of FIG. 10(C) is better than that of FIG. 10(B), but the discriminant predicts FIG. 10(C) is still not a focused image, so a new movement is predicted based on FIG. 10(C) and two more images are captured - by comparison, image FIG. 10(E) is better than FIG. 10(D), and the discriminant predicts image FIG. 10(E) is a focused image.
  • FIG. 11 shows a schematic diagram of an illustrative seven layer neural network for compressively sensing according to aspects of the present disclosure
  • FIG. 12(A), FIG. 12(B), FIG. 12(C) and FIG. 12(D) is a series of images illustrating aspects of the present disclosure in which FIG. 12(A) is a defocused image which is compressively sensed to a four channel tensor by a random matrix wherein each channel is shown in FIG. 12(B), the network leans from these images, any movement required and image FIG. 12(C) and FIG 12(D) is classified focused by the discriminator;
  • FIG. 13 shows a schematic diagram of an illustrative neural network for salience detection according to aspects of the present disclosure
  • FIG. 14(A), FIG. 14(B), and FIG. 14(C) is a series of images illustrating aspects of the present disclosure in which FIG. 14(A) is an original image, FIG. 14(B) shows a generated defocused image and FIG. 14(C) shows the label;
  • FIG. 15(A), FIG. 15(B), and FIG. 15(C) is a series of images illustrating aspects of the present disclosure in which FIG. 15(A) shows the first defocused image captured by a camera, FIG. 15(B) shows the predicted saliency map, and FIG. 15(C) shows the final output image with bounding box showing the block that is selected as the target;
  • FIG. 16(A), FIG. 16(B), and FIG. 16(C) is a series of images illustrating aspects of the present disclosure in which FIG. 16(A) shows the first defocused image captured by a camera, FIG. 16(B) shows the predicted saliency map, and FIG. 16(C) shows the final output image with bounding box showing the block that is selected as the target; and
  • FIG. 17(A), FIG. 17(B), and FIG. 17(C) is a series of images illustrating aspects of the present disclosure in which FIG. 17(A) shows the first defocused image captured by a camera, FIG. 17(B) shows the predicted saliency map, and FIG. 17(C) shows the final output image with bounding box showing the block that is selected as the target.
  • FIGs comprising the drawing are not drawn to scale.
  • such phrasing is intended to encompass the selection of the first listed option (A) only, or the selection of the second listed option (B) only, or the selection of the third listed option (C) only, or the selection of the first and the second listed options (A and B) only, or the selection of the first and third listed options (A and C) only, or the selection of the second and third listed options (B and C) only, or the selection of all three options (A and B and C).
  • This may be extended, as readily apparent by one of ordinary skill in this and related arts, for as many items listed.
  • FIG. 1 shows a schematic ray diagram of an illustrative optical arrangement according to aspects of the present disclosure.
  • the light from the object converges at Z 0 on optical axis, while the sensor plane can move between Z min and Z max .
  • focus control requires finding Z 0 , and moving the sensor plane accordingly.
  • existing auto-focus algorithms set the sensor plane at different positions between Z min and Z max and determine its optimal position by comparing the captured image quality with respect to one or more of the metrics previously mentioned.
  • Z x and Z 2 can be any number between Z min and Z max , the displacement between Z x and Z 2 is fixed and defined by the following relationship:
  • the neural-network-based focus position prediction algorithm learns from the two normalized blocks the displacement between Z x and Z 0 , according to: fzaiPz,’ Pz - Z 0 Z x
  • FIG.2 shows a schematic diagram of an illustrative six layer neural network according to aspects of the present disclosure.
  • such neural network receives as input two normalized image blocks, P Zl and P ⁇ 2 , and predicts the displacement between Z and Z 0.
  • the first two arrows represent convolutional layers and subsequent four arrows represent fully-connected layers.
  • Two image blocks of size 256 x 256 are first stacked, forming a 256 x 256 x 2 input tensor.
  • the illustrative neural network includes six consecutive layers.
  • the input tensor is first applied (fed) into a convolutional layer.
  • Illustrative filter size, number of filters and stride for this layer are 8, 64 and 8 respectively, and the activation function for this layer is rectified linear unit (ReLu).
  • the dimension of the output of this layer is 32 X 32 X 64.
  • the second illustrative layer is another convolutional layer having illustrative filter size, number of filters and stride being 4, 16 and 4 respectively.
  • the activation function for this layer is ReLu, and the dimension of the output is 8 X 8 X 16.
  • the four subsequent illustrative layers are fully-connected layers, and the dimensions of the four fully-connected layers are 1024, 512, 10 and 1 respectively. ReLu activation is applied to the first three fully-connected layers.
  • systems, methods, and structures according to aspects of the present disclosure may include one or more mechanism(s) to construct a defocus model for the specific imaging system, and the model can be used to generate training data from existing image dataset.
  • FIG. 3 shows a pair of images captured at an optimal focus (Sharp Image - top) position and an off position (Blurred Image - bottom) according to aspects of the present disclosure.
  • the defocus process can be modeled as image blur followed by image scaling defined by:
  • h is the image blur filter
  • a is the image scaling factor
  • Gamma -1 (x) is the inverse process of gamma correction defined by:
  • This inverse process compensates for non-linear operation, i.e. gamma correction, in the image signal processor (ISP), where g is predefined for a specific camera.
  • ISP image signal processor
  • g is predefined for a specific camera.
  • an image block from Gamma -1 (7 z0 ) is cropped (see the bounding box in FIG. 4 (A)), denoted as ] ZQ (see FIG. 4 (B)).
  • ] ZQ should be included in an object that all the pixels have the same distance d 0 .
  • a blurred image is generated by convolving J zQ and h(r ). (see
  • a block Kz 0,ft(r) is cropped from the blurred image (see the bounding box in
  • This step is to remove pixels that should have been blurred by the pixels outside the J Zq .
  • a score for each pair of r and a is calculated by computing the maximum
  • training data can be generated from existing image dataset(s). First, the number of training data, N, and Z d are chosen. For each grayscale image d
  • An optimal focus position, Z i 0 , and a sensor plane position, Z t l , are randomly generated:
  • Two defocused images are generated: 3. Two image blocks, B i Zi and B i Z containing the object of interest are cropped from the two images, d i ⁇ i and d i ⁇ 2 , at the same position. The image blocks are then normalized to be between 0 and 1 according to the following:
  • Each data comprises two input images, d i Zi and d i ⁇ 2 , and the corresponding label Z i 0 — Z t l .
  • the neural network is trained on the simulated dataset by minimizing the mean square error, i.e.
  • each raw data is first interpolated to generate an RGB image. Then the same calibration process is applied. Note that we assume that the defocus model is the same for three channels.
  • each raw data is first interpolated.
  • the defocus model is applied to each channel, and the defocused raw data is generated by mosaicking the defocused RGB image.
  • the same network structure and training strategy can be applied to train the model.
  • a multi-step focus mechanism may be employed.
  • a neural network denoted as f distance
  • Z d abs(Z— Z 0 )
  • two images are captured by moving the sensor plane in different directions, i.e. 7 Z-Z by moving the sensor plane to Z— Z d and l ⁇ + z d by moving the sensor plane to Z + Z d , and the two images are compared to find the right moving direction.
  • f discriminant a discriminant to judge if the image is focused
  • FIG. 5 shows a schematic diagram of an illustrative seven layer neural network for predicting distance Z d according to aspects of the present disclosure.
  • the input to the network is a 512 x 512 block from the defocused image.
  • the network comprises seven consecutive layers.
  • the input image block is first fed into a convolutional layer.
  • the illustrative filter size, number of filters and stride for this layer are 8, 64 and 8 respectively, and the activation function for this layer is rectified linear unit (ReLu).
  • the illustrative dimension of the output of this layer is 64 X 64 X 4.
  • the second layer and the third layer are illustratively convolutional layers with filter size/number of filters/stride being 4/8/4 and 4/8/4 respectively.
  • the activation function is ReLu, and the dimensions of the output are 16 X 16 X 8 and 4 x 4 x 8.
  • the four subsequent layers are illustratively fully-connected layers, and the dimensions of the four fully-connected layers are 1024, 512, 10 and 1 respectively.
  • Leaky ReLu activation is applied to the first three fully-connected layers.
  • FIG. 6 shows a schematic diagram of an illustrative discriminant six layer neural network according to aspects of the present disclosure.
  • the input to the network is a 512 x 512 block from image.
  • the first two layers are illustratively convolutional layers, with filter size/number of filters/stride being 8/1/8 and 8/1/8 respectively.
  • the activation function for these two layers is ReLu, and the dimensions of the illustrative output of these two layers are 64 X 64 X 1 and 8 X 8 X 1.
  • the third illustrative layer is a fully-connected layer, followed by the ReLU activation, and the dimension of this layer is 10.
  • the fourth layer is a dropout layer with rate 0.5.
  • Last layer is illustratively a fully-connected layer followed by softmax activation which indicates the category of the input block.
  • FIG. 7(A), FIG. 7(B), and FIG. 7(C) is a series of images illustrating aspects of the present disclosure in which FIG. 7(A) is a defocused image captured at the beginning of the process, then the network predicts movement and generates two possible positions illustrated in FIG. 7(B) and FIG. 7(C) - by comparison, the image of FIG. 7(C) is better than that of FIG. 7(B) and the discriminant predicts FIG. 7(C) is a focused image.
  • FIG. 8(A), FIG. 8(B), and FIG. 8(C) is a series of images illustrating aspects of the present disclosure in which FIG.
  • FIG. 8(A) is a defocused image captured at the beginning of the process, then the network predicts movement and generates two possible positions illustrated in FIG. 8(B) and FIG. 8(C) - by comparison, the image of FIG. 8(C) is better than that of FIG. 8(B) and the discriminant predicts FIG. 8(C) is a focused image.
  • FIG. 9(A), FIG. 9(B), and FIG. 9(C) is a series of images illustrating aspects of the present disclosure in which FIG. 9(A) is a defocused image captured at the beginning of the process, then the network predicts movement and generates two possible positions illustrated in FIG. 9(B) and FIG. 9(C) - by comparison, the image of FIG. 9(C) is better than that of FIG. 9(B) and the discriminant predicts FIG. 9(C) is a focused image.
  • FIG. 10(A), FIG. 10(B), FIG. 10(C), FIG. 10(D), and FIG. 10(E) is a series of images illustrating aspects of the present disclosure in which FIG. 10(A) is a defocused image captured at the beginning of the process, then the network predicts movement and generates two possible positions illustrated in FIG. 10(B) and FIG. 10(C) - by comparison, the image of FIG. 10(C) is better than that of FIG. 10(B), but the discriminant predicts FIG. 10(C) is still not a focused image, so a new movement is predicted based on FIG. 10(C) and two more images are captured - by comparison, image FIG. 10(E) is better than FIG. 10(D), and the discriminant predicts image FIG. 10(E) is a focused image.
  • FIG. 11 shows a schematic diagram of an illustrative seven layer neural network for compressively sensing according to aspects of the present disclosure. As may be observed from that figure, the basic network structure is substantially the same as that shown previously, wherein the first illustrative convolutional layer is randomly initialized and is not trainable. This is equivalent to generating a random sensing matrix for compressive sensing, and here the measurement rate is 0.0625.
  • FIG. 12(A), FIG. 12(B), FIG. 12(C) and FIG. 12(D) is a series of images illustrating aspects of the present disclosure in which FIG. 12(A) is a defocused image which is compressively sensed to a four channel tensor by a random matrix wherein each channel is shown in FIG. 12(B), the network leans from these images, any movement required and image FIG. 12(C) and FIG 12(D) is classified focused by the discriminator. Note that image of FIG. 12(B) is not downsampled from the image of FIG. 12(A).
  • a focus module is controlled to“best image” (i.e., determine a best image) a target object/target object plane.
  • an indication of the target is required in focus control.
  • the location of the target is: 1) by default the central of the image, or 2) manually indicated by users.
  • Some smart systems may adjust the focus module to best capture faces by first detecting the faces in the image, but the detection requires a clear image, which is based on a general focus method per se.
  • Visual saliency detection has been explored for decades, and advantageously according to aspects of the present disclosure - it may be combined with camera focus control algorithm, providing a saliency-based focus control mechanism.
  • FIG. 13 shows a schematic diagram of an illustrative neural network for salience detection according to aspects of the present disclosure. Operationally, the image is resized to 300 x 400 before being fed into the network. The input image is first fed into three separate convolutional layers, the filter sizes for the three convolutional layers are 40, 10 and 3 individually, and the number of filters/stride for each layer are 16/1.
  • CONV3 is reshaped to a 384 x 1 vector, followed by a fully- connected layer with dimension 384 X 1, which is again reshaped to a 3 X 4 X 32 tensor, denoted C0NV4.
  • CONV3 and C0NV4 are then concatenated to form a 3 X 4 X 64 tensor, followed by a convolutional layer with filter size/number of filters/stride being 3/64/1.
  • a deconvolutional layer is then applied with filter size/number of filters/stride being 20/16/10, and the dimension of the output tensor, denoted CONV5, is 30 X 40 X 16.
  • CONV2 and CONV5 are then concatenated to form a 30 X 40 X 32 tensor, followed by a convolutional layer with filter size/number of filters/stride being 5/64/1.
  • CONV1, CONV6 and the input image are concatenated to form a
  • PASCALS*** for saliency detection, but no publicly available dataset provides labeled defocused image for saliency detection.
  • the image defocus model calibrated on a specific camera are used, and process according to aspects of the present disclosure is described as:
  • FIG. 14(C) Image FIG. 14(A) shows the original image, image FIG. 14(B) shows the generated defocused image, and image FIG. 14(C) shows the label.
  • Each training data consists of a defocused image / and a label image L.
  • the network fsaiiency is trained on the training dataset by minimizing the mean square error, i. e.
  • the next step is to select the block to perform the focus prediction network on.
  • the saliency map is first resized to the size of the original image, i.e. image captured by the camera. Then the block is determined by finding a 512 x 512 block that achieves the highest average intensity: where B represents a 512 x 512 block from the image, / ⁇ ; ⁇ is a pixel in the image, and S t represents the saliency map. Once the block is determined, the focus is predicted from this block.
  • FIG. 15(A), FIG. 15(B), and FIG. 15(C) is a series of images illustrating aspects of the present disclosure in which FIG. 15(A) shows the first defocused image captured by a camera, FIG. 15(B) shows the predicted saliency map, and FIG. 15(C) shows the final output image with bounding box showing the block that is selected as the target.
  • FIG. 16(A), FIG. 16(B), and FIG. 16(C) is a series of images illustrating aspects of the present disclosure in which FIG. 16(A) shows the first defocused image captured by a camera, FIG. 16(B) shows the predicted saliency map, and FIG. 16(C) shows the final output image with bounding box showing the block that is selected as the target.
  • FIG. 17(A), FIG. 17(B), and FIG. 17(C) is a series of images illustrating aspects of the present disclosure in which FIG. 17(A) shows the first defocused image captured by a camera, FIG. 17(B) shows the predicted saliency map, and FIG. 17(C) shows the final output image with bounding box showing the block that is selected as the target.

Landscapes

  • Engineering & Computer Science (AREA)
  • Multimedia (AREA)
  • Signal Processing (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Software Systems (AREA)
  • Health & Medical Sciences (AREA)
  • Biomedical Technology (AREA)
  • General Health & Medical Sciences (AREA)
  • Image Analysis (AREA)
  • Studio Devices (AREA)

Abstract

Aspects of the present disclosure describe systems, methods and structures providing neural network focusing for imaging systems that advantageously may identify regions or targets of interest within a scene and selectively focus on such target(s); process subsampled or compressively sampled data directly thereby substantially reducing computational requirements as compared with prior-art contemporary systems, methods, and structures; and may directly evaluate image quality on image data - thereby eliminating need for focus specific hardware.

Description

NEURAL NETWORK FOCUSING FOR IMAGING SYSTEMS
TECHNICAL FIELD
[0001] This disclosure relates generally to systems, methods, structures, and techniques pertaining to neural network focusing for imaging systems resulting in enhanced image quality.
BACKGROUND
[0002] As is known in the imaging arts, focus - and the ability of an imaging system to quickly and accurately focus - is a primary determinant of image quality. Typically, focus is determined and subsequently controlled by optimizing image quality metrics including contrast and/or weighted Laplacian measures. Such metrics are described, for example by Yao, Y., B. Abidi, N. Doggaz and M. Abidi in an article entitled “Evaluation of Sharpness Measures and Search Algorithms for the Auto-Focusing of High- Magnification Images,” that appeared in Defense and Security Symposium, SPIE, (1960). Strategies for optimizing such metrics are described by Jie, H., Z. Rongzhen and H. Zhiliang in an article entitled "Modified Fast Climbing Search Auto-Focus Algorithm with Adaptive Step Size Searching Technique for Digital Camera" that appeared in IEEE Transactions on Consumer Electronics 49(2): 257-262, (2003). Alternatively, and/or additionally, secondary image paths or phase detection pixels may be used to determine and adjust focus quality.
[0003] Unfortunately - and as is further known in the art - these strategies and techniques suffer from the metrics being ad hoc and not provably optimal, that they are not specific to scene content and may focus on regions with little content or interest, that calculation of the metric(s) is sometimes computationally intensive/expensive, and, in the case of phase detection systems, specialized hardware is required. SUMMARY
[0004] An advance in the art is made according to aspects of the present disclosure directed to neural network focusing for imaging systems that advantageously overcomes these - and other - focusing difficulties associated with contemporary imaging systems. As will be shown and described, a neural network according to aspects of the present disclosure is trained based on know best focus scenes thereby optimizing focus/imaging performance and resulting quality.
[0005] Advantageously - and in sharp contrast to the prior art - the neural network employed in imaging systems according to aspects of the present disclosure may identify regions or targets of interest within a scene and selectively focus on such target(s). Additionally, such neural network may process subsampled or compressively sampled data directly thereby substantially reducing computational requirements as compared with prior-art contemporary systems, methods, and structures. Finally, such neural network may directly evaluate image quality on image data - thereby eliminating need for focus specific hardware.
BRIEF DESCRIPTION OF THE DRAWING
[0006] A more complete understanding of the present disclosure may be realized by reference to the accompanying drawing in which:
[0007] FIG. 1 shows a schematic ray diagram of an illustrative optical arrangement according to aspects of the present disclosure;
[0008] FIG. 2 shows a schematic diagram of an illustrative six layer neural network according to aspects of the present disclosure;
[0009] FIG. 3 shows a pair of images captured at an optimal focus (Sharp Image - top) position and an off position (Blurred Image - bottom) according to aspects of the present disclosure; [0010] FIG. 4(A), FIG. 4(B), FIG. 4(C), FIG. 4(D), FIG. 4(E), and FIG. 4(F) are a series of images illustrating model calibration using images captured with different focal position and displacement from focal position according to aspects of the present disclosure;
[0011] FIG. 5 shows a schematic diagram of an illustrative seven layer neural network for predicting distance according to aspects of the present disclosure;
[0012] FIG. 6 shows a schematic diagram of an illustrative discriminant six layer neural network according to aspects of the present disclosure;
[0013] FIG. 7(A), FIG. 7(B), and FIG. 7(C) is a series of images illustrating aspects of the present disclosure in which FIG. 7(A) is a defocused image captured at the beginning of the process, then the network predicts movement and generates two possible positions illustrated in FIG. 7(B) and FIG. 7(C) - by comparison, the image of FIG. 7(C) is better than that of FIG. 7(B) and the discriminant predicts FIG. 7(C) is a focused image;
[0014] FIG. 8(A), FIG. 8(B), and FIG. 8(C) is a series of images illustrating aspects of the present disclosure in which FIG. 8(A) is a defocused image captured at the beginning of the process, then the network predicts movement and generates two possible positions illustrated in FIG. 8(B) and FIG. 8(C) - by comparison, the image of FIG. 8(C) is better than that of FIG. 8(B) and the discriminant predicts FIG. 8(C) is a focused image;
[0015] FIG. 9(A), FIG. 9(B), and FIG. 9(C) is a series of images illustrating aspects of the present disclosure in which FIG. 9(A) is a defocused image captured at the beginning of the process, then the network predicts movement and generates two possible positions illustrated in FIG. 9(B) and FIG. 9(C) - by comparison, the image of FIG. 9(C) is better than that of FIG. 9(B) and the discriminant predicts FIG. 9(C) is a focused image;
[0016] FIG. 10(A), FIG. 10(B), FIG. 10(C), FIG. 10(D), and FIG. 10(E) is a series of images illustrating aspects of the present disclosure in which FIG. 10(A) is a defocused image captured at the beginning of the process, then the network predicts movement and generates two possible positions illustrated in FIG. 10(B) and FIG. 10(C) - by comparison, the image of FIG. 10(C) is better than that of FIG. 10(B), but the discriminant predicts FIG. 10(C) is still not a focused image, so a new movement is predicted based on FIG. 10(C) and two more images are captured - by comparison, image FIG. 10(E) is better than FIG. 10(D), and the discriminant predicts image FIG. 10(E) is a focused image.
[0017] FIG. 11 shows a schematic diagram of an illustrative seven layer neural network for compressively sensing according to aspects of the present disclosure;
[0018] FIG. 12(A), FIG. 12(B), FIG. 12(C) and FIG. 12(D) is a series of images illustrating aspects of the present disclosure in which FIG. 12(A) is a defocused image which is compressively sensed to a four channel tensor by a random matrix wherein each channel is shown in FIG. 12(B), the network leans from these images, any movement required and image FIG. 12(C) and FIG 12(D) is classified focused by the discriminator;
[0019] FIG. 13 shows a schematic diagram of an illustrative neural network for salience detection according to aspects of the present disclosure;
[0020] FIG. 14(A), FIG. 14(B), and FIG. 14(C) is a series of images illustrating aspects of the present disclosure in which FIG. 14(A) is an original image, FIG. 14(B) shows a generated defocused image and FIG. 14(C) shows the label;
[0021] FIG. 15(A), FIG. 15(B), and FIG. 15(C) is a series of images illustrating aspects of the present disclosure in which FIG. 15(A) shows the first defocused image captured by a camera, FIG. 15(B) shows the predicted saliency map, and FIG. 15(C) shows the final output image with bounding box showing the block that is selected as the target; [0022] FIG. 16(A), FIG. 16(B), and FIG. 16(C) is a series of images illustrating aspects of the present disclosure in which FIG. 16(A) shows the first defocused image captured by a camera, FIG. 16(B) shows the predicted saliency map, and FIG. 16(C) shows the final output image with bounding box showing the block that is selected as the target; and
[0023] FIG. 17(A), FIG. 17(B), and FIG. 17(C) is a series of images illustrating aspects of the present disclosure in which FIG. 17(A) shows the first defocused image captured by a camera, FIG. 17(B) shows the predicted saliency map, and FIG. 17(C) shows the final output image with bounding box showing the block that is selected as the target.
[0024] The illustrative embodiments are described more fully by the Figures and detailed description. Embodiments according to this disclosure may, however, be embodied in various forms and are not limited to specific or illustrative embodiments described in the drawing and detailed description.
DESCRIPTION
[0025] The following merely illustrates the principles of the disclosure. It will thus be appreciated that those skilled in the art will be able to devise various arrangements which, although not explicitly described or shown herein, embody the principles of the disclosure and are included within its spirit and scope.
[0026] Furthermore, all examples and conditional language recited herein are intended to be only for pedagogical purposes to aid the reader in understanding the principles of the disclosure and the concepts contributed by the inventor(s) to furthering the art and are to be construed as being without limitation to such specifically recited examples and conditions.
[0027] Moreover, all statements herein reciting principles, aspects, and embodiments of the disclosure, as well as specific examples thereof, are intended to encompass both structural and functional equivalents thereof. Additionally, it is intended that such equivalents include both currently known equivalents as well as equivalents developed in the future, i.e., any elements developed that perform the same function, regardless of structure.
[0028] Thus, for example, it will be appreciated by those skilled in the art that any block diagrams herein represent conceptual views of illustrative circuitry embodying the principles of the disclosure.
[0029] Unless otherwise explicitly specified herein, the FIGs comprising the drawing are not drawn to scale.
[0030] Finally, it is noted that the use herein of any of the following
Figure imgf000008_0001
“and/or”, and“at least one of’, for example, in the cases of“A/B”,“A and/or B” and“at least one of A and B”, is intended to encompass the selection of the first listed option (A) only, or the selection of the second listed option (B) only, or the selection of both options (A and B). As a further example, in the cases of“A, B, and/or C” and“at least one of A, B, and C”, such phrasing is intended to encompass the selection of the first listed option (A) only, or the selection of the second listed option (B) only, or the selection of the third listed option (C) only, or the selection of the first and the second listed options (A and B) only, or the selection of the first and third listed options (A and C) only, or the selection of the second and third listed options (B and C) only, or the selection of all three options (A and B and C). This may be extended, as readily apparent by one of ordinary skill in this and related arts, for as many items listed.
[0031] By way of some additional background, we note that a primary task of camera focus control is to set a sensor plane at which light from an object converges. FIG. 1 shows a schematic ray diagram of an illustrative optical arrangement according to aspects of the present disclosure.
[0032] As shown in FIG. 1, the light from the object converges at Z0 on optical axis, while the sensor plane can move between Zmin and Zmax. As such focus control requires finding Z0, and moving the sensor plane accordingly. To achieve this, existing auto-focus algorithms set the sensor plane at different positions between Zmin and Zmax and determine its optimal position by comparing the captured image quality with respect to one or more of the metrics previously mentioned.
[0033] According to aspects of the present disclosure - and in further contrast to the prior art - instead of capturing a large number of images at different positions to search for an optimal one, only two gray-scale images are captured or converted from RGB images, denoted as Iz and Iz , wherein the subscripts represent the position of the sensor plane. Note that as used herein,‘image’ describes a gray-scale image obtained from a camera or other imaging apparatus/structure.
[0034] While Zx and Z2 can be any number between Zmin and Zmax , the displacement between Zx and Z2 is fixed and defined by the following relationship:
Zx - Z2 = Zd .
[0035] Then two image blocks, Bz and
Figure imgf000009_0001
containing the object of interest are cropped from the two images, Iz and Iz at the same position. The image blocks are then normalized to be between 0 and 1 :
p _ B Bmin
B-max— Bmin
[0036] The neural-network-based focus position prediction algorithm according to aspects of the present disclosure - denoted as fZd - learns from the two normalized blocks the displacement between Zx and Z0, according to: fzaiPz,’ Pz - Z0 Zx
[0037] Consequently - and as will now be readily understood by those skilled in the art - the optimal focus position, Z0, can be achieved by moving the sensor plane accordingly. Neural Network Structure
[0038] FIG.2 shows a schematic diagram of an illustrative six layer neural network according to aspects of the present disclosure. With reference to that figure we note that such neural network according to the present disclosure receives as input two normalized image blocks, PZl and På2 , and predicts the displacement between Z and Z0. As viewed from left to right, the first two arrows represent convolutional layers and subsequent four arrows represent fully-connected layers. Two image blocks of size 256 x 256 are first stacked, forming a 256 x 256 x 2 input tensor.
[0039] As may be further observed from the figure, the illustrative neural network includes six consecutive layers. The input tensor is first applied (fed) into a convolutional layer. Illustrative filter size, number of filters and stride for this layer are 8, 64 and 8 respectively, and the activation function for this layer is rectified linear unit (ReLu). The dimension of the output of this layer is 32 X 32 X 64.
[0040] The second illustrative layer is another convolutional layer having illustrative filter size, number of filters and stride being 4, 16 and 4 respectively. The activation function for this layer is ReLu, and the dimension of the output is 8 X 8 X 16. The four subsequent illustrative layers are fully-connected layers, and the dimensions of the four fully-connected layers are 1024, 512, 10 and 1 respectively. ReLu activation is applied to the first three fully-connected layers.
Defocus Model
[0041] At this point we note that focus control is an imaging-system-specific problem, so each imaging system (i.e., camera) requires a unique network trained on data for that specific imaging system. Advantageously, systems, methods, and structures according to aspects of the present disclosure may include one or more mechanism(s) to construct a defocus model for the specific imaging system, and the model can be used to generate training data from existing image dataset. [0042] FIG. 3 shows a pair of images captured at an optimal focus (Sharp Image - top) position and an off position (Blurred Image - bottom) according to aspects of the present disclosure. The defocus process can be modeled as image blur followed by image scaling defined by:
Figure imgf000011_0001
where h is the image blur filter, a is the image scaling factor and Gamma-1 (x) is the inverse process of gamma correction defined by:
Gamma-1 x) = xY.
[0043] This inverse process compensates for non-linear operation, i.e. gamma correction, in the image signal processor (ISP), where g is predefined for a specific camera. The present disclosure assumes h is a circular averaging filter, which is uniquely determined by the averaging radius, r. Both r and a are determined by the optimal focus position, Z0, and the displacement from the optimal focus position, Z— Z0. r, a = g(Z0,Z - Z0)
[0044] To calibrate the model, systems, methods, and structures according to aspects of the present disclosure require images captured at different focal position and displacement from focal position, i.e. Z0 and Z— Z0. An object is installed with different distance, d( in front of the camera (see FIG. 1), and each ci0 corresponds to a unique Z0 and IZq .
[0045] For each Z0, by moving the sensor plane between Zmin and Zmax, images with different Z can be captured, and each pair of /z and IZo are used to calibrate g(Z0, Z— Z0) . - with reference to FIG.4(A), FIG.4(B), FIG.4(C), FIG.4(D), FIG.4(E), and FIG.4(F), - the overall illustrative process may be described as follows
1. First, an image block from Gamma-1 (7z0) is cropped (see the bounding box in FIG. 4 (A)), denoted as ]ZQ (see FIG. 4 (B)). ]ZQ should be included in an object that all the pixels have the same distance d0. 2. For each r, a blurred image is generated by convolving JzQ and h(r ). (see
FIG.4(C))
3. A block Kz0,ft(r) is cropped from the blurred image (see the bounding box in
FIG.4(C) and FIG. 4(D)). This step is to remove pixels that should have been blurred by the pixels outside the JZq .
4. For each a, a scaled image is generated from KZo h^. (FIG. 4(E))
5. A score for each pair of r and a is calculated by computing the maximum
normalized cross-correlation between the scaled KZo h^ and all blocks that have the same dimension in Gamma-1 (/z). (see, FIG.4(F))
6. The r and a corresponding to the maximum score are the calibrated model
parameters.
[0046] After calibration, a defocused image can be generated by: dz = Gamma(imsc ale (Gamma 1(dZo) * h(r(Z0,Z— Z0)), a(Z0,Z— Z0)), where dZo is a clear image from existing image dataset which is assumed to be the image at Z0, dz is the simulated defocused image with respect to Z0 and Z, and Gamma(x ) is defined by:
1
Gamma(x ) = xJ .
Training Data
[0047] Once the model is calibrated, training data can be generated from existing image dataset(s). First, the number of training data, N, and Zd are chosen. For each grayscale image d
1. An optimal focus position, Zi 0, and a sensor plane position, Zt l, are randomly generated:
Zt,o ^ [Zmiru Zmax\ ^i,l ^ \Z min’ Z max Zc[\
2. Two defocused images are generated: 3. Two image blocks, Bi Zi and Bi Z containing the object of interest are cropped from the two images, di åi and di å2 , at the same position. The image blocks are then normalized to be between 0 and 1 according to the following:
Figure imgf000013_0001
[0048] Each data comprises two input images, di Zi and di å2 , and the corresponding label Zi 0— Zt l .
[0049] The neural network is trained on the simulated dataset by minimizing the mean square error, i.e.
Figure imgf000013_0002
and
Figure imgf000013_0003
Bayer Data
[0050] With a digital camera, data directly captured from the scene is the raw data in Bayer format. For cameras with access to its raw data, systems, methods, and structures according to aspects of the present disclosure may advantageously be modified to perform on raw data. To calibrate the model, each raw data, a one channel intensity map, is first interpolated to generate an RGB image. Then the same calibration process is applied. Note that we assume that the defocus model is the same for three channels.
[0051] To generated simulated training data, each raw data is first interpolated.
Then the defocus model is applied to each channel, and the defocused raw data is generated by mosaicking the defocused RGB image. The same network structure and training strategy can be applied to train the model.
Multi-Step Focus
[0052] Advantageously according to aspects of the present disclosure - and instead of analyzing multiple images to find the optimal focus position - one defocused image is sufficient to determine the absolute distance between the optimal focus position, abs(Z— Z0), and the current sensor plane, Z.
[0053] According to aspects of the present disclosure, a multi-step focus mechanism may be employed. Operationally, when a defocused image, /z, is captured, a neural network, denoted as f distance, is applied to find the distance between the optimal focus position and the current position, Zd = abs(Z— Z0). Then, two images are captured by moving the sensor plane in different directions, i.e. 7Z-Z by moving the sensor plane to Z— Zd and lå+zd by moving the sensor plane to Z + Zd , and the two images are compared to find the right moving direction. By introducing a discriminant, denoted as f discriminant, to judge if the image is focused, this solution can adjust the focus until a clear image is captured.
Network Details
f distance
[0054] FIG. 5 shows a schematic diagram of an illustrative seven layer neural network for predicting distance Zd according to aspects of the present disclosure.
[0055] The input to the network is a 512 x 512 block from the defocused image.
The network comprises seven consecutive layers. The input image block is first fed into a convolutional layer. The illustrative filter size, number of filters and stride for this layer are 8, 64 and 8 respectively, and the activation function for this layer is rectified linear unit (ReLu). The illustrative dimension of the output of this layer is 64 X 64 X 4. [0056] The second layer and the third layer are illustratively convolutional layers with filter size/number of filters/stride being 4/8/4 and 4/8/4 respectively. The activation function is ReLu, and the dimensions of the output are 16 X 16 X 8 and 4 x 4 x 8.
[0057] The four subsequent layers are illustratively fully-connected layers, and the dimensions of the four fully-connected layers are 1024, 512, 10 and 1 respectively. Leaky ReLu activation is applied to the first three fully-connected layers. f discriminant
[0058] FIG. 6 shows a schematic diagram of an illustrative discriminant six layer neural network according to aspects of the present disclosure. The input to the network is a 512 x 512 block from image.
[0059] As may be observed, there are five consecutive layers in this illustrative network. The first two layers are illustratively convolutional layers, with filter size/number of filters/stride being 8/1/8 and 8/1/8 respectively. The activation function for these two layers is ReLu, and the dimensions of the illustrative output of these two layers are 64 X 64 X 1 and 8 X 8 X 1.
[0060] The third illustrative layer is a fully-connected layer, followed by the ReLU activation, and the dimension of this layer is 10. The fourth layer is a dropout layer with rate 0.5. Last layer is illustratively a fully-connected layer followed by softmax activation which indicates the category of the input block.
Algorithm Details:
[0061] With the foregoing in place, we may now describe an entire illustrative process according to aspects of the present disclosure as follows:
1. An image is captured at Z, denoted /z, and a 512 x 512 block is cropped and normalized, denoted Jz. 2. Determine if the image is focused by / discriminant- If YES, stop the algorithm and output Z; If NOT, continue the algorithm.
3. Predict the distance from the optimal focus position
Zd f distance U z)
4. Compute the two possible positions Z + Zd and Z— Zd, and determine which is/are feasible (between the Zmin and Zmax).
5. If only one position is feasible, update Z and return to step 1.
6. If both positions are feasible, capture the images at Z + Zd and Z— Zd , denoted +zd ar|d -zd Predict the distance between the optimal focus position and the focus position of the two images
Zd — f distanced) Z+Zd )< Zd2 f distance (j Z-Zd )
7· If Zdl < Zd2 :
1. Update Z = Z + Zd and Zd = Zdl
2. Determine if the image lz+Zd is focused by f discriminant - If YES, stop the algorithm and output Z; If NOT, return to step 4.
If¾i > Zd2 :
1. Update Z = Z— Zd and Zd = Zd2
2. Determine if the image Iz-Zd is focused by f discriminant - If YES, stop the algorithm and output Z; If NOT, return to step 4.
8. If none of the positions is feasible, update Z with a small step change
Z = Z + Z,
and return to step 1.
Examples
[0062] FIG. 7(A), FIG. 7(B), and FIG. 7(C) is a series of images illustrating aspects of the present disclosure in which FIG. 7(A) is a defocused image captured at the beginning of the process, then the network predicts movement and generates two possible positions illustrated in FIG. 7(B) and FIG. 7(C) - by comparison, the image of FIG. 7(C) is better than that of FIG. 7(B) and the discriminant predicts FIG. 7(C) is a focused image. [0063] FIG. 8(A), FIG. 8(B), and FIG. 8(C) is a series of images illustrating aspects of the present disclosure in which FIG. 8(A) is a defocused image captured at the beginning of the process, then the network predicts movement and generates two possible positions illustrated in FIG. 8(B) and FIG. 8(C) - by comparison, the image of FIG. 8(C) is better than that of FIG. 8(B) and the discriminant predicts FIG. 8(C) is a focused image.
[0064] FIG. 9(A), FIG. 9(B), and FIG. 9(C) is a series of images illustrating aspects of the present disclosure in which FIG. 9(A) is a defocused image captured at the beginning of the process, then the network predicts movement and generates two possible positions illustrated in FIG. 9(B) and FIG. 9(C) - by comparison, the image of FIG. 9(C) is better than that of FIG. 9(B) and the discriminant predicts FIG. 9(C) is a focused image.
[0065] FIG. 10(A), FIG. 10(B), FIG. 10(C), FIG. 10(D), and FIG. 10(E) is a series of images illustrating aspects of the present disclosure in which FIG. 10(A) is a defocused image captured at the beginning of the process, then the network predicts movement and generates two possible positions illustrated in FIG. 10(B) and FIG. 10(C) - by comparison, the image of FIG. 10(C) is better than that of FIG. 10(B), but the discriminant predicts FIG. 10(C) is still not a focused image, so a new movement is predicted based on FIG. 10(C) and two more images are captured - by comparison, image FIG. 10(E) is better than FIG. 10(D), and the discriminant predicts image FIG. 10(E) is a focused image.
Compressively Sensed Image Data
[0066] Compressive sensing is a potential technique in image systems because it achieves sensor level compression. While traditional methods require analyzing images to determine the focus position, the present disclosure determines the focus from compressed data. While the first convolutional layer in the network provides a sensing matrix for compressive sensing, and this sensing matrix is trained to yield the best focus results, a more general sensing matrix is tested which proves that the present solution can be applied to compressed data. [0067] FIG. 11 shows a schematic diagram of an illustrative seven layer neural network for compressively sensing according to aspects of the present disclosure. As may be observed from that figure, the basic network structure is substantially the same as that shown previously, wherein the first illustrative convolutional layer is randomly initialized and is not trainable. This is equivalent to generating a random sensing matrix for compressive sensing, and here the measurement rate is 0.0625.
[0068] FIG. 12(A), FIG. 12(B), FIG. 12(C) and FIG. 12(D) is a series of images illustrating aspects of the present disclosure in which FIG. 12(A) is a defocused image which is compressively sensed to a four channel tensor by a random matrix wherein each channel is shown in FIG. 12(B), the network leans from these images, any movement required and image FIG. 12(C) and FIG 12(D) is classified focused by the discriminator. Note that image of FIG. 12(B) is not downsampled from the image of FIG. 12(A).
Saliency-based Focus
[0069] In imaging systems, a focus module is controlled to“best image” (i.e., determine a best image) a target object/target object plane. However, since the real world is 3-dimentional, an indication of the target is required in focus control. In current imaging systems, such as digital cameras or cellphone cameras, the location of the target is: 1) by default the central of the image, or 2) manually indicated by users. Some smart systems may adjust the focus module to best capture faces by first detecting the faces in the image, but the detection requires a clear image, which is based on a general focus method per se. Visual saliency detection has been explored for decades, and advantageously according to aspects of the present disclosure - it may be combined with camera focus control algorithm, providing a saliency-based focus control mechanism.
[0070] As we shall show and describe, the main workflow of saliency-based focus according to the present disclosure is like the workflow disclosed herein for Multi-step Focus, wherein the location of the block is determined by a saliency detection network from the first defocused image Iz . [0071] FIG. 13 shows a schematic diagram of an illustrative neural network for salience detection according to aspects of the present disclosure. Operationally, the image is resized to 300 x 400 before being fed into the network. The input image is first fed into three separate convolutional layers, the filter sizes for the three convolutional layers are 40, 10 and 3 individually, and the number of filters/stride for each layer are 16/1.
[0072] The output of the three layers are then concatenated, forming a
300 X 400 X 48 tensor, denoted in the figure as CONV1. Following are two convolutional layers, with filter size/strides being 10/10 for both layers. The number of filters for the convolutional layers are 16 and 32 individually, and the output dimension of two layers are 30 X 40 X 16 and 3 X 4 X 32, denoted in the figure as CONV2 and CONV3.
[0073] Next, CONV3 is reshaped to a 384 x 1 vector, followed by a fully- connected layer with dimension 384 X 1, which is again reshaped to a 3 X 4 X 32 tensor, denoted C0NV4. CONV3 and C0NV4 are then concatenated to form a 3 X 4 X 64 tensor, followed by a convolutional layer with filter size/number of filters/stride being 3/64/1.
[0074] A deconvolutional layer is then applied with filter size/number of filters/stride being 20/16/10, and the dimension of the output tensor, denoted CONV5, is 30 X 40 X 16. CONV2 and CONV5 are then concatenated to form a 30 X 40 X 32 tensor, followed by a convolutional layer with filter size/number of filters/stride being 5/64/1.
[0075] Then another deconvolutional layer is applied with filter size/number of filters/stride being 20/32/10, and the dimension of the output tensor, denoted CONV6, is 300 X 400 X 32.
[0076] Finally, CONV1, CONV6 and the input image are concatenated to form a
300 X 400 X 83 tensor, followed by four consecutive convolutional layers. The filter size/number of filters/stride for the three layers are 11/64/1, 7/32/1, 3/16/1 and 1/1/1, and the output of the last layer is the predicted saliency map. ReLU activation is applied to all convolutional, deconvolutional and fully-connected layers except for the last layer, which is activated by Sigmoid function.
Data Generation
[0077] There exist several labeled datasets (i.e., MSRAIOK*, MSRA-B**,
PASCALS***) for saliency detection, but no publicly available dataset provides labeled defocused image for saliency detection. To prepare training data, the image defocus model calibrated on a specific camera are used, and process according to aspects of the present disclosure is described as:
1. Obtain an image and its label from existing datasets.
2. Resize the image to the size of the camera image.
3. Regard the resized image as a focused image and generate a defocused image using the calibrated model.
4. Resize the defocused image and its corresponding label to 300 x 400. An
example (image from MSRA-B**) is shown in FIG. 14(A), FIG. 14(B), and
FIG. 14(C). Image FIG. 14(A) shows the original image, image FIG. 14(B) shows the generated defocused image, and image FIG. 14(C) shows the label. Each training data consists of a defocused image / and a label image L.
[0078] The network fsaiiency is trained on the training dataset by minimizing the mean square error, i. e.
Figure imgf000020_0001
Block Selection:
[0079] When the saliency map is predicted by the network, the next step is to select the block to perform the focus prediction network on. To achieve that, the saliency map is first resized to the size of the original image, i.e. image captured by the camera. Then the block is determined by finding a 512 x 512 block that achieves the highest average intensity:
Figure imgf000020_0002
where B represents a 512 x 512 block from the image, /ί ;· is a pixel in the image, and St represents the saliency map. Once the block is determined, the focus is predicted from this block.
Examples
[0080] FIG. 15(A), FIG. 15(B), and FIG. 15(C) is a series of images illustrating aspects of the present disclosure in which FIG. 15(A) shows the first defocused image captured by a camera, FIG. 15(B) shows the predicted saliency map, and FIG. 15(C) shows the final output image with bounding box showing the block that is selected as the target.
[0081] FIG. 16(A), FIG. 16(B), and FIG. 16(C) is a series of images illustrating aspects of the present disclosure in which FIG. 16(A) shows the first defocused image captured by a camera, FIG. 16(B) shows the predicted saliency map, and FIG. 16(C) shows the final output image with bounding box showing the block that is selected as the target.
[0082] FIG. 17(A), FIG. 17(B), and FIG. 17(C) is a series of images illustrating aspects of the present disclosure in which FIG. 17(A) shows the first defocused image captured by a camera, FIG. 17(B) shows the predicted saliency map, and FIG. 17(C) shows the final output image with bounding box showing the block that is selected as the target.
[0083] At this point, while we have presented this disclosure using some specific examples, those skilled in the art will recognize that our teachings are not so limited. Accordingly, this disclosure should be only limited by the scope of the claims attached hereto.

Claims

Claims:
1. A neural-network-based focusing method for an imaging system comprising:
a) providing two gray-scale images, each individual one of the gray-scale images captured at a different position of a sensor plane of the imaging system;
b) cropping two image blocks, one from each of the two gray-scale images provided and cropped from the same relative position to their respective gray-scale images, each individual one of the cropped image blocks including an object of interest; and
c) predicting a focus position through the effect of the neural network operating on the two image blocks.
2. The neural-network-based focusing method according to claim 1 further comprising: d) moving the sensor plane and repeating steps a) - c).
3. The neural-network-based focusing method according to claim 2 further comprising: normalizing each individual one of the two cropped image blocks such that they are between 0 and 1.
4. The neural-network-based focusing method according to claim 3 further comprising: determining, through the effect of the neural -network, a displacement between the two provided gray-scale images from the two normalized blocks.
5. The neural-network-based focusing method according to claim 4 further comprising: training the neural network on training data for the imaging system, specifically.
6. The neural-network-based focusing method according to claim 5 further comprising: constructing a defocus model for the imaging system, specifically; and
generating the imaging system specific training data training the neural network.
7. The neural-network-based focusing method according to claim 6 further wherein the defocus model is generated from a defocus process, said defocus process modeled as image blur followed by image scaling.
8. The neural-network-based focusing method according to claim 7 wherein the defocus process is defined by
Figure imgf000023_0001
where h is the image blur filter, a is the image scaling factor and Gamma-1 (x) is the inverse process of gamma correction defined by:
Gamma-1 x) = xY.
9. The neural -network-based focusing method according to claim 1 wherein the gray-scale images are derived from raw imaging system data in Bayer format, wherein each raw data comprises a one channel intensity map and interpolated to generate an RGB image.
10. The neural-network-based focusing method according to claim 1 wherein the gray scale images are derived from moving the sensor plane in different directions and then determining a correct moving direction.
PCT/US2019/064755 2018-12-05 2019-12-05 Neural network focusing for imaging systems WO2020118093A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
US201862775725P 2018-12-05 2018-12-05
US62/775,725 2018-12-05

Publications (1)

Publication Number Publication Date
WO2020118093A1 true WO2020118093A1 (en) 2020-06-11

Family

ID=70974805

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/US2019/064755 WO2020118093A1 (en) 2018-12-05 2019-12-05 Neural network focusing for imaging systems

Country Status (1)

Country Link
WO (1) WO2020118093A1 (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP4266695A1 (en) * 2022-04-20 2023-10-25 Canon Kabushiki Kaisha Learning apparatus for multi-focus imaging, method, program, and storage medium

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20110085028A1 (en) * 2009-10-14 2011-04-14 Ramin Samadani Methods and systems for object segmentation in digital images
US20150116526A1 (en) * 2013-10-31 2015-04-30 Ricoh Co., Ltd. Plenoptic Color Imaging System with Enhanced Resolution
US20170064204A1 (en) * 2015-08-26 2017-03-02 Duke University Systems and methods for burst image delurring

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20110085028A1 (en) * 2009-10-14 2011-04-14 Ramin Samadani Methods and systems for object segmentation in digital images
US20150116526A1 (en) * 2013-10-31 2015-04-30 Ricoh Co., Ltd. Plenoptic Color Imaging System with Enhanced Resolution
US20170064204A1 (en) * 2015-08-26 2017-03-02 Duke University Systems and methods for burst image delurring

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
HE J. ET AL.: "Modified fast climbing search auto-focus algorithm with adaptive step size searching technique for digital camera", IEEE TRANSACTIONS ON CONSUMER ELECTRONICS, vol. 49, no. 2, 3 April 2003 (2003-04-03), pages 257 - 262, XP001171284, Retrieved from the Internet <URL:https://pdfs.semanticscholar.org/c4d7/dcdcd2f541663bb8343b8ddd48765bb018db.pdf> [retrieved on 20200129] *
WEI L. ET AL.: "Neural network control of focal position during time-lapse microscopy of cells", SCIENTIFIC REPORTS, vol. 8, no. 1, 13 December 2017 (2017-12-13), pages 7313, XP055717021, Retrieved from the Internet <URL:https://www.biorxiv.org/content/10.1101/233940v1.full.pdf> [retrieved on 20200129] *
YAO Y. ET AL.: "Evaluation of sharpness measures and search algorithms for the auto-focusing of high-magnification images", VISUAL INFORMATION PROCESSING XV, vol. 6246, 12 May 2006 (2006-05-12), pages 62460G, XP055717023, Retrieved from the Internet <URL:http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.478.82&rep=rep1&type=pdf> [retrieved on 20200129] *

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP4266695A1 (en) * 2022-04-20 2023-10-25 Canon Kabushiki Kaisha Learning apparatus for multi-focus imaging, method, program, and storage medium

Similar Documents

Publication Publication Date Title
KR102574141B1 (en) Image display method and device
CN110248096B (en) Focusing method and device, electronic equipment and computer readable storage medium
CN102457681B (en) Image processing apparatus and image processing method
US10841558B2 (en) Aligning two images by matching their feature points
JP6478496B2 (en) Imaging apparatus and control method thereof
CN110661977B (en) Subject detection method and apparatus, electronic device, and computer-readable storage medium
JP2015222411A (en) Image-capturing device and control method thereof
EP2785047B1 (en) Image pickup apparatus, image processing system, image pickup system, image processing method, image processing program, and storage medium
KR20130008441A (en) Image-processing apparatus and method, and program
US10148863B2 (en) Information processing apparatus and information processing method
JP7289621B2 (en) Control device, imaging device, control method, and program
JP7024736B2 (en) Image processing equipment, image processing method, and program
JP5845023B2 (en) FOCUS DETECTION DEVICE, LENS DEVICE HAVING THE SAME, AND IMAGING DEVICE
US9031355B2 (en) Method of system for image stabilization through image processing, and zoom camera including image stabilization function
US9020269B2 (en) Image processing device, image processing method, and recording medium
CN108156383B (en) High-dynamic billion pixel video acquisition method and device based on camera array
JP6812387B2 (en) Image processing equipment and image processing methods, programs, storage media
WO2020118093A1 (en) Neural network focusing for imaging systems
US20170180627A1 (en) Auto-focus system for a digital imaging device and method
JP2019212132A (en) Image processing method, image processing apparatus, image capturing apparatus, program, and storage medium
CN114782507A (en) Asymmetric binocular stereo matching method and system based on unsupervised learning
JP7129229B2 (en) Image processing method, image processing device, imaging device, program, and storage medium
CN112866554A (en) Focusing method and device, electronic equipment and computer readable storage medium
US11044396B2 (en) Image processing apparatus for calculating a composite ratio of each area based on a contrast value of images, control method of image processing apparatus, and computer-readable storage medium
US20070116448A1 (en) Focusing Method for an Image Device

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 19892190

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 19892190

Country of ref document: EP

Kind code of ref document: A1