WO2022124607A1 - Procédé d'estimation de profondeur, dispositif, équipement électronique et support de stockage lisible par ordinateur - Google Patents

Procédé d'estimation de profondeur, dispositif, équipement électronique et support de stockage lisible par ordinateur Download PDF

Info

Publication number
WO2022124607A1
WO2022124607A1 PCT/KR2021/016579 KR2021016579W WO2022124607A1 WO 2022124607 A1 WO2022124607 A1 WO 2022124607A1 KR 2021016579 W KR2021016579 W KR 2021016579W WO 2022124607 A1 WO2022124607 A1 WO 2022124607A1
Authority
WO
WIPO (PCT)
Prior art keywords
feature
sampling
image
feature map
position information
Prior art date
Application number
PCT/KR2021/016579
Other languages
English (en)
Inventor
Zikun LIU
Juan LEI
Jianxing Zhang
Wen Liu
Chunyang Li
Jian Yang
Wei Wen
Hyungju CHUN
Jongbum CHOI
Original Assignee
Samsung Electronics Co., Ltd.
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Priority claimed from CN202011440325.3A external-priority patent/CN114596349A/zh
Application filed by Samsung Electronics Co., Ltd. filed Critical Samsung Electronics Co., Ltd.
Publication of WO2022124607A1 publication Critical patent/WO2022124607A1/fr

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/50Depth or shape recovery

Definitions

  • the present application relates to a field of computer technology. Specifically, the present application relates to a depth estimation method, device, electronic equipment, and computer-readable storage medium.
  • Autofocus is a core function of many smart devices when shooting images and videos. Regardless of the distance of the object from the shooting device, users hope that the object of interest to be shot is clear, and depth estimation is the basis for achieving fast autofocus.
  • the depth range that can be estimated by the existing depth estimation method is small, and the existing depth estimation method cannot meet the automatic focusing requirements of smart devices for short-distance or long-distance targets. Therefore, it is necessary to improve the existing depth estimation method.
  • the present application provides a depth estimation method, device, electronic equipment, and computer-readable storage medium.
  • the related steps of depth estimation in this solution can be processed by an artificial intelligence module.
  • the solution maps the image to be processed to a preset plane to obtain position information of each pixel of the image to be processed on the preset imaging plane, and use the position information of each pixel in the image to be processed in the preset imaging plane in the depth estimation process to eliminate the influence of camera parameters on a depth estimation range, such that the same network model can estimate the depth of the image to be processed corresponding to different camera parameters. While ensuring a wide range of depth estimation, the solution saves computing resources and storage space.
  • Fig. 1 is a schematic diagram of an auto-focusing process performed by a smart device
  • Fig. 2 is a schematic diagram of the comparison of corresponding sizes of objects on different imaging planes
  • Fig. 3 is a schematic diagram of a network structure of a depth estimation network model in the prior art
  • Fig. 4 is a schematic flowchart of a depth estimation method provided by an embodiment of the present application.
  • Fig. 5 is a schematic diagram of coordinate transformation from a current imaging plane to a preset imaging plane in an embodiment of the present application
  • Fig. 6 is a schematic diagram of IPM acquiring first position information in an example of an embodiment of the present application.
  • Fig. 7 is a schematic diagram of a combination of a pre-defined network model and IPM for depth estimation in an example of an embodiment of the present application
  • Fig. 8 is a schematic diagram of comparison of several indexes of auto focus function and auto exposure function in the prior art
  • Fig. 9a is a schematic diagram of a human parsing network model for human parsing provided in an example of an embodiment of the present application.
  • Fig. 9b is a detailed schematic diagram of a network structure of the human parsing network model shown in Fig. 9a;
  • Fig.10a is a schematic diagram of performing depth estimation based on a pre-defined network model obtained by expansion of a human parsing network provided in an example of an embodiment of the present application;
  • Fig.10b is a detailed schematic diagram of the network structure of a pre-defined network model shown in Fig. 10a;
  • Fig. 11a is a schematic diagram of zero padding in a convolution kernel corresponding to a human parsing encoding module in an example of an embodiment of the present application;
  • Fig. 11b is a schematic diagram of zero padding in a convolution kernel corresponding to a depth estimation encoding module in an example of an embodiment of the present application;
  • Fig. 12 is a schematic diagram of angle offset and histogram statistics of the angle offset introduced by the motion of the hand holding the photographing device in an example of the embodiment of the present application;
  • Fig. 13 is a disparity diagram of two adjacent images in consecutive k images in an example of an embodiment of the present application
  • Fig. 14 is a schematic diagram of a comparison of depth images respectively obtained according to a single image and according to consecutive k images in an example of an embodiment of the present application;
  • Fig. 15 is a schematic diagram of MFbD acquiring first disparity information in an example of an embodiment of the present application.
  • Fig. 16 is a schematic diagram of a combination of a pre-defined network model, IPM and MFbD for depth estimation in an example of an embodiment of the present application;
  • Fig. 17 is a detailed schematic diagram of the combined network structure shown in Fig. 16;
  • Fig. 18 is a schematic diagram generating an image of a target camera from one image in the NYUDv2 data set in an example of an embodiment of the present application;
  • Fig. 19 is the depth value histogram statistics of NYUDv2 and KITTI data sets
  • Fig. 20 is a schematic flowchart of an image processing method provided by an embodiment of the present application.
  • Fig. 21 is a schematic diagram of performing depth estimation based on another pre-defined network model obtained by expansion of a human parsing network provided in an example of an embodiment of the present application;
  • Fig. 22 is a schematic flowchart of another depth estimation method provided by an embodiment of the present application.
  • Fig. 23 is a schematic diagram of performing depth estimation based on another pre-defined network model obtained by expansion of a human parsing network provided in an example of an embodiment of the present application;
  • Fig. 24 is a structural block diagram of a depth estimation device provided by an embodiment of the present application.
  • Fig. 25 is a structural block diagram of an image processing device provided by an embodiment of the present application.
  • Fig. 26 is a structural block diagram of another depth estimation device provided by an embodiment of the present application.
  • Fig. 27 is a schematic diagram of an electronic device provided by an embodiment of the present application.
  • the purpose of the present application is to solve at least one of the above technical defects.
  • the technical solutions provided by the embodiments of the present application are as follows:
  • an embodiment of the present application provides a depth estimation method, comprises: mapping an image to be processed to a preset plane, and acquiring first position information of pixels in the image to be processed on the preset plane; and performing depth estimation on the image to be processed based on the first position information.
  • the acquiring first position information of pixels in the image to be processed on the preset plane comprises: acquiring second position information of pixels in the image to be processed on the current imaging plane based on a second camera parameter corresponding to the current imaging plane; and acquiring the first position information based on the second position information, a first camera parameter corresponding to the preset plane, and the second camera parameter.
  • the camera parameter includes at least one of focal length of a camera, position of a principal point, and size of a sensor.
  • the acquiring the first position information based on the second position information, a first camera parameter corresponding to the preset plane, and the second camera parameter comprising: acquiring a mapping relationship between the first position information and the second position information based on the first camera parameter and the second camera parameter; and acquiring the first position information based on the second position information and the mapping relationship.
  • the performing depth estimation on the image to be processed based on the first position information comprises: through an encoding network, performing at least once feature down-sampling on the image to be processed to obtain a corresponding feature map; and through a first decoding network, performing at least once feature up-sampling based on the feature map and the first position information to obtain a depth estimation result.
  • the encoding network comprises several feature down-sampling units
  • the at least one feature down-sampling unit comprises a first feature down-sampling module and a second feature down-sampling module
  • the first feature down-sampling module performs feature down-sampling on the input image to be processed or the feature map to obtain a first feature map
  • the second feature down-sampling module performs feature down-sampling on the input image to be processed or the feature map to obtain a second feature map, and outputting the fusing result of the first feature map and the second feature map.
  • a convolution kernel used by the first feature down-sampling module includes a first convolution kernel of a first dimension and a second convolution kernel of a second dimension, wherein a value of the second convolution kernel is zero, wherein, the first dimension is determined based on the number of convolution kernels used by the first feature down-sampling module in the previous feature down-sampling unit, and the second dimension is determined based on the number of convolution kernels used by the second feature down-sampling module in the previous feature down-sampling unit.
  • the convolution kernels used by the first feature down-sampling module and the second feature down-sampling module are standard convolution or point-wise convolution.
  • the number of convolution kernels used by the first feature down-sampling module is equal to the number of convolution kernels used by the first feature down-sampling module in the previous feature down-sampling unit
  • the number of convolution kernels used by the second feature down-sampling module is equal to the number of convolution kernels used by the second feature down-sampling module in the previous feature down-sampling unit.
  • the convolution kernels used by the first feature down-sampling module and the second feature down-sampling module are point-wise convolution.
  • the corresponding processing result obtained by the second decoding network is a semantic parsing result of the image to be processed.
  • the performing at least once feature up-sampling based on the feature map and the first position information to obtain a depth estimation result comprises: performing feature fusion on the first position information and the feature map output by at least once feature down-sampling to obtain at least one first fused feature map; and performing feature up-sampling corresponding to the at least once feature down-sampling based on the at least one first fused feature map.
  • the performing feature up-sampling corresponding to the at least once feature down-sampling based on the at least one first fused feature map comprises: performing feature fusion on the first fused feature map and its corresponding input feature map corresponding to the feature up-sampling to obtain a corresponding second fused feature map; and performing feature up-sampling based on the second fused feature map, and outputting the obtained feature map.
  • the method further comprises: acquiring at least two consecutive images containing the image to be processed; and acquiring first disparity information corresponding to the image to be processed based on the at least two consecutive images; the performing at least once feature up-sampling based on the feature map and the first position information to obtain a depth estimation result, comprises: performing at least once feature up-sampling based on the feature map, the first position information, and the first disparity information to obtain a depth estimation result.
  • the acquiring first disparity information corresponding to the image to be processed based on the at least two consecutive images comprises: acquiring second disparity information between two adjacent images in the at least two consecutive images; and acquiring the first disparity information based on the second disparity information.
  • the acquiring the first disparity information based on the second disparity information comprises: acquiring corresponding average disparity information or accumulated disparity information based on the second disparity information, and using the average disparity information or the accumulated disparity information as the first disparity information.
  • the performing at least once feature up-sampling based on the feature map, the first position information, and the first disparity information to obtain a depth estimation result comprises: performing feature fusion on the first position information, the first disparity information, and the feature map output by at least once feature down-sampling to obtain at least one third fused feature map; and performing feature up-sampling corresponding to the at least once feature down-sampling based on the at least one third fused feature map.
  • the performing feature up-sampling corresponding to the at least once feature down-sampling based on the at least one third fused feature map comprises: performing feature fusion on the third fused feature map and its corresponding input feature map corresponding to the feature up-sampling to obtain a corresponding fourth fused feature map; and performing feature up-sampling based on the fourth fused feature map, and outputting the obtained feature map.
  • an embodiment of the present application provides an image processing method, including: through an encoding network, performing at least once feature down-sampling on the image to be processed to obtain a corresponding feature map, wherein the encoding network includes several feature down-sampling units, and the at least one feature down-sampling unit includes a first feature down-sampling module and a second feature down-sampling module, the first feature down-sampling module and the second feature down-sampling module respectively perform feature down-sampling on the input image to be processed or the feature map to obtain a first feature map and the second feature map, and output the fused result of the first feature map and the second feature map; and through a first decoding network, performing at least once feature up-sampling based on the feature map output by the encoding network to obtain a first processing result; and through a second decoding network, performing at least once feature up-sampling based on the first feature map output by the at least one first down-sampling module in the encoding
  • a convolution kernel used by the first feature down-sampling module includes a first convolution kernel of a first dimension and a second convolution kernel of a second dimension, wherein a value of the second convolution kernel is zero, wherein, the first dimension is determined based on the number of convolution kernels used by the first feature down-sampling module in the previous feature down-sampling unit, and the second dimension is determined based on the number of convolution kernels used by the second feature down-sampling module in the previous feature down-sampling unit.
  • the number of convolution kernels used by the first feature down-sampling module is equal to the number of convolution kernels used by the first feature down-sampling module in the previous feature down-sampling unit
  • the number of convolution kernels used by the second feature down-sampling module is equal to the number of convolution kernels used by the second feature down-sampling module in the previous feature down-sampling unit.
  • the first processing result is a depth estimation result of the image to be processed
  • the second processing result is a semantic parsing result of the image to be processed
  • an embodiment of the present application provides a depth estimation method, including: acquiring at least two consecutive images containing an image to be processed; acquiring first disparity information corresponding to the image to be processed based on the at least two consecutive images; and performing depth estimation on the image to be processed based on the first disparity information.
  • the acquiring first disparity information corresponding to the image to be processed based on the at least two consecutive images comprises: acquiring second disparity information between two adjacent images in the at least two consecutive images; and acquiring the first disparity information based on the second disparity information.
  • the acquiring the first disparity information based on the second disparity information comprises: acquiring corresponding average disparity information or accumulated disparity information based on the second disparity information, and using the average disparity information or the accumulated disparity information as the first disparity information.
  • the performing depth estimation based on the first disparity information comprises: through an encoding network, performing at least once feature down-sampling on the image to be processed to obtain a corresponding feature map; and through a first decoding network, performing at least once feature up-sampling based on the feature map and the first disparity information to obtain a depth estimation result.
  • the performing at least once feature up-sampling based on the feature map and the first disparity information comprises: performing feature fusion on the first disparity information and the feature map output by the at least once feature down-sampling to obtain at least one fifth fused feature map; and performing the feature up-sampling corresponding to the at least once feature down-sampling based on the at least one fifth fused feature map.
  • the performing the feature up-sampling corresponding to the at least once feature down-sampling based on the at least one fifth fused feature map comprises: performing feature fusion on the fifth fused feature map and its corresponding input feature map corresponding to the feature up-sampling to obtain a corresponding sixth fused feature map; and performing feature up-sampling based on the sixth fused feature map, and outputting the obtained feature map.
  • an embodiment of the present application provides a depth estimation device, comprises: a position information acquisition module, configured to map an image to be processed to a preset plane, and acquire first position information of pixels in the image to be processed on the preset plane; and a depth estimation module, configured to perform depth estimation on the image to be processed based on the first position information.
  • an embodiment of the present application provides an image processing device, comprises: an encoding module, configured to perform at least once feature down-sampling on the image to be processed to obtain a corresponding feature map through an encoding network, wherein the encoding network includes several feature down-sampling units, and the at least one feature down-sampling unit includes a first feature down-sampling module and a second feature down-sampling module, the first feature down-sampling module and the second feature down-sampling module respectively perform feature down-sampling on the input image to be processed or the feature map to obtain a first feature map and the second feature map, and output the fused feature map of the first feature map and the second feature map; a first decoding module, configured to perform at least once feature up-sampling based on the feature map output by the encoding network to obtain a first processing result through a first decoding network; and a second decoding module, configured to perform at least once feature up-sampling based on the first feature map output by
  • an embodiment of the present application provides a depth estimation device, comprises: a consecutive image acquisition module, configured to acquire at least two consecutive images containing an image to be processed; a disparity information acquisition module, configured to acquire first disparity information corresponding to the image to be processed based on the at least two consecutive images; and a depth estimation module, configured to perform depth estimation on the image to be processed based on the first disparity information.
  • an embodiment of the present application provides an electronic device, including a memory and a processor; computer programs are stored in the memory; the processor, configured to execute computer programs to implement the embodiment of the first aspect or any optional embodiment of the first aspect, the embodiment of the second aspect or any optional embodiment of the second aspect, the embodiment of the third aspect or the method provided in any optional embodiment of the third aspect.
  • an embodiment of the present application provides a computer-readable storage medium, and computer programs are stored on the computer-readable storage medium.
  • computer programs are executed by a processor, the embodiment of the first aspect or any optional embodiment of the first aspect, the embodiment of the second aspect or any optional embodiment of the second aspect, the embodiment of the third aspect or the method provided in any optional embodiment of the third aspect are implemented.
  • the solution maps the image to be processed to a preset plane to obtain position information of each pixel of the image to be processed on the preset imaging plane, and use the position information of each pixel in the image to be processed in the preset imaging plane in the depth estimation process to eliminate the influence of camera parameters on a depth estimation range, such that the same network model can estimate the depth of the image to be processed corresponding to different camera parameters. While ensuring a wide range of depth estimation, the solution saves computing resources and storage space.
  • Fig. 1 shows the processing process of the smart device (or shooting device) for autofocusing and acquiring the image after autofocus, wherein, the left image is the input image before focusing; the middle image is the result image of the depth estimation corresponding to the input image (i.e., depth image); the right image is the image after autofocus processing according to the estimated depth image.
  • the depth value of each pixel in an input image refers to the distance from a Z-axis point corresponding to the object to an optical center (origin) of the camera in the camera coordinate system.
  • the principle of the existing depth estimation scheme is to estimate the depth by using the size of the object area in the image on the imaging plane according to the imaging model of the camera, that is, to estimate the Z-axis distance between the shooting object and the shooting device.
  • the size of the image obtained on the imaging plane is different.
  • the size of the object on the imaging plane depends on the distance of the object from the shooting device and the parameters of the shooting device (such as camera focal length, principal point position (also called principal point coordinates), sensor size, etc.).
  • the depth estimation network model includes an encoder and a decoder, usually uses the public data sets NYUDv2 and KITTI to train and test the depth estimation network model, and the camera parameters of the shooting equipment corresponding to the image samples in the same public data set are the same. Since the size of the same depth value on the imaging plane corresponding to different camera parameters is different, that is, the corresponding relationship between the depth value of the object corresponding to different camera parameters and the size of the object on the imaging plane is different, so it is necessary to use the image samples in the same data set to train and test the depth estimation network.
  • the depth estimation network model trained by the above scheme can only be used to perform depth estimation on images with the same camera parameters corresponding to the training data set, that is, the trained depth estimation network model can only cover the depth range that the specific camera parameters corresponding to the specific dataset can cover.
  • NYUDv2 is an indoor-scene dataset, and the corresponding depth range when shooting is 0.5m ⁇ 10m, then the depth estimation network model trained by using NYUDv2 training can only cover the depth range of 0.5m ⁇ 10m
  • KITTI is an outdoor-scene data set, the corresponding depth range when shooting is 1m ⁇ 100m
  • the depth estimation network model trained by using KITTI can only cover the depth range of 1m ⁇ 100m.
  • the depth range of each data set can only cover part of the "wide range”. Therefore, for the models trained on different training sets, they can only estimate part of the depth range.
  • the embodiments of the present application provide a depth estimation method, which will be further described in the following.
  • Fig. 4 is a schematic flowchart of a depth estimation method provided by an embodiment of the present application. As shown in Fig. 4, the method may comprise:
  • Step S401 mapping an image to be processed to a preset plane, and acquiring first position information of pixels in the image to be processed on the preset plane;
  • Step S402 performing depth estimation on the image to be processed based on the first position information.
  • the first position information may be the coordinates of each pixel.
  • the preset plane may also be called a preset imaging plane
  • the size of the object in each image to be processed on the preset imaging plane can be acquired according to the first position information, and then the depth value of the object is estimated according to the size. Since different images to be processed are mapped to a same preset imaging plane, the calculation scale of the object size in each image to be processed is the same, such that a corresponding relationship between the object size and the object depth value in each image to be processed is consistent.
  • the image to be processed is mapped to the preset imaging plane and then the depth estimation is performed by using the size of the object in the image to be processed in the preset imaging plane, by doing that the influence of camera parameters on the depth estimation range of the same depth estimation network model is eliminated, that is, the same depth estimation network model can be used to realize the depth estimation of the image to be processed corresponding to different camera parameters.
  • the solution provided by the present application maps the image to be processed to a preset plane to obtain position information of each pixel of the image to be processed on the preset imaging plane, and use the position information of each pixel in the image to be processed in the preset imaging plane in the depth estimation process to eliminate the influence of camera parameters on a depth estimation range, such that the same network model can estimate the depth of the image to be processed corresponding to different camera parameters. While ensuring a wide range of depth estimation, the solution saves computing resources and storage space.
  • the acquiring first position information of pixels in the image to be processed on a preset imaging plane comprises:
  • the acquiring the first position information based on the second position information, a first camera parameter corresponding to the preset imaging plane, and the second camera parameter comprising:
  • the process of mapping the image to be processed from the current imaging plane to the preset imaging plane is a process of converting the coordinates of each pixel in the image to be processed, that is, converting the second position information of each pixel in the current imaging plane into the first position information in the preset imaging plane.
  • the first camera parameters corresponding to the preset imaging plane U include: focal length , sensor size
  • the second camera parameters corresponding to the imaging plane P include: focal length , sensor size , and principal point coordinates .
  • the above coordinate conversion process may include:
  • the coordinates of each pixel in the image to be processed on the current imaging plane are converted into coordinates on the preset imaging plane (that is, the first position information), that is, the mapping relationship between the current imaging plane and the preset imaging plane are acquired according to the principle of similar triangles, and then the first position information is acquired according to the mapping relationship and the second position information, the formula is shown as follows:
  • an image mapping module can be preset according to the above processing process for coordinate conversion of each pixel in the image to be processed, and the input of the IPM is the image to be processed and the corresponding camera parameters (i.e., the second camera parameter), the output is the first position information corresponding to the image to be processed.
  • Fig. 6 is a schematic diagram of the input and output of the IPM module.
  • the x-axis camera parameters and the y-axis camera parameters corresponding to the second camera parameters are input, and after the x-axis camera parameters and the y-axis camera parameters corresponding to the second camera parameters pass through the IPM, two coordinate matrices of size W*H are output (which can be recorded as W*H*2).
  • Each element in these two coordinate matrices is the x-axis coordinate and y-axis coordinate of the corresponding pixel in the preset imaging plane.
  • the performing depth estimation on the image to be processed based on the first position information comprises:
  • the pre-defined network model is used to, in combination with the first position information, perform depth estimation on the image to be processed, and output a corresponding depth estimation result.
  • the pre-defined network model can be combined with the IPM, and the pre-defined network model uses the first position information output by the IPM during the depth estimation process of the image to be processed to output the corresponding depth estimation result.
  • the performing depth estimation on the image to be processed based on the first position information includes:
  • the obtained depth result may be a corresponding depth image.
  • feature down-sampling is performed on the image to be processed for multiple times, and each feature down-sampling will generate a corresponding feature map, and the feature map generated by each feature down-sampling can be understood as the feature map output by the encoding network.
  • feature up-sampling is performed on the feature map output by the encoding network (that is, the output feature map of the encoding network) for multiple times to obtain the depth image corresponding to the image to be processed, each feature up-sampling corresponds to one feature down-sampling in the encoding network. As shown in Fig.
  • the pre-defined network model is combined with the IPM.
  • the first position information output by the IPM and the feature map corresponding to the feature down-sampling in the encoding unit i.e., the encoding network
  • the performing at least once feature up-sampling based on the feature map and the first position information to obtain the depth estimation result includes:
  • the performing feature up-sampling corresponding to at least once feature down-sampling based on at least one first fused feature map includes:
  • the first position information and the feature map corresponding to at least once feature down-sampling are feature-fused to obtain at least one first fused feature map; then each first fused feature map is used for the corresponding feature up-sampling corresponding to the feature down-sampling.
  • feature fusion is performed on each first fusion feature and the input feature map of the corresponding feature up-sampling to obtain a second fused feature map, and then feature up-sampling processing is performed on the second fused feature map.
  • the feature map for feature fusion with the first position information can be selected according to actual needs.
  • the feature maps outputted by one feature down-sampling can be selected, or feature maps output by multiple times feature down-sampling can be selected.
  • it is necessary to select the input feature map of a corresponding feature up-sampling and the corresponding first fused feature map for feature fusion that is, it is necessary to apply the first fused feature map to the corresponding feature up-sampling.
  • auto exposure and auto focus are functions that need to be available at the same time.
  • these two functions require different network models to achieve, as shown in Fig. 8, for the auto exposure function, it requires a network model that can realize human parsing and exposure parameter settings, it takes about 25 milliseconds; for the auto focus function, it requires a model that can implement depth estimation, it takes about 7 to 10 milliseconds. If the two network models are operated separately, it will consume many computing resources and time, which will affect real-time performance.
  • the above encoding network can be shared by the human parsing task and the depth estimation task to a certain extent.
  • the other tasks can also share the encoding network with the depth estimation task.
  • other tasks can also be the semantic parsing task which is more widely used, the embodiment of the present application only uses the human parsing task as an example to describe the solution in detail, but is not limited to this. It can also be understood that the solution provided by the embodiment of the present application can be applied not only in the preview mode, but also in the non-preview mode.
  • the encoding network includes several feature down-sampling units, and at least one feature down-sampling unit includes a first feature down-sampling module (such as a semantic parsing encoding module, and further, such as a human parsing encoding module ) and a second feature down-sampling module (i.e., depth estimation encoding module), wherein, the first feature down-sampling module performs feature down-sampling on the input image to be processed or feature map to obtain the first feature map, and the second feature down-sampling module performs feature down-sampling on the input image to be processed or feature map to obtain a second feature map, and outputs the fused result of the first feature map and the second feature map.
  • a first feature down-sampling module such as a semantic parsing encoding module, and further, such as a human parsing encoding module
  • a second feature down-sampling module i.e., depth estimation encoding module
  • convolution kernel used by the first feature down-sampling module and the second feature down-sampling module are standard convolution or point-wise convolution
  • convolution kernel used by the first feature down-sampling module includes the first convolution kernel of the first dimension and the second convolution kernel of the second dimension, wherein the value of the second convolution kernel is zero
  • the first dimension is determined based on the number of convolution kernels used by the first feature down-sampling module in a previous feature down-sampling unit
  • the second dimension is determined based on the convolution kernel used by the second feature down-sampling module in the previous feature down-sampling unit.
  • the convolution kernel of standard convolution or point-wise convolution is generally a three-dimensional convolution kernel.
  • Each three-dimensional convolution kernel can be regarded as a superposition of multiple two-dimensional convolution kernels. Then the three-dimensional convolution kernel can be divided into two parts, that is, the first convolution kernel of the first dimension and the second convolution kernel of the second dimension.
  • the height of the first convolution kernel is equal to the number of convolution kernels of the first feature down-sampling module in the previous feature down-sampling unit
  • the height of the second convolution kernel is equal to the number of convolution kernels of the second feature down-sampling module in the previous feature down-sampling unit
  • the second convolution kernel is 0 (that is, all weights in the second convolution kernel are 0).
  • the first feature down-sampling module only extracts the feature information in the feature map output by the first feature down-sampling module in the previous feature down-sampling unit, that is, it can ensure that the human parsing module only extracts the human parsing feature information in the feature map output by the human parsing module in the previous feature down-sampling unit.
  • each two-dimensional convolution kernel superimposed in the three-dimensional convolution kernel can also be called a slice, that is, each three-dimensional convolution kernel is composed of multiple slices, wherein, the number of slices of the first convolution kernel is equal to the number of convolution kernels of the first feature down-sampling module in the previous feature down-sampling unit, and the number of slices of the second convolution kernel is equal to the number of the convolution kernels of the second feature down-sampling module in the previous feature down-sampling unit.
  • the height of the convolution kernel of the second feature down-sampling module is the same as the height of the convolution kernel of the first feature down-sampling module (the dimensions of the convolution kernel are the same).
  • the convolution kernel used by the first feature down-sampling module and the second feature down-sampling module is a depth-wise convolution kernel
  • the number of convolution kernels used by the first feature down-sampling module is the same as the number of convolution kernels used by the first feature down-sampling module in the previous feature down-sampling unit
  • the number of convolution kernels used by the second feature down-sampling module is the same as the number of convolution kernels used by the second feature down-sampling module in the previous feature down-sampling unit.
  • the convolution kernel for depth-wise convolution is generally a two-dimensional convolution kernel.
  • Each convolution kernel performs convolution operation on the feature map of a channel output by the previous feature down-sampling unit, then the convolution kernel for depth-wise convolution is equal to the number of channels in the previous feature down-sampling unit.
  • the number of convolution kernels used by the first feature down-sampling module is the same as the number of convolution kernels used by the first feature down-sampling module in the previous feature down-sampling unit
  • the number of convolution kernels used by the second feature down-sampling module is the same as the number of convolution kernels used by the second feature down-sampling module in the previous feature down-sampling unit.
  • the method may further include:
  • the corresponding processing result obtained by the second decoding network may be a semantic parsing result of the image to be processed.
  • the corresponding processing result obtained by the first decoding network may be the depth estimation result of the image to be processed.
  • semantic parsing is similar to semantic segmentation, and the basic task of the two can be considered as assigning a category to each pixel.
  • semantic parsing is usually more detailed than semantic segmentation.
  • semantic segmentation may be divided into human body, blue sky, grass, etc.; semantic parsing may be divided into eyebrows, nose, mouth, etc.
  • the pre-defined network model of the present application can be understood as being acquired by expansion of the existing human parsing network model.
  • Fig. 9a it is a schematic diagram of the network structure of an existing human parsing network model, which includes a human parsing encoding module and a human parsing decoding module.
  • the human parsing encoding module part of the human parsing network model is composed of multiple convolutional layers, including standard ordinary convolution (i.e., standard convolution) and channel-by-channel (Depth-wise) convolution.
  • standard convolution i.e., standard convolution
  • Depth-wise convolution After the input image passes through multiple convolutional layers, shallow feature maps, middle feature maps, and deep feature maps can be obtained.
  • the input image gets C shallow feature maps after C 1*k*k convolution kernels, and then after passing through D standard convolution kernels to obtain D middle feature maps, and then after passing through D depth-wise convolution kernels (It can also be called a channel-by-channel convolution kernel) to obtain D deep feature maps.
  • the subsequent is the human parsing decoding module (that is, the second decoding network).
  • the human parsing decoding module is also composed of multiple convolutional layers.
  • the output image of the human parsing encoding module performs feature up-sampling on the output image of the human parsing module, and the final output is the human parsing result corresponding to the image to be processed.
  • a depth estimation encoding module i.e., a second feature down-sampling module
  • the human parsing encoding module i.e., the first feature down-sampling module
  • the depth estimation decoding unit corresponding to the depth estimation encoding unit i.e. the first decoding network
  • the depth estimation decoding unit and the human parsing decoding unit constitute the decoding network in the pre-defined network model of the embodiment of the present application
  • the schematic diagram of the network structure of the pre-defined network model is shown in Fig. 10a.
  • the method of expanding the human parsing encoding module part is: in the human parsing network model, when from the image to be processed to the shallow feature map, the human parsing task requires C 1*k*k convolution kernels, and C' 1*k*k convolution kernels are additionally added for the depth estimation task.
  • the subsequent is the human parsing decoding module part and the depth estimation decoding module part.
  • the human parsing decoding module performs feature up-sampling on the output feature map corresponding to the human parsing encoding module
  • the depth estimation decoding module performs feature up-sampling on the output feature corresponding to the depth estimation encoding module
  • the final output is the human parsing result and the depth estimation result (i.e., depth image) corresponding to the image to be processed.
  • the human parsing encoding module only extracts the features of the part of the feature map corresponding to the human parsing task in the input feature map, while the depth estimation encoding model simultaneously extracts the features of the part of the feature map corresponding to the human parsing task in the input feature map and the feature of the part of the feature map corresponding to the depth estimation task.
  • the process of expanding the human parsing encoding module to obtain the depth estimation encoding module first, determine the new convolution kernel corresponding to the depth estimation; then, since the human parsing does not need to use the features extracted by the depth estimation, and meanwhile in order to avoid using computationally expensive merge operations, padding zeros is performed for the convolution kernel corresponding to the human parsing encoding module, and the dimension of the zero padding is equal to the dimension of the added convolution kernel; finally, since the depth estimation can use the features extracted by the human parsing, the superposition of the added convolution kernel and the convolution kernel before zero padding corresponding to the human parsing module is the convolution kernel corresponding to the depth estimation encoding module.
  • D convolution kernels with dimensions C*k*k are used in the human parsing network model
  • D convolution kernels with dimensions (C+C')*k*k are used in the human parsing encoding module part of the pre-defined network model of the embodiment of the present application, as shown in Fig.
  • its first C-dimensional convolution kernel (the convolution kernel part with a height of C) is the same as the convolution kernel in the human parsing network model, and the added C'-dimensional convolution kernel (the convolution kernel part with a height of C') has a value of 0, so that it avoids using computationally expensive merge operations without affecting the human parsing.
  • the dimension of the convolution kernel used in the depth estimation encoding module part of the pre-defined network model is also (C+C')*k*k.
  • the above is a method of expanding the convolution kernel (that is, expanding the encoding module), and different expansion methods can be used in different network branches.
  • the depth-wise convolution kernels from the human parsing encoding module if they are the convolution kernels used by the depth-wise convolution layers, they do not need to be filled with zeros.
  • the standard convolution kernel from the depth estimation encoding module if the current layer does not need to use the features extracted by human parsing, the current layer convolution kernel needs to be filled with zeros, as shown in Fig. 11b, wherein the convolution kernels of the dimension corresponding to C are all 0, and the convolution kernel of the dimension corresponding to C' is the depth estimation convolution kernel.
  • Performance There are many similarities between the features of human parsing and the features of depth estimation tasks, and the reuse of the features of human parsing in depth estimation tasks can improve the effect of depth estimation. Compared with the network of a separate depth estimation task, the encoding part of the depth estimation task network in this scheme can obtain more semantic features, not only the number of features is more, but also the meaning of the features is richer, and more semantic features can improve the performance of depth estimation. For example, for the category to which each pixel in the known image belongs, for example, a certain part is the human body, the depth value of this part of the pixel should be similar.
  • Model size Two models are required for a separate human parsing and a separate depth estimation, and the models will occupy a relatively large storage space. But for the multi-task network of the present application, under the premise of ensuring the performance of depth estimation, since the number of convolution kernels required for depth estimation is much smaller than that of a separate depth estimation network, our multi-task network occupies a smaller storage space.
  • the movement of the hand holding the camera will introduce a random angle offset.
  • figure(a) shows the horizontal and vertical offset caused by hand movement, and the distribution of the offset is almost symmetric
  • Figure(b) shows the histogram statistics of the angle offset, and most of the offset is relatively small
  • the movement response of the hand is shown on the shooting device as: a closer object in the image will have a larger position offset than a distant object.
  • the offset caused by these movements reflects the difference in depth values between objects.
  • the object that is closer to the depth value of the shooting device has a larger disparity value, and on the same image, the object that is farther from the depth value of the shooting device has a smaller disparity value, as shown in Fig. 13, the upper part of the figure shows consecutive k images, and the lower part of the figure shows a disparity map and the depth map of the kth image obtained between every two adjacent images.
  • the method may further include:
  • the performing depth estimation on the image to be processed based on the first position information to obtain the depth image corresponding to the image to be processed includes:
  • acquiring the first disparity information corresponding to the image to be processed includes:
  • the acquiring the first disparity information based on the second disparity information includes:
  • a disparity map (that is, the second disparity information) can be calculated for every two adjacent images, and the depth map is calculated based on (k-1) absolute disparity maps (i.e., first disparity information) that are continuously accumulated or averaged.
  • the disparity value in the horizontal direction between the second frame and the first frame is ,the disparity value in the vertical direction is ;
  • the disparity value in the horizontal direction between the third frame and the second frame is , the disparity value in the vertical direction is ;
  • the disparity value in the horizontal direction between the kth frame and the (k-1)th frame is , and the disparity value in the vertical direction is .
  • a multi-frames based disparity module can be preset according to the above processing process to obtain the first disparity information of the image to be processed, and the input of the MFbD is the at least two consecutive images (i.e., consecutive k images) containing the image to be processed, and the output is the first disparity information corresponding to the image to be processed.
  • Fig. 15 is a schematic diagram of the input and output of MFbD: input the k image in the preview mode, and output two disparity matrices of size W*H (which can be recorded as W*H*2) through the MFbD convolution kernel. Each element in the two disparity matrices is the horizontal and vertical disparity information of the corresponding pixel.
  • the performing at least once feature up-sampling based on the feature map, the first position information and the first disparity information to obtain the depth estimation result includes:
  • performing feature up-sampling corresponding to at least once feature down-sampling based on at least one third fused feature map includes:
  • the MFbD can perform depth estimation on the image to be processed by itself or together with pre-defined network model described above perform depth estimation on the image to be processed.
  • the connection method of the MFbD and the preset network is the same as the connection method of the IPM and the preset network
  • the usage of the first disparity information output by the MFbD in the depth estimation of the pre-defined network model is the same as the usage of the first position information, and will not be repeated here.
  • the technical solution provided by the present application can provide a wide range of depth map estimation (covering micro-depth) to improve the auto focus function of subsequent mobile phones.
  • the pre-defined network model in the solution provided by the present application can be adapted to different cameras, it can be used for the main camera, telephoto lens, etc. in the rear camera, and can also be applied to different cameras in the front camera series group.
  • the solution provided in the present application can use one model to complete the functions of auto exposure and auto focus.
  • the embodiment of the present application proposes a multi-task wide-range depth estimation method based on deep learning.
  • the wide-range depth estimation has very broad application prospects in autofocus, AI (Artificial Intelligence) camera, AR (Augmented Reality)/VR (Virtual Reality) and other fields, and can be used in various smart phones, robots and AR/VR devices.
  • the embodiment of the present application obtains support for a wide range of depths by integrating the image plane coordinate transformation into the convolutional layer, and realizes more robust depth estimation of long and short distances by fusing information of multi-images, and fuses the depth estimation and other tasks in a multitasking method into the same network to achieve efficient and wide-range depth estimation, which can meet the real-time needs in preview mode.
  • the following describes the training process of the pre-defined network model used in the above process.
  • This part includes the acquisition of training data, the design of the loss function, and the design of rating criteria, which are specifically expressed as follows.
  • z is the depth value of each pixel of the current image, is the set translation depth, and R is the set rotation matrix.
  • step (3) There will be many holes in the image produced in step (2). Use bilinear difference values to fill the holes.
  • the left image is the image taken in NYUDV2 data set
  • the right image is the target image (i.e., the training data) obtained by transforming the image.
  • D(i) is the cost function of the ith depth value interval.
  • the cost function can be L1, L2 or other functions that measure the error between the true value and the estimated depth value, represents the value of acceptable error, represents the interval of the true value of the ith depth value, represents the average value of the acceptable error in this interval of in the tth batch of data, and , represent hyper-parameter, L represents the moving average error of all intervals.
  • Fig. 19 shows the histogram statistical results of the depth values of NYUDv2 and KITTI datasets.
  • Table 1 shows the acceptable error values for different depth ranges. For example, for level 1, the acceptable error range for the corresponding depth value range [0.07m, 0.08m) is 0.005m, and the acceptable error range for the depth value range [1.2m, 1.6m) is 0.2m.
  • the difference between the estimated depth value d and the true depth value is less than AE, it is considered to meet the condition.
  • the number of pixels that meet the conditions is recorded as , and the number of all pixels on the test set is recorded as , then the acceptable error of the defined evaluation criterion is expressed as: .
  • Fig. 20 is a schematic flowchart of an image processing method according to an embodiment of the present application. As shown in Fig. 20, the method may include:
  • Step S2001 through an encoding network, performing at least once feature down-sampling on the image to be processed to obtain a corresponding feature map
  • the encoding network includes several feature down-sampling units
  • the at least one feature down-sampling unit includes a first feature down-sampling module and a second feature down-sampling module
  • the first feature down-sampling module and the second feature down-sampling module respectively perform feature down-sampling on the input image to be processed or the feature map to obtain a first feature map and the second feature map, and output the fused result of the first feature map and the second feature map
  • Step S2002 through a first decoding network, performing at least once feature up-sampling based on the feature map output by the encoding network to obtain a first processing result;
  • Step S2003 through a second decoding network, performing at least once feature up-sampling based on the first feature map output by the at least one first down-sampling module in the encoding network to obtain a second processing result.
  • the encoding module performs feature down-sampling on the image to be processed through the first feature down-sampling module and the second feature down-sampling module for multiple times, and decodes the feature maps output by the two feature down-sampling modules through the corresponding decoding networks to obtain a corresponding processing result. While two processing results are obtained in one model, when the second down-sampling module in the encoding module performs down-sampling, the features extracted by the first down-sampling module can be reused so as to make the processing result obtained after decoding more accurate.
  • the convolution kernel used by the first feature down-sampling module includes a first convolution kernel of a first dimension and a second convolution kernel of a second dimension, wherein the value of the second convolution kernel is zero, wherein,
  • the first dimension is determined based on the number of convolution kernels used by the first feature down-sampling module in the previous feature down-sampling unit
  • the second dimension is determined based on the number of convolution kernels used by the second feature down-sampling module in the previous feature down-sampling unit.
  • the number of convolution kernels used by the first feature down-sampling module is equal to the number of convolution kernels used by the first feature down-sampling module in the previous feature down-sampling unit
  • the number of convolution kernels used by the second feature down-sampling module is equal to the number of convolution kernels used by the second feature down-sampling module in the previous feature down-sampling unit.
  • the first processing result is the depth estimation result of the image to be processed
  • the second processing result is the semantic parsing result of the image to be processed
  • the depth estimation task and other related tasks in the embodiment of the present application share the encoding network, and the depth estimation feature down-sampling module in the encoding network can reuse the features extracted by the feature down-sampling module of the related task.
  • the depth estimation task and the human parsing task share an encoding network, and the encoding network includes a human parsing encoding module and a depth estimation module.
  • Fig. 22 is a schematic flowchart of a depth estimation method provided by an embodiment of the present application. As shown in Fig. 22, the method may include:
  • Step S2201 acquiring at least two consecutive images containing an image to be processed
  • Step S2202 acquiring first disparity information corresponding to the image to be processed based on the at least two consecutive images.
  • Step S2203 performing depth estimation on the image to be processed based on the first disparity information.
  • the solution provided by the present application acquires the disparity information of the image to be processed through at least two consecutive images containing the image to be processed, and combines the disparity information of the image to be processed to perform depth estimation, thus the influence of the disparity introduced when collecting images to be processed on the depth estimation is eliminated, and the accuracy and robustness of the depth estimation are improved.
  • the acquiring the first disparity information corresponding to the image to be processed based on at least two consecutive images includes:
  • the acquiring the first disparity information based on the second disparity information includes:
  • the performing depth estimation on the image to be processed based on the first disparity information includes:
  • the performing at least once feature up-sampling based on the feature map and the first disparity information includes:
  • the performing feature up-sampling corresponding to at least once feature down-sampling based on at least one fifth fused feature map includes:
  • MFbD is combined with the pre-defined network model described above to perform depth estimation on the image to be processed.
  • the usage of the first disparity information output by MFbD used in the pre-defined network model for depth estimation is the same as the usage of the first position information, and will not be repeated here.
  • Fig. 24 is a structural block diagram of a depth estimation device provided by an embodiment of the present application.
  • the device 2400 may include: a position information acquisition module 2401 and a depth image acquisition module 2402, wherein,
  • the position information acquisition module 2401 is configured to map the image to be processed to a preset plane, and acquire first position information of pixels in the image to be processed on the preset plane;
  • the depth image acquisition module 2402 is configured to perform depth estimation on the image to be processed based on the first position information.
  • the solution provided by the present application maps the image to be processed to a preset plane to obtain position information of each pixel of the image to be processed on the preset imaging plane, and use the position information of each pixel in the image to be processed in the preset imaging plane in the depth estimation process to eliminate the influence of camera parameters on a depth estimation range, such that the same network model can estimate the depth of the image to be processed corresponding to different camera parameters. While ensuring a wide range of depth estimation, the solution saves computing resources and storage space.
  • the first position information acquisition module is specifically configured to:
  • the camera parameter includes at least one of focal length of a camera, position of a principal point, and size of a sensor.
  • the first position information acquisition module is further configured to:
  • the depth image acquisition module includes a feature down-sampling sub-module and a feature up-sampling sub-module, wherein:
  • the feature down-sampling sub-module is configured to perform at least once feature down-sampling on the image to be processed through an encoding network to obtain a corresponding feature map
  • the first feature up-sampling sub-module is configured to perform at least once feature up-sampling based on the feature map and the first position information through a first decoding network to obtain a depth estimation result.
  • the encoding network includes several feature down-sampling units, and the at least one feature down-sampling unit includes a first feature down-sampling module and a second feature down-sampling module, wherein,
  • the first feature down-sampling module performs feature down-sampling on the input image to be processed or feature map to obtain a first feature map
  • the second feature down-sampling module performs feature down-sampling on the input image to be processed or the feature map to obtain a second feature map, and output the fused result of the first feature map and the second feature map.
  • the convolution kernel used by the first feature down-sampling module includes a first convolution kernel of a first dimension and a second convolution kernel of a second dimension, wherein the value of the second convolution kernel is zero, wherein,
  • the first dimension is determined based on the number of convolution kernels used by the first feature down-sampling module in the previous feature down-sampling unit
  • the second dimension is determined based on the number of convolution kernels used by the second feature down-sampling module in the previous feature down-sampling unit.
  • the convolution kernels used by the first feature down-sampling module and the second feature down-sampling module are standard convolution or point-wise convolution.
  • the number of convolution kernels used by the first feature down-sampling module is equal to the number of convolution kernels used by the first feature down-sampling module in the previous feature down-sampling unit
  • the number of convolution kernels used by the second feature down-sampling module is equal to the number of convolution kernels used by the second feature down-sampling module in the previous feature down-sampling unit.
  • the convolution kernels used by the first feature down-sampling module and the second feature down-sampling module are convolution kernels of depth-wise convolution.
  • the device may further include a second feature up-sampling sub-module, configured to:
  • the corresponding processing result obtained by the second decoding network is a semantic parsing result of the image to be processed.
  • the first feature up-sampling submodule is specifically configured to:
  • the first feature up-sampling submodule is further configured to:
  • the device may further include a disparity information acquisition module, configured to:
  • the depth estimation module is further configured to:
  • the disparity information acquisition module is specifically configured to:
  • the disparity information acquisition module is further configured to:
  • the first feature up-sampling submodule is further configured to:
  • the first feature up-sampling submodule is further configured to:
  • Fig. 25 is a structural block diagram of an image processing device provided by an embodiment of the present application.
  • the device 2500 may include: an encoding module 2501, a first decoding module 2502, and a second decoding module 2503, wherein:
  • the encoding module 2501 is configured to through an encoding network, perform at least once feature down-sampling on the image to be processed to obtain a corresponding feature map, wherein the encoding network includes several feature down-sampling units, and the at least one feature down-sampling unit includes a first feature down-sampling module and a second feature down-sampling module, the first feature down-sampling module and the second feature down-sampling module respectively perform feature down-sampling on the input image to be processed or the feature map to obtain a first feature map and the second feature map, and output the fused result of the first feature map and the second feature map;
  • the first decoding module 2502 is configured to perform at least once feature up-sampling based on the feature map output by the encoding network through a first decoding network to obtain a first processing result;
  • the second decoding module 2503 is configured to perform at least once feature up-sampling based on the first feature map output by the at least one first down-sampling module in the encoding network through a second decoding network to obtain a second processing result.
  • the solution provided by the present application maps the image to be processed to a preset plane to obtain position information of each pixel of the image to be processed on the preset imaging plane, and use the position information of each pixel in the image to be processed in the preset imaging plane in the depth estimation process to eliminate the influence of camera parameters on a depth estimation range, such that the same network model can estimate the depth of the image to be processed corresponding to different camera parameters. While ensuring a wide range of depth estimation, the solution saves computing resources and storage space.
  • the convolution kernel used by the first feature down-sampling module includes a first convolution kernel of a first dimension and a second convolution kernel of a second dimension, wherein the value of the second convolution kernel is zero, wherein,
  • the first dimension is determined based on the number of convolution kernels used by the first feature down-sampling module in the previous feature down-sampling unit
  • the second dimension is determined based on the number of convolution kernels used by the second feature down-sampling module in the previous feature down-sampling unit.
  • the number of convolution kernels used by the first feature down-sampling module is equal to the number of convolution kernels used by the first feature down-sampling module in the previous feature down-sampling unit
  • the number of convolution kernels used by the second feature down-sampling module is equal to the number of convolution kernels used by the second feature down-sampling module in the previous feature down-sampling unit.
  • the first processing result is a depth estimation result of the image to be processed
  • the second processing result is a semantic parsing result of the image to be processed
  • Fig. 26 is a structural block diagram of a depth estimation device provided by an embodiment of the present application.
  • the device 2600 may include: a consecutive image acquisition module 2601, a disparity information acquisition module 2602, and a depth estimation module 2603, wherein:
  • the consecutive image acquisition module 2601 is configured to acquire at least two consecutive images containing the image to be processed
  • the disparity information acquisition module 2602 is configured to acquire the first disparity information corresponding to the image to be processed based on at least two consecutive images;
  • the depth estimation module 2603 is configured to perform depth estimation on the image to be processed based on the first disparity information.
  • the first disparity information acquisition module is specifically configured to:
  • the first disparity information acquisition module is further configured to:
  • the depth estimation module includes: a feature down-sampling sub-module and a feature up-sampling sub-module, wherein:
  • the feature down-sampling sub-module is configured to perform at least once feature down-sampling on the image to be processed through the encoding network to obtain a corresponding feature map
  • the feature up-sampling sub-module is configured to perform at least once feature up-sampling based on the feature map and the first disparity information through the first decoding network to obtain a depth estimation result.
  • the feature up-sampling submodule is specifically configured to:
  • the feature up-sampling submodule is further configured to:
  • Fig. 27 shows a schematic structural diagram of an electronic device 1800 (for example, a terminal device or a server that executes the method shown in Fig. 4, Fig. 20, or Fig. 22) suitable for implementing embodiments of the present application.
  • the electronic devices in the embodiments of the present application may include, but are not limited to, mobile terminals (such as mobile phones, notebook computers, digital broadcast receivers, PDAs (personal digital assistants), PADs (tablet computers), PMPs (portable multimedia players), vehicle-mounted terminals (for example, car navigation terminals)) and fixed terminals (such as digital TVs, desktop computers, etc.).
  • the electronic device shown in Fig. 27 is only an example, and should not bring any limitation to the functions and scope of use of the embodiments of the present application.
  • the electronic device includes: a memory and a processor, wherein the memory is configured to store programs for executing the methods described in the foregoing method embodiments; the processor is configured to execute the programs stored in the memory.
  • the processor here may be referred to as the processing device 1801 described below, and the memory may include at least one of a read-only memory (ROM) 1802, a random-access memory (RAM) 1803, and a storage device 1808, specifically shown as follows:
  • the electronic device 1800 may include a processing device (such as a central processing unit, a graphics processor, etc.) 1801, which can execute various appropriate actions and processing according to programs stored in a read-only memory (ROM) 1802 or programs loaded from a storage device 1808 into a random-access memory (RAM) 1803.
  • ROM read-only memory
  • RAM random-access memory
  • various programs and data required for the operation of the electronic device 1800 are also stored.
  • the processing device 1801, ROM 1802, and RAM 1803 are connected to each other through a bus 1804.
  • An input/output (I/O) interface 1805 is also connected to the bus 1804.
  • the following devices can be connected to the I/O interface 1805: an input devices 1806 such as touch screen, touch pad, keyboard, mouse, camera, microphone, accelerometer, gyroscope, etc.; an output device 1807 such as liquid crystal display (LCD), speaker, vibration; a storage device 1808 such as a magnetic tape, a hard disk, etc.; and a communication device 1809.
  • the communication device 1809 may allow the electronic device 1800 to perform wireless or wired communication with other devices to exchange data.
  • Fig. 27 shows an electronic device having various devices, it should be understood that it is not required to implement or have all the illustrated devices. It may alternatively be implemented or provided with more or fewer devices.
  • the process described above with reference to the flowchart can be implemented as computer software programs.
  • the embodiments of the present application include a computer program product, which includes computer programs carried on a non-transitory computer readable medium, and the computer programs include program codes for executing the method shown in the flowchart.
  • the computer programs may be downloaded and installed from the network through the communication device 1809, or installed from the storage device 1808, or installed from the ROM 1802.
  • the processing device 1801 executes the above functions defined in the method of the embodiment of the present application.
  • the aforementioned computer readable medium in the present application may be a computer readable signal medium or a computer readable storage medium, or any combination of the two.
  • the computer readable storage medium may be, for example, but not limited to, an electric, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the above. More specific examples of computer readable storage media may include, but are not limited to: electrical connections with one or more wires, portable computer disks, hard disks, random access memory (RAM), read-only memory (ROM), erasable programmable read only memory (EPROM or flash memory), optical fiber, portable compact disk read only memory (CD-ROM), optical storage device, magnetic storage device, or any suitable combination of the above.
  • a computer readable storage medium may be any tangible medium that contains or stores a program, and the program may be used by or in combination with an instruction execution system, apparatus, or device.
  • a computer readable signal medium may include a data signal propagated in a baseband or as a part of a carrier wave, and a computer readable program codes are carried therein. This propagated data signal can take many forms, including but not limited to electromagnetic signals, optical signals, or any suitable combination of the foregoing.
  • the computer readable signal medium may also be any computer readable medium other than the computer readable storage medium.
  • the computer readable signal medium may send, propagate, or transmit the program for use by or in combination with the instruction execution system, apparatus, or device.
  • the program code contained on the computer readable medium can be transmitted by any suitable medium, including but not limited to: wire, optical cable, RF (radio frequency), etc., or any suitable combination of the above.
  • the client and server can communicate with any currently known or future-developed network protocol such as HTTP (HyperText Transfer Protocol), and can be interconnected with any form or medium of digital data communication (for example, communication network).
  • HTTP HyperText Transfer Protocol
  • Examples of communication networks include local area networks ("LAN”), wide area networks ("WAN”), the Internet (for example, the Internet), and end-to-end networks (for example, ad hoc end-to-end networks), as well as any currently known or future-developed network.
  • the above computer-readable mediums may be contained in the above electronic device; or it may exist alone without being assembled into the electronic device.
  • the above computer-readable medium carries one or more programs, and when the above one or more programs are executed by the electronic device, causing the electronic device:
  • mapping an image to be processed to a preset plane and acquiring first position information of pixels in the image to be processed on the preset plane; and performing depth estimation on the image to be processed based on the first position information.
  • a encoding network performing at least once feature down-sampling on the image to be processed to obtain a corresponding feature map
  • the encoding network includes several feature down-sampling units
  • the at least one feature down-sampling unit includes a first feature down-sampling module and a second feature down-sampling module
  • the first feature down-sampling module and the second feature down-sampling module respectively perform feature down-sampling on the input image to be processed or the feature map to obtain a first feature map and the second feature map, and fuse and output the first feature map and the second feature map
  • through a first decoding network performing at least once feature up-sampling based on the feature map output by the encoding network to obtain a first processing result
  • a second decoding network performing at least once feature up-sampling based on the first feature map output by the at least one first down-sampling module in the encoding network to obtain a second processing result.
  • acquiring at least two consecutive images containing an image to be processed acquiring first disparity information corresponding to the image to be processed based on the at least two consecutive images; and performing depth estimation on the image to be processed based on the first disparity information.
  • the computer program code for performing the operations of the present disclosure may be written in one or more programming languages or a combination thereof.
  • the above-mentioned programming languages include object-oriented programming languages ⁇ such as Java, Smalltalk, C++, and also include conventional procedural programming language-such as "C" language or similar programming language.
  • the program code can be executed entirely on the user's computer, partly on the user's computer, executed as an independent software package, partly on the user's computer and partly executed on a remote computer, or entirely executed on the remote computer or server.
  • a remote computer can be connected to the user's computer through any kind of network, including a local area network (LAN) or a wide area network (WAN), or it can be connected to an external computer (for example, using an Internet service provider to pass Internet connection).
  • LAN local area network
  • WAN wide area network
  • each block in the flowchart or block diagram may represent a module, program segment, or part of code, and the module, program segment, or part of code contains one or more executable instructions for realizing the specified logical function.
  • the functions marked in the block may also occur in a different order from the order marked in the drawings. For example, two blocks shown in succession can actually be executed substantially in parallel, or they can sometimes be executed in the reverse order, depending on the functions involved.
  • each block in the block diagram and/or flowchart, and a combination of blocks in the block diagram and/or flowchart can be implemented by a dedicated hardware-based system that performs the specified function or operation, or it can be realized by a combination of dedicated hardware and computer instructions.
  • the modules or units involved in the embodiments described in the present application can be implemented in software or hardware. Wherein, the name of the module or unit does not constitute a limitation on the unit itself under certain circumstances.
  • the first position information acquisition module can also be described as "a module for acquiring first position information”.
  • exemplary types of hardware logic components include: Field Programmable Gate Array (FPGA), Application Specific Integrated Circuit (ASIC), Application Specific Standard Product (ASSP), System on Chip (SOC), Complex Programmable Logical device (CPLD) and the like.
  • FPGA Field Programmable Gate Array
  • ASIC Application Specific Integrated Circuit
  • ASSP Application Specific Standard Product
  • SOC System on Chip
  • CPLD Complex Programmable Logical device
  • a machine-readable medium may be a tangible medium, which may contain or store a program for use by the instruction execution system, apparatus, or device or in combination with the instruction execution system, apparatus, or device.
  • the machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium.
  • the machine-readable medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus or device, or any suitable combination of the foregoing.
  • machine-readable storage media would include electrical connections based on one or more wires, portable computer disks, hard drives, random access memory (RAM), read-only memory (ROM), erasable programmable read-only memory (EPROM or flash memory), optical fiber, portable compact disk read only memory (CD-ROM), optical storage device, magnetic storage device, or any suitable combination of the above.
  • RAM random access memory
  • ROM read-only memory
  • EPROM or flash memory erasable programmable read-only memory
  • CD-ROM portable compact disk read only memory
  • magnetic storage device or any suitable combination of the above.
  • the device provided in the embodiment of the present application may implement at least one of the multiple modules through an AI model.
  • the functions associated with AI may be performed by a non-volatile memory, a volatile memory, and a processor.
  • the processor may include one or more processors.
  • the one or more processors may be general-purpose processors, such as a central processing unit (CPU), an application processor (AP), etc., or a pure graphics processing unit, such as a graphics processing unit (GPU), a visual processing unit (VPU), and/or AI dedicated processor, such as a neural processing unit (NPU).
  • CPU central processing unit
  • AP application processor
  • GPU graphics processing unit
  • VPU visual processing unit
  • NPU neural processing unit
  • the one or more processors control the processing of input data according to predefined operating rules or artificial intelligence (AI) models stored in the non-volatile memory and volatile memory.
  • the pre-defined operating rules or artificial intelligence models are provided through training or learning.
  • providing by learning refers to obtaining predefined operating rules or AI models with desired characteristics by applying learning algorithms to multiple learning data.
  • This learning may be performed in the device itself in which the AI according to the embodiment is executed, and/or may be realized by a separate server/system.
  • the AI model may contain multiple neural network layers. Each layer has multiple weight values, and the calculation of one layer is performed by the calculation result of the previous layer and multiple weights of the current layer.
  • Examples of neural networks include, but are not limited to, Convolutional Neural Networks (CNN), Deep Neural Networks (DNN), Recurrent Neural Networks (RNN), Restricted Boltzmann Machines (RBM), Deep Belief Networks (DBN), Bidirectional Loops Deep Neural Network (BRDNN), Generative Adversarial Network (GAN), and Deep Q Network.
  • a learning algorithm is a method of training a predetermined target device (for example, a robot) using a plurality of learning data to make, allow, or control the target device to make determination or prediction.
  • Examples of the learning algorithm include, but are not limited to, supervised learning, unsupervised learning, semi-supervised learning, or reinforcement learning.

Landscapes

  • Engineering & Computer Science (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Image Processing (AREA)

Abstract

La présente invention concerne un procédé d'estimation de profondeur, un dispositif, un équipement électronique et un support de stockage lisible par ordinateur. Les étapes associées d'estimation de profondeur dans cette invention peuvent être traitées par un module d'intelligence artificielle. L'image à traiter est mappée sur un plan prédéfini pour obtenir des informations de position de chaque pixel de l'image à traiter sur le plan d'imagerie prédéfini, et les informations de position de chaque pixel de l'image à traiter sont utilisées dans le plan d'imagerie prédéfini dans le processus d'estimation de profondeur afin d'éliminer l'influence de paramètres de caméra sur une plage d'estimation de profondeur, de sorte que le même modèle de réseau peut estimer la profondeur de l'image à traiter correspondant à différents paramètres de caméra. Tout en assurant une large plage d'estimation de profondeur, l'invention permet d'économiser les ressources de calcul et l'espace de stockage.
PCT/KR2021/016579 2020-12-07 2021-11-12 Procédé d'estimation de profondeur, dispositif, équipement électronique et support de stockage lisible par ordinateur WO2022124607A1 (fr)

Applications Claiming Priority (4)

Application Number Priority Date Filing Date Title
CN202011440325.3A CN114596349A (zh) 2020-12-07 2020-12-07 深度估计方法、装置、电子设备及计算机可读存储介质
CN202011440325.3 2020-12-07
KR10-2021-0153377 2021-11-09
KR1020210153377A KR20220080696A (ko) 2020-12-07 2021-11-09 깊이 추정 방법, 디바이스, 전자 장비 및 컴퓨터 판독가능 저장 매체

Publications (1)

Publication Number Publication Date
WO2022124607A1 true WO2022124607A1 (fr) 2022-06-16

Family

ID=81974669

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/KR2021/016579 WO2022124607A1 (fr) 2020-12-07 2021-11-12 Procédé d'estimation de profondeur, dispositif, équipement électronique et support de stockage lisible par ordinateur

Country Status (1)

Country Link
WO (1) WO2022124607A1 (fr)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115100360A (zh) * 2022-07-28 2022-09-23 中国电信股份有限公司 图像生成方法及装置、存储介质和电子设备
CN115375827A (zh) * 2022-07-21 2022-11-22 荣耀终端有限公司 光照估计方法及电子设备

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9124874B2 (en) * 2009-06-05 2015-09-01 Qualcomm Incorporated Encoding of three-dimensional conversion information with two-dimensional video sequence
KR102074929B1 (ko) * 2018-10-05 2020-02-07 동의대학교 산학협력단 깊이 영상을 통한 평면 검출 방법 및 장치 그리고 비일시적 컴퓨터 판독가능 기록매체
US20200098184A1 (en) * 2014-12-23 2020-03-26 Meta View, Inc. Apparatuses, methods and systems coupling visual accommodation and visual convergence to the same plane at any depth of an object of interest

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9124874B2 (en) * 2009-06-05 2015-09-01 Qualcomm Incorporated Encoding of three-dimensional conversion information with two-dimensional video sequence
US20200098184A1 (en) * 2014-12-23 2020-03-26 Meta View, Inc. Apparatuses, methods and systems coupling visual accommodation and visual convergence to the same plane at any depth of an object of interest
KR102074929B1 (ko) * 2018-10-05 2020-02-07 동의대학교 산학협력단 깊이 영상을 통한 평면 검출 방법 및 장치 그리고 비일시적 컴퓨터 판독가능 기록매체

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
LU YAWEN; SARKIS MICHEL; LU GUOYU: "Multi-Task Learning for Single Image Depth Estimation and Segmentation Based on Unsupervised Network", 2020 IEEE INTERNATIONAL CONFERENCE ON ROBOTICS AND AUTOMATION (ICRA), IEEE, 31 May 2020 (2020-05-31), pages 10788 - 10794, XP033826063, DOI: 10.1109/ICRA40945.2020.9196723 *
LUKAS LIEBEL; MARCO KORNER: "MultiDepth: Single-Image Depth Estimation via Multi-Task Regression and Classification", ARXIV.ORG, CORNELL UNIVERSITY LIBRARY, 201 OLIN LIBRARY CORNELL UNIVERSITY ITHACA, NY 14853, 25 July 2019 (2019-07-25), 201 Olin Library Cornell University Ithaca, NY 14853 , XP081448933 *

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115375827A (zh) * 2022-07-21 2022-11-22 荣耀终端有限公司 光照估计方法及电子设备
CN115375827B (zh) * 2022-07-21 2023-09-15 荣耀终端有限公司 光照估计方法及电子设备
CN115100360A (zh) * 2022-07-28 2022-09-23 中国电信股份有限公司 图像生成方法及装置、存储介质和电子设备
CN115100360B (zh) * 2022-07-28 2023-12-01 中国电信股份有限公司 图像生成方法及装置、存储介质和电子设备

Similar Documents

Publication Publication Date Title
WO2020171550A1 (fr) Procédé et appareil de traitement d'image, dispositif électronique et support d'informations lisible par ordinateur
WO2017018612A1 (fr) Procédé et dispositif électronique pour stabiliser une vidéo
WO2017111302A1 (fr) Appareil et procédé de génération d'image à intervalles préréglés
WO2020050686A1 (fr) Dispositif et procédé de traitement d'images
WO2022124607A1 (fr) Procédé d'estimation de profondeur, dispositif, équipement électronique et support de stockage lisible par ordinateur
WO2017090837A1 (fr) Appareil de photographie numérique et son procédé de fonctionnement
WO2016209020A1 (fr) Appareil de traitement d'image et procédé de traitement d'image
WO2018044073A1 (fr) Procédé de diffusion en continu d'image et dispositif électronique pour prendre en charge celui-ci
WO2014014238A1 (fr) Système et procédé de fourniture d'une image
WO2017191978A1 (fr) Procédé, appareil et support d'enregistrement pour traiter une image
WO2019017641A1 (fr) Dispositif électronique, et procédé de compression d'image de dispositif électronique
WO2022010122A1 (fr) Procédé pour fournir une image et dispositif électronique acceptant celui-ci
WO2015126044A1 (fr) Procédé de traitement d'image et appareil électronique associé
WO2017090833A1 (fr) Dispositif de prise de vues, et procédé de commande associé
WO2022154387A1 (fr) Dispositif électronique et son procédé de fonctionnement
WO2022139262A1 (fr) Dispositif électronique pour l'édition vidéo par utilisation d'un objet d'intérêt, et son procédé de fonctionnement
EP3329665A1 (fr) Procédé d'imagerie d'objet mobile et dispositif d'imagerie
WO2020017936A1 (fr) Dispositif électronique et procédé de correction d'image sur la base d'un état de transmission d'image
EP4320472A1 (fr) Dispositif et procédé de mise au point automatique prédite sur un objet
WO2022031041A1 (fr) Réseau de données en périphérie permettant de fournir une image de caractères 3d à un terminal et son procédé de fonctionnement
WO2021060636A1 (fr) Dispositif de détection de mouvement et procédé
WO2023063679A1 (fr) Dispositif et procédé de mise au point automatique prédite sur un objet
WO2019107769A1 (fr) Dispositif électronique destiné à compresser sélectivement des données d'image conformément à une vitesse de lecture d'un capteur d'image, et son procédé de fonctionnement
WO2022225375A1 (fr) Procédé et dispositif de reconnaissance faciale basée sur des dnn multiples à l'aide de pipelines de traitement parallèle
EP3494706A1 (fr) Procédé de diffusion en continu d'image et dispositif électronique pour prendre en charge celui-ci

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 21903654

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 21903654

Country of ref document: EP

Kind code of ref document: A1