CN113628190B

CN113628190B - Depth map denoising method and device, electronic equipment and medium

Info

Publication number: CN113628190B
Application number: CN202110918788.4A
Authority: CN
Inventors: 王子豪; 吴迪; 陈永炜; 邹龙坤
Original assignee: Cross Dimension Shenzhen Intelligent Digital Technology Co ltd
Current assignee: Cross Dimension Shenzhen Intelligent Digital Technology Co ltd
Priority date: 2021-08-11
Filing date: 2021-08-11
Publication date: 2024-03-15
Anticipated expiration: 2041-08-11
Also published as: CN113628190A

Abstract

The application discloses a depth map denoising method, a depth map denoising device, electronic equipment and a depth map denoising medium; the method comprises the following steps: selecting a shooting scene as a current shooting scene, and acquiring an image of the current shooting scene through a camera; constructing a current data pair corresponding to a current shooting scene based on the image; repeatedly executing the operations until the data pairs corresponding to all shooting scenes are constructed; constructing a data set by using the data pairs corresponding to each shooting scene, and training a depth map denoising model to be trained based on the data set and a loss function to obtain a trained depth map denoising model; and denoising the depth map to be processed by using the trained depth map denoising model. The method and the device can build a proper network model through deep learning, can adaptively solve various noises appearing in the real depth map, and can keep the detailed information of the target object as far as possible.

Description

Depth map denoising method and device, electronic equipment and medium

Technical Field

The embodiment of the application relates to the technical field of artificial intelligence, in particular to a depth map denoising method, a depth map denoising device, electronic equipment and a depth map denoising medium.

Background

Speckle-depth cameras typically comprise two components, a projector and a binocular camera. Because the stereo matching algorithm needs to establish a matching relation by means of the characteristics of shooting scenes, the more complex scenes have higher matching success rate, but for shooting scenes with fewer characteristics, the matching algorithm is difficult to work normally, so that the scene characteristics are increased by utilizing a projector to project speckle patterns, and then the scene shooting is performed by utilizing a binocular camera.

The existing speckle three-dimensional camera is influenced by a shooting environment light source, the material of a shooting object and a stereo matching algorithm of the camera, so that the finally output depth map has the problems of rich noise types and large noise amplitude, the corresponding point cloud also has the phenomenon of more noise points and floating points, and the noise has obvious influence on various three-dimensional tasks. It is therefore necessary to denoise the depth map.

However, the existing depth map denoising method can only solve specific types of noise in a targeted manner basically, and certain detailed information is lost, so that the denoising algorithm does not perform well in a real situation.

Disclosure of Invention

The application provides a depth map denoising method, device, electronic equipment and medium, which can adaptively solve various noises appearing in a real depth map and reserve detailed information of a target object as far as possible by establishing a proper network model through deep learning.

In a first aspect, an embodiment of the present application provides a depth map denoising method, where the method includes:

selecting a shooting scene as a current shooting scene, and acquiring an image of the current shooting scene through a camera;

constructing a current data pair corresponding to the current shooting scene based on the image; repeatedly executing the operations until the data pairs corresponding to all shooting scenes are constructed;

constructing a data set by using data pairs corresponding to all shooting scenes, and training a depth map denoising model to be trained based on the data set and a pre-designed loss function to obtain a trained depth map denoising model; wherein the dataset comprises: a depth map and a template map; the depth map comprises a noiseless depth map and a noisy depth map; the template map only comprises a noisy template map; the loss function comprises a two-dimensional depth map loss function and a three-dimensional point cloud loss function;

And denoising the depth map to be processed by using the trained depth map denoising model.

In a second aspect, an embodiment of the present application further provides a depth map denoising apparatus, where the apparatus includes: the system comprises an acquisition module, a construction module, a training module and a denoising module; wherein,

the acquisition module is used for selecting a shooting scene as a current shooting scene and acquiring an image of the current shooting scene through a camera;

the construction module is used for constructing a current data pair corresponding to the current shooting scene based on the image; repeatedly executing the operations until the data pairs corresponding to all shooting scenes are constructed;

the training module is used for constructing a data set by using data pairs corresponding to all shooting scenes, and training a depth map denoising model to be trained based on the data set and a predesigned loss function to obtain a trained depth map denoising model; wherein the dataset comprises: a depth map and a template map; the depth map comprises a noiseless depth map and a noisy depth map; the template map only comprises a noisy template map; the loss function comprises a two-dimensional depth map loss function and a three-dimensional point cloud loss function;

And the denoising module is used for denoising the depth map to be processed by using the trained depth map denoising model.

In a third aspect, an embodiment of the present application provides an electronic device, including:

one or more processors;

a memory for storing one or more programs,

the one or more programs, when executed by the one or more processors, cause the one or more processors to implement the depth map denoising method described in any embodiment of the present application.

In a fourth aspect, embodiments of the present application provide a storage medium having stored thereon a computer program that, when executed by a processor, implements a depth map denoising method as described in any embodiment of the present application.

The embodiment of the application provides a depth map denoising method, a depth map denoising device, electronic equipment and a depth map denoising medium, wherein a shooting scene is selected as a current shooting scene, and an image of the current shooting scene is acquired through a camera; then constructing a current data pair corresponding to the current shooting scene based on the image; repeatedly executing the operations until the data pairs corresponding to all shooting scenes are constructed; and then constructing a data set by using the data pairs corresponding to each shooting scene, and training a depth map denoising model to be trained based on the data set and a pre-designed loss function. That is, in the technical solution of the present application, a data pair corresponding to each shooting scene may be constructed, and a data set for training a depth map model may be further constructed, and the depth map denoising model to be trained may be trained using the data set, so that the depth map to be processed may be denoised using the trained depth map denoising model. In the prior art, only specific types of noise can be addressed in a targeted manner, and certain detailed information is lost. Therefore, compared with the prior art, the depth map denoising method, the device, the electronic equipment and the medium provided by the embodiment of the application can build a proper network model through deep learning, can adaptively solve various noises appearing in a real depth map, and can keep the detailed information of a target object as far as possible; in addition, the technical scheme of the embodiment of the application is simple and convenient to realize, convenient to popularize and wider in application range.

Drawings

Fig. 1 is a schematic flow chart of a depth map denoising method according to an embodiment of the present application;

fig. 2 is a second flow chart of a depth map denoising method according to an embodiment of the present application;

fig. 3 is a third flow chart of a depth map denoising method according to an embodiment of the present application;

fig. 4 is a schematic structural diagram of a depth map denoising apparatus according to an embodiment of the present application;

fig. 5 is a schematic structural diagram of an electronic device according to an embodiment of the present application.

Detailed Description

The present application is described in further detail below with reference to the drawings and examples. It is to be understood that the specific embodiments described herein are merely illustrative of the application and not limiting thereof. It should be further noted that, for convenience of description, only some, but not all of the structures related to the present application are shown in the drawings.

Example 1

Fig. 1 is a schematic flow chart of a depth map denoising method according to an embodiment of the present application, where the method may be performed by a depth map denoising apparatus or an electronic device, and the apparatus or the electronic device may be implemented by software and/or hardware, and the apparatus or the electronic device may be integrated in any intelligent device having a network communication function. As shown in fig. 1, the depth map denoising method may include the steps of:

S101, selecting a shooting scene as a current shooting scene, and acquiring an image of the current shooting scene through a camera.

In this step, the electronic device may first select a shooting scene as the current shooting scene, and acquire an image of the current shooting scene through the camera. Specifically, the electronic device may acquire an original left image and an original right image of a current shooting scene through a left camera and a right camera in the binocular camera, respectively; wherein the original left image and the original right image can be respectively represented as I _left And I _right 。

S102, constructing a current data pair corresponding to a current shooting scene based on an image; and repeatedly executing the operations until the data pairs corresponding to the shooting scenes are constructed.

In the step, the electronic equipment can construct a current data pair corresponding to a current shooting scene based on the image; and repeatedly executing the operations until the data pairs corresponding to the shooting scenes are constructed. Specifically, the electronic device may construct a current data pair corresponding to the current shooting scene based on the original left image and the original right image; and repeatedly executing the operations until the data pairs corresponding to the shooting scenes are constructed. For example, the electronic device may obtain current noisy data corresponding to the current shooting scene based on the original left image and the original right image; based on the original left image or the original right image, current noiseless data corresponding to the current shooting scene is obtained; then using the current noisy data and the current noiseless data to construct a current data pair; wherein, the current data pair can be expressed as: { currently noisy data, currently noisy data }. And repeatedly executing the operations until the data pairs corresponding to the shooting scenes are constructed. For example, assume that there are N shooting scenes, respectively: shooting scene 1, shooting scenes 2, …, shooting scene N; wherein N is a natural number greater than 1. Through the steps, the data pair { noisy data 1, noiseless data 1} corresponding to the shooting scene 1, the data pair { noisy data 2, noiseless data 2} corresponding to the shooting scene 2, …, and the data pair { noisy data N, noiseless data N } corresponding to the shooting scene N can be respectively constructed.

S103, constructing a data set by using data pairs corresponding to all shooting scenes, and training a depth map denoising model to be trained based on the data set and a pre-designed loss function to obtain a trained depth map denoising model; wherein the data set comprises: a depth map and a template map; the depth map comprises a noiseless depth map and a noisy depth map; the template diagram only comprises a noisy template diagram; the loss functions include a two-dimensional depth map loss function and a three-dimensional point cloud loss function.

In the step, the electronic equipment can construct a data set by using data pairs corresponding to all shooting scenes, and train a depth map denoising model to be trained based on the data set and a pre-designed loss function; wherein the data set comprises: a depth map and a template map; the depth map comprises a noiseless depth map and a noisy depth map; the template diagram only comprises a noisy template diagram; the loss functions include a two-dimensional depth map loss function and a three-dimensional point cloud loss function. Specifically, the electronic device may combine the data pairs corresponding to each shooting scene to construct a data pair set including all the data pairs, which is used as the data set in the embodiment of the present application; and then training the depth map denoising model to be trained based on the data set. Specifically, when the electronic device trains the depth map denoising model to be trained based on the data set, one data pair can be selected from the data set as a current data pair, and when the depth map denoising model to be trained does not meet a preset convergence condition, the current data is used for training the depth map denoising model to be trained; and repeatedly executing the operation of extracting the current data pair until the depth map denoising model to be trained meets the convergence condition.

S104, denoising the depth map to be processed by using the trained depth map denoising model.

Specifically, the electronic device may input the depth map to be processed into a trained depth map denoising model, and the denoised depth map may be obtained through the model.

Data sets as used in this application include mainly two items: the depth map comprises a noiseless depth map and a noisy depth map, and the template map only comprises a noisy template map. For a depth map, which is a two-dimensional pixel matrix with a single channel, a distance value is stored in each pixel, and a common unit is m or mm, where the value range may be 0 to positive infinity or a manually specified distance range (for example, 0 to 2m in our setup). For a template map, which is also a two-dimensional pixel matrix with a single channel, pixel points on the template map are in one-to-one correspondence with pixels on a depth map, and each pixel value stores gray values, namely only 0 and 255 (namely black and white), which represent whether the point is ignored (the ignored case includes that the parallax value of the point is smaller than 0, namely no depth value meaning and the depth value exceeds a specified depth range). In addition, camera intrinsic parameters and camera gestures corresponding to each set of depth maps are required.

Camera internal parameters:the provided information comprises the optical center position and focal length of the shooting camera; wherein f _x And f _y The focal lengths of the camera in the x-direction and the y-direction on the picture coordinate system are respectively represented, and the units are pixels. Specifically, assuming that the focal length of the camera is f millimeters, then f _x The specific meaning of the pixel is that on a picture coordinate system which is f millimeters away from an optical center, the number of pixels corresponding to f millimeters in the x direction is f _x 。u ₀ And v ₀ Then it is represented the specific position of the camera's optical center in the x-direction and y-direction in the picture coordinate system, again in pixels.

Camera pose: camera pose in our application refers to the conversion of the world coordinate system to the camera coordinate system, i.e. the rotation vectorR _w2c Translation vector T _w2c (unlike the external reference mentioned in the binocular calibration, the translation vector and the rotation vector in the binocular calibration represent the conversion relationship between the two camera coordinate systems). Through the camera gesture, we can convert the three-dimensional point cloud shot by each camera into the same world coordinate system.

Design of the loss function: in order to better realize the depth map denoising function by the network, the design of the loss function is updated, and the two-dimensional depth map denoising method comprises the steps of _depth And a three-dimensional point cloud loss function L _pointcloud Two parts. Loss function L for two-dimensional depth map _depth Mainly uses depth map depth after network denoising _predict Noise-free depth map depth originally provided _clean Noisy depth template mask _raw The difference between the two depth maps is calculated as a loss value for return. Loss function I for three-dimensional point cloud _pointcloud And converting the depth map after network denoising and the original noiseless depth map into three-dimensional point clouds by utilizing the internal parameters of the camera, the camera gesture and the noisy depth template provided by the data set, converting the depth map and the original noiseless depth map into the same coordinate system, and calculating the difference of the two point clouds as a loss value for returning.

Two-dimensional depth map loss function L _depth : for a given mask _raw Normalized to [0,1 ]]I.e. the mask at present _raw The internal value only contains two possibilities of 0 and 1. In a specific operation, we just put the mask on _raw The part with the value of 1 participates in the calculation of the loss function, namely:

L _depth ＝||mask _raw ·(depth _predict -depth _clean )|| ²

three-dimensional point cloud loss function L _pointcloud : we convert the depth map to a corresponding point cloud and move to a unified world coordinate system using the provided templates, camera references, and camera posesWe call the corresponding point cloud as a point cloud _predict-c And pointgroup _clean-c . We utilized Chamfer Distance (CD) as a three-dimensional point cloud loss function, namely:

L _pointcloud ＝CD(pointcloud _predict-c ，pointcloud _clean-c )

Chamfer Distance: the definition is as follows:

regarding the design of the network architecture: for depth map denoising work, a multi-size convolution kernel strategy is introduced in the network. The input features are processed by using a network formed by a plurality of convolution kernels with different sizes, and the corresponding receptive fields are different due to the different sizes of the convolution kernels, so that the local features and the global features can be organically integrated and utilized. The specific network construction structure can be a pyramid structure or a U-shaped structure.

According to the depth map denoising method, a shooting scene is selected as a current shooting scene, and an image of the current shooting scene is obtained through a camera; then constructing a current data pair corresponding to the current shooting scene based on the image; repeatedly executing the operations until the data pairs corresponding to all shooting scenes are constructed; and then constructing a data set by using the data pairs corresponding to each shooting scene, and training a depth map denoising model to be trained based on the data set and a pre-designed loss function. That is, in the technical solution of the present application, a data pair corresponding to each shooting scene may be constructed, and a data set for training a depth map model may be further constructed, and the depth map denoising model to be trained may be trained using the data set, so that the depth map to be processed may be denoised using the trained depth map denoising model. In the prior art, only specific types of noise can be addressed in a targeted manner, and certain detailed information is lost. Therefore, compared with the prior art, the depth map denoising method provided by the embodiment of the application can build a proper network model through deep learning, can adaptively solve various noises appearing in a real depth map, and can keep the detailed information of a target object as much as possible; in addition, the technical scheme of the embodiment of the application is simple and convenient to realize, convenient to popularize and wider in application range.

Example two

Fig. 2 is a second flow chart of a depth map denoising method according to an embodiment of the present application. Further optimization and expansion based on the above technical solution can be combined with the above various alternative embodiments. As shown in fig. 2, the depth map denoising method may include the steps of:

s201, selecting a shooting scene as a current shooting scene, and acquiring an image of the current shooting scene through a camera.

S202, obtaining current noisy data corresponding to a current shooting scene based on the image; and obtaining current noiseless data corresponding to the current shooting scene based on the image.

In the step, the electronic equipment can obtain current noisy data corresponding to a current shooting scene based on the image; and obtaining current noiseless data corresponding to the current shooting scene based on the image. Specifically, the electronic device may obtain current noisy data corresponding to the current shooting scene based on the original left image and the original right image; and obtaining current noiseless data corresponding to the current shooting scene based on the original left image or the original right image. For example, when the electronic device acquires the current noisy data, the electronic device may first determine a left image point and a right image point of the current shooting scene in the original left image and the original right image respectively; then, carrying out image matching on the left image point and the right image point based on a predetermined object point of the current shooting scene to obtain a matching result of the left image point and the right image point; and obtaining the current noisy data based on the matching result of the left image point and the right image point.

In addition, when the electronic equipment acquires the current noiseless data, the image point of the current shooting scene can be determined in the image; then calculating an object point emitted from the optical center of the camera through the image point to reach the predetermined current shooting scene through the virtual engineThe true distance of the light ray and the included angle of the light ray and the image plane; and obtaining the current noiseless data based on the real distance of the light and the included angle between the light and the image plane. For example, the electronic device may first determine a left image point of the current shooting scene in the original left image; or determining a right image point of the current shooting scene in the original right image; then calculating the real distance of left light rays emitted from the optical center of the left camera to reach an object point of a predetermined current shooting scene through a left image point and the included angle between the left light rays and an original left image plane through a virtual engine; or calculating the real distance of the right light rays emitted from the optical center of the right camera to the object point through the right image point and the included angle between the right light rays and the original right image plane through the virtual engine; obtaining current noiseless data based on the real distance of the left light and the included angle between the left light and the original left image plane; or obtaining the current noiseless data based on the real distance of the right ray and the included angle between the right ray and the original right image plane. Specifically, the electronic device may calculate the true depth of the original left image or the original right image using the following formula: z is Z _clean =l·sinθ; thus, a noiseless depth map can be obtained; and then converting the noiseless depth map into a noiseless parallax map and a noiseless shielding template by combining camera internal parameters with a predetermined depth formula, and further obtaining the noiseless point cloud.

S203, constructing a current data pair by using the current noisy data and the current noiseless data; and repeatedly executing the operations until the data pairs corresponding to the shooting scenes are constructed.

S204, constructing a data set by using data pairs corresponding to all shooting scenes, and training a depth map denoising model to be trained based on the data set and a pre-designed loss function to obtain a trained depth map denoising model; wherein the data set comprises: a depth map and a template map; the depth map comprises a noiseless depth map and a noisy depth map; the template diagram only comprises a noisy template diagram; the loss functions include a two-dimensional depth map loss function and a three-dimensional point cloud loss function.

In a specific embodiment of the present application, the training process of the depth map denoising model may include three stages, namelyThe method comprises the following steps: a noise estimation stage, a multi-size denoising stage and a feature fusion stage. Depth map depth for a given piece of noisy _raw After the three stages, the final depth map depth after denoising can be obtained _denoise . It should be noted that for the second stage, the output of the first stage is cascaded with the original input as the input for the second stage; similarly, for the third stage, the output of the second stage and the input of the second stage are cascaded together as the input of the third stage. It should be noted that, the depth map denoising model in the embodiment of the present application may also use other network models based on deep learning, and the structure of the depth map denoising model is not limited in the present application.

Specifically, during the noise estimation phase, the electronic device may select five pure convolutional layers without pooling and batch normalization operations as feature extractors for the depth map denoising model, each followed by a ReLU activation function. The output dimensions of all other convolution layers except the final one are 32 dimensions, and the convolution kernels are 3×3 in size. The output dimension of the last layer is typically 1 or 3 dimensions. Embodiments of the present application may add an attention mechanism module before the last convolutional layer. It is desirable to obtain an attention weight through the attention module, expressed as: mu= [ mu ] ₁ ，μ ₂ ，...，μ _n ]∈R ^1×1×C Feature map U εR for readjusting inputs ^H×W×C . In the adjustment process, a global pooling layer (global average pooling, GAP for short) is first utilized to integrate the global information of U into v E R ^1×1×c Then two fully connected layers (fully connected layers, FC for short). The whole process is summarized as follows:

μ＝Sigmoid(FC ₂ (ReLU(FC ₁ (GAP(U)))))

the output of the final attention module is:wherein U' E R ^H×W×C ，/>Is a dimension-wise multiplication.

In particular, in a multi-dimensional denoising stage, the electronic device may use a five-layer pyramid structure to extract features of different sizes. Through the pyramid structure, the input feature images are downsampled to different sizes, so that the original information, the local information and the global information can be simultaneously extracted through receptive fields with different sizes. The pooled convolution kernel is set to: 1×1, 2×2, 4×4, 8×8, and 16×16, the pooled features are followed by a U-Net for further feature extraction; finally, features of different sizes are up-sampled to the same size by bilinear interpolation and then cascaded together.

Specifically, in the feature fusion stage, for the feature map U e R input to the third stage ^H×W×C Three new feature graphs U' E R can be obtained by using convolution kernels of 3×3, 5×5 and 7×7 respectively ^H×W×C 、U″∈R ^H×W×C And U' "E R ^H×W×C Then add element by element to obtainI.e. < ->Then, using the same operation as in the attentive mechanism, for +.>Compression is performed, then one GAP and two FC layers are passed, and the Sigmoid of the last layer is removed. The three outputs of the second FC are alpha' E R ^1×1×C 、β′∈R ^1×1×C And gamma' ∈R ^1×1×C Treatment with Softmax:

note that alpha _c Is the c-th dimension element of alpha, beta _c And gamma _c And the same is true. So the final output is V.epsilon.R ^H×W×C ，V＝[V ₁ ，V ₂ ，...，V _c ]；V _c ＝α _c ·U′+β _c ·U″+γ _c U' "; and finally, reconstructing the fused characteristic map into a depth map by using a 1 multiplied by 1 convolution layer.

S205, denoising the depth map to be processed by using the trained depth map denoising model.

Example III

Fig. 3 is a third flow chart of a depth map denoising method according to an embodiment of the present application. Further optimization and expansion based on the above technical solution can be combined with the above various alternative embodiments. As shown in fig. 3, the depth map denoising method may include the steps of:

s301, selecting a shooting scene as a current shooting scene, and acquiring an image of the current shooting scene through a camera.

In this step, the electronic device may select a shooting scene as the current shooting scene, and acquire an image of the current shooting scene through the camera. Specifically, the electronic device may acquire an original left image and an original right image of the current shooting scene through a left camera and a right camera of the binocular camera, respectively. Illustratively, the electronic device may build a left camera by building a virtual speckle-depth camera _left And right camera _right The method comprises the steps of carrying out a first treatment on the surface of the The resolution ratio and the internal reference K of the binocular camera are set according to specific requirements, the posture relation of the binocular camera is directly adjusted to be in a state after baseline correction, the baseline length B of the binocular camera is set according to specific requirements, and the binocular camera always points to a shooting scene.

S302, determining an image point of the current shooting scene in the image.

In this step, the electronic device may determine an image point of the current shooting scene in the image. Specifically, the electronic device may determine a left image point and a right image point of the current shooting scene in the original left image and the original right image, respectively. Illustratively, for camera related applications, there are three coordinate systems, respectively: world coordinate system, camera coordinate system and picture coordinate system; the world coordinate system refers to a three-dimensional coordinate system where all objects in the world are located, and the origin of coordinates of the world coordinate system can be any point in a three-dimensional space; the camera coordinate system refers to a three-dimensional coordinate system taking a camera optical center as a coordinate origin; the picture coordinate system is the two-dimensional coordinate system where the picture is located. In the world coordinate system, a three-dimensional point is taken as an object point, a plane formed by connecting the object point with optical centers of the left and right cameras is called a polar plane, an intersection line of the polar plane and picture planes of the left and right cameras is called a polar line, a connection line of the two optical centers is called a base line, an intersection point of the base line and the picture plane is called a pole, and an intersection point of the optical center and the object point connecting line at the picture plane is called an image point. The camera internal parameters obtained by camera calibration and external parameters between the cameras can enable the picture planes of the two cameras to be parallel to each other, the heights of image points corresponding to the left and right cameras of the same object point are kept consistent, the poles are at infinity, the transformation only comprises rotation, and the process is called baseline correction. Through baseline correction, in the subsequent stereo matching process, only the matching points of the picture planes of the left camera and the right camera are required to be searched on the same line of the picture.

And S303, carrying out image matching on the image points based on the predetermined object points of the current shooting scene to obtain a matching result of the image points.

In this step, the electronic device may perform image matching on the image point based on the predetermined object point of the current shooting scene, to obtain a matching result of the image point. Specifically, the electronic device may perform image matching on the left image point and the right image point based on a predetermined object point of the current shooting scene, so as to obtain a matching result of the left image point and the right image point. In the stereo matching process, images taken by the left camera and the right camera can be respectively represented as I _left And I _right The object point is called M, and the corresponding image point on the left and right image planes is called P _left And P _right From baseline correction, it can be known that P _left And P _right Already in the same line of the left and right images. By establishing an image matching algorithm, the method can confirm the object point M, P _left And P _right In this way, a corresponding disparity map can be obtained. The specific definition is as follows: for P _left The coordinate value in the x direction is called x _left For P _right The coordinate value in the y direction is called x _right Then, in the picture coordinate system of the left camera, the parallax d corresponding to the point is calculated in the following manner: d=x _left -x _right 。

In addition, through stereo matching, a corresponding disparity map is obtained, and the disparity map can be converted into a corresponding depth map depth by using internal parameters and external parameters calibrated by a camera. The specific calculation mode is as follows: for any point in the disparity map, the corresponding disparity is d, and then the depth Z of the point is:wherein f is the focal length in the internal referenceB is the length of the baseline. The specific calculation mode is as follows: the modulo length of the translation vector T in the inter-camera outliers. Thus, a corresponding depth map is obtained, and a corresponding point cloud (point cloud) can be obtained; for point clouds, it converts each two-dimensional pixel point in an image to a three-dimensional point in the camera coordinate system.

S304, obtaining current noisy data based on a matching result of the image points.

In this step, the electronic device may obtain the current noisy data based on the matching result of the image points. Specifically, the electronic device may obtain the current noisy data based on the matching result of the left image point and the right image point. In one embodiment, since the binocular camera is already in a baseline corrected state, the raw image I obtained can be directly processed _left And I _right Constructing a binocular matching algorithm, and utilizing d=x according to a matching result _left -x _right Obtaining a noisy disparity map of the left camera in a picture coordinate system _raw Then, camera internal parameters are utilized to calculate the formulaDisparity map with noise _raw Conversion to noisy depth map depth _raw And noisy point cloud _raw . In addition, mask is blocked for noisy objects _raw Noisy disparity map may be used _raw The pixel value on the template is set to 0 for the part with the parallax value smaller than or equal to 0, and the pixel value on the template is set to 255 for the part with the parallax value larger than 0.

S305, calculating the real distance of light rays emitted from the optical center of the camera to reach the object point of the predetermined current shooting scene through the image point and the included angle between the light rays and the image plane through the virtual engine.

S306, obtaining current noiseless data based on the real distance of the light and the included angle between the light and the image plane.

In a specific embodiment of the present application, for any pixel point in the picture coordinate system of the left camera, the electronic device may calculate, using the virtual rendering engine, the emission from the optical center through the pixel point and finally strike the electronic deviceThe true distance L of the light of the scene, and then the included angle theta between the light and the picture plane is used for calculating the true depth, namely Z _clean =l·sinθ, whereby a noiseless depth map depth can be obtained _clean Then the camera internal parameter combination formula is utilizedNoise-free depth map disparity _raw Conversion to a noiseless disparity map _clean And noiseless point cloud _clean 。

S307, constructing a current data pair by using the current noisy data and the current noiseless data; and repeatedly executing the operations until the data pairs corresponding to the shooting scenes are constructed.

S308, constructing a data set by using data pairs corresponding to all shooting scenes, and training a depth map denoising model to be trained based on the data set and a pre-designed loss function to obtain a trained depth map denoising model; wherein the data set comprises: a depth map and a template map; the depth map comprises a noiseless depth map and a noisy depth map; the template diagram only comprises a noisy template diagram; the loss functions include a two-dimensional depth map loss function and a three-dimensional point cloud loss function.

S309, denoising the depth map to be processed by using the trained depth map denoising model.

It should be noted that, at present, the mainstream speckle three-dimensional camera is easily influenced by the material of the shooting object, the illumination condition of the shooting environment and the stereo matching algorithm of the camera, so that the problems of abundant noise types and large noise amplitude of the output depth map exist, and the phenomena of large noise points and floating points of the corresponding three-dimensional point cloud can also appear, so that the three-dimensional visual task can be obviously influenced. Thus, denoising is a very necessary preprocessing operation in various three-dimensional visual tasks. At present, a great deal of work is dedicated to three-dimensional data denoising, including depth map denoising and point cloud denoising, but in the traditional denoising method, a method is generally adopted to only process a certain specific kind of noise, and corresponding parameter adjustment is required to be carried out according to specific situations in the application process. However, for real data, the noise types are rich, the noise amplitude is large, the whole noise problem is difficult to solve by a single traditional denoising method, the traditional denoising method is quite time-consuming to adjust parameters, and meanwhile, the denoising operation can lose a part of data details, which are all problems existing in the traditional three-dimensional data denoising method at present when the real data noise is solved. Recently, there has also been related work directed to dealing with three-dimensional data denoising problems using deep learning. Taking depth map denoising as an example, researchers want to use reasonable network frame design to enable a network to obtain the capability of simultaneously processing various types of noise and keeping data details by learning the relevant properties of the data set about the depth map noise. However, these methods for denoising depth maps based on deep learning do not have good effects at present, mainly because the construction mode of the data set is not reasonable enough and the network design is not proper enough.

For the deep learning based depth map denoising method, how to construct accurate noisy data and noiseless data is very important, and only if a noisy and noiseless data pair is correctly constructed, a network can learn the relevant property of the real noise of the depth map, so that the task of depth map denoising can be completed. At present, only the method for constructing the data set can construct noisy and noiseless data pairs really close to the real situation, which is a standard for ensuring that deep learning can be used for denoising the depth map. The depth map denoising model provided by the embodiment of the application is different from the network design of the existing depth map denoising method based on the deep learning, and more references the network design of image denoising. However, image denoising is very different from depth map denoising because the color of the point is stored in each pixel of a general image, the numerical distribution of the point is between 0 and 255, the distance is stored in each pixel of the general image, the numerical distribution of the depth map is generally between 0 and infinity, and the depth map has the characteristic properties that depth mutation easily occurs, the depth value is more accurate at the position closer to a camera, and the like. According to the depth map denoising model, an existing network design for image denoising based on deep learning is used as a reference, the depth map denoising model is transformed into a network structure suitable for depth map denoising, input and output of a network are changed into a depth map, and more modules suitable for depth map denoising are added.

Example IV

Fig. 4 is a block diagram of a depth map denoising apparatus according to a fourth embodiment of the present application. As shown in fig. 4, the depth map denoising apparatus includes: an acquisition module 401, a construction module 402, a training module 403 and a denoising module 404; wherein,

the acquiring module 401 is configured to select a shooting scene as a current shooting scene, and acquire an image of the current shooting scene through a camera;

the construction module 402 is configured to construct a current data pair corresponding to the current shooting scene based on the image; repeatedly executing the operations until the data pairs corresponding to all shooting scenes are constructed;

the training module 403 is configured to construct a data set by using data pairs corresponding to each shooting scene, and train a depth map denoising model to be trained based on the data set and a pre-designed loss function, so as to obtain a trained depth map denoising model; wherein the dataset comprises: a depth map and a template map; the depth map comprises a noiseless depth map and a noisy depth map; the template map only comprises a noisy template map; the loss function comprises a two-dimensional depth map loss function and a three-dimensional point cloud loss function;

the denoising module 404 is configured to denoise a depth map to be processed using the trained depth map denoising model.

Further, the construction module 402 is specifically configured to obtain current noisy data corresponding to the current shooting scene based on the image; obtaining current noiseless data corresponding to the current shooting scene based on the image; and constructing the current data pair by using the current noisy data and the current noiseless data.

Further, the constructing module 402 is specifically configured to determine an image point of the current shooting scene in the image; performing image matching on the image point based on the predetermined object point of the current shooting scene to obtain a matching result of the image point; and obtaining the current noisy data based on the matching result of the image points.

Further, the construction module 402 is specifically configured to obtain a noisy disparity map corresponding to the current shooting scene based on a matching result of the image points; obtaining a noisy depth map and a noisy object shielding template corresponding to the current shooting scene based on the noisy disparity map corresponding to the current shooting scene; and taking the noisy depth map and the noisy object shielding template as the current noisy data.

Further, the constructing module 402 is specifically configured to determine an image point of the current shooting scene in the image; calculating the real distance of light rays emitted from the optical center of the camera to reach the predetermined object point of the current shooting scene through the image point and the included angle between the light rays and the image plane through a virtual engine; and obtaining the current noiseless data based on the real distance of the light and the included angle between the light and the image plane.

The depth map denoising device can execute the method provided by any embodiment of the application, and has the corresponding functional modules and beneficial effects of the execution method. Technical details which are not described in detail in this embodiment can be referred to the depth map denoising method provided in any embodiment of the present application.

Example five

Fig. 5 is a schematic structural diagram of an electronic device according to an embodiment of the present application. Fig. 5 illustrates a block diagram of an exemplary electronic device suitable for use in implementing embodiments of the present application. The electronic device 12 shown in fig. 5 is merely an example and should not be construed as limiting the functionality and scope of use of embodiments of the present application.

As shown in fig. 5, the electronic device 12 is in the form of a general purpose computing device. Components of the electronic device 12 may include, but are not limited to: one or more processors or processing units 16, a system memory 28, a bus 18 that connects the various system components, including the system memory 28 and the processing units 16.

Bus 18 represents one or more of several types of bus structures, including a memory bus or memory controller, a peripheral bus, an accelerated graphics port, a processor, and a local bus using any of a variety of bus architectures. By way of example, and not limitation, such architectures include Industry Standard Architecture (ISA) bus, micro channel architecture (MAC) bus, enhanced ISA bus, video Electronics Standards Association (VESA) local bus, and Peripheral Component Interconnect (PCI) bus.

Electronic device 12 typically includes a variety of computer system readable media. Such media can be any available media that is accessible by electronic device 12 and includes both volatile and nonvolatile media, removable and non-removable media.

The system memory 28 may include computer system readable media in the form of volatile memory, such as Random Access Memory (RAM) 30 and/or cache memory 32. The electronic device 12 may further include other removable/non-removable, volatile/nonvolatile computer system storage media. By way of example only, storage system 34 may be used to read from or write to non-removable, nonvolatile magnetic media (not shown in FIG. 5, commonly referred to as a "hard disk drive"). Although not shown in fig. 5, a magnetic disk drive for reading from and writing to a removable non-volatile magnetic disk (e.g., a "floppy disk"), and an optical disk drive for reading from or writing to a removable non-volatile optical disk (e.g., a CD-ROM, DVD-ROM, or other optical media) may be provided. In such cases, each drive may be coupled to bus 18 through one or more data medium interfaces. Memory 28 may include at least one program product having a set (e.g., at least one) of program modules configured to carry out the functions of the embodiments of the present application.

A program/utility 40 having a set (at least one) of program modules 42 may be stored in, for example, memory 28, such program modules 42 including, but not limited to, an operating system, one or more application programs, other program modules, and program data, each or some combination of which may include an implementation of a network environment. Program modules 42 generally perform the functions and/or methods in the embodiments described herein.

The electronic device 12 may also communicate with one or more external devices 14 (e.g., keyboard, pointing device, display 24, etc.), one or more devices that enable a user to interact with the electronic device 12, and/or any devices (e.g., network card, modem, etc.) that enable the electronic device 12 to communicate with one or more other computing devices. Such communication may occur through an input/output (I/O) interface 22. Also, the electronic device 12 may communicate with one or more networks such as a Local Area Network (LAN), a Wide Area Network (WAN) and/or a public network, such as the Internet, through a network adapter 20. As shown, the network adapter 20 communicates with other modules of the electronic device 12 over the bus 18. It should be appreciated that although not shown in fig. 5, other hardware and/or software modules may be used in connection with electronic device 12, including, but not limited to: microcode, device drivers, redundant processing units, external disk drive arrays, RAID systems, tape drives, data backup storage systems, and the like.

The processing unit 16 executes various functional applications and data processing by running programs stored in the system memory 28, for example, implementing the depth map denoising method provided in the embodiments of the present application.

Example six

A sixth embodiment of the present application provides a computer storage medium.

Any combination of one or more computer readable media may be employed in the computer readable storage media of the embodiments herein. The computer readable medium may be a computer readable signal medium or a computer readable storage medium. The computer readable storage medium can be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or a combination of any of the foregoing. More specific examples (a non-exhaustive list) of the computer-readable storage medium would include the following: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In this document, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device.

The computer readable signal medium may include a propagated data signal with computer readable program code embodied therein, either in baseband or as part of a carrier wave. Such a propagated data signal may take any of a variety of forms, including, but not limited to, electro-magnetic, optical, or any suitable combination of the foregoing. A computer readable signal medium may also be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device.

Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to wireless, wireline, optical fiber cable, RF, etc., or any suitable combination of the foregoing.

Computer program code for carrying out operations of the present application may be written in one or more programming languages, including an object oriented programming language such as Java, smalltalk, C ++ and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the case of a remote computer, the remote computer may be connected to the user's computer through any kind of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or may be connected to an external computer (for example, through the Internet using an Internet service provider).

Note that the above is only a preferred embodiment of the present application and the technical principle applied. Those skilled in the art will appreciate that the present application is not limited to the particular embodiments described herein, but is capable of numerous obvious changes, rearrangements and substitutions as will now become apparent to those skilled in the art without departing from the scope of the present application. Therefore, while the present application has been described in connection with the above embodiments, the present application is not limited to the above embodiments, but may include many other equivalent embodiments without departing from the spirit of the present application, the scope of which is defined by the scope of the appended claims.

Claims

1. A depth map denoising method, the method comprising:

The two-dimensional depth map loss function uses the depth map after denoising of the depth map denoising model, the noiseless depth map and the noisy template map, and calculates the difference between the two depth maps as a loss value for returning;

the three-dimensional point cloud loss function converts the depth map after denoising of the depth map denoising model and the noiseless depth map into three-dimensional point clouds by utilizing the internal parameters of the camera, the camera gesture and the noisy template map provided by the data set, converts the depth map and the noiseless depth map into the same coordinate system, and calculates the difference of the two point clouds as a loss value for returning;

2. The method of claim 1, wherein the constructing a current data pair corresponding to the current shooting scene based on the image comprises:

obtaining current noisy data corresponding to the current shooting scene based on the image; obtaining current noiseless data corresponding to the current shooting scene based on the image;

and constructing the current data pair by using the current noisy data and the current noiseless data.

3. The method according to claim 2, wherein the obtaining current noisy data corresponding to the current shooting scene based on the image includes:

Determining an image point of the current shooting scene in the image;

performing image matching on the image point based on the predetermined object point of the current shooting scene to obtain a matching result of the image point;

and obtaining the current noisy data based on the matching result of the image points.

4. A method according to claim 3, wherein said obtaining said current noisy data based on a result of said matching of said image points comprises:

acquiring a noisy disparity map corresponding to the current shooting scene based on the matching result of the image points;

obtaining a noisy depth map and a noisy object shielding template corresponding to the current shooting scene based on the noisy disparity map corresponding to the current shooting scene; and taking the noisy depth map and the noisy object shielding template as the current noisy data.

5. The method according to claim 2, wherein the obtaining, based on the image, current noiseless data corresponding to the current shooting scene includes:

determining an image point of the current shooting scene in the image;

calculating the real distance of light rays emitted from the optical center of the camera to reach the predetermined object point of the current shooting scene through the image point and the included angle between the light rays and the image plane through a virtual engine;

And obtaining the current noiseless data based on the real distance of the light and the included angle between the light and the image plane.

6. A depth map denoising apparatus, comprising: the system comprises an acquisition module, a construction module, a training module and a denoising module; wherein,

7. The device according to claim 6, wherein the construction module is specifically configured to obtain current noisy data corresponding to the current shooting scene based on the image; obtaining current noiseless data corresponding to the current shooting scene based on the image; and constructing the current data pair by using the current noisy data and the current noiseless data.

8. The apparatus according to claim 7, wherein the construction module is configured to determine an image point of the current shooting scene in the image; performing image matching on the image point based on the predetermined object point of the current shooting scene to obtain a matching result of the image point; and obtaining the current noisy data based on the matching result of the image points.

9. The apparatus of claim 8, wherein the construction module is specifically configured to obtain a noisy disparity map corresponding to the current shooting scene based on a matching result of the image points; obtaining a noisy depth map and a noisy object shielding template corresponding to the current shooting scene based on the noisy disparity map corresponding to the current shooting scene; and taking the noisy depth map and the noisy object shielding template as the current noisy data.

10. The apparatus according to claim 7, wherein the construction module is configured to determine an image point of the current shooting scene in the image; calculating the real distance of light rays emitted from the optical center of the camera to reach the predetermined object point of the current shooting scene through the image point and the included angle between the light rays and the image plane through a virtual engine; and obtaining the current noiseless data based on the real distance of the light and the included angle between the light and the image plane.

11. An electronic device, comprising:

one or more processors;

a memory for storing one or more programs,

the one or more programs, when executed by the one or more processors, cause the one or more processors to implement the depth map denoising method of any one of claims 1 to 5.

12. A storage medium having stored thereon a computer program, which when executed by a processor implements the depth map denoising method of any one of claims 1 to 5.