CN115861401B

CN115861401B - Binocular and point cloud fusion depth recovery method, device and medium

Info

Publication number: CN115861401B
Application number: CN202310170221.2A
Authority: CN
Inventors: 许振宇; 李月华; 朱世强; 邢琰; 姜甜甜
Original assignee: Beijing Institute of Control Engineering; Zhejiang Lab
Current assignee: Beijing Institute of Control Engineering; Zhejiang Lab
Priority date: 2023-02-27
Filing date: 2023-02-27
Publication date: 2023-06-09
Anticipated expiration: 2043-02-27
Also published as: CN115861401A

Abstract

The invention discloses a binocular and point cloud fusion depth recovery method, a device and a medium, wherein the method constructs a depth recovery neural network, and the depth recovery neural network comprises a sparse expansion module, a multi-scale feature extraction and fusion module, a variable weight Gaussian modulation module and a cascaded three-dimensional convolution neural network module. On the basis of a binocular stereo matching network, sparse point clouds are introduced, the density of guide points is improved by a neighborhood expansion method, and a Gaussian modulation and multi-scale feature extraction fusion method is comprehensively adopted, so that the depth recovery precision and robustness are improved, and the method is an effective method for dense depth recovery in real application.

Description

Binocular and point cloud fusion depth recovery method, device and medium

Technical Field

The invention relates to the field of computer vision, in particular to a binocular and point cloud fusion depth recovery method, a binocular and point cloud fusion depth recovery device and a binocular and point cloud fusion depth recovery medium.

Background

Depth restoration is a very important application in computer vision, and is widely applied to various fields such as robots, autopilots, three-dimensional reconstruction and the like.

Compared with the traditional binocular stereo matching depth recovery method, the depth recovery algorithm of binocular and sparse point cloud fusion introduces high-precision sparse point cloud derived from sensors such as a laser radar, a TOF camera and the like as priori information, and plays a guiding role in depth recovery. Especially in scenes with weak texture features, large shielding, large domain change and the like, the depth information provided by the sparse point cloud can effectively improve the accuracy and the robustness of depth recovery.

The existing binocular and point cloud fusion depth recovery algorithm is mainly divided into two types of point cloud guidance cost aggregation and point cloud information fusion, and the two types of algorithms directly use original sparse point clouds for fusion or guidance processing. However, due to sparsity of input point cloud data, the method for point cloud guidance cost aggregation has limited actual guidance information, and modulation guidance is only performed on a depth range, so that more sufficient prior information cannot be provided on an image dimension. For the method of point cloud information fusion, the direct fusion or the feature fusion is based on the discontinuity of the data, so that the extracted fusion information has weak texture.

Disclosure of Invention

The invention aims to provide a binocular and point cloud fusion depth recovery method, a binocular and point cloud fusion depth recovery device and a binocular and point cloud fusion depth recovery medium aiming at the defects of the prior art.

The aim of the invention is realized by the following technical scheme: the first aspect of the embodiment of the invention provides a binocular and point cloud fusion depth recovery method, which comprises the following steps:

(1) The method comprises the steps of constructing a depth recovery network, wherein the depth recovery network comprises a sparse expansion module, a multi-scale feature extraction and fusion module, a variable weight Gaussian modulation module and a cascade three-dimensional convolutional neural network module; the input of the depth recovery network is binocular image and sparse point cloud data, and the output of the depth recovery network is dense depth image;

(2) Training the depth recovery network constructed in the step (1), inputting binocular images and sparse point cloud data by using a binocular data set, projecting the sparse point cloud data to a left-eye camera coordinate system to generate a sparse depth image, comparing a depth truth image, carrying out data enhancement on the binocular images and the sparse depth image, calculating and outputting loss values of dense depth images, and iteratively updating network weights by using a counter-propagation network;

(3) And (3) inputting the binocular image to be tested and sparse point cloud data into the depth recovery network obtained by training in the step (2), and projecting the sparse point cloud data to a left-eye camera coordinate system to generate a sparse depth image by utilizing sensor calibration parameters so as to output a dense depth image.

Further, the sparse expansion module specifically includes: and taking the multi-channel information of the image as a guide, improving the density of sparse point cloud data by a neighborhood expansion method, and outputting a semi-dense depth map.

Further, constructing the sparse expansion module includes the sub-steps of:

(a1) Acquiring a sparse depth map according to the pose relation between the point cloud data and the left-eye camera image, and respectively extracting pixel coordinates of effective points in the sparse depth map, corresponding image multichannel values and image multichannel values of points in the neighborhood of the image multichannel values;

(a2) Calculating average image numerical deviation according to the image multi-channel numerical value corresponding to the pixel coordinates of the effective points and the image multi-channel numerical value of the adjacent points;

(a3) And expanding the sparse depth map into a semi-dense depth map according to the average image numerical deviation of the effective points and a set fixed threshold, and outputting the semi-dense depth map.

Further, the multi-scale feature extraction and fusion module specifically comprises: the method comprises the steps of taking a semi-dense depth map and a binocular image output by a sparse expansion module as input, adopting a decoding structure of a Unet encoder, combining a space pyramid pooling method to extract point cloud features, left-eye image features and right-eye image features, and further fusing the left-eye image features and the point cloud features in a cascading mode at a feature layer to obtain fusion features.

Further, constructing the multi-scale feature extraction and fusion module includes the sub-steps of:

(b1) Respectively carrying out multi-layer downsampling coding on the semi-dense depth map and the binocular image which are output by the sparse expansion module so as to obtain left-eye image characteristics, right-eye image characteristics and point cloud characteristics after downsampling coding of a plurality of scales;

(b2) Respectively carrying out spatial pyramid pooling treatment on the left eye image characteristic, the right eye image characteristic and the point cloud characteristic which are subjected to downsampling coding with the lowest resolution so as to obtain a pooling treatment result;

(b3) Respectively carrying out multi-layer up-sampling decoding on the results obtained after the pooling treatment of the left-eye image features, the right-eye image features and the point cloud features so as to obtain left-eye image features, right-eye image features and point cloud features obtained after up-sampling decoding of a plurality of scales;

(b4) And cascading the up-sampled and decoded left-eye image features and the point cloud features in feature dimensions to obtain fusion features of the left-eye image features and the point cloud features.

Further, the variable-weight gaussian modulation module specifically comprises: based on the data reliability of the semi-dense depth map, generating Gaussian modulation functions with different weights, and modulating the depth dimension at different pixel positions of the cost volume.

Further, constructing the variable weight gaussian modulation module comprises the sub-steps of:

(c1) Constructing a cost volume in a cascading mode according to the fusion characteristics and the right-eye image characteristics;

(c2) According to the reliability of the sparse point cloud, gaussian modulation functions with different weights are respectively constructed;

(c3) Modulating the cost roll according to the constructed Gaussian modulation function to obtain the modulated cost roll.

Further, constructing the cascaded three-dimensional convolutional neural network module includes the substeps of:

(d1) Carrying out cost volume fusion and cost volume aggregation on the low-resolution cost volumes through a three-dimensional convolutional neural network so as to obtain aggregated cost volumes;

(d2) Acquiring softmax values of all depth values on each pixel coordinate by adopting a softmax function so as to obtain a low-resolution depth map;

(d3) And up-sampling is carried out according to the low-resolution depth map so as to obtain a prediction result of the high-resolution depth map, and three cascading iteration processes are carried out so as to obtain a dense depth map under the complete resolution.

The second aspect of the embodiment of the invention provides a binocular and point cloud fusion depth recovery device, which comprises one or more processors and is used for realizing the binocular and point cloud fusion depth recovery method.

A third aspect of the embodiments of the present invention provides a computer readable storage medium having a program stored thereon, which when executed by a processor, is configured to implement the binocular and point cloud fusion depth restoration method described above.

The method has the advantages that dense depth is restored based on point cloud and binocular fusion, sparse point cloud data and binocular images are taken as input, a semi-dense depth image is obtained through neighborhood expansion, feature extraction and feature fusion are carried out based on the depth image and the binocular images, a cost volume is constructed, the cost volume is modulated by using a Gaussian modulation function with variable weight, cost aggregation is carried out through a deep learning network, and recovery of dense depth information is achieved; according to the invention, on the basis of the design of a binocular stereo matching depth recovery network, sparse point clouds are introduced, the density of guide points is improved by a neighborhood expansion method, and based on the sparse point clouds, a Gaussian modulation guide method and a multi-scale feature extraction and fusion method are adopted, so that the accuracy and the robustness of depth recovery are improved in an auxiliary mode. The invention relies on the sensor equipment capable of providing binocular image data and sparse point cloud data, is beneficial to improving precision and robustness, and is an effective method for recovering dense depth in real application.

Drawings

FIG. 1 is a diagram of a network architecture as a whole;

FIG. 2 is a sparse expansion schematic;

FIG. 3 is a schematic diagram of variable weight Gaussian modulation;

FIG. 4 is a schematic diagram showing the effect of the present invention; wherein a in fig. 4 is an input left-eye image, b in fig. 4 is an input right-eye image, c in fig. 4 is a picture of input sparse point cloud re-projected under a left-eye coordinate system, and d in fig. 4 is a depth picture obtained by restoration;

fig. 5 is a schematic structural diagram of the binocular and point cloud fusion depth restoration device of the present invention.

Detailed Description

The present invention will be described in detail below with reference to the accompanying drawings.

The binocular and point cloud fusion depth recovery method of the invention, as shown in figure 1, comprises the following steps:

(1) And constructing a deep recovery network.

The overall network architecture design is based on an open-source deep learning framework pyrach, and is modified on a disclosed binocular three-dimensional matching network architecture CF-NET to construct four parts, namely a sparse expansion module, a multi-scale feature extraction and fusion module, a variable-weight Gaussian modulation module and a cascading three-dimensional convolutional neural network module. In addition, the input of the depth restoration network is binocular image and sparse point cloud data, and the output of the depth restoration network is dense depth image.

(1.1) constructing a sparse expansion module.

The whole processing flow of the module is shown in fig. 2, the multi-channel information of the image is used as a guide, the density of sparse point cloud data can be improved by a neighborhood expansion method, and a semi-dense depth map is output.

(a1) According to the pose relation between the point cloud data and the left-eye camera image, using an openCv reprojection function to project the input sparse point cloud data to a camera coordinate system to obtain a sparse depth map

And defining depth value +.>

Depth information greater than 0 is effective point, effective point +.>

Pixel coordinates +.>

And the corresponding image multichannel value +.>

And image multi-channel numerical values of points in the neighborhood thereof

Wherein->

Sparse depth map D representing re-projection to left-eye image

The depth values in the coordinate positions W, H represent the width and height of the image, respectively, in this embodiment w=960, h=512,

expressed in pixel coordinates +.>

Image multichannel value of the lower c-channel,/-channel>

For the channel number, C=3, alpha and beta corresponding to RGB image represent the offset values on the abscissa and ordinate of the point in the neighborhood, respectively, +.>

Representing the distance of the neighborhood, in this embodiment, let +.>

=2. It should be understood that C may take other values as well, for example, c=4 for RGBA images. />

(a2) Pixel coordinates from the effective point

Corresponding image multichannel value->

Image multichannel value +.>

Calculation ofAverage image value deviation>

Average image numerical deviation

The expression of (2) is:

wherein, C is the number of channels,

expressed in pixel coordinates +.>

The image multi-channel value for the lower c-channel,

expressed in pixel coordinates +.>

The image values of the c channel of the neighborhood inner point, alpha and beta respectively represent the offset values on the abscissa and the ordinate of the neighborhood inner point,/and%>

，/>

，/>

Representing the distance of the neighborhood.

(a3) Pixel coordinates for each active point

Mean image value deviation +.>

In contrast to a fixed Threshold (Threshold), which represents the ease of pixel extension,can be adjusted according to the accuracy of the final depth restoration, in this embodiment we set the fixed Threshold (Threshold) to 8, and then can extend the sparse depth map D to a semi-dense depth map D by the following formula _exp After the neighborhood expansion of all the effective points is completed, a final semi-dense depth map can be obtained, and the semi-dense depth map is output:

wherein, the liquid crystal display device comprises a liquid crystal display device,

sparse depth map D representing a re-projection to the left-eye image is +.>

The depth value in the coordinate position,

sparse depth map D representing a re-projection to the left-eye image is +.>

The depth values in the coordinate positions, alpha and beta respectively represent the offset values in the abscissa and the ordinate of the points in the neighborhood,/->

，/>

，/>

Representing the distance of the neighborhood.

(1.2) constructing a multi-scale feature extraction and fusion module.

The module takes the semi-dense depth map and the binocular image output by the sparse expansion module as input, adopts a Unet encoder decoder structure respectively, combines a space pyramid pooling method, can extract point cloud characteristics, left-eye image characteristics and right-eye image characteristics, and further fuses the left-eye image characteristics and the point cloud characteristics in a cascading mode at a characteristic layer, so that fusion characteristics are obtained.

(b1) Multi-layer downsampling encoding is respectively carried out on the semi-dense depth map and the binocular image, and left-eye image characteristics after downsampling encoding of multiple scales can be obtained

Right eye image features

And Point cloud feature->

Wherein->

Representing the feature dimension of the i-th layer after downsampling encoding, in this embodiment, the semi-dense depth map and the binocular image are downsampled and encoded by a 5-level residual block, at which time +_>

，W=960，H=512。

(b2) And respectively carrying out spatial pyramid pooling treatment on the left eye image characteristic, the right eye image characteristic and the point cloud characteristic which are subjected to downsampling coding with the lowest resolution so as to obtain pooled results, wherein the pooled results are respectively expressed as follows:

representing a pooling function, meaning that downsampling is encodedThe feature is subjected to spatial pyramid pooling, N represents the maximum layer number of downsampling codes, and +.>

Representing left-eye image feature pooling results, +.>

Representing right eye image feature pooling results, +.>

And (5) representing the point cloud characteristic pooling result.

In this embodiment, a spatial pyramid pooling method similar to the public network HSMNet is used to perform 4-level average pooling on the left-eye image feature, the right-eye image feature and the point cloud feature after downsampling encoding with the lowest resolution, where the pooling size of each level is respectively

Pooling function->

Expressed, the results after the pooling process can be expressed as:

(b3) Respectively carrying out multi-layer up-sampling decoding on the results obtained after the pooling treatment of the left-eye image features, the right-eye image features and the point cloud features, and obtaining left-eye image features obtained after up-sampling decoding of multiple scales

Right eye image feature->

And Point cloud feature->

Wherein->

Representing the up-sampled decoded +.>

The feature dimensions of the layer, correspondingly, the results of upsampling and decoding the left-eye image feature, the right-eye image feature and the point cloud feature are respectively expressed as follows:

/>

representing a vector concatenation function, ">

Representing the processing function of the upsampling decoding module, N representing the maximum number of layers for upsampling decoding.

In this embodiment, up-sampling decoding is performed by a 5-stage corresponding up-sampling decoding module, where F is set ₁ =64，F ₂ =128，F ₃ =192，F ₄ =256，F ₅ =512, the upsampled decoded results are expressed as:

(b4) Cascading the left-eye image features and the point cloud features in feature dimensions to obtain fusion features of the left-eye image features and the point cloud features

The expression is:

representing a vector concatenation function, ">

Representing the left-eye image feature after upsampling decoding,/->

And (3) representing the up-sampled and decoded point cloud characteristics, wherein i represents the characteristic dimension of the ith layer.

(1.3) constructing a variable weight Gaussian modulation module.

Based on the data reliability of the semi-dense depth image, generating Gaussian modulation functions with different weights, and modulating the depth dimension at different pixel positions of the cost volume.

(c1) Constructing a cost volume in a cascading mode according to the fusion characteristics and the right-eye image characteristics

Wherein->

Represents the maximum parallax search range, in this embodiment +.>

The value 256 is->

Characteristic dimension representing cost volume, ++>

Representing the up-sampled decoded +.>

Characteristic dimensions of the layers, w=960, h=512.

(c2) According to the reliability of the sparse point cloud, gaussian modulation functions with different weights are respectively constructed, and the expression of the Gaussian modulation functions is as follows:

wherein k is ₁ 、c ₁ Respectively representing the weight and variance, k of the modulation function corresponding to the original sparse point cloud ₂ 、c ₂ Respectively representing the weight and variance of the modulation function corresponding to the expanded point cloud, in the embodiment, k ₁ =10，c ₁ =1，k ₂ =2，c ₂ =8，

Sparse depth map D representing a re-projection to the left-eye image is +.>

Depth value in coordinate position,/->

Semi-dense depth map D representing a re-projection to a left-eye image _exp At->

Depth value in coordinate position,/->

、/>

Respectively is

、/>

Is effective when the corresponding point is active (+)>

) When (I)>

、

And is set to 1 otherwise to 0, d representing coordinates in the depth dimension.

(c3) Modulating the cost roll according to the constructed Gaussian modulation function to obtain a modulated cost roll

. Specifically, for all cost volumes +.>

Characteristic value +.>

The modulated eigenvalues are expressed as:

the overall flow diagram of the variable-weight gaussian modulation module is shown in fig. 3, and the corresponding sparse point cloud can be divided into an invalid point, an original point and a point obtained by neighborhood expansion.

In particular, at the point of invalidity

、/>

Therefore->

、/>

Therefore, the cost volume of the corresponding position of the invalid point remains unchanged; original +.>

、/>

Therefore, it is

、/>

Therefore, the original point corresponding to the position generation of the price volume uses a high weight and a low variance k ₁ =10，c ₁ A gaussian modulation function of =1; neighborhood extension derived point +.>

、

Therefore->

、/>

Therefore, the neighborhood expansion derived point reliability bias, using low weight high variance k ₂ =2，c ₂ A gaussian modulation function of 8.

(1.4) constructing a cascaded three-dimensional convolutional neural network module.

(d1) Adopting a method of cascading three-dimensional convolutional neural networks in a public network CF-NET to replace low-resolutionPrice roll

Cost volume fusion and cost volume aggregation are carried out through an hourglass type three-dimensional convolutional neural network, and an aggregated cost volume is obtained>

。

(d2) The softmax function is adopted to obtain the softmax value of all depth values on each pixel coordinate, so that a low-resolution depth map can be obtained

。

(d3) Upsampling is performed based on the low-resolution depth map, so that a prediction result of the high-resolution depth map can be obtained

. Defining the range of the depth actually predicted based on the reliability of the prediction result with the prediction result as the depth distribution range of the high-resolution cost volume aggregation +.>

. The distribution range is recursively subjected to a cost aggregation process of the high-resolution cost volume, and cost aggregation is carried out through an hourglass type three-dimensional convolutional neural network, so that the aggregated cost volume with high primary resolution is obtained>

Wherein->

Representing the current number of depth layers, the corresponding actual depth value may be expressed as +.>

. Likewise, the softmax function is utilized to obtain softmax values of all depth values on each pixel coordinate, so that a depth map +.>

。

Through the above-mentioned process, 3 cascade iterative processes can be finally obtained to obtain a dense depth map under complete resolution

The architecture of the cascaded three-dimensional convolutional neural network is shown as the cascaded three-dimensional convolutional neural network in fig. 1.

(2) Training the depth recovery network constructed in the step (1), inputting binocular images and sparse point cloud data by using a binocular data set, projecting the sparse point cloud data to a left-eye camera coordinate system to generate a sparse depth image, comparing a depth truth image, carrying out data enhancement on the binocular images and the sparse depth image, calculating and outputting loss values of dense depth images, and iteratively updating network weights by using a counter-propagation network.

In this embodiment, an open source SceneFlow binocular dataset may be selected as a task sample; the dataset contained 35454 pair binocular images and depth truth for training, 7349 pair binocular images and depth truth for testing. In the training process, 5% of points are randomly sampled from the depth truth value to obtain a sparse depth map, so that the sparse depth map of point cloud re-projection is simulated, and the sparse depth map is used as the input of the sparse depth map.

The binocular image is sequentially subjected to data enhancement by using methods such as random occlusion, asymmetric color transformation, random clipping and the like. The random occlusion is realized by randomly generating a rectangular coordinate area and converting image data on all coordinates in a corresponding area in the right image into average image values. The asymmetric color transformation is implemented by using different brightness, contrast and gamma value transformation processes for the left and right eye images, the corresponding processing functions can be directly called the adjust_bright under the torchvision. The random clipping is realized by randomly generating a rectangular coordinate area with fixed size and clipping the image information of the rest areas. The sparse depth map is also sequentially subjected to data enhancement by using methods such as random shielding and random clipping, wherein the random shielding positions are randomly generated without keeping consistent with the positions of the binocular images, and the random clipping areas are consistent with the clipping positions of the binocular images so as to ensure the correspondence between binocular image information and depth information.

The binocular image and the sparse depth image which are subjected to data enhancement processing are used as input and are sent into a depth recovery network in the step (1), an Adam optimizer is used for end-to-end network training, an L1 loss function is used for evaluating loss between the recovered depth image and a depth true value, iterative training is realized according to common forward propagation and reverse propagation processes of a neural network, and the learning rate of training can be set as follows initially

The total iteration is carried out for 20 rounds, and the learning rate is reduced to half of the original learning rate from 16 th round to 18 th round. The learning rate and the iteration parameters can be adjusted according to the actual depth recovery precision result.

(3) In the task verification process, as shown in fig. 4, a binocular image to be tested (shown as a in fig. 4 and b in fig. 4) and sparse point cloud data are input into the depth recovery network obtained by training in the step (2), the sparse point cloud data are projected to a left-eye camera coordinate system to generate a sparse depth image (shown as c in fig. 4) by using sensor calibration parameters, and finally a dense depth image (shown as d in fig. 4) is output, so that the visualization process is completed.

Corresponding to the embodiment of the binocular and point cloud fusion depth recovery method, the invention also provides an embodiment of a binocular and point cloud fusion depth recovery device.

Referring to fig. 5, the binocular and point cloud fusion depth restoration device provided by the embodiment of the invention includes one or more processors, and is used for implementing the binocular and point cloud fusion depth restoration method in the above embodiment.

The embodiment of the binocular and point cloud fusion depth restoration device can be applied to any device with data processing capability, and the device with data processing capability can be a device or a device such as a computer. The apparatus embodiments may be implemented by software, or may be implemented by hardware or a combination of hardware and software. Taking software implementation as an example, the device in a logic sense is formed by reading corresponding computer program instructions in a nonvolatile memory into a memory by a processor of any device with data processing capability. In terms of hardware, as shown in fig. 5, a hardware structure diagram of an apparatus with data processing capability where the binocular and point cloud fusion depth recovery apparatus of the present invention is located is shown in fig. 5, and in addition to the processor, the memory, the network interface, and the nonvolatile memory shown in fig. 5, the apparatus with data processing capability where the apparatus is located in an embodiment generally includes other hardware according to the actual function of the apparatus with data processing capability, which is not described herein.

The implementation process of the functions and roles of each unit in the above device is specifically shown in the implementation process of the corresponding steps in the above method, and will not be described herein again.

For the device embodiments, reference is made to the description of the method embodiments for the relevant points, since they essentially correspond to the method embodiments. The apparatus embodiments described above are merely illustrative, wherein the elements illustrated as separate elements may or may not be physically separate, and the elements shown as elements may or may not be physical elements, may be located in one place, or may be distributed over a plurality of network elements. Some or all of the modules may be selected according to actual needs to achieve the purposes of the present invention. Those of ordinary skill in the art will understand and implement the present invention without undue burden.

The embodiment of the invention also provides a computer readable storage medium, on which a program is stored, which when executed by a processor, implements the binocular and point cloud fusion depth recovery method in the above embodiment.

The computer readable storage medium may be an internal storage unit, such as a hard disk or a memory, of any of the data processing enabled devices described in any of the previous embodiments. The computer readable storage medium may be any device having data processing capability, for example, a plug-in hard disk, a Smart Media Card (SMC), an SD Card, a Flash memory Card (Flash Card), or the like, which are provided on the device. Further, the computer readable storage medium may include both internal storage units and external storage devices of any data processing device. The computer readable storage medium is used for storing the computer program and other programs and data required by the arbitrary data processing apparatus, and may also be used for temporarily storing data that has been output or is to be output.

The above embodiments are merely for illustrating the design concept and features of the present invention, and are intended to enable those skilled in the art to understand the content of the present invention and implement the same, the scope of the present invention is not limited to the above embodiments. Therefore, all equivalent changes or modifications according to the principles and design ideas of the present invention are within the scope of the present invention.

Claims

1. The binocular and point cloud fusion depth recovery method is characterized by comprising the following steps of:

(1) The method comprises the steps of constructing a depth recovery network, wherein the depth recovery network comprises a sparse expansion module, a multi-scale feature extraction and fusion module, a variable weight Gaussian modulation module and a cascade three-dimensional convolutional neural network module; the input of the depth recovery network is binocular image and sparse point cloud data, and the output of the depth recovery network is dense depth image; the sparse expansion module specifically comprises: taking multi-channel information of the image as a guide, improving the density of sparse point cloud data by a neighborhood expansion method, and outputting a semi-dense depth map; the variable-weight Gaussian modulation module specifically comprises the following components: generating Gaussian modulation functions with different weights according to the data reliability of the semi-dense depth map, and modulating the depth dimensions at different pixel positions of the cost volume;

2. The binocular and point cloud fusion depth restoration method according to claim 1, wherein constructing the sparse expansion module comprises the sub-steps of:

3. The binocular and point cloud fusion depth restoration method according to claim 1, wherein the multi-scale feature extraction and fusion module specifically comprises: the method comprises the steps of taking a semi-dense depth map and a binocular image output by a sparse expansion module as input, adopting a decoding structure of a Unet encoder, combining a space pyramid pooling method to extract point cloud features, left-eye image features and right-eye image features, and further fusing the left-eye image features and the point cloud features in a cascading mode at a feature layer to obtain fusion features.

4. A binocular and point cloud fusion depth restoration method according to claim 3, wherein constructing the multi-scale feature extraction and fusion module comprises the sub-steps of:

5. The binocular and point cloud fusion depth restoration method according to claim 1, wherein constructing the variable weight gaussian modulation module comprises the sub-steps of:

6. The binocular and point cloud fusion depth restoration method of claim 1, wherein constructing the cascaded three-dimensional convolutional neural network module comprises the sub-steps of:

7. A binocular and point cloud fusion depth restoration apparatus comprising one or more processors configured to implement the binocular and point cloud fusion depth restoration method of any one of claims 1-6.

8. A computer readable storage medium, having stored thereon a program which, when executed by a processor, is adapted to implement the binocular and point cloud fusion depth restoration method of any one of claims 1-6.