CN115546279A

CN115546279A - Two-stage real-time binocular depth estimation method and device based on grouping mixing

Info

Publication number: CN115546279A
Application number: CN202211275720.XA
Authority: CN
Inventors: 杨红; 梁必发; 黄锦皓; 刘成
Original assignee: Guangzhou University
Current assignee: Guangzhou University
Priority date: 2022-10-18
Filing date: 2022-10-18
Publication date: 2022-12-30

Abstract

The embodiment of the specification provides a two-stage real-time binocular depth estimation method and device based on grouping mixing, wherein the method comprises the following steps: carrying out feature extraction on an original input image to obtain a feature map with 1/4 and 1/8 resolution relative to the original input image; constructing grouping distance cost substitution quantity by using the feature map with the resolution of 1/8 to obtain aggregated matching cost substitution quantity and further obtain a first-stage disparity map; amplifying the 1/8 resolution rough estimation disparity map into a 1/4 resolution disparity map, constructing grouping related cost values, generating a 1/4 resolution fine estimation disparity map, and further obtaining a second stage disparity map; based on the first-stage disparity map and the second-stage disparity map, an Adam optimizer is used for carrying out model optimization on the loss function to obtain an optimization model, and a TensorRT optimizer is used for carrying out reasoning acceleration on a network layer of the optimization model.

Description

Two-stage real-time binocular depth estimation method and device based on grouping mixing

Technical Field

The present invention relates to the field of computer technologies, and in particular, to a two-stage real-time binocular depth estimation method and apparatus based on packet mixing, an electronic device, and a storage medium.

Background

The binocular depth estimation algorithm is widely applied to the fields of robot navigation, augmented reality, intelligent cities, automatic driving and the like. Therefore, the accurate and fast binocular depth estimation algorithm has important significance for the embedded platform with limited resources. In recent years, with continuous innovation of a deep convolutional neural network, a binocular depth estimation algorithm based on the deep convolutional network is remarkably improved in precision. However, the current high-precision binocular depth estimation algorithm generally has the problems of high computational cost, high power consumption, high delay and the like, so that the existing algorithm is difficult to be deployed on an embedded platform with limited resources in real time.

The binocular depth estimation algorithm mainly comprises the following steps: feature extraction, construction of cost amount, cost aggregation and parallax regression. The three steps of feature extraction, cost quantity construction and cost aggregation play a decisive role in the precision and the reasoning speed of the network. For the feature extraction step, the existing method mainly adopts a U-Net network to extract the features of the three-dimensional input image. In particular, the network is a symmetrical feature coding and feature decoding architecture, and can simultaneously output feature maps with different sizes. However, some important characteristic information is lost in the encoding process of U-Net. For the step of constructing cost amount, the existing method mainly adopts full distance, full correlation and grouping correlation cost amount to calculate the matching cost. Specifically, full distance generates a single-channel distance map for each disparity level, and full correlation generates a single-channel correlation map for each disparity level. Since full range and full correlation only generate one single-channel range and correlation maps, much feature information is lost. Grouping correlation divides left and right features into a plurality of groups, then calculates a correlation graph group by group, can obtain a plurality of cost matching schemes, and finally combines the matching schemes into a grouping correlation quantity. Although packet correlation retains more feature information, it is still difficult to provide a good similarity measure quickly. For the cost aggregation step, the existing method mainly adopts the combination of a stacked hourglass model and intermediate supervision to adjust the matching cost amount. Specifically, the method is an encoder-decoder architecture, and repeated processing from fine to coarse and then from coarse to fine is carried out in combination with intermediate supervision. However, since the stacked hourglass model is composed of many three-dimensional convolution layers, the calculation amount is complex, and thus real-time deployment on an embedded device with limited resources cannot be satisfied.

In order to reduce the computational complexity of the cost aggregation step, the existing method adopts a progressive refinement strategy from coarse to fine to perform depth estimation. Specifically, the method includes the steps that firstly, a low-size feature map is used for constructing matching cost values to obtain a roughly estimated disparity map, then the disparity map is subjected to up-sampling through bilinear interpolation, and a disparity result roughly estimated in the previous stage is corrected by using a small disparity offset under a high size. The method can obviously reduce the calculation complexity of the cost aggregation step, but the cost amount is constructed by using the low-size feature map, so that a high-precision parallax estimation result is difficult to obtain. In addition, in order to further improve the result of disparity estimation, there is a method of performing disparity estimation using a multi-stage coarse-to-fine strategy. However, as the number of operation stages becomes larger, the calculation time of the method is obviously increased. In summary, the existing algorithm is still difficult to satisfy the requirement of generating a high-precision disparity map in real time on a resource-limited embedded platform, so that the problem to be solved at present is how to obtain a high-precision disparity estimation result in real time on the resource-limited embedded platform.

Disclosure of Invention

The invention aims to provide a two-stage real-time binocular depth estimation method and device based on grouping mixing, electronic equipment and a storage medium, and aims to solve the problems in the prior art.

The invention provides a two-stage real-time binocular depth estimation method based on grouping and mixing, which comprises the following steps of:

performing feature extraction on the original input image by using a feature extractor based on block convolution to obtain feature maps of 1/4 and 1/8 resolution relative to the original input image;

constructing grouping distance cost values by using a characteristic graph with 1/8 resolution, carrying out regularization processing on the grouping distance cost values through a lightweight cost aggregation network to obtain aggregated matching cost values, generating a coarse estimation disparity map with 1/8 resolution by using disparity regression on the aggregated matching cost values, and upsampling to full size through bilinear interpolation to obtain a first-stage disparity map;

amplifying the 1/8-resolution rough estimation disparity map into a 1/4-resolution disparity map, performing dynamic offset according to the 1/4-resolution disparity map and left map features to construct grouping related cost values, obtaining a residual map through cost aggregation network and disparity regression by the grouping related cost values, adding the residual map into the amplified 1/4-resolution disparity map to generate a 1/4-resolution precisely estimated disparity map, and performing up-sampling to full size through bilinear interpolation to obtain a second-stage disparity map;

based on the first-stage disparity map and the second-stage disparity map, an Adam optimizer is used for carrying out model optimization on the loss function to obtain an optimization model, and a TensorRT optimizer is used for carrying out reasoning acceleration on a network layer of the optimization model.

The invention provides a two-stage real-time binocular depth estimation device based on grouping mixing, which comprises:

the extraction module is used for extracting the features of the original input image by using a feature extractor based on the block convolution to obtain a feature map of 1/4 and 1/8 resolution relative to the original input image;

the first-stage disparity map module is used for constructing grouping distance cost values by using a feature map with 1/8 resolution, performing regularization processing on the grouping distance cost values through a lightweight cost aggregation network to obtain aggregated matching cost values, generating a rough estimation disparity map with 1/8 resolution by the aggregated matching cost values through disparity regression, and performing up-sampling to full size through bilinear interpolation to obtain a first-stage disparity map;

the second-stage disparity map module is used for amplifying the 1/8-resolution rough estimation disparity map into a 1/4-resolution disparity map, carrying out dynamic offset according to the 1/4-resolution disparity map and left map features to construct grouping related cost values, obtaining a residual map by the grouping related cost values through cost aggregation network and disparity regression, adding the residual map into the amplified 1/4-resolution disparity map to generate a 1/4-resolution precisely estimated disparity map, and carrying out bilinear interpolation up-sampling to obtain a full size to obtain a second-stage disparity map;

and the optimization reasoning module is used for performing model optimization on the loss function by using an Adam optimizer based on the first-stage disparity map and the second-stage disparity map to obtain an optimization model, and performing reasoning acceleration on a network layer of the optimization model by using a TensorRT optimizer.

An embodiment of the present invention further provides an electronic device, including: a memory, a processor and a computer program stored on the memory and executable on the processor, the computer program, when executed by the processor, implementing the steps of the above two-stage real-time binocular depth estimation method based on packet blending.

The embodiment of the invention also provides a computer readable storage medium, wherein an implementation program of information transmission is stored on the computer readable storage medium, and when the program is executed by a processor, the steps of the two-stage real-time binocular depth estimation method based on packet mixing are implemented.

Compared with the prior art, the embodiment of the invention at least has the following beneficial effects:

(1) In the step of feature extraction, an embodiment of the present invention provides a feature extractor based on a block convolution, which performs downsampling using a continuous block convolution. Compared with the prior art, the method provided by the invention has the advantages that the downsampling is carried out by using the block convolution, and more characteristic information can be reserved.

(2) In the step of cost quantity construction, the embodiment of the invention provides a distance-related grouping mixing two-stage method to construct cost quantities, and the distance-related grouping cost quantities generate a multichannel distance map and a correlation map for each parallax level. Compared with the prior art, the strategy can fuse more characteristic channel information and provide better similarity measurement.

(3) In the step of cost aggregation, the embodiment of the invention provides a light-weight three-dimensional cost aggregation network, and each stage only uses four 5 × 5 × 5 three-dimensional convolution layers to optimize the distance-related grouping cost. Compared with the prior art, the calculation cost of the cost aggregation is low, and the 5 multiplied by 5 three-dimensional convolution has larger receptive field, so that the accuracy of identifying the non-texture area is improved.

(4) The embodiment of the invention adopts a two-stage processing mode from coarse to fine to carry out parallax estimation. Compared with the prior art, the processing mode can greatly reduce the calculation complexity of the cost aggregation step and obviously improve the reasoning speed of the network.

(5) The embodiment of the invention uses a TensorRT optimizer to carry out reasoning acceleration in a network layer. Compared with the prior art, the data sets of KITTI 2012 and KITTI 2015 can be deployed in real time on NVIDIA Jetson Nano equipment with limited resources, and high accuracy is kept, so that the method and the device have important significance in the fields of robot navigation, augmented reality, intelligent cities, automatic driving and the like.

Drawings

In order to more clearly illustrate one or more embodiments or prior art solutions of the present specification, the drawings that are needed in the description of the embodiments or prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments described in the present specification, and that other drawings can be obtained by those skilled in the art without inventive exercise.

FIG. 1 is a flow chart of a two-stage real-time binocular depth estimation method based on packet blending according to an embodiment of the present invention;

FIG. 2 is a block diagram of an overall framework of a two-stage real-time binocular depth estimation method based on distance-dependent packet mixing in accordance with an embodiment of the present invention;

FIG. 3 is a schematic diagram of the feature extractor extracting 1/8 and 1/4 resolution features based on the block convolution according to the embodiment of the invention;

FIG. 4 is a diagram illustrating a specific structure of 32 groups based on the cost of distance-dependent grouping according to an embodiment of the present invention;

fig. 5 is a schematic structural diagram of a cost aggregation network divided into four 5 × 5 × 5 three-dimensional convolution layers at each stage according to an embodiment of the present invention;

FIG. 6 is a schematic diagram of a two-stage real-time binocular depth estimation apparatus based on packet blending according to an embodiment of the present invention;

fig. 7 is a schematic diagram of an electronic device of an embodiment of the invention.

Detailed Description

The embodiment of the invention aims to overcome the defects and shortcomings of the prior art and provides a two-stage real-time binocular depth estimation method based on grouping mixing. Different from the prior art that effective feature information is easily lost during feature extraction, good similarity measurement cannot be quickly provided during cost quantity construction, regularization processing is difficult to efficiently carry out on the cost quantity during cost aggregation, and the problem that a high-precision disparity map is difficult to generate in real time on an embedded platform with limited resources is solved, the method can obtain a high-precision disparity estimation result in real time on the embedded platform with limited resources through the provided block-cutting convolution feature extractor, the distance-related grouping cost quantity, the light-weight cost aggregation network and the TensorRT optimization model.

In order to make those skilled in the art better understand the technical solutions in one or more embodiments of the present specification, the technical solutions in one or more embodiments of the present specification will be clearly and completely described below with reference to the drawings in one or more embodiments of the present specification, and it is obvious that the described embodiments are only a part of the embodiments of the present specification, and not all embodiments. All other embodiments that can be derived by a person skilled in the art from one or more of the embodiments described herein without making any inventive step shall fall within the scope of protection of this document.

Method embodiment

According to an embodiment of the present invention, a two-stage real-time binocular depth estimation method based on packet mixing is provided, fig. 1 is a flowchart of the two-stage real-time binocular depth estimation method based on packet mixing according to the embodiment of the present invention, as shown in fig. 1, the two-stage real-time binocular depth estimation method based on packet mixing according to the embodiment of the present invention specifically includes:

101, extracting the features of an original input image by using a feature extractor based on block convolution to obtain a feature map of 1/4 and 1/8 resolution relative to the original input image; the method specifically comprises the following steps:

with a feature extractor based on the slice convolution, the original input image is firstly down-sampled to 1/2 resolution, 1/4 resolution and 1/8 resolution level by using the slice convolution, and then deep feature extraction is carried out on the features of the 1/4 resolution and the 1/8 resolution, so as to obtain a feature map relative to the 1/4 resolution and the 1/8 resolution of the original input image.

102, constructing grouping distance cost values by using a feature map with 1/8 resolution, performing regularization processing on the grouping distance cost values through a lightweight cost aggregation network to obtain aggregated matching cost values, generating a rough estimation disparity map with 1/8 resolution by using the aggregated matching cost values through disparity regression, and performing up-sampling to full size through bilinear interpolation to obtain a first-stage disparity map; the constructing of the packet distance cost quantity by using the feature map with the resolution of 1/8 specifically comprises the following steps:

according to equation 1, a 1/8 resolution feature map is used to construct a grouping distance cost metric:

wherein | | · - | purple light ₁ Means calculating the L1 distance, C, between two features _gwd Representing the grouping distance cost value of the similarity of pixel points in the input picture, d representing parallax, x and y representing feature vectors, g representing the number of the feature group, f _l And f _r Representing the left graph feature and the right graph feature, respectively.

Wherein the lightweight cost-aggregation network comprises: four three-dimensional convolutional layers with convolution kernel sizes of 5 × 5 × 5, wherein the first 5 × 5 × 5 three-dimensional convolution is used for the dimension of increasing the cost amount, the second and third 5 × 5 × 5 three-dimensional convolutions are used for optimizing the cost amount, and the fourth 5 × 5 × 5 three-dimensional convolutional layer is used for the dimension of reducing the cost amount.

Generating a 1/8-resolution rough estimation disparity map by performing disparity regression on the aggregated matching cost values, and performing upsampling to full size by bilinear interpolation to obtain a first-stage disparity map, which specifically comprises the following steps:

based on formula 3, performing disparity regression by using the matched cost value after the cost aggregation network optimization to obtain a 1/8-resolution rough estimation disparity map, and performing upsampling to full size through bilinear interpolation to obtain a first-stage disparity map:

where M represents the maximum parallax, the maximum parallax at the first stage M is 24, d represents the parallax, C _i (d') and C _i (d) Representing cost values of d' and d, respectively, for the disparity level in the current pixel i.

103, amplifying the roughly estimated disparity map with the 1/8 resolution into a 1/4 resolution disparity map, performing dynamic offset according to the 1/4 resolution disparity map and the characteristics of a left image to construct grouping related cost values, obtaining a residual map by performing cost aggregation network and disparity regression on the grouping related cost values, adding the residual map into the amplified 1/4 resolution disparity map to generate a disparity map with the 1/4 resolution accurately estimated, and performing up-sampling to full size through bilinear interpolation to obtain a second-stage disparity map; wherein, constructing a packet-related cost measure by performing a dynamic offset according to the 1/4 resolution disparity map and the left map features specifically comprises:

based on a formula 2, carrying out dynamic offset according to the 1/4 resolution disparity map and the left map characteristics to construct grouping related cost quantity:

wherein,<·，·>means for calculating the inner product, C, between two features _gwc Representing the grouping related cost value of the similarity of pixel points in the input picture, d representing parallax, x and y representing feature vectors, g representing the number of the feature group, f _l And f _r Respectively representing the left graph feature and the right graph feature.

And 104, performing model optimization on the loss function by using an Adam optimizer based on the first-stage disparity map and the second-stage disparity map to obtain an optimization model, and performing reasoning acceleration on a network layer of the optimization model by using a TensorRT optimizer. Specifically, the method comprises the following steps:

optimizing the loss function shown in equations 4-6 using an Adam optimizer; wherein,

wherein x is an input variable, d is a disparity estimation value,

is the ground truth disparity at the same pixel, N is the number of candidate disparities,

and

the loss functions of the first stage and the second stage are respectively, the loss weight coefficients of the first stage and the second stage combined training are different, and the weight coefficient of the first stage is lambda ₁ =0.5, the weight factor of the second stage is λ ₂ ＝0.7。

The above technical solutions of the embodiments of the present invention are described in detail below with reference to the accompanying drawings.

The embodiment of the invention discloses a two-stage real-time binocular depth estimation method based on grouping mixing, which comprises the following specific steps of:

step a: extracting features of the input image by using a feature extractor based on block convolution to obtain a feature map of 1/4 and 1/8 resolution relative to the original input image;

step b: constructing a grouping distance cost amount by using the 1/8 resolution feature map extracted by the feature extractor;

step c: regularizing the cost quantity by a lightweight cost aggregation network to obtain a polymerized matching cost quantity;

step d: generating a 1/8 resolution rough estimation disparity map by the aggregated matching cost values through disparity regression, and upsampling the rough estimation disparity map into a full size through bilinear interpolation to obtain a first-stage disparity map;

step e: the roughly estimated disparity map with the resolution of 1/8 in the first stage is amplified into a disparity map with the resolution of 1/4, the disparity map and the characteristics of a left image are subjected to dynamic offset to construct grouping related cost quantity, the cost quantity obtains a residual map through a cost aggregation network and disparity regression, the residual map is added into the disparity map with the resolution of 1/4 amplified in the first stage to generate a disparity map with the resolution of 1/4 accurately estimated, and the disparity map is up-sampled to full size through bilinear interpolation to obtain a disparity map with the second stage;

step f: optimizing a model for the loss function using an Adam optimizer;

step g: and (5) reasoning and accelerating the optimized model by using TensorRT.

Specifically, the implementation method of the block convolution feature extractor in the step a is as follows:

the feature extractor first convolves the feature map by two 3 × 3 convolution layers with step size 1. The feature map is then successively reduced in size to 1/2, 1/4 and 1/8 resolution relative to the original input image by three successive slice convolutions and six ordinary convolutions, with increasing feature channel size, as shown in table 1. Wherein the step of slice convolution comprises: firstly, a 2 × 2 two-dimensional convolution with the step length of 2 is used for cutting, secondly, the learning capacity of the network model is optimized by using the leak ReLu, and thirdly, the convergence performance of the network model is improved by using Batch Normalization. The feature extractor samples the original input image down to 1/2, 1/4 and 1/8 resolution level by level, and then performs deeper feature extraction on the 1/4 and 1/8 resolution features to obtain a 1/4 resolution feature map and a 1/8 resolution feature map, as shown in fig. 3. Wherein the number of the characteristic diagram channels corresponding to 1/4 and 1/8 resolution characteristics is 4c and 8c, and c represents a hyper-parameter of the characteristic extractor.

TABLE 1 feature extractor based on a sliced convolution

Specifically, the distance-related grouping cost implementation method in step b is as follows:

the distance-related grouping cost value provided by the embodiment of the invention consists of two parts: a grouping distance cost value and a grouping related cost value. The basic idea of distance-dependent scoring is: in the first stage, distance cost amounts at each parallax level are calculated. In the second stage, the relevant cost amount at each parallax level is calculated. In each stage, the left graph feature and the right graph feature are firstly divided into a plurality of groups, then the distance graph (first stage) or the correlation graph (second stage) is calculated group by group, which can obtain a plurality of distance or correlation cost matching schemes, and finally the matching schemes are combined into a grouping distance cost amount or a grouping correlation cost amount, as shown in fig. 4. Grouping distance cost substitute C _gwd Cost value C associated with a group _gwc The concrete formulas are respectively as follows:

wherein | · - · | | purple ₁ And<·，·>respectively representing the calculation of L between two features ₁ Distance and inner product, d denotes disparity, x and y denote two input feature vectors, g denotes the number of the feature group, f _l Showing features of the left graph, f _r Showing the right graph features.

Specifically, the lightweight cost aggregation network implementation method in step c is as follows:

in order to improve the problem that the distance-related grouping cost quantity may be affected by unstable factors such as weak textures, occlusion, fuzzy matching and the like in the input image, the invention adopts a lightweight cost aggregation network to carry out regularization processing on the cost quantity to obtain the aggregated matching cost quantity, as shown in fig. 5. The aggregation network is composed of four three-dimensional convolution layers with convolution kernel sizes of 5 × 5 × 5. The first 5 x 5 three-dimensional convolutional layer is used to increase the dimension of the cost volume, which is beneficial for learning more context features. The second and third 5 x 5 three-dimensional convolutional layers are used to further optimize the cost. The fourth 5 x 5 three-dimensional convolutional layer is used to reduce the dimension of the cost amount to obtain the final refined cost amount. Each 5X 5 three-dimensional convolution uses the Batch Normalization and ReLu activation functions.

Specifically, the method for implementing the disparity regression and the first-stage disparity map in step d is as follows:

after the aggregated matching cost value is obtained, the embodiment of the invention performs parallax regression on the cost value by utilizing soft argmin operation to obtain a roughly estimated parallax image with 1/8 resolution, and upsamples the roughly estimated parallax image into full size through bilinear interpolation to obtain a first-stage parallax image. The specific formula of the parallax regression is as follows:

where d represents disparity, M represents maximum disparity, and the maximum disparity at the first stage M =24, ci (d ') and Ci (d) represent cost values for the disparity level in the current pixel i as d' and d, respectively. Specifically, soft-argmin operation is performed by taking C _i (d') and C _i (d) Negative values of (c) change the amount of cost into a probability amount.And performing soft-argmin weighted summation on all parallax values, so that the minimum parallax of the cost matching can be recovered, and meanwhile, the convergence speed of the model is improved.

Specifically, the method for implementing residual prediction and the second-stage disparity map in step e is as follows:

in order to reduce the maximum parallax between two pixels of the input stereo image, the patent limits the parallax range between-2 and 2 through prediction residual. Specifically, the roughly estimated disparity map of the first stage 1/8 resolution is first up-sampled into a 1/4 resolution disparity map, and the right map feature of the second stage is subjected to warping processing. And then, constructing distance-related grouping cost quantities by using the 1/4 resolution left image features extracted by the feature extractor and the warped right image features. And finally, obtaining a residual image through a cost aggregation network and parallax regression, adding the residual image into the parallax image amplified in the first stage to correct parallax, generating a parallax image with fine resolution of 1/4, and performing up-sampling to full size through bilinear interpolation to obtain a parallax image in the second stage.

Specifically, the method for implementing the loss function in step f is as follows:

after obtaining the disparity map, in order to improve the convergence speed of the model, the embodiment of the present invention optimizes the objective function by using an Adam optimizer, where beta 1=0.9, and beta 2=0.999. The loss weight coefficients of the first stage and the second stage are different, and the weight coefficient of the first stage is lambda ₁ =0.5, the weight factor of the second stage is λ ₂ =0.7, the specific formula of the loss function is as follows:

where x is the input variable of the loss function, d is the disparity estimate,

is a function of the loss at the first stage,

lsum represents the final loss function of the model, which is the loss function of the second stage.

Specifically, the TensorRT reasoning acceleration implementation method of the step g is as follows:

the method performs pre-training 40 times on the SceneFlow data set, and then performs fine tuning training 800 times on the KITTI 2012 and KITTI 2015 data set by using the result obtained by the pre-training. And finally, carrying out reasoning acceleration on the model obtained by fine tuning training on the NVIDIA Jetson Nano embedded equipment by using a TensrT optimizer, and deploying the optimized model to the Jetson Nano equipment.

In summary, the embodiments of the present invention provide a distance-correlation grouping mixing two-stage real-time binocular depth estimation method, which can perform disparity estimation in real time and efficiently. The feature extractor provided by the invention uses the block convolution to carry out down-sampling, so that more effective feature information can be kept, and a high-precision feature map can be generated. The cost-substitution quantity is constructed by the distance-correlation grouping provided by the invention, and good similarity measurement can be provided. The lightweight cost aggregation network provided by the invention has a light structure, and only four 5 × 5 × 5 three-dimensional convolution layers are adopted in each stage, so that the network can carry out cost aggregation at a rapid reasoning speed. Furthermore, existing techniques are generally unable to generate high precision disparity maps in real time on resource constrained embedded devices. In contrast, the network layer is reasoned and accelerated by using the TensorRT optimizer, real-time deployment can be carried out on NVIDIA Jetson Nano equipment with limited resources, and high accuracy is kept.

Apparatus embodiment one

According to an embodiment of the present invention, there is provided a two-stage real-time binocular depth estimation apparatus based on packet mixing, fig. 6 is a schematic diagram of the two-stage real-time binocular depth estimation apparatus based on packet mixing according to the embodiment of the present invention, and as shown in fig. 6, the two-stage real-time binocular depth estimation apparatus based on packet mixing according to the embodiment of the present invention specifically includes:

an extraction module 60, configured to perform feature extraction on the original input image by using a feature extractor based on the cut-block convolution, so as to obtain a feature map with a resolution of 1/4 and 1/8 of the original input image;

the first-stage disparity map module 62 is configured to construct a grouping distance cost value by using a feature map with a resolution of 1/8, perform regularization processing on the grouping distance cost value by using a lightweight cost aggregation network to obtain an aggregated matching cost value, generate a coarse-estimation disparity map with a resolution of 1/8 by performing disparity regression on the aggregated matching cost value, and perform upsampling to a full size by using bilinear interpolation to obtain a first-stage disparity map;

a second-stage disparity map module 64, configured to amplify the 1/8-resolution rough estimation disparity map into a 1/4-resolution disparity map, construct a grouping related cost value according to the 1/4-resolution disparity map and the left image features through dynamic offset, obtain a residual map by using the grouping related cost value through a cost aggregation network and disparity regression, add the residual map to the amplified 1/4-resolution disparity map, generate a 1/4-resolution precisely estimated disparity map, and perform bilinear interpolation up-sampling to full size to obtain a second-stage disparity map;

and the optimization reasoning module 66 is used for performing model optimization on the loss function by using an Adam optimizer based on the first-stage disparity map and the second-stage disparity map to obtain an optimization model, and performing reasoning acceleration on a network layer of the optimization model by using a TensorRT optimizer.

Compared with the prior art, the embodiment of the invention has at least the following beneficial effects:

(1) In the step of feature extraction, an embodiment of the present invention provides a feature extractor based on a slice convolution, which uses a continuous slice convolution to perform downsampling. Compared with the prior art, the method provided by the invention uses the block convolution to carry out downsampling, and can retain more characteristic information.

(3) In the step of cost aggregation, the embodiment of the invention provides a lightweight three-dimensional cost aggregation network, and each stage only uses four 5 × 5 × 5 three-dimensional convolution layers to optimize the distance-related grouping cost. Compared with the prior art, the calculation cost of the cost aggregation is lower, and the 5 × 5 × 5 three-dimensional convolution has a larger receptive field, so that the accuracy of identifying the non-texture region is improved.

The embodiment of the present invention is an apparatus embodiment corresponding to the above method embodiment, and specific operations of each module may be understood with reference to the description of the method embodiment, which is not described herein again.

Device embodiment II

An embodiment of the present invention provides an electronic device, as shown in fig. 7, including: a memory 70, a processor 72 and a computer program stored on the memory 70 and executable on the processor 72, the computer program, when executed by the processor 72, implementing the steps as described in the method embodiments.

Example III of the device

An embodiment of the present invention provides a computer-readable storage medium, on which a program for implementing information transmission is stored, and when the program is executed by a processor 72, the program implements the steps described in the method embodiment.

The computer-readable storage medium of this embodiment includes, but is not limited to: ROM, RAM, magnetic or optical disks, and the like.

Finally, it should be noted that: the above embodiments are only used to illustrate the technical solution of the present invention, and not to limit the same; while the invention has been described in detail and with reference to the foregoing embodiments, it will be understood by those skilled in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some or all of the technical features may be equivalently replaced; and the modifications or the substitutions do not make the essence of the corresponding technical solutions depart from the scope of the technical solutions of the embodiments of the present invention.

Claims

1. A two-stage real-time binocular depth estimation method based on packet mixing is characterized by comprising the following steps:

constructing grouping distance cost values by using a feature map with 1/8 resolution, performing regularization processing on the grouping distance cost values through a lightweight cost aggregation network to obtain aggregated matching cost values, generating a rough estimation disparity map with 1/8 resolution by using disparity regression on the aggregated matching cost values, and performing upsampling to full size through bilinear interpolation to obtain a first-stage disparity map;

based on the first-stage disparity map and the second-stage disparity map, model optimization is carried out on the loss function by using an Adam optimizer to obtain an optimization model, and reasoning acceleration is carried out on a network layer of the optimization model by using a TensorRT optimizer.

2. The method according to claim 1, wherein the feature extraction of the original input image by using the feature extractor based on the slice convolution, and the obtaining of the feature map with respect to the resolution of 1/4 and 1/8 of the original input image specifically comprises:

3. The method according to claim 1, wherein constructing the packet distance cost metric using the 1/8 resolution feature map specifically comprises:

according to equation 1, a 1/8 resolution feature map is used to construct the grouping distance cost metric:

wherein | · - · | | purple ₁ Means calculating the L1 distance, C, between two features _gwd Representing the grouping distance cost value of the similarity of pixel points in the input picture, d representing parallax, x and y representing feature vectors, g representing the number of the feature group, f _l And f _r Respectively show the features of the left graphAnd a right graph feature.

4. The method according to claim 1, wherein constructing the packet-related cost measure by performing dynamic offset according to the 1/4 resolution disparity map and the left map feature specifically comprises:

wherein,<·，·>means for calculating the inner product, C, between two features _gwc Representing grouping related cost values of pixel point similarity in input picture, d representing parallax, x and y representing feature vectors, g representing number of feature group, f _l And f _r Representing the left graph feature and the right graph feature, respectively.

5. The method of claim 1, wherein the lightweight cost-effective aggregation network comprises: four three-dimensional convolutional layers with convolution kernel sizes of 5 × 5 × 5, wherein the first 5 × 5 × 5 three-dimensional convolution is used for the dimension of increasing the cost amount, the second and third 5 × 5 × 5 three-dimensional convolutions are used for optimizing the cost amount, and the fourth 5 × 5 × 5 three-dimensional convolutional layer is used for the dimension of reducing the cost amount.

6. The method according to claim 1, wherein the step of generating a 1/8 resolution roughly-estimated disparity map by performing disparity regression on the aggregated matching cost values, and performing upsampling to full size by bilinear interpolation to obtain a first-stage disparity map specifically comprises:

based on formula 3, performing disparity regression by using the matching cost value after the cost aggregation network optimization to obtain a 1/8 resolution rough estimation disparity map, and performing up-sampling to full size through bilinear interpolation to obtain a first-stage disparity map:

where M represents the maximum parallax, the maximum parallax at the first stage M is 24, d represents the parallax, C _i (d') and C _i (d) Representing cost values with disparity levels d' and d, respectively, in the current pixel i.

7. The method of claim 1, wherein optimizing the model of the loss function using an Adam optimizer based on the first stage disparity map and the second stage disparity map specifically comprises:

where x is an input variable, d is a disparity estimate,

and

loss weight coefficients of the first stage and the second stage combined training respectivelyIn contrast, the first stage has a weight factor of λ ₁ =0.5, the weight factor of the second stage is λ ₂ ＝0.7。

8. A two-stage real-time binocular depth estimation apparatus based on packet blending, comprising:

9. An electronic device, comprising: memory, processor and computer program stored on the memory and executable on the processor, the computer program when executed by the processor implementing the steps of the packet blending based two-stage real-time binocular depth estimation method according to any one of the claims 1 to 7.

10. A computer-readable storage medium, characterized in that the computer-readable storage medium has stored thereon an implementation program of information transfer, which when executed by a processor implements the steps of the packet-blending based two-stage real-time binocular depth estimation method according to any one of claims 1 to 7.