CN115546279A - Two-stage real-time binocular depth estimation method and device based on grouping mixing - Google Patents

Two-stage real-time binocular depth estimation method and device based on grouping mixing Download PDF

Info

Publication number
CN115546279A
CN115546279A CN202211275720.XA CN202211275720A CN115546279A CN 115546279 A CN115546279 A CN 115546279A CN 202211275720 A CN202211275720 A CN 202211275720A CN 115546279 A CN115546279 A CN 115546279A
Authority
CN
China
Prior art keywords
resolution
map
stage
disparity map
cost
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202211275720.XA
Other languages
Chinese (zh)
Inventor
杨红
梁必发
黄锦皓
刘成
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Guangzhou University
Original Assignee
Guangzhou University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Guangzhou University filed Critical Guangzhou University
Priority to CN202211275720.XA priority Critical patent/CN115546279A/en
Publication of CN115546279A publication Critical patent/CN115546279A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/50Depth or shape recovery
    • G06T7/55Depth or shape recovery from multiple images
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T3/00Geometric image transformations in the plane of the image
    • G06T3/40Scaling of whole images or parts thereof, e.g. expanding or contracting
    • G06T3/4007Scaling of whole images or parts thereof, e.g. expanding or contracting based on interpolation, e.g. bilinear interpolation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/97Determining parameters from multiple pictures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/20Special algorithmic details
    • G06T2207/20081Training; Learning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/20Special algorithmic details
    • G06T2207/20084Artificial neural networks [ANN]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/20Special algorithmic details
    • G06T2207/20088Trinocular vision calculations; trifocal tensor
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/20Special algorithmic details
    • G06T2207/20228Disparity calculation for image-based rendering

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Data Mining & Analysis (AREA)
  • General Health & Medical Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Evolutionary Computation (AREA)
  • Artificial Intelligence (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Health & Medical Sciences (AREA)
  • Image Processing (AREA)
  • Image Analysis (AREA)

Abstract

The embodiment of the specification provides a two-stage real-time binocular depth estimation method and device based on grouping mixing, wherein the method comprises the following steps: carrying out feature extraction on an original input image to obtain a feature map with 1/4 and 1/8 resolution relative to the original input image; constructing grouping distance cost substitution quantity by using the feature map with the resolution of 1/8 to obtain aggregated matching cost substitution quantity and further obtain a first-stage disparity map; amplifying the 1/8 resolution rough estimation disparity map into a 1/4 resolution disparity map, constructing grouping related cost values, generating a 1/4 resolution fine estimation disparity map, and further obtaining a second stage disparity map; based on the first-stage disparity map and the second-stage disparity map, an Adam optimizer is used for carrying out model optimization on the loss function to obtain an optimization model, and a TensorRT optimizer is used for carrying out reasoning acceleration on a network layer of the optimization model.

Description

Two-stage real-time binocular depth estimation method and device based on grouping mixing
Technical Field
The present invention relates to the field of computer technologies, and in particular, to a two-stage real-time binocular depth estimation method and apparatus based on packet mixing, an electronic device, and a storage medium.
Background
The binocular depth estimation algorithm is widely applied to the fields of robot navigation, augmented reality, intelligent cities, automatic driving and the like. Therefore, the accurate and fast binocular depth estimation algorithm has important significance for the embedded platform with limited resources. In recent years, with continuous innovation of a deep convolutional neural network, a binocular depth estimation algorithm based on the deep convolutional network is remarkably improved in precision. However, the current high-precision binocular depth estimation algorithm generally has the problems of high computational cost, high power consumption, high delay and the like, so that the existing algorithm is difficult to be deployed on an embedded platform with limited resources in real time.
The binocular depth estimation algorithm mainly comprises the following steps: feature extraction, construction of cost amount, cost aggregation and parallax regression. The three steps of feature extraction, cost quantity construction and cost aggregation play a decisive role in the precision and the reasoning speed of the network. For the feature extraction step, the existing method mainly adopts a U-Net network to extract the features of the three-dimensional input image. In particular, the network is a symmetrical feature coding and feature decoding architecture, and can simultaneously output feature maps with different sizes. However, some important characteristic information is lost in the encoding process of U-Net. For the step of constructing cost amount, the existing method mainly adopts full distance, full correlation and grouping correlation cost amount to calculate the matching cost. Specifically, full distance generates a single-channel distance map for each disparity level, and full correlation generates a single-channel correlation map for each disparity level. Since full range and full correlation only generate one single-channel range and correlation maps, much feature information is lost. Grouping correlation divides left and right features into a plurality of groups, then calculates a correlation graph group by group, can obtain a plurality of cost matching schemes, and finally combines the matching schemes into a grouping correlation quantity. Although packet correlation retains more feature information, it is still difficult to provide a good similarity measure quickly. For the cost aggregation step, the existing method mainly adopts the combination of a stacked hourglass model and intermediate supervision to adjust the matching cost amount. Specifically, the method is an encoder-decoder architecture, and repeated processing from fine to coarse and then from coarse to fine is carried out in combination with intermediate supervision. However, since the stacked hourglass model is composed of many three-dimensional convolution layers, the calculation amount is complex, and thus real-time deployment on an embedded device with limited resources cannot be satisfied.
In order to reduce the computational complexity of the cost aggregation step, the existing method adopts a progressive refinement strategy from coarse to fine to perform depth estimation. Specifically, the method includes the steps that firstly, a low-size feature map is used for constructing matching cost values to obtain a roughly estimated disparity map, then the disparity map is subjected to up-sampling through bilinear interpolation, and a disparity result roughly estimated in the previous stage is corrected by using a small disparity offset under a high size. The method can obviously reduce the calculation complexity of the cost aggregation step, but the cost amount is constructed by using the low-size feature map, so that a high-precision parallax estimation result is difficult to obtain. In addition, in order to further improve the result of disparity estimation, there is a method of performing disparity estimation using a multi-stage coarse-to-fine strategy. However, as the number of operation stages becomes larger, the calculation time of the method is obviously increased. In summary, the existing algorithm is still difficult to satisfy the requirement of generating a high-precision disparity map in real time on a resource-limited embedded platform, so that the problem to be solved at present is how to obtain a high-precision disparity estimation result in real time on the resource-limited embedded platform.
Disclosure of Invention
The invention aims to provide a two-stage real-time binocular depth estimation method and device based on grouping mixing, electronic equipment and a storage medium, and aims to solve the problems in the prior art.
The invention provides a two-stage real-time binocular depth estimation method based on grouping and mixing, which comprises the following steps of:
performing feature extraction on the original input image by using a feature extractor based on block convolution to obtain feature maps of 1/4 and 1/8 resolution relative to the original input image;
constructing grouping distance cost values by using a characteristic graph with 1/8 resolution, carrying out regularization processing on the grouping distance cost values through a lightweight cost aggregation network to obtain aggregated matching cost values, generating a coarse estimation disparity map with 1/8 resolution by using disparity regression on the aggregated matching cost values, and upsampling to full size through bilinear interpolation to obtain a first-stage disparity map;
amplifying the 1/8-resolution rough estimation disparity map into a 1/4-resolution disparity map, performing dynamic offset according to the 1/4-resolution disparity map and left map features to construct grouping related cost values, obtaining a residual map through cost aggregation network and disparity regression by the grouping related cost values, adding the residual map into the amplified 1/4-resolution disparity map to generate a 1/4-resolution precisely estimated disparity map, and performing up-sampling to full size through bilinear interpolation to obtain a second-stage disparity map;
based on the first-stage disparity map and the second-stage disparity map, an Adam optimizer is used for carrying out model optimization on the loss function to obtain an optimization model, and a TensorRT optimizer is used for carrying out reasoning acceleration on a network layer of the optimization model.
The invention provides a two-stage real-time binocular depth estimation device based on grouping mixing, which comprises:
the extraction module is used for extracting the features of the original input image by using a feature extractor based on the block convolution to obtain a feature map of 1/4 and 1/8 resolution relative to the original input image;
the first-stage disparity map module is used for constructing grouping distance cost values by using a feature map with 1/8 resolution, performing regularization processing on the grouping distance cost values through a lightweight cost aggregation network to obtain aggregated matching cost values, generating a rough estimation disparity map with 1/8 resolution by the aggregated matching cost values through disparity regression, and performing up-sampling to full size through bilinear interpolation to obtain a first-stage disparity map;
the second-stage disparity map module is used for amplifying the 1/8-resolution rough estimation disparity map into a 1/4-resolution disparity map, carrying out dynamic offset according to the 1/4-resolution disparity map and left map features to construct grouping related cost values, obtaining a residual map by the grouping related cost values through cost aggregation network and disparity regression, adding the residual map into the amplified 1/4-resolution disparity map to generate a 1/4-resolution precisely estimated disparity map, and carrying out bilinear interpolation up-sampling to obtain a full size to obtain a second-stage disparity map;
and the optimization reasoning module is used for performing model optimization on the loss function by using an Adam optimizer based on the first-stage disparity map and the second-stage disparity map to obtain an optimization model, and performing reasoning acceleration on a network layer of the optimization model by using a TensorRT optimizer.
An embodiment of the present invention further provides an electronic device, including: a memory, a processor and a computer program stored on the memory and executable on the processor, the computer program, when executed by the processor, implementing the steps of the above two-stage real-time binocular depth estimation method based on packet blending.
The embodiment of the invention also provides a computer readable storage medium, wherein an implementation program of information transmission is stored on the computer readable storage medium, and when the program is executed by a processor, the steps of the two-stage real-time binocular depth estimation method based on packet mixing are implemented.
Compared with the prior art, the embodiment of the invention at least has the following beneficial effects:
(1) In the step of feature extraction, an embodiment of the present invention provides a feature extractor based on a block convolution, which performs downsampling using a continuous block convolution. Compared with the prior art, the method provided by the invention has the advantages that the downsampling is carried out by using the block convolution, and more characteristic information can be reserved.
(2) In the step of cost quantity construction, the embodiment of the invention provides a distance-related grouping mixing two-stage method to construct cost quantities, and the distance-related grouping cost quantities generate a multichannel distance map and a correlation map for each parallax level. Compared with the prior art, the strategy can fuse more characteristic channel information and provide better similarity measurement.
(3) In the step of cost aggregation, the embodiment of the invention provides a light-weight three-dimensional cost aggregation network, and each stage only uses four 5 × 5 × 5 three-dimensional convolution layers to optimize the distance-related grouping cost. Compared with the prior art, the calculation cost of the cost aggregation is low, and the 5 multiplied by 5 three-dimensional convolution has larger receptive field, so that the accuracy of identifying the non-texture area is improved.
(4) The embodiment of the invention adopts a two-stage processing mode from coarse to fine to carry out parallax estimation. Compared with the prior art, the processing mode can greatly reduce the calculation complexity of the cost aggregation step and obviously improve the reasoning speed of the network.
(5) The embodiment of the invention uses a TensorRT optimizer to carry out reasoning acceleration in a network layer. Compared with the prior art, the data sets of KITTI 2012 and KITTI 2015 can be deployed in real time on NVIDIA Jetson Nano equipment with limited resources, and high accuracy is kept, so that the method and the device have important significance in the fields of robot navigation, augmented reality, intelligent cities, automatic driving and the like.
Drawings
In order to more clearly illustrate one or more embodiments or prior art solutions of the present specification, the drawings that are needed in the description of the embodiments or prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments described in the present specification, and that other drawings can be obtained by those skilled in the art without inventive exercise.
FIG. 1 is a flow chart of a two-stage real-time binocular depth estimation method based on packet blending according to an embodiment of the present invention;
FIG. 2 is a block diagram of an overall framework of a two-stage real-time binocular depth estimation method based on distance-dependent packet mixing in accordance with an embodiment of the present invention;
FIG. 3 is a schematic diagram of the feature extractor extracting 1/8 and 1/4 resolution features based on the block convolution according to the embodiment of the invention;
FIG. 4 is a diagram illustrating a specific structure of 32 groups based on the cost of distance-dependent grouping according to an embodiment of the present invention;
fig. 5 is a schematic structural diagram of a cost aggregation network divided into four 5 × 5 × 5 three-dimensional convolution layers at each stage according to an embodiment of the present invention;
FIG. 6 is a schematic diagram of a two-stage real-time binocular depth estimation apparatus based on packet blending according to an embodiment of the present invention;
fig. 7 is a schematic diagram of an electronic device of an embodiment of the invention.
Detailed Description
The embodiment of the invention aims to overcome the defects and shortcomings of the prior art and provides a two-stage real-time binocular depth estimation method based on grouping mixing. Different from the prior art that effective feature information is easily lost during feature extraction, good similarity measurement cannot be quickly provided during cost quantity construction, regularization processing is difficult to efficiently carry out on the cost quantity during cost aggregation, and the problem that a high-precision disparity map is difficult to generate in real time on an embedded platform with limited resources is solved, the method can obtain a high-precision disparity estimation result in real time on the embedded platform with limited resources through the provided block-cutting convolution feature extractor, the distance-related grouping cost quantity, the light-weight cost aggregation network and the TensorRT optimization model.
In order to make those skilled in the art better understand the technical solutions in one or more embodiments of the present specification, the technical solutions in one or more embodiments of the present specification will be clearly and completely described below with reference to the drawings in one or more embodiments of the present specification, and it is obvious that the described embodiments are only a part of the embodiments of the present specification, and not all embodiments. All other embodiments that can be derived by a person skilled in the art from one or more of the embodiments described herein without making any inventive step shall fall within the scope of protection of this document.
Method embodiment
According to an embodiment of the present invention, a two-stage real-time binocular depth estimation method based on packet mixing is provided, fig. 1 is a flowchart of the two-stage real-time binocular depth estimation method based on packet mixing according to the embodiment of the present invention, as shown in fig. 1, the two-stage real-time binocular depth estimation method based on packet mixing according to the embodiment of the present invention specifically includes:
101, extracting the features of an original input image by using a feature extractor based on block convolution to obtain a feature map of 1/4 and 1/8 resolution relative to the original input image; the method specifically comprises the following steps:
with a feature extractor based on the slice convolution, the original input image is firstly down-sampled to 1/2 resolution, 1/4 resolution and 1/8 resolution level by using the slice convolution, and then deep feature extraction is carried out on the features of the 1/4 resolution and the 1/8 resolution, so as to obtain a feature map relative to the 1/4 resolution and the 1/8 resolution of the original input image.
102, constructing grouping distance cost values by using a feature map with 1/8 resolution, performing regularization processing on the grouping distance cost values through a lightweight cost aggregation network to obtain aggregated matching cost values, generating a rough estimation disparity map with 1/8 resolution by using the aggregated matching cost values through disparity regression, and performing up-sampling to full size through bilinear interpolation to obtain a first-stage disparity map; the constructing of the packet distance cost quantity by using the feature map with the resolution of 1/8 specifically comprises the following steps:
according to equation 1, a 1/8 resolution feature map is used to construct a grouping distance cost metric:
Figure BDA0003896511630000071
wherein | | · - | purple light 1 Means calculating the L1 distance, C, between two features gwd Representing the grouping distance cost value of the similarity of pixel points in the input picture, d representing parallax, x and y representing feature vectors, g representing the number of the feature group, f l And f r Representing the left graph feature and the right graph feature, respectively.
Wherein the lightweight cost-aggregation network comprises: four three-dimensional convolutional layers with convolution kernel sizes of 5 × 5 × 5, wherein the first 5 × 5 × 5 three-dimensional convolution is used for the dimension of increasing the cost amount, the second and third 5 × 5 × 5 three-dimensional convolutions are used for optimizing the cost amount, and the fourth 5 × 5 × 5 three-dimensional convolutional layer is used for the dimension of reducing the cost amount.
Generating a 1/8-resolution rough estimation disparity map by performing disparity regression on the aggregated matching cost values, and performing upsampling to full size by bilinear interpolation to obtain a first-stage disparity map, which specifically comprises the following steps:
based on formula 3, performing disparity regression by using the matched cost value after the cost aggregation network optimization to obtain a 1/8-resolution rough estimation disparity map, and performing upsampling to full size through bilinear interpolation to obtain a first-stage disparity map:
Figure BDA0003896511630000072
where M represents the maximum parallax, the maximum parallax at the first stage M is 24, d represents the parallax, C i (d') and C i (d) Representing cost values of d' and d, respectively, for the disparity level in the current pixel i.
103, amplifying the roughly estimated disparity map with the 1/8 resolution into a 1/4 resolution disparity map, performing dynamic offset according to the 1/4 resolution disparity map and the characteristics of a left image to construct grouping related cost values, obtaining a residual map by performing cost aggregation network and disparity regression on the grouping related cost values, adding the residual map into the amplified 1/4 resolution disparity map to generate a disparity map with the 1/4 resolution accurately estimated, and performing up-sampling to full size through bilinear interpolation to obtain a second-stage disparity map; wherein, constructing a packet-related cost measure by performing a dynamic offset according to the 1/4 resolution disparity map and the left map features specifically comprises:
based on a formula 2, carrying out dynamic offset according to the 1/4 resolution disparity map and the left map characteristics to construct grouping related cost quantity:
Figure BDA0003896511630000081
wherein,<·,·>means for calculating the inner product, C, between two features gwc Representing the grouping related cost value of the similarity of pixel points in the input picture, d representing parallax, x and y representing feature vectors, g representing the number of the feature group, f l And f r Respectively representing the left graph feature and the right graph feature.
And 104, performing model optimization on the loss function by using an Adam optimizer based on the first-stage disparity map and the second-stage disparity map to obtain an optimization model, and performing reasoning acceleration on a network layer of the optimization model by using a TensorRT optimizer. Specifically, the method comprises the following steps:
optimizing the loss function shown in equations 4-6 using an Adam optimizer; wherein,
Figure BDA0003896511630000082
Figure BDA0003896511630000083
Figure BDA0003896511630000084
wherein x is an input variable, d is a disparity estimation value,
Figure BDA0003896511630000085
is the ground truth disparity at the same pixel, N is the number of candidate disparities,
Figure BDA0003896511630000086
and
Figure BDA0003896511630000087
the loss functions of the first stage and the second stage are respectively, the loss weight coefficients of the first stage and the second stage combined training are different, and the weight coefficient of the first stage is lambda 1 =0.5, the weight factor of the second stage is λ 2 =0.7。
The above technical solutions of the embodiments of the present invention are described in detail below with reference to the accompanying drawings.
The embodiment of the invention discloses a two-stage real-time binocular depth estimation method based on grouping mixing, which comprises the following specific steps of:
step a: extracting features of the input image by using a feature extractor based on block convolution to obtain a feature map of 1/4 and 1/8 resolution relative to the original input image;
step b: constructing a grouping distance cost amount by using the 1/8 resolution feature map extracted by the feature extractor;
step c: regularizing the cost quantity by a lightweight cost aggregation network to obtain a polymerized matching cost quantity;
step d: generating a 1/8 resolution rough estimation disparity map by the aggregated matching cost values through disparity regression, and upsampling the rough estimation disparity map into a full size through bilinear interpolation to obtain a first-stage disparity map;
step e: the roughly estimated disparity map with the resolution of 1/8 in the first stage is amplified into a disparity map with the resolution of 1/4, the disparity map and the characteristics of a left image are subjected to dynamic offset to construct grouping related cost quantity, the cost quantity obtains a residual map through a cost aggregation network and disparity regression, the residual map is added into the disparity map with the resolution of 1/4 amplified in the first stage to generate a disparity map with the resolution of 1/4 accurately estimated, and the disparity map is up-sampled to full size through bilinear interpolation to obtain a disparity map with the second stage;
step f: optimizing a model for the loss function using an Adam optimizer;
step g: and (5) reasoning and accelerating the optimized model by using TensorRT.
Specifically, the implementation method of the block convolution feature extractor in the step a is as follows:
the feature extractor first convolves the feature map by two 3 × 3 convolution layers with step size 1. The feature map is then successively reduced in size to 1/2, 1/4 and 1/8 resolution relative to the original input image by three successive slice convolutions and six ordinary convolutions, with increasing feature channel size, as shown in table 1. Wherein the step of slice convolution comprises: firstly, a 2 × 2 two-dimensional convolution with the step length of 2 is used for cutting, secondly, the learning capacity of the network model is optimized by using the leak ReLu, and thirdly, the convergence performance of the network model is improved by using Batch Normalization. The feature extractor samples the original input image down to 1/2, 1/4 and 1/8 resolution level by level, and then performs deeper feature extraction on the 1/4 and 1/8 resolution features to obtain a 1/4 resolution feature map and a 1/8 resolution feature map, as shown in fig. 3. Wherein the number of the characteristic diagram channels corresponding to 1/4 and 1/8 resolution characteristics is 4c and 8c, and c represents a hyper-parameter of the characteristic extractor.
TABLE 1 feature extractor based on a sliced convolution
Figure BDA0003896511630000101
Specifically, the distance-related grouping cost implementation method in step b is as follows:
the distance-related grouping cost value provided by the embodiment of the invention consists of two parts: a grouping distance cost value and a grouping related cost value. The basic idea of distance-dependent scoring is: in the first stage, distance cost amounts at each parallax level are calculated. In the second stage, the relevant cost amount at each parallax level is calculated. In each stage, the left graph feature and the right graph feature are firstly divided into a plurality of groups, then the distance graph (first stage) or the correlation graph (second stage) is calculated group by group, which can obtain a plurality of distance or correlation cost matching schemes, and finally the matching schemes are combined into a grouping distance cost amount or a grouping correlation cost amount, as shown in fig. 4. Grouping distance cost substitute C gwd Cost value C associated with a group gwc The concrete formulas are respectively as follows:
Figure BDA0003896511630000102
Figure BDA0003896511630000103
wherein | · - · | | purple 1 And<·,·>respectively representing the calculation of L between two features 1 Distance and inner product, d denotes disparity, x and y denote two input feature vectors, g denotes the number of the feature group, f l Showing features of the left graph, f r Showing the right graph features.
Specifically, the lightweight cost aggregation network implementation method in step c is as follows:
in order to improve the problem that the distance-related grouping cost quantity may be affected by unstable factors such as weak textures, occlusion, fuzzy matching and the like in the input image, the invention adopts a lightweight cost aggregation network to carry out regularization processing on the cost quantity to obtain the aggregated matching cost quantity, as shown in fig. 5. The aggregation network is composed of four three-dimensional convolution layers with convolution kernel sizes of 5 × 5 × 5. The first 5 x 5 three-dimensional convolutional layer is used to increase the dimension of the cost volume, which is beneficial for learning more context features. The second and third 5 x 5 three-dimensional convolutional layers are used to further optimize the cost. The fourth 5 x 5 three-dimensional convolutional layer is used to reduce the dimension of the cost amount to obtain the final refined cost amount. Each 5X 5 three-dimensional convolution uses the Batch Normalization and ReLu activation functions.
Specifically, the method for implementing the disparity regression and the first-stage disparity map in step d is as follows:
after the aggregated matching cost value is obtained, the embodiment of the invention performs parallax regression on the cost value by utilizing soft argmin operation to obtain a roughly estimated parallax image with 1/8 resolution, and upsamples the roughly estimated parallax image into full size through bilinear interpolation to obtain a first-stage parallax image. The specific formula of the parallax regression is as follows:
Figure BDA0003896511630000111
where d represents disparity, M represents maximum disparity, and the maximum disparity at the first stage M =24, ci (d ') and Ci (d) represent cost values for the disparity level in the current pixel i as d' and d, respectively. Specifically, soft-argmin operation is performed by taking C i (d') and C i (d) Negative values of (c) change the amount of cost into a probability amount.And performing soft-argmin weighted summation on all parallax values, so that the minimum parallax of the cost matching can be recovered, and meanwhile, the convergence speed of the model is improved.
Specifically, the method for implementing residual prediction and the second-stage disparity map in step e is as follows:
in order to reduce the maximum parallax between two pixels of the input stereo image, the patent limits the parallax range between-2 and 2 through prediction residual. Specifically, the roughly estimated disparity map of the first stage 1/8 resolution is first up-sampled into a 1/4 resolution disparity map, and the right map feature of the second stage is subjected to warping processing. And then, constructing distance-related grouping cost quantities by using the 1/4 resolution left image features extracted by the feature extractor and the warped right image features. And finally, obtaining a residual image through a cost aggregation network and parallax regression, adding the residual image into the parallax image amplified in the first stage to correct parallax, generating a parallax image with fine resolution of 1/4, and performing up-sampling to full size through bilinear interpolation to obtain a parallax image in the second stage.
Specifically, the method for implementing the loss function in step f is as follows:
after obtaining the disparity map, in order to improve the convergence speed of the model, the embodiment of the present invention optimizes the objective function by using an Adam optimizer, where beta 1=0.9, and beta 2=0.999. The loss weight coefficients of the first stage and the second stage are different, and the weight coefficient of the first stage is lambda 1 =0.5, the weight factor of the second stage is λ 2 =0.7, the specific formula of the loss function is as follows:
Figure BDA0003896511630000121
Figure BDA0003896511630000122
Figure BDA0003896511630000123
where x is the input variable of the loss function, d is the disparity estimate,
Figure BDA0003896511630000124
is the ground truth disparity at the same pixel, N is the number of candidate disparities,
Figure BDA0003896511630000125
is a function of the loss at the first stage,
Figure BDA0003896511630000126
lsum represents the final loss function of the model, which is the loss function of the second stage.
Specifically, the TensorRT reasoning acceleration implementation method of the step g is as follows:
the method performs pre-training 40 times on the SceneFlow data set, and then performs fine tuning training 800 times on the KITTI 2012 and KITTI 2015 data set by using the result obtained by the pre-training. And finally, carrying out reasoning acceleration on the model obtained by fine tuning training on the NVIDIA Jetson Nano embedded equipment by using a TensrT optimizer, and deploying the optimized model to the Jetson Nano equipment.
In summary, the embodiments of the present invention provide a distance-correlation grouping mixing two-stage real-time binocular depth estimation method, which can perform disparity estimation in real time and efficiently. The feature extractor provided by the invention uses the block convolution to carry out down-sampling, so that more effective feature information can be kept, and a high-precision feature map can be generated. The cost-substitution quantity is constructed by the distance-correlation grouping provided by the invention, and good similarity measurement can be provided. The lightweight cost aggregation network provided by the invention has a light structure, and only four 5 × 5 × 5 three-dimensional convolution layers are adopted in each stage, so that the network can carry out cost aggregation at a rapid reasoning speed. Furthermore, existing techniques are generally unable to generate high precision disparity maps in real time on resource constrained embedded devices. In contrast, the network layer is reasoned and accelerated by using the TensorRT optimizer, real-time deployment can be carried out on NVIDIA Jetson Nano equipment with limited resources, and high accuracy is kept.
Apparatus embodiment one
According to an embodiment of the present invention, there is provided a two-stage real-time binocular depth estimation apparatus based on packet mixing, fig. 6 is a schematic diagram of the two-stage real-time binocular depth estimation apparatus based on packet mixing according to the embodiment of the present invention, and as shown in fig. 6, the two-stage real-time binocular depth estimation apparatus based on packet mixing according to the embodiment of the present invention specifically includes:
an extraction module 60, configured to perform feature extraction on the original input image by using a feature extractor based on the cut-block convolution, so as to obtain a feature map with a resolution of 1/4 and 1/8 of the original input image;
the first-stage disparity map module 62 is configured to construct a grouping distance cost value by using a feature map with a resolution of 1/8, perform regularization processing on the grouping distance cost value by using a lightweight cost aggregation network to obtain an aggregated matching cost value, generate a coarse-estimation disparity map with a resolution of 1/8 by performing disparity regression on the aggregated matching cost value, and perform upsampling to a full size by using bilinear interpolation to obtain a first-stage disparity map;
a second-stage disparity map module 64, configured to amplify the 1/8-resolution rough estimation disparity map into a 1/4-resolution disparity map, construct a grouping related cost value according to the 1/4-resolution disparity map and the left image features through dynamic offset, obtain a residual map by using the grouping related cost value through a cost aggregation network and disparity regression, add the residual map to the amplified 1/4-resolution disparity map, generate a 1/4-resolution precisely estimated disparity map, and perform bilinear interpolation up-sampling to full size to obtain a second-stage disparity map;
and the optimization reasoning module 66 is used for performing model optimization on the loss function by using an Adam optimizer based on the first-stage disparity map and the second-stage disparity map to obtain an optimization model, and performing reasoning acceleration on a network layer of the optimization model by using a TensorRT optimizer.
Compared with the prior art, the embodiment of the invention has at least the following beneficial effects:
(1) In the step of feature extraction, an embodiment of the present invention provides a feature extractor based on a slice convolution, which uses a continuous slice convolution to perform downsampling. Compared with the prior art, the method provided by the invention uses the block convolution to carry out downsampling, and can retain more characteristic information.
(2) In the step of cost quantity construction, the embodiment of the invention provides a distance-related grouping mixing two-stage method to construct cost quantities, and the distance-related grouping cost quantities generate a multichannel distance map and a correlation map for each parallax level. Compared with the prior art, the strategy can fuse more characteristic channel information and provide better similarity measurement.
(3) In the step of cost aggregation, the embodiment of the invention provides a lightweight three-dimensional cost aggregation network, and each stage only uses four 5 × 5 × 5 three-dimensional convolution layers to optimize the distance-related grouping cost. Compared with the prior art, the calculation cost of the cost aggregation is lower, and the 5 × 5 × 5 three-dimensional convolution has a larger receptive field, so that the accuracy of identifying the non-texture region is improved.
(4) The embodiment of the invention adopts a two-stage processing mode from coarse to fine to carry out parallax estimation. Compared with the prior art, the processing mode can greatly reduce the calculation complexity of the cost aggregation step and obviously improve the reasoning speed of the network.
(5) The embodiment of the invention uses a TensorRT optimizer to carry out reasoning acceleration in a network layer. Compared with the prior art, the data sets of KITTI 2012 and KITTI 2015 can be deployed in real time on NVIDIA Jetson Nano equipment with limited resources, and high accuracy is kept, so that the method and the device have important significance in the fields of robot navigation, augmented reality, intelligent cities, automatic driving and the like.
The embodiment of the present invention is an apparatus embodiment corresponding to the above method embodiment, and specific operations of each module may be understood with reference to the description of the method embodiment, which is not described herein again.
Device embodiment II
An embodiment of the present invention provides an electronic device, as shown in fig. 7, including: a memory 70, a processor 72 and a computer program stored on the memory 70 and executable on the processor 72, the computer program, when executed by the processor 72, implementing the steps as described in the method embodiments.
Example III of the device
An embodiment of the present invention provides a computer-readable storage medium, on which a program for implementing information transmission is stored, and when the program is executed by a processor 72, the program implements the steps described in the method embodiment.
The computer-readable storage medium of this embodiment includes, but is not limited to: ROM, RAM, magnetic or optical disks, and the like.
Finally, it should be noted that: the above embodiments are only used to illustrate the technical solution of the present invention, and not to limit the same; while the invention has been described in detail and with reference to the foregoing embodiments, it will be understood by those skilled in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some or all of the technical features may be equivalently replaced; and the modifications or the substitutions do not make the essence of the corresponding technical solutions depart from the scope of the technical solutions of the embodiments of the present invention.

Claims (10)

1. A two-stage real-time binocular depth estimation method based on packet mixing is characterized by comprising the following steps:
performing feature extraction on the original input image by using a feature extractor based on block convolution to obtain feature maps of 1/4 and 1/8 resolution relative to the original input image;
constructing grouping distance cost values by using a feature map with 1/8 resolution, performing regularization processing on the grouping distance cost values through a lightweight cost aggregation network to obtain aggregated matching cost values, generating a rough estimation disparity map with 1/8 resolution by using disparity regression on the aggregated matching cost values, and performing upsampling to full size through bilinear interpolation to obtain a first-stage disparity map;
amplifying the 1/8-resolution rough estimation disparity map into a 1/4-resolution disparity map, performing dynamic offset according to the 1/4-resolution disparity map and left map features to construct grouping related cost values, obtaining a residual map through cost aggregation network and disparity regression by the grouping related cost values, adding the residual map into the amplified 1/4-resolution disparity map to generate a 1/4-resolution precisely estimated disparity map, and performing up-sampling to full size through bilinear interpolation to obtain a second-stage disparity map;
based on the first-stage disparity map and the second-stage disparity map, model optimization is carried out on the loss function by using an Adam optimizer to obtain an optimization model, and reasoning acceleration is carried out on a network layer of the optimization model by using a TensorRT optimizer.
2. The method according to claim 1, wherein the feature extraction of the original input image by using the feature extractor based on the slice convolution, and the obtaining of the feature map with respect to the resolution of 1/4 and 1/8 of the original input image specifically comprises:
with a feature extractor based on the slice convolution, the original input image is firstly down-sampled to 1/2 resolution, 1/4 resolution and 1/8 resolution level by using the slice convolution, and then deep feature extraction is carried out on the features of the 1/4 resolution and the 1/8 resolution, so as to obtain a feature map relative to the 1/4 resolution and the 1/8 resolution of the original input image.
3. The method according to claim 1, wherein constructing the packet distance cost metric using the 1/8 resolution feature map specifically comprises:
according to equation 1, a 1/8 resolution feature map is used to construct the grouping distance cost metric:
Figure FDA0003896511620000021
wherein | · - · | | purple 1 Means calculating the L1 distance, C, between two features gwd Representing the grouping distance cost value of the similarity of pixel points in the input picture, d representing parallax, x and y representing feature vectors, g representing the number of the feature group, f l And f r Respectively show the features of the left graphAnd a right graph feature.
4. The method according to claim 1, wherein constructing the packet-related cost measure by performing dynamic offset according to the 1/4 resolution disparity map and the left map feature specifically comprises:
based on a formula 2, carrying out dynamic offset according to the 1/4 resolution disparity map and the left map characteristics to construct grouping related cost quantity:
Figure FDA0003896511620000022
wherein,<·,·>means for calculating the inner product, C, between two features gwc Representing grouping related cost values of pixel point similarity in input picture, d representing parallax, x and y representing feature vectors, g representing number of feature group, f l And f r Representing the left graph feature and the right graph feature, respectively.
5. The method of claim 1, wherein the lightweight cost-effective aggregation network comprises: four three-dimensional convolutional layers with convolution kernel sizes of 5 × 5 × 5, wherein the first 5 × 5 × 5 three-dimensional convolution is used for the dimension of increasing the cost amount, the second and third 5 × 5 × 5 three-dimensional convolutions are used for optimizing the cost amount, and the fourth 5 × 5 × 5 three-dimensional convolutional layer is used for the dimension of reducing the cost amount.
6. The method according to claim 1, wherein the step of generating a 1/8 resolution roughly-estimated disparity map by performing disparity regression on the aggregated matching cost values, and performing upsampling to full size by bilinear interpolation to obtain a first-stage disparity map specifically comprises:
based on formula 3, performing disparity regression by using the matching cost value after the cost aggregation network optimization to obtain a 1/8 resolution rough estimation disparity map, and performing up-sampling to full size through bilinear interpolation to obtain a first-stage disparity map:
Figure FDA0003896511620000023
where M represents the maximum parallax, the maximum parallax at the first stage M is 24, d represents the parallax, C i (d') and C i (d) Representing cost values with disparity levels d' and d, respectively, in the current pixel i.
7. The method of claim 1, wherein optimizing the model of the loss function using an Adam optimizer based on the first stage disparity map and the second stage disparity map specifically comprises:
optimizing the loss function shown in equations 4-6 using an Adam optimizer; wherein,
Figure FDA0003896511620000031
Figure FDA0003896511620000032
Figure FDA0003896511620000033
where x is an input variable, d is a disparity estimate,
Figure FDA0003896511620000034
is the ground truth disparity at the same pixel, N is the number of candidate disparities,
Figure FDA0003896511620000035
and
Figure FDA0003896511620000036
loss weight coefficients of the first stage and the second stage combined training respectivelyIn contrast, the first stage has a weight factor of λ 1 =0.5, the weight factor of the second stage is λ 2 =0.7。
8. A two-stage real-time binocular depth estimation apparatus based on packet blending, comprising:
the extraction module is used for extracting the features of the original input image by using a feature extractor based on the block convolution to obtain a feature map of 1/4 and 1/8 resolution relative to the original input image;
the first-stage disparity map module is used for constructing grouping distance cost values by using a feature map with 1/8 resolution, performing regularization processing on the grouping distance cost values through a lightweight cost aggregation network to obtain aggregated matching cost values, generating a rough estimation disparity map with 1/8 resolution by the aggregated matching cost values through disparity regression, and performing up-sampling to full size through bilinear interpolation to obtain a first-stage disparity map;
the second-stage disparity map module is used for amplifying the 1/8-resolution rough estimation disparity map into a 1/4-resolution disparity map, carrying out dynamic offset according to the 1/4-resolution disparity map and left map features to construct grouping related cost values, obtaining a residual map by the grouping related cost values through cost aggregation network and disparity regression, adding the residual map into the amplified 1/4-resolution disparity map to generate a 1/4-resolution precisely estimated disparity map, and carrying out bilinear interpolation up-sampling to obtain a full size to obtain a second-stage disparity map;
and the optimization reasoning module is used for performing model optimization on the loss function by using an Adam optimizer based on the first-stage disparity map and the second-stage disparity map to obtain an optimization model, and performing reasoning acceleration on a network layer of the optimization model by using a TensorRT optimizer.
9. An electronic device, comprising: memory, processor and computer program stored on the memory and executable on the processor, the computer program when executed by the processor implementing the steps of the packet blending based two-stage real-time binocular depth estimation method according to any one of the claims 1 to 7.
10. A computer-readable storage medium, characterized in that the computer-readable storage medium has stored thereon an implementation program of information transfer, which when executed by a processor implements the steps of the packet-blending based two-stage real-time binocular depth estimation method according to any one of claims 1 to 7.
CN202211275720.XA 2022-10-18 2022-10-18 Two-stage real-time binocular depth estimation method and device based on grouping mixing Pending CN115546279A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202211275720.XA CN115546279A (en) 2022-10-18 2022-10-18 Two-stage real-time binocular depth estimation method and device based on grouping mixing

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202211275720.XA CN115546279A (en) 2022-10-18 2022-10-18 Two-stage real-time binocular depth estimation method and device based on grouping mixing

Publications (1)

Publication Number Publication Date
CN115546279A true CN115546279A (en) 2022-12-30

Family

ID=84735585

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202211275720.XA Pending CN115546279A (en) 2022-10-18 2022-10-18 Two-stage real-time binocular depth estimation method and device based on grouping mixing

Country Status (1)

Country Link
CN (1) CN115546279A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117218263A (en) * 2023-09-12 2023-12-12 山东捷瑞信息技术产业研究院有限公司 Texture lightweight optimization method and system based on three-dimensional engine

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117218263A (en) * 2023-09-12 2023-12-12 山东捷瑞信息技术产业研究院有限公司 Texture lightweight optimization method and system based on three-dimensional engine
CN117218263B (en) * 2023-09-12 2024-03-19 山东捷瑞信息技术产业研究院有限公司 Texture lightweight optimization method and system based on three-dimensional engine

Similar Documents

Publication Publication Date Title
CN111915660B (en) Binocular disparity matching method and system based on shared features and attention up-sampling
CN109472819B (en) Binocular parallax estimation method based on cascade geometric context neural network
CN111028146B (en) Image super-resolution method for generating countermeasure network based on double discriminators
CN112435282B (en) Real-time binocular stereo matching method based on self-adaptive candidate parallax prediction network
CN109241972B (en) Image semantic segmentation method based on deep learning
CN112016507B (en) Super-resolution-based vehicle detection method, device, equipment and storage medium
CN111507521B (en) Method and device for predicting power load of transformer area
CN112861729B (en) Real-time depth completion method based on pseudo-depth map guidance
CN110533712A (en) A kind of binocular solid matching process based on convolutional neural networks
CN109005398B (en) Stereo image parallax matching method based on convolutional neural network
CN111046917B (en) Object-based enhanced target detection method based on deep neural network
CN111553296B (en) Two-value neural network stereo vision matching method based on FPGA
CN117576402B (en) Deep learning-based multi-scale aggregation transducer remote sensing image semantic segmentation method
CN116612288B (en) Multi-scale lightweight real-time semantic segmentation method and system
CN116469100A (en) Dual-band image semantic segmentation method based on Transformer
CN117152580A (en) Binocular stereoscopic vision matching network construction method and binocular stereoscopic vision matching method
CN115100148A (en) Crop pest detection method based on light-weight convolutional neural network
CN114299358A (en) Image quality evaluation method and device, electronic equipment and machine-readable storage medium
CN116258758A (en) Binocular depth estimation method and system based on attention mechanism and multistage cost body
CN115546279A (en) Two-stage real-time binocular depth estimation method and device based on grouping mixing
CN113538402A (en) Crowd counting method and system based on density estimation
CN113902753A (en) Image semantic segmentation method and system based on dual-channel and self-attention mechanism
CN116485864A (en) Three-stage binocular depth estimation method and device based on re-parameterization
CN112990336B (en) Deep three-dimensional point cloud classification network construction method based on competitive attention fusion
CN113887536B (en) Multi-stage efficient crowd density estimation method based on high-level semantic guidance

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination