CN115345917A

CN115345917A - Multi-stage dense reconstruction method and device for low video memory occupation

Info

Publication number: CN115345917A
Application number: CN202210954308.4A
Authority: CN
Inventors: 庞大为; 王江安
Original assignee: Tudou Data Technology Group Co ltd
Current assignee: Tudou Data Technology Group Co ltd
Priority date: 2022-08-10
Filing date: 2022-08-10
Publication date: 2022-11-15

Abstract

The application discloses a multi-stage dense reconstruction method and device with low video memory occupation, wherein the method comprises the following steps: acquiring a plurality of original images; generating a characteristic diagram of each original image in multiple scales through a preset characteristic pyramid; determining feature bodies of a plurality of scales of a plurality of source images corresponding to each reference image; generating a cost body of a reference image on a top-level scale, and generating a depth map by using a hierarchical recursive convolution network; on other scales, generating a residual cost body of a reference image by referring to the depth map of the previous scale, generating a residual depth map by using a hierarchical recursive convolution network, and performing up-sampling addition on the residual depth map and the depth map of the previous scale to obtain a depth map of the current scale; and fusing the depth maps generated on a plurality of bottom-layer scales to obtain dense point cloud. The method disclosed by the application can generate a high-precision depth map, effectively overcomes the problems of low texture extraction, weak texture extraction and the like and overlarge video memory occupation of the conventional deep learning-based MVS algorithm, and realizes dense reconstruction of the image.

Description

Multi-stage dense reconstruction method and device for low video memory occupation

Technical Field

The application relates to the technical field of remote sensing mapping geographic information, in particular to a multi-stage dense reconstruction method and device with low video memory occupation.

Background

The dense reconstruction algorithm aims to acquire a 3D dense point cloud model of a real scene from a plurality of images, and the conventional method uses similarity measurement and luminosity consistency calculated manually to estimate a depth map and obtain a dense 3D point cloud.

The traditional method has a good effect in an ideal Lambert scene, but the MVS algorithm has the problem of overlarge display memory occupation in deep learning.

Disclosure of Invention

In the embodiment of the application, by providing the multi-stage dense reconstruction method and the device with low video memory occupation, a high-precision depth map can be generated, hardware resources can be saved when the method and the device are applied in actual engineering, the problems of low texture extraction, weak texture extraction and the like of the conventional MVS algorithm based on deep learning and the problem of overlarge video memory occupation are effectively solved, and dense reconstruction of an image is realized.

In a first aspect, an embodiment of the present application provides a multi-stage dense reconstruction method with low video memory occupancy, including; acquiring a plurality of original images; wherein the original image comprises a reference image and a source image; generating a feature map of each original image in multiple scales through a preset feature pyramid; determining feature bodies of multiple scales of multiple source images corresponding to each reference image according to the feature maps of multiple scales of the original image; generating a cost body of the reference image on a top-level scale, and generating a depth map by using a hierarchical recursive convolution network; on other scales, generating a residual cost body of the reference image by referring to the depth map of the previous scale, generating a residual depth map by using a hierarchical recursive convolution network, and performing up-sampling addition on the residual depth map and the depth map of the previous scale to obtain the depth map of the current scale; and fusing the depth maps generated on the multiple bottom-layer scales to obtain dense point cloud.

With reference to the first aspect, in a possible implementation manner, the hierarchical recursive convolutional network includes a plurality of parallel recursive modules in a vertical direction, and each parallel recursive module is configured to transmit a recursive convolution result of a previous parallel recursive module to a next parallel recursive module, where the method includes: the hierarchical recursive convolutional network is of a plane U-Net structure in the horizontal direction.

With reference to the first aspect, in a possible implementation manner, the merging depth maps generated on multiple underlying scales includes: and fusing the depth maps generated on a plurality of bottom-layer scales through dynamic consistency test.

With reference to the first aspect, in a possible implementation manner, the method further includes: feature refinement is performed on the feature map using a multi-scale aggregation module.

With reference to the first aspect, in one possible implementation manner, the multi-scale aggregation module includes a first aggregation module and a second aggregation module; the first aggregation module comprises a hole convolution and bilinear interpolation unit, and the second aggregation module comprises a deformable convolution and bilinear interpolation unit; the feature refinement of the feature map by using a multi-scale aggregation module comprises the following steps: performing feature refinement on the feature map on the bottom-level scale of each original image by using a second aggregation module; and performing feature refinement on the feature maps on other scales of each original image by using a first aggregation module.

With reference to the first aspect, in a possible implementation manner, the method further includes: training the hierarchical recursive convolutional network, and analyzing the hierarchical recursive convolutional network by using a total loss function; the total loss function is defined as follows:

wherein L is ^k Representing the loss function of the k-th scale, λ ^k Indicating its corresponding loss weight.

In a second aspect, an embodiment of the present application provides a multi-stage dense reconstruction optimization apparatus with low video memory occupancy, including: the acquisition module is used for acquiring a plurality of original images; wherein the original image comprises a reference image and a source image; the characteristic map generation module is used for generating characteristic maps of multiple scales of each original image through a preset characteristic pyramid; the characteristic body determining module is used for determining characteristic bodies of a plurality of scales of a plurality of source images corresponding to each reference image according to the characteristic images of the original image at the plurality of scales; the depth map generation module is used for generating a cost body of the reference image on the top scale and generating a depth map by using a hierarchical recursive convolution network; on other scales, generating a residual cost body of the reference image by referring to the depth map of the previous scale, generating a residual depth map by using a hierarchical recursive convolution network, and performing up-sampling addition on the residual depth map and the depth map of the previous scale to obtain the depth map of the current scale; and the fusion module is used for fusing the depth maps generated on a plurality of bottom-layer scales to obtain dense point cloud.

With reference to the second aspect, in a possible implementation manner, the hierarchical recursive convolutional network includes a plurality of parallel recursive modules in a vertical direction, and each parallel recursive module is configured to transmit a recursive convolution result of a previous parallel recursive module to a next parallel recursive module; the hierarchical recursive convolutional network is of a plane U-Net structure in the horizontal direction.

With reference to the second aspect, in a possible implementation manner, the fusion module is specifically configured to: and fusing the depth maps generated on a plurality of bottom-layer scales through dynamic consistency test.

With reference to the second aspect, in a possible implementation manner, the apparatus further includes: and the multi-scale aggregation module is used for carrying out feature refinement on the feature map.

With reference to the second aspect, in a possible implementation manner, the multi-scale aggregation module includes a first aggregation module and a second aggregation module; the first aggregation module comprises a hole convolution and bilinear interpolation unit, and the second aggregation module comprises a deformable convolution and bilinear interpolation unit; the multi-scale aggregation module is specifically configured to: performing feature refinement on the feature map on the bottom scale of each original image by using a second aggregation module; and performing feature refinement on the feature maps on other scales of each original image by using a first aggregation module.

With reference to the second aspect, in a possible implementation manner, the apparatus further includes: the loss analysis module is used for training the hierarchical recursive convolutional network and analyzing the hierarchical recursive convolutional network by using a total loss function; the total loss function is defined as follows:

wherein L is ^k Representing the loss function of the k-th scale, λ ^k Indicating their respective loss weights.

In a third aspect, an embodiment of the present application provides a multi-stage dense reconstruction server with low video memory occupancy, including a memory and a processor; the memory is to store computer-executable instructions; the processor is configured to execute the computer-executable instructions to implement the method of the first aspect or any one of the possible implementations of the first aspect.

In a fourth aspect, embodiments of the present application provide a computer-readable storage medium, where executable instructions are stored, and when the executable instructions are executed by a computer, the method described in the first aspect or any one of the possible implementation manners of the first aspect can be implemented.

One or more technical solutions provided in the embodiments of the present application have at least the following technical effects or advantages:

the embodiment of the application provides a multi-stage dense reconstruction method with low video memory occupation, and in the implementation of the method, a hierarchical recursive convolution network is used for extracting image features through a staged strategy of a preset feature pyramid to obtain depth maps with multiple scales, the depth map with each scale is obtained by referring to the depth map with the previous scale and is generated from coarse to fine, and the depth maps generated on the bottom scale are fused, so that the memory consumption of a GPU (graphics processing unit) is effectively reduced, the problem of overlarge video memory occupation is solved, and dense reconstruction of an image is realized.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings needed to be used in the description of the embodiments of the present invention or the prior art will be briefly introduced below, and it is obvious that the drawings in the following description are some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to these drawings without creative efforts.

Fig. 1 is a flowchart of a multi-stage dense reconstruction method for low video memory occupation according to an embodiment of the present application;

FIG. 2 is a flowchart for refining a feature map according to an embodiment of the present disclosure;

fig. 3 is a schematic diagram of a multi-stage dense reconstruction apparatus with low video memory occupation according to an embodiment of the present application;

fig. 4 is a schematic diagram of a multi-stage dense reconstruction optimization server with low video memory occupation according to an embodiment of the present application;

fig. 5 is a schematic structural diagram of a hierarchical recursive convolutional network provided in an embodiment of the present application;

fig. 6 is a schematic structural diagram of a U-LSTMCONV module provided in the embodiment of the present application;

fig. 7 is a schematic structural diagram of a feature pyramid provided in the embodiment of the present application;

fig. 8 is a schematic structural diagram of a first aggregation module provided in an embodiment of the present application;

FIG. 9 is a schematic structural diagram of a second polymerization module according to an embodiment of the present disclosure;

fig. 10 is a schematic diagram of the overall steps provided in the embodiment of the present application.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention. It is to be understood that the embodiments described are only a few embodiments of the present invention, and not all embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

First, related techniques or concepts related to the embodiments of the present application will be briefly described.

Deep Learning (DL) is a new research direction in the field of Machine Learning (ML), which is introduced into Machine Learning to make it closer to the original target, artificial Intelligence (AI). Deep learning is the intrinsic law and expression hierarchy of learning sample data, and information obtained in the learning process is very helpful for interpretation of data such as characters, images and sounds. The final aim of the method is to enable the machine to have the analysis and learning capability like a human, and to recognize data such as characters, images and sounds. Deep learning has achieved many achievements in search technology, data mining, machine learning, machine translation, natural language processing, multimedia learning, speech, recommendation and personalization technologies, and other related fields. The deep learning enables the machine to imitate human activities such as audio-visual and thinking, solves a plurality of complex pattern recognition problems, and makes great progress on the artificial intelligence related technology.

A homography transformation is defined as a projection mapping from one plane to another. For example, the mapping of points on a two-dimensional plane onto a camera imager is an example of a planar homography transformation.

The Feature Pyramid (Feature Pyramid Network) is a basic component in an identification system for detecting objects with different scales, and can be fused by extracting multi-scale Feature information, so that the model precision is improved.

The embodiment of the application provides a multi-stage dense reconstruction method with low video memory occupation, which comprises the following steps of introducing a feature pyramid stage strategy, adopting convolutions with different sizes to carry out self-adaptive extraction on image features of each stage, and utilizing the stage strategy to carry out rough-to-fine depth map inference on a cost body regular pattern, wherein the method mainly comprises the following steps of: constructing a sparse cost body by adopting a top-level characteristic graph, estimating the depth of the whole scene by utilizing a hierarchical recursive convolutional network, then carrying out self-adaptive adjustment by utilizing a depth map estimated at the early stage, and finally obtaining a residual depth map by utilizing the hierarchical recursive convolutional network.

The embodiment of the application provides a multi-stage dense reconstruction method with low video memory occupation, and as shown in fig. 1 and fig. 10, the method includes steps S101 to S106.

S101: a plurality of original images are acquired. Wherein the original image comprises a reference image and a source image.

S102: and generating a feature map of each original image in multiple scales by using the preset feature pyramid.

In the embodiment of the application, the preset feature pyramid can be regarded as a coding and decoding structure, and the original image can obtain a plurality of feature maps with different scales through the preset feature pyramid. The feature map at the top level contains high-level semantic features but lacks the details at the bottom level; the feature map at the lower layer contains feature details, but lacks sufficient semantic information. Therefore, the preset feature pyramid extracts more comprehensive features from multiple scales, and can describe accurate image features.

For example, the number of layers of the preset pyramid shown in fig. 7 in the embodiment of the present application is three, and each original image can generate a feature map with three scales through the preset pyramid shown in fig. 7; certainly, in the present embodiment, the number of layers of the predetermined pyramid may also be set to be other numbers of layers, such as two layers, four layers, and the like.

In addition, the size of the preset feature pyramid provided by the embodiment of the application is manually set in advance. Illustratively, the feature size of each layer of the preset feature pyramid is twice that of the previous layer, the feature output size of the preset top layer shown in fig. 7 is W/4 × H/4 × 32, the middle layer dimension is W/2 × H/2 × 16, and the bottom layer dimension is W × H × 8; wherein, W and H are the original image scale. Of course, the feature size of each layer of the feature pyramid may be other multiples of the previous layer, such as three times, four times, etc.

S103: and determining a plurality of scales of feature bodies of a plurality of source images corresponding to each reference image according to the feature maps of the plurality of scales of the original image. Specifically, each reference image of each scale corresponds to a feature volume of the plurality of source images.

S104: on the top scale, a cost volume of the reference image is generated, and a depth map is generated using a hierarchical recursive convolutional network.

S105: and on other scales, generating a residual cost body of the reference image by referring to the depth map of the previous scale, generating a residual depth map by using a hierarchical recursive convolution network, and adding the residual depth map and the depth map of the previous scale by up-sampling to obtain the depth map of the current scale.

S106: and fusing the depth maps generated on a plurality of bottom-layer scales to obtain dense point cloud.

In the steps, the image features are extracted by using a staged strategy of a preset feature pyramid and a hierarchical recursive convolution network to obtain a plurality of scales of depth maps, the depth map of each scale is obtained by referring to the depth map of the previous scale and is generated from coarse to fine, and the depth maps generated on the bottom scales are fused, so that the memory consumption of the GPU is effectively reduced, the problem of overlarge video memory occupation is solved, and dense reconstruction of the image is realized.

As shown in fig. 5, the hierarchical recursive convolutional network includes a plurality of parallel recursive modules in a vertical direction, and each parallel recursive module is configured to transmit a recursive convolution result of a previous parallel recursive module to a next parallel recursive module. The number of the parallel recursive modules is set in advance by human, the hierarchical recursive convolutional network module in fig. 5 in the embodiment of the present application has 5 parallel recursive modules in the vertical direction, and certainly, the number of the parallel recursive modules may also be 4, 6, and the like, and the stacked module shown in fig. 5 can absorb context information in multiple scales and can efficiently process a cost body.

As shown in FIG. 6, the hierarchical recursive convolutional network has a planar U-Net structure in the horizontal direction. The plane U-Net structure is a structure of the U-LSTMCONV module in the horizontal direction. The planar U-Net structure includes LSTMConvCell unit 601, deconv unit 602, and MaxPooling unit 603, each of which is located as shown in fig. 6. The LSTMConvCell not only has the time sequence of LSTM, but also can depict local features like CNN (Convolutional Neural Networks).

Specifically, the details of a hierarchical recursive convolutional network provided in the embodiment of the present application are as follows:

the input of the hierarchical recursive convolutional network is an original image and a calibrated camera pose. The original image is used for feature extraction, and the camera pose information is used for homography transformation. Taking the preset pyramid shown in fig. 7 and the hierarchical recursive convolutional network provided in the embodiment of the present application as an example, the steps are described in detail as follows.

With reference to the depth map obtained from the top-level scale of the preset feature pyramid shown in fig. 7, the depth map obtained from the next scale of the preset feature pyramid, i.e., the middle-level scale, is estimated. And sampling depth planes in the residual depth range by using the intermediate-level feature map to form a residual cost body with the scale of W/2 xH/2 x 16 x 32, then generating a residual depth map by the formed residual cost body through a hierarchical recursive convolution network, and finally generating a depth map with the intermediate scale. And the depth map of the bottom-level scale is a depth map of the middle-level scale, a residual cost body is constructed by the same method, the scale size of the residual cost body is W multiplied by H multiplied by 8, then the residual cost body is generated into a residual depth map through a hierarchical recursive convolution network, and finally the depth map of the bottom-level scale is generated.

Specifically, the formula for constructing the remaining cost body by the homography is as follows:

wherein the content of the first and second substances,

indicating the predicted depth of the k layer at the m-th pixel,

representing the residual depth of the mth pixel of the (k + 1) th layer to be estimated,k, R and t represent internal parameters of the camera, I is an identity matrix, and n is a main optical axis of a reference image.

Further, after obtaining the lowest scale depth map meeting the accuracy requirement, step S107 specifically includes: and fusing the depth maps generated on a plurality of bottom-layer scales through dynamic consistency test.

At present, when depth maps based on geometric constraint are fused, the consistency of depth estimation is measured in a plurality of views, and most of the measured depth estimates use parameters which are fixed in advance; such as pixel reprojection errors, depth reprojection errors, etc. If a fixed parameter is used, a sufficient number of unmatched pixels cannot be screened out in different scenes, and thus is not reliable for different scenes. The method provided by the embodiment of the application fuses the depth maps generated on a plurality of bottom-layer scales by applying a dynamic consistency check method, and can obtain more accurate and complete dense point clouds according to dynamic constraint of an algorithm and consistency of adjacent views.

The dynamic matching consistency of different views, namely a dynamic consistency checking method, is defined as follows:

wherein epsilon _p Is the pixel reprojection error, epsilon _d For depth reprojection errors, λ is used to measure the weight of both reprojection errors. And fusing the matching consistency of all views to obtain the global dynamic multi-view geometric consistency, which is defined as:

finally, the outliers are filtered with τ. Exemplarily, λ =200 and τ =1.8 are set.

The method provided by the embodiment of the application further comprises the following steps: and performing feature refinement on the feature map by using a multi-scale aggregation module. The feature maps of all scales are sent to a multi-scale aggregation module for feature refinement, so that the texture of the feature maps is clearer, and the features of low-texture and weak-texture regions can be better extracted.

FIG. 2 exemplarily provides a specific implementation of feature refinement of a feature map using a multi-scale aggregation module, wherein the multi-scale aggregation module includes a first aggregation module and a second aggregation module; the first aggregation module comprises a hole convolution and bilinear interpolation unit, the second aggregation module comprises a deformable convolution and bilinear interpolation unit, and the multi-scale aggregation module is used for carrying out feature refinement on the feature map. Specifically, the method includes steps S201 to S202.

S201: and performing feature refinement on the feature map on the bottom scale of each original image by using a second aggregation module.

S202: and performing feature refinement on feature maps on other scales of each original image by using a first aggregation module.

Referring to fig. 8 and 9, fig. 8 is a schematic structural diagram of a first polymerization module provided in an embodiment of the present application, and fig. 9 is a schematic structural diagram of a second polymerization module provided in the embodiment of the present application. The first aggregation module comprises a hole convolution and bilinear interpolation unit, and the second aggregation module comprises a deformable convolution and bilinear interpolation unit.

And refining the features of the feature map on the bottom scale of each original image by using a second aggregation module, and refining the features of the feature map on other scales of each original image by using a first aggregation module.

The hollow convolution has the advantages that: under the condition of not making pooling loss information, the receptive field can be enlarged, and the output of each convolution contains information in a larger range. The hole convolution can be well applied to the problem that the image needs global information or the voice and the text need long sequence information. The advantages of the deformable convolution are: the effect is better, and any shape is supported.

Illustratively, the preset feature pyramid shown in fig. 7 is three layers, so the hierarchical recursive convolution network used in cooperation with the preset pyramid shown in fig. 7 is divided into three stages, the first stage and the second stage respectively use the holes of 3 different scales to convolve and obtain feature maps of scales W/4 × H/4 × 32 and W/2 × H/2 × 16 through bilinear interpolation and concatenation, and similarly, the third stage uses deformable convolution and obtains feature maps of scales W × H × 8 through bilinear interpolation and concatenation.

The deformable convolution in the second aggregation module is defined as follows:

where f (p) represents the characteristic value of the pixel p, w _k And p _k Representing the convolution kernel parameters and the fixed offset, Δ p, defined in a common convolution operation _k And Δ m _k Are learned offsets and weights for the deformable convolution.

The method provided by the embodiment of the application further comprises the following steps: the hierarchical recursive convolutional network is trained and analyzed using a total loss function.

The hierarchical recursive convolutional network in the embodiment of the application is multi-stage, cost bodies of multiple stages can generate two intermediate depth maps and a final depth map, and the cost bodies need to be regularized by the hierarchical recursive convolutional network to obtain a probability body before the depth map is obtained.

All stages need to be considered when calculating the total loss function, which is defined as follows:

wherein L is ^k Representing the loss function of the k-th scale, λ ^k Indicating its corresponding loss weight. In the embodiment of the present application, when the preset pyramid shown in fig. 7 is used, the hierarchical recursive convolutional network sets N to 3. In general, the higher the resolution scale at which the depth map is generated, the greater the weight that is set.

The loss function calculation function for each scale is as follows:

wherein x is _valid Representing an effective set of pixels, G (i, x) representing one-hot encoding generation of a real depth map at the ith depth of pixel x, and P (i, x) representing a pixel in a probability volume。

According to the results of multiple experiments, it can be determined that the method provided by the embodiment of the application obviously reduces the memory consumption of the GPU, and the operating video memory occupies 24% of the MVSNet.

The embodiment of the present application further provides a multi-stage dense reconstruction device with low video memory occupation, as shown in fig. 3, the device includes: an acquisition module 301, a feature map generation module 302, a feature body determination module 303, a depth map generation module 304, and a fusion module 305.

The obtaining module 301 is configured to obtain a plurality of original images; wherein the original image comprises a reference image and a source image.

The feature map generation module 302 is configured to generate feature maps of multiple scales for each original image through a preset feature pyramid.

The feature determining module 303 is configured to determine, according to the feature maps of multiple scales of the original image, a feature of multiple scales of multiple source images corresponding to each reference image.

The depth map generation module 304 is configured to generate a cost volume of the reference image in a top-level scale, and generate a depth map using a hierarchical recursive convolutional network; and on other scales, generating a residual cost body of the reference image by referring to the depth map of the previous scale, generating a residual depth map by using a hierarchical recursive convolution network, and performing up-sampling addition on the residual depth map and the depth map of the previous scale to obtain the depth map of the current scale.

The fusion module 305 is configured to fuse the depth maps generated on the multiple bottom-level scales to obtain a dense point cloud. The method is specifically used for: and fusing the depth maps generated on the multiple underlying scales through dynamic consistency check.

The multi-stage dense reconstruction apparatus 300 with low video memory occupancy further includes: and the multi-scale aggregation module is used for carrying out feature refinement on the feature map. Wherein the multi-scale polymerization module comprises a first polymerization module and a second polymerization module; the first aggregation module comprises a hole convolution and bilinear interpolation unit, and the second aggregation module comprises a deformable convolution and bilinear interpolation unit; the multi-scale aggregation module is specifically configured to: performing feature refinement on the feature map on the bottom-level scale of each original image by using a second aggregation module; and performing feature refinement on the feature maps on the other scales of each original image by using a first aggregation module.

The multi-stage dense reconstruction apparatus 300 with low video memory occupancy further includes: the loss analysis module is used for training the hierarchical recursive convolutional network and analyzing the hierarchical recursive convolutional network by using a total loss function; the total loss function is defined as follows:

The apparatuses or modules illustrated in the above embodiments may be implemented by a computer chip or an entity, or by a product with certain functions. For convenience of description, the above devices are described as being divided into various modules by functions, which are described separately. The functionality of the modules may be implemented in the same one or more software and/or hardware implementations of the present application. Of course, a module that implements a certain function may be implemented by a plurality of sub-modules or sub-units in combination.

The methods, apparatus or modules described herein may be implemented in a computer readable program code means for a controller in any suitable manner, for example, the controller may take the form of, for example, a microprocessor or processor and a computer readable medium storing computer readable program code (e.g., software or firmware) executable by the (micro) processor, logic gates, switches, application Specific Integrated Circuits (ASICs), programmable logic controllers and embedded microcontrollers, examples of which include, but are not limited to, the following microcontrollers: ARC 625D, atmel AT91SAM, microchip PIC18F26K20, and Silicone Labs C8051F320, the memory controller may also be implemented as part of the control logic for the memory. Those skilled in the art will also appreciate that, in addition to implementing the controller as pure computer readable program code, the same functionality can be implemented by logically programming method steps such that the controller is in the form of logic gates, switches, application specific integrated circuits, programmable logic controllers, embedded microcontrollers and the like. Such a controller may therefore be considered as a hardware component, and the means included therein for performing the various functions may also be considered as a structure within the hardware component. Or even means for performing the functions may be conceived to be both a software module implementing the method and a structure within a hardware component.

Some of the modules in the apparatus described herein may be described in the general context of computer-executable instructions, such as program modules, being executed by a computer. Generally, program modules include routines, programs, objects, components, data structures, classes, etc. that perform particular tasks or implement particular abstract data types. The application may also be practiced in distributed computing environments where tasks are performed by remote processing devices that are linked through a communications network. In a distributed computing environment, program modules may be located in both local and remote computer storage media including memory storage devices.

The embodiment of the present application further provides a depth map server for three-dimensional reconstruction, as shown in fig. 4, including a memory 401 and a processor 402; the memory 401 is used to store computer executable instructions; the processor 402 is configured to execute computer-executable instructions to implement the multi-stage dense reconstruction method with low video memory occupation provided by the embodiment of the present application.

The embodiment of the application also provides a computer-readable storage medium, wherein the computer-readable storage medium stores executable instructions, and when the computer executes the executable instructions, the multi-stage dense reconstruction method with low video memory occupation provided by the embodiment of the application can be realized.

The storage medium includes, but is not limited to, a Random Access Memory (RAM), a Read-Only Memory (ROM), a Cache, a Hard Disk (Hard Disk Drive), or a Memory Card (HDD). The memory may be used to store computer program instructions.

From the above description of the embodiments, it is clear to those skilled in the art that the present application can be implemented by software plus necessary hardware. Based on such understanding, the technical solution of the present application, which essentially or contributes to the prior art, may be embodied in the form of a software product, and may also be embodied in the implementation process of data migration. The computer software product may be stored in a storage medium such as ROM/RAM, magnetic disk, optical disk, etc., and includes instructions for causing a computer device (which may be a personal computer, mobile terminal, server, or network device, etc.) to perform the methods described in the various embodiments or portions of the embodiments of the present application.

The embodiments in the present specification are described in a progressive manner, and the same or similar parts among the embodiments may be referred to each other, and each embodiment focuses on the differences from the other embodiments. All or portions of the present application are operational with numerous general purpose or special purpose computing system environments or configurations. For example: personal computers, server computers, hand-held or portable devices, tablet-type devices, mobile communication terminals, multiprocessor systems, microprocessor-based systems, programmable electronic devices, network PCs, minicomputers, mainframe computers, distributed computing environments that include any of the above systems or devices, and the like.

The above embodiments are only used to illustrate the technical solutions of the present application, and not to limit the present application; although the present application has been described in detail with reference to the foregoing embodiments, it should be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some or all of the technical features may be equivalently replaced; such modifications or substitutions do not depart from the spirit and scope of the present disclosure.

Claims

1. A multi-stage dense reconstruction method with low video memory occupation is characterized by comprising the following steps:

acquiring a plurality of original images; wherein the original image comprises a reference image and a source image;

generating a feature map of each original image in multiple scales through a preset feature pyramid;

determining feature bodies of multiple scales of multiple source images corresponding to each reference image according to the feature maps of multiple scales of the original image;

generating a cost body of the reference image on a top-level scale, and generating a depth map by using a hierarchical recursive convolutional network; on other scales, generating a residual cost body of the reference image by referring to the depth map of the previous scale, generating a residual depth map by using a hierarchical recursive convolution network, and performing up-sampling addition on the residual depth map and the depth map of the previous scale to obtain the depth map of the current scale;

and fusing the depth maps generated on the multiple bottom-layer scales to obtain dense point cloud.

2. The method according to claim 1, wherein the hierarchical recursive convolutional network comprises a plurality of parallel recursive modules in a vertical direction, and each parallel recursive module is configured to transmit a recursive convolution result of a previous parallel recursive module to a next parallel recursive module;

the hierarchical recursive convolutional network is of a plane U-Net structure in the horizontal direction.

3. The method of claim 1, wherein the merging of the depth maps generated at the multiple underlying scales comprises:

and fusing the depth maps generated on the multiple underlying scales through dynamic consistency check.

4. The method of claim 1, further comprising: feature refinement is performed on the feature map using a multi-scale aggregation module.

5. The method of claim 5, wherein the multi-scale polymerization module comprises a first polymerization module and a second polymerization module; the first aggregation module comprises a hole convolution and bilinear interpolation unit, and the second aggregation module comprises a deformable convolution and bilinear interpolation unit;

the feature refinement of the feature map using a multi-scale aggregation module comprises:

performing feature refinement on the feature map on the bottom scale of each original image by using a second aggregation module;

and performing feature refinement on the feature maps on other scales of each original image by using a first aggregation module.

6. The method of claim 1, further comprising: training the hierarchical recursive convolutional network, and analyzing the hierarchical recursive convolutional network by using a total loss function;

the total loss function is defined as follows:

7. A multi-stage dense reconstruction optimization device with low video memory occupation is characterized by comprising the following components:

the acquisition module is used for acquiring a plurality of original images; wherein the original image comprises a reference image and a source image;

the characteristic map generation module is used for generating characteristic maps of multiple scales of each original image through a preset characteristic pyramid;

the characteristic body determining module is used for determining characteristic bodies of a plurality of scales of a plurality of source images corresponding to each reference image according to the characteristic images of the original image at the plurality of scales;

the depth map generation module is used for generating a cost body of the reference image on a top-level scale and generating a depth map by using a hierarchical recursive convolutional network; on other scales, generating a residual cost body of the reference image by referring to the depth map of the previous scale, generating a residual depth map by using a hierarchical recursive convolution network, and performing up-sampling addition on the residual depth map and the depth map of the previous scale to obtain the depth map of the current scale;

and the fusion module is used for fusing the depth maps generated on a plurality of bottom-layer scales to obtain dense point cloud.

8. A multi-stage dense reconstruction server with low video memory occupation is characterized by comprising a memory and a processor;

the memory is to store computer-executable instructions;

the processor is configured to execute the computer-executable instructions to implement the method of any of claims 1-7.

9. A computer-readable storage medium having stored thereon executable instructions that, when executed by a computer, are capable of implementing the method of any one of claims 1-7.