CN117726513A

CN117726513A - Depth map super-resolution reconstruction method and system based on color image guidance

Info

Publication number: CN117726513A
Application number: CN202311574125.0A
Authority: CN
Inventors: 王诗言; 徐慧玲; 王译苹; 张驰; 谢博
Original assignee: Chongqing University of Post and Telecommunications
Current assignee: Chongqing University of Post and Telecommunications
Priority date: 2023-11-23
Filing date: 2023-11-23
Publication date: 2024-03-19

Abstract

The invention belongs to the field of computer vision, and relates to a depth map super-resolution reconstruction method and system based on color image guidance, wherein the method comprises the following steps: acquiring a low-resolution depth image and a high-resolution color image under the same scene; carrying out feature extraction on images under two different modes by adopting a global-local feature extraction network to obtain feature information of the two images; the invention adopts a feature reconstruction network to fuse the image features under two different modes and reconstruct to obtain a high-resolution depth image.

Description

Depth map super-resolution reconstruction method and system based on color image guidance

Technical Field

The invention belongs to the technical field of computer vision, and relates to a depth map super-resolution reconstruction method and system based on color image guidance.

Background

The depth map super-Resolution (Depth Super Resolution, DSR) aims to reconstruct a Low Resolution (LR) depth map into a High Resolution (HR) depth map while recovering to a great extent the missing details in the Low Resolution depth map. The depth map characterizes distance information between a depth camera and objects in a scene, and is widely applied to the fields of scene understanding, automatic driving, gesture estimation and three-dimensional reconstruction. In the real world, acquiring depth information of a scene requires reliance on some specialized equipment, such as passive and active depth cameras. However, due to the complex real world imaging environment and the performance limitations of the sensor itself, it is often difficult to directly acquire high resolution depth maps, and therefore, depth maps are often low resolution. Therefore, the research of the super-resolution algorithm of the depth map is an important point of various tasks of computer vision at present.

Along with the development of science and technology, it is easy to obtain high-resolution color images in the same scene, so many existing depth map super-resolution methods are to use LR depth maps and HR color images in the same scene as network inputs together, and extract information from the color images to assist in reconstructing the high-resolution depth maps. Due to the strong performance of CNN, in the existing research method, the super-resolution of the depth map based on the deep learning has achieved a remarkable visual effect. They typically have a large number of parameters and complex computational properties and the feature extraction of the two modality images is incomplete due to the irrational nature of the feature extraction structure. In the field of super-resolution reconstruction of depth maps, color images play an important role as guidance for reconstructing images, so how to fully extract features of two modal images and better fuse and reconstruct high-resolution depth maps is an important point of current research.

Disclosure of Invention

In order to solve the problems in the prior art, the invention provides a depth map super-resolution reconstruction method and a depth map super-resolution reconstruction system based on color image guidance, wherein the method can extract representative features on images of two modes with less parameter amounts based on global-local angles, so that the information in the color images can be better utilized to reconstruct high-resolution depth images.

In order to achieve the above technical object, an aspect of the present invention provides a depth map super-resolution reconstruction method based on color image guidance, including: inputting a low-resolution depth image and a high-resolution color image under the same scene into a trained depth image super-resolution reconstruction model, and reconstructing the high-resolution depth image by using characteristic information of the high-resolution color image as a guide, wherein the depth image super-resolution reconstruction model comprises: a feature extraction network and a feature reconstruction network; the feature extraction network includes: a first convolution layer, 4 feature extraction modules connected in series, and a channel attention ESA module; the feature reconstruction network includes: a first soft pooling layer, a second soft pooling layer, an MLP multi-layer perceptron, a sigmoid activation function, and two serially connected residual blocks;

the reconstruction process of the high-resolution depth image comprises the following steps:

s1: respectively inputting the low-resolution depth image and the high-resolution color image into a feature extraction network, and sequentially extracting image feature information through a first convolution layer, 4 feature extraction modules connected in series and a channel attention ESA module to obtain a first LR depth feature image and a first HR color feature image;

s2: inputting the first LR depth feature map into a first soft pooling layer for pooling to obtain a second LR depth feature map, and inputting the first HR color feature map into a second soft pooling layer for pooling to obtain a second HR color feature map;

s3: performing feature addition on the second LR depth feature map and the second HR color feature map to obtain a first fusion feature map; inputting the first fusion feature map into an MLP multi-layer perceptron for further fusion to obtain a second fusion feature map; inputting a sigmoid activation function into the second fusion feature map to obtain a shared weight matrix;

s4: performing Cronecker product operation on the shared weight matrix and the first LR depth feature map and the first HR color feature map respectively to obtain a third LR depth feature map and a third HR color feature map;

s5: performing feature addition on the third LR depth feature map and the third HR color feature map to obtain a third fusion feature map; inputting the third fusion feature map into two serially connected residual blocks to output a fourth fusion feature map;

s6: and performing feature addition on the low-resolution depth image with the fourth fusion feature map after performing bicubic upsampling to obtain a high-resolution depth image.

Preferably, the feature map input to the ith feature extraction module is defined asThe feature extraction process of the feature extraction module is as follows:

s11: map the characteristic mapNormalized to obtain a feature map->Will->The 4 feature components obtained by the channel segmentation operation are expressed as follows:

wherein Split represents a channel Split operation;

s12: features to be characterizedPerforming deep convolution treatment to obtain a feature X1;

s13: features to be characterizedInput downscaling of 2 ^k After processing the largest pooling layer, carrying out deep convolution processing to obtain a characteristic X (k+1);

s14: inputting the features X1, X2, X3 and X4 into a feature integration module to perform feature fusion to obtain a feature mapFeature map +.>Inputting a glue activation function to process to obtain a feature map +.>

S15: map the characteristic mapAnd feature map->Carrying out Hadamard product operation to obtain a characteristic diagram +.>Feature map +.>And feature map->Performing feature addition to obtain feature map->

S16: map the characteristic mapNormalized operation to obtain a feature map->Feature map +.>Inputting the partial convolution module to process to obtain a characteristic diagram +.>

S17: map the characteristic mapAnd feature map->And carrying out feature addition to obtain an output feature diagram of the ith feature extraction module.

Preferably, the inputting the features X1, X2, X3 and X4 into the feature integration module for feature fusion includes:

s141: the feature X4 is subjected to 2 times of up-sampling operation and then is spliced with the feature X3 to obtain a first intermediate feature;

s142: inputting the first intermediate feature into a second convolution layer for processing to obtain a second intermediate feature, and splicing the second intermediate feature with the feature X2 after up-sampling operation of 2 times to obtain a third intermediate feature;

s143: inputting the third intermediate feature into a third convolution layer for processing to obtain a fourth intermediate feature, and splicing the fourth intermediate feature with the feature X1 after up-sampling operation of 2 times to obtain a fifth intermediate feature;

s144: inputting the fifth intermediate feature into the fourth convolution layer for processing to obtain a feature map

Preferably, the first, second, third and fourth convolution layers each employ a convolution layer of 3*3.

Preferably, the characteristic diagram isThe input local convolution module processes the data including:

map the characteristic mapRespectively inputting a fifth convolution layer and a sixth convolution layer with two different scales for processing to obtain a characteristic diagram +.>And feature map->Feature map +.>And feature map->Performing feature addition to obtain feature map->Map the characteristic mapInputting a seventh convolution layer for processing to obtain a characteristic diagram +.>

Preferably, the sixth and seventh convolution layers employ a convolution layer of 1*1 and the fifth convolution layer employs a convolution layer of 3*3.

Preferably, the deep convolution process is performed using a deep convolution layer of 3*3.

Another aspect of the present invention provides a depth map super-resolution reconstruction system based on color image guidance, where the system is applied to the depth map super-resolution reconstruction method based on color image guidance, and the method is characterized by comprising:

the data acquisition module is used for acquiring a low-resolution depth image and a high-resolution color image under the same scene;

the feature extraction module is used for extracting image feature information of the low-resolution depth image and the high-resolution color image;

and the characteristic reconstruction network is used for reconstructing a high-resolution depth image according to the image characteristic information of the low-resolution depth image and the high-resolution color image.

In a further aspect, the present invention provides a computer readable storage medium storing a program, which when executed by a processor, implements the method for reconstructing a depth map based on color image guidance.

The invention has at least the following beneficial effects

Compared with the existing super-resolution reconstruction method of the depth map, the method adopts channel segmentation and downsampling operations on the normalized features before feature extraction, and uses depth convolution during feature extraction to reduce the quantity of parameters; extracting representative features in two modal images from a global-local angle by using a multi-scale feature modulation structure and a local convolution module; when the features are fused, the soft pooling layer and the multi-layer perceptron are combined to generate shared channel attention weight, information of two modes is aggregated, and structural information in the color image is fully utilized to reconstruct a high-resolution depth image.

Drawings

FIG. 1 is a schematic flow chart of a method according to an embodiment of the invention;

FIG. 2 is a schematic diagram of a network structure of a depth map super-resolution reconstruction model according to an embodiment of the present invention

FIG. 3 is a schematic diagram of a feature extraction network according to an embodiment of the present invention;

FIG. 4 is a schematic diagram of a feature reconstruction network according to an embodiment of the present invention;

FIG. 5 is a schematic structural diagram of a feature integration module according to an embodiment of the invention;

fig. 6 is a schematic structural diagram of a partial convolution module according to an embodiment of the present disclosure.

Detailed Description

Other advantages and effects of the present invention will become apparent to those skilled in the art from the following disclosure, which describes the embodiments of the present invention with reference to specific examples. The invention may be practiced or carried out in other embodiments that depart from the specific details, and the details of the present description may be modified or varied from the spirit and scope of the present invention. It should be noted that the illustrations provided in the following embodiments merely illustrate the basic idea of the present invention by way of illustration, and the following embodiments and features in the embodiments may be combined with each other without conflict.

Wherein the drawings are for illustrative purposes only and are shown in schematic, non-physical, and not intended to limit the invention; for the purpose of better illustrating embodiments of the invention, certain elements of the drawings may be omitted, enlarged or reduced and do not represent the size of the actual product; it will be appreciated by those skilled in the art that certain well-known structures in the drawings and descriptions thereof may be omitted.

The same or similar reference numbers in the drawings of embodiments of the invention correspond to the same or similar components; in the description of the present invention, it should be understood that, if there are terms such as "upper", "lower", "left", "right", "front", "rear", etc., that indicate an azimuth or a positional relationship based on the azimuth or the positional relationship shown in the drawings, it is only for convenience of describing the present invention and simplifying the description, but not for indicating or suggesting that the referred device or element must have a specific azimuth, be constructed and operated in a specific azimuth, so that the terms describing the positional relationship in the drawings are merely for exemplary illustration and should not be construed as limiting the present invention, and that the specific meaning of the above terms may be understood by those of ordinary skill in the art according to the specific circumstances.

Referring to fig. 1, 2, 3 and 4, an aspect of the present invention provides a depth map super-resolution reconstruction method based on color image guidance, including: inputting a low-resolution depth image and a high-resolution color image under the same scene into a trained depth image super-resolution reconstruction model, and reconstructing the high-resolution depth image by using characteristic information of the high-resolution color image as a guide, wherein the depth image super-resolution reconstruction model comprises: a feature extraction network and a feature reconstruction network; the feature extraction network includes: a first convolution layer, 4 feature extraction modules connected in series, and a channel attention ESA module; the feature reconstruction network includes: a first soft pooling layer, a second soft pooling layer, an MLP multi-layer perceptron, a sigmoid activation function, and two serially connected residual blocks (reblocks);

s5: performing feature addition on the third LR depth feature map and the third HR color feature map to obtain a third fusion feature map; inputting the third fusion characteristic diagram into two serially connected residual error fast output fourth fusion characteristic diagrams;

The feature reconstruction network consists of two Soft pooling layers, a multi-layer perceptron and a Sigmoid function, the features of the two modes respectively pass through the Soft pooling layers, the results of the two branches are added and fed into the multi-layer perceptron, shared channel attention weights are generated through the Sigmoid function, the features of the two modes are modulated respectively by the weights, the obtained results are fed into two continuous residual blocks, and a high-resolution depth map is obtained through reconstruction.

In this embodiment, the low resolution image may be obtained from the existing public data set, or may be obtained by shooting with a camera, a video camera, a mobile phone, or other devices, which is not particularly limited in the present invention.

In the embodiment, when training the super-resolution reconstruction model of the depth map, 1449 pairs of images are acquired from the NYU V2 data set, 30 pairs of images are acquired from the Middlebury data set, and 6 pairs of images are acquired from the Lu data set as model sample sets. In an embodiment of the invention, the low resolution depth map is obtained by bicubic downsampling of the high resolution depth image in the sample dataset. Furthermore, 449 of the NYU V2 dataset was also used in this example for verification during training.

Referring to fig. 3, the feature map input to the ith feature extraction module is preferably defined asThe feature extraction process of the feature extraction module is as follows:

wherein Split represents a channel Split operation;

s14: inputting the features X1, X2, X3 and X4 into a feature integration module to perform feature fusion to obtain a feature mapFeature map +.>Input deviceProcessing the glue activation function to obtain a feature map +.>

The whole feature extraction network is constructed in global and local angles, and low-frequency information in the image is fully utilized by long jump connection, so that the extracted feature information of two modes can further help the super-resolution reconstruction network to generate a more accurate and clearer high-resolution depth map.

Referring to fig. 5, preferably, the inputting the features X1, X2, X3, and X4 into the feature integration module for feature fusion includes:

Referring to fig. 6, the characteristic diagram is preferablyThe input local convolution module processes the data including:

In the embodiment of the invention, 1000 high-resolution depth maps with index 0 to 999 in the public data set NYU V2 are randomly cut to 256×256 and subjected to bicubic downsampling operation with a scaling factor s to obtain a low-resolution depth map, and the low-resolution depth map and a high-resolution color image paired with the high-resolution depth map in the data set are applied to a network for training reasoning, wherein the values of s are 4, 8 and 16 in the invention, but the invention is not limited in any way. Image, dataset Middlebury, and dataset simultaneously using the NYU V2 dataset index 1000 to 1448 together 449 pairsAnd the Lu is used as a test set to test the performance of the network, the training is completed through the initial learning rate, the optimizer, the loss function, the iteration number and the like, and the final network reconstruction effect is obtained through test comparison. During training, parameters are updated using Adam optimizer, where β ₁ Is 0.9, beta ₂ An initial learning rate of 1×10 is set to 0.99 ^-4 Every 60 epochs are multiplied by 0.2, a total of 200 epochs are trained, and the batch size is set to 1.

In this embodiment a loss function is constructed by minimizing the image between the predicted result and the real result, which is expressed as:

wherein D is _SR D for reconstructed high resolution depth image _GT For a true high resolution depth image label, N is the set of sampling points and, I ₁ Represents L ₁ Norms.

In summary, in the depth map super-resolution reconstruction model of the present invention, a simple and efficient depth CNN model is provided to solve the problem of super-resolution of the depth map, and the proposed feature extraction module extracts input information from a global angle based on a modulation mechanism of multi-scale feature representation, and simultaneously provides a local convolution block to encode the spatial context of the feature and supplement local information. Different from the prior depth map super-resolution reconstruction algorithm, the method simultaneously extracts global and local context information when the feature extraction is carried out on the two-mode images, and obtains more sufficient feature information with less parameter quantity so as to better carry out feature fusion on a super-resolution reconstruction network and reconstruct a depth map with higher quality.

Through the iterative loop, a trained depth map super-resolution reconstruction model can be obtained, and super-resolution reconstruction can be performed on the low-resolution depth map to be processed, so that a high-quality depth image is obtained.

Those skilled in the art will appreciate that all or part of the processes in the methods of the above embodiments may be implemented by a computer program for instructing relevant hardware, where the program may be stored in a non-volatile computer readable storage medium, and where the program, when executed, may include processes in the embodiments of the methods described above. Any reference to memory, storage, database, or other medium used in the various embodiments provided herein may include non-volatile and/or volatile memory. The nonvolatile memory can include Read Only Memory (ROM), programmable ROM (PROM), electrically Programmable ROM (EPROM), electrically Erasable Programmable ROM (EEPROM), or flash memory. Volatile memory can include Random Access Memory (RAM) or external cache memory. By way of illustration and not limitation, RAM is available in a variety of forms such as Static RAM (SRAM), dynamic RAM (DRAM), synchronous DRAM (SDRAM), double Data Rate SDRAM (DDRSDRAM), enhanced SDRAM (ESDRAM), synchronous Link DRAM (SLDRAM), memory bus direct RAM (RDRAM), direct memory bus dynamic RAM (DRDRAM), and memory bus dynamic RAM (RDRAM), among others.

The foregoing description of the preferred embodiments of the invention is not intended to be limiting, but rather is intended to cover all modifications, equivalents, and alternatives falling within the spirit and principles of the invention.

Finally, it is noted that the above embodiments are only for illustrating the technical solution of the present invention and not for limiting the same, and although the present invention has been described in detail with reference to the preferred embodiments, it should be understood by those skilled in the art that modifications and equivalents may be made thereto without departing from the spirit and scope of the present invention, which is intended to be covered by the claims of the present invention.

Claims

1. A depth map super-resolution reconstruction method based on color image guidance comprises the following steps: inputting a low-resolution depth image and a high-resolution color image under the same scene into a trained depth image super-resolution reconstruction model, and reconstructing the high-resolution depth image by using characteristic information of the high-resolution color image as a guide, wherein the depth image super-resolution reconstruction model comprises: a feature extraction network and a feature reconstruction network; the feature extraction network includes: a first convolution layer, 4 feature extraction modules connected in series, and a channel attention ESA module; the feature reconstruction network includes: a first soft pooling layer, a second soft pooling layer, an MLP multi-layer perceptron, a sigmoid activation function, and two serially connected residual blocks;

2. The color image guided depth map super-resolution reconstruction method according to claim 1, wherein the feature map input to the i-th feature extraction module is defined asThe feature extraction process of the feature extraction module is as follows:

wherein Split represents a channel Split operation;

3. The method for reconstructing a depth map based on color image guidance according to claim 2, wherein the feature fusion of the input features X1, X2, X3 and X4 into the feature integration module comprises:

4. A depth map super-resolution reconstruction method based on color image guidance as defined in claim 3, wherein the first, second, third and fourth convolution layers each adopt a convolution layer of 3*3.

5. The color image guided depth map super-resolution reconstruction method according to claim 2, wherein the feature map is obtained byThe input local convolution module processes the data including:

map the characteristic mapRespectively inputting the fifth convolution layer and the sixth convolution layer with different scales to obtain a characteristic diagramAnd feature map->Feature map +.>And feature map->Performing feature addition to obtain feature map->Feature map +.>Inputting a seventh convolution layer for processing to obtain a characteristic diagram +.>

6. The color image guided depth map super-resolution reconstruction method of claim 5, wherein said sixth and seventh convolution layers are 1*1 convolution layers and said fifth convolution layer is 3*3 convolution layer.

7. The color image guided depth map super-resolution reconstruction method according to claim 2, wherein the depth convolution processing is performed by using a depth convolution layer of 3*3.

8. A depth map super-resolution reconstruction system based on color image guidance, the system being applied to the depth map super-resolution reconstruction method based on color image guidance according to any one of claims 1 to 7, and comprising:

9. A computer readable storage medium storing a program, wherein the program when executed by a processor implements a color image guided depth map super resolution reconstruction method according to any one of claims 1 to 7.