CN114723915A

CN114723915A - Dense point cloud generation method based on multi-view infrared

Info

Publication number: CN114723915A
Application number: CN202210290383.5A
Authority: CN
Inventors: 高大化; 李太行; 朱浩男; 马赛; 李文鑫; 张一诺
Original assignee: Xidian University
Current assignee: Xidian University
Priority date: 2022-03-23
Filing date: 2022-03-23
Publication date: 2022-07-08

Abstract

The invention relates to a multi-view infrared-based dense point cloud generation method, which comprises the following steps: acquiring internal parameters and external parameters of an infrared camera; obtaining a multi-view first low-resolution infrared image and a first high-resolution infrared image by using a multi-view low-resolution infrared camera and a high-resolution infrared camera; constructing a data set; inputting the data set to a dense point cloud generation network to be trained until a loss function is converged to obtain a trained dense point cloud generation network; and inputting the multi-mesh low-resolution infrared image to be generated into the trained dense point cloud generation network to obtain a disparity map and a high-resolution infrared image of the multi-mesh low-resolution infrared image so as to generate dense point cloud. The invention adopts a multi-mesh pure vision scheme, the scheme has stronger anti-interference capability, and can avoid the limitation that the distance measurement schemes such as laser radar mutual interference and active infrared cannot be applied to outdoor and other intense light environments, so that the robustness of the constructed system is improved.

Description

Dense point cloud generation method based on multi-view infrared

Technical Field

The invention belongs to the technical field of infrared image processing bits, and relates to a dense point cloud generation method based on multi-view infrared.

Background

The point cloud data is used to describe the three-dimensional appearance of the scene, and by using the data of all the points in a certain three-dimensional coordinate system, the three-dimensional coordinates X, Y, Z of each point and the information of the intensity are included in the data. The point cloud data can be acquired in various ways by laser radar, RGB-D camera, binocular vision and other methods. The point cloud data contains a large amount of three-dimensional structure information, so that the point cloud data has wide application in the aspects of modeling, surveying and mapping, automatic driving, medical treatment and the like. The point cloud data can be divided into sparse point cloud and dense point cloud according to the density of the points in the point cloud data, and the density of the points in the dense point cloud is high, so that the effect processing of a downstream task is improved. In recent years, artificial intelligence develops fire heat, a deep learning algorithm has a tendency of gradually replacing the traditional algorithm in all aspects, and the deep learning algorithm has excellent performance on the tasks of stereo matching and binocular image super-resolution reconstruction. The multi-view image contains the three-dimensional relation of the scenery, and the sampling density is higher compared with that of a monocular image, so that the multi-view image contains more abundant information, and the effect of generating point cloud can be improved. However, because the input multi-view images have differences of parallax, chromatic aberration, resolution and the like, how to solve the problem of alignment and registration among the multi-view images is a challenge when a deep learning neural network algorithm is used. The registration of the multi-view images is beneficial to fully exploring information in the multi-view images, and the point cloud generation effect is improved.

Chengdu navigation intelligent core science and technology limited company provides an airborne laser radar point cloud generation method in the patent document 'airborne laser radar point cloud generation method and system' (patent application No. 2021113368882, application publication No. CN113933861A) applied by Chengdu navigation intelligent core science and technology limited company. The method comprises the following steps of 1, acquiring GNSS data and IMU data in a current calculation period as first data; 2. acquiring first attitude information of the carrier in the current calculation period according to the first data and the optimized first attitude information in the previous calculation period; 3. acquiring laser radar data in the current scanning period as second data, and matching the second data with the latest local map to obtain second position and attitude information of the carrier in the current scanning period; 4. performing fusion filtering on the latest first pose information and the latest second pose information in the current filtering period to obtain third pose information and pose information errors corresponding to the current filtering period; 5. optimizing the first pose information of the current calculation period according to the latest pose information error to obtain the optimized first pose information in the current calculation period; 6. and performing spatial transformation on the third pose information and the second data to obtain point cloud data in the current filtering period, and generating a local map according to the point cloud data in a plurality of filtering periods.

However, the method has the disadvantages that because the method adopts the laser radar as the acquisition equipment for generating the point cloud, on one hand, the equipment cost is high, the equipment is easily interfered, on the other hand, if other information of the point cloud such as infrared intensity information needs to be obtained, the point cloud can be used after registration, and the upper limit of the resolution is limited by the highest resolution of the equipment.

Disclosure of Invention

In order to solve the problems in the prior art, the invention provides a dense point cloud generating method based on multi-view infrared. The technical problem to be solved by the invention is realized by the following technical scheme:

the embodiment of the invention provides a dense point cloud generation method based on multi-view infrared, which comprises the following steps:

acquiring internal parameters and external parameters of a multi-view low-resolution infrared camera and a high-resolution infrared camera;

obtaining a multi-view first low-resolution infrared image and a first high-resolution infrared image by using a multi-view low-resolution infrared camera and a high-resolution infrared camera;

constructing a data set, wherein the data set comprises internal parameters and external parameters of a multi-view low-resolution infrared camera and a high-resolution infrared camera, a multi-view second low-resolution infrared image and a second high-resolution infrared image, the second low-resolution infrared image is obtained by intercepting the first low-resolution infrared image, and the second high-resolution infrared image is obtained by intercepting the first high-resolution infrared image;

inputting the data set to a dense point cloud generation network to be trained until a loss function is converged to obtain the trained dense point cloud generation network, wherein the dense point cloud generation network comprises a depth estimation sub-network, a multi-mesh infrared image registration module and a multi-mesh super-resolution sub-network;

and inputting the multi-mesh low-resolution infrared image to be generated into the trained dense point cloud generation network to obtain a disparity map and a high-resolution infrared image of the multi-mesh low-resolution infrared image, and generating dense point cloud according to the disparity map and the high-resolution infrared image of the multi-mesh low-resolution infrared image.

In one embodiment of the invention, acquiring internal parameters and external parameters of a multi-view low-resolution infrared camera and a high-resolution infrared camera comprises:

shooting an infrared calibration target by using a multi-view low-resolution infrared camera and a high-resolution infrared camera simultaneously to obtain a plurality of groups of calibration images;

and calibrating by using the calibration image to obtain internal parameters and external parameters of all cameras.

In an embodiment of the present invention, inputting the data set to a dense point cloud generating network to be trained until a loss function converges to obtain a trained dense point cloud generating network, includes:

inputting the data set into the depth estimation sub-network to obtain a disparity map of a multi-view second low-resolution infrared image, and converting the disparity map of the multi-view second low-resolution infrared image into a depth map;

registering the multi-view second low-resolution infrared image by a multi-view infrared image registration module through a projection method based on a projection matrix between the low-resolution infrared camera and the high-resolution infrared camera and the multi-view depth map of the second low-resolution infrared image to obtain a multi-view registered second low-resolution infrared image;

inputting 2-channel data formed by splicing the second low-resolution infrared image after each eye registration and the depth map corresponding to the second low-resolution infrared image into the multi-eye super-resolution sub-network to obtain a high-resolution infrared image;

and optimizing the loss function by a gradient descent method, and iteratively updating the parameters of the dense point cloud generation network until the loss function is converged to obtain the trained dense point cloud generation network.

In one embodiment of the invention, the loss function is:

L＝λ₁L_SR+λ₂L_REP

L_REP＝∑||I_rep-I_lr||₁

wherein L represents a loss function of the dense point cloud generating network, L_SRRepresents super-resolution loss, L_REPThe loss of the re-projection is indicated,

represents mean square error operation, | ·| non-conducting phosphor₁Denotes a norm operation, I_srRepresenting super-resolution infrared images, I_hrReprojected infrared image representing a projection of a high resolution infrared image in a dataset onto a plurality of infrared camera positions, I_repImages representing the reprojection of low-resolution infrared images onto other low-resolution infrared image locations, I_lrRepresenting low resolution infrared images, lambda, in a data set₁And λ₂Weights representing super-resolution loss and re-projection loss.

In one embodiment of the present invention, the training process of the depth estimation sub-network is:

inputting a third low-resolution infrared image and a fourth low-resolution infrared image into the depth estimation sub-network to obtain a disparity map of the third low-resolution infrared image;

obtaining a projection diagram from the fourth low-resolution infrared image to the third low-resolution infrared image based on the fourth low-resolution infrared image, a depth diagram corresponding to the disparity map of the third low-resolution infrared image, and internal and external parameters;

and optimizing the loss function of the depth estimation sub-network by a gradient descent method, and iteratively updating the parameters of the depth estimation sub-network until the loss function of the depth estimation sub-network is converged to obtain a trained depth estimation sub-network.

In one embodiment of the present invention, the structure of the depth estimation sub-network sequentially comprises:

an input layer, a first convolution layer, a first residual layer, a second convolution layer, a second residual layer, a third convolution layer, a third residual layer, a fourth convolution layer, a fourth residual layer, a fifth convolution layer, a fifth residual layer, a sixth convolution layer, a sixth residual layer, a seventh convolution layer, a seventh residual layer, an eighth convolution layer, an eighth residual layer, a ninth convolution layer, a parallax cost aggregation layer, a first 3D convolution layer, a first 3D residual layer, a second 3D convolution layer, a third 3D convolution layer, a second 3D residual layer, a fourth 3D convolution layer, a fifth 3D convolution layer, a third 3D residual layer, a sixth 3D convolution layer, a seventh 3D convolution layer, a fourth 3D residual layer, an eighth 3D convolution layer, a ninth 3D convolution layer, a fifth 3D residual layer, a first 3D inverse convolution layer, a second 3D inverse convolution layer, a third 3D inverse convolution layer, a fourth 3D inverse convolution layer, a third 3D residual layer, a fourth inverse convolution layer, a sixth residual layer, a third 3D residual layer, a fourth residual layer, a sixth residual layer, a third residual layer, a fourth residual layer, a sixth residual layer, a third residual layer, a fourth residual layer, a third residual layer, a fourth residual layer, a third residual layer, a fourth residual layer, a third residual layer, a fourth residual layer, a third residual layer, a fourth residual layer, a third residual layer, a fourth, A fifth 3D deconvolution layer, a parallax regression layer;

the input of the Mth convolution layer is the addition of the Nth convolution layer and the Nth residual error layer respectively, N is one to eight, and M is N plus one; the input of the second 3D convolutional layer, the fourth 3D convolutional layer, the sixth 3D convolutional layer, and the eighth 3D convolutional layer is sequentially the output of the parallax cost aggregation layer, the output of the second 3D convolutional layer, the output of the fourth 3D convolutional layer, and the output of the sixth 3D convolutional layer, respectively; the inputs of the second 3D deconvolution layer, the third 3D deconvolution layer, the fourth 3D deconvolution layer, and the fifth 3D deconvolution layer are sequentially the sum of the output of the first 3D deconvolution layer and the output of the fourth 3D residual layer, the sum of the output of the second 3D deconvolution layer and the output of the third 3D residual layer, the sum of the output of the third 3D deconvolution layer and the output of the second 3D residual layer, and the sum of the output of the fourth 3D deconvolution layer and the output of the first 3D residual layer, respectively.

In one embodiment of the present invention, the convolution kernel size of the input layer is set to 5 × 5, the step size is set to 2, and the output feature map channel size is set to 32;

convolution kernels of the first convolution layer, the second convolution layer, the third convolution layer, the fourth convolution layer, the fifth convolution layer, the sixth convolution layer, the seventh convolution layer, the eighth convolution layer and the ninth convolution layer are all set to be 3 x 3, step lengths are all set to be 1, and channel sizes of output feature maps are all set to be 32;

convolution kernel sizes of the first residual layer, the second residual layer, the third residual layer, the fourth residual layer, the fifth residual layer, the sixth residual layer, the seventh residual layer and the eighth residual layer are all set to be 3 x 3, step sizes are all set to be 1, and sizes of output feature map channels are all set to be 32;

the convolution kernel size of the first 3D convolution layer is set to be 3 multiplied by 3, the step length is set to be 1, and the output characteristic diagram channel size is set to be 32;

the sizes of convolution kernels of the second 3D convolution layer, the fourth 3D convolution layer and the sixth 3D convolution layer are all set to be 3 multiplied by 3, the step length is all set to be 2, and the sizes of output characteristic diagram channels are all set to be 64;

the sizes of convolution kernels of the third 3D convolution layer, the fifth 3D convolution layer and the seventh 3D convolution layer are all set to be 3 multiplied by 3, the step length is all set to be 1, and the sizes of output characteristic diagram channels are all set to be 64;

the sizes of convolution kernels of the eighth 3D convolution layer and the ninth 3D convolution layer are set to be 3 multiplied by 3, the step sizes are set to be 2, and the sizes of output characteristic diagram channels are set to be 128;

the convolution kernel size of the first 3D residual layer is set to be 3 multiplied by 3, the step length is set to be 1, and the size of an output feature map channel is set to be 32;

the sizes of convolution kernels of the second 3D residual layer, the third 3D residual layer and the fourth 3D residual layer are all set to be 3 multiplied by 3, the step length is all set to be 1, and the sizes of output characteristic diagram channels are all set to be 64;

the convolution kernel size of the fifth 3D residual layer is set to be 3 multiplied by 3, the step size is set to be 1, and the size of an output feature map channel is set to be 128;

the convolution kernel size of the first 3D deconvolution layer, the second 3D deconvolution layer and the third 3D deconvolution layer is set to be 3 x 3, the step length is set to be 2, and the output feature map channel size is set to be 64;

the convolution kernel size of the fourth 3D deconvolution layer is set to be 3 multiplied by 3, the step length is set to be 2, and the size of an output feature map channel is set to be 32;

the convolution kernel size of the fifth 3D deconvolution layer is set to 3 x 3, the step size is set to 1, and the output feature map channel size is set to 1.

In one embodiment of the present invention, the structure of the multi-view super-resolution subnetwork sequentially comprises:

a feature extraction layer, a tenth convolution layer, a ninth residual layer, an eleventh convolution layer, a tenth residual layer, a twelfth convolution layer, an eleventh residual layer, a thirteenth convolution layer, a twelfth residual layer, a fourteenth convolution layer, a thirteenth residual layer, a fifteenth convolution layer, a fourteenth residual layer, a sixteenth convolution layer, a fifteenth residual layer, a seventeenth convolution layer, a sixteenth residual layer, an eighteenth convolution layer, a seventeenth residual layer, a nineteenth convolution layer, an eighteenth residual layer, a twentieth convolution layer, a nineteenth residual layer, a twenty-first convolution layer, a twentieth residual layer, a twenty-second convolution layer, a twenty-first residual layer, a twenty-third convolution layer, a twenty-second residual layer, a twenty-fourth convolution layer, a twenty-third residual layer, a twenty-fifth convolution layer, a twenty-fourth residual layer, a pixel recombination layer, and an output layer;

the feature extraction layer shares weight, and 2-channel data formed by splicing the low-resolution infrared image subjected to projection registration and a depth map corresponding to the low-resolution infrared image are input; the input of the tenth convolution layer is a spliced characteristic diagram of a multi-path output characteristic diagram of the multi-view low-resolution image passing through the characteristic extraction layer with shared weight in the channel direction; the input of the other convolution layers is the sum of the output of the last convolution layer and the output of the last residual error layer; the pixel recombination layer input is the sum of the output of the twenty-fourth residual layer and the output of the feature extraction layer.

In an embodiment of the present invention, the size of the convolution kernel of the feature extraction layer is set to 3 × 3, the step size is set to 1, and the size of the output feature map channel is k × 64, where k is the mesh number of the multi-mesh low-resolution infrared camera;

convolution kernel sizes of the tenth, eleventh, twelfth, thirteenth, fourteenth, fifteenth, sixteenth, seventeenth, eighteenth, nineteenth, twentieth, twenty-first, twenty-second, twenty-third, twenty-fourth, twenty-fifth convolution layers are all set to 3 × 3, step sizes are all set to 1, and output feature map channel sizes are all set to 64;

the convolution kernel sizes of the ninth, tenth, eleventh, twelfth, thirteenth, fourteenth, fifteenth, sixteenth, seventeenth, eighteenth, nineteenth, twentieth, twenty-first, twenty-second, and twenty-third residual layers are all set to 3 × 3, the step size is all set to 1, and the output feature map channel size is all set to 64;

the convolution kernel size of the twenty-fourth residual layer is set to be 3 x 3, the step length is set to be 1, and the size of an output feature map channel is set to be k x 64;

the amplification factor of the pixel reconstruction layer is set to be s, the size of an output characteristic graph channel is set to be 64, wherein s is a super-resolution graphMultiple of image reconstruction, s2ⁿN is an integer greater than or equal to 1;

the convolution kernel size of the output layer is set to be 3 x 3, the step size is set to be 1, and the output characteristic diagram channel size is set to be 2.

In one embodiment of the invention, generating a dense point cloud according to a disparity map of a multi-view low-resolution infrared image and a high-resolution infrared image comprises:

correspondingly obtaining a multi-view low-resolution depth map according to the parallax map of the multi-view low-resolution infrared image;

respectively carrying out interpolation up-sampling processing on the multi-view low-resolution depth map to correspondingly obtain a multi-view high-resolution depth map;

calculating the average value of the multi-view high-resolution depth map, and taking the average value as a final high-resolution depth map;

obtaining three-dimensional coordinates of the dense point cloud according to the internal reference inverse matrix of the high-resolution infrared image, the final high-resolution depth map and the pixel coordinates of the high-resolution infrared image;

and giving the dense point cloud to the pixel value of the high-resolution infrared image according to the pixel coordinate correspondence so as to generate the dyed dense point cloud.

Compared with the prior art, the invention has the beneficial effects that:

firstly, the invention adopts a multi-view pure vision scheme which has stronger anti-interference capability and can avoid the limitation that ranging schemes such as laser radar mutual interference and active infrared cannot be applied to outdoor equal-intensity light environments, so that the invention improves the robustness of the constructed system.

Secondly, the invention constructs a dense point cloud generating network, and a multi-mesh super-resolution sub-network of the network contains a super-resolution structure, so that a network model can be effectively supervised by high-resolution input. The invention improves the resolution ratio of the generated point cloud, obtains the performance exceeding the upper limit of the input data equipment, and improves the visual and sensory experience of people and the quality of the input data of subsequent tasks.

Third, the present invention adds reprojection loss to the total loss function, which can supervise depth estimates using multi-view images. Thereby making it possible to use a multi-vision pure visual scheme.

Fourthly, the multi-view pure vision scheme is adopted, and the data acquisition equipment of the scheme is most common and cheap, so that the equipment cost and the data acquisition cost can be effectively reduced, and the equipment requirement of system construction is reduced.

Other aspects and features of the present invention will become apparent from the following detailed description, which proceeds with reference to the accompanying drawings. It is to be understood, however, that the drawings are designed solely for purposes of illustration and not as a definition of the limits of the invention, for which reference should be made to the appended claims. It should be further understood that the drawings are not necessarily drawn to scale and that, unless otherwise indicated, they are merely intended to conceptually illustrate the structures and procedures described herein.

Drawings

Fig. 1 is a schematic flowchart of a method for generating a dense point cloud based on multi-view infrared according to an embodiment of the present invention;

FIG. 2 is a schematic diagram of a constructed dense point cloud generation network according to an embodiment of the present invention;

fig. 3a to fig. 3f are simulation group diagrams of the method for generating dense point cloud based on multi-view infrared according to the embodiment of the present invention.

Detailed Description

The present invention will be described in further detail with reference to specific examples, but the embodiments of the present invention are not limited thereto.

Example one

Referring to fig. 1 and fig. 2, fig. 1 is a schematic flow diagram of a dense point cloud generating method based on multi-view infrared according to an embodiment of the present invention, and fig. 2 is a schematic diagram of a dense point cloud generating network constructed according to an embodiment of the present invention, where the present invention provides a dense point cloud generating method based on multi-view infrared, and the dense point cloud generating method includes:

step 1, obtaining internal parameters and external parameters of a multi-view low-resolution infrared camera and a high-resolution infrared camera.

Specifically, the internal reference and the external reference of the multi-view low-resolution infrared camera and the internal reference and the external reference of a high-resolution infrared camera need to be acquired, wherein the internal reference is the attribute of the camera, and includes an internal reference matrix K, a rotation matrix R and a translation matrix T, and the external reference is the pose relationship between one infrared camera and another infrared camera, that is, a projection matrix.

In a specific embodiment, step 1 may specifically include:

step 1.1, shooting the infrared calibration target by using a multi-eye low-resolution infrared camera and a high-resolution infrared camera simultaneously to obtain a plurality of groups of calibration images.

Specifically, the multi-view low-resolution infrared camera and the high-resolution infrared camera are fixed, meanwhile, the view field of the high-resolution infrared camera is guaranteed to be within the view field of the low-resolution infrared camera, all the infrared cameras are used for shooting the infrared calibration targets at the same time to obtain a set of calibration images, then, the relative position between the multi-view low-resolution infrared camera and the high-resolution infrared camera can be changed, meanwhile, the position of the infrared calibration targets needs to be changed, and therefore shooting can be repeated, and multiple sets of calibration images are obtained.

And step 1.2, calibrating the camera by using the calibration image to obtain the internal reference and the external reference of all the cameras.

Specifically, a Zhangyingyou calibration method is used for calibrating the calibration image so as to obtain internal parameters and external parameters of all the low-resolution infrared cameras and the high-resolution infrared cameras.

And 2, obtaining a multi-view first infrared image and a first high-resolution infrared image by using a multi-view low-resolution infrared camera and a high-resolution infrared camera.

Specifically, all infrared cameras are used for shooting the same scene at the same time, for example, the scene is a natural scene, so that a multi-view low-resolution infrared camera shoots a multi-view original first low-resolution infrared image, and a high-resolution infrared camera shoots a first high-resolution infrared image.

And 3, constructing a data set, wherein the data set comprises internal parameters and external parameters of the multi-view low-resolution infrared camera and the high-resolution infrared camera, a multi-view second low-resolution infrared image and a second high-resolution infrared image, the second low-resolution infrared image is obtained by intercepting the first low-resolution infrared image, and the second high-resolution infrared image is obtained by intercepting the first high-resolution infrared image.

Preferably, a common-view region of all the first low-resolution infrared images and the first high-resolution infrared images is intercepted to obtain multi-view second low-resolution infrared images and second high-resolution infrared images with the same size.

In the present embodiment, the multi-view first low-resolution infrared image I_lri(I1, 2.) with the first high-resolution infrared image I_hrIs fixed in relation to the known first low-resolution infrared image I_lrIs given by the internal reference matrix K_lrA first high-resolution infrared image I_hrIs given by the internal reference matrix K_hrFirst low resolution infrared image I_lrTo the first high resolution infrared image I_hrProjection matrix T between_lhIn the case of (1), if it is known that a point in space is in the first high-resolution infrared image I_lrImage coordinates of (u) of_l,v_l) And depth d_lThen a first high resolution infrared image I can be calculated by projection_hrImage coordinates of (u) of_h,v_h) And depth d_hNamely:

wherein the internal reference matrix

f_x、f_yDenotes the focal length, u₀、v₀Represents the offset of the center point; projection matrix

T_lhRepresenting a projection matrix, T, between a low-resolution infrared camera and a high-resolution infrared camera_llA projection matrix from the low-resolution infrared camera to the low-resolution infrared camera, R and T representing the low-resolution infrared camera to the high-resolution infrared camera, respectivelyA rotation matrix and a translation matrix (translation matrix is also called translation vector) between the resolution infrared cameras or the low resolution infrared cameras.

The constructed data set can be proportionally divided into a training sample set and a testing sample set.

And 4, inputting the data set into a dense point cloud generation network to be trained until a loss function is converged to obtain the trained dense point cloud generation network, wherein the dense point cloud generation network comprises a depth estimation sub-network, a multi-mesh infrared image registration module and a multi-mesh super-resolution sub-network.

In this embodiment, the structure of the depth estimation sub-network sequentially includes:

an input layer, a first convolution layer, a first residual layer, a second convolution layer, a second residual layer, a third convolution layer, a third residual layer, a fourth convolution layer, a fourth residual layer, a fifth convolution layer, a fifth residual layer, a sixth convolution layer, a sixth residual layer, a seventh convolution layer, a seventh residual layer, an eighth convolution layer, an eighth residual layer, a ninth convolution layer, a parallax cost aggregation layer, a first 3D convolution layer, a first 3D residual layer, a second 3D convolution layer, a third 3D convolution layer, a second 3D residual layer, a fourth 3D convolution layer, a fifth 3D convolution layer, a third 3D residual layer, a sixth 3D convolution layer, a seventh 3D convolution layer, a fourth 3D residual layer, an eighth 3D convolution layer, a ninth 3D convolution layer, a fifth 3D residual layer, a first 3D deconvolution layer, a second 3D deconvolution layer, a third 3D deconvolution layer, a fourth 3D deconvolution layer, a sixth 3D deconvolution layer, a ninth 3D deconvolution layer, a seventh 3D residual layer, a fifth 3D residual layer, a fourth 3D deconvolution layer, a third 3D deconvolution layer, a fourth deconvolution layer, a third 3D deconvolution layer, a fourth convolution layer, a third residual layer, a fourth convolution layer, a residual layer, a third residual layer, a fourth residual layer, a third residual layer, a fourth residual layer, a third residual layer, a fourth residual layer, a third residual layer, a second residual layer, a fourth residual layer, a third residual layer, a fourth residual layer, a second residual layer, a third residual layer, a fourth residual layer, a third residual layer, a second residual layer, a third residual layer, a fourth residual layer, a second residual layer, a fourth residual layer, a third, A fifth 3D deconvolution layer, a parallax regression layer;

the input of the Mth convolution layer is the addition of the Nth convolution layer and the Nth residual error layer respectively, N is one to eight, and M is N plus one; the input of the second 3D convolutional layer, the fourth 3D convolutional layer, the sixth 3D convolutional layer and the eighth 3D convolutional layer is the output of the parallax-disparity cost aggregation layer, the output of the second 3D convolutional layer, the output of the fourth 3D convolutional layer and the output of the sixth 3D convolutional layer respectively in sequence; the inputs of the second 3D deconvolution layer, the third 3D deconvolution layer, the fourth 3D deconvolution layer, and the fifth 3D deconvolution layer are sequentially the sum of the output of the first 3D deconvolution layer and the output of the fourth 3D residual layer, the sum of the output of the second 3D deconvolution layer and the output of the third 3D residual layer, the sum of the output of the third 3D deconvolution layer and the output of the second 3D residual layer, and the sum of the output of the fourth 3D deconvolution layer and the output of the first 3D residual layer, respectively.

Further, the parameters of the layers of the depth estimation sub-network are set as follows:

the convolution kernel size of the input layer is set to be 5 multiplied by 5, the step length is set to be 2, and the channel size of the output characteristic diagram is set to be 32;

the sizes of convolution kernels of the first convolution layer, the second convolution layer, the third convolution layer, the fourth convolution layer, the fifth convolution layer, the sixth convolution layer, the seventh convolution layer, the eighth convolution layer and the ninth convolution layer are all set to be 3 multiplied by 3, the step length is all set to be 1, and the size of an output characteristic diagram channel is all set to be 32;

the sizes of convolution kernels of the first residual error layer, the second residual error layer, the third residual error layer, the fourth residual error layer, the fifth residual error layer, the sixth residual error layer, the seventh residual error layer and the eighth residual error layer are all set to be 3 multiplied by 3, the step length is all set to be 1, and the size of an output characteristic diagram channel is all set to be 32;

the convolution kernel size of the first 3D convolution layer is set to be 3 multiplied by 3, the step size is set to be 1, and the output feature map channel size is set to be 32;

the sizes of convolution kernels of the eighth 3D convolution layer and the ninth 3D convolution layer are set to be 3 multiplied by 3, the step sizes are set to be 2, and the sizes of output characteristic graph channels are set to be 128;

the convolution kernel size of the first 3D residual layer is set to be 3 multiplied by 3, the step size is set to be 1, and the size of an output feature map channel is set to be 32;

convolution kernels of the second 3D residual layer, the third 3D residual layer and the fourth 3D residual layer are all set to be 3 multiplied by 3, step length is all set to be 1, and the size of an output characteristic image channel is all set to be 64;

the sizes of convolution kernels of the first 3D deconvolution layer, the second 3D deconvolution layer and the third 3D deconvolution layer are set to be 3 multiplied by 3, the step length is set to be 2, and the size of an output characteristic diagram channel is set to be 64;

the convolution kernel size of the fourth 3D deconvolution layer is set to be 3 multiplied by 3, the step size is set to be 2, and the output feature map channel size is set to be 32;

the convolution kernel size of the fifth 3D deconvolution layer is set to be 3 multiplied by 3, the step length is set to be 1, and the output characteristic diagram channel size is set to be 1;

the parallax cost aggregation layer comprises the steps of calculating the projection parallax slope direction and splicing parallax characteristic graphs; calculating coordinates (u) in the A-mesh image according to internal reference and external reference of the A-mesh camera and the B-mesh camera (both the A-mesh camera and the B-mesh camera are low-resolution infrared cameras)_A,v_A) The coordinate linear equation projected to the B-mesh image at different depths is expressed as follows according to the form of a parameter equation:

wherein the variable d^-1Represents the corresponding coordinates (u) of the A-mesh image_A,v_A) Theta and delta denote the angle and offset of the projection corresponding to the B-eye image at different depths, delta u denotes the offset of the projection corresponding to the B-eye image in the u direction, and delta v denotes the offset of the projection corresponding to the B-eye image in the v direction.

The input of the parallax cost aggregation layer is an A-mesh feature map F with the size of C multiplied by H multiplied by W_AAnd B mesh feature map F with size of C × H × W_BThe output is a splicing characteristic diagram with the spliced size of 2C multiplied by M multiplied by H multiplied by W; wherein M represents the maximum parallax preset in advance, and the splicing characteristic graph is spliced along a parallax channel according to the number of parallax pixels; the splicing characteristic diagram F correspondingly comprises the following steps:

F[1:C,disp,:,disp:W]＝F_A[:,:,:,disp:W]

F[C+1:2C,disp,:,disp:W]＝F_B[C+1:2C,:,:,1:W-disp]

disp represents parallax, and the remaining part value of the mosaic feature map is 0.

A parallax regression layer, which estimates the parallax value by adopting soft argmin operation on the parallax dimension, wherein the calculation formula of the parallax regression is as follows:

where D represents the disparity, ranging from 0 to the maximum disparity D_maxσ denotes a sigmod function, c_dChannel dimensions of a feature map representing an input of a disparity regression layer, wherein:

where t represents an input element.

In addition, the present embodiment further provides a training process for a deep estimation subnetwork, where the training process includes:

s1, inputting the third low-resolution infrared image and the fourth low-resolution infrared image into a depth estimation sub-network to obtain a disparity map of the third low-resolution infrared image;

s2, obtaining a projection diagram from the fourth low-resolution infrared image to the third low-resolution infrared image based on the fourth low-resolution infrared image, the depth diagram corresponding to the disparity map of the third low-resolution infrared image, and the internal reference and the external reference;

s3, optimizing the loss function of the depth estimation sub-network through a gradient descent method, and iteratively updating the parameters of the depth estimation sub-network until the loss function of the depth estimation sub-network converges to obtain the trained depth estimation sub-network.

Specifically, the third low-resolution infrared image and the fourth low-resolution infrared image are images obtained by shooting with a low-resolution infrared camera arranged oppositely, and if the third low-resolution infrared image is a left eye image, for example, the fourth low-resolution infrared image is a right eye image. Therefore, the third low-resolution infrared image and the fourth low-resolution infrared image are input into the depth estimation sub-network, the output of the depth estimation sub-network is the disparity map of the third low-resolution infrared image, the depth map of the third low-resolution infrared image can be obtained according to the disparity map of the third low-resolution infrared image, and according to the binocular stereo matching principle: as will be understood by those skilled in the art, when the conditions of the fourth low-resolution infrared image, the depth map of the third low-resolution infrared image, and the internal reference and the external reference of the infrared camera are known, a projection map of the fourth low-resolution infrared image to the third low-resolution infrared image can be obtained. Therefore, the loss function of the depth estimation sub-network is optimized through a gradient descent method, the parameters of the depth estimation sub-network are updated iteratively until the loss function is converged, the trained depth estimation sub-network can be obtained, and at the moment, the projection image from the fourth low-resolution infrared image to the third low-resolution infrared image is closest to the third low-resolution infrared image.

In this embodiment, the multi-view infrared image registration module is configured to register the low-resolution infrared image by a projection method according to a projection matrix between the low-resolution infrared camera and the high-resolution infrared camera and a depth of the low-resolution infrared image, so as to obtain a projected and registered low-resolution infrared image.

In this embodiment, the structure of the multi-view super-resolution subnetwork sequentially comprises:

the feature extraction layer shares weight, and 2-channel data formed by splicing the low-resolution infrared image subjected to projection registration and a depth map corresponding to the low-resolution infrared image are input; the tenth convolution layer input is a spliced characteristic diagram of a multi-path output characteristic diagram of the multi-view low-resolution image passing through the characteristic extraction layer with shared weight in the channel direction; the input of the other convolution layers is the sum of the output of the last convolution layer and the output of the last residual error layer; the input of the pixel recombination layer is the sum of the output of the twenty-fourth residual layer and the output of the feature extraction layer.

Further, setting parameters of each layer of the multi-view super-resolution sub-network:

the convolution kernel size of the feature extraction layer is set to be 3 × 3, the step size is set to be 1, the size of the output feature map channel is set to be 64, and as the feature extraction layer shares the weight, the actual output channel number is k × 64, where k is the mesh number of the multi-mesh low-resolution infrared camera, and in this embodiment, k is four, so the actual output channel number is 256;

the convolution kernels of the tenth, eleventh, twelfth, thirteenth, fourteenth, fifteenth, sixteenth, seventeenth, eighteenth, nineteenth, twentieth, twenty-first, twenty-second, twenty-third, twenty-fourth, twenty-fifth convolution layers are all set to 3 x 3, the step sizes are all set to 1, and the output characteristic diagram channel sizes are all set to 64;

convolution kernels of a ninth residual layer, a tenth residual layer, an eleventh residual layer, a twelfth residual layer, a thirteenth residual layer, a fourteenth residual layer, a fifteenth residual layer, a sixteenth residual layer, a seventeenth residual layer, an eighteenth residual layer, a nineteenth residual layer, a twentieth residual layer, a twenty-first residual layer, a twenty-second residual layer and a twenty-third residual layer are all set to be 3 multiplied by 3, step length is set to be 1, and the size of an output feature map channel is set to be 64;

the convolution kernel size of the twenty-fourth residual layer is set to be 3 × 3, the step size is set to be 1, and the output feature map channel size is set to be k × 64, namely 256;

the magnification factor of the pixel reconstruction layer is set to be s, and the size of the output characteristic image channel is set to be 64, wherein s is a multiple of super-resolution image reconstruction, and s is 2ⁿN is an integer greater than or equal to 1, and n is 4 in the embodiment as an example;

the convolution kernel size of the output layer is set to 3 × 3, the step size is set to 1, and the output signature channel size is set to 2.

Inputting a multi-view super-resolution sub-network into a multi-view low-resolution projection image; and outputting the high-resolution image through a multi-view super-resolution sub-network.

In a specific embodiment, step 4 may specifically include:

and 4.1, inputting the multi-view second low-resolution infrared image in the data set into a depth estimation sub-network to obtain a disparity map of the multi-view second low-resolution infrared image, and converting the disparity map of the multi-view second low-resolution infrared image into a depth map.

And 4.2, registering the multi-view second low-resolution infrared image through a projection method in a multi-view infrared image registration module based on a projection matrix between the low-resolution infrared camera and the high-resolution infrared camera and a depth map of the multi-view second low-resolution infrared image to obtain a multi-view registered second low-resolution infrared image.

And 4.3, inputting 2-channel data formed by splicing the second low-resolution infrared image after each eye registration and the depth map corresponding to the second low-resolution infrared image into a multi-eye super-resolution sub-network to obtain a high-resolution infrared image.

And 4.4, optimizing the loss function through a gradient descent method, and iteratively updating the parameters of the dense point cloud generation network until the loss function is converged to obtain the trained dense point cloud generation network.

In the present embodiment, the loss function of the dense point cloud generation network includes a super-resolution loss and a re-projection loss.

Specifically, the super-resolution loss is calculated as: and inputting the multi-view second low-resolution infrared image in the data set into a dense point cloud generation network to obtain a high-resolution infrared image, and calculating the loss between the generated high-resolution infrared image and the high-resolution infrared image in the data set, wherein the loss is the super-resolution loss.

And (3) re-projecting the high-resolution infrared image to the positions of other cut low-resolution infrared images according to the pose relation between the multi-view low-resolution infrared camera and the high-resolution infrared camera, and calculating the loss between the projected high-resolution infrared image and the cut low-resolution infrared image.

Specifically, the reprojection loss is calculated as: and inputting the multi-view low-resolution infrared image in the data set into a dense point cloud generation network to obtain a high-resolution infrared image, re-projecting the high-resolution infrared image to other cut low-resolution infrared image positions according to the pose relationship between the multi-view infrared cameras, and calculating the loss between the projected high-resolution infrared image and the low-resolution infrared image in the data set.

In this embodiment, a multi-view low-resolution infrared image is input to a dense point cloud generation network, the super-resolution loss and the reprojection loss are added to obtain a loss function of the dense point cloud generation network, the loss function is optimized by a gradient descent method, and network parameters of the dense point cloud generation network are iteratively updated until the loss function converges, so that a trained dense point cloud generation network model is obtained.

The loss function of the dense point cloud generation network is as follows:

L＝λ₁L_SR+λ₂L_REP

L_REP＝∑||I_rep-I_lr||₁

represents mean square error operation, | ·| non-conducting phosphor₁Denotes a norm operation, I_srRepresenting the first channel in the trained dense point cloud generating network output to generate a super-resolution infrared image, I_repImages representing the reprojection of low-resolution infrared images in a dataset onto other low-resolution infrared image locations, I_hrReprojected infrared image representing a projection of a high resolution infrared image in a dataset onto a multi-view low resolution infrared camera position, I_lrRepresenting cropped low resolution infrared images in a data set, lambda₁And λ₂Weights representing super-resolution loss and re-projection loss.

And 5, inputting the multi-mesh low-resolution infrared image to be generated into the trained dense point cloud generation network to obtain a disparity map and a high-resolution infrared image of the multi-mesh low-resolution infrared image, and generating the dense point cloud according to the disparity map and the high-resolution infrared image of the multi-mesh low-resolution infrared image.

In a specific embodiment, step 5 may specifically include:

and 5.1, correspondingly obtaining a multi-view low-resolution depth map according to the parallax map of the multi-view low-resolution infrared image.

Specifically, the multi-view low-resolution infrared image to be generated is input to the trained dense point cloud generating network, so that the disparity map of the multi-view low-resolution infrared image and the high-resolution infrared image can be correspondingly output, and therefore the disparity map of the low-resolution infrared image can be converted into a depth map, namely, the depth map is converted into a low-resolution depth map.

And 5.2, performing interpolation up-sampling processing on the multi-view low-resolution depth map respectively to correspondingly obtain the multi-view high-resolution depth map.

Step 5.3, calculating the average value of the multi-view high-resolution depth map, and taking the average value as the final high-resolution depth map;

and 5.4, obtaining three-dimensional coordinates of the dense point cloud according to the internal reference inverse matrix of the high-resolution infrared image, the final high-resolution depth map and the pixel coordinates of the high-resolution infrared image, wherein a calculation model is as follows:

wherein, the internal parameter of the high resolution infrared image (super resolution infrared image) is equal to the internal parameter of the original low resolution infrared image multiplied by the high resolution multiple, (x)_w、y_w、z_w) Three-dimensional coordinates representing a dense point cloud,

internal reference inverse matrix, d, representing a high-resolution infrared image_srMean values representing a multi-view high resolution depth map, i.e. the final high resolution depth map, (u)_sr、v_sr1) and v_srRepresenting the pixel coordinates of the high resolution infrared image.

And 5.5, correspondingly giving the dense point cloud to the pixel value of the high-resolution infrared image according to the pixel coordinate to generate the dyed dense point cloud.

Specifically, the high-resolution dense point cloud can be generated by giving the point cloud staining to the pixel value of the original pixel of the high-resolution infrared image.

The effects of the present invention can be further described by the following simulation experiments.

1. The experimental conditions are as follows:

the hardware test platform of the simulation experiment of the invention is as follows: a CPU: i7-9700K 3.60GHz, 32G memory, GPU: TITAN Xp;

the software simulation platform of the invention is as follows: windows 1064-bit operating system, Pycharm development platform;

the software simulation language and deep learning framework used by the invention is as follows: python, Pytorch.

The data used in the simulation experiment of the invention is a multi-view infrared image data set which is made by self, and the data set is made by arranging and fixedly shooting a three-eye low-resolution infrared camera and a one-eye high-resolution infrared camera according to a straight line shape. Each set of training data of the data set comprises 3 low-resolution infrared images with 288 × 384 pixels and 1 high-resolution infrared image with 1024 × 1281 pixels, and the image format is png.

2. Analysis of experimental content and results

The experiment of the invention adopts the dense point cloud generation method provided by the invention to process the input trinocular low-resolution infrared image and generate dense point cloud.

The effect of the present invention will be further described with reference to the simulation diagrams of fig. 3a to 3 f.

Fig. 3a, 3c, and 3e are three sets of input three-mesh low-resolution infrared images, respectively.

Fig. 3b, 3d, and 3f are different perspective images of the dense point cloud generated corresponding to the three sets of inputs, respectively.

As can be seen from fig. 3a to 3f, the dense point cloud generating method provided by the present invention can generate a high-resolution dense point cloud through a multi-mesh low-resolution infrared image, and fills the blank of the method in infrared point cloud generation.

Firstly, the invention adopts a multi-view pure visual scheme, and the data acquisition equipment of the scheme is most common and cheap, so that the equipment cost and the data acquisition cost can be effectively reduced, and the equipment requirement of system construction is reduced.

Secondly, the multi-view pure vision scheme is adopted, the anti-interference capability of the scheme is stronger, the limitation that the distance measurement schemes such as laser radar mutual interference and active infrared cannot be applied to outdoor and other intense light environments can be avoided, and the robustness of the constructed system is improved.

Thirdly, the invention constructs a dense point cloud generating network, and a multi-mesh super-resolution sub-network of the network contains a super-resolution structure, so that a network model can be effectively supervised by high-resolution input. The invention improves the resolution ratio of the generated point cloud, obtains the performance exceeding the upper limit of the input data equipment, and improves the visual and sensory experience of people and the quality of the input data of subsequent tasks.

Fourth, the present invention adds reprojection loss to the total loss function, which can supervise depth estimates using multi-view images. Thereby making it possible to use a multi-vision pure visual scheme.

In the description of the present invention, the terms "first" and "second" are used for descriptive purposes only and are not to be construed as indicating or implying relative importance or implying any number of technical features indicated. Thus, a feature defined as "first" or "second" may explicitly or implicitly include one or more of that feature. In the description of the present invention, "a plurality" means two or more unless specifically defined otherwise.

In the description herein, references to the description of the term "one embodiment," "some embodiments," "an example," "a specific example," or "some examples," etc., mean that a particular feature, structure, material, or characteristic data point described in connection with the embodiment or example is included in at least one embodiment or example of the present invention. In this specification, the schematic representations of the terms used above are not necessarily intended to refer to the same embodiment or example. Furthermore, the particular features, structures, materials, or characteristic data points described may be combined in any suitable manner in any one or more embodiments or examples. Furthermore, various embodiments or examples described in this specification can be combined and combined by those skilled in the art.

The foregoing is a more detailed description of the invention in connection with specific preferred embodiments and it is not intended that the invention be limited to these specific details. For those skilled in the art to which the invention pertains, several simple deductions or substitutions can be made without departing from the spirit of the invention, and all shall be considered as belonging to the protection scope of the invention.

Claims

1. A dense point cloud generation method based on multi-view infrared is characterized by comprising the following steps:

2. The multi-view infrared-based dense point cloud generation method of claim 1, wherein obtaining internal and external parameters of a multi-view low-resolution infrared camera and a high-resolution infrared camera comprises:

3. The multi-view infrared-based dense point cloud generation method of claim 1, wherein inputting the dataset to a dense point cloud generation network to be trained until a loss function converges to obtain a trained dense point cloud generation network comprises:

4. The method of generating a dense point cloud based on multi-view infrared according to claim 3, wherein the loss function is:

L＝λ₁L_SR+λ₂L_REP

L_REP＝∑||I_rep-I_lr||₁

5. The method of claim 1, wherein the training process of the depth estimation sub-network is:

6. The method of generating a dense point cloud based on multi-view infrared according to claim 1, wherein the structure of the depth estimation sub-network comprises in order:

7. The method of claim 6, wherein the input layer convolution kernel size is set to 5 x 5, the step size is set to 2, the output signature channel size is set to 32;

8. The method of generating a dense point cloud based on multi-view infrared according to claim 1, wherein the structure of the multi-view super resolution sub-network comprises in order:

9. The multi-view infrared-based dense point cloud generation method of claim 1, wherein the convolution kernel size of the feature extraction layer is set to 3 x 3, the step size is set to 1, and the output feature map channel size is kx 64, where k is the mesh number of the multi-mesh low-resolution infrared camera;

convolution kernels of the ninth residual layer, the tenth residual layer, the eleventh residual layer, the twelfth residual layer, the thirteenth residual layer, the fourteenth residual layer, the fifteenth residual layer, the sixteenth residual layer, the seventeenth residual layer, the eighteenth residual layer, the nineteenth residual layer, the twentieth residual layer, the twenty first residual layer, the twenty second residual layer and the twenty third residual layer are all set to be 3 × 3, step sizes are all set to be 1, and output feature map channel sizes are all set to be 64;

the magnification factor of the pixel reconstruction layer is set to be s, the size of the output characteristic image channel is set to be 64, wherein the s is a multiple of super-resolution image reconstruction, and the s is 2ⁿN is an integer greater than or equal to 1;

10. The method of claim 1, wherein generating the dense point cloud from the disparity map of the multi-view low-resolution infrared image and the high-resolution infrared image comprises: