CN111583345A

CN111583345A - Method, device and equipment for acquiring camera parameters and storage medium

Info

Publication number: CN111583345A
Application number: CN202010387692.5A
Authority: CN
Inventors: 王欣; 贾锋
Original assignee: Jilin University
Current assignee: Jilin University
Priority date: 2020-05-09
Filing date: 2020-05-09
Publication date: 2020-08-25
Anticipated expiration: 2040-05-09
Also published as: CN111583345B

Abstract

The application discloses a method, a device, equipment and a storage medium for acquiring camera parameters, which comprise the following steps: collecting original continuous frame images shot by a monocular camera; constructing a DepthNet model and a MotionNet model; the DepthNet model comprises a network for outputting a single-channel depth map; the MotionNet model comprises a main network used for camera motion prediction and giving a pixel confidence mask and a branch network used for camera internal parameter prediction; preprocessing an original continuous frame image, inputting the preprocessed original continuous frame image into a constructed model, performing unsupervised training on the model through a joint loss function, and performing super-parameter tuning; and processing the image to be detected through the trained model, and outputting a depth map of each frame of image, camera motion, camera internal parameters and a pixel confidence mask containing scene motion information. Therefore, the camera is not required to be calibrated, and the camera internal parameters, the camera motion and the depth map of each frame can be obtained by directly inputting the video.

Description

Method, device and equipment for acquiring camera parameters and storage medium

Technical Field

The invention relates to the field of computer vision and photogrammetry, in particular to a method, a device, equipment and a storage medium for acquiring camera parameters.

Background

As one of the main tools of computer vision, cameras and various algorithms surrounding the cameras have an important position. Photogrammetry among others mainly studies the imaging principle of cameras and focuses on how to obtain real world information from pictures taken by cameras. For various applications of computer vision and photogrammetry, such as industrial control, automatic driving, robot navigation way finding and other scenes, camera internal parameters, camera motion and depth of field all have important values, and a large number of calculation processes related to photogrammetry and camera imaging properties need to take the three information as input.

The camera's internal reference contains information such as the focal length of the camera, the camera's self-motion is also called ego-motion, and contains the position transformation information of the camera itself, and the depth of field expresses the distance between each point in the camera's field of view and the optical center of the camera, and is usually represented by a depth map. The process of acquiring the internal and external parameters of the Camera is generally called Camera calibration (Camera calibration), and the process of acquiring ego-motion is called Visual Odometer (VO).

Existing methods based on non-depth learning typically acquire separate techniques for camera parameters, ego-motion, and depth of field information. The method for acquiring the internal reference needs to use a camera to take several (usually about 20) calibration board images with different angles, and when the camera needs to be adjusted frequently, the calibration has to be performed frequently, and for application scenes where the camera device is not accessible, the calibration method is not available. The methods for acquiring ego-motion and depth information have similar drawbacks: the normal operation of these methods requires some assumptions (i.e., scene static assumption, scene consistency assumption, and lambertian assumption), and any condition that violates these assumptions will affect the normal operation of the corresponding methods. The technology based on the deep learning can get rid of the dependence on the preposition assumption to different degrees, and can synchronously acquire ego-motion and depth information, so that the use convenience is improved. However, the internal reference of the camera still needs to be input, so that the inconvenience caused by the camera calibration method cannot be completely eliminated.

Therefore, how to solve the problem of limitations that the existing solution needs to perform camera calibration and needs a large amount of supervised learning data is a technical problem to be solved urgently by those skilled in the art.

Disclosure of Invention

In view of this, the present invention provides a method, an apparatus, a device and a storage medium for acquiring camera parameters, which can output a depth map of each frame, a motion of a camera during shooting and an internal reference of the camera by using consecutive frames shot by a monocular camera as input without calibrating the camera.

The specific scheme is as follows:

a camera parameter obtaining method comprises the following steps:

collecting original continuous frame images shot by a monocular camera;

constructing a DepthNet model and a MotionNet model; the DepthNet model comprises a network for outputting a single-channel depth map; the MotionNet model comprises a main network used for camera motion prediction and giving a pixel confidence mask and a branch network used for camera internal parameter prediction;

respectively inputting the original continuous frame images into the constructed DepthNet model and the MotionNet model after preprocessing, and performing unsupervised training and super-parameter tuning on the DepthNet model and the MotionNet model through a joint loss function;

and processing the image to be detected through the trained DepthNet model and the MotionNet model, and outputting a depth map of each frame of the image to be detected, the motion of the camera, the internal reference of the camera and a pixel confidence mask containing scene motion information.

Preferably, in the above method for acquiring camera parameters provided in the embodiment of the present invention, the DepthNet model is composed of a first encoder and a first decoder;

preprocessing the original continuous frame images and inputting the preprocessed original continuous frame images into the DepthNet model for training, wherein the method specifically comprises the following steps:

acquiring a preprocessed three-channel image through the first encoder, and successively encoding the three-channel image into features of multiple granularities;

decoding using the first decoder in conjunction with features of different granularity;

and outputting a single-channel depth map with the same size as the input three-channel image through the first decoder.

Preferably, in the method for acquiring camera parameters provided in the embodiment of the present invention, the three-channel image is successively encoded into features of multiple granularities by the first encoder, and the decoding is performed by using the first decoder in combination with the features of different granularities, which specifically includes:

in the first encoder, a 2D convolution with convolution kernel size of 7 multiplied by 7 is performed, and after batch standardization and a linear rectification unit, a first-stage feature code is formed;

connecting a maximum pooling layer and two first residual modules to form a second-level feature code;

alternately connecting a second residual error module and the first residual error module to form a third-level feature code, a fourth-level feature code and a fifth-level feature code respectively;

inputting the first level feature encoding, the second level feature encoding, the third level feature encoding, the fourth level feature encoding, and the fifth level feature encoding to the first decoder;

in the first decoder, 2D transposition convolution and 2D convolution are alternately used, five levels of feature codes are combined step by step, and softplus activating functions are adopted for output at an output layer.

Preferably, in the above method for acquiring camera parameters provided in the embodiment of the present invention, the backbone network is composed of a second encoder and a second decoder;

inputting the original continuous frame images into the MotionNet model for training after preprocessing, and specifically comprising the following steps:

acquiring two adjacent preprocessed frame images through the second encoder;

in the second encoder, 7 cascaded 3 × 3 2D convolutional layers are used, one 1 × 1 convolutional layer is connected to the bottleneck portion, the number of output channels is compressed to six, the first three channels output the translation of the camera, and the last three channels output the rotation of the camera;

in the second decoder, two parallel convolution paths are adopted and short-cut connection is used, the convolution output and the output of bilinear interpolation are combined to form the output of a Refine module, a pixel-level confidence mask is output and used for determining whether each pixel participates in calculation when a joint loss function is calculated, and meanwhile, a penalty function is added to the pixel-level confidence mask to prevent training degradation;

and outputting the internal reference matrix of the camera through the branch network connected to the lowest encoder of the backbone network.

Preferably, in the method for acquiring camera parameters provided in the embodiment of the present invention, outputting the internal reference matrix of the camera specifically includes:

in the branch network, multiplying the network predicted value by the width and height of the image to obtain the actual focal length;

adding 0.5 to the network predicted value, and multiplying by the width and height of the image to obtain the pixel coordinate of the principal point;

and (3) the focal length is diagonal to form a diagonal matrix of 2 multiplied by 2, column vectors formed by connecting principal point coordinates are connected, and row vectors are added to form a 3 multiplied by 3 internal reference matrix.

Preferably, in the method for acquiring camera parameters provided in the embodiment of the present invention, the preprocessing the original continuous frame images includes:

adjusting the resolution of the original continuous frame images, and arranging and splicing the original continuous frame images to be spliced into a plurality of triple frame images;

when each triple frame image is input into the DepthNet model, outputting a depth map of each frame image;

and when each triple frame image is input into the MotionNet model, outputting four times of camera motion between every two adjacent frame images, and internal reference and pixel confidence mask of the camera.

Preferably, in the method for acquiring camera parameters provided in the embodiment of the present invention, the joint loss function is calculated by using the following formula:

wherein L is_totalFor said joint loss function, L_RIs a reprojection error function, a is a weight of the reprojection error function,

for depth smoothing loss, b is the weight of the depth smoothing loss, Λ is the regularization penalty function of the pixel confidence mask, and c is the weight of the penalty function.

The embodiment of the present invention further provides a device for acquiring camera parameters, including:

the image collection module is used for collecting original continuous frame images shot by the monocular camera;

the model building module is used for building a DepthNet model and a MotionNet model; the DepthNet model comprises a network for outputting a single-channel depth map; the MotionNet model comprises a main network used for camera motion prediction and giving a pixel confidence mask and a branch network used for camera internal parameter prediction;

the model training module is used for respectively inputting the original continuous frame images into the constructed DepthNet model and the MotionNet model after preprocessing, performing unsupervised training on the DepthNet model and the MotionNet model through a joint loss function, and performing super-parameter tuning;

and the model prediction module is used for processing the image to be detected through the trained DepthNet model and the MotionNet model, and outputting a depth map of each frame of the image to be detected, the motion of the camera, the internal parameters of the camera and a pixel confidence mask containing scene motion information.

The embodiment of the present invention further provides a device for acquiring camera parameters, which includes a processor and a memory, wherein the processor implements the method for acquiring camera parameters provided in the embodiment of the present invention when executing the computer program stored in the memory.

The embodiment of the present invention further provides a computer-readable storage medium for storing a computer program, where the computer program is executed by a processor to implement the method for acquiring the camera parameter provided in the embodiment of the present invention.

It can be seen from the foregoing technical solutions that, the method, apparatus, device and storage medium for acquiring camera parameters provided by the present invention include: collecting original continuous frame images shot by a monocular camera; constructing a DepthNet model and a MotionNet model; the DepthNet model comprises a network for outputting a single-channel depth map; the MotionNet model comprises a main network used for camera motion prediction and giving a pixel confidence mask and a branch network used for camera internal parameter prediction; respectively inputting the original continuous frame images into a constructed DepthNet model and a MotionNet model after preprocessing, and performing unsupervised training and super-parameter tuning on the DepthNet model and the MotionNet model through a joint loss function; and processing the image to be detected through the trained DepthNet model and MotionNet model, and outputting a depth map of each frame of image to be detected, the motion of the camera, the internal parameters of the camera and a pixel confidence mask containing scene motion information.

The method does not need to calibrate the camera, has no additional limitation on the use scene, can acquire the motion trail of the camera, the depth map of each frame and the internal parameters of the camera during shooting by directly inputting any video shot by the monocular camera, and can ensure the normal training by using the joint loss function to perform unsupervised learning under the condition that the internal parameters of the camera are unknown; in addition, a front-end solution with less constraints is provided for computer vision application requiring camera internal reference, camera motion and a depth map, and the method has good application value.

Drawings

In order to more clearly illustrate the embodiments of the present invention or technical solutions in related arts, the drawings used in the description of the embodiments or related arts will be briefly introduced below, it is obvious that the drawings in the following description are only embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to the provided drawings without creative efforts.

Fig. 1 is a flowchart of a method for acquiring camera parameters according to an embodiment of the present invention;

fig. 2 is a schematic structural diagram of a first encoder in a DepthNet model according to an embodiment of the present invention;

fig. 3 is a schematic structural diagram of a first residual error module in a first encoder according to an embodiment of the present invention;

fig. 4 is a schematic structural diagram of a second residual error module in the first encoder according to an embodiment of the present invention;

fig. 5 is a schematic structural diagram of a first decoder in the DepthNet model according to an embodiment of the present invention;

fig. 6 is a schematic structural diagram of a backbone network in a MotionNet model according to an embodiment of the present invention;

fig. 7 is a schematic structural diagram of a Refine module in a backbone network according to an embodiment of the present invention;

fig. 8 is a schematic structural diagram of a branch network in the MotionNet model according to an embodiment of the present invention;

FIG. 9 is a schematic diagram of an arrangement of training data according to an embodiment of the present invention;

fig. 10 is a schematic structural diagram of an apparatus for acquiring camera parameters according to an embodiment of the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

The invention provides a method for acquiring camera parameters, which comprises the following steps as shown in figure 1:

s101, collecting original continuous frame images shot by a monocular camera; it should be noted that, the collected original continuous frame images can be extracted from the KITTI data set;

s102, constructing a DepthNet model and a MotionNet model; the DepthNet model comprises a network for outputting a single-channel depth map; the MotionNet model comprises a main network used for camera motion prediction and giving a pixel confidence mask and a branch network used for camera internal parameter prediction; it should be noted that the function of the internal reference prediction enables the present invention to extract accurate camera motion and depth information from any video from unknown sources without performing camera calibration;

s103, respectively inputting the original continuous frame images into a constructed DepthNet model and a MotionNet model after preprocessing, carrying out unsupervised training on the DepthNet model and the MotionNet model through a joint loss function, and carrying out super-parameter tuning; it should be noted that the joint loss function is composed of a reprojection error, a depth smoothing loss and a regularization penalty function of a pixel confidence mask, and the joint loss function uses the relationship among continuous frames shot by a monocular camera as a source of a supervision signal, provides training power for a depth model, and is a key for realizing unsupervised learning; in addition, the model provided by the invention mainly comprises the learning rate, the loss function weight and the batch size, and the super parameters need to be adjusted and optimized in order to obtain the optimal combination;

and S104, processing the image to be detected through the trained DepthNet model and the MotionNet model, and outputting a depth map of each frame of image to be detected, the motion of the camera, the internal reference of the camera and a pixel confidence mask containing scene motion information.

In the method for acquiring the camera parameters provided by the embodiment of the invention, the DepthNet model and the MotionNet model are unsupervised depth learning models, a camera does not need to be calibrated, no extra limitation is imposed on a use scene, any video shot by a monocular camera is directly input, a camera motion track, a depth map of each frame and camera internal parameters during shooting can be acquired, and unsupervised learning is performed by using a joint loss function under the condition that the camera internal parameters are unknown, so that normal training can be ensured; in addition, a front-end solution with less constraints is provided for computer vision application requiring camera internal reference, camera motion and a depth map, and the method has good application value.

In practical application, the model can be realized by PyTorch, one deep learning workstation is used for training, a CPU comprises two Intel Xeon E52678 v3, a 64GB main memory and 4 NVIDIA GeForce GTX 1080Ti, and each video card has 12GB video memory. The invention carries out parallel optimization on the machine, particularly sets the epochs to be a multiple of 4, at the data reading stage, two CPUs respectively load half of data and store the data in respective corresponding main memories, and because the four display cards of the machine are that each CPU is directly connected with two display cards through a PCI-E channel, the mode that the two CPUs respectively load the data can furthest utilize the bandwidth of each PCI-E channel, which is beneficial to reducing the data transmission time. After the data is transmitted to the video memory from the main memory, the four video cards respectively start respective gradient processes, when the four video cards respectively consume the data, the synchronous point of the program is reached, respective gradient information is reported to the CPU, the CPU performs gradient summarization and model updating, and then the next cycle is started. The final effect is that when the GPU performs gradient operation, the CPU reads and prepares data, thereby reducing the downtime of the GPU as much as possible and improving the overall operation efficiency.

In specific implementation, in the method for acquiring camera parameters provided in the embodiment of the present invention, the DepthNet model may be composed of a first encoder and a first decoder; the input of the DepthNet model is a three-channel picture shot by a monocular camera, and a single-channel depth map with the same size as the input is output through an encoder-decoder structure. This corresponds to encoding every pixel of the input image. In addition, because the complexity of the model is large, in order to ensure effective transfer of the gradient and enable the deep network to receive good training, the model adopts a large number of Residual building blocks (Residual building blocks).

In step S103, the original continuous frame image is input into the DepthNet model after being preprocessed, and the training may specifically include: firstly, acquiring a preprocessed three-channel image through a first encoder, and successively encoding the three-channel image into characteristics of various granularities; then, a first decoder is used for decoding by combining the characteristics with different granularities; and finally, outputting the single-channel depth map with the same size as the input three-channel image size through the first decoder. As shown in fig. 2, the first encoder may output features of five granularities.

Further, in implementation, successively encoding the three-channel image into features of multiple granularities through the first encoder in the above steps, and decoding by using the first decoder in combination with the features of different granularities, specifically, the method may include: in a first encoder, a 2D convolution with convolution kernel size of 7 multiplied by 7 is firstly carried out, and after Batch normalization (Batch normalization) and a linear rectification unit, a first-stage feature code is formed; then connecting a maximum pooling layer and two first residual modules (residual _ block _ A) to form a second-level feature code; finally, alternately connecting a second residual error module (residual _ block _ B) and the first residual error module to respectively form a third-level feature code, a fourth-level feature code and a fifth-level feature code; next, inputting the first level feature coding, the second level feature coding, the third level feature coding, the fourth level feature coding and the fifth level feature coding to a first decoder; in the first decoder, the 2D transposed convolution and the 2D convolution are alternately used, five levels of feature codes are combined step by step, and the output layer adopts a softplus activating function for output.

It should be noted that the first encoder includes two residual blocks, residual _ block _ a and residual _ block _ B. As shown in fig. 3, residual _ block _ a is mainly composed of two 3 × 3 convolutional layers, and the number of output channels of the two convolutional layers is equal to the number of input channels of the residual module, so that residual _ block _ a does not change the number of channels of the tensor. The portion of two consecutive convolutional layers in the residual block is called the main branch, and the branch path that leads directly from the input to the output of the main branch is called short-cut. The main branch of residual _ block _ B is similar to residual _ block _ a, but the short-cut form contains some conditional judgment logic; as shown in fig. 4, when the number of input channels is not equal to the number of output channels, short-cut performs a 1 × 1 convolution on the basis of the input tensor, and the number of input channels and the number of output channels are adjusted to be consistent through the layer of convolution; and when the number of input channels is equal to that of output channels, further judging the size of the step length. When the step size is 1, short-cut is the input tensor, and when the step size is not 1, the dimension of the output tensor is not equal to the dimension of the input tensor, and in order to compensate the dimension difference, a maximum pooling layer is added to the input tensor. As shown in fig. 2, out _ channels and stride are input from outside the module, and three out represent outputs in three cases, respectively.

As shown in fig. 5, the first decoder receives five-level feature encoding of the first encoder as input, combines five-level feature encoding by using a 2D transposition convolution and a 2D convolution alternately, and finally outputs a single-channel depth map having a size equal to the input of the encoder by using a softplus activation function in an output layer. Concat _ and _ pad in FIG. 5 is a complex operation that first concatenates the output from the 2D transpose convolution with the next stage encoder output in the third dimension, then performs another continuation and inputs the result into the subsequent 2D convolution. The final output depth map size is [ B, h, w,1], where B represents the batch size, h and w represent the height and width of the picture, and 1 represents the number of channels of the depth map is 1.

In a specific implementation, in the method for acquiring camera parameters provided in the embodiment of the present invention, the backbone network in the MotionNet model may be formed by a second encoder and a second decoder.

In step S103, the original continuous frame image is input into the MotionNet model after being preprocessed, and the training may specifically include: firstly, acquiring two adjacent frames of preprocessed images through a second encoder; then, as shown in fig. 6, in the second encoder, 7 cascaded 2D convolutional layers with a convolutional kernel size of 3 × 3 are used, one 1 × 1 convolutional layer is connected to the bottleneck portion of the second encoder, the number of output channels is compressed to six, the first three channels output the translation of the camera, and the last three channels output the rotation of the camera; then, in a second decoder, two parallel convolution paths are adopted and short-cut connection similar to a residual error module is used, convolution output and output of bilinear interpolation are combined to form output of a Refine module, a pixel-level confidence mask is output and used for determining whether each pixel participates in calculation or not in calculation when a joint loss function is calculated, excluded pixels cannot participate in calculation of reprojection loss due to translation, rotation, shielding and other factors of a scene, and meanwhile a penalty function is added to the pixel-level confidence mask and used for preventing training degradation; and finally, outputting the internal reference matrix of the camera through a branch network connected to a bottommost encoder of the backbone network.

It should be understood that, in the backbone network of MotionNet, the second decoder is composed of a Refine module, as shown in fig. 7, conv _ input represents the input of the decoder side, and Refine _ input represents the output of the previous Refine stage. In order to solve the problem of different resolutions, the invention uses bilinear interpolation to adjust the output size of the prior stage Refine.

In a specific implementation, in the method for acquiring camera parameters provided in the embodiment of the present invention, outputting an internal reference matrix of a camera specifically includes: in a branch network, multiplying the network predicted value by the width and height of an image to obtain an actual focal length; adding 0.5 to the network predicted value, and multiplying by the width and height of the image to obtain the pixel coordinate of the principal point; and (3) the focal length is diagonal to form a diagonal matrix of 2 multiplied by 2, column vectors formed by connecting principal point coordinates are connected, and row vectors are added to form a 3 multiplied by 3 internal reference matrix.

Specifically, as shown in fig. 8, "bottompiece" on the left side represents the bottommost layer feature of the encoder output of the backbone network, and has a size of [ B,1, 1024]B stands for batch size, and two 1 × 1 convolutions in parallel are used to predict the focal length f in the reference matrix respectively_x、f_yAnd principal point coordinates c_x、c_yFor the principal point coordinates, the predicted coordinate values account for the ratio of the image width to the image height, since the principal point position is toward the center of the image, the invention adds 0.5 to the predicted network value, then multiplies the width and the height to obtain the pixel coordinates of the principal point, finally, diagonalizing the focal length to a diagonal matrix of 2 × 2, connecting the diagonal matrix of the principal point coordinates to form a column of the principal point coordinatesVector, then add row vector [0,0, 1]]And finally, a 3 × 3 internal reference matrix is formed.

In specific implementation, in the method for acquiring camera parameters provided in the embodiment of the present invention, the preprocessing the original continuous frame image in step S103 may include: adjusting the resolution of the original continuous frame images, arranging and splicing the images to form a plurality of triple frame images; when each triple frame image is input into the DepthNet model, outputting a depth map of each frame image; when each triple frame image is input into the MotionNet model, the camera motion between every two adjacent frame images, the internal reference of the camera and the pixel confidence mask are output four times.

It should be understood that, according to the model and the training method adopted by the present invention, each training of the DepthNet model requires a monocular color image (i.e., a three-channel image), each training of the MotionNet model requires two temporally continuous images (i.e., two adjacent frame images), in order to improve the data reading efficiency, the present invention needs to preprocess the original continuous frame images, as shown in fig. 9, each continuous three images are spliced into one image, (a) represents the original continuous frame image in the data set, and (b) represents the triple frame image spliced together after the preprocessing, and after such processing, each triple frame image is taken, two pairs of adjacent images can be obtained. In order to reduce the operation pressure, the original image is reduced in an equal ratio while being spliced, and finally the resolution of all the images can be unified to be 416 × 128, and the resolution of a single triple frame image can be 1248 × 128.

Specifically, the training process is performed in units of triple frames. When a triple frame is read out, firstly, a depth map of each frame is generated by using a DepthNet model, then, camera motion, pixel confidence mask and camera internal parameters of the 1 st frame to the 2 nd frame are generated by using a MotionNet model, and similarly, the 2 nd frame to the 3 rd frame, the 3 rd frame to the 2 nd frame and the 2 nd frame to the 1 st frame are generated, so that the prediction of the internal parameters of the four times of cameras is obtained, and the average number of the internal parameters of the four times is taken as the internal parameters associated with the triple set. The joint loss function can then be used once between every two adjacent frames, so that a total of four times (i.e., 1-2, 2-3, 3-2, 2-1) can be used, and the four loss function values are accumulated as the loss function values associated with the set of triplet frames. During actual training, each time data is read, a plurality of batch size triple frames are obtained, the frames are subjected to parallel calculation and then are subjected to back propagation, and a model is updated.

In a specific implementation, in the method for acquiring camera parameters provided in the embodiment of the present invention, the joint loss function may be calculated by using the following formula:

wherein L is_totalAs a joint loss function, L_RIs the reprojection error function, a is the weight of the reprojection error function,

for depth smoothing loss (also called L of depth value)₁Norm, the more outliers and sharp points of the depth map, the greater the smoothing loss), b is the weight of the depth smoothing loss, Λ is the regularized penalty function of the pixel confidence mask, and c is the weight of the penalty function.

L in the formula (1)_RComprises the following steps:

wherein i, j are pixel coordinates,

is a pixel confidence mask, representing the confidence of the pixel at (i, j), the phi function represents a bilinear interpolation,

for re-projecting the view, I_s(i, j) is the real view.

The reprojection method in formula (1) is:

where K is camera internal reference, R is camera rotation matrix, t is camera translation vector, p_sFor the pixel coordinate after re-projection, p_tAs pixel coordinates before projection, D_s(p_s) As pixel coordinate p_sAt the corresponding depth value after reprojection, D_t(p_t) For pixel coordinate p before re-projection_tThe corresponding depth value.

Λ in formula (1) is:

meaning that all H (i, j) are averaged. H (i, j) is defined as:

H(i,j)＝-∑_i，jM(i，j)log(s(i，j)) (5)

the meaning is that the cross entropy is taken on S (i, j), and the definition of S (i, j) is as follows:

meaning to pixel confidence mask

And taking cross entropy.

The penalty function Λ may avoid the network predicting all pixel confidence masks as "untrustworthy" (i.e.,

all take 0, in which case the most dominant part L of the loss function_RIt would be directly 0 because deep learning tends to minimize the loss function, and without the penalty function it is easily trapped in this "all untrusted" situation, where the loss function is small but has no real meaning).

Based on the same inventive concept, embodiments of the present invention further provide a device for acquiring camera parameters, and since the principle of the device for acquiring camera parameters to solve the problem is similar to the method for acquiring camera parameters, the implementation of the device for acquiring camera parameters may refer to the implementation of the method for acquiring camera parameters, and repeated details are not repeated.

In specific implementation, the apparatus for acquiring camera parameters provided in the embodiment of the present invention, as shown in fig. 10, specifically includes:

the image collecting module 11 is used for collecting original continuous frame images shot by the monocular camera;

the model building module 12 is used for building a DepthNet model and a MotionNet model; the DepthNet model comprises a network for outputting a single-channel depth map; the MotionNet model comprises a main network used for camera motion prediction and giving a pixel confidence mask and a branch network used for camera internal parameter prediction;

the model training module 13 is configured to input the original continuous frame images after being preprocessed into the DepthNet model and the MotionNet model that are constructed, perform unsupervised training on the DepthNet model and the MotionNet model through a joint loss function, and perform super-parameter tuning;

and the model prediction module 14 is configured to process the image to be measured through the trained DepthNet model and the MotionNet model, and output a depth map of each frame of the image to be measured, motion of the camera, internal parameters of the camera, and a pixel confidence mask including scene motion information.

In the device for acquiring camera parameters provided by the embodiment of the invention, the camera can be acquired by interaction of the four modules without calibrating the camera and using global information, and important data including camera internal parameters, depth maps and the like can be acquired only by recording videos when the camera has translational motion and rotational motion.

For more specific working processes of the modules, reference may be made to corresponding contents disclosed in the foregoing embodiments, and details are not repeated here.

Correspondingly, the embodiment of the invention also discloses equipment for acquiring the camera parameters, which comprises a processor and a memory; the method for acquiring the camera parameters disclosed in the foregoing embodiments is implemented when the processor executes the computer program stored in the memory.

For more specific processes of the above method, reference may be made to corresponding contents disclosed in the foregoing embodiments, and details are not repeated here.

Further, the present invention also discloses a computer readable storage medium for storing a computer program; the computer program, when executed by a processor, implements the method of acquiring camera parameters disclosed previously.

The embodiments are described in a progressive manner, each embodiment focuses on differences from other embodiments, and the same or similar parts among the embodiments are referred to each other. The device, the equipment and the storage medium disclosed by the embodiment correspond to the method disclosed by the embodiment, so that the description is relatively simple, and the relevant points can be referred to the method part for description.

Those of skill would further appreciate that the various illustrative elements and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware, computer software, or combinations of both, and that the various illustrative components and steps have been described above generally in terms of their functionality in order to clearly illustrate this interchangeability of hardware and software. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the implementation. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present application.

The steps of a method or algorithm described in connection with the embodiments disclosed herein may be embodied directly in hardware, in a software module executed by a processor, or in a combination of the two. A software module may reside in Random Access Memory (RAM), memory, Read Only Memory (ROM), electrically programmable ROM, electrically erasable programmable ROM, registers, hard disk, a removable disk, a CD-ROM, or any other form of storage medium known in the art.

To sum up, a method, an apparatus, a device and a storage medium for acquiring camera parameters provided in the embodiments of the present invention include: collecting original continuous frame images shot by a monocular camera; constructing a DepthNet model and a MotionNet model; the DepthNet model comprises a network for outputting a single-channel depth map; the MotionNet model comprises a main network used for camera motion prediction and giving a pixel confidence mask and a branch network used for camera internal parameter prediction; respectively inputting the original continuous frame images into a constructed DepthNet model and a MotionNet model after preprocessing, and performing unsupervised training and super-parameter tuning on the DepthNet model and the MotionNet model through a joint loss function; and processing the image to be detected through the trained DepthNet model and MotionNet model, and outputting a depth map of each frame of image to be detected, the motion of the camera, the internal parameters of the camera and a pixel confidence mask containing scene motion information. The method does not need to calibrate the camera, has no additional limitation on the use scene, can acquire the motion trail of the camera, the depth map of each frame and the internal parameters of the camera during shooting by directly inputting any video shot by the monocular camera, and can ensure the normal training by using the joint loss function to perform unsupervised learning under the condition that the internal parameters of the camera are unknown; in addition, a front-end solution with less constraints is provided for computer vision application requiring camera internal reference, camera motion and a depth map, and the method has good application value.

Finally, it should also be noted that, herein, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other identical elements in a process, method, article, or apparatus that comprises the element.

The method, the apparatus, the device and the storage medium for acquiring camera parameters provided by the present invention are described in detail above, and a specific example is applied in the present disclosure to explain the principle and the implementation of the present invention, and the description of the above embodiment is only used to help understanding the method and the core idea of the present invention; meanwhile, for a person skilled in the art, according to the idea of the present invention, there may be variations in the specific embodiments and the application scope, and in summary, the content of the present specification should not be construed as a limitation to the present invention.

Claims

1. A method for acquiring camera parameters is characterized by comprising the following steps:

collecting original continuous frame images shot by a monocular camera;

2. The method for acquiring camera parameters according to claim 1, wherein the DepthNet model is composed of a first encoder and a first decoder;

3. The method according to claim 2, wherein the three-channel image is successively encoded into features of multiple granularities by the first encoder, and the decoding is performed by using the first decoder in combination with the features of different granularities, specifically including:

4. The method for acquiring camera parameters according to claim 3, wherein the backbone network is composed of a second encoder and a second decoder;

acquiring two adjacent preprocessed frame images through the second encoder;

5. The method for acquiring camera parameters according to claim 4, wherein outputting the internal reference matrix of the camera specifically includes:

6. The method for acquiring camera parameters according to claim 1, wherein the preprocessing of the original continuous frame images comprises:

7. The method of claim 1, wherein the joint loss function is calculated by using the following formula:

8. An apparatus for acquiring camera parameters, comprising:

9. An apparatus for acquiring camera parameters, comprising a processor and a memory, wherein the processor implements the method for acquiring camera parameters according to any one of claims 1 to 7 when executing a computer program stored in the memory.

10. A computer-readable storage medium for storing a computer program, wherein the computer program, when executed by a processor, implements the method of acquiring camera parameters according to any one of claims 1 to 7.