CN111583345B - Method, device and equipment for acquiring camera parameters and storage medium - Google Patents
Method, device and equipment for acquiring camera parameters and storage medium Download PDFInfo
- Publication number
- CN111583345B CN111583345B CN202010387692.5A CN202010387692A CN111583345B CN 111583345 B CN111583345 B CN 111583345B CN 202010387692 A CN202010387692 A CN 202010387692A CN 111583345 B CN111583345 B CN 111583345B
- Authority
- CN
- China
- Prior art keywords
- model
- camera
- depthnet
- motionnet
- image
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T7/00—Image analysis
- G06T7/80—Analysis of captured images to determine intrinsic or extrinsic camera parameters, i.e. camera calibration
-
- G—PHYSICS
- G01—MEASURING; TESTING
- G01C—MEASURING DISTANCES, LEVELS OR BEARINGS; SURVEYING; NAVIGATION; GYROSCOPIC INSTRUMENTS; PHOTOGRAMMETRY OR VIDEOGRAMMETRY
- G01C25/00—Manufacturing, calibrating, cleaning, or repairing instruments or devices referred to in the other groups of this subclass
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
- G06N3/084—Backpropagation, e.g. using gradient descent
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T2207/00—Indexing scheme for image analysis or image enhancement
- G06T2207/20—Special algorithmic details
- G06T2207/20081—Training; Learning
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T2207/00—Indexing scheme for image analysis or image enhancement
- G06T2207/20—Special algorithmic details
- G06T2207/20084—Artificial neural networks [ANN]
Landscapes
- Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- Theoretical Computer Science (AREA)
- General Physics & Mathematics (AREA)
- General Health & Medical Sciences (AREA)
- General Engineering & Computer Science (AREA)
- Biophysics (AREA)
- Computational Linguistics (AREA)
- Data Mining & Analysis (AREA)
- Evolutionary Computation (AREA)
- Artificial Intelligence (AREA)
- Molecular Biology (AREA)
- Computing Systems (AREA)
- Biomedical Technology (AREA)
- Life Sciences & Earth Sciences (AREA)
- Mathematical Physics (AREA)
- Software Systems (AREA)
- Health & Medical Sciences (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Manufacturing & Machinery (AREA)
- Radar, Positioning & Navigation (AREA)
- Remote Sensing (AREA)
- Image Analysis (AREA)
Abstract
The application discloses a method, a device, equipment and a storage medium for acquiring camera parameters, which comprise the following steps: collecting original continuous frame images shot by a monocular camera; constructing a DepthNet model and a MotionNet model; the DepthNet model comprises a network for outputting a single-channel depth map; the MotionNet model comprises a main network used for camera motion prediction and giving a pixel confidence mask and a branch network used for camera internal parameter prediction; preprocessing an original continuous frame image, inputting the preprocessed original continuous frame image into a constructed model, performing unsupervised training on the model through a joint loss function, and performing super-parameter tuning; and processing the image to be detected through the trained model, and outputting a depth map of each frame of image, camera motion, camera internal parameters and a pixel confidence mask containing scene motion information. Therefore, the camera is not required to be calibrated, and the camera internal parameters, the camera motion and the depth map of each frame can be obtained by directly inputting the video.
Description
Technical Field
The invention relates to the field of computer vision and photogrammetry, in particular to a method, a device, equipment and a storage medium for acquiring camera parameters.
Background
As one of the main tools of computer vision, cameras and various algorithms surrounding the cameras have an important position. Photogrammetry among others mainly studies the imaging principle of cameras and focuses on how to obtain real world information from pictures taken by cameras. For various applications of computer vision and photogrammetry, such as industrial control, automatic driving, robot navigation way finding and other scenes, camera internal parameters, camera motion and depth of field all have important values, and a large number of calculation processes related to photogrammetry and camera imaging properties need to take the three information as input.
The camera's internal reference contains information such as the focal length of the camera, the camera's self-motion is also called ego-motion, and contains the position transformation information of the camera itself, and the depth of field expresses the distance between each point in the camera's field of view and the optical center of the camera, and is usually represented by a depth map. The process of acquiring the internal and external parameters of the Camera is generally called Camera calibration (Camera calibration), and the process of acquiring ego-motion is called Visual Odometer (VO).
Existing methods based on non-deep learning typically acquire separate techniques for camera parameters, ego-motion, and depth of field information. The method for acquiring the internal reference needs to use a camera to take several (usually about 20) calibration board images with different angles, and when the camera needs to be adjusted frequently, the calibration has to be performed frequently, and for application scenes where the camera device is not accessible, the calibration method is not available. The methods for acquiring ego-motion and depth information have similar drawbacks: the normal operation of these methods requires some assumptions (i.e., scene static assumption, scene consistency assumption, and lambertian assumption), and any condition that violates these assumptions will affect the normal operation of the corresponding methods. The technology based on deep learning can get rid of the dependence on the preposition hypothesis to different degrees, and can synchronously acquire ego-motion and depth information, so that the use convenience is improved. However, the internal reference of the camera still needs to be input, so that the inconvenience caused by the camera calibration method cannot be completely eliminated.
Therefore, how to solve the problem of limitations that the existing solution needs to perform camera calibration and needs a large amount of supervised learning data is a technical problem to be solved urgently by those skilled in the art.
Disclosure of Invention
In view of this, the present invention provides a method, an apparatus, a device and a storage medium for acquiring camera parameters, which can output a depth map of each frame, a motion of a camera during shooting and an internal reference of the camera by using consecutive frames shot by a monocular camera as input without calibrating the camera.
The specific scheme is as follows:
a camera parameter obtaining method comprises the following steps:
collecting original continuous frame images shot by a monocular camera;
constructing a DepthNet model and a MotionNet model; the DepthNet model comprises a network for outputting a single-channel depth map; the MotionNet model comprises a main network used for camera motion prediction and giving a pixel confidence mask and a branch network used for camera internal parameter prediction;
respectively inputting the original continuous frame images into the constructed DepthNet model and the MotionNet model after preprocessing, and performing unsupervised training and super-parameter tuning on the DepthNet model and the MotionNet model through a joint loss function;
and processing the image to be detected through the trained DepthNet model and the MotionNet model, and outputting a depth map of each frame of the image to be detected, the motion of the camera, the internal reference of the camera and a pixel confidence mask containing scene motion information.
Preferably, in the method for acquiring camera parameters provided in the embodiment of the present invention, the DepthNet model is composed of a first encoder and a first decoder;
preprocessing the original continuous frame images and inputting the preprocessed original continuous frame images into the DepthNet model for training, wherein the method specifically comprises the following steps:
acquiring a preprocessed three-channel image through the first encoder, and successively encoding the three-channel image into features of multiple granularities;
decoding using the first decoder in conjunction with features of different granularity;
and outputting a single-channel depth map with the same size as the input three-channel image through the first decoder.
Preferably, in the method for acquiring camera parameters provided in the embodiment of the present invention, the three-channel image is successively encoded into features of multiple granularities by the first encoder, and the decoding is performed by using the first decoder in combination with the features of different granularities, which specifically includes:
in the first encoder, a 2D convolution with convolution kernel size of 7 multiplied by 7 is performed, and after batch standardization and a linear rectification unit, a first-stage feature code is formed;
connecting a maximum pooling layer and two first residual modules to form a second-level feature code;
alternately connecting a second residual error module and the first residual error module to form a third-level feature code, a fourth-level feature code and a fifth-level feature code respectively;
inputting the first level feature encoding, the second level feature encoding, the third level feature encoding, the fourth level feature encoding, and the fifth level feature encoding to the first decoder;
in the first decoder, 2D transposition convolution and 2D convolution are alternately used, five levels of feature codes are combined step by step, and softplus activating functions are adopted for output at an output layer.
Preferably, in the above method for acquiring camera parameters provided in the embodiment of the present invention, the backbone network is composed of a second encoder and a second decoder;
inputting the original continuous frame images into the MotionNet model for training after preprocessing, specifically comprising:
acquiring two adjacent preprocessed frame images through the second encoder;
in the second encoder, 7 cascaded 3 × 3 2D convolutional layers are used, one 1 × 1 convolutional layer is connected to the bottleneck portion, the number of output channels is compressed to six, the first three channels output the translation of the camera, and the last three channels output the rotation of the camera;
in the second decoder, two parallel convolution paths are adopted and short-cut connection is used, the convolution output and the output of bilinear interpolation are combined to form the output of a Refine module, a pixel-level confidence mask is output and used for determining whether each pixel participates in calculation or not in the process of calculating a joint loss function, and meanwhile a penalty function is added to the pixel-level confidence mask to prevent training degradation;
and outputting the internal reference matrix of the camera through the branch network connected to the lowest encoder of the backbone network.
Preferably, in the method for acquiring camera parameters provided in the embodiment of the present invention, outputting the internal reference matrix of the camera specifically includes:
in the branch network, multiplying the network predicted value by the width and height of the image to obtain the actual focal length;
adding 0.5 to the network predicted value, and multiplying by the width and height of the image to obtain the pixel coordinate of the principal point;
and (3) the focal length is diagonal to form a diagonal matrix of 2 multiplied by 2, column vectors formed by connecting principal point coordinates are connected, and row vectors are added to form a 3 multiplied by 3 internal reference matrix.
Preferably, in the method for acquiring camera parameters provided in the embodiment of the present invention, the preprocessing the original continuous frame images includes:
adjusting the resolution of the original continuous frame images, and arranging and splicing the original continuous frame images to splice a plurality of triple frame images;
when each triple frame image is input into the DepthNet model, outputting a depth map of each frame image;
and when each triple frame image is input into the MotionNet model, outputting camera motion, internal reference of the camera and a pixel confidence mask between every two adjacent frame images for four times.
Preferably, in the method for acquiring camera parameters provided in the embodiment of the present invention, the joint loss function is calculated by using the following formula:
wherein L is total For said joint loss function, L R Is a re-projection error function, a is a weight of the re-projection error function,for depth smoothing loss, b is the depth smoothing lossThe Λ is a regularization penalty function of the pixel confidence mask, and the c is a weight of the penalty function.
The embodiment of the present invention further provides a device for acquiring camera parameters, including:
the image collection module is used for collecting original continuous frame images shot by the monocular camera;
the model building module is used for building a DepthNet model and a MotionNet model; the DepthNet model comprises a network for outputting a single-channel depth map; the MotionNet model comprises a main network used for camera motion prediction and giving a pixel confidence mask and a branch network used for camera internal parameter prediction;
the model training module is used for respectively inputting the original continuous frame images into the constructed DepthNet model and the MotionNet model after preprocessing, performing unsupervised training on the DepthNet model and the MotionNet model through a joint loss function, and performing super-parameter tuning;
and the model prediction module is used for processing the image to be detected through the trained DepthNet model and the MotionNet model, and outputting a depth map of each frame of the image to be detected, the motion of the camera, the internal parameters of the camera and a pixel confidence mask containing scene motion information.
The embodiment of the present invention further provides a device for acquiring camera parameters, which includes a processor and a memory, wherein the processor implements the method for acquiring camera parameters provided in the embodiment of the present invention when executing the computer program stored in the memory.
The embodiment of the present invention further provides a computer-readable storage medium for storing a computer program, where the computer program is executed by a processor to implement the method for acquiring the camera parameter provided in the embodiment of the present invention.
It can be seen from the foregoing technical solutions that, the method, apparatus, device and storage medium for acquiring camera parameters provided by the present invention include: collecting original continuous frame images shot by a monocular camera; constructing a DepthNet model and a MotionNet model; the DepthNet model comprises a network for outputting a single-channel depth map; the MotionNet model comprises a main network used for camera motion prediction and giving a pixel confidence mask and a branch network used for camera internal parameter prediction; respectively inputting the original continuous frame images into a constructed DepthNet model and a MotionNet model after preprocessing, and performing unsupervised training and super-parameter tuning on the DepthNet model and the MotionNet model through a joint loss function; and processing the image to be detected through the trained DepthNet model and MotionNet model, and outputting a depth map of each frame of image to be detected, the motion of the camera, the internal parameters of the camera and a pixel confidence mask containing scene motion information.
The method does not need to calibrate the camera, has no additional limitation on the use scene, can acquire the motion trail of the camera, the depth map of each frame and the internal parameters of the camera during shooting by directly inputting any video shot by the monocular camera, and can ensure the normal training by using the joint loss function to perform unsupervised learning under the condition that the internal parameters of the camera are unknown; in addition, a front-end solution with less constraint is provided for computer vision application requiring camera internal reference, camera motion and a depth map, and the method has good application value.
Drawings
In order to more clearly illustrate the embodiments of the present invention or technical solutions in related arts, the drawings used in the description of the embodiments or related arts will be briefly introduced below, it is obvious that the drawings in the following description are only embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to the provided drawings without creative efforts.
Fig. 1 is a flowchart of a method for acquiring camera parameters according to an embodiment of the present invention;
fig. 2 is a schematic structural diagram of a first encoder in a DepthNet model according to an embodiment of the present invention;
fig. 3 is a schematic structural diagram of a first residual error module in a first encoder according to an embodiment of the present invention;
fig. 4 is a schematic structural diagram of a second residual error module in the first encoder according to an embodiment of the present invention;
fig. 5 is a schematic structural diagram of a first decoder in the DepthNet model according to an embodiment of the present invention;
fig. 6 is a schematic structural diagram of a backbone network in a MotionNet model according to an embodiment of the present invention;
fig. 7 is a schematic structural diagram of a Refine module in a backbone network according to an embodiment of the present invention;
fig. 8 is a schematic structural diagram of a branch network in the MotionNet model according to an embodiment of the present invention;
FIG. 9 is a schematic diagram of an arrangement of training data according to an embodiment of the present invention;
fig. 10 is a schematic structural diagram of an apparatus for acquiring camera parameters according to an embodiment of the present invention.
Detailed Description
The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
The invention provides a method for acquiring camera parameters, which comprises the following steps as shown in figure 1:
s101, collecting original continuous frame images shot by a monocular camera; it should be noted that, the collected original continuous frame images can be extracted from the KITTI data set;
s102, constructing a DepthNet model and a MotionNet model; the DepthNet model comprises a network for outputting a single-channel depth map; the MotionNet model comprises a main network used for predicting the motion of the camera and giving a pixel confidence mask and a branch network used for predicting the internal parameters of the camera; it should be noted that the function of the internal reference prediction enables the present invention to extract accurate camera motion and depth information from any video from unknown sources without performing camera calibration;
s103, respectively inputting the original continuous frame images into a constructed DepthNet model and a MotionNet model after preprocessing, carrying out unsupervised training on the DepthNet model and the MotionNet model through a joint loss function, and carrying out super-parameter tuning; it should be noted that, the joint loss function is composed of a re-projection error, a depth smoothing loss and a regularization penalty function of a pixel confidence mask, and the joint loss function uses the relationship between continuous frames shot by a monocular camera as a source of a supervision signal, provides training power for a depth model, and is a key for realizing unsupervised learning; in addition, the model provided by the invention mainly comprises the learning rate, the loss function weight and the batch size, and the super parameters need to be adjusted and optimized in order to obtain the optimal combination;
and S104, processing the image to be detected through the trained DepthNet model and the MotionNet model, and outputting a depth map of each frame of image to be detected, the motion of the camera, the internal reference of the camera and a pixel confidence mask containing scene motion information.
In the method for acquiring the camera parameters provided by the embodiment of the invention, the DepthNet model and the MotionNet model are unsupervised deep learning models, a camera does not need to be calibrated, no extra limitation is imposed on a use scene, any video shot by a monocular camera is directly input, a camera motion track, a depth map of each frame and camera internal parameters during shooting can be acquired, and unsupervised learning is performed by using a joint loss function under the condition that the camera internal parameters are unknown, so that normal training can be ensured; in addition, a front-end solution with less constraint is provided for computer vision application requiring camera internal reference, camera motion and a depth map, and the method has good application value.
In practical application, the model can be realized by PyTorch, one deep learning workstation is used for training, a CPU comprises two Intel Xeon E52678 v3, a 64GB main memory and 4 NVIDIA GeForce GTX 1080Ti, and each video card has 12GB video memory. The invention carries out parallel optimization on the machine, particularly sets the epochs to be a multiple of 4, at the data reading stage, two CPUs respectively load half of data and store the data in respective corresponding main memories, and because the four display cards of the machine are that each CPU is directly connected with two display cards through a PCI-E channel, the mode that the two CPUs respectively load the data can furthest utilize the bandwidth of each PCI-E channel, which is beneficial to reducing the data transmission time. After the data is transmitted to the video memory from the main memory, the four video cards respectively start respective gradient processes, when the four video cards respectively consume the data, the synchronous point of the program is reached, respective gradient information is reported to the CPU, the CPU performs gradient summarization and model updating, and then the next cycle is started. The final effect is that when the GPU performs gradient operation, the CPU reads and prepares data, thereby reducing the downtime of the GPU as much as possible and improving the overall operation efficiency.
In specific implementation, in the method for acquiring camera parameters provided in the embodiment of the present invention, the DepthNet model may be formed by a first encoder and a first decoder; the input of the DepthNet model is a three-channel picture shot by a monocular camera, and a single-channel depth map with the same size as the input is output through an encoder-decoder structure. This corresponds to encoding every pixel of the input image. In addition, because the complexity of the model is large, in order to ensure effective transfer of the gradient and enable the deep network to receive good training, the model adopts a large number of Residual Building Blocks.
In step S103, the original continuous frame image is input into the DepthNet model after being preprocessed, and the training may specifically include: firstly, acquiring a preprocessed three-channel image through a first encoder, and successively encoding the three-channel image into characteristics of various granularities; then, a first decoder is used for decoding by combining the characteristics with different granularities; and finally, outputting the single-channel depth map with the same size as the input three-channel image size through the first decoder. As shown in fig. 2, the first encoder may output features of five granularities.
Further, in specific implementation, the three-channel image is successively encoded into features of multiple granularities through the first encoder in the above steps, and the decoding is performed by using the first decoder in conjunction with the features of different granularities, which specifically includes: in a first encoder, a 2D convolution with convolution kernel size of 7 multiplied by 7 is firstly carried out, and after Batch normalization (Batch normalization) and a linear rectification unit, a first-stage feature code is formed; then connecting a maximum pooling layer and two first residual modules (residual _ block _ A) to form a second-level feature code; finally, alternately connecting a second residual error module (residual _ block _ B) and the first residual error module to respectively form a third-level feature code, a fourth-level feature code and a fifth-level feature code; next, inputting the first level feature coding, the second level feature coding, the third level feature coding, the fourth level feature coding and the fifth level feature coding to a first decoder; in the first decoder, the 2D transposed convolution and the 2D convolution are alternately used, five levels of feature codes are combined step by step, and the output layer adopts a softplus activating function for output.
It should be noted that the first encoder includes two residual blocks, residual _ block _ a and residual _ block _ B. As shown in fig. 3, residual _ block _ a is mainly composed of two 3 × 3 convolutional layers, and the number of output channels of the two convolutional layers is equal to the number of input channels of the residual module, so that residual _ block _ a does not change the number of channels of the tensor. The portion of two consecutive convolutional layers in the residual block is called the main branch, and the branch path that leads directly from the input to the output of the main branch is called short-cut. The main branch of residual _ block _ B is similar to residual _ block _ a, but the short-cut form contains some conditional judgment logic; as shown in fig. 4, when the number of input channels is not equal to the number of output channels, short-cut performs a 1 × 1 convolution on the basis of the input tensor, and the number of input channels and the number of output channels are adjusted to be consistent through the layer of convolution; and when the number of input channels is equal to that of output channels, further judging the size of the step length. When the step size is 1, short-cut is the input tensor, and when the step size is not 1, the dimension of the output tensor is not equal to the dimension of the input tensor, and in order to compensate the dimension difference, a maximum pooling layer is added to the input tensor. As shown in fig. 2, out _ channels and stride are input from outside the module, and three out represent outputs in three cases, respectively.
As shown in fig. 5, the first decoder receives five-level feature encoding of the first encoder as input, combines five-level feature encoding by using a 2D transposition convolution and a 2D convolution alternately, and finally outputs a single-channel depth map having a size equal to the input of the encoder by using a softplus activation function in an output layer. Concat _ and _ pad in FIG. 5 is a complex operation that first concatenates the output from the 2D transpose convolution with the next stage encoder output in the third dimension, then performs another continuation and inputs the result into the subsequent 2D convolution. The final output depth map has a size of [ B, h, w,1], where B represents the batch size, h and w represent the height and width of the picture, and 1 represents the number of channels of the depth map is 1.
In a specific implementation, in the method for acquiring camera parameters provided in the embodiment of the present invention, the backbone network in the MotionNet model may be formed by a second encoder and a second decoder.
In step S103, the original continuous frame image is input into the MotionNet model after being preprocessed, and the training may specifically include: firstly, acquiring two adjacent frames of preprocessed images through a second encoder; then, as shown in fig. 6, in the second encoder, 7 cascaded 2D convolutional layers with convolutional kernel size of 3 × 3 are used, one 1 × 1 convolutional layer is connected to the bottleneck portion of the second encoder, the number of output channels is compressed to six, the first three channels output the translation of the camera, and the last three channels output the rotation of the camera; then, in a second decoder, two parallel convolution paths are adopted and short-cut connection similar to a residual error module is used, convolution output and output of bilinear interpolation are combined to form output of a Refine module, a pixel-level confidence mask is output and used for determining whether each pixel participates in calculation or not in calculation when a joint loss function is calculated, excluded pixels cannot participate in calculation of reprojection loss due to translation, rotation, shielding and other factors of a scene, and meanwhile a penalty function is added to the pixel-level confidence mask and used for preventing training degradation; and finally, outputting the internal reference matrix of the camera through a branch network connected to a bottommost encoder of the backbone network.
It should be understood that, in the backbone network of MotionNet, the second decoder is composed of a Refine module, as shown in fig. 7, conv _ input represents the input of the decoder side, and Refine _ input represents the output of the previous Refine stage. In order to solve the difference of the resolutions, the invention uses bilinear interpolation to adjust the output size of the former stage Refine.
In a specific implementation, in the method for acquiring camera parameters provided in the embodiment of the present invention, outputting an internal reference matrix of a camera specifically includes: in a branch network, multiplying the network predicted value by the width and height of an image to obtain an actual focal length; adding 0.5 to the network predicted value, and multiplying by the width and height of the image to obtain the pixel coordinate of the principal point; and (3) the focal length is diagonal to form a diagonal matrix of 2 multiplied by 2, column vectors formed by connecting principal point coordinates are connected, and row vectors are added to form a 3 multiplied by 3 internal reference matrix.
Specifically, as shown in fig. 8, "bottompiece" on the left side represents the bottommost layer feature of the encoder output of the backbone network, and has a size of [ B,1, 1024]B represents the batch size, and two 1 × 1 convolutions in parallel are used to predict the focal length f in the reference matrix respectively x 、f y And principal point coordinates c x 、c y In order to facilitate network learning, actually, two paths of convolution prediction are both small numbers. For focal lengths, the ratio of the actual focal length to the width and height of the image is predicted here, so the present invention multiplies the net prediction value by the width and height of the image. For the principal point coordinate, the ratio of the predicted coordinate value to the width and height of the image is calculated, and since the position of the principal point tends to the center of the image, the invention adds 0.5 to the predicted value of the network and then multiplies the result by the width and the height to obtain the pixel coordinate of the principal point. Finally, the focal length is diagonal to be a diagonal matrix of 2 multiplied by 2, the column vector formed by the coordinates of the principal points is connected, and then the row vector [0,0, 1] is added]Finally, a 3 × 3 internal reference matrix is formed.
In specific implementation, in the method for acquiring camera parameters provided in the embodiment of the present invention, the preprocessing the original continuous frame image in step S103 may include: adjusting the resolution of the original continuous frame images, and arranging and splicing the original continuous frame images to splice a plurality of triple frame images; when each triple frame image is input into the DepthNet model, outputting a depth map of each frame image; when each triple frame image is input into the MotionNet model, the camera motion between every two adjacent frame images, the internal reference of the camera and the pixel confidence mask are output four times.
It should be understood that, according to the model and the training method adopted by the present invention, each training of the DepthNet model requires a monocular color image (i.e., a three-channel image), each training of the MotionNet model requires two temporally continuous images (i.e., two adjacent frame images), in order to improve the data reading efficiency, the present invention needs to preprocess the original continuous frame images, as shown in fig. 9, each continuous three images are spliced into one image, (a) represents the original continuous frame image in the data set, and (b) represents the triple frame image spliced together after the preprocessing, and after such processing, each triple frame image is taken, two pairs of adjacent images can be obtained. In order to reduce the operation pressure, the original image is reduced in an equal ratio while being spliced, and finally the resolution of all the images can be unified to be 416 × 128, and the resolution of a single triple frame image can be 1248 × 128.
Specifically, the training process is performed in units of triple frames. When a triple frame is read out, firstly, a depth map of each frame is generated by using a DepthNet model, then, camera motion, pixel confidence mask and camera internal parameters of the 1 st frame to the 2 nd frame are generated by using a MotionNet model, and similarly, the 2 nd frame to the 3 rd frame, the 3 rd frame to the 2 nd frame and the 2 nd frame to the 1 st frame are generated, so that the prediction of the internal parameters of the four times of cameras is obtained, and the average number of the internal parameters of the four times is taken as the internal parameters associated with the triple set. The joint loss function can then be used once between every two adjacent frames, so that a total of four times (i.e., 1-2, 2-3, 3-2, 2-1) can be used, and the four loss function values are accumulated as the loss function values associated with the set of triplet frames. During actual training, each time data is read, a plurality of batch size triple frames are obtained, the frames are subjected to parallel calculation and then are subjected to back propagation, and a model is updated.
In a specific implementation, in the method for acquiring camera parameters provided in the embodiment of the present invention, the joint loss function may be calculated by using the following formula:
wherein L is total As a joint loss function, L R Is the reprojection error function, a is the weight of the reprojection error function,for depth smoothing loss (also called L of depth value) 1 Norm, the more outliers and sharp points of the depth map, the greater the smoothing loss), b is the weight of the depth smoothing loss, Λ is the regularized penalty function of the pixel confidence mask, and c is the weight of the penalty function.
L in the formula (1) R Comprises the following steps:
wherein i, j are pixel coordinates,is a pixel confidence mask, representing the confidence of the pixel at (i, j), the phi function represents a bilinear interpolation,for re-projecting the view, I s (i, j) is the real view.
The reprojection method in formula (1) is:
where K is camera internal reference, R is camera rotation matrix, t is camera translation vector, p s For the pixel coordinate after re-projection, p t As pixel coordinates before projection, D s (p s ) As pixel coordinate p s At the corresponding depth value after reprojection, D t (p t ) For pixel coordinate p before re-projection t The corresponding depth value.
Λ in formula (1) is:
meaning that all H (i, j) are averaged. H (i, j) is defined as:
H(i,j)=-∑ i,j M(i,j)log(s(i,j)) (5)
meaning that cross entropy is taken on S (i, j) which is defined as:
The penalty function a may avoid the network predicting all pixel confidence masks as "untrusted" (i.e. for each pixel location (i, j),all take 0, in which case the most dominant part L of the loss function R It would be directly 0 because deep learning tends to minimize the loss function, and without a penalty function, it is easily trapped in such an "all untrusted" situation, where the loss function is small but has no practical significance). The penalty function lambda will take a larger value the more "untrustworthy pixels" in the confidence mask.
Based on the same inventive concept, embodiments of the present invention further provide a device for acquiring camera parameters, and since the principle of the device for acquiring camera parameters to solve the problem is similar to the method for acquiring camera parameters, the implementation of the device for acquiring camera parameters may refer to the implementation of the method for acquiring camera parameters, and repeated details are not repeated.
In specific implementation, the apparatus for acquiring camera parameters provided in the embodiment of the present invention, as shown in fig. 10, specifically includes:
the image collecting module 11 is used for collecting original continuous frame images shot by the monocular camera;
the model building module 12 is used for building a DepthNet model and a MotionNet model; the DepthNet model comprises a network for outputting a single-channel depth map; the MotionNet model comprises a main network used for camera motion prediction and giving a pixel confidence mask and a branch network used for camera internal parameter prediction;
the model training module 13 is configured to input the original continuous frame images after being preprocessed into the DepthNet model and the MotionNet model that are constructed, perform unsupervised training on the DepthNet model and the MotionNet model through a joint loss function, and perform super-parameter tuning;
and the model prediction module 14 is configured to process the image to be measured through the trained DepthNet model and the trained MotionNet model, and output a depth map of each frame of the image to be measured, motion of the camera, internal parameters of the camera, and a pixel confidence mask including scene motion information.
In the device for acquiring camera parameters provided by the embodiment of the invention, important data including camera internal parameters, depth maps and the like can be acquired only by recording videos when the camera has translational and rotational motions through the interaction of the four modules without calibrating the camera and using global information, and the device can be used as a pre-algorithm for other computer vision applications, and is small in limitation of use scenes and convenient to apply.
For more specific working processes of the modules, reference may be made to corresponding contents disclosed in the foregoing embodiments, and details are not repeated here.
Correspondingly, the embodiment of the invention also discloses equipment for acquiring the camera parameters, which comprises a processor and a memory; the method for acquiring the camera parameters disclosed in the foregoing embodiments is implemented when the processor executes the computer program stored in the memory.
For more specific processes of the above method, reference may be made to corresponding contents disclosed in the foregoing embodiments, and details are not repeated here.
Further, the present invention also discloses a computer readable storage medium for storing a computer program; the computer program when executed by a processor implements the method of acquiring camera parameters disclosed in the foregoing.
For more specific processes of the above method, reference may be made to corresponding contents disclosed in the foregoing embodiments, and details are not repeated here.
The embodiments are described in a progressive manner, each embodiment focuses on differences from other embodiments, and the same or similar parts among the embodiments are referred to each other. The device, the equipment and the storage medium disclosed by the embodiment correspond to the method disclosed by the embodiment, so that the description is relatively simple, and the relevant points can be referred to the method part for description.
Those of skill would further appreciate that the various illustrative elements and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware, computer software, or combinations of both, and that the various illustrative components and steps have been described above generally in terms of their functionality in order to clearly illustrate this interchangeability of hardware and software. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the implementation. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present application.
The steps of a method or algorithm described in connection with the embodiments disclosed herein may be embodied directly in hardware, in a software module executed by a processor, or in a combination of the two. A software module may reside in Random Access Memory (RAM), memory, Read Only Memory (ROM), electrically programmable ROM, electrically erasable programmable ROM, registers, hard disk, a removable disk, a CD-ROM, or any other form of storage medium known in the art.
To sum up, a method, an apparatus, a device and a storage medium for acquiring camera parameters provided in the embodiments of the present invention include: collecting original continuous frame images shot by a monocular camera; constructing a DepthNet model and a MotionNet model; the DepthNet model comprises a network for outputting a single-channel depth map; the MotionNet model comprises a main network used for predicting the motion of the camera and giving a pixel confidence mask and a branch network used for predicting the internal parameters of the camera; respectively inputting the original continuous frame images after preprocessing into a constructed DepthNet model and a MotionNet model, and performing unsupervised training and super-parameter tuning on the DepthNet model and the MotionNet model through a joint loss function; and processing the images to be detected through the trained DepthNet model and the trained MotionNet model, and outputting a depth map of each frame of image to be detected, the motion of the camera, the internal parameters of the camera and a pixel confidence mask containing scene motion information. The method does not need to calibrate the camera, has no additional limitation on the use scene, can acquire the motion trail of the camera, the depth map of each frame and the internal parameters of the camera during shooting by directly inputting any video shot by the monocular camera, and can ensure the normal training by using the joint loss function to perform unsupervised learning under the condition that the internal parameters of the camera are unknown; in addition, a front-end solution with less constraints is provided for computer vision application requiring camera internal reference, camera motion and a depth map, and the method has good application value.
Finally, it should also be noted that, herein, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other identical elements in a process, method, article, or apparatus that comprises the element.
The method, the apparatus, the device and the storage medium for acquiring camera parameters provided by the present invention are described in detail above, and a specific example is applied in the present disclosure to explain the principle and the implementation of the present invention, and the description of the above embodiment is only used to help understanding the method and the core idea of the present invention; meanwhile, for a person skilled in the art, according to the idea of the present invention, there may be variations in the specific embodiments and the application scope, and in summary, the content of the present specification should not be construed as a limitation to the present invention.
Claims (8)
1. A method for acquiring camera parameters is characterized by comprising the following steps:
collecting original continuous frame images shot by a monocular camera;
constructing a DepthNet model and a MotionNet model; the DepthNet model comprises a network for outputting a single-channel depth map; the MotionNet model comprises a main network used for camera motion prediction and giving a pixel confidence mask and a branch network used for camera internal parameter prediction;
respectively inputting the original continuous frame images into the constructed DepthNet model and the MotionNet model after preprocessing, and performing unsupervised training and super-parameter tuning on the DepthNet model and the MotionNet model through a joint loss function;
processing images to be detected through the trained DepthNet model and the trained MotionNet model, and outputting a depth map of each frame of the images to be detected, the motion of the camera, the internal reference of the camera and a pixel confidence mask containing scene motion information;
the DepthNet model is composed of a first encoder and a first decoder;
preprocessing the original continuous frame images and inputting the preprocessed original continuous frame images into the DepthNet model for training, wherein the method specifically comprises the following steps:
acquiring a preprocessed three-channel image through the first encoder, and successively encoding the three-channel image into features of multiple granularities;
decoding using the first decoder in conjunction with features of different granularity;
outputting, by the first decoder, a single-channel depth map of the same size as the input three-channel image;
the backbone network is composed of a second encoder and a second decoder;
inputting the original continuous frame images into the MotionNet model for training after preprocessing, specifically comprising:
acquiring two adjacent preprocessed frame images through the second encoder;
in the second encoder, 7 cascaded 3 × 3 2D convolutional layers are used, one 1 × 1 convolutional layer is connected to the bottleneck portion, the number of output channels is compressed to six, the first three channels output the translation of the camera, and the last three channels output the rotation of the camera;
in the second decoder, two parallel convolution paths are adopted and short-cut connection is used, the convolution output and the output of bilinear interpolation are combined to form the output of a Refine module, a pixel-level confidence mask is output and used for determining whether each pixel participates in calculation when a joint loss function is calculated, and meanwhile, a penalty function is added to the pixel confidence mask and used for preventing training degradation;
and outputting the internal reference matrix of the camera through the branch network connected to the lowest encoder of the backbone network.
2. The method according to claim 1, wherein the three-channel image is successively encoded into features of multiple granularities by the first encoder, and the first decoder is used to perform decoding in conjunction with the features of different granularities, specifically comprising:
in the first encoder, a 2D convolution with convolution kernel size of 7 multiplied by 7 is carried out, and after a batch standardization and linear rectification unit, a first-stage feature code is formed;
connecting a maximum pooling layer and two first residual modules to form a second-level feature code;
alternately connecting a second residual error module and the first residual error module to form a third-level feature code, a fourth-level feature code and a fifth-level feature code respectively;
inputting the first level feature encoding, the second level feature encoding, the third level feature encoding, the fourth level feature encoding, and the fifth level feature encoding to the first decoder;
in the first decoder, 2D transposition convolution and 2D convolution are alternately used, five levels of feature codes are combined step by step, and softplus activating functions are adopted for output at an output layer.
3. The method for acquiring camera parameters according to claim 1, wherein outputting the internal reference matrix of the camera specifically includes:
in the branch network, multiplying the network predicted value by the width and height of the image to obtain the actual focal length;
adding 0.5 to the network predicted value, and multiplying by the width and height of the image to obtain the pixel coordinate of the principal point;
and (3) the focal length is diagonal to form a diagonal matrix of 2 multiplied by 2, column vectors formed by connecting principal point coordinates are connected, and row vectors are added to form a 3 multiplied by 3 internal reference matrix.
4. The method for acquiring camera parameters according to claim 1, wherein the preprocessing of the original continuous frame images comprises:
adjusting the resolution of the original continuous frame images, and arranging and splicing the original continuous frame images to be spliced into a plurality of triple frame images;
when each triple frame image is input into the DepthNet model, outputting a depth map of each frame image;
and when each triple frame image is input into the MotionNet model, outputting four times of camera motion between every two adjacent frame images, and internal reference and pixel confidence mask of the camera.
5. The method of claim 1, wherein the joint loss function is calculated by using the following formula:
wherein L is total As said joint loss function, L R Is a reprojection error function, a is a weight of the reprojection error function,and b is the depth smoothing loss, b is the weight of the depth smoothing loss, Λ is the regularization penalty function of the pixel confidence mask, and c is the weight of the penalty function.
6. An apparatus for acquiring camera parameters, comprising:
the image collection module is used for collecting original continuous frame images shot by the monocular camera;
the model building module is used for building a DepthNet model and a MotionNet model; the DepthNet model comprises a network for outputting a single-channel depth map; the MotionNet model comprises a main network used for camera motion prediction and giving a pixel confidence mask and a branch network used for camera internal parameter prediction;
the model training module is used for respectively inputting the original continuous frame images into the constructed DepthNet model and the MotionNet model after preprocessing, performing unsupervised training on the DepthNet model and the MotionNet model through a joint loss function, and performing super-parameter tuning;
the DepthNet model is composed of a first encoder and a first decoder;
preprocessing the original continuous frame images and inputting the preprocessed original continuous frame images into the DepthNet model for training, wherein the method specifically comprises the following steps:
acquiring a preprocessed three-channel image through the first encoder, and successively encoding the three-channel image into features of multiple granularities;
decoding using the first decoder in conjunction with features of different granularity;
outputting, by the first decoder, a single-channel depth map of the same size as the input three-channel image;
the backbone network is composed of a second encoder and a second decoder;
inputting the original continuous frame images into the MotionNet model for training after preprocessing, and specifically comprising the following steps:
acquiring two adjacent preprocessed frame images through the second encoder;
in the second encoder, 7 cascaded 3 × 3 2D convolutional layers are used, one 1 × 1 convolutional layer is connected to the bottleneck portion, the number of output channels is compressed to six, the first three channels output the translation of the camera, and the last three channels output the rotation of the camera;
in the second decoder, two parallel convolution paths are adopted and short-cut connection is used, the convolution output and the output of bilinear interpolation are combined to form the output of a Refine module, a pixel confidence mask is output and used for determining whether each pixel participates in calculation when a joint loss function is calculated, and meanwhile, a penalty function is added to the pixel confidence mask and used for preventing training degradation;
outputting an internal reference matrix of a camera through the branch network connected to a bottommost encoder of the backbone network; and the model prediction module is used for processing the image to be measured through the trained DepthNet model and the MotionNet model and outputting a depth map of each frame of the image to be measured, the motion of the camera, the internal parameters of the camera and a pixel confidence mask containing scene motion information.
7. An apparatus for acquiring camera parameters, comprising a processor and a memory, wherein the processor implements the method for acquiring camera parameters according to any one of claims 1 to 5 when executing a computer program stored in the memory.
8. A computer-readable storage medium for storing a computer program, wherein the computer program, when executed by a processor, implements the method of acquiring camera parameters according to any one of claims 1 to 5.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010387692.5A CN111583345B (en) | 2020-05-09 | 2020-05-09 | Method, device and equipment for acquiring camera parameters and storage medium |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010387692.5A CN111583345B (en) | 2020-05-09 | 2020-05-09 | Method, device and equipment for acquiring camera parameters and storage medium |
Publications (2)
Publication Number | Publication Date |
---|---|
CN111583345A CN111583345A (en) | 2020-08-25 |
CN111583345B true CN111583345B (en) | 2022-09-27 |
Family
ID=72117146
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202010387692.5A Active CN111583345B (en) | 2020-05-09 | 2020-05-09 | Method, device and equipment for acquiring camera parameters and storage medium |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN111583345B (en) |
Families Citing this family (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN114531580B (en) * | 2020-11-23 | 2023-11-21 | 北京四维图新科技股份有限公司 | Image processing method and device |
CN112606000B (en) * | 2020-12-22 | 2022-11-18 | 上海有个机器人有限公司 | Method for automatically calibrating robot sensor parameters, calibration room, equipment and computer medium |
Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN108665496A (en) * | 2018-03-21 | 2018-10-16 | 浙江大学 | A kind of semanteme end to end based on deep learning is instant to be positioned and builds drawing method |
CN109191491A (en) * | 2018-08-03 | 2019-01-11 | 华中科技大学 | The method for tracking target and system of the twin network of full convolution based on multilayer feature fusion |
CN110009674A (en) * | 2019-04-01 | 2019-07-12 | 厦门大学 | Monocular image depth of field real-time computing technique based on unsupervised deep learning |
CN110148179A (en) * | 2019-04-19 | 2019-08-20 | 北京地平线机器人技术研发有限公司 | A kind of training is used to estimate the neural net model method, device and medium of image parallactic figure |
CN110503680A (en) * | 2019-08-29 | 2019-11-26 | 大连海事大学 | It is a kind of based on non-supervisory convolutional neural networks monocular scene depth estimation method |
CN110738697A (en) * | 2019-10-10 | 2020-01-31 | 福州大学 | Monocular depth estimation method based on deep learning |
Family Cites Families (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2017132766A1 (en) * | 2016-02-03 | 2017-08-10 | Sportlogiq Inc. | Systems and methods for automated camera calibration |
CN106157307B (en) * | 2016-06-27 | 2018-09-11 | 浙江工商大学 | A kind of monocular image depth estimation method based on multiple dimensioned CNN and continuous CRF |
-
2020
- 2020-05-09 CN CN202010387692.5A patent/CN111583345B/en active Active
Patent Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN108665496A (en) * | 2018-03-21 | 2018-10-16 | 浙江大学 | A kind of semanteme end to end based on deep learning is instant to be positioned and builds drawing method |
CN109191491A (en) * | 2018-08-03 | 2019-01-11 | 华中科技大学 | The method for tracking target and system of the twin network of full convolution based on multilayer feature fusion |
CN110009674A (en) * | 2019-04-01 | 2019-07-12 | 厦门大学 | Monocular image depth of field real-time computing technique based on unsupervised deep learning |
CN110148179A (en) * | 2019-04-19 | 2019-08-20 | 北京地平线机器人技术研发有限公司 | A kind of training is used to estimate the neural net model method, device and medium of image parallactic figure |
CN110503680A (en) * | 2019-08-29 | 2019-11-26 | 大连海事大学 | It is a kind of based on non-supervisory convolutional neural networks monocular scene depth estimation method |
CN110738697A (en) * | 2019-10-10 | 2020-01-31 | 福州大学 | Monocular depth estimation method based on deep learning |
Non-Patent Citations (4)
Title |
---|
光场相机成像模型及参数标定方法综述;张春萍等;《中国激光》;20160610(第06期);270-281 * |
基于无监督学习的相机位姿估计算法研究;吴炎桐;《中国优秀硕士学位论文全文数据库 (信息科技辑)》;20190815;I138-927 * |
基于深度学习的动态场景相机姿态估计方法;路昊等;《高技术通讯》;20200115(第01期);41-47 * |
基于生成对抗网络的雾霾场景图像转换算法;肖进胜等;《计算机学报》;20190911(第01期);165-176 * |
Also Published As
Publication number | Publication date |
---|---|
CN111583345A (en) | 2020-08-25 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN111047516B (en) | Image processing method, image processing device, computer equipment and storage medium | |
CN111402130B (en) | Data processing method and data processing device | |
CN109271933B (en) | Method for estimating three-dimensional human body posture based on video stream | |
CN112308200B (en) | Searching method and device for neural network | |
CN112862689B (en) | Image super-resolution reconstruction method and system | |
CN110751649B (en) | Video quality evaluation method and device, electronic equipment and storage medium | |
CN111263161B (en) | Video compression processing method and device, storage medium and electronic equipment | |
US20220414838A1 (en) | Image dehazing method and system based on cyclegan | |
CN111192226A (en) | Image fusion denoising method, device and system | |
CN111583345B (en) | Method, device and equipment for acquiring camera parameters and storage medium | |
CN115731505B (en) | Video salient region detection method and device, electronic equipment and storage medium | |
CN113850231A (en) | Infrared image conversion training method, device, equipment and storage medium | |
CN114842400A (en) | Video frame generation method and system based on residual block and feature pyramid | |
CN115546162A (en) | Virtual reality image quality evaluation method and system | |
CN115661403A (en) | Explicit radiation field processing method, device and storage medium | |
CN115115540A (en) | Unsupervised low-light image enhancement method and unsupervised low-light image enhancement device based on illumination information guidance | |
CN114885112B (en) | High-frame-rate video generation method and device based on data fusion | |
CN115565039A (en) | Monocular input dynamic scene new view synthesis method based on self-attention mechanism | |
CN117391995B (en) | Progressive face image restoration method, system, equipment and storage medium | |
CN112541972A (en) | Viewpoint image processing method and related equipment | |
CN112396674B (en) | Rapid event image filling method and system based on lightweight generation countermeasure network | |
CN111726621B (en) | Video conversion method and device | |
CN117576179A (en) | Mine image monocular depth estimation method with multi-scale detail characteristic enhancement | |
CN117314750A (en) | Image super-resolution reconstruction method based on residual error generation network | |
CN117274446A (en) | Scene video processing method, device, equipment and storage medium |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |