CN107481279B - Monocular video depth map calculation method - Google Patents

Monocular video depth map calculation method Download PDF

Info

Publication number
CN107481279B
CN107481279B CN201710351600.6A CN201710351600A CN107481279B CN 107481279 B CN107481279 B CN 107481279B CN 201710351600 A CN201710351600 A CN 201710351600A CN 107481279 B CN107481279 B CN 107481279B
Authority
CN
China
Prior art keywords
calculating
matching
adopted
rotation matrix
translation vector
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201710351600.6A
Other languages
Chinese (zh)
Other versions
CN107481279A (en
Inventor
曹治国
张润泽
肖阳
鲜可
杨佳琪
李然
赵富荣
李睿博
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Huazhong University of Science and Technology
Original Assignee
Huazhong University of Science and Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Huazhong University of Science and Technology filed Critical Huazhong University of Science and Technology
Priority to CN201710351600.6A priority Critical patent/CN107481279B/en
Publication of CN107481279A publication Critical patent/CN107481279A/en
Application granted granted Critical
Publication of CN107481279B publication Critical patent/CN107481279B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/50Depth or shape recovery
    • G06T7/55Depth or shape recovery from multiple images
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/70Determining position or orientation of objects or cameras
    • G06T7/73Determining position or orientation of objects or cameras using feature-based methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/10Image acquisition modality
    • G06T2207/10016Video; Image sequence
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/30Subject of image; Context of image processing
    • G06T2207/30241Trajectory

Landscapes

  • Engineering & Computer Science (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Image Analysis (AREA)

Abstract

The invention discloses a monocular video depth map calculation method which is characterized by comprising the steps of decomposing a video to be restored into pictures according to frames; extracting picture characteristic points of each frame; matching the characteristic points to form a characteristic point track; calculating a global rotation matrix and a translation vector; optimizing camera parameters; computing a dense optical flow for the selected frame; and calculating the depth value of the selected frame to obtain a depth map of the selected frame. The method adopts a depth estimation method for recovering a Structure (SFM) from motion based on a physical mechanism, and takes dense optical flow as matching. The method does not need any training sample, does not utilize optimization modes such as segmentation and plane fitting, and is small in calculation amount. Meanwhile, the method solves the problem that in the prior art, in the process from scene sparse reconstruction to dense reconstruction, especially in a non-texture area, the depth values of all pixel points cannot be obtained, improves the calculation efficiency and ensures the accuracy of the depth map.

Description

Monocular video depth map calculation method
Technical Field
The invention belongs to the field of computer vision, and particularly relates to a monocular video depth map calculation method.
Background
With the development of science and technology, 3D movies and virtual reality are enriching the lives of people. However, both 3D movies already popular worldwide and virtual reality currently being enjoyed face a serious problem, namely, the lack of 3D resources. In the prior art, the depth is mainly predicted through a monocular video, and then a binocular stereo video is obtained through viewpoint synthesis, which is also a main method for solving the problem of the lack of 3D resources at present.
In this technical approach, depth estimation of monocular video is gaining attention as an important component. Currently, the mainstream monocular depth prediction methods include a depth estimation method based on depth learning, a depth estimation method based on a physical mechanism to recover a Structure (SFM) from motion, and an optical flow method. The depth estimation method based on deep learning is relatively dependent on a database and is generally depth estimation with supervised learning, namely a large number of training samples are needed to ensure the accuracy of a result, and the cost for obtaining a large number of training samples is relatively high. Methods for recovering Structures From Motion (SFM) based on physical mechanisms are currently quite accurate in recovering camera position and achieve a high balance in efficiency and accuracy. However, the difficulty of this method in the past is that it is impossible to obtain the depth values of all pixel points from sparse reconstruction to dense reconstruction of a scene, especially in a non-texture region. In contrast, the accuracy of the optical flow method can reach a precise sub-pixel level.
Disclosure of Invention
In view of the above-mentioned deficiencies or needs in the art, the present invention provides a monocular video depth map computing method. The method of the technical scheme of the invention is used for calculating the monocular video depth value by a method based on physical mechanism guarantee aiming at the condition that the existing supervised learning training sample is not easy to obtain; and meanwhile, dense optical flows are introduced for matching, so that the depth map of the video frame can be accurately calculated.
To achieve the above object, according to one aspect of the present invention, there is provided a monocular video depth map calculating method comprising the steps of,
s1, decomposing the video to be restored into pictures by frames;
s2, extracting picture feature points of each frame;
s3 matching the feature points to form a feature point track;
s4, calculating a global rotation matrix, wherein the rotation matrix is the direction of the coordinate axis of the world coordinate system relative to the camera coordinate system;
s5, calculating a global translation vector which is a position coordinate of a world coordinate system coordinate axis relative to a camera coordinate system;
s6 optimizing camera parameters;
s7 calculating a dense optical flow for the selected frame;
s8 calculates the depth value of the selected frame and further obtains its depth map.
According to the method, the method for recovering the structure from the motion and the optical flow method are combined, and the camera position and the image are well matched, so that the result quality of the depth map can be greatly predicted.
Specifically, for a video to be restored, it is decomposed into several pictures by frame, and for each frame of picture, a texture-based feature extraction operation is performed. After the feature points of each frame are obtained, the feature points of one feature in all frames are matched, and each feature can correspondingly form a specific feature point track, namely the track of the feature in the video playing process.
In order to express the position relationship between the camera and the image more clearly, two coordinate systems, namely a world coordinate system and a camera coordinate system, are generally provided, and the position of the image in the world coordinate system, the position in the camera coordinate system and the position relationship between the image and the camera can be determined through the two coordinate systems. When calculating the depth map, the pose of the video camera is also a critical part, which is mainly composed of two parts, specifically, a rotation matrix and a translation vector. The rotation matrix represents the direction of the coordinate axis of the world coordinate system relative to the camera coordinate system, and the translation vector represents the position coordinate of the coordinate axis of the world coordinate system relative to the camera coordinate system. The rotation matrix and the translation vector obtained by solving are mainly used for completing the optimization of the whole camera parameters, in particular to the optimization of the reprojection error. And the rotation matrix, the translation vector and the space point coordinate are combined to be optimized, so that the aim of optimizing the reprojection error is fulfilled. On the basis, further utilizing a dense matching optical flow method to obtain the optical flow matching relation between the selected frame and the adjacent frame, and calculating the depth value.
As a preferred embodiment of the present invention, step S3 includes,
s31 matching the feature points of each frame;
s32, eliminating matching pairs with wrong matching;
s33 simulates forming a feature point trajectory.
When matching the feature points, in order to ensure the accuracy of matching, after the initial matching is completed, the wrong matching pairs need to be removed. Many feature points are extracted from each frame of image, when matching is performed, some feature points cannot be matched, some feature points are matched with similar feature points, some feature points are matched with feature points without any association, and the like, so that the feature points with wrong matching need to be removed to ensure accuracy.
As a preferred embodiment of the present invention, step S4 includes,
s41, calculating relative rotation matrix and relative translation vector of any two frames;
s42, eliminating relative rotation matrixes and relative translation vectors which are calculated wrongly;
s43 calculates a global rotation matrix from the relative rotation matrix and the relative translation vector.
In a specific process of calculating the global rotation matrix, the method of the technical scheme of the invention preferably calculates the relative rotation matrix and the relative translation vector of any two frames, then eliminates the relative rotation matrix and the relative translation vector which are calculated wrongly, and finally calculates the global rotation matrix according to the relative rotation matrix and the relative translation vector. Specifically, the feature matching relationship of step S3 is used to calculate the essential matrix, and the relative rotation matrix and the relative translation vector of any two frames are calculated. However, due to the existence of noise, certain errors exist in the calculation of the relative rotation matrix and the relative translation vector, so that the calculation accuracy of the global rotation matrix is not high, and the elimination of the error data is helpful for improving the calculation accuracy of the global rotation matrix.
As a preferred embodiment of the present invention, step S5 includes,
s51, calculating local relative translation vectors by using the global rotation matrix and/or the three-view relation;
s52, establishing a linear programming model to solve the global translation vector.
The calculation of the translation vector has the problems of large influence by a base line and non-uniform scale, so the solving difficulty is higher than that of a rotation matrix. According to the method, the local relative translation vector is calculated by selecting the close and strong three-view relation, and the problem that the two-view relative translation vector is greatly influenced by the base line is solved. Meanwhile, in order to improve the calculation efficiency, the method of the technical scheme of the invention utilizes the calculation information of the global rotation matrix to calculate the translation of the three views, and has higher efficiency compared with a method for calculating the trifocal tensor by directly utilizing the matching relation of the three view points. Further, in the process of solving the global translation vector by the local translation vector, a linear programming method is adopted for calculation.
As a preferred embodiment of the present invention, step S6 is preferably to perform joint optimization, i.e. beam adjustment, on the translation vector, the rotation matrix, and the spatial point coordinate, so as to optimize the camera parameters.
Due to the existence of image noise, a projection formula cannot completely meet the calculation requirement, so that the previously calculated translation vector, rotation matrix and spatial point coordinates need to be optimized once in a combined mode to obtain a relatively accurate camera parameter as far as possible, the process is called light beam adjustment, the optimization target is called re-projection error, and the final purpose is to achieve optimization of the camera parameter to restore the video depth as accurately as possible.
As a preferred embodiment of the present invention, step S8 preferably calculates the depth value by using a triangulation method.
In the technical scheme of the invention, a least square method is adopted to solve a linear equation. Specifically, any two frames of images are selected, the corresponding matching positions of the pixel points in the first frame of image in the second frame of image are obtained according to the dense matching based on the optical flow, and the matching positions are expressed by coordinates. In each frame of image, the measured values corresponding to each pixel point are combined to obtain a linear system.
According to one aspect of the invention, a monocular video depth map computing system is provided, which is characterized by comprising a video decomposition module, a video restoration module and a video restoration module, wherein the video decomposition module is used for decomposing a video to be restored into pictures by frames;
the characteristic extraction module is used for extracting the picture characteristic points of each frame;
the characteristic matching module is used for matching the characteristic points to form a characteristic point track;
the rotation matrix module is used for calculating a global rotation matrix, and the rotation matrix is the direction of the coordinate axis of the world coordinate system relative to the camera coordinate system;
the translation vector module is used for calculating a global translation vector, and the translation vector is a position coordinate of a world coordinate system coordinate axis relative to a camera coordinate system;
the parameter optimization module is used for optimizing camera parameters;
a dense optical flow module to compute a dense optical flow for the selected frame;
and the depth map module is used for calculating the depth value of the selected frame and further obtaining the depth map of the selected frame.
According to one aspect of the invention, there is provided a memory device having stored therein a plurality of instructions adapted to be loaded and executed by a processor:
s1, decomposing the video to be restored into pictures by frames;
s2, extracting picture feature points of each frame;
s3 matching the feature points to form a feature point track;
s4, calculating a global rotation matrix, wherein the rotation matrix is the direction of the coordinate axis of the world coordinate system relative to the camera coordinate system;
s5, calculating a global translation vector which is a position coordinate of a world coordinate system coordinate axis relative to a camera coordinate system;
s6 optimizing camera parameters;
s7 calculating a dense optical flow for the selected frame;
s8 calculates the depth value of the selected frame and further obtains its depth map.
According to one aspect of the invention, there is provided a terminal comprising a processor adapted to implement various instructions; and a storage device adapted to store a plurality of instructions, the instructions adapted to be loaded and executed by a processor to:
s1, decomposing the video to be restored into pictures by frames;
s2, extracting picture feature points of each frame;
s3 matching the feature points to form a feature point track;
s4, calculating a global rotation matrix, wherein the rotation matrix is the direction of the coordinate axis of the world coordinate system relative to the camera coordinate system;
s5, calculating a global translation vector which is a position coordinate of a world coordinate system coordinate axis relative to a camera coordinate system;
s6 optimizing camera parameters;
s7 calculating a dense optical flow for the selected frame;
s8 calculates the depth value of the selected frame and further obtains its depth map.
Generally, compared with the prior art, the above technical solution conceived by the present invention has the following beneficial effects:
1) according to the method, a depth estimation method for recovering a Structure (SFM) from motion based on a physical mechanism is adopted, dense optical flows are used as matching, and the problem that the depth values of all pixel points cannot be obtained in the process from scene sparse reconstruction to dense reconstruction in the past, particularly in a non-texture area, is solved.
2) According to the method provided by the technical scheme of the invention, the overall SFM and the dense optical flow method are innovatively combined, so that the efficiency is improved, and the accuracy of the result is ensured; specifically, the efficiency of calculating the depth map is ensured by a calculation method of the global SFM, and the accuracy of the calculated depth map is ensured by a dense optical flow method.
3) According to the method, the overall SFM method and the dense optical flow method are adopted, any training sample is not needed, optimization modes such as segmentation and plane fitting are not utilized, the calculated amount is small, the cost is greatly saved, and the calculation time is reduced.
Drawings
FIG. 1 is a schematic flow chart of an embodiment of the present invention;
FIG. 2 is a schematic diagram of an essential matrix decomposition of an embodiment of the present inventive technique;
FIG. 3 is a schematic diagram of a local rotation matrix for selecting a calculation error according to an embodiment of the present invention;
fig. 4 shows a selected frame original and a depth map generated therefrom according to an embodiment of the present invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention. In addition, the technical features involved in the embodiments of the present invention described below may be combined with each other as long as they do not conflict with each other. The present invention will be described in further detail with reference to specific embodiments.
The flow of the monocular video depth estimation method based on the SFM and the dense optical flow provided by the embodiment of the technical scheme of the invention is shown in figure 1, and comprises the following steps: decomposing a video to be restored into pictures according to frames; extracting picture characteristic points of each frame; matching the characteristic points to form a characteristic point track; calculating a global rotation matrix; calculating a global translation vector; optimizing camera parameters; computing a dense optical flow for the selected frame; the depth value of the selected frame is calculated, and further a depth map thereof is obtained. The monocular video depth estimation method based on the SFM and the dense optical flow provided by the invention is specifically described in the following by combining an example.
The specific steps of the monocular video depth estimation method based on the SFM and the dense optical flow provided by the example are as follows:
(1) and performing framing on the video to be restored to obtain continuous frame pictures. The video is decomposed into single pictures according to frames, and the purpose is to purposefully extract the feature points in each whole picture so as to analyze specific feature points at a later stage.
(2) A feature extraction operation is performed for each frame. The embodiment of the technical scheme of the invention adopts two characteristic extraction modes: SIFT extraction method (see David G. Lowe, "dispersive image defects from Scale-innovative keys," International Journal of Computer Vision,60,2(2004), pp.91-110) and AKAZE extraction method (see Alcanaria P. fast explicit Difference for additive Features in Nonlinear Scale Spaces [ C ]// British machine Vision reference. 2013: 13.1-13.11). In the embodiment of the technical solution of the present invention, for a video with rich overall texture, a SIFT descriptor is preferably used, and for a video with not rich overall texture, an akage descriptor is preferably used.
(3) And matching and tracking the extracted feature points to form a feature point track. Aiming at different feature description modes, the embodiment of the technical scheme of the invention adopts different matching strategies. For the SIFT extraction mode, the embodiment of the technical scheme of the invention adopts a fast cascade Hash matching strategy in order to improve the matching efficiency; for the AKAZE extraction mode, because the descriptor adopted by AKAZE is a binary character string, the embodiment of the technical scheme of the invention adopts Hamming distance as the matching measure of the characteristic points, and the result can be simply and rapidly calculated by one-time XOR. In addition, in order to ensure the accuracy of Matching, the embodiment of the technical scheme of the invention adopts a K-VLD Method (see, concretely, Virtual Line Descriptor and semi-Local Graph Matching Method for replaceable Feature texture correction, ZHE LIUand recovery MARLET, BMVC 2012) to remove Matching pairs with Matching errors after the primary Matching is completed.
(4) A global rotation matrix is calculated. In general, the camera pose consists of two parts: a rotation matrix and a translation vector. The rotation matrix represents the direction of the coordinate axis of the world coordinate system relative to the camera coordinate system, and the translation vector represents the position coordinate of the coordinate axis of the world coordinate system relative to the camera coordinate system. From these two coordinate systems, the relative positions of the analysis object to the world and the camera, and the relative positional relationship between the world and the camera can be determined.
In this step, the following steps are taken to ensure the accuracy of the results:
and (4.1) calculating a relative rotation matrix and a relative translation vector of every two video frames, wherein the method adopted by the invention is to calculate an essential matrix by using a RANSAC five-point method preferably through the matching relation in the step (3), and decompose the essential matrix to obtain the relative rotation matrix and the relative translation vector of the two-view. The essential matrix decomposition results in a schematic diagram of the relative rotation matrix and the relative translation vector as shown in fig. 2. In particular, four forms of rotation matrices may be generated, but the only case that follows the laws of physics is that shown in (a), where the calculated three-dimensional points precede two cameras.
(4.2) the embodiment of the technical scheme of the invention utilizes cycle errors algorithm (see Enqvist O, Kahl F, Olsson C.non-sequential structure from motion [ C ]// IEEE International conference on Computer Vision Worksops, ICCV 2011 Worksops, Barcelona, Spain, November.DBLP,2011: 264-. Wherein, diagram (a) represents an input picture; the graph (b) represents the structure of every two pictures, each node represents an image, and a connecting line represents that the two images have a geometric relationship; graph (c) shows the statistical error calculated for each edge by cycle errors; graph (d) shows the edges corresponding to the relative rotation matrix errors screened by bayesian inference.
(4.3) when calculating the global rotation matrix, according to the relation satisfied by the global rotation matrix and the relative rotation matrix, the following formula is:
Rj=Rij·Ri
it is decomposed into the following formula:
Figure GDA0002488983180000071
in the above formula, i, j represents picture symbol, RijAnd (3) relative rotation matrixes representing the ith frame and the jth frame, wherein k is 1,2 and 3 … which represent column numbers of the global rotation matrix. The calculation accuracy can be ensured by performing linear least square processing on the relational expression.
(5) A translation vector is calculated for each frame. The translation vector has the problems of large influence by a base line and non-uniform scale, so the solving difficulty is higher than that of a rotation matrix. The embodiment of the technical scheme of the invention adopts the selection of the close and stronger three-view relation to calculate the local relative translation vector, and overcomes the problem that the two-view relative translation vector is greatly influenced by the baseline. Meanwhile, in order to improve the calculation efficiency, the embodiment of the technical scheme of the invention utilizes the information of the full rotation matrix to calculate the translation of the three views, and the calculation mode has higher efficiency than the method of directly calculating the trifocal tensor by utilizing the matching relation of the three view points. In the process of solving the global translation vector by the local translation vector, the embodiment of the technical scheme of the invention adopts a linear programming method for calculation, does not need iteration and has higher efficiency. The specific calculation method of the translation vector is as follows:
(5.1) calculating local relative translation vectors using the global rotation matrix and the tri-view relationships. According to the embodiment of the technical scheme, the translation relation of the three views is calculated by adopting a method for minimizing the reprojection error. Under the condition that a global rotation matrix is known, the calculation method of the reprojection error comprises the following steps:
Figure GDA0002488983180000081
wherein, j ∈ {1,2, … m }, i ∈ {1,2,3}, tiIs the translation vector for the ith view,
Figure GDA0002488983180000082
is row 1, X of the rotation matrix of the ith viewjIs the coordinate of the jth three-dimensional point,
Figure GDA0002488983180000083
for the coordinates of the image point corresponding to the jth three-dimensional point in the ith view, t can be uniquely determined by four groups of non-coplanar three-view matching points under the condition of the known global rotation matrixiAnd XjThe value of (c). The calculation method is a mathematical optimization model with the following solving form:
Figure GDA0002488983180000084
wherein, the firstTwo constraints indicate that after transforming a three-dimensional point into the camera coordinate system, the Z-coordinate of the three-dimensional point must be greater than 1. The reason is that all image points are normalized, i.e. transformed: x ═ K-1X (K represents an internal reference matrix of the image). This allows the internal parameters of each view to be changed to an identity matrix, so that the focus is 1, so that a three-dimensional point with a Z coordinate greater than 1 indicates that the point is in front of the camera. The third constraint is to specify the camera center of view 1 of the three views to be at the origin of the world coordinate system. Due to ρ (t)i,Xj) Contains a quadratic term, so the optimization model is a nonlinear model. The model has three unknown gamma, ti,XjAn initial value needs to be assigned to γ, and in the embodiment of the technical solution of the present invention, an initial value of γ is preferably 0.5.
And (5.2) establishing a linear programming model to solve the global translation vector. Calculating a global translation vector from the relative translation vector, firstly establishing the relationship between the relative translation vector and the global translation vector:
λtij=Tj-RijTi
there is a uniform scale factor λ in the three-view relationshipτSo the linear programming model is:
Figure GDA0002488983180000091
wherein the second constraint is to prevent λτTrapping is infinitesimally small because the global translation vector T may have an arbitrary scale, which, without this constraint, may result in a large error in the calculation result. The third constraint specifies the camera coordinate system of the first view as the world coordinate system. In the actual calculation process, the model can be solved by using a linear programming function library.
(6) According to the embodiment of the technical scheme, the translation vector, the rotation vector and the generated sparse space point are subjected to light beam adjustment optimization, and the optimization of the whole camera parameter is completed.
Projection formula due to the presence of image noise
Figure GDA0002488983180000092
The calculation requirements cannot be completely met, so that optimization needs to be performed on the combination of the previously calculated translation vector, the previously calculated rotation matrix and the spatial point coordinates, the process is called beam adjustment, and the optimization target is called reprojection error, namely:
Figure GDA0002488983180000093
in the above formula, PiProjection matrix representing the ith frame of picture, XjSubstitute for Chinese traditional medicineTable calculated coordinates of the jth spatial point, Pi·XjThe physical meaning is that the jth space point coordinate is projected to the image coordinate of the ith frame, namely, a three-dimensional to two-dimensional projection is completed;
Figure GDA0002488983180000094
the Euclidean distance between the two-dimensional coordinates of the re-projection and the original extracted feature point coordinates is calculated, and the translation vector, the rotation matrix and the space point coordinates can be optimized in a combined mode by optimizing the geometric distance. In this way, the required camera optimization parameters can be obtained.
(7) A dense optical flow for the selected frame is calculated.
In the embodiment of the technical solution of the present invention, preferably, a Classic NL (see Sun D, Roth S, Black M j. secret information flow and the hair references [ C ]// Computer Vision and pattern recognition. ieee,2010: 2432-.
(8) Triangulation calculates depth values. Selecting any two frames of images, namely an ith frame and a jth frame, and obtaining the corresponding matching position x + delta (x and x + delta represent a plane two-dimensional) of each pixel x of the ith frame in the jth frame by dense matching based on optical flowA coordinate vector). In each frame, the corresponding measured value of each pixel point is xi~PiX and Xj~PjX, which can be combined to yield a linear system form a · X ═ 0.
In a specific calculation procedure, first, three equations are given, two of which are linearly independent, by removing the consistency scale factor for the cross product of each image point. Performing cross-product operation on the two projection relations can obtain the following formula:
x(P3TX)-(P1TX)=0
y(P3TX)-(P2TX)=0
x(P2TX)-y(P1TX)=0
x'(P'3TX)-(P'1TX)=0
y(P'3TX)-(P'2TX)=0
x'(P'2TX)-y'(P'1TX)=0
merging the excess entries may result in:
Figure GDA0002488983180000101
in the embodiment of the technical scheme of the invention, the equation AX is solved to be 0 by adopting a least square method. Then finally X is in PiThe reference camera coordinate system coordinates are:
Xi=Ri·X+t
at this time XiThe z coordinate of (a) is a depth value, and in the embodiment of the present invention, it is preferably stretched to a gray value of 0 to 255 according to its value, so as to facilitate display and observation, and the result is shown in fig. 4.
It will be understood by those skilled in the art that the foregoing is only a preferred embodiment of the present invention, and is not intended to limit the invention, and that any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention should be included in the scope of the present invention.

Claims (8)

1. A monocular video depth map calculation method is characterized by comprising the following steps,
s1, decomposing the video to be restored into pictures by frames;
s2, extracting picture feature points of each frame, wherein SIFT descriptors are adopted for video frames with abundant overall textures during extraction, and AKAZE descriptors are adopted for video frames with not abundant overall textures;
s3, matching the feature points of each frame to form a feature point track, wherein if an SIFT descriptor is adopted in the extraction in the step S2, a fast cascade Hash matching method is adopted in the matching, and if an AKAZE descriptor is adopted in the extraction in the step S2, a matching method based on a Hamming distance is adopted in the matching;
s4, calculating a global rotation matrix of each frame, wherein the global rotation matrix is the direction of the coordinate axis of the world coordinate system relative to the camera coordinate system;
s5, calculating a global translation vector of each frame, wherein the global translation vector is a position coordinate of a world coordinate system coordinate axis relative to a camera coordinate system;
s6, acquiring camera optimization parameters;
s7 calculating a dense optical flow for the selected frame;
s8, calculating the depth value of the selected frame according to the camera optimization parameters and the dense optical flow, and further obtaining a depth map of the selected frame;
the step S5 includes the steps of,
s51, calculating a three-view relation by using a method of minimizing a reprojection error, and calculating a local relative translation vector by using a global rotation matrix and the three-view relation;
s52, establishing a linear programming model to solve the global translation vector.
2. The method for calculating a monocular video depth map according to claim 1, wherein the step S3 includes,
s31 matching the feature points of each frame;
s32, eliminating matching pairs with wrong matching;
s33 simulates forming a feature point trajectory.
3. The monocular video depth map calculating method of claim 1 or 2, wherein the step S4 includes,
s41, calculating relative rotation matrix and relative translation vector of any two frames;
s42, eliminating relative rotation matrixes with calculation errors;
s43 calculates a global rotation matrix from the relative rotation matrix and the relative translation vector.
4. The method for calculating a monocular video depth map according to claim 1 or 2, wherein the step S6 performs joint optimization, i.e. beam adjustment, on the global translation vector, the global rotation matrix and the spatial point coordinates, so as to optimize the camera parameters.
5. The monocular video depth map calculating method of claim 1 or 2, wherein the step S8 calculates the depth value by using a triangulation method.
6. A monocular video depth map computing system comprising,
the video decomposition module is used for decomposing the video to be restored into pictures according to frames;
the characteristic extraction module is used for extracting the picture characteristic points of each frame, wherein SIFT descriptors are adopted for video frames with rich overall textures during extraction, and AKAZE descriptors are adopted for video frames with poor overall textures;
the characteristic matching module is used for matching characteristic points to form a characteristic point track, wherein if an SIFT descriptor is adopted in the extraction in the step S2, a fast cascade Hash matching method is adopted in the matching process, and if an AKAZE descriptor is adopted in the extraction in the step S2, a matching method based on a Hamming distance is adopted in the matching process;
the rotation matrix module is used for calculating a global rotation matrix, and the global rotation matrix is the direction of a coordinate axis of a world coordinate system relative to a camera coordinate system;
the translation vector module is used for calculating a three-view relation by using a method for minimizing a reprojection error, calculating a local relative translation vector by using a global rotation matrix and the three-view relation, establishing a linear programming model and solving a global translation vector, wherein the global translation vector is a position coordinate of a coordinate axis of a world coordinate system relative to a camera coordinate system;
the parameter optimization module is used for optimizing camera parameters;
a dense optical flow module to compute a dense optical flow for the selected frame;
and the depth map module is used for calculating the depth value of the selected frame and further obtaining the depth map of the selected frame.
7. A memory device having stored therein a plurality of instructions adapted to be loaded and executed by a processor:
s1, decomposing the video to be restored into pictures by frames;
s2, extracting picture feature points of each frame, wherein SIFT descriptors are adopted for video frames with abundant overall textures during extraction, and AKAZE descriptors are adopted for video frames with not abundant overall textures;
s3, matching the feature points to form a feature point track, wherein if an SIFT descriptor is adopted in the extraction in the step S2, a fast cascade Hash matching method is adopted in the matching, and if an AKAZE descriptor is adopted in the extraction in the step S2, a matching method based on Hamming distance is adopted in the matching;
s4, calculating a global rotation matrix, wherein the global rotation matrix is the direction of the coordinate axis of the world coordinate system relative to the camera coordinate system;
s5, calculating a global translation vector which is a position coordinate of a world coordinate system coordinate axis relative to a camera coordinate system;
s6 optimizing camera parameters;
s7 calculating a dense optical flow for the selected frame;
s8, calculating the depth value of the selected frame and further obtaining the depth map thereof;
the step S5 includes the steps of,
s51, calculating a three-view relation by using a method of minimizing a reprojection error, and calculating a local relative translation vector by using a global rotation matrix and the three-view relation;
s52, establishing a linear programming model to solve the global translation vector.
8. A terminal, comprising: a processor adapted to implement various instructions; and a storage device adapted to store a plurality of instructions, the instructions adapted to be loaded and executed by a processor to:
s1, decomposing the video to be restored into pictures by frames;
s2, extracting picture feature points of each frame, wherein SIFT descriptors are adopted for video frames with abundant overall textures during extraction, and AKAZE descriptors are adopted for video frames with not abundant overall textures;
s3, matching the feature points to form a feature point track, wherein if an SIFT descriptor is adopted in the extraction in the step S2, a fast cascade Hash matching method is adopted in the matching, and if an AKAZE descriptor is adopted in the extraction in the step S2, a matching method based on Hamming distance is adopted in the matching;
s4, calculating a global rotation matrix, wherein the global rotation matrix is the direction of the coordinate axis of the world coordinate system relative to the camera coordinate system;
s5, calculating a global translation vector which is a position coordinate of a world coordinate system coordinate axis relative to a camera coordinate system;
s6 optimizing camera parameters;
s7 calculating a dense optical flow for the selected frame;
s8, calculating the depth value of the selected frame and further obtaining the depth map thereof;
the step S5 includes the steps of,
s51, calculating a three-view relation by using a method of minimizing a reprojection error, and calculating a local relative translation vector by using a global rotation matrix and the three-view relation;
s52, establishing a linear programming model to solve the global translation vector.
CN201710351600.6A 2017-05-18 2017-05-18 Monocular video depth map calculation method Active CN107481279B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201710351600.6A CN107481279B (en) 2017-05-18 2017-05-18 Monocular video depth map calculation method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201710351600.6A CN107481279B (en) 2017-05-18 2017-05-18 Monocular video depth map calculation method

Publications (2)

Publication Number Publication Date
CN107481279A CN107481279A (en) 2017-12-15
CN107481279B true CN107481279B (en) 2020-07-07

Family

ID=60593543

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201710351600.6A Active CN107481279B (en) 2017-05-18 2017-05-18 Monocular video depth map calculation method

Country Status (1)

Country Link
CN (1) CN107481279B (en)

Families Citing this family (15)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108154531B (en) * 2018-01-03 2021-10-08 深圳北航新兴产业技术研究院 Method and device for calculating area of body surface damage region
CN110047101A (en) * 2018-01-15 2019-07-23 北京三星通信技术研究有限公司 Gestures of object estimation method, the method for obtaining dense depth image, related device
CN110517304B (en) * 2019-07-26 2022-04-22 苏州浪潮智能科技有限公司 Method and device for generating depth map, electronic equipment and storage medium
CN110579169A (en) * 2019-07-30 2019-12-17 广州南方卫星导航仪器有限公司 Stereoscopic vision high-precision measurement method based on cloud computing and storage medium
CN110782490B (en) * 2019-09-24 2022-07-05 武汉大学 Video depth map estimation method and device with space-time consistency
CN113034562B (en) * 2019-12-09 2023-05-12 百度在线网络技术(北京)有限公司 Method and apparatus for optimizing depth information
CN111010558B (en) * 2019-12-17 2021-11-09 浙江农林大学 Stumpage depth map generation method based on short video image
CN111784659A (en) * 2020-06-29 2020-10-16 北京百度网讯科技有限公司 Image detection method and device, electronic equipment and storage medium
CN111982058A (en) * 2020-08-04 2020-11-24 北京中科慧眼科技有限公司 Distance measurement method, system and equipment based on binocular camera and readable storage medium
CN112672150A (en) * 2020-12-22 2021-04-16 福州大学 Video coding method based on video prediction
CN112668474B (en) * 2020-12-28 2023-10-13 北京字节跳动网络技术有限公司 Plane generation method and device, storage medium and electronic equipment
CN112967341B (en) * 2021-02-23 2023-04-25 湖北枫丹白露智慧标识科技有限公司 Indoor visual positioning method, system, equipment and storage medium based on live-action image
CN113284081B (en) * 2021-07-20 2021-10-22 杭州小影创新科技股份有限公司 Depth map super-resolution optimization method and device, processing equipment and storage medium
CN113793420B (en) * 2021-09-17 2024-05-24 联想(北京)有限公司 Depth information processing method and device, electronic equipment and storage medium
CN116228834B (en) * 2022-12-20 2023-11-03 阿波罗智联(北京)科技有限公司 Image depth acquisition method and device, electronic equipment and storage medium

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102903096A (en) * 2012-07-04 2013-01-30 北京航空航天大学 Monocular video based object depth extraction method
WO2013139067A1 (en) * 2012-03-22 2013-09-26 Hou Kejie Method and system for carrying out visual stereo perception enhancement on color digital image

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103247075B (en) * 2013-05-13 2015-08-19 北京工业大学 Based on the indoor environment three-dimensional rebuilding method of variation mechanism
CN105184863A (en) * 2015-07-23 2015-12-23 同济大学 Unmanned aerial vehicle aerial photography sequence image-based slope three-dimension reconstruction method
CN105551086B (en) * 2015-12-04 2018-01-02 华中科技大学 A kind of modeling of personalized foot and shoe-pad method for customizing based on computer vision

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2013139067A1 (en) * 2012-03-22 2013-09-26 Hou Kejie Method and system for carrying out visual stereo perception enhancement on color digital image
CN102903096A (en) * 2012-07-04 2013-01-30 北京航空航天大学 Monocular video based object depth extraction method

Also Published As

Publication number Publication date
CN107481279A (en) 2017-12-15

Similar Documents

Publication Publication Date Title
CN107481279B (en) Monocular video depth map calculation method
Kim et al. Deep monocular depth estimation via integration of global and local predictions
CN110458939B (en) Indoor scene modeling method based on visual angle generation
Tokmakov et al. Learning motion patterns in videos
US10334168B2 (en) Threshold determination in a RANSAC algorithm
Ma et al. Stage-wise salient object detection in 360 omnidirectional image via object-level semantical saliency ranking
US9247139B2 (en) Method for video background subtraction using factorized matrix completion
CN110827312B (en) Learning method based on cooperative visual attention neural network
Sun et al. Learning local quality-aware structures of salient regions for stereoscopic images via deep neural networks
Yin et al. Or-nerf: Object removing from 3d scenes guided by multiview segmentation with neural radiance fields
CN110544202A (en) parallax image splicing method and system based on template matching and feature clustering
CN117456136A (en) Digital twin scene intelligent generation method based on multi-mode visual recognition
Pan et al. Multi-stage feature pyramid stereo network-based disparity estimation approach for two to three-dimensional video conversion
Habtegebrial et al. Fast view synthesis with deep stereo vision
Zhu et al. Occlusion-free scene recovery via neural radiance fields
Huang et al. ES-Net: An efficient stereo matching network
Wan et al. Boosting image-based localization via randomly geometric data augmentation
CN112084855A (en) Outlier elimination method for video stream based on improved RANSAC method
Long et al. Detail preserving residual feature pyramid modules for optical flow
Zhou et al. Lrfnet: an occlusion robust fusion network for semantic segmentation with light field
Chao et al. Occcasnet: occlusion-aware cascade cost volume for light field depth estimation
Halperin et al. Clear Skies Ahead: Towards Real‐Time Automatic Sky Replacement in Video
Chen et al. Accurate 3D motion tracking by combining image alignment and feature matching
Hausler et al. Reg-NF: Efficient registration of implicit surfaces within neural fields
Chen et al. MoCo‐Flow: Neural Motion Consensus Flow for Dynamic Humans in Stationary Monocular Cameras

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant