CN107481279B

CN107481279B - Monocular video depth map calculation method

Info

Publication number: CN107481279B
Application number: CN201710351600.6A
Authority: CN
Inventors: 曹治国; 张润泽; 肖阳; 鲜可; 杨佳琪; 李然; 赵富荣; 李睿博
Original assignee: Huazhong University of Science and Technology
Current assignee: Huazhong University of Science and Technology
Priority date: 2017-05-18
Filing date: 2017-05-18
Publication date: 2020-07-07
Anticipated expiration: 2037-05-18
Also published as: CN107481279A

Abstract

The invention discloses a monocular video depth map calculation method which is characterized by comprising the steps of decomposing a video to be restored into pictures according to frames; extracting picture characteristic points of each frame; matching the characteristic points to form a characteristic point track; calculating a global rotation matrix and a translation vector; optimizing camera parameters; computing a dense optical flow for the selected frame; and calculating the depth value of the selected frame to obtain a depth map of the selected frame. The method adopts a depth estimation method for recovering a Structure (SFM) from motion based on a physical mechanism, and takes dense optical flow as matching. The method does not need any training sample, does not utilize optimization modes such as segmentation and plane fitting, and is small in calculation amount. Meanwhile, the method solves the problem that in the prior art, in the process from scene sparse reconstruction to dense reconstruction, especially in a non-texture area, the depth values of all pixel points cannot be obtained, improves the calculation efficiency and ensures the accuracy of the depth map.

Description

Monocular video depth map calculation method

Technical Field

The invention belongs to the field of computer vision, and particularly relates to a monocular video depth map calculation method.

Background

With the development of science and technology, 3D movies and virtual reality are enriching the lives of people. However, both 3D movies already popular worldwide and virtual reality currently being enjoyed face a serious problem, namely, the lack of 3D resources. In the prior art, the depth is mainly predicted through a monocular video, and then a binocular stereo video is obtained through viewpoint synthesis, which is also a main method for solving the problem of the lack of 3D resources at present.

In this technical approach, depth estimation of monocular video is gaining attention as an important component. Currently, the mainstream monocular depth prediction methods include a depth estimation method based on depth learning, a depth estimation method based on a physical mechanism to recover a Structure (SFM) from motion, and an optical flow method. The depth estimation method based on deep learning is relatively dependent on a database and is generally depth estimation with supervised learning, namely a large number of training samples are needed to ensure the accuracy of a result, and the cost for obtaining a large number of training samples is relatively high. Methods for recovering Structures From Motion (SFM) based on physical mechanisms are currently quite accurate in recovering camera position and achieve a high balance in efficiency and accuracy. However, the difficulty of this method in the past is that it is impossible to obtain the depth values of all pixel points from sparse reconstruction to dense reconstruction of a scene, especially in a non-texture region. In contrast, the accuracy of the optical flow method can reach a precise sub-pixel level.

Disclosure of Invention

In view of the above-mentioned deficiencies or needs in the art, the present invention provides a monocular video depth map computing method. The method of the technical scheme of the invention is used for calculating the monocular video depth value by a method based on physical mechanism guarantee aiming at the condition that the existing supervised learning training sample is not easy to obtain; and meanwhile, dense optical flows are introduced for matching, so that the depth map of the video frame can be accurately calculated.

To achieve the above object, according to one aspect of the present invention, there is provided a monocular video depth map calculating method comprising the steps of,

s1, decomposing the video to be restored into pictures by frames;

s2, extracting picture feature points of each frame;

s3 matching the feature points to form a feature point track;

s4, calculating a global rotation matrix, wherein the rotation matrix is the direction of the coordinate axis of the world coordinate system relative to the camera coordinate system;

s5, calculating a global translation vector which is a position coordinate of a world coordinate system coordinate axis relative to a camera coordinate system;

s6 optimizing camera parameters;

s7 calculating a dense optical flow for the selected frame;

s8 calculates the depth value of the selected frame and further obtains its depth map.

According to the method, the method for recovering the structure from the motion and the optical flow method are combined, and the camera position and the image are well matched, so that the result quality of the depth map can be greatly predicted.

Specifically, for a video to be restored, it is decomposed into several pictures by frame, and for each frame of picture, a texture-based feature extraction operation is performed. After the feature points of each frame are obtained, the feature points of one feature in all frames are matched, and each feature can correspondingly form a specific feature point track, namely the track of the feature in the video playing process.

In order to express the position relationship between the camera and the image more clearly, two coordinate systems, namely a world coordinate system and a camera coordinate system, are generally provided, and the position of the image in the world coordinate system, the position in the camera coordinate system and the position relationship between the image and the camera can be determined through the two coordinate systems. When calculating the depth map, the pose of the video camera is also a critical part, which is mainly composed of two parts, specifically, a rotation matrix and a translation vector. The rotation matrix represents the direction of the coordinate axis of the world coordinate system relative to the camera coordinate system, and the translation vector represents the position coordinate of the coordinate axis of the world coordinate system relative to the camera coordinate system. The rotation matrix and the translation vector obtained by solving are mainly used for completing the optimization of the whole camera parameters, in particular to the optimization of the reprojection error. And the rotation matrix, the translation vector and the space point coordinate are combined to be optimized, so that the aim of optimizing the reprojection error is fulfilled. On the basis, further utilizing a dense matching optical flow method to obtain the optical flow matching relation between the selected frame and the adjacent frame, and calculating the depth value.

As a preferred embodiment of the present invention, step S3 includes,

s31 matching the feature points of each frame;

s32, eliminating matching pairs with wrong matching;

s33 simulates forming a feature point trajectory.

When matching the feature points, in order to ensure the accuracy of matching, after the initial matching is completed, the wrong matching pairs need to be removed. Many feature points are extracted from each frame of image, when matching is performed, some feature points cannot be matched, some feature points are matched with similar feature points, some feature points are matched with feature points without any association, and the like, so that the feature points with wrong matching need to be removed to ensure accuracy.

As a preferred embodiment of the present invention, step S4 includes,

s41, calculating relative rotation matrix and relative translation vector of any two frames;

s42, eliminating relative rotation matrixes and relative translation vectors which are calculated wrongly;

s43 calculates a global rotation matrix from the relative rotation matrix and the relative translation vector.

In a specific process of calculating the global rotation matrix, the method of the technical scheme of the invention preferably calculates the relative rotation matrix and the relative translation vector of any two frames, then eliminates the relative rotation matrix and the relative translation vector which are calculated wrongly, and finally calculates the global rotation matrix according to the relative rotation matrix and the relative translation vector. Specifically, the feature matching relationship of step S3 is used to calculate the essential matrix, and the relative rotation matrix and the relative translation vector of any two frames are calculated. However, due to the existence of noise, certain errors exist in the calculation of the relative rotation matrix and the relative translation vector, so that the calculation accuracy of the global rotation matrix is not high, and the elimination of the error data is helpful for improving the calculation accuracy of the global rotation matrix.

As a preferred embodiment of the present invention, step S5 includes,

s51, calculating local relative translation vectors by using the global rotation matrix and/or the three-view relation;

s52, establishing a linear programming model to solve the global translation vector.

The calculation of the translation vector has the problems of large influence by a base line and non-uniform scale, so the solving difficulty is higher than that of a rotation matrix. According to the method, the local relative translation vector is calculated by selecting the close and strong three-view relation, and the problem that the two-view relative translation vector is greatly influenced by the base line is solved. Meanwhile, in order to improve the calculation efficiency, the method of the technical scheme of the invention utilizes the calculation information of the global rotation matrix to calculate the translation of the three views, and has higher efficiency compared with a method for calculating the trifocal tensor by directly utilizing the matching relation of the three view points. Further, in the process of solving the global translation vector by the local translation vector, a linear programming method is adopted for calculation.

As a preferred embodiment of the present invention, step S6 is preferably to perform joint optimization, i.e. beam adjustment, on the translation vector, the rotation matrix, and the spatial point coordinate, so as to optimize the camera parameters.

Due to the existence of image noise, a projection formula cannot completely meet the calculation requirement, so that the previously calculated translation vector, rotation matrix and spatial point coordinates need to be optimized once in a combined mode to obtain a relatively accurate camera parameter as far as possible, the process is called light beam adjustment, the optimization target is called re-projection error, and the final purpose is to achieve optimization of the camera parameter to restore the video depth as accurately as possible.

As a preferred embodiment of the present invention, step S8 preferably calculates the depth value by using a triangulation method.

In the technical scheme of the invention, a least square method is adopted to solve a linear equation. Specifically, any two frames of images are selected, the corresponding matching positions of the pixel points in the first frame of image in the second frame of image are obtained according to the dense matching based on the optical flow, and the matching positions are expressed by coordinates. In each frame of image, the measured values corresponding to each pixel point are combined to obtain a linear system.

According to one aspect of the invention, a monocular video depth map computing system is provided, which is characterized by comprising a video decomposition module, a video restoration module and a video restoration module, wherein the video decomposition module is used for decomposing a video to be restored into pictures by frames;

the characteristic extraction module is used for extracting the picture characteristic points of each frame;

the characteristic matching module is used for matching the characteristic points to form a characteristic point track;

the rotation matrix module is used for calculating a global rotation matrix, and the rotation matrix is the direction of the coordinate axis of the world coordinate system relative to the camera coordinate system;

the translation vector module is used for calculating a global translation vector, and the translation vector is a position coordinate of a world coordinate system coordinate axis relative to a camera coordinate system;

the parameter optimization module is used for optimizing camera parameters;

a dense optical flow module to compute a dense optical flow for the selected frame;

and the depth map module is used for calculating the depth value of the selected frame and further obtaining the depth map of the selected frame.

According to one aspect of the invention, there is provided a memory device having stored therein a plurality of instructions adapted to be loaded and executed by a processor:

s1, decomposing the video to be restored into pictures by frames;

s2, extracting picture feature points of each frame;

s3 matching the feature points to form a feature point track;

s6 optimizing camera parameters;

s7 calculating a dense optical flow for the selected frame;

According to one aspect of the invention, there is provided a terminal comprising a processor adapted to implement various instructions; and a storage device adapted to store a plurality of instructions, the instructions adapted to be loaded and executed by a processor to:

s1, decomposing the video to be restored into pictures by frames;

s2, extracting picture feature points of each frame;

s3 matching the feature points to form a feature point track;

s6 optimizing camera parameters;

s7 calculating a dense optical flow for the selected frame;

Generally, compared with the prior art, the above technical solution conceived by the present invention has the following beneficial effects:

1) according to the method, a depth estimation method for recovering a Structure (SFM) from motion based on a physical mechanism is adopted, dense optical flows are used as matching, and the problem that the depth values of all pixel points cannot be obtained in the process from scene sparse reconstruction to dense reconstruction in the past, particularly in a non-texture area, is solved.

2) According to the method provided by the technical scheme of the invention, the overall SFM and the dense optical flow method are innovatively combined, so that the efficiency is improved, and the accuracy of the result is ensured; specifically, the efficiency of calculating the depth map is ensured by a calculation method of the global SFM, and the accuracy of the calculated depth map is ensured by a dense optical flow method.

3) According to the method, the overall SFM method and the dense optical flow method are adopted, any training sample is not needed, optimization modes such as segmentation and plane fitting are not utilized, the calculated amount is small, the cost is greatly saved, and the calculation time is reduced.

Drawings

FIG. 1 is a schematic flow chart of an embodiment of the present invention;

FIG. 2 is a schematic diagram of an essential matrix decomposition of an embodiment of the present inventive technique;

FIG. 3 is a schematic diagram of a local rotation matrix for selecting a calculation error according to an embodiment of the present invention;

fig. 4 shows a selected frame original and a depth map generated therefrom according to an embodiment of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention. In addition, the technical features involved in the embodiments of the present invention described below may be combined with each other as long as they do not conflict with each other. The present invention will be described in further detail with reference to specific embodiments.

The flow of the monocular video depth estimation method based on the SFM and the dense optical flow provided by the embodiment of the technical scheme of the invention is shown in figure 1, and comprises the following steps: decomposing a video to be restored into pictures according to frames; extracting picture characteristic points of each frame; matching the characteristic points to form a characteristic point track; calculating a global rotation matrix; calculating a global translation vector; optimizing camera parameters; computing a dense optical flow for the selected frame; the depth value of the selected frame is calculated, and further a depth map thereof is obtained. The monocular video depth estimation method based on the SFM and the dense optical flow provided by the invention is specifically described in the following by combining an example.

The specific steps of the monocular video depth estimation method based on the SFM and the dense optical flow provided by the example are as follows:

(1) and performing framing on the video to be restored to obtain continuous frame pictures. The video is decomposed into single pictures according to frames, and the purpose is to purposefully extract the feature points in each whole picture so as to analyze specific feature points at a later stage.

(2) A feature extraction operation is performed for each frame. The embodiment of the technical scheme of the invention adopts two characteristic extraction modes: SIFT extraction method (see David G. Lowe, "dispersive image defects from Scale-innovative keys," International Journal of Computer Vision,60,2(2004), pp.91-110) and AKAZE extraction method (see Alcanaria P. fast explicit Difference for additive Features in Nonlinear Scale Spaces [ C ]// British machine Vision reference. 2013: 13.1-13.11). In the embodiment of the technical solution of the present invention, for a video with rich overall texture, a SIFT descriptor is preferably used, and for a video with not rich overall texture, an akage descriptor is preferably used.

(3) And matching and tracking the extracted feature points to form a feature point track. Aiming at different feature description modes, the embodiment of the technical scheme of the invention adopts different matching strategies. For the SIFT extraction mode, the embodiment of the technical scheme of the invention adopts a fast cascade Hash matching strategy in order to improve the matching efficiency; for the AKAZE extraction mode, because the descriptor adopted by AKAZE is a binary character string, the embodiment of the technical scheme of the invention adopts Hamming distance as the matching measure of the characteristic points, and the result can be simply and rapidly calculated by one-time XOR. In addition, in order to ensure the accuracy of Matching, the embodiment of the technical scheme of the invention adopts a K-VLD Method (see, concretely, Virtual Line Descriptor and semi-Local Graph Matching Method for replaceable Feature texture correction, ZHE LIUand recovery MARLET, BMVC 2012) to remove Matching pairs with Matching errors after the primary Matching is completed.

(4) A global rotation matrix is calculated. In general, the camera pose consists of two parts: a rotation matrix and a translation vector. The rotation matrix represents the direction of the coordinate axis of the world coordinate system relative to the camera coordinate system, and the translation vector represents the position coordinate of the coordinate axis of the world coordinate system relative to the camera coordinate system. From these two coordinate systems, the relative positions of the analysis object to the world and the camera, and the relative positional relationship between the world and the camera can be determined.

In this step, the following steps are taken to ensure the accuracy of the results:

and (4.1) calculating a relative rotation matrix and a relative translation vector of every two video frames, wherein the method adopted by the invention is to calculate an essential matrix by using a RANSAC five-point method preferably through the matching relation in the step (3), and decompose the essential matrix to obtain the relative rotation matrix and the relative translation vector of the two-view. The essential matrix decomposition results in a schematic diagram of the relative rotation matrix and the relative translation vector as shown in fig. 2. In particular, four forms of rotation matrices may be generated, but the only case that follows the laws of physics is that shown in (a), where the calculated three-dimensional points precede two cameras.

(4.2) the embodiment of the technical scheme of the invention utilizes cycle errors algorithm (see Enqvist O, Kahl F, Olsson C.non-sequential structure from motion [ C ]// IEEE International conference on Computer Vision Worksops, ICCV 2011 Worksops, Barcelona, Spain, November.DBLP,2011: 264-. Wherein, diagram (a) represents an input picture; the graph (b) represents the structure of every two pictures, each node represents an image, and a connecting line represents that the two images have a geometric relationship; graph (c) shows the statistical error calculated for each edge by cycle errors; graph (d) shows the edges corresponding to the relative rotation matrix errors screened by bayesian inference.

(4.3) when calculating the global rotation matrix, according to the relation satisfied by the global rotation matrix and the relative rotation matrix, the following formula is:

R_j＝R_ij·R_i

it is decomposed into the following formula:

in the above formula, i, j represents picture symbol, R_ijAnd (3) relative rotation matrixes representing the ith frame and the jth frame, wherein k is 1,2 and 3 … which represent column numbers of the global rotation matrix. The calculation accuracy can be ensured by performing linear least square processing on the relational expression.

(5) A translation vector is calculated for each frame. The translation vector has the problems of large influence by a base line and non-uniform scale, so the solving difficulty is higher than that of a rotation matrix. The embodiment of the technical scheme of the invention adopts the selection of the close and stronger three-view relation to calculate the local relative translation vector, and overcomes the problem that the two-view relative translation vector is greatly influenced by the baseline. Meanwhile, in order to improve the calculation efficiency, the embodiment of the technical scheme of the invention utilizes the information of the full rotation matrix to calculate the translation of the three views, and the calculation mode has higher efficiency than the method of directly calculating the trifocal tensor by utilizing the matching relation of the three view points. In the process of solving the global translation vector by the local translation vector, the embodiment of the technical scheme of the invention adopts a linear programming method for calculation, does not need iteration and has higher efficiency. The specific calculation method of the translation vector is as follows:

(5.1) calculating local relative translation vectors using the global rotation matrix and the tri-view relationships. According to the embodiment of the technical scheme, the translation relation of the three views is calculated by adopting a method for minimizing the reprojection error. Under the condition that a global rotation matrix is known, the calculation method of the reprojection error comprises the following steps:

wherein, j ∈ {1,2, … m }, i ∈ {1,2,3}, t_iIs the translation vector for the ith view,

is row 1, X of the rotation matrix of the ith view_jIs the coordinate of the jth three-dimensional point,

for the coordinates of the image point corresponding to the jth three-dimensional point in the ith view, t can be uniquely determined by four groups of non-coplanar three-view matching points under the condition of the known global rotation matrix_iAnd X_jThe value of (c). The calculation method is a mathematical optimization model with the following solving form:

wherein, the firstTwo constraints indicate that after transforming a three-dimensional point into the camera coordinate system, the Z-coordinate of the three-dimensional point must be greater than 1. The reason is that all image points are normalized, i.e. transformed: x ═ K^-1X (K represents an internal reference matrix of the image). This allows the internal parameters of each view to be changed to an identity matrix, so that the focus is 1, so that a three-dimensional point with a Z coordinate greater than 1 indicates that the point is in front of the camera. The third constraint is to specify the camera center of view 1 of the three views to be at the origin of the world coordinate system. Due to ρ (t)_i,X_j) Contains a quadratic term, so the optimization model is a nonlinear model. The model has three unknown gamma, t_i,X_jAn initial value needs to be assigned to γ, and in the embodiment of the technical solution of the present invention, an initial value of γ is preferably 0.5.

And (5.2) establishing a linear programming model to solve the global translation vector. Calculating a global translation vector from the relative translation vector, firstly establishing the relationship between the relative translation vector and the global translation vector:

λt_ij＝T_j-R_ijT_i

there is a uniform scale factor λ in the three-view relationship_τSo the linear programming model is:

wherein the second constraint is to prevent λ_τTrapping is infinitesimally small because the global translation vector T may have an arbitrary scale, which, without this constraint, may result in a large error in the calculation result. The third constraint specifies the camera coordinate system of the first view as the world coordinate system. In the actual calculation process, the model can be solved by using a linear programming function library.

(6) According to the embodiment of the technical scheme, the translation vector, the rotation vector and the generated sparse space point are subjected to light beam adjustment optimization, and the optimization of the whole camera parameter is completed.

Projection formula due to the presence of image noise

The calculation requirements cannot be completely met, so that optimization needs to be performed on the combination of the previously calculated translation vector, the previously calculated rotation matrix and the spatial point coordinates, the process is called beam adjustment, and the optimization target is called reprojection error, namely:

in the above formula, PⁱProjection matrix representing the ith frame of picture, Xj_{Substitute for Chinese traditional medicine}Table calculated coordinates of the jth spatial point, Pⁱ·X_jThe physical meaning is that the jth space point coordinate is projected to the image coordinate of the ith frame, namely, a three-dimensional to two-dimensional projection is completed;

the Euclidean distance between the two-dimensional coordinates of the re-projection and the original extracted feature point coordinates is calculated, and the translation vector, the rotation matrix and the space point coordinates can be optimized in a combined mode by optimizing the geometric distance. In this way, the required camera optimization parameters can be obtained.

(7) A dense optical flow for the selected frame is calculated.

In the embodiment of the technical solution of the present invention, preferably, a Classic NL (see Sun D, Roth S, Black M j. secret information flow and the hair references [ C ]// Computer Vision and pattern recognition. ieee,2010: 2432-.

(8) Triangulation calculates depth values. Selecting any two frames of images, namely an ith frame and a jth frame, and obtaining the corresponding matching position x + delta (x and x + delta represent a plane two-dimensional) of each pixel x of the ith frame in the jth frame by dense matching based on optical flowA coordinate vector). In each frame, the corresponding measured value of each pixel point is x_i～P_iX and X_j～P_jX, which can be combined to yield a linear system form a · X ═ 0.

In a specific calculation procedure, first, three equations are given, two of which are linearly independent, by removing the consistency scale factor for the cross product of each image point. Performing cross-product operation on the two projection relations can obtain the following formula:

x(P^3TX)-(P^1TX)＝0

y(P^3TX)-(P^2TX)＝0

x(P^2TX)-y(P^1TX)＝0

x'(P'^3TX)-(P'^1TX)＝0

y(P'^3TX)-(P'^2TX)＝0

x'(P'^2TX)-y'(P'^1TX)＝0

merging the excess entries may result in:

in the embodiment of the technical scheme of the invention, the equation AX is solved to be 0 by adopting a least square method. Then finally X is in P_iThe reference camera coordinate system coordinates are:

X_i＝R_i·X+t

at this time X_iThe z coordinate of (a) is a depth value, and in the embodiment of the present invention, it is preferably stretched to a gray value of 0 to 255 according to its value, so as to facilitate display and observation, and the result is shown in fig. 4.

It will be understood by those skilled in the art that the foregoing is only a preferred embodiment of the present invention, and is not intended to limit the invention, and that any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention should be included in the scope of the present invention.

Claims

1. A monocular video depth map calculation method is characterized by comprising the following steps,

s1, decomposing the video to be restored into pictures by frames;

s2, extracting picture feature points of each frame, wherein SIFT descriptors are adopted for video frames with abundant overall textures during extraction, and AKAZE descriptors are adopted for video frames with not abundant overall textures;

s3, matching the feature points of each frame to form a feature point track, wherein if an SIFT descriptor is adopted in the extraction in the step S2, a fast cascade Hash matching method is adopted in the matching, and if an AKAZE descriptor is adopted in the extraction in the step S2, a matching method based on a Hamming distance is adopted in the matching;

s4, calculating a global rotation matrix of each frame, wherein the global rotation matrix is the direction of the coordinate axis of the world coordinate system relative to the camera coordinate system;

s5, calculating a global translation vector of each frame, wherein the global translation vector is a position coordinate of a world coordinate system coordinate axis relative to a camera coordinate system;

s6, acquiring camera optimization parameters;

s7 calculating a dense optical flow for the selected frame;

s8, calculating the depth value of the selected frame according to the camera optimization parameters and the dense optical flow, and further obtaining a depth map of the selected frame;

the step S5 includes the steps of,

s51, calculating a three-view relation by using a method of minimizing a reprojection error, and calculating a local relative translation vector by using a global rotation matrix and the three-view relation;

2. The method for calculating a monocular video depth map according to claim 1, wherein the step S3 includes,

s31 matching the feature points of each frame;

s32, eliminating matching pairs with wrong matching;

s33 simulates forming a feature point trajectory.

3. The monocular video depth map calculating method of claim 1 or 2, wherein the step S4 includes,

s42, eliminating relative rotation matrixes with calculation errors;

4. The method for calculating a monocular video depth map according to claim 1 or 2, wherein the step S6 performs joint optimization, i.e. beam adjustment, on the global translation vector, the global rotation matrix and the spatial point coordinates, so as to optimize the camera parameters.

5. The monocular video depth map calculating method of claim 1 or 2, wherein the step S8 calculates the depth value by using a triangulation method.

6. A monocular video depth map computing system comprising,

the video decomposition module is used for decomposing the video to be restored into pictures according to frames;

the characteristic extraction module is used for extracting the picture characteristic points of each frame, wherein SIFT descriptors are adopted for video frames with rich overall textures during extraction, and AKAZE descriptors are adopted for video frames with poor overall textures;

the characteristic matching module is used for matching characteristic points to form a characteristic point track, wherein if an SIFT descriptor is adopted in the extraction in the step S2, a fast cascade Hash matching method is adopted in the matching process, and if an AKAZE descriptor is adopted in the extraction in the step S2, a matching method based on a Hamming distance is adopted in the matching process;

the rotation matrix module is used for calculating a global rotation matrix, and the global rotation matrix is the direction of a coordinate axis of a world coordinate system relative to a camera coordinate system;

the translation vector module is used for calculating a three-view relation by using a method for minimizing a reprojection error, calculating a local relative translation vector by using a global rotation matrix and the three-view relation, establishing a linear programming model and solving a global translation vector, wherein the global translation vector is a position coordinate of a coordinate axis of a world coordinate system relative to a camera coordinate system;

the parameter optimization module is used for optimizing camera parameters;

7. A memory device having stored therein a plurality of instructions adapted to be loaded and executed by a processor:

s1, decomposing the video to be restored into pictures by frames;

s3, matching the feature points to form a feature point track, wherein if an SIFT descriptor is adopted in the extraction in the step S2, a fast cascade Hash matching method is adopted in the matching, and if an AKAZE descriptor is adopted in the extraction in the step S2, a matching method based on Hamming distance is adopted in the matching;

s4, calculating a global rotation matrix, wherein the global rotation matrix is the direction of the coordinate axis of the world coordinate system relative to the camera coordinate system;

s6 optimizing camera parameters;

s7 calculating a dense optical flow for the selected frame;

s8, calculating the depth value of the selected frame and further obtaining the depth map thereof;

the step S5 includes the steps of,

8. A terminal, comprising: a processor adapted to implement various instructions; and a storage device adapted to store a plurality of instructions, the instructions adapted to be loaded and executed by a processor to:

s1, decomposing the video to be restored into pictures by frames;

s6 optimizing camera parameters;

s7 calculating a dense optical flow for the selected frame;

the step S5 includes the steps of,