WO2021107254A1

WO2021107254A1 - Method and apparatus for estimating depth of monocular video image

Info

Publication number: WO2021107254A1
Application number: PCT/KR2019/018413
Authority: WO
Inventors: 함범섭; 엄찬호; 박현종
Original assignee: 연세대학교 산학협력단
Priority date: 2019-11-29
Filing date: 2019-12-24
Publication date: 2021-06-03
Also published as: KR102262832B1

Abstract

Disclosed are a method and apparatus for estimating a depth of a monocular video image. The disclosed apparatus comprises: a space feature encoder network module for generating a space feature map through a neural network operation for a current frame image; a time feature encoder network module for generating a time feature map through a neural network operation for optical flow images of the current frame image and a previous frame image; a flow guide memory module for generating a depth feature map for the current frame image through a neural network operation, by using the space feature map and the time feature map; and a decoder network module for generating a depth map through a neural network operation for the depth feature map, wherein the flow guide memory module uses an RNN, corrects a previous state feature map used in the RNN through warping based on the time feature map, and performs the neural network operation by using the corrected previous state feature map instead of the previous state feature map. According to the disclosed apparatus and method, there is an advantage of accurately estimating the depth by considering a correlation between frames in the monocular video image.

Description

Method and apparatus for estimating depth of monocular video image

The present invention relates to an apparatus and method for estimating depth, and more particularly, to an apparatus and method for estimating depth of a monocular video image.

Depth estimation is an essential skill in autonomous driving and driver assistance systems. During autonomous driving, real-time depth estimation is required to determine the terrain structure and to determine the exact location of surrounding vehicles and obstacles.

A general depth estimation is done through stereo matching. Stereo matching is a method of estimating depth using left and right images obtained using two cameras. In stereo matching, the depth is estimated by calculating the displacement between the corresponding pixels of the left image and the right image.

However, stereo matching has a problem in that an image must always be obtained using two cameras, and an image must be obtained in a state in which the two cameras are accurately aligned, so that accurate depth estimation is possible, making it difficult to use in reality.

Various methods have been proposed for a method of estimating depth using a monocular camera. The relative size of the object, the degree of texture change, the occluded area, etc. provide information for estimating depth even in a monocular image, and when such information is used, depth estimation is possible even if a stereo image is not provided.

Meanwhile, as research on deep learning develops in recent years, various methods for estimating the depth of a monocular image through neural network computation have been proposed.

However, in the conventional method of estimating the depth of a monocular image using a neural network, the depth was independently estimated in units of frames. Although successive frames have a high correlation with each other, the inter-frame correlation was not well reflected in the depth estimation of monocular images, which became one of the main causes of inaccurate depth estimation in video images.

The present invention proposes a depth estimation apparatus and method capable of accurately estimating depth in consideration of inter-frame correlation in a monocular video image.

In order to achieve the above object, according to an aspect of the present invention, a spatial feature encoder network module for generating a spatial feature map through a neural network operation on a current frame image; a temporal feature encoder network module for generating a temporal feature map through neural network operation on the optical flow image of the current frame image and the previous frame image; a flow guide memory module for generating a depth feature map for the current frame image through a neural network operation using the spatial feature map and the temporal feature map; and a decoder network module for generating a depth map through neural network operation on the depth feature map, wherein the flow guide memory module uses an RNN, and a previous state feature map used for the RNN is based on the temporal feature map. There is provided an apparatus for estimating depth of a monocular video image by correcting through warping and performing neural network operation using the corrected previous state feature map instead of the previous state feature map.

and an optical flow correction network module for generating a temporal feature map corrected through neural network operation on the current frame image, the optical flow image, and the previous frame image.

The flow guide memory module corrects the previous state feature map through warping based on the corrected temporal feature map instead of the temporal feature map.

The value of the corrected previous state feature map is adjusted by a mask feature map, and the mask feature map is a feature map reflecting the reliability of the temporal feature map or the corrected temporal feature map. ,

The reliability is calculated based on a difference between the image of the current frame and the image obtained by warping the previous frame image based on the temporal feature map or the corrected temporal feature map.

The RNN of the flow guide memory module is a current state feature map (h ^t ), a corrected previous state feature map (

), reset gate (r ^t ) and update gate (r ^t ) and candidate state feature map (

) is calculated.

In the above equation, σ means a sigmoid function,

denotes an element-wise multiplication, * denotes convolution, x ^t denotes an input feature map, a feature map that combines a spatial feature map and a temporal feature map, and W denotes a preset weight and b is a preset bias value,

is a feature map obtained by warping a previous state feature map using a temporal feature map or a calibrated temporal feature map, and M ^t is a mask feature map.

The mask feature map is set as follows.

In the above equation, p means a pixel, I ^t ₃ (p) is the current frame image,

is the warped previous frame image, and ε is an arbitrarily set constant.

The spatial feature encoder network module and the temporal feature encoder network module generate a spatial feature map and a temporal feature map using CNN, respectively.

According to another aspect of the present invention, generating a spatial feature map through a neural network operation on a current frame image (a); generating a temporal feature map through neural network operation on the optical flow image of the current frame image and the previous frame image (b); (c) generating a depth feature map for the current frame image through a neural network operation using the spatial feature map and the temporal feature map; and generating a depth map through neural network operation on the depth feature map, wherein the step (c) uses an RNN, and a previous state feature map used for the RNN is added to the temporal feature map. Provided is a method for estimating depth of a monocular video image by performing correction through warping based on the previous state and performing neural network operation using the corrected previous state feature map instead of the previous state feature map.

According to the present invention, there is an advantage in that depth can be accurately estimated in consideration of inter-frame correlation in a monocular video image.

1 is a diagram illustrating a neural network structure constituting an apparatus for estimating depth of a monocular video image according to a first embodiment of the present invention.

2 is a diagram illustrating a neural network structure for depth estimation of a monocular video image according to a second embodiment of the present invention.

Figure 3 is a diagram showing the operation structure of the optical flow correction network module according to an embodiment of the present invention.

4 is a diagram illustrating an operation structure of a flow guide memory module according to an embodiment of the present invention;

5 is a diagram conceptually illustrating warping in a flow guide memory module according to an embodiment of the present invention.

6 is a flowchart illustrating an overall flow of a method for estimating depth of a monocular video image according to a second embodiment of the present invention.

Hereinafter, the present invention will be described with reference to the accompanying drawings. However, the present invention may be embodied in several different forms, and thus is not limited to the embodiments described herein.

And in order to clearly explain the present invention in the drawings, parts irrelevant to the description are omitted, and similar reference numerals are attached to similar parts throughout the specification.

Throughout the specification, when a part is "connected" with another part, this includes not only the case of being "directly connected" but also the case of being "indirectly connected" with another member interposed therebetween. .

Also, when a part "includes" a certain component, it means that other components may be further included, rather than excluding other components, unless otherwise stated.

Hereinafter, embodiments of the present invention will be described in detail with reference to the accompanying drawings.

The depth estimation apparatus and method of the present invention estimate the depth of each pixel of the monocular video image by using the monocular video image. Unlike stereo images, since monocular video images do not know binocular disparity information, accurate depth estimation is difficult, but stereo images cannot always be acquired, and it is also not easy to acquire accurately aligned stereo images. Depth estimation is continuously required, especially in fields that require real-time depth information acquisition, such as autonomous driving.

A video image consists of a plurality of frames, and adjacent frames (eg, t-1 frame, t frame, t+1 frame) have high correlation. However, depth estimation in a video image is performed for each frame. Since the depth estimation is performed independently for each frame, the depth estimated in the previous frame (t-1 frame) does not affect the depth estimation in the current frame (t-frame). However, since adjacent frames are correlated with each other, the depths of the previous frame and the current frame are also correlated, but this correlation is not taken into account in estimating the depth of the existing monocular video image. This problem also caused flickering in a depth map generated by depth estimation of a conventional monocular video image.

The apparatus and method for estimating the depth of a monocular video image proposed by the present invention proposes a configuration in which the correlation of depth between frames can be reflected. However, it should be noted that, since a video is a moving picture, a moving object may exist between frames, and a view point may be changed due to the movement of the camera. Considering only the correlation of depth without reflecting the movement of the object or camera may rather cause inaccurate depth estimation. The present invention proposes a depth estimation method in which inter-frame correlation is reflected by reflecting the motion of such an object or camera together.

Referring to FIG. 1 , the apparatus for estimating the depth of a monocular video image according to the first embodiment of the present invention includes a spatial feature encoder network module 100 , a temporal feature encoder network module 110 , a flow guide memory module 120 , and a decoder. a network module 130 .

Two images are input to the monocular video image depth estimation apparatus of the present invention. One is the input image (I ^t ) of the t frame, and the other is the optical flow image (O ^t ) of the t frame. Here, the optical flow image in the t frame refers to an image in which optical flow calculation results of the previous frame (t-1 frame) image and the current frame (t frame) image are reflected. The optical flow value of the input optical flow image may be calculated in various ways, and no matter how the optical flow image is generated, the essence of the present invention is not affected. A well-known optical flow calculation algorithm may be used, and a separate neural network may be used for optical flow calculation.

The spatial feature encoder network module 100 generates a spatial feature map for the current frame image through neural network operation. The spatial feature encoder network module 100 may use various known neural networks for generating a spatial feature map. As an example, the spatial feature encoder network module 100 may include a convolutional neural network (CNN) network that generates a feature map while applying a convolutional kernel to a current frame, but is not limited thereto.

The spatial feature encoder network module 100 may generate a final feature map while reducing the dimension of the input image. For example, a spatial feature map having a size 1/4 of that of the input image may be output through the spatial feature encoder. The neural network weight of the spatial feature encoder network module 100 is set through learning, and the learning method will be described later.

The temporal feature encoder network module 110 generates a temporal feature map for the optical flow image through neural network operation. The reason why the feature map output from the temporal feature network module 110 is defined as a temporal feature map is because the definition of an optical flow is the degree of movement of each pixel over time, and the temporal feature map does not indicate temporal information itself.

The temporal feature encoder network module 110 may also use a variety of known neural networks for generating a temporal feature map, and a Convolutional Neural Network (CNN) network that generates a feature map while applying a convolutional kernel to an optical flow image can be used. will be.

The temporal feature encoder network module 100 will also be able to generate a final feature map while reducing the dimension of the optical flow image. For example, the temporal feature map having a 1/4 size compared to the input optical flow image is time It may be output through the feature encoder network module 110 . Neural network weights of the temporal feature encoder network module 110 are also set through learning.

The spatial feature map and the temporal feature map are complementary feature information. The spatial feature map includes features for the shapes of existing objects in the image and layouts of the background. In addition, the temporal feature map includes individual motion trajectory information of each pixel according to a change in a frame.

According to a preferred embodiment of the present invention, a dilated convolution for expanding a receptive field in a convolution operation applied to the spatial feature encoder network module 100 and the temporal feature encoder network module 110 . ) is preferably used. When dilated convolution is used, loss of spatial information resolution and loss of scene detail can be minimized, so that it can be more effective for a neural network for depth estimation as in the present invention.

In the expanded convolution operation, the expansion ratio and the size of the receiving area dependent thereon may be appropriately adjusted.

The flow guide memory module 120 receives the spatial feature map output from the spatial feature encoder network module 100 and the temporal feature map output from the temporal feature encoder network module 110, and generates a depth feature map through neural network operation. . The flow guide memory module 120 receives the temporal feature map and the spatial feature map sequentially (t-1, t, t+1, t) and generates a depth feature map.

Preferably, a feature map obtained by concatenating a temporal feature map and a spatial feature map is input to the flow guide memory module 120 . For coupling between feature maps, it is preferable that the dimension of the temporal feature map and the dimension of the depth feature map are the same.

As described above, successive frames are not independent of each other but are correlated. In the present invention, a recurrent neural network (RNN) is used as the flow guide memory module 120 to generate a depth feature map in consideration of such correlation. Compared to a general CNN network, the RNN network can generate a depth feature map that more accurately reflects the correlation or dependency between frames.

RNN networks include various types of networks. Basic RNN networks include Long Short-term Memory (LSTM) and Gated Recurrent Unit (GRU). Also, recently, ConvLSTM and ConvGRU in which convolution is reflected in LSTM and GRU, respectively, are sometimes used.

According to a preferred embodiment of the present invention, ConvGRU may be used among RNN networks. The reason ConvGRU is advantageous among RNN networks is that ConvGRU does not cause much loss of spatial resolution and is advantageous in terms of memory usage. Of course, it will be apparent to those skilled in the art that other types of RNN networks may be used.

A detailed operation structure of the flow guide memory module 120 will be described later with reference to a separate drawing.

The depth feature map output from the flow guide memory module 120 is input to the decoder network module 130 . The decoder network module 130 generates a final depth map through neural network operation on the input depth feature map. The decoder network module 130 may perform decoding using CNN as an example, but is not limited thereto. The decoder network module 130 may generate the depth map while extending the dimension of the depth feature map like a general decoder network.

According to a preferred embodiment of the present invention, feature maps and temporal feature encoder network module 110 generated for each layer in the spatial feature encoder network module 100 so that decoding reflecting feature information in the encoding process can be performed. A skip connection may be made with respect to the feature maps generated for each layer. The skip connection combines feature maps generated in the encoding process during decoding and uses it for decoding of the next layer. Since the skip connection is used for encoding and decoding of various neural networks, a detailed description thereof will be omitted.

As a result, it can be said that the depth estimation apparatus according to the first embodiment of the present invention consists of four neural networks, and the final depth map is output through the decoder network module 130 .

The weight learning of the four neural networks constituting the depth estimation apparatus of the present invention may be performed by calculating the loss for the output depth map and backpropagating it. The backpropagation of the loss is performed in the reverse order in the order of the decoder network module 130 -> flow guide memory module 120 -> spatial feature encoder network module 100/time feature encoder network module 110, and the loss is minimized The weight update proceeds in the direction to

Various known methods may be used for the loss operation for learning, and the loss operation applied in the present invention will be described later.

The apparatus for estimating the depth of a monocular video image according to the first embodiment of the present invention includes a spatial feature encoder network module 100, a temporal feature encoder network module 110, a flow guide memory module 120, a decoder network module 130, and and an optical flow calibration network module 200 .

Compared to the first embodiment, the depth estimation apparatus according to the second embodiment of the present invention includes an optical flow correction network module 200 additionally. Except for the optical flow calibration network module 200, the operation of the other modules is the same as in the first embodiment.

The optical flow calibration network module 200 generates a calibrated temporal feature map through neural network operation. With only the optical flow image, it is difficult to generate a temporal feature map reflecting the accurate optical flow. The optical flow calibration network module receives the current frame image (I ^t ), the previous frame image (I ^t-1 ), and the optical flow image (O ^t ) to generate an accurate temporal feature map, and the time corrected through neural network operation. Create a feature map.

The corrected temporal feature map is used for warping the previous state feature map in the flow guide memory module 120 . Warping of the previous state feature map will be described in detail with reference to a separate drawing.

3 is a diagram illustrating an operation structure of an optical flow calibration network module according to an embodiment of the present invention.

Referring to FIG. 3 , the current frame image (I ^t ), the previous frame image (I ^t-1 ), and the optical flow image (O ^t ) are input to the optical flow calibration network module 200 in a concatenated form. do.

A feature map is generated for each layer through convolutional encoding of the input. The feature map of the first layer has the same size as the combined input image. ^{R t} ₁ , the output of the first layer, is obtained through convolutional encoding after the first feature map and the optical flow image O ^{t are combined.}

The feature map of the second layer is generated through convolutional encoding of an image in ^{which R t} ₁ , which is an output of the first layer, and the feature map of the first layer are combined. The feature map of the second layer may have a size that is 1/2 downsampled compared to the feature map of the first layer. ^{R t} _{2 , which} is an output of the second layer, is obtained through convolutional encoding of an image in which the second layer feature map and the 1/2 down-sampled optical flow image are combined.

The feature map of the third layer is ^{generated through convolutional encoding of an image in which R t} ₂ , the output of the second layer, and the feature map of the second layer are combined. The third layer feature map may have a size that is 1/2 down-sampled compared to the second layer feature map.

^{R t} ₃ , the output of the third layer, is obtained through convolutional encoding of the image in which the third layer feature map and the 1/4 down-sampled optical flow image are combined, and the temporal feature map in which ^{R t} _{3 is finally corrected} and this may be defined as a corrected optical flow.

According to an embodiment of the present invention, learning for the optical flow correction network module 200 may be performed using two losses. The first loss is a photometric consistency loss and the second loss is a smoothness loss.

The image coherence loss is an image obtained by applying ^{the optical flow (R t} ₁ , R ^t ₂ , R ^t ₃ ) output from each layer to I ^{t-1 .}

It is a loss calculated from the similarity between and the current frame image (I ^{t ).} Here, W() represents a warping function.

Specifically, the image coherence loss may be calculated as in Equation 1 below.

In Equation 1 above, Ni is the number of all pixels, p represents a pixel, SSIM is a function for calculating structural similarity, β is a balance constant, and is selected as one of 0 to 1.

The second loss, which is a smoothness loss, is a loss that calculates the degree of smoothing of an optical flow image, and is generally used for learning in the image field. As an example, the smoothing loss may be calculated as in Equation 2 below.

In the above equation, τ is an arbitrarily set constant.

Of course, the learning of the optical flow correction network module 200 may be learned in various ways other than the optical coherence loss and smoothing loss described above, and those skilled in the art know that the change in the learning method does not affect the spirit of the present invention. You will understand.

After all, the difference between the first embodiment and the second embodiment is whether the temporal feature map output from the temporal feature encoder network module 110 is used when the previous state feature map is warped in the flow guide memory module 120, or the optical flow calibration network Whether to use the calibrated temporal feature map output from the module 200 .

The first embodiment warps the previous state feature map in the flow guide memory module 120 using the temporal feature map obtained from the temporal feature encoder network module 110 without acquiring a separate calibrated temporal feature map. However, the second embodiment does not use the temporal feature map obtained for warping the previous state feature map in the flow guide memory module 120, but uses the corrected optical flow.

However, the time characteristic map output from the temporal characteristic encoder network module 110 is not input as the input of the flow guide memory module, but the corrected temporal characteristic map in the second embodiment.

4 is a diagram illustrating an operation structure of a flow guide memory module according to an embodiment of the present invention.

FIG. 4 shows an operational structure of the flow guide memory module 120 using ConvGRU as an example according to an embodiment of the present invention. However, as described above, various RNN networks may be used in addition to ConvGRU.

Five values are used for ConvGRU. h ^t is the current state feature map, h ^t-1 is the previous state feature map, r ^t is the reset gate, z ^t is the update gate,

is a candidate state feature map.

In the conventional ConvGRU, the current state feature map, the previous state feature map, the reset gate , the update gate and the candidate state feature map are calculated as shown in Equation 3 below.

In Equation 3 above, σ means a sigmoid function,

denotes an element-wise multiplication, * denotes a convolution, and x ^t denotes an input feature map. In the present invention, the feature map combining the spatial feature map and the temporal feature map becomes x ^t .

As described above, when the conventional ConvGRU is used, it becomes difficult to reflect the motion between frame time intervals. Although the previous frame and the current frame are correlated with each other, and the depth of the previous frame and the current frame are correlated with each other, the correlation of depth is more accurate when the current frame and the previous frame are in the same state. can be Accordingly, when there is a movement of an object or a movement of a camera between the current frame and the previous frame, it is preferable to consider the correlation of depth while compensating for the movement.

In the present invention, the previous state feature map is corrected through warping using the obtained temporal feature map or the corrected temporal feature map, and the corrected corrected state feature map is used instead of the previous state feature map. The correction state feature map is shown in FIG.

is defined as

After all, the flow guide memory module 120 of the present invention corrects the previous state feature map based on the temporal feature map or the corrected temporal feature map, and then applies the corrected state feature map to generate the current state feature map.

The current state feature map, the corrected state feature map, the reset gate, the update gate, and the candidate state feature map according to the present invention are calculated as in Equation 4 below.

In Equation 4,

is a feature map obtained by warping the previous state feature map using a temporal feature map or a calibrated temporal feature map,

can be defined as In addition, W is a preset weight, and b is a preset bias value.

Meanwhile, the mask feature map M ^t is applied to the warped feature map in Equation 4 above. The mask feature map may be defined as a feature map indicating the reliability of warping for each pixel P. If the current frame image (I ^t ₃ (p)) and the previous frame image (

), the reliability of warping will not be high, and if the difference is not large, the reliability of warping will be high. Based on this fact, the mask feature map M ^t may be defined as in Equation 5 below.

In Equation 6 above, ε is an arbitrarily set constant, and the width of the exponential function is determined by ε.

Referring to FIG. 5 , a case in which a vehicle moves between frames t-1 and t is illustrated. In the case of the existing ConvGRU method, since h ^t-1 , which is the previous state feature map, is used, it is difficult to use the correlation of the depth reflecting the vehicle motion for depth estimation.

In order to solve this problem, the present invention uses an optical flow (temporal feature map) to calculate the previous state feature map ht-1.

to calculate the depth feature map using the corrected state feature map.

^{Meanwhile, the difference between the depth map true value (G t} (p)) and the depth map output through the decoder network module for learning the neural networks constituting the apparatus for estimating the depth according to the embodiment of the present invention shown in FIG. 1 . A corresponding loss can be used.

Loss for the true value and the difference for a more accurate learning (L ^D) and smoothed loss (L ^DS) may be used together.

According to an embodiment of the present invention, the loss of the depth map true value and the difference may be calculated as in Equation 6 below.

In Equation 6 above

, where D ^t (p) is the output depth map, and G ^t (p) is the ground truth depth map.

In Equation 7 above, the first term means the difference between the output depth and the true value depth map. However, it is very difficult to obtain the true depth of each pixel in a monocular video image sequence. The second term in Equation 7 is a term for alleviating this problem. For two pixel pairs p and q, the product of s(p) and s(q) is summed, α is a balance constant having a value of 0 to 1, and N is the number of all pixels.

Also, the smoothing loss is calculated to prevent discontinuity in depth, and may be calculated as in Equation 7 below.

In the illustrated example above, even loss backpropagation for learning of a depth estimation apparatus shown in Figure 1 is calculated as the sum of the loss (L ^D) and smoothed loss (L ^DS) for the true value and the difference.

Referring to FIG. 6 , first, a spatial feature map is generated by inputting the current frame image to the spatial feature encoder network module 100 (step 600).

In addition, an optical flow image obtained using the current frame image and the previous frame image is input to the temporal feature encoder network module 110 to generate a temporal feature map (step 602).

Meanwhile, a corrected temporal feature map is generated by inputting the current frame image, the previous frame image, and the optical flow image to the optical flow calibration network module 200 (step 604).

The spatial feature map generated in step 600 and the temporal feature map in step 602 are combined with each other and input to the optical flow memory module 120, and the flow guide memory module 120 generates a depth feature map through neural network operation (step 602). 606). The optical flow memory module uses RNN. The flow guide memory module 120 generates a corrected state feature map by warping the previous state feature map used for updating the current state feature map in the RNN using the optical flow of the corrected temporal feature map generated in step 604, The corrected state feature map is used to update the current state feature map.

The depth feature map generated in step 606 is input to the decoder network module 130, and the decoder network module 130 generates a depth map through neural network operation (step 608).

Meanwhile, in FIG. 6, the overall flow has been described using the case of the second embodiment as an example, but the first embodiment differs only in that the temporal feature map output from the temporal feature encoder network module is used for warping the previous state feature map of the RNN. It has been explained above that there is

The above description of the present invention is for illustration, and those of ordinary skill in the art to which the present invention pertains can understand that it can be easily modified into other specific forms without changing the technical spirit or essential features of the present invention. will be.

Therefore, it should be understood that the embodiments described above are illustrative in all respects and not restrictive.

For example, each component described as a single type may be implemented in a dispersed form, and likewise components described as distributed may be implemented in a combined form.

The scope of the present invention is indicated by the following claims, and all changes or modifications derived from the meaning and scope of the claims and their equivalents should be construed as being included in the scope of the present invention.

Claims

a spatial feature encoder network module that generates a spatial feature map through neural network operation on the current frame image;

a temporal feature encoder network module for generating a temporal feature map through neural network operation on the optical flow image of the current frame image and the previous frame image;

a flow guide memory module for generating a depth feature map for the current frame image through a neural network operation using the spatial feature map and the temporal feature map; and

A decoder network module for generating a depth map through a neural network operation on the depth feature map,

The flow guide memory module uses an RNN, corrects a previous state feature map used for the RNN through warping based on the temporal feature map, and uses the corrected previous state feature map instead of the previous state feature map to perform a neural network An apparatus for estimating depth of a monocular video image, characterized in that the calculation is performed.
According to claim 1,

and an optical flow correction network module for generating a temporal feature map corrected through neural network operation on the current frame image, the optical flow image, and the previous frame image.
3. The method of claim 2,

and the flow guide memory module corrects the previous state feature map through warping based on the corrected temporal feature map instead of the temporal feature map.
4. The method of claim 3,

The value of the corrected previous state feature map is adjusted by a mask feature map, and the mask feature map is a feature map reflecting the reliability of the temporal feature map or the corrected temporal feature map. Depth estimation device.
According to claim 1,

The apparatus for estimating the depth of a monocular video image, wherein the reliability is calculated based on a difference between an image of a previous frame image warped based on the temporal feature map or a corrected temporal feature map and a current frame image.
According to claim 1,

The RNN of the flow guide memory module is a current state feature map (h t ), a corrected previous state feature map (
), reset gate (r t ) and update gate (r t ) and candidate state feature map (
), an apparatus for estimating the depth of a monocular video image.

In the above equation, σ means a sigmoid function,
denotes an element-wise multiplication, * denotes convolution, x t denotes an input feature map, a feature map that combines a spatial feature map and a temporal feature map, and W denotes a preset weight and b is a preset bias value,
is a feature map obtained by warping a previous state feature map using a temporal feature map or a calibrated temporal feature map, and M t is a mask feature map.
According to claim 1,

The mask feature map is an apparatus for estimating the depth of a monocular video image, characterized in that set by the following equation.

In the above equation, p means a pixel, I t 3 (p) is the current frame image,
is the warped previous frame image, and ε is an arbitrarily set constant.
According to claim 1,

The apparatus for estimating the depth of a monocular video image, characterized in that the spatial feature encoder network module and the temporal feature encoder network module generate a spatial feature map and a temporal feature map, respectively, using CNN.
generating a spatial feature map through neural network operation on the current frame image (a);

generating a temporal feature map through neural network operation on the optical flow image of the current frame image and the previous frame image (b);

(c) generating a depth feature map for the current frame image through a neural network operation using the spatial feature map and the temporal feature map; and

Including the step (d) of generating a depth map through a neural network operation on the depth feature map,

The step (c) uses an RNN, corrects the previous state feature map used for the RNN through warping based on the temporal feature map, and uses the corrected previous state feature map instead of the previous state feature map to perform a neural network A method for estimating the depth of a monocular video image, characterized in that the calculation is performed.
10. The method of claim 9,

The method for estimating the depth of a monocular video image, further comprising generating a temporal feature map corrected through neural network operation on the current frame image, the optical flow image, and the previous frame image.
11. The method of claim 10,

The step (c) comprises correcting the previous state feature map through warping based on the corrected temporal feature map instead of the temporal feature map.
12. The method of claim 11,

The value of the corrected previous state feature map is adjusted by a mask feature map, and the mask feature map is a feature map reflecting the reliability of the temporal feature map or the corrected temporal feature map. Depth estimation method.
10. The method of claim 9,

and the reliability is calculated based on a difference between an image of a previous frame image warped on the basis of the temporal feature map or a corrected temporal feature map and a current frame image.
10. The method of claim 9,

The RNN is a current state feature map (h t ), a corrected previous state feature map (
), reset gate (r t ) and update gate (r t ) and candidate state feature map (
), a method for estimating the depth of a monocular video image.

In the above equation, σ means a sigmoid function,
denotes an element-wise multiplication, * denotes convolution, x t denotes an input feature map, a feature map that combines a spatial feature map and a temporal feature map, and W denotes a preset weight and b is a preset bias value,
is a feature map obtained by warping a previous state feature map using a temporal feature map or a calibrated temporal feature map, and M t is a mask feature map.
According to claim 1,

The mask feature map is a depth estimation method of a monocular video image, characterized in that set by the following equation.

In the above equation, p means a pixel, I t 3 (p) is the current frame image,
is the warped previous frame image, and ε is an arbitrarily set constant.