CN115953468A

CN115953468A - Method, device and equipment for estimating depth and self-movement track and storage medium

Info

Publication number: CN115953468A
Application number: CN202211578441.0A
Authority: CN
Inventors: 张戈; 陈德聪; 赵飞
Original assignee: Agricultural Bank of China
Current assignee: Agricultural Bank of China
Priority date: 2022-12-09
Filing date: 2022-12-09
Publication date: 2023-04-11

Abstract

The invention discloses a method, a device, equipment and a storage medium for estimating depth and self-motion tracks. The method comprises the following steps: acquiring a preset model, a source view and a target view, wherein the preset model comprises a depth estimation network, a motion estimation network and an implicit cue network, the implicit cue network is used for extracting static and dynamic characteristics between the source view and the target view from the motion estimation network and mapping the static and dynamic characteristics to the depth estimation network in an identical way, and the source view and the target view are two color images of two frames at adjacent moments; inputting a source view and a target view into a preset model; estimating the self-motion track of the camera based on a motion estimation network; and estimating the depth of the source view and/or the target view based on the depth estimation network and the static and dynamic characteristics between the source view and the target view. The scheme provided by the invention can effectively relieve the artifact problem of a moving object, improve the estimation quality of monocular image depth, reduce pose transformation errors and realize more accurate estimation of the self-motion track of the camera.

Description

Method, device and equipment for estimating depth and self-motion track and storage medium

Technical Field

The present invention relates to the field of computer technologies, and in particular, to a method, an apparatus, a device, and a storage medium for estimating a depth and a self-movement trajectory.

Background

The depth of view and the self-motion track of a camera are important for understanding a geometric scene from a video or an image, and the method is widely applied to the fields of robot visual navigation, automatic driving, intelligent traffic application and the like.

The existing estimation methods of depth and self-motion trajectories are usually implemented based on a depth estimation network and a motion estimation network. The depth estimation network aims to estimate a depth map of a target view, and the motion estimation network aims to estimate a pose transformation matrix of a source view relative to the target view. And forming a reconstructed source view based on the camera model, the camera parameters, the depth map of the target view and the pose transformation matrix of the source view relative to the target view, and then calculating the structural similarity, the L1 loss function and the smoothness loss function between the source view and the reconstructed view so as to synchronously optimize the depth estimation network and the motion estimation network.

However, the existing method only uses a single image for depth prediction, which is static and single, and does not fully use the rich dynamic features and static features between video frames for depth estimation; meanwhile, according to the camera imaging principle, a plurality of real scenes can be projected to the same pixel plane, the scenes have different depth values, the method only uses pixel level view reconstruction loss and cannot effectively restrict consistent scene depth, and inconsistent scene depth causes inconsistent transformation vectors due to joint training of a motion estimation network and a depth estimation network, so that the position and attitude transformation of a video frame has offset. In addition, the existing method only uses the camera pose transformation information between video frames, and does not fully consider other implicit clues: if the source view and the target view are the same view, namely the camera does not move, the motion estimation network cannot predict the pose transformation of the camera; if a moving object and a moving camera for shooting the moving object exist, the moving object violates the scene assumption of static scene reconstruction at the moment, and the edge part of the moving object is not effectively restricted, so that the clear contour of the moving object cannot be predicted by a single view.

Disclosure of Invention

The invention provides a method, a device and equipment for estimating depth and self-movement track and a storage medium, which can effectively relieve the artifact problem of a moving object, improve the estimation quality of monocular image depth, reduce pose transformation errors and realize more accurate estimation of the self-movement track of a camera.

According to an aspect of the present invention, there is provided a depth and self-motion trajectory estimation method, including:

acquiring a preset model, a source view and a target view, wherein the preset model comprises a depth estimation network, a motion estimation network and an implicit cue network, the implicit cue network is used for extracting static and dynamic characteristics between the source view and the target view from the motion estimation network and mapping the static and dynamic characteristics to the depth estimation network in an identical manner, and the source view and the target view are two color images at adjacent moments;

inputting a source view and a target view into a preset model;

estimating the self-motion track of the camera based on a motion estimation network; and estimating the depth of the source view and/or the target view based on the depth estimation network and the static and dynamic features between the source view and the target view.

Optionally, before obtaining the preset model, the method further includes:

acquiring a first training view and a second training view;

and training the preset model according to the first training view and the second training view.

Optionally, training the preset model according to the first training view and the second training view includes:

inputting the first training view and the second training view into a preset model to obtain a first reconstruction view and a second reconstruction view;

determining a reprojection loss and a smoothness loss according to the first training view, the second training view, the first reconstructed view and the second reconstructed view;

determining a three-dimensional reconstruction loss;

and training the preset model based on the back propagation and gradient descent principle according to the reprojection loss, the smoothness loss and the three-dimensional reconstruction loss.

Optionally, obtaining the first reconstructed view and the second reconstructed view includes:

determining the first training separatelyView I _s With respect to the second training view I _t First posture transformation matrix T _t->s A second training view I _t Relative to the first training view I _s Second posture transformation matrix T _s->t Depth map D of first training view _s And a depth map D of a second training view _t ；

Depth map D from camera parameters K and a second training view _t Determining coordinates P of the first spatial point cloud of the second training view in the camera coordinate system _t ；

According to the coordinate P of the first space point cloud _t And a first bit-attitude transformation matrix T _t->s Determining the coordinate P of the second space point cloud of the second training view in the coordinate system of the first training view based on the Euclidean transformation principle of the coordinate system _s ’；

Based on the camera model, projecting and sampling the second space point cloud to obtain a first reconstruction view;

depth map D from camera parameters K and first training view _s Determining the coordinates P of a third spatial point cloud of the first training view in the camera coordinate system _s ；

According to the coordinates P of the third space point cloud _s And a second attitude transformation matrix T _s->t Determining the coordinate P of the fourth space point cloud of the first training view in the coordinate system of the second training view based on the Euclidean transformation principle of the coordinate system _t ’；

And based on the camera model, projecting and sampling the fourth space point cloud to obtain a second reconstructed view.

Optionally, determining a three-dimensional reconstruction loss comprises:

according to the coordinate P of the first space point cloud _t Coordinate P of the second space point cloud _s ', coordinate P of the third space point cloud _s And coordinates P of a fourth spatial point cloud _t ', determining three-dimensional reconstruction loss L = | P _t ’-P _t |+|P _s ’-P _s |。

Optionally, a first bit-attitude transformation matrix T _t->s ＝PoseNet(I _s ,I _t )；

Second attitude transformation matrix T _s->t ＝PoseNet(I _t ,I _s )；

Depth map D of first training view _s ＝DepthNet_Decoder(ICNet(PoseNet_Encoder(I _t ,I _s ))+DepthNet_Encoder(I _s )))；

Depth map D of the second training view _t ＝DepthNet_Decoder(ICNet(PoseNet_Encoder(I _s ,I _t ))+DepthNet_Encoder(I _t )))；

Coordinate P of first space point cloud _t ＝K ^-1 D _t I _t ；

Coordinates P of the second spatial point cloud _s ’＝T _t->s P _t ；

Coordinate P of the third space point cloud _s ＝K ^-1 D _s I _s ；

Coordinate P of the fourth space point cloud _t ’＝T _s->t P _s ；

Wherein ICNet represents an implicit cue network, poseNet represents a motion estimation network, poseNet _ Encoder represents an Encoder of the motion estimation network, depthNet _ Encoder represents an Encoder of the depth estimation network, and DepthNet _ Decode represents a Decoder of the depth estimation network.

Optionally, the first training view and the second training view are two color images of two frames at adjacent time; alternatively, the second training view is a color image generated based on the first training view to simulate an adjacent frame of the first training view.

According to another aspect of the present invention, there is provided an apparatus for estimating depth and self-motion trajectory, comprising: an acquisition module and an estimation module; wherein the content of the first and second substances,

the system comprises an acquisition module, a display module and a display module, wherein the acquisition module is used for acquiring a preset model, a source view and a target view, the preset model comprises a depth estimation network, a motion estimation network and an implicit cue network, the implicit cue network is used for extracting static and dynamic characteristics between the source view and the target view from the motion estimation network and mapping the static and dynamic characteristics to the depth estimation network in an identical way, and the source view and the target view are two color images of adjacent moments;

the estimation module is used for inputting the source view and the target view into a preset model; estimating the self-motion track of the camera based on a motion estimation network; and estimating the depth of the source view and/or the target view based on the depth estimation network and the static and dynamic features between the source view and the target view.

According to another aspect of the present invention, there is provided an electronic apparatus including:

at least one processor; and

a memory communicatively coupled to the at least one processor; wherein the content of the first and second substances,

the memory stores a computer program executable by the at least one processor, the computer program being executable by the at least one processor to enable the at least one processor to perform the method for estimating depth and self-motion trajectory of any of the embodiments of the present invention.

According to another aspect of the present invention, there is provided a computer-readable storage medium storing computer instructions for causing a processor to implement the method for estimating depth and self-motion trajectory according to any one of the embodiments of the present invention when executed.

According to the technical scheme, the preset model is designed to comprise a depth estimation network, a motion estimation network and an implicit cue network, the implicit cue network is used for extracting static and dynamic characteristics between a source view and a target view from the motion estimation network and mapping the static and dynamic characteristics to the depth estimation network in an identical mode so as to supplement single static view depth information, enhance geometric constraint of a static object and make up for the dynamic characteristics of a moving object, therefore the artifact problem of the moving object is effectively relieved, and the depth estimation quality of the source view and/or the target view is improved; meanwhile, the three-dimensional reconstruction loss further constrains the view reconstruction from the angle of the spatial point cloud, so that the camera transformation process has consistent depth and pose transformation, the pose transformation error is effectively reduced, the motion track accumulated offset predicted by the long video is reduced, and the self-motion track of the camera is more accurately estimated.

It should be understood that the statements in this section are not intended to identify key or critical features of the embodiments of the present invention, nor are they intended to limit the scope of the invention. Other features of the present invention will become apparent from the following description.

Drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present invention, the drawings needed to be used in the description of the embodiments will be briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.

Fig. 1 is a schematic flow chart of a method for estimating depth and self-motion trajectory according to an embodiment of the present invention;

fig. 2 is a schematic flowchart of a depth and self-motion trajectory estimation method according to a second embodiment of the present invention;

fig. 3 is a schematic structural diagram of an apparatus for estimating depth and self-motion trajectory according to a third embodiment of the present invention;

fig. 4 is a schematic structural diagram of another apparatus for estimating depth and self-motion trajectory according to a third embodiment of the present invention;

fig. 5 is a schematic structural diagram of an electronic device according to a fourth embodiment of the present invention.

Detailed Description

In order to make the technical solutions of the present invention better understood, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

It should be noted that the terms "first," "second," "third," "fourth," "source," "target," and the like in the description and claims of the present invention and in the above-described drawings are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used is interchangeable under appropriate circumstances such that the embodiments of the invention described herein are capable of operation in sequences other than those illustrated or described herein. Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed, but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.

Example one

Fig. 1 is a flowchart of a depth and self-movement trajectory estimation method according to an embodiment of the present invention, where the present embodiment is applicable to a case of estimating a depth of a self-movement trajectory and a view of a camera, and the method may be performed by a depth and self-movement trajectory estimation apparatus, where the depth and self-movement trajectory estimation apparatus may be implemented in a form of hardware and/or software, and the depth and self-movement trajectory estimation apparatus may be configured in an electronic device (e.g., a computer or a server). As shown in fig. 1, the method includes:

s110, acquiring a preset model, a source view and a target view, wherein the preset model comprises a depth estimation network, a motion estimation network and an implicit cue network, the implicit cue network is used for extracting static and dynamic characteristics between the source view and the target view from the motion estimation network and mapping the static and dynamic characteristics to the depth estimation network in an identical mode, and the source view and the target view are two color images of two frames at adjacent moments.

The preset model is a pre-trained model stored in the estimation device of the depth and the self-motion track, and can be used for estimating the self-motion track of the camera and the depth of the view. Estimating the self-motion track of the camera refers to estimating the motion of the camera relative to a fixed scene, and generally using a pose transformation vector or a pose transformation matrix for representation; estimating the depth of a view refers to estimating a depth map of the view, which is a single-channel two-dimensional image composed of the vertical distances from pixels in the scene to the camera imaging plane.

The preset model comprises a depth estimation network, a motion estimation network and an implicit cue network, wherein the implicit cue network is used for extracting static and dynamic characteristics between a source view and a target view from the motion estimation network and mapping the static and dynamic characteristics to the depth estimation network in an identical mode. Namely, the implicit cue network can obtain the implicit cue of the video, and the implicit cue of the video refers to the data features extracted from the video frame according to a specific convolutional neural network, for example, the monocular movement parallax cue has the features that the movement of a near object is fast, and the movement of a far object is slow.

The source view and the target view are two color images at adjacent time instants. For a monocular video training scene, the source view is typically a color image at time t-1 or time t +1, and the target view is typically a color image at time t.

It should be noted that, before step S110 is executed, the present invention may also train a preset model. Specifically, a first training view and a second training view may be obtained; and training the preset model according to the first training view and the second training view.

And S120, inputting the source view and the target view into a preset model.

S130, estimating the self-motion track of the camera based on a motion estimation network; and estimating the depth of the source view and/or the target view based on the depth estimation network and the static and dynamic features between the source view and the target view.

The implicit clue network can extract static and dynamic characteristics between the source view and the target view from the motion estimation network and map the static and dynamic characteristics to the depth estimation network in an identical manner, so that the depth information of a single static view is supplemented, the geometric constraint of a static object is enhanced, the dynamic characteristics of a moving object are made up, the artifact problem of the moving object is effectively relieved, and the depth estimation quality of the source view and/or the target view is improved.

Example two

Fig. 2 is a schematic flow chart of a depth and self-motion trajectory estimation method according to a second embodiment of the present invention, and this embodiment provides a detailed preset model training method based on the first embodiment. As shown in fig. 2, the method includes:

s201, acquiring a first training view and a second training view.

The first training view and the second training view are two color images at adjacent moments; alternatively, the second training view is a color image generated based on the first training view to simulate an adjacent frame of the first training view.

S202, inputting the first training view and the second training view into a preset model to obtain a first reconstruction view and a second reconstruction view.

Specifically, the method for "obtaining the first reconstructed view and the second reconstructed view" in step S202 may include the following 7 steps:

step 1: respectively determining a first training view I _s With respect to the second training view I _t First bit-position transformation matrix T _t->s Second training view I _t With respect to the first training view I _s Second posture transformation matrix T _s->t Depth map D of first training view _s And a depth map D of a second training view _t 。

The preset model comprises a depth estimation network DepthNet, a motion estimation network PosenNet and an implicit cue network ICNet.

The depth estimation network DepthNet coding part adopts ResNet-18 as a basic framework, a decoder uses 4 layers of convolution up-sampling blocks to respectively predict disparity maps with 1/8, 1/4, 1/2 and the same resolution ratio relative to an input image, and feature pixels of a coding layer are added to features of a decoding layer by using skip connection, so that multi-scale feature fusion is realized.

For the first training view I _s And a second training view I _t The motion estimation network PoseNet is intended to obtain therefrom a first training view I _s Relative to the second training view I _t And converting it into a first bit-attitude transformation matrix T _t->s And acquiring a second training view I _t Relative toFirst training View I _s And converting it into a second attitude transformation matrix T _s->t 。

The PoseNet coding part of the motion estimation network is similar to a DepthNet coding structure, channels of a first layer are expanded from 3 to 6 so as to receive input of two color images, and a decoding layer performs downsampling on 512-dimensional features to generate a 6-dimensional pose transformation vector. Since the pose transformation matrix has no conductivity, and the transformation vector can be converted into a transformation matrix and is conductive, the output of the PoseNet is a 6-dimensional pose transformation vector.

In particular, a first bit-orientation transformation matrix T _t->s ＝PoseNet(I _s ,I _t ) (ii) a Second attitude transformation matrix T _s->t ＝PoseNet(I _t ,I _s )。

The motion estimation network PoseNet extracts dynamic information from adjacent frames, which is complex and redundant. In order to obtain effective depth clues, an implicit clue network ICNet is used for further abstracting effective characteristics and acting on a decoder of a depth estimation network, so that a depth map D of a first training view is obtained _s And a depth map D of a second training view _t 。

The implicit clue network ICNet adopts 3 layers of bottleneck layers, and the bottleneck layers comprise three convolutions of 1 × 1,3 × 3,1 × 1. The input feature size of the bottleneck layer is kept consistent with the output. And the similarity of the output characteristics of the two bottleneck layers in a high-dimensional space is obtained through a Gaussian kernel function, and finally, the similarity is multiplied by the output characteristics of the third bottleneck layer pixel by pixel and connected to a depth estimation network.

In particular, the depth map D of the first training view _s ＝DepthNet_Decoder(ICNet(PoseNet_Encoder(I _t ,I _s ))+DepthNet_Encoder(I _s ) ); depth map D of second training view _t ＝DepthNet_Decoder(ICNet(PoseNet_Encoder(I _s ,I _t ))+DepthNet_Encoder(I _t ) ); wherein, poseNet _ Encoder represents an Encoder of the motion estimation network, depthNet _ Encoder represents an Encoder of the depth estimation network, and DepthNet _ Decode represents a Decoder of the depth estimation network.

Step 2:depth map D from camera parameters K and a second training view _t Determining the coordinates P of the first spatial point cloud of the second training view in the camera coordinate system _t 。

Obtaining a depth map D of a second training view _t Then, combining the camera imaging principle, according to the camera parameter K and the depth map D of the second training view _t Determining coordinates P of the first spatial point cloud of the second training view in the camera coordinate system _t 。

Specifically, the coordinates P of the first space point cloud _t ＝K ^-1 D _t I _t The camera coordinate system can be understood as coordinate system O _t -Z _t -X _t -Y _t 。

And step 3: according to the coordinate P of the first space point cloud _t And a first attitude transformation matrix T _t->s Determining the coordinate P of a second space point cloud of a second training view in the coordinate system of the first training view based on the Euclidean transformation principle of the coordinate system _s ’。

In particular, the coordinates P of the second spatial point cloud _s ’＝T _t->s P _t The coordinate system of the second training view in the first training view may be understood as coordinate system O _s -Z _s -X _s -Y _s 。

And 4, step 4: and based on the camera model, projecting and sampling the second space point cloud to obtain a first reconstruction view.

And 5: depth map D from camera parameters K and first training view _s Determining the coordinates P of a third spatial point cloud of the first training view in the camera coordinate system _s 。

In the same way, the depth map D of the first training view is obtained _s Then, combining the camera imaging principle, according to the camera parameters K and the depth map D of the first training view _s Determining the coordinates P of a third spatial point cloud of the first training view in the camera coordinate system _s 。

In particular, the coordinates P of the third spatial point cloud _s ＝K ^-1 D _s I _s The camera coordinate system can be understood as coordinate system O _t -Z _t -X _t -Y _t 。

Step 6: according to the coordinate P of the third space point cloud _s And a second attitude transformation matrix T _s->t Determining the coordinate P of the fourth space point cloud of the first training view in the coordinate system of the second training view based on the Euclidean transformation principle of the coordinate system _t ’。

In particular, the coordinates P of the fourth spatial point cloud _t ’＝T _s->t P _s The coordinate system of the first training view in the second training view may be understood as coordinate system O _s -Z _s -X _s -Y _s 。

And 7: and based on the camera model, projecting and sampling the fourth space point cloud to obtain a second reconstructed view.

S203, determining the re-projection loss and the smoothness loss according to the first training view, the second training view, the first reconstruction view and the second reconstruction view.

And S204, determining the three-dimensional reconstruction loss.

Specifically, the coordinate P of the first space point cloud can be used _t Coordinate P of the second space point cloud _s ', coordinate P of the third space point cloud _s And coordinates P of a fourth spatial point cloud _t ', determining three-dimensional reconstruction loss L = | P _t ’-P _t |+|P _s ’-P _s |。

And S205, training the preset model based on the back propagation and gradient descent principle according to the reprojection loss, the smoothness loss and the three-dimensional reconstruction loss.

In the training process, parameters of the depth estimation network, the motion estimation network and the implicit cue network are updated synchronously.

S206, acquiring a preset model, a source view and a target view.

And S207, inputting the source view and the target view into a preset model.

S208, estimating the self-motion track of the camera based on the motion estimation network; and estimating the depth of the source view and/or the target view based on the depth estimation network and the static and dynamic characteristics between the source view and the target view.

Therefore, the problem of artifacts of moving objects can be effectively solved, the estimation quality of monocular image depth is improved, pose transformation errors are reduced, and the self-motion track of the camera is estimated more accurately.

The embodiment of the invention provides a depth and self-movement track estimation method, which comprises the following steps: acquiring a preset model, a source view and a target view, wherein the preset model comprises a depth estimation network, a motion estimation network and an implicit cue network, the implicit cue network is used for extracting static and dynamic characteristics between the source view and the target view from the motion estimation network and mapping the static and dynamic characteristics to the depth estimation network in an identical way, and the source view and the target view are two color images of two frames at adjacent moments; inputting a source view and a target view into a preset model; estimating the self-motion track of the camera based on a motion estimation network; and estimating the depth of the source view and/or the target view based on the depth estimation network and the static and dynamic features between the source view and the target view. The method comprises the steps that a preset model is designed, the preset model comprises a depth estimation network, a motion estimation network and an implicit cue network, the implicit cue network is used for extracting static and dynamic characteristics between a source view and a target view from the motion estimation network and mapping the static and dynamic characteristics to the depth estimation network in an identical mode, single static view depth information is supplemented, geometric constraint of a static object is enhanced, dynamic characteristics of a moving object are made up, accordingly the artifact problem of the moving object is effectively relieved, and depth estimation quality of the source view and/or the target view is improved; meanwhile, the three-dimensional reconstruction loss further constrains the view reconstruction from the angle of the spatial point cloud, so that the camera transformation process has consistent depth and pose transformation, the pose transformation error is effectively reduced, the motion track accumulated offset predicted by a long video is reduced, and the self-motion track of the camera is more accurately estimated.

EXAMPLE III

Fig. 3 is a schematic structural diagram of an apparatus for estimating depth and self-motion trajectory according to a third embodiment of the present invention. As shown in fig. 3, the apparatus includes: an acquisition module 301 and an estimation module 302.

An obtaining module 301, configured to obtain a preset model, a source view, and a target view, where the preset model includes a depth estimation network, a motion estimation network, and an implicit cue network, the implicit cue network is configured to extract static and dynamic features between the source view and the target view from the motion estimation network and map the extracted features to the depth estimation network in an identical manner, and the source view and the target view are two color images at adjacent times;

an estimation module 302, configured to input the source view and the target view into a preset model; estimating the self-motion track of the camera based on a motion estimation network; and estimating the depth of the source view and/or the target view based on the depth estimation network and the static and dynamic features between the source view and the target view.

With reference to fig. 3, fig. 4 is a schematic structural diagram of another depth and self-motion trajectory estimation apparatus according to a third embodiment of the present invention. As shown in fig. 4, further includes: a training module 303.

The training module 303 is configured to obtain a first training view and a second training view before the obtaining module 301 obtains the preset model; and training the preset model according to the first training view and the second training view.

Optionally, the training module 303 is specifically configured to input the first training view and the second training view into a preset model to obtain a first reconstructed view and a second reconstructed view; determining a reprojection loss and a smoothness loss according to the first training view, the second training view, the first reconstructed view and the second reconstructed view; determining a three-dimensional reconstruction loss; and training the preset model based on the back propagation and gradient descent principle according to the reprojection loss, the smoothness loss and the three-dimensional reconstruction loss.

Optionally, the training module 303 is specifically configured to determine the first training views I respectively _s With respect to the second training view I _t First bit-position transformation matrix T _t->s Second training view I _t With respect to the first training view I _s Second posture transformation matrix T _s->t Depth map D of first training view _s And a depth map D of a second training view _t (ii) a Depth map D from camera parameters K and second training view _t Determining the coordinates P of the first spatial point cloud of the second training view in the camera coordinate system _t (ii) a According to the coordinate P of the first space point cloud _t And a first bit-attitude transformation matrix T _t->s Determining the coordinate P of a second space point cloud of a second training view in the coordinate system of the first training view based on the Euclidean transformation principle of the coordinate system _s '; based on the camera model, projecting and sampling the second spatial point cloud to obtain a first reconstruction view; depth map D from camera parameters K and first training view _s Determining the coordinates P of a third spatial point cloud of the first training view in the camera coordinate system _s (ii) a According to the coordinates P of the third space point cloud _s And a second attitude transformation matrix T _s->t Determining the coordinate P of the fourth space point cloud of the first training view in the coordinate system of the second training view based on the Euclidean transformation principle of the coordinate system _t '; and based on the camera model, projecting and sampling the fourth space point cloud to obtain a second reconstructed view.

Optionally, the training module 303 is specifically configured to train the first spatial point cloud according to the coordinate P of the first spatial point cloud _t Coordinate P of the second space point cloud _s ', coordinate P of third space point cloud _s And coordinates P of a fourth spatial point cloud _t ', determining three-dimensional reconstruction loss L = | P _t ’-P _t |+|P _s ’-P _s |。

Second attitude transformation matrix T _s->t ＝PoseNet(I _t ,I _s )；

Coordinate P of first space point cloud _t ＝K ^-1 D _t I _t ；

Coordinates P of the second spatial point cloud _s ’＝T _t->s P _t ；

Coordinate P of the third space point cloud _s ＝K ^-1 D _s I _s ；

Coordinates P of the fourth spatial point cloud _t ’＝T _s->t P _s ；

Optionally, the first training view and the second training view are two color images of adjacent time points; alternatively, the second training view is a color image generated based on the first training view to simulate an adjacent frame of the first training view.

The device for estimating the depth and the self-movement track, provided by the embodiment of the invention, can execute the method for estimating the depth and the self-movement track, provided by any embodiment of the invention, and has corresponding functional modules and beneficial effects of the execution method.

Example four

FIG. 5 illustrates a block diagram of an electronic device 10 that may be used to implement an embodiment of the invention. Electronic devices are intended to represent various forms of digital computers, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other appropriate computers. The electronic device may also represent various forms of mobile devices, such as personal digital assistants, cellular phones, smart phones, wearable devices (e.g., helmets, glasses, watches, etc.), and other similar computing devices. The components shown herein, their connections and relationships, and their functions, are meant to be exemplary only, and are not meant to limit implementations of the inventions described and/or claimed herein.

As shown in fig. 5, the electronic device 10 includes at least one processor 11, and a memory communicatively connected to the at least one processor 11, such as a Read Only Memory (ROM) 12, a Random Access Memory (RAM) 13, and the like, wherein the memory stores a computer program executable by the at least one processor, and the processor 11 may perform various suitable actions and processes according to the computer program stored in the Read Only Memory (ROM) 12 or the computer program loaded from the storage unit 18 into the Random Access Memory (RAM) 13. In the RAM 13, various programs and data necessary for the operation of the electronic apparatus 10 can also be stored. The processor 11, the ROM 12, and the RAM 13 are connected to each other via a bus 14. An input/output (I/O) interface 15 is also connected to the bus 14.

A number of components in the electronic device 10 are connected to the I/O interface 15, including: an input unit 16 such as a keyboard, a mouse, or the like; an output unit 17 such as various types of displays, speakers, and the like; a storage unit 18 such as a magnetic disk, optical disk, or the like; and a communication unit 19 such as a network card, modem, wireless communication transceiver, etc. The communication unit 19 allows the electronic device 10 to exchange information/data with other devices via a computer network such as the internet and/or various telecommunication networks.

The processor 11 may be a variety of general and/or special purpose processing components having processing and computing capabilities. Some examples of processor 11 include, but are not limited to, a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), various specialized Artificial Intelligence (AI) computing chips, various processors running machine learning model algorithms, a Digital Signal Processor (DSP), and any suitable processor, controller, microcontroller, or the like. The processor 11 performs the various methods and processes described above, such as depth and self-motion trajectory estimation methods.

In some embodiments, the methods of estimating depth and self-motion trajectories may be implemented as a computer program tangibly embodied on a computer-readable storage medium, such as storage unit 18. In some embodiments, part or all of the computer program may be loaded and/or installed onto the electronic device 10 via the ROM 12 and/or the communication unit 19. When the computer program is loaded into RAM 13 and executed by processor 11, one or more steps of the depth and self-motion trajectory estimation method described above may be performed. Alternatively, in other embodiments, the processor 11 may be configured by any other suitable means (e.g. by means of firmware) to perform the depth and self-movement trajectory estimation method.

Various implementations of the systems and techniques described here above may be implemented in digital electronic circuitry, integrated circuitry, field Programmable Gate Arrays (FPGAs), application Specific Integrated Circuits (ASICs), application Specific Standard Products (ASSPs), system on a chip (SOCs), load programmable logic devices (CPLDs), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include: implemented in one or more computer programs that are executable and/or interpretable on a programmable system including at least one programmable processor, which may be special or general purpose, receiving data and instructions from, and transmitting data and instructions to, a storage system, at least one input device, and at least one output device.

A computer program for implementing the methods of the present invention may be written in any combination of one or more programming languages. These computer programs may be provided to a processor of a general purpose computer, special purpose computer, or other programmable data processing apparatus, such that the computer programs, when executed by the processor, cause the functions/acts specified in the flowchart and/or block diagram block or blocks to be performed. A computer program can execute entirely on a machine, partly on a machine, as a stand-alone software package partly on a machine and partly on a remote machine or entirely on a remote machine or server.

In the context of the present invention, a computer-readable storage medium may be a tangible medium that can contain, or store a computer program for use by or in connection with an instruction execution system, apparatus, or device. A computer readable storage medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. Alternatively, the computer readable storage medium may be a machine readable signal medium. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

To provide for interaction with a user, the systems and techniques described here can be implemented on an electronic device having: a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to a user; and a keyboard and a pointing device (e.g., a mouse or a trackball) by which a user can provide input to the electronic device. Other kinds of devices may also be used to provide for interaction with a user; for example, feedback provided to the user can be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user can be received in any form, including acoustic, speech, or tactile input.

The systems and techniques described here can be implemented in a computing system that includes a back-end component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a user computer having a graphical user interface or a web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such back-end, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include: local Area Networks (LANs), wide Area Networks (WANs), blockchain networks, and the internet.

The computing system may include clients and servers. A client and server are generally remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other. The server can be a cloud server, also called a cloud computing server or a cloud host, and is a host product in a cloud computing service system, so that the defects of high management difficulty and weak service expansibility in the traditional physical host and VPS service are overcome.

It should be understood that various forms of the flows shown above may be used, with steps reordered, added, or deleted. For example, the steps described in the present invention may be executed in parallel, sequentially, or in different orders, and are not limited herein as long as the desired results of the technical solution of the present invention can be achieved.

The above-described embodiments should not be construed as limiting the scope of the invention. It should be understood by those skilled in the art that various modifications, combinations, sub-combinations and substitutions may be made in accordance with design requirements and other factors. Any modification, equivalent replacement, and improvement made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims

1. A method for estimating depth and self-motion trajectory, comprising:

acquiring a preset model, a source view and a target view, wherein the preset model comprises a depth estimation network, a motion estimation network and an implicit cue network, the implicit cue network is used for extracting static and dynamic features between the source view and the target view from the motion estimation network and mapping the features to the depth estimation network in an identical manner, and the source view and the target view are two color images at adjacent moments;

inputting the source view and the target view into the preset model;

estimating a self-motion trajectory of the camera based on the motion estimation network; and estimating the depth of the source view and/or the target view based on the depth estimation network and the static and dynamic features between the source view and the target view.

2. The method of claim 1, further comprising, prior to obtaining the pre-set model:

acquiring a first training view and a second training view;

3. The method of claim 2, wherein the training the pre-set model according to the first training view and the second training view comprises:

inputting the first training view and the second training view into the preset model to obtain a first reconstruction view and a second reconstruction view;

determining a reprojection loss and a smoothness loss from the first training view, the second training view, the first reconstructed view, and the second reconstructed view;

determining a three-dimensional reconstruction loss;

4. The method of claim 3, wherein the obtaining the first reconstructed view and the second reconstructed view comprises:

determining the first training views I respectively _s Relative to the second training view I _t First bit-position transformation matrix T _t->s The second training view I _t Relative to the first training view I _s Second posture transformation matrix T _s->t A depth map D of the first training view _s And a depth map D of the second training view _t ；

According to the camera parameter K and the second training viewDepth map D of _t Determining the coordinates P of the first spatial point cloud of the second training view in the camera coordinate system _t ；

According to the coordinate P of the first space point cloud _t And said first bit-attitude transformation matrix T _t->s Determining the coordinate P of the second space point cloud of the second training view in the coordinate system of the first training view based on the Euclidean transformation principle of the coordinate system _s ’；

Based on a camera model, projecting and sampling a second spatial point cloud to obtain a first reconstruction view;

depth map D from camera parameters K and the first training view _s Determining the coordinates P of the third spatial point cloud of the first training view in the camera coordinate system _s ；

According to the coordinate P of the third space point cloud _s And the second attitude transformation matrix T _s->t Determining the coordinate P of the fourth space point cloud of the first training view in the coordinate system of the second training view based on the Euclidean transformation principle of the coordinate system _t ’；

And based on the camera model, projecting and sampling the fourth spatial point cloud to obtain the second reconstructed view.

5. The method of claim 4, wherein determining the three-dimensional reconstruction loss comprises:

according to the coordinate P of the first space point cloud _t The coordinates P of the second space point cloud _s ', coordinates P of the third spatial point cloud _s And the coordinates P of the fourth spatial point cloud _t ', determining the three-dimensional reconstruction loss L = | P _t ’-P _t |+|P _s ’-P _s |。

6. The method of claim 4,

the first attitude transformation matrix T _t->s ＝PoseNet(I _s ,I _t )；

The second attitude transformation matrix T _s->t ＝PoseNet(I _t ,I _s )；

Depth map D of the first training view _s ＝

DepthNet_Decoder(ICNet(PoseNet_Encoder(I _t ,I _s ))+DepthNet_Encoder(I _s )))；

Depth map D of the second training view _t ＝

DepthNet_Decoder(ICNet(PoseNet_Encoder(I _s ,I _t ))+DepthNet_Encoder(I _t )))；

Coordinates P of the first spatial point cloud _t ＝K ^-1 D _t I _t ；

Coordinates P of the second spatial point cloud _s ’＝T _t->s P _t ；

Coordinates P of the third space point cloud _s ＝K ^-1 D _s I _s ；

Coordinates P of the fourth space point cloud _t ’＝T _s->t P _s ；

Wherein ICNet represents the implicit cue network, poseNet represents the motion estimation network, poseNet _ Encode represents an Encoder of the motion estimation network, depthNet _ Encode represents an Encoder of the depth estimation network, and DepthNet _ Decode represents a Decoder of the depth estimation network.

7. The method of claim 2, wherein the first training view and the second training view are two color images at adjacent time instances; alternatively, the second training view is a color image generated based on the first training view for simulating an adjacent frame of the first training view.

8. An apparatus for estimating depth and self-motion trajectory, comprising: an acquisition module and an estimation module; wherein the content of the first and second substances,

the acquisition module is used for acquiring a preset model, a source view and a target view, wherein the preset model comprises a depth estimation network, a motion estimation network and an implicit cue network, the implicit cue network is used for extracting static and dynamic features between the source view and the target view from the motion estimation network and mapping the features to the depth estimation network in an identical manner, and the source view and the target view are two color images at adjacent moments;

the estimation module is used for inputting the source view and the target view into the preset model; estimating a self-motion trajectory of the camera based on the motion estimation network; and estimating the depth of the source view and/or the target view based on the depth estimation network and the static and dynamic features between the source view and the target view.

9. An electronic device, characterized in that the electronic device comprises:

at least one processor; and

a memory communicatively coupled to the at least one processor; wherein, the first and the second end of the pipe are connected with each other,

the memory stores a computer program executable by the at least one processor to enable the at least one processor to perform the method of estimating depth and self-motion trajectory of any one of claims 1-7.

10. A computer-readable storage medium storing computer instructions for causing a processor to perform the method of estimating depth and self-motion trajectory of any one of claims 1-7 when executed.