CN111127522B

CN111127522B - Depth optical flow prediction method, device, equipment and medium based on monocular camera

Info

Publication number: CN111127522B
Application number: CN201911394005.6A
Authority: CN
Inventors: 请求不公布姓名
Original assignee: Hiscene Information Technology Co Ltd
Current assignee: Hiscene Information Technology Co Ltd
Priority date: 2019-12-30
Filing date: 2019-12-30
Publication date: 2024-02-06
Anticipated expiration: 2039-12-30
Also published as: CN111127522A

Abstract

The embodiment of the invention discloses a depth optical flow prediction method, device and equipment based on a monocular camera and a storage medium. The method comprises the following steps: acquiring a reference image and an adjacent image, and inputting the reference image and the adjacent image into a trained depth optical flow prediction model; respectively predicting target depth information of a reference image and target optical flow information from the reference image to an adjacent image according to an output result of the depth optical flow prediction model; the depth optical flow prediction model comprises a depth prediction network, an optical flow prediction network and a depth optical flow information interaction module which is respectively connected with the depth prediction network and the optical flow prediction network. According to the technical scheme, the prediction precision and the prediction instantaneity of the depth prediction and the optical flow prediction can be remarkably improved by means of combined optimization, and the effects of high-efficiency and high-precision depth prediction and optical flow prediction are achieved.

Description

Depth optical flow prediction method, device, equipment and medium based on monocular camera

Technical Field

The embodiment of the invention relates to the technical field of images, in particular to a depth optical flow prediction method, device, equipment and medium based on a monocular camera.

Background

In computer vision and robotics, depth prediction and optical flow prediction are two important tasks to understand spatial three-dimensional geometry and camera motion. Among them, optical flow prediction is always a classical problem in computer vision, and is also a basis for solving many other problems, and it is usually to predict the positions of each pixel point in a reference image in its adjacent image from a pair of images associated with each other, so optical flow information has rich motion information; accordingly, depth information is a necessary condition for converting a two-dimensional image into a three-dimensional space, and depth prediction focuses on learning object structure information.

In recent years, due to the lightweight and low cost characteristics of monocular cameras, applications for locating and constructing maps using monocular cameras are becoming more and more widespread, and thus many mature monocular camera SLAM systems are emerging. SLAM is an acronym for simultaneous localization and mapping (simultaneous localization and mapping), which is typically the subject of a pre-set sensor, and without environmental prior information, can model the environment during motion while estimating its own motion. If the preset sensor herein is primarily a camera, it may also be referred to as "visual SLAM".

The current monocular camera SLAM systems can be divided into two types, a feature point method and a direct method. The feature point method comprises the steps of firstly detecting sparse feature points of a current image, searching feature corresponding relations between the image and a local map, then estimating the pose of a camera by using a PnP algorithm according to the corresponding relations, and solving the depth of the feature points through triangulation. However, the feature point method can only generate a sparse depth map, and such a depth map is only suitable for pose tracking, but is not suitable for other tasks, such as obstacle avoidance, augmented reality, and the like. On the basis, in order to obtain a denser depth map, the direct method searches for a match between each pixel of the current image and a corresponding key image through polar line searching, and then solves the depth of a matching point through triangulation. However, the traditional direct method is to directly process the image, the image is greatly influenced by the environment such as illumination, reliable matching can be found only in a high texture area, and the reliability of matching can be found in a weak texture area is low, so that the accuracy of the system is reduced. That is, based on the depth prediction scheme of the traditional monocular camera SLAM system, only sparse or semi-dense depth maps can be obtained, and the depth prediction is not complete enough; alternatively, a dense depth map can be obtained, but the depth prediction accuracy in the weak texture region is low.

In addition to the above-mentioned schemes, there are various schemes for performing depth prediction using a monocular camera at present, such as single view depth prediction, which is to observe a space from a single view, and easily learn structural prior information from training data excessively, so that the structure prior information may perform poorly in a scene which has not been seen before; and then, like depth prediction based on multi-view stereoscopic vision, the method can observe and acquire images of scenes at a plurality of view angles, so as to complete matching and depth prediction, and recover three-dimensional scene structures from two-dimensional images at different view angles; for another example, a depth prediction scheme combining a traditional method and deep learning, such as a CNN-SLAM system, uses a depth map predicted by a neural network as an initialization depth of the SLAM system, and optimizes a depth value by BA when the SLAM system is running.

Disclosure of Invention

The embodiment of the invention provides a depth optical flow prediction method, device, equipment and medium based on a monocular camera, so as to realize the effect of jointly optimizing the depth prediction and the optical flow prediction.

In a first aspect, an embodiment of the present invention provides a method for predicting depth optical flow based on a monocular camera, which may include:

Acquiring a reference image and an adjacent image, and inputting the reference image and the adjacent image into a trained depth optical flow prediction model;

respectively predicting target depth information of a reference image and target optical flow information from the reference image to an adjacent image according to an output result of the depth optical flow prediction model;

the depth optical flow prediction model comprises a depth prediction network, an optical flow prediction network and a depth optical flow information interaction module which is respectively connected with the depth prediction network and the optical flow prediction network.

Optionally, on this basis, the method may further include:

acquiring a history reference image and history depth information of the history reference image, and history adjacent images of the history reference image and history optical flow information from the history reference image to the history adjacent images, and taking the history reference image, the history adjacent images, the history depth information and the history optical flow information as a group of training samples;

and constructing an initial depth optical flow prediction model, and training the initial depth optical flow prediction model based on a plurality of training samples to generate the depth optical flow prediction model.

Optionally, the depth optical flow information interaction module comprises a depth information interaction sub-module connected with the depth prediction network and an optical flow information interaction sub-module connected with the optical flow prediction network;

After inputting the reference image and the neighboring image into the trained deep optical flow prediction model, the method may further include:

extracting original optical flow information of a second feature map in the optical flow prediction network through an optical flow information interaction sub-module, and generating intermediate depth information according to the original optical flow information and the pose of a reference image to an adjacent image;

extracting original depth information of a first feature map in a depth prediction network through a depth information interaction sub-module, and generating intermediate optical flow information according to the original depth information and the pose;

the intermediate depth information sent by the optical flow information interaction sub-module is received through the depth information interaction sub-module, and the original depth information and the intermediate depth information are fused to obtain a third feature map to be spliced with the first feature map;

and receiving the intermediate optical flow information sent by the depth information interaction sub-module through the optical flow information interaction sub-module, and fusing the original optical flow information and the intermediate optical flow information to obtain a fourth feature map to be spliced with the second feature map.

Optionally, fusing the original optical flow information and the intermediate optical flow information to obtain a fourth feature map to be spliced with the second feature map may include:

And respectively obtaining an original optical flow characteristic map of the original optical flow information and an intermediate optical flow characteristic map of the intermediate optical flow information, and fusing the original optical flow characteristic map and the intermediate optical flow characteristic map to obtain a fourth characteristic map to be spliced with the second characteristic map.

Optionally, obtaining the original optical flow feature map of the original optical flow information may include:

scaling the reference image and the adjacent image according to the scale information of the depth optical flow information interaction module to respectively obtain a reference scaled image and an adjacent scaled image;

projecting adjacent scaled images onto a reference scaled image according to original optical flow information, and fusing a projection result and the reference scaled image to obtain a residual optical flow characteristic map;

and fusing the residual optical flow feature map and the original optical flow information to obtain an original optical flow feature map of the original optical flow information.

Optionally, extracting, via the depth information interaction sub-module, original depth information of the first feature map in the depth prediction network may include:

convolving the intermediate depth information to obtain a fifth feature map, and fusing the fifth feature map and a sixth feature map in a depth prediction network to obtain a first feature map;

the original depth information of the first feature map is extracted via the depth information interaction sub-module.

Optionally, the optical flow prediction network comprises an association layer, and after inputting the reference image and the neighboring image into the trained deep optical flow prediction model, the method may further comprise:

and determining the matching relation of corresponding pixels in the seventh feature map and the eighth feature map based on a preset dot product operation by a correlation layer aiming at the seventh feature map extracted from the reference image and the eighth feature map extracted from the adjacent image, so as to obtain the correlation feature map.

Optionally, the optical flow prediction network further includes an epipolar layer, and an epipolar feature map output by the epipolar layer is fused with the associated feature map.

Alternatively, the epipolar layer may output the epipolar feature map by:

acquiring an electrode line of a reference pixel point on an eighth feature map in the seventh feature map, and acquiring each adjacent pixel point in an adjacent area corresponding to the reference pixel point on the eighth feature map;

and calculating the distance between each adjacent pixel point and the polar line to obtain a polar line characteristic map.

Optionally, calculating the distance between each adjacent pixel point and the epipolar line to obtain the epipolar line feature map may include:

and calculating the distance between each adjacent pixel point and the polar line, and transforming the distance based on the preset Gaussian distribution to obtain the polar line characteristic map.

Alternatively, the number of adjacent images may be at least two;

extracting original optical flow information of a second feature map in the optical flow prediction network, and generating intermediate depth information according to the original optical flow information and the pose from the reference image to the adjacent image, wherein the intermediate depth information may include:

and respectively extracting original optical flow information of each second feature map in the optical flow prediction network, and establishing a linear equation set according to the original optical flow information and the pose from the reference image to each adjacent image to generate intermediate depth information.

Optionally, the overlapping ratio of the reference image and the adjacent image is within a preset overlapping range, and/or the baseline distance of the reference image and the adjacent image is within a preset distance range.

Optionally, the depth prediction network and/or the optical flow prediction network comprise: convolution layer and deconvolution layer.

Optionally, the number of the depth optical flow information interaction modules is one or more, and when the number of the depth optical flow information interaction modules is a plurality of the depth optical flow information interaction modules, the scale information of each depth optical flow information interaction module is different from each other.

In a second aspect, an embodiment of the present invention further provides a depth optical flow prediction device based on a monocular camera, where the device may include:

the input module is used for acquiring a reference image and an adjacent image and inputting the reference image and the adjacent image into the trained depth optical flow prediction model;

The prediction module is used for respectively predicting the target depth information of the reference image and the target optical flow information from the reference image to the adjacent image according to the output result of the depth optical flow prediction model;

In a third aspect, an embodiment of the present invention further provides an apparatus, which may include:

one or more processors;

a memory for storing one or more programs;

the one or more programs, when executed by the one or more processors, cause the one or more processors to implement the monocular camera-based depth optical flow prediction method provided by any embodiment of the present invention.

In a fourth aspect, an embodiment of the present invention further provides a computer readable storage medium, on which a computer program is stored, which when executed by a processor implements the monocular camera-based depth optical flow prediction method provided by any embodiment of the present invention.

According to the technical scheme, the reference image and the adjacent image are acquired, and the reference image and the adjacent image are input into the trained depth optical flow prediction model, so that the depth optical flow prediction model is provided with a depth prediction network, an optical flow prediction network and a depth optical flow information interaction module respectively connected with the depth prediction network and the optical flow prediction network, and dense target depth information of the reference image and target optical flow information from the reference image to the adjacent image can be respectively predicted. According to the technical scheme, the prediction precision and the prediction instantaneity of the depth prediction and the optical flow prediction can be remarkably improved by means of combined optimization, and the effects of high-efficiency and high-precision depth prediction and optical flow prediction are achieved.

Drawings

FIG. 1 is a flow chart of a depth-to-optical-flow prediction method based on a monocular camera according to a first embodiment of the present invention;

FIG. 2 is a schematic diagram of a network structure of a depth-to-optical-flow prediction model in a depth-to-optical-flow prediction method based on a monocular camera according to a first embodiment of the present invention;

FIG. 3 is a schematic diagram of a seventh feature map and an eighth feature map in a monocular camera-based depth optical flow prediction method according to a first embodiment of the present invention;

FIG. 4 is a schematic diagram of epipolar geometry constraint in a monocular camera-based depth optical flow prediction method according to one embodiment of the present invention;

FIG. 5 is a flowchart of a method for predicting depth optical flow based on a monocular camera according to a second embodiment of the present invention;

fig. 6 is a schematic diagram of a depth optical flow information interaction module in a depth optical flow prediction method based on a monocular camera in a second embodiment of the present invention;

FIG. 7 is a flowchart of a method for predicting depth optical flow based on a monocular camera according to a third embodiment of the present invention;

FIG. 8 is a schematic diagram of multi-view depth prediction in a depth optical flow prediction method based on a monocular camera according to a third embodiment of the present invention;

FIG. 9 is a block diagram of a monocular camera-based depth-to-optical-flow prediction apparatus according to a fourth embodiment of the present invention;

Fig. 10 is a schematic structural diagram of an apparatus according to a fifth embodiment of the present invention.

Detailed Description

The invention is described in further detail below with reference to the drawings and examples. It is to be understood that the specific embodiments described herein are merely illustrative of the invention and are not limiting thereof. It should be further noted that, for convenience of description, only some, but not all of the structures related to the present invention are shown in the drawings.

Example 1

Fig. 1 is a flowchart of a depth optical flow prediction method based on a monocular camera according to a first embodiment of the present invention. The present embodiment is applicable to a case where depth information and optical flow information are predicted jointly based on a monocular camera, and is particularly applicable to a case where optical flow information and depth information are predicted jointly by combining multi-view stereoscopic vision and single-view structure information. The method can be performed by the monocular camera-based depth optical flow prediction device provided by the embodiment of the invention, the device can be realized by software and/or hardware, and the device can be integrated on various devices. Referring to fig. 1, the method of the embodiment of the present invention specifically includes the following steps:

s110, acquiring a reference image and an adjacent image, and inputting the reference image and the adjacent image into a trained depth optical flow prediction model, wherein the depth optical flow prediction model comprises a depth prediction network, an optical flow prediction network and a depth optical flow information interaction module respectively connected with the depth prediction network and the optical flow prediction network.

A monocular camera can be considered as a camera with a common RGB camera, which can capture successive RGB images, from which reference images (I) _ref ) And adjacent image (I) _nei ) And inputting the reference image and the adjacent image into the trained deep optical flow prediction model. Meanwhile, the pose of the reference image to the adjacent image can be acquired, and the pose is input into the depth optical flow prediction model together, wherein the pose acquisition method includes but is not limited to: downloading a training/testing data set, wherein the data set comprises a reference image, an adjacent image and the pose of the reference image to the adjacent image; acquiring a training/testing data set, and calculating the pose from a reference image to each adjacent image through an SLAM system; and calculating the pose of the reference image to each adjacent image by a feature point matching method. Alternatively, the depth optical flow prediction model may include a pose calculation module, where, after the reference image and the adjacent image are input into the depth optical flow prediction model, the pose of the reference image to the adjacent image may be calculated by the pose calculation module, and the pose may represent position information and orientation information of the monocular camera at different moments, where the depth information and the optical flow information may be linked together in the depth optical flow information interaction module.

The Depth Optical Flow prediction model may include a Depth prediction network (Depth Net), an Optical Flow prediction network (Optical Flow Net), and a Depth Optical Flow information interaction module (exchange block) connected to the Depth prediction network and the Optical Flow prediction network, respectively. The depth prediction network uses the reference image as an input image, and can obtain the target depth information (output depth) of the reference image, wherein the target depth information refers to a dense depth map corresponding to the reference image, and the optical flow prediction network uses the reference image and the adjacent image as the input image, and can obtain the target optical flow information (output optical flow) from the reference image to the adjacent image, and the target optical flow information refers to the optical flow from the reference image to the adjacent image. On this basis, the combined prediction of the target depth information and the target optical flow information can be realized through the depth optical flow information interaction module.

Alternatively, the depth prediction network and/or the optical flow prediction network may adopt a structure of an encoding module and a decoding module, and the two may be connected in a cross-layer manner to fuse the feature maps before and after fusion. Taking the example of a depth prediction network, it may include a convolution layer (convolution layer) via which a convolution profile (convolution feature) may be obtained and a deconvolution layer (deconvolution layer) via which a deconvolution profile (deconvolution feature) may be obtained. The coding module can obtain feature graphs with different sizes by using different step sizes when convolving layer by layer, and the channel of the feature graph is larger when the resolution is lower; furthermore, the feature images can be connected to a later decoding module in a cross-layer manner, and in the decoding module, a rough-to-fine strategy can be adopted, the feature images are firstly convolved through a deconvolution layer and then connected to the feature images with the same size in the encoding module in a cross-layer manner, and after the two feature images are spliced, the original depth information corresponding to the current scale information is obtained through convolution. The process can be repeated for a plurality of times to obtain a plurality of original depth information with different sizes, each original depth information can complete information interaction with the original optical flow information of the corresponding scale information obtained by learning in the optical flow prediction network through the depth optical flow information interaction module, after the information interaction for a plurality of times, deconvolution and convolution can be carried out on the last feature map to obtain target depth information with the size of the reference image, and the target depth information is a dense depth map. The optical flow prediction network may perform similar operations. It should be noted that the number of the depth optical flow information interaction modules may be one or more, and when the number of the depth optical flow information interaction modules is plural, the scale information of each depth optical flow information interaction module may be different from each other, because the resolution of the deconvolution layer may be low to high.

Accordingly, fig. 2 is a schematic diagram of an alternative network structure of the depth optical flow prediction model, and in order to more intuitively understand the network structure of the depth optical flow prediction model, a specific structure of the optical flow prediction network is described by taking fig. 2 as an example. Each rectangular box in fig. 2 is a feature map, and cross-layer connections, convolution layers, and deconvolution layers are respectively represented by arrows of different colors. The optical flow prediction network takes a reference image and a neighboring image as input images, alternatively, the number of reference images may be one, and the number of neighboring images may be one or more. The optical flow prediction network may also use a coding module-decoding module structure similar to the depth prediction network, which may firstly extract feature maps of two images (i.e., a seventh feature map 7 and an eighth feature map 8 in fig. 2) based on three convolution layers shared by weights, and optionally, then perform matching association (correlation layer) on the two feature maps to obtain an associated feature map, which may improve the performance of the optical flow prediction network; optionally, the seventh feature map may be convolved to obtain a ninth feature map, which may add information of the reference image in optical flow prediction; optionally, an epipolar feature (epipolar feature) can be determined according to the seventh feature map and the eighth feature map, and the constraint of the limiting layer can improve the accuracy of optical flow prediction; still alternatively, the associated feature map, the ninth feature map, and the epipolar feature map may be stitched together. Further, the subsequent network structure can be similar to the depth prediction network, the original optical flow information of the middle three pieces of different scale information can also be subjected to information interaction with the corresponding original depth information in the depth prediction network through the depth optical flow information interaction module, then deconvolution is carried out on the final feature map to obtain the feature map with the size of the reference image, and then the target optical flow information with the size of the reference image is obtained through deconvolution twice.

S120, respectively predicting target depth information of the reference image and target optical flow information from the reference image to the adjacent image according to an output result of the depth optical flow prediction model.

According to the output result of the depth optical flow prediction model, the target depth information of the reference image and the target optical flow information from the reference image to the adjacent image can be respectively predicted, the target depth information can restore the two-dimensional image into a three-dimensional structure, and the target optical flow information can represent the movement speed and the movement direction from the reference image to each pixel point on the adjacent image. Specifically, target depth information of the reference image can be predicted in real time according to the depth prediction network, wherein the depth prediction network can obtain a dense depth image by combining a deep learning technology, and meanwhile, the situation that the depth prediction accuracy is low easily in a weak texture area can be relieved to a certain extent by utilizing feature learning; and, target optical flow information of the reference image to each adjacent image can be predicted in real time according to the optical flow prediction network.

That is, the above-described technical scheme combines multi-view stereoscopic vision and single-view structure information to jointly predict optical flow information and depth information, and the joint prediction scheme can significantly improve the accuracy of optical flow prediction and depth prediction. The optical flow prediction and the single view depth prediction are respectively focused on the study of multi-view stereoscopic vision and single view structural information, and can be connected together through pose to complement each other. The technical scheme realizes high-efficiency and high-precision optical flow prediction and depth prediction, has better generalization capability for new scenes, and provides an effective solution for a plurality of application scenes with higher real-time requirements, such as autonomous obstacle avoidance and the like.

On the basis, optionally, the overlapping ratio of the reference image and the adjacent image in the first aspect may be within a preset overlapping range, and/or the baseline distance between the reference image and the adjacent image may be within a preset distance range, where the preset overlapping range and the preset distance range may be specifically set according to the actual situation, and are not specifically limited herein. That is, the reference image and the adjacent image may have a certain overlapping ratio, that is, they may have a certain common field of view; and/or the reference image and the adjacent image may have a certain baseline distance, i.e. the line connecting the optical centers of the monocular cameras at these two moments may have a certain distance. In practical applications, the monocular camera may have a certain motion and the motion cannot be too large or too small, and an exemplary distance between lines of optical centers of the monocular cameras of two frames of images may be greater than 5cm, and an overlapping rate of the reference image and the adjacent image is greater than 65%.

In the second aspect, for a plurality of frame images acquired from consecutive RGB images, an image acquired first may be regarded as a reference image, or an image acquired later may be regarded as a reference image. In general, if the number of adjacent images is one, an image acquired first may be used as a reference image; if the number of adjacent images is plural, the intermediate acquired image may be used as a reference image, so that both the front and rear image information can be considered. It should be noted that, regardless of the sequence of the reference image and the adjacent image, the pose obtained directly or obtained by calculation is usually the pose from the reference image to the adjacent image, and the obtained target optical flow information is usually the target optical flow information from the reference image to the adjacent image.

Alternatively, in a third aspect, the trained deep optical flow prediction model may be pre-trained by: acquiring a history reference image and history depth information of the history reference image, and history adjacent images of the history reference image and history optical flow information from the history reference image to the history adjacent images, and taking the history reference image, the history adjacent images, the history depth information and the history optical flow information as a group of training samples; and constructing an initial depth optical flow prediction model, and training the initial depth optical flow prediction model based on a plurality of training samples to generate the depth optical flow prediction model. Meanwhile, the historical pose from the historical reference image to the historical adjacent image can be acquired, and the historical pose is input into the initial depth optical flow prediction model together. Wherein, optionally, the overlapping ratio of the history reference image and the history adjacent image is within a preset overlapping range, and/or the baseline distance of the history reference image and the history adjacent image is within a preset distance range. In the model training process, error reverse transfer may be utilized to train model parameters of the initial depth optical flow prediction model, and training may be deemed complete when the loss function curve converges and/or the number of iterations reaches a set threshold. In addition, it should be noted that, the historical pose from the historical reference image to the historical adjacent image may be obtained in various manners, for example, the downloaded training sample already includes the historical pose; for another example, a historical pose from each historical reference image to each historical neighboring image is calculated by the SLAM system; for another example, the historical pose from each historical reference image to each historical adjacent image is calculated by means of feature point matching; etc., and are not particularly limited herein.

An optional technical solution, the optical flow prediction network may include an association layer, where the association layer may promote matching performance of the optical flow prediction network; accordingly, after the reference image and the adjacent image are input into the trained depth optical flow prediction model, the depth optical flow prediction method based on the monocular camera may specifically further include: and determining the matching relation of corresponding pixels in the seventh feature map and the eighth feature map based on a preset dot product operation by a correlation layer aiming at the seventh feature map extracted from the reference image and the eighth feature map extracted from the adjacent image, so as to obtain the correlation feature map.

The seventh feature map may be a feature map obtained by passing the reference image through one or more convolution layers (e.g., the seventh feature map 7 in fig. 2), and similarly, the eighth feature map may be a feature map obtained by passing the adjacent image through one or more convolution layers (e.g., the eighth feature map 8 in fig. 2). Therefore, the matching relation of corresponding pixels in the seventh feature map and the eighth feature map can be determined based on the preset dot product operation through the association layer, and the association feature map is obtained. Optionally, determining the matching relationship of the corresponding pixel points in the seventh feature map and the eighth feature map based on the preset dot product operation may specifically include: acquiring a current pixel point in the seventh feature map and a target pixel point corresponding to the current pixel point in the eighth feature map, and determining a search area by taking the target pixel point as a center and taking a first preset value as a radius; and calculating the association value of each current search point and each current pixel point in the search area based on the preset dot product operation to obtain the matching relation between the current pixel point and each current search point. It should be noted that, when determining the target pixel point corresponding to the current pixel point, the target pixel point may be determined directly according to the position, without being calculated according to a mode such as feature point matching, which is because, in general, the degree of change of a certain object in the reference image and the adjacent image is small, and the current search point with the highest association degree with the current pixel point may be directly found in the search area of the target pixel point corresponding to the current pixel point.

For example, as shown in fig. 3, the left side of fig. 3 is a schematic diagram of the seventh feature map 7, the right side of fig. 3 is a schematic diagram of the eighth feature map 8, the target pixel point r2 corresponds to the current pixel point r1, at this time, the search area 81 of the target pixel point r2 is determined according to the first preset value D (d=1 for example), the side length of the search area is d=2d+1, and it includes the current search points r2 and q1-q8, so that D can be calculated for each current pixel point r1 in the seventh feature map 7 ² A plurality of association values; furthermore, if the dimensions of the seventh feature map 7 and the eighth feature map 8 are both h×w×c, the dimensions of the associated feature map may be h×w×d ² Wherein D is ² The probability of matching the current pixel point and each current search point may be represented separately.

On the basis, optionally, calculating the association value of each current search point and the current pixel point in the search area based on the preset dot product operation respectively may specifically include: for a current search point in the search area, calculating an association value of the current search point and the current pixel point according to the following formula:

wherein c (x ₁ ,x ₂ ) Is the associated value, x ₁ Is a seventh feature map F extracted from the reference image _ref In (2), x ₂ Is an eighth feature map F extracted from adjacent images _nei K is a second preset value. If the dimensions of the seventh feature map 7 and the eighth feature map 8 are bothIs H, W, F _ref (x ₁ +o) represents F _ref At x ₁ The C-dimensional vector at + o,<a,b>representing the dot product operation of vectors a and b. c (x) ₁ ,x ₂ ) The larger the number of (a) indicates the greater the similarity of the two patches, i.e., x ₁ And x ₂ The greater the probability of a match. For example, as shown in fig. 3, for one region 82 (patch) centered on the current search point q1 and k being a radius and one region 72 (patch) centered on the current pixel point r1 and k being a radius, c (r 1, q 1) of the regions 72 and 82 may be calculated based on the above equation, k being a radius of each patch, and a size of each patch being (2k+1) ×2k+1, in fig. 3, k=1.

On the basis of the technical schemes, because the operation of searching for matching by the association layer is performed on one patch, when the visual angle of two frames of images is changed greatly, a larger patch is generally required to be set to obtain the optimal matching point, so that the calculation complexity is increased, and the situation of matching errors is easy to occur. In order to solve the problems, a polar layer can be arranged, and the matching precision can be remarkably improved by increasing polar constraint, namely, more accurate optical flow can be obtained by optical flow prediction network learning. Therefore, optionally, the optical flow prediction network may further include a epipolar layer, where the epipolar feature map output by the epipolar layer is fused with the associated feature map, for example, the limiting feature map (epipolar feature) and the associated feature map may be directly spliced, or if the seventh feature map is convolved to obtain the ninth feature map, this may splice the associated feature map, the ninth feature map and the epipolar feature map, which may increase image information of the reference image in optical flow prediction. The width and height of the epipolar line feature map and the associated feature map are identical to those of the seventh feature map and the eighth feature map.

On this basis, optionally, the polar layer may output the polar signature by: acquiring an electrode line of a reference pixel point on an eighth feature map in the seventh feature map, and acquiring each adjacent pixel point in an adjacent area corresponding to the reference pixel point on the eighth feature map; and calculating the distance between each adjacent pixel point and the polar line to obtain a polar line characteristic map. Wherein the epipolar geometry (Epipolar Geometry) can be describedThe visual geometrical relationship between two frames of images in the same scene, therefore, according to the thought of epipolar geometry, the pixel points matched by multiple visual angles should fall on the corresponding polar lines. Exemplary, FIG. 4 is a schematic illustration of epipolar geometry constraints, O ₁ ,O ₂ Is the optical center of a monocular camera at two moments, P is a certain three-dimensional point in three-dimensional space, P ₁ ,p ₂ P is the point projected onto two frames of images (reference image and adjacent image), O ₁ ,O ₂ The three points P can form a polar plane, the intersection line of the polar plane and the two image planes is an polar line, when the depth of P is uncertain, P ₁ The matching points on the corresponding adjacent images will move along the epipolar line.

Based on this, when the internal reference K of the monocular camera and the pose T from the reference image to the adjacent image are known _ref,nei Then, the polar line l of the reference pixel point on the eighth feature map in the seventh feature map can be calculated _epip This is because, taking fig. 4 as an example, O1, O2, P1 are known, P is unknown, but P can only be on the extension straight line corresponding to O1-P1, then the polar plane corresponding to the three points O1, O2, P is determined, and the image plane is also determined, whereby the intersection of the two planes (i.e., the epipolar line) is determined. It should be noted that, the lines corresponding to different reference pixel points on the seventh feature map on the eighth feature map are different. If the reference pixel point is the current pixel point, each adjacent pixel point in the corresponding adjacent region of the reference pixel point on the eighth feature map is D ² And the current search points are obtained, so that a polar line characteristic diagram is obtained according to the vertical distance from each adjacent pixel point to the polar line, the polar line characteristic diagram has the same width and height as those of the seventh characteristic diagram, the polar line characteristic diagram and the associated characteristic diagram can be spliced together, and then the subsequent convolution operation is carried out.

On this basis, since the vertical distance from the adjacent pixel point to the polar line can represent the matching probability of the adjacent pixel point and the reference pixel point, the smaller the vertical distance is, the larger the probability that the adjacent pixel point falls on the polar line is, and when the vertical distance is 0, the adjacent pixel point is the pixel point on the polar line, which is the matching point of the reference pixel point. However, the distance is a pixel distance, and each pixel The difference in the values of the distances may be relatively large, and in order to achieve a better constraint effect, the distances may be generalized to a certain range by using a common gaussian distribution. For example, calculating the distance between each adjacent pixel point and the epipolar line to obtain the epipolar line feature map may specifically include: and calculating the distance between each adjacent pixel point and the polar line, and transforming the distance based on the preset Gaussian distribution to obtain the polar line characteristic map. Illustratively, each adjacent pixel point is denoted as u _nei And assume u _nei The probability of matching a pixel conforms to a gaussian distributionCan be +.>Transformed into a 1-dimensional vector, and because each reference pixel point can correspond to D ² Each 1-dimensional vector may have D for each adjacent pixel point ² A number of values. Optionally, a->This can be expressed by the following equation, where d is the vertical distance.

Example two

Fig. 5 is a flowchart of a depth optical flow prediction method based on a monocular camera according to a second embodiment of the present invention. The present embodiment is optimized based on the above technical solutions. In this embodiment, optionally, the depth optical flow information interaction module includes a depth information interaction sub-module connected to the depth prediction network and an optical flow information interaction sub-module connected to the optical flow prediction network; accordingly, after inputting the reference image and the neighboring image into the trained deep optical flow prediction model, the method further comprises: extracting original optical flow information of a second feature map in the optical flow prediction network through an optical flow information interaction sub-module, and generating intermediate depth information according to the original optical flow information and the pose of a reference image to an adjacent image; extracting original depth information of a first feature map in a depth prediction network through a depth information interaction sub-module, and generating intermediate optical flow information according to the original depth information and the pose; the intermediate depth information sent by the optical flow information interaction sub-module is received through the depth information interaction sub-module, and the original depth information and the intermediate depth information are fused to obtain a third feature map to be spliced with the first feature map; and receiving the intermediate optical flow information sent by the depth information interaction sub-module through the optical flow information interaction sub-module, and fusing the original optical flow information and the intermediate optical flow information to obtain a fourth feature map to be spliced with the second feature map. Wherein, the explanation of the same or corresponding terms as the above embodiments is not repeated herein.

Referring to fig. 5, the method of this embodiment may specifically include the following steps:

s210, acquiring a reference image and an adjacent image, and inputting the reference image and the adjacent image into a trained depth optical flow prediction model, wherein the depth optical flow prediction model comprises a depth prediction network, an optical flow prediction network and a depth optical flow information interaction module connected with the depth prediction network and the optical flow prediction network respectively, and the depth optical flow information interaction module comprises a depth information interaction sub-module connected with the depth prediction network and an optical flow information interaction sub-module connected with the optical flow prediction network.

Wherein based on the equation (1), the depth information and the optical flow information can pass through the pose T _ref,nei Are connected together to form a pose T _ref,nei Is the pose of the reference image to the neighboring image,

I _flow ＝π(T _ref,nei ·π ^-1 (u _ref ,I _dep ))-u _ref (1)

the equation (2) and the equation (3) are a projection model and a back projection model of the monocular camera, respectively, f _x 、f _y 、c _x 、c _y is a camera internal reference, u _ref Is the coordinates of a pixel point (referred to herein as the first pixel point) on the reference image, I _dep Is the depth of the first pixel point, I _flow Is the optical flow information from the first pixel point to the corresponding matched second pixel point on the adjacent image. Specifically, the camera projection model projects the 3D point as a 2D point on the pixel, and the back projection model obtains a corresponding 3D point according to the 2D point on the pixel. From the formulas (2) and (3), it is possible to obtain: x, Y, Z is the coordinates in the camera coordinate system, u _ref Is the pixel coordinate of the first pixel point on the reference image (corresponding to [ u v 1])，I _dep The depth of the first pixel (corresponding to Z) is determined by equation (3) to be pi ^-1 (u _ref ,I _dep ) Is the coordinate of the first pixel point under the first camera coordinate system, T _ref,nei ·π ^-1 (u _ref ,I _dep ) Is the coordinates of the second pixel point in the second camera coordinate system. The first camera coordinate system refers to a camera coordinate system corresponding to shooting a reference image, and the second camera coordinate system refers to a camera coordinate system corresponding to shooting an adjacent image. From the formula (2), pi (T) _ref,nei ·π ^-1 (u _ref ,I _dep ) Is the pixel coordinate of the second pixel point, on the basis of which the pixel coordinate u of the first pixel point is subtracted _ref Optical flow information I can be obtained _flow 。

In order to perform joint optimization on the optical flow information and the depth information by utilizing the correlation, the depth optical flow information interaction module in the depth optical flow prediction model can comprise a depth information interaction sub-module connected with a depth prediction network and an optical flow information interaction sub-module connected with the optical flow prediction network, so that single-view structure priori and multi-view stereoscopic vision are fully utilized to improve the prediction precision of the optical flow prediction network and the depth prediction network.

S220, extracting original optical flow information of a second feature map in the optical flow prediction network through an optical flow information interaction sub-module, generating intermediate depth information according to the original optical flow information and the pose of a reference image to an adjacent image, extracting original depth information of a first feature map in the depth prediction network through a depth information interaction sub-module, and generating intermediate optical flow information according to the original depth information and the pose.

The original optical flow information of the second feature map in the optical flow prediction network can be extracted through an optical flow information interaction sub-module connected with the optical flow prediction network, and intermediate depth information is generated through SVD decomposition according to a triangulation method. Similarly, the original depth information of the first feature map in the depth prediction network may be extracted via a depth information interaction sub-module connected to the depth prediction network, and the intermediate optical flow information may be generated according to equation (1). It should be noted that, the number of the depth optical flow information interaction modules may be one or more, and when the number of the depth optical flow information interaction modules is plural, the scale information of each depth optical flow information interaction module may be different from each other. The information interaction module of each scale only processes the first feature map and the second feature map of the corresponding scale, so that the depth-to-optical-flow information interaction module of each scale information can be used for processing the first feature map and the second feature map of different sizes or can be used for processing the original depth information and the original optical flow information of different sizes.

Exemplary, as shown in FIG. 6, FIG. 6 is a schematic diagram of a depth-to-optical-flow information interaction module with scale information n, a first feature map And a second characteristic map->Raw depth information may be generated through one or more convolution layers, respectivelyAnd original optical flow information->Further, it is possible to add according to the original depth information->And pose T _ref,nei Calculating intermediate optical flow information +.>Based on the original optical flow information->And pose T _ref,nei Determination of intermediate depth information ++by SVD decomposition using triangulation method>

And S230, receiving the intermediate depth information sent by the optical flow information interaction sub-module through the depth information interaction sub-module, and fusing the original depth information and the intermediate depth information to obtain a third feature map to be spliced with the first feature map.

The depth information interaction sub-module may receive the intermediate depth information sent by the optical flow information interaction sub-module, and then, the original depth information and the intermediate depth information may be fused to obtain a third feature map to be spliced with the first feature map, where the third feature map may be used as an output of the depth information interaction sub-module. For fusing the original depth information and the intermediate depth information, there may be various implementations, for example, the original depth information and the intermediate depth may be directly spliced, and a third feature map to be spliced with the first feature map is obtained after the splicing result is convolved; for example, the original depth information and the intermediate depth may be respectively convolved, and then the convolution results are spliced to obtain a third feature map to be spliced with the first feature map; for another example, the original depth information and the intermediate depth may be respectively convolved, then the convolution result is spliced, and then the splicing result is convolved again to obtain a third feature map to be spliced with the first feature map; etc. Thus, the third feature map merges the original depth information of the depth prediction network and the intermediate depth information converted by the depth-to-optical-flow information interaction module in the optical-flow prediction network, and optionally, after the first feature map and the third feature map are spliced, a subsequent operation of the depth prediction network, for example, a deconvolution operation, can be performed.

S240, receiving the intermediate optical flow information sent by the depth information interaction sub-module through the optical flow information interaction sub-module, and fusing the original optical flow information and the intermediate optical flow information to obtain a fourth feature map to be spliced with the second feature map.

The generating process of the fourth feature map is similar to the generating process of the third feature map, and the generating method further comprises the following fusion mode: the residual optical flow characteristics can be obtained by learning from the original optical flow information and/or the intermediate optical flow information, the original optical flow characteristics and the intermediate optical flow characteristics are respectively generated according to the residual optical flow characteristics, and the fourth characteristic diagram to be spliced with the second characteristic diagram is obtained by splicing or convolution after splicing according to the original optical flow characteristics and the intermediate optical flow characteristics. Further, optionally, after the second feature map and the fourth feature map are spliced, a subsequent operation of the optical flow prediction network, such as a deconvolution operation, may be performed.

S250, respectively predicting target depth information of the reference image and target optical flow information from the reference image to the adjacent image according to an output result of the depth optical flow prediction model.

According to the technical scheme, the depth optical flow information interaction module comprises a depth information interaction sub-module connected with a depth prediction network and an optical flow information interaction sub-module connected with the optical flow prediction network, so that original depth information extracted from the depth prediction network and intermediate depth information converted from the optical flow prediction network through the depth optical flow information interaction module can be fused through the mutual cooperation of the sub-modules and the networks to generate a third feature map, and after the third feature map of the original optical flow information and a first feature map in the depth prediction network are spliced, the combined optimization of the depth prediction and the optical flow prediction can be realized; the generation process of the fourth feature map is similar, and after the fourth feature map fused with the original depth information and the second feature map in the optical flow prediction network are spliced, the combined optimization of the depth prediction and the optical flow prediction can be realized.

An optional technical solution, fusing the original optical flow information and the intermediate optical flow information to obtain a fourth feature map to be spliced with the second feature map may specifically include: and respectively obtaining an original optical flow characteristic map of the original optical flow information and an intermediate optical flow characteristic map of the intermediate optical flow information, and fusing the original optical flow characteristic map and the intermediate optical flow characteristic map to obtain a fourth characteristic map to be spliced with the second characteristic map. The original optical flow characteristic map can compensate errors when learning original optical flow information, and the intermediate optical flow characteristic map can compensate errors when learning intermediate optical flow information; then, the original optical flow feature map and the intermediate optical flow feature map can be fused, for example, the original optical flow feature map and the intermediate optical flow feature map can be spliced together and then subjected to one-layer convolution to obtain a fourth feature map with higher precision after optical flow information interaction; for another example, the fourth feature map is obtained by directly splicing the two feature maps, and various schemes can be adopted to realize the fusion of the original optical flow feature map and the intermediate optical flow feature map, which is not limited herein. The fourth feature map may be an output of the optical flow information interaction sub-module.

Illustratively, obtaining the original optical flow feature map of the original optical flow information may specifically include: scaling the reference image and the adjacent image according to the scale information of the depth optical flow information interaction module to respectively obtain a reference scaled image and an adjacent scaled image; projecting adjacent scaled images onto a reference scaled image according to original optical flow information, and fusing a projection result and the reference scaled image to obtain a residual optical flow characteristic map; fusing the residual light flow characteristic diagram and the original light flow information to obtain the original light flow informationAn initial optical flow feature map. Where if the scale information is n, the reference image and the neighboring image may be divided by n to achieve multiple scaling, n may be 8, 4, 2, etc. In order to more clearly understand the above steps, a specific implementation procedure of the above steps will be described below by taking fig. 6 as an example. Exemplary, a reference scaled image is obtained firstAnd adjacent scaled image +.>Based on the original optical flow information->Adjacent scaled image +.>Projecting (warp) to a reference scaled image +.>Obtain a first projection image +.>At this time, if the original optical flow information +.>Sufficiently accurate, then the first projected imageAnd reference scaled image +.>The values of the corresponding pixel points are basically consistent; further, the first projected image may be And reference scaled image +.>Fusion to obtain residual optical flow feature map, e.g. first projection imageAnd reference scaled image +.>After being spliced together, the residual optical flow characteristic diagram can be obtained through three convolution layers, so that errors in learning original optical flow information can be compensated. Correspondingly, for the original optical flow information +.>The original optical flow information can also be learned by three convolution layers, and the convolution result and the residual optical flow feature map are fused to obtain the original optical flow feature map of the original optical flow information, for example, the original optical flow information->The convolution result of (1) and the residual optical flow characteristic diagram are spliced together, and the original optical flow information can be obtained after one layer of convolution>Is a feature map of the original optical flow. The number of convolution layers in the original optical flow characteristic diagram obtaining process is not limited, and the original optical flow characteristic diagram can be one layer or multiple layers.

Similarly, for the depth information based on the originalCalculated intermediate optical flow information->Similar steps can also be used to obtain intermediate optical flow information +.>Is a feature map of the intermediate optical flow. In the picture of6, for example, according to the intermediate optical flow information +.>Adjacent scaled image +.>Projecting (warp) to a reference scaled image +.>Obtaining a second projection image Further, the second projection image can be +.>And reference scaled image +.>Fusion is performed to obtain a residual optical flow feature map, e.g. second projection image +.>And reference scaled image +.>After being spliced together, the residual optical flow characteristic diagram can be obtained through three convolution layers, so that errors in learning intermediate optical flow information can be compensated. Correspondingly, for intermediate optical flow information +.>The intermediate optical flow information can also be learned by three convolution layers, and the convolution result and the residual optical flow feature map are fused to obtain an intermediate optical flow feature map of the intermediate optical flow information, for example, the intermediate optical flow information->Convolved result and residual optical flow features of (2)The pictures are spliced together, and intermediate optical flow information can be obtained after a layer of convolution>Is a feature map of the intermediate optical flow. The number of convolution layers in the middle optical flow characteristic diagram obtaining process is not limited, and the middle optical flow characteristic diagram can be one layer or multiple layers.

On the basis of the above technical solution, optionally, extracting, via the depth information interaction sub-module, original depth information of the first feature map in the depth prediction network may specifically include: convolving the intermediate depth information to obtain a fifth feature map, and fusing the fifth feature map and a sixth feature map in a depth prediction network to obtain a first feature map; the original depth information of the first feature map is extracted via the depth information interaction sub-module. In practical applications, each step in the optical flow prediction network may be performed first, then after intermediate depth information is obtained according to the original optical flow information, convolution may be performed on the intermediate depth information to obtain a fifth feature map, and the fifth feature map and a sixth feature map in the depth prediction network may be fused, for example, the fifth feature map and the sixth feature map may be spliced, and a convolution and/or deconvolution operation may be performed on the spliced result to obtain a first feature map, so that the original depth information in the first feature map is depth information fused with the original optical flow information, which again enhances the effect of joint optimization of depth prediction and optical flow prediction.

Example III

Fig. 7 is a flowchart of a depth optical flow prediction method based on a monocular camera according to a third embodiment of the present invention. The present embodiment is optimized based on the second embodiment. In this embodiment, optionally, the number of adjacent images is at least two, extracting original optical flow information of the second feature map in the optical flow prediction network, and generating intermediate depth information according to the original optical flow information and the pose of the reference image to the adjacent images may specifically include: and respectively extracting original optical flow information of each second feature map in the optical flow prediction network, and establishing a linear equation set according to the original optical flow information and the pose from the reference image to each adjacent image to generate intermediate depth information. Wherein, the explanation of the same or corresponding terms as the above embodiments is not repeated herein.

Referring to fig. 7, the method of this embodiment may specifically include the following steps:

s310, acquiring a reference image and an adjacent image, and inputting the reference image and the adjacent image into a trained depth optical flow prediction model, wherein the depth optical flow prediction model comprises a depth prediction network, an optical flow prediction network and a depth optical flow information interaction module connected with the depth prediction network and the optical flow prediction network respectively, the depth optical flow information interaction module comprises a depth information interaction sub-module connected with the depth prediction network and an optical flow information interaction sub-module connected with the optical flow prediction network, and the number of the adjacent images is at least two.

Wherein { I { for realizing multi-view image _ref ,I _nei,1 ,I _nei,2 …,I _nei,N In some embodiments, the depth optical flow prediction network includes one depth prediction network and a plurality of optical flow prediction networks, where one optical flow prediction network inputs a set of reference images and corresponding neighboring images, and then the plurality of optical flow prediction networks respectively input a plurality of sets of reference images and corresponding neighboring images, where parameters of the plurality of optical flow prediction networks may be the same or different. In some embodiments, the depth optical flow prediction network includes a depth prediction network and an optical flow prediction network, where the optical flow prediction network inputs multiple sets of reference images and corresponding neighboring images, and where the optical flow prediction network is reusable multiple times, the optical flow prediction network parameters are unchanged.

S320, extracting original optical flow information of each second feature map in the optical flow prediction network through an optical flow information interaction sub-module, establishing a linear equation set according to the original optical flow information and the pose of the reference image to each adjacent image to generate intermediate depth information, extracting original depth information of the first feature map in the depth prediction network through a depth information interaction sub-module, and generating intermediate optical flow information according to the original depth information and the pose of the reference image to each adjacent image.

In the depth optical flow information interaction module, original optical flow information of each second feature map in the optical flow prediction network can be extracted through the optical flow information interaction sub-module, and each second feature map can be derived from the same optical flow prediction network or from different optical flow prediction networks, which is not particularly limited herein. After the original optical flow information is extracted from the second feature map, for each original optical flow information, a linear equation set can be established according to the pose from the corresponding reference image to the adjacent image, namely all the matched pixel points in each original optical flow information are combined, and the least square regression is utilized to obtain the intermediate depth information of all the matched points in a best fit manner. Taking the example that each second feature map is derived from different optical flow prediction networks as shown in fig. 8, if the number of optical flow prediction networks is 3, the original optical flow information a, the original optical flow information B and the original optical flow information C can be extracted by the optical flow information interaction sub-module respectively, and according to the three original optical flow information and the pose a from the reference image corresponding to the original optical flow information a to the corresponding adjacent image, the pose B corresponding to the original optical flow information B and the pose C corresponding to the original optical flow information C, a linear equation set can be established to generate intermediate depth information, that is, 1 intermediate depth information is generated according to the 3 original optical flow information and the 3 poses, that is, the depth is predicted from multiple angles.

Further, a pose T from one original depth information and a plurality of reference images to adjacent images _ref,nei Intermediate optical flow information corresponding to each adjacent image can be calculated by using 3D-2D projection, that is, 1 intermediate optical flow information can be generated according to 1 original depth information and the pose of 1 reference image to the corresponding adjacent image, and then a plurality of intermediate optical flow information can be generated according to 1 original depth information and the pose of a plurality of reference images to the corresponding adjacent image.

S330, receiving the intermediate depth information sent by the optical flow information interaction sub-module through the depth information interaction sub-module, and fusing the original depth information and the intermediate depth information to obtain a third feature map to be spliced with the first feature map.

S340, receiving each piece of intermediate optical flow information sent by the depth information interaction sub-module through the optical flow information interaction sub-module, and fusing each piece of original optical flow information and the corresponding intermediate optical flow information to obtain each fourth feature map to be spliced with the corresponding second feature map.

The number of the intermediate optical flow information may be consistent with the number of the optical flow prediction networks or the number of times of recycling of one optical flow prediction network, and the corresponding fourth feature map may be obtained after merging the original optical flow information with the corresponding intermediate optical flow information. Taking the example that the number of the intermediate optical flow information is consistent with the number of the optical flow prediction networks as shown in fig. 8, intermediate optical flow information a can be obtained according to the original depth information and the pose a, intermediate optical flow information B can be obtained according to the original depth information and the pose B, and intermediate optical flow information C can be obtained according to the original depth information and the pose C; further, the original optical flow information A and the intermediate optical flow information A are fused to obtain a fourth feature map 4A to be spliced with the second feature map 2A, wherein the original optical flow information A is extracted from the second feature map 2A; the fourth feature map 4B to be spliced with the second feature map 2B can be obtained by fusing the original optical flow information B and the intermediate optical flow information B, wherein the original optical flow information B is … … extracted from the second feature map 2B, and so on.

S350, respectively predicting target depth information of the reference image and target optical flow information of the reference image to each adjacent image according to the output result of the depth optical flow prediction model.

According to the technical scheme provided by the embodiment of the invention, the number of the adjacent images is at least two, after the plurality of original optical flow information is extracted from the plurality of second feature images of the optical flow prediction network through the optical flow information interaction sub-module, a linear equation set can be established according to the positions of each original optical flow information and the corresponding reference image to each adjacent image so as to generate middle depth information, and the expansion from one adjacent image to a plurality of adjacent images is realized, so that the operation time can be saved and the retraining of network parameters can be avoided.

On the basis, optionally, according to the pose from each original optical flow information and the reference image to each adjacent image, the implementation process of establishing a linear equation set to generate the intermediate depth information may be: since the original optical flow information can present the movement information of the pixel coordinates, the pixel point of a certain pixel point in the reference image, which is respectively matched with a plurality of adjacent images, can be calculated according to the reference image and the plurality of original optical flow information, so as to obtain a plurality of matched pairs, for example, for a certain pixel point X in the reference image, the matched pixel points such as X on a plurality of adjacent images can be calculated according to the reference image and the plurality of original optical flow information respectively corresponding to the plurality of adjacent images ₁ 、X ₂ …X _N Therefore, a linear square scale group can be established, and the intermediate depth information corresponding to the pixel point X can be solved. That is, when estimating one piece of intermediate depth information from a plurality of pieces of original optical flow information, one piece of intermediate depth information can be generated by merely building a linear equation set from a plurality of pieces of original optical flow information and a plurality of corresponding poses without participation of the original depth information. The following describes the workflow of multi-view depth estimation in detail, taking 3 optical flow prediction networks or 3 optical flow prediction networks for reuse 3 times as an example:

dx＝TP (4)

d is the original depth information of a certain pixel point in the reference image, X represents the pixel normalized plane homogeneous coordinates [ u, v,1] of the pixel point, T represents the monocular camera pose (represented by [ R|t ]), R corresponds to the rotation matrix, T corresponds to the translation vector, and P represents the homogeneous coordinates [ X, Y, Z,1] of the 3D point in the world coordinate system.

Converting formula (4) to dxx=xtp,

wherein the method comprises the steps ofThere is x x=0 according to the matrix operation. Obtaining x++TP=0 (5), the form is expanded and reduced according to matrix multiplication to +.>Wherein [ u, v,1]Normalizing the planar homogeneous coordinates for the pixels of the pixel point, T ₁ ,T ₂ ,T ₃ Three row vectors of the monocular camera pose are respectively. Equation (5) and equation (6) represent constraints of the pixel normalized planar homogeneous coordinates, camera pose, and 3D coordinates (depth information) on one frame of image.

When there are two frames of images (i.e., a reference image and an adjacent image), equation (6) can be extended to:wherein [ u, v,1]Normalizing planar homogeneous coordinates, [ u ', v',1 ] for a pixel of a pixel point on a reference image]Normalized planar homogeneous coordinates, T, for pixels of matching points on adjacent images computed from raw optical flow information ₁ ,T ₂ ,T ₃ And T' ₁ ,T′ ₂ ,T′ ₃ Three row vectors of monocular camera pose for the reference image and the neighboring image, respectively. The data type (7) can solve the intermediate depth information of the reference image according to the optical flow and the pose of the two frames of images. />

When there are multiple frames of images (i.e., a reference image and multiple frames of adjacent images), equation (7) can be extended to:

here, two frame neighboring images are taken as an example, [ u, v,1]Normalizing planar homogeneous coordinates, [ u ', v',1 ] for pixels on a reference image]，[u”，v”，1]Respectively normalizing planar homogeneous coordinates, T, of pixels of corresponding matching points on two adjacent images ₁ ,T ₂ ,T ₃ And T' ₁ ,T′ ₂ ,T′ ₃ T' ₁ ,T″ ₂ ,T″ ₃ Three row vectors of monocular camera pose for the reference image and two neighboring images, respectively. According to equation (8) the original reference image to two adjacent imagesAnd solving the intermediate depth information by using the initial optical flow information and the monocular camera pose of each frame. Similarly, as the number of adjacent images increases, equation (8) may continue to expand in a similar manner.

Example IV

Fig. 9 is a block diagram of a monocular camera-based depth optical flow prediction apparatus according to a fourth embodiment of the present invention, which is configured to perform the monocular camera-based depth optical flow prediction method according to any of the above embodiments. The device belongs to the same inventive concept as the monocular camera-based depth optical flow prediction method in the above embodiments, and reference may be made to the above-described embodiment of the monocular camera-based depth optical flow prediction method for details that are not described in detail in the embodiment of the monocular camera-based depth optical flow prediction device. Referring to fig. 9, the apparatus may specifically include: an input module 410 and a prediction module 420.

The input module 410 is configured to acquire a reference image and an adjacent image, and input the reference image and the adjacent image into the trained depth optical flow prediction model;

a prediction module 420, configured to predict target depth information of the reference image and target optical flow information of the reference image to the neighboring image according to an output result of the depth optical flow prediction model;

Optionally, on this basis, the apparatus may further include:

the training sample acquisition module is used for acquiring the historical reference image and the historical depth information of the historical reference image, and the historical adjacent image of the historical reference image and the historical optical flow information from the historical reference image to the historical adjacent image, and taking the historical reference image, the historical adjacent image, the historical depth information and the historical optical flow information as a group of training samples;

the depth optical flow prediction model generation module is used for constructing an initial depth optical flow prediction model, training the initial depth optical flow prediction model based on a plurality of training samples and generating the depth optical flow prediction model.

on this basis, the device may further comprise:

the intermediate depth information generation module is used for extracting original optical flow information of a second feature map in the optical flow prediction network through the optical flow information interaction sub-module and generating intermediate depth information according to the original optical flow information and the pose from the reference image to the adjacent image;

The intermediate optical flow information generation module is used for extracting original depth information of a first feature map in the depth prediction network through the depth information interaction sub-module and generating intermediate optical flow information according to the original depth information and the pose;

the third feature map determining module is used for receiving the intermediate depth information sent by the optical flow information interaction sub-module through the depth information interaction sub-module, and fusing the original depth information and the intermediate depth information to obtain a third feature map to be spliced with the first feature map;

the fourth feature map determining module is used for receiving the intermediate optical flow information sent by the depth information interaction sub-module through the optical flow information interaction sub-module, and fusing the original optical flow information and the intermediate optical flow information to obtain a fourth feature map to be spliced with the second feature map.

Optionally, the fourth feature map determining module may specifically include:

the fourth feature map determining unit is configured to obtain an original optical flow feature map of the original optical flow information and an intermediate optical flow feature map of the intermediate optical flow information, and fuse the original optical flow feature map and the intermediate optical flow feature map to obtain a fourth feature map to be spliced with the second feature map.

Optionally, the fourth feature map determining unit may specifically be configured to:

Optionally, the intermediate optical flow information generating module may be specifically configured to:

Optionally, the optical flow prediction network includes an association layer, on which the apparatus may further include:

the associated feature map determining module is used for determining a matching relation of corresponding pixels in the seventh feature map and the eighth feature map based on a preset dot product operation through an associated layer aiming at the seventh feature map extracted from the reference image and the eighth feature map extracted from the adjacent image, and obtaining the associated feature map.

Optionally, the polar layer may output the polar signature by the following module:

the polar line characteristic map output module is used for obtaining polar lines of the reference pixel points on the eighth characteristic map in the seventh characteristic map, and each adjacent pixel point in the corresponding adjacent area of the reference pixel points on the eighth characteristic map, and calculating the distance between each adjacent pixel point and the polar line so as to obtain the polar line characteristic map.

Optionally, the polar line feature map output module may specifically include:

and the distance calculation unit is used for calculating the distance between each adjacent pixel point and the polar line, and converting the distance based on the preset Gaussian distribution so as to obtain the polar line characteristic map.

Optionally, the number of adjacent images is at least two;

on the basis, the intermediate depth information generating module is specifically applicable to:

According to the depth optical flow prediction device based on the monocular camera, the input module and the prediction module are matched with each other, so that a reference image and an adjacent image can be obtained, the reference image and the adjacent image are input into the trained depth optical flow prediction model, and the depth optical flow prediction model is provided with a depth prediction network, an optical flow prediction network and a depth optical flow information interaction module which is respectively connected with the depth prediction network and the optical flow prediction network, so that dense target depth information of the reference image and target optical flow information from the reference image to the adjacent image can be respectively predicted. The device can remarkably improve the prediction precision and the prediction instantaneity of the depth prediction and the optical flow prediction by combining the optimization mode, and achieves the effects of the depth prediction and the optical flow prediction with high efficiency and high precision.

The depth optical flow prediction device based on the monocular camera provided by the embodiment of the invention can execute the depth optical flow prediction method based on the monocular camera provided by any embodiment of the invention, and has the corresponding functional modules and beneficial effects of the execution method.

It should be noted that, in the embodiment of the monocular camera-based depth optical flow prediction device, each unit and module included are only divided according to the functional logic, but are not limited to the above-mentioned division, so long as the corresponding functions can be implemented; in addition, the specific names of the functional units are also only for distinguishing from each other, and are not used to limit the protection scope of the present invention.

Example five

Fig. 10 is a schematic structural diagram of an apparatus according to a fifth embodiment of the present invention, and as shown in fig. 10, the apparatus includes a memory 510, a processor 520, an input device 530, and an output device 540. The number of processors 520 in the device may be one or more, one processor 520 being taken as an example in fig. 10; the memory 510, processor 520, input means 530 and output means 540 in the device may be connected by a bus or other means, in fig. 10 by way of example by a bus 550.

The memory 510 is a computer readable storage medium, and may be used to store software programs, computer executable programs, and modules, such as program instructions/modules corresponding to the monocular camera-based depth optical flow prediction method in the embodiment of the present invention (e.g., the input module 410 and the prediction module 420 in the monocular camera-based depth optical flow prediction device). The processor 520 executes various functional applications of the device and data processing by running software programs, instructions and modules stored in the memory 510, i.e., implements the monocular camera-based depth optical flow prediction method described above.

The memory 510 may mainly include a storage program area and a storage data area, wherein the storage program area may store an operating system, at least one application program required for functions; the storage data area may store data created according to the use of the device, etc. In addition, memory 510 may include high-speed random access memory, and may also include non-volatile memory, such as at least one magnetic disk storage device, flash memory device, or other non-volatile solid-state storage device. In some examples, memory 510 may further include memory located remotely from processor 520, which may be connected to the device via a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.

The input device 530 may be used to receive input numeric or character information and to generate key signal inputs related to user settings and function control of the device. The output 540 may include a display device such as a display screen.

Example six

A sixth embodiment of the present invention provides a storage medium containing computer-executable instructions, which when executed by a computer processor, are for performing a monocular camera-based depth optical flow prediction method, the method comprising:

Of course, the storage medium containing the computer executable instructions provided in the embodiments of the present invention is not limited to the above-described method operations, and may also perform the related operations in the monocular camera-based depth optical flow prediction method provided in any embodiment of the present invention.

From the above description of embodiments, it will be clear to a person skilled in the art that the present invention may be implemented by means of software and necessary general purpose hardware, but of course also by means of hardware, although in many cases the former is a preferred embodiment. In light of such understanding, the technical solution of the present invention may be embodied essentially or in part in the form of a software product, which may be stored in a computer-readable storage medium, such as a floppy disk, read-Only Memory (ROM), random-access Memory (Random Access Memory, RAM), FLASH Memory (FLASH), hard disk, optical disk, etc., of a computer, which may be a personal computer, a server, a network device, etc., and which includes instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to perform the methods described in the embodiments of the present invention.

Note that the above is only a preferred embodiment of the present invention and the technical principle applied. It will be understood by those skilled in the art that the present invention is not limited to the particular embodiments described herein, but is capable of various obvious changes, rearrangements and substitutions as will now become apparent to those skilled in the art without departing from the scope of the invention. Therefore, while the invention has been described in connection with the above embodiments, the invention is not limited to the embodiments, but may be embodied in many other equivalent forms without departing from the spirit or scope of the invention, which is set forth in the following claims.

Claims

1. A monocular camera-based depth optical flow prediction method, comprising:

respectively predicting target depth information of the reference image and target optical flow information from the reference image to the adjacent image according to an output result of the depth optical flow prediction model;

the depth optical flow prediction model comprises a depth prediction network, an optical flow prediction network and a depth optical flow information interaction module which is respectively connected with the depth prediction network and the optical flow prediction network;

the depth optical flow prediction model is obtained through the following steps:

constructing an initial depth optical flow prediction model, training the initial depth optical flow prediction model based on a plurality of training samples, and generating the depth optical flow prediction model;

The depth optical flow information interaction module comprises a depth information interaction sub-module connected with the depth prediction network and an optical flow information interaction sub-module connected with the optical flow prediction network;

after said inputting the reference image and the neighboring image into the trained deep optical flow prediction model, further comprising:

extracting original optical flow information of a second feature map in the optical flow prediction network through the optical flow information interaction sub-module, and generating intermediate depth information according to the original optical flow information and the pose from the reference image to the adjacent image;

extracting original depth information of a first feature map in the depth prediction network through the depth information interaction sub-module, and generating intermediate optical flow information according to the original depth information and the pose;

receiving the intermediate depth information sent by the optical flow information interaction sub-module through the depth information interaction sub-module, and fusing the original depth information and the intermediate depth information to obtain a third feature map to be spliced with the first feature map;

2. The method of claim 1, wherein the fusing the original optical flow information and the intermediate optical flow information to obtain a fourth feature map to be stitched with the second feature map comprises:

3. The method of claim 2, wherein the deriving the raw optical flow feature map of the raw optical flow information comprises:

scaling the reference image and the adjacent image according to the scale information of the depth optical flow information interaction module to respectively obtain a reference scaling image and an adjacent scaling image;

projecting the adjacent scaled images onto the reference scaled image according to the original optical flow information, and fusing a projection result and the reference scaled image to obtain a residual optical flow characteristic diagram;

4. The method of claim 1, wherein extracting, via the depth information interaction sub-module, original depth information of a first feature map in the depth prediction network comprises:

convolving the intermediate depth information to obtain a fifth feature map, and fusing the fifth feature map and a sixth feature map in the depth prediction network to obtain a first feature map;

extracting original depth information of the first feature map via the depth information interaction sub-module.

5. The method of any of claims 1-4, wherein the optical flow prediction network includes an association layer, after the inputting the reference image and the neighboring image into a trained deep optical flow prediction model, further comprising:

and determining the matching relation of corresponding pixels in the seventh feature map and the eighth feature map based on a preset dot product operation by the association layer according to the seventh feature map extracted from the reference image and the eighth feature map extracted from the adjacent image, so as to obtain the association feature map.

6. The method of claim 5, wherein the optical flow prediction network further comprises a epipolar layer, the epipolar feature map output by the epipolar layer being fused with the associated feature map.

7. The method of claim 6, wherein the epipolar layer outputs the epipolar signature by:

acquiring an electrode line of a reference pixel point in the seventh feature map on the eighth feature map, and each adjacent pixel point of the reference pixel point in a corresponding adjacent region on the eighth feature map;

and calculating the distance between each adjacent pixel point and the polar line to obtain the polar line characteristic map.

8. The method of claim 7, wherein said calculating distances of said each adjacent pixel to said epipolar line to obtain said epipolar line signature comprises:

and calculating the distance between each adjacent pixel point and the polar line, and transforming the distance based on a preset Gaussian distribution to obtain the polar line characteristic map.

9. The method according to any one of claims 1 to 4, wherein the number of adjacent images is at least two;

extracting original optical flow information of a second feature map in the optical flow prediction network, and generating intermediate depth information according to the original optical flow information and the pose from the reference image to the adjacent image, wherein the method comprises the following steps:

And respectively extracting original optical flow information of each second feature map in the optical flow prediction network, and establishing a linear equation set according to the original optical flow information and the pose from the reference image to each adjacent image so as to generate intermediate depth information.

10. The method according to claim 1, wherein the overlapping ratio of the reference image and the neighboring image is within a preset overlapping range and/or the baseline distance of the reference image and the neighboring image is within a preset distance range.

11. The method according to any one of claims 1 to 4, wherein the depth prediction network and/or the optical flow prediction network comprises: convolution layer and deconvolution layer.

12. The method according to any one of claims 1 to 4, wherein the number of the depth optical flow information interaction modules is one or more, and when the number of the depth optical flow information interaction modules is a plurality, scale information of each of the depth optical flow information interaction modules is different from each other.

13. A monocular camera-based depth optical flow prediction apparatus, comprising:

the depth optical flow prediction model generation module is used for constructing an initial depth optical flow prediction model, training the initial depth optical flow prediction model based on a plurality of training samples and generating a depth optical flow prediction model;

14. An apparatus, the apparatus comprising:

one or more processors;

A memory for storing one or more programs;

the one or more programs, when executed by the one or more processors, cause the one or more processors to implement the monocular camera-based depth optical flow prediction method of any of claims 1-12.

15. A computer readable storage medium, on which a computer program is stored, characterized in that the computer program, when being executed by a processor, implements the monocular camera based depth optical flow prediction method according to any one of claims 1-12.