CN118071807A

CN118071807A - Monocular depth estimation method, monocular depth estimation device, monocular depth estimation computer device, and storage medium

Info

Publication number: CN118071807A
Application number: CN202211482163.9A
Authority: CN
Inventors: 高立钊; 李琛
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd
Priority date: 2022-11-24
Filing date: 2022-11-24
Publication date: 2024-05-24

Abstract

The application relates to a monocular depth estimation method, apparatus, computer device, storage medium and computer program product. The method relates to the field of artificial intelligence, and can be applied to the field of vehicle-mounted technology, and comprises the following steps: acquiring inertial measurement data of an image acquisition device acquiring a target image; performing the calculation of a visual inertial mileage calculation method based on the inertial measurement data to obtain a calculation result, and determining sparse depth points in the target image according to the calculation result; performing depth estimation processing on the target image to obtain relative depth data of the target image; absolute depth data of the target image is determined based on the sparse depth points and the relative depth data. According to the application, the complex absolute depth estimation flow is replaced by the relative depth estimation with lower prediction difficulty and smaller calculation amount, and then the relative depth is corrected by combining the sparse points determined by the inertial measurement data, so that the calculation amount of the depth estimation can be effectively reduced, and the efficiency of the depth estimation is improved.

Description

Monocular depth estimation method, monocular depth estimation device, monocular depth estimation computer device, and storage medium

Technical Field

The present application relates to the field of computer technology, and in particular, to a monocular depth estimation method, apparatus, computer device, storage medium, and computer program product.

Background

With the development of computer technology and computer vision, a monocular depth estimation technology has emerged, where depth estimation refers to estimating the distance between each pixel in an image and a shooting source by using a color system (RGB) image under one or only one or multiple viewing angles, and monocular depth estimation is the case where a monocular camera is used as the shooting source. For the human eye, a large amount of depth information can be extracted from the image information acquired by one eye due to the large amount of a priori knowledge. The monocular depth estimation not only requires learning objective depth information from the two-dimensional image but also requires extracting some empirical information that would be sensitive to the camera and scene in the dataset.

Most of the monocular depth estimation is currently based on the conversion estimation of two-dimensional RGB images to RBG-D images. However, these methods all need to combine the camera position and posture to predict the depth, so that the calculated amount is large and the efficiency of depth estimation is low.

Disclosure of Invention

In view of the foregoing, it is desirable to provide a monocular depth estimation method, apparatus, computer device, computer readable storage medium, and computer program product that can improve the monocular depth estimation efficiency.

In a first aspect, the present application provides a monocular depth estimation method. The method comprises the following steps:

Acquiring an object image and inertial measurement data of an image acquisition device for acquiring the object image;

Performing the calculation of a visual inertia mileage calculation method based on the inertia measurement data to obtain a calculation result, and determining sparse depth points in the target image according to the calculation result;

performing depth estimation processing on the target image to obtain relative depth data of the target image;

Absolute depth data of the target image is determined based on the sparse depth points and the relative depth data.

In a second aspect, the application further provides a monocular depth estimation device. The device comprises:

The data acquisition module is used for acquiring a target image and inertial measurement data of image acquisition equipment for acquiring the target image;

the sparse point identification module is used for carrying out the solution of the visual inertial mileage calculation method based on the inertial measurement data to obtain a solution result, and determining sparse depth points in the target image according to the solution result;

the relative depth estimation module is used for carrying out depth estimation processing on the target image to obtain relative depth data of the target image;

and the absolute depth estimation module is used for determining absolute depth data of the target image based on the sparse depth points and the relative depth data.

In a third aspect, the present application also provides a computer device. The computer device comprises a memory storing a computer program and a processor which when executing the computer program performs the steps of:

In a fourth aspect, the present application also provides a computer-readable storage medium. The computer readable storage medium having stored thereon a computer program which when executed by a processor performs the steps of:

In a fifth aspect, the present application also provides a computer program product. The computer program product comprises a computer program which, when executed by a processor, implements the steps of:

The monocular depth estimation method, apparatus, computer device, storage medium and computer program product described above, by acquiring an object image and inertial measurement data of an image acquisition device that acquires the object image; firstly, acquiring a target image of depth estimation, namely inertial measurement data, and then firstly, carrying out the solution of a visual inertial mileage calculation method based on the inertial measurement data to obtain a solution result, and determining sparse depth points in the target image according to the solution result; firstly, determining absolute depth values of partial sparse points in a target image through a visual inertia mileage calculation method, and carrying out depth estimation processing on the target image to obtain relative depth data of the target image; that is, the absolute depth data of the target image is determined by predicting the relative depth in the target image through depth estimation and then correcting the relative depth data based on the absolute depth values of the sparse depth points. According to the application, the complex absolute depth estimation flow is replaced by the relative depth estimation with lower prediction difficulty and smaller calculation amount, and then the relative depth is corrected by combining the sparse points determined by the inertial measurement data, so that the calculation amount of the depth estimation can be effectively reduced, and the efficiency of the depth estimation is improved.

Drawings

FIG. 1 is an application environment diagram of a monocular depth estimation method in one embodiment;

FIG. 2 is a flow chart of a monocular depth estimation method in one embodiment;

FIG. 3 is a schematic diagram of a process flow for depth estimation via a depth neural network in one embodiment;

FIG. 4 is a schematic diagram of a model training step in one embodiment;

FIG. 5 is a schematic diagram of the architecture of a lightweight mobile convolutional neural network in one embodiment;

FIG. 6 is a flow diagram of image depth in one embodiment;

FIG. 7 is a schematic overall flow chart of a monocular depth estimation method in one embodiment;

FIG. 8 is a flow chart of a monocular depth estimation method according to another embodiment;

FIG. 9 is a block diagram of a monocular depth estimation device in one embodiment;

Fig. 10 is an internal structural view of a computer device in one embodiment.

Detailed Description

The present application will be described in further detail with reference to the drawings and examples, in order to make the objects, technical solutions and advantages of the present application more apparent. It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the scope of the application.

The present application relates to artificial intelligence (ARTIFICIAL INTELLIGENCE, AI), which is the theory, method, technique and application system that simulates, extends and expands human intelligence, senses the environment, acquires knowledge and uses knowledge to obtain optimal results using a digital computer or a machine controlled by a digital computer. In other words, artificial intelligence is an integrated technology of computer science that attempts to understand the essence of intelligence and to produce a new intelligent machine that can react in a similar way to human intelligence. Artificial intelligence, i.e. research on design principles and implementation methods of various intelligent machines, enables the machines to have functions of sensing, reasoning and decision. The artificial intelligence technology is a comprehensive subject, and relates to the technology with wide fields, namely the technology with a hardware level and the technology with a software level. Artificial intelligence infrastructure technologies generally include technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing technologies, operation/interaction systems, mechatronics, and the like. The artificial intelligence software technology mainly comprises a computer vision technology, a voice processing technology, a natural language processing technology, machine learning/deep learning and other directions. And in particular to Computer Vision (CV) and machine learning techniques in artificial intelligence.

The computer vision is a science for researching how to make a machine "see", and more specifically, a camera and a computer are used to replace human eyes to perform machine vision such as recognition, follow-up and measurement on a target, and further perform graphic processing, so that the computer is processed into an image more suitable for human eyes to observe or transmit to an instrument to detect. As a scientific discipline, computer vision research-related theory and technology has attempted to build artificial intelligence systems that can acquire information from images or multidimensional data. Computer vision techniques typically include image processing, image recognition, image semantic understanding, image retrieval, OCR, video processing, video semantic understanding, video content/behavior recognition, three-dimensional object reconstruction, 3D techniques, virtual reality, augmented reality, synchronous positioning, and map construction, among others, as well as common biometric recognition techniques such as face recognition, fingerprint recognition, and others. Machine learning is a multi-domain interdisciplinary, and relates to a plurality of disciplines such as probability theory, statistics, approximation theory, convex analysis, algorithm complexity theory and the like. It is specially studied how a computer simulates or implements learning behavior of a human to acquire new knowledge or skills, and reorganizes existing knowledge structures to continuously improve own performance. Machine learning is the core of artificial intelligence, a fundamental approach to letting computers have intelligence, which is applied throughout various areas of artificial intelligence. Machine learning and deep learning typically include techniques such as artificial neural networks, confidence networks, reinforcement learning, transfer learning, induction learning, teaching learning, and the like.

The monocular depth estimation method provided by the embodiment of the application can be applied to an application environment shown in fig. 1. Wherein the image acquisition device 102 may communicate with the terminal 104. After the image is acquired by the image acquisition device 102, the image can be stored on the terminal 104, after the image is acquired by the terminal 104, if the depth estimation is required to be carried out on the image, the image can be used as a target image to carry out the depth estimation, then inertial measurement data of the image acquisition device 102 are acquired, if the image acquisition device 102 is integrally connected with the terminal 104, the inertial measurement data of the terminal 104 are directly acquired, then the solution of a visual inertial mileage calculation method is carried out based on the inertial measurement data, a solution result is obtained, and sparse depth points in the target image are determined according to the solution result; performing depth estimation processing on the target image to obtain relative depth data of the target image; absolute depth data of the target image is determined based on the sparse depth points and the relative depth data. The image capturing device 102 may be implemented by a monocular camera, and the terminal 104 may be, but not limited to, various desktop computers, notebook computers, smart phones, tablet computers, internet of things devices and portable wearable devices, where the internet of things devices may be smart speakers, smart televisions, smart air conditioners, smart vehicle devices, and the like. The portable wearable device may be a smart watch, smart bracelet, headset, or the like.

In one embodiment, as shown in fig. 2, a monocular depth estimation method is provided, which can be applied to a terminal or a server, and the method is applied to the terminal 104 in fig. 1, for example, and includes the following steps:

Step 201, acquiring an object image and inertial measurement data of an image acquisition device acquiring the object image.

The target image is an object estimated by the monocular depth estimation method, and the depth estimation method can determine the distance between each pixel point in the target image and a shooting source of the target image by carrying out depth estimation on the target image, so that a depth estimation result is obtained. The image capturing device of the target image refers to the capturing source, the image capturing device 102 may capture image information from reality through processing means such as photographing, so as to obtain the target image, and then submit the target image to the terminal 104 for monocular depth estimation. The inertial measurement data refers to data acquired by an inertial measurement unit (Inertial Measurement Unit, IMU) in the image acquisition device, and mainly includes information such as three-axis attitude angle (or angular rate) and acceleration of the image acquisition device.

Specifically, when a user, after acquiring an image through an image acquisition device, if desiring to perform depth estimation on the acquired image, then applies the image to fields of robot navigation, augmented Reality (AR), three-dimensional reconstruction, automatic driving, etc., based on the result of the depth estimation, the monocular depth estimation processing of the image can be realized by the monocular depth estimation method of the present application. Firstly, a user submits an image acquired by the image acquisition device 102 as a target image to the terminal 104, and after the terminal 104 acquires the target image, inertial measurement data of the image acquisition device for synchronously acquiring the target image are acquired, so that pose data of the image acquisition device are identified, and visual inertial odometer related calculation is performed on the pose data to assist in monocular depth estimation. In a specific embodiment, the monocular depth estimation method of the present application is applied to the field of augmented reality, the depth information is the basis for realizing scene understanding in augmented reality, and at this time, the image acquisition device 102 is specifically integrally connected with the mobile intelligent terminal 104, and after the user acquires the image through the camera, if the user performs augmented reality processing based on the captured image, monocular depth estimation can be performed on the image to obtain the depth information of the image. At this time, the acquired image is obtained, and meanwhile, inertial measurement data of the mobile intelligent terminal 104 is also required to be acquired to assist in depth estimation.

And 203, performing the calculation of the visual inertia mileage calculation method based on the inertia measurement data to obtain a calculation result, and determining sparse depth points in the target image according to the calculation result.

The visual inertial odometer (visual inertial odometry, VIO) is also called a visual inertial system, and is an algorithm for autonomous positioning based on sensing information collected by a visual sensor and an inertial sensor together, and the visual sensor and the inertial sensor have good complementarity. The real scale of the visual sensor track can be estimated by aligning the pose sequence estimated by the inertial sensor with the pose sequence estimated by the visual sensor, the inertial sensor can well predict the pose of the image frame and the position of the feature point at the last moment in the image of the next frame, the matching speed of the feature tracking algorithm and the robustness of the algorithm for coping with rapid rotation are improved, and finally the gravity vector provided by the accelerometer in the inertial sensor can convert the estimated position into a world coordinate system required by actual navigation. The sparse depth points refer to characteristic points in the image extracted by a visual inertia distance calculation method, the visual inertia distance calculation method completes positioning based on tracking of the characteristic points after the characteristic points of the image are extracted, the characteristic points determined in the process are the sparse depth points, and depth data corresponding to all the characteristic points in the target image can be determined while autonomous positioning is completed.

Specifically, the scheme of the application mainly assists in carrying out depth estimation through a visual inertia distance calculation method, and when the depth estimation is carried out, absolute depth data corresponding to a plurality of characteristic points contained in an image is determined through the visual inertia distance calculation method, and then the complete depth of the target image can be determined by combining the relative depth data of the target image. Therefore, when monocular depth estimation is performed, the visual inertia mileage calculation processing corresponding to the target image is also required to be completed, the process can specifically take the image acquired by the visual sensor and the inertia measurement data of the image acquisition device as input data, then identify the characteristic points in the image, and solve the positions of the characteristic points in the coordinate system of the image acquisition device through triangulation processing, so that sparse depth points containing depth information are obtained. In one embodiment, the scheme of the application is applied to the mobile intelligent terminal for realizing the depth estimation processing related to augmented reality, and at the moment, the mobile intelligent terminal can calculate the image acquired by the camera on the terminal and the inertial measurement data generated by the gyroscope of the terminal as the basic data of the visual inertial mileage calculation method to obtain the sparse depth point in the target image.

In step 205, a depth estimation process is performed on the target image, so as to obtain relative depth data of the target image.

The depth estimation comprises absolute depth estimation and relative depth estimation, wherein the absolute depth estimation is to estimate the distance between each pixel in an image and a shooting source by using an RGB image under one or a unique visual angle. The relative depth estimation refers to normalizing the scale in the RGB image and predicting the depth, so that the predicted depth is not the real distance between the object and the shooting source, but the ratio of the real distance between the object and the shooting source to the maximum distance in the graph. And the specific flow of depth estimation can be realized by a corresponding machine learning model. The relative depth data of the target image refers to the ratio of the true distance from each point in the target image to the shooting source to the maximum distance in the graph.

Specifically, when the visual inertia mileage calculation method is calculated, the sparse depth point is determined, meanwhile, the relative depth data of the target image can be obtained by carrying out depth estimation processing on the target image, a depth network model can be specifically constructed to estimate the relative depth data during depth estimation, the depth network model is trained by collecting historical data in advance, and then the relative depth estimation processing on the target image is realized based on the trained depth network model, so that the relative depth data is obtained. In a specific embodiment, the scheme of the application is applied to the mobile intelligent terminal and is used for realizing the depth estimation processing related to augmented reality, and at the moment, a lightweight depth network can be selected as a depth estimation model because the mobile intelligent terminal has priority in calculating force, so that the efficiency of the relative estimation processing is ensured.

In step 207, absolute depth data of the target image is determined based on the sparse depth points and the relative depth data.

Specifically, after obtaining the sparse depth point in the target image and the relative depth data of the target image, the absolute depth can be estimated based on the sparse depth point and the relative depth data. When estimating the absolute depth, the relative depth value of the pixel point at the position corresponding to the sparse depth point can be searched first, and the relative depth value of the pixel point is combined with the linear relation of the relative depth to correct the relative depth data corresponding to each pixel point, so as to obtain the absolute depth data corresponding to each pixel point in the target image.

According to the monocular depth estimation method, the target image and the inertial measurement data of the image acquisition equipment for acquiring the target image are acquired; firstly, acquiring a target image of depth estimation, namely inertial measurement data, and then firstly, carrying out the solution of a visual inertial mileage calculation method based on the inertial measurement data to obtain a solution result, and determining sparse depth points in the target image according to the solution result; firstly, determining absolute depth values of partial sparse points in a target image through a visual inertia mileage calculation method, and carrying out depth estimation processing on the target image to obtain relative depth data of the target image; that is, the absolute depth data of the target image is determined by predicting the relative depth in the target image through depth estimation and then correcting the relative depth data based on the absolute depth values of the sparse depth points. According to the application, the complex absolute depth estimation flow is replaced by the relative depth estimation with lower prediction difficulty and smaller calculation amount, and then the relative depth is corrected by combining the sparse points determined by the inertial measurement data, so that the calculation amount of the depth estimation can be effectively reduced, and the efficiency of the depth estimation is improved.

In one embodiment, step 205 comprises: performing size adjustment processing on the target image to obtain a target size image; performing image feature extraction processing on the target size image to obtain a target feature image; performing image feature fusion processing on the target feature image to obtain a target depth image; relative depth data of the target image is determined based on the target depth image.

The resizing refers to a process of resizing the size of the target image to the target size, and the size of the target image is required to be unified before the relative depth estimation is performed because the size of the target image affects the effect of the depth estimation. The image feature extraction processing refers to a process of extracting image information, for example, a convolution kernel may be used to convolve an image of a target size, and downsampling processing is performed to obtain a processed image, namely, a target feature image. The image feature fusion process processes the target feature image to recover the original resolution of the target feature image, and may also be regarded as a decoding process. And the pixel value of each pixel point in the target depth image represents the relative depth of the pixel point, the relative depth data of the target image can be determined based on the target depth image.

Specifically, for the process flow of depth estimation, it can be realized by combining the two processes of feature extraction and feature fusion. However, as proved by experiments, the larger size input can cause the increase of calculation amount, so that the depth map cannot be generated in real time, and the smaller size causes the poor generation effect. Therefore, before the processing, the target image needs to be resized to obtain a unified target size image. In one embodiment, since the image feature extraction process requires down-sampling 4 times at a rate of 1/2, the input size must be satisfied as a multiple of 16. The size of the object image may be selected (160 ) as the adjusted object image size. After the size is adjusted, the image feature extraction processing such as downsampling and convolution can be performed on the target size image to obtain a target feature image. And then, carrying out image feature fusion processing such as up-sampling, layer jump connection and the like on the target feature image to obtain a target depth image. The pixel value of each pixel point in the obtained target depth image represents the relative depth value. In this embodiment, the depth estimation of the target image after the size adjustment is implemented through the image feature extraction process and the image feature fusion process, so that the accuracy and the efficiency of the depth estimation process can be effectively ensured, and the efficiency and the accuracy of the monocular depth estimation process are ensured.

In one embodiment, performing image feature extraction processing on a target size image to obtain a target feature image includes: performing convolution processing and downsampling processing on the target size image through an encoder module to obtain a target image feature map; performing image feature fusion processing on the target feature image to obtain a target depth image, wherein the step of obtaining the target depth image comprises the following steps: and performing up-sampling processing and layer jump connection processing on the target feature image through a decoder module to obtain a target depth image.

The encoder module is used for realizing the encoding process, and is specifically used for realizing the convolution process and the downsampling process in the feature extraction process. The decoder module is a module for implementing decoding processing, and is specifically used for implementing up-sampling processing and layer jump connection processing in the feature fusion process. Convolution refers to the probability (integral change) of one event (function) under the influence of another event (function), whereas in image processing, convolution processing is mainly used to remove low-order features of an image, preserve and highlight high-order features of an image. Downsampling, also called downsampling, is mainly used for reducing images and reducing the number of sampling points of a matrix. Upsampling, also known as upsampling (upsampling) or interpolation (interpolation). Can be colloquially understood as magnifying an image, increasing the number of sampling points of the matrix. The skip connection is a method for realizing multi-scale feature fusion in the deep neural network, and can solve the problems of gradient explosion and gradient disappearance in the training process.

Specifically, in the scheme of the application, the image feature extraction processing and the image feature fusion processing in the depth estimation process can be realized through an encoder module and a decoder module respectively. When the image feature extraction processing is performed, the encoder module can perform convolution processing and downsampling processing on the target size image to obtain a target image feature map. And in the process of image feature fusion, the decoder module performs up-sampling processing and layer jump connection processing on the target feature image to obtain a target depth image. In one specific embodiment, as shown in fig. 3, the depth estimation processing for the target image is realized by the depth neural network, the backbone network of the depth estimation network can be constructed by adopting the lightweight mobile convolutional neural network, meanwhile, the functions of the encoder and the decoder are realized by adopting the network with Unet structure, at the moment, the encoder and the decoder are connected through the layer jump to perform multi-scale fusion, and meanwhile, the attention mechanism is added into each fusion module, so that the fusion precision is improved. In this embodiment, the encoder and the decoder are used to implement depth estimation, so that the accuracy of the depth estimation process can be effectively improved, and the accuracy of the relative depth estimation in the monocular depth estimation process can be ensured.

In one embodiment, the method further comprises: acquiring historical image data with depth identification; constructing training sample data according to the historical image data with the depth mark; training the initial depth estimation model through training sample data to obtain a depth estimation model; step 205 comprises: and carrying out depth estimation processing on the target image through the depth estimation model to obtain relative depth data of the target image.

The historical image data is acquired based on an application scene of monocular depth estimation processing, for example, when the depth estimation method is applied to the vehicle-mounted field, an image acquired by a vehicle-mounted camera can be used as the historical image data. The depth identification refers to identifying the depth of each pixel point in the historical image data in advance, so as to determine loss related data in the model training process. The training sample data can be specifically divided into a training set and a testing set, the initial depth estimation model refers to an untrained depth neural network, the training set is used for repeatedly training the depth neural network, and the available depth estimation model can be obtained after verification of the verification set data.

Specifically, the depth estimation flow in the scheme of the application can be realized by constructing a machine learning model and relying on a machine learning method. Therefore, before the depth estimation is performed, the training sample data for model training is constructed by relying on historical data, then the training sample data is split into a training set, a test set, a verification set and the like, the initial depth estimation model is repeatedly trained through the training set data, the model obtained through training is tested and verified through the test set and the verification set in the training process of each round, and gradient supervision is performed on the model based on loss after verification is completed until the depth estimation model meeting the requirements is obtained through verification of the verification set. In one embodiment, the backbone network of the depth estimation model in the scheme of the application is constructed based on a lightweight moving depth convolution neural network, and the specific processing flow can be as shown in fig. 4, by inputting model training data into the lightweight moving depth convolution neural network to obtain a predicted depth image, determining model loss by comparing the labeling data corresponding to the model training data, namely, the labeling depth image, and then adjusting the lightweight moving depth convolution neural network to complete training of the model. The lightweight mobile deep convolutional neural network may be a mobile-v2 network, and the specific structure of the lightweight mobile deep convolutional neural network may be shown in fig. 5, and is based on an inverted residual structure, where the main branch of the original residual structure has three convolutions, two point-by-point convolutions have more channels, the inverted residual structure is just opposite, the middle convolutions have more channels (the depth separable convolutions are still used), and the two ends are fewer. According to the scheme provided by the application, the initial depth estimation model is carried out through the historical image data with the depth mark, so that the depth estimation processing is effectively carried out on the target image through the depth estimation model, and the efficiency and the accuracy of the depth estimation processing process can be effectively ensured.

In one embodiment, training the initial depth estimation model with training sample data to obtain the depth estimation model includes: performing depth estimation processing on training sample data through an initial depth estimation model to obtain a predicted depth image; normalizing the depth data in the predicted depth image to obtain normalized depth predicted data, performing mask processing and normalization processing on the depth mark of the training sample data to obtain normalized depth truth value, wherein the mask processing is used for performing data range revision on the depth truth value data of the training sample data so as to set pixel values of pixels exceeding the data range to zero; determining a model penalty based on the normalized depth prediction data and the normalized depth truth value; and training the initial depth estimation model according to the model loss until the training stopping condition is met, so as to obtain the depth estimation model.

The predicted depth image is a depth image containing deviation, and because the prediction accuracy of the initial depth estimation model is low, the model loss needs to be determined by combining the predicted depth image with the depth mark of the training sample data. The normalization process converts the characteristic value of the sample into the same dimension and maps the data into the range of [0,1] or [ -1,1], because the scheme of the application only needs to carry out the relative depth estimation process, the depth data needs to be processed through normalization in the model training process, the mask process refers to the range of the depth true value in the corrected image, and the pixels beyond the range are masked to prevent the interference of the pixels, and the masking method mainly directly places the pixel value (depth value) of the point not in the depth range at 0, so that the back propagation of the loss function is not participated. Model loss refers to the difference between model predicted and true values, and the application refers to the difference between normalized depth predicted data and normalized depth true values of sample data. The training stopping condition specifically means that the model passes verification of the verification set, and the loss of the model is smaller than a preset loss threshold.

Specifically, in general depth estimation training, multiple data sets are mixed. Each data set, however, results in a different depth range and accuracy due to the different acquisition devices (internal parameters and performance). When training absolute depth, it is required to normalize it to the same scale, which requires a great knowledge of the acquisition equipment and acquisition mode for each dataset, and is a great deal of effort and difficulty. When training with relative depth, the designed loss function needs to eliminate errors due to scale non-uniformity for each sample. And carrying out normalization processing on the depth data in the predicted depth image to obtain normalized depth predicted data, and carrying out mask processing and normalization processing on the depth identification of the training sample data to obtain normalized depth truth value. Determining model loss based on the normalized depth prediction data and the normalized depth truth value; and training the initial depth estimation model according to the model loss until the training stopping condition is met, so as to obtain the depth estimation model. In a specific embodiment, assuming that an input sample S is an output depth map is I and a depth truth value is I ^*, it is first necessary to normalize both I and I ^*. In the case of the I,

I＝sigmod(f(S))*maxdepth,

Where f (S) is the result of the network prediction and maxdepth is the maximum depth value. Although a relative depth is used, it is still ensured that the maximum depth does not exceed the specified range at the time of prediction.

Normalized toThe method comprises the following steps:

Wherein the method comprises the steps of Media (I) is the median of the depth map I. M is the number of pixel points.

Further, in order to effectively utilize as much data as possible (including outdoor data), it is necessary to revise the data range. First, at absolute scale, a range of depth truth values is set, for example, 0.05m to 10m, beyond which pixels are masked. The mask method is that points which are not in the range are directly placed at 0, and the back transmission of the loss function is not participated.

I^*＝mask(I^*)

And after normalizationIs that

Wherein the method comprises the steps of

Because of scene diversity of monocular depth estimation, when a scene exceeding a limited range (too large or too small) is shot, the mean value point is sensitively changed, so that prediction inaccuracy is caused, and the median is selected to replace the mean value during normalization, so that the depth normalization is ensured not to have numerical drift due to a few outlier points. To further prevent drift from flying spots in the image, the first 5% of the spots are also masked at the relative scale, and thus the final loss function is:

wherein, And R _m =0.95. Meanwhile, in order to make the depth map smoother, a regularization term can be set up to perform gradient supervision on the directions of the x axis and the y axis. The corresponding optimization equation is:

In this embodiment, through normalization processing on the depth data and the depth identifier, normalized depth prediction data and normalized depth true values can be accurately obtained, and model loss is determined, so that training of a depth estimation model is completed, and effectiveness of a depth estimation process is ensured.

In one embodiment, step 203 comprises: acquiring a previous frame image of the target image, and determining feature points co-existing between the previous frame image and the target image and coordinate data of the feature points in the previous frame image; determining pose transformation data of the image acquisition device based on the inertial measurement data; based on pose transformation data and coordinate data of a previous frame of image, performing calculation of a visual inertial mileage calculation method to obtain coordinate data of feature points in a coordinate system of image acquisition equipment; and determining absolute depth data of the feature points based on the coordinate data of the feature points in the coordinate system of the image acquisition equipment to obtain sparse depth points.

The previous frame of image of the target image refers to another image acquired by the same image acquisition device before the target image, and the two images may have the same marker, so that the commonly existing feature points can be identified, and the image acquisition device between the two frames of images is displaced, so that related inertial measurement data is generated. The feature points refer to features of points corresponding to objects existing simultaneously in the front and rear frame images, such as road signs, signal lamps or vehicles existing simultaneously in the two images, and can be identified as feature points. The coordinate data of the feature point in the previous frame image refers to the coordinate data of the feature point in the image acquisition device in the previous frame image. And the pose transformation data is a item of data determined through the inertial measurement data, when the pose of the image acquisition equipment changes, the transformation of inertial measurement sampling data between navigation coordinate systems can be completed through the initial pose, and the calculation and feedback of information such as carrier pose, speed and the like can be realized, so that corresponding pose transformation data can be obtained.

Specifically, the pose and the motion situation of the image capturing device can be estimated through inertial measurement data, and after the motion of the image capturing device is estimated, the spatial position of the feature point needs to be estimated through the motion of the image capturing device, and at this time, the calculation of the spatial position can be realized through a triangularization scheme. In the scheme of the application, a triangularization scheme can be adopted to realize the calculation of the visual inertial mileage calculation method, so that sparse depth points are determined. At this time, firstly, a previous frame image of the target image is required to be acquired, and the space coordinates of the feature points are identified based on pose switching of the two images, so that absolute depth data of the feature points are determined, and sparse depth points containing the depth data are obtained. For the triangularization process in the visual inertia calculation method, firstly, the characteristic points in the previous frame of image can be projected into the coordinate system of the target image according to pose transformation data to obtain a corresponding normalized plane coordinate correlation equation of the characteristic points, then the depth values of the characteristic points are substituted into the equation as unknown data to obtain an equation set observation equation set, then the least square solution of the observation equation set is solved to obtain the coordinate positions of the characteristic points, and meanwhile, the corresponding absolute depth values of the characteristic points can be determined, so that sparse depth points are obtained. In the embodiment, the feature points in the target image and the absolute depth data corresponding to the feature points are solved and determined based on the inertial measurement data by a triangulation method, so that sparse depth points are obtained, and the recognition efficiency of the sparse depth points can be effectively ensured.

In one embodiment, the calculating of the visual inertia calculation method based on the pose transformation data and the coordinate data of the previous frame image, and the obtaining the coordinate data of the feature point in the target image includes: constructing a pose transformation matrix based on the pose transformation data; projecting the feature points to a coordinate system of the target image according to the pose transformation matrix and the coordinate data of the previous frame image to obtain feature point projection coordinates; constructing a linear equation set based on the pose transformation matrix and the projection coordinates of the feature points; and calculating the linear equation set through a singular value decomposition algorithm to obtain coordinate data of the feature points in the coordinate system of the image acquisition equipment.

Specifically, the above-described pose transformation matrix refers to a matrix for describing the relative motion of the image acquisition apparatus from the previous frame image to the target image, which can be obtained from the front-rear inertial measurement data of the image acquisition apparatus. After the pose transformation matrix is obtained, the feature points in the previous frame of image can be projected into the coordinate system of the image acquisition device corresponding to the target image based on the relative motion of the image acquisition device, so as to obtain the projection coordinates of the predicted points, namely the feature points. A linear equation set is constructed based on the pose transformation matrix and the projection coordinates of the feature points, two equation sets can be constructed for each previous frame of image, and when the observation times are larger than or equal to each other, namely, the same feature point is observed for a plurality of frames, and the pose of the image acquisition equipment corresponding to the frames is known, the coefficient matrix D of the linear equation set is very likely to be full of rank due to the influence of various measurement noise, and then the equation set is only zero-solved. At the moment, a linear equation set can be calculated through a singular value decomposition algorithm to obtain a least square solution of the equation set, so that the position of the characteristic point in the coordinate system of the image acquisition equipment is obtained, and the corresponding depth information of the characteristic point is determined. In this embodiment, by constructing a linear equation set and then performing corresponding calculation based on a singular value decomposition algorithm, coordinate data of each feature point in the coordinate system of the image acquisition device can be accurately determined, and a depth value corresponding to the sparse depth point is obtained.

In one embodiment, step 207 comprises: projecting the relative depth data to a visual inertial coordinate system where the sparse depth points are located through linear transformation; acquiring an intersection point set of sparse depth points and relative depth data; and determining absolute depth data of the target image based on the absolute depth corresponding to each point in the intersection point set.

The visual inertial coordinate system generally comprises three kinds of world coordinate systems, an IMU coordinate system and an image acquisition device coordinate system, wherein the visual inertial coordinate system can be the world coordinate system or the image acquisition device coordinate system.

Specifically, when the absolute depth data is converted, the relative depth data is projected to a visual inertial coordinate system where the sparse depth points are located through linear conversion, then the pixel points corresponding to the sparse depth points are identified from the relative depth data to construct an intersection point set, and because the sparse depth points are all feature points extracted from the target image, one pixel point corresponding to the sparse depth points is necessarily present in the corresponding relative depth data. And the set formed by the pixel points corresponding to the sparse depth points is the collection point set. The absolute depth of each pixel point in the intersection point set can be directly determined from the depth data of the sparse depth point, namely the depth data corresponding to the sparse depth point is known, therefore, an equation set can be constructed based on the absolute depth of the sparse depth point, the proportion and the offset between the relative depth and the absolute depth are determined, the obtained proportion and the offset are promoted to all the pixel points, and the absolute depth data corresponding to the target image can be determined by combining the relative depth data of the pixel points. Specifically, for the process of determining absolute depth data of a target image based on absolute depths corresponding to points in an intersection point set, a linear equation set of absolute depths can be constructed based on the intersection point set and the relative depth data; and solving a linear equation set based on the absolute depth corresponding to each point in the intersection point set to obtain the absolute depth data of the target image. The specific calculation process is as follows, assuming that the set of the identified sparse depth points and the effective intersection points of the relative depth data is H, the corresponding absolute depth is:

wherein, For an image matrix of relative depth, α is the ratio of relative depth to absolute depth (sacle), and β is an inexpensive quantity, then the following binary linear equation is solved to find α, β, where the solution can be specifically solved by the clemermer equation. /(I)

In this embodiment, the absolute depth data of the target image is determined by identifying the intersection points corresponding to the sparse depth points, so that the efficiency and accuracy of absolute depth calculation can be effectively ensured, and the effect of monocular depth estimation is ensured.

In one embodiment, the method further comprises: performing augmented reality processing on the target image based on the absolute depth data of the target image to obtain an augmented reality image; pushing the augmented reality image.

The augmented reality (Augmented Reality, AR) technology is a technology of skillfully fusing virtual information with a real world, and widely uses various technical means such as multimedia, three-dimensional modeling, real-time tracking and registration, intelligent interaction, sensing and the like, and applies virtual information such as characters, images, three-dimensional models, music, videos and the like generated by a computer to the real world after simulation, so that the two kinds of information are mutually complemented, thereby realizing the enhancement of the real world. Not only can the real world content be effectively embodied, but also virtual information content can be promoted to be displayed, and the fine and smooth content is mutually supplemented and overlapped. In visual augmented reality, users need to be able to bring together the real world and computer graphics on the basis of a head mounted display, around which the real world can be fully seen after the coincidence. The augmented reality technology mainly comprises new technologies and means such as multimedia and three-dimensional modeling, scene fusion and the like, and the information content provided by the augmented reality and the information content which can be perceived by human beings are obviously different.

The scheme of the application can be particularly applied to the field of augmented reality, and in addition to the camera parameters and the sparse structure of the scene, the augmented reality technology needs to recover the dense three-dimensional structure of the scene in order to better process the shielding relation and the synthetic shadow. Thus, dense depth estimation is also an important part of the augmented reality technology. The scheme of the application can obtain the dense depth image under absolute depth by carrying out depth estimation processing on the target image and combining with sparse depth points, can be beneficial to scene recovery processing in augmented reality, can obtain an augmented reality image with stronger expressive force, and can improve the processing effect of augmented reality. In this embodiment, by applying the absolute depth data of the target image to the field of augmented reality, the expressive force of augmented reality processing can be effectively ensured, and the effect of augmented reality can be improved.

The application also provides an application scene, which applies the monocular depth estimation method.

Specifically, the application of the monocular depth estimation method in the application scene is as follows:

When a user needs to use the augmented reality function on the smart phone to optimize the photographing function of the smart phone, the scheme of the application can be used for carrying out depth estimation processing on the photographed picture, and then the augmented reality processing is assisted based on the result of the depth estimation, so that the effectiveness of the augmented reality is ensured. The application estimates the relative depth, the concept of the depth map can refer to the figure 6, the very far mountain peak in the figure 6 has only very small pixel coordinate difference, the farther tree corresponds to the smaller pixel coordinate difference, the nearer guard rail has larger pixel coordinate difference, and the depth map reflects the distance between the pixel point and the shooting source through the pixel size of the pixel point, so that the distance of each object in the image can be effectively identified through the depth map. The depth estimation flow of the scheme of the application can specifically refer to fig. 7, firstly, image data and inertial measurement data acquired by an image acquisition device are acquired, wherein the image data can be processed through relative depth generation to obtain a relative depth map of an image, the image data and the inertial measurement data can also be processed through sparse depth measurement to obtain a plurality of sparse absolute depth points, and finally, the sparse depth points and the relative depth map are combined to obtain a dense absolute depth map. The relative depth estimation process can be realized through a neural network model, and the size of the target image is adjusted to obtain the target size image; and inputting the target size image into a depth estimation model, performing convolution processing and downsampling processing on the target size image by the depth estimation model to obtain a target image feature image, and performing upsampling processing and layer jump connection processing on the target feature image to obtain a target depth image. For the training process of the depth estimation model, firstly acquiring historical image data with depth marks; constructing training sample data according to the historical image data with the depth mark; and training the initial depth estimation model through training sample data to obtain a depth estimation model. For the sparse absolute depth point estimation process, the method can be realized by a visual inertial odometer, a previous frame image of the target image is obtained, and feature points co-existing between the previous frame image and the target image and coordinate data of the feature points in the previous frame image are determined; determining pose transformation data of the image acquisition device based on the inertial measurement data; constructing a pose transformation matrix based on the pose transformation data; projecting the feature points to a coordinate system of the target image according to the pose transformation matrix and the coordinate data of the previous frame image to obtain feature point projection coordinates; constructing a linear equation set based on the pose transformation matrix and the projection coordinates of the feature points; calculating a linear equation set through a singular value decomposition algorithm to obtain coordinate data of the characteristic points in a coordinate system of the image acquisition equipment; and determining absolute depth data of the feature points based on the coordinate data of the feature points in the coordinate system of the image acquisition equipment to obtain sparse depth points. For the estimation flow of dense absolute depth, the relative depth data is projected to a visual inertial coordinate system where sparse depth points are located through linear transformation; acquiring an intersection point set of sparse depth points and relative depth data; and determining absolute depth data of the target image based on the absolute depth and the relative depth corresponding to each point in the intersection point set. And finally, after the absolute depth data of the target image is obtained, the target image can be subjected to augmented reality processing based on the absolute depth data of the target image, so that an augmented reality image is obtained, and then the augmented reality image is displayed to a user, and the complete flow of augmented reality is realized.

In a specific embodiment, the present application further provides an abnormal web page detection method, and a flowchart may refer to fig. 8, including the following steps: step 801, acquiring an object image and inertial measurement data of an image acquisition device acquiring the object image. Step 803, a previous frame image of the target image is acquired, and feature points co-existing between the previous frame image and the target image and coordinate data of the feature points in the previous frame image are determined. Step 805, determining pose transformation data of the image acquisition device based on the inertial measurement data. Step 807, a pose transformation matrix is constructed based on the pose transformation data. And step 809, projecting the feature points to the coordinate system of the target image according to the pose transformation matrix and the coordinate data of the previous frame image to obtain the projection coordinates of the feature points. And 811, constructing a linear equation set based on the pose transformation matrix and the projection coordinates of the feature points. And step 813, calculating the linear equation set through a singular value decomposition algorithm to obtain coordinate data of the feature points in the coordinate system of the image acquisition equipment. Step 815, performing a resizing process on the target image to obtain a target size image. In step 817, convolution processing and downsampling processing are performed on the target size image through the encoder module, so as to obtain a target image feature map. And step 819, performing up-sampling processing and layer jump connection processing on the target feature image through the decoder module to obtain a target depth image. In step 821, relative depth data of the target image is determined based on the target depth image. Step 823, projecting the relative depth data to a visual inertial coordinate system where the sparse depth points are located through linear transformation. Step 825, an intersection set of sparse depth points and relative depth data is obtained. Step 827, determining absolute depth data of the target image based on the absolute depth and the relative depth corresponding to each point in the collection of points.

It should be understood that, although the steps in the flowcharts related to the above embodiments are sequentially shown as indicated by arrows, these steps are not necessarily sequentially performed in the order indicated by the arrows. The steps are not strictly limited to the order of execution unless explicitly recited herein, and the steps may be executed in other orders. Moreover, at least some of the steps in the flowcharts described in the above embodiments may include a plurality of steps or a plurality of stages, which are not necessarily performed at the same time, but may be performed at different times, and the order of the steps or stages is not necessarily performed sequentially, but may be performed alternately or alternately with at least some of the other steps or stages.

Based on the same inventive concept, the embodiment of the application also provides a monocular depth estimation device for realizing the monocular depth estimation method. The implementation of the solution provided by the device is similar to the implementation described in the above method, so the specific limitations in the embodiments of the monocular depth estimation device or devices provided below may be referred to as the limitations of the monocular depth estimation method described above, and will not be repeated here.

In one embodiment, as shown in fig. 9, there is provided a monocular depth estimation apparatus including:

A data acquisition module 902, configured to acquire inertial measurement data of a target image and an image acquisition device that acquires the target image.

The sparse point identification module 904 is configured to perform a solution of a visual inertial mileage calculation method based on the inertial measurement data, obtain a solution result, and determine a sparse depth point in the target image according to the solution result.

The relative depth estimation module 906 is configured to perform a depth estimation process on the target image, so as to obtain relative depth data of the target image.

An absolute depth estimation module 908 is configured to determine absolute depth data of the target image based on the sparse depth points and the relative depth data.

In one embodiment, the relative depth estimation module 906 is specifically configured to: performing size adjustment processing on the target image to obtain a target size image; performing image feature extraction processing on the target size image to obtain a target feature image; performing image feature fusion processing on the target feature image to obtain a target depth image; relative depth data of the target image is determined based on the target depth image.

In one embodiment, the relative depth estimation module 906 is further configured to: performing convolution processing and downsampling processing on the target size image through an encoder module to obtain a target image feature map; performing image feature fusion processing on the target feature image to obtain a target depth image, wherein the step of obtaining the target depth image comprises the following steps: and performing up-sampling processing and layer jump connection processing on the target feature image through a decoder module to obtain a target depth image.

In one embodiment, the method further comprises a model training module for: acquiring historical image data with depth identification; constructing training sample data according to the historical image data with the depth mark; and training the initial depth estimation model through training sample data to obtain a depth estimation model. The relative depth estimation module 906 is specifically configured to: and carrying out depth estimation processing on the target image through the depth estimation model to obtain relative depth data of the target image.

In one embodiment, the model training module is specifically configured to: performing depth estimation processing on training sample data through an initial depth estimation model to obtain a predicted depth image; normalizing the depth data in the predicted depth image to obtain normalized depth predicted data, performing mask processing and normalization processing on the depth mark of the training sample data to obtain normalized depth truth value, wherein the mask processing is used for performing data range revision on the depth truth value data of the training sample data so as to set pixel values of pixels exceeding the data range to zero; determining a model penalty based on the normalized depth prediction data and the normalized depth truth value; and training the initial depth estimation model according to the model loss until the training stopping condition is met, so as to obtain the depth estimation model.

In one embodiment, the backbone network of the depth estimation model is built based on a lightweight mobile convolutional neural network.

In one embodiment, the sparse point identification module 904 is specifically configured to: acquiring a previous frame image of the target image, and determining feature points co-existing between the previous frame image and the target image and coordinate data of the feature points in the previous frame image; determining pose transformation data of the image acquisition device based on the inertial measurement data; based on pose transformation data and coordinate data of a previous frame of image, performing calculation of a visual inertial mileage calculation method to obtain coordinate data of feature points in a coordinate system of image acquisition equipment; and determining absolute depth data of the feature points based on the coordinate data of the feature points in the coordinate system of the image acquisition equipment to obtain sparse depth points.

In one embodiment, the sparse point identification module 904 is further to: constructing a pose transformation matrix based on the pose transformation data; projecting the feature points to a coordinate system of the target image according to the pose transformation matrix and the coordinate data of the previous frame image to obtain feature point projection coordinates; constructing a linear equation set based on the pose transformation matrix and the projection coordinates of the feature points; and calculating the linear equation set through a singular value decomposition algorithm to obtain coordinate data of the feature points in the coordinate system of the image acquisition equipment.

In one embodiment, the absolute depth estimation module 908 is specifically configured to: projecting the relative depth data to a visual inertial coordinate system where the sparse depth points are located through linear transformation; acquiring an intersection point set of sparse depth points and relative depth data; and determining absolute depth data of the target image based on the absolute depth and the relative depth corresponding to each point in the intersection point set.

In one embodiment, the absolute depth estimation module 908 is further to: constructing a linear equation set of absolute depth based on the collection of the intersection points and the relative depth data; and solving a linear equation set based on the absolute depth corresponding to each point in the intersection point set to obtain the absolute depth data of the target image.

In one embodiment, the system further comprises an augmented reality processing module for: performing augmented reality processing on the target image based on the absolute depth data of the target image to obtain an augmented reality image; pushing the augmented reality image.

The individual modules in the monocular depth estimation device described above may be implemented in whole or in part by software, hardware, and combinations thereof. The above modules may be embedded in hardware or may be independent of a processor in the computer device, or may be stored in software in a memory in the computer device, so that the processor may call and execute operations corresponding to the above modules.

In one embodiment, a computer device is provided, which may be a terminal, and an internal structure diagram thereof may be as shown in fig. 10. The computer device includes a processor, a memory, an input/output interface, a communication interface, a display unit, and an input means. The processor, the memory and the input/output interface are connected through a system bus, and the communication interface, the display unit and the input device are connected to the system bus through the input/output interface. Wherein the processor of the computer device is configured to provide computing and control capabilities. The memory of the computer device includes a non-volatile storage medium and an internal memory. The non-volatile storage medium stores an operating system and a computer program. The internal memory provides an environment for the operation of the operating system and computer programs in the non-volatile storage media. The input/output interface of the computer device is used to exchange information between the processor and the external device. The communication interface of the computer device is used for carrying out wired or wireless communication with an external terminal, and the wireless mode can be realized through WIFI, a mobile cellular network, NFC (near field communication) or other technologies. The computer program is executed by a processor to implement a monocular depth estimation method. The display unit of the computer equipment is used for forming a visual picture, and can be a display screen, a projection device or a virtual reality imaging device, wherein the display screen can be a liquid crystal display screen or an electronic ink display screen, the input device of the computer equipment can be a touch layer covered on the display screen, can also be a key, a track ball or a touch pad arranged on a shell of the computer equipment, and can also be an external keyboard, a touch pad or a mouse and the like.

It will be appreciated by those skilled in the art that the structure shown in FIG. 10 is merely a block diagram of some of the structures associated with the present inventive arrangements and is not limiting of the computer device to which the present inventive arrangements may be applied, and that a particular computer device may include more or fewer components than shown, or may combine some of the components, or have a different arrangement of components.

In an embodiment, there is also provided a computer device comprising a memory and a processor, the memory having stored therein a computer program, the processor implementing the steps of the method embodiments described above when the computer program is executed.

In one embodiment, a computer-readable storage medium is provided, storing a computer program which, when executed by a processor, implements the steps of the method embodiments described above.

In one embodiment, a computer program product or computer program is provided that includes computer instructions stored in a computer readable storage medium. The processor of the computer device reads the computer instructions from the computer-readable storage medium, and the processor executes the computer instructions, so that the computer device performs the steps in the above-described method embodiments.

It should be noted that, the user information (including but not limited to user equipment information, user personal information, etc.) and the data (including but not limited to data for analysis, stored data, presented data, etc.) related to the present application are information and data authorized by the user or sufficiently authorized by each party, and the collection, use and processing of the related data need to comply with the related laws and regulations and standards of the related country and region.

Those skilled in the art will appreciate that implementing all or part of the above-described methods in accordance with the embodiments may be accomplished by way of a computer program stored on a non-transitory computer readable storage medium, which when executed may comprise the steps of the embodiments of the methods described above. Any reference to memory, database, or other medium used in embodiments provided herein may include at least one of non-volatile and volatile memory. The nonvolatile Memory may include Read-Only Memory (ROM), magnetic tape, floppy disk, flash Memory, optical Memory, high density embedded nonvolatile Memory, resistive random access Memory (ReRAM), magneto-resistive random access Memory (Magnetoresistive Random Access Memory, MRAM), ferroelectric Memory (Ferroelectric Random Access Memory, FRAM), phase change Memory (PHASE CHANGE Memory, PCM), graphene Memory, and the like. Volatile memory can include random access memory (Random Access Memory, RAM) or external cache memory, and the like. By way of illustration, and not limitation, RAM can be in various forms such as static random access memory (Static Random Access Memory, SRAM) or dynamic random access memory (Dynamic Random Access Memory, DRAM), etc. The databases referred to in the embodiments provided herein may include at least one of a relational database and a non-relational database. The non-relational database may include, but is not limited to, a blockchain-based distributed database, and the like. The processor referred to in the embodiments provided in the present application may be a general-purpose processor, a central processing unit, a graphics processor, a digital signal processor, a programmable logic unit, a data processing logic unit based on quantum computing, or the like, but is not limited thereto.

The technical features of the above embodiments may be arbitrarily combined, and all possible combinations of the technical features in the above embodiments are not described for brevity of description, however, as long as there is no contradiction between the combinations of the technical features, they should be considered as the scope of the description.

The foregoing examples illustrate only a few embodiments of the application, which are described in detail and are not to be construed as limiting the scope of the application. It should be noted that it will be apparent to those skilled in the art that several variations and modifications can be made without departing from the spirit of the application, which are all within the scope of the application. Accordingly, the scope of the application should be assessed as that of the appended claims.

Claims

1. A monocular depth estimation method, the method comprising:

2. The method of claim 1, wherein performing depth estimation processing on the target image to obtain relative depth data of the target image comprises:

performing size adjustment processing on the target image to obtain a target size image;

Performing image feature extraction processing on the target size image to obtain a target feature image;

Performing image feature fusion processing on the target feature image to obtain a target depth image;

Relative depth data of the target image is determined based on the target depth image.

3. The method of claim 2, wherein performing image feature extraction processing on the target size image to obtain a target feature image comprises:

performing convolution processing and downsampling processing on the target size image through an encoder module to obtain a target image feature map;

The step of performing image feature fusion processing on the target feature image to obtain a target depth image comprises the following steps:

And performing up-sampling processing and layer jump connection processing on the target feature image through a decoder module to obtain a target depth image.

4. The method according to claim 1, wherein the method further comprises:

Acquiring historical image data with depth identification;

constructing training sample data according to the historical image data with the depth mark;

training the initial depth estimation model through the training sample data to obtain a depth estimation model;

the step of performing depth estimation processing on the target image to obtain relative depth data of the target image includes:

and carrying out depth estimation processing on the target image through the depth estimation model to obtain relative depth data of the target image.

5. The method of claim 4, wherein training the initial depth estimation model with the training sample data to obtain a depth estimation model comprises:

performing depth estimation processing on the training sample data through the initial depth estimation model to obtain a predicted depth image;

Normalizing the depth data in the predicted depth image to obtain normalized depth predicted data, performing mask processing and normalization processing on the depth identification of the training sample data to obtain normalized depth truth value, wherein the mask processing is used for performing data range revision on the depth truth value data of the training sample data so as to set pixel values of pixels exceeding the data range to zero;

Determining a model penalty based on the normalized depth prediction data and the normalized depth truth value;

and training the initial depth estimation model according to the model loss until the training stopping condition is met, so as to obtain a depth estimation model.

6. The method of claim 4, wherein the backbone network of the depth estimation model is constructed based on a lightweight mobile convolutional neural network.

7. The method of claim 1, wherein the performing a solution to a visual inertial range calculation based on the inertial measurement data to obtain a solution result, and determining sparse depth points in the target image based on the solution result comprises:

Acquiring a previous frame image of the target image, and determining feature points co-existing between the previous frame image and the target image and coordinate data of the feature points in the previous frame image;

Determining pose transformation data of the image acquisition device based on the inertial measurement data;

Based on the pose transformation data and the coordinate data of the previous frame of image, calculating a visual inertial mileage calculation method to obtain the coordinate data of the characteristic points in the coordinate system of the image acquisition equipment;

and determining absolute depth data of the feature points based on the coordinate data of the feature points in the coordinate system of the image acquisition equipment to obtain sparse depth points.

8. The method according to claim 7, wherein the calculating the visual inertial distance calculation method based on the pose transformation data and the coordinate data of the previous frame image, to obtain the coordinate data of the feature point in the coordinate system of the image capturing device includes:

Constructing a pose transformation matrix based on the pose transformation data;

According to the pose transformation matrix and the coordinate data of the previous frame image, projecting the characteristic points to a coordinate system of the target image to obtain characteristic point projection coordinates;

constructing a linear equation set based on the pose transformation matrix and the feature point projection coordinates;

And calculating the linear equation set through a singular value decomposition algorithm to obtain coordinate data of the characteristic points in the coordinate system of the image acquisition equipment.

9. The method of any of claims 1 to 8, wherein the determining absolute depth data of the target image based on the sparse depth points and the relative depth data comprises:

Projecting the relative depth data to a visual inertial coordinate system where the sparse depth points are located through linear transformation;

acquiring an intersection point set of the sparse depth points and the relative depth data;

And determining absolute depth data of the target image based on the absolute depth and the relative depth corresponding to each point in the intersection point set.

10. The method of claim 9, wherein determining absolute depth data for the target image based on absolute depths and relative depths for each point in the collection of points comprises:

Constructing a linear equation set of absolute depth based on the intersection point set and the relative depth data;

and solving the linear equation set based on absolute depth corresponding to each point in the intersection point set to obtain absolute depth data of the target image.

11. The method according to claim 1, wherein the method further comprises:

Performing augmented reality processing on the target image based on the absolute depth data of the target image to obtain the augmented reality image;

pushing the augmented reality image.

12. A monocular depth estimation apparatus, the apparatus comprising:

13. A computer device comprising a memory and a processor, the memory storing a computer program, characterized in that the processor implements the steps of the method of any one of claims 1 to 11 when the computer program is executed.

14. A computer readable storage medium, on which a computer program is stored, characterized in that the computer program, when being executed by a processor, implements the steps of the method of any of claims 1 to 11.

15. A computer program product comprising a computer program, characterized in that the computer program, when executed by a processor, implements the steps of the method of any one of claims 1 to 11.