WO2022165722A1 - Procédé, appareil et dispositif d'estimation de profondeur monoculaire - Google Patents

Procédé, appareil et dispositif d'estimation de profondeur monoculaire Download PDF

Info

Publication number
WO2022165722A1
WO2022165722A1 PCT/CN2021/075318 CN2021075318W WO2022165722A1 WO 2022165722 A1 WO2022165722 A1 WO 2022165722A1 CN 2021075318 W CN2021075318 W CN 2021075318W WO 2022165722 A1 WO2022165722 A1 WO 2022165722A1
Authority
WO
WIPO (PCT)
Prior art keywords
map
camera
image
dsn
estimated
Prior art date
Application number
PCT/CN2021/075318
Other languages
English (en)
Chinese (zh)
Inventor
摩拉莱斯•斯皮诺扎•卡洛斯•埃曼纽尔
李正卿
Original Assignee
华为技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 华为技术有限公司 filed Critical 华为技术有限公司
Priority to PCT/CN2021/075318 priority Critical patent/WO2022165722A1/fr
Publication of WO2022165722A1 publication Critical patent/WO2022165722A1/fr

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/50Depth or shape recovery

Definitions

  • the embodiments of the present application relate to the field of computer vision, and in particular, to a method, apparatus, and device for monocular depth estimation.
  • Monocular depth estimation uses images captured by a single camera as input to estimate real-world depth images (depth maps). Each pixel in the depth map stores a depth value, and the depth value is the distance between the three-dimensional (3-dimension, 3D) coordinate point of the real world corresponding to the pixel and the viewpoint of the camera. Monocular depth estimation can be applied to many important application scenarios that require 3D environmental information. These application scenarios include but are not limited to augmented reality (AR), navigation (such as autonomous driving), scene reconstruction, scene recognition, object detection, etc.
  • AR augmented reality
  • navigation such as autonomous driving
  • scene reconstruction scene recognition
  • object detection etc.
  • the monocular camera used in Monocular Depth Estimation (MDE) is usually an RGB or Grayscale camera.
  • RGB or Grayscale (Gray) cameras can capture better image effects when the lighting is good, the light ratio is small, and the camera/scene motion is stable.
  • the monocular camera as an RGB camera as an example, the depth map is obtained by obtaining two RGB frames from the monocular camera and performing stereo matching calculation on the two RGB frames.
  • the above method of estimating depth by stereo matching through two RGB frames has the following problems.
  • the monocular camera needs to be in a moving state while the target object in the real world is in a static state, and the texture details in the environment are required. and better lighting conditions.
  • the target objects in the scene are often dynamic, such as cars on the road. This makes the above monocular depth estimation method unsuitable for depth estimation of target objects in real life.
  • the present application provides a monocular depth estimation method, apparatus and device, which are suitable for depth estimation of target objects in general or common scenes in real life.
  • an embodiment of the present application provides a monocular depth estimation method.
  • the method may include: acquiring an image to be estimated and a first parameter corresponding to the image to be estimated, where the first parameter is a value of a camera that shoots the image to be estimated. Camera calibration parameters.
  • the to-be-estimated image is input into the first neural network model, and the first distance-scaled normal DSN map output by the first neural network model is obtained, and the first DSN map is used to represent the plane of the target object corresponding to the to-be-estimated image. The orientation and the distance between this plane and this camera.
  • a first camera filter map is determined, and the first camera filter map is used to represent the mapping relationship between the 3D point and the 2D plane of the target object in space, and the 2D plane is the The imaging plane of the camera.
  • a first depth map corresponding to the image to be estimated is determined according to the first DSN map and the first camera filter map.
  • this implementation method obtains the depth map based on the first DSN map and the first camera filter map, and the depth map can accurately reflect the distance of the target object, thereby improving the accuracy of monocular depth estimation. sex. Different from the method of estimating depth through stereo matching through two RGB frames, this implementation can perform depth estimation through one frame of the image to be estimated, and there is no scene limitation that requires the monocular camera to be in motion and the real-world target object to be stationary. , which can be applied to depth estimation of target objects in general or common scenes in real life.
  • the first neural network model is obtained by training using a training image and a second DSN map corresponding to the training image, and the second DSN map is obtained according to the second depth map corresponding to the training image, and The camera calibration parameters corresponding to the training image are determined.
  • the first neural network model since the first neural network model is obtained by training using the training image and the second DSN map corresponding to the training image, the first neural network model has the ability to output the DSN map corresponding to the input image, and can then use the DSN map corresponding to the input image.
  • the DSN map and the camera filter map corresponding to the input image can obtain the depth map corresponding to the input image to achieve monocular depth estimation.
  • this training image is used as input to the initial neural network model.
  • the loss function includes at least one of a first loss function, a second loss function or a third loss function, and the loss function is used to adjust parameters of the initial neural network model to obtain the first neural network model through training.
  • the first loss function is used to represent the error between the second DSN image and the third DSN image
  • the third DSN image is the DSN image corresponding to the training image output by the initial neural network model.
  • the second loss function uses to represent the error between the second depth map and the third depth map, the third depth map is determined according to the third DSN map and the second camera filter map, the second camera filter map is based on the training The camera calibration parameter corresponding to the image is determined by the training image, and the third loss function is used to represent the matching degree of the second depth map and the third depth map.
  • the neural network model by evaluating the error between the second DSN map and the third DSN map, the error between the second depth map and the third depth map, or the second depth map One or more of the matching degrees with the third depth map, adjust the neural network model so that the adjusted neural network model meets one or more accuracy requirements, thereby improving the performance of using the trained neural network model.
  • the accuracy of the monocular depth estimation method of the embodiment of the present application by evaluating the error between the second DSN map and the third DSN map, the error between the second depth map and the third depth map, or the second depth map One or more of the matching degrees with the third depth map.
  • the training image may be an image captured by any camera, such as an RGB camera, a grayscale camera, a night vision camera, a thermal camera, a panoramic camera, an event camera, or an infrared camera.
  • a camera such as an RGB camera, a grayscale camera, a night vision camera, a thermal camera, a panoramic camera, an event camera, or an infrared camera.
  • the neural network model is trained by using images captured by any camera such as an RGB camera, a grayscale camera, a night vision camera, a thermal camera, a panoramic camera, an event camera, or an infrared camera, so that the Monocular depth estimation methods can support depth estimation for images captured by different types of cameras.
  • a camera such as an RGB camera, a grayscale camera, a night vision camera, a thermal camera, a panoramic camera, an event camera, or an infrared camera.
  • determining the filter map of the first camera according to the image to be estimated and the first parameter includes: determining the first camera according to the position coordinates of the pixels of the image to be estimated and the first parameter A filter map, the first camera filter map includes a camera filter map vector corresponding to the pixel, the camera filter vector is used to represent the mapping relationship between the 3D point and the pixel, and the pixel is the 3D point projected to the A point in a 2D plane.
  • the first camera filter map is determined according to the position coordinates of the pixel points of the image to be estimated and the first parameter.
  • the camera filter map is related to the pixel points of the input image and the camera model, and is not affected by the 3D structure of the target object in the scene.
  • the corresponding camera filter maps are the same and can be calculated only once.
  • the camera filter map can be recalculated according to the camera calibration parameters of the new camera.
  • the depth map is obtained through the camera filter map and DSN map, which can improve the processing speed of monocular depth estimation.
  • the position coordinates of the pixel include abscissa and ordinate
  • the camera filter map vector corresponding to the pixel includes a first camera filter map component and a second camera filter map component
  • the first camera filter map The component is determined according to the abscissa and the first parameter
  • the second camera filter map component is determined according to the ordinate and the first parameter, or according to the abscissa, the ordinate and the first parameter.
  • the first parameter when the field of view of the camera that captures the image to be estimated is less than 180 degrees, the first parameter includes the center coordinates (c x , cy ) and the focal length (f x , f y ) of the camera.
  • the first camera filter map component is determined according to the abscissa and the first parameter
  • the second camera filter map component is determined according to the ordinate and the first parameter.
  • the first parameter when the field of view of the camera that captures the image to be estimated is less than 180 degrees, the first parameter includes the center coordinates (c x , cy ) and the focal length (f x , f y ) of the camera,
  • the first camera filter map component is Fu
  • the second camera filter map component is F v ,
  • the first parameter when the field of view of the camera of the to-be-estimated image is greater than 180 degrees, the first parameter includes the width pixel value W and the height pixel value H of the to-be-estimated image, and the first camera filters the mapping component. is determined according to the abscissa and the first parameter, and the second camera filter mapping component is determined according to the abscissa, the ordinate and the first parameter.
  • the first DSN map includes the first DSN vector corresponding to the pixel of the image to be estimated, and the first DSN map corresponding to the image to be estimated is determined according to the first DSN map and the first camera filter map.
  • a depth map includes: determining a depth value corresponding to the pixel point according to the first DSN vector corresponding to the pixel point and the camera filter map vector corresponding to the pixel point. Wherein, the first depth map includes the depth value corresponding to the pixel point.
  • the depth value corresponding to the pixel is determined according to the first DSN vector corresponding to the pixel and the camera filter mapping vector corresponding to the pixel, including:
  • the inverse depth value corresponding to the pixel point is determined.
  • the depth value corresponding to the pixel point is determined.
  • is the inverse depth value corresponding to the pixel point
  • N is the first DNS vector corresponding to the pixel point
  • F is the camera filter map vector corresponding to the pixel point.
  • the method may further include: acquiring a training image, a second depth image corresponding to the training image, and camera calibration parameters corresponding to the training image.
  • the initial neural network model is trained by using the training image, the second depth image corresponding to the training image, and the camera calibration parameters corresponding to the training image, and the first neural network model is obtained.
  • the training image, the second depth image corresponding to the training image, and the camera calibration parameters corresponding to the training image to train the initial neural network model, and obtain the first neural network model, including: according to the camera corresponding to the training image.
  • the parameters and the training image are calibrated to determine a second camera filter map, where the second camera filter map includes camera filter map vectors corresponding to pixels in the training image data.
  • the orientation of the plane where the 3D point in the scene corresponds to and the distance from the camera.
  • the training image is input to the initial neural network model, and the third DSN map output by the initial neural network model is obtained.
  • the parameters of the initial neural network model are adjusted to obtain the first neural network model.
  • the mapping vector is F
  • the second DSN map includes the DSN vector of the plane where the 3D point in the scene corresponding to the pixel point of the training image is located.
  • obtaining a second DSN map according to the second camera filter map and the second depth image including:
  • N i (N xi , N yi , N zi );
  • i (u, v)
  • ⁇ i is the inverse depth value of the 3D point in the scene corresponding to the pixel point i
  • the inverse depth value is the inverse of the depth value
  • the second DSN map includes the scene corresponding to the pixel point of the training image The DSN vector of the plane in which the 3D points in .
  • the first loss function is
  • N i (N xi , N yi , N zi ) represents the DSN vector corresponding to the pixel i in the third DSN map, Represents the entire set of valid pixels.
  • the second loss function is
  • represents the paradigm, represents the inverse depth value corresponding to pixel i in the second depth map, ⁇ i represents the inverse depth value corresponding to pixel i in the third depth map, Represents the entire set of valid pixels.
  • I is the image data in the second depth map that matches the third depth map, Represents a matching set of pixels.
  • the loss function is
  • ⁇ DEP , ⁇ DSN and ⁇ INP are greater than or equal to 0, respectively.
  • acquiring the training image and the second depth image corresponding to the training image includes at least one of the following:
  • Acquire multiple training images which are image data obtained by shooting scenes with multiple calibrated and synchronized cameras; use 3D vision technology to process the multiple training images to obtain second images corresponding to the multiple training images. depth image; or,
  • the data optimization includes at least one of hole filling optimization, sharpening occlusion edge optimization, or temporal consistency optimization.
  • the environmental conditions of the scene captured by the original image are changed to obtain the training images under different environmental conditions.
  • the second depth image corresponding to each training image is obtained by a depth sensor; or, the second depth image corresponding to each training image is obtained after the teacher's monocular depth estimation network processes the input training image.
  • the output depth image is obtained.
  • an embodiment of the present application provides a monocular depth estimation apparatus, which has a function of implementing the first aspect or any possible design of the first aspect.
  • the functions can be implemented by hardware, and can also be implemented by hardware executing corresponding software.
  • the hardware or software includes one or more modules corresponding to the above functions, for example, an acquisition unit or module, a DSN unit or module, a camera filter mapping unit or module, and a depth estimation unit or module.
  • embodiments of the present application provide an electronic device, which may include: one or more processors; one or more memories; wherein the one or more memories are used to store one or more programs ; the one or more processors are configured to run the one or more programs to implement the method according to the first aspect or any possible design of the first aspect.
  • embodiments of the present application provide a computer-readable storage medium, which is characterized by comprising a computer program, and when the computer program is executed on a computer, causes the computer to execute the first aspect or any of the first aspect.
  • a computer program when executed on a computer, causes the computer to execute the first aspect or any of the first aspect.
  • an embodiment of the present application provides a chip, which is characterized in that it includes a processor and a memory, the memory is used for storing a computer program, and the processor is used for calling and running the computer program stored in the memory, to A method as described in the first aspect or any possible design of the first aspect is performed.
  • embodiments of the present application provide a computer program product, which, when the computer program product runs on a computer, causes the computer to execute the method described in the first aspect or any possible design of the first aspect.
  • the image to be estimated and the camera calibration parameters are obtained, the first DSN map corresponding to the image to be estimated is obtained through the first neural network model, and the first camera is determined according to the camera calibration parameters. Filter the map, and then obtain a depth map based on the first DSN map and the first camera filter map. Different from directly using the neural network model to output the depth map, the embodiment of the present application obtains the depth map based on the first DSN map and the first camera filter map, and the depth map can accurately reflect the distance of the target object, thereby improving the accuracy of monocular depth estimation. sex.
  • the embodiment of the present application can perform depth estimation through one frame of the image to be estimated, and there is no need for the monocular camera to be in motion and the real-world target object to be in a static state. Limits that can be applied to depth estimation of target objects in general or common real-life scenes.
  • FIG. 1 is a schematic diagram of a system architecture 100 provided by an embodiment of the present application.
  • FIG. 2 is a schematic diagram of a convolutional neural network (CNN) 200 provided by an embodiment of the present application;
  • CNN convolutional neural network
  • FIG. 3 is a schematic diagram of a chip hardware structure provided by an embodiment of the present application.
  • FIG. 4 is a schematic diagram of a system architecture 400 provided by an embodiment of the present application.
  • FIG. 5 is a flowchart of a monocular depth estimation method provided by an embodiment of the present application.
  • FIG. 6 is a flowchart of another monocular depth estimation method provided by an embodiment of the present application.
  • FIG. 7 is a schematic diagram of a geometric correspondence between a 3D point and a perspective (pinhole) camera model in a scene provided by an embodiment of the present application;
  • FIG. 8 is a schematic diagram of a monocular depth estimation processing process provided by an embodiment of the present application.
  • FIG. 9 is a flowchart of a method for training a first neural network model provided by an embodiment of the present application.
  • FIG. 10 is a schematic diagram of a training process provided by an embodiment of the present application.
  • FIG. 11 is a schematic diagram of a training process of a first neural network model provided by an embodiment of the application.
  • FIG. 12 is a schematic diagram of a training process of a first neural network model provided by an embodiment of the application.
  • FIG. 13 is a schematic structural diagram of a monocular depth estimation apparatus provided by an embodiment of the present application.
  • FIG. 14 is a schematic structural diagram of an electronic device provided by an embodiment of the application.
  • FIG. 15 is a schematic structural diagram of another monocular depth estimation apparatus provided by an embodiment of the present application.
  • At least one (item) refers to one or more, and "a plurality” refers to two or more.
  • “And/or” is used to describe the relationship between related objects, indicating that there can be three kinds of relationships, for example, “A and/or B” can mean: only A, only B, and both A and B exist , where A and B can be singular or plural.
  • the character “/” generally indicates that the associated objects are an “or” relationship.
  • At least one item(s) below” or similar expressions thereof refer to any combination of these items, including any combination of single item(s) or plural items(s).
  • At least one (a) of a, b or c can mean: a, b, c, "a and b", “a and c", “b and c", or "a and b and c" ", where a, b, c can be single or multiple respectively.
  • the monocular depth estimation method of the embodiment of the present application does not require the scene limitation that the monocular camera is in a moving state and the target object in the real world is in a static state, and can be applied to Depth estimation of target objects in real-life general or common scenes.
  • the general or general scene described in the embodiments of the present application specifically refers to any scene without conditional restrictions, where the conditional restrictions may include but are not limited to lighting condition restrictions, camera type restrictions, target object type restrictions, or camera and target objects in the scene Restrictions on the relative positional relationship between them, etc.
  • the lighting condition can be the lighting of the environment in which the scene is located.
  • Camera types can be RGB cameras, grayscale cameras, event cameras, night vision cameras, or thermal cameras, etc.
  • the target object type can be a person, an animal, or an object, etc.
  • the relative positional relationship between the camera and the target object in the scene can be close-range, distant, still or moving, and so on.
  • AR augmented reality
  • navigation eg, automatic driving or assisted driving
  • scene reconstruction e.g., scene understanding, or object detection, and the like.
  • the image to be estimated is input into the first neural network model, the first distance scaled normal (DSN) map of the first neural network model is obtained, and the image to be estimated and the One parameter is to determine the first camera filter map, and then determine the first depth map corresponding to the image to be estimated according to the first DSN map and the first camera filter map.
  • the first parameter is a camera calibration parameter of the camera that captures the image to be estimated.
  • the first DSN map is related to the 3D structure (eg, geometric structure) of the target object in the scene, and is not affected by the camera model (eg, including geometric projection model and camera calibration parameters, etc.).
  • the first camera filter map is related to the pixel points of the image to be estimated and the camera model, and is not affected by the 3D structure of the target object in the scene. Compared with the traditional depth estimation network, determining the depth map through the first DSN map and the first camera filter map can improve the accuracy and efficiency of monocular depth estimation. For its specific implementation, reference may be made to the explanations of the following embodiments.
  • Target Objects including but not limited to people, animals or objects.
  • the objects may be objects in the natural environment, such as grass, or trees, etc., or may be objects in the human environment, such as buildings, roads, or vehicles.
  • the surface of the target object usually has a regular planar structure. In some embodiments, even if the surface of the target object does not have a completely planar structure, the surface of the target object may be divided into a plurality of small planar regions.
  • 3D point of target object in space A point on the plane of the outer surface of the target object in space.
  • the 3D point of the target object in space can be any point on the outer surface of the vehicle, such as a point on the plane formed by the windshield of the front windshield, and a point on the plane formed by the license plate.
  • the DSN map used to represent the orientation of the plane of the target object corresponding to the input image and the distance between the plane and the camera.
  • the camera refers to the camera that captures the input image
  • the plane of the target object refers to the plane of the outer surface of the target object in 3D space.
  • the target object is a cube
  • the camera shoots the cube from an angle, and the obtained input image presents 3 planes of the cube
  • the plane of the target object refers to the 3 planes of the cube.
  • the DSN map includes the DSN vectors of the three planes of the cube, and the DSN vector of each plane can represent the orientation of the respective plane and the distance between the respective plane and the camera.
  • the DSN vector of each plane is related to the 3D structure of the target object in the scene and is not affected by the camera model.
  • the DSN vectors corresponding to the coplanar 3D points of the target object in the same plane are the same.
  • the DSN vector of each pixel is always the same as the DSN vector of the adjacent pixels belonging to the same plane.
  • each pixel in the DSN map stores a DSN vector.
  • the number of data stored in each pixel in the DSN map is called the number of channels. Since each pixel stores a DSN vector here, the number of channels is equal to the number of components included in the DSN vector.
  • the data stored by one pixel of is 3-channel data, and one channel of one pixel in the DSN diagram is used to store one component of the DSN vector, that is, one dimension component.
  • the monocular depth estimation method may use a neural network model to process an input image to obtain a DSN map.
  • the input image is the image to be estimated, and the first neural network model (also called the target neural network model) is used to process the to-be-estimated image to obtain the first DSN map.
  • the first neural network model also called the target neural network model
  • the training process of the neural network model the input image is the training image, and the initial neural network model is used to process the training image to obtain the second DSN map.
  • This embodiment of the present application uses the first DSN map and the second DSN map to distinguish the DSN maps output by the neural network model in different processes.
  • Camera filter map used to represent the mapping relationship between the 3D point of the target object in space and the 2D plane, the 2D plane is the imaging plane of the camera.
  • the camera filter map is related to the pixel points of the input image and the camera model, and is not affected by the 3D structure of the target object in the scene.
  • the camera model may include a geometric projection model, camera calibration parameters, and the like, and the camera calibration parameters may include camera center coordinates, focal length, and the like.
  • the corresponding camera filter maps are the same and can be calculated only once.
  • the camera filter map can be recalculated according to the camera calibration parameters of the new camera.
  • each pixel in the camera filter map stores a camera filter map vector.
  • the camera 1 shoots the target object 1, and the input image 11 is obtained.
  • the camera 1 shoots the target object 2, and the input image 12 is obtained. Since the input image 11 and the input image 12 are both collected by the camera 1, the input image
  • the camera filter map corresponding to 11 is the same as the camera filter map corresponding to the input image 12 .
  • Depth map used to represent the distance (depth) from the 3D point in space of the target object corresponding to the input image to the camera.
  • Each pixel in the depth map stores a depth value that estimates the distance between the real-world 3D point corresponding to the pixel and the camera's viewpoint at the time the input image was captured by the camera.
  • the real-world 3D point can be any target object in any scene, a 3D point in space.
  • the depth value of a pixel in the depth map can be determined by two parts, which include the DSN vector of the pixel and the camera filter map vector of the pixel.
  • the electronic device in this embodiment of the present application may be a mobile phone (mobile phone), a tablet computer (Pad), a computer with a wireless transceiver function, a virtual reality (Virtual Reality, VR) terminal device, an augmented reality (Augmented Reality, AR) terminal device, Terminal equipment in industrial control, terminal equipment in assisted driving, terminal equipment in self driving, terminal equipment in remote medical surgery, terminal equipment in smart grid (smart grid) equipment, terminal equipment in transportation safety, terminal equipment in smart city, terminal equipment in smart home, smart watch, smart bracelet, smart glasses, and other sports accessories or Wearables and more.
  • a terminal device in a smart home may be a smart home appliance such as a smart TV and a smart speaker.
  • a neural network can be composed of neural units, and a neural unit can refer to an operation unit that takes x s and an intercept 1 as input, and the output of the operation unit can be:
  • W s is the weight of x s
  • b is the bias of the neural unit.
  • f is an activation function of the neural unit, which is used to perform nonlinear transformation on the features obtained in the neural network, and convert the input signal in the neural unit into an output signal. The output signal of this activation function can be used as the input of the next convolutional layer.
  • the activation function can be a sigmoid function.
  • a neural network is a network formed by connecting many of the above single neural units together, that is, the output of one neural unit can be the input of another neural unit.
  • the input of each neural unit can be connected with the local receptive field of the previous layer to extract the features of the local receptive field, and the local receptive field can be an area composed of several neural units.
  • a deep neural network also known as a multi-layer neural network, can be understood as a neural network with multiple hidden layers.
  • the DNN is divided according to the positions of different layers.
  • the neural network inside the DNN can be divided into three categories: input layer, hidden layer, and output layer. Generally speaking, the first layer is the input layer, the last layer is the output layer, and the middle layers are all hidden layers.
  • the layers are fully connected, that is, any neuron in the i-th layer must be connected to any neuron in the i+1-th layer.
  • DNN looks complicated, in terms of the work of each layer, it is not complicated. In short, it is the following linear relationship expression: in, is the input vector, is the output vector, is the offset vector, W is the weight matrix (also called coefficients), and ⁇ () is the activation function.
  • Each layer is just an input vector After such a simple operation to get the output vector Due to the large number of DNN layers, the coefficient W and offset vector The number is also higher.
  • the DNN Take the coefficient W as an example: Suppose that in a three-layer DNN, the linear coefficient from the 4th neuron in the second layer to the 2nd neuron in the third layer is defined as The superscript 3 represents the number of layers where the coefficient W is located, and the subscript corresponds to the output third layer index 2 and the input second layer index 4.
  • the coefficient from the kth neuron in the L-1 layer to the jth neuron in the Lth layer is defined as
  • the input layer does not have a W parameter.
  • more hidden layers allow the network to better capture the complexities of the real world.
  • a model with more parameters is more complex and has a larger "capacity", which means that it can complete more complex learning tasks.
  • Training the deep neural network is the process of learning the weight matrix, and its ultimate goal is to obtain the weight matrix of all layers of the trained deep neural network (the weight matrix formed by the vectors of many layers).
  • Convolutional neural network is a deep neural network with a convolutional structure.
  • a convolutional neural network consists of a feature extractor consisting of convolutional layers and subsampling layers, which can be viewed as a filter.
  • the convolutional layer refers to the neuron layer in the convolutional neural network that convolves the input signal.
  • a convolutional layer of a convolutional neural network a neuron can only be connected to some of its neighbors.
  • a convolutional layer usually contains several feature planes, and each feature plane can be composed of some neural units arranged in a rectangle. Neural units in the same feature plane share weights, and the shared weights here are convolution kernels. Shared weights can be understood as the way to extract features is independent of location.
  • the convolution kernel can be formalized in the form of a matrix of random size, and the convolution kernel can be learned to obtain reasonable weights during the training process of the convolutional neural network.
  • the immediate benefit of sharing weights is to reduce the connections between the layers of the convolutional neural network, while reducing the risk of overfitting.
  • the pixel value of the image can be a red-green-blue (RGB) color value, and the pixel value can be a long integer representing the color.
  • the pixel value is 256*Red+100*Green+76*Blue, where Blue represents the blue component, Green represents the green component, and Red represents the red component. In each color component, the smaller the value, the lower the brightness, and the larger the value, the higher the brightness.
  • the pixel values can be grayscale values.
  • an embodiment of the present application provides a system architecture 100 .
  • the data collection device 160 is used to collect training data.
  • the training data in this embodiment of the present application may include the training image and the second DSN map corresponding to the training image, or include the training image and the second depth map corresponding to the training image, or include the training image and the second DSN map corresponding to the training image.
  • the training device 120 processes the training data through the training method of the first neural network model in the following embodiments of the present application, and compares the output image with the target image (for example, the second DSN map) until training.
  • the difference between the image output by the device 120 and the target image is less than a certain threshold, so that the training of the target model/rule 101 is completed.
  • the target model/rule in this embodiment of the present application is used to process the input image to be estimated, and output a first DSN map, where the first DSN map is used to represent the orientation of the plane of the target object corresponding to the image to be estimated and the relationship between the plane and the camera distance between.
  • the target model/rule 101 can be used to implement the monocular depth estimation method provided by the embodiment of the present application, that is, the image to be processed, such as the image to be estimated, is input into the target model/rule 101 after relevant preprocessing, Get the first depth map.
  • the target model/rule 101 in this embodiment of the present application may specifically be a neural network.
  • the training data maintained in the database 130 may not necessarily come from the collection of the data collection device 160, and may also be received from other devices.
  • the training device 120 may not necessarily train the target model/rule 101 completely based on the training data maintained by the database 130, and may also obtain training data from the cloud or other places for model training. The above description should not be used as a reference to this application Limitations of Examples.
  • the target model/rule 101 trained according to the training device 120 can be applied to different systems or devices, such as the execution device 110 shown in FIG. Laptops, augmented reality (AR)/virtual reality (VR), in-vehicle terminals, etc., can also be servers or the cloud.
  • the execution device 110 is configured with an (input/output, I/O) interface 112 for data interaction with external devices.
  • the user can input data to the I/O interface 112 through the client device 140, and the input
  • the data may include: an image to be estimated.
  • the preprocessing module 113 is configured to perform preprocessing according to the input data (eg, the image to be estimated) received by the I/O interface 112 .
  • the preprocessing module 113 may be used to perform image filtering and other processing on the input data.
  • the preprocessing module 113 and the preprocessing module 114 may also be absent, and the calculation module 111 may be directly used to process the input data.
  • the execution device 110 When the execution device 110 preprocesses the input data, or the calculation module 111 of the execution device 110 performs calculations and other related processing, the execution device 110 can call the data, codes, etc. in the data storage system 150 for corresponding processing , the data and instructions obtained by corresponding processing may also be stored in the data storage system 150 .
  • the I/O interface 112 returns the processing result, such as the first depth map obtained as described above, to the client device 140, so as to be provided to the user.
  • the training device 120 can generate corresponding target models/rules 101 based on different training data for different goals or tasks, and the corresponding target models/rules 101 can be used to achieve the above goals or complete The above task, thus providing the user with the desired result.
  • the user can manually specify input data, which can be operated through the interface provided by the I/O interface 112 .
  • the client device 140 can automatically send the input data to the I/O interface 112 . If the user's authorization is required to request the client device 140 to automatically send the input data, the user can set the corresponding permission in the client device 140 .
  • the user can view the result output by the execution device 110 on the client device 140, and the specific presentation form can be a specific manner such as display, sound, and action.
  • the client device 140 can also be used as a data collection terminal to collect the input data of the input I/O interface 112 and the output result of the output I/O interface 112 as new sample data as shown in the figure, and store them in the database 130 .
  • the I/O interface 112 directly uses the input data input into the I/O interface 112 and the output result of the output I/O interface 112 as shown in the figure as a new sample The data is stored in database 130 .
  • FIG. 1 is only a schematic diagram of a system architecture provided by an embodiment of the present application, and the positional relationship among the devices, devices, modules, etc. shown in the figure does not constitute any limitation.
  • the data storage system 150 is an external memory relative to the execution device 110 , and in other cases, the data storage system 150 may also be placed in the execution device 110 .
  • the training device 120 and the execution device 110 are two devices, and in other cases, the training device 120 and the execution device 110 may be one device.
  • the training device 120 and the execution device 110 may be a server or a server cluster
  • the client device 140 may establish a connection with the server
  • the server may process the image to be estimated by using the monocular depth estimation method in the embodiment of the present application to obtain the A depth map, providing the first depth map to client device 140 .
  • the execution device 110 and the client device 140 are two devices, and in other cases, the execution device 110 and the client device 140 may be one device.
  • the execution device 110 and the client device 140 may be a smart phone
  • the training device 120 may be a server or a server cluster
  • the server may process the training data through the training method of the first neural network model in the embodiment of the present application
  • a target model/rule is generated, and the target model/rule is provided to the smartphone, so that the smartphone can process the image to be estimated by the monocular depth estimation method of the embodiment of the present application to obtain a first depth map.
  • a target model/rule 101 is obtained by training according to the training device 120.
  • the target model/rule 101 may be the first neural network model in the present application.
  • the first neural network model in the present application A neural network model can include CNN or deep convolutional neural networks (DCNN), among others.
  • CNN is a very common neural network
  • a convolutional neural network is a deep neural network with a convolutional structure and a deep learning architecture. learning at multiple levels of abstraction.
  • CNN is a feed-forward artificial neural network in which individual neurons can respond to images fed into it.
  • a convolutional neural network (CNN) 200 may include an input layer 210 , a convolutional/pooling layer 220 (where the pooling layer is optional), and a fully connected layer 230 .
  • the convolutional/pooling layer 220 may include layers 221-226 as examples, for example: in one implementation, layer 221 is a convolutional layer, layer 222 is a pooling layer, and layer 223 is a convolutional layer Layer 224 is a pooling layer, 225 is a convolutional layer, and 226 is a pooling layer; in another implementation, 221 and 222 are convolutional layers, 223 are pooling layers, and 224 and 225 are convolutional layers. layer, 226 is the pooling layer. That is, the output of a convolutional layer can be used as the input of a subsequent pooling layer, or it can be used as the input of another convolutional layer to continue the convolution operation.
  • the convolution layer 221 may include many convolution operators.
  • the convolution operator is also called a kernel. Its role in image processing is equivalent to a filter that extracts specific information from the input image matrix.
  • the convolution operator is essentially Can be a weight matrix, which is usually pre-defined, usually one pixel by one pixel (or two pixels by two pixels) along the horizontal direction on the input image during the convolution operation on the image. ...It depends on the value of the stride step) to process, so as to complete the work of extracting specific features from the image.
  • the size of the weight matrix should be related to the size of the image. It should be noted that the depth dimension of the weight matrix is the same as the depth dimension of the input image.
  • each weight matrix is stacked to form the depth dimension of the convolutional image, where the dimension can be understood as determined by the "multiple" described above.
  • Different weight matrices can be used to extract different features in the image. For example, one weight matrix is used to extract image edge information, another weight matrix is used to extract specific colors of the image, and another weight matrix is used to extract unwanted noise in the image.
  • the multiple weight matrices have the same size (row ⁇ column), and the size of the feature maps extracted from the multiple weight matrices with the same size is also the same, and then the multiple extracted feature maps with the same size are combined to form a convolution operation. output.
  • weight values in these weight matrices need to be obtained through a lot of training in practical applications, and each weight matrix formed by the weight values obtained by training can be used to extract information from the input image, so that the convolutional neural network 200 can make correct predictions .
  • the initial convolutional layer eg, 221
  • the features extracted by the later convolutional layers eg, 226 become more and more complex, such as features such as high-level semantics.
  • features with higher semantics are more suitable for the problem to be solved.
  • the pooling layer can be a convolutional layer followed by a layer.
  • the pooling layer can also be a multi-layer convolutional layer followed by one or more pooling layers.
  • the pooling layer may include an average pooling operator and/or a max pooling operator for sampling the input image to obtain a smaller size image.
  • the average pooling operator can calculate the pixel values in the image within a certain range to produce an average value as the result of average pooling.
  • the max pooling operator can take the pixel with the largest value within a specific range as the result of max pooling. Also, just as the size of the weight matrix used in the convolutional layer should be related to the image size, the operators in the pooling layer should also be related to the image size.
  • the size of the output image after processing by the pooling layer can be smaller than the size of the image input to the pooling layer, and each pixel in the image output by the pooling layer represents the average or maximum value of the corresponding sub-region of the image input to the pooling layer.
  • the convolutional neural network 200 After being processed by the convolutional layer/pooling layer 220, the convolutional neural network 200 is not sufficient to output the required output information. Because as mentioned before, the convolutional layer/pooling layer 220 only extracts features and reduces the parameters brought by the input image. However, in order to generate the final output information (required class information or other relevant information), the convolutional neural network 200 needs to utilize the fully connected layer 230 to generate one or a set of outputs of the required number of classes. Therefore, the fully connected layer 230 may include multiple hidden layers (231, 232 to 23n as shown in FIG. 2), and the parameters contained in the multiple hidden layers may be based on the relevant training data of specific task types It is obtained by pre-training, for example, the task type can include image recognition, image classification, image super-resolution reconstruction and so on.
  • the output layer 240 After the multi-layer hidden layers in the fully connected layer 230, that is, the last layer of the entire convolutional neural network 200 is the output layer 240, the output layer 240 has a loss function similar to the classification cross entropy, and is specifically used to calculate the prediction error,
  • the forward propagation of the entire convolutional neural network 200 (as shown in Figure 2, the propagation from the direction 210 to 240 is forward propagation)
  • the back propagation (as shown in Figure 2, the propagation from the 240 to 210 direction is the back propagation) will Start to update the weight values and biases of the aforementioned layers to reduce the loss of the convolutional neural network 200 and the error between the result output by the convolutional neural network 200 through the output layer and the ideal result.
  • the convolutional neural network 200 shown in FIG. 2 is only used as an example of a convolutional neural network.
  • the convolutional neural network can also exist in the form of other network models. Including a part of the network structure shown in FIG. 2 , for example, the convolutional neural network adopted in this embodiment of the present application may only include an input layer 210 , a convolutional layer/pooling layer 220 and an output layer 240 .
  • FIG. 3 is a hardware structure of a chip provided by an embodiment of the application, and the chip includes a neural network processor 30 .
  • the chip can be set in the execution device 110 as shown in FIG. 1 to complete the calculation work of the calculation module 111 .
  • the chip can also be set in the training device 120 as shown in FIG. 1 to complete the training work of the training device 120 and output the target model/rule 101 .
  • the algorithms of each layer in the convolutional neural network shown in Figure 2 can be implemented in the chip shown in Figure 3.
  • Both the monocular depth estimation method and the training method of the first neural network model in the embodiment of the present application can be implemented in the chip as shown in FIG. 3 .
  • the neural network processor 30 may be a neural-network processing unit (NPU), a tensor processing unit (TPU), or a graphics processor (graphics processing unit, GPU), etc., all suitable for large-scale applications.
  • NPU neural-network processing unit
  • TPU tensor processing unit
  • GPU graphics processor
  • the NPU is mounted on the main central processing unit (CPU) (host CPU) as a co-processor, and the main CPU assigns tasks.
  • the core part of the NPU is the operation circuit 303, and the controller 304 controls the operation circuit 303 to extract the data in the memory (weight memory or input memory) and perform operations.
  • TPU is Google's fully customized artificial intelligence accelerator application-specific integrated circuit for machine learning.
  • the arithmetic circuit 303 includes multiple processing units (process engines, PEs). In some implementations, arithmetic circuit 303 is a two-dimensional systolic array. The arithmetic circuit 303 may also be a one-dimensional systolic array or other electronic circuitry capable of performing mathematical operations such as multiplication and addition. In some implementations, arithmetic circuit 303 is a general-purpose matrix processor.
  • the arithmetic circuit 303 fetches the weight data of the matrix B from the weight memory 302 and buffers it on each PE in the arithmetic circuit 303 .
  • the arithmetic circuit 303 fetches the input data of the matrix A from the input memory 301 , performs matrix operation according to the input data of the matrix A and the weight data of the matrix B, and stores the partial result or the final result of the matrix in the accumulator 308 .
  • the vector calculation unit 307 can further process the output of the arithmetic circuit, such as vector multiplication, vector addition, exponential operation, logarithmic operation, size comparison and so on.
  • the vector computing unit 307 can be used for network computation of non-convolutional/non-FC layers in the neural network, such as pooling, batch normalization, local response normalization, etc. .
  • the vector computation unit 307 can store the processed output vectors to the unified buffer 306 .
  • the vector calculation unit 307 may apply a nonlinear function to the output of the arithmetic circuit 303, such as a vector of accumulated values, to generate activation values.
  • vector computation unit 307 generates normalized values, merged values, or both.
  • vector computation unit 307 stores the processed vectors to unified memory 306 .
  • the vector processed by the vector computing unit 307 can be used as the activation input of the arithmetic circuit 303, for example, for use in subsequent layers in the neural network, as shown in FIG. 2, if the current processing layer is the hidden layer 1 (231), the vector processed by the vector calculation unit 307 can also be used for calculation in the hidden layer 2 (232).
  • Unified memory 306 is used to store input data and output data.
  • the weight data is directly stored in the weight memory 302 through a storage unit access controller (direct memory access controller, DMAC) 305.
  • Input data is also stored in unified memory 306 via the DMAC.
  • the bus interface unit (bus interface unit, BIU) 310 is used for the interaction of the DMAC and the instruction fetch buffer (instruction fetch buffer) 309; the bus interface unit 301 is also used for the instruction fetch memory 309 to obtain instructions from the external memory; the bus interface unit 301 also The memory cell access controller 305 acquires the original data of the input matrix A or the weight matrix B from the external memory.
  • the DMAC is mainly used to store the input data in the external memory DDR into the unified memory 306 , or store the weight data into the weight memory 302 , or store the input data into the input memory 301 .
  • An instruction fetch buffer 309 connected to the controller 304 is used to store the instructions used by the controller 304.
  • the controller 304 is used for invoking the instructions cached in the memory 309 to control the working process of the operation accelerator.
  • the unified memory 306 , the input memory 301 , the weight memory 302 and the instruction fetch memory 309 are all on-chip memories, and the external memory is the memory outside the NPU, and the external memory can be double data rate synchronous dynamic random access.
  • Memory double data rate synchronous dynamic random access memory, DDR SDRAM), high bandwidth memory (high bandwidth memory, HBM) or other readable and writable memory.
  • each layer in the convolutional neural network shown in FIG. 2 may be performed by the operation circuit 303 or the vector calculation unit 307 .
  • both the training method of the first neural network model and the monocular depth estimation method in the embodiment of the present application may be executed by the operation circuit 303 or the vector calculation unit 307 .
  • an embodiment of the present application provides a system architecture 400 .
  • the system architecture includes a local device 401, a local device 402, an execution device 410 and a data storage system 450, wherein the local device 401 and the local device 402 are connected with the execution device 410 through a communication network.
  • Execution device 410 may be implemented by one or more servers.
  • the execution device 410 may be used in conjunction with other computing devices, such as data storage, routers, load balancers and other devices.
  • the execution device 410 may be arranged on one physical site, or distributed across multiple physical sites.
  • the execution device 410 may use the data in the data storage system 450 or call the program code in the data storage system 450 to implement the training method and/or the monocular depth estimation method of the first neural network model in the embodiment of the present application.
  • a user may operate respective user devices (eg, local device 401 and local device 402 ) to interact with execution device 410 .
  • Each local device may represent any computing device, such as a personal computer, computer workstation, smartphone, tablet, smart camera, smart car or other type of cellular phone, media consumption device, wearable device, set-top box, gaming console, etc.
  • Each user's local device can interact with the execution device 410 through a communication network of any communication mechanism/communication standard.
  • the communication network can be a wide area network, a local area network, a point-to-point connection, etc., or any combination thereof.
  • the local device 401 and the local device 402 obtain the first neural network model from the execution device 410, deploy the first neural network model on the local device 401 and the local device 402, and use the first neural network model Perform monocular depth estimation.
  • the first neural network model may be directly deployed on the execution device 410.
  • the execution device 410 acquires the images to be processed from the local device 401 and the local device 402, and uses the first neural network model to perform the processing on the images to be processed. Monocular depth estimation.
  • the above execution device 410 may also be a cloud device, in this case, the execution device 410 may be deployed in the cloud; or, the above execution device 410 may also be a terminal device, in this case, the execution device 410 may be deployed on the user terminal side, the embodiment of the present application This is not limited.
  • FIG. 5 is a flowchart of a monocular depth estimation method according to an embodiment of the present application. As shown in FIG. 5 , the method in this embodiment may include:
  • Step 101 Acquire the image to be estimated and the first parameter corresponding to the image to be estimated.
  • the image to be estimated is an image obtained by photographing a target object in a 3D space by a camera.
  • the to-be-estimated image may be an image captured (also referred to as acquisition) by any camera such as an RGB camera, a grayscale camera, a night vision camera, a thermal camera, a panoramic camera, an event camera, or an infrared camera.
  • the image to be estimated may be a frame of image.
  • the first parameter is the camera calibration parameter of the camera that captures the image to be estimated.
  • Camera calibration parameters are parameters related to the camera's own characteristics.
  • the camera calibration parameters may include the center coordinates and focal length of the camera.
  • the camera calibration parameters may include width pixel values and height pixel values of the image to be estimated.
  • a way to acquire the image to be estimated may be to acquire the image to be estimated by acquiring the image to be estimated through any of the above-mentioned cameras of the device itself.
  • Another way to obtain the image to be estimated may be to receive the image to be estimated sent by other devices, and the image to be estimated may be collected by cameras of other devices.
  • Step 102 Input the image to be estimated into the first neural network model, and obtain the first DSN map output by the first neural network model.
  • the first DSN map is used to represent the orientation of the plane of the target object corresponding to the image to be estimated and the distance between the plane and the camera.
  • the input image here is the image to be estimated.
  • the first DSN map is related to the geometry of the target object in 3D space, independent of the camera model, which can more accurately represent the three-dimensional world.
  • the first neural network model may be any neural network model, for example, a deep neural network (Deep Neural Network, DNN), a convolutional neural network (Convolutional Neural Networks, CNN) or a combination thereof, and the like.
  • a deep neural network Deep Neural Network, DNN
  • a convolutional neural network Convolutional Neural Networks, CNN
  • the first neural network model is obtained by training using the training image and the second DSN map corresponding to the training image.
  • the second DSN graph is the ground truth in the process of training the neural network model.
  • the second DSN map is determined according to a second depth map corresponding to the training image and camera calibration parameters corresponding to the training image.
  • the second depth map is the ground truth in the process of training the neural network model, and the second DSN map can be determined by the camera calibration parameters corresponding to the second depth map and the training image.
  • the first neural network model is trained by the training image and the second DSN map corresponding to the training image, and learns the mapping feature of the DSN map obtained from the input image, so that the above-mentioned image to be estimated can be intelligently perceived and output corresponding to the image to be estimated.
  • DSN diagram DSN diagram.
  • Step 103 Determine a first camera filter map according to the image to be estimated and the first parameter.
  • the first camera filter map is used to represent the mapping relationship between the 3D points of the target object in space and the 2D plane, where the 2D plane is the imaging plane of the camera.
  • the input image here is the image to be estimated, that is, the first camera filter map is related to the pixels of the image to be estimated and the camera model, and is not affected by the image. The influence of the 3D structure of the target object in the scene.
  • the first camera filter map is determined according to the position coordinates of the pixels of the image to be estimated and the first parameter.
  • the first camera filter map includes camera filter map vectors corresponding to pixels.
  • the pixel points in the first camera filter map store the camera filter map vector.
  • the camera filter vector is used to represent the mapping relationship between the 3D point and the pixel point, and the pixel point is the point where the 3D point is projected to the 2D plane (camera imaging plane).
  • the position coordinates of the pixel points may include an abscissa and an ordinate
  • the camera filter map vector corresponding to the pixel includes a first camera filter map component and a second camera filter map component
  • the first camera filter map component is based on the abscissa.
  • the second camera filter map component is determined according to the ordinate and the first parameter, or according to the abscissa, the ordinate and the first parameter.
  • the first parameter may include the center coordinates and focal length of the camera.
  • the first camera filter map component is determined according to the abscissa and the first parameter, and the second camera filter map component is determined according to the ordinate and the first parameter.
  • the above-mentioned first parameter may include the width pixel value and the height pixel value of the to-be-estimated image, and the first camera filter map component is based on The abscissa and the first parameter are determined, and the second camera filter map component is determined according to the abscissa, the ordinate and the first parameter.
  • Step 104 Determine a first depth map corresponding to the image to be estimated according to the first DSN map and the first camera filter map.
  • the first depth map may include a depth value corresponding to a pixel, in other words, a pixel in the first depth map stores a depth value, and the depth value is used to represent the time when the image to be estimated was captured by the camera, and the real world corresponding to the pixel.
  • the first depth map is a dense, edge-aware, metric-scale depth map.
  • the first DSN map and the first camera filter map through the above steps, two parts can be obtained: the first DSN map and the first camera filter map, and through these two parts, the first depth map corresponding to the image to be estimated can be finally obtained.
  • This embodiment of the present application uses the first neural network model to obtain the first DSN map, where the first DSN map can accurately represent the geometric structure of the target object in the 3D space.
  • the first camera filter map is determined according to the camera calibration parameters of the image to be estimated.
  • the first camera filter map is related to the pixels of the image to be estimated and the camera model, and is not affected by the 3D structure of the target object in the scene.
  • the depth map is then obtained based on these two parts.
  • the embodiment of the present application obtains the depth map based on the first DSN map and the first camera filter map, and the depth map can accurately reflect the distance of the target object and improve the accuracy of monocular depth estimation. .
  • the first DSN map may include the first DSN vector corresponding to the pixel points of the image to be estimated, in other words, the pixels in the first DSN map store the DSN vector. According to the first DSN vector corresponding to the pixel point and the camera filter map vector corresponding to the pixel point, the depth value corresponding to the pixel point can be determined. That is, corresponding operations are performed on vectors at the same pixel position in the first DSN map and the first camera filter map, so as to obtain the depth value of the corresponding pixel position.
  • the specific implementation of determining the depth value corresponding to the pixel i may be, by the following formula 1 and formula: 2 OK.
  • ⁇ i is the inverse depth value of the 3D point of the target object in the scene corresponding to pixel i
  • N i is the first DSN vector corresponding to pixel i
  • F i is the camera filter mapping vector corresponding to pixel i.
  • Ni can be obtained from pixel i of the first DSN map
  • F i can be obtained from pixel i of the first camera filter map.
  • the depth value corresponding to pixel i can be obtained by taking the inverse of the inverse depth value. For example, it is determined according to the following formula 2.
  • Z i is the inverse depth value corresponding to pixel i.
  • the above pixel point i may be any pixel point in the image to be estimated.
  • the depth value corresponding to the pixel i is determined by formula 1 and formula 2, which can be applied to the monocular depth estimation of the image to be estimated collected by the camera of the perspective camera model, and can also be applied to the non-perspective camera.
  • Monocular depth estimation is performed on the image to be estimated collected by the camera of the model, and the camera of the non-perspective camera model includes but is not limited to a panoramic camera, a 360-degree spherical camera, a catadioptric camera, a fisheye camera, and the like.
  • the image to be estimated and the camera calibration parameters are obtained
  • the first DSN map corresponding to the image to be estimated is obtained through the first neural network model
  • the first camera filter map is determined according to the camera calibration parameters, and then based on the first DSN map and
  • the embodiment of the present application obtains the depth map based on the first DSN map and the first camera filter map, and the depth map can accurately reflect the distance of the target object, thereby improving the accuracy of monocular depth estimation. sex.
  • the embodiment of the present application can perform depth estimation through one frame of the image to be estimated, and there is no need for the monocular camera to be in motion and the real-world target object to be in a static state. Limits that can be applied to depth estimation of target objects in general or common real-life scenes.
  • FIG. 6 is a flowchart of another monocular depth estimation method according to an embodiment of the present application. As shown in FIG. 6 , the method in this embodiment may include:
  • Step 201 Acquire the image to be estimated and the first parameter corresponding to the image to be estimated.
  • the first parameter corresponding to the image to be estimated may include the center coordinate and focal length of the camera, or may include the width pixel value and the height pixel value of the image to be estimated.
  • the width pixel value and the height pixel value of the image to be estimated may be calculated based on the center coordinates of the camera.
  • Step 202 Input the image to be estimated into the first neural network model, and obtain the first DSN map output by the first neural network model.
  • steps 201 to 202 For the explanation of steps 201 to 202, reference may be made to the specific explanation of steps 101 to 102 of the embodiment shown in FIG. 5, which will not be repeated here.
  • Step 203 Determine whether the field of view of the camera that shoots the image to be estimated is less than 180°, if yes, go to Step 204, and if not, go to Step 205.
  • Step 204 Determine a first camera filter map according to the abscissa and ordinate of the pixel of the image to be estimated, and the center coordinate and focal length of the camera that captures the image to be estimated.
  • the first camera filter map includes camera filter map vectors corresponding to pixels.
  • the pixel point i is taken as an example, and the specific implementation manner of determining the camera filter mapping vector corresponding to the pixel point i in this step may be determined by the following formulas 3 to 5.
  • F i is the camera filter map vector corresponding to the pixel i
  • F u is the second camera filter map component of F i
  • F v is the second camera filter map component of F i
  • i (u, v)
  • u is the abscissa of the pixel i
  • v is the ordinate of the pixel i.
  • (c x , c y ) are the center coordinates of the camera that captures the image to be estimated
  • (f x , f y ) are the focal lengths of the camera that captures the image to be estimated.
  • the above-mentioned pixel point i may be any pixel point in the image to be estimated, so that the first camera filter map can be obtained.
  • the first camera filter map can be determined by using Formula 3 to Formula 5 in this step.
  • Step 205 Determine the first camera filter map according to the abscissa and ordinate of the pixel points of the image to be estimated, and the width and height pixel values of the image to be estimated.
  • the pixel point i is taken as an example, and the specific implementation manner of determining the camera filter mapping vector corresponding to the pixel point i in this step may be determined by, for example, formula 3, formula 6 and formula 7.
  • W is the width pixel value of the image to be estimated
  • H is the height pixel value of the image to be estimated
  • the above-mentioned pixel point i may be any pixel point in the image to be estimated, so that the first camera filter map can be obtained.
  • Formula 3, Formula 6, and Formula 7 in this step may be used to determine the first camera filter map.
  • Step 206 Determine a first depth map corresponding to the image to be estimated according to the first DSN map and the first camera filter map.
  • step 206 may refer to the specific explanation of step 104 in the embodiment shown in FIG. 5 , which is not repeated here.
  • the image to be estimated and the camera calibration parameters are obtained
  • the first DSN map corresponding to the image to be estimated is obtained through the first neural network model
  • the first camera filter map is determined according to the camera calibration parameters, and then based on the first DSN map and
  • the embodiment of the present application obtains the depth map based on the first DSN map and the first camera filter map, and the depth map can accurately reflect the distance of the target object, thereby improving the accuracy of monocular depth estimation. sex.
  • the embodiment of the present application can perform depth estimation through one frame of the image to be estimated, and there is no need for the monocular camera to be in motion and the real-world target object to be in a static state. Limits that can be applied to depth estimation of target objects in general or common real-life scenes.
  • the monocular depth estimation method of the embodiment of the present application can be applied to perform depth estimation on images to be estimated collected by cameras of different camera models, so as to realize generalized perception of images to be estimated from different cameras.
  • the depth value can be determined by the DSN vector of the pixel and the camera filter mapping vector of the pixel. Exemplary explanations are given.
  • a 3D point of the target object in the scene is captured by a camera with position coordinates at (0, 0, 0) and stored in a 2D pixel at pixel i in the image plane.
  • P represents the 3D point
  • P (X, Y, Z)
  • (X, Y, Z) represents the position coordinates of the 3D point in space (also called three-dimensional space coordinates)
  • i (u, v)
  • ( u, v) represent the position coordinates of pixel i
  • the position coordinates of pixel i are measured from the upper left corner of the image plane.
  • the geometric correspondence between the position coordinates of the 3D points in the scene in space and the position coordinates of the pixel point i is given by the following formula 1:
  • the camera calibration parameters may include the center coordinates and focal length of the camera of the perspective (pinhole) camera model, (c x , c y ) are the center coordinates, and (f x , f y ) are the focal lengths.
  • the 3D point can also be modeled and represented as a unit surface normal, (n x , ny , n z ) represents the unit surface normal of the 3D point of the target object in the scene.
  • the surface in the unit surface normal refers to the plane formed by a 3D point and its adjacent coplanar 3D points (which can be extended to an infinite plane).
  • the representation of this unit vector is independent of the camera model used.
  • the distance from the camera to the extended plane of the 3D point can be defined as h. Then the distance h can be calculated geometrically by the scalar product of the unit surface normal and the three-dimensional space coordinate:
  • Equation 8 the geometric relationship of the perspective camera model can be used to replace the 3D point space coordinates in Equation 9, and get:
  • the embodiment of the present application proposes a new 3D structure representation method, which decomposes the inverse depth of a 3D point in the scene (ie, the inverse of the depth) into a DSN vector and a camera filter mapping vector, where N represents the DSN vector of the 3D point, and F The camera filter map vector representing pixel i. See Equation 11 to Equation 15 below.
  • represents the inverse depth of the 3D point.
  • the inverse depth can be determined according to the 3D point structure representation provided by the embodiment of the present application, that is, the inverse depth value can be determined by the DSN vector and the camera filter map vector, and is expressed as:
  • the DSN vector of the 3D point is:
  • the camera filter map vector of the corresponding pixel i is:
  • F u is the filter mapping component of the first camera
  • F v is the filter mapping component of the second camera.
  • the camera model used may not be limited to the camera of the above-mentioned perspective camera model, but may also be a camera of a non-perspective camera model, for example, a panoramic camera, a 360-degree spherical camera , catadioptric camera, or fisheye camera, etc.
  • the 3D structure representation of the embodiments of the present application can perform geometric correspondence adaptation between the 3D structure of the scene and the non-perspective camera model.
  • the camera filter map vector (calculated according to Equation 15) needs to be updated for different types of cameras.
  • the camera may be a 360-degree panoramic camera.
  • the geometric correspondence between 3D point P and pixel i in the scene can be given by:
  • Equation 9 the inverse depth of a 3D point in the scene can be decomposed into a DSN vector and a camera filter map vector, where N represents the DSN vector and F represents the camera filter map vector.
  • represents the inverse depth of the 3D point.
  • the DSN vector corresponding to the 3D point can be calculated by Equation 14.
  • the embodiment of the present application provides a theoretical basis for the monocular depth estimation method of the embodiment shown in FIG. 5 or FIG. 6 . So that the monocular depth estimation method of the embodiment of the present application can output the first DSN vector corresponding to each pixel point of the image to be estimated through the first neural network model, and obtain the first camera filter mapping vector corresponding to each pixel point according to formula 3, Then, the depth value corresponding to each pixel is obtained according to formula 1 and formula 2, so as to realize monocular depth estimation and improve the accuracy of depth estimation.
  • FIG. 8 is a schematic diagram of a monocular depth estimation processing process according to an embodiment of the present application.
  • the first neural network model is a convolutional neural network as an example for schematic illustration.
  • the monocular depth estimation method can include: inputting the image data L301 into the convolutional neural network L302, for example, the above-mentioned to-be-estimated image can be used as the image data L301.
  • the convolutional neural network L302 outputs the DSN map L303.
  • a channel of a pixel in a DSN map is used to store a component of the DSN vector.
  • the convolutional neural network may be based on the encoder-decoder architecture of ResNet-18.
  • the convolutional neural network can be trained by the training method shown in Figure 9 below.
  • F i is the filter mapping vector corresponding to the pixel i
  • F i (F ui , F vi ) as an example
  • the camera filter map L304 only needs to be calculated once.
  • the camera filter map L304 can be used to filter the DSN map L303.
  • the inverse depth map L306 is obtained by calculation, and then the depth map is obtained based on the inverse depth map L306.
  • FIG. 9 is a flowchart of a training method of a first neural network model according to an embodiment of the present application.
  • the first neural network model may also be called a monocular depth estimation model.
  • the method of this embodiment may include:
  • Step 301 Acquire a training image and a second DSN map corresponding to the training image.
  • the training image can be an image captured by any camera, such as an RGB camera, a grayscale camera, a night vision camera, a thermal camera, a panoramic camera, an event camera, or an infrared camera.
  • a camera such as an RGB camera, a grayscale camera, a night vision camera, a thermal camera, a panoramic camera, an event camera, or an infrared camera.
  • training images can be derived from the database shown in Figure 1.
  • the database may store multiple training images and the second DSN map corresponding to the training image, or store multiple training images and the second depth map corresponding to the training image, or store multiple training images and the second depth map corresponding to the training image. DSN map and second depth map.
  • the training data in the database can be obtained in the following ways.
  • the plurality of training images are images captured by a plurality of calibrated and synchronized cameras of the scene.
  • a second depth map corresponding to the multiple training images can be obtained.
  • the 3D vision technology may be a structure from motion (sfm) restoration technology, a multi-view 3D reconstruction technology, or a view synthesis technology.
  • different types of cameras include depth cameras and thermal cameras.
  • the depth cameras and thermal cameras are mounted in a frame and calibrated.
  • the depth map captured by the depth camera can be directly aligned with the image obtained by the thermal camera.
  • the depth map obtained by the camera can be used as the second depth map corresponding to the training image, and the image obtained by the thermal camera can be used as the training image.
  • an original image is obtained.
  • the original image can be an image captured by any of the above-mentioned types of cameras, and data optimization or data enhancement is performed on the original image to obtain a training image.
  • the data optimization includes at least one of hole filling optimization, sharpening occlusion edge optimization, or temporal consistency optimization.
  • One example is to optimize the original image through an image processing filter to obtain a training image.
  • the image processing filter may be bilateral filtering, or guided image filtering, or the like.
  • Another example is to optimize the original image by using the temporal information or consistency between frames of the video to obtain the training image. For example, add temporal constraints via optical flow methods.
  • geometric information or semantic segmentation information which can be calculated by some convolutional neural networks, such as the segmentation convolutional neural network for surface normal vector segmentation or more surface categories.
  • convolutional neural networks for segmentation of different categories such as people, vegetation, sky, cars, etc.
  • This data augmentation is used to change the environmental conditions of the scene captured by the original image to obtain training images under different environmental conditions.
  • the environmental conditions may include lighting conditions, weather conditions, visibility conditions, and the like.
  • the second depth map corresponding to the training image may be a depth map output by the teacher's monocular depth estimation network after processing the input training image.
  • the teacher monocular depth estimation network here can be a trained MDE convolutional neural network.
  • the second DSN map may be determined by using the training image, the camera calibration parameters corresponding to the training image, and the second depth map.
  • the second DSN map corresponding to the training image can be obtained by calculating in the following two ways.
  • the second DSN map includes the DSN vector of the plane where the 3D point in the scene corresponding to the pixel point of the training image is located.
  • a possible implementation is to calculate the unit surface normal corresponding to the pixel point i. (n xi , n yi , n zi ) is the unit surface normal corresponding to pixel i. In some embodiments, it can be calculated using the vector cross product of adjacent pixels of pixel i.
  • the camera calibration parameters may be camera calibration parameters of the camera that shoots the training image.
  • the distance from the plane where the 3D point corresponding to the pixel point i is located to the camera can be calculated by the following formula 19.
  • hi is the distance from the plane where the 3D point corresponding to pixel i is located to the camera
  • the camera calibration parameters include the center coordinates and focal length of the camera
  • (c x , c y ) are the center coordinates of the camera
  • (f x , f y ) is the focal length of the camera
  • (u, v) is the position coordinate of pixel i
  • Z is the depth value corresponding to pixel i.
  • the DSN vector of the plane where the 3D point corresponding to pixel i is located is calculated.
  • the DSN vector of the plane where the 3D point corresponding to the pixel point i is located can be calculated by the following formula 20.
  • N i is the DSN vector of the plane where the 3D point corresponding to the pixel point i is located.
  • the DSN map can be calculated from the inverse depth map.
  • the calculation method can be as follows: by calculating the image gradient at the pixel point i in the inverse depth map, the DSN vector corresponding to the pixel point i is obtained, that is, the DSN vector of the plane where the 3D point corresponding to the pixel point i is located.
  • the DSN vector of the plane where the 3D point corresponding to the pixel point i is located is calculated according to the following formula 31.
  • N i is the DSN vector of the plane where the 3D point corresponding to the pixel i is located
  • N i (N xi , N yi , N zi )
  • ⁇ i is the inverse depth value of the 3D point in the scene corresponding to the pixel i
  • (c x , c y ) is the center coordinate of the camera
  • (f x , f y ) is the focal length of the camera
  • (u, v) is the position coordinate of the pixel i.
  • Step 302 using the training image and the second DSN map to train the initial neural network model to obtain the first neural network model.
  • the training image may be input into the initial neural network model to obtain a third DSN map.
  • the second camera filter map is determined according to the camera calibration parameters corresponding to the training image and the training image.
  • a third depth map is obtained according to the second camera filter map and the third DSN map. According to the difference between the third DSN map and the second DSN map corresponding to the training image, or the difference between the third depth map and the second depth map, or, in the degree of matching between the third depth map and the second depth map
  • the parameters of the initial neural network model are adjusted, and the above process is repeated until the training ends, and the above-mentioned first neural network model is obtained.
  • the parameters of the neural network model can be adjusted according to the loss function until the first neural network model that satisfies the training objective is obtained.
  • the loss function may include at least one of a first loss function, a second loss function, or a third loss function.
  • the first loss function is used to represent the error between the second DSN map and the third DSN map
  • the second loss function is used to represent the error between the second depth map and the third depth map.
  • the third loss function is used to represent the matching degree between the second depth map and the third depth map.
  • a training image L501 is acquired from the database L400.
  • the training image L501 is input into the convolutional neural network L502, and the convolutional neural network L502 outputs the third DSN map L503.
  • a second camera filter map L504 is obtained according to the camera calibration parameters corresponding to the training image L501.
  • the third DSN map L503 and the second camera filter map L504 are provided to a filter L505 which outputs an inverse depth map L506.
  • the above-mentioned third depth map may be obtained based on the inverse depth map L506.
  • the third DSN map L503, the inverse depth map L506, the real DSN map L508, and the real inverse depth map L509 are provided to the loss function L507 to determine the loss function value, and adjust the convolutional neural network L502 according to the loss function value.
  • the real DSN map L508 is the second DSN map corresponding to the training image.
  • the real inverse depth map L509 can be obtained from the second depth map corresponding to the training image.
  • the loss function (L507) is defined as follows.
  • ⁇ DEP , ⁇ DSN and ⁇ INP are hyperparameters for weighted depth loss function DSN loss function and the patch loss function ⁇ DEP , ⁇ DSN and ⁇ INP are greater than or equal to 0, respectively. For example, assigning zero to a hyperparameter cancels the effect of its corresponding loss function on network training.
  • represents the normal form, which can be L1 normal form, L2 normal form, etc.
  • the inverse depth value in the depth loss function may also be replaced by the depth value as appropriate.
  • the DSN normal loss function the loss function used to compute the error between the true DSN vector (L508) and the estimated DSN vector (L503).
  • the calculation of the loss function will be performed on all valid pixels Carried on.
  • the loss function can be calculated through the above formula 22 to formula 25 to adjust the network parameters to obtain the first neural network model.
  • the estimated DSN map is obtained by inputting the training image into the convolutional neural network.
  • the estimated inverse depth map is obtained from the estimated DSN map and the camera filter map corresponding to the training image.
  • the loss function value is calculated based on the estimated DSN map, the true DSN map, the estimated inverse depth map, and the true inverse depth map.
  • the parameters of the convolutional neural network are adjusted according to the value of the loss function, and the above steps are repeated to obtain the first neural network model by training.
  • the first neural network model can learn the mapping relationship between the input image and the DSN map, so that the first neural network model can output the DSN map corresponding to the input image based on the input image, and use the DSN map and the camera filter map corresponding to the input image,
  • the depth map corresponding to the input image can be obtained to achieve monocular depth estimation.
  • the DSN map is related to the 3D structure of the target object corresponding to the input image, and is not affected by the camera model, so that in the model application stage, even if the The camera model corresponding to the estimated image is different from the camera model corresponding to the training image, and the DSN map corresponding to the to-be-estimated image output by the first neural network model can also more accurately represent the 3D structure of the target object corresponding to the to-be-estimated image.
  • a depth map is obtained based on the DSN map corresponding to the image to be estimated and the camera filter map related to the camera model. The depth map can more accurately represent the distance of the target object in space, thereby improving the accuracy of monocular depth estimation.
  • the trained first neural network model can perceive An image captured by an arbitrary camera, i.e. generalizable perception to the camera. And a depth map is estimated based on the output of the first neural network model.
  • the first neural network model trained through the above steps can be configured in the electronic device, so that the electronic device can realize relatively accurate monocular depth estimation.
  • the first neural network model can also be configured in the server, so that the server can process the image to be estimated sent by the electronic device and return the DSN map, and then the electronic device can obtain the depth map based on the DSN map to achieve a relatively accurate monocular depth. estimate.
  • the first neural network model may be a software function module or a solidified hardware circuit, for example, the hardware circuit may be an arithmetic circuit, etc.
  • the specific form of the first neural network model is not specifically limited in this embodiment of the present application.
  • the following method may also be used to train the neural network model to obtain the first neural network model.
  • FIG. 11 is a schematic diagram of a training process of a first neural network model according to an embodiment of the present application.
  • the monocular depth estimation module L515 ie, including the part involved in the monocular depth estimation process of the embodiment shown in FIG. 8 ) estimates the depth by learning the monocular depth estimation teacher network L511 .
  • the teacher network is used to synthesize paired input and output ground truth data through inverse rendering.
  • the monocular depth estimation module L515 is trained using the input and output ground truth data. That is, the neural network model in the monocular depth estimation module L515 is trained to obtain the first neural network model.
  • a trained monocular depth estimation teacher network L511 can be used as the MDE processor.
  • the teacher network L511 can estimate the depth map corresponding to the generated input image.
  • the input images and estimated depth maps of the teacher network can be used as ground truth data to augment the database L400.
  • the teacher network L511 can estimate a synthetic depth map L512 corresponding to the noise by inputting a noise image L510. Again by using the encoder of the teacher network L511 and the noisy image L510, the inverse rendering synthesis can be used for the synthetic image data L513 as training input.
  • the inverse rendering synthetic image data L513 and the synthetic depth map L512 generated by the teacher network synthesis may be added to the database L400 as a set of paired ground truth data L514.
  • the monocular depth estimation module L515 is set as a student network, and uses the inverse rendering synthetic image data L513 generated by the teacher network as input, and the synthetic depth map L512 as ground truth data for training.
  • the monocular depth estimation module L515 processes the inversely rendered synthetic image data L513 and outputs an estimated depth map L516.
  • the estimated depth map L516 and the synthetic depth map L512 are provided to the training loss function L517.
  • the training loss function L517 can adopt the loss function in the embodiment shown in FIG. 9 above, and of course other forms of loss functions can also be used.
  • the application examples are not specifically limited. Adjust the neural network model in the monocular depth estimation module L515 according to the training loss function L517 to obtain the first neural network model.
  • the monocular depth estimation module L515 can be trained by learning other MDE processors (for example, off-the-shelf MDE software, MDE networks that have been trained, etc.), without directly accessing these MDE processors/networks for training raw data used. This approach is achieved through a cutting-edge knowledge distillation algorithm. This method has higher training efficiency.
  • MDE processors for example, off-the-shelf MDE software, MDE networks that have been trained, etc.
  • FIG. 12 is a schematic diagram of a training process of a first neural network model according to an embodiment of the present application.
  • the monocular depth estimation module L523 (the same module as L515 above) and the MDE processor L521 use noise as input at the same time.
  • a trained monocular depth estimation teacher network L521 can be used as an MDE processor for augmenting the database L400.
  • the teacher network L521 can estimate the depth map corresponding to the input image.
  • the monocular depth estimation module L523 is set as a student network.
  • a noisy image can be simultaneously input to the monocular depth estimation module L523 and the MDE processor L521.
  • the monocular depth estimation module L523 processes the noisy image and outputs the depth map L524 estimated by the student network.
  • the monocular depth estimation teacher network L521 processes the noisy image and outputs the depth map L522 estimated by the teacher network.
  • the depth map L524 estimated by the student network and the depth map L522 estimated by the teacher network are provided to the training loss function L525.
  • the monocular depth estimation module L523 is adjusted to reduce the loss error between the depth map L524 estimated by the student network and the depth map L522 estimated by the teacher network. network) training.
  • the monocular depth estimation module L523 can be trained by learning other MDE processors (for example, off-the-shelf MDE software, MDE networks that have been trained, etc.), without directly accessing these MDE processors/networks for training raw data used.
  • MDE processors for example, off-the-shelf MDE software, MDE networks that have been trained, etc.
  • This approach is achieved through a cutting-edge knowledge distillation algorithm. This method has higher training efficiency.
  • the embodiments of the present application further provide a monocular depth estimation apparatus, which is used for performing the method steps in the above method embodiments.
  • the monocular depth estimation apparatus may include: an acquisition module 91 , a DSN module 92 , a camera filter mapping module 93 and a depth estimation module 94 .
  • the obtaining module 91 is configured to obtain the to-be-estimated image and the first parameter corresponding to the to-be-estimated image, where the first parameter is a camera calibration parameter of the camera that shoots the to-be-estimated image.
  • the distance scaled normal DSN module 92 is used to input the image to be estimated into the first neural network model, and obtain the first distance scaled normal DSN map output by the first neural network model, and the first DSN map is used for Indicates the orientation of the plane of the target object corresponding to the image to be estimated and the distance between the plane and the camera.
  • the camera filter mapping module 93 is configured to determine a first camera filter map according to the to-be-estimated image and the first parameter, where the first camera filter map is used to represent the 3D point of the target object in space The mapping relationship with the 2D plane, where the 2D plane is the imaging plane of the camera.
  • the depth estimation module 94 is configured to determine a first depth map corresponding to the image to be estimated according to the first DSN map and the first camera filter map.
  • the first neural network model is obtained by training using a training image and a second DSN map corresponding to the training image, and the second DSN map is based on a second depth corresponding to the training image and the camera calibration parameters corresponding to the training images.
  • the training image is used as the input of the initial neural network model
  • the loss function includes at least one of a first loss function, a second loss function or a third loss function
  • the loss function is used to adjust the initial loss function.
  • the parameters of the neural network model are obtained by training to obtain the first neural network model.
  • the first loss function is used to represent the error between the second DSN map and the third DSN map
  • the third DSN map is the DSN map corresponding to the training image output by the initial neural network model.
  • the second loss function is used to represent the error between the second depth map and the third depth map
  • the third depth map is determined according to the third DSN map and the second camera filter map
  • the The second camera filter map is determined according to the camera calibration parameters corresponding to the training image and the training image
  • the third loss function is used to represent the matching degree of the second depth map and the third depth map.
  • the camera filter map module 93 is configured to: determine the first camera filter map according to the position coordinates of the pixel points of the to-be-estimated image and the first parameter, and the first camera filter map
  • the figure includes a camera filter mapping vector corresponding to the pixel point, the camera filter vector is used to represent the mapping relationship between the 3D point and the pixel point, and the pixel point is the projection of the 3D point to the 2D plane. point.
  • the position coordinates of the pixel include abscissa and ordinate
  • the camera filter map vector corresponding to the pixel includes a first camera filter map component and a second camera filter map component, the first camera filter map
  • the filter map component is determined according to the abscissa and the first parameter
  • the second camera filter map component is determined according to the ordinate and the first parameter, or according to the abscissa, the ordinate and the first parameter is determined.
  • the first parameter when the field of view of the camera that captures the image to be estimated is less than 180 degrees, the first parameter includes the center coordinates (c x , cy ) and the focal length (f x , f y ) of the camera ), the first camera filter map component is determined according to the abscissa and the first parameter, and the second camera filter map component is determined according to the ordinate and the first parameter.
  • the first parameter includes a width pixel value W and a height pixel value H of the to-be-estimated image
  • the first camera filter map component is based on the The abscissa and the first parameter are determined
  • the second camera filter map component is determined according to the abscissa, the ordinate and the first parameter.
  • the first DSN map includes a first DSN vector corresponding to a pixel of the image to be estimated
  • the depth estimation module 94 is configured to: according to the first DSN vector corresponding to the pixel and the pixel
  • the camera filtering map vector corresponding to the point determines the depth value corresponding to the pixel point.
  • the first depth map includes depth values corresponding to the pixels.
  • the monocular depth estimation apparatus provided in the embodiment of the present application can be used to execute the above-mentioned monocular depth estimation method, and the content and effect thereof may refer to the method section, which will not be repeated in this embodiment of the present application.
  • the electronic device may include: an image collector 1001 configured to acquire an image to be estimated and a first parameter corresponding to the image to be estimated; one or more processors 1002 ; a memory 1003 ;
  • the various devices described above may be connected by one or more communication buses 1005 .
  • the above-mentioned memory 1003 stores one or more computer programs 1004, one or more processors 1002 are used to execute one or more computer programs 1004, and the one or more computer programs 1004 include instructions, and the above-mentioned instructions can be used to execute the above-mentioned Various steps in a method embodiment.
  • processors 1002 are used to execute one or more computer programs 1004 to perform the following actions:
  • the to-be-estimated image and a first parameter corresponding to the to-be-estimated image are acquired, where the first parameter is a camera calibration parameter of a camera that captures the to-be-estimated image.
  • a first camera filter map is determined, and the first camera filter map is used to represent the mapping relationship between the 3D point and the 2D plane of the target object in space.
  • the 2D plane is the imaging plane of the camera.
  • a first depth map corresponding to the image to be estimated is determined according to the first DSN map and the first camera filter map.
  • the first neural network model is obtained by training using a training image and a second DSN map corresponding to the training image, and the second DSN map is based on a second depth corresponding to the training image and the camera calibration parameters corresponding to the training images.
  • the training image is used as the input of the initial neural network model
  • the loss function includes at least one of a first loss function, a second loss function or a third loss function
  • the loss function is used to adjust the initial loss function.
  • the parameters of the neural network model are obtained by training to obtain the first neural network model.
  • the first loss function is used to represent the error between the second DSN map and the third DSN map
  • the third DSN map is the DSN map corresponding to the training image output by the initial neural network model.
  • the second loss function is used to represent the error between the second depth map and the third depth map
  • the third depth map is determined according to the third DSN map and the second camera filter map
  • the The second camera filter map is determined according to the camera calibration parameters corresponding to the training image and the training image
  • the third loss function is used to represent the matching result of the second depth map and the third depth map.
  • the first camera filter map is determined according to the position coordinates of the pixel points of the to-be-estimated image and the first parameter, and the first camera filter map includes the corresponding pixels of the pixels.
  • a camera filter mapping vector where the camera filter vector is used to represent the mapping relationship between the 3D point and the pixel point, where the pixel point is the point where the 3D point is projected onto the 2D plane.
  • the position coordinates of the pixel include abscissa and ordinate
  • the camera filter map vector corresponding to the pixel includes a first camera filter map component and a second camera filter map component, the first camera filter map
  • the filter map component is determined according to the abscissa and the first parameter
  • the second camera filter map component is determined according to the ordinate and the first parameter, or according to the abscissa, the ordinate and the first parameter is determined.
  • the first parameter when the field of view of the camera that captures the image to be estimated is less than 180 degrees, the first parameter includes the center coordinates (c x , cy ) and the focal length (f x , f y ) of the camera ), the first camera filter map component is determined according to the abscissa and the first parameter, and the second camera filter map component is determined according to the ordinate and the first parameter.
  • the first parameter includes a width pixel value W and a height pixel value H of the to-be-estimated image
  • the first camera filter map component is based on the The abscissa and the first parameter are determined
  • the second camera filter map component is determined according to the abscissa, the ordinate and the first parameter.
  • the first DSN map includes a first DSN vector corresponding to a pixel of the image to be estimated, according to the first DSN vector corresponding to the pixel and a camera filter mapping vector corresponding to the pixel , and determine the depth value corresponding to the pixel point.
  • the first depth map includes depth values corresponding to the pixels.
  • the electronic device shown in FIG. 14 may also include other devices such as an audio module and a SIM card interface, which are not limited in this embodiment of the present application.
  • the embodiment of the present application further provides a monocular depth estimation apparatus.
  • the monocular depth estimation apparatus includes a processor 1101 and a transmission interface 1102, and the transmission interface 1102 is used to obtain the image to be estimated and the to-be-estimated image.
  • the transmission interface 1102 may include a sending interface and a receiving interface.
  • the transmission interface 1102 may be any type of interface according to any proprietary or standardized interface protocol, such as high definition multimedia interface (HDMI), mobile Mobile Industry Processor Interface (MIPI), Display Serial Interface (DSI) standardized by MIPI, Embedded Display Port (Embedded Display Port) standardized by Video Electronics Standards Association (VESA) , eDP), Display Port (DP) or V-By-One interface, V-By-One interface is a digital interface standard developed for image transmission, as well as various wired or wireless interfaces, optical interfaces, etc.
  • HDMI high definition multimedia interface
  • MIPI mobile Mobile Industry Processor Interface
  • DSI Display Serial Interface
  • MIPI mobile Mobile Industry Processor Interface
  • DSI Display Serial Interface
  • Embedded Display Port Embedded Display Port
  • DP Display Port
  • V-By-One interface is a digital interface standard developed for image transmission, as well as various wired or wireless interfaces, optical interfaces, etc.
  • the processor 1101 is configured to call the program instructions stored in the memory to execute the monocular depth estimation method according to the above method embodiment.
  • the apparatus further includes a memory 1103 .
  • the processor 1102 may be a single-core processor or a multi-core processor group
  • the transmission interface 1102 is an interface for receiving or sending data
  • the data processed by the monocular depth estimation apparatus may include video data or image data.
  • the monocular depth estimation apparatus may be a processor chip.
  • inventions of the embodiments of the present application further provide a computer storage medium, where the computer storage medium may include computer instructions, when the computer instructions are executed on the electronic device, the electronic device is made to perform various steps of the above method embodiments.
  • inventions of the embodiments of the present application further provide a computer program product, which, when the computer program product runs on a computer, causes the computer to execute each step of the foregoing method embodiments.
  • the processor mentioned in the above embodiments may be an integrated circuit chip, which has signal processing capability.
  • each step of the above method embodiments may be completed by a hardware integrated logic circuit in a processor or an instruction in the form of software.
  • the processor can be a general-purpose processor, a digital signal processor (DSP), an application-specific integrated circuit (ASIC), a field programmable gate array (FPGA), or other Programming logic devices, discrete gate or transistor logic devices, discrete hardware components.
  • a general purpose processor may be a microprocessor or the processor may be any conventional processor or the like.
  • the steps of the methods disclosed in the embodiments of the present application may be directly embodied as executed by a hardware coding processor, or executed by a combination of hardware and software modules in the coding processor.
  • the software modules may be located in random access memory, flash memory, read-only memory, programmable read-only memory or electrically erasable programmable memory, registers and other storage media mature in the art.
  • the storage medium is located in the memory, and the processor reads the information in the memory, and completes the steps of the above method in combination with its hardware.
  • the memory mentioned in the above embodiments may be volatile memory or non-volatile memory, or may include both volatile and non-volatile memory.
  • the non-volatile memory may be read-only memory (ROM), programmable read-only memory (PROM), erasable programmable read-only memory (EPROM), electrically programmable Erase programmable read-only memory (electrically EPROM, EEPROM) or flash memory.
  • Volatile memory may be random access memory (RAM), which acts as an external cache.
  • RAM random access memory
  • DRAM dynamic random access memory
  • SDRAM synchronous DRAM
  • SDRAM double data rate synchronous dynamic random access memory
  • ESDRAM enhanced synchronous dynamic random access memory
  • SLDRAM synchronous link dynamic random access memory
  • direct rambus RAM direct rambus RAM
  • the disclosed system, apparatus and method may be implemented in other manners.
  • the apparatus embodiments described above are only illustrative.
  • the division of the units is only a logical function division. In actual implementation, there may be other division methods.
  • multiple units or components may be combined or Can be integrated into another system, or some features can be ignored, or not implemented.
  • the shown or discussed mutual coupling or direct coupling or communication connection may be through some interfaces, indirect coupling or communication connection of devices or units, and may be in electrical, mechanical or other forms.
  • the units described as separate components may or may not be physically separated, and components displayed as units may or may not be physical units, that is, may be located in one place, or may be distributed to multiple network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution in this embodiment.
  • each functional unit in each embodiment of the present application may be integrated into one processing unit, or each unit may exist physically alone, or two or more units may be integrated into one unit.
  • the functions, if implemented in the form of software functional units and sold or used as independent products, may be stored in a computer-readable storage medium.
  • the technical solution of the present application can be embodied in the form of a software product in essence, or the part that contributes to the prior art or the part of the technical solution.
  • the computer software product is stored in a storage medium, including Several instructions are used to cause a computer device (personal computer, server, or network device, etc.) to execute all or part of the steps of the methods described in the various embodiments of the present application.
  • the aforementioned storage medium includes: U disk, mobile hard disk, read-only memory (ROM), random access memory (RAM), magnetic disk or optical disk and other media that can store program codes .

Landscapes

  • Engineering & Computer Science (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Image Processing (AREA)

Abstract

L'invention concerne un procédé, un appareil et un dispositif d'estimation de profondeur monoculaire. Le procédé d'estimation de profondeur monoculaire peut consister à : acquérir une image à estimer et des paramètres d'étalonnage de caméra ; obtenir une première carte DSN correspondant à ladite image au moyen d'un premier modèle de réseau neuronal ; déterminer une première carte de filtrage de caméra en fonction des paramètres d'étalonnage de caméra ; puis obtenir une carte de profondeur sur la base de la première carte DSN et de la première carte de filtrage de caméra. Le procédé d'estimation de profondeur monoculaire peut être applicable à l'estimation de profondeur d'un objet cible dans une scène commune ou générale de la vie réelle et présente une bonne précision d'estimation de profondeur.
PCT/CN2021/075318 2021-02-04 2021-02-04 Procédé, appareil et dispositif d'estimation de profondeur monoculaire WO2022165722A1 (fr)

Priority Applications (1)

Application Number Priority Date Filing Date Title
PCT/CN2021/075318 WO2022165722A1 (fr) 2021-02-04 2021-02-04 Procédé, appareil et dispositif d'estimation de profondeur monoculaire

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/CN2021/075318 WO2022165722A1 (fr) 2021-02-04 2021-02-04 Procédé, appareil et dispositif d'estimation de profondeur monoculaire

Publications (1)

Publication Number Publication Date
WO2022165722A1 true WO2022165722A1 (fr) 2022-08-11

Family

ID=82740792

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2021/075318 WO2022165722A1 (fr) 2021-02-04 2021-02-04 Procédé, appareil et dispositif d'estimation de profondeur monoculaire

Country Status (1)

Country Link
WO (1) WO2022165722A1 (fr)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115731365A (zh) * 2022-11-22 2023-03-03 广州极点三维信息科技有限公司 基于二维图像的网格模型重建方法、系统、装置及介质
CN115965758A (zh) * 2022-12-28 2023-04-14 无锡东如科技有限公司 一种图协同单目实例三维重建方法
CN116993679A (zh) * 2023-06-30 2023-11-03 芜湖合德传动科技有限公司 一种基于目标检测的伸缩机皮带磨损检测方法

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108615244A (zh) * 2018-03-27 2018-10-02 中国地质大学(武汉) 一种基于cnn和深度滤波器的图像深度估计方法及系统
CN108765481A (zh) * 2018-05-25 2018-11-06 亮风台(上海)信息科技有限公司 一种单目视频的深度估计方法、装置、终端和存储介质
CN110503680A (zh) * 2019-08-29 2019-11-26 大连海事大学 一种基于非监督的卷积神经网络单目场景深度估计方法
CN110738697A (zh) * 2019-10-10 2020-01-31 福州大学 基于深度学习的单目深度估计方法
US20210004646A1 (en) * 2019-07-06 2021-01-07 Toyota Research Institute, Inc. Systems and methods for weakly supervised training of a model for monocular depth estimation

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108615244A (zh) * 2018-03-27 2018-10-02 中国地质大学(武汉) 一种基于cnn和深度滤波器的图像深度估计方法及系统
CN108765481A (zh) * 2018-05-25 2018-11-06 亮风台(上海)信息科技有限公司 一种单目视频的深度估计方法、装置、终端和存储介质
US20210004646A1 (en) * 2019-07-06 2021-01-07 Toyota Research Institute, Inc. Systems and methods for weakly supervised training of a model for monocular depth estimation
CN110503680A (zh) * 2019-08-29 2019-11-26 大连海事大学 一种基于非监督的卷积神经网络单目场景深度估计方法
CN110738697A (zh) * 2019-10-10 2020-01-31 福州大学 基于深度学习的单目深度估计方法

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115731365A (zh) * 2022-11-22 2023-03-03 广州极点三维信息科技有限公司 基于二维图像的网格模型重建方法、系统、装置及介质
CN115965758A (zh) * 2022-12-28 2023-04-14 无锡东如科技有限公司 一种图协同单目实例三维重建方法
CN115965758B (zh) * 2022-12-28 2023-07-28 无锡东如科技有限公司 一种图协同单目实例三维重建方法
CN116993679A (zh) * 2023-06-30 2023-11-03 芜湖合德传动科技有限公司 一种基于目标检测的伸缩机皮带磨损检测方法
CN116993679B (zh) * 2023-06-30 2024-04-30 芜湖合德传动科技有限公司 一种基于目标检测的伸缩机皮带磨损检测方法

Similar Documents

Publication Publication Date Title
WO2020177651A1 (fr) Procédé de segmentation d'image et dispositif de traitement d'image
WO2021043168A1 (fr) Procédé d'entraînement de réseau de ré-identification de personnes et procédé et appareil de ré-identification de personnes
WO2021043273A1 (fr) Procédé et appareil d'amélioration d'image
US11232286B2 (en) Method and apparatus for generating face rotation image
WO2022165722A1 (fr) Procédé, appareil et dispositif d'estimation de profondeur monoculaire
WO2021164731A1 (fr) Procédé d'amélioration d'image et appareil d'amélioration d'image
US20210398252A1 (en) Image denoising method and apparatus
CN110222717B (zh) 图像处理方法和装置
WO2022042049A1 (fr) Procédé de fusion d'images, procédé et appareil de formation d'un modèle de fusion d'images
WO2022001372A1 (fr) Procédé et appareil d'entraînement de réseau neuronal, et procédé et appareil de traitement d'image
WO2022179581A1 (fr) Procédé de traitement d'images et dispositif associé
WO2021063341A1 (fr) Procédé et appareil d'amélioration d'image
WO2022134971A1 (fr) Procédé de formation de modèle de réduction de bruit et appareil associé
WO2022100419A1 (fr) Procédé de traitement d'images et dispositif associé
US20220157046A1 (en) Image Classification Method And Apparatus
CN111797882A (zh) 图像分类方法及装置
CN113011562A (zh) 一种模型训练方法及装置
WO2023083030A1 (fr) Procédé de reconnaissance de posture et dispositif associé
CN110222718A (zh) 图像处理的方法及装置
WO2022052782A1 (fr) Procédé de traitement d'image et dispositif associé
CN113284055A (zh) 一种图像处理的方法以及装置
CN113066018A (zh) 一种图像增强方法及相关装置
CN115239581A (zh) 一种图像处理方法及相关装置
WO2022179606A1 (fr) Procédé de traitement d'image et appareil associé
CN112258565B (zh) 图像处理方法以及装置

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 21923747

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 21923747

Country of ref document: EP

Kind code of ref document: A1